Windows PerfCounters and Powershell - CPU perf data

So far, I talked of WMI, CIM, WQL, System.Diagnostics.PerformanceCounterCategory, perf-counter data organization and flavour. Now it's time to look at some performance counters I deem important for my use-case more closely.
Note: List of available Counters for Get-Counter command
Get-Counter -ListSet * | Sort-Object CounterSetName | Format-Table CounterSetName

Basic concepts:

I will introduce basic concepts of Processor, Core and CPU now to help you follow the text. Let us use this convention:
  • "Processor" is a piece of hardware you connect to a slot on the motherboard.
  • "Physical Core" is a physical computing unit built into the "Processor".
  • "Virtual Core" is a virtual computing unit built on top of "Physical Core" (i.e. HT is ON).
  • "CPU" is a computing unit inside the "Processor", either physical or virtual.

Putting concepts to work

Now lets calculate number of CPUs for my laptop:
PS > ((Get-CimInstance -Namespace root/CIMV2 -ClassName CIM_Processor).NumberOfLogicalProcessors | Measure-Object -Sum).Sum


Note: Many other counters fail for some HW configuration and/or OS! Be sure to check.
Note: HT is ON on my dual-core laptop and no cores are parked so to get number of Physical cores:
PS > ((Get-CimInstance -Namespace root/CIMV2 -ClassName CIM_Processor).NumberOfCores | Measure-Object -Sum).Sum


Note: There are many ways to collect this info:
PS > (Get-CimInstance Win32_ComputerSystem).NumberOfLogicalProcessors
PS > ((New-Object Diagnostics.PerformanceCounterCategory("Processor Information")).GetInstanceNames() | ?{$_ -match "^(\d{1}),(\d{1})"} | Measure-Object -Sum).Count
Note: RegEx expression is matching "Number,Number" Instances only (See previous blog about instances).

It is not obvious when working with 1 NUMA node/Slot, but the -Sum might refer to Sum of CPUs per Slot, depending on RegEx.

Before starting on Counters, let me stress that the measurements at the system, process and thread level in Windows are based on a sampling methodology thus the data gathered is subject to typical sampling errors like:
  • accumulating a "sufficient" number of sample observations to be able to make a reliable statistical inference, i.e. the sampling size
  • ensuring that there aren’t systemic sources of sampling error that causes results to be under or over-sampled as I will demonstrate shortly.

As of W2K8, the trends are changing towards event driven measurement for CPU utilization which, although more sane and accurate, poses its own set of challenges (say, a clock drift across multiprocessor cores when they are not resynchronized periodically and so on). To compensate for drift, new PerfMon/ResMon work by measuring CPU load in real time using event oriented measurement data gathered by the OS Scheduler each time a context switch occurs.
A context switch occurs in Windows whenever the processor switches its execution context to run a different thread (see more below). Context switches also occur as a result of high priority Interrupt Service Routines (ISRs) as well as the Deferred Procedure Calls (DPCs) that ISRs schedule to complete the interrupt processing. Starting in Windows 6 (Vista/2008), the OS Scheduler began issuing RDTSC instructions to get the internal processor clock each time a context switch occurs. I will talk of context switching and DPC counters in a short while. For more details please see this excellent blog post.

System CPU counters:

First counter I want to talk about is Processor Queue Length. Immediately a Linux users observes that there is no "System load" counter on Windows. This is because Windows OS is Thread based as opposed to Linux which is Process based. This simply means that, in Windows, an execution thread is a basic unit of execution (thus basis for collecting usage statistics too) and a process acts as a container for threads. As simple as it may seem, this actually poses a lot of challenges since one has to start aggregating data about running processes from Threads counters and work his way up. I will talk about this in detail in final blog. So, the WMI counter mimicking Linux "System load" best is, IMO, Processor Queue Length:
PS > Get-Counter '\System\Processor Queue Length'

Timestamp                 CounterSamples                                                      
---------                 --------------                                                      
23.10.15. 10:34:10        \\server_name\system\processor queue length : 1                                          
However, this is slooooow (although subsequent calls return much faster):
PS > Measure-Command { Get-Counter '\System\Processor Queue Length' }

TotalSeconds      : 4.2961321

PS > Measure-Command { Get-Counter '\System\Processor Queue Length' }

TotalSeconds      : 1.007445
So, as described in previous blog, I use System.Diagnostics class to fetch this value:
PS > Measure-Command { New-Object Diagnostics.PerformanceCounter("System", "Processor Queue Length")}

TotalSeconds      : 2.0006457

PS > Measure-Command { New-Object Diagnostics.PerformanceCounter("System", "Processor Queue Length")}

TotalSeconds      : 0.000643
Now, put this into a variable and simply call NextValue():
PS > $System_ProcQL = New-Object Diagnostics.PerformanceCounter("System", "Processor Queue Length")
PS > $System_ProcQL.NextValue()
PS > $System_ProcQL.NextValue()
The value obtained is for all of the CPU's so you need to calculate the number of CPU's to be your divider and obtain the real value:
$SystemLoad = $System_ProcQL.NextValue() / $totCPU

In-depth description of System Counters important for my use-case:

COUNTER: System\Processor Queue Length
TYPE: Instantaneous
USAGE: New-Object Diagnostics.PerformanceCounter("System", "Processor Queue Length") / ((Get-CimInstance -Namespace root/CIMV2 -ClassName CIM_Processor).NumberOfLogicalProcessors | Measure-Object -Sum).Sum
MEANING: Number of threads per CPU that are ready for execution but can't get CPU cycles for whatever reason thus waiting in OS Scheduler queue. Since Windows have one Scheduler queue, I divide this value with total number of computation units (i.e. CPUs). The actual mechanics is that when Counter value is requested a measurement function traverses the Scheduler Ready Queue and counts the number of threads waiting for an available CPU.
GOTCHA: Even on idle system there can be significant number of threads running on schedule that can bump this number very high. Say you have 4 CPU box and processes fetching values for 100 counters, 10 samples every 1 second. All of these sample requests will lay sleeping for 1 second (thus the Processor Queue Length value will be low) and then all will wake up at the same timer event (clock interrupt) causing Processor Queue Length to spike although there is no real load on the system. It's even worse if your thread(s) is of high priority as it will get executed sooner than the user threads thus pushing Processor Queue Length number very very high. This leads to disproportionate number of Ready Threads waiting for cycles, even (or especially) when the processor itself is not very busy overall. So tip 1 would be to check if CPUs are really busy or not.
THRESHOLD: Pending on above, it is hard to tell what the threshold value is but most people seem to agree it's "sustained value of 2 or more" with CPU utilization of 85%+. This combination tells us we can benefit from adding more CPUs.

COUNTER: System\Context Switches/sec
TYPE: Instantaneous
$System_CSpS = New-Object Diagnostics.PerformanceCounter("System", "Context Switches/sec")

MEANING: Context switching happens when a higher priority thread pre-empts a lower priority thread that is currently running or when a high priority thread blocks. High levels of context switching can occur when many threads share the same priority level. This often indicates that there are too many threads competing for the processors on the system. If you do not see much processor utilization and you see very low levels of context switching, it could indicate that threads are blocked (link).
GOTCHA: The number obtained is system-wide! To report the total number of context switches generated per second by all threads use the Thread(_Total)\Context Switches/sec counter (Category((Instance)\Counter):
New-Object Diagnostics.PerformanceCounter("Thread", "Context Switches/sec", "_Total")
THRESHOLD: Context switching rates in excess of 15,000 per second per CPU. The remedy would be to reduce the number of threads and queue more at the application level. This will cause less context switching, and less context switching is good for reducing CPU load.

In-depth description of CPU Counters important for my use-case:

Note: "Processor Information" category, besides overall _Total, has instances for Slot/NUMA node (0,_Total, n,_Total) while "Processor" category gives just _Total for all CPUs as defined above.
Gotcha: On single slot machines, "Processor" category will give info for all the CPUs while on machines with multiple slots, it will give info on just the Physical cores :-/
Thus, if InstanceName is _Total, both yield the same value.

COUNTER: Processor Information(_Total)\% Processor Time, Processor(_Total)\% Processor Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% Processor Time")
$PI_PT.InstanceName = $InstanceName
$null = $PI_PT.NextValue()
Get-Counter -Counter "\Processor Information(_Total)\% Processor Time"
Get-Counter -Counter "\Processor(_Total)\% Processor Time"

MEANING: Primary indicator of CPU activity. High values many not necessarily be bad. However, if the other processor-related counters are increasing linearly such as Processor\% Privileged Time or System\Processor Queue Length, high CPU utilization may be worth investigating.
GOTCHA: If this counter is around threshold value, starting new processes will only lead to increased value of Processor Queue Length but the work done will remain the same. Look for some more counters that I'm about to describe in relation to this one.
THRESHOLD: Folks seem to agree on ~85%. Low CPU utilization with sustained Processor Queue Length value of 2 or higher is indicator that requests for CPU time arrive randomly and threads demand irregular amounts of time from the CPU. This means that the processor power is not a bottleneck but that the application threading logic should be improved.

COUNTER: Processor Information(_Total)\% Privileged Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PPT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% Privileged Time")
$PI_PPT.InstanceName = $InstanceName
$null = $PI_PPT.NextValue()

MEANING: Counter indicates the percentage of non-idle CPU time spent in privileged mode, i.e. calls to OS functions (file or network I/O, memory allocation...). Basically, this is unrestricted mode allowing direct access to hardware and all memory.
THRESHOLD: Folks seem to agree on consistently being over 75%.

COUNTER: Processor Information(_Total)\% User Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PUT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% User Time")
$PI_PUT.InstanceName = $InstanceName
$null = $PI_PUT.NextValue()

MEANING: Percentage of non-idle CPU time spent in user mode. User mode is a restricted processing mode designed for applications, environment subsystems, and integral subsystems.
GOTCHA: Processor Information(_Total)\% Privileged Time +
Processor Information(_Total)\% User Time = Processor Information(_Total)\% Processor Time.
THRESHOLD: Depends on previous two counters.

COUNTER: Processor Information(_Total)\% Idle Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PIT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% Idle Time")
$PI_PIT.InstanceName = $InstanceName
$null = $PI_PIT.NextValue()

MEANING: Counter indicates the percentage of time OS idle thread was consuming cycles. On Windows, there is a special Kernel thread that consumes cycles when CPU is idling. Counting cycles consumed by this thread gives Idle CPU time.
GOTCHA: Processor Information(_Total)\% Processor Time + Processor Information(_Total)\% Idle Time = 100%

COUNTER: Processor Information(_Total)\% Priority Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PPRIOT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% Priority Time")
$PI_PPRIOT.InstanceName = $InstanceName
$null = $PI_PPRIOT.NextValue()

MEANING: CPU utilization by high priority threads.
GOTCHA: Kernel scheduler can, on occasion, wake up low priority threads sleeping for "long" time assigning them much more slices on CPU than one would expect given the (low)priority. This, in turn, blocks high-priority threads from execution which is never an expected behaviour. I would look at this value in relation to Context switches/second to determine what's going on.

COUNTER: Processor Information\Interrupts/sec
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_INTPS = New-Object Diagnostics.PerformanceCounter("Processor Information", "Interrupts/sec")
$PI_INTPS.InstanceName = $InstanceName
$null = $PI_INTPS.NextValue()

MEANING: Number of hardware interrupts per second. This value is the indicator of the activity of devices that generate interrupts, such as network adapters.
GOTCHA: See next counter.

COUNTER: Processor Information\% Interrupt Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PINTT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% Interrupt Time")
$PI_PINTT.InstanceName = $InstanceName
$null = $PI_PINTT.NextValue()

MEANING: The value indicates the percentage of time CPUs spend receiving and servicing hardware interrupts. This value is an indirect indicator of the activity of devices that generate interrupts, such as network adapters.
GOTCHA: Mass increase in Processor Information\Interrupts/sec and Processor Information\% Interrupt Time indicates potential hardware problems.

COUNTER: Processor Information\DPCs Queued/sec
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_DPCQPS = New-Object Diagnostics.PerformanceCounter("Processor Information", "DPCs Queued/sec")
$PI_DPCQPS.InstanceName = $InstanceName
$null = $PI_DPCQPS.NextValue()

MEANING: Overall rate at which deferred procedure calls ("SW interrupts") are added to the processor's DPC queue. This property measures the rate at which DPCs are added to the queue, not the number of DPCs in the queue.
GOTCHA: This is NOT the number of SW interrupts in the queue!

COUNTER: Processor Information\DPC Time
TYPE: Sample, Instance
$InstanceName = "_Total"
$PI_PDPCT = New-Object Diagnostics.PerformanceCounter("Processor Information", "% DPC Time")
$PI_PDPCT.InstanceName = $InstanceName
$null = $PI_PDPCT.NextValue()

MEANING: Percentage of time that the processor spent receiving and servicing deferred procedure calls (SW interrupts) during the sample interval. They are counted separately and are not a component of the interrupt counters.
GOTCHA: This property is a component of PercentPrivilegedTime because DPCs are executed in privileged mode.

Other useful counters I would look into in case of trouble are C1/C2/C3TransitionsPerSec. There is a huge penalty waking up CPU from C3 low power state to C2 low power state and considerable penalty transitioning from C2 to C1. So if box is choking and CPUs are idling, look here. And make sure ParkingStatus for each CPU is 0 ;-)
Example: (Physical) CPU 9 in Slot 7 was asleep:
PS > GCim Win32_PerfFormattedData_Counters_ProcessorInformation
Name                        : 7,9
AverageIdleTime             : 100
C3TransitionsPersec         : 64
ClockInterruptsPersec       : 64
IdleBreakEventsPersec       : 64
InterruptsPersec            : 64
PercentC3Time               : 99
Basically, only processing timer events.

There are also combinations of counters that can point out problems like Processor\% DPC Time, % Interrupt Time and % Privileged Time. If Interrupt Time and DPC time are a large portion of Privileged Time, the kernel is spending significant amount of time processing (most likely) I/O requests. In some cases performance can be improved by configuring interrupts and DPC affinity to a small number of CPUs on a multiprocessor system, which improves cache locality. In other cases, it works best to distribute the interrupts and DPCs among many CPUs, so as to keep the interrupt and DPC activity from becoming a bottleneck.

In the next blog I will cover Memory performance counters.

In this series:
BLOG 1: PerfCounters infrastructure
BLOG 2: PerfCounters Raw vs. Formatted values
BLOG 3: PerfCounters, fetching the values
BLOG 4: PerfCounters, CPU perf data
BLOG 5: PerfCounters, Memory perf data
BLOG 6: PerfCounters, Disk/IO perf data
BLOG 7: PerfCounters, Network and Contention perf data

No comments:

Post a Comment