2015-12-07

Windows PerfCounters and Powershell - Disk & IO perf data

This post is the hardest for me to write as I generally pay little attention to disks. When they prove too slow, I replace them with faster ones. So now I am writing this on laptop with two SSDs. That said, Disk subsystem could be a major system performance bottleneck and thus there are numerous counters covering this area (Get-CimClass *disk* | Select CimClassName). I would also like to turn your attention to old yet excellent article Top Six FAQs on Windows 2000 Disk Performance if you're interested in subject.

Disk counters:

Note: Microsoft recommends that "when attempting to analyse disk performance bottlenecks, you should always use physical disk counters. However, if you use software RAID, you should use logical disk counters. As for Logical Disk and Physical Disk Counters, the same values are available in each of these counter objects. Logical disk data is tracked by the volume manager(s), and physical disk data is tracked by the partition manager."

The one I look into the most is Disk Queue Length which comes in two flavours; Average and Current.
COUNTER: Win32_PerfFormattedData_PerfDisk_PhysicalDisk\AvgDiskQueueLength (AvgDiskReadQueueLength)
TYPE: Sample, Instance
USAGE:
PS > Get-CimInstance Win32_PerfFormattedData_PerfDisk_PhysicalDisk | Where {$_.Name -eq '_Total'} |
 Select AvgDiskQueueLength, CurrentDiskQueueLength | FL

AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0
MEANING: Average number of both read and write requests that were queued and waiting for the selected disk during the sample interval as well as requests in service. Since I used "_Total" instance, this means I need to divide the value with number of physical disks on the system. PerfMon shows this value per logical disk.
PS > Get-CimInstance Win32_PerfFormattedData_PerfDisk_LogicalDisk |
 Select Name, AvgDiskQueueLength, CurrentDiskQueueLength | FL

Name                   : HarddiskVolume1 #Boot image on Physical disk 1
AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0

Name                   : C: #Boot partition on Physical disk 1
AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0

Name                   : D: #Partition on Physical disk 1
AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0

Name                   : E: #Partition on Physical disk 2
AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0

Name                   : G: #Partition on Physical disk 2
AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0

Name                   : _Total
AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0 
GOTCHA: Since both "pending" and "in service" requests are counted, this counter might overstate the activity.
THRESHOLD: If more than 2 requests are continuously waiting on a single disk, the disk might be a bottleneck. To analyse queue length data further, use it's components; AvgDiskReadQueueLength and AvgDiskWriteQueueLength.

COUNTER: Win32_PerfFormattedData_PerfDisk_PhysicalDisk\CurrentDiskQueueLength
TYPE: Instantaneous, Instance
USAGE:
PS > Get-CimInstance Win32_PerfFormattedData_PerfDisk_PhysicalDisk | Where {$_.Name -eq '_Total'} |
 Select AvgDiskQueueLength, CurrentDiskQueueLength | FL

AvgDiskQueueLength     : 0
CurrentDiskQueueLength : 0
MEANING: Number of requests outstanding on the disk at the time the performance data is collected. It includes requests being serviced at the time of data collection. The value represents an instantaneous length, not an average over a time interval. Multispindle disk devices can have multiple requests active at one time, but other concurrent requests await service. This property may reflect a transitory high or low queue length. If the disk drive has a sustained load, the value will be consistently high. Requests experience delays proportional to the length of the queue minus the number of spindles on the disks. This difference should average less than two for good performance.
GOTCHA:
THRESHOLD: 2 requests in queue for prolonged period of time for single disk (spindle).

Inner workings of measurement collection:

Values are mostly derived by the diskperf filter driver that provides disk performance statistics. Diskperf is a layer of software sitting in the disk driver stack. As I/O Request packets (IRPs) pass through this layer, diskperf keeps track of the time I/O's start and the time they finish. On the way to the device, diskperf records a timestamp for the IRP. On the way back from the device, the completion time is recorded. The difference is the duration of the I/O request. Averaged over the collection interval, this becomes the Avg. Disk sec/Transfer, a direct measure of disk response time from the point of view of the device driver. Diskperf also maintains byte counts and separate counters for reads and writes, at both the Logical and Physical disk level allowing Avg. Disk sec/Transfer to be broken out into reads and writes. This layer does add to latency but not significantly (up to 5%). Now that we know the mechanics, back to PhysicalDisk\Avg. Disk Queue Length and why we gather both queued and in-service requests in a bunch.
So, AvgDiskQueueLength counter is useful for gathering concurrency data, including data bursts and peak loads. These values represent the number of requests in flight below the driver taking the statistics. This means the requests are not necessarily queued but could actually be in service or completed and on the way back up the path. Possible in-flight locations include the following:
  • SCSIport or Storport queue
  • OEM driver queue
  • Disk controller queue
  • Hard disk queue
  • Actively receiving from a hard disk

Brief account of some other counters:

COUNTER: PhysicalDisk\Disk Writes/sec
MEANING: This counter indicates the rate of write operations on the disk.
THRESHOLD: Depends on manufacturer’s specifications.

COUNTER: PhysicalDisk\Split IO/sec
MEANING: Reports the rate at which the operating system divides I/O requests to the disk into multiple requests. A split I/O request might occur if the program requests data in a size that is too large to fit into a single request or if the disk is fragmented. Factors that influence the size of an I/O request can include application design, the file system, or drivers. A high rate of split I/O might not, in itself, represent a problem. However, on single-disk systems, a high rate for this counter tends to indicate disk fragmentation.
More info in MSDN.

Disk and Memory:

Because memory is cached to disk as physical memory becomes limited, make sure that you have a sufficient amount of memory available. When memory is scarce, more pages are written to disk, resulting in increased disk activity. Also, make sure to set the paging file to an appropriate size. Additional disk memory cache will help offset peaks in disk I/O requests. However, it should be noted that a large disk memory cache seldom solves the problem of not having enough spindles, and having enough spindles can negate the need for a large disk memory cache.

COUNTER: PhysicalDisk\Avg. Disk sec/Transfer
MEANING: This counter indicates the time, in seconds, of the average disk transfer. This may indicate a large amount of disk fragmentation, slow disks, or disk failures.
GOTCHA: Multiply the values of the Physical Disk\Avg. Disk sec/Transfer and Memory\Pages/sec counters. If the product of these counters exceeds 0.1, paging is taking more than 10% of disk access time, so you need more physical memory available.
THRESHOLD: Should not be more than 18 milliseconds.

COUNTER: Memory\Pages/sec
MEANING: This counter indicates the rate at which pages are read from or written to disk to resolve hard page faults. Multiply the values of the Physical Disk\Avg. Disk sec/Transfer and Memory\Pages/sec performance counters. If the product of these values exceeds 0.1, paging is utilizing more than 10 percent of disk access time, which indicates that insufficient physical memory is available.
GOTCHA: A high value for the performance counter could indicate excessive paging which will increase disk I/0. If this occurs, consider adding physical memory to reduce disk I/O and increase performance.
THRESHOLD: A sustained value of more than 5 indicates a bottleneck.


Next I will talk briefly of other counter categories such as Network and Processes.

In this series:
BLOG 1: PerfCounters infrastructure
BLOG 2: PerfCounters Raw vs. Formatted values
BLOG 3: PerfCounters, fetching the values
BLOG 4: PerfCounters, CPU perf data
BLOG 5: PerfCounters, Memory perf data
BLOG 6: PerfCounters, Disk/IO perf data
BLOG 7: PerfCounters, Network and Contention perf data

No comments:

Post a Comment