As noted in one of the previous blog post, I will use following terminology:
- "Processor" is a piece of hardware you connect to a socket on the motherboard.
- "Physical Core" is a physical computing unit built into the "Processor".
- "Virtual Core" is a virtual computing unit built on top of "Physical Core" (i.e. HT is ON).
- "CPU" is a computing unit inside the "Processor", either physical or virtual.
After a series of blogs on Windows performance counters and after releasing sysb.ps1 testing/benchmarking framework version 0.9RC (dbt2-0.37.50.10) I set out to eliminate some unknowns from the testing. First to tackle was Kernel scheduler in an effort to run processes, from inside the Powershell script, on controlled subset of CPUs much like TASKSET does on Linux. Also worth noting is that proximity rocks, on occasion, meaning you can get up to 20% better results when the workload is distributed perfectly. However, this is hard to achieve thus I'm more going after consistency in test environment.
This posed quite a bit of challenges; knowing the details of hardware, NUMA node assignments, finding out and evaluating various ways of controlling the CPU pinning to calculating CPU affinity mask for more than 64 CPUs.
One interesting challenge was to calculate the CPU indexes for MySQL Cluster thread config.
As a first step, I had to find out as much as possible about my hardware.
Know your hardware:
PS> Get-CimInstance Win32_BIOS SMBIOSBIOSVersion : 11018100 Manufacturer : American Megatrends Inc. Name : Default System BIOS SerialNumber : 1207FMA00C Version : SUN - 20151001 PS> Get-CimInstance Win32_ComputerSystem | FL * Status : OK Name : HEL01 Roles : {LM_Workstation, LM_Server, NT, Server_NT} AutomaticManagedPagefile : False DomainRole : 3 HypervisorPresent : False Manufacturer : Oracle Corporation Model : Sun Fire X4800 NetworkServerModeEnabled : True NumberOfLogicalProcessors : 96 NumberOfProcessors : 8 PartOfDomain : True SystemType : x64-based PC TotalPhysicalMemory : 549746266112 PS> Get-CimInstance Win32_ComputerSystemProcessor | FL * GroupComponent : Win32_ComputerSystem (Name = "HEL01") PartComponent : Win32_Processor (DeviceID = "CPU0") CimClass : root/cimv2:Win32_ComputerSystemProcessor CimInstanceProperties : {GroupComponent, PartComponent} ... PartComponent : Win32_Processor (DeviceID = "CPU1") PartComponent : Win32_Processor (DeviceID = "CPU2") PartComponent : Win32_Processor (DeviceID = "CPU3") PartComponent : Win32_Processor (DeviceID = "CPU4") PartComponent : Win32_Processor (DeviceID = "CPU5") PartComponent : Win32_Processor (DeviceID = "CPU6") PartComponent : Win32_Processor (DeviceID = "CPU7") PS> Get-CimInstance Win32_PerfFormattedData_PerfOS_NUMANodeMemory Name : 0 AvailableMBytes : 64530 FreeAndZeroPageListMBytes : 63989 StandbyListMBytes : 541 TotalMBytes : 65526 ... Name : 7 AvailableMBytes : 64600 FreeAndZeroPageListMBytes : 64387 StandbyListMBytes : 213 TotalMBytes : 65536 PS> Get-CimInstance Win32_SystemSlot SlotDesignation : EM00 PCIExp Tag : System Slot 0 SupportsHotPlug : True Status : OK Shared : True PMESignal : True MaxDataWidth : 8 ... SlotDesignation : EM01 PCIExp Tag : System Slot 1 SlotDesignation : EM30 PCIExp Tag : System Slot 2 SlotDesignation : EM31 PCIExp Tag : System Slot 3 SlotDesignation : EM10 PCIExp Tag : System Slot 4 SlotDesignation : EM11 PCIExp Tag : System Slot 5 SlotDesignation : EM20 PCIExp Tag : System Slot 6 SlotDesignation : EM21 PCIExp Tag : System Slot 7 PS> Get-CimInstance Win32_PerfFormattedData_Counters_ProcessorInformation Name : 0,0 PercentofMaximumFrequency : 100 PercentPerformanceLimit : 100 PercentProcessorPerformance : 69 ProcessorFrequency : 2001 ... Name : 0,11 --- Name : 7,0 PercentofMaximumFrequency : 100 PercentPerformanceLimit : 100 PercentProcessorPerformance : 72 ProcessorFrequency : 2001 ... Name : 7,11Or, in short, my test box has 2 Processor groups with 48 CPUs each. This makes for Max. CPU affinity mask of 281474976710655d (or 111111111111111111111111111111111111111111111111b). The total number of CPUs is 96, total number of sockets and NUMA nodes is 8.
Note: Notice there are exactly 48 "1" in Max CPU Affinity mask which is the number of CPUs in each Processor group. This implies you can only set process affinity mask on per Processor group basis, not machine-wide! This limitation is caused by CPUs affinity mask being 64 bits long.
Groups, NUMA nodes etc. assignments are not chiseled in stone. Please see MSDN for details on how to manipulate these settings.
Once done playing with WMI, you can turn to coreinfo from Sysinternals suite as it's extremely informative:
Intel(R) Xeon(R) CPU E7540 @ 2.00GHz Intel64 Family 6 Model 46 Stepping 6, GenuineIntel Microcode signature: 00000009 HTT * Hyperthreading enabled HYPERVISOR - Hypervisor is present VMX * Supports Intel hardware-assisted virtualization SVM - Supports AMD hardware-assisted virtualization X64 * Supports 64-bit mode SMX - Supports Intel trusted execution SKINIT - Supports AMD SKINIT ...Important to notice is that, in my configuration, Sockets map to NUMA nodes 1-1:
Logical Processor to Socket Map: Logical Processor to NUMA Node Map: Socket 0: NUMA Node 0: ************------------------------------------ ************------------------------------------ ------------------------------------------------ ------------------------------------------------ Socket 1: NUMA Node 1: ------------------------------------------------ ------------------------------------------------ ************------------------------------------ ************------------------------------------ Socket 2: NUMA Node 2: ------------************------------------------ ------------************------------------------ ------------------------------------------------ ------------------------------------------------ Socket 3: NUMA Node 3: ------------------------------------------------ ------------------------------------------------ ------------************------------------------ ------------************------------------------ Socket 4: NUMA Node 4: ------------------------************------------ ------------------------************------------ ------------------------------------------------ ------------------------------------------------ Socket 5: NUMA Node 5: ------------------------------------------------ ------------------------------------------------ ------------------------************------------ ------------------------************------------ Socket 6: NUMA Node 6: ------------------------------------************ ------------------------------------************ ------------------------------------------------ ------------------------------------------------ Socket 7: NUMA Node 7: ------------------------------------------------ ------------------------------------------------ ------------------------------------************ ------------------------------------************so I can use Processor/Socket/NUMA node as though they are synonyms. Also, notice that NUMA node/Socket 0 and even ones are in Processor group 0 while odd sockets are in Processor group 1. Here is how CPU utilization looks like in Task manager/Performance tab when just ProcessorGroup 0 is used:
Logical Processor to Group Map: Group 0: Group 1: ************************************************ ------------------------------------------------ ------------------------------------------------ ************************************************Note: Coreinfo provides NUMA nodes latency too:
Approximate Cross-NUMA Node Access Cost (relative to fastest): 00 01 02 03 04 05 06 07 00: 1.4 1.7 2.1 1.7 1.7 2.1 2.2 2.1 01: 1.7 1.4 1.7 2.1 2.1 1.7 2.0 1.3 02: 2.1 1.7 1.4 1.7 2.1 2.1 1.6 1.2 03: 1.8 2.1 1.7 1.4 2.1 2.1 2.0 1.1 04: 1.7 2.1 2.1 2.1 1.4 1.7 1.7 1.4 05: 2.1 1.7 2.1 2.1 1.7 1.4 2.0 1.0 06: 2.1 2.1 1.7 2.1 1.7 2.1 1.4 1.3 07: 2.1 2.1 2.1 1.7 2.1 1.7 1.6 1.0
The software:
Primary tool used is sysb.ps1 Powershell script version 1.0 (not available for download atm). Version 0.9x RC is available for download and placed in dbt2-0.37.50.10.tar.gz\dbt2-0.37.50.10.tar\dbt2-0.37.50.10\windows_scripts\sysb-script\ directory.OS details:
PS:518 [HEL01]> Get-CimInstance Win32_OperatingSystem | FL * Status : OK Name : Microsoft Windows Server 2012 R2 Standard FreePhysicalMemory : 528660256 FreeSpaceInPagingFiles : 8388608 FreeVirtualMemory : 537242324 Distributed : False MaxNumberOfProcesses : 4294967295 MaxProcessMemorySize : 137438953344 OSType : 18 SizeStoredInPagingFiles : 8388608 TotalSwapSpaceSize : TotalVirtualMemorySize : 545250196 TotalVisibleMemorySize : 536861588 Version : 6.3.9600 BootDevice : \Device\HarddiskVolume1 BuildNumber : 9600 BuildType : Multiprocessor Free CodeSet : 1252 DataExecutionPrevention_32BitApplications : True DataExecutionPrevention_Available : True DataExecutionPrevention_Drivers : True DataExecutionPrevention_SupportPolicy : 3 Debug : False ForegroundApplicationBoost : 2 LargeSystemCache : Manufacturer : Microsoft Corporation OperatingSystemSKU : 7 OSArchitecture : 64-bit PAEEnabled : ServicePackMajorVersion : 0 ServicePackMinorVersion : 0
So how do the Windows work?
Process is just a container for threads doing the work providing you with fancy name, PID etc. This effectively means you can not calculate "System load" like on Linux. This also explains why there is no ProcessorGroup member attached to Process class while there is one for Threads. This also makes all sorts of problems regarding CPU utilization as described in previous blogs here and here.Processor group is a collection of up to 64 CPUs as explained here and here.
Thread is a basic unit of execution. Setting the Thread affinity will influence the Process class and dictate what you can do with it. There is a great paper on this you can download from MSDN to figure it out. The focus of this blog is on scripting.
Know the OS pitfalls:
The setup: I have a script acting as testing/benchmarking framework. Script controls the way processes are launched, collects data from running processes and generally helps me do part of my job of identifying performance issues and testing solutions.The problem: Windows is thread based OS and I can not control the threads in binary from within the script.
Next, .NET System.Diagnostics.Process class does not expose Processor group bit. This means there is no way to control Processor group and thus no way to guarantee the kernel scheduler will start all of your processes inside the Processor group you want :-/ I consider this a bug and not deficiency in Windows because of the following scenario:
"ProcessA" is pinned, by scheduler, to Processor group 0 with ability to run on all CPUs within that group. "ProcessB" is pinned, by scheduler, to Processor group 1 with ability to run on all CPUs within that group. ProcessorAffinity member of System.Diagnostics.Process class is the same in both cases! $procA = Get-Process -Name ProcessA $procA.ProcessorAffinity 281474976710655 #For my 48 CPUs in each Processor group. $procB = Get-Process -Name ProcessB $procB.ProcessorAffinity 281474976710655 #For my 48 CPUs in each Processor group.This leads you to believe that both processes run in the same Processor group, which might not be true as the information is ambiguous. I have set up mysqld to run on 1st NUMA node and part of second (12 + 8 CPUs). At the same time, Sysbench is pinned to NUMA node 0, last 4 CPUs. When scheduler decides to run mysqld on Processor group 1, the CPU load distribution is like this:
NUMA #0, last 4 CPUs lit up by Sysbench. NUMA #1 and part of 3, lit up by mysqld. |
Using the same(!) Process.ProcessorAffinity for mysqld for subsequent run but this time the scheduler decides it will run mysqld on Processor group 0:
NUMA #0, last 4 CPUs lit up by Sysbench and mysqld. NUMA #2 in part lit up by mysqld. |
It is obvious how later case will most likely produce much lower results since mysqld is competing with Sysbench (on last 4 CPUs of the NUMA node 0) and Windows (first 2 CPUs of NUMA node 0). This is indicative of 2 things:
a) Microsoft rushed solution for big boxes (> 64 CPUs) and it is not mature nor will it scale.
b) You can not trust Kernel scheduler to do the right thing on its own as it has no clue as to what will be your next move.
I might add here that even the display in Task manager lacks the ability to display CPU load per ProcessorGroup...
Before you send me to RTFM and do this the "proper" way, please notice that the CPU usage pattern for NUMA nodes 5 and 7 is the same in both runs. This is because our Cluster knows how to pin threads to CPUs "properly". Alas, I do not think this is possible from the Powershell.
Also notice the lack of ProcessorGroup member in System.Diagnostic.Process class. I expected at least ProcessorGroup with getter function (if not complete getter/setter) so I can break the run if scheduler makes the choice I'm not happy with.
The last problem to mention is late binding of Affinity mask :-/. The code might look like this:
$sb_psi = New-object System.Diagnostics.ProcessStartInfo
$sb_psi.CreateNoWindow = $true
$sb_psi.UseShellExecute = $false
$sb_psi.RedirectStandardOutput = $true
$sb_psi.RedirectStandardError = $true
$sb_psi.FileName = "$PathToSB" + '\sysbench.exe '
$sb_psi.Arguments = @("$sbArgList")
$sb_process = $null
$sb_process = New-Object System.Diagnostics.Process
$sb_process.StartInfo = $sb_psi
[void]$sb_process.Start() <<<<
#Now you can set the Affinity mask:
$sb_process.ProcessorAffinity = $SCRIPT:SP_BENCHMARK_CPU
$sb_process.WaitForExit()
IMO, process.ProcessorAffinity should go to System.Diagnostics.ProcessStartInfo.I can't help but to wonder what will happen if Intel decides to release single processor with 64+ CPUs?
What are our options in Powershell then?
Essentially, you can use 3 techniques to start the process in Powershell and bind it to CPUs but you have to bear in mind that this is not what Microsoft expects you to do so each approach has its pro's and con's:1) Using START in cmd.exe (start /HIGH /NODE 2 /AFFINITY 0x4096 /B /WAIT E:\test\...\sysbench.exe --test=oltp...)
Settings: sysbench.conf: BENCHMARK_NODE=5 BENCHMARK_CPU="111100000000" # Xeon E7540 has 12 CPUs per socket so I'm running on LAST 4 (9,10,11 and 12). These options allow user to run Sysbench on certain NUMA node as well as certain CPUs within that NUMA node. autobench.conf: SERVER_NUMA_NODE=3 SERVER_CPU="111111111" #(Or, 000111111111) Running on first 9 CPUs. It is not necessary to set CPUs to run on if you're running on entire dedicated NUMA node. Pros: Works. Cons: The process you're starting is not the expected one (say, benchmark) but rather cmd.exe START. Cumbersome. Not really "Powershell-way". Process is bound to just one NUMA node which is fine if it's not hungry for more CPU power.
2) Using .NET System.Diagnostics.Process (PS, C#):
$process = Start-Process E:\test\mysql-cluster-7.5.0-winx64\bin\mysqld.exe -ArgumentList "--standalone --console --initialize-insecure" -WindowStyle Hidden -PassThru -Wait -RedirectStandardOutput e:\test\stdout.txt -RedirectStandardError e:\test\stderr.txt $process.ProcessorAffinity = 70368739983360 Affinity mask means mysqld runs on NUMA node 7, 5 and part of 3 (0-based index) IF ProcessorGroup is set to 1 by Kernel scheduler: 001111111111111111111111110000000000000000000000 = 70368739983360 |___________________48 CPUs____________________| |__________||__________||__________||__________| NUMA #7 NUMA #5 NUMA #3 NUMA #1 Settings: Autobench.conf: SP_SERVER_CPU=70368739983360 Sysbench.conf: SP_BENCHMARK_CPU=211106232532992 #Run on NUMA node 7, last 2 CPUs, 110000000000000000000000000000000000000000000000b Pros: Real "Powershell-way" of doing things. Process can span over more than 1 NUMA node. Good control of the process (HasExited, ExitTime, Kill, ID (PID) ...). Cons: Late binding; i.e. process has to be up and running for you to pin it to CPUs. This presents a problem with processes that start running immediately. No way to control Processor group meaning there is no way to guarantee the kernel scheduler will start all of your processes inside the desired Processor group.Note: Using -PassThru ensures you will get Process object. Otherwise, Start-Process cmdlet has no output. Also, you can start the process and then use Get-Process -Name... to accomplish the same.
Not available in Powershell AFAIK but important to understand if using MySQL Cluster:
3) Hook the threads to CPUs. Since this is not available from the "outside", I will use the Cluster code to do the work for me:
config.ini ---------- NoOfFragmentLogParts=10 ThreadConfig=ldm={count=10,cpubind=88-91,100-105},tc={count=4,cpubind=94-95,106-107},send={count=2,cpubind=92-93}, recv={count=2,cpubind=98,99},main={count=1,cpubind=109},rep={count=1,cpubind=109} sysbench.conf ------------- #NUMA node to run sysbench on. BENCHMARK_NODE=0 #Zero based index. #CPUs inside selected NUMA node to run sysbench on. BENCHMARK_CPU="111100000000" 000000001111 |__________| |_12 CPUs__| NUMA #0 CPU0 CPU11 autobench.conf -------------- SP_SERVER_CPU=1048575 Affinity mask means mysqld runs on NUMA node 7, 5 and part of 3 (0-based index) IF ProcessorGroup is 1: 001111111111111111111111110000000000000000000000 = 70368739983360d |___________________48 CPUs____________________| |__________||__________||__________||__________| NUMA #7 NUMA #5 NUMA #3 NUMA #1 Test image shows 000000000000000000000000000011111111111111111111 = 1048575d |___________________48 CPUs____________________| |__________||__________||__________||__________| NUMA #7 NUMA #5 NUMA #3 NUMA #1
Calculating ProcessorAffinity mask for process is different depending on the function accepting the input.
1) For cmd.exe START, the actual number passed is in HEX notation. The binary mask is composed so that the highest index CPU comes first:
BENCHMARK_CPU="111100000000" 000000001111 |__________| |_12 CPUs__| NUMA #0 CPU0 CPU11It is more convenient to provide the mask in binary so I convert setting to Hex value inside the script.
The NUMA node to run on is specified as decimal integer.
If you have provided the NUMA node # for the process to run on, not specifying ProcessorAffinity mask means "run on all CPUs within specified node".
If you provide the wrong mask, process will fail to start. For example, I have 12 CPUs per NUMA node (socket) so providing the mask like "11111111111000" will fail.
The approach works only on one NUMA node.
2) Start Process expects decimal integer for mask. The rightmost "1" indicates usage of CPU #0 within Processor group assigned by Kernel scheduler in Round-Robin manner.
000000000000000000000000000011111111111111111111 = 1048575d |___________________48 CPUs____________________| |__________||__________||__________||__________| NUMA #7 NUMA #5 NUMA #3 NUMA #1 or, should the scheduler pick Processor group 0: NUMA #6 NUMA #4 NUMA #2 NUMA #0Start process takes (and returns) decimal value for ProcessorAffinity.
It uses late binding so Process has to be up and running before assigning Affinity mask to it.
You have no control over ProcessorGroup meaning Kernel scheduler is free to pick any NUMA node in Round-Robin fashion.
3) Doing things "properly" (binding threads to CPUs). Or, how to calculate ThreadConfig for MySQL Cluster:
ThreadConfig=ldm={count=10,cpubind=88-91,100-105},tc={count=4,cpubind=94-95,106-107},send={count=2,cpubind=92-93},recv={count=2,cpubind=98,99},main={count=1,cpubind=109},rep={count=1,cpubind=109}
shows CPU indexes above total number of CPUs available on my test system (2x48=96). This has to do with the maximum capacity of Processor group which is 64. The designer of this functionality treats each Processor group found on system as full meaning it occupies 64 places for CPU index. This makes sense if you are going from the box with 48 CPUs in group (like mine) to a box with 64 CPUs in group as your ThreadConfig line will continue to work as expected. However, it requires some math to come to CPU indexes:
Processor group 0 |Processor group 1 CPU#0 CPU#47 CPU#63CPU#64 CPU#110 CPU#127 | AVAILABLE | RESERV || AVAILABLE | RESERV | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRRRRRRRRRRRRRRRXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXRRRRRRRRRRRRRRRRNow my ThreadConfig line makes sense:
Conclusion:
o Windows use notion of Processor group. Machines with less than 64 CPUs have 1 Processor group thus your application runs exactly as before.o Bug 1: Affinity mask is only 64-bit wide so there is no way to have continuous index of CPUs inside the big box such as mine.
o .NET System.Diagnostics.Process has no get/set of Processor group. At least a getter function was expected and a member of System.Diagnostics.Process disclosing this information.
o Bug 2: Information on CPU Affinity mask obtained from .NET System.Diagnostics.Process is ambiguous.
o 1 + 2, bug 3: There is no way I found to script pinning to individual CPUs that is complete.
o Feature request 1: .NET System.Diagnostics.Process allows only for late binding of Affinity mask. Move Affinity to .NET System.Diagnostics.ProcessStartInfo.
o Feature request 2, consolidate: The various approaches taken by Microsoft seem uncoordinated and incomplete. Even using START command requires decimal number for NUMA node index and hexadecimal number for Affinity mask. cmd.exe START and creation of thread objects allow for early binding of CPU Affinity mask while .NET System.Diagnostics.Process allows only late binding. And so on.
o Feature request 3, give us TASKSET complement: Given all of the above, it is impossible to script the replacement for Linux TASKSET.
o What will happen once single processors with more than 64 CPUs are available?
o Mysql Cluster counts CPUs as if every existing Processor group is complete (has 64 CPUs).
No comments:
Post a Comment