CMG: VMWare vSphere Performance Analysis John Paul Managed Services, R&D February 17, 2012 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Acknowledgments and Presentation Goal The material in this presentation was pulled from a variety of sources, some of which was graciously provided by VMware and Intel staff members. I acknowledge and thank the VMware and Intel staff for their permission to use their material in this presentation. This presentation is intended to review the basics for performance analysis for the virtual infrastructure with detailed information on the tools and counters used. It presents a series of examples of how the performance counters report different types of resource consumption with a focus on key counters to observe. The performance counters do change and we are not going to go over all of the counters. ESXTOP and RESXTOP will be used interchangeably since both tools effectively provide the same counters. The screen shots in this presentation have the colors inverted for readability purposes. The presentation shows screen shots from vSphere 4 and 5 since both are actively in use. Page 2 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Page 3 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Trends to Consider - Hardware Intel Strategies – Intel necessarily had to move to a horizontal, multicore strategy due to the physical restrictions of what could be done on the current technology base. This resulted in: Increasing number of cores per socket Stabilization of processor speed (i.e., processor speeds no longer are increasing according to Moore’s Law and in fact are slower for newer models) Focus on new architectures that allow for more efficient movement of data between memory and the processors, external sources and the processors, and larger and faster caches associated with different components OEM Strategies – As Intel moved down this path system OEMs have assembled the Intel (and AMD) components in different ways (such as multi-socket) to differentiate their offerings. It is important to understand the timing of the Intel architectural releases, the OEM implementation of those releases, and the operating system vendors’ use of those features in their code. They aren’t always aligned. Page 4 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Trends to Consider – Hardware Assisted Virtualization Intel VT-x and AMD-V - This provides two forms of CPU operation (root and non-root), allowing the virtualization hypervisor to be less intrusive during workloads. Hardware virtualization with a Virtual Machine Monitor (VMM) versus binary translation resulted in a substantial performance improvement. Intel EPT and AMD RVI - Memory management virtualization (MMU) supports extended page tables (EPT) which eliminated the need for ESX to maintain shadow page tables. Vt-d and AMD-Vi – I/O virtualization assist allows the virtual machines to have direct access to hardware I/O devices, such as network cards and storage controllers (HBAs). Page 5 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Trends to Consider – Software VMWare/Hypervisor Strategies – Horizontal scalability at the hardware layer requires comparable scalability for the hypervisor NUMA support required hypervisor scheduler changes (Wide NUMA) Larger CPU and RAM virtual machines Efficiency while running larger, most complex workloads Abstraction model versus consolidation model for some workloads Performance guarantees for the Core Four Federation of management tools and resource pools Page 6 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Scheduler vCPU – The vCPU is an aggregation of the time it allocates to the workload. It time slices each core based upon the type of configuration. It constantly is changing which core the vm is on, unless affinity is used. SMP Lazy Scheduling – The scheduler continues to evolve, using lazy scheduling to launch individual vCPUs and then having others “catch up” if CPU skewing occurs. Note that this has improved across releases. SMP – Note that SMP effectiveness is NOT linear, depending upon workloads. It is very important to load test your workloads on your hardware to measure the efficiency of SMP. We have found that the higher the number of SMP vCPUs, the lower the efficiency. Page 7 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Introductory Comments Resource Pools – These are really a way to group the amount of resources allocated to a specific workload, or grouping of workloads across VMs and hosts. Single Unit of Work – It is easy to miss the fact that resource “sharing” really does not affect the single unit of work. Dynamic Resource Sharing (DRS) moves VMs across ESX hosts where there may be more resources. Perfmon Counters – The inclusion of ESX counters (via VMTools) into the Windows perfmon counters is helpful for overall analysis, and higher level performance analysis. Many counters are not exposed yet in Perfmon. Page 8 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Introductory Comments Hyper-threading – The pre-Nehalem Intel architecture had some problems with the hyper-threading efficiencies, causing many people to turn off hyper-threading. The Nehalem architecture seems to have corrected those problems, and vSphere is now hyper-threading aware. You need to understand how hyper-threading works so you know how to interpret the VMware tools (such as ESXTOP, ESXPlot). The different cores will be shown as equals while they don’t have equal capacity. Microsoft Hyper-V – While we are not going to be diving into Hyper-V (or Zen VM) the basic principles of the hypervisors are the same, though the implementation is quite different. There are good reasons why VMware leads the market in enterprise virtualization implementations. Page 9 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introductory Comments Introductory Comments – Decision Points WHAT should we change? – VMware continues to expose more performance changing settings. One of the key questions that needs to be answered is whether you should take the default settings for operational simplicity or fine tune for the best possible performance. NUMA Awareness and Control – Should the NUMA control be turned over to the guest operating system? ESXTOP versus RESXTOP – Though both work the use of ESXTOP on the actual host requires SSH to be enabled, which may violate security guidelines. Page 10 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. vSphere Architecture VMware ESX Architecture Guest Monitor VMkernel Physical Hardware Page 11 TCP/IP Guest File System Monitor supports: BT (Binary Translation) HW (Hardware assist) PV (Paravirtualization) Monitor (BT, HW, PV) Scheduler Memory Allocator CPU is controlled by scheduler and virtualized by monitor Virtual NIC Virtual SCSI Virtual Switch File System NIC Drivers I/O Drivers Memory is allocated by the VMkernel and virtualized by the monitor Network and I/O devices are emulated and proxied though native device drivers Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Analysis Basics Key Reference Documents vSphere Resource Management (EN-000591-01) Performance Best Practices for VMWare vSphere 5.0 (EN-000005-04) Page 12 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Analysis Basics Types of Resources – The Core Four (Plus One) Though the Core Four resources exist at both the ESX host and virtual machine levels, they are not the same in how they are instantiated and reported against. CPU – processor cycles (vertical), multi-processing (horizontal) Memory – allocation and sharing Disk (a.k.a. storage) – throughput, size, latencies, queuing Network - throughput, latencies, queuing Though all resources are limited, ESX handles the resources differently. CPU is more strictly scheduled, memory is adjusted and reclaimed (more fluid) if based on shares, disk and network are fixed bandwidth (except for queue depths) resources. The Fifth Core Four resource is virtualization overhead! Page 13 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Analysis Basics vSphere Components in a Context You Are Used To World The smallest schedulable component for vSphere Similar to a process in Windows or thread in other operating systems Groups A collection of ESX worlds, often associated with a virtual server or common set of functions, such as Idle System Helper Drivers Vmotion Console Page 14 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Analysis Basics The Five Contexts of Virtualization and Which Tools to Use for Each Physical Machine Application Virtual Machine VCPU Application Application Application Operating System Operating System Operating System VMemory VCPU VDisk VMemory VMemoryVDiskVNIC ESX Host Machine VNIC Intel Hardware VNIC VCPU Application Operating System VDisk ESX Host Farm/Cluster PMemory PMemory PCPU PCPU ESXTOP PDisk PNIC PDisk PDisk PNIC PNIC VMemory VNIC Operating System VDisk VNIC VCPU Intel Hardware VMemory VNIC VCPU PNIC VMemory VNIC Operating System VDisk VCPU VMemory Intel Hardware ESXTOP PMemory Application Operating System VDisk Intel Hardware PerfMon PCPU Application Operating System VDisk PDisk PNIC Virtual Center Application Operating System VMemory PDisk PCPU VDisk ESXTOP PMemory PNIC Intel Hardware Remember the virtual context Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. VNIC Intel Hardware PerfMon Intel Hardware Page 15 VDisk Intel Hardware Virtual Center VCPU VNIC ESXTOP PMemory PCPU Intel Hardware Application VMemory Intel Hardware PerfMon Hardware IntelIntel Hardware ESX Host Complex VCPU Intel Hardware Operating System PerfMon PMemory PCPU VCPU Intel Hardware Intel Hardware Application PDisk Performance Analysis Basics Types of Performance Counters (v4) Static – Counters that don’t change during runtime, for example MEMSZ (memsize), Adapter queue depth, VM Name. The static counters are informational and may not be essential during performance problem analysis. Dynamic – Counters that are computed dynamically, for example CPU load average, memory over-commitment load average. Calculated - Some are calculated from the delta between two successive snapshots. Refresh interval (-d) determines the time between successive snapshots. For example %CPU used = ( CPU used time at snapshot 2 - CPU used time at snapshot 1 ) / time elapsed between snapshots Page 16 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Analysis Basics A Review of the Basic Performance Analysis Approach Identify the virtual context of the reported performance problem Where is the problem being seen? (“When I do this here, I get that”) How is the problem being quantified? (“My function is 25% slower) Apply a reasonability check (“Has something changed from the status quo?”) Monitor the performance from within that virtual context View the performance counters in the same context as the problem Look at the ESX cluster level performance counters Look for atypical behavior (“Is the amount of resources consumed characteristic of this particular application or task for the server processing tier?” ) Look for repeat offenders! This happens often. Expand the performance monitoring to each virtual context as needed Are other workloads influencing the virtual context of this particular application and causing a shortage of a particular resource? Consider how a shortage is instantiated for each of the Core Four resources Page 17 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Analysis Basics Resource Control Revisited – CPU Example Reservation (Guarantees) • Minimum service level guarantee (in MHz) • When system is overcommitted it is still the target • Needs to pass admission control for start-up Total MHZ Limit Shares Shares (Share the Resources) • CPU entitlement is directly proportional to VM's shares and depends on the total number of shares issued • Abstract number, only ratio matters Reservation Limit • Absolute upper bound on CPU entitlement (in MHz) • Even when system is not overcommitted Page 18 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. 0 MHZ Tools Key Reference Documents vSphere Monitoring and Performance (EN-000620-01) Chapter 7 – Performance Monitoring Utilities: resxtop and esxtop Esxtop for Advanced Users (VSP1999 VMWorld 2011) Page 19 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools Load Generators, Data Gatherers, Data Analyzers Load Generators IOMeter – www.iometer.org Consume – windows SDK SQLIOSIM - http://support.microsoft.com/?id=231619 Data Gatherers ESXTOP Virtual Center Vscsistats Data Analyzers ESXTOP (interactive or batch mode) Windows Perfmon/Systems Monitor ESXPLOT Page 20 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools A Comparison of ESXTOP and the vSphere Client vC gives a graphical view of both real-time and trend consumption vC combines real-time reporting with short term (1 hour) trending vC can report on the virtual machine, ESX host, or ESX cluster vC has performance overview charts in vSphere 4 and 5 vC is limited to 2 unit types at a time for certain views ESXTOP allows more concurrent performance counters to be shown ESXTOP has a higher system overhead to run ESXTOP can sample down to a 2 second sampling period ESXTOP gives a detailed view of each of the Core Four Recommendation – Use vC to get a general view of the system performance but use ESXTOP for detailed problem analysis. Page 21 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools An Introduction to ESXTOP/RESXTOP Launched through vSphere Management Assistant (VMA) or CLI or via SSH session (ESXTOP) with ESX host Screens (version 5) c: cpu (default) d: disk adapter h: help i: interrupts m: memory n: network p: power management u: disk device v: disk VM Can be piped to a file and then imported in System Monitor/ESXPLOT Horizontal and vertical screen resolution limits the number of fields and entities that could be viewed so chose your fields wisely Some of the rollups and counters may be confusing to the casual user Page 22 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: New Counters in vSphere 5.0 World, VM Count, vCPU count (CPU screen) %VMWait (%Wait - %Idle, CPU screen) CPU Clock Frequency in different P-states (Power Management Screen) Failed Disk IOs (Disk adapter screen) FCMDs/s – failed commands per second FReads/s – failed reads per second FMBRD/s – failed megabyte reads per second FMBWR/s – failed megabyte writes per second FRESV/s – failed reservations per second VAAI: Block Deletion Operations (Disk adapter screen) Same counters as Failed Disk IOs above Low-Latency Swap (Host Cache – Disk Screen) LLSWR/s – Swap in rate from host cache LLSWW/s – Swap out rate to host cache Page 23 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: Help Screen (v5) Page 24 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: CPU screen (v5) Time Uptime New Counter • Worlds = Worlds, VMs, vCPU Totals • ID = ID • GID = world group identifier • NWLD = number of worlds Page 25 fields hidden from the view… Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: CPU screen (v4) expanding groups press ‘e’ key • In rolled up view some stats are cumulative of all the worlds in the group • Expanded view gives breakdown per world • VM group consists of mks (mouse, keyboard, screen), vcpu, vmx worlds. SMP VMs have additional vcpu and vmm worlds • vmm0, vmm1 = Virtual machine monitors for vCPU0 and vCPU1 respectively Page 26 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP CPU Screen (v5): Many New Worlds New Processes Using Little/No CPU resource Page 27 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP CPU Screen (v5): Virtual Machines Only (using V command) Value >= 1 means overload Page 28 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP CPU Screen (v5): Virtual Machines Only, Expanded Page 29 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: CPU screen (v4) PCPU = Physical CPU/core CCPU = Console CPU (CPU 0) Press ‘f’ key to choose fields Page 30 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: CPU screen (v5) Core Usage Now Shown PCPU = Physical CPU Changed Field CORE = Core CPU New Field Page 31 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools Idle State on Test Bed (CPU View v4) ESXTOP Virtual Machine View Page 32 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools Idle State on Test Bed – GID 32 Expanded (v4) Wait includes idle Expanded GID Page 33 Rolled Up GID Five Worlds Cumulative Wait % Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Total Idle % Tools ESXTOP memory screen (v4) COS PCI Hole VMKMEM Physical Memory (PMEM) VMKMEM - Memory managed by VMKernel COSMEM - Memory used by Service Console Page 34 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Possible states: High, Soft, hard and low Tools ESXTOP: memory screen (v5) NUMA Stats Changed Field New Fields Page 35 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: memory screen (4.0) Swapping activity in Service Console VMKernel Swapping activity SZTGT : determined by reservation, limit and memory shares SWCUR = 0 : no swapping in the past SZTGT = Size target SWTGT = 0 : no swapping pressure SWTGT = Swap target SWR/S, SWR/W = 0 : No swapping activity currently SWCUR = Currently swapped MEMCTL = Balloon driver SWR/S = Swap read /sec SWW/S = Swap write /sec Page 36 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: disk adapter screen (v4) Host bus adapters (HBAs) - includes SCSI, iSCSI,RAID, and FC-HBA adapters Latency stats from the Device, Kernel and the Guest DAVG/cmd - Average latency (ms) from the Device (LUN) KAVG/cmd - Average latency (ms) in the VMKernel GAVG/cmd - Average latency (ms) in the Guest Page 37 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: disk device screen (v4) LUNs in C:T:L format (Controller: Target: LUN) Page 38 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP disk VM screen (v v4) running VMs Page 39 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXTOP: network screen (v4) Service console NIC Virtual NICs Physical NIC PKTTX/s - Packets transmitted /sec PKTRX/s - Packets received /sec MbTx/s - Transmit Throughput in Mbits/sec MbRx/s - Receive throughput in Mbits/sec Port ID: every entity is attached to a port on the virtual switch DNAME - switch where the port belongs to Page 40 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools A Brief Introduction to the vSphere Client Screens – CPU, Disk, Management Agent, Memory, Network, System vCenter collects performance metrics from the hosts that it manages and aggregates the data using a consolidation algorithm. The algorithm is optimized to keep the database size constant over time. vCenter does not display many counters for trend/history screens ESXTOP defaults to a 5 second sampling rate while vCenter defaults to a 20 second rate. Default statistics collection periods, samples, and how long they are stored Interval Interval Period Number of Samples Per Hour (real-time) 20 seconds 180 Per day 5 minutes 288 1 day Per week 30 minutes 336 1 week Per month 2 hours 360 1 month 1 day 365 1 year Per year Page 41 Interval Length Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools vSphere Client – CPU Screen (v4) To Change Settings Page 42 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. To Change Screens Tools vSphere Client – Disk Screen (v4) Page 43 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools vSphere Client - Performance Overview Chart (v4) Performance overview charts help to quickly identify bottlenecks and isolate root causes of issues. Page 44 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools Analyzing Performance from Inside a VM VM Performance Counters Integration into Perfmon Access key host statistics from inside the guest OS View “accurate” CPU utilization along side observed CPU utilization Third-parties can instrument their agents to access these counters using WMI Integrated with VMware Tools Page 45 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools Summarized Performance Charts (v4) Quickly identify bottlenecks and isolate root causes Side-by-side performance charts in a single view Correlation and drill-down capabilities Richer set of performance metrics Key Metrics Displayed Aggregated Usage Page 46 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools A Brief Introduction to ESXPlot Launched on a Windows workstation Imports data from a .csv file Allows an in-depth analysis of an ESXTOP batch file session Capture data using ESXTOP batch from root using SSH utility ESXTOP –a –b >exampleout.csv (for verbose capture) Transfer file to Windows workstation using WinSCP Page 47 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXPlot Page 48 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXPlot Field Expansion: CPU Page 49 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools ESXPlot Field Expansion: Physical Disk Page 50 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Tools Top Performance Counters to Use for Initial Problem Determination Physical/Virtual Machine CPU (queuing) • Average physical CPU utilization • Peak physical CPU utilization • CPU Time • Processor Queue Length Memory (swapping) • Average Memory Usage • Peak Memory Usage • Page Faults • Page Fault Delta* Disk (latency) • Split IO/Sec • Disk Read Queue Length • Disk Write Queue Length • Average Disk Sector Transfer Time Network (queuing/errors) • Total Packets/second • Bytes Received/second • Bytes Sent/Second • Output queue length Page 51 ESX Host CPU (queuing) • PCPU% •%SYS •%RDY • Average physical CPU utilization • Peak physical CPU utilization • Physical CPU load average Memory (swapping) • State (memory state) • SWTGT (swap target) • SWCUR (swap current) • SWR/s (swap read/sec) • SWW/s (swap write/sec) • Consumed • Active (working set) • Swapused (instantaneous swap) • Swapin (cumulative swap in) • Swapout (cumulative swap out) • VMmemctl (balloon memory) Disk (latency, queuing) • DiskReadLatency • DiskWriteLatency • CMDS/s (commands/sec) • Bytes transferred/received/sec • Disk bus resets • ABRTS/s (aborts/sec) • SPLTCMD/s (I/O split cmds/sec) Network (queuing/errors) • %DRPTX (packets dropped - TX) • %DRPRX (packets dropped – RX) • MbTX/s (mb transferred/sec – TX) • MbRX/s (mb transferred/sec – RX) Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. CPU Page 52 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action CPU – Understanding PCPU versus VCPU It is important to separate the physical CPU (PCPU) resources of the ESX host from the virtual CPU (VCPU) resources that are presented by ESX to the virtual machine. PCPU – The ESX host’s processor resources are exposed only to ESX. The virtual machines are not aware and cannot report on those physical resources. VCPU – ESX effectively assembles a virtual CPU(s) for each virtual machine from the physical machine’s processors/cores, based upon the type of resource allocation (ex. shares, guarantees, minimums). Scheduling - The virtual machine is scheduled to run inside the VCPU(s), with the virtual machine’s reporting mechanism (such as W2K’s System Monitor) reporting on the virtual machine’s allocated VCPU(s) and remaining Core Four resources. Page 53 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action CPU – Key Question and Considerations Is there a lack of CPU resources for the VCPU(s) of the virtual machine or for the PCPU(s) of the ESX host? Allocation – The CPU allocation for a specific workload can be constrained due to the resource settings or number of CPUs, amount of shares, or limits. The key field at the virtual machine level is CPU queuing and at the ESX level it is Ready to Run (%RDY in ESXTOP). Capacity - The virtual machine’s CPU can be constrained due to a lack of sufficient capacity at the ESX host level as evidenced by the PCPU/LCPU utilization. Contention – The specific workload may be constrained by the consumption of workloads operating outside of their typical patterns SMP CPU Skewing – The movement towards lazy scheduling of SMP CPUs can cause delays if one CPU gets too far “ahead” of the other. Look for higher %CSTP (co-schedule pending) Page 54 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action CPU State Times and Accounting Accounting: USED = RUN + SYS - OVRLP Page 55 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action High CPU within one virtual machine caused by affinity (ESXTOP v4) Physical CPU Fully Used Page 56 One Virtual CPU is Fully Used Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action High CPU within one virtual machine (affinity) (vCenter v4) View of the ESX Host View of the VM Page 57 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action SMP Implementation WITHOUT CPU Constraints – ESXTOP V4 4 Physical CPUs Fully Used Page 58 One - 2 CPU SMP VCPU 4 Virtual CPUs Fully Used Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Ready to Run Acceptable Performance Counters in Action SMP Implementation WITHOUT CPU Constraints – vC V4 Page 59 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action SMP Implementation with Mild CPU Constraints V4 4 Physical CPUs Fully Used Page 60 One - 2 CPU SMP VCPUs (7 NWLD) 4 Virtual CPUs Heavily Used Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Ready to Run Indicates Problems Performance Counters in Action SMP Implementation with Severe CPU Constraints V4 4 Physical CPUs Fully Used Page 61 Two - 2 CPU SMP VCPUs (7 NWLD) 4 Virtual CPUs Fully Used Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Ready to Run Indicates Severe Problems Performance Counters in Action SMP Implementation with Severe CPU Constraints V4 Page 62 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action CPU Usage – Without Core Sharing ESX scheduler tries to avoid sharing the same core Page 63 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Performance Counters in Action CPU Usage – With Core Sharing Page 64 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Introduction to the Intel® QuickPath Interconnect Intel® QuickPath interconnect is: Cache-coherent, high-speed packet-based, point-to-point interconnect used in Intel’s next generation microprocessors (starting in 2H’08) Narrow physical link contains 20 lanes Two uni-directional links complete QuickPath interconnect port Four-Socket Platform I/O chipset Memory Interface Intel® QuickPath interconnect links processor processor processor processor Memory Interface Provides high bandwidth, low latency connections between processors and between processors and chipset Maximum data rate of 6.4GT/s 2 bytes/T, 2 directions, yields 25.6GB/s per port Memory Interface Intel® QuickPath interconnect links Memory Interface chipset I/O Interconnect performance for Intel’s next generation microarchitectures Page 65 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Intel Topologies Lynnfield Intel® Core™ i7 Nehalem-EP Processor Intel® Itanium® Processor Nehalem-EX (Tukwila) CPU CPU CPU CPU CPU No links 1 Full Width Link 2 Full Width Links 4 Full Width Links 4 Full Width Links 2 Half Width Links Nehalem-EP Example 2S Nehalem-EX Example 4S IOH IOH CPU CPU IOH IOH CPU CPU CPU CPU IOH IOH Different Number of Links for Different Platforms Page 66 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Intel: QPI Performance Considerations • Not always a direct correlation between processor performance and interconnect latency / bandwidth • What is important is that the interconnect should perform sufficiently to not limit processor performance Max theoretical bandwidth Max of 16 bits (2 bytes) of “real” data sent across full width link during one clock edge Double pumped bus with max initial frequency of 3.2 GHz 2 bytes/transfer * 2 transfers/cycle * 3.2 GHz = 12.8 GB/s With Intel® QuickPath Interconnect at 6.4 GT/s translates to 25.6 GB/s across two simultaneous unidirectional links Page 67 Max bandwidth with packet overhead Typical data transaction is a 64 byte cache line Typical packet has header Flit which requires 4 Phits to transmit across link Data payload takes 32 Phits to transfer (64 bytes at 2 bytes/Phit) With CRC is sent inline with data, data packet requires 4 Phits for header + 32 Phits of payload With Intel® QuickPath Interconnect at 6.4 GT/s, 64B cache line transfers in 5.6 ns translates to 22.8 GB/s across two simultaneous unidirectional links Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Page 68 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Memory – Separating the machine and guest memory It is important to note that some statistics refer to guest physical memory while others refer to machine memory. “ Guest physical memory" is the virtualhardware physical memory presented to the VM. " Machine memory" is actual physical RAM in the ESX host. In the figure below, two VMs are running on an ESX host, where each block represents 4 KB of memory and each color represents a different set of data on a block. Inside each VM, the guest OS maps the virtual memory to its physical memory. ESX Kernel maps the guest physical memory to machine memory. Due to ESX Page Sharing technology, guest physical pages with the same content can be mapped to the same machine page. Page 69 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory A Brief Look at Ballooning The W2K balloon driver is located in VMtools ESX sets a balloon target for each workload at start-up and as workloads are introduced/removed The balloon driver expands memory consumption, requiring the Virtual Machine operating system to reclaim memory based on its algorithms Ballooning routinely takes 10-20 minutes to reach the target The returned memory is now available for ESX to use Key ballooning fields: SZTGT: determined by reservation, limit and memory shares SWCUR = 0 : no swapping in the past SWTGT = 0 : no swapping pressure SWR/S, SWR/W = 0 : No swapping activity currently Page 70 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Interleaving Basics Ch 0 Ch 1 Ch 2 Ch 0 Ch 1 Ch 2 CSI System Memory Map Tylersburg -DP IOH What is it? It is a process where memory is stored in a non contiguous form to optimize access performance and efficiency Interleaving usually done in cache line granularity Why do it? Increase bandwidth by allowing multiple memory accesses at once Reduce hot spots since memory is spread out over a wider location To support NUMA (Non Uniform Memory Access) based OS/applications Memory organization where there is different access times for different sections of memory, due to memory located in different locations Concentrate the data for the application on the memory of the same socket Page 71 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Non-NUMA (UMA) Uniform Memory Access (UMA) Addresses interleaved across memory nodes by cache line. Accesses may or may not have to cross QPI link Socket 0 Memory Socket 1 Memory DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Tylersburg-DP System Memory Map Uniform Memory Access lacks tuning for optimal performance Page 72 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. NUMA Non-Uniform Memory Access (NUMA) Addresses not interleaved across memory nodes by cache line. Each CPU has direct access to contiguous block of memory. Socket 0 Memory Socket 1 Memory DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Tylersburg-EP System Memory Map Thread affinity benefits from memory attached locally Page 73 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory ESX Memory Sharing - The “Water Bed Effect” ESX handles memory shares on an ESX host and across an ESX cluster with a result similar to a single water bed, or room full of water beds, depending upon the action and the memory allocation type: Initial ESX boot (i.e., “lying down on the water bed”) – ESX sets a target working size for each virtual machine, based upon the memory allocations or shares, and uses ballooning to pare back the initial allocations until those targets are reached (if possible). Steady State (i.e., “minor position changes”) - The host gets into a steady state with small adjustments made to memory allocation targets. Memory “ripples” occur during steady state, with the amplitude dependent upon the workload characteristics and consumption by the virtual machines. New Event (i.e., “second person on the bed”) – The host receives additional workload via a newly started virtual machine or VMotion moves a virtual machine to the host through a manual step, maintenance mode, or DRS. ESX pares back the target working size of that virtual machine while the other virtual machines lose CPU cycles that are directed to the new workload. Large Event (i.e., “jumping across water beds”) – The cluster has a major event that causes a substantial movement of workloads to or between multiple hosts. Each of the hosts has to reach a steady state, or to have DRS determine that the workload is not a current candidate for the existing host, moving to another host that has reached a steady state with available capacity. Maintenance mode is another major event. Page 74 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Memory – Key Question and Considerations Is the memory allocation for each workload optimum to prevent swapping at the Virtual Machine level, yet low enough not to constrain other workloads or the ESX host? HA/DRS/Maintenance Mode Regularity – How often do the workloads in the cluster get moved between hosts? Each movement causes an impact on the receiving (negative) and sending (positive) hosts with maintenance mode causing a rolling wave of impact across the cluster, depending upon the timing. Allocation Type – Each of the allocation types have their drawbacks so tread carefully when choosing the allocation type. One size seldom is right for all needs. Capacity/Swapping - The virtual machine’s CPU can be constrained due to a lack of sufficient capacity at the ESX host level. Look for regular swapping at the ESX host level as an indicator of a memory capacity issue but be sure to notice memory leaks that artificially force a memory shortage situation. Page 75 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Idle State on Test Bed – Memory View V4 Page 76 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Memory View at Steady State of 3 Virtual Machines – Memory Shares V4 Most memory is not reserved Virtual Machine Just Powered On Page 77 These VMs are at memory steady state No VM Swapping or Targets Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Ballooning and Swapping in Progress – Memory View V4 Possible states: High, Soft, hard and low Different Size Targets Due to Different Amount of Up Time Page 78 Ballooning In Effect Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Mild swapping Memory Memory Reservations – Effect on New Loads V4 What Size Virtual Machine with Reserved Memory Can Be Started? 6GB of “free” physical memory due to memory sharing over 20 minutes 666MB of unreserved memory Three VMs, each with 2GB reserved memory Can’t start fourth virtual machine of >512MB of reserved memory Fourth virtual machine of 512MB of reserved memory started Page 79 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Memory Shares – Effect on New Loads V4 Three VMs with 2GB allocation 5.9 GB of “free” physical memory 6GB of unreserved memory Fourth virtual machine of 2GB of memory allocation started successfully Page 80 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Virtual Machine with Memory Greater Then On A Single NUMA Node V5 Remote NUMA Access Page 81 Local NUMA Access Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. % Local NUMA Access Memory Wide-NUMA Support in V5 – 1 vCPU 1 NUMA Node Page 82 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Memory Wide-NUMA Support in V5 – 8 vCPU 2 NUMA Nodes Page 83 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Power Management Page 84 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Power Management Power Management Screen V5 Page 85 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Power Management Impact of Power States on CPU Page 86 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Power Management Power Management Impact on CPU V5 Page 87 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Page 88 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Storage – Key Question and Considerations Is the bandwidth and configuration of the storage subsystem sufficient to meet the desired latency (a.k.a. response time) for the target workloads? If the latency target is not being met then further analysis may be very time consuming. Storage Frames specifications refer to the aggregate bandwidth of the frame or components, not the single path capacity of those components. Queuing - Queuing can happen at any point along the storage path, but is not necessarily a bad thing if the latency meets requirements. Storage Path Configuration and Capacity – It is critical to know the configuration of the storage path and the capacity of each component along that path. The number of active vmkernel commands must be less then or equal to the queue depth max of any of the storage path components while processing the target storage workload. Page 89 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Storage – Aggregate versus Single Paths Storage Frames specifications refer to the aggregate bandwidth of the frame or components, not the single path capacity of those components* DMX Message Bandwidth: 4-6.4 GB/s DMX Data Bandwidth: 32-128 GB/s Global Memory: 32-512 GB Concurrent Memory Transfers: 16-32 (4 per Global Memory Director) Performance Measurement for storage is all about individual paths and the performance of the components contained in that path (* Source – EMC Symmetrix DMX-4 Specification Sheet c1166-dmx4-ss.pdf) Page 90 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Storage – More Questions Virtual Machines per LUN - The number of outstanding active vmkernel commands per virtual machine times the number of virtual machines on a specific LUN must be less then the queue depth of that adapter How fast can the individual disk drive process a request? Based upon the block-size and type of I/O (sequential read, sequential write, random read, random write) what type of configuration (RAID, number of physical spindles, cache) is required to match the I/O characteristics and workload demands for average and peak throughput? Does the network storage (SAN frame) handle the I/O rate down each path and aggregated across the internal bus, frame adaptors, and front end processors? In order to answer these questions we need to better understand the underlying design, considerations, and basics of the storage subsystem Page 91 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Back-end Storage Design Considerations Capacity - What is the storage capacity needed for this workload/cluster? Disk drive size (ex., 144GB, 300GB) Number of disk drives needed within a single logical unit (ex., LUN) IOPS Rate – How many I/Os per second are required with the needed latency? Number of physical spindles per LUN Impact of sharing of physical disk drives between LUNs Configuration (ex., cache) and speed of the disk drive Availability – How many disk drives, storage components can fail at one time? Type of RAID chosen, number of parity drives per grouping Amount of redundancy built into the storage solution Cost – Delivered cost per byte at the required speed and availability Many options are available for each design consideration Final decisions on the choice for each component The cumulative amount of capacity, IOPS rate, and availability often dictate the overall solution Page 92 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Storage from the Ground Up – Basic Definitions: Mechanical Drives Disk Latency – The average time it takes for the requested sector to rotate under the read/write head after a completed seek 5400 (5.5ms), 7200 (4.2ms), 10,000 (3ms) , 15,000 (2ms) RPM Ave. disk latency = 1/2 * rotation Throughput (MB/sec) = (Outstanding IOs/ latency (msec)) * Block size (KB) Seek Time – The time it takes for the read/write head to find the physical location of the requested data on the disk Average Seek time: 8-10 ms Access Time – The total time it takes to locate the data on the drive(s). This includes seek time, latency, settle time, and command processing overhead time. Host Transfer Rate – The speed at which the host can transfer the data across the disk interface. Page 93 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Network Storage Components That Can Affect Performance/Availability Size and use of cache (i.e., % dedicated to reads versus writes) Number of independent internal data paths and buses Number of front-end interfaces and processors Types of interfaces supported (ex. Fiber channel and iSCSI) Number and type of physical disk drives available MetaLUN Expansion MetaLUNs allow for the aggregation of LUNs System typically re-stripes data when MetaLUN is changed Some performance degradation during re-striping Storage Virtualization Aggregation of storage arrays behind a presented mount point/LUN Movements between disk drives and tiers control by storage management Change of physical drives and configuration may be transient and severe Page 94 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Case Studies - Storage Test Bed Idle State– Device Adapter View V4 Average Device Latency, Per Command Storage Adapter Maximum Queue Length World Maximum Queue Length LUN Maximum Queue Length Page 95 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Case Studies - Storage Moderate load on two virtual machines V4 Commands are queued Page 96 BUT…. Acceptable latency from the disk subsystem Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Case Studies - Storage Heavier load on two virtual machines Commands are queued and are exceeding maximum queue lengths Page 97 BUT…. Virtual machine latency is consistently above 20ms/second, performance could start to be an issue Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Case Studies - Storage Heavy load on four virtual machines Commands are queued and are exceeding maximum queue lengths Page 98 AND…. Virtual machine latency is consistently above 60 ms/second for some VMs, performance will be an issue Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Case Studies - Storage Artificial Constraints on Storage Problem with the disk subsystem Good throughput Low device Latency Bad throughput Device Latency is high - cache disabled Page 99 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Understanding the Disk Counters and Latencies Page 100 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Understanding Disk I/O Queuing Page 101 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations SAN Storage Infrastructure – Areas to Watch/Consider HBA Speed Fiber Bandwidth FA CPU Speed HBA Disk Response RAID Configuration Number of Spindles in LUN/array Block Size HBA ISL FC switch Director San.jpg Disk Speeds HBA World Queue Length Page 102 LUN Queue Length Storage Adapter Queue Length Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Cache Size/Type Storage Considerations Storage Queuing – The Key Throttle Points ESX Host Length (WQLEN) VM 3 World Queue Length (WQLEN) VM 4 World Queue Length (WQLEN) Page 103 Execution Throttle World Queue HBA Execution Throttle VM 2 LUN Queue Length (LQLEN) VM 1 World Queue Length (WQLEN) HBA Storage Area Network Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Storage I/O – The Key Throttle Point Definitions Storage Adapter Queue Length (AQLEN) The number of outstanding vmkernel active commands that the adapter is configured to support. This is not settable. It is a parameter passed from the adapter to the kernel. LUN Queue Length (LQLEN) The maximum number of permitted outstanding vmkernel active commands to a LUN. (This would be the HBA queue depth setting for an HBA.) This is set in the storage adapter configuration via the command line. World Queue Length (WQLEN) VMware Recommends Not to Change This!!!! The maximum number of permitted outstanding vmkernel active requests to a LUN from any singular virtual machine (min:1, max:256: default: 32) Configuration->Advanced Settings->Disk-> Disk.SchedNumReqOutstanding Execution Throttle (this is not a displayed counter) The maximum number of permitted outstanding vmkernel active commands that can be executed on any one HBA port (min:1, max:256: default: ~16, depending on vendor) This is set in the HBA driver configuration. Page 104 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Storage Considerations Queue Length Rules of Thumb For a lightly-loaded system, average queue length should be less than 1 per spindle with occasional spikes up to 10. If the workload is writeheavy, the average queue length above a mirrored controller should be less than 0.6 per spindle and less than 0.3 per spindle above a RAID-5 controller. For a heavily-loaded system that isn’t saturated, average queue length should be less than 2.5 per spindle with infrequent spikes up to 20. If the workload is write-heavy, the average queue length above a mirrored controller should be less than 1.5 per spindle and less than 1 above a RAID-5 controller. Page 105 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. Closing Thoughts Know the key counters to look at for each type of resource Be careful on what type of resource allocation technique you use for CPU and RAM. One size may NOT fit all. Consider the impact of events such as maintenance on the performance of a cluster Set up a simple test bed where you can create simple loads to become familiar with the various performance counters and tools Compare your test bed analysis and performance counters with the development and production clusters Know your storage subsystem components and configuration due to the large impact this can have on overall performance Take the time to learn how the various components of the virtual infrastructure work together Page 106 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved. John Paul – johnathan.paul@siemens.com Page 107 Copyright © 2012 Siemens Medical Solutions USA, Inc. All rights reserved.