ESX Performance Troubleshooting VMware Technical Support Broomfield, Colorado Confidential © 2009 VMware Inc. All rights reserved What is slow performance? •What does slow performance mean? • Application responds slowly - latency • Application takes longer time to do a job – throughput Both related to time •Interpretation varies wildly • Slower than expectation • Throughput is low • Latency is high • Throughput, latency fine but uses excessive resources (efficiency) •What are high latency, low throughput, and excessive resource usage? • These are subjective and relative Bandwidth, Throughput, Goodput, Latency Bandwidth vs. Throughput • Higher Bandwidth does not guarantee Throughput. • Low Bandwidth is a bottleneck for higher Throughput Throughput vs. Goodput • Higher Throughput does not mean higher Goodput • Low Throughput is indicative of lower Goodput Efficiency = Goodput/Bandwidth Throughput vs. Latency • Low Latency does not guarantee higher Throughput and vice versa • Throughput or Latency alone can dominate performance Bandwidth, Throughput, Goodput, Latency Bandwidth Latency Goodput Throughput How to measure performance? Higher throughput does not necessarily mean higher performance – Goodput could be low Throughput is easy to measure, but Goodput is not How do we measure performance? • Performance is actually never measured • We could only quantify different metrics that affect performance. These metrics describe the state of: CPU, memory, disk and network Performance Metrics CPU • Throughput: MIPS (%used), Goodput: useful instructions • Latency: Instruction Latency (cache latency, cache miss) Memory • Throughput: MB/Sec, Goodput: useful data • Latency: nanosecs Storage • Throughput: MB/Sec, IOPS/Sec, Goodput: useful data • Latency: Seek time Networking • Throughput: MB/Sec, IO/Sec, Goodput: useful traffic • Latency: microseconds Hardware and Performance CPU • Processor Architecture: Intel XEON, AMD Opteron • Processor cache – L1, L2, L3, TLB • Hyperthreading • NUMA Hardware and Performance Processor Architecture • Clock Speeds from one architecture is not comparable with other P-III outperforms P4 on a clock by clock basis Opteron outperforms P4 on a clock by clock basis • Higher clock speeds is not always beneficial Bigger cache or better architecture may outperform higher clock speeds • Processor memory communication is often the performance bottleneck Processor wastes 100’s of instruction cycles while waiting on memory access Caching alleviates this issue Hardware and Performance Processor Cache • Cache reduces memory access latency • Bigger cache increases cache hit probability • Why not build bigger cache ? Expensive Cache access latency increases with cache size • Cache is built into stages – L1, L2, L3 with varying cache access latency • ESX benefits from larger cache sizes • L3 cache seems to boost performance of networking workloads Hardware and Performance TLB – Translation Lookaside Buffer • Every running process needs virtual address (VA) to physical address (PA) translation • Historically this translation table was done entirely from memory • Since memory access is significantly slower and process needs access to this table on every context switch, TLB was introduced • TLB is a hardware circuitry that caches VA to PA mappings • When VA is not available in TLB, Page Fault occurs and OS needs to bring the address to TLB (load latency) • Performance of application depends on effective use of TLB • TLB is flushed during context switch Hardware and performance Hyperthreading • Introduced with Pentium 4 and Xeon processors • Allows simultaneous execution of two threads on a single processor • HT maintains separate architectural states for the same processor but shares underlying processor resources like execution unit, cache etc • HT strives to improve throughput by taking advantage of processor stalls on the logical processor • HT performance could be worse than UniProcessor (non-HT) performance if the threads have higher cache hit (more than 50%) Hardware and Performance Multicores • Cores have their own L1 Cache • L2 Cache is shared between processors • Cache coherency is relatively faster compared to SMP systems • Performance scaling is same as SMP systems Hardware and performance NUMA • Memory contention increases as the number of processors increase • NUMA alleviates memory contention by localizing memory per processor Hardware and Performance - Memory Node Interleaving • Opteron processors supports two type of memory access – NUMA and Node Interleaving mode • Node interleaving mode alternates memory pages between processor nodes so that the memory latencies are made uniform. This can offer performance improvements to systems that are not NUMA aware • NUMA on single core Opteron systems contains only one core per NUMA node. • SMP VM on ESX running on a single core Opteron systems will have to access memory across the NUMA boundary. So SMP VMs may benefit from Node Interleaving • On dual core Opteron systems a single NUMA node will have two cores. So NUMA mode could be turned on. Hardware and Performance – I/O devices I/O Devices • PCI-E, PCI-X, PCI PCI at 66MHz – 533 MB/s PCI-X at 133 MHz – 1066 MB/s PCI-X at 266 MHz – 2133 MB/s PCI-E bandwidth depends on the number of Lanes, x16 Lanes - 4GB/s, each Lane adds 250 MB/s. • PCI bus saturation – dual port, quad port devices In PCI protocol the bus bandwidth is shared by all the devices in the bus. Only one device could communicate at a time. PCI-E allows parallel full duplex transmission with the use of Lanes Hardware and Performance – I/O Devices SCSI • Ultra3/Ultra 160 SCSI – 160 MB/s • Ultra320 SCSI – 320 MB/s • SAS 3Gbps– 300 MB/s duplex FC • Speed constrained by Medium, Laser wavelength • Link Speeds: 1G FC – 200 MB/s, 2G – 400 MB/s, 4G – 800 MB/s, 8GB – 1600 MB/s ESX Architecture Performance Perspective 17 Confidential ESX Architecture – Performance Perspective CPU Virtualization – Virtual Machine Monitor • ESX doesn’t trap and emulate every instruction, x86 arch does not allow this • System calls and Faults are trapped by the monitor • Guest code runs in one of three contexts Direct execution Monitor code (fault handling) Binary Translation (BT - non virtualizable instructions) • BT behaves much like JIT • Previously translated code fragments are stored in translation cache and reused – saves translation overhead ESX Architecture – Performance Implications Virtual Machine Monitor – Performance implications • Programs that don’t fault or invoke system calls run at near native speeds – ex. Gzip • Micro-benchmarks that do nothing but invoke system calls will incur nothing but monitor overhead • Translation overhead varies with different Privileged instructions. Translation cache tries to offset some of the overhead. • Applications will have varying amount of monitor overhead depending on their call stack profile. • Call stack profile of an application can vary depending on its workload, errors and other factors. • It is hard to generalize monitor overheads for any workload. Monitor overheads measured for an application are strictly applicable only to “Identical” test conditions. ESX Architecture – Performance Perspective Memory virtualization • Modern OS’es set up page tables for each running process. x86 paging hardware (TLB) caches VA - PA mappings • Page table shadowing – additional level of indirection VMM maintains PA – MA mappings in a shadow table Allows the guest to use x86 paging hardware with the shadow table • MMU updates VMM write protects shadow page tables (trace) When the guest updates page table, monitor kicks in (page fault) and keeps shadow page table consistent with the physical page table • Hidden page faults Trace faults are hidden to the guest OS - monitor overhead. Hidden page faults are similar to TLB misses on native environments ESX Architecture – Performance Perspective Page table shadowing ESX Architecture – Performance Implications Context Switches • On Native hardware TLB is flushed during a context switch. Newly switched process will incur TLB miss on first memory access. • VMM caches Page Table Entries (PTE) during context switches (caching MMU). We try to keep the Shadow PTE consistent with the Physical PTE • If there are lots of processes running in the guest, and they context switch frequently, VMM may run out of PT caching. Workload=terminalservices increases this cache size (vmx). Process creation • Every new process created requires new PT mapping. MMU updates are frequent • Shell Scripts that spawns commands can cause MMU overhead ESX Architecture – Performance Perspective I/O Path ESX Architecture – Performance Perspective I/O Virtualization • I/O devices are non virtualizable and therefore they are emulated in the guest OS • VMkernel handles Storage and Networking devices directly as they are performance critical in server environments. CDROM, floppy devices are handled by the service console. • I/O is interrupt driven and therefore incurs monitor overhead. All I/O goes through VMkernel and involves a context switch from VMM to VMKernel • Latency of networking device is lower and therefore delay due to context switches can hamper throughput • VMkernel fields I/O interrupts and delivers it to correct VM. From ESX 2.1, VMKernel delivers the interrupts to the idle processor. ESX Architecture – Performance Perspective Virtual Networking • Virtual NICs Queue buffer could overflow - if the pkt tx/rx rate is high - VM is not scheduled frequently VMs are scheduled when they have packets for delivery Idle VMs still receive broadcast frames. Wastes CPU resources. Guest Speed/duplex settings is irrelevant. • Virtual Switches don’t learn MAC address VMs register MAC address, virtual switch knows the location of the MAC • VMnics Listens for the MAC addresses that are registered by the VMs. Layer 2 Broadcast frames are passed above ESX Architecture – Performance Perspective NIC Teaming • Teaming only provides outbound load balancing • NICs with different capabilities could be teamed Least common Capability in the bond is used • Out-MAC mode scales with number of VMs/virtual NICs. Traffic from a single virtual NIC is never load balanced. • Out-IP scales with the number of Unique TCP/IP sessions. • Incoming traffic can come on the same NIC. Link aggregation on the physical switches provides inbound load balancing. • Packet reflections can cause performance hits in the guest OS. No empirical data available. • We Failback when the Link comes alive again. Performance could be affected if the Link flips flops. ESX Architecture – Performance Perspective vmxnet optimizations • vmxnet handles cluster of packets at once – reduces context switches and interrupts • Clustering kicks in only when the packet receive/transmit rate is high. • vmxnet shares memory area with VMkernel – reduces copying overhead • vmxnet can take advantage of TCP checksum and Segmentation offloading (TSO) • NIC Morphing – allows loading vmxnet driver for valance virtual device. Probes a new register with the valance device. • Performance of a NIC Morphed vlance device is same as the performance of vmxnet virtual device. ESX Architecture – Performance Perspective SCSI performance • Queue depth determines the SCSI throughput. When the queue is full, SCSI I/O’s are blocked limiting effective throughput. • Stages of Queuing Buslogic/LSILogic -> VMkernel Queue -> VMkernel Driver Queue depth > Device Firmware Queue -> Queue depth of the LUN • Sched.numrequestOutstanding – number of outstanding I/O commands per VM – see KB 1269 • Buslogic driver in windows limits the queue depth size to 1 – see KB 1890 • Registry settings available for maximizing queue depth for LSILogic adapter (Maximum Number of Concurrent I/Os) ESX Architecture – Performance Perspective VMFS • Uses larger block sizes (1MB default) Larger block size reduces Metadata size – metadata is completely cached in memory Near native speeds is possible, because metadata overhead is removed Fewer I/O operations. Improves read-ahead cache hits for sequential reads • Spanning Data is filled to the other LUN sequentially after overflow. There is no striping. Does not offer performance improvements. • Distributed Access Multiple ESX hosts can access the VMFS volume, only one ESX host updates the meta-data ESX Architecture – Performance Perspective VMFS • Volume Locking Metadata updates are performed through locking mechanism SCSI reservation is used to lock the volume Do not confuse this locking with the file level locks implemented in the VMFS volume for different access modes • SCSI reservation SCSI reservation blocks all I/O operations until the lock is released by the owner SCSI reservation is held usually for a very short time and released as soon as the update is performed SCSI reservation conflict happens when SCSI reservation is attempted on a volume that is already locked. This usually happens when multiple ESX hosts contend for metadata updates ESX Architecture – Performance Perspective VMFS • Contention for metadata updates Redo log updates from multiple ESX hosts Template deployment with redo log activity Anything that changes/modifies file permission on every ESX host • VMFS 3.0 uses new volume locking mechanism that significantly reduces the number of SCSI reservations used ESX Architecture – Performance Perspective Service Console • Service console can share Interrupt resources with VMkernel. Shared interrupt lines reduce performance of I/O devices – KB 1290 • MKS is handled in the service console in ESX 2.x. and its performance is determined by the resources available in the COS • The default Min CPU allocated is 8% and may not be sufficient if there are lots of VMs running • Memory recommendations for service console do not account memory that will be used by the agents • Scalability of VMs is limited by COS in ESX 2.x. ESX 3.x avoids this problems with userworlds for VMkernel. Understanding ESX Resource Management & Over-Commitment 33 Confidential ESX Resource Management Scheduling • Only one VCPU runs on a CPU at any time • Scheduler tries to run the VM on the same CPU as much as possible • Scheduler can move VMs to others Processors when it has to meet the CPU demands of the VM Co-scheduling • SMP VMs are co-scheduled, i.e. all the VCPUs run on their own PCPUs/LCPUs simultaneously • Co-scheduling facilitates synchronization/communication between processors, like in the case of spinlock wait between CPUs • Scheduler can run a VCPU without the other for a short period of time (1.5 ms) • Guest could halt the co-scheduled CPU, if it is not using it, but Windows doesn’t seem to halt the CPU – wastes CPU cycles ESX Resource Management NUMA Scheduling • Scheduler tries to schedule the world within the same NUMA node so that cross NUMA migrations are fewer • If a VM’s memory pages are split between NUMA nodes, the memory scheduler slowly migrates all the VM’s pages to the local node. Over time the system becomes completely NUMA balanced. • On NUMA architecture, CPU utilization per NUMA node gives better idea of CPU contention • While factoring %ready, factor the CPU contention within the same NUMA node. ESX Resource Management Hyperthreading • Hyperthreading support was added in ESX 2.1, recommended • Hyperthreading increases scheduler’s flexibility especially in the case of running SMP VMs with UP VMs • A VM scheduled on a LCPU is charged only half the “package seconds” • Scheduler tries to avoid scheduling a SMP VM onto the logical CPUS of the same package • A high priority VM may be scheduled to a package with one its of LCPU halted – this prevents other running worlds from using the same package ESX Resource Management HTSharing • Controls hyperthreading behavior with individual VMs. • htsharing=any Virtual CPUs could be scheduled on any LCPUs. Most flexible option for the scheduler. • htsharing=none Excludes sharing of LCPUs with other VMs. VM with this option gets a full package or never gets scheduled. Essentially this excludes the VM from using logical CPUs (useful for the security paranoid). Use this option if an application in the VM is known to perform poorly with HT. • htsharing=internal Applies to SMP VMs only. This is same as none, but allows sharing the same package for the VCPUs of the same VM. Best of both worlds for SMP VMs. For UP VMs this translates to none ESX Resource Management HT Quarantining • ESX uses P4 Performance counters to constantly evaluate HT performance of running worlds • If a VM appears to interact badly with HT, the VM is automatically placed into a quarantining mode (i.e. htsharing is set to none) • If the bad events disappear, the VM is automatically pulled back from quarantining mode • Quarantining is completely transparent ESX Resource Management CPU affinity • Defines a subset of LCPUs/PCPUs that a world could run on • Useful to partition server between departments troubleshoot system reliability issues For manually setting NUMA affinity in ESX 1.5.x applications that benefit from cache affinity • Caveats Worlds that don’t have affinity can run on any CPU, so they have better chance of getting scheduled Affinity reduces Schedulers capability to maintain fairness – min CPU guarantees may not be possible under some circumstances NUMA optimizations (page migrations) are excluded for VMs that have CPU affinity (can enforce manual memory affinity) SMP VMs should not be pinned to LCPUs Disallows vMotion operations ESX Resource Management Proportional Shares • Shares are used only when there is a resource contention • Unused shares (shares of a halting/idling VM) are partitioned across active VMs. • In ESX 2.x shares operate on a flat namespace • Changing shares of one world affects the effective CPU cycles received by other running worlds. • If VMs use a different share scale then shares for other worlds should be changed to the same scale ESX Resource Management Minimum CPU • Guarantees CPU resources when the VM requests for it • Unused resources are not wasted, and is given to other worlds that requires it. • Setting min CPU to 100% (200% in case of SMP) ensures that the VM is not bound by the CPU resource limits • Using min CPU is favored over using CPU affinity or proportional shares • Admission control verifies if Min CPUs could be guaranteed when the VM is powered on or VMotioned ESX Resource Management Demystifying “Ready” time • Powered on VM could be either running, halted or in a ready state • Ready time signifies the time spent by a VM on the run queue waiting to be scheduled • Ready time accrues when more than one world wants to run at the same time on the same CPU PCPU, VCPU over-commitment with CPU intensive workloads Scheduler constraints - CPU affinity settings • Higher ready time reduces response times or increases job completion time • Total accrued ready time is not useful VM could have accrued ready time during their runtime without incurring performance loss (for example during boot) • %ready = ready time accrual rate ESX Resource Management Demystifying “Ready” time • There are no good/bad values for %ready. Depends on the priority of the VMs - latency sensitive applications may require less or no ready time • Ready time could be reduced by increasing the priority of the VM Allocate more shares, set minCPU, remove CPU affinity ESX Resource Management Unexplained “Ready” time • If the VM accrues ready time while there are enough CPU resources then it is called “Unexplained Ready time” • There are some belief in the field that such a thing actually exists – hard to prove or disprove • Very hard to determine if CPU resources are available when ready time accrues CPU utilization is not a good indicator of CPU contention Burstiness is very hard to determine NUMA boundaries – All VMs may contend within the same NUMA node Misunderstanding of how scheduler works ESX Resource Management Resource Management in ESX 3.0 • Resource Pools Extends hierarchy. Shares operate within the resource pool domain. • MHz Resource allocation are absolute based on clock cycles. % based allocation could vary with processor speeds. • Clusters Aggregates resources from multiple ESX hosts Resource Over-Commitment CPU Over-Commitment • Scheduling Too many things to do! Symptoms: high %ready Judicious use of SMP • CPU utilization Too much to do! Symptoms: 100% CPU Things to watch - Misbehaving applications inside the guest - Do not rely on Guest CPU utilization – halting issues, timer interrupts - Some applications/services seem to impact guest halting behavior. No longer tied to SMP HALs. Resource Over-Commitment CPU Over-Commitment • Higher CPU utilization does not necessarily mean lesser performance. Application’s progress is not affected by higher CPU utilization However if higher CPU utilization is due to monitor overheads then it may impact performance by increasing latency When there is no headroom (100% CPU), performance degrades • 100% CPU utilization and %ready are almost identical – both delay application progress • CPU Over-Commitment could lead to other performance problems Dropped network packets Poor I/O throughput Higher latency, poor response time Resource Over-Commitment Memory Over-Commitment • Guest Swapping - Warning Guest page faults while swapping. Performance is affected by both guest swapping and due to monitor overhead handling page faults. Additional disk I/O • Ballooning – Serious • VMkernel Swapping - Critical • COS Swapping - Critical VMX process could stall and affect the progress of the VM VMX could be a victim of random process killed by the kernel COS requires additional CPU cycles, for handling frequent page faults and disk I/O • Memory shares determine the rate of ballooning/swapping Resource Over-Commitment Memory Over-Commitment • Ballooning Ballooning/swapping stalls processor, increases delay Windows VMs touches all allocated memory pages during boot. Memory pages touched by the guest could be reclaimed only by ballooning Linux guest touches memory pages on demand. Ballooning kicks in only when the guest is under complete memory pressure Ballooning could be avoided by using min=max /proc/vmware/sched/mem - size <>sizetgt indicates memory pressure - mctl > mctlgt – ballooning out (giving away pages) - mctl < mctlgt – ballooning in (taking in pages) Memory shares affect ballooning rate Resource Over-Commitment Memory Over-Commitment • VMKernel Swapping Processor stalls due to VMkernel swapping are more expensive than ballooning (due to disk I/O) Do not confuse this with - Swap reservation: Swap is always reserved for worst case scenario if min<> max, reservation = max – min - Total swapped pages: Only current swap I/O affects performance /proc/vmware/sched/mem-verbose - swpd – total pages swapped - swapin, swapout – swap I/O activity SCSI I/O delays during VMKernel I/O swapping could result in system reliability issues Resource Over-Commitment I/O bottlenecks • PCI Bus saturation • Target device saturation Easy to saturate storage arrays if the topology is not designed correctly for load distribution • Packet drops Effective throughput reduces Retransmissions can cause congestion Window size scales down in the case of TCP • Latency affects throughput TCP is very sensitive to Latency and packet drops • Broadcast traffic Multicast and broadcast traffic sent to all VMs. • Keep an eye on Pkts/sec and IOPS and not just bandwidth consumption ESX Performance Application Performance issues 52 Confidential ESX Performance – Application Issues Before we begin • From VM perspective, an running application is just a x86 workload. • Any Application performance tuning that makes the application to run more efficiently will help • Application performance can vary between versions New version could be more or less efficient Tuning recommendations could change • Application behavior could change based on its configuration • Application performance tuning requires intimate knowledge on how the application behaves • Nobody at VMware specializes on application performance tuning Vendors should optimize their software with the thought that the hardware resources could be shared by other Operating Systems. TAP program - SpringSource (unit of VMware) – Provides developer support for API scripting ESX Performance – Application issues Citrix • Roughly 50-60% monitor overhead – takes 50-60% more CPU cycles than on the native machine • The maximum number of users limit is hit when the CPU is maxed out – roughly 50% of users as would be seen on native environment with an apples to apples comparison. • Citrix Logon delays This could happen even on native machines when roaming profiles are configured. Refer Citrix and MS KB articles Monitor overhead can introduce logon delays • Workarounds Disable com ports, workload=terminalservices, disable unused apps, scale horizontally • ESX 3.0 improves Citrix performance – roughly 70-80% of native performance ESX Performance – Application issues Database performance • Scales well with vSMP – recommended Exceptions: Pervasive SQL – not optimized for SMP • Two key parameters for database workloads Response time - Transaction logs CPU utilization • Understanding SQL performance is complex. Most enterprise databases run some sort of query optimizer that changes the SQL Engine parameters dynamically Performance will vary with run time. Typically benchmarking is done after priming the database • Memory resource is key. SQL performance can vary a lot depending on the available memory. ESX Performance – Application Issues Lotus Domino Server • One of the better performing workloads. 80-90% of direct_exec • CPU and I/O intensive • Scalability issues – Not a good idea to run all domino servers on the same ESX server. ESX Performance – Application Issues 16-bit applications • 16 bit applications on windows NT/2000 and above run in a Sandboxed Virtual Machine • 16 bit apps depend on segmentation – possible monitor overhead. • Some 16-bit apps seem to spin idle loop instead of halting the CPU Consumes excessive CPU cycles • No performance studies done yet No compelling application ESX Performance – Application Issues Netperf – throughput • Max Throughput is bound by a variety of parameters Available Bandwidth, TCP window size, available CPU cycles • VM incurs additional CPU overhead for I/O • CPU utilization for networking varies with Socket buffer size, MTU – affects the number of I/O operations performed Driver – vmxnet consumes lesser CPU cycles Offloading features – depending on the driver settings and NIC capabilities • For most applications, throughput is not the bottleneck Measuring throughput and improving it may not always resolve the underlying performance issue ESX Performance – Application Issues Netperf – Latency • Latency plays an important role for many applications • Latency can increase When there are too many VMs to schedule VM is CPU bound Packets are dropped and then re-transmitted ESX Performance – Application Issues Compiler Workloads • MMU intensive: Lots of new processes created, context switched, and destroyed. • SMP VM may hurt performance Many compiler workloads are not optimized by SMP Process threads could ping-pong between the vCPUs • Workarounds: Disable NPTL Try UP (don’t forget to change the HAL) Workload=terminalservices might help ESX Performance Forensics 61 Confidential ESX Performance Forensics Troubleshooting Methodology • Understand the problem. Pay attention to all the symptoms Pay less attention to subjective metrics. • Know the mechanics of the application Find how the application works What resources it uses, and how it interacts with the rest of the system • Identify the key bottleneck Look for clues in the data and see if that could be related to the symptoms Eliminate CPU, Disk I/O, Networking I/O, Memory bottlenecks by running tests • Running the right test is critical. ESX Performance Forensics Isolating memory bottlenecks • Ballooning • Swapping • Guest MMU overheads ESX Performance Forensics Isolating Networking Bottlenecks • Speed/Duplex settings • Link state flapping • NIC Saturation /Load balancing • Packet drops • Rx/Tx Queue Overflow ESX Performance Forensics Isolating Disk I/O bottlenecks • Queue depth • Path thrashing • LUN thrashing ESX Performance Forensics Isolating CPU bottlenecks • CPU utilization • CPU scheduling contention • Guest CPU usage • Monitor Overhead ESX Performance Forensics Isolating Monitor overhead • Procedures for release builds Collect performance snapshots • Monitor Components ESX Performance Forensics Collecting Performance Snapshots • Duration • Delay • Proc nodes • Running esxtop on performance snapshots ESX Performance Forensics Collecting Benchmarking numbers • Client side benchmarks • Running benchmarks inside the guest ESX Performance Troubleshooting - Summary 70 Confidential ESX Performance Troubleshooting - Summary Key points • Address real performance issues. Lots of time could be spent on spinning wheels on theoretical benchmarking studies • Real performance issues could be easily described by the end user who uses the application • There is no magical configuration parameter that will solve all performance problems • ESX performance problems are resolved by Re-architecting the deployment Tuning application Applying workarounds to circumvent bad workloads Moving to a newer version that addresses a known problem • Understanding Architecture is the key Understanding both ESX and application architecture is essential to resolve performance problems Questions? Reference links http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf http://www.vmware.com/resources/techresources/10041 http://www.vmware.com/resources/techresources/10054 http://www.vmware.com/resources/techresources/10066 http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf http://www.vmware.com/pdf/RVI_performance.pdf http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf http://www.vmware.com/files/pdf/perf-vsphere-fault_tolerance.pdf