Performance Management Iwan ‘e1’ Rahabok VCAP-DCD, TOGAF Certified, vExpert Staff SE (Strategic Accounts) & CTO Ambassador e1@vmware.com | 9119-9226 | Linkedin.com/in/e1ang | Tweeter: e1_ang https://www.facebook.com/groups/vmware.users/ http://virtual-red-dot.info © 2014 VMware Inc. All rights reserved. Warm-up exercise You got an email from the app team, saying the main Intranet application was slow. • The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that. • So it was slow between 1-2 hours ago, but ok now. • You did a check. Everything is indeed ok in the past 1 hour. • The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM • You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest OS. • Your environment: 1 VC, 4 clusters, 30 hosts, 500 VM, 40 datastores, 1 midrange array, 10 GE, iSCSI storage Test your vSphere knowledge! How do you solve/approach this with just vSphere? What do you do? A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this. C: SMS your wive, “Honey, I’m staying overnight at the datacenter “ D: Take a blood pressure medicine so it won’t shoot up. E: Buy the app team very nice dinner, and tell them to keep quiet. 2 Performance: How do you know it’s optimised? • What do you measure? – Utilisation? • Utilisation of 100% means it’s performing…? • Utilisation of 5% means it’s performing…? • Utilisation of 50% means it’s performing…? Really? – Something else? • What is that something else? To understand this “something else”, we need to go back to “fundamental”. CONFIDENTIAL 3 What do we care at each layer? VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM 1 We care if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view. Make sure it is not contending for resource. 2 We check if it is sized properly. If too small, increase its configuration. If too big, right size it for better performance 1 We care if it is serving everyone well. Make sure there is no contention for resource among all the VMs in the platform. 2 We check for overall utilisation. Too low, we are not investing wisely on hardware Too high, we need to buy more hardware. SDDC Take Away: Contention and Utilisation • Unlike physical DC, in virtual infrastructure…. – we use Contention, not Utilisation, for Performance Management – we use Utilisation (short range) for Performance Management – we use Utilisation (long range) for Capacity Management • Contention is how you measure that the platform is performing well. • Sound good! But how do you measure “Contention”? CONFIDENTIAL 5 Performance: The counters • What counters prove that it is optimised? – You need a technical fact to assure yourself • Either that, or take a sleeping pill at night – You need a technical fact to show to your customers • Your SLA must be based on something concrete, not subject to interpretation or “feeling of the day” – If you can’t prove it, how does anyone know it is optimised? ;-) CONFIDENTIAL 6 Optimized Infrastructure Performance* • CPU • RAM • Storage • Network * While keeping Cost in mind CONFIDENTIAL 7 How a VM gets its resource Provisioned Limit Contention Entitlement Usage Demand Reservation 0 VM CPU: The 4 States VM CPU: What do you monitor? • Contention – Ready (ms)? – Co-Stop (ms)? – Latency (%)? – Max Limited (ms)? – Overlap (ms)? – Swap Wait (ms)? • Utilisation – Used (ms)? – Usage (%)? – Demand (MHz)? Quiz Time! What’s difference between Average, Summation and Latest? How does timeline impact the value? CONFIDENTIAL 10 VM CPU: What you should monitor vCenter Operations • Contention: – Contention (%) • Utilisation – Workload (%) vCenter • Contention – Latency (%) – Max Limited (if applicable) • Utilisation – Usage (%) – Demand (MHz) Discussion Time! What’s should the value be for an optimized environment? CONFIDENTIAL 11 One more thing… • Hypervisor does not have visibility inside the Guest OS. • There is 1 particular CPU counter that you should get. It tells you that there is not enough CPU to meet demand. • vRealize Operations (via Hyperic) does not collect this counter • Which counter is that? CONFIDENTIAL 12 Enough about CPU. Let’s move to RAM! CONFIDENTIAL 13 Quiz Time! • Which of the following sentences are True: – Ballooning is bad. You see a VM has balloon, that VM has memory performance problem. – Ballooning happens before Compression, which happens before Swapping. If you see a VM has Compressed memory but not Ballooned memory, that vCenter is buggy, or your eyes are just tired. – If all the VMs in the ESXi host has low Usage counter, then the ESXi must also be low. – Turn on Large Page, and there goes all your TPS. – To check if a VM has memory contention, check its CPU Swap Wait counter. – Why are all the questions difficult?! • Answer – Ballooning indicates the ESXi has memory pressure. It does not mean the VM has memory performance issue. – Pages remain compressed or swap if they are not accessed. – Usage counter is different in VM and ESXi! In VM, it is Active. In ESXi, it is Consumed. This is due to 2 level memory concept. – Yes, unless your ESXi is under heavy memory constraint. CONFIDENTIAL 14 2 levels of Memory Hierarchy OS Hypervisor • New hierarchy in VMware’s memory overcommit technology • Transparent Page Sharing • Ballooning • Memory Compression • Swap to Host Cache (SSD) • Disk swapping • Decompression is sub-ms compared to swap (15-20 ms)! vSphere Memory Management • 2 types of Memory Management – Guest OS level • Balloon – Hypervisor level • TPS • Compression, Swap to disk, Swap to cache (SSD) Volunteer Time! Explain Balloon, TPS, Compression. CONFIDENTIAL 16 VM RAM: What do you monitor? • Contention – Swapped? – Balloon? – Compressed? – Latency? – CPU Swap Wait? • Utilisation – Active? – Usage? – Consumed? CONFIDENTIAL 17 VM RAM: What you should monitor vCenter Operations • Contention: – RAM Contention (%) • Utilisation – Workload (%) – Consumed (KB) vCenter • Contention – Latency (%) – CPU Swap Wait (ms) • Utilisation – Usage (%) – Consumed (KB) Discussion Time! What’s should the value be for an optimized environment? CONFIDENTIAL 18 One more thing… • Hypervisor does not have visibility inside the Guest OS. • There is 1 particular RAM counter that you should get. It tells you that there is not enough RAM to meet demand. • Which counter is that? • You can monitor it Guest OS paging activity by separating the page file into its own vmdk. – You can then use vC Ops to analyse the pattern. CONFIDENTIAL 19 Enough about RAM. Let’s move to Storage! CONFIDENTIAL 20 Quiz Time! • Which of the following sentences are True: – The latency counter is the (Write Latency + Read Latency) / 2 – If you have RDM, vCenter does not track the latency. – If the VM virtual disk counter showing 1000 IOPS, but the VM datastore counter showing 2x IOPS, something is seriously wrong. Time to call your TAM! – If all your VMs experiencing high latency, the first thing you do is check the VMkernel queue • Answer – It is not. It takes into account the number of commands issued. It’s a weighted average. – It only tracks the latency at the latest data. It’s not including other data during the collection period. – Check for snapshot. Snapshot IOPS is transparent to virtual disk. – The first thing you do is check the physical device queue and your storage array. VMkernel queue rarely exceeds 1 ms. CONFIDENTIAL 21 VM Storage: Where and what do you monitor? Virtual Disk Datastore Disk 22 VM Storage: where to monitor • For vmdk, use Datastore metric VM groups. Disk 1 Disk 2 Disk 3 vDisk vDisk vDisk scsi0:1 scsi0:0 scsi0:2 • For RDM, use Disk metric groups • Disk metric group is naturally not relevant for NFS (files) VMFS NFS Datastore Datastore Disk RDM Disk VM Storage: What do you monitor? vCenter Operations • Contention – Latency (ms) • Utilisation vCenter • Contention – Latency (ms) • Utilisation – Commands per second – Commands Issued – Usage (KBps) – Usage (KBps) – Workload (%) CONFIDENTIAL 24 VM Network • Contentions – Drop packets – Packets retransmit • Utilisation – Network throughput • Limitations – We cannot monitor latency (e.g. between source and destination) CONFIDENTIAL 25 Different Tiers, Different Optimization • Business Logic: – Tier 1 is optimised for Performance and Availability – Tier 3 is optimised for Cost • Do you allow Tier 1 VM on Tier 3 Storage? – Or you map the Compute Tier to the Storage Tier? • What distinguish Tier 1 from Tier 3? – Availability – Performance – Monitoring – Cost! CONFIDENTIAL 26 Tiering: Considerations • Compute – No of spare host – No of hosts – Consolidation Ratio (VM:Host) – vCPU:pCPU Oversubscribed – vRAM:pRAM Oversubscribed – Clustering (e.g. VCS) • Storage – IOPS per VM • Monitoring – Application availability monitoring (e.g. AppHA) – Application performance monitoring (e.g. vC Ops Enterprise) • Availability – Automated DR (SRM) – RPO – RTO – Latency CONFIDENTIAL 27 3-Tiers Offering: Example Tier 1 Tier 2 Tier 3 No of spare host 2 1 1 No of hosts 6 8 10 Consolidation Ratio (VM:Host) 10:1 20:1 40:1 vCPU:pCPU Oversubscribed n/a 2.0x 4.0x vRAM:pRAM Oversubscribed n/a 1.5x 2.0x IOPS per VM 400 200 100 <10 ms 15-20 ms 20-25 ms Clustering (e.g. VCS) Yes Yes No Application monitoring (e.g. AppHA) Yes Yes No Apps Yes Yes No Automated DR (SRM) Yes Yes Yes RPO 5 minutes 1-2 hour 2-8 hours RTO 1 hour <2 hours <4 hours Latency CONFIDENTIAL 28 Demystifying “Peak” • There are 2 types of “Peak” – Peak across time – Peak across objects • Impacts – Peak across time can be too high if the burst is high • VM is low for 24 hours, burst to 100% for 5 minutes, and you get 100% reported. – Peak across time can be lower if the number of member objects is high. • Peak of a cluster in the past 1 day is 70%. That means at least 1 host was >70%. – Peak across objects can be too high is the load is unbalanced • Happens when cluster utilisation is not high enough to trigger DRS orStorage DRS CONFIDENTIAL 29 Sample SLA and Internal Threshold Tier 1 Tier 2 Tier 3 CPU Contention 1% 3% 13% RAM Contention 0% 5% 10% 10 ms 20 ms 30 ms Tier 1 Tier 2 Tier 3 CPU Contention 0.5% 2% 10% RAM Contention 0% 2% 8% 10 ms 15 ms 20 ms Disk Latency Disk Latency SLA only applies to VM. VM owner does not care about underlying platform CONFIDENTIAL 30 Where to monitor at the Platform level? • Compute • Network – Host? – Standard Switch and port group? :-) – Cluster? – Host? – Datacenter? – Distributed Switch? – vCenter? – Distributed Port Group? • Storage – Host? – Cluster? – Datastore? – Datastore Cluster? – Datacenter? – vCenter? • Network CONFIDENTIAL 31 Where to monitor Not here • Compute • Compute – Host – Cluster – Datacenter • Storage – Host – Cluster • Network – Host Monitor these • Storage – Datastore – Datastore Cluster • Network – Distributed Switch. – Distributed Port Group DRS (and Storage DRS) will balance the cluster CONFIDENTIAL 32 QoS in a shared environment • QoS is mandatory in a shared environment • Areas to control – Compute – Network – Storage • CPU and RAM – Shares – Reservation – Limit? – Resource Pool? • Storage I/O Control • Network I/O Control CONFIDENTIAL 33 QoS: Compute • When not to use Resource Pool? • When to use Resource Pool? • What’s the impact of Reservation? – HA Slot Size. Unless you use % – Boot time – Oversubscribe ability. You cannot go beyond 100% reservation. CONFIDENTIAL 34 QoS: Storage • A single VM can hog storage Without Storage IO Control Actual Disk Resources utilized by each VM are not in the correct ratio throughput – Just need to run IOmeter ESX Server ESX Server – Unfairly penalizes VMs on hosts with high 24 24 consolidation ratios • Existing resource management only works for VMs on the same host • SIOC calculates datastore latency device queue depth device queue depth 25 % 75% 0 to identify contention 100 % 0 – Latency is a normalized, average across VMs Storage Array Queue – IO size and IOPS included 38% 12% 50% Storage Congested Without SIOC – Latency is Unbounded With Storage IO Control Actual Disk Resources utilized by each VM Are in the correct ratio even across ESX Hosts QoS: Storage • SIOC enforces fairness when VM A 1500 Shares datastore latency crosses threshold ESX Server – Dynamic threshold setting VM C 500 Shares VM B 500 Shares ESX Server 24 – Fairness enforced by limiting VMs access 25 % device queue depth to queue slots • What’s the limitation? – No inter-datastore awareness – Does not work on RDM 75% 6 100 % – Non VM workload not included 0 0 • Work with your Storage team. Storage Array Queue – Auto-tiering array is supported 60% 20% 20% Storage Controlled With SIOC – Latency is Controlled Storage Queue Throttled Key Takeaways • Optimization in SDDC has a lot more components than we normally think • Contention is 1st . Utilisation is 2nd • SLA is at VM level, not Infrastructure level. • Peak can be too low or too high. • Anything else? CONFIDENTIAL 37