VMware vCenter Operation Manager Karoly Szalai, Technical Support Engineer CCNP, VCP 3/4/5, VCAP4-DCA © 2009 VMware Inc. All rights reserved Agenda What is vCOPs and why is it good for me? An example scenario Counters and badges 2 Managing Performance/Capacity in vSphere: the basics What is vCOPs? Is this just an another monitoring system? Boring! We already have the best (nagios, zabbix, HP openview, etc.) No, it’s more than just a monitoring system! Is it healthy? • Every VM & ESX performing well? CPU, RAM, Network, Disk? • Are they behaving expectedly? • Any fault on any component? 3 Is it enough? • Enough CPU, RAM, Network, Disk? Future risk? • Time remaining? • Capacity remaining? • Where are the “Stress points” in time? Is it optimised? • Which VMs need adjustment? • What are my key ratios? • How much can I claim back from “fat” VMs? • How many more VMs can I put without impacting performance? vCOPs is built to complement vCenter Is it healthy = Health • Workload • Anomalies • Faults Is it enough = Risk • Time remaining • Capacity remaining • Stress period 4 Is it optimised = Efficiency • What we can reclaim? • Density, key ratio! Daily update at midnight! Bird-eye view This is a small environment 1 vCenter 1 Datacenter 2 clusters 4 hosts 9 VMs (including off) 2 datastore 5 Visibility across vCenters 6 Everyday task: performance troubleshooting You got an email from the app team, saying the main intranet application was slow • The email was 1 hour ago. The email stated it was slow for about 1 hour and it was ok after that • (So it was slow between 1-2 hours ago, but it’s ok now. Helpful, isn’t it?) • You just checked. Everything is indeed ok in the past 1 hour. • The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM • You are not familiar with the applications. You don’t know what apps runs on each VM as you have no access to the Guest OS • Your environment: 1 VC, 4 clusters, 35 hosts 500+ VM, 30 datastores, 1 midrange array, 10GE FCoE How do you approach/solve this with just vCenter? What do you do? 7 A: smile, as this will be a nice challenge for your TAM/BCS/MCS engineer B: no sweat, you’re VCDX,CCIE, ITIL master + you can fix your storage fw with a hex editor. You’re born for this C: send a text: “Honey, this evening is cancelled, I got a better offer” D: Buy a dinner to app team, and tell them to keep quiet. Everyday task: performance troubleshooting The minimum you need to prove • Performance problem is not caused by your infrastructure, not by your VMware • Infrastructure: VMware + Storage + Network • Application: VM + App inside the VMs What you should be able to prove • For each VM, the following was ok during the incident: CPU, RAM, Disk, network • The shared infrastructure was also healthy: ESXi, datastore, overall platform Ideally you can prove • Show the exact application level counter that are slow, with the underlying infrastructure-level counter that caused it = Root Cause Analysis 8 Challenge 1: details are lost after 1 hour The first problem is: vCenter stores only 1 hour worth of data in depth. After an hour, a lot of details are no longer available! In real time performance we have 2 cores info + 16 different counters 9 In past day stats we have only CPU info of VM and 6 counters only! A typical ESX host has 1224 cores. What if the problem with vSMP? Challenge 1: details are lost after 1 hour Memory Counters <1 hour 10 >1 hour Disk Counters <1 hour >1 hour In the meantime in vCOPs 11 Challenge 2: vSphere and applications Here is the second challenge: vSphere has no application-awareness! You have a little idea what the 10 VMs make up the application What services are running on each VM Only thing you can do is to group them via vAPP like vCOPs: 12 In the meantime in vCOPs Same application • Health is 89, so it’s good • It’s been good in the past 6 hours • The app consists 4 components: distribution, analysis, collection and presentation • We know there are only 2 VMs. So you’re getting app-level data here! • You can double click on each metric to dig deeper, but full HD resolution recommended • You can configure your tab as you like it. 13 Another plus is Infrastructure navigator Infrastructure navigator is a separate component in vCOPs (enterprise or higher level) VIN can answer for the following questions: • How many VMs make up this application? • What services are running on each VM? • Who are talking to who? Using what ports? • Which VMs are protected with DR? You can even tell which SRM protection group and SRM protection plan are involved. VIN requires vCenter 5, as it relies on web client (new UI standards) 14 Analyse data in vCenter can be hard or misguiding Hey! There is an alarm with high memory usage! It’s above 90% for more than 5 mins! THIS IS BAD! WE NEED TO BUY MORE RAM! NOW! 15 Analyse data in vCenter can be hard or misguiding Let’s check the performance data in vCenter! Here is a common example of why a deep understanding of vSphere make big difference. As we can see, this host needs more RAM, doesn’t it? It’s using 92% for more than a day. 16 In the meantime in vCOPs 17 Configured memory: 16.383 MB Demand: 5.574 MB (36% of Usable) Usage: 15.147MB (98% of Usable) Usable: 15.43 MB Normal demand: 4.672 – 8.843 MB Plenty of headroom! It just saves us from a costly RAM upgrade project! Counters and badges A vCenter farm with only 50 ESXi host and 500 VM will have 10000< counters! • It is impossible to look at them, so let vCOPS to analyse them. vCenter presents raw counters • i.e. what does Ready time of 1500 in Real Time chart mean? Is value of 2000 in Real Time chart better than value of 75000 in Daily chart? Derived counters Standardises the scale into 0 - 100 1 universal unit, minimse the “translation” in our head • Is memory usage at 90% at ESXi level good or bad? Can be >100 if demand is unmet • Is IOPs of 300 good or bad for datastore XYZ? Universal. Apply to CPU, RAM, Disk, Net etc. Single counter can be misleading • Low CPU usage does not mean VM is getting the CPU, if there is limit, contention and co-stop. • Disk performance measured with different counters at multiple layers (VM, kernel, physical) Different counters have different units • GHz, %, MB, kbps, IO/s, ms • This make analysis even more complex 18 Counters derived using sophisticated formula, not just aggregated. For the same counter, different objects use different formula Thresholds: vCOPs does differently vCenter sets static threshold, which can be misleading • During peak time, it is common for VM to reach high utilisation • Static threshold will generate alerts when it should not • vSphere admins quickly learn to ignore them, defeating the purpose of alert to begin with • During non-peak, it might be abnormal for VM to reach even 50% utilisation • Static threshold will not generate alerts when it should have vCenter only sets high threshold • Do you have any threshold when CPU or RAM utilisation drops below 5%? • A drop in entire array storage IOPs might be a sign of terrible day ahead • Will not alert when: • Utilisation drops from 75% to 1% when it should not • Utilisation change from 5% to 75% when it should not • We need to plots both upper and lower range! Each VM differs. The same VM differs depending on day/time • Intelligence required to analyse each metrics and their expected “normal” behaviour 19 Dynamic threshold & alerts vCenter Operations uses dynamic threshold • It is dynamic and personalized down to individual metric. • Varies from object to object. 1000 VM will have their own threshold. • Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the chart below. • Varies from metric to metric. An ESX with 12 cores, each core can have its own CPU Usage threshold. • You can fix hard thresholds if you need to. • This needs Enterprise edition. It comes with no static threshold defined. • Steps http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html Notice the range varies in size 20 Badges – Health Answers complex questions like: • How is the entire virtual data center doing? • For every cluster, host, datastore, what’s their health? Health is the current operational state • It represents what is wrong now and should be addressed within 1 day. Thus Health needs to be scored such that if it’s red, then it really needs attention. Weather Map • Simple way to check that entire farm is healthy • Shows health of all parent and child objects • Each square can be VM, ESX, datastore, cluster datacenter, vCenter Value 21 Explanation 75 – 100 Normal behaviour 50 – 75 The object experience some problems. 25 – 50 The object might have serious problems. Check, and take action as soon as possible 0 – 25 The object is either not functioning properly or will stop functioning soon Badges – Workload Answers complex questions like: • For every object how is Demand vs Spply? • For every single VM, is CPU/Memory/Disk/Network bound? • Any VM is not getting what they are entitled/required? • What’s the normal workload range for every object in our vDC? Workload is not utilisation or usage • More accurate than utilisation as it takes many factors than just utilisation Workload = (Demand/Entitlement) • Entitlement is dynamic. Affected by shares, limit, etc. • Demand ≠ Usage • Usage may mean passive usage (RAM page is there but no Value Explanation 0 – 80 Workload is not high. 80 – 90 The object is experiencing some high resource workloads. 90 – 95 Workload on the object is approaching its capacity in ≥1 areas. write/read at all • Score is Max(CPU, RAM, Disk IO, Net IO) >95 22 Workload on the object is at or over its capacity in ≥1 areas. Badges – Anomalies Answers complex questions like: • Is our vDC doing as usual? Are there any unexpected changes (as we have dynamic environment)? • Which VMs, ESX, cluster, datastore etc are behaving abnormally? • … and exactly which counters are the culprits? Identifying metric abnormalities • It needs to learn dynamic ranges of “Normal” for each metric, so give it >3 cycle per metric • A month-end job means it needs 3 months • Normal range changes after configuration or application changes Anomalies score • High number of anomalies: • Usually an indication of problem • Demand change Value 0 – 50 Normal Anomaly range 50 – 75 The score exceeds the normal range. 75 – 90 The score is very high. • Application team changed code/app • KPI (Key performance Indicator) metrics impacts the anomalies more than non KPI metrics 23 Explanation > 90 Most of the metrics are beyond their thresholds. This object might not be working properly or will stop working soon. Badges – Faults Answers complex questions like: • What fault do we experience in our vDC? Value • For every object, what faults does it have? 0 – 25 No fault is registered on the object 25 – 50 Faults of low importance happens on object. 50 – 75 Faults of high importance happens on object. Specific knowledge of which vCenter events • Which events affect Availability and Performance of which object? • Pulled from active vCenter events • Example: • Loss of redundancy in NICs or HBAs • Memory checksum errors • HA failover problems. • Each fault has a default score • Highest individual Fault Score drives the Fault object score Best Practices • Do not change Fault Threshold • Use Alerts View to manage Faults. You can Filter it to just show Faults. 24 > 75 Explanation Faults of critical importance happens on object Badges – Risk Answers complex questions like: • Do we have risk from performance or capacity in our vDC? If yes, where are they and how serious? • Which objects are at risk? What is the specific risk? Risk Score takes into account • Time Remaining • Capacity Remaining • Stress Risk is an early warning system • Identifies potential problems that could eventually hurt the performance • The Risk Chart shows Risk score over the last 7 days, giving a view of trend 25 Value Explanation 0 – 50 No problems are expected in the future. 50 – 75 There is a low chance of future problems or a potential problem might occur in the far future. 75 – 100 There is a chance of a more serious problem or a problem might occur in the medium-term future. 100 The chances of a serious future problem are high or a problem might occur in the near future Badges – Time remaining Answer complex questions like: • How much time do we have before we need to buy more server, storage, network before performance starts to degrade or we run out of capacity? • For every cluster, VM, datastore, how much time do we have? Measures time remaining before each resource type reaches its capacity • CPU • Memory • Disk (IOPS & Space) • Network I/O Early warning of upcoming provisioning needs • Based on Score Provisioning buffer. Default value is 30 days. • Set in “Capacity & Time Remaining” section 26 Value Time remaining 50 – 100 > 2x SP Buffer (60 days) 25 – 50 < 2x SP Buffer <25 Near SP Buffer 0 < SP buffer (30 days) Badges – Capacity remaining Answer complex questions like: • How many more VM can we put without impacting performance or using up capacity? • For every cluster, VM, datastore, which components (CPU, RAM, Disk, Network) would run out first? Early warning system • A low score of 1 mean you still have >30 days. • Measures how many more VMs can be placed on the object Percentage of Total VM “Slots” Remaining • Based on the average size of the VM on the object (e.g. VM profile) • Each object has its OWN VM profile size: Host, Cluster, Datacenter, Etc. From the table, notice value is not linear • It is also not the same with Time Remaining threshold. • A value of 30 means >120 days for capacity but around 40 days for time. 27 Value Capacity remaining >10 >120 days 5 – 10 60 – 120 days 2–5 30 – 60 days 1 <30 days Capacity remaining calculation Determine capacity constraint resources Deployed or Powered On VMs • Powered off VMs only use disk space resources • Powered off VMs use ALL of the 4 resources Calculation example: • The limit is 40 more VMs • We have 9 deployed VMs • 40/(40+9) = 81% You can drill down to see details • You can check all 9 components as shown on right • This helps to answer the question which components have how many days or VM left • Summary = min (all 9 components) 28 Badges – Stress Answer complex questions like: • In our vDC, do we have stress points or periods? How bad is it? • For every cluster, VM, datastore, which ones are experiencing stress and how bad is it? Measures long-term or chronic workload (6 weeks) • Chart shows weeks break down of Stress for each day/hour averaged over the last 6 Weeks • Workloads > 70% = “Stressed” • Threshold Configurable as per screenshot below Value 0–1 Normal score. No action needed 1–5 Some of the object resources are not enough to meet the demands. 5 – 30 The object is experiencing regular resource shortage. >30 29 Explanation Most of the resources on the object are constantly insufficient. The object might stop functioning properly. Stress Calculation Stress Zone 100 12% 70 Workload Line 0 Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold compared to the Total Capacity of the object • Stress Score = (Stress area / Stress Zone) *100 • But max value can be > 100% as the workload can be >100. Example • Stress Line is 70% Workload • 12% of the area is above the 70% threshold • Stress Score is 12 30 Badges – Efficiency Answer complex questions like: • Are there optimization opportunities in our vDC? • How well do we do in terms of VM provisioning? Do we get them right? Efficiency Score factors • Reclaimable waste • Density ratio Graph Depicts VMs by Percent • Optimal – Optimally Provisioned VMs • Waste – Over Provisioned VMs Value >25 The efficiency is good. The resource use on the selected object is optimal. 10 – 25 The efficiency is good, but can be improved. Some resources are not fully used. 0 – 10 The resources on the selected object are not used in the most optimal way. 0 The efficiency is bad. Many resources are wasted. • Stress – Under Provisioned VMs • Not used in Efficiency Calculation (see Risk) 31 Explanation Badges – Reclaimable waste Answer complex questions like: • Do we over provisioned the VMs in terms of CPU, RAM and Disk? If yes, what’s the degree of over provisioning? • For every cluster, VM, datastore, what can we reclaim? It identifies the amount of reclaimable resources • CPU • Memory • Disk Reclaimable Waste = Reclaimable Capacity / Value Explanation Deployed Capacity 0 – 50 No resources are wasted on the selected object. • Waste Score = Max(CPU Waste Score, RAM Waste Score, 50 – 75 Some resource can be used better. 75 – 100 Many resources are underused Disk Space Waste Score) • Disk calculation can also include old snapshots and templates 32 100 Most of the resources on the selected object are wasted. Badges – Density Answer complex questions like: • How high can we push our consolidation ratio before we experience performance problem? • Now that’s a million dollar question! • For every datacenter, cluster, ESXi, what are our key ratios and how much head room do we have? Contrasts Actual vs Ideal Density • Identify Optimal Resource Deployment Before Contention Occurs • Ideal is based on demand, not simple configuration. • High Density is good. 100 is not too high. 33 Value >25 Explanation Good consolidation 10 – 25 Some resources are not fully consolidated 0 – 10 The consolidation for many resources is low 0 The resource consolidation is extremely low. Badge thresholds There are 2 different threshold: VM and Infra (ESXi, Cluster, Datastore, etc) Notice that Major badge has different threshold to its minor badges Even “similar” badges have different threshold. Notice Time remaining and Capacity remaining have very different thresholds. 34 Using badges together Workload High & Anomalies Low & Stress High • Workload – Object is Running Hot. Potentially Starving for Resources Add resources • Anomalies – Normal Behavior for this timeframe • Stress – Object is often running under high Workload. Workload High & Anomalies Low & Stress Low • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Normal Behavior for this timeframe Not likely a big problem… a cyclical workload spike? • Stress – Object usually has enough resources Workload High & Anomalies High • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Abnormal behavior for this timeframe If there are Alert and Fault too, then it is a sign of major issue 35 Something is a miss! Immediate attention. … at the end This is not all! We are just scratching the surface. • Heat map / Cold map: 2 dimensional chart, great way to show a lot of info on 1 screen about all cluster/host/VM • Planning: gives visibility for the next 6 month. CPU/memory demand, Disk I/O, Network I/O • Alerts: normal vs smart alert • Smart alert relies on the advanced analytics instead of simple raw counters. Not static, based on Dynamic Threshold. Can do SNMP, SMTP, file. • Performance chart! • Capacity management • Historical utilization trends, resources have been requested vs. needed, how many VMs fit in my farm? • Forecast: when will I run out of capacity? What if I add/remove/reconfigure capacity? • Change events correlated with Performance: enable operations to quickly understand and resolve performance issues 36 Questions? 37