Performance Management in the Virtual World Singapore, Q1 2013 1 Document Information This deck is part of a series. • Part 1 is “Management in the Virtual World: a technical introduction.” • http://communities.vmware.com/docs/DOC-17841 • Part 2 is “Resource Management in the Virtual World” • http://communities.vmware.com/docs/DOC-17417 • Part 3 is “Performance Management in the Virtual World” • http://communities.vmware.com/docs/DOC-22034 • Part 4 is “Capacity Management in the Virtual World” • http://communities.vmware.com/docs/DOC-21791 • Part 5 is “Chargeback in the Virtual World” • http://communities.vmware.com/docs/DOC-18593 • Part 6 is “Configuration Management in the Virtual World” • To be written Related documents • DR 2.0: a new school of thought • http://communities.vmware.com/docs/DOC-19992 • Sample Designs for vSphere • http://communities.vmware.com/docs/DOC-19627 2 This is a very long & technical material. Use the Section feature to see how it is organised. Use the speaker notes. Authors & Audience Iwan ‘e1’ Rahabok VCAP-DCD, TOGAF Certified Staff SE, Strategic Accounts, VMware e1@vmware.com | Linkedin.com/in/e1ang Co author wanted, needed, appreciated. This presentation is created for VMware Administrators. It assumes knowledge of vSphere and vCenter Operations Co author wanted, needed, appreciated. 3 VM CPU: The 4 States 5 How a VM gets its resource Provisioned Limit Entitlement Contention Usage Demand Reservation 0 6 Contention: Derived Metric CPU Co-Stop 7 millisecond CPU Latency % RAM Balloon KB RAM Zipped KB RAM Swap KB CPU Contention % RAM Contention % What do we care at each layer? CPU: RAM: Disk: Network: VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM Utilisation (%) of each vCore, Run-Queue Utilisation (%), Swap (%), Ballooning (%) Utilisation (Mbps), IOPS, Latency (ms) Utilisation (Mbps), Packet drop (qty) SDDC CPU: RAM: Disk: Network: 8 Utilisation (%) of each pCore, Latency (%), Co-Stop Utilisation (%), Ballooning (%) Utilisation (Gbps), IOPS, Latency (ms) Utilisation (Gbps), Packet drop The 2 side of Performance Performance Troubleshooting 9 Performance Management Fixing Preventing Specific issue Big Picture Now Future Performance Troubleshooting: a day in the life… You got an email from the app team, saying the main Intranet application was slow. • The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that. • So it was slow between 1-2 hours ago, but ok now. • You did a check. Everything is indeed ok in the past 1 hour. • The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM • You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest OS. • Your environment: 1 VC, 4 clusters, 30 hosts, 300 VM, 20 datastores, 1 midrange array, 10 GE Test your vSphere knowledge! How do you solve/approach this with just vSphere? What do you do? A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this. C: SMS your wive, “Honey, I’m staying overnight at the datacenter “ D: Take a blood pressure medicine so it won’t shoot up. E: Buy the app team very nice dinner, and tell them to keep quiet. 10 Performance Troubleshooting: a day in the life… The minimum you need to prove • Performance is not caused by your infrastructure, or at least not by your VMware. • Infrastructure = VMware + Storage + Network • Application = VM + App inside the VM What you need to prove • For each of the 10 VM, the following was ok between 1-2 hours ago: CPU, RAM, Disk, network • To strengthen the above, prove that: • The shared infrastructure was also healthy: relevant ESX, relevant Datastore • The overall platform was also healthy. • No relevant faults that happened 1-2 hours ago. What challenges do you face in vSphere to do the above? • Group discussion: what limitations do you face, if vCenter + vMA + PowerGUI + RVTools is all you have? The ideal you need to prove • Give the list of ports (that the 10 VM use) to network team to ensure the firewall is not dropping them. • Show the exact application-level counter that are slow, with the underlying infrastructure-level counter that caused it. Another word, application-specific + root-cause-analysis 11 Performance Troubleshooting: Approach Is the VM configured with enough resource? • • • • CPU: low run queue RAM: no swapping Network: no drop packet Disk: latency below SLA What is the VM & platform utilisation? • Is the platform a bottleneck? • • • • 12 CPU Latency + CPU Co-Stop is below SLA. CPU is not waiting for Memory or Disk. • RAM: 0 swapping, 0 Balloon, 0 compression • Disk: KAVG + DAVG is below SLA Network: no drop packet Disk: 0 abort, 0 reset. • VM utilisation • CPU, RAM, Disk, Net Platform utilisation • CPU, RAM, Disk, Net VM: is it configured with enough resource? CPU • Run-Queue within the Guest • Contention • Latency • Co-Stop RAM • Swapping within the Guest SLA for Tier 1 differs to Tier 3. • Contention How will you prove that VM’s Demand is being met? • Balloon • Compression • Swap Disk • Read Latency & Write Latency Network • Dropped packet for both sending and receiveing. 13 All these counters should be below SLA. Infra: Is the platform a bottleneck? CPU • Contention • Co-Stop • Latency • Demand All these counters should be below SLA. RAM • Contention • Balloon • Compression • Swap Disk • DAVG: Read Latency & Write Latency • KAVG: Kernel Latency Network • Dropped packet for both sending and receiveing. 14 SLA for Tier 1 differs to Tier 3. How will you prove that your Infra is coping well? Demo Custom Dashboards • Dashboard 1: Performance Troubleshooting • Dashboard 2: Utilisation • Dashboard 3: Generic Performance troubleshooting • Custom Dashboard creation • Demo of how the above dashboards were created. Application dependancy • Provided by vCenter Infrastructure Navigator 15 Custom Dashboard: Performance Troubleshooting 16 Custom Dashboard: Utilisation 17 Performance Management: Approach Is it healthy? 18 Is it enough? • Every VM & ESX performing well? CPU, RAM, Network, Disk? • Enough CPU, RAM, Network, Disk? Future risk? • Are they behaving expectedly? • Capacity remaining? • Any fault on any component? • Where are the “Stress points” in time? • Time remaining? Is it optimised? • Which VMs need adjustment? • What are my key ratios? • How much can I claim back from “fat” VMs? • How many more VMs can I put without impacting performance? Direct Mapping by vCenter Operations Is it healthy = Health • Workload • Anomalies • Faults Is it enough = Risk • Time remaining • Capacity remaining • Stress period Is it optimised = Efficiency • What can we reclaim? • Density. Key ratios for management Daily update at midnight Goes beyond Performance • Capacity • Compliance • Application Dependancy 19 Exercise 1: Big Picture Your CIO wants a dashboard that shows the entire environment (VMs, compute, storage, spanning many physical DC) in 1 easy to understand chart. He wants to see: • the higher the utilisation, the bigger the object • the less healthy the object, the brighter its color (e.g. red) • the more time it has in terms of capacity, the further it is on the horizon. • the more oversized the object, the higher it is on the chart. Example 2 • He wants to see just VM from Production Datacenter • the worse the compliance to company standard, the bigger the object • faults should be highlighted in color. The worse the fault, the brighter the color of the object • the more time it has in terms of capacity, the further it is on the horizon. • the more abnormal the behaviour of the object, the higher it is on the chart. 20 Answer 21 Exercise 2: Big Picture You are in charge of a large virtual platform spanning 10 datacenters. • It has 10,000 VM, 600 ESXi, 500 datastores. • At 90% virtualised, your platform is by far the largest in the company. CIO wants to have a dashboard that shows her the health of entire infrastructure at a glance. • She wants information to be color coded. Green, Yellow, Amber, Red. • She wants to know the following • Utilisation, which must cover CPU, RAM, Disk, Network. • Issues, such as faults • Abnormal behaviour How will you show the above? 22 Managing >10,000s virtual objects 23 Visibility across vCenters 24 Demo vSphere UI • Dashboard • Scoreboard • Main Tabs • Configuration 25 Counters and Badges A vCenter farm with 100 VM and 10 ESX will have >50000 counters! • It is not humanely possible to look at them, let alone analyse them. vCenter presents raw counters • e.g. What does Ready Time of 1500 in Real Time chart mean? Is value of 2000 in Real Time chart better than value of 75000 in Daily Chart? Standardises the scale into 0 100. • e.g. Is memory.usage at 90% at ESXi level good or bad? 1 universal unit. Minimise the “translation” in our head. • E.g. Is IOPS of 300 good or bad for datastore XYZ? Can be >100 if demand is unmet Single counter can be misleading • e.g. Low CPU usage does not mean VM is getting the CPU, if there is Limit, Contention and Co-Stop. • e.g. To see disk performance, we need to see multiple counters at multiple layers (VM, kernel, physical) Different counters have different units • GHz, %, MB, kbps, ops/sec, ms • This makes analysis even more complex 26 Derived Counters Universal. Apply to CPU, RAM, Disk, Net, etc. Counters derived using sophisticated formula, not just aggregated. For the same counter, different objects use different formula. Samples of Derived Metric: Health Health Score of an Object = MAX (Abnormal Workload, Faults) • Abnormal Workload per Metric = Geometric Mean (MAX (Abnormality (Capacity/Entitlement), Abnormality (Demand/Usage)), Workload) • Abnormal Workload per Object = Score Aggregation (Abnormal Workload per Metric) • Fault depends on the object as every object is different: Cluster = HA Issues = MAX (HA Insufficient Failover Resources, HA Failover In Progress, HA Cannot Find Master) Host = MAX (Hardware Issues, HA Issues) Hardware Issues = MAX (Network Issues, Storage Issues, Compute Issues, CIM Issues) Network Issues = MAX (Network, DVPort, VMNic) Network = Max_of_all_instances (Network Device) DVPort = Max_of_all_instances (DVPort Device) VMNic = Max_of_all_instances (VMNic Device) Storage Issues = MAX(Storage, SCSI, VMFS heartbeat, NFS server, CIM Storage) Storage = Max_of_all_instances (Storage Device) SCSI = Max_of_all_instances (SCSI Device) VMFS heartbeat = Max_of_all_instances (VMFS heartbeat Device) NFS server = Max_of_all_instances (NFS server Device) Compute Issues = MAX (Error, PCIe) CIM Issues = MAX (Processor, Memory, Fan, Voltage, Temperature, Power, System Board, Battery, Other Health, IPMI, BMC) HA Issues = HA Host Status VM = MAX (FT Issues, HA Issues) 27 Threshold: a shift in mindset needed vCenter sets “static” threshold, which can be misleading • During peak, it is common for VM to reach high utilisation. • Static threshold will generate alerts when they should not. • vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with. • During non-peak, it might be abnormal for VM to reach even 50% utilisation. • Static threshold will not generate alerts when they should have. vCenter only sets high threshold • Do you set static threshold when CPU or RAM utilisation drops below 5%? • A drop in entire array storage IOPS might be a sign of terrible day ahead. • Will not alert when these happen: • Utilisation drops from 75% to 1% when it should not. • Utilisation change from 5% to 70% when it should not. • We need to plots both upper range and lower range But each VM differs. And the same VM differs depending on day/time… • Intelligence required to analyse each metrics and their expected “normal” behaviour. 28 Dynamic threshold & alerts vCenter Operations uses dynamic threshold • It is dynamic and personalised down to individual metric. • Varies from object to object. 1000 VM will have their own threshold. • Varies from time to time. The same CPU Usage counter has different threshold at different time. This cater for peak. See the chart below. • Varies from metric to metric. An ESX with 12 cores, each core has its own CPU Usage threshold. • You can fix hard thresholds if you need to. • This needs Enterprise edition. It comes with no static threshold defined. • Steps http://virtual-red-dot.blogspot.com/2012/01/vcenter-operations-5-hard-threshold.html Notice the range varies in size 29 Dynamic Threshold Analysis For each metric DT analysis runs nightly • New dynamic thresholds are computed for Data Categorization each metric Data categorization • Tries to identify stat as linear, Linear DT Multinomial DT Sparse Sigma DT Step Function DT Quantile Sigma DT multinomial, step function, etc • If one of those matches, that DT function is used CCPD Otherwise: competition • Sigma: assumes hourly cycles ACPD • CCPD: tries to find normal cycles • ACPD: tries to find abnormal cycles DT Scoring • Winner is assigned based on metric trending accuracy The same metric may get different DT Dynamic Thresholds 30 function on different day Dynamic Threshlold: Algorithm m 1 m 1 m 0 ,0 i , j i , j m 1 m 1 m i 1 j 1 i m , j 1 i,j 1 1 P , P ,..., P ( p1,1, p1, 2 ,..., p m , m ) p i , j p i ,i j, j m 1 m 1 m 1,1 1,2 m ,m i 1 j 1 i m , j 1 0 ,0 i , j i , j i 1 j 1 i m , j 1 m 1 m 1 w h e re i 1 j 1 m pi, j p i , j 1, 0 pi, j 1 a n d z i m , j 1 T h e m a rg in a l d istrib u tio n o f th e i ( p i ,1 ,..., p i , m 1 ) D irich le t D irich le t th t z 1 ro w o f J is: fo r i 1,..., m 1 , m ,1, m , 2 ,. .., m , m , 0 ,0 fo r i m 0 ,0 m m, j j 1 m 1 m 1 j 1 m i, j i, j i m , j 1 It is pretty difficult for a human to beat the computer in analysis of the data.. The above is one of the many algorithms applied by vCenter Operations. 31 pi, j 0 ,0 1 t i , j , i ,1, i , 2 ,..., i , m 1 i 1 i m , j 1 e dt j 1 w h e re 0 ,0 m 0 m 1 m 1 m 1 1 p i , j i 1 j 1 Analytics 7 different analytics areas. For DT feature, there are 8 algorithms. Only in Enterprise Edition These advance features create Smart Alert. 32 Discussion Point Raw Counters vs Derived Counters Dynamic Threshold vs Static Threshold 33 Performance Management: FAQ • How is the entire virtual data center doing? What’s the degree of their health? • For every cluster, host, datastore, what’s their health? • For every single VM, is CPU/Memory/Disk/Network bound? • Any VM is not getting what they are entitled? • What’s the normal workload range for every object in our vDC? • Is our vDC doing business as usual today? Or is today a turbulent day with lots of unexpected changes? • Which VMs, ESX, cluster, datastore, etc are behaving abnormally? • …. and exactly which counters are the culprits? • What faults do we experience in our vDC? • In our vDC, do we have stress points or periods? How bad is it? • For every cluster, VM, datastore, which ones are experiencing stress and how bad is it? • How much time do we have before we need to buy more server, storage, network before performance starts to degrade or we run out of capacity? • For every cluster, VM, datastore, how much time do we have? • How many more VM can we put without impacting performance or using up capacity? • For every cluster, VM, datastore, which components (CPU, RAM, Disk, Network) would run out first? • What’s the degree of our compliance? 34 Badge – Health Answer complex questions like: • How is the entire virtual data center doing? What’s the degree of their health? • For every cluster, host, datastore, what’s their health? Health is a current Operational State. • It represents what is wrong now that should be addressed within 1 day. Thus Health needs to be scored such that if it is red, then it really needs attention. Weather Map • Simple way to check that entire farm is healthy • For child object, it is replaced with Health Trend • Shows Health of all parent and child objects • Each square can be VM, ESX, datastore, cluster, datacenter, vCenter. Value 35 Explanation 75 – 100 Normal behaviour 50 – 75 The object experience some problems. 25 – 50 The object might have serious problems. Check and take action as soon as possible. 0 – 25 The object is either not functioning properly or will stop functioning soon. Badge – Workload 95 Answer complex questions like: • For every object, how is Demand vs Supply? • For every single VM, is CPU/Memory/Disk/Network bound? • Any VM is not getting what they are entitled? • What’s the normal workload range for every object in our vDC? Workload is not utilisation or usage • More accurate than utilisation as it takes many factors than just utilisation. Workload = (Demand/Entitlement) • Entitlement is dynamic. Affected by shares, limit, etc. Value • Demand ≠ Usage. 0 – 80 Workload is not high. 80 – 90 The object is experiencing some high resource workloads. 90 – 95 Workload on the object is approaching its capacity in ≥1 area. • Usage may mean passive usage. E.g. the RAM page is there but no write/read. • Score is Max (CPU, RAM, Disk IO, Net IO) • To bring up the attention >95 36 Explanation Workload on the object is at or over its capacity in ≥1 areas. Badge – Anomalies Answer complex questions like: • Is our vDC doing business as usual today? Or is today a turbulent day with lots of unexpected changes? • Which VMs, ESX, cluster, datastore, etc are behaving abnormally? • …. and exactly which counters are the culprits? Identifying metric abnormalities • It need to learn dynamic ranges of “Normal” for each metric, so give it >3 cycle per metric. • A month-end job means it needs 3 months. • Normal range changes after configuration or application changes. Anomalies score Value 0 – 50 Normal Anomaly range • Usually an indication of a problem 50 – 75 The score exceeds the normal range. • Demand change 75 – 90 The score is very high. • A high number of anomalies: • Application team change code/app • KPI metrics impacts the Anomalies score more than non-KPI metrics. 37 Explanation > 90 Most of the metrics are beyond their thresholds. This object might not be working properly or will stop working soon. This virtual DC spans multiple vCenters. vCenter Ops show all the counters that are behaving abnormally. 38 Badge – Faults Answer complex questions like: • What faults do we experience in our vDC? • For every object, what faults does it have? Specific knowledge of which vCenter Events • Which events affect Availability and Performance of which object? • Pulled from active vCenter events • Example: • Loss of redundancy in NICs or HBAs • Memory checksum errors • HA failover problems • Each fault has a default score (e.g. 25, 50, 75, 100) Value • Highest individual Fault Score drives the Fault object 0 – 25 No fault is registered on the object 25 – 50 Faults of low importance happens on object. 50 – 75 Faults of high importance happens on object. Score Best Practices: • Do not change the Faults Threshold • Use Alerts View to manage Faults. Filter it to just show Fault. 39 > 75 Explanation Faults of critical importance happens on object Badge – Risk Answer complex questions like: • Do we have risk from performance and capacity in our vDC? If yes, where are they and can you quantify the seriousness? • Which objects are at risk? What is the specific risk? Risk Score takes into account • Time Remaining • Capacity Remaining • Stress Risk is an early warning system. • Identifies potential problems that could eventually hurt the performance • The Risk Chart shows Risk score over the last 7 days, giving a view of the trend. 40 Value Explanation 0 – 50 No problems are expected in the future. 50 – 75 There is a low chance of future problems or a potential problem might occur in the far future. 75 – 100 There is a chance of a more serious problem or a problem might occur in the medium-term future. 100 The chances of a serious future problem are high or a problem might occur in the near future Badge – Time Remaining Answer complex questions like: • How much time do we have before we need to buy more server, storage, network before performance starts to degrade or we run out of capacity? • For every cluster, VM, datastore, how much time do we have? Measures time remaining before each resource type reaches its capacity • CPU • Memory • Disk (IOPS & Space) • Network I/O Early warning of upcoming provisioning needs • Based on Score Provisioning buffer. Default value is 30 days. • Set in “Capacity & Time Remaining” section 41 Value Time remaining 50 – 100 > 2x SP Buffer (60 days) 25 – 50 < 2x SP Buffer <25 Near SP Buffer 0 < SP buffer (30 days) Badge – Capacity Remaining Answer complex questions like: • How many more VM can we put without impacting performance or using up capacity? • For every cluster, VM, datastore, which components (CPU, RAM, Disk, Network) would run out first? Early warning system • A low score of 1 mean you still have >30 days. 333 More VMs correlates to 77% Capacity Remaining for this object • Measures how many more VMs can be placed on the object Percentage of Total VM “Slots” Remaining • Based on the average size of the VM on the object (e.g. VM profile) • Each object has its OWN VM profile size: Host, Cluster, Datacenter, Etc. From the table, notice value is not linear • It is also not the same with Time Remaining threshold. • A value of 30 means >120 days for capacity but around 40 days for time. 42 Value Capacity remaining >10 >120 days 5 – 10 60 – 120 days 0–5 30 – 60 days 0 <30 days Capacity Remaining Calculation Determine Capacity Constraint Resource Deployed or Powered On VMs • Powered Off VMs only use disk space resources • Powered On VMs uses ALL of the 4 resources Calculation Example Shown: • Limiting Resource is Disk Space with 333 VMs available • Use the Deployed VM number of 99 to do the calculation for percentage space remaining • Determine Capacity Remaining • 333 / (333 + 99) = 77% 43 Capacity and Time details You can drill down to see details • You can check the 9 components, as shown on the right. • This helps answer the question which components have how many days or VM left! • Summary = Min (all 9 components) 44 Badge – Stress Answer complex questions like: • In our vDC, do we have stress points or periods? How bad is it? • For every cluster, VM, datastore, which ones are experiencing stress and how bad is it? Measures long-term or chronic workload (6 weeks) • Chart shows weeks break down of Stress for each day/hour averaged over the last 6 Weeks • Workloads > 70% = “Stressed” • Threshold Configurable as per screenshot below Value 0–1 Normal score. No action needed 1–5 Some of the object resources are not enough to meet the demands. 5 – 30 The object is experiencing regular resource shortage. >30 45 Explanation Most of the resources on the object are constantly insufficient. The object might stop functioning properly. Stress Calculation Stress Zone 100 12% 70 Workload Line 0 6 Weeks Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold compared to the Total Capacity of the object • Stress Score = (Stress area / Stress Zone) *100 • But max value can be > 100% as the workload can be >100. Example • Stress Line is 70% Workload • 12% of the area is above the 70% threshold • Stress Score is 12 46 Badge – Efficiency Answer complex questions like: • Are there optimization opportunities in our vDC? • How well do we do in terms of VM provisioning? Do we get them right? Efficiency Score factors • Reclaimable waste • Density ratio Graph Depicts VMs by Percent • Optimal – Optimally Provisioned VMs • Waste – Over Provisioned VMs • Stress – Under Provisioned VMs • Not used in Efficiency Calculation (see Risk) Value Explanation The efficiency is good. The resource use Three >25 Resources Considered on the selected object is optimal. • CPU The efficiency is good, but can be • 10Memory – 25 improved. Some resources are not fully used. • Disk Space 0 – 10 The resources on the selected object are not used in the most optimal way. Waste 0 The efficiency is bad. Many resources are wasted. Note: VMs can appear in Stress and 47 Badge – Reclaimable Waste Answer complex questions like: • Do we over provisioned the VMs in terms of CPU, RAM and Disk? If yes, what’s the degree of over provisioning? • For every cluster, VM, datastore, what can we reclaim? It identifies the amount of reclaimable resources • CPU • Memory • Disk Reclaimable Waste = Reclaimable Capacity / Deployed Capacity • Waste Score = Max(CPU Waste Score, RAM Waste Score, Disk Space Waste Score) • Disk calculation can also include old snapshots and Value 0 – 50 No resources are wasted on the selected object. 50 – 75 Some resource can be used better. 75 – 100 Many resources are underused templates 100 48 Explanation Most of the resources on the selected object are wasted. Badge – Density Answer complex questions like: • How high can we push our consolidation ratio before we experience performance problem? • Now that’s a million dollar question! • For every datacenter, cluster, ESXi, what are our key ratios and how much head room do we have? Contrasts Actual vs Ideal Density • Identify Optimal Resource Deployment Before Contention Occurs • Ideal is based on demand, not simple configuration. • High Density is good. 100 is not too high. Value >25 49 Explanation Good consolidation 10 – 25 Some resources are not fully consolidated 0 – 10 The consolidation for many resources is low 0 The resource consolidation is extremely low. Badge Thresholds Disable Color Threshold by Clicking the Level Off 50 Using badges together Workload High & Anomalies Low & Stress High • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Normal Behavior for this timeframe Add resources • Stress – Object is often running under high Workload. Workload High & Anomalies Low & Stress Low • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Normal Behavior for this timeframe Not likely a big problem… a cyclical workload spike? • Stress – Object usually has enough resources Workload High & Anomalies High • Workload – Object is Running Hot. Potentially Starving for Resources • Anomalies – Abnormal behavior for this timeframe If there are Alert and Fault too, then it is a sign of major issue 51 Something is amiss! Immediate attention. Discussion Point Is Badge the way to go? Are these the right 12 badges? What other badges do you need? 52 Heat Map Built-in heat maps • Basic: • • • • Storage: space, IO CPU RAM Network A great way to show a lot of information on 1 screen. Heat map can quickly highlight information, as it can present relative information. It is good for relative comparison among VMs. • Advance (or composite) • Health • Workload • Capacity Custom heat map or cold map • Since we can change the color, we can actually create cold map. Heat map is a 2 dimensional chart. So it takes 2 parameters. You cannot choose >2 data. For example, you cannot show the following at the same time: • IOPS, Latency and Throughput. Also, these 3 have different units so it’s hard to combine using Super Metric. • ESX, VM and Datastore. • In cold map, the bigger the size, the colder it is (less utilised it is). The bluer it is, the less utilised it is. • Hence it focuses on Waste 53 Storage: Datastore, VM, Workload & latency Since all the datastores are on the same array, how do we quickly tell the relative workload generated by every one of them? • This answers: which datastores are heavily loaded? For each of these datastores, how do we know the relative workload generated by the VM? • This answers: which VMs dominate within a datastore? For every VM, how do we performance is reasonable number? • This answers: which VM has storage bottlenect? How do we show all the above data in one page, without the need to show a lot of numbers? • And we still want to be able to drill down to each VM and datastore. 54 Each square is a VM. They are grouped by datastore. Bigger square: bigger throughput Color: latency. 55 Storage: Throughput & Latency at cluster level Which cluster is generating high storage workload? Are they getting the SLA they ask? What’s the latency? The cluster owner wants to know that his entire cluster is getting <10 ms latency. We expect these X, Y, Z clusters to be doing little work. Can we prove this? Basically, the same concept from previous slide, but looking from cluster point of view as Cluster & Datastore has a Many-to-Many relationship. 56 Storage: Throughput & Latency at cluster level 57 Storage: Throughput & Latency at host level 58 Storage: Throughput & Latency at VM level Can we show at VM level now? That’s why you need a 24” monitor 59 Storage: Space (GB) & Latency Any big VM that is not getting the SLA we agreed on? 60 Storage: Datastore space contention Do we have space contention at any of the datastore? If yes, how bad is the contention? • While we use thin provision at vSphere level (and thick at array level), we still have risk of space from snapshots, vRAM increase, new VM, new vDisk, storage vMotion, storage DRS, etc. • The higher the contention, the brighter the color. Are we running low on capacity in those datastore with high contention? This requires custom heat map. We can do a variant of this heat map. 61 Storage: Space contention We use thin provisioning This is a variant of previous slide heat map. In this variant, we answer question: are the datastore of the same standard size? 62 CPU: Contention vs Usage at cluster level Which clusters are doing the most work? Which are not doing much? How is the CPU workload on every cluster? For each of those clusters, can we see if there is CPU contention? 63 CPU: Contention vs Usage at host level Same questions with previous, but for host. We can expect some “drill down” in this heat map 64 CPU: Contention vs Usage at VM level Can we show at VM level now? That’s why you need a 24” full HD monitor 65 VM Health Heat Map 1: Current Health • Are all the VMs healthy? Especially those VMs which have high workload! • Which VMs are experiencing problems? • Are more demanding VMs less healthy? • Can we see this by cluster? By host? Heat Map 2: Future Health • Will all the VMs be okay in future (30 days)? Need to check CPU, RAM, Disk IO, Disk Space and network for every single VM! • For those VMs which are not ok, can we be specific on which value will run out first? Can we “drill down” to individual VM? 66 VM: color by health, size by workload 67 VM: color by capacity, size by workload This is now showing future projection. We can see that the VM vCenter 5 is having red color. Its capacity will run out within 30 days. So we click on it to drill down. 68 Drill down to specific VM Screenshot below shows vCenter 5. We can see that it will need more vCPU as it will max out in 10 days. We can go as far as 6 months. This is good enough as you should not buy hardware >6 months in advance. It makes sense in the physical world as it’s fixed, but unwise in virtual world. 69 Discussion Point Which heat maps are useful for you? What other heat maps or cold maps do you need? 70 Monitoring the big workload You have convinced your CIO to virtualise the remaining 50% of the servers. Your CIO needs you to prove, supported by performance charts, that the platform has served every VM well, meeting the SLA in the past 1 quarter. • Tier 1 cluster SLA: 2% CPU Contention, 0 RAM Contention, 10 ms disk latency, 0 drop packets. • Tier 2 cluster SLA: 4% CPU Contention, 5% RAM Contention, 20 ms disk latency, 0 drop packets. • Tier 3 cluster SLA: 6% CPU Contention, 10% RAM Contention, 30 ms disk latency, 0 drop packets. You have 500 VM on 50 ESXi, 8 clusters, 40 datastores, 5 RDM. You must prove that: • Not a single Tier 1 VM has >2% CPU Contention in the past 1 quarter. The underlying ESXi also has <2% CPU contention. • Not a single Tier 1 VM has >10 ms disk latency in the past 1 quarter. The underlying ESXi also has <10 ms disk latency. • Etc, for each Tier and each component (CPU, RAM, Disk, Net) What kind of charts do you need to show? 71 Super Metrics 72 See you 1 April! VMware office © 2010 VMware Inc. All rights reserved Demo Super Metric • Editor • Package • Attaching to object 74 Discussion Point Think of super metrics that you need. Explain why and how you will need them. 75 Implementation Approach Define who needs what Create Super Metrics Create Applications Create Tags Create Heat Maps Begin with the end in mind • Every Super Metric must serve a particular role • Role, not individual. A person can & will have many heatmaps/dashboards. • Decide if you need the following non-standard info • Application-level & Guest-OS-level info • Info from physical machines (UNIX, X64, etc) • Info from physical storage and network (switch, FW, router, etc) Think in terms of application • A great way to complement vSphere as vCenter does not have this object. 76 Create Dashboards Who needs to see what CIO or CTO Simple Dashboard. Big picture. Tend to be application focused. No absolute data. Normalised to 0-100. Focus on long term. Averaged data. A 30-minute spike will not show up. Updated daily. Group Head e.g. Head of Infra, Head of Apps Dept Head e.g. Head of Storage, Head of Server, Head of Network, Head of Databases Admin/Architect e.g. Storage Admin, Network Admin, App Owner, VM Owner 77 Rich Dashboard. Ideally Full HD screen. Specific info. Absolute data + Normalised Data. Focus on short term. Actual data. A 5-minute spike will be visible. Updated every 2 minutes. Who needs to see what (samples) Roles Info presented CIO Health of overall IT in the past 1 month Health of key applications in the past 1 month CTO As above, but with more technical content, and tailored to him. Head of Applications Health of all key apps in the past 1 month, with the ability to do 1 level drill down for each app. Capacity projection for all key apps. Head of Infrastructure Health of Storage Health of Network: max drop packets for entire infra Health of Servers (VMware and Physical) Health of VM Head of Storage A higher level, simpler dashboard than Storage Admin Head of Network Max througput across entire infrastructure Max drop packets VMware Team An App Owner 78 The infra is providing each of the VMs in my App with the resources it needs Designing Super Metric Leverage existing derived metrics Leverage Objects that vCenter cannot provide performance data • Application, Resource Pool, Folder, Location, can now have performance counters Minimise static alert. Know what a good range for the end result Build a simple table to avoid super metric sprawl and duplicating existing metrics • Below is an example, showing 2 Super Metrics. Name VM SLA Infra SLA 79 Purpose Target Role Formula Shows that a VM gets the resources it wants from infrastructure based on the defined SLA. VM Owner VM SLA = 100% - Max (CPU, RAM, Disk, Network) CPU = CPU Contention %. RAM = RAM ballooning %. Disk = % above threshold latency. Network = Packet Drop %. Show that the underlying infra has the resources for all the VMs on it VMware Admin Tier 1 Disk SLA is 10 ms. Tier 2 Disk SLA is 20 ms. Tier 3 Disk SLA is 30 ms. Infra SLA = 100% - Max (Host Cluster, Datastore Cluster) Good Range >99% (Tier 1 cluster) >97 (Tier 2 cluster) >95% (Tier 3 cluster) Custom Heat Map or Cold Map Component Heat Map Cold Map Least utilised VM: size by vCPU count, color by RAM + CPU usage (a Super Metric) CPU Resource pool: size by CPU utilisation, RAM Most RAM intensive VMs, grouped by ESX. Size by RAM utilisation, color by health Disk Most disk intensive VMs, grouped by ESX. Size by disk utilisation, color by health Least utilised disk: size by GB, color by % of free Network Most network intensive VMs, grouped by ESX. Size by network utilisation, color by health Most idle VMs, grouped by host Capacity VMs with file system that will run out soon. Color by % left, size by GB left. Health VM health, grouped by cluster. Color by health, size by workload. Design consideration • Use Super Metric so the info is richer. • Group VMs by 1 consistent hierarchy only. If you group by cluster, it won’t make sense to further group by datastore as 1 datastore can spans multiple cluster. 80 Demo vSphere UI • Operations • Details • Cluster • Host • VM • Datastore • Events • All Metrics • VC Ops Chart vs vCenter chart Custom UI 81 Thank you © 2010 VMware Inc. All rights reserved vCenter Operations presents datastore with all the details. It also show the estimated max IOPS! 83 Storage in vCenter Operations Automatic learning of storage performance. Calculating both Demand and Normal rate. 84 vSphere 5 Performance Chart (fat client) Can only choose 1 component at a time. e.g. cannot show CPU and RAM at the same time. 85 vSphere 5 Performance Chart (fat client) Can only show 1 chart at a time. Hence can only show 2 units at a time. 86 vCenter Operation charts Can show >1 charts at a time. Can combine/split charts. Can show different data type from different objects. Line is color coded, showing when threshold is breached. 87 CPU counters 88 89 90 91 Demand: Derived Metric The chart below shows Demand in action. I generated IOPS which on a local datastore, resulting in spike in latency (read latency when up from 3 ms to 60 ms. Demand correspondingly go up from 4 to 100! 92 Cluster Overview 93 94 95