Performance and Capacity with Analytics Dan Kimball – Cloud Infrastructure Architect - VMware © 2009 VMware Inc. All rights reserved Agenda Introduction What is Analytics? Real-world examples 3rd generation monitoring with analytics Success stories Bringing it all together Closing remarks and Q&A 2 What is Analytics? Analytics is the application of computer technology, operational research, and statistics to solve problems in business and industry. A simple definition of analytics is "the science of analysis". A practical definition, however, would be that analytics is the process of developing optimal or realistic decision recommendations based on insights derived through the application of statistical models and analysis against existing and/or simulated future data. Source: Wikipedia - http://en.wikipedia.org/wiki/Analytics 3 Real-world examples of Analytics • Clinical decision support systems • Customer retention • Fraud detection • Risk management • Underwriting Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain conditions, like diabetes, asthma, heart disease, and other lifetime illnesses. With the number of competing services available, businesses need to focus efforts on maintaining continuous consumer satisfaction, rewarding consumer loyalty and minimizing customer attrition. Fraud is a big problem for many businesses and can be of various types: inaccurate credit applications, fraudulent transactions (both offline and online), identity thefts and false insurance claims. When employing risk management techniques, the results are always to predict and benefit from a future scenario. The Capital asset pricing model (CAP-M) "predicts" the best portfolio to maximize return Many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk. For example, auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver. 4 “1st Generation” Tools, Up/down… Floods of alerts 1st Generation - Event-Centric, Hard-Threshold Based DATA FEEDS DATA FEEDS DATA FEEDS DATA FEEDS 5 3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a 3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System 3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System 3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System 3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a 3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle 3/4/08 14:40 n/a responseTimeServ… The Response Time Service Level on Siebel Sa.. n/a n/a n/a 3/4/08 14:20 n/a processingTimeServ.. The Processing Time Service Level on Siebel S. n/a n/a n/a 3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 6780)’: is cons.. n/a 0 Windows_System 3/4/08 14:39 Host 3 Top_CPU_Table Process ‘siebsh.exe(svc-siebel, 7940)’: is cons.. n/a 0 Windows_System 3/4/08 14:15 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a 3/4/08 14:15 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a 3/4/08 13:55 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF 3/4/08 16:45 Host 1 processingTimeServ The Processing Time Service Level on process… n/a n/a n/a 3/4/08 16:45 Host 1 Processor_Table 0 Processor 0 is at 87.0%. A CPU Bottleneck is….. n/a 0 Windows_System 3/4/08 16:44 Host 2 System_Table The number of hardware interrupts per second… n/a 0 Windows_System 3/4/08 16:30 Host 2 Processor_Table 1 Processor 1 is at 84.0%. A CPU Bottleneck is …. n/a 0 Windows_System 3/4/08 16:25 n/a responseTimeServ… The Response Time Service Level on Toadwor.. n/a n/a n/a 3/4/08 16:20 n/a processingTimeServ.. The Processing Time Service Level on Prospec.. n/a n/a n/a 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD A CPU Hog has been detected n/a OraSF Oracle 3/4/08 16:08 Host 1 Ora_Sql_Hogs_Alert Oracle: SFPRD SQL with high I/O has been de.. n/a OraSF Oracle Oracle “2nd Generation” Tools, don’t handle change > false positives 2nd Generation - Rudimentary Baselining, Rules/Templates, Charting 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 3/4/08 6 16:45 16:45 16:44 16:30 16:25 16:20 16:08 16:08 14:40 14:20 14:39 14:39 14:15 14:15 13:55 16:45 16:45 16:44 16:30 16:25 16:20 16:08 16:08 Host Host Host Host n/a n/a Host Host n/a n/a Host Host n/a n/a Host Host Host Host Host n/a n/a Host Host 1 1 2 2 1 1 3 3 1 1 1 2 2 1 1 processingTimeServ Processor_Table 0 System_Table Processor_Table 1 responseTimeServ… processingTimeServ.. Ora_Sql_Hogs_Alert Ora_Sql_Hogs_Alert responseTimeServ… processingTimeServ.. Top_CPU_Table Top_CPU_Table responseTimeServ… processingTimeServ.. Ora_Sql_Hogs_Alert processingTimeServ Processor_Table 0 System_Table Processor_Table 1 responseTimeServ… processingTimeServ.. Ora_Sql_Hogs_Alert Ora_Sql_Hogs_Alert The Processing Time Service Level on process… Processor 0 is at 87.0%. A CPU Bottleneck is….. The number of hardware interrupts per second… Processor 1 is at 84.0%. A CPU Bottleneck is …. The Response Time Service Level on Toadwor.. The Processing Time Service Level on Prospec.. Oracle: SFPRD A CPU Hog has been detected Oracle: SFPRD SQL with high I/O has been de.. The Response Time Service Level on Siebel Sa.. The Processing Time Service Level on Siebel S. Process ‘siebsh.exe(svc-siebel, 6780)’: is cons.. Process ‘siebsh.exe(svc-siebel, 7940)’: is cons.. The Response Time Service Level on Toadwor.. The Processing Time Service Level on Prospec.. Oracle: SFPRD A CPU Hog has been detected The Processing Time Service Level on process… Processor 0 is at 87.0%. A CPU Bottleneck is….. The number of hardware interrupts per second… Processor 1 is at 84.0%. A CPU Bottleneck is …. The Response Time Service Level on Toadwor.. The Processing Time Service Level on Prospec.. Oracle: SFPRD A CPU Hog has been detected Oracle: SFPRD SQL with high I/O has been de.. n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 0 0 0 n/a n/a OraSF OraSF n/a n/a 0 0 n/a n/a OraSF n/a 0 0 0 n/a n/a OraSF OraSF n/a Windows_System Windows_System Windows_System n/a n/a Oracle Oracle n/a n/a Windows_System Windows_System n/a n/a Oracle n/a Windows_System Windows_System Windows_System n/a n/a Oracle Oracle 3rd generation monitoring with analytics – It’s here! Dan Kimball – Cloud Infrastructure Architect - COE - VMware © 2009 VMware Inc. All rights reserved Real-Time Performance Management 3rd Generation – Holistic, Real Time Analytics Flexible INTEGRATION to many data sources Enterprise SCALABILITY Patented performance ANALYTICS Powerful information DASHBOARDS 8 I can put all my monitoring tools to good use and get better performance analytics. Smart Alert™ - Using Analytics to understand abnormalities across the application User Experience (e.g., HP RUM, etc.) Business Application App Data (e.g., Hyperic, SCOM) m 1 m 1 m i , j m 1 m 1 0,0 1 0,0 i , j m m m 1 m 1 i 1 j 1 i m, j 1 i , j 1 i , j 1 P1,1,P1,2 ,...,Pm ,m ( p1,1, p1,2 ,..., pm,m ) pi , j pi , j 1 pi , j pi , j m 1 m 1 m i m, j 1 i 1 j 1 i 1 j 1 i , j Generation i , j (“When”) i m, j 1 Smart Alert 0,0 i 1 j 1 i m , j 1 m 1 m 1 where pi , j i 1 j 1 m i m, j 1 0 vCenter The(Private/Public marginal Cloud) distribution ( pi ,1,..., pi ,m 1 ) ! pi , j 1, 0 pi , j 1 and z t z 1e t dt Dirichlet Dirichlet SMART ALERT of the i th row of J is: i , j , i ,1, i ,2 ,..., i ,m 1 j 1 for m NetApp, IBM) Storage (EMC, 0,0 m, j , m,1, m,2 ,..., m,m ,0,0 j 1 m 1 m 1 where 0,0 i , j i 1 j 1 9 Network Data (e.g., Ionix IPAM/PM, etc.) m 1 m i , j i m, j 1 i 1,..., m 1 for i m Future State – Evolution of Learning and Predictive Analysis Slide 10 Monitoring Server O/S Metrics – CPU, RAM, Disk, I/O, etc. Muscular Respiration Skeletal Heart Rate Cardio Vascular Nervous Monitoring App Layer Metric – JVM, DB Connections, etc. Temperature Monitoring UserEx Metrics My brain is understanding the health of my body. Should I do anything? Your Brain Understands Context: If my heart rate and temperature are increasing I should go to the hospital If I’m tired, rest more If I tire easily, start exercising! 10 Monitoring Business Metrics vCenter Operations is understanding the health of my enterprise by analyzing millions of measurements. Should I do anything? vCenter Operations Understands Context: Act based on urgency of emerging problems Act based on real-time performance dashboards Act based on long term correlations and trends Data Agnostic Approach to Data Collection Accepts any time series data (examples) • Server OS • Server App layer (i.e., IIS, Oracle, WebSphere, etc.) • Network • Storage • User Experience • Transactional • Business Data • Change Events Minimal Required Fields (4) • Object Name, Metric Name, Value, Timestamp Data Extraction - *not* an analytic question • No rules/templates to Write and Maintain • No thresholds or KPI’s to figure out 11 Learn Normal Behavior and Identify Abnormalities GRAY BAR Upper and Lower band of Dynamic Threshold “Normal” BLUE LINE Metric’s Current Value RED BAR Breached Dynamic Threshold – “Abnormal” 12 Doesn’t assume IT data has a normal bell-shaped distribution Sophisticated Analytics – 9 different algorithms working together Learns your dynamic ranges of “Normal” without templates Learns patterns of behavior and identifies abnormalities Understanding Progressive Change Standard Actual Build Build 80,000 CIs New Build 13 Type: Planned, Controlled • Updates and fixes • Infrastructure changes • Component patches Type: Unplanned, Uncontrolled • User Changes • Unapproved Admin Change • Exploits • Shadow IT •Origin: End Users, Developers, Suppliers Use Cases Dan Kimball – Cloud Infrastructure Architect - VMware © 2009 VMware Inc. All rights reserved The Role of Operations Management Ensure and Restore Service Levels Slow performance Optimize for Efficiency and Cost Utilization / forecast ! Problem Rollback change Config issue Reactive 15 Maintenance Orchestrate changes Reclaim capacity Proactive Business benefits delivered by 3rd generation monitoring Comprehensive Visibility • • • • • 16 Intelligent Automation Proactive Management vCenter Operations Management Suite Higher QoS • Improved Collaboration Fewer Incidents • Resource Utilization Tool Consolidation • … Compliance Faster MTTR “Troubleshooting time reduced by 50%” “Notified the storage team before they were even aware of an issue.” “We’ll be able to reduce our monitoring tools from over 300 to about 30.” TUI Infotec Maximus Kaiser Permanente Customer Success: IT Operations Solve performance issues before end-users are affected and reduce total alerts Before 400 critical alerts/hour Learn Normal 20 alerts/MONTH End-user complaints Smart Alerting 3 hours advanced warning alerted IT to the problem End-users impacted (avg. 2 hours/outage) 12 Level-2 engineers on bridge call to address problem 17 After Root Cause of slowdown w/root cause NO end-user impact 1 Level-2 Engineer and 1 DBA to address problems Bringing it all together Dan Kimball – Cloud Infrastructure Architect - COE - VMware © 2009 VMware Inc. All rights reserved Focused Solutions • Performance and Capacity analytics with root cause analysis • Configuration, Change, Compliance Management with Patching • Application Dependency Mapping 19 Change Events Correlated with Health and Performance 20 Deeper performance and capacity management for the Cloud Overview Service Owner Gain performance and capacity 21 management across the Enterprise Cover every silo of the environment Breakdown the silos in the org. Reduce overall MTTR/MTTI Keep an eye on your cloud service providers Reclaim precious compute resources Gain unprecedented visibility into how your infrastructure behaves Performance and Capacity for VDI Overview End-to-end monitoring of infrastructure Included PCoIP performance monitoring Desktop, Pool and User Contexts Self-Learning performance analytics Automated alerts Remediation guidance Benefits Get to root cause quickly; Reduce MTTI Respond proactively before support calls Remediate quickly and accurately Improve resource utilization by identifying over-provisioned hardware and track down bottlenecks 22 Thank you for your time! Additional reading material: Quantifying Information Data Loss through Data Aggregation http://www.vmware.com/files/pdf/vcenter/VMware-vCenter-Operations-Quantifying-Information-Loss-DataAggregation-WP-EN.pdf How Normal is Your Data: http://www.vmware.com/files/pdf/vcenter/VMware-vCenter-Operations-How-Normal-Is-Your-Data-WP-EN.pdf © 2009 VMware Inc. All rights reserved