VMWare: Performance and Capacity with Analytics

Performance and Capacity with Analytics
Dan Kimball – Cloud Infrastructure Architect - VMware
© 2009 VMware Inc. All rights reserved
Agenda
 Introduction
 What is Analytics?
 Real-world examples
 3rd generation monitoring with analytics
 Success stories
 Bringing it all together
 Closing remarks and Q&A
2
What is Analytics?
Analytics is the application of computer technology, operational research, and
statistics to solve problems in business and industry.
A simple definition of analytics is "the science of analysis". A practical definition,
however, would be that analytics is the process of developing optimal or realistic
decision recommendations based on insights derived through the application of
statistical models and analysis against existing and/or simulated future data.
Source: Wikipedia - http://en.wikipedia.org/wiki/Analytics
3
Real-world examples of Analytics
•
Clinical decision support systems
•
Customer retention
•
Fraud detection
•
Risk management
•
Underwriting
Experts use predictive analysis in health care primarily to determine which patients are at risk of developing certain
conditions, like diabetes, asthma, heart disease, and other lifetime illnesses.
With the number of competing services available, businesses need to focus efforts on maintaining continuous
consumer satisfaction, rewarding consumer loyalty and minimizing customer attrition.
Fraud is a big problem for many businesses and can be of various types: inaccurate credit applications, fraudulent
transactions (both offline and online), identity thefts and false insurance claims.
When employing risk management techniques, the results are always to predict and benefit from a future scenario.
The Capital asset pricing model (CAP-M) "predicts" the best portfolio to maximize return
Many businesses have to account for risk exposure due to their different services and determine the cost needed to
cover the risk. For example, auto insurance providers need to accurately determine the amount of premium to
charge to cover each automobile and driver.
4
“1st Generation” Tools, Up/down… Floods of alerts
1st Generation - Event-Centric, Hard-Threshold Based
DATA FEEDS
DATA FEEDS
DATA FEEDS
DATA FEEDS
5
3/4/08 16:45
Host 1
processingTimeServ
The Processing Time Service Level on process…
n/a
n/a
n/a
3/4/08 16:45
Host 1
Processor_Table 0
Processor 0 is at 87.0%. A CPU Bottleneck is…..
n/a
0
Windows_System
3/4/08 16:44
Host 2
System_Table
The number of hardware interrupts per second…
n/a
0
Windows_System
3/4/08 16:30
Host 2
Processor_Table 1
Processor 1 is at 84.0%. A CPU Bottleneck is ….
n/a
0
Windows_System
3/4/08 16:25
n/a
responseTimeServ…
The Response Time Service Level on Toadwor..
n/a
n/a
n/a
3/4/08 16:20
n/a
processingTimeServ..
The Processing Time Service Level on Prospec..
n/a
n/a
n/a
3/4/08 16:08
Host 1
Ora_Sql_Hogs_Alert
Oracle: SFPRD A CPU Hog has been detected
n/a
OraSF
Oracle
3/4/08 16:08
Host 1
Ora_Sql_Hogs_Alert
Oracle: SFPRD SQL with high I/O has been de..
n/a
OraSF
Oracle
3/4/08 14:40
n/a
responseTimeServ…
The Response Time Service Level on Siebel Sa..
n/a
n/a
n/a
3/4/08 14:20
n/a
processingTimeServ..
The Processing Time Service Level on Siebel S.
n/a
n/a
n/a
3/4/08 14:39
Host 3
Top_CPU_Table
Process ‘siebsh.exe(svc-siebel, 6780)’: is cons..
n/a
0
Windows_System
3/4/08 14:39
Host 3
Top_CPU_Table
Process ‘siebsh.exe(svc-siebel, 7940)’: is cons..
n/a
0
Windows_System
3/4/08 14:15
n/a
responseTimeServ…
The Response Time Service Level on Toadwor..
n/a
n/a
n/a
3/4/08 14:15
n/a
processingTimeServ..
The Processing Time Service Level on Prospec..
n/a
n/a
n/a
3/4/08 13:55
Host 1
Ora_Sql_Hogs_Alert
Oracle: SFPRD A CPU Hog has been detected
n/a
OraSF
3/4/08 16:45
Host 1
processingTimeServ
The Processing Time Service Level on process…
n/a
n/a
n/a
3/4/08 16:45
Host 1
Processor_Table 0
Processor 0 is at 87.0%. A CPU Bottleneck is…..
n/a
0
Windows_System
3/4/08 16:44
Host 2
System_Table
The number of hardware interrupts per second…
n/a
0
Windows_System
3/4/08 16:30
Host 2
Processor_Table 1
Processor 1 is at 84.0%. A CPU Bottleneck is ….
n/a
0
Windows_System
3/4/08 16:25
n/a
responseTimeServ…
The Response Time Service Level on Toadwor..
n/a
n/a
n/a
3/4/08 16:20
n/a
processingTimeServ..
The Processing Time Service Level on Prospec..
n/a
n/a
n/a
3/4/08 16:08
Host 1
Ora_Sql_Hogs_Alert
Oracle: SFPRD A CPU Hog has been detected
n/a
OraSF
Oracle
3/4/08 16:08
Host 1
Ora_Sql_Hogs_Alert
Oracle: SFPRD SQL with high I/O has been de..
n/a
OraSF
Oracle
Oracle
“2nd Generation” Tools, don’t handle change > false positives
2nd Generation - Rudimentary Baselining, Rules/Templates, Charting
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
3/4/08
6
16:45
16:45
16:44
16:30
16:25
16:20
16:08
16:08
14:40
14:20
14:39
14:39
14:15
14:15
13:55
16:45
16:45
16:44
16:30
16:25
16:20
16:08
16:08
Host
Host
Host
Host
n/a
n/a
Host
Host
n/a
n/a
Host
Host
n/a
n/a
Host
Host
Host
Host
Host
n/a
n/a
Host
Host
1
1
2
2
1
1
3
3
1
1
1
2
2
1
1
processingTimeServ
Processor_Table 0
System_Table
Processor_Table 1
responseTimeServ…
processingTimeServ..
Ora_Sql_Hogs_Alert
Ora_Sql_Hogs_Alert
responseTimeServ…
processingTimeServ..
Top_CPU_Table
Top_CPU_Table
responseTimeServ…
processingTimeServ..
Ora_Sql_Hogs_Alert
processingTimeServ
Processor_Table 0
System_Table
Processor_Table 1
responseTimeServ…
processingTimeServ..
Ora_Sql_Hogs_Alert
Ora_Sql_Hogs_Alert
The Processing Time Service Level on process…
Processor 0 is at 87.0%. A CPU Bottleneck is…..
The number of hardware interrupts per second…
Processor 1 is at 84.0%. A CPU Bottleneck is ….
The Response Time Service Level on Toadwor..
The Processing Time Service Level on Prospec..
Oracle: SFPRD A CPU Hog has been detected
Oracle: SFPRD SQL with high I/O has been de..
The Response Time Service Level on Siebel Sa..
The Processing Time Service Level on Siebel S.
Process ‘siebsh.exe(svc-siebel, 6780)’: is cons..
Process ‘siebsh.exe(svc-siebel, 7940)’: is cons..
The Response Time Service Level on Toadwor..
The Processing Time Service Level on Prospec..
Oracle: SFPRD A CPU Hog has been detected
The Processing Time Service Level on process…
Processor 0 is at 87.0%. A CPU Bottleneck is…..
The number of hardware interrupts per second…
Processor 1 is at 84.0%. A CPU Bottleneck is ….
The Response Time Service Level on Toadwor..
The Processing Time Service Level on Prospec..
Oracle: SFPRD A CPU Hog has been detected
Oracle: SFPRD SQL with high I/O has been de..
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
0
0
0
n/a
n/a
OraSF
OraSF
n/a
n/a
0
0
n/a
n/a
OraSF
n/a
0
0
0
n/a
n/a
OraSF
OraSF
n/a
Windows_System
Windows_System
Windows_System
n/a
n/a
Oracle
Oracle
n/a
n/a
Windows_System
Windows_System
n/a
n/a
Oracle
n/a
Windows_System
Windows_System
Windows_System
n/a
n/a
Oracle
Oracle
3rd generation monitoring with analytics –
It’s here!
Dan Kimball – Cloud Infrastructure Architect - COE - VMware
© 2009 VMware Inc. All rights reserved
Real-Time Performance Management
3rd Generation – Holistic, Real Time Analytics
Flexible
INTEGRATION
to many data sources
Enterprise
SCALABILITY
Patented performance
ANALYTICS
Powerful information
DASHBOARDS
8
I can put all my
monitoring tools to good
use and get better
performance analytics.
Smart Alert™ - Using Analytics to understand abnormalities across the application
User Experience (e.g., HP RUM, etc.)
Business Application
App Data (e.g., Hyperic, SCOM)
m 1 m  1
m


 





 i , j   m 1 m  1
0,0 1

 0,0   i , j

m
m

 m 1 m  1

i 1 j 1
i  m, j 1
 i , j 1
 i , j 1  




 P1,1,P1,2 ,...,Pm ,m ( p1,1, p1,2 ,..., pm,m )  
pi , j   pi , j   1     pi , j   pi , j  
m 1 m  1
m
  
i  m, j 1
i 1 j 1
 

   i 1 j 1
 i , j   Generation
  i , j   (“When”) i  m, j 1
Smart Alert
  0,0   

i 1 j 1
i  m , j 1


m 1 m  1
where
  pi , j 
i 1 j 1
m

i  m, j 1

0
vCenter
The(Private/Public
marginal Cloud)
distribution
( pi ,1,..., pi ,m 1 )
!
pi , j  1, 0  pi , j  1 and   z    t z 1e t dt

Dirichlet



Dirichlet

SMART ALERT
of the i th row of J is:


     i , j , i ,1, i ,2 ,..., i ,m 1 
j 1


for
m NetApp, IBM)
 Storage

(EMC,


    0,0    m, j  , m,1, m,2 ,..., m,m ,0,0 
j 1




m 1 m  1
where    0,0     i , j 
i 1 j 1
9
Network Data (e.g., Ionix IPAM/PM, etc.)
m 1
m

i , j
i  m, j 1

i  1,..., m  1



for i  m


Future State – Evolution of Learning and Predictive Analysis
Slide 10
Monitoring Server O/S Metrics – CPU, RAM, Disk, I/O, etc.
Muscular
Respiration
Skeletal
Heart Rate
Cardio Vascular
Nervous
Monitoring App Layer Metric – JVM, DB Connections, etc.
Temperature
Monitoring UserEx Metrics
My brain is understanding the health of my body.
Should I do anything?
Your Brain Understands Context:
 If my heart rate and temperature are increasing I
should go to the hospital
 If I’m tired, rest more
 If I tire easily, start exercising!
10
Monitoring Business Metrics
vCenter Operations is understanding the health of
my enterprise by analyzing millions of
measurements. Should I do anything?
vCenter Operations Understands Context:
 Act based on urgency of emerging problems
 Act based on real-time performance dashboards
 Act based on long term correlations and trends
Data Agnostic Approach to Data Collection
 Accepts any time series data (examples)
• Server OS
• Server App layer (i.e., IIS, Oracle, WebSphere, etc.)
• Network
• Storage
• User Experience
• Transactional
• Business Data
• Change Events
 Minimal Required Fields (4)
• Object Name, Metric Name, Value, Timestamp
 Data Extraction - *not* an analytic question
• No rules/templates to Write and Maintain
• No thresholds or KPI’s to figure out
11
Learn Normal Behavior and Identify Abnormalities
GRAY BAR
Upper and Lower band
of Dynamic Threshold “Normal”
BLUE LINE
Metric’s Current
Value
RED BAR
Breached Dynamic
Threshold – “Abnormal”




12
Doesn’t assume IT data has a normal bell-shaped distribution
Sophisticated Analytics – 9 different algorithms working together
Learns your dynamic ranges of “Normal” without templates
Learns patterns of behavior and identifies abnormalities
Understanding Progressive Change
Standard
Actual
Build
Build
80,000
CIs
New
Build
13
Type: Planned, Controlled
• Updates and fixes
• Infrastructure changes
• Component patches
Type: Unplanned, Uncontrolled
• User Changes
• Unapproved Admin Change
• Exploits
• Shadow IT
•Origin: End Users,
Developers, Suppliers
Use Cases
Dan Kimball – Cloud Infrastructure Architect - VMware
© 2009 VMware Inc. All rights reserved
The Role of Operations Management
Ensure and Restore
Service Levels
Slow performance
Optimize for
Efficiency and Cost
Utilization / forecast
!
Problem
Rollback change
Config issue
Reactive
15
Maintenance
Orchestrate changes
Reclaim capacity
Proactive
Business benefits delivered by 3rd generation monitoring
Comprehensive
Visibility
•
•
•
•
•
16
Intelligent
Automation
Proactive
Management
vCenter Operations Management Suite
Higher QoS
• Improved Collaboration
Fewer Incidents
• Resource Utilization
Tool Consolidation
• …
Compliance
Faster MTTR
“Troubleshooting time
reduced by 50%”
“Notified the storage
team before they were
even aware of an issue.”
“We’ll be able to reduce
our monitoring tools from
over 300 to about 30.”
TUI Infotec
Maximus
Kaiser Permanente
Customer Success: IT Operations
Solve performance issues before end-users are affected
and reduce total alerts
Before
 400 critical alerts/hour
Learn Normal
 20 alerts/MONTH
 End-user complaints
Smart Alerting
 3 hours advanced warning
alerted IT to the problem
 End-users impacted (avg. 2
hours/outage)
 12 Level-2 engineers on
bridge call to address
problem
17
After
Root Cause
of slowdown w/root cause
 NO end-user impact
 1 Level-2 Engineer and 1
DBA to address problems
Bringing it all together
Dan Kimball – Cloud Infrastructure Architect - COE - VMware
© 2009 VMware Inc. All rights reserved
Focused Solutions
• Performance and Capacity analytics with root cause analysis
• Configuration, Change, Compliance Management with Patching
• Application Dependency Mapping
19
Change Events Correlated with Health and Performance
20
Deeper performance and capacity management for the Cloud
Overview
Service Owner
 Gain performance and capacity






21
management across the Enterprise
Cover every silo of the environment
Breakdown the silos in the org.
Reduce overall MTTR/MTTI
Keep an eye on your cloud service
providers
Reclaim precious compute
resources
Gain unprecedented visibility into
how your infrastructure behaves
Performance and Capacity for VDI
Overview






End-to-end monitoring of infrastructure
Included PCoIP performance monitoring
Desktop, Pool and User Contexts
Self-Learning performance analytics
Automated alerts
Remediation guidance
Benefits
 Get to root cause quickly; Reduce MTTI
 Respond proactively before support
calls
 Remediate quickly and accurately
 Improve resource utilization by
identifying over-provisioned hardware
and track down bottlenecks
22
Thank you for your time!
Additional reading material:
Quantifying Information Data Loss through Data Aggregation
http://www.vmware.com/files/pdf/vcenter/VMware-vCenter-Operations-Quantifying-Information-Loss-DataAggregation-WP-EN.pdf
How Normal is Your Data:
http://www.vmware.com/files/pdf/vcenter/VMware-vCenter-Operations-How-Normal-Is-Your-Data-WP-EN.pdf
© 2009 VMware Inc. All rights reserved