The State-Space Approach to Self-Management of Enterprise Systems

advertisement
The State-Space Approach to
Self-Management of Enterprise Systems
Vibhore Kumar, Karsten Schwan
Subu Iyer*, Yuan Chen*, Akhil Sahai*
Georgia Institute of Technology
Hewlett-Packard labs*
Outline







Motivation: Enterprise Complexity
Issues
Solution Overview
Policy-Driven Self-Management
Dynamic SLA Decomposition
Results
Future Work
Enterprise Complexity: Some Facts
 From a survey conducted by Forrester
Research
 Enterprises now devote 80% of their overall IT
budget to maintenance and ongoing operations
 More than half of the 347 participating companies
used at least 3 database vendors
 A major banking-industry client had 18 different
travel and expense systems in the organization
 “VP of IT Governance” - says tons about the
state of enterprise IT infrastructure
The Complexity Wall
“If we don’t get a handle on complexity, it will stop the
expansion”
- Paul Horn, Senior Vice President, IBM Research
“Our enterprise customers are working with enormous
complexity”
- Dick Lampman, Former Director, HP Labs
The Complexity Wall @
 Worldspan, one of our industry collaborators, provides
services to the travel industry
 One of their airline ticket pricing/availability services is
hosted on a farm of 1400 servers
 In 2006 alone, they processed around 9.6 billion messages
 Highly varying request rates and request type mix
 Several behaviors of their system are not well understood



Effects of Ticket Geography
Effects of Cache Refresh Time
Effects of Time of Day …
To Handle The Complexity…
 One must enable self-management of complex
enterprise infrastructures driven by high-level goals
Enterprise Self-Management: The Hurdles
 Enterprise systems are too big
 The problem of Scale
 It is tough to relate high-level goals to lowlevel actions
 The problem of Complex System Modeling
 The operating environment is very dynamic
 The problem of Dynamism
 Administrators find it hard to trust black-box
solutions
 The problem of Trust & Tractability
Solution Overview: System State-Space
Enterprise System
Monitored
System
Variables
Monitored
Component
Variables
System State Space V = (v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,vn)
• Variables of Interest Vø

• Controllable Variables Vα
V, e.g. Response-Time, QoI

V, e.g. Allocated-Servers, Memory
 The aim is to establish a relation between Vø
and Vα under current operating conditions
Simple Automated Operation
 SLO: “Response Time < 10msec”
 Event: SLO Violation
 Condition: Bandwidth=90Mbps, Request Rate=30
 Action: set Allocated Servers to 3
 : Vα
Vø given V – (Vα U Vø)
V
Vø
α
1
3
90
30
12
12
8
9
Allocated Servers
Bandwidth
Request Rate
Response Time
Solution Overview: The Function

 Learn from observed system states
 But there are problems
 Different behavior in different sub-spaces
 Large state space, |V| ≈ 102 to 103
v1 v2 . . . . . . . . . . . . . vn
CPU
Bottleneck
Machine
Learning
Network
Bottleneck
Observed System States

Solution Overview: The Function

 We decided to model the system using multiple µ-models
 = { 1 ,  2 ,,  n }
 We intelligently partition the set of observed system states
v
. . . . . . . . . exhibit
. . . . vn
1 v2 partitions

homogenous behavior
partitions have a reduced number of relevant variables
Reduced Number
of Relevant
in a µ-model
 Partitioning
& µ-Modeling
solveVariables
two problems!


The problem of Scale
The problem of Complex System Modeling
Solution Overview: µ-Models
 We use Tree Augmented Naïve Bayes (TAN)
Classifier to build µ-models
 The model returns the following probability
γ = Pr(Vα | Vdesired)
 Find assignment of values to variables in Vα that
maximizes the probability of moving the system
to the desired state
Solution Approach: Dynamism
 As the system keeps running more system states
are generated, which could be incorporated into
the µ-models
 µ-models are easier to update as compared to
monolithic system models
 As a result of µ-model update
 Policy Invalidation
 Policy Adaptation
 New Policies can Result
 This addresses the problem of Dynamism
Solution Approach: Tractability & Trust
 Each self-management action that assigns values
to variables in Vα is associated with a probability
γ = Pr(Vα | V – Vø)
 An action is taken only when γ > γthreshold
 This can be used to fine-tune self-management
 TANs can be easily understood by administrators
Outline







Motivation: Enterprise Complexity
Issues
Solution Overview
Policy-Driven Self-Management
Dynamic SLA Decomposition
Results
Future Work
Policy-Driven Self-Management
 SLO: “Response Time < 10msec”




Event: SLO Violation
Condition: Bandwidth=90Mbps, Request Rate=30
Given
the goal
state (90,30,9), find
the
µ-model to use
Current
State
Goal
State
Action:
set Allocated Servers to (90,30,9)
3
(90,30,12)
evaluate c : Pr(c | 90,30,9)  max(Pr(ci | 90,30,9))
ci  V
1
3
90
30
12
12
8
9
Allocated Servers
Bandwidth
Request Rate
Response Time
Dynamic SLA Decomposition
 Problem: To determine sub-SLAs for components that
lead to SLA conformance
System-Level SLA
 Sub-SLAs can be thought of as per-component range of
values for controllable variables
SLA1
SLA2
SLA3
SLA4
SLA5
 If each component adheres to the sub-SLAs then the SLA
is not violated
 Our techniques can handle SLA decomposition
conformance(SLA1, SLA2, …, SLAn)
 conformance(System SLA)
Experimental Results: SOA Simulator
Without Self-Management
With Self-Management
Experimental Results: RUBiS over VMs
Without Self-Management
Database
Perturbation
With Self-Management
Partition
Change
Conclusions & Future Work
 Our techniques are applicable for a variety of enterprise
systems
 In our experiments the techniques have proven to be very
scalable and accurate
 Monitoring overheads can be reduced by taking inputs
about relevant variables from the state-space partitions
 Design & Implement techniques that can proactively avoid
SLA violations
Thank You!
References
[1] V. Kumar, K. Schwan, S. Iyer, Y. Chen, A. Sahai. The statespace approach to SLA-based management. In submission to
NOMS 2008.
[2] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. iManage:
Policy-Driven Self-Management for Enterprise-Scale Systsem.
Middleware 2007.
[3] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. Enabling
Policy-Driven Self-Management for Enterprise Systems. PBAC
2007 in conjunction with ICAC-2007
[4] V. Kumar, et al. Implementing Diverse Messaging Models with
Self-Managing Properties using IFLOW. ICAC 2006
Download