The State-Space Approach to Self-Management of Enterprise Systems Vibhore Kumar, Karsten Schwan Subu Iyer*, Yuan Chen*, Akhil Sahai* Georgia Institute of Technology Hewlett-Packard labs* Outline Motivation: Enterprise Complexity Issues Solution Overview Policy-Driven Self-Management Dynamic SLA Decomposition Results Future Work Enterprise Complexity: Some Facts From a survey conducted by Forrester Research Enterprises now devote 80% of their overall IT budget to maintenance and ongoing operations More than half of the 347 participating companies used at least 3 database vendors A major banking-industry client had 18 different travel and expense systems in the organization “VP of IT Governance” - says tons about the state of enterprise IT infrastructure The Complexity Wall “If we don’t get a handle on complexity, it will stop the expansion” - Paul Horn, Senior Vice President, IBM Research “Our enterprise customers are working with enormous complexity” - Dick Lampman, Former Director, HP Labs The Complexity Wall @ Worldspan, one of our industry collaborators, provides services to the travel industry One of their airline ticket pricing/availability services is hosted on a farm of 1400 servers In 2006 alone, they processed around 9.6 billion messages Highly varying request rates and request type mix Several behaviors of their system are not well understood Effects of Ticket Geography Effects of Cache Refresh Time Effects of Time of Day … To Handle The Complexity… One must enable self-management of complex enterprise infrastructures driven by high-level goals Enterprise Self-Management: The Hurdles Enterprise systems are too big The problem of Scale It is tough to relate high-level goals to lowlevel actions The problem of Complex System Modeling The operating environment is very dynamic The problem of Dynamism Administrators find it hard to trust black-box solutions The problem of Trust & Tractability Solution Overview: System State-Space Enterprise System Monitored System Variables Monitored Component Variables System State Space V = (v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,vn) • Variables of Interest Vø • Controllable Variables Vα V, e.g. Response-Time, QoI V, e.g. Allocated-Servers, Memory The aim is to establish a relation between Vø and Vα under current operating conditions Simple Automated Operation SLO: “Response Time < 10msec” Event: SLO Violation Condition: Bandwidth=90Mbps, Request Rate=30 Action: set Allocated Servers to 3 : Vα Vø given V – (Vα U Vø) V Vø α 1 3 90 30 12 12 8 9 Allocated Servers Bandwidth Request Rate Response Time Solution Overview: The Function Learn from observed system states But there are problems Different behavior in different sub-spaces Large state space, |V| ≈ 102 to 103 v1 v2 . . . . . . . . . . . . . vn CPU Bottleneck Machine Learning Network Bottleneck Observed System States Solution Overview: The Function We decided to model the system using multiple µ-models = { 1 , 2 ,, n } We intelligently partition the set of observed system states v . . . . . . . . . exhibit . . . . vn 1 v2 partitions homogenous behavior partitions have a reduced number of relevant variables Reduced Number of Relevant in a µ-model Partitioning & µ-Modeling solveVariables two problems! The problem of Scale The problem of Complex System Modeling Solution Overview: µ-Models We use Tree Augmented Naïve Bayes (TAN) Classifier to build µ-models The model returns the following probability γ = Pr(Vα | Vdesired) Find assignment of values to variables in Vα that maximizes the probability of moving the system to the desired state Solution Approach: Dynamism As the system keeps running more system states are generated, which could be incorporated into the µ-models µ-models are easier to update as compared to monolithic system models As a result of µ-model update Policy Invalidation Policy Adaptation New Policies can Result This addresses the problem of Dynamism Solution Approach: Tractability & Trust Each self-management action that assigns values to variables in Vα is associated with a probability γ = Pr(Vα | V – Vø) An action is taken only when γ > γthreshold This can be used to fine-tune self-management TANs can be easily understood by administrators Outline Motivation: Enterprise Complexity Issues Solution Overview Policy-Driven Self-Management Dynamic SLA Decomposition Results Future Work Policy-Driven Self-Management SLO: “Response Time < 10msec” Event: SLO Violation Condition: Bandwidth=90Mbps, Request Rate=30 Given the goal state (90,30,9), find the µ-model to use Current State Goal State Action: set Allocated Servers to (90,30,9) 3 (90,30,12) evaluate c : Pr(c | 90,30,9) max(Pr(ci | 90,30,9)) ci V 1 3 90 30 12 12 8 9 Allocated Servers Bandwidth Request Rate Response Time Dynamic SLA Decomposition Problem: To determine sub-SLAs for components that lead to SLA conformance System-Level SLA Sub-SLAs can be thought of as per-component range of values for controllable variables SLA1 SLA2 SLA3 SLA4 SLA5 If each component adheres to the sub-SLAs then the SLA is not violated Our techniques can handle SLA decomposition conformance(SLA1, SLA2, …, SLAn) conformance(System SLA) Experimental Results: SOA Simulator Without Self-Management With Self-Management Experimental Results: RUBiS over VMs Without Self-Management Database Perturbation With Self-Management Partition Change Conclusions & Future Work Our techniques are applicable for a variety of enterprise systems In our experiments the techniques have proven to be very scalable and accurate Monitoring overheads can be reduced by taking inputs about relevant variables from the state-space partitions Design & Implement techniques that can proactively avoid SLA violations Thank You! References [1] V. Kumar, K. Schwan, S. Iyer, Y. Chen, A. Sahai. The statespace approach to SLA-based management. In submission to NOMS 2008. [2] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. iManage: Policy-Driven Self-Management for Enterprise-Scale Systsem. Middleware 2007. [3] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. Enabling Policy-Driven Self-Management for Enterprise Systems. PBAC 2007 in conjunction with ICAC-2007 [4] V. Kumar, et al. Implementing Diverse Messaging Models with Self-Managing Properties using IFLOW. ICAC 2006