Self-* Systems CSE 598B Instructor: Bhuvan Urgaonkar Fall 2005 Introduction Bhuvan Urgaonkar – Assistant Professor, CSE Ph.D. Univ. of Mass., Amherst Research Interests – Distributed systems, operating systems, computer networking, modeling of systems Office: 338D, Email: bhuvan@cse.psu.edu Office hours and class timings – Undecided as of now, we will figure this out at the end of the class – If in doubt: just walk in anytime! Students’ turn to introduce themselves 2 Self-* systems Self-*: a regular expression – But not quite – No self-destroying systems Three themes – Self-tuning systems – Self-healing systems – Self-stabilizing systems Course Web page: – http://www.cse.psu.edu/~bhuvan/teaching/fall05/self-star.html To do: Set up a course mailing list 3 Self-tuning systems Systems that can adapt their behavior to dynamically changing external influences on their own Desired Trajectory Friction, Turbulence Guidance Model Thrust Parameters Rocket Thrusters Actual Trajectory 4 Internet applications Proliferation of Internet applications auction site online game online retail store Growing significance in personal, business affairs Focus: Internet server applications 5 Hosting platforms Data Centers – Clusters of servers – Storage devices – High-speed interconnect Hosting platforms: – Rent resources to third-party applications – Performance guarantees in return for revenue Benefits: – Applications: don’t need to maintain their own infrastructure • Rent server resources, possibly on demand – Platform provider: generates revenue by renting resources 6 Goals of a hosting platform Meet service-level agreements – Satisfy application performance guarantees • E.g., average response time, throughput Maximize revenue – E.g., maximize the number of hosted applications Question: How should a hosting platform manage its resources to meet these goals? 7 Challenge: dynamic workloads 1200 Multi-time-scale variations – E.g., Flash crowds User threshold for response time: 8-10 s 0 0 1 2 3 4 5 Time (days) 140K 140000 Request Rate (req/min) Overloads Arrivals per min – Time-of-day, hour-of-day 120000 100000 80000 60000 40000 20000 0 0 0 0 Key issue: How to provide good response time under varying workloads? 5 10 Time (hrs) 12 15 20 Time (hours) 8 24 Self-tuning systems A self-tuning hosting platform Application Performance Goals Dynamic Workloads Resource Inference Model Resource Shares Resource Schedulers Actual Performance 9 Dynamic provisioning Monitor workload Compute current/ future demand Adjust allocation Key idea: increase or decrease allocated servers to handle workload fluctuations – Monitor incoming workload – Compute current or future demand – Match number of allocated servers to demand 10 Dynamic provisioning at multiple time-scales Predictive provisioning – Certain Internet workloads patterns can be predicted • E.g., time-of-day effects, increased workload during Thanksgiving – Design a good application model – Provision using model at time-scale of hours or days Reactive provisioning – Applications may see unpredictable fluctuations • E.g., Increased workload to news-sites after an earthquake – Detect such anomalies and react fast (minutes) Question: How to put these together? – When to invoke the predictor and the reactor? 11 Self-healing systems Systems that continue to operate on their own despite faults or failures Distinction between faults and failures – Fault: A sysadmin sets a small concurrency limit for a Web server – Failure: debris from an external fuel tank is thought to have struck Columbia's left wing in 2003. Failure/fault handling capability built into the system – Graceful degradation We will study classic literature in fault tolerance, papers that apply these principles to modern distributed systems 12 Self-stabilizing systems Guaranteed to converge to a desired behavior from any initial state if left alone Why should one have interest in self-stabilizing algorithms? – Its applicability to distributed systems – Recovering from faults of a space shuttle. Faults may cause malfunction for a while. Using a self-stabilizing algorithm for its control will cause an automatic recovery, and enables the shuttle continue in its task 13 What is a self-stabilizing algorithm? This question will be answered using the “Stabilizing Orchestra” example The Problem: – The conductor is unable to participate – harmony is achieved by players listening to their neighbor players – Windy evening – the wind can turn some pages in the score, and the players may not notice the change 14 The “Stabilizing Orchestra” Example Our Goal: To guarantee that harmony is achieved at some point following the last undesired page turn Imagine that the drummer notices a different page of the violin next to him … (solutions and their problems): 1. The drummer turns to its neighbors new page – what if the violin player noticed the difference as well? 2. Both the drummer and violin player start from the beginning - what if the player next to the violin player notices the change only after sync between the other 2? 15 The Self-Stabilizing Solution Every player will join the neighboring player who is playing the earliest page (including himself) Note that the score has a bounded length. What happens if a player goes to the first page of the score before harmony is achieved? In every long enough period in which the wind does not turn a page, the orchestra resumes playing in synchrony 16 Discussion: Overlaps and distinctions Self-tuning vs self-healing vs self-stabilizing systems Proactive vs reactive 17 Crosscutting goals and challenges Removing costly and error-prone humans from administering complex systems Learning from the past Modeling systems to render them amenable to analysis Understanding how robust a system is – Robust = predictable behavior, graceful degradation – Equivalent: Figuring out how to make a system robust 18 Introspection! Everyone gives an example of a self-* aspect from their research/experience – Arjun: e-commerce applications – Amitayu: dynamic allocation of servers in a farm – Ross: Ross’s sensor n/w – Huajing: information ret/ feedback – Young: fault handling by duplication – Krishna: activity migration in a multiprocessor 19 Goals of the course Understand classic literature Identify theory and systems issues/tools common across these diverse domains – Statistical learning, control theory, measurement techniques, data analysis, fault tolerance, modeling • I will try to have some guest lectures Learn to appreciate how theory translates into and compares with practice Critically evaluate papers and present them, use these in research 20 Some administrative details … 21 Grading policy Paper presentations: 30% Class participation and discussion: 15% – Lets have lots of heated discussions – Don’t be shy! Paper evaluations due before class: 15% – A conference-style evaluation form Semester-long project: 30% – May be replaced by a term paper – Apply ideas to your research, masters thesis Final exam: 10% – Take-home exam 22 Expected course-load No intentions of stressing you out! Round-robin presentation policy – Number of presentations will depend on how many students enroll – Red-teams: To make sure you come prepared – We DON’T want bad presentations! Mid-term and final presentations for students doing projects End-of-semester take-home exam – Goal: Find out what we learnt in the course 23 Presentations Prepare about 45-min long talk Rest of the class for discussions – We will accept or reject papers at the end of each class Red team – Each presenter will practice his/her talk with the assigned red team before the class – You are welcome to talk to me, discuss slides, ask for help understanding the paper before presenting it Use the powerpoint template on course page We will try to become good speakers and reviewers! 24 Paper evaluations Due the midnight before the class I will put up an evaluation format that you will adhere to – No long essays needed – Be critical, read the papers carefully I will anonymize evaluations and put them up after the class so all can read them Acceptable: txt, pdf 25 Course project Not compulsory You may work in groups of up to 2 students You may replace it with a term paper – Survey of additional reading material Project may be – A theoretical exercise – Implementation-based – A thought experiment Report and term papers due at the end of the semester 26 Final exam Day-long take-home exam For students doing projects, I will design questions related to their project For students doing a survey, I will design questions based on their survey report 27 Miscelleneous Please register soon so the course can be offered – At least 5 students need to take the course Lets figure out course timings suitable to all Random thoughts – Would you like to solve puzzles? – Would you like to have discussions on systems research in general, hot areas, top conferences …? – Would you like to take turns as scribes? Hope: We will learn a lot and have lots of fun in this course 28 Questions or comments? 29