Reliability and State Machines in an Advanced Network Testbed Mac Newbold School of Computing University of Utah MS Thesis Defense April 5, 2004 Advisor: Prof. Jay Lepreau Distributed Systems • Distributed Systems are complex – Many components – Distributed across multiple systems • Component failures are relatively common – But should not cause system breakdown • “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” – Leslie Lamport, quoted in CACM, June 1992 April 5th, 2004 Mac Newbold - MS Thesis Defense 2 Our Context: Emulab • Emulab is an advanced network testbed • Complex time- and space-shared system • System dynamically reconfigures nodes and network links to create “experiments” • Key architectural feature: Central Database – System uses DB for storage, communication • Complex system with many different scripts and programs on clients and servers April 5th, 2004 Mac Newbold - MS Thesis Defense 3 Emulab Background • First prototype in April 2000 (10 nodes) • In production since Oct. 2000 (40 nodes) • Early versions weren’t perfect – Reliability problems – Experiments of limited size – Inefficient use of resources • Problem is becoming harder – 200 nodes, 400 remote, 2000 virtual nodes April 5th, 2004 Mac Newbold - MS Thesis Defense 4 Four Key Challenges Emulab requirements: • Reliability • Scalability • Performance and Efficiency • Generality and Portability April 5th, 2004 Mac Newbold - MS Thesis Defense 5 1. Reliability • Complex systems are hard to make reliable • Many sources of unreliability: – Hardware – commodity PCs as nodes – Software – misconfiguration, bugs, etc. – Humans – can interrupt at any time • More complexity and more parts mean higher chance that something is broken at any given time April 5th, 2004 Mac Newbold - MS Thesis Defense 6 2. Scalability • Almost everything a testbed provides is harder to provide at a larger scale • Larger scale requires more resources • If throughput doesn’t increase, things slow down at larger scale • Increased load adversely affects reliability • Practical scalability limited by reliability, performance and efficiency April 5th, 2004 Mac Newbold - MS Thesis Defense 7 3. Performance and Efficiency • One direct requirement on performance: – Emulab is used in an interactive style • System tasks must complete in “a few minutes” • Indirect requirements: – Scalability requirement places high demands on many system components – Maximize efficient resource utilization • As many users/experiments as possible in the shortest time with the fewest resources April 5th, 2004 Mac Newbold - MS Thesis Defense 8 4. Generality and Portability • Workload Generality – Wide variety of research, teaching, development, and testing activities – Good support for minimally and noninstrumented client OS’s and devices • Model generality – Evolving system – New types of network devices • Portable and non-intrusive client software April 5th, 2004 Mac Newbold - MS Thesis Defense 9 Summary of Challenges • Three challenges are closely related – Fourth is a constraint more than a challenge • Reliability is key • Failure rates directly impact scalability, performance, and efficiency • Generality and portability requirements constrain any solution used to address the challenges April 5th, 2004 Mac Newbold - MS Thesis Defense 10 Thesis Statement Enhancing Emulab with a flexible framework based on state machines provides better monitoring and control, and improves the reliability, scalability, performance, efficiency, and generality of the system. April 5th, 2004 Mac Newbold - MS Thesis Defense 11 Outline • Introduction, Background, and Challenges • Thesis Statement • State Machines in Emulab – Interactions, Control Models, stated – Node Boot State Machines • Results • Related and Future Work • Conclusion April 5th, 2004 Mac Newbold - MS Thesis Defense 12 State Machines • Also called Finite State Machines (FSMs), Finite State Automata (FSAs), or Statecharts in UML • Well-known model • Simple • Explicit model • Rich and flexible • Easy to understand and visualize April 5th, 2004 Mac Newbold - MS Thesis Defense 13 Example State Machine • Three main parts: • States • Transitions – Directional • Events – Associated with transitions – labels • Stored in database – diagrams generated automatically from DB April 5th, 2004 Mac Newbold - MS Thesis Defense 14 State Machines in Emulab • Each state machine has a “type” – Currently three: node boot, node allocation, and experiment status • Multiple machines allowed within a type – Only in one state in one machine of a type • States can have “timeouts” with actions – Timer starts when state is entered • State “triggers” – Entry Actions April 5th, 2004 Mac Newbold - MS Thesis Defense 15 Direct Interaction • Within a type, can take a transition from a state in one machine (or “mode”) to a state in another machine of that type – Known as “mode transition” in Emulab – Similar to hierarchical state machines • Highlights similarities/symmetries – Most machines are variations of another • Improved code reuse April 5th, 2004 Mac Newbold - MS Thesis Defense 16 Direct Interaction Example April 5th, 2004 Mac Newbold - MS Thesis Defense 17 Models of State Machine Control • Centralized monitoring & control: stated – State changes submitted, checked for correctness, applicable actions performed – Daemon tracks timeouts – Used for Node Boot state machines • Distributed management of state machine – No central service enforcing correctness – No dependency on central service – Timeouts harder to implement April 5th, 2004 Mac Newbold - MS Thesis Defense 18 The stated State Daemon • • • • Listens for events continuously State transitions cause database updates Invalid transitions cause notifications Timeouts, timeout actions, triggers configurable in DB for each state • Caching – only writer of node boot states • Modular design – dispatch events to proper action handlers April 5th, 2004 Mac Newbold - MS Thesis Defense 19 Node Boot State Machines • Nodes in Emulab self-configure • Monitored via state machines – stated • “Normal” node boot machine • Variations – “Minimal” • Reloading node disks April 5th, 2004 Mac Newbold - MS Thesis Defense 20 Node Self-Configuration • Nodes send state events during booting to allow progress to be monitored • “Global knowledge” inside state daemon – Better decisions about recovery steps • Finer granularity gives more information for recovery, allows for shorter timeouts • Each OS image is associated with a “mode” (state machine) that describes its behavior April 5th, 2004 Mac Newbold - MS Thesis Defense 21 “Normal” Node Boot Machine • Start in SHUTDOWN • DHCP, start OS booting • When Emulab-specific configuration begins, enter TBSETUP • ISUP when finished • In case of failure, can retry from SHUTDOWN April 5th, 2004 Mac Newbold - MS Thesis Defense 22 Variations of Node Boot • Example: MINIMAL • For OS images with little or no special Emulab support • ISUP generated by stated if necessary – Immediate or ping • SilentReboot allowed in this mode April 5th, 2004 Mac Newbold - MS Thesis Defense 23 Reloading Node Disks • Mode transition into RELOAD / SHUTDOWN • RELOADDONE transitions into mode for newly-installed OS image April 5th, 2004 Mac Newbold - MS Thesis Defense 24 Reloading and Mode Transitions April 5th, 2004 Mac Newbold - MS Thesis Defense 25 Experiment Status State Machine • Uses distributed model – Stored in database, but not strictly enforced • Documents life-cycle • Restricts user interruption – Reduces a source of errors • Can queue, activate, modify, restart, swap, or terminate an expt. April 5th, 2004 Mac Newbold - MS Thesis Defense 26 Node Allocation State Machine • Distributed control model • Diagram documents the way the states are used by the program, but not currently enforced • Either reloads nodes with a custom image, or reboots them as members of the experiment April 5th, 2004 Mac Newbold - MS Thesis Defense 27 Results: Context • Emulab in production 1 year before state machines were added • In production 3 years since first stated – 650 users, 150 projects, 75 institutions • 19 papers, top venues – Over 155,000 nodes allocated in nearly 10,000 experiment instances – 13 classes at 10 universities • Emulab SW on 6 more testbeds, 4 planned April 5th, 2004 Mac Newbold - MS Thesis Defense 28 Results • Anecdotal: (others in thesis) – Reliability/Performance: Preventing race conditions – Generality: Graceful handling of custom OS images – Generality: New node types • Experiment: – Reliability/Scalability: Improved timeout mechanisms April 5th, 2004 Mac Newbold - MS Thesis Defense 29 Reliability/Performance: Preventing Race Conditions • Expt. ends, nodes move to holding expt., get reloaded, then freed while they boot • Problem: Getting allocated while booting • Node appears unresponsive, gets forcefully power cycled, corrupts FS on disk • Solution: don’t free immediately • Add trigger on next ISUP for a node that finishes booting, that frees it when booted April 5th, 2004 Mac Newbold - MS Thesis Defense 30 Generality: Graceful Handling of Custom OS Images • • • • • • • Users create custom OS images Emulab client software is optional Problem: Nodes don’t send state events Solution: “Minimal” state machine SHUTDOWN: maybe on server, optional BOOTING: server side, trigger checks ISUP ISUP: either node sends, or generated when pingable, or generated immediately April 5th, 2004 Mac Newbold - MS Thesis Defense 31 Generality: New Node Types • Emulab is always growing and changing • State machine model and our framework are flexible to provide graceful evolution • We’ve added 5 new node types – IXPs, wide-area, PlanetLab, vnodes, sim-nodes • Mostly used existing machines • 2 new machines, slight variations • 1 change to stated to add a new trigger April 5th, 2004 Mac Newbold - MS Thesis Defense 32 Reliability/Scalability: Improved Timeout Mechanisms • Before: reboot node, wait for it to boot – Static, 7 minute timeout • Pragmatic – minimizes false positives/negatives • Avg. 4 min., but max. error-free boot is 15 min. • 11 minute delay is too long • Improved: state machine monitoring – Fine-grained, context-sensitive timeouts – Faster error detection – Better monitoring and control April 5th, 2004 Mac Newbold - MS Thesis Defense 33 Reliability/Scalability: Improved Timeout Mechanisms • Experiment: Measure expt. swap-in time, with and without the improvements – Synthetic but plausible scenario • One node, loads an RPM (8 min. install) • Node reboots, timeout during RPM install – Reboots again, timeout again, mark node dead – Try twice per swap-in, 3 swap-in attempts – Total failure in 45 min., 3 nodes “dead” April 5th, 2004 Mac Newbold - MS Thesis Defense 34 Reliability/Scalability: Improved Timeout Mechanisms • With state machines: – Timeouts: SHUTDOWN 2 min, BOOTING 3 min, TBSETUP 10 min • • • • • Node reboots, enters BOOTING 1 minute: Enters TBSETUP 9 minutes: Enters ISUP, expt. ready Succeeds, with no dead nodes or retries Cut time from 45 min. to 9 min. (80%) April 5th, 2004 Mac Newbold - MS Thesis Defense 35 Limitations and Issues • stated is critical infrastructure – Another single point of failure – More system complexity, new bugs, complicated debugging – Potential for scaling problems (none seen yet) • Simple heuristics for error detection – Send mail for invalid transitions April 5th, 2004 Mac Newbold - MS Thesis Defense 36 Summary of Results • Explicit model requires careful thought – Improves design and implementation • Visualization makes it easier to understand • Faster and more accurate error detection • Better reliability helps scalability/efficiency – Bigger expts. possible, less overhead per expt. • Flexibility for evolution, workload generality April 5th, 2004 Mac Newbold - MS Thesis Defense 37 Related Work • “Standard” Finite State Automata – basics – Timed Automata – have global clock • Message Sequence Charts (MSCs) – “Scenarios” – hierarchy, like modes/machines • UML Statecharts – States have entry actions – “triggers” – Hierarchical states – similar to modes – Can model Emulab’s timeouts April 5th, 2004 Mac Newbold - MS Thesis Defense 38 Future Work • Further developing distributed control – Add monitoring, timeouts, triggers • Better heuristics for error detection – Only flag clustered or related errors • Implement more ideas from other systems – UML’s exit actions, guarded transitions, etc. • Move code into database – i.e. triggers – Easier to modify, framework code vs. machine April 5th, 2004 Mac Newbold - MS Thesis Defense 39 Conclusion Enhancing Emulab with a flexible framework based on state machines provides better monitoring and control, and improves the reliability, scalability, performance, efficiency, and generality of the system. April 5th, 2004 Mac Newbold - MS Thesis Defense 40 Bonus Slides April 5th, 2004 Mac Newbold - MS Thesis Defense 41 Demonstrating Improvement • Currently: programs have their own retry and timeout mechanisms for node reboots – No knowledge of progress, just completion – Can cause failures by forcing a reboot, which can damage file systems on node disk • “New Way”: stated handles timeout and retry during rebooting – Implemented, not installed – Knows if progress is being made – Programs simply wait for ISUP or failure event April 5th, 2004 Mac Newbold - MS Thesis Defense 42 Demonstrating Improvement (cont’d) • These failures directly hurt reliability – Node failure can cause experiment setup to fail • Significant impact on scaling, performance, efficiency • Maximum experiment size is limited by node failure rate – Failures make things take longer – A slower system means less efficient use of resources April 5th, 2004 Mac Newbold - MS Thesis Defense 43 Demonstrating Improvement (cont’d) • Compare current vs. new: failure rate, time to completion, etc. • Test data; one of: – Historical experiments – Artificially high load – Fault injection, e.g. reboots • Why new way should help: – Better knowledge for intelligent recovery • Know when to wait longer and when to retry – Shorter timeouts allow for early error detection April 5th, 2004 Mac Newbold - MS Thesis Defense 44 Future Work: Modeling Indirect Interaction • Occur between machines of different types – Due to external relationships between the entities tracked by each type of machine • Examples: – Same entity may be tracked in two different types of machine • Nodes are in Boot and Allocation machines – Other relationship between entities • Nodes may be “owned” or allocated to an experiment – links Expt. Status and Node machines April 5th, 2004 Mac Newbold - MS Thesis Defense 45 Question and Answer April 5th, 2004 Mac Newbold - MS Thesis Defense 46 Conclusion Enhancing Emulab with a flexible framework based on state machines provides better monitoring and control, and improves the reliability, scalability, performance, efficiency, and generality of the system. April 5th, 2004 Mac Newbold - MS Thesis Defense 47