Civilian Worms: Ensuring Reliability in an Unreliable Environment Sanjeev R. Kulkarni University of Wisconsin-Madison sanjeevk@cs.wisc.edu Joint Work with Sambavi Muthukrishnan Outline Motivation and Goals Civilian Worms Master-Worker Model Leader Election Forward Progress Correctness Parallel Applications What’s happening today Move towards clusters Resource Managers eg. Condor Dynamic environment Motivation Large Parallel/Standalone Applications Non-Dedicated Resources Unreliable commodity clusters eg.:- Condor env. Machines can disappear at any time Hardware failures Network Failures Security Attacks! What’s available Parallel Platforms MPI PVM MPI-1 :- Machines can’t go away! MPI-2 any takers? Shoot the master! Condor Shoot the Central Manager! Goal Bottleneck-Free infrastructure in an unreliable Environment Ensure “normal termination” of applications Users submit their jobs Get e-mail upon completion! Focus of this talk Approaches for Reliability Standalone Applications Monitor framework ( worms! ) Replication Parallel Applications Future work! Worms are here again! Usual Worms Self replicating Hard to detect and kill Civilian Worms Controlled replication Spread legally! Monitor applications Desired Monitoring System C W W C W W = worm C = computation W Issues Management of worms Forward Progress Distributed State detection Very hard Checkpointing Correctness Management Models Master-Worker Simple Effective Our Choice! Symmetric Difficult to manage the model itself! Our Implementation Model Master W = worm C C = computation W Workers W C W C W Worm States Master Worker Periodically ping the master Starts the encapsulated process if instructed Leader Election Maintains the state of all the worm segments Listens on a particular socket Respawns failed worm segments Invoke the LE algorithm to elect a new master Note:- Independent of application State Leader Election The woes begin! Master goes down Detection Worker ping times out Timeout value Worker gets an LE message Action Worker goes into LE state LE algorithm Each worm segment is given an ID Only master gives the id Workers broadcast their ids The worker with the lowest id wins Brief Skeleton While in LE bcast LE message with your id Set min = your id On getting an LE message with id i If i >= min ignore else min = i; min is the new Master LE in action (1) M0 W1 W2 Master goes down! LE in action (2) LE, 1 LE, 2 L2 L1 LE, 1 LE, 2 L1 and L2 send out LE messages LE in action (3) L1 COORD_ACK L2 L1 gets LE, 2 and ignores it L2 gets LE, 1 and send COORD_ACK LE in action (4) W3 spawn M1 COORD W2 M1 send COORD to W2, spawns W0 Implementation Problems Too many cases Many unclear cases Time to Converge Timeout values Network Partition What happens if? Master still up? Next master in line goes down? Incoming id < self id => goes to LE mode Else => sends back COORD message Timeout on COORD message receipt Late COORD_ACK? Sends KILL message More Bizarre cases Multiple Masters? Master bcasts its id periodically Conflict is resolved using lowest id method No-master? Workers will timeout soon! Test-Bed 64 dual processor 550 MHz P-III nodes Linux 2.2.12 2 GB RAM Fast interconnect. 100 Mbps Master-Worker comm. via UDP A Stress Test for LE Test Worker Pings every second Kill n/4 workers After 1 sec, kill the master After .5 sec kill the master in line Kill n/4 workers again Convergence Converge time in secs Convergence Graph 35 30 25 20 15 10 5 0 2 4 8 Cluster Size 16 Forward Progress Why? MTTF < application time Solutions Checkpointing Application Level Process level Start from checkpoint image! Checkpoint Address Space Condor Checkpoint library Rewrites Object files Writes checkpoint to a file on SIGUSR2 Files Assumption :- Common File System Correctness File Access Read Only, no problems Writes Possible inconsistency if multiple processes access Inconsistency across checkpoints? Need a new File Access Algorithm Solution: Individual Versions File Access Algorithm On open If first open Else read: nothing write: create a local copy and set a mapping If mapped access mapped file If write: create a local copy and set a mapping Close Preserve the mapping File Access cont. Commit Point On completion of the computation Checkpoint Includes mapped files Being more Fancy Security Attacks Civilian to Military transition Hide yourself from the ps Re-fork periodically to avoid detection Conclusion LE is VERY HARD Does our system work? Don’t take it for a course project! 16 nodes: YES 32 nodes: NO Quite Reliable Future Direction Robustness Extension to parallel programs Re-write send/recv calls Routing issues Scalability issues? A hierarchical design? References Cohen, F. B., ‘A Case for Benevolent Viruses’, http://www.all.net/books/integ/goodvcase.html M. Litzkow and M. Solomon. “Supporting Checkponting and Process Migration outside the UNIX kernel”, Usenix Conference Proceedings, San Francisco, CA, January 1992. Gurdip Singh, “Leader election in complete networks”, PPDC 92 Implementation Arch. Worm Communicator Dispatcher Dequeuer Remove Checkpointer Checkpoint Prepend Computation Append Parallel Programs Communication Connectivity across failures Re-write send/recv socket calls Limitations of Master-Worker Model? Not really! Communication Checkpoint markers Buffer all data between checkpoint markers Help of master in rerouting