Civilian Worms: Ensuring Reliability in an Unreliable Environment Sanjeev R. Kulkarni

advertisement
Civilian Worms: Ensuring
Reliability in an Unreliable
Environment
Sanjeev R. Kulkarni
University of Wisconsin-Madison
sanjeevk@cs.wisc.edu
Joint Work with Sambavi Muthukrishnan
Outline



Motivation and Goals
Civilian Worms
Master-Worker Model




Leader Election
Forward Progress
Correctness
Parallel Applications
What’s happening today


Move towards clusters
Resource Managers


eg. Condor
Dynamic environment
Motivation


Large Parallel/Standalone Applications
Non-Dedicated Resources



Unreliable commodity clusters



eg.:- Condor env.
Machines can disappear at any time
Hardware failures
Network Failures
Security Attacks!
What’s available

Parallel Platforms

MPI



PVM


MPI-1 :- Machines can’t go away!
MPI-2 any takers?
Shoot the master!
Condor

Shoot the Central Manager!
Goal




Bottleneck-Free infrastructure in an
unreliable Environment
Ensure “normal termination” of
applications
Users submit their jobs
Get e-mail upon completion!
Focus of this talk

Approaches for Reliability

Standalone Applications



Monitor framework ( worms! )
Replication
Parallel Applications

Future work!
Worms are here again!

Usual Worms



Self replicating
Hard to detect and kill
Civilian Worms



Controlled replication
Spread legally!
Monitor applications
Desired Monitoring System
C
W
W
C
W
W = worm
C = computation
W
Issues

Management of worms



Forward Progress


Distributed State detection
Very hard
Checkpointing
Correctness
Management Models

Master-Worker




Simple
Effective
Our Choice!
Symmetric

Difficult to manage the model itself!
Our Implementation Model
Master
W = worm
C
C = computation
W
Workers
W
C
W
C
W
Worm States

Master




Worker



Periodically ping the master
Starts the encapsulated process if instructed
Leader Election


Maintains the state of all the worm segments
Listens on a particular socket
Respawns failed worm segments
Invoke the LE algorithm to elect a new master
Note:- Independent of application State
Leader Election


The woes begin!
Master goes down

Detection

Worker ping times out



Timeout value
Worker gets an LE message
Action

Worker goes into LE state
LE algorithm

Each worm segment is given an ID



Only master gives the id
Workers broadcast their ids
The worker with the lowest id wins
Brief Skeleton

While in LE



bcast LE message with your id
Set min = your id
On getting an LE message with id i



If i >= min ignore
else min = i;
min is the new Master
LE in action (1)
M0
W1
W2
Master goes down!
LE in action (2)
LE, 1
LE, 2
L2
L1
LE, 1
LE, 2
L1 and L2 send out LE messages
LE in action (3)
L1
COORD_ACK
L2
L1 gets LE, 2 and ignores it
L2 gets LE, 1 and send COORD_ACK
LE in action (4)
W3
spawn
M1
COORD
W2
M1 send COORD to W2, spawns W0
Implementation Problems



Too many cases
Many unclear cases
Time to Converge


Timeout values
Network Partition
What happens if?

Master still up?



Next master in line goes down?


Incoming id < self id => goes to LE mode
Else => sends back COORD message
Timeout on COORD message receipt
Late COORD_ACK?

Sends KILL message
More Bizarre cases

Multiple Masters?



Master bcasts its id periodically
Conflict is resolved using lowest id
method
No-master?

Workers will timeout soon!
Test-Bed





64 dual processor 550 MHz P-III
nodes
Linux 2.2.12
2 GB RAM
Fast interconnect. 100 Mbps
Master-Worker comm. via UDP
A Stress Test for LE

Test





Worker Pings every second
Kill n/4 workers
After 1 sec, kill the master
After .5 sec kill the master in line
Kill n/4 workers again
Convergence
Converge time in secs
Convergence Graph
35
30
25
20
15
10
5
0
2
4
8
Cluster Size
16
Forward Progress

Why?


MTTF < application time
Solutions

Checkpointing



Application Level
Process level
Start from checkpoint image!
Checkpoint

Address Space

Condor Checkpoint library



Rewrites Object files
Writes checkpoint to a file on SIGUSR2
Files

Assumption :- Common File System
Correctness

File Access


Read Only, no problems
Writes



Possible inconsistency if multiple processes
access
Inconsistency across checkpoints?
Need a new File Access Algorithm
Solution: Individual Versions

File Access Algorithm

On open

If first open



Else



read: nothing
write: create a local copy and set a mapping
If mapped access mapped file
If write: create a local copy and set a mapping
Close

Preserve the mapping
File Access cont.

Commit Point


On completion of the computation
Checkpoint

Includes mapped files
Being more Fancy


Security Attacks
Civilian to Military transition


Hide yourself from the ps
Re-fork periodically to avoid detection
Conclusion

LE is VERY HARD


Does our system work?



Don’t take it for a course project!
16 nodes: YES
32 nodes: NO
Quite Reliable
Future Direction


Robustness
Extension to parallel programs



Re-write send/recv calls
Routing issues
Scalability issues?

A hierarchical design?
References



Cohen, F. B., ‘A Case for Benevolent Viruses’,
http://www.all.net/books/integ/goodvcase.html
M. Litzkow and M. Solomon. “Supporting Checkponting and
Process Migration outside the UNIX kernel”, Usenix
Conference Proceedings, San Francisco, CA, January 1992.
Gurdip Singh, “Leader election in complete networks”, PPDC
92
Implementation Arch.
Worm
Communicator
Dispatcher
Dequeuer
Remove
Checkpointer
Checkpoint
Prepend
Computation
Append
Parallel Programs

Communication



Connectivity across failures
Re-write send/recv socket calls
Limitations of Master-Worker
Model?

Not really!
Communication

Checkpoint markers


Buffer all data between checkpoint
markers
Help of master in rerouting
Download