Worms for Reliable Computation Sambavi Muthukrishnan and Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin – Madison {sambavi, sanjeevk}@cs.wisc.edu May 3, 2000 Abstract We describe the use of network worms to achieve the goal of ensuring reliable computation given a distributed environment as the base. The idea is to use the self-replication property of a worm in a network to ensure that a computation completes reliably in the face of machine failures and/or concerted attack against the completion of the job. We present a design model to construct a worm and use this to build a general framework that encapsulates any job and ensures its termination with a high degree of reliability. We use the master-worker model to construct the worm. The worm replicates the encapsulated computation on nodes in the given computing environment up to a maximum replication count. To ensure forward progress of the computation especially in the face of round robin attack on the computation replicas, we use a checkpointing and restart mechanism. Further, to enable the encapsulated jobs to write to files, file remapping is used whereby we map the output files of the job to individual copies for each computation replica. When one of the replicas completes the job, its output files are remapped to the original output files of the computation. For the purpose of being able to terminate the worm at any time, we provide the mechanism of a kill switch/file. Whenever the kill switch is in place, the worm automatically kills itself on all the nodes in the computation pool. We also employ many defense mechanisms to maintain minimum presence in the system. Keywords reliability checkpointing master worker replication leader election 1 Introduction General purpose problem solving is rarely a problem in computers today. However, reliability is particularly critical to large applications and ensuring this in a single machine environment is tough owing to the fact that one cannot ensure against machine crashes and hence abnormal termination of the job. But, given a distributed environment as the base, the number of computers involved in problem solving increase. By replicating the job on multiple machines in the environment, some degree of reliability can be ensured since atleast one of the 1 replicas can be taken to termination. Taking this one step further, by checkpointing and migrating the job, it can be protected even against round-robin failure of the machines in the environment. One way to approach the problem of ensuring that a job completes reliably in an unreliable computing environment is to build the desired features into the application directly. This job rests with the programmer of the application who integrates the desired features of replication, checkpointing and process migration into the application program. The advantage of this approach is that the programmer can tune these features to work well with the application. Another approach is to build a general framework/interface that takes up the work of distributing the computation, replicating it to ensure reliability and taking the computation to termination. This general framework may be used to encapsulate any job and add the desired properties of reliability and distribution to it. The framework must present as general a framework as possible for jobs to use. Though this approach may not be optimal with respect to a particular application, it has the advantages of flexibility and generality. It takes away the burden of this work from the application programmer. Use of such a framework keeps the application program code easily maintainable. From the study of the worm paradigm of distributed computation, it appears that worms lend themselves naturally to construct this kind of a framework for reliable computing. Several properties of worms make them suitable for this purpose. A worm is basically a self-replicating program that tries to survive and remain computing in the most hostile of environments. The fundamental property of a worm is that it propagates itself across machines over a network, automatically repairing parts that were damaged or destroyed. The worm requires no user assistance to spread across a network. It identifies new nodes to spread on to, transfers a copy of itself on to those machines and starts running on the nodes. The use of self-replicating programs for beneficial uses is not new. In fact, John von Neumann, one of the pioneers of the computer age, described reliable self-replicating programs in the 1940s [1]. The first worm programs were actually designed to facilitate better usage of a network. Worms have been developed in Xerox Palo Alto Research Center for beneficial purposes. 5 worms were developed there, each of which was designed to perform helpful tasks around the network. Some worms were quite simple, such as the town crier worm, which simply traveled throughout the network posting announcements. Other worms were quite clever and complex, such as the vampire worm. This worm was idle during the day, but at night, it would take advantage of the largely idle computers and apply them to complex tasks that needed the extra processing power. At dawn, it would save the work it had done so far, and then become idle, waiting for the next evening [2]. In this work, we extend the idea of using worms for reliable computing one step further. We have designed and developed a worm framework, which provides a general infrastructure in which we can encapsulate any job and ensure its reliable completion. Our work aims at using the self-replication property of a worm to carry a job to completion. The idea is that if a job is encapsulated in a worm, the worm provides the framework for spreading it on 2 to multiple machines. All that the worm has to do in addition to replicating itself on every node in the specified domain is to replicate the job on a specified maximum number of machines. Further, the worm must ensure forward progress of the job even when worm segments (along with the copy of the job they are running) go down. Thus packaging a computation as a worm ensures that the computation will complete with a high probability even in the midst of system crashes and/or malicious users. The main features of our implementation are Beneficial application of worms Our implementation of a general framework for replicated distributed computing applies the general concept of worms for a constructive purpose. It makes use of the self-replication property of a worm to ensure reliable computation in the face of machine failures and/or concerted attack against the completion of the computation. Master-Worker model The design model for our worm is the Master-Worker model. The master maintains all the state information for the worm. When a worker goes down, the master is responsible for initiating suitable action. To handle the case of the death of the master we use a leader election algorithm. Encapsulated job replication The encapsulated computation is replicated on multiple nodes. The number of nodes on which to replicate the job is specified by a maximum replication count. Once any of these replicated copies (which we call computation replicas) terminate, the worm terminates. Checkpointing To ensure forward progress of the computation amidst continuous failures, it is essential that the computation is checkpointed at periodic intervals and restarted from the checkpointed image instead of from the start. We have added support for this feature in our implementation. File mapping Since the same computation runs on several machines, special care has to be taken in handling files that are written by the computation. When the computation is replicated, we map the output files to individual copies. When one of computation replicas terminate, it shifts its temporary output to the originally specified output files. Termination using a kill switch For terminating the worm at any point during the computation, a kill switch is used. This is simply a file, the presence of which indicates that the worm should kill itself. 3 The rest of this paper is organized as follows. Section 2 presents our architecture for a worm framework. The following two sections address the issues of start up, replication and recovery (using a leader election mechanism) of the worm. Section 5 explains how a computation can be encapsulated into our worm framework. The issue of termination issue is dealt with in Section 6. We then describe some implementation details in Section 7. Section 8 addresses some of the self-defense mechanisms that we have adopted to minimize the visibility of our worm. The paper ends with conclusions and some ideas on future work. 2 Architecture In this section, we present our design strategy to build a worm. The main issue to address in designing the architecture for a worm is how to ensure that the segments of the worm running on different nodes are aware of each other and cooperate with one another. Such cooperation is necessary to identify and repair damaged segments of the worm. Design models for constructing worms There are two basic strategies to build worms – master-worker model and symmetric model. In the master-worker model one of the processes (designated as the master) takes on the mantle of coordination of system resources and other segments. This process has the responsibility of spawning new worm segments (which we call workers) and coordinating among the various segments. All the workers simply keep in touch with the master. The worker processes need not be aware of each other. The master alone maintains state information and has the responsibility of repairing damaged segments. In the symmetric model, all the nodes where the worm is started up communicate and cooperate to achieve the goals of self-perpetuation and reliability. All the worm segments have equal responsibility and behave identically. Any worm segment can take up the task of repairing damaged worm segments. Thus there is no delegation of responsibility to some subset of the worm segments. Essentially this model would involve distributed state maintenance and distributed coordination. The main advantage of the master-worker model is that handling of system events is greatly simplified since all functionality is centralized at the master. The only form of communication needed is between each worker and the master. This is to help the master identify the death of workers and likewise for workers to be able to identify the death of the master. For instance there is no coordination needed to prevent two worm segments from re-spawning a worm segment on the same node. Such a case has to be handled in the symmetric model since multiple segments may simultaneously decide to repair damaged worm segments. The main problem in this model is scalability. When the number of nodes increases, the master may become a bottleneck since it has to communicate with all the workers. The advantage of the symmetric method is the absence of a central point of failure. Since all worm segments are equal, the failure of one does not affect the others. However from the point of view of 4 complexity, this method involves complex distributed coordination algorithms for handling system activities like repair. Our design model We have chosen the master-worker model because it greatly simplifies resource management tasks like spawning of new worm processes and termination. The problem of scalability appears in both models owing to the higher message exchange in the symmetric model and the centralized information in the master-worker model. The problem of the master being the central point of failure can be partially alleviated by going through a leader election process to select a new master. Further since we envision the use of our worm for encapsulating any general computation to provide reliability, we believe that it will be used in closed computing environments like the cluster that we are using for our implementation and so we need not scale to thousands of machines. The design model of our worm is indicated in Figure 1. The worm segment at one node acts as the master for the other worm segments. This is the node that controls replication and identifies failure of any of the workers. When the master first starts up, it spawns worker worms on all the other nodes specified in a configuration file. It then identifies a set of nodes in this pool (up to a maximum replication count) and starts up the encapsulated computation on those nodes. Further it periodically exchanges messages with each of the workers to ensure that the workers are alive and to indicate its existence. Whenever it identifies that a worker has gone down, it starts up another copy of the worm on the same node. The main idea is to ensure that a copy of the worm is always resident on every machine specified in the configuration file (whether or not that worm segment is running the encapsulated computation at its site). C W W W C W C Figure 1 C Master-Worker model for designing worms W = Worm segment C = Encapsulated computation 5 Obviously, there is no assurance that the master is always up. One has to account for the failure of the master too. When the master goes down, one of the remaining processes is elected as the master and the computation continues. This process of leader election is explained in Section 4. States in our Model There are thus three modes in which a worm segment can exist. The first is the Master mode in which the worm segment assumes responsibility for the functioning of the entire distributed worm program. The next is the Worker mode in which all the segment does is keep in touch with the Master and, if instructed to, run the encapsulated computation at its node. The communication between the Master and Workers is limited to ping messages periodically sent by each worker to the master indicating its existence and the acknowledgement messages sent by the master in response. Thus the state in our model is a soft state that has to be periodically refreshed by new messages. If the master does not hear from the worker for a sufficient period of time it infers that the worker is dead. Likewise lack of acknowledgement for a long time indicates the death of the master. When a worker infers the death of the master from the timeout on acknowledgement messages, it goes into the third state that is the Leader Election (LE) state. In this state, the worker goes through the process of exchanging messages with other workers to decide on a new master. Since the worker has no information about other workers this exchange is done via a broadcast on the entire computation pool. Potentially several workers can go into the Leader Election mode at the same time. The Leader Election algorithm that we employ guarantees that this election of the new master takes place in a coordinated manner. In other words at the end of the leader election algorithm execution, all the worm segments will have arrived at the same conclusion regarding the new Master. Since only the master process has the exact state information and the workers have no clue about the state, there is the problem of building state information when a new master is elected. This process is carried out as part of the leader election algorithm. 3 Start up and replication Thus, in accordance with our model, the startup involves starting up of the master. At startup, the worm is given a list of machines on which to execute in a configuration file. This list serves as the computation pool. The master then goes through the process of replicating the worm on all nodes in the computation pool by transferring a copy of itself to each node and starting it in worker mode. The replication count N (specified at start up) indicates the number of computation replicas to be maintained at any time. The master starts the computation on its machine and tells N - 1 other worms to do so. Thus all the machines in the pool have a worm process running but only a few machines execute the computation. It is assumed that the master process gets enough time to start up at least one worker worm. 6 4 Recovery and Leader Election If the master does not receive a ping message from a worker for a while, it assumes that the worm process has died on that machine. It immediately tries to restart the process on that machine. In addition if the dead worm was also running a computation replica, it chooses a different running worm and asks it to run the computation. If a worker does not receive an acknowledgement message from the master for a while, it goes to the leader election mode and steps through the leader election algorithm described below. Leader election The leader election mode is basically a transient mode that aids in the election of a new master. Thus at the end of this mode, every worm segment will assume one of the two roles of master and worker, with there being only one master among all the worm segments. The way we achieve the above is through the use of a distributed leader election algorithm. Our algorithm is a variant of the Bully election algorithm. The working of the Bully algorithm is as follows. When a process notices that the master is no longer responding to requests, it initiates an election. The process sends an ELECTION message to all processes with lower numbers. If no one responds, this process wins the election and becomes the coordinator (master). If a process with a lower number responds, that process takes over by sending an OK message to this process. The lower numbered process then holds an election (if it is not already holding one). Eventually all processes except one give up and that process becomes the master after a specified time period [3]. In our model, the process number corresponds to a unique identifier associated with each worm segment. The master generates and hands this identifier to the worker when it spawns it off. However the Bully algorithm cannot be used as such because it assumes that every process knows which are the processes with lower numbers than its process number. In our model, the worker worm segment initiating an election is not aware of other worm segments or their identifiers. Therefore we have incorporated some variations into the Bully algorithm to account for this lack of state information. The actual working of our leader election algorithm is as follows. Whenever a worker notices the master being down, it first marks itself as current estimated coordinator and then broadcasts a LE message to all nodes in the computation pool with its identifier included in the message. It also sets a timeout upon the completion of which it can assume the role of the coordinator. It starts building up a state table in case it becomes the master. If a worker receives this LE message (it has not yet shifted to leader election mode), it compares the incoming identifier with the identifier of its coordinator. The reason for this check is that the coordinator may not have actually gone down. Until a worker identifies that its coordinator is down there is no need for it to shift to the leader election mode (unless the leader election has been initiated by a worm segment with a lower identifier than its current coordinator). In case the coordinator is actually down, this worm segment will also soon identify loss of contact with the coordinator and shift to leader election mode within a specified time period. 7 Master goes down 1 and 2 send out LE messages M0 LE, 1 W1 W2 LE, 2 L1 L2 LE, 1 LE, 2 W3 spawn M1 COORD W2 L1 1 send COORD to 2 1 spawns W3 COORD_ACK L2 1 gets LE, 2 and ignores it 2 gets LE, 1 and send COORD_ACK Figure 2 Leader Election Number of nodes = 3 Number of broadcast ports = 2 M = Master mode W = Worker mode L = Leader Election mode i = identifier i 8 When a segment that has already moved into the leader election mode independently receives a LE message, it compares the incoming identifier with its current (estimated) coordinator identifier. In case the incoming identifier is lower, it accepts that worm segment as the current estimated coordinator and sends it a COORD_ACK message with a piggy backed value indicating if it is running the encapsulated process. It also sets a timeout within which it should get a response from the other segment. If it times out, this segment rebroadcasts a LE message. This timeout is used in case the segment that broadcasted the LE message also goes down before it assumes the role of coordinator. If the current coordinator identifier that this worm segment is aware of is lower than the incoming identifier, this segment sends a LE message encapsulating its current (estimated) coordinator identifier to the worm segment from which this LE message came. This reply LE message serves to indicate to the other segment that there is a segment more eligible to become the coordinator. It may so happen that the master is still up and this leader election was initiated unnecessarily. In such a case, the master should be able to respond appropriately to this message. When the master receives a LE message, it shifts to leader election mode if the incoming identifier is lower than its own. In case, the master has the lower identifier, it simply sends a COORD message to the worm segment from which the LE message came indicating that it is the master. When a segment in leader election mode receives a COORD message, it accepts the worm segment that sent the message as coordinator if it has a lower identifier than its current coordinator identifier. In this case, it shifts to worker mode. If this segment identifies itself as the coordinator even after receiving the COORD message, it broadcasts a COORD message on the computation pool. On receipt of a COORD_ACK message, a segment in leader election mode simply makes an entry in the state table it is building up if it still is the current estimated coordinator. This building up of the state table avoids the overhead of building this up after a particular segment has assumed the role of the master. The overall algorithm ensures that in the end only one segment identifies itself as the coordinator and after a specified timeout during which its current estimated coordinator value does not change, that segment assumes the role of the master. Thus the worm segment that has the lowest identifier becomes the next master. Once a segment has assumed the role of the master, there may still be some segments that have not yet succeeded in indicating their presence to the current master during the leader election process. So the master may not have an entry for that segment in its state table. In such a case, the new master will receive a late COORD_ACK or ping message from that worker. Such workers are simply killed by the master if it has already respawned a worm segment on that node as a worker. The entire leader election process is indicated in Figure 2. We have shown a simplified case of a 3-node computation pool with just two possible broadcast ports. 9 5 Forward progress and checkpointing Though replication ensures reliability, there is no forward progress if every time a the computation is started up it starts from the beginning. A round robin attack on the worm process on all machines would then ensure that the computation never terminates even though at least one worm instance was running all the time. Therefore to ensure forward progress there has to be a mechanism to save the state of the computation periodically and to restart the computation from that saved state in the case of failure. This saving of state can be done at two levels. The first is at the application level where the application itself writes its current state on to some stable storage from which it can restart in case of any failure. We can provide an API definition for the application developer to be able to checkpoint the process and migrate it to another node for restart. The functions in the API are implemented by the application developer and hence are application specific. At each of the nodes where the worm is running, the computation periodically checkpoints itself and stores this checkpoint on stable storage. This ensures that if the worm dies on a machine, the worm segment managing that computation replica can restart the computation from the last checkpoint. This ensures that the computation will eventually complete. The main advantage of this approach is that the checkpoint written by the application is architecture/OS independent. When restarting the worm from the checkpoint we need not worry about starting the process on the same architecture/OS platform that it ran earlier. All we need is to have the right application executable for all the architectures in the computation pool. Thus we can shift the computation from one machine to another without the need for all the machines in the pool to be homogeneous. But this solution is not desirable for two reasons. Firstly it is not transparent to the application developer. The user has to be aware of the API and should write the read_checkpoint and write_checkpoint functions. We desire not to put any burden on the part of the programmer who writes applications for our worm. Secondly this approach will not work for already existing object files that do not have the checkpointing capability. The second approach, the one that we are using, is to do process level checkpointing and migration. The idea is to save the process image. We are using the Condor checkpointing library with some modifications. The Condor checkpointing library vectors system calls into its own system call definitions. It provides a hook into the program so that when it gets a SIGUSR2 signal it writes the process image into a file. It does so by changing the stack pointer to point to a temporary stack and executing the write function from that stack [5]. This checkpoint file can be saved by the worm and the computation can be restarted on failure from that checkpoint. Several interesting issues come when we are executing the same computation on different machines. The most interesting is the issue of file handling. All the different replicas of the computation must be able to access all the files that they need. If the files are local then we need some mechanism by which non-local processes can access these files. One way of doing this would be to have a representative process running on the machine where the file exists. Computations running on other machines can have their file system calls relinked using the condor library to 10 contact the representative. This approach has two disadvantages. Firstly we need information as to where the file exists. Secondly if the representative goes down (either because it failed, someone killed it or the machine on which it was running went down) all other processes cannot make any progress. The second approach would be to ship all the files that the computation accesses when the process is started. But this approach requires that the process knows beforehand which file it wants to access. Also the files may be too huge to be shipped. The third alternative is to require that all file accesses be in some shared file space. We are assuming a shared file space. Another issue in file handling is what to do if the computation replicas try to write to a file. If more than one process tries to write into a file, result is undefined as per UNIX file semantics. The semantics of the shared file systems vary across systems. Thus while AFS ensures an update on close, NFS follows the UNIX semantics more closely and does an update on write. So we need to do some intelligent remapping of files. The approach that we use is as follows. Whenever the computation tries to open a file in write mode, we remap it to a different file name. This mapping between the actual name and the mapped name is stored for future reference. All writes go to the mapped file. In order to be sure that the mappings of a file by two different processes on two different machines are different, the mapping function is based on the hostname of the executing process. Thus even if this computation is moved around several machines, all the files will still be visible as they will be in shared space. We have integrated the above approach by modifying the condor checkpointing library. On termination of the computation at some node, we move these mapped files at this node back to their original files. To be able to use the checkpointing routines, the object files of the computation have to be linked with the modified Condor checkpointing library. A worm process running the computation periodically checkpoints its computation replica by sending the SIGUSR2 signal to the computation. Whenever a worm detects that its computation process has died, it restarts it from its last checkpoint. This ensures some forward progress of the computation. When the master wants to start a worker and wants it to run the computation, it supplies to it its copy of the checkpoint file. There is no guarantee that the master’s checkpoint file is indeed the latest one. However we deemed that having the master obtain the latest checkpoint file may impose heavy overhead. 6 Termination The worm can terminate in two ways. 1. A computation replica terminates: When the worm that manages this computation replica sees that the computation has terminated normally, it sends a COMP_TERM message to the master. It also broadcasts this message on the computation pool. The other worms on receiving this message terminate their computation replica if any and terminate themselves. 2. The kill switch file is detected: At the start the worm is provided the name of a file in the shared space to use as a kill switch. Each worm segment checks for the existence of this file periodically. On detecting this file, the worm segment will terminate itself. 11 7 Implementation details We have implemented our general framework for reliable computation on a cluster of 64 dual processor SMP nodes running Linux 2.2.12-22. Each of the nodes is a 550 MHz Pentium III with 2 GB RAM. Components of our Implementation Figure 3 illustrates the major components of our implementation. As shown in the figure, the main entities of the implementation are the Communicator, the Dequeuer, the Dispatcher and the Checkpointer. All these entities are implemented as different threads using the pthread library implementation in Linux [7]. 1. Communicator: This component handles all network communication. The communicator binds to two UDP sockets and listens for messages on them. One socket is used for unicast communication (the communication between the master and the worker worms). The other socket is used for listening to broadcast messages on the computation pool. We do not use a fixed port for the unicast channel. When the master starts up a worker, it supplies its unicast socket port as one of the arguments. The worker upon binding its own unicast port piggybacks information onto the ping message that it sends to the master. Thus the master gets to know of the workers unicast channel endpoint on the receipt of the first ping message from that worker. The broadcast socket port however is fixed. This is because only the master knows about the broadcast socket port of worker (workers do not communicate among themselves). This can be a potential drawback of the implementation since anyone can block on that socket and prevent our worms from binding to that port. To prevent this we use a set of n port numbers as potential broadcast channel endpoints. If the worm process cannot bind to a particular port it tries to bind to the next one. While broadcasting we broadcast to all the possible broadcast ports We use UDP as the underlying transport protocol. We preferred UDP to TCP for two reasons. First, we really don't need any persistent connections between the master and the worker. Second, our algorithms do not require guaranteed delivery. We can tolerate one or two packet losses. Upon receiving a message the Communicator simply appends the message to an internal message queue to be picked up later by the Dequeuer. An important thing to note here is that the Communicator thread itself does not act on the messages. Thus, it is free to listen to other messages. 2. Dispatcher: Our implementation uses several timers and corresponding timeouts. One way to handle these timeouts would be to spawn a thread to handle each timeout by waking up at the appropriate time. Instead we have centralized this functionality in the dispatcher. Whenever any thread wants a task to be executed after a certain time, it supplies the task and the time to execute it to the Dispatcher. The Dispatcher keeps a list of functions to be executed in sorted order of the time (when the function needs to be called). It fires these tasks at the required 12 time. However if any of the dispatched tasks involve access of any global structures, the dispatcher simply prepends a message to the message queue, which will be acted upon later by the Dequeuer. This is to avoid race conditions, as both the Dequeuer and the Dispatcher will be executing concurrently. Worm Communicator Dispatcher Dequeuer Remove Checkpointer Checkpoint Prepend Computation Append Messaage queue Figure 3 Implementation frameworks 3. Dequeuer: This thread acts as the serializer. It removes a packet from the head of the message queue and acts upon it. Therefore this is the thread that does most of the work of the worm. The reason we chose to have a serializer was to avoid the need to lock and unlock state variables. 4. Checkpointer: The Checkpointer keeps track of the computation replica (if any) that is executing on the local machine. This thread is only started when it is decided to run a copy of the computation on this machine. This thread starts the computation replica (from the checkpoint file if one is provided). It periodically examines the 13 status of the computation replica. The Checkpointer checkpoints the computation if it is alive. If the computation replica is dead, the checkpointer tries to determine why it died. If it died because of a signal, the worm assumes that someone tried to kill it and so it restarts the process from the last checkpoint. On the other hand if it died abnormally then it assumes that there is some problem with either the checkpoint file or the executable. So it asks the master to supply it with a new copy of the worker executable and the latest checkpoint. This transfer of the files is done using a TCP connection between the master and the worker. The worker has a persistent TCP server connection that it sets up on startup. This TCP port information is sent to the master piggybacked with the ping messages. When the master decides to supply the files to the worker it connects to this port and sends the files. On getting the required files the worker restarts the computation from that checkpoint. Message types and packet structure The types of messages used in our implementation are indicated in Table 1. Message type Use IAA Indicates to the master that the worker is still alive. IAA_ACK Response to IAA. Indicates to the worker that the master is still alive. KILL Instructs a worm segment to kill itself. LE Indicates a shift to Leader Election mode by a worm segment. COORD Used by a worm segment to indicate that it has taken up the role of the master. COORD_ACK Used by a worm segment in response to a LE message to indicate that it has accepted the destination worm segment as the master. COMP_TERM Used to indicate completion of the computation at a node. Table 1 Message types and their use The packet structure we have used in our implementation for these message types is shown in Figure 4. Source worm segment identifier Message type Data (32 byte character stream) Figure 4 Packet structure 14 The Data field contains piggybacked information with the messages. It contains the following information. Whether the encapsulated job is running at the source node in COORD_ACK messages. This helps the potential new master to build state information during the leader election mode itself. Indication to the worker worm segment to start running the encapsulated job in IAA_ACK messages sent out by the master to a worker. Information about the workers state (whether it has acquired the computation binary, whether it is running the encapsulated computation and its TCP port number for receiving binaries and checkpoint dumps) in IAA messages from workers to the master. State information The state information maintained in the master mode is indicated in Figure 5. The master maintains a table of state information, one entry per worker. The first field in the record indicates the validity of the entry. The worker identifier is a unique value assigned to each worker when it is spawned. This value is used to uniquely identify a worker worm segment and is used in the leader election process. The UDP port number and machine address supply information to the master to be able to communicate with the worker. The Last IAA time field indicates the time when an IAA message was last received from the worker. The running field specifies whether the computation is running on the node in which the worker segment resides. The last two fields are used in the self-defense process. All the binaries (the worm binary and the computation binary) and checkpoint dumps are retained by the worm segments in memory in case they are wiped out on the disk. The TCP port number is the port on which the worker listens for receiving these binaries and dumps. Executable received indicates if the worker worm segment has received the computation binary from the master. If this is set to false, the master will have to spawn a thread to connect to the worker at the given TCP port number and transfer the binary. Valid Worker UDP port Machine IP Last IAA Identifier number address time Figure 5 Running Executable TCP port received number Information maintained in the State Table in the master 8 Defense mechanisms One of the important goals of our project is for the worm to exist despite concerted attempts to kill it. Replicating the worm on every machine in the computation pool and respawning worm segments when they die is one way of ensuring this. The other way is preventing our worm process from being visible to the rest of the system. The techniques we use for this include 15 zeroing out the argument vector on start and masquerading as some other process This prevents us from being visible on the 'ps -ax' output. keeping all the required files, binaries and checkpoint dumps in memory This ensures that even if these are wiped out on disk, the worm can continue work by transferring copies of these files and executables to the other nodes. To keep these files in memory we just keep them open throughout the lifetime of the worm segment. having minimum presence by unlinking all files after opening them periodically reforking the worm process. The parent exits while the child continues to run. Thus effectively we change the process id of the worm process. By repeating this process at regular short intervals, it is ensured that someone trying to kill us is unable to do so unless the cost of our forking is higher than the cost of their search and kill of the worm process. To study the cost of forking on Linux in our implementation environment, we performed a simple timing measurement of the cost of fork. We used the gettimeofday system call with a resolution of 1 microsecond to time our measurements. We found the cost of a single fork to be 290 microseconds. The computation/job is started off from the worm process by a fork and exec combination. By using the waitpid system call, the status of the computation can then be determined in order to decide whether to restart/ checkpoint the job. However, this method fails when the worm process changes its pid by reforking since waitpid is dependent on parent-child semantics. To get over this problem, we remove the necessity of a parent-child relationship between the job and the worm process by using a message queue for communication of the status of the job to the worm process. 9 Conclusions and future work We have described a general framework for building worms and using this framework to encapsulate any computation to ensure the successful completion of the computation with a high degree of reliability. We have employed a master-worker model wherein one worm segment is designated as the master worm that deals with repairing failed worm segments. In case the master goes down the workers execute a leader election algorithm to elects a new master. The worms periodically checkpoint and restart the replicated computation from the checkpoint. This enables the computation to make forward progress despite failures. Our work can be extended in several ways. In our current version we do not support the use of sockets by the computation. Socket support can be added to our general framework as follows. In our master worker model, we can designate the master to be the granting machine. So whenever a computation replica wants to send any message it contacts the master with the message. Messages are tagged by a sequence number. If the master has already granted permission to send that message to some other replica, it will ignore the message. Otherwise the master sends the message. Similarly we can handle the receive operation on a socket. Now since the master can go down at 16 any time, the socket that it uses has to be migrated to the new master. A lot of work has been done in socket migration [6]. Further we can remove the limitation of all computation files being in shared space by replicating files across machines and making the worm running on those nodes as the representative for those files. Further study is needed to explore the scalability issues in our current framework. On a large computation pool with several thousand machines, having one master may not scale well since the master can become the bottleneck. Having a hierarchy of masters where a central master controls a bunch of masters who in turn control a set of workers is a potential solution in such an environment. References [1] Cohen, F. B., ‘A Case for Benevolent Viruses’, http://www.all.net/books/integ/goodvcase.html [2] Schmidt, C., Darby, T., ‘The What, Why, and How of the 1988 Internet Worm’, http://www.software.com.pl/newarchive/misc/Worm/darbyt/pages/worm.html. [3] Tanenbaum, S. Andrew, ‘Distributed Operating Systems’, Prentice-Hall, 1995. [4] Bach, M.J., ‘The design of the UNIX Operating System’, Prentice-Hall, 1986. [5] M. Litzkow and M. Solomon. "Supporting Checkpointing and Process Migration Outside the UNIX Kernel" ,in Usenix Conference Proceedings, San Francisco, CA, January 1992, pages 283—290. [6] Process Hijacking, Victor C. Zandy, Barton P. Miller and Miron Livny, http://www.cs.wisc.edu/paradyn/papers/hijack.pdf. [7] IEEE. Information Technology-Portable Operating System Interface (POSIX®)-Part 1: System Application: Program Interface (API) [C Language]. IEEE/ANSI Std 1003.1, 1996 Edition. 17