Distributed Deadlock Detection Gilbert K. Cheung December 2004 Abstract – Without distributed shared memory, distributed systems are prone to deadlocks. Deadlock is a result of some uncontrolled sequence of release and request of resources among processes in a distributed system. This survey paper presents some system models and deadlock handling techniques to deal with the problem. Selected algorithms are also presented to see how distributed deadlocks can be detected. Index Terms – Distributed system, deadlock detection, wait-for-graph I. INTRODUCTION When computers start to work together, interesting problems arise. As there is no shared memory, deadlock is one of those problems. A primary motivation for using distributed systems is the possibility of resource sharing [5]. A process makes requests or release control of a resource in an unknown order in execution as a priori. In terms of locking mechanism, a process holds a lock of a resource when the process is controlling the resource. As a process can request and release resources in any order not prior known, deadlocks may occur in the system. A deadlock occurs when a set of processes are waiting for the lock of some resources held by other processes in the same set. In this paper, we are going to present several algorithms that handle deadlocks in a distributed system. A deadlock can be resolved by aborting one or more processes in the deadlocked-set and restart that process such that its previous state is resumed [2]. A process is aborted when all of the resources it is holding is released, and withdraw all the resource requests it has made. The rest of the paper is organized as follows. The system model for the deadlock detection in distributed system will be discussed in Section II. Section III contains the handling methods of deadlocks in a distributed system. We will show some selected algorithms for deadlock detection in Section IV. Discussion of deadlock detection is presented in Section V. Finally, Section VI concludes the paper. II. SYSTEM MODELS A distributed system is a network that consists of a set of sites inter-connected by communication links. Within a site, there are processes and resources. Resources are handled by a resource controller. In the system, multiple processes and resources can coexist in the same site. When they are not in the same site, they communicate with each other by messages. We assume that the message communication links are reliable and follows first-in-firstout ordering rule. The reliability and ordering of the communication can be easily implemented by using combinations of retransmissions, acknowledgements and sequence numbers. There is no guaranteed maximum time limit for message deliveries but process Pa can assert that the message sent by Pa to Pb will be delivered to Pb eventually. We further assume that processes aborts, failures and malicious behaviors are non-existing. At any given time, a process can be in a state either idle or executing. A process is idle when it is waiting for some resource or message that is occupied or not sent by other process. A process is executing when it is not idle. 1 Resource and Deadlock model Two deadlock models are interested in researches: Resource model and communication model [3]. We will discuss both models. Resource Model In resource model, processes make requests to acquire some resources, such as data objects in distributed database. Processes can simultaneously wait for several resources and cannot proceed until it acquired all the locks of those resources. A set of processes is resource-deadlocked if all processes in the set are waiting for some other processes in the set to release the lock of some resources [3]. The system model in [1] divided the communication message into 2 categories: computation message and control message. A computation message is any message that is sent because of the execution of an application process. It can either be REQUEST, REPLY, or CANCEL message. REQUEST message is sent to a resource controller when a process wants to access to that resource. REPLY message is sent to a requesting process when the resource controller has determined that the resource is ready to be access by that process. A CANCEL message is sent to a requested resource controller when the process decides to cancel the request made to access that resource earlier. A resource requested by a process is said to be in the dependent set of the process. A control message is any message that is sent because of the execution of the deadlock detection algorithm. We will discuss about the algorithms about deadlock detection in the following the resource model presented in [2]: 1) Resources are reusable 2) Resources are not duplicated 3) No two process can access a resource at the same time A process Pj is said to be dependent on another process Pk if there exists a sequence of processes P j, Pi(1), Pi(2), Pi(3),…, Pk, where each process in the sequence is idle and each process (except the first) in the sequence holds a resource for which the previous process in the sequence is waiting. Pj is said to be locally dependent on Pk if all the processes in the sequence is located in the same site [3]. Communication Model In communication model, the resources the processes are acquiring are messages. Blocked processes are those that are waiting for several arrivals of messages at the same time. A process can be unblocked when a subset of those messages it is waiting is arrived. A nonempty set of processes is communication-deadlocked if all processes in the set are permanently idle. A process is permanently idle if it never receives a message form any process in its dependent set [3]. However, permanent idle cannot be detected by timeout since we cannot say that a process A and B are deadlock on each other if B eventually send a message to A as the time B sends a message is unknown. In [3], Chandy, Misra and Haas defined a set of process S is deadlocked if: 1) All processes in S are idle. 2) The dependent set of every process in S is a subset of S. 3) There are no messages in transit between processes S. As all processes are idle and dependent on some processes in the same set, there can be no message transits within the set. Without message transits and all processes are dependent on some processes in the same set, all processes are idle. AND, OR, and P-out-of-Q Models A process changes its state from executing to idle when it waits for a reply for a sent REQUEST to a resource controller or waits for some messages. Two resource request models are presented to describe the behavior: AND model and OR model, in the work by Chandy, Misra, and Haas [3]. In AND model, an idle process changes its state to executing when all requests to its dependent set are replied. In OR model, an idle process 2 changes its state to executing when any request to its dependent set is replied. In particular, resource deadlocked processes are modeled by AND and communication deadlocks are modeled by OR. A P-out-of-Q model is a generalization of these two models. A process can change its state from idle to executing when it has received p REPLYs out of q REQUESTs sent, where p < q. In other words, AND model is generalized to P-out-of-Q model by setting p equals q, and OR model is generalized by setting p to 1. Graph-Representation of Deadlocks With the dynamic state of deadlocks during runtime, an effective data structure to represent a deadlock in a system becomes more important. The situation of deadlocks can be represented and visualized by Wait-ForGraphs (WFGs) [5]. A WFG is a directed graph (n, l) consists of n nodes and l edges. An edge is presented as below [6]: x P1 P2 where P1 is blocked and waiting for a resource x held by P 2. The edge represents a wait-for relationship. In the resource model, this edge does not mean that P 1 is communicating with P2 directly. P1 is communicating with resource control of x which determines the resource locks to processes. A deadlock occurs if and only if there is a cycle exists in the WFG of the system [5]. In the following example, P1, P2 and P3 are deadlocked as P1 P2 P3 P1. P1 P2 P3 III. DEADLOCK HANDLING There are 3 different approaches [5] to handle the possibility of deadlocks in a distributed system: deadlock prevention, deadlock avoidance, and deadlock detection. Deadlock prevention Deadlock can be prevented when all processes begin their execution after they have acquired all the resources that they need. Processes can assert that when they begin their execution, all resources are already controlled by them exclusively. This approach has a number of drawbacks [5]. First, it kills the concurrency, and thus the primary motivation of distributed systems [5]. For example, if there is only one resource in the system and all process will eventually need it, using the deadlock prevention approach, there can only be one process executing at any given time. Second, it is a potential deadlock source during the phase of resource acquiring. For instance, when a process P1 and P2 make requests to acquire resources R1 and R2 at the same time, the resource controller of R1 may establish the lock to P1 while resource controller of R2 establish the lock to P2. P1 and P2 are deadlocked in this phase as they are waiting for resource controller of R 2 and R1 to reply their requests respectively. Deadlock Avoidance Deadlock avoidance is an approach to deadlocks handling that a resource is granted to a process if the resulting global system state is safe for the lock to be granted [5]. Problems in this approach is that 1) every resources and processes have to maintain a big table of global state such that they can decide if a lock should be granted or not when it is requested. 2) There can be at most one process or resource controller in the system to decide if the global state is safe or not at any given time. If there is more than one process or resource controller is making the decision, they may give a false conclusion of the global state where they all found the global state as safe but the 3 net global state may not be safe. In [7], Wojcik, B. E. and Wojcik, Z. M. pointed out that it is impractical in needing some knowledge a priori that is not available in a distributed organization and resulting in very high contentions for communication channels. Deadlock Detection In deadlock detection, there is no control of how and when the processes should acquire locks to resources. The probe or query computation is a deadlock detection sequence of messages, separated from the underlying computation. The detection algorithm thus can be run concurrently with the computation. Any circular waits (hence cycles in WFG of the system) is only the necessary condition for deadlock in the communication model [7]. The algorithm then resolves the deadlock detection by using its deadlock resolution algorithm. Types of Deadlock Detections There are three types of deadlock detections: 1. Centralized deadlock detection 2. Distributed deadlock detection 3. Hierarchical deadlock detection Centralized Deadlock Detection In this type of detection, a designate site, called the control site, has the responsibility of constructing the global WFG of the system [5]. Cycles are searched by this site and resolved by the control site. It is conceptually simple to implement such a system. Since the control site has the full picture of the system, optimal decisions can be made. However, this approach suffers from drawbacks that centralized deadlock control site is a single point of failure. As the system consists of a larger number of sites and processes, the centralized control site has to serve a larger number of processes, message traffics in that site will be increased and the computational load of the cycle search algorithm will be increased. These affect the performance of the system as well as the stability of it. Distributed Deadlock Detection In distributed deadlock detection, processes are responsible to detect the deadlock by themselves [5]. They utilize control messages between the processes to detect deadlocks. This type of detection enjoys the concurrency of the algorithm as well as the tolerance to process failures. However, this type of detection also suffers from a number of drawbacks. First, as the messages between processes are asynchronous and the system is dynamic, a distributed algorithm solving the problem is hard to implement and design correctly. Second, this type of algorithms is not as efficient as the centralized type because no processes can have the full picture of the WFG. Third, all processes need to run the deadlock detection algorithm continuously and concurrently with the underlying computation. While it is possible that the algorithm would use a small portion of the computational resource, it is a performance leak. Hierarchical Deadlock Detection In hierarchical deadlock detection, sites are arranged into clusters hierarchically [5], where sites detect deadlocks that involve only its descendant sites. Hierarchical algorithms tend to get the best out of the two types of deadlock detection algorithms presented above. There is no single point of failure and sites are not going to be overloaded with the deadlock detection algorithm when it is unnecessary for deadlock detection. This kind of algorithm make uses of the access patterns of the system in order to design the hierarchy of the clusters so that deadlocks are as localized in a cluster as possible. This is one of the biggest challenges in implementing such kind of detection algorithm. In the following section, we are going to discuss some algorithms in distributed deadlock detection. 4 IV. DISTRIBUTED DEADLOCK DETECTION ALGORITHMS Obermarck’s Algorithm Obermarck’s algorithm in deadlock detection is a path-pushing algorithm [2], [4]. Path-pushing algorithms are those that information of the global WFG is distributed in the form of paths. It was developed for the distributed database system R* of the IBM Corporation. In a distributed database system, the computation is done in a set of participating sites. A transaction is an abstraction for the application processing performed to take the database from one consistent state to another consistent state in a way that it can be viewed as atomic [4]. Transactions involve agents in different sites. Therefore, given a site, there can be multiple agents in the site working on different or the same transaction. When the first transaction is waiting for the second transaction and the second transaction is waiting for the first transaction at the same time, the system is said to be deadlocked. In [4], the author used the term Transaction Wait-For-Graph (TWFG) to stress the difference between transaction model and communication model. However, the mechanism of the graph is roughly the same as WFG. Each transaction is represented by a globally unique identifier. Since each site consists of a number of agents that belongs to different transactions, the global TWFG is split into smaller parts when viewed by each site. In a given site S, the algorithm uses a non-existing virtual agent EXTERNAL (“EX”) to denote external sites that are not S. Only one T1 T2 edge is created in the TWFG no matter how many times transaction T1 is waiting for transaction T2. The algorithm at each site builds and analyzes a directed TWFG, where the nodes represent the transaction agents and the edges denote the wait-for relationships. When an agent is dependent (waiting for) some agents in external site, it is denotes by an edge to “EX”. Similarly, when an agent is determined that an agent in external sites is waiting for itself, it is represented by an edge from “EX” to that agent. For example, in a TWFG “EX” T1 T2 “EX” denotes a potential global deadlock that involves some external sites, where there is an agent in external sites is waiting for T 1, and T2 is waiting for some agent in external sites. The deadlock detectors are roughly synchronized when they exchange control messages. The control messages are in the form of strings. Strings can be thought as partial TWFG that a site S1 sends to site S2 when some agents in S1 are waiting for some agents in S2. For the previous example of TWFG, the site will send the string “EX”, TRANS1, TRANS2 to the site where T 2 is waiting for. Elementary cycles are cycles that do not involve EXTERNAL. These cycles are detected by each site in step 5 of the algorithm. When a cycle is detected, the site will choose a victim in the cycle such that it will be removed. Algorithm For each site S, 1) Build a wait-for-graph using the transaction-to-transaction wait-for relationships. 2) Receive any strings of nodes transmitted from other sites and add them into the wait-for-graph. a. For each transaction identified in the string, create a node of the TWFG if none exists at this site. b. For each transaction identified in the string, starting with the first, create an edge to the node representing the next transaction in the string. 3) Create wait-for edges from EXTERNAL to each node representing a transaction’s agent that is expected to receive on a communication link. 4) Create wait-for edges to EXTERNAL from each node representing a transaction’s agent that is expected to send on a communication link. 5) Analyze the resulting graph, listing all elementary cycles. 6) Select a victim to break each cycle that does not contain the node External. As each victim is chosen for a given cycle, remove all cycles that include the victim. a. Site must remember the transaction identifier of the victim such that it can discard strings received involves the victim. b. If the victim transaction has an agent at this site, then the fact that the transaction was chosen as a victim must be transmitted to each site known to contain an agent of the victim transaction. Otherwise, the site has to transmit the fact to each site that sends a string containing the victim’s identifier to S. 5 7) Examine each remaining cycle that contains the node External. If the transaction identifier of the node External is waiting for is greater than the node that waits for External, then a. Transform the cycle into a string, which starts with “EX” and terminates with a node identifier that identify the node waiting for External in the site. b. Send the string to each site which the terminating node in the string is waiting for. T1 T5 T5 T1 S1: EX T1 T5 EX S2: EX T1 T5 EX False Deadlocks False deadlocks are deadlocks that are detected but do not really exist in the system. The algorithm assumed that the state of the system is static and strings are propagated to other sites to determine a deadlock. However, it is obvious that the state of a distributed system can be rather dynamic. A local TWFG may not be a real picture of the global state. As shown in the above figure, suppose a deadlock detection algorithm determined that T1 T5 exists in site S1. At the time the detection algorithm is querying the T 2, T3 and T4, T5 can release its locks and T1 waits is finished. However, the detection algorithm will not know about that and sends the string to other site S 2. When S2, with TWFG T1 T5, received such a string, it will determine that a deadlock is existing T 1 T5 T1, while it is not necessarily true. Based on the fact that real deadlocks will persist until broken, when a deadlock cycle is detected, validation of the transaction wait-for relationships in the cycle can be done. If the relationship returns to the starting node after following the wait-for edges in the cycle, a deadlock can be concluded. Because of step 6 in the algorithm, the transactions are totally ordered. This reduces the number of messages transfers and decreases the deadlock detection overhead [2]. This also ensure that there will be exactly one transaction in each cycle detects the deadlock. Chandy-Misra Algorithm In Chandy-Misra algorithm for deadlock detection, two algorithms are given. First algorithm is for AND models, and second algorithm is for OR models. 1) Algorithm for AND model (Resource Model) This algorithm is an example of edge-chasing algorithm [2]. For an edge-chasing algorithm, the existence of a cycle in a WFG can be detected by propagating special control messages called probes along the edges of the wait-for-graph [8]. Probes are concerned with the deadlock detection and are distinct from other computation messages. When the initiator init receives a probe that is originated from itself, it can be determined that there is a cycle in the graph and thus, a deadlock exists. A probe is a triple [3] in the form of (i, j, k) denoting that Pi is the initiator of the probe, and Pj is the sender of the probe and Pk is the receiver of it where Pj and Pk are not in the same site. Pj sends the probe to Pk when the following conditions exist: a. Pj is idle, b. Pj is waiting for Pk, c. Pj has determined that Pi is dependent on Pj. 6 Pk can either accept or discard the probe. P k accepts the probe if and only if: a. Pk is also idle and is waiting for some other processes, b. Pk did not know that Pi is depending on it, c. Pk now knows that Pi is depending on it, otherwise, Pk discards the probe. Coloring the edges To prove the correctness of the algorithm, coloring of edges is used [8]. gray if Pi has sent a request to Pj that Pj has not yet received; black if Pj has received a request from Pi but has not yet sent a grant message to P i; white if Pj has sent a grant message to Pi but Pi has not yet received it. Gray and black edges are called dark edges. When a dark cycle exists, we can see that the cycle will persist. Algorithm Each resource controller maintains an array of dependent k for each process P k, where dependentk(i) is true only if Pk’s site knows that Pi is dependent on it (i.e., Pi Pk). Default: dependentk(i) = false for all k and i. Initiation of probe by idle process P i: if Pi is locally dependent on itself then declare that Pi is deadlocked else for all Pa, Pb such that a. Pi is locally dependent on Pa, and b. Pa is waiting for Pb, and c. Pa, Pb are on different sites. send probe (i, a, b) On receiving probe (i, j, k) for Pk: if a. Pk is idle, and b. dependentk(i) = false, and c. Pk has not replied to all requests of Pj, then dependentk(i) = true if k == i then declare that Pi is deadlocked else for all Pa, Pb such that a. Pk is locally dependent on Pa, and b. Pa is waiting for Pb, and c. Pa, Pb are on different sites. send probe (i, a, b) When a process Pk in the site becomes executing: set dependentk(i) = false for all i 7 2) Algorithm for OR model (Communication Model) This algorithm is designed for the communication model, where the process can change from idle to executing when it receives some number of replies for the requests sent. This algorithm is an example of diffusion computation based algorithm [8]. The base idea of a diffusion computation based algorithm is that it is activated by a process that suspects a deadlock. Query and reply messages are used to detect the activity of the processes in the dependent set. Query messages are in the form of (i, m, j, k) which denotes that it is the mth query sent by initiator process Pi, and the query message is sent by process P j to process Pk. A reply message is also in the form of (i, m, j, k). When the initiator Pi initiates the deadlock detection, it sends query message to all processes in its dependent set. When an active process received a query message, it discards it. When an idle process P k receives a query message (i, m, j, k), and if the message is received for the first time, called engaging query, then Pk forwards the query message to its dependent set. These properties follow [3]: 1. 2. If the query message received by Pk is not an engaging query, and Pk is still in idle, it immediately replies the message to Pj. If process Pi is deadlocked when it initiates its mth query, then it will receive reply (i, m, j, i) corresponding to every query (i, m, i, j) it sent. If the initiator process Pi received reply (i, m, j, i) corresponding to every query (i, m, i, j) that it sent, then it can declare itself as deadlocked. In the following algorithm, each process has to maintain these four arrays: a. latest(i) the largest sequence number m in any query (i, m, j, k) sent or received by P k. Initially, latest(i) = 0 for every i. b. engager(i) i ≠ k, the identity of the process which casued latest(i) to be set to its current value m by sending Pk the message query (i, m, j, k). c. num(i) total number of messages of the form query (i, m, k, j) sent by P k, minus total number of messages of the form reply (i, m, j, k) received by Pk, where m = latest(i). num(i) = 0 means that Pk has received replies to all queries that P k sent. d. wait(i) a boolean variable that is true if and only if P k has been idle continuously since latest(i) was last updated. Initially, wait(i) = false for all i. Algorithm Initiation of query by an idle process P i: latest(i) = latest(i) + 1 wait(i) = true send query (i, latest(i), i, j) to all processes Pj in Pi’s dependent set num(i) = the number of processes in Pi’s dependent set For all processes Pk: if executing, wait(i) = false for all i discard all query messages On receiving a query message (i, m, j, k) sent from Pj to Pk: if m > latest(i) then latest(i) = m engager(i) = j wait(i) = true for all proceses Pr in Pk’s dependent set send query(i, m, k, r) num(i) = number of processes in Pk’s dependent set else if (wait(i) == true) and (m == latest(i)) 8 then send reply (i, m, k, j) to Pj On receiving a reply message (i, m, r, k) for Pk: if (m == latest(i)) and (wait(i) == true) num(i) = num(i) – 1 if num(i) == 0 then if i == k then declare Pk is deadlocked else send reply (i, m, k, j) to Pj where j = engager(i) To illustrate the two cases where the replies are sent, following example is constructed. 5 1 3 4 6 7 2 8 Example In the example, WFG is given as above. At node 3 and node 6, replies are sent by to 2 when they receive a query message that they have received before. It indicates that there is a cycle in the WFG and node 3 and node 6 are participants of the cycles. Therefore, a reply is sent when a query is received with wait(i) == true and latest(i) == m. At node 2, when the number of queries sent equal to the number of the replies received, it sends back a reply to its engager. In this example, it is shown that node 2 sends a reply to node 1 when it receives 2 replies from node 3 and node 6, indicating that there are cycles in its dependent set and thus deadlock is unpreventable in this situation. Therefore, when node 1 receives a reply from node 2, it will declare it is deadlocked. Kshemkalyani-Singhal Algorithm Kshemkalyani-Singhal algorithm [1] is an example of global state detection based algorithm [2]. The works in this area are largely based on results by Chandy and Lamport [8]. The key to global state is that we are going to find a consistent global state without freezing of the underlying computation. Consistent global state is a state that is possible to exist in the system. Chandy and Lamport show how to obtain a consistent global state of the computation by propagating markers along the links of the system. A maker separates the messages in the links into those to be included in the snapshot (i.e. channel state or process state) from those not to be recorded in the snapshot. It acts as delimiters for the messages in the channels so that the channel state recorded by the process at the receiving end of the channel satisfies the condition that if a message m is sent by i to j, if sending m is not in the consistent state, receive event of m in j should not in the consistent state and m should not in the channel state [9]. After finding such a consistent global state, the algorithm The algorithm consists of a single phase. The single phase comprises a diffusion of messages outward from the initiator process along the edges of the WFG (outward sweep), and then the echoing of diffusion messages inward to the initiator process (inward sweep). During the outward sweep of messages, the system records a 9 snapshot which each process on receiving the message records its local state. During the inward sweep of messages, the system reduces the snapshot in a way that it can simulate the possible future of the system in terms of unblocking events [1]. System Model In this algorithm, a distributed system consists of nodes. P-out-of-Q model is used so that a node in a WFG is reduced if P out of the Q requests on which it is blocked can be granted during the reduction process. Therefore, at the end of the process, those nodes that are not reduced are deadlocked. In this system, every two nodes are connected logically and there is no shared memory in the system so that every communication between the nodes is done by sending and receiving messages. A node i has the following local variables to record its state: ti: t_blocki: ini: outi: pi: wi: waiti: integer, current local time integer, the time at which node I was last blocked set of integer, set of nodes whose requests are outstanding at node i set of integer, set of nodes for which node I is waiting since t_block i integer, the number of replies required for unblocking real, weight for termination of deadlock detection algorithm detection boolean, records the current status [2] Send REQUEST - Executed by process i when it blocks for a p i-out-of-qi requests. Pi and qi depend on the application and pi <qi. set Pi by the application for each node j of qi nodes on which i are depending on, do outi = outi {j} send REQUEST(i) to j t_blocki=ti Receive REQUEST - Execute by process i when it receives a REQUEST made by process k ini = ini {k} Send REPLY - Execute by process i when it replies to a request by process k ini = ini – {k} Receive REPLY(j) - Execute by process i when it receives a reply from process j to its corresponding request. A reply to an outdated request can be identified by the timestamp of the outdated request on the reply and is discarded outi = outi – {j} pi = pi -1 if pi == 0 then t_blocki = 0 for every j in outi, send CANCEL(i) to j outi = 0 Receive CANCEL(k) - Executed by process i when it receives a CANCEL message from process k ini = ini – {k} When a process i is blocked, it initiates the deadlock detection algorithm with initiator = i. When there is more than one processes are blocked, it is possible to have multiple instances of the algorithm are running concurrently. Each invocation of the deadlock detection algorithm is treated independently and is identified by 10 the initiator’s process ID and the timestamp. Therefore, it is necessary for every node to maintain a snapshot initiated by other nodes. They need to maintain the latest snapshot only because if there is another call for the snapshot initiate from the same process i, that means that between the time process i initiated the first snapshot and the time process i initiated the second snapshot, process i is not idle and thus not deadlocked. Therefore, previous snapshots can be discarded. Basic Idea of the Algorithm We will describe the algorithm here. The detection algorithm makes use of three types of control messages: FLOOD, ECHO and SHORT. FLOOD messages are sent by the initiator to the processes in its dependent set in order to do the outward sweep of messages. Every process on receiving the first FLOOD message records its local state. If the process receiving the FLOOD message is idle, it will forward the FLOOD message to every process in its dependent set. If the process receiving the FLOOD message is not idle, it will initiate the reduction process of the algorithm by sending ECHO message back to the sender of the FLOOD message. On receiving an ECHO message, process i determines itself if it is deadlocked. If not, it will send ECHO message along the way it received the FLOOD message. These are done concurrently and there is no exact synchronization marker between the snapshot process and the reduction process in the algorithm. Reduction process of the algorithm works in the way that a node representing a process is reduced if it receives pi-out-of-qi ECHOs where qi is the number of FLOOD messages sent. It simulates the fact that the process will be unblocked as pi-out-of-qi processes in its dependent set sent / forwarded the ECHOs message to it, meaning that they are not blocked. Therefore, any node that is not reduced suffers from deadlock and those that are reduced are not deadlocked. The initiator declares itself as deadlocked if it is not reduced at the time of the termination of the algorithm. WFG reduction can begin at a non-leaf node before recording of the WFG is completed. This happens when ECHOs arrive to process i before process i sends all the FLOOD messages to processes in its dependent set. Therefore, the two activities of the algorithm are done concurrently in a single phase where serialization is not needed. Termination Detection To determine whether it is deadlocked or not, the initiator process requires a detection scheme for the termination of the deadlock detection algorithm. It is done by using weights and SHORT messages. For the initiator process, the weight is 1 when the algorithm is initiated. Weight is a real number from 0 to 1 inclusively. Whenever the process is sending FLOOD message along the paths to the processes of its dependent set, it gives a portion of the weight it has to those processes. In particular, if |DS(i)| denotes the number of processes in the dependent set DS(i) of process i, when process i sends a FLOOD message to those processes is DS(i), it attaches weight 1/|DS(i)| to every FLOOD messages. When the initiator receives returned weights equal to the weights it has been given out, it can determine that the algorithm has been terminated. On receiving subsequent FLOOD messages to process i, process i does not forward the FLOOD message to processes in its DS(i). Rather, it returns the weight attached in the subsequent FLOOD message back to the initiator using SHORT message. When a FLOOD message arrived at leaf node of the WFG, which represents a process i that is not idle (i.e. |DS(i)| = 0), the weight is returned by attaching it in the ECHO messages. When a non-leaf node receives returned weight through the ECHO message, and if it determines itself as unblocked by that ECHO message, it return the received weights by distributing it among the ECHOs that are sent by that node along the incoming edges in its WFG snapshot. When a non-leaf node receives a returned weight through the ECHO message and it does not unblock itself, it returns the weight back to the initiator through SHORT message. By this, the sum of weights in the system is always equal to 1. Algorithm Variables The variables used in the algorithms for process i: init: the initiator process identifier 11 As noted before, since it is possible to have multiple instances of the algorithm running concurrently, and the process needs to maintain the snapshot of the latest received initiation of the algorithm from each process, each node maintains LS where LS: array [1…n] of records where n is the number of processes in the system. A record in LS contains fields: LS[init].out: set of integers, nodes on which process i is waiting in the snapshot LS[init].in: set of integers, nodes on which waiting for process i LS[init].t: integer, the time when init initiated snapshot, = 0 initially LS[init].s: boolean, local blocked state as seen by the snapshot, = false initially LS[init].p: integer, value of pi as seen in the snapshot Messages: FLOOD(a, b, t, w): a is the sender of the message, b is the initiator process, t is the time the algorithm is initiated, and w is the weight attached in this message ECHO(a, b, t, w): a is the sender of the message, b is the initiator process, t is the time the algorithm is initiated, and w is the weight attached in this message SHORT(b, t, w): b is the initiator process, t is the time the algorithm is initiated, and w is the weight attached in this message Procedures executed by process i To initiate a snapshot: init = i wi = 0 LS[init].t = ti LS[init].out = out(i) LS[init].s = true LS[init].in = 0 LS[init].p = pi send FLOOD(i, i, ti, 1/|out(i)|) to each process j in out(i) On Receiving a FLOOD message (j, init, t_init, w) from j: if (LS[init].t < t_init) and (j is in in(i)) then LS[init].t = t_init LS[init].out = out(i) LS[init].s = waiti LS[init].in = {j} if waiti == true then LS[init].p = pi send FLOOD(i, init, t_init, w/|out(i)|) to each process k in out(i) else LS[init].p = 0 send ECHO(i, init, t_init, w) to j LS[init].in = LS[init].in – {j} else if (LS[init].t < t_init) and (j is not in in(i)) then send ECHO(i, init, t_init, w) to j else 12 if (LS[init].t == t_init) and (j is not in in(i)) then send ECHO(i, init, t_init, w) to j else if (LS[init].t == t_init) and (j is in in(i)) then if LS[init].s == false then send ECHO(i, init, t_init, w) to j else LS[init].in = LS[init].in {j} send SHORT(init, t_init, w) to init else if LS[init].t > t_init then discard the FLOOD message On Receiving ECHO message (j, init, t_init, w) from j: if LS[init].t > t_init then discard the ECHO message else if LS[init].t < t_init then discard the ECHO message else if LS[init].t == t_init then LS[init].out = LS[init].out – {j} if LS[init].s == false then send SHORT(init, t_init, w) to init else LS[init].p = LS[init].p – 1 if LS[init].p == 0 then LS[init].s = false if init == i then declare process i is not deadlocked exit send ECHO(i, init, t_init, w/|LS[init].in|) to all process k in LS[init].in else send SHORT(init, t_init, w) to init On Receiving SHORT message (init, t_init, w) if t_init < t_blocki then discard the SHORT message if t_init > t_blocki then 13 discard the SHORT message if t_init == t_blocki then if LS[init].s == false then discard the SHORT message else wi = wi + w if wi == 1 then declare that Pi is deadlocked abort V. DISCUSSION In distributed system that with no global shared memory and communication are all done by sending and receiving messages, it is a nontrivial task to design a correct algorithm to solve the problem of deadlocks. Algorithms are prone to errors because of the possibility of conclusion of an inconsistent state graph of the system through outdated messages [5]. Proof of Correctness The large number of errors in published algorithms addressing the problem of distributed deadlock detection shows the need for proof of correctness of the algorithms. There are two important conditions for correctness [5]: 1. 2. The algorithm must detect a deadlock if and only if that exists in the system. No false deadlocks. Deadlock Resolution A deadlock is resolved by aborting at least one process involved in the deadlock cycle [2]. The resolution faces two problems [2]: 1. 2. The process that detected a deadlock does not know all processes that are deadlocked. Concurrent deadlock resolutions can lead to unexpected results. Processes in a cycle in WFG initiate the deadlock detection and resolutions, and may lead to multiple processes reduction, which is inefficiently and thus undesirable. There are algorithms that use priority of processes to deal with the second problem shown above. Only the process with the highest priority in the cycle detects deadlocks and only the process with the lowest priority if being reduced. This does not imply any performance gain in terms of the choice of the process being reduced, but it solves the second problem. A process to be aborted will do the following: 1. The process must be aborted; all the resources held by it must be released. The state of the resources released must be restored to some reusable state for other processes. The resource controller must ensure that these released resources are going to be granted to some processes in the deadlocked cycle. 2. All deadlock information about the process reduced must be cleaned so that the processes will not assume its existence in the wait-for-graphs. 14 False Deadlocks Deadlock is a problem that cannot be easily detected and resolved correctly [2]. There are many published algorithms that claims to detect deadlocks but are vulnerable to false deadlocks. False deadlocks are deadlocks that detected by the algorithm but do not exist in the system. Deadlock persists until it is being broken. We can view that as a static problem. However, when we try to reduce the problem by deadlock resolution, the problem becomes dynamic. Suppose two deadlock detection algorithms are running concurrently initiated by different process in the system. If two detection algorithms detected 2 cycles which shares some edges and both algorithms try to resolve the cycles by break some edges in the cycles detected, false deadlock is likely to be detected when the first algorithm resolved the cycle, and the second algorithm still thinks there is a cycle in the system. Performance It is suggested that the performance analysis of the algorithms published are not receiving enough attention from the authors [2]. Algorithm performances that are measured in terms of number of messages exchanged are not reflecting the real performance. Singhal and Shivaratri [2] suggested that performance of algorithms should be measured in deadlock persistence time. It is the length of time of a particular deadlock persists in the system, given that a deadlock detection and resolution algorithm is running. There is a tradeoff between message traffic and deadlock persistence time. Other than the communication overheads and deadlock persistence time, measures such as space requirements in each process and the time complexity for the algorithm to run in each process should also be considered. There are many other factors that may affect the performance of the deadlock detection algorithm such as the requestrelease pattern of the system, the arrangement of the processes and so on. How these factors interact with the performance of the algorithm remains an open research topic. VI. CONCLUSION Distributed deadlock detection is an interesting problem in distributed systems. It involves both static and dynamic properties and a generalized proof of correctness is yet to be found. In this paper, we have discussed about ways to handle deadlocks in a distributed system. These are deadlock prevention, deadlock avoidance and deadlock detection. While it is possible to do prevention and avoidance for deadlocks, the required overhead is too large in a way that it is impractical to implement [2]. Deadlock detection is a popular research topic in this area. In deadlock detections, it can be done via centralized control site, distributed processes, or special arranged hierarchy of clusters. The algorithms we have discussed solve the problem of deadlock in two phases: 1) it constructs the WFG of the system, and 2), it searches for the cycles. Due to the lack of globally shared memory, the design of the algorithms is difficult because sites may report the existence of a global cycle after seeing segments of the cycle at different instants, even though all the segments never existed simultaneously. We have discussed four types of distributed deadlock detection algorithms, path-pushing, edge-chasing, diffusion, and global state algorithms. In path-pushing algorithms, wait-for paths are sent between deadlock detectors to construct the WFG. In edge-chasing algorithms, probes are used to be sent along the edges of the WFG. If a probe is received by its initiator, a cycle is detected. In diffusion algorithm, QUERYs and REPLYs are sent along the paths of the WFG. If an initiator receives all REPLYs for the QUERYs sent, it can declare itself as deadlocked. In global state detection algorithm, processes record their local snapshot when they received the QUERY message from the initiator. The global snapshot is reduced when processes that are not idle reply ECHOs. The initiator distributes a weight to its dependent set. When the weights it distributed are all returned, it 15 can determine that the algorithm is terminated. At that point of time, if it is not yet being reduced, it declares itself as deadlocked. While these algorithms present some of the most innovative ways to deal with the problem, it is by no means the only types of the algorithms studied. Though there is a large number of algorithms are published for the deadlock detection, there are areas that these papers are not yet addressed. Future research should focus on the performance and correctness of the proofs of the algorithms. 16 REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. Kshemkalyani, A. D., and Singhal, M., “Efficient Detection and Resolution of Generalized Distributed Deadlocks,” IEEE Trans. on Software Engineering, January 1994. Singhal, M. and Shivaratri, N. G., “Distributed Deadlock Detection,” in Advanced Concepts in Operating Systems, McGraw-Hill, Inc., 1994, pp. 151-177. Chandy, K. M., Misra, J., and Haas, L. M., “Distributed Deadlock Detection,” ACM Trans. on Computer Systems, May 1983. Obermarck, R., “Distributed Deadlock Detection Algorithm,” ACM Trans. on Database Systems, June 1982. Singhal, M., “Deadlock Detection in Distributed Systems,” IEEE Computer, November 1989. Choudhary, A. N., Kohler, W. H., and Stankovic, J. A., “A Modified Priority Based Probe Algorithm for Distributed Deadlock Detection and Resolution,” IEEE Trans. on Software Engineering, January 1989. Wojcik, B. E., and Wojcik, Z. M., “Sufficient Condition for a Communication Deadlock and Distributed Deadlock Detection,” IEEE Trans. on Software Engineering, December 1989. Knapp, E., “Deadlock Detection in Distributed Databases,” ACM Computing Surveys, December 1987. Kshemkalyani, A. D., Raynal, M., Singhal, M., “An Introduction to Snapshot Algorithms in Distributed Computing,” Distributed System Engineering, 1995. 17