IEEE TRANSACTIONS ON COMPUTERS, VOL. 41, NO. 5, MAY 1992 616 Implementation of On-Line Distributed System-Level Diagnosis Theory Ronald P. Bianchini, Jr., Member, IEEE, and Richard W. Buskens, Student Member, IEEE Abstract- There has been significant theoretical research in the area of system-level diagnosis. This paper documents the first practical application and implementation of on-line distributed system-level diagnosis theory. Proven distributed diagnosis algorithms are shown to be impractical in real systems due to high resource requirements. A new distributed system-level diagnosis algorithm, called Adaptive DSD,is shown to minimize network resources and has resulted in a practical implementation. Adaptive DSD assumes a distributed network, in which network nodes can test other nodes and determine them to be faulty or fault-free. Tests are issued from each node adaptively, and depend on the fault situation of the network. Test result reports are generated from test results and forwarded between nodes in the network. Adaptive DSD is proven correct in that each faultfree node reaches an accurate independent diagnosis of the fault conditions of the remaining nodes. No restriction is placed on the number of faulty nodes; any fault situation with any number of faulty nodes is diagnosed correctly. The Adaptive DSD algorithm is implemented and currently monitors over 200 workstations in the Electrical and Computer Engineering Department at Carnegie Mellon University. The algorithm has executed continuously for the past year, even though no single workstation has remained fault-free over that period. Key results of this paper include: an overview of previous distributed system-level diagnosis algorithms, the specification of a new adaptive distributed system-level diagnosis algorithm, its comparison to previous centralized adaptive and distributed nonadaptive schemes, its application to an actual distributed network environment, and the experimentation within that environment. Fig. 1. Example diagnosis network. The fault model of the network characterizes the outcome of test results given the fault status of the nodes involved in the tests. This work assumes the “symmetric invalidation fault model of system diagnosis [12]. In this model, the outcome of a test performed by a fault-free node is accurate and equals the fault state of the node being tested. Tests performed by faulty nodes are inaccurate and results of such tests may be arbitrary. Classical system-level diagnosis research [ 121, [4] assumes a central observer that performs diagnosis. This observer accurately receives all test results and uses the results to determine the fault situation. The work presented herein assumes a distributed model, such that each node performs its own local diagnosis of the network. In general it is not practical for every node to perform diagnosis by testing every other node Index Terms- Adaptive diagnosis, diagnosis algorithms, dis- in the network. Thus, each node performs tests of only a tributed computer networks, self-diagnosable systems, systemsubset of the nodes and receives test result reports from other level diagnosis. nodes about nodes that it does not test. Report validation is required since faulty nodes can distribute inaccurate test result I. INTRODUCTION reports. Typically, report validation requires additional testing. ONSIDER a network of distributed nodes in an intercon- Fig. 2 illustrates an example of utilizing test result reports to nect network. Each node is assigned a fault state, s,, such perform diagnosis. In the figure, nodes nk and n, are faulty, that s, E {fault-free(O), faulty(1)). Interconnect, or link faults all other nodes are fault-free. Node n, tests both n3 and nk are not considered in this model. The fault situation of the and determines n3 to be fault-free and nk to be faulty. Since network is the set of node fault states. Nodes perform tests of n, determines nJ to be fault-free, n, can utilize test result other nodes; a test of node nb by n, is identified t a b such that reports from n J .Thus, n, can correctly diagnose nl as faultt a b E {fault-free(O), faulty(1)). The syndrome of the network free and n, as faulty without directly testing those nodes. is the set of all test results. Diagnosis is the determination Since n, determines nk to be faulty, n, cannot diagnose the of the fault situation of a network given its syndrome. Fig. 1 state of nn. presents an example network. Node labels identify the fault System-level diagnosis research was introduced by state of each node. A n arc from n, to nb represents a test Preparata, Metze, and Chien in [12]. They defined tperformed by n, of nb and is labeled with the test result. diagnosability as the ability to diagnose a fault situation with t or fewer faults given the syndrome of a network. They proved the following necessary condition given a fixed interManuscript received July 8, 1991; revised December 4, 1991. node testing assignment: If a network is t-diagnosable then The authors are with the Department of Electrical and Computer Engineerevery node must be tested by at least t other nodes. Hakimi ing, Carnegie Mellon University, Pittsburgh, PA 15213. and Amin [4] proved that this condition is sufficient if no two IEEE Log Number 9108213. ” C 001&9340/92$03.00 0 1992 IEEE 617 BlANCHINl AND BUSKENS: DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS THEORY Fig. 2. Forward fault-free paths from 3,. nodes test each other. Hakimi and Nakajima [15] showed that the number of tests required for diagnosis could be reduced by dynamically adapting the testing assignment based on the fault situation. Their work assumed the presence of a central observer to determine which inter-node tests were necessary for diagnosis. There has been a large body of further theoretical developments [7], including the diagnosability of new failure modes [ 141 and research in distributed diagnosis algorithms PI, [91, ~ 1 [117 , PI. The remainder of this paper is outlined as follows. Section I1 provides an overview of two important distributed systemlevel diagnosis algorithms: NEW-SELF and EVENT-SELF, and describes major drawbacks of implementing the algorithms. In Section 111, the Adaptive DSD algorithm is introduced and implementation enhancements are given. A performance comparison between the major distributed diagnosis algorithms is made in Section IV. Section V describes an implementation of Adaptive DSD, and provides experimental results based on that implementation. Concluding remarks are given in Section VI. 11. DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS A . The NE W-SELF Algorithm On-line distributed diagnosis algorithms are given in [8], [9], [6], [1],and [Z]. The SELF distributed diagnosis algorithm was presented by Hosseini, Kuhl, and Reddy [8]. In that work it is assumed that the maximum number of faulty nodes is bounded by a predefined limit, t, and there is a fixed testing assignment such that a node is responsible for testing a fixed set of neighboring nodes. Fault-free nodes forward test result reports to neighboring nodes; reports reach nonneighboring nodes through intermediate nodes. No assumption is made about faulty nodes, which may distribute erroneous test result reports. Each node independently determines a diagnosis of the network utilizing the test result reports that it generates and receives. The NEW-SELF on-line distributed diagnosis algorithm is presented in [6]. The algorithm assumes a fixed inter-node testing assignment and is executed on-line, permitting node failure and repair. In the NEW-SELF algorithm each node tests its neighboring nodes and generates a test result report for each test result. The report is stored locally, overwriting previous reports concerning the tested node, and is subsequently forwarded to all testers of the testing node. The algorithm ensures the accuracy of test result reports by restricting the forwarding of these reports to occur between fault-free nodes. A node only accepts information from other nodes that it tests and determines to be fault-free. As evident by this specification, valid test result reports are forwarded between fault-free nodes in the reverse direction of tests performed by the nodes. The following testing and report validation scheme is utilized: 1) ni tests nj as fault-free, 2) ni receives test result reports from n j , 3) ni tests nj as fault-free, 4) n; assumes the diagnostic information received in Step 2 is valid. This scheme assumes that a node cannot fail and then recover from that failure in an undetected fashion during the interval between two tests by another node. For correct diagnosis, the NEW-SELF algorithm requires that every fault-free node receives all test result reports generated by all other fault-free nodes. It is proven in [6] that this condition is satisfied if each node is tested by at least t 1 other nodes. This is shown since a fault-free node is guaranteed to be forwarding test result reports to at least one other fault-free node if it is tested by t 1 nodes, where the number of node failures is restricted to t or fewer. Although provably correct, the NEW-SELF algorithm has considerable drawbacks for implementation on a large number of distributed workstations. Consider a network of N nodes that is required to be t-diagnosable. The algorithm requires at least N ( t 1) tests since each node is tested by at least t 1 other nodes. The number of messages required to transfer the test result reports is N 2 ( t + 1 ) 2 .This is shown since each node generates one test result report for each of the t 1 nodes it tests, resulting in N ( t 1) reports. Every node subsequently forwards all of the messages that it receives to its t 1 testers. The number of messages required for algorithm execution is considerable, even for small networks. For example, a network of N = 8 nodes with t = 2 diagnosability requires 576 messages. + + + + + + + B. The EVENT-SELF Algorithm The EVENT-SELF algorithm [ l ] extended the NEW-SELF algorithm by addressing the resource limitations of actual distributed networks. This algorithm utilized “event driven” forwarding of test result reports to lower the number of messages required by the algorithm. Test result reports are only forwarded by a node if they differ from reports already stored at the node. In this manner, only reports that signify a new fault event in the network will get forwarded. It is proven in [ l ] that there are only two situations when a node must forward test result reports to its testers. The first is when a differing test result report is received by the node. This is a report indicating that a fault event has occurred, representing a node becoming either faulty or fault-free. In this case, the report is forwarded to all of the testers of the reporting node. The second case occurs when a node diagnoses one of its testers as faulty, and then receives a report that the tester is fault-free. In this situation, the node must forward all of its test result reports to that tester, to ensure that the tester has all current test result reports and can perform its own diagnosis. IEEE TRANSACTIONS ON COMPUTERS, VOL. 41, NO. 5, MAY 1992 618 Using this approach the EVENT-SELF algorithm message count is significantly reduced from that of NEW-SELF. In steady state, only tests are performed and no test result reports are forwarded in the network. Additional messages are required only when a node changes state. This is denoted by A f , or the change in the number of faulty nodes, f . For a t-diagnosable network the number of messages required for test result reports is reduced to A f N t 2 . A significant improvement in diagnosis latency is also possible using EVENT-SELF. Diagnosis latency is the time from the detection of a fault event to the time when all nodes correctly diagnose the event. The diagnosis latency of the SELF algorithms corresponds to the time required to forward the test result reports corresponding to the fault event to all nodes. In NEW-SELF, when new messages arrive at node n, from n3, validation of the messages must wait until another test of n3 is accomplished. In EVENT-SELF, validation tests for messages received from n3 are initiated as soon as the messages arrive. Using this scheme, message validation time can be significantly less than a testing period. Fig 3. Limited diagnosability. Fig. 4. Data structure maintained at node AV2. C. Drawbacks of the SELF Algorithms There are two significant drawbacks to the SELF algorithms. The first drawback is illustrated in Fig. 3 and concerns limited diagnosability. For any nonadaptive diagnosis algorithm, only a limited number of node failures can be diagnosed. For a t-diagnosable network, correct diagnosis is guaranteed for t or fewer faults. The two faulted nodes, shown shaded in the figure, result in test result reports that are not forwarded to all fault-free nodes. The second major drawback of the SELF algorithms concerns redundancy, in terms of both inter-node testing and report forwarding. For t-diagnosable systems, each node must be tested by at least t 1 other nodes. Ideally, each node must be tested by only one fault-free node to ensure correct diagnosis, thus all but one of the t 1 tests are redundant. Also, since each node is tested by at least t 1 nodes, test result reports are forwarded along redundant paths. Ideally, only one forwarding path is required. n, has received diagnostic information from a fault-free node specifying that ni has tested n3 and found it to be fault-free. Fig. 4 shows the TESTED-UP2 array maintained at 712 for an eight node system with n1, n4, n; faulty. Note that “2” represents an entry that is arbitrary. The Adaptive DSD algorithm executes at each node by first identifying another unique fault-free node and then updating local diagnostic information with information received from that node. Functionally, this is accomplished as follows. List the nodes in sequential order, as ( 7 ~ 0 n1,. , . . , n ~ - l )Node . n, identifies the next sequential fault-free node in the list, sequentially testing consecutive nodes n,+l, n,+2, etc., until a fault-free node is found. Diagnostic information received from the tested fault-free node is assumed to be valid and is utilized to update local information. All addition is modulo N so that the last fault-free node in the ordered list identifies the first fault-free node in the list. The Adaptive DSD algorithm is given in Fig. 5. Each 111. ADAPTIVE DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS node n, executes the algorithm at predefined testing intervals. The Adaptive DSD algorithm differs considerably from the Instructions 1 and 2 identify ny as the first fault-free node SELF algorithms in that the testing assignment is adaptive and after n, in the ordered node list. The test at Step 2.3 evaluates determined by the fault situation. Node failures and repairs are to “fault-free’’ if ny has remained fault-free since the last considered; link failures are not. The Adaptive DSD algorithm test by n,, including the period required for ny to forward further differs from the SELF algorithms in that the number TESTED-UP, in Step 2.2. This ensures that the diagnostic of nodes in the fault set is not bounded. The remaining fault- information included in TESTED-UP, received at Step 2.2 free nodes correctly diagnose the fault states of all nodes in is accurate. Instructions 3 and 4 update local diagnostic the system. information dependent on both the fault-free test of n, and the diagnostic information received from ny. Instruction 3 A. Algorithm Specification asserts TESTED-UP,[z] = y, specifying that n, has tested A n example of the data structure required by the Adaptive ny and determined it to be fault-free. In Instruction 4, all other elements of TESTED-UP, are updated to the values of DSD algorithm is shown in Fig. 4. The array TESTED-UP, is maintained at each node n,. TESTED-UP, contains N TESTED-UP,. Thus, the diagnostic information contained in elements, indexed by node identifier, z, as TESTED-UP, [i], the TESTED-UP arrays is forwarded between nodes in the for 0 5 5 N - 1. Each element of TESTED-UP, contains a reverse direction of tests. In this example, the information node identifier. The entry TESTED-UP,[z] = j indicated that is forwarded from ny to n,. Note that Step 4.1 prevents a + + + 619 BlANCHINl AND BUSKENS: DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS THEORY I* ADAPTIVE-DSD /* The following is executed at each /* at predefined testing intervals. 1. 2. y = .r: + 1)I l l O d s: y = (y 2.2. 2.3. 3. request n TESTED-UP,[r] = y: 4. for i = 0 to S - 1 4.1.1. *I *I *I repeat { 2.1. 4.1. . 5 .r 5 Z - 1 71,~0 ,to forward TESTED-UP, } until (n.,. tests n if ( I # to n : as “fault-free”); Fig. 6. Example system and test set. .r) TESTED-UP,[i] = TESTED-UP, [I]: Fig. 5. The Adaptive DSD Algorithm, 1. node from replacing diagnostic information that it determines through normal testing procedures with information that it receives from other fault-free nodes. Since n, continues testing nodes in Step 2 until a fault-free node is found, the test set is dependent on the fault situation. The test set of an example system of eight nodes is shown in Fig. 6. In the example, 121, n4, and nj are faulty, all other nodes are fault-free. The Adaptive DSD algorithm specifies that a node sequentially tests consecutive nodes until a faultfree node is identified. For example, no tests n1, finds it to be faulty and continues testing. Subsequently, n o tests node 7 1 2 , finds it to be fault-free and stops testing. Node n2 finds 713 to be fault-free and stops testing immediately. Node 723 must test three nodes before it tests a fault-free node. The TESTED-UP2 array maintained at n2 for this example is shown in Fig. 4. Diagnosis is accomplished at any node nz by following the fault-free paths from n, to other fault-free nodes. The Diagnose algorithm to be executed by a node n, is given in Fig. 7. The algorithm uses the information stored in TESTED-UP, to diagnose the system. Its results are stored in an array, STATE,, where STATE,[i] represents the diagnosed state of node 7 1 , . For correct diagnosis, STATE, [ i ] must equal the actual fault state of node n, for all z. It is proven in Section 111-B that, after execution of Adaptive DSD, entries of TESTED-UP, corresponding to fault-free nodes are the same at each faultfree node. The Diagnose algorithm utilizes these fault-free entries of TESTED-UP, and operates as follows (refer to Fig. 7). Initially, all nodes are identified as faulty in Step 1. In Step 2, nodeqointer is set to 2, the identifier of the node executing Diagnose. Step 3 of the algorithm traverses the forward fault-free paths in the test set, labeling each of the nodes visited as fault-free. This is accomplished by setting STATE,[nodegointer] to fault-free and then setting nodegointer to TESTED-UP, [nodegointer],which identifies the next sequential fault-free node in the system. Step 3 is repeated until nodegointer is set to every fault-free node and returns to x. The correctness proof of the Adaptive DSD algorithm is given in Section 111-B. The key steps of the algorithm proof show that the tests performed by the fault-free nodes form a directed cycle among all the fault-free nodes, after the Adaptive DSD algorithm is executed at least once at every 1.1. I* DIAGNOSE I* The following is executed at each 72,. 0 5 s I* when n, desires diagnosis of the system. for = 0 to S - 1 nodegointer = s; 3. 3.1. repeat { STATE, [nodegointer] = fault-free; 3.3. *I *I */ STATE,[z] =faulty; 2. 3.2. 5 S-1 nodegointer = TESTED-UP, [nodepointer] } until (nodepointer == s); Fig. 7. The Diagnose Algorithm. node. It is then shown that the entries of the TESTED-UP array corresponding to the tests performed by fault-free nodes are consistent at all fault-free nodes after further execution of the Adaptive DSD algorithm. The Diagnose algorithm utilizes only the fault-free entries of TESTED-UP, for diagnosis and thus operates correctly at all fault-free nodes. B. Correctness Proof The correctness proof of the Adaptive DSD algorithm utilizes algorithm “testing rounds.” A testing round of Adaptive DSD is defined as the period of time such that Adaptive DSD executes at least once on every fault-free node in the system. Correctness of the Adaptive DSD algorithm is proven in three steps. In the first step, it is proven that, after a single testing round, there exists a directed path from any fault-free node to any other fault-free node in the testing graph T ( S ) .Second, it is proven that fault-free entries of the TESTED-UP arrays are the same at each fault-free node after a fixed number of testing rounds following a fault event. Finally, it is proven that the Diagnose algorithm will correctly diagnose the state of all nodes in the system. Theorem I : Given diagnosis system S = ( V ( S ) E ! (S), T ( S ) ) fault , situation F ( S ) ,and one testing round of Adaptive DSD, T ( S ) will contain a directed path from any fault-free node in V ( S )to any other fault-free node in V ( S ) . Proof: (by contradiction): Choose two fault-free nodes, nz and n,, such that there does not exist a directed path from n, to n, and ( z - x) is minimized for all such n, and n,. Identify the largest y, y < z , such that ny is fault-free and there exists a path from nz to nu.Refer to Fig. 8. By definition of Adaptive DSD, after a single testing round, nIy must have tested one fault-free node, na.Since the largest y was chosen IEEE TRANSACTIONS ON COMPUTERS, VOL. 41, NO. 5, MAY 1992 620 Fig. 8. Testing paths in T ( S ) . The execution of the Diagnose algorithm at n, requires the fault-free entries of TESTED-UP,. By Theorem 2, these entries are proven to be consistent at all fault-free nodes. Thus, the execution of the Diagnose algorithm will yield the same result when executed at any fault-free node. The Adaptive DSD algorithm, as presented, is optimal in terms of the total number of tests required, since each node is tested by at most one other node. TO reduce other algorithm resource requirements and hCreaSe implementation performance, algorithm enhancements are employed. such that y < z , then a must be greater than z . By definition of the algorithm, ny must have tested n, before testing n, and must have found it to be faulty. This results in a contradiction since n, is selected as fault-free. An interesting result of Theorem 1 is that T ( S ) contains a c. Transient Behavior directed cycle of the fault-free nodes. As shown in Section 111-B the Adaptive DSD algorithm c O r o l l a ~l": Given diagnosis system = ('(')I E ( S ) * yields provably correct diagnosis after a "convergence period" situation and One testing round Of Adaptive following T(S)), a fault event, However, correct diagnosis is not then T ( S ) contain a directed cyc1e7 consisting Of guaranteed during this period. The problem occurs when faulty every fault-free node of V ( S ) . nodes are repaired and become fault-free. The newly repaired By Theorem there a directed path between any two node requires finite time to identify a single fault-free node. nodes in T ( S ) . In addition, by 'pecificaBefore a fault-free node is identified, the other fault-free there is Only a sing1e arc from each node to nodes utilize old test result reports received from that node, One Other node in T ( S ) .A directed cyc1e is the Only resulting in incorrect diagnosis. This situation can be identified graph structure that satisfies both of these conditions. by a break in the testing cycle, and is aggravated in actual Theorem 2' Give diagnosis system = ('1, E ( S ) ' systems where newly repaired nodes require appreciable time T ( S ) ) ,fault situation F ( S ) , N testing rounds of execution to identify a fault-free node. Of Adaptive DSD, and nodes n Z and nY such that Fig. 9 illustrates a node repair sequence that exhibits inn~ tests '%. Then, for all '%,TEsTED-up~[2] = y' correct transient diagnosis. Node n3 is faulty in Fig. 9(a), proof: Choose an arbitrary node nx' See requiring n2 to test 713 and nq. Node n2 detects that n3 is Fig' By specification, TESTED-UPZ = Y after repaired in Fig. 9(b) and begins testing only n3. However, if n3 one testing rC3und. After two testing rounds, node n w will have has not yet tested nq then TESTED-UP~ is invalid. This causes tested n X and received its array' after a break in the testing cycle. Since the Diagnose algorithm testing TESTED-UPw[xI = mSTED-UPZ[z] = follows fault-free paths in the testing cycle it will determine y. This step iterates as each fault-free node in the directed an incorrect diagnosis of the fault situation. In Fig. 9(c), n3 cyc1e identified in receives TESTED-UP, [I.' determines n4 to be fault-free, restoring the testing cycle. Note that information flows backwards the cyc1e. The Henceforth, the Diagnose algorithm correctly diagnoses the longest path around the cycle contains N nodes. Thus, fault situation. after no more than N testing rounds all fault-free nodes Transient incorrect diagnosis is avoided by requiring ad%, receive TESTED-UP, ].[ and for d' ditional temporary testing in the Adaptive DSD algorithm. TESTED-UPi[z] = y. As a natural consequence of the Adaptive DSD algorithm, Thus, it is shown using Theorem 2 that the TESTED-UPz[i] node n2 will test both nodes n3 and n4 as long as n3 is entries corresponding to %7 are the Same at faulty. For proper diagnosis, n2 must continue to test nq7even every fault-free node, n,. after it has determined that n3 has become fault-free. Only Theorem 3' Given diagnosis system = ('(')lE('), once n3 has identified n4 as fault-free can n2 stop testing situation F ( S ) ,and testing rounds Of Adaptive n4. The additional temporary testing overhead incurred is the DSD' Then? the Diagnose executed at any testing required for n3 to identify n4 as fault-free and report node, will correctly determine F ( S ) . this information to n2 before n2 stops testing n 4 . If this proof: Choose a node n x to the procedure is followed, then incorrect transient diagnosis does Diagnose algorithm. Initially, the state of every node is set not occur. to faulty in Step 1. Step 2 identifies n, as the first node to be determined fault-free by setting nodepointer = x. Step 3 identifies n,,de-poznteras fault-free and updates nodeqointer D. Information Updating to another fault-free node, specifically, nodegointer = Although the Adaptive DSD algorithm is optimal in terms TESTED-UP,[nodeqointer].By Corollary 1.1, all fault-free of test count, it requires more than the minimum number of nodes are contained in a directed cycle in T ( S ) .By Theorem diagnostic messages. This is because a node will generate a 2, nodeqointer is updated, in Step 3.2, from its current value test result report for each test result it obtains. For nodes whose to the next consecutive fault-free node in the directed cycle of state remains the same over several tests, duplicate reports are T ( S ) .Since each fault-free node in a directed cycle uniquely generated and forwarded. This is wasteful and unnecessary. A identifies one other fault-free node, each fault-free node is time stamping scheme like that presented in [l] is employed identified by nodeqointer exactly once in Step 3. to permit nodes to transfer new diagnosis information only F ( S ) 7 DSDy '' '" T(S))7 !I 621 BlANCHlNl AND BUSKENS: DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS THEORY (a) (b) (c) Fig. 9. Possible event sequence for repaired .V3 (a) (b) (c) Fig. 10. Different report forwarding schemes. during Step 2.2 of Adaptive DSD. The total message count is minimized using this scheme since each node receives a single message for every change in TESTED-UP. E. Event Driven Information This enhancement addresses the diagnosis latency of the algorithm and assumes the information updating enhancement. When a new diagnostic message arrives at nz, n, stores the message in TESTED-UP,. At this time, nz can determine correct diagnosis. The new information is not forwarded until a reauest for the information arrives from another node. However, if nz can identify the node that the message will be forwarded to, it can forward the message when it arrives. This scheme is termed Event Driven since information is forwarded when the event occurs rather than by explicit requests. The number of test result reports remains the same as the information updating scheme, but the diagnosis latency is reduced. Fig. 11. Asymmetric information forwarding of faulted NI. Fig, 12. Example multicast operation. requires each node to forward only one additional report, yet reduces the path length a report must travel from N to log, N . F. Asymmetric Report Forwarding Asymmetric report fowarding further reduces diagnosis latency by forwarding diagnosis information along redundant communication paths, different from those utilized for testing. Three different information forwarding schemes are illustrated in Fig. 10 for the event of n o detecting n1 as faulty. Tests are identified by shaded arcs and tests result reports are forwarded along solid arcs. Fig. 10(a) illustrates symmetric forwarding, where test result reports are forwarded only in the reverse direction of tests. This scheme requires the lowest number of test result reports forwarded from each node and has the highest diagnosis latency. The forwarding scheme utilized by the SELF algorithms is illustrated in Fig. 10(b). Each node forwards test result reports to t other nodes. The scheme illustrated in Fig. 1O(c) requires high message count at no but has the minimum diagnosis latency of one message delay. The asymmetric report forwarding scheme utilized in the final implementation of Adaptive DSD is illustrated in Fig. 11. Using this scheme, no forwards the test result report to n4 and 727. Nodes n.4 and n7 each forward the report to two additional nodes. In this implementation, the reports forwarded along the solid arcs require only two arcs to reach n2 versus six arcs for symmetric forwarding. The structure represented by the solid arcs is a balanced binary tree, with longest path log, N . A binary tree is chosen as the forwarding structure since it G. Multicast Information Forwarding Fig, 1O(c) represents the ideal forwarding scheme for minimum diagnosis latency. This forwarding scheme is not practical in real systems since it requires no to forward a test result report to all other fault-free nodes. For common interconnection buses, such as Ethernet, multicasting can be used to reduce the overhead associated with forwarding a message to all nodes. Multicasting allows one source to broadcast a single message, on a shared communication bus, to many destinations. Fig. 12 illustrates a multicast operation. Multicasting can be accomplished on an Ethernet using IP level [ 111 protocols. By using a multicast approach, each node still receives a single message for every fault event, but the number of messages placed on the communication bus is substantially reduced. In Ethernet networks, the presence of routers must be considered. A router permits the interconnection of multiple Ethernet networks, or subnets. Multicast messages on an Ethernet are not forwarded through routers. Hence, a special procedure is required to ensure that multicasted test result reports are forwarded to all nodes in the network, regardless of the presence of routers and that the reports are properly verified. In the first step of the procedure, the original report is multicasted by the originator of the report. In the second step, each node that received the multicasted report forwards the IEEE TRANSACTIONS ON COMPUTERS, VOL. 41, NO. 5, MAY 1992 622 Fig. 13. Example multicast operation with subnets. report, symmetrically to the next subsequent node. Nodes that receive the symmetrically forwarded report without a multicast message must be located on a different subnet. These nodes re-execute the given multicast procedure to distribute the new message on their subnet with the minimum possible diagnosis latency. The presented multicast procedure is illustrated in Fig. 13. Using this procedure, a total of N S test result reports are forwarded, where N is the total number of nodes in the system and S is the number of subnets. This is shown, since each node forwards one message symmetrically, plus one node in each subnet multicasts a message. Since two message delays are required to distribute the diagnostic message per subnet, the total diagnosis latency is a function of 2s message delays. + TABLE 1 ALGORITHM DIAGNOSABILITY AND TEST COUNT I I I Aleorithm Diaenosabilitv SELF Algorithms All Information Forwarding Schemes I Adaptive DSD s-1 t (a) Testing Count Per Testing Round SELF Algorithms All Information Forwarding Schemes .V(t + 1) I Adaptive DSD N TABLE 11 ALGORITHM MESSAGE COUNT Message Count Per Testing Round IV. DISTRIBUTED DIAGNOSIS ALGORITHM COMPARISON I I SELF Algorithms I AdaDtive DSD I All Information O(SV) .v, Table I(a) shows the diagnosability of the algorithms disLvA f Information Updating cussed. Algorithm diagnosability is the maximum number of Event Driven L\-Af faulty nodes that are permitted for the algorithm to guarantee 0( S A f P ) correct diagnosis. The SELF algorithms are t-diagnosable, a 1.5SAf Asymmetric Forwarding function of the predefined fixed testing topology required. The Multicast Forwarding (IV S ) A f testing topology of the Adaptive DSD algorithms varies based on the fault situation; algorithm diagnosability is N - 1 for forwards diagnosis information along redundant paths. The these algorithms. Table I(b) shows the number of tests required by each multicast algorithm requires ( N S)Af messages since each algorithm. The SELF algorithms require N ( t 1) tests since fault event is first multicasted to all nodes and then verified each node must be tested by t 1 other nodes. The Adaptive between subsequent node pairs. Table 111 identifies the diagnosis latency of each algorithm. algorithms require N tests. Since every node of any distributed diagnosis system must be tested by one of the fault-free nodes, The diagnosis latency is the time required for all fault-free N is the minimum number of tests possible. Thus, Adaptive nodes in the diagnosis system to reach a correct diagnosis after a fault event. As proven in Section 111-B,Adaptive DSD DSD is optimal in terms of the number of tests required. Table I1 identifies the number of messages that contain test requires N report forwarding delays to distribute new test result reports. In the SELF algorithms, each message contains result reports to every node. Thus, the diagnosis latency is the triple [ A , B , C ]where , A, B , and C are node identifiers. N(T,), where T, represents the time of a testing interval. The Self algorithms require N / ( t 1) testing intervals since there The Adaptive DSD algorithm requires that each TESTED-UP array gets forwarded in a testing round. Thus, N messages are multiple paths between nodes in the test set, including of size N are required and recorded as N 2 in Table 11. The paths of length N / ( t 1).The test result reports require less message counts of the event driven and information updating time to be forwarded to all nodes in the system. schemes are functions of the number of fault events. Adaptive The event driven algorithms have significantly reduced DSD with information updating forwards each fault event, diagnosis latency. In the nonevent driven algorithms, test result A f , to each node, thus the total message count is N A f . The reports arrive at a node and are not forwarded until the reports message count is optimal since each node must receive at are requested during the next testing interval. In the event least one message for every fault event. This message count is driven schemes, the node receiving the report immediately the same for Event Driven Adaptive DSD. The asymmetric validates the message, then forwards it to subsequent nodes. forwarding algorithm requires 1.5NAf messages since it Thus, the report is forwarded after the time required for a + + + + + + BIANCHINI AND BUSKENS: DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS THEORY TABLE 111 ALGORITHM DIAGNOSIS LATENCY 623 B. Experimentation Experimentation of the Adaptive DSD algorithm on the CMU ECE network focused on algorithm communication overhead, in terms of average packet count, and diagnosis All Information latency, measured in seconds. The following figures graph the Information Updating communication overhead as a function of experiment elapsed Event Driven time. In addition, important events are marked, including Asymmetric Forwarding fault occurrence and diagnosis. The first figure illustrates the Multicast Forwarding execution of the Adaptive DSD algorithm with symmetric information forwarding. The subsequent two figures illustrate fault-free test, Ttest,which is significantly less than a testing the performance of the Adaptive DSD algorithm with asymcycle in our implementation. The asymmetric adaptive algo- metric forwarding. In each experiment, the diagnosis system rithm further reduces diagnosis latency by utilizing redundant consists of 60 nodes and the algorithm executes with a 30 s shorter paths, the longest of which contains log, N nodes. The test interval. Every node performs its own data collection for multicast algorithm requires 2STteStlatency since the original the packet counts shown in the figures, which are collected at fault event message is multicasted to all nodes on each subnet 10 s intervals throughout each experiment. Experiments 1 and 2 demonstrate the difference between simultaneously. symmetric and asymmetric forwarding. See Figs. 14 and 15. Both experiments involve the failure and subsequent v. IMPLEMENTATION AND EXPERIMENTATION recovery of a single node. Symmetric forwarding is utilized in Experiment 1 and asymmetric forwarding is utilized in A. Implementation Experiment 2. At 60 s during Experiment 1, a single node Adaptive DSD has been running in the CMU ECE departin the network fails. The faulted node is detected at 110 s, ment since November 1990 on various workstations using the after it is tested and a test timeout period occurs. After 110 s Ultrix operating system, including VAX and DEC 3100 RISC the fault information is forwarded to the remaining fault-free workstations. The algorithm consists of approximately 3000 nodes. Since diagnosis information is validated by testing, the line of C code, written in modular format to make it easily fault information will reach the farthest node from the failure portable. The network interface for this implementation uses the Berkeley socket interface [13] and presently supports Eth- only after all nodes between it and the fault are tested and ernet IPKJDP protocols [3], [ 111. Appropriate modifications found to be fault-free. Thus, at time 510, the node farthest to the network module will allow the program to run on any from the fault receives the information indicating the node failure. This results in an overall diagnosis latency of 450 s. system that has a C compiler. At 960 s the faulty node is repaired. The newly recovered Adaptive DSD is implemented as a modular, event-driven node immediately performs forward tests up to the limit of program. A configuration file is read by each workstation at startup that identifies the complete list of workstations five, as specified in the configuration file. This causes the participating in system diagnosis, as well as specifying a sharp increase in packet count at time 960. At time 970, the number of tuning parameters. Algorithm tuning parameters recovered node is detected. This information is propagated include the maximum number of forward tests in a single backward through the path of fault-free nodes until it reaches test interval, various timeout values, and flags that enable the fault-free node farthest from the recovered node, at 1430 s. or disable algorithm options. An activity scheduler plays a Correct diagnosis is achieved within 460 s. After 1430 s the significant role in the implementation by permitting events packet counts return to nominal levels. As shown in Experiment 1, the diagnosis latency of Adapsuch as workstation tests, packet retransmissions, and other tive DSD with symmetric forwarding is a linear function of timeouts to be scheduled for execution at a specified time. As with EVENT-SELF, the workstation test is implemented as the number of system nodes and can be significant for large a separate program that is spawned as a subprocess to test systems. Experiment 2, shown in Fig. 15, illustrates the same experiment with asymmetric forwarding. The diagnosis latency several of the hardware facilities of the workstation. Workstations participating in system diagnosis are initially is significantly reduced. The diagnosis latency for the failure sorted by Internet host address. Since this number is unique is 60 s for asymmetric forwarding versus 400 s for symmetric to each workstation, all workstations generate identical sorted forwarding. The same diagnostic information is forwarded, lists. Testing occurs in the forward direction of the sorted list; except that it is forwarded closer in time to the fault event. This i.e., each workstation tests those workstations that follow it in results in a higher peak message count with shorter duration. the sorted list, modulo the number of workstations. Informa- The remaining experiments utilize asymmetric forwarding to tion forwarding occurs in the reverse direction, or backwards provide reduced diagnosis latencies. Fig. 16 illustrates one advantage of Adaptive DSD over in the sorted list. Due to Internet standard subnet routing [lo], workstations with numerically similar host addresses both of the SELF algorithms: the ability to correctly diagnose are located on a single subnet. The sorted arrangement of the state of a network under the presence of many faults. In workstations tends to minimize the load on routers and bridges Experiment 3, 50 of the 60 nodes experience simultaneous as a result of inter-subnet communication. failures at 60 s. The average packet count initially reduces Wors IEEE TRANSACTIONS ON COMPUTERS, VOL. 41, NO. 5, MAY 1992 624 1.2 1 0.8 0.6 I I I 1 1 1 I I1 I I I I1 I 1 1 I II I1 I I I I 1 I I1 I1 I I I I I 1 1 1 1 1 I I I1 11 I I I I1 I1 II I I 1 I1 I I I I I 1 1 1 1 I I I 11 I1 I I I1 I1 I 060 I 0.4 0.2 0 1 I I I 110 510 I I I I I , I I I I I 1430 II 960970 timc(smn&) Fig. 14. Experiment 1 on a 60 node testing network 3 2.5 1 0.5 0 b $0 9b 140 1 I 1 1 I 1 I 1 I 1 1 1 I I I I I I 54d \ 6 0 6 ! 0 time(sccondr) Fig. 15. Experiment 2 on a 60 node testing network. significantly since 50 nodes cease transmitting messages. The first faulty node is detected at 90 s, and the remaining fault-free nodes re-establish a cycle in the test set. At this time, complete diagnostic information is forwarded among these nodes. After the 360 s diagnosis latency, the packet counts reduce to their new nominal values. At time 960, one of the 50 failed nodes returns to the network. The usual recovery detection occurs, and diagnostic information is exchanged. After 90 s, complete diagnosis among the fault-free nodes is established. Fig. 17 compares message counts for Adaptive DSD to those for the SELF algorithms for a single failure and subsequent recovery. Due to the high number of diagnostic messages generated by the NEW-SELF algorithm and the available network bandwidth, the diagnosis system is limited to twenty nodes. The algorithms executed in Experiment 4 shown in Fig. 17 use the same configuration parameters as the first three experiments: 30 s test interval, packet bundling, asymmetric forwarding, and a maximum of t = 5 forward tests per test interval. Adaptive DSD has lower communication overhead and reduced diagnosis latency. This is verified in Table 111. Observed message counts reflect those calculated in Table 11. VI. CONCLUSION The Adaptive DSD algorithm has been specified and implemented. Unlike previous distributed system-level diagnosis algorithms, the testing assignment is adaptive and varies during algorithm execution. The testing assignment adapts locally at each node, yet the algorithm is provably globally correct. Diagnosability of the Adaptive DSD algorithm is optimal since correct diagnosis is guaranteed for any set of node BIANCHINI AND BUSKENS: DISTRIBUTED SYSTEM-LEVEL DIAGNOSIS THEORY Fig. 16. Experiment 3 on a 60 node testing network. 401 4 .. .. .. .. ... .. ... ... ... . . . . . . (....' . . . . .. ... . . . . . ... .. . ... . .. ... . . . . . . . . . . . .. ... .. . . . . . . . . . . . . . . . . . . New Sdf .... . . . . . . . . . . . . . . . . . .... . . . . . . . . . . . .. . .. .. .. . . . .. . . . . . . _ ' ' Fig. 17. Experiment 4 on a 20 node testing network. failures. In addition, the number of tests performed is optimal, since each node is tested by a single fault-free node. Using symmetric forwarding, each node receives a single test result report per fault event, hence the number of test result reports is optimal. Symmetric forwarding suffers from the longest possible diagnosis latency, however, since a test result report is forwarded between every fault-free node before it reaches the last node. A direct tradeoff is presented that permits improvements in diagnosis latency while requiring additional reports to be forwarded. For special interconnection networks, multicasting can be used to reduce both the diagnosis latency and the number of test result reports forwarded. Adaptive DSD has been running on various workstations of the CMU ECE department since November 1990. Previous nonadaptive versions have been running at Carnegie Mellon since April, 1989. Since its inception at Carnegie Mellon, greater reliance has been placed on the DSD system by system administrators. The current system is used to diagnose faulty workstations within seconds of failure. In addition the system is used to determine the cause of failures during the presence of increased fault activity. Current research focuses on methods of distributed systemlevel diagnosis for arbitrary network interconnection topologies, and investigating other features such as handling link failures, and dynamic entry and exit of nodes into and out of the diagnosis system. REFERENCES [ l ] R. P. Bianchini, Jr., K. Goodwin, and D. S. Nydick, "Practical application and implementation of distributed system-level diagnosis theory," in Proc. Twentieth Int. Symp. Fault-Tolerant Comput., IEEE, June 1990, pp. 332-339, 626 [2] R.P. Bianchini, Jr. and R. Buskens, “ A n adaptive distributed systemlevel diagnosis algorithm and its implementation,” in Proc. Twenty-First In?. Symp. Fault-Tolerant Comput., IEEE, June 1991. [3] The Ethernet: A Local Area Network. 2.0 edition, Digital Equipment Corp., Intel Corp., Xerox Corp., 1982. Data Link Layer and Physical Layer Specification. [4] S. L. Hakimi and A. T. Amin, “Characterization of connection assignment of diagnosable systems,” IEEE Trans. Comput., vol. C-23, Jan. 1974. [5] S. L. Hakimi and E. F. Schmeichel, “An adaptive algorithm for system level diagnosis,” J. Algorithms, vol. 5, June 1984. [6] S. H. Hosseini, J. G. Kuhl, and S. M. Reddy, “A diagnosis algorithm for distributed computing systems with dynamic failure and repair,” IEEE Trans. Comput., vol. C-33, pp. 223-233, Mar. 1984. [7] E. Kreutzer and S. L. Hakimi, “System-level fault diagnosis: A survey,” Euromicro J., vol. 20 no. 4 3 , pp. 323-330, May 1987. [8] J. G. Kuhl and S. M. Reddy, “Distributed fault-tolerance for large multiprocessor systems,’’ in Proc. 7thAnnu. Symp. Comput.Architecture, IEEE, May 1980, pp. 23-30. “Fault-diagnosis in fully distributed systems,” in Proc. I l t h Int. [9] -, Con& Fault-Tolerant Comput., IEEE, June 1981, pp. 100- 105. [lo] J. C. Mogul and J. B. Postel, “Internet standard subnetting procedure,” Tech. Rep., NSF-NetRFC 950, Aug. 1985. [ l l ] J. B. Postel, “Internet protocol,” Tech. Rep., NSF-NetRFC 791, Sept. 1981. [12] F. P. Preparata, G. Metze, and R. T. Chien, “On the connection assignment problem of diagnosable systems,” IEEE Trans. Electron. Comput., vol. EC-16, pp. 848-854, Dec. 1967. [13] UNIX Programmer’s Manual: Socket, The University of California at Berkeley, 1986. [14] C.-L. Yang and G.M. Masson, “Hybrid fault diagnosability with unreliable communication links,” in Proc. Fault-Tolerant Comput. Syst., IEEE, July 1986, pp. 226-231. [15] S. L. Hakimi and K. Nakajima, “On adaptive system diagnosis,” IEEE Trans. Comput., vol. C-33, pp. 234-240, Mar. 1984. [16] F. J. Meyer and D. K. Pradhan, “Dynamic testing strategy for distributed systems,” IEEE Trans. Comput., vol. (2-38, pp. 356-365, Mar. 1989. IEEE TRANSACTIONS ON COMPUTERS, VOL. 41, NO. 5, MAY 1992 Ronald P. Bianchini, Jr. (S’SO-M’SS) was born in Brooklyn, NY, on April 29, 1962. He received the B.S. degree in electrical engineering from the Massachusetts Institute of Technology in 1983 and the M.S. and Ph.D. degrees in electrical and computer engineering from Carnegie Mellon University in 1985 and 1989, respectively. He aided the New York University Ultracomputer project during the summer of 1983 in the area of wireability. He consulted for AT&T Bell Laboratories during the summers of 1990 and 1991. in the application of a fault diagnosis system to AT&T research networks. Currently, he is an Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA. He directs research groups in the study of fault diagnosis in distributed systems and the design of telecommunication switching architectures. His research interests include system-level diagnosis, distributed computer systems, telecommunication switching, and computer architecture. Dr. Bianchini is a member of the IEEE Computer Society, the Association for Computing Machinery, and was nominated to the Eta Kappa Nu Honor Society in 1983. I 1. Richard W. Buskens (S’85) received the B.S. degree in computer engineering and the M.S. degree in computer science from the University of Manitoba, Manitoba, Canada. He is currently a Ph.D. student in the Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA. His research interests include applied graph theory, computer networks, and parallel and distributed algorithms.