Reliability Evaluation Techniques Reliability refers to the degree of tolerance against errors and component failures in a system. A reliable system prevents loss of information even in the event of component failures. The multiplicity of storage devices and processors in a distributed system allows the maintenance of multiple copies of critical information within the system and the execution of important computations redundantly to protect them against catastrophic failures. E.g. if one of the processors fails, the computation can be successfully completed at the other processor, or if one of the storage devices fails, the information can still be retrieved from the other storage device. Distributed computing systems have a potential for higher reliability since a few parts of the system can be down without interrupting the jobs of the users who are using the other parts of the system. This is in contrast to the centralized system, where a failure of the central processor results in the entire system shutting down. E.g. if a workstation of a distributed computing system that is based on the workstation-server model fails, only the user of that workstation is affected. The advantage of higher reliability is an important reason for use of distributed computing systems for mission-critical applications whose failure may be disastrous. Ways to represent reliability There are two ways to represent reliability: 1. Network point of view 2. Application point of view Network Point of View Given the link and node reliability for a specific network topology, determine the probability of the network becoming partitioned. There are two popular parameters used to evaluate network reliability: 1. Source-terminal reliability (STR) This is the probability that a path exists between a given source node, s, and a given destination node, d. 2. Computer Network Reliability (CNR) This is the probability that a path exists between any source node and any destination node (i.e. the probability that any node in the network can communicate with any other node in the network). Application Point of View Given the link and processing node reliabilities, and the distribution of programs as well as data files, determine the probability of successfully completing a specific application. A popular parameter used to evaluate reliability from the application execution standpoint is: Distributed Program Reliability (DPR) This is the probability that a program can successfully execute in a distributed computing system (i.e. the program can always access the data files it needs to successfully execute). Computing the STR Consider the network topology given below: 2 X1 1 X 2 X2 0.85 0.8 X 2 0.8 X3 0.9 X4 3 0.9 X5 4 X 2 Assumptions: X 1. Each link is full-duplex 2 when working. 2. Each link can is either in a working state or in a failed state. 3. The reliability of every possible link is known. 4. The nodes are perfectly reliable (this assumption is for mathematical convenience only. The failure of a node can be easily modeled by assuming the failure of all links that connect to the node that has failed). Step 1: Determine all possible paths from source node 1 to destination node 3. The possible paths (in increasing order of cardinality) are: X1 X4, X2 X5, X1 X3 X5, and X2 X3 X4. These are depicted below in order: X1 X4 X2 X5 X1 X3 X5 X2 X3 X4 Step 2: Reliability Expression Generation by making the terms disjoint: Term 1: X1 X4 P1 P4 Term 2: XI X4 (these are both not present in the second path X2 X5) Z(X1 X4) = X1’ + X1 X4’ X2 X5 (X1’ + X1 X4’) P2 P5 (Q1 + P1 Q4) Term 3: X4 + X2 Z(X4 + X2) = X4’ X2’ X1 X3 X5 X4’ X2’ P1 P3 P5 Q4 Q2 Term 4: X1 + X5 + X1 X5 X1 (1 + X5) + X5 Z(X1 + X5) = X1’ X5’ X2 X3 X4 X1’ X5’ P2 P3 P4 Q1 Q5 X1 + X5 Hence, STR = P1 P4 + P2 P5 (Q1 + P1 Q4) + P1 P3 P5 Q4 Q2 + P2 P3 P4 Q1 Q5 STR = 0.85 * 0.8 + 0.9 * 0.9 (0.15 + 0.85 * 0.2) + 0.85 * 0.8 * 0.9 * 0.2 * 0.1 + 0.9 * 0.8 * 0.8 * 0.15 * 0.1 = 0.96008 The Zeta Operator X1, X2, X3 are logical variables. Z(X1) = X1’ Z(X1 + X2) = Z(X1) Z(X2) = X1’ X2’ Z(X1 X2) = Z(X1) + X1 Z(X2) = X1’ + X1 X2’ Z(X1 + X2 + X3) = Z(X1) Z(X2) Z(X3) = X1’ X2’ X3’ Z(X1 X2 X3) = Z(X1) + X1 Z(X2) + X1 X2 Z(X3) = X1’ + X1 X2’ + X1 X2 X3’ Computing the CNR Consider the network topology given below: 2 X1 1 X 2 X2 0.85 0.8 X 2 0.8 X3 0.9 X4 3 0.9 X5 4 X 2 X Step 1: 2 Determine all minimum spanning trees (MST) for the topology given. A MST is defined as the tree that connects every node in the network with no cycle in them. For e.g. X1 X3 X4 is a MST but X1 X2 X3 X5 is not. The MSTs (in increasing order of cardinality) are: MST1 = X1 X3 X4 MST2 = X1 X3 X5 MST3 = X2 X3 X4 MST4 = X2 X3 X5 MST5 = X1 X2 X5 MST6 = X1 X2 X4 MST7 = X2 X4 X5 MST8 = X1 X4 X5 Step 2: Make the MSTs disjoint. Term1 Pr[MST1] = P1 P3 P4 Term2 X4 Z(X4) = X4’ X1 X3 X5 X4’ P1 P3 P5 Q4 Pr[MST2] = P1 P3 P5 Q4 Term3 X1 + X1 X5 X1 (1 + X5) X1 Z(X1) = X1’ X2 X3 X4 X1’ P2 P3 P4 Q1 Pr[MST3] = P2 P3 P4 Q1 Term4 X1 X4 + X1 + X4 X1(X4 + 1) + X4 X1 + X4 Z(X1 + X4) = X1’ X4’ X2 X3 X5 X1’ X4’ P2 P3 P5 Q1 Q4 Pr[MST4] = P2 P3 P5 Q1 Q4 Term5 X3 X4 + X3 + X3 X4 + X3 X3 (X4 + 1 + X4 + 1) X3 Z(X3) = X3’ X1 X2 X5 X3’ P1 P2 P5 Q3 Pr[MST5] = P1 P2 P5 Q3 Term6 X3 + X3 X5 + X3 + X3 X5 + X5 X3 (1 + X5 + 1 + X5) + X5 X3 + X5 Z(X3 + X5) X3’ X5’ X1 X2 X4 X3’ X5’ P1 P2 P4 Q3 Q5 Pr[MST6] = P1 P2 P4 Q3 Q5 Term7 X1 X3 + X1 X3 + X3 + X3 + X1 + X1 X1 (X3 + X3 + 1 + 1) + X3 + X3 X1 + X3 Z(X1 + X3) X1’ X3’ X1 X4 X5 X1’ X3’ P2 P4 P5 Q1 Q3 Pr[MST7] = P2 P4 P5 Q1 Q3 Term8 X3 + X3 + X2 X3 + X2 X3 + X2 + X2 + X2 X3 (1 + 1 + X2 + X2) + X2 + X2 + X2 X3 + X2 Z(X3 + X2) X3’ X2’ X1 X4 X5 X3’ X2’ P1 P4 P5 Q2 Q3 Pr[MST8] = P1 P4 P5 Q2 Q3 Hence, 8 CNR = Pr [MSTj] j=1 Computing the DPR Consider the topology of the DCS. The distribution of files is shown in curly braces. Program P is executed on node 1 and the files required for its execution are f1, f2, f3. The link reliabilities are as shown. {f1, f2} X2 2 X1 {f1, f5} 4 0.85 0.9 X7 0.95 0.8 0.8 1 X X3 2 {f2, f5} X4 X6 0.9 0.9 3 {f3, f4} 0.9 X5 6 {f1, f4} X8 5 {f3, f6} Step 1: Determine all minimum file spanning trees (MFST) for the topology given. The MFSTs (in increasing order of cardinality) are: MFST1 = X1 X3 MFST2 = X1 X4 MFST3 = X3 X4 MFST4 = X3 X5 X6 MFST5 = X1 X2 X6 MFST6 = X3 X5 X8 X7 MFST7 = X1 X2 X7 X8 Term 1 Pr[MFST1] = P1 P3 Term 2 Z(X3) X3’ Pr[MFST2] = P1 P4 Q3 Term 3 X1 + X1 X1 Z(X1) X1’ Pr[MFST3] = P3 P4 Q1 Term 4 X1 + X1 X4 + X4 X1 (1 + X4) + X4 Z(X1 + X4) X1’ X4’ Pr[MFST4] = P3 P5 P6 Q1 Q4 X1 + X4 Term 5 X3 + X4 + X3 X4 + X3 X5 X3(1 + X4 + X5) + X4 Z(X3 + X4) X3’ X4’ Pr[MFST5] = P1 P2 P6 Q3 Q4 Term 6 X1 + X1 X4 + X4 + X6 + X1 X2 X6 X1 (1 + X4 + X2 X6) + X4 + X6 X1 + X4 + X6 Z(X1 + X4 + X6) X1’ X4’ X6’ Pr[MFST6] = P3 P5 P7 P8 Q1 Q4 Q6 Term 7 X3 + X4 + X3 X5 X6 + X6 + X3 X5 X3 (1 + X5 X6 + X5) + X4 + X6 X3 + X4 + X6 Z(X3 + X4 + X6) X3’ X4’ X6’ Pr[MFST7] = P1 P2 P7 P8 Q3 Q4 Q6 Hence, 7 DPR = Pr [MFSTj] j=1 X3 + X4 Applicability of CNR to network design One of the considerations in the design of computer networks is the reliability of between any pair of nodes and the maximum permissible cost. Given the location of various nodes (routers) of the network, the maximum permissible cost of the installing the links, and the possible position of the links, an algorithm by K. K. Aggarwal is used to determine an optimal network topology, which maximizes the CNR. Network Design Algorithm is as follow: Steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Find all the MSTs. Generate a cost matrix, Sc, with its rows corresponding to each spanning tree. For each row, if branch i exists in the MST then initialize it with its cost, else zero. Generate a reliability matrix, Sr, with its rows corresponding to each spanning tree. For each row if branch i exists in the MST then initialize it with its reliability, else one. Determine the cost of each MST from matrix Sc and generate a new matrix A. Determine the reliability of each MST from matrix Sr and generate a new matrix X. Determine the ratio of MST reliability and MST cost and generate a new matrix D, i.e., D = X / A. Choose the MST that has the highest ratio and satisfies the cost. Remove the links that are in the MST of step 7. For each remaining branch compute the increment in the reliability if that branch is added. Find the ratio of incremental reliability, R, with the cost of the link (i.e. D = R / cost of link). Choose the one with the highest ratio (i.e. highest value of D) and satisfy the remaining cost. Remove this link from consideration and repeat steps 9 and 10 until all links are exhausted or the cost is exceeded. Example of the network design problem Link cost Link reliability a 2 0.9 b 5 0.7 c 3 0.8 d 6 0.6 e 4 0.9 f 3 0.8 g 4 0.7 h 3 0.8 Maximum permissible cost = 21 units The nodes to be connected are as follows. The algorithm will determine which links are to be maintained. d 2 4 a g 1 c 6 f Xb 2 h 3 5 e Step 1: All the possible MSTs are abdeh, bcdeh, acdeh, bdefh, adefh, abefh, bdegh, adegh, abdgh, abdeg, abegh, bcefg, bcdfh, cdfha, acefh, bcegh, bcdeg, bcdgh, acegh, acdeg, acdgh, bdefg, adefg, abefg, abdfg, bcefg, abcfg, acefg, acdfg. Steps 2 and 4 a b c d e f g h 2 0 2 0 2 2 2 0 2 2 2 2 0 0 2 2 0 Sc = 0 0 2 2 2 0 2 2 2 0 0 2 2 5 5 0 5 0 5 5 5 0 5 5 5 5 5 0 0 5 5 5 0 0 0 5 0 5 5 5 5 0 0 0 3 3 0 0 0 0 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 0 0 0 0 3 3 3 3 6 6 6 6 6 6 0 6 6 6 6 0 0 6 6 0 0 6 6 0 6 6 6 6 0 6 0 6 0 6 4 4 4 4 4 0 4 4 4 0 4 4 4 0 0 4 4 4 0 4 4 0 4 4 4 0 4 0 4 0 0 0 0 3 3 3 3 0 0 0 0 0 3 3 3 3 0 0 0 0 0 0 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 4 4 4 4 4 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 0 3 3 3 3 3 3 0 3 3 0 3 0 0 0 0 0 0 0 0 A= 20 21 18 21 18 19 17 22 19 20 21 18 18 20 17 15 19 22 21 16 19 18 22 19 18 20 19 21 16 18 Steps 3 and 5 a 0.9 1 0.9 1 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 1 1 0.9 0.9 1 Sr = 1 1 0.9 0.9 0.9 1 0.9 0.9 0.9 1 1 0.9 0.9 b c d e f g h 0.7 0.7 1 0.7 1 0.9 0.7 0.7 1 0.7 0.7 0.7 0.7 0.7 0.7 1 0.7 0.7 0.7 1 1 1 0.7 1 0.7 0.7 0.7 0.7 1 1 1 0.8 0.8 1 1 1 1 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 1 0.6 0.6 0.6 0.6 1 1 0.6 0.6 1 1 0.6 0.6 1 0.6 0.6 0.6 0.6 1 0.6 1 0.6 1 0.6 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 0.9 1 0.9 0.9 0.9 1 1 0.9 0.9 0.9 1 0.9 0.9 1 0.9 0.9 0.9 1 0.9 1 0.9 1 1 1 1 0.8 0.8 0.8 0.8 1 1 1 1 1 0.8 0.8 0.8 0.8 1 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1 1 1 1 1 1 1 0.7 0.7 0.7 0.7 0.7 1 1 1 1 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1 0.8 0.8 0.8 0.8 0.8 0.8 1 0.8 0.8 1 0.8 1 1 1 1 1 1 1 1 X= 0.27216 0.24192 0.31104 0.24192 0.31104 0.24192 0.36288 0.21168 0.27216 0.21168 0.23814 0.31752 0.32256 0.21506 0.27648 0.41472 0.28224 0.21168 0.18816 0.36288 0.27216 0.24192 0.21168 0.27216 0.31752 0.21168 0.28224 0.18816 0.36288 0.24192 Step 6 D = X/A = 0.01361 0.01152 0.01728 0.01152 0.01728 0.01273 0.02134 0.00962 0.01432 0.01058 0.01134 0.01764 0.01792 0.01075 0.01626 0.02765 0.01485 0.00962 0.00896 0.02268 0.01432 0.01344 0.00962 0.01432 0.01764 0.01058 0.01485 0.00896 0.02268 0.01344 max Step 7 Max[D] = 0.02765 for i = 16 A[16] = 15 (cost) Thus branches acefh are connected as shown. Figure 1 2 4 a 1 c 6 f X 2 h 3 5 e Step 8 The branches to be considered (in increasing order of cost) are g, b, d. We have a cost of 21-15 = 6 units to work with. Step 9 Insert the branch g in the topology shown in figure 1. The augmented network will have spanning trees acefh, acefg, acehg. 2 4 a g 1 c 6 f X 2 h 3 5 e The overall reliability of this network, R = Pa Pc Pe Pf Ph + Pa Pc Pe Pf Pg Qh + Pa Pc Pe Ph Pg Qf R = 0.55987 Therefore, R(1) = 0.55987 – 0.41472 = 0.14515 C(1) = 4 (cost increase, since cost of link g is 4) D(1) = R(1) / C(1) = 0.03628 Next, insert branch b in Figure 1. The augmented network is shown below. It has the spanning trees acefh, abefh, bcefh. 2 4 a 1 c 6 f Xb 2 h 3 5 e The overall reliability, R = Pa Pc Pe Pf Ph + Pa Pb Pe Pf Ph Qc + Pb Pc Pe Pf Ph Qa R = 0.51555 Therefore, R(2) = 0.51555 – 0.41472 = 0.10483 C(2) = 5 (cost increase, since cost of link b is 5) D(2) = R(2) / C(2) = 0.02096 Next, insert branch d in Figure 1. The augmented network is shown below. It has the spanning trees acefh, adefh, acdfh, acdeh. d 2 4 a 1 c 6 f X 2 h 3 5 e The overall reliability, R = Pa Pc Pe Pf Ph + Pa Pd Pe Pf Ph Qc + Pa Pc Pd Pf Ph Qe + Pa Pc Pd Pe Ph Qf R = 0.56678 Therefore, R(3) = 0.56678 – 0.41472 = 0.15206 C(3) = 6 (cost increase, since cost of link d is 6) D(3) = R(3) / C(3) = 0.02534 Since D(1) is the greatest we add the branch g permanently to Figure 1. The total cost used up is now 15 + 4 = 19. We are left with resources of 21-19 = 2 to work with. This is insufficient to add either links b or d. Hence we stop. If we had the resources we would go back to step 8. The final topology is: 2 4 a g 1 c 6 f X 2 h 3 5 e Applicability of DPR to the file allocation problem The reliability of executing various applications in a DCS depends on the topology of the DCS, the program locations, and the allocation of files on the DCS. If the topology is fixed, then the overall reliability of the DCS depends mainly on the file allocation on the DCS. The file allocation problem is formulated in terms of cost factors and constraints, the objective being to allocate files such that the cost factor(s) are minimized/maximized and all the constraints are satisfied. The DPR is used to allocate files on a DCS such that the DPR is maximized. The problem is formulated as: Given: DCS topology, link reliabilities, program location, files needed for program execution. Objective function: Maximize DPR Constraints: NN xij = NFi ; for each file Fi ; 1 <= i <= FN j=1 where, NN = total number of nodes on DCS FN = total number of files on the DCS NFi = total number of copies of file Fi allowed on the DCS = 1; if file Fi is allocated to node Nj xij = = 0; otherwise If there are n processing nodes and m data files in the DCS, then the total number of possible assignments is nm . Thus the optimal allocation of files on the processing nodes is a problem of exponential complexity. Hence, various heuristic algorithms such as the Genetic Algorithm are used to solve this problem.