Loss of Data Probability in Case of Multiple Failures We assume we have N RAMCloud masters. Each RAMCloud master partitions its data into S segments. These segments are members of the master group. We represent these segments as M i , j , where i is the master index, and j is the segment index. Let M 1 , M 2 , ..., M N be master groups, and M i M i ,1 , M i , 2 , ..., M i ,S , i 1,2, ..., N . We assume each master acts both as a master and as a backup. Let B1 , B2 , ..., BN be N RAMCloud backup groups. We make three assumptions: 1. Each master randomly and uniformly distributes one copy of each of its segments to two different backup groups (i.e. each partition is stored in two different backup groups). Each segment distribution is independent of other masters. 2. A backup group cannot hold a segment that belongs to a master with the same master index. Given the simultaneous failure of two different backups ( Bi , B j ), we would like to know the probability of whether both of the backups keep at least one copy of a same partition. In other words, we would like to compute: Pr Bi B j Using the assumption that each segment is distributed independently between two different backups: Pr Bi B j k N ,l S k ,l 1 Pr M k ,l Bi B j Using constraint 2: Pr M i ,l Bi B j Pr M j ,l Bi B j 1, l . A single segment, M i ,k is distributed in two copies. Without loss of generality, we name the first copy M i',k , and the second copy M i'',k . We use the Law of Total Probability, assuming k i, j : Pr M B Pr M B Pr M k ,l Bi B j Pr M k ,l Bi B j | M i',k Bi Pr M i',k Bi k ,l Bi B j | M i',k 2 Pr M i'',k Bi | M i',k Bi Pr M i',k Bi j ' i ,k j From assumption 2, we know that the first copy of a segment has n-1 possible candidates. The second copy of a segment has n-2 candidates, since it cannot be sent to the same backup as the first copy. Therefore: Pr M i',k Bi 1 N 1 Pr M i'',k Bi | M i',k Bi 1 N 2 Therefore, Pr M k ,l Bi B j 2 . N 1N 2 We have S segments per master, and there are two masters that cannot hold the same segments as the two failed backups. Therefore: Pr B B i k N ,l S j k ,l 1 k i , j 2 2 1 1 N 1 N 2 N 1 N 2 2 Pr Bi B j 1 1 N 1N 2 S N 2 S N 2 In order to calculate the loss of data in a general sense, we need to take into account the simultaneous loss of two specific backups and the loss of a master that holds the same data that is held by the two failed backups. Assuming that we have a loss of three different masters, we would now like to know whether any segments have been compromised. We should find: Pfail Pr Bi B j M k Bi Bk M j B j Bk M i Using the Inclusion-Exclusion Principle and assumption 2: Pr Bi B j M k Bi Bk M j B j Bk M i Pr Bi B j M k Pr Bi Bk M j Pr B j Bk M i Pr Bi B j M k Pr Bi Bk M j Pr Bi B j M k Pr B j Bk M i Pr Bi Bk M j Pr B j Bk M i Pr Bi B j M k Pr Bi Bk M j Pr B j Bk M i 3 Pr Bi B j M k 3 Pr Bi B j M k Pr Bi B j M k 2 As we did before (i,j,k are different from each other): Pr Bi B j M k Pr M k ,l Bi B j S l 1 2 Pr Bi B j M k 1 N 1N 2 S 3 Therefore, we get: 2 Pfail S S S 2 2 2 31 1 1 1 31 1 N 1 N 2 N 1 N 2 N 1 N 2 3 Rate of Failures Let F be the average yearly failure rate of a given machine. Let Y be the number of days F in a year. Let X i ~ Poiss , t be the number of failures in a machine over a time Y interval t. Let N be the number of machines. Let T be the recovery time for each machine. We want to compute the average number of simultaneous failures of all machines in an interval of a year. We make several assumptions: 1. All the machines fail independently of each other 2. If a single machine fails, there is a time-slot of 2T in which if other machines fail, the failures count as simultaneous failures N NF , t be the number of failures of all machines over a time Let X X i ~ Poiss Y i 1 interval t. We define g NF as the global rate of failures. Given a failure, the Y probability that k additional machines will fail at the 2T failure time slot is: Pr X k , t 2T 2Tg k e 2Tg k! We first compute E1 the average number of failures per year where exactly one server fails simultaneously. This is identical to the Aloha network package rate formula: E1 gY Pr X 0, t 2T gYe 2Tg Now we compute E2 the average number of failures where exactly two servers simultaneously crash per year. In order to count simultaneous failure, we only take a time slot of T after the first failure. This way, we don’t double-count a single failure of two machines. We assume that failures are sparse, and that there is an order of magnitude difference between the number of instances where two servers failure simultaneously, and the number of instances where three servers fail simultaneously: E2 gY Pr X 1, t T gY Tg e Tg We can calculate E3 in the same fashion, again assuming sparse failures: E3 gY Pr X 2, t T gY Tg 2 e Tg 2 Overall Number of Expected Failures per Year Number of total failures (1 master and 2 backups): ETotal E3 Pfail 2 Tg e Tg gY 31 1 2 N 1N 2 2 S S 2 S 2 2 1 1 31 1 N 1 N 2 N 1 N 2 Some numbers Number of total failures per year (average of 5 failures per year per machine): 8000 segments per machine, 10,000 machines: ETotal 0.00012 8000 segments per machine, 100,000 machines: ETotal 0.001167 8000 segments per machine, 1,000,000 machines: ETotal 0.01587 Probability of total loss: 8000 segments per machine, 1000 machines: Pfail 0.047 8000 segments per machine, 10,000 machines: Pfail 0.001584 8000 segments per machine, 100,000 machines: Pfail 0.00000048 Probability of disk loss: 8000 segments per machine, 10,000 machines: Pr Bi B j 0.798 8000 segments per machine, 100,000 machines: Pr Bi B j 0.147