Paper Summary #3 - Submitted by Asaf Cidon

advertisement
Loss of Data Probability in Case of Multiple Failures
We assume we have N RAMCloud masters. Each RAMCloud master partitions its data
into S segments. These segments are members of the master group. We represent these
segments as M i , j , where i is the master index, and j is the segment index. Let
M 1 , M 2 , ..., M N be master groups, and M i  M i ,1 , M i , 2 , ..., M i ,S , i  1,2, ..., N .
We assume each master acts both as a master and as a backup. Let B1 , B2 , ..., BN be N
RAMCloud backup groups.
We make three assumptions:
1. Each master randomly and uniformly distributes one copy of each of its segments
to two different backup groups (i.e. each partition is stored in two different
backup groups). Each segment distribution is independent of other masters.
2. A backup group cannot hold a segment that belongs to a master with the same
master index.
Given the simultaneous failure of two different backups ( Bi , B j ), we would like to know
the probability of whether both of the backups keep at least one copy of a same partition.
In other words, we would like to compute: Pr  Bi  B j   
Using the assumption that each segment is distributed independently between two
different backups: Pr Bi  B j    
k  N ,l  S

k ,l 1
Pr M k ,l  Bi  B j 
Using constraint 2: Pr M i ,l  Bi  B j   Pr M j ,l  Bi  B j   1, l .
A single segment, M i ,k is distributed in two copies. Without loss of generality, we name
the first copy M i',k , and the second copy M i'',k .
We use the Law of Total Probability, assuming k  i, j :

Pr M
 
 B  Pr M

B 
Pr M k ,l  Bi  B j   Pr M k ,l  Bi  B j | M i',k  Bi  Pr M i',k  Bi 

k ,l
 Bi  B j | M i',k
 
 2  Pr M i'',k  Bi | M i',k  Bi  Pr M i',k  Bi

j
'
i ,k
j
From assumption 2, we know that the first copy of a segment has n-1 possible candidates.
The second copy of a segment has n-2 candidates, since it cannot be sent to the same
backup as the first copy. Therefore:


Pr M i',k  Bi 
1
N 1


Pr M i'',k  Bi | M i',k  Bi 
1
N 2
Therefore, Pr M k ,l  Bi  B j  
2
.
N  1N  2
We have S segments per master, and there are two masters that cannot hold the same
segments as the two failed backups. Therefore:
Pr B  B
i
   
k  N ,l  S
j
k ,l 1
k i , j

 

2
2
1 
  1 

  N  1 N  2    N  1 N  2 


2

Pr Bi  B j     1  1 
 N  1N  2 
S  N 2 
S  N 2 
In order to calculate the loss of data in a general sense, we need to take into account the
simultaneous loss of two specific backups and the loss of a master that holds the same
data that is held by the two failed backups.
Assuming that we have a loss of three different masters, we would now like to know
whether any segments have been compromised. We should find:
Pfail  Pr Bi  B j  M k     Bi  Bk  M j     B j  Bk  M i   
Using the Inclusion-Exclusion Principle and assumption 2:
Pr Bi  B j  M k     Bi  Bk  M j     B j  Bk  M i    
Pr Bi  B j  M k     Pr Bi  Bk  M j     Pr B j  Bk  M i    
Pr Bi  B j  M k    Pr Bi  Bk  M j    
Pr Bi  B j  M k    Pr B j  Bk  M i    
Pr Bi  Bk  M j    Pr B j  Bk  M i    
Pr Bi  B j  M k    Pr Bi  Bk  M j    Pr B j  Bk  M i    
3  Pr Bi  B j  M k     3  Pr Bi  B j  M k     Pr Bi  B j  M k   
2
As we did before (i,j,k are different from each other):
Pr Bi  B j  M k      Pr M k ,l  Bi  B j 
S
l 1


2

Pr Bi  B j  M k     1 
 N  1N  2 
S
3
Therefore, we get:
2
Pfail
S
S
S
 
   
   
 
2
2
2
   31  1 
   1  1 
 
 31  1 
   N  1 N  2      N  1 N  2       N  1 N  2  
3
Rate of Failures
Let F be the average yearly failure rate of a given machine. Let Y be the number of days
F 
in a year. Let X i ~ Poiss  , t  be the number of failures in a machine over a time
Y 
interval t. Let N be the number of machines. Let T be the recovery time for each machine.
We want to compute the average number of simultaneous failures of all machines in an
interval of a year. We make several assumptions:
1. All the machines fail independently of each other
2. If a single machine fails, there is a time-slot of 2T in which if other machines fail,
the failures count as simultaneous failures
N
 NF 
, t  be the number of failures of all machines over a time
Let X   X i ~ Poiss 
 Y 
i 1

interval t. We define g 
NF
as the global rate of failures. Given a failure, the
Y
probability that k additional machines will fail at the 2T failure time slot is:
Pr  X  k , t  2T  
2Tg k e 2Tg
k!
We first compute E1 the average number of failures per year where exactly one server
fails simultaneously. This is identical to the Aloha network package rate formula:
E1  gY Pr X  0, t  2T   gYe 2Tg
Now we compute E2 the average number of failures where exactly two servers
simultaneously crash per year. In order to count simultaneous failure, we only take a time
slot of T after the first failure. This way, we don’t double-count a single failure of two
machines. We assume that failures are sparse, and that there is an order of magnitude
difference between the number of instances where two servers failure simultaneously,
and the number of instances where three servers fail simultaneously:
E2  gY Pr X  1, t  T   gY Tg e Tg
We can calculate E3 in the same fashion, again assuming sparse failures:
E3  gY Pr  X  2, t  T   gY
Tg 2 e Tg
2
Overall Number of Expected Failures per Year
Number of total failures (1 master and 2 backups):
ETotal  E3  Pfail 

2

Tg  e Tg   
gY
31  1 
2

 





N  1N  2 
2
S
S 2
S 
  
   
 
2
2
   1  1 
  
  31  1 
    N  1 N  2     N  1 N  2   

Some numbers
Number of total failures per year (average of 5 failures per year per machine):
8000 segments per machine, 10,000 machines: ETotal  0.00012
8000 segments per machine, 100,000 machines: ETotal  0.001167
8000 segments per machine, 1,000,000 machines: ETotal  0.01587
Probability of total loss:
8000 segments per machine, 1000 machines: Pfail  0.047
8000 segments per machine, 10,000 machines: Pfail  0.001584
8000 segments per machine, 100,000 machines: Pfail  0.00000048
Probability of disk loss:
8000 segments per machine, 10,000 machines: Pr Bi  B j     0.798
8000 segments per machine, 100,000 machines: Pr Bi  B j     0.147
Download