Product Reliability

advertisement
Product Reliability
Chris Nabavi BSc SMIEEE
1
© 2006 PCE Systems Ltd
Reliability
 Reliability is the probability that an equipment
will operate for some determined period of time
under the working conditions for which it was
designed
2
The “Bathtub” Curve
Failures per hour
Infant Mortality
End of Life
Operational Life Phase
Time
3
Operational Strategy
 1. Run the equipment without traffic until the
infant mortality period has passed - (the burn-in
period)
 2. Use the equipment during the operational life
period
 3. Retire or replace the equipment before the
end of life period
4
Failure Rate
 This is a statistical measure, applicable to a
large number of samples
 The failure rate,  is the number of failures per
unit time, divided by the number of items in the
test
  is constant during the operational phase
  is often expressed in % / 1000 hours
or FITs (failures in ten to the 9 hours)
5
Mean Time Between Failures
(MTBF)
 This is a statistical measure, applicable to a
large number of samples
 The MTBF,  is the average time between
failures, times the number of items in the test
MTBF = 1 / failure rate
6
=1/
Measured MTBF and Failure Rate
 A manufacturer tests 3000 light bulbs for 300
hours and observes 5 failures
Note: we don’t know the average time between failures
from this test, because they have not all failed! But
approximately:

MTBF = 3000 x 300 / 5 = 180,000 hours

Failure rate = 0.556 % per 1000 hours
 This measured MTBF is an under-approximation
of the true MTBF
7
MTBF and End of Life
 MTBF is a measure of quality and has nothing to
do with the expected lifetime
 To visualise this, think of a candle. After three
hours, the wax will all be used up and it will have
reached its end of life. This is its expected lifetime.
 However, a quality candle (higher MTBF) will be
less likely to fizzle out half way down. If we light a
new candle, just as each old one runs out of wax,
the mean time between being unexpectedly
plunged into darkness is the MTBF
8
Failure Rate (Graphical Representation)
Failures per hour
The failure rate is the size of this gap
Time
9
Example: Typical Hard Disc




10
Rated or expected life = 5 years
Guaranteed life = 3 years
MTBF = 1,000,000 hours (approx. 114 years)
Modern hard discs are fairly reliable, but being
mechanical, they wear out after a few years
Disc Replacement Strategy
 Observation: The expected life is much less
than the MTBF and discs are the “weak link” in
the system
 Conclusion: Replace the discs just before they
wear out under a preventative maintenance
program
11
Example MTBF Figures
12
ITEM
MTBF
N. American Power Utility
2 months !
Router
10 years
Uninterruptible Power Supply
11 years
File Server
14 years
Ethernet Hub
120 years
Transistor
30,000 years
Resistor
100,000 years
Operational Life Phase
 Reliability theory only works in the operational life
phase, where the failure rates are constant
 With this proviso, the maths is well established
and closely related to statistics
 There is a large amount of statistical theory
concerned with sampling procedures, aimed at
estimating the MTBF of components
 From now on, we are only concerned with the
operational phase
13
Probability of Survival
1
Probability of survival
p= e-t
0.37
0
14
MTBF, 
Time
Non-Redundant System Reliability
 For a system S, made up of components A, B,
C, etc.
 1 / MTBFS = 1 / MTBFA + 1 / MTBFB + 1 / MTBFC + etc.
 or S = A + B + C + etc.
 These formulae are used to calculate the MTBF or
failure rate of equipment, using published tables
covering everything from a soldered joint to a disc
sub-system
15
Mean Time To Repair (MTTR)
 The formulae discussed earlier assume zero
maintenance, i.e. if a device breaks down, it is not
fixed.
 Often, it is important to know the probability of fixing a
broken system within a given time T
 For this we need to know the MTTR, which is worked
out by examining all the steps involved and the failure
modes
 The probability of fixing the broken system within time
T can then be predicted using similar exponentials as
seen already
16
Operational Readiness
 Operational readiness is the probability that a
system will be ready to fulfil its function when
called upon
 E.g. The probability that an email sent at a
random time will get through
MTBF
 Operational readiness =
17
/ (MTBF + MTTR)
Active Redundancy
Either device can do the job
Device 1
Device 2
18
Active Redundancy Calculations
 MTBFS = 1 / 1 + 1 / 2 - 1 / (1 + 2 )
 Probability of survival = e-1t + e-2t - e-1t . e-2t

For an active redundancy system, S made from two identical sub-systems, A
MTBFS = 1.5 x MTBFA
 Note: The failure rate is no longer constant with time
19
Passive Redundancy
When device 1 fails, switch over to device 2
Device 1
Device 2
(Device 2 not normally powered)
20
Passive Redundancy Calculations
 For a passive redundancy system, S made from two
identical sub-systems, A and ignoring the reliability of the
switch-over system
 Probability of survival = e-At x (1+At)
MTBFS = 2 x MTBFA
 Note: The failure rate is no longer constant with time
21
Error Detection and Correction
 There are two trivially simple ways to guard
against errors
 Send the information twice: Then at the
receiver, if they are different, we have detected
an error
 Send the information three times: Then at the
receiver, accept the majority verdict to correct
an error
 But we can do better than this .....
22
The Hamming (7,4) code




0
0
1
0
1
0
0
0
0
1
0
0
0
0
0
1
1
1
0
1
0
1
1
1
1
0
1
1
Pink are parity check bits
Green are information bits
16 codes can be obtained
by adding any rows mod 2
 All 16 codes have Hamming distance of 3 or more, so
the code can correct a single error
----------------- The Golay (23,12) code has 12 information bits and 11
parity check bits and can correct 3 errors
23
Redundant Array of Independent Discs
 There are 5 RAID levels:





1
2
3
4
5
Mirrored discs
Hamming code error correction
Single check disc per group
Independent read and write
Spread data and parity over all discs
 In each case, disc errors are corrected; the differences
are largely in the system performance
24
Effect of Non-Maintenance
 Consider a file server with 6 discs in an array
 The probability of getting a disc failure in a 6 disc
RAID array in one year is about 5% which is fairly
high. Assume that this happens and the problem is
left unfixed for a further 3 weeks
 The probability of another disc failing in this time is
about .25% If this happens too, you loose the server!
 The odds for this happening in any year are 5% of
0.25% or 1 in 8000 divided by the number of RAIDs
25
Effect of Improving the MTTR
 If the MTTR of the RAID had been 3 hours instead of
3 weeks, the odds are somewhat different:
 The probability of the first failure is still 5%. But now
the probability of getting a second failure in the
ensuing 3 hours is .0003% instead of .25%
 So the odds of loosing the RAID in any year improve
to 1 in 6,666,675 from the previous 1 in 8000
Moral of the story: Fix the First Fault Fast
26
Download