Failure Trends in a Large Disk Drive Population

advertisement
Failure Trends in a Large Disk Drive
Population
Authors: Eduardo Pinheiro, WolfDietrich Weber and Luiz Andr´e Barroso
Presented by Vinuthna & Arjun
Motivation
• 90% of all new information is stored on magnetic
disks.
• Most of such data stored on HDD
• Study failure patterns and key factors that affect
the life
• Analyze the correlation between failures and
parameters that are believed to impact life of
HDD
• Why ? --better design and maintenance of
storage systems
Previous studies
• Mostly accelerated aging experiments – poor
predictor
• Moderate size
• Stats present on returned units from warranty
databases
• No insight on what actually happened to drive
during operation
Our study
• Large study – examining hard drives in Google’s
infrastructure. 1 lac disk drives
• Disk population size is large but depth and detail
of study from a end users point of view
• Why? Manufacturers say failure rate is below 2%
but end user experiences much high failure rate
• Some studies say the failure rate is 20-30% when
manufacturer says no prob and it fails on field
SYSTEM HEALTH INFRASTRUCTURE
•Collection layer – collects data from each server and dumps to repository
•Storage based on BIGTABLE which is based on GFS. Has 2D data cells and 3rd dimension
for time version
•Database has complete history of environment, error, config and repair events
•A daemon runs on each machines. It is light weight & gives info to collectors
•Large scale analysis done by MapReduce
•Computation is readily available, user focuses on algorithm of computations
Some other info
• Data collected over nine months.
• Mix of HDD--- diff ages, manufacturers and
models
• Failure info mined from previous repair
databases upto 5 years
• We monitor temp, activity levels and SMART
parameters
• Results are not affected by population mix
Results
• Utilization
• Previous notion – high duty cycles affect disk
drives negatively
Utilization AFR
•More utilization, more failures true only for infant mortality stage and end stage
•After 1st year high utilization is only moderately over low utilization
•How is this possible- Survival of the fittest, previous correlation based on
accelerated life test. Same is seen here.
•Conclusion – Utilization has much weaker correlation to failure than assumed
before
Temperature
•Previous belief temperature change of 15C can double failure rate
•PDF – Failure does not increase with temperature. Infact lower temperatures may have
higher failure rate
•For age vs AFR – flat failure rate for mid range temp, Modest increase for low temps
•High temp is not associated with high failure rate, except when old
•Conclusion – If moderate temp range is considered, temp is not a strong factor for failure
rate
SMART Data Analysis
• Some signals more relevant to disk failures
• Parameters
– Scan errors
– Reallocation counts
– Offline Reallocations
– Probational counts
– Miscellaneous signals
Scan errors
• Errors that are reported when drives scan the
disk surface in the background
• Indicative of surface defects
• Consistent impact on AFR
• Drives with scan errors are 39 times more
likely to fail after first scan error
Reallocation Counts
• Represents the number of times a faulty
sector is remapped to new physical sector
• Consistent impact on AFR
• 14 times more likely to fail
Offline reallocations
• Subset of reallocation counts
• Reallocated sectors found during background
scrubbing
• Survival probability worse than total
reallocations
• 21 times more likely to fail
Probational counts
• Sectors are on ‘probation’ until they fail
permanently or work without problems
• 16 times more likely to fail
• Threshold is 1
Miscellaneous signals
•
•
•
•
•
•
•
Seek errors
CRC errors
Power cycles
Calibration retries
Spin retries
Power-on hours
Vibration
Conclusion
• Larger population size used compared to
previous studies
• Lack of consistent pattern of failures for high
temperatures or utilization levels
• SMART parameters are well correlated with
failure probabilities
• Prediction models based only on SMART
parameters is limited in accuracy
Download