use conditions

advertisement
Engineering the right accelerated life tests for reliability
qualification: customer use conditions vs. industry
standards based approaches
Presenter: Sudarshan Rangaraj (rsudarsh@lab126.com)
Hardware Reliability Manager – Amazon Lab126, Sunnyvale CA
Based largely on papers authored at IRPS, IITC, ECTC and review of literature
Acknowledgements: current and former colleagues at Intel and Amazon Lab126
1
Motivation and Relevance
• Industry standards e.g. JEDEC, AEC, MIL provide qualification criteria, e.g.
– HAST: 130C/85%RH/96 hours 0 Fails / 45 Tested
– 150C 1000 hours Bake
• Blanket qualification criteria without knowledge of product use conditions
(UC) can be undesirable:
– Over-design, extra cost for reliability margin most customers will not use
– Field failures: negative to user experience and company brand
• Goal of reliability engineering:
– Start with the customer
– Use field intelligence to develop UC models, compare them to standards
– Strive to meet the higher bar, reliability can be a marketing advantage!
2/11
2
Advantages of standards based testing
• Allows suppliers and their customers a speak a common language
• Helps overcome differences in reliability certification methodology, helps
clarify expectations
• Guarantees a consistent reliability bar
• Valuable in well established industries
3/11
3
Importance of understanding usage conditions
• A robust reliability qualification process protects the customer i.e. ensures
sufficient reliability while optimizing cost for the manufacturer
• Three elements of robust reliability engineering:
1. Quantified understanding of customer usage patterns and use conditions
2. Well designed accelerated life tests
3. Acceleration models (of sufficiently high confidence) that link the two
• Pitfalls of not making an accurate link between stress and use conditions
– Over design leading to added cost and impact to bottom line
– Under design  high customer returns, poor experience  erodes brand
4/11
4
Talk outline
• Overview of common failure mechanisms in IC components
• Analysis of field use condition data….review one example
• Contrast use condition knowledge based qualification to standards based
qualification using 2 case studies
1. Moisture and voltage bias induced failures in IC components
2. Temperature cycling failures in IC components
5/11
5
IC component – package stack-up
Silicon substrate
Devices: front-end
Metals/via: back-end
with ultra low-k ILD
Metals/via: far-backend with polymer ILD
Images from proceedings of IITC 2013
Bumps: C4 with Cu – Pbfree solder
Package: metals/via
6/11
6
Some common failure modes in IC components and
associated extreme use conditions
Reliability failure mechanism
Extreme use condition
1
Front end: transistor gate dielectric reliability
- High power states at high voltage,
frequency, temperature and current
2
Backend: Di-electric breakdown
3
Backend & bumps: Electromigration
4
Backend: stress voiding
- Sustained operation at high temperatures
5
Moisture ingress: De-lamination,
electro-chemical corrosion, metal
migration, pop-corning etc.
- Low power modes like OFF/Stand-by
- High humidity and temperature ambient
conditions e.g. 25C 80% RH
6
Temperature cycling: Cracking
and de-lamination
- Repeated cold temperature exposures when
part may be OFF
- Power cycles when part is ON
• Dominant failure modes for an IC used in a server, cell-phone and a
wearable device will be very different because usage is different!
7/11
7
Chip operating states
• OFF mode: chip and package at ambient T, ambient RH at part surface
• STAND-BY mode: ambient T + self-heating (~10C) from few “always ON” IO pins
• ON state: chip at high T, low RH at the part surface
Effective RH vs. temperature at the part surface
OFF state: low T, high RH
STAND-BY: higher T, lower RH
ON state: high T, low RH
OFF and STAND-BY modes are critical states for moisture absorption
into chip/package: highest RH at part surface
8/11
8
6
Increasing moisture risk
Use conditions by product segment: risk from moisture
Market segment
ON time as
fraction of
product
lifetime
OFF/STAND-BY
events, durations
Ambient environments
Servers, High
Performance
computing & high
end Desktop
Very large
Very few events
of short duration
Controlled T, RH in data
centers and server farms
Desktop enterprise
Lower
Sizeable
Indoor T, RH
Mobile - laptop
Lower
Sizeable number Some outdoor T, RH
of longer duration exposure
events
Worse in hot humid GEOs
Ultra-mobile:
Tablet, smartphone
Lower
Sizeable number Often outdoor T, RH
of longer duration Worse in hot humid GEOs
events
Wearables/IoT
A new set of applications, still being understood?
9/11
9
Events leading to moisture exposure
• Packaging/Assembly operations……factory floor
• Customer warehouses during storage
• Customer factories during surface mount
• Usage by end customer especially in hot + humid locations
10/11
10
Failure modes due to moisture and temperature cycling
Package blistering and cracking between copper traces after surface mount on to
system motherboard, a.k.a. “pop-corning” [Literature]
blister
Edge de-lamination after temp-cycle B
(125 to -55C) on very early 22nm
silicon process
Proceedings of ECTC 2013
11/11
11
Moisture diffusion under a 25C 80% RH ambient exposure
Finite element modeling
•
•
7 days
50 days
7 days
50 days
Through underfill
C/CSAT
Time at 25C 80% RH
Under sustained exposure, moisture confined to edge 1mm
of chip/package
Consistent with empirical failure observations
Through PKG
Time (days)
Package
Chip
12/11
12
Mining use conditions: data collection and analysis
• Customer profile data from ~2000 worldwide laptop users for one year
• OFF (shutdown), STAND-BY and HIBERNATE times recorded  data used
to generate distributions
Format of user data:
User ID
OFF time
STAND-by time
1
{-, -, -,……..}
{-, -, -,……..}
2
{-, -, -,……..}
{-, -, -,……..}
3
{-, -, -,……..}
{-, -, -,……..}
{-, -, -,……..}
{-, -, -,……..}
…
2123
• Distributions combining all
data from all users
• Distribution of Max{off times}
and Max{Stand-by time} per
users
13/11
13
Moisture exposure in use condition: user data
Max {OFF time} i.e. 100th %tile per user
All data
2000 users
non-S0from
duration distribution
1
Cumulative probability
probability
Cumulative
Cumulative probability
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
10
20
30
40
50
60
Time (hrs)
70
80
Time (hours)
99th percentile  4 days
99.5th percentile  7 days
90
100
Time (days)
95th percentile  50 days
Standby/Off times: Nominal = 7 days, Worst case = 50 days
Conservative ambient condition: 25C 80% RH, 20% of cities in the world
experience this for 5% of the year i.e. a 95th percentile condition from surveys
14/11
14
Phenomenological Acceleration Model for dominant
moisture induced chip – package failure modes
Peck’s law fits empirically
observed HAST fails
Variable
Temperature
RH
Voltage (V)
Range used in study
85 – 130C
65 – 85%
1.2 – 3.3V
Acceleration factor
Ea = 0.71 eV (90% CL lower bound)
n = 4 (best estimate)
m = 0.5 (best estimate)
Vt = 1.4V
• Temperature – strongest variable
• Relative humidity and voltage – relatively weaker effects
15/11
15
Accelerated life testing: failure rate data for a “typical”
failure mode
Probability Plot (Fitted Arrhenius, Fitted Ln) for start readout
UC: 25C
Lognormal
Arbitrary Censoring - ML Estimates
100000000
temp
85
110
130
130
95
90
80
RH
85
85
65
85
Table of Statistics
Loc
Scale A D*
7.11990 0.658992 25.184
5.50666 0.658992 14.916
5.92739 0.658992 55.509
4.36013 0.658992 10.373
70
60
50
40
30
20
E
10000000
Ea=1.1
A-2
1000000
MTTF (Hr)
99
Percent
Relation plot (Temp vs MTTF)
Ea=0.71
EA-AVG
100000
10000
Ea=0.44
EA-1
1000
100
10
10
5
1
1
10
100
1000
Time to Failure
10000
0
10
20
30
40
50
60
70
80
90 100 110 120 130 140
Temp (C)
• Thermal acceleration different in the 130 – 110C and 110 – 85C ranges
• Epoxy glass transition ~120C, over accelerated moisture diffusion above 120C
• Stressing recommended below glass transition of packaging polymers, T < TG is
what is relevant for use condition anyway
16/11
16
HAST stress durations: use conditions vs. JEDEC JESD22-A110
standard requirements
Stress
condition
Stress time
equivalent to 7
days at 25C
80% RH (hrs)
Stress time
equivalent to
50 days at 25C
80% RH (hrs)
JEDEC JESD
22 A110
equivalent
readout (hrs)
130C 85% RH
<1
5.7
96
110C 85% RH
2.5
18
264
85C 85% RH
17
121
1000
Nominal
Worst case
JEDEC Std.
• Conservative worst case (50 days @ 25C 80% RH): JEDEC requirements +8 times
higher than use condition based requirements
• Intel uses a “test to fail” approach during process development. These gating
readouts go beyond use condition based requirements
17/11
17
Some thoughts about temperature cycling
JEDEC standard for temp-cycle
Most common: TCB 125 to -55C,
700 cycles
• Having to demonstrate reliability down to -55 or -65C may need trade-off
between reliability and performance/yield
•
•
Di-electric constant (electrical performance) vs. fracture toughness
Epoxy flow characteristics vs. fracture toughness
18/11
18
Some examples of cold-side effects: material response
Crack driving energy (F.E. modeling) rises
sharply below -20C
Measured strain-to-fail drops 2X from
25C to -55C for passivation polymer
Solder fracture toughness drops precipitously
below -25C [Literature]
If T < -25C was not relevant for the
use condition of the component, by
using TCB for qual., we might be
solving problems not relevant to
customer usage
19/11
19
Risk of over or under-assessing field reliability
Number of cycles at various operating DT equivalent to TCB 700 cycles (JEDEC standard)
A simple temp-cycle model (CoffinManson):
{Nf1/Nf2} = {DT2/DT1}n
Desktop & Servers
Example use condition requirement
Highly mobile devices
[Tmax-Tmin]
•
•
For an always ON server in a controlled environment TCB 700 cycles may be over-kill
• No cold exposures, -55C is not relevant
• At DT of 50C, TCB 700 represents 10 – 50 cycles/day for 5 years
For a part that may get used in an COMMS application with outdoor exposures in Alaska
with 10 year life requirement TCB 700 under-assesses field reliability
20/11
20
Key messages
 Important to pick stress conditions that are relevant to worst case usage to
avoid artifacts not relevant to worst case use e.g. embrittlement
 Standards offer a guideline or starting point. Qualification plans should be
based on knowledge of use conditions
 Limiting failure modes in the components that comprise a system will likely be
very different for various applications….standards don’t directly address that
21/11
25
Download