Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry standards based approaches Presenter: Sudarshan Rangaraj (rsudarsh@lab126.com) Hardware Reliability Manager – Amazon Lab126, Sunnyvale CA Based largely on papers authored at IRPS, IITC, ECTC and review of literature Acknowledgements: current and former colleagues at Intel and Amazon Lab126 1 Motivation and Relevance • Industry standards e.g. JEDEC, AEC, MIL provide qualification criteria, e.g. – HAST: 130C/85%RH/96 hours 0 Fails / 45 Tested – 150C 1000 hours Bake • Blanket qualification criteria without knowledge of product use conditions (UC) can be undesirable: – Over-design, extra cost for reliability margin most customers will not use – Field failures: negative to user experience and company brand • Goal of reliability engineering: – Start with the customer – Use field intelligence to develop UC models, compare them to standards – Strive to meet the higher bar, reliability can be a marketing advantage! 2/11 2 Advantages of standards based testing • Allows suppliers and their customers a speak a common language • Helps overcome differences in reliability certification methodology, helps clarify expectations • Guarantees a consistent reliability bar • Valuable in well established industries 3/11 3 Importance of understanding usage conditions • A robust reliability qualification process protects the customer i.e. ensures sufficient reliability while optimizing cost for the manufacturer • Three elements of robust reliability engineering: 1. Quantified understanding of customer usage patterns and use conditions 2. Well designed accelerated life tests 3. Acceleration models (of sufficiently high confidence) that link the two • Pitfalls of not making an accurate link between stress and use conditions – Over design leading to added cost and impact to bottom line – Under design high customer returns, poor experience erodes brand 4/11 4 Talk outline • Overview of common failure mechanisms in IC components • Analysis of field use condition data….review one example • Contrast use condition knowledge based qualification to standards based qualification using 2 case studies 1. Moisture and voltage bias induced failures in IC components 2. Temperature cycling failures in IC components 5/11 5 IC component – package stack-up Silicon substrate Devices: front-end Metals/via: back-end with ultra low-k ILD Metals/via: far-backend with polymer ILD Images from proceedings of IITC 2013 Bumps: C4 with Cu – Pbfree solder Package: metals/via 6/11 6 Some common failure modes in IC components and associated extreme use conditions Reliability failure mechanism Extreme use condition 1 Front end: transistor gate dielectric reliability - High power states at high voltage, frequency, temperature and current 2 Backend: Di-electric breakdown 3 Backend & bumps: Electromigration 4 Backend: stress voiding - Sustained operation at high temperatures 5 Moisture ingress: De-lamination, electro-chemical corrosion, metal migration, pop-corning etc. - Low power modes like OFF/Stand-by - High humidity and temperature ambient conditions e.g. 25C 80% RH 6 Temperature cycling: Cracking and de-lamination - Repeated cold temperature exposures when part may be OFF - Power cycles when part is ON • Dominant failure modes for an IC used in a server, cell-phone and a wearable device will be very different because usage is different! 7/11 7 Chip operating states • OFF mode: chip and package at ambient T, ambient RH at part surface • STAND-BY mode: ambient T + self-heating (~10C) from few “always ON” IO pins • ON state: chip at high T, low RH at the part surface Effective RH vs. temperature at the part surface OFF state: low T, high RH STAND-BY: higher T, lower RH ON state: high T, low RH OFF and STAND-BY modes are critical states for moisture absorption into chip/package: highest RH at part surface 8/11 8 6 Increasing moisture risk Use conditions by product segment: risk from moisture Market segment ON time as fraction of product lifetime OFF/STAND-BY events, durations Ambient environments Servers, High Performance computing & high end Desktop Very large Very few events of short duration Controlled T, RH in data centers and server farms Desktop enterprise Lower Sizeable Indoor T, RH Mobile - laptop Lower Sizeable number Some outdoor T, RH of longer duration exposure events Worse in hot humid GEOs Ultra-mobile: Tablet, smartphone Lower Sizeable number Often outdoor T, RH of longer duration Worse in hot humid GEOs events Wearables/IoT A new set of applications, still being understood? 9/11 9 Events leading to moisture exposure • Packaging/Assembly operations……factory floor • Customer warehouses during storage • Customer factories during surface mount • Usage by end customer especially in hot + humid locations 10/11 10 Failure modes due to moisture and temperature cycling Package blistering and cracking between copper traces after surface mount on to system motherboard, a.k.a. “pop-corning” [Literature] blister Edge de-lamination after temp-cycle B (125 to -55C) on very early 22nm silicon process Proceedings of ECTC 2013 11/11 11 Moisture diffusion under a 25C 80% RH ambient exposure Finite element modeling • • 7 days 50 days 7 days 50 days Through underfill C/CSAT Time at 25C 80% RH Under sustained exposure, moisture confined to edge 1mm of chip/package Consistent with empirical failure observations Through PKG Time (days) Package Chip 12/11 12 Mining use conditions: data collection and analysis • Customer profile data from ~2000 worldwide laptop users for one year • OFF (shutdown), STAND-BY and HIBERNATE times recorded data used to generate distributions Format of user data: User ID OFF time STAND-by time 1 {-, -, -,……..} {-, -, -,……..} 2 {-, -, -,……..} {-, -, -,……..} 3 {-, -, -,……..} {-, -, -,……..} {-, -, -,……..} {-, -, -,……..} … 2123 • Distributions combining all data from all users • Distribution of Max{off times} and Max{Stand-by time} per users 13/11 13 Moisture exposure in use condition: user data Max {OFF time} i.e. 100th %tile per user All data 2000 users non-S0from duration distribution 1 Cumulative probability probability Cumulative Cumulative probability 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 Time (hrs) 70 80 Time (hours) 99th percentile 4 days 99.5th percentile 7 days 90 100 Time (days) 95th percentile 50 days Standby/Off times: Nominal = 7 days, Worst case = 50 days Conservative ambient condition: 25C 80% RH, 20% of cities in the world experience this for 5% of the year i.e. a 95th percentile condition from surveys 14/11 14 Phenomenological Acceleration Model for dominant moisture induced chip – package failure modes Peck’s law fits empirically observed HAST fails Variable Temperature RH Voltage (V) Range used in study 85 – 130C 65 – 85% 1.2 – 3.3V Acceleration factor Ea = 0.71 eV (90% CL lower bound) n = 4 (best estimate) m = 0.5 (best estimate) Vt = 1.4V • Temperature – strongest variable • Relative humidity and voltage – relatively weaker effects 15/11 15 Accelerated life testing: failure rate data for a “typical” failure mode Probability Plot (Fitted Arrhenius, Fitted Ln) for start readout UC: 25C Lognormal Arbitrary Censoring - ML Estimates 100000000 temp 85 110 130 130 95 90 80 RH 85 85 65 85 Table of Statistics Loc Scale A D* 7.11990 0.658992 25.184 5.50666 0.658992 14.916 5.92739 0.658992 55.509 4.36013 0.658992 10.373 70 60 50 40 30 20 E 10000000 Ea=1.1 A-2 1000000 MTTF (Hr) 99 Percent Relation plot (Temp vs MTTF) Ea=0.71 EA-AVG 100000 10000 Ea=0.44 EA-1 1000 100 10 10 5 1 1 10 100 1000 Time to Failure 10000 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 Temp (C) • Thermal acceleration different in the 130 – 110C and 110 – 85C ranges • Epoxy glass transition ~120C, over accelerated moisture diffusion above 120C • Stressing recommended below glass transition of packaging polymers, T < TG is what is relevant for use condition anyway 16/11 16 HAST stress durations: use conditions vs. JEDEC JESD22-A110 standard requirements Stress condition Stress time equivalent to 7 days at 25C 80% RH (hrs) Stress time equivalent to 50 days at 25C 80% RH (hrs) JEDEC JESD 22 A110 equivalent readout (hrs) 130C 85% RH <1 5.7 96 110C 85% RH 2.5 18 264 85C 85% RH 17 121 1000 Nominal Worst case JEDEC Std. • Conservative worst case (50 days @ 25C 80% RH): JEDEC requirements +8 times higher than use condition based requirements • Intel uses a “test to fail” approach during process development. These gating readouts go beyond use condition based requirements 17/11 17 Some thoughts about temperature cycling JEDEC standard for temp-cycle Most common: TCB 125 to -55C, 700 cycles • Having to demonstrate reliability down to -55 or -65C may need trade-off between reliability and performance/yield • • Di-electric constant (electrical performance) vs. fracture toughness Epoxy flow characteristics vs. fracture toughness 18/11 18 Some examples of cold-side effects: material response Crack driving energy (F.E. modeling) rises sharply below -20C Measured strain-to-fail drops 2X from 25C to -55C for passivation polymer Solder fracture toughness drops precipitously below -25C [Literature] If T < -25C was not relevant for the use condition of the component, by using TCB for qual., we might be solving problems not relevant to customer usage 19/11 19 Risk of over or under-assessing field reliability Number of cycles at various operating DT equivalent to TCB 700 cycles (JEDEC standard) A simple temp-cycle model (CoffinManson): {Nf1/Nf2} = {DT2/DT1}n Desktop & Servers Example use condition requirement Highly mobile devices [Tmax-Tmin] • • For an always ON server in a controlled environment TCB 700 cycles may be over-kill • No cold exposures, -55C is not relevant • At DT of 50C, TCB 700 represents 10 – 50 cycles/day for 5 years For a part that may get used in an COMMS application with outdoor exposures in Alaska with 10 year life requirement TCB 700 under-assesses field reliability 20/11 20 Key messages Important to pick stress conditions that are relevant to worst case usage to avoid artifacts not relevant to worst case use e.g. embrittlement Standards offer a guideline or starting point. Qualification plans should be based on knowledge of use conditions Limiting failure modes in the components that comprise a system will likely be very different for various applications….standards don’t directly address that 21/11 25