Recent Challenges Soft Errors • Scaling: SEU (Single-event upset): − Ionizing radiation corrupts data stored Cause: − Radioactive impurities in device packages − Recently: cosmic radiation Scaling worsens SEU: 1. Voltage scaling + reduced node capacitances − lower the charge threshold necessary to corrupt the data 2. Greater level of integration − increases the likelihood that soft errors will affect the device 2 SEU • Sources: Configuration memory Flip-flops Memory blocks Combinational circuits (transient error permanent) 3 Combinational circuits (transient error permanent) 4 SEU in Configuration Memory • SEU in cinfiguration bits (SRAM-based): In Virtex FPGAs, ~ 91% of sensitive bits to soft errors are configuration bits − flash- or antifuse-based do not suffer Any change to the configuration memory may alter the functionality Persist until FPGA is reprogrammed 5 SEU Mitigation Techniques • Mitigation techniques: 1. Circuit and technology-level: − Addition of metal capacitors to nodes in the memory increases the amount of charge necessary to cause SEU 2. System-level: − Ensures that the system can detect and recover. − Regularly verify their configuration memory by comparing the current values with the desired configuration state using cyclic redundancy checks (Altera Stratix III) 3. User-level: a) TMR (triple modular redundancy): − − Replicating a design three times and voting among outputs Reduce the sensitivity to soft errors in the design by careful selection of the resources used 6 Circuit Level • [Ebrahimi]: Reduce # SRAM cells in a switch box (6 5)( 6 4) 0 1 2 3 0 1 2 3 7 Circuit Level • [Ebrahimi]: Reduce # SRAM cells in a switch box (6 5)( 6 4) 0 N w 0 x W y z a e b E f c d 0 S 0 8 User Design Level • Care bits [Golshan07] : Only a subset of configuration bits affect the design due to SEU. • Resource A is used for net A A-B SRAM is not a care bit if B is not used by other nets. A-C SRAM bit is a care bit (change to ‘1’ hurts net A). A-D SRAM bit is not a care bit (w.r.t. net A) if D not used. 9 User Design Level • Soft Error Routing Problem [Golshan07]: Given a routing graph and a set of multi-terminal nets, route each net with the least care-cost, where carecost is the number of routing care bits. • Experiments: 14% reduction in the number of care bits − ~80% of soft errors in the FPGA: configuration memory [Kuon07] 10 Recent Challenges Process Variation Process Variation Sources x 10- Leff 2.3 7 2.2 2.1 2.0 1.9 1.8 60 100 Wafer X 40 50 20 0 Wafer Y [IBM, Intel and TSMC] 12 Variation Variations • Variation of variation over years ILD: inter-layer dielectric • Variation from mean value − Gate oxides are so thin that a change of one atom can cause a 25 percent difference in substrate current. − EE Times (04/11/2006) 13 Statistical Description The combined set of underlying deterministic and random contributions are lumped into a combined “random” statistical description. For devices on one wafer, the distribution (mean and variance) for L can be different from devices within a single die. 14 Inter-die vs. Intra-die Variations Leff Inter-die global Correlation Intra-die spatial Correlation • Figures are courtesy of IBM, Intel and TSMC 15 Impact of Variation • Importance of variation: Timing violations − Yield loss 16 Impact of Variation • Process variations can cause up to 2000% variation in leakage current and 30% variation in frequency in 180nm CMOS − Borkar, S., Karnik, T., Narenda, S., Tschanz, J., Keshavarzi, A., De, V. Parameter Variations and Impact on Circuits and Microarchitecture. In Proc. of DAC (2003), 338-342. 17 Impact of Variation Die-to-die frequency variation 18 Variation in FPGA • Binning: Historically: most of variation between dies − FPGA manufacturers test the speed of each FPGA after manufacturing and binning each device according to its speed. − Higher speeds: more expensive − Unacceptable leakage power: discard the device More recently: significant within die variation − Cannot be leveraged in the same manner − Operating speeds must be reduced to maintain functionality − 90nm: speed reduction of 5.7% − 22nm: speed reduction of 22.4% 19 Solutions • Architectural solution: 1. Select the logic block architecture parameters to minimize this variation − LUT size is particularly important [Wong05] − LUT size = 4 : highest leakage yield − LUT size = 7 : highest timing yield − LUT size = 5 : maximum combined leakage and timing yield. 2. Adaptively compensate for any variation through bodybiasing [Nabaa06]: − Slow blocks: set to a body bias decrease Vt increase block’s speed − Fast blocks: increase threshold voltage reduce leakage power Experiments: − Area penalty: 1%–2% − Delay variability reduction: 30% − Leakage variability reduction: 78% 20 Solutions • CAD-Level: 1. Statistical static timing analysis (SSTA) in FPGA CAD tools − Improve delays by avoiding the margins that are necessary for traditional STA 2. Testing multiple logically equivalent configurations of the FPGA to find one that is functional at the desired speed [Sedcole07] 3. Generating critical paths that will be more robust in the face of variation [Matsumoto07] 21 Inter-die vs. Intra-die Variations P0 = ΔPintradie = Δ Pinterdie = Δ Pe = nominal design value intra-die variation (within a given chip) Inter-die variation (from one chip to another) remaining “random” or unexplained variation P: a structural or electrical parameter e.g. − − − − − − W, tox, Vth, channel mobility, coupling capacitances, line resistances. 22 Corner Analysis • PRCA (Process Corner Analysis): Takes 1. nominal values of process parameters 2. and a delta for each parameter by which it varies. Finds − performance as max and min values. • • Pros: Simple Cons: conservative inaccurate 23 Corner Analysis H Hmax M3 H Cg T W M2 Tmin Cg M1 T Tmax Hmin W Wmin Wmax • PRCA shortcoming: Process corners are believed to coincide with performance corners. − Fact: best-case corner may not depend on Pmin or Pmax for a particular interconnect parameter but on a value within that range. 24 SSTA 25 Solutions • CAD-Level: 2. Testing multiple logically equivalent configurations of the FPGA to find one that is functional at the desired speed [Sedcole07] 26 References • [Kuon07] Kuon, Tessier, “FPGA Architecture: Survey and Challenges,” Foundations and Trends in Electronic Design Automation, Vol. 2, No. 2 (2007) 135–253. • [Lin07] Yan Lin and Lei He, Device and Architecture Concurrent Optimization for FPGA Transient Soft Error Rate, ICCAD 2007 • [Golshan07] S. Golshan and E. Bozorgzadeh, “Single-eventupset (SEU) awareness in FPGA routing,” in DAC ’07: • [Xilinx] www.xilinx.com • [Altera] www.altera.com • [Wong05] H.-Y.Wong, L. Cheng, Y. Lin, and L. He, “FPGA device and architecture evaluation considering process variations,” in ICCAD, 2005. • [Nabaa06] G. Nabaa, N. Azizi, and F. N. Najm, “An adaptive FPGA architecture with process variation compensation and reduced leakage,” DAC, 2006. 27 References • [Sedcole07] P. Sedcole and P. Y. K. Cheung, “Parametric yield in FPGAs due to within-die delay variations: A quantitative analysis,” in FPGA, 2007. 28