IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 49 Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance Keith A. Bowman, Member, IEEE, James W. Tschanz, Member, IEEE, Nam Sung Kim, Janice C. Lee, Chris B. Wilkerson, Shih-Lien L. Lu, Member, IEEE, Tanay Karnik, Senior Member, IEEE, and Vivek K. De, Senior Member, IEEE Abstract—A 65 nm resilient circuit test-chip is implemented with timing-error detection and recovery circuits to eliminate the clock frequency guardband from dynamic supply voltage (VCC ) and temperature variations as well as to exploit path-activation probabilities for maximizing throughput. Two error-detection sequential (EDS) circuits are introduced to preserve the timing-error detection capability of previous EDS designs while lowering clock energy and removing datapath metastability. One EDS circuit is a dynamic transition detector with a time-borrowing datapath latch (TDTB). The other EDS circuit is a double-sampling static design with a time-borrowing datapath latch (DSTB). In comparison to previous EDS designs, TDTB and DSTB redirect the highly complex metastability problem from both the datapath and error path to only the error path, enabling a drastic simplification in managing metastability. From a survey of various EDS circuit options, TDTB represents the lowest clock energy EDS circuit known; DSTB represents the lowest clock energy static-EDS circuit with SER protection known. Error-recovery circuits are introduced to replay failing instructions at lower clock frequency to guarantee correct functionality. Relative to conventional circuits, test-chip measurements demonstrate that resilient circuits enable either 25%–32% throughput gain at equal VCC or at least 17% VCC reduction at equal throughput, corresponding to 31%–37% total power reduction. Index Terms—Dynamic variations, error correction, error detection, error-detection sequential, error recovery, instruction replay, parameter variations, resilient circuits, resilient design, supply voltage droop, temperature variation, timing errors, variation tolerance. I. INTRODUCTION I NTEGRATED circuits have always been vulnerable to dyand temperature. namic variations in supply voltage Abrupt changes in die-level switching activity induce large current transients in the power delivery system, resulting in droop and overshoot fluctuations. The magnitude and duration droops and overshoots depend on the interaction of of capacitive and inductive parasitics at the board, package, and die levels with changes in current demand [1]. Temperature variations depend on workload, environmental conditions, and the heat-removal capability of the package. Conventional decreases or as microprocessor performance reduces as temperature increases. Consequently, the maximum clock freof a microprocessor is traditionally determined quency droop and temperature specifications. based on maximum Manuscript received March 31, 2008; revised July 09, 2008. Current version published December 24, 2008. The authors are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail: keith.a.bowman@intel.com). Digital Object Identifier 10.1109/JSSC.2008.2007148 Since typical usage patterns usually operate at nominal and temperature, these infrequent dynamic variations severely limit as compared to the potential higher clock freat nominal and temperature. Therefore, quency guardband is required at nominal and an inherent temperature to ensure correct functionality within the presence and temperature variations. of dynamic and temperature variation sensors coupled On-die with adaptive circuit techniques have been demonstrated to , , or body bias in response to changes in adjust and temperature [2]–[4]. These schemes reduce the guardband from slow-changing and temperature . Alternatively, variations, resulting in higher average benefits may be converted to lower average the average power by decreasing . The disadvantages of on-die sensors and adaptive approaches include the inability to respond to droops fast-changing variations such as high-frequency [1]. Furthermore, sensors and adaptive circuits require substantial calibration time per die, leading to increased testing costs. Although sensors may be tuned during test to reduce the delay mismatch between sensors and critical paths from guardband is still necessary to within-die variations, an ensure coverage across a wide range of and temperature conditions as well as for transistor aging. Error-detection sequential (EDS) circuits have been proposed to monitor timing faults for on-line testing of digital circuits within the presence of environmental influences and reliability concerns [5], [6]. EDS circuits have also been employed to detect transient delay errors resulting from single-event upsets induced by alpha particles and cosmic radiation [7]. The combination of timing-error detection and error-recovery circuits deter[7]–[9] enables the microprocessor to operate at an and temperature. When dynamic variamined by nominal tions induce a timing error, the error is detected and corrected to maintain proper logic functionality. Since additional clock cycles are required for error recovery, instructions per cycle (IPC) reduce as errors occur. Assuming infrequent timing errors, the gains IPC reduction is relatively small as compared to the from removing the guardband for dynamic and temperature variations, resulting in higher overall system throughput. benefits are possible by exploiting In addition, further path-activation probabilities [7]–[9]. If the slowest paths on the may increase higher die are infrequently activated, the than the critical path operating frequency. When these critical paths are activated, the timing error is detected and corrected. gains may As an alternative to performance benefits, the [8], [9]. be traded-off for lower power by reducing 0018-9200/$25.00 © 2008 IEEE Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. 50 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 While some of the potential benefits of timing-error detection and correction circuits have been highlighted in previous work [5]–[9], several significant challenges have also been identified with recent implementations. In previous EDS circuits [5]–[9], the error-detection capability is achieved at a large cost in clock energy since a redundant storage element is needed. This sequential clock energy overhead is a major concern in microprocessor designs since sequential clock energy is a large fraction of the overall total dynamic energy. In addition, these circuits are susceptible to datapath metastability, requiring substantial design overhead [9]. In a previous error-recovery design [8], [9], a single-cycle error-recovery circuit has been implemented to ensure forward progress for late arriving signals. In this approach, a 2-to-1 MUX logic gate is placed in each datapath prior to the receiving flip-flop to select between the normal datapath value and the correct value from the previous cycle. This additional logic gate increases the delay of all paths and increases total power. The error signals from each EDS circuit in a pipeline are combined through an OR tree to generate a restore signal, which is then routed to the select line of each MUX in the pipeline. This design imposes significant timing restrictions on the error-signal propagation delay and requires additional area for interconnect routing tracks. Furthermore, the previous error-recovery design [8], [9] emcontroller system to adjust based on ploys an off-chip is generally an monitored error rates. Although increasing effective tuning knob to reduce critical path delays, the applicability of this controller is highly complex for error recovery due depends on the to the following: (i) Path-delay sensitivity to , (ii) Path-delay sensitivity to differs from operating to speed-up a critical path also path to path, (iii) Increasing speeds-up min-delay paths, (iv) Maximum is defined by reliability constraints which could limit error-recovery in a highis relatively slow. performance mode, and (v) Changing In this paper, the concept of timing-error detection and correction circuits in previous work [5]–[9] is extended and implemented in a prototype [10] in 65 nm technology [11] to explore the effectiveness of resilient circuits in eliminating the guardband from dynamic and temperature variations as well as exploiting path-activation probabilities to maximize throughput. In Section II, EDS circuits for dynamic variation tolerance are reviewed and two EDS circuits [10] are introduced to retain the error-detection capability in previous work [5]–[9] while lowering clock energy and eliminating datapath metastability. Next in Section III, the test-chip architecture is described along with the error-recovery design. In contrast to error-recovery approach [8], the single-cycle adaptive[9], a multi-cycle instruction-replay error-recovery circuit is to significantly reduce design implemented with adaptive overhead at a small penalty in error-recovery cycles [10]. In Section IV, test-chip measurements are presented. Section V concludes by summarizing the key results and recommendations. II. ERROR-DETECTION SEQUENTIAL (EDS) CIRCUITS A. EDS Circuit Overview The basic concept of timing-error detection circuits for dynamic variation tolerance is described in Fig. 1. A conventional path with master-slave flip-flops (MSFF) is provided in Fig. 1(a) along with conceptual timing diagrams in Fig. 1(b), illustrating the arrival times of the input data (D) to the receiving flip-flop during worst-case dynamic variations and nominal conditions. Within the presence of worst-case dynamic variations, the input data to the receiving flip-flop must arrive a setup time prior to the rising clock edge to ensure correct functionality. In comparison, the input data for the same path arrives much earlier during nominal conditions. The difference between the input data arrival times for these two cases represents the effective timing guardband required for dynamic variations. A resilient path is created by replacing the receiving MSFF of the conventional path with an EDS circuit as described in Fig. 1(c). The conceptual timing diagram in Fig. 1(d) illustrates late arriving input data. The EDS circuit in Fig. 1(c) and Fig. 2(a) is a simplified Razor flip-flop (RFF) [5]–[9], where the metastability detector is omitted. The RFF double samples input data with a datapath flip-flop on the rising clock edge and a shadow latch on the falling clock edge. The flip-flop and latch outputs are compared with an XOR gate to produce an error signal (ERROR). If input data transitions late as described in Fig. 1(d), flip-flop and latch outputs differ, resulting in a logic-high error signal. The error signal is handled at the microarchitecture level to enable error recovery. Since the resilient circuit can detect and correct late arriving data, the timing guardband for dynamic variations in the conventional design can be removed, allowing the resilient circuit to operate at . a higher In comparison to an MSFF, the RFF error-detection capability is attained at a cost in clock energy and area since an additional latch is needed. Although the increase in area at the flip-flop level appears large, the overall area penalty at the microprocessor level is expected to be relatively small ( 1%). For scan flip-flops, the additional latch can be shared with scan circuitry to further reduce the area overhead. In contrast, the clock energy overhead at the flip-flop level significantly impacts the total dynamic energy of a microprocessor since sequential clock energy represents a large portion of the overall total dynamic energy. Moreover, a well-tuned microprocessor has a substantial number of critical paths, requiring a large fraction of the total sequentials to be protected with the timing-error detection capability. A critical issue for RFF is the susceptibility to datapath metastability. If the input data to the flip-flop arrives slightly after the setup time prior to a rising clock edge, the output of the datapath flip-flop can become metastable. In this scenario, the flip-flop requires an indefinite amount of time to resolve the output to a valid logic value, corresponding to a clock-to-output (CLK-to-Q) delay push-out. During metastability, the CLK-to-Q delay push-out exponentially depends on the relationship between the setup time and the input data arrival time [12]–[14]. Since the flip-flop output feeds an error path, which is described further in Section III, and multiple fan-out datapaths, the CLK-to-Q delay push-out from a metastable output can affect the error path differently from one of the fan-out datapaths such that an undetected error occurs. Since the mean time between failures (MTBF) must satisfy aggressive microprocessor targets, a metastability detector is required for RFF, resulting in substantial design overhead in both clock Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE 51 Fig. 1. (a) Conventional path design with (b) conceptual timing diagrams for worst-case dynamic variations and nominal conditions. (c) Resilient path design [5]–[9] with (d) conceptual timing diagram for late arriving input data. energy and area [9]. In Fig. 2(b) and (c), two EDS circuits are introduced to retain the RFF error-detection feature while lowering clock energy and eliminating datapath metastability [10]. For the three EDS circuits in Fig. 2, the high clock phase deas illustrated in Fig. 1(d). fines the error-detection window The maximum path delay (max-delay) constraint within the presence of worst-case dynamic conditions for max-delay is defined as (1) is the maximum path delay, including clock skew and jitter delay, is the cycle time , and is the setup time based on the falling clock edge. The minimum path delay (min-delay) constraint during worst-case dynamic conditions for min-delay is calculated as (2) is the minimum path delay, accounting for clock skew and is the hold time based on the falling jitter delay, and clock edge. The max-delay and min-delay constraints in (1) and (2) only apply to paths with an EDS circuit as the receiving , min-delay requirements are satissequential. For a target increases, fied in pre-silicon design by buffer insertion. As the number of buffers increases, leading to larger power. From (1) and (2), the fundamental trade-off in timing-error detecincreases, tion circuits is max-delay versus min-delay. As may decrease to enable a higher while satisfying the max-delay constraint in (1) at a cost of increased min-delay penalty in (2). For microprocessors with deep pipelines (i.e., small number of logic stages between sequentials), this trade-off may not be advantageous due to the stringent min-delay requirements. In recent technology generations, however, the microarchitecture for microprocessors has moved towards shallow pipelines (i.e., large number of logic stages between sequentials) to improve energy efficiency [15], [16]. Microprocessors with shallow pipelines greatly relax the min-delay requirements as compared to a deep pipeline design, enabling a more effective trade-off of max-delay improvement for min-delay penalty. As additional protection from min-delay violations, the high clock ) may be tuned with a duty-cycle control circuit. phase (i.e., The test-chip contains a scan-tunable duty-cycle control circuit to adjust . The duty-cycle control circuit, which is discussed further in Section III-C, maintains a constant high clock phase . delay at low and high Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. 52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 Fig. 3. Simulated timing diagrams for (a) TDTB and (b) DSTB to demonstrate error generation from late arriving input data. Fig. 2. Error-detection sequential circuits: (a) Razor flip-flop (RFF) [5]–[9], (b) transition detector with time borrowing (TDTB), and (c) double sampling with time borrowing (DSTB). CLK is duty-cycle controlled to satisfy min-delay requirements. B. Transition Detector With Time Borrowing (TDTB) In Fig. 2(b), the first proposed EDS circuit is a transition detector with a time-borrowing latch (TDTB). The TDTB EDS circuit operation is demonstrated through a simulated timing diagram in Fig. 3(a). The transition detector monitors input data (D) transitions during the high clock phase. As input data transitions, a pulse is always generated at the XOR output. During the low clock phase, the output of the dynamic gate pre-charges and the pulse does not affect the error signal (ERROR) as described in Fig. 3(a). If input data arrives late, CLK is logically-high and the pulse discharges the output node voltage of the dynamic gate, thus transitioning ERROR to a logic-high as illustrated in Fig. 3(a). As CLK transitions to a logic-low, the dynamic gate output pre-charges, and consequently, ERROR transitions to a logic-low. As discussed further in Section III-B, ERROR is propagated to a set-dominant latch (SDL), where the SDL output remains logically-high while the dynamic transition detector pre-charges during the low clock phase. The SDL is transparent during the high clock phase and only allows high transitions during the low clock phase. Since min-delay paths are designed with sufficient margin as described in (2), the master latch of a datapath flip-flop is unnecessary. The datapath latch is identical to a pulse-latch, resulting in lower clock energy and eliminating datapath metastability during a rising clock edge. Datapath metastability does not occur on the falling clock edge since the max-delay constraint in (1) is satisfied. Although TDTB employs a datapath latch, path timing constraints are still based on a flip-flop design with an error-detection window as illustrated in Fig. 1 and modeled in (1). The purpose of the transparency window in the datapath latch is to eliminate datapath metastability while detecting timing errors. When input data arrives late, an error signal is generated even though the input data traverses to the latch output. The error signal ensures that late arriving data from the path in the current pipeline stage does not affect the max-delay constraint in (1) for adjoining fan-out paths in subsequent pipeline stages. If ample max-delay margin is available for the adjoining paths in the subsequent pipeline stage, then a pulse-latch may replace the TDTB EDS circuit at the current pipeline stage. This would enable traditional time borrowing between the path in the current pipeline stage and the adjoining paths in the subsequent pipeline stage. Although datapath metastability is removed in TDTB, the transition-detector output can become metastable. For metastability to occur on the transition-detector output, the input data must arrive within a tight metastability window ( 1 ps in a 65 nm technology [11]), starting slightly after the setup time prior to a rising clock edge. For EDS circuits is defined as the minimum with a datapath latch, Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE data-to-clock (D-to-CLK) delay prior to a rising clock edge such that an error signal is not generated (i.e., an error signal indi). cates that the input data transition did not satisfy In TDTB, is determined by the transition detector, of 0 ps at nominal process which is designed for a , then conditions. If input data arrives slightly after the pulse generated at the XOR output transitions to a logic-low while the transition-detector clock transitions to a logic-high, potentially resulting in a metastable output. The transition-detector output feeds the error path. The error path for TDTB, which is described in more detail in Section III-B and illustrated in Fig. 5, consists of an OR tree of error signals from each TDTB EDS circuit in the pipeline stage. The OR-tree output feeds an SDL, and the SDL output is the final error signal (FINAL ERROR). The SDL maintains the logic-high value for FINAL ERROR when the dynamic transition detector pre-charges during the low clock phase. FINAL ERROR is an input to an MSFF, and the MSFF output is the pipeline-error signal (PIPELINE ERROR). The condition in which the input data for TDTB arrives defines the boundary of a timing slightly after failure. Since the datapath latch transparency allows input data to continue to the next pipeline stage, PIPELINE ERROR can be either a logic-high, resulting in error recovery, or a logic-low, resulting in no error recovery, and still maintain correct functionality. With this unique characteristic, the error path behaves similar to a traditional synchronizer circuit [12]–[14] with the exception of having combinational logic between the sequentials. For a metastable error signal in TDTB to result in an undetected error, the following sequence of events must occur: (i) ERROR becomes metastable from input data arriving slightly after the TDTB ; (ii) ERROR starts resolving to a logic-high while the dynamic transition detector begins pre-charging as clock transitions to a logic-low, resulting in a degraded low-to-high-to-low pulse; (iii) The degraded pulse propagates through the OR tree to induce metastability at the SDL output (FINAL ERROR); (iv) FINAL ERROR resolves such that PIPELINE slightly after the MSFF ERROR becomes metastable; (v) PIPELINE ERROR resolves after a specific time based on microarchitectural control signals to induce a failure. In Section III-D, an MTBF model for this sequence of events is presented for a microprocessor. C. Double Sampling With Time Borrowing (DSTB) In Fig. 2(c), the second proposed EDS circuit is double sampling with a time-borrowing latch (DSTB), which is similar to TDTB except a shadow flip-flop replaces the transition detector. In Fig. 3(b), a simulated timing diagram demonstrates the DSTB EDS circuit operation. DSTB double samples input data similar to RFF and compares datapath latch and shadow flip-flop outputs to generate an error signal while retaining the time-borrowing feature of TDTB to eliminate datapath metastability. The DSTB clock energy overhead is slightly lower than for RFF since a datapath sequential is typically sized larger than a minimum-sized shadow sequential. The DSTB clock energy savings improve when compared to an RFF with a metastability detector at the output [9]. 53 As described earlier for TDTB, the DSTB error signal can become metastable. In contrast to TDTB, the DSTB error path does not contain an SDL. The error path for DSTB, which is described in Section III-B, is an OR tree of error signals from each DSTB EDS circuit in the pipeline stage. The OR-tree output (FINAL ERROR) feeds an MSFF, and the MSFF output is PIPELINE ERROR. For a metastable error signal in DSTB to result in an undetected error, the following sequence of events must occur: (i) ERROR becomes metastable from input ; (ii) ERROR data arriving slightly after the DSTB resolves within a precise time window to propagate through the OR tree and transition FINAL ERROR barely after the such that PIPELINE ERROR becomes MSFF metastable; (iii) PIPELINE ERROR resolves at a particular time based on microarchitectural control signals to induce a failure. In Section III-D, the MTBF from this sequence of events is modeled for a microprocessor. D. Impact of Eliminating Datapath Metastability on Mean Time Between Failures (MTBF) Although additional microarchitectural information, which is presented in Section III-B, is required to calculate the MTBF from metastability for a microprocessor, an analysis of one EDS circuit that feeds an error path and multiple fan-out datapaths highlights the benefits from eliminating datapath metastability in TDTB and DSTB as compared to RFF. Since an actual failure cannot be modeled in TDTB and DSTB without considering additional microarchitectural details, a mean time between potential failures (MTBPF) is first described for the single EDS circuit analysis of the error path only. Then, the MTBF from datapath metastability in RFF is modeled. Since the error path for DSTB and RFF is the same, the MTBPF from error-signal metastability is initially presented for DSTB and RFF. Then, the MTBPF for TDTB is described. For DSTB and RFF, the mean time between inducing metastability on PIPELINE ERROR from one EDS circuit in a pipeline stage is calculated as [12]–[14]: (3) and are the EDS circuit metastability resolution is the time and resolution time constant, respectively. metastability window for the EDS circuit. depends on and the error-path delay; and are deteris the frequency of input data mined from the EDS design. transitions to the EDS circuit that are capable of generating a is a function of metastable condition. In a resilient design, , the delay and path-activation probability for paths that transition input data, and the frequency of dynamic variations that induce late arriving input data. The resolution time required to propagate a signal through the error path such that FINAL ERROR transitions precisely after the MSFF setup time, which is the beginning of the MSFF metastability window, is calculated as Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. (4) 54 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 is the error-path delay, including the minimum CLK-to-ERROR delay for late arriving input data. The additional CLK-to-ERROR delay from metastability is represented . is the setup time for the MSFF. by Since the TDTB error path contains an SDL, the mean time between inducing metastability on PIPELINE ERROR from one TDTB in a pipeline stage is calculated as [13] (5) and are the TDTB and SDL metastability resolution times, respectively; and are the TDTB and SDL metastability resolution time constants, respectively. is the metastability window for TDTB. As discussed is the metastable CLK-to-ERROR in Section II-B, push-out delay for ERROR to start resolving to a logic-high while the dynamic transition detector begins pre-charging as clock transitions to a logic-low. This event results in a degraded low-to-high-to-low pulse that propagates through the OR tree to induce metastability at the SDL output (FINAL ERROR). The SDL resolution time required to transition FINAL ERROR precisely after the MSFF setup time is calculated as (6) Since in (6) contains the SDL delay, for TDTB is longer than for DSTB and RFF. For DSTB, RFF, and SDL, the resolution time constants and metastability window are extracted from flip-flop D-to-Q delay simulations across a range of D-to-CLK delays [14]. For TDTB, transition detector CLK-to-ERROR delays are simulated for a range of D-to-CLK delays to extract and . For TDTB, DSTB, RFF, and SDL, the metastability resolution time constants are approximately 5.75 ps in a 65 nm technology [11] with , corresponding to less than half (0.43) of a fan-out of four (FO4) inverter delay. Since the resolution time constants for TDTB and SDL are equal and from (4) and (6) , (5) simplifies to (3). Thus, the mean time between inducing metastability on PIPELINE ERROR from one EDS circuit in a pipeline stage is modeled and in (3) and (4) for TDTB, DSTB, and RFF, where are separately calculated for TDTB, DSTB, and RFF. Based on simulations in a 65 nm technology [11] with , is 1 ps for TDTB and 5 ps for DSTB is 100 ps (7.50 FO4 delay) for TDTB and RFF; and 85 ps (6.38 FO4 delay) for DSTB and RFF; is 5 ps for MSFF. In contrast to TDTB and DSTB, RFF is susceptible to datapath metastability, where an undetected error can be modeled by analyzing one RFF that feeds an error path and multiple fan-out datapaths without considering additional microarchitectural details. The MTBF for one RFF in a pipeline stage is calculated as [12]–[14] (7) Similar to previous descriptions, , , and represent the RFF resolution time, resolution time constant, and metastability window, respectively. Although the datapath flip-flop output may resolve to the same logic value as the shadow latch output after a metastable event in RFF, a number exists such that an of flip-flop output resolution times undetected error occurs in at least one of the multiple fan-out must satisfy two timing datapaths. For this condition, timing constraint requires that constraints. The first the datapath flip-flop output resolves in time to ensure that FINAL ERROR is sampled as a logic-low by the MSFF, which is defined as (8) If satisfies the timing constraint in (8), an error is not detiming tected in the current pipeline stage. The second constraint is based on two dependencies for each fan-out datapath in the subsequent pipeline stage: (i) Path delay and (ii) Whether the path contains an RFF or an MSFF as the receiving sequential. If a path in the subsequent pipeline stage contains an RFF as the receiving sequential, then an undetected error occurs plus the path delay exceeds the max-delay in this path if constraint in (1) as (9) represents the delay for any of the multiple fan-out datapaths in the subsequent pipeline stage with an RFF as the receiving sequential. If an MSFF is the receiving seconstraint in (9) would change by quential, the and replacing with the MSFF removing . For simplifying the discussion, this case is not considered. The CLK-to-Q delay for input data arriving at the setup time of the datapath flip-flop is included in , and represents the CLK-to-Q delay push-out from (8) into (9), from metastability. Substituting a finite probability of an undetected error exists for paths in the subsequent pipeline stage with delays ranging from to . From (2) and assuming , the path delay range is approximated as to . As a practical example with , corresponding to , and , an undetected error can occur in a datapath with a delay ranging from to , where a large number of datapaths is expected to fall within this range. is primarily deReviewing (7)–(9), termined by and , where is determined Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE 55 Fig. 4. Advantages and disadvantages of the three error-detection sequential circuits in Fig. 2. T and L denote typical and large sequentials. Clock energy and = 1:2 V. delay are simulated in a 65 nm technology [11] with V from the datapath flip-flop design. As reduces, exponentially decreases. As increases, a smaller is required to satisfy (9). Thus, the MTBF for one RFF is dominated by the longest path delays from the multiple fan-out datapaths in the subsequent pipeline stage. Comparing (7)–(9) for datapath metastability in RFF to (3) and (4) for error metastability in TDTB, DSTB, is exponentially lower than and RFF, , where MTBF for RFF is primarily determined by . As an example with , , and only considering one of the multiple fan-out datapaths with , then for TDTB and DSTB is ~ larger than for RFF. Moreover, additional microarchitectural factors are required to calculate the MTBF for TDTB and DSTB, which as deis exponentially larger than scribed in Section III-D. Although other scenarios exists in which RFF metastability can result in an undetected error, the case presented in (7)–(9) represents one of the dominant failure conditions to determine MTBF. In summary, a salient feature of TDTB and DSTB is moving the highly complex RFF metastability issue from both the datapath and error path to only the error path. Since the error path is vastly simpler than the datapath for metastability, the potential risk of metastability affecting logic functionality exponentially reduces. E. Advantages and Disadvantages of RFF, TDTB, and DSTB In addition to datapath metastability, the key trade-offs of the three EDS circuits in Fig. 2 are listed in Fig. 4, including simulated clock energy and delay comparisons to an MSFF and a pulse-latch. The sequential clock energy overhead of RFF and DSTB as compared to a typical (T) MSFF is 36% and 34%, respectively; TDTB actually reduces clock energy by 9%. As the datapath sequential size increases, the clock energy overhead for RFF and DSTB reduces since the additional clock energy from the error-detection circuitry is amortized by the larger datapath sequential; TDTB clock energy savings increase. As described earlier in (2), resilient designs must satisfy min-delay constraints. As a comparison, a conventional design could replace MSFFs with pulse-latches to lower clocking energy while satisfying an increased min-delay penalty. Relative to a typical pulse-latch, the clock energy overhead for RFF, TDTB, and DSTB is 145%, 64%, and 143%, respectively. For a larger pulse-latch, the clock energy overhead reduces. When a metastability detector is included at the output of the RFF [9], the clock energy overhead for RFF increases. The RFF minimum D-to-Q delay is larger than the MSFF minimum D-to-Q delay due to the additional capacitance at the input and output nodes. TDTB and DSTB minimum D-to-Q delays are faster than the MSFF minimum D-to-Q delay since a latch is in the datapath, but slower than the pulse-latch minimum D-to-Q delay due to the extra node capacitances. Although the RFF D-to-Q delay is 30% larger than the TDTB and DSTB D-to-Q delays, the impact of this delay difference is only 2% when compared to the target cycle time. The design complexity of RFF and DSTB is lower than for TDTB since the dynamic transition detector is highly sensitive to within-die (WID) process variations. In designing the pulse width for the transition detector, the key trade-off is versus a sufficient pulse width to detect late arriving data within the presence of WID variations. As the pulse width increases, the tolerance of the dynamic transition detector to WID varia. RFF and DSTB tions improves at a cost of a larger also provide SER protection for the datapath sequential during the entire clock cycle. In comparison, TDTB offers SER protection for the datapath sequential during the high clock phase and only a portion of the low clock phase. When introducing a new circuit technique into a production microprocessor, the ability to turn the functionally off is highly desirable in case unforeseen complexities arise. With RFF, the error detection cawith pability could be turned off and operated at a low a 50% duty cycle. With TDTB and DSTB, a duty-cycle control . would always be required for both high and low Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. 56 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 Recently, a Razor II EDS circuit has been proposed [17], which contains features similar to TDTB. Razor II and TDTB EDS circuits both utilize a datapath latch and dynamically detect late data transitions. The key difference between these two EDS circuits is that TDTB detects late transitions at the latch input where data transitions are only monitored during the high clock phase; Razor II detects late transitions at the latch pass gate output, denoted as N [17], where data transitions are continuously monitored except for a window of time, defined as [17], after the rising clock edge. In Razor II, the tranafter the rising clock sition detector is suspended for edge to guarantee that early input data transitions do not induce a timing error. The transition detector is suspended by generating a negative clock pulse to the dynamic transition detector. , the negative-clock-pulse width, must be larger than the CLK-to-N delay with a zero setup time plus the data-pulse width to the transition detector. Relative to TDTB, additional clock transistors (e.g., inverters and NAND gate) are required in Razor II to generate the negative clock pulse, resulting in larger clock energy for Razor II. Furthermore, WID process variations amplify the complex timing in Razor II, requiring a sufficient delay guardband on the negative-clock-pulse width. This delay guardband results in extra clock inverters, and consequently, larger clock energy overhead. The clock energy overhead comparison between Razor II and DSTB depends on the additional clock energy required to generate the negative clock pulse for Razor II versus the additional clock energy in the shadow MSFF for DSTB. The D-to-Q delays for TDTB, DSTB, and Razor II appear relatively similar. The advantages of the Razor II EDS circuit include: (i) Significantly improving metastability management as compared to RFF, which is similar to TDTB and DSTB, (ii) Fast D-to-Q delay, and (iii) SER protection for the datapath sequential during the entire clock cycle. The disadvantages include: (i) Complex timing requirements for the negative clock pulse and the data pulse to the dynamic transition detector to ensure proper functionality, (ii) Additional clock energy required to generate the negative clock pulse, and (iii) Error detection window shrinks by the negative-clock-pulse width, resulting in less opportunity to detect late arriving input data for a target min-delay constraint as defined by the high clock phase. In comparing TDTB and DSTB across a variety of known EDS circuit options, including Razor II [17], TDTB appears to be the lowest clock energy EDS circuit; DSTB appears to be the lowest clock energy static-EDS circuit with SER protection [10]. III. ERROR-RECOVERY CIRCUITS AND ARCHITECTURE A. Test-Chip Architecture The three EDS circuits in Fig. 2 and a conventional MSFF are implemented in four separate 3-stage pipeline circuit blocks to imitate a microprocessor as described in Fig. 5. The unlabeled sequentials in Fig. 5 represent MSFFs. The 32-entry input buffer consists of memory, an address counter, and control logic. Input buffer data and control signals are scanned into memory. Loop instructions are available to operate the test-chip for long durations. The input buffer drives data and control signals to each 3-stage pipeline circuit block. Each pipeline stage in the circuit block is 64-bits wide and contains a variety of path types, including inverter chains, NAND chains, NOR chains, XOR chains, NAND-NOR combination chains, inverter-MUX combination chains, long-distance repeater-based interconnect chains, cross-coupling paths, and multiple-input-switching paths. Transistor widths and number of logic stages vary from path to path. Transistor widths range from minimum-size to 40X minimum-size and the number of logic stages per path ranges from 8 to 28. For some paths, the number of logic stages is controlled via select bits to analyze different path delay histograms. A stabilization pipeline stage [8], [9] follows the 3-stage pipeline circuit block to accommodate the 1-cycle latency for propagating error signals in the 3rd stage. This stabilization pipeline stage ensures that instructions are error-free before committing the state to the output buffer in the next pipeline stage. The output buffer consists of a 64-bit input linear-feedback-shift register (LFSR) that compresses output data into a unique signature to validate functionality. B. Instruction-Replay Error-Recovery Circuits In the 3-stage pipeline circuit block, error signals from each EDS circuit per pipeline stage are combined via an OR tree to generate a single error signal (FINAL ERROR) [8], [9]. For the two static-EDS circuits (RFF and DSTB), the OR-tree output directly feeds an MSFF. For the dynamic-EDS circuit (TDTB), the OR-tree output feeds an SDL, where the schematic is provided in Fig. 6. The SDL output is FINAL ERROR, which is an input to an MSFF. The SDL in Fig. 6 is transparent during the high clock phase and only allows high transitions during the low clock phase. If FINAL ERROR transitions to a logic-high, the SDL maintains the logic-high value for FINAL ERROR when the dynamic-EDS circuit pre-charges during the low clock phase. For all three EDS circuits, the output of the MSFF represents the pipeline-error signal (PIPELINE ERROR). In contrast to the previous single-cycle error-recovery design [8], [9], a multi-cycle error-recovery circuit is implemented to reduce design overhead [10], [17]. As illustrated in Fig. 5, the three pipeline-error signals are propagated to the input buffer in one cycle to replay the failed instruction and pipelined to the output buffer to invalidate erroneous data. Input buffer control logic determines the appropriate instruction to replay based on the three pipeline-error signals. In a microprocessor, the instruction replay circuits could leverage the existing replay design to recover from a branch miss-prediction [18]. If a pipeline-error signal transitions to a logic-high, the input buffer signals the while maintaining a constant high clock divider to halve clock phase delay for min-delay protection. Reducing in half ensures correct operation during replay even if dynamic variations persist. After the replayed instruction finishes, the input buffer sends a reset signal to validate output data and signals the clock divider to resume at target . Since is halved for all but one of the recovery cycles, the number of actual (effective) cycles for recovery is 6 (11), 7 (13), and 8 (15) corresponding to timing errors in the 1st, 2nd, and 3rd pipeline stages. Since the number of recovery cycles linearly increases with the number of pipeline stages, the average error-recovery penalty for a microprocessor is expected to linearly increase as compared to the test-chip implementation. If dynamic variations persist for long durations, an adaptive off-die clock generator adjusts the nominal operating . Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE 57 Fig. 5. Instruction-replay error-recovery design for one error-detection sequential (EDS) circuit. Set-dominant latch (SDL) is only used for TDTB. D. Mean Time Between Failures (MTBF) From Error-Signal Metastability Fig. 6. Set-dominant latch (SDL) circuit schematic. C. Clock Divider and Duty-Cycle Control Circuits The clock divider and duty-cycle control circuits are presented in Fig. 7(a) along with a conceptual timing diagram in Fig. 7(b). An off-die signal generator with a differential pulse-splitter creates differential inputs CLKIN and CLKIN#. The HALF FREQUENCY input is controlled by the input buffer as described in Fig. 5. The CLK output is distributed throughout the test-chip. CLKIN and CLKIN# are inputs to a differential amplifier that generates an intermediate clock signal. This intermediate clock signal and the output of the negative edge-triggered MSFF in Fig. 7(a) are inputs to a logic-AND gate to produce the clock divider output (CLK0). When the HALF FREQUENCY input is a logic-low, the output of the negative edge-triggered MSFF remains a logic-high, thus CLK0 and CLKIN have the same frequency. When the HALF FREQUENCY input is asserted, the output of the negative edge-triggered MSFF toggles every other cycle, enabling the clock divider circuit to skip every other high phase of CLKIN as illustrated in Fig. 7(b). The duty-cycle control is performed with a logical-AND of CLK0 and a delayed CLK0# (i.e., inversion of CLK0) with CLK as the output. The delayed CLK0# determines the CLK high phase delay, as controlled via scan bits. With this duty-cycle control circuit, the CLK high phase values, delay remains constant at both high and low which is essential for min-delay protection. As described previously in Section II, the error signals for TDTB and DSTB can become metastable when input data arrives slightly after the setup time prior to a rising clock edge. Recalling the unique characteristic of these signals, correct functionality is guaranteed as long as the pipeline-error signal (PIPELINE ERROR) is either a logic-high or a logic-low. Based on this behavior, the error path as described in Fig. 5 is similar to a traditional synchronizer circuit with the exception of having combinational logic in the middle of the sequentials. As modeled in (3) and (4), the extra logic between the sequentials exponentially reduces [12]–[14]. As a counter benefit, the OR-tree logic increases the MTBPF from metastability since PIPELINE ERROR can only become metastable when an individual EDS error signal is metastable while all other EDS error signals in the same pipeline stage remain logically-low. In other words, a metastable EDS error signal cannot affect the OR-tree output if any other EDS error signal in the pipeline stage is logas ically-high. Defining the mean time between inducing metastability on PIPELINE of EDS circuits in the same ERROR from a number does not pipeline stage, since the probability of simultadecrease linearly with . neous errors from multiple EDS circuits increases with An undetected error occurs from error-signal metastability if PIPELINE ERROR becomes metastable and resolves at a time such that different logic values are captured at the INPUT BUFFER and the OUTPUT BUFFER. Based on this condition, the metastability resolution time for the MSFF driving PIPELINE ERROR is computed as Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. (10) 58 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 Fig. 7. (a) Clock divider and duty-cycle control circuit schematics and (b) conceptual timing diagrams. is the longest pipeline-error path delay propagating to either the INPUT BUFFER or the OUTPUT BUFFER. From Fig. 5, the pipeline-error path delay from the driving MSFF to the INPUT BUFFER represents the longest pipeline-error path delay for the 1st, 2nd, and 3rd pipeline . stages, thus determining is the setup time for an MSFF contained in the INPUT BUFFER. In a 65 nm technology [11] with , for the 3rd pipeline-stage is 125 ps for an ). If (9.37 FO4 delay or is larger than (10), then a metastable PIPELINE ERROR can result in an actual failure. If is less than (10), then correct functionality is maintained. For a microprocessor implementation, PIPELINE ERROR could be sent to the INPUT BUFFER over multiple cycles by inserting , additional registers, resulting in a lower and thus exponentially increasing MTBF with a linear penalty in error-recovery cycles. An approximation of the MTBF from error-signal metastability in a microprocessor is calculated as (11) represents the number of pipeline stages in a microprocessor. is the average number of EDS is the circuits in a pipeline stage. The parameter resolution time constant for the MSFF driving PIPELINE ERROR. From simulations in a 65 nm technology [11] with , is 5.75 ps (0.43 FO4 delay). Note that a more detailed MTBF model requires an analysis of and for each pipeline stage of a microprocessor. Observing (3) and (11), the primary parameters that determine the MTBF from error-signal metastability are , , , and . Assuming extreme , , , and worst-case estimates for from (3), for modeling larger the MTBF from error-signal metastability is over than microprocessor MTBF targets for soft-error rate (SER) in a 65 nm technology. E. Test-Chip Characteristics The resilient circuit test-chip micrograph and characteristics are provided in Fig. 8. The clock circuitry, input buffer (IB), and each 3-stage pipeline circuit block with an output buffer are highlighted. On-chip noise injectors are implemented to indroop events, where magnitude and duration is conduce trolled via scan bits. The test-chip is manufactured in a 65 nm of 1.2 V. The testlogic technology [11] with a target chip die area is m and the number of transistors is 493K. Silicon measurements are collected using a highfrequency membrane probe card on a bare die in a 300 mm probe station. F. Resilient Circuit Demonstration A timing-error detection and recovery measurement is demonstrated in Fig. 9 based on an oscilloscope capture of test-chip output signals. CLK is the clock signal distributed throughout the test-chip; DATA is 1 bit of data sent from the input buffer to the 3-stage pipeline circuit block; PIPELINE ERROR# is the logic-NOR output of the three pipeline-error signals; and VALID OUTPUT is the output buffer valid signal. In the demonstration with no timing errors, DATA transitions from a logic-high to a logic-low for 4 cycles and then transitions to a logic-high. In the demonstration with timing errors, the high-to-low transition from DATA induces a timing error at the 3rd pipeline stage; (a) Error is detected by the EDS circuit and propagated to the pipeline-error signal to lower PIPELINE ERROR# for one cycle; (b) VALID OUTPUT transitions low to invalidate erroneous data; (c) Pipeline-error signal triggers the input buffer to raise the HALF FREQUENCY signal in Figs. 5 ; (d) Input buffer replays the instruction, and 7 to halve resulting in a high-to-low DATA transition; (e) Once the instruction finishes, the input buffer resets the VALID OUTPUT signal Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE 59 Fig. 8. Resilient circuit test-chip micrograph and characteristics. Measurements collected with a high-frequency membrane probe card on a bare die in a 300 mm probe station. Fig. 10. Measured error detection and recovery demonstration for sporadic V droops. Fig. 9. Measured timing-error detection and recovery demonstration. Data from input buffer arrives late at 3rd pipeline stage; (a) Detect error; (b) Invalidate output data; (c) Halve F ; (d) Replay instruction; (e) Validate output . data; and (f) Resume target F to allow output data to reach the LFSR; (f) Finally, input buffer lowers the HALF FREQUENCY signal to resume at target . In Fig. 9, note that the CLK high phase delay remains for min-delay protection. In constant at both high and low Fig. 10, an oscilloscope capture of , PIPELINE ERROR#, and VALID OUTPUT signals demonstrate the measured error droops. detection and recovery capability for sporadic IV. TEST-CHIP MEASUREMENTS Imitating a microprocessor, instruction kernels are executed on the test-chip to compare a resilient design with EDS circuits to a conventional design with MSFFs. The benefits from resilient circuits depend on the path-activation probabilities as determined by the instruction kernels. Although previous research has investigated node-activity probabilities for a microprocessor [19], node-activity probabilities do not directly translate into path-activation probabilities. Since path-activation probabilities have not been rigorously explored for a microprocessor, there is uncertainty about the dependency of path-activation probability on path delay. Since slow paths typically contain more logic depth than fast paths, the probability of activating slow paths is intuitively expected to be less than the activation probability for fast paths in general. In addressing this issue, two sets of instruction kernels are selected to induce the path-activation examples presented in Fig. 11, which attempt to represent practical approximations of favorable (example #1) and unfavorable (example #2) scenarios. In path-activation example #1, critical paths are activated less frequently than non-critical paths, where the activation probability exponential reduces from fast to slow paths. Path-activation example #2 applies an equal activation probability for all paths. In Fig. 12, throughput (TP) and error rate is measured for for the two path-activation the TDTB EDS circuit versus , error rate is governed by examples in Fig. 11. For a given the path histogram, as dictated by design optimization, as well as path-activation probabilities and environmental variations, as droop magnitudes determined by workloads. A range of and durations are inserted based on data from a recent microprocessor along with assumptions on droop-inducing events. The worst-case droop magnitude is 10% and is 1.2 V the worst-case temperature is 110 . Nominal and the nominal operating temperature is assumed 60 . In Fig. 12(a), throughput increases linearly as increases with no errors. Once errors occur, instructions per cycle (IPC) reduce as a function of error rate and recovery time. Since droop events are assumed infrequent, throughput gains continue as increases into the and temperature guardband Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. 60 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 Fig. 11. Path-activation probability versus path delay for two path-activation examples. region. When reaches 3020 MHz, the first path failure occurs under nominal conditions, resulting in a sharp error rate increase. Since the path-activation probability for example #1 is low for slow paths, further throughput gains are achieved . The maximum throughput of 3.17 billion at higher instructions per second (BIPS) corresponds to a 3200 MHz . Increasing further leads to a larger error rate, gains. Due to the high where IPC reduction outweighs path-activation probability for slow paths in example #2, the resilient design cannot exploit the path-activation probabilities in Fig. 12(b), limiting the maximum throughput to 3.01 BIPS. In Fig. 12(a) and (b), the maximum throughput to guarantee correct functionality within the presence of worst-case dynamic and temperature variations for the conventional design of 2400 with MSFFs is 2.4 BIPS, corresponding to an MHz. From Fig. 12, a resilient design enables 25% throughput gain over a conventional design by eliminating the guardband from dynamic and temperature variations and an additional 7% throughput increase from exploiting the path-activation probabilities for example #1. In Fig. 13, the maximum throughputs for resilient and con. Since a worst-case ventional designs are measured versus droop of 10% is applied in these measurements, the droop reduces in absolute value as the nominal maximum reduces. For the resilient design, the two path-activation examples in Fig. 11 are evaluated. Depending on the path-activation example in Fig. 13, the throughput for the resilient is greater than or equal to the design with . throughput of the conventional design with Thus, measured data indicates that resilient circuits enable either 25%–32% throughput gain at equal or at least a 17% reduction at equal throughput [10] for a target of 1.2 V in a 65 nm technology. For equal throughput, the measured total power reduction from resilient circuits is 37% and 31% for path-activation example #1 and example #2, respectively. The total power reduction for a resilient microprocessor will differ from the test-chip depending on the fraction of total sequentials protected with EDS circuits and the amount of buffer insertion required to correct min-delay paths. Fig. 12. Measured throughput (TP) and error rate for a resilient design with ) for path-activation (a) exTDTB EDS circuits versus clock frequency (F ample #1 and (b) example #2. Fig. 13. Measured maximum throughput (TP) for a resilient design with TDTB EDS circuits and a conventional design with MSFFs versus supply voltage. From the data in Fig. 13, the measured throughput gain of a resilient design relative to a conventional design is plotted versus in Fig. 14. Measurements demonstrate a larger reduces, resulting from the increased throughput gain as path-delay sensitivity to . From Fig. 14, the throughput gain from a resilient design increases from 25%–32% for to 34%–40% for . In Fig. 15, the measured throughput gain in Fig. 13 is mapped to a measured power reduction by lowering the of the resilient design to achieve equal throughput for the resilient and conventional designs. The change in throughput across the x-axis directly corresponds to a change in nominal Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE Fig. 14. Measured throughput gain of a resilient design with TDTB EDS circuits relative to a conventional design with MSFFs versus supply voltage. 61 based on aging-related delay faults and adjust nominal monitoring the error rate or on-die aging sensors to allow a smooth performance reduction over the microprocessor lifetime [20]. For cross-coupling and MIS events, conventional pre-silicon design methodologies often apply pessimistic assumptions for these dynamic variations, although these events may rarely or never occur. These pessimistic assumptions lead to an over-design for the majority of paths, and consequently, high power. Allowing resilient circuits to detect and correct late transitions from infrequent cross-coupling and MIS events, less stringent cross-coupling and MIS conditions could be applied in timing analysis, resulting in power savings. Third, timing optimization with resilient circuits presents an exciting opportunity. In conventional timing optimization, all paths in the path-delay histogram must satisfy a target cycle time. For resilient circuits, timing optimization would be based on combining the path-delay histogram with path-activation probabilities. As an example, infrequently activated paths could have a relaxed timing constraint as compared to frequently activated paths to reduce power. In quantifying the cost associated with resilient circuits, a rigorous analysis is warranted for the additional power overhead from min-delay buffer insertion versus the error-detection window for a recent microprocessor. In addition, realistic path-activation probabilities should be evaluated across a variety of workloads. V. CONCLUSION Fig. 15. Measured power reduction of a resilient design with TDTB EDS circuits relative to a conventional design with MSFFs versus throughput. (i.e., higher throughput corresponds to higher ). As throughput reduces from 2.5 BIPS to 1.25 BIPS, measured data reveals that the power benefits decrease from 33%–38% to 18%–21%. Although the throughput gain from resilient as demonstrated in Fig. 14, circuits increases with lower reduces. the power-to-throughput sensitivity decreases as to As an example for throughput at 2.5 BIPS, raising achieve a 1% throughput gain incurs a 3% total power increase. When throughput equals 1.25 BIPS, however, raising to achieve a 1% throughput gain only results in a 1.8% total power increase. Thus, any circuit technique that maps linear to power reduction via reducing performance gains versus will have decreased power savings as reduces. Recommendations to further enhance the performance and energy benefits of resilient circuits are offered. First, since and temperature variations may persist for dynamic long durations, resilient circuits in conjunction with on-die variation sensors and adaptiveschemes [2]–[4] could improve the efficiency of adjusting the nominal operating . Second, additional sources of dynamic variation that could be mitigated include: (i) Transistor aging, (ii) Crosscoupling capacitance, and (iii) Multiple-input switching (MIS). For transistor aging, a conventional design typically applies an guardband based on lifetime aging projections. upfront In comparison, resilient circuits with on-die variation sensors techniques could detect and recover from and adaptive- Timing-error detection and recovery circuits are implemented in a 65 nm resilient circuit test-chip to eliminate the clock frequency guardband from dynamic supply voltage and temperature variations as well as to exploit path-activation probabilities for maximizing throughput. Two error-detection sequential (EDS) circuits are introduced to lower the clock energy and to remove the datapath metastability in existing EDS designs. One EDS circuit is a dynamic transition detector with a time-borrowing datapath latch (TDTB). The other EDS circuit is a double-sampling static design with a time-borrowing datapath latch (DSTB). A salient feature of TDTB and DSTB relative to previous EDS designs is moving the highly complex metastability issue from both the datapath and error path to only the error path, thus drastically simplifying metastability management. Based on a survey of various known EDS circuit options, TDTB is the lowest clock energy EDS circuit; DSTB is the lowest clock energy static-EDS circuit with SER protection. Error-recovery circuits are proposed to replay failing instructo ensure correct functionality. Silicon meations at half surements indicate that resilient circuits enable either 25%–32% throughput gain at equal or at least 17% reduction at equal throughput, corresponding to 31%–37% total power reduction. Future opportunities to further enhance performance and energy efficiency include: (i) Combining resilient circuits with on-die variation sensors and adaptiveschemes, (ii) Mitigating delay faults induced from transistor aging, crosscoupling capacitance, and multiple-input switching, and (iii) Optimizing resilient circuit designs by coupling the path-delay histogram with path-activation probabilities. Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. 62 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009 ACKNOWLEDGMENT The authors express sincere appreciation to Pavan Karidi for mask design; David Jenkins, David Finan, and Saurabh Dighe for simulation support; Paolo Aseron and Trang Nguyen for lab measurement support; Charles Dike for metastability calculation guidance; Sriram Vangal for helpful advice; and Matthew Haycock for encouragement. REFERENCES [1] A. Muhtaroglu, G. Taylor, and T. R. Arabi, “On-die droop detector for analog sensing of power supply noise,” IEEE J. Solid-State Circuits, pp. 651–660, Apr. 2004. [2] T. Fischer, J. Desai, B. Doyle, S. Naffziger, and B. Patella, “A 90-nm variable frequency clock system for a power-managed itanium architecture processor,” IEEE J. Solid-State Circuits, pp. 218–228, Jan. 2006. [3] R. McGowen et al., “Power and temperature control on a 90-nm itanium family processor,” IEEE J. Solid-State Circuits, pp. 229–237, Jan. 2006. [4] J. Tschanz et al., “Adaptive frequency and biasing techniques for tolerance to dynamic temperature-voltage variations and aging,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp. 292–293. [5] P. Franco and E. J. McCluskey, “Delay testing of digital circuits by output waveform analysis,” in Proc. IEEE Int. Test Conf., Oct. 1991, pp. 798–807. [6] P. Franco and E. J. McCluskey, “On-line testing of digital circuits,” in Proc. IEEE VLSI Test Symp., Apr. 1994, pp. 167–173. [7] M. Nicolaidis, “Time redundancy based soft-error tolerance to rescue nanometer technologies,” in Proc. IEEE VLSI Test Symp., Apr. 1999, pp. 86–94. [8] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level timing speculation,” in Proc. IEEE/ACM Int. Symp. Microarchitecture (MICRO-36), Dec. 2003, pp. 7–18. [9] S. Das et al., “A self-tuning DVS processor using delay-error detection and correction,” IEEE J. Solid-State Circuits, pp. 792–804, Apr. 2006. [10] K. A. Bowman et al., “Energy-efficient and metastability-immune timing-error detection and instruction-replay-based recovery circuits for dynamic-variation tolerance,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2008, pp. 402–403. [11] P. Bai et al., “A 65 nm logic technology featuring 35 nm gate lengths, enhanced channel strain, 8 Cu interconnect layers, low-k ILD and 0.57 m SRAM cell,” in IEEE IEDM Tech. Dig., Dec. 2004, pp. 657–660. [12] H. J. M. Veendrick, “The behavior of flip-flops used as synchronizers and prediction of their failure rate,” IEEE J. Solid-State Circuits, pp. 169–176, Apr. 1980. [13] C. L. Portmann and T. H. Y. Meng, “Metastability in CMOS library elements in reduced supply and technology scaled applications,” IEEE J. Solid-State Circuits, pp. 39–46, Jan. 1995. [14] C. Dike and E. Burton, “Miller and noise effects in a synchronizing flip-flop,” IEEE J. Solid-State Circuits, pp. 849–855, Jun. 1999. [15] V. Srinivasan et al., “Optimizing pipelines for power and performance,” in Proc. Int. Symp. Microarchitecture (MICRO-35), Nov. 2002, pp. 333–344. [16] A. Hartstein and T. R. Puzak, “The optimum pipeline depth considering both power and performance,” ACM Trans. Arch. and Code Opt. (TACO), pp. 369–388, Dec. 2004. [17] D. Blaauw et al., “Razor II: In situ error detection and correction for PVT and SER Tolerance,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2008, pp. 400–401. [18] J. Hennessy and D. Patterson, Computer Architecture A Quantitative Approach, 2nd ed. San Francisco, CA: Morgan Kaufmann, 1996. [19] H. L. Yeager, M. J. Patyra, R. Reyes, and K. A. Bowman, “Microprocessor power optimization through multi-performance device insertion,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2004, pp. 334–337. [20] M. Agarwal, B. C. Paul, M. Zhang, and S. Mitra, “Circuit failure prediction and its application to transistor aging,” in Proc. IEEE VLSI Test Symp., May 2007, pp. 277–286. Keith A. Bowman (S’97–M’02) received the B.S. degree in electrical engineering from North Carolina State University, Raleigh, NC, in 1994 and the M.S. and Ph.D. degrees in electrical engineering from the Georgia Institute of Technology, Atlanta, GA, in 1995 and 2001, respectively. He is currently a Staff Research Scientist in the Circuit Research Lab (CRL) at Intel Corporation in Hillsboro, OR. From 2001 to 2004, he worked as a Senior Computer-Aided Design (CAD) Engineer in the Technology-CAD Division at Intel in Hillsboro, OR, where he developed and supported statistical-based models, methodologies, and software tools to predict microprocessor performance and power variability. Since joining CRL in 2004, his research has focused on the development of circuit design solutions to mitigate the impact of parameter variations on microprocessor performance and power. James W. Tschanz (M’99) received the B.S. degree in computer engineering and the M.S. degree in electrical engineering from the University of Illinois at Urbana-Champaign, in 1997 and 1999, respectively. Since 1999, he has been a circuits researcher with the Intel Circuit Research Lab in Hillsboro, OR. His research interests include low-power digital circuits, design techniques, and methods for tolerating parameter variations. He also taught VLSI design for seven years as an adjunct faculty member at the Oregon Graduate Institute in Beaverton, OR. Nam Sung Kim received the B.S. and M.S. degrees in electrical engineering from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in 1997 and 2000, respectively, and the Ph.D. degree in computer science and engineering from the University of Michigan-Ann Arbor in 2004. He is currently an Assistant Professor at the University of Wisconsin-Madison. He was with Intel Microprocessor Technology Lab as a senior research scientist from 2004 to 2008. He has authored more than 30 technical papers in refereed international conferences and journals. His research interests span low power and high performance circuits, circuit-microarchitecture co-designs, CAD algoritms, and biomimetic microsystems. Dr. Kim was a recipient of the award at the 2001 IEEE Design Automation Conference (DAC) Student Design Contest and the best paper award at the 2003 IEEE International Conference on Microarchitecture (MICRO) in San Diego, CA, for his work on the low-power and robust microarchitecture. He was also a recipient of Intel Fellowship in 2002. Janice C. Lee received the B.Sc. degree in engineering physics from Queen’s University in Canada, and the S.M. and Ph.D. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology, Cambridge, MA. In 2006, she joined Intel Corporation as a research scientist in the Circuit Research Lab at Hillsboro, OR. Janice’s PhD research was in the area of Quantum Computing using superconducting circuits, and was a National Science Foundation graduate fellow. At Intel, her interests include next-generation-transistor research using carbon nanotubes, and resilient-circuit research for dynamic variation tolerance. Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply. BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE Chris B. Wilkerson graduated from Carnegie Mellon University with the Masters degree in 1996. He has published a number of papers on a number of micro architectural topics including value prediction, branch prediction, cache organization, runahead and advanced speculative execution. Recently, he has focused on low power design including micro architectural mechanisms to enable low voltage operation for microprocessors. Shih-Lien L. Lu (M’90) received the B.S. degree in electrical engineering and computer science (EECS) from the University of California at Berkeley, and the M.S. and Ph.D. degrees, both in computer science and engineering, from the University of California at Los Angeles (UCLA). He worked on the MOSIS project at USC/ISI which provides research and education community VLSI fabrication services from 1984 to 1991. He joined the faculty of the Department of Electrical and Computer Engineering at Oregon State University (OSU), Corvallis, OR, in 1991. While at OSU, Dr. Lu received the College of Engineering Carter Award for outstanding and inspirational teaching in 1995 and the CoE/ECE Engelbrecht Young Faculty Award in 1996. He joined Intel in 1999 and has been with Microprocessor Technology Labs since then. Currently, he is a Principal Engineer and leads the Oregon Microarchitecture Lab. He is a member of the IEEE and the IEEE Computer Society. 63 Vivek K. De (SM’07) received the Bachelor’s degree in electrical engineering from the Indian Institute of Technology, Madras, India, in 1985 and the Master’s degree in electrical engineering from Duke University, Durham, NC, in 1986. He received the Ph.D. degree in electrical engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1992. He is an Intel Fellow and Director of Circuit Technology Research in the Corporate Technology Group. He joined Intel in 1996 as a staff engineer in the Circuits Research Lab (CRL) in Hillsboro, OR. Since that time he has led research teams in CRL focused on developing advanced circuits and design techniques for low-power and high-performance processors. In his current role, he provides strategic direction for future circuit technologies and is responsible for aligning CRL’s circuit research with technology scaling challenges. Prior to joining Intel, he was engaged in semiconductor devices and circuits research at Rensselaer Polytechnic Institute and Georgia Institute of Technology, and was a visiting researcher at Texas Instruments. He has published 167 technical papers in refereed conferences and journals, and six book chapters on low power circuits. He holds 154 patents, with 40 more patents filed (pending). Dr. De received an Intel Achievement Award for his contributions to a novel integrated voltage regulator technology. Tanay Karnik (M’88–SM’04) received the Ph.D. degree in computer engineering from the University of Illinois at Urbana-Champaign in 1995. From 1995 to 1999, he worked in the Strategic CAD Lab at Intel. Since March 1999, he has lead the power delivery, soft error rate, and low power circuits research in the Circuits Research, Intel Labs, where he is Principal Engineer and manager of low power circuits research. His research interests are in the areas of variation tolerance, power delivery, soft errors and physical design. He has published over 40 technical papers and has 41 issued and 36 pending patents in these areas. Dr. Karnik received an Intel Achievement Award for the pioneering work on integrated power delivery. He has presented several invited talks and tutorials, and has served on five Ph.D. students’ committees. He was a member of DAC, ICCAD, ICICDT and ISQED program committees and JSSC, TCAD, TVLSI, and TCAS review committees. He was General Chair of ISQED’08 and ICICDT’08. Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.