article - Department of Electrical and Computer Engineering

advertisement
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
49
Energy-Efficient and Metastability-Immune Resilient
Circuits for Dynamic Variation Tolerance
Keith A. Bowman, Member, IEEE, James W. Tschanz, Member, IEEE, Nam Sung Kim, Janice C. Lee,
Chris B. Wilkerson, Shih-Lien L. Lu, Member, IEEE, Tanay Karnik, Senior Member, IEEE, and
Vivek K. De, Senior Member, IEEE
Abstract—A 65 nm resilient circuit test-chip is implemented
with timing-error detection and recovery circuits to eliminate the
clock frequency guardband from dynamic supply voltage (VCC )
and temperature variations as well as to exploit path-activation
probabilities for maximizing throughput. Two error-detection sequential (EDS) circuits are introduced to preserve the timing-error
detection capability of previous EDS designs while lowering clock
energy and removing datapath metastability. One EDS circuit is a
dynamic transition detector with a time-borrowing datapath latch
(TDTB). The other EDS circuit is a double-sampling static design
with a time-borrowing datapath latch (DSTB). In comparison to
previous EDS designs, TDTB and DSTB redirect the highly complex metastability problem from both the datapath and error path
to only the error path, enabling a drastic simplification in managing metastability. From a survey of various EDS circuit options,
TDTB represents the lowest clock energy EDS circuit known;
DSTB represents the lowest clock energy static-EDS circuit with
SER protection known. Error-recovery circuits are introduced to
replay failing instructions at lower clock frequency to guarantee
correct functionality. Relative to conventional circuits, test-chip
measurements demonstrate that resilient circuits enable either
25%–32% throughput gain at equal VCC or at least 17% VCC
reduction at equal throughput, corresponding to 31%–37% total
power reduction.
Index Terms—Dynamic variations, error correction, error
detection, error-detection sequential, error recovery, instruction
replay, parameter variations, resilient circuits, resilient design,
supply voltage droop, temperature variation, timing errors, variation tolerance.
I. INTRODUCTION
I
NTEGRATED circuits have always been vulnerable to dyand temperature.
namic variations in supply voltage
Abrupt changes in die-level switching activity induce large current transients in the power delivery system, resulting in
droop and overshoot fluctuations. The magnitude and duration
droops and overshoots depend on the interaction of
of
capacitive and inductive parasitics at the board, package, and
die levels with changes in current demand [1]. Temperature
variations depend on workload, environmental conditions,
and the heat-removal capability of the package. Conventional
decreases or as
microprocessor performance reduces as
temperature increases. Consequently, the maximum clock freof a microprocessor is traditionally determined
quency
droop and temperature specifications.
based on maximum
Manuscript received March 31, 2008; revised July 09, 2008. Current version
published December 24, 2008.
The authors are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:
keith.a.bowman@intel.com).
Digital Object Identifier 10.1109/JSSC.2008.2007148
Since typical usage patterns usually operate at nominal
and temperature, these infrequent dynamic variations severely
limit
as compared to the potential higher clock freat nominal
and temperature. Therefore,
quency
guardband is required at nominal
and
an inherent
temperature to ensure correct functionality within the presence
and temperature variations.
of dynamic
and temperature variation sensors coupled
On-die
with adaptive circuit techniques have been demonstrated to
,
, or body bias in response to changes in
adjust
and temperature [2]–[4]. These schemes reduce the
guardband from slow-changing
and temperature
. Alternatively,
variations, resulting in higher average
benefits may be converted to lower average
the average
power by decreasing
. The disadvantages of on-die sensors
and adaptive approaches include the inability to respond to
droops
fast-changing variations such as high-frequency
[1]. Furthermore, sensors and adaptive circuits require substantial calibration time per die, leading to increased testing
costs. Although sensors may be tuned during test to reduce
the delay mismatch between sensors and critical paths from
guardband is still necessary to
within-die variations, an
ensure coverage across a wide range of
and temperature
conditions as well as for transistor aging.
Error-detection sequential (EDS) circuits have been proposed
to monitor timing faults for on-line testing of digital circuits
within the presence of environmental influences and reliability
concerns [5], [6]. EDS circuits have also been employed to detect transient delay errors resulting from single-event upsets induced by alpha particles and cosmic radiation [7]. The combination of timing-error detection and error-recovery circuits
deter[7]–[9] enables the microprocessor to operate at an
and temperature. When dynamic variamined by nominal
tions induce a timing error, the error is detected and corrected to
maintain proper logic functionality. Since additional clock cycles are required for error recovery, instructions per cycle (IPC)
reduce as errors occur. Assuming infrequent timing errors, the
gains
IPC reduction is relatively small as compared to the
from removing the guardband for dynamic
and temperature variations, resulting in higher overall system throughput.
benefits are possible by exploiting
In addition, further
path-activation probabilities [7]–[9]. If the slowest paths on the
may increase higher
die are infrequently activated, the
than the critical path operating frequency. When these critical
paths are activated, the timing error is detected and corrected.
gains may
As an alternative to performance benefits, the
[8], [9].
be traded-off for lower power by reducing
0018-9200/$25.00 © 2008 IEEE
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
50
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
While some of the potential benefits of timing-error detection
and correction circuits have been highlighted in previous work
[5]–[9], several significant challenges have also been identified
with recent implementations. In previous EDS circuits [5]–[9],
the error-detection capability is achieved at a large cost in clock
energy since a redundant storage element is needed. This sequential clock energy overhead is a major concern in microprocessor designs since sequential clock energy is a large fraction
of the overall total dynamic energy. In addition, these circuits
are susceptible to datapath metastability, requiring substantial
design overhead [9].
In a previous error-recovery design [8], [9], a single-cycle
error-recovery circuit has been implemented to ensure forward
progress for late arriving signals. In this approach, a 2-to-1 MUX
logic gate is placed in each datapath prior to the receiving flip-flop
to select between the normal datapath value and the correct value
from the previous cycle. This additional logic gate increases the
delay of all paths and increases total power. The error signals
from each EDS circuit in a pipeline are combined through an
OR tree to generate a restore signal, which is then routed to the
select line of each MUX in the pipeline. This design imposes
significant timing restrictions on the error-signal propagation
delay and requires additional area for interconnect routing tracks.
Furthermore, the previous error-recovery design [8], [9] emcontroller system to adjust
based on
ploys an off-chip
is generally an
monitored error rates. Although increasing
effective tuning knob to reduce critical path delays, the applicability of this controller is highly complex for error recovery due
depends on the
to the following: (i) Path-delay sensitivity to
, (ii) Path-delay sensitivity to
differs from
operating
to speed-up a critical path also
path to path, (iii) Increasing
speeds-up min-delay paths, (iv) Maximum
is defined by
reliability constraints which could limit error-recovery in a highis relatively slow.
performance mode, and (v) Changing
In this paper, the concept of timing-error detection and
correction circuits in previous work [5]–[9] is extended and
implemented in a prototype [10] in 65 nm technology [11] to
explore the effectiveness of resilient circuits in eliminating the
guardband from dynamic
and temperature variations
as well as exploiting path-activation probabilities to maximize
throughput. In Section II, EDS circuits for dynamic variation
tolerance are reviewed and two EDS circuits [10] are introduced to retain the error-detection capability in previous work
[5]–[9] while lowering clock energy and eliminating datapath
metastability. Next in Section III, the test-chip architecture is
described along with the error-recovery design. In contrast to
error-recovery approach [8],
the single-cycle adaptive[9], a multi-cycle instruction-replay error-recovery circuit is
to significantly reduce design
implemented with adaptive
overhead at a small penalty in error-recovery cycles [10]. In
Section IV, test-chip measurements are presented. Section V
concludes by summarizing the key results and recommendations.
II. ERROR-DETECTION SEQUENTIAL (EDS) CIRCUITS
A. EDS Circuit Overview
The basic concept of timing-error detection circuits for dynamic variation tolerance is described in Fig. 1. A conventional
path with master-slave flip-flops (MSFF) is provided in Fig. 1(a)
along with conceptual timing diagrams in Fig. 1(b), illustrating
the arrival times of the input data (D) to the receiving flip-flop
during worst-case dynamic variations and nominal conditions.
Within the presence of worst-case dynamic variations, the input
data to the receiving flip-flop must arrive a setup time prior to the
rising clock edge to ensure correct functionality. In comparison,
the input data for the same path arrives much earlier during nominal conditions. The difference between the input data arrival
times for these two cases represents the effective timing guardband required for dynamic variations. A resilient path is created
by replacing the receiving MSFF of the conventional path with
an EDS circuit as described in Fig. 1(c). The conceptual timing
diagram in Fig. 1(d) illustrates late arriving input data. The EDS
circuit in Fig. 1(c) and Fig. 2(a) is a simplified Razor flip-flop
(RFF) [5]–[9], where the metastability detector is omitted. The
RFF double samples input data with a datapath flip-flop on the
rising clock edge and a shadow latch on the falling clock edge.
The flip-flop and latch outputs are compared with an XOR gate
to produce an error signal (ERROR). If input data transitions
late as described in Fig. 1(d), flip-flop and latch outputs differ,
resulting in a logic-high error signal. The error signal is handled
at the microarchitecture level to enable error recovery. Since
the resilient circuit can detect and correct late arriving data, the
timing guardband for dynamic variations in the conventional design can be removed, allowing the resilient circuit to operate at
.
a higher
In comparison to an MSFF, the RFF error-detection capability is attained at a cost in clock energy and area since an
additional latch is needed. Although the increase in area at the
flip-flop level appears large, the overall area penalty at the microprocessor level is expected to be relatively small ( 1%). For
scan flip-flops, the additional latch can be shared with scan circuitry to further reduce the area overhead. In contrast, the clock
energy overhead at the flip-flop level significantly impacts the
total dynamic energy of a microprocessor since sequential clock
energy represents a large portion of the overall total dynamic
energy. Moreover, a well-tuned microprocessor has a substantial number of critical paths, requiring a large fraction of the
total sequentials to be protected with the timing-error detection
capability.
A critical issue for RFF is the susceptibility to datapath
metastability. If the input data to the flip-flop arrives slightly
after the setup time prior to a rising clock edge, the output
of the datapath flip-flop can become metastable. In this scenario, the flip-flop requires an indefinite amount of time to
resolve the output to a valid logic value, corresponding to a
clock-to-output (CLK-to-Q) delay push-out. During metastability, the CLK-to-Q delay push-out exponentially depends on
the relationship between the setup time and the input data arrival
time [12]–[14]. Since the flip-flop output feeds an error path,
which is described further in Section III, and multiple fan-out
datapaths, the CLK-to-Q delay push-out from a metastable
output can affect the error path differently from one of the
fan-out datapaths such that an undetected error occurs. Since
the mean time between failures (MTBF) must satisfy aggressive
microprocessor targets, a metastability detector is required for
RFF, resulting in substantial design overhead in both clock
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
51
Fig. 1. (a) Conventional path design with (b) conceptual timing diagrams for worst-case dynamic variations and nominal conditions. (c) Resilient path design
[5]–[9] with (d) conceptual timing diagram for late arriving input data.
energy and area [9]. In Fig. 2(b) and (c), two EDS circuits
are introduced to retain the RFF error-detection feature while
lowering clock energy and eliminating datapath metastability
[10].
For the three EDS circuits in Fig. 2, the high clock phase deas illustrated in Fig. 1(d).
fines the error-detection window
The maximum path delay (max-delay) constraint within the
presence of worst-case dynamic conditions for max-delay is
defined as
(1)
is the maximum path delay, including clock skew
and jitter delay,
is the cycle time
, and
is the setup time based on the falling clock
edge. The minimum path delay (min-delay) constraint during
worst-case dynamic conditions for min-delay is calculated as
(2)
is the minimum path delay, accounting for clock skew and
is the hold time based on the falling
jitter delay, and
clock edge. The max-delay and min-delay constraints in (1) and
(2) only apply to paths with an EDS circuit as the receiving
, min-delay requirements are satissequential. For a target
increases,
fied in pre-silicon design by buffer insertion. As
the number of buffers increases, leading to larger power. From
(1) and (2), the fundamental trade-off in timing-error detecincreases,
tion circuits is max-delay versus min-delay. As
may decrease to enable a higher
while satisfying
the max-delay constraint in (1) at a cost of increased min-delay
penalty in (2). For microprocessors with deep pipelines (i.e.,
small number of logic stages between sequentials), this trade-off
may not be advantageous due to the stringent min-delay requirements. In recent technology generations, however, the microarchitecture for microprocessors has moved towards shallow
pipelines (i.e., large number of logic stages between sequentials)
to improve energy efficiency [15], [16]. Microprocessors with
shallow pipelines greatly relax the min-delay requirements as
compared to a deep pipeline design, enabling a more effective
trade-off of max-delay improvement for min-delay penalty. As
additional protection from min-delay violations, the high clock
) may be tuned with a duty-cycle control circuit.
phase (i.e.,
The test-chip contains a scan-tunable duty-cycle control circuit
to adjust . The duty-cycle control circuit, which is discussed
further in Section III-C, maintains a constant high clock phase
.
delay at low and high
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
52
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 3. Simulated timing diagrams for (a) TDTB and (b) DSTB to demonstrate
error generation from late arriving input data.
Fig. 2. Error-detection sequential circuits: (a) Razor flip-flop (RFF) [5]–[9],
(b) transition detector with time borrowing (TDTB), and (c) double sampling
with time borrowing (DSTB). CLK is duty-cycle controlled to satisfy min-delay
requirements.
B. Transition Detector With Time Borrowing (TDTB)
In Fig. 2(b), the first proposed EDS circuit is a transition detector with a time-borrowing latch (TDTB). The TDTB EDS
circuit operation is demonstrated through a simulated timing diagram in Fig. 3(a). The transition detector monitors input data
(D) transitions during the high clock phase. As input data transitions, a pulse is always generated at the XOR output. During the
low clock phase, the output of the dynamic gate pre-charges and
the pulse does not affect the error signal (ERROR) as described
in Fig. 3(a). If input data arrives late, CLK is logically-high and
the pulse discharges the output node voltage of the dynamic
gate, thus transitioning ERROR to a logic-high as illustrated
in Fig. 3(a). As CLK transitions to a logic-low, the dynamic
gate output pre-charges, and consequently, ERROR transitions
to a logic-low. As discussed further in Section III-B, ERROR
is propagated to a set-dominant latch (SDL), where the SDL
output remains logically-high while the dynamic transition detector pre-charges during the low clock phase. The SDL is transparent during the high clock phase and only allows high transitions during the low clock phase. Since min-delay paths are
designed with sufficient margin as described in (2), the master
latch of a datapath flip-flop is unnecessary. The datapath latch
is identical to a pulse-latch, resulting in lower clock energy and
eliminating datapath metastability during a rising clock edge.
Datapath metastability does not occur on the falling clock edge
since the max-delay constraint in (1) is satisfied.
Although TDTB employs a datapath latch, path timing constraints are still based on a flip-flop design with an error-detection window as illustrated in Fig. 1 and modeled in (1). The
purpose of the transparency window in the datapath latch is to
eliminate datapath metastability while detecting timing errors.
When input data arrives late, an error signal is generated even
though the input data traverses to the latch output. The error
signal ensures that late arriving data from the path in the current
pipeline stage does not affect the max-delay constraint in (1) for
adjoining fan-out paths in subsequent pipeline stages. If ample
max-delay margin is available for the adjoining paths in the subsequent pipeline stage, then a pulse-latch may replace the TDTB
EDS circuit at the current pipeline stage. This would enable traditional time borrowing between the path in the current pipeline
stage and the adjoining paths in the subsequent pipeline stage.
Although datapath metastability is removed in TDTB, the
transition-detector output can become metastable. For metastability to occur on the transition-detector output, the input data
must arrive within a tight metastability window ( 1 ps in a
65 nm technology [11]), starting slightly after the setup time
prior to a rising clock edge. For EDS circuits
is defined as the minimum
with a datapath latch,
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
data-to-clock (D-to-CLK) delay prior to a rising clock edge such
that an error signal is not generated (i.e., an error signal indi).
cates that the input data transition did not satisfy
In TDTB,
is determined by the transition detector,
of 0 ps at nominal process
which is designed for a
, then
conditions. If input data arrives slightly after
the pulse generated at the XOR output transitions to a logic-low
while the transition-detector clock transitions to a logic-high,
potentially resulting in a metastable output. The transition-detector output feeds the error path.
The error path for TDTB, which is described in more detail
in Section III-B and illustrated in Fig. 5, consists of an OR tree
of error signals from each TDTB EDS circuit in the pipeline
stage. The OR-tree output feeds an SDL, and the SDL output is
the final error signal (FINAL ERROR). The SDL maintains the
logic-high value for FINAL ERROR when the dynamic transition detector pre-charges during the low clock phase. FINAL
ERROR is an input to an MSFF, and the MSFF output is the
pipeline-error signal (PIPELINE ERROR).
The condition in which the input data for TDTB arrives
defines the boundary of a timing
slightly after
failure. Since the datapath latch transparency allows input data
to continue to the next pipeline stage, PIPELINE ERROR
can be either a logic-high, resulting in error recovery, or a
logic-low, resulting in no error recovery, and still maintain correct functionality. With this unique characteristic, the error path
behaves similar to a traditional synchronizer circuit [12]–[14]
with the exception of having combinational logic between the
sequentials. For a metastable error signal in TDTB to result
in an undetected error, the following sequence of events must
occur: (i) ERROR becomes metastable from input data arriving
slightly after the TDTB
; (ii) ERROR starts resolving to a logic-high while the dynamic transition detector
begins pre-charging as clock transitions to a logic-low, resulting
in a degraded low-to-high-to-low pulse; (iii) The degraded pulse
propagates through the OR tree to induce metastability at the
SDL output (FINAL ERROR); (iv) FINAL ERROR resolves
such that PIPELINE
slightly after the MSFF
ERROR becomes metastable; (v) PIPELINE ERROR resolves
after a specific time based on microarchitectural control signals
to induce a failure. In Section III-D, an MTBF model for this
sequence of events is presented for a microprocessor.
C. Double Sampling With Time Borrowing (DSTB)
In Fig. 2(c), the second proposed EDS circuit is double sampling with a time-borrowing latch (DSTB), which is similar
to TDTB except a shadow flip-flop replaces the transition detector. In Fig. 3(b), a simulated timing diagram demonstrates
the DSTB EDS circuit operation. DSTB double samples input
data similar to RFF and compares datapath latch and shadow
flip-flop outputs to generate an error signal while retaining the
time-borrowing feature of TDTB to eliminate datapath metastability. The DSTB clock energy overhead is slightly lower than
for RFF since a datapath sequential is typically sized larger than
a minimum-sized shadow sequential. The DSTB clock energy
savings improve when compared to an RFF with a metastability
detector at the output [9].
53
As described earlier for TDTB, the DSTB error signal can
become metastable. In contrast to TDTB, the DSTB error path
does not contain an SDL. The error path for DSTB, which is
described in Section III-B, is an OR tree of error signals from
each DSTB EDS circuit in the pipeline stage. The OR-tree
output (FINAL ERROR) feeds an MSFF, and the MSFF output
is PIPELINE ERROR. For a metastable error signal in DSTB
to result in an undetected error, the following sequence of
events must occur: (i) ERROR becomes metastable from input
; (ii) ERROR
data arriving slightly after the DSTB
resolves within a precise time window to propagate through
the OR tree and transition FINAL ERROR barely after the
such that PIPELINE ERROR becomes
MSFF
metastable; (iii) PIPELINE ERROR resolves at a particular
time based on microarchitectural control signals to induce a
failure. In Section III-D, the MTBF from this sequence of
events is modeled for a microprocessor.
D. Impact of Eliminating Datapath Metastability on Mean
Time Between Failures (MTBF)
Although additional microarchitectural information, which is
presented in Section III-B, is required to calculate the MTBF
from metastability for a microprocessor, an analysis of one EDS
circuit that feeds an error path and multiple fan-out datapaths
highlights the benefits from eliminating datapath metastability
in TDTB and DSTB as compared to RFF. Since an actual failure
cannot be modeled in TDTB and DSTB without considering additional microarchitectural details, a mean time between potential failures (MTBPF) is first described for the single EDS circuit analysis of the error path only. Then, the MTBF from datapath metastability in RFF is modeled.
Since the error path for DSTB and RFF is the same, the
MTBPF from error-signal metastability is initially presented
for DSTB and RFF. Then, the MTBPF for TDTB is described.
For DSTB and RFF, the mean time between inducing metastability on PIPELINE ERROR from one EDS circuit in a pipeline
stage is calculated as [12]–[14]:
(3)
and
are the EDS circuit metastability resolution
is the
time and resolution time constant, respectively.
metastability window for the EDS circuit.
depends on
and the error-path delay;
and
are deteris the frequency of input data
mined from the EDS design.
transitions to the EDS circuit that are capable of generating a
is a function of
metastable condition. In a resilient design,
, the delay and path-activation probability for paths that
transition input data, and the frequency of dynamic variations
that induce late arriving input data. The resolution time required
to propagate a signal through the error path such that FINAL
ERROR transitions precisely after the MSFF setup time, which
is the beginning of the MSFF metastability window, is calculated as
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
(4)
54
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
is the error-path delay, including the minimum
CLK-to-ERROR delay for late arriving input data. The additional CLK-to-ERROR delay from metastability is represented
.
is the setup time for the MSFF.
by
Since the TDTB error path contains an SDL, the mean time
between inducing metastability on PIPELINE ERROR from one
TDTB in a pipeline stage is calculated as [13]
(5)
and
are the TDTB and SDL metastability resolution times, respectively;
and
are the TDTB
and SDL metastability resolution time constants, respectively.
is the metastability window for TDTB. As discussed
is the metastable CLK-to-ERROR
in Section II-B,
push-out delay for ERROR to start resolving to a logic-high
while the dynamic transition detector begins pre-charging as
clock transitions to a logic-low. This event results in a degraded
low-to-high-to-low pulse that propagates through the OR tree to
induce metastability at the SDL output (FINAL ERROR). The
SDL resolution time required to transition FINAL ERROR precisely after the MSFF setup time is calculated as
(6)
Since
in (6) contains the SDL delay,
for TDTB is longer than
for DSTB and RFF.
For DSTB, RFF, and SDL, the resolution time constants
and metastability window
are extracted from flip-flop D-to-Q delay simulations across a range
of D-to-CLK delays [14]. For TDTB, transition detector
CLK-to-ERROR delays are simulated for a range of D-to-CLK
delays to extract
and
. For TDTB, DSTB,
RFF, and SDL, the metastability resolution time constants
are approximately 5.75 ps in a 65 nm technology [11] with
, corresponding to less than half (0.43) of a
fan-out of four (FO4) inverter delay. Since the resolution time
constants for TDTB and SDL are equal and from (4) and (6)
, (5) simplifies to (3). Thus,
the mean time between inducing metastability on PIPELINE
ERROR from one EDS circuit in a pipeline stage is modeled
and
in (3) and (4) for TDTB, DSTB, and RFF, where
are separately calculated for TDTB, DSTB, and
RFF. Based on simulations in a 65 nm technology [11] with
,
is 1 ps for TDTB and 5 ps for DSTB
is 100 ps (7.50 FO4 delay) for TDTB
and RFF;
and 85 ps (6.38 FO4 delay) for DSTB and RFF;
is 5 ps for MSFF.
In contrast to TDTB and DSTB, RFF is susceptible to datapath metastability, where an undetected error can be modeled by
analyzing one RFF that feeds an error path and multiple fan-out
datapaths without considering additional microarchitectural details. The MTBF for one RFF in a pipeline stage is calculated as
[12]–[14]
(7)
Similar to previous descriptions,
,
, and
represent the RFF resolution time, resolution time constant,
and metastability window, respectively. Although the datapath
flip-flop output may resolve to the same logic value as the
shadow latch output after a metastable event in RFF, a number
exists such that an
of flip-flop output resolution times
undetected error occurs in at least one of the multiple fan-out
must satisfy two timing
datapaths. For this condition,
timing constraint requires that
constraints. The first
the datapath flip-flop output resolves in time to ensure that
FINAL ERROR is sampled as a logic-low by the MSFF, which
is defined as
(8)
If
satisfies the timing constraint in (8), an error is not detiming
tected in the current pipeline stage. The second
constraint is based on two dependencies for each fan-out datapath in the subsequent pipeline stage: (i) Path delay and (ii)
Whether the path contains an RFF or an MSFF as the receiving
sequential. If a path in the subsequent pipeline stage contains an
RFF as the receiving sequential, then an undetected error occurs
plus the path delay exceeds the max-delay
in this path if
constraint in (1) as
(9)
represents the delay for any of the multiple fan-out
datapaths in the subsequent pipeline stage with an RFF as
the receiving sequential. If an MSFF is the receiving seconstraint in (9) would change by
quential, the
and replacing
with the MSFF
removing
. For simplifying the discussion, this case is
not considered. The CLK-to-Q delay for input data arriving
at the setup time of the datapath flip-flop is included in
, and
represents the CLK-to-Q delay push-out
from (8) into (9),
from metastability. Substituting
a finite probability of an undetected error exists for paths
in the subsequent pipeline stage with delays ranging from
to
. From (2) and assuming
, the path delay range is approximated as
to
. As a practical example with
, corresponding to
,
and
, an undetected error
can occur in a datapath with a delay ranging from
to
, where a large number of datapaths is expected
to fall within this range.
is primarily deReviewing (7)–(9),
termined by
and
, where
is determined
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
55
Fig. 4. Advantages and disadvantages of the three error-detection sequential circuits in Fig. 2. T and L denote typical and large sequentials. Clock energy and
= 1:2 V.
delay are simulated in a 65 nm technology [11] with V
from the datapath flip-flop design. As
reduces,
exponentially decreases. As
increases, a smaller
is required to satisfy (9). Thus,
the MTBF for one RFF is dominated by the longest path
delays from the multiple fan-out datapaths in the subsequent
pipeline stage. Comparing (7)–(9) for datapath metastability
in RFF to (3) and (4) for error metastability in TDTB, DSTB,
is exponentially lower than
and RFF,
, where MTBF for RFF is primarily
determined by
. As an example with
,
, and
only considering one of the multiple fan-out datapaths with
, then
for
TDTB and DSTB is ~
larger than
for RFF. Moreover, additional microarchitectural factors are
required to calculate the MTBF for TDTB and DSTB, which
as deis exponentially larger than
scribed in Section III-D. Although other scenarios exists in
which RFF metastability can result in an undetected error,
the case presented in (7)–(9) represents one of the dominant
failure conditions to determine MTBF. In summary, a salient
feature of TDTB and DSTB is moving the highly complex RFF
metastability issue from both the datapath and error path to
only the error path. Since the error path is vastly simpler than
the datapath for metastability, the potential risk of metastability
affecting logic functionality exponentially reduces.
E. Advantages and Disadvantages of RFF, TDTB, and DSTB
In addition to datapath metastability, the key trade-offs of
the three EDS circuits in Fig. 2 are listed in Fig. 4, including
simulated clock energy and delay comparisons to an MSFF
and a pulse-latch. The sequential clock energy overhead of
RFF and DSTB as compared to a typical (T) MSFF is 36%
and 34%, respectively; TDTB actually reduces clock energy by
9%. As the datapath sequential size increases, the clock energy
overhead for RFF and DSTB reduces since the additional clock
energy from the error-detection circuitry is amortized by the
larger datapath sequential; TDTB clock energy savings increase. As described earlier in (2), resilient designs must satisfy
min-delay constraints. As a comparison, a conventional design
could replace MSFFs with pulse-latches to lower clocking energy while satisfying an increased min-delay penalty. Relative
to a typical pulse-latch, the clock energy overhead for RFF,
TDTB, and DSTB is 145%, 64%, and 143%, respectively. For
a larger pulse-latch, the clock energy overhead reduces. When
a metastability detector is included at the output of the RFF [9],
the clock energy overhead for RFF increases.
The RFF minimum D-to-Q delay is larger than the MSFF
minimum D-to-Q delay due to the additional capacitance at the
input and output nodes. TDTB and DSTB minimum D-to-Q delays are faster than the MSFF minimum D-to-Q delay since a
latch is in the datapath, but slower than the pulse-latch minimum
D-to-Q delay due to the extra node capacitances. Although the
RFF D-to-Q delay is 30% larger than the TDTB and DSTB
D-to-Q delays, the impact of this delay difference is only 2%
when compared to the target cycle time.
The design complexity of RFF and DSTB is lower than for
TDTB since the dynamic transition detector is highly sensitive
to within-die (WID) process variations. In designing the pulse
width for the transition detector, the key trade-off is
versus a sufficient pulse width to detect late arriving data within
the presence of WID variations. As the pulse width increases,
the tolerance of the dynamic transition detector to WID varia. RFF and DSTB
tions improves at a cost of a larger
also provide SER protection for the datapath sequential during
the entire clock cycle. In comparison, TDTB offers SER protection for the datapath sequential during the high clock phase
and only a portion of the low clock phase. When introducing
a new circuit technique into a production microprocessor, the
ability to turn the functionally off is highly desirable in case unforeseen complexities arise. With RFF, the error detection cawith
pability could be turned off and operated at a low
a 50% duty cycle. With TDTB and DSTB, a duty-cycle control
.
would always be required for both high and low
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
56
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Recently, a Razor II EDS circuit has been proposed [17],
which contains features similar to TDTB. Razor II and TDTB
EDS circuits both utilize a datapath latch and dynamically detect late data transitions. The key difference between these two
EDS circuits is that TDTB detects late transitions at the latch
input where data transitions are only monitored during the high
clock phase; Razor II detects late transitions at the latch pass
gate output, denoted as N [17], where data transitions are continuously monitored except for a window of time, defined as
[17], after the rising clock edge. In Razor II, the tranafter the rising clock
sition detector is suspended for
edge to guarantee that early input data transitions do not induce
a timing error. The transition detector is suspended by generating a negative clock pulse to the dynamic transition detector.
, the negative-clock-pulse width, must be larger than the
CLK-to-N delay with a zero setup time plus the data-pulse width
to the transition detector. Relative to TDTB, additional clock
transistors (e.g., inverters and NAND gate) are required in Razor
II to generate the negative clock pulse, resulting in larger clock
energy for Razor II. Furthermore, WID process variations amplify the complex timing in Razor II, requiring a sufficient delay
guardband on the negative-clock-pulse width. This delay guardband results in extra clock inverters, and consequently, larger
clock energy overhead. The clock energy overhead comparison
between Razor II and DSTB depends on the additional clock
energy required to generate the negative clock pulse for Razor
II versus the additional clock energy in the shadow MSFF for
DSTB. The D-to-Q delays for TDTB, DSTB, and Razor II appear relatively similar. The advantages of the Razor II EDS circuit include: (i) Significantly improving metastability management as compared to RFF, which is similar to TDTB and DSTB,
(ii) Fast D-to-Q delay, and (iii) SER protection for the datapath
sequential during the entire clock cycle. The disadvantages include: (i) Complex timing requirements for the negative clock
pulse and the data pulse to the dynamic transition detector to ensure proper functionality, (ii) Additional clock energy required
to generate the negative clock pulse, and (iii) Error detection
window shrinks by the negative-clock-pulse width, resulting in
less opportunity to detect late arriving input data for a target
min-delay constraint as defined by the high clock phase.
In comparing TDTB and DSTB across a variety of known
EDS circuit options, including Razor II [17], TDTB appears to
be the lowest clock energy EDS circuit; DSTB appears to be the
lowest clock energy static-EDS circuit with SER protection [10].
III. ERROR-RECOVERY CIRCUITS AND ARCHITECTURE
A. Test-Chip Architecture
The three EDS circuits in Fig. 2 and a conventional MSFF are
implemented in four separate 3-stage pipeline circuit blocks to
imitate a microprocessor as described in Fig. 5. The unlabeled
sequentials in Fig. 5 represent MSFFs. The 32-entry input
buffer consists of memory, an address counter, and control
logic. Input buffer data and control signals are scanned into
memory. Loop instructions are available to operate the test-chip
for long durations. The input buffer drives data and control signals to each 3-stage pipeline circuit block. Each pipeline stage
in the circuit block is 64-bits wide and contains a variety of path
types, including inverter chains, NAND chains, NOR chains,
XOR chains, NAND-NOR combination chains, inverter-MUX
combination chains, long-distance repeater-based interconnect
chains, cross-coupling paths, and multiple-input-switching
paths. Transistor widths and number of logic stages vary from
path to path. Transistor widths range from minimum-size to
40X minimum-size and the number of logic stages per path
ranges from 8 to 28. For some paths, the number of logic
stages is controlled via select bits to analyze different path
delay histograms. A stabilization pipeline stage [8], [9] follows the 3-stage pipeline circuit block to accommodate the
1-cycle latency for propagating error signals in the 3rd stage.
This stabilization pipeline stage ensures that instructions are
error-free before committing the state to the output buffer in the
next pipeline stage. The output buffer consists of a 64-bit input
linear-feedback-shift register (LFSR) that compresses output
data into a unique signature to validate functionality.
B. Instruction-Replay Error-Recovery Circuits
In the 3-stage pipeline circuit block, error signals from each
EDS circuit per pipeline stage are combined via an OR tree to
generate a single error signal (FINAL ERROR) [8], [9]. For the
two static-EDS circuits (RFF and DSTB), the OR-tree output
directly feeds an MSFF. For the dynamic-EDS circuit (TDTB),
the OR-tree output feeds an SDL, where the schematic is provided in Fig. 6. The SDL output is FINAL ERROR, which is
an input to an MSFF. The SDL in Fig. 6 is transparent during
the high clock phase and only allows high transitions during the
low clock phase. If FINAL ERROR transitions to a logic-high,
the SDL maintains the logic-high value for FINAL ERROR
when the dynamic-EDS circuit pre-charges during the low clock
phase. For all three EDS circuits, the output of the MSFF represents the pipeline-error signal (PIPELINE ERROR).
In contrast to the previous single-cycle error-recovery design
[8], [9], a multi-cycle error-recovery circuit is implemented to
reduce design overhead [10], [17]. As illustrated in Fig. 5, the
three pipeline-error signals are propagated to the input buffer in
one cycle to replay the failed instruction and pipelined to the
output buffer to invalidate erroneous data. Input buffer control
logic determines the appropriate instruction to replay based on
the three pipeline-error signals. In a microprocessor, the instruction replay circuits could leverage the existing replay design to
recover from a branch miss-prediction [18]. If a pipeline-error
signal transitions to a logic-high, the input buffer signals the
while maintaining a constant high
clock divider to halve
clock phase delay for min-delay protection. Reducing
in
half ensures correct operation during replay even if dynamic
variations persist. After the replayed instruction finishes, the
input buffer sends a reset signal to validate output data and signals the clock divider to resume at target
. Since
is
halved for all but one of the recovery cycles, the number of actual (effective) cycles for recovery is 6 (11), 7 (13), and 8 (15)
corresponding to timing errors in the 1st, 2nd, and 3rd pipeline
stages. Since the number of recovery cycles linearly increases
with the number of pipeline stages, the average error-recovery
penalty for a microprocessor is expected to linearly increase as
compared to the test-chip implementation. If dynamic variations
persist for long durations, an adaptive off-die clock generator
adjusts the nominal operating
.
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
57
Fig. 5. Instruction-replay error-recovery design for one error-detection sequential (EDS) circuit. Set-dominant latch (SDL) is only used for TDTB.
D. Mean Time Between Failures (MTBF) From Error-Signal
Metastability
Fig. 6. Set-dominant latch (SDL) circuit schematic.
C. Clock Divider and Duty-Cycle Control Circuits
The clock divider and duty-cycle control circuits are presented in Fig. 7(a) along with a conceptual timing diagram
in Fig. 7(b). An off-die signal generator with a differential
pulse-splitter creates differential inputs CLKIN and CLKIN#.
The HALF FREQUENCY input is controlled by the input
buffer as described in Fig. 5. The CLK output is distributed
throughout the test-chip. CLKIN and CLKIN# are inputs to
a differential amplifier that generates an intermediate clock
signal. This intermediate clock signal and the output of the
negative edge-triggered MSFF in Fig. 7(a) are inputs to a
logic-AND gate to produce the clock divider output (CLK0).
When the HALF FREQUENCY input is a logic-low, the output
of the negative edge-triggered MSFF remains a logic-high, thus
CLK0 and CLKIN have the same frequency. When the HALF
FREQUENCY input is asserted, the output of the negative
edge-triggered MSFF toggles every other cycle, enabling the
clock divider circuit to skip every other high phase of CLKIN
as illustrated in Fig. 7(b). The duty-cycle control is performed
with a logical-AND of CLK0 and a delayed CLK0# (i.e., inversion of CLK0) with CLK as the output. The delayed CLK0#
determines the CLK high phase delay, as controlled via scan
bits. With this duty-cycle control circuit, the CLK high phase
values,
delay remains constant at both high and low
which is essential for min-delay protection.
As described previously in Section II, the error signals for
TDTB and DSTB can become metastable when input data
arrives slightly after the setup time prior to a rising clock edge.
Recalling the unique characteristic of these signals, correct
functionality is guaranteed as long as the pipeline-error signal
(PIPELINE ERROR) is either a logic-high or a logic-low.
Based on this behavior, the error path as described in Fig. 5 is
similar to a traditional synchronizer circuit with the exception
of having combinational logic in the middle of the sequentials. As modeled in (3) and (4), the extra logic between the
sequentials exponentially reduces
[12]–[14]. As a counter benefit, the OR-tree logic increases
the MTBPF from metastability since PIPELINE ERROR
can only become metastable when an individual EDS error
signal is metastable while all other EDS error signals in the
same pipeline stage remain logically-low. In other words, a
metastable EDS error signal cannot affect the OR-tree output
if any other EDS error signal in the pipeline stage is logas
ically-high. Defining
the mean time between inducing metastability on PIPELINE
of EDS circuits in the same
ERROR from a number
does not
pipeline stage,
since the probability of simultadecrease linearly with
.
neous errors from multiple EDS circuits increases with
An undetected error occurs from error-signal metastability
if PIPELINE ERROR becomes metastable and resolves at a
time such that different logic values are captured at the INPUT
BUFFER and the OUTPUT BUFFER. Based on this condition, the metastability resolution time for the MSFF driving
PIPELINE ERROR is computed as
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
(10)
58
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 7. (a) Clock divider and duty-cycle control circuit schematics and (b) conceptual timing diagrams.
is the longest pipeline-error path delay
propagating to either the INPUT BUFFER or the OUTPUT
BUFFER. From Fig. 5, the pipeline-error path delay from the
driving MSFF to the INPUT BUFFER represents the longest
pipeline-error path delay for the 1st, 2nd, and 3rd pipeline
.
stages, thus determining
is the setup time for an MSFF contained in the INPUT
BUFFER. In a 65 nm technology [11] with
,
for the 3rd pipeline-stage is 125 ps
for an
). If
(9.37 FO4 delay or
is larger than (10), then a metastable PIPELINE
ERROR can result in an actual failure. If
is less
than (10), then correct functionality is maintained. For a microprocessor implementation, PIPELINE ERROR could be
sent to the INPUT BUFFER over multiple cycles by inserting
,
additional registers, resulting in a lower
and thus exponentially increasing MTBF with a linear penalty
in error-recovery cycles. An approximation of the MTBF from
error-signal metastability in a microprocessor is calculated as
(11)
represents the number of pipeline stages in
a microprocessor.
is the average number of EDS
is the
circuits in a pipeline stage. The parameter
resolution time constant for the MSFF driving PIPELINE
ERROR. From simulations in a 65 nm technology [11] with
,
is 5.75 ps (0.43 FO4 delay). Note that
a more detailed MTBF model requires an analysis of
and
for each pipeline stage of a microprocessor. Observing (3) and (11), the primary parameters
that determine the MTBF from error-signal metastability are
,
,
, and
. Assuming extreme
,
,
, and
worst-case estimates for
from (3),
for modeling
larger
the MTBF from error-signal metastability is over
than microprocessor MTBF targets for soft-error rate (SER) in
a 65 nm technology.
E. Test-Chip Characteristics
The resilient circuit test-chip micrograph and characteristics
are provided in Fig. 8. The clock circuitry, input buffer (IB),
and each 3-stage pipeline circuit block with an output buffer
are highlighted. On-chip noise injectors are implemented to indroop events, where magnitude and duration is conduce
trolled via scan bits. The test-chip is manufactured in a 65 nm
of 1.2 V. The testlogic technology [11] with a target
chip die area is
m and the number of transistors is 493K. Silicon measurements are collected using a highfrequency membrane probe card on a bare die in a 300 mm probe
station.
F. Resilient Circuit Demonstration
A timing-error detection and recovery measurement is
demonstrated in Fig. 9 based on an oscilloscope capture of
test-chip output signals. CLK is the clock signal distributed
throughout the test-chip; DATA is 1 bit of data sent from the
input buffer to the 3-stage pipeline circuit block; PIPELINE
ERROR# is the logic-NOR output of the three pipeline-error
signals; and VALID OUTPUT is the output buffer valid signal.
In the demonstration with no timing errors, DATA transitions
from a logic-high to a logic-low for 4 cycles and then transitions
to a logic-high. In the demonstration with timing errors, the
high-to-low transition from DATA induces a timing error at
the 3rd pipeline stage; (a) Error is detected by the EDS circuit
and propagated to the pipeline-error signal to lower PIPELINE
ERROR# for one cycle; (b) VALID OUTPUT transitions low to
invalidate erroneous data; (c) Pipeline-error signal triggers the
input buffer to raise the HALF FREQUENCY signal in Figs. 5
; (d) Input buffer replays the instruction,
and 7 to halve
resulting in a high-to-low DATA transition; (e) Once the instruction finishes, the input buffer resets the VALID OUTPUT signal
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
59
Fig. 8. Resilient circuit test-chip micrograph and characteristics. Measurements collected with a high-frequency membrane probe card on a bare die in a 300 mm
probe station.
Fig. 10. Measured error detection and recovery demonstration for sporadic
V droops.
Fig. 9. Measured timing-error detection and recovery demonstration. Data
from input buffer arrives late at 3rd pipeline stage; (a) Detect error; (b) Invalidate output data; (c) Halve F
; (d) Replay instruction; (e) Validate output
.
data; and (f) Resume target F
to allow output data to reach the LFSR; (f) Finally, input buffer
lowers the HALF FREQUENCY signal to resume at target
. In Fig. 9, note that the CLK high phase delay remains
for min-delay protection. In
constant at both high and low
Fig. 10, an oscilloscope capture of
, PIPELINE ERROR#,
and VALID OUTPUT signals demonstrate the measured error
droops.
detection and recovery capability for sporadic
IV. TEST-CHIP MEASUREMENTS
Imitating a microprocessor, instruction kernels are executed
on the test-chip to compare a resilient design with EDS circuits
to a conventional design with MSFFs. The benefits from resilient circuits depend on the path-activation probabilities as determined by the instruction kernels. Although previous research
has investigated node-activity probabilities for a microprocessor
[19], node-activity probabilities do not directly translate into
path-activation probabilities. Since path-activation probabilities
have not been rigorously explored for a microprocessor, there
is uncertainty about the dependency of path-activation probability on path delay. Since slow paths typically contain more
logic depth than fast paths, the probability of activating slow
paths is intuitively expected to be less than the activation probability for fast paths in general. In addressing this issue, two sets
of instruction kernels are selected to induce the path-activation
examples presented in Fig. 11, which attempt to represent practical approximations of favorable (example #1) and unfavorable
(example #2) scenarios. In path-activation example #1, critical
paths are activated less frequently than non-critical paths, where
the activation probability exponential reduces from fast to slow
paths. Path-activation example #2 applies an equal activation
probability for all paths.
In Fig. 12, throughput (TP) and error rate is measured for
for the two path-activation
the TDTB EDS circuit versus
, error rate is governed by
examples in Fig. 11. For a given
the path histogram, as dictated by design optimization, as well
as path-activation probabilities and environmental variations, as
droop magnitudes
determined by workloads. A range of
and durations are inserted based on data from a recent microprocessor along with assumptions on
droop-inducing
events. The worst-case
droop magnitude is 10% and
is 1.2 V
the worst-case temperature is 110 . Nominal
and the nominal operating temperature is assumed 60 . In
Fig. 12(a), throughput increases linearly as
increases
with no errors. Once errors occur, instructions per cycle (IPC)
reduce as a function of error rate and recovery time. Since
droop events are assumed infrequent, throughput gains continue
as
increases into the
and temperature guardband
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
60
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
Fig. 11. Path-activation probability versus path delay for two path-activation
examples.
region. When
reaches 3020 MHz, the first path failure
occurs under nominal conditions, resulting in a sharp error rate
increase. Since the path-activation probability for example #1
is low for slow paths, further throughput gains are achieved
. The maximum throughput of 3.17 billion
at higher
instructions per second (BIPS) corresponds to a 3200 MHz
. Increasing
further leads to a larger error rate,
gains. Due to the high
where IPC reduction outweighs
path-activation probability for slow paths in example #2, the
resilient design cannot exploit the path-activation probabilities
in Fig. 12(b), limiting the maximum throughput to 3.01 BIPS.
In Fig. 12(a) and (b), the maximum throughput to guarantee
correct functionality within the presence of worst-case dynamic
and temperature variations for the conventional design
of 2400
with MSFFs is 2.4 BIPS, corresponding to an
MHz. From Fig. 12, a resilient design enables 25% throughput
gain over a conventional design by eliminating the
guardband from dynamic
and temperature variations
and an additional 7% throughput increase from exploiting the
path-activation probabilities for example #1.
In Fig. 13, the maximum throughputs for resilient and con. Since a worst-case
ventional designs are measured versus
droop of 10% is applied in these measurements, the
droop reduces in absolute value as the nominal
maximum
reduces. For the resilient design, the two path-activation
examples in Fig. 11 are evaluated. Depending on the path-activation example in Fig. 13, the throughput for the resilient
is greater than or equal to the
design with
.
throughput of the conventional design with
Thus, measured data indicates that resilient circuits enable either 25%–32% throughput gain at equal
or at least a 17%
reduction at equal throughput [10] for a target
of 1.2
V in a 65 nm technology. For equal throughput, the measured
total power reduction from resilient circuits is 37% and 31% for
path-activation example #1 and example #2, respectively. The
total power reduction for a resilient microprocessor will differ
from the test-chip depending on the fraction of total sequentials
protected with EDS circuits and the amount of buffer insertion
required to correct min-delay paths.
Fig. 12. Measured throughput (TP) and error rate for a resilient design with
) for path-activation (a) exTDTB EDS circuits versus clock frequency (F
ample #1 and (b) example #2.
Fig. 13. Measured maximum throughput (TP) for a resilient design with TDTB
EDS circuits and a conventional design with MSFFs versus supply voltage.
From the data in Fig. 13, the measured throughput gain of
a resilient design relative to a conventional design is plotted
versus
in Fig. 14. Measurements demonstrate a larger
reduces, resulting from the increased
throughput gain as
path-delay sensitivity to
. From Fig. 14, the throughput
gain from a resilient design increases from 25%–32% for
to 34%–40% for
.
In Fig. 15, the measured throughput gain in Fig. 13 is
mapped to a measured power reduction by lowering the
of the resilient design to achieve equal throughput for the
resilient and conventional designs. The change in throughput
across the x-axis directly corresponds to a change in nominal
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
Fig. 14. Measured throughput gain of a resilient design with TDTB EDS circuits relative to a conventional design with MSFFs versus supply voltage.
61
based on
aging-related delay faults and adjust nominal
monitoring the error rate or on-die aging sensors to allow a
smooth performance reduction over the microprocessor lifetime [20]. For cross-coupling and MIS events, conventional
pre-silicon design methodologies often apply pessimistic assumptions for these dynamic variations, although these events
may rarely or never occur. These pessimistic assumptions lead
to an over-design for the majority of paths, and consequently,
high power. Allowing resilient circuits to detect and correct
late transitions from infrequent cross-coupling and MIS events,
less stringent cross-coupling and MIS conditions could be
applied in timing analysis, resulting in power savings. Third,
timing optimization with resilient circuits presents an exciting
opportunity. In conventional timing optimization, all paths
in the path-delay histogram must satisfy a target cycle time.
For resilient circuits, timing optimization would be based
on combining the path-delay histogram with path-activation
probabilities. As an example, infrequently activated paths could
have a relaxed timing constraint as compared to frequently
activated paths to reduce power. In quantifying the cost associated with resilient circuits, a rigorous analysis is warranted for
the additional power overhead from min-delay buffer insertion
versus the error-detection window for a recent microprocessor.
In addition, realistic path-activation probabilities should be
evaluated across a variety of workloads.
V. CONCLUSION
Fig. 15. Measured power reduction of a resilient design with TDTB EDS circuits relative to a conventional design with MSFFs versus throughput.
(i.e., higher throughput corresponds to higher
). As
throughput reduces from 2.5 BIPS to 1.25 BIPS, measured
data reveals that the power benefits decrease from 33%–38%
to 18%–21%. Although the throughput gain from resilient
as demonstrated in Fig. 14,
circuits increases with lower
reduces.
the power-to-throughput sensitivity decreases as
to
As an example for throughput at 2.5 BIPS, raising
achieve a 1% throughput gain incurs a 3% total power increase.
When throughput equals 1.25 BIPS, however, raising
to achieve a 1% throughput gain only results in a 1.8% total
power increase. Thus, any circuit technique that maps linear
to power reduction via reducing
performance gains versus
will have decreased power savings as
reduces.
Recommendations to further enhance the performance and
energy benefits of resilient circuits are offered. First, since
and temperature variations may persist for
dynamic
long durations, resilient circuits in conjunction with on-die
variation sensors and adaptiveschemes [2]–[4] could
improve the efficiency of adjusting the nominal operating
. Second, additional sources of dynamic variation that
could be mitigated include: (i) Transistor aging, (ii) Crosscoupling capacitance, and (iii) Multiple-input switching (MIS).
For transistor aging, a conventional design typically applies an
guardband based on lifetime aging projections.
upfront
In comparison, resilient circuits with on-die variation sensors
techniques could detect and recover from
and adaptive-
Timing-error detection and recovery circuits are implemented
in a 65 nm resilient circuit test-chip to eliminate the clock frequency
guardband from dynamic supply voltage
and temperature variations as well as to exploit path-activation
probabilities for maximizing throughput. Two error-detection
sequential (EDS) circuits are introduced to lower the clock energy and to remove the datapath metastability in existing EDS
designs. One EDS circuit is a dynamic transition detector with
a time-borrowing datapath latch (TDTB). The other EDS circuit is a double-sampling static design with a time-borrowing
datapath latch (DSTB). A salient feature of TDTB and DSTB
relative to previous EDS designs is moving the highly complex
metastability issue from both the datapath and error path to only
the error path, thus drastically simplifying metastability management. Based on a survey of various known EDS circuit options, TDTB is the lowest clock energy EDS circuit; DSTB is
the lowest clock energy static-EDS circuit with SER protection.
Error-recovery circuits are proposed to replay failing instructo ensure correct functionality. Silicon meations at half
surements indicate that resilient circuits enable either 25%–32%
throughput gain at equal
or at least 17%
reduction at
equal throughput, corresponding to 31%–37% total power reduction. Future opportunities to further enhance performance
and energy efficiency include: (i) Combining resilient circuits
with on-die variation sensors and adaptiveschemes, (ii)
Mitigating delay faults induced from transistor aging, crosscoupling capacitance, and multiple-input switching, and (iii)
Optimizing resilient circuit designs by coupling the path-delay
histogram with path-activation probabilities.
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
62
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY 2009
ACKNOWLEDGMENT
The authors express sincere appreciation to Pavan Karidi for
mask design; David Jenkins, David Finan, and Saurabh Dighe
for simulation support; Paolo Aseron and Trang Nguyen for lab
measurement support; Charles Dike for metastability calculation guidance; Sriram Vangal for helpful advice; and Matthew
Haycock for encouragement.
REFERENCES
[1] A. Muhtaroglu, G. Taylor, and T. R. Arabi, “On-die droop detector for
analog sensing of power supply noise,” IEEE J. Solid-State Circuits,
pp. 651–660, Apr. 2004.
[2] T. Fischer, J. Desai, B. Doyle, S. Naffziger, and B. Patella, “A 90-nm
variable frequency clock system for a power-managed itanium architecture processor,” IEEE J. Solid-State Circuits, pp. 218–228, Jan.
2006.
[3] R. McGowen et al., “Power and temperature control on a 90-nm itanium family processor,” IEEE J. Solid-State Circuits, pp. 229–237, Jan.
2006.
[4] J. Tschanz et al., “Adaptive frequency and biasing techniques for tolerance to dynamic temperature-voltage variations and aging,” in IEEE
ISSCC Dig. Tech. Papers, Feb. 2007, pp. 292–293.
[5] P. Franco and E. J. McCluskey, “Delay testing of digital circuits by
output waveform analysis,” in Proc. IEEE Int. Test Conf., Oct. 1991,
pp. 798–807.
[6] P. Franco and E. J. McCluskey, “On-line testing of digital circuits,” in
Proc. IEEE VLSI Test Symp., Apr. 1994, pp. 167–173.
[7] M. Nicolaidis, “Time redundancy based soft-error tolerance to rescue
nanometer technologies,” in Proc. IEEE VLSI Test Symp., Apr. 1999,
pp. 86–94.
[8] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level
timing speculation,” in Proc. IEEE/ACM Int. Symp. Microarchitecture
(MICRO-36), Dec. 2003, pp. 7–18.
[9] S. Das et al., “A self-tuning DVS processor using delay-error
detection and correction,” IEEE J. Solid-State Circuits, pp. 792–804,
Apr. 2006.
[10] K. A. Bowman et al., “Energy-efficient and metastability-immune
timing-error detection and instruction-replay-based recovery circuits
for dynamic-variation tolerance,” in IEEE ISSCC Dig. Tech. Papers,
Feb. 2008, pp. 402–403.
[11] P. Bai et al., “A 65 nm logic technology featuring 35 nm gate
lengths, enhanced channel strain, 8 Cu interconnect layers, low-k
ILD and 0.57 m SRAM cell,” in IEEE IEDM Tech. Dig., Dec.
2004, pp. 657–660.
[12] H. J. M. Veendrick, “The behavior of flip-flops used as synchronizers
and prediction of their failure rate,” IEEE J. Solid-State Circuits, pp.
169–176, Apr. 1980.
[13] C. L. Portmann and T. H. Y. Meng, “Metastability in CMOS library
elements in reduced supply and technology scaled applications,” IEEE
J. Solid-State Circuits, pp. 39–46, Jan. 1995.
[14] C. Dike and E. Burton, “Miller and noise effects in a synchronizing
flip-flop,” IEEE J. Solid-State Circuits, pp. 849–855, Jun. 1999.
[15] V. Srinivasan et al., “Optimizing pipelines for power and performance,” in Proc. Int. Symp. Microarchitecture (MICRO-35), Nov.
2002, pp. 333–344.
[16] A. Hartstein and T. R. Puzak, “The optimum pipeline depth considering both power and performance,” ACM Trans. Arch. and Code Opt.
(TACO), pp. 369–388, Dec. 2004.
[17] D. Blaauw et al., “Razor II: In situ error detection and correction for
PVT and SER Tolerance,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2008,
pp. 400–401.
[18] J. Hennessy and D. Patterson, Computer Architecture A Quantitative
Approach, 2nd ed. San Francisco, CA: Morgan Kaufmann, 1996.
[19] H. L. Yeager, M. J. Patyra, R. Reyes, and K. A. Bowman, “Microprocessor power optimization through multi-performance device insertion,” in IEEE Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2004, pp.
334–337.
[20] M. Agarwal, B. C. Paul, M. Zhang, and S. Mitra, “Circuit failure prediction and its application to transistor aging,” in Proc. IEEE VLSI Test
Symp., May 2007, pp. 277–286.
Keith A. Bowman (S’97–M’02) received the B.S.
degree in electrical engineering from North Carolina
State University, Raleigh, NC, in 1994 and the M.S.
and Ph.D. degrees in electrical engineering from
the Georgia Institute of Technology, Atlanta, GA, in
1995 and 2001, respectively.
He is currently a Staff Research Scientist in the
Circuit Research Lab (CRL) at Intel Corporation in
Hillsboro, OR. From 2001 to 2004, he worked as a
Senior Computer-Aided Design (CAD) Engineer in
the Technology-CAD Division at Intel in Hillsboro,
OR, where he developed and supported statistical-based models, methodologies, and software tools to predict microprocessor performance and power variability. Since joining CRL in 2004, his research has focused on the development
of circuit design solutions to mitigate the impact of parameter variations on microprocessor performance and power.
James W. Tschanz (M’99) received the B.S. degree
in computer engineering and the M.S. degree in electrical engineering from the University of Illinois at
Urbana-Champaign, in 1997 and 1999, respectively.
Since 1999, he has been a circuits researcher with
the Intel Circuit Research Lab in Hillsboro, OR. His
research interests include low-power digital circuits,
design techniques, and methods for tolerating parameter variations. He also taught VLSI design for seven
years as an adjunct faculty member at the Oregon
Graduate Institute in Beaverton, OR.
Nam Sung Kim received the B.S. and M.S. degrees
in electrical engineering from Korea Advanced Institute of Science and Technology, Daejeon, Korea, in
1997 and 2000, respectively, and the Ph.D. degree in
computer science and engineering from the University of Michigan-Ann Arbor in 2004.
He is currently an Assistant Professor at the
University of Wisconsin-Madison. He was with Intel
Microprocessor Technology Lab as a senior research
scientist from 2004 to 2008. He has authored more
than 30 technical papers in refereed international
conferences and journals. His research interests span low power and high
performance circuits, circuit-microarchitecture co-designs, CAD algoritms,
and biomimetic microsystems.
Dr. Kim was a recipient of the award at the 2001 IEEE Design Automation
Conference (DAC) Student Design Contest and the best paper award at the 2003
IEEE International Conference on Microarchitecture (MICRO) in San Diego,
CA, for his work on the low-power and robust microarchitecture. He was also a
recipient of Intel Fellowship in 2002.
Janice C. Lee received the B.Sc. degree in engineering physics from Queen’s University in
Canada, and the S.M. and Ph.D. degrees in electrical
engineering and computer science from the Massachusetts Institute of Technology, Cambridge, MA.
In 2006, she joined Intel Corporation as a research
scientist in the Circuit Research Lab at Hillsboro, OR.
Janice’s PhD research was in the area of Quantum
Computing using superconducting circuits, and was
a National Science Foundation graduate fellow. At
Intel, her interests include next-generation-transistor
research using carbon nanotubes, and resilient-circuit research for dynamic variation tolerance.
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
BOWMAN et al.: ENERGY-EFFICIENT AND METASTABILITY-IMMUNE RESILIENT CIRCUITS FOR DYNAMIC VARIATION TOLERANCE
Chris B. Wilkerson graduated from Carnegie
Mellon University with the Masters degree in 1996.
He has published a number of papers on a number
of micro architectural topics including value prediction, branch prediction, cache organization, runahead
and advanced speculative execution. Recently, he has
focused on low power design including micro architectural mechanisms to enable low voltage operation
for microprocessors.
Shih-Lien L. Lu (M’90) received the B.S. degree in
electrical engineering and computer science (EECS)
from the University of California at Berkeley, and the
M.S. and Ph.D. degrees, both in computer science
and engineering, from the University of California at
Los Angeles (UCLA).
He worked on the MOSIS project at USC/ISI
which provides research and education community
VLSI fabrication services from 1984 to 1991. He
joined the faculty of the Department of Electrical and
Computer Engineering at Oregon State University
(OSU), Corvallis, OR, in 1991. While at OSU, Dr. Lu received the College
of Engineering Carter Award for outstanding and inspirational teaching in
1995 and the CoE/ECE Engelbrecht Young Faculty Award in 1996. He joined
Intel in 1999 and has been with Microprocessor Technology Labs since then.
Currently, he is a Principal Engineer and leads the Oregon Microarchitecture
Lab. He is a member of the IEEE and the IEEE Computer Society.
63
Vivek K. De (SM’07) received the Bachelor’s degree
in electrical engineering from the Indian Institute of
Technology, Madras, India, in 1985 and the Master’s
degree in electrical engineering from Duke University, Durham, NC, in 1986. He received the Ph.D. degree in electrical engineering from Rensselaer Polytechnic Institute, Troy, NY, in 1992.
He is an Intel Fellow and Director of Circuit Technology Research in the Corporate Technology Group.
He joined Intel in 1996 as a staff engineer in the Circuits Research Lab (CRL) in Hillsboro, OR. Since
that time he has led research teams in CRL focused on developing advanced
circuits and design techniques for low-power and high-performance processors.
In his current role, he provides strategic direction for future circuit technologies
and is responsible for aligning CRL’s circuit research with technology scaling
challenges. Prior to joining Intel, he was engaged in semiconductor devices and
circuits research at Rensselaer Polytechnic Institute and Georgia Institute of
Technology, and was a visiting researcher at Texas Instruments. He has published 167 technical papers in refereed conferences and journals, and six book
chapters on low power circuits. He holds 154 patents, with 40 more patents filed
(pending).
Dr. De received an Intel Achievement Award for his contributions to a novel
integrated voltage regulator technology.
Tanay Karnik (M’88–SM’04) received the Ph.D.
degree in computer engineering from the University
of Illinois at Urbana-Champaign in 1995.
From 1995 to 1999, he worked in the Strategic
CAD Lab at Intel. Since March 1999, he has lead
the power delivery, soft error rate, and low power
circuits research in the Circuits Research, Intel Labs,
where he is Principal Engineer and manager of low
power circuits research. His research interests are in
the areas of variation tolerance, power delivery, soft
errors and physical design. He has published over 40
technical papers and has 41 issued and 36 pending patents in these areas.
Dr. Karnik received an Intel Achievement Award for the pioneering work
on integrated power delivery. He has presented several invited talks and tutorials, and has served on five Ph.D. students’ committees. He was a member
of DAC, ICCAD, ICICDT and ISQED program committees and JSSC, TCAD,
TVLSI, and TCAS review committees. He was General Chair of ISQED’08 and
ICICDT’08.
Authorized licensed use limited to: North Carolina State University. Downloaded on January 24, 2009 at 12:44 from IEEE Xplore. Restrictions apply.
Download