PDF version - ARM Infocenter

advertisement
17-1
A Self-Tuning DVS Processor Using Delay-Error Detection and Correction
Shidhartha Das, Sanjay Pant, David Roberts, Seokwoo Lee, David Blaauw, Todd Austin, Trevor Mudge, Krisztian Flautner*
Department of EECS, University of Michigan, Ann Arbor, MI
*ARM Inc., Cambridge, UK
Abstract
In this paper, we present the implementation and silicon
measurements results of a 64bit processor fabricated in 0.18µm
technology . The processor employs a delay-error detection and
correction scheme called Razor to eliminate voltage safety margins
and scale voltage 120mV below the first failure point. It achieves
44% energy savings over the worst case operating conditions for a
0.1% targeted error rate at a fixed frequency of 120MHz.
1. Introduction
Recently, we proposed a new voltage management concept for
Dynamic Voltage Scaled (DVS) processors, called Razor [1], along
with initial simulation based results. In this paper we present the
first silicon implementation of a Razor design. We discuss the
circuit structures used in this new implementation and present
silicon measurements for 33 tested dies. The chip implements a
subset of the Alpha instruction set and was fabricated with
MOSIS[7] in tsmc 0.18µm technology.
Traditional DVS techniques [2-6] use a delay chain or a lookup
table to determine the minimum voltage necessary for error-free
operation at a particular frequency. Hence they require voltage
margins to ensure correct operation over process variation,
temperature fluctuations and voltage drop. In contrast, the Razor
processor uses a delay-error tolerant flip-flop on critical paths to
detect when voltage is scaled to the point of first failure for a given
frequency. Voltage control is based on the observed error rate and
power savings are achieved by 1) eliminating the above margins
under nominal operating and silicon conditions and 2) scaling
voltage 120mV below the first failure point to achieve a 0.1%
targeted error rate. The total measured energy savings over the
worst case was 44% at 120MHz under nominal conditions.
Figure 1a shows the delay-error tolerant Razor flip-flop in concept.
The standard positive edge triggered DFF is augmented with a
shadow latch which samples at the negative clock edge. Timing
errors are detected by comparing the main flip-flop data with that
of the shadow latch. An additional detector flags the occurrence of
metastability at the main flip-flop output. Error signals of
individual RFFs are OR-ed together to generate the pipeline restore
signal which overwrites the shadow latch data into the errant flipflop. A distributed pipeline recovery mechanism [1] is implemented
to recover correct pipeline state (figure 1b). The minimum allowed
supply voltage is set to ensure that the shadow latch data is
guaranteed correct and can be used for error recovery. The duration
of the positive clock phase, when the shadow latch is transparent,
determines the sampling delay of the shadow latch. The hold time
constraint imposed by the shadow latch was met by inserting delay
buffers.
The rest of the paper is organized as follows. Section 2 describes
the transistor level design of the Razor flip-flop and the
metastability detector. Section 3 describes Razor processor
implementation details and measurement results are presented in
Section 4. The Razor energy savings are quantified in Section 5.
We give the details on Razor voltage control in Section 6 and draw
our conclusions in Section 7.
2. Razor Flip-Flop Circuit Level Implementation
Figure 2 shows the transistor level schematic of the RFF. The error
comparator evaluates in the negative phase when the data latched
by the slave differs from the shadow. The metastability detector
which shares the dynamic node Err_dyn with the comparator
evaluates in the positive phase of the clock when the slave output
could become metastable. Thus, the RFF error signal is flagged
when either evaluate. This, in turn, evaluates the dynamic gate to
generate the restore signal by OR-ing error signals of individual
RFFs. The restore overwrites the master with the shadow lat ch data
such that the slave gets the correct data at the next positive edge. In
this positive phase, it also disables the shadow to protect state. The
rbar_latched signal precharges the Err_dyn node for the next errant
cycle. Compared to a regular DFF of the same drive strength and
delay, the RFF consumes 22% extra (60fJ/49fJ) energy when
sampled data is static and 65% extra (205fJ/124fJ) energy when
sampled data switches. However, in the processor only 207 flipflops out of 2388 flip-flops, or 9%, could become critical and
needed to be RFFs. Hence, the net Razor power overhead,
including the delay buffer power for short paths, was computed to
be 3% of nominal chip power.
Q
Logic Stage
L1
IF
PC
Local MetaDetector
Fail
Shadow
Latch
RAZOR FF
clk
comparator
Error
Figure 1a. Abstract View of the Razor Flip-Flop
258
4-900784-01-X
err bubble
EX
err bubble
MEM
err bubble
err
WB
reg/mem)
(
bubble
restore
Global Meta Detector
Restore
ID
Stabilizer FF
Main
Flip -Flop
Razor FF
D
Razor FF
0
1
Razor FF
Logic Stage
L1
Razor FF
clk
Flush
Control
flushID
restore
restore
flushID
flushID
restore
flushID
Figure 1b. Distributed Pipeline Recovery Mechanism
2005 Symposium on VLSI Circuits Digest of Technical Papers
Authorized licensed use limited to: Arm Limited. Downloaded on October 15, 2009 at 08:44 from IEEE Xplore. Restrictions apply.
Clk
Clk_n
Master
Clk_p
Restore
Restore_p
Restore_n
Slave
Clk_p
Clk_n
Restore_p
Q
QS
D_in
Q_bar
G1
P-skewed
Clk_n
Clk_n
Clk_p
FFP1 FFP2
Clk_n
Clk_p
Rbar_latched
Error 0
N-skewed
Err_dyn
Restore_n
Shadow
Clk_n
Clk_p
Error
Comparator
Figure 2a. Razor Flip-Flop Circuit Schematic
Detection Band
Driver G1
1.6
Meta Comparator
1.2
0.8
Ambiguous Band
0.4
FFN1 FFN2
N-skewed FFs
Qbar
Clk_p
Latch2
Clk_n
restore
Clk_p
Metastability
Detector
Figure 2b. Restore Generation Circuitry
The metastability detector consists of p- and n-skewed inverters
which switch to opposite power rails under a meta-stable input
voltage. The detector evaluates when input node QS can be
ambiguously interpreted by its fan-out, inverter G1 and the error
comparator. The DC transfer curve (figure 3) of inverter G1, the
error comparator and the metastability detector show that the
“detection” band is contained well within the ambiguously
interpreted voltage band. Table 1 gives the error detection and
ambiguous interpretation bands for different corners.
The probability that metastability propagates through the error
detection logic and causes metastability of the restore signal itself
was computed to be below 2e-30. Such an event is flagged by the
fail signal generated using double skewed flip-flops. In the rare
event of a fail, the pipeline is flushed and the supply voltage is
immediately increased. During 4 months of chip testing, this event
was never detected.
3. Processor Implementation Details
The die photograph of the Razor processor and its details are shown
in figure 4. To verify correct operation, the dcache/register file
contents were scanned and compared with a PC emulating the same
code. A 64b register records the number of errant cycles and was
sampled to compute the error rate. An internal clock unit generates
an asymmetric clock with a range between 60 MHz to 400 MHz in
steps of 20MHz. The shadow latch sampling delay, defined by the
positive clock phase, is configurable from 0ps to 3.5ns in steps of
500ps. The clock unit has a separate voltage domain that is not
voltage scaled. Energy savings from Razor DVS were measured at
140 and 120MHz for 33 chips from 2 fabrication runs.
3.3mm
Error Comparator
0.0
0.4
0.8
1.2
1.6
Voltage of Node QS
2.0
Corner
Proc
VDD
TEMP
Ambiguous
Band
Slow
1.2V
85C
0.57-0.60
Detection
Band
0.53-0.64
Typ.
1.2V
40C
0.52-0.58
0.48-0.61
Fast
1.2V
27C
0.48-0.56
0.40-0.61
Slow
1.8V
85C
0.77-0.87
0.67-0.93
Typ.
1.8V
40C
0.71-0.83
0.65-0.90
Fast
1.8V
27C
0.64-0.81
0.58-0.89
Table 1. Metastability Detector Corner Analysis
RF
Icache
Figure 3. DC Transfer Characteristics
WB
IF
ID
EX
3.6mm
0.0
MEM
V_OUT
Error 63
Clk_n
Clk_p
Q
Clk_n
Error
Clk_n
Fail
Latch1
Restore_p
Restore_p
P-skewed FFs
Clk_n
Dcache
Figure 4. Die Photograph of the Chip
2005 Symposium on VLSI Circuits Digest of Technical Papers
Authorized licensed use limited to: Arm Limited. Downloaded on October 15, 2009 at 08:44 from IEEE Xplore. Restrictions apply.
259
Technology Node
0.18µm
Max. Clock Frequency
140MHz
DVS Supply Voltage Range
1.2-1.8V
Total Number of Transistors
1.58million
Die Size
3.3mm*3.6mm
Measured Chip Power at 1.8V
130mW
Icache Size
8KB
Dcache Size
8KB
2801
207
Number of Delay Buffers Added
2388
Point of First Failure
120MHz
27C
Error Free Operation (Simulation Results)
Standard FF Energy (Static/Switching)
49fJ/124fJ
RFF Energy (Static/Switching)
60fJ/205fJ
% Total Chip Power Overhead
2.9%
Error Correction and Recovery Overhead
Table 2. Processor Implementation Details
4. Measurement Results
Figure 5 shows the error rates and energy gains versus supply
voltage at 120 and 140 MHz for two chips. Energy at a particular
voltage is normalized with respect to the energy at point of first
failure. For all plotted points, correct program execution with Razor
error correction was verified.
From the figure, we note that the error rate at the point of first
failure is very low because only a few of the critical paths fail to
meet the setup requirements. As voltage is scaled further into the
sub-critical regime the error rate increases exponentially. The IPC
penalty due to the error recovery cycles is negligible for error rates
below 0.1%. Under such low error rates, the recovery overhead
energy is also negligible and the total processor energy shows a
quadratic reduction with the supply voltage. At error rates
exceeding 0.1%, the recovery energy rapidly starts to dominate,
Chip 1
10
0.1
0.90
1E-6
0.85
1E-7
0.80
Voltage (in Volts )
870pJ
89.7mW
740pJ
119.4mW
990pJ
99.6mW
830pJ
1.64
Figure 6 shows the distribution of the first failure voltage for the 33
measured chips. The first failure voltage for chips 1 and 2 in figure
5 are 1.63V and 1.72V, respectively and hence represent typical
and worst case process conditions. The scatter plot shows the
correlation between the first failure voltage and the 0.1% error rate
voltage. The relative “flatness” of the linear fit indicates less
sensitivity to process variation when running at a 0.1% error rate
than at point of first failure. The distribution of energy savings
from running at 0.1% error rate at 120MHz and 27C is shown for
all chips and ranges from 5% to 23%. The measured error rates at
different operating temperatures shows 80mV shift in the point of
first failure from 1.46V to 1.54V for a temperature increase from
45 to 95C as shown in figure 7.
5. Razor Energy Savings
The bar graph in figure 8 shows the energy for the two chips in
figure 4 when operating at 120MHz and 45C. The first set of bars
0.1
140MHz 0.95
1E-5
1.60
104.5mW
Chip2
1.10
1.00
1.56
Chip1
1
1E-3
1.52
Instruction
(Power/IPC/
Freq)
1.15
1.05
1E-8
1.48
Power
10
0.01
1E-4
Energy per
Instruction
(Power/IPC/
Freq)
1.20
Normalized Energy
Percentage Error Rate
1
120MHz
Energy per
Power
Table 3. Error Rate and Energy/Instruction at Point of First
Failure and Point of 0.1% Error Rate
260fJ
Percentage Error Rate
Energy of a RFF per error event
Point of 0.1% Error
Rate
Chip 2
1.20
1.15
1.10
1.05
0.01
1.00
1E-3
0.95
120MHz
1E-4
0.90
1E-5
140MHz
1E-6
0.85
0.80
1E-7
1E-8
Normalized Energy
Total Number of Flip -Flops
Total Number of Razor Flip-Flops
offsetting the quadratic savings due to voltage scaling. For the
measured chips, the energy optimal error rate fell at approximately
0.1%. Table 3 shows the measured power at the point of first
failure and the energy per instruction for both the chips at the point
of first failure and at the point of 0.1% error rate. At 120MHz, chip
1 consumes 104.5mW at the first failure point and 89.7mW at an
optimal 0.1% error rate, leading to 15% energy savings with
negligible IPC hit. The energy gains for chip 2 are 18%. These
gains are in addition to the energy saved by eliminating voltage
margins.
0.75
1.52
1.56
1.60
1.64
1.68
Voltage (in Volts)
1.72
0.70
1.76
Figure 5. Error Rate and Normalized Energy Measurements for Chips 1 and 2
260
4-900784-01-X
2005 Symposium on VLSI Circuits Digest of Technical Papers
Authorized licensed use limited to: Arm Limited. Downloaded on October 15, 2009 at 08:44 from IEEE Xplore. Restrictions apply.
Lot0
Lot1
14
120MHz
T=27C
12
1.8
Chips
Linear Fit
y=0.78685x + 0.22117
1.7
10
8
1.6
6
1.5
4
2
0
1.48 1.56 1.64 1.72 1.80
Voltage at 0.1%Error Rate
16
Lot 0
Lot 1
Number of Chips
Number of Chips
12
11
10
9
8
7
6
5
4
3
2
1
0
1.4
0
5
10
15
20
25
30
1.4
Percentage Savings
Normalized Energy Savings over First Failure Point
at 0.1% Error Rate
Voltage
Point of First Failure
1.5
1.6
1.7
Voltage at First Failure
1.8
45C
65C
95C
1E-5
95C
65C
1E-6
1E-7
1E-8
1E-9
1.40
1.44
1.48
Voltage
Total energy gains for chip 1 (71mW, 44%) and chip 2 (63mW,
39%) are comparable because greater process margin in chip 1
(100mV greater) is compensated by increased savings for chip 2
when scaling below first failure point. The distribution of the total
energy savings over the worst case at 120 and 140MHz for all 33
chips is shown in figure 9 and shows an average savings of
approximately 50% for 120MHz and 45% for 140MHz.
Measured Power (in mW)
140
120
160.5mW 162.8mW
27.7mW
27.3mW
180mV
180mV
11.3mW
80mV
temp
17.3mW
130mV
Process
100
104.5
mW
4.2mW
30mV
Process
8
6
4
2
0
41
46
51
56
61
Percentage Savings
6. Razor Voltage Control
Figure 10 shows the voltage controller, implemented in software
that regulates the supply voltage by reacting to error rates. The
controller samples the error register and adjusts the supply voltage
to achieve a targeted error rate. The response of the voltage
controller for a 0.5% targeted error rate for a test code with
alternating high and low error rate phases is shown. The controller
settles at 1.52V at high error rate phases and at 1.45V at low error
rate phases.
16
14
120MHz
27C
12
10
8
6
4
2
0
100 200 300 400 500 600
1.80
1.75
1.70
1.65
1.60
1.55
1.50
1.45
1.40
1.35
Runtime Samples
7. Conclusion
In this paper, we present a self tuning DVS processor using delay
error tolerant flip-flops. We obtained 44% energy savings by
eliminating voltage margins and operating at a 0.1% error rate.
119.4mW
119.4
mW
104.5
mW
80
chip1 chip2
Measured Power
with supply,
temperature
and process
margins
140MHz
10
Figure 10. Razor Voltage Controller Response
104.5mW
119.4
mW
12
39 44 49 54 59 64
0
Power
Supply
Integrity
11.5mW
80mV
temp
14
Figure 9. Distribution of Net Energy Savings
shows the energy when Razor is turned off and worst-case margins
are added to ensure correct operation. For chip 1, 80mV
temperature margin, 130mV process margin (compared to the
worst-case chip out of the 33 chips) and an estimated 180mV
power supply margin (10% nominal Vdd) were added. The second
and third sets of bars show the energy when operating with Razor
at first point of failure and at 0.1% error rate.
Power
Supply
Integrity
Lot0
Lot1
16
Percentage Savings
1.52
Figure 7. Temperature Dependence of Error Rate
160
120MHz
Percentage Error Rate
1E-4
Lot0
Lot1
Voltage Output of Controller
45C
1E-3
11
10
9
8
7
6
5
4
3
2
1
0
Number of Chips
0.1
0.01
Number of Chips
Percentage Error Rate
Figure 6. Distribution of Point of First Failure, Point of 0.1% Error Rate and Normalized Energy across 33 measured chips
chip1 chip2
Power with Razor
DVS
when Operating
at Point
of First Failure
99.6mW
89.7mW
89.7
mW
chip1
99.6
mW
chip2
Power with Razor
DVS
when Operating at
Point
of 0.1% Error Rate
Figure 8. Razor Energy Savings @120Mhz
References
[1] D. Ernst, et. al., pp.7-18, MICRO 36.
[2] K. Suzuki, et. al., pp. 587-590, CICC 97
[3] S. Sakiyama, et. al., pp. 99-100, Symposium on VLSI circuits,
1997
[4] T. Burd, et. al., pp.294-295, ISSCC, 2000.
[5] S. Akui, et. al., pp.64-65, ISSCC 2004.
[6] Nowka, et. al., pp.340-341, ISSCC 2002
[7] www.mosis.org
2005 Symposium on VLSI Circuits Digest of Technical Papers
Authorized licensed use limited to: Arm Limited. Downloaded on October 15, 2009 at 08:44 from IEEE Xplore. Restrictions apply.
261
Download