Exploiting Variable Cycle Transmission for Energy-Efficient On

advertisement
21st International Conference on VLSI Design
Exploiting Variable Cycle Transmission for Energy-Efficient On-Chip
Interconnect Design
T. Venkata Kalyan
Madhu Mutyam
P. Vijaya Sankara Rao
IIIT Hyderabad
IIT Madras
IIT Kharagpur
Hyderabad−500032, India Chennai−600036, India
Kharagpur−721302, India
kalyan tv@research.iiit.ac.in madhu@cs.iitm.ernet.in vijaysankar@ece.iitkgp.ernet.in
Abstract
buses [7], the data sequences that are closely placed are
transformed to minimize the coupling effects, which in turn
achieves power savings. In order to ensure the data integrity,
the authors proposed to use either extra control lines or extra
clock cycles. A technique to minimize power consumption
due to coupling transitions is proposed in [15], which modifies the transition profiles to reduce switching energy by
50%. An area and energy efficient coding technique is proposed in [17] to obtain nearly 46% reduction in delay along
with 10% reduction in energy, but it requires 48 wires to
encode 32-bit data. An adaptive bus encoding technique using weighted code mapping and the delayed bus technique
is proposed in [3] to achieve significant reduction in power
consumption.
In all the above mentioned techniques, a fixed clock period is considered for data transmission. But data can be
transmitted using variable clock periods. As data transition patterns determine the necessary delay for data transmission, we can fix necessary delay for each data transition
pattern and transmit data with delay corresponding to its
worst-case transition pattern. A technique based on this idea
called variable cycle transmission (VCT) technique [8]. It
is shown that VCT technique can achieve significant delay
savings.
As system-wide power consumption is one of the critical
issues in VLSI community, several techniques are proposed
to minimize it. One of the well known techniques for reducing system-wide dynamic power consumption is dynamic
voltage scaling (DVS) [5]. DVS technique exploits the
quadratic dependency between supply voltage and power
consumption and the linear relationship between clock frequency and supply voltage, to achieve significant dynamic
power savings.
Application of DVS technique for on-chip interconnect
is explored in [6]. By considering interconnect designs
based on a double sampling latch which detect and correct
for timing errors, DVS technique is applied to recover the
available slack which results in good power savings. By
keeping frequency constant, supply voltage is scaled for
As on-chip interconnect in deep-submicron designs contribute to the system-wide power consumption, minimization of interconnect power consumption has become one
of the important design issues in deep-submicron technologies. As transition activity mainly determines the interconnect power consumption, several bus encoding techniques
have been proposed to minimize the activity.
Unlike the existing low-power or energy-efficient bus
encoding techniques, in this paper, we propose a scheme
which exploits both dynamic voltage scaling and variable
cycle transmission mechanisms for minimizing on-chip interconnect energy consumption. We transmit data using
variable cycle transmission method and, based on the delay savings achieved through variable cycle transmission
method at regular intervals, scale the voltage and frequency
to obtain significant energy savings. Using our technique
for a 5mm interconnect wire we achieved energy savings
of 30% and 45% over the base case in the address bus
and data bus, respectively. Our technique also reduces the
energy-delay-product by 34% and 52% for address bus and
data bus, respectively.
1. Introduction
With the scaling of process technology, system-wide
power consumption is increasing. One of the contributors
for the system-wide power consumption is the on-chip interconnect. Data transmission on the interconnect causes railto-rail voltage swing and charging/discharging of capacitance, which in turn result in dynamic (or switching) power
consumption. As switching activity mainly determines the
power consumption, several low-power or energy-efficient
bus encoding techniques have been proposed in the literature to minimize it.
In coupling-driven bus encoding technique for on-chip
1063-9667/08 $25.00 © 2008 IEEE
DOI 10.1109/VLSI.2008.15
235
Crosstalk Relative delay
Class on the middle
Transition patterns
wire (×CL RT )
1
0
x−y
2
1
↑↑↑,↓↓↓
3
1+λ
− ↑↑,↑↑ −,− ↓↓,↓↓ −
4
1 + 2λ
− ↑ −,− ↓ −,↓↓↑,↑↓↓,↑↑↓, ↓↑↑
5
1 + 3λ
− ↑↓,− ↓↑,↓↑ −,↑↓ −
6
1 + 4λ
↓↑↓,↑↓↑
power reduction. Voltage scaling by keeping frequency as
constant can result in timing errors. Errors are detected and
corrected using the doubling sampling latch and the voltage
scaling is controlled by the error recovery rate.
Motivating from the fact that interconnection network
consumes significant portion of system-wide power budget,
application of DVS technique for interconnection network
links is explored in [13]. Power minimization is achieved
by adjusting both frequency and voltage of links.
In this paper, we propose an interconnect design on
which data is transmitted using variable clock periods. Delay savings obtained by transmitting data using variable
clock periods are exploited for applying DVS technique.
As part of the application of DVS technique, we scale both
supply voltage and frequency. We validate the effectiveness
of our technique by focusing on the L1 cache address/data
buses of a microprocessor using the SPEC CPU2000 benchmark suite and show that for a 5mm interconnect wire our
technique achieves 30% and 45% energy savings over the
base case in the address bus and data bus, respectively. In
addition to the energy savings, our technique reduces the
energy-delay-product by 34% and 52% for address bus and
data bus, respectively.
2
Table 1. Crosstalk classes (here λ =
x, y : {a → b | a, b ∈ {0, 1}}).
Ebus
We first review the effects of voltage scaling on dynamic
power consumption and delay. Each transition of a digital
circuit consumes power because of charging and discharging of the digital circuit’s capacitance. The dynamic power
consumption (Pdynamic ) is expressed as
∝
2
VDD
f
Tl
∝
VDD
(VG − VT )α
(3)
=
CL RT [(1 + 2λ)∆2l − λ∆l (∆l−1 + ∆l+1 )] (4)
where RT is the total resistance, λ is the ratio of the interwire capacitance (CI ) to the wire-to-substrate capacitance
(CL ), and ∆l is the transition occurring on line l. ∆l is
equal to 1 (or ↑) for 0-to-1 transition, −1 (or ↓) for 1-to-0
transition, and 0 (or −) for no transition.
As data transition patterns determine the propagation delay, they are classified into six different crosstalk classes
[8, 16] based on the relative delay of a wire w.r.t. its adjacent wires. Table 1 shows different crosstalk classes.
(1)
where VDD is the supply voltage and f is the clock frequency. It is clear from Equation (1) that the power consumption can be minimized quadratically by reducing VDD ,
but supply voltage reduction increases the propagation delay. The propagation delay (τ ) of a CMOS transistor [12] is
expressed as
τ
2
(αs CL + αc CI )VDD
and
where αs and αc denote the rates at which each capacitance
is switched. While αs represents the self transitions, αc is
related to the coupling transitions on the interconnect. We
know from Equation (3) that the voltage scaling can significantly reduces the energy consumption.
We now review an analytical model for propagation delay in deep-submicron buses. By assuming a n-bit parallel
bus in a single metal layer, we model a deep sub-micron bus
as a distributed RC network with coupling capacitance between adjacent wires. The delay of wire l (1 < l < n) of
the bus is given by [16]
Prerequisites
Pdynamic
=
CI
CL
3
Our Approach
(2)
The basic idea of our approach is to transmit data using
variable clock periods as proposed in the VCT technique
[8] and exploit the delay savings obtained through the VCT
technique to apply the DVS technique for significant power
savings. As part of the application of DVS technique, we
scale both supply voltage and frequency.
Figure 1 shows the basic mechanism used in our approach. The shaded portion in the figure represents the
implementation of the VCT technique. In general, data is
transmitted using a fixed clock period, which is at least the
where VT is the threshold voltage, VG is the input gate
voltage, and 1 ≤ α ≤ 2. In deep-submicron designs, the
value of α is nearly 1.3. The clock frequency is restricted
by the propagation delay and it has to be reduced to tolerate
the increased propagation delay.
Energy consumption per data transmission for an interconnect, which includes both the self capacitance (CL ) and
the coupling capacitance between two adjacent lines (CI ),
is given by [10]
236
Figure 1. Basic mechanism used in our approach.
delay of crosstalk class 6 (refer to Table 1). Instead of transmitting data using fixed worst-case crosstalk class delay, we
analyze the crosstalk class of a next data w.r.t. the present
data and transmit the next data using the necessary delay.
In order to determine the crosstalk class of a next data w.r.t.
the present data on the bus, we use a Crosstalk Class Analyzer [8] (as shown in Figure 1). To support variable clock
period for data transmission, we consider x as the unit clock
period such that
x
2x
3x
4x
5x
≥
≥
≥
≥
≥
CL RT
CL RT (1 + λ)
CL RT (1 + 2λ)
CL RT (1 + 3λ)
CL RT (1 + 4λ)
Figure 2. Distribution of transition patterns.
delay minimization, we exploit the delay savings obtained
through the VCT mechanism for energy reduction.
In order to apply the DVS technique for energy minimization, the delay savings are measured at regular intervals and based on the delay savings obtained so far we scale
the voltage. As voltage scaling alters the propagation delay, we also scale the frequency. As shown in Figure 1,
we consider a Delay Counter which consists of a Global
Counter and a Local Counter. The Global Counter maintains the number of clock cycles that have been saved so
far. The Local Counter maintains the number of clock cycles that have been saved within the current sampling interval. We consider the sampling period as 15000 cycles. The
Crosstalk Class Analyzer provides values 3, 4, 5, and 6 for
crosstalk classes 1-3, 4, 5, and 6, respectively, to the Local Counter. The Local Counter, upon receiving the value
i from the Crosstalk Class Analyzer, increments its count
by (6-i), where i ∈ {3, 4, 5, 6}, and resets its value after
every 15000 cycles. The Delay Counter updates the Global
Counter value for every 15000 cycles using the following
formula:
Hence, the unit clock period becomes CL RT 5(1+4λ) . Based
on the crosstalk class of a next data w.r.t. the present data,
we consider delay as an integer multiple of the unit clock
period. As variable clock periods are used for data transmission, we use an extra interconnect wire (for Ready signal)
to indicate the availability of data at the receiver side. The
extra wire is separated from the actual data by using a shield
wire so that the Ready signal takes a single cycle to reach
the receiver. The delay due to Crosstalk Class Analyzer is
overlapped with the propagation delay of the Ready signal
[8]. Thus, we use two extra wires (i.e., the Ready signal
and a shield wire) and hence the VCT technique requires 34
wires for a 32-bit bus.
Data transmission using variable clock periods can result in significant delay savings. It is shown in [8] that the
VCT technique achieves 31.5% delay savings in the case
of L1 cache data bus for on-chip data transmission. Significant delay savings achieved by the VCT technique are
due to the fact that on-chip data exhibit high percentage of
lower crosstalk class transitions (refer to Table 1) as shown
in Figure 2.
As our main focus is on energy minimization rather than
Ng
=
Ng + Nl − d(
(x0 − x)Nl
e
x
(5)
where Ng and Nl are the values of Global Counter and
Local Counter, respectively, x is the original unit clock pe-
237
B’mark
Bzip
Eon
Gcc
Mcf
Perlbmk
Vortex
Applu
Art
Galgel
Mesa
Swim
Address
104103584
180520910
186621216
117417966
129748403
114818866
102778392
109836092
106010236
130750176
100232496
Data
31711265
48659025
56015716
32797679
36285808
42302045
33394619
36995126
5657714
33263659
24084939
B’mark
Crafty
Gap
Gzip
Parser
Twolf
Vpr
Apsi
Fma3d
Lucas
Mgrid
Wupwise
Address
160591861
134847111
181144637
201532006
122384930
227654678
102578114
131143197
99952797
101415579
99954963
Data
45039063
37827729
40342073
51105161
31389418
37423636
28801754
19278215
18938412
32568604
8159589
Parameter
Value
W
S
T
H Dielectric
(nm) (nm) (nm) (nm) constant
205 205 430.5 398.5
3.3
Table 3. Device parameters for 90nm technology nodes based on the ITRS 2004 edition.
#
Codec overhead
Method of wires Area (µm2 ) Energy (pJ)
Base
32
0
0
VCT
34
3603.5
0.458
Table 2. Number of 32-bit data items considered in different benchmarks.
Table 4. Codec overhead summary.
0
riod (as defined earlier in the section), and x is the new unit
clock period obtained due to frequency scaling.
4
If the Global Counter value is more than δ1 and the energy savings due to voltage scaling by 0.1V is more than the
energy overhead due to voltage regulator (given by Equation (6)), we scale-down the voltage by 0.1V . On the other
hand, if the Global Counter value is less than δ2 , we scaleup the voltage by 0.1V . We consider the upper and lower
limits for the supply voltage as 1.2V and 0.8V , respectively,
and δ1 and δ2 as 4999 and 0, respectively. We adjust the
clock frequency in accordance with the supply voltage scaling. The on-chip voltage regulators take few µs to change
from one voltage to another voltage [11]. We assume that
the original frequency of the bus is 1.5GHz, the voltage
regulator takes 1µs to adjust the voltage by 10mV [6] and
tbus (15) cycles are needed for frequency transition [13].
Thus, the voltage regulator takes 15000 cycles to adjust the
voltage by 0.1V . As we use greedy method when applying
the DVS technique (i.e., voltage and frequency are scaled
based on the clock cycles saved so far), even though the
voltage regulator takes 15000 cycles to adjust the voltage
by 0.1V , there is no need of a prediction technique to estimate the bus transition patterns. Although the frequency
transitions are much faster compared to voltage transitions,
the bus is disabled during a frequency transition in order to
avoid timing uncertainity when the receiver is tracking the
input clock.
Experimental Validation
We validate our technique by simulating 22 SPEC2000
CPU benchmarks [2] using the Simplescalar 3.0 simulator
[4]. For each benchmark, we fast forward 100 million instructions and then simulate next 300 million instructions.
Table 2 shows benchmark-wise number of 32-bit data items
transmitted on L1 cache address/data bus.
We first discuss the energy, area, and latency overhead
due to extra circuitry used to implement our technique. As
we transmit data using the VCT mechanism, we design
codecs used in the VCT technique [8] in Verilog and synthesize them using the Synopsys Design Compiler with 90nm
TSMC technology library. The Berkeley interconnect model
[1] is used to calculate the ground capacitance and coupling
capacitance of the interconnect. In our experimental results,
we consider metal layer 4 wire parameters as shown in Table 3 and wire length of 5mm. Energy and area overheads
of VCT codecs along with the base case are shown in Table
4. Note that there is no latency overhead due to Crosstalk
Analyzer used in the VCT technique as its latency is overlapped with the propagation delay of the Ready signal [8]
and the delay associated with performing activities related
to voltage regulations is overlapped with the data transmission delay.
We now discuss the effect of voltage scaling on energy
consumption. Energy consumption per data transmission in
the base case is given by
Energy overhead due to transition from initial voltage V1
to final voltage V2 is calculated by [18]
Ebase
(V ,V2 )
EV R1
=
(1 − η)C|V22 − V12 |
=
2
(αs CL + αc CI )VDD
(7)
where αs and αc represent the average self and coupling
transitions, respectively.
Energy consumption per data transmission in our technique is given by
(6)
where C is the filter capacitance of the power supply regulator and η is the power efficiency. In our experimental
setup, we assume C as 5µF and η to be 94%. It is clear
(V ,V )
(V ,V )
from Equation (6) that EV R1 2 = EV R2 1 . Note that the
energy overhead calculated using Equation (6) is included
in the calculation of overall energy savings.
EV CT +DV S =
2
(Σn
i=1 (αsi CL +αci CI )Vi )
n
+ EOH
(8)
where EOH is the energy overhead due to voltage regulator and the VCT circuitry, n is the number of samples
238
Figure 3. Energy savings of our technique
w.r.t. the base case.
Figure 4. Delay savings of our technique w.r.t.
the base case.
considered, αsi and αci represent the average self and coupling transitions, respectively, and Vi is the voltage at the
ith sample period. As voltage may change in each sample period depending on the Global Counter value, we consider different voltages (Vi ) and use the average number of
self (αsi ) and coupling (αci ) transitions occurred during ith
sample period to calculate the energy consumption during
the sample period. The total number of self (coupling) transitions during the entire execution is equal to the number
of self (coupling) transitions at each sample period, i.e.,
αs = Σni=1 αsi and αc = Σni=1 αci .
Energy overhead due to voltage regulator and the VCT
circuitry is given by
an average value of 30%. Variation in the energy savings
across different benchmarks is less as data transition pattern behavior is almost uniform across all benchmarks (refer to Figure 2). Energy savings in the data bus case are
also almost uniform across different benchmarks except for
benchmarks “Applu” and “Galgel”. In the case of “Applu”
and “Galgel” benchmarks, our technique achieves 14% and
34% energy savings, respectively, while for all other benchmarks, our technique achieves nearly 45% energy savings.
As voltage and frequency scaling can affect data transmission delay, we now discuss the delay savings achieved
by our technique w.r.t. the base case. We calculate the percentage of delay savings (dsave ) obtained through our technique by using the following formula:
(i,i+0.1)
EOH =
Σ1.1
i=0.8 n(i,i+0.1) EV R
m
+ t × ECodec
(9)
dsave = (1 −
where ni,j is the number of times voltage is changed from i
to j, m is the total number of data items transmitted, ECodec
is the energy overhead due to the VCT circuitry (as shown
in Table 4), and t is a multiplicative factor, which is used to
consider energy overhead due to comparator circuit. Note
that we use a comparator circuit in our experiments to check
whether or not the energy savings due to voltage scaling is
more than the energy overhead due to voltage regulation. In
our experiments we consider t as 1.1.
Using the above equations, we now give a formula to
calculate the percentage of energy savings (Esave ) obtained
through our technique w.r.t. the base case as:
Esave
=
(1 −
EV CT +DV S
) × 100
Ebase
(Σni=1 ki − Σ1.1
i=0.8 n(i,i+0.1) tbus )
) × 100 (11)
k
where k is the total number of cycles required for data
transmission in the base case and ki is the total number of
cycles required during ith sample period for data transmission in our technique. In equation (11), the overhead due
to the bus disabling, tbus , is considered for every frequency
transition.
Figure 4 shows benchmark-wise delay savings obtained
through our technique w.r.t. the base case. Delay savings
in the address bus are within the range of 8.50% to 10.64%
with an average value of 9.08%, while in the data bus, they
range from −0.12% to 28.63% with an average value of
13.69%.
We now consider the normalized energy-delay-product
(EDP) of our technique w.r.t. the base case as shown in
Figure 5. It is clear from the figure that our technique reduces the EDP by 34% and 52% for address and data buses,
respectively. Note that it is easy to show that the EDP of
our technique is less than that of the VCT technique. For
instance, the VCT technique achieves 31.5% delay savings
(10)
Figure 3 shows benchmark-wise energy savings obtained
through our technique w.r.t. the base case. Energy savings
are almost uniform across different benchmarks in the address bus case and are ranging from 27.53% to 31.66% with
239
[4] D.C. Burger and T.M. Austin. The SimpleScalar tool-set, version 2.0, Technical Report 1342, Department of Computer
Science, UW, 1997.
[5] A. Chandrakasan, S. Sheng, and R. Brodersen. “Low-power
CMOS digital design”. JSSC, 27(4), 1992, pp. 473-484.
[6] H. Kaul, D. Sylvester, D. Blauuw, T. Mudge, and T. Austin.
“DVS for on-chip bus designs based on timing error correction”. DATE, 2005, pp. 80-85.
[7] K.-W. Kim, K.-H. Baek, N. Shanbhag, C.L. Liu, and S.M. Kang. “Coupling-driven signal encoding scheme for lowpower interface design”. ICCAD, 2000, pp. 317-321.
[8] L. Li, N. Vijaykrishnan, M. Kandemir, and M.J. Irwin. “A
Crosstalk Aware Interconnect with Variable Cycle Transmission”. DATE, 2004, pp. 102-107.
Figure 5. Energy-delay-product of our technique w.r.t. the base case.
[9] D. Liu et al. “Power consumption estimation in CMOS VLSI
chips”. IEEE JSSC, 1994, 26, pp. 663-670.
[10] L. Macchiarulo, E. Macii, M. Poncino. “Low-Energy Encoding for Deep-Submicron Address Buses”. ISLPED, 2001, pp.
176-181.
in the data bus [8] and incurs some energy penalty (due to
codec energy overhead of the VCT technique) as compared
to the base case, and hence its normalized EDP is nearly
30%. The sensitivity analysis with wire length of 4mm
yielded energy savings of 23.68% and 42.70% and EDP
reduction of 30.6% and 50.5% for address and data bus,
respectively.
5
[11] M. Meijer, J. Pinede de Gyvez, and R. Otten. “On-Chip Digital Power Supply Control for System-on-Chip Applications”.
ISLPED, 2005, pp. 311-314.
[12] T. Sakurai and A.R. Newton. “Alpha-power law MOSFET
model and its applications to CMOS inverter delay and other
formulas”. IEEE JSSC, 1990, 25(2), pp. 584-594.
Conclusion
[13] L. Shang, L.S. Peh, N.K. Jha. “Dynamic Voltage Scaling
with Links for Power Optimization of Interconnection Networks”. HPCA, 2003, pp. 91-102.
In this paper, by exploiting variable cycle transmission
and DVS mechanisms, we proposed a novel technique for
energy-efficient on-chip interconnect design. As part of the
application of DVS technique, we scaled both supply voltage and frequency. Delay savings provided by the variable
cycle transmission mechanism are exploited while voltage
scaling technique is applied so that we obtained significant
energy savings as well as delay savings without impacting
the throughput. We validated the effectiveness of our technique by focusing on the L1 cache address/data buses of a
microprocessor using the SPEC CPU2000 benchmark suite
and showed that for a 5mm interconnect wire our technique
achieves 30% and 45% energy savings over the base case in
the address bus and data bus, respectively. We also demonstrated that our technique reduces the energy-delay-product
significantly as compared to the base case.
[14] P. Sotiriadis and A. Chandrakasan. “Low Power Bus Coding Techniques Considering Inter-wire Capacitances”. CICC,
2000, pp. 507-510.
[15] P. Sotiriadis and A. Chandrakasan. “Bus energy minimization by transition pattern coding (TPC) in deep submicron
technologies”. ICCAD, 2000, pp. 322-327.
[16] P. Sotiriadis and A. Chandrakasan. “Reducing Bus Delay in
Sub-micron Technology using Coding”. ASPDAC, 2001, pp.
109-114.
[17] S.R. Sridhara, A. Ahmed, and N.R. Shanbhag. “Area and
Energy-efficient Crosstalk Avoidance Codes for On-chip
Buses”. ICCD, 2004, pp. 12-17.
[18] A. Stratakos. High-efficiency low-voltage DC-DC conversion
for portable applications. Ph.D. Thesis, University of California, Berkeley, 1998.
References
[1] Berkeley predictive technology model. http://wwwdevice.eecs.berkeley.edu/∼ptm/interconnect.html
[2] SPEC CPU2000 Benchmark. http://www.spec.org
[3] A.R. Brahmbhatt, J. Zhang, Q. Wu, and Q. Qiu. “Low-power
bus encoding using an adaptive hybrid algorithm”. DAC,
2006, pp. 987-990.
240
Download