Simultaneous Peak Temperature and Average Power Minimization

advertisement
2009 22nd International Conference on VLSI Design
Simultaneous Peak Temperature and Average Power Minimization during
Behavioral Synthesis
Vyas Krishnan and Srinivas Katkoori
Department of Computer Science & Engineering, University of South Florida, Tampa, USA
{krishnan, katkoori}@cse.usf.edu
thermal issues, thus creating a need for temperatureaware low-power synthesis.
Different functional units in a circuit can have
different switching activity rates, leading to uneven
power dissipation among the various logic blocks on
the chip. Due to low thermal conductivity of silicon,
the rate of lateral heat propagation in a chip is slow,
causing localized heating to occur much faster than
chip-wide heating. This can cause uneven temperature
distribution on the chip, creating on-chip thermal
gradients, and “thermal hot spots” due localized areas
of high power densities.
Traditional low-power behavioral synthesis
techniques do not take on-chip thermal distributions
into account during synthesis. With power densities
and thermal gradients projected to increase rapidly in
future technologies [1][2], there is a need for
temperature-aware low-power design techniques that
directly target the spatial nature of on-chip temperature
distributions. This makes temperature-awareness an
essential part of low-power behavioral synthesis.
Elevated temperatures have several adverse effects
on a circuit. Higher temperatures adversely impact
performance due to decreased transistor switching
speeds and increased wire resistance, which can lead to
timing violations [2]-[4]. Leakage power in CMOS
circuits increases exponentially with temperature[5].
This coupling between static power and temperature
can lead to thermal runways if not adequately managed
[2]. Higher temperatures also cause reliability
problems due to electro-migration in wires, and
thermal-stress related gate-oxide breakdown in
transistors [4]. In addition, chip packaging costs
increase with higher thermal budgets.
Temperature-aware
high-level
synthesis
is
challenging because of the multi-objective nature of he
problem, with several conflicting objectives –
throughput rate, power, peak temperature, and chip
area. Figure 1 illustrates the different factors that affect
power dissipation and temperature of an integrated
circuit. Task scheduling and resource binding in high-
Abstract
With continuous CMOS scaling, and increasing
operating frequencies, power and thermal concerns
have become critical design issues in current and
future high-performance integrated circuits. Elevated
chip
temperatures
adversely
impact
circuit
performance and reliability. On-chip thermal gradients
can lead to unpredictable clock skew variations and
timing failures. Chip temperatures are influenced by
design decisions at the behavioral and physicalsynthesis levels. Existing low-power design techniques
cannot adequately address thermal issues since their
optimization objectives fail to capture the spatial
nature of on-chip thermal gradients. We present an
algorithm for thermally-aware low-power behavioral
synthesis that concurrently minimizes average power
and peak chip temperature. Our algorithm uses
accurate floorplan-based temperature estimates to
guide behavioral synthesis. Compared to traditional
low-power synthesis, our method reduces peak
temperatures by as much as 23%, with less than 10%
overhead in chip area.
1. Introduction and Motivation
Steady scaling of CMOS process technologies over
the past three decades have enabled feature sizes to
shrink continuously, allowing the integration of
millions of transistors in modern VLSI chips.
However, with decreasing feature sizes and increasing
transistor counts, power density in VLSI circuits has
increased dramatically. Since the heat generated by a
VLSI circuit is proportional to its power density, the
corresponding rise in on-chip temperatures adversely
impacts reliability, circuit performance, and cooling
costs. According to ITRS [1], thermal management is
projected as one of the most challenging issues in the
design of future high-performance integrated circuits.
Power-aware design alone fails to adequately address
1063-9667/09 $25.00 © 2009 IEEE
DOI 10.1109/VLSI.Design.2009.78
419
Authorized licensed use limited to: University of South Florida. Downloaded on February 11, 2009 at 23:54 from IEEE Xplore. Restrictions apply.
second stage that improves on this solution to optimize
average power dissipation while minimizing peak chip
temperature. To estimate the on-chip thermal
distributions, a thermal analysis of the floorplan is
performed. Using a multi-stage annealing approach
significantly reduces synthesis run-times by avoiding
expensive thermal analysis during early stages of
search, when solutions are far from optimal. We use
ISAC [7], an accurate architectural-level thermal
modeling tool to estimate on-chip thermal distributions
and module temperatures. ISAC uses the power traces
of functional units and their floorplan locations, to
compute module temperatures. The use of thermal
modeling during design space exploration allows
weeding out potential designs that have a high
probability of encountering thermal problems.
The remainder of the paper is organized as follows.
Section 2 gives an overview of related work. In Section
3, we discuss our proposed temperature-aware lowpower high-level synthesis approach. Section 4
presents our experimental results, and Section 5
concludes the paper.
Figure 1. Factors affecting total power and onchip temperature during behavioral synthesis
level synthesis significantly impacts the static and
dynamic power of datapath functional units, and hence
their temperature. An increase in temperature increases
sub-threshold leakage, leading to further increases in
temperature.
Floorplanning also significantly affects on-chip
thermal distribution. The temperature of a functional
unit is determined not only by its power density but
also by the temperature of other functional units in its
vicinity. Changing functional unit positions on a
floorplan to balance power density and thereby reduce
chip temperatures may increase chip area. Conversely,
placing highly connected functional units bound to
timing critical tasks, close to each other to minimize
wire delay and power, could lead to localized areas of
high chip temperatures, especially if the modules also
have high switching activity.
An effective temperature-aware design technique
must tightly couple floorplanning and high-level
synthesis, with some form of feedback from thermal
analysis. In this paper, we propose an integrated
approach to temperature-aware high-level synthesis
that simultaneously performs temperature-aware
scheduling, binding, and floorplanning, using feedback
from an accurate and fast thermal simulation, to guide
synthesis decisions.
To perform a multi-objective optimization of delay,
chip area, power dissipation, and peak temperature, we
use a two-stage simulated annealing algorithm, that
uses different optimization objectives in each stage. In
particular, the first stage of the algorithm uses a fast
annealing strategy to optimize for power, throughput,
and chip area, and provide a good starting point for the
2. Related Work
Several researchers have addressed the problem of
thermal modeling and temperature-aware design.
Micro-architecture level thermal models were proposed
in [8]. The authors of [8, 9] consider thermal effects
during processor-based micro-architectural design
space exploration. Thermal issues have also been
considered during physical synthesis [10]-[12].
A number of behavioral synthesis works have
appeared in the literature addressing average power
during datapath synthesis. However, there are few
research works addressing thermal issues during highlevel synthesis. In high-level synthesis, researchers
have considered thermal effects during resource
binding [13]-[15] and scheduling [16]. Gu et al., [17]
use floorplanning and voltage islands to alleviate
thermal-hotspot formation during high-level synthesis.
Our work differs from these in several respects. Our
approach tightly integrates high-level and physicallevel synthesis steps at all stages of synthesis, to
provide accurate estimates of thermal distribution. The
works in [13] and [14] do not use floorplan-level
information in their temperature-minimization
algorithm, and therefore does not account for lateral
heat flow between modules. The authors of [15]
separate the floorplanning and binding steps during
thermal optimization. The authors of [16] use a series
of scheduling and binding moves in an attempt to
minimize temperature, without changing the floorplan.
420
Authorized licensed use limited to: University of South Florida. Downloaded on February 11, 2009 at 23:54 from IEEE Xplore. Restrictions apply.
switching power for different task and variable
bindings. A simulated annealing based search algorithm
is then used to concurrently perform task scheduling,
resource binding, floorplanning, and module power
minimization. Module temperature profiles are then
determined through a thermal analysis of the solutions.
We use ISAC [7], a fast and accurate temperature
analysis tool to perform a static thermal analysis of
solutions examined by our algorithm. ISAC takes a
floorplan and module power traces as inputs, and
computes the module temperatures. Our synthesis
algorithm concurrently minimizes delay, power, area,
and peak module temperatures.
3.1. Two-stage Simulated Annealing
To address the multi-objective nature of the
temperature-aware low-power synthesis problem, we
use a two-stage simulated annealing algorithm, where
different optimization objectives, moves, and cooling
rates are used in different stages of annealing. The
objectives optimized in each annealing stage are:
Figure 2. Temperature-aware behavioral
synthesis algorithm
3. Temperature-Aware Low-Power HighLevel Synthesis
• Stage-I (high-temperature annealing): A Layout-
aware low-power high-level synthesis, where the
optimization objectives are power, schedule-length,
and chip area.
In this section, we give an overview of the proposed
technique, a resource-constrained high-level synthesis
(HLS) algorithm that considers the impact of task
scheduling, resource binding, and floorplanning, on
average power and on-chip temperature distribution.
Our low-power synthesis method uses a two-stage
simulated annealing algorithm (SA) to concurrently
perform the tasks of scheduling, resource binding,
floorplanning, and thermal analysis, yielding solutions
that have lower peak module temperatures than
traditional temperature-unaware low-power high-level
synthesis techniques.
Figure 2 illustrates the main steps used in our
approach. First, the input dataflow graph (DFG), is
simulated with typical input traces to create input
switching power tables for each resource type in the
target datapath circuit. Task schedules, resource
bindings, power models, an RTL design library,
floorplanner, and a thermal model of the IC package are
then used to evaluate the IC temperature profile, power,
area, and performance of designs synthesized by our
algorithm.
The main algorithm comprises of an HLS synthesis
system tightly integrated with an incremental
floorplanner and a thermal analysis tool. The inputs to
the synthesis algorithm are (1) a behavioral description
of the design in the form of a dataflow graph, (2) an
RTL module library, and (3) profile-based input
switching tables for the RTL modules modeling their
• Stage-II
(low-temperature
annealing):
A
Temperature-aware
layout-driven
low-power
synthesis, where the optimization objectives are
peak module temperature, power, and chip area.
In first stage of annealing, we perform a floorplandriven low-power high-level synthesis, given the
resource constraints for the datapath. The second stage
is a low-temperature annealing stage, that takes the best
low-power solution returned by the first stage, and uses
a set of thermally-aware SA moves to minimize peak
module temperature and create an even thermal
distribution across the chip.
There are two main reasons for adopting a two-stage
annealing strategy in our algorithm. First, our
experiments indicate that for the problem addressed in
this paper, using a single weighted-sum of all objectives
during the search process often yields solutions far from
the Pareto-optimal front. However, a two-stage
annealing algorithm is able to sample the multiobjective solution space more efficiently, consistently
yielding nearly optimal solutions. Secondly, use of a
two-stage annealing approach significantly reduces runtimes by avoiding expensive thermal analysis of
solutions in earlier stages of search, when they are far
from optimal.
421
Authorized licensed use limited to: University of South Florida. Downloaded on February 11, 2009 at 23:54 from IEEE Xplore. Restrictions apply.
• Swap the module bindings of two DFG tasks.
3.2. Solution Encoding
The first two move operations change a task's
priority in the node-list used by the list scheduler,
resulting in different schedules of possibly different
schedule-lengths. The remaining two moves randomly
change a DFG node's or variable's resource binding,
with goal of minimizing interconnect and multiplexer
costs.
The solution encoding used by our two-stage SA
consists of two parts – (1) a set of DFG node-lists (or
variable-lists) specifying the tasks (or variables) bound
to each datapath resource (datapath unit or register), and
(2) a sequence pair [18] (S1, S2) representing the
relative location of datapath resources on a chip
floorplan. SA neighborhood search moves that explore
the HLS search space of task schedules and resource
bindings perturb the node-list and variable-list portions
of a solution encoding. A node-list associated with an
RTL module represents node priorities for DFG tasks
bound to the module. Node priorities are used by a
modified list scheduling algorithm to schedule tasks in
the DFG. The search space of chip floorplans is
explored via SA moves defined on the sequence pair.
Solution cost functions are evaluated by using
algorithms described in Section 3.4.
3.3.3. HLS Moves Used in Stage-II of the SA. The
schedule length of the best solution found by the twostage SA is used as a constraint in Stage-II of annealing.
The HLS moves defined in Stage-II are designed to
explore different schedules and bindings, without
changing the length of these schedules (schedule length
invariant moves). Five neighborhood search moves are
defined:
•
•
•
•
•
3.3. Neighborhood Moves
Simulated Annealing moves are defined in both, the
HLS search space (task schedules, module and register
bindings), and the layout search space (chip floorplans).
All these three moves are designed to preserve the
schedule-length, while exploring a variety of schedules
and bindings. The first of these moves changes the
time-step when a task is executed, without changing its
resource binding. The next two moves change the
resource bindings of tasks, and also possibly the timestep when they are executed. The remaining two moves
change the bindings of variables to registers. The main
objective of Stage-II is to simultaneously minimize
peak module temperature and average power.
3.3.1. Floorplan Moves. Five SA moves, similar to the
ones defined in [6], are defined on the sequence pair
representing a solution's floorplan:
•
•
•
•
•
Shift a task within its mobility range,
Migrate a task to another compatible module,
Swap module bindings of two compatible tasks,
Migrate a variable to another compatible register,
Swap register bindings of two compatible
variables.
Rotate datapath module,
Shift datapath module in S1 string,
Shift datapath module in S2 string,
Swap two datapath modules in S1 string,
Swap two datapath modules in S1 string.
These moves are used in both stages of annealing used
in our approach.
3.3.4. Thermally aware Floorplan Moves. To
improve on-chip temperature distributions, our
algorithm also incorporates floorplan-level thermal
optimization moves. In solutions where functional units
with high power densities are physically close to each
other, thermal hotspots can occur. To mitigate this
situation, a thermal-aware swap operation is defined on
the sequence pair of a floorplan. Under this operation,
the functional units are sorted in order of increasing
temperature, and the positions of a randomly chosen
high-temperature functional unit in either the S1 or S2
sequence pair is exchanged with that of a lowtemperature functional unit. This SA move allows more
even thermal distribution over the chip, and prevents
high-temperature modules from clustering in one area,
leading to thermal hotspots. These moves are used
during Stage-II of annealing.
3.3.2. HLS Moves Used in Stage-I of the SA. Four SA
moves are defined to operate on the node-lists
associated respectively with computational units in the
solution datapath. These moves are designed to explore
different task schedules and bindings encoded by the
node-lists. These moves are termed as schedule-length
variant moves, because they could result in task
schedules with differing schedule lengths. The main
objective of these moves is to find a solution that
minimizes the total switching power among datapath
functional units and interconnects, while at the same
time meeting the user-specified delay constraint on the
schedule. The SA moves used are:
• Change a DFG task priority in a module node-list,
• Swap two DFG tasks in a module's node-list,
• Change a DFG tasks module binding,
422
Authorized licensed use limited to: University of South Florida. Downloaded on February 11, 2009 at 23:54 from IEEE Xplore. Restrictions apply.
3.4. Cost Function Evaluation
4. Experimental Results
Figure 3 illustrates the main steps used to decode a
solution encoding and evaluate the cost function. The
The proposed algorithm was tested on a Linuxbased workstation using a 1.86GHz Intel CoreDuo
processor with 2GB memory. The overall flow used in
our experiments is shown in Figure 2. Our experiments
were performed on a set of several reallife large-sized benchmarks drawn from MediaBench
suite [19]. Each of these benchmarks was specified as a
DFG capturing the behavioral description of the
architecture to be synthesized. The DFGs used in
experiments range in size from 51 nodes to 333 nodes,
and were obtained from [17]. The RTL resource set
used in our experiments comprised of multipliers,
ALUs, registers, and multiplexers synthesized using a
TSMC 180nm technology library.
For our thermal analysis, we assume that each chip
is attached to a copper heat sink using forced air
cooling. Heat dissipates from the silicon die, through
the cooling package to the ambient environment, and
through the package to the printed circuit board. We
assume an ambient temperature of 45oC and a silicon
thickness of 200μm.
In our experiments, the objective was to minimize
the peak temperature among the functional units in a
datapath during low-power behavioral synthesis. Our
temperature-aware low-power synthesis method was
compared with two other temperature-unaware lowpower behavioral synthesis methods:
Figure 3. Solution cost function evaluation
node-list associated with each functional unit is used
by a list scheduler to schedule tasks in the DFG. Once
a feasible schedule is determined, we use the classical
left-edge algorithm to allocate and bind registers in the
datapath. The sequence-pair maintained by each
solution encoding is used to pack the datapath units
into a floorplan. The DFG schedule and the profiled
switching probability data are used to compute module
switching power. The power traces for functional unit,
together with the chip floorplan, is then used by ISAC
to perform a static thermal analysis to determine
module temperatures and chip thermal profiles.
The cost functions used in Stages I and II differ in
their objective functions. In Stage-I, we use the
following cost function:
• Method-A: A low-power floorplan-aware synthesis
methodology that minimizes average power,
• Method-B: A low-power floorplan-aware synthesis
methodology that minimizes peak module power.
Method-A is an SA-based layout-driven low-power
high-level synthesis algorithm that tightly integrates a
floorplanner within the HLS synthesis loop. The
optimization function used in Method-A minimizes the
average power and throughput, together with the
traditional floorplanning objectives of chip area and
total wire length. Method-B is similar to Method-A
except that it minimizes the peak power consumption
of the datapath units, instead of average power.
Comparing our algorithm with these low-power
synthesis techniques allows us to highlight the
advantages of a temperature-driven synthesis technique
over a low-power design methodology. A thermal
analysis is performed on the solutions found by these
methods, and the peak module temperatures are
compared with the solutions found by our technique.
The intuition behind using Method-A is that lowering
the total power dissipation in a circuit would hopefully
lower overall on-chip power density and hence the onchip temperatures. The intuition behind Method-B is
α*D+β*P+γ*A
where, D is the normalized value of latency (schedulelength), P is the normalized value of average power,
and A is the normalized chip area. The terms are
normalized with respect to the best schedule-length,
power, and chip-area values seen during search, which
are dynamically updated during search. In our
experiments, we set α = 250, β = 125, and γ = 125. In
Stage-II, where module temperatures of a low-power
datapath are minimized, we use the following objective
function:
θ*T+λ*P+φ*A
Here, T refers to normalized peak module temperature
in the datapath, while P and A have the same meaning
as in Stage-I. In our experiments, we set θ = 250, λ =
125, and φ = 125.
423
Authorized licensed use limited to: University of South Florida. Downloaded on February 11, 2009 at 23:54 from IEEE Xplore. Restrictions apply.
the largest benchmark. In addition, the difference
between the average power of solutions found by our
method and a traditional low-power synthesis method
is less than 1% on average, for all the benchmarks
tested.
that by constraining peak module power dissipation,
one could mitigate the formation of on-chip thermalhotspots.
Table 1 compares the peak chip temperatures and
average power using our temperature-aware technique
and the low-power methods A and B. The peak
temperature and average power values are averaged
over ten independent SA runs using different random
number seeds. For all the benchmarks, our technique
found solutions with lower peak temperatures when
5. Conclusions
Traditional low-power behavioral synthesis
methods cannot sufficiently address non-uniform
temperature distributions in an integrated circuit.
Unless a synthesis methodology accounts for lateral
thermal diffusion between modules on a floorplan,
only minimizing peak module power or average power
alone is inadequate in
minimizing
peak chip
temperature. In this paper, we present an integrated
temperature-aware high-level synthesis technique that
tightly couples the HLS tasks of scheduling and
binding, with physical-level estimates from an
incremental floorplanner, to concurrently optimize
power and peak chip temperature. Our experimental
results indicate that compared to conventional powerminimization approaches to HLS, our synthesis
technique reduces peak module temperatures by an
average of 15%, with less than 1% difference in
average power. These improvements in peak
temperatures are achieved with less than 7% average
increase in chip area over power-optimized designs.
Table 1. Comparison of Peak Chip
Temperature and Average Power
Benchmark
ID
Name
Proposed
Max
T
(C)
Method-A
Power
(W)
Max
T
(C)
Method-B
Power
(W)
Max
T
(C)
Power
(W)
0.386
1
h2v2-smooth_downsample
69.4
0.367
79.9
0.364
79.3
2
feedback_points
76.1
0.928
89.5
0.921
91.4
1.002
3
collapse_pyr
76.6
0.852
86.4
0.847
97.1
0.919
4
write_bmp_header
75.6
0.539
78.9
0.541
82.6
0.585
5
interpolate_aux
78.4
2.142
86.7
2.135
93.2
2.469
6
matrix_mult
80.7
2.270
89.8
2.263
94.7
2.355
7
idct_col
74.3
1.654
87.2
1.655
89.3
1.732
8
jpeg_idct_fast
72.4
1.370
91.3
1.371
92.6
1.459
9
jpeg_fdct_islow
79.8
1.415
95.8
1.418
97.1
1.544
10
smooth_color_z_triangle
71.2
1.022
88.6
1.019
91.9
1.114
11
invert_matrix
88.4
3.638
102
3.671
103.7
3.715
76.6
1.473
88.7
1.382
92.1
1.571
Average
6. References
[1]
[2]
compared to a temperature-unaware low-power
synthesis technique, highlighting the advantages of a
temperature-aware technique over temperatureunaware low-power methods. The average peak chip
temperature for the MediaBench benchmarks using our
method was 76.6 C. Using Method-A, the average peak
temperature was 88.7 C, while using Method-B
resulted in an average peak temperature of 92.1 C.
Our temperature-aware synthesis methodology is
able to achieve significant reductions in peak
temperature over power minimizing methods A and B.
Peak temperature improvements over Method-A
averaged 13.5%, with a maximum improvement of up
to of 20.7%. The average peak temperature
improvement over Method-B was 16.7%, with a
maximum improvement of 22.5%. Unless a synthesis
methodology accounts for lateral thermal diffusion
between modules on a floorplan, only minimizing peak
module power or average power alone is inadequate in
minimizing peak chip temperature.
The improvements in peak temperatures using our
method are achieved with only a modest area overhead
that averages to less than 10% for the tested
benchmarks, with a peak area overhead of 12.6% for
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
International Technology Roadmap for Semiconductors, http://public.itrs.net
K. Banerjee, S.-C. Lin, and N. Srivastava, “Electrothermal engineering in the
nanometer era: from devices and interconnects to circuits and systems”, ASP-DAC
2006, pp. 223-230.
A.H. Ajami et. al., “Analysis of non-uniform temperature-dependent interconnect
performance in high-performance ICs,” DAC 2001, pp. 567-572.
M. Pedram and S. Nazarian, “Thermal modeling, analysis, and management in VLSI
circuits: Principles and Methods,” Proc. IEEE, August 2006, pp.1487-1501.
K. Roy, S. Mukhopadhyay, and H. Meimand, “Leakage current mechanisms and
leakage reduction techniques in deep-submicron CMOS circuits,” Proc. IEEE, 2003,
vol. 91, pp.305-327.
H. Murata et a., “VLSI module placement based on rectangular packing by sequence
pair,” IEEE Trans. CAD, Dec. 1996.
Y. Yang, Z. Gu, C. Zhu, R.P. Dick, and Li Shang, “ISAC: Integrated space and time
adaptive chip-package thermal analysis,” IEEE Trans. CAD, vol 26, no.1, 2007.
L. Skandron, K. Sankaranarayanan, S. Veluswamy, and D. Tarjan, “Temperatureaware microarchitecture modeling and implementation,” ACM TACO 2004.
M. Healy et al., “Microarchitectural floorplanning under performance and thermal
tradeoff,” DATE 2007.
B. Goplen and S. Sapatnekar, “Efficient thermal placement of standard cells in 3D
ICs using a force-directed approach,” ICCAD 2003.
J. Cong, J. Wei, and Y. Zhang, “A thermal-driven floorplanning algorithm for 3D
ICs,” ICCAD 2004.
A. Gupta et. al., “Thermal aware global routing for VLSI chips for enhanced
reliability,” ISQED 2008.
R. Mukherjee, S.O.Memik, and G. Memik, “Temperature-aware resource allocation
and binding in high-level synthesis,” DAC 2005.
R. Mukherjee, S.O.Memik, and G. Memik, “Peak temperature control and leakage
reduction during binding in high-level synthesis,” ISLPED 2005.
P. Lim and T. Kim, “Thermal aware high-level synthesis based on network flow
method,” CODES + ISSS 2006.
R. Mukherjee and S.O. Memik, “An integrated approach to thermal management in
high-level synthesis,” IEEE Trans. VLSI, November 2006.
http://express.ece.ucsb.edu/benchmark/
424
Authorized licensed use limited to: University of South Florida. Downloaded on February 11, 2009 at 23:54 from IEEE Xplore. Restrictions apply.
Download