Reducing On-Chip Power Consumption via Data Value Prediction

advertisement
Reducing On-Chip Power Consumption
via Data Value Prediction
Lisa Zorn
Vickie Chan
Alan Chou
LMZ@EECS.Berkeley.EDU
CHANV@cory.EECS.Berkeley.EDU
ALCHOU@cory.EECS.Berkeley.EDU
CS 252 - Graduate Computer Architecture, Fall 2000
Department of Electrical Engineering and Computer Sciences
University of California, Berkeley
ABSTRACT
As minimum transistor sizes continue to drop below two microns and their respective
source voltages are reduced well under two volts, the power consumption of comparable
digital circuits continue to fall. However, the power dissipated by on-chip, metal
interconnect is not scaling down as quickly as the transistors themselves – a situation
roughly analogous to the “Less' Law” gap between memory and processor speed. This
study attempts to reduce the power impact of buses via compression schemes adapted
from data value prediction techniques as well schemes related to branch prediction and
closer examination of the data driven across buses. Finally, the benefits of this proposed
compression technique are evaluated via simulation of microprocessor bus activity. Two
of the schemes in this study show promising results: a stride extension scheme as well as
an opcode-indexed value caching scheme.
1. Introduction
Though the power consumption of comparable digital circuits are progressively falling as
minimum transistor sizes continue to drop below two microns and their respective source
voltages are reduced to well under two volts, the power dissipated by metal interconnect
is becoming a larger issue. In essence, the power dissipated by interconnect is rapidly
becoming more important than that dissipated by the transistors themselves. In other
words, the power dissipated by bus interconnect fails to scale down as quickly as
transistor properties do -- a situation roughly analogous to the “Less's Law” gap between
memory and processor speed. Indeed, power dissipation (heat) is already a significant
issue in current microprocessors, which require progressively complex and expensive
cooling strategies with each new generation of more highly integrated, progressively dense
digital circuitry.
This study is comprised of two parts: a limit study and a practical study. For
both parts, we used various compression schemes adapted from data value prediction
techniques to investigate methods to reduce the power dissipated by various buses. The
basic architecture of this wire interconnect power reduction scheme consists of a pair of
specialized data compressor/ decompressor modules located at two ends of compressed
communications bus, which is the exclusive bus used for the communication of
compression information (see Figure 1, below).
Fig. 1 Specialized bus data value (de)compressor pair and their conservative
communication bus.
The transition activity of several buses of a simulated microprocessor will be
analyzed in terms of power dissipated, which will be measured as the overall quantity of
wire interconnect transitions occurring for a particular process. Given the inherent
assumption that bus wire interconnect power dissipation will soon outweigh the power
dissipated via transistor computations, we can neglect transistor power dissipation and
assess any power improvements via a comparison of the resulting overall number of wire
transitions required for a given sequence of operations. Thus, for a given simulation, the
primary performance measure of an encoder-decoder system (hereafter the “transcoder"
system) will be the number of transitions that occur on the bus lines with the transcoder
compression divided by the number of transitions that would have occurred on the bus
lines without any such system – the percentage of original transitions that are being
driven.
Additionally, for the practical study, we will be able to measure a more accurate
metric of performance by measuring the actual power used by a bus with a (simplified)
transcoder system as a percentage of the actual power used by the bus without. This will
be heavily dependent on the choice of buses examined, since the power used by the bus is
proportional to the square of its length.
Section 2 will describe the compression and simulation techniques used for the
limit study portion of this investigation and section 3 will describe the results. Section 4
will describe the practical study portion of this investigation. Section 5 will provide
concluding remarks and recommendations.
2
2. Limit Study Methodology
The limit study consisted of utilizing a processor simulator (SimpleScalar) to generate
various bus traces as well as a transcoder program which would simulate the compression
of the trace information as well as calculate its performance using the transition
percentage metric described above.
2.1 SimpleScalar
The processor simulator used was a modified version of the SimpleScalar tool set
(www.simplescalar.org). The particular version used was an out-of-order execution
processor; it was modified so that various bus traces were outputted to the transcoder.
Most notably, this study focused on simulations of the bus trace data from three buses of
interest. The first bus was the common data bus (“CDBData”); this bus was connected
to all functional units of the processor and used for them to broadcast their computed
results (see figure.2 for transcoder configuration). The second bus examined was the
register commit bus (“RegCommit”); the re-order buffer used this bus to send final nonspeculative values to the register file. The advantage in using these two buses for our
study was that we could pull out the corresponding opcode and PC for each value sent
over the bus and analyze its utility in aiding the compression. This aspect will be
discussed further when we describe the transcoder. Finally, we looked at some
simulations on the memory address bus (“MemAddr”) which connects the processor
with main memory. For this simulation, each of these buses were simulated with 32 lines.
A subset of the SPEC95 benchmarks were included in this study.
Fig. 2 Transcoder configuration for CDBData bus.
3
2.1 Transcoder
Sections 2.1.1-2.1.4 describe the baseline transcoder system; this work was done prior to
our investigation by Victor Wen. Sections 2.1.5 - 2.17 describe our extensions to the prior
work.
2.1.1 Finite State Machine
At the top-most level, the transcoder chooses between its various predictors via a finite
state machine. For each bus input that the transcoder must input, the system checks to
see which predictor has the capability of sending the input with the fewest transitions.
This count includes the possible overhead of a transition to indicate a state change, should
the predictor that sends this input be different from the predictor that sent the previous
input. In tallying the total transitions made using the transcoder system, we differentiate
between the data transitions and the state-change overhead transitions. Additionally, the
transcoder universally encodes a no-change input (i.e. when the current input is the same
as the previous input) as zero transitions. Thus, any other value must be encoded by at
least one transition. A description of each of the predictors used in this study follows.
2.1.2 No Predictor
The most basic “predictor” is not really a predictor at all, because we must be able to
send data across even if we do not have cues for predicting. Nonetheless, the no predictor
still attempts to minimize transitions by including a finite state machine of its own.
Here, each state corresponds to a particular simple representation of the data: the XOR of
the input with the previous input, the negation of the input, the direct input, and the
negation of the XOR of the input. Thus, even if we cannot intelligently encode the data,
we will use the most compact simple representation at the very least.
2.1.3 Stride Predictor
The stride predictor simply caches a table of strides, and given an input, it calculates the
stride from the previous input. It then searches its cache of strides, and if there is a hit,
the stride predictor can represent the input by representing the stride. Since the
transcoder that is decoding the bus is also updating its stride cache with every piece of
received data, the two caches are the same and the data can be encoded by encoding an
index in the stride table.
2.1.4 Context Based Predictor
The context based predictor works similarly to the stride predictor, except that it looks
for and caches contexts up to a given length instead of simple strides.
4
2.1.5 Stride Predictor Extension
The observation that we made with respect to the above baseline description was that on
certain buses, multiple stride-based streams are sent in an interleaved fashion. If these
two stride-based streams have different strides, then they will not hit in the context based
predictor (unless they are of short duration and restart simultaneously). Additionally
they will miss in the stride predictor because the difference between any one value and
the previous is changing due to the unequal strides of the two streams. Thus we modified
the stride-based predictor to cache more than one last value and calculate a stride from
each. It will then perform a search for each one of those strides in the stride cache, and
also update the stride cache with each. Clearly, this method will now be able to predict
such interleaved stride-streams with little cost in hardware.
2.1.6 PC-indexed Last Value Caching Extension
In [7], observations which indicated that the number of values generated by any given
dynamic instruction is limited; for example, Sazeides and Smith observe that “the
majority, >= 50%, of dynamic instructions correspond to static instructions that generate
fewer than 64 values.” Using this idea, we incorporated a new state, the PC indexed value
cache state that would use the PC to index into a cache of entries with variable
associativity. For each entry, we would cache a fixed number of inputs that were
transmitted. Replacement of PC-indexed entries as well as replacement of inputs within
each entry was done using LRU policy. The idea of this scheme is that the PC indexing
will help us to cache the most useful data. Using the PC to index into the cache requires
that the PC is communicated to the decoding transcoder as well, and the PC itself must be
encoded somehow. Thus, the PC encoding should be counted in our transition counts.
However, since this is a limit study, we were interested in first learning how much benefit
the PC could possibly give us if we could communicate the PC without cost, and
including the PC encoding in our performance measure is beyond the scope of this paper.
2.1.7 Opcode-indexed Last Value Caching Extension
The opcode extension that we included is exactly analogous to the PC extension
mentioned above, except that opcodes were used to index entries in our cache. Again,
each entry contained a fixed number of previously transmitted inputs. Because of the
nature of opcodes, it was most logical to force the opcode cache to be fully associative
and smaller (because studies have shown that relatively few opcodes are actually used
with high frequency over the course of a program). Thus, we incorporated a higher
number of inputs cached per opcode.
5
Fig. 3 Transition Reduction for CDB Data B us
70
60
% of Uncoded
50
or iginal
stride2
40
PC4
OP16
30
OP32
20
10
0
applu
fpppp
su2cor
swim
turb3d
wave5
compress
ijpeg
Benchmark s
Fig. 4 T ransition Reduction for RegCommit Bus
80
70
% of Uncoded
60
50
or igi nal
stri de2
40
PC4
OP16
OP32
30
20
10
0
applu
fpppp
s u2c or
swim
tur b3d
wave5
compress
ijpeg
Ben chmarks
3. Limit Study Results
3.1 Stride Extension Performance
The stride extension (with numLastValuesCached=2) was tested on a subset of the
SPEC95 benchmarks and results were positive. A comparison of the performance of the
6
baseline transcoder with the stride-extension enhanced transcoder produced a further
decrease in transitions by up to 4 percent for the CDBData and RegCommit data buses
(figures 3 and 3), and up to a 25 percent decrease in transitions on the MemAddr bus
(figure 5). This is not a surprising result, and it raises the possibility of optimizing
transcoders for the bus that the system is serving. To further analyze the stride extension
performance, we look at the breakdown of prediction states used in the MemAddr
simulations (figure 6). The figure indicates exactly what we would hope: that the stride
extension does not reduce context-based predictor hits, but rather takes away wholly
from the no-predictor hits. However, considering that the stride extension is so low cost
(in this case, we are only caching one more 32-bit value than the baseline) and that it
seems to be universally helpful, this simple extension is recommended.
Fig. 5 Transition Reduction for MemAddr Bu s
80
70
% of Uncoded
60
50
original
stride2
40
30
20
10
0
applu
fpppp
su2cor
swi m
turb3d
wave5
compress
ij peg
Ben ch marks
3.2 PC-indexed Last Value Caching Extension Performance
The PC-indexed last value caching extension was run with 2-way set associativity and 64
total entries. Each entry cached 4 bus values. In figures 3 and 4, we see the performance
of these simulations for the CDBData and RegCommit data buses. The results for these
parameters do not seem to be encouraging. However, it is advisable that this extension be
studied further using different parameters because of the extremely limited nature of our
simulations.
3.3 Opcode-indexed Last Value Caching Extension
The opcode-indexed last value caching extension was run with a fully associative 16-entry
table. Two simulations were performed: in OP16, each entry cached 16 bus values and in
7
OP32, each entry cached 32. Figures 3 and 4 show that both of these schemes give us
some performance improvement, although for many benchmarks the difference between
them is minimal. Figure 7 confirms this by indicating that the extension gives us some
more hits than the baseline originals. The results for these parameters are encouraging,
but again, it is advisable that this extension be studied further.
Fig. 6 Encoding Breakdo wn for MemAddr Bu s
100%
90%
80%
70%
60%
SN
50%
CB
NO
40%
30%
20%
10%
r2
st
es
s.
pr
m
pr
co
m
av
co
e5
.s
es
s
tr2
e5
av
rb
tu
w
3d
tu
im
w
tr2
.s
3d
rb
tr2
.s
im
sw
sw
tr2
su
2c
or
.s
2c
or
r2
st
p.
pp
su
p
pp
fp
ap
pl
fp
u.
ap
st
pl
u
r2
0%
Benchmark
Fig. 7 Encod ing Breakdo wn for CD BData Bus
100%
90%
80%
70%
60%
ICV
CB
50%
SN
40%
NO
30%
20%
10%
pp
p. p
O
pp P1
6
p.
O
P
32
su su2
2c
co
o
r
su r.O
2c P
or 16
.O
P
32
sw sw
im im
.
sw OP
im 16
.O
P3
2
tu tur
b3
rb
3
d
tu d.O
rb
P
3d 16
.O
P
3
w 2
w
av av
e5
e
w 5.O
av
P
e5 16
.O
co com P3
m
pr 2
p
es
co res
s
s
m
pr .OP
es
16
s.
O
P3
2
ijp ijpe
eg g
.
ijp OP
eg 16
.O
P3
2
pp
fp
fp
P
32
fp
pl
u.
O
P
O
ap
ap
pl
u.
ap
p
lu
16
0%
Be nchma rks
8
4. Notes on a Practical Study
A practical study was attempted to study the actual hardware needed to implement some
simple data predictors; this would be particularly interesting because no prior studies
have been done in this area. These simulations would provide information on how much
power a specific transcoder uses, and they would help determine exactly which buses it
would be effective for. In the worst case, the power dissipated by the compressor and
the bus lines should not exceed the power dissipated if the compressor was not used.
The study was attempted using tools provided by Synopsys, such as Power Mill.
However, due to lack of time, completed simulations are not available to include in this
report.
5. Concluding Remarks and Recommendations
The results of this study were encouraging but not conclusive. In particular, the stride
extension is an inexpensive effect addition to the transcoder system, while the opcode
extension shows much promise. In particular, several variations on this study would
provide interesting data.
First, it would be interesting to change the structure of the FSM. Instead of
incorporating the indexing predictors into the existing transcoder FSM, it would be
worthwhile to make a higher level predictor that uses an indexed value-caching system as
one predictor, and the baseline transcoder as the other predictor. In that way we could
attempt to discover the maximum effectiveness of the indexed value-caching system.
Additionally, studies should be made which vary the other parameters of the
program, such as the baseline transcoder cache tables. Since the indexed value-caching
system is fairly hardware intensive, it might be true that the baseline transcoder would be
comparably effective given the same cache sizes.
6. Acknowledgements
We would like to thank Victor Wen for his original work and help on the transcoder and
Mark Whitney for his assistance with and notes on the SimpleScalar processor simulator.
7. References
[1]
[2]
Doug Joseph and Dirk Grunwald; “Prefetching using Markov Predictors,” in
Proceedings of the International Symposium on Computer Architecture (ISCA
’97), Denver Colorado, June 1997.
Dean M. Tullsen and John S. Seng; “Storageless Value Prediction Using Prior
Register Values,” in Proceedings of the 26th International Symposium on
Computer Architecture, May 1999.
9
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Doug Burger, Stefanos Kaxiras, and James R. Goodman; “DataScalar
Architectures,” in Proceedings of the International Symposium on Computer
Architecture (ISCA ’97), Denver Colorado, June 1997.
Youtao Zhang, Jun Yang, and Rajiv Gupta; “Frequent Value Locality and ValueCentric Data Cache Design,” ASPLOS-IX, Cambridge MA, 2000.
Kai Wang and Manoj Franklin; “Highly Accurate Data Value Prediction using
Hybrid Predictors,” IEEE, pp. 281-290, 1997.
Pedro Marcuello, Jordi Tubella, and Antonio González; “Value Prediction for
Speculative Multithreaded Architectures”.
Yiannakis Sazeides and James E. Smith; “The Predictability of Data Values,”
IEEE Proceedings of Micro-30, Research Triangle Park NC,December 1-3, 1997.
Brad Calder, Glenn Reinman and Dean M. Tullsen; “Selective Value Prediction,”
in Proceedings of the 26th International Symposium on Computer Architecture,
pp. 1-11, May 1999.
A. N. Eden and T. Mudge; “The YAGS Branch Prediction Scheme,” in 31st
International Symposium on Microarchitecture, pp. 69-77, December 1998.
10
Download