Reducing On-Chip Power Consumption via Data Value Prediction Lisa Zorn Vickie Chan Alan Chou LMZ@EECS.Berkeley.EDU CHANV@cory.EECS.Berkeley.EDU ALCHOU@cory.EECS.Berkeley.EDU CS 252 - Graduate Computer Architecture, Fall 2000 Department of Electrical Engineering and Computer Sciences University of California, Berkeley ABSTRACT As minimum transistor sizes continue to drop below two microns and their respective source voltages are reduced well under two volts, the power consumption of comparable digital circuits continue to fall. However, the power dissipated by on-chip, metal interconnect is not scaling down as quickly as the transistors themselves – a situation roughly analogous to the “Less' Law” gap between memory and processor speed. This study attempts to reduce the power impact of buses via compression schemes adapted from data value prediction techniques as well schemes related to branch prediction and closer examination of the data driven across buses. Finally, the benefits of this proposed compression technique are evaluated via simulation of microprocessor bus activity. Two of the schemes in this study show promising results: a stride extension scheme as well as an opcode-indexed value caching scheme. 1. Introduction Though the power consumption of comparable digital circuits are progressively falling as minimum transistor sizes continue to drop below two microns and their respective source voltages are reduced to well under two volts, the power dissipated by metal interconnect is becoming a larger issue. In essence, the power dissipated by interconnect is rapidly becoming more important than that dissipated by the transistors themselves. In other words, the power dissipated by bus interconnect fails to scale down as quickly as transistor properties do -- a situation roughly analogous to the “Less's Law” gap between memory and processor speed. Indeed, power dissipation (heat) is already a significant issue in current microprocessors, which require progressively complex and expensive cooling strategies with each new generation of more highly integrated, progressively dense digital circuitry. This study is comprised of two parts: a limit study and a practical study. For both parts, we used various compression schemes adapted from data value prediction techniques to investigate methods to reduce the power dissipated by various buses. The basic architecture of this wire interconnect power reduction scheme consists of a pair of specialized data compressor/ decompressor modules located at two ends of compressed communications bus, which is the exclusive bus used for the communication of compression information (see Figure 1, below). Fig. 1 Specialized bus data value (de)compressor pair and their conservative communication bus. The transition activity of several buses of a simulated microprocessor will be analyzed in terms of power dissipated, which will be measured as the overall quantity of wire interconnect transitions occurring for a particular process. Given the inherent assumption that bus wire interconnect power dissipation will soon outweigh the power dissipated via transistor computations, we can neglect transistor power dissipation and assess any power improvements via a comparison of the resulting overall number of wire transitions required for a given sequence of operations. Thus, for a given simulation, the primary performance measure of an encoder-decoder system (hereafter the “transcoder" system) will be the number of transitions that occur on the bus lines with the transcoder compression divided by the number of transitions that would have occurred on the bus lines without any such system – the percentage of original transitions that are being driven. Additionally, for the practical study, we will be able to measure a more accurate metric of performance by measuring the actual power used by a bus with a (simplified) transcoder system as a percentage of the actual power used by the bus without. This will be heavily dependent on the choice of buses examined, since the power used by the bus is proportional to the square of its length. Section 2 will describe the compression and simulation techniques used for the limit study portion of this investigation and section 3 will describe the results. Section 4 will describe the practical study portion of this investigation. Section 5 will provide concluding remarks and recommendations. 2 2. Limit Study Methodology The limit study consisted of utilizing a processor simulator (SimpleScalar) to generate various bus traces as well as a transcoder program which would simulate the compression of the trace information as well as calculate its performance using the transition percentage metric described above. 2.1 SimpleScalar The processor simulator used was a modified version of the SimpleScalar tool set (www.simplescalar.org). The particular version used was an out-of-order execution processor; it was modified so that various bus traces were outputted to the transcoder. Most notably, this study focused on simulations of the bus trace data from three buses of interest. The first bus was the common data bus (“CDBData”); this bus was connected to all functional units of the processor and used for them to broadcast their computed results (see figure.2 for transcoder configuration). The second bus examined was the register commit bus (“RegCommit”); the re-order buffer used this bus to send final nonspeculative values to the register file. The advantage in using these two buses for our study was that we could pull out the corresponding opcode and PC for each value sent over the bus and analyze its utility in aiding the compression. This aspect will be discussed further when we describe the transcoder. Finally, we looked at some simulations on the memory address bus (“MemAddr”) which connects the processor with main memory. For this simulation, each of these buses were simulated with 32 lines. A subset of the SPEC95 benchmarks were included in this study. Fig. 2 Transcoder configuration for CDBData bus. 3 2.1 Transcoder Sections 2.1.1-2.1.4 describe the baseline transcoder system; this work was done prior to our investigation by Victor Wen. Sections 2.1.5 - 2.17 describe our extensions to the prior work. 2.1.1 Finite State Machine At the top-most level, the transcoder chooses between its various predictors via a finite state machine. For each bus input that the transcoder must input, the system checks to see which predictor has the capability of sending the input with the fewest transitions. This count includes the possible overhead of a transition to indicate a state change, should the predictor that sends this input be different from the predictor that sent the previous input. In tallying the total transitions made using the transcoder system, we differentiate between the data transitions and the state-change overhead transitions. Additionally, the transcoder universally encodes a no-change input (i.e. when the current input is the same as the previous input) as zero transitions. Thus, any other value must be encoded by at least one transition. A description of each of the predictors used in this study follows. 2.1.2 No Predictor The most basic “predictor” is not really a predictor at all, because we must be able to send data across even if we do not have cues for predicting. Nonetheless, the no predictor still attempts to minimize transitions by including a finite state machine of its own. Here, each state corresponds to a particular simple representation of the data: the XOR of the input with the previous input, the negation of the input, the direct input, and the negation of the XOR of the input. Thus, even if we cannot intelligently encode the data, we will use the most compact simple representation at the very least. 2.1.3 Stride Predictor The stride predictor simply caches a table of strides, and given an input, it calculates the stride from the previous input. It then searches its cache of strides, and if there is a hit, the stride predictor can represent the input by representing the stride. Since the transcoder that is decoding the bus is also updating its stride cache with every piece of received data, the two caches are the same and the data can be encoded by encoding an index in the stride table. 2.1.4 Context Based Predictor The context based predictor works similarly to the stride predictor, except that it looks for and caches contexts up to a given length instead of simple strides. 4 2.1.5 Stride Predictor Extension The observation that we made with respect to the above baseline description was that on certain buses, multiple stride-based streams are sent in an interleaved fashion. If these two stride-based streams have different strides, then they will not hit in the context based predictor (unless they are of short duration and restart simultaneously). Additionally they will miss in the stride predictor because the difference between any one value and the previous is changing due to the unequal strides of the two streams. Thus we modified the stride-based predictor to cache more than one last value and calculate a stride from each. It will then perform a search for each one of those strides in the stride cache, and also update the stride cache with each. Clearly, this method will now be able to predict such interleaved stride-streams with little cost in hardware. 2.1.6 PC-indexed Last Value Caching Extension In [7], observations which indicated that the number of values generated by any given dynamic instruction is limited; for example, Sazeides and Smith observe that “the majority, >= 50%, of dynamic instructions correspond to static instructions that generate fewer than 64 values.” Using this idea, we incorporated a new state, the PC indexed value cache state that would use the PC to index into a cache of entries with variable associativity. For each entry, we would cache a fixed number of inputs that were transmitted. Replacement of PC-indexed entries as well as replacement of inputs within each entry was done using LRU policy. The idea of this scheme is that the PC indexing will help us to cache the most useful data. Using the PC to index into the cache requires that the PC is communicated to the decoding transcoder as well, and the PC itself must be encoded somehow. Thus, the PC encoding should be counted in our transition counts. However, since this is a limit study, we were interested in first learning how much benefit the PC could possibly give us if we could communicate the PC without cost, and including the PC encoding in our performance measure is beyond the scope of this paper. 2.1.7 Opcode-indexed Last Value Caching Extension The opcode extension that we included is exactly analogous to the PC extension mentioned above, except that opcodes were used to index entries in our cache. Again, each entry contained a fixed number of previously transmitted inputs. Because of the nature of opcodes, it was most logical to force the opcode cache to be fully associative and smaller (because studies have shown that relatively few opcodes are actually used with high frequency over the course of a program). Thus, we incorporated a higher number of inputs cached per opcode. 5 Fig. 3 Transition Reduction for CDB Data B us 70 60 % of Uncoded 50 or iginal stride2 40 PC4 OP16 30 OP32 20 10 0 applu fpppp su2cor swim turb3d wave5 compress ijpeg Benchmark s Fig. 4 T ransition Reduction for RegCommit Bus 80 70 % of Uncoded 60 50 or igi nal stri de2 40 PC4 OP16 OP32 30 20 10 0 applu fpppp s u2c or swim tur b3d wave5 compress ijpeg Ben chmarks 3. Limit Study Results 3.1 Stride Extension Performance The stride extension (with numLastValuesCached=2) was tested on a subset of the SPEC95 benchmarks and results were positive. A comparison of the performance of the 6 baseline transcoder with the stride-extension enhanced transcoder produced a further decrease in transitions by up to 4 percent for the CDBData and RegCommit data buses (figures 3 and 3), and up to a 25 percent decrease in transitions on the MemAddr bus (figure 5). This is not a surprising result, and it raises the possibility of optimizing transcoders for the bus that the system is serving. To further analyze the stride extension performance, we look at the breakdown of prediction states used in the MemAddr simulations (figure 6). The figure indicates exactly what we would hope: that the stride extension does not reduce context-based predictor hits, but rather takes away wholly from the no-predictor hits. However, considering that the stride extension is so low cost (in this case, we are only caching one more 32-bit value than the baseline) and that it seems to be universally helpful, this simple extension is recommended. Fig. 5 Transition Reduction for MemAddr Bu s 80 70 % of Uncoded 60 50 original stride2 40 30 20 10 0 applu fpppp su2cor swi m turb3d wave5 compress ij peg Ben ch marks 3.2 PC-indexed Last Value Caching Extension Performance The PC-indexed last value caching extension was run with 2-way set associativity and 64 total entries. Each entry cached 4 bus values. In figures 3 and 4, we see the performance of these simulations for the CDBData and RegCommit data buses. The results for these parameters do not seem to be encouraging. However, it is advisable that this extension be studied further using different parameters because of the extremely limited nature of our simulations. 3.3 Opcode-indexed Last Value Caching Extension The opcode-indexed last value caching extension was run with a fully associative 16-entry table. Two simulations were performed: in OP16, each entry cached 16 bus values and in 7 OP32, each entry cached 32. Figures 3 and 4 show that both of these schemes give us some performance improvement, although for many benchmarks the difference between them is minimal. Figure 7 confirms this by indicating that the extension gives us some more hits than the baseline originals. The results for these parameters are encouraging, but again, it is advisable that this extension be studied further. Fig. 6 Encoding Breakdo wn for MemAddr Bu s 100% 90% 80% 70% 60% SN 50% CB NO 40% 30% 20% 10% r2 st es s. pr m pr co m av co e5 .s es s tr2 e5 av rb tu w 3d tu im w tr2 .s 3d rb tr2 .s im sw sw tr2 su 2c or .s 2c or r2 st p. pp su p pp fp ap pl fp u. ap st pl u r2 0% Benchmark Fig. 7 Encod ing Breakdo wn for CD BData Bus 100% 90% 80% 70% 60% ICV CB 50% SN 40% NO 30% 20% 10% pp p. p O pp P1 6 p. O P 32 su su2 2c co o r su r.O 2c P or 16 .O P 32 sw sw im im . sw OP im 16 .O P3 2 tu tur b3 rb 3 d tu d.O rb P 3d 16 .O P 3 w 2 w av av e5 e w 5.O av P e5 16 .O co com P3 m pr 2 p es co res s s m pr .OP es 16 s. O P3 2 ijp ijpe eg g . ijp OP eg 16 .O P3 2 pp fp fp P 32 fp pl u. O P O ap ap pl u. ap p lu 16 0% Be nchma rks 8 4. Notes on a Practical Study A practical study was attempted to study the actual hardware needed to implement some simple data predictors; this would be particularly interesting because no prior studies have been done in this area. These simulations would provide information on how much power a specific transcoder uses, and they would help determine exactly which buses it would be effective for. In the worst case, the power dissipated by the compressor and the bus lines should not exceed the power dissipated if the compressor was not used. The study was attempted using tools provided by Synopsys, such as Power Mill. However, due to lack of time, completed simulations are not available to include in this report. 5. Concluding Remarks and Recommendations The results of this study were encouraging but not conclusive. In particular, the stride extension is an inexpensive effect addition to the transcoder system, while the opcode extension shows much promise. In particular, several variations on this study would provide interesting data. First, it would be interesting to change the structure of the FSM. Instead of incorporating the indexing predictors into the existing transcoder FSM, it would be worthwhile to make a higher level predictor that uses an indexed value-caching system as one predictor, and the baseline transcoder as the other predictor. In that way we could attempt to discover the maximum effectiveness of the indexed value-caching system. Additionally, studies should be made which vary the other parameters of the program, such as the baseline transcoder cache tables. Since the indexed value-caching system is fairly hardware intensive, it might be true that the baseline transcoder would be comparably effective given the same cache sizes. 6. Acknowledgements We would like to thank Victor Wen for his original work and help on the transcoder and Mark Whitney for his assistance with and notes on the SimpleScalar processor simulator. 7. References [1] [2] Doug Joseph and Dirk Grunwald; “Prefetching using Markov Predictors,” in Proceedings of the International Symposium on Computer Architecture (ISCA ’97), Denver Colorado, June 1997. Dean M. Tullsen and John S. Seng; “Storageless Value Prediction Using Prior Register Values,” in Proceedings of the 26th International Symposium on Computer Architecture, May 1999. 9 [3] [4] [5] [6] [7] [8] [9] Doug Burger, Stefanos Kaxiras, and James R. Goodman; “DataScalar Architectures,” in Proceedings of the International Symposium on Computer Architecture (ISCA ’97), Denver Colorado, June 1997. Youtao Zhang, Jun Yang, and Rajiv Gupta; “Frequent Value Locality and ValueCentric Data Cache Design,” ASPLOS-IX, Cambridge MA, 2000. Kai Wang and Manoj Franklin; “Highly Accurate Data Value Prediction using Hybrid Predictors,” IEEE, pp. 281-290, 1997. Pedro Marcuello, Jordi Tubella, and Antonio González; “Value Prediction for Speculative Multithreaded Architectures”. Yiannakis Sazeides and James E. Smith; “The Predictability of Data Values,” IEEE Proceedings of Micro-30, Research Triangle Park NC,December 1-3, 1997. Brad Calder, Glenn Reinman and Dean M. Tullsen; “Selective Value Prediction,” in Proceedings of the 26th International Symposium on Computer Architecture, pp. 1-11, May 1999. A. N. Eden and T. Mudge; “The YAGS Branch Prediction Scheme,” in 31st International Symposium on Microarchitecture, pp. 69-77, December 1998. 10