Compressed Instruction Cache to Increase Instruction Cache Memory Nicholas Meloche Dept. of Computer Science and Engineering Washington University Saint Louis, MO 63130 nrm1@cec.wustl.edu David Lautenschlager Dept. of Computer Science and Engineering Washington University Saint Louis, MO 63130 Dcl1@cec.wustl.edu 1 Abstract Latencies in instruction fetch cycles slow the average throughput of a processor. In this paper we investigate compression techniques to reduce the amount of memory used by a set of instructions. In turn, these techniques will decrease the average number of fetch-cycle cache misses, resulting in lower average fetch latencies. The basic technique involves compressing Reduced Instruction Set Computer (RISC) type instructions via software at compile time, and decompressing them in hardware at runtime. We propose the implementation of a compile time software application that encodes the RISC instructions, and a real-time hardware decoder to decode the resulting encoded instructions. We have shown on average our technique compresses sets of instructions to about 85% of their original size. 2 Introduction Instructions of the similar RISC architectures intrinsically have a high probability of being similar. This is because the architecture design, by its Prashanth Jadardanan Dept. of Computer Science and Engineering Washington University Saint Louis, MO 63130 ptj1@cec.wustl.edu nature, is simple. This implies a small set of instruction types, which in turn lends to optimism towards the ability to compress. Traditionally, computer architects have addressed this issue in a number of ways. Very Large Instruction Word (VLIW) processors increase the speed of fetch cycles, but also increase code size as well. Other concepts, which convert RISC instructions into Complex Instruction Set Computer (CISC) type instructions, encode RISC instructions based on their interactions with other instructions. This technique would locally optimize code, but does not work well on the file size scale. In this paper, we propose decompressing pre-compressed code at runtime. This will result in both reducing effective program size, and increase average fetch-cycle speed. Our basic technique for compression involves compressing common bit strings, while allowing uncommon bit strings to remain uncompressed, or even expand. We show that the amount of space saved by compressing the common cases outweighs the lack of compression in the uncommon cases. To decode the instructions, a hardware module is added to the processor in an unobtrusive manner. This module will decode the compressed instructions before they reach the processor. This hardware module is also capable of reconfiguring itself for a different decoding algorithm. This allows for each file to be encoded differently. By having a different code to encode mapping for each file, we are able to optimize on file content, rather than previous benchmarked compression. This research focused on creating a synthesizeable VHDL model of the decoder that would interface with the LEON2 processor. The LEON2 we integrated with is verified by Sun Microsystems and was licensed under GNU. By choosing this processor, we are able to show that this can be integrated with existing processors. Another intended consequence of compressing the instructions is the minimization of memory misses. Since more instructions can be loaded into cache at a particular time, the likelihood of a cache miss is less. To magnify this benefit, we would like to point out the number of fetch cycles along the entire memory hierarchy will be reduced. One system where this research would be applicable is for simple real-time systems. We say “simple” because these systems are commonly short on cache, and data processing speeds are relatively slow. Due to the smaller cache sizes, memory misses are frequent, and valuable clock cycles are spent retrieving data. By moving more memory closer to the processor at a particular time, we can undoubtedly increase speeds. 3.2 Related Work 3 Motivation and Related Work 3.1 Motivation Research has been done predominately to encode files that will be sent to processors specific to the encoded files. Our research focuses on creating a module that any RISC processor may use with only minimal rework of the processor’s internal functionality without jeopardizing its architecture. A clear advantage of compressing instructions before storing them into memory is the savings in the size of the file. By keeping the file compressed until the processor needs to process it, we minimize the amount of data flow between the processor and memory. Hardware decoding mechanisms have been previously defined, and vary based on function. Decompression in hardware has been achieved, but the assumption was made that the software was able to come up with a compression algorithm [1] [2] [4] [9]. In other instances, software compression was proven feasible [3] [5] [6] [8] [10], but whether or not hardware could be developed based on the encoding mechanism was not explored. The one case we found which incorporated software encoding with hardware decoding outlined how one could do it, but only did an analysis of their algorithm but did not implement or code it. [7] We used the general concepts of these papers to formulate a software and hardware combination that compliments each other and still lends to the general notion of compression. 4 Compression Techniques To compress instruction code, we needed an algorithm that was easily expressible so the hardware can be made aware of it without much overhead. Huffman encoding was a simple encoding algorithm that takes into consideration relative frequency of codes. Since our encoding was going to be done immediately after the executable was created, we would have the ability to read through the file and calculate relative frequencies painlessly. 4.1 Modified Huffman Encoding By creating a Huffman Tree, common codes can be expressed in fewer bits, whereas uncommon codes will be given a longer length. Huffman encoding takes a fixed length of code and converts it into a variable length code up to length 2n, where n is the original bit length of code. We chose to encode 16 bit values, due to ease in splitting instruction words in this manner. At this point, a problem how to implement the Huffman Trees is introduced. The Huffman tree will require large amount of resources to implement in hardware. (The encoding in software is trivial no matter the original/final code size. This is due to the iterative nature of Huffman encoding). To account for the hardware deficiency, we had to modify the Huffman tree. Instead of allowing the codes to expand to a possible bit length of 2n, which would be 65536 bits in our case, we cut the final encoded bit size to a maximum length of 8. Encoded values that ended up larger than 8 bits exploded the length of time it took for the encoding to occur in software. Encoded values that were less than 8 bits caused a clear lack of compression in exchange for the amount of time saved. By choosing 8 bits for our maximum, it would allow for up to 128 16-bit values to be encoded in a timely manner. To account for “chopping” the Huffman Tree, if the code could not be represented in 8 bits it fell into an uncompressed category. The uncompressed category consists of the 8 bit uncompressed code immediately followed by the 16 bit original code, resulting in a sum of 24 bits. At first glance a 150% expansion would be frowned upon. One must remember that due to the nature of Huffman encoding, we are only doing this for the most unlikely codes. The fact that the Huffman compression for the most common cases is small offsets the losses for the uncompressed case. One unforeseen problem that occurred was deciding which codes fell into the uncompressed case. To exemplify the problem, consider a Huffman encoding which has all of its encoded values deeper than 8 bits. One cannot just “chop” the tree, because all leaves would be located in the uncompressed case. What we chose to do is iteratively build Huffman Trees until we found a case where we become too deep. We start with the 8 most common code values based on relative frequency. We then combine all other code values into the uncompressed case, which we give a frequency rank of 0, and build a Huffman Tree. By doing this we guarantee the uncompressed case sinks to the bottom of the tree. We analyze the Huffman Tree’s depth. If the depth is not greater than 8, we do another round of tree building with 9 of the most common code values. All other code values are once again placed into the uncompressed case. The process of taking the most common of the uncommon code values out of the uncompressed case and placing it in the tree continues until there is a breach in the depth of 8. When this occurs, we back up one tree, and that tree contains our final encoding. By modifying the Huffman Tree, we make the decoding in hardware easier. The encoded values can only be 8 bits long, with a special case of 8 bits plus 16 original bits (for 24 total bits) uncompressed case. 5 Encoder Implementation 5.1 Encoder Functionality The encoder is implemented in a C++ application. The encoder is capable of reading in the file, calculating relative frequencies of codes, and printing to the output file. To build the Huffman Tree, the encoder reads through the file 16 bits at a time, and calculates the frequency of these sets. The encoder is then able to encode and write the output to the file. To specify to the decoder the configuration of the Huffman Tree, there is a header applied to the beginning of each file. The first line of the header states how many leaves are going to be defined. The subsequent header words contain a mapping between the encoded value and the original code, and if that particular encoding is the uncompressed case. 5.2 Jump and Branch Resolution. Jumps and branches in code create unique situations in the decode phase which need to be handled in the encode phase. With multiple codes now being able to fit in the area that originally only one word could fit, jumps and branch relative targets are moved and potentially offset into the new words. To account for the offset, we force every target address to occur at the beginning of a new word. This negates the need for an offset. The relative target movements can then be accounted by passing through the encoded instruction set, decompressing the jump instructions, and recompressing them if possible. The net loss from this is negligible for the files we have tested. A more effective encoder could be created with a compiler to help resolve jump and branch targets. For the purpose of this research project, it was sufficient to make it a standalone program. 5.3 Encoder Flow The encoder is a three-pass application. On the first pass the file is analyzed, jump instruction locations are noted, and a Huffman Tree is created based on relative frequencies of the original code. The second pass writes out an output file into memory, and keeps track of where the jump targets moved to. The third pass writes the encoded instruction set out to the final file. passed to the Instruction fetch logic. 6.2 Decode mode implementation Since the instructions are encoded 16 bits at a time, two code lookups are required each instruction fetch cycle to generate the output code. After each lookup, the used encoded bits are shifted out of the input data register. 6 Decoder Implementation The core of the decoder is two Content Addressable Memories one for each lookup. Both CAMs are loaded with identical copies of the decode algorithm. Each address in the CAM contains 3 values. The first value is a flag, which indicates whether the code represents a 16-bit value or if it is flagging an uncompressed case. The second value is the length of the code. The third is the 16-bit value represented by the code. For codes less than 7 bits long, all addresses starting with the bits of the code are loaded with the same values. 6.1 Decoder organization The decoder operates in three modes: No_Decode – Each 32-bit fetch from memory is passed to the Instruction Fetch logic unchanged. Algorithm_Load – The header block on code in memory is processed to load the decode algorithm for the following code. Decode – Memory is decoded and reconstructed 32-bit instructions are Register Data in PC Increment Logic Shift Logic Shift 16 Logic 128 x 20 RAM TCAM PC Increment Out Shift 16 Logic Shift Logic 128 x 20 RAM TCAM Mux Mux 16 bits 16 bits Decoded Instruction The most significant bit of the input is checked to determine if the remainder of the 32 bit input word is valid or wasted space. If the bit is 0, the remainder of the 32 bit input word is discarded and a NOP is output. If the bit is set, the next 7 bits are used as an address for the CAM. The input data is shifted left by the code length value returned by the CAM. A multiplexer then selects either the CAM output value or the next 16 bits of the input data depending on the value of the compressed flag returned by the CAM. The input data is then shifted 16 bits if those bits were selected by the multiplexer. The lookup and shift operation is repeated with the second CAM using the shifted input data from the first lookup and shift. The result is a 32-bit output instruction and the remaining bits of the input data, which are returned to the input data register for use on the next fetch cycle. Additional logic was added to fetch input data a required to ensure that the input data register always starts a fetch cycle with at least 48 bits. 48 bits are required to ensure that a complete output instruction can be produced even with worst case encoding. If less than 48 bits are available, NOPs are produced while further fetches occur until the requirement is met. 7 Results The encoding application was run on 7 different instruction memory images provided as test code with the LEON2 processor. The results are summarized in the table below. We saw a 5-15% reduction in the instruction memory image size due to our compression. The VHDL code was added to the LEON2 code and simulated using MicroSim. A test program was successfully executed in simulation, and showed approximately 7% improvement in execution time. However the program was smaller than the LEON2 cache size so much of the benefit from compression was not realized. 8 Conclusion This research has shown that there are realistic and significant savings in both code size and execution speed to be gained by compression of instruction memory. Furthermore, the VHDL model that implements the design is fully synthesizeable. With further research and optimization, even greater gains should be realized. References [1] J. Liu, N. Mahapatra, K. Sundearesan, S. Dangeti, and B. Venkatrao. Memory System Compression and Its Benefits [2] P. Pujara, A. Aggarwal. Restrictive Compression Techniques to Increase Level 1 Cache Capacity [3] M. Hampton, M. Zhang. Cool Code Compression for Hot Risc [4] H. Pan, K. Asanović. Heads And Tails: A Variable-Length Instruction Format Supporting Parallel Fetch and Decode [5] A. Alameldeen, D. Wood. Adaptive Cache Compression for HighPerformance Processors. 31st Annual International Sumposium on Computer Architecture (ISCA-31) [6] E. Nikolova, D. Mulvaney, V. Chouliaras, J Nunez-Yanez. A Compression/Decompression Scheme For Embedded Systems Code. Electronic Systems and Control Division Research 2003, p. 36. [7] R. Castro, A. Lago, D. Silva. Adaptive Compressed Caching: Design and Implementation. [8] Y. Jin, R. Chen. Instruction Cache Compression for Embedded Systems. [9] P. Wilson, S. Kaplan, Y. Smaragdakis. The Case For Compressed Caching In Virtual Memory Systems. Proceedings of the USENIX Annual Technical Conference. Monterey, California, USA, June 6-11, 1999. [10] H. Pan. High Performance, Variable Length Instruction Encodings.