Report

advertisement
Compressed Instruction Cache to Increase Instruction Cache
Memory
Nicholas Meloche
Dept. of Computer
Science and Engineering
Washington University
Saint Louis, MO 63130
nrm1@cec.wustl.edu
David Lautenschlager
Dept. of Computer
Science and Engineering
Washington University
Saint Louis, MO 63130
Dcl1@cec.wustl.edu
1 Abstract
Latencies in instruction fetch cycles
slow the average throughput of a
processor. In this paper we investigate
compression techniques to reduce the
amount of memory used by a set of
instructions. In turn, these techniques
will decrease the average number of
fetch-cycle cache misses, resulting in
lower average fetch latencies. The basic
technique
involves
compressing
Reduced Instruction Set Computer
(RISC) type instructions via software at
compile time, and decompressing them
in hardware at runtime. We propose the
implementation of a compile time
software application that encodes the
RISC instructions, and a real-time
hardware decoder to decode the resulting
encoded instructions. We have shown
on average our technique compresses
sets of instructions to about 85% of their
original size.
2 Introduction
Instructions of the similar RISC
architectures intrinsically have a high
probability of being similar. This is
because the architecture design, by its
Prashanth Jadardanan
Dept. of Computer
Science and Engineering
Washington University
Saint Louis, MO 63130
ptj1@cec.wustl.edu
nature, is simple. This implies a small set
of instruction types, which in turn lends
to optimism towards the ability to
compress.
Traditionally, computer architects have
addressed this issue in a number of
ways. Very Large Instruction Word
(VLIW) processors increase the speed of
fetch cycles, but also increase code size
as well. Other concepts, which convert
RISC instructions into Complex
Instruction Set Computer (CISC) type
instructions, encode RISC instructions
based on their interactions with other
instructions.
This technique would
locally optimize code, but does not work
well on the file size scale.
In this paper, we propose decompressing
pre-compressed code at runtime. This
will result in both reducing effective
program size, and increase average
fetch-cycle speed. Our basic technique
for compression involves compressing
common bit strings, while allowing
uncommon bit strings to remain
uncompressed, or even expand. We
show that the amount of space saved by
compressing
the
common
cases
outweighs the lack of compression in the
uncommon cases.
To decode the instructions, a hardware
module is added to the processor in an
unobtrusive manner. This module will
decode the compressed instructions
before they reach the processor. This
hardware module is also capable of
reconfiguring itself for a different
decoding algorithm. This allows for
each file to be encoded differently. By
having a different code to encode
mapping for each file, we are able to
optimize on file content, rather than
previous benchmarked compression.
This research focused on creating a
synthesizeable VHDL model of the
decoder that would interface with the
LEON2 processor. The LEON2 we
integrated with is verified by Sun
Microsystems and was licensed under
GNU. By choosing this processor, we
are able to show that this can be
integrated with existing processors.
Another intended consequence of
compressing the instructions is the
minimization of memory misses. Since
more instructions can be loaded into
cache at a particular time, the likelihood
of a cache miss is less. To magnify this
benefit, we would like to point out the
number of fetch cycles along the entire
memory hierarchy will be reduced.
One system where this research would
be applicable is for simple real-time
systems.
We say “simple” because
these systems are commonly short on
cache, and data processing speeds are
relatively slow. Due to the smaller
cache sizes, memory misses are
frequent, and valuable clock cycles are
spent retrieving data. By moving more
memory closer to the processor at a
particular time, we can undoubtedly
increase speeds.
3.2 Related Work
3 Motivation and Related
Work
3.1 Motivation
Research has been done predominately
to encode files that will be sent to
processors specific to the encoded files.
Our research focuses on creating a
module that any RISC processor may
use with only minimal rework of the
processor’s
internal
functionality
without jeopardizing its architecture.
A clear advantage of compressing
instructions before storing them into
memory is the savings in the size of the
file. By keeping the file compressed
until the processor needs to process it,
we minimize the amount of data flow
between the processor and memory.
Hardware decoding mechanisms have
been previously defined, and vary based
on function.
Decompression in
hardware has been achieved, but the
assumption was made that the software
was able to come up with a compression
algorithm [1] [2] [4] [9]. In other
instances, software compression was
proven feasible [3] [5] [6] [8] [10], but
whether or not hardware could be
developed based on the encoding
mechanism was not explored. The one
case we found which incorporated
software encoding with hardware
decoding outlined how one could do it,
but only did an analysis of their
algorithm but did not implement or code
it. [7]
We used the general concepts of these
papers to formulate a software and
hardware combination that compliments
each other and still lends to the general
notion of compression.
4 Compression
Techniques
To compress instruction code, we
needed an algorithm that was easily
expressible so the hardware can be made
aware of it without much overhead.
Huffman encoding was a simple
encoding algorithm that takes into
consideration relative frequency of
codes. Since our encoding was going to
be done immediately after the executable
was created, we would have the ability
to read through the file and calculate
relative frequencies painlessly.
4.1 Modified Huffman Encoding
By creating a Huffman Tree, common
codes can be expressed in fewer bits,
whereas uncommon codes will be given
a longer length. Huffman encoding
takes a fixed length of code and converts
it into a variable length code up to length
2n, where n is the original bit length of
code. We chose to encode 16 bit values,
due to ease in splitting instruction words
in this manner.
At this point, a problem how to
implement the Huffman Trees is
introduced. The Huffman tree will
require large amount of resources to
implement in hardware. (The encoding
in software is trivial no matter the
original/final code size. This is due to
the iterative nature of Huffman
encoding). To account for the hardware
deficiency, we had to modify the
Huffman tree. Instead of allowing the
codes to expand to a possible bit length
of 2n, which would be 65536 bits in our
case, we cut the final encoded bit size to
a maximum length of 8. Encoded values
that ended up larger than 8 bits exploded
the length of time it took for the
encoding to occur in software. Encoded
values that were less than 8 bits caused a
clear lack of compression in exchange
for the amount of time saved. By
choosing 8 bits for our maximum, it
would allow for up to 128 16-bit values
to be encoded in a timely manner.
To account for “chopping” the Huffman
Tree, if the code could not be
represented in 8 bits it fell into an
uncompressed
category.
The
uncompressed category consists of the 8
bit uncompressed code immediately
followed by the 16 bit original code,
resulting in a sum of 24 bits. At first
glance a 150% expansion would be
frowned upon. One must remember that
due to the nature of Huffman encoding,
we are only doing this for the most
unlikely codes.
The fact that the
Huffman compression for the most
common cases is small offsets the losses
for the uncompressed case.
One unforeseen problem that occurred
was deciding which codes fell into the
uncompressed case. To exemplify the
problem, consider a Huffman encoding
which has all of its encoded values
deeper than 8 bits. One cannot just
“chop” the tree, because all leaves would
be located in the uncompressed case.
What we chose to do is iteratively build
Huffman Trees until we found a case
where we become too deep. We start
with the 8 most common code values
based on relative frequency. We then
combine all other code values into the
uncompressed case, which we give a
frequency rank of 0, and build a
Huffman Tree.
By doing this we
guarantee the uncompressed case sinks
to the bottom of the tree. We analyze
the Huffman Tree’s depth. If the depth is
not greater than 8, we do another round
of tree building with 9 of the most
common code values. All other code
values are once again placed into the
uncompressed case. The process of
taking the most common of the
uncommon code values out of the
uncompressed case and placing it in the
tree continues until there is a breach in
the depth of 8. When this occurs, we
back up one tree, and that tree contains
our final encoding.
By modifying the Huffman Tree, we
make the decoding in hardware easier.
The encoded values can only be 8 bits
long, with a special case of 8 bits plus 16
original bits (for 24 total bits)
uncompressed case.
5 Encoder
Implementation
5.1 Encoder Functionality
The encoder is implemented in a C++
application. The encoder is capable of
reading in the file, calculating relative
frequencies of codes, and printing to the
output file.
To build the Huffman Tree, the encoder
reads through the file 16 bits at a time,
and calculates the frequency of these
sets. The encoder is then able to encode
and write the output to the file.
To specify to the decoder the
configuration of the Huffman Tree, there
is a header applied to the beginning of
each file. The first line of the header
states how many leaves are going to be
defined. The subsequent header words
contain a mapping between the encoded
value and the original code, and if that
particular encoding is the uncompressed
case.
5.2 Jump and Branch Resolution.
Jumps and branches in code create
unique situations in the decode phase
which need to be handled in the encode
phase. With multiple codes now being
able to fit in the area that originally only
one word could fit, jumps and branch
relative targets are moved and
potentially offset into the new words.
To account for the offset, we force every
target address to occur at the beginning
of a new word. This negates the need
for an offset.
The relative target
movements can then be accounted by
passing through
the encoded
instruction set, decompressing the jump
instructions, and recompressing them if
possible. The net loss from this is
negligible for the files we have tested. A
more effective encoder could be created
with a compiler to help resolve jump and
branch targets. For the purpose of this
research project, it was sufficient to
make it a standalone program.
5.3 Encoder Flow
The encoder is a three-pass application.
On the first pass the file is analyzed,
jump instruction locations are noted, and
a Huffman Tree is created based on
relative frequencies of the original code.
The second pass writes out an output file
into memory, and keeps track of where
the jump targets moved to. The third
pass writes the encoded instruction set
out to the final file.
passed to the Instruction fetch logic.
6.2 Decode mode implementation
Since the instructions are encoded 16
bits at a time, two code lookups are
required each instruction fetch cycle to
generate the output code. After each
lookup, the used encoded bits are shifted
out of the input data register.
6 Decoder
Implementation
The core of the decoder is two Content
Addressable Memories one for each
lookup. Both CAMs are loaded with
identical copies of the decode algorithm.
Each address in the CAM contains 3
values. The first value is a flag, which
indicates whether the code represents a
16-bit value or if it is flagging an
uncompressed case. The second value is
the length of the code. The third is the
16-bit value represented by the code. For
codes less than 7 bits long, all addresses
starting with the bits of the code are
loaded with the same values.
6.1 Decoder organization
The decoder operates in three modes:
No_Decode – Each 32-bit fetch from
memory is passed to the Instruction
Fetch logic unchanged.
Algorithm_Load – The header block on
code in memory is processed to load the
decode algorithm for the following code.
Decode – Memory is decoded and
reconstructed 32-bit instructions are
Register
Data in
PC Increment
Logic
Shift
Logic
Shift
16
Logic
128 x 20
RAM
TCAM
PC Increment Out
Shift
16
Logic
Shift
Logic
128 x 20
RAM
TCAM
Mux
Mux
16 bits
16 bits
Decoded Instruction
The most significant bit of the input is
checked to determine if the remainder of
the 32 bit input word is valid or wasted
space. If the bit is 0, the remainder of the
32 bit input word is discarded and a
NOP is output. If the bit is set, the next 7
bits are used as an address for the CAM.
The input data is shifted left by the code
length value returned by the CAM. A
multiplexer then selects either the CAM
output value or the next 16 bits of the
input data depending on the value of the
compressed flag returned by the CAM.
The input data is then shifted 16 bits if
those bits were selected by the
multiplexer.
The lookup and shift operation is
repeated with the second CAM using the
shifted input data from the first lookup
and shift. The result is a 32-bit output
instruction and the remaining bits of the
input data, which are returned to the
input data register for use on the next
fetch cycle.
Additional logic was added to fetch
input data a required to ensure that the
input data register always starts a fetch
cycle with at least 48 bits. 48 bits are
required to ensure that a complete output
instruction can be produced even with
worst case encoding. If less than 48 bits
are available, NOPs are produced while
further fetches occur until the
requirement is met.
7 Results
The encoding application was run on 7
different instruction memory images
provided as test code with the LEON2
processor. The results are summarized in
the table below. We saw a 5-15%
reduction in the instruction memory
image size due to our compression.
The VHDL code was added to the
LEON2 code and simulated using
MicroSim.
A test program was
successfully executed in simulation, and
showed approximately 7% improvement
in execution time. However the program
was smaller than the LEON2 cache size
so much of the benefit from compression
was not realized.
8 Conclusion
This research has shown that there are
realistic and significant savings in both
code size and execution speed to be
gained by compression of instruction
memory. Furthermore, the VHDL model
that implements the design is fully
synthesizeable. With further research
and optimization, even greater gains
should be realized.
References
[1] J. Liu, N. Mahapatra, K.
Sundearesan, S. Dangeti, and B.
Venkatrao. Memory System
Compression and Its Benefits
[2] P. Pujara, A. Aggarwal. Restrictive
Compression Techniques to Increase
Level 1 Cache Capacity
[3] M. Hampton, M. Zhang. Cool Code
Compression for Hot Risc
[4] H. Pan, K. Asanović. Heads And
Tails: A Variable-Length Instruction
Format Supporting Parallel Fetch and
Decode
[5] A. Alameldeen, D. Wood. Adaptive
Cache Compression for HighPerformance Processors. 31st Annual
International Sumposium on Computer
Architecture (ISCA-31)
[6] E. Nikolova, D. Mulvaney, V.
Chouliaras, J Nunez-Yanez. A
Compression/Decompression Scheme
For Embedded Systems Code.
Electronic Systems and Control Division
Research 2003, p. 36.
[7] R. Castro, A. Lago, D. Silva.
Adaptive Compressed Caching: Design
and Implementation.
[8] Y. Jin, R. Chen. Instruction Cache
Compression for Embedded Systems.
[9] P. Wilson, S. Kaplan, Y.
Smaragdakis. The Case For Compressed
Caching In Virtual Memory Systems.
Proceedings of the USENIX Annual
Technical Conference. Monterey,
California, USA, June 6-11, 1999.
[10] H. Pan. High Performance, Variable
Length Instruction Encodings.
Download