The purpose of this project was to gain a better

advertisement
Emmanuel Sanchez
Gerardo Gomez-Martinez
Implementation of 4-way Associative Data Cache
Purpose of Project
The purpose of this project was to gain a better understanding of the operation and
implementation of cache memory, as well as the effect it plays on the overall
performance of a system. Memory systems are usually a bottleneck for computers, and
in this project, we focused on improving the performance of our pipelined processor by
adding a cache. Our goal was to implement a 4-way associative data cache and to
analyze the performance increases over the non-cached version of the MIPS processor.
Implementation
The design was implemented on the MIPS pipelined processor. The data cache was
implemented using a 242 by 16 bit array. The 16 is the number of entries that can be
held in each of the cache's ways. Therefore, this number determines the number of set
bits used in the address (log216 = 4 bits). The 242 bits include 32 bits of data, 26 bits of
tag, 1 valid bit, and 1 dirty bit for each of the 4 ways, along with 2 bits required for the
LRU implementation. Main memory was implemented using a 32 by 128 bit array;
meaning it can hold 128 words. Data in the cache and main memory is word aligned.
The figure below shows the main structure of our 4-way associative data cache.
Emmanuel Sanchez
Gerardo Gomez-Martinez
Figure 1
Operation
On reset, the data cache is initialized to an empty state and it becomes populated as
the processor executes memory operations. When a “load word” instruction is
encountered, the processor first checks if the required data is in the cache by checking
the valid bit and comparing the tags. If the data is found, the cache issues a hit signal
and the data is obtained from the cache. However, if the data is not found, the hit signal
is deasserted and the cache controller interprets it as a cache miss. In this case, the
controller stalls the processor’s pipeline while the data is retrieved from main memory
and written into the cache. Cache hits allow memory reads to occur in one clock cycle,
and cache misses have a penalty of six clock cycles. When a “store word” instruction is
executed, the data word is mapped to a location in the cache according to its target
address. Before updating the cache with the new data, the corresponding set is
Emmanuel Sanchez
Gerardo Gomez-Martinez
checked for any available slots. All valid bits are checked within the set and the new
data is written to the first slot that has a valid bit of 0. In the case where there are no
available slots in a particular set (all valid bits are 1), the processor evicts one of the
least recently used data words to make room for the new data. Because this is a 4-way
cache, there are 2 bits in each set that keep track of the most recently used data word.
Evictions require a write to main memory and therefore they have a latency penalty of
six clock cycles. Main memory is only updated when a cache eviction occurs and not
every time a “store word” is encountered.
Memory Latency
Initially, in our project there was no real difference between the latency of accessing
main memory and that of accessing the cache, because they are both small and can be
put in the same chip. In order to simulate the memory latency that occurs in real
systems where memory is found outside the processor, we decided to create an
artificial delay. We did so by creating a signal that could stall our entire processor's
pipeline for a specific number of clock cycles. We implemented a counter that was in
charge of asserting that stall signal on every access to main memory (whether it was a
read or a write) for exactly six clock cycles before allowing the processor to continue
executing.
Cache Write Policy
During the early stages of our project, we took care of memory writes by writing to both
cache and main memory (memory write-through). This way, it was easier to verify if our
design was behaving as expected. Implementing a memory write back policy was
Emmanuel Sanchez
Gerardo Gomez-Martinez
among the final stages of our project. To achieve this, we included a dirty bit for each
data entry in the cache. When a memory write was performed, only the cache was
updated and the corresponding dirty bit was set, indicating that particular data entry had
been updated. Whenever two data words map to the same cache slot, the dirty bit of the
data currently residing in cache is checked. If the dirty bit was not set, the old data is
simply overwritten by the new data and the dirty bit is set. If however, the dirty bit was
set, the old data has to be evicted and written to main memory before the new data can
overwrite it. Using memory write back policy improved the performance of the cache
because main memory was only updated when an eviction was necessary and not
every time a “store word” was encountered.
LRU Implementation
Each set in the cache accommodates two bits (labeled ‘U’ in Figure 1) that specify
which way contains the most recent data. When a conflict miss occurs, the design
randomly selects among the three least recently used data words and evicts one. This
design does not use a true LRU implementation where the least recently used data is
always evicted, but the performance is believed to come close performance-wise. A true
LRU implementation would require more hardware and a more complex eviction
algorithm. The cost of such design would most likely not be worth the small
performance increase.
Cache Controller
In order to add the cache to the design, it was necessary to implement a cache
controller. The cache controller is in charge of managing all signals that are essential to
Emmanuel Sanchez
Gerardo Gomez-Martinez
the operation of the cache. Among the most important ones are: the signal that will stall
the processor, the signal that will enable writing to main memory, and the signal that will
enable writing to the cache. In order to simulate the latency of accessing the main
memory, the cache controller contains a counter that ensures the pipeline is stalled for
six clock cycles any time there is a need to read or write to main memory.
Debugging and Testing
Prior to implementing the 4-way data cache, we began by making a direct mapped
cache. Initially, we tested the cache by loading all memories (instruction memory, data
cache, main memory) with initialization files. During the first stage of testing, we
performed only read operations. Then, we made a simple program consisting of only
“load words” and verified the data from the cache was correctly loaded into registers.
The next step was to take care of the “store words,” which we did by using memory
write through, for simplicity. After having a functional direct mapped cache, we
proceeded to transform it into a 4-way associative cache. Once again, we began testing
by initializing the cache with a file, and performing only “load words.”
Simulation Results and Performance Analysis
For this project, we compared the performance of the old processor with no cache to the
processor with a 4-way data cache. In order to simulate the performance of a processor
with no cache, we modified the cache logic so it would always output a miss signal. This
way, data would always be retrieved from main memory and the memory latency would
be apparent. The performance increase was calculated after running the following
Emmanuel Sanchez
Gerardo Gomez-Martinez
program. The program consists of two identical loops that take each element from an
array of size 10, scale it by a factor of 4, and store the new value in a second array.
Program:
MIPS Assembly
C Equivalent
When executing the “load word” in the first loop, both processors (cached and noncached) stall the pipeline for 6 cycles. In the cached processor, this happens because
the required data is not initially in the cache. However, after running the first loop, the
data cache fills up and the second loop executes significantly faster.
Emmanuel Sanchez
Gerardo Gomez-Martinez
Simulation Results:
Non – Cached Processor:
Cached Processor:
Performance Calculations:
CPI Old (no cache) =
163 instructions
CPI New (4 way cache) =
=
New is faster by 1.58
94 cycles
= 1.734
103 instructions
94 cycles
= 1.096
= 1.743 / 1.096 = 1.58
Emmanuel Sanchez
Gerardo Gomez-Martinez
This performance analysis is specific to the program above. Obviously, every program will
have its own data requirements and level of cache use. For example, a program that makes
repeated use of the same data will greatly benefit from having a cache, while a program
that constantly requires different data may not yield the same performance.
Conclusion
Before working on this project, we had a very limited understanding of the structure and
operation of cache memory. Things such as tags, sets, blocks, and associativity were
covered in previous CPE classes, but we never understood them well enough. After getting a
chance of actually building a cache memory and integrating it into the MIPS processor, we
feel much more confident about our knowledge in the topic. This was a very interesting and
useful assignment.
Download