Emmanuel Sanchez Gerardo Gomez-Martinez Implementation of 4-way Associative Data Cache Purpose of Project The purpose of this project was to gain a better understanding of the operation and implementation of cache memory, as well as the effect it plays on the overall performance of a system. Memory systems are usually a bottleneck for computers, and in this project, we focused on improving the performance of our pipelined processor by adding a cache. Our goal was to implement a 4-way associative data cache and to analyze the performance increases over the non-cached version of the MIPS processor. Implementation The design was implemented on the MIPS pipelined processor. The data cache was implemented using a 242 by 16 bit array. The 16 is the number of entries that can be held in each of the cache's ways. Therefore, this number determines the number of set bits used in the address (log216 = 4 bits). The 242 bits include 32 bits of data, 26 bits of tag, 1 valid bit, and 1 dirty bit for each of the 4 ways, along with 2 bits required for the LRU implementation. Main memory was implemented using a 32 by 128 bit array; meaning it can hold 128 words. Data in the cache and main memory is word aligned. The figure below shows the main structure of our 4-way associative data cache. Emmanuel Sanchez Gerardo Gomez-Martinez Figure 1 Operation On reset, the data cache is initialized to an empty state and it becomes populated as the processor executes memory operations. When a “load word” instruction is encountered, the processor first checks if the required data is in the cache by checking the valid bit and comparing the tags. If the data is found, the cache issues a hit signal and the data is obtained from the cache. However, if the data is not found, the hit signal is deasserted and the cache controller interprets it as a cache miss. In this case, the controller stalls the processor’s pipeline while the data is retrieved from main memory and written into the cache. Cache hits allow memory reads to occur in one clock cycle, and cache misses have a penalty of six clock cycles. When a “store word” instruction is executed, the data word is mapped to a location in the cache according to its target address. Before updating the cache with the new data, the corresponding set is Emmanuel Sanchez Gerardo Gomez-Martinez checked for any available slots. All valid bits are checked within the set and the new data is written to the first slot that has a valid bit of 0. In the case where there are no available slots in a particular set (all valid bits are 1), the processor evicts one of the least recently used data words to make room for the new data. Because this is a 4-way cache, there are 2 bits in each set that keep track of the most recently used data word. Evictions require a write to main memory and therefore they have a latency penalty of six clock cycles. Main memory is only updated when a cache eviction occurs and not every time a “store word” is encountered. Memory Latency Initially, in our project there was no real difference between the latency of accessing main memory and that of accessing the cache, because they are both small and can be put in the same chip. In order to simulate the memory latency that occurs in real systems where memory is found outside the processor, we decided to create an artificial delay. We did so by creating a signal that could stall our entire processor's pipeline for a specific number of clock cycles. We implemented a counter that was in charge of asserting that stall signal on every access to main memory (whether it was a read or a write) for exactly six clock cycles before allowing the processor to continue executing. Cache Write Policy During the early stages of our project, we took care of memory writes by writing to both cache and main memory (memory write-through). This way, it was easier to verify if our design was behaving as expected. Implementing a memory write back policy was Emmanuel Sanchez Gerardo Gomez-Martinez among the final stages of our project. To achieve this, we included a dirty bit for each data entry in the cache. When a memory write was performed, only the cache was updated and the corresponding dirty bit was set, indicating that particular data entry had been updated. Whenever two data words map to the same cache slot, the dirty bit of the data currently residing in cache is checked. If the dirty bit was not set, the old data is simply overwritten by the new data and the dirty bit is set. If however, the dirty bit was set, the old data has to be evicted and written to main memory before the new data can overwrite it. Using memory write back policy improved the performance of the cache because main memory was only updated when an eviction was necessary and not every time a “store word” was encountered. LRU Implementation Each set in the cache accommodates two bits (labeled ‘U’ in Figure 1) that specify which way contains the most recent data. When a conflict miss occurs, the design randomly selects among the three least recently used data words and evicts one. This design does not use a true LRU implementation where the least recently used data is always evicted, but the performance is believed to come close performance-wise. A true LRU implementation would require more hardware and a more complex eviction algorithm. The cost of such design would most likely not be worth the small performance increase. Cache Controller In order to add the cache to the design, it was necessary to implement a cache controller. The cache controller is in charge of managing all signals that are essential to Emmanuel Sanchez Gerardo Gomez-Martinez the operation of the cache. Among the most important ones are: the signal that will stall the processor, the signal that will enable writing to main memory, and the signal that will enable writing to the cache. In order to simulate the latency of accessing the main memory, the cache controller contains a counter that ensures the pipeline is stalled for six clock cycles any time there is a need to read or write to main memory. Debugging and Testing Prior to implementing the 4-way data cache, we began by making a direct mapped cache. Initially, we tested the cache by loading all memories (instruction memory, data cache, main memory) with initialization files. During the first stage of testing, we performed only read operations. Then, we made a simple program consisting of only “load words” and verified the data from the cache was correctly loaded into registers. The next step was to take care of the “store words,” which we did by using memory write through, for simplicity. After having a functional direct mapped cache, we proceeded to transform it into a 4-way associative cache. Once again, we began testing by initializing the cache with a file, and performing only “load words.” Simulation Results and Performance Analysis For this project, we compared the performance of the old processor with no cache to the processor with a 4-way data cache. In order to simulate the performance of a processor with no cache, we modified the cache logic so it would always output a miss signal. This way, data would always be retrieved from main memory and the memory latency would be apparent. The performance increase was calculated after running the following Emmanuel Sanchez Gerardo Gomez-Martinez program. The program consists of two identical loops that take each element from an array of size 10, scale it by a factor of 4, and store the new value in a second array. Program: MIPS Assembly C Equivalent When executing the “load word” in the first loop, both processors (cached and noncached) stall the pipeline for 6 cycles. In the cached processor, this happens because the required data is not initially in the cache. However, after running the first loop, the data cache fills up and the second loop executes significantly faster. Emmanuel Sanchez Gerardo Gomez-Martinez Simulation Results: Non – Cached Processor: Cached Processor: Performance Calculations: CPI Old (no cache) = 163 instructions CPI New (4 way cache) = = New is faster by 1.58 94 cycles = 1.734 103 instructions 94 cycles = 1.096 = 1.743 / 1.096 = 1.58 Emmanuel Sanchez Gerardo Gomez-Martinez This performance analysis is specific to the program above. Obviously, every program will have its own data requirements and level of cache use. For example, a program that makes repeated use of the same data will greatly benefit from having a cache, while a program that constantly requires different data may not yield the same performance. Conclusion Before working on this project, we had a very limited understanding of the structure and operation of cache memory. Things such as tags, sets, blocks, and associativity were covered in previous CPE classes, but we never understood them well enough. After getting a chance of actually building a cache memory and integrating it into the MIPS processor, we feel much more confident about our knowledge in the topic. This was a very interesting and useful assignment.