International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 4, Issue 3 , March -2015 ISSN (Print) 2320- 5156 Design and Functional Verification of Four Way Set Associative Cache Controller Bavishi Pooja Dept. of VLSI & Embedded System Design GTU PG School, Ahmedabad, India bavishipooja123@gmail.com Mr. Santosh S Jagtap Wipro Pune, India sansjagtap@gmail.com Abstract— This project describes the design of a Cache Controller that will handle 32Kbyte 4 ways with 8word block size cache. A cache controller is a device that used to sequence the read and write of the cache storage array. Most of modern microprocessor is designed with multiple core architecture that will lead to massive traffic of cache data transfer. By taking the advantage of using temporal locality and spatial locality to the cache, the problem can be solved. With this solution, a controller capable to handle huge amount of way and block size need to be designed. This design will be implemented using Questasim software. It was developed base on Verilog coding. Using the same software, a test bench was constructed to test the functionality of the controller. Keywords—Cache memory, Associative cache, pLRU I. Four Way Set INTRODUCTION Throughout the last decades, the technology of digital electronic have become more advance. As the times goes on, this advancement have made the computer and other electronic hardware such as mobile phone, PDA and many electronics gadget become smaller, faster and cheaper to produce. Most of these devices are using microprocessor as their brain to control their operation. The performance gap between the processor and memory is a major bottleneck in microprocessor design. In today’s high performance systems, a memory request may take hundreds of cycles to complete. This performance gap motivates continued improvements in cache efficiency. Nowadays, making a faster microprocessor is the main concern. One of the important components inside the microprocessor is the www.ijrcct.org cache controller. As the microprocessor speed vastly increases, designing a much faster cache become very important [12] [13]. Cache controller needs to be fast enough to deal with this massive data transfer between cache, memory and the processor. Increasing the size of the cache can increase the cache performance. But there is a trade-off, where the caches access time increases as the size increasing [7]. Nevertheless, most of caches earn a lot of benefit from larger cache size [4]. Closest to the CPU is the Cache Memory. Cache memory is fast but quite small; it is used to store small amounts of data that have been accessed recently and are likely to be accessed again soon in the future. Data is stored here in blocks, each containing a number of words. To keep track of which blocks are currently stored in Cache, and how they relate to the rest of the memory, the Cache Controller stores identifiers for the blocks currently stored in Cache. These include the index, tag, valid and dirty bits, associated with a whole block of data. Using these identifiers, the Cache Controller can respond to read and write requests issued by the CPU, by reading and writing data to specific blocks, or by fetching or writing out whole blocks to the larger, slower Main Memory. Figure 1 shows a block diagram for a simple memory hierarchy consisting of CPU, Cache (including the Cache controller and the small, fast memory used for data storage), Main Memory Controller and Main Memory proper. Page 203 Figure 1: Block Diagram of Memory Hierarchy II. THE CACHE The cache is fastest memory available on the market due to its small sizes and architecture. It is the highest part in the memory hierarchy tree. But it is also the most expensive among them all. A The Architecture A data request has an address specifying the location of the requested data. Each cache-line sized chunk of data from the lower level can only be placed into one set. The set that it can be placed into depends on its address. This mapping between addresses and sets must have an easy, fast implementation. The fastest implementation involves using just a portion of the address to select the set. When this is done, a request address is broken up into three parts An offset part identifies a particular location within a cache line. Here offset is of 6 bits A set part identifies the set that contains the requested data which here is of 7 bits. It is also called as index A tag part must be saved in each cache line along with its data to distinguish different addresses that could be placed in the set. Tag bits here are of 19 bits Figure 2 shows the architecture of the cache designed for this particular project. The cache implemented here has a memory of 32 Kbyte. It is 4 ways set associative cache with a block size of 64 byte. Each way has the capability to hold 522-bit on each line. There are 4 sets for each of ways. www.ijrcct.org Figure 2: Proposed Four Way Set Associative Cache Implementation III. THE CACHE CONTROLLER Cache Controller is a device that control the data transfer between cache, main memory and microprocessor. When the microprocessor sends an address to request data, the cache controller will check the data inside cache. If the data are available, the cache controller will send the data to processor. If the data are not present in the cache, the cache controller will fetch the data from the main memories and send to the microprocessor as well as the cache [4]. Figure 3 shows the detailed diagram of proposed design which consists of basic modules like hit/miss logic, pLRU replacement policy, buffers, cache memory and main memory Page 204 is no hit, no data will be provided. If there is a hit, the output will be taken from the data from data multiplexer. When a read miss occurs processor reads from main memory. The data to the processor is provided from main memory and also the main memory data is written to the cache simultaneously C. Write Policy Write Includes Write Through With No Write Allocate. Applying write through policy, on hits it writes to cache and main memory. Because of no write allocate, on misses it updates the block in main memory not bringing that block to the cache. Subsequent writes to the block will update main memory because Write Through policy is employed. So, some time is saved not bringing the block in the cache on a miss because it appears useless anyway [6] [7]. Figure 3: Proposed Cache Controller Design A. Hit Miss Logic The address of the word that the CPU is currently referencing is stored in the Address Latch. The middle 7 bits of the address is the set index bits. Set index bits are connected with all the RAM memory. The core of the hit/miss logic is a parallel comparator. The comparator simultaneously compares the stored address tags of the cache lines with the tag part of the Address Latch and outputs the hit/miss signal and the line select signal. Only valid cache lines are involved in the tag comparison. If one of the tags stored in the memory is the same as that in the Address Latch and the valid bit of the cache line is set, there is a cache hit and the hit counter is increased. If no matching address tag is found, it is a cache miss. In such a case, the miss counter is increased [7]. B. Read policy When a read hit occurs processor reads directly from the cache memory. To read the cache, all the ways will be enabled. The set select address will decide which set will be used. Based on the tag address provided by the CPU, it will be compared using comparator inside each way. If there is the same tag register inside the cache, it will be a hit. But the tag register must also be valid, but if it is not valid, it means there is no data for that register and no hit. The data multiplexer will select the way which asserted the hit. These multiplexers select the data via the data provided by the hit encoder. There should be only one hit at a time since one each way should have different register. If there www.ijrcct.org D. Buffers 1) Write buffer A write buffer is a very small, fast FIFO memory buffer that temporarily holds data that the processor would normally write to main memory. In a system without write buffer the processor directly writes to main memory. In a system with a write buffer, data is written at high speed to FIFO and then emptied to slower main memory. The write buffer reduces the processor time taken to write small blocks of sequential data to main memory. 2) Line Fill buffers (LFBs) These buffers capture line fill data from main memory, waiting for a complete line before writing to cache memory. It is filled with data so that an entire cache line can be allocated to the cache. Line fill buffers speed up line replacement. An option which does not have a line buffer is to hold the processor in a wait state until the entire line has been refilled after cache miss .After complete refill the processor is allowed to continue. Using a line fill buffer, the missed word is fed to both line buffer and the processor simultaneously. The three bit counter is placed so that when all 8 lines of buffer are filled then the data is transferred to cache line. With the placement of data in each line counter increments from 0 .when counter reaches 8, on next clock cycle whole data of line fill buffer is transferred. 3) Line Read buffers (LRBs) These buffers hold a line from the cache in case of cache hit. It takes a single clock cycle to transfer data to line read buffer from cache memory. 4) Address Buffer It holds the addresses coming from processor. Here address buffer can store 8 addresses as AXI interface is used. After 8 addresses are filled, an interrupt is generated. Page 205 E. Pseudo Least Recently Used (pLRU) Replacement Algorithm LRU keeps all cache entries in the order of their last reference time. To keep a strict LRU order, a relatively large number of bits is needed. For example, for a four-way associative cache, there are 4! =24 different permutations of use orders. It would require log2 (24!)=5 bits to keep track. Hence the space required to implement LRU is larger. It is also expensive in terms of speed and hardware. The main disadvantage is we need to remember the order in which all N lines were last accessed. To save space, Pseudo-LRU replacement algorithms are proposed. In Pseudo-LRU replacement algorithms, the LRU order of cache lines are only approximately kept. For example, in PLRU-tree, only the just referenced line is accurately recorded, and the order of other cache lines is not precise. At the cost of precision, Pseudo LRU replacement algorithms need fewer bits for replacement decision making [2]. 1) Tree-based Pseudo LRU (pLRUt) This binary tree approximation of the LRU algorithm requires N-1 bits in an N-way associative cache. Hence a four way set associative requires 3 bits which are less compared to 5 bits in true LRU [10]. The algorithm is shown in figure below to understand the diagram refer definitions below • Each bit represents one branch point in a binary decision tree • let 1 represent that the left side has been referenced more recently than the right side, and 0 vice-versa [9]. Figure 4 shows pLRU-t tree for replacement of lines that is when to replace which line if all cache lines are filled and new cache line is to be written IDLE READ Miss WRITE READ MISS WRITE HIT WRITE MISS READ MEM WRITE CACHE WRITE MEM WAIT FOR MEMORY WAIT FOR MEMORY READ DATA WRITE DATA Figure 5: FSM Diagram of Cache Controller F. Finite State Machine Diagram Figure 5 shows a state diagram of this finite state machine used in the cache controller. At the beginning the controller waits for an instruction to read or write (load or store) data. If the instruction executed is a load, the controller begins a set of steps. First, it has to do a comparison. If the address tag is equal to the cache tag, a read hit occurs. The value of the cache is used, then waits until the transaction is completed and return to idle state. If a read miss occurs, the data will be read from main memory, waits until the process is finished and returns to beginning. This is a slower process than a read hit because the access time to the external memory is greater than accessing internal RAMs. If the instruction received is a store it is also necessary compare the tags. If the address tag is equal to the cache tag, a write hit occurs and a data is written into the cache and into the main memory. If a write miss occurs the data is written into the main memory [4]. • • • • • IDLE : No memory access underway READ: Read access initiated by processor; Cache is checked during this state. If hit, access is satisfied from cache during this cycle and control returns to IDLE state at next transition. If miss, transition to READMISS state to initiate main memory access READMISS: Initiate memory access following a read miss. Transition to READMEM State. READMEM: Main memory read in progress. Remain in this state until memory is being read and then transition to READDATA state. READDATA: Data available from main memory read. Write this data into the cache line and use it to satisfy the original processor (driver) read request Figure 4: pLRU -tree Cache Replacement Algorithm www.ijrcct.org Page 206 • • • • • • WRITE: Write access initiated by processor. If cache is hit, transition to WRITEHIT state. If miss, transition to WRITEMISS state. WRITEHIT: Cache has been hit on a write operation. Complete write to cache and initiate write-through to main memory. Transition to WRITEMEM state and to WRITE CACHE state WRITEMISS: Cache has been missed on a write operation. Write to cache (cache load) and initiate write-through to main memory .Next state WRITEMEM WRITEMEM: Main memory write in progress. Wait until memory is being written, then transition to WRITEDATA state. WRITE CACHE: On write hit data is written to cache memory as write through policy is being used WRITEDATA: Last Cycle of Main memory write. Assert Ready signal to Processor to indicate completion of write. V. SIMULATION RESULT Figure 7: Simulation Result 2 of Cache Write Hit Simulations are shown for the entire cache controller design (including the whole cache and the memory) simulated using Questasim 10.0b software Figure 6 shows the result for cache write miss at first so the data is written to main memory. When same address is given at high value of read enable, cache read miss will occur as data is not available in cache memory. Now the data is being written to cache memory and as output the data is being read from main memory. Next giving the same address again, it will show cache read hit and required address output is obtained from cache memory. As many times the same address is requested the data will be available from cache memory and not going to main memory increasing speed. Figure 8: Simulation Result 3 of pLRU replacement Algorithm Implementation Figure 7 is showing the cache write hit scenario.Once the data is being written to cache ,it will store the tag address of the requested address.If same tag address arrives then write hit happens as shown below and new data is being overwiten to the cache memory without writing back to main memory. Figure 6: Simulation Result 1 of Cache Write Miss and Cache Read Miss and Hit www.ijrcct.org Page 207 Figure 8 shows the pLRU replacement policy being used. Firstly all cache lines are filled. Now if new address comes then a cache line has to be over written. Then according to the value of LRU bits the least recently line is being overwritten. Here we have depicted that as we are using four way set associative cache, all four lines in the set are being replaced with new data on write hit of cache. VI. CONCLUSION Based on the results of simulations, it can be concluded that the design was successfully functioning. Furthermore, the design has been proven that can be implemented in to real life by using higher specification. Hence, we can say that cache controller finds the requested address in the cache memory and gives the output as the data of that particular cache line. But if the data is not available in the cache memory, it fetches the data from the main memory and stores that data in the cache memory. Hence cache controller tracks the miss rate of the cache memory. REFERENCES [1] Vipin S. Bhure1, Dr. Dinesh Padole2” Design of Cache Controller for Multi-core Systems Using Multilevel Scheduling Method”, 2012 Fifth International Conference on Emerging Trends in Engineering and Technology. [2] Hussein Al-Zoubi, AleksandarMilenkovic, Milena Milenkovic,” Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite” [8] A. Agarwal and S. Pudar, “Column-associative caches: A technique for reducing the miss rate of direct-mapped caches,” in Proceedings of the International Symposium on Computer Architecture, 1993, pp. 179–180. [9] S. Jiang, X. Zhang, Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance, IEEE Transactions on Computers, Vol. 54, No. 8, Aug. 2005. [10] Andreas Abel and Jan Reineke, “Reverse Engineering of Cache Replacement Policies in Intel Microprocessors and Their Evaluation”, Department of Computer Science saarland University Saarbrucken, Germany, 2014 IEEE. [11] Yogesh S. Watile ,A. S. Khobragade,” FPGA Implementation of cache memory”, International Journal of Engineering Research and Applications (IJERA) ISSN: 2248- 9622 Vol. 3, Issue 3, May-Jun 2013, pp.283-286. [12] Roy W. Badeau, "A 100-MHz Macropipelined V AX Microprocessor", in IEEE Joumal of Solid-State Circuits, Vol.27, No. II, November, 1992. [13] Daniel W. Dobberpuhl, "A 200-MHz 64-bit Dual Issue CMOS Microprocessor", in IEEE Joumal of Solid-State Circuits, Vol. 27, No. II, November, 1992. [3] RuchiRastogi Bani1, Saraju P. Mohanty2, Elias Kougianos3, and Garima Thakral4,” Design of a Reconfigurable Embedded Data Cache”, 2010 International Symposium on Electronic System Design. [4] SitiLailatulMohd Hassan, MohdNaqibJohari, AzilahSaparon, Ili ShairahAbdHalim, A'zraaAfhzanAb Rahim.” Multi-Sized Output Cache Controllers”, 2013 International Conference on Technology, Informatics, Management, Engineering & Environment (TIME-E 2013) Bandung, Indonesia, June 23-26, 2013. [5] Ben Cohen, Srinivasan Venkataramanan and AjeethaKumari, Lisa Piper, “Experiencing for Checkers a Cache Controller Design. [6] J. L. Hennessy and D. A. Patterson, Computer architecture Quantitative Approach. MorganKaufmann Publishing. [7] The Cache Memory Book, Jim Handy. www.ijrcct.org Page 208