Print this article

advertisement
International Journal of Research in Computer and ISSN (Online) 2278- 5841
Communication Technology, Vol 4, Issue 3 , March -2015
ISSN (Print) 2320- 5156
Design and Functional Verification of Four Way Set
Associative Cache Controller
Bavishi Pooja
Dept. of VLSI & Embedded System Design
GTU PG School,
Ahmedabad, India
bavishipooja123@gmail.com
Mr. Santosh S Jagtap
Wipro
Pune, India
sansjagtap@gmail.com
Abstract— This project describes the design of a
Cache Controller that will handle 32Kbyte 4 ways with
8word block size cache. A cache controller is a device
that used to sequence the read and write of the cache
storage array. Most of modern microprocessor is
designed with multiple core architecture that will lead
to massive traffic of cache data transfer. By taking the
advantage of using temporal locality and spatial locality
to the cache, the problem can be solved. With this
solution, a controller capable to handle huge amount of
way and block size need to be designed. This design will
be implemented using Questasim software. It was
developed base on Verilog coding. Using the same
software, a test bench was constructed to test the
functionality of the controller.
Keywords—Cache memory,
Associative cache, pLRU
I.
Four
Way
Set
INTRODUCTION
Throughout the last decades, the technology of digital
electronic have become more advance. As the times goes
on, this advancement have made the computer and other
electronic hardware such as mobile phone, PDA and many
electronics gadget become smaller, faster and cheaper to
produce. Most of these devices are using microprocessor as
their brain to control their operation. The performance gap
between the processor and memory is a major bottleneck in
microprocessor design. In today’s high performance
systems, a memory request may take hundreds of cycles to
complete. This performance gap motivates continued
improvements in cache efficiency. Nowadays, making a
faster microprocessor is the main concern. One of the
important components inside the microprocessor is the
www.ijrcct.org
cache controller. As the microprocessor speed vastly
increases, designing a much faster cache become very
important [12] [13]. Cache controller needs to be fast
enough to deal with this massive data transfer between
cache, memory and the processor. Increasing the size of the
cache can increase the cache performance. But there is a
trade-off, where the caches access time increases as the
size increasing [7]. Nevertheless, most of caches earn a lot
of benefit from larger cache size [4].
Closest to the CPU is the Cache Memory. Cache memory
is fast but quite small; it is used to store small amounts of
data that have been accessed recently and are likely to be
accessed again soon in the future. Data is stored here in
blocks, each containing a number of words. To keep track
of which blocks are currently stored in Cache, and how
they relate to the rest of the memory, the Cache Controller
stores identifiers for the blocks currently stored in Cache.
These include the index, tag, valid and dirty bits,
associated with a whole block of data. Using these
identifiers, the Cache Controller can respond to read and
write requests issued by the CPU, by reading and writing
data to specific blocks, or by fetching or writing out whole
blocks to the larger, slower Main Memory.
Figure 1 shows a block diagram for a simple memory
hierarchy consisting of CPU, Cache (including the Cache
controller and the small, fast memory used for data
storage), Main Memory Controller and Main Memory
proper.
Page 203
Figure 1: Block Diagram of Memory Hierarchy
II. THE CACHE
The cache is fastest memory available on the market due to
its small sizes and architecture. It is the highest part in the
memory hierarchy tree. But it is also the most expensive
among them all.
A The Architecture
A data request has an address specifying the location of the
requested data. Each cache-line sized chunk of data from
the lower level can only be placed into one set. The set that
it can be placed into depends on its address.
This mapping between addresses and sets must have an
easy, fast implementation. The fastest implementation
involves using just a portion of the address to select the set.
When this is done, a request address is broken up into three
parts

An offset part identifies a particular location
within a cache line. Here offset is of 6 bits

A set part identifies the set that contains the
requested data which here is of 7 bits. It is also
called as index

A tag part must be saved in each cache line along
with its data to distinguish different addresses that
could be placed in the set. Tag bits here are of 19
bits
Figure 2 shows the architecture of the cache designed for
this particular project. The cache implemented here has a
memory of 32 Kbyte. It is 4 ways set associative cache
with a block size of 64 byte. Each way has the capability to
hold 522-bit on each line. There are 4 sets for each of
ways.
www.ijrcct.org
Figure 2: Proposed Four Way Set Associative Cache
Implementation
III. THE CACHE CONTROLLER
Cache Controller is a device that control the data transfer
between cache, main memory and microprocessor. When
the microprocessor sends an address to request data, the
cache controller will check the data inside cache. If the
data are available, the cache controller will send the data to
processor.
If the data are not present in the cache, the cache controller
will fetch the data from the main memories and send to the
microprocessor as well as the cache [4].
Figure 3 shows the detailed diagram of proposed design
which consists of basic modules like hit/miss logic, pLRU
replacement policy, buffers, cache memory and main
memory
Page 204
is no hit, no data will be provided. If there is a hit, the
output will be taken from the data from data multiplexer.
When a read miss occurs processor reads from main
memory. The data to the processor is provided from main
memory and also the main memory data is written to the
cache simultaneously
C. Write Policy
Write Includes Write Through With No Write Allocate.
Applying write through policy, on hits it writes to cache
and main memory. Because of no write allocate, on misses
it updates the block in main memory not bringing that
block to the cache.
Subsequent writes to the block will update main memory
because Write Through policy is employed. So, some time
is saved not bringing the block in the cache on a miss
because it appears useless anyway [6] [7].
Figure 3: Proposed Cache Controller Design
A. Hit Miss Logic
The address of the word that the CPU is currently
referencing is stored in the Address Latch. The middle 7
bits of the address is the set index bits. Set index bits are
connected with all the RAM memory. The core of the
hit/miss logic is a parallel comparator. The comparator
simultaneously compares the stored address tags of the
cache lines with the tag part of the Address Latch and
outputs the hit/miss signal and the line select signal. Only
valid cache lines are involved in the tag comparison. If one
of the tags stored in the memory is the same as that in the
Address Latch and the valid bit of the cache line is set,
there is a cache hit and the hit counter is increased. If no
matching address tag is found, it is a cache miss. In such a
case, the miss counter is increased [7].
B. Read policy
When a read hit occurs processor reads directly from the
cache memory. To read the cache, all the ways will be
enabled. The set select address will decide which set will
be used. Based on the tag address provided by the CPU, it
will be compared using comparator inside each way. If
there is the same tag register inside the cache, it will be a
hit. But the tag register must also be valid, but if it is not
valid, it means there is no data for that register and no hit.
The data multiplexer will select the way which asserted the
hit. These multiplexers select the data via the data provided
by the hit encoder. There should be only one hit at a time
since one each way should have different register. If there
www.ijrcct.org
D. Buffers
1) Write buffer
A write buffer is a very small, fast FIFO memory buffer
that temporarily holds data that the processor would
normally write to main memory. In a system without write
buffer the processor directly writes to main memory. In a
system with a write buffer, data is written at high speed to
FIFO and then emptied to slower main memory. The write
buffer reduces the processor time taken to write small
blocks of sequential data to main memory.
2) Line Fill buffers (LFBs)
These buffers capture line fill data from main memory,
waiting for a complete line before writing to cache
memory. It is filled with data so that an entire cache line
can be allocated to the cache. Line fill buffers speed up line
replacement. An option which does not have a line buffer is
to hold the processor in a wait state until the entire line has
been refilled after cache miss .After complete refill the
processor is allowed to continue. Using a line fill buffer,
the missed word is fed to both line buffer and the processor
simultaneously.
The three bit counter is placed so that when all 8 lines of
buffer are filled then the data is transferred to cache line.
With the placement of data in each line counter increments
from 0 .when counter reaches 8, on next clock cycle whole
data of line fill buffer is transferred.
3) Line Read buffers (LRBs)
These buffers hold a line from the cache in case of cache
hit. It takes a single clock cycle to transfer data to line read
buffer from cache memory.
4) Address Buffer
It holds the addresses coming from processor. Here address
buffer can store 8 addresses as AXI interface is used. After
8 addresses are filled, an interrupt is generated.
Page 205
E. Pseudo Least Recently Used (pLRU) Replacement
Algorithm
LRU keeps all cache entries in the order of their last
reference time. To keep a strict LRU order, a relatively
large number of bits is needed. For example, for a four-way
associative cache, there are 4! =24 different permutations
of use orders. It would require log2 (24!)=5 bits to keep
track. Hence the space required to implement LRU is
larger. It is also expensive in terms of speed and hardware.
The main disadvantage is we need to remember the order
in which all N lines were last accessed.
To save space, Pseudo-LRU replacement algorithms are
proposed. In Pseudo-LRU replacement algorithms, the
LRU order of cache lines are only approximately kept. For
example, in PLRU-tree, only the just referenced line is
accurately recorded, and the order of other cache lines is
not precise. At the cost of precision, Pseudo LRU
replacement algorithms need fewer bits for replacement
decision making [2].
1) Tree-based Pseudo LRU (pLRUt)
This binary tree approximation of the LRU algorithm
requires N-1 bits in an N-way associative cache. Hence a
four way set associative requires 3 bits which are less
compared to 5 bits in true LRU [10]. The algorithm is
shown in figure below to understand the diagram refer
definitions below
• Each bit represents one branch point in a binary decision
tree
• let 1 represent that the left side has been referenced more
recently than the right side, and 0 vice-versa [9].
Figure 4 shows pLRU-t tree for replacement of lines that is
when to replace which line if all cache lines are filled and
new cache line is to be written
IDLE
READ
Miss
WRITE
READ
MISS
WRITE
HIT
WRITE
MISS
READ
MEM
WRITE
CACHE
WRITE
MEM
WAIT FOR
MEMORY
WAIT FOR
MEMORY
READ
DATA
WRITE
DATA
Figure 5: FSM Diagram of Cache Controller
F. Finite State Machine Diagram
Figure 5 shows a state diagram of this finite state machine
used in the cache controller. At the beginning the controller
waits for an instruction to read or write (load or store) data.
If the instruction executed is a load, the controller begins a
set of steps. First, it has to do a comparison. If the address
tag is equal to the cache tag, a read hit occurs. The value of
the cache is used, then waits until the transaction is
completed and return to idle state. If a read miss occurs,
the data will be read from main memory, waits until the
process is finished and returns to beginning. This is a
slower process than a read hit because the access time to
the external memory is greater than accessing internal
RAMs. If the instruction received is a store it is also
necessary compare the tags. If the address tag is equal to
the cache tag, a write hit occurs and a data is written into
the cache and into the main memory. If a write miss occurs
the data is written into the main memory [4].
•
•
•
•
•
IDLE : No memory access underway
READ: Read access initiated by processor; Cache
is checked during this state. If hit, access is
satisfied from cache during this cycle and control
returns to IDLE state at next transition. If miss,
transition to READMISS state to initiate main
memory access
READMISS: Initiate memory access following a
read miss. Transition to READMEM State.
READMEM: Main memory read in progress.
Remain in this state until memory is being read
and then transition to READDATA state.
READDATA: Data available from main memory
read. Write this data into the cache line and use it
to satisfy the original processor (driver) read
request
Figure 4: pLRU -tree Cache Replacement Algorithm
www.ijrcct.org
Page 206
•
•
•
•
•
•
WRITE: Write access initiated by processor. If
cache is hit, transition to WRITEHIT state. If
miss, transition to WRITEMISS state.
WRITEHIT: Cache has been hit on a write
operation. Complete write to cache and initiate
write-through to main memory. Transition to
WRITEMEM state and to WRITE CACHE state
WRITEMISS: Cache has been missed on a write
operation. Write to cache (cache load) and initiate
write-through to main memory .Next state
WRITEMEM
WRITEMEM: Main memory write in progress.
Wait until memory is being written, then transition
to WRITEDATA state.
WRITE CACHE: On write hit data is written to
cache memory as write through policy is being
used
WRITEDATA: Last Cycle of Main memory write.
Assert Ready signal to Processor to indicate
completion of write.
V. SIMULATION RESULT
Figure 7: Simulation Result 2 of Cache Write Hit
Simulations are shown for the entire cache controller
design (including the whole cache and the memory)
simulated using Questasim 10.0b software
Figure 6 shows the result for cache write miss at first so the
data is written to main memory. When same address is
given at high value of read enable, cache read miss will
occur as data is not available in cache memory. Now the
data is being written to cache memory and as output the
data is being read from main memory. Next giving the
same address again, it will show cache read hit and
required address output is obtained from cache memory. As
many times the same address is requested the data will be
available from cache memory and not going to main
memory increasing speed.
Figure 8: Simulation Result 3 of pLRU replacement
Algorithm Implementation
Figure 7 is showing the cache write hit scenario.Once the
data is being written to cache ,it will store the tag address
of the requested address.If same tag address arrives then
write hit happens as shown below and new data is being
overwiten to the cache memory without writing back to
main memory.
Figure 6: Simulation Result 1 of Cache Write Miss and
Cache Read Miss and Hit
www.ijrcct.org
Page 207
Figure 8 shows the pLRU replacement policy being used.
Firstly all cache lines are filled. Now if new address comes
then a cache line has to be over written. Then according to
the value of LRU bits the least recently line is being
overwritten. Here we have depicted that as we are using
four way set associative cache, all four lines in the set are
being replaced with new data on write hit of cache.
VI. CONCLUSION
Based on the results of simulations, it can be concluded
that the design was successfully functioning. Furthermore,
the design has been proven that can be implemented in to
real life by using higher specification. Hence, we can say
that cache controller finds the requested address in the
cache memory and gives the output as the data of that
particular cache line. But if the data is not available in the
cache memory, it fetches the data from the main memory
and stores that data in the cache memory. Hence cache
controller tracks the miss rate of the cache memory.
REFERENCES
[1] Vipin S. Bhure1, Dr. Dinesh Padole2” Design of
Cache Controller for Multi-core Systems Using
Multilevel Scheduling Method”, 2012 Fifth International Conference on Emerging Trends in Engineering and Technology.
[2] Hussein Al-Zoubi, AleksandarMilenkovic, Milena
Milenkovic,” Performance Evaluation of Cache
Replacement Policies for the SPEC CPU2000
Benchmark Suite”
[8] A. Agarwal and S. Pudar, “Column-associative
caches: A technique for reducing the miss rate of
direct-mapped caches,” in Proceedings of the
International
Symposium
on
Computer
Architecture, 1993, pp. 179–180.
[9] S. Jiang, X. Zhang, Making LRU Friendly to
Weak Locality Workloads: A Novel Replacement
Algorithm to Improve Buffer Cache Performance,
IEEE Transactions on Computers, Vol. 54, No. 8,
Aug. 2005.
[10] Andreas Abel and Jan Reineke, “Reverse
Engineering of Cache Replacement Policies in
Intel Microprocessors and Their Evaluation”,
Department of Computer Science saarland
University Saarbrucken, Germany, 2014 IEEE.
[11] Yogesh S. Watile ,A. S. Khobragade,” FPGA
Implementation of cache memory”, International
Journal of
Engineering Research and
Applications (IJERA) ISSN: 2248- 9622 Vol. 3,
Issue 3, May-Jun 2013, pp.283-286.
[12] Roy W. Badeau, "A 100-MHz Macropipelined V
AX
Microprocessor", in IEEE Joumal of Solid-State
Circuits, Vol.27, No. II, November, 1992.
[13] Daniel W. Dobberpuhl, "A 200-MHz 64-bit Dual
Issue CMOS Microprocessor", in IEEE Joumal of
Solid-State Circuits, Vol. 27, No. II, November,
1992.
[3] RuchiRastogi Bani1, Saraju P. Mohanty2, Elias
Kougianos3, and Garima Thakral4,” Design of a
Reconfigurable Embedded Data Cache”, 2010 International Symposium on Electronic System Design.
[4] SitiLailatulMohd Hassan, MohdNaqibJohari, AzilahSaparon,
Ili
ShairahAbdHalim,
A'zraaAfhzanAb Rahim.” Multi-Sized Output Cache
Controllers”, 2013 International Conference on
Technology, Informatics, Management, Engineering & Environment (TIME-E 2013) Bandung, Indonesia, June 23-26, 2013.
[5] Ben Cohen, Srinivasan Venkataramanan and
AjeethaKumari, Lisa Piper, “Experiencing for
Checkers a Cache Controller Design.
[6] J. L. Hennessy and D. A. Patterson, Computer
architecture Quantitative Approach. MorganKaufmann Publishing.
[7] The Cache Memory Book, Jim Handy.
www.ijrcct.org
Page 208
Download