Uploaded by R INI BHANDARI

CACHES

advertisement
Chapter 12
Caches
Optimization Technique in Embedded System (ARM)
LiangAlei@SJTU.edu.cn, 2008 April
ARM 2007 liangalei@sjtu.edu.cn
Introduction
• Cache
– is a small, fast array of memory placed between the
processor core and main memory that store portions of
recently referenced main memory.
– The word cache is a French word meaning “a concealed
place for storage”.
• Write Buffer
– Often used with a cache is a write buffer a very small
first-in-first-out (FIFO) memeory placed between the
processor core and main memory.
– The purpose of a write buffer is to free the processor
core and cache memory from the slow write time
associated with writing to main memory.
ARM 2007 liangalei@sjtu.edu.cn
12.1 The Memory Hierarchy and Cache
Memory
ARM 2007 liangalei@sjtu.edu.cn
Memory Hierarchy
ARM 2007 liangalei@sjtu.edu.cn
Cache & Write Buffer
ARM 2007 liangalei@sjtu.edu.cn
Overview
• Cache, Write-Buffer
– N-Way set associate
– ARM940T’s cache
– Test a cache
• Cache Clean & Flush(or Invalidate)
• Cache Lockdown
ARM 2007 liangalei@sjtu.edu.cn
12.1.1 Caches and Memory
Management Units
• Virtual vs. Physical cache
– Virtual cahce: ARM7 ~ ARM10, StrongARM, XScale
– Physical cache: ARM11
• Locality of reference
– The cache make use of this local reference in both
time and space.
» If the reference is in time, it is called Temporal
locality.
» If it is by address proximity, then it is called
Spatial locality.
ARM 2007 liangalei@sjtu.edu.cn
Logical and Physical Caches
ARM 2007 liangalei@sjtu.edu.cn
12.2 Cache Architecture
ARM 2007 liangalei@sjtu.edu.cn
Two Bus Architecture
• ARM uses two bus architecture in its cached
cores, the Von Neumann and the Harvard.
– In processor core using Von Neumann architecture,
there is a single cache used for both instruction and
data. This type is known as a unified cache.
– The Harvard architecture has separate instruction and
data buses to improve overall system performance. This
type of cache is known as a split cache.
ARM 2007 liangalei@sjtu.edu.cn
12.2.1 Basic Architecture of a
Cache Memory
• Three main parts in a cache: Directory
store, Status information, and Data section
ARM 2007 liangalei@sjtu.edu.cn
Cache Organization
• Directory
– The cache must know where the information stored in a
cache line originates from in main memory.
– The directory entry is known as a cache-tag.
• Data section
– stores the data read from main memory.
– “The size of a cache “ is defined as the actual code
and data the cache can store from main memory.
• Status Bits
– The two common bits are: Valid, Dirty
ARM 2007 liangalei@sjtu.edu.cn
12.2.2 Basic Operations of
a Cache Controller
• The Cache Controller
– is hardware that copies code and data from main
memory to cache memory automatically.
– The cache controller intercepts read and write memory
requests before passing them on to the memory
controller.
– It processes a request by dividing the address of the
request into three fields: tag, set index, data index.
ARM 2007 liangalei@sjtu.edu.cn
Access Cache
• First, use “set index” to locate the cache line within
cache memory.
– Check the “valid” bit of the line;
– Compare the cache-tag to “tag” field.
– It’s a cache “hit” if both the status check and comparison
succeed; or “miss” if either fails.
• On a cache miss,
– the controller copies an entire cache line from main memory
and provide the requested code or data to the processor.
• On a cache hit
– the controller supplies the code and data directly from cache
to the processor (selected by “data index”).
ARM 2007 liangalei@sjtu.edu.cn
12.2.3 Direct Mapped Cache
• In a direct-mapped cache
– each addressed location in main memory maps to a single
location in cache memory.
» Since main memory is much larger than the cache
memory, there are many addresses in main memory that
map to the same single location in cache memory.
• Data streaming
– During a cache line fill, the cache controller may forward the
loading data to the core at the same time it is copying it to
the cache.
• Thrashing
– Direct-mapped caches are subject to high levels of thrashing
– a software battle for the same location in cache memory.
– The result of thrashing is the repeated loading and eviction of
a cache line.
ARM 2007 liangalei@sjtu.edu.cn
12.2.4 Set Associativity
• Some caches include an additional feature
to reduce the frequency of thrashing
– This structural design feature is a change that divides
the cache memory into smaller equal units, called ways.
– The cache lines with the same set index are said to be
in the same set.
• The set of cache lines pointed to by the set
index are set associative.
– The cache-tag field is larger, and set index is smaller.
ARM 2007 liangalei@sjtu.edu.cn
12.2.4.1 Increasing Set
Associativity
• As the associativity of a cache controller
goes up, the probability of thrashing goes
down.
– The ideal goal would be to maximize the set
associativity of a cache by designing it so any main
memory location maps to any cache line, which is known
as a fully associative cache.
– However, as the associativity increases, so does the
complexity of the hardware that support it.
– One method used by hardware designers to increase
the set associativity of a cache includes a content
addressing memory (CAM).
ARM 2007 liangalei@sjtu.edu.cn
CAM
• A CAM works in the opposite way a RAM
works
– Where a RAM produces data when given an address
value, a CAM produces an address if a given data value
exist in the memory.
• Using a CAM allow many more cache-tags to
be compared simultaneously.
– A CAM uses a set of comparators to compare the input
tag address with a cache-tag stored in each valid
cache line.
ARM 2007 liangalei@sjtu.edu.cn
CAM in ARM920T/940T
•
Using a CAM to locate cache-tags is the design choice ARM made
in their ARM920T and ARM940T processor cores.
–
–
The caches in the ARM920T and ARM940T are 64-way set associative.
The tag portion of a requested address is used as an input to the four CAMs that
simultaneously compare the input tag with all cache-tags stored in the 64-way.
ARM 2007 liangalei@sjtu.edu.cn
12.2.5 Write Buffers
• Write buffer
– A small, fast FIFO holding data that processor would normally writes
to main memory. It reduce the processor time taken to write small
blocks of sequential data to main memory.
– The efficiency of the write buffer depends on the ratio of main
memory writes to the number of instructions executed.
» Over the given time interval, if the number of writes to main
memory is low or sufficiently spaced between other processing
instructions, the write buffer will rarely fill.
– The write buffer also improves cache performance during cache line
evictions.
• When data is in write-buffer, it’s not allowed to be read.
– This is one of reasons that FIFO depth is usually quiet small.
• Write Merging (or coalescing) (ARM10)
– If new value and old fit the same address
– If new data and old data fit into same memory block
ARM 2007 liangalei@sjtu.edu.cn
12.3 Cache Policy
ARM 2007 liangalei@sjtu.edu.cn
Three Cache Policies
• There are three policies that determine the
operation of a cache
– Write Policy, Replacement Policy, and Allocation Policy
ARM 2007 liangalei@sjtu.edu.cn
12.3.1 Writeback vs. Writethrough
• Writethrough
– Cache controller writes to both cache and main memory
when there is a cache hot on write, ensuring that the
cache and main memory stay coherent at all times, but
slower than writeback.
• Writeback
– Cache controller writes to valid cache data memory and
not to main memory. Consequently, valid cache lines and
main memory may contain different data.
– The line data will be written back to main memory when
evicted.
– Must use one or more of the dirty bits.
ARM 2007 liangalei@sjtu.edu.cn
12.3.2 Cache Line Replacement
policy
• Replacement policy
– The strategy implemented in a cache controller to select the
next victim.
• ARM cached core support two replacement policies
– round-robin: sequential increment.
– pseudorandom
» Use a non-sequential incrementing victim counter. ( When
counter reaches a maximum value, it is reset to a defined
base value).
• Most ARM cores support both policies
– The round-robin replacement policy has great predictability,
which is desirable in an embedded system.
– However, a round-robin replacement policy is subject to large
changes in performance given small changes in memory access.
» To show this change in performance, see example 12.1.
ARM 2007 liangalei@sjtu.edu.cn
case: ARM940T’s Cache
bit 31
• L1 Cache
– 4KB total size
65
TAG
4 3
idx
0
BS
MVA = (PID + VA)
– 64 ways
– 16 Bytes per line
Way: 00
0
01
02
………
………
63
00
16
………
01
32
………
10
48
………
11
cmp & sel (64:1)
16:1 SEL
ARM 2007 liangalei@sjtu.edu.cn
Example 12.1
ARM 2007 liangalei@sjtu.edu.cn
Example 12.1
•
•
•
•
•
•
•
•
•
•
int readSet( unsigned int times, unsigned
{
int setcount, value;
//
volatile int *newstart;
volatile int *start = (int *)0x20000; //
__asm
{
timesloop:
MOV
newstart, start
MOV
setcount, numset
//
•
•
•
•
•
setloop:
LDR
ADD
SUBS
BNE
•
•
•
•
•
SUBS
BNE
}
}
return value;
int numset)
registers ?
ARM 2007 liangalei@sjtu.edu.cn
Way: 00
why it’s 0x20000 ?
01
02
………
63
0/1 1/1 2/1
0/2 1/2 2/2
………
63/1
63/2
………
………
test: numset = 64 or 65
value,[newstart,#0];
newstart,newstart,#0x40; //0x40:
setcount, setcount, #1;
setloop;
times, times, #1;
timesloop;
numset=64/times
keep idx unchange
………
numset=65/times in RR
Way: 00 01
0/1 1/1
64/1 0/2
63/2 64/2
02
2/1
1/2
0/3
………
………
………
………
………
63
63/1
62/2
61/3
Test Result
• Given (in ADS1.2)
– times = 0x10000(2^16), 50MHz, 100ns(no-seq), 50ns(seq)
• So, results
–
–
–
–
–
–
Round Robin test size
Round Robin enabled
Random enabled
Round Robin test size
Round Robin enabled
Random enabled
• Analysis
=
=
=
=
=
=
64
0.51
0.51
65
2.56
0.58
seconds
seconds
seconds
seconds
– numset=64: Ta = 2^16*Texe + 64*Tm = 510,000,000ns
– numset=65: Tb = 2^16*(Texe + 65*Tm) = 2,560,000,000ns
• (2^16*65-64)*Tm = 2050,000,000 = Tb – Ta
• so, Tm = 481ns per line, Tm/4 word per access = 120ns
• This is an extreme example,
– but it does shows a difference between using a round-robin policy and
a random replacement policy.
ARM 2007 liangalei@sjtu.edu.cn
12.4 CP15 & Cache
•
Cache is controlled via CP15’s registers
•
Primary CP15 registers
•
Secondary CP15 registers
–
–
–
c7, c9: Control the setup and operation of cache
CP15:c7 registers
» Write-Only, clean (dirty written back) and flush (just invalidate it).
CP15:c9 registers
» define the victim pointer base address.
ARM 2007 liangalei@sjtu.edu.cn
Coprocessor Instructions
• Instruction Format
– MRC|MCR cp, opcode1, Rd, Cn, Cm, opcode2
» opcode: executed by CP
» Cn: major register
» Cm: minor resgister
• Example
– MRC p15, 0, r1, c1, c0, 0
ARM 2007 liangalei@sjtu.edu.cn
12.5 CP15:c7 – Clean/Flush Cache
• CP15:c7 is a Write-only register
– Sometime used for Prefetch buffers and BTB;
– Instruction format
» MCR p15, 0, <Rd>, c7, <CRm>, <op2>
– CRm
» 5=Flush I-Cache, 6=Flush D-Cache, 7=Flush Both
» 10=Clean D-Cache, 11=Clean Unified
» 14=Clean&Invalidate D-Cache, 15=C/I Unified
– op2 (index method)
» 0 = SBZ (whole)
» 1 = MVA
» 2 = Set/Index
» 3 = Test-Clean (only for ARM926/1026EJ-S)
ARM 2007 liangalei@sjtu.edu.cn
Cache Operations
• Clean/Write-back/Copy-back
– applied to “Write-Back D-Cache”;
• Flush/Invalidate
– just invalidate line(s), not include cleaning;
• Prefetch
– load memory line(s) into cache;
• Drain Write Buffer
– Stop ARM from further executing until write buffer empty;
• Wait for Interrupt
– Put ARM into Lower Power State until an interruption occurs;
• Prefetch buffer
– IMPLEMENTATION DEFINED;
• Branch Target Cache
– IMPLEMENTATION DEFINED;
• Data
– Value in <Rd>
ARM 2007 liangalei@sjtu.edu.cn
Self-Modifying Code (SMC)
• The cache may also need cleaning or flushing before
the execution of self-modifying code in a split cache.
• The need to clean or flush arises from two possible
conditions
– First, the self-modifying code may be held in the D-cache
and therefore be unavailable to load from main memory as an
instruction.
– Second, existing instructions in the I-cache may mask new
instructions that have already been written to main memory.
• So, after self-modifying code being written
– D-cache should be “clean” to be present in main memory
– I-cache should be “flush” or invalidated to prevent hitting it
in cache.
ARM 2007 liangalei@sjtu.edu.cn
12.5.1 Flushing ARM Cached Cores
• For example, flushing D-Cache
– MCR p15, 0, Rd, c7, c6, 0
• Note
– Rd should be zero.
ARM 2007 liangalei@sjtu.edu.cn
12.5.2 Cleaning ARM Cached Cores
• To clean a cache is to issue commands that
force the cache controller to write all dirty
D-cache lines out to main memory.
ARM 2007 liangalei@sjtu.edu.cn
12.5.3 Cleaning the D-Cache
• There are three methods used to clean the
D-Cache
– Clean a certain line via c7f {Way & Set};
– TEST-CLEAN Instruction
– Clean a dedicate block/line via MVA
ARM 2007 liangalei@sjtu.edu.cn
Notes: Cache Parameters
• Cache parameters
– CSIZE: size of cache = 2^CSIZE
– CLINE: size of a line = 2^CLINE
– NWAY: Number of way
• Command fields (c7f, c9f)
– I7SET: ‘SET’ offset in CP15:c7
– I7WAY: ‘WAY’ offset in CP15:c7
– I9WAY: ‘WAY’ offset in CP15:c9
• And two others can be calculated out
– SWAY: Bytes per way
– NSET: lines per way
ARM 2007 liangalei@sjtu.edu.cn
12.5.4 Index line via {way, set}
ARM 2007 liangalei@sjtu.edu.cn
List of c7 format (c7f)
in {Way, Set}
ARM 2007 liangalei@sjtu.edu.cn
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
c7f
5
RN 0
; cp15:c7 register format
MACRO
CACHECLEANBYWAY $op
MOV
c7f, #0
; create c7 format
IF
"$op" = "Dclean"
MCR
p15, 0, c7f, c7, c10, 2 ; clean D-cline
ENDIF
IF
"$op" = "Dcleanflush"
MCR
p15, 0, c7f, c7, c14, 2 ; cleanflush D-cline
ENDIF
ADD
c7f, c7f, #1<<I7SET
; +1 set index
TST
c7f, #1<<(NSET+I7SET)
; test index overflow
BEQ
%BT5
BIC
c7f, c7f, #1<<(NSET+I7SET) ; clear index overflow
ADDS
c7f, c7f, #1<<I7WAY
; +1 victim pointer
BCC
%BT5
; test way overflow
MEND
cleanDCache
CACHECLEANBYWAY Dclean
MOV
pc, lr
cleanFlushDCache
CACHECLEANBYWAY Dcleanflush
MOV
pc, lr
cleanFlushCache
CACHECLEANBYWAY Dcleanflush
MCR
p15,0,r0,c7,c5,0
; flush I-cache
MOV
pc, lr
ARM 2007 liangalei@sjtu.edu.cn
12.5.5 Cleaning the D-Cache
using Test-Clean command
• Test-Clean search the first “dirty” cache line, and cleans
it by transferring its contents to main memory.
– /* ARM926EJ-S, ARM1026EJ-S */
– cleanDCache
–
MCR p15, 0, r15, c7, c10, 3
–
BNE cleanDCache
/* test Z flag */
–
MOV pc, lr
– Note:
» It’s R15(pc) to be written to MCR
ARM 2007 liangalei@sjtu.edu.cn
12.5.6 Cleaning the D-Cache in Intel
XScale and StrongARM
• The Intel XScale and StrongARM processors use a
third method to clean their D-Cache.
– Using a command to allocate a line in the D-cache without
doing a line fill: sets the valid bit and fill the directory entry
with cache-tag provide in the <Rd> register.
– No data is transferred from main memory. Thus, the data is
not initialized until it is written to by the processor.
– .
ARM 2007 liangalei@sjtu.edu.cn
12.5.7 Invalid & Clean line via MVA
MVA(31:5)
SBZ(4:0)
• Flush-Clean (addr, size)
– addr RN 0
» BIC addr, addr, #(1<<CLINE) – 1
» MOV nl, size, lsr #CLINE ; CLINE=line_size
– 10
» MCR p15, 0, addr, c7, c5, 1 ; clean IC-line@addr
» ADD addr, addr, #1<<CLINE
» SUBS nl, nl, #1
» BNE %BT10
ARM 2007 liangalei@sjtu.edu.cn
Commands to Flush & Clean
a single line via MVA or PA
ARM 2007 liangalei@sjtu.edu.cn
c7f: clean & flush a line
ARM 2007 liangalei@sjtu.edu.cn
12.6 Cache Lockdown
ARM 2007 liangalei@sjtu.edu.cn
Cache Lockdown
• Cache lockdown is a feature that enables a
program to load time-critical code and data
into cache, and mark it as exempt from
eviction.
– Lock unit: WAY (rather than LINE or others)
– Code that candidate for locking are
» IVT, ISR, Critical Algorithm
– Data to be locked
» Global variables frequently used
• Q: How-to ?
ARM 2007 liangalei@sjtu.edu.cn
Methods of Cache Lockdown
ARM 2007 liangalei@sjtu.edu.cn
12.6.1 Procedures for Lockdown
• Example
– int globalData[16];
– unsigned int *vectortable = (unsigned int *)0x0;
– int vectorCodeSize = 212; // IVT + IRQ handler
– state = disable_interrupt();
– enableCache();
– flushCache();
– wayIndex = lockDcache(globalData, sizeof(globalData));
– wayIndex = lockIcache(vectortable, vectorCodeSize);
– enable_interrupt(state);
ARM 2007 liangalei@sjtu.edu.cn
12.6.2 Lockdown base
• By “victim counter”
– The victim counter is reset to “victim reset value” when it
increments beyond the number of ways in the core.
– In RR method, the entry (or way) to be evicted is pointed by
the “victim counter”.
– Either Code or Data Cache can be controlled by the victim
pointer, and prefetched into the cache.
ARM 2007 liangalei@sjtu.edu.cn
Lock-down Commands
• Commands that lock data in cache by referencing its
way.
ARM 2007 liangalei@sjtu.edu.cn
How to lock down
• Principles for cache lock down
– 1. Ensure that no processor exceptions can occur during the execution
of this procedure (by disabling interrupts). If for some reason this is
not possible, all code and data used by any exception handlers that can
get called must be treated as code and data used by this procedure
for the purpose of steps 2 and 3.
– 2. If an instruction cache or a unified cache is being locked down,
ensure that all the code executed by this procedure is in an uncachable
area of memory.
– 3. If a data cache or a unified cache is being locked down, ensure
that all data used by the following code is in an uncachable area of
memory, apart from the data which is to be locked down.
– 4. Ensure that the data/instruction that are to be locked down are in
a cachable area of memory.
– 5. Ensure that the data/instruction that are to be locked down are not
already in the cache, using cache clean and/or invalidate instructions
as appropriate.
• Ref: ARM Manual DDI0100E.pdf pp B5-20
ARM 2007 liangalei@sjtu.edu.cn
How to Lockdown (1)
• First, ensuring the code to be locked is not already
in the cache.
– e.g., Invalidate D or I Cache
» MCR p15, 0, Rd, c7, c5,0 ;Invalidate ICache
» MCR p15, 0, Rd, c7, c5,1 ;Invalidate ICache via MVA
» MCR p15, 0, Rd, c7, c6,0 ;Invalidate DCache
» MCR p15, 0, Rd, c7, c6,1 ;Invalidate DCache via MVA
– or Clean them (for writeback DCache)
» MCR p15, 0, Rd, c7, c10,1 ;Clean DCache via MVA
» MCR p15, 0, Rd, c7, c10,2 ;Clean DCache via Index
ARM 2007 liangalei@sjtu.edu.cn
How to lockdown (2)
• Then, set victim counter
– Writing CP15:c9 to force the victim pointer to a specific line.
– Using either C9F_a (without L-bit) or C9F_b (with L-bit).
• At last, load them into cache by a software routine.
» NOTE: It’s important that the code and data to be locked in
cache does not exist elsewhere in cache. And, enable MMU to
ensure that any TLB misses while loading instructions or data
cause a page table walk.
– For Data, use LDR;
– For Instruction, use instruction prefetching (MCR C7:c13)
ARM 2007 liangalei@sjtu.edu.cn
12.7 Cache & Software Performance
• Here are s few simple rules to help write
code that take advantage of cache
architecture.
– Use Cache & Write buffer, improve the AMAT.
– Be careful, NOT to cache or write buffer the memorymapping I/O;
– Spatial locality: Organize the common data in-order
and in one cache line size;
– Make the code as small as possible;
– Search a “linked list” will degrade the performance of
cache.
– Cache is not the only who effect the performance. (see
chapter 5 and 6).
ARM 2007 liangalei@sjtu.edu.cn
Summary of cache
• Cache is the layer between core and MM.
• Write-Buffer is a FIFO between core and
MM.
• Rule of Locality
• Line: operation unit
• Virtual vs. Physical Cache
• Cache’s Clean, Flush, Lockdown
ARM 2007 liangalei@sjtu.edu.cn
Download