here - Princess Sumaya University for Technology

advertisement
LOGO
P r i n c e s s
S u m a y a
U n i v e r s i t y
f o r
Computer
Architecture
Dr. Esam Al_Qaralleh
Te c h n o l o g y
Memory Hierarchy Design
2
LOGO
5.1 Introduction
Memory Hierarchy Design
Motivated by the principle of locality - A 90/10
type of rule
 Take advantage of 2 forms of locality
• Spatial - nearby references are likely
• Temporal - same reference is likely soon
Also motivated by cost/performance structures
 Smaller hardware is faster: SRAM, DRAM, Disk, Tape
 Access vs. bandwidth variations
 Fast memory is more expensive
Goal – Provide a memory system with cost
almost as low as the cheapest level and speed
almost as fast as the fastest level
4
Memory relevance in Computer Design ?
A computer’s performance is given by the number of
instructions executed per time unit
The time for executing an instruction depends on:
 The ALU speed (I.e. the data-path cycle duration)
 The time it takes for each instruction to load/store its
operands/result from/into the memory (in brief, the time to
access memory)
The processing speed (CPU speed) grows faster than
the memory speed. As a result the CPU speed cannot
be fully exploited. This speed gap leads to an
Unbalanced System !
5
DRAM/CPU Gap
CPU performance improves at 55%/year
 In 1996 it was a phenomenal 18% per month
DRAM - has improved at 7% per year
6
Levels in A Typical Memory
Hierarchy
7
Sample Memory Hierarchy
8
Unit of Transfer / Addressable Unit
 Unit of Transfer: Number of bits read from, or written
into memory at a time
 Internal : usually governed by data bus width
 External : usually a block of words e.g 512 or
more.
 Addressable unit: smallest location which can be
uniquely addressed
 Internal : word or byte
 External : device dependent e.g. a disk “cluster”
9
Access Method
Sequential
 Data is stored in records, access is in linear sequence
(tape)
Direct
 Data blocks have a unique and direct access, data
within block is sequential (disk)
Random
 Data has unique and direct access (ram)
Associative
 Data retrieved based on (partial) match rather than
address (cache)
10
Sequential Access Method
Start at the beginning – read through in
order
Access time depends on location of data
and previous location
first location
start
e.g. tape
...
read to here
location of interest
11
Direct Access Method
...
Individual blocks have unique address
Access is by jumping to vicinity plus
sequential search (or waiting! e.g. waiting
for disk to rotate)
Access time depends on target location and
previous location
e.g. disk
jump to here
read to here
block i
12
Random Access Method
Individual addresses identify specific
locations
Access time independent of location
or previous access
e.g. RAM
...
read here
13
Associative Access Method
Data is located by a comparison with
contents of a portion of the store. (
Content Addressable Memory).
Access time is independent of
location or previous access
e.g. Cache
14
Registers
Used for storing data
32 bits
16 words
Register files
 Multiple read/write ports
possible
 Example: 32-bit wide by
16-bit deep, dual-port
parallel read, single port
parallel write register file
...
32
15
Implementing Registers Using Logic Gates
Flip-flops
 Simple SR latch:
S
Q
Q
R
S
Q
R
Q
S
1
1
0
0
R
1
0
1
0
Q
Q
0
1
x
Q’
Q’
1
0
x
 Flip-flops
• JK, D, T
• Clocked
• Master-slave (edge-triggered)
16
Implementing Registers in CMOS
Direct gate implementation too costly
 A master-slave JK flip-flop uses 38 CMOS
transistors
Directly implement in transistors
 Example: clocked SR FF
Q
Q
Q
f
f
S
R
17
Internal Organization of Memory
Chips
b7
b7
b1
b1
b0
b0
W0
•
•
•
FF
A0
A2
•
•
•
A1
W1
FF
Address
decoder
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Memory
cells
A3
•
•
•
W15
Sense / Write
circuit
Data input/output lines: b7
Sense / Write
circuit
b1
Sense / Write
circuit
R/W
CS
b0
Figure 5.2. Organization of bit cells in a memory chip.
18
A Memory Chip
5-bit row
address
W0
W1
5-bit
decoder
32  32
memory cell
array
W31
10-bit
address
Sense/Write
circuitry
32-to-1
output multiplexer
and
input demultiplexer
R/ W
CS
5-bit column
address
Data
input/output
Organization of a 1K  1 memory chip.
19
Static Memories
 The circuits are capable of retaining their state as long
as power is applied.
b
b
T1
X
Y
T2
Word line
Bit lines
A static RAM cell.
20
b
b
Vsupply
T3
T4
T1
T2
X
Y
T5
T6
Word line
Bit lines
Static Memories
Figure 5.5. An example of a CMOS memory cell.
CMOS cell: low power consumption
21
Asynchronous DRAMs
 Static RAMs are fast, but they cost more area and are more
expensive.
 Dynamic RAMs (DRAMs) are cheap and area efficient, but they can
not retain their state indefinitely – need to be periodically refreshed.
Bit line
Word line
T
C
A single-transistor dynamic memory cell
22
A Dynamic Memory Chip
RA S
Row Addr. Strobe
Row
address
latch
A20 - 9  A 8 -
Row
decoder
4096 (512  8)
cell array
Sense / Write
circuits
0
Column
address
latch
CA S
CS
R/ W
Column
decoder
D7
D0
Column Addr. Strobe
Internal organization of a 2M  8 dynamic memory chip.
23
LOGO
5.2 Review of the
ABCs of Caches
36 Basic Terms on Caches
Cache
Full associative
Write allocate
Virtual memory
dirty bit
unified cache
memory stall cycles
block offset
misses per instruction
directed mapped
write back
block
valid bit
data cache
locality
block address
hit time
address trace
write through
cache miss
set
instruction cache
page fault
random placement
average memory access time miss rate
index field
cache hit
n-way set
associative
no-write allocate
page
least-recently used
write buffer
miss penalty
tag field
write stall
25
Cache
The first level of the memory hierarchy
encountered once the address leaves the CPU
 Persistent mismatch between CPU and main-memory
speeds
 Exploit the principle of locality by providing a small,
fast memory between CPU and main memory -- the
cache memory
Cache is now applied whenever buffering is
employed to reuse commonly occurring terms
(ex. file caches)
Caching – copying information into faster
storage system
 Main memory can be viewed as a cache for
secondary storage
26
General Hierarchy Concepts
 At each level - block concept is present (block is the
caching unit)
 Block size may vary depending on level
• Amortize longer access by bringing in larger chunk
• Works if locality principle is true
 Hit - access where block is present - hit rate is the probability
 Miss - access where block is absent (in lower levels) - miss rate
 Mirroring and consistency
 Data residing in higher level is subset of data in lower level
 Changes at higher level must be reflected down - sometime
• Policy of sometime is the consistency mechanism
 Addressing
 Whatever the organization you have to know how to get at it!
 Address checking and protection
27
Physical Address Structure
Key is that you want different block sizes
at different levels
28
Latency and Bandwidth
The time required for the cache miss depends
on both latency and bandwidth of the memory
(or lower level)
Latency determines the time to retrieve the first
word of the block
Bandwidth determines the time to retrieve the
rest of this block
A cache miss is handled by hardware and
causes processors following in-order execution
to pause or stall until the data are available
29
Predicting Memory Access Times
On a hit: simple access time to the cache
On a miss: access time + miss penalty
 Miss penalty = access time of lower + block transfer
time
 Block transfer time depends on
• Block size - bigger blocks mean longer transfers
• Bandwidth between the two levels of memory
– Bandwidth usually dominated by the slower memory and the
bus protocol
Performance
 Average-Memory-Access-Time = Hit-Access-Time +
Miss-Rate * Miss-Penalty
 Memory-stall-cycles = IC * Memory-reference-perinstruction * Miss-Rate * Miss-Penalty
30
Block Sizes, Miss Rates & Penalties,
Accesses
31
Headaches of Memory Hierarchies
CPU never knows for sure if an access will hit
How deep will a miss be - i. e. miss penalty
 If short then the CPU just waits
 If long then probably best to work on something
else – task switch
• Implies that the amount can be predicted with reasonable
accuracy
• Task switch better be fast or productivity/efficiency will suffer
Implies some new needs
 More hardware accounting
 Software readable accounting information (address
trace)
32
Four Standard Questions
Block Placement
 Where can a block be placed in the upper
level?
Block Identification
 How is a block found if it is in the upper level?
Block Replacement
 Which block should be replaced on a miss?
Write Strategy
 What happens on a write?
Answer the four questions for the first level of the memory hierarchy
33
Block Placement Options
Direct Mapped
 (Block address) MOD (# of cache blocks)
Fully Associative
 Can be placed anywhere
Set Associative
 Set is a group of n blocks -- each block is called a
way
 Block first mapped into a set  (Block address)
MOD (# of cache sets)
 Placed anywhere in the set
Most caches are direct mapped, 2- or 4-way
set associative
34
Block Placement Options (Cont.)
35
Block Identification
Each cache block carries tags
Address Tags: which block am I?
Many memory blocks may
map to the same cache
block
 Physical address now: address tag## set index##
block offset
 Note relationship of block size, cache size, and tag
size
 The smaller the set tag the cheaper it is to find
 Status Tags: what state is the block in?
 valid, dirty, etc.
Physical address =
r + m + n bits
r
(address tag)
2m addressable sets
in the cache
m
(set index)
n
(block offset)
2n bytes
per block
36
Block Identification (Cont.)
Physical address = r + m + n bits
r (address tag)
2m addressable sets
in the cache
m
n
2n bytes
per block
•
Caches have an address tag on each block frame that gives
the block address.
•
A valid bit to say whether or not this entry contains a valid
address.
•
The block frame address can be divided into the tag field and
the index field.
37
Block Replacement
Random: just pick one and chuck it
 Simple hash game played on target block frame
address
 Some use truly random
• But lack of reproducibility is a problem at debug time
LRU - least recently used
 Need to keep time since each block was last
accessed
• Expensive if number of blocks is large due to global compare
• Hence approximation is oftenOnly
usedone
= Use
bitfor
tagdirect-mapped
and LFU
choice
FIFO
placement
38
Data Cache Misses Per 1000
Instructions
64 byte blocks on a Alpha using 10 SPEC2000
39
Short Summaries from the Previous
Figure
More-way associative is better for small cache
2- or 4-way associative perform similar to 8-way
associative for larger caches
Larger cache size is better
LRU is the best for small block sizes
Random works fine for large caches
FIFO outperforms random in smaller caches
Little difference between LRU and random for
larger caches
40
Improving Cache Performance
MIPS mix is 10% stores and 37% loads
 Writes are about 10%/(100%+10%+37%) = 7% of
overall memory traffic, and 10%/(10%+37%)=21% of
data cache traffic
Make the common case fast
 Implies optimizing caches for reads
Read optimizations
 Block can be read concurrent with tag comparison
 On a hit the read information is passed on
 On a miss the - nuke the block and start the miss
access
Write optimizations
 Can’t modify until after tag check - hence take longer
41
Write Options
 Write through: write posted to cache line and through to next lower
level
 Incurs write stall (use an intermediate write buffer to reduce the stall)
 Write back
 Only write to cache not to lower level
 Implies that cache and main memory are now inconsistent
• Mark the line with a dirty bit
• If this block is replaced and dirty then write it back
 Pro’s and Con’s  both are useful
 Write through
• No write on read miss, simpler to implement, no inconsistency with main
memory
 Write back
• Uses less main memory bandwidth, write times independent of main
memory speeds
• Multiple writes within a block require only one write to the main memory
42
LOGO
5.3 Cache
Performance
Cache Performance
44
Cache Performance Example
 Each instruction takes 2 clock cycle (ignore memory
stalls)
 Cache miss penalty – 50 clock cycles
 Miss rate = 2%
 Average 1.33 memory reference per instructions
•
•
•
•
Ideal – IC * 2 * cycle-time
With cache – IC*(2+1.33*2%*50)*cycle-time = IC * 3.33 * cycle-time
No cache – IC * (2+1.33*100%*50)*cycle-time
The importance of cache for CPUs with lower CPI and higher clock
rates is greater – Amdahl’s Law
45
Average Memory Access Time VS
CPU Time
 Compare two different cache organizations
 Miss rate – direct-mapped (1.4%), 2-way associative (1.0%)
 Clock-cycle-time – direct-mapped (2.0ns), 2-way associative
(2.2ns)
 CPI with a perfect cache – 2.0, average memory
reference per instruction – 1.3; miss-penalty – 70ns; hittime – 1 CC
• Average Memory Access Time (Hit time + Miss_rate * Miss_penalty)
• AMAT(Direct) = 1 * 2 + (1.4% * 70) = 2.98ns
• AMAT(2-way) = 1 * 2.2 + (1.0% * 70) = 2.90ns
• CPU Time
• CPU(Direct) = IC * (2 * 2 + 1.3 * 1.4% * 70) = 5.27 * IC
• CPU(2-way) = IC * (2 * 2.2 + 1.3 * 1.0% * 70) = 5.31 * IC
Since CPU time is our bottom-line evaluation, and since direct mapped is
simpler to build, the preferred cache is direct mapped in this example
46
Unified and Split Cache




Unified – 32KB cache, Split – 16KB IC and 16KB DC
Hit time – 1 clock cycle, miss penalty – 100 clock cycles
Load/Store hit takes 1 extra clock cycle for unified cache
36% load/store – reference to cache: 74% instruction, 26% data
• Miss rate(16KB instruction) = 3.82/1000/1.0 = 0.004
Miss rate (16KB data) = 40.9/1000/0.36 = 0.114
• Miss rate for split cache – (74%*0.004) + (26%*0.114) = 0.0324
Miss rate for unified cache – 43.3/1000/(1+0.36) = 0.0318
• Average-memory-access-time = % inst * (hit-time + inst-miss-rate *
miss-penalty) + % data * (hit-time + data-miss-rate * miss-penalty)
• AMAT(Split) = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24
• AMAT(Unified) = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318* 100)
= 4.44
47
Improving Cache Performance
Average-memory-access-time = Hittime + Miss-rate * Miss-penalty
Strategies for improving cache
performance
 Reducing the miss penalty
 Reducing the miss rate
 Reducing the miss penalty or miss rate via
parallelism
 Reducing the time to hit in the cache
48
LOGO
5.4
Reducing Cache
Miss Penalty
Techniques for Reducing Miss
Penalty
Multilevel Caches (the most important)
Critical Word First and Early Restart
Giving Priority to Read Misses over Writes
Merging Write Buffer
Victim Caches
50
Multi-Level Caches
Probably the best miss-penalty reduction
Performance measurement for 2-level
caches
 AMAT = Hit-time-L1 + Miss-rate-L1* Misspenalty-L1
 Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 *
Miss-penalty-L2
 AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-timeL2 + Miss-rate-L2 * Miss-penalty-L2)
51
Multi-Level Caches (Cont.)
Definitions:
 Local miss rate: misses in this cache divided by the
total number of memory accesses to this cache (Missrate-L2)
 Global miss rate: misses in this cache divided by the
total number of memory accesses generated by CPU
(Miss-rate-L1 x Miss-rate-L2)
 Global Miss Rate is what matters
Advantages:
 Capacity misses in L1 end up with a significant
penalty reduction since they likely will get supplied
from L2
• No need to go to main memory
 Conflict misses in L1 similarly will get supplied by L2
52
Miss Rate Example
Suppose that in 1000 memory references
there are 40 misses in the first-level cache
and 20 misses in the second-level cache
 Miss rate for the first-level cache = 40/1000
(4%)
 Local miss rate for the second-level cache =
20/40 (50%)
 Global miss rate for the second-level cache =
20/1000 (2%)
53
Miss Rate Example (Cont.)
Assume miss-penalty-L2 is 100 CC, hit-time-L2
is 10 CC, hit-time-L1 is 1 CC, and 1.5 memory
reference per instruction. What is average
memory access time and average stall cycles
per instructions? Ignore writes impact.
 AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 +
Miss-rate-L2 * Miss-penalty-L2) = 1 + 4% * (10 + 50%
* 100) = 3.4 CC
 Average memory stalls per instruction = Misses-perinstruction-L1 * Hit-time-L2 + Misses-per-instructionsL2*Miss-penalty-L2
= (40*1.5/1000) * 10 + (20*1.5/1000) * 100 = 3.6 CC
• Or (3.4 – 1.0) * 1.5 = 3.6 CC
54
Critical Word First and Early
Restart
 Do not wait for full block to be loaded before restarting
CPU
 Critical Word First – request the missed word first from memory
and send it to the CPU as soon as it arrives; let the CPU
continue execution while filling the rest of the words in the block.
Also called wrapped fetch and requested word first
 Early restart -- as soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue execution
 Benefits of critical word first and early restart depend on
 Block size: generally useful only in large blocks
 Likelihood of another access to the portion of the block that has
not yet been fetched
• Spatial locality problem: tend to want next sequential word, so not
clear if benefit
block
55
Giving Priority to Read Misses Over
Writes
 In write through, write buffers complicate memory access
in that they might hold the updated value of location
needed on a read miss
 RAW conflicts with main memory reads on cache misses
 Read miss waits until the write buffer empty  increase
read miss penalty (old MIPS 1000 with 4-word buffer by
50% )
 Check write buffer contents before read, and if no
conflicts, let the memory access continue
 Write Back?
 Read miss replacing dirty block
 Normal: Write dirty block to memory, and then do the read
 Instead copy the dirty block to a write buffer, then do the read,
and then do the write
 CPU stall less since restarts as soon as do read
56
Merging Write Buffer
An entry of write buffer often contain multi-words.
However, a write often involves single word
 A single-word write occupies the whole entry if no
write-merging
Write merging: check to see if the address of a
new data matches the address of a valid write
buffer entry. If so, the new data are combined
with that entry
Advantage
 Multi-word writes are usually faster than single-word
writes
 Reduce the stalls due to the write buffer being full
57
Write-Merging Illustration
58
Victim Caches
 Remember what was just discarded in case it is need
again
 Add small fully associative cache (called victim cache)
between the cache and the refill path
 Contain only blocks discarded from a cache because of a miss
 Are checked on a miss to see if they have the desired data
before going to the next lower-level of memory
• If yes, swap the victim block and cache block
 Addressing both victim and regular cache at the same time
• The penalty will not increase
 Jouppi (DEC SRC) shows miss reduction of 20 - 95%
 For a 4KB direct mapped cache with 1-5 victim blocks
59
Victim Cache Organization
60
LOGO
5.5 Reducing Miss
Rate
Classify Cache Misses - 3 C’s
Compulsory  independent of cache size
 First access to a block  no choice but to load it
 Also called cold-start or first-reference misses
Capacity  decrease as cache size increases
 Cache cannot contain all the blocks needed during
execution, then blocks being discarded will be later
retrieved
Conflict (Collision)  decrease as associativity
increases
 Side effect of set associative or direct mapping
 A block may be discarded and later retrieved if too
many blocks map to the same cache block
62
Techniques for Reducing Miss Rate
Larger Block Size
Larger Caches
Higher Associativity
Way Prediction Caches
Compiler optimizations
63
Larger Block Sizes
Obvious advantages: reduce compulsory
misses
 Reason is due to spatial locality
Obvious disadvantage
 Higher miss penalty: larger block takes longer
to move
 May increase conflict misses and capacity
miss if cache is small
Don’t let increase in miss penalty outweigh the decrease in miss rate
64
Large Caches
Help with both conflict and capacity
misses
May need longer hit time AND/OR higher
HW cost
Popular in off-chip caches
65
Higher Associativity
8-way set associative is for practical purposes
as effective in reducing misses as fully
associative
2: 1 Rule of thumb
 2 way set associative of size N/ 2 is about the same
as a direct mapped cache of size N (held for cache
size < 128 KB)
Greater associativity comes at the cost of
increased hit time
 Lengthen the clock cycle
 Hill [1988] suggested hit time for 2-way vs. 1-way:
external cache +10%, internal + 2%
66
Effect of Higher Associativity for
AMAT
Clock-cycle-time (2-way) = 1.10 * Clock-cycle-time (1-way)
Clock-cycle-time (4-way) = 1.12 * Clock-cycle-time (1-way)
Clock-cycle-time (8-way) = 1.14 * Clock-cycle-time (1-way)
67
Way Prediction
Extra bits are kept in cache to predict the way, or
block within the set of the next cache access
Multiplexor is set early to select the desired
block, and only a single tag comparison is
performed that clock cycle
A miss results in checking the other blocks for
matches in subsequent clock cycles
Alpha 21264 uses way prediction in its 2-way
set-associative instruction cache. Simulation
using SPEC95 suggested way prediction
accuracy is in excess of 85%
68
Compiler Optimization for Code
Code can easily be arranged without affecting
correctness
Reordering the procedures of a program might
reduce instruction miss rates by reducing conflict
misses
McFarling's observation using profiling
information [1988]
 Reduce miss by 50% for a 2KB direct-mapped
instruction cache with 4-byte blocks, and by 75% in
an 8KB cache
 Optimized programs on a direct-mapped cache
missed less than unoptimized ones on an 8-way setassociative cache of same size
69
Compiler Optimization for Data
Idea – improve the spatial and temporal locality
of the data
Lots of options
 Array merging – Allocate arrays so that paired
operands show up in same cache block
 Loop interchange – Exchange inner and outer loop
order to improve cache performance
 Loop fusion – For independent loops accessing the
same data, fuse these loops into a single aggregate
loop
 Blocking – Do as much as possible on a sub- block
before moving on
70
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
val
key
/* After: 1 array of stuctures */
struct merge {
int val;
val key val key val key
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve
spatial locality
71
Loop Interchange Example
/* Before */
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory
every 100 words; improve spatial locality
72
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{
a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
Perform different
computations on
the common data in
two loops  fuse
the two loops
2 misses per access to a & c vs. one miss per access;
Improve temporal locality
73
LOGO
5.7 Reducing Hit
Time
Reducing Hit Time
Hit time is critical because it affects the
clock cycle time
 On many machines, cache access time limits
the clock cycle rate
A fast hit time is multiplied in importance
beyond the average memory access time
formula because it helps everything
 Average-Memory-Access-Time = HitAccess-Time + Miss-Rate * Miss-Penalty
• Miss-penalty is clock-cycle dependent
75
Techniques for Reducing Hit Time
Small and Simple Caches
Avoid Address Translation during Indexing
of the Cache
Pipelined Cache Access
Trace Caches
76
Small and Simple Caches
A time-consuming portion of a cache hit: use the
index portion to read the tag and then compare it
to the address
Small caches – smaller hardware is faster
 Keep the L1 cache small enough to fit on the same
chip as CPU
 Keep the tags on-chip, and the data off-chip for L2
caches
Simple caches – direct-Mapped cache
 Trading hit time for increased miss-rate
• Small direct mapped misses more often than small
associative caches
• But simpler structure makes the hit go faster
77
Pipelining Writes for Fast Write
Hits – Pipelined Cache
 Write hits usually take longer than read hits
 Tag must be checked before writing the data
 Pipelines the write
 2 stages – Tag Check and Update Cache (can be more in
practice)
 Current write tag check & previous write cache update
 Result
 Looks like a write happens on every cycle
 Cycle-time can stay short since real write is spread over
 Mostly works if CPU is not dependent on data from a write
• Spot any problems if read and write ordering is not preserved by the
memory system?
 Reads play no part in this pipeline since they already
operate in parallel with the tag check
78
Cache Optimization Summary
79
LOGO
5.9 Main Memory
Main Memory -- 3 important issues
 Capacity
 Latency
 Access time: time between a read is requested and the word
arrives
 Cycle time: min time between requests to memory (> access
time)
• Memory needs the address lines to be stable between accesses
 By addressing big chunks - like an entire cache block (amortize
the latency)
 Critical to cache performance when the miss is to main
 Bandwidth -- # of bytes read or written per unit time
 Affects the time it takes to transfer the block
81
Example of Memory Latency and
Bandwidth
Consider
 4 cycle to send the address
 56 cycles per word of access
 4 cycle to transmit the data
Hence if main memory is organized by word
 64 cycles has to be spent for every word we want to
access
Given a cache line of 4 words (8 bytes per word)
 256 cycles is the miss penalty
 Memory bandwidth = 1/8 byte per clock cycle (4 * 8
/256)
82
Improving Main Memory
Performance
Simple:
 CPU, Cache, Bus, Memory same width (32 or
64 bits)
Wide:
 CPU/Mux 1 word; Mux/Cache, Bus, Memory
N words (Alpha: 64 bits & 256 bits;
UtraSPARC 512)
Interleaved:
 CPU, Cache, Bus 1 word: Memory N Modules
(4 Modules); example is word interleaved
83
3 Examples of Bus Width, Memory Width, and
Memory Interleaving to Achieve Memory Bandwidth
84
Wider Main Memory
 Doubling or quadrupling the width of the cache or
memory will doubling or quadrupling the memory
bandwidth
 Miss penalty is reduced correspondingly
 Cost and Drawback
 More cost on memory bus
 Multiplexer between the cache and the CPU may be on the
critical path (CPU is still access the cache one word at a time)
• Multiplexors can be put between L1 and L2
 The design of error correction become more complicated
• If only a portion of the block is updated, all other portions must be
read for calculating the new error correction code
 Since main memory is traditionally expandable by the customer,
the minimum increment is doubled or quadrupled
85
Simple Interleaved Memory
 Memory chips are organized into banks to read or write
multiple words at a time, rather than a single word
 Share address lines with a memory controller
 Keep the memory bus the same but make it run faster
 Take advantage of potential memory bandwidth of all DRAMs
banks
 The banks are often one word wide
 Good for accessing consecutive memory location
 Miss penalty of 4 + 56 + 4 * 4 or 76 CC (0.4 bytes per
CC)
Bank_# = address MOD #_of_banks
Address_within_bank = Floor(Address / #_of_bank
Interleaving factor = #_of_banks (usually power of 2)
Organization of Four-way Interleaved Memory
86
What Can Interleaving and a Wide
Memory Buy?





Block size = 1, 2, 4 words. Miss rate = 3%, 2% 1.2% correspondingly
Memory Bus width = 1 word, memory access per instruction = 1.2
Cache miss penalty = 64 cycles (as above)
Average cycles per instruction (ignore cache misses) = 2
CPI = 2 + (1.2 * 3% *64) = 4.3 (1-word block)
•
•
•
•
Block size = 2 words
64-bit bus and memory, no interleaving = 2 + (1.2 * 2% * 2 * 64) = 5.07
64-bit bus and memory, interleaving = 2 + (1.2 * 2% * (4+56+2*4)) = 3.63
128-bit bus and memory, no interleaving = 2 + (1.2 * 2% * 1* 64) = 3.54
•
•
•
•
Block size = 4 words
64-bit bus and memory, no interleaving = 2 + (1.2 * 1.2% * 4 *64) = 5.69
64-bit bus and memory, interleaving = 2 + (1.2 * 1.2% * (4+56+4*4)) = 3.09
128-bit bus and memory, no interleaving = 2 + (1.2 * 1.2% * 2 *64) = 3.84
87
Simple Interleaved Memory (Cont.)
Interleaved memory is logically a wide memory,
except that accesses to bank are staged over
time to share bus
How many banks should be included?
 More than # of CC to access word in bank
• To achieve the goal that delivering information from a new
bank each clock for sequential accesses  avoid waiting
Disadvantages
 Making multiple banks are expensive  larger chip,
few chips
• 512MB RAM
– 256 chips of 4M*4 bits 16 banks of 16 chips
– 16 chips of 64M*4 bit  only 1 bank
 More difficulty in main memory expansion (like wider
memory)
88
LOGO
5.9 Memory
Technology
DRAM Technology
 Semiconductor Dynamic Random Access Memory
 Emphasize on cost per bit and capacity
 Multiplex address lines  cutting # of address pins in
half
 Row access strobe (RAS) first, then column access strobe (CAS)
 Memory as a 2D matrix – rows go to a buffer
 Subsequent CAS selects subrow
 Use only a single transistor to store a bit
 Reading that bit can destroy the information
 Refresh each bit periodically (ex. 8 milliseconds) by writing back
• Keep refreshing time less than 5% of the total time
 DRAM capacity is 4 to 8 times that of SRAM
90
DRAM Technology (Cont.)
DIMM: Dual inline memory module
 DRAM chips are commonly sold on small boards
called DIMMs
 DIMMs typically contain 4 to 16 DRAMs
Slowing down in DRAM capacity growth
 Four times the capacity every three years, for more
than 20 years
 New chips only double capacity every two year, since
1998
DRAM performance is growing at a slower rate
 RAS (related to latency): 5% per year
 CAS (related to bandwidth): 10%+ per year
91
RAS improvement
A performance improvement in RAS of about 5% per year
92
SRAM Technology
Cache uses SRAM: Static Random Access
Memory
SRAM uses six transistors per bit to prevent the
information from being disturbed when read 
no need to refresh
 SRAM needs only minimal power to retain the charge
in the standby modegood for embedded
applications
 No difference between access time and cycle time for
SRAM
Emphasize on speed and capacity
 SRAM address lines are not multiplexed
SRAM speed is 8 to 16x that of DRAM
93
ROM and Flash
Embedded processor memory
Read-only memory (ROM)




Programmed at the time of manufacture
Only a single transistor per bit to represent 1 or 0
Used for the embedded program and for constant
Nonvolatile and indestructible
Flash memory:
 Nonvolatile but allow the memory to be modified
 Reads at almost DRAM speeds, but writes 10 to 100
times slower
 DRAM capacity per chip and MB per dollar is about 4
to 8 times greater than flash
94
Improving Memory Performance in a
Standard DRAM Chip
 Fast page mode: time signals that allow repeated
accesses to buffer without another row access time
 Synchronous RAM (SDRAM): add a clock signal to
DRAM interface, so that the repeated transfer would not
bear overhead to synchronize with the controller
 Asynchronous DRAM involves overhead to sync with controller
 Peak speed per memory module 800—1200MB/sec in 2001
 Double data rate (DDR): transfer data on both the rising
edge and falling edge of DRAM clock signal
 Peak speed per memory module 1600—2400MB/sec in 2001
95
LOGO
5.10 Virtual
Memory
Virtual Memory
Virtual memory divides physical memory into
blocks (called page or segment) and allocates
them to different processes
With virtual memory, the CPU produces virtual
addresses that are translated by a combination
of HW and SW to physical addresses, which
accesses main memory. The process is called
memory mapping or address translation
Today, the two memory-hierarchy levels
controlled by virtual memory are DRAMs and
magnetic disks
97
Example of Virtual to Physical
Address Mapping
Mapping by a
page table
98
Address Translation Hardware for
Paging
frame number frame offset
f (l-n)
d (n)
99
Page table when some pages are
not in main memory…
illegal access
100
Virtual Memory (Cont.)
 Permits applications to grow bigger than main memory
size
 Helps with multiple process management




Each process gets its own chunk of memory
Permits protection of 1 process’ chunks from another
Mapping of multiple chunks onto shared physical memory
Mapping also facilitates relocation (a program can run in any
memory location, and can be moved during execution)
 Application and CPU run in virtual space (logical memory, 0 –
max)
 Mapping onto physical space is invisible to the application
 Cache VS. VM
 Block becomes a page or segment
 Miss becomes a page or address fault
101
Typical Page Parameters
102
Cache vs. VM Differences
Replacement
 Cache miss handled by hardware
 Page fault usually handled by OS
Addresses
 VM space is determined by the address size of the
CPU
 Cache space is independent of the CPU address size
Lower level memory
 For caches - the main memory is not shared by
something else
 For VM - most of the disk contains the file system
• File system addressed differently - usually in I/ O space
• VM lower level is usually called SWAP space
103
2 VM Styles - Paged or Segmented?
 Virtual systems can be categorized into two classes: pages (fixed-size
blocks), and segments (variable-size blocks)
Page
Segment
Words per address
One
Two (segment and offset)
Programmer visible?
Invisible to application
programmer
May be visible to application
programmer
Replacing a block
Trivial (all blocks are the same
size)
Hard (must find contiguous, variable-size,
unused portion of main memory)
Memory use inefficiency
Internal fragmentation (unused
portion of page)
External fragmentation (unused pieces
of main memory)
Efficient disk traffic
Yes (adjust page size to balance
access time and transfer time)
Not always (small segments may
transfer just a few bytes)
104
Virtual Memory – The Same 4
Questions
Block Placement
 Choice: lower miss rates and complex placement or
vice versa
• Miss penalty is huge, so choose low miss rate  place
anywhere
• Similar to fully associative cache model
Block Identification - both use additional data
structure
 Fixed size pages - use a page table
 Variable sized segments - segment table
frame number frame offset
f (l-n)
d (n)
105
Address Translation Hardware for
Paging
frame number frame offset
f (l-n)
d (n)
106
Virtual Memory – The Same 4
Questions (Cont.)
Block Replacement -- LRU is the best
 However true LRU is a bit complex – so use
approximation
• Page table contains a use tag, and on access the use tag is
set
• OS checks them every so often - records what it sees in a
data structure - then clears them all
• On a miss the OS decides who has been used the least and
replace that one
Write Strategy -- always write back
 Due to the access time to the disk, write through is
silly
 Use a dirty bit to only write back pages that have
been modified
107
Techniques for Fast Address
Translation
 Page table is kept in main memory (kernel memory)
 Each process has a page table
 Every data/instruction access requires two memory
accesses
 One for the page table and one for the data/instruction
 Can be solved by the use of a special fast-lookup hardware
cache called associative registers or translation look-aside
buffers (TLBs)
 If locality applies then cache the recent translation
 TLB = translation look-aside buffer
 TLB entry: virtual page no, physical page no, protection bit, use
bit, dirty bit
108
TLB = Translation Look-aside
Buffer
The TLB must be on chip; otherwise it is
worthless
 Fully associative – parallel search
Typical TLB’s




Hit time - 1 cycle
Miss penalty - 10 to 30 cycles
Miss rate - .1% to 2%
TLB size - 32 B to 8 KB
109
Paging Hardware with TLB
110
TLB of Alpha 21264
Address Space Number: process
ID to prevent context switch
A total of 128 TLB entries
111
Page Size – An Architectural Choice
Large pages are good:
 Reduces page table size
 Amortizes the long disk access
 If spatial locality is good then hit rate will
improve
 Reduce the number of TLB miss
Large pages are bad:
 More internal fragmentation
• If everything is random each structure’s last page
is only half full
 Process start up time takes longer
112
Download