Slides

advertisement
ECE7995 : Presentation
CPU Cache Prefetching
Timing Evaluations of Hardware Implementation
Ravikiran Channagire & Ramandeep Buttar
TOPICS
Introduction
Previous research and issues in it.
Architectural Model
Processor, Cache, Buffer, Memory bus, Main Memory etc.
Methodology
The workload.
The Baseline System
MCPI, Relative MCPI and other terms.
Effects of System Resources on Cache Prefetching
Comparison.
System Design
OPTIprefetch system.
Conclusion
Introduction
Terms used in this paper -
Cache Miss – Address not found in the cache
Partial Cache Miss – Address issued by prefetching unit and sent to memory
True Cache Miss – Ratio of cache misses to cache references
Partial Miss Ratio – Ratio of partial cache misses to number of cache references
Total Miss Ratio – Sum of true miss ratio and partial miss ratio
Issued Prefetch – Sent to prefetch address buffer by prefetching unit.
Lifetime of a Prefetch- Time from prefetch sent to memory to loading data
Useful Prefetch – Address referenced by processor
Useless Prefetch – Address not referenced by the processor
Aborted Prefetch – Issued prefetch discarded in prefetch address buffer
Prefetch Ratio – Ratio of issued prefetches to cache references
Success Ratio – Ratio of total no. of useful prefetches to total no. of issued prefetches.
Global Success Ratio – Fraction of Cache misses avoided/ partially avoided
Prefetching : Effective in reducing the cache miss ratio.
Doesn’t always improve CPU performance - why?
Need to determine when and if hardware Prefetching is useful.
How to improve performance?
Double ported address array.
Double ported or fully buffered data array.
Wide bus – Split and non-split operations.
Prefetching can be improved with a significant investment in
extra hardware.
Why cache memories?
Factors affecting cache performance.
Block size, Cache Size, Associativity, Algorithms used.
Hardware Cache Prefetching : dedicated hardware w/o
software support.
Which block to prefetch? – Simplest
When to prefetch?
Simple Prefetch algorithms
Always Prefetch.
Prefetch on misses.
Tagged prefetch.
Threaded prefetching.
Bi-directional prefetching
Number of cache blocks – Fixed or Variable?
Disadvantages of Cache Prefetching –
Increase in memory traffic because of prefetches which are
never referenced.
Memory pollution – Useless prefetches displacing useful
prefetches. Most hazardous when cache sizes are small and
block sizes are large.
Factors degrading the performance even when miss ratio is
decreased –
Address tag array busy due to prefetch lookups.
Cache data array busy due to prefetch loads &
replacements.
Memory bus busy due to prefetch address transfers & data
fetches.
Memory system busy due to prefetch fetches and
replacements.
Architectural Model
Architecture Model
Processor :
Five stages in the pipeline
Instruction fetch.
Instruction decode.
Read or write to memory.
ALU computation.
Register file update.
Cache : Single or Double Ported
Instruction Cache
Data Cache
Write Buffer :
Prefetching Units :
One for each cache.
Receives information like cache miss, cache hit,
instruction type, branch target address.
What if buffer is full?
Memory Bus
Split and Non-split transactions
Bus Arbitrator
Main Memory
Methodology
Methodology used :
25 commonly used real programs from 5 workload categories
Computer Aided Design (CAD)
Compiler Related Tools (COMP)
Floating Point Intensive Applications (FP)
Text Processing Programs ( TEXT)
Unix Utilities ( UNIX)
The Baseline System
Default System Configuration for the Baseline System
Cycles Per Instructions contributed by Memory accesses (MCPI)
 Why MCPI is preferred over cache miss ratio / memory access time?
Covers every aspect of performance.
Excludes aspects of performance which can not be affected by a
cache Prefetching strategies e.g. efficiency of instruction
pipelining.
Relative MCPI – Should be smaller than 1
Relative and Absolute MCPI
A cache prefetching strategy is very sensitive to the type of program
the processor is running.
CPU Stall Breakdown
Instruction and Data Cache Miss Ratios
True Miss Ratio – Ratio of total number of true cache misses to the
total number of processor references to the cache.
Partial Miss Ratio – Ratio of the total number of partial cache misses
to the total number of processor reference to the cache.
Total Miss Ratio – Sum of the true miss ratio and the partial miss ratio.
Success Ratio and Global Success Ratios
Success Ratio – Ratio of the total number of useful prefetches issued
to the total number of prefetches issued.
GSR – Fraction of cache misses which are avoided or partially
avoided.
Average Distribution of Prefetches.
Useful Prefetches
Useless Prefetches
Aborted Prefetches.
Major limitations which reduces the effectiveness of cache Prefetching
are conflicts and delays in accessing the caches, the data bus and the
main memory.
Ideal system characteristics Ideal Cache
Special access port to the tag array for prefetch lookups
Special access port to the data array for prefetch loads.
No need to buffer the perfetched blocks.
 Ideal Data Bus
Private bus connecting the Prefetching unit and main memory
Ideal Main Memory
Dual ported. Takes 0 time to access these memory banks.
Effects of System Resources
on Cache Prefetching
Effect of different design patterns
 Single vs. Double Ported Cache Tag Arrays
 Single vs. Double Ported Cache Data Arrays and Buffering
 Cache Size
 Cache Block Size
 Cache Associativity
 Split vs. Non-split Bus Transaction
 Bus Width
 Memory Latency
 Number of Memory Banks
 Bus Traffic
 Prefetch Look-ahead Distance
Single vs. Double Ported Cache Tag Arrays
Double ported cache tag array gives best results first with ‘bi-dir’
and then with ‘always’ and ‘thread’ strategies
If a prefetch strategy looks up the cache tag arrays frequently,
extra access ports to the tag array are vital.
Single vs. Double Ported Cache Data Arrays and Buffering
Far less important than that of the cache tag arrays because
prefetch strategies use data arrays less frequently.
For single ported data array relative MCPI > 1, but reduces with double
ported array.
Conflicts over cache port vanishes for double ported data array – no stall.
Buffered data array
 Draw back - Adding an extra port to cache data array is very costly
 More practical sol – Provide some buffering for the data arrays.
 Result – for single ported buffered data array, the relative MCPI
decreases by 0.26 on average and this is almost as good as when the
cache data array are double ported.
Cache Size
 Performance of a prefatch strategy improves as the cache size
increases.
 For large caches, prefatched block resides in cache for longer period.
 Bigger the cache, the fewer the cache misses ; hence, prefetches are
less likely to interfare with normal cache operation
Cache Block Size
 For 16 or 32 byte long cache blocks, most prefatching strategies
perform better than when there is no prefatching.
 MCPI increases with increase in block size – result of more cache port
conflicts.
Cache Associativity
 Relative MCPI decreases by 0.07 on avg. when the cache changes
from direct mapped to two way mapped.
 MCPI remains almost constant as cache associativity increases.
Split vs. Non-split Bus Transaction
•
Relative MCPI decreases by 14 % on avg. when bus transaction
changes from nonsplit to split.
•
Reason – most aborted prefetches become useful prefetches when
data bus supports split transaction.
Bus Width
•
As the bus width increases, the relative MCPI begins to fall below one
for the base system
•
Reason – there are fewer cache port conflicts.
•
Assumptions – cache data port must be as wide as data bus
Memory Latency (CPU Cycles)
•
Relative MCPI decreases as the memory latency increases from 8 to 64
processor cycles, but it starts to rise for further increase in latency
Two reasons account for these U shaped curves
Fewer cache port conflicts.
More data bus conflicts.
Number of Memory Banks – (Bus transaction is nonsplit)
•
Relative MCPI decreases by 0.18 on average when the number of
memory banks increases from 1 to 4.
•
Multiple prefatches can occur in parallel when there are more memory
banks.
Bus Traffic – Bus Utilization by DMA (%)
 As traffic increases, Relative MCPI converges to 1 because there is less
and less bus bandwidth available to send prefetch requests to the main
memory.
 True in baseline system -- heavier bus traffic helps to reduce the amount of
undesirable prefatches and, hence, relative MCPI decreases.
Prefetch Lookahead Distance (in blocks)
 Better performance could be achieved by prefetching blocks p+LA
instead of p.
 P – original block address requested by prefetching strategy
and LA – Look ahead distance in blocks
 For all strategies except thread, relative MCPI rises with increasing
look ahead distance.
 Reason – with increase in LA, effect of spatial locality diminishes.
The System Design
Worst and Best Values for Each System Parameter in terms of
effectiveness of Prefetching
 system parameters are changed from -- worst value to its best value --improvement in the performance of cache prefatching ranges from 7%
to as high as 90%
RELATIVE AND ABSOLUTE MCPI FOR THE OPTIPREFETCH SYSTEM
 All prefetching strategies perform better than when there is no prefetching.
 The relative MCPI for all strategies averages 0.65 -- prefetching reduces
MCPI by 35% on average relative to the corresponding baseline system.
 OptiPrefatch system favors aggressive strategies like always and tag, which
issues lot of prefatches.
Stall Breakdown in % for OPTIPREFETCH system.
 There are no conflicts over the cache data ports as the data arrays are
dual ported.
 As 16 memory banks are available, conflicts over memory banks occur
very rarely.
Average distribution of Prefetches
 Due to high bus BW, most prefatch requests can be sent to the main
memory successfully and almost no prefatches are aborted.
Baseline System
OPTIPREFETCH System
Possible System Design
 Prefatching improves performance in all systems except system D and
G as data bus BW is too small
Performance of Cache Prefetching in Some Possible Systems
Conclusion
Prefetching can reduce average memory latency provided
system has appropriately designed hardware.
For a cache to prefetch effectively
Cache Tag Arrays be double ported
Data Arrays either be double ported or buffered
Cache – At lease two way associative
Most effective
Cache size is large
Block size is small
Memory bus wide.
Split transaction bus
Interleaving main memory
Download