ECE7995 : Presentation CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar TOPICS Introduction Previous research and issues in it. Architectural Model Processor, Cache, Buffer, Memory bus, Main Memory etc. Methodology The workload. The Baseline System MCPI, Relative MCPI and other terms. Effects of System Resources on Cache Prefetching Comparison. System Design OPTIprefetch system. Conclusion Introduction Terms used in this paper - Cache Miss – Address not found in the cache Partial Cache Miss – Address issued by prefetching unit and sent to memory True Cache Miss – Ratio of cache misses to cache references Partial Miss Ratio – Ratio of partial cache misses to number of cache references Total Miss Ratio – Sum of true miss ratio and partial miss ratio Issued Prefetch – Sent to prefetch address buffer by prefetching unit. Lifetime of a Prefetch- Time from prefetch sent to memory to loading data Useful Prefetch – Address referenced by processor Useless Prefetch – Address not referenced by the processor Aborted Prefetch – Issued prefetch discarded in prefetch address buffer Prefetch Ratio – Ratio of issued prefetches to cache references Success Ratio – Ratio of total no. of useful prefetches to total no. of issued prefetches. Global Success Ratio – Fraction of Cache misses avoided/ partially avoided Prefetching : Effective in reducing the cache miss ratio. Doesn’t always improve CPU performance - why? Need to determine when and if hardware Prefetching is useful. How to improve performance? Double ported address array. Double ported or fully buffered data array. Wide bus – Split and non-split operations. Prefetching can be improved with a significant investment in extra hardware. Why cache memories? Factors affecting cache performance. Block size, Cache Size, Associativity, Algorithms used. Hardware Cache Prefetching : dedicated hardware w/o software support. Which block to prefetch? – Simplest When to prefetch? Simple Prefetch algorithms Always Prefetch. Prefetch on misses. Tagged prefetch. Threaded prefetching. Bi-directional prefetching Number of cache blocks – Fixed or Variable? Disadvantages of Cache Prefetching – Increase in memory traffic because of prefetches which are never referenced. Memory pollution – Useless prefetches displacing useful prefetches. Most hazardous when cache sizes are small and block sizes are large. Factors degrading the performance even when miss ratio is decreased – Address tag array busy due to prefetch lookups. Cache data array busy due to prefetch loads & replacements. Memory bus busy due to prefetch address transfers & data fetches. Memory system busy due to prefetch fetches and replacements. Architectural Model Architecture Model Processor : Five stages in the pipeline Instruction fetch. Instruction decode. Read or write to memory. ALU computation. Register file update. Cache : Single or Double Ported Instruction Cache Data Cache Write Buffer : Prefetching Units : One for each cache. Receives information like cache miss, cache hit, instruction type, branch target address. What if buffer is full? Memory Bus Split and Non-split transactions Bus Arbitrator Main Memory Methodology Methodology used : 25 commonly used real programs from 5 workload categories Computer Aided Design (CAD) Compiler Related Tools (COMP) Floating Point Intensive Applications (FP) Text Processing Programs ( TEXT) Unix Utilities ( UNIX) The Baseline System Default System Configuration for the Baseline System Cycles Per Instructions contributed by Memory accesses (MCPI) Why MCPI is preferred over cache miss ratio / memory access time? Covers every aspect of performance. Excludes aspects of performance which can not be affected by a cache Prefetching strategies e.g. efficiency of instruction pipelining. Relative MCPI – Should be smaller than 1 Relative and Absolute MCPI A cache prefetching strategy is very sensitive to the type of program the processor is running. CPU Stall Breakdown Instruction and Data Cache Miss Ratios True Miss Ratio – Ratio of total number of true cache misses to the total number of processor references to the cache. Partial Miss Ratio – Ratio of the total number of partial cache misses to the total number of processor reference to the cache. Total Miss Ratio – Sum of the true miss ratio and the partial miss ratio. Success Ratio and Global Success Ratios Success Ratio – Ratio of the total number of useful prefetches issued to the total number of prefetches issued. GSR – Fraction of cache misses which are avoided or partially avoided. Average Distribution of Prefetches. Useful Prefetches Useless Prefetches Aborted Prefetches. Major limitations which reduces the effectiveness of cache Prefetching are conflicts and delays in accessing the caches, the data bus and the main memory. Ideal system characteristics Ideal Cache Special access port to the tag array for prefetch lookups Special access port to the data array for prefetch loads. No need to buffer the perfetched blocks. Ideal Data Bus Private bus connecting the Prefetching unit and main memory Ideal Main Memory Dual ported. Takes 0 time to access these memory banks. Effects of System Resources on Cache Prefetching Effect of different design patterns Single vs. Double Ported Cache Tag Arrays Single vs. Double Ported Cache Data Arrays and Buffering Cache Size Cache Block Size Cache Associativity Split vs. Non-split Bus Transaction Bus Width Memory Latency Number of Memory Banks Bus Traffic Prefetch Look-ahead Distance Single vs. Double Ported Cache Tag Arrays Double ported cache tag array gives best results first with ‘bi-dir’ and then with ‘always’ and ‘thread’ strategies If a prefetch strategy looks up the cache tag arrays frequently, extra access ports to the tag array are vital. Single vs. Double Ported Cache Data Arrays and Buffering Far less important than that of the cache tag arrays because prefetch strategies use data arrays less frequently. For single ported data array relative MCPI > 1, but reduces with double ported array. Conflicts over cache port vanishes for double ported data array – no stall. Buffered data array Draw back - Adding an extra port to cache data array is very costly More practical sol – Provide some buffering for the data arrays. Result – for single ported buffered data array, the relative MCPI decreases by 0.26 on average and this is almost as good as when the cache data array are double ported. Cache Size Performance of a prefatch strategy improves as the cache size increases. For large caches, prefatched block resides in cache for longer period. Bigger the cache, the fewer the cache misses ; hence, prefetches are less likely to interfare with normal cache operation Cache Block Size For 16 or 32 byte long cache blocks, most prefatching strategies perform better than when there is no prefatching. MCPI increases with increase in block size – result of more cache port conflicts. Cache Associativity Relative MCPI decreases by 0.07 on avg. when the cache changes from direct mapped to two way mapped. MCPI remains almost constant as cache associativity increases. Split vs. Non-split Bus Transaction • Relative MCPI decreases by 14 % on avg. when bus transaction changes from nonsplit to split. • Reason – most aborted prefetches become useful prefetches when data bus supports split transaction. Bus Width • As the bus width increases, the relative MCPI begins to fall below one for the base system • Reason – there are fewer cache port conflicts. • Assumptions – cache data port must be as wide as data bus Memory Latency (CPU Cycles) • Relative MCPI decreases as the memory latency increases from 8 to 64 processor cycles, but it starts to rise for further increase in latency Two reasons account for these U shaped curves Fewer cache port conflicts. More data bus conflicts. Number of Memory Banks – (Bus transaction is nonsplit) • Relative MCPI decreases by 0.18 on average when the number of memory banks increases from 1 to 4. • Multiple prefatches can occur in parallel when there are more memory banks. Bus Traffic – Bus Utilization by DMA (%) As traffic increases, Relative MCPI converges to 1 because there is less and less bus bandwidth available to send prefetch requests to the main memory. True in baseline system -- heavier bus traffic helps to reduce the amount of undesirable prefatches and, hence, relative MCPI decreases. Prefetch Lookahead Distance (in blocks) Better performance could be achieved by prefetching blocks p+LA instead of p. P – original block address requested by prefetching strategy and LA – Look ahead distance in blocks For all strategies except thread, relative MCPI rises with increasing look ahead distance. Reason – with increase in LA, effect of spatial locality diminishes. The System Design Worst and Best Values for Each System Parameter in terms of effectiveness of Prefetching system parameters are changed from -- worst value to its best value --improvement in the performance of cache prefatching ranges from 7% to as high as 90% RELATIVE AND ABSOLUTE MCPI FOR THE OPTIPREFETCH SYSTEM All prefetching strategies perform better than when there is no prefetching. The relative MCPI for all strategies averages 0.65 -- prefetching reduces MCPI by 35% on average relative to the corresponding baseline system. OptiPrefatch system favors aggressive strategies like always and tag, which issues lot of prefatches. Stall Breakdown in % for OPTIPREFETCH system. There are no conflicts over the cache data ports as the data arrays are dual ported. As 16 memory banks are available, conflicts over memory banks occur very rarely. Average distribution of Prefetches Due to high bus BW, most prefatch requests can be sent to the main memory successfully and almost no prefatches are aborted. Baseline System OPTIPREFETCH System Possible System Design Prefatching improves performance in all systems except system D and G as data bus BW is too small Performance of Cache Prefetching in Some Possible Systems Conclusion Prefetching can reduce average memory latency provided system has appropriately designed hardware. For a cache to prefetch effectively Cache Tag Arrays be double ported Data Arrays either be double ported or buffered Cache – At lease two way associative Most effective Cache size is large Block size is small Memory bus wide. Split transaction bus Interleaving main memory