Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary • Read misses are more critical than write misses • Read misses can stall processor, writes are not on the critical path • Problem: Cache management does not exploit read-write disparity • Goal: Design a cache that favors reads over writes to improve performance • Lines that are only written to are less critical • Prioritize lines that service read requests • Key observation: Applications differ in their read reuse behavior in clean and dirty lines • Idea: Read-Write Partitioning • Dynamically partition the cache between clean and dirty lines • Protect the partition that has more read hits • Improves performance over three recent mechanisms 2 Outline • • • • • Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 3 Motivation • Read and write misses are not equally critical • Read misses are more critical than write misses • Read misses can stall the processor • Writes are not on the critical path Wr B Rd C Rd A STALL STALL Buffer/writeback B time Cache management does not exploit the disparity between read-write requests 4 Key Idea • Favor reads over writes in cache • Differentiate between read vs. only written to lines • Cache should protect lines that serve read requests • Lines that are only written to are less critical • Improve performance by maximizing read hits • An Example Rd A A D Rd B Wr B Read-Only B Rd D Wr C Read and Written C Write-Only 5 An Example Rd A Rd A M DCB Wr B WR B M Rd B H STALL Write B Wr C M Rd D M Write C STALL DBiteration AD C B Aper CBA A 2Dstalls LRU Replacement Policy Rd A H Rd D Wr C Rd B Wr B H Rd B H WR C M Rd D M Write B Write C DCB cycles saved STALL C DBA ADBBA D D BA B A1Dstall CBA perReplace iteration Read-Biased Replacement Policy Dirty lines differently Evicting linesare thattreated are only written to depending on performance read requests can improve 6 Outline • • • • • Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 7 Reuse Behavior of Dirty Lines • Not all dirty lines are the same • Write-only Lines • Do not receive read requests, can be evicted • Read-Write Lines • Receive read requests, should be kept in the cache Evicting write-only lines provides more space for read lines and can improve performance 8 Applications have37.4% different reuse behavior On average linesread are write-only, in both dirty read lines and written 9.4% lines are 9 483.xalancbmk 482.sphinx3 481.wrf 473.astar 471.omnetpp 80 470.lbm 90 465.tonto 100 464.h264ref 462.libquantum 459.GemsFDTD 458.sjeng 456.hmmer 450.soplex 447.dealII 445.gobmk 437.leslie3d 436.cactusADM 435.gromacs 434.zeusmp 433.milc 429.mcf 410.bwaves 403.gcc 401.bzip2 400.perlbench Percentage of Cachelines in LLC Reuse Behavior of Dirty Lines Dirty (write-only) Dirty (read-write) 70 60 50 40 30 20 10 0 Outline • • • • • Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 10 Read-Write Partitioning • Goal: Exploit different read reuse behavior in dirty lines to maximize number of read hits • Observation: – Some applications have more reads to clean lines – Other applications have more reads to dirty lines • Read-Write Partitioning: – Dynamically partitions the cache in clean and dirty lines – Evict lines from the partition that has less read reuse Improves performance by protecting lines with more read reuse 11 Read-Write Partitioning Soplex 9 Number of Reads Normalized to Reads in clean lines at 100m Number of Reads Normalized to Reads in clean lines at 100m 10 Clean Line Dirty Line 8 7 6 5 4 3 2 1 0 0 100 200 300 400 Instructions (M) 500 4 Xalanc 3.5 Clean Line Dirty Line 3 2.5 2 1.5 1 0.5 0 0 100 200 300 400 Instructions (M) 500 Applications have significantly different read reuse behavior in clean and dirty lines 12 Read-Write Partitioning • • • • Utilize disparity in read reuse in clean and dirty lines Partition the cache into clean and dirty lines Predict the partition size that maximizes read hits Maintain the partition through replacement – DIP [Qureshi et al. 2007] selects victim within the partition Predicted Best Partition Size 3 Dirty Lines Cache Sets Replace from dirty partition Clean Lines 13 Predicting Partition Size • Predicts partition size using sampled shadow tags – Based on utility-based partitioning [Qureshi et al. 2006] • Counts the number of read hits in clean and dirty lines • Picks the partition (x, associativity – x) that maximizes number of read hits Maximum number of read hits … … S E T S T A G S Dirty LRU+1 LRU S H A D O W MRU-1 S H A D O W D MRU E LRU+1 LRU L MRU-1 A M P C O U N T E R S MRU S C O U N T E R S T A G S Clean 14 Outline • • • • • Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 15 Methodology • • • • • • CMP$im x86 cycle-accurate simulator [Jaleel et al. 2008] 4MB 16-way set-associative LLC 32KB I+D L1, 256KB L2 200-cycle DRAM access time 550m representative instructions Benchmarks: – 10 memory-intensive SPEC benchmarks – 35 multi-programmed applications 16 Comparison Points • DIP, RRIP: Insertion Policy [Qureshi et al. 2007, Jaleel et al. 2010] – Avoid thrashing and cache pollution • Dynamically insert lines at different stack positions – Low overhead – Do not differentiate between read-write accesses • SUP+: Single-Use Reference Predictor [Piquet et al. 2007] – Avoids cache pollution • Bypasses lines that do not receive re-references – High accuracy – Does not differentiate between read-write accesses • Does not bypass write-only lines – High storage overhead, needs PC in LLC 17 Comparison Points: Read Reference Predictor (RRP) • A new predictor inspired by prior works [Tyson et al. 1995, Piquet et al. 2007] • Identifies read and write-only lines by allocating PC – Bypasses write-only lines • Writebacks are not associated with any PC Allocating No PC from L1 allocating PC Time PC P: Rd A Wb A Wb A PC Q: Wb A Wb A Wb A Marks P as athe PC allocating that allocates lineand that is never read Associates PC ina overhead L1 passes PC in L2,again LLC High storage 18 Single Core Performance Speedup vs. Baseline LRU 1.20 48.4KB 2.6KB RRP RWP 1.15 1.10 1.05 1.00 DIP RRIP SUP+ Differentiating read vs. write-only lines RWP performs within 3.4% of RRP, improves performance over recentoverhead mechanisms But requires 18X less storage 19 Speedup vs. Baseline LRU 4 Core Performance 1.14 DIP RRIP SUP+ RRP RWP 1.12 1.10 1.08 1.06 +8% +4.5% 1.04 1.02 1.00 No Memory Intensive 1 Memory Intensive 2 Memory Intensive 3 Memory Intensive 4 Memory Intensive More benefit when more applications Differentiating read vs. write-only lines are memory intensive improves performance over recent mechanisms 20 Average Memory Traffic Percentage of Memory Traffic 120 100 15% 80 Writeback Miss 17% 60 40 85% 20 66% 0 Base RWP Increases writeback traffic by 2.5%, but reduces overall memory traffic by 16% 21 20 18 16 14 12 10 8 6 4 2 0 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Number of Cachelines Dirty Partition Sizes Natural Dirty Partition Predicted Dirty Partition Partition size varies significantly for some benchmarks 22 400.perlbench 401.bzip2 403.gcc 410.bwaves 416.gamess 429.mcf 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd 445.gobmk 447.dealII 450.soplex 453.povray 454.calculix 456.hmmer 458.sjeng 459.GemsFDTD 462.libquantum 464.h264ref 465.tonto 470.lbm 471.omnetpp 473.astar 481.wrf 482.sphinx3 483.xalancbmk Number of Cachelines Dirty Partition Sizes 20 18 16 14 12 10 8 6 4 2 0 Natural Dirty Partition Predicted Dirty Partition Partition size varies significantly during the runtime for some benchmarks 23 Outline • • • • • Motivation Reuse Behavior of Dirty Lines Read-Write Partitioning Results Conclusion 24 Conclusion • Problem: Cache management does not exploit read-write disparity • Goal: Design a cache that favors read requests over write requests to improve performance – Lines that are only written to are less critical – Protect lines that serve read requests • Key observation: Applications differ in their read reuse behavior in clean and dirty lines • Idea: Read-Write Partitioning – Dynamically partition the cache in clean and dirty lines – Protect the partition that has more read hits • Results: Improves performance over three recent mechanisms 25 Thank you 26 Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Extra Slides 28 Read Intensive • Most accesses are reads, only a few writes • 483.xalancbmk: Extensible stylesheet language transformations (XSLT) processor XSLT Processor XML Input HTML/XML/TXT Output XSLT Code • 92% accesses are reads • 99% write accesses are stack operation 29 Read Intensive Read and Written Written Only elementAt push %ebp mov %esp, %ebp push %esi … Logical View of Stack WRITEBACK %ebp %esi pop %esi pop %ebp ret Written Only WRITEBACK 1.8% lines in LLC write-only, 0.38% lines are read-written 30 Write Intensive • Generates huge amount of intermediate data • 456.hmmer: Searches a protein sequence database Hidden Markov Model of Multiple Sequence Alignments Best Ranked Matches Database • Viterbi algorithm, uses dynamic programming • Only 0.4% writes are from stack operations 31 Write IntensiveRead and Written 2D Table WRITEBACK WRITEBACK Written Only 92% lines in LLC write-only, 2.68% lines are read-written 32 Read-Write Intensive • Iteratively reads and writes huge amount of data • 450.soplex: Linear programming solver Miminum/maximum of objective function Inequalities and objective function • Simplex algorithm, iterates over a matrix • Different operations over the entire matrix at each iteration 33 Read-Write Intensive Read and Written Pivot Column READ WRITEBACK Pivot Row Written Only Read and Written 19% lines in LLC write-only, 10% lines are read-written 34 120 100 0 400.pe… 401.bzi… 403.gcc 410.bw… 429.mcf 433.milc 434.ze… 435.gro… 436.cac… 437.lesl… 445.go… 447.de… 450.so… 456.hm… 458.sje… 459.Ge… 462.lib… 464.h2… 465.to… 470.lbm 471.om… 473.astar 481.wrf 482.sp… 483.xal… Percentage of Cachelines in LLC Read Reuse of Clean-Dirty Lines in LLC Clean Non-Read Dirty Non-Read Clean Read Dirty Read 80 60 40 20 On average 37.4% blocks are dirty non-read and 42% blocks are clean non-read 35 Single Core Performance Speedup vs. Baseline LRU 1.07 48.4KB 2.6KB RRP RWP 1.06 1.05 1.04 1.03 1.02 1.01 1.00 DIP RRIP SUP+ 5% speedup over the whole SPECCPU2006 benchmarks 36 Speedup Speedup vs. Baseline LRU 1.30 DIP RRIP SUP+ RRP RWP 1.20 1.10 1.00 0.90 On average 14.6% speedup over baseline 5% speedup over the whole SPECCPU2006 benchmarks Writeback Traffic Normalized to LRU Writeback Traffic to Memory 1.4 1.2 RRP RWP 1 0.8 0.6 0.4 0.2 0 Increases traffic by 17% over the whole SPECCPU2006 benchmarks 38 4-Core Speedup Normalized Weighted Speedup 1.3 1.25 RRP RWP 1.2 1.15 1.1 1.05 1 0.95 Workloads 39 Change of IPC with Static Partitioning Xalancbmk 0.9 Soplex 0.8 0.7 0.7 IPC IPC 0.8 0.6 0.6 0.5 0.4 0.5 0 2 4 6 8 10 12 14 16 Number of Ways Assigned to Dirty Blocks 0 2 4 6 8 10 12 14 16 Number of Ways Assigned to Dirty Blocks

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan

Related documents

Products

Support

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib