Hyesoon Kim 2/25 | Introduction | Background | TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP | Evaluation Methodology | Evaluation Results | Conclusion TLP-Aware Cache Management Policy (HPCA-18) 3/25 | Combining GPU cores with conventional CMPs is a trend. Intel’s Sandy Bridge AMD’s Fusion Denver Project | Various resources are shared between CPU and GPU cores. LLC, on-chip interconnection, memory controller, and DRAM | Shared cache is one of most important resources. TLP-Aware Cache Management Policy (HPCA-18) 4/25 | Many researchers have proposed various cache mechanisms. Dynamic cache partitioning Dynamic cache insertion policies Suh+[HPCA’02], Kim+[PACT’04], Qureshi+[MICRO’06] Qureshi+[ISCA’07], Jaleel+[PACT’08,ISCA’10], Wu+[MICRO’11,MICRO’11] Many other mechanisms | All mechanisms target CMPs. | These may not be directly applicable to CPU-GPU heterogeneous architectures because CPU and GPU cores have different characteristics. TLP-Aware Cache Management Policy (HPCA-18) 5/25 | SIMD, massive threading, lack of speculative execution, … | GPU cores have an order-of-magnitude more threads. CPU: 1-4 way SMT GPU: 10s of active threads in a core | GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores. | TLP has a significant impact on how caching affects performance of applications. TLP-Aware Cache Management Policy (HPCA-18) 6/25 Compute intensive Cache friendly or Thrashing TLP Dominant MPKI CPI MPKI MPKI CPI CPI Cache Size | With low TLP TLP-Aware Cache Management Policy (HPCA-18) Cache Size | With High TLP | This type is hardly found in CPU applications 7/25 Cache friendly TLP Dominant Identical MPKI MPKI Different CPI CPI Cache Size Cache Size | Cache-oriented metrics cannot differentiate two types. Unable to recognize the effect of TLP | We need to directly monitor performance effect by caching. TLP-Aware Cache Management Policy (HPCA-18) 8/25 | Samples GPU cores with different cache policies Bypassing LLC (No L3) Core Core POL1 L1 CPUs Followers Core Core Core Core Core Core Core Core Follow Follow Follow Follow L1 L1 L1 L1 Shared Last-Level Cache DRAM TLP-Aware Cache Management Policy (HPCA-18) MRU insertion policy in LLC Core Core POL2 L1 GPUs 9/25 | Measures performance differences Bypassing LLC (No L3) Core POL1 Collect Performance Samples IPC1 Core Sampling Controller Core POL2 IPC2 Calculate ∆ (IPC1, IPC2) Yes Cache-friendly Caching improves perf. TLP-Aware Cache Management Policy (HPCA-18) MRU insertion policy in LLC Core Core Core Core Follow Follow Follow Follow Calculate Performance Delta Make a decision Followers ∆ > Threshold No Not Cache-friendly Caching does not affect perf. 10/25 Cache friendly TLP Dominant MPKI MPKI CPI CPI Cache Size Core POL1 Core POL2 Bypassing LLC MRU insertion ∆ > Threshold: Cache-friendly TLP-Aware Cache Management Policy (HPCA-18) Cache Size Core POL1 Bypassing LLC Core POL2 MRU insertion ∆ < Threshold: Not cache-friendly 11/25 | Having different LLC policies for cores to identify the effect of last-level cache | Main goal - finding cache-friendly GPGPU applications | How core sampling is viable SPMD (Single Program, Multiple Data) model Each GPU core is running same program. GPGPUs usually have symmetric behavior on their running GPU cores. Performance variance between GPU cores is very small. TLP-Aware Cache Management Policy (HPCA-18) 12/25 | GPU cores have higher TLP (Thread-Level Parallelism) than CPU cores. | GPU cores have an order-of-magnitude more cache accesses | GPUs have higher tolerance for cache misses due to TLP Generate cache accesses from different threads without stalls | SIMD execution – one SIMD instruction can generate multiple memory requests TLP-Aware Cache Management Policy (HPCA-18) 13/25 CPU Thread GPU Threads Cache miss Cache miss Processor stalled 100 No stalls, more^2 cache accesses < 100 Requests per 1000 cycles Requests per 1000 cycles Stalled, fewer cache accesses 80 60 40 20 0 CPU, 1-core TLP-Aware Cache Management Policy (HPCA-18) 3000 2500 2000 1500 1000 500 0 > 500 GPU, 6-core 14/25 | Why are much more frequent accesses from GPGPU applications problematic? Severe interference by GPGPU applications e.g.) base LRU replacement policy Performance impact of cache hits is different in applications. =? Perf. PenaltyGPU(cache miss) Perf. PenaltyCPU(cache miss) > | We have to consider the different degree of cache accesses. | We propose Cache Block Lifetime Normalization. TLP-Aware Cache Management Policy (HPCA-18) 15/25 | Simple monitoring mechanism Monitor cache access rate differences between CPU and GPGPU applications and periodically calculate the ratio GPU $ Access Counter CPU $ Access Counter Calculate Ratio 𝐺𝑃𝑈𝑐𝑜𝑢𝑛𝑡𝑒𝑟 𝑟= 𝐶𝑃𝑈𝑐𝑜𝑢𝑛𝑡𝑒𝑟 if 𝑟 > threshold XSRATIO = 𝑟 XSRATIO if 𝑟 < threshold XSRATIO = 1 | Hints for proposed TAP mechanisms regarding access rate differences TLP-Aware Cache Management Policy (HPCA-18) 16/25 TAP Core Sampling Lifetime Normalization To find cache-friendly applications To consider different degree of cache accesses UCP Utility-based TAP-UCP Cache Partitioning RRIP Re-Reference TAP-RRIP Interval Prediction TLP-Aware Cache Management Policy (HPCA-18) In this talk In the paper 17/25 UCP [Qureshi and Patt, MICRO-2006] Divide hit counter by XSRATIO register value to balance cache space Per application, ATD and hit counters LLC ATD Counters Hit Counters ATD Way Hit Way (LRU Stack) Way Hit Counters ATD (LRU Stack) Way Hit Counters (LRU Stack) UCP UCP / Partitioning Algorithm Partitioning Algorithm UCP-Mask CPU n4 n5 1n6 n7 GPU n8 GPGPU Optimal Partition n1 n2 n3Assign way to If UCP-Mask == 1 | UCP-Mask Register | Core Sampling | Cache block lifetime normalization UCP-Mask = 1 if not cache friendly XSRATIO Core Sampling Controller TAP Cache block lifetime normalization TAP (TLP-Aware) TLP-Aware Cache Management Policy (HPCA-18) 18/25 CPU Hit Counters 16 3 8 20 5 8 GPU Hit Counters 3 2 MRU LRU 3 5.5 10.3 9 8.8 7.8 3 +3 8+ 3+8 20 1way 2way 3way Performance 32 6 16 40 10 16 6 C MRU Utility UCP Marginal Utility How many more hits are expected if N ways are given to an application 4 LRU 6 11 20.7 10 18 17.615.7 13 10.1 TAP-UCP Case 1: Non Cache-Friendly TAP-UCP G Not Cache-friendly ∆ < Threshold C C UCP G G G G G G G G 1 CPU: 7 GPU G …… Final Partition More GPU ways C G G Caching has little effect on Perf. Assign only 1 way to GPGPU 4 CPU: 4 GPU 7 CPU: 1 GPU G G G G 1 CPU 7 GPU G G TLP-Aware Cache Management Policy (HPCA-18) 7 CPU Final Partition GPU More CPU1ways C C C C C C C G 19/25 CPU Hit Counters 16 3 8 20 5 8 GPU Hit Counters 3 2 MRU LRU 3 5.5 10.3 5 9 6.5 8.8 5.3 7.8 Performance 32 3 16 6 16 8 20 40 10 5 16 8 3 6 MRU Utility UCP C G G G G G Divide hit counters by XSRATIO 3 5.5 10.3 9 8.8 7.8 TAP-UCP Case 2: Cache-Friendly 4 CPU Final Partition 4 GPU 1 CPU 7 GPU G XSRATIO = 2 LRU UCP Final Partition 4 2 ∆ > Threshold C G G C C C Cache-friendly G G G TAP-UCP 1 CPU: 7 GPU 4 CPU: 4 GPU More GPU ways TLP-Aware Cache Management Policy (HPCA-18) C G C G 7 CPU: 1 GPU C C C More CPU ways C G C C C G G G 20/25 | Introduction | Background | TAP (TLP-Aware Cache Management Policy) | | | | Core sampling Cache block lifetime normalization TAP-UCP Evaluation Methodology Evaluation Results Conclusion TLP-Aware Cache Management Policy (HPCA-18) 21/25 | MacSim simulator (http://code.google.com/p/macsim) [GT] Trace-driven, timing simulator, x86+ptx instructions CPU (1-4 cores) OOO 4-wide Private L1/L2 GPU (6 cores) 16 SIMD width Private L1 LLC DRAM 32-way 8MB Shared LLC (Base: LRU) DDR3-1333, 41.6GB/s BW FR-FCFS | Workload CPU: SPEC 2006 GPGPU: CUDA SDK, Parboil, Rodinia, ERCBench 1-CPU 2-CPU 4-CPU Stream-CPU (1 CPU + 1 GPU) (2 CPUs + 1 GPU) (4 CPUs + 1 GPU) (Stream CPU + 1 GPU) 152 150 75 25 TLP-Aware Cache Management Policy (HPCA-18) 22/25 UCP TAP-UCP RRIP 1.15 1.2 11% 1.1 1.05 1 0.95 0.9 | UCP is effective with thrashing. | Less effective with cache-sensitive GPGPU applications. TLP-Aware Cache Management Policy (HPCA-18) Speedup over LRU Speedup over LRU 1.2 TAP-RRIP 1.15 12% 1.1 1.05 1 0.95 0.9 | RRIP is generally less effective on heterogeneous workloads. 23/25 Normalized MPKI | Sphinx3 + Stencil 1.5 Previous TAP TLP dominant 1 | MPKI 0.5 0 MPKI CPU Speedup over LRU | Stencil 2.5 2 1.5 1 0.5 0 MPKI GPU Previous MPKI Overall TAP CPU: significant decrease GPGPU: considerable increase Overall MPKI: increased | Performance CPU Speedup GPU Speedup TLP-Aware Cache Management Policy (HPCA-18) Overall Speedup CPU: huge improvement GPU: no change Overall: huge improvement 24/25 UCP TAP-UCP RRIP TAP-RRIP 1.3 24% Speedup over LRU 1.25 1.2 1.15 17.5% 11% 12% 12.5% 14% 1.1 1.05 1 1 CPU App + 1 GPGPU App 2 CPU Apps + 1 GPGPU App 4 CPU Apps + 1 GPGPU App | TAP mechanisms show higher benefits with more CPU applications. TLP-Aware Cache Management Policy (HPCA-18) 25/25 | CPU-GPU Heterogeneous architecture is a popular trend. Resource sharing problem is more significant. | We propose TAP for CPU-GPU heterogeneous architecture First proposal to consider the resource sharing problem | We introduce a core sampling technique that samples GPU cores with different policies to identify cache-friendliness. | Two TAP mechanisms improve the performance of the system significantly. TAP-UCP: 11% over LRU and 5% over UCP TAP-RRIP: 12% over LRU and 9% over RRIP TLP-Aware Cache Management Policy (HPCA-18)