Euro-Par’02 30 aug 2002 Reuse Distance-Based Cache Hint Selection Kristof Beyls and Erik D’Hollander Ghent University 1 Introduction • Cache Hints: why and what • Reuse distance • Reuse distance-based cache hint selection: how • Experiments and results 2 Cache Hints: Why • Anti-law of Moore: the speed gap doubles every 2 years. • To bridge the gap, efforts must be combined at – algorithm level – compiler level – hardware level Cache hints 3 Cache Hints: What • Cache control instructions are emerging in a number of architectures. • HP-PlayDoh EPIC architecture provides 2 kinds of cache hints: LD LD_C2_C3 L1 L2 L3 MRU CPU L1 L2 L3 MRU CPU Cache LRU Cache LRU 4 Cache Hints: What (2) • 2 kinds of cache hints LD_C2_C3 – source cache hint (C2): Indicates cache level where data is expected. • used by compiler to know real latency of load – target cache hint (C3): Indicates cache level where data should be kept. • used by hardware to adapt replacement policy • Question: How to select appropriate cache hints? ( reuse distance) 5 Introduction • Cache Hints: why and what • Reuse distance • Reuse distance-based cache hint generation: how • Experiments and results 6 Reuse Distance: Definition • • • • Reuse pair Reuse distance of reuse pair Forward reuse distance Backward reuse distance 3 A 3 B C D 0 D 0 0 A 3 F 7 Introduction • Cache Hints: why and what • Reuse distance • Reuse distance-based cache hint selection: how • Experiments and results 8 Reuse Distance: Properties • Backward reuse distance > cache size Cache miss in fully-assoc. LRU cache • Forward reuse distance > cache size Data will not be retained in fully-assoc. LRU cache 9 Reuse-distance based cache hint selection... Reuse Distance (RD) Cache Hint RD < L1 C1 L1 <= RD < L2 C2 L2 <= RD < L3 C3 L3 <= RD C4 L1 L2 L3 Cache size 10 Differentiating between source and target hints • Source cache specifier: based on backward reuse distance • Target cache specifier: based on forward reuse distance e.g.: BRD = 13 ... B L1 = 256 lines L2 = 8K lines L3 = 64K lines H F ... FRD = 5000 LD_C1_C2 11 Memory access vs. Memory instruction • Cache hints are applied to instructions. • Instructions can generate multiple accesses (e.g. in a loop), with different reuse distances. • Only 1 cache hint can be selected for different accesses from the same instruction. 12 Example of cache hint per instruction problem for i := 1 to 100 A( i ) = ... A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(8) 75% LD_C1_C1 or LD_C4_C4 ???? cumulative reuse distance distribution 25% C1 reuse distance C4 13 Cumulative reuse distance distribution to cache hint 100% 5% 0% 90% 45% C2 40% Reuse distance 0% L1 L2 L3 Cache size 14 Introduction • Cache Hints: why and what • Reuse distance • Reuse distance-based cache hint selection: how • Implementation, experiments and results 15 Strategy • Instrument the program to obtain memory access stream. • Profile to obtain reuse distance distribution. • Generate cache hints. • Execute optimized program. 16 Implementation • Open64 compiler for Itanium (IA-64) – source cache hints: in instruction scheduler • only visible in internal compiler representation of instructions. – target cache hints: in assembly output • shows up in assembly code e.g. ld.nta r34 = [r47] • Programs from Olden and Spec95fp. 17 Execution times Normalized execution time after cache hint selection 100% 90% 80% 70% sw to im m ca tv ap pl u w av e5 m g av rid er ag e ts p em 3d he al th pe ms rim t et er po w tr er ee ad d rt bi so bh 60% 18 Results • In Olden (pointer chasing), speedup comes mainly from target cache hints (better replacement policy) (4% on average) • In Specfp (numerical loops), speedup results mainly from source cache hints (better latency hiding through instruction scheduling). (10% on average) 19 Conclusion • Reuse distance is independent of cache parameters such as size or associativity. • good metric for optimizations targeting multiple cache levels. • As such, it is an appropriate measure to base cache hint selection on. • The implementation in an EPIC compiler resulted in 7% speedup on average with a maximum of 36%. 20