Increasing TLB Reach by Exploiting Clustering in Page Translations Binh Pham§, Abhishek Bhattacharjee§, Yasuko Eckertǂ, Gabriel H. Lohǂ §Rutgers HPCA 2014 University ǂAMD Binh Pham - Rutgers University Research 1 Address Translation Overview Address Generation TLB VA Page Table Walker Time X86: Four-Level Page Tables in Memory PA Address Translation PTE PTE PTE PTE Cache Access VA: Virtual Address PA: Physical Address PTE: Page Table Entry HPCA 2014 Binh Pham - Rutgers University 2 Address Translation Performance Impact • Address translation performance overhead – 10-15% – Clark & Emer [Trans. On Comp. Sys. 1985] – Talluri & Hill [ASPLOS 1994] – Barr, Cox & Rixner [ISCA 2011] • Emerging software trends – Virtualization – up to 89% overhead [Bhargava et al., ASPLOS 2008] – Big Memory workloads – up to 50% overhead [Basu et al., ISCA 2013] • Emerging hardware trends – LLC capacity to TLB capacity ratios increasing – Manycore/hyperthreading increases TLB and LLC PTE stress HPCA 2014 Binh Pham - Rutgers University 3 TLB Miss Elimination Approaches • Increasing TLB size? – Latency – Power • Increasing TLB reach – Using large pages – Using “CoLT: Coalesced Large-Reach TLBs” (Pham et al., MICRO 2012) HPCA 2014 Binh Pham - Rutgers University 4 Contiguous Locality in Page Tables CoLT TLB Physical Page 8 (01000) Base Virt. Page Base Phys. Page Len 1 (0001) 2 (0010) 9 (01001) 10 (01010) 0 8 3 3 12 2 3 (0011) 4 (0100) 5 (0101) 12 (01100) 13 (01101) 17 (10001) 5 17 1 6 16 1 7 18 1 6 (0110) 7 (0111) 16 (10000) 18 (10010) HPCA 2014 Holes Virtual Page 0 (0000) OoO Singletons Sequential Groups Page Table Binh Pham - Rutgers University 5 Clustered Locality in Page Tables Clustering Physical Page 8 ( 01 000 ) Virt. Page MSBs Phys. Page MSBs Len 1 ( 0 001 ) 2 ( 0 010 ) 3 ( 0 011 ) 9 ( 01 001 ) 10 ( 01 010 ) 17 ( 01 10 100 001 ) 12 0 01 5 0 10 3 4 ( 0 100 ) 5 ( 0 101 ) 13 ( 01 101 ) 17 ( 01 12 10 100 001 ) 6 ( 0 110 ) 7 ( 0 111 ) 16 ( 10 000 ) 18 ( 10 010 ) Holes Virtual Page 0 ( 0 000 ) OoO Clustered Groups Page Table • Clustered locality can deal with “holes” between PTEs • Clustered locality does NOT care about PTEs’ order HPCA 2014 Binh Pham - Rutgers University 6 Spatial Locality Characterization Page Table Physical Page 0 ( 0 000 ) 8 ( 01 000 ) 1 ( 0 001 ) 9 ( 01 001 ) 2 ( 0 010 ) 10 ( 01 010 ) 3 ( 0 011 ) 12 ( 01 100 ) 4 ( 0 100 ) 13 ( 01 101 ) 5 ( 0 101 ) 17 ( 10 001 ) 6 ( 0 110 ) 16 ( 10 000 ) 7 ( 0 111 ) 18 ( 10 010 ) PTEs Distribution Clust. Group Len % PTEs 3 37.5 % 5 62.5 % HPCA 2014 % PTEs in the same group Virtual Page xalancbmk 1.0 0.8 0.6 0.4 0.2 0.0 1 8 64 group len Contiguous Cluster4 Cluster2 Cluster5 Cluster3 256 1 • Clustered locality is abundant and surpasses contiguous locality • Clustered locality increases with clustered spatial region size Binh Pham - Rutgers University 7 Outline • How do we exploit clustered locality in hardware? • How much can our design improve performance? • Conclusion HPCA 2014 Binh Pham - Rutgers University 8 Clustered TLB: Miss and Fill Page Table Page Table Walker 64B cacheline 8B PTE Coalescing Logic Clustered TLB Entry Sub-entries 0 01 VA(47:20) PA(51:15) HPCA 2014 1 010 V PA(14:12) Binh Pham - Rutgers University Virtual Page Physical Page 0 ( 0 000 ) 8 ( 01 000 ) 1 ( 0 001 ) 9 ( 01 001 ) 2 ( 0 010 ) 10 ( 01 010 ) 3 ( 0 011 ) 12 ( 01 100 ) 4 ( 0 100 ) 13 ( 01 101 ) 5 (0 101) 17 (10 001) 6 (0 110) 16 (10 000) 7 (0 111) 18 (10 010) 9 Clustered TLB Look Up Base VPN VPN(2:0) 0 011 Clustered TLB Base V 0 Base P 01 000 001 010 Sub-entries 100 100 101 X X X =? PPN lower bits Sub-entry hit? concat PPN = 12 (01100) Hit? HPCA 2014 Binh Pham - Rutgers University 10 Multi Granular TLB Design L1-TLB cacheline VPN L2-TLB Clustered hit Clustered TLB C0 TLB PPN Coalescing Logic PPN Base Base VPN PPN 8B PTE Clustered-TLB entry C0 hit TLB hit PPN N len >= Θ Y HPCA 2014 Binh Pham - Rutgers University 11 Methodology • Workloads: SPEC CPU2006, Cloudsuite, Server • Full System Simulation: – Baseline: 64-entry L1 ITLB, 64-entry L1 DTLB, 512-entry L2 TLB – Roughly equal hardware for baseline, CoLT, and MG-TLB Hardware Cost Baseline 4.6 KB CoLT 5.0 KB MG-TLB 4.8 KB HPCA 2014 L1-TLB L1-TLB L1-TLB L2-TLB CoLT-TLB MG-TLB Baseline CoLT MG-TLB Binh Pham - Rutgers University 12 Miss Elimination Misses Eliminated CoLT 100% 80% 60% 40% 20% 0% -20% MG-TLB -123% Best design gives 7% performance improvement on average HPCA 2014 Binh Pham - Rutgers University 13 Misses Eliminated Insert to Clustered TLB or C0 TLB? 100% 80% 60% 40% 20% 0% -20% θ=1 -51% θ=2 θ=3 θ=4 -218% Θ = 2 gives best performance: C0 Entry C0 Entry Cluster3 Entry HPCA 2014 Binh Pham - Rutgers University 14 Prefetching versus Capacity Benefit Misses Eliminated Baseline + Prefetch 100% 80% 60% 40% 20% 0% -20% -32% Lazy en-MG-TLB en-MG-TLB -21% MG-TLB combines prefetching and capacity to get best performance HPCA 2014 Binh Pham - Rutgers University 15 Conclusion • We observe more generic type of locality (clustered locality) in the page translations • Multi-granular TLB – Eliminates nearly half of TLB misses • Our approach requires no OS modification, and provides robust performance gain HPCA 2014 Binh Pham - Rutgers University 16 Thanks for listening! Questions? HPCA 2014 Binh Pham - Rutgers University 17