Qualifying Examination CoLT: Coalesced Large

advertisement
Increasing TLB Reach by Exploiting
Clustering in Page Translations
Binh Pham§, Abhishek Bhattacharjee§,
Yasuko Eckertǂ, Gabriel H. Lohǂ
§Rutgers
HPCA 2014
University
ǂAMD
Binh Pham - Rutgers University
Research
1
Address Translation Overview
Address Generation
TLB
VA
Page Table Walker
Time
X86: Four-Level Page Tables in Memory
PA
Address Translation
PTE
PTE
PTE
PTE
Cache Access
VA: Virtual Address
PA: Physical Address
PTE: Page Table Entry
HPCA 2014
Binh Pham - Rutgers University
2
Address Translation Performance Impact
• Address translation performance overhead – 10-15%
– Clark & Emer [Trans. On Comp. Sys. 1985]
– Talluri & Hill [ASPLOS 1994]
– Barr, Cox & Rixner [ISCA 2011]
• Emerging software trends
– Virtualization – up to 89% overhead [Bhargava et al., ASPLOS 2008]
– Big Memory workloads – up to 50% overhead [Basu et al., ISCA 2013]
• Emerging hardware trends
– LLC capacity to TLB capacity ratios increasing
– Manycore/hyperthreading increases TLB and LLC PTE stress
HPCA 2014
Binh Pham - Rutgers University
3
TLB Miss Elimination Approaches
• Increasing TLB size?
– Latency
– Power
• Increasing TLB reach
– Using large pages
– Using “CoLT: Coalesced Large-Reach TLBs” (Pham et al., MICRO 2012)
HPCA 2014
Binh Pham - Rutgers University
4
Contiguous Locality in Page Tables
CoLT TLB
Physical Page
8 (01000)
Base Virt.
Page
Base Phys.
Page
Len
1 (0001)
2 (0010)
9 (01001)
10 (01010)
0
8
3
3
12
2
3 (0011)
4 (0100)
5 (0101)
12 (01100)
13 (01101)
17 (10001)
5
17
1
6
16
1
7
18
1
6 (0110)
7 (0111)
16 (10000)
18 (10010)
HPCA 2014
Holes
Virtual Page
0 (0000)
OoO
Singletons
Sequential Groups
Page Table
Binh Pham - Rutgers University
5
Clustered Locality in Page Tables
Clustering
Physical Page
8 ( 01 000 )
Virt. Page
MSBs
Phys. Page
MSBs
Len
1 ( 0 001 )
2 ( 0 010 )
3 ( 0 011 )
9 ( 01 001 )
10 ( 01 010 )
17 ( 01
10 100
001 )
12
0
01
5
0
10
3
4 ( 0 100 )
5 ( 0 101 )
13 ( 01 101 )
17 ( 01
12
10 100
001 )
6 ( 0 110 )
7 ( 0 111 )
16 ( 10 000 )
18 ( 10 010 )
Holes
Virtual Page
0 ( 0 000 )
OoO
Clustered Groups
Page Table
• Clustered locality can deal with “holes” between PTEs
• Clustered locality does NOT care about PTEs’ order
HPCA 2014
Binh Pham - Rutgers University
6
Spatial Locality Characterization
Page Table
Physical Page
0 ( 0 000 )
8 ( 01 000 )
1 ( 0 001 )
9 ( 01 001 )
2 ( 0 010 )
10 ( 01 010 )
3 ( 0 011 )
12 ( 01 100 )
4 ( 0 100 )
13 ( 01 101 )
5 ( 0 101 )
17 ( 10 001 )
6 ( 0 110 )
16 ( 10 000 )
7 ( 0 111 )
18 ( 10 010 )
PTEs Distribution
Clust. Group Len
% PTEs
3
37.5 %
5
62.5 %
HPCA 2014
% PTEs in the same group
Virtual Page
xalancbmk
1.0
0.8
0.6
0.4
0.2
0.0
1
8
64
group len
Contiguous
Cluster4
Cluster2
Cluster5
Cluster3
256
1
• Clustered locality is abundant and
surpasses contiguous locality
• Clustered locality increases with clustered
spatial region size
Binh Pham - Rutgers University
7
Outline
• How do we exploit clustered locality in hardware?
• How much can our design improve performance?
• Conclusion
HPCA 2014
Binh Pham - Rutgers University
8
Clustered TLB: Miss and Fill
Page Table
Page Table Walker
64B cacheline
8B PTE
Coalescing Logic
Clustered TLB Entry
Sub-entries
0
01
VA(47:20)
PA(51:15)
HPCA 2014
1
010
V
PA(14:12)
Binh Pham - Rutgers University
Virtual Page
Physical Page
0 ( 0 000 )
8 ( 01 000 )
1 ( 0 001 )
9 ( 01 001 )
2 ( 0 010 )
10 ( 01 010 )
3 ( 0 011 )
12 ( 01 100 )
4 ( 0 100 )
13 ( 01 101 )
5 (0 101)
17 (10 001)
6 (0 110)
16 (10 000)
7 (0 111)
18 (10 010)
9
Clustered TLB Look Up
Base VPN
VPN(2:0)
0
011
Clustered TLB
Base V
0
Base P
01
000
001
010
Sub-entries
100
100 101
X
X
X
=?
PPN
lower
bits
Sub-entry hit?
concat
PPN = 12 (01100)
Hit?
HPCA 2014
Binh Pham - Rutgers University
10
Multi Granular TLB Design
L1-TLB
cacheline
VPN
L2-TLB
Clustered
hit
Clustered
TLB
C0
TLB
PPN
Coalescing Logic
PPN
Base Base
VPN PPN
8B PTE
Clustered-TLB entry
C0 hit
TLB hit
PPN
N
len >= Θ
Y
HPCA 2014
Binh Pham - Rutgers University
11
Methodology
• Workloads: SPEC CPU2006, Cloudsuite, Server
• Full System Simulation:
– Baseline: 64-entry L1 ITLB, 64-entry L1 DTLB, 512-entry L2 TLB
– Roughly equal hardware for baseline, CoLT, and MG-TLB
Hardware Cost
Baseline
4.6 KB
CoLT
5.0 KB
MG-TLB
4.8 KB
HPCA 2014
L1-TLB
L1-TLB
L1-TLB
L2-TLB
CoLT-TLB
MG-TLB
Baseline
CoLT
MG-TLB
Binh Pham - Rutgers University
12
Miss Elimination
Misses Eliminated
CoLT
100%
80%
60%
40%
20%
0%
-20%
MG-TLB
-123%
Best design gives 7% performance improvement on average
HPCA 2014
Binh Pham - Rutgers University
13
Misses Eliminated
Insert to Clustered TLB or C0 TLB?
100%
80%
60%
40%
20%
0%
-20%
θ=1
-51%
θ=2
θ=3
θ=4
-218%
Θ = 2 gives best performance:
C0 Entry
C0 Entry
Cluster3 Entry
HPCA 2014
Binh Pham - Rutgers University
14
Prefetching versus Capacity Benefit
Misses Eliminated
Baseline + Prefetch
100%
80%
60%
40%
20%
0%
-20%
-32%
Lazy en-MG-TLB
en-MG-TLB
-21%
MG-TLB combines prefetching and capacity to get best performance
HPCA 2014
Binh Pham - Rutgers University
15
Conclusion
• We observe more generic type of locality (clustered locality)
in the page translations
• Multi-granular TLB
– Eliminates nearly half of TLB misses
• Our approach requires no OS modification, and provides robust
performance gain
HPCA 2014
Binh Pham - Rutgers University
16
Thanks for listening!
Questions?
HPCA 2014
Binh Pham - Rutgers University
17
Download