Revisiting Virtual Memory

advertisement
Revisiting Virtual Memory
Arkaprava Basu
Committee:
Remzi H. Arpaci-Dusseau
Mark D. Hill (Advisor)
Mikko H. Lipasti
Michael M. Swift (Advisor)
David A. Wood
“Virtual Memory was invented in a time of scarcity. Is it still good idea?”
--- Charles Thacker, 2010 Turing Award Lecture
Virtual Memory Refresher
Process 1
Virtual Address Space
Core
Physical Memory
Cache
TLB
Process 2
(Translation Lookaside Buffer)
Page Table
2
Thesis
Time to Revisit Virtual Memory
Management
• Change in Usage: Memory size million times
– Access Lot of Memory, Low Locality
• Change in Constraint: Energy Dissipation
– TLB is Energy Hungry
3
Memory capacity for $10,000*
1,000.00
1
Memory size
GB
MB
TB
10,000.00
10
Commercial servers with
4TB memory
100.00
100
10.00
10
1.00
1
0.10
100
Big data needs to access
terabytes of data at low latency
0.01
10
0.00
0
1980
1990
*Inflation-adjusted 2011 USD, from: jcmit.com
2000
2010
4
TLB is Less Effective
• TLB sizes hardly scaled
Year
L1-DTLB
entries
1999
72
(Pent. III)
2001
64
(Pent. 4)
2008
2012
96
100
(Nehalem) (Ivy Bridge)
• Low access locality of server workloads
[Ramcloud’10, Nanostore’11]
Memory Size +
TLB size + Low locality
 TLB miss latency overhead
5
Energy Dissipation is Key Constraint
13%
TLB is energy
Hungry
* From Sodani’s /Intel’s MICRO 2011 Keynote
• TLB shows up as hotspot
• TLB latency hiding forces energy-hungry L1 cache
6
Three Pieces Work
Process 1
Virtual Address Space
Core
Physical Memory
L1 Cache
Process 2
TLB
Page Table
7
Performance
1
Process 1
Virtual Address Space
Core
Physical Memory
Direct Segment
(ISCA’13)
TLB
Eliminates 99% of DTLB misses
L1 Cache
Process 2
Execu on me
overhead due to
page table walks
on TLB misses
Page Table
8
Energy Dissipation
1
2
Process 1
Virtual Address Space
Physical Memory
Core
Opportunistic Virtual Caching (ISCA’12)
L1 Cache
Eliminates 20% of on-chip memoryTLB
dynamic energy
Process 2
Energy
dissipa on due
to TLB and L1
cache lookup
Page Table
9
3
Partitioning TLB resources
Process%1%
Virtual%
Address%
Space%
Core%
Physical%
Memory%
Merged-Associative TLB
Process%2%
Avoid partitioning TLB resourcesTLB%
L1%
Cache%
Overheads due
to mul ple page
sizes in TLB
Page%
Table%
10
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
– Latency overhead of TLB misses
– Analysis: How Big Memory Workloads
– Design: Direct Segment
– Evaluation
– Summary
2• Opportunistic Virtual Caching
3• Merged-Associative TLB
11
Experimental Setup
• Experiments on Intel Xeon (Sandy Bridge) x86-64
– Page sizes: 4KB (Default), 2MB, 1GB
4 KB
L1 DTLB
L2 DTLB
2 MB
1GB
64 entry, 4-way 32 entry, 4-way 4 entry, fully assoc.
512 entry, 4-way
• 96GB installed physical memory
• Methodology: Use hardware performance counter
12
yS
Q
L
d
PS
51.1
GU
NP
B:
CG
NP
B:
BT
M
em
ca
ch
e
ap
h5
00
35
m
gr
Percentage of execu on cycles spent on
servicing DTLB missses
Big Memory Workloads
83.1
30
4KB
25
2MB
20
15
1GB
10
5
Direct
Segment
0
13
35
51.1
83.1
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
m
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
14
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
m
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
15
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
m
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
16
35
51.1
Significant overhead of
paged virtual memory
30
25
83.1 51.3
4KB
Worse with TBs of
memory now or in
future?
20
15
2MB
1GB
10
Direct
Segme
5
m
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
17
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
– Latency overhead of TLB misses
– Analysis: How Big Memory Workloads
– Design: Direct Segment
– Evaluation
– Summary
2• Opportunistic Virtual Caching
3• Merged-Associated TLB
18
How is Paged Virtual Memory used?
An example: memcached servers
In-memory
Hash table
Network state
Client
memcached
server # n
Key X
Value Y
19
Big Memory Workloads’ Use of Paging
Paged VM
Feature
Our Analysis
Implication
Swapping
~0 swapping
Not essential
Per-page protection
~99% pages read-write
Overkill
Fragmentation
reduction
Little OS-visible
fragmentation
(next slide)
Per-page (re)allocation less
important
20
Allocated Memory (in GB)
Memory Allocation Over Time
Warm-up
graph500
memcached
0
300
MySQL
NPB:BT
NPB:CG
GUPS
90
75
60
45
30
15
0
150
450
600
750
900
1050 1200 1350 1500
Time (in seconds)
Most of the memory allocated early
21
Where Paged Virtual Memory Needed?
Paging Valuable
Paging Not Needed
*
VA
Dynamically allocated
Heap region
Code Constants
Shared Memory
Mapped Files Stack
Guard Pages
Paged VM not needed for MOST memory
* Not to scale
22
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
– Latency overhead of TLB misses
– Analysis: How Big Memory Workloads
– Design: Direct Segment
– Evaluation
– Summary
2• Opportunistic Virtual Caching
3• Merged-Associated TLB
23
Idea: Two Types of Address Translation
A
Conventional paging
• All features of paging
• All cost of address translation
B
Simple address translation
• NO paging features
• NO TLB miss
• OS/Application decides where to use which
[=> Paging features where needed]
24
Hardware: Direct Segment
1 Conventional Paging
BASE
2 Direct Segment
LIMIT
VA
OFFSET
PA
Why Direct Segment?
• Matches big memory workload needs
• NO TLB lookups => NO TLB Misses
25
H/W: Translation with Direct Segment
[V47V46……………………V13V12]
[V11……V0]
LIMIT<?
BASE ≥?
DTLB
Lookup
Paging Ignored
HIT/MISS
Y
OFFSET
MISS
Page-Table
Walker
* NOT to scale
[P40P39………….P13P12]
[P11……P
]
26 0
H/W: Translation with Direct Segment
[V47V46……………………V13V12]
BASE ≥?
[V11……V0]
LIMIT<?
Direct Segment
Ignored
N
DTLB
Lookup
HIT
OFFSET
HIT/MISS
MISS
Page-Table
Walker
[P40P39………….P13P12]
[P11……P
]
27 0
S/W:
1
Setup Direct Segment Registers
• Calculate register values for processes
– BASE = Start VA of Direct Segment
– LIMIT = End VA of Direct Segment
– OFFSET = BASE – Start PA of Direct Segment
• Save and restore register values
BASE
LIMIT
VA2
VA1
OFFSET
PA
28
S/W:: 2 Provision Physical Memory
• Create contiguous physical memory
– Reserve at startup
• Big memory workloads cognizant of memory needs
• e.g., memcached’s object cache size
– Memory compaction
• Latency insignificant for long running jobs
– 10GB of contiguous memory in < 3 sec
– 1% speedup => 25 mins break even for 50GB compaction
29
S/W:
3
Abstraction for Direct Segment
• Primary Region
– Contiguous VIRTUAL address not needing paging
– Hopefully backed by Direct Segment
– But all/part can use base/large/huge pages
VA
PA
• What allocated in primary region?
– All anonymous read-write memory allocations
– Or only on explicit request (e.g., mmap flag)
30
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
– Latency overhead of TLB misses
– Analysis: How Big Memory Workloads
– Design: Direct Segment
– Evaluation
– Summary
2• Opportunistic Virtual Caching
3• Merged-Associative TLB
31
Methodology
• Primary region implemented in Linux 2.6.32
• Estimate performance of non-existent direct-segment
– Get fraction of TLB misses to direct-segment memory
– Estimate performance gain with linear model
• Prototype simplifications (design more general)
– One process uses direct segment
– Reserve physical memory at start up
– Allocate r/w anonymous memory to primary region
32
35
51.1
83.1 51.3
Lower is better
30
4KB
25
2MB
20
15
1GB
10
Direct
Segment
5
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
m
em
ca
ch
e
ap
h5
00
0
gr
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
33
35
51.1
83.1 51.3
Lower is better
30
4KB
25
2MB
20
15
0.01
10
~0
~0
0.48
0.01
1GB
0.49
Direct
Segment
5
GU
PS
NP
B:
CG
T
NP
B:
B
L
yS
Q
M
ca
ch
ed
m
em
h5
00
0
gr
ap
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
34
35
51.1
83.1 51.3
Lower is better
30
20
4KB
“Misses” in Direct Segment
25
99.9%
2MB
92.4% 99.9% 99.9% 99.9%
99.9%
15
0.01
10
~0
~0
0.48
0.01
1GB
0.49
Direct
Segment
5
GU
PS
NP
B:
CG
T
NP
B:
B
L
yS
Q
M
ca
ch
ed
m
em
h5
00
0
gr
ap
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
35
Summary:
1
Performance
• Big memory workloads
– Incurs high TLB miss cost
– Paging not needed for almost all memory
• Our proposal: Direct Segment
– Paged virtual memory where needed
– Segmentation (NO TLB miss) where possible
36
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
2• Opportunistic Virtual Caching (OVC)
Short Path
(10 slides)
Long Path
(22 slides)
3• Merged-Associative TLB
37
TLB is Energy Hungry
13%
* From Sodani’s /Intel’s MICRO 2011 Keynote
38
TLB Latency Hiding Constrains L1 Associativity
• Virtually Indexed Physically Tagged L1 Caches
– L1 Assoc. >= Cache Size/Page Size; e.g., 32KB/4KB => 8-way
VA46 ………………… VA0
L1 Cache (32KB, 8 way)
1 (a).
TLB
2.
1 (b).
Way0
Way4
Way7
Page offset
Tag matching logic
39
Why not Virtual Caching?
Read-Write Synonyms
ISA Compatibility
(e.g., x86 page table walker)
Energy dissipation was less
important?
VA46 ………………… VA0
L1 Cache (32KB, 4/8 way)
TLB
1.
Way0
Way4
Way7
2.
Tag matching logic
7/26/2016
40
Observations
• Synonym usage rare: 0-9% of pages (mostly
readonly)
• Page permission changes infrequent
• OS already knows where synonyms possible
41
Idea
• Opportunistic Virtual Caching
– Energy-efficient virtual caching as dynamic
optimization
– Default to Physical Caching when needed
• Mechanism
– Hardware expose both Virtual and Physical caching
• Decided based on high order virtual address bit
– OS lays out memory allocation accordingly
42
Physical Caching in Opportunistic Virtual Caching
Virtual Addr.
0 = VA47 VA46 ……………… VA0
NO
L1 Cache (32 KB, 8Way)
ovc_enable
(Default Off)
TLB
Way0
Way4
Way7
Tag matching logic
43
Virtual Caching in Opportunistic Virtual Caching
Virtual Addr.
1 = VA47 VA46 ……………… VA0
Permission bits
ASID (V/P) TAG
Physical
Tag
DATA
Yes
ovc_enable
L1 Cache (32 KB, 4-way/8-way)
TLB
Way0
Way4
Way7
Tag matching logic
44
Dynamic Energy Savings?
dynamic energy
% of on-chip memory subsystem's
Physical Caching
(VIPT)
Opportunistic Virtual
Caching
100
90
80
70
60
50
40
30
20
10
0
TLB Energy
L1 Energy
L2+L3 Energy
On average ~20% of the on-chip memory
subsystem’s dynamic energy is reduced
45
Overheads
• No significant performance overhead
– Average performance degradation 0.017%
• State overhead
– 27.5KB extra state for ~9.25 MB cache hierarchy (< 0.3%)
– < 1% static power overhead
No significant state or static power overheads but
significant dynamic power savings
46
Summary:
2
Energy Dissipation
• TLB energy dissipation non-negligible
• TLB latency hiding makes L1 energy worse
• Our Proposal: Opportunistic Virtual Caching
– Virtual Caching for energy savings
– Physical caching for compatibility
47
Roadmap
•
1•
2•
3•
Why Revisit Virtual Memory?
Direct Segment
Opportunistic Virtual Caching (OVC)
Merged-Associative TLB
– Motivation
– Background: TLB mechanisms
– Merged-Associative TLB
– Evaluation
48
Motivation
• Processors support multiple page sizes
– Slow, energy- and area-hungry fully associative
TLB
– Static partitioning of resources with setassociative TLB
• Goal: Support multiple page sizes in a setassociative TLB w/o static partitioning
49
Roadmap
•
1•
2•
3•
Why Revisit Virtual Memory?
Direct Segment
Opportunistic Virtual Caching (OVC)
Merged-Associative TLB
– Motivation
– Background: TLB mechanisms
– Merged-Associative TLB
– Evaluation
50
Two TLB Mechanisms
1•
Fully Associative TLB
2•
Set-associative TLB (Split TLB design)
51
1
Fully Associative (FA) TLB
AE01ED11F
Content Addressable Memory
4KB pages
FFEA001A
Page Frame Number
Virtual Page Number
AE01ED11F E00
Random Access Memory
52
1
Fully Associative (FA) TLB
Virtual Page Number Page Offset
AE01ED11F
Content Addressable Memory
4KB pages
FFEA001A
Page Frame Number
Virtual Page Number
AE01ED11F E00
Random Access Memory
53
1
Fully Associative (FA) TLB
Virtual Page Number Page Offset
2MB pages
AE01ED11F
FFEA001A
BB01ED0XX
XX
FFEA0000
Content Addressable Memory
Page Frame Number
Virtual Page Number
BB01ED04E 0D0
Random Access Memory
54
FA TLB is slow, area- and energy- hungry
FA = Fully Associative
64 Entries
Access time
(ns)
Dynamic
Access
Energy (nJ)
Static Power
(mW)
4-SA= 4-way Set Associative
128 Entries
256 Entries
FA
4-SA
FA
4-SA
FA
4-SA
0.39
0.14
0.47
0.15
0.67
0.17
0.003 0.008
0.004
0.016
0.006
0.031
1.72
2.69
7.57
4.81
14.37
3.87
A fully associative TLB ~3X – 4X slower than a
set-associative TLB
55
FA TLB is slow, area- and energy- hungry
FA = Fully Associative
64 Entries
Access time
(ns)
Dynamic
Access
Energy (nJ)
Static Power
(mW)
4-SA= 4-way Set Associative
128 Entries
256 Entries
FA
4-SA
FA
4-SA
FA
4-SA
0.14
0.39
0.15
0.47
0.17
0.67
0.008 0.003
0.016
0.004
0.031
0.006
1.72
2.69
7.57
4.81
14.37
3.87
Each access to a fully associative TLB spends 2.5X – 6X
more dynamic energy.
56
FA TLB is slow, area- and energy- hungry
FA = Fully Associative
64 Entries
Access time
(ns)
Dynamic
Access
Energy (nJ)
Static Power
(mW)
4-SA= 4-way Set Associative
128 Entries
256 Entries
FA
4-SA
FA
4-SA
FA
4-SA
0.14
0.39
0.15
0.47
0.17
0.67
0.003 0.008
0.004
0.016
0.006
0.031
3.87
7.57
2.69
14.37
4.81
1.72
A fully associative TLB cost 2-3X more static power.
57
2
Set Associative (SA) TLB
Virtual Page Number Page Offset
AE01ED11F E00
4KB pages
AE01ED11C
8-entry 4-way set associative
*Physical Frame number not shown
58
2
Set Associative (SA) TLB
2 bits of index
AE01ED11F E00
4KB pages
AE01ED11C
8-entry 4-way set associative
59
2
Set Associative (SA) TLB
Virtual Page Number Page Offset
BB01ED04E 0D0
What bits to use for
indexing?
2MB pages
BB01ED0
AE01ED11C
8-entry 4-way set associative
60
2
Set Associative (SA) TLB
BB01ED04E 0D0
Index bit can not be
part of page offset
2MB pages
BB01ED0
AE01ED11C
8-entry 4-way set associative
Challenge: Page size is unknown before translation
61
2
Split TLB Design
Solution:
• Separate sub-TLBs for each page size
• All TLBs looked up in parallel
TLB for 2MB pages
TLB for 4KB pages
62
2
Split TLB Design
AE01ED11F E00
AE01ED11C
TLB for 2MB pages
TLB for 4KB pages
63
2
Split TLB Design
BB01ED04E 0D0
BB01ED0
TLB for 2MB pages
TLB for 4KB pages
64
2
Split TLB Design
In practice, less number of entries for larger page size.
Example, Intel’s Sandy Bridge has 64 entries 4KB pages,
32 entries for 2MB pages, 4 entries 1GB pages.
TLB for 2MB pages
TLB for 4KB pages
65
Drawbacks of Split-TLB design
• Anomalous TLB behavior
– Larges page sizes can lead to more TLB misses
NPB:CG
4KB
2MB
1GB
279.5
42.1
130.7
TLB misses per 1K memory reference
“differences in TLB structure make predicting how many huge
pages can be used and still be of benefit problematic” --Linux
Weekly News
• TLB resource underutilization
66
Roadmap
•
1•
2•
3•
Why Revisit Virtual Memory?
Direct Segment
Opportunistic Virtual Caching (OVC)
Merged-Associative TLB
– Motivation
– Background: TLB mechanisms
– Merged-Associative TLB
– Evaluation
67
Goal of Merged-Associative TLB
• A single set associative TLB for all page sizes
– NO static partitioning of resources
• NO anomalous TLB behavior
• TLB resource aggregation
– NO fully associative TLB
• Faster, area- efficient, energy-efficient
– Backward compatible with split-TLB design
68
Idea
• OS partitions Virtual address space
– Each partition holds mappings for single page size
– Virtual address hints the page size
• Hardware logically merges sub-TLBs
– Virtual address to interpret page size
– NO static partitioning of TLB resources
69
S/W: Address Space Partitioning
4KB
4KB
2MB
2MB
1GB
4KB
Virtual Address Space
70
Split TLB => Merged-Associative TLB
AE01ED11F E00
AE01ED11C
TLB for 2MB pages
TLB for 4KB pages
71
Split TLB => Merged-Associative TLB
Merged-Associative TLB
72
Split TLB => Merged-Associative TLB
Merged-Associative TLB
73
Split TLB => Merged-Associative TLB
AE01ED11F E00
4K
IDX2MB
IDX4KB
Merged-Associative TLB
74
Split TLB => Merged-Associative TLB
AE01ED11F E00
4K
IDX2MB
IDX4KB
IDX4KB
*TAG match not shown
Merged-Associative TLB
75
Split TLB => Merged-Associative TLB
BB01EDF4E E00
2M
IDX2MB
IDX4KB
Merged-Associative TLB
76
Split TLB => Merged-Associative TLB
BB01EDF4E E00
2M
IDX2MB
IDX4KB
IDX2MB
*TAG match not shown
Merged-Associative TLB
77
Backward Compatibility
• “Unknown Page size” for a region is allowed
– Reverts back to split-TLB design
• Runs unmodified OS
• Works under dynamic page size promotion
• No benefits over split-TLB design
78
Roadmap
•
1•
2•
3•
Why Revisit Virtual Memory?
Direct Segment
Opportunistic Virtual Caching (OVC)
Merged-Associative TLB
– Motivation
– Background: TLB mechanisms
– Merged-Associative TLB
– Evaluation
79
Methodology
• TLB simulator written in PIN
– Collects TLB miss rates
• Workloads
– Graph analytics, memcached, mySQL, NAS
80
TLB Configurations
• Split-TLB design (Intel’s Sandy Bridge)
– 4KB pages: 64 entry, 4-way set associative
– 2MB pages: 32 entry, 4-way set associative
– 1GB pages: 4-entry, fully associative
• Fully associative design
– 64 entry and 96 entry TLB
• Merged-Associative TLB
– 96 entry (64 + 32)
• All configurations: 512 entry 4-way setassociative L2-DTLB
81
Evaluation
• Avoid TLB miss behavior anomaly ?
• Reduce TLB miss rates?
82
Anomalous TLB behavior
4KB
Split
TLB
NPB:CG
279.5
Merged
TLB
282.6
2MB
Split
TLB
42.1
Merged
TLB
0
1GB
Split
TLB
130.7
Merged
TLB
0
TLB misses per 1K memory reference
Merged-associative TLB avoids anomalous behavior with large
pages
83
TLB Misses Per 1K Accesses: 4KB pages
Split-TLB
graph500
memcached
MySQL
NPB:CG
NPB:BT
207.9
FA-TLB
(64 entry)
207.9
FA-TLB
(96 entry)
207.7
MergedTLB
207.9
4.4
4.31
282.1
5.77
4.36
3.88
281.37
6.63
4.42
4.05
284.29
5.67
4.38
3.63
282.1
5.50
Merged-associative TLB does not improve miss rates
84
TLB Misses Per 1K Accesses: 2MB pages
Split-TLB
graph500
memcached
MySQL
NPB:CG
NPB:BT
60.13
3.97
2.89
0.68
2.94
FA-TLB
(64 entry)
51.94
4.05
3.73
0.0017
3.13
FA-TLB
(96 entry)
39.92
4.04
3.48
0.0018
3.12
MergedTLB
33.52
4.08
3.95
0.035
3.19
Merged-associative TLB improves miss rates in one occasion
85
TLB Misses Per 1K Accesses: 1GB pages
Split-TLB
graph500
memcached
MySQL
NPB:CG
NPB:BT
6.04
3.57
4.21
109.791
1.28991
FA-TLB
(64 entry)
0
3.02
2.79
0
0
FA-TLB
(96 entry)
0
2.72
2.92
0
0
MergedTLB
0
2.76
3.23
0
0
Merged-associative TLB reduces miss rates in two cases
86
Summary: 3 Partitioning TLB Resources
• Static partitioning of TLB resources
• Our Proposal: Merged-Associative TLB
– Partition virtual address space for page sizes
– Logically aggregate partitioned TLB resources
87
Other Works
• Caches and Cache coherence
– FreshCache: Statically and Dynamically Exploiting Dataless
Ways (ICCD’2013)
– CMP Directory Coherence: One Granularity Does Not Fit All
(UW-CS-TR1798, 2013)
– Scavenger: A New Last Level Cache Architecture with
Global Block Priority (MICRO’07)
• Parallel program debugging
– Karma: Scalable Deterministic Record-Replay (ICS’11)
88
Summary
1• Performance overhead of TLB misses
– Direct Segments
2• Energy overhead of address translation
– Opportunistic Virtual Caching
3• Partitioning TLB resources
– Merged-Associative TLB
89
90
Roadmap (OVC long path)
• Why Revisit Virtual Memory?
1• Direct Segment
2• Opportunistic Virtual Caching (OVC)
–
–
–
–
–
–
Why is TLB Energy Hungry
Physical Caching Vs. Virtual Caching
Opportunities for Virtual Caching
Mechanisms for OVC
Evaluation
Summary
3• Merged-Associative TLB
91
TLB is Energy Hungry
13%
* From Sodani’s /Intel’s MICRO 2011 Keynote
92
Why is TLB Energy Hungry?
• TLB looked up on every cache access
– ALL blocks cached with physical address
– Each access needs address translation
• TLB lookup latency in critical path
– Fast and thus energy hungry transistors
– Content Addressable Memory
93
TLB Latency Hiding Constrains L1 Associativity
• Virtually Indexed Physically Tagged L1 Caches
– L1 Assoc. >= Cache Size/Page Size; e.g., 32KB/4KB => 8-way
TLB
Page offset
VA46 ………………… VA0
L1 Cache (32KB, 8 way)
Way0
Way4
Way7
Tag matching logic
94
TLB Latency Hiding Makes L1 Energy Worse
Workloads 4-way 8-way 16-way
Dynamic 4-way 8-way 16-way
Energy
Read Energy
1
1.309 1.858
Parsec
1
1.002 1.002
Commercial
1
1.002 1.004 Write Energy
Relative Performance
1
1.111
1.296
Relative L1 Access Energy
Substantial energy impact dominates negligible performance
benefit of increased associativity
*Methods: CACTI, full system simulation
95
Why Not Virtual Caching?
• Cache ALL blocks under virtual address
– Saves TLB lookup on L1 cache hits
– L1 Cache associativity not constrained
– Read-Write Synonyms
• e.g, V1 -> P0 <- V2
– Incompatibility with commercial ISAs
• e.g., x86 ‘s hardware page table walker
? Energy dissipation was less important?
Best of Virtual and Physical Caching?
96
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
2• Opportunistic Virtual Caching (OVC)
–
–
–
–
–
–
Why is TLB Energy Hungry
Physical Caching Vs. Virtual Caching
Opportunities for Virtual Caching
Mechanisms for OVC
Evaluation
Summary
3• Merged-Associative TLB
97
How Frequent are Synonyms?
Commercial
Parsec
Applications
canneal
fluidanimate
facesim
streamcluster
swaptions
x264
bind
firefox
memcached
specjbb
Static Synonym Dynamic accesses to
Pages
Synonyms
0.06%
0%
0.28%
0%
Read-only
0.00%
0%
0.23%
0.01%
5.90%
26% 100%
1.40%
1%
0.01%
0.16%
9%
13% 95%
0.01%
0%
1%
2%
Synonyms occur, but conflicting use rare;
confined to small region.
98
Identify Synonyms at Allocation?
Protection flags
• Process address space
divided into regions
• Synonym possibility
indicated by protection flags
System V Shared Memory
Stack
r-w------
Sys V
r-w-s
Heap
r-w------
Constants
Code
r---r-x--
r-> read
w-> write
Possible to separate memory regions with and
s->shared
without read-write synonyms
x->execute
99
Idea
• Opportunistic Virtual Caching
– Use energy-efficient virtual caching opportunistically
– Default to Physical Caching when needed
100
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
2• Opportunistic Virtual Caching (OVC)
–
–
–
–
–
–
Why is TLB Energy Hungry
Physical Caching Vs. Virtual Caching
Opportunities for Virtual Caching
Mechanisms for OVC
Evaluation
Summary
3• Merged-Associative TLB
101
Role of the H/W and the OS
• Hardware allows Virtual or Physical Caching
• OS decides when to use Virtual Caching
• OS responsible for correctness
102
Physical Caching in Opportunistic Virtual Caching
Virtual Addr.
0 = VA47 VA46 ……………… VA0
NO
L1 Cache (32 KB, 8Way)
ovc_enable
(Default Off)
TLB
Way0
Way4
Way7
Tag matching logic
103
Virtual Caching in Opportunistic Virtual Caching
Virtual Addr.
1 = VA47 VA46 ……………… VA0
Permission bits
ASID (V/P) TAG
Physical
Tag
DATA
Yes
ovc_enable
L1 Cache (32 KB, 4-way/8-way)
TLB
Way0
Way4
Way7
Tag matching logic
104
Operating System Mechanisms
• Memory allocations from two partitions
– Separate partitions for virtual and physical caching
• e.g., VA47= 0 =>Physical Caching and VA47= 1 => Virtual Caching
– Protection flags determine which partition to use
• Operating System responsible for correctness
– Wrong classification possible, but rare
• e.g., User makes a region shared after allocation
– Cache flush on possible violation
105
Roadmap
• Why Revisit Virtual Memory?
1• Direct Segment
2• Opportunistic Virtual Caching (OVC)
–
–
–
–
–
–
Why is TLB Energy Hungry
Physical Caching Vs. Virtual Caching
Opportunities for Virtual Caching
Mechanisms for OVC
Evaluation
Summary
3• Merged-Associative TLB
106
Methodology
• Modification to Linux kernel (2.6.28-4)
• Hardware changes simulated in gem5 fullsystem simulator
• Energy numbers from CACTI
107
Configuration
Cores
4 cores, in-order, x86-64 ISA
TLBs
L1 DTLB/ITLB 64 entries, Fully
associative,
L2 TLB 512 entries, 4-way set
associative
Privates Caches 32 KB 8way I/D-L1 cache,
256 KB , 8-way L2 per core
Shared Caches
8MB, 16-way L3 cache
Memory
4 GB memory 300 cycles round trip
108
Dynamic Energy Savings?
dynamic energy
% of on-chip memory subsystem's
Physical Caching
(VIPT)
Opportunistic Virtual
Caching
100
90
80
70
60
50
40
30
20
10
0
TLB Energy
L1 Energy
L2+L3 Energy
On average ~20% of the on-chip memory
subsystem’s dynamic energy is reduced
109
Overheads
• No significant performance overhead
– Average performance degradation 0.017%
• State overhead
– 27.5KB extra state for ~9.25 MB cache hierarchy (< 0.3%)
– < 1% static power overhead
No significant state or static power overheads but
significant dynamic power savings
110
Summary:
2
Energy Dissipation
• TLB energy dissipation non-negligible
• TLB latency hiding makes L1 energy worse
• Our Proposal: Opportunistic Virtual Caching
– Virtual Caching for energy savings
– Physical caching for compatibility
111
BACKUP
112
Why Not Large Pages?
• Fundamentally Not Scalable
– Newer page sizes, larger TLB with memory growth
– Continual changes to PT-Walker, OS, application
• TLBs Need Locality
– Increasing reach not necessarily reduces misses
• Large Pages Needs to be Aligned
– Significant opportunity can be lost [COLT, MICRO’12]
• Fixed Sparse ISA-defined sizes
– Dictated by Page Table structure
– In x86-64, 4KB, 2MB, 1GB
113
Address Translation in Different
ISA/machines
ISA/Machine
Address Translation
Multics
Segmentation on top of Paging
Burroughs B5000
Segmentation
UltraSPARC
Paging
X86 (32 bit)
Segmentation on top of Paging
ARM
Paging
PowerPC
Segmentation on top of Paging
Alpha
Paging
X86-64
Paging only (mostly)
Direct Segment:
(1)NOT on top of paging.
(2)NOT to replace paging.
(3)NO two-dimensional address space. Keeps Linear address space.
114
Direct Segment Methodology
• Convert TLB misses to page fault
– TLB entries incoherent with memory-resident
copies
– Set reserved bit in memory-resident copy
• On page fault check whether it falls in DS
• Deduct the cycles due to TLB misses
proportionally
115
Direct Segment(DS) in Cloud?
• Currently DS suitable for enterprise workloads
– Less suitable when many short jobs come and go
• Memory usage needs to be predictable to
enable performance guarantees
– Same memory usage predictions can be used to
create DS
116
How to handle faulty pages?
• OS can map a page-frame with permanent
fault to a non-faulty new page-frame
(Possible) Solutions:
• Revert part or all of direct segment memory
• Memory controller (MC) remaps faulty pages
– Only small number of faulty pages
– List of faulty re-mapped pages in MC
– OS need NOT to know about faults
– Finer grain remapping (instead of page-grain)
117
Direct Segment + OVC possible?
VA47
VA47 = 1
&&
OVC_ENB =1?
OVC L1 Cache
low-associativity
lookup with VA
YES
L1 Cache
Hit/ Miss?
MISS
NO
Lookup DTLB
hierarchy with
VPN
BASE >= VPN
&&
VPN < LIMIT
?
YES
OFFSET
+
VPN
potential
page walk
HIT
Cancel
OVC_ENB
DTLB
Hit/ Miss?
MISS
Walk the
page table
Concatenate PFN
with Page offset
Complete
Cache lookup
118
OVC
119
How Coherence is Maintained?
CASE 1. Coherence reply due to own request
0x10ab16e10
L1 Cache
(VA)
0x10ab16e10
(VA)
Miss
TLB
0x00fc10d10
0x10ab16e10
To lower level caches
(PA)
MSHR
0x00fc10d10
(PA)
120
How Coherence is maintained?
CASE 2. Coherence request due to other controller
L1 Cache
Physical Tag array
0x00fc10d10
(PA)
121
Alternative techniques for address
differentiation?
• Range register(s) for Virtual Caching
0xfff10000100
(VA)
0xfff10000000
Lower bound
≥
≤
0xfff10f00000
Upper bound
1= Use Virtual Caching
0= Use Physical Caching
122
What about static power?
• Around 42-45% of the total on-chip memory
subsystem power
 OVC can save around 12% of total on-chip
memory subsystem’s power
123
What is the breakup of TLB lookup
savings?
canneal
facesim
fluidanimate
streamcluster
swaptions
x264
specjbb
memcached
bind
Mean
L1 Data TLB
72.253
96.787
99.363
95.083
99.028
95.287
91.887
94.580
97.090
93.484
L1 Instr. TLB
99.986
99.999
99.999
99.994
99.989
99.304
99.192
98.605
98.310
99.486
124
Download