Revisiting Virtual Memory Arkaprava Basu Committee: Remzi H. Arpaci-Dusseau Mark D. Hill (Advisor) Mikko H. Lipasti Michael M. Swift (Advisor) David A. Wood “Virtual Memory was invented in a time of scarcity. Is it still good idea?” --- Charles Thacker, 2010 Turing Award Lecture Virtual Memory Refresher Process 1 Virtual Address Space Core Physical Memory Cache TLB Process 2 (Translation Lookaside Buffer) Page Table 2 Thesis Time to Revisit Virtual Memory Management • Change in Usage: Memory size million times – Access Lot of Memory, Low Locality • Change in Constraint: Energy Dissipation – TLB is Energy Hungry 3 Memory capacity for $10,000* 1,000.00 1 Memory size GB MB TB 10,000.00 10 Commercial servers with 4TB memory 100.00 100 10.00 10 1.00 1 0.10 100 Big data needs to access terabytes of data at low latency 0.01 10 0.00 0 1980 1990 *Inflation-adjusted 2011 USD, from: jcmit.com 2000 2010 4 TLB is Less Effective • TLB sizes hardly scaled Year L1-DTLB entries 1999 72 (Pent. III) 2001 64 (Pent. 4) 2008 2012 96 100 (Nehalem) (Ivy Bridge) • Low access locality of server workloads [Ramcloud’10, Nanostore’11] Memory Size + TLB size + Low locality TLB miss latency overhead 5 Energy Dissipation is Key Constraint 13% TLB is energy Hungry * From Sodani’s /Intel’s MICRO 2011 Keynote • TLB shows up as hotspot • TLB latency hiding forces energy-hungry L1 cache 6 Three Pieces Work Process 1 Virtual Address Space Core Physical Memory L1 Cache Process 2 TLB Page Table 7 Performance 1 Process 1 Virtual Address Space Core Physical Memory Direct Segment (ISCA’13) TLB Eliminates 99% of DTLB misses L1 Cache Process 2 Execu on me overhead due to page table walks on TLB misses Page Table 8 Energy Dissipation 1 2 Process 1 Virtual Address Space Physical Memory Core Opportunistic Virtual Caching (ISCA’12) L1 Cache Eliminates 20% of on-chip memoryTLB dynamic energy Process 2 Energy dissipa on due to TLB and L1 cache lookup Page Table 9 3 Partitioning TLB resources Process%1% Virtual% Address% Space% Core% Physical% Memory% Merged-Associative TLB Process%2% Avoid partitioning TLB resourcesTLB% L1% Cache% Overheads due to mul ple page sizes in TLB Page% Table% 10 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment – Latency overhead of TLB misses – Analysis: How Big Memory Workloads – Design: Direct Segment – Evaluation – Summary 2• Opportunistic Virtual Caching 3• Merged-Associative TLB 11 Experimental Setup • Experiments on Intel Xeon (Sandy Bridge) x86-64 – Page sizes: 4KB (Default), 2MB, 1GB 4 KB L1 DTLB L2 DTLB 2 MB 1GB 64 entry, 4-way 32 entry, 4-way 4 entry, fully assoc. 512 entry, 4-way • 96GB installed physical memory • Methodology: Use hardware performance counter 12 yS Q L d PS 51.1 GU NP B: CG NP B: BT M em ca ch e ap h5 00 35 m gr Percentage of execu on cycles spent on servicing DTLB missses Big Memory Workloads 83.1 30 4KB 25 2MB 20 15 1GB 10 5 Direct Segment 0 13 35 51.1 83.1 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 14 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 15 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 16 35 51.1 Significant overhead of paged virtual memory 30 25 83.1 51.3 4KB Worse with TBs of memory now or in future? 20 15 2MB 1GB 10 Direct Segme 5 m PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 17 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment – Latency overhead of TLB misses – Analysis: How Big Memory Workloads – Design: Direct Segment – Evaluation – Summary 2• Opportunistic Virtual Caching 3• Merged-Associated TLB 18 How is Paged Virtual Memory used? An example: memcached servers In-memory Hash table Network state Client memcached server # n Key X Value Y 19 Big Memory Workloads’ Use of Paging Paged VM Feature Our Analysis Implication Swapping ~0 swapping Not essential Per-page protection ~99% pages read-write Overkill Fragmentation reduction Little OS-visible fragmentation (next slide) Per-page (re)allocation less important 20 Allocated Memory (in GB) Memory Allocation Over Time Warm-up graph500 memcached 0 300 MySQL NPB:BT NPB:CG GUPS 90 75 60 45 30 15 0 150 450 600 750 900 1050 1200 1350 1500 Time (in seconds) Most of the memory allocated early 21 Where Paged Virtual Memory Needed? Paging Valuable Paging Not Needed * VA Dynamically allocated Heap region Code Constants Shared Memory Mapped Files Stack Guard Pages Paged VM not needed for MOST memory * Not to scale 22 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment – Latency overhead of TLB misses – Analysis: How Big Memory Workloads – Design: Direct Segment – Evaluation – Summary 2• Opportunistic Virtual Caching 3• Merged-Associated TLB 23 Idea: Two Types of Address Translation A Conventional paging • All features of paging • All cost of address translation B Simple address translation • NO paging features • NO TLB miss • OS/Application decides where to use which [=> Paging features where needed] 24 Hardware: Direct Segment 1 Conventional Paging BASE 2 Direct Segment LIMIT VA OFFSET PA Why Direct Segment? • Matches big memory workload needs • NO TLB lookups => NO TLB Misses 25 H/W: Translation with Direct Segment [V47V46……………………V13V12] [V11……V0] LIMIT<? BASE ≥? DTLB Lookup Paging Ignored HIT/MISS Y OFFSET MISS Page-Table Walker * NOT to scale [P40P39………….P13P12] [P11……P ] 26 0 H/W: Translation with Direct Segment [V47V46……………………V13V12] BASE ≥? [V11……V0] LIMIT<? Direct Segment Ignored N DTLB Lookup HIT OFFSET HIT/MISS MISS Page-Table Walker [P40P39………….P13P12] [P11……P ] 27 0 S/W: 1 Setup Direct Segment Registers • Calculate register values for processes – BASE = Start VA of Direct Segment – LIMIT = End VA of Direct Segment – OFFSET = BASE – Start PA of Direct Segment • Save and restore register values BASE LIMIT VA2 VA1 OFFSET PA 28 S/W:: 2 Provision Physical Memory • Create contiguous physical memory – Reserve at startup • Big memory workloads cognizant of memory needs • e.g., memcached’s object cache size – Memory compaction • Latency insignificant for long running jobs – 10GB of contiguous memory in < 3 sec – 1% speedup => 25 mins break even for 50GB compaction 29 S/W: 3 Abstraction for Direct Segment • Primary Region – Contiguous VIRTUAL address not needing paging – Hopefully backed by Direct Segment – But all/part can use base/large/huge pages VA PA • What allocated in primary region? – All anonymous read-write memory allocations – Or only on explicit request (e.g., mmap flag) 30 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment – Latency overhead of TLB misses – Analysis: How Big Memory Workloads – Design: Direct Segment – Evaluation – Summary 2• Opportunistic Virtual Caching 3• Merged-Associative TLB 31 Methodology • Primary region implemented in Linux 2.6.32 • Estimate performance of non-existent direct-segment – Get fraction of TLB misses to direct-segment memory – Estimate performance gain with linear model • Prototype simplifications (design more general) – One process uses direct segment – Reserve physical memory at start up – Allocate r/w anonymous memory to primary region 32 35 51.1 83.1 51.3 Lower is better 30 4KB 25 2MB 20 15 1GB 10 Direct Segment 5 PS GU NP B: CG NP B: BT yS Q L M d m em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 33 35 51.1 83.1 51.3 Lower is better 30 4KB 25 2MB 20 15 0.01 10 ~0 ~0 0.48 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG T NP B: B L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 34 35 51.1 83.1 51.3 Lower is better 30 20 4KB “Misses” in Direct Segment 25 99.9% 2MB 92.4% 99.9% 99.9% 99.9% 99.9% 15 0.01 10 ~0 ~0 0.48 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG T NP B: B L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 35 Summary: 1 Performance • Big memory workloads – Incurs high TLB miss cost – Paging not needed for almost all memory • Our proposal: Direct Segment – Paged virtual memory where needed – Segmentation (NO TLB miss) where possible 36 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment 2• Opportunistic Virtual Caching (OVC) Short Path (10 slides) Long Path (22 slides) 3• Merged-Associative TLB 37 TLB is Energy Hungry 13% * From Sodani’s /Intel’s MICRO 2011 Keynote 38 TLB Latency Hiding Constrains L1 Associativity • Virtually Indexed Physically Tagged L1 Caches – L1 Assoc. >= Cache Size/Page Size; e.g., 32KB/4KB => 8-way VA46 ………………… VA0 L1 Cache (32KB, 8 way) 1 (a). TLB 2. 1 (b). Way0 Way4 Way7 Page offset Tag matching logic 39 Why not Virtual Caching? Read-Write Synonyms ISA Compatibility (e.g., x86 page table walker) Energy dissipation was less important? VA46 ………………… VA0 L1 Cache (32KB, 4/8 way) TLB 1. Way0 Way4 Way7 2. Tag matching logic 7/26/2016 40 Observations • Synonym usage rare: 0-9% of pages (mostly readonly) • Page permission changes infrequent • OS already knows where synonyms possible 41 Idea • Opportunistic Virtual Caching – Energy-efficient virtual caching as dynamic optimization – Default to Physical Caching when needed • Mechanism – Hardware expose both Virtual and Physical caching • Decided based on high order virtual address bit – OS lays out memory allocation accordingly 42 Physical Caching in Opportunistic Virtual Caching Virtual Addr. 0 = VA47 VA46 ……………… VA0 NO L1 Cache (32 KB, 8Way) ovc_enable (Default Off) TLB Way0 Way4 Way7 Tag matching logic 43 Virtual Caching in Opportunistic Virtual Caching Virtual Addr. 1 = VA47 VA46 ……………… VA0 Permission bits ASID (V/P) TAG Physical Tag DATA Yes ovc_enable L1 Cache (32 KB, 4-way/8-way) TLB Way0 Way4 Way7 Tag matching logic 44 Dynamic Energy Savings? dynamic energy % of on-chip memory subsystem's Physical Caching (VIPT) Opportunistic Virtual Caching 100 90 80 70 60 50 40 30 20 10 0 TLB Energy L1 Energy L2+L3 Energy On average ~20% of the on-chip memory subsystem’s dynamic energy is reduced 45 Overheads • No significant performance overhead – Average performance degradation 0.017% • State overhead – 27.5KB extra state for ~9.25 MB cache hierarchy (< 0.3%) – < 1% static power overhead No significant state or static power overheads but significant dynamic power savings 46 Summary: 2 Energy Dissipation • TLB energy dissipation non-negligible • TLB latency hiding makes L1 energy worse • Our Proposal: Opportunistic Virtual Caching – Virtual Caching for energy savings – Physical caching for compatibility 47 Roadmap • 1• 2• 3• Why Revisit Virtual Memory? Direct Segment Opportunistic Virtual Caching (OVC) Merged-Associative TLB – Motivation – Background: TLB mechanisms – Merged-Associative TLB – Evaluation 48 Motivation • Processors support multiple page sizes – Slow, energy- and area-hungry fully associative TLB – Static partitioning of resources with setassociative TLB • Goal: Support multiple page sizes in a setassociative TLB w/o static partitioning 49 Roadmap • 1• 2• 3• Why Revisit Virtual Memory? Direct Segment Opportunistic Virtual Caching (OVC) Merged-Associative TLB – Motivation – Background: TLB mechanisms – Merged-Associative TLB – Evaluation 50 Two TLB Mechanisms 1• Fully Associative TLB 2• Set-associative TLB (Split TLB design) 51 1 Fully Associative (FA) TLB AE01ED11F Content Addressable Memory 4KB pages FFEA001A Page Frame Number Virtual Page Number AE01ED11F E00 Random Access Memory 52 1 Fully Associative (FA) TLB Virtual Page Number Page Offset AE01ED11F Content Addressable Memory 4KB pages FFEA001A Page Frame Number Virtual Page Number AE01ED11F E00 Random Access Memory 53 1 Fully Associative (FA) TLB Virtual Page Number Page Offset 2MB pages AE01ED11F FFEA001A BB01ED0XX XX FFEA0000 Content Addressable Memory Page Frame Number Virtual Page Number BB01ED04E 0D0 Random Access Memory 54 FA TLB is slow, area- and energy- hungry FA = Fully Associative 64 Entries Access time (ns) Dynamic Access Energy (nJ) Static Power (mW) 4-SA= 4-way Set Associative 128 Entries 256 Entries FA 4-SA FA 4-SA FA 4-SA 0.39 0.14 0.47 0.15 0.67 0.17 0.003 0.008 0.004 0.016 0.006 0.031 1.72 2.69 7.57 4.81 14.37 3.87 A fully associative TLB ~3X – 4X slower than a set-associative TLB 55 FA TLB is slow, area- and energy- hungry FA = Fully Associative 64 Entries Access time (ns) Dynamic Access Energy (nJ) Static Power (mW) 4-SA= 4-way Set Associative 128 Entries 256 Entries FA 4-SA FA 4-SA FA 4-SA 0.14 0.39 0.15 0.47 0.17 0.67 0.008 0.003 0.016 0.004 0.031 0.006 1.72 2.69 7.57 4.81 14.37 3.87 Each access to a fully associative TLB spends 2.5X – 6X more dynamic energy. 56 FA TLB is slow, area- and energy- hungry FA = Fully Associative 64 Entries Access time (ns) Dynamic Access Energy (nJ) Static Power (mW) 4-SA= 4-way Set Associative 128 Entries 256 Entries FA 4-SA FA 4-SA FA 4-SA 0.14 0.39 0.15 0.47 0.17 0.67 0.003 0.008 0.004 0.016 0.006 0.031 3.87 7.57 2.69 14.37 4.81 1.72 A fully associative TLB cost 2-3X more static power. 57 2 Set Associative (SA) TLB Virtual Page Number Page Offset AE01ED11F E00 4KB pages AE01ED11C 8-entry 4-way set associative *Physical Frame number not shown 58 2 Set Associative (SA) TLB 2 bits of index AE01ED11F E00 4KB pages AE01ED11C 8-entry 4-way set associative 59 2 Set Associative (SA) TLB Virtual Page Number Page Offset BB01ED04E 0D0 What bits to use for indexing? 2MB pages BB01ED0 AE01ED11C 8-entry 4-way set associative 60 2 Set Associative (SA) TLB BB01ED04E 0D0 Index bit can not be part of page offset 2MB pages BB01ED0 AE01ED11C 8-entry 4-way set associative Challenge: Page size is unknown before translation 61 2 Split TLB Design Solution: • Separate sub-TLBs for each page size • All TLBs looked up in parallel TLB for 2MB pages TLB for 4KB pages 62 2 Split TLB Design AE01ED11F E00 AE01ED11C TLB for 2MB pages TLB for 4KB pages 63 2 Split TLB Design BB01ED04E 0D0 BB01ED0 TLB for 2MB pages TLB for 4KB pages 64 2 Split TLB Design In practice, less number of entries for larger page size. Example, Intel’s Sandy Bridge has 64 entries 4KB pages, 32 entries for 2MB pages, 4 entries 1GB pages. TLB for 2MB pages TLB for 4KB pages 65 Drawbacks of Split-TLB design • Anomalous TLB behavior – Larges page sizes can lead to more TLB misses NPB:CG 4KB 2MB 1GB 279.5 42.1 130.7 TLB misses per 1K memory reference “differences in TLB structure make predicting how many huge pages can be used and still be of benefit problematic” --Linux Weekly News • TLB resource underutilization 66 Roadmap • 1• 2• 3• Why Revisit Virtual Memory? Direct Segment Opportunistic Virtual Caching (OVC) Merged-Associative TLB – Motivation – Background: TLB mechanisms – Merged-Associative TLB – Evaluation 67 Goal of Merged-Associative TLB • A single set associative TLB for all page sizes – NO static partitioning of resources • NO anomalous TLB behavior • TLB resource aggregation – NO fully associative TLB • Faster, area- efficient, energy-efficient – Backward compatible with split-TLB design 68 Idea • OS partitions Virtual address space – Each partition holds mappings for single page size – Virtual address hints the page size • Hardware logically merges sub-TLBs – Virtual address to interpret page size – NO static partitioning of TLB resources 69 S/W: Address Space Partitioning 4KB 4KB 2MB 2MB 1GB 4KB Virtual Address Space 70 Split TLB => Merged-Associative TLB AE01ED11F E00 AE01ED11C TLB for 2MB pages TLB for 4KB pages 71 Split TLB => Merged-Associative TLB Merged-Associative TLB 72 Split TLB => Merged-Associative TLB Merged-Associative TLB 73 Split TLB => Merged-Associative TLB AE01ED11F E00 4K IDX2MB IDX4KB Merged-Associative TLB 74 Split TLB => Merged-Associative TLB AE01ED11F E00 4K IDX2MB IDX4KB IDX4KB *TAG match not shown Merged-Associative TLB 75 Split TLB => Merged-Associative TLB BB01EDF4E E00 2M IDX2MB IDX4KB Merged-Associative TLB 76 Split TLB => Merged-Associative TLB BB01EDF4E E00 2M IDX2MB IDX4KB IDX2MB *TAG match not shown Merged-Associative TLB 77 Backward Compatibility • “Unknown Page size” for a region is allowed – Reverts back to split-TLB design • Runs unmodified OS • Works under dynamic page size promotion • No benefits over split-TLB design 78 Roadmap • 1• 2• 3• Why Revisit Virtual Memory? Direct Segment Opportunistic Virtual Caching (OVC) Merged-Associative TLB – Motivation – Background: TLB mechanisms – Merged-Associative TLB – Evaluation 79 Methodology • TLB simulator written in PIN – Collects TLB miss rates • Workloads – Graph analytics, memcached, mySQL, NAS 80 TLB Configurations • Split-TLB design (Intel’s Sandy Bridge) – 4KB pages: 64 entry, 4-way set associative – 2MB pages: 32 entry, 4-way set associative – 1GB pages: 4-entry, fully associative • Fully associative design – 64 entry and 96 entry TLB • Merged-Associative TLB – 96 entry (64 + 32) • All configurations: 512 entry 4-way setassociative L2-DTLB 81 Evaluation • Avoid TLB miss behavior anomaly ? • Reduce TLB miss rates? 82 Anomalous TLB behavior 4KB Split TLB NPB:CG 279.5 Merged TLB 282.6 2MB Split TLB 42.1 Merged TLB 0 1GB Split TLB 130.7 Merged TLB 0 TLB misses per 1K memory reference Merged-associative TLB avoids anomalous behavior with large pages 83 TLB Misses Per 1K Accesses: 4KB pages Split-TLB graph500 memcached MySQL NPB:CG NPB:BT 207.9 FA-TLB (64 entry) 207.9 FA-TLB (96 entry) 207.7 MergedTLB 207.9 4.4 4.31 282.1 5.77 4.36 3.88 281.37 6.63 4.42 4.05 284.29 5.67 4.38 3.63 282.1 5.50 Merged-associative TLB does not improve miss rates 84 TLB Misses Per 1K Accesses: 2MB pages Split-TLB graph500 memcached MySQL NPB:CG NPB:BT 60.13 3.97 2.89 0.68 2.94 FA-TLB (64 entry) 51.94 4.05 3.73 0.0017 3.13 FA-TLB (96 entry) 39.92 4.04 3.48 0.0018 3.12 MergedTLB 33.52 4.08 3.95 0.035 3.19 Merged-associative TLB improves miss rates in one occasion 85 TLB Misses Per 1K Accesses: 1GB pages Split-TLB graph500 memcached MySQL NPB:CG NPB:BT 6.04 3.57 4.21 109.791 1.28991 FA-TLB (64 entry) 0 3.02 2.79 0 0 FA-TLB (96 entry) 0 2.72 2.92 0 0 MergedTLB 0 2.76 3.23 0 0 Merged-associative TLB reduces miss rates in two cases 86 Summary: 3 Partitioning TLB Resources • Static partitioning of TLB resources • Our Proposal: Merged-Associative TLB – Partition virtual address space for page sizes – Logically aggregate partitioned TLB resources 87 Other Works • Caches and Cache coherence – FreshCache: Statically and Dynamically Exploiting Dataless Ways (ICCD’2013) – CMP Directory Coherence: One Granularity Does Not Fit All (UW-CS-TR1798, 2013) – Scavenger: A New Last Level Cache Architecture with Global Block Priority (MICRO’07) • Parallel program debugging – Karma: Scalable Deterministic Record-Replay (ICS’11) 88 Summary 1• Performance overhead of TLB misses – Direct Segments 2• Energy overhead of address translation – Opportunistic Virtual Caching 3• Partitioning TLB resources – Merged-Associative TLB 89 90 Roadmap (OVC long path) • Why Revisit Virtual Memory? 1• Direct Segment 2• Opportunistic Virtual Caching (OVC) – – – – – – Why is TLB Energy Hungry Physical Caching Vs. Virtual Caching Opportunities for Virtual Caching Mechanisms for OVC Evaluation Summary 3• Merged-Associative TLB 91 TLB is Energy Hungry 13% * From Sodani’s /Intel’s MICRO 2011 Keynote 92 Why is TLB Energy Hungry? • TLB looked up on every cache access – ALL blocks cached with physical address – Each access needs address translation • TLB lookup latency in critical path – Fast and thus energy hungry transistors – Content Addressable Memory 93 TLB Latency Hiding Constrains L1 Associativity • Virtually Indexed Physically Tagged L1 Caches – L1 Assoc. >= Cache Size/Page Size; e.g., 32KB/4KB => 8-way TLB Page offset VA46 ………………… VA0 L1 Cache (32KB, 8 way) Way0 Way4 Way7 Tag matching logic 94 TLB Latency Hiding Makes L1 Energy Worse Workloads 4-way 8-way 16-way Dynamic 4-way 8-way 16-way Energy Read Energy 1 1.309 1.858 Parsec 1 1.002 1.002 Commercial 1 1.002 1.004 Write Energy Relative Performance 1 1.111 1.296 Relative L1 Access Energy Substantial energy impact dominates negligible performance benefit of increased associativity *Methods: CACTI, full system simulation 95 Why Not Virtual Caching? • Cache ALL blocks under virtual address – Saves TLB lookup on L1 cache hits – L1 Cache associativity not constrained – Read-Write Synonyms • e.g, V1 -> P0 <- V2 – Incompatibility with commercial ISAs • e.g., x86 ‘s hardware page table walker ? Energy dissipation was less important? Best of Virtual and Physical Caching? 96 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment 2• Opportunistic Virtual Caching (OVC) – – – – – – Why is TLB Energy Hungry Physical Caching Vs. Virtual Caching Opportunities for Virtual Caching Mechanisms for OVC Evaluation Summary 3• Merged-Associative TLB 97 How Frequent are Synonyms? Commercial Parsec Applications canneal fluidanimate facesim streamcluster swaptions x264 bind firefox memcached specjbb Static Synonym Dynamic accesses to Pages Synonyms 0.06% 0% 0.28% 0% Read-only 0.00% 0% 0.23% 0.01% 5.90% 26% 100% 1.40% 1% 0.01% 0.16% 9% 13% 95% 0.01% 0% 1% 2% Synonyms occur, but conflicting use rare; confined to small region. 98 Identify Synonyms at Allocation? Protection flags • Process address space divided into regions • Synonym possibility indicated by protection flags System V Shared Memory Stack r-w------ Sys V r-w-s Heap r-w------ Constants Code r---r-x-- r-> read w-> write Possible to separate memory regions with and s->shared without read-write synonyms x->execute 99 Idea • Opportunistic Virtual Caching – Use energy-efficient virtual caching opportunistically – Default to Physical Caching when needed 100 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment 2• Opportunistic Virtual Caching (OVC) – – – – – – Why is TLB Energy Hungry Physical Caching Vs. Virtual Caching Opportunities for Virtual Caching Mechanisms for OVC Evaluation Summary 3• Merged-Associative TLB 101 Role of the H/W and the OS • Hardware allows Virtual or Physical Caching • OS decides when to use Virtual Caching • OS responsible for correctness 102 Physical Caching in Opportunistic Virtual Caching Virtual Addr. 0 = VA47 VA46 ……………… VA0 NO L1 Cache (32 KB, 8Way) ovc_enable (Default Off) TLB Way0 Way4 Way7 Tag matching logic 103 Virtual Caching in Opportunistic Virtual Caching Virtual Addr. 1 = VA47 VA46 ……………… VA0 Permission bits ASID (V/P) TAG Physical Tag DATA Yes ovc_enable L1 Cache (32 KB, 4-way/8-way) TLB Way0 Way4 Way7 Tag matching logic 104 Operating System Mechanisms • Memory allocations from two partitions – Separate partitions for virtual and physical caching • e.g., VA47= 0 =>Physical Caching and VA47= 1 => Virtual Caching – Protection flags determine which partition to use • Operating System responsible for correctness – Wrong classification possible, but rare • e.g., User makes a region shared after allocation – Cache flush on possible violation 105 Roadmap • Why Revisit Virtual Memory? 1• Direct Segment 2• Opportunistic Virtual Caching (OVC) – – – – – – Why is TLB Energy Hungry Physical Caching Vs. Virtual Caching Opportunities for Virtual Caching Mechanisms for OVC Evaluation Summary 3• Merged-Associative TLB 106 Methodology • Modification to Linux kernel (2.6.28-4) • Hardware changes simulated in gem5 fullsystem simulator • Energy numbers from CACTI 107 Configuration Cores 4 cores, in-order, x86-64 ISA TLBs L1 DTLB/ITLB 64 entries, Fully associative, L2 TLB 512 entries, 4-way set associative Privates Caches 32 KB 8way I/D-L1 cache, 256 KB , 8-way L2 per core Shared Caches 8MB, 16-way L3 cache Memory 4 GB memory 300 cycles round trip 108 Dynamic Energy Savings? dynamic energy % of on-chip memory subsystem's Physical Caching (VIPT) Opportunistic Virtual Caching 100 90 80 70 60 50 40 30 20 10 0 TLB Energy L1 Energy L2+L3 Energy On average ~20% of the on-chip memory subsystem’s dynamic energy is reduced 109 Overheads • No significant performance overhead – Average performance degradation 0.017% • State overhead – 27.5KB extra state for ~9.25 MB cache hierarchy (< 0.3%) – < 1% static power overhead No significant state or static power overheads but significant dynamic power savings 110 Summary: 2 Energy Dissipation • TLB energy dissipation non-negligible • TLB latency hiding makes L1 energy worse • Our Proposal: Opportunistic Virtual Caching – Virtual Caching for energy savings – Physical caching for compatibility 111 BACKUP 112 Why Not Large Pages? • Fundamentally Not Scalable – Newer page sizes, larger TLB with memory growth – Continual changes to PT-Walker, OS, application • TLBs Need Locality – Increasing reach not necessarily reduces misses • Large Pages Needs to be Aligned – Significant opportunity can be lost [COLT, MICRO’12] • Fixed Sparse ISA-defined sizes – Dictated by Page Table structure – In x86-64, 4KB, 2MB, 1GB 113 Address Translation in Different ISA/machines ISA/Machine Address Translation Multics Segmentation on top of Paging Burroughs B5000 Segmentation UltraSPARC Paging X86 (32 bit) Segmentation on top of Paging ARM Paging PowerPC Segmentation on top of Paging Alpha Paging X86-64 Paging only (mostly) Direct Segment: (1)NOT on top of paging. (2)NOT to replace paging. (3)NO two-dimensional address space. Keeps Linear address space. 114 Direct Segment Methodology • Convert TLB misses to page fault – TLB entries incoherent with memory-resident copies – Set reserved bit in memory-resident copy • On page fault check whether it falls in DS • Deduct the cycles due to TLB misses proportionally 115 Direct Segment(DS) in Cloud? • Currently DS suitable for enterprise workloads – Less suitable when many short jobs come and go • Memory usage needs to be predictable to enable performance guarantees – Same memory usage predictions can be used to create DS 116 How to handle faulty pages? • OS can map a page-frame with permanent fault to a non-faulty new page-frame (Possible) Solutions: • Revert part or all of direct segment memory • Memory controller (MC) remaps faulty pages – Only small number of faulty pages – List of faulty re-mapped pages in MC – OS need NOT to know about faults – Finer grain remapping (instead of page-grain) 117 Direct Segment + OVC possible? VA47 VA47 = 1 && OVC_ENB =1? OVC L1 Cache low-associativity lookup with VA YES L1 Cache Hit/ Miss? MISS NO Lookup DTLB hierarchy with VPN BASE >= VPN && VPN < LIMIT ? YES OFFSET + VPN potential page walk HIT Cancel OVC_ENB DTLB Hit/ Miss? MISS Walk the page table Concatenate PFN with Page offset Complete Cache lookup 118 OVC 119 How Coherence is Maintained? CASE 1. Coherence reply due to own request 0x10ab16e10 L1 Cache (VA) 0x10ab16e10 (VA) Miss TLB 0x00fc10d10 0x10ab16e10 To lower level caches (PA) MSHR 0x00fc10d10 (PA) 120 How Coherence is maintained? CASE 2. Coherence request due to other controller L1 Cache Physical Tag array 0x00fc10d10 (PA) 121 Alternative techniques for address differentiation? • Range register(s) for Virtual Caching 0xfff10000100 (VA) 0xfff10000000 Lower bound ≥ ≤ 0xfff10f00000 Upper bound 1= Use Virtual Caching 0= Use Physical Caching 122 What about static power? • Around 42-45% of the total on-chip memory subsystem power OVC can save around 12% of total on-chip memory subsystem’s power 123 What is the breakup of TLB lookup savings? canneal facesim fluidanimate streamcluster swaptions x264 specjbb memcached bind Mean L1 Data TLB 72.253 96.787 99.363 95.083 99.028 95.287 91.887 94.580 97.090 93.484 L1 Instr. TLB 99.986 99.999 99.999 99.994 99.989 99.304 99.192 98.605 98.310 99.486 124