Efficient Virtual Memory for Big Memory Servers U Wisc and HP Labs ISCA’13 Architecture Reading Club Summer'13 1 Key points • Big memory workloads Memcached, databases, graph analysis • Analysis shows TLB misses can account for upto 51% of execution time Rich features of Paged VM is not needed by most applications • Proposal : Direct Segments Paged VM as usual where needed Segmentation where possible • For big memory workloads – this eliminates 99% of data TLB misses ! Architecture Reading Club Summer'13 2 Main Memory Mgmt Trends • The amount of physical memory has gone from a few MBs to a few GBs and then to several TBs now • But at the same time the size of the DTLB has remained fairly unchanged Pent III – 72 Pent IV – 64 Nehalem – 96 IvyBridge – 100 • Also workloads were nicer in the days-gone-by (higher locality) • So higher memory cap + const TLB + misbehaving apps = more TLB misses Architecture Reading Club Summer'13 3 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 Architecture Reading Club Summer'13 PS GU NP B: CG NP B: BT yS QL M ap h5 00 m em ca ch ed 0 gr Percentage of execu on cycles wasted So how bad is it really ? 4 Main Features of Paged VM Feature Analysis Verdict Swapping No swapping Not required Per Page Access Perms 99% of pages are read-write Overkill Fragmentation mgmt. Very little OS visible fragmentation Architecture Reading Club Summer'13 Per-page reallocation is not important 5 Main Memory Allocation Architecture Reading Club Summer'13 6 Paged VM – why is it needed ? • Shared memory regions for Inter-Process-Communication • Code regions protected by per-page R/W • Copy on-write uses per-page R/W for lazy implementation of fork. • Guard pages at the end of thread-stacks. VA Paging Valuable Code Constants Paging Not Needed * Dynamically allocated Heap region Shared Memory Mapped Files Stack Guard Pages Architecture Reading Club Summer'13 7 Direct Segments • Hybrid Paged + Segmented memory (not one on top of the other). Architecture Reading Club Summer'13 8 Address Translation Architecture Reading Club Summer'13 9 OS Support : Handling Physical Memory • Setup Direct Segment registers BASE = Start VA of Direct Segment LIMIT = End VA of Direct Segment OFFSET = BASE – Start PA of Direct Segment Save and restore register values as part of process metadata on context-switch • Create contiguous physical memory region Reserve at startup – big memory apps are cognizant of memory requirement at startup. Memory compaction – latency insignificant for long running jobs Architecture Reading Club Summer'13 10 OS Support : Handling Virtual Memory • Primary regions Abstraction presented to application Contiguous Virtual address space backed by Direct Segment • What goes in the primary region Dynamically allocated R/W memory Application can indicate what it needs to put in primary region • The size of the primary region is set to a very high value to accommodate the whole of the physical memory if need be 64-bit VA support 128TB of VM, so pretty much never running out of VA space Architecture Reading Club Summer'13 11 Evaluation • Methodology • Implement Primary Region in the kernel • Find the number of TLB misses that would be served by the non-existent direct segments x86 uses hardware page-table walker they trap all TLB misses by duping the system into believing that the PTE residing in memory is invalid In the handler • They touch the page with the faulting address • Again mark the PTE to invalid Architecture Reading Club Summer'13 12 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 0.01 10 ~0 0.48 ~0 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG PB :B T N L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Results Architecture Reading Club Summer'13 13 Results Architecture Reading Club Summer'13 14 Why not large pages ? • Huge pages does not automatically scale New page size and/or more TLB entries • TLBs dependent on access locality • Fixed ISA-defined sparse page sizes e.g., 4KB, 2MB, 1GB Needs to be aligned at page size boundaries • Multiple page sizes introduces TLB tradeoffs Fully associative vs. set-associative designs Architecture Reading Club Summer'13 15 Virtual Memory Basics Process 1 Virtual Address Space Core Physical Memory Cache TLB Process 2 (Translation Lookaside Buffer) Page Table 16 Architecture Reading Club Spring'13 16