st 21 Century Computer Architecture Mark D. Hill, Univ. of Wisconsin-Madison 2/2014 @ HPCA, PPoPP, & CGO 1. Whitepaper Briefing 2. Impact & Computing Community Consortium (CCC) 3. E.g., Efficient Virtual Memory for Big Memory Servers 21st Century Computer Architecture A CCC community white paper http://cra.org/ccc/docs/init/21stcenturyarchitecturewhitepaper.pdf • • • • • Participants & Process Information & Commun. Tech’s Impact Semiconductor Technology’s Challenges Computer Architecture’s Future Pre-Competitive Research Justified White Paper Participants Sarita Adve, U Illinois * David H. Albonesi, Cornell U David Brooks, Harvard U Luis Ceze, U Washington * Sandhya Dwarkadas, U Rochester Joel Emer, Intel/MIT Babak Falsafi, EPFL Antonio Gonzalez, Intel/UPC Mark D. Hill, U Wisconsin *,** Mary Jane Irwin, Penn State U * David Kaeli, Northeastern U * Stephen W. Keckler, NVIDIA/U Texas Christos Kozyrakis, Stanford U Alvin Lebeck, Duke U Milo Martin, U Pennsylvania José F. Martínez, Cornell U Margaret Martonosi, Princeton U * Kunle Olukotun, Stanford U Mark Oskin, U Washington Li-Shiuan Peh, M.I.T. Milos Prvulovic, Georgia Tech Steven K. Reinhardt, AMD Michael Schulte, AMD/U Wisconsin Simha Sethumadhavan, Columbia U Guri Sohi, U Wisconsin Daniel Sorin, Duke U Josep Torrellas, U Illinois * Thomas F. Wenisch, U Michigan * David Wood, U Wisconsin * Katherine Yelick, UC Berkeley/LBNL * “*” contributed prose; “**” effort coordinator Thanks of CCC, Erwin Gianchandani & Ed Lazowska for guidance and Jim Larus & Jeannette Wing for feedback 4 White Paper Process • Late March 2012 o CCC contacts coordinator & forms group • April 2012 o o o o Brainstorm (meetings/online doc) Read related docs (PCAST, NRC Game Over, ACAR1/2, …) Use online doc for intro & outline then parallel sections Rotated authors to revise sections • May 2012 o o o o Brainstorm list of researcher in/out of comp. architecture Solicit researcher feedback/endorsement Do distributed revision & redo of intro Release May 25 to CCC & via email Kudos to participants on executing on a tight timetable 5 20th Century ICT Set Up • Information & Communication Technology (ICT) Has Changed Our World o <long list omitted> • Required innovations in algorithms, applications, programming languages, … , & system software • Key (invisible) enablers (cost-)performance gains o Semiconductor technology (“Moore’s Law”) o Computer architecture (~80x per Danowitz et al.) 6 Enablers: Technology + Architecture Architecture Technology Danowitz et al., CACM 04/2012, Figure 1 7 21st Century ICT Promises More Data-centric personalized health care Human network analysis Computation-driven scientific discovery Much more: known & unknown 8 21st Century App Characteristics BIG DATA SECURE/PRIVATE ALWAYS ONLINE Whither enablers of future (cost-)performance gains? 9 Technology’s Challenges 1/2 Late 20th Century The New Reality Moore’s Law — 2× transistors/chip Transistor count still 2× BUT… Dennard Scaling — ~constant power/chip Gone. Can’t repeatedly double power/chip 11 Classic CMOS Dennard Scaling: the Science behind Moore’s Law Source: Future of Computing Performance: Game Over or Next Level?, National Academy Press, 2011 Scaling: Voltage: Oxide: (Finding 2) V/a tOX/a Results: 1/a2 Power/ckt: Power Density: ~Constant National Research Council (NRC) – Computer Science and Telecommunications Board (CSTB.org) 12 Post-classic CMOS Dennard Scaling Post Dennard CMOS Scaling Rule TODO: Chips w/ higher power (no), smaller (), Scaling: dark silicon (), or other (?) Voltage: V/a V tOX/a Oxide: Results: 1/a2 1 Power/ckt: Power Density: ~Constant a2 National Research Council (NRC) – Computer Science and Telecommunications Board (CSTB.org) 13 Technology’s Challenges 2/2 Late 20th Century The New Reality Moore’s Law — 2× transistors/chip Transistor count still 2× BUT… Dennard Scaling — ~constant power/chip Gone. Can’t repeatedly double power/chip Modest (hidden) transistor unreliability Increasing transistor unreliability can’t be hidden Focus on computation over communication Communication (energy) more expensive than computation 1-time costs amortized via mass market One-time cost much worse & want specialized platforms How should architects step up as technology falters? 14 System Capability (log) “Timeline” from DARPA ISAT Fallow Period 80s 90s 00s 10s 20s 30s 40s 50s Source: Advancing Computer Systems without Technology Progress, ISAT Outbrief (http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf) Mark D. Hill and Christos Kozyrakis, DARPA/ISAT Workshop, March 26-27, 2012. Approved for Public Release, Distribution Unlimited The views expressed are those of the author and do not reflect the official policy or position of the 15 Department of Defense or the U.S. Government. 21st Century Comp Architecture 20th Century 21st Century Single-chip in generic computer Architecture as Infrastructure: Spanning sensors to clouds X Performance + security, privacy, availability, programmability, … Performance via invisible instr.-level parallelism Energy First ● Parallelism X ● Specialization ● Cross-layer design Predictable technologies: CMOS, DRAM, & disks New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication CrossCutting: Break current layers with new interfaces 16 Available for Free: Search on “Synthesis Lectures on Computer Architecture” 17 21st Century Comp Architecture 20th Century 21st Century Single-chip in generic computer Architecture as Infrastructure: Spanning sensors to clouds Performance + security, privacy, availability, programmability, … Performance via invisible instr.-level parallelism Energy First ● Parallelism X ● Specialization ● Cross-layer design Predictable technologies: CMOS, DRAM, & disks New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication CrossCutting: Break current layers with new interfaces 18 21st Century Comp Architecture 20th Century 21st Century Single-chip in generic computer Architecture as Infrastructure: Spanning sensors to clouds X Performance + security, privacy, availability, programmability, … Performance via invisible instr.-level parallelism Energy First ● Parallelism X ● Specialization ● Cross-layer design Predictable technologies: CMOS, DRAM, & disks New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication CrossCutting: Break current layers with new interfaces 19 21st Century Comp Architecture 20th Century 21st Century Single-chip in generic computer Architecture as Infrastructure: Spanning sensors to clouds X Performance + security, privacy, availability, programmability, … Performance via invisible instr.-level parallelism Energy First ● Parallelism X ● Specialization ● Cross-layer design Predictable technologies: CMOS, DRAM, & disks New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication CrossCutting: Break current layers with new interfaces 20 21st Century Comp Architecture 20th Century 21st Century Single-chip in stand-alone computer Architecture as Infrastructure: Spanning sensors to clouds Performance + security, privacy, availability, programmability, … Performance via invisible instr.-level parallelism Energy First ● Parallelism ● Specialization ● Cross-layer design Predictable technologies: CMOS, DRAM, & disks New technologies (non-volatile memory, near-threshold, 3D, photonics, …) Rethink: memory & storage, reliability, communication CrossCutting: Break current layers with new interfaces 22 What Research Exactly? • Research areas in white paper (& backup slides) 1. 2. 3. 4. Architecture as Infrastructure: Spanning Sensors to Clouds Energy First Technology Impacts on Architecture Cross-Cutting Issues & Interfaces • Much more research developed by future PIs! • Example from our work in last part of talk [ISCA 2013] o Cloud workloads(memcached) use vast memory (100 GB to TB) wasting up to 50% execution time o A cross-cutting OS-HW change eliminates this waste 23 Pre-Competitive Research Justified • Retain (cost-)performance enabler to ICT revolution • Successful companies cannot do this by themselves o Lack needed long-term focus o Don’t want to pay for what benefits all o Resist transcending interfaces that define their products • Corroborates o Future of Computing Performance: Game Over or Next Level?, National Academy Press, 2011 o DARPA/ISAT Workshop Advancing Computer Systems without Technology Progress with outbrief http://www.cs.wisc.edu/~markhill/papers/isat2012_ACSWTP.pdf 24 st 21 Century Computer Architecture Mark D. Hill, Univ. of Wisconsin-Madison 2/2014 @ HPCA, PPoPP, & CGO 1. Whitepaper Briefing 2. Impact & Computing Community Consortium (CCC) 3. E.g., Efficient Virtual Memory for Big Memory Servers $15M NSF XPS 2/2013 + $15M for 2/2014 38 Computing Community Consortium 39 CS Research Community CCC Research Funders Research Funders http://cra.org/ccc Research Beneficiaries Research Beneficiaries The CCC Council Leadership Susan Graham, UC Berkeley (Chair) Greg Hager, Johns Hopkins (Vice Chair) Ed Lazowska, U. Washington (Past Chair) Ann Drobnis, Director Kenneth Hines, Program Associate Andy Bernat, CRA Executive Director Terms ending 6/2016 Randy Bryant, CMU Limor Fix, Intel Mark Hill, U. Wisconsin, Madison Tal Rabin, IBM Research Daniela Rus, MIT Ross Whitaker, Univ. Utah Stephanie Forrest, Univ. New Mexico Robin Murphy, Texas A&M John King, Univ. Michigan Dave Waltz, Columbia Karen Sutherland, Augsburg College Terms ending 6/2015 Liz Bradley, Univ. Colorado Sue Davidson, Univ. Pennsylvania Joe Evans, Univ. Kansas Ran Libeskind-Hadas, Harvey Mudd Elizabeth Mynatt, Georgia Tech Shashi Shekhar, Univ. Minnesota Terms ending 6/2014 Deborah Crawford, Drexel Anita Jones, Univ. Virginia Fred Schneider, Cornell Bob Sproull, Sun Labs Oracle (ret.) Josep Torrellas, Univ. Illinois Chris Johnson, Univ. Utah Bill Feiereisen, LANL Dick Karp, UC Berkeley Greg Andrews, Univ. Arizona http://cra.org/ccc Frans Kaashoek, MIT Dave Kaeli, Northeastern Andrew McCallum, UMass Peter Lee, Carnegie Mellon A Multitude of Activities Community-initiated visioning: Workshops to discuss “out-of-the-box” ideas Challenges & Visions tracks at conferences Outreach to White House, funding agencies: Outputs of visioning activities Short reports to inform policy makers Task Forces – Health IT, Sustainability IT, Data Analytics Public relations efforts: Library of Congress symposia Research “Highlight of the Week” CCC Blog [http://cccblog.org/] Nurturing the next generation of leaders: Computing Innovation Fellows Project “Landmark Contributions by Students” Leadership in Science Policy Institute http://cra.org/ccc Catalyzing: Visioning Activities Over 20 Workshops to date More than 1,500 participants Free & Open Source Software http://cra.org/ccc Catalyzing and Enabling: Architecture 2010 Josep Torrellas UIUC 2010 Mark Oskin Washington 2012 Mark Hill Wisconsin http://cra.org/ccc 2013 CCC & You CCC Chair Susan Graham ~20 Council Members Standing committee of Computing Research Association CRA established via NSF cooperative agreement You Be a rain-maker & give forward White papers Visioning And more http://cra.org/ccc 44 st 21 Century Computer Architecture Mark D. Hill, Univ. of Wisconsin-Madison 2/2014 @ HPCA, PPoPP, & CGO 1. Whitepaper Briefing 2. Impact & Computing Community Consortium (CCC) 3. E.g., Efficient Virtual Memory for Big Memory Servers A View of Computer Layers Punch Thru (small) 7/26/2016 Problem Algorithm Application Middleware / Compiler Operating System Microarchitecture Logic Design Transistors, etc. Instrn Set Architecture 46 Efficient Virtual Memory for Big Memory Servers Arkaprava Basu, Jayneel Gandhi, Jichuan Chang*, Mark D. Hill, Michael M. Swift * HP Labs Q: “Virtual Memory was invented in a time of scarcity. Is it still good idea?” --- Charles Thacker, 2010 Turing Award Lecture A: As we see it, OFTEN but not ALWAYS. Executive Summary • Big memory workloads important – graph analysis, memcached, databases • Our analysis: – TLB misses burns up to 51% execution cycles – Paging not needed for almost all of their memory • Our proposal: Direct Segments – Paged virtual memory where needed – Segmentation (No TLB miss) where possible • Direct Segment often eliminates 99% DTLB misses 7/26/2016 ISCA 2013 48 Virtual Memory Refresher Process 1 Virtual Address Space Core Physical Memory Cache TLB (Translation Lookaside Buffer) Process 2 Challenge: TLB misses wastes execution time 7/26/2016 Page Table 49 Memory capacity for $10,000* 1,000.00 1 Memory size GB MB TB 10,000.00 10 Commercial servers with 4TB memory 100.00 100 10.00 10 1.00 1 0.10 100 Big data needs to access terabytes of data at low latency 0.01 10 0.00 0 1980 1990 *Inflation-adjusted 2011 USD, from: jcmit.com 2000 2010 50 TLB is Less Effective • TLB sizes hardly scaled Year L1-DTLB entries 1999 72 (Pent. III) 2001 64 (Pent. 4) 2008 2012 96 100 (Nehalem) (Ivy Bridge) • Low access locality of server workloads [Ramcloud’10, Nanostore’11] Memory Size + TLB size + Low locality TLB miss latency overhead 51 Experimental Setup • Experiments on Intel Xeon (Sandy Bridge) x86-64 – Page sizes: 4KB (Default), 2MB, 1GB 4 KB L1 DTLB L2 DTLB 2 MB 1GB 64 entry, 4-way 32 entry, 4-way 4 entry, fully assoc. 512 entry, 4-way • 96GB installed physical memory • Methodology: Use hardware performance counter 7/26/2016 ISCA 2013 52 7/26/2016 yS Q L d ISCA 2013 PS 51.1 GU NP B: CG NP B: BT M em ca ch e ap h5 00 35 m gr Percentage of execu on cycles spent on servicing DTLB missses Big Memory Workloads 83.1 30 4KB 25 2MB 20 15 1GB 10 5 Direct Segment 0 53 35 51.1 83.1 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 54 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 55 35 51.1 Significant overhead of paged virtual memory 30 25 20 83.1 51.3 4KB Worse with TBs of memory now or in future? 2MB 15 1GB 10 Direct Segme 5 m 7/26/2016 ISCA 2013 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 56 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 0.01 10 ~0 0.48 ~0 0.01 1GB 0.49 Direct Segment 5 m 7/26/2016 ISCA 2013 GU PS NP B: CG NP B: BT L yS Q M ca ch ed em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 57 Roadmap • • • • • Introduction and Motivation Analysis: Big memory workloads Design: Direct Segment Evaluation Summary 7/26/2016 ISCA 2013 58 How is Paged Virtual Memory used? An example: memcached servers 7/26/2016 In-memory Hash table Network state Client memcached server # n ISCA 2013 Key X Value Y 59 Big Memory Workloads’ Use of Paging Paged VM Feature Our Analysis Implication Swapping ~0 swapping Not essential Per-page protection ~99% pages read-write Overkill Fragmentation reduction Little OS-visible fragmentation (next slide) Per-page (re)allocation less important 7/26/2016 ISCA 2013 60 Allocated Memory (in GB) Memory Allocation Over Time Warm-up graph500 memcached 0 300 MySQL NPB:BT NPB:CG GUPS 90 75 60 45 30 15 0 150 450 600 750 900 1050 1200 1350 1500 Time (in seconds) 25 minutes! Most of the memory allocated early 7/26/2016 ISCA 2013 61 Where Paged Virtual Memory Needed? Paging Valuable Paging Not Needed * VA Dynamically allocated Heap region Code Constants Shared Memory Mapped Files Stack Guard Pages Paged VM not needed for MOST memory 7/26/2016 ISCA 2013 * Not to scale 62 Roadmap • Introduction and Motivation • Analysis: Big Memory Workloads • Design: Direct Segment – Idea – Hardware – Software • Evaluation • Summary 7/26/2016 ISCA 2013 63 Idea: Two Types of Address Translation A Conventional paging • All features of paging • All cost of address translation B Simple address translation • Protection but NO (easy) swapping • NO TLB miss • OS/Application decides where to use which [=> Paging features where needed] 7/26/2016 ISCA 2013 64 Hardware: Direct Segment 2 Direct Segment 1 Conventional Paging BASE LIMIT VA OFFSET PA Why Direct Segment? • Matches big memory workload needs • NO TLB lookups => NO TLB Misses 7/26/2016 ISCA 2013 65 H/W: Translation with Direct Segment [V47V46……………………V13V12] [V11……V0] LIMIT<? BASE ≥? DTLB Lookup Paging Ignored HIT/MISS Y OFFSET 7/26/2016 MISS Page-Table Walker [P40P39………….P13P12] [P11……P ] 66 0 H/W: Translation with Direct Segment [V47V46……………………V13V12] BASE ≥? [V11……V0] LIMIT<? Direct Segment Ignored N DTLB Lookup HIT OFFSET 7/26/2016 HIT/MISS MISS Page-Table Walker [P40P39………….P13P12] [P11……P ] 67 0 S/W: 1 Setup Direct Segment Registers • Calculate register values for processes – BASE = Start VA of Direct Segment – LIMIT = End VA of Direct Segment – OFFSET = BASE – Start PA of Direct Segment • Save and restore register values BASE LIMIT VA2 VA1 OFFSET PA 7/26/2016 ISCA 2013 68 S/W: 2 Provision Physical Memory • Create contiguous physical memory – Reserve at startup • Big memory workloads cognizant of memory needs • e.g., memcached’s object cache size – Memory compaction • Latency insignificant for long running jobs – 10GB of contiguous memory in < 3 sec – 1% speedup => 25 mins break even for 50GB compaction 7/26/2016 ISCA 2013 69 S/W: 3 Abstraction for Direct Segment • Primary Region – Contiguous VIRTUAL address not needing paging – Hopefully backed by Direct Segment – But all/part can use base/large/huge pages VA PA • What allocated in primary region? – All anonymous read-write memory allocations – Or only on explicit request (e.g., mmap flag) 7/26/2016 ISCA 2013 70 Segmentation Not New ISA/Machine Address Translation Multics Segmentation on top of Paging Burroughs B5000 Segmentation without Paging UltraSPARC Paging X86 (32 bit) Segmentation on top of Paging ARM Paging PowerPC Segmentation on top of Paging Alpha Paging X86-64 Paging only (mostly) Direct Segment: NOT on top of paging. NOT to replace paging. NO two-dimensional address space: keeps linear address space. 7/26/2016 ISCA 2013 71 Roadmap • • • • Introduction and Motivation Analysis: Big Memory Workloads Design: Direct Segment Evaluation – Methodology – Results • Summary 7/26/2016 ISCA 2013 72 Methodology • Primary region implemented in Linux 2.6.32 • Estimate performance of non-existent direct-segment – Get fraction of TLB misses to direct-segment memory – Estimate performance gain with linear model • Prototype simplifications (design more general) – One process uses direct segment – Reserve physical memory at start up – Allocate r/w anonymous memory to primary region 7/26/2016 ISCA 2013 73 35 51.1 83.1 51.3 Lower is better 30 4KB 25 2MB 20 15 1GB 10 Direct Segment 5 PS GU NP B: CG NP B: BT yS Q L M d m em ca ch e ap h5 00 0 gr Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7/26/2016 ISCA 2013 74 35 51.1 83.1 51.3 Lower is better 30 4KB 25 2MB 20 15 0.01 10 ~0 ~0 0.48 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG T NP B: B L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7/26/2016 ISCA 2013 75 35 51.1 83.1 51.3 Lower is better 30 20 4KB “Misses” in Direct Segment 25 99.9% 2MB 92.4% 99.9% 99.9% 99.9% 99.9% 15 0.01 10 ~0 ~0 0.48 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG T NP B: B L yS Q M ca ch ed m em h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 7/26/2016 ISCA 2013 76 (Some) Limitations • Does not (yet) work with Virtual Machines • Can be extended but memory overcommit challenging • Less suitable for sparse virtual address space • One direct segment – Our workloads did not justify more 7/26/2016 ISCA 2013 77 Summary • Big memory workloads – Incurs high TLB miss cost – Paging not needed for almost all memory • Our proposal: Direct Segment – Paged virtual memory where needed – Segmentation (NO TLB miss) where possible • Bonus: Whither Memory Referencing Energy? 7/26/2016 ISCA 2013 78 A 2nd Virtual Memory Problem: Energy 13% TLB is energy Hungry * From Sodani’s /Intel’s MICRO 2011 Keynote To save it we ask: Where is the energy spent? 83 Standard Physical Cache TLB access on every memory Core reference in parallel w/ L1 cache • Enables physical address Cache hit/miss check TLB • Hides TLB latency (Translation Lookaside Buffer) BUT DOES NOT HIDE TLB ENERGY! 7/26/2016 84 Old Idea: Virtual Cache • TLB access after L1 cache miss (1-5%) Core • Greatly reduces TLB access energy • But breaks compatibility • e.g., read-write synonyms Cache TLB (Translation Lookaside Buffer) • OUR ANALYSIS: Read-write synonyms rare 7/26/2016 85 Opportunistic Virtual Cache • A New Hardware Cache [ISCA 2012] – Caches w/ virtual addresses for almost-all blocks – Caches w/ physical addresses only for compatibility • OS decides (cross-layer design) • Saves energy – Avoids most TLB lookups – L1 cache can use lower associativity (subtle) 7/26/2016 ISCA 2013 86 st 21 Century Computer Architecture Mark D. Hill, Univ. of Wisconsin-Madison 2/2014 @ HPCA, PPoPP, & CGO 1. Whitepaper Briefing 2. Impact & Computing Community Consortium (CCC) 3. E.g., Efficient Virtual Memory for Big Memory Servers Dynamic Energy Savings? dynamic energy % of on-chip memory subsystem's Physical Caching Opportunistic Virtual (VIPT) Caching 100 90 80 70 60 50 40 30 20 10 0 TLB Energy L1 Energy L2+L3 Energy On average ~20% of the on-chip memory subsystem’s dynamic energy is reduced 89