Remember Memory? Mark D. Hill, Univ. of Wisconsin-Madison 5/2016 @ David A. Patterson Celebration I. Million-fold Memory Growth & Virtual Memory II. General-Purpose GPUs & Memory Consistency III. Non-Volatile Memory’s Fusing Memory & Storage Mark D. Hill, No Change 30 Years? 7/26/2016 2 I. Million-fold Memory Growth Memory capacity for $10,000* 1,000.00 1 Memory size GB MB TB 10,000.00 10 Commercial servers with 16TB memory 100.00 100 10.00 10 1.00 1 0.10 100 Interactive services need to access TB of data at low latency 0.01 10 0.00 0 1980 1990 *Inflation-adjusted 2011 USD, from: jcmit.com 2000 2010 3 How is Paged Virtual Memory used? memcached server # n In-memory Hash table Network state Client E.g.: memcached servers Key X Value Y • But TLB sizes hardly scaled Year L1-DTLB entries 7/26/2016 1999 72 (Pent. III) 2008 2012 2015 96 100 100 (Nehalem) (Ivy Bridge) (Broadwell)4 ISCA 2013 Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 1GB 10 Direct Segment 5 0 m A: As we see it, OFTEN but not ALWAYS. . 7/26/2016 PS GU NP B: CG NP B: BT yS Q L M d em ca ch e gr ap h5 00 Q: “Virtual Memory was invented in a time of scarcity. Is it still good idea?” --- Charles Thacker, 2010 Turing Award Lecture 5 A View of Computer Layers Problem Algorithm Application Middleware / Compiler Operating System Punch Thru Microarchitecture Logic Design Transistors, etc. (small) Instrn Set Architecture See 21st Century Computer Architecture [CCC 2012] 6 7/26/2016 Bypass Paging (Often) Conventional Paging: 1 guard page, COW, mapped files BASE 2 Direct Segment: Heap w/o swapping LIMIT VA OFFSET PA Direct Segment [ISCA 2013] but more-general ideas now 7/26/2016 7 35 51.1 83.1 51.3 4KB 30 25 2MB 20 15 0.01 10 ~0 0.48 ~0 0.01 1GB 0.49 Direct Segment 5 GU PS NP B: CG NP B: BT L yS Q M ca ch ed h5 00 0 gr ap Percentage of execu on cycles wasted Execution Time Overhead: TLB Misses m 7/26/2016 em Non-Volatile Memory to explode address space & sharing? ISCA 2013 8 II. Graphics Processing Units (GPUs) • GPUs = Throughput • Hierarchical “scoped” programming model Grid m Di en on si Z • Share memory to expand viable programs Dimension Y on si en Sub-group (Hardware-specific size) Z 7/26/2016 Dimension X Work-item m Di – Rich data structures (w/o copying) – “Pointer is a pointer” – Coherence? Scopes? Dimension Y Work-group Dimension X OpenCL Execution Hierarchy 9 GPU Memory Hierarchy = Throughput LLC Directory / Memory L2 CPU GPU L2 L1 CU0 L1 CU15 L1 L1 CPU0 CPU1 • Poor match CPU coherence w/ writeback caches • Coherence is means; memory consistency the end 7/26/2016 10 Sequential Consistency (SC) Thread 1 R1 = TOS R2 = R1 – 1 TOS = R2 Data1 = *R1 Thread 2 R1 = TOS R2 = R1 – 1 TOS = R2 Data2 = *R1 Total Memory Order 7/26/2016 11 Sequential Consistency (SC) w/ Locks Thread 1 Lock(Stack) R1 = TOS R2 = R1 – 1 TOS = R2 Data1 = *R1 Unlock(Stack) 7/26/2016 Thread 2 Lock(Stack) R1 = TOS R2 = R1 – 1 TOS = R2 Data2 = *R1 Unlock(Stack) 12 SC for Data Race Free Thread 1 Lock(Stack) R1 = TOS R2 = R1 – 1 TOS = R2 Data1 = *R1 Unlock(Stack) 7/26/2016 Thread 2 Lock(Stack) R1 = TOS R2 = R1 – 1 TOS = R2 Data2 = *R1 Unlock(Stack) 13 CPU History & GPU Future • CPUs 3 Decades! – SC [ToC 1979] – SC for Data Race Free [ISCA 1990] [ISCA 1990] – SC for DRF Java/C++ [PLDI 2005] [PLDI 2008] GPUs Faster? 7/26/2016 14 GPU Memory Hierarchy = Throughput LLC Directory / Memory L2 CPU GPU L2 L1 CU0 L1 CU15 L1 L1 CPU0 CPU1 • GPU has “scopes” – nearer in faster 7/26/2016 15 CPU History & GPU Future • CPUs 3 Decades! – SC [ToC 1979] – SC for Data Race Free [ISCA 1990] [ISCA 1990] – SC for DRF Java/C++ [PLDI 2005] [PLDI 2008] • GPUs GPUs Faster? – SC for Heterogeneous Race Free [ASPLOS 2014] – No data races & synchronization of “enough” scope – In Heterogeneous System Architecture [2015] Whither System on a Chip w/ many accelerators? 7/26/2016 16 III. Non-Volatile Memory (NVM) Compute Memory Storage 7/26/2016 Convergence/hype Off by (a) surprise or (b) design 17 III(a) Power Off by Surprise (Crash) STORE value = 0xC02 STORE valid = 1 Non-Volatile Memory Write-back Cache Total Memory Order 7/26/2016 value 0xC02 valid 1 value 0xDEADBEEF valid 0 1 18 Seek Consistent Durable State on Crash Persistency Order? Persistency Model [Pelley et al ISCA 2014] Total Memory Order 7/26/2016 – Strict persistency: Strong as (relaxed) memory model – Relaxed persistency: Even weaker 19 More Persistency Work Needed • Industry not there yet: If PCOMMIT is executed after a store to a persistent memory range is accepted to memory, the store becomes persistent when the PCOMMIT becomes globally visible.” • While all store-to-memory operations are eventually accepted to memory, the following items specify the actions software can take to ensure that they are accepted: Non-temporal stores to write-back (WB) memory and all stores to uncacheable (UC), write-combining (WC), and write-through (WT) memory are accepted to memory as soon as they are globally visible. • If, after an ordinary store to write-back (WB) memory becomes globally visible, CLFLUSH, CLFLUSHOPT, or CLWB is executed for the same cache line as the store, the store is accepted to memory when the CLFLUSH, CLFLUSHOPT or CLWB execution itself becomes globally visible. • IMHO Need – Deeper & more formal models (e.g., happens-before) – Better understanding of app durable state – Implement HW that orders; not gratuitously flushes 7/26/2016 20 III(b) Power off by Design (Prediction) idle idle • Greatly improve energy-efficiency Need work: • Especially when (briefly) doing nothing circuits, architecture, idle system SW • Or advice Patterson never gave me…. Do Nothing Well! Summary I. Million-fold Memory Growth & Virtual Memory II. General-Purpose GPUs & Memory Consistency III. Non-Volatile Memory’s Fusing Memory & Storage 7/26/2016 21 Backup 7/26/2016 22