HETEROGENEOUS-RACE-FREE MEMORY MODELS DEREK R. HOWER, BLAKE A. HECHTMAN, BRADFORD M. BECKMANN, BENEDICT R. GASTER, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD ASPLOS 3/4/2014 EXECUTIVE SUMMARY From the CPU world: SC for Data-race-free (SC for DRF) ‒ Relaxed HW, precise semantics, sanity-preserving SC for DRF: All global synchronization GPUs use scoped synchronization ‒ SC for DRF will have unpalatable performance loss. Eat the performance for the sake of sanity? SC for Heterogeneous-race-free (SC for HRF) ‒ Reclaim performance, maintain sanity Two specific models: HRF-direct Allowed Relaxations Tomorrow’s potential Target Workloads Today’s regular HRF-indirect Today’s implementations Tomorrow’s irregular Case study: GPGPU Task Sharing Runtime ‒ HRF-indirect provides up to 10% performance improvement 2 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 OUTLINE DRF, SCOPES, AND GPU HARDWARE HRF-direct vs. HRF-indirect Case Study: GPGPU Task Runtime 3 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 DATA-RACE-FREE MEMORY MODELS A BRIEF HISTORY Sequential Consistency (SC) [1970]: as if system is a multitasking uniprocessor ‒ Easy to understand ‒ Hard to build Most real systems more relaxed (e.g., TSO [1991]) t1 i1: ST A = 1 i2: X = LD B t2 i3: ST B = 1 i4: Y = LD A Impossible in SC: X = Y = 0 Possible in TSO: X = Y = 0 Relaxations: hard to understand! 4 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 DATA-RACE-FREE MEMORY MODELS A BRIEF HISTORY Sequential Consistency (SC) [1970]: as if system is a multitasking uniprocessor ‒ Easy to understand ‒ Hard to build Most real systems more relaxed (e.g., TSO [1991]) SC for DRF [1990]: SC for all programs that are data-race-free ‒ Easy to understand, easy to buildRelaxed model: permits any reordering that ‒ Hard to really optimize could only be observed with a race t1 ST A = 1 t2 t3 Release ST B = 2 Z = LD B Acquire X = LD A Release Acquire Y = LD A Data-race-free: Synchronized, no simultaneous accesses Must be SC X = Y = 1 5 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 t4 DATA-RACE-FREE MEMORY MODELS A BRIEF HISTORY Sequential Consistency (SC) [1970]: as if system is a multitasking uniprocessor ‒ Easy to understand, hard to build ‒ Most real systems more relaxed (e.g., TSO [1991]) SC for DRF [1990]: SC for all programs that are data-race-free ‒ Easy to understand, easy to build ‒ Hard to really optimize C++11 memory model [2008]: SC for DRF…for most ‒ SC for DRF for most users (default semantics) ‒ Relaxed for DRF for experts using explicitly ordered atomics (not SC) 6 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Not our focus DATA-RACE-FREE MEMORY MODELS A BRIEF HISTORY Sequential Consistency (SC) [1979]: as if system is a multitasking uniprocessor ‒ Easy to understand, hard to build ‒ Most real systems more relaxed (e.g., TSO [1991]) SC for DRF [1990]: SC for all programs that are data-race-free ‒ Easy to understand, easy to build ‒ Hard to really optimize C++11 memory model [2008]: SC for DRF…for most ‒ SC for DRF for most users (default semantics) ‒ Relaxed for DRF for experts using explicitly ordered atomics (not SC) OpenCL memory model [2014]: SC for …?? ‒ What is a race? Not the same as C++ ‒ Need to understand a heterogeneous race! 7 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Not our focus HETEROGENEOUS SYNCHRONIZATION Grid Work-group ‒ Wavefront, work-group, device, system D Languages provide scoped synchronization im en si on Dimension Y In GPU platform, HW hierarchy is exposed Z Dimension X ‒ OpenCL 2.0: ‒ flag.store(1, memory_order_seq_cst, memory_scope_work_group) ‒ CUDA: Work-item D Why? im en si on Dimension Y ‒ __threadfence{_block, _system}, ‒ __synchthreads Sub-group (Hardware-specific size) Z Dimension X OpenCL Execution Hierarchy 8 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 ALL SYNCHRONIZATION NOT CREATED EQUAL L2 L1 L1 Write buffers: WI1 WI2 WI3 WI4 Reasonable GPU memory system: Write combining caches ‒ No read-for-ownership coherence ‒ Synchronize through flush/invalidate Scopes have different costs: ‒ Synchronize w/ Work-group (Compute Unit): ‒ e.g., write buffer flush/invalidate ‒ Synchronize w/ Grid/device (GPU): ‒ e.g., L1 cache flush/invalidate 9 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 PROBLEM #1 SC for DRF: All synchronization is global Work-group WG1 t1 wi1 ST A X = 1 t2 wi2 Work-group WG2 t3 wi3 Release Acquire X = LD A Release Acquire Y = LD A L1 Cache Flush 10 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 t4 wi4 PROBLEM #2 Need performance: just add scopes to SC for DRF? ‒ What happens when actors use different scopes? Write Buffer Flush Work-group WG1 wi1 ST A = 1 Release_WG1 Work-group WG2 wi2 wi3 Acquire_WG1 X = LD A Release_DEVICE Acquire_DEVICE Y = LD A L1 Cache Flush wi4 Y=? SC for DRF: No “simultaneous” accesses Race-free 11 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 SC for HRF: No “simultaneous” accesses ?? IS EXAMPLE A RACE? Work-group WG1 wi1 ST A = 1 Release_WG1 wi2 Work-group WG2 wi3 wi4 Acquire_WG1 X = LD A (1) Release_DEVICE Acquire_DEVICE Y = LD A Option 1: YES Option 2: No HRF-direct HRF-indirect 12 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 HRF-direct Correct synchronization: communicating actors use exact same scope ‒ Including all stops in transitive chain Example is a race: Y undefined Work-group WG1 wi1 ST A = 1 Release_WG1 wi2 Acquire_WG1 X = LD A Release_DEVICE Use different scope 13 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Work-group WG2 wi3 wi4 wi1, wi3 communicate Acquire_DEVICE Y = LD A HRF-direct Correct synchronization: communicating actos use exact same scope ‒ Including all stops in transitive chain Example is a race: Y undefined The fix: wi1-wi2 use DEVICE scope Work-group WG1 wi1 ST A = 1 Release_WG1 wi2 Acquire_WG1 X = LD A Release_DEVICE Work-group WG2 wi3 wi4 wi1, wi3 communicate Acquire_DEVICE Y = LD A 14 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 HRF-indirect Correct synchronization: ‒ All paired synchronization uses exact same scope ‒ Transitive chains OK Example is a not a race: Y = 1 Work-group WG1 wi1 ST A = 1 Release_WG1 Transitive chain Through wi2 wi2 Acquire_WG1 X = LD A Release_DEVICE 15 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Work-group WG2 wi3 wi4 Paired synchronization with same scope Acquire_DEVICE Y = LD A CHOOSE WISELY HRF-direct HRF-indirect Allowed HW Relaxations Tomorrow’s potential Today’s implementations Target Workloads Today’s regular workloads Tomorrow’s irregular workloads Scope Layout Flexibility Heterogeneous 16 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Hierarchical OUTLINE DRF, SCOPES, AND GPU HARDWARE HRF-direct vs. HRF-indirect Case Study: GPGPU Task Runtime 17 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 CASE STUDY: TASK SHARING RUNTIME worker worker worker HRF-direct: - All Queue sync is global Task WI WI WI WI WI WI WI WI WI wi Sub-group Queue Producer does not know eventual consumer HRF-indirect: - Sync uses smallest scope in common case WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi - Transitive chain formed by donator Sub-group Queue Work-group Queue Sub-group Queue Sub-group Queue Work-group Queue NDRange (Kernel) Queue 18 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Producer does not know scope of eventual consumer CASE STUDY – TASK SHARING RUNTIME RESULTS Performance Normalized to HRF-direct 1.15 HRF-indirect: up to 10% performance improvement w/ irregular parallelism 1.1 1.05 1 0.95 input sets: uts_t1 uts_t2 HRF-direct 19 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 uts_t4 HRF-indirect uts_t5 SUMMARY SC for DRF: Great for CPUs, doesn’t support scoped synchronization SC for HRF: Define behavior with scoped synchronization Proposing two specific models: ‒ HRF-direct ‒ Conflicts separated by synchronization of identical scope + Easier to define/understand + Permits more future HW optimizations − Prohibits some SW opts in current hardware ‒ HRF-indirect ‒ Relaxes identical requirement of HRF-direct: ‒ Scope Transitivity: A sync B, B sync C A sync C + More accurate description of current hardware capabilities + Has some SW benefits (E.g., is more composable ) − May limit future HW opts In the paper: − Formal definitions − Sharp corners − More dimensions for HRF options 20 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Questions? Backup WORKLOADS Flexible Memory Models 23 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Irregular workloads CHOOSE WISELY HRF-direct HRF-indirect Model Complexity Scope selection easy: Smallest common subscope of any possible communicator 24 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Scope selection harder: Smallest common subscope of communicator synchronized without any third-party interactions CHOOSE WISELY HRF-direct Model Complexity Simpler Scope selection easy: Smallest common subscope of any possible communicator 25 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 HRF-indirect Harder Scope selection harder: Smallest common subscope of communicator synchronized without any third-party interactions CHOOSE WISELY HRF-direct Model Complexity Simpler HRF-indirect Harder Implementation Flexibility Compatible w/ current GPUs 26 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Compatible w/ current GPUs CHOOSE WISELY HRF-direct Model Complexity Simpler HRF-indirect Harder Implementation Flexibility Optimizations possible: E.g., selective flush on release L1 Cache WI WI Today WI WI Tomorrow 27 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Allowed executions are superset of HRF-direct: Will not permit as many optimizations CHOOSE WISELY HRF-direct HRF-indirect Implementation Flexibility Very high Optimizations possible: E.g., selective flush on release Model Complexity Simpler WI Today OK for current GPUs Allowed executions are superset of HRF-direct: L1 Cache WI Harder WI WI Tomorrow 28 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 Will not permit as many optimizations CHOOSE WISELY HRF-direct HRF-indirect Slow sync. with Implementation Flexibility Very high irregular parallelism Model Complexity Simpler Harder OK for current GPUs Performance on current HW Work-group WG1 wi1 ST X = 1 Release_WG1 Release_DEVICE wi2 Work-group WG2 wi3 Acquire_WG1 Acquire_DEVICE LD X (1) Release_DEVICE Acquire_DEVICE LD X (??) 29 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 wi4 CHOOSE WISELY HRF-direct HRF-indirect Fast sync. with Slow sync. with Implementation Flexibility Very high OK for current GPUs irregular parallelism irregular parallelism Model Complexity Simpler Harder Performance on current HW Work-group WG1 wi1 ST X = 1 Release_WG1 Good for current Better for irregular workloads & regular Work-group parallelism WG2 parallelism wi2 wi3 wi4 Acquire_WG1 LD X (1) Release_DEVICE Acquire_DEVICE LD X (1) 30 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 CASE STUDY – TASK SHARING RUNTIME DETAILS Runtime hides the OpenCL execution model ‒ Application only defines independent tasks ‒ Runtime uses persistent threads The same function assigned to a wavefront ‒ Grouped together using “taskfronts” Task Task Task Task Task Task Task Task Task Task Task Task Task Task Taskfront Taskfront Taskfront Queue Synchronization occurs when tasks are enqueued/dequeued ‒ Enqueuer/producer does not know eventual consumer ‒ HRF-direct: must always use device/kernel scope synchronization ‒ HRF-indirect: only use device scope synchronization for kernel donations/consumption Evaluation: Unbalanced Tree Search (UTS) synthetic workload ‒ Traversal of unbalanced graph whose topology is determined dynamically ‒ 4 different input sets 31 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 LD X = 1: CURRENT GPU HARDWARE Current GPU: write-combining cache hierarchy X L2 =1 ‒ WG release flush stores from coalescer ‒ WG acquire stall until coalescer is empty X = L1 1 L1 X=1 ‒ Device release Flush all dirty locations in L1 cache ‒ Device acquire Invalidate all valid locations in L1 cache XWF1 =1 WF2 WF3 WF4 Coalescer wf1 ST X = 1 Release_S12 wf2 wf3 wf4 Acquire_S12 LD X (1) Release_SGlobal Acquire_SGlobal LD X (??) 32 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 t3 sees (1) LD X = ??: OPTIMIZED GPU Optimized GPU: per-wavefront L1 cache management XL2= 0 ‒ WG release flush stores from coalescer ‒ WG acquire stall until coalescer is empty X = L1 0 L1 X=1 ‒ Global release Flush locations written by releasing WF in L1 ‒ Global acquire Inv. locations read by acquiring WF in L1 XWF1 =1 WF2 WF3 WF4 Coalescer wf1 ST X = 1 Release_S12 wf2 wf3 wf4 Acquire_S12 LD X (1) Release_SGlobal Acquire_SGlobal LD X (??) 33 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 t3 sees (0…or 1) TWO HRF DEFINITIONS Which is scenario should be allowed? ‒ Can programmers assume transitivity? ‒ Permitting the “Current GPU Hardware” scenario ‒ …or must producers and consumers use the same scope? ‒ Permitting the “Optimized GPU” scenario Our notation: ‒ HRF-direct (permit optimized GPU) ‒ Requires communicating actors to synchronize using the same scope ‒ Communication using different scopes is explicitly undefined ‒ HRF-indirect (permit transitivity on current GPU) ‒ Extends HRF-direct to support transitive communication using different scopes ‒ Allows indirect communication using a third party Both models require direct synchronization use the same matching scope ‒ i.e., an acq/rel pair using scopes that are subset/superset is undefined 34 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 HRF MODELS IMPLICATION ON PROGRAMMERS What value will wf3 see? ‒ HRF-direct: final LD X forms a race (inexact scopes between wf1-wf3) ‒ Undefined behavior don’t try!! ‒ HRF-indirect: No race (scope transitivity) ‒ SC behavior wf3 loads (1) Consequences: ‒ HRF-direct: ‒ Must use global scope w/o future sync. knowledge example is slower on existing HW ‒ HRF-indirect: ‒ Can use local scope w/o future sync. knowledge example faster on existing HW ‒ Will NOT work with potentially optimized future GPU ‒ Transitivity better for function composition, etc. wf1 ST X = 1 Release_S12 wf2 wf3 Acquire_S12 LD X (1) Release_SGlobal Acquire_SGlobal LD X (??) 35 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 wf4 HRF Design Space Others 36 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 PROGRAMMABLE PIPELINE EXAMPLE Scope Global Scope 1-2 Stage 1 Scope 2-3 Stage 2 L1-2 L1/L2/DRAM 37 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 L2-3 Stage 3 L2 L1 t1 S12 L1 t2 t3 SGlobal S34 t1 t4 38 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014 t2 t3 t4 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 39 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014