HETEROGENEOUS-RACE-FREE
MEMORY MODELS
DEREK R. HOWER, BLAKE A. HECHTMAN,
BRADFORD M. BECKMANN, BENEDICT R. GASTER,
MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD
ASPLOS 3/4/2014
EXECUTIVE SUMMARY
 From the CPU world: SC for Data-race-free (SC for DRF)
‒ Relaxed HW, precise semantics, sanity-preserving
 SC for DRF: All global synchronization  GPUs use scoped synchronization
‒ SC for DRF will have unpalatable performance loss.
 Eat the performance for the sake of sanity?
 SC for Heterogeneous-race-free (SC for HRF)
‒ Reclaim performance, maintain sanity
 Two specific models:
HRF-direct
Allowed Relaxations
Tomorrow’s potential
Target Workloads
Today’s regular
HRF-indirect
 Today’s implementations
Tomorrow’s irregular

 Case study: GPGPU Task Sharing Runtime
‒ HRF-indirect provides up to 10% performance improvement
2 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
OUTLINE
DRF, SCOPES, AND GPU HARDWARE
HRF-direct vs. HRF-indirect
Case Study: GPGPU Task Runtime
3 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
DATA-RACE-FREE MEMORY MODELS
A BRIEF HISTORY
 Sequential Consistency (SC) [1970]: as if system is a multitasking uniprocessor
‒ Easy to understand
‒ Hard to build  Most real systems more relaxed (e.g., TSO [1991])
t1
i1: ST A = 1
i2: X = LD B
t2
i3: ST B = 1
i4: Y = LD A
Impossible in SC:
X = Y = 0
Possible in TSO:
X = Y = 0
Relaxations: hard to understand!
4 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
DATA-RACE-FREE MEMORY MODELS
A BRIEF HISTORY
 Sequential Consistency (SC) [1970]: as if system is a multitasking uniprocessor
‒ Easy to understand
‒ Hard to build  Most real systems more relaxed (e.g., TSO [1991])
 SC for DRF [1990]: SC for all programs that are data-race-free
‒ Easy to understand, easy to buildRelaxed model: permits any reordering that
‒ Hard to really optimize
could only be observed with a race
t1
ST A = 1
t2
t3
Release
ST B = 2
Z = LD B
Acquire
X = LD A
Release
Acquire
Y = LD A
Data-race-free: Synchronized, no simultaneous accesses
Must be SC  X = Y = 1
5 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
t4
DATA-RACE-FREE MEMORY MODELS
A BRIEF HISTORY
 Sequential Consistency (SC) [1970]: as if system is a multitasking uniprocessor
‒ Easy to understand, hard to build
‒ Most real systems more relaxed (e.g., TSO [1991])
 SC for DRF [1990]: SC for all programs that are data-race-free
‒ Easy to understand, easy to build
‒ Hard to really optimize
 C++11 memory model [2008]: SC for DRF…for most
‒ SC for DRF for most users (default semantics)
‒ Relaxed for DRF for experts using explicitly ordered atomics (not SC)
6 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Not our focus
DATA-RACE-FREE MEMORY MODELS
A BRIEF HISTORY
 Sequential Consistency (SC) [1979]: as if system is a multitasking uniprocessor
‒ Easy to understand, hard to build
‒ Most real systems more relaxed (e.g., TSO [1991])
 SC for DRF [1990]: SC for all programs that are data-race-free
‒ Easy to understand, easy to build
‒ Hard to really optimize
 C++11 memory model [2008]: SC for DRF…for most
‒ SC for DRF for most users (default semantics)
‒ Relaxed for DRF for experts using explicitly ordered atomics (not SC)
 OpenCL memory model [2014]: SC for …??
‒ What is a race? Not the same as C++
‒ Need to understand a heterogeneous race!
7 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Not our focus
HETEROGENEOUS SYNCHRONIZATION
Grid
Work-group
‒ Wavefront, work-group, device, system
D
 Languages provide scoped synchronization
im
en
si
on
Dimension Y
 In GPU platform, HW hierarchy is exposed
Z
Dimension X
‒ OpenCL 2.0:
‒ flag.store(1, memory_order_seq_cst,
memory_scope_work_group)
‒ CUDA:
Work-item
D
 Why?
im
en
si
on
Dimension Y
‒ __threadfence{_block, _system},
‒ __synchthreads
Sub-group
(Hardware-specific size)
Z
Dimension X
OpenCL Execution Hierarchy
8 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
ALL SYNCHRONIZATION NOT CREATED EQUAL
L2
L1
L1
Write buffers:
WI1
WI2
WI3
WI4
 Reasonable GPU memory system: Write combining caches
‒ No read-for-ownership coherence
‒ Synchronize through flush/invalidate
 Scopes have different costs:
‒ Synchronize w/ Work-group (Compute Unit):
‒ e.g., write buffer flush/invalidate
‒ Synchronize w/ Grid/device (GPU):
‒ e.g., L1 cache flush/invalidate
9 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
PROBLEM #1
 SC for DRF: All synchronization is global
Work-group WG1
t1
wi1
ST A
X = 1
t2
wi2
Work-group WG2
t3
wi3
Release
Acquire
X = LD A
Release
Acquire
Y = LD A
L1 Cache Flush
10 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
t4
wi4
PROBLEM #2
 Need performance: just add scopes to SC for DRF?
‒ What happens when actors use different scopes?
Write Buffer Flush
Work-group WG1
wi1
ST A = 1
Release_WG1
Work-group WG2
wi2
wi3
Acquire_WG1
X = LD A
Release_DEVICE
Acquire_DEVICE
Y = LD A
L1 Cache Flush
wi4
Y=?
SC for DRF:
No “simultaneous” accesses  Race-free
11 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
SC for HRF:
No “simultaneous” accesses  ??
IS EXAMPLE A RACE?
Work-group WG1
wi1
ST A = 1
Release_WG1
wi2
Work-group WG2
wi3
wi4
Acquire_WG1
X = LD A (1)
Release_DEVICE
Acquire_DEVICE
Y = LD A
Option 1: YES
Option 2: No
HRF-direct
HRF-indirect
12 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
HRF-direct
 Correct synchronization: communicating actors use exact same scope
‒ Including all stops in transitive chain
 Example is a race: Y undefined
Work-group WG1
wi1
ST A = 1
Release_WG1
wi2
Acquire_WG1
X = LD A
Release_DEVICE
Use different scope
13 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Work-group WG2
wi3
wi4
wi1, wi3 communicate
Acquire_DEVICE
Y = LD A
HRF-direct
 Correct synchronization: communicating actos use exact same scope
‒ Including all stops in transitive chain
 Example is a race: Y undefined
 The fix: wi1-wi2 use DEVICE scope
Work-group WG1
wi1
ST A = 1
Release_WG1
wi2
Acquire_WG1
X = LD A
Release_DEVICE
Work-group WG2
wi3
wi4
wi1, wi3 communicate
Acquire_DEVICE
Y = LD A
14 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
HRF-indirect
 Correct synchronization:
‒ All paired synchronization uses exact same scope
‒ Transitive chains OK
 Example is a not a race: Y = 1
Work-group WG1
wi1
ST A = 1
Release_WG1
Transitive chain
Through wi2
wi2
Acquire_WG1
X = LD A
Release_DEVICE
15 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Work-group WG2
wi3
wi4
Paired synchronization
with same scope
Acquire_DEVICE
Y = LD A
CHOOSE WISELY
HRF-direct
HRF-indirect
Allowed HW Relaxations
Tomorrow’s potential
Today’s implementations
Target Workloads
Today’s regular workloads Tomorrow’s irregular workloads
Scope Layout Flexibility
Heterogeneous


16 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014

Hierarchical
OUTLINE
DRF, SCOPES, AND GPU HARDWARE
HRF-direct vs. HRF-indirect
Case Study: GPGPU Task Runtime
17 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
CASE STUDY: TASK SHARING RUNTIME
worker
worker
worker
HRF-direct:
- All Queue
sync is global
Task
WI
WI
WI
WI
WI
WI
WI
WI
WI
wi
Sub-group
Queue
Producer does not know
eventual consumer
HRF-indirect:
- Sync uses smallest scope in common case
WI
WI
WI
WI
WI
WI
WI
WI
WI
wi
WI
WI
WI
WI
WI
WI
WI
WI
WI
wi
WI
WI
WI
WI
WI
WI
WI
WI
WI
wi
- Transitive chain formed by donator
Sub-group
Queue
Work-group Queue
Sub-group
Queue
Sub-group
Queue
Work-group Queue
NDRange (Kernel) Queue
18 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Producer does not know scope
of eventual consumer
CASE STUDY – TASK SHARING RUNTIME
RESULTS
Performance Normalized to HRF-direct
1.15
HRF-indirect: up to 10% performance
improvement w/ irregular parallelism
1.1
1.05
1
0.95
input sets:
uts_t1
uts_t2
HRF-direct
19 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
uts_t4
HRF-indirect
uts_t5
SUMMARY
 SC for DRF: Great for CPUs, doesn’t support scoped synchronization
 SC for HRF: Define behavior with scoped synchronization
 Proposing two specific models:
‒ HRF-direct
‒ Conflicts separated by synchronization of identical scope
+ Easier to define/understand
+ Permits more future HW optimizations
− Prohibits some SW opts in current hardware
‒ HRF-indirect
‒ Relaxes identical requirement of HRF-direct:
‒ Scope Transitivity: A sync B, B sync C  A sync C
+ More accurate description of current hardware capabilities
+ Has some SW benefits (E.g., is more composable )
− May limit future HW opts
 In the paper:
− Formal definitions
− Sharp corners
− More dimensions for HRF options
20 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Questions?
Backup
WORKLOADS
Flexible
Memory Models
23 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Irregular
workloads
CHOOSE WISELY
HRF-direct
HRF-indirect
Model Complexity
Scope selection easy:
Smallest common subscope of
any possible communicator
24 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Scope selection harder:
Smallest common subscope of
communicator synchronized
without any third-party
interactions
CHOOSE WISELY
HRF-direct
Model Complexity
Simpler
Scope selection easy:
Smallest common subscope of
any possible communicator
25 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
HRF-indirect

Harder
Scope selection harder:
Smallest common subscope of
communicator synchronized
without any third-party
interactions
CHOOSE WISELY
HRF-direct
Model Complexity
Simpler

HRF-indirect
Harder
Implementation Flexibility
Compatible w/
current GPUs
26 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Compatible w/
current GPUs
CHOOSE WISELY
HRF-direct
Model Complexity
Simpler
HRF-indirect

Harder
Implementation Flexibility
Optimizations
possible:
E.g., selective flush on release
L1
Cache
WI
WI
Today
WI
WI
Tomorrow
27 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Allowed executions are
superset of HRF-direct:
Will not permit as many
optimizations
CHOOSE WISELY
HRF-direct
HRF-indirect

Implementation Flexibility
Very high
Optimizations
possible:

E.g., selective flush on release
Model Complexity
Simpler
WI
Today
OK for current GPUs
Allowed executions are
superset of HRF-direct:
L1
Cache
WI
Harder
WI
WI
Tomorrow
28 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
Will not permit as many
optimizations
CHOOSE WISELY
HRF-direct
HRF-indirect

Slow sync. with
Implementation
Flexibility
Very high
irregular
parallelism

Model Complexity
Simpler
Harder
OK for current GPUs
Performance on current HW
Work-group WG1
wi1
ST X = 1
Release_WG1
Release_DEVICE
wi2
Work-group WG2
wi3
Acquire_WG1
Acquire_DEVICE
LD X (1)
Release_DEVICE
Acquire_DEVICE
LD X (??)
29 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
wi4
CHOOSE WISELY
HRF-direct
HRF-indirect
Fast sync. with
Slow sync. with
Implementation
Flexibility
Very high
OK for
current GPUs
irregular
parallelism
irregular
parallelism

Model Complexity
Simpler
Harder
Performance on current HW
Work-group WG1
wi1
ST X = 1
Release_WG1
Good for current
Better for irregular
workloads & regular Work-group
parallelism
WG2
parallelism
wi2
wi3
wi4
Acquire_WG1
LD X (1)
Release_DEVICE
Acquire_DEVICE
LD X (1)
30 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014

CASE STUDY – TASK SHARING RUNTIME
DETAILS
 Runtime hides the OpenCL execution model
‒ Application only defines independent tasks
‒ Runtime uses persistent threads
 The same function assigned to a wavefront
‒ Grouped together using “taskfronts”
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Taskfront
Taskfront
Taskfront
Queue
 Synchronization occurs when tasks are enqueued/dequeued
‒ Enqueuer/producer does not know eventual consumer
‒ HRF-direct: must always use device/kernel scope synchronization
‒ HRF-indirect: only use device scope synchronization for kernel donations/consumption
 Evaluation: Unbalanced Tree Search (UTS) synthetic workload
‒ Traversal of unbalanced graph whose topology is determined dynamically
‒ 4 different input sets
31 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
LD X = 1: CURRENT GPU HARDWARE
 Current GPU: write-combining cache hierarchy
X L2
=1
‒ WG release  flush stores from coalescer
‒ WG acquire  stall until coalescer is empty
X = L1
1
L1
X=1
‒ Device release  Flush all dirty locations in L1 cache
‒ Device acquire  Invalidate all valid locations in L1 cache
XWF1
=1
WF2
WF3
WF4
Coalescer
wf1
ST X = 1
Release_S12
wf2
wf3
wf4
Acquire_S12
LD X (1)
Release_SGlobal
Acquire_SGlobal
LD X (??)
32 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
t3 sees (1)
LD X = ??: OPTIMIZED GPU
 Optimized GPU: per-wavefront L1 cache management
XL2= 0
‒ WG release  flush stores from coalescer
‒ WG acquire  stall until coalescer is empty
X = L1
0
L1
X=1
‒ Global release  Flush locations written by releasing WF in L1
‒ Global acquire  Inv. locations read by acquiring WF in L1
XWF1
=1
WF2
WF3
WF4
Coalescer
wf1
ST X = 1
Release_S12
wf2
wf3
wf4
Acquire_S12
LD X (1)
Release_SGlobal
Acquire_SGlobal
LD X (??)
33 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
t3 sees (0…or 1)
TWO HRF DEFINITIONS
 Which is scenario should be allowed?
‒ Can programmers assume transitivity?
‒ Permitting the “Current GPU Hardware” scenario
‒ …or must producers and consumers use the same scope?
‒ Permitting the “Optimized GPU” scenario
 Our notation:
‒ HRF-direct (permit optimized GPU)
‒ Requires communicating actors to synchronize using the same scope
‒ Communication using different scopes is explicitly undefined
‒ HRF-indirect (permit transitivity on current GPU)
‒ Extends HRF-direct to support transitive communication using different scopes
‒ Allows indirect communication using a third party
 Both models require direct synchronization use the same matching scope
‒ i.e., an acq/rel pair using scopes that are subset/superset is undefined
34 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
HRF MODELS IMPLICATION ON PROGRAMMERS
 What value will wf3 see?
‒ HRF-direct: final LD X forms a race (inexact scopes between wf1-wf3)
‒ Undefined behavior  don’t try!!
‒ HRF-indirect: No race (scope transitivity)
‒ SC behavior  wf3 loads (1)
 Consequences:
‒ HRF-direct:
‒ Must use global scope w/o future sync. knowledge  example is slower on existing HW
‒ HRF-indirect:
‒ Can use local scope w/o future sync. knowledge  example faster on existing HW
‒ Will NOT work with potentially optimized future GPU
‒ Transitivity better for function composition, etc.
wf1
ST X = 1
Release_S12
wf2
wf3
Acquire_S12
LD X (1)
Release_SGlobal
Acquire_SGlobal
LD X (??)
35 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
wf4
HRF Design Space
Others
36 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
PROGRAMMABLE PIPELINE EXAMPLE
Scope Global
Scope 1-2
Stage 1
Scope 2-3
Stage 2
L1-2
L1/L2/DRAM
37 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
L2-3
Stage 3
L2
L1
t1
S12
L1
t2
t3
SGlobal
S34
t1
t4
38 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014
t2
t3
t4
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance
Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.
39 | HETEROGENEOUS-RACE-FREE MEMORY MODELS| MARCH 4, 2014