Acoherent Shared Memory Derek R. Hower Ph.D. Defense

advertisement
Acoherent
Shared Memory
Derek R. Hower
Ph.D. Defense
July 16, 2012
Executive Summary
L1
P
P
CI
CI
CO
L1
?
CO
Simple
L2
abstraction
Simple
abstraction
GPU
P
Coherent
View
P
Acoherent
View
- Complex implementation
- Hides caches (bad?!)
- High overhead
- Simple implementation
- Abstracts caches
- Low overhead
2
Outline
 Motivation and Goals
 ASM Model
 ASM-CMP Prototype
 Evaluation and Results
 Conclusions and Future Work
3
Trends
 Energy Matters
 Dark Silicon/Mobile/Datacenter
 < 50% of processor powered by 20241
 Complexity Matters
 Lower barrier to entry for accelerators
 Area Matters
 New tech nodes are not cheaper2
 Memory: may be difficult to turn off
 e.g., S-NUCA
 Compatibility doesn’t matter
 Vertical integration is the new black
1
Esmaeilzadeh, et al. ISCA 2011
2 ExtremeTech 2012
4
We must
change
We can
change
The Problem With Coherence
 Wrong abstraction
 Optimized for fine-grained, share-everything
• Programs aren’t!
 Makes SW isolation hard
 Hypothesis: SW will want control over data placement
 Impedes HW specialization
 Does your multicore ASIC need a coherence controller?
 Coherent GPUs?
 Efficiency problems
 Directories take space/broadcasts take energy
• e.g. 14% of cache are dedicated to directory on 4-core die1
1
Stackhouse et al., ISSCC 2008
5
Rethinking Coherence: Goals
 Maintain programmer sanity
 Keep shared memory
 Minimal compatibility change
 Expose hardware capabilities
 Let SW guide memory management -> semantics
 Simple hardware
 Lower cost of entry for accelerators
 Solution: Acoherent Shared Memory
6
Outline
 Motivation and Goals
 ASM Model
 ASM-CMP Prototype
 Evaluation and Results
 Conclusions and Future Work
7
ASM Model Basics
 Replace black box with simple hierarchy
 Still flat, linear address space
 SW gets private storage
CI
CO
CO
P
P
8
CI
 Manage with CVS-like checkout/checkin
Checkout/Checkin
Granularity?
Checkout/Checkin are
not synchronization primitives
- Closer to a FENCE
9
P
P
CI
CO
CO
Checkin:
Publish local updates globally
CI
Checkout:
Pull data into private storage
Segments
 Compromise: Memory Segments
– Linear partition of address space
– CO/CI segments at a time
 Observation: Programs are already segmented
 Can re-use layout
Stack
Heap
Typical CO/CI granularity
in existing C code
Data
BSS
Code
10
Segment Types
 Not all memory wants/needs acoherence
 Segment types give different “views”
 Communicate semantic information to HW
Stack
Private
Private
Available Types
Private
Coherent-RW
Heap
Acoherent
Shared
Acoherent
Device
Data
BSS
Code
Coherent-RO
Coherent
RO
Shared,
Read-Only
11
Managing Finite Resources
 Model so far is strong acoherence
 Likely requires prohibitive HW resources
 Also weak acoherence and best-effort acoherence
 Still useful to software/hardware
Synchronized =>
 Weak acoherence:
not a problem
 Data visible early (before checkin)
Hybrid Runtimes =>
not a problem
 Best-effort acoherence:
 Spontaneous checkouts at any time
• + SW notification
 All-or-nothing
12
Case Study: pthreads
pthread_barrier_t barrier;
shared_data
char* shared_data;
• Global, Heap in
acoherent segment
Automatic:
• Stack in private
segment
Runtime2:
Step 1:
• Text in coherentWorks
as
is
RO segment
Task:
Convert
to
ASM
Checkout/Checkin
Assign Segments
int main(int argc
argc, char* argv
argv[]) {
j k
int i
i,j,k;
sib
pthread_t sib;
shared_data = malloc(PROBLEM_SIZE);
barrier NULL, 2);
pthread_barrier_init(&barrier,
sib NULL,
pthread_create(&sib,
worker, (void*) 1);
int pthread_barrier_init(…)
{
worker((void*)
0);
…
pthread_join(sib,
sib NULL);
_barrier = coherent_malloc(sizeof(int));
int0;pthread_barrier_wait(…) {
return
…
…
} }
Automatic:
Library
checkin(heap, data);
<barrier>
void* worker(void*
arg)
arg
checkout(heap, data);
{
…
while
} (work remains) {
<split work>
<do work>
barrier
pthread_barrier_wait(&barrier);
}
}
13
• Synch. in coherentRW segment
• CI/CO Global, Heap
at synchronization
Communication Point
Memory Consistency Model
Option 1: The Details
(6 slides + really ugly equations)
14
Option 2: The Highlights
(2 slides)
Memory Consistency Model
 Defined in style of SPARC TSO/RMO
 Memory Order: Total order of memory ops
• Restricted by consistency model
 Processor Order: Local dependencies
 Value of load: defined via memory + processor order
15
Weak Acoherence
1. Define Memory Order
LSi a  p LSi  LSi a  m LSi
# Load -> Load to same address
(a)
Same as TSO, etc.
LSi  p SiS a  LSi m SiS a
SiS a  p SiS a  SiS a m SiS a
# Load -> Store to same address
(b)
# Store -> Store to same address
(c)
CI-CO pair => fence
Total order of CO/CI
SiS a  p CIiS m COSj  p LSj  SiS m LSj
CX  p CX  CX m CX
# Paired CI-CO act as distributed fence
(d)
# CI/CO -> CI/CO
(e)
2. Define legal value of loads

value  LSi a   value max m  S S a | S S a m LSi a or
16
S S a  p LSi a 

Strong Acoherence
1. Define Memory Order
LSi a  p LSi  LSi a  m LSi
# Load -> Load to same address
(a)
LSi  p SiS a  LSi m SiS a
# Load -> Store to same address
(b)
Normally: S  p CI  CI m S
Can “lose” data
Stores not visible until CI
S
i
SiS a  p SiS a  SiS a m SiS a
S
i
SiS a  p CIiS m COSj  p LSj  SiS m LSj
CX  p CX  CX m CX
SiS  p next p (CI S , SiS )  p next p (CO S , SiS )  next p (CI S , SiS ) m SiS
SiS  p CO  p next p (CI , SiS )  SiS m max p (COS , SiS )
# Store -> Store to same address
(c)
# Paired CI-CO act as distributed fence
(d)
# CI/CO -> CI/CO
(e)
# Store not visible until CI
(f)
# Stores can be clobbered
2. Define legal value of loads


value  LSi a   value max p SiS a | max p (CO S , LSi a )  p SiS a  p LSi a



or, if SiS a does not exist,
 value max m S S a | max p (CO S , S S a)  m S S a  m LSi a
17

Other Segment Types
 Coherent
 Like weak, but:
• Loads implicitly paired with (atomic) CO
• Stores implicitly paired with (atomic) CI
 SC w.r.t. each other
 Private
 Like weak
18
Analysis
 CO/CI not atomic
 Subtlties:
Initially,
Thread 0
A = 0
Thread 1
00: CHECKOUT
Initially,
Thread 0
A = 0
Thread 1
Thread 0
A = 0
Thread 1
05: CHECKOUT
02: CHECKOUT
10: A = 1
Initially,
14: A = 1
03: R0 = A
11: CHECKIN
12: A = 1
01: R0 = A
06: R0 = A
13: CHECKIN
04: R1 = A
Strong: R0 = 0 or 1
Weak:
R0 = 0 or 1
(a) Lazy checkout
Strong: R0 = 0, R1 = 0
Weak:
R0 = 0, R1 = 0 or 1
(b) Isolation
19
Strong: R0 = 0
Weak:
R0 = 0 or 1
(c) Leaky stores
ASM = SC for DRF
 ASM = SC for lossless and properly paired
 Lossless:
 No clobbering checkouts
 i.e., SiS a, if COiS : SiS a  p COiS 
Next
CI iS : SiS a  p CI iS  p COiS
 Properly Paired:
 All conflicting stores->load separated by CI/CO
 i.e., LSi a, S Sj a : value( LSi a)  value( S Sj a), i  j 
COiS , CI Sj : S Sj a  p CI Sj m COiS  p LSi a
 Proof sketch:
 LL+PP executions defined by CO/CI order, program
order only
 CO/CI, program order same in ASM, SC
20
CO/CI Semantics
 CO/CI like fence
 Lazy checkouts
 Non-atomic, non-blocking checkins
• Updates can interleave
Initially,
Thread 0
Initially,
A = 0
Thread 0
Thread 1
00: A = 1
01: B = 2
02: CHECKIN
00: CHECKOUT
10: A = 1
A = 0
Thread 1
10: A = 10
11: B = 20
12: CHECKIN
11: CHECKIN
01: R0 = A
Finally, any combo of:
A = 1 or 10
B = 2 or 20
Finally:
R0 = 0 or 1
21
Consistency Highlights
 Coherent accesses have implicit CO/CI
 CO/CI are totally ordered
 Transitivity hides non-atomicity
Thread 0
Thread 1
 Sequentially consistent for data-race-free
 Lossless & Properly Paired
CO lock_segment
ST critical
LL lock
LD lock
CI critical_segment
ST lock
ST lock
CI lock_segment
STsync lock
CI lock_segment
CO critical_segment
22
LD critical
SC lock
Outline
 Motivation and Goals
 ASM Model
 ASM-CMP Prototype
 Evaluation and Results
 Conclusions and Future Work
23
ASM-CMP Overview
 Based on MIPS
 + special insns, e.g., checkout, checkin
 Uses segments, no paging
Skipping the Details
• Maintains flat address space
 Coherence protocol -> Acoherence Engine
 DMA for caches
• Selectively move data
24
Baseline
Memory Controller
Memory Controller
Memory Controller
Memory Controller
25
Core
L1I
L2
L1D
Switch
Segment Types
L2
Exclusive L2
CI
CO
CO
CI
Noninclusive L2
L1 AE L1 AE
P
P
Acoherent
L1
P
P
Coherent-RW
26
P
L1
P
Private
Acoherence Engine
 Three main responsibilities:
 Checkout:
Lazy Flash Invalidate
• Invalidate all segment data
 Checkin:
Track write set
Decoupled
Metastate
Cache
• Write back all dirty segment data
 Order:
Timestamp based
• Detect CI-CO pairs
 FSM like coherence, but few races, no directory
27
Decoupled Metastate Cache
 All L1 Caches
 Decouple metastate from data
 Quick access to aggregate state
 Track V/D per-segment
 Checkout:
 XOR
global/segment
valid
 Checkin:
 Walk segment
dirty state
28
Order
 Need to:
1. Determine if a CI precedes a CO
2. Delay load after CO if previous CI hasn’t completed
 Timestamp algorithm (per segment):
 Two phase CO/CI
1. Acquire timestamp
1. Invalidate/Flush
2. Wait for previous CO/CI to complete
 Implemented in firmware
29
Multiple Writer Support
 Keep per-byte dirty bitmask in L1s
 Allows multiple writers with false sharing
 12.5% larger L1 cache
 Bitmask accompanies data to L2
30
Simple?
Directory
L2 / L2
REQ
RESP
RESP
REQ
FWD
L1
L1
Source of Races / Complexity
31
Outline
 Motivation and Goals
 ASM Model
 ASM-1 Prototype
 Evaluation and Results
 Conclusions and Future Work
32
Methodology
 Simulation-based
 Enhanced-User Mode
 Workloads:
 Class-1: SPLASH
 Class-2: Task-Q
 Three memory modules
 ASM-CMP
 CC from gem5-Ruby
• MESI (Inclusive)
• MOESI (Non-inclusive)
33
Performance
1.4
Runtime Normalized to MOESI
1.2
1
0.8
0.6
0.4
0.2
0
False Sharing/
Comparable
performance
Checkout
too
much
Migratory
Sharing
34
moesi
mesi
asm
Perfect Checkout
Runtime Normalized to ASM Baseline
1.2
1
0.8
0.6
0.4
0.2
0
asm_base
35
asm_ideal
Energy
1.2
Energy Normalized to MOESI
1
0.8
0.6
0.4
0.2
0
Less Energy
(Same Performance)
36
e_l1d
e_l1i
e_l2
e_link
e_switch
e_tlb
Checkout Characteristics
% of checkouts
Class-1 Workloads
80%
70%
60%
50%
40%
30%
20%
10%
0%
fft
fmm
lu
mp3d
ocean
radix
water
Most
checkout
invalidations
Checkouts
usually
small;
dead
Canaffect
be large
(> blocks
25% of L1)
# blocks invalidated
% of Checkout Invalidations
Elided
barnes
1.2
1
0.8
0.6
0.4
0.2
0
37
Checkin Characteristics
Class-1 Workloads
40%
% of checkins
35%
30%
25%
20%
15%
10%
5%
0%
Checkins usually small;
Checkin latency is hidden
Can be large (> 25% of L1)
# blocks invalidated
barnes
fft
fmm
lu
mp3d
38
ocean
radix
water
Outline
 Motivation and Goals
 ASM Model
 ASM-CMP Prototype
 Evaluation and Results
 Conclusions/Other Work
39
Conclusions
 Going forward:
 HW designs must find efficiency
 SW will want to see caches/control placement
 ASM: viable alternative to coherent shared memory
 Semantic cooperation between HW/SW
 ASM-CMP: build components w/o coherence engine
 Make custom integration easier
 Practically:
 Will the next x86 core use ASM? No
 Will a heterogeneous accelerator? Maybe
40
Related Work
 ASM Model
 ASM-CMP
 Alternatives/Detractors
Skip
41
Related Work – ASM Model
 Relaxed consistency models
 Release Consistency (ISCA 1990)
• Acquire/Release ≈ CO/CI
 DRF-0 (ISCA 1990), DRF-1 (PDS 1993)
• SC for DRF
 Weak ordering (ISCA 1998)
 Semantic Segmentation
 Cohesion (ISCA 2011)
 Entry consistency (CMU-TR 1991)
42
Related Work – ASM-CMP
 Rigel: IEEE Micro 2011
 Differentiates coherent/incoherent
 Treadmarks: ISCA 1992
 Twinning and diffing
43
Related Work - Alternatives
 Reduce directory overhead





Cuckoo directory (HPCA 2011)
Tagless directory (MICRO 2009, PACT 2011)
Waypoint (PACT 2010)
Region coherence (IEEE Micro 2006)
SW controlled coherence (…)
 Simplify coherence design
 Denovo (PACT 2011)
 Coherence is here to stay
 CACM 2012
44
Future Work
 ASM Model
 ASM Implementations
 ASM Software
Skip
45
Future Work – ASM Model
1. Use CO/CI for synchronization
 Return timestamp with CO/CI
 Blocking CO
2. Only guarantee transitivity across coherent
accesses
 Would eliminate need for timestamps
3. Hierarchical ASM

Expose multiple levels of abstracted caches
4. Interaction with coherent shared memory

Acoherent/coherent components in same system
46
Future Work: ASM Implementation
 ASM-CMP
1. Optimize empty checkout/checkin
2. Non-speculative support for strong acoherence
• e.g., HW copy-on-write support on eviction
• Use ASM as foundation for TM/Determinism/etc
3. Low overhead byte-diffing
• False sharing is rare/pattern reuse is common
4. More segment control
• Non-contiguous
• Remap-able
 Other
1. Multi-socket support
2. Use ASM to simplify traditional coherence
• Private/shared
47
Future Work – ASM Software
1. Message passing on ASM
 More efficient than coherence (think: migratory)
2. Software speculation
 Use working memory for isolation
3. Programming language integration
 CO/CI first-class operations
 Work already exists:
• Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS
48
Previous Work
 Rerun: ISCA 2008 and CACM 2009
 Race recorder for deterministic replay
 vs. state of the art:
• SAME logging performance, > 10x state reduction
 Calvin: HPCA 2011
 Coherence for deterministic execution
• i.e., zero-log-size deterministic replay
 Selective determinism to match program requirements
 Hobbes: WoDet 2011
 Strong acoherence in SW runtime
49
Phew!
Backup Slides
What I would do differently
 Focus on more specific target system
 Stop building new infrastructure!
 Why did I?
• gem5 wasn’t ready
• Started more radical/not clear it would have helped
 Step back more often
 Easy to get sucked in to details – usually don’t matter
 Functional specification of consistency -> yuck!
52
Case Study: Cilk
 Work-stealing task queue
 Distributed design
53
Runtime Normalized to
MOESI Using Segments
ASM Segments Benefit
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Energy Normalized to MOESI
Using Segments
moesi_tlb_0
moesi_tlb_32
moesi_tlb_64
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
e_l1d
e_l1i
e_l2
e_link
54
e_switch
e_tlb
History
1980
CPU Era
Everything is
general purpose
2000
Multicore Era
Moore of the same
55
2010
Dark Era
??
Navigating the Darkness
 Solution #1: Wait for CMOS replacement
 Don’t hold your breath
 Solution #2: Rethink everything




Deep integration
HW Specialization/Heterogeneity
Efficiency
Take compatibility off its pedestal
56
Coherence?
Rethinking Coherence: Why Now?
 Dennard Scaling is over; Moore’s Law continues:
 Need efficient components/reduced waste
 Heterogeneity/Specialization
 Different memory access patterns
 Multicore ASICs
 Important workloads don’t use it
 Compatibility not a show stopper
 Mobile -> fast design cycles, controlled SW stacks
 Datacenter -> economy of scale in single location
 Missing opportunities
57
Case Study 2: Software Speculation
begin_speculation() {
<copy
state>
checkout(…)
<setup>
}
Use private storage
Multiple checkouts: “forget” updates
Task:
SW can
Convert
use memory
to ASM
in new ways
end_speculation() {
if(success)
<free
copies>
checkin(…)
Checkin: commit updates
else
abort_speculation();
}
abort_speculation() {
<revert
to copy>
checkout(…)
<cleanup>
}
58
New Software Potential
 Evaluate ability to write speculation software
 Microbenchmark:
 Fill array with speculative data, then commit
 Vary size of array
Normalized
Runtime
1.5
1
ASM
MESI
0.5
0
16
64 256 1K 16K 32K 64K 128K
# of Blocks in Isolation Region
59
Using Weak Acoherence
global array;
weak acoherent
func producer(…)
func consumer(…)
checkout(array);
waitfor(producer);
array[0] = x;
checkout(array);
globally visible! …
array[1] = y;
checkin(array);
end func
checkin(array);
signal(consumer);
end func
Synch hides early checkin
Synchronized ->
Early visibility OK
60
Using Best-Effort Acoherence
begin_tx
checkout(array)
array[0] = x
checkout(array)
array[1] = y
checkin(array)
end_tx
Exception!
SW handles
resource limitations
61
Simulator Design
 Two Goals
 Functionally evaluate ASM system
• programming model, kernel management
 Performance comparison to CMP
 Enhanced User Mode simulator
 Emulate non-timing critical components (e.g., disks)
 Simulate the rest (e.g., virtual memory)
62
Qualitative Data
 Is ASM a reasonable model?
 YES
 Almost no changes to application software
• Unsynchronized flags
• Stack sharing
 Functioning Kernel, same tricks
• Heavier use of coherent segments
63
Three Questions
1. How can software select view?
2. Which view to use?
3. How to manage CO/CI?
PC
PC
P
P
P
CI
CO
CO
LLC
CI
DRAM
P
P
Hardware Acoherent
Layout
View
64
P
Private
View
P
P
Coherent
View
ASM-CMP Segments
 Uses true memory segments
 e.g., all pointers are long (segment + offset)
 BUT, address space still appears flat!
 Long Pointer Propagation
 Segment pointers propagate through datapath
 Add lp/sp + register sidecars
 Languages/SW remain segment-oblivious
65
ASM-CMP Segments
Segment pointers propagates with datapath
memcpy(dst, src, len);
lp $t0,
lp
$t0,0(dst)
0(dst)
lp
$t0,
0(dst)
lp $t1,
lp
$t1,0(src)
0(src)
mov
$a2$a20(src)
;;cnt
len len
mov
$t2,
cnt<- <lp$t2,
$t1,
Pointers are long
Memory
loop:
beqz $t2,
exit
beqz
$t2,
exit
lb $t3,
lb
$t3,0($t1)
0($t1);
sb $t3,
sb
$t3,0($t0)
0($t0);
addi $t0,
$t0,
1 ;
addi
$t0,
$t0,
addi $t1, $t1, 1 ;
subi $t2, $t2, 1 ;
b loop
exit:
mov $t2, $a2
; cnt <- len
ld ld
src src
;
Register File
loop:
dst ptr Offset
st
dst
; st dst
dst
Seg. Ptr.
1inc.
; inc.dst
Seg
beqz $t2, exit
$t0 Offset
inc. src
dec. cnt
Offset
Seg
lb $t3, 0($t1)
; $t1
ld src
src ptr Offset
Seg. Ptr.
len
sb $t3, 0($t0)
; $t2
st dst
src
addi $t0, $t0, 1 ; $t3
inc.data
dst
addi $t1, $t1, 1 ; inc. src
1
dst
subi $t2, $t2, 1 ; dec. cnt
Segment
propagates
b loop
Seg
ALU
exit: src -> dst
66
Offset+1
The Problem
DRAM
LLC
PC
PC
P
P
Coherent
Shared
Memory
P
Hardware
Layout
P
Software
View
67
The Problem
DRAM
LLC
PC
PC
P
P
Coherent
Shared
Memory
P
P
Hardware Policy
–
Hardware
Software
Layout
View
Software Can’t Change!
68
All Data Are Created Equal?
Assume: CMP MESI protocol, inclusive LLC
DRAM
?
?
Location := 1;
LLC
?
69
PC
PC
P
P
Missed Opportunities
Assume: CMP MESI protocol, inclusive LLC
DRAM
begin_tx
cpLocation := Location;
Location := 1;
end_tx
SW Makes
Redundant Copy
70
LLC
PC
PC
P
P
All Data Are NOT Created Equal
Assume: CMP MESI protocol, inclusive LLC
func foo()
var Location;
Wasting
Space
DRAM
LLC
Location := 1;
Private
71
PC
PC
P
P
ASM-1 Hardware
8MB L3
256KB L2
Bitmask
AE
Per-line
32KB L1 Bitmask
P
72
Baseline
P0
P1
P2
P3
P4
P5
P6
P7
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
L2
L2
L2
L2
L2
L2
L3_0
L3_4
L3_1
L3_5
L3_2
L3_6
L3_3
L3_7
In-order, single thread
L2
L2
L2
L2
L2
L2
L2
L2
Ring interconnect
73
L1
L1
L1
L1
L1
L1
L1
L1
P8
P9
P10
P11
P12
P13
P14
P15
Storage Overhead
100%
90%
More indirection ->
longer latency
Storage Overhead
80%
70%
60%
50%
ASM-1
MESI-1 Level
MESI-2 Level
MESI-3 Level
40%
30%
No Indirection
20%
10%
0%
# Cores
74
Rethinking Coherence: Why Now?
 Dennard Scaling is over; Moore’s Law continues
 Need scalable, energy efficient components
 Accelerators are here
 How should they see memory?
 Shared-little workloads in important markets
75
All Data Are NOT Created Equal
Assume: CMP MESI protocol, inclusive LLC
DRAM
func CUDAKernel(…)
…
Not clear accelerators
want/need coherence
LLC
Location := 1;
76
PC
PC
P
GPU
Download