Remember Memory? Mark D. Hill, Univ. of Wisconsin-Madison

advertisement
Remember Memory?
Mark D. Hill, Univ. of Wisconsin-Madison
5/2016 @ David A. Patterson Celebration
I. Million-fold Memory Growth & Virtual Memory
II. General-Purpose GPUs & Memory Consistency
III. Non-Volatile Memory’s Fusing Memory & Storage
Mark D. Hill, No Change 30 Years?
7/26/2016
2
I. Million-fold Memory Growth
Memory capacity for $10,000*
1,000.00
1
Memory size
GB
MB
TB
10,000.00
10
Commercial servers with
16TB memory
100.00
100
10.00
10
1.00
1
0.10
100
Interactive services need to
access TB of data at low latency
0.01
10
0.00
0
1980
1990
*Inflation-adjusted 2011 USD, from: jcmit.com
2000
2010
3
How is Paged Virtual Memory used?
memcached
server # n
In-memory
Hash table
Network state
Client
E.g.: memcached servers
Key X
Value Y
• But TLB sizes hardly scaled
Year
L1-DTLB
entries
7/26/2016
1999
72
(Pent. III)
2008
2012
2015
96
100
100
(Nehalem)
(Ivy Bridge) (Broadwell)4
ISCA 2013
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
1GB
10
Direct
Segment
5
0
m
A: As we see it, OFTEN but not ALWAYS.
. 7/26/2016
PS
GU
NP
B:
CG
NP
B:
BT
yS
Q
L
M
d
em
ca
ch
e
gr
ap
h5
00
Q: “Virtual Memory was invented in a time of scarcity. Is it still good idea?”
--- Charles Thacker, 2010 Turing Award Lecture
5
A View of Computer Layers
Problem
Algorithm
Application
Middleware / Compiler
Operating System
Punch
Thru
Microarchitecture
Logic Design
Transistors, etc.
(small)
Instrn Set
Architecture
See 21st Century Computer Architecture [CCC 2012] 6
7/26/2016
Bypass Paging (Often)
Conventional Paging:
1
guard page, COW, mapped files
BASE
2 Direct Segment:
Heap w/o swapping
LIMIT
VA
OFFSET
PA
Direct Segment [ISCA 2013] but more-general ideas now
7/26/2016
7
35
51.1
83.1 51.3
4KB
30
25
2MB
20
15
0.01
10
~0
0.48
~0
0.01
1GB
0.49
Direct
Segment
5
GU
PS
NP
B:
CG
NP
B:
BT
L
yS
Q
M
ca
ch
ed
h5
00
0
gr
ap
Percentage of execu on cycles wasted
Execution Time Overhead: TLB Misses
m
7/26/2016
em
Non-Volatile Memory to explode address space & sharing?
ISCA 2013
8
II. Graphics Processing Units (GPUs)
• GPUs = Throughput
• Hierarchical “scoped”
programming model
Grid
m
Di
en
on
si
Z
• Share memory to
expand viable programs
Dimension Y
on
si
en
Sub-group
(Hardware-specific size)
Z
7/26/2016
Dimension X
Work-item
m
Di
– Rich data structures
(w/o copying)
– “Pointer is a pointer”
– Coherence? Scopes?
Dimension Y
Work-group
Dimension X
OpenCL Execution Hierarchy
9
GPU Memory Hierarchy = Throughput
LLC
Directory / Memory
L2
CPU
GPU
L2
L1
CU0
L1
CU15
L1
L1
CPU0
CPU1
• Poor match CPU coherence w/ writeback caches
• Coherence is means; memory consistency the end
7/26/2016
10
Sequential Consistency (SC)
Thread 1
R1 = TOS
R2 = R1 – 1
TOS = R2
Data1 = *R1
Thread 2
R1 = TOS
R2 = R1 – 1
TOS = R2
Data2 = *R1
Total Memory Order
7/26/2016
11
Sequential Consistency (SC) w/ Locks
Thread 1
Lock(Stack)
R1 = TOS
R2 = R1 – 1
TOS = R2
Data1 = *R1
Unlock(Stack)
7/26/2016
Thread 2
Lock(Stack)
R1 = TOS
R2 = R1 – 1
TOS = R2
Data2 = *R1
Unlock(Stack)
12
SC for Data Race Free
Thread 1
Lock(Stack)
R1 = TOS
R2 = R1 – 1
TOS = R2
Data1 = *R1
Unlock(Stack)
7/26/2016
Thread 2
Lock(Stack)
R1 = TOS
R2 = R1 – 1
TOS = R2
Data2 = *R1
Unlock(Stack)
13
CPU History & GPU Future
• CPUs
3 Decades!
– SC [ToC 1979]
– SC for Data Race Free [ISCA 1990] [ISCA 1990]
– SC for DRF Java/C++ [PLDI 2005] [PLDI 2008]
GPUs Faster?
7/26/2016
14
GPU Memory Hierarchy = Throughput
LLC
Directory / Memory
L2
CPU
GPU
L2
L1
CU0
L1
CU15
L1
L1
CPU0
CPU1
• GPU has “scopes” – nearer in faster
7/26/2016
15
CPU History & GPU Future
• CPUs
3 Decades!
– SC [ToC 1979]
– SC for Data Race Free [ISCA 1990] [ISCA 1990]
– SC for DRF Java/C++ [PLDI 2005] [PLDI 2008]
• GPUs
GPUs Faster?
– SC for Heterogeneous Race Free [ASPLOS 2014]
– No data races & synchronization of “enough” scope
– In Heterogeneous System Architecture [2015]
Whither System on a Chip w/ many accelerators?
7/26/2016
16
III. Non-Volatile Memory (NVM)
Compute Memory Storage
7/26/2016
Convergence/hype
Off by (a) surprise
or (b) design
17
III(a) Power Off by Surprise (Crash)
STORE value = 0xC02
STORE valid = 1
Non-Volatile
Memory
Write-back Cache
Total
Memory
Order
7/26/2016
value
0xC02
valid
1
value
0xDEADBEEF
valid
0
1
18
Seek Consistent Durable State on Crash
Persistency
Order?
Persistency Model
[Pelley et al ISCA 2014]
Total
Memory
Order
7/26/2016
– Strict persistency:
Strong as (relaxed)
memory model
– Relaxed persistency:
Even weaker
19
More Persistency Work Needed
• Industry not there yet:
If PCOMMIT is executed after a store to a persistent memory
range is accepted to memory, the store becomes persistent
when the PCOMMIT becomes globally visible.”
• While all store-to-memory operations are eventually accepted to memory, the
following items specify the actions software can take to ensure that they are accepted: Non-temporal stores to write-back (WB) memory and all stores to uncacheable
(UC), write-combining (WC), and write-through (WT) memory are accepted to memory as soon as they are globally visible. • If, after an ordinary store to write-back (WB)
memory becomes globally visible, CLFLUSH, CLFLUSHOPT, or CLWB is executed for the same cache line as the store, the store is accepted to memory when the CLFLUSH,
CLFLUSHOPT or CLWB execution itself becomes globally visible.
• IMHO Need
– Deeper & more formal models (e.g., happens-before)
– Better understanding of app durable state
– Implement HW that orders; not gratuitously flushes
7/26/2016
20
III(b) Power off by Design (Prediction)
idle
idle
• Greatly improve energy-efficiency Need work:
• Especially when (briefly) doing nothing circuits,
architecture,
idle
system SW
• Or advice Patterson never gave me….
Do Nothing Well!
Summary
I. Million-fold Memory Growth & Virtual Memory
II. General-Purpose GPUs & Memory Consistency
III. Non-Volatile Memory’s Fusing Memory & Storage
7/26/2016
21
Backup
7/26/2016
22
Download