Remote Store Programming High Performance Embedded Computing Workshop September 2009

advertisement
Remote Store Programming
Henry Hoffmann David Wentzlaff Anant Agarwal
High Performance Embedded Computing Workshop
September 2009
Remote Store Programming Outline
• Introduction/Motivation
• Key Features of RSP
– Usability
– Performance
• Evaluation
– Methodology
– Results
• Conclusion
Remote Store Programming 2
Multicore requires innovation to balance usability and performance
Cavium – 16 cores
Sun – 8 cores
IBM – 9 cores
Tilera – 64 cores
Intel – 4 cores
Raw – 16 cores
• Parallel programming is becoming ubiquitous
– Parallel programming is no longer the domain of select experts
– Balancing ease-of-use and performance is more important than ever
Remote Store Programming 3
Existing programming models do no combine usability and
performance
Threads with cache coherence (CC)
Direct Memory Access
(DMA)
Globally Shared Cache
data
Local
Cache
Local
Cache
data
data
store
load
RF
RF
Core 0
Core 1
•
•
Usability
Performance
Local
Memory
•
Familiar loads and stores
Fine-grained comm. is easy
No control over locality
Local
Memory
data
data
store
load
RF
DMA
Core 0
RF
Core 1
•
•
Requires additional software API
Hard to schedule DMA transactions
•
Programmer has complete control
over locality
Remote Store Programming (RSP) can combine usability with performance
Remote Store Programming 4
DMA
RSP combines the usability of Threads with the performance of DMA
RSP
Local
Cache
1
Threads
Local 2
Cache
data
store
DMA
Globally Shared Cache
data
Local
Cache
data
Local
Memory
Local
Memory
data
data
Local
Cache
data
store
load
stor
e
RF
RF
Core 0
Core 1
load
RF
RF
Core 0
Core 1
load
RF
DMA
Core 0
RF
DMA
Core 1
Implications
1
Usability
2
Performance
Remote Store Programming 5
•
•
•
Familiar loads and stores
Fine-grained
One-sided
Can develop high
performance programs in a
shorter amount of time
•
Software controls locality
Always access physically
close memory and minimize
load latency
Remote Store Programming Outline
• Introduction/Motivation
• Key Features of RSP
– Usability
– Performance
• Evaluation
– Methodology
– Results
• Conclusion
Remote Store Programming 6
1
RSP captures the usability of threads and shared memory
Memory 0
Memory 1
private
private
2
Usability
Performance
Remotely writable
buffer
Process 0
Process 1
buffer = map_remote(4,1);
buffer =
*buffer = 42;
barrier();
Core 0
Core 1
remote_write_alloc(4,0);
barrier();
print(buffer);
•
Process Model
•
Communication
•
Synchronization
– Each process has private memory by default
– A process can grant write access to remote processes
– Processes communicate by storing to remotely writable memory
– Supports test-and-set, compare-and-swap, etc.
– We assumer higher level primitives like barrier
Remote Store Programming 7
1
RSP emphasizes locality for performance on large scale
multicores
Local
Cache
Local
Cache
buffer
Core 0 is a producer:
Data transferred from Core 0’s
registers to Core 1’s cache
network
All loads guaranteed to be local
and minimum latency
load
RF
RF
Core 0
Core 1
Core 1 gives Core 0 write permission for buffer
2
Core 0 receives write permission from Core 1
for buffer
3
Core 0’s store instructions target remote
memory on consumer (buffer)
4
Core 1’s standard load instructions are used to
read data from cache
Remote Store Programming 8
Performance benefit
Locality
Description
1
Performance
Core 1 is a consumer:
• Cores access close memories
• Load latency is minimized
Fine
Grain
Comm.
Step
store
2
Usability
• Complete overlap of
communication and
computation
1
2D FFT example illustrates performance and usability
Threads
DMA
Globally Shared Cache
data
Local
Cache
data
Local
Memory
Local
Memory
load
RF
RF
Core 0
Core 1
Matrix A, C;
shared Matrix B;
pid = my_id();
init(A);
row_fft(B,A, pid);
barrier();
col_fft(C,B,pid);
Easiest to write
Fine-grain
No locality
RF
load
DMA
Core 0
RF
DMA
Core 1
Matrix A, B, B2, C;
pid = my_id();
init(A);
row_fft(B,A, pid);
DMA_move(B2,B);
col_fft(C,B2,pid);
Hard to write
Coarse-grain
High locality
Local
Cache
data
store
store
Performance
RSP
data
Local
Cache
data
2
Usability
Local
Cache
data
store
load
RF
RF
Core 0
Core 1
Matrix A, C;
Matrix B =
alloc_rsp();
pid = my_id();
init(A);
row_fft(B,A, pid);
barrier();
col_fft(C,B,pid);
Easy to write
Fine-grain
High locality
For more detail see: Hoffmann, Wentzlaff, Agarwal. Remote Store Programming: Mechanisms and Performance. Technical Report MIT-CSAILTR-2009-017. May, 2009
Remote Store Programming 9
RSP requires incremental hardware support
Local
Cache
Local
Cache
buffer
store
RF
Core 0
network
load
RF
Core 1
• RSP requires incremental additional hardware
– In processor supporting cache coherent shared memory
• Additional memory allocation mode for remotely writable memory
• Do not update local cache when writing to remote memory
– In processor without cache coherent shared memory
• Memory allocation for remotely writable memory
• Write misses on remote cores forward miss to allocating core
Remote Store Programming 10
Remote Store Programming Outline
• Introduction/Motivation
• Key Differentiators of RSP
– Usability
– Performance
• Evaluation
– Methodology
– Results
• Conclusion
Remote Store Programming 11
RSP, Threads, and DMA are compared on the TILEPro64 processor
Emulate RSP
on TILEPro64
Implement
embedded
benchmarks
with RSP
• TILEPro64 supports
Cache-coherence
and DMA
• Matrix Transpose
• Additional memory
allocation modes let
us emulate RSP
• Matrix Multiply
• Two-dimensional
FFT
• Bitonic Sort
• Convolution
• Floyd-Steinberg
Error Diffusion
• H.264 Encode
• Histogram
Remote Store Programming 12
Compare
performance to
cache-coherence
and DMA
• Compare speedup
and load latency
Speedups of RSP and cache-coherent benchmarks
Bitonic sort
64
Speedup
56
2D FFT
64
64
80.6
RSP
CC
RSP
CC
56
64
RSP
CC
56
48
48
48
40
40
40
40
32
32
32
32
24
24
24
24
16
16
16
16
8
8
8
8
0
2
4
8
16
32
Image Hist.
4
8
16
32
64
2
8
16
32
RSP
CC
56
2
64
56
RSP
CC
56
48
48
48
40
40
40
40
32
32
32
32
24
24
24
24
16
16
16
16
8
8
8
8
0
2
4
8
16
Cores
Remote Store Programming 13
32
64
0
2
4
8
16
Cores
32
6
4
8
16
32
64
H.264
48
0
4
64
64
64
RSP
CC
4
Matrix Trans.
Matrix Mul.
64
56
0
0
2
64
RSP
CC
56
48
0
Speedup
Error Diff.
Convolution
RSP
CC
0
2
4
8
16
Cores
32
64
2
4
8
16
Cores
32
40
RSP is outperforms threading with cache coherence for large
numbers of cores
Speed of RSP/Speed of CC
Speedup of RSP versus Cache Coherence for selected benchmarks
(higher is better)
3.5
• For large numbers of cores, RSP provides
significant benefit
• 5 RSP apps are 1.5x faster than CC for 64 cores
• RSP FFT is > 3x faster than CC for 64 cores
3
2.5
2
1.5
1
0.5
0
2
4
8
16
Cores
Transpose
Convolution
FFT
Matrix Multiply
Error Diffusion
Histogram
H.264
Bitonic Sort
Remote Store Programming 14
32
64
Average load latency of RSP/CC
RSP’s emphasis on locality results in low load latency
Load latency of selected benchmarks on RSP versus Cache Coherence
(lower is better)
3.5
3
2.5
2
1.5
1
0.5
0
2
4
8
Cores
Transpose
Convolution
FFT
Matrix Multiply
Error Diffusion
Histogram
H.264
Bitonic Sort
Remote Store Programming 15
16
32
64
H.264 Encode
Performance Relative to Shared Memory
Performance Normalized to Shared Memory
Comparison of RSP, Threading, and DMA for two applications
3.5
3
Cache Coherence
DMA
RSP
2.5
2
1.5
1
0.5
0
2
4
8
16
Cores
32
40
2D FFT
3.5
Cache Coherence
DMA
RSP
3
2.5
2
1.5
1
0.5
0
2
4
8
16
32
Cores
• RSP is faster than Threading and DMA due to fine emphasis on locality and finegrain communication
Remote Store Programming 16
64
Remote Store Programming Outline
• Introduction/Motivation
• Key Differentiators of RSP
– Usability
– Performance
– Hardware Requirements
• Evaluation
– Methodology
– Results
• Conclusion
Remote Store Programming 17
Conclusion
• Talk presents Remote Store Programming
–
–
–
–
A programming model for multicore
Uses familiar communication mechanisms
Achieves high-performance through locality and fine-grain communication
Requires incremental hardware support
• Conclusion:
– Threads and shared memory good performance and easy to use
– For large numbers of cores RSP can outperform threading because of greater locality
– RSP is easier to use than DMA and slightly harder than Threads
• Anticipated use for RSP:
– Can supplement threads and cache-coherent shared memory on multicore
• Most code uses standard threading and shared memory techniques
• Performance critical sections of code or applications use RSP for additional performance
• Gentle Slope to programming multicore
Remote Store Programming 18
Download