Remote Store Programming Henry Hoffmann David Wentzlaff Anant Agarwal High Performance Embedded Computing Workshop September 2009 Remote Store Programming Outline • Introduction/Motivation • Key Features of RSP – Usability – Performance • Evaluation – Methodology – Results • Conclusion Remote Store Programming 2 Multicore requires innovation to balance usability and performance Cavium – 16 cores Sun – 8 cores IBM – 9 cores Tilera – 64 cores Intel – 4 cores Raw – 16 cores • Parallel programming is becoming ubiquitous – Parallel programming is no longer the domain of select experts – Balancing ease-of-use and performance is more important than ever Remote Store Programming 3 Existing programming models do no combine usability and performance Threads with cache coherence (CC) Direct Memory Access (DMA) Globally Shared Cache data Local Cache Local Cache data data store load RF RF Core 0 Core 1 • • Usability Performance Local Memory • Familiar loads and stores Fine-grained comm. is easy No control over locality Local Memory data data store load RF DMA Core 0 RF Core 1 • • Requires additional software API Hard to schedule DMA transactions • Programmer has complete control over locality Remote Store Programming (RSP) can combine usability with performance Remote Store Programming 4 DMA RSP combines the usability of Threads with the performance of DMA RSP Local Cache 1 Threads Local 2 Cache data store DMA Globally Shared Cache data Local Cache data Local Memory Local Memory data data Local Cache data store load stor e RF RF Core 0 Core 1 load RF RF Core 0 Core 1 load RF DMA Core 0 RF DMA Core 1 Implications 1 Usability 2 Performance Remote Store Programming 5 • • • Familiar loads and stores Fine-grained One-sided Can develop high performance programs in a shorter amount of time • Software controls locality Always access physically close memory and minimize load latency Remote Store Programming Outline • Introduction/Motivation • Key Features of RSP – Usability – Performance • Evaluation – Methodology – Results • Conclusion Remote Store Programming 6 1 RSP captures the usability of threads and shared memory Memory 0 Memory 1 private private 2 Usability Performance Remotely writable buffer Process 0 Process 1 buffer = map_remote(4,1); buffer = *buffer = 42; barrier(); Core 0 Core 1 remote_write_alloc(4,0); barrier(); print(buffer); • Process Model • Communication • Synchronization – Each process has private memory by default – A process can grant write access to remote processes – Processes communicate by storing to remotely writable memory – Supports test-and-set, compare-and-swap, etc. – We assumer higher level primitives like barrier Remote Store Programming 7 1 RSP emphasizes locality for performance on large scale multicores Local Cache Local Cache buffer Core 0 is a producer: Data transferred from Core 0’s registers to Core 1’s cache network All loads guaranteed to be local and minimum latency load RF RF Core 0 Core 1 Core 1 gives Core 0 write permission for buffer 2 Core 0 receives write permission from Core 1 for buffer 3 Core 0’s store instructions target remote memory on consumer (buffer) 4 Core 1’s standard load instructions are used to read data from cache Remote Store Programming 8 Performance benefit Locality Description 1 Performance Core 1 is a consumer: • Cores access close memories • Load latency is minimized Fine Grain Comm. Step store 2 Usability • Complete overlap of communication and computation 1 2D FFT example illustrates performance and usability Threads DMA Globally Shared Cache data Local Cache data Local Memory Local Memory load RF RF Core 0 Core 1 Matrix A, C; shared Matrix B; pid = my_id(); init(A); row_fft(B,A, pid); barrier(); col_fft(C,B,pid); Easiest to write Fine-grain No locality RF load DMA Core 0 RF DMA Core 1 Matrix A, B, B2, C; pid = my_id(); init(A); row_fft(B,A, pid); DMA_move(B2,B); col_fft(C,B2,pid); Hard to write Coarse-grain High locality Local Cache data store store Performance RSP data Local Cache data 2 Usability Local Cache data store load RF RF Core 0 Core 1 Matrix A, C; Matrix B = alloc_rsp(); pid = my_id(); init(A); row_fft(B,A, pid); barrier(); col_fft(C,B,pid); Easy to write Fine-grain High locality For more detail see: Hoffmann, Wentzlaff, Agarwal. Remote Store Programming: Mechanisms and Performance. Technical Report MIT-CSAILTR-2009-017. May, 2009 Remote Store Programming 9 RSP requires incremental hardware support Local Cache Local Cache buffer store RF Core 0 network load RF Core 1 • RSP requires incremental additional hardware – In processor supporting cache coherent shared memory • Additional memory allocation mode for remotely writable memory • Do not update local cache when writing to remote memory – In processor without cache coherent shared memory • Memory allocation for remotely writable memory • Write misses on remote cores forward miss to allocating core Remote Store Programming 10 Remote Store Programming Outline • Introduction/Motivation • Key Differentiators of RSP – Usability – Performance • Evaluation – Methodology – Results • Conclusion Remote Store Programming 11 RSP, Threads, and DMA are compared on the TILEPro64 processor Emulate RSP on TILEPro64 Implement embedded benchmarks with RSP • TILEPro64 supports Cache-coherence and DMA • Matrix Transpose • Additional memory allocation modes let us emulate RSP • Matrix Multiply • Two-dimensional FFT • Bitonic Sort • Convolution • Floyd-Steinberg Error Diffusion • H.264 Encode • Histogram Remote Store Programming 12 Compare performance to cache-coherence and DMA • Compare speedup and load latency Speedups of RSP and cache-coherent benchmarks Bitonic sort 64 Speedup 56 2D FFT 64 64 80.6 RSP CC RSP CC 56 64 RSP CC 56 48 48 48 40 40 40 40 32 32 32 32 24 24 24 24 16 16 16 16 8 8 8 8 0 2 4 8 16 32 Image Hist. 4 8 16 32 64 2 8 16 32 RSP CC 56 2 64 56 RSP CC 56 48 48 48 40 40 40 40 32 32 32 32 24 24 24 24 16 16 16 16 8 8 8 8 0 2 4 8 16 Cores Remote Store Programming 13 32 64 0 2 4 8 16 Cores 32 6 4 8 16 32 64 H.264 48 0 4 64 64 64 RSP CC 4 Matrix Trans. Matrix Mul. 64 56 0 0 2 64 RSP CC 56 48 0 Speedup Error Diff. Convolution RSP CC 0 2 4 8 16 Cores 32 64 2 4 8 16 Cores 32 40 RSP is outperforms threading with cache coherence for large numbers of cores Speed of RSP/Speed of CC Speedup of RSP versus Cache Coherence for selected benchmarks (higher is better) 3.5 • For large numbers of cores, RSP provides significant benefit • 5 RSP apps are 1.5x faster than CC for 64 cores • RSP FFT is > 3x faster than CC for 64 cores 3 2.5 2 1.5 1 0.5 0 2 4 8 16 Cores Transpose Convolution FFT Matrix Multiply Error Diffusion Histogram H.264 Bitonic Sort Remote Store Programming 14 32 64 Average load latency of RSP/CC RSP’s emphasis on locality results in low load latency Load latency of selected benchmarks on RSP versus Cache Coherence (lower is better) 3.5 3 2.5 2 1.5 1 0.5 0 2 4 8 Cores Transpose Convolution FFT Matrix Multiply Error Diffusion Histogram H.264 Bitonic Sort Remote Store Programming 15 16 32 64 H.264 Encode Performance Relative to Shared Memory Performance Normalized to Shared Memory Comparison of RSP, Threading, and DMA for two applications 3.5 3 Cache Coherence DMA RSP 2.5 2 1.5 1 0.5 0 2 4 8 16 Cores 32 40 2D FFT 3.5 Cache Coherence DMA RSP 3 2.5 2 1.5 1 0.5 0 2 4 8 16 32 Cores • RSP is faster than Threading and DMA due to fine emphasis on locality and finegrain communication Remote Store Programming 16 64 Remote Store Programming Outline • Introduction/Motivation • Key Differentiators of RSP – Usability – Performance – Hardware Requirements • Evaluation – Methodology – Results • Conclusion Remote Store Programming 17 Conclusion • Talk presents Remote Store Programming – – – – A programming model for multicore Uses familiar communication mechanisms Achieves high-performance through locality and fine-grain communication Requires incremental hardware support • Conclusion: – Threads and shared memory good performance and easy to use – For large numbers of cores RSP can outperform threading because of greater locality – RSP is easier to use than DMA and slightly harder than Threads • Anticipated use for RSP: – Can supplement threads and cache-coherent shared memory on multicore • Most code uses standard threading and shared memory techniques • Performance critical sections of code or applications use RSP for additional performance • Gentle Slope to programming multicore Remote Store Programming 18