R-Stream: Enabling efficient development of portable, high- performance, parallel applications

advertisement
R-Stream: Enabling efficient
development of portable, highperformance, parallel applications
Peter Mattson, Allen Leung, Ken Mackenzie,
Eric Schweitz, Peter Szilagi, Richard Lethin
{mattson, leunga, kenmac, schweitz, szilagyi,
lethin @ reservoir.com}
R-Stream Compiler
Reservoir Labs Inc.
Goals and Benefits
Portable, consistently high-performance compiler for
streaming applications on morphable architectures
Relative to fixed-function hardware, R-Stream and a morphable architecture yields:
• Cost savings
• Dynamic mission changes
• Shorter time to deployment
• In-mission adaptation
• Multipurpose hardware
• Multi-sensor
• More advanced algorithms
• Bug fixes and upgrades less costly
Streaming Compilation
•
Streaming is a pattern in efficient implementations of
computation- and data-intensive applications
– Three key characteristics:
•
•
•
–
•
Processing is largely parallel
Data access patterns are apparent
Control is high-level, steady, simple
Caused by intersection of application domain and architecture limits:
Applications process,
simulate, or render
physical systems
Architectural limits
Processing is
largely parallel
Independence of
spatially and/or
temporally distributed
data
VLIW, SIMD, multiprocessor
demand for parallelism
Data access
patterns are
apparent
Continuous, Ndimensional
spatial/temporal data
Small local memories
Control is high-level,
steady, simple
Few tasks per
application, few
unpredictable events
Handling unpredictability in
hardware is expensive
Application
domain
Streaming
Architecture
limits
Streaming languages and architectures
designed for this pattern are emerging:
–
–
•
Whole program
Streaming languages enforce the pattern, making
parallelism and data-flow apparent
Streaming architectures provide maximum parallel
resources and minimize control overhead by exposing
resources to the compiler
Multiple loop nests
Single loop nest
CSE etc.
ALUs, registers
R-Stream uses these to expand the scope of
optimization and resource choreography
–
–
Cache
Streaming languages avoid limits on compiler analysis
(e.g. aliasing), enabling top-down optimizations
Streaming architectures allow the compiler to lay out all
computation, data, and communication
Local memory
Multiprocessor, routing
Implementation Plan
•
•
Spiral development
– Helps to ensure eventual results
– Enables early usage by partners
Three major releases:
– Release 1.0: Functionality, but no performance optimization
– Release 2.0: Common-case, phase-ordered performance optimization, static morphing
– Release 3.0: General, unified performance optimization, dynamic morphing
Year
03
Quarter
2
Release
Front-end
Streaming IR
Mapping
Code generation
04
3
4
1.0
1
Non-existent
05
2
3
2.0
4
1
2
3.0
3
4
Discardable draft
Draft
Final
Compiler Flow
Brook
Kernels perform computations on streams.
This kernel computes pair-wise sum.
StreamIt
void kernel streamAdd(stream float s1, stream float s2, out
stream float s3) {
s3 = s1 + s2;
Represents s3.push(s1.pop() + s2.pop()).
}
void kernel weightedSum(stream float image_in[3][3], out
stream float image_out) {
image_out = 1.0 / 9.0 *
(image_in[0][0] + image_in[0][1] + image_in[0][2] +
image_in[1][0] + image_in[1][1] + image_in[1][2] +
image_in[2][0] + image_in[2][1] + image_in[2][2]);
}
Stream is 1D, but elements
a
int brookMain() {
can be arrays. This is
float image[100][100][3][3];
float imageOut[100][100];stream of 3x3 arrays.
stream float st[3][3];
stream float stOut;
stream float stOutDouble;
Use of stream
…
operator to read from
streamRead(st, image, 0, 99, 0, 99);
weightedSum(st, stOut);
array into stream.
streamAdd(stOut, stOut, stOutDouble);
streamWrite(imageOut, stOutDouble, 0, 99, 0, 99);
…
Use of streamAdd kernel to double stream.
}
Front end
1. Unfold graph
2. Cluster based on
edge/node weights
PCA Machine
Model for Target
Architecture
Thread
Proc
Proc0
Proc1
Cache
Mem0
Mem1
DMA2
Mapper
(simple example used in
algorithm illustration)
(simple 1.0 algorithm shown)
0. Extract graph
weightedSum1
weightedSum2
splitter1
splitter2
weightedSum1
weightedSum2
splitter1
splitter2
streamAdd1
streamAdd2
weightedSum
streamAdd1
streamAdd2
splitter2
streamAdd
3. Convert communication
DMA1
DMA2
weightedSum1
weightedSum2
splitter1
splitter2
streamAdd1
streamAdd2
DMA3
DMA4
4. Schedule and assign storage
Proc0:
DMA2:
Proc1:
DMA1
weightedSum1
weightedSum2
DMA2
GlobalMem
splitter1
splitter2
DMA3
streamAdd1
streamAdd2
DMA4
Code generation
Streaming Virtual Machine Code
streamInitRAM(&(ts_1_0), MEM1, 0, 512, sizeof(float[3][3]), (STREAM_ONE_EOS));
streamInitRAM(&(ts_1_18432), MEM1, 18432, 512, sizeof(float), (STREAM_ONE_EOS));
...
int hlc_loop = 0 ;
Streams mapped to memory
for (; hlc_loop < 10000 ; hlc_loop += 512 ){
int hlc_smc = min ( 10000 - hlc_loop , 512 );
int hlc_seteos = hlc_loop + hlc_smc == 10000 ;
Whole application
kernelWait (& mv7 . HLC_kernel );
strip-mined
moveInit (& mv7 , DMA2 ,& streamIn ,& ts_1_0 , hlc_smc , hlc_seteos );
kernelAddDependence (& mv7 . HLC_kernel ,& mv8 . HLC_kernel );
kernelRun (& mv7 . HLC_kernel );
kernelWait (& weightedSum_2 . HLC_kernel );
/* weightedSum_2 */
HLC_INIT_weightedSum (& weightedSum_2 , PROC1 , (Block*)0 , &ts_1_0 , &ts_1_18432 );
kernelAddDependence (& weightedSum_2 . HLC_kernel ,& streamAdd_5 . HLC_kernel );
kernelRun (& weightedSum_2 . HLC_kernel );
kernelWait (& mv6 . HLC_kernel );
/* mv6 */
moveInit (& mv6 , DMA2 ,& streamIn ,& ts_0_0 , hlc_smc , hlc_seteos );
kernelAddDependence (& mv6 . HLC_kernel ,& mv7 . HLC_kernel );
kernelRun (& mv6 . HLC_kernel );
kernelWait (& splitter_4 . HLC_kernel );
/* splitter_4 */
HLC_INIT_splitter (& splitter_4 , PROC1 , (Block*)20480 , &ts_1_18432 , &ts_1_20480 , &ts_1_22528 );
kernelAddDependence (& splitter_4 . HLC_kernel ,& weightedSum_2 . HLC_kernel );
kernelRun (& splitter_4 . HLC_kernel );
kernelWait (& streamAdd_5 . HLC_kernel );
/* streamAdd_5 */
HLC_INIT_streamAdd (& streamAdd_5 , PROC1 , (Block*)24576 , &ts_1_20480 , &ts_1_22528 , &ts_1_24576 );
kernelAddDependence (& streamAdd_5 . HLC_kernel ,& splitter_4 . HLC_kernel );
kernelRun (& streamAdd_5 . HLC_kernel );
...
}
Kernels mapped to processors
...
Low-level compiler for
target architecture
Binary Executable
Technologies
Streaming Intermediate Representation (IR)
•
•
•
Represents an application within the compiler as a series of kernels or loop nests
connected by streams or arrays
Enables the compiler to directly analyze and optimize streaming applications
Makes task, pipeline, and data parallelism readily apparent:
Stream
or array
Kernel or
loop nest
A
Data parallelism
C
A
Task parallelism
D
B
C
B
Pipeline parallelism
Top down, unified, streaming IR mapping
•
•
•
Mapping will include top-down optimizations, such as software pipelining an entire
application to cover communication latency
Mapping of computation, data, and communication layout will be performed by a
single, unified pass for greater efficiency
Mapping will exploit all degrees of parallelism exposed by the streaming IR:
Data Parallel
space
time
A
Task Parallel
A
B
A
B
B
C
C
D
A
D
D
D
Combinations
B
D
C
Pipelined
A
C
B
C
A
A
B
C
D
Edges in
streaming IR
become
communication
or storage
Compiler driven architecture morphing
•
Morphable architectures reconfigure high-level resources such as processors,
memories, and networks to meet application demands
•
Compiler must choose a configuration and map to that
configuration – this expands the compiler mapping space:
•
Compiler can employ one or multiple configurations:
computation->processor
data->memory
configuration
1.
Static Morphing: pick one
configuration and mapping for
whole application
2.
Dynamic Morphing: pick multiple configurations
and mappings for different parts of the application;
orchestrate morphing between them
space
space
time
time
A
A
A
A
A
A
D
D
B
C
B
C
Download