A peek at some of the math behind R-Stream

The R-Stream High-Level Transformation Tool:
State of the Art and Objectives Within the UHPC Program
N. Vasilache , R. Lethin
•
•
•
•
•
•
•
Government Purpose Rights
Purchase Order Number: N/A
Agreement No.: HR001‐10‐3‐0007
Contractor Name: Intel Corporation
Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124
Expiration Date: None
The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are restricted by
paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply
after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also
reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said
property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu;
ETIInternational – www.etinternational.com; Intel Corporation – www.intel.com; Reservoir Labs – www.reservoir.com; University
of California – San Diego – www.ucsd.edu; University of Illinois at Urbana-Champaign- www.illinois.edu.
•
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
1
Outline
• R-Stream Overview
• UHPC Goals
• Some Performance Results
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
2
Power Efficiency Driving Architectures
Heterogeneous
Processing
SIMD
DMA
Distributed
Local Memories
Explicitly
Managed
Architecture
Bandwidth
Starved
Multiple
Spatial
Dimensions
SIMD
NUMA
FPGA
Memory
GPP
DMA
Memory
GPP
SIMD
SIMD
SIMD
SIMD
FPGA
DMA
SIMD
SIMD
FPGA
SIMD
SIMD
SIMD
SIMD
Hierarchical
(including board,
chassis, cabinet)
FPGA
Memory
GPP
SIMD
SIMD
DMA
Memory
Multiple
Execution
Models
GPP
SIMD
SIMD
Mixed
Parallelism
Types
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
3
Computation Choreography
•
•
•
•
•
•
•
Expressing it in the program:
Annotations and pragma dialects for C
Chapel subset (UHPC in progress with UIUC)
CnC subset (UHPC in progress with Intel)
Generating it:
Explicitly (e.g., new languages like CUDA, target specific )
Implicitly (UHPC in progress: libraries, runtime abstractions
CnC)
• But before expressing it, how can programmers find it?
Not our focus
• Manual constructive procedures, art, sweat, time
– Artisans get complete control over every detail
• Fully-automatic
– Operations research problems and (advanced) autotuning
Reservoir Labs – Faster, sometimes better, than a human
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
4
Program Transformations Specification
iteration space of a statement S(i,j)
t2
j
 :Z  Z
2
2
i
•
•
•
•
•
•
•
t1
Schedule maps iterations to multi-dimensional time:
A feasible schedule preserves dependences
Placement maps iterations to multi-dimensional space:
UHPC in progress, partially done
Layout maps data elements to multi-dimensional space:
UHPC in progress
Hierarchical by design, tiling serves separation of concerns
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
5
Polyhedral Slogans
• Parametric imperfect loop nests
• Subsumes classical transformations
• Compacts the transformation search space
• Parallelization, locality optimization (communication avoiding)
• Preserves semantics
• Analytic joint formulations of optimizations
• Not just for affine static control programs
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
6
R-Stream Blueprint
Machine
Model
Polyhedral Mapper
Raising
EDG C
Front End
Lowering
Scalar Representation
Extended Representation
Pretty Printer (CUDA,
C+annotations, pthreads …)
CnC High-Level
C Low-Level
CnC / Chapel
Front End
(UHPC in progress)
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
7
Mapping Process for Explicitly Managed Memories
Dependencies
2- Task formation:
- Coarse-grain atomic tasks
- Master/slave side operations
1- Scheduling:
Parallelism, locality, tilability
3- Placement:
Assign tasks to blocks/threads
- Local / global data layout optimization
- Multi-buffering (explicitly managed)
- Synchronization (barriers)
- Bulk communications
- Thread generation -> master/slave
- Target-specific optimizations
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
8
Model for Scheduling Trades 3 Objectives Jointly
Fewer
Global
Memory
Accesses
Loop
Fission
More
Locality
More
Parallelis
m
Sufficient
Occupan
cy
Loop Fusion
+
successive
thread
contiguity
Memory
Contiguity
+
successive
thread
contiguity
Better
Effective
Bandwidt
h
Patent pending
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
9
Inside the R-Stream Mapper
Optimization modules engineered to expose advanced “knobs” used by auto-tuner
Extended GDG representation
Tactics Module
Parallelization
Locality
Optimization
Tiling
Memory
Promotion
Sync
Generation
Placement
Comm.
Generation
…
Layout
Optimization
Polyhedral
Scanning
Jolylib, …
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
10
Optimization Across BLAS Calls
Numerous cache
misses
/* Optimization with BLAS */
for loop {
Outer loop(s)
…
Retrieve data Z from
BLAS call 1
disk
Store data
Z back to
…
Retrieve data
disk Z from disk
BLAS call 2
!!!
…
…
BLAS call n
…
}
VS.
/* Global Optimization*/
doall loop {
Can
…
parallelize
for loop {
outer loop(s)
…
[read from Z]
Loop
…
fusion
[write to Z]
can
…
improve
[read from Z]
locality
}
…
}
→
Global optimization exposes better parallelism and locality (significant
speedups)
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
Outline
• R-Stream Overview
• UHPC Goals
• Some Performance Results
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
12
Codelets From a HLC perspective
•
•
•
•
•
Codelets have:
Fine granularity
Explicit communication
Point to point, other kinds of synchronization
Can utilize scheduling and dependence information
hints
• Should also use placement of data and computation
hints
• Work from local scratch pad memories
• Good match for UHPC hardware, allows good
control for energy, resilience, etc.
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
13
UHPC from HLC perspective
•
•
•
•
•
•
Energy
must minimize data motion/communication
Near Threshold Voltage
must find even more parallelism
Resilience
synergy needed with new checkpointing/recovery
models
• Self awareness
• dynamic distributed feedback and regulation
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
14
Another Observation
• But programming directly in codelets is impractical:
• Exposing machine details is a good thing, but don’t want
programmers to manage them.
– Too complicated: getting it done, getting it right, getting it fast.
(Complexity = parallelism x locality x resilience…)
• Writing directly in codelets will also overs-pecify the program,
bake to one machine, and defeat portability
• Role of HLC is to take high level abstractions from programmer
– sequential code,
– Chapel, CnC,
– data-parallel idioms,
– math language
• Perform optimization to various levels of the target hardware
hierarchy
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
15
Based on R-Stream Technology
Existing
New
Energy
Locality Opt
Explicit Comm
Gen
Map to
accelerators
Hierarchical
barriers
Deep hierarchical
scheduling
Point to point sync
Data placement
opts
More parallelism
Exact
dependence
Imperfect loops
Dynamic
schedules and
placements
Resilience
High Labs
level
Reservoir
programming
For Codelets
Emit scheduling
and placement
hints
Emit interaction
sets
ABFT support
Memory reuse opt
Checkpointing opt
Sequential C
Chapel, CnC,
Math, Data
Parallel Idioms
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
16
Goal: Generating CnC
• Assume a mapping from CnC -> Codelets
• Advantages of CnC
• More succinct expression of parallelism (the skewing
problem)
• Adaptable parallelism and load balancing
• High-level representation of data parallel idioms
– CnC help solve the irregular, idiomatic part of the problem
– R-Stream can target optimizations across irregular idioms
• Easy to test for correctness of generated code and
execute efficiently on x86 / clusters.
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
17
Goal: Synergy with CnC
• Represent CnC action-attribute graphs explicitly in RStream:
• Benefit from optimization across multiple CnC steps
• Explore tradeoff between fusing steps and running them
in parallel:
– Fused steps reduce the runtime overhead
– An also the memory footprint
• Generate many semantically equivalent versions and
explore the design space tradeoffs
– R-Stream auto-tuning mode will help a lot here
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
18
Goal: Synergy with Chapel and UIUC
•
•
•
•
•
•
•
•
Extensions to blackboxing:
User interface, can represent any program
Supports even linking with precompiled code
Integrate user-specific data distributions within R-Stream
HTAs
Locales
Find the right abstraction
The goal for Rstream to understand the abstraction and
make good mapping decisions; not to replace the user
choices
• Iterative, feedback-directed design
• Language / transformation tool
• Transformation tool / Runtime
• Language / Runtime
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
19
Goal: Pragmatic Approach
• Support multiple kinds of placement:
• Explicit / implicit ; virtual / physical ; linear/ cyclic/ block cyclic/
general
• Build on R-Stream’s current over-provisioning for performance:
• Originally built for CUDA performance
• Concepts extend to any architecture with dynamic scheduling
decisions
• Has implications on locality/communication granularity
• Examine implications on power
• Use advanced auto-tuning features for design space
exploration
• Explore which modes perform best with CnC:
• Dependent on how over-provisioning is implemented
• Over-provisioning (may) have implications on memory
Reservoir Labs
persistence:
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
20
Goal: HLC support for Challenge Applicationss
•
•
•
•
•
•
•
•
•
•
•
Go beyond loop nest optimizations
Chapel / data-parallel support
CnC attribute action graph optimization
SAR
New locality transformations demonstrated speedups on linear
flight path (reported to DARPA)
MD
Exploring HLC optimization to neutral territory methods
Graph
High level approaches to optimizing graph algorithms and
increasing locality, new lock-free data-parallel algorithm for BFS
Chess, Hydrodynamics
TBD.
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
21
Outline
• R-Stream Overview
• UHPC Goals
• Some Performance Results
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
22
CSLC-LMS (Mapping Across Function/Library Calls)
Configuration 1: MKL
Radar
code
Configuration 2: Low-level compilers
MKL calls
Radar
code
GCC
ICC
Configuration 3: R-Stream
Radar
code
RStream
Optimize
d radar
code
GCC
ICC
•
•
•
•
Main comparisons:
R-Stream High-Level C Compiler 3.1.2
Intel MKL 10.2.1
Dual quad-core E5405 Xeon processors (8 cores total), 9GB
8 thr
Reservoirmemory,
Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
CSLC-LMS (Mapping Across Function/Library Calls)
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
RTM (Exploiting Over-Provisioning for Performance)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],
int pX, int pY, int pZ) {
double temp;
int i, j, k;
for (k=4; k<pZ-4; k++) {
for (j=4; j<pY-4; j++) {
for (i=4; i<pX-4; i++) {
temp = C0 * U2[k][j][i] +
C1 * (U2[k-1][j][i] + U2[k+1][j][i] +
U2[k][j-1][i] + U2[k][j+1][i] +
U2[k][j][i-1] + U2[k][j][i+1]) +
C2 * (U2[k-2][j][i] + U2[k+2][j][i] +
U2[k][j-2][i] + U2[k][j+2][i] +
U2[k][j][i-2] + U2[k][j][i+2]) +
C3 * (U2[k-3][j][i] + U2[k+3][j][i] +
U2[k][j-3][i] + U2[k][j+3][i] +
U2[k][j][i-3] + U2[k][j][i+3]) +
C4 * (U2[k-4][j][i] + U2[k+4][j][i] +
U2[k][j-4][i] + U2[k][j+4][i] +
U2[k][j][i-4] + U2[k][j][i+4]);
•
•
25-point 8th
order (in
space) stencil
U1[k][j][i] = 2.0f * U2[k][j][i] - U1[k][j][i] + V[k][j][i] *
temp;
} } } }
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
25
RTM (Exploiting Over-Provisioning for Performance)
• 3D discretized wave equation kernel with single time
iteration
• Run on NVIDIA GTX 480
• Double Precision 256^3 Problem
• High-Performance from Over-Provisioning space exploration
and explicit optimization of register rotation and shared
memory reuse
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
26
R-Stream to CnC Proof of Concept
• Examined feasibility and benefits of automatic
•
•
•
•
•
coordination language (CnC )generation from RStream:
on 4-D stencil, in-place, kernel application
coarse grained parallelism is pipelined (i.e.
wavefronts of parallel tasks) and representative of
other streaming kernels
Rstream generates a non-trivial OpenMP version
Manually transform this OpenMP version to CnC
code
Process completely automatable
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
27
R-Stream to CnC Proof of Concept
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
28
Conclusion
• R-Stream simplifies software development and maintenance
• Does this by automatically parallelizing loop code
• While optimizing for data locality, coalescing, communications
reuse, etc.
• Many exciting developments within UHPC
Reservoir Labs
Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved.
Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation.
This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted
as representing the official policies, either expressed or implied, of the U.S. Government.
29