The R-Stream High-Level Transformation Tool: State of the Art and Objectives Within the UHPC Program N. Vasilache , R. Lethin • • • • • • • Government Purpose Rights Purchase Order Number: N/A Agreement No.: HR001‐10‐3‐0007 Contractor Name: Intel Corporation Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124 Expiration Date: None The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are restricted by paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu; ETIInternational – www.etinternational.com; Intel Corporation – www.intel.com; Reservoir Labs – www.reservoir.com; University of California – San Diego – www.ucsd.edu; University of Illinois at Urbana-Champaign- www.illinois.edu. • Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 1 Outline • R-Stream Overview • UHPC Goals • Some Performance Results Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 2 Power Efficiency Driving Architectures Heterogeneous Processing SIMD DMA Distributed Local Memories Explicitly Managed Architecture Bandwidth Starved Multiple Spatial Dimensions SIMD NUMA FPGA Memory GPP DMA Memory GPP SIMD SIMD SIMD SIMD FPGA DMA SIMD SIMD FPGA SIMD SIMD SIMD SIMD Hierarchical (including board, chassis, cabinet) FPGA Memory GPP SIMD SIMD DMA Memory Multiple Execution Models GPP SIMD SIMD Mixed Parallelism Types Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 3 Computation Choreography • • • • • • • Expressing it in the program: Annotations and pragma dialects for C Chapel subset (UHPC in progress with UIUC) CnC subset (UHPC in progress with Intel) Generating it: Explicitly (e.g., new languages like CUDA, target specific ) Implicitly (UHPC in progress: libraries, runtime abstractions CnC) • But before expressing it, how can programmers find it? Not our focus • Manual constructive procedures, art, sweat, time – Artisans get complete control over every detail • Fully-automatic – Operations research problems and (advanced) autotuning Reservoir Labs – Faster, sometimes better, than a human Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 4 Program Transformations Specification iteration space of a statement S(i,j) t2 j :Z Z 2 2 i • • • • • • • t1 Schedule maps iterations to multi-dimensional time: A feasible schedule preserves dependences Placement maps iterations to multi-dimensional space: UHPC in progress, partially done Layout maps data elements to multi-dimensional space: UHPC in progress Hierarchical by design, tiling serves separation of concerns Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 5 Polyhedral Slogans • Parametric imperfect loop nests • Subsumes classical transformations • Compacts the transformation search space • Parallelization, locality optimization (communication avoiding) • Preserves semantics • Analytic joint formulations of optimizations • Not just for affine static control programs Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 6 R-Stream Blueprint Machine Model Polyhedral Mapper Raising EDG C Front End Lowering Scalar Representation Extended Representation Pretty Printer (CUDA, C+annotations, pthreads …) CnC High-Level C Low-Level CnC / Chapel Front End (UHPC in progress) Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 7 Mapping Process for Explicitly Managed Memories Dependencies 2- Task formation: - Coarse-grain atomic tasks - Master/slave side operations 1- Scheduling: Parallelism, locality, tilability 3- Placement: Assign tasks to blocks/threads - Local / global data layout optimization - Multi-buffering (explicitly managed) - Synchronization (barriers) - Bulk communications - Thread generation -> master/slave - Target-specific optimizations Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 8 Model for Scheduling Trades 3 Objectives Jointly Fewer Global Memory Accesses Loop Fission More Locality More Parallelis m Sufficient Occupan cy Loop Fusion + successive thread contiguity Memory Contiguity + successive thread contiguity Better Effective Bandwidt h Patent pending Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 9 Inside the R-Stream Mapper Optimization modules engineered to expose advanced “knobs” used by auto-tuner Extended GDG representation Tactics Module Parallelization Locality Optimization Tiling Memory Promotion Sync Generation Placement Comm. Generation … Layout Optimization Polyhedral Scanning Jolylib, … Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 10 Optimization Across BLAS Calls Numerous cache misses /* Optimization with BLAS */ for loop { Outer loop(s) … Retrieve data Z from BLAS call 1 disk Store data Z back to … Retrieve data disk Z from disk BLAS call 2 !!! … … BLAS call n … } VS. /* Global Optimization*/ doall loop { Can … parallelize for loop { outer loop(s) … [read from Z] Loop … fusion [write to Z] can … improve [read from Z] locality } … } → Global optimization exposes better parallelism and locality (significant speedups) Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. Outline • R-Stream Overview • UHPC Goals • Some Performance Results Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 12 Codelets From a HLC perspective • • • • • Codelets have: Fine granularity Explicit communication Point to point, other kinds of synchronization Can utilize scheduling and dependence information hints • Should also use placement of data and computation hints • Work from local scratch pad memories • Good match for UHPC hardware, allows good control for energy, resilience, etc. Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 13 UHPC from HLC perspective • • • • • • Energy must minimize data motion/communication Near Threshold Voltage must find even more parallelism Resilience synergy needed with new checkpointing/recovery models • Self awareness • dynamic distributed feedback and regulation Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 14 Another Observation • But programming directly in codelets is impractical: • Exposing machine details is a good thing, but don’t want programmers to manage them. – Too complicated: getting it done, getting it right, getting it fast. (Complexity = parallelism x locality x resilience…) • Writing directly in codelets will also overs-pecify the program, bake to one machine, and defeat portability • Role of HLC is to take high level abstractions from programmer – sequential code, – Chapel, CnC, – data-parallel idioms, – math language • Perform optimization to various levels of the target hardware hierarchy Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 15 Based on R-Stream Technology Existing New Energy Locality Opt Explicit Comm Gen Map to accelerators Hierarchical barriers Deep hierarchical scheduling Point to point sync Data placement opts More parallelism Exact dependence Imperfect loops Dynamic schedules and placements Resilience High Labs level Reservoir programming For Codelets Emit scheduling and placement hints Emit interaction sets ABFT support Memory reuse opt Checkpointing opt Sequential C Chapel, CnC, Math, Data Parallel Idioms Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 16 Goal: Generating CnC • Assume a mapping from CnC -> Codelets • Advantages of CnC • More succinct expression of parallelism (the skewing problem) • Adaptable parallelism and load balancing • High-level representation of data parallel idioms – CnC help solve the irregular, idiomatic part of the problem – R-Stream can target optimizations across irregular idioms • Easy to test for correctness of generated code and execute efficiently on x86 / clusters. Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 17 Goal: Synergy with CnC • Represent CnC action-attribute graphs explicitly in RStream: • Benefit from optimization across multiple CnC steps • Explore tradeoff between fusing steps and running them in parallel: – Fused steps reduce the runtime overhead – An also the memory footprint • Generate many semantically equivalent versions and explore the design space tradeoffs – R-Stream auto-tuning mode will help a lot here Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 18 Goal: Synergy with Chapel and UIUC • • • • • • • • Extensions to blackboxing: User interface, can represent any program Supports even linking with precompiled code Integrate user-specific data distributions within R-Stream HTAs Locales Find the right abstraction The goal for Rstream to understand the abstraction and make good mapping decisions; not to replace the user choices • Iterative, feedback-directed design • Language / transformation tool • Transformation tool / Runtime • Language / Runtime Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 19 Goal: Pragmatic Approach • Support multiple kinds of placement: • Explicit / implicit ; virtual / physical ; linear/ cyclic/ block cyclic/ general • Build on R-Stream’s current over-provisioning for performance: • Originally built for CUDA performance • Concepts extend to any architecture with dynamic scheduling decisions • Has implications on locality/communication granularity • Examine implications on power • Use advanced auto-tuning features for design space exploration • Explore which modes perform best with CnC: • Dependent on how over-provisioning is implemented • Over-provisioning (may) have implications on memory Reservoir Labs persistence: Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 20 Goal: HLC support for Challenge Applicationss • • • • • • • • • • • Go beyond loop nest optimizations Chapel / data-parallel support CnC attribute action graph optimization SAR New locality transformations demonstrated speedups on linear flight path (reported to DARPA) MD Exploring HLC optimization to neutral territory methods Graph High level approaches to optimizing graph algorithms and increasing locality, new lock-free data-parallel algorithm for BFS Chess, Hydrodynamics TBD. Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 21 Outline • R-Stream Overview • UHPC Goals • Some Performance Results Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 22 CSLC-LMS (Mapping Across Function/Library Calls) Configuration 1: MKL Radar code Configuration 2: Low-level compilers MKL calls Radar code GCC ICC Configuration 3: R-Stream Radar code RStream Optimize d radar code GCC ICC • • • • Main comparisons: R-Stream High-Level C Compiler 3.1.2 Intel MKL 10.2.1 Dual quad-core E5405 Xeon processors (8 cores total), 9GB 8 thr Reservoirmemory, Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. CSLC-LMS (Mapping Across Function/Library Calls) Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. RTM (Exploiting Over-Provisioning for Performance) • • • • • • • • • • • • • • • • • • • • void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X], int pX, int pY, int pZ) { double temp; int i, j, k; for (k=4; k<pZ-4; k++) { for (j=4; j<pY-4; j++) { for (i=4; i<pX-4; i++) { temp = C0 * U2[k][j][i] + C1 * (U2[k-1][j][i] + U2[k+1][j][i] + U2[k][j-1][i] + U2[k][j+1][i] + U2[k][j][i-1] + U2[k][j][i+1]) + C2 * (U2[k-2][j][i] + U2[k+2][j][i] + U2[k][j-2][i] + U2[k][j+2][i] + U2[k][j][i-2] + U2[k][j][i+2]) + C3 * (U2[k-3][j][i] + U2[k+3][j][i] + U2[k][j-3][i] + U2[k][j+3][i] + U2[k][j][i-3] + U2[k][j][i+3]) + C4 * (U2[k-4][j][i] + U2[k+4][j][i] + U2[k][j-4][i] + U2[k][j+4][i] + U2[k][j][i-4] + U2[k][j][i+4]); • • 25-point 8th order (in space) stencil U1[k][j][i] = 2.0f * U2[k][j][i] - U1[k][j][i] + V[k][j][i] * temp; } } } } Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 25 RTM (Exploiting Over-Provisioning for Performance) • 3D discretized wave equation kernel with single time iteration • Run on NVIDIA GTX 480 • Double Precision 256^3 Problem • High-Performance from Over-Provisioning space exploration and explicit optimization of register rotation and shared memory reuse Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 26 R-Stream to CnC Proof of Concept • Examined feasibility and benefits of automatic • • • • • coordination language (CnC )generation from RStream: on 4-D stencil, in-place, kernel application coarse grained parallelism is pipelined (i.e. wavefronts of parallel tasks) and representative of other streaming kernels Rstream generates a non-trivial OpenMP version Manually transform this OpenMP version to CnC code Process completely automatable Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 27 R-Stream to CnC Proof of Concept Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 28 Conclusion • R-Stream simplifies software development and maintenance • Does this by automatically parallelizing loop code • While optimizing for data locality, coalescing, communications reuse, etc. • Many exciting developments within UHPC Reservoir Labs Copyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. 29