An Introduction to Open64 Compiler Guang R. Gao (Capsl, Udel) Xiaomi An (Capsl, Udel) Curtsey to Fred Chow Outline • Background and Motivation • Part I: An overview of the Open64 compiler infrastructure and design principles • Part II: Using Open64 in compiler research & development 4/8/2015 Open64 Tutorial - An Introduction 2 What is The Original Open64 (Pro64) ? • A suite of optimizing compiler tools for Intel IA-64, x86-64 on Linux systems • C, C++ and Fortran90/95 compilers • Conforming to the IA-64, x86-64 Linux ABI and API standards • Open to all researchers/developers in the community 4/8/2015 Open64 Tutorial - An Introduction 3 Historical Perspectives Stanford RISC compiler research Cydrome Cydra5 Compiler 1980-83 MIPS Ucode Compiler (R2000) Software pipelining 1989 Global opt under -O2 SGI Ragnarok Floating-pt Compiler (R8000) performance 1987 1994 MIPS Ucode Compiler (R4000) Loop opt under -O3 1991 Stanford SUIF Rice IPA SGI MIPSpro Compiler (R10000) 1997 Curtsey to Fred Chow Pro64/Open64 Compiler (Itanium) 2000 Who Might Want to Use Open64? • Researchers : test new compiler analysis and optimization algorithms • Developers : retarget to another architecture/system • Educators : a compiler teaching platform 4/8/2015 Open64 Tutorial - An Introduction 5 Who Are Using Open64 – from ACM CGO 2008 Open64 Workshop Attendee List (April 6, 2008, Boston) 12 Companies: Google, nVidia, IBM, HP, Qualcomm, AMD, Tilera, PathScale, SimpLight, Absoft, Coherenet Logix, STMicro, 9 Education Institutes: USC, Rice, U. of Delaware, U. Houston, U. of Illinois, Tsinghua, Fudan, ENS Lyon, NRCC, Abramson Adhianto Bagley Boissinot Cavazos Chakrabarti Chen Chow Danalis Dasgupta DE Eachempati Gan Gao Gottlieb Guillon Hundt Jain Ju Linthicum Ma McIntosh Murphy Narayanan Özender Ramasamy Rastello Ributzka Ryoo Schafer Wang Ye Zheng Zhou Zhou Jeremy D Laksono Richard Benoit John Gautam Tong Fred Antonios G Anshuman Subrato K Deepak Ge Guang Robert Christophe Robert Suneel Roy Charles A. Yin Nathaniel Michael P Kannan M. Serhat Vinodha Fabrice Juergen Shane Uriel Xu Handong Bixia Hucheng Xing Univ of Southern California Rice University AMD ENS Lyon University of Delaware PathScale, LLC. IBM PathScale, LLC University of Delaware Qualcomm, Inc. Qualcomm, Inc. University of Houston University of Delaware University of Delaware Tilera Corporation STMicroelectronics Google Hewlett Packard AMD Qualcomm, Inc. Absoft Hewlett Packard Company NVIDIA Coherent Logix, Inc. NRCC Google, Inc. LIP, ENS Lyon University of Delaware University of Illinois Hewlett-Packard University of Delaware University of Delaware AMD Tsinghua University Tsinghua University Redondo Beach Houston Marlborough Lyon Newark Sunnyvale Yorktown Heights Fremont Newark Austin San Diego Houston Newark Newark Westborough Grenoble Palo Alto Los Gatos Sunnyvale Austin Rochester Hills Lexington Newark Milpitas Istanbul Mountain View Lyon Newark Urbana Silver Spring Newark Newark Sunnyvale Beijing Beijing USA USA USA FRANCE USA USA USA USA USA USA USA USA USA USA USA FRANCE USA USA USA USA USA USA USA USA TURKEY USA FRANCE USA USA USA USA USA USA CHINA CHINA Vision and Status of Open64 Today ? • People should view it as GCC with an alternative backend with great potential to reclaim the best compiler in the world • The technology incorporated all top compiler optimization research in 90's • It has regain momentum in the last three years due to Pathscale and HP's investment in robustness and performance • Targeted to x86, Itanium in the public repository, ARM, MIPS, PowerPC, and several other signal processing CPU in private branches 4/8/2015 Open64 Tutorial - An Introduction 7 Overview of Open64 Infrastructure • Logical compilation model and component flow • WHIRL Intermediate Representation • Very High Optimizer • Inter-Procedural Analysis (IPA) • Loop Nest Optimizer (LNO) and Parallelization • Global optimization (WOPT) • Code Generation (CG) 4/8/2015 Open64 Tutorial - An Introduction 8 Front end Very High Optimizer Interprocedural Analysis and Optimization Good IR Loop Nest Optimization and Parallelization Global (Scalar) Optimization Middle-End Backend Code Generation 4/8/2015 Open64 Tutorial - An Introduction 9 Front Ends • C front end based on gcc • C++ front end based on g++ • Fortran90/95 front end from MIPSpro 4/8/2015 Open64 Tutorial - An Introduction 10 Semantic Level of IR At higher level: Source program High •More kinds of constructs •Shorter code sequence •More program info present •Hierarchical constructs •Cannot perform many optimizations At lower level: •Less program info •Fewer kinds of constructs Machine instruction Low •Longer code sequence •Flat constructs •All optimizations can be performed 4/8/2015 Open64 Tutorial - An Introduction 12 Compilation Flow 4/8/2015 Open64 Tutorial - An Introduction 13 Very High WHIRL Optimizer Lower to High WHIRL while performing optimizations First part deals with common language constructs Bit-field optimizations Short-circuit boolean expressions Switch statement optimization Simple if-conversion Assignments of small structs: lower struct copy to assignments of individual fields Convert patterns of code sequences to intrinsics: • Saturated subtract, abs() Other pattern-based optimizations • max, min 4/8/2015 Open64 Tutorial - An Introduction 14 Roles of IPA The only optimization component operating at program scope • Analysis: collect information from entire program • Optimization: performs optimizations across procedure boundaries • Depends on later phases for full optimization effects • Supplies cross-file information for later optimization phases 4/8/2015 Open64 Tutorial - An Introduction 15 IPA Flow 4/8/2015 Open64 Tutorial - An Introduction 16 IPA Main Stage Analysis – alias analysis – array section – code layout Optimization – inlining – cloning – dead function and variable elimination – constant propagation 4/8/2015 Open64 Tutorial - An Introduction 17 Loop Nest Optimizations 1. Transformations for Data Cache 2. Transformations that help other optimizations 3. Vectorization and Parallellization LNO Transformations for Data Cache Cache blocking Transform loop to work on sub-matrices that fit in cache Loop interchange Array Padding Reduce cache conflicts Prefetches generation Hide the long latency of cache miss references Loop fusion Loop fission 4/8/2015 Open64 Tutorial - An Introduction 19 LNO Transformations that Help Other Optimizations Scalar Expansion / Array Expansion Reduce inter-loop dependencies, enable parallelization Scalar Variable Renaming Less constraints for register allocation Array Scalarization Improves register allocation Hoist Messy Loop Bounds Outer loop unrolling Array Substitution (Forward and Backward) Loop Unswitching Hoist IF Inter-iteration CSE 4/8/2015 Open64 Tutorial - An Introduction 20 LNO Parallelization • SIMD code generation – Highly dependent on the SIMD instructions in target • Generate vector intrinsics – Based on the library functions available • Automatic parallelization – Leverage OpenMP support in rest of backend 4/8/2015 Open64 Tutorial - An Introduction 21 Global Optimization Phase • SSA is unifying technology • Open-64 extension to SSA technology – Representing aliases and indirect memory operations (Chow et al, CC 96) – Integrated partial redundancy elimination (Chow et al, PLDI 97; Kennedy et al, CC 98, TOPLAS 99) – Support for speculative code motion – Register promotion via load and store placement (Lo et al, PLDI 98) 4/8/2015 Open64 Tutorial - An Introduction 22 Overview • • • • • • • • Works at function scope Builds control flow graph Performs alias analysis Represents program in SSA form SSA-based optimization algorithms Co-operations among multiple phases to achieve final effects Phase order designed to maximize effectiveness Separated into Preopt and Mainopt – Pre-opt serves as pre-optimizing front-ends for LNO and IPA (in High WHIRL) • Provide use-def info to LNO and IPA • Provide alias info to CG 4/8/2015 Open64 Tutorial - An Introduction 23 Optimizations Performed Pre-optimizer Goto conversion Loop normalization Induction variable canonicalization Dead store elimination Copy propagation Dead code elimination Alias analysis (flow-free and flow-sensitive) Compute def-use chains for LNO and IPA Pass alias info to CG 4/8/2015 Main optimizer Partial redundancy elimination based on SSAPRE framework o Global common subexpression o Loop invariant code motion o Strength reduction o Linear function test replacement Value-number-based full redundancy elimination Induction variable elimination Register promotion Bitwise dead store elimination Open64 Tutorial - An Introduction 24 Feedback Used throughout the compiler • Instrumentation can be added at any stage • VHO, LNO, WOPT, CG • Explicit instrumentation data incorporated where inserted • Instrumentation data maintained and checked for consistency through program transformations. 4/8/2015 Open64 Tutorial - An Introduction 25 WHIRL WHIRL-to-TOP CGIR Extended basic block optimization Control Flow optimization Hyperblock Formation Critical Path Reduction Inner Loop Opt Software Pipelining IGLS GRA/LRA Information from Front end (alias, structure, etc.) Smooth Info flow into Backend in Pro64 Code Emission Executable 2015/4/8 \course\cpeg421-10s\Topic2a.ppt 26 Software Pipelining vs Normal Scheduling a SWP-amenable loop candidate ? Yes IGLS Inner loop processing software pipelining GRA/LRA Failure/not profitable Success 2015/4/8 No IGLS Code Emission \course\cpeg421-10s\Topic2a.ppt 27 Code Generation Intermediate Representation (CGIR) • • • • • • • TOPs (Target Operations) are “quads” Operands/results are TNs Basic block nodes in control flow graph Load/store architecture Supports predication Flags on TOPs (copy ops, integer add, load, etc.) Flags on operands (TNs) 4/8/2015 Open64 Tutorial - An Introduction 28 From WHIRL to CGIR Cont’d • Information passed – alias information – loop information – symbol table and maps 4/8/2015 Open64 Tutorial - An Introduction 29 The Target Information Table (TARG_INFO) Objective: • Parameterized description of a target machine and system architecture • Separates architecture details from the compiler’s algorithms • Minimizes compiler changes when targeting a new architecture 4/8/2015 Open64 Tutorial - An Introduction 30 WHIRL SSA: A New Optimization Infrastructure for Open64 Parallel Processing Institute, Fudan University, Shanghai, China Global Delivery China Center, Hewlett-Packard, Shanghai, China 4/8/2015 Open64 Tutorial - An Introduction 31 Goal • A better “DU manager” – Factored UD chain • Reduced traversing overhead – Keeping alias information • Handle both direct and indirect access • Eliminate ‘incomplete DU/UD chain’ • Easy to use – STL-style iterator to traverse the DU/UD chain • A flexible Infrastructure – Available from H WHIRL to L WHIRL – Lightweight, demand-driven – Precise and updatable 4/8/2015 Open64 Tutorial - An Introduction 32 PHI Placement • For SCF, φ nodes are mapped on the root WN • For GOTO-LABEL, φ nodes are placed on the LABEL IF a3ß φ(a1,a2) if(p1) a1 ß a 2ß a3ß φ(a1,a2) if(p1) a1ß a1ß while(p1) a2ß φ(a1,a3) a3ß a2ß (a) IF a2ß φ(a1,a3) a 3ß DO_WHILE a2ß φ(a1,a3) CMP I1ß DO_LOOP I2ß φ(I1,I3) I2ß φ(I1,I3) BODY while(p1) p1 (c) DO_WHILE 4/8/2015 a3ß p1 (b) WHILE_DO INIT a 1ß WHILE_DO a2ß φ(a1,a3) a3ß CMP STEP I3=op(I2) I I1ß I3ß op(I2) BODY (d) DO_LOOP Open64 Tutorial - An Introduction 33 Thank you! 4/8/2015 Open64 Tutorial - An Introduction 34