The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda OpenUH compiler OpenMP language extensions Compiler – Tools interactions Compiler cost modeling OpenUH: A Reference OpenMP Compiler Based on Open64 Integrates features from other major branches: Pathscale, ORC, UPC,… Complete support for OpenMP 2.5 in C/C++ and Fortran Freely available and open source Stable, portable Modularized and complete optimization framework Available on most Linux/Unix platforms OpenUH: A Reference OpenMP Compiler Facilitates research and development For us as well as for the HPC community Testbed for new language features New compiler transformations Interactions with variety of programming tools Currently installed at Cobalt@NCSA and Columbia@NASA Cobalt: 2x512 processors, Columbia: 20x512 processors The Open64 Compiler Suite An optimizing compiler suite for C/C++ and Fortran77/90 on Linux/IA-64 systems Open-sourced by SGI from Pro64 compiler State-of-the-art intra- & interprocedural analysis and optimizations 5 levels of uniform IR (WHIRL), with IR-to-source “translators”: whirl2c & whirl2f Used for research and commercial purposes: Intel, HP, QLogic, STMicroelectronics, UPC, CAF, U Delaware, Tsinghua, Minnesota … Major Modules in Open64 Lower to High IR .B -IPA Local IPA Main IPA -O3 LNO Inliner gfec .I gfecc f90 Take either path Very high WHIRL High WHIRL Mid WHIRL Lower I/O (only for f90) -mp Lower MP WHIRL2 .w2c.c /.w2f.f C/Fortran .w2c.h (only for OpenMP) -O0 -phase: w=off Lower all Lower Mid W Low WHIRL -O2/O3 CG Main opt OpenUH Compiler Infrastructure Open64 Compiler infrastructure FRONTENDS (C/C++, Fortran 90, OpenMP) IPA (Inter Procedural Analyzer) OMP_PRELOWER (Preprocess OpenMP ) LNO (Loop Nest Optimizer) Source code w/ OpenMP directives Source code with runtime library calls A Native Compiler LOWER_MP (Transformation of OpenMP ) WOPT (global scalar optimizer) Linking Object files Executables WHIRL2C & WHIRL2F (IR-to-source for none-Itanium ) CG (code generator for Itanium) A Portable OpenMP Runtime library OpenMP Implementation in OpenUH Frontends: OMP_PRELOWER: Preprocessing Semantic checking LOWER_MP Parse OpenMP pragmas Generation of microtasks for parallel regions Insertion of runtime calls Variable handling, … Runtime library Support for thread manipulations Implements user level routines Monitoring environment OpenMP Code int main(void) { int a,b,c; #pragma omp parallel \ private(c) do_sth(a,b,c); return 0; } Translation _INT32 main() { int a,b,c; /* microtask */ void __ompregion_main1() { _INT32 __mplocal_c; /*shared variables are kept intact, substitute accesses to private variable*/ do_sth(a, b, __mplocal_c); } … /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1 ); … } Runtime based on ORC work performed by Tsinghua University Multicore Complexity AMD dual-core IBM Power4 Sun T-1 (Niagara) Cell processor Resources (L2 cache, memory bandwidth): shared or separate Each core: single thread or multithreaded, complex or simplified Individual cores: symmetric or asymmetric (heterogeneous) Is OpenMP Ready for Multicore? Is OpenMP ready? Designed for medium-scale SMPs: <100 threads One-team-for-all scheme for work sharing and synchronization. simple but not flexible Some difficulties using OpenMP on these platforms Determining the optimal number of threads Binding threads to right processor cores Finding good scheduling policy and chunk size Challenges Posed By New Architectures We may want sibling threads to share in a workload on a multicore. But we may want SMT threads to do different things Hierarchical and hybrid parallelism Diversity in kind and extent of resource sharing, potential for thread contention Clusters, SMPs, CMP (multicores), SMT (simultaneous multithreading), … ALU/FP units, cache, MCU, data-path, memory bandwidth Homogeneous or heterogeneous Deeper memory hierarchy Size and scale Will many codes have multiple levels of parallelism? Subteams of Threads for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (2: omp_get_num_threads()-1 ) for k=0; k<M;k++) { //on threads in subteam ... processing(); } // barrier involves subteam only • MPI provides for definition of groups of pre-existing processes • Why not allow worksharing among groups (or subteams) of pre-existing threads? • Logical machine description, mapping of threads to it • Or simple “spread” or “keep together” notations Case Study: A Seismic Code This loop is parallel for (i=0;i<N;i++) { ReadFromFile(i,...); for (j=0; j<ProcessingNum; j++) for(k=0;k<M;k++){ process_data(); //involves several different seismic functions } WriteResultsToFile(i); } Kingdom Suite from Seismic Micro Technology Goal: create OpenMP code for SMP with hyperthreading enabled Parallel Seismic Kernel V1 for( j=0; j< ProcessingNum; j++) { #pragma omp for schedule(dynamic) for(k=0; k<M; k++) { processing(); //user configurable functions } // here is the barrier } end of j-loop Load Data Load Data Load Data: Process Data Save Data Process Data: Save Data Save Data OMP For implicit barrier causes the computation threads to wait for I/O threads to complete. Timeline Save Data: Subteams of Threads for (j=0; j<ProcessingNum;j++) { #pragma omp for on threads (2: omp_get_num_threads()-1 ) for k=0; k<M;k++) { //on threads in subteam ... processing(); } // barrier involves subteam only • A parallel loop does not incur overheads of nested parallelism • But we need to avoid the global barrier early on in the loop’s execution • One way to do this would be to restrict loop execution to a subset of the team of executing threads Parallel Seismic Code V2 Loadline(nStartLine,...); // preload the first line of data #pragma omp parallel { for (int iLineIndex=nStartLine; iLineIndex <= nEndLine; iLineIndex++) { #pragma omp single nowait onthread(0) {// loading the next line data, NO WAIT! Load Load Loadline(iLineIndex+1,...); } Data Data for(j=0;j<iNumTraces;j++) #pragma omp for schedule(dynamic) onthread(2: omp_get_num_threads()-1 ) for(k=0;k<iNumSamples;k++) processing(); #pragma omp barrier #pragma omp single nowait { SaveLine(iLineIndex); } } } onthread(1) Load Data Load Data Process Process Process Data Data Data Save Data Save Data Timeline OpenMP Scalability: Thread Subteam Thread Subteam: original thread team is divided into several subteams, each of which can work simultaneously. Advantages Flexible worksharing/synchronization extension Low overhead because of static partition Facilitates thread-core mapping for better data locality and less resource contention Implementation in OpenUH … void * threadsubteam; __ompv_gtid_s1 = __ompc_get_local_thread_num(); __ompc_subteam_create(&idSet1,&threadsubteam); /*threads not in the subteam skip the later work*/ if (!__ompc_is_in_idset(__ompv_gtid_s1,&idSet1)) goto L111; __ompc_static_init_4(__ompv_gtid_s1, …&__do_stride, 1, 1, &threadsubteam); for(__mplocal_i = __do_lower; __mplocal_i <= __do_upper; __mplocal_i = __mplocal_i + 1) { ......... //omp for } __ompc_barrier(&threadsubteam); /*barrier at subteam only*/ L111: /* Insert a label as the boundary between two worksharing bodies*/ __ompv_gtid_s1 = __ompc_get_local_thread_num(); mpsp_status = __ompc_single(__ompv_gtid_s1); • Tree-structured team and subteams in if(mpsp_status == 1) runtime library { j = omp_get_thread_num(); //omp single • Threads not in a subteam skip the printf("I am the one: %d\n", j); work in compiler translation } • Global thread IDs are converted into __ompc_end_single(__ompv_gtid_s1); local IDs for loop scheduling __ompc_barrier (NULL); /*barrier at the default team*/ • Implicit barriers only affect threads in a subteam BT-MZ Performance with Subteams Platform: Columbia@NASA OpenMP 3.0 and Beyond Major thrust for 3.0 spec. supports non-traditional loop parallelism Ideas on support for multicore / higher levels scalability Extend nested parallelism by binding threads in advance High overhead of dynamic thread creation/cancellation Poor data locality between parallel regions executed by different threads without binding Describe structure of threads used in computation Map to logical machine, or group Explicit data migration Subteams of threads Control over the default behavior of idle threads What About The Tools? Typically hard work to use, steep learning curve Low-level interaction with user Tuning may be fragmentary effort May require multiple tools Often not integrated with each other Let alone with compiler Can we improve tools’ results, reduce user effort and help compiler if they interact? Exporting Program Information Front End CFG_IPL Control flow graph IPL Call graph IPA-Link LNO WOPT/CG feedback Program Info. Database Dragon Executable Dragon Tool Browser Data Dependence Array Section VCG .vcg .ps Static and dynamic program information is exported .bmp Productivity: Integrated Development Environment KOJAK Executing Application Low-Level Trace Data Perfsuite Runtime Monitoring TAU Runtime Information / Sampling Selective Instrumented OpenUH Executable Program Analyses High Level Representation Static/Feedback Optimizations Performance Feedback Application Source code Fluid Dynamics Application Common Program Database Interface Performance Analysis Results High Level Profile/ Performance Problem Analyzer Queries for Application Information Development Environment for MPI/OpenMP Dragon Program Analysis Results http://www.cs.uh.edu/~copper NSF CCF-0444468 Cascade Results Offending critical region was rewritten Courtesy of R. Morgan, NASA Ames Tuning Environment Using OpenUH selective instrumentation combined with its internal cost model for procedures and internal call graph, we find procedures with high amount of work, called infrequently, and within a certain call path level. Using our instrumented OpenMP runtime we can monitor parallel regions. Compiler and Runtime Components Selective Instrumentation analysis A Performance Problem: Specification GenIDLEST Real world scientific simulation code Solves incompressible Navier Stokes and energy equations MPI and OpenMP versions Platform SGI Altix 3700 Two distributed shared memory systems Each system with 512 Intel Itanium 2 Processors Thread count: 8 The problem: OpenMP version is slower than MPI Timings of Diff_coeff Subroutine OpenMP version We find that a single procedure is responsible for 20% of the time and that it is 9 times slower than MPI! MPI version Performance Analysis Procedure Timings When comparing the metrics between OpenMP and MPI using KOJAK performance algebra. We find: Large # of: • Exceptions • Flushes • Cache Misses • Pipeline stalls Some loops are 27 times slower in OpenMP than MPI. These loops contains large amounts of stalling due to remote memory accesses to the shared heap. Pseudocode of The Problem Procedure procedure diff_coeff() { allocation of arrays to heap by master thread initialization of shared arrays PARALLEL REGION Shared Arrays { loop in parallel over lower_bound [my thread id] , upper bound [my thread id] computation on my portion of shared arrays ….. } } • Lower and upper bounds of computational loops are shared, and stored within the same memory page and cache line • Delays in remote memory accesses are probable causes of exceptions causing processor flushes Solution: Privatization Stall Cycle Breakdown for Non-Privatized (NP) and Privatized (P) Versions of diff_coeff 5.00E+10 4.50E+10 4.00E+10 3.50E+10 3.00E+10 2.50E+10 2.00E+10 1.50E+10 1.00E+10 5.00E+09 0.00E+00 NP P Front-end flushes FLP Units Instruction miss stall Branch misprediction NP-P D-cach stalls Cycles OpenMP Privatized Version •Privatizing the arrays improved the performance of the whole program by 30% and resulted in a speedup of 10 for the problem procedure. •Now this procedure only takes 5% of total time •Processor Stalls are reduced significantly OpenMP Platform-awareness: Cost modeling Cost modeling: To estimate the cost, mostly the time, of executing a program (or a portion of it) on a given system (or a component of it) using compilers, runtime systems, performance tools, etc. An OpenMP cost model is critical for: OpenMP compiler optimizations Adaptive OpenMP runtime support Load balancing in hybrid MPI/OpenMP Targeting OpenMP to new architectures: multicore Complementing empirical search Example Usage of Cost Modeling Performance of an OpenMP Program 30000 25000 DO K2 = 1, M, B DO J2 = 1, M, B DO I = 1, M DO K1 = K2, MIN(K2+B-1,M) DO J1 = J2, MIN(J2+B-1,M) Z(J1,I) = Z(J1,I) + X(K1,I) * Y(J1,K1) MFLOPS 20000 15000 10000 5000 0 1 2 4 8 16 32 64 128 Number of Threads Case 1: What is the optimal tiling size for a loop tiling transformation? Cache size, Miss penalties, Loop overhead, … Case 2: What is the maximum number of threads for parallel execution without performance degradation? Parallel overhead Ratio of parallelizable_work/total_work System capacities … Usage of OpenMP Cost Modeling OpenMP Applications Determine parameters for OpenMP execution Application Features Computation Requirements Memory References Parallel Overheads Cost Modeling Architectural Profiles Processor Cache OpenMP Compiler OpenMP Runtime Library Topology OpenMP Implementation CMT Platforms Number of Threads Thread-core mapping Scheduling Policy Chunk size Modeling OpenMP Previous models T_parallel_region = T_fork + T_worksharing + T_join T_worksharing = T_sequential / N_threads Our model aims to consider much more: Multiple worksharing, synchronization portions in a parallel region Scheduling policy Chunk size Load imbalance Cache impact for multiple threads on multiple processors … Modeling OpenMP Parallel Regions A parallel region could encompass several worksharing and synchronization portions The sum of the longest execution time of all threads between a pair of synchronization points dominates the final execution time: load imbalance A parallel region Master Thread Modeling OpenMP Worksharing Work-sharing has overhead because of multiple dispatching of work chunks Schedule (type, chunkSize) … Thread i Time (cycles) Implementation in OpenUH Based on the existing cost models used in loop optimization Only works for perfectly nested loops: those permitting arbitrary transformations To guide conventional loop transformation: unrolling, titling, interchanging, To help auto-parallelization: justification, which level, interchanging Cost models Processor model Parallel model Cache model Computational resource cost Machine cost Cache cost Operation cost Issue cost Mem_ref cost Cache cost TLB cost Loop overhead Dependency latency cost Parallel overhead Register spilling cost Reduction cost Cost Model Extensions Added a new phase in compiler to traverse IR to conduct modeling Working on OpenMP regions instead of perfectly nested loops Enhancement to model OpenMP details reusing processor and cache models for processor and cache cycles Modeling load imbalance: using max (thread_i_exe) Modeling scheduling: adding a lightweight scheduler in the model Reading an environment variable for the desired numbers of threads during modeling (so this is currently fixed) Experiment Machine: Cobalt in NCSA (National Center for Supercomputing Applications) Benchmark: OpenMP version of a classic matrix-matrix multiplication (MMM) code 32-processor SGI Altix 3700 1.5 GHz Itanium 2 with 6 M L3 Cache 256 G memory i, k, j order 3 different double floating point matrix sizes: 500, 1000, 1500 OpenUH compiler: -O3, -mp Cycle measuring tools: pfmon, perfsuite #pragma omp parallel for private ( i , j , k ) f o r ( i = 0 ; i < N; i ++) f o r ( k = 0 ; k < K; k++) f o r ( j = 0 ; j < M; j ++) c [ i ] [ j ]= c [ i ] [ j ]+ a [ i ] [ k ]*b [ k ] [ j ] ; Results Efficiency = Modeling_Time / Compilation_Time x100% = 0.079s/6.33s =1.25% Modeling vs. Measurement 1E+11 Model-500 Measure-500 Model-1000 Measure-1000 Model-1500 Measure-1500 CPU Cycles 1E+10 1E+09 100000000 10000000 1 2 3 4 5 6 7 8 Number of Threads Measured data have irregular fluctuation, especially for smaller dataset with larger number of threads 108 cycles@1.5GHz <0.1 second, relatively big system level noise from thread management Overestimation for 500x500 array from 1 to 5 threads, underestimation for all the rest: optimistic assumption for resource utilization more threads, more underestimation: lack of contention models for cache, memory and bus Relative Accuracy: Modeling Different Chunk Sizes for Static Scheduling CPU Cycles Billions Modeling vs Measuring for OpenMP Scheduling Excessive scheduling overheads 35 static-modeling 30 static-measuring 25 Dynamic-measuring 20 Guided-measuring Load imbalance 15 10 5 0 1 10 100 250 500 1000 Chunksize 4 threads, matrix size 1000x1000 Successfully captured the trend of measured result Cost Model Detailed cost model could be used to recompile program regions that perform poorly Possibly with focus on improving specific aspect of code Current models in OpenUH are inaccurate Most often they accurately predict trends Fail to account for resource contention This will be critical for modeling multicore platforms What level of accuracy should we be going for? Summary Challenge of multicores demands “simple” parallel programming models There is very much to explore in this regard Compiler technology has advanced and public domain software has become fairly robust Many opportunities for exploiting this to improve Languages Compiler implementations Runtime systems OS interactions Tool behavior …