The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston

advertisement
The OpenUH Compiler:
A Community Resource
Barbara Chapman
University of Houston
March, 2007
High Performance Computing and Tools Group
http://www.cs.uh.edu/~hpctools
Agenda




OpenUH compiler
OpenMP language extensions
Compiler – Tools interactions
Compiler cost modeling
OpenUH: A Reference OpenMP
Compiler

Based on Open64






Integrates features from other major branches:
Pathscale, ORC, UPC,…
Complete support for OpenMP 2.5 in C/C++ and
Fortran
Freely available and open source
Stable, portable
Modularized and complete optimization framework
Available on most Linux/Unix platforms
OpenUH: A Reference OpenMP
Compiler
Facilitates research and development
 For us as well as for the HPC community
 Testbed for new language features
 New compiler transformations
 Interactions with variety of programming tools
 Currently installed at Cobalt@NCSA and
Columbia@NASA
 Cobalt: 2x512 processors, Columbia: 20x512
processors
The Open64 Compiler Suite
An optimizing compiler suite for C/C++ and
Fortran77/90 on Linux/IA-64 systems




Open-sourced by SGI from Pro64 compiler
State-of-the-art intra- & interprocedural analysis
and optimizations
5 levels of uniform IR (WHIRL), with IR-to-source
“translators”: whirl2c & whirl2f
Used for research and commercial purposes:
Intel, HP, QLogic, STMicroelectronics, UPC, CAF,
U Delaware, Tsinghua, Minnesota …
Major Modules in Open64
Lower to
High IR
.B
-IPA
Local
IPA
Main
IPA
-O3
LNO
Inliner
gfec
.I
gfecc
f90
Take either path
Very high WHIRL
High WHIRL
Mid WHIRL
Lower
I/O
(only for f90)
-mp
Lower
MP
WHIRL2 .w2c.c /.w2f.f
C/Fortran .w2c.h
(only for OpenMP)
-O0
-phase:
w=off
Lower
all
Lower
Mid W
Low WHIRL
-O2/O3
CG
Main
opt
OpenUH Compiler Infrastructure
Open64 Compiler infrastructure
FRONTENDS
(C/C++, Fortran 90, OpenMP)
IPA
(Inter Procedural Analyzer)
OMP_PRELOWER
(Preprocess OpenMP )
LNO
(Loop Nest Optimizer)
Source code w/
OpenMP directives
Source code with
runtime library calls
A Native
Compiler
LOWER_MP
(Transformation of OpenMP )
WOPT
(global scalar optimizer)
Linking
Object files
Executables
WHIRL2C & WHIRL2F
(IR-to-source for none-Itanium )
CG
(code generator for Itanium)
A Portable OpenMP
Runtime library
OpenMP Implementation in OpenUH

Frontends:


OMP_PRELOWER:



Preprocessing
Semantic checking
LOWER_MP




Parse OpenMP pragmas
Generation of microtasks for
parallel regions
Insertion of runtime calls
Variable handling, …
Runtime library



Support for thread
manipulations
Implements user level routines
Monitoring environment
OpenMP Code
int main(void)
{
int a,b,c;
#pragma omp parallel \
private(c)
do_sth(a,b,c);
return 0;
}
Translation
_INT32 main()
{
int a,b,c;
/* microtask */
void __ompregion_main1()
{
_INT32 __mplocal_c;
/*shared variables are kept intact,
substitute accesses to private
variable*/
do_sth(a, b, __mplocal_c);
}
…
/*OpenMP runtime calls */
__ompc_fork(&__ompregion_main1
);
…
}
Runtime based on ORC work performed by Tsinghua University
Multicore Complexity
AMD dual-core



IBM Power4
Sun T-1 (Niagara)
Cell processor
Resources (L2 cache, memory bandwidth): shared or separate
Each core: single thread or multithreaded, complex or simplified
Individual cores: symmetric or asymmetric (heterogeneous)
Is OpenMP Ready for Multicore?

Is OpenMP ready?


Designed for medium-scale SMPs: <100 threads
One-team-for-all scheme for work sharing and
synchronization.


simple but not flexible
Some difficulties using OpenMP on these platforms



Determining the optimal number of threads
Binding threads to right processor cores
Finding good scheduling policy and chunk size
Challenges Posed By New Architectures
We may want sibling threads to share in a
workload on a multicore. But we may want
SMT threads to do different things

Hierarchical and hybrid parallelism


Diversity in kind and extent of resource
sharing, potential for thread contention




Clusters, SMPs, CMP (multicores), SMT
(simultaneous multithreading), …
ALU/FP units, cache, MCU, data-path, memory
bandwidth
Homogeneous or heterogeneous
Deeper memory hierarchy
Size and scale
Will many codes have multiple levels of parallelism?
Subteams of Threads
for (j=0; j<ProcessingNum;j++) {
#pragma omp for on threads (2: omp_get_num_threads()-1 )
for k=0; k<M;k++) {
//on threads in subteam
... processing();
}
// barrier involves subteam only
• MPI provides for definition of groups of pre-existing
processes
• Why not allow worksharing among groups (or subteams)
of pre-existing threads?
• Logical machine description, mapping of threads to it
• Or simple “spread” or “keep together” notations
Case Study: A Seismic Code
This loop is
parallel
for (i=0;i<N;i++) {
ReadFromFile(i,...);
for (j=0; j<ProcessingNum; j++)
for(k=0;k<M;k++){
process_data();
//involves several different seismic functions
}
WriteResultsToFile(i);
}
Kingdom Suite from Seismic Micro
Technology
Goal: create OpenMP code for SMP
with hyperthreading enabled
Parallel Seismic Kernel V1
for( j=0; j< ProcessingNum; j++) {
#pragma omp for schedule(dynamic)
for(k=0; k<M; k++) {
processing(); //user configurable functions
}
// here is the barrier
} end of j-loop
Load Data
Load Data
Load Data:
Process
Data
Save
Data
Process Data:
Save
Data
Save Data
OMP For implicit barrier causes the computation
threads to wait for I/O threads to complete.
Timeline
Save Data:
Subteams of Threads
for (j=0; j<ProcessingNum;j++) {
#pragma omp for on threads (2: omp_get_num_threads()-1 )
for k=0; k<M;k++) {
//on threads in subteam
... processing();
}
// barrier involves subteam only
• A parallel loop does not incur overheads of nested
parallelism
• But we need to avoid the global barrier early on in the
loop’s execution
• One way to do this would be to restrict loop execution to
a subset of the team of executing threads
Parallel Seismic Code V2
Loadline(nStartLine,...); // preload the first line of data
#pragma omp parallel
{
for (int iLineIndex=nStartLine; iLineIndex <= nEndLine; iLineIndex++)
{
#pragma omp single nowait onthread(0)
{// loading the next line data, NO WAIT!
Load
Load
Loadline(iLineIndex+1,...);
}
Data
Data
for(j=0;j<iNumTraces;j++)
#pragma omp for schedule(dynamic)
onthread(2: omp_get_num_threads()-1 )
for(k=0;k<iNumSamples;k++)
processing();
#pragma omp barrier
#pragma omp single nowait
{
SaveLine(iLineIndex);
}
}
}
onthread(1)
Load
Data
Load
Data
Process
Process
Process
Data
Data
Data
Save
Data
Save
Data
Timeline
OpenMP Scalability: Thread Subteam
Thread Subteam: original
thread team is divided into
several subteams, each of
which can work simultaneously.

Advantages
 Flexible worksharing/synchronization extension
 Low overhead because of static partition
 Facilitates thread-core mapping for better data locality and
less resource contention
Implementation in OpenUH
…
void * threadsubteam;
__ompv_gtid_s1 = __ompc_get_local_thread_num();
__ompc_subteam_create(&idSet1,&threadsubteam);
/*threads not in the subteam skip the later work*/
if (!__ompc_is_in_idset(__ompv_gtid_s1,&idSet1)) goto L111;
__ompc_static_init_4(__ompv_gtid_s1, …&__do_stride, 1, 1, &threadsubteam);
for(__mplocal_i = __do_lower; __mplocal_i <= __do_upper; __mplocal_i = __mplocal_i + 1)
{
.........
//omp for
}
__ompc_barrier(&threadsubteam); /*barrier at subteam only*/
L111: /* Insert a label as the boundary between two worksharing bodies*/
__ompv_gtid_s1 = __ompc_get_local_thread_num();
mpsp_status = __ompc_single(__ompv_gtid_s1);
• Tree-structured team and subteams in
if(mpsp_status == 1)
runtime library
{ j = omp_get_thread_num();
//omp single
• Threads not in a subteam skip the
printf("I am the one: %d\n", j);
work in compiler translation
}
• Global thread IDs are converted into
__ompc_end_single(__ompv_gtid_s1);
local IDs for loop scheduling
__ompc_barrier (NULL); /*barrier at the default team*/
• Implicit barriers only affect threads in a
subteam
BT-MZ Performance with Subteams
Platform: Columbia@NASA
OpenMP 3.0 and Beyond
Major thrust for 3.0 spec. supports non-traditional
loop parallelism
Ideas on support for multicore / higher levels scalability

Extend nested parallelism by binding threads in advance
High overhead of dynamic thread creation/cancellation
 Poor data locality between parallel regions executed by
different threads without binding
Describe structure of threads used in computation
 Map to logical machine, or group
Explicit data migration
Subteams of threads
Control over the default behavior of idle threads





What About The Tools?

Typically hard work to use, steep learning
curve

Low-level interaction with user
Tuning may be fragmentary effort
May require multiple tools





Often not integrated with each other
Let alone with compiler
Can we improve tools’ results, reduce user
effort and help compiler if they interact?
Exporting Program
Information
Front End
CFG_IPL
Control flow graph
IPL
Call graph
IPA-Link
LNO
WOPT/CG
feedback
Program Info.
Database
Dragon Executable
Dragon Tool
Browser
Data Dependence
Array Section
VCG
.vcg
.ps
Static and dynamic program information is exported
.bmp
Productivity: Integrated Development
Environment
KOJAK
Executing Application
Low-Level Trace Data
Perfsuite Runtime
Monitoring
TAU
Runtime Information /
Sampling
Selective Instrumented
OpenUH
Executable
Program Analyses
High Level
Representation
Static/Feedback
Optimizations
Performance
Feedback
Application
Source code
Fluid Dynamics Application
Common
Program
Database
Interface
Performance Analysis
Results
High Level Profile/
Performance Problem Analyzer
Queries for
Application Information
Development Environment
for MPI/OpenMP
Dragon Program Analysis Results
http://www.cs.uh.edu/~copper
NSF CCF-0444468
Cascade Results
Offending
critical region
was rewritten
Courtesy of
R. Morgan,
NASA Ames
Tuning Environment


Using OpenUH selective instrumentation combined with its internal cost model
for procedures and internal call graph, we find procedures with high amount of
work, called infrequently, and within a certain call path level.
Using our instrumented OpenMP runtime we can monitor parallel regions.
Compiler and Runtime Components
Selective Instrumentation analysis
A Performance Problem: Specification

GenIDLEST




Real world scientific simulation code
Solves incompressible Navier Stokes and energy equations
MPI and OpenMP versions
Platform


SGI Altix 3700
Two distributed shared memory systems



Each system with 512 Intel Itanium 2 Processors
Thread count: 8
The problem: OpenMP version is slower than MPI
Timings of Diff_coeff Subroutine
OpenMP version
We find that a single procedure is
responsible for 20% of the time and
that it is 9 times slower than MPI!
MPI version
Performance Analysis
Procedure Timings
When comparing the metrics between OpenMP and
MPI using KOJAK performance algebra.
We find:
Large # of:
• Exceptions
• Flushes
• Cache Misses
• Pipeline stalls
Some loops are 27 times slower in OpenMP than
MPI. These loops contains large amounts of stalling
due to remote memory accesses to the shared heap.
Pseudocode of The Problem Procedure
procedure diff_coeff()

{
allocation of arrays to heap by master thread
initialization of shared arrays
PARALLEL REGION
Shared Arrays
{
loop in parallel over lower_bound [my thread id] , upper bound [my thread id]
computation on my portion of shared arrays
…..
}
}
• Lower and upper bounds of computational
loops are shared, and stored within the
same memory page and cache line
• Delays in remote memory accesses are
probable causes of exceptions causing
processor flushes
Solution: Privatization
Stall Cycle Breakdown for Non-Privatized (NP) and
Privatized (P) Versions of diff_coeff
5.00E+10
4.50E+10
4.00E+10
3.50E+10
3.00E+10
2.50E+10
2.00E+10
1.50E+10
1.00E+10
5.00E+09
0.00E+00
NP
P
Front-end
flushes
FLP Units
Instruction
miss stall
Branch
misprediction
NP-P
D-cach stalls
Cycles
OpenMP Privatized Version
•Privatizing the arrays improved the performance of the
whole program by 30% and resulted in a speedup of 10
for the problem procedure.
•Now this procedure only takes 5% of total time
•Processor Stalls are reduced significantly
OpenMP Platform-awareness: Cost
modeling

Cost modeling:


To estimate the cost, mostly the time, of executing a
program (or a portion of it) on a given system (or a
component of it) using compilers, runtime systems,
performance tools, etc.
An OpenMP cost model is critical for:




OpenMP compiler optimizations
Adaptive OpenMP runtime support
Load balancing in hybrid MPI/OpenMP
Targeting OpenMP to new architectures: multicore

Complementing empirical search
Example Usage of Cost Modeling
Performance of an OpenMP Program
30000
25000
DO K2 = 1, M, B
DO J2 = 1, M, B
DO I = 1, M
DO K1 = K2, MIN(K2+B-1,M)
DO J1 = J2, MIN(J2+B-1,M)
Z(J1,I) = Z(J1,I) + X(K1,I) * Y(J1,K1)
MFLOPS
20000
15000
10000
5000
0
1
2
4
8
16
32
64
128
Number of Threads

Case 1: What is the optimal tiling
size for a loop tiling transformation?




Cache size,
Miss penalties,
Loop overhead,
…

Case 2: What is the maximum number
of threads for parallel execution
without performance degradation?




Parallel overhead
Ratio of parallelizable_work/total_work
System capacities
…
Usage of OpenMP Cost Modeling
OpenMP
Applications
Determine parameters for
OpenMP execution
Application Features
Computation
Requirements
Memory
References
Parallel
Overheads
Cost
Modeling
Architectural Profiles
Processor
Cache
OpenMP
Compiler
OpenMP
Runtime Library
Topology
OpenMP Implementation
CMT Platforms
Number of
Threads
Thread-core
mapping
Scheduling
Policy
Chunk
size
Modeling OpenMP

Previous models



T_parallel_region = T_fork + T_worksharing + T_join
T_worksharing = T_sequential / N_threads
Our model aims to consider much more:






Multiple worksharing, synchronization portions in a parallel
region
Scheduling policy
Chunk size
Load imbalance
Cache impact for multiple threads on multiple processors
…
Modeling OpenMP Parallel Regions

A parallel region could encompass several worksharing and
synchronization portions

The sum of the longest execution time of all threads between a pair
of synchronization points dominates the final execution time: load
imbalance
A parallel region
Master Thread
Modeling OpenMP Worksharing

Work-sharing has overhead because of
multiple dispatching of work chunks
Schedule (type, chunkSize)
…
Thread i
Time (cycles)
Implementation in OpenUH

Based on the existing cost models used in loop optimization



Only works for perfectly nested loops: those permitting arbitrary transformations
To guide conventional loop transformation: unrolling, titling, interchanging,
To help auto-parallelization: justification, which level, interchanging
Cost models
Processor model
Parallel model
Cache model
Computational
resource cost
Machine cost
Cache cost
Operation cost
Issue cost
Mem_ref cost
Cache cost
TLB cost
Loop overhead
Dependency
latency cost
Parallel overhead
Register spilling
cost
Reduction cost
Cost Model Extensions

Added a new phase in compiler to traverse IR to
conduct modeling


Working on OpenMP regions instead of perfectly nested
loops
Enhancement to model OpenMP details




reusing processor and cache models for processor and
cache cycles
Modeling load imbalance: using max (thread_i_exe)
Modeling scheduling: adding a lightweight scheduler in the
model
Reading an environment variable for the desired numbers
of threads during modeling (so this is currently fixed)
Experiment

Machine: Cobalt in NCSA (National Center for Supercomputing Applications)




Benchmark: OpenMP version of a classic matrix-matrix multiplication (MMM) code




32-processor SGI Altix 3700
1.5 GHz Itanium 2 with 6 M L3 Cache
256 G memory
i, k, j order
3 different double floating point matrix sizes: 500, 1000, 1500
OpenUH compiler: -O3, -mp
Cycle measuring tools:

pfmon, perfsuite
#pragma omp parallel for private ( i , j , k )
f o r ( i = 0 ; i < N; i ++)
f o r ( k = 0 ; k < K; k++)
f o r ( j = 0 ; j < M; j ++)
c [ i ] [ j ]= c [ i ] [ j ]+ a [ i ] [ k ]*b [ k ] [ j ] ;
Results

Efficiency = Modeling_Time / Compilation_Time x100% = 0.079s/6.33s =1.25%
Modeling vs. Measurement
1E+11
Model-500
Measure-500
Model-1000
Measure-1000
Model-1500
Measure-1500
CPU Cycles
1E+10
1E+09
100000000
10000000
1
2
3
4
5
6
7
8
Number of Threads

Measured data have irregular fluctuation, especially for smaller dataset with larger number of
threads


108 cycles@1.5GHz <0.1 second, relatively big system level noise from thread management
Overestimation for 500x500 array from 1 to 5 threads, underestimation for all the rest:


optimistic assumption for resource utilization
more threads, more underestimation: lack of contention models for cache, memory and bus
Relative Accuracy: Modeling Different
Chunk Sizes for Static Scheduling
CPU Cycles
Billions
Modeling vs Measuring for OpenMP Scheduling
Excessive
scheduling
overheads
35
static-modeling
30
static-measuring
25
Dynamic-measuring
20
Guided-measuring
Load
imbalance
15
10
5
0
1
10
100
250
500
1000
Chunksize
4 threads, matrix size 1000x1000

Successfully captured the trend of measured result
Cost Model

Detailed cost model could be used to recompile
program regions that perform poorly


Possibly with focus on improving specific aspect of code
Current models in OpenUH are inaccurate

Most often they accurately predict trends

Fail to account for resource contention
This will be critical for modeling multicore platforms

What level of accuracy should we be going for?

Summary

Challenge of multicores demands “simple” parallel
programming models



There is very much to explore in this regard
Compiler technology has advanced and public
domain software has become fairly robust
Many opportunities for exploiting this to improve






Languages
Compiler implementations
Runtime systems
OS interactions
Tool behavior
…
Download