PGAS-Stanford

advertisement
Compilation Technology for
Computational Science
Kathy Yelick
Lawrence Berkeley National Laboratory
and UC Berkeley
Joint work with
The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea,
K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen
The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell,
P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome
PGAS Languages
1
Kathy Yelick
Parallelism Everywhere
• Single processor Moore’s Law effect is ending
• Power density limitations; device physics below 90nm
• Multicore is becoming the norm
• AMD, IBM, Intel, Sun all offering multicore
• Number of cores per chip likely to increase with density
• Fundamental software change
• Parallelism is exposed to software
• Performance is no longer solely a hardware problem
• What has the HPC community learned?
• Caveat: Scale and applications differ
Kathy Yelick, 3
Phillip Colella’s “Seven dwarfs”
High-end simulation in the physical sciences = 7 methods:
1.
2.
3.
4.
5.
6.
7.
• Add 4 for embedded;
Structured Grids (including
covers all 41 EEMBC
Adaptive Mesh Refinement)
benchmarks
Unstructured Grids
8. Search/Sort
Spectral Methods (FFTs, etc.)
9. Filter
Dense Linear Algebra
10. Comb. logic
Sparse Linear Algebra
11. Finite State
Machine
Particles
Monte Carlo Simulation
Note: Data sizes (8 bit to 32 bit) and types (integer, character)
differ, but algorithms the same
Games/Entertainment close to scientific computing
Slide source: Phillip Colella, 2004 and Dave Patterson, 2006
Kathy Yelick, 4
Parallel Programming Models
• Parallel software is still an unsolved problem !
• Most parallel programs are written using either:
• Message passing with a SPMD model
• for scientific applications; scales easily
• Shared memory with threads in OpenMP, Threads, or Java
• non-scientific applications; easier to program
• Partitioned Global Address Space (PGAS)
Languages off 3 features
• Productivity: easy to understand and use
• Performance: primary requirement in HPC
• Portability: must run everywhere
Kathy Yelick, 5
Partitioned Global Address Space
Global address space
• Global address space: any thread/process may
directly read/write data allocated by another
• Partitioned: data is designated as local (near) or
global (possibly far); programmer controls layout
x: 1
y:
x: 5
y:
l:
l:
l:
g:
g:
g:
p0
p1
x: 7
y: 0
By default:
• Object heaps
are shared
• Program
stacks are
private
pn
• 3 Current languages: UPC, CAF, and Titanium
• Emphasis in this talk on UPC & Titanium (based on Java)
Kathy Yelick, 6
PGAS Language Overview
• Many common concepts, although specifics differ
• Consistent with base language
• Both private and shared data
• int x[10];
and
shared int y[10];
• Support for distributed data structures
• Distributed arrays; local and global pointers/references
• One-sided shared-memory communication
• Simple assignment statements: x[i] = y[i];
or
t = *p;
• Bulk operations: memcpy in UPC, array ops in Titanium and CAF
• Synchronization
• Global barriers, locks, memory fences
• Collective Communication, IO libraries, etc.
Kathy Yelick, 7
Example: Titanium Arrays
• Ti Arrays created using Domains; indexed using Points:
double [3d] gridA = new double [[0,0,0]:[10,10,10]];
• Eliminates some loop bound errors using foreach
foreach (p in gridA.domain())
gridA[p] = gridA[p]*c + gridB[p];
• Rich domain calculus allow for slicing, subarray, transpose and
other operations without data copies
• Array copy operations automatically work on intersection
data[neighborPos].copy(mydata);
intersection (copied area)
“restrict”-ed (non-ghost)
cells
ghost cells
mydata
data[neighorPos]
Kathy Yelick, 8
Productivity: Line Count Comparison
Fortran
Lines of code
2000
C
1500
1000
MPI+F
500
CAF
0
UPC
NPB-CG
NPB-EP
NPB-FT
NPB-IS
NPB-MG
Titanium
• Comparison of NAS Parallel Benchmarks
• UPC version has modest programming effort relative to C
• Titanium even more compact, especially for MG, which uses multi-d
arrays
• Caveat: Titanium FT has user-defined Complex type and cross-language
support used to call FFTW for serial 1D FFTs
UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al;
Titanium joint with Kaushik Datta & Dan Bonachea
Kathy Yelick, 9
Case Study 1: Block-Structured AMR
• Adaptive Mesh Refinement
(AMR) is challenging
• Irregular data accesses and
control from boundaries
• Mixed global/local view is useful
Titanium AMR benchmarks available
AMR Titanium work by Tong Wen and Philip Colella
Kathy Yelick, 10
AMR in Titanium
C++/Fortran/MPI AMR
Titanium AMR
• Chombo package from LBNL
• Bulk-synchronous comm:
• Entirely in Titanium
• Finer-grained communication
• Pack boundary data between procs
• No explicit pack/unpack code
• Automated in runtime system
Code Size in Lines
C++/Fortran/MPI
Titanium
AMR data Structures
35000
2000
AMR operations
6500
1200
Elliptic PDE solver
4200*
1500
10X reduction
in lines of
code!
* Somewhat more functionality in PDE part of Chombo code
Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su Kathy Yelick, 11
Performance of Titanium AMR
Speedup
80
speedup
70
60
50
Comparable
performance
40
30
20
10
0
16
28
36
56
112
#procs
Ti
Chombo
• Serial: Titanium is within a few % of C++/F; sometimes faster!
• Parallel: Titanium scaling is comparable with generic optimizations
- additional optimizations (namely overlap) not yet implemented
Kathy Yelick, 12
Immersed Boundary Simulation in Titanium
• Modeling elastic structures in an
incompressible fluid.
• Blood flow in the heart, blood clotting,
inner ear, embryo growth, and many more
• Complicated parallelization
• Particle/Mesh method
• “Particles” connected into materials
Pow3/SP 256^3
Pow3/SP 512^3
P4/Myr 512^2x256
Time per timestep
time (secs)
100
80
60
40
Code Size in Lines
20
0
1
2
4
8
# procs
16
32
64
12
8
Joint work with Ed Givelberg, Armando Solar-Lezama
Fortran
Titanium
8000
4000
Kathy Yelick, 13
High Performance
Strategy for acceptance of a new language
• Within HPC: Make it run faster than anything else
Approaches to high performance
• Language support for performance:
• Allow programmers sufficient control over resources for tuning
• Non-blocking data transfers, cross-language calls, etc.
• Control over layout, load balancing, and synchronization
• Compiler optimizations reduce need for hand tuning
• Automate non-blocking memory operations, relaxed memory,…
• Productivity gains though parallel analysis and optimizations
• Runtime support exposes best possible performance
• Berkeley UPC and Titanium use GASNet communication layer
• Dynamic optimizations based on runtime information
Kathy Yelick, 14
One-Sided vs Two-Sided
one-sided put message
address
data payload
two-sided message
message id
host
CPU
network
interface
data payload
memory
• A one-sided put/get message can be handled directly by a network
interface with RDMA support
• Avoid interrupting the CPU or storing data from CPU (preposts)
• A two-sided messages needs to be matched with a receive to
identify memory address to put data
• Offloaded to Network Interface in networks like Quadrics
• Need to download match tables to interface (from host)
Kathy Yelick, 15
Performance Advantage of One-Sided
Communication: GASNet vs MPI
900
GASNet put (nonblock)"
MPI Flood
700
Bandwidth (MB/s)
(up is good)
800
600
500
Relative BW (GASNet/MPI)
400
2. 4
2. 2
300
2. 0
1. 8
200
1. 6
1. 4
1. 2
100
1. 0
10
1000 S i z e ( b y t e s )100000
10000000
0
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Size (bytes)
• Opteron/InfiniBand (Jacquard at NERSC):
• GASNet’s vapi-conduit and OSU MPI 0.9.5 MVAPICH
• Half power point (N ½ ) differs by one order of magnitude
Joint work with Paul Hargrove and Dan Bonachea
Kathy Yelick, 16
GASNet: Portability and High-Performance
8-byte Roundtrip Latency
24.2
25
22.1
MPI ping-pong
Roundtrip Latency (usec)
(down is good)
20
GASNet put+sync
18.5
17.8
15
14.6
13.5
9.6
10
9.5
8.3
6.6
6.6
4.5
5
0
Elan3/Alpha
Elan4/IA64
Myrinet/x86
IB/G5
IB/Opteron
SP/Fed
GASNet better for latency across machines
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 17
GASNet: Portability and High-Performance
Flood Bandwidth for 2MB messages
Percent HW peak (BW in MB)
(up is good)
100%
90%
857
244
858
225
228
799
795
255
1504
1490
80%
610
70%
630
60%
50%
40%
30%
20%
10%
MPI
GASNet
0%
Elan3/Alpha
Elan4/IA64
Myrinet/x86
IB/G5
IB/Opteron
SP/Fed
GASNet at least as high (comparable) for large messages
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 18
GASNet: Portability and High-Performance
Flood Bandwidth for 4KB messages
100%
223
90%
231
Percent HW peak
(up is good)
80%
70%
MPI
763
714
702
GASNet
679
190
152
60%
420
50%
40%
750
547
252
30%
20%
10%
0%
Elan3/Alpha
Elan4/IA64
Myrinet/x86
IB/G5
IB/Opteron
SP/Fed
GASNet excels at mid-range sizes: important for overlap
Joint work with UPC Group; GASNet design by Dan Bonachea
Kathy Yelick, 19
Case Study 2: NAS FT
• Performance of Exchange (Alltoall) is critical
• 1D FFTs in each dimension, 3 phases
• Transpose after first 2 for locality
• Bisection bandwidth-limited
• Problem as #procs grows
• Three approaches:
• Exchange:
• wait for 2nd dim FFTs to finish, send 1
message per processor pair
• Slab:
• wait for chunk of rows destined for 1
proc, send when ready
• Pencil:
• send each row as it completes
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
Kathy Yelick, 20
Overlapping Communication
• Goal: make use of “all the wires all the time”
• Schedule communication to avoid network backup
• Trade-off: overhead vs. overlap
• Exchange has fewest messages, less message overhead
• Slabs and pencils have more overlap; pencils the most
• Example: Class D problem on 256 Processors
Exchange (all data at once)
512 Kbytes
Slabs (contiguous rows that go to 1
processor)
64 Kbytes
Pencils (single row)
16 Kbytes
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
Kathy Yelick, 21
NAS FT Variants Performance Summary
Best MFlop rates for all NAS FT Benchmark versions
1000
.5 Tflops
Best NAS Fortran/MPI
Best MPI
Best UPC
MFlops per Thread
800
600
400
200
0
56
et 6 4
nd 2
a
B
i
Myr in
Infin
3 256
Elan
3 512
Elan
4 256
Elan
4 512
Elan
• Slab is always best for MPI; small message cost too high
• Pencil is always best for UPC; more overlap
Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea
Kathy Yelick, 22
Case Study 3: LU Factorization
• Direct methods have complicated dependencies
• Especially with pivoting (unpredictable communication)
• Especially for sparse matrices (dependence graph with
holes)
• LU Factorization in UPC
• Use overlap ideas and multithreading to mask latency
• Multithreaded: UPC threads + user threads + threaded BLAS
• Panel factorization: Including pivoting
• Update to a block of U
• Trailing submatrix updates
• Status:
• Dense LU done: HPL-compliant
Joint work with
Parry Husbands
• Sparse
version
underway
Kathy Yelick, 23
UPC HPL Performance
X1 Linpack Performance
Opteron Cluster
Linpack
Performance
1400
Altix Linpack
Performance
160
MPI/HPL
1200
UPC
140
200
120
800
100
600
100
400
GFlop/s
150
GFlop/s
GFlop/s
1000
MPI/HPL
80
60
UPC
40
MPI/HPL
UPC
•MPI HPL numbers
from HPCC
database
•Large scaling:
• 2.2 TFlops on 512p,
• 4.4 TFlops on 1024p
(Thunder)
50
200
20
0
0
0
60
X1/64
X1/128
Opt/64
Alt/32
• Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid
• ScaLAPACK (block size 64) 25.25 GFlop/s (tried several block sizes)
• UPC LU (block size 256) - 33.60 GFlop/s, (block size 64) - 26.47 GFlop/s
• n = 32000 on a 4x4 process grid
• ScaLAPACK - 43.34 GFlop/s (block size = 64)
• UPC - 70.26 Gflop/s (block size = 200)
Kathy Yelick, 24
Joint work with Parry Husbands
Automating Support for Optimizations
• The previous examples are hand-optimized
• Non-blocking put/get on distributed memory
• Relaxed memory consistency on shared memory
• What analyses are needed to optimize parallel
codes?
• Concurrency analysis: determine which blocks of code
could run in parallel
• Alias analysis: determine which variables could access the
same location
• Synchronization analysis: align matching barriers, locks…
• Locality analysis: when is a general (global pointer) used
only locally (can convert to cheaper local pointer)
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 25
Reordering in Parallel Programs
In parallel programs, a reordering can change the
semantics even if no local dependencies exist.
Initially, flag = data = 0
T1
data = 1
T1
T2
flag = 1
T2
f = flag
f = flag
d = data
d = data
flag = 1
data = 1
{f == 1, d == 0} is possible after reordering; not in original
Compiler, runtime, and hardware can produce such reorderings
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 26
Memory Models
• Sequential consistency: a reordering is
illegal if it can be observed by another
thread
• Relaxed consistency: reordering may be
observed, but local dependencies and
synchronization preserved (roughly)
• Titanium, Java, & UPC are not sequentially
consistent
• Perceived cost of enforcing it is too high
• For Titanium and UPC, network latency is the cost
• For Java shared memory fences and code
transformations are the cost
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 27
Conflicts
• Reordering of an access is observable
only if it conflicts with some other access:
• The accesses can be to the same memory location
• At least one access is a write
• The accesses can run concurrently
T1
T2
data = 1
f = flag
flag = 1
d = data
Conflicts
• Fences (compiler and hardware) need to
be inserted around accesses that conflict
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 29
Sequential Consistency in Titanium
• Minimize number of fences – allow same
optimizations as relaxed model
• Concurrency analysis identifies
concurrent accesses
• Relies on Titanium’s textual barriers and singlevalued expressions
• Alias analysis identifies accesses to the
same location
• Relies on SPMD nature of Titanium
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 30
Barrier Alignment
• Many parallel languages make no attempt
to ensure that barriers line up
• Example code that is legal but will deadlock:
if (Ti.thisProc() % 2 == 0)
Ti.barrier(); // even ID threads
else
; // odd ID threads
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 31
Structural Correctness
• Aiken and Gay introduced structural
correctness (POPL’98)
• Ensures that every thread executes the same
number of barriers
• Example of structurally correct code:
if (Ti.thisProc() % 2 == 0)
Ti.barrier(); // even ID threads
else
Ti.barrier(); // odd ID threads
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 32
Textual Barrier Alignment
• Titanium has textual barriers: all threads
must execute the same textual sequence
of barriers
• Stronger guarantee than structural correctness –
this example is illegal:
if (Ti.thisProc() % 2 == 0)
Ti.barrier(); // even ID threads
else
Ti.barrier(); // odd ID threads
• Single-valued expressions used to enforce
textual barriers
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 33
Single-Valued Expressions
• A single-valued expression has the same
value on all threads when evaluated
• Example: Ti.numProcs() > 1
• All threads guaranteed to take the same
branch of a conditional guarded by a
single-valued expression
• Only single-valued conditionals may have barriers
• Example of legal barrier use:
if (Ti.numProcs() > 1)
Ti.barrier(); // multiple threads
else
; // only one thread total
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 34
Concurrency Analysis
• Graph generated from program as follows:
• Node added for each code segment between
barriers and single-valued conditionals
• Edges added to represent control flow between
segments
1
// code segment 1
if ([single])
// code segment 2
2
3
else
// code segment 3
// code segment 4
4
barrier
Ti.barrier()
// code segment 5
Joint work with Amir Kamil and Jimmy Su
5
Kathy Yelick, 35
Concurrency Analysis (II)
• Two accesses can run concurrently if:
• They are in the same node, or
• One access’s node is reachable from the other
access’s node without hitting a barrier
• Algorithm: remove barrier edges, do DFS
1
Concurrent Segments
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
1
2
3
4
5
1
X
X
X
X
2
X
X
X
3
X
X
X
4
X
X
X
X
5
X
Kathy Yelick, 36
Alias Analysis
• Allocation sites correspond to abstract
locations (a-locs)
• All explicit and implict program variables
have points-to sets
• A-locs are typed and have points-to sets
for each field of the corresponding type
• Arrays have a single points-to set for all indices
• Analysis is flow,context-insensitive
• Experimental call-site sensitive version – doesn’t
seem to help much
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 37
Thread-Aware Alias Analysis
• Two types of abstract locations: local and
remote
• Local locations reside in local thread’s memory
• Remote locations reside on another thread
• Exploits SPMD property
• Results are a summary over all threads
• Independent of the number of threads at runtime
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 38
Alias Analysis: Allocation
• Creates new local abstract location
• Result of allocation must reside in local memory
class Foo {
Object z;
}
static void bar() {
L1: Foo a = new Foo();
Foo b = broadcast a from 0;
Foo c = a;
L2: a.z = new Object();
}
Joint work with Amir Kamil and Jimmy Su
A-locs
1, 2
Points-to Sets
a
b
c
Kathy Yelick, 39
Alias Analysis: Assignment
• Copies source abstract locations into
points-to set of target
class Foo {
Object z;
}
static void bar() {
L1: Foo a = new Foo();
Foo b = broadcast a from 0;
Foo c = a;
L2: a.z = new Object();
}
Joint work with Amir Kamil and Jimmy Su
A-locs
1, 2
Points-to Sets
a
1
b
c
1
1.z
2
Kathy Yelick, 40
Alias Analysis: Broadcast
• Produces both local and remote versions
of source abstract location
• Remote a-loc points to remote analog of what
local a-loc points to
class Foo {
Object z;
}
static void bar() {
L1: Foo a = new Foo();
Foo b = broadcast a from 0;
Foo c = a;
L2: a.z = new Object();
}
Joint work with Amir Kamil and Jimmy Su
A-locs
1, 2, 1r
Points-to Sets
a
1
b
1, 1r
c
1
1.z
2
1r.z
2r
Kathy Yelick, 41
Aliasing Results
• Two variables A and B may
alias if:
$ xpointsTo(A).
xpointsTo(B)
• Two variables A and B may
alias across threads if:
$ xpointsTo(A).
R(x)pointsTo(B),
(where R(x) is the remote
counterpart of x)
Joint work with Amir Kamil and Jimmy Su
Points-to Sets
a
1
b
1, 1r
c
1
Alias [Across
Threads]:
a
b, c [b]
b
a, c [a, c]
c
a, b [b]
Kathy Yelick, 42
Benchmarks
Benchmark
Lines1 Description
pi
56
Monte Carlo integration
demv
122
Dense matrix-vector multiply
sample-sort
321
Parallel sort
lu-fact
420
Dense linear algebra
3d-fft
614
Fourier transform
gsrb
1090
Computational fluid dynamics kernel
gsrb*
1099
Slightly modified version of gsrb
spmv
1493
Sparse matrix-vector multiply
gas
8841
Hyperbolic solver for gas dynamics
1
Line counts do not include the reachable portion of the
1 37,000 line Titanium/Java 1.0 libraries
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 43
Analysis Levels
• We tested analyses of varying levels of
precision
Analysis
Description
naïve
All heap accesses
sharing
All shared accesses
concur
Concurrency analysis + type-based AA
concur/saa
Concurrency analysis + sequential AA
concur/taa
Concurrency analysis + thread-aware AA
concur/taa/cycle
Concurrency analysis + thread-aware AA
+ cycle detection
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 44
Static (Logical) Fences
Static Fence Removal
Percentage
120
100
80
60
40
20
co
nc
ur
/s
aa
co
nc
ur
co
/ta
nc
a
ur
/ta
a/
cy
cl
e
co
nc
ur
sh
ar
in
g
na
ïv
e
0
pi
demv
sample sort
lu fact
3d fft
gsrb
gsrb*
spmv
gas
GOOD
Percentages are for number of static fences reduced over naive
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 45
Dynamic (Executed) Fences
Dynamic Fence Removal
120
Percentage
100
80
60
40
20
co
nc
ur
/s
aa
co
nc
ur
co
/ta
nc
a
ur
/ta
a/
cy
cl
e
co
nc
ur
sh
ar
in
g
na
ïv
e
0
pi
demv
sample sort
lu fact
3d fft
gsrb
gsrb*
spmv
gas
GOOD
Percentages are for number of dynamic fences reduced over naive
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 46
Dynamic Fences: gsrb
• gsrb relies on dynamic locality checks
• slight modification to remove checks (gsrb*)
greatly increases precision of analysis
gsrb Dynamic Fence Removal
120
Percentage
100
80
60
40
GOOD
20
Joint work with Amir Kamil and Jimmy Su
cl
e
/c
y
ur
/t
aa
nc
ur
/t
nc
co
gsrb*
co
ur
/s
nc
co
gsrb
aa
aa
ur
nc
co
ar
in
g
sh
na
ïv
e
0
Kathy Yelick, 47
Two Example Optimizations
• Consider two optimizations for GAS
languages
1.Overlap bulk memory copies
2.Communication aggregation for irregular array
accesses (i.e. a[b[i]])
• Both optimizations reorder accesses, so
sequential consistency can inhibit them
• Both are addressing network
performance, so potential payoff is high
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 48
Array Copies in Titanium
• Array copy operations are commonly used
dst.copy(src);
• Content in the domain intersection of the two
arrays is copied from dst to src
src
dst
• Communication (possibly with packing)
required if arrays reside on different threads
• Processor blocks until the operation is
complete.
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 49
Non-Blocking Array Copy Optimization
• Automatically convert blocking array copies
into non-blocking array copies
• Push sync as far down the instruction stream as
possible to allow overlap with computation
• Interprocedural: syncs can be moved
across method boundaries
• Optimization reorders memory accesses –
may be illegal under sequential consistency
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 50
Communication Aggregation on Irregular
Array Accesses (Inspector/Executor)
• A loop containing indirect array accesses is
split into phases
• Inspector examines loop and computes reference
targets
• Required remote data gathered in a bulk operation
• Executor uses data to perform actual computation
for (...) {
a[i] = remote[b[i]];
// other accesses
}
schd = inspect(remote, b);
tmp = get(remote, schd);
for (...) {
a[i] = tmp[i];
// other accesses
}
• Can be illegal under sequential consistency
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 51
Relaxed + SC with 3 Analyses
• We tested performance using analyses of
varying levels of precision
Name
Description
relaxed
Uses Titanium’s relaxed memory model
naïve
Uses sequential consistency, puts
fences around every heap access
sharing
Uses sequential consistency, puts
fences around every shared heap
access
Uses sequential consistency, uses our
most aggressive analysis
concur/taa/cycle
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 52
Dense Matrix Vector Multiply
Dense Matrix Vector Multiply
speedup
2
1.5
1
0.5
0
1
relaxed
2
naive
4
8
# of processors
sharing
16
concur/taa/cycle
• Non-blocking array copy optimization applied
• Strongest analysis is necessary: other SC
implementations suffer relative to relaxed
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 53
Sparse Matrix Vector Multiply
Sparse Matrix Vector Multiply
100
speedup
80
60
40
20
0
1
relaxed
2
4
8
# of processors
naive
sharing
16
concur/taa/cycle
• Inspector/executor optimization applied
• Strongest analysis is again necessary and sufficient
Joint work with Amir Kamil and Jimmy Su
Kathy Yelick, 54
Portability of Titanium and UPC
• Titanium and the Berkeley UPC translator use a similar model
• Source-to-source translator (generate ISO C)
• Runtime layer implements global pointers, etc
• Common communication layer (GASNet)
Also used by gcc/upc
• Both run on most PCs, SMPs, clusters & supercomputers
• Support Operating Systems:
• Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, Cygwin, MacOSX, Unicos, SuperUX
• UPC translator somewhat less portable: we provide a http-based compile server
• Supported CPUs:
• x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron
• GASNet communication:
• Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix,
Cray/SGI SHMEM, and (for portability) MPI and UDP
• Specific supercomputer platforms:
• HP AlphaServer, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000
• Underway: Cray XT3, BG/L (both run over MPI)
• Can be mixed with MPI, C/C++, Fortran
Joint work with Titanium and UPC groups
Kathy Yelick, 55
Portability of PGAS Languages
Other compilers also exist for PGAS Languages
• UPC
•
•
•
•
Gcc/UPC by Intrepid: runs on GASNet
HP UPC for AlphaServers, clusters, …
MTU UPC uses HP compiler on MPI (source to source)
Cray UPC
• Co-Array Fortran:
• Cray CAF Compiler: X1, X1E
• Rice CAF Compiler (on ARMCI or GASNet), John Mellor-Crummey
•
•
•
•
Source to source
Processors: Pentium, Itanium2, Alpha, MIPS
Networks: Myrinet, Quadrics, Altix, Origin, Ethernet
OS: Linux32 RedHat, IRIS, Tru64
NB: source-to-source requires cooperation by backend compilers
Kathy Yelick, 56
Summary
• PGAS languages offer productivity advantage
• Order of magnitude in line counts for grid-based code in Titanium
• Push decisions about packing/not into runtime for portability
(advantage of language with translator vs. library approach)
• Significant work in compiler can make programming easier
• PGAS languages offer performance advantages
• Good match to RDMA support in networks
• Smaller messages may be faster:
• make better use of network: postpone bisection bandwidth pain
• can also prevent cache thrashing for packing
• Have locality advantages that may help even SMPs
• Source-to-source translation
• The way to ubiquity
• Complement highly tuned machine-specific compilers
Kathy Yelick, 57
End of Slides
PGAS Languages
58
Kathy Yelick
Productizing BUPC
• Recent Berkeley UPC release
• Support full 1.2 language spec
• Supports collectives (tuning ongoing); memory model compliance
• Supports UPC I/O (naïve reference implementation)
• Large effort in quality assurance and robustness
• Test suite: 600+ tests run nightly on 20+ platform configs
• Tests correct compilation & execution of UPC and GASNet
• >30,000 UPC compilations and >20,000 UPC test runs per night
• Online reporting of results & hookup with bug database
• Test suite infrastructure extended to support any UPC compiler
• now running nightly with GCC/UPC + UPCR
• also support HP-UPC, Cray UPC, …
• Online bug reporting database
• Over >1100 reports since Jan 03
• > 90% fixed (excl. enhancement requests)
Kathy Yelick, 59
Benchmarking
• Next few UPC and MPI application benchmarks
use the following systems
•
•
•
•
Myrinet: Myrinet 2000 PCI64B, P4-Xeon 2.2GHz
InfiniBand: IB Mellanox Cougar 4X HCA, Opteron 2.2GHz
Elan3: Quadrics QsNet1, Alpha 1GHz
Elan4: Quadrics QsNet2, Itanium2 1.4GHz
Kathy Yelick, 61
PGAS Languages: Key to High Performance
One way to gain acceptance of a new language
• Make it run faster than anything else
Keys to high performance
• Parallelism:
• Scaling the number of processors
• Maximize single node performance
• Generate friendly code or use tuned libraries (BLAS, FFTW, etc.)
• Avoid (unnecessary) communication cost
• Latency, bandwidth, overhead
• Avoid unnecessary delays due to dependencies
• Load balance
• Pipeline algorithmic dependencies
Kathy Yelick, 62
Hardware Latency
• Network latency is not expected to improve significantly
• Overlapping communication automatically (Chen)
• Overlapping manually in the UPC applications (Husbands, Welcome,
Bell, Nishtala)
• Language support for overlap (Bonachea)
Kathy Yelick, 63
Effective Latency
Communication wait time from other factors
• Algorithmic dependencies
• Use finer-grained parallelism, pipeline tasks (Husbands)
• Communication bandwidth bottleneck
• Message time is: Latency + 1/Bandwidth * Size
• Too much aggregation hurts: wait for bandwidth term
• De-aggregation optimization: automatic (Iancu);
• Bisection bandwidth bottlenecks
• Spread communication throughout the computation (Bell)
Kathy Yelick, 64
Fine-grained UPC vs. Bulk-Synch MPI
• How to waste money on supercomputers
• Pack all communication into single message (spend
memory bandwidth)
• Save all communication until the last one is ready (add
effective latency)
• Send all at once (spend bisection bandwidth)
• Or, to use what you have efficiently:
• Avoid long wait times: send early and often
• Use “all the wires, all the time”
• This requires having low overhead!
Kathy Yelick, 65
What You Won’t Hear Much About
• Compiler/runtime/gasnet bug fixes, performance
tuning, testing, …
• >13,000 e-mail messages regarding cvs checkins
• Nightly regression testing
• 25 platforms, 3 compilers (head, opt-branch, gcc-upc),
• Bug reporting
• 1177 bug reports, 1027 fixed
• Release scheduled for later this summer
• Beta is available
• Process significantly streamlined
Kathy Yelick, 66
Take-Home Messages
• Titanium offers tremendous gains in productivity
• High level domain-specific array abstractions
• Titanium is being used for real applications
• Not just toy problems
• Titanium and UPC are both highly portable
• Run on essentially any machine
• Rigorously tested and supported
• PGAS Languages are Faster than two-sided MPI
• Better match to most HPC networks
• Berkeley UPC and Titanium benchmarks
• Designed from scratch with one-side PGAS model
• Focus on 2 scalability challenges: AMR and Sparse LU
Kathy Yelick, 67
Titanium Background
• Based on Java, a cleaner C++
• Classes, automatic memory management, etc.
• Compiled to C and then machine code, no JVM
• Same parallelism model at UPC and CAF
• SPMD parallelism
• Dynamic Java threads are not supported
• Optimizing compiler
• Analyzes global synchronization
• Optimizes pointers, communication, memory
Kathy Yelick, 68
Do these Features Yield Productivity?
MG Line Count Comparison
CG Line Count Comparison
Computation
Communication
Declarations
203
700
600
500
400
552
300
200
92
100
0
60
58
Fortran w/ MPI
Titanium
36
Productive Lines of Code
160
Computation
Communication
Declarations
800
140
120
100
99
80
44
60
28
40
27
20
35
14
0
Fortran w/ MPI
Titanium
Language
Language
CG Class D Speedup - G5/IB
Speedup (Over Best 64
Proc Case)
MG Class D Speedup - Opteron/IB
Speedup (Over Best 32
Proc Case)
Productive Lines of Code
900
140
Linear Speedup
Fortran w/MPI
Titanium
120
100
80
60
40
20
0
0
50
100
150
Processors
Joint work with Kaushik Datta, Dan Bonachea
400
350
300
250
200
150
100
50
0
Linear Speedup
Fortran w/MPI
Titanium
0
50
100
150
200
250
300
Processors
Kathy Yelick, 69
GASNet/X1 Performance
single word get
14
13
Shm em
GASNet
12
MPI
14
13
Get Lat ency (m icroseconds)
Put per m essage gap (m icroseconds)
single word put
11
10
9
8
7
6
5
4
3
2
1
12
Shm em
GASNet
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
RMW
4
Sc a la r
8
16
32
64
128
Ve c t or
256
512 1024 2048
b c op y()
1
RMW
Message Size (byt es)
2
4
Sc a la r
8
16
32
64
Ve c t or
128
256 512 1024 2048
b c op y()
Message Size (byt es)
• GASNet/X1 improves small message performance over shmem and MPI
• Leverages global pointers on X1
• Highlights advantage of languages vs. library approach
Joint work with Christian Bell, Wei Chen and Dan Bonachea
Kathy Yelick, 70
High Level Optimizations in Titanium
• Irregular communication can be expensive
• “Best” strategy differs by data size/distribution and machine
parameters
• E.g., packing, sending bounding boxes, fine-grained are
• Use of runtime
optimizations
Itanium/Myrinet Speedup Comparison
• Inspector-executor
Average and maximum speedup
of the Titanium version relative to
the Aztec version on 1 to 16
processors
Joint work with Jimmy Su
1.5
speedup
• Performance on
Sparse MatVec Mult
• Results: best strategy
differs within the
machine on a single
matrix (~ 50% better)
Speedup relative to MPI code (Aztec library)
1.6
1.4
1.3
1.2
1.1
1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22
matrix number
average speedup
maximum speedup
Kathy Yelick, 71
Source to Source Strategy
• Source-to-source translation strategy
• Tremendous portability advantage
• Still can perform significant optimizations
• Use of “restrict” pointers in C
• Understand Multi-D array
indexing (Intel/Itanium issue)
• Support for pragmas like
IVDEP
• Robust vectorizators (X1,
SSE, NEC,…)
Performance Ratio C / UPC
• Relies on high quality back-end compilers and some
coaxing in code generation
Livermore Loops on Cray X1 (single Node)
48x
4
3.5
3
2.5
2
1.5
1
0.5
0
1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
• On machines with integrated shared memory
hardware need access to shared memory operations
Joint work with Jimmy Su
Kathy Yelick, 72
Download