Runtime Optimization of Application Level Communication Patterns Edgar Gabriel and Shuo Huang

advertisement
Runtime Optimization of
Application Level Communication
Patterns
Edgar Gabriel and Shuo Huang
Department of Computer Science
University of Houston
gabriel@cs.uh.edu
HIPS 2007 Long Beach
Edgar Gabriel
Motivation
Finite Difference code on a PC cluster using IB and GE interconnects
Execution time for 200 iterations of the solver on 32 processes/processors
30
execution time [sec ]
25
20
fcfs
p
fcfs-pack
15
ordered
overlap
10
5
0
HIPS 2007 Long
Beach
128x128x64
IB
Edgar Gabriel
128x128x128 IB
128x128x64 TCP 128x128x128 TCP
How to implement the required
communication pattern efficiently?
• Dependence on platform
– Some functionality only supported (efficiently) on
certain/platforms or with certain network interconnects
• Dependence on MPI library
– Does the MPI library support all available methods
– Efficiency in overlapping communication and computation
– Quality of the support for user defined data-types
• Dependence on application
– Problem size
– Ratio of communication to computation
HIPS 2007 Long Beach
Edgar Gabriel
• Problem: How can an (average) user understand the
myriad of implementation options and their impact on the
performance of the application?
• (Honest) Answer: no way
– Abstract interfaces for application level communication
operations required
ADCL
– Statistical tools required to detect correlations between
parameters and application performance
HIPS 2007 Long Beach
Edgar Gabriel
ADCL - Adaptive Data and
Communication Library
• Goals:
– Provide abstract interfaces for often occurring application
level communication patterns
• Collective operations
• Not-covered by MPI specification
– Provide a wide variety of implementation possibilities and
decision routines which choose the fastest available
implementation (at runtime)
• Not replacing MPI, but add-on functionality
– Uses many features of MPI
HIPS 2007 Long Beach
Edgar Gabriel
ADCL terminology
ADCL object
Functionality
Attribute
Abstraction for a characteristic of an implementation represented by the set its possible values
Attribute-set Group of attributes
Function
Implementation of a particular operation
• optionally including an attribute-set and values
Function-set
Set of functions providing the same functionality
• have to have the same attribute-set
Vector
Abstraction for a multi-dimensional data object
Topology
Abstraction for a process topology
Request
Handle for tuple of < topology, vector,
function-set>
HIPS 2007 Long Beach
Edgar Gabriel
Code sample
ADCL_Vector
vec;
ADCL_Topology topo;
ADCL_Request request;
/* Generate a 2-D process topology */
MPI_Cart_create ( comm, 2, cart_dims, periods, 0,&cart_comm);
ADCL_Topology_create ( cart_comm, &topo );
/* Register a 2D vector with ADCL */
ADCL_Vector_register (ndims, vec_dims, HALO_WIDTH,
MPI_DOUBLE, vector, &vec);
/* Match process topology, data item and function-set */
ADCL_Request_create (vec, topo, ADCL_FNCTSET_NEIGHBORHOOD,
&request );
for (i=0; i<NIT; i++ ) {
/* Main application loop */
ADCL_Request_start (request );
HIPS 2007…
Long Beach
Edgar Gabriel
}
Runtime selection logic: brute force
search (I)
Implementation no.
1
2 3 4 5 6
HIPS 2007 Long Beach
Edgar Gabriel
7
Using the fastest implementation for
the rest of the application
Runtime selection logic: brute force
search (II)
• Test each function of a given function set a given
number of times
– Store the execution time for each execution per process
• Filter the list of execution times in order to exclude
outliers
• Determine the avg. execution time per function i and
process j
• Determine the max. execution time for function i across
all processes
fi
max
= max( f i ), j = 0...nprocs − 1
j
– Requires communication (e.g. MPI_Allreduce)
HIPS 2007 Long Beach
Edgar Gabriel
Runtime selection logic: brute force
search (III)
• Determine the function with the minimal max. execution
time across all processes
f winner = min( f i
max
), i = 0...nfuncs − 1
• Use this function for the rest of the application lifetime
HIPS 2007 Long Beach
Edgar Gabriel
Runtime selection logic:
performance hypothesis (I)
• Assumptions:
– every implementation can be characterized by a set of
attributes, which impact its performance, e.g. for
neighborhood communication
• Communication pattern/degree
• Handling of non-contiguous data
• Data transfer primitive
• Overlapping communication and computation
– The fastest implementation will also have the optimal
values for these attributes
HIPS 2007 Long Beach
Edgar Gabriel
Runtime selection logic:
performance hypothesis (II)
• Approach: determine the optimal value for an attribute by
comparing the execution time of functions differing in
only a single attribute
Function a
Function b
Function c
Value for attribute 1
1
2
3
Value for attribute 2
X
X
X
Value for attribute 3
Y
Y
Y
Value for attribute 4
z
z
z
– E.g. if function c had the lowest execution time across all
processes:
• Hypothesis: value 3 optimal for attribute 1
• Confidence
value in this hypothesis: 1
HIPS 2007
Long Beach
Edgar Gabriel
Runtime selection logic:
performance hypothesis (III)
• Evaluate a different set of functions differing in one other
attribute, e.g.
Function c
Function d
Function e
Value for attribute 1
1
2
3
Value for attribute 2
X+1
X+1
X+1
Value for attribute 3
Y
Y
Y
Value for attribute 4
z
z
z
– If this set of measurements lead to the same optimal value
for attribute 1:
• Increase confidence value for this hypothesis by 1
– Else decrease the confidence value by 1
HIPS 2007 Long Beach
Edgar Gabriel
Runtime selection logic:
performance hypothesis (IV)
• If the confidence value for an attribute reaches a given
threshold
– Remove all functions not having the required value for this
attribute from the Function-set
• If the value for attribute (s) do not converge towards a
value this algorithm leads to the brute force search
• Advantage: potentially fewer functions have to be
evaluated to determine the winner
HIPS 2007 Long Beach
Edgar Gabriel
Currently available implementations for
neighborhood communication
Name
Comm. pattern
IsendIrecv_aao
IsendIrecv_pair
SendIrecv_aao
SendIrecv_pair
IsendIrecv_aao_pack
IsendIrecv_pair_pack
SendIrecv_aao_pack
SendIrecv_pair_pack
SendRecv_pair
Sendrecv_pair
SendRecv_pair_pack
Sendrecv_pair_pack
WinfencePut_aao
WinfenceGet_aao
PostStartPut_aao
PostStartGet_aao
WinfencePut_pair
WinfenceGet_pair
PostStartPut_pair
HIPS 2007 Long Beach
Edgar Gabriel
PostStartGet_pair
aao
pair
aao
pair
aao
pair
aao
pair
pair
pair
pair
pair
aao
aao
aao
aao
pair
pair
pair
pair
Handling of
non-cont. data
ddt
ddt
ddt
ddt
ddt
Pack/unpack
ddt
Pack/unpack
ddt
ddt
Pack/unpack
Pack/unpack
ddt
ddt
ddt
ddt
ddt
ddt
ddt
ddt
Data transfer primitive
MPI_Isend/Irecv/Waitall
MPI_Isend/Irecv/Waitall
MPI_Send/Irecv/Waitall
MPI_Send/Irecv/Wait
MPI_Isend/Irecv/Waitall
MPI_Isend/Irecv/Waitall
MPI_Send/Irecv/Waitall
MPI_Send/Irecv/Wait
MPI_Send/Recv
MPI_Send/Recv
MPI_Send/Recv
MPI_Send/Recv
MPI_Put/MPI_Win_fence
MPI_Get/MPI_Win_fence
MPI_Put/MPI_Win_post/start
MPI_Get/MPI_Win_post/start
MPI_Put/MPI_Win_fence
MPI_Get/MPI_Win_fence
MPI_Put/MPI_Win_post/start
MPI_Get/MPI_Win_post/start
HIPS 2007 Long Beach
Edgar Gabriel
en
dI
re
cv
v_
aa
o
_a
ao
po
ut
e
hy
br
_p
ai
dR
r
ec
v_
Se
pa
nd
ir
Ire
cv
_p
Se
ai
nd
Is
r
re
en
c
dI
v_
re
pa
cv
ir
Se
_a
nd
ao
Ire
_p
cv
ac
_a
Is
k
en
ao
dI
_p
re
ac
c
v
Se
k
_p
nd
ai
r_
R
ec
pa
v_
ck
Se
pa
nd
ir_
Ire
pa
cv
ck
_p
Se
ai
nd
r_
re
pa
cv
ck
_p
ai
r_
pa
ck
Se
n
Is
cv
re
c
dI
re
dI
en
Se
n
Is
Execution time [sec]
Performance results (I)
InfiniBand 32 processes small problem size
12.4
12.2
12
11.8
11.6
11.4
11.2
11
10.8
10.6
10.4
HIPS 2007 Long Beach
Edgar Gabriel
hy
po
br
ut
e
Is
en
dI
re
cv
_a
Se
ao
nd
Ire
cv
_a
Is
en
ao
dI
re
cv
_p
Se
ai
nd
r
R
ec
v_
Se
pa
ir
nd
Ire
cv
_p
Se
ai
n
r
dr
Is
en
ec
v_
dI
re
pa
cv
ir
_a
Se
a
nd
o_
Ire
pa
ck
cv
_
Is
aa
en
o_
dI
pa
re
ck
cv
_p
Se
ai
nd
r_
R
pa
ec
ck
v_
Se
pa
ir_
nd
pa
Ire
ck
cv
_p
Se
ai
nd
r_
pa
re
cv
ck
_p
ai
r_
pa
ck
Execution time [sec]
Performance results (II)
InfiniBand 32 processes large problem size
77.5
77
76.5
76
75.5
75
74.5
74
73.5
73
72.5
HIPS 2007 Long Beach
Edgar Gabriel
en
dI
re
cv
v_
aa
_p
ir
ir
ac
k
pa
pa
pa
ir
o_
p
_a
ao
v_
v_
v_
Se
n
po
ut
e
hy
br
ac
k
_p
ai
dR
r_
pa
ec
ck
v_
Se
p
ai
nd
r_
Ire
pa
cv
ck
_p
Se
ai
nd
r_
re
pa
cv
ck
_p
ai
r_
pa
ck
Is
re
c
cv
re
c
o
_p
ai
r
aa
_a
ao
v_
cv
dr
ec
dI
Se
n
Se
n
dI
re
dI
en
Se
n
Is
dI
re
re
c
cv
dR
ec
en
Se
n
Is
dI
re
dI
en
Se
n
Is
Execution time [sec]
Performance results (III)
TCP over Fast Ethernet 32 processes small problem size
400
350
300
250
200
150
100
50
0
HIPS 2007 Long Beach
Edgar Gabriel
dI
re
k
ir_
p
po
ut
e
hy
br
k
ck
ac
pa
ck
pa
ck
ac
ir_
pa
pa
pa
ir_
pa
v_
v_
re
c
dr
ec
dI
Se
n
Se
n
o_
p
ir
ac
k
pa
_p
_p
ai
r_
aa
v_
cv
dR
ec
en
Se
n
Is
_a
ao
ai
r
pa
ir
v_
p
v_
re
c
dr
ec
v_
cv
re
c
dI
re
dI
en
Se
n
Is
Se
n
dI
ao
_p
ai
r
v_
cv
dR
ec
Se
n
Se
n
dI
re
_a
ao
v_
a
cv
re
c
dI
re
dI
Is
en
Se
n
Is
en
Execution time [sec]
Performance results (IV)
TCP over Fast Ethernet 32 processes large problem size
450
400
350
300
250
200
150
100
50
0
Limitations of ADCL
•
•
•
•
•
Reproducibility of measurements even on dedicated compute nodes
a challenging topic
– Hyper-threading
– Processor frequency scaling
Network often shared between multiple jobs
Hierarchical networks
– Process placement by the batch scheduler
Performance hypothesis
– Attributes should not be correlated
User has to modify its code
– How much longer will we have to deal with MPI?
HIPS 2007 Long Beach
Edgar Gabriel
Advantages of ADCL
•
•
•
•
Provides close to optimal performance in many scenarios
Simplifies the development of parallel code for many applications
Simplifies the development of adaptive parallel code
Currently ongoing work:
– Improving (nearly) all components of ADCL
• Data filtering
• Increase parameter space and set of implementation
• Experiment with other runtime selection algorithms
– Historic learning, Game theory, genetic algorithms
– Integration with a CFD solver in cooperation with Dr.
Garbey
HIPS 2007 Long Beach
Edgar Gabriel
Download