Static Dataflow: Compiling Global Control into Local Control

advertisement
Static Dataflow: Compiling
Global Control into Local
Control
Pritish Jetley, Laxmikant V. Kalé
Department of Computer Science
University of Illinois at Urbana-Champaign
pjetley2@illinois.edu
©Pritish Jetley
1
The Need for Abstractions
●
Traditional programming models don't provide
the right frameworks for complicated Science &
Engineering applications
–
Modularity
–
Separation of concerns
–
Programming productivity
2
Modularity in MPI
●
A must call B & C (no order)
Images courtesy and ©David Kunzman
3
Modularity in MPI
●
●
A must call B & C (no order)
In MPI, must serialize calls to
different modules
Images courtesy and ©David Kunzman
4
Modularity in MPI
●
●
●
A must call B & C (no order)
In MPI, must serialize calls to
different modules
Or, insert cross-module
wildcard receives
Images courtesy and ©David Kunzman
5
Charm++
●
Application composed of collections of objects
–
Collections = arrays
6
Charm++
●
Application composed of collections of objects
–
●
Collections = arrays
Object-based virtualization: adaptive overlap
7
Charm++
●
Application composed of collections of objects
–
Collections = arrays
●
Object-based virtualization: adaptive overlap
●
Communication = Asynch. method invocation
–
Methods cannot be preempted
–
Scheduler picks message and invokes on target
8
Charm++
●
Application composed of collections of objects
–
Collections = arrays
●
Object-based virtualization: adaptive overlap
●
Communication = Asynch. method invocation
●
–
Methods cannot be preempted
–
Scheduler picks message and invokes on target
Array-like syntax for addressing
–
array1(17).f();
–
array2(F(x), G(z)).g();
–
thisProxy(thisIndex).h();
9
Charm++
●
Application composed of collections of objects
–
Collections = arrays
●
Object-based virtualization: adaptive overlap
●
Communication = Asynch. method invocation
●
●
–
Methods cannot be preempted
–
Scheduler picks message and invokes on target
Array-like syntax for addressing
–
array1(17).f();
–
array2(F(x), G(z)).g();
–
thisProxy(thisIndex).h();
Load balancing, communication optimization, etc.
10
Modularity in Charm++
●
Many objects/processor
Images courtesy and ©David Kunzman
11
Modularity in Charm++
●
●
Many objects/processor
Scheduler sends messages to
appropriate recipients
Images courtesy and ©David Kunzman
12
Modularity in Charm++
●
●
●
Many objects/processor
Scheduler sends messages to
appropriate recipients
Idle time of one overlapped with
computation of other
Images courtesy and ©David Kunzman
13
However...
●
Reactive specification of
Charm++ programs
14
However...
●
Reactive specification of
Charm++ programs
–
Hard to follow global
control/data flow
15
However...
●
Reactive specification of
Charm++ programs
–
●
Hard to follow global
control/data flow
Non-determinism in
message delivery
–
Hard to reason about/debug
programs
entry void call(){
A[x].fun_1();
A[x].fun_2();
}
entry void fun_1(){
var = 2;
}
entry void fun_2(){
var = 3;
}
16
Can we do better?
●
Most Science/Engineering applications follow
certain patterns of computation and
communication
17
Can we do better?
●
●
Most Science/Engineering applications follow
certain patterns of computation and
communication
What is common among the following
applications?
–
Matrix mult.
–
Jacobi
–
FFT
–
Unstructured Mesh Computations
–
Cutoff-Based Molecular Dynamics
18
Can we do better?
●
●
Most Science/Engineering applications follow
certain patterns of computation and
communication
What is common among the following
applications?
Static communication
pattern
–
Matrix mult.
–
Jacobi
–
FFT
–
Unstructured Mesh Computations
–
Cutoff-Based Molecular Dynamics
19
Static Dataflow
●
Static patterns of communication
●
Objects produce and consume data
20
Jacobi in Charisma
foreach x,y in J
(lb[x,y],rb[x,y],tb[x,y],bb[x,y]) ← J[x,y].prodBorders();
J[x,y].consume(lb[x+1,y],rb[x­1,y],tb[x,y+1],bb[x,y­1]);
end­foreach
21
Jacobi in Charisma
foreach x,y in J
(lb[x,y],rb[x,y],tb[x,y],bb[x,y]) ← J[x,y].prodBorders();
J[x,y].consume(lb[x+1,y],rb[x­1,y],tb[x,y+1],bb[x,y­1]);
end­foreach
22
Charisma Semantics
●
foreach statements
execute across
object arrays
–
Have associated
methods
23
Charisma Semantics
●
foreach statements
execute across
object arrays
–
●
Have associated
methods
Objects produce and
consume parameters
24
Charisma Semantics
●
foreach statements
execute across
object arrays
–
●
●
Have associated
methods
Objects produce and
consume parameters
Statements executed
on individual objects
in program order
25
Data Dependences
●
A::f() produces p[]
26
Data Dependences
●
●
A::f() produces p[]
f() has embedded produce() function
27
Data Dependences
●
●
●
A::f() produces p[]
f() has embedded produce() function
B::h() consumes p[]
28
Data Dependences
●
●
●
●
A::f() produces p[]
f() has embedded produce() function
B::h() consumes p[]
Indices decide
dependences
29
Program Order
●
B[x].g() executes before B[x].h()
●
But B[x].g() concurrent with B[y].h() if x ≠ y
30
Ensuring Determinism
●
Determinism = Data dependences +
Program order
31
Ensuring Determinism
●
●
Determinism = Data dependences +
Program order
Data dependences enforce causal order on
statements across objects
32
Ensuring Determinism
●
●
●
Determinism = Data dependences +
Program order
Data dependences enforce causal order on
statements across objects
Program order removes non-determinism within
objects due to message-reordering
33
Implementing Semantics
●
Barrier after every for
loop?
34
Implementing Semantics
●
●
Barrier after every for
loop?
Does it work here?
35
Implementing Semantics
●
●
●
Barrier after every for
loop?
Does it work here?
No, need barrier after
each statement!
–
Too much parallel
overhead
36
Programs are Distributed DAGS
fA,I­1
fC,I­1
gB,I­1
fA,I
g1
B,I
g2B,I
fC,I
hB,I­1
37
Translation Strategy
●
Use Charm++ for performance & productivity
38
Translation Strategy
●
●
Use Charm++ for performance & productivity
Translate Charisma's global control and data
flows into local behavior of Charm++ objects
39
Translation Strategy
●
●
●
Use Charm++ for performance & productivity
Translate Charisma's global control and data
flows into local behavior of Charm++ objects
Instead of translating to Charm++ code,
generate local DAGs specified in SDAG
–
Abstract target
–
Efficient implementation
–
Easier to write compiler
40
From Global to Local Flows (I)
●
Generate unique targets
41
From Global to Local Flows (I)
●
Generate unique targets
●
Project global control flow onto objects
42
From Global to Local Flows (I)
●
Generate unique targets
●
Project global control flow onto objects
g1B,I­1
fA,I­1
b) DAGA
g1
B,I
g2B,I­1
fA,I
fC,I­1
g2B,I
a) DAGB
c) DAGC
fC,I
43
From Global to Local Flows (II)
●
Generate asynch.
message sends for
data dependences
44
From Global to Local Flows (II)
●
●
Generate asynch.
message sends for
data dependences
Generated code sets
reference numbers to
ensure match between
sender and receiver
iterations
45
From Global to Local Flows (II)
fA,I­1
fC,I­1
fA,I
●
fC,I­1
●
fC,I
hB,I­1
Generate asynch.
message sends for
data dependences
Generated code sets
reference numbers to
ensure match between
sender and receiver
iterations
g1B,I
g2B,I
46
Performance Comparisons
●
Compare code generated by previous and new
versions of Charisma compiler
–
CTC: Charisma to Charm++
–
CTS: Charisma to SDAG
47
Performance Comparisons
●
●
Compare code generated by previous and new
versions of Charisma compiler
–
CTC: Charisma to Charm++
–
CTS: Charisma to SDAG
CTS eliminates barriers at end of for loops
48
Performance Comparisons
●
●
●
Compare code generated by previous and new
versions of Charisma compiler
–
CTC: Charisma to Charm++
–
CTS: Charisma to SDAG
CTS eliminates barriers at end of for loops
Similar CTC implementation would have
required significantly more construction effort
49
3D FFT
foreach x in planes1
(pencildata[x,*]) <­ planes1[x].fft1d();
end­foreach
foreach y in planes2
planes2[y].fft2d(pencildata[*,y]);
end­foreach
50
Cannon Matrix Multiplication
for I = 1 to (N/T)
foreach x,y in M
(A[x,y], B[x,y]) <­ M[x,y].prodTiles();
workers[x,y].mult(A[x+1, y], B[x, y+1]);
end­foreach
end­for
51
Five-Point Jacobi Relaxation
for I = 1 to 100
foreach i,j in J
(lb[i,j],rb[i,j],tb[i,j],bb[i,j]) ← J[i,j].prodBorders();
J[i,j].compute(lb[i+1,j],rb[i­1,j],tb[i,j­1],bb[i,j+1]);
end­foreach
52
end­for
Conclusion
●
Benefits of translating Charisma to SDAG
–
Less impedance mismatch
●
Compiler easier to write
–
Existing dependence satisfaction, loop tagging
frameworks
–
Performance gain (!)
53
Download