V. Elango †, F. Rastello , L-N Pouchet , J. Ramanujam , P

advertisement
Towards Automated Characterization of the Data
Movement Complexity of Affine Programs
HiPEAC'15
1/19/2015
†
Elango ,
V.
†The
F.
§
Rastello ,
Ohio State University
L-N
†
Pouchet ,
J.
‡
Ramanujam ,
P.
‡Louisiana State University
§INRIA
0:7
†
Sadayappan
Although a CDAG is derived from analysis of dependences between instances of statements executed
by a sequential program, it abstracts away that sequential schedule of operations and only imposes
an essential partial order captured by the data dependences between the operation instances. Control
dependences in the computation need not be represented since the goal is to capture the inherent data
locality characteristics based on the set of operations that actually transpired during an execution of the
program.
o CDAG abstrac#on: o Arbitrary CDAGs: –  To compute a vertex, predecessor ver#ces must hold values in 1fast memory 2
–  Limited fast memory => computed values may need to be temporarily stored in slow memory and reloaded for N=6 Fig. 5: CDAGCDAG for Gauss-Seidel
code in Fig. 2.
Fig.
6: m
Convex-partition
of the CDAG
for the
Input vertices are shown in black, all other• Inherent verd
ata vmt. c
omplexity o
f code in Fig. 2 for N = 10.
ticesis represent operations performed.
• Data movement cost CDAG: Minimal #loads+#stores among different for two versions all possible valid schedules
Theyskey
idea behind the work presented in this article is to perform analysis on the CDAG of a
• Also depends on cache ize upper order
bounds on min-­‐cost computation, attempting Develop to find a different
of execution
of the operations that can improve the
Ques#on: Can we do reuse-distance
beJer? profile compared to that of the given program’s sequential execution trace. If this analysis
revealsnao significantly improved reuse distance profile, it suggests that suitable source code transformaHow do we know when tions have the potential
to enhance data
locality.dOn
themother
hand, if cthe
analysis is unable to improve
Minimum p
ossible ata ovement ost? further improvement ossible? profile of the code, it is likely that it is already as well optimized for data locality as
thepreuse-distance
No known effec?ve solu?on to problem possible.
Ques#on: What is the The
lowest dynamic analysis involves the following steps:
achievable data movement (1) Generate a sequential execution trace of a program.
cost among all equivalent (2) Form a CDAG from the
executionlower trace. bounds on min-­‐cost Develop versions of the computa#on? (3) Perform a multi-level convex partitioning of the CDAG, which is then used to change the schedule
of operations of the CDAG from the original order in the given input code. A convex partitioning of
a CDAG is analogous to tiling the iteration space of a regular nested loop computation. Multi-level
convex partitioning is analogous to multi-level cache-oblivious blocking.
(4) Perform standard reuse-distance analysis of the reordered trace after multi-level convex partitioning.
Lower Bounds: Geometric Reasoning with Data Footprints Finally, Fig. 6 shows the convex partitioning of the CDAG corresponding to the code in Fig. 2.
After such a partitioning, the execution order of the vertices is reordered so that the convex partitions
are executed in some valid order (corresponding to a topological sort of a coarse-grained inter-partition
o Divide execu#on trace ininto segments wasith dependence
graph),
with
the
vertices
within
a
partition
being
executed
the
same
relative
order
the
for (i=0; i<N; i++) Load original sequential execution. DetailsS areload/stores presented in the next
(3 isection.
n ex.) Seg. 3
Source: Jim Demmel
…... Time
Seg. 2
Seg. 1
o Loomis-­‐Whitney inequality (2D): bounds #points in a set by product of for (j=0;j<N;j++) Load Load OF CDAG
3. CONVEX PARTITIONING
o Within each segment, #dis#nct elements # projected points on coordinate axes if (i <> j) force[i] we provide
FLOP In this+= section,
details on our algorithm for convex partitioning of CDAGs, which is at
of pos[] <= 2S (up to S coming into o Prior work: Uses Loomis-­‐Whitney Store dynamic analysis. In the case of loops, numerous efforts have attempted to
the
heart
of
our
proposed
func(pos[i],pos[j]) optimize data localityFLOP
by applying loop
transformations,
in particular involving
loop tilingSand
segment in scratchpad and another loop
Background
inequality & generaliza#on (Holder-­‐
fusion [Irigoin and Triolet
Load 1988; Wolf and Lam 1991; Kennedy and McKinley 1993; Bondhugula et al.
explicitly loaded) 2008]. Tiling for locality
Brascamp-­‐Lieb) for lower bounds for Load attempts to group points in an iteration space of a loop into smaller blocks
(tiles) allowing reuseLoad (thereby reducing
reuse
distance)
in multiple
directions when
the
block fits in
o For c
ode e
xample, p
rojec#on o
f S
tmt(i,j) linear-­‐algebra-­‐like computa#ons y Inequality
FLOP onto i-­‐axis maps to data element pos[i]; ACM Transactions on Architecture
§ Projec#ons of itera#on-­‐space FLOP and Code Optimization, Vol. 0, No. 0, Article 0, Publication date: 2014.
E FLOP similarly for j-­‐axis Max. # dis#nct elts of E
points <
==> D
ata f
ootprint j bset in the euclidian d-space
Store pos[i] or pos[j] read in any segment <= 2S § Geometric i
nequality: B
ound m
ax. Store projections on the coordinates hyperplanes
FLOP o By Loomis-­‐Whitney, max. # itera#on #of ops for a given # of data moves j …... Store points in any segment, |P| <= 2S*2S Load i 2/4S2; each seg. (but d
Ei FLOP o Min. #
segments >
= N
1/(d 1)
|E| 
|fj (E)|
2D Loomis-­‐Whitney Inequality last) has S load/stores j=1
|E| <= |Ei|*|Ej| o #load/stores >= (N2/4S2-­‐1)*S = Ω(N2/S) ’
Geometric Reasoning with Data Footprints: Limita?ons Contributions:
POPL 2015
Affine
computations
o Cannot handle mul#-­‐
o Cannot model effect of data for (i=0; i<N; i++)
for (t=1; t<T; t++) {
for (j=0;j<N;j++)
statement programs dependences for (i=1; i<N-1; i++)
for (k=0;k<N;k++)
§ Computa#ons with § Dependences may impose Can be represented
as (union
of)very Z -polyhedra:B[i] = A[i-1]+A[i]+A[i+1];
C[i][j] += A[i][k]*B[k][j];
different d
ata m
vmt. constraints =
> h
igher d
ata d
for
(i=1;
i<N-1;
i++)
I Space: d -dimensional integer lattice (Z ).
Rqmts. but same array movement cost than A[i] = B[i];
I Points: Each instance of the statement.}
for (i=0; i<N; i++)
access footprint => same LB footprint analysis reveals for (j=0;j<N;j++)
I Arrows:
True datapdependencies.
§ Seman#cs reserving loop § Example: 1D Jacobi – for (k=0;k<N;k++) {
transforma#ons can result footprint based geometric C[i][j] += 1;
for ( i =0; in i <cNhange ; i ++)to lower bound analysis cannot derive A[i][k] += 1;
S1: A[i] = I[i];
B[k][j] += 1;
known LB of Ω(NT/S) for ( t =1; Same
t < T ; access
t ++) functions
}
⇒ same analysis result
{
3/√S)
LB
=
Ω(N
for ( i =1; i <N -1; i ++)
POPL 2015
Loop Distribution S2: B[i] = A[i-1]+A[i]+A[i+1];
12 / 27
for ( i =1; i <N -1; i ++)
S3: A[i]
= B[i];
Semantically
equivalent
for (i,j,k) C[i][j] += 1; }
code after loop distribution:
for (i,j,k) A[i][k] += 1;
for (i,j,k) B[k][j] += 1;
but different IO lower bound
LB = Ω(N2)
· Apply geometric reasoning on Z -polyhedra to bound |P|
21
Lower Bounds for CDAGs: Geometric Reasoning o Linear-­‐Algebra-­‐like algorithms: § Hong & Kung (1981): strong § Irony et al. (2004) and Ballard et al. rela#on between:1) Data (2011): Geometric approach based movement cost for a CDAG on geometric inequality schedule, and 2) Number of § Christ et al. (2013): Automa#on, vertex-­‐sets in “2S-­‐par##on” based on generalized geometric HBL of CDAG inequality (Holder-­‐Brascamp-­‐Lieb) § Change from reasoning about § (+) Automated asympto#c all valid schedules to all valid parametric lower bound expressions, 2S-­‐par##ons of graph e.g., O(N3/sqrt(S)) for NxN mat-­‐mult § (+) Generality Background
§ (-­‐) R
estricted c
omputa#onal m
odel: § (-­‐) Manual CDAG-­‐specific weakness of bounds or reasoning => c
hallenge t
o Hong & Kung’s S-partitioning
inapplicability automate 5.3 Automated I/O Lower Bound Computation
We present a static analysis algorithm for automated derivation
of expressions for parametric asymptotic I/O lower bounds for
partitioning.
programs. We use two illustrative examples to explain the various
steps in the algorithm before providing the detailed pseudo-code
for the full algorithm.
I/O lower
boundingasympto?c technique based
on graph
Our work: Sta?c analysis to automate parametric lower bounds analysis of affine for without
CDAG m
odel Validcodes with and
re-computation.
Definition
(SNR -partitioning
Illustrative example 1: Consider the following example of Jacobi
1D stencil computation.
– Recomputation prohibited)
SNR -partitioning
CDAG Lower Bounds: Hong/Kung S-­‐Par??oning Given a CDAG C, an
such that:
1 2 3 4 5 FLOP Store FLOP VS1 Load 6 of C is a collection of h
0:7
V \I
Parameters : N , T
subsets
Inputs
: I[N] of
Outputs : A[N]
for (i =0; i <N;
i ++)
Parameters: N, T S1 : A[i] = I[i ];
Load Load FLOP FLOP VS2 3 5 1 2 i 5: CDAG for Gauss-Seidel code in Fig. 2.
the CDAG
for the
§ |In(VSFig.
)| 6:<= Convex-partition
2S (up to Sof from prev. VS3 Input vertices are shown
in black, all other verStore icode
in Fig. 2 for N = 10.
tices represent operations performed.
Store FLOPFig.
for (t=1; t<T; t++) {
for (i=1; i<N-1; i++)
B[i] = A[i-1]+A[i]+A[i+1];
for (i=1; i<N-1; i++)
A[i] = B[i];
}
o How to upper-­‐bound |VSi|for any valid 2S-­‐par##on? o Key Idea: Use geometric inequality, but relate itera#on points to In(VSi) and not data footprint § Find rays corresponding to [T,N]->{S1[i]->S2[1,i+1]:1<=i<N-2}
dependence chains => projec#ons [T,N]->{S1[i]->S2[1,i]:1<=i<N-1}
[T,N]->{S1[i]->S2[1,i-1]:2<=i<N-1}
§ Projected points from VSi must be • Edges e5 and e6: Multiple uses of the boundary elements
of A[t][1]
In(VSand
I[0] andsubset I[N-1] by
i) A[t][N-2], respectively,
for 1<=t<T are represented by the following relations.
§ Transform itera#on space so that [T,N]->{S1[0]->S2[t,1]:1<=t<T}
[T,N]->{S1[N-1]->S2[t,N-2]:1<=t<T}
are along coordinate axes • Edge e7:rays The use of array B in statement S3 corresponds to edge
e7 with the following relation.
[T,N]->{S2[t,i]->S3[t,i]:1<=t<T and 1<=i<N-1}
• Edges e8, e9 and e10: The uses of array A in statement S2 from
S3 are represented by these edges with the following relations.
[T,N]->{S3[t,i]->S2[t+1,i+1]:1<=t<T-1 and 1<=i<N-2}
[T,N]->{S3[t,i]->S2[t+1,i]:1<=t<T-1 and 1<=i<N-1}
[T,N]->{S3[t,i]->S2[t+1,i-1]:1<=t<T-1 and 2<=i<N-1}
Given a path p = (e1 , . . . , el ) with associated edge relations (R1 , . . . , Rl ),
the relation associated with p can be computed by composing
the relations of its edges, i.e., relation(p) = Rl ◦ · · · ◦ R1 . For
instance, the relation for the path (e7, e8) in the example, obtained through the composition Re8 ◦ Re7 , is given by R p = [T,N]
-> {S2[t,i] -> S2[t+1,i+1]}. Further, the domain and image of a composition are restricted to the points for which the
composition can apply, i.e., domain(R j ◦ Ri ) = Ri −1 (image(Ri ) ∩
domain(R j )) and image(R j ◦ Ri ) = R j (image(Ri ) ∩ domain(R j )).
Hence, domain(R p ) = [T,N] -> {S2[t,i] : 1<=t<T-1 and
1<=i<N-2}
and image(R p ) = [T,N] -> {S2[t,i] : 2<=t<T
Background
and 2<=i<N-1}.
Two kinds of paths, namely, injective circuit and broadcast
path, defined below, are of specific importance to the analysis.
o Use ISL to find all “must” data flow dependences o Cycles data dep. graph == “rays” in the CDAG o Generalized geom. inequality allows different dimensional orthogonal projec#ons § Parametric exponents in inequality: sum of weighted ranks of projected subspaces 5
4
I
must exceed rank of full itera#on space Out(Vi ): setassociated of vertices w
ofith Vi athat
the output set O or that
2S-­‐par##on of CofDAG 3 are also part
§ solve a linear program to find op#mal 7 have
at least § Divide one successor
outside
of Vi . incurring trace into segments e1
Brascamp-Lieb inequality
weights exactly S
l
oad/stores 4 6 POPL 2015
9 / 27
1
2
D EFINITION
6 (Injective
edge and circuit).
Anlower injective b
edge
a is
§ => a
sympto#c p
arametric I
/O ound S1
§ Ops executed in segment-­‐i form a an edge of a –data-flow
whose
associated
relation
Ra isnot
both
Generalizes Loomis-Whitney
allowsgraph
more
general
linear
maps,
affine
and injective,
i.e., Ra = A.⃗x +⃗b, where A is an invertible
for a
ffine p
rogram convex vertex set VS
necessarily all mapping onto spaces of the same dimension.
Sh
Inputs: In[N]; Outputs: A[N] P1 8i 6= j, Vi \ Vj = 0/ , and i=1 Vi = V \ I
Although a CDAG is derived from analysis of dependences between instances of statements executed
for (t =1; t <T;
for t(++)
i=0; i<N; i++) {
by a sequential program, it abstracts
away
that
sequential
schedule
of
operations
and
only
imposes
P2 there is no cyclic dependence between subsets for (i =1; i <N
A-1;
[i] = iIn[i]; ++) S1 an essential partial order captured by the data dependences between the operation instances. Control
S2 :
B[i] = A[i
-1]
+ A[i] + A[i
for (t=0;t<T;t++) { +1];
dependences in the computation need
not
be
represented
since
the
goal
is
to
capture
the
inherent
data
P3 8i, |In(Vi )| actually
 S transpired
S-­‐Par??on of Cof
DAG for(i=1;i<N-­‐1;i++) locality characteristics based on the set of operations that
during an execution
the
for (i =1; i <N -1; i ++)
program.
];
B[i] = A[i-­‐1]+A[i]+A[i+1]; S2 S3 :
A[i] = B[i
sa?sfies 4
p
roper?es |Out(V
P4
8i,
)|

S
i
Load }
for(i=1;i<N-­‐1;i++) Load A[i] = B[i]; } S3 Load In(Vi ): seto Any of vertices of V \ Vi that have at least one successor in Vi .
valid schedule using S registers is Segment 1
for(it = 1; it<N−1; it +=B)
for(jt = 1; jt<N−1; jt +=B)
for(i = it; i < min(it+B, N−1); i++)
for(j = jt; j < min(jt+B, N−1); j++)
Un#led version A[i][j] = A[i−1][j] + A[i][j−1];
2
Comp. complexity: (N-­‐1) Ops Tiled Version Comp. complexity: (N-­‐1)2 Ops for (i=1; i<N-1; i++)
for (j=1;j<N-1; j++)
A[i][j] = A[i][j-1] + A[i-1][j];
Segment 2
– Vertex = opera#on, edge = data dep. 5S fast 4 with • 2-­‐level memory hierarchy 3
mem loca#ons & infinite slow locs. Prior Work: Data Movement Lower Bounds e2
e3
e4 e5 e6
matrix. An injective circuit is a circuit E of a data-flow graph such
that every edge e ∈ E is an injective edge.
Further generalized by Bennett et al.
d
m
EFINITION 7 (Broadcast edge and path).
A broadcast edge mb is
1/(d D1)
sj
segment a
nd u
p t
o S
n
ew l
oads) 7 FLOP |E|  ’ |fj (E)|
|E|
|f

(E)|
s.t.,
8i, 1 Rb isÂaffine
S2
’ whose
j
an edge of a data-flow graph
associated relation
j=1 si di,j
j=1
s
egment (
except l
ast) h
as S
l
oads/
and dim(domain(Rb )) <j=1
dim(image(Rb )). A broadcast path is a
Store They key idea behind the work presented§ Each in this article is to perform analysis on the CDAG of a
0:7
where,
path (e1 , . . . , en ) of a data-flow graph such that e1 is a broadcast
Load computation, attempting to find a different order
of
execution
of
the
operations
that
can
improve
the
stores =
> S
*NS >
= T
otal I
/O >
=S*(NS-­‐1) e7
e8 e9 e10
edgemand ∀ni=2 ei are injective edges.
…. FLOPreuse-distance
(s1 , . . . , sm ) 2 [0, 1]
profile compared to that of the given program’s sequential execution trace. If this analysis
§ Reasoning about minimum vertex sets d ! Rdj are orthogonal projections
reveals a significantly improved reuse distance
profile, it suggests
that suitable
source code#transformaStore f
:
R
Injective circuits and broadcast paths in a data-flow graph essenj
Although
a
CDAG
is
derived
from
analysis
of
dependences
between
instances
of
statements
executed
tions
have
the
potential
to
enhance
data
locality.
On
the
other
hand,
if
the
analysis
is
unable
to
improve
S3
Load tially
indicate multiple !
uses of same data, and therefore are good
!
for a
ny v
alid 2
S-­‐par##on =
> L
ower d
:
dim(f
(span(
e
)))
–
where
e
i-th canonical
i,j
j
i
i ,analysis.
the reuse-distance
profileitofabstracts
the code, it is likely
it is already
as well optimized
for dataof
locality
as
by a sequential
program,
awaythatthat
sequential
schedule
operations
and only imposes
candidates
for lower bound
Hence onlyvector.
paths of these two
Total of NS Segments kinds are considered in the analysis. The current example of Jacobi
bound on # loads/stores an essentialpossible.
partial
order
captured
by
the
data
dependences
between
the
operation
instances.
Control
1D computation illustrates the use of injective circuits to derive I/O
The dynamic analysis involves the following steps:
Figure 6: Data-flow graph for Jacobi 1D
lower bounds, while the use of broadcast paths for lower bound
dependences
the acomputation
need
be represented since the goal is to capture the inherent data
(1) in
Generate
sequential execution
trace ofnot
a program.
analysis is explained in another example that follows.
Fig.
6
shows
the
static
data-flow
graph
G
=
(V
,
E
)
for
Jacobi
F
F
F
(2) Form a CDAGbased
from the on
execution
locality characteristics
the trace.
set of operations that actually transpired during
execution
of the
Injective
circuits: In the Jacobi
example, we have three circuits
Lower B
ounds: R
esearch D
irec?ons 1D. GF an
contains
a vertex for each
statement
in the code.
The
input
Perform a multi-level convex partitioning of the CDAG, which is then used to change the schedule I is also explicitly represented in G by node I (shaded in black in
to vertex S2 through S3. The relation for each circuit is computed
F
program. (3) of
operations of the CDAG from the original order in the given input code. A convex partitioning of Fig. 6). Each vertex has an associated domain as shown below:
by composing the relations of its edges as explained earlier. The
1) Alternate lower bounds relations,
and the dependence vectors they represent, are listed
a CDAG is analogous to tiling the iteration space of a regular nested loop computation. Multi-level • DI =[N]->{I[i]:0<=i<N}
below.
convex partitioning is analogous to multi-level cache-oblivious blocking.
approach (
graph min-­‐cut be8):
ased) • DS1 =[N]->{S1[i]:0<=i<N}
•
o Analyze C
DAG s
tructure Circuit
c
=
(e7,
1
(4) Perform standard reuse-distance analysis of the reordered trace after multi-level convex partitioning. • DS2 =[T,N]->{S2[t,i]:1<=t<T and 1<=i<N-1}
2) C
omposi#on o
f l
ower ->ounds {S2[t,i]->S2[t+1,i+1] : 1<=t<T-1
and 13 / 27
Rc = [T,N] b
• DS3 =[T,N]->{S3[t,i]:1<=t<T and 1<=i<N-1}
POPL
2015
§ Establish Max |toVS
<= inVFig.
Smax
Theory & Models Finally, Fig. 6 shows the convex partitioning of the
CDAG corresponding
thei| code
2. (S) The edges represent the true (read-after-write) data
1<=i<N-2}
3) dependences
Modeling ver#cal +
h
orizontal After such a partitioning, the execution order of the vertices is reordered so that the convex partitions between the statements. Each edge has an associated affine depenb⃗1 = (1, 1)T
Min. sort
# vofertex sets = Ninter-partition
Smin(S) = 4
data m
ovement b
ounds f
or • Circuit
c
=
(e7,
e9):
are executed in some valid order (corresponding to§ => a topological
a coarse-grained
5
2
dence relation as shown below:
= [T,N] ->
scalable parallel Rscystems {S2[t,i]->S2[t+1,i] : 1<=t<T-1 and
dependence graph), with the vertices within a partition
being executed
the same relative order as the • Edge e1: This edge corresponds to the dependence
due to copyNver#ces
/VSmaxin(S) 3
1<=i<N-1}
original sequential execution. Details are presented in the next section.
the following
ing the inputs I to array A at statement S1 and has
[
SPAA ‘
14] b⃗2 = (1, 0)T
§ => IOmin(S) >= (NSmin-­‐1)*S relation.
• Circuit c3 = (e7, e10):
3. CONVEX PARTITIONING OF CDAG
[N]->{I[i]->S1[i]:0<=i<N}
§ In S=2, ofNCDAGs,
Rc = [T,N] -> {S2[t,i]->S2[t+1,i-1] : 1<=t<T-1 and
of array elements A[i-1], A[i]
ver#ces=16 In this section, we provide details on our algorithm
forexample: convex partitioning
which is at • Edges e2, e3 and e4: The useTools Applica#ons 2<=i<N-1}
and
A[i+1]
at
statement
S2
are
captured
by
edges
e2,
e3
and
the heart of our proposed dynamic analysis. In the§ VS
case of loops,
(S) =numerous
4 => efforts
NSminhave
= 1attempted
6/4 = 4to b⃗3 = (1, −1)T
max
e4,
respectively.
optimize data locality by applying loop transformations, in particular involving loop tiling1and loop
1) C
ompara#ve a
nalysis o
f 1) 
Automated l
ower 2
§ IOKennedy
2*(4-­‐1) 6 Bondhugula et al.
fusion [Irigoin and Triolet 1988; Wolf and Lam 1991;
and
McKinley=1993;
min >= algorithms via lower bounds bounds for arbitrary 2008]. Tiling for locality attempts to group points in an iteration space of a loop into smaller blocks
2) Assessment of compiler (tiles) allowing reuse (thereby reducing reuse distance) in multiple directions when the block fits in
explicit CDAGs effec#veness 2)  Automated parametric ACM Transactions on Architecture and Code Optimization, Vol. 0, No. 0, Article 0, Publication date: 2014.
3) Algorithm/architecture co-­‐
lower bounds for affine Fig. 5: CDAG
for Gauss-Seidel code in Fig. 2.
design space explora#on Fig. 6: Convex-partition of the CDAG for the
programs Input vertices are shown in black, all other ver [ACM TACO ’14, Hipeac ’15] [this poster; POPL ’15] code
in
Fig.
2
for
N
=
10.
tices represent operations performed.
Segment 3
Modeling Data Movement Complexity 1
2
3
Download