Document 11430356

advertisement
Optimizing Timing and Code Size Using Maximum Diret Loop Fusion 1
Meilin Liu
1
2
2
Meikang Qiu
Edwin H.-M. Sha
2
Department of Computer Siene
Department of Computer Siene
Chun Xue
2
Wright State University
University of Texas at Dallas
Dayton, Ohio 45435, USA
Rihardson, Texas 75083, USA
Abstrat
In this paper, we develop the tehnique that ombines loop distribution with maximum diret loop fusion
(LD MDF), whih performs maximum loop distribution,
followed by maximum diret loop fusion to optimize timing
and ode size simultaneously. We illustrate using examples that dependene yle is not always a restrition for
loop distribution for multi-level loops. We map the maximum diret loop fusion problem to the graph partitioning problem. A polynomial graph partitioning algorithm
is developed to ompute the fusion partitions. We prove
that our maximum diret loop fusion algorithm produes
the fewest number of resultant loop nests without violating
dependene onstraints. We also show that the resultant
ode size of the fused loops by the tehnique of loop distribution with maximum diret loop fusion is always smaller
than the ode size of the original loops.
Keywords: Loop Fusion, Loop Distribution, Code Size,
Embedded DSP
1
Introdution
Loop fusion, groups multiple distint loops into a single loop. Loop fusion an be used to redue the ost of
loop bound testing. Loop fusion an also be used to exploit the instrution-level parallelism on the modern highperformane arhiteture suh as VLIW [1℄. Loop fusion
an enhane the data loality by reusing the variables from
loal storage, thus it redues the number of memory referenes and the power onsumption [7, 1℄.
Diret loop fusion is to nd the legal fusion partition
of the loop nodes so that the loop nodes inside one partition
an be fused diretly without transformation. Maximal
diret loop fusion is to minimize the number of the fusion
partitions, thus the resultant number of the fused loops is
minimized.
Loop distribution separates independent statements
inside a single loop (or loop nest) into multiple loops (or
loop nests) [7, 1, 2, 5℄. Loop distribution exposes parallelism by separating the statements that ould be parallelized, and the statements that must be exeuted sequentially. Loop distribution an also be used to break up a
large loop that doesn't t into the ahe [2, 7, 1℄.
This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001 and
NSF CCR-0309461, USA.
A lot of researh for loop transformation has been
done in loop fusion and loop distribution to improve
the instrution-level parallelism and enhane data loality [7, 1, 5, 2℄. In [2℄, Kennedy and MKinley use loop
fusion and distribution to enhane data loality and maximize parallelism separately, whih may not give the optimal results. MKinley et al. tried to use a ompound loop
transformation algorithm that onsists of loop permutation, fusion, distribution, and reversal to ahieve the best
loop struture for a loop nest in terms of the ahe line referenes. The above loop fusion tehniques, however, do not
onsider resultant ode size of a transformed loop whih is
another ritial onern for embedded system design [8℄.
Timing and ode size are the two most important performane metris for embedded systems with very limited
on-hip memory resoures [8℄. Combining loop distribution
and loop fusion an lead to an optimization solution with
both redued exeution time and restrited ode size. After loop distribution is applied to reate the nest possible
loop nests, diret loop fusion must be applied to exploit
data loality or improve the parallelism. In this paper, we
propose the tehnique of loop distribution with maximum
diret loop fusion (LD MDF). The authors of [2℄ stated
that loop distribution preserves dependenes if all statements involved in a data dependene yle in the original
loop are plaed in the same loop. We illustrate using examples that dependene yle is not always a restrition
for loop distribution for multi-level loops. We map the diret loop fusion problem to the graph partitioning problem
and we develop a polynomial graph partitioning algorithm
to get the fusion partitions. We prove that our maximum
diret loop fusion tehnique produes the fewest number
of resultant loop nests without violating dependene onstraints. We also show that the resultant ode size of the
fused loop by the LD MDF tehnique will be always smaller
than the ode size of the original loop.
The rest of the paper is organized as follows: We introdue the basi onepts and priniples related to our
tehnique in Setion 2. In Setion 3, we illustrate that
dependene yle is not always a restrition for loop distribution for multi-level loops using examples. We propose
the maximum diret loop fusion tehnique in Setion 4.
The basi idea of the tehnique of loop distribution with
maximum diret loop fusion (LD MDF) is illustrated in
Setion 5. Setion 6 presents the experimental results. Setion 7 onludes the paper.
2
Basi Conepts
L1
for i=0, N
for j=0,M
A[i,j]=C[i−2,j+1];
endfor
for j=0, M
B[i,j]=A[i,j+1]+A[i,j]+C[i−1,j];
endfor
for j=0, M
C[i,j]=B[i,j]+A[i,j+2]+A[i,j+1];
endfor
endfor
In this setion, we provide an overview of the basi
onepts and priniples related to our tehnique, inluding multi-dimensional data ow graph (M DF G) and loop
dependene graph (LDG).
2.1
Data Flow Graph
for i=0, N
for j=0, M
A[i,j]=(B[i−1,j]+C[i−1,j])/2;
B[i,j]=A[i,j−1]*2;
C[i,j]=B[i,j]+3;
endfor
endfor
(a)
(1,0)
(2,−1)
(1,0)
(0,0)
(L3)
L3
(b)
3
Loop Distribution
(0,0)
V3
(b)
Figure 1: A two-level loop and its orresponding DFG.
We use a multi-dimensional data ow graph
(M DF G) to model the body of one nested loop. A MDFG
~ t) is a node-weighted and edge-weighted diG = (V; E; d;
reted graph, where V is the set of omputation nodes,
E V V is the set of edges representing dependenes,
d~ is a funtion from E to Z n , representing the multidimensional delays between two nodes, where n is the number of dimensions, and t is a funtion from V to positive
integers, representing the omputation time of eah node.
A two-level loop is shown in Figure 1(a) and its orresponding data ow graph is shown in Figure 1(b).
2.2
(0,−2)
Figure 2: (a) A two-level loop with three inner loops. (b)
The orresponding loop dependene graph.
(0,1)
V2
L2
(L2)
(a)
V1
(1,0)
(0,−1)
(L1)
Loop Dependene Graph
Loop dependene graph is a higher-level graph
model ompared to the data ow graph. It is used to
model the data dependenes between multiple loops. A
multi-dimensional loop dependene graph (MLDG) G =
(V; E; Æ; o) is a node-labeled and edge-weighted direted
graph, where V is a set of nodes representing the loops.
E V V is a set of edges representing data dependenes
between the loops. Æ is a funtion from E to Z n , representing the minimum data dependene vetor between
the omputations of two loops. o is a funtion from V to
positive integers, representing the order of the exeution
sequene.
The loop dependene graph of the loop in Figure 2(a)
is shown in Figure 2(b). In a loop dependene graph, a
fusion-preventing dependene is represented by an edge e
with edge weight Æ(e) < (0; 0). All the omparisons between two loop dependene vetors are based on the lexiographi order in this paper. A two-dimensional vetor
~
v is smaller than a two-dimensional vetor ~
u aording
to the lexiographi order if either v1 < u1 or v1 = u1
and v2 < u2 . The fusion-preventing dependene edges are
e1 : L1 ! L2 and e2 : L1 ! L3 in the LDG shown in
Figure 2(b).
Loop distribution is an important part of our tehnique of loop distribution with maximum diret loop fusion
(LD MDF). But loop distribution is not always appliable.
In the proess of loop distribution, we must maintain all
the data dependenes to ensure that we won't hange the
semantis of the original program. We show that dependene yle is not always a restrition for loop distribution
for multi-level loops by illustration examples.
for i=0, N
for j=0, M
A[i,j]=(A[i,j]+B[i−1,j])/2;
B[i,j]=A[i,j]+5;
endfor
endfor
(a)
for i=0, N
for j=0, M
A[i,j]=(A[i,j]+B[i−1,j])/2;
endfor
for j=0, M
B[i,j]=A[i,j]+5;
endfor
endfor
(b)
Figure 3: An example loop that is legal to be distributed.
The program shown in Figure 3(a) has a dependene
yle in its orresponding data ow graph, but the statements involved in the dependene yle an be distributed.
The distributed loop shown in Figure 3(b) has the same semantis as the original program shown in Figure 3(a) sine
the true data dependene in the loop in Figure 3(a) is maintained to be a true data dependene in the distributed loop
in Figure 3(b).
for i=0, N
for j=0, M
A[i,j]=B[i,j−1]*2;
B[i,j]=A[i,j]+5;
endfor
endfor
(a)
for i=0, N
for j=0, M
A[i,j]=B[i,j−1]*2;
endfor
for j=0, M
B[i,j]=A[i,j]+5;
endfor
endfor
(b)
Figure 4: An example loop that is illegal to be distributed.
The program shown in Figure 4(a) also has a dependene yle in its orresponding data ow graph. If we
distribute the statements involved in the dependene yle, then the semantis of the program is hanged. We an
see that the ode shown in Figure 4(b) omputes dierently
from the ode shown in Figure 4(a). A[i; j ℄ is dependent on
B [i; j 1℄, whih is a true data dependene in the original
program as shown in Figure 4(a). If we diretly distribute
the loop shown in Figure 4(a), then this true data dependene beomes an anti-data dependene in the distributed
loop as shown in Figure 4(b). In this ase, the statements
involved in the dependene yle in the original loop annot
be distributed.
In summary, if the summation of the edge weights
of the dependene yle satises a ertain ondition, then
the statements involved in the dependene yle an be
distributed. The loop distribution theorems and the maximum loop distribution algorithm have been presented
in [4℄.
4
Diret Loop Fusion Tehnique
Diret loop fusion is to nd the legal fusion partition
of the loop nodes so that the loop nodes inside one partition an be fused diretly. To apply diret loop fusion, we
partition the loop nodes in the LDG into several partitions
so that loop nodes onneted by a fusion-preventing dependene edge are partitioned into dierent partitions. Maximal loop fusion is to minimize the number of the fusion
partitions, thus the resultant number of the fused loops is
minimized. We develop a polynomial graph partitioning
algorithm 4.1 and we prove that the total number of the
partitions got by our algorithm is minimal, i.e., the number
of the fused loops is minimal.
A fusion partition is a partition of loop nodes V in
the loop dependene graph into dierent partitions: eah
partition represents a set of loops to be fused. A fusion
partition is legal if and only if the following two onditions
are satised:
1. Preedene dependene onstraint: the relative order
of two nodes onneted by a preedene dependene
edge annot hange.
2. Fusion-preventing dependene onstraint: two nodes
onneted by a fusion-preventing dependene edge
must be put into dierent partitions.
4.1
Graph Partitioning Algorithm
We formulate maximal fusion problem as a sheduling
problem, i.e., we are looking for a mapping P from the loop
nodes V to the positive integers N so that the following two
onditions are satised:
1. Preedene dependene onstraint: e = u ! v, u; v 2
V , e is a preedene dependene edge, P (v) P (u)
(1)
2. Fusion-preventing dependene onstraint: e = u ! v,
u; v 2 V , e is a fusion-preventing dependene edge,
P (v) P (u) + 1
(2)
We ompute the partition number for eah loop node
in the loop dependene graph based on the two onditions
as stated above. More speially, we ompute the partition number of eah node by using the shortest path algorithm after onstruting the onstraint graph based on
the two onditions of the legal loop partition. The basi
idea of the partition algorithm by using the shortest path
algorithm is as follows:
Given the input loop dependene graph, we distinguish two kinds of edges, the fusion-preventing dependene edges and the preedene dependene edges. We
onstrut a onstraint graph based on the given loop dependene graph and the two onditions oming from the
preedene dependene edge and the fusion-preventing dependene edge. Then we apply the shortest path algorithm
on the onstraint graph to ompute the partition number
for eah loop node.
Graph Partitioning Algorithm
Require: A LDG G = (V; E; Æ; o)
Ensure: Partitions of loop nodes of the LDG
Remove all the edges e with Æ1 (e) 1 from E , and get
a graph G0 = (V; E 0 ; Æ), where E 0 E fejÆ1 (e) 1g
Add one orresponding node Pi in the onstraint graph
G for eah node vi in graph G0
0
0
0
for all edges e 2 E in graph G do
0
if e is a fusion-preventing edge in the original loop
dependene graph then
We get a onstraint Pi Pj 1. So we add one
edge from Pj to Pi in the onstraint graph G , whih
is weighted by 1.
Algorithm 4.1
else
The edge is a preedene edge, and we get a onstraint Pi Pj 0. So we add one edge from Pj to
Pi in the onstraint graph G , whih is weighted by
0.
end if
end for
Add the soure node P0 in the onstraint graph G , and
also the edges from P0 to all the other nodes Pi in the
onstraint graph G . All these added edges from the
soure node P0 to all the other nodes Pi are weighted
with 0
/* Compute the shortest distane from the soure node
to all the other nodes in the onstraint graph by using
the shortest path algorithm */
Call the Bellman-Ford algorithm to ompute D(P0 ; Pi),
the shortest distane from P0 to eah node Pi 2 V
/* Compute the minimum shortest distane */
for all nodes v 2 V do
Dmin
minPi (D(P0; Pi ))
end for
for all
P (Pi)
nodes v 2 V do
D(P0 ; Pi) Dmin + 1
end for
nodes v 2 V in the original LDG do
Put node vi in partition P (Pi )
for all
end for
return Partitions
In the algorithm, we rst remove all the edges e with
1 from the LDG and we obtain a direted ayli
graph (DAG) G0 . Then we onstrut the orresponding
onstraint graph G based on graph G0 . For eah node
Æ1 (e)
in graph G0 , we have a orresponding node Pi in the
onstruted onstraint graph G . The edges in the onstraint graph are onstruted based on the preedene dependene onstraint and the fusion-preventing dependene
onstraint. For eah preedene edge e : vi ! vj in the
original loop dependene, the onstraint is Pi Pj 0, so
an edge from Pj to Pi is added in the onstraint graph, and
the edge is weighted by 0. For eah fusion-preventing dependene edge e : vi ! vj in the original loop dependene,
the onstraint is Pi Pj 1, sine vj must be put into
a partition dierent from the partition in whih vi is put
into, and the partition number of vj must be larger than
the partition number of vi . So an edge from Pj to Pi is
added in the onstraint graph, and the edge is weighted by
1. Then, we add a soure node P0 and also the edges from
P0 to all the other nodes Pi . The added edges from the
soure node P0 to all the other nodes Pi are all weighted
by 0. The onstrution of the onstraint graph G does
not reate yles. The Bellman-Ford algorithm an be applied to ompute the shortest distane D(P0; Pi ) from the
soure node P0 to eah other node Pi in the onstruted
onstraint graph.
After we get the shortest distane from the soure
node P0 to eah other node Pi, we an ompute the minimum shortest distane minD of all the nodes in the onstraint graph, i.e., minD = minPi (D(P0; Pi )). Then for
eah node Pi , we ompute the partition number P (Pi) for
Pi by P (Pi) = D(P0; Pi ) minD + 1. In the algorithm,
we get the partition number for every node, and the nodes
with the same partition number are put into the same fusion partition.
The distributed loop of the example loop in Figure 6(a) is shown in Figure 6(b), and its orresponding
LDG is shown in Figure 5(a). The onstruted onstraint
graph by our graph partitioning algorithm 4.1 is shown
in Figure 5(b). After we run algorithm 4.1, the partition number we got for the loop nodes in the LDG are
as follows: P (L1) = 1; P (L2) = 1; P (L3) = 2; P (L4) =
1; P (L5) = 2; P (L6) = 2. So the partition results are
P artition1 = fL1; L2; L4g, and P artition2 = fL3; L5; L6g
as shown in Figure 5(). There are totally two fusion partitions aording to this partitioning result, whih ahieves
the maximal fusion, sine there are totally three fusionpreventing dependene edges in the original LDG. The
fused loop is shown in Figure 5(d).
vi
Theorem 4.1 Given a loop and its orresponding LDG,
the partition omputed by algorithm 4.1 is a legal partition.
And the total number of the partitions omputed by algorithm 4.1 is minimal.
By ontradition:
Aording to the properties of a legal loop, every legal
loop dependene graph G = (V; E; Æ; o) must satisfy that
Æ1 (e) 0; 8e 2 E , and Æ1 () 1, for eah yle in G.
Therefore, in algorithm 4.1, after we remove all the edges e
with Æ1 (e) 1, the obtained graph G0 is a DAG. Then, we
onstrut the onstraint graph G based on the graph G0 .
For eah node Li in graph G0 , there is a orresponding node
Pi in the orresponding onstraint graph. If the original
Proof 4.1
0
P1
P2
0
−1
L1
L2
P0
0
−1
0
P3
P4
(0,−1)
(0,0)
(0,−1)
L3
0
L4
0
−1
0
P5
P6
0
(0,2)
(0,1)
(0,−1)
L5
L6
(a)
(0,−1)
(0,0)
L3
(0,2)
(b)
L2
L1
(0,−1)
0
L4
(0,1)
L5
(0,−1)
L6
()
for i=0, N
for j=0, M
A[i,j]=A[i,j]*2+1;
B[i,j]=B[i,j−1]*3;
D[i,j]=B[i,j]*3;
endfor
for j=0, M
C[i,j]=(A[i,j+2]+B[i,j+1])/2;
E[i,j]=C[i,j−2]+5;
F[i,j]=(C[i,j−1]+D[i,j+1])/2;
endfor
endfor
(d)
Figure 5: (a) The LDG of the distributed loop in Figure 6(b). (b) The onstruted onstraint graph for partitioning by algorithm 4.1. () The graph partitioning result.
(d)The fused loop by the LD MDF tehnique.
edge e : Li ! Lj in G0 is a preedene edge, then in the
onstraint graph, there is one edge e : Pj ! Pi , whih is
weighted by 0. If the original edge e : Li ! Lj in G0 is a
fusion-preventing dependene edge, then in the onstraint
graph, there is one edge e : Pj ! Pi , whih is weighted by
1. And we also add a soure node V0 and also the edges
from node V0 to all the other nodes in the onstraint graph,
and all the added edges from the soure node V0 to other
nodes are weighted by 0 too.
The onstrution of the onstraint graph G does not
reate yles. Thus, G is a DAG. Bellman-Ford algorithm
is applied to ompute the single-soure shortest distane
D(P0; Pi) from P0 to all the other nodes Pi . Note that
there must at least exist a node P so that D(P0; P ) = 0
sine the onstraint graph is a DAG. Also, sine in the
onstraint graph, all the edges are weighted either by 0 or
by 1, then omputed D(P0; Pi ) values must be ontinuous
integers.
In the algorithm 4.1, we ompute Dmin
minPi (D(P0; Pi )), whih is the minimum value of the
longest shortest path. Dmin is a negative value aording to the onstrution of the onstraint graph. And the
omputed values for the nodes in the onstraint graph inlude 0; ; Dmin , i.e., so there are totally Dmin + 1
distint omputed values for all the loop nodes. Dene
Kmax = Dmin , then there are totally Kmax + 1 parti-
tions after running the algorithm.
We here prove that the number of the partitions omputed by algorithm 4.1 is minimal. Suppose that the number of the partition is smaller than Kmax . Consider node
Pm whose D(P0; Pm) = Dmin . Aording to the onstrution of the onstraint graph, and the denition of the
shortest path algorithm, there must exist a path with Kmax
number of edges with weight 1 from the soure node P0
to Pm . The edge with weight 1 is orresponding to the
fusion-preventing dependene edges in the original loop dependene graph. Along this path, it is obvious that all the
nodes in the original data ow graph assoiated with these
Kmax fusion-preventing dependene edges must be put into
(Kmax + 1) dierent partitions. This path from P0 to Pm
by the shortest path algorithm has given us the lower bound
of the number of partitions between v0 and vm . So it is
impossible to have the total number of the partitions less
than (Kmax + 1) by simply onsidering this path from v0 to
vm .
5
fused loops. Aording to the implementation proedure
of the LD MDF tehnique, it is obviously that the number of the fused loops by the LD MDF tehnique is always
smaller than the number of the loops in the original program. Correspondingly, we an onlude that the ode
size of the fused loops by the LD MDF tehnique is always
smaller than the ode size of the original loops as stated in
Corollary 5.1.
Corollary 5.1 Given a multi-level loop and its orresponding data ow graph, after we apply the tehnique
of loop distribution with maximum diret loop fusion
(LD MDF) to fuse the loops, the ode size of the transformed loop is always smaller than the ode size of the
original loop.
for i=0, N
for j=0, M
A[i,j]=A[i,j]*2+1;
endfor
for j=0, M
B[i,j]=B[i,j−1]*3;
endfor
for j=0, M
C[i,j]=(A[i,j+1]+B[i,j+1])/2;
D[i,j]=B[i,j]*3;
endfor
for j=0, M
E[i,j]=C[i,j−2]+5;
F[i,j]=(C[i,j−1]+D[i,j+1])/2;
endfor
endfor
Loop Distribution with Maximum
Diret Loop Fusion
The tehnique of loop distribution with maximum
diret loop fusion (LD MDF) ombines loop distribution
with maximum diret loop fusion to improve the timing
performane of the loops without the inrease of the ode
size. The basi idea and implementation of the LD MDF
tehnique are illustrated as follows:
1. Apply maximum loop distribution on the given loop
using the maximum loop distribution algorithm presented [4℄.
2. Construt the orresponding loop dependene graph
GL of the distributed loop.
3. Apply the maximum diret loop fusion algorithm 4.1
to ompute the fusion partitions.
4. Get the fused loop Lf by fusing all the loop nodes in
the same fusion partition.
The tehnique of loop distribution with maximum diret loop fusion rst applies maximum loop distribution
on a given loop using the maximum loop distribution algorithm proposed in [4℄. After we perform maximum loop
distribution on the given loop, we onstrut the loop dependene graph of the distributed loop. Then, we partition
the loop nodes in the LDG of the distributed loop so that
there is no fusion-preventing dependenes existing between
the nodes inside one fusion. In other words, we partition
the loop nodes in the LDG into several partitions so that
loop nodes onneted by a fusion-preventing dependene
edge are partitioned into dierent partitions. Thus, all
the loop nodes inside one fusion partition an be diretly
fused [2℄. After we get the fusion partitions, diret loop
fusion is applied on eah fusion partition. We prove that
the total number of the fusion partitions got by our graph
partitioning algorithm is minimum, i.e., our maximum diret loop fusion algorithm produes the fewest number of
(a)
for i=0, N
for j=0, M
A[i,j]=A[i,j]*2+1;
endfor
for j=0, M
B[i,j]=B[i,j−1]*3;
endfor
for j=0, M
C[i,j]=(A[i,j+2]+B[i,j+1])/2;
endfor
for j=0, M
D[i,j]=B[i,j]*3;
endfor
for j=0, M
E[i,j]=C[i,j−2]+5;
endfor
for j=0, M
F[i,j]=(C[i,j−1]+D[i,j+1])/2;
endfor
endfor
(b)
Figure 6: (a) The original 2-level loop with six inner loops.
(b)The distributed loop.
For example, the ode shown in Figure 6(a) ontains four sequential loops enlosed in one shared outermost loop. To apply our LD MDF tehnique, we rst
maximumly distribute the loop using the maximum loop
distribution algorithm proposed in [4℄. The maximumly
distributed loop for the original program is shown in Figure 6(b). After loop distribution, there are totally six
loops. Then, we partition the six loop nodes in the loop
dependene graph as shown in Figure 5(a) into two fusion partitions as shown in Figure 5(). We an see that
all the loop nodes onneted by fusion-preventing dependenes are partitioned into dierent loop partitions. So
the loop nodes inside one loop partition an be fused into
one loop diretly. Thus, the six loops an be fused into
two loops without transformation. The fused loop by our
LD MDF tehnique is shown in Figure 5(d), whih has two
inner loops. The number of the loops in the fused loop is
less than the number of the loops in the original loop. The
ode size of the fused loop is 12 instrutions, whih is also
smaller than the ode size of the original loop, whih is
16 instrutions, sine the loop ontrol instrutions are redued.
6
Experiments
In our experiments, we performed the general legalizing loop fusion tehnique (ULF IP) presented in [3℄ and
the tehnique of loop distribution with maximum diret
loop fusion (LD MDF). All the test ases are extrated
from real DSP appliations, most of them are extrated
from the real lters inluding WDF (Wave Digital lter),
IIR (Innite Impulse Response lter), DPCM (Dierential
Pulse-Code Modulation devie), and 2D (Two Dimensional
lter). The VLIW arhiteture is used as the test platform.
We simulated a VLIW arhiteture based DSP proessor
with eight funtional units. We ompare the timing performane and the ode size of the original loops, the fused
loops by the ULF IP tehnique, and the fused loop by the
LD MDF tehnique. The experiments are performed on a
Dell PC with a P 4 2:1G proessor and 512M B memory
running Red Hast Linux 9:0. Every experiment is nished
within one minute.
Time/(N*M)
Size
40
300
Original
ULF_IP
LD_MDF
35
Original
ULF_IP
LD_MDF
250
by our LD MDF tehnique an be improved 25.3% on average ompared to the original loops, and the ode size is
redued 10.0% on average ompared to the original loops.
7
Conlusion
In this paper, we developed the tehnique of ombining loop distribution with maximum diret loop fusion
(LD MDF) to ahieve a shorter exeution time with a redution of the ode size. we developed a polynomial graph
partitioning algorithm to ompute the fusion partitions.
We prove that our maximum diret loop fusion tehnique
produes the fewest number of resultant loop nests without violating dependene onstraints. We also show that
the resultant ode size of the fused loop by the LD MDF
tehnique will be always smaller than the ode size of the
original loop.
30
200
25
20
Referenes
150
15
100
10
50
5
0
EX1
EX2
EX3
EX4
(a)
EX5
EX6
0
EX1
EX2
EX3
EX4
EX5
EX6
(b)
Figure 7: (a)The Exeution Time of the original loops
and the fused loops by various tehniques. (b) The Code
Size of the original loops and the fused loops by various
tehniques.
EX1 refers to the example loop shown in Figure 6(a).
EX2, EX3, and EX4 refer to the DSP appliations presented in [6℄ that have several loops, inluding WDF
(Wave Digital lter), IIR (Innite Impulse Response lter), DPCM (Dierential Pulse-Code Modulation devie),
and 2D (Two Dimensional lter). EX5 and EX6 refer to
the example loops presented in [3℄.
In Figure 7(a), we ompare the exeution time of the
original loops and the fused loops by the ULF IP tehnique
and the LD MDF tehnique. The exeution time is dened
to be the shedule length times the total iterations. The
shedule length is the number of time units to nish one
iteration of the loop body. We assume that eah omputation an be nished in one time unit. For the sequentially
exeuted loops, the exeution time is the sum of the exeution time of eah individual loop. N denotes the total
number of the iterations for the outermost loop, and M denotes the total number of the iterations for the innermost
loop in Figure 7(a). Figure 7(b) ompares the ode size
of the original loops and the fused loops by the ULF IP
tehnique and the LD MDF tehnique. We alulate the
ode size by the number of instrutions.
Although the ULF IP tehnique proposed in [3℄ an
always ahieve a shorter exeution time than the LD MDF
tehnique, it inreases the ode size. In many ases, this
tehnique annot be applied beause of the memory onstraint. Compared to the ULF IP tehnique, the LD MDF
tehnique takes advantage of both loop distribution and
loop fusion, so it redues the original exeution time and
also avoids the ode-size expansion. The experimental results showed that the timing performane of the fused loop
[1℄ R. Allen and K. Kennedy. Optimizing Compilers for
Modern Arhitetures: A Dependene-based Approah.
Morgan Kaufmann, 2001.
[2℄ K. Kennedy and K. S. Mkinley. Maximizing loop parallelism and improving data loality via loop fusion and
distribution. In Languages and Compilers for Parallel
Computing, Leture Notes in Computer Siene, Number 768, pages 301{320, 1993.
[3℄ M. Liu, Q. Zhuge, Z. Shao, and E. H.-M. Sha. General
loop fusion tehnique for nested loops onsidering timing and ode size. In Pro. ACM/IEEE International
Conferene on Compilers, Arhitetures, and Synthesis
for Embedded Systems (CASES 2004), pages 190{201,
Sep. 2004.
[4℄ M. Liu, Q. Zhuge, Z. Shao, C. Xue, M. Qiu, and E. H.M. Sha. Loop distribution and fusion with timing and
ode size optimization for embedded DSP. In Pro.
The 2005 International Conferene on Embedded And
Ubiquitous Computing (EUC), pages 121{130, Leture
Note in Computer Siene, Springer, Nagasaki, Japan,
De. 2005.
[5℄ K. S. MKinley, S. Carr, and C.-W. Tseng. Improving data loality with loop transformations. ACM
Transations on Programming Languages and Systems
(TOPLAS), 18(4):424 { 453, July 1996.
[6℄ E. H.-M. Sha, T. W. O'Neil, and N. L. Passos. EÆient polynomial-time nested loop fusion with full parallelism. International Journal of Computers and Their
Appliations, 10(1):9{24, Mar. 2003.
[7℄ M. Wolfe. High Performane Compilers for Parallel Computing. Addison-Wesley Publishing Company,
In., 1996.
[8℄ Q. Zhuge, B. Xiao, and E.-M. Sha. Code size redution
tehnique and implementation for software-pipelined
DSP appliations. ACM Transations on Embedded
Computing Systems(TECS), 2(4):590{613, Nov. 2003.
Download