Optimizing Timing and Code Size Using Maximum Diret Loop Fusion 1 Meilin Liu 1 2 2 Meikang Qiu Edwin H.-M. Sha 2 Department of Computer Siene Department of Computer Siene Chun Xue 2 Wright State University University of Texas at Dallas Dayton, Ohio 45435, USA Rihardson, Texas 75083, USA Abstrat In this paper, we develop the tehnique that ombines loop distribution with maximum diret loop fusion (LD MDF), whih performs maximum loop distribution, followed by maximum diret loop fusion to optimize timing and ode size simultaneously. We illustrate using examples that dependene yle is not always a restrition for loop distribution for multi-level loops. We map the maximum diret loop fusion problem to the graph partitioning problem. A polynomial graph partitioning algorithm is developed to ompute the fusion partitions. We prove that our maximum diret loop fusion algorithm produes the fewest number of resultant loop nests without violating dependene onstraints. We also show that the resultant ode size of the fused loops by the tehnique of loop distribution with maximum diret loop fusion is always smaller than the ode size of the original loops. Keywords: Loop Fusion, Loop Distribution, Code Size, Embedded DSP 1 Introdution Loop fusion, groups multiple distint loops into a single loop. Loop fusion an be used to redue the ost of loop bound testing. Loop fusion an also be used to exploit the instrution-level parallelism on the modern highperformane arhiteture suh as VLIW [1℄. Loop fusion an enhane the data loality by reusing the variables from loal storage, thus it redues the number of memory referenes and the power onsumption [7, 1℄. Diret loop fusion is to nd the legal fusion partition of the loop nodes so that the loop nodes inside one partition an be fused diretly without transformation. Maximal diret loop fusion is to minimize the number of the fusion partitions, thus the resultant number of the fused loops is minimized. Loop distribution separates independent statements inside a single loop (or loop nest) into multiple loops (or loop nests) [7, 1, 2, 5℄. Loop distribution exposes parallelism by separating the statements that ould be parallelized, and the statements that must be exeuted sequentially. Loop distribution an also be used to break up a large loop that doesn't t into the ahe [2, 7, 1℄. This work is partially supported by TI University Program, NSF EIA-0103709, Texas ARP 009741-0028-2001 and NSF CCR-0309461, USA. A lot of researh for loop transformation has been done in loop fusion and loop distribution to improve the instrution-level parallelism and enhane data loality [7, 1, 5, 2℄. In [2℄, Kennedy and MKinley use loop fusion and distribution to enhane data loality and maximize parallelism separately, whih may not give the optimal results. MKinley et al. tried to use a ompound loop transformation algorithm that onsists of loop permutation, fusion, distribution, and reversal to ahieve the best loop struture for a loop nest in terms of the ahe line referenes. The above loop fusion tehniques, however, do not onsider resultant ode size of a transformed loop whih is another ritial onern for embedded system design [8℄. Timing and ode size are the two most important performane metris for embedded systems with very limited on-hip memory resoures [8℄. Combining loop distribution and loop fusion an lead to an optimization solution with both redued exeution time and restrited ode size. After loop distribution is applied to reate the nest possible loop nests, diret loop fusion must be applied to exploit data loality or improve the parallelism. In this paper, we propose the tehnique of loop distribution with maximum diret loop fusion (LD MDF). The authors of [2℄ stated that loop distribution preserves dependenes if all statements involved in a data dependene yle in the original loop are plaed in the same loop. We illustrate using examples that dependene yle is not always a restrition for loop distribution for multi-level loops. We map the diret loop fusion problem to the graph partitioning problem and we develop a polynomial graph partitioning algorithm to get the fusion partitions. We prove that our maximum diret loop fusion tehnique produes the fewest number of resultant loop nests without violating dependene onstraints. We also show that the resultant ode size of the fused loop by the LD MDF tehnique will be always smaller than the ode size of the original loop. The rest of the paper is organized as follows: We introdue the basi onepts and priniples related to our tehnique in Setion 2. In Setion 3, we illustrate that dependene yle is not always a restrition for loop distribution for multi-level loops using examples. We propose the maximum diret loop fusion tehnique in Setion 4. The basi idea of the tehnique of loop distribution with maximum diret loop fusion (LD MDF) is illustrated in Setion 5. Setion 6 presents the experimental results. Setion 7 onludes the paper. 2 Basi Conepts L1 for i=0, N for j=0,M A[i,j]=C[i−2,j+1]; endfor for j=0, M B[i,j]=A[i,j+1]+A[i,j]+C[i−1,j]; endfor for j=0, M C[i,j]=B[i,j]+A[i,j+2]+A[i,j+1]; endfor endfor In this setion, we provide an overview of the basi onepts and priniples related to our tehnique, inluding multi-dimensional data ow graph (M DF G) and loop dependene graph (LDG). 2.1 Data Flow Graph for i=0, N for j=0, M A[i,j]=(B[i−1,j]+C[i−1,j])/2; B[i,j]=A[i,j−1]*2; C[i,j]=B[i,j]+3; endfor endfor (a) (1,0) (2,−1) (1,0) (0,0) (L3) L3 (b) 3 Loop Distribution (0,0) V3 (b) Figure 1: A two-level loop and its orresponding DFG. We use a multi-dimensional data ow graph (M DF G) to model the body of one nested loop. A MDFG ~ t) is a node-weighted and edge-weighted diG = (V; E; d; reted graph, where V is the set of omputation nodes, E V V is the set of edges representing dependenes, d~ is a funtion from E to Z n , representing the multidimensional delays between two nodes, where n is the number of dimensions, and t is a funtion from V to positive integers, representing the omputation time of eah node. A two-level loop is shown in Figure 1(a) and its orresponding data ow graph is shown in Figure 1(b). 2.2 (0,−2) Figure 2: (a) A two-level loop with three inner loops. (b) The orresponding loop dependene graph. (0,1) V2 L2 (L2) (a) V1 (1,0) (0,−1) (L1) Loop Dependene Graph Loop dependene graph is a higher-level graph model ompared to the data ow graph. It is used to model the data dependenes between multiple loops. A multi-dimensional loop dependene graph (MLDG) G = (V; E; Æ; o) is a node-labeled and edge-weighted direted graph, where V is a set of nodes representing the loops. E V V is a set of edges representing data dependenes between the loops. Æ is a funtion from E to Z n , representing the minimum data dependene vetor between the omputations of two loops. o is a funtion from V to positive integers, representing the order of the exeution sequene. The loop dependene graph of the loop in Figure 2(a) is shown in Figure 2(b). In a loop dependene graph, a fusion-preventing dependene is represented by an edge e with edge weight Æ(e) < (0; 0). All the omparisons between two loop dependene vetors are based on the lexiographi order in this paper. A two-dimensional vetor ~ v is smaller than a two-dimensional vetor ~ u aording to the lexiographi order if either v1 < u1 or v1 = u1 and v2 < u2 . The fusion-preventing dependene edges are e1 : L1 ! L2 and e2 : L1 ! L3 in the LDG shown in Figure 2(b). Loop distribution is an important part of our tehnique of loop distribution with maximum diret loop fusion (LD MDF). But loop distribution is not always appliable. In the proess of loop distribution, we must maintain all the data dependenes to ensure that we won't hange the semantis of the original program. We show that dependene yle is not always a restrition for loop distribution for multi-level loops by illustration examples. for i=0, N for j=0, M A[i,j]=(A[i,j]+B[i−1,j])/2; B[i,j]=A[i,j]+5; endfor endfor (a) for i=0, N for j=0, M A[i,j]=(A[i,j]+B[i−1,j])/2; endfor for j=0, M B[i,j]=A[i,j]+5; endfor endfor (b) Figure 3: An example loop that is legal to be distributed. The program shown in Figure 3(a) has a dependene yle in its orresponding data ow graph, but the statements involved in the dependene yle an be distributed. The distributed loop shown in Figure 3(b) has the same semantis as the original program shown in Figure 3(a) sine the true data dependene in the loop in Figure 3(a) is maintained to be a true data dependene in the distributed loop in Figure 3(b). for i=0, N for j=0, M A[i,j]=B[i,j−1]*2; B[i,j]=A[i,j]+5; endfor endfor (a) for i=0, N for j=0, M A[i,j]=B[i,j−1]*2; endfor for j=0, M B[i,j]=A[i,j]+5; endfor endfor (b) Figure 4: An example loop that is illegal to be distributed. The program shown in Figure 4(a) also has a dependene yle in its orresponding data ow graph. If we distribute the statements involved in the dependene yle, then the semantis of the program is hanged. We an see that the ode shown in Figure 4(b) omputes dierently from the ode shown in Figure 4(a). A[i; j ℄ is dependent on B [i; j 1℄, whih is a true data dependene in the original program as shown in Figure 4(a). If we diretly distribute the loop shown in Figure 4(a), then this true data dependene beomes an anti-data dependene in the distributed loop as shown in Figure 4(b). In this ase, the statements involved in the dependene yle in the original loop annot be distributed. In summary, if the summation of the edge weights of the dependene yle satises a ertain ondition, then the statements involved in the dependene yle an be distributed. The loop distribution theorems and the maximum loop distribution algorithm have been presented in [4℄. 4 Diret Loop Fusion Tehnique Diret loop fusion is to nd the legal fusion partition of the loop nodes so that the loop nodes inside one partition an be fused diretly. To apply diret loop fusion, we partition the loop nodes in the LDG into several partitions so that loop nodes onneted by a fusion-preventing dependene edge are partitioned into dierent partitions. Maximal loop fusion is to minimize the number of the fusion partitions, thus the resultant number of the fused loops is minimized. We develop a polynomial graph partitioning algorithm 4.1 and we prove that the total number of the partitions got by our algorithm is minimal, i.e., the number of the fused loops is minimal. A fusion partition is a partition of loop nodes V in the loop dependene graph into dierent partitions: eah partition represents a set of loops to be fused. A fusion partition is legal if and only if the following two onditions are satised: 1. Preedene dependene onstraint: the relative order of two nodes onneted by a preedene dependene edge annot hange. 2. Fusion-preventing dependene onstraint: two nodes onneted by a fusion-preventing dependene edge must be put into dierent partitions. 4.1 Graph Partitioning Algorithm We formulate maximal fusion problem as a sheduling problem, i.e., we are looking for a mapping P from the loop nodes V to the positive integers N so that the following two onditions are satised: 1. Preedene dependene onstraint: e = u ! v, u; v 2 V , e is a preedene dependene edge, P (v) P (u) (1) 2. Fusion-preventing dependene onstraint: e = u ! v, u; v 2 V , e is a fusion-preventing dependene edge, P (v) P (u) + 1 (2) We ompute the partition number for eah loop node in the loop dependene graph based on the two onditions as stated above. More speially, we ompute the partition number of eah node by using the shortest path algorithm after onstruting the onstraint graph based on the two onditions of the legal loop partition. The basi idea of the partition algorithm by using the shortest path algorithm is as follows: Given the input loop dependene graph, we distinguish two kinds of edges, the fusion-preventing dependene edges and the preedene dependene edges. We onstrut a onstraint graph based on the given loop dependene graph and the two onditions oming from the preedene dependene edge and the fusion-preventing dependene edge. Then we apply the shortest path algorithm on the onstraint graph to ompute the partition number for eah loop node. Graph Partitioning Algorithm Require: A LDG G = (V; E; Æ; o) Ensure: Partitions of loop nodes of the LDG Remove all the edges e with Æ1 (e) 1 from E , and get a graph G0 = (V; E 0 ; Æ), where E 0 E fejÆ1 (e) 1g Add one orresponding node Pi in the onstraint graph G for eah node vi in graph G0 0 0 0 for all edges e 2 E in graph G do 0 if e is a fusion-preventing edge in the original loop dependene graph then We get a onstraint Pi Pj 1. So we add one edge from Pj to Pi in the onstraint graph G , whih is weighted by 1. Algorithm 4.1 else The edge is a preedene edge, and we get a onstraint Pi Pj 0. So we add one edge from Pj to Pi in the onstraint graph G , whih is weighted by 0. end if end for Add the soure node P0 in the onstraint graph G , and also the edges from P0 to all the other nodes Pi in the onstraint graph G . All these added edges from the soure node P0 to all the other nodes Pi are weighted with 0 /* Compute the shortest distane from the soure node to all the other nodes in the onstraint graph by using the shortest path algorithm */ Call the Bellman-Ford algorithm to ompute D(P0 ; Pi), the shortest distane from P0 to eah node Pi 2 V /* Compute the minimum shortest distane */ for all nodes v 2 V do Dmin minPi (D(P0; Pi )) end for for all P (Pi) nodes v 2 V do D(P0 ; Pi) Dmin + 1 end for nodes v 2 V in the original LDG do Put node vi in partition P (Pi ) for all end for return Partitions In the algorithm, we rst remove all the edges e with 1 from the LDG and we obtain a direted ayli graph (DAG) G0 . Then we onstrut the orresponding onstraint graph G based on graph G0 . For eah node Æ1 (e) in graph G0 , we have a orresponding node Pi in the onstruted onstraint graph G . The edges in the onstraint graph are onstruted based on the preedene dependene onstraint and the fusion-preventing dependene onstraint. For eah preedene edge e : vi ! vj in the original loop dependene, the onstraint is Pi Pj 0, so an edge from Pj to Pi is added in the onstraint graph, and the edge is weighted by 0. For eah fusion-preventing dependene edge e : vi ! vj in the original loop dependene, the onstraint is Pi Pj 1, sine vj must be put into a partition dierent from the partition in whih vi is put into, and the partition number of vj must be larger than the partition number of vi . So an edge from Pj to Pi is added in the onstraint graph, and the edge is weighted by 1. Then, we add a soure node P0 and also the edges from P0 to all the other nodes Pi . The added edges from the soure node P0 to all the other nodes Pi are all weighted by 0. The onstrution of the onstraint graph G does not reate yles. The Bellman-Ford algorithm an be applied to ompute the shortest distane D(P0; Pi ) from the soure node P0 to eah other node Pi in the onstruted onstraint graph. After we get the shortest distane from the soure node P0 to eah other node Pi, we an ompute the minimum shortest distane minD of all the nodes in the onstraint graph, i.e., minD = minPi (D(P0; Pi )). Then for eah node Pi , we ompute the partition number P (Pi) for Pi by P (Pi) = D(P0; Pi ) minD + 1. In the algorithm, we get the partition number for every node, and the nodes with the same partition number are put into the same fusion partition. The distributed loop of the example loop in Figure 6(a) is shown in Figure 6(b), and its orresponding LDG is shown in Figure 5(a). The onstruted onstraint graph by our graph partitioning algorithm 4.1 is shown in Figure 5(b). After we run algorithm 4.1, the partition number we got for the loop nodes in the LDG are as follows: P (L1) = 1; P (L2) = 1; P (L3) = 2; P (L4) = 1; P (L5) = 2; P (L6) = 2. So the partition results are P artition1 = fL1; L2; L4g, and P artition2 = fL3; L5; L6g as shown in Figure 5(). There are totally two fusion partitions aording to this partitioning result, whih ahieves the maximal fusion, sine there are totally three fusionpreventing dependene edges in the original LDG. The fused loop is shown in Figure 5(d). vi Theorem 4.1 Given a loop and its orresponding LDG, the partition omputed by algorithm 4.1 is a legal partition. And the total number of the partitions omputed by algorithm 4.1 is minimal. By ontradition: Aording to the properties of a legal loop, every legal loop dependene graph G = (V; E; Æ; o) must satisfy that Æ1 (e) 0; 8e 2 E , and Æ1 () 1, for eah yle in G. Therefore, in algorithm 4.1, after we remove all the edges e with Æ1 (e) 1, the obtained graph G0 is a DAG. Then, we onstrut the onstraint graph G based on the graph G0 . For eah node Li in graph G0 , there is a orresponding node Pi in the orresponding onstraint graph. If the original Proof 4.1 0 P1 P2 0 −1 L1 L2 P0 0 −1 0 P3 P4 (0,−1) (0,0) (0,−1) L3 0 L4 0 −1 0 P5 P6 0 (0,2) (0,1) (0,−1) L5 L6 (a) (0,−1) (0,0) L3 (0,2) (b) L2 L1 (0,−1) 0 L4 (0,1) L5 (0,−1) L6 () for i=0, N for j=0, M A[i,j]=A[i,j]*2+1; B[i,j]=B[i,j−1]*3; D[i,j]=B[i,j]*3; endfor for j=0, M C[i,j]=(A[i,j+2]+B[i,j+1])/2; E[i,j]=C[i,j−2]+5; F[i,j]=(C[i,j−1]+D[i,j+1])/2; endfor endfor (d) Figure 5: (a) The LDG of the distributed loop in Figure 6(b). (b) The onstruted onstraint graph for partitioning by algorithm 4.1. () The graph partitioning result. (d)The fused loop by the LD MDF tehnique. edge e : Li ! Lj in G0 is a preedene edge, then in the onstraint graph, there is one edge e : Pj ! Pi , whih is weighted by 0. If the original edge e : Li ! Lj in G0 is a fusion-preventing dependene edge, then in the onstraint graph, there is one edge e : Pj ! Pi , whih is weighted by 1. And we also add a soure node V0 and also the edges from node V0 to all the other nodes in the onstraint graph, and all the added edges from the soure node V0 to other nodes are weighted by 0 too. The onstrution of the onstraint graph G does not reate yles. Thus, G is a DAG. Bellman-Ford algorithm is applied to ompute the single-soure shortest distane D(P0; Pi) from P0 to all the other nodes Pi . Note that there must at least exist a node P so that D(P0; P ) = 0 sine the onstraint graph is a DAG. Also, sine in the onstraint graph, all the edges are weighted either by 0 or by 1, then omputed D(P0; Pi ) values must be ontinuous integers. In the algorithm 4.1, we ompute Dmin minPi (D(P0; Pi )), whih is the minimum value of the longest shortest path. Dmin is a negative value aording to the onstrution of the onstraint graph. And the omputed values for the nodes in the onstraint graph inlude 0; ; Dmin , i.e., so there are totally Dmin + 1 distint omputed values for all the loop nodes. Dene Kmax = Dmin , then there are totally Kmax + 1 parti- tions after running the algorithm. We here prove that the number of the partitions omputed by algorithm 4.1 is minimal. Suppose that the number of the partition is smaller than Kmax . Consider node Pm whose D(P0; Pm) = Dmin . Aording to the onstrution of the onstraint graph, and the denition of the shortest path algorithm, there must exist a path with Kmax number of edges with weight 1 from the soure node P0 to Pm . The edge with weight 1 is orresponding to the fusion-preventing dependene edges in the original loop dependene graph. Along this path, it is obvious that all the nodes in the original data ow graph assoiated with these Kmax fusion-preventing dependene edges must be put into (Kmax + 1) dierent partitions. This path from P0 to Pm by the shortest path algorithm has given us the lower bound of the number of partitions between v0 and vm . So it is impossible to have the total number of the partitions less than (Kmax + 1) by simply onsidering this path from v0 to vm . 5 fused loops. Aording to the implementation proedure of the LD MDF tehnique, it is obviously that the number of the fused loops by the LD MDF tehnique is always smaller than the number of the loops in the original program. Correspondingly, we an onlude that the ode size of the fused loops by the LD MDF tehnique is always smaller than the ode size of the original loops as stated in Corollary 5.1. Corollary 5.1 Given a multi-level loop and its orresponding data ow graph, after we apply the tehnique of loop distribution with maximum diret loop fusion (LD MDF) to fuse the loops, the ode size of the transformed loop is always smaller than the ode size of the original loop. for i=0, N for j=0, M A[i,j]=A[i,j]*2+1; endfor for j=0, M B[i,j]=B[i,j−1]*3; endfor for j=0, M C[i,j]=(A[i,j+1]+B[i,j+1])/2; D[i,j]=B[i,j]*3; endfor for j=0, M E[i,j]=C[i,j−2]+5; F[i,j]=(C[i,j−1]+D[i,j+1])/2; endfor endfor Loop Distribution with Maximum Diret Loop Fusion The tehnique of loop distribution with maximum diret loop fusion (LD MDF) ombines loop distribution with maximum diret loop fusion to improve the timing performane of the loops without the inrease of the ode size. The basi idea and implementation of the LD MDF tehnique are illustrated as follows: 1. Apply maximum loop distribution on the given loop using the maximum loop distribution algorithm presented [4℄. 2. Construt the orresponding loop dependene graph GL of the distributed loop. 3. Apply the maximum diret loop fusion algorithm 4.1 to ompute the fusion partitions. 4. Get the fused loop Lf by fusing all the loop nodes in the same fusion partition. The tehnique of loop distribution with maximum diret loop fusion rst applies maximum loop distribution on a given loop using the maximum loop distribution algorithm proposed in [4℄. After we perform maximum loop distribution on the given loop, we onstrut the loop dependene graph of the distributed loop. Then, we partition the loop nodes in the LDG of the distributed loop so that there is no fusion-preventing dependenes existing between the nodes inside one fusion. In other words, we partition the loop nodes in the LDG into several partitions so that loop nodes onneted by a fusion-preventing dependene edge are partitioned into dierent partitions. Thus, all the loop nodes inside one fusion partition an be diretly fused [2℄. After we get the fusion partitions, diret loop fusion is applied on eah fusion partition. We prove that the total number of the fusion partitions got by our graph partitioning algorithm is minimum, i.e., our maximum diret loop fusion algorithm produes the fewest number of (a) for i=0, N for j=0, M A[i,j]=A[i,j]*2+1; endfor for j=0, M B[i,j]=B[i,j−1]*3; endfor for j=0, M C[i,j]=(A[i,j+2]+B[i,j+1])/2; endfor for j=0, M D[i,j]=B[i,j]*3; endfor for j=0, M E[i,j]=C[i,j−2]+5; endfor for j=0, M F[i,j]=(C[i,j−1]+D[i,j+1])/2; endfor endfor (b) Figure 6: (a) The original 2-level loop with six inner loops. (b)The distributed loop. For example, the ode shown in Figure 6(a) ontains four sequential loops enlosed in one shared outermost loop. To apply our LD MDF tehnique, we rst maximumly distribute the loop using the maximum loop distribution algorithm proposed in [4℄. The maximumly distributed loop for the original program is shown in Figure 6(b). After loop distribution, there are totally six loops. Then, we partition the six loop nodes in the loop dependene graph as shown in Figure 5(a) into two fusion partitions as shown in Figure 5(). We an see that all the loop nodes onneted by fusion-preventing dependenes are partitioned into dierent loop partitions. So the loop nodes inside one loop partition an be fused into one loop diretly. Thus, the six loops an be fused into two loops without transformation. The fused loop by our LD MDF tehnique is shown in Figure 5(d), whih has two inner loops. The number of the loops in the fused loop is less than the number of the loops in the original loop. The ode size of the fused loop is 12 instrutions, whih is also smaller than the ode size of the original loop, whih is 16 instrutions, sine the loop ontrol instrutions are redued. 6 Experiments In our experiments, we performed the general legalizing loop fusion tehnique (ULF IP) presented in [3℄ and the tehnique of loop distribution with maximum diret loop fusion (LD MDF). All the test ases are extrated from real DSP appliations, most of them are extrated from the real lters inluding WDF (Wave Digital lter), IIR (Innite Impulse Response lter), DPCM (Dierential Pulse-Code Modulation devie), and 2D (Two Dimensional lter). The VLIW arhiteture is used as the test platform. We simulated a VLIW arhiteture based DSP proessor with eight funtional units. We ompare the timing performane and the ode size of the original loops, the fused loops by the ULF IP tehnique, and the fused loop by the LD MDF tehnique. The experiments are performed on a Dell PC with a P 4 2:1G proessor and 512M B memory running Red Hast Linux 9:0. Every experiment is nished within one minute. Time/(N*M) Size 40 300 Original ULF_IP LD_MDF 35 Original ULF_IP LD_MDF 250 by our LD MDF tehnique an be improved 25.3% on average ompared to the original loops, and the ode size is redued 10.0% on average ompared to the original loops. 7 Conlusion In this paper, we developed the tehnique of ombining loop distribution with maximum diret loop fusion (LD MDF) to ahieve a shorter exeution time with a redution of the ode size. we developed a polynomial graph partitioning algorithm to ompute the fusion partitions. We prove that our maximum diret loop fusion tehnique produes the fewest number of resultant loop nests without violating dependene onstraints. We also show that the resultant ode size of the fused loop by the LD MDF tehnique will be always smaller than the ode size of the original loop. 30 200 25 20 Referenes 150 15 100 10 50 5 0 EX1 EX2 EX3 EX4 (a) EX5 EX6 0 EX1 EX2 EX3 EX4 EX5 EX6 (b) Figure 7: (a)The Exeution Time of the original loops and the fused loops by various tehniques. (b) The Code Size of the original loops and the fused loops by various tehniques. EX1 refers to the example loop shown in Figure 6(a). EX2, EX3, and EX4 refer to the DSP appliations presented in [6℄ that have several loops, inluding WDF (Wave Digital lter), IIR (Innite Impulse Response lter), DPCM (Dierential Pulse-Code Modulation devie), and 2D (Two Dimensional lter). EX5 and EX6 refer to the example loops presented in [3℄. In Figure 7(a), we ompare the exeution time of the original loops and the fused loops by the ULF IP tehnique and the LD MDF tehnique. The exeution time is dened to be the shedule length times the total iterations. The shedule length is the number of time units to nish one iteration of the loop body. We assume that eah omputation an be nished in one time unit. For the sequentially exeuted loops, the exeution time is the sum of the exeution time of eah individual loop. N denotes the total number of the iterations for the outermost loop, and M denotes the total number of the iterations for the innermost loop in Figure 7(a). Figure 7(b) ompares the ode size of the original loops and the fused loops by the ULF IP tehnique and the LD MDF tehnique. We alulate the ode size by the number of instrutions. Although the ULF IP tehnique proposed in [3℄ an always ahieve a shorter exeution time than the LD MDF tehnique, it inreases the ode size. In many ases, this tehnique annot be applied beause of the memory onstraint. Compared to the ULF IP tehnique, the LD MDF tehnique takes advantage of both loop distribution and loop fusion, so it redues the original exeution time and also avoids the ode-size expansion. The experimental results showed that the timing performane of the fused loop [1℄ R. Allen and K. Kennedy. Optimizing Compilers for Modern Arhitetures: A Dependene-based Approah. Morgan Kaufmann, 2001. [2℄ K. Kennedy and K. S. Mkinley. Maximizing loop parallelism and improving data loality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, Leture Notes in Computer Siene, Number 768, pages 301{320, 1993. [3℄ M. Liu, Q. Zhuge, Z. Shao, and E. H.-M. Sha. General loop fusion tehnique for nested loops onsidering timing and ode size. In Pro. ACM/IEEE International Conferene on Compilers, Arhitetures, and Synthesis for Embedded Systems (CASES 2004), pages 190{201, Sep. 2004. [4℄ M. Liu, Q. Zhuge, Z. Shao, C. Xue, M. Qiu, and E. H.M. Sha. Loop distribution and fusion with timing and ode size optimization for embedded DSP. In Pro. The 2005 International Conferene on Embedded And Ubiquitous Computing (EUC), pages 121{130, Leture Note in Computer Siene, Springer, Nagasaki, Japan, De. 2005. [5℄ K. S. MKinley, S. Carr, and C.-W. Tseng. Improving data loality with loop transformations. ACM Transations on Programming Languages and Systems (TOPLAS), 18(4):424 { 453, July 1996. [6℄ E. H.-M. Sha, T. W. O'Neil, and N. L. Passos. EÆient polynomial-time nested loop fusion with full parallelism. International Journal of Computers and Their Appliations, 10(1):9{24, Mar. 2003. [7℄ M. Wolfe. High Performane Compilers for Parallel Computing. Addison-Wesley Publishing Company, In., 1996. [8℄ Q. Zhuge, B. Xiao, and E.-M. Sha. Code size redution tehnique and implementation for software-pipelined DSP appliations. ACM Transations on Embedded Computing Systems(TECS), 2(4):590{613, Nov. 2003.