Partitioning 1998. 5. 19 조준동 SungKyunKwan Univ. VADA Lab. 1 Partitioning in VLSI CAD • • • Partitioning is a technique widely used to solve diverse problems occurring in VLSI CAD. Applications of partitioning can be found in logic synthesis, logic optimization, testing, and layout synthesis. High-quality partitioning is critical in high-level synthesis. To be useful, highlevel synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. However, logic decomposition of the design into procedures may not be appropriate for highlevel and logic-level synthesis [60]. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance. Some technology mapping programs use partitioning techniques to map a circuit specified as a network of modules performing simple Boolean operations onto a network composed of specific modules available in an FPGA. SungKyunKwan Univ. VADA Lab. 2 Partitioning in VLSI CAD • • Since the test generation problem for large circuits may be extremely intensive computationally, circuit partitioning may provide the means to speed it up. Generally, the problem of test pattern generation is NP-complete. To date, all test generation algorithms that guarantee finding a test for a given fault exhibit the worst-case behavior requiring CPU times exponentially increasing with the circuit size. If the circuit can be partitioned into k parts (k not fixed), each of bounded size c, then the worst-case test generation time would be reduced linearly related to the circuit size. Partitioning is often utilized in layout synthesis to produce and/or improve the placement of the circuit modules. Partitioning is used to find strongly connected subcircuits in the design, and the resulting information is utilized by some placement algorithms to place in mutual proximity components belonging to such subcircuits, thus minimizing delays and routing lengths. SungKyunKwan Univ. VADA Lab. 3 Partitioning in VLSI CAD • Another important class of partitioning problems occurs at the system design level. Since IC packages can hold only a limited number of logic components and external terminals, the components must be partitioned into subcircuits small enough to be implemented in the available packages. • Partitioning has been used as well to estimate some properties of physical IC designs, such as the expected IC area. SungKyunKwan Univ. VADA Lab. 4 Circuit Partitioning • The early attempts to solve the circuit partitioning problem were based on the representation of the circuit as a graph G = (V,E), where V is a set of nodes (vertices) representing the fundamental components, such as gates, flip-flops, inputs and outputs and E is a set of edges representing nets present in the network. Graph partitioning problems representing VLSI design problems usually involve separating the set of the graph nodes into disjoint subsets while optimizing some objective function defined on the graph vertices and edges. In the partitioned graph, edges can be divided into two classes: inter-subset edges whose vertices belong to different subsets, and intra-subset edges whose vertices belong to the same subset. The objective functions associated with the graph partitioning problems usually treat these classes of edges in different ways. • One classic graph partitioning problem is the minimum cut (mincut) problem. Its objective is to divide V into two disjoint parts, U and W, such that the number of the inter-subset edges is minimized. The set e(U,W) is referred to as a cut set, and the number of edges in cut set as the cut value. SungKyunKwan Univ. VADA Lab. 5 Circuit Partitioning • graph and physical representation SungKyunKwan Univ. VADA Lab. 6 VHDL example process communication Behavioral description control/data flow graph SungKyunKwan Univ. VADA Lab. 7 Mincut Partitioning • • An exact solution to the mincut problem was provided by Ford and Fulkerson [11], who transformed the mincut problem into the maximum flow (maxflow) problem. The maxflow-mincut algorithm finds a maximum flow in a network; the maxflow value is equal to the mincut value. The first heuristic algorithm for a two-way graph partitioning into equal-sized subsets was proposed by Kernighan and Lin, Their method consists of choosing an initial partition randomly and reducing the cut value by exchanging appropriately selected pairs of nodes from the subsets. After exchanging the positions, nodes are locked in new positions. In subsequent steps, pair of unlocked nodes are selected and exchanged until all nodes are locked. The execution of the algorithm stops, when it riches the local minimum. Most nets in digital circuits are multi-point connections among more than two modules (logic gates, flip-flops, etc.). Therefore, modeling VLSI circuit partitioning problems as graph partitioning problems may lead to poor results caused by inadequate representation of multi-point nets which have to be decomposed into two-point connections. One way to approximate circuit partitioning problems is to transform the circuit into a weighted graph G' representation via a net model. For example, a multipoint net connecting n nodes may be modeled as a complete graph (clique) spanned on these nodes, i.e., containing all possible edges among these nodes. SungKyunKwan Univ. VADA Lab. 8 Clustering (Cont’d) • Clustering based on criterion B below the first cut-line, then criterion A • Clustering based on criterion A below the second cut-line, then criterion B SungKyunKwan Univ. VADA Lab. 9 Clustering Example • Two-cluster Partition • Three-cluster Partition SungKyunKwan Univ. VADA Lab. 10 Survey on Partitioning • . We discuss the traditional min-cut and ratio cut bipartitioning formulations along with multi-way extensions and newer problem formulations, e.g., constraint-driven partitioning (for FPGAs) and partitioning with module replication. Our discussion of solution approaches is divided into four major categories: move-based approaches, geometric representations, combinatorial formulations, and clustering approaches. Move-based algorithms iteratively explore the space of feasible solutions according to a neighborhood operator ; such methods include greed, iterative exchange, simulated annealing, and evolutionary algorithms. Algorithms based on geometric representations embed the circuit netlist in some type of "geometry", e.g, a 1-dimensional linear ordering or a multi-dimensional vector space; the embeddings are commonly constructed using spectral methods. Combinatorial methods transform the partitioning problem into another type of optimization, e.g., based on network flows or mathematical programming. Finally, clustering algorithms merge the netlist modules into many small clusters; we discuss methods which combine clustering with existing algorithms (e.g., two-phase partitioning). SungKyunKwan Univ. VADA Lab. 11 Survey on Partitioning • F-M partitioning algorithm is perhaps the most widely adopted algorithm, due to the linear time complexity, its efficiency and the ease of the implementation. There have been many enchancement of the algorithm proposed in the past. Both Khrishnamurthy and Ng, et. al., have reported that the quality of the solutions yielded by the F-M algorithm is very erratic for circuit partitioning. Subsequently, Krishnamurthy amended the Fiduccia-Mattheyses implementation with a look-ahead technique which considerably improved the average performance. Sanchis extended their work to partition hypergraphs into k partitions. Sechen proposed new improved objective function for mincut circuit partitioning, based on the statistical model, which estimate the expected number of net crossings of the cutline. There have been many improvements of F-M algorithm published, which utilized other techniques such as clustering, replication and other improvement scheme of the basic F-M heuristic. SungKyunKwan Univ. VADA Lab. 12 Survey on Partitioning • An important class of partitioning approaches consists of so-called constructive methods, where methods that are based on graph spectra received the most attention to date. They use eigenvalues and eigenvectors of matrices derived from the netlist graph. Early theoretical work by Barnes, Donath, and Hoffman established relationship between the spectral properties and the partitioning properties of graph. More recently, eigenvector and eigenvalue methods have been used for both component placement and graph minimum-width bisection. Hadley et al. used an eigenvector approach for obtaining good initial partitions of the netlist as a starting solution for iterative improvement algorithm, which was used afterwards. Hagen and Kahng applied eigenvector decomposition of graph for solving ratio-cut partitioning problem. They found that the second smallest eigenvalue of matrix representation of graph yields a lower bound on the optimal ratio cut cost. The most recent work from Alpert et al. and Chan et. al showed that more extensive eigenvector computation leads to better partitioning results. Other approaches to constructive partitioning approaches are based on placement techniques, vertex orderings and clustering, dynamic and boolean programming and geometric embeddings. SungKyunKwan Univ. VADA Lab. 13 Complexity of Partitioning In general, computing the optimal partitioning is an NP-complete problem, which means that the best known algorithms take time which is an exponential function of n=|N| and p, and it is widely believed that no algorithm whose running time is a polynomial function of n=|N| and p exists (see ``Computers and Intractability'', M. Garey and D. Johnson, W. H. Freeman, 1979, for details.) Therefore we need to use heuristics to get approximate solutions for problems where n is large. The picture below illustrates a larger graph partitioning problem; it was generated using the spectral partitioning algorithm as implemented in the graph partitioning software by Gilbert et al, described below. The partition is N = Nblue U Nblack, with red edges connecting nodes in the two partitions. SungKyunKwan Univ. VADA Lab. 14 Chaco • • Before a calculation can be performed on a parallel computer, it must first be decomposed into tasks which are assigned to different processors. Efficient use of the machine requires that each processor have about the same amount of work to do and that the quantity of interprocessor communication is kept small. Partitioning G means dividing N into the union of P disjoint pieces N = N1 U N2 U ... U NP, where the nodes (jobs) in Ni are assigned to be done by processor Pi. This partitioning is done subject to the optimality conditions below. – – • 1.The sums of the weights Wn of the nodes n in each Ni is approximately equal. This means the load is approximately balanced across processors. 2.The sum of the weights We of edges connecting nodes in different Ni and Nj should be minimized. This means that the total size of all messages communicated between different processors is minimized. Chaco is used at over 150 institutions for parallel computing, sparse matrix reordering, circuit placement and a range of other applications. More information about Chaco can be found at bah@cs.sandia.gov SungKyunKwan Univ. VADA Lab. 15 Chaco • • • • A good solution to the graph partitioning problem assigns nodes to processors so that 1.The sums of node weights are approximately equal for each processor. This means that each processor has an equal amount of floating point work to do, the the problem is load balanced. 2.As few edges cross processor boundaries as possible. This minimizes communication, since each crossing edge Ai,j means that xj must be sent to the processor owning xi. The figure below illustrates such a partitioning onto 4 processors (colored blue, red, green and magenta). Crossing edges, which require communication, are colored black, and noncrossing edges, which require no communication, have the same color as the processor. SungKyunKwan Univ. VADA Lab. 16 Chaco • Given n nodes and p processors, there are exponentially many ways to assign n nodes to p processors, some of which more nearly satisfy the optimality conditions than others. To illustrate, the following figure shows two partitions of a graph with 8 nodes onto 4 processors, with 2 nodes per processor. The partitioning on the left has 6 edges crossing processor boundaries and so is superior to the partitioning on the right, with 10 edges crossing processor boundaries. The reader is invited to find another 6-edge-crossing partition, and show that no fewer edge crossings suffice. SungKyunKwan Univ. VADA Lab. 17 Edge Separator and Vertex Separator Bisecting a graph G=(N,E) can be done in two ways. In the last section, we discussed finding the smallest subset Es of E such that removing Es from E divided G into two disconnected subgraphs G1 and G2, with nodes N1 and N2 respectively, where N1 U N2 = N and N1 and N2 are disjoint and equally large. (If the number of nodes is odd, we obviously cannot make |N1|=|N2|. So we will call Es an edge separator if |N1| and |N2| are sufficiently close; we will be more explicit about how different |N1| and |N2| can be only when necessary.) The edges in Es connect nodes in N1 to nodes in N2. Since removing Es disconnects G, Es is called an edge separator. The other way to bisect a graph is to find a vertex separator, a subset Ns of N, such that removing Ns and all incident edges from G also results in two disconnected subgraphs G1 and G2 of G. In other words N = N1 U Ns U N2, where all three subsets of N are disjoint, N1 and N2 are equally large, and no edges connect N1 and N2. SungKyunKwan Univ. The following figure illustrates these ideas. The green edges, Es1, form an edge separator, as well as the blue edges Es2. The red nodes, Ns, are a vertex separator, since removing them and the indicident edges (Es1, Es2, and the purple edges), leaves two disjoint subgraphs. Theorem. (Tarjan, Lipton, "A separator theorem for planar graphs", SIAM J. Appl. Math., 36:177-189, April 1979). Let G=(N,E) be an planar graph. Then we can find a vertex separator Ns, so that N = N1 U Ns U N2 is a disjoint partition of N, |N1| <= (2/3)*|N|, |N2| <= (2/3)*|N|, and |Ns| <= sqrt(8*|N|). VADA Lab. 18 Inertial Partitioning 1. Choose a straight line L, given by a*(x-xbar)+b(y-ybar) = 0. This is a straight line through (xbar,ybar), with slope -a/b. We assume without loss of generality that a2 + b2 = 1. 2.For each node ni=(xi,yi), compute a coordinate by computing the dotproduct Si = -b*(xi-xbar) + a*(yi-ybar). Si is distance from (xbar,ybar) of the projection of (xi,yi) onto the line L. 3.Find the median value Sbar of the Si's. 4.Let the nodes (xi,yi) satisfying Si <= Sbar be in partition N1, and the nodes where Si > Sbar be in partition N2. SungKyunKwan Univ. VADA Lab. 19 Inertial Partitioning In mathematical terms, we want to pick a line such that the sum of squares of lengths of the green lines in the figure are minimized; this is also called doing a total least squares fit of a line to the nodes. In physical terms, if we think of the nodes as unit masses, we choose (x,y) to be the axis about which the moment of inertia of the nodes is minimized. This is why the method is called inertial partitioning. This means choosing a, b, xbar and ybar so that a2 + b2 = 1, and the following quantity is minimized: sumi=1,...,|N| (length of i-th green line)2 = sumi=1,...,|N| ((xi-xbar)2 + (yi-ybar)2 (-b*(xi-xbar) + a*(yi-ybar))2 ) ... by the Pythagorean theorem = a2 * ( sumi=1,...,|N| (xi-xbar)2 ) + b2 * ( sumi=1,...,|N| (yi-ybar)2 ) + 2*a*b * ( sumi=1,...,|N| (xi-xbar)*(yi-ybar) ) = a2 * X2 + b2 * Y2 + 2*a*b * XY = [ a b ] * [ X2 XY ] * [ a ] = [ a b ] * M * [ a ] [ XY Y2 ] [ b ] [b] where X2, Y2 and XY are the summations in the previous lines. One can show that an answer is to choose xbar = sumi=1,...,|N| xi / |N|, ybar = sumi=1,...,|N| yi / |N|, i.e. (xbar,ybar) is the "center of mass" of the nodes, and (a,b) is the unit eigevector corresponding to the smallest eigenvalue of the matrix M. SungKyunKwan Univ. VADA Lab. 20 Partitioning on Planar graph • A very simple partitioning algorithm is based on breadth first search (BFS) of a graph. It is reasonably effective on planar graphs, and probably does well on overlap graphs as defined above. Given a connected graph G=(N,E) and a distinguished node r in N we will call the root, breadth first search produces a subgraph T of G (with the same nodes and a subset of the edges), where T is a tree with root r. In addition, it associates a level with each node n, which is the number of edges on the path from r to n in T. The implementation requires a data structure called a Queue, or a First-In-First-Out (FIFO) list. It will contain a list of objects to be processed. There are two operations one can perform on a Queue. Enqueue(x) adds an object x to the left end of the Queue. y=Dequeue() removes the rightmost entry of the Queue and returns it in y. In other words, if x1, x2, ..., xk are Enqueued on the Queue in that order, then k consecutive Dequeue operations (possibly interleaved with the Enqueue operations) will return x1, x2, ... , xk. • SungKyunKwan Univ. NT = {(r,0)} ... Initially T is just the root r, ... which is at level 0 ET = empty set ... T = (NT, ET) at each stage of the algorithm Enqueue((r,0)) ... Queue is a list of nodes to be processed Mark r ... Mark the root r as having been processed While the Queue is nonempty ... While nodes remain to be processed (n,level) = Dequeue() ... Get a node to process For all unmarked children c of n NT = NT U (c,level+1) ... Add child c to the list of nodes NT of T ET = ET U (n,c) ... Add the edge (n,c) to the edge list ET of T Enqueue((c,level+1)) ... Add child c to Queue for later processing Mark c ... Mark c as having been visited End for End while VADA Lab. 21 Breadth First Search • • Partitioning the graph into nodes at level L or lower, and nodes at level L+1 or higher, guarantees that only tree and interlevel edges will be cut. There can be no "extra" edges connecting, say, the root to the leaves of the tree. This is illustrated in the above figure, where the 10 nodes above the dotted blue line are assigned to partition N1, and the 10 nodes below the line as assigned to N2. For example, suppose one had an n-by-n mesh with unit distance between nodes. Choose any node r as root from which to build a BFS tree. Then the nodes at level L and above approximately form a diamond centered at r with a diagonal of length 2*L. This is shown below, where nodes are visited counterclockwise starting with the north. SungKyunKwan Univ. VADA Lab. 22 Kernighan and Lin Algorithm • • B. Kernighan and S. Lin ("An effective heuristic procedure for partitioning graphs", The Bell System Technial Journal, pp. 291--308, Feb 1970), which takes O(|N|3) time per iteration. A more complicated and efficient implementation, which takes only O(|E|) time per iteration, was presented by C. Fiduccia and R. Mattheyses, "A linear-time heuristic for improving network partitions", Technical Report 82CRD130, General Electric Co., Corporate Research and Development Ceter, Schenectady, NY 1982. We start with an edge weighted graph G=(N,E,WE), and a partitioning G = A U B into equal parts: |A| = |B|. Let w(e) = w(i,j) be the weight of edge e=(i,j), where the weight is 0 if no edge e=(i,j) exists. The goal is to find equal-sized subsets X in A and Y in B, such that exchanging X and Y reduces the total cost of edges from A to B. More precisely, we let T = sum[ a in A and b in B ] w(a,b) = cost of edges from A to B and seek X and Y such that new_A = A - X U Y and new_B = B - Y U X has a lower cost new_T. To compute new_T efficiently, we introduce: SungKyunKwan Univ. E(a) = external cost of a = sum[ b in B ] w(a,b) I(a) = internal cost of a = sum[ a' in A, a'!=a]w(a,a') D(a) = cost of a = E(a) - I(a) and analogously E(b) = external cost of b = sum[ a in A ] w(a,b) I(b) = internal cost of b = sum[ b' in B, b' !=b]w(b,b') D(b) = cost of b = E(b) - I(b) Then it is easy to show that swapping a in A and b in B changes T to new_T = T - ( D(a) + D(b) 2*w(a,b) ) = T - gain(a,b) In other words, gain(a,b) = D(a)+D(b)-2*w(a,b) measures the improvement in the partitioning by swapping a and b. D(a') and D(b') also change to new_D(a') = D(a') + 2*w(a',a) - 2*w(a',b) for all a' in A, a' !=a new_D(b') = D(b') + 2*w(b',b) - 2*w(b',a) for all b' in B, b' != b VADA Lab. 23 SungKyunKwan Univ. VADA Lab. 24 Kernighan and Lin Algorithm (0) Compute T = cost of partition N = A U B ... cost = O(|N|2) Repeat (1) Compute costs D(n) for all n in N ... cost = O(|N|2) (2) Unmark all nodes in G ... cost = O(|N|) (3) While there are unmarked nodes ... |N|/2 iterations (3.1) Find an unmarked pair (a,b) maximizing gain(a,b) ... cost = O(|N|2) (3.2) Mark a and b (but do not swap them) ... cost = O(1) (3.3) Update D(n) for all unmarked n, as though a and b had been swapped ... cost = O(|N|) End while SungKyunKwan Univ. ... At this point, we have computed a sequence of pairs ... (a1,b1), ... , (ak,bk) and ... gains gain(1), ..., gain(k) ... where k = |N|/2, ordered by the order in which ... we marked them (4) Pick j maximizing Gain = sumi=1...j gain(i) ... Gain is the reduction in cost from swapping ... (a1,b1),...,(aj,bj) (5) If Gain > 0 then (5.2) Update A = A - {a1,...,ak} U {b1,...,bk} ... cost = O(|N|) (5.2) Update B = B - {b1,...,bk} U {a1,...,ak} ... cost = O(|N|) (5.3) Update T = T - Gain ... cost = O(1) End if Until Gain <= 0 VADA Lab. 25 Spectral Partitioning • • • • This is a powerful but expensive technique, based on techniques introduced by Fiedler in the 1970s, but popularized in 1990 by A. Pothen, H. Simon, and K.-P. Liou, "Partitioning sparse matrices with eigenvectors of graphs", SIAM J. Matrix Anal. Appl., 11:430--452. We will first describe the algorithm, and then give three related justifications for its efficacy. Let G=(N,E) be an undirected, unweighted graph without self edges (i,i) or multiple edges from one node to another. We define two matrices related to this graph. Definition The incidence matrix In(G) of G is an |N|-by-|E| matrix, with one row for each node and one column for each edge. Suppose edge e=(i,j). Then column e of In(G) is zero except for the the i-th and j-th entries, which are +1 and -1, respectively. SungKyunKwan Univ. Note that there is some ambiguity in this definition, since G is undirected; writing edge e=(i,j) instead of (j,i) is equivalent to multiplying column e of In(G) by -1. We will see that this ambiguity will not be important to us. Definition The Laplacian matrix L(G) of G is an |N|-by-|N| symmetric matrix, with one row and column for each node. It is defined as follows. (L(G))(i,j) = degree of node i if i=j (number of incident edges) = -1 if i!=j and there is an edge (i,j) VADA Lab. 26 Spatial Locality: Hardware Partitioning • • • • • • • The interface logic should be properly partitioned for area and timing reasons. Minimization of global busses leads to lower bus capacitance, and thus lower interconnect power. Signal values within the clusters tend to be more highly correlated. Data path should be partitioned into approximately equal size. In the DSP area, data paths tens to occupy far more area than the control paths. Wiring is still one of the domain area consumers The method used to identify clusters is based on the eigenvalues and eigenvectors of the Laplacian of the graph. The eigen vector corresponding to the second smallest eigen value provides a 1-D placement of the nodes which minimizes the mean-squared connection length. SungKyunKwan Univ. VADA Lab. 27 Spectral Partitioning in VLSI placement SungKyunKwan Univ. VADA Lab. 28 Spectral Partitioning in VLSI placement • Setting the derivative of the Lagrangian, L, to zero gives: (Q I ) x 0 • • The solution to the above equation are those is the eigenvalue and x is the corresponding eigenvector. The smallest eigenvalue 0 gives a trivial solution with all nodes at the same point. The eigenvector corresponding to the second smallest eigenvalue minimizes the cost function while giving a non-trivial solution SungKyunKwan Univ. VADA Lab. 29 Key Ideas in Spectral Partitioning SungKyunKwan Univ. VADA Lab. 30 Spectral Partitioning SungKyunKwan Univ. VADA Lab. 31 Spectral Partitioning The following theorem state some important facts about In(G) and L(G). It introduces us to the idea that the eigenvalues and eigen vectors of L(G) are related to the connectivity of G. Theorem 1. Given a graph G, its associated matrices In(G) and L(G) have the following properties. 1.L(G) is a symmetric matrix. This means the eigenvalues of L(G) are real, and its eigenvectors are real and orthogonal. 2.Let e=[1,...,1]', where ' means transpose, i.e. the column vector of all ones. Then L(G)*e = 0. 3.In(G)*(In(G))' = L(G). This is independent of the signs chosen in each column of In(G). 4.Suppose L(G)*v = lambda*v, where v is nonzero. Then SungKyunKwan Univ. norm(In(G)'*v)2 lambda = -----------------norm(v)2 where norm(z)2 = sumi z(i)2 = sum{all edges e=(i,j)} (v(i)- v(j))2 ---------------------------------sumi v(i)2 5. The eigenvalues of L(G) are nonnegative: 0 <= lambda1 <= lambda2 <= ... <= lambdan 6.The number of of connected components of G is equal to the number of lambdai) equal to 0. In particular, lambda2 != 0 if and only if G is connected. VADA Lab. 32 Spectral Partitioning Compute the eigenvector v2 corresponding to lambda2 of L(G) for each node n of G if v2(n) < 0 put node n in partition Nelse put node n in partition N+ endif endfor First we show that this partition is at least reasonable, because it tends to give connected components N- and N+: Theorem 2. (M. Fiedler, "A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory", Czech.Math. J. 25:619--637, 1975.) Let G be connected, and N- and N+ be defined by the above algorithm. Then N- is connected. If no v2(n) = 0, N+ is also connected. SungKyunKwan Univ. There are a number of reasons lambda2 is called the algebraic connectivity. Here is another. Theorem 3. (Fiedler). Let G=(N,E) be a graph, and G1=(N,E1) a subgraph, i.e. with the same nodes and subset of the edges, so that G1 is "less connected" than G. Then lambda2(L(G1)) <= lambda2(L(G)), i.e. the algebraic connectivity of G1 is also less than or equal to the algebraic connectivity of G. Motivation for spectral bisection, by analogy with a vibrating string How does a taut string vibrate when it is plucked? From our background in either physics or music, we know that it has certain modes of vibration or harmonics. If we were to take snapshots of these modes, they would look like this: VADA Lab. 33 Spectral Partitioning SungKyunKwan Univ. VADA Lab. 34 Multilevel Kernighan-Lin Given a matching, Gc is computed as follows. We let there be a node r in Nc for each edge in Gc is computed in step (1) of Recursive_partition as follows. We define a Em. Then we construct Ec as follows: matching of a graph G=(N,E) as a subset Em of the edges. E with the property that no for r = 1 to |Em| ... for each node in Nc let (i,j) be the edge in Em corresponding to two edges in Em share an endpoint. A node r maximal matching is one to which no more for each other edge e=(i,k) in E incident on i edges can be added and remain a matching. let ek be the edge in Em incident on k, and We can compute a maximal matching by a let rk be the corresponding node in Nc simple random algorithm: add the edge (r,rk) to Ec end for let Em be empty for each other edge e=(j,k) in E incident on j mark all nodes in N as unmatched let ek be the edge in Em incident on k, and for i = 1 to |N| ... visit the nodes in a let rk be the corresponding node in Nc random order add the edge (r,rk) to Ec if node i has not been matched, end for choose an edge e=(i,j) where j is also unmatched, end for and add it to Em if there are multiple edges between pairs of mark i and j as matched nodes of Nc, collapse them into single edges end if end for SungKyunKwan Univ. VADA Lab. 35 Multilevel Kernighan-Lin Note that we can take node weights into account by letting the weight of a node (i,j) in Nc be the sum of the weights of the nodes I and j. We can similarly take edge weights into account by letting the weight of an edge in Ec be the sum of the weights of the edges "collapsed" into it. Furthermore, we can choose the edge (i,j) which matches j to i in the construction of Nc above to have the large weight of all edges incident on i; this will tend to minimize the weights of the cut edges. This is called heavy edge matching in METIS, and is illustrated on the right. SungKyunKwan Univ. VADA Lab. 36 Multilevel Kernighan-Lin Given a partition (Nc+,Nc-) from step (2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below: Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin. SungKyunKwan Univ. VADA Lab. 37 Multilevel Spectral Partitioning Now we turn to the divide-and-conquer algorithm of Barnard and Simon, which is based on spectral partitioning rather than Kernighan-Lin. The expensive part of spectral bisection is finding the eigenvector v2, which requires a possibly large number of matrix-vector multiplications with the Laplacian matrix L(G) of the graph G. The divide-and-conquer approach of Recursive_partition will dramatically decrease the cost. Barnard and Simon perform step (1) of Recursive_partition, computing Gc = (Nc,Ec) from G=(N,E), slightly differently than above: They find a maximal independent subset Nc of N. This means that N contains Nc and E contains Ec, no nodes in Nc are directly connected by edges in E (independence), and Nc is as large as possible (maximality). SungKyunKwan Univ. There is a simple "greedy" algorithm for finding an Nc: Nc = empty set for i = 1 to |N| if node i is not adjacent to any node already in Nc add i to Nc end if end for This is shown below in the case where G is simply a chain of 9 nodes with nearest neighbor connections, in which case Nc consists simply of every other node of N. VADA Lab. 38 hMETIS • • • hMETIS is a set of programs for partitioning hypergraphs such as those corresponding to VLSI circuits. The algorithms implemented by hMETIS are based on the multilevel hypergraph partitioning scheme described in [KAKS97]. hMETIS produces bisections that cut 10% to 300% fewer hyperedges than those cut by other popular algorithms such as PARABOLI, PROP, and CLIPPROP, especially for circuits with over 100,000 cells, and circuits with nonunit cell areaIt is extremely fast!A single run of hMETIS is faster than a single run of simpler schemes such as FM, KL, or CLIP. Furthermore, because of its very good average cut characteristics, it produces high quality partitionings in significantly fewer runs. It can bisect circuits with over 100,000 vertices in a couple of minutes on Pentium-class workstations. The performance of hMETIS on the new ISPD98 benchmark suite can be found in the paper by Chuck Alpert. http://www.users.cs.umn.edu/~karypis/metis/metis.html SungKyunKwan Univ. VADA Lab. 39 How good is Recursive Bisection? • • Horst D. Simon and Shang-Hua Teng , Report RNR-93-012, August 1993 The most commonly used p-way partitioning method is recursive bisection. It first "optimally" divides the graph (mesh) into two equal sized pieces and then recursively divides the two pieces.We show that,due to the greedy nature and the lack of global information,recursive bisection, in the worst case,may produce a partition that is very far from the optimal one. Our negative result is complemented by two positive ones.First, we show that for some important classes of graphs that occur in practical applications,such as well shaped finite element and finite difference meshes,recursive bisection is normally within a constant factor of the optimal one. Secondly,we show that if the balanced condition is relaxed so that each block in the partition is bounded by (1+e)n/p,then there exists a approximately balanced recursive partitioning scheme that finds a partition whose cost is within an 0(log p) factor of the cost of the optimal p-way partition. SungKyunKwan Univ. VADA Lab. 40 Partitioning Algorithm with Multiple Constraints 1998. 5. 19 조준동 SungKyunKwan Univ. VADA Lab. 41 Partitioning with pin and area constraints 회로가 그래프 G(V,E)로 표현될 때, V는 n개의 노드를 갖는 전체 노드의 집합으로 V = v_1 , v_2 , …, v_n 이며 각 노드는 면적 a_i를 갖는다. 간선 e_ij는 노드 v_i와 v_j를 연결한다. E는 전체 노드간의 간선들의 집합이다. 그래프 분할은 전체 노드의 집합을 서로 겹치지 않는 k 개 의 블록 V1,V2 ... ,Vk으로 나누는 것이다. 이때 각 블럭들은 각각의 면적 A1, A2, ... ,Ak 및 각 각의 블록의 핀 개수인 P1,P2, …, Pk를 가지고 있다. 각각의 블럭은 면적과 핀을 비롯한 여러 가지 제약조건들을 가지고 있다. 각 블록이 가질 수 있는 최대 면적은 A_upper이고 최소 면 적은 A_lower, 최대 핀의 개수는 P_upper이다. 또 C_ij는 블록 Vi와 Vj사이를 연결하는 간선 들의 가중치의 합이다. 분할 결과는 이러한 제약조건들을 만족시키면서 각 블록들간을 연결 하는 간선의 가중치가 적어지도록 만드는 것이다. k개의 부그래프의 집합을 K라고 할 때, 분 할은 제약조건들을 만족시키며 다음의 목적함수를 최소화시키는 최적의 매핑 Γ:V-> K 를 찾 는 것이다. k k W Cij (i j ) i 1 j 1 Alower Ai Aupper, Pi Pupper,1 i k SungKyunKwan Univ. VADA Lab. 42 스위칭에 의한 충전과 방전 • 전체 전력소모의 최대 90%까지 차지 Vdd PMOS pull-up network charge discharge NMOS pull-up network CL short circuit + leakage SungKyunKwan Univ. VADA Lab. 43 저전력을 위한 분할 • 기존의 방법 : cut을 지나가는 간선의 수 • 저전력 : 간선의 스위칭 동작의 수 0.25 0.75 0.25 0.75 0.25 ( a ) cut ÀÇ ¼ö·Î ÀÚ¸§ SungKyunKwan Univ. 0.25 ( b ) ½ºÀ§Äª µ¿ÀÛÀÇ ¼ö·Î ÀÚ¸§ VADA Lab. 44 최소비용흐름 알고리즘 • 주어진 양을 가장 적은 비용으로 원하는 목적지까지 보낼수 있는 방법 – 각 통로는 용량과 비용을 가짐 • Max-flow min-cut : 간선의 수만 고려 • Min-Cost flow : 간선마다 스위칭 동작의 가중치를 부여 – 비용 : 스위칭 동작 vs. 간선의 수 – 용량 : 간선에 흐를 수 있는 최대양 • 비용이 적을수록 선택되도록 큰 용량 Wi Si (1 )Ci SungKyunKwan Univ. VADA Lab. 45 Network and Mincost Flow 15 / 30 45 / 55 100 / 10 20 / 10 23 / 11 10 / 100 30 / 24 100 / 10 7 / 80 10 / 100 45 / 55 1/ 5 10 / 35 23 / 11 15 / 30 6 / 100 3/5 1 / 10 10 / 35 100 / 10 SungKyunKwan Univ. VADA Lab. 46 그래프 변환 알고리즘 • Min-Cost Flow 경로를 찾음 • Cut 을 찾기 위해서 그래프의 변환이 필요 • 레벨에 따른 topological 정렬 Level 5 Level 4 Level 3 Level 2 Level 1 SungKyunKwan Univ. VADA Lab. 47 그래프 변환 알고리즘 • 추가된 노드 및 간선 Level ( i+1 ) Sink Source Level ( i ) »õ·Î »ý¼ºµÈ °£¼± ±âÁ¸ÀÇ SungKyunKwan Univ. °£¼± »õ·Î »ý¼ºµÈ ³ëµå ±âÁ¸ÀÇ ³ëµå VADA Lab. 48 그래프 변환 Level 5 Level 4 S Level 3 T sink Source Level 2 Level 1 SungKyunKwan Univ. VADA Lab. 49 Algorithm Input: Flow f, Network Output: Partition the network into f subnetworks 단계 1: 그래프에 Flow 를 push하여 최소비용흐름 알고리즘 수행; 만약 각각의 partition에 대하여 A_upper 또는 P_upper를 만족하면 마침; 그렇지않으면 f = f+1; 증가시키고 upper bound를 만족할 때까지 단계 1을 반복한다. 단계 2: 만약 A_lower 또는 P_lower를 만족하지 않는두개의 partition p, q 가 있고 Alower Ap Aq Aupper Plower Pp Pq Pupper 라면 p와 q는 merge가 가능하고 모든 가능한{p,q} set에 대하여 최소비용매칭을 적용하 여 분할된 partition의 개수를 줄임. SungKyunKwan Univ. VADA Lab. 50 참고문헌 [1] J.D.Cho and P.D.Franzon, "High-Performance Design Automation for Multi-Chip Modules and Packages", World Scientific Pub. Co. 1996 [2] H.J.M.Veendrick, "Short-Circuit Dessipation of Static CMOS Circuitry and its Impact on the Design of Buffer Circuits" IEEE JSSCC, pp.468-473, August, 1984 [3] H.B.Bakoglu, "Circuits, Interconnections and Packaging for VLSI", pp.81-112, Addison-Wesley Publishing Co., 1990 [4] K.M.hall. "An r-dimensional quadratic placement algorithm", Management Sci., vol.17, pp.219-229, Nov, 1970 [5] Cadence Design Systems. "A Vision for Multi-Chip Module design in the nineties", Tech. Rep. Cadence Design Systems Inc., Santa Clara, CA, 1993 [6] R.Raghavan, J.Cohoon, and S.Shani. "Single Bend Wiring", Journal of Algorithms, 7(2):232-257, June, 1986 [7] Kernighan, B.W. and S.lin. "An efficient heuristic procedure to partition graphs" Bell System Technical Journal, 492:291-307, Feb. 1970 [8] Wei, Y.C. and C.K.Cheng "Ratio-Cut Partitioning for Hierachical Designs", IEEE Trans. on ComputerAided Design. 40(7):911-921, 1991 [9] S.W.Hadley, B.L.Mark, and A.Vanelli, "An Efficient Eigenvector Approach for Finding Netlist Partitions", IEEE Trans. on Computer-Aided Design, vol. CAD-11, pp.85-892, July, 1992 [10] L.R.Fold, Jr. and D.R.Fulkerson. "Flows in Networks", Princeton University Press, Princeton, NJ, 1962 [11] Liu H. and D.F.Wong, "Network Flow Based Multi-Way Partitioning With Area and Pin Constraints", IEEE/ACM Symposium on Physical Design, pp. 12-17, 1997 [12] Kirkpatrick, S. Jr., C.Gelatt, and M.Vecchi. "Optimization by simulated annealing", Science, 220(4598):498-516, May, 1983 [13] Pedram, M. "Power Minimization in IC Design: Principles and Applications," ACM Trans. on Design Automation of Electronics Systems, 1(1), Jan. pp. 3-56, 1996. [14] A.H.Farrahi and M.Sarrafzadeh. "FPGA Technology Mapping for Power Minimizatioin", In International Workshop on Field-Programmable Logic and Applications, pp66-77, Sep. 1994 [15] M.A.Breur, "Min-Cut Placement", J.Design Automation and Fault-Tolerant Computing, pp.343-382, Oct. 1977 SungKyunKwan Univ. VADA Lab. 51 [16] M.Hanan and M.J.Kutrzberg. A Review of the Placement and the Quadratic Assignment Problem, Apr. 1072. [17] N.R.Quinn, "The Placement Problem as Viewed from the Physics of Classical Mechanics", Proc. of the 12th Design Automation Conference, pp.173-178, 1975 [18] C.Sehen, and A.Sangiovanni-Vincentelli, "The Timber Wolf placement and routing package", IEEE Journal of Solid-State Circuits, Sc-20, pp.501-522, 1985 [19] K.Shahookar, and P.Mazumder, "A Genetic Approach to Standard Cell Placement", First European Design Automation Conference, Mar. 1990 [20] J.D.Cho, S.Raje, M.Sarrafzadeh, M.Sriram, and S.M.Kang, "Crosstalk Minimum Layer Assignment", In Proc. IEEE Custom Integr. Circuits Conf., San Diego, CA, pp.29.7.1-29.7.4, 1993 [21] J.M.Ho, M.Sarrafzadeh, G,Vijayan, and C.K.Wong. "Layer Assignment for Multi-Chip Modules", IEEE Trans. on Computer-Aided Design, CAD-9(12):1272-1277, Dec., 1991 [22] G.Devaraj. "Distributed placement and crosstalk driven router for multichip modules", In MS Thesis, Univ. of Cincinnati, 1994 [23] J.D.Cho. "Min-Cost Flow based Minimum-Cost Rectilinear Steiner Distance-Preserving Tree", International Symposium on Physical Desigh, pp-82-87, 1997 [24] A.Vitttal and M.Marek-Sadowska. "Minimal Delay Interconnection Design using Alphabetic Trees", In Design Automation Conference, pp.392-396, 1994 [25] M.C.Golumbic. "Algorithmic Graph Theory and Perfect Graph", pp.80-103, New York : Academic. 1980 [26] R.Vemuri. "Genetic Algorithms for partitioning, placement, and layer assignment for multichip modules", Ph.D. Thesis, Univ. of Cincinnati, 1994 [27] J.L.Kennington and R.V.Helgason, "Algorithms for Network Programmin", John Wiley, 1980 [28] J.Y.Cho and J.D.Cho "Improving Performance and Routability Estimation in MCM Placement", In InterPack'97, Hawaii, June, 1997 [29] J.Y.Cho and J.D.Cho "Partitioning for Low Power Using Min-Cost Flow Algorithm", submitted to 한국반도체학술대회, Feb, 1998 SungKyunKwan Univ. VADA Lab. 52