Local Unidirectional Bias for Smooth Cutsize-delay Tradeoff in Performance-driven Partitioning Andrew B. Kahng and Xu Xu UCSD CSE and ECE Depts. Work supported in part by MARCO GSRC Outline Motivation • Performance driven bipartition problem • New bipartitioning algorithm • Experimental results • Conclusion and future work Partitioning and Performance The goal of traditional hypergraph partitioning is to minimize cutsize. To meet the performance requirement of current designs, we need a performancedriven partitioner, which considers both cutsize and delay. Previous Work (I) • [Cong et al. ISPD-2002] – Global clustering based algorithm with retiming Min-delay Clustering w/ retiming Min-cutsize Clustering De-clustering and refinement – Reduces delay by 16% while increasing cutsize by 17% – Requires substantial gate replication Previous Work (II) • [Ababei et al. ICCAD-2002] – Reweighting based method Path based Input Reweighting 1 Global timing analysis Find critical paths 1 1 1 Cutsize oriented partitioner, such as hMetis,MLPart 1 Net based 2 – 14% reduction of delay with 10% increase in cutsize – 139% increase in runtime compared with hMetis Motivating Questions Can we avoid global timing analysis? – Global timing analysis is extremely time-consuming Can we improve path delay without significant degrading of cutsize? – Need smooth tradeoff between delay and cutsize Can we reduce implementation overheads? – Previous methods store thousands of critical paths and continuously update them Outline • Motivation Performance driven bipartition problem • New bipartitioning algorithm • Experimental results • Conclusion and future work Delay Model Delay = hop_delay + node_delay hop FF nodes Part 0 Part 1 Combinational nodes cut [Cong et al. ISPD-2002] [Ababei et al. ICCAD-2002] hop_delay=5 node_delay=1 Delay = 3x5 + 5x1 = 20 hop_delay=Elmore delay node_delay=constant Performance Driven Bipartition Problem Given: • Hypergraph H=(V,E) • Area Balance tolerance s (0<s<1), a parameter to control allowable slack in the area constraint • a, a given parameter which captures tradeoff between cutsize and path delay (hopcount) Find: A bipartition (V0|V1) which satisfies: and minimizes a(cutsize)+(1a)(Max_hopcount) Outline • Motivation • Performance driven bipartition problem New bipartitioning algorithm • Experimental results • Conclusion and future work Unidirectional Partition Path delay is minimized with hopcount = 1 if the partition is unidirectional (“acyclic”), that is, Part 0 all cuts are in the same direction Part 1 Problem: • High cutsize • No unidirectional solution Can we achieve “locally unidirectional” partition? Max hopcount=5 Part 0 Part 1 Max hopcount=3 Part 0 Part 1 V-Shaped Nodes V-shaped node If a combinational node v satisfies: there exist vj, vt in the other part and a path from vj to vt that includes only v then v is a V-shaped node vj Part 0 Part 1 v vt V-Shaped Nodes in Critical Paths Empirical observations from study of partitioning solutions: • there are V-shaped nodes in the partitioning solutions • every V-shaped node is included in many critical paths • every critical path contains several V-shaped nodes For testcase 1: •Number of nets : 16377 •Number of critical paths : 26772 •On average, one critical path contains 27.6 nodes •On average, one critical path contains 3.4 V-nodes •On average, one V-node belongs to 233.7 critical paths Key Idea: V-Shaped Nodes Elimination Part 0 a Part 1 d f Part 0 c b a Move b e d f c b Part 1 e Move V-shaped node “b” to reduce path hopcount PATH: abc hopcount=2 PATH: abc hopcount=0 PATH: dbc hopcount=1 PATH: dbc hopcount=1 PATH: ebc hopcount=1 PATH: ebc hopcount=1 Distance-k V-Shaped Nodes Elimination Distance-k V-shaped Nodes (Vk Nodes): Paths of k combinational nodes with neighbors in the other part. Part 0 Part 0 d a Part 1 b c d a b c Move b,c Part 1 k = 2: Move V2 node “b, c” reduce path hopcount from 2 to 0 Problems with large k: Cutsize may be greatly increased New Gain Function v v Before Move After Move Gain(v)=δ(0)+ δ(1) g(v): traditional FM gain rj(v): reduction of Vj nodes after moving v Distance-k Unidirectional Algorithm Calculate initial gains for all nodes and store the gains Select the node v with maximum gain /* CLIP-like method: move the cluster that v belongs to */ Reset the gains of all nodes to zero Move v and update the gains of v and its neighbors While ( one node not moved) Select one node v with the maximum updated gain Move v and update the related gains Find the point in the move sequence at which the sum of gains is maximum; undo all moves after this point Outline • Motivation • New bipartitioning algorithm Experimental results • Conclusion and future work Experimental Setup • Four industry testcases obtained as LEF/DEF • Model of Ababei et al. (ICCAD-2002) used to calculate delay • Partitioning solutions compared to results of MLPart – strongest multilevel netlist partitioning code – website: http://nexus6.cs.ucla.edu/GSRC/bookshelf/Slots/Partitioning/MLPart • All tests on 600MHz Intel Pentium-III Xeon Biasing against V1 Nodes vs. MLPart δ(0)=1, δ(1)=10 MLPart+V-shaped nodes Removal MLPart Testcase cutsize h delay time(s) cutsize h delay time(s) 1 820.7 5.3 352.8 11.79 856.1 3.3 266.8 12.58 2 169.9 3.5 220.7 13.45 189.8 2.5 211.2 15.32 3 141.3 3 291.6 16.67 152.3 2.3 283.6 18.27 4 408.7 5.3 302.6 12.43 421.2 3.6 252.7 14.03 • Reduction of delay: 4.5%-24.4% average:15.1% • Increase of cutsize: 3.0%-10.0% average: 4.9% • Increase of runtime: 6.3%-11.4% average: 9.7% Using the delay model in Cong et al. ISPD -2002 • Reduction of delay: 4.3%-21.2% average:14.7% Biasing against V2 Nodes vs. MLPart δ(0)=1, δ(1)=30, δ(2)=3 Testcase MLPart MLPart+Vk=2 nodes Removal cutsize h delay time(s) cutsize h delay time(s) 1 820.7 5.3 352.8 11.79 847.5 3 262.1 13.16 2 169.9 3.5 220.7 13.45 183.2 2 202.5 15.67 3 141.3 3 291.6 16.67 149.2 2 275.6 18.92 4 408.7 5.3 302.6 12.43 416.7 3.4 243.5 14.79 • Reduction of delay: 8.9%-30.0% average: 18.7% • Increase of cutsize: 3.1%-7.2% average: 3.5% • Increase of runtime: 11.9%-15.9% average: 13.1% Using the delay model in Cong et al. ISPD -2002 • Reduction of delay: 8.3%-28.7% average: 17.3% Outline • Motivation • Performance driven bipartition problem • New bipartitioning algorithm • Experimental results Conclusions and future work Conclusions • Simple yet efficient timing-driven partitioning that does not require global timing analysis • Negligible implementation, runtime overhead • Significantly reduces path delay with cutsize and runtime almost same as leading-edge MLPart • Similar improvements observed with different path delay metrics • Futures – Impact of new partitioner on placement – Efficient methods for biasing δ(k) k>2 Thank you! Future Work • Impact of new partitioner on placement • Efficient methods for biasing δ(k) k>2 Why Performance Driven Partitioning? • Achieving timing closure becomes increasingly difficult in deep-submicron technologies due to non-ideal scaling of interconnect delay • Routing alone can no longer solve timing problem, even with aggressive optimizations (buffer insertion, buffer/wire sizing,…) Timing needs to be addressed at all design stages • Partitioning is a critical step in defining interconnect timing properties, but is traditionally driven by cutsize objective Previous Work (I) • With Logic Replication – Retiming – Replication graph • Without Logic Replication – Net based reweighting – Path based reweighting FM Partitioning and Gain Function Part 0 Part 1 Start with random partition v v Part 0 Part 1 Move the node with the max gain and lock it Before Move After Move Gain(v)=-1 Gain(v) = Reduction of cutsize after moving v Part 1 Part 0 Keep moving until all nodes are locked Part 0 Part 1 Find the best point in the move sequence Procedure to Calculate rj(v) Delete all FF nodes and their related edges In the remaining graph, BFS from v For each level j from 1 to k If v is a Vj node before moving, rj’=1 If v is a Vj node after moving, rj’’=1 rj=rj’’-rj’ CLIP Algorithm CLIP v v Reminiscent of CLIP (Deng et al. DAC 1996) in how it induces movement of clusters across the cutline. Distance-k V-Shaped Nodes Distance-k V-shaped nodes (Vk-node): If k combinational nodes vi,1 … vi,k satisfy: vi,1 … vi,k are in the same part vj, vt in the other part a path from vj to vt and only passes vi,1 … vi,k then vi,1 … vi,k are distance-k V-shaped nodes vj vt Part 0 vi,1 Part 1 vi,k Notation • H(V,E)= circuit hypergraph • V = set of nodes representing components of the circuit • E = set of signal nets • A bipartition (V0|V1) of H(V,E) divides V into two disjoint subsets s.t. V= V0V1, which are called Part 0 and Part 1 • A = the total area of all the nodes in V • A0 = the area of all the nodes in V0