ppt

advertisement
Architecture and Details of
a High Quality, Large-Scale
Analytical Placer
Andrew B. Kahng, Sherief Reda and Qinke Wang
VLSI CAD Lab
University of California, San Diego
http://vlsicad.ucsd.edu/
Work partially supported by the MARCO Gigascale Systems Research Center.
ABK is currently with Blaze DFM, Inc., Sunnyvale, CA.
Outline
•
•
•
•
•
•
History of APlace
From APlace1.0 to APlace2.0
Anatomy of APlace2.0
New techniques in APlace2.0
Experimental Results
Conclusions and Future Work
2
History of APlace
• Research to study Synopsys patent
– Naylor et al., US Patent 6,301,693 (2001)
• Extensible foundation: APlace1.0
– Timing-driven placement
– Mixed-size placement
– Area-I/O placement
• ISPD-2005 placement contest  APlace2.0
– Many parts of APlace rewritten
– Superior performance
3
Outline
•
•
•
•
•
•
History of APlace
From APlace1.0 to APlace2.0
Anatomy of APlace2.0
New techniques in APlace2.0
Experimental Results
Conclusions and Future Works
4
APlace Problem Formulation
• Constrained Nonlinear Optimization:
Divide the layout area into uniform bins, and
seek to minimize HPWL etc. so that total cell
area in every bin is equalized
–
: density function that equals the
total cell area in a global bin g
– D : average cell area over all global bins
5
Nonlinear Optimization
• Smooth approximation of placement
objectives: wirelength, density function, etc.
• Quadratic Penalty method
– Solve a sequence of unconstrained minimization
problems for a sequence of µ → 0
• Conjugate Gradient (CG) solver
– Useful for finding an unconstrained minimum
of a high-dimensional function
– Adaptable to large-scale placement problems:
memory requirement is linear in problem size 6
Wirelength Approximation
• Half-Perimeter Wirelength (HPWL)
– Half-perimeter of net’s bounding box
– Simple, close measure of routing congestion
– Not strictly convex, or everywhere differentiable
• Log-Sum-Exp approximation
– Naylor et al., US Patent 6,301,693 (2001)
– Precise, closer to HPWL when α → 0
– Strictly convex, continuously differentiable
7
 : Smoothing Parameter
• “Significance criterion” for choosing nets with
large wirelength to minimize
– Larger gradients for longer nets
– Minimize long nets more efficiently than short nets
• Two-pin net
• Partial gradient for x1
Partial Gradient of x1
1
0.5
0
-0.5
-1
-10
-5
0
(x1 - x2) / 
5
10
– close to 0, when net
length |x1- x2| is small
compared to 
– close to 1 or -1, o.w.
8
Area Potential Function
• Overlap area =
– overlap along the x and y directions
– 0/1 function with cell size ignored
• Area potential function: defines an “area
potential” exerted by a cell to nearby grids
– smooth bell-shaped function for standard cells
[Naylor et al., US Patent 6,301,693 (2001)]
9
Module Area Potential Function
• Mixed-size placement: decide scope of area
potential based on module's dimension
• p(d) : potential function
– d : distance from module to grid
– radius r = w/2 + 2wg for block with width w
– convex curve
d < w/2 + wg
– concave curve
w/2 + wg < d < w/2+ 2wg
– smooth at
d = w/2 + wg
-w/2-2wg
p(d)
1-a*d2
b*(r-d)2
d
w/2+ wg
10
Changes: APlace1.0  APlace2.0
• Strong scalability from new clustering algorithm
• Dynamic adjustment of weights for wirelength and
overlap penalty during global placement
• Improvements to legalization, detailed placement
– whitespace compaction
– cell reordering algorithms
– global greedy cell movement
• APlace2.0 vs. APlace1.0: up to 19% WL reduction
1.5-2x speedup
11
IBM BigBlue4 Placement
2.1M instances, HPWL = 833.21, CPU = 23h
12
Outline
•
•
•
•
•
•
History of APlace
From APlace1.0 to APlace2.0
Anatomy of APlace2.0
New techniques in APlace2.0
Experimental Results
Conclusions and Future Works
13
Anatomy of APlace 2.0
Clustering
Global
Phase
Adaptive APlace engine
Unclustering
Legalization
WS arrangement
Detailed
Phase
Cell order polishing
Global moving
14
New Feature 1: Multi-Level Clustering
Objective: cluster to reduce runtime
and allow scalable implementations
with no compromise to quality
netlist
reduce netlist size by 10x
 Multi-level approach using bestchoice clustering (ISPD’05)
 Clustering ratio  10
 #Top-level clusters  2000
 Wirelength calculation
– assume modules located
at cluster center
– only consider inter-cluster
parts of nets
size ~
2000?
no
yes
global placement
no
uncluster
flat?
yes
Legalization
15
Best-Choice Clustering
• Each clustering level uses the best-choice heuristic
with lazy updates and tight area control
 For each clustering level:
 Calculate the clustering score of each node to its neighbors
based on the number of connections and areas
 Sort all nodes based on their best scores using a heap
 Until target clustering ratio is reached:
 If top node of heap is “valid” then cluster it with its closest neighbor
 Else recalculate the top node score and reinsert in heap;
Continue
 calculate the clustering score of the new node and reinsert
into the heap
 update netlist and mark all neighbors of the new node as invalid
16
Two Clustering Concerns
Mark boundaries of clustering hierarchy at each clustering
level
 allow exact reversal of clustering during unclustering
• Meet target number of objects by avoiding “saturation”
 bypass small fixed objects during clustering
fixed object
cluster
bypass
fixed objects
17
Multiple Levels of Grids
• Adaptive grid size based on average cluster size
• Better global optimization
– use solution of placement problem constrained with
coarser grids as initial solution for problem constrained
with finer grids
• Better scalability
– larger grid size spreads modules faster
• Different levels of relaxation for density constraints
– According to grid size
18
New Feature 2: Adaptive WL Weight
• Important to QOR
• Initial weight value
– For each cluster level and grid level
– Based on wirelength and density partial
derivatives
– Goal: Magnitudes of gradients roughly equal
• Decrease WL weight by half whenever CG
solver obtains a stable solution
19
New Feature 3: Legalization and
Detailed Placement
Variant of greedy legalization algorithm (Hill’01):
1. Sort all cells from left to right: move each cell in order
to the closest legal position
2. Sort all cells from right to left: move each cell in order
to the closest legal position(s)
3. Pick the better of (1) and (2)
Detailed Placement Components:
• Global cell movement (Goto81, KenningsM98 BoxPlace, FP…)
• Whitespace compaction (KahngTZ’99, KahngMR’04)
• Cell order polishing (similar to rowIroning, FS detailed placer)
• Intra-row cell reordering
• Inter-row cell reordering
20
Global Moving
• Move cell to “optimal” location among
available whitespace
– improve quality when utilization is low
• Two steps
– search for available location in optimal region of
a cell’s placement
– search for available location in “best” bin
• divide placement area into uniform bins
• choose “best" bin according to available whitespace
and cost of moving cell to bin center
• assume normal distribution of whitespace with width
and estimate if an available location exists
21
WhiteSpace (WS) Compaction
row
start node
1
2
3
4
5
6
7
8
9
10
sites
11
12
cell 1
cell 2
cell 3
cell n
 Each chain represents the possible placement sites for each cell
end node
 The cost on the arrows is the change in HWPL of the cell move to each site
 The order of chains correspond to the order of cells from left to right in a row
 A Shortest path from source to sink gives the best way to compact WS
22
Cell Order Polishing
• Permute a small window of neighboring cells
in order to improve wirelength
– MetaPlacer’s rowIroning: up to 15 cells in one row
assuming equal whitespace distribution
– FengShui's cell ordering: six objects in one or
more rows regarding whitespace as pseudo cells
• Branch-and-bound algorithm
– four nearby cells in one or multiple rows
– consider optimal placement for each permutation
– more accurate, overlap-free permutations and no
cell shifting
23
Single-Row Cell Ordering
• Cost of placing first j cells of a permutation
– cost = wirelength increase when placing a cell
– ΔWL≠ 0, only if cell is leftmost of rightmost
– remaining cells placed to the right of first j cells
– unrelated to order or placement of remaining cells
• B&B algorithm
– construct permutations in lexicographic order
• next permutation has same prefix as the previous one
• beginning rows of DP table can be reused as possible
– cut branch when minimum cost of placing first j
cells > best cost till now
24
Two- or Three-Row Cell Ordering
• DP algorithm
– decide how many cells assigned to each row from
up to down
– construct a permutation in lexicographic order
– find “optimal” placement within the window
• Y-cost of placing first j cells: accurate
– remaining cells placed lower than first j cells
• X-cost of placing first j cells: inaccurate when
a net connects placed and unplaced cells
– results show still effective with small set of cells
and small window
25
Outline
•
•
•
•
•
Introduction
Clustering
Global Placement
Detailed Placement
Experimental Results
– IBM ISPD04
– IBM-PLACE v2
– IBM ICCAD04
– IBM ISPD05
• Conclusions and Future Works
26
IBM ISPD04
• Test basic placer performance with standard cells
APlace2.0
mPL5 Capo9.0 Dragon3
FP1
FS2.6
ibm10
17.20
17.3
1.1
1.04
1.07
1.07
ibm11
13.22
14
1.09
1.03
1.09
1.04
ibm12
21.83
22.3
1.11
1.03
1.08
1.07
ibm13
16.46
16.6
1.1
1.05
1.11
1.09
ibm14
30.55
31.6
1.1
1.05
1.11
1.04
ibm15
38.38
38.5
1.09
1.04
1.13
1.07
ibm16
41.36
43
1.1
1.05
1.07
1.09
ibm17
60.82
61.3
1.09
1.08
1.08
1.08
ibm18
39.32
41
1.09
1.02
1.1
1.04
Average
0.97
1
1.09
1.03
1.08
1.06
 3% better than the best other - mPL5 (ISPD05)
27
IBM Place V2
• Test placer under whitespace presence and routability
Circuit
APlace2.0
Vias
mPL+WSA
ibm09-easy
3.023
495073
3.5
ibm09-hard
3.027
503410
3.65
ibm10-easy
5.977
758598
6.84
ibm10-hard
5.931
772744
6.76
ibm11-easy
4.577
638523
5.16
ibm11-hard
4.654
656525
5.15
ibm12-easy
8.337
892915
10.52
ibm12-hard
8.317
902465
10.13
Average
0.88
1
• 12% better than mPL-R+WSA (ICCAD04)
28
IBM ICCAD04
• Test placer performance with cells and blocks (floorplacement)
APlace2.0
FS2.6
Capo9.0
ibm10
28.55
41.96
34.98
ibm11
18.67
21.19
22.31
ibm12
33.51
40.84
40.78
ibm13
23.03
25.45
28.7
ibm14
35.9
39.93
40.97
ibm15
46.82
51.96
59.19
ibm16
54.58
62.77
67
ibm17
66.49
69.38
78.78
ibm18
42.14
45.59
50.39
Average
0.86
1
1.05
 14% and 19% better than FS and Capo, respectively
29
IBM ISPD05
• Test placer performance with cells and movable/fixed blocks
adaptec2
adaptec4
BB1
BB2
BB3
BB4
AVG
APlace2.0
87.31
187.65
94.64
143.82
357.89
833.21
1
mFAR
91.53
190.84
97.7
168.7
379.95
876.28
1.06
Dragon
94.72
200.88
102.39
159.71
380.45
903.96
1.08
mPL
97.11
200.94
98.31
173.22
369.66
904.19
1.09
FastPlace
107.86
204.48
101.56
169.89
458.49
889.87
1.16
Capo
99.71
211.25
108.21
172.3
382.63
1098.76
1.17
NTUP
100.31
206.45
106.54
190.66
411.81
1154.15
1.21
FengShui
122.99
337.22
114.57
285.43
471.15
1040.05
1.5
KW
157.65
352.01
149.44
322.22
656.19
1403.79
1.84
 6% better than the best other placer (mFAR)
30
APlace2.0 Conclusions
• 60 days + clean sheet of paper + Qinke Wang + Sherief Reda
• Scalable implementation
• State-of-the-art clustering and global placement engines
• Improved detailed placement engine
• Better than best published results by
• 3% ISPD’04 suite
• 14%
• 12%
• 6%
ICCAD’04
IBMPLACE V.2
ISPD’05 Placement Contest
• Recent Applications (other than restoring functionality)
• IR-drop driven placement (ICCD-2005 Best Paper)
• Lens aberration-aware placement (DATE-2006)
• Toward APlace3.0: ?
31
Thank You
Questions?
32
Goals and Plan
Goals:
• Build a new placer to win the competition
• Scalable, robust, high-quality implementation
• Leave no stone unturned / QOR on the table
Plan and Schedule:
• Work within most promising framework: APlace
• 30 days for coding + 30 days for tuning
33
Philosophy
Respect the competition
• Well-funded groups with decades of experience
– ABKGroup’s Capo, MLPart, APlace = all unfunded side projects
– No placement-related industry interactions
• QOR target: 24-26% better than Capo v9r6 on all known
benchmarks
– Nearly pulled out 10 days before competition
Work smart
• Solve scalability and speed basics first
– Slimmed-down data structure, -msse compiler options, etc.
• Ordered list of ~15 QOR ideas to implement
• Daily regressions on all known benchmarks
• Synthetic testcases to predict bb3, bb4, etc.
34
Implementation Framework
New APlace Flow
APlace weaknesses:
• Weak clustering
• Poor legalization /
detailed placement
Clustering
Global
Phase
Adaptive APlace engine
Unclustering
New APlace:
1. New clustering
2. Adaptive parameter
setting for scalability
3. New legalization +
iterative detailed
placement
Legalization
WS arrangement
Detailed
Phase
Cell order polishing
Global moving
35
Parameterization and Parallelizing
Tuning Knobs:
 Clustering ratio, # top-level clusters, cluster area constraints
 Initial wirelength weight, wirelength weight reduction ratio
 Max # CG iterations for each wirelength weight
 Target placement discrepancy
 Detailed placement parameters, etc.
Resources:
 SDSC ROCKS Cluster: 8 Xeon CPUs at 2.8GHz
 Michigan Prof. Sylvester’s Group: 8 various CPUs
 UCSD FWGrid: 60 Opteron CPUs at 1.6GHz
 UCSD VLSICAD Group: 8 Xeon CPUs at 2.4GHz
Wirelength Improvement after Tuning : 2-3%
36
Artificial Benchmark Synthesis
 Synthetic benchmarks to test code scalability and
performance
 Rapid response to broadcast of s00-nam.pdf
 Created “synthetic versions of bigblue3 and
bigblue4 within 48 hours
 Mimicked fixed-block layout diagrams in the
artificial benchmark creation
 This process was useful: we identified (and solved) a
problem with clustering in presence of many small
fixed blocks
37
Results
Circuit
GP
HPWL
Leg
HPWL
DP
HPWL CPU (h)
adaptec1
80.20
81.80
79.50
3
adaptec2
84.70
92.18
87.31
3
adaptec3
218.00
230.00
218.00
10
adaptec4
182.90
194.75
187.71
13
bigblue1
93.67
97.85
94.64
5
bigblue2
140.68
147.85
143.80
12
bigblue3
357.28
407.09
357.89
22
bigblue4
813.91
868.07
833.21
50
38
Conclusions
ISPD05 = an exercise in process and philosophy
At end, we were still 4% short of where we wanted
Not happy with how we handled 5-day time frame
Auto-tuning  first results ~ best results
During competition, wrote but then left out “annealing” DP
improvements that gained another 0.5%
 Students and IBM ARL did a really, really great job
 Currently restoring capabilities (congestion, timing-driven,
etc.) and cleaning (antecedents in Naylor patent)
39
Download