Physically Aware Data Communication Optimization for Hardware Synthesis Ryan Kastner, Wenrui Gong, Xin Hao, Forrest Brewer Adam Kaplan, Philip Brisk and Majid Sarrafzadeh Dept. of Electrical and Computer Engineering University of California, Santa Barbara Computer Science Department University of California, Los Angeles Hardware Compilation Application specified in high level language HDL Compiler (behavioral, structural) We focus our efforts on mapping an application written in a high-level language to a hardware description We desire this mapping to have optimal characteristics (area, latency, etc.) In this talk, we focus on the problem of minimizing data communication in the final hardware Synthesis and Physical Design Chip, bitstream, … Obligatory Design Flow Slide SUIF: Syntactic & Semantic Analysis Application Specification CFG Entity 5. Create CFG interface AST Machine SUIF: Compiler Backend SSA CDFG 6. Determine structural control and data communication between basic block entities 7. Generate synthesizable RTL code entity cfg is … architecture behavioral of cfg … 2. Transform instruction list to dataflow graph * + + * Entity 3 Entity 4 1. Create interface + Entity 1 Entity 2 Basic Block Entity 4. Synthesize behavioral HDL code to RTL code Behavioral Synthesis 8. Synthesize RTL code Logical & Physical Synthesis 3. Transform dataflow graph to behavioral HDL code entity basic_block is … architecture behavioral of basic_block … Characterizing Data Communication Examples of data communication schemes Control Node 1 Control Node 1 Memory (Register Bank, RAM) Bus Control Node 2 Control Node 3 Control Node 4 Distributed Data communication = wire Control Node 2 Control Node 3 Control Node 4 Centralized Data communication = memory access Identifying Data Communication Determine relationship between place(s) where data is defined and where data is used a… b… a… b… c… c a… b a Naïve method: all use-points of a variable depend on all definitions of that variable Not all use points “use” a variable Global Data Communication = 5 variables Need analysis to minimize the amount of data communication Use of SSA in Compilation Must determine relationship between where data is generated and where data is used Problem formulations [DAC02]: Minimize the total number of bits communicated between all pairs of control nodes Today: Minimize overall wirelength SSA (Static Single Assignment) Changes each variable to have a unique definition point Must add -nodes to merge definitions a 1 … … b1 … … a2 … … b2 … … c1 … … cc1 a3 … … bb1 a4 (a2,a3) a4 Physically Aware Compiler Transforms Consider layout information during compilation Modify transforms to consider physical info Ideal: full physical synthesis – extremely accurate, but way too time consuming application ApproximateLet’s using floorplanning Get Physical! Much faster Gives “good enough” high level physical picture Previous No Hardware Compilation data communication work physical information Can lead to negative results Physical FloorSynthesis planner Physically Aware Data Communication Modify placement of Φ-functions to consider wirelength -Placement Algorithm 1. Given a CFG Gcfg(Vcfg, Ecfg) 2. perform_ssa(Gcfg) 3. calculate_def_use_chains(Gcfg) 4. remove_back_edges(Gcfg) 5. topological_sort(Gcfg) 6. foreach vertex v Vcfg 7. foreach -node v 8. s .sources 9. d |def_use_chain(.dest)| 10. IDF iterated_dominance_fronter(s) 11. PossiblePlacements findPlacementOptions(IDF) 12. place() selectBest(PossiblePlacements) 13. distribute/duplicate to place() FindPlacementOptions Algorithm 1. Given a set of CFG Nodes R 2. -options 3. insert(R) into-options 4. foreach instruction i R 5. 6. if( i is a destination of -function f ) return -options 7. temp_-options 8. foreach non-dominated child c of R 9. temp_-options crossProductJoin(temp__options, findPlacementOptions(c)) 10. return-options temp_-options Algorithm in Action FAST function from MediaBench testsuite N3 F T nn_4, i_2 T nn_5, i_3 F N9 Algorithm in Action N3 F T nn_4, i_2 N3 T nn_5, i_3 F F N9 T T nn_4, i_2 nn_5, i_3 F N9 Full Floorplanning Results iterative approach Spectacularly negative results Floorplan Wirelength 1000 3. Full 100 Physical FloorSynthesis 10 planner 4. in t ern al_ ex pa nd od er ad pc m_ de c ad pc m_ co de r 1 benchmark de t 2. 4T R 10000 Initial optimization minimizes data communication Full SA based floorplanning Reoptimization based to minimize floorplanning Full SA based floorplanning FR Hardware 100000 Compilation WL (first) WL (second) FA ST 1. 1000000 in t ern al_ fi lt er wirelength (logarithmic) 10000000 co mp re ss _o De utp co ut de _M PE G2 De _In co tra de _B _M l oc PE k G2 _N on _In tra _B loc de k co de _m ot ion _v ec tor Simple Incremental Floorplanning Incremental Placement [Coudert et al]: floorplan and a set of changes to an optimized placement (e.g.due duetototechnology -functionremapping) movement) modify the modules netlist (e.g., the placement floorplan to improve it. Given Equally applicable to floorplanning Modified Floorplan Initial Floorplan 2 2 1 6 3 1 Perturbations 6 3 1 4 4 6 6 Our Incremental Floorplanner Initial Floorplan Modified Floorplan 2 2 6 3 1 Perturbations 3 1 4 4 6 6 Incremental Floorplan 2 | 32/36 3 5/5.6 - - 1 11/12.4 2/2.3 - 2 - 27/30.4 6 4 16/18 - Incremental Floorplanner 1 4 3 9/10.1 6 Our Incremental Floorplanner 1. 2. Calculate area & room of each node: bottom up slicing tree traversal Area redistribution Simple, yet effective Top down traversal Other more complicated Increase area if necessary Not enough space at root algorithms Aspect ratios become too distorted Modified Floorplan might work better Incremental Floorplan 2 2 | 32/36 3 1 3 4 5/5.6 - - 1 11/12.4 2/2.3 - 2 - 27/30.4 6 4 16/18 - 1 4 3 9/10.1 6 MediaBench Functions Benchmark Blocks Links Weight Initial WL 1 adpcm coder 33 31 54 2688 35568 2 adpcm decoder 26 23 44 1952 21588 3 internal filter 10 143 60 17088 411637 4 Internal expand 101 94 257 14336 317031 5 compress output 34 17 60 2368 29114 6 mpeg2dec block 62 13 66 2272 34510 7 mpeg2dec vector 16 4 26 1024 4366 8 FAST 14 4 15 704 3714 9 FR4TR 77 87 155 704 340697 10 det 12 5 13 7936 3772 Incremental Floorplanning Results Normalized Wirelength 1.2 Initial Overall Optimal Overall Incremental Phi Optimal Phi Incremental 1 0.8 0.6 0.4 “Optimal” Approach: 12% Overall Wirelength Reduction 25% Phi-node Wirelength Reduction Our Approach: 6% Overall Wirelength Reduction 8% Phi-node Wirelength Reduction 0.2 0 1 2 3 4 5 6 Benchmarks 7 8 9 10 avrg Related Work Hardware compilation projects using SSA Physically aware behavioral synthesis techniques PDG+SSA form [UCSB] CASH [CMU] SA-C [UCR] Sea Cucumber [BYU] SA for scheduling, binding and floorplanning [Prabhakaran97] SA for binding and floorplanning [Yung-Ming94] Scheduling, allocation and binding [Dougherty00] Fasolt: bus topology [Knapp92] High level synthesis [Tarafdar00] Incremental CAD Problem overview/challenges [Coudert00] Floorplanning [Crenshaw99] Conclusions It’s been a long strange trip… SSA a nice IR for hardware compilation Explicitly shows data flow Useful for exploiting parallelism Compiler techniques applied to hardware design can reduce wirelength They must be aware of physical information They must use an incremental floorplanning