Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226) Winter-Spring 2001 Codesign of Embedded Systems 1 Topics Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis Winter-Spring 2001 Codesign of Embedded Systems 2 Topics Introduction A Classification Examples Vulcan Cosyma Winter-Spring 2001 Codesign of Embedded Systems 3 Introduction to HW/SW Partitioning The first variety of co-synthesis applications Definition A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually Multiprocessor architecture = one CPU + some ASICs on CPU bus Winter-Spring 2001 Codesign of Embedded Systems 4 Introduction to HW/SW Partitioning (cont’d) A Terminology Allocation Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling The process of assigning PE (CPU and/or ASICs) time to processes to get executed Winter-Spring 2001 Codesign of Embedded Systems 5 Introduction to HW/SW Partitioning (cont’d) In most partitioning algorithms Type of CPU is fixed and given ASICs must be synthesized What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems CDFG is the starting model Winter-Spring 2001 Codesign of Embedded Systems 6 HW/SW Partitioning (cont’d) Normal use of architectural components CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use? High-performance applications No CPU is fast enough for the operations Low-cost application ASIC accelerators allow use of much smaller, cheaper CPU Winter-Spring 2001 Codesign of Embedded Systems 7 A Classification Criterion: Optimization Strategy Primal Approach Trade-off between Performance and Cost Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal. Winter-Spring 2001 Codesign of Embedded Systems 8 A Classification (cont’d) Classification due to optimization strategy (cont’d) Example co-synthesis systems Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy Winter-Spring 2001 Codesign of Embedded Systems 9 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Vulcan Winter-Spring 2001 Codesign of Embedded Systems 10 Partitioning Examples: Vulcan Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost. System specification language HardwareC Is compiled into a flow graph Winter-Spring 2001 Codesign of Embedded Systems 11 Partitioning Examples: Vulcan (cont’d) x=a; y=b; HardwareC nop 1 1 x=a y=b cond if (c>d) x=e; else y=f; c>d c<=d x=e y=f HardwareC Winter-Spring 2001 Codesign of Embedded Systems 12 Partitioning Examples: Vulcan (cont’d) Flow Graph Definition A variation of a (single-rate) task graph Nodes Represent operations Typically low-level operations: mult, add Edges Represent data dependencies Each contains a Boolean condition under which the edge is traversed Winter-Spring 2001 Codesign of Embedded Systems 13 Partitioning Examples: Vulcan (cont’d) Flow Graph is executed repeatedly at some rate can have initiation-time constraints for each node t(vj)+lij t(vj) t(vj)+uij can have rate constraints on each node mi Ri Mi Winter-Spring 2001 Codesign of Embedded Systems 14 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis Algorithm Partitioning quantum is a thread Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph Target architecture CPU + Co-processor (multiple ASICs) Winter-Spring 2001 Codesign of Embedded Systems 15 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Allocation Primal approach Scheduling is done by a scheduler on the target CPU is generated as part of synthesis process schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time Winter-Spring 2001 Codesign of Embedded Systems 16 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Cost estimation SW implementation Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread HW implementation ? Winter-Spring 2001 Codesign of Embedded Systems 17 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Performance estimation Both SW- and HW-implementation Winter-Spring 2001 From flow-graph, and basic execution times for the operators Codesign of Embedded Systems 18 Partitioning Examples: Vulcan (cont’d) Algorithm Details Partitioning goal Allocate each thread to one of two partitions CPU Set: FS Co-processor set: FH Required execution-rate must be met, and total cost minimized Winter-Spring 2001 Codesign of Embedded Systems 19 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Algorithm steps 1. Put all threads in FH set 2. Iteratively do 2.1. Move some operations to FS. 2.1.1. Select a group of operations to move to FS. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition Winter-Spring 2001 Codesign of Embedded Systems 20 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Vulcan cost function f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m| c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor Winter-Spring 2001 Codesign of Embedded Systems 21 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Complementary notes A heuristic to minimize communication Once a thread is moved to FS, its immediate successors are placed in the list for evaluation in the next iteration. No back-track Once a thread is assigned to FS, it remains there Experimental results Winter-Spring 2001 considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced Codesign of Embedded Systems 22 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Cosyma Winter-Spring 2001 Codesign of Embedded Systems 23 Partitioning Examples: Cosyma Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective. System specification language Cx Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG Winter-Spring 2001 Codesign of Embedded Systems 24 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm Partitioning quantum is a Basic Block Target Architecture A Basic Blocks is a branch-free block of program CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details Winter-Spring 2001 Codesign of Embedded Systems 25 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm (cont’d) Performance Estimation SW implementation HW implementation Done by examining the object code for the basic block generated by a compiler Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Communication Winter-Spring 2001 Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed Codesign of Embedded Systems 26 Partitioning Examples: Cosyma (cont’d) Algorithm Steps Change in execution-time caused by moving basic block b from CPU to ASIC: Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b) w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b): Total number of times that b is executed Winter-Spring 2001 Codesign of Embedded Systems 27 Partitioning Examples: Cosyma (cont’d) Experimental Results By moving only basic-blocks to HW Typical speedup of only 2x Reason: Cure: Limited intra-basic-block parallelism Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result: Winter-Spring 2001 Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation Codesign of Embedded Systems 28 What we learned today HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized Winter-Spring 2001 Codesign of Embedded Systems 29