CoSynthesis_Algorithms-Partitioning.ppt

Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226) Winter-Spring 2001 Codesign of Embedded Systems 1 Topics     Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis Winter-Spring 2001 Codesign of Embedded Systems 2 Topics    Introduction A Classification Examples   Vulcan Cosyma Winter-Spring 2001 Codesign of Embedded Systems 3 Introduction to HW/SW Partitioning   The first variety of co-synthesis applications Definition   A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually  Multiprocessor architecture = one CPU + some ASICs on CPU bus Winter-Spring 2001 Codesign of Embedded Systems 4 Introduction to HW/SW Partitioning (cont’d)  A Terminology  Allocation   Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling  The process of assigning PE (CPU and/or ASICs) time to processes to get executed Winter-Spring 2001 Codesign of Embedded Systems 5 Introduction to HW/SW Partitioning (cont’d)  In most partitioning algorithms   Type of CPU is fixed and given ASICs must be synthesized    What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems  CDFG is the starting model Winter-Spring 2001 Codesign of Embedded Systems 6 HW/SW Partitioning (cont’d)  Normal use of architectural components    CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use?  High-performance applications   No CPU is fast enough for the operations Low-cost application  ASIC accelerators allow use of much smaller, cheaper CPU Winter-Spring 2001 Codesign of Embedded Systems 7 A Classification  Criterion: Optimization Strategy   Primal Approach    Trade-off between Performance and Cost Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach   Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal. Winter-Spring 2001 Codesign of Embedded Systems 8 A Classification (cont’d)  Classification due to optimization strategy (cont’d)  Example co-synthesis systems   Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy Winter-Spring 2001 Codesign of Embedded Systems 9 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Vulcan Winter-Spring 2001 Codesign of Embedded Systems 10 Partitioning Examples: Vulcan   Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost.  System specification language  HardwareC  Is compiled into a flow graph Winter-Spring 2001 Codesign of Embedded Systems 11 Partitioning Examples: Vulcan (cont’d) x=a; y=b; HardwareC nop 1 1 x=a y=b cond if (c>d) x=e; else y=f; c>d c<=d x=e y=f HardwareC Winter-Spring 2001 Codesign of Embedded Systems 12 Partitioning Examples: Vulcan (cont’d)  Flow Graph Definition   A variation of a (single-rate) task graph Nodes    Represent operations Typically low-level operations: mult, add Edges   Represent data dependencies Each contains a Boolean condition under which the edge is traversed Winter-Spring 2001 Codesign of Embedded Systems 13 Partitioning Examples: Vulcan (cont’d)  Flow Graph   is executed repeatedly at some rate can have initiation-time constraints for each node   t(vj)+lij  t(vj)  t(vj)+uij can have rate constraints on each node  mi  Ri  Mi Winter-Spring 2001 Codesign of Embedded Systems 14 Partitioning Examples: Vulcan (cont’d)  Vulcan Co-synthesis Algorithm  Partitioning quantum is a thread   Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph  Target architecture  CPU + Co-processor (multiple ASICs) Winter-Spring 2001 Codesign of Embedded Systems 15 Partitioning Examples: Vulcan (cont’d)  Vulcan Co-synthesis algorithm (cont’d)  Allocation   Primal approach Scheduling  is done by a scheduler on the target CPU    is generated as part of synthesis process schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time Winter-Spring 2001 Codesign of Embedded Systems 16 Partitioning Examples: Vulcan (cont’d)  Vulcan Co-synthesis algorithm (cont’d)  Cost estimation  SW implementation    Code size  relatively straight forward Data size  Biggest challenge.  Vulcan puts some effort to find bounds for each thread HW implementation  ? Winter-Spring 2001 Codesign of Embedded Systems 17 Partitioning Examples: Vulcan (cont’d)  Vulcan Co-synthesis algorithm (cont’d)  Performance estimation  Both SW- and HW-implementation  Winter-Spring 2001 From flow-graph, and basic execution times for the operators Codesign of Embedded Systems 18 Partitioning Examples: Vulcan (cont’d)  Algorithm Details  Partitioning goal  Allocate each thread to one of two partitions    CPU Set: FS Co-processor set: FH Required execution-rate must be met, and total cost minimized Winter-Spring 2001 Codesign of Embedded Systems 19 Partitioning Examples: Vulcan (cont’d)  Algorithm Details (cont’d)  Algorithm steps 1. Put all threads in FH set 2. Iteratively do 2.1. Move some operations to FS. 2.1.1. Select a group of operations to move to FS. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition Winter-Spring 2001 Codesign of Embedded Systems 20 Partitioning Examples: Vulcan (cont’d)  Algorithm Details (cont’d)  Vulcan cost function f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|      c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor Winter-Spring 2001 Codesign of Embedded Systems 21 Partitioning Examples: Vulcan (cont’d)  Algorithm Details (cont’d)  Complementary notes  A heuristic to minimize communication  Once a thread is moved to FS, its immediate successors are placed in the list for evaluation in the next iteration.  No back-track   Once a thread is assigned to FS, it remains there Experimental results  Winter-Spring 2001 considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced Codesign of Embedded Systems 22 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Cosyma Winter-Spring 2001 Codesign of Embedded Systems 23 Partitioning Examples: Cosyma   Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective.  System specification language  Cx   Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG Winter-Spring 2001 Codesign of Embedded Systems 24 Partitioning Examples: Cosyma (cont’d)  Cosyma Co-synthesis Algorithm  Partitioning quantum is a Basic Block   Target Architecture       A Basic Blocks is a branch-free block of program CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details Winter-Spring 2001 Codesign of Embedded Systems 25 Partitioning Examples: Cosyma (cont’d)  Cosyma Co-synthesis Algorithm (cont’d)  Performance Estimation  SW implementation   HW implementation     Done by examining the object code for the basic block generated by a compiler Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Communication   Winter-Spring 2001 Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory  Proportional to number of variables to be accessed Codesign of Embedded Systems 26 Partitioning Examples: Cosyma (cont’d)  Algorithm Steps  Change in execution-time caused by moving basic block b from CPU to ASIC: Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)     w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b): Total number of times that b is executed Winter-Spring 2001 Codesign of Embedded Systems 27 Partitioning Examples: Cosyma (cont’d)  Experimental Results  By moving only basic-blocks to HW   Typical speedup of only 2x Reason:   Cure:    Limited intra-basic-block parallelism Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result:  Winter-Spring 2001  Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation Codesign of Embedded Systems 28 What we learned today   HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized Winter-Spring 2001 Codesign of Embedded Systems 29

CoSynthesis_Algorithms-Partitioning.ppt

Related documents

Products

Support

CoSynthesis_Algorithms-Partitioning.ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib