A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati {kchatha,ranga}@ececs.uc.edu Organization of Talk • Introduction • Overview of Tool • Codesign partitioner • Pipelined Scheduler • Results • Conclusion Introduction • Motivation: The throughput of a loop oriented HW-SW application can be maximized by obtaining a pipelined implementation. • Objective: To obtain a pipelined implementation of the application on the codesign architecture such that: - Throughput constraint is satisfied - HW area constraint is satisfied - Number of pipeline stages is minimized - Increase in memory requirement is minimized Introduction Architecture and Task Graph HW Co-processor SW Processor Local Memory For SW-SW communication. Shared Memory For SW-HW, HW-SW & HW-HW communication. S = 225 ns H = 175 ns (8 +, …) A B S = 400 ns H = 150 ns (4 *, 8 +, …) S = 200 ns H = 100 ns (4 *, 8 -, …) C D S = 100 ns H = 400 ns (3 *, 3 +, …) 10 Data items per dependence Some Definitions • A pipelined design is characterized by its initiation interval. • Initiation interval (II) is the time difference between the start of two consecutive iterations of the steady state. • Given a partitioned task graph there exists a theoretical lower bound on the II of its pipelined schedule called the Minimum Initiation Interval (MII). For a directed acyclic task graph the MII is given by: MII = max (Sum_hw, Sum_sw) where Sum_hw is the sum of execution times of tasks bound to HW and Sum_sw is the sum of execution times of tasks bound to SW. Task Graph Architecture Throughput and Area Constraints Satisfy throughput and area constraints. Partition Design Constraint Satisfied ? NO HW-SW Codesign Unable to Design with Given Constraints YES Calculate MII Set II = MII Obtain a Pipelined Schedule which executes in II Yes time. Schd found ? NO Increase II NO YES II > Constraint ? YES Satisfy throughput constraints, minimize the number of pipeline stages and minimize the increase in memory requirements. Output Successful Design HW-SW Partitioner • Branch and bound algorithm • Initial solution tries to minimize MII - Suitability of task to be assigned to HW is given by: v spm v cr K 1 , area constraint max spm max cr v area suit v v spm v K cr , no area constraint max spm max cr - Sort tasks in descending order of their suitabilities. - Assign tasks to HW and SW alternatively from front and back of the sorted list so that Sum_hw and Sum_sw remain balanced. • We also apply heuristics to effectively limit the search space of the algorithm. HW-SW Partitioner Area Estimation • Resources required by tasks divided into two types: 1. Shared - adders, subtractors, multipliers, dividers 2. Unshared - interconnect and controller • Shared resource area estimated by taking the union of the shared resources required by all the HW tasks. • Unshared resource area estimated by adding the area associated with the unshared resources of all the HW tasks. • Total area estimated by taking the sum of area requirements of shared and unshared resources. Pipelined Scheduling Try to obtain a task schedule which executes in II time. (use list scheduling) Schd. Found ? Yes Success No Select a dependency to retime. (use RECOD Step 1) Retiming Transformation (use RECOD Step 2) Yes Dependency found ? No Failure RECOD Step 1: Select a dependency to retime 1. Dependency is an intra loop dependency (ILD). 2. Dependency between tasks bound to heterogeneous processors. 3. Dependency whose predecessor task belongs to longer constraining path. A Var = 20 d0 d0 d1 HW B SW d0 SW C D E SW d0 d0 d1 HW SW SW F G H d0 d0 d0 Var = 10 4. Dependency representing the least number of data items transferred. SW I HW d0 RECOD Step 2: Partition to minimize increase in memory requirements. Set P Cost A v(u, w) (u,w) cutset B C D Cost function for the partitioner E Set R F G H Cutset Set S I for (u, v) Cutset (u, v) (u, v) 1 for u pred(Cutse t) d (u) d (u) 1 Retiming Transformation JPEG Case Study • We specified the JPEG image compression algorithm as task graph with 12 tasks. • We then obtained pipelined codesign implementations by specifying different constraints on the II and HW area. Constraints Results Area (sq mm) 100 80 60 40 20 0 0 100 200 300 Initiation Interval (ms) 400 Execution Time • We evaluated the runtime of the tool by invoking it for 50 random task graphs and searching for optimal HW-SW partitions. 10000 Execution Times (secs) 1000 100 10 1 0.1 0.01 0.001 0 5 10 15 20 25 30 35 40 45 Task Graphs ( in ascending order of number of nodes ) 50 Percentage deviation of initial solution from final Percentage Deviation • We calculated the percentage deviation in initiation interval of the initial partition from the final partition. • The average percentage deviation was 8.4%. 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 40 45 Task Graphs ( in ascending order of number of nodes ) 50 Conclusion • The tool can optimize the throughput, area, pipeline stages and memory requirements of pipelined HW-SW system. • The tool can obtain solutions for task graphs with upto 30 nodes within a short period of time. • Although it assumes a single SW processor and single HW coprocessor the technique can be extended to multiple processor architectures. • The limitation of the tool is its inability to handle large task graphs (> 30 nodes) in a reasonable amount of time. A time out option with the branch and bound partitioner can overcome this limitation.