A Tool Partitioning and Pipelined Scheduling of Hardware

advertisement
A Tool for Partitioning and
Pipelined Scheduling
of Hardware-Software
Systems
Karam S Chatha and Ranga Vemuri
Department of ECECS
University of Cincinnati
{kchatha,ranga}@ececs.uc.edu
Organization of Talk
• Introduction
• Overview of Tool
• Codesign partitioner
• Pipelined Scheduler
• Results
• Conclusion
Introduction
• Motivation:
The throughput of a loop oriented HW-SW
application can be maximized by obtaining a
pipelined implementation.
• Objective:
To obtain a pipelined implementation of the
application on the codesign architecture such that:
- Throughput constraint is satisfied
- HW area constraint is satisfied
- Number of pipeline stages is minimized
- Increase in memory requirement is minimized
Introduction
Architecture and Task Graph
HW
Co-processor
SW
Processor
Local
Memory
For SW-SW
communication.
Shared
Memory
For SW-HW,
HW-SW & HW-HW
communication.
S = 225 ns
H = 175 ns
(8 +, …)
A
B
S = 400 ns
H = 150 ns
(4 *, 8 +, …)
S = 200 ns
H = 100 ns
(4 *, 8 -, …)
C
D
S = 100 ns
H = 400 ns
(3 *, 3 +, …)
10 Data items per dependence
Some Definitions
• A pipelined design is characterized by its initiation interval.
• Initiation interval (II) is the time difference between the
start of two consecutive iterations of the steady state.
• Given a partitioned task graph there exists a theoretical
lower bound on the II of its pipelined schedule called the
Minimum Initiation Interval (MII). For a directed acyclic
task graph the MII is given by:
MII = max (Sum_hw, Sum_sw)
where Sum_hw is the sum of execution times of tasks bound
to HW and Sum_sw is the sum of execution times of tasks
bound to SW.
Task Graph
Architecture
Throughput and
Area Constraints
Satisfy throughput
and area constraints.
Partition Design
Constraint
Satisfied ?
NO
HW-SW
Codesign
Unable to Design
with Given
Constraints
YES
Calculate MII
Set II = MII
Obtain a Pipelined Schedule
which executes in II Yes
time.
Schd
found ?
NO
Increase II
NO
YES
II > Constraint ?
YES
Satisfy throughput constraints,
minimize the number of pipeline stages and
minimize the increase in memory requirements.
Output Successful
Design
HW-SW Partitioner
• Branch and bound algorithm
• Initial solution tries to minimize MII
- Suitability of task to be assigned to HW is given by:
 v spm
v cr  K  1



, area constraint


 max spm max cr  v area
suit v   
v spm
v K

 cr
, no area constraint
 max spm
max
cr

- Sort tasks in descending order of their suitabilities.
- Assign tasks to HW and SW alternatively from front and back
of the sorted list so that Sum_hw and Sum_sw remain
balanced.
• We also apply heuristics to effectively limit the search space
of the algorithm.
HW-SW Partitioner
Area Estimation
• Resources required by tasks divided into two types:
1. Shared - adders, subtractors, multipliers, dividers
2. Unshared - interconnect and controller
• Shared resource area estimated by taking the union of the
shared resources required by all the HW tasks.
• Unshared resource area estimated by adding the area
associated with the unshared resources of all the HW tasks.
• Total area estimated by taking the sum of area requirements
of shared and unshared resources.
Pipelined Scheduling
Try to obtain a task schedule
which executes in II time.
(use list scheduling)
Schd.
Found ?
Yes
Success
No
Select a dependency
to retime.
(use RECOD Step 1)
Retiming
Transformation
(use RECOD Step 2)
Yes
Dependency
found ?
No
Failure
RECOD Step 1: Select a
dependency to retime
1. Dependency is an intra loop
dependency (ILD).
2. Dependency between tasks bound
to heterogeneous processors.
3. Dependency whose predecessor
task belongs to longer
constraining path.
A
Var = 20
d0
d0 d1
HW
B
SW
d0
SW
C
D
E
SW
d0
d0
d1
HW
SW
SW
F
G
H
d0
d0
d0
Var = 10
4. Dependency representing the
least number of data items transferred.
SW
I
HW
d0
RECOD Step 2: Partition to minimize
increase in memory requirements.
Set P
Cost 
A
 v(u, w)
(u,w)  cutset
B
C
D
Cost function for
the partitioner
E
Set R
F
G
H
Cutset
Set S
I
for  (u, v) Cutset
 (u, v)   (u, v)  1
for  u  pred(Cutse t)
d (u)  d (u)  1
Retiming Transformation
JPEG Case Study
• We specified the JPEG image compression algorithm as task graph with
12 tasks.
• We then obtained pipelined codesign implementations by specifying
different constraints on the II and HW area.
Constraints
Results
Area (sq mm)
100
80
60
40
20
0
0
100
200
300
Initiation Interval (ms)
400
Execution Time
• We evaluated the runtime of the tool by invoking it for 50 random task
graphs and searching for optimal HW-SW partitions.
10000
Execution Times (secs)
1000
100
10
1
0.1
0.01
0.001
0
5
10
15
20
25
30
35
40
45
Task Graphs ( in ascending order of number of nodes )
50
Percentage deviation of
initial solution from final
Percentage Deviation
• We calculated the percentage deviation in initiation interval of the initial
partition from the final partition.
• The average percentage deviation was 8.4%.
30
25
20
15
10
5
0
0
5
10
15
20
25
30
35
40
45
Task Graphs ( in ascending order of number of nodes )
50
Conclusion
• The tool can optimize the throughput, area, pipeline stages
and memory requirements of pipelined HW-SW system.
• The tool can obtain solutions for task graphs with upto 30
nodes within a short period of time.
• Although it assumes a single SW processor and single HW
coprocessor the technique can be extended to multiple
processor architectures.
• The limitation of the tool is its inability to handle large task
graphs (> 30 nodes) in a reasonable amount of time.
A time out option with the branch and bound partitioner can
overcome this limitation.
Download