CoSynthesis_Algorithms-Partitioning.ppt

advertisement
Co-Synthesis Algorithms:
HW/SW Partitioning
Part of
HW/SW Codesign of Embedded
Systems Course (CE 40-226)
Winter-Spring 2001
Codesign of Embedded Systems
1
Topics




Introduction
Preliminaries
Hardware/Software Partitioning
Distributed System Co-Synthesis
Winter-Spring 2001
Codesign of Embedded Systems
2
Topics



Introduction
A Classification
Examples


Vulcan
Cosyma
Winter-Spring 2001
Codesign of Embedded Systems
3
Introduction to
HW/SW Partitioning


The first variety of co-synthesis applications
Definition


A HW/SW partitioning algorithm implements a
specification on some sort of multiprocessor
architecture
Usually

Multiprocessor architecture = one CPU + some
ASICs on CPU bus
Winter-Spring 2001
Codesign of Embedded Systems
4
Introduction to
HW/SW Partitioning (cont’d)

A Terminology

Allocation


Synthesis methods which design the multiprocessor
topology along with the PEs and SW architecture
Scheduling

The process of assigning PE (CPU and/or ASICs) time to
processes to get executed
Winter-Spring 2001
Codesign of Embedded Systems
5
Introduction to
HW/SW Partitioning (cont’d)

In most partitioning algorithms


Type of CPU is fixed and given
ASICs must be synthesized



What function to implement on each ASIC?
What characteristics should the implementation have?
Are single-rate synthesis problems

CDFG is the starting model
Winter-Spring 2001
Codesign of Embedded Systems
6
HW/SW Partitioning (cont’d)

Normal use of architectural components



CPU performs less computationally-intensive
functions
ASICs used to accelerate core functions
Where to use?

High-performance applications


No CPU is fast enough for the operations
Low-cost application

ASIC accelerators allow use of much smaller, cheaper
CPU
Winter-Spring 2001
Codesign of Embedded Systems
7
A Classification

Criterion: Optimization Strategy


Primal Approach



Trade-off between Performance and Cost
Performance is the primary goal
First, all functionality in ASICs. Progressively move more
to CPU to reduce cost.
Dual Approach


Cost is the primary goal
First, all functions in the CPU. Move operations to the
ASIC to meet the performance goal.
Winter-Spring 2001
Codesign of Embedded Systems
8
A Classification (cont’d)

Classification due to optimization strategy
(cont’d)

Example co-synthesis systems


Vulcan (Stanford): Primal strategy
Cosyma (Braunschweig, Germany): Dual strategy
Winter-Spring 2001
Codesign of Embedded Systems
9
Co-Synthesis Algorithms:
HW/SW Partitioning
HW/SW Partitioning Examples:
Vulcan
Winter-Spring 2001
Codesign of Embedded Systems
10
Partitioning Examples:
Vulcan


Gupta, De Micheli, Stanford University
Primal approach
1. All-HW initial implementation.
2. Iteratively move functionality to CPU to reduce
cost.

System specification language

HardwareC

Is compiled into a flow graph
Winter-Spring 2001
Codesign of Embedded Systems
11
Partitioning Examples:
Vulcan (cont’d)
x=a; y=b;
HardwareC
nop
1
1
x=a
y=b
cond
if (c>d)
x=e;
else y=f;
c>d
c<=d
x=e
y=f
HardwareC
Winter-Spring 2001
Codesign of Embedded Systems
12
Partitioning Examples:
Vulcan (cont’d)

Flow Graph Definition


A variation of a (single-rate) task graph
Nodes



Represent operations
Typically low-level operations: mult, add
Edges


Represent data dependencies
Each contains a Boolean condition under which the edge
is traversed
Winter-Spring 2001
Codesign of Embedded Systems
13
Partitioning Examples:
Vulcan (cont’d)

Flow Graph


is executed repeatedly at some rate
can have initiation-time constraints for each node


t(vj)+lij  t(vj)  t(vj)+uij
can have rate constraints on each node

mi  Ri  Mi
Winter-Spring 2001
Codesign of Embedded Systems
14
Partitioning Examples:
Vulcan (cont’d)

Vulcan Co-synthesis Algorithm

Partitioning quantum is a thread


Algorithm divides the flow graph into threads and
allocates them
Thread boundary is determined by
1. (always) a non-deterministic delay element, such as wait
for an external variable
2. (on choice) other points of flow graph

Target architecture

CPU + Co-processor (multiple ASICs)
Winter-Spring 2001
Codesign of Embedded Systems
15
Partitioning Examples:
Vulcan (cont’d)

Vulcan Co-synthesis algorithm (cont’d)

Allocation


Primal approach
Scheduling

is done by a scheduler on the target CPU



is generated as part of synthesis process
schedules all threads (both HW and SW threads)
cannot be static, due to some threads non-deterministic
initiation-time
Winter-Spring 2001
Codesign of Embedded Systems
16
Partitioning Examples:
Vulcan (cont’d)

Vulcan Co-synthesis algorithm (cont’d)

Cost estimation

SW implementation



Code size
 relatively straight forward
Data size
 Biggest challenge.
 Vulcan puts some effort to find bounds for each
thread
HW implementation
 ?
Winter-Spring 2001
Codesign of Embedded Systems
17
Partitioning Examples:
Vulcan (cont’d)

Vulcan Co-synthesis algorithm (cont’d)

Performance estimation

Both SW- and HW-implementation

Winter-Spring 2001
From flow-graph, and basic execution times for the
operators
Codesign of Embedded Systems
18
Partitioning Examples:
Vulcan (cont’d)

Algorithm Details

Partitioning goal

Allocate each thread to one of two partitions



CPU Set: FS
Co-processor set: FH
Required execution-rate must be met, and total cost
minimized
Winter-Spring 2001
Codesign of Embedded Systems
19
Partitioning Examples:
Vulcan (cont’d)

Algorithm Details (cont’d)
 Algorithm steps
1. Put all threads in FH set
2. Iteratively do
2.1. Move some operations to FS.
2.1.1. Select a group of operations to move to FS.
2.1.2. Check performance feasibility, by computing
worst-case delay through flow-graph given the new
thread times
2.1.3. Do the move, if feasible
2.2. Incrementally update the new cost-function to reflect
the new partition
Winter-Spring 2001
Codesign of Embedded Systems
20
Partitioning Examples:
Vulcan (cont’d)

Algorithm Details (cont’d)

Vulcan cost function
f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m|





c:
weight constants
S(): Size functions
B:
Bus utilization (<1)
P:
Processor utilization (<1)
m:
total number of variables to be transferred
between the CPU and the co-processor
Winter-Spring 2001
Codesign of Embedded Systems
21
Partitioning Examples:
Vulcan (cont’d)

Algorithm Details (cont’d)
 Complementary notes

A heuristic to minimize communication

Once a thread is moved to FS, its immediate successors
are placed in the list for evaluation in the next iteration.

No back-track


Once a thread is assigned to FS, it remains there
Experimental results

Winter-Spring 2001
considerably faster implementations than all-SW, but
much cheaper than all-HW designs are produced
Codesign of Embedded Systems
22
Co-Synthesis Algorithms:
HW/SW Partitioning
HW/SW Partitioning Examples:
Cosyma
Winter-Spring 2001
Codesign of Embedded Systems
23
Partitioning Examples:
Cosyma


Rolf Ernst, et al: Technical University of
Braunschweig, Germany
Dual approach
1. All-SW initial implementation.
2. Iteratively move basic blocks to the ASIC
accelerator to meet performance objective.

System specification language

Cx


Is compiled into an ESG (Extended Syntax Graph)
ESG is much like a CDFG
Winter-Spring 2001
Codesign of Embedded Systems
24
Partitioning Examples:
Cosyma (cont’d)

Cosyma Co-synthesis Algorithm

Partitioning quantum is a Basic Block


Target Architecture






A Basic Blocks is a branch-free block of program
CPU + accelerator ASIC(s)
Scheduling
Allocation
Cost Estimation
Performance Estimation
Algorithm Details
Winter-Spring 2001
Codesign of Embedded Systems
25
Partitioning Examples:
Cosyma (cont’d)

Cosyma Co-synthesis Algorithm (cont’d)

Performance Estimation

SW implementation


HW implementation




Done by examining the object code for the basic block
generated by a compiler
Assumes one operator per clock cycle.
Creates a list schedule for the DFG of the basic block.
Depth of the list gives the number of clock cycles required.
Communication


Winter-Spring 2001
Done by data-flow analysis of the adjacent basic blocks.
In Shared-Memory
 Proportional to number of variables to be accessed
Codesign of Embedded Systems
26
Partitioning Examples:
Cosyma (cont’d)

Algorithm Steps

Change in execution-time caused by moving basic
block b from CPU to ASIC:
Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b)




w:
Constant weight
t(b):
Execution time of basic block b
tcom(b): Estimated communication time between CPU and
the accelerator ASIC, given a set Z of basic blocks
implemented on the ASIC
It(b):
Total number of times that b is executed
Winter-Spring 2001
Codesign of Embedded Systems
27
Partitioning Examples:
Cosyma (cont’d)

Experimental Results

By moving only basic-blocks to HW


Typical speedup of only 2x
Reason:


Cure:



Limited intra-basic-block parallelism
Implement several control-flow optimizations to increase
parallelism in the basic block, and hence in ASIC
Examples: loop pipelining, speculative branch execution with
multiple branch prediction, operator pipelining
Result:

Winter-Spring 2001

Speedups: 2.7 to 9.7
CPU times: 35 to 304 seconds on a typical workstation
Codesign of Embedded Systems
28
What we learned today


HW/SW Partitioning: One broad category of
co-synthesis algorithms
Criteria by which a co-synthesis algorithm is
categorized
Winter-Spring 2001
Codesign of Embedded Systems
29
Download