The SPARK - Microelectronic Embedded Systems Lab

advertisement
High-Level Synthesis of High
Performance Microprocessor Blocks
SPARK High Level Synthesis System
Sumit Gupta
Nick Savoiu
Nikil Dutt
Rajesh Gupta
Alex Nicolau
Timothy Kam
Michael Kishinevsky
Steve Haynal
Abdallah Tabbara
Center for Embedded
Computer Systems
University of California, Irvine
http://www.cecs.uci.edu/~spark
Strategic CAD Labs
Design Technologies
Intel Inc, Hillsboro
http://www.intel.com/research/scl
Supported by Semiconductor Research Corporation and Intel
Copyright CECS & The Spark Project
08/31/2001
Overview
 Brief background
 Spark
High-Level Synthesis Framework
 Previous work in Spark framework
 High-level synthesis for Microprocessor blocks
 Instruction Length Decoder
 Design
Behavior
 Steps involved in synthesis
 Work done this summer at SCL
 Future Plans
Copyright CECS & The Spark Project
2
High Level Synthesis
From C to CDFG to Architecture
Copyright CECS & The Spark Project
3
Scheduling with Given Resource
Allocation
Resource Constraints
+
<
Copyright CECS & The Spark Project
4
The Spark High-Level Synthesis Framework
Copyright CECS & The Spark Project
5
Limitations of high-level synthesis targeted by Spark
 Quality of synthesis results severely effected by complex
control flow
Control flow style effects the effectiveness of optimizations
 Nested ifs and loops not handled or handled poorly

 Poor understanding (much less integration) of the
interaction between source-level and fine grain “compiler”
transformations
 No comprehensive synthesis framework
Few and scattered optimizations
 Results presented for scheduling



Effects on logic synthesis not understood
Small, synthetic benchmarks
Copyright CECS & The Spark Project
6
Generalized Code Motions
Across Hierarchical
Blocks
+
If Node
Speculation
Reverse Speculation
T
F
+
Conditional Speculation
+
Copyright CECS & The Spark Project
7
Characteristics of ASIC Design
 Large designs such as MPEG
 Multi-cycle implementation
 Resource constrained
Implications on transformations applied
 Extraction of parallelism constrained by area
limitations
 Speculation
may lead to additional registers
 More conservative with transformations such as
loop unrolling
Copyright CECS & The Spark Project
8
Characteristics of Microprocessor Blocks
 Smaller Designs
 Single or Dual Cycle implementation
 High performance
 Extract
maximal parallelism
 Area constraints are more lax
Implications on transformations applied
 Operations within behavior are chained together
with no latching
 All loops can be unrolled
Copyright CECS & The Spark Project
9
Simplified Instruction Length Decoder
NeedNextByte
Byte 0
Length Contribution
NeedNextByte
Byte 1
Length Contribution
NeedNextByte
Byte 2
Length Contribution
Byte 3
Length Contribution
Copyright CECS & The Spark Project
10
Simplified Instruction Length Decoder
First Instruction
NeedNextByte
Byte 0
Length Contribution
NeedNextByte
Byte 1
Length Contribution
NeedNextByte
Byte 2
Byte 3
Copyright CECS & The Spark Project
Length Contribution
Length Contribution
11
Behavioral Description in C
NextStartByte = 0;
for (i=0; i < n; i++)
{
len[i] = CalculateLength(i);
if (i == NextStartByte)
{
NextStartByte = len[i];
Mark[i] = 1;
}
} /* for (i=0; i < n; i++) */
int CalculateLength(i)
{
lc1 = LengthContribution(i);
need1 = need_next_byte(i);
if (need1) {
lc2 = LengthContribution(i+1);
need2 = need_next_byte(i+1);
if (need2) {
lc3 = LengthContribution(i+2);
need3 = need_next_byte(i+2);
if (need3) {
lc4 = LengthContribution(i+3);
Length = lc1 + lc2 + lc3 + lc4;
} else
Length = lc1 + lc2 + lc3;
} else
Length = lc1 + lc2;
} else Length = lc1;
return Length;
}
Copyright CECS & The Spark Project
12
Speculate Maximally
NextStartByte = 0;
for (i=0; i < n; i++)
{
len[i] = CalculateLength(i);
if (i == NextStartByte)
{
NextStartByte = len[i];
Mark[i] = 1;
}
} /* for (i=0; i < n; i++) */
int CalculateLength(i) {
lc1 = LengthContribution(i);
Data
need1 = need_next_byte(i);
Calculation
lc2 = LengthContribution(i+1);
need2 = need_next_byte(i+1);
lc3 = LengthContribution(i+2);
need3 = need_next_byte(i+2);
lc4 = LengthContribution(i+3);
TempLength1 = lc1 + lc2 + lc3 + lc4;
TempLength2 = lc1 + lc2 + lc3;
TempLength3 = lc1 + lc2;
if (need1) {
Control
if (need2) {
Logic
if (need3) {
Length = TempLength1;
} else Length = TempLength2;
} else Length = TempLength3;
} else Length = lc1;
return Length;
}
Copyright CECS & The Spark Project
13
Inlining (Done Earlier)
NextStartByte = 0;
for (i=0; i < n; i++)
{
Results(i) = DataCalulation(i, i+1, i+2, i+3);
Length(i) = ControlLogic(Results(i));
len[i] = Length(i);
if (i == NextStartByte)
{
NextStartByte = len[i];
Mark[i] = 1;
}
} /* for (i=0; i < n; i++) */
int CalculateLength(i) {
lc1 = LengthContribution(i);
Data
need1 = need_next_byte(i);
Calculation
lc2 = LengthContribution(i+1);
need2 = need_next_byte(i+1);
lc3 = LengthContribution(i+2);
need3 = need_next_byte(i+2);
lc4 = LengthContribution(i+3);
TempLength1 = lc1 + lc2 + lc3 + lc4;
TempLength2 = lc1 + lc2 + lc3;
TempLength3 = lc1 + lc2;
if (need1) {
Control
if (need2) {
Logic
if (need3) {
Length = TempLength1;
} else Length = TempLength2;
} else Length = TempLength3;
} else Length = lc1;
return Length;
}
Copyright CECS & The Spark Project
14
Unroll Loop Completely
NextStartByte = 0;
i=0;
Results(i) = DataCalculation(i, i+1, i+2, i+3);
Length(i) = ControlLogic(Results(i));
len[i] = Length(i);
if (i == NextStartByte) {
NextStartByte = len[i];
Mark[i] = 1;
}
Results(i+1) = DataCalculation(i +1, i+2, i+3, i+4);
Length(i +1) = ControlLogic(Results(i +1));
len[i +1] = Length(i +1);
if (i +1 == NextStartByte) {
NextStartByte = len[i +1];
Mark[i +1] = 1;
}
Copyright CECS & The Spark Project
Shown
For
Only
2 Unrolls
15
Propagate Constant: Loop Index
NextStartByte = 0;
Results(0) = DataCalculation(0, 1, 2, 3);
Length(0) = ControlLogic(Results(0));
len[0] = Length(0);
if (0 == NextStartByte) {
NextStartByte = len[0];
Mark[0] = 1;
}
Results(1) = DataCalculation(1, 2, 3, 4);
Length(1) = ControlLogic(Results(1));
len[1] = Length(1);
if (1 == NextStartByte) {
NextStartByte = len[1];
Mark[1] = 1;
}
Copyright CECS & The Spark Project
16
Maximally Parallelize/Compact
Results(0) = DataCalculation(0,1,2,3); Data
NextStartByte = 0;
Ripple
Results(1) = DataCalculation(1,2,3,4);Calculation if (0 == NextStartByte) { Control
…
NextStartByte = len[0]; Logic
Results(n) = DataCalulation(n, n+1, n+2, n+3);
Mark[0] = 1;
}
Length(0) = ControlLogic(Results(0));
Control if (1 == NextStartByte) {
Length(1) = ControlLogic(Results(1));
NextStartByte = len[1];
Logic
…
Mark[1] = 1;
Length(n) = ControlLogic(Results(n));
}
…
len[0] = Length(0);
…
len[1] = Length(1);
if (n == NextStartByte) {
…
NextStartByte = len[n];
len[n] = Length(n);
Mark[n] = 1;
}
Copyright CECS & The Spark Project
17
Final Design Architecture
Results(0) = DataCalculation(0,1,2,3)
Data
Results(1) = DataCalculation(1,2,3,4) Calculation
…
Results(n) = DataCalculation(n, n+1, n+2, n+3);
Length(0) = ControlLogic(Results(0));
Control
Length(1) = ControlLogic(Results(1));
Logic
…
Length(n) = ControlLogic(Results(n));
if (0 == NextStartByte) {
NextStartByte = len[0];
Mark[0] = 1; }
…
if (n == NextStartByte) {
NextStartByte = len[n];
Mark[n] = 1; }
Ripple
Control
Logic
Instruction Buffer
Data
Calculation
Control
Logic
Ripple
Logic
Copyright CECS & The Spark Project
18
ILD Tasks Achieved This Summer
 Chaining across conditional boundaries
 Enables
single cycle schedules
 Useful as general high-level synthesis
transformation as well
 Had implications on other things such as VHDL
generation
 Complete unrolling of loops
 Was
implemented previously
 Constant Propagation
 Useful
for loop index propagation after unrolling
Copyright CECS & The Spark Project
19
Other Interaction within SCL
 Interfacing with HLD team via XML
 Implemented
XML generation pass
 Creates a path from C for NexSiS and the rest of
HLD flow
 Being driven by requirements from Abdallah
 Analyzed some other designs
 Whitney:
3-D design
 FAX: Willamette floating point unit
Copyright CECS & The Spark Project
20
Future Plans
 Continue to work on ILD with the more complicated
(complete) design
 Look at similar designs

Detect first 3 zeros in 32 bit vector
 Develop a set of transformations targeted to such high
performance blocks
 Expand interaction with HLD Design flow
Do some transformations before handing over CDFG via
XML to Symbolic Scheduling
 For example: transformations that lead to node duplication,
source-to-source transformations, some loop transformations

Copyright CECS & The Spark Project
21
Additional Slides
Copyright CECS & The Spark Project
22
Spark’s Methodology
 Applies coarse and fine grain compiler optimizations
 Targets
control flow transformations
 “Fine grain” loop optimization techniques for multiple
and nested loops
 Mixed IR suitable for fine and coarse grain compiler
transformations (similar to other systems such as
SUIF)
 Synthesis from C provides
 Flow
from architecture design to synthesis
 Opportunity to apply coarse grain optimizations
 Compiler transformations modified to target HLS
 Multiple
mutually-exclusive operations can be
Copyright
CECSresource
& The Spark Project
scheduled on the
same
in the same cycle
23
Spark’s Methodology
 Customizable extensible scheduler
 Range of transformations in modular toolbox
 Percolation, trailblazing, loop pipelining (RDLP),
inlining
 Selected under heuristics and/or user control
 Code motion, loop transformations
 Ability to generate synthesizable RTL VHDL
 Integrates with current IC design flows
 Code generation at various levels:
 Behavioral C
 Behavioral VHDL
 Structural VHDL
Copyright CECS & The Spark Project
24
Generalized Code Motions
 Hierarchical code motions
 Operations are moved across entire conditional structures
 Speculation to improve resource utilization
 Has to be controlled to limit impact on number of registers
 Reverse speculation
 Moves operations down into conditional branches
 Early condition execution
 Evaluates conditionals as soon as the corresponding
operation has been executed
 Conditional Speculation
 Duplicates operations up into conditional branches
Copyright CECS & The Spark Project
25
Scheduling Results on MPEG
Prediction Block
Copyright CECS & The Spark Project
26
Scheduling Results on ADPCM
Encoder
Copyright CECS & The Spark Project
27
Scheduling Results
 Synthesis results after scheduling by Spark show
 Considerable
gain in execution cycles
 Critical path decreases marginally
 Area can increase significantly
 Benchmarks used are large real-life applications
 well
written; no gains due to sloppy code
Copyright CECS & The Spark Project
28
Interconnect minimization by
resource binding
 Minimize the complexity of steering logic
 Multiplexors
and demultiplexors
 Bind operations with same inputs and outputs to
same functional units
 Bind variables, which are inputs/outputs to same
functional units, to the same registers
Copyright CECS & The Spark Project
29
Results after Binding
Copyright CECS & The Spark Project
30
Results after Binding: ADPCM
Copyright CECS & The Spark Project
31
Future Plans
 Synthesis for high-performance microprocessor
blocks
 Single
cycle behavioral descriptions
 Timing analysis and time budgeting
 Introducing
time constrained synthesis
 Loop Transformations
 Parallelizing
compiler transformations: loop
interchange, exchange, splitting, fusion
 Resource versus Throughput analysis
 Cost models for code motions
Copyright CECS & The Spark Project
32
The Intermediate Representation
Hierarchical Task Graph (HTG) is main structure in the intermediate
representation (IR)
Maintains information on:
Code structure (IFs, LOOPs)
HTG/CDFG
Loop bounds, type (FOR, WHILE)
Array accesses are not lowered
EDG AST
to address calculation followed by
memory access
Is complete
Can regenerate input C code
Copyright CECS & The Spark Project
33
IR Examples
Copyright CECS & The Spark Project
34
HTG
C code
Copyright CECS & The Spark Project
CDFG
35
Scheduling
Copyright CECS & The Spark Project
36
The Scheduler Framework
Scheduler framework philosophy
modular, reusability
allow designer to write new scheduling algorithms with minimal effort
Toolbox approach
core transformations: percolation, trailblazing, RDLP
heuristics to decide which transformations are to be applied
Copyright CECS & The Spark Project
37
The Scheduler Framework
 Designed to be completely customizable in terms
of the scheduling algorithms and heuristics used
 An instance of a scheduling algorithm consists of a
set of
 IR
traversal algorithms
 code motion algorithms
 scheduling heuristics
 The designer can use predefined
algorithms and heuristics or design
new ones
 enabled
Algorithm
IR
Walkers
Scheduling
Heuristics
Candidate
Provider
Candidate
Validators
by toolbox approach
Copyright CECS & The Spark Project
38
Extracting Parallelism with
Speculation
Copyright CECS & The Spark Project
39
Reverse Speculation
Moves operations
into conditionals
Only moves to
branches which
require result
Moves operations
with lower priority
Copyright CECS & The Spark Project
40
Early Condition Execution
Evaluates conditions
ASAP
Moves all
unscheduled
operations into
conditionals
Uses reverse
speculation to
achieve this
Copyright CECS & The Spark Project
41
Conditional Speculation
Copyright CECS & The Spark Project
42
RDLP Example
A i=i+1
A
A
A
B j=i+h
B:C
B:C
B:C
C k=i+g
D
D:A
D:A
D l=j+1
B:C
D
Original Loop
Compact
Unroll
and
compact
Copyright CECS & The Spark Project
Shift
and
Pipeline
43
Download