PPTX slides - DAC Virtual Resources

advertisement
Efficient IP Design flow for
Low-Power
High-Level Synthesis Quick & Accurate
Power Analysis and Optimization Flow
Asher Berkovitz
Yaniv Fais
JAN.20.2014
TM
Authors Contact Details
Asher Berkovitz
Asher.Berkovitz@freescale.com
+972- 09-9522511
Yaniv Fais
Yaniv.Fais@freescale.com
+972- 09-9522179
Freescale Semiconductor
Israel
Herzelia
Shenkar 3
1
© 2014 Freescale Semiconductor, Inc. | External Use
Outline
•
Challenges
•
High Level Synthesis flow
•
Power Efficiency
− Problems
•
at RTL
Proposed VSIM++ Flow
− Analysis
− Optimization
− Results
•
on Networking Algorithm (Non-Abstract Version)
Conclusions
2
© 2014 Freescale Semiconductor, Inc. | External Use
Challenges
•
IP blocks for networking types of applications need to meet tight
power consumptions while meeting aggressive performance
requirements.
•
Making changes to micro architectures and other high abstraction
modeling styles could deliver the largest benefits on overall power.
•
It is hard to accurately measure power at higher abstractions.
•
Measuring accurate power upon signoff is late in the design
process when high level changes are impossible
3
© 2014 Freescale Semiconductor, Inc. | External Use
High Level Synthesis design Flow

Algorithms
Definition
Macro-architecture definition:


Based on an accelerator base class
Uses unified modules (FIFOs, interfaces etc)
Macro-Architecture
Definition
Bit-exact SystemC ®
Model


SystemC ®
Commands
(uArch)
SystemC® Model:


RTL
Cell library
(.lib)
Architecture evaluation and RTL generation
Accurate data path description according to
macro-architecture
Design to meet processing requirements
RTL Quick explore
(Timing/Area)

HLS:


RTL2GDSII
“Normal”
flow

4
Builds pipelined data path and control logic
Considers real timings during RTL generation
Explore implementation tradeoffs
© 2014 Freescale Semiconductor, Inc. | External Use
Power Dissipation
•
Static Power - ~test independent
•
Dynamic Power – highly dependent on application (Signal
Transition)
•
Signal transitions can be divided to:
− Functional
− Glitch
•
change
(signal changes that which not captured by a sequential element)
Glitches are not visible in RTL simulation and can contribute ~20%
to power dissipation
5
© 2014 Freescale Semiconductor, Inc. | External Use
Fast & Accurate power analysis flow (VSIM)
RTL DB
•
Quick Physical Design (PD) flow:
− Timing
•
violations allowed
Quick PD
flow
− DRC
violations allowed
− Less
than 100% RTL to GL equivalence
Costumed test bench enables Cycle accurate
Test bench
generation
Gate Level Simulation
•
Power analysis is performed using gate level
netlist & parasitics file.
•
Power analysis results are mapped backed to
RTL netlist.
GLV
simulation
Power
Analysis
Mapping
GL 2 RTL
6
© 2014 Freescale Semiconductor, Inc. | External Use
Test Bench Generation
•
Based on RTL to GL mapping, force RTL values on GLV simulation
Force the RTL
value on the key
point
D Q
D Q
Timing violation!
Std’ test bench
•
“VSIM” test bench
Advantages:
Values are a
bit “off”
D Q
D Q
Force correct
value @ time
point X
Short run time:
Simulate selected window
7
D Q
Correct values
forced
D Q
GL &
SDF
D Q
GL delay for logic cones
(SDF)
© 2014 Freescale Semiconductor, Inc. | External Use
GL netlist
RTL netlist
Gate level results mapping to RTL netlist
1.
2.
reg cond[1:0] 26
reg count[1:0] 29
3.
always @(posedge clk)
if (condition == 2’b11)
count = count + 1;
4.
Map RTL 2 GL
For each unmapped GL instance:
Divide the power between
drive/load key points
Assign GL key point power to RTL
key point
The power of each RTL hierarchy is
the sum of power assigned to its
key point
1
13
10
11
13
15
Cond_0
1
Cond_1
10
11
count_0
1
10
11
Clock
Gate
4
8
14
2
1
1
8
© 2014 Freescale Semiconductor, Inc. | External Use
count_1
10
11
Mapping results to high-level language (VSIM++)
•
Using annotation of C++ class names, variable names as well as
file name/line numbers we can map power consumption from the
accurate gate-level to the C++.
•
This capability allows us to:
− Analyze
reg my_var_Ln123[1:0]
reg count_Ln124[1:0]
always @(posedge clk)
if (my_var_Ln123 == 2’b11)
count_Ln124 = count_Ln124 + 1;
9
Line
#
121:
122:
123:
124:
125:
126:
127:
void process() { …
while (true) {
if (my_var==3)
count++;
…
}
}
© 2014 Freescale Semiconductor, Inc. | External Use
C++ code
RTL netlist
and fix clock gating
− Redesign “power hungry” resources
− Consider different architectures
Example problem identified
•
Tool inserts “clock gating” enabler code for RTL automatically
always @(posedge clk)
C++ process condition
HLS
if (en)
data[511:0] <= new_data;
•
Gate-Level implementation is
not implemented as gated clock
DFF
DFF
DFF
but as data logic due to timing
new_data
violations
•
Solution – Simplify clock gating
clk
en
enablers to meet timing
constraints
10
© 2014 Freescale Semiconductor, Inc. | External Use
data
Clock gating enabler simplification
Hash
Key
DFF
DFF
new_data
en
data
clk
Process
control
DFF
DFF
Hash
Key
DFF
DFF
new_data
en
Header
DFF
DFF
Original clock gating scheme –
Complicated enable logic
Synthesized to non efficient enabler
11
data
clk
Process
control
DFF
DFF
Simplified clock gating scheme –
Enable synthesized w/o changes
Leading to high clock gating efficiency
© 2014 Freescale Semiconductor, Inc. | External Use
Conclusions
•
•
•
Use High Level Synthesis for IP Design
−
Quick and easy to explore architecture alternatives
−
Quick front-end flow including verification
Power analysis:
−
Measure power on system level scenario
−
Quick (doesn’t require full physical design flow convergence)
−
Accurate (done on gate-level)
Analysis and Optimization in high-level design (C++)
−
•
Manual clock gating enable setting reduced dynamic power consumption by 19.4%
Early in the design cycle : Easy to change IP architecture !
12
© 2014 Freescale Semiconductor, Inc. | External Use
Backup
13
© 2014 Freescale Semiconductor, Inc. | External Use
Accuracy
•
Measured using similar methodology on a different design
• Si measurement compared to full T/O gate level data
Test
Dynamic power accuracy
Single core Fast Fourier Transform
-7.59%
Single core Fast Fourier Transform
No memory miss
Dual core Fast Fourier Transform
-8.40%
14
7.57%
© 2014 Freescale Semiconductor, Inc. | External Use
Download