Efficient IP Design flow for Low-Power High-Level Synthesis Quick & Accurate Power Analysis and Optimization Flow Asher Berkovitz Yaniv Fais JAN.20.2014 TM Authors Contact Details Asher Berkovitz Asher.Berkovitz@freescale.com +972- 09-9522511 Yaniv Fais Yaniv.Fais@freescale.com +972- 09-9522179 Freescale Semiconductor Israel Herzelia Shenkar 3 1 © 2014 Freescale Semiconductor, Inc. | External Use Outline • Challenges • High Level Synthesis flow • Power Efficiency − Problems • at RTL Proposed VSIM++ Flow − Analysis − Optimization − Results • on Networking Algorithm (Non-Abstract Version) Conclusions 2 © 2014 Freescale Semiconductor, Inc. | External Use Challenges • IP blocks for networking types of applications need to meet tight power consumptions while meeting aggressive performance requirements. • Making changes to micro architectures and other high abstraction modeling styles could deliver the largest benefits on overall power. • It is hard to accurately measure power at higher abstractions. • Measuring accurate power upon signoff is late in the design process when high level changes are impossible 3 © 2014 Freescale Semiconductor, Inc. | External Use High Level Synthesis design Flow Algorithms Definition Macro-architecture definition: Based on an accelerator base class Uses unified modules (FIFOs, interfaces etc) Macro-Architecture Definition Bit-exact SystemC ® Model SystemC ® Commands (uArch) SystemC® Model: RTL Cell library (.lib) Architecture evaluation and RTL generation Accurate data path description according to macro-architecture Design to meet processing requirements RTL Quick explore (Timing/Area) HLS: RTL2GDSII “Normal” flow 4 Builds pipelined data path and control logic Considers real timings during RTL generation Explore implementation tradeoffs © 2014 Freescale Semiconductor, Inc. | External Use Power Dissipation • Static Power - ~test independent • Dynamic Power – highly dependent on application (Signal Transition) • Signal transitions can be divided to: − Functional − Glitch • change (signal changes that which not captured by a sequential element) Glitches are not visible in RTL simulation and can contribute ~20% to power dissipation 5 © 2014 Freescale Semiconductor, Inc. | External Use Fast & Accurate power analysis flow (VSIM) RTL DB • Quick Physical Design (PD) flow: − Timing • violations allowed Quick PD flow − DRC violations allowed − Less than 100% RTL to GL equivalence Costumed test bench enables Cycle accurate Test bench generation Gate Level Simulation • Power analysis is performed using gate level netlist & parasitics file. • Power analysis results are mapped backed to RTL netlist. GLV simulation Power Analysis Mapping GL 2 RTL 6 © 2014 Freescale Semiconductor, Inc. | External Use Test Bench Generation • Based on RTL to GL mapping, force RTL values on GLV simulation Force the RTL value on the key point D Q D Q Timing violation! Std’ test bench • “VSIM” test bench Advantages: Values are a bit “off” D Q D Q Force correct value @ time point X Short run time: Simulate selected window 7 D Q Correct values forced D Q GL & SDF D Q GL delay for logic cones (SDF) © 2014 Freescale Semiconductor, Inc. | External Use GL netlist RTL netlist Gate level results mapping to RTL netlist 1. 2. reg cond[1:0] 26 reg count[1:0] 29 3. always @(posedge clk) if (condition == 2’b11) count = count + 1; 4. Map RTL 2 GL For each unmapped GL instance: Divide the power between drive/load key points Assign GL key point power to RTL key point The power of each RTL hierarchy is the sum of power assigned to its key point 1 13 10 11 13 15 Cond_0 1 Cond_1 10 11 count_0 1 10 11 Clock Gate 4 8 14 2 1 1 8 © 2014 Freescale Semiconductor, Inc. | External Use count_1 10 11 Mapping results to high-level language (VSIM++) • Using annotation of C++ class names, variable names as well as file name/line numbers we can map power consumption from the accurate gate-level to the C++. • This capability allows us to: − Analyze reg my_var_Ln123[1:0] reg count_Ln124[1:0] always @(posedge clk) if (my_var_Ln123 == 2’b11) count_Ln124 = count_Ln124 + 1; 9 Line # 121: 122: 123: 124: 125: 126: 127: void process() { … while (true) { if (my_var==3) count++; … } } © 2014 Freescale Semiconductor, Inc. | External Use C++ code RTL netlist and fix clock gating − Redesign “power hungry” resources − Consider different architectures Example problem identified • Tool inserts “clock gating” enabler code for RTL automatically always @(posedge clk) C++ process condition HLS if (en) data[511:0] <= new_data; • Gate-Level implementation is not implemented as gated clock DFF DFF DFF but as data logic due to timing new_data violations • Solution – Simplify clock gating clk en enablers to meet timing constraints 10 © 2014 Freescale Semiconductor, Inc. | External Use data Clock gating enabler simplification Hash Key DFF DFF new_data en data clk Process control DFF DFF Hash Key DFF DFF new_data en Header DFF DFF Original clock gating scheme – Complicated enable logic Synthesized to non efficient enabler 11 data clk Process control DFF DFF Simplified clock gating scheme – Enable synthesized w/o changes Leading to high clock gating efficiency © 2014 Freescale Semiconductor, Inc. | External Use Conclusions • • • Use High Level Synthesis for IP Design − Quick and easy to explore architecture alternatives − Quick front-end flow including verification Power analysis: − Measure power on system level scenario − Quick (doesn’t require full physical design flow convergence) − Accurate (done on gate-level) Analysis and Optimization in high-level design (C++) − • Manual clock gating enable setting reduced dynamic power consumption by 19.4% Early in the design cycle : Easy to change IP architecture ! 12 © 2014 Freescale Semiconductor, Inc. | External Use Backup 13 © 2014 Freescale Semiconductor, Inc. | External Use Accuracy • Measured using similar methodology on a different design • Si measurement compared to full T/O gate level data Test Dynamic power accuracy Single core Fast Fourier Transform -7.59% Single core Fast Fourier Transform No memory miss Dual core Fast Fourier Transform -8.40% 14 7.57% © 2014 Freescale Semiconductor, Inc. | External Use