Low Power Clocking Through the Use of Dual Edge Triggered Flip-Flops Gabriel Ricardo Theresa Holliday ACSEL Lab University of California, Davis 1 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 2 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 3 Symmetric Pulse Generator Flip-Flop (SPGFF) First stage, X and Y, are dynamic, second stage static NAND Results in small delay Can size to trade some delay for power ACSEL Lab University of California, Davis 4 Operation of SPGFF Transparency window created by CLK and CLK3 for stage 1 (CLK1 and CLK4 for stage 2), allows for X (Y) to conditionally evaluate based on input D. Output stage NAND allows for X, Y to be passed to output based on clock value without the need for a latch. ACSEL Lab University of California, Davis 5 Transmission Gate Master Slave (TGMS) ACSEL Lab University of California, Davis 6 Comparison between SPGFF and TGMS in 0.18um Delay Power EDP Clk load SPGFF 356 ps 133 μW 1.70e-23 Js 12 fF TGMS 354 ps 89.9 μW 1.13e-23 Js 16 fF delay max( t su,r tclk q , f , t su, f tclk q ,r ) Power @ 25% activity (uW) Performance (ps) Total Power Total Delay 356 354 133 122 Setup Time 90 75 Data Power 110 TGMS -20 SPGFF ACSEL Lab University of California, Davis 3 12 TGMS Internal Power Clock Power 2.0 9.3 SPGFF 7 Advantages of SPGFF Lowest clock energy of other DET-CSEs, resulting in higher clock power savings Energy delay product comparable to high performance single edge triggered clocked storage elements ACSEL Lab University of California, Davis 8 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 9 Characterization Methodology – Generating synthesis views Created automated process for generating synopsys liberty format (.lib) synthesis models. Using perl scripts and gspice (spice pre/postprocessor) Characterized for timing and energy. Can easily extend to generate cadence synthesis models (.tlf). ACSEL Lab University of California, Davis 10 Characterization Methodology – Trip-points Used same trip-points as those in technology library. Nominal conditions: 25˚C, 1.8V supply Can easily generate best and worst case corner models (over temp and supply variation). Cell delay: defined as clock 50% rise/fall to Output (Q or QN) 50% rise/fall Transition time: 10%-90% rise, 90%-10% fall time ACSEL Lab University of California, Davis 11 Trip-points - Falling ACSEL Lab University of California, Davis 12 Trip-points - Rising ACSEL Lab University of California, Davis 13 Characterization Methodology Drive Characteristics Build 5x5 non-linear delay table. Clock slope values (nano-seconds) : 0.03, 0.1, 0.4, 1.5, 3 Output load values (fF): 0.35, 21, 38.5, 147, 311 ACSEL Lab University of California, Davis 14 Characterization Methodology – Trip-points Setup time: sweep input transition towards active edge until 10% increase in clock to output delay. Hold time: sweep input transition away from active edge until 10% increase in clock to output delay. ACSEL Lab University of California, Davis 15 Characterization Methodology – Setup-hold Constant clk-Q Variable clk-Q Variable clk-Q Constant clk-Q Clock to Q delay Failure region 10% push-out 10% push-out Data to clock delay ACSEL Lab University of California, Davis 16 Characterization Methodology – Setup and Hold Build 3x2 non-linear delay table. (3ps accuracy) Clock slope values (nano-seconds): 0.03, 3 Data slope values (nano-seconds): 0.03, 0.9, 3 ACSEL Lab University of California, Davis 17 Characterization Methodology – Internal energy Characterized over same data points as drive characteristics for internal energy (5x5 lookup table). Data pin, clock pin energy tables generated (1x5 lookup table). ACSEL Lab University of California, Davis 18 Characterization Results - single vs dual-edge – D to Q delay TGMS SPGFF SPGFF delay TGMS delay 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 delay (ns) delay (ns) 0.4 0.2 0.2 0.15 0.15 0.1 0.1 0.4 0.05 0.05 clock slope (ns) 0.1 0 0.1 0.03 clock slope (ns) 1 6 11 42 0 0.03 1 11 6 load - # of minimum sized inverters 0.4-0.45 0.35-0.4 0.3-0.35 0.25-0.3 0.2-0.25 0.15-0.2 0.1-0.15 0.05-0.1 0-0.05 42 load - # of minimum sized inverters ACSEL Lab University of California, Davis 19 What is typical output load? Extracted output loading from netlist for all CSEs. Average load = 24fF (6.8 min. inverters) 90% of CSEs have load less than 60fF (17 min. sized inverters) ACSEL Lab University of California, Davis 20 Netlist extracted CSE output loading statistics output loading on CSEs 1200 1000 number of nets 800 600 400 200 8. 6 11 .4 14 .3 17 .1 20 .0 22 .9 25 .7 28 .6 31 .4 34 .3 37 .1 40 .0 42 .9 45 .7 48 .6 51 .4 54 .3 57 .1 60 .0 62 .9 65 .7 68 .6 71 .4 74 .3 77 .1 80 .0 82 .9 85 .7 5. 7 1 2. 9 0 0 loading - # of min. sized inverters ACSEL Lab University of California, Davis 21 Characterization Results - single vs dual-edge – Delay TGMS SPGFF SPGFF delay TGMS delay 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 delay (ns) delay (ns) 0.2 0.2 0.15 0.15 0.1 0.1 0.4 0.4 0.05 0.05 clock slope (ns) 0.1 0 0.1 0.03 clock slope (ns) 1 6 11 load - # of minimum sized inverters ACSEL Lab University of California, Davis 42 0 0.03 1 11 6 load - # of minimum sized inverters 0.4-0.45 0.35-0.4 0.3-0.35 0.25-0.3 0.2-0.25 0.15-0.2 0.1-0.15 0.05-0.1 0-0.05 42 Typical region of operation 22 Characterization Results – zoomed-in - single vs dual-edge – delay TGMS SPGFF TGMS delay SPGFF delay 0.21 0.21 0.2 0.2 0.19 0.19 0.18 0.18 delay (ns) clock slope (ns) delay (ns) 0.2-0.21 0.17 0.17 0.19-0.2 0.2-0.21 0.18-0.19 0.09 clock slope (ns) 0.11 0.15 0.07 0.05 0.14 0.03 2 3 4 5 load (# min. inverters) ACSEL Lab University of California, Davis 6 0.19-0.2 0.17-0.18 0.16 0.11 0.09 0.16 0.16-0.17 0.18-0.19 0.17-0.18 0.15-0.16 0.15 0.14-0.15 0.07 0.16-0.17 0.15-0.16 0.14-0.15 0.05 0.14 0.03 2 3 4 5 6 load (# min. inverters) 23 Characterization Results - single vs dual-edge – Energy delay product TGMS SPGFF TGMS energy SPGFF energy 0.32 0.32 0.3 0.3 0.28 0.28 0.26 0.26 energy (pJ) energy (pJ) 0.24 0.24 0.3-0.32 0.28-0.3 0.22 0.22 0.26-0.28 0.24-0.26 0.11 0.09 0.2 0.07 0.05 0.18 0.03 2 3 4 5 load (# min. inv) ACSEL Lab University of California, Davis 6 clock slope (ns) clk slope (ns) 0.11 0.22-0.24 0.09 0.2 0.2-0.22 0.18-0.2 0.07 0.05 0.18 0.03 2 3 4 5 6 load (# min. inv) 24 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 25 Leon SPARC core configuration ACSEL Lab University of California, Davis 26 Leon SPARC synthesis Synthesized using TSMC 0.18um standard cell library. Target frequency of 200MHz Limit use of single sized D-FF. ACSEL Lab University of California, Davis 27 SET- Synthesis flow RTL of processor (VHDL) Standard cell library Synthesis (Design Compiler) Netlist (.db) Power Analysis (power compiler) ACSEL Lab University of California, Davis Reports (area, timing) 28 SET-CSE synthesis summary Area and Power Cell type Area (mm2) % Power total (mW) % total Memory blocks 2.03 55% 214.3 72% Core 0.71 19% 73 24% Clock tree (ideal net) N/A N/A 4% Total 3.7 ACSEL Lab University of California, Davis 11.6 299 29 Core summary Core Area(mm2) % total core Power (mW) Sequential (1986 CSEs) 0.47 36% 26 Combinatorial + nets 0.24 64% 47 Total 0.71 73 Approximately 20k-gates ACSEL Lab University of California, Davis 30 Clock tree loading Clock tree components Loading (pF) Sequential cells (1986 cells) 5.18 Memory macro cells (6) 1.37 Wire routing* 11.4 Total 17.94 * - based on library wire-load model ACSEL Lab University of California, Davis 31 Clock tree power estimation High-fanout nets are beyond the library’s wire-load models interpolation range. wire-load models are not meant for estimating balanced distribution nets such as clock nets. Using library wire-load models for clock tree is not valid. Use an H-tree estimation equation to obtain a ballpark number. ACSEL Lab University of California, Davis 32 H-tree estimation equation Equation developed by ACSEL lab member Nikola Nedovic. recursively calculates H-tree loading for a given area, number of CSEs in design, and number of H-tree levels. ACSEL Lab University of California, Davis 33 H-tree estimation method S Leaf level c S/2L-1 S/2L-1 PLL c S M/4L-1 Storage elements ACSEL Lab University of California, Davis 34 H-tree estimation method * Table taken from Nedovic, Nikola, Ph.D. Dissertation, UCD, “CLOCKED STORAGE ELEMENTS FOR HIGH-PERFORMANCE APPLICATIONS” ACSEL Lab University of California, Davis 35 H-tree estimation method Equation reduces to: Load due to CSEs ACSEL Lab University of California, Davis Load due to wiring 36 Total H-tree power Load switching power Clock driver power ACSEL Lab University of California, Davis 37 SET-CSE synthesis summary with H-tree estimate Area and Power Cell type Area (mm2) % Power total (mW) % total Memory blocks 2.03 55% 214.3 66% Core 0.71 19% 63 19% Clock tree (H-tree estimate) N/A N/A 15% Total 3.7 ACSEL Lab University of California, Davis 48.5 325 38 SET-CSE power profile with H-tree estimate SET power breakdown Total cache (mW), 128.5716, 40% calculated clk pwr, 48.507, 15% Total core power (mW), 63, 19% Register file (mW), 85.762, 26% ACSEL Lab University of California, Davis 39 SET-CSE Core power profile SET Core power breakdown Total core power (mW), 63, 56% ACSEL Lab University of California, Davis calculated clk pwr, 48.507, 44% 40 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 41 Modeling DET-CSEs for Synthesis Need to model the timing parameters for both edges. T T T T s-r h-r s-f h-f DET-CSE System clock Tsetup Thold SET-CSE Data Tclk->Q Output ACSEL Lab University of California, Davis 42 Modeling DET-CSEs for Synthesis Can model complex timing relationships for synthesis. Falling-edge timing arc Q D rising-edge timing arc CLK ACSEL Lab University of California, Davis 43 Modeling DET-CSEs for Synthesis Synthesis tool will time, and (try to) meet constraints for the dual-edge triggered synchronous system. D CLK ACSEL Lab University of California, Davis 44 Modeling DET-CSEs for Synthesis Synthesis tool will use the worst timing arc relationship for critical path constraint. Rising to Falling Falling to rising Rising to Falling Critical Not Critical Rising edge sample window falling edge sample window ACSEL Lab University of California, Davis 45 Modeling DET-CSEs for Synthesis Synthesis tools are not capable of inferring a dual-edge triggered device from HDL code. For meeting timing we only care about the strictest constraint anyway. (i.e. for one pair of launch and capture edges). Unnecessary to model complex timing device. ACSEL Lab University of California, Davis 46 Modeling DET-CSEs for Synthesis Simply model DET-CSE as a SET-CSE with worst-edge timing parameters. Ts-max Th-max System clock Data Tclk->Q-max Output ACSEL Lab University of California, Davis 47 Synthesis flow for DET-CSEs RTL of processor (VHDL) Standard cell library Synthesis (Design Compiler) Automated Characterization (perl, hspice) Model of DET-CSE Netlist with DET-CSEs (.db) Power Analysis Timing Analysis ACSEL Lab University of California, Davis 48 Synthesis flow for DET-CSEs Use synthesis directives to force use of DETCSE modeled device. Synthesize for target throughput, not frequency. Worst-case models for meeting critical-path timing constraints. generate a worst-case hold model, to verify the race-path. Fastest clk-Q with worst-case hold time ACSEL Lab University of California, Davis 49 Modeling DET-CSEs for Synthesis Race-path modeling. Rising to Falling Rising edge sample window Falling to rising Rising to Falling falling edge sample window May have under-constrained race-path. ACSEL Lab University of California, Davis 50 DET-CSE synthesis summary with H-tree estimate Area and Power Cell type Area (mm2) % Power total (mW) % total Memory blocks 2.03 44% 214.3 72% Core 1.65 36% 64 21% Clock tree (det-cse H-tree estimate) @ new freq. N/A N/A 7% Total 4.64 ACSEL Lab University of California, Davis 20.2 298.5 51 DET-CSE power profile DET power breakdown Total cache (mW), 128.5716, 43% calculated clk pwr, 20.2, 7% Total core power (mW), 64, 21% Register file (mW), 85.762, 29% ACSEL Lab University of California, Davis 52 DET Core summary Core Area(mm2) % total core Power (mW) % total Sequential (1986 CSEs) 1.41 85.5% 22 34% Combinatorial + nets 0.24 14.5% 42 66% Total 1.65 64 Approximately 20k-gates (based on nand4) ACSEL Lab University of California, Davis 53 DET-CSE power profile DET Core power breakdown calculated clk pwr, 20.2, 24% Total core power (mW), 64, 76% ACSEL Lab University of California, Davis 54 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including DETCSEs into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 55 Issues with DET-CSE integration Memory blocks are single-edge triggered and must be clocked at twice the core clock rate. Currently using a dual-edge triggered VHDL behavioral model for memory blocks for netlist simulations. Possible solutions: Clock the memory blocks at 2x nominal. Modify memory address and data latch to be dual-edge triggered. ACSEL Lab University of California, Davis 56 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 57 Power Comparison of two design netlists SPGFF TGMS DET Core power breakdown SET Core power breakdown calculated clk pwr, 20.2, 24% Total core power (mW), 63, 56% Total core power (mW), 64, 76% Core TotalTotal = 92.46mW = 84.2mW calculated clk pwr, 48.507, 44% Core Total = 106.8mW Total = 111mW 27mW savings 24% power savings in core ACSEL Lab University of California, Davis 58 Summary of comparison 24% savings in core power. Estimated 28% increase in sequential cell area (17% increase in core area). Both meet specified performance @ 200MHz (report zero slack). ACSEL Lab University of California, Davis 59 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 60 Summary Established methods for automated cell characterization. Developed design flow for DET-CSE integration. Demonstrated pre-layout results. Obtained functional DET-CSE netlist. Investigated functionally enhanced DETCSEs (scan, reset). ACSEL Lab University of California, Davis 61 Future work Expand family of DET-CSEs (i.e. sizings, functionalities) Obtain more accurate clock tree loading. Perform layout of cells for more accurate comparison. ACSEL Lab University of California, Davis 62 Functionally enhanced Dual-Edge Triggered Flip-Flops Need to show that functions such as reset, set, and scan can be added to DETCSEs Need to do analysis of power and performance impact of added functionality Do DETCSEs still result in practical power savings? ACSEL Lab University of California, Davis 63 Scan in SPGFF CLK Mp0 Mp14 Mp20 Mp21 Mp22 Mp15 CLK3 Mp23 Mp19 X CLK1 CLK4 Y Q SD Mns0 D Mns1 Mn0 Mn9 D Mns3 Mns5 Mns4 Mn12 Mns2 Mpi3 mni3 Mn13 Mpi10 mni10 SCAN CLK3 Mn1 CLK Mn2 Mpi9 mni9 SCAN Q Mn3 CLK1 CLK2 SD Mn8 CLK3 Mn10 CLK4 Mn11 CLK1 CLK4 CLK Mpi4 mni4 ACSEL Lab University of California, Davis Mpi5 mni5 Mpi6 mni6 Mpi7 mni7 64 Scan in DFF Functional Schematic of DFF with Scan ACSEL Lab University of California, Davis 65 Clear in SPGFF Mpr0 CLR Mpr2 Mp14 Mpr1 CLK Mpr3 Mp0 Mp22 CLK3 Mn1 CLK Mn2 Y Mn13 Mpi10 mni10 Mpi3 mni3 Mn9 Q Mn3 Mn8 Mn10 Mn11 CLK1 CLK1 CLK4 Mp19 Mn12 Mpi3 mni3 Mn0 CLK3 Mp21 Mp23 Q Mp15 X D CLR Mp20 CLK2 CLK3 D CLK4 CLK1 CLK4 CLK Mpi4 mni4 ACSEL Lab University of California, Davis Mpi5 mni5 Mpi6 mni6 Mpi7 mni7 66 Clear in DFF ACSEL Lab University of California, Davis 67 Preliminary Results of Adding Functionalities Delay Power EDP SPGFF 356 ps 136 μW 1.73e-23 Js With Scan 371 ps (4.2%) 143 μW (5%) 1.97e-23 Js (14%) With Reset 407 ps (14%) 140 μW (3%) 2.32e-23 Js (34%) Delay Power EDP SETFF 412 ps 82 μW 1.38e-23 Js With Scan 483 ps (17%) 82 μW (0%) 1.89e-23 Js (37%) With Reset 483 ps (17%) 71 μW (-13%) 1.65e-23 Js (20%) delay max( t su,r tclk q , f , t su, f tclk q ,r ) ACSEL Lab University of California, Davis 68 Outline Dual Edge Flip-Flops overview Standard Cell Characterization LEON Synthesis for SET design LEON Synthesis for DET design Issues with including Dual edge into synthesis flow Preliminary comparisons Conclusions and Future Work Questions ACSEL Lab University of California, Davis 69