Presentation

advertisement
Low Power Clocking
Through the Use of Dual Edge
Triggered Flip-Flops
Gabriel Ricardo
Theresa Holliday
ACSEL Lab University of California, Davis
1
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
2
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
3
Symmetric Pulse Generator Flip-Flop
(SPGFF)



First stage, X and Y, are dynamic, second stage static NAND
Results in small delay
Can size to trade some delay for power
ACSEL Lab University of California, Davis
4
Operation of SPGFF


Transparency window
created by CLK and CLK3
for stage 1 (CLK1 and
CLK4 for stage 2), allows
for X (Y) to conditionally
evaluate based on input D.
Output stage NAND allows
for X, Y to be passed to
output based on clock value
without the need for a latch.
ACSEL Lab University of California, Davis
5
Transmission Gate Master Slave
(TGMS)
ACSEL Lab University of California, Davis
6
Comparison between SPGFF and
TGMS in 0.18um
Delay
Power
EDP
Clk load
SPGFF
356 ps
133 μW
1.70e-23 Js
12 fF
TGMS
354 ps
89.9 μW
1.13e-23 Js
16 fF
delay  max( t su,r  tclk  q , f , t su, f  tclk  q ,r )
Power @ 25% activity (uW)
Performance (ps)
Total Power
Total Delay
356
354
133
122
Setup Time
90
75
Data Power
110
TGMS
-20
SPGFF
ACSEL Lab University of California, Davis
3 12
TGMS
Internal Power
Clock Power
2.0 9.3
SPGFF
7
Advantages of SPGFF

Lowest clock energy of other DET-CSEs,
resulting in higher clock power savings
 Energy delay product comparable to high
performance single edge triggered clocked
storage elements
ACSEL Lab University of California, Davis
8
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
9
Characterization Methodology –
Generating synthesis views

Created automated process for generating
synopsys liberty format (.lib) synthesis
models.

Using perl scripts and gspice (spice pre/postprocessor)

Characterized for timing and energy.
 Can easily extend to generate cadence
synthesis models (.tlf).
ACSEL Lab University of California, Davis
10
Characterization Methodology –
Trip-points





Used same trip-points as those in technology
library.
Nominal conditions: 25˚C, 1.8V supply
Can easily generate best and worst case corner
models (over temp and supply variation).
Cell delay: defined as clock 50% rise/fall to Output
(Q or QN) 50% rise/fall
Transition time: 10%-90% rise, 90%-10% fall
time
ACSEL Lab University of California, Davis
11
Trip-points - Falling
ACSEL Lab University of California, Davis
12
Trip-points - Rising
ACSEL Lab University of California, Davis
13
Characterization Methodology Drive Characteristics

Build 5x5 non-linear delay table.
 Clock slope values (nano-seconds) :
0.03, 0.1, 0.4, 1.5, 3
 Output load values (fF):
0.35, 21, 38.5, 147, 311
ACSEL Lab University of California, Davis
14
Characterization Methodology –
Trip-points

Setup time: sweep input transition towards
active edge until 10% increase in clock to
output delay.
 Hold time: sweep input transition away from
active edge until 10% increase in clock to
output delay.
ACSEL Lab University of California, Davis
15
Characterization Methodology – Setup-hold
Constant clk-Q
Variable clk-Q
Variable clk-Q
Constant clk-Q
Clock to Q
delay
Failure region
10% push-out
10% push-out
Data to clock delay
ACSEL Lab University of California, Davis
16
Characterization Methodology –
Setup and Hold

Build 3x2 non-linear delay table. (3ps
accuracy)
 Clock slope values (nano-seconds):
0.03, 3
 Data slope values (nano-seconds):
0.03, 0.9, 3
ACSEL Lab University of California, Davis
17
Characterization Methodology –
Internal energy

Characterized over same data points as drive
characteristics for internal energy (5x5
lookup table).
 Data pin, clock pin energy tables generated
(1x5 lookup table).
ACSEL Lab University of California, Davis
18
Characterization Results
- single vs dual-edge – D to Q delay
TGMS
SPGFF
SPGFF delay
TGMS delay
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
delay (ns)
delay (ns)
0.4
0.2
0.2
0.15
0.15
0.1
0.1
0.4
0.05
0.05
clock slope (ns) 0.1
0
0.1
0.03
clock slope (ns)
1
6
11
42
0
0.03
1
11
6
load - # of minimum sized inverters
0.4-0.45
0.35-0.4
0.3-0.35
0.25-0.3
0.2-0.25
0.15-0.2
0.1-0.15
0.05-0.1
0-0.05
42
load - # of minimum sized inverters
ACSEL Lab University of California, Davis
19
What is typical output load?

Extracted output loading from netlist for all
CSEs.
 Average load = 24fF


(6.8 min. inverters)
90% of CSEs have load less than 60fF

(17 min. sized inverters)
ACSEL Lab University of California, Davis
20
Netlist extracted CSE output loading statistics
output loading on CSEs
1200
1000
number of nets
800
600
400
200
8.
6
11
.4
14
.3
17
.1
20
.0
22
.9
25
.7
28
.6
31
.4
34
.3
37
.1
40
.0
42
.9
45
.7
48
.6
51
.4
54
.3
57
.1
60
.0
62
.9
65
.7
68
.6
71
.4
74
.3
77
.1
80
.0
82
.9
85
.7
5.
7
1
2.
9
0
0
loading - # of min. sized inverters
ACSEL Lab University of California, Davis
21
Characterization Results
- single vs dual-edge – Delay
TGMS
SPGFF
SPGFF delay
TGMS delay
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
delay (ns)
delay (ns)
0.2
0.2
0.15
0.15
0.1
0.1
0.4
0.4
0.05
0.05
clock slope (ns) 0.1
0
0.1
0.03
clock slope
(ns)
1
6
11
load - # of minimum sized inverters
ACSEL Lab University of California, Davis
42
0
0.03
1
11
6
load - # of minimum sized inverters
0.4-0.45
0.35-0.4
0.3-0.35
0.25-0.3
0.2-0.25
0.15-0.2
0.1-0.15
0.05-0.1
0-0.05
42
Typical region of operation
22
Characterization Results – zoomed-in
- single vs dual-edge – delay
TGMS
SPGFF
TGMS delay
SPGFF delay
0.21
0.21
0.2
0.2
0.19
0.19
0.18
0.18
delay (ns)
clock slope
(ns)
delay (ns)
0.2-0.21
0.17
0.17
0.19-0.2
0.2-0.21
0.18-0.19
0.09
clock slope
(ns)
0.11
0.15
0.07
0.05
0.14
0.03
2
3
4
5
load (# min. inverters)
ACSEL Lab University of California, Davis
6
0.19-0.2
0.17-0.18
0.16
0.11
0.09
0.16
0.16-0.17
0.18-0.19
0.17-0.18
0.15-0.16
0.15
0.14-0.15
0.07
0.16-0.17
0.15-0.16
0.14-0.15
0.05
0.14
0.03
2
3
4
5
6
load (# min. inverters)
23
Characterization Results
- single vs dual-edge – Energy delay product
TGMS
SPGFF
TGMS energy
SPGFF energy
0.32
0.32
0.3
0.3
0.28
0.28
0.26
0.26
energy (pJ)
energy (pJ)
0.24
0.24
0.3-0.32
0.28-0.3
0.22
0.22
0.26-0.28
0.24-0.26
0.11
0.09
0.2
0.07
0.05
0.18
0.03
2
3
4
5
load (# min. inv)
ACSEL Lab University of California, Davis
6
clock slope
(ns)
clk slope
(ns)
0.11
0.22-0.24
0.09
0.2
0.2-0.22
0.18-0.2
0.07
0.05
0.18
0.03
2
3
4
5
6
load (# min. inv)
24
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
25
Leon SPARC core configuration
ACSEL Lab University of California, Davis
26
Leon SPARC synthesis

Synthesized using TSMC 0.18um standard
cell library.
 Target frequency of 200MHz
 Limit use of single sized D-FF.
ACSEL Lab University of California, Davis
27
SET- Synthesis flow
RTL of
processor
(VHDL)
Standard
cell library
Synthesis
(Design Compiler)
Netlist
(.db)
Power Analysis
(power compiler)
ACSEL Lab University of California, Davis
Reports
(area, timing)
28
SET-CSE synthesis summary
Area and Power
Cell type
Area
(mm2)
%
Power
total (mW)
%
total
Memory blocks
2.03
55% 214.3
72%
Core
0.71
19% 73
24%
Clock tree (ideal net)
N/A
N/A
4%
Total
3.7
ACSEL Lab University of California, Davis
11.6
299
29
Core summary
Core
Area(mm2)
% total
core
Power
(mW)
Sequential (1986 CSEs)
0.47
36%
26
Combinatorial + nets
0.24
64%
47
Total
0.71
73
Approximately 20k-gates
ACSEL Lab University of California, Davis
30
Clock tree loading
Clock tree components
Loading (pF)
Sequential cells (1986 cells)
5.18
Memory macro cells (6)
1.37
Wire routing*
11.4
Total
17.94
* - based on library wire-load model
ACSEL Lab University of California, Davis
31
Clock tree power estimation
High-fanout nets are beyond the library’s wire-load
models interpolation range.
 wire-load models are not meant for estimating
balanced distribution nets such as clock nets.
 Using library wire-load models for clock tree is not
valid.
 Use an H-tree estimation equation to obtain a ballpark number.

ACSEL Lab University of California, Davis
32
H-tree estimation equation

Equation developed by ACSEL lab member
Nikola Nedovic.
 recursively calculates H-tree loading for a
given area, number of CSEs in design, and
number of H-tree levels.
ACSEL Lab University of California, Davis
33
H-tree estimation method
S
Leaf level
c
S/2L-1
S/2L-1
PLL
c
S
M/4L-1 Storage elements
ACSEL Lab University of California, Davis
34
H-tree estimation method
* Table taken from Nedovic, Nikola, Ph.D. Dissertation, UCD, “CLOCKED STORAGE ELEMENTS FOR HIGH-PERFORMANCE APPLICATIONS”
ACSEL Lab University of California, Davis
35
H-tree estimation method

Equation reduces to:
Load due to CSEs
ACSEL Lab University of California, Davis
Load due to wiring
36
Total H-tree power
Load switching power
Clock driver power
ACSEL Lab University of California, Davis
37
SET-CSE synthesis summary
with H-tree estimate
Area and Power
Cell type
Area
(mm2)
%
Power
total (mW)
%
total
Memory blocks
2.03
55% 214.3
66%
Core
0.71
19% 63
19%
Clock tree (H-tree estimate)
N/A
N/A
15%
Total
3.7
ACSEL Lab University of California, Davis
48.5
325
38
SET-CSE power profile
with H-tree estimate
SET power breakdown
Total cache
(mW),
128.5716,
40%
calculated
clk pwr,
48.507,
15%
Total core
power
(mW), 63,
19%
Register file
(mW),
85.762,
26%
ACSEL Lab University of California, Davis
39
SET-CSE Core power profile
SET Core power breakdown
Total core
power
(mW), 63,
56%
ACSEL Lab University of California, Davis
calculated
clk pwr,
48.507,
44%
40
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
41
Modeling DET-CSEs for Synthesis

Need to model the timing parameters for both
edges.
T
T
T
T
s-r
h-r
s-f
h-f
DET-CSE
System clock
Tsetup
Thold
SET-CSE
Data
Tclk->Q
Output
ACSEL Lab University of California, Davis
42
Modeling DET-CSEs for Synthesis

Can model complex timing relationships for
synthesis.
Falling-edge timing arc
Q
D
rising-edge timing arc
CLK
ACSEL Lab University of California, Davis
43
Modeling DET-CSEs for Synthesis

Synthesis tool will time, and (try to) meet constraints for
the dual-edge triggered synchronous system.
D
CLK
ACSEL Lab University of California, Davis
44
Modeling DET-CSEs for Synthesis

Synthesis tool will use the worst timing arc
relationship for critical path constraint.
Rising to Falling
Falling to rising
Rising to Falling
Critical
Not Critical
Rising edge sample
window
falling edge sample
window
ACSEL Lab University of California, Davis
45
Modeling DET-CSEs for Synthesis

Synthesis tools are not capable of inferring a
dual-edge triggered device from HDL code.
 For meeting timing we only care about the
strictest constraint anyway. (i.e. for one pair of
launch and capture edges).
 Unnecessary to model complex timing device.
ACSEL Lab University of California, Davis
46
Modeling DET-CSEs for Synthesis

Simply model DET-CSE as a SET-CSE with
worst-edge timing parameters.
Ts-max Th-max
System clock
Data
Tclk->Q-max
Output
ACSEL Lab University of California, Davis
47
Synthesis flow for DET-CSEs
RTL of
processor
(VHDL)
Standard
cell library
Synthesis
(Design Compiler)
Automated
Characterization
(perl, hspice)
Model of
DET-CSE
Netlist with
DET-CSEs
(.db)
Power Analysis
Timing Analysis
ACSEL Lab University of California, Davis
48
Synthesis flow for DET-CSEs

Use synthesis directives to force use of DETCSE modeled device.
 Synthesize for target throughput, not frequency.
 Worst-case models for meeting critical-path
timing constraints.
 generate a worst-case hold model, to verify the
race-path.

Fastest clk-Q with worst-case hold time
ACSEL Lab University of California, Davis
49
Modeling DET-CSEs for Synthesis

Race-path modeling.
Rising to Falling
Rising edge sample
window
Falling to rising
Rising to Falling
falling edge sample
window
May have under-constrained
race-path.
ACSEL Lab University of California, Davis
50
DET-CSE synthesis summary
with H-tree estimate
Area and Power
Cell type
Area
(mm2)
%
Power
total (mW)
%
total
Memory blocks
2.03
44% 214.3
72%
Core
1.65
36% 64
21%
Clock tree (det-cse H-tree estimate)
@ new freq.
N/A
N/A
7%
Total
4.64
ACSEL Lab University of California, Davis
20.2
298.5
51
DET-CSE power profile
DET power breakdown
Total cache
(mW),
128.5716,
43%
calculated
clk pwr,
20.2, 7%
Total core
power
(mW), 64,
21%
Register file
(mW),
85.762,
29%
ACSEL Lab University of California, Davis
52
DET Core summary
Core
Area(mm2)
% total
core
Power
(mW)
% total
Sequential (1986 CSEs)
1.41
85.5%
22
34%
Combinatorial + nets
0.24
14.5%
42
66%
Total
1.65
64
Approximately 20k-gates (based on nand4)
ACSEL Lab University of California, Davis
53
DET-CSE power profile
DET Core power breakdown
calculated
clk pwr,
20.2, 24%
Total core
power
(mW), 64,
76%
ACSEL Lab University of California, Davis
54
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including DETCSEs into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
55
Issues with DET-CSE integration

Memory blocks are single-edge triggered and
must be clocked at twice the core clock rate.
 Currently using a dual-edge triggered VHDL
behavioral model for memory blocks for netlist
simulations.
 Possible solutions:


Clock the memory blocks at 2x nominal.
Modify memory address and data latch to be
dual-edge triggered.
ACSEL Lab University of California, Davis
56
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
57
Power Comparison of two design netlists
SPGFF
TGMS
DET Core power breakdown
SET Core power breakdown
calculated
clk pwr,
20.2, 24%
Total core
power
(mW), 63,
56%
Total core
power
(mW), 64,
76%
Core TotalTotal
= 92.46mW
= 84.2mW
calculated
clk pwr,
48.507,
44%
Core Total = 106.8mW
Total = 111mW
27mW savings
24% power savings in core
ACSEL Lab University of California, Davis
58
Summary of comparison

24% savings in core power.
 Estimated 28% increase in sequential cell area
(17% increase in core area).
 Both meet specified performance @ 200MHz
(report zero slack).
ACSEL Lab University of California, Davis
59
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
60
Summary

Established methods for automated cell
characterization.
 Developed design flow for DET-CSE
integration.
 Demonstrated pre-layout results.
 Obtained functional DET-CSE netlist.
 Investigated functionally enhanced DETCSEs (scan, reset).
ACSEL Lab University of California, Davis
61
Future work

Expand family of DET-CSEs (i.e. sizings,
functionalities)
 Obtain more accurate clock tree loading.
 Perform layout of cells for more accurate
comparison.
ACSEL Lab University of California, Davis
62
Functionally enhanced Dual-Edge
Triggered Flip-Flops

Need to show that functions such as reset, set, and
scan can be added to DETCSEs
 Need to do analysis of power and performance impact
of added functionality

Do DETCSEs still result in practical power savings?
ACSEL Lab University of California, Davis
63
Scan in SPGFF
CLK
Mp0
Mp14
Mp20 Mp21
Mp22
Mp15
CLK3
Mp23
Mp19
X
CLK1
CLK4
Y
Q
SD
Mns0 D
Mns1
Mn0
Mn9
D Mns3
Mns5
Mns4
Mn12
Mns2
Mpi3
mni3
Mn13
Mpi10
mni10
SCAN
CLK3
Mn1
CLK
Mn2
Mpi9
mni9
SCAN
Q
Mn3
CLK1
CLK2
SD
Mn8
CLK3
Mn10
CLK4
Mn11
CLK1
CLK4
CLK
Mpi4
mni4
ACSEL Lab University of California, Davis
Mpi5
mni5
Mpi6
mni6
Mpi7
mni7
64
Scan in DFF
Functional Schematic of DFF with Scan
ACSEL Lab University of California, Davis
65
Clear in SPGFF
Mpr0
CLR
Mpr2
Mp14
Mpr1
CLK
Mpr3
Mp0
Mp22
CLK3
Mn1
CLK
Mn2
Y
Mn13
Mpi10
mni10
Mpi3
mni3
Mn9
Q
Mn3
Mn8
Mn10
Mn11
CLK1
CLK1
CLK4
Mp19
Mn12
Mpi3
mni3
Mn0
CLK3
Mp21
Mp23
Q
Mp15
X
D
CLR
Mp20
CLK2
CLK3
D
CLK4
CLK1
CLK4
CLK
Mpi4
mni4
ACSEL Lab University of California, Davis
Mpi5
mni5
Mpi6
mni6
Mpi7
mni7
66
Clear in DFF
ACSEL Lab University of California, Davis
67
Preliminary Results of Adding Functionalities
Delay
Power
EDP
SPGFF
356 ps
136 μW
1.73e-23 Js
With Scan
371 ps (4.2%)
143 μW (5%)
1.97e-23 Js (14%)
With Reset
407 ps (14%)
140 μW (3%)
2.32e-23 Js (34%)
Delay
Power
EDP
SETFF
412 ps
82 μW
1.38e-23 Js
With Scan
483 ps (17%)
82 μW (0%)
1.89e-23 Js (37%)
With Reset
483 ps (17%)
71 μW (-13%)
1.65e-23 Js (20%)
delay  max( t su,r  tclk  q , f , t su, f  tclk  q ,r )
ACSEL Lab University of California, Davis
68
Outline








Dual Edge Flip-Flops overview
Standard Cell Characterization
LEON Synthesis for SET design
LEON Synthesis for DET design
Issues with including Dual edge into synthesis flow
Preliminary comparisons
Conclusions and Future Work
Questions
ACSEL Lab University of California, Davis
69
Download