Alex Kondratyev
Cadence Berkeley Labs,
Berkeley, CA, USA
In collaboration with Jordi Cortadella, Luciano Lavagno
Kelvin Lwin and Christos Sotiriou
1
What do we optimize?
End of deterministic design
Technical and business implications
Asynchronous design with commercial tools
Desynchronization
Delay-insensitive datapath
2
Optimization metrics
Late 70-s:
Literals
nodes of a Boolean network
Levels of a Boolean network
Area
Speed
Nowadays:
Literals
nodes of a Boolean network
Levels of a Boolean network
Wire length
Area
Speed
Tools are optimizing for area and speed!
3
Universal metrics
Power: small ?
P = P + P + P dyn short leak clk
2
P = a * f * C * V
C
4
Universal metrics
Power small ?
P = P + P + P dyn short leak clk
2
P = a * f * C * V
Delay: t = Q / I = C * V / k(V - V ) ds dd
2
Supply voltage
Power
, delay
Speed can be taken as a universal metrics
C
5
What do we optimize?
End of deterministic design
Technical and business implications
Asynchronous design with commercial tools
Desynchronization
Delay-insensitive datapath
Fine-grain pipelining
6
Timing margins
Algorithms/tools (approximations)
Modeling (process corners e.g.)
Architecture (unbalanced computation)
7
Algorithms/tools
False paths (< 5%)
Common path pessimism removal
Hierarchy hurts!!!
10-35% gain from floorplan flattening
(Reshape)
Bad news: we do not know how far we are from optimum
Good news: optimum is not possible to find
8
Modeling
0.25
, Vdd=2.5
10%, T= 0, 125
C 0.13
, Vdd=1.0
10%, T=40, 125
C
200
150
100
50
0
INVX2 (fall) slow typical
Fast
0.76 Typical
Slow
1.47 Typical fast
200
150
100
50
0
INVX2 (fall)
Fast
0.73 Typical
Slow
1.55 Typical slow typical fast
Why to panic?
New BIG players: signal integrity and process variability
9
Variability sources
-Environment (T, Vdd) + signal integrity
Within-die only
-Process variations
(gate length L, wire width W, threshold voltage Vt)
-Die-to-die (design independent)
-Within-die (design dependent)
10
Environment + SI
Supply voltage: ± 10% Temperature: -40
C to 125
C
V
DD
IR drop – decrease in the current from Vdd
V’
DD
Bad news:
6
10 gates x 8metal layers
9
10 RC elements in VDD grid
Good news:
7
Field solvers can handle 10 variables
Abstraction, model reduction, IP reuse help further
Tools make IR drop sign off at 5%Vdd (still
10% delay penalty)
11
aggressor victim
Environment + SI aggressor
Crosstalk victim pulse delay
Pruning by coupling
Worst coupling estimation
H-Spice simulation
Compute switching windows
Pruning by timing
12
6
0 ind1 ind2 ind3 ind4
Conservative analysis: up to 20% delay penalty (post-layout fixes)
12
Process variations
-Die-to-die design independent, well modeled via worst-case files
-Within-die design dependent, systematic and random!!
40
20
0
130 within-die die-to-die
90 65
30
20
10
0
50
40
Lgate
Wwire
250 180 130
Tt
Nassif’01
100 70
13
Measuring variability
% chips
Microprocessor at-speed functional testing
ASIC no delay testing, no binning
Strategically placed oscillators:
Bin1 Bin2 Bin3
Problem:
Up to 15% delay variation in RO (Nassif’03)
Vertical/horizontal (4%), spacing poli-SI (7%), distance (5%) frequency
14
Modeling variability
Model for gate delay (linear wrt variability sources) d =
env +
device +
wire var var var var
Independence of sources (within a group - model reduction (PCA or SVD))
For a single variability source: L = L + L var spatial random
(is modeled by random normally distributed variables N(0,
))
d (L )
15
Statistical timing analysis
?
Reconvergence needs some care
Numerical computation of a distribution
Approximate convolution (5% accuracy)
Use upper and lower bounds (10% diff. Blaauw’03)
Algorithms have linear complexity!
16
What it buys?
Trading yield
WC confidence margin must be big
(chips work)
But it is fully unknown
Confidence margin worst
STA helps to quantify risk (reduce margin and be structure specific)
STA might help to trade off confidence margin and yield (testing???)
Open issues :
why normal?
how to derive
?
how to derive sensitivity coefficients?
17
What do we optimize?
End of deterministic design
Technical and business implications
Asynchronous design with commercial tools
Desynchronization
Delay-insensitive datapath
Fine-grain pipelining
18
Cycle time
Summing this up
Clock overhead
Real Computation Time
Worstaverage
45%
SI Variability
25% 30%
Clock skew
10%
Nonbalanced stages
20%
Some designs work twice faster than needed by spec!
Everything boils down to
$$$
Synchronous design is turning out to become a costly proposition
19
Is asynchronous an option?
It is about time but …
“must” requirements to asynchronous CAD tool:
Competitive
- added value with minimal (or no) penalty
- scalable (capable of handling large designs)
Simple
- minimal knowledge of asynchronous design
- RTL input
Risk-free
- does not change sign-off (STA)
- complete solution in verification and testing
- backup options (synchronous implementation)
20
What do we optimize?
End of deterministic design
Technical and business implications
Asynchronous design with commercial tools
Desynchronization
Delay-insensitive datapath
Fine-grain pipelining
21
Bundled approach
Design options
QDI approach start
•
•
•
Single-rail logic
•
•
• delay done
•
•
•
Dual-rail logic
•
•
•
C done
22
Sliding the trade-off curve
Automation efforts
QDI datapath
NCL, phased logic
Bundled data desynchronization
EMI, skew penalty
Variability
Penalties?
Average speed gates blocks
23
Desyncronization flow
Think synchronous
Design synchronous: one clock and edge-triggered flip-flops
De-synchronize (automatically)
Run it asynchronously
24
CLK
Synchronous circuit
MS flip-flop
L
0
L
1
L L
0 1
0
L
0
L
25
C
0
L
L
0
C
De-synchronization
L
1
C
C
0
L
L L
0 1
C C
26
De-synchronization
Distributed controllers substitute the clock network
C
C
C
The data path remains intact !
C C
C
27
C
D
A
B
A B C D
A+ BC+ D-
AB+ CD+
Non-overlapping handshake protocol
28
C
D
A
B
A B C D
A+ B+ C+ D+
ABC-
Overlapping is also acceptable
D-
29
Concurrent model
A+
A data
B+
AB-
B
C+
Cbubble
C
• + and – must alternate
• data available at the previous latch
• next latch must be closed before receiving new data
30
For any netlist
31
Synchronization layer
32
Synchronization layer
33
Synchronization layer
This
This is a circuit marked graph (CMG)
34
Properties of CMGs
Any CMG is live and safe
Safeness: no data overwriting
Liveness: no deadlock
A+
A-
B+
B-
C+
C-
35
36
37
38
39
40
41
42
43
44
45
46
47
+
48
49
50
51
52
53
54
A
Flow equivalence
[Guernic, Talpin, Lann, 2003]
B
55
Flow equivalence
CLK
A 1 3 0 2 1 5 3 1 6 0
B 5 1 2 3 1 4 2 4 3 1
Synchronous behavior
A 1 3 0 2 1 5 3 1 6 0
B 5 1 2 3 1 4 2 4 3 1
De-synchronized behavior
56
Flow equivalence
CLK
A 1 3 0 2 1 5 3 1 6 0
B 5 1 2 3 1 4 2 4 3 1
Synchronous behavior
A 1 3 0 2 1 5 3 1 6 0
B 5 1 2 3 1 4 2 4 3 1
De-synchronized behavior
Theorem:
The de-synchronization model preserves flow-equivalence
57
Timing equivalence
C
D
A
B
A+
A-
La
A
Lb Lc del_a del_b del_c
B C del_b = del_a = del_c = del_d
Ld
D
B-
B+
C+
C-
D-
D+
Synchronous-like behavior
58
Timing equivalence
C
D
A
B
A+
A-
La
A
Lb Lc del_a del_b del_c
B C del_b > del_a = del_c = del_d
Ld
D
B-
B+
C+
C-
D-
D+
B keeps the same period and settles the rest
59
Compatibility
Synchronous: T
T + T + T + T sync comb setup skew CQ
Desynchronized: T
T + T + T comb controller
CQ
Statement: Desynchronized design is behavior and timing compatible to its synchronous counterpart
60
Synchronous environment
Clk
Clk+
A
A+
B
B+
ClkAB-
C
Clk
C+
Timing arc
C-
61
Implementation of a controller
• Only local handshakes with adjacent controllers are necessary
• Synthesis by using intuition, common sense, … and petrify
62
Implementation of a controller
63
Delay matching
Combinational logic d
64
Post-layout delay matching
Combinational logic
65
Post-layout delay matching
Combinational logic
66
Desynchronization. Gaining Trust
Synchronous RTL
=
67
Async DLX block diagram
68
Desynchronization. Gaining Trust
Synchronous
Synchronous RTL
Desynchronized
=
Cycle: 4.4ns
Power: 70.9mW
Area: 372,656
m
Cycle: 4.45ns
Power: 71.2mW
Area: 378,058
m
69
DLX lessons. Positive
Asynchronous design with no area, power, delay penalties
30% less EMI
Partial tolerance of variability
(matched delays scale with the rest of the gates)
Binning!!!
Treq > Tclk
Error req
B C
Clk
70
DLX lessons. Negative
Asynchronous design with no area, power, delay advantage
Clock power is saved but latched designs have higher loads
P&R constraints of de-sync design are non-trivial
Matched delay variability might hurt
Hard work to come out even with synchronous
71
Clustering
Timing optimization
Retiming of
M-latches
Can we do better?
early
D
A late
C
M S
D
A late
M S
C
72
Problems of delay matching
Max(STA_delay) z
Min(STA_delay)
Gate and wire profiles are different
(must be compensated by margins)
Matched delay margins vs inter-die variation matching??
Calls for the use of different architectures
73
Sliding the trade-off curve
Automation efforts
QDI datapath
NCL, phased logic
Bundled data desynchronization
EMI, skew penalty
Variability Average speed gates blocks
74
Phased Logic
00
10
Value ‘0’ t v
0
0 even0
Linden’94
11
Even
Phase
01
Odd
Phase
LSB is ‘value’ bit (v)
MSB is ‘timing’ bit (t)
Value ‘1’
1
0
1
1
0
1
0
0 odd0 even1 odd1 even0
0
1 odd1
1
1 even1
A signal changes phase or value (only one bit changes)
75
Phased logic gate
A PL gate has an internal state Even or Odd .
A PL gate fires when all inputs match the gate phase.
E
O
Gate
Phase:
E
O Gate is not ready to fire
E
Gate ready to fire
E
Gate
Phase:
E
O
E
After Firing
E
Gate
Phase:
O
E
76
LUT-4 based implementation a_v b_v c_v d_v
LUT4 new_v
D latch
D Q
EN Q
R r-bit fi a_v
Input completion detection a_t b_v b_t c_v C gate_phase c_t d_v d_t reset fo fo_b
G1
G2 out_phase
G3
• Functionality: v(a_v, b_v, c_v, d_v)
D-latch new_t out_phase = gate_phase
D Q
EN Q
R r-bit
Phase: reset t_rbit a_t, b_t, c_t, d_t, t v t t_b
Area penalty!
77
NCL Design Flow
VHDL Synchronous
GTECH library
NCL library
Synthesis
Synchronous netlist
2-rail expansion+ optimization
NCL netlist
Asynchronous
1. Pattern matching (Ligthart’00)
2. Completion separation (NCLX)
78
Introduction to NCL
2-phase functioning (evaluate (DATA) – precharge (NULL)) +
Self-timed register interaction (acknowledgement of phases)
Reg.
Combinational logic
Reg.
CD
NULL Ack+ DATA Ack+
Micropipeline with delay-insensitive (DI) datapath
79
x.0
x.1
y.0
y.1
From 2 to 3-rail Scheme
2-rail gate
F
F z.1
z.0
z.1, z.0 are 2-rails but they do not acknowledge inputs x.1
y.1
x.0
y.0
z.1
z.0
Not DI scheme!!!
81
From 2 to 3-rail Scheme x.0
x.1
y.0
y.1
2-rail gate
F
F z.1
z.0
Functional part
C z.go
Completion part x.go
y.go
Rationale behind delay-insensitivity of 3-rail scheme:
1. 2-rail circuit is hazard-free under monotonic input changes
2. All inputs changes are observable at outputs
82
NCLX flow (MUX ) a s z
Tech. Map .
a s b b z z
2-rail expansion
Unate a.1
s.1
s.0
b.1
Functional part a.0
s.0
s.1
b.0
a.1
s.1
s.0
b.1
z.1
z.0
2-rail gate
(incomplete)
Completin g a.1
s.1
s.0
b.1
a.0
s.0
s.1
b.0
a.go
s.go
b.go
z z.1
z.1
z.0
C
Completion part z.go
2-rail gate (complete)
83
NCL lessons. Positive
Very low EMI
High security of computation
Automatic stand-by mode
Tolerance to variability
84
NCL lessons. Negative
Big area overhead: 2.7-3.0x
No performance advantage
(average case performance is swallowed by the penalty from NULL)
Completion introduces further penalties (power and delay)
85
Back in Business
Performance improvement:
Fast reset
- partition a circuit into chunks 4-6 levels logic deep
- apply reset to each chunk simultaneously
Use faster negative gates
- negative gates are about 20% faster than unate gates
Area improvement:
Make completion by outputs only
- single NOR gate suffices
86
Penalties and Savings
Logic synthesis
250
200
150
100
50
0
Delay Area
Sync
2-R (v=0%)
2-R (v=60%)
2-R (v=100%)
250
200
150
100
50
0
Place & Route
Delay Area
20-35% performance improvement at the expense of 100% of area penalty
87
Use Case
FF
Combinational Logic
FF reset comp clk error
Error signal provides a mean to:
- Calibrate chips during manufacture testing
- Perform on-line delay testing
88
Use Case
Find timing critical portions of design
Re-implement them asynchronously
Up to 30% of performance improvement at the cost of
2x area penalty for critical portions
(appealing if the size of critical portion is small)
89
Best of both worlds. PLA designs
Dynamic PLAs are naturally two-phase
Delay matching of PLA is easy
Bundled routing to cope with wire delay variations go data go_A done_A
PLA
A done_B go_D go_C done_C
PLA
C
PLA
D done_D
C done
PLA
B
PLA
E done_E go_B go_E
Critical path is always go-done path
90
PLA vs SC
• PLA-based design vs standard cell based design. ( SC typical vs PLA typical)
Stardard cell flow
P
SC
P
PLA
RC
Encounter
RTL design
SA-Placer
Bundle router
Delay
PLA flow
Cluster alu2
SC_A SC_D PLA_A PLA_D
11105 1274 7672 1160 alu4 18884 1622 18683 1521
C6288 80304 4401 75218 4235 apex7 4339 784 apex6 14270 907
4729
15291
818
838
C1355 14237 1351 12162 1354
C3540 29239 2074 29478 2083 k2 47332 1361 30789 1602 x3 11353 862 14588 845
C5315 31092 2005 42031 2076 average 1 1 0.9923
0.9975
Delay