CMOS Circuit Design

advertisement
CMOS Circuit Design
Prof. MacDonald
MOS Transistor
gate terminal
drain terminal
source terminal
Gate
Field
Oxide
Source
Drain
Field
Oxide
Silicon Substrate
substrate terminal
typically tied to ground for PWELLs
and Vdd for NWELLs
MOS Transistor
gate terminal
drain terminal
source terminal
Gate
Field
Oxide
Source
N+
Drain
P
N+
Field
Oxide
Silicon Substrate
substrate terminal
Device is symmetrical – for NFET drain is defined as node with highest value.
With zero bias on Gate, channel is P type and thus two back-back diodes.
No conduction between source and drain.
MOS Transistor
gate terminal
drain terminal
source terminal
Gate
Field
Oxide
P
Drain
Source
N+
N+
Depletion Region
Field
Oxide
Inversion Layer
Silicon Substrate
substrate terminal
If gate voltage is raised to Vth a N type channel is formed below the gate.
This effectively shorts out the back-to-back diodes and allows conduction.
MOS Transistor - off
gate terminal = Vg
source terminal
Vs = 0
drain terminal Vd
Gate
Field
Oxide
P
Source
N+
Drain
N+
Field
Oxide
Silicon Substrate
substrate terminal Vb = 0
if Vgs < Vt, then no inversion layer exists and back-to-back
diodes prevent conduction between drain and source
regardless of Vds
MOS Transistor – Linear mode
gate terminal = Vg > Vt
source terminal
Vs = 0
drain terminal Vd = small
Gate
Field
Oxide
P
Source
N+
Drain
N+
Field
Oxide
Silicon Substrate
substrate terminal
Vb = 0
if Vgs > Vt and Vds remains small, then inversion layer beneath
gate is almost uniform and complete from source to drain.
Channel acts as a resistor and Ids increases linearly with Vds.
MOS Transistor – Almost saturated
gate terminal = Vg > Vt
source terminal
Vs = 0
drain terminal Vd = Vgs-Vt
Gate
Field
Oxide
P
Source
N+
Drain
N+
Field
Oxide
Silicon Substrate
substrate terminal
Vb = 0
if Vds = Vgs – Vt, the inversion layer begins to disappear at the
drain end of the channel. This is the transition point from linear
mode to saturation mode.
MOS Transistor –saturated
gate terminal = Vg > Vt
source terminal
Vs = 0
drain terminal Vd > Vgs-Vt
Gate
Field
Oxide
P
Source
N+
Drain
N+
Field
Oxide
Silicon Substrate
substrate terminal
Vb = 0
if Vds > Vgs – Vt, the inversion layer disappears near drain.
The end of the inversion layer is Vdssat and electrons that reach
the end are swept drain. Increases in Vds have little effect on Ids.
MOSFET Current Equation
Vgs
Vds
Vgs < Vt
Ids
Ids ~ 0
Vgs < Vt
Vds < Vgs – Vt
Ids = µCox(W/L)((Vgs – Vt)Vds – (Vds2/2))
Vgs < Vt
Vds => Vgs – Vt Ids = (1/2)µCox(W/L)(Vgs – Vt)2
The body effect is another consideration not described here. If the Vbs
voltage is modified, the Vt will be affected and consequently affect Ids.
Strained Silicon
 
 
 
Strained silicon is new process to enhance carrier mobilities
Add dopant that is mechanically close but slightly different from
Silicon to cause strain
IBM / Intel reported started production in the 90nm process
from IBM webpage
MOSFET Scaling
 
As Moore’s law predicts
– 
– 
 
Two forms of scaling
– 
– 
 
dimensions decrease by factor S
area decreases by S2
constant voltage - up to the early 90’s – 5V Vdd steady
constant electric field – after early 90’s – Vdd drops
Why scale CMOS
– 
– 
– 
faster if smaller (drive current = 1/Leff)
cheaper if transistors take less area
more functionality if same area is used with more transistors
MOSFET Scaling
– 
Leff min proportional to (Interview Question)
 
 
 
– 
gate oxide electric field intrinsic breakdown at 10 MV/cm
 
– 
Xj – requires precise implantation and anneal
N – affects the mobilities of carriers if too high – cap if too low
Tox - below 20 angstroms – tunneling leakage
this sets the max Vdd for a given technology
thresholds need to be scaled
 
but every 80mVs reduced, leakage increases X10
Rough timeline
Node
20u
2u
1u
0.8u
0.65u
0.5u
0.35u
0.25u
0.18u
0.12u
90nm
65nm
45nm
35nm
?
Year
60’s
early 80’s
late 80’s
92
94
96
98
00
01
02
03
05
08
10
?
Tox
1u
250A
200A
150A
105A
50A
37A
27A
?
?
?
new
Comment
no CMOS – just NMOS (+/- 5V)
CMOS, but some NMOS
All CMOS, +5V
5V – start of my career
5V
5V – start of E scaling
3.5V – 5V tolerant
2.5V – 3.3V tolerant
1.8V – 3.3V tolerant
1.5V
1.2V
?
?
?
Quantum Dot Computers? Not!
MOSFET Scaling – Constant Voltage
Quantity
Before
After
Effect
Channel Length
L
L' = L/S
Channel Width
W
W' = W/S
Tox
t'ox= tox/S
Xj
X'j = Xj/S
Vdd
Vdd
Vt
Vt
Na,Nd
NaS, NdS
Gate Capicitance (per area)
Cox
C'ox = CoxS
Total Gate Cap
Cg
C'g = Cg/S
Drive current
Ids
Ids*S
Gate oxide thickness
Junction Depth
Vdd
Threshold voltage
Doping
Power (for same function)
Power density
Device Delay
Wire Delay
P
P*S
P/A
(P*S)/(A/S2)=S
D = CV/I
D'= D/S2
RC
R*S * (C/S) = RC
faster and less power
faster
same circuit scaled consumes
less power
so more power per function
Much faster than before
Really gets bigger
MOSFET Scaling – Constant Field
Quantity
Before
After
Channel Length
L
L' = L/S
Channel Width
W
W' = W/S
Tox
t'ox= tox/S
Xj
X'j = Xj/S
Vdd
Vdd/S
Vt
Vt/S
Na,Nd
NaS, NdS
Gate Capicitance (per area)
Cox
C'ox = CoxS
Total Gate Cap
Cg
C'g = Cg/S
Drive current
Ids
Ids/S
P
P/S2
P/A
(P/S2)/(A/S2)
D = CV/I
D'= D/S
RC
R*S * (C/S) = RC
Gate oxide thickness
Junction Depth
Vdd
Threshold voltage
Doping
Power (for same function)
Power density
Device Delay
Wire Delay
Effect
for reliability reasons
not done in practice - leakage
faster and less power
same circuit scaled consumes
less power
so same power for more
function
Faster but not as fast as CV
really gets bigger
MOSFET Scaling – current issues
 
Static Power – Major problem
– 
– 
no static power was original motivation for CMOS
gate oxides are 17 Angstroms – tunneling 1 Amp / cm
 
 
– 
sub-threshold leakage increased due to scaled Vts
 
 
need new oxide that acts electrically thin but physically thick
silicon used because of nice native oxide with good interface
using dual threshold processes, but this adds expense
Wire Delay
– 
need low K material for inter-layer dielectric
 
 
– 
current materials are having mechanical reliability issues
thermal cycle the chips and get opens/shorts
need low resistance conductors
 
 
migrating from aluminum to copper (Intel last to go, IBM first)
but copper is difficult to etch – dual damascene process
Other observations
Vdd
Vdd
NFETs can’t drive high voltages well
Vdd-Vt
Vt
PFETs can’t drive low voltages well
This will affect many of the circuits that we explore in this class
and this is a major
questions).
source of interview questions (and exam
MOS Inverters
 
 
 
Most fundamental circuit in MOS family
Represents the basic operation of all static gates
One input and one output
– 
 
Output = ~Input
Inverter Threshold Voltage - Vth
– 
– 
input voltage where output equals input
not the same as transistor threshold Vt
Voltage Transfer Characteristic (VTC)
Vout
Vout = Vin
Vdd
gain = -1
Vil
Vth
Vih
Vdd=Voh
Vin
Noise Margin – low gain region
Vout
low gain region
gain = -1
Vin
Noise Margin – high gain region
Vout
high gain region
Good design minimizes
high gain region
aka transition region.
gain = -1
Vin
CMOS Inverter
Vout
Vout=Vin-Vtp
Vout=Vtn
A
Vout=Vdd+Vtp
B
Vout=Vin-Vtn
C
D
E
Vin
CMOS Inverter
Vout
Ids
Vdd
Vin
Layout of inverter – top view
n-well
W
W
Layout of inverter – top view
n-well
gate
vdd
I1
drain
I2
drain
source
input
gnd
source
I1
out
in
I2
MOS Inverters – Dynamic
 
 
Performance is inversely proportional to delay
Delay is time to raise (lower) voltage at nodes
– 
– 
node voltage is changed by charging (discharging) load cap
more current means more charge transported over time
Q = I • t = C •V
C •V
tdelay = Q / I =
I
MOS Inverters – Dynamic
junction cap
gate cap
wire cap
particularly bad when
driving a load far away.
MOS Inverters – Dynamic
Lumped cap
CL=Cgdn+Cgdp+Cdbn+Cdbp+Cw+Cg
MOS Inverters – discharge delay
input
output
Lumped cap
CL=Cgdn+Cgdp+Cdbn+Cdbp+Cw+Cg
time
MOS Inverters –charge delay
0V
Lumped cap
CL=Cgdn+Cgdp+Cdbn+Cdbp+Cw+Cg
time
Propagation Delay
Tplh
time
Tplh
Defined twice – once for a falling output and once for a rising output.
The propagation delay is the delay from the input crossing the 50%
point of Vdd to the resulting output signal crossing of the 50% point.
Tplh = Rising propagation delay
Tphl = Falling propagation delay
Rise and Fall Times
Trise
time
Tfall
The rise time is the time for the signal to cross from 10% to 90% of Vdd.
The fall time is the time for the signal to cross from 90% to 10% of Vdd.
If an inverter is driven by a signal with a really slow rise or fall time, the
delay through the inverter is aggravated and since the inverter is in the
transition region longer, a lot of short circuit current can be generated.
Rise and Fall Times
Trise
time
Trise
time
If excessive rise or fall times exists, fix them by cranking up drive
source or decreasing the load.
Increasing drive strength usually means widening transistors.
Decreasing the load usually means splitting up load with buffers.
Calculating Delay Times
Tplh
time
Simplest approach is to use average current and average capacitance
models to calculate propagation delays for both edges.
Cload • ΔVhl
τplh =
Iavghl
Cload • ΔVlh
τphl =
Iavghl
MOS Inverters – fall delay
output
Reqn
time
t
−
Vout (t ) = Vdd • e RnCl
MOS Inverters –rise delay
Reqp
Vout (t ) = Vdd • (1 − e
time
t
−
RpCl
)
Combating delays
 
Reduce Capacitive load
– 
– 
– 
 
Increase Drive current
– 
– 
– 
– 
 
drive fewer gates – buffer tree
drive smaller gates (less gate capacitance) in subsequent stage
drive closer gates (less distance means less interconnect load)
reduce Vt – not really an option for circuit designers
reduce L’s – most transistors are minimum sized for area
increase Vdd – can’t because of gate oxide integrity
increase Weff – main weapon of circuit designer
Reduce wire lengths for long wires (more later…)
Coupling Analysis
Agressor
Ccoupling
Victim
Reqn
Vagressor
Cgood
Ccoupling • Vdd
Vvictim =
Ccoupling + Cgood
Reqn
Minimizing Coupling Capacitance
 
 
 
Wire spreaders are tools that search through a
routed design and find places where signals can
be spread.
Noise sensitive signals (i.e. clock signal) can be
shielded by running fixed signals (i.e. gnd, vdd)
between clock and other signals.
Technologies are being developed that raise the
permittivity of the inter layer dielectric.
– 
– 
problems persist with this new materials
thermal cycling the material causes ruptures due to
differences in the thermal expansion coefficient.
Wire Spreading Example
Before
After
Shielding Signals
Coupling capacitance goes down with a 1/T relationship.
Good cap goes up because of shielding.
victim
signal
gnd
Substrate (ground plane)
aggressor
signal
metal 1
Long Lines and RC Delays
Buffer can cut down on L and decrease interconnect delay
quadratically – of course device delay is inserted but many
times the overall delay goes down.
100ps
400ps
100ps
L
600 pS total
Long Lines and RC Delays
If distance L has 400ps of RC delay, then a distance of L/2
will have 100ps of delay - (L/2)2 or ¼ of the delay.
100ps
100ps
L/2
150ps
100ps
L/2
550 pS total
100ps
Long Lines and RC Delays
If distance L has 400ps of RC delay, then a distance of L/3
will have 45ps of delay - (L/3)2 or 1/9 of the delay.
100ps
45ps
L/3
100ps
45ps
100ps
45ps
L/3
L/3
535 pS total
100ps
Note on RC delays and Vdd
RC values are not affected by Vdd values to the first order.
Device delay however is related by the square of the voltage.
100ps
45ps
100ps
L/3
400ps
45ps
L/3
45ps
100ps
L/3
400ps
45ps
L/3
45ps
100ps
Vdd=
1.8V
L/3
400ps
45ps
L/3
400ps
Vdd=
0.9V
Inverter sizing and Fanout
To drive a huge load with a small inverter we need a string
of inverters to “ramp up” the capacitive gain.
If inverter is too small, will have difficult time charging next stage.
If inverter is too large, it will overload the previous inverter.
Wp
Wn
4
2
12
6
36
18
108
36
Case of huge load
(i.e. IO driving off
chip loads or clock
tree driving 1000s
of flip-flops
Parallel Transistor Configurations
 
 
 
Two same-type transistors in parallel have their
transconductances added if on at same time.
If both transistors are on simultaneously and the L values
are the same for both, we can add the widths to get an
effective single transistor equivalent.
When both are on, (W/L)eq is sum of all ratios
8/1
8/1
16/1
Series Transistor Configurations
(W/L)eq = (W/L)a + (W/L)b or if Ls equal, simply add Ws
W
W
L
L
W
L
Series Transistor Configurations
 
 
Two same-type transistors in series have their
resistances added if on at same time.
If both transistors are on simultaneously and the W
values are the same for both, we can add the lengths to
get an effective single transistor equivalent.
8/1
8/2 = 4/1
8/1
Series Transistor Configurations
(W/L)eq = 1 / (sum of reciprocals)
or if Ws are equal, simply add Ls
W
L
W
L
2 Input NOR – depletion NFET load
If both A and B is high– NFET heavy inverter
Vdd
Out
A
B
CMOS NANDS and NOR
Consider transistor sizings for balanced circuits…
A
out
B
A
B
A
A
B
out
B
NAND Layout
Legend
vdd
active area
n-well
metal 1
poly
OUT
A
B
gnd
CMOS NOR Transistor sizing
Consider transistor sizings for balanced circuits…
A
B
A
W
W*4
W*2
W*4
B
out
W
out
W
CMOS NAND Transistor Sizing
Consider transistor sizings for balanced circuits…
A
2*W
A
B
B
2*W
2*W
out
2*W
2*W
out
W
CMOS NAND Transistor Sizing
Consider transistor sizings for balanced circuits…
B
A
A
C
out
3*W
B
3*W
C
2*W
3*W
out
W
Fanin (number of inputs)
There is a limit to the number of inputs that can be used.
B
A
A
B
C
D
E
C
D
E
Complex CMOS Logic
Can make single stage gates that implement:
AND-OR-Inverter (AOI)
OR-AND-Inverter (OAI)
Given a function F = ((A*B)+C)’
Invert the function to get N network
F’ = (A*B)+C
Take dual of N network equation to get PFET network
F’d = (A’+B’)*C’
Remember that PFETs invert inputs naturally
Complex CMOS Logic
B
A
A
B
C
C
out
A
C
B
PFET network
out
A
B
C
NFET network
Complex CMOS Logic - Euler
B
A
B
C
C
out
A
C
B
B
A
C
A
NFET network
PFET network
Find common Euler path which does not
traverse any branch more than once.
Complex CMOS Logic
Given a function F = ((A*B)+C)’ what is best layout to share
diffusions when possible. One solution but not best
Vdd
nwell
active area
out
active area in pwell
A
Gnd
B
C
Complex CMOS Logic
Given a function F = ((A*B)+C)’ what is best layout to share
diffusions when possible. Switch S/D of C for better.
Vdd
out
A
Gnd
B
C
Pass Gates
In most static CMOS, a PFET network pulls high and a dual
NFET network pulls low.
In a pass gate configuration, they tie inputs to outputs.
Pass gates can either be “ON” and pass a value or be
“OFF” and tri-state an output.
One NFET can do this, but passes high values poorly.
One PFET can do this too, but passes low values poorly.
in
out
enable
Pass Gates
Couple of problems, not only will it not drive a full logic
high, the effective R skyrockets to infinity as you approach
Vdd-Vt. This means that it also slows down as well as
and provides no drive strength when statically high, thus the
output is susceptible to coupling noise.
in
out
enable
Charge Sharing (and Pass Gates)
Common interview question…
Basis for DRAM operation.
At t=0, the gate is low, C1 (50 fF) is charged to 2 volts,
C2 (25 fF) is charged to 3 volts.
Later, the gate is turned on. What is voltage of C1 and C2?
Simple Eng101, but most grads can’t do it.
v2
v1
C1
C2
gate
Complementary Transmission Gates
Use a PFET and NFET in parallel, passes ones and zeros.
Never used by logic designers, circuit designers hide them.
TGs act as switches, either providing a resistive short
or an open circuit.
Does not provide drive, attenuating the signal.
Susceptible to “above Vdd” or “below Gnd” noise at input.
Effective Resistance of TGs
For passing low values, the NFET is fully on.
For passing high values, the PFET is fully on.
The effective resistance stays relatively constant
regardless of the input voltage (as opposed to how
pass gates respond).
Vt
Vdd-Vt
NFET R
Reff
Combination
PFET R
Vin
Vdd
Real Transmission Gate Mux
d1
out
d0
s
Need input inverters for noise and output inverter
to cancel inversion and provide drive strength
TG Logic
c
b’
a
b
f
c’
b’
Implements F = A*B + C
c
CPL Logic
a
b’
f
Implements F = A^B
b
a’
Couple of major problems though:
1) really needs 4 transistors to get both complements,
2) if F is high, you’ll have a Vt drop (slow and consumes power)
3) inputs are unbuffered to source (noise).
CPL Logic
a
b’
f
f’
b
a’
a
b
f’
b’
a’
Implements F = A^B
f
Logical Effort Method (LEM)
 
 
 
LE is a method to estimate delay in CMOS circuits
Helps identify best circuit style and choose widths
Based on basic delay unit – T
– 
– 
 
Isolates technology differences
Delay through an inverter ignoring parasitics with fanout of 1
Two components of delay through any gate
– 
– 
Parasitic delay (no-load delay or self-loading delay) - p
Effort delay or Stage effort - f
 
Stage effort (f) is product of
–  Electrical effort - h
–  Logical effort - g
d = p + h*g
Logical Effort - Parasitic Delay - P
 
 
Parasitic delay is calculated from diffusion cap at output
Bigger transistor – more current - but more P also
–  Diminishing returns on increasing widths
B
A
A
B
out
Gate Type
Inv
N-input NAND
N-input NOR
N-input MUX
XOR
P
Pinv
nPinv
nPinv
2nPinv
4Pinv
d = p + h*g
Electrical effort - H
 
 
Ratio of input to output capacitance
Captures the effects of fanout
d = p + h*g
Logic effort - G
 
Captures complexity of gate
–  Topology and ability to drive current
–  Considers fan-in relative to inverter
1
A
2
A
B
B
2
2
2
out
Inv
1
NAND
NOR
Mux
XOR
2
3
4
5
4/3
5/3
2
4
5/3
7/3
2
12
6/3 7/3
9/3 11/3
2
2
32
d = p + h*g
Normalized Delay
Logical Effort - Delay
Effort Delay
Parasitic Delay
Electrical effort - h
Logical Effort - Example 1
 
 
Tau = 50 ps for given technology
Determine delay through 4-input NOR driving 10 identical
circuits
Solution:
d = g*h + p
d = 9/3*10 + 4 = 34 delay units = 1.570 nS
Comment:
Large loads minimize impact of parasitics
Large load will increase rise/fall times and this estimation
Ignores this effect
Logical Effort - Multi-stage Networks
 
Principles generalize from gates to paths
 
Path logic effort – G = Πgi
Path parasitic delay – P = Σpi
Path electrical effort – H - is still ratio of Cout / Cin
Introduce branching effort – b = Ctotal/Cuseful
Introduce path branching effort - B – product of all b’s
 
Note B*H = Cout / Cin * Πbi = Πhi
 
 
 
 
 
 
Path effort delay = Df = Σ(gi*hi)
Path delay = D = Σdi = Df + P
Logical Effort - Multi-stage Networks
 
 
Path is optimized when each stage bears same effort
Dmin = N*F^(1/N) + P
To obtain the balanced stage effort, each stage f
fi = gi * hi
 
= F^(1/N)
To obtain the balanced stage effort, each hi should be
hi = (F^(1/N))/gi
 
To determine sizings, start from end and work backward
Cini = (gi * Couti) / fimin
Example 2
Tau = 50 ps for given technology (0.6u CMOS)
  Size transistors to minimize path delay
  No branching
Solution:
F = G*H*B = (4/3*4/3*4/3) where H = 1, B = 1
Dmin = 3*(4/3*4/3*4/3)^(1/3) + (2+2+2) = 10 units = 500 pS
 
C
C
Logical Effort - Example 2
Solution:
fmin = F^(1/N) = 2.37^(1/3) = 4/3
Cini = (gi * Couti) / fimin = ((4/3)* Couti )/ 4/3 = Couti = C
A
C/2
A
B
B
C/2
C/2
C/2
out
C
C
C
C
Logical Effort - Example 3
Same as example 2 but driving 8C output cap
Solution:
F = G*H*B = (4/3*4/3*4/3)*8 = 18.96 where H = 8, B = 1
Dmin = 3*(18.96)^(1/3) + (2+2+2) = 14 units = 700 pS
C
8C
Logical Effort - Example 3
Solution:
fmin = F^(1/N) = 18.96^(1/3) = 8/3
Cini = (gi * Couti) / fimin = ((4/3)* Couti )/ 8/3 = ½ Couti
C
2C
4C
8C
Logical Effort - Example 4
Same as example 2 but driving 8C output cap
Solution:
G = (4/3)3
B=2*3=6
H = 4.5/1 = 4.5
F = 64
Dmin = 3*(64)^(1/3) + (2+2+2)
C
= 18 units = 900 pS
4.5C
4.5C
4.5C
Logical Effort - Example 4
Solution:
fmin = F^(1/N) = 64^(1/3) = 4
Cin2 = (g2 * Cout2) / fimin = ((4/3)* Cout2 )/ 4 = 1.5 C
Cin1 = (g1 * Cout1) / fimin = ((4/3)* (1.5*C*3))/ 4 = 1.5 C
Cin0 = (g0 * Cout0) / fimin
Cin0 = (4/3 * 1.5*C*2) / 4= C (correct!)
C
1.5C
4.5C
4.5C
1.5C
4.5C
Logical Effort - Example 5
Solution:
F=G*B*H = (1*5/3*4/3*1)*1*2 = 40/9
fmin = (40/9)^(1/4) = 1.45
Cin3 = (g1 * Cout3) / fimin = ((1*20C))/1.45 = 14C
Cin2 = (g1 * Cout2) / fimin = ((4/3)* (14C))/1.45 = 13C
Cin1 = (g0 * Cout1) / fimin = ((5/3)* (13C))/1.45 = 15C
Cin0 = (1 * 15C) / 1.45 = 10C (correct!)
10C
20C
Sequential Element Review
 
Sequential elements provide memory for circuits
– 
– 
 
heart of a state machine – saving current state
used to hold or pipe data – data registers, shift registers
Two varieties
– 
– 
level sensitive transparent latch – less common
edge sensitive master-slave flip-flop – everywhere
D Latch Schematic - better
d
gate
q
CMOS Tri-state Inverter
~en
output
input
en
D Latch Operation
 
 
Gate Low – Q holds value and ignores D
Gate High – transparent – Q follows D after delay
G
D
Q
time
D Latch Common Uses
Most common – basic building block of Flip-Flop
Other uses – to condition enable for clock gating
G
D
Q
time
Standard CMOS D Flip-Flop Diagram
D
D
G
Q
D Latch
D
G
Q
Q
D Latch
CLK
Two doors – never simultaneously open or closed.
Q is never directly influenced by D.
D Flip-Flop Schematic
D
CLK
Q
D Flip-Flop Operation
Sample input at edge and launch to output.
Input must be good for “setup time” before and “hold time”
after the edge to sample correctly – sampling window.
Th
CLK
D
Tsu
Q
Tlaunch
time
D Flip-Flop Set-up Times
The time before the rising edge during which the data must
be stable to be sampled correctly.
Logic designers can slow the clock (bigger period) to
alleviate setup problems – but less performance.
A good Flip-Flop design will have a very short set-up time.
Q
D Flip-Flop Set-up Times
Adding logic in the data path increases set-ups and
therefore decreases performance.
Muxes often added for either functional or test purposes.
net A
d0
d1
sel
clk
Q
D Flip-Flop Hold Times
Hold times describe how long the data must be stable
after the rising edge.
Often the hold time is zero.
Although hold times do not affect frequency, if you fail to
meet them, your chip will not work and slowing it down
will not help.
Q
D Flip-Flop Launch Times
Time required for data at input to be sampled and launched
from the input.
This adds to the time it requires to start through the cone of
logic and get to the next flip-flop therefore increases
period of clock and slows performance.
Q
Alternative Designs
Pulsed Latch – faster but less robust
Latches not good for sequential elements due to
race condition when transparent.
Make a glitch enable at rising edge and can get
away with a latch.
D
CLK
delay
Alternative Designs
CLK
D
Q
time
Meta-stability
Common logic or circuit design interview question.
If input is asynchronous, then it is possible for input to
change exactly at rising edge and latch a middle value.
This is called the meta-stable point and results in
indeterminate operation and high static power
consumption.
CLK
D
time
Meta-stability
VoutA
VinB
meta-stable
point – theoretically
possible to be
stable here
VinA
VoutB
CMOS dynamic latch
G
soft node
D
Q
undriven
G
D
Q
time
Leakage
enable
0v
in
0v
2.0v
subthreshold leakage
H value will leak to zero given enough time.
If leakage is 1 nA, cap is 50 fF, and V = 2.0V
Trefresh = (V*C)/I = (2*50e-15)/1e-9 = 100 uS
to drop to 0V
Trefresh = (V*C)/I = (0.2*50e-15)/1e-9 = 10 uS to
drop to 1.8V
Capacitive Feedthrough
Q1 = Q2
C1*V1 = C2(V2-V1)
V1= (C1/(C1+C2))*V2
dV1 = (C1/(C1+C2))*dV2
enable
in
V2
C2
V2
V2
V1
V1
V1
C1
time
Domino Circuits
 
Typical Applications
– 
– 
– 
 
arithmetic functions - manchester carry chains
wordline decoders/drivers in on-chip SRAM arrays
timing critical paths
Performance advantages
– 
– 
– 
lower input capacitance
no NFET / PFET network contention
critical path through NFET devices with higher
mobilities and less area
Domino Gate Operation
 
Pre-charge phase
 
Evaluate Phase
– 
clock input low
– 
clock input high
– 
dynamic node high
– 
dynamic node conditionally discharged
based on NFET network evaluation
Dynamic
Node
CLK
A
B
P1
P2
N3
N4
N2
N1
OUT
Domino Circuits
E
PC
E
PC
E
PC
E
PC
CLK
A
B
OUT
time
Leakage Sensitivity
 
 
Dynamic node can be inadvertently discharged
Sources of leakage through NFET network
– 
– 
– 
– 
– 
Subthreshold current
Radiation
Noise at inputs
Charge sharing
Resistive defects
CLK
A
B
P1
P2
N3
N4
N2
N1
OUT
Sensitivity Improvement via Keeper
 
Keeper transistor replenishes lost charge
 
Reduces performance
– 
– 
 
increases diffusion capacitance on dynamic node
causes momentary contention during evaluate phase
Difficult to test keeper functionality
CLK
P1
A
B
N3
N2
N1
P3
P2
N4
OUT
Domino Circuits – Wide OR
No series PFET network – like NOR.
10 transistors 7 NFETs and 3 PFETs
NOR has 6 NFETs and 6 PFETs (~12 NFETS)
13 NFETs vs. 18 NFETs
Inputs connected to one NFET vs NFET/PFET
CLK
A
B
C
D
E
OUT
Types of Memories"
 
Volatile Memories
– 
require power supply to retain information
– 
dynamic memories
 
– 
static memories
 
 
use charge to store information and require refreshing
use feedback (latch) to store information – no refresh required
Non-Volatile Memories
– 
ROM (Mask)
– 
EEPROM
– 
FLASH – NAND or NOR
– 
MRAM
Memory Hierarchy"
100pS
RF
1nS
L1
SRAM
10nS
L2
SRAM
100nS
L3
DRAM
1us
100’s of bytes
Disks / Flash
10’s of Kbytes
100’s of Kbytes
10’s of
Gbytes
Tbytes
Register Files "
 
Fastest and most robust memory array
 
Largest bit cell size
 
Basically an array of large latches
 
No sense amps – bits provide full rail data out
 
Often multi-ported (i.e. 8 read ports, 2 write ports)
 
Often used with ALUs in the CPU as source/destination
 
Typically less than 10,000 bits
– 
32 32-bit fixed point registers
– 
32 60-bit floating point registers
SRAM"
 
Same process as logic so often combined on one die
 
Smaller bit cell than register file – more dense but slower
 
Uses sense amp to detect small bit cell output
 
Fastest for reads and writes after register file
 
Large per bit area costs
– 
six transistors (single port), eight transistors (dual port)
 
L1 and L2 Cache on CPU is always SRAM
 
On-chip Buffers – (Ethernet buffer, LCD buffer)
 
Typical sizes 16k by 32
Static Memory Cell"
Wordline"
T1"
T3"
True"
Bit"
Line" T5"
T6"
T2"
T4"
Complement"
Bit"
Line"
Motherboard architecture
Dynamic RAM"
 
Most dense RAM (1 Gbit chips available)
 
Historically, different semiconductor process so built on a separate die
 
L3 Cache (old days) and computer main memory
 
Requires refresh of data due to leakage
 
New push to combine DRAM and logic
– 
embedded DRAM, eDRAM
– 
business case hard to close – yields drop
DRAM Bit Cells (1T)
DRAM used since the early 70s
Destructive Read
Highest density
bitline
wordline
Cbl
Cb
DRAM Cross Section
Flash Cross Section"
FLASH"
FLASH"
 
NOR Flash
– 
– 
– 
– 
– 
– 
 
less dense (256 Mbit) but provides fast random read access
Erase FN / Program HEI
100,000 write cycles
Slow erase, fast program and read
SRAM like interface – give an address – get a byte of data
great for code memory ( bios, boot-up, cell phone, etc)
NAND Flash
– 
– 
– 
– 
– 
– 
More than 2X denser – up to 2Gbit
Erase FN/ Program FN
Fast erase, slow program and read
1,000,000 write cycles
IO like interface – not as simple as NOR
good for data storage – memory cards, IPODs, USB keydrives
Flash Cross Section"
NOR FLASH"
NAND Flash Reading"
Tunneling vs Injection"
Charge Pumps"
 
Flash and EEPROM architectures need
unavailable higher voltage for programming (+10v)
 
Charge pumps can pump a cap to get high voltage
 
DC to DC (higher) converter - without inductors
 
Need to consider Vmax across any gate oxide
 
Generally cannot provide much power (I*V)
 
Charge pumps used for a lot of other things like
overdrive voltages and PLLs
Staged Diode Charge Pump"
Dickson Charge Pump"
V1
Vin
V2
M d1
M d2
C
φ
φ
V3
M d3
C
V4
M d4
C
M d15
C
V out
C out
Clock booster"
N2b
N2
C1b
P2
P1
Outb
Out
C1
N1b
N1
Vo
Vob
SRAM Organization""
 
Blocks with unity aspect ratio
 
Rows
 
Columns
 
IO
Static Memory Cell"
Wordline"
T1"
T3"
True"
Bit"
Line" T5"
T6"
T2"
T4"
Complement"
Bit"
Line"
SRAM Read Cross-Section!
TSA!
CSA!
Set!
Sense!
Amp!
Bit !
Line !
Isolation!
Isolation Circuit!
Precharge ! Circuit!
Bit !
Line!
Precharge!
Wordline!
CBL!
TBL!
T"
Cell!
SRAM Isolation & Pre-charge Circuits!
Sense Amp!
Bit Switch Circuit!
Bit !
Line !
Isolation!
Pre-charge ! Circuit!
Bit !
Line !
Pre-charge!
Cells!
SRAM Sense Amplifier Circuit!
TSA!
CSA!
Set!
Sense!
Amp!
Bit Switch!
SRAM Internal Memory Waveforms !
Clock"
Word line"
Isolation"
Set Sense Amp"
Sense Amp Output"
Data"
SRAM Write Head Circuit!
Bit Line!
True!
Write!
Enable!
Data!
Bit Line!
Complement!
SRAM Cell with Center GND Contact
Vdd
PFET diffusion
Ground
NFET diffusion
Word line
(Polysilicon)
Bit line contacts
SRAM Cell with Shared Vdd Contact
PFET
diffusion
Vdd
Ground
NFET diffusion
Bit line contacts
Word line
(Polysilicon)
Split Word Line SRAM Cell
Bit line contact PFET diffusion
Word line
(Polysilicon)
NFET diffusion
Ground
Bit line contact
Vdd
Bit Cell Analysis – Read Disturb"
Wordline"
T1"
precharged to 1.8v
T3"
True"
Bit"
Line" T5"
T6"
T2"
T4"
Complement"
Bit"
Line"
starts at 0v but will jump up.
If it jumps too high, can flip
the bit. T6 is often not min
L to keep the jump low.
Bit Cell Analysis – Read Disturb"
If low, right data node (Vrd) cannot exceed the threshold of
T2 or bit may flip.
(Kn6 / 2) (Vdd – Vrd – Vtn)2 = (Kn5 / 2) ( 2 ( Vdd – Vtn)*Vrd – Vrd2)
Kn6/Kn5 < (2(Vdd – 1.5 Vtn) Vtn) / (Vdd – 2*Vtn)2
Bit Cell Analysis - Write"
 
Must ensure that write head circuit can over power cell
by the end of the write cycle.
 
The side of the bit cell with a 0 dominates the write
transaction as the pass transistor is an NFET.
 
When the word line asserts the write head circuit drives
a zero on one of the two sides.
 
The bit data in the cell must be brought below the
threshold of the cross-coupled inverter to flip the bit.
Bit Cell Analysis – Soft Error"
 
Radiation (particularly in space – but occasionally on
Earth) causes the generation of charge in circuits.
 
SOI technology helps as it shields transistors from
charge in the bulk silicon.
 
The bit cell node has a capacitance and introduced
charge will change the voltage at the node.
 
If the voltage swing exceeds the threshold of the crosscoupled inverter, the bit will flip (i.e. soft error)
 
Qcrit is charge required to flip bit.
 
Data is bad, but the bit cell still works (thus soft error).
Bit Cell Analysis – Soft Error"
Wordline"
T1"
T3"
True"
Bit"
Line" T5"
T6"
T2"
T4"
Complement"
Bit"
Line"
constant current
source turned on
for time t
Qcrit = I * t
Hamming Code (ECC)"
 
Simple parity (9th bit) can detect one failure.
 
Hamming code is kinda a two-dimensional parity that
will not only detect one failure but correct it as well. It
will detect two, but any more failures can be missed.
 
Current DRAM memories (DDR and SDRAM) have 64
bit buses with 8 additional ECC bits.
8 bit SEC-ECC example
 
8 data bits requires 4 parity checking bits
– 
 
 
 
log2 (8) + 1 = 4
Build encoded word by indexing bits from left
All bits at power of 2 locations are parity
Data bits take remaining spaces in order
8 bit SEC-ECC example
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 = encoded word
R1 R2 D1 R3 D2 D3 D4 R4 D5 D6 D7 D8
Each redundant bit is the parity bit for all bits in the word
that contain the corresponding power of 2 in the index
R1 = E3 ^ E5 ^ E7 ^ E9 ^ E11
R1 = D1 ^ D2 ^ D4 ^ D5 ^ D7
R2 = E3 ^ E6 ^ E7 ^ E10 ^ E11
R2 = D1 ^ D3 ^ D4 ^ D6 ^ D7
R3 = E5 ^ E6 ^ E7 ^ E12
R3 = D2 ^ D3 ^ D4 ^ D8
R4 = E9 ^ E10 ^ E11 ^ E12
R4 = D5 ^ D6 ^ D7 ^ D8
1
2
3
4
5
6
7
8
9
10
11
12
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
8 bit SEC-ECC example
Encode the 8 bit number CD or 1100_1110
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12
R1 R2 1 R3 1 0 0
R4 1 1
1
0
R1 = E3 ^ E5 ^ E7 ^ E9 ^ E11
R1 = D1 ^ D2 ^ D4 ^ D5 ^ D7 = 1 ^ 1 ^ 0 ^ 1 ^ 1 = 0
R2 = E3 ^ E6 ^ E7 ^ E10 ^ E11
R2 = D1 ^ D3 ^ D4 ^ D6 ^ D7 = 1 ^ 0 ^ 0 ^ 1 ^ 1 = 1
R3 = E5 ^ E6 ^ E7 ^ E12
R3 = D2 ^ D3 ^ D4 ^ D8
=1^0^0 ^0=1
R4 = E9 ^ E10 ^ E11 ^ E12
=1^1^1^0=1
R4 = D5 ^ D6 ^ D7 ^ D8
encoded word = 1011 _1000_1110 = 69E and check bits = 1110
8 bit SEC-ECC example
Now flip any one bit (included parity).
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12
R1 R2 1 R3 1 1 0
R4 1 1
1
0
R1 = E3 ^ E5 ^ E7 ^ E9 ^ E11
R1 = D1 ^ D2 ^ D4 ^ D5 ^ D7 = 1 ^ 1 ^ 0 ^ 1 ^ 1 = 0
R2 = E3 ^ E6 ^ E7 ^ E10 ^ E11
R2 = D1 ^ D3 ^ D4 ^ D6 ^ D7 = 1 ^ 1 ^ 0 ^ 1 ^ 1 = 1 - > 0
R3 = E5 ^ E6 ^ E7
R3 = D2 ^ D3 ^ D4
R4 = E9 ^ E10 ^ E11
R4 = D5 ^ D7 ^ D8
= 1 ^ 1 ^ 0 ^ 0 = 0 -> 0
=1^1^1^0=1
new check bits 1000
8 bit SEC-ECC example
Since old check bits do not match new check bits
we know there is a failure. 0011 != 0101
bitwise XOR the old and new to create an index to the failure
1000- new check bits
1110- original check bits
0110 = 6 culprit is E6 - these are the syndrome bits
Flip culprit bit back and data is correct
Put a R0 in E0 position as global parity and if correct and
check bits don’t add up, then you have a double error.
Can’t fix it, but at least you can detect it
Download