Circuit Design for SRCMOS Asynchronous Wave Pipelines

advertisement
Circuit Design
for
SRCMOS
Asynchronous Wave Pipelines
Oliver Hauck
Integrated Circuits and Systems Lab
Departments of Computer Science and Electrical Engineering
Darmstadt University of Technology
Outline

Pipelines: synchronous, asynchronous, wave pipelined,
and asynchronous wave pipelined (AWP)

Comparison: AWPs vs. sync, async, and sync wave pipes

AWP Circuit Design

Conclusion
2
Pipelining

Pipelining used as premier technique to
better exploit hardware and
boost performance of VLSI chips

Clocking overhead presents serious threat for
deeply pipelined systems built upon sub-micron
CMOS processes running at GHz frequencies
3
General Framework for Pipelines
i
Logic
Latch/Reg
Latch/Reg
Data
o
Clk
4
Some Notations...
G
: set of all gate output nodes in logic
t min , t max
: minimum and maximum logic delay
t min (i ), t max (i )
: minimum and maximum logic delay from input to
internal node i  G
tstable(i )
: minimum time internal node i  G has to be stable
Tclk
: clock period or cycle time
i, o
: intentiona l skew at input and output registers
  o  i
: delay between input and output clock
tskew
: uncontroll ed clock skew at register
td
: propagatio n delay of register
tsetup
: set - up time of register
thold
: hold time of register
5
General Relations
Data is latched by output clock at time t  k  Tclk  o
(1)
k is called ``global clock latency´´, equals # clocks at output before data
Lower bound : t  i  t max  td  tsetup  tskew
(2)
Upper bound : t  Tclk  i  t min  td  thold  tskew
(3)
Combining (2) and (3) :
t max  tsetup  tskew  k  Tclk    td  Tclk  t min  thold  tskew
(4)
By transit ivity, (4) implies : Tclk  (t max  t min )  tsetup  thold  2  tskew
(5)
I. e., cycle time bounded by delay vari ation, register overhead and clock skew
Similarly, minimum pulse width has to be respected for all i  G :
Tclk  (t max (i )  t min (i ))  tstable(i )  tskew
(6)
6
Synchronous Pipeline
Logic
Latch/Reg
Latch/Reg
Data
Clk
 Throughput
Negative
Implementation
side-effects
determined
options:
of gate-level
by longest
pipelining
logic path
: +
k  1,   0  Tclk  t max  td  tsetup  tskew

Increased
Registerclock/register
vs.
latency,
latch-based,
overhead
clock load/skew,
explicit latches
power,vs.
area,
latchless
design time

TSPC
Fine-grain
More
area
vs. local
pipelining
for clocking
clocksallows
derived
and registers
high
from
throughput
global
than for
clock
at
logic
the cost of

Static
increased
vs. dynamic,
clock/register
single-ended
overheadvs. dual-rail
7
Asynchronous Pipeline
ack_in
Logic
Handshake
req_in
Handshake
Data
req_out
ack_out
Micropipeline
(Sutherland
1989)
Implementation
options:
 Operation is data
dependant, saves power during idle

Plug
& Play
composability
Synchronous
clock
replaced(event)
by asynchronous
handshaking

4-phase
(level)
vs.sync
2-phase
protocol can
 As
with fine-grain
pipelines, throughput
be high;

onoperation:
req
and
ack
lines
Elastic
input
anddistributed
output completion
rate may differ
 Load
Bundled
data
(matched
handshake
causes
high delay)
latencyvs.
and backwarddetection
stall
 Used
by Furber‘s
at will
Manchester
momentarily,
and group
pipeline
buffer U for AMULET1/2/3
8
Synchronous Wave Pipeline

1
Wave Logic
Latch/Reg
Clk
Latch/Reg
Data
2
Several
Wave
pipelining
data waves
potentially
simultaneously
gives higher
active
throughput
in the logic
as
t maxminimize
 td  tsetupdelay
 tdecreased
skewvariations

t min P,T,V
td  treduced
hold
 tskew  
Logic
conventional
has
to
pipelines
at
latency
over
and
corners
k  0, 1 
 Tclk 
k
k 1
 Global
clock load,
clockarea
used
and
with
power
constructive skew to adjust phases


However, tuning the logic and the delay elements is difficult
9
Wave Pipelining: A Short Outline




Wave pipelining occurs when combinational logic
is clocked faster than latency would allow
Several data waves are then active in the logic
without being separated by storage elements
Latency remains constant and throughput is
determined by delay differences rather than
absolute delay
Requirement for delay balanced logic and
complicated timing are the main hurdles
10
Wave Pipelining: A Little History




Technique stems from the 60s and has had a
reputation for being exotic since
Wave pipelining was long dead before being revived
by W. Burleson (U. Mass.) and M. Flynn (Stanford U.,
PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU
Some working academic chips exist, mainly datapath
Some commercial memory is wave pipelined
(e.g. ULTRA-III cache), but no logic, as far as we know
11
Asynchronous Wave Pipeline (AWP)
Wave Latch
req_in
Wave Latch
Data
Wave Logic
req_out
matched delay
AWP
Data words
is special
associated
case of the
withsync
events
wave
on pipeline
request line
with the
clk
min
d
hold
skew
 Several
constructive
data skew
wavesset
and
to protocol
worst-case
events
logicsimultaneously
delay


k  0  T   t
t
t t
t t
t
t
Itactive
is crucial
in thethat
logic
theand
delay
theelement
matched
accurately
element,
tracks
respectively
the delay
max
ddelaysetup
skew
behaviour of the logic over P, T, V corners
12
AWPs vs. Synchronous Pipelines



No global clock, instead a local clock (request)
that is fed through the pipeline and obeys a
simple asynchronous protocol, i.e. data is
associated with event on request
Many pipeline registers removed, thus requirements
on the clock (request) relaxed
Synchronous pipelines can reach the throughput of
AWPs only with excessive cost in area, power and
latency
13
AWPs vs. Asynchronous Pipelines




AWPs deliberately sacrifice the ack and keep only the req
to avoid protocol overhead
AWPs not elastic: data at output has to be consumed
AWPs eliminate hazards as side-effect of delay balancing
AWPs have in common with other async methodologies:
data dependant operation (avoids redundant transitions),
composability (though inelastic),
no global clock
14
AWPs vs. Synchronous Wave Pipelines
AWPs tackle two main difficulties in sync wave pipes:


Replacing the constructive skew by worst-case delay
removes double-sided timing constraint, i. e. in contrast to sync wave pipes do AWPs operate at any rate
Using dynamic self-resetting logic controls delay
variation and doesn´t impact latency much
15
Wave Pipelining Combinational Logic

Overall goal: keep data wave coherent under all
possible conditions (data, PTV)

Desirable architecture features:
most logic paths have same depth
fanin/fanout the same everywhere

First step: pad all short paths to maximum length
16
Example: 64-b Brent-Kung Parallel Adder
0
1
2
3
4
Buffers provide
All gates in the
for same depth
on every logic
path
pg
PG
PG
x
G o
r
same column
must have the
same delay
17
Circuits





Logic style used has to minimize delay variation
Earlier work focused on bipolar logic (ECL, CML), but
CMOS is mainstream
Static CMOS is not well suited for wave piping, fixing
the problem results in more power and slower speed
Pass transistor logic gives slopy edges thereby
introducing delay variation
Dynamic logic is attractive as only output high
transition is data-dependant, output pulldown is done
by precharge
18
Circuits (cont.)



Using dynamic logic as in Burleson´s Wave Domino
jeopardizes the concept as it needs fine-grain
precharge
What is needed is a dynamic logic family without
precharge overhead: SRCMOS
Work done at IBM: classic paper by Chappell et al:``A
2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM
with a Fully Pipelined Architecture,´´ JSSC (26), 11,
1991; or, more recently: ``Implementation of a SelfResetting CMOS 64-Bit Parallel Adder with Enhanced
Testability,´´ JSSC (34), 8, 1999, by Hwang et al.
19
SRCMOS

Distinguishing property of our SRCMOS circuits:
precharge feedback is fully local, and NMOS trees
are delay balanced
output
inputs
N
20
Operation of a 2-AND
21
Delay Balancing at Transistor Level





NMOS tree is designed so that the precharge node is
pulled down by a constant number of series devices
Short paths are padded with dummy devices
Delay variation is minimal when exactly one path is
on, i. e. wide fanin OR´s are hard to use
Every output has to see the same load
Lightly loaded outputs are given dummy cap
22
Example: Carry tree in a 64-bit adder
Gim  Glm  Plm  (Gkl  Pkl  (Gjk  Pjk  Gij))
23
Gim Layout
24
Simulation of Gim cell
Pulses of 4 possible
input situations giving
´1´ at the output are
tightly matched

Note: in this case
never are Pxy=Gxy=1

25
First Pulse Problem
26
Miller Effect
27
64-bit Adder Output Waveforms
latching
window
28
Transistor Sizing
Wprecharge
Wkeeper
Cfeedback
Cload
Cdrive
inputs
N
output
Wpd
Wpd / Cdrive = const
Cdrive / (Cload+Cfeedback+Wkeeper) = const
Cfeedback / Wprecharge = const
Wprecharge / Cdrive = const
LINEAR SIZING
29
Interconnect: Resistive Effects

0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms
C only
R/3, R/3, R/3
R/2, R/2
RC only
30
Interconnect: Coupling Effects

2 adjacent MET2 lines coupled by C=54fF
31
PTV Variations





SRCMOS provides some robustness by generating
fresh pulses at every gate output
Pulsed operation reduces data dependancy, coupling
PTV noise is not critical when drift is in the same
direction across die
Critical are: temperature gradient, supply drop, and
local variations
What is needed: Rule of thumb like ``For process X,
to be on the safe side, keep area between two
latches < Y sqmm´´
32
Conclusion

AWPs presented as alternative approach to high-speed
design, shows potential for GHz throughput without clocks

AWPs avoid some problems of conventional wave pipes
and (a)synchronous systems

64b adder + test circuit and EC crypto layout in the making

Not covered here: feedback + controllers

To do: support transistor sizing
33
Download