Asynchronous VLSI Design: An Introduction

advertisement
Asynchronous Logic:
Results and
Prospects
Alain J. Martin
California Institute of Technology
NTU, March 2007
What Is Asynchronous
Logic?
Sequencing and Computation



“An algorithm is a sequence of computational steps.”
CL&R
How do we implement sequencing in a continuous
physical medium?
Traditional answer: use of a global time reference (“the
clock”)
CLK
A
3
B
C
D
E
Can we compute without a clock?

Yes!: “asynchronous” or “clockless” logic

Also “self-timed” or “speed-independent”

David Muller “Theory of Asynchronous Circuits” (1959)

ILLIAC (1959) and ILLIAC II (1962) partially
asynchronous

PDP6 (1960) asynchronous
4
Can we compute without a clock
and without delay assumptions?




Delay-insensitivity (Molnar, 198x…)
Almost:
“The class of delay-insensitive circuits is limited (not
Turing-complete).” (Martin, 1990)
Quasi-delay-insensitive (QDI) logic:
– Delay-insensitive
– Isochronic forks (only delay assumption)

5
QDI is Turing-complete (Martin & Manohar, 1996)
What is an Asynchronous Circuit?


Asynchronous system: collection of modules
communicating by handshake protocols
Distributed system on a chip (communicating by
message exchange)
A
ack
6
B
C
D
ack
ack
ack
E
ack
Caltech QDI Approach




7
Quasi delay-insensitive (QDI) design
Minimal delay assumptions (only isochronic
forks)
Stricter logic synthesis (DI codes for datapath,
completion trees), but…
Robust and efficient (no evidence that delay
assumptions improve efficiency)
Why Asynchronous and
QDI Logic?
Scientific Reasons





9
Understanding the role of time in computation
Limit of delay insensitivity
Implementing a digital computation directly in a
continuous physical medium
Design by program transformation (real
correctness-by-construction approach)
“VLSI design as programming” paradigm
Engineering Reasons

Better match for high-level synthesis
– Can separate correctness from performance issues



Modularity and better use of concurrency
Large system design (SoC): Only local communication
Efficiency
– Average-case instead of worst-case behavior
– Less pressure for global optimization
(“timing closure”)

Robustness and reliability
– Robust to variations in fabrication technology, temperature,
voltage, noise, SEU-tolerance

10
Energy efficiency
Energy Advantages of Async

No clock
– Up to 50% of clock power recuperated

Automatic shut-off of idle parts
– Perfect clock gating

No glitches (spurious transitions)
– Up to 50% of power in combinational circuits

Automatic adaptation to parameter’s variations
– Voltage scaling: Perfect exchange of delay against
energy through voltage scaling

Flexibility of asynchronous interfaces:
– Better use of concurrency
11
Reactive Use in Embedded
Systems




12
Archetype of a reactive system
Average execution time may be much shorter
than maximal execution time
Sleep sequence without race condition
– Modeled after wait/signal with condition
variables
Instant wake-up from deep sleep
Robustness to PVT Variations





13
Increase in physical parameter variations (PVT)
is becoming a huge problem…
Even worse in future technologies (nano CMOS
or others)
Variations of physical parameters all affect
timing
Increased timing variations reduce robustness
and/or performance
Single time reference (clock) may become
unavailable or too expensive in future
technologies and large systems (SoC)
Robustness to Voltage and
Temperature Variations
14
Single-event Upset and Soft-error
Tolerance of QDI circuits





15
Soft-errors caused by alpha particles, cosmic
rays and other radiation sources are becoming
increasingly problematic, even at ground-level
QDI circuits can absorb most “dose-effects”
Single-event upsets that cause a soft-error (bit
flip) can be corrected efficiently in QDI circuits
Error-correction scheme specific to QDI
Entire async microcontroller SEU-tolerant
Detection and Correction of SE in
QDI circuits
Single-error detection: duplicate and
compare
 Correction:
– prevent propagation of detected SE
– stability of guards corrects automatically
– “Detection is correction”
 Simplest, most expensive coding, but
simplest detection mechanism
 Entire microcontroller SEU tolerant

16
Disadvantages of Async
Size overhead (more transistors)
 Poorly understood and rarely taught
 No industrial CAD tools (yet)
 No well-developed testing procedure (yet)
 No easy transition path for large
established companies…

17
Experimental Evidence
Asynchronous Chips @ Caltech
World-first Asynchronous
Microprocessor (1988)
MiniMIPS (1998)
19
Lattice-Structure Filter (1994)
Lutonium 8051 Microcontroller (2005)
First Asynchronous Microprocessor
(Caltech, 1988)


16-bit RISC, 2-micron CMOS
Formal synthesis:
– Initial sequential
description was a single
page of CHP code
– 5 months from start of
project to tape-out (small
group)
– Fully functional on first
silicon
20


Performance:
– 5 MIPS,
5mA @ 2V
– 18 MIPS, 45mA @ 5V
– 26 MIPS, 100mA @ 10V
Potato-chip experiment
– Runs on a potato as power
supply!
– 50kHz @ 0.75V,
300kHz @ 0.9V
Asynchronous MIPS R3000
Microprocessor

Standard 32-bit RISC ISA
Single instruction issue, one branch delay slot
Precise exceptions
2 on-chip caches: 4kB Icache and 4kB Dcache

First prototype (1998):



– No TLB
– 2M transistors
– First asynchronous processor
competitive with large
synchronous designs
21
MiniMIPS Low-Voltage
Operation
Functional from 0.5V Vdd up
 Functional at 0.4V with some transistor
resizing

22
Asynchronous MIPS: Practical
Results

HP’s 0.6-micron CMOS
– Expected:
– First prototype:
– Voltage range:

275 MIPS @ 7W @ 3.3V @ 25oC
190 MIPS @ 4W @ 3.3V @ 25oC
1V (9.66MHz @ 0.021 W) to 8V
Functional on first silicon despite
– Inconsistencies in HP’s process parameters (e.g. higher Vt’s)
– Long polysilicon wire overlooked in the critical fetch loop
– (Testament to the robustness of asynchronous design style!)

Roughly 4x faster than commercial synchronous MIPS
ported to same technology
– Note: no particular effort made towards designing for low power.
23
Lutonium-18: QDI 8051
Microcontroller

TSMC SCN018 through MOSIS
– 0.18mm CMOS
– 1.8V nominal
– |Vt| = 0.4V to 0.5V


24
Expected area: 5mm2 (including 8kB SRAM)
Performance from low-level simulation (conservative!)
1.8 V
200 MIPS
100.0 mW
500 pJ/inst
1800 MIPS/W
1.1 V
100 MIPS
20.7 mW
207 pJ/inst
4830 MIPS/W
0.9 V
66 MIPS
9.2 mW
139 pJ/inst
7200 MIPS/W
0.8 V
48 MIPS
4.4 mW
92 pJ/inst 10900 MIPS/W
0.5 V
4 MIPS
mW
43 pJ/inst 23000 MIPS/W
170
Energy Efficiency Metric: Et2


E = C*V2 , t = k / V
E*t2 independent of V

Estimate of energy
efficiency

Comparison of designs
“Algorithmic

of energy’’
See Chapter 15 in “Power Aware
Computing” book by Graybill &
Melhem eds. Kluwer
25
Voltage Scaling Advantage:
Comparison to Intel Xscale
26
Energy Breakdown and
Comparisons
icache
fetch
Microprocessor -- Results
exec
units
(adder)
(shifter)
(fblock)
(mem)
(mult/div)
decode
write
back
MIPS 33nJ
Energy 70nJ
async-0.6m
sync-0.6m
MIPS 6ns
CycleTime 21ns
async-0.6m
sync-0.6m
Microcontroller -- Estimation
regfile
(bypass)
fetch
11%
bus
12%
icache
31%
27
10.00nJ (1X)
decode
4% execunits
8%
writeback
7%
regfile
27%
Energy
Breakdown
sync-0.5m
8051
1.67nJ (6X)
async-0.5m
Energy 0.56nJ (18X) async-0.18m@1.8V
per Instr 0.14nJ (72X)
async-0.18m@0.9V
20ns (1X)
8051 10ns (2X)
CycleTime 5ns (4X)
10ns (2X)
sync-0.5m
async-0.5m
async-0.18m@1.8V
async-0.18m@0.9V
More than 100X Et2
improvement over
any other 8051
Design Methodology
Handshakes & Dual-Rail Encoding
BUFFER: *[ L?x; R!x ]
L?


DATA
ACK
L0
L1
La
Four-phase handshake
Dual-rail encoding:
– 3 wires (2 data, 1 ack) for
one bit of information
– Other DI codes are used:
1-of-N
29
R0
R1
Ra
R!
C0
C1
Data
0
0
Hasn’t
arrived
1
0
0
0
1
1
1
1
invalid
A QDI pipeline stage
*[ L?x; R!f(x)]
30
QDI PIPELINE vs Bundled Data







31
Dual-rail or 1-of-n data encoding
Completion tree
Critics: high overhead (2*N +1 wires and
completion tree)
Alternative: Bundled data
N + 1 wires, no completion tree
Delay line for indicating completion, spurious
transitions
Big controversy!
Fine-grain Pipeline (PCHB)
en
R
R!
L?
f
en
validity
Rv
La
en
32
completion
validity
Lv
L?
Ra
FINE-GRAIN PIPELINE






33
No need for separate register
Very high throughput and low forward latency
Excellent Et^2 performance
Entirely QDI
Used in MiniMIPS and Lutonium
Area overhead significant
Lower-Level Synthesis: HSE
CHP Program
*[ L?x; R!x ]
Handshaking Expansion
*[ [ Ra  L0  R0
 Ra  L1  R1
]; La ;
[ Ra  R0, R1 ];
[ L0  L1  La  ]
]
34
2
4
7
8
1
3
5
6
[ Ld ]; La; [ Ld ]; La 
[ Ra ]; Rd; [ Ra ]; Rd
Lower-Level Synthesis: PRS
CHP Program
Production Rule Set
*[ L?x; R!x ]
L0  L1
 Lv 
La Ra  L0 R0 
La Ra  L1 R1 
R0  R1
 Rv 
Lv  Rv
 La 
L0  L1
 Lv 
Ra  La
 R0 
Ra  La
 R1 
R0  R1
 Rv 
Lv  Rv
 La 
Handshaking Expansion
*[ [ Ra  L0  R0
 Ra  L1  R1
]; La ;
[ Ra  R0, R1 ];
[ L0  L1  La  ]
]
To PRS for CMOS …
35
Lower-Level Synthesis: PRS

Each production rule has the
form:
guard expr  node  or
guard expr  node 

These can be evaluated as
If ( guard expr is true )
node = Vdd
or
If ( guard expr is true )
node = GND

A set of production rules must
be stable and non-interfering
(for hazard-free circuits)
Production Rule Set
L0  L1
 Lv 
La Ra  L0 R0 
La Ra  L1 R1 
R0  R1
 Rv 
Lv  Rv
 La 
L0  L1
 Lv 
Ra  La
 R0 
Ra  La
 R1 
R0  R1
 Rv 
Lv  Rv
 La 
To PRS for CMOS …
36
Asynchronous Architectures
37

New asynchronous
solutions for
pipelined
microprocessors

Execution units are
in parallel, allowing
concurrent and outof-order execution of
instructions
CAD Tools




38
Complete suite of tools: synthesis, simulation,
verification, optimization, layout
Designer-assisted compilation
Tools are modular and customizable
Main representations: CHP, PRS, Cast
Design Flow
sequential
program
chpsim
DDD
SDD
Legend
cosim
synthesis
concurrent
system
prsim/esim
simulators
spice
logical
PL2
physical
physical
PRS
database
add
?
!
Placer
Router
Sizer
=
resize using
wire information
39
sized
PRS
collection
of cells
placed
cells
routed
cells
physical
layout
Robustness and
Reliability
Robustness to Power-Supply Noise
HPSICE simulation of a typical QDI asynchronous
circuit: A five-stage ring of async (PCHB) pipeline
stages.
Technology: TSMC 0.18micron CMOS
Vdd: 1.8V, Vt : .5V, Complete layout.
Vdd is oscillating between 3.5V and 0V (maximal
amplitude), and at various frequencies. The circuit
keeps working correctly!
(It will malfunction at some very high-frequency
noise in phase with circuit frequency.)
41
Robustness to Power-Supply Noise
42
SE-Tolerant QDI Circuits
xa
ya
xb
yb
43
z’a
C
za
C
zb
z’b
intermediate
final
Soft-error Tolerant Asynchronous
Microprocessor (STAM)






44
The STAM architecture defines simplified 32-bit RISC instruction set,
which has eight general registers, and four types of instructions:
arithmetic, branch, memory and shift operations.
A partially-wired layout of the STAM was completed TSMC.SCN
0.18um CMOS. In SPICE simulation, it runs about 120 MHz.
The soft-error tolerance of the STAM has been tested by injecting
errors randomly while the STAM runs the RC4 program (a simple
stream cipher) in the digital-level simulator.
About five soft errors, whose locations are chosen randomly from a list
of all nodes of the STAM, are injected in each execution of an
instruction.
About 25% of 203,000 nets in the STAM experience a bit-flipping in
each testing
The figure shows locations of errors by dots and a box in the figure
represents a CHP process.
Soft-error Tolerant Asynchronous
Microprocessor (STAM)
45
Async Molecular Nanoelectronics
Molecular nano was our motivation for XQDI: Extreme case of variability!
46
“Extreme” QDI (XQDI)
Can we improve QDI to eliminate (or
reduce further) the remaining variability
dependencies?
 Isochronic forks
 Keepers on state-holding nodes
 Slew rates and oscillating rings

47
Isochronic Forks




48
Only timing assumption in QDI
design
New design style that (1)
minimizes the number of
isochronic forks, and (2)
mitigates their effect
d(single transition) << d(multitransition path)
One-sided inequality can always
be satisfied
Cell Design without Keeper



Keepers needed for state-holding cells
Keeper requires transistor sizing and balancing
current strengths. Difficult with variability…
Example of the C-element:
With keeper
49
Without keeper
Ring Oscillators



An async system is a collection of rings of
operators. Oscillating rings are the engine of an
asynchronous circuit.
Right choices of slew rates and number of
stages guarantee that each ring oscillates.
What are the limits? How many restoring stages
per ring?
....
C
C
C
....
50
Theoretical Results &
General Comments
Concurrency and the digital/analog
interface




52
Elementary building block: guarded transition
(PR: guard expr  node  or guard expr  node  )
Stability and non-interference are necessary
and sufficient to guarantee the absence of
logical hazards
Stable and non-interfering PR set is
deterministic (Church-Rosser property)
Any sequential execution is OK (powerful
simulator and execution model)
Analog Implementation



53
There exists a QDI (stable, non-interfering)
implementation for any deterministic
computation (Turing-completeness)
Arbitration treated separately. Metastability of
arbiters is not a problem because of
asynchrony
Analog requirements on isochronic forks and
ring oscillation can always be satisfied by
adding restoring delays to the circuit (singlesided timing requirement).
Knowledge vs. Ignorance
Cost of implementing sequencing
 In a clocked discipline: relies on
knowledge of delays
 Because of increasing variability and
complexity, this knowledge is increasingly
expensive!
 In a QDI system, timing is ignored; cost to
implement sequencing is high but fixed!
 “If knowledge is expensive, try ignorance”

54
At some point in time
the costs cross…
Crossing point already passed for SoC…
COST/
COMPONENT
CLOCKED
QDI
TECHNOLOGY
(increasing variability
and complexity)
55
Intel Says…


56
From ISSCC 2005 article by Intel about Itanium L3
cache:
“ …traditional synchronous design becomes
increasingly inefficient. Much of total delay is dedicated
to clock skew, latch delay, margin in each cycle, and
non-ideal division to cycle boundaries. …Significant
margins must be added to account for slow marginal
cells that are statistically probable in a 24MB cache. The
delivery of low clock skew over such an area is also
difficult and costly. This single-ended asynchronous
design eliminates the drawbacks above…”
Conclusion






57
Async QDI logic can be made extremely robust
to timing variations and therefore to parameter
variability
Flexible interfaces of async & absence of global
signal better suited for complex system design
as in SoC
Better match for probabilistic design
Energy efficient
No synchronization failure because of
metastability
As technology advances, less costly for
complex designs
Conclusion
As we enter the nanoscale era:
 System complexity (interfaces, clocking in SoC,
reuse)
 Robustness issues (parameters variations, soft
errors, noise)
 Costs: masks, design time
 Power and energy consumption
 “End-of-Moore’s-law” argument for parallelism
An asynchronous approach offers many
advantages and is unavoidable in the long
run.
58
Industrial Prospects






59
Time is ripe. Why is industry so aloof?
Absence of industrial CAD tools
No seamless transition (GALS the stopgap
solution?)
Maybe not in Intel’s interest?
Perhaps, we need an industrial environment
untied to traditional approaches and EDA tools
Async offers an opportunity to leapfrog the
current technology limitations
60
61
Managing Complexity: The Design
Productivity Gap
62
From: The International Roadmap for Semiconductors: 1999
Managing Complexity
All circuits designed have been found fully
functional on first silicon:
Year
Transistors
1985
200
Distributed mutual
exclusion element
1986
2000
Stack Element
1989
20 000
First
microprocessor
1995
500 000
DSP filter
1998
2 000 000
63
Description
MIPS
microprocessor
Download