Introduction to basic concepts on asynchronous circuit design

advertisement

Design Automation for

Asynchronous Circuits

Alex Kondratyev

Cadence Berkeley Labs,

Berkeley, CA, USA

In collaboration with Jordi Cortadella, Luciano Lavagno

Kelvin Lwin and Christos Sotiriou

1

What do we optimize?

End of deterministic design

Technical and business implications

Asynchronous design with commercial tools

Desynchronization

Delay-insensitive datapath

2

Optimization metrics

Late 70-s:

Literals

nodes of a Boolean network

Levels of a Boolean network

Area

Speed

Nowadays:

Literals

nodes of a Boolean network

Levels of a Boolean network

Wire length

Area

Speed

Tools are optimizing for area and speed!

3

Universal metrics

Power: small ?

P = P + P + P dyn short leak clk

2

P = a * f * C * V

C

4

Universal metrics

Power small ?

P = P + P + P dyn short leak clk

2

P = a * f * C * V

Delay: t = Q / I = C * V / k(V - V ) ds dd

2

Supply voltage

Power

 , delay

Speed can be taken as a universal metrics

C

5

What do we optimize?

End of deterministic design

Technical and business implications

Asynchronous design with commercial tools

Desynchronization

Delay-insensitive datapath

Fine-grain pipelining

6

Timing margins

 Algorithms/tools (approximations)

 Modeling (process corners e.g.)

 Architecture (unbalanced computation)

7

Algorithms/tools

False paths (< 5%)

Common path pessimism removal

Hierarchy hurts!!!

10-35% gain from floorplan flattening

(Reshape)

Bad news: we do not know how far we are from optimum 

Good news: optimum is not possible to find 

8

Modeling

0.25

, Vdd=2.5

10%, T= 0, 125

C 0.13

, Vdd=1.0

10%, T=40, 125

C

200

150

100

50

0

INVX2 (fall) slow typical

Fast

0.76 Typical

Slow

1.47 Typical fast

200

150

100

50

0

INVX2 (fall)

Fast

0.73 Typical

Slow

1.55 Typical slow typical fast

Why to panic?

New BIG players: signal integrity and process variability

9

Variability sources

-Environment (T, Vdd) + signal integrity

Within-die only

-Process variations

(gate length L, wire width W, threshold voltage Vt)

-Die-to-die (design independent)

-Within-die (design dependent)

10

Environment + SI

Supply voltage: ± 10% Temperature: -40

C to 125

C

V

DD

IR drop – decrease in the current from Vdd

V’

DD

Bad news:

6

10 gates x 8metal layers

9

10 RC elements in VDD grid

Good news:

7

Field solvers can handle 10 variables

Abstraction, model reduction, IP reuse help further

Tools make IR drop sign off at 5%Vdd (still

10% delay penalty)

11

aggressor victim

Environment + SI aggressor

Crosstalk victim pulse delay

Pruning by coupling

Worst coupling estimation

H-Spice simulation

Compute switching windows

Pruning by timing

12

6

0 ind1 ind2 ind3 ind4

Conservative analysis: up to 20% delay penalty (post-layout fixes)

12

Process variations

-Die-to-die design independent, well modeled via worst-case files

-Within-die design dependent, systematic and random!!

40

20

0

130 within-die die-to-die

90 65

30

20

10

0

50

40

Lgate

Wwire

250 180 130

Tt

Nassif’01

100 70

13

Measuring variability

% chips

Microprocessor at-speed functional testing

ASIC no delay testing, no binning

Strategically placed oscillators:

Bin1 Bin2 Bin3

Problem:

Up to 15% delay variation in RO (Nassif’03)

Vertical/horizontal (4%), spacing poli-SI (7%), distance (5%) frequency

14

Modeling variability

Model for gate delay (linear wrt variability sources) d =

 env +

 device +

 wire var var var var

Independence of sources (within a group - model reduction (PCA or SVD))

For a single variability source: L = L + L var spatial random

(is modeled by random normally distributed variables N(0,

))

 d (L )

15

Statistical timing analysis

?

Reconvergence needs some care

Numerical computation of a distribution

Approximate convolution (5% accuracy)

Use upper and lower bounds (10% diff. Blaauw’03)

Algorithms have linear complexity!

16

What it buys?

Trading yield

WC confidence margin must be big

(chips work)

But it is fully unknown

Confidence margin worst

STA helps to quantify risk (reduce margin and be structure specific)

STA might help to trade off confidence margin and yield (testing???)

Open issues :

why normal?

how to derive

?

how to derive sensitivity coefficients?

17

What do we optimize?

End of deterministic design

Technical and business implications

Asynchronous design with commercial tools

Desynchronization

Delay-insensitive datapath

Fine-grain pipelining

18

Cycle time

Summing this up

Clock overhead

Real Computation Time

Worstaverage

45%

SI Variability

25% 30%

Clock skew

10%

Nonbalanced stages

20%

Some designs work twice faster than needed by spec!

Everything boils down to

$$$

Synchronous design is turning out to become a costly proposition

19

Is asynchronous an option?

It is about time but …

“must” requirements to asynchronous CAD tool:

Competitive

- added value with minimal (or no) penalty

- scalable (capable of handling large designs)

Simple

- minimal knowledge of asynchronous design

- RTL input

Risk-free

- does not change sign-off (STA)

- complete solution in verification and testing

- backup options (synchronous implementation)

20

What do we optimize?

End of deterministic design

Technical and business implications

Asynchronous design with commercial tools

Desynchronization

Delay-insensitive datapath

Fine-grain pipelining

21

Bundled approach

Design options

QDI approach start

Single-rail logic

• delay done

Dual-rail logic

C done

22

Sliding the trade-off curve

Automation efforts

QDI datapath

NCL, phased logic

Bundled data desynchronization

EMI, skew penalty

Variability

Penalties?

Average speed gates blocks

23

Desyncronization flow

Think synchronous

Design synchronous: one clock and edge-triggered flip-flops

De-synchronize (automatically)

Run it asynchronously

24

CLK

Synchronous circuit

MS flip-flop

L

0

L

1

L L

0 1

0

L

0

L

25

C

0

L

L

0

C

De-synchronization

L

1

C

C

0

L

L L

0 1

C C

26

De-synchronization

Distributed controllers substitute the clock network

C

C

C

The data path remains intact !

C C

C

27

C

D

A

B

A B C D

A+ BC+ D-

AB+ CD+

Non-overlapping handshake protocol

28

C

D

A

B

A B C D

A+ B+ C+ D+

ABC-

Overlapping is also acceptable

D-

29

Concurrent model

A+

A data

B+

AB-

B

C+

Cbubble

C

• + and – must alternate

• data available at the previous latch

• next latch must be closed before receiving new data

30

For any netlist

31

Synchronization layer

32

Synchronization layer

33

Synchronization layer

This

This is a circuit marked graph (CMG)

34

Properties of CMGs

Any CMG is live and safe

Safeness: no data overwriting

Liveness: no deadlock

A+

A-

B+

B-

C+

C-

35

Behavioral equivalence

36

37

38

Synchronous flow

39

40

41

42

43

44

45

De-synchronized flow

46

47

+

48

49

50

51

52

53

54

A

Flow equivalence

[Guernic, Talpin, Lann, 2003]

B

55

Flow equivalence

CLK

A 1 3 0 2 1 5 3 1 6 0

B 5 1 2 3 1 4 2 4 3 1

Synchronous behavior

A 1 3 0 2 1 5 3 1 6 0

B 5 1 2 3 1 4 2 4 3 1

De-synchronized behavior

56

Flow equivalence

CLK

A 1 3 0 2 1 5 3 1 6 0

B 5 1 2 3 1 4 2 4 3 1

Synchronous behavior

A 1 3 0 2 1 5 3 1 6 0

B 5 1 2 3 1 4 2 4 3 1

De-synchronized behavior

Theorem:

The de-synchronization model preserves flow-equivalence

57

Timing equivalence

C

D

A

B

A+

A-

La

A

Lb Lc del_a del_b del_c

B C del_b = del_a = del_c = del_d

Ld

D

B-

B+

C+

C-

D-

D+

Synchronous-like behavior

58

Timing equivalence

C

D

A

B

A+

A-

La

A

Lb Lc del_a del_b del_c

B C del_b > del_a = del_c = del_d

Ld

D

B-

B+

C+

C-

D-

D+

B keeps the same period and settles the rest

59

Compatibility

Synchronous: T

T + T + T + T sync comb setup skew CQ

Desynchronized: T

T + T + T comb controller

CQ

Statement: Desynchronized design is behavior and timing compatible to its synchronous counterpart

60

Synchronous environment

Clk

Clk+

A

A+

B

B+

ClkAB-

C

Clk

C+

Timing arc

C-

61

Implementation of a controller

• Only local handshakes with adjacent controllers are necessary

• Synthesis by using intuition, common sense, … and petrify

62

Implementation of a controller

63

Delay matching

Combinational logic d

64

Post-layout delay matching

Combinational logic

65

Post-layout delay matching

Combinational logic

66

Desynchronization. Gaining Trust

Synchronous RTL

=

67

Async DLX block diagram

68

Desynchronization. Gaining Trust

Synchronous

Synchronous RTL

Desynchronized

=

Cycle: 4.4ns

Power: 70.9mW

Area: 372,656

 m

Cycle: 4.45ns

Power: 71.2mW

Area: 378,058

 m

69

DLX lessons. Positive

 Asynchronous design with no area, power, delay penalties

 30% less EMI

 Partial tolerance of variability

(matched delays scale with the rest of the gates)

 Binning!!!

Treq > Tclk

Error req

B C

Clk

70

DLX lessons. Negative

 Asynchronous design with no area, power, delay advantage

 Clock power is saved but latched designs have higher loads

 P&R constraints of de-sync design are non-trivial

 Matched delay variability might hurt

Hard work to come out even with synchronous

71

 Clustering

 Timing optimization

 Retiming of

M-latches

Can we do better?

early

D

A late

C

M S

D

A late

M S

C

72

Problems of delay matching

Max(STA_delay) z

Min(STA_delay)

Gate and wire profiles are different

(must be compensated by margins)

Matched delay margins vs inter-die variation matching??

Calls for the use of different architectures

73

Sliding the trade-off curve

Automation efforts

QDI datapath

NCL, phased logic

Bundled data desynchronization

EMI, skew penalty

Variability Average speed gates blocks

74

Phased Logic

00

10

Value ‘0’ t v

0

0 even0

Linden’94

11

Even

Phase

01

Odd

Phase

LSB is ‘value’ bit (v)

MSB is ‘timing’ bit (t)

Value ‘1’

1

0

1

1

0

1

0

0 odd0 even1 odd1 even0

0

1 odd1

1

1 even1

A signal changes phase or value (only one bit changes)

75

Phased logic gate

A PL gate has an internal state Even or Odd .

A PL gate fires when all inputs match the gate phase.

E

O

Gate

Phase:

E

O Gate is not ready to fire

E

Gate ready to fire

E

Gate

Phase:

E

O

E

After Firing

E

Gate

Phase:

O

E

76

LUT-4 based implementation a_v b_v c_v d_v

LUT4 new_v

D latch

D Q

EN Q

R r-bit fi a_v

Input completion detection a_t b_v b_t c_v C gate_phase c_t d_v d_t reset fo fo_b

G1

G2 out_phase

G3

• Functionality: v(a_v, b_v, c_v, d_v)

D-latch new_t out_phase = gate_phase

D Q

EN Q

R r-bit

Phase: reset t_rbit a_t, b_t, c_t, d_t, t v t t_b

Area penalty!

77

NCL Design Flow

VHDL Synchronous

GTECH library

NCL library

Synthesis

Synchronous netlist

2-rail expansion+ optimization

NCL netlist

Asynchronous

1. Pattern matching (Ligthart’00)

2. Completion separation (NCLX)

78

Introduction to NCL

2-phase functioning (evaluate (DATA) – precharge (NULL)) +

Self-timed register interaction (acknowledgement of phases)

Reg.

Combinational logic

Reg.

CD

NULL Ack+ DATA Ack+

Micropipeline with delay-insensitive (DI) datapath

79

x.0

x.1

y.0

y.1

From 2 to 3-rail Scheme

2-rail gate

F

F z.1

z.0

z.1, z.0 are 2-rails but they do not acknowledge inputs x.1

y.1

x.0

y.0

z.1

z.0

Not DI scheme!!!

81

From 2 to 3-rail Scheme x.0

x.1

y.0

y.1

2-rail gate

F

F z.1

z.0

Functional part

C z.go

Completion part x.go

y.go

Rationale behind delay-insensitivity of 3-rail scheme:

1. 2-rail circuit is hazard-free under monotonic input changes

2. All inputs changes are observable at outputs

82

NCLX flow (MUX ) a s z

Tech. Map .

a s b b z z

2-rail expansion

Unate a.1

s.1

s.0

b.1

Functional part a.0

s.0

s.1

b.0

a.1

s.1

s.0

b.1

z.1

z.0

2-rail gate

(incomplete)

Completin g a.1

s.1

s.0

b.1

a.0

s.0

s.1

b.0

a.go

s.go

b.go

z z.1

z.1

z.0

C

Completion part z.go

2-rail gate (complete)

83

NCL lessons. Positive

 Very low EMI

 High security of computation

 Automatic stand-by mode

 Tolerance to variability

84

NCL lessons. Negative

 Big area overhead: 2.7-3.0x

 No performance advantage

(average case performance is swallowed by the penalty from NULL)

 Completion introduces further penalties (power and delay)

85

Back in Business

Performance improvement:

 Fast reset

- partition a circuit into chunks 4-6 levels logic deep

- apply reset to each chunk simultaneously

 Use faster negative gates

- negative gates are about 20% faster than unate gates

Area improvement:

 Make completion by outputs only

- single NOR gate suffices

86

Penalties and Savings

Logic synthesis

250

200

150

100

50

0

Delay Area

Sync

2-R (v=0%)

2-R (v=60%)

2-R (v=100%)

250

200

150

100

50

0

Place & Route

Delay Area

20-35% performance improvement at the expense of 100% of area penalty

87

Use Case

FF

Combinational Logic

FF reset comp clk error

Error signal provides a mean to:

- Calibrate chips during manufacture testing

- Perform on-line delay testing

88

Use Case

 Find timing critical portions of design

 Re-implement them asynchronously

Up to 30% of performance improvement at the cost of

2x area penalty for critical portions

(appealing if the size of critical portion is small)

89

Best of both worlds. PLA designs

Dynamic PLAs are naturally two-phase

Delay matching of PLA is easy

Bundled routing to cope with wire delay variations go data go_A done_A

PLA

A done_B go_D go_C done_C

PLA

C

PLA

D done_D

C done

PLA

B

PLA

E done_E go_B go_E

Critical path is always go-done path

90

PLA vs SC

• PLA-based design vs standard cell based design. ( SC typical vs PLA typical)

Stardard cell flow

P

SC

P

PLA

RC

Encounter

RTL design

SA-Placer

Bundle router

Delay

PLA flow

Cluster alu2

SC_A SC_D PLA_A PLA_D

11105 1274 7672 1160 alu4 18884 1622 18683 1521

C6288 80304 4401 75218 4235 apex7 4339 784 apex6 14270 907

4729

15291

818

838

C1355 14237 1351 12162 1354

C3540 29239 2074 29478 2083 k2 47332 1361 30789 1602 x3 11353 862 14588 845

C5315 31092 2005 42031 2076 average 1 1 0.9923

0.9975

Delay

Download