NoC meeting Southampton May 30

advertisement
Asynchronous Links, for NanoNets?
Alex Yakovlev
University of Newcastle, UK
17/09/2007
NanoNet’07, Catania
1
Motivation-1


At very deep submicron, gate delay is much less than
interconnect delay: total interconnect length can reach several
meters; interconnect delay can be as much as 90% of total path
delay in VDSM circuits
Timing issue is a problem, particularly for global wires
Feature size (nm)
Relative
180
130
90
delay 250
100
Gate delay (fanout 4)
Local interconnect (M1,2)
Global interconnect with repeaters
Global interconnect without repeaters
10
65
45
32


1
Source: ITRS, 2003
0.1
17/09/2007
NanoNet’07, Catania
2
Multiple clock
domains are reality,
problem of interface
between them
ITRS’05 predicted:
4x (8x) increase in
global asynchronous
signalling by 2012
(2020)
Motivation-2

Variability and uncertainty
– Geometry and process: for long channels intra-die
variations are less correlated for different part of
the interconnect, both for interconnects and
repeaters
• e.g., M4 and M5 resistance/um massively differ, leading
to mistracking (C.Visuweswariah, SLIP’06)
• e.g. 250nm clock skew has 25% variability due to
interconnect variations (Y.Liu et.al. DAC’00)
– Behavioural: crosstalk (sidewall capacitance can
cause up to 7x variation in delay (R. Ho,
M.Horowitz))
17/09/2007
NanoNet’07, Catania
3
A Network on Chip
Async Links
Multiple Clocks
Synchronization required
Arbitration required
17/09/2007
NanoNet’07, Catania
4
Example from the Past: Fault-Tolerant SelfTimed Ring (Varshavsky et al. 1986)
For an onboard airborne computer-control system which
tolerated up to two faults. Self-timed ring was a GALS system
with self-checking and self-repair at the hardware level
Individually clocked subsystems
Self-timed adapters forming a ring
17/09/2007
NanoNet’07, Catania
5
Communication Channel Adapter
Much higher reliability than a
bus and other forms of
redundancy
MCC was developed TTLSchottky gate arrays, approx 2K
gates.
Data (DR,DS) is encoded using
3-of-6 Sperner code (16 data
values for half-byte, plus 4
tokens for ring acquisition
protocol)
AR, AS – acknowledgements
RR, RS – spare (for self-repair)
lines
17/09/2007
NanoNet’07, Catania
6
Outline









Token-based view of communication
Basics of asynchronous signalling
Self-timed data encoding
Pipelining
How to hide acknowledgements
Serial vs Parallel links
Arbiters and routers
Async2sync interface
CAD issues
17/09/2007
NanoNet’07, Catania
7
Data exchange: token-based view
Data
source

tx
rx
dest
Question 1: when can Rx look at the incoming
data?
Data validity issue – Forming a well-defined token
17/09/2007
NanoNet’07, Catania
8
Data exchange: token-based view
Data
source

tx
rx
dest
Question 1: when can Rx looked at the data?
Data validity issue – Forming a well-defined token

Question 2: when can Tx send new data?
Acknowledgement issue – Separation b/w tokens
17/09/2007
NanoNet’07, Catania
9
Data exchange: token-based view
Data
source

tx
rx
dest
Question 1: when can Rx looked at the data?
Data validity issue – Forming a well-defined token

Question 2: when can Tx send new data?
Acknowledgement issue – Separation b/w tokens
These are fundamental issues of flow control at the physical and
link levels
The answers are determined by many design aspects: technology
level, system architecture (application, pipelining), latency,
throughput, power, design process etc.
17/09/2007
NanoNet’07, Catania
10
Tokens and spaces with global
clocking
Data
source
tx
rx
dest
clk

In globally clocked systems both Q1 and Q2
are resolved with the aid of clock pulses
17/09/2007
NanoNet’07, Catania
11
Tokens and spaces
Data
source
tx
D_valid
rx
dest
Clk_rx
Clk_tx
bundle


Without global clocking: Q1 can be resolved
differently from Q2
E.g.: Q1 – source-synchronous (mesochronous),
bundled data or self-synchronising codes; Q2 – ack
or stop signal, or by local timing
17/09/2007
NanoNet’07, Catania
12
Tokens and spaces
Data
tx
source
D_valid
ack

dest
ack
ack

rx
bundle
Without global clocking: Q1 can be resolved
differently from Q2
E.g.: Q1 – source-synchronous (mesochronous),
bundled data or self-synchronising codes; Q2 – ack
or stop signal, or by local timing
17/09/2007
NanoNet’07, Catania
13
Petri net model
source
Tx
Data Valid
dest
Rx
Tx delay
Rx delay
One way delay, but may be unsafe!
source
Tx
Tx delay or ack
Data Valid
ack
Rx
Rx delay or ack
Always safe but with a round trip delay!
17/09/2007
NanoNet’07, Catania
14
dest
Asynchronous handshake signalling
Valid data tokens and safe spaces between
them can be created by different means of
signalling and encoding
 Level-based -> Return-To-Zero (RTZ) or 4phase protocol
 Transition-based -> Non-Return-to-Zero
(NRZ) or 2-phase protocol
 Pulse-based, e.g. GasP
 Phase-difference-based
 Data encoding: bundled data (BD), Delayinsensitive (DI)
17/09/2007
NanoNet’07, Catania
15
Handshake Signalling Protocols

Level Signalling (RTZ or 4-phase)
req
ack
req
ack
One cycle

Transition Signalling (RTZ or 4-phase)
req
ack
One cycle
17/09/2007
NanoNet’07, Catania
16
One cycle
Handshake Signalling Protocols

Pulse Signalling
req
ack
req
ack
One cycle

Single-track Signalling (GasP)
req
req + ack
One cycle
ack
17/09/2007
NanoNet’07, Catania
17
GasP signalling
Pull up from
pred (req)
Pulse length
control loops
Pull down here
(ack)
Pull up from
here (req)
Pull down from
succ (ack)
Source: R. Ho et al, Async’04
17/09/2007
NanoNet’07, Catania
18
Data encoding

Bundled data
– Code is positional binary, token is determined by Req+
signal; Req+ arrives with a safe set-up delay from data

Delay-insensitive codes (tokens determined by the
codeword values, require a spacer, or NULL, state if
RTZ)
– 1-of-2 (Dual-rail per bit) – systematic code, encoding,
decoding straightforward
– m-of-n (n>2) – not systematic, i.e. incur encoding and
decoding costs, optimal when m=n/2
– One-hot ,1-of-n (n>2), completion detection is easy, not
practical beyond n>4
– Systematic, such as Berger, incur complex completion
detection
17/09/2007
NanoNet’07, Catania
19
Bundled Data
Data
req
RTZ:
Data
req
ack
ack
One cycle
NRZ:
Data
req
ack
One cycle
17/09/2007
NanoNet’07, Catania
20
One cycle
DI encoded data (Dual-Rail)
Data.0
Data.1
RTZ:
NULL (spacer)
NULL
Logical 0
Data.0
Logical 1
ack
Data.1
ack
One cycle
One cycle
NRZ:
Logical 0
Data.0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
17/09/2007
NanoNet’07, Catania
21
cycle
cycle
cycle
DI encoded data (Dual-Rail)
Data.0
Data.1
RTZ:
NULL (spacer)
NULL
Logical 0
Data.0
Logical 1
ack
Data.1
ack
One cycle
This coding leads
to complex logic
implementation;
hard to track odd
and even phases
and logic values –
hence see LEDR
below
17/09/2007
NanoNet’07, Catania
One cycle
NRZ:
Logical 0
Data.0
Logical 1
Logical 1
Logical 1
Data.1
ack
cycle
22
cycle
cycle
cycle
DI codes (1-of-n and m-of-n)

1-of-4:
– 0001=> 00, 0010=>01, 0100=>10, 1000=>11

2-of-4:
– 1100, 1010, 1001, 0110, 0101, 0011 – total 6
combinations (cf. 2-bit dual-rail – 4 comb.)

3-of-6:
– 111000, 110100, …, 000111 – total 20 combinations
(can encode 4 bits + 4 control tokens)

2-of-7:
– 1100000, 1010000, …, 0000011 – total 21
combinations (4 bits + 5 control tokens)
17/09/2007
NanoNet’07, Catania
23
DI codes completion detection and
decoding



1-of-4 completion detection is a 4-input OR gate
(CD=d0+d1+d2+d3)
Decode 1-of-4 to dual rail is a set of four 2-input OR
gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1;
q1.1=d2+d3)
For m-of-n codes CD and decoding is non-trivial
17/09/2007
From J.Bainbridge et al, ASYNC’03
24
NanoNet’07, Catania
Incomplete DI codes
Incomplete 2-of-7:
Composed of
1-of-3
and
1-of-4
From J.Bainbridge et al ASYNC’03
17/09/2007
NanoNet’07, Catania
25
Phase difference based encoding (C.
D’Alessandro et al. ASYNC’06,’07)



The proposed system consists in encoding a bit of data in the phase
relationship between two signals generated using a reference
This would ensure that any transient fault appearing on one of the
reference signals will be ignored if it is not mirrored by a corresponding
transition on the other line
Similarity with multi-wire communication
t_1 before t_0
t_0 before t_1
ref
t_1
t_0
sp0
data
17/09/2007
sp1
0
NanoNet’07, Catania
sp0
0
1
26
sp0
sp1
0
Phase encoding: multiple rail




No group of wires has the same delay
All wires toggle when an item of data is sent
Increased number of states available ( n wires = n! states) hence
more bits/symbol
Table illustrates examples of phase encoding compared to the
respective m-of-n counterpart
Type of Link
Number
of states
Bits per
Symbol
Extra
states
Transitions
per symbol
Symbols
per packet
Transitions
per packet
Phase enc. (4)
24
4
8
4
32
128
1-of-4
4
2
0
2
64
128
Phase enc. (6)
720
9
208
6
15
90
3-of-6
20
4
4
6
32
192
17/09/2007
NanoNet’07, Catania
27
Phase encoding Repeater
receiver
i1
sender
1<3
3<1
o1
2<3
i2
3<2
o2
i3
1<2
2<1
Phase
detectors
(Mutexes)
17/09/2007
NanoNet’07, Catania
o3
go
28
Pipelines
Dual-rail pipeline
From J.Bainbridge & S. Furber IEEE Micro, 2002
17/09/2007
NanoNet’07, Catania
29
The problem of Acking


17/09/2007
Question 2 “when can Tx send new data?”
has two aspects:
– Safety (not to overflow the channel or
when Tx and Rx have much variation in
delay)
– Performance (to maximize throughput and
reduce latency)
Can we hide ack (round trip) delay?
NanoNet’07, Catania
30
To maintain
throughput more
pipeline stages are
required but that
costs too much
latency and power
First minimize
latency along a long
wire (not specific to
asynchronous) and
then maximize
throughput (using
“wagging tail buffer”
approach)
From R.Ho et al. ASYNC’04
17/09/2007
NanoNet’07, Catania
31
Use of wagging buffer approach
Alternate between
top and bottom
control
From R.Ho et al. ASYNC’04
17/09/2007
NanoNet’07, Catania
32
“Wagging tail buffer” approach
reqtop
Top and
bot
control
channels
work at
½
frequenc
y of data
channel
acktop
data
reqbot
ackbot
17/09/2007
NanoNet’07, Catania
33
Serial Link vs Parallel Link (from R. Dobkin)


Link
Why Serial Link?
Length
– Less interconnect area
[mm]
– Less routing congestion
– Less coupling
– Less power (depends on
range)
The relative improvement grows
with technology scaling. The
example on the right refers to:
– Single gate delay serial link
– Fully-shielded parallel link
with 8 gate delay clock cycle
– Equal bit-rate
– Word width N=8
Serial Link
dissipates less power
Parallel Link
dissipates less power
Serial Link
requires less area
Parallel Link
requires less area
Technology Node [nm]
17/09/2007
NanoNet’07, Catania
34
Serialization model
Tx
Rx
…
…
Acking at the bit level
17/09/2007
NanoNet’07, Catania
35
Serialization model
Tx
Rx
Acking at the word level
17/09/2007
NanoNet’07, Catania
36
Serialization model
Tx
Rx
Acking at the word level (with
more concurrency)
17/09/2007
NanoNet’07, Catania
37
Serial Link – Top Structure (R.Dobkin,
Async’07)






Transition signaling instead of sampling: two-phase NRZ Level
Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. datastrobe (DS)
Acknowledge per word instead of per bit
Synchronizers used at the level of the ack signals
Wave-pipelining over channel
Differential encoding (DS-DE, IEEE1355-95)
Reported throughput: 67Gps for 65nm process (viz. one bit per
15ps – expected FO4 inverter delay), based on simulations
17/09/2007
NanoNet’07, Catania
38
Encoding –Two Phase NRZ LEDR

Two Phase Non-Return-to-Zero Level Encoded Dual Rail
– “delta” encoding (one transition per bit)
Uncoded (B)
0
0
1
1
0
0
0
0
Phase bit (P)
State bit (S)
 B(i ), i odd
P(i )  
 B(i ), i even
17/09/2007
NanoNet’07, Catania
S (i )  B(i ) i
39
1
0
Transmitter – Fast SR Approach
(from R. Dobkin)
17/09/2007
NanoNet’07, Catania
40
Receiver Splitter (from R. Dobkin)
17/09/2007
NanoNet’07, Catania
41
Self Timed Networks

Router requires priority arbitration
– Arbitration necessary at every router merge
– Potential delay at every node on the path
BUT
– Asynchronous merge/arbitration time is average not worst
case


Adapters to locally clocked cells require
synchronization
Synchronization necessary when clocks are unknown
– Occurs when receiving data (data valid), and when sending
(acknowledge)
BUT
– Time can be long (2 cycles?)
– Must assume worst case time (maybe)
17/09/2007
NanoNet’07, Catania
42
Router priority
Flow Control
Link
Merge


Split
Virtual channels implement scheduling algorithm
Contention for link resolved by priority circuits
17/09/2007
NanoNet’07, Catania
43
Asynchronous Arbiters


Multiway arbiters (e.g. for Xbar switches):
– Cascaded mesh (latency ~ N)
– Cascaded Tree (latency ~ logN)
– Token-Ring (busy ring and lazy ring) (latency ~
from 1 to N)
Priority arbiters (e.g. for Routers with different QS):
– Static priority (topological order)
– Dynamic priority (request arrives with priority
code)
– Ordered (time-priority) - multiway arbiter, followed
by a FIFO buffer
17/09/2007
NanoNet’07, Catania
44
Static Priority Arbiter
Lock
R1
s* q
MUTEX
r1
s1
C
G1
C
G2
C
G3
MUTEX
R2
r2
s2
s* q
r
Priority Module
r
MUTEX
R3
s* q
r3
s3
r
Lock Register
s
C
17/09/2007
NanoNet’07, Catania
q
r*
45
Why Synchronizer?
DATA
DATA
CLK
Q
CLK
DFF
Q
1
0
1
0
Metastability
Metastability
DATA
CLK
Q
DFF
DFF
Two DFF Synchronizer
17/09/2007
NanoNet’07, Catania
46
Here one clock
cycle is used for
the metastability
to resolve.
CAD support: Async design flow
17/09/2007
NanoNet’07, Catania
47
Synthesis of Asynchronous link
interfaces
Bus
DSr
Data
Transceiver
LDS
LDTACK
Device
D
DSr
DSw
LDS
VME Bus
Controller LDTACK
D
DTACK
DTACK
Read Cycle
17/09/2007
NanoNet’07, Catania
48
DSr+
LDS+
D+
LDTACK+
LDS+
D+
DTACK+
17/09/2007
DTACK-
DSw+
LDTACK-
LDS-
LDTACK+
D-
DSr-
DTACK+
D-
DSw-
NanoNet’07, Catania
49
DSr+
DSw+
DTACK-
D
DTACK
LDS+
D+
LDTACK+
synthesis
LDS+
LDS
D+
LDTACK-
csc
LDTACK+
DSr
DTACK+
D-
LDS-
LDTACK
DSr-
DTACK+
D-
DSw-
csc +
LDS+
DSr+
LDTACKDSr+
LDTACK+
LDSDSr+
Logic asynchronous circuit
DTACKLDTACK-
LDTACK-
DTACKLDS-
LDS = D  csc
DTACK = D
D = LDTACK
csc = DSr
LDS-
DTACK-
D+
DTACK+
Boolean equations:
DDSr-
csc -
Complete State Coding (CSC)
17/09/2007
NanoNet’07, Catania
50
Conclusions on Async Links







At nm level links will be more asynchronous, perhaps first,
mesochronous to avoid global clock skew
Delay-insensitive codes can be used to tolerate interwire-delay
variability
Phase-encoding can be used for higher power-bit efficiency and
SEU tolerance
Acking will be mainly used for flow control (word level) and its
overhead can be ‘hidden’ by using the “wagging buffer”
technique
Serial Links save area and power for long interconnects, with
buffering (pipelining) if one wants to maintain high throughput;
they also simplify building switches
Synthesis tools can be used to build clock-free interfaces
between different links
Asynchronous logic can be used for building higher level
circuits, e.g. arbiters for switches and routers
17/09/2007
NanoNet’07, Catania
51
And
17/09/2007
finally …
NanoNet’07, Catania
52
ASYNC’08 and NOCs’08 …plus
SLIP’08




Held in Newcastle upon Tyne, UK, 7-11 April 2008
(SLIP on 5-6 April – weekend)
async.org.uk/async2008
async.org.uk/nocs2008
Submission deadlines:
– Async’08: Abstract – Oct. 8 , Full paper – Oct. 15
– NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19
17/09/2007
NanoNet’07, Catania
53
Extras

More slides if I have time!
17/09/2007
NanoNet’07, Catania
54
Chain Network Components
From J.Bainbridge & S. Furber
IEEE Micro, 2002
17/09/2007
NanoNet’07, Catania
55
A Network on Chip
Multiple Clocks
Synchronization required
Arbitration required
17/09/2007
NanoNet’07, Catania
56
Transmitter – Fast SR Approach
(from R. Dobkin)
17/09/2007
NanoNet’07, Catania
57
Receiver Splitter (from R. Dobkin)
17/09/2007
NanoNet’07, Catania
58
Self Timed Networks

Router requires priority arbitration
– Arbitration necessary at every router merge
– Potential delay at every node on the path
BUT
– Asynchronous merge/arbitration time is average not worst
case


Adapters to locally clocked cells require
synchronization
Synchronization necessary when clocks are unknown
– Occurs when receiving data (data valid), and when sending
(acknowledge)
BUT
– Time can be long (2 cycles?)
– Must assume worst case time (maybe)
17/09/2007
NanoNet’07, Catania
59
Router priority
Flow Control
Link
Merge


Split
Virtual channels implement scheduling algorithm
Contention for link resolved by priority circuits
17/09/2007
NanoNet’07, Catania
60
Static priority arbiter
Lock
R1
s* q
MUTEX
r1
s1
C
G1
C
G2
C
G3
MUTEX
R2
r2
s2
s* q
r
Priority Module
r
MUTEX
R3
s* q
r3
s3
r
Lock Register
s
C
17/09/2007
NanoNet’07, Catania
q
r*
61
Reliability and latency

Asynchronous arbiters fail only if time is bounded
– Latency depends on fixed gates plus MUTEX lock time
–  for 2 channels,  +  ln(N-1) for more
– This likely to be small compared with flow control latency

Synchronizers fail at (fairly) predictable rates but
these rates may get worse
– Latency can be 35 now for good reliability
17/09/2007
NanoNet’07, Catania
62
The synchronizer



Clock and valid can happen very close together
Flip Flop #1 gets caught in metastability
We wait until it is resolved (1 –2 clock periods)
DATA
VALID
D
#1
Q
D
#2
Q
CLK2
CLK1
17/09/2007
NanoNet’07, Catania
63
MTBF
t/
e
MTBF 
Tw . fc . fd




For a 0.18 process  is 20 – 50 ps
Tw is similar
Suppose the clock and data frequencies are 2 GHz
t needs to be > 25  (more than one clock period) to
get MTBF > 28 days
– 100 synchronizers + 5 
– MTBF > 1year + 2 
– PVT variations +5 - 10 . . .
17/09/2007
NanoNet’07, Catania
64
Event Histogram
100ps input variation
10ps noise and jitter
Deep meta
Metastability Time
-1.0E-08
-8.0E-09
-6.0E-09
-4.0E-09
-2.0E-09
0.0E+00
1E-13
1E-16
1E-19
Q to Clock time
Convert to log scale, slope is 
Measurement
17/09/2007
NanoNet’07, Catania
65
Effective Input Overlap
1E-10
Not always simple
Metastability Time
10ps noise and jitter
Deep meta
1.000E- 9.000E- 8.000E- 7.000E- 6.000E- 5.000E- 4.000E- 3.000E08
09
09
09
09
09
09
09
1E-10
1E-12
1E-14
1E-16
1E-18
1E-20
Q to Clock tim e
17/09/2007
NanoNet’07, Catania
66
Effective Input Overlap
More than one slope
350ps
120ps
140ps
Synchronization Strategies

Avoid synchronization time (and arbitration time) by
– predicting clocks, stoppable clocks
– dedicate link paths for long periods of time

Minimize time by circuit methods
– Higher power, better 
– Reducing apparent device variability - wide transistors
– many parallel synchronizers increase throughput

Reduce average latency by speculation
– Reduce synchronization time, detect errors and roll back
17/09/2007
NanoNet’07, Catania
67
Timing regions can have
predictable relationships

Locked
–
–
–
–
–

Two clocks from same source
Linked by PLL
One produced by dividing the other
Some asynchronous systems
Some GALS
Not locked together but predictable
– Two clocks same frequency, but different
oscillators.
– As above, same frequency ratio
17/09/2007
NanoNet’07, Catania
68
Don’t synchronise when you
don’t need to



If the two clocks are locked together, you don’t need
a synchroniser, just an asynchronous FIFO big
enough to accommodate any jitter/skew
FIFO must never overflow
Next read clock can be predicted and metastability
avoided
DATA
FIFO
ACK IN
REQ OUT
REQ IN ACK OUT
Write Data Available
17/09/2007
NanoNet’07, Catania
DATA
69
Read done
Conflict Prediction
Receiver
Clock
Transmitter
Clock
Predicted
Transmitter
Clock
Synchronization problem
known a cycle in advance
of the Receiver clock.
We can do this thanks to the periodic
nature of the clocks
17/09/2007
NanoNet’07, Catania
70
Problems predicting next cycle

Difficult to predict
– Multiple source clocks
– Input output interfaces

Dynamic jitter and noise
– GALS start up clocks take several cycles to stabilise
– Crosstalk
– power supply variations introducing noise into both data and
clock .
– temperature changes alter relative delays

As a proportion of cycle time, this is likely to increase
with smaller geometries
17/09/2007
NanoNet’07, Catania
71
Synchronizer reliability trends

Clock rates increase. 10 GHz gives 100ps for
a cycle.
– Both data and clock rates up by n
–  down by n


Assume  scales with cycle time reliability
(MTBF) of one synchronizer down by n
Number of synchronizers goes up by N
– Die reliability down by N

Die – die and on-die variability increases to
as much as 40%
– 40% more time needed for all synchronizers
17/09/2007
NanoNet’07, Catania
72
An example

Example
–
–
–
–
–

10 GHz clock and data rate
 = 10 ps
100 synchronizers
MBTF required 3.8 months (107 seconds )
Time required 41 , or 4.1 cycles + 40% =5.8
cycles
Does this matter?
17/09/2007
NanoNet’07, Catania
73
Power futures



Total synchronizer area/power small, BUT
 very sensitive to voltage/power – both n and p
transistors can turn off at low voltages – no gain
This affects MUTEX circuits as well
tau
250
200
ps
150
100
50
0
0.5
1
1.5
Vdd
17/09/2007
NanoNet’07, Catania
74
2
Power/speed tradeoffs

Increase Vdd when synchronisation required

Make synchronizer transistors wide to reduce
variation and, to some extent, 

Make many synchronizer circuits, and select
the consistently fastest one

Avoid reducing synchronizer Vdd when
running slow
17/09/2007
NanoNet’07, Catania
75
Speculation

Mostly, the synchronizer does not need 35 to
settle

Only e-10 (0.005%) need more than 10

Why not go ahead anyway, and try again if
more time was needed
17/09/2007
NanoNet’07, Catania
76
Low latency synchronization



Data Available, or Free to write are produced early
– After one cycle?.
If they prove to be in error, synchronization failed
– Only know this after two of more cycles
Read Fail or Write Fail flag is then raised and the action can be
repeated.
DATA
DATA
FIFO
Free to write
Write Fail
Data Available
Speculative
synchronizer
Write clock
Write Data
17/09/2007
NanoNet’07, Catania
Full
Not Empty
WRITE
77
READ
Speculative
synchronizer
Read Fail
Read Clock
Read done
Comments


Synchronization time will be an issue for
future GALS
Latency and throughput can be affected
– Should the flit be large to reduce the effective
overhead of time and power?

Some power speed trade off is possible
– Higher power synchronization can buy some
performance ?

Speculation is complex
– Is it worth it?
17/09/2007
NanoNet’07, Catania
78
Download