Asynchronous Links, for NanoNets? Alex Yakovlev University of Newcastle, UK 17/09/2007 NanoNet’07, Catania 1 Motivation-1 At very deep submicron, gate delay is much less than interconnect delay: total interconnect length can reach several meters; interconnect delay can be as much as 90% of total path delay in VDSM circuits Timing issue is a problem, particularly for global wires Feature size (nm) Relative 180 130 90 delay 250 100 Gate delay (fanout 4) Local interconnect (M1,2) Global interconnect with repeaters Global interconnect without repeaters 10 65 45 32 1 Source: ITRS, 2003 0.1 17/09/2007 NanoNet’07, Catania 2 Multiple clock domains are reality, problem of interface between them ITRS’05 predicted: 4x (8x) increase in global asynchronous signalling by 2012 (2020) Motivation-2 Variability and uncertainty – Geometry and process: for long channels intra-die variations are less correlated for different part of the interconnect, both for interconnects and repeaters • e.g., M4 and M5 resistance/um massively differ, leading to mistracking (C.Visuweswariah, SLIP’06) • e.g. 250nm clock skew has 25% variability due to interconnect variations (Y.Liu et.al. DAC’00) – Behavioural: crosstalk (sidewall capacitance can cause up to 7x variation in delay (R. Ho, M.Horowitz)) 17/09/2007 NanoNet’07, Catania 3 A Network on Chip Async Links Multiple Clocks Synchronization required Arbitration required 17/09/2007 NanoNet’07, Catania 4 Example from the Past: Fault-Tolerant SelfTimed Ring (Varshavsky et al. 1986) For an onboard airborne computer-control system which tolerated up to two faults. Self-timed ring was a GALS system with self-checking and self-repair at the hardware level Individually clocked subsystems Self-timed adapters forming a ring 17/09/2007 NanoNet’07, Catania 5 Communication Channel Adapter Much higher reliability than a bus and other forms of redundancy MCC was developed TTLSchottky gate arrays, approx 2K gates. Data (DR,DS) is encoded using 3-of-6 Sperner code (16 data values for half-byte, plus 4 tokens for ring acquisition protocol) AR, AS – acknowledgements RR, RS – spare (for self-repair) lines 17/09/2007 NanoNet’07, Catania 6 Outline Token-based view of communication Basics of asynchronous signalling Self-timed data encoding Pipelining How to hide acknowledgements Serial vs Parallel links Arbiters and routers Async2sync interface CAD issues 17/09/2007 NanoNet’07, Catania 7 Data exchange: token-based view Data source tx rx dest Question 1: when can Rx look at the incoming data? Data validity issue – Forming a well-defined token 17/09/2007 NanoNet’07, Catania 8 Data exchange: token-based view Data source tx rx dest Question 1: when can Rx looked at the data? Data validity issue – Forming a well-defined token Question 2: when can Tx send new data? Acknowledgement issue – Separation b/w tokens 17/09/2007 NanoNet’07, Catania 9 Data exchange: token-based view Data source tx rx dest Question 1: when can Rx looked at the data? Data validity issue – Forming a well-defined token Question 2: when can Tx send new data? Acknowledgement issue – Separation b/w tokens These are fundamental issues of flow control at the physical and link levels The answers are determined by many design aspects: technology level, system architecture (application, pipelining), latency, throughput, power, design process etc. 17/09/2007 NanoNet’07, Catania 10 Tokens and spaces with global clocking Data source tx rx dest clk In globally clocked systems both Q1 and Q2 are resolved with the aid of clock pulses 17/09/2007 NanoNet’07, Catania 11 Tokens and spaces Data source tx D_valid rx dest Clk_rx Clk_tx bundle Without global clocking: Q1 can be resolved differently from Q2 E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing 17/09/2007 NanoNet’07, Catania 12 Tokens and spaces Data tx source D_valid ack dest ack ack rx bundle Without global clocking: Q1 can be resolved differently from Q2 E.g.: Q1 – source-synchronous (mesochronous), bundled data or self-synchronising codes; Q2 – ack or stop signal, or by local timing 17/09/2007 NanoNet’07, Catania 13 Petri net model source Tx Data Valid dest Rx Tx delay Rx delay One way delay, but may be unsafe! source Tx Tx delay or ack Data Valid ack Rx Rx delay or ack Always safe but with a round trip delay! 17/09/2007 NanoNet’07, Catania 14 dest Asynchronous handshake signalling Valid data tokens and safe spaces between them can be created by different means of signalling and encoding Level-based -> Return-To-Zero (RTZ) or 4phase protocol Transition-based -> Non-Return-to-Zero (NRZ) or 2-phase protocol Pulse-based, e.g. GasP Phase-difference-based Data encoding: bundled data (BD), Delayinsensitive (DI) 17/09/2007 NanoNet’07, Catania 15 Handshake Signalling Protocols Level Signalling (RTZ or 4-phase) req ack req ack One cycle Transition Signalling (RTZ or 4-phase) req ack One cycle 17/09/2007 NanoNet’07, Catania 16 One cycle Handshake Signalling Protocols Pulse Signalling req ack req ack One cycle Single-track Signalling (GasP) req req + ack One cycle ack 17/09/2007 NanoNet’07, Catania 17 GasP signalling Pull up from pred (req) Pulse length control loops Pull down here (ack) Pull up from here (req) Pull down from succ (ack) Source: R. Ho et al, Async’04 17/09/2007 NanoNet’07, Catania 18 Data encoding Bundled data – Code is positional binary, token is determined by Req+ signal; Req+ arrives with a safe set-up delay from data Delay-insensitive codes (tokens determined by the codeword values, require a spacer, or NULL, state if RTZ) – 1-of-2 (Dual-rail per bit) – systematic code, encoding, decoding straightforward – m-of-n (n>2) – not systematic, i.e. incur encoding and decoding costs, optimal when m=n/2 – One-hot ,1-of-n (n>2), completion detection is easy, not practical beyond n>4 – Systematic, such as Berger, incur complex completion detection 17/09/2007 NanoNet’07, Catania 19 Bundled Data Data req RTZ: Data req ack ack One cycle NRZ: Data req ack One cycle 17/09/2007 NanoNet’07, Catania 20 One cycle DI encoded data (Dual-Rail) Data.0 Data.1 RTZ: NULL (spacer) NULL Logical 0 Data.0 Logical 1 ack Data.1 ack One cycle One cycle NRZ: Logical 0 Data.0 Logical 1 Logical 1 Logical 1 Data.1 ack cycle 17/09/2007 NanoNet’07, Catania 21 cycle cycle cycle DI encoded data (Dual-Rail) Data.0 Data.1 RTZ: NULL (spacer) NULL Logical 0 Data.0 Logical 1 ack Data.1 ack One cycle This coding leads to complex logic implementation; hard to track odd and even phases and logic values – hence see LEDR below 17/09/2007 NanoNet’07, Catania One cycle NRZ: Logical 0 Data.0 Logical 1 Logical 1 Logical 1 Data.1 ack cycle 22 cycle cycle cycle DI codes (1-of-n and m-of-n) 1-of-4: – 0001=> 00, 0010=>01, 0100=>10, 1000=>11 2-of-4: – 1100, 1010, 1001, 0110, 0101, 0011 – total 6 combinations (cf. 2-bit dual-rail – 4 comb.) 3-of-6: – 111000, 110100, …, 000111 – total 20 combinations (can encode 4 bits + 4 control tokens) 2-of-7: – 1100000, 1010000, …, 0000011 – total 21 combinations (4 bits + 5 control tokens) 17/09/2007 NanoNet’07, Catania 23 DI codes completion detection and decoding 1-of-4 completion detection is a 4-input OR gate (CD=d0+d1+d2+d3) Decode 1-of-4 to dual rail is a set of four 2-input OR gates (q0.0=d0+d2; q0.1=d1+d3; q1.0=d0+d1; q1.1=d2+d3) For m-of-n codes CD and decoding is non-trivial 17/09/2007 From J.Bainbridge et al, ASYNC’03 24 NanoNet’07, Catania Incomplete DI codes Incomplete 2-of-7: Composed of 1-of-3 and 1-of-4 From J.Bainbridge et al ASYNC’03 17/09/2007 NanoNet’07, Catania 25 Phase difference based encoding (C. D’Alessandro et al. ASYNC’06,’07) The proposed system consists in encoding a bit of data in the phase relationship between two signals generated using a reference This would ensure that any transient fault appearing on one of the reference signals will be ignored if it is not mirrored by a corresponding transition on the other line Similarity with multi-wire communication t_1 before t_0 t_0 before t_1 ref t_1 t_0 sp0 data 17/09/2007 sp1 0 NanoNet’07, Catania sp0 0 1 26 sp0 sp1 0 Phase encoding: multiple rail No group of wires has the same delay All wires toggle when an item of data is sent Increased number of states available ( n wires = n! states) hence more bits/symbol Table illustrates examples of phase encoding compared to the respective m-of-n counterpart Type of Link Number of states Bits per Symbol Extra states Transitions per symbol Symbols per packet Transitions per packet Phase enc. (4) 24 4 8 4 32 128 1-of-4 4 2 0 2 64 128 Phase enc. (6) 720 9 208 6 15 90 3-of-6 20 4 4 6 32 192 17/09/2007 NanoNet’07, Catania 27 Phase encoding Repeater receiver i1 sender 1<3 3<1 o1 2<3 i2 3<2 o2 i3 1<2 2<1 Phase detectors (Mutexes) 17/09/2007 NanoNet’07, Catania o3 go 28 Pipelines Dual-rail pipeline From J.Bainbridge & S. Furber IEEE Micro, 2002 17/09/2007 NanoNet’07, Catania 29 The problem of Acking 17/09/2007 Question 2 “when can Tx send new data?” has two aspects: – Safety (not to overflow the channel or when Tx and Rx have much variation in delay) – Performance (to maximize throughput and reduce latency) Can we hide ack (round trip) delay? NanoNet’07, Catania 30 To maintain throughput more pipeline stages are required but that costs too much latency and power First minimize latency along a long wire (not specific to asynchronous) and then maximize throughput (using “wagging tail buffer” approach) From R.Ho et al. ASYNC’04 17/09/2007 NanoNet’07, Catania 31 Use of wagging buffer approach Alternate between top and bottom control From R.Ho et al. ASYNC’04 17/09/2007 NanoNet’07, Catania 32 “Wagging tail buffer” approach reqtop Top and bot control channels work at ½ frequenc y of data channel acktop data reqbot ackbot 17/09/2007 NanoNet’07, Catania 33 Serial Link vs Parallel Link (from R. Dobkin) Link Why Serial Link? Length – Less interconnect area [mm] – Less routing congestion – Less coupling – Less power (depends on range) The relative improvement grows with technology scaling. The example on the right refers to: – Single gate delay serial link – Fully-shielded parallel link with 8 gate delay clock cycle – Equal bit-rate – Word width N=8 Serial Link dissipates less power Parallel Link dissipates less power Serial Link requires less area Parallel Link requires less area Technology Node [nm] 17/09/2007 NanoNet’07, Catania 34 Serialization model Tx Rx … … Acking at the bit level 17/09/2007 NanoNet’07, Catania 35 Serialization model Tx Rx Acking at the word level 17/09/2007 NanoNet’07, Catania 36 Serialization model Tx Rx Acking at the word level (with more concurrency) 17/09/2007 NanoNet’07, Catania 37 Serial Link – Top Structure (R.Dobkin, Async’07) Transition signaling instead of sampling: two-phase NRZ Level Encoded Dual Rail (LEDR) asynchronous protocol, a.k.a. datastrobe (DS) Acknowledge per word instead of per bit Synchronizers used at the level of the ack signals Wave-pipelining over channel Differential encoding (DS-DE, IEEE1355-95) Reported throughput: 67Gps for 65nm process (viz. one bit per 15ps – expected FO4 inverter delay), based on simulations 17/09/2007 NanoNet’07, Catania 38 Encoding –Two Phase NRZ LEDR Two Phase Non-Return-to-Zero Level Encoded Dual Rail – “delta” encoding (one transition per bit) Uncoded (B) 0 0 1 1 0 0 0 0 Phase bit (P) State bit (S) B(i ), i odd P(i ) B(i ), i even 17/09/2007 NanoNet’07, Catania S (i ) B(i ) i 39 1 0 Transmitter – Fast SR Approach (from R. Dobkin) 17/09/2007 NanoNet’07, Catania 40 Receiver Splitter (from R. Dobkin) 17/09/2007 NanoNet’07, Catania 41 Self Timed Networks Router requires priority arbitration – Arbitration necessary at every router merge – Potential delay at every node on the path BUT – Asynchronous merge/arbitration time is average not worst case Adapters to locally clocked cells require synchronization Synchronization necessary when clocks are unknown – Occurs when receiving data (data valid), and when sending (acknowledge) BUT – Time can be long (2 cycles?) – Must assume worst case time (maybe) 17/09/2007 NanoNet’07, Catania 42 Router priority Flow Control Link Merge Split Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits 17/09/2007 NanoNet’07, Catania 43 Asynchronous Arbiters Multiway arbiters (e.g. for Xbar switches): – Cascaded mesh (latency ~ N) – Cascaded Tree (latency ~ logN) – Token-Ring (busy ring and lazy ring) (latency ~ from 1 to N) Priority arbiters (e.g. for Routers with different QS): – Static priority (topological order) – Dynamic priority (request arrives with priority code) – Ordered (time-priority) - multiway arbiter, followed by a FIFO buffer 17/09/2007 NanoNet’07, Catania 44 Static Priority Arbiter Lock R1 s* q MUTEX r1 s1 C G1 C G2 C G3 MUTEX R2 r2 s2 s* q r Priority Module r MUTEX R3 s* q r3 s3 r Lock Register s C 17/09/2007 NanoNet’07, Catania q r* 45 Why Synchronizer? DATA DATA CLK Q CLK DFF Q 1 0 1 0 Metastability Metastability DATA CLK Q DFF DFF Two DFF Synchronizer 17/09/2007 NanoNet’07, Catania 46 Here one clock cycle is used for the metastability to resolve. CAD support: Async design flow 17/09/2007 NanoNet’07, Catania 47 Synthesis of Asynchronous link interfaces Bus DSr Data Transceiver LDS LDTACK Device D DSr DSw LDS VME Bus Controller LDTACK D DTACK DTACK Read Cycle 17/09/2007 NanoNet’07, Catania 48 DSr+ LDS+ D+ LDTACK+ LDS+ D+ DTACK+ 17/09/2007 DTACK- DSw+ LDTACK- LDS- LDTACK+ D- DSr- DTACK+ D- DSw- NanoNet’07, Catania 49 DSr+ DSw+ DTACK- D DTACK LDS+ D+ LDTACK+ synthesis LDS+ LDS D+ LDTACK- csc LDTACK+ DSr DTACK+ D- LDS- LDTACK DSr- DTACK+ D- DSw- csc + LDS+ DSr+ LDTACKDSr+ LDTACK+ LDSDSr+ Logic asynchronous circuit DTACKLDTACK- LDTACK- DTACKLDS- LDS = D csc DTACK = D D = LDTACK csc = DSr LDS- DTACK- D+ DTACK+ Boolean equations: DDSr- csc - Complete State Coding (CSC) 17/09/2007 NanoNet’07, Catania 50 Conclusions on Async Links At nm level links will be more asynchronous, perhaps first, mesochronous to avoid global clock skew Delay-insensitive codes can be used to tolerate interwire-delay variability Phase-encoding can be used for higher power-bit efficiency and SEU tolerance Acking will be mainly used for flow control (word level) and its overhead can be ‘hidden’ by using the “wagging buffer” technique Serial Links save area and power for long interconnects, with buffering (pipelining) if one wants to maintain high throughput; they also simplify building switches Synthesis tools can be used to build clock-free interfaces between different links Asynchronous logic can be used for building higher level circuits, e.g. arbiters for switches and routers 17/09/2007 NanoNet’07, Catania 51 And 17/09/2007 finally … NanoNet’07, Catania 52 ASYNC’08 and NOCs’08 …plus SLIP’08 Held in Newcastle upon Tyne, UK, 7-11 April 2008 (SLIP on 5-6 April – weekend) async.org.uk/async2008 async.org.uk/nocs2008 Submission deadlines: – Async’08: Abstract – Oct. 8 , Full paper – Oct. 15 – NOCs’08: Abstract – Nov. 12, Full paper – Nov. 19 17/09/2007 NanoNet’07, Catania 53 Extras More slides if I have time! 17/09/2007 NanoNet’07, Catania 54 Chain Network Components From J.Bainbridge & S. Furber IEEE Micro, 2002 17/09/2007 NanoNet’07, Catania 55 A Network on Chip Multiple Clocks Synchronization required Arbitration required 17/09/2007 NanoNet’07, Catania 56 Transmitter – Fast SR Approach (from R. Dobkin) 17/09/2007 NanoNet’07, Catania 57 Receiver Splitter (from R. Dobkin) 17/09/2007 NanoNet’07, Catania 58 Self Timed Networks Router requires priority arbitration – Arbitration necessary at every router merge – Potential delay at every node on the path BUT – Asynchronous merge/arbitration time is average not worst case Adapters to locally clocked cells require synchronization Synchronization necessary when clocks are unknown – Occurs when receiving data (data valid), and when sending (acknowledge) BUT – Time can be long (2 cycles?) – Must assume worst case time (maybe) 17/09/2007 NanoNet’07, Catania 59 Router priority Flow Control Link Merge Split Virtual channels implement scheduling algorithm Contention for link resolved by priority circuits 17/09/2007 NanoNet’07, Catania 60 Static priority arbiter Lock R1 s* q MUTEX r1 s1 C G1 C G2 C G3 MUTEX R2 r2 s2 s* q r Priority Module r MUTEX R3 s* q r3 s3 r Lock Register s C 17/09/2007 NanoNet’07, Catania q r* 61 Reliability and latency Asynchronous arbiters fail only if time is bounded – Latency depends on fixed gates plus MUTEX lock time – for 2 channels, + ln(N-1) for more – This likely to be small compared with flow control latency Synchronizers fail at (fairly) predictable rates but these rates may get worse – Latency can be 35 now for good reliability 17/09/2007 NanoNet’07, Catania 62 The synchronizer Clock and valid can happen very close together Flip Flop #1 gets caught in metastability We wait until it is resolved (1 –2 clock periods) DATA VALID D #1 Q D #2 Q CLK2 CLK1 17/09/2007 NanoNet’07, Catania 63 MTBF t/ e MTBF Tw . fc . fd For a 0.18 process is 20 – 50 ps Tw is similar Suppose the clock and data frequencies are 2 GHz t needs to be > 25 (more than one clock period) to get MTBF > 28 days – 100 synchronizers + 5 – MTBF > 1year + 2 – PVT variations +5 - 10 . . . 17/09/2007 NanoNet’07, Catania 64 Event Histogram 100ps input variation 10ps noise and jitter Deep meta Metastability Time -1.0E-08 -8.0E-09 -6.0E-09 -4.0E-09 -2.0E-09 0.0E+00 1E-13 1E-16 1E-19 Q to Clock time Convert to log scale, slope is Measurement 17/09/2007 NanoNet’07, Catania 65 Effective Input Overlap 1E-10 Not always simple Metastability Time 10ps noise and jitter Deep meta 1.000E- 9.000E- 8.000E- 7.000E- 6.000E- 5.000E- 4.000E- 3.000E08 09 09 09 09 09 09 09 1E-10 1E-12 1E-14 1E-16 1E-18 1E-20 Q to Clock tim e 17/09/2007 NanoNet’07, Catania 66 Effective Input Overlap More than one slope 350ps 120ps 140ps Synchronization Strategies Avoid synchronization time (and arbitration time) by – predicting clocks, stoppable clocks – dedicate link paths for long periods of time Minimize time by circuit methods – Higher power, better – Reducing apparent device variability - wide transistors – many parallel synchronizers increase throughput Reduce average latency by speculation – Reduce synchronization time, detect errors and roll back 17/09/2007 NanoNet’07, Catania 67 Timing regions can have predictable relationships Locked – – – – – Two clocks from same source Linked by PLL One produced by dividing the other Some asynchronous systems Some GALS Not locked together but predictable – Two clocks same frequency, but different oscillators. – As above, same frequency ratio 17/09/2007 NanoNet’07, Catania 68 Don’t synchronise when you don’t need to If the two clocks are locked together, you don’t need a synchroniser, just an asynchronous FIFO big enough to accommodate any jitter/skew FIFO must never overflow Next read clock can be predicted and metastability avoided DATA FIFO ACK IN REQ OUT REQ IN ACK OUT Write Data Available 17/09/2007 NanoNet’07, Catania DATA 69 Read done Conflict Prediction Receiver Clock Transmitter Clock Predicted Transmitter Clock Synchronization problem known a cycle in advance of the Receiver clock. We can do this thanks to the periodic nature of the clocks 17/09/2007 NanoNet’07, Catania 70 Problems predicting next cycle Difficult to predict – Multiple source clocks – Input output interfaces Dynamic jitter and noise – GALS start up clocks take several cycles to stabilise – Crosstalk – power supply variations introducing noise into both data and clock . – temperature changes alter relative delays As a proportion of cycle time, this is likely to increase with smaller geometries 17/09/2007 NanoNet’07, Catania 71 Synchronizer reliability trends Clock rates increase. 10 GHz gives 100ps for a cycle. – Both data and clock rates up by n – down by n Assume scales with cycle time reliability (MTBF) of one synchronizer down by n Number of synchronizers goes up by N – Die reliability down by N Die – die and on-die variability increases to as much as 40% – 40% more time needed for all synchronizers 17/09/2007 NanoNet’07, Catania 72 An example Example – – – – – 10 GHz clock and data rate = 10 ps 100 synchronizers MBTF required 3.8 months (107 seconds ) Time required 41 , or 4.1 cycles + 40% =5.8 cycles Does this matter? 17/09/2007 NanoNet’07, Catania 73 Power futures Total synchronizer area/power small, BUT very sensitive to voltage/power – both n and p transistors can turn off at low voltages – no gain This affects MUTEX circuits as well tau 250 200 ps 150 100 50 0 0.5 1 1.5 Vdd 17/09/2007 NanoNet’07, Catania 74 2 Power/speed tradeoffs Increase Vdd when synchronisation required Make synchronizer transistors wide to reduce variation and, to some extent, Make many synchronizer circuits, and select the consistently fastest one Avoid reducing synchronizer Vdd when running slow 17/09/2007 NanoNet’07, Catania 75 Speculation Mostly, the synchronizer does not need 35 to settle Only e-10 (0.005%) need more than 10 Why not go ahead anyway, and try again if more time was needed 17/09/2007 NanoNet’07, Catania 76 Low latency synchronization Data Available, or Free to write are produced early – After one cycle?. If they prove to be in error, synchronization failed – Only know this after two of more cycles Read Fail or Write Fail flag is then raised and the action can be repeated. DATA DATA FIFO Free to write Write Fail Data Available Speculative synchronizer Write clock Write Data 17/09/2007 NanoNet’07, Catania Full Not Empty WRITE 77 READ Speculative synchronizer Read Fail Read Clock Read done Comments Synchronization time will be an issue for future GALS Latency and throughput can be affected – Should the flit be large to reduce the effective overhead of time and power? Some power speed trade off is possible – Higher power synchronization can buy some performance ? Speculation is complex – Is it worth it? 17/09/2007 NanoNet’07, Catania 78