Clocking and Timing in FaultTolerant Systems-on-Chip Andreas Steininger Outline • The Clock as a Blessing • The Clock as a Curse • Alternative Synchronization Schemes GALS fully asynchronous the DARTS approach • Conclusion 2 Contributors to this Work The DARTS project team TU Vienna RUAG Space Gottfried Fuchs Matthias Fuegger Ulrich Schmid Thomas Handl Gerald Kempf Manfred Sust Wolfgang Zangerl 3 The Need for Fault Tolerance miniaturization is key to progress in VLSI => smaller structures => lower voltage swing => smaller critical charge => higher operating frequencies …result in higher susceptibility to faults (SET, EMI,…) => cannot avoid faults, need to tolerate them 4 The Role of Time “The only reason for time is so that everything doesn’t happen at once”, Albert Einstein 5 The Need for Clocking activities need to be co-ordinated • on system level (braking of wheels, …) • on algorithmic level (consensus, …) • on communication level • on logic level (state machine switching,…) co-ordination in the time domain (synchronization) is an efficient way to attain this => need a global notion of time (discrete „ticks“) 6 The Quality of Synchronization local time (number of ticks) precision π real time 7 Typical Precision Values on system level: on algorithm level: on communication level: on logic level: ms … ms ms … ms ns … ms ps … ns 8 Synchronization Requirements phase synchronisation (for „hardware clock“ on logic level) 1ms is excellent precision for distributed clock at 1GHz this means 360.000° phase shift clock synchronisation (for distributed time base on algorithmic level) 9 Globally Synchronous Design • whole design is „isochronic“ („perfect“ precision) • time conveyed by clock transitions • perfect co-ordination of all activities • very efficient design • can assume consistent states • high level of abstraction • very efficient implementation: • single crystal oscillator • single control line (clock net) 10 „Isochronic“ Regions ? speed of light (in medium) = 2 x 108 m/s = 20cm/ns Ref 2cm 1GHz 4GHz 8GHz 11 The Variation Problem Designer projected conditions system model User ?(unknown) worst case ?(imperfections) safety margins actual conditions actual system Timing completely fixed after design No way to react to actual conditions & system („PVT variations“) 12 Fault-Tolerant Architectures Duplication & Triple-Modular Comparison Redundancy FU FU =? ERR FU FU voter Y FU 13 Lock-Step Operation single clock FU FU FU single point of failure voter Y „3“ „4“ „3“ „4“ „3“ „4“ good replica determinism 14 Lock-Step Operation independent clocks „3“ FU FU FU single fault tolerant voter Y „4“ „3“ „3“ „4“ „4“ bad replica determinism 15 Fault-Tolerant HW-Clocking v FU v FU v FU voter Y 16 Fault-Tolerant HW-Clocking ?v FU D v FU ?v FU voter Y D 17 The Charme of SoCs billions of transistors fit on one die => structuring into (IP) modules „System-on-Chip“ BUT: • large clock distribution networks => „isochronic“?? • FT clocking does not work with large skew • may need individual clocks for function modules => clock-synchrony neither attainable nor desirable 18 Co-ordination of Data Exchange When can SNK use its input? When it is valid and consistent SRC f(x) SNK When can SRC apply the next input? When SNK has consumed the previous one 19 The Synchronous Approach SRC f(x) SNK co-ordination based on (global) time 20 Alternative: Asynchronous Design co-ordination based on handshaking REQ: „Data word valid, you can use it“ SRC f(x) SNK ACK: „Data word consumed, send the next“ 21 Async. Design – Advantages • closed-loop control makes timing much more • • • • • robust and adaptive to PVT variations no need for worst-case timing local handshakes replace global clock activity only when needed beneficial for EMI tends to stop operation in case of fault 22 Async. Design – Disadvantages • Need to handle race between REQ and data 23 Async. Design – Disadvantages • Need to handle race between REQ and data REQ: „Data word valid, you can use it“ SRC f(x) SNK 24 Async. Design – Disadvantages • Need to handle race between REQ and data Solution 1: „Bundled Data“ REQ: „Data word valid, you can use it“ SRC f(x) SNK 25 Async. Design – Disadvantages • Need to handle race between REQ and data Solution 2: „Delay Insensitive“ (Coding) REQ: „Data word valid, you can use it“ Completion detection SRC f(x) SNK 26 Async. Design – Disadvantages • • • • • • Need to handle race between REQ and data significant HW overhead (coding, delay elements) „adaptive“ timing not as predictable more difficult to design classical fault-tolerance schemes not applicable tends to stop operation in case of fault 27 Best of Both Worlds GALS: Globally Asynchronous Locally Synchronous use asynchronous principle where clock distribution too cumbersome: „inter-module“ retain efficiency of synchronous design wherever possible: „intra-module“ First mention in PhD thesis by Chapiro / Stanford 84 28 A GALS Example CPU DSP 2GHz 2,7GHz PCI-IF USB-IF 533MHz 24MHz 29 Communication in GALS Shared Memory producer writes to memory, consumer reads from there pro: control flow stays independent • shared single-port memory • true dual-port memory Direct Messages (Data words) move data word from producer‘s output register to consumer‘s input register • non-buffered / buffered (FIFO-queues) • clock fixed, data-driven or pausible 30 Shared Memory decoupling of clock domains by memory acting as a third party => high area overhead => unusual for single port memory arbitration required • arbitration problem (unbounded delay…) • one side may block the other at the arbiter for multiport memory problems are confined to access to the same cell • busy flag may become metastable • blocking still possible for one specific address 31 Shared Memory • perfect decoupling of data path • potential metastability problems at arbitration logic • potential blocking through arbitration CPU DSP 2GHz 2,7GHz Arbitration shared memory 32 Direct Messages clock domain boundary is between producer‘s output register and consumer‘s input register in general a synchronizer is needed at consumer‘s input • definitely for conventional (fixed) clock • can be avoided by data-driven / pausible clocking control flows of producer and consumer are strongly coupled: not maintaining the input/output register blocks other party buffers/queues/FIFOs can • mitigate, but not avoid this problem (full/empty) • compensate variations in the data rate on both sides, but not different average data rates 33 Direct Messages CPU 2GHz S S DSP 2,7GHz data moving over clock domain boundary metastability problems => need to insert handshake …with synchronizers and (optional) buffers 34 Arbiter: Principle purpose: ○ manage concurring requests to shared resource method: ○ handle pairs of request_in / grant_out ○ requests may arrive in any order ○ arbiter must activate only one grant_out at a time (respond to the first requester) Mutual Exclusion (MUTEX) problem: ○ resolve concurrent requests => metastability problem 35 Arbiter: Circuit MUTEX-element: SR-latch R1 G1’ G1 Vout,FF Vmeta Vth,inv R2 G2’ G2 t „Metastability filter“: e.g., hi-threshold inverter [from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley] 36 Arbiter: Operation R1 R2 G1’ G1 G2’ G2 R1 R2 G1 G2 37 Muller C-Element a IF a = b THEN y = a ELSE hold y b C y a reset b y RS a C y b set 38 Muller C-Element: Circuit [Alan Martin, Caltech] 39 Data-Driven Clocking Principle: ○ as soon as new data arrive => start clocking ○ determine number k of clock cycles required to process new data ○ stop clocking after k cycles, wait for next data Properties: ○ need to switch clock on and off => beware spurious clock pulses! ○ no metastability problem: data stable as soon as consumer clock starts ○ potential for power saving ○ useful for specific applications only (no pipe!) 40 Data-Driven Clock: Circuit / 1 CLK out CLK half period determined by D D D CLK out 41 Data-Driven Clock: Circuit / 2 CLK out transition on REQ answered by transition on CLK out REQ C D ACK D min CLK half period determined by D CLK out REQ ACK 42 Pausible Clocking Principle: ○ producer requests consumer‘s clock to pause ○ data provided to input register during idle time ○ consumer‘s clock may resume - free running („pausible clock“) - with one cycle only („stoppable clock“) Properties: ○ need to switch clock on and off => beware spurious clock pulses! => beware of clock tree delays! ○ producer controls consumer‘s clock (blocking!) ○ applications must cope with paused clock 43 Pausible Clock: Circuit / 1 CLK out REQ C D inverter generates next REQ from ACK self-oscillation ACK D CLK out REQ ACK 44 Pausible Clock: Circuit / 2 CLK out Arb C D ACK’ REQ’ external unit can safely stop CLK by activating REQ’ … and gets ACK’ as a response D CLK out REQ’ ACK’ 45 Pausible Clock: Circuit / 3 CLK out Arb ACK1 REQ1 Arb ACKn REQn C D for more external sources arbiters can be added and “anded” before the Muller C-Element the two inverters can be eliminated by using a Muller CElement with inverting output 46 Advantages of GALS • • • • synchronous islands can be designed efficiently modules operate independently can use module specific-clock & timing clocking is no single point of failure 47 Problems with GALS • operation of modules not (inherently) co-ordinated synchrony for communication but not on system / algorithm level • communication has to cross clock boundaries • potential for metastability => performance penalty through synchronizers OR => module must handle irregular clocking 48 The DARTS Idea Distributed Algorithms for Robust Tick Synchronization phase synchronisation tick synchronisation clock synchronisation 49 The DARTS Approach Concept: Multiple synchronized tick generators Method: Distributed algorithm for fault-tolerant tick generation implemented in (asynchronous) digital logic Advantages us aB t a D Fu No crystal oscillator(s) Fu t s -Ne TG -Alg TG No critical clock tree Fu Clock is no single point of failure! Reasonable synchrony 3 1 2 50 The DARTS Principle Fu1 TG-Net Fu2 Every function unit Fui augmented with simple local clock unit (TG-Alg) TG-Algs communicate over dedicated TG-Net to generate tick-synchronized local clock signals Up to f TG-Algs can be Byzantine faulty need n ≥ 3f + 2 TG-Algs data bus Fu3 TG-Algs Clock tree DARTS clocks Standard synchronous clocking 51 A Comparison synchronous SoC aB D at us Fu 3 to illa Osc r aB D at us t -Ne TG ck Clo e Tre s -Alg TG Fu 2 Fu 2 Fu1 clk single point of failure Fu clk 2 us aB D at Fu 3 Fu 1 Fu 1 global synchrony (< 1 tick) GALS DARTS to illa Osc r to illa Osc Fu 3 to illa Osc r Fu 1 r Fu 2 tick(3) tick(4) global synchrony no single point (potentially 1 tick) of failure no single point of failure NO (inherent) global synchrony 52 52 The Distributed Algorithm TG-Alg 1 TG-Alg 2 TG-Alg 6 TG-Alg 3 TG-Alg 5 TG-Net [Srikanth & Toueg, 87] (1) Initially: TG-Alg 4 (2) send tick(0) to all; clock:= 0; (3) “Relay Rule” (4) If received tick(m) from at least f+1 remote nodes and m > clock: (5) send tick(clock+1),…, tick(m) to all [once]; clock:= m; (6) “Increment Rule” (7) If received tick(m) from at least 2f+1 remote nodes and m >= clock: (8) send tick(m+1) to all [once]; clock:= m+1; Implementation Challenges (1) Initially: (2) k-bit msg vs. zero-bit tick send tick(0) to all; clock:= 0; (3) “Relay Rule” (4) If received tick(m) from at least f+1 remote nodes and m > clock: TICK(0) TICK(k-1) (5) ... TICK(1) send tick(clock+1),…, tick(m) to all [once]; clock:= m; (6) “Increment Rule” TICK(k) (7) If received tick(m) from at least 2f+1 remote nodes and m >= clock: (8) send tick(m+1) to all [once]; clock:= m+1; k-bit messages Software-based k unbounded Replacement by zero-bit messages algorithm Thresholds functions for fault tolerance Glitch-free asynchronous implementation Atomicity of actions To be ensured by the architecture and delay constraints 54 The DARTS Prototype ASIC design: • radhard 180nm technology • 2 designs: - flexible - fast Prototype board: 8 chips plus fixed & programmable interconnect 55 Proof of Concept 56 Frequency Stability (Warm-up) 53.45 frequency in [MHz] 53.4 53.35 53.3 53.25 53.2 53.15 0 2 4 6 8 10 time in [hours] 12 14 16 18 57 Frequency Stability (detail) core voltage in [V] frequency in [MHz] 52.0 51.98 1.7974 51.96 1.7972 1.7970 51.94 0 5 10 1.7968 15 time in [min] 58 DARTS – General Properties Fully asynchronous implementation NO oscillators Tolerates up to three Byzantine faulty nodes (configurable number of TG-Algs; 5 to 12) Adapts to operating conditions (asynchronous logic) 59 Still Room for Improvements o Transient faults are permanently stored in the elastic pipelines o No on-the-fly integration of TG-Alg o Relatively low clock speed o Interfacing to traditional synchronous designs o Scaling with number of faults is costly 60 Summary: Trends & Needs • Preceding miniaturization necessitates fault tolerance • Co-ordinaton of activities is fundamental, thus tight synchrony is a desirable feature on all levels • SoCs are large modular designs on a single die 61 Summary: SoC Clocking • globally synchronous clock: + ideal synchrony, efficient in design & implementation - isochrony unrealistic, single point of failure • DARTS clock + best attainable global synchrony, adaptive timing, FT - high implementation efforts, frequency not stable • GALS + uses best of syn & asyn, indep. & module-specific clock - no global synchrony, metastability issues • asynchronous design + power-efficient, robust against faults & PVT - high overheads, difficult to design, timing hard to predict 62 More information on DARTS http://ti.tuwien.ac.at/ecs/research/projects/darts 63