Transient and Permanent Error Control for High-­‐End Multiprocessor Systems-­‐on-­‐Chip Qiaoyan Yu¹, José Cano², José Flich², Paul Ampadu³ ¹University of New Hampshire, USA ²Universitat Politècnica de València , Spain ³University of Rochester, USA Conference title 1 Outline • Introduc)on & Mo)va)on – Impact of permanent and transient errors on NoC routers – Advanced topologies • Proposed method – LBDRhr – Transient error control in LBDRhr • Experimental results • Summary and conclusions The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 2 Introduc0on • Types of MPSoCs: – Applica)on-­‐specific Fully irregular topologies System design totally customized E.g. Spidergon STNoC – High-­‐end Regular structures (2D mesh-­‐based topologies) E.g. Tilera This work focuses here !!! The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 3 Introduc0on • Cri)cal challenge in current NoCs: RELIABILITY – Permanent errors E.g. due to defec)ve components (links, routers) Solu)on based on fault-­‐tolerant rou)ng Logic-­‐based Distributed Rou0ng (LBDR) – Transient errors E.g. due to par)cle strike Solu)on based on error control coding Inherent informa0on redundancy (IIR) It could be a good solu0on for addressing both permanent and transient errors in NoCs The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 4 Introduc0on & Mo0va0on • Problem: both LBDR and IIR methods cannot be applied to topologies and configura)ons for advanced NoC topologies – LBDR approach Designed for 2D meshes Routers connected to 1 router neighbour on each dimension and direc)on North West Router South Not ready for transient errors East Local PE – IIR approach Designed for XY rou)ng Not suitable for more advanced rou)ng solu)ons Not ready for permanent faults The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 5 Advanced Topologies Diagonal mesh Flattened butterfly 2D-mesh with express channels NNN port EE port 1-hop links 2-hop diagonal links 1-hop links 2-hop straight links 1-hop links 2-hop straight links 3-hop links The ini0al 2D-­‐mesh is the underlying topology!!! The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 6 Proposed Ideas • To address fault tolerance for advanced topologies: – Redesign the LBDR mechanism: LBDRhr (LBDR for high-­‐radix networks) Adap)ve rou)ng algorithm supported 2 Virtual Channels Deadlock-­‐free for the high-­‐radix topologies defined – Develop a new method to detect transient errors in LBDRhr logic Exploits the inherent informa)on redundancy in LBDRhr to significantly reduce the error control overhead The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 7 NoC Router Func0onality • Compute rou)ng direc)on for next hop • Pass the packet to its intended output port Note: 24 is the maximum number of rou4ng ports for each router, but not all need to be implemented, depends on the topology The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 8 Permanent Error Management • Previous method: Logic-­‐Based Distributed Rou)ng (LBDR) – Four rou)ng ports per switch (North, South, East, West) – Two sets of bits: Rou)ng bits (Rxy, 2 per output port) and Connec)vity bits (Cx, 1 per output port) – Minimal path support Xcurr Xdst C M P E’ W’ CN N’ E’ W’ N N’ E’ RNE Ycurr Ydst C M P N’ S’ LBDR . . . N’ W’ RNW . . . The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 9 Permanent Error Management • LBDRhr – Tolerates permanent link and router failures – Implemented with three basic logic blocks 1-­‐hop, 2-­‐hop and 3-­‐hop ports – Uses a few configura)on bits to store local informa)on about the neighboring routers 8 configura)on bits for rou)ng purposes Rxy 2 bits for two deroute op)ons (special cases) at every input port DRx 1 connec)vity bit per output port Cx The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 10 LBDRhr logic (common part) Relative direction of message’s destination XXX’ set: dest is at least three hops away in X direction XX’ set: dest is at least two hops away in X direction X’ set: dest is at least one hop away in X direction if XX’ set -­‐> X’ set If XXX’ set -­‐> XX’ and X’ set The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 11 LBDRhr logic (adap0ve part) One gate per output signal e.g.: NNNlbdr = NNN’ & Cnnn Routing restrictions (at 1hop ports) taken into account e.g: N’’ = (N’ & E’ & Rne) | (N’ & W’ & Rnw) | (N’ & /E’ & /W’) One gate per output signal (3hop and 2hop ports have priority) One gate per output signal (3hop ports have e.g.: N’’’ = N’’priority) & Cn & !3hop & !2hop e.g.: NElbdr = N’ & E’ & Cne & /3hops In case of no solution at all (non-minimal path support) Take configured deroute option at the switch The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 12 LBDRhr logic (escape part) The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 13 Permanent Error Management • Deadlock-­‐free rou)ng example 1/2 0 1 2 3 0 1 2 3 Deroute here 1/2 3/4 4 8 6 5 3 1 9 3 4 12 7 10 13 5 8 9 2/4 15 6 7 1 10 11 2 1/3 2 14 11 4 12 13 1 2 15 14 4 VC0: Faulty 2D-mesh with express channels VC1: Faulty 2D-mesh The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 14 Prevision Transient Error Control Methods • Limita)on of Previous Methods Flooding[1] – Need knowledge of error loca)ons – Consume large link switching power – Increase area overhead or – Limited to XY rou)ng Triple modular redundancy[2] IIR[3] [1] R. Mǎrculescu, ISVLSI’03. [2] A. Yanamandra, et al. ASP-­‐DAC’10. [3] Q. Yu, et al. NOCS’11. The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 15 New Inherent Informa0on Redundancy Extrac0on • Forbidden signal pa[erns in routers are regarded as inherent informa)on redundancy (IIR) Request to West Port Request to East Port • More IIR are found in LBDRhr-­‐based router (a) Switched single- (b) Switched multirequest request (c) Multi-switched request 1 0 0 1 1 0 with error Non-header w/o error 1 1 0 Opposite directions w/o error 0 1 0 0 with error (e) Mute-request 1 1 0 w/o error (d) Bidirectional switched-request with w/o error error Opposite directions Two flips The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 0 with w/o error error with error Header & no output 16 Error Detec0on for CMP in Router we keep the forbidden signal patterns of the opposite directions in mind to further constrain the signal patterns Err1.1 = WWW’ & EEE’ Err2.1= WWW’ & !(WW’ & W’ & EE’_n & E’_n) Err3 = (A_n & (E’ | W’)) | (A & E’_n & W’_n) A: internal node in CMP The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 17 Error Detec0on for Mul0-­‐hop Logic Err2-­‐hops = (NN & SS) | (EE & WW) | (NE & SW) | (SE & NW)| (NE & SE) | (SW & SE) Err3-­‐hops = (NNE & SSW) | (EEN & SSW) | (EES & WWN) | (SSE & NNW) | (NNN & SSS) | (EEE & WWW) | (NNE & NNW) | (SSW & SSE) | (EEN & EES) | (EES & EEW) | (WWN & WWS) The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 18 Experimental Results • Error Detec)on Rate • Reliability • Flit Throughput and Latency • Area, Power and Delay The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 19 Experimental Results: Setup • Mul)ple NoC topologies Error Injection Enable • LBDRhr Rou)ng • 8-­‐bit address Original Gate MuX • Synthesized with a TSMC 65nm technology Input Output Logic facilitating error injection in modeling The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 20 Error Detec0on Rate in CMP • No ma[er how the NoC size changes, the error detec)on rate for E’ and W’ is 100% because of the use of the internal node. • The error detec)on rate for EE’, WW’, EEE’ and WWW’ is less than 1. – Only the occurrence of opposite direc)on pairs helps to detect errors in EE’, WW’, EEE’ and WWW’ (the non-­‐zero substrac)on output does not contribute to detect the errors causing wrong EE’, WW’, EEE’, WWW’). 1 Error Detection Rate 0.9 0.8 0.7 0.6 0.5 0.4 Single Error Double Errors Triple Errors 0.3 0.2 9 16 25 100 225 400 Number of Nodes in MPSoC The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 21 Error Detec0on Rate in Mul0-­‐hop Logic • 3-­‐hop path 2-­‐hop path – Achieve minor varia)on on the error detec)on rate for different topologies. – Improve the error detec)on rate of 2-­‐hops logic as the number of error injected to the logic increases, because of more IIR The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 22 Residual Error Rate Comparison • The proposed method – Reduce the residual error rate by two orders of magnitude over TMR – Slightly vary the error detec)on rate The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 23 Flit Throughput and Latency • The number of faulty links in each topology increases up to obtain the underlying 2D-­‐mesh The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 24 Area, Power and Delay LBDRhr LBDRhr with LBDRhr without Error Proposed Error TMR Detec0on Detec0on with Area (µm2) 342 (100%) 363 (106.1%) Delay (ns) 0.495 (100%) 0.54 (109.1%) 0.51 (103%) Power 199.97 (100%) 207.27 (103.7%) 1.8084 (100%) 1.8405 (101.8%) Dyn.(µW) Leak(µW) The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 806 (235.7%) 267.39 (133.7%) 4.0969 (226.5%) 25 Conclusions • For transient errors, our method reduces the residual error rate and the average power consump)on by up to 200x and 30%, respec)vely, over triple modular redundancy. • For permanent errors, the proposed method is able to cover permanent failures of all the long-­‐range links and 80% of the failure combina)ons of the 2D-­‐mesh links. The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 26 Thank you! Transient and Permanent Error Control for High-­‐End Multiprocessor Systems-­‐on-­‐Chip Qiaoyan Yu¹, José Cano², José Flich², Paul Ampadu³ ¹University of New Hampshire, USA ²Universitat Politècnica de València , Spain ³University of Rochester, USA Conference title 27 Error Detec0on for Deroute Logic • Four direc)ons are exclusive is regarded as a new inherent informa)on redundancy Nderoute= ~DR[0] &~DR[1] Ederoute= ~DR[0]& DR[1] Wderoute =DR[0]&~DR[1] Sderoute =DR[0]&DR[1] The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 28