Transient and Permanent Error Control for High

advertisement
Transient and Permanent Error Control for High-­‐End Multiprocessor Systems-­‐on-­‐Chip Qiaoyan Yu¹, José Cano², José Flich², Paul Ampadu³ ¹University of New Hampshire, USA ²Universitat Politècnica de València , Spain ³University of Rochester, USA Conference title 1
Outline •  Introduc)on & Mo)va)on –  Impact of permanent and transient errors on NoC routers –  Advanced topologies •  Proposed method –  LBDRhr –  Transient error control in LBDRhr •  Experimental results •  Summary and conclusions The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 2
Introduc0on •  Types of MPSoCs: –  Applica)on-­‐specific  Fully irregular topologies  System design totally customized  E.g. Spidergon STNoC –  High-­‐end  Regular structures (2D mesh-­‐based topologies)  E.g. Tilera This work focuses here !!! The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 3
Introduc0on •  Cri)cal challenge in current NoCs: RELIABILITY –  Permanent errors   E.g. due to defec)ve components (links, routers)  Solu)on based on fault-­‐tolerant rou)ng  Logic-­‐based Distributed Rou0ng (LBDR) –  Transient errors  E.g. due to par)cle strike  Solu)on based on error control coding  Inherent informa0on redundancy (IIR) It could be a good solu0on for addressing both permanent and transient errors in NoCs The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 4
Introduc0on & Mo0va0on •  Problem: both LBDR and IIR methods cannot be applied to topologies and configura)ons for advanced NoC topologies –  LBDR approach  Designed for 2D meshes  Routers connected to 1 router neighbour on each dimension and direc)on North
West
Router
South
 Not ready for transient errors East
Local
PE –  IIR approach  Designed for XY rou)ng  Not suitable for more advanced rou)ng solu)ons  Not ready for permanent faults The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 5
Advanced Topologies Diagonal mesh
Flattened butterfly
2D-mesh with express channels
NNN port
EE port
1-hop links
2-hop diagonal links
1-hop links
2-hop straight links
1-hop links
2-hop straight links
3-hop links
The ini0al 2D-­‐mesh is the underlying topology!!! The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 6
Proposed Ideas •  To address fault tolerance for advanced topologies: –  Redesign the LBDR mechanism: LBDRhr (LBDR for high-­‐radix networks)  Adap)ve rou)ng algorithm supported  2 Virtual Channels  Deadlock-­‐free for the high-­‐radix topologies defined –  Develop a new method to detect transient errors in LBDRhr logic  Exploits the inherent informa)on redundancy in LBDRhr to significantly reduce the error control overhead The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 7
NoC Router Func0onality •  Compute rou)ng direc)on for next hop •  Pass the packet to its intended output port Note: 24 is the maximum number of rou4ng ports for each router, but not all need to be implemented, depends on the topology The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 8
Permanent Error Management •  Previous method: Logic-­‐Based Distributed Rou)ng (LBDR) –  Four rou)ng ports per switch (North, South, East, West) –  Two sets of bits: Rou)ng bits (Rxy, 2 per output port) and Connec)vity bits (Cx, 1 per output port) –  Minimal path support Xcurr
Xdst
C
M
P
E’
W’
CN
N’
E’
W’
N
N’
E’
RNE
Ycurr
Ydst
C
M
P
N’
S’
LBDR
.
.
.
N’
W’
RNW
.
.
.
The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 9
Permanent Error Management •  LBDRhr –  Tolerates permanent link and router failures –  Implemented with three basic logic blocks  1-­‐hop, 2-­‐hop and 3-­‐hop ports –  Uses a few configura)on bits to store local informa)on about the neighboring routers  8 configura)on bits for rou)ng purposes  Rxy  2 bits for two deroute op)ons (special cases) at every input port  DRx  1 connec)vity bit per output port  Cx The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 10
LBDRhr logic (common part) Relative direction of message’s destination XXX’ set: dest is at least three hops away in X direction XX’ set: dest is at least two hops away in X direction X’ set: dest is at least one hop away in X direction if XX’ set -­‐> X’ set If XXX’ set -­‐> XX’ and X’ set The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 11
LBDRhr logic (adap0ve part) One gate per output signal
e.g.: NNNlbdr = NNN’ & Cnnn
Routing restrictions (at 1hop ports) taken into account
e.g: N’’ = (N’ & E’ & Rne) | (N’ & W’ & Rnw) | (N’ & /E’ & /W’)
One gate per output signal
(3hop and 2hop ports have priority)
One gate per output signal
(3hop ports
have
e.g.: N’’’
= N’’priority)
& Cn & !3hop & !2hop
e.g.: NElbdr = N’ & E’ & Cne & /3hops
In case of no solution at all (non-minimal path support)
Take configured deroute option at the switch
The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 12
LBDRhr logic (escape part) The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 13
Permanent Error Management •  Deadlock-­‐free rou)ng example 1/2
0
1
2
3
0
1
2
3
Deroute
here
1/2
3/4
4
8
6
5
3
1
9
3
4
12
7
10
13
5
8
9
2/4
15
6
7
1
10
11
2
1/3
2
14
11
4
12
13
1
2
15
14
4
VC0: Faulty 2D-mesh with express channels
VC1: Faulty 2D-mesh
The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 14
Prevision Transient Error Control Methods •  Limita)on of Previous Methods Flooding[1]
–  Need knowledge of error loca)ons –  Consume large link switching power –  Increase area overhead or –  Limited to XY rou)ng Triple modular redundancy[2]
IIR[3]
[1] R. Mǎrculescu, ISVLSI’03. [2] A. Yanamandra, et al. ASP-­‐DAC’10. [3] Q. Yu, et al. NOCS’11. The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 15
New Inherent Informa0on Redundancy Extrac0on •  Forbidden signal pa[erns in routers are regarded as inherent informa)on redundancy (IIR) Request to
West Port
Request to
East Port
•  More IIR are found in LBDRhr-­‐based router (a) Switched single- (b) Switched multirequest
request
(c) Multi-switched
request
1
0
0
1
1
0
with
error
Non-header
w/o
error
1
1
0
Opposite directions
w/o
error
0
1
0
0
with
error
(e) Mute-request
1
1
0
w/o
error
(d) Bidirectional
switched-request
with w/o
error error
Opposite directions
Two flips
The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 0
with w/o
error error
with
error
Header & no output
16
Error Detec0on for CMP in Router we keep the forbidden signal patterns of the opposite directions in mind to further constrain the signal patterns Err1.1 = WWW’ & EEE’ Err2.1= WWW’ & !(WW’ & W’ & EE’_n & E’_n) Err3 = (A_n & (E’ | W’)) | (A & E’_n & W’_n) A: internal node in CMP The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 17
Error Detec0on for Mul0-­‐hop Logic Err2-­‐hops = (NN & SS) | (EE & WW) | (NE & SW) | (SE & NW)| (NE & SE) | (SW & SE) Err3-­‐hops = (NNE & SSW) | (EEN & SSW) | (EES & WWN) | (SSE & NNW) | (NNN & SSS) | (EEE & WWW) | (NNE & NNW) | (SSW & SSE) | (EEN & EES) | (EES & EEW) | (WWN & WWS) The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 18
Experimental Results •  Error Detec)on Rate •  Reliability •  Flit Throughput and Latency •  Area, Power and Delay The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 19
Experimental Results: Setup •  Mul)ple NoC topologies Error Injection Enable
•  LBDRhr Rou)ng •  8-­‐bit address Original
Gate
MuX •  Synthesized with a TSMC 65nm technology Input
Output
Logic facilitating
error injection in
modeling
The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 20
Error Detec0on Rate in CMP •  No ma[er how the NoC size changes, the error detec)on rate for E’ and W’ is 100% because of the use of the internal node. •  The error detec)on rate for EE’, WW’, EEE’ and WWW’ is less than 1. –  Only the occurrence of opposite direc)on pairs helps to detect errors in EE’, WW’, EEE’ and WWW’ (the non-­‐zero substrac)on output does not contribute to detect the errors causing wrong EE’, WW’, EEE’, WWW’). 1
Error Detection Rate
0.9
0.8
0.7
0.6
0.5
0.4
Single Error
Double Errors
Triple Errors
0.3
0.2
9
16
25
100
225
400
Number of Nodes in MPSoC
The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 21
Error Detec0on Rate in Mul0-­‐hop Logic •  3-­‐hop path 2-­‐hop path –  Achieve minor varia)on on the error detec)on rate for different topologies. –  Improve the error detec)on rate of 2-­‐hops logic as the number of error injected to the logic increases, because of more IIR The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 22
Residual Error Rate Comparison •  The proposed method –  Reduce the residual error rate by two orders of magnitude over TMR –  Slightly vary the error detec)on rate The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 23
Flit Throughput and Latency •  The number of faulty links in each topology increases up to obtain the underlying 2D-­‐mesh The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 24
Area, Power and Delay LBDRhr LBDRhr with LBDRhr without Error Proposed Error TMR Detec0on Detec0on with Area (µm2) 342 (100%) 363 (106.1%) Delay (ns) 0.495 (100%) 0.54 (109.1%) 0.51 (103%) Power 199.97 (100%) 207.27 (103.7%) 1.8084 (100%) 1.8405 (101.8%) Dyn.(µW) Leak(µW) The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 806 (235.7%) 267.39 (133.7%) 4.0969 (226.5%) 25
Conclusions •  For transient errors, our method reduces the residual error rate and the average power consump)on by up to 200x and 30%, respec)vely, over triple modular redundancy. •  For permanent errors, the proposed method is able to cover permanent failures of all the long-­‐range links and 80% of the failure combina)ons of the 2D-­‐mesh links. The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 26
Thank you!
Transient and Permanent Error Control for High-­‐End Multiprocessor Systems-­‐on-­‐Chip Qiaoyan Yu¹, José Cano², José Flich², Paul Ampadu³ ¹University of New Hampshire, USA ²Universitat Politècnica de València , Spain ³University of Rochester, USA Conference title 27
Error Detec0on for Deroute Logic •  Four direc)ons are exclusive is regarded as a new inherent informa)on redundancy Nderoute= ~DR[0] &~DR[1] Ederoute= ~DR[0]& DR[1] Wderoute =DR[0]&~DR[1] Sderoute =DR[0]&DR[1] The 6th ACM/IEEE Interna0onal Symposium on Networks-­‐on-­‐Chip. May 9-­‐11, 2012, Lyngby, Denmark 28
Download