Document

advertisement
On exploiting redundancy
Sandeep Gupta
Ming Hsieh Department of Electrical Engineering
University of Southern California
sandeep@usc.edu
Thanks
• $s
– NSF and SRC
• Colleagues
–
–
–
–
–
–
–
–
–
–
–
–
–
Murali Annavaram
Melvin A. Breuer
Jaechul Cha
Da Cheng
Keith Chugg
Tong-Yu Hsieh
Hsunwei Hsiung
Zhigang Jiang
K.-J. Lee
Mohammad Mirza-Aghatabar
Antonio Ortega
Shideh Shahidi
Doochul Shin
2
Redundancy
Historical definitions and uses
World English Dictionary (via dictionary.com)
• redundancy
— n , pl -cies
1. a. the state or condition of being redundant or
superfluous, esp superfluous in one's job
b. (as modifier): a redundancy payment
2. excessive proliferation or profusion, esp of superfluity
3. duplication of components in electronic or mechanical
equipment so that operations can continue following
failure of a part
4. repetition of information or inclusion of additional
information to reduce errors in telecommunication
transmissions and computer processing
4
Computing Dictionary (via dictionary.com)
• redundancy definition
1. The provision of multiple interchangeable components to perform
a single function in order to provide resilience (to cope with
failures and errors). Redundancy normally applies primarily to
hardware. For example, a cluster may contain two or three
computers doing the same job. They could all be active all the
time thus giving extra performance through parallel
processing and load balancing; one could be active and the others
simply monitoring its activity so as to be ready to take over if it
failed ("warm standby"); the "spares" could be kept turned off and
only switched on when needed ("cold standby"). Another
common form of hardware redundancy is disk mirroring.
2. data redundancy.
(1995-05-09)
5
Classical forms of redundancy in digital systems
• Spares
– Circuit modules: Spare copies of modules
• Cold or hot standby
• Dedicated or shared
– Computational tasks: Executing spare copies of tasks
• In space or in time
– Data to be communicated: Spare bits/bytes
• In space or in time; coding is typically used
– Data to be stored: Spare bits/bytes/…
• In space; coding is typically used
6
Classical motivations for using redundancy
• To improve resilience to failures or errors during
the system’s operational life
– Circuits and computation
• To continue operation after a failure occurs
• Typically limited to systems that demanded high-reliability,
high-availability, …
– Data
• To ensure integrity despite errors during communication or
storage
• In memories and disks, redundancy used to
compensate for fabrication-induced failures
7
Redundancy
Our motivation
Yield decreasing with scaling
• Lower initial
yield
• Yield learning
rate not improved
• Lower mature
yield
R. C. Leachman and C. N. Berglund, “Systematic Mechanisms Limited Yield (SMLY)
Assessment Study”, International SEMATECH, Doc. #03034383A-ENG, March 2003
9
Causes of decreasing yield
• Continue to fabricate chips with finer feature
sizes – widths, separations, and so on
• Some key tools for fabrication not being
scaled due to cost considerations, e.g., 193nm
light source for photolithography
• Continual increase in number of fabrication
steps and masks
• Continued increase in chip size
10
Yield of a chip with area A
• When a chip has zero spares and only chips
with zero defects counted in “yield”
– Poisson model
π‘Œ = 𝑒 −𝐷0⋅𝐴⋅πœ…
– Negative binomial model
𝐷0 ⋅ 𝐴 ⋅ πœ…
π‘Œ = 1+
𝛽
𝐷0 : defect density
πœ…: kill ratio
−𝛽
𝛽: clustering factor
11
A simple way to view the problem
• Each passing year is likely to bring
– Increasing defect densities
– Increasing variations in values of circuit
parameters
– Increasing circuit sizes
12
Our motivation
• Use redundancy in computational logic to
improve yield, despite increasing fabrication
challenges
– Increasing defect densities
– Increasing variations in values of circuit
parameters
• Redundancy has been used for decades to
improve memory yields
– New approaches to optimally exploit redundancy
for increasing fabrication challenges
13
Our motivation
• Yield enhancement via design in real world
– Spare rows/columns used in SRAMs since 1970s
– IBM CELL processor has one spare core
– Nvidia GTX 480/470/465 have 1/2/5 disabled
processors
IBM CELL
Nvidia GTX 470
14
Our concept of redundancy
• Any functionality that is not necessary for
acceptable system operation
– Functionality = circuit module
– Acceptable =
• General purpose computing
– Guaranteed correctness of output values, plus
– Performance limits
• Media applications
– Output values within user-specified limits, in terms of PSNR,
BER, MSE, …
15
Our primary figures of merit
• Yield = % of fabricated chips deemed
acceptable and shipped to customers
• Die area: A measure of chip fabrication cost
• These two are combined to indicate total
functionality obtained from each wafer
•
•
•
•
Yield/area
GOPS/area or GOPS/wafer
Revenue/wafer
Revenue/total-cost
16
Yield-per-area (YPA)
• Consider a wafer
– Without our approach
• For the original design, each die has area 𝐴
• The wafer contains, say, 100 dies
• The yield is, say, 30%, hence
– Chips sold: 𝑐1 , 𝑐2 , …, 𝑐30 ; and chips discarded: 𝑐31 , 𝑐32 , …, 𝑐100
– Our approach changes the design
• Our new design is larger and hence each die has area, say, 1.25𝐴
• Hence, the wafer contains 80 dies
• As our new design can tolerate some defects, yield increases, say, to 50%
∗
∗
∗
∗
– Chips sold: 𝑐1∗ , 𝑐2∗ , …, 𝑐40
; chips discarded: 𝑐41
, 𝑐42
, …, 𝑐80
• Assuming the same cost per wafer
40
– Benefit of our approach = 30 = 1.33
– Yield per area gain =
50%
1.25𝐴
30%
𝐴
=1.33
17
GOPS-per-area or GOPS-per-wafer
• What if our approach produces chips with different
performance
– Without our approach
• For the original design, each die has area 𝐴
• The wafer contains, say, 100 dies
• The yield is, say, 30%, hence
– Chips sold: 𝑐1 , 𝑐2 , …, 𝑐30 , each with performance Pπ‘œπ‘Ÿπ‘–π‘”
– Chips discarded: 𝑐31 , 𝑐32 , …, 𝑐100
• Total performance per wafer: 30π‘ƒπ‘œπ‘Ÿπ‘–π‘”
18
GOPS-per-area or GOPS-per-wafer
• What if our approach produces chips with different
performance
– Our approach changes the design
• Our new design is larger and hence each die has area, say, 1.1𝐴
• Hence, the wafer contains 91 dies
• As our new design can tolerate some defects, yield increases, say,
to 40%
∗
– Chips sold: 𝑐1∗ , 𝑐2∗ , …, 𝑐36
, with, say,
» 26 chips with performance π‘ƒπ‘œπ‘Ÿπ‘–π‘” , and
» 10 chips with performance 0.9π‘ƒπ‘œπ‘Ÿπ‘–π‘”
∗
∗
∗
– Chips discarded: 𝑐37
, 𝑐42
, …, 𝑐91
• Total performance per wafer: 35π‘ƒπ‘œπ‘Ÿπ‘–π‘”
19
General purpose processors
Key idea
• In general purpose computing systems
– Large gap exists between
• Correctness requirements at circuit layer
• Functional correctness for users
– Relax the unnecessarily stringent correctness
requirements at circuit layer
• Guarantee the correctness requirements at the
–
–
–
–
μArch layer
ISA layer
System/software layer
User layer
21
Module X is redundant in what sense?
• We could do everything we can do now, if we
did not have module X
• Why did we add X?
– To improve performance
– To reduce power
–…
22
Implicit redundancy: Examples
• Storage
– In absence of last-level cache (LLC), main memory can
provide alternative storage
– One part of LLC can provide alternative storage for another
part
• Computation
– In absence of a multiplier, adder and shifter can provide
the functionality
– In absence of a floating point unit (FPU), integer units can
provide the functionality
• Controllers, predictors, …
–…
23
Observation
• Anything beyond Intel 4004 is redundant
24
How to exploit implicit redundancy?
• If module X does not work in one particular
chip
– Can we use the chip as-is?
– Or, should we simply ‘disable’ X and use the chip?
– Or, can we find some other way to use the chip?
25
Objectives
• To guarantee functional correctness of a
microprocessor in presence of a defect in a module
o At negligible cost in circuit area
o At minimal impact on circuit performance
οƒ˜ Performance defined at various levels
 Circuit-level: clock period, in ps
 Architecture-level: instructions per cycle, IPC
 Combined: instructions per second, IPS
Alternatively, operations per second, OPS
οƒ˜ Which chips affected?
 All chips (including defect-free ones), vs.
 Only chips with the defect
26
Key ideas
• Enable sale of some defective chips that would have
been thrown away without our method(s)
• Explore all layers to identify method(s) that
– Guarantee correctness of user results
– Minimize performance penalty for all chips
• Preferably zero performance penalty for non-defective chips
– Minimize area overheads
• Typically by exploiting features that exist in modern
microprocessors
27
Case study-1: Branch predictor
• A commonly used module in all types of microprocessors
o Instead of waiting to find out which direction a branch instruction will
take, predict branch and continue with pre-fetch, decode, …
• Since prediction is not guaranteed to be correct 100% of time,
any design that uses a branch predictor also includes
o A module that detects every mis-prediction
o A module that helps recover from a mis-prediction by flushing all
incorrect values and continuing along the correct branch
• Correctness of any program output guaranteed even if the
branch predictor module is faulty
o Above modules identify and recover from any mis-prediction cause by
the fault
• Hence, a fault in branch predictor only causes performance
degradation and guarantees correctness
28
Estimating performance degradation
• A gate-level design of gshare branch predictor implemented in
SimpleScalar processor simulator
• Single stuck-at faults on each line studied for benchmark
programs
A==Taken/Not-taken
Select(j)
Taken/Not-taken
BHT
(Branch history table)
Branch Instruction (n-bit shift register)
Address
XOR-Array
(n 2-input XOR gates)
P_taken/P_Not-taken
PHT
(Pattern history table)
(a register file
of 2n 2-bit FSMs)
0
1
DC
decoder)
Q
Ck
FSM(j)
0
1
(n-to-2n
D
Q’
Cnt0(j)
Z(j)
D
Q
Ck
Q’
Cnt1(j)
29
P==P_Taken/P_Not-taken
Degradation in branch-prediction accuracy
Module
Decoder
Line type
# faults (% of
total faults)
A
72 (0.0002%)
D
1,048,576 (3.63%)
E
MSB
FSMs
LSB
524,288
(1.82%)
13,634,188
(47.24%)
13,634,188
(47.24%)
Branch Inst. Addr.
N/A
36 (0.0001%)
BHT
N/A
36 (0.0001%)
Input
36 (0.0001%)
Output
36 (0.0001%)
P
N/A
2 (0.000006%)
A
N/A
2 (0.000006%)
XOR
Degradation
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
Min.
Max.
twolf
SA0
0.00
0.00
0.00
0.04
0.00
0.00
0.00
0.00
0.00
0.00
-0.11
0.20
0.98
13.23
-0.02
0.98
0.55
1.61
42.84
42.84
37.69
37.69
gzip
SA1
0.40
1.16
0.00
0.00
18.27
18.27
0.00
0.00
0.00
0.00
-0.11
0.24
0.98
14.52
-0.02
0.98
0.55
1.62
48.49
48.49
39.17
39.17
SA0
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
-0.03
0.20
-0.02
6.36
-0.04
0.02
-0.02
0.23
60.54
60.54
60.37
60.37
97.27% of the faults induce no measurable degradation
1.82% result in 0-1% degradation
0.91% induce degradation larger than 1%
Faults inducing less than 1% mis-prediction accuracy result in less than 2%
performance degradation
SA1
-0.04
0.06
0.00
0.00
26.77
26.77
0.00
0.00
0.00
0.00
-0.03
0.20
-0.02
6.36
-0.04
0.02
-0.02
0.23
33.01
33.01
31.18
31.18
30
Summary for branch prediction
• Branch predictors significantly improve
performance of architectures
• Overwhelming percentage of faults in branch
predictors are acceptable
• As corresponding versions of branch predictors
provide most of the above performance
improvement and guarantee correctness of results
• Hence, chips with such faults can be used without
any reconfiguration
31
A new type of redundancy
• Combinationally redundant fault: CRF
– Does not change IO behavior for a combinational block of
logic
• Sequentially redundant fault: SRF
– Does not change IO behavior for a sequential logic
• Faults in branch-prediction change the IO behavior of
the entire processor chip
– Yet, at the architecture-level program outputs guaranteed
to be correct
– A new category – architecturally redundant faults: ARF
32
A new type of redundancy
ARF
CRF
SRF
PA-ARF
33
Case study-2: LLC on chip
• LLC (last level cache on-chip) is the largest individual
module (~MB scale)
• Capacity is expected to grow in the future multicore microprocessors
• Implemented with densely packed and small devices
• Susceptible to process variation and defects
• Existing hardware-based approaches
• Beyond certain levels, spare rows and columns cause
high overheads
• Require circuit modification and may be under- or overdesigned
34
OS-level LLC defect tolerance
SRAM cache
DRAM memory
a page-frame
a page- cover
X
*
a cache
region
N S:
number
of sets
12
*
Nw: number of ways
: A block of SB bytes
.
.
.
• Page-cover: a set of continuous cache locations which are mapped by one pageframe in main memory
• OS controls the allocation/de-allocation of page frames
• Page-cover disabling: OS blocks all page-frames that map to faulty page covers
35
• In PCD, defective page-covers will not be used during system operation
Defective LLC
• LLC: 3MB, 12-way, 64-byte blocks 48subarrays, each with one spare row and
two spare columns
• Page-cover: consecutive 64-cache-sets
– Total 64 page-covers in the 3MB
cache case study
90%
Cumulative percentage of chips after spare repair
80%
70%
60%
50%
40%
77%
30%
• Most defective caches have limited
numbers of defective page-covers
• Capacity loss in cache and memory is small
20%
10%
Bit-cell failure rate= 𝟐. πŸ“ × πŸπŸŽ−πŸ”
0%
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of defective page-covers
36
Efficiency: EGOPS-per-area
Optimum of “spares + PCD”
× 10
8
−7
Optimum of “spares only”
(12, 1, 1)
7
1.5X
6
(48, 1, 1)
5
(1
μm2
2.5X
) 4
(48, 1, 2)
3
Failure rate = 𝟏. 𝟎 × πŸπŸŽ−πŸ”
Failure rate = 𝟐. πŸ“ × πŸπŸŽ−πŸ”
2
1
0
•
•
(96, 1, 3)
Design: (𝑡𝑺𝑨 , 𝑡𝒔𝒓 , 𝑡𝒔𝒄 )
0.6
0.7
0.8
0.9
•
1
1.1
1.2
1.3
Area : enhanced CACTI 32nm
implementation
– Modified Linux kernel
2.6.32
– Dual-core processor with
shared 12-way 3MB LLC
Benchmark: SPEC CPU2000
1.4
Area
• High efficiency gain over traditional approach
• High flexibility compared to hardware-based approaches
37
Case study-3: Computational units
• Integer adders in ALUs
• Floating point units (FPUs)
38
ALU-disabling and adder error-masking
• Exploit existing features in modern processors
– Multiple ALUs and sophisticated instruction
scheduling
• ALU-disabling
– Prevent instruction scheduler from issuing
instructions to a defective ALU
• Adder error-masking
– Re-issue to a defect-free ALU any add instruction
that was issued to an ALU with a defective adder
39
ammp equake
π‘Œπ‘’π‘›π‘’π‘›π‘
0.7
0.5
apsi
applu
swim
art
bzip2
gcc
gzip
mcf
vortex
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
15%
5%
5%
1.12
1.10
1.08
1.06
1.04
1.02
1.00
0.98
15%
ALU approach normalized efficiency
Area % of 2
ALUs
mgrid
0.3
• Efficiency: total-performance-per-unit-fabrication-cost
– Normalized to unenhanced processor efficiency
• With negligible area overhead, our approach enables the use of defective
chips that would otherwise be discarded
40
ALU approach breakdown
1.12
1.10
1.08
1.06
1.04
tier3
1.02
tier2
1.00
tier1
0.98
5%
10%
Yunenc 0.7
15%
5%
10%
Yunenc 0.5
15%
5%
10%
15%
Yunenc 0.3
• Tier1: defect-free
• Tier2: one defective ALU, defects not in the adder of the defective ALU
• Tier3: one defective adder
41
Floating point multiplication emulation
1.25
1.20
1.15
1.10
1.05
1.00
0.95
10% 15% 20% 10%
ammp
π‘Œπ‘‘π‘“π‘Ÿπ‘’π‘’
70%
15%
20%
equake
50%
30%
10%
15%
apsi
20%
10%
15%
swim
20%
10%
15%
art
20%
10%
15%
20%
FPU area %
INT
bipz2, gcc, gzip, mcf,
vortex, parser
• No area overheads, efficiency always larger than 1
• More than 10% improvement in some benchmarks when π‘Œπ‘‘π‘“π‘Ÿπ‘’π‘’ is low
• 2%~5% improvement in most cases when π‘Œπ‘‘π‘“π‘Ÿπ‘’π‘’ is in the higher ranges
42
How widely applicable?
• Can we develop such uniquely efficient
approaches for many modules?
43
An example processor
Pentium III, Katmai
* Hsien-Hsin Lee, Georgia Tech Advanced Computer Architecture Lecture,
users.ece.gatech.edu/~leehs/ECE6100/slides/Lec12-p6_P4.ppt, 2001
44
Module area
Pentium III [3]
Caching modules
~46%
Datapath modules
~16%
Queuing and
speculation modules
~21%
AMD Bulldozer [1][2]
No on-chip LLC
0%
2MB on-chip LLC
41.20%
Fetch+L1i cache
11.14%
L1i cache
2.30%
L1d cache
6.83%
L1d cache
2.15%
dTLB
1.80%
FPU
4.72%
FPU
10.47%
ALU (int. cluster)
4.37%
ALU (int. cluster)
5.55%
L/SU
2.75%
L/SU
9.09%
BTB
2.17%
Branch predictor
5.36%
Decoder
7.88%
Decoder
5.31%
Inst. scheduler & Qs
6.49%
Inst. scheduler & Qs
5.30%
Fetch
2.15%
Physical register
1.91%
Rename
1.58%
Rename
3.67%
• With small or zero overheads, defects in these modules can be tolerated without
producing errors
[1]
[2]
[3]
Bulldozer x86-64 Core,’’ IEEE Int’l Solid State Circuits Conf., IEEE Press, 2011, pp. 80-81
T Fischer et al., ‘‘Design Solutions for the Bulldozer 32-nm SOI 2-Core Processor Module in an 8-Core CPU,’’ IEEE Int’l Solid State Circuits Conf., IEEE Press, 2011
Hsien-Hsin Lee, Georgia Tech Advanced Computer Architecture Lecture, users.ece.gatech.edu/~leehs/ECE6100/slides/Lec12-p6_P4.ppt, 2001
45
Practical issues
• Design changes
– In some cases, only the software/firmware
– In other cases, low overhead changes to hardware
• Test and diagnosis
– In many cases, need to identify failing
modules/locations
• Information flow and reconfiguration
– In some cases, failure information must be
propagated to system assembly, software
configuration, …
46
Error tolerance
47
Introduction
• Concept
– The circuit produces some errors at its outputs
– Yet, the application using the circuit produces acceptable
results
• Applications
– Video, images, graphics, audio, communications, errorcorrecting codes for wireless communications
• Discrete Cosine Transform (DCT) module in JPEG encoders
– More than 50% of single SAF are acceptable for a typical architecture
• Motion Estimation (ME) module in MPEG modules
– More than 50% of single SAF are acceptable for a typical architecture
48
A case study
• Graphics processors
– Side-by-side comparison of
• A fault-free chip
• Faulty ones from the reject bin
– Comparison for several
• Graphics demos
• Graphics sequences designed to exercise different
capabilities
• About 40 minutes of game play
– About 20% of faulty chips had output errors that
were imperceptible to users
49
Errors case 3
Max(MSE)>0.1 + Perception OK
•
•
•
Class 0, Chip No.54
Max(MSE) = 9.177
Very tiny dot (Type II)
50
51
52
Our other case studies
•
•
•
•
JPEG encoder
MPEG encoder
Digital answering machine
Decoder for turbo-codes for wireless
communication
• In all cases: A large percentage of hard faults
in the largest modules had negligible impact
on user experience
53
Metrics: Circuit level
• Error significance (ES): Maximum numerical
magnitude by which the values at outputs of an
erroneous circuit deviates from the error free response
• Error rate (ER): Percentage of vectors for which values
at outputs deviate from the error free response
Number of vectors that cause error
– Error rate =
Total number of input vectors applied
54
Metrics: Circuit to system level
ER
: FNA
: FA
Original
Image
DCT
Quantization
Huffman
Encoding
Encoded
JPEG
Retrieved
Image
IDCT
Dequantization
Huffman
Decoding
ES
• ES and ER for a fault calculated separately for a module
– In above example, we inject faults in DCT module in JPEG system
• Quality of image is decided using application-level
metrics – PSNR, BER, MSE, …
– using a complete system that contains the faulty module
– Each fault is marked either FNA or FA
55
How to exploit ET
• During testing
– Classical design
– Classify faults as acceptable and non-acceptable
– Develop new ET test techniques that
• Detect every unacceptable fault
• But not detect any acceptable fault
– Theoretical results
• Proved that perfect ET test sets always exist
– Practical results
• Developed test generators and demonstrated that ET test costs are
comparable to normal test costs
56
One approach to exploit ET
• Normal design; new analysis and test
Fault-free
Faulty
Sell
Throw
Faulty
Fault-free
Sell
Acceptable
Sell
Unacceptable
Throw
57
How to exploit ET
• Acceptable faults are indications of over-specification
of the functionality
• Exploit these during design
– Arithmetic circuits
• Re-optimization of classic architectures for acceptable
approximations
– General circuits
• A synthesis approach to exploit allowable approximate versions of
functions
• An approach to re-design a circuit to exploit allowable
approximations
58
Results: Approximate redesign
C432
ER
% area reduction
1%
17.6
5%
21.8
1%
6.5
5%
9.2
1%
17.0
5%
20.2
C880
ER
% area reduction
C1908
ER
% area reduction
59
Results: Approximate synthesis
% reduction in literals
30
rd73(7/3/903)
clip(9/5/793)
sao2(10/4/496)
5xp1(7/10/347)
Z9sym(9/1/610)
sym10(10/1/1470)
t481(16/1/5233)
20
10
0
0.000
0.005
0.010
0.015
0.020
0.025
0.030
error rate
(Legend: # of inputs, # of outputs, number of literals in the original function)
• For each circuit, we performed synthesis for various ER values
60
Results: Approximate adder designs
0.9
0.85
0.8
Yield
0.75
0.7
0.65
0.6
RCA (MAD=1)
CLA (MAD=0.24)
CSA(v) (MAD=0.34)
CSA(f) (MAD=0.43)
KSA (MAD=0.14)
0.55
0.5
0
0.25
0.75
1.31
1.75
3.75
4.25
RS threshold
61
Explicit redundancy
Using spare copies to improve yield-per-area
Objective
• To derive fundamental properties – strengths
and limitations – of the use of spares for
improving yield-per-area (YPA)
– Focus on use of spares for random logic (RL)
modules, where two modules cannot share spare
copies
– Focus on functional yield, i.e., correct operation at
any speed
– Consider all possible ways in which spares can be
used
63
Random logic vs. regular memory/logic
64
Objective
• To answer questions like
– What is the maximum value of original chip yield for which
spares can improve YPA?
– What is the maximum YPA improvement that can be
obtained?
– What rules govern the optimal use of spares?
• What amount of spares provides maximum YPA improvement?
• At what granularity must spares be used to obtain maximum YPA
improvement?
• How should a given budget for spares be used to obtain maximum
YPA improvement?
• Particularly interested in the future, where, without
the use of redundancy, yield → 0
65
Theory: Simplifying assumptions
• Given system
– Decomposed into modules m1, m2, …, mk of areas a1, a2, …, ak
– At arbitrary granularity, i.e., we allow
• The given circuit to be partitioned in any desired manner
– In particular, we often consider equal-area modules
• At arbitrary levels of granularity, including k→ ∞
• Ideal reconfiguration interconnects , i.e., interconnects required to
use spare copies have
– Negligible area
– Negligible probability of being defective
– Negligible delay and power impact
• Spare modules not idealized, i.e., every spare module has area,
defects, delays, …
• Defect and yield model
– Given defect density
– No clustering
66
Yield models for module mi with area ai
• Poisson model
π‘Œπ‘šπ‘– = 𝑒 −𝐷0⋅π‘Žπ‘–⋅πœ…
• Negative binomial model
π‘Œπ‘šπ‘–
𝐷0 ⋅ π‘Žπ‘– ⋅ πœ…
= 1+
𝛽
𝐷0 : defect density
πœ…: kill ratio
−𝛽
𝛽: clustering factor
67
Notation
m1
m2
mk
…
…
m1
m2
…
mk
…
m1
(a)
68
Some ways of using spares
α
α
α
69
More ways of using spares
70
Baseline: The original system (no spares)
• Consider the original system (no spares)
– Aorig = 1
• Area for all cases normalized with respect to Aorig
– Yorig = Y
• Where, 0 < Y ≤ 1
– YPAorig = Yorig/Aorig = Y
71
n-way replication of the entire system
– Anew = n
– Ynew = 1- (1 – Y)n
• Where, 0 < Y ≤ 1
– YPAnew =
– In terms of yield-per-area, the gain of using spares is
• G=
>>
72
n-way replication of the entire system
– YPA gain, G =
• Where, 0 < Y ≤ 1
– Properties
•
• As Y → 0, G → 1
>>
73
Duplication at finer granularities
• YPA gain, G =
– As Y → 0, G →
• Ynew =
– As k → ∞, Ynew → 1
– Extremely useful, if we can find a way to create fine-grain duplication
without overheads of reconfiguration interconnects
>>
74
When any way of using spares cannot help
• A completely general scenario for spares for RL
– Spares used for fraction α (by area) of the original system,
where, 0 ≤ α ≤ 1
α
– Without spares, yield for this fraction of system: Yα
– This fraction of system may be partitioned into
• Any number of modules, and
• Each module may have arbitrary size
– At least one spare copy used for every module in this
fraction of the system
– Yield for the remainder of the system: Y(1 - α)
>>
75
When any way of using spares cannot help
• Since at least one spare used for fraction α of
the original system
– Anew ≥ (1 + α)
α
• Since, area of spares ≥ α
– Ynew ≤ Y(1 - α)
• Since, yield for the fraction with spares ≤ 1
– Hence, YPAnew ≤
>>
76
When any way of using spares cannot help
• Hence, the gain of using spares is
–G=
– where
• 0 < Y ≤ 1, and
• 0≤α≤1
>>
77
When any way of using spares cannot help
• For the use of spares to be effective, we need
–G>1
– That is, we need
– For spares to be effective, the original yield, Y,
must be less-than-or-equal-to
>>
78
When any way of using spares cannot help
– Hence, for spares to be effective, the original
system yield, Y, must satisfy
Maximum
k
value of Y for
which G ≥ 1
2
0.34
3
0.41
4
0.43
5
0.45
6
0.46
7
0.46
∞
0.50
– Duplication at practical granularities provides YPA
gain for values of Y close to a universal bound! 79
>>
When any way of using spares cannot help
• For Y ≤
– For the bound argument, G improves
monotonically with α, when
– This suggests that optimal partial duplication may
be achieved for full duplication
80
Summary of theoretical results
• Necessary condition for spares to improve YPA
– When original yield, Y, is less than 0.5
• In the future, i.e., as Y decreases (Y → 0)
– YPA gain provided by spares increases
– YPA gain → 2k-1, if the system is partitioned into k equalarea modules (ignoring interconnect overheads)
• Partitioning and granularity
– YPA gain increases with k (ignoring interconnect
overheads)
– As k → ∞, duplication is all that is needed to achieve
perfect yield (ignoring interconnect overheads)
– Equal-area partitions provide maximum YPA gain
81
To reality: Algorithms for design
• Given
– Scalable switches, forks, joins along with their associated
yields and areas
– An upper bound qi* on the number of spares for each
module mi
• Design: Optimize yield/area, by
– Identifying an optimal set of modules mi, where i = 1, 2,… ,
k, and associated yi and ai
– Identifying the best number of spares for each module
– Identifying locations where steering logics should be
inserted among the modules
82
Algorithm for inserting redundancy to maximize YPA
Example: Six module pipeline
m1
m2
m3
m4
m5
m6
0.97
0.75
0.84
0.45
0.96
0.98
Y ≈ 0.26
m4
0.45
m2
m1
0.97
m3
0.75
F
0.75
S
0.84
0.45
S
0.45
0.84
0.75
S
m5
m6
0.96
0.98
0.96
0.98
J
0.45
0.45
Ynew ≈ 0.87
• Various degrees of replication and concatenation
• Forks, joins and switches of appropriate bus width
inserted
83
After testing and configuration
m4
0.45
m2
m1
m3
0.75
F
0.97
0.75
S
0.84
0.45
S
0.45
S
0.84
0.75
m5
m6
0.96
0.98
0.96
0.98
m5
m6
------
------
0.96
0.98
J
0.45
0.45
Good units shown in blue;
OK transmission shown solid
------
m2
m1
0.97
m3
-----F
------
m4
S
0.84
0.45
S
------
-----0.75
Power-down unused modules
and switch segments
S
J
----------84
Case study: OpenSPARC T2
•
•
•
•
8 identical cores
Area of each core is around 12mm2
Area of other modules is 246mm2
Without loss of generality we set YOM = 1
Gο€½
Y / ANewο€­Y / AOrig
Y / AOrig
85
ο‚΄ 100
Results: Modules after CLB partitioning
• Search space reduced from 188887 gates and FFs to
1808 CLBs, i.e., 104 times fewer
• CLBs sizes vary over a wide range
Size (# of gates)
#of CLBs Percentage of area in the core
10000 < Size(C)
6
59.4%
1000 < Size(C) < 10000
11
30.5%
100 < Size(C) < 1000
25
4.78%
Size(C) < 100
1766
5.32%
Total:
1808
100%
86
Results: Modules after clustering small CLBs
Cluster1 Cluster2
PI 1
2
PI 1
2
PI 1
2
PI 1
PI 1
PI 1
Cluster3
4
4
3
2
4
3
4
4
Optimization results after CLB clustering
Before
Optimization
After
Optimization
1808
58
Number of control signals 1808
58
Number of Spare FFs
1830
131
Area overhead (FFs)
2.09%
0.15%
Number of CLBs
87
Key characteristics of algorithms
• Consider regular as well as irregular logic circuit
structures
• Allow each module its own pool of spares
• Consider all interconnect and reconfiguration
overheads
• Consider partial redundancy
• Improve yield/area significantly, especially at early
stages of a new process
– Improve time to market for a new process
88
Partitioning dominant CLBs
From our theoretical results
C
C1 1
C
C
(a)
C
C2 2
: Extra circuitry for testing
(b)
89
Partitioning dominant CLBs
Original circuit
1
2
netlist
Synthesis
3
CLB Generation
CLB
Clustering
CLBs
Area of each
gate & FF
4
Hypergraph
Generation
6
yes
Hyper Graph
7
Partitioning using
HMETIS
5
Need more
partitions?
Redundant
Configuration
no
Golden Y/A
Exit
New CLBs
90
Adding
Redundancy
Experimental results
• Dorig is the original design without redundancy.
• DRed-core is a redundant design which uses redundancy at core
level (traditional technique)
• DRed-func-modules is a redundant design that uses the original
functional modules of design for redundancy
• DRed-CLB+Reg is similar to ours but simply replicates registers
• DRed-CLB is the redundant design generated by our design flow
without extra partitioning
• DRed-Golden is the golden redundant design generated by our
design flow
91
Experimental results
For emerging technologies with low yield,
the original design without redundancy
has a very low Y/A that leads to long time
to market
Finding 1
92
Experimental results
DRed-Core has the worst yield/area when
defect density increases. We need
redundancy at finer granularities
rather than traditional core level for
multi core chips
Finding 2
DRed-CLB has 1.1 to 13.3 times better Y/A as a function of
defect density compared to DRed-Core
93
Experimental results
DRed-func-modules with 13
unbalanced partitions
underperforms designs with
more partitions. This gap
increases with defect density
Finding 3
94
Experimental results
FF sharing led to significant Y/A.
DRed-CLB has up to 45% better Y/A
compared to DRed-CLB+Reg
Finding 4
95
Experimental results
DRed-CLB outperforms all other
designs but golden. Its gain is close
to the upper bound gain
Finding 5
We may not need extra partitioning for
yield/area improvement
96
Explicit redundancy configurations
for highly-regular architectures
Objectives
• Show that regularity significantly improves
effectiveness of spares
• To exploit the benefits and overheads of the use
of spares in the context of highly-regular
architectures
– Focus on use of spares for identical copies of
functional modules, where spare copies are shared
– Consider all possible ways in which spares used
– Level of granularity to add spares
– Scope of sharing of spares
– Number of spares
98
Proposed redundancy designs
• K-way spare core sharing (KSCS)
β–« Example:
Inter-processor spares
Scope of sharing
16
8
4
2
1
Number
3
4
4
8
0
Case 1: inter-processor spares
Case 2: intra-processor and inter-processor spares
Intra-processor spares
99
Impact of scope of sharing
• Comparison of different spare configurations
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Yield/area
Area
7
6
0.88
5
0.64
4
3.00
3
1.36
2.39
2
1
1.25
0
Configuration IDs from Table 2
Area
Yield and Yield/area
Yield
Config. Number of spare cores of different scope of sharing
IDs
#1
0
0
0
0
96
0
0
#2
0
0
0
32
64
0
0
#3
0
0
0
64
32
0
0
#4
0
0
0
96
0
0
0
#5
0
0
16
80
0
0
0
#6
0
0
32
64
0
0
0
#7
0
0
48
48
0
0
0
#8
0
0
64
32
0
0
0
#9
0
0
80
16
0
0
0
#10
0
0
96
0
0
0
0
#11
0
16
0
80
0
0
0
#12
0
32
0
64
0
0
0
#13
0
48
0
48
0
0
0
#14
0
64
0
32
0
0
0
#15
0
80
0
16
0
0
0
#16
0
96
0
0
0
0
0
• Config. IDs are ordered in such a ways that the scope of sharing increases100
Proposed redundancy designs
• Definitions
β–« Scope of sharing: domain in which any defective module can
be replaced by the specific spare copy
β–« Maximum number of spare cores (MNSC):
• Greedy repair process
β–« Given spares with different scopes, spares with smaller
scope is used first, with the maximum remaining MNSC for
other cores
β–« After repair process, the chips with fewer than nominal
number of working functional modules are counted as nonworking
101
Proposed redundancy designs
• Area overhead
3.1
3.1
De-mux
1
Spare module
3.3
1
2
1
1
3.2
1
1
Area of spare modules
2
Wasted area
3.1
Impact on De -mux height
3.2
Impact on Crossbar
3.3
Impact on De -mux width
3.2
A. Original 8-core B. With one spare core
processor
and 1-to-2 De-mux
C. With four spare cores
and 1-to-5 De-mux
102
Experiments
CS-P
Inter-processor spares
Intra-processor and inter-processor spares
1
0.9
0.8
0.84 0.85
0.78
Yield/Area
0.7
0.80
0.75
0.64
0.6
0.71
0.65
0.57
0.65
0.55
0.45
0.5
0.4
0.60
0.49
0.39
0.55
0.44
0.35
0.3
0.2
0.1
0
0.05
0.1
0.15
0.2
0.25
0.3
d
CS-P: core-level spares within processors
103
Utilizing flexibilities in CMP binning and
e-optimal designs
104
Binning CMPs
• Number-of-Processors Binning (NPB) models
– We define NPB function as follows, where π‘˜ and 𝑙 and the
lower and upper bound of number of processors that can
be used.
πœ™(𝑗) =
0,
𝑗,
𝑙,
𝑖𝑓 𝑗 < π‘˜,
𝑖𝑓 π‘˜ ≤ 𝑗 ≤ 𝑙,
𝑖𝑓 𝑗 > 𝑙,
– NPB Origin: only 𝑛, i.e., (π‘˜ = 𝑛, 𝑙 = 𝑛)
– NPB A: π‘˜ to 𝑛, i.e., (π‘˜ < 𝑛, 𝑙 = 𝑛)
– NPB B: 𝑛 or greater, i.e., (π‘˜ = 𝑛, 𝑙 > 𝑛)
– NPB C: π‘˜ to max (π‘˜ < 𝑛, 𝑙 > 𝑛)
𝑛 denotes the number of processors in the nominal design
105
Value functions
• We define three value functions based on how the value 𝑣(𝑖) of
a chip depends on 𝑖, the number of (enabled) working
processors on a chip
– Linear: The value of a chip is linear with the number of (enabled)
working processors
– IPC: We evaluate the value of a chip based on instructions per cycle (IPC)
– Catalog price: We derive another practical value function using catalog
prices for a chip with different numbers of enabled processors
Normolized values
Linear
IPC
Catalog price
2.5
2
1.5
1
0.5
11
13
15
17
19
21
23
25
27
29
31
Number of working processors
33
35
37
39
106
Utility functions and evaluation metrics
• We define utility functions using one NPB function
and one value function as follows
– 𝑒 𝑗 = 𝑣(πœ™(𝑗))
• Evaluation metrics
Value functions
Linear
Processors-sold-per-unit-area (PSPA)
IPC
Instructions-per-cycle-per-area (IPCPA)
Catalog price
Revenue-per-area (RPA)
107
Optimal redundancy design
• We derive the optimal spare configurations for different value
functions (a) linear, (b) IPC, and (c) catalog price.
NPB function
Spare
processors
Spare
cores
4-way
8-way
16-way
32-way
Origin
A
B
C
36
24
40
28
Origin
A
B
C
128
0
64
0
0
0
32
32
128
0
64
0
0
0
32
32
No spares design
Normalized figure-of-merit
NPB function
Spare
processors
Spare
cores
4-way
8-way
16-way
32-way
Optimal processors design for NPB Origin
NPB function
Origin
A
B
C
36
24
36
32
Origin
A
B
C
128
0
64
0
0
0
32
32
128
0
64
0
0
0
32
32
Spare processors design
Spare
processors
Spare
cores
4-way
8-way
16-way
32-way
Origin
A
B
C
36
28
40
32
Origin
A
B
C
128
0
64
0
Optimal core design for NPB Origin
0 128 0
0
0
32 64 32
32 0 32
Spare cores design
1
82.5%
0.8
0.6
47.6%
36.6%
0.4
36.6%
0.2
0
72.5%
75.0%
72.5%
38.5%
72.5%
47.8%
47.2%
7.9%
0.0%
NPB Origin
83.5%
82.5%
72.5%
75.0%
50.6%
49.1%
72.5%
72.5%
74.9%
50.4%
36.6%
38.6%
36.6%
18.1%
72.5%
53.2%
53.2%
83.8%
83.8%
72.5% 74.8%
74.8%
72.5%
72.5%
50.1%
46.5%
49.1%
36.6% 38.4%
49.9%
48.1%
72.5%
74.9%
56.0%
55.2%
18.1%
36.6%
7.9%
0.0%
NBP A
83.5%
72.5%
NPB B
0.0%
NPB C
NPB Origin
0.0%
NBP A
NPB B
0.0%
NPB C
16.0%
16.0%
NPB Origin
0.0%
NBP A
NPB B
NPB C
e-optimal redundancy design
• Due to diminishing return effects, we always observe scenarios
where the benefit increases slowly with number of spares
• In such cases, if we are given a margin e, within which the
optimal redundancy design is acceptable, we are able to derive
a close-to-optimal chip design with significantly reduced chip
area
Normalized figure-of-merit
PSPA
IPCPA
RPA
0.6
56.0%
54.9%
0.55
50.6%
51.3%
0.5 48.3%
0.45
49.8%
48.6%
50.1%
49.8%
45.9%
0.4
8
9
10
11
12
13
14
Numer of spare processors
15
16
e-optimal redundancy design
• We derive e-optimal redundancy designs for various utility
functions: (a) linear, (b) IPC, and (c) catalog price, with e =8%.
NPB function
Spare
processors
Spare
cores
4-way
8-way
16-way
32-way
NPB function
Origin
A
B
C
31
23
35
24
Origin
A
B
C
128
32
0
0
0
0
64
0
128
32
0
0
0
0
64
0
Normalized figure-of-merit
Spare processors design
Spare
processors
Spare
cores
4-way
8-way
16-way
32-way
0.8
82.5%
82.5%
77.1% 72.5%
77.1%
68.9%
50.6%
47.8%
47.6%
48.3%
45.3%
43.8%
72.5%
68.9%
0.6
36.6%
0.4 34.2%
Origin
A
B
C
31
23
32
24
Origin
A
B
C
128
32
0
0
0
0
64
0
128
32
0
0
0
0
64
0
Ɛ-optimal spare processor design
1
NPB function
Spare cores design
83.5%
83.5%
80.4% 72.5%
80.4%
68.9%
56.0%
53.2%
50.4%
51.3%
48.9%
47.8%
72.5%
68.9%
36.6%
34.2%
Spare
processors
Spare
cores
4-way
8-way
16-way
32-way
Origin
A
B
C
31
24
35
27
Origin
A
B
C
128
32
0
0
0 128 0
64 64 64
0
0
0
16 0 16
Ɛ-optimal spare core design
83.8%
83.8%
77.1% 72.5%
77.1%
68.9%
50.1%
49.1%
46.5%
47.5%
46.1%
45.1%
72.5%
68.9%
36.6%
34.2%
0.2
0
NPB Origin
NBP A
NPB B
NPB C
NPB Origin
NBP A
NPB B
NPB C
NPB Origin
NBP A
NPB B
NPB C
Conclusion
• Three forms of redundancy
– Implicitly present; no errors in program outputs
– Implicitly present; some errors in application
outputs acceptable
– Explicitly added; no errors at module outputs
• Implicit forms of redundancy require us to
look beyond the circuit level
– For processors: μArch, ISA, system/software, and
user layers
– For error tolerance: Application and user layers 111
Conclusion
• Every form provides an avenue for combating
increasing defect rates and variations of nanoscale technologies
• Exploitation of redundancy increasingly
beneficial with increasing defect density
112
Conclusion
• Ordering of methods to exploit redundancy in
terms of how soon they become effective
• Implicit redundancy in
– General purpose processors
– Error tolerance in medial applications
• Explicit redundancy
– Memories (in use)
– Multi-core/multi-processor chips
– Random logic
113
Questions?
Sandeep Gupta
Ming Hsieh Department of Electrical Engineering
University of Southern California
sandeep@usc.edu
114
Download