SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

advertisement
SOLVING THE
DRAM SCALING CHALLENGE:
RETHINKING THE INTERFACE BETWEEN
CIRCUITS, ARCHITECTURE, AND SYSTEMS
Samira Khan
MEMORY IN TODAY’S SYSTEM
Processor
DRAM
Memory
Storage
DRAM is critical for performance
2
MAIN MEMORY CAPACITY
Gigabytes of DRAM
Increasing demand for high capacity
1. More cores
2. Data-intensive applications
3
DEMAND 1:
INCREASING NUMBER OF CORES
2012
2013
2014
SPARC M5
6 Cores
SPARC M6
12 Cores
SPARC M7
32 Cores
2015
More cores need more memory
4
DEMAND 2:
DATA-INTENSIVE APPLICATIONS
MEMORY CACHING
IN-MEMORY DATABASE
More demand for memory
5
HOW DID WE GET MORE CAPACITY?
Technology
Scaling
DRAM Cells
DRAM Cells
DRAM scaling enabled high capacity
6
MEGABITS/CHIP
DRAM SCALING TREND
10000
1000
100
10
1
1985
1995
2005
2015
START OF MASS PRODUCTION
DRAM scaling is getting difficult
Source: Flash Memory Summit 2013, Memcon 2014
7
DRAM SCALING CHALLENGE
Technology
Scaling
DRAM Cells
DRAM Cells
Manufacturing reliable cells at low cost
is getting difficult
8
WHY IS IT DIFFICULT TO SCALE?
DRAM Cells
In order to answer this we need to take a
closer look to a DRAM cell
9
WHY IS IT DIFFICULT TO SCALE?
Transistor
Capacitor
Bitline
Contact
Capacitor
Transistor
LOGICAL VIEW
VERTICAL CROSS SECTION
A DRAM cell
10
WHY IS IT DIFFICULT TO SCALE?
Technology
Scaling
DRAM Cells
DRAM Cells
Challenges in Scaling
1. Capacitor reliability
2. Cell-to-cell interference
11
SCALING CHALLENGE 1:
CAPACITOR RELIABILITY
Technology
Scaling
DRAM Cells
DRAM Cells
Capacitor is getting taller
12
CAPACITOR RELIABILITY
58 nm
140 m
Results in failures while manufacturing
Source: Flash Memory Summit, Hynix 2012
13
IMPLICATION:
DRAM COST TREND
Cost/Bit
PROJECTION
2000
2005
2010
2015
2020
YEAR
Cost is expected to go higher
Source: Flash Memory Summit, Hynix 2012
14
SCALING CHALLENGE 2:
CELL-TO-CELL INTERFERENCE
Technology
Scaling
Less Interference
Indirect path
More Interference
Indirect path
More interference results in
more failures
15
IMPLICATION:
DRAM ERRORS IN THE FIELD
1.52% of DRAM modules failed
in Google Servers
1.6% of DRAM modules failed
in LANL
SIGMETRICS’09, SC’12
16
GOAL
ENABLE
HIGH CAPACITY MEMORY
WITHOUT
SACRIFICING RELIABILITY
17
TWO DIRECTIONS
DRAM
Difficult to scale
MAKE DRAM
SCALABLE
SIGMETRICS’14, DSN’15, ONGOING
NEW
TECHNOLOGIES
Predicted to be
highly scalable
LEVERAGE NEW
TECHNOLOGIES
WEED’13, ONGOING
18
Technology
Scaling
DRAM Cells
DRAM Cells
DRAM SCALING CHALLENGE
Detect
and
Mitigate
DRAM Cells
UNIFY
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
Non-Volatile
Memory
Storage
NON-VOLATILE
MEMORIES:
UNIFIED MEMORY &
STORAGE
PAST AND FUTURE WORK
19
TRADITIONAL APPROACH
TO ENABLE DRAM SCALING
Make
DRAM
Reliable
Unreliable
DRAM Cells
Reliable
DRAM Cells
Manufacturing Time
Reliable System
System
in the Field
DRAM has strict reliability guarantee
20
MY APPROACH
Make
DRAM
Reliable
Unreliable
DRAM Cells
Reliable
DRAM Cells
Manufacturing
Manufacturing Time
Time
Reliable System
System
System
in the Field in the Field
Shift the responsibility to systems
21
VISION: SYSTEM-LEVEL DETECTION
AND MITIGATION
Detect
and
Mitigate
Unreliable
DRAM Cells
Reliable System
Detect and mitigate errors after
the system has become operational
ONLINE PROFILING
Reduces cost, increases yield,
and enables scaling
22
CHALLENGE:
INTERMITTENT
FAILURES
Detect
and
Mitigate
DRAM Cells
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
EFFICACY OF
SYSTEM-LEVEL
TECHNIQUES WITH
REAL DRAM CHIPS
HIGH-LEVEL DESIGN
NEW SYSTEM-LEVEL
TECHNIQUES
23
CHALLENGE: INTERMITTENT FAILURES
Detect
and
Mitigate
Unreliable
DRAM Cells
Reliable System
If failures were permanent, a simple
boot up test would have worked
What are the characteristics of
intermittent failures?
24
DRAM RETENTION FAILURE
Switch
Refreshed
Every 64 ms
Leakage
Capacitor
Retention
Retention
Time Time
Refresh Interval
64 ms
Time
25
INTERMITTENT RETENTION FAILURE
DRAM Cells
• Some retention failures are intermittent
• Two characteristics of intermittent retention failures
1
2
Data Pattern Sensitivity
Variable Retention Time
26
DATA PATTERN SENSITIVITY
INTERFERENCE
10
0
0
1
NO
FAILURE
FAILURE
Some cells can fail depending on the
data stored in neighboring cells
JSSC’88, MDTD’02
27
Retention Time (ms)
VARIABLE RETENTION TIME
640
512
384
256
128
0
Time
Retention time changes randomly
in some cells
IEDM’87, IEDM’92
28
CURRENT APROACH TO CONTAIN
INTERMITTENT FAILURES
Manufacturing Time Testing
PASS
FAIL
1. Manufacturers perform exhaustive
testing of DRAM chips
2. Chips failing the tests are discarded
29
SCALING AFFECTING TESTING
Manufacturing Time Testing
PASS
FAIL
Longer Tests and More Failures
More interference in smaller technology
nodes leads to lower yield and higher cost
30
SYSTEM-LEVEL ONLINE PROFILING
Not fully tested during
manufacture-time
1
Ship modules
2
with possible failures
PASS
FAIL
Detect and mitigate
failures online
3
Increases yield, reduces cost, enables scaling
31
CHALLENGE:
INTERMITTENT
FAILURES
Detect
and
Mitigate
DRAM Cells
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
EFFICACY OF
SYSTEM-LEVEL
TECHNIQUES WITH
REAL DRAM CHIPS
HIGH-LEVEL DESIGN
NEW SYSTEM-LEVEL
TECHNIQUES
32
EFFICACY OF SYSTEM-LEVEL TECHNIQUES
Can we leverage existing techniques?
1 Testing
2 Guardbanding
3 Error Correcting Code
4 Higher Strength ECC
We analyze the effectiveness of these techniques
using experimental data from real DRAMs
Data set publicly available
33
METHODOLOGY
FPGA-based testing infrastructure
Evaluated 96 chips from three major vendors
34
1. TESTING
Write some pattern
in the module
Repeat
Read 3
and verify
1
Wait until 2
refresh interval
Test each module with different patterns for many rounds
Zeros (0000), Ones (1111), Tens (1010), Fives (0101), Random
35
Number of Failing Cells Found
EFFICACY OF TESTING
ZERO
200000
ONE
TEN
FIVE
RAND
All
Even after hundreds of
rounds,
a
small
number
Only a few rounds can
of new most
cells keep
discover
of thefailing
150000
100000
failures
50000
0
0
100 200 300 400 500 600 700 800 900 1000
Number of Rounds
Conclusion: Testing alone cannot detect
all possible failures
36
2. GUARDBANDING
• Adding a safety-margin on the refresh interval
• Can avoid VRT failures
4X Guardband
2X Guardband
Refresh Interval
Effectiveness depends on the
difference between retention times of a cell 37
EFFICACY OF GUARDBANDING
Number of Failing Cells
1000000
100000
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
38
EFFICACY OF GUARDBANDING
Number of Failing Cells
1000000
100000
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
39
EFFICACY OF GUARDBANDING
Number of Failing Cells
1000000
100000
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
40
EFFICACY OF GUARDBANDING
Number of Failing Cells
1000000
100000
10000
1000
Most of the cells exhibit
closeby retention times
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
41
EFFICACY OF GUARDBANDING
Number of Failing Cells
1000000
100000
There are few cells with
large differences in
retention times
10000
1000
100
10
1
0
4
8
12
16
20
Retention Time (in seconds)
Conclusion: Even a large guardband (5X)
cannot detect 5-15% of the failing cells 42
3. ERROR CORRECTING CODE
• Error Correcting Code (ECC)
– Additional information to detect error and correct data
43
Probability of New Failure
EFFECTIVENESS OF ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
44
Probability of New Failure
EFFECTIVENESS OF ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
45
Probability of New Failure
EFFECTIVENESS OF ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
46
Probability of New Failure
EFFECTIVENESS OF ECC
No ECC
SECDED
SECDED, 2X Guardband
1E+00
Combination of techniques
SECDED code reduces
reduces error rate by 107 times
error
ratea by
times
Adding
2X100
guardband
reduces error rate
by 1000 times
1E-06
1E-12
1E-18
1
10
100
1000
Number of Rounds
Conclusion: A combination of mitigation
techniques is much more effective 47
4. HIGHER STRENGTH ECC (HI-ECC)
No testing, use strong ECC
But amortize cost of ECC over larger data chunk
Can potentially tolerate errors at the cost of
higher strength ECC
48
Time to Failure (in years)
EFFICACY OF HI-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
49
Time to Failure (in years)
EFFICACY OF HI-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
50
Time to Failure (in years)
EFFICACY OF HI-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
After starting with 4EC5ED,
can reduce to 3EC4ED code
after 2 rounds of tests
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
51
Time to Failure (in years)
EFFICACY OF HI-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
Can reduce to DECTED code
after 10 rounds of tests
1E+20
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
52
Time to Failure (in years)
EFFICACY OF HI-ECC
4EC5ED, 2X Guardband
3EC4ED, 2X Guardband
DECTED, 2X Guardband
SECDED, 2X Guardband
1E+25
1E+20
Can reduce to SECDED code,
after 7000 rounds of tests
(4 hours)
1E+15
1E+10
1E+05
10 Years
1E+00
1E-05
1
10
100
1000
10000
Number of Rounds
Conclusion: Testing can help to reduce the ECC
strength, but blocks memory for hours
53
CONCLUSIONS SO FAR
Key Observations:
• Testing alone cannot detect all possible failures
• Combination of ECC and other mitigation
techniques is much more effective
• Testing can help to reduce the ECC strength
– Even when starting with a higher strength ECC
– But degrades performance
54
CHALLENGE:
INTERMITTENT
FAILURES
Detect
and
Mitigate
DRAM Cells
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
EFFICACY OF
SYSTEM-LEVEL
TECHNIQUES WITH
REAL DRAM CHIPS
HIGH-LEVEL DESIGN
NEW SYSTEM-LEVEL
TECHNIQUES
55
TOWARDS AN
ONLINE PROFILING SYSTEM
Initially Protect DRAM
1 with Strong ECC
Periodically Test
Parts of DRAM
2
Test
Test
Test
Mitigate errors and
reduce ECC
3
Run tests periodically after a short interval
at smaller regions of memory
56
CHALLENGE:
INTERMITTENT
FAILURES
Detect
and
Mitigate
DRAM Cells
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
EFFICACY OF
SYSTEM-LEVEL
TECHNIQUES WITH
REAL DRAM CHIPS
HIGH-LEVEL DESIGN
NEW SYSTEM-LEVEL
TECHNIQUES
57
WHY SO MANY ROUNDS OF TESTS?
DATA-DEPENDENT FAILURE
Fails when specific pattern in the neighboring cell
0L D
0
1 R
LINEAR
ADDRESS
X-1 X X+1
0
SCRAMBLED X-1
ADDRESS
0 1 0 0
X-4 X X+2X+1
Even many rounds of random patterns
cannot detect all failures
58
DETERMINE THE LOCATION OF
PHYSICALLY ADJACENT CELLS
L D R
SCRAMBLED
ADDRESS
X-? X X+?
NAÏVE SOLUTION
For a given failure X,
test every combination of two bit addresses in the row
O(n2)
8192*8192 tests, 49 days for a row with 8K cells
Our goal is to reduce the test time
59
STRONGLY VS. WEAKLY DEPENDENT CELLS
STRONGLY DEPENDENT
Fails even if only one neighbor data changes
WEAKLY DEPENDENT
Fails if both neighbor data change
Can detect neighbor location in strongly
dependent cells by testing every bit address
0, 1, … , X, X+1, X+2, … n
60
PHYSICAL NEIGHBOR LOCATION TEST
Testing every bit address will
detect only one neighbor
Run parallel tests in different rows
X-4
L A
L C R
B R
X+2
Aggregate the locations from different rows
61
PHYSICAL NEIGHBOR LOCATION TEST
L A
SCRAMBLED
ADDRESS
LINEAR
TESTING
RECURSIVE
TESTING
X-4 X
2 6
0 1 2 3 4 5 6 7
0, 1, 2, 3
4, 5, 6, 7
0, 1
2, 3 4, 5
6, 7
0 1 2 3 4 5 6 7
62
PHYSICAL NEIGHBOR-AWARE TEST
A
B
C
NUM TEST
REDUCED
745654X
1016800X
745654X
EXTRA
FAILURES
27%
11%
14%
Detects more failures
with small number of tests
leveraging neighboring information
63
NEW SYSTEM-LEVEL TECHNIQUES
REDUCE FAILURE MITIGATION OVERHEAD
Mitigation for worst vs. common case
Reduces mitigation cost around 3X-10X
ONGOING
Variable refresh
Leverages adaptive refresh to mitigate failures
DSN’15
64
SUMMARY
Detect
and
Mitigate
Unreliable
DRAM Cells
Reliable System
Proposed online profiling to enable scaling
Analyzed efficacy of system-level detection
and mitigation techniques
Found that combination of techniques is
much more effective, but blocks memory
Proposed new system-level techniques to
reduce detection and mitigation overhead
65
Technology
Scaling
DRAM Cells
DRAM Cells
DRAM SCALING CHALLENGE
Detect
and
Mitigate
DRAM Cells
UNIFY
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
Non-Volatile
Memory
Storage
NON-VOLATILE
MEMORIES:
UNIFIED MEMORY &
STORAGE
PAST AND FUTURE WORK
66
STORAGE
MEMORY
CPU
TWO-LEVEL STORAGE MODEL
Ld/St
DRAM
FILE
I/O
VOLATILE
FAST
BYTE ADDR
NONVOLATILE
SLOW
BLOCK ADDR
67
STORAGE
MEMORY
CPU
TWO-LEVEL STORAGE MODEL
Ld/St
DRAM
FILE
I/O
NVM
PCM, STT-RAM
VOLATILE
FAST
BYTE ADDR
NONVOLATILE
SLOW
BLOCK ADDR
Non-volatile memories combine
characteristics of memory and storage
68
VISION: UNIFY MEMORY AND STORAGE
CPU
NVM
PERSISTENT
MEMORY
Ld/St
Provides an opportunity to manipulate
persistent data directly
69
CPU
CPU
DRAM IS STILL FASTER
MEMORY
DRAM
NVM
PERSISTENT
MEMORY
Ld/St
A hybrid unified
memory-storage system
70
CPU
CPU
CHALLENGE: DATA CONSISTENCY
MEMORY
PERSISTENT
MEMORY
Ld/St
System crash can result in
permanent data corruption in NVM
71
CURRENT SOLUTIONS
Explicit interfaces to manage consistency
– NV-Heaps [ASPLOS’11], BPFS [SOSP’09], Mnemosyne
[ASPLOS’11]
AtomicBegin {
Insert a new node;
} AtomicEnd;
Limits adoption of NVM
Have to rewrite code with clear partition
between volatile and non-volatile
Burden on the programmers
72
OUR GOAL AND APPROACH
Goal:
Provide efficient transparent
consistency in hybrid systems
time
Running
Epoch 0
Checkpointing
Running
Checkpointing
Epoch 1
Periodic Checkpointing of dirty data
Transparent to application and system
Hardware checkpoints and recovers data
73
CHECKPOINTING GRANULARITY
PAGE
EXTRA WRITES
SMALL METADATA
DRAM
DIRTY CACHE
BLOCK
NO EXTRA WRITE
HUGE METADATA
NVM
High write locality pages in DRAM,
low write locality pages in NVM
74
DUAL GRANULARITY CHECKPOINTING
DRAM
GOOD FOR
STREAMING WRITES
NVM
GOOD FOR
RANDOM WRITES
Can adapt to access patterns
75
TRANSPARENT DATA CONSISTENCY
Cost of consistency compared to
systems with zero-cost consistency
UNMODIFIED
LEGACY
CODE
DRAM
-3.5%
NVM
+2.7%
Provides consistency without
significant performance overhead
76
Technology
Scaling
DRAM Cells
DRAM Cells
DRAM SCALING CHALLENGE
Detect
and
Mitigate
DRAM Cells
UNIFY
Reliable
System
SYSTEM-LEVEL
TECHNIQUES
TO ENABLE DRAM
SCALING
Non-Volatile
Memory
Storage
NON-VOLATILE
MEMORIES:
UNIFIED MEMORY &
STORAGE
PAST AND FUTURE WORK
77
OTHER WORK
IMPROVING CACHE
PERFORMANCE
STORAGE
MEMORY
CPU
MICRO’10, PACT’10, ISCA’12, HPCA’12, HPCA’14
EFFICIENT LOW VOLTAGE
PROCESSOR OPERATION
HPCA’13, INTEL TECH JOURNAL’14
NEW DRAM ARCHITECTURE
HPCA’15, ONGOING
ENABLING DRAM SCALING
SIGMETRICS’14, DSN’15, ONGOING
UNIFYING MEMORY &
STORAGE
WEED’13, ONGOING
78
DRAM
NVM
LOGIC
STORAGE
MEMORY
CPU
FUTURE WORK: TRENDS
79
APPROACH
DESIGN AND BUILD BETTER SYSTEMS
WITH NEW CAPABILITIES
BY REDEFINING FUNCTIONALITIES
ACROSS DIFFERENT LAYERS
IN THE SYSTEM STACK
80
SSD
OPERATING
SYSTEM
APPLICATION
RETHINKING STORAGE
CPU
FLASH
CONTROLLER
FLASH
CHIPS
APPLICATION, OPERATING SYSTEM, DEVICES
What is the best way to design a system
to take advantage of the SSDs?
81
ENHANCING SYSTEMS WITH
NON-VOLATILE MEMORY
Recover
and
migrate
NVM
PROCESSOR, FILE SYSTEM,
DEVICE, NETWORK, MEMORY
How to provide efficient instantaneous
system recovery and migration?
82
DESIGNING SYSTEMS WITH MEMORY
AS AN ACCELERATOR
MANAGE
DATA
FLOW
SPECIALIZED CORES
MEMORY WITH LOGIC
APPLICATION, PROCESSOR, MEMORY
How to manage data movement when
applications run on different accelerators?
83
THANK YOU
QUESTIONS?
84
SOLVING THE
DRAM SCALING CHALLENGE:
RETHINKING THE INTERFACE BETWEEN
CIRCUITS, ARCHITECTURE, AND SYSTEMS
Samira Khan
Download