Efficient Scrub Mechanisms for Error-Prone Emerging Memories

advertisement
Efficient Scrub Mechanisms for Error-Prone Emerging Memories ∗
Manu Awasthi± †
manua@cs.utah.edu
Manjunath Shevgoor±
shevgoor@cs.utah.edu
Rajeev Balasubramonian ±
rajeev@cs.utah.edu
±
Kshitij Sudan ±
kshitij@cs.utah.edu
Bipin Rajendran ‡
brajend@us.ibm.com
Viji Srinivasan ‡
viji@us.ibm.com
University of Utah, ‡ IBM T.J. Watson Research Center
Abstract
Many memory cell technologies are being considered as possible replacements for DRAM and Flash technologies, both of which
are nearing their scaling limits. While these new cells (PCM, STTRAM, FeRAM, etc.) promise high density, better scaling, and nonvolatility, they introduce new challenges. Solutions at the architecture level can help address some of these problems; e.g., prior research has proposed wear-leveling and hard error tolerance mechanisms to overcome the limited write endurance of PCM cells.
In this paper, we focus on the soft error problem in PCM, a topic
that has received little attention in the architecture community.
Soft errors in DRAM memories are typically addressed by having
SECDED support and a scrub mechanism. The scrub mechanism
scans the memory looking for a single-bit error and corrects it before the line experiences a second uncorrectable error. However,
PCM (and other emerging memories) are prone to new sources of
soft errors. In particular, multi-level cell (MLC) PCM devices will
suffer from resistance drift, that increases the soft error rate and
incurs high overheads for the scrub mechanism. This paper is the
first to study the design of architectural scrub mechanisms, especially when tailored to the drift phenomenon in MLC PCM. Many
of our solutions will also apply to other soft-error prone emerging memories. We first show that scrub overheads can be reduced
with support for strong ECC codes and a lightweight error detection operation. We then design different scrub algorithms that
can adaptively trade-off soft and hard errors. Using an approach
that combines all proposed solutions, our scrub mechanism yields
a 96.5% reduction in uncorrectable errors, a 24.4 × decrease in
scrub-related writes, and a 37.8% reduction in scrub energy, relative to a basic scrub algorithm used in modern DRAM systems.
1 Introduction
Challenges in DRAM and Flash scaling [23, 19] have ignited great interest in alternative memory technologies such
as PCM, STT-RAM, Memristors, and FeRAM. Most architectural studies to date have focused on the primary problems with these new technologies: long latencies, energyintensive writes, and limited write endurance. The last problem has fueled several recent solutions that attempt wearleveling and hard error tolerance (Pairing [14], ECP [35],
∗ This work has been supported in parts by NSF grants CCF-0811249,
CCF-0916436, and NSF CAREER award CCF-0545959
† The author is currently at Micron Technology Inc..
SAFER [37], FREE-p [45]). However, little attention has
been paid to the problem of soft error tolerance in emerging
memories. This is especially true as scaling continues and
as these emerging devices employ multi-level cells (MLCs).
In particular, PCM and FeRAM devices are expected to suffer from the problem of resistance drift. The resistance of a
cell is expected to drift upward over time, eventually causing a cell to represent a wrong state. This is a well documented problem for PCM MLCs [5] and is also listed as an
important research need for PCM and FeRAM devices by
the Semiconductor Research Corporation [2].
Drift-based errors require new error tolerance solutions
within the memory system. Unlike DRAM errors which
are largely random, drift-based errors in PCM are imminent
over long time periods and multi-bit errors are expected to
be very common. This is a problem that can be mitigated
with device-level solutions, but not completely eliminated.
Hence, ultimately, the PCM device will expose multi-bit errors and it is up to the architecture and OS to provide faulttolerance. This technology model is no different than what
is currently assumed for state-of-the-art DRAM systems;
while DRAM devices provide error margins, occasional errors are possible, and ECC support and scrub mechanisms
are required to tolerate these errors. Typically, SECDED
(Single Error Correct, Double Error Detect) codes are used
to recover from a single bit error in DRAM. If a line already
has an error, it is vulnerable to a second bit error that cannot
be corrected. To prevent the occurrence of this second error, the memory is constantly examined in the background.
When a single bit error is detected, it is corrected and the
line is written back. This is referred to as Scrubbing [15].
DRAM multi-bit error rates are small because for the most
part, each bit error is an independent event1 . Hence, the
DRAM scrub mechanism can be very basic and it incurs
a very small overhead (one DRAM read and write every
200,000 cycles). In MLC PCM, errors are not independent;
if one cell has drifted to the wrong state, there is a high
probability that other cells will drift in the near future [5].
1 In this work, we will ignore chipkill support, which is an orthogonal
consideration. If an entire chip were to fail, the handling strategy is not
affected by whether the chip is DRAM or PCM.
Hence, multi-bit error rates for MLC PCM will be much
higher. Our analysis shows that a DRAM-like scrub mechanism will have to issue a PCM read in nearly every cycle to
achieve tolerable error rates. To address this high overhead,
more sophisticated scrub mechanisms are required. This is
the first body of work that observes non-trivial overheads for
scrubbing and proposes optimizations to scrub mechanisms
for MLC PCM. We expect this to be an important area of
research for future memories with higher error rates.
To reduce scrubbing overheads, we first advocate the use
of multi-bit error correction support and quantify its impact
on PCM device lifetime. We then propose a lightweight
read mechanism that performs approximate error detection
on the PCM chip itself and reduces data transfer energy to
the memory controller. Next, we introduce three variants
to the scrub algorithm that are synergistic with the approximate error detection mechanism. Our policies and results
expose new trade-offs in PCM design and represent different points in the energy, hard error rate, and soft error rate
design spectrum. We show that our architectural policies are
more effective than device-level solutions at handling driftbased errors with low overheads.
2 Background
2.1
Error Tolerance in DRAM Systems
All memory devices are expected to yield soft errors.
Studies have shown that DRAM systems require varying
forms of error tolerance support from the architecture or
OS [36]. At a bare minimum, SECDED support is required.
Many platforms augment this with a scrubbing mechanism [15, 36]. A few high-end systems invest significant
architecture support to provide chipkill correctness [39, 44].
Future PCM based devices will require greater architecture (and possibly, OS) support because soft error rates and
multi-bit error rates in PCM devices are expected to be much
higher. The topic of multi-bit error correction in DRAM has
received little attention because low-frequency scrubbing
can easily handle such errors. Schroeder et al. [36] report
a typical scrubbing rate of 1 GB every 45 minutes, which
is a small overhead (one 64 B cache line every 200,000 cycles). Every scrub operation transfers a cache line from the
DIMM to the memory controller, where error detection and,
if required, write of the corrected data is performed. As
we show later, an unoptimized PCM system would incur a
significant overhead from the basic scrubbing mechanism;
a cache line would have to be transferred to the processor
almost every other cycle.
2.2
PCM MLCs
While the phenomenon of drift manifests in multiple
memory cell technologies and there may be many sources
of soft errors in future memories, this paper will focus on
drift-induced soft errors in MLC PCM devices.
PCM cells are programmed with electrical pulses that
heat the chalcogenide material. A cell can be programmed
to a high-resistance amorphous state, a low-resistance crystalling state, or states that lie between these two extremes.
A single-level cell (SLC) only uses the extreme states and
assigns a 1 (SET) value to the crystalline state and a 0 (RESET) value to the amorphous state. By suitably controlling
the programming process, a cell can be programmed into
a hybrid state with a given percentage of amorphous material. As a result, a cell can have any resistance between
the purely crystalline and purely amorphous extremes. The
available resistance range (usually between 103 and 106 Ω)
can be partitioned into multiple regions, each region representing a different state. This gives rise to multi-level cells
(MLCs), where a single cell can represent n bits of information by partitioning the material’s resistance range into
2n different levels. Figure 1 shows the distribution of resistance in an MLC, and the subsequent onset of drift. The
resistances that represent the boundaries between neighboring levels (states) are referred to as the boundary resistance
thresholds (R1, R2, and R3 in Figure 1b). For example,
a cell may represent the ‘10’ state if its resistance is between the boundary resistance thresholds of 104.5 (R2) and
105.5 (R3) Ω. It is expected that MLCs will be the primary
source of density increments across successive PCM generations [1].
The programming process for MLCs is iterative [29].
After an initial Reset pulse, many short pulses are provided to gradually decrease the cell’s resistance. After each
pulse, the cell is read to confirm if the resistance is within
the specified programmed resistance range for the desired
state. To provide margin for error, the programmed resistance range for a state is usually tighter than the gap between the surrounding boundary resistance thresholds. At
the end of a successful write, a cell’s resistance is within the
programmed resistance range for that state, but the exact
resistance will be a function of variations in the manufacturing, programming, and crystallization processes. Prior
work [17, 4] has shown that after the programming process,
the resistance of the cell will follow a normal distribution:
with a high probability, the cell resistance will be at the midpoint of the resistance range; with a small probability, the
resistance will lie at the edges of the range (solid line distributions in Figure 1b).
2.3
Resistance Drift in PCM
Once a cell is programmed, the presence of defect structures in the chemical lattice of the chalcogenide material
leads to changes in the programmed resistance, or shortterm resistance drift [28]. The “defective crystals” stabilize
over time; the material becomes more amorphous and acquires a higher resistance. Such defects are less common in
the crystalline state and hence resistance drift is less significant for a cell programmed into a state with a higher percentage of crystalline material (depicted in Figure 1a). The
drift is not affected by read operations; it is corrected only
when the cell is written, at which point, the drift process
starts all over again.
The move from SLC to MLC introduces one significant
ŵŽƌƉŚŽƵƐ
R1
Number of ce
cells at a given resistance
ZĂƚĞŽĨƌŝĨƚĨŽƌ
^ƚĂƚĞ
ϬϬ
ZĞĞƐŝƐƚĂŶĐĞ
ϭϬ
ϭϭ
Ϭϭ
ƌLJƐƚĂůůŝŶĞ
dϬ
dŶ
Boundary
Resistance
Thresholds
01
11
(a)Initial programmed resistance (A) and resistance (B) after some
elapsed time for a single cell.
R3
10
Programmed
Resistance
Range
Distribution of
initial resistance
dŝŵĞ
R2
00
Drift with tighter writes
Drift with relaxed writes
Resistance
Distribution of
resistance after time t
(b)Distribution of initial programmed resistance and
resistance after drift
Figure 1: Illustration of the resistance drift phenomenon.
challenge - drift-induced soft errors. Drift is not a problem
for SLC PCM for two reasons. First, the rate of resistance
drift is very small if the cell is programmed as mostly crystalline (Figure 1a shows that each cell drifts at a different
rate). Hence, it will take a very long time for a cell programmed to be crystalline (SET) to drift to a high resistance
value that represents the RESET state. Second, a cell programmed to be in the RESET state has a high rate of drift,
but all resistances above the specified boundary resistance
threshold represent the RESET state; so drift does not cause
the cell to represent an erroneous state. However, as shown
in Figure 1a, in a 4-level MLC PCM, drift can lead to frequent errors. The drift rate is higher for cells programmed
with higher resistances. Consider a cell that is programmed
to have an initial resistance A that represents the 10 state.
Such a cell has a high rate of drift and after a relatively short
period of time, the cell’s resistance B becomes higher than
the boundary resistance threshold. At this point, it starts to
represent the neighboring 00 state, causing an error. This is
an example of a “soft” or “transient” error. The error rate
will increase dramatically as the resistance range is partitioned into more levels and boundary resistance thresholds
are more closely spaced.
When a cell is programmed to a given state, its initial
resistance R0 will lie somewhere within the programmed
resistance range for that state and follows a normal distribution (solid curves in Figure 1b). The dotted curves in
Figure 1b show the distribution of resistances after some
time has elapsed. In essence, each resistance has drifted to
a higher value, with some cells having a resistance that lies
within the threshold boundaries of the adjacent state.
Drift within the RESET state (00 in a 4-level MLC or
000 in an 8-level MLC) is not problematic because a higher
resistance will continue to represent the RESET state. However, its adjacent state (10 in the 4-level MLC in Figure 1b)
is considered the most drift-prone as it has the next highest
drift rate and is at risk of drifting past the resistance threshold of the RESET state. Hence, to a large extent, this paper will focus on worst-case behavior in cells that are pro-
grammed to the most drift-prone state (10). We assume the
use of a Gray code mapping policy that ensures that adjacent states differ in only one bit. Thus, every drift-based
cell error corresponds to only a single bit error. As shown
in Figure 1, we assume that the mapping of states in order
is 00 (amorphous), 10, 11, 01 (crystalline).
2.4
Modeling Drift
Jeong et al. [16] describe the following equation to measure resistance drift. Given an initial resistance of R0 , the
resistance Rdrif t (t) can be calculated as : Rdrif t (t) =
R0 × tα , where α represents the drift exponent and t is the
time elapsed (in seconds) since programming the cell.
The value of R0 for a given state follows a normal distribution with the mean lying at the mid-point of the programmed resistance range defined by the iterative write process [24] for that state. For most experiments in this paper,
we will assume that the programmed resistance range is set
to include resistances that are within (±) 2.75σ of the mean,
where σ refers to the standard deviation of the normal distribution. The boundary resistance threshold is at a distance
of (±) 3.0σ from the mean, allowing some margin for drift.
Other programmed resistance ranges are also considered in
Sections 2.5 and 5.5.
The drift exponent α depends on a number of factors.
The value of α is impacted most by the state represented by
the cell. The higher the initial resistance R0 , the higher the
value of α. For the nominal material thickness, we adopt
expected values of α for each state according to empirical
data [33, 17, 4, 13]. The exact value of α for each state
follows a normal distribution around the expected value for
that state and is also modeled. The parameters for our drift
model are based on the assumptions of Xu and Zhang [43,
42] and are summarized in Table 1.
The type and thickness of chalcogenide material impacts
the drift exponent. For this study, we assume the empirical
values of α recommended for standard GST material [16,
18]. Temperature also impacts the drift rate [34, 12, 46].
For this study, we assume that page coloring/wear leveling
techniques will ensure load balance across banks and not
Storage level
Data
0
1
2
3
01
11
10
00
4 levels/Cell
lg R0
mean
deviation
3.0
4.0
0.17
5.0
6.0
α
mean
0.001
0.02
0.06
0.10
SDMR
40%
Table 1: Configuration of 4 Level MLCs [43]. SDMR stands for
Standard Deviation to Mean Ratio.
12
Time to cross boundary resistance threshold (s)
10
4
R11 = 10
5
R10 = 10
512 Seconds
10
10
8
10
6
10
4
10
2
10
0
10
1
1.5
2
2.5
3
Separation between mean cell resistance and programmed resistance (units of σ)
Figure 2: Representative time to drift for cells in most drift-prone
states (10 and 11) in 4 level MLC as a function of separation
from the mean of programmed resistance. For example, 2.5 on
the X-axis refers to a cell that was programmed to a resistance of
mean + 2.5σ. Boundary thresholds are at mean + 3σ. Models
assumed described in Table 1, verified against published experimental data [17, 4, 13]. 512 seconds is the maximum Refresh Rate
considered in this study.
create localized hotspots. If the uniform temperature of a
PCM chip increases, scrub rates must be increased, and such
adaptive policies are discussed in Section 3.3.
2.5
Inadequacy of Device-Level Solutions
We validated that the drift times estimated from the
above drift model match the drift times derived using the
distributions of R0 and α from empirical data in prior
work [43, 17, 4, 13]. Using the model discussed above, we
show in Figure 2 the drift time for a 4-level cell programmed
in one of the drift-prone states (10 or 11).
Figure 2 shows that for a given boundary threshold (3σ
from the mean), if the cell is programmed in the 10 state at
1σ from the mean, drift time for error is under 104 seconds.
Furthermore, the drift time drops to less than 100 seconds
when the cell is programmed at 2σ from the mean resistance. Therefore, it is clear that if the problem of resistance
drift is not addressed effectively, cells that have been programmed reasonably accurately cannot be treated as truly
non-volatile [5].
To delay the onset of drift, it may be possible to widen
the state’s boundary thresholds. However, doing so compromises the density advantage of PCM, and significantly
reduces the number of levels of a multi-level cell. In fact, we
expect that the boundary thresholds will be squeezed closer
together with each new generation.
It is not possible to introduce a correction to the read re-
sistance during a read. For example, it has been suggested
that a reference cell in a row be used to normalize the resistance of each cell in that row [26]. But this is likely to
not be very effective because of the significant difference
in drift times for different cells programmed to the same
state, as can be seen in Figure 2. Some recent work has
also proposed mitigating the effects of drift with better coding strategies [26]. While coding strategies can help and are
better than reference cell based strategies, they cannot eliminate drift-based errors. Papandreou et al. [26] show that
modulation coding can help an MLC PCM exhibit a raw error rate of 10−5 after 37 days at room temperature; this is
well short of the 10−15 error rate target that is acceptable
for such an experiment. This shows that two of the leading
device-level solutions (coding and reference cells) to date
are not adequate.
Based on various initial empirical data, for most of this
study we assume that the boundary threshold for a state
is 3σ away from the mean and the programmed resistance
range is within 2.75σ from the mean. The drift time for
the worst-case cell can be as low as 1.81 seconds, making it necessary to explore techniques to effectively mitigate
the effects of resistance drift. If the programmed resistance
range is made narrower with a more precise write process,
the worst-case drift time can be increased. However, doing so causes an increase in write latency and a drop in cell
endurance. The corresponding trade-offs showing that architectural techniques are more effective than precise write
process to combat drift are discussed in Section 5.5.
In summary, various device-level strategies can be used
to delay the onset of drift. These strategies can therefore
mitigate the problem, but not eliminate it; they must be
augmented with architectural solutions. These device-level
strategies also incur significant cost. Our goal here is to
out-do these device-level techniques with simple architectural policies. This is a philosophy similar to that used for
modern-day DRAM – DRAM chips are produced at very
low cost and exhibit errors; these errors are tolerated with architectural solutions that use error correction codes, RAIDlike redundancy, and scrubbing.
2.6
Naive Refresh and Basic Scrub
DRAM systems employ a refresh operation to handle
charge leakage in cells and a scrub mechanism to correct
single-bit errors. We first show that such baseline mechanisms are highly inadequate for PCM MLC systems.
The drift problem can be addressed if a row can be read
and re-written before the worst-case cell can drift to a neighboring state. For most of this study, we assume that the iterative write process can force R0 to be in a well-defined
programmed resistance range (within 2.75σ of the mean),
so cells will not immediately be on the brink of drifting to a
neighboring state. With such a baseline, the worst-case drift
time is 1.8 seconds and every line must be refreshed every 1.8 seconds. Since a cell can typically only endure 108
writes, this limits the PCM device’s lifetime to 1.8×108 sec-
2.0E+05
mean+2.75sigma (Baseline)
1.8E+05
mean+1.375sigma (Baseline)
1.6E+05
1.4E+05
Errors
1.2E+05
mean+2.75sigma (ECC-1)
mean+1.375sigma (ECC-1)
ing even the highest-performing modern memory channel
and incurring high dynamic energy cost. If the read interval
is increased to a more manageable 512 seconds, the uncorrectable error rates are well over 104 . Thus, modern solutions that work well for DRAM are highly inadequate for
PCM in terms of their impact on error rates, lifetime, energy, and bandwidth needs.
1.0E+05
3 Architectural Support for Efficient Scrub
8.0E+04
6.0E+04
4.0E+04
2.0E+04
0.0E+00
2 Seconds
8 Seconds
32 Seconds 128 Seconds 512 Seconds
Refresh/Scrub Rate
Figure 3: Uncorrectable errors in baseline systems for different
programmed resistance ranges for 107 long-term writes. Baseline
refers to a model where a re-write (refresh) is performed for each
line at the specified interval. ECC-1 refers to a DRAM-like scrub
mechanism where a re-write is performed only after a single error
is detected. The error is detected by issuing reads at the specified
interval.
onds, or only 5.7 years (ignoring short-term writes 2 ). More
importantly, we must refresh all of the billion 64 byte lines
in a high-capacity PCM memory within 1.8 seconds; assuming 1000 ns for a PCM write [31], we must handle simultaneous refresh of about 600 lines (each 64 bytes), a process
that will overwhelm the PCM system. If we assume that
writes are precise enough such that the programmed resistance is within 1.375σ of the mean, the worst-case drift time
is now around 500 seconds. As we show later, precise writes
are expensive for many reasons: they incur a longer write
time, worsen performance, and degrade lifetime. Even with
such a device-level solution, simultaneous refresh of about
4 64-byte lines will have to happen simultaneously, much
more onerous than the current refresh process in DRAM.
Alternatively, each line can have a SECDED code and
we can keep reading PCM lines at a specified rate until a
single error is detected. Only then is the corrected line rewritten. Since the decoding for the SECDED code is performed at the memory controller, every line read during the
scrubbing process must be sent to the processor over the
memory channel, assuming an on-chip controller in modern processors. This is the basic scrub mechanism in use
for DRAM-based systems today. Figure 3 shows the uncorrectable error rates for this scrubbing mechanism (ECC-1)
for different programmed resistance ranges and different intervals for the read operation (detailed methodology in Section 4). Even if we assume precise writes and a 2-second
read interval, as many as 97 uncorrectable errors are encountered. Further, the entire PCM main memory capacity will have to be sent to the processor within 2 seconds
(roughly one cache line every 2 cycles), more than saturat2 If the gap between writes to a block is very high, the writes are referred
to as long-term. Such blocks are vulnerable to drift-based errors.
Drift-prone PCM systems introduce a new energy, performance, and endurance bottleneck: the scrubbing process.
We carefully consider the design of efficient scrub mechanisms and propose novel solutions.We introduce a threepronged approach to tackle the drift problem: (i) stronger
ECC codes, (ii) low-cost and approximate error detection
within a PCM chip, and (iii) scrub algorithm extensions that
are conservative and adaptive. Most results are presented
in Section 5. We mention a few in this section to preserve
context. Please refer to Section 4 for detailed methodology.
3.1
Multi-Bit Error Correction Support
DRAM-based systems assume SECDED support and
have a scrubbing process to trigger error correction before
the incidence of a second error. If the gap between the first
and second error is shorter than the scrub interval for a line,
the second error may manifest before the first can be corrected, resulting in an uncorrectable multi-bit error. Thankfully, this gap is very high in DRAM systems and even a
large scrub interval (3 hours for a 4 GB memory system)
can yield very low uncorrectable error rates [36]. In MLC
PCM devices, the gap between the first and second error in
a line is much smaller than in DRAM devices, requiring a
faster scrub rate. It appears intuitive that we can lower the
scrub rate if we had support for multi-bit error correction. If
we assume that we have a code that can correct E errors3 ,
then the scrub mechanism initiates recovery when it detects
the E th error. The required scrub rate is determined by the
typical gap between the E th and E + 1th errors in a line.
For the drift model described in the previous section, we
observe that as E increases, this gap does increase. Hence,
if we have support to recover from many errors, we can get
by with a longer scrub interval. The reason is the exponent
term in the drift equation and the gaussian distributions of
R0 and α. In short, the worst-case cell has an order of magnitude less drift time than the next-worst cell, and so on.
Hence, the first change that we advocate is the introduction of multi-bit error correction codes. For a 512-bit line,
Table 2 quantifies the storage overhead incurred by multibit error correction codes, as expressed by the Hamming
bound [10]. The storage for ECC bits grows linearly with
E. We observe that eight errors can be corrected with a
storage overhead of 14.25%. We argue that this is an acceptable overhead as it is similar to the overhead for modern
SECDED DRAM systems that have an 8-bit SECDED code
for each 64-bit word.
3 As
a shorthand, we refer to such a code as ECC-E.
Number of
Correctable Errors
1
2
4
8
Additional Bits of ECC
Storage (512-bit line)
10
19
37
73
Storage Overhead
(% for 512-bit line)
1.95%
3.71%
7.22%
14.25%
Table 2: ECC Overheads.
A second problem with using large ECC codes is that
ECC codes can reduce PCM device lifetime [35]. When
writing a PCM line, to improve endurance, we assume that
a cell is written to only if the state has changed [20, 47].
While data bits in a cache line don’t always undergo change,
ECC bits have high entropy, i.e., they are very random and
very likely to flip (from 0 to 1 or vice versa) on every write.
If we are writing a 512-bit line (256 4-level cells), we observe that across a suite of benchmark programs, 118 cells
(46%) are expected to be re-programmed. For the corresponding 37 ECC-8 coding cells, 75% of the cells are expected to be re-programmed, i.e., a much higher level of
write activity. If we assume that word and row-shifting
wear-leveling optimizations [32, 6, 31, 47] are employed,
we can claim that the higher activity in the ECC code will
be spread across all cells. By adding an ECC-8 code to a
512-bit word, as done above, the average write activity of a
cell is increased by a factor of 1.07. This 7% reduction in
lifetime is worth noting; the corresponding reduction in lifetime for an ECC-4 code is only 4%. As we show later, the
more efficient scrubbing process can reduce the number of
scrub-related writes and possibly offset this 7% drop. This
observation highlights that a scrub process must be designed
carefully to balance several soft error, hard error, energy,
and storage considerations. A third problem with strong
ECC codes is the high cost of encoding and decoding. We
will address this concern in the next sub-section.
While the use of strong ECC codes has been proposed
here for soft error tolerance, it also doubles up as a hard error tolerance mechanism. Programming a PCM cell results
in repeated expansion and contraction of the chalcogenide
alloy. This increases the probability that the material physically detaches from the heating element, resulting in the cell
being permanently stuck-at some value. Recently proposed
techniques such as Pairing [14], ECP [35], SAFER [37], and
FREE-p [45] address hard error tolerance in PCM, but do
not address drift-induced soft errors. On the other hand, the
proposed ECC-E code can be used to tolerate up to E errors
and these errors can be either hard or soft errors. Since ECC
is being maintained across a fairly large line, an ECC-8 code
has a storage overhead of only 14%. In comparison, for an
ECP scheme [35] to tolerate eight hard errors in a 512-bit
line, a storage overhead of 16% would be incurred, and the
PCM device would be intolerant of soft errors.
3.2
Light Array Read for Drift Detection
(LARDD)
In conventional scrub mechanisms for DRAM, a scrub
operation is issued once every 200,000 cycles [36]. This is
an infrequent operation and is not worth optimizing. However, as shown in Section 2.6, scrub operations in MLC
PCM devices are expected to be much more frequent. We
therefore propose a different access pipeline for scrub operations.
Already, in DRAM systems, refresh is known to be a
growing bottleneck because retention times are not changing, but capacity continues to increase. As a result, refresh
control is moving from the memory controller to the DRAM
chips [15]. Modern DRAM chips already have structures
that can track the last refreshed row and timing deadlines
for the next refresh operation. In a similar vein, we propose
localized PCM chip control for scrub operations so that data
is not constantly being shipped to the memory controller on
the memory bus.
We advocate that, similar to a DRAM refresh operation,
PCM rows are read from arrays, and simple logic on the
PCM chip performs error detection. If panic is not triggered,
the chip moves on to the next operation and no write is performed. The memory controller and processor are involved
only if the error detection panics and demands that the row
be corrected and re-written. The key to making this happen
on a PCM chip is simple error detection. As described earlier, it is desireable to have support for codes that can correct
multi-bit errors. However, the encoding and decoding cost
for BCH codes is significant. The required circuitry [27]
can likely not be accommodated on high-density memory
chips. We therefore add a simpler error detection mechanism in addition to the BCH codes. The BCH codes are
still required for eventual error correction at the memory
controller, but the simpler error detection logic is invoked
in the common case when the scrub operations are issued
within a PCM chip. Our error detection logic is a parity-like
scheme. For a 256-cell line, we partition it into eight 32cell fields. Within each 32-cell field, we count the number
of drift-prone (10) states and track if this number is odd or
even (“parity”). During a scrub operation, we check to see
if the number of drift-prone states within each 32-cell field
has deviated from the recorded “parity”. Every deviation is
counted as an error and we sum the number of errors across
all eight 32-cell fields. Such error estimation is clearly approximate; we must therefore be conservative in flagging a
panic and re-writing a line. In Section 5, we show that such
a circuit has very low cost and can possibly be included on
a PCM chip. The storage overhead for parity is 8 bits per
512-bit line, a 1.625% overhead in addition to the 14.25%
already being incurred for the BCH code.
We assume that parity is maintained for each 32-cell field
within a PCM chip so that the chip can perform its own autonomous scrub operation in the common case without coordinating with other PCM chips. The BCH code may be
associated per cache line and could be striped across multiple PCM chips. The entire cache line (plus BCH code) is
read out of all PCM chips in the rank only when a panic is
triggered.
The proposed read pipeline is referred to as a Light Array Read for Drift Detection (LARDD). The PCM chip sequentially reads each row, performs the approximate parity-
based error detection, and moves on if no re-write is required. If the error detection raises a panic signal, the row
is shipped to the memory controller where error correction
and re-write are handled. This is unlike the DRAM scrub
process, where every cache line is shipped to the memory
controller. The scrub process is therefore made up of several periodic LARDDs and occasional re-writes to a line.
The scrub interval is the time between successive LARDDs
to a given PCM row.
3.3
Scrub Algorithms
Our basic scrub algorithm issues LARDDs at a specified frequency, and triggers a line re-write when a minimum
number of parity-based errors are encountered in that line.
We now introduce a few extensions to this basic algorithm
that make the scrubbing process more efficient.
Headroom. The first extension is the use of headroom. If
we assume that we have an ECC-8 code that can recover
from up to eight errors, a line re-write must be triggered
before the line encounters its ninth error. If the LARDD
triggers a re-write after observing eight parity-based errors,
many uncorrectable errors may slip through. This can happen for two reasons. One, the parity-based error detection
mechanism is approximate; eight parity-based errors may
represent more than eight actual errors and the line would
be uncorrectable. Second, if the eighth and ninth errors happen in quick succession without an intervening LARDD, the
line becomes uncorrectable. The probability of such uncorrectable errors can be reduced by triggering a line re-write
well before the maximum error budget E is reached. This is
referred to as the Headroom scheme, where a line re-write
is triggered as soon as the parity-based error detection circuit identifies at least E − h errors. If a high headroom h
is provided, the uncorrectable error rate drops significantly.
But because we are being conservative in our error correction, line re-writes are triggered more often, thus accelerating wearout and increasing the hard error rate. Thus, the
choice of headroom introduces a trade-off between hard error rates and uncorrectable soft error rates.
Gradual. The Gradual scheme uses non-uniform LARDD
rates for each line. If a line has very few drift-based errors, it is unlikely that the next LARDD will trigger a rewrite. Hence, a line can employ a small LARDD frequency
and increase the frequency as more drift-based errors are
detected. To keep the design simple, we assume only two
LARDD frequencies. The LARDD frequency is doubled after E − h − g parity-based errors are encountered. Each line
must maintain a Gradual bit to track its LARDD frequency;
those lines with a Gradual bit set to zero will skip every
alternate LARDD. These Gradual bits can be stored in the
PCM memory itself. An entire row of Gradual bits (representing hundreds of PCM rows) can be read at a time by the
LARDD controller on the PCM chip. After all those rows
have been scrubbed, the updated row of bits can be written back into PCM cells. The Gradual scheme can cause
a slight increase in the uncorrectable error rate (in the rare
event that a flurry of soft errors happen between consecutive
LARDDs), but it can significantly reduce the energy overhead associated with LARDDs.
Adaptive. The uncorrectable error rate can be a function
of dynamic events. For example, prior work [46] has shown
that drift is accelerated at higher temperatures. If a PCM device heats up, the pre-selected LARDD rate may lead to an
unexpected number of uncorrectable errors. Similarly, if we
execute a workload that has many more drift-prone states,
the error rate would again go up. In yet another example,
that will be used as an adaptive LARDD case study in this
paper, as a PCM device ages, the number of worn out cells
(hard errors) in a line increases. This reduces the soft error
budget for that line as only a maximum of E errors (hard
or soft) can be corrected by the ECC-E code. Again, a preselected LARDD rate will cause more uncorrectable errors
as the number of hard errors in the device increases. As a
result, the LARDD rate will necessarily have to be adaptive.
Every epoch (say W writes), we examine the uncorrectable
errors in the last epoch, and double the LARDD rate if the
errors exceed a threshold. Assuming that we start with a
LARDD policy with headroom h, we get rid of the headroom policy after a line is known to have at least E − h − 1
hard errors (to prevent the full re-write from happening almost immediately after the write). We also consider an
adaptive policy where the faster LARDD rate is only applied to lines with greater than HE hard errors. Similar to
the Gradual optimization described earlier, a bit is tracked
per line to figure out when LARDDs must be skipped.
4 Simulation Methodology
Since we must model millions of writes to a PCM device
over its lifetime, detailed cycle-by-cycle simulations with
real workloads is not an option. However, our experiments
are frequently guided by observations and data from detailed Simics simulations which model PCM as main memory storage behind a 256 MB DRAM cache, and simulate multi-threaded and multi-programmed workloads from
PARSEC [3] and SpecJBB2005.
In each experiment, we model the behavior of 107 longterm writes to 64 byte cache lines. For each write, we probabilistically estimate the cells that will encode states 00, 01,
10, and 11 (for a 4-level MLC). The probabilities are derived from Simics simulations which capture the percentage
occurrence of each state in actual cache line data. For each
cell, we estimate the value of R0 and α based on the normal
distributions specified in Table 1. Based on this, we compute the time for each cell to drift to a neighboring state and
identify the cells that will be among the first to fail.
We assume robust wear-leveling techniques [32, 6, 31,
47] will be used, and a 64 GB PCM main memory will likely
handle well over 1017 writes in its lifetime. Many lines may
have successive writes within a short interval and will not
be prone to drift-based errors, i.e., the line is re-written before its worst-case cells drift to neighboring states. But we
expect many of the lines in the PCM device to have a long
interval between successive writes, especially in the follow-
ECC-1
ECC-8-headroom-3
ECC-8-headroom-3-G-1
1.6E+5
1.4E+5
Baseline
ECC-8
ECC-8-headroom-5
1.2
1
ECC-1
ECC-8-headroom-3
ECC-8-headroom-3-G-1
Errors
1.0E+5
4.0E+4
0.2
0.2
2.0E+4
1
0.4
0.4
6.0E+4
ECC-1
ECC-8-headroom-3
ECC-8-headroom-3-G-1
0.6
0.6
8.0E+4
Baseline
ECC-8
ECC-8-headroom-5
0.8
0.8
1.2E+5
1.2
Energy Normalized to Baseline
1.8E+5
Scrub Related Writes Normalized to
Baseline
Baseline
ECC-8
ECC-8-headroom-5
2.0E+5
0
0
0.0E+0
2 seconds
2 seconds
8 seconds 32 seconds 128 seconds 512 seconds
Refresh/LARDD Interval
2 seconds
(a) Uncorrectable Errors
8 seconds 32 seconds 128 seconds 512 seconds
Refresh/LARDD Interval
8 seconds
32 seconds 128 seconds 512 seconds
Refresh/LARDD Interval
(b) Scrub-Related Writes
(c) Scrub Energy
Figure 4: Error rates, scrub-related writes, and scrub energy for various schemes as a function of Refresh/LARDD interval.
EnergyRead
10 pJ
W rite
Energy01
50 pJ
W rite
Energy11
100 pJ
W rite
Energy10
400 pJ
W rite
Energy00
1600 pJ
Table 3: Per (Cell) Access Read and Write Energy for different
ing situations: (i) read-only data structures in long-running
applications, (ii) data of long-running applications that are
resident in the DRAM cache for a very long time, and (iii)
filesystem data when PCM is used as persistent storage. Our
simulations are focused on estimating error rates for the first
107 such long-term writes, i.e., we always assume that driftbased errors show up before the next write to that line. For
each such write, we estimate if a given error tolerance mechanism would have allowed an error to escape correction.
To get an estimate of the impact of our proposed policies
on cell endurance, we report the total number of scrub or
refresh related writes to a line. The eventual impact of this
on lifetime will be a function of the percentage of long-term
writes in a workload.
Energy per read and write is estimated based on the data
of Xu et al. [41] for each cell state and is summarized in
Table 3. We also quantify the energy cost of the paritybased error detection circuit (described shortly). Our energy
estimates only include the energy consumed within PCM
chips for scrub operations; we do not include the energy
cost of data transfers on the memory channel or the cost of
ECC encode/decode at the memory controller.
We compare our proposed mechanisms against two baseline techniques described in Section 2.6. The first technique
is similar to DRAM refresh that blindly reads and re-writes
lines at a specified rate. The second is a basic scrub mechanism (ECC-1) similar to that used in DRAM, where reads
are issued at the specified rate and a SECDED code is used
to correct any observed single-bit errors.
5 Results
5.1
Impact of Strong ECC Codes
We first show the impact of strong ECC codes on driftinduced error rates. We assume a scrub mechanism that
maintains an ECC-E code per line, issues LARDDs at a
given frequency, and re-writes the line if E errors are de-
Unrecoverable Errors
programmed values derived from [41].
2.0E+5
Baseline
1.8E+5
ECC-1
1.6E+5
ECC-2
1.4E+5
ECC-4
1.2E+5
ECC-6
ECC-8
1.0E+5
8.0E+4
6.0E+4
4.0E+4
2.0E+4
0.0E+0
2 seconds
8 seconds 32 seconds 128 seconds 512 seconds
Refresh Interval
Figure 5: Uncorrectable errors for 107 long-term writes for the
proposed scrub mechanism as a function of E and LARDD interval. The ECC-1 model represents a DRAM-like scrub mechanism.
tected. Even with strong ECC support, there are occasions when errors go undetected or cannot be corrected;
this happens when multiple errors happen between successive LARDDs. For a given LARDD rate, the error rate as
a function of E depends on how errors are clustered over
time. Since the values of R0 and α follow a normal distribution, there are few outlier worst-case cells that are separated significantly in their drift times (roughly by a factor
of two). Hence, we observe that as E increases, the error
rate drops sharply (as the likely time gap between the E th
and (E + 1)th error goes up) as shown in Figure 5. Also,
for a given E, the uncorrectable error rate increases sharply
as the LARDD interval is increased. With an 8-second interval between LARDDs, the ECC-8 scheme yields significantly fewer uncorrectable errors (228) than the baseline
ECC-1 scrub mechanism (9,016). However, the error rate
with ECC-8 is still very high. Figure 5 shows that while
multi-bit error tolerant codes can help the basic scrub mechanism, there is still much room for improvement.
5.2
Impact of Headroom and Gradual
Headroom. To reduce the likelihood of errors slipping
through, the Headroom scheme triggers a re-write operation
even before the E th error happens. In our nomenclature,
Policy
ECC-8
ECC-8-headroom-3
ECC-8-headroom-3-gradual-1
Reduction in uncorrectable error rate
92%
99.6%
96.5%
Reduction in scrub-related writes
173.7×
18.8×
24.4×
Reduction in scrub energy
52.4%
35.4%
37.8%
Table 4: Summary of improvements with proposed scrub mechanisms for a 512 second LARDD interval, relative to the ECC-1 baseline.
a headroom-h scheme triggers a re-write after the LARDD
detects at least E − h errors. An uncorrectable error with a
headroom-3 scheme now happens if errors E − 3, E − 2,
E −1, E, and E +1, all happen within one LARDD interval,
an event with significantly lower probability.
Figure 4a shows that headroom-h helps reduce error rate,
and also helps reduce LARDD frequency for a given error
rate. The ECC-8-headroom-3 scheme is able to yield only
647 errors at a LARDD rate of 512 seconds (roughly one
cache line read every 500 cycles). In comparison, the ECC1 baseline yields 105 errors at that LARDD rate.
However, since the policy panics and re-writes sooner
than the no-headroom policy, the number of scrub-related
writes increases as shown in Figure 4b. Thus, there is a
clear and significant trade-off between endurance (hard error rates) and uncorrectable soft error rates when designing
scrub policies for MLC PCM. The ECC-8 scheme issues
173× fewer scrub-related writes than the ECC-1 scheme;
the ECC-8-headroom-3 scheme gives up some of this advantage and is only 19× better than the ECC-1 baseline
for scrub related writes. While some recent papers have
shown that wear-leveling and other optimizations can increase PCM lifetimes to over 10 years, it is important to realize that some of that lifetime may be lost to drift-induced
soft-error tolerance techniques. Hence, another conclusion
of our study is that hard error avoidance and tolerance in
PCM will be even more important in the future.
Using the energy estimates in Table 3, we compare energy consumption of various headroom schemes in Figure 4c. While the ECC-E scheme is more energy-efficient
than the ECC-1 baseline, an ECC-E-headroom-h scheme
consumes more energy compared to its ECC-E counterpart
because of the higher re-write frequency. So the use of headroom introduces a trade-off involving uncorrectable soft error rates, hard error rates, and energy.
Headroom-Gradual. As described in Section 3.3, to reduce the energy overhead of LARDD operations, we employ adaptive LARDD rates. If a line has fewer than
E − h − g errors, we halve the LARDD frequency. Our
analysis shows that such a scheme reduces the number of
LARDDs, but leads to a higher error rate. We also observe
that the error rates are manageable only when it is combined
with a headroom scheme. For all of our evaluations, we only
show results for g = 1.
Figure 4 shows that at a 512-second LARDD interval,
headroom-gradual leads to an 8× higher error rate than the
headroom scheme, but reduces the number of LARDDs by
25.5%, thus consuming 5% less scrub-related energy.
For a given ECC support, we can make two significant
conclusions from Figure 4 - (i) increased headroom leads
to fewer errors, but decreases lifetime and increases energy
consumption, and (ii) a headroom-gradual scheme leads to a
Policy
LARDD-ECC-8
LARDD-ECC-8-headroom-3
LARDD-ECC-8-headroom-5
LARDD-parity-4/8
2s
45
0
0
0
8s
84
0
0
19
32 s
632
15
0
29
128 s
2,043
1,491
6
871
512 s
4,153
2,441
486
2,647
Table 5: Uncorrectable errors with parity and ECC schemes. A
parity-based scheme that panics at 4 parity errors has similar uncorrectable error rates as a scheme that panics at 5 actual errors
(ECC-8-headroom-3).
higher uncorrectable error rate as compared to a pure headroom scheme, but decreases energy consumption, and has
a very small (negative) effect on lifetimes. The quantitative improvements for the proposed schemes, relative to the
ECC-1 baseline are summarized in Table 4.
5.3
Impact of Parity-Based Approximate
Error Detection
Strong ECC codes incur a non-trivial overhead in terms
of circuit complexity, latency, and energy. Since any offchip storage array is likely to have some form of error protection, incurring this overhead on every request out of the
LLC is to be expected. However, incurring this cost on every LARDD will likely be prohibitively expensive.
We first describe a basic baseline ECC circuit. We use
a BCH scheme similar to the one used in [27] to correct
up to 8 errors. The ECC circuitry is split into two parts,
the first used to detect and the second used to correct errors. Since error detection is used on every LARDD, the
error detect circuitry is optimized to complete in one cycle.
To minimize area and leakage power overheads, the encode
and decode circuitry are shared. Also, since error correction
is only performed when an error is detected, the correction
circuitry is power gated when not in use to reduce leakage
power. It uses Berlekamp-Massey algorithm to calculate the
error location polynomial and Chien search to evaluate error
position. From [27] and [7], we estimate the energy used by
an ECC circuit capable of correcting 8 errors for a 512 bit
line to be 192 pJ. This is in addition to the 2560 pJ incurred
on average for every PCM line read [41].
To alleviate this cost, and make it possible to perform
error detection on the PCM chip itself, we introduce the
parity-based approximate error detection circuit described
in Section 3.2. Our error detection circuit can be approximate because we provide headroom. Assuming that an
ECC-8 code exists for a 256-cell line, a full re-write is triggered as soon as 4 of the 8 32-cell words flag a parity error.
Our parity scheme consists of simple combinational
logic per bit to detect the drift prone 10 states and signal
a 1. Each PCM line consists of 256 PCM cells and this is
divided into 8 sets of 32 cells each. Parity is calculated for
each of these sets by XORing all the bits of the detector circuit. The parity calculated for every 32 cell set is compared
with a parity bit stored during the write. If more than 4 sets
have mismatches, a Panic signal is raised, which triggers a
full re-write. The energy (1.47 pJ) used by this parity circuit is estimated by synthesizing this circuit using 65 nm
libraries in Synopsys Design Compiler assuming worst case
switching. The parity optimization makes a LARDD even
lighter, reducing the cost of a line read from 2752 pJ to 2561
pJ (a 7% reduction). Even more importantly, the area of our
parity-based circuit (1, 656µ2 at 65 nm) is 11 times less than
that of the baseline ECC circuit [27].
We note that the overhead of proposed LARDD schemes
can be further reduced by using 3D-stacked memory chips.
The logic interface dies on such 3D stacks can accommodate circuits for error detection and adaptive LARDD rates,
without impacting PCM cell density. This is consistent with
a recent trend [38] to move more functionality to the logic
interface dies on 3D stacks to save bandwidth.
We carry out simulations on real cache line data so that
the distribution of drifted cells across the 8 32-cell words
is cognizant of how states are distributed across real cache
lines. In Table 5, we compare the error rates for our initial
LARDD-ECC-8 policies and the LARDD-parity-4/8 policy
and show that the error rates are not much more than the
headroom-3 scheme.
5.4
Adaptive Scrub Rate Case Study
Drift times and error rates are a function of many factors.
We expect that the OS will track error rates over time and
occasionally increase or decrease LARDD rates to best balance the error rate and LARDD overhead. As a case study,
we evaluate such an approach for an important scenario: the
emergence of hard errors over the lifetime of a PCM chip.
We simulate a model where a line may have a number
of hard errors (but less than E hard errors) with a probability that is linearly proportional to the number of writes
already seen by the simulation. In other words, there are
few hard errors early in the simulated lifetime of the line,
but the number of hard errors gets closer to E by the end
of the simulated lifetime. A line is considered defunct if it
has E hard errors. As the hard errors increase, we effectively can handle fewer soft errors and the error rates start
to increase.
In our experiments, we assume that E is 8, h is 3, and
HE is 2. Recall from the discussion in Section 3.3 that a
faster LARDD rate is only used if a line has more than HE
hard errors. We start with a LARDD interval of 16 seconds
and the interval is halved if more than 5 uncorrectable errors
are detected in the previous epoch (105 long-term writes). In
a baseline with zero hard errors, a total of 4 uncorrectable errors were detected in the entire simulation. Once hard errors
were introduced as described above, the number of uncorrectable errors grows to 149,000. With our dynamic policy,
the LARDD interval is progressively reduced in the latter
half of the simulation to 64 ms, while keeping the number of
uncorrectable errors below 1,000. If a longer LARDD interval is used for lines with fewer than HE errors, LARDD energy is reduced by 22%, but the number of errors increases
by 18%. We also observed similar trends when we assumed
other hard error proliferation rates. The OS would have to
consider the above trade-offs when setting the parameters of
such adaptive policies.
5.5
Comparison Against Device-Level Solutions
In this section, we consider two of the most effective
device-level solutions to delay the onset of drift-based errors. We show that architectural solutions are superior to
these device-level alternatives in almost every regard.
The first device-level solution is non-uniform banding.
Instead of splitting the available resistance range equally
among all resistance states, the range is split in a nonuniform manner so that drift-prone states have more widely
spaced boundary resistance thresholds. This device-level
solution does not incur higher implementation complexity,
but is less effective as the number of levels in a cell is increased. In our experiments, the boundary resistance threshold for the two drift-prone states is widened by 10%, while
that for the non-drift-prone states is reduced by 10%.
The second device-level solution attempts more precise
writes. In most experiments in this paper, we assume that
the programmed resistance range allows cells with resistance at most 2.75σ from the mean. Instead, we can implement more precise writes by halving the programmed resistance range to allow cells with resistance at most 1.375σ
from the mean. This means that a cell will have to drift further to cross the boundary threshold (which is situated at a
resistance 3σ from the mean), allowing the baseline device
to get away with a much longer refresh/LARDD interval.
We’ll shortly compare the error rates with these competing
approaches. The cost of this device-level solution is summarized next.
Currently, not much work exists on the viability of precise writes in PCM. Based on work in [21, 22, 25, 11], we
project the following impacts on latency, energy, and endurance. The iterative write process subjects a cell to multiple current pulses and after every pulse, we check to see if
the resistance is within the programmed resistance range. If
we provide a pulse that causes a big jump in resistance, it
may be possible to miss the acceptable range altogether. So
the typical jump needs to be a little bit smaller than the width
of the acceptable resistance range. Thus, with a high likelihood, the final pulse will place the cell within the acceptable range. However, as already discussed, because of variations, the resistance follows a normal distribution within
that range.
If a write must be made more precise (narrower acceptable range), each pulse will have to cause a smaller jump in
resistance. So the write will involve many short resistance
jumps. Assuming a fixed pulse width (say, 5 ns), the cell
can be either programmed with many low-amplitude pulses,
or a few high-amplitude pulses [21, 22]. The former provides the higher precision at a latency penalty. Typically, if
the programmed resistance range is being halved, we will
need twice as many pulses, each causing half the resistance
jump. Fortunately, the higher precision write also consumes
Policy/Improvement
Uncorrectable errors
Baseline ECC-1
(2.75σ)
153,164
ECC-1 + Precise Write
(1.375σ)
26,610
ECC-1 +
Non-uniform banding
70,935
ECC-8
(2.75σ)
11,293
ECC-8-headroom-3
(2.75σ)
674
Table 6: Uncorrectable errors for a LARDD interval time of 512 seconds. We show a baseline scrub mechanism (ECC-1) combined with
device-level solutions (precise writes and non-uniform banding). These are compared against architectural solutions that do not change
programmed resistance ranges or boundary resistance thresholds.
less overall energy (since energy is proportional to I 2 ). The
work of Lin et al. [21] shows an example where a write with
twice the precision consumes four times less energy and
twice the latency. To measure the impact of higher write
latency, we ran detailed Simics simulations with a state-ofthe-art memory scheduling model that buffers writes and
drains them once they reach a high water mark. We observed that a 200-cycle increase in write latency can reduce
IPC by 5.2%. The impact of write precision on endurance
is not yet exactly known (to the best of our knowledge).
The work of Goux et al. [9, 8] shows that endurance is very
strongly influenced by the RESET pulse at the start of the
iterative write process. Subsequent pulses have a smaller
impact, as they typically do not result in melting the active
volume. Endurance is also a linear function of the number
of pulses. Thus, we can conclude that a write that is twice as
precise and needs twice as many low-amplitude pulses will
have an endurance that may be at most 2× worse. Thus,
the precise write incurs a penalty in terms of latency and
endurance.
We see how the competing approaches compare in terms
of error rates. Table 6 considers a baseline PCM device
(with the 2.75σ programmed resistance range), a PCM device with a more precise 1.375σ programmed resistance
range, and a PCM device with non-uniform bands (10%
wider boundary thresholds for drift-prone states). All three
models have a basic DRAM-like scrubbing mechanism
(ECC-1) with a LARDD interval of 512 seconds. We see
that precise writes and non-uniform banding are able to
bring the error rate down from 153.2K in the baseline to
26.6K and 70.9K, respectively. Instead, if the baseline was
augmented with architectural solutions (scrub with ECC-8
and scrub with ECC-8-headroom-3), while keeping the programmed resistance range fixed at 2.75σ, we are able to
bring the errors down to 11.3K and 674, respectively. The
architectural solutions are also superior because they delay
the issue of writes, thus improving energy and endurance.
The precise write approach also degrades performance because it increases write latency. Thus, if we had to pick a
single approach to mitigate drift-based errors, an architectural solution with LARDDs, multi-bit error correction, and
headroom appears most desireable. The architectural solution can be combined with device-level solutions to achieve
even higher gains.
6 Related Work
Recently, many architecture-level solutions have been
proposed to address PCM challenges [32, 31, 20, 47, 30,
46]. Only one targets resistance drift in MLC PCM [46]. In
the eDRAM space, strong BCH based ECC methods have
been used for a variety of purposes, most recently in HiECC [40]. Hi-ECC uses 5EC6ED codes to reduce refresh
power in eDRAM, along with other optimizations.
Mitigating Resistance Drift A few device-level solutions
for drift mitigation have been proposed recently [18, 34, 16,
43, 42, 26]. These techniques are in early stages of design
and range from corrective voltage pulses to dopant materials to reference cells and modulation coding. Zhang and
Li [46] propose the Helmet architecture that uses a number
of methods to counter drift: encoding mechanisms to reduce
the occurrence of drift-prone states, switching from MLC to
SLC mode, and modifying the OS for temperature-aware
page allocation. Our work looks at a completely distinct set
of optimizations that are focused on scrub-based tolerance
of errors exposed by drift. We anticipate that some of the
Helmet optimizations can be combined with our scrub policies to further reduce error rates and overheads.
Hard Error Correction Ipek at al. [14] propose a dynamically replicated memory architecture that allows for continued operation through graceful degradation of PCM cells.
Schechter et al. [35] attempt to deal with hard errors in resistive memories by tracking worn out cells and storing a
pointer to these cells along with the cache line. Seong et
al. [37] propose SAFER, a technique to recover from the
“stuck-at” hard errors in PCM based on the observation that
even if a bit is stuck at a particular value, that value is still
readable with the right encoding (inversion or not). Yoon et
al. [45] propose FREE-p, where a failed data block stores
a pointer to another data block that is used as a substitute.
These solutions can only handle cells that are known to have
failed and are not effective for soft error tolerance.
In Flash, strong error correction codes and adaptive
programming voltages are employed, with no background
scrub mechanism. Apart from the Flash coding strategies,
to the best of out knowledge, we are not aware of other work
that combines lightweight detection codes with stronger
correction codes. The lightweight LPDC error detection
strategy for Flash is more sophisticated and longer latency
than parity-based schemes proposed in this work.
7 Conclusions
In this work, we show that drift-based multi-bit errors in
PCM MLC devices are a significant problem. Traditional
scrub mechanisms that have worked for DRAM are highly
inadequate for MLC PCM. We show how scrub mechanisms can be extended with stronger ECC codes, headroom
schemes, and adaptive scrub rates. These extensions have
an impact on uncorrectable soft error rates, lifetime, and energy. Our combined policy (ECC-8-headroom-3-gradual)
yields a 96.5% error rate reduction, a 24.4 × reduction in
scrub-related writes, and a 37.8% reduction in scrub energy,
relative to a DRAM-like (ECC-1) scrub mechanism. The
impact of this improvement on overall device lifetime and
FIT rates would depend on a number of factors, including
the frequency of long-term writes in the workload. We show
that these architectural techniques are much more effective
than device-level approaches.
[22]
[23]
[24]
References
[25]
[1] Multilevel Phase Change Memories.
http://www.
thinfilmproducts.umicore.com/feature.asp?
page=art6.
[2] Research Needs for Memory Technologies.
Technical report, Semiconductor Research Corporation (SRC), 2011.
www.src.org/program/grc/ds/research-needs/2011/memory.pdf.
[3] C. Benia et al. The PARSEC Benchmark Suite: Characterization and
Architectural Implications. Technical report, Princeton University,
2008.
[4] M. Boniardi, D. Ielmini, S. Lavizzari, A. Lacaita, A. Redaelli, and
A. Pirovano. Statistical and scaling behavior of structural relaxation
effects in phase-change memory (PCM) devices. 2009.
[5] G. W. Burr, M. J. Breitwisch, M. Franceschini, D. Garetto,
K. Gopalakrishnan, B. Jackson, B. Kurdi, C. Lam, L. A. Lastras,
A. Padilla, B. Rajendran, S. Raoux, and R. S. Shenoy. Phase Change
Memory Technology, 2010. http://arxiv.org/abs/1001.1164v1.
[6] S. Cho and H. Lee. Flip-N-Write: A Simple Deterministic Technique
to Improve PRAM Write Performance, Energy, and Endurance. In
Proceedings of MICRO, 2009.
[7] W. Gao and S. Simmons. A study on the VLSI implementation of
ECC for embedded DRAM. In Proceedings of IEEE CCECE, 2003.
[8] L. Goux, T. Gille, D. Castro, G. Hurkx, J. Lisoni, R. Delhougne,
D. Gravesteijn, K. De Meyer, K. Attenborough, and D. Wouters.
Transient Characteristics of the Reset Programming of a PhaseChange Line Cell and the Effect of the Reset Parameters on the Obtained State. Electron Devices, IEEE Transactions on, 56(7), 2009.
[9] L. Goux, D. Tio Castro, G. Hurkx, J. Lisoni, R. Delhougne,
D. Gravesteijn, K. Attenborough, and D. Wouters. Degradation of
the Reset Switching During Endurance Testing of a Phase-Change
Line Cell. Electron Devices, IEEE Transactions on, 56(2), 2009.
[10] R. Hamming. Error detecting and error correcting codes. Bell System
Technical Journal, 1950.
[11] Y. Hwang, C. Um, J. Lee, C. Wei, H. Oh, G. Jeong, H. Jeong, C. Kim,
and C. Chung. MLC PRAM with SLC write-speed and robust read
scheme. In Proceedings of Symposium on VLSI Technology, 2010.
[12] D. Ielmini, M. Boniardi, A. Lacaita, A. Redaelli, and A. Pirovano.
Unified mechanisms for structural relaxation and crystallization in
phase-change memory devices. Microelectronic Engineering, 86(79):1942 – 1945, 2009. INFOS 2009.
[13] D. Ielmini, D. Sharma, S. Lavizzari, and A. Lacaita. Reliability Impact of Chalcogenide-Structure Relaxation in Phase-Change Memory
(PCM) Cells ;Part I: Experimental Study. Electron Devices, IEEE
Transactions on, 56(5), may 2009.
[14] E. Ipek, J. Condit, E. Nightingale, D. Burger, and T. Moscibroda.
Dynamically Replicated Memory : Building Reliable Systems from
nanoscale Resistive Memories. In Proceedings of ASPLOS, 2010.
[15] B. Jacob, S. W. Ng, and D. T. Wang. Memory Systems - Cache,
DRAM, Disk. Elsevier, 2008.
[16] C.-W. Jeong, D.-H. kang, H.-J. Kim, S.-P. Ko, and D.-W. Lim. Multiple Level Cell Phase-Change Memory Devices Having Controlled
Resistance Drift Parameter, Memory Systems Employing Such Devices and Methods of Reading Memory Devices, 2008. United States
Patent Application, Number US 2008/0316804 A1 .
[17] S. Kang, W. Y. Cho, B.-H. Cho, K.-J. Lee, C.-S. Lee, H.-R. Oh, B.G. Choi, Q. Wang, H.-J. Kim, M.-H. Park, Y. H. Ro, S. Kim, C.-D.
Ha, K.-S. Kim, Y.-R. Kim, D.-E. Kim, C.-K. Kwak, H.-G. Byun,
G. Jeong, H. Jeong, K. Kim, and Y. Shin. A 0.1- µm 1.8-V 256-Mb
Phase-Change Random Access Memory (PRAM) With 66-MHz Synchronous Burst-Read Operation. Solid-State Circuits, IEEE Journal
of, 42(1), jan. 2007.
[18] A. Kostylev and W. Czubatyj. Method of Eliminating Drift in PhaseChange Memory, 2004. United States Patent Application, Number
US 2004/0228159 A1 .
[19] S. K. Lai. Flash memories: Successes and challenges. IBM Journal
of Research and Development, 52(4.5), 2008.
[20] B. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting Phase Change
Memory as a Scalable DRAM Alternative. In Proceedings of ISCA,
2009.
[21] J.-T. Lin, Y.-B. Liao, M.-H. Chiang, I.-H. Chiu, C.-L. Lin, W.-C. Hsu,
P.-C. Chiang, S.-S. Sheu, Y.-Y. Hsu, W.-H. Liu, K.-L. Su, M.-J. Kao,
and M.-J. Tsai. Design optimization in write speed of multi-level
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
cell application for phase change memory. In Intl. Conf. of Electron
Devices and Solid-State Circuits, 2009.
J.-T. Lin, Y.-B. Liao, M.-H. Chiang, and W.-C. Hsu. Operation
of Multi-Level Phase Change Memory Using Various Programming
Techniques. In Proceedings of ICICDT, 2009.
W. Mueller, G. Aichmayr, W. Bergner, E. Erben, T. Hecht,
C. Kapteyn, A. Kersch, S. Kudelka, F. Lau, J. Luetzen, A. Orth,
J. Nuetzel, T. Schloesser, A. Scholz, U. Schroeder, A. Sieck,
A. Spitzer, M. Strasser, P.-F. Wang, S. Wege, and R. Weis. Challenges for the DRAM cell scaling to 40nm. 2005.
T. Nirschl, J. Phipp, T. Happ, G. Burr, B. Rajendran, M.-H. Lee,
A. Schrott, M. Yang, M. Breitwisch, C.-F. Chen, E. Joseph, M. Lamorey, R. Cheek, S.-H. Chen, S. Zaidi, S. Raoux, Y. Chen, Y. Zhu,
R. Bergmann, H.-L. Lung, and C. Lam. Write Strategies for 2 and
4-bit Multi-Level Phase-Change Memory. 2007.
A. Pantazi, A. Sebastian, N. Papandreou, M. J. Breitwisch, C. Lam,
H. Pozidis, and E. Eleftheriou. Multilevel Phase-Change Memory
Modeling and Experimental Characterization. In Proceedings of EPCOS, 2009.
N. Papandreou, H. Pozidis, T. Mittelholzer, G. Close, M. Breitwisch,
C. Lam, and E. Eleftheriou. Drift-tolerant multilevel phase-change
memory. In Proceedings of International Memory Workshop (IMW),
2011.
S. Paul, F. Cai, X. Zhang, and S. Bhunia. Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache. IEEE
Transactions on Computers, 99, 2010.
A. Pirovano, A. Lacaita, F. Pellizzer, S. Kostylev, A. Benvenuti, and
R. Bez. Low-field amorphous state resistance and threshold voltage
drift in chalcogenide materials. IEEE Transactions on Electron Devices, 51(5), may 2004.
M. Qureshi, M. Franceschini, and L. Lastras. Improving Read Performance of Phase Change Memories via Write Cancellation and Write
Pausing. In Proceedings of HPCA, 2010.
M. Qureshi, M. Franceschini, L. Lastras-Montano, and J. Karidis.
Morphable Memory System: A Robust Architecture for Exploiting
Multi-Level Phase Change Memory. In Proceedings of ISCA, 2010.
M. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,
and B. Abali. Enhancing Lifetime and Security of PCM-Based Main
Memory with Start-Gap Wear Leveling. In Proceedings of MICRO,
2009.
M. Qureshi, V. Srinivasan, and J. Rivers. Scalable High Performance
Main Memory System Using Phase-Change Memory Technology. In
Proceedings of ISCA, 2009.
A. Redaelli, A. Pirovano, A. Locatelli, and F. Pellizzer. Numerical Implementation of Low Field Resistance Drift for Phase Change
Memory Simulations. In Non-Volatile Semiconductor Memory Workshop, 2008.
S. Kostylev and T. Lowrey. Drift of Programmed Resistance In Electrical Phase Change Memory Devices, 2008.
S. Schechter, G. Loh, K. Strauss, and D. Burger. Use ECP, not ECC,
for hard Failures in Resistive Memories. In Proceedings of ISCA,
2010.
B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild:
A Large-Scale Field Study. In Proceedings of SIGMETRICS, 2009.
N. Seong, D. Woo, V. Srinivasan, J. Rivers, and H. Lee. SAFER:
Stuck-At-Fault Error Recovery for Memories. In Proceedings of MICRO, 2010.
A. Udipi, N. Muralimanohar, R. Balasubramonian, A. Davis, and
N. Jouppi. Combining Memory and a Controller with Photonics
through 3D-Stacking to Enable Scalable and Energy-Efficient Systems. In Proceedings of ISCA, 2011.
A. N. Udipi et al. Rethinking DRAM Design and Organization for
Energy-Constrained Multi-Cores. In Proceedings of ISCA, 2010.
C. Wilkerson, A. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar,
and S.-L. Lu. Reducing Cache Power with Low-Cost, Multi-Bit Error
Correcting Codes. In Proceedings of ISCA, 2010.
W. Xu, J. Liu, and T. Zhang. Data Manipulation Techniques to
Reduce Phase Change Memory Write Energy. In Proceedings of
ISLPED, 2009.
W. Xu and T. Zhang. A time-aware fault tolerance scheme to improve
reliability of multilevel phase-change memory in presence of significant resistance drift. IEEE Transactions on VLSI Systems, PP(99),
2010.
W. Xu and T. Zhang. Using Time-Aware Memory Sensing to Address Resistance Drift Issue in Multi-Level Phase Change Memory.
In Proceedings of ISQED, 2010.
D. Yoon and M. Erez. Virtualized and Flexible ECC for Main Memory. In Proceedings of ASPLOS, 2010.
D.-H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. Jouppi,
and M. Erez. FREE-p: Protecting Non-Volatile Memory against both
Hard and Soft Errors. In Proceedings of HPCA, 2011.
W. Zhang and T. Li. Helmet: A Resistance Drift Resilient Architecture for Multi-level Cell Phase Change Memory System. In Proceedings of DSN, 2011.
P. Zhou, B. Zhao, J. Yang, and Y. Zhang. A Durable and Energy
Efficient Main Memory Using Phase Change Memory Technology.
In Proceedings of ISCA, 2009.
Download