Improving Read Performance of PCM via Write Cancellation

advertisement
Improving Read Performance of PCM via
Write Cancellation and Write Pausing
Moinuddin Qureshi
Michele Franceschini and Luis Lastras
IBM T. J. Watson Research Center, Yorktown Heights, NY
HPCA – 2010
© 2007 IBM Corporation
Introduction
More cores in system  More concurrency  Larger working set
DRAM-based memory system hitting: power, cost, scaling wall
Phase Change Memory (PCM): Emerging technology, projected
to be more scalable, higher density, power-efficient
2
© 2007 IBM Corporation
Switching by heating using electrical pulses
RESET state: amorphous (high resistance)
SET state: crystalline (low resistance)
Temperature
PCM Operation
RESET
Tmelt
SET
Tcryst
Time
Small
Current
Large
Current
Memory
Element
Access
Device
RESET
High resistance
SET
Low resistance
Read latency 2x-4x of DRAM. Write latency much higher
3
Photo Courtesy: Bipin Rajendran, IBM
© 2007 IBM Corporation
Problem of Contention from Slow Writes
PCM writes 4x-8x slower than reads Writes not latency critical.
Typical response: Use large buffers and intelligent scheduling.
But once write is scheduled to a bank, later arriving read waits
Write request causes contention for reads  increased read latency
4
© 2007 IBM Corporation
Outline
 Introduction
 Quantifying the Problem
 Adaptive Write Cancellation
 Write Pausing
 Combining Cancellation & Pausing
 Summary
5
© 2007 IBM Corporation
Configuration: Hybrid Memory
Processor Chip
DRAM Cache
(256MB)
PCM-Based Main Memory
Each bank has a separate RDQ and WRQ (32-entry)
Baseline uses read priority scheduling if WRQ < 80% full.
If WRQ>80% full, oldest-first policy  “forced write” (rare <0.1%)
6
© 2007 IBM Corporation
Read Latency=1k cycles Write Latency=8k cycles (sensitivity in paper)
12 workloads: each with 8 benchmarks from SPEC06
3000
2800
2600
2400
2200
2000
1800
1600
1400
1200
1000
800
600
400
200
0
Baseline
No Read Priority
Write Latency=1K
Write Latency=0
Norm. Execution Time
Effective Read Latency (Cycles)
Problem
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Writes significantly increase read latency
(Problem only for asymmetric memories)
7
© 2007 IBM Corporation
Outline
 Introduction
 Problem: Writes Delaying Reads
 Adaptive Write Cancellation
 Write Pausing
 Combining Cancellation & Pausing
 Summary
8
© 2007 IBM Corporation
Write Cancellation
Write Cancellation: “abort” on-going write to Improve read latency
Line in non-deterministic state: read matching read request from WRQ
Perform write cancellation as soon as a read request arrives at a bank
(as long as the write is not done in forced-mode)
9
© 2007 IBM Corporation
Write Cancellation with Static Threshold
Canceling a write request close to completion is wasteful
and causes episodes of forced-writes (low performance)
WCST: Cancel write request only if less than K% service done
Effective Read Latency (Cycles)
1600
2365
1500
1400
1300
1200
1100
1000
K=0%
K=50%
(NeverCancel)
10
K=65%
K=75%
K=90%
K=100%
(AlwaysCancel)
© 2007 IBM Corporation
Adaptive Write Cancellation
Best threshold depends on num pending entries in WRQ.
Fewer entries  Higher threshold (best read latency)
More entries  Lower threshold (reduces forced writes)
Threshold
100%
High
50%
Low
0%
ForcedWrites
30
20
Num Entries in WRQ
10
Write Cancellation with Adaptive Threshold (WCAT)
Threshold = 100 – (4*NumEntriesInWRQ)
11
© 2007 IBM Corporation
Adaptivity of WCAT
We sampled all WRQ every 2M cycles to measure occupancy
Num Entries in WRQ
WCST(K=75%)
Low
(0-1)
61.4%
Med
High
(2-13) (14-25)
29.8% 7.4%
Forced
(26+)
1.43%
WCAT
58.2%
35.4%
0.72%
5.6%
WCAT uses higher threshold initially with empty WRQ but
Lower threshold later reduces the episodes of forced-writes
12
© 2007 IBM Corporation
Results for WCAT
Baseline: 2365 cycles Ideal:1K cycles
Extra Write Cycles (%)
Average Read Latency
1550
1500
1450
1400
1350
1300
1250
1200
1150
1100
1050
1000
Write Cancellation WCST (K=75%)
WCAT
45
40
35
30
25
20
15
10
5
0
Write Cancellation
WCST (K=75%)
WCAT
Adaptive threshold reduces latency and incurs half the overhead
13
© 2007 IBM Corporation
Outline
 Introduction
 Problem: Writes Delaying Reads
 Adaptive Write Cancellation
 Write Pausing
 Combining Cancellation & Pausing
 Summary
14
© 2007 IBM Corporation
Iterative Write in PCM devices
In Multi-Level Cells (MLC), the programming precision requirement
increases linearly with the number of levels
PCM cells respond differently to same programming pulse
Acknowledged solution to address uncertainty: Iterative writes
Each iteration consists of steps of: write-read-verify
Not done
Write
15
Read
Verify
Done
© 2007 IBM Corporation
Model for Iterative Writes
We develop an analytical model to capture number of iterations:
In terms of bits/cell, num levels written in one shot, and learning
Time required to write a line is worst-case of all cells in line
MLC:3 bits/cell
Avg number of iterations: 8.3 (consistent with MLC literature)
16
© 2007 IBM Corporation
Concept of Write Pausing
Iterative writes can be paused to service pending read requests
Potential Pause Points
Rd X
Iter 1 Iter 2 Iter 3 Iter 4
Iter 1 Iter 2 Rd X Iter 3 Iter 4
Reads can be performed at the end of
each iteration (potential pause point)
Better read latency with
negligible write overhead
We extend the iterative write algorithm of Nirschl et al. [IEDM’07]
to support Write Pausing
17
© 2007 IBM Corporation
Results for Write Pausing
2400
Effective Read Latency
2300
2200
2100
2000
1900
1800
1700
1600
1500
1400
1300
1200
1100
1000
Baseline
Write Pause
Anytime Pause
Write Pausing at end of iteration gets 85% of benefit of “Anytime” Pause
18
© 2007 IBM Corporation
Outline
 Introduction
 Problem: Writes Delaying Reads
 Adaptive Write Cancellation
 Write Pausing
 Combining Cancellation & Pausing
 Summary
19
© 2007 IBM Corporation
Write Pausing + WCAT
Rd X
Iter 1 Iter 2 Rd X Iter 3 Iter 4
Rd X
Iter 1 Iter 2 Iter 3 Iter 4
Rd X
Iter 1
Rd X Iter 2 Iter 3 Iter 4
Iter2 Cancelled
Only one iteration is cancelled  “micro-cancellation” has low overhead
20
© 2007 IBM Corporation
Results
Baseline: 2365 cycles Ideal:1K cycles
1.5
Speedup (wrt Baseline)
Effective Read Latency
1500
1450
1400
1350
1300
1250
1200
1150
1100
1050
1000
1.4
1.3
1.2
1.1
1
Write Pause
Write Pause+Micro
Cancellation
Anytime Pause
Write Pause
Write Pause+Micro
Cancellation
Anytime Pause
Write Pause + Micro Cancellation very close to Anytime Pause
(re-execution overhead of micro cancellation <4% extra iterations)
21
© 2007 IBM Corporation
Speedup wrt Baseline (32-entry)
Impact of Write Queue Size
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Baseline
Pause + Micro Cancellation
8
16
32
64
128
256
512
Number of Entries in Each WRQ
We will need large buffers to best exploit the benefit of Pausing
22
© 2007 IBM Corporation
Outline
 Introduction
 Problem: Writes Delaying Reads
 Adaptive Write Cancellation
 Write Pausing
 Combining Cancellation & Pausing
 Summary
23
© 2007 IBM Corporation
Summary
 Slow writes increase the effective read latency (2.3x)
 Write Cancellation: Cancel ongoing write to service read
 Threshold based write cancellation
 Adaptive Threshold: better performance, half the overhead
 Write Pausing exploits iterative write to service pending reads
 Write Pausing + Micro Cancellation close to optimal pause
 Effective read latency: from 2365 to 1330 cycles (1.45x speedup)
 We will need large write buffers to exploit the benefit of Pausing
24
© 2007 IBM Corporation
Questions
25
© 2007 IBM Corporation
Write Pausing in Iterative Algorithms
(Nirschl+ IEDM’07)
26
© 2007 IBM Corporation
Workloads and Figure of Merit
12 memory-intensive workloads from SPEC 2006:
•6 rate-mode (eight copies of same benchmark)
•6 mix-mode (two copies of four benchmarks)
Key metric: Effective Read Latency
Tin = Time at which read request enters RDQ
Tout = Time at which read request finishes service at memory
Effective Read Latency = Tout – Tin (average reported)
27
© 2007 IBM Corporation
Sensitivity to Write Latency
At WriteLatency=4K, the speedup is 1.35x instead of 1.45x (at 8K latency)
28
© 2007 IBM Corporation
Download