Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and Luis Lastras IBM T. J. Watson Research Center, Yorktown Heights, NY HPCA – 2010 © 2007 IBM Corporation Introduction More cores in system More concurrency Larger working set DRAM-based memory system hitting: power, cost, scaling wall Phase Change Memory (PCM): Emerging technology, projected to be more scalable, higher density, power-efficient 2 © 2007 IBM Corporation Switching by heating using electrical pulses RESET state: amorphous (high resistance) SET state: crystalline (low resistance) Temperature PCM Operation RESET Tmelt SET Tcryst Time Small Current Large Current Memory Element Access Device RESET High resistance SET Low resistance Read latency 2x-4x of DRAM. Write latency much higher 3 Photo Courtesy: Bipin Rajendran, IBM © 2007 IBM Corporation Problem of Contention from Slow Writes PCM writes 4x-8x slower than reads Writes not latency critical. Typical response: Use large buffers and intelligent scheduling. But once write is scheduled to a bank, later arriving read waits Write request causes contention for reads increased read latency 4 © 2007 IBM Corporation Outline Introduction Quantifying the Problem Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary 5 © 2007 IBM Corporation Configuration: Hybrid Memory Processor Chip DRAM Cache (256MB) PCM-Based Main Memory Each bank has a separate RDQ and WRQ (32-entry) Baseline uses read priority scheduling if WRQ < 80% full. If WRQ>80% full, oldest-first policy “forced write” (rare <0.1%) 6 © 2007 IBM Corporation Read Latency=1k cycles Write Latency=8k cycles (sensitivity in paper) 12 workloads: each with 8 benchmarks from SPEC06 3000 2800 2600 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Baseline No Read Priority Write Latency=1K Write Latency=0 Norm. Execution Time Effective Read Latency (Cycles) Problem 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Writes significantly increase read latency (Problem only for asymmetric memories) 7 © 2007 IBM Corporation Outline Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary 8 © 2007 IBM Corporation Write Cancellation Write Cancellation: “abort” on-going write to Improve read latency Line in non-deterministic state: read matching read request from WRQ Perform write cancellation as soon as a read request arrives at a bank (as long as the write is not done in forced-mode) 9 © 2007 IBM Corporation Write Cancellation with Static Threshold Canceling a write request close to completion is wasteful and causes episodes of forced-writes (low performance) WCST: Cancel write request only if less than K% service done Effective Read Latency (Cycles) 1600 2365 1500 1400 1300 1200 1100 1000 K=0% K=50% (NeverCancel) 10 K=65% K=75% K=90% K=100% (AlwaysCancel) © 2007 IBM Corporation Adaptive Write Cancellation Best threshold depends on num pending entries in WRQ. Fewer entries Higher threshold (best read latency) More entries Lower threshold (reduces forced writes) Threshold 100% High 50% Low 0% ForcedWrites 30 20 Num Entries in WRQ 10 Write Cancellation with Adaptive Threshold (WCAT) Threshold = 100 – (4*NumEntriesInWRQ) 11 © 2007 IBM Corporation Adaptivity of WCAT We sampled all WRQ every 2M cycles to measure occupancy Num Entries in WRQ WCST(K=75%) Low (0-1) 61.4% Med High (2-13) (14-25) 29.8% 7.4% Forced (26+) 1.43% WCAT 58.2% 35.4% 0.72% 5.6% WCAT uses higher threshold initially with empty WRQ but Lower threshold later reduces the episodes of forced-writes 12 © 2007 IBM Corporation Results for WCAT Baseline: 2365 cycles Ideal:1K cycles Extra Write Cycles (%) Average Read Latency 1550 1500 1450 1400 1350 1300 1250 1200 1150 1100 1050 1000 Write Cancellation WCST (K=75%) WCAT 45 40 35 30 25 20 15 10 5 0 Write Cancellation WCST (K=75%) WCAT Adaptive threshold reduces latency and incurs half the overhead 13 © 2007 IBM Corporation Outline Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary 14 © 2007 IBM Corporation Iterative Write in PCM devices In Multi-Level Cells (MLC), the programming precision requirement increases linearly with the number of levels PCM cells respond differently to same programming pulse Acknowledged solution to address uncertainty: Iterative writes Each iteration consists of steps of: write-read-verify Not done Write 15 Read Verify Done © 2007 IBM Corporation Model for Iterative Writes We develop an analytical model to capture number of iterations: In terms of bits/cell, num levels written in one shot, and learning Time required to write a line is worst-case of all cells in line MLC:3 bits/cell Avg number of iterations: 8.3 (consistent with MLC literature) 16 © 2007 IBM Corporation Concept of Write Pausing Iterative writes can be paused to service pending read requests Potential Pause Points Rd X Iter 1 Iter 2 Iter 3 Iter 4 Iter 1 Iter 2 Rd X Iter 3 Iter 4 Reads can be performed at the end of each iteration (potential pause point) Better read latency with negligible write overhead We extend the iterative write algorithm of Nirschl et al. [IEDM’07] to support Write Pausing 17 © 2007 IBM Corporation Results for Write Pausing 2400 Effective Read Latency 2300 2200 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200 1100 1000 Baseline Write Pause Anytime Pause Write Pausing at end of iteration gets 85% of benefit of “Anytime” Pause 18 © 2007 IBM Corporation Outline Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary 19 © 2007 IBM Corporation Write Pausing + WCAT Rd X Iter 1 Iter 2 Rd X Iter 3 Iter 4 Rd X Iter 1 Iter 2 Iter 3 Iter 4 Rd X Iter 1 Rd X Iter 2 Iter 3 Iter 4 Iter2 Cancelled Only one iteration is cancelled “micro-cancellation” has low overhead 20 © 2007 IBM Corporation Results Baseline: 2365 cycles Ideal:1K cycles 1.5 Speedup (wrt Baseline) Effective Read Latency 1500 1450 1400 1350 1300 1250 1200 1150 1100 1050 1000 1.4 1.3 1.2 1.1 1 Write Pause Write Pause+Micro Cancellation Anytime Pause Write Pause Write Pause+Micro Cancellation Anytime Pause Write Pause + Micro Cancellation very close to Anytime Pause (re-execution overhead of micro cancellation <4% extra iterations) 21 © 2007 IBM Corporation Speedup wrt Baseline (32-entry) Impact of Write Queue Size 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Baseline Pause + Micro Cancellation 8 16 32 64 128 256 512 Number of Entries in Each WRQ We will need large buffers to best exploit the benefit of Pausing 22 © 2007 IBM Corporation Outline Introduction Problem: Writes Delaying Reads Adaptive Write Cancellation Write Pausing Combining Cancellation & Pausing Summary 23 © 2007 IBM Corporation Summary Slow writes increase the effective read latency (2.3x) Write Cancellation: Cancel ongoing write to service read Threshold based write cancellation Adaptive Threshold: better performance, half the overhead Write Pausing exploits iterative write to service pending reads Write Pausing + Micro Cancellation close to optimal pause Effective read latency: from 2365 to 1330 cycles (1.45x speedup) We will need large write buffers to exploit the benefit of Pausing 24 © 2007 IBM Corporation Questions 25 © 2007 IBM Corporation Write Pausing in Iterative Algorithms (Nirschl+ IEDM’07) 26 © 2007 IBM Corporation Workloads and Figure of Merit 12 memory-intensive workloads from SPEC 2006: •6 rate-mode (eight copies of same benchmark) •6 mix-mode (two copies of four benchmarks) Key metric: Effective Read Latency Tin = Time at which read request enters RDQ Tout = Time at which read request finishes service at memory Effective Read Latency = Tout – Tin (average reported) 27 © 2007 IBM Corporation Sensitivity to Write Latency At WriteLatency=4K, the speedup is 1.35x instead of 1.45x (at 8K latency) 28 © 2007 IBM Corporation