PPT - The Laboratory for Computer Architecture

advertisement
MICRO-43
Elastic Refresh: Techniques to Mitigate Refresh
Penalties in High Density Memory
Jeffrey Stuecheli1,2, Dimitris Kaseridis1, Hillery C. Hunter3 & Lizy K. John1
1ECE
Department, The University of Texas at Austin
2IBM
Corp., Austin
3IBM
Thomas J. Watson Research Center
Laboratory for Computer Architecture
12/7/2010
Overview/Summary
 Refresh overhead is increasing with device density
 Due to the nature of this increase, performance is suffering
 Current refresh scheduling methods ineffective in hiding
these delays
 We propose more sophisticated mitigation methods
– Elastic Refresh Scheduling
2
Laboratory for Computer Architecture
12/7/2010
Background
Basic DRAM/Refresh Info
 Each bit stored on a capacitor
 Single read transistor to hold charge
 Leakage, looses charge over time
 Refresh: Rewrite cell on periodic basis
 DDR3
– Temperature dependence on refresh
requirement, 64ms@85oC, 32ms@95oC
– DRAM device contains internal address
counter
– JEDEC simply specifies the time interval
(tREFI, time REFresh Interval) tREFI =
64ms/8096 = 7.8 us (3.9 us for 95oC)
3
Laboratory for Computer Architecture
12/7/2010
Background
Transition to denser devices
 7.8 us based on 8k Rows per bank
 DRAM device density doubles ~2 year
 With one refresh per row, tREFI would
half each generation
 Instead, multiple rows are refreshed
with each command
95 nm 512 MBit
 Current delivery constraints forces
increase in tRFC with denser devices
42 nm 2GBit
4
Laboratory for Computer Architecture
12/7/2010
Background
“Stacked” Refresh Operations in a Single Command Example
Source: TN-47-16 Designing for High-Density DDR2 Memory Introduction by MICRON
5
Laboratory for Computer Architecture
12/7/2010
Background
tRFC Growth with DRAM Density
 In the most basic terms, tRFC
should scale linearly with density
DRAM type
Refresh
Completion Time
512Mbit
90ns
1Gbit
110ns
2Gbit
160ns
4Gbit
300ns
8Gbit
350ns
– Based strictly on current to charge
capacitance
 ~Fixed charge per bit
 This has been reflected in the
DDR3 spec, with the exception of
8 GBit
 Net, even if DRAM vendors can
slow the growth, the delay is
large today
6
Laboratory for Computer Architecture
12/7/2010
IPC Degradation over No-Refresh
7
0%
Laboratory for Computer Architecture
Integer
12/7/2010
Floating Point
G. Mean
sphix3
wrf
lbm
tonto
GemsFDTD
calculix
povray
soplex
dealII
namd
leslie3d
cactusADM
4 Gbit
gromacs
zeusmp
milc
2 Gbit
gamess
bwaves
G. Mean
xalancbmk
30%
astar
omnetpp
h264ref
libquantum
sjeng
hmmer
gobmk
mcf
gcc
bzip2
perlbench
Motivation
Slowdown Effects Observed in Simulation
 Simics/Gems
 4 cores, 2 1333MHz channels, 2 DDR3 Ranks/channel
25%
8 Gbit
20%
15%
10%
5%
Motivation
Why it is so bad
26ns
326ns
Refresh
Worst Case Refresh Hit DRAM Read
Refreshes
Reads
tREFI
8
Laboratory for Computer Architecture
tRFC
bandwidth
overhead
o
(95 C per Rank)
latency
overhead
(95 o C)
512Mb
90ns
2.7%
1.4ns
1Gb
110ns
3.3%
2.1ns
2Gb
160ns
5.0%
4.9ns
4Gb
300ns
7.7%
11.5ns
8Gb
350ns
9.0%
15.7ns
DRAM
capacity
tRFC
12/7/2010
Motivation
Postponing Refresh Operations
 Each cell needs to be refreshed every 64 ms,
 Refresh command spacing is based around an
average rate.
 As such, cell failure will not occur if no refresh is
sent as tREFI expires.
 Current DDR3 spec allows the controller to fall
eight tREFI intervals behind (backlog count)
– Cell refresh rate is elongated by 0.1% (8 in 8k)
9
Laboratory for Computer Architecture
12/7/2010
Motivation
Current Approaches
 Demand Refresh (DR)
– Most basic policy, sends refresh operations as high priority operations
every tREFI period
 Delay Until Empty (DUE)
– Policy utilizes DRAM ability to postpone refreshes.
– Refresh operations are postponed until no reads are queued, or the
max backlog count has been reached
 Why These policies are ineffective
– DR: Does nothing to hide refreshes
– DUE: Too aggressive in sending refresh operations. Does not take
advantage of the backlog in many cases.
10
Laboratory for Computer Architecture
12/7/2010
Elastic Refresh
 Exploit
– Non-uniform request distribution
– Refresh overhead just has to fit in free cycles
 Initially not aggressive, converges with DUE as refresh
backlog grows
 Latency sensitive workloads are often lower bandwidth
 Decrease the probability of reads conflicting with refreshes
11
Laboratory for Computer Architecture
12/7/2010
Elastic Refresh
Idle Delay Function
 Introduce refresh backlog dependent idle threshold
 With a log backlog, there is no reason to send refresh command
 With a bursty request stream, the probability of a future request decreases with
time
 As backlog grows, decrease this delay threshold
Idle
Delay
Threshold
Constant
1
12
Laboratory for Computer Architecture
2
Proportional
3
4
5
6
Refresh Backlog
High
Priority
7
12/7/2010
8
Elastic Refresh
Tuning the Idle Delay Function
 The optimal shape of the IDF is
workload dependent
 IDF can be controlled with the
listed parameters
 Our system contains hardware
to determine “good” parameters
– Max Delay and Proportional Slope
Parameter
Units
Description
Max Delay
Memory Clocks
Sets the delay in the constant
region
Proportional
Slope
Memory Clocks per Postponed
Step
Sets slope of the proportional
region
Postponed Step
Point where the idle delay goes to
zero
High Priority
Pivot
13
Laboratory for Computer Architecture
12/7/2010
Elastic Refresh
Max Delay Circuit
 Circuit used to collect average Rank
idle period
 Conceptually, given a exponential type
distribution, the average can be used
to find the tail
 Calculated average is used as Max
Delay
DRAM Read Sent
1
+
0
Current Idle Count
(14)
 Circuit function,
– Accumulate idle delay over 1024 events
– Average calculated with concatenation of
accumulator
Laboratory for Computer Architecture
Operation Count
(10)
cat
+
Max Delay (10)
Delay Accumlator
(20)
14
carry
+
12/7/2010
To Idle Delay
Function
Elastic Refresh
Proportional Slope Circuit
Postponed < Threshold
Low
 Conceptually, proportional region
acts to gracefully transition to high
priority, while utilizing full
postponed range
 Circuit works to balance the
utilization across the postponed
range (High/Low counts)
High
carry
carry
+
+
Divide By 2
Divide By 2
Low
High
Integral
-
 PI type controller adjusts slot to
balance High/Low counts
+
w(p)
w(i)
+
Prop Slope
To Idle Delay
Function
15
Laboratory for Computer Architecture
12/7/2010
Elastic Refresh
Hardware Cost
 Trivial integration into DUE based policies
– Structure replaces “empty” indication of DUE
 Logic size
– ~100 latch bits for static policy
– ~80 additional latch bits for dynamic policy
 Logic cycle time
– Low frequency compared to ALU functions in processor core.
– Infrequent updates could enable pipelined control.
Refresh
Scheduler
Request
Input
Interface
16
Output
To DRAM
IO Drivers
Rank Queues x N
Input Queue
Bank Queues x 8
tREFI Counter
Refresh Queue
Laboratory for Computer Architecture
12/7/2010
Simulation Methodology
 Simics extended with GEMS model
– 1, 4 & 8 cores CMP
– First-Ready, First-Come-First-Served memory controller policy
– DDR3 1333MHz 8-8-8 memory, 2 MC, 2 Ranks/MC
– tRFC= 550ns, tREFI = 3.9μs @95oC (estimation of 16GBit)
– Refresh policies:
• Demand Refresh (DR)
• Defer Until Empty (DUE)
• Elastic Refresh policies
 SPEC cpu2006 workloads
17
Laboratory for Computer Architecture
12/7/2010
18
Integer
Laboratory for Computer Architecture
12/7/2010
Floating Point
GeoMean
sphinx3
wrf
lbm
tonto
GemsFDTD
calculix
povray
soplex
dealII
Integer
namd
leslie3d
cactusADM
gromacs
zeusmp
milc
Fixed Delay (FD)
Dynamic Delay (DD)
gamess
bwaves
Geo Mean
xalancbmk
astar
omnetpp
h264ref
1.35
libquantum
IPC Improvement
1.4
sjeng
1
1.4
1.35
1.3
1.25
1.2
1.15
1.1
1.05
1
hmmer
gobmk
mcf
gcc
bzip2
perlbench
IPC Improvement
Results
1 Core
1.3
1.25
1.2
1.15
1.1
1.05
Floating Point
8 Cores
Related Work
 B. Bhat and F. Mueller,“Making DRAM refresh predictable,” Real-Time Systems,
Euromicro Conference 2010
 M. Ghosh and H. S. Lee, “Smart Refresh: An enhanced memory controller design
for reducing energy in conventional and 3D die-stacked DRAMs,” in MICRO 40
 K. Toshiaki, P. Paul, H. David, K. Hoki, J. Golz, F. Gregory, R. Raj, G. John, R.
Norman, C. Alberto, W. Matt, and I. Subramanian, “An 800 MHz embedded
DRAM with a concurrent refresh mode,” in IEEE ISSCC Digest of Technical
Papers, Feb. 2004
19
Laboratory for Computer Architecture
12/7/2010
Conclusions
 The significant degradation of refresh can be mitigated with
low overhead mechanisms
 Commodity DRAM is cost driven
– Elastic refresh requires no DRAM changes
 Future work:
– Coordinate refresh with other structures on the CMP
– Investigate refresh for future DRAM devices (DDR4)
• Example, dynamically select how many rows to refreshed
20
Laboratory for Computer Architecture
12/7/2010
Thank You,
Questions?
Laboratory for Computer Architecture
University of Texas Austin
IBM Austin
IBM T. J. Watson Lab
21
Laboratory for Computer Architecture
12/7/2010
Download