Sliding - people.csail.mit.edu

advertisement
Versatile Refresh: Low Complexity
Refresh Scheduling for High–throughput
Multi-banked eDRAM
Mohammed Alizadeh, Adel Javanmard,
Da Chuang, Sundar Iyer, Yi Lu
(alizade, adelj)@stanford.edu,
(dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu
ACM Sigmetrics/Performance 2012
What Is Embedded DRAM?
1. 2nd Most Common Embedded Memory
•
•
•
Consists of 1 Transistor, 1 Capacitor cell
2X-3X denser than SRAM
2X-4X slower than SRAM
2. Supported by Key ASIC and IP Vendors
•
IBM, TSMC, NEC, Mosys, ST
3. Used in a Number of Applications
•
Servers, Networking, Storage, Gaming, Mobile
4. Industry Examples
•
•
•
•
Select
Storage
Capacitor
Data
eDRAM 1T1C
Memory Cell
IBM's P7
Sony Playstations, Nintendo GameCube, Wii
Apple iPhone, Microsoft Zune HD, Xbox 360
Cisco Catalyst 3K-10K
ACM Sigmetrics/Performance 2012
2
Problem: eDRAM Refresh Causes Memory Bandwidth Loss
DRAM Capacitor has Finite
Retention Time (W = Tref)
Bank
Rows
1
Example: W= 18us @ 100C
= 4050 cycles @ 225 MHz
R
Refresh Port
R/W Port
Example:
R = 64 rows
All 64 rows will lose
data in 4050 cycles!
Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in Memory
Causes Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58%
ACM Sigmetrics/Performance 2012
3
Trend: Higher Density Multi-banked Macros (Mb/mm2)
(1) More Rows are Packed
Together and Need
to be Refreshed
1
(2) More Banks are Packed
Together and Need
to be Refreshed
2
Memory Banks
B
Shared Refresh and
R/W Ports
Rows
1
R
1
M
Shared Circuitry to Conserve Area
(5) Low Clock Speed Mode
Decreases ‘Clock-time’ to Refresh
(3) Smaller
Capacitor with
Lower Geometry
→ Smaller W
(4) Smaller W with
Higher Temperature
Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB)
Does not Scale with Larger Macros, Geometry & Low Power Modes
ACM Sigmetrics/Performance 2012
4
Examples of Periodic Refresh with Multi-banked Macros
M = Number of Memory Ports, B = Number of Banks,
R = Number of Rows, W = Cell Retention Time
M=1
(Ports)
W
(@ 100C)
Normal
(250 MHz)
18.2 us =
4050 cycles
Low Power
(150 MHz)
18.2 us =
2699 cycles
R
Rows
B
Banks
Periodic
Refresh Loss
64
8
12.6%
128
8
25%
64
8
18.9%
128
8
38%
The Problem is Only Getting Worse Over Time …
ACM Sigmetrics/Performance 2012
5
Vendor Solution: Concurrent Refresh
1
2
Memory Banks
B
Rows
1
Concurrent
Refresh Port
R/W
Ports
Refresh
Port
R
1
M
Concurrent Refresh++: Refresh a Bank Which is Not Being
Concurrently Accessed
++T.
Kirihata et. al., An 800-MHz embedded DRAM with a concurrent refresh mode.
Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005.
ACM Sigmetrics/Performance 2012
6
How is Concurrent Refresh Used Today?
Memory Banks
1
B
2
RP1
RP3
RP4
RP16
RP2
Next Concurrent
Accessed Bank
Bank 2
Refresh Pointer
Deficit Register
3
Count
Deficit Register Tracks
Non-refreshed
Bank(s)
Standard Observation: N-1 out of N Banks Get Refreshed for Any Pattern
Concurrent Refresh Overhead is Proportional to 1 bank
Concurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58%
Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be
Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler
ACM Sigmetrics/Performance 2012
7
Goals of Our Work: An Industry Outlook
 Design a Concurrent Refresh Scheduler that can
1. Provide Deterministic Memory Performance Guarantees
−
Maximize Memory Throughput (Optimality)
2. Be Universally Applicable
−
−
For any eDRAM macro with B banks, R Rows, M memory ports
For any characteristics of cell retention time W++, and Clock speed
3. Maximize Memory Burst Tolerance
4. Have Low Implementation Overhead
++Note
that W is itself a function of temperature, process, and the
micro-architecture of the eDRAM
ACM Sigmetrics/Performance 2012
8
Problem Formulation
 We consider a general class of algorithms that require X
refresh (idle) timeslots in every Y consecutive timeslots.
Refresh
......Refresh
. . . . . ….
Refresh Refresh
Any Refresh Window
Refresh Window 1
Refresh
Fixed TDM Constraint
Refresh Window 2
Refresh Window 3
Refresh
Refresh
Any Refresh Window
Refresh Window 4
Sliding Window Constraint
Supports X idle cycles in any (t, t+Y)
Sliding Window Constraint Gives Maximum Flexibility for
Handling Bursts, and When to Provide Idle Cycles
ACM Sigmetrics/Performance 2012
9
Key Performance Metrics
 Refresh Overhead = X / Y
• Memory bandwidth wasted on refresh
 Burst Tolerance = Y – X
• Maximum number of consecutive memory accesses
without interruption for refresh
We’ll Consider the Simple Case When the User is Required
to Send X = 1 Idle in Y Cycles, and M = 1
ACM Sigmetrics/Performance 2012
10
Our Solution: Versatile Refresh Algorithm
Memory Banks
1
B
2
RP2
RP1
RP2
RP1
RPB
RP3
RP3
RP4
.......
RP4
RP1
RPB
RP2
Max Register
Next Concurrent
Deficit Register
1
1
0
2
3
Count
Refresh Pointer
Pointer
Count
Bank with deficit has priority for refresh.
Maximum Allowed Deficit Register
Controls Burst Tolerance (Y)
ACM Sigmetrics/Performance 2012
11
Necessary Refresh Overhead for any Algorithm: Intuition, X=1

At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)

An adversary keeps reading from a particular bank; idle slots are
needed to refresh cells in that bank.

A total of BR inequalities to ensure cells are refreshed in time

Interestingly, only two of these inequalities matter
•
•
The one corresponding to the oldest cell
The one corresponding to the oldest “youngest cell in each bank”
ACM Sigmetrics/Performance 2012
12
Necessary Refresh Overhead for any Algorithm: Derivation, X=1
 How much can the adversary age the oldest cell?
•
•
Current age is at least BR-1
Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W
𝑂𝑉𝑛𝑒𝑐
1
1
= ≥
𝑌
𝑊 − 𝐵𝑅 + 1
 How much can the adversary age the oldest “youngest cell in
each bank”?
•
•
Current age is at least B-1
Must wait for at least R idles before it is picked up: (B-1) + YR ≤ W
𝑂𝑉𝑛𝑒𝑐
ACM Sigmetrics/Performance 2012
1
𝑅
= ≥
𝑌
𝑊 − (𝐵 − 1)
13
Optimality for Versatile Refresh Overhead: Results, X =1
 Necessity: Result for any Algorithm
𝑂𝑉𝑛𝑒𝑐
1
𝑅
≥ 𝑚𝑎𝑥
,
𝑊 − 𝐵𝑅 + 1 𝑊 − (𝐵 − 1)
𝑅
>
𝑊
 Sufficiency: Result for VR Algorithm (with parameter X):
𝑂𝑉𝑠𝑢𝑓
𝑋
𝑅
≤ max
,
𝑊 − 𝐵𝑅 𝑊 − 𝐵𝑋 − 1
Nearly Optimal Refresh For X=1
ACM Sigmetrics/Performance 2012
14
Performance Guarantees of Versatile Refresh Algorithm
“Bad” Region with High Overhead
Worst-case Refresh Overhead
(X/Y)
1
Increasing X
Near-optimal
Refresh Overhead
for X = 1
Refresh Overhead ~
R/W, for W large
1/B
R/W
0
RB Wc = RB + B-1
Cell Retention Time (W)
Why Would We Ever Use Large X?
ACM Sigmetrics/Performance 2012
15
Why Would We Ever Use Large X?
 Because of Burst Tolerance (large X → large Y – X)
•
If memory accesses are bursty, refreshes can be hidden
 There is a Critical Value of X for Max Burst Tolerance
𝑊
𝑋𝑐 = 𝑚𝑖𝑛 𝑅,
−𝑅
𝐵

Example: B = 16, R = 128, W = 2500
ACM Sigmetrics/Performance 2012
16
Calculations for Customer ASIC++
Theoretical
In Practice
Total Bandwidth
375 MHz
375 MHz
Versatile Refresh Formula
6825 > 257N
6825 > 257N
Versatile Refresh Constraint
1 in 26.55
1 in 26
Data-path
360 MHz
360 MHz
Refresh
14.12 MHz
14.42 MHz
Extra Bandwidth for CPU
0.88 MHz
0.58 MHz
R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz
(++Note that these numbers have been sanitized)
ACM Sigmetrics/Performance 2012
17
Versatile Refresh Enhancement
 Enhancement:
•
No-conflict slot: A timeslot where the bank the VR
scheduler wants to refresh is not being accessed.
• Any idle slot is a no-conflict slot; but not vice versa
• For VR, no-conflict slots are as good as idle slots.
 Observation:
• This allows lower refresh overhead (possibly zero) for
non-adversarial memory access patterns
ACM Sigmetrics/Performance 2012
18
Fully Enhanced Versatile Refresh Algorithm
Memory Banks
1
B
2
RP2
RP1
RP2
RP1
RPB
RP3
RP3
RP4
.......
RP4
RP1
RPB
RP2
Max Register
Deficit Register
Next Refresh
2
3
Count
Pointer
Bank Pointer
X idles in Y
timeslots
Count
Repeat for
Multiple Memory
Ports (M)
No conflict
feedback
Enforcer Module
(User Logic)
ACM Sigmetrics/Performance 2012
19
Simulation: Synthetic Statistical Workload
Refresh Overhead (%)
Periodic Refresh
VR (X = 4)
VR (X = 128)
9
8
7
6
5
4
3
2
1
0
VR with X = 4:
Min worst-case overhead
(best for adversarial)
VR with X = 128:
Max burst tolerance
(best for benign)
0.001
0.01
0.1
0.25
alpha
0.5
1
 Parameter Alpha Controls Degree of Temporal Locality
• alpha ~ 0 → always read from bank 1 (adversarial)
• alpha ~ 1 → read from random banks (benign)
ACM Sigmetrics/Performance 2012
Refresh Overhead has
Disappeared
Completely!
20
Conclusion
 With Versatile Refresh A Designer Can …
1. Exactly Calculate Available Memory Bandwidth
−
−
For any eDRAM macro with B banks, R Rows, M memory ports
For any characteristics of Temperature, W= Tref and Clock speed
2. Achieve Optimal Worst-case Memory Bandwidth
3. Design for Large Burst Tolerance
4. Potentially Eliminate Back-pressure
−
Simplify associated complex design and verification
5. Maximize Best-case Memory Bandwidth
6. Avail of a Formally Verified VR Controller
−
On a suitably reduced memory instance
ACM Sigmetrics/Performance 2012
21
Download