Recent Progress in Embedded Memory Controller Design

advertisement
Memory Hierarchy
Latency, Capacity, Bandwidth
L: 0.5ns, C: 10MB
Cache
L: 50ns, C: 100GB
BW: 100GB/s
DRAM
L: 10us, C: 2TB
BW: 2GB/s
Flash
L: 10ms, C: 4TB
BW: 600MB/s
Disk
Controller
DRAM Primer
<bank, row, column>
Page buffer per bank
DRAM Characteristics


DRAM page crossing

Charge ~10K DRAM cells and bitlines

Increase power & latency

Decrease effective bandwidth
Sequential access VS. random access

Less page crossing

Lower power consumption

4.4x shorter latency

10x better BW
5
Take Away: DRAM = Disk
Embedded Controller
Bad News

None available as in
general purpose
processor
Good News

Opportunities for
customization
Agenda

Overview

Multi-Port Memory Controller (MPMC)
Design

“Out-of-Core” Algorithmic Exploration
Motivating Example: H.264 Decoder

Diverse QoS requirements
Latency sensitive
6.4
9.6
MB/s
1.2
164.8
Bandwidth sensitive
0.09 31.0 156.7
94
Dynamic
latency, BW
and power
9
Wanted

Bandwidth guarantee

Prioritized access

Reduced page crossing
1
Previous Works

Bandwidth guarantee
• Q0: Distinguish bandwidth guarantee for different
classes of ports
• Q1: Distinguish bandwidth guarantee for each port



Q2: Prioritized access
Q3: Residual bandwidth allocation
Q4: Effective DRAM bandwidth
Q0 Q1
Q2 Q3 Q4
[Rixner,00][McKee,00][Hur,04]
[Heighecker,03,05][Whitty,08]
✓
✓
✓
[Lee,05]
✓
[Burchard,05]
✓
Proposed BCBR
✓
✓
✓
✓
✓
✓
✓
11
Key Observations

Port locality:




Same port requests 
same DRAM page
Service time
flexibility


Weighted round robin:




Statically allocated BW
Underutilized at runtime

Minimum BW guarantee
Busting service
Credit borrow & repay
1/24 second to decode a
video frame
4M cycles at 100 MHz for
request reordering
Residual bandwidth


Reorder requests according
to priority
Dynamic BW calculation

Capture and re-allocate
residual BW
12
Weighted Round Robin

Assume bandwidth requirement Tround = 10

Q2: 30%
Q1: 50%
Q0: 20%
Time: scheduling cycles
T(Rij): arriving time of jth requests for Qi
Clock: 0 1 2 3
Request time: T(R2) R20 R21 R22
Service time: Q2 R20 R21 R22
4
5
6
7
8
9
T(R1) R10 R11 R12 R13 R14
Q1
R10 R11 R12 R13 R14
T(R0) R00 R01
Q0
R00 R01
13
Problem with WRR

Priority: Q0 > Q2
Clock: 0 1 2 3
T(R2) R20 R21 R22
Q2
R20 R21 R22
4
5
6
7
8
9
T(R1) R10 R11 R12 R13 R14
Q1
R10 R11 R12 R13 R14
T(R0) R00 R01
Q0
8 cycles of waiting time!
R00 R01
Could be worse!
14
Borrow Credits

Zero Waiting time for Q0!
Clock: 0 1 2 3
T(R2) R20 R21 R22
Q2
R20
4
5
6
7
8
9
T(R1) R10 R11 R12
Q1
borrow
T(R0) R00 R01
Q0* R00 R01
debtQ0
Q2
Q2
Q2
15
Repay Later
At Q0’s turn, BW guarantee is recovered
Clock: 0 1 2 3
T(R2) R20 R21 R22
Q2
R20
4
5
6
7
8
9
R21 R22
T(R1) R10 R11 R12 R13 R14
Q1
R10 R11 R12 R13 R14
repay
T(R0) R00 R01
Q0* R00 R01
debtQ0
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Prioritized access!
16
Problem: Depth of DebtQ

DebtQ as residual BW collector


BW allocated to Q0 increases to: 20% + residual BW
Requirement for the depth of DebtQ0 decreases
Clock: 0 1 2 3
T(R2) R20 R21 R22
R20
Q2
T(R1)
Q1
R10 R11 R12 R13
T(R0)
Q0*
R00 R01 R03
debtQ0
4
5 6
7
8
R21 R22
Help repay
R10 R11 R12 R13
R00 R01
Q2
9
R03
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
Q2
17
Evaluation Framework

Simulation Framework



Workload: ALPBench suite
DRAMSim: simulates DRAM latency+BW+power
Reference schedulers: PQ, RR, WRR, BGPQ
18
Bandwidth Guarantee

Bandwidth guarantees:

P0: 2%

System residual: 8%
Port
RR
PQ
BGPQ
WRR
BCBR
P1: 30%
0
P2: 20%
1
P3:20% P4:20%
No BW guarantee
2
3
4
1.08%
0.73%
1.07%
24%
80%
39%
24%
18%
20%
24%
0%
20%
24%
0%
20%
0.76%
0.76%
33%
33%
22%
22%
22%
22%
22%
22%
Provides BW guarantee!
19
Cache Response Latency
Average 16x faster than WRR
 As fast as PQ (prioritized access)
Latency (ns)

20
DRAM Energy & BW Efficiency
30% less page crossing (compared to RR)
 1.4x more energy efficient
 1.2x higher effective DRAM BW
As good as WRR (exploit port locality)
GB/J
Act-Pre Ratio
Improvement
RR
0.298
BGPQ WRR
0.289 0.412
BCBR
0.411
29.6% 30.1% 23.0% 23.0%
1.0x 0.97x 1.38x 1.38x
21
Hardware Cost
BCBR: frontend
 1393 LUTs
 884 registers
 0 BRAM
Reference backend:
speedy DDRMC
 1986 LUTs
 1380 registers
 4 BRAMs
Xilinx MPMC:
frontend + backend
 3450 LUTs
 5540 registers
 1-9 BRAMs
BCBR + Speedy
 3379 LUTs
 2264 registers
 4 BRAMs
Better performance without higher cost!
22
Agenda

Overview

Multi-Port Memory Controller (MPMC)
Design

“Out-of-Core” Algorithm / Architecture
Exploration
Idea
Out-of-core
algorithms
 Data does not fit DRAM
 Performance dominated
by IO
Key questions
 Reduce #IOs
 Block granularity
Remember
DRAM=DISK
So let’s
Ask the same question
Plug-on DRAM
parameters
Get DRAM-specific
answers
24
Motivating Example: CDN

Caches in CDN

Get closer to users

Save bandwidth

Zipf’s law

80-20 rule  hit
rate
25
Video Cache
Defining the Knobs
Transaction
a number of column access commands
enclosed by row activation / precharge
W: burst size
s : # bursts
Function of array
organization &
timing params
Function of
algorithmic
parameters
Function of array
organization &
timing params
27
D-nary Heap
Algorithmic Design Variable:
Branching Factor
Record Size
B+ Tree
Lessons Learned

Optimal result can be beautifully derived!

Big O does not matter in some cases

Depending on data input characteristics
Download