Recent Progress in Embedded Memory Controller Design

Memory Hierarchy Latency, Capacity, Bandwidth L: 0.5ns, C: 10MB Cache L: 50ns, C: 100GB BW: 100GB/s DRAM L: 10us, C: 2TB BW: 2GB/s Flash L: 10ms, C: 4TB BW: 600MB/s Disk Controller DRAM Primer <bank, row, column> Page buffer per bank DRAM Characteristics   DRAM page crossing  Charge ~10K DRAM cells and bitlines  Increase power & latency  Decrease effective bandwidth Sequential access VS. random access  Less page crossing  Lower power consumption  4.4x shorter latency  10x better BW 5 Take Away: DRAM = Disk Embedded Controller Bad News  None available as in general purpose processor Good News  Opportunities for customization Agenda  Overview  Multi-Port Memory Controller (MPMC) Design  “Out-of-Core” Algorithmic Exploration Motivating Example: H.264 Decoder  Diverse QoS requirements Latency sensitive 6.4 9.6 MB/s 1.2 164.8 Bandwidth sensitive 0.09 31.0 156.7 94 Dynamic latency, BW and power 9 Wanted  Bandwidth guarantee  Prioritized access  Reduced page crossing 1 Previous Works  Bandwidth guarantee • Q0: Distinguish bandwidth guarantee for different classes of ports • Q1: Distinguish bandwidth guarantee for each port    Q2: Prioritized access Q3: Residual bandwidth allocation Q4: Effective DRAM bandwidth Q0 Q1 Q2 Q3 Q4 [Rixner,00][McKee,00][Hur,04] [Heighecker,03,05][Whitty,08] ✓ ✓ ✓ [Lee,05] ✓ [Burchard,05] ✓ Proposed BCBR ✓ ✓ ✓ ✓ ✓ ✓ ✓ 11 Key Observations  Port locality:     Same port requests  same DRAM page Service time flexibility   Weighted round robin:     Statically allocated BW Underutilized at runtime  Minimum BW guarantee Busting service Credit borrow & repay 1/24 second to decode a video frame 4M cycles at 100 MHz for request reordering Residual bandwidth   Reorder requests according to priority Dynamic BW calculation  Capture and re-allocate residual BW 12 Weighted Round Robin  Assume bandwidth requirement Tround = 10  Q2: 30% Q1: 50% Q0: 20% Time: scheduling cycles T(Rij): arriving time of jth requests for Qi Clock: 0 1 2 3 Request time: T(R2) R20 R21 R22 Service time: Q2 R20 R21 R22 4 5 6 7 8 9 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 T(R0) R00 R01 Q0 R00 R01 13 Problem with WRR  Priority: Q0 > Q2 Clock: 0 1 2 3 T(R2) R20 R21 R22 Q2 R20 R21 R22 4 5 6 7 8 9 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 T(R0) R00 R01 Q0 8 cycles of waiting time! R00 R01 Could be worse! 14 Borrow Credits  Zero Waiting time for Q0! Clock: 0 1 2 3 T(R2) R20 R21 R22 Q2 R20 4 5 6 7 8 9 T(R1) R10 R11 R12 Q1 borrow T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2 15 Repay Later At Q0’s turn, BW guarantee is recovered Clock: 0 1 2 3 T(R2) R20 R21 R22 Q2 R20 4 5 6 7 8 9 R21 R22 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 repay T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Prioritized access! 16 Problem: Depth of DebtQ  DebtQ as residual BW collector   BW allocated to Q0 increases to: 20% + residual BW Requirement for the depth of DebtQ0 decreases Clock: 0 1 2 3 T(R2) R20 R21 R22 R20 Q2 T(R1) Q1 R10 R11 R12 R13 T(R0) Q0* R00 R01 R03 debtQ0 4 5 6 7 8 R21 R22 Help repay R10 R11 R12 R13 R00 R01 Q2 9 R03 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 17 Evaluation Framework  Simulation Framework    Workload: ALPBench suite DRAMSim: simulates DRAM latency+BW+power Reference schedulers: PQ, RR, WRR, BGPQ 18 Bandwidth Guarantee  Bandwidth guarantees:  P0: 2%  System residual: 8% Port RR PQ BGPQ WRR BCBR P1: 30% 0 P2: 20% 1 P3:20% P4:20% No BW guarantee 2 3 4 1.08% 0.73% 1.07% 24% 80% 39% 24% 18% 20% 24% 0% 20% 24% 0% 20% 0.76% 0.76% 33% 33% 22% 22% 22% 22% 22% 22% Provides BW guarantee! 19 Cache Response Latency Average 16x faster than WRR  As fast as PQ (prioritized access) Latency (ns)  20 DRAM Energy & BW Efficiency 30% less page crossing (compared to RR)  1.4x more energy efficient  1.2x higher effective DRAM BW As good as WRR (exploit port locality) GB/J Act-Pre Ratio Improvement RR 0.298 BGPQ WRR 0.289 0.412 BCBR 0.411 29.6% 30.1% 23.0% 23.0% 1.0x 0.97x 1.38x 1.38x 21 Hardware Cost BCBR: frontend  1393 LUTs  884 registers  0 BRAM Reference backend: speedy DDRMC  1986 LUTs  1380 registers  4 BRAMs Xilinx MPMC: frontend + backend  3450 LUTs  5540 registers  1-9 BRAMs BCBR + Speedy  3379 LUTs  2264 registers  4 BRAMs Better performance without higher cost! 22 Agenda  Overview  Multi-Port Memory Controller (MPMC) Design  “Out-of-Core” Algorithm / Architecture Exploration Idea Out-of-core algorithms  Data does not fit DRAM  Performance dominated by IO Key questions  Reduce #IOs  Block granularity Remember DRAM=DISK So let’s Ask the same question Plug-on DRAM parameters Get DRAM-specific answers 24 Motivating Example: CDN  Caches in CDN  Get closer to users  Save bandwidth  Zipf’s law  80-20 rule  hit rate 25 Video Cache Defining the Knobs Transaction a number of column access commands enclosed by row activation / precharge W: burst size s : # bursts Function of array organization & timing params Function of algorithmic parameters Function of array organization & timing params 27 D-nary Heap Algorithmic Design Variable: Branching Factor Record Size B+ Tree Lessons Learned  Optimal result can be beautifully derived!  Big O does not matter in some cases  Depending on data input characteristics

Recent Progress in Embedded Memory Controller Design

Related documents

Products

Support

Recent Progress in Embedded Memory Controller Design

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib