Memory Hierarchy Latency, Capacity, Bandwidth L: 0.5ns, C: 10MB Cache L: 50ns, C: 100GB BW: 100GB/s DRAM L: 10us, C: 2TB BW: 2GB/s Flash L: 10ms, C: 4TB BW: 600MB/s Disk Controller DRAM Primer <bank, row, column> Page buffer per bank DRAM Characteristics DRAM page crossing Charge ~10K DRAM cells and bitlines Increase power & latency Decrease effective bandwidth Sequential access VS. random access Less page crossing Lower power consumption 4.4x shorter latency 10x better BW 5 Take Away: DRAM = Disk Embedded Controller Bad News None available as in general purpose processor Good News Opportunities for customization Agenda Overview Multi-Port Memory Controller (MPMC) Design “Out-of-Core” Algorithmic Exploration Motivating Example: H.264 Decoder Diverse QoS requirements Latency sensitive 6.4 9.6 MB/s 1.2 164.8 Bandwidth sensitive 0.09 31.0 156.7 94 Dynamic latency, BW and power 9 Wanted Bandwidth guarantee Prioritized access Reduced page crossing 1 Previous Works Bandwidth guarantee • Q0: Distinguish bandwidth guarantee for different classes of ports • Q1: Distinguish bandwidth guarantee for each port Q2: Prioritized access Q3: Residual bandwidth allocation Q4: Effective DRAM bandwidth Q0 Q1 Q2 Q3 Q4 [Rixner,00][McKee,00][Hur,04] [Heighecker,03,05][Whitty,08] ✓ ✓ ✓ [Lee,05] ✓ [Burchard,05] ✓ Proposed BCBR ✓ ✓ ✓ ✓ ✓ ✓ ✓ 11 Key Observations Port locality: Same port requests same DRAM page Service time flexibility Weighted round robin: Statically allocated BW Underutilized at runtime Minimum BW guarantee Busting service Credit borrow & repay 1/24 second to decode a video frame 4M cycles at 100 MHz for request reordering Residual bandwidth Reorder requests according to priority Dynamic BW calculation Capture and re-allocate residual BW 12 Weighted Round Robin Assume bandwidth requirement Tround = 10 Q2: 30% Q1: 50% Q0: 20% Time: scheduling cycles T(Rij): arriving time of jth requests for Qi Clock: 0 1 2 3 Request time: T(R2) R20 R21 R22 Service time: Q2 R20 R21 R22 4 5 6 7 8 9 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 T(R0) R00 R01 Q0 R00 R01 13 Problem with WRR Priority: Q0 > Q2 Clock: 0 1 2 3 T(R2) R20 R21 R22 Q2 R20 R21 R22 4 5 6 7 8 9 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 T(R0) R00 R01 Q0 8 cycles of waiting time! R00 R01 Could be worse! 14 Borrow Credits Zero Waiting time for Q0! Clock: 0 1 2 3 T(R2) R20 R21 R22 Q2 R20 4 5 6 7 8 9 T(R1) R10 R11 R12 Q1 borrow T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2 15 Repay Later At Q0’s turn, BW guarantee is recovered Clock: 0 1 2 3 T(R2) R20 R21 R22 Q2 R20 4 5 6 7 8 9 R21 R22 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 repay T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Prioritized access! 16 Problem: Depth of DebtQ DebtQ as residual BW collector BW allocated to Q0 increases to: 20% + residual BW Requirement for the depth of DebtQ0 decreases Clock: 0 1 2 3 T(R2) R20 R21 R22 R20 Q2 T(R1) Q1 R10 R11 R12 R13 T(R0) Q0* R00 R01 R03 debtQ0 4 5 6 7 8 R21 R22 Help repay R10 R11 R12 R13 R00 R01 Q2 9 R03 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 17 Evaluation Framework Simulation Framework Workload: ALPBench suite DRAMSim: simulates DRAM latency+BW+power Reference schedulers: PQ, RR, WRR, BGPQ 18 Bandwidth Guarantee Bandwidth guarantees: P0: 2% System residual: 8% Port RR PQ BGPQ WRR BCBR P1: 30% 0 P2: 20% 1 P3:20% P4:20% No BW guarantee 2 3 4 1.08% 0.73% 1.07% 24% 80% 39% 24% 18% 20% 24% 0% 20% 24% 0% 20% 0.76% 0.76% 33% 33% 22% 22% 22% 22% 22% 22% Provides BW guarantee! 19 Cache Response Latency Average 16x faster than WRR As fast as PQ (prioritized access) Latency (ns) 20 DRAM Energy & BW Efficiency 30% less page crossing (compared to RR) 1.4x more energy efficient 1.2x higher effective DRAM BW As good as WRR (exploit port locality) GB/J Act-Pre Ratio Improvement RR 0.298 BGPQ WRR 0.289 0.412 BCBR 0.411 29.6% 30.1% 23.0% 23.0% 1.0x 0.97x 1.38x 1.38x 21 Hardware Cost BCBR: frontend 1393 LUTs 884 registers 0 BRAM Reference backend: speedy DDRMC 1986 LUTs 1380 registers 4 BRAMs Xilinx MPMC: frontend + backend 3450 LUTs 5540 registers 1-9 BRAMs BCBR + Speedy 3379 LUTs 2264 registers 4 BRAMs Better performance without higher cost! 22 Agenda Overview Multi-Port Memory Controller (MPMC) Design “Out-of-Core” Algorithm / Architecture Exploration Idea Out-of-core algorithms Data does not fit DRAM Performance dominated by IO Key questions Reduce #IOs Block granularity Remember DRAM=DISK So let’s Ask the same question Plug-on DRAM parameters Get DRAM-specific answers 24 Motivating Example: CDN Caches in CDN Get closer to users Save bandwidth Zipf’s law 80-20 rule hit rate 25 Video Cache Defining the Knobs Transaction a number of column access commands enclosed by row activation / precharge W: burst size s : # bursts Function of array organization & timing params Function of algorithmic parameters Function of array organization & timing params 27 D-nary Heap Algorithmic Design Variable: Branching Factor Record Size B+ Tree Lessons Learned Optimal result can be beautifully derived! Big O does not matter in some cases Depending on data input characteristics