Design and Analysis of a Robust Pipelined Memory System Hao Wang†, Haiquan (Chuck) Zhao*, Bill Lin†, and Jun (Jim) Xu* †University of California, San Diego *Georgia Institute of Technology Infocom 2010, San Diego Memory Wall • Modern Internet routers need to manage large amounts of packet- and flow-level data at line rates • e.g., need to maintain per-flow records during a monitoring period, but – Core routers have millions of flows, translating to 100’s of megabytes of storage – 40 Gb/s OC-768 link, new packet can arrive every 8 ns 2 Memory Wall • SRAM/DRAM dilemma • SRAM: access latency typically between 5-15 ns (fast enough for 8 ns line rate) • But the capacity of SRAMs is substantially inadequate in many cases: 4 MB largest typically (much less than 100’s of MBs needed) 3 Memory Wall • DRAM provides inexpensive bulk storage • But random access latency typically 50- 100 ns (much slower than 8 ns needed for 40 Gb/s line rate) • Conventional wisdom is that DRAMs are not fast enough to keep up with ever-increasing line rates 4 Memory Design Wish List • Line rate memory bandwidth (like SRAM) • Inexpensive bulk storage (like DRAM) • Predictable performance • Robustness to adversarial access patterns 5 Main Observation • Modern DRAMs can be fast and cheap! – Graphics, video games, and HDTV – At commodity pricing, just $0.01/MB currently, $20 for 2GB! 6 Example: Rambus XDR Memory • 16 internal banks 7 Memory Interleaving • Performance achieved through memory interleaving – e.g. suppose we have B = 6 DRAM banks and access pattern is sequential 1 7 13 2 8 14 3 9 15 4 10 16 5 11 17 6 12 18 : : : : : : : : : : : : – Effective memory bandwidth B times faster 8 Memory Interleaving • But, suppose access pattern is as follows: 1 7 13 19 : 25 : 2 8 14 3 9 15 4 10 16 5 11 17 6 12 18 : : : : : : : : : : • Memory bandwidth degrades to worst-case DRAM latency 9 Memory Interleaving • One solution is to apply pseudo-randomization of memory locations : : : : : : : : : : : : 10 Adversarial Access Patterns • However, memory bandwidth can still degrade to worst-case DRAM latency even with randomization: 1. Lookups to same global variable will trigger accesses to same memory bank 2. Attacker can flood packets with same TCP/IP header, triggering updates to the same memory location and memory bank, regardless of the randomization function. 11 Outline • Problem and Background →Proposed Design • Theoretical Analysis • Evaluation 12 Pipelined Memory Abstraction Emulates SRAM with Fixed Delay op addr data R R c b W W data out SRAM R a c c a W R R R 5 4 time time 5 4 3 2 1 0 op addr data R R c b W W SRAM Emulation R 3 2 1 data out a c c a W R R R time 0 time 5 4 3 2 1 0 D+5 D+4 D+3 D+2 D+1 D 13 Implications of Emulation • Fixed pipeline delay: If a read operation is issued at time t to an emulated SRAM, the data is available from the memory controller at exactly t + D (instead of same cycle). • Coherency: The read operations output the same results as an ideal SRAM system. 14 Proposed Solution: Basic Idea • Keep SRAM reservation table of memory operations and data that occurred in last C cycles • Avoid introducing new DRAM operation for memory references to same location within C cycles 15 Details of Memory Architecture request buffers B random address permutation … data data data op addr data R-link R-link R-link p p p R-link p … addr addr addr data out C … op op op MRI table (CAM) MRW table (CAM) C reservation table C input operations DRAM banks 16 Merging of Operations • Requests arrive from right to left. 1. … 2. … + WRITE WRITE read copies data from write WRITE + WRITE WRITE 2nd write overwrites 1st write READ 2nd read copies data from 1st read 3. … READ + READ 4. … WRITE + READ WRITE + READ READ 17 Proposed Solution • Rigorously prove that with merging, worst-case delay for memory operation is bounded by some fixed D w.h.p. • Provide pipelined memory abstraction in which operations issued at time t are completed at exactly t + D cycles later (instead of same cycle). • Reservation table with C > D also used to implement the pipeline delay, as well as serving as a “cache”. 18 Outline • Problem and Background • Proposed Design →Theoretical Analysis • Evaluation 19 Robustness • At most one write operation in a request buffer every C cycles to a particular memory address. • At most one read operation in a request buffer every C cycles to a particular memory address. • At most one read operation followed by one write operation in a request buffer every C cycles to a particular address. 20 Theoretical Analysis • Worst case analysis • Convex ordering • Large deviation theory • Prove: with a cache of size C, the best an attacker can do is to send repetitive requests every C+1 cycles. 21 Bound on Overflow Probability • Want to bound the probability that a request buffer overflows in n cycles Pr[overflow] 0 s t n Pr[ Ds ,t ] Ds ,t : X s ,t K • is the number of updates to a bank during cycles [s, t], t s , K is the length of a request queue. X s ,t • For total overflow probability bound multiply by B. 22 Chernoff Inequality Pr[ Ds ,t ] Pr[ X K + ] Pr[e X e( K + ) ] E[e X ] ( K + ) e • Since this is true for all θ>0, E[e X ] Pr[ Ds ,t ] min ( K + ) 0 e • We want to find the update sequence that maximizes E[e X ] 23 Worst Case Request Patterns q1 T 1 C q2 2Tq1 2T 1 1 T C • • • • r 2Tq1 2T 1 q2 C q1+q2 +1 requests for distinct counters a1 , aq + q + r , q1 requests repeat 2T times each q2 requests repeat 2T-1 times each 1 request repeat r times each 1 2 24 Outline • Problem and Background • Proposed Design • Theoretical Analysis →Evaluation 25 Evaluation • Overflow probability for 16 million addresses, µ=1/10, and B=32. 0 10 -2 Overflow Probability Bound 10 -4 10 -6 10 -8 10 C=6000 -10 C=7000 10 C=8000 SRAM 156 KB, CAM 24 KB -12 10 C=9000 -14 10 80 90 100 110 120 130 140 Queue Length K 150 160 170 180 26 Evaluation • Overflow probability for 16 million addresses, µ=1/10, and C=8000. 0 10 -5 Overflow Probability Bound 10 -10 10 -15 10 B=32 -20 10 B=34 B=36 -25 10 B=38 -30 10 80 90 100 110 120 130 140 Request Buffer Size K 150 160 170 180 27 Conclusion • Proposed a robust memory architecture that provides throughput of SRAM with density of DRAM. • Unlike conventional caching that have unpredictable hit/miss performance, our design guarantees w.h.p. a pipelined memory architecture abstraction that can support new memory operation every cycle with fixed pipeline delay. • Convex ordering and large deviation theory to rigorously prove robustness under adversarial accesses. 28 Thank You