Virtually Pipelined Network Memory Banit Agrawal Tim Sherwood UC SANTA BARBARA Memory Design is Hard • Increasing functionalities • Increasing size of data structures IPv4 routing table size Packet classification rules 100k 200k 360k 2000 5000 10000 • Increasing line rate 10 Gbps 40 Gbps • Throughput in the worst case 160 Gbps Need to service the traffic at the advertised rate Banit Agrawal 4/11/2020 What programmers think ? Network Programmers Network System • • • • Memory DRAM Low cost Low power High capacity High bandwidth incase of some access patterns What is the problem? Banit Agrawal 4/11/2020 DRAM Bank Conflicts Variable latency bank 2 row decoder Busy1 bank row decoder row decoder row decoder bank 0 Variable throughput Busy3 bank sense amplifier sense amplifier sense amplifier sense amplifier column decoder column decoder column decoder column decoder address bus data bus Worst case : Every access conflicts Banit Agrawal 4/11/2020 Prior Work • Reducing bank-conflicts in common access patterns Prefetching and memory-aware layout [Lin-HPCA’01, Mathew- HPCA’00] Reordering of requests [Hong-HPCA’99, Rixner-ISCA’00] Vector processing domain [Espasa-Micro’97] Good for desktop computing No guarantees for the worst case • Reducing bank conflicts for special access patterns Packet buffering : written once write and read once Low bank conflicts - Optimizations including row-locality and scheduling [Hasan-ISCA’03, Nikologiannis-ICC’01] No bank conflicts - Reordering and clever memory management algorithms [Garcia-Micro’03, Iyer-StanTechReport’02] Not applicable in any access patterns Banit Agrawal 4/11/2020 Where network system stands ? 0% deadline failures Full determinism required No exploitable deadline failures Common-case optimized parts Best effort (co-operative) Common-case optimized Parts Banit Agrawal 4/11/2020 Virtually Pipelined Memory • Normalize the overall latency Using randomization and buffering Deterministic latency for all accesses t t+D Memory Controller DRAM • Trillions of accesses without any bank conflicts Even in case of any access patterns Banit Agrawal 4/11/2020 Outline • Memory for networking systems • Memory controller • Design analysis Hardware design How we compare? • Conclusion Banit Agrawal 4/11/2020 Memory Controller Bank 0 controller Bank 0 key R t HU 5 → 2,A 6 → 0,F 7 → 2,B 8 → 3,A Bank 2 controller address t+D data Bus Scheduler Bank 1 controller Bank 1 Bank 2 Bank 3 controller Bank 3 Non-conflicting Accesses Bank latency (L) – 15 cycles 0 requests 10 A data ready 20 30 B Normalized delay (D) – 30 cycles 40 50 60 70 80 C A B C Repeated requests Banit Agrawal 4/11/2020 Redundant Accesses Bank latency (L) – 15 cycles 0 requests A 10 B 20 Normalized delay (D) – 30 cycles 30 40 AA 50 60 70 80 B data ready A B AA B Conflicting requests Banit Agrawal 4/11/2020 Conflicting Accesses Bank latency (L) – 15 cycles requests 0 10 A B 20 Normalized delay (D) – 30 cycles 30 40 C D 50 60 70 80 E Stall data ready Banit Agrawal A 4/11/2020 B C D E Implementing Virtual Pipelined Banks Delay Storage Buffer v address Bank Access Queue incr/decr r/w data words row id ++ Bank Access Queue first zero ++ Delay Storage Buffer ++ ++ ++ row addr data Set 1 access[t-3] access[t-2] access[t] access[t-d] access[t-d+1] access[t-d+2] Control Circular Delay Buffer Logic … … in ptr Write Buffer Set 0 Write Buffer (FIFO) Control Logic data words out ptr address scheduled-access address scheduled-access data ++ to memory Circular Delay Buffer Interface address Interface data Implementing Virtual Pipelined Banks Delay Storage Buffer v address Bank Access Queue incr/decr r/w data words row id ++ Bank Access Queue first zero ++ Delay Storage Buffer ++ ++ ++ row addr data Set 1 access[t-3] access[t-2] access[t] access[t-d] access[t-d+1] access[t-d+2] Control Circular Delay Buffer Logic … … in ptr Write Buffer Set 0 Write Buffer (FIFO) Control Logic data words out ptr address scheduled-access address scheduled-access data ++ to memory Circular Delay Buffer Interface address Interface data Delay Storage Buffer Stall • Mean-time to stall (MTS) B – number of banks, 1/B is the probability of a request • Stall happens when there are more than k accesses in interval D • An Illustration Normalized latency (D) - 30 cycles Number of entries in the delay storage buffer (K) - 3 Banit Agrawal 4/11/2020 Delay Storage Buffer Stall +1 requests +1 A 0 +1 +1 B 10 C 20 data ready MTS = 30 D A B -1 -1 4/11/2020 F E 40 50 60 70 C D -1 -1 log ( 21 ) log (1 – ( ( Banit Agrawal +1 +1 D-1 K-1 1 B ) * ) K-1 )) +D 80 E F -1 -1 Markovian Analysis • Bank access queue stall State-based analysis Number of banks (B) - 1/B is the probability of an access to a bank • If more than D cycles of work to be done, a stall occurs. • An example: Bank access latency (L) = 3 Normalized delay (D) = 6 1 B 1 B idle 1- 1 B Banit Agrawal 1 2 3 1- 4/11/2020 1 B 1 1 B 4 5 stall 6 MTS = probability of stall state becomes 0.5 Markovian Analysis I 1 M 1 1 B 1 1 B 0 0 0 0 0 0 Banit Agrawal 0 0 0 0 0 0 0 0 1 B 0 0 0 0 0 0 1 B 0 0 0 0 0 1 B 0 0 0 0 1 B 1 0 1 B 1 1 B 0 0 1 1 B 0 0 0 1 0 0 0 0 1 0 0 0 0 0 4/11/2020 0 1 B 0 0 0 0 1 B 0 0 0 0 0 0 0 1 B 1 B 1 B 1 P=IM n Find n s.t. P=50% Hardware Design and Overhead • Hardware design Verilog implementation Verification using ModelSim and C++ simulation model Synthesizing using Synopsys Design Compiler • Hardware overhead tool Using Cacti parameters Verify one with the synthesized design • Optimal design parameters using this tool 45.7 seconds MTS with area overhead of 34.1 mm2 at 77% efficiency 10 hours MTS with area overhead of 34 mm2 at 71.4% efficiency Banit Agrawal 4/11/2020 How VPNM performs ? • Packet buffering Only need to store the head and tail pointers Can support arbitrarily large number of logical queues Scheme Line rate (Gbps) Area (mm2) Total delay (ns) Supported interfaces RADS [17] 40 10 53 130 CFDS [12] 160 60 10000 850 Our approach 160 41.9 960 4096 35% less area 10x less latency 5x more queues • Packet reassembly Banit Agrawal 4/11/2020 Conclusion • VPNM provides Deterministic latency Randomization and normalization t t+D Memory Controller Higher throughput worst case that is impossible to exploit Handles any access patterns Ease of programmability/mapping Banit Agrawal Packet buffering Packet reassembly 4/11/2020 DRAM Thanks for your attention. Questions?? http://www.cs.ucsb.edu/~arch/ Banit Agrawal 4/11/2020