Saturating the Transceiver Bandwidth: Switch Fabric Design on FPGAs FPGA’2012 Zefu Dai, Jianwen Zhu University of Toronto Switch is Important Cisco Catalyst 6500 Series Switches since 1999 Total $: $42 billion Total Systems: 700,000 Total Ports: 110 million Total Customers: 25,000 Average price/system: $60K Average price/port: $381 Commodity 48 1GigE port switch: $52/port http://www.cisco.com/en/US/prod/collateral/modules/ps2797/ps11878/at_a_glance_c45-652087.pdf http://www.albawaba.com/cisco-readies-catalyst-6500-tackle-next-decades-networking-challenges-383275 University of Toronto Inside the Box Function blocks: Virtualization Admin. Ctrl Configuration Firewall L2 Bridge Security 10 Gb/s QoS Traffic manage 3.125 Gb/s Backplane Line Card … Optical Fiber Routing SERDES 2Tb/s Line Card 6.6 Gb/s 40 Gb/s FPGAs University of Toronto Critical: connects to everyone Go Beyond the Line Card XC7VH870T (single chip): - 72 GTX: 13.1 Gb/s - 16 GTZ: 28.05 Gb/s - Raw total: 1.4 Tb/s Next generation line rate: 160 Gb/s University of Toronto Can We Do Switch Fabric? Is single chip switch fabric possible on FPGA that saturate transceiver BW? Is it possible to rival ASIC performance? Prof. Jonathan Rose, FPGA’06: Gap Area Speed Power FPGA VS. ASIC 40 3 12 University of Toronto What’s Switch Fabric Two major tasks: - Provide data path connection (crossbar) - Resolve congestion (buffers) little computational task Input Buffers Crossbar Output Buffers 1 … … Line Rate: R 1 N N University of Toronto Switch Fabric Architectures Buffer location marks the difference: - Output Queued (OQ) switch: buffered at output - Input Queued (IQ) switch: buffered at input - Crosspoint Queued (CQ) switch: buffered at XB Input Buffers Crossbar Output Buffers 1 1 … … N N University of Toronto Ideal Switch OQ switch - Best performance - Simplest crossbar (distributed multiplexers) - 2NR memory bandwidth requirement Crossbar 1 (N+1)R 1 N University of Toronto CSM 2NR … N … 1 … … N Centralized Shared Memory 1 N Lowest Memory Requirement Switch IQ switch - 2R memory bandwidth - Low performance, 58% throughput cap (HOL) - Maximum bipartite matching problem 2R Crossbar 1 1 … … N N University of Toronto Most Popular Switch Combined Input and Output Queued (CIOQ) : - Internal speedy up S: read (write) S cells from (to) memory Emulate OQ switch with S=2 (S+1)R memory bandwidth Complex scheduling 1<=S<=N Crossbar (S+1)R (S+1)R 1 1 … … N N University of Toronto State-of-the-art Switch Combined Input and Crosspoint Queued (CICQ) - High performance, close to OQ switch - Simple scheduling - 2R memory bandwidth, - N2 buffers Crossbar 2R … 1 … IBM Prizma switch (2004) 2R … N 2Tb/s 1 University of Toronto N Switch Fabric on FPGA? FPGA resources: - LUTs, wires, DSPs, … Idea: Memory is the Switch! SERDES SRAM ? SRAM FPGA SRAM SAME Technology!! SRAM SRAM Saturate the transceiver BW University of Toronto Start Point CICQ switch: rely on large amount of small buffers 2R … 1 2R … … … … … … 2R Xilinx 18kbit dualport BRAM (500MHz): 36Gb/s 2R … N Run Out! R=10 Gb/s 2R = 20 Gb/s … 2 N=100 N2=10k N-1 2R 1 2 … N-1 University of Toronto N Waste of BW! Memory Can be Shared Use x BRAMs to emulate y crosspoint buffers (x<y) 2R … 1 2R BRAMs … 2 … 2R … … … … … N-1 x=5, y=9 5*36 = 9*20 Total: 5*(N/3)2 2R Xilinx 18kbit dualport BRAM (500MHz): 36Gb/s 2R … N 1 2 … N-1 University of Toronto N Memory is the Switch Logical View Logical View 1 1 2 3 2 Small OQ Switch! 3 1 2 1 3 In the same memory 2 3 Physical View BRAMs 1 address decoder and sense amp act as crossbar 2 3 University of Toronto 1 2 3 Memory Can be Borrowed Give busy buffer more memory space Enable efficient multicast support BRAMs 1 1 2 2 3 3 Free Addr Pool Implement in BRAMs Pointer queues University of Toronto Group Crosspoint Queued (GCQ) Switch Similar to Dally’s HC switch, but: - Higher memory requirement for simpler scheduling and better performance - Use SRAMs as small switch (not just buffer) - Multicast support 1 2 Memory Based Sub-switch MBS MBS -Efficient utilization of BRAMs N+1 N MBS MBS 1 2 N+1 N -simple scheduler (time muxing) University of Toronto Resource Savings Each SxS sub-switch requires P BRAMS: - P*B= 2SR; (R: line rate; B: BW of BRAM) - P/S = 2R/B; (constant) Total BRAMs required for entire crossbar: - P*(N/S)2 = (2N2R/B)/S = C/S (C is constant) - Savings: N2-C/S; (larger S, more saving) Max S limited by minimum packet size - Aggregate data width of BRAMs <= minimum packet size - TCP packet (40 bytes): max S = 8 University of Toronto Hardware Implementation N S Data Width Registers LUTs BRAMS Savings Virtex6-240T 16 Spartan6-150T 9 4 256 bit 36945 (12%) 3 256 bit 27028(14%) 49537 (32%) 224 (27%) 288(56%) 37285 (40%) 95 (36%) 67 (41%) University of Toronto Saturate the transceiver BW Small S due to automatic P&R (BRAM frequency as low as 160MHz) Clock domain crossing (CDC) Source synchronous CDC technique Performance Evaluation Experimental setup - Booksim: network simulator by Dally’s group - Tested switches: IQ, OQ, CICQ and different configuration of GCQ with different S Flit Delay 2 Credit Delay 2 N 16 Flit Size 32 bytes Packet size 1 flit, 16 flits Traffic Uniform Network topology Fat tree Routing Nearest common ancestor University of Toronto Maximum Throughput Test Keep minimum buffer in the crossbar, sweep different Input Buffer depth 1 0.9 throughput 0.8 0.7 0.6 IQ 0.5 CICQ-1flit CICQ-4flit 0.4 GCQ-S4-16flit 0.3 GCQ-S8-64flit OQ-256flit 0.2 HOL blocking phenomenon 16 32 64 128 256 IQ_depth (flits) University of Toronto 512 Buffer Memory Efficiency Keep minimum buffer in the Input Buffer, sweep different crossbar buffer size OQ 1 0.9GCQ throughput 0.8 0.7 0.6 CICQ 0.5 CICQ 0.4 GCQ-s4 0.3 GCQ-s8 OQ 0.2 512 768 1280 Total Buffer Size (flits) University of Toronto 2304 Packet Latency Test Limit the total buffer size to 1k flits CICQ 300 GCQ IQ OQ 270 240 Latency (cycles) 210 180 150 IQ-80flit CICQ-4flit GCQ-S4-8flit GCQ-S4-16flit GCQ-S4-32flit GCQ-S4-64flit OQ-1024flit 120 90 60 30 0 0.01 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.99 Traffic Load (%) University of Toronto Discussion: Why ISE cannot meet timing? Hierarchical compilation, place and router - The design is regular and symmetric 160 MHz Clock Domain serializer Memory Based Switch mux Output 1 MBS MBS MBS Output S Addr Recycle Bin Free Address Pool MBS 1 Packet Header Output Pointer Queues S 256 University of Toronto ... Shared Buffer Input S sel 40 MHz de-mux Input 1 256 ... input1 input2 input3 input4 256 Discuss: Why BRAM is slow? CACTI prediction based on ITRS-LSTP data Xilinx BRAM (18Kbit) performance - 45 nm 28 nm, with little improvement on BRAM performance ? Fulcrum FM4000 130nm, 2MB: 720MHz On-Chip SRAM Performance Frequency (MHz) 1200 1000 Sun SPARC 90nm 4MB L2: 800MHz 800 600 400 CACTI Xilinx 200 0 130 90 65 45 32 Process (nm) University of Toronto 28 Intel Xeon 65nm 16MB L3: 850MHz Discussion: Extrapolation on Virtex7 Virtex-7 XC7VH870T : - Max BRAM frequency: 600 MHz - Total of 2820 18kbit BRAMS - 1.4 Tb/s transceiver CICQ requirement: > 10k BRAMs Proposed GCQ switch: - Assume 400MHz BRAM frequency: 2R/B = 0.69 - (1-P/S2) = 91% - 1014 18kbit BRAMS (36%) University of Toronto Questions? University of Toronto Conclusions FPGAs can rival ASICs on switch fabric design - Remain resource for other functions Big room for improvement in FPGA’s on-chip SRAM performance FPGA CAD tools can do a better job. University of Toronto Roadmap for Data Link Speed Nathan Binkert et al. @ ISCA 2011 Link Characteristics Process nm 45 32 22 Link Speedy Gb/s 80 160 320 Max link length m in flight data Bytes Optical Link parameters Data wavelengths Optical data rate Gb/s Electric link parameters SERDES speed Gb/s SERDES channels 10 1107 2214 4428 8 16 32 10 10 20 32 8 8 10 University of Toronto Grows in channel count NOT channel speed Interlaken Chip y Chip x Packet 10Gb/s CH CH CH CH CH CH CH 64 bits cells C C … C C C … C Interlaken Interlaken 40Gb/s CH Packet C C … C C C … C Strip Interleave University of Toronto Assemble High Link Speed Processing Parallel processing 10Gb/s Memory bounded Parallel Switch 10Gb/s Strip Parallel Switch Assemble Parallel Switch 40Gb/s 40Gb/s Parallel Switch Prefer High radix switch with thin ports over Low radix switch with fat ports University of Toronto Latest Improvement Upon CICQ Hierarchical Crossbar --- Dally: - Require (40%) less buffers - Higher requirement on memory BW and scheduling of sub-switches 2R … (S+1)R … … … Cray BlackWidow (2006): 2Tb/s Hierarchical Crossbar … 1 … University of Toronto N CIOQ Can We Do Single-Chip Switch Fabric? FPGA VS. ASIC: - Comparable transceiver BW - Abundant on-chip SRAMs (same as ASICs) SERDES SRAM SRAM FPGA SRAM SAME Technology!! SRAM SRAM SRAM ASIC SRAM Saturate the transceiver BW University of Toronto SRAM SRAM Can’t do better