2012 IEEE Nuclear Science Symposium and Medical Imaging Conference Record (NSS/MIC) N14-36 A 34 Gbps Data Transmission System with FPGAs Embedded Transceivers and QSFP+ Modules Roberto Ammendolaa , Andrea Biagionib , Ottorino Frezzab , Francesca Lo Cicerob , Alessandro Lonardob , Pier Paoluccib , Davide Rossettib , Andrea Salamona , Francesco Simulab , Laura Tosorattob and Piero Vicinib Abstract—APEnet+ is our custom developed PCIe gen2 board based on an Altera Stratix IV FPGA. We demonstrate reliable usage of Altera’s embedded transceivers coupled with QSFP+ (Quad Small Form Pluggable) technology. QSFP+ standard defines a hot-pluggable transceiver available in copper or optical cable assemblies for an aggregated bandwidth of up to 40 Gbps. We use embedded transceivers in a 4 lane configuration, each one capable of 8.5 Gbps, for an aggregate bandwidth of 34 Gpbs per link. On Stratix IV 290 we can place up to 6 bidirectional links, together with a PCIe gen2 x8 hard IP. We describe design and implementation of this data transmission system. • • Index Terms—data acquisition, field programmable gate arrays, serial transceivers, high speed interconnect I. I NTRODUCTION N the last few years we assisted to the introduction on the market of programmable components (FPGA) equipped with multiple high speed transceivers [1]. This allowed to develop with relative low effort and low cost, high bandwidth data transmission systems customized to be effective in different application fields. The APEnet+ project [2] leverages on the huge number of embedded transceivers available on Altera Stratix IV devices to implement an ”off-chip logic” free 3D Torus network controller integrated in a PCIe-based board with an impressive aggregated bandwidth equal to 400 Gbps per single device. The APEnet+ interconnect hardware, with appropriate customizations, is under evaluation in several experiments where the required performances as well as system constraints change dramatically. As a consequence, a large effort was devoted to design a robust and performant logic architecture able to sustain the data traffic between the host and the Torus links. A preliminary work [4] was done to check the interoperability of FPGAs embedded transceivers with off-the-shelf, high bandwidth, 40 Gbps QSFP+ cables. In this paper we describe the design and the implementation of the APEnet+ Torus links and report on state-of-the-art test results for achieved performance, BER and signal integrity. I A. Use Cases APEnet+ is tested and used in several experiments as interconnect system: • QUonG: a PC cluster designed for Grand Challenge applications such as QCD, equipped with NVIDIA GPUs a INFN Roma Tor Vergata; {firstname.secondname}@roma2.infn.it b INFN Roma La Sapienza; {firstname.secondname}@roma1.infn.it 978-1-4673-2030-6/12/$31.00 ©2012 IEEE E-mail contact E-mail contact • and a 3D toroidal network topology and using APEnet+ and NVIDIA support of Peer-to-Peer protocol to reach higher performances [3]. Euretile: an EU FP7 funded project aimed to investigate and design a distribuited brain-inspired massively parallel computing architecture[8]. NA62, LHCb: real time event selection in HEP experiments using Grahical Processing Units, using APEnet+ P2P feature to inject data arriving from DAQ directly into GPU memory. Biocomputing: investigation on a fully connected parallel computing architecture. II. APE NET + H ARDWARE APEnet+ is a custom PCI Express board Gen2 x8 developed by INFN. It is based on an Altera Stratix IV (EP4S290F45C2 model) FPGA, and has 6 QSFP+ modules directly connected to Altera’s embedded transceivers. Four modules are placed on the main board, two more are on a small piggy back card, thus resulting a 2-slot wide card. PCI Express mechanical compatibility is preserved with x16 cards. The two boards are connected through a Samtec SEAM-SEAF 18.5 mm stack height connector [5], which has a demonstrated speed rating of 10.5 GHz at 23 mm stacking height [6]. Complete assembly of the APEnet+ card can be seen on Fig. 1. For the applications it is aimed for, as we need to build networks with 3D topology, the card hosts 6 channels using the QSFP+ electrical and mechanical standard [7] which defines an hot-pluggable transceiver with 4 transmit and 4 receive lanes. We designed the system in such a way that each lane is connected to the transmit and receive side of the FPGA embedded transceivers, for a total amount of 32: 24 of them are for the QSFP+ channels, 8 for the PCIe side. Each transceiver is capable of up to 8.5 Gbps data rate in full bidirectional mode, so that the raw bandwidth is up to 34 Gb/s for any of the 12 directions. A. Logic Blocks APEnet+ is an interconnection card, which enables communication among the channels and the PCIe interface and among the channels themselves. Three main logic blocks are implemented in the FPGA: • a network interface, managing the PCIe connection with the use of an embedded microcontroller; • a switch block, responsible for data routing and dispatch; 872 TABLE II L OGIC U SAGE PER L OGIC B LOCK External Power Gbit Ethernet mini-USB Network Interface Block Switch Block Single Link Block Combinational ALUTs 32,858 12,879 4,216 Memory ALUTs 228 0 0 ALMs 25,893 12,810 4,413 Dedicated logic registers 29,622 12,321 3,842 Block memory bits 4,609,848 2,875,392 8,032 Power Dissipation 914 mW 3,464 mW 1,134 mW XX+ Z+ YZ- Y+ Gbit Ethernet PCI-e connector QSFP+ Connectors Altera Stratix IV SO-DIMM DDR3 Fig. 1. APEnet+ board. The six QSFP+ communication channels are named X+, X−, Y +, Y −, Z+, Z− as possible directions from a vertex in a 3D mesh. The thermal power dissipation measurements are obtained from the power analyzer tool by Altera, with a value for the power input I/O toggle rate set to 25%. The power dissipation associated to a single channel (4 transceivers and related logic) amounts to slightly more than 1 Watt. B. Link description Fig. 2. Outline of the main logic blocks representing the architecture implemented inside the FPGA. TABLE I L OGIC U SAGE S UMMARY Logic utilization 42% Total Thermal Power Dissipation 13191mW Combinational ALUTs 70,673 / 232,960 (30%) Memory ALUTs 228 / 116,480 (< 1%) Dedicated logic registers 61,712 / 232,960 (26%) Total pins 242 / 1,112 (22%) Total block memory bits 7,533,432 / 13,934,592 (54%) DSP block 18-bit elements 4 / 832 (< 1%) Total GXB Receiver Channel PCS 32 / 32 (100%) Total GXB Receiver Channel PMA 32 / 48 (67%) Total GXB Transmitter Channel PCS 32 / 32 (100%) Total GXB Transmitter Channel PMA 32 / 48 (67%) Total PLLs 3 / 12 (25%) • the 6 links, composed by the transceivers and the necessary logic needed for data alignment, transmission and flow control. A more detailed view of the logic architecture is in Fig. 2. We show on table I and II an outline of the FPGA logic usage measured with the synthesis software. The FPGA employed in APEnet+, the Stratix IV GX, provides up to 32 full-duplex Clock-Data-Recovery-based transceivers with Physical Coding Sublayer (PCS) and Physical Medium Attachment (PMA). These transceivers support proprietary protocols and implement 4 bi-directional lanes bonded into a single channel; each lane supports data rates up to 8.5 Gbps, totalling up to 34 Gbps per link, with the deserializer supporting 20-bit deserialization factors (double-width mode). The PMA analog settings of the Altera transceiver — Equalization, Pre-emphasis, DC Gain and Voltage Output Differential (VOD) — can be dynamically reconfigured by a Reconfig block. This block reconfigures the physical medium attachment (PMA) controls, functional blocks, clock multiplier unit (CMU) phase-locked loops (PLLs), receiver clock data recovery (CDR), and input reference clocks of a transceiver channel. What is sent over the link is 8B10B-encoded to maintain the DC balance in the serial data transmitted. We used the Altera 8B10B encoder, which generates 10-bit code groups from the 8-bit data and 1-bit control identifier; in double-width mode, two 8B10B encoders/decoders are cascaded for handling the 20-bit encoded data. The serializers embedded in the transceivers are fed directly with an external, programmable low-jitter clock source, for a maximum frequency of 425 MHz. A half-rate clock is used for the data path; the Fmax resulting from the synthesis is well over the requested 212.5 MHz for the logic of each individual channel. 873 WORD ALIGNING BLOCK deser FIFO write/read enable DESKEW BLOCK LANE 0 Lane 0 clock LANE 1 Lane 1 clock Lane 0 clock FPGA LOGIC TRANSCEIVER ser LANE 2 Lane 2 clock Lane 0 clock LANE 3 Lane 3 clock Fig. 3. Lane 0 clock Schematic view of architecture for the aligning custom logic block. Fig. 4. Deskewing procedure example: at T0 the align command is received by two out of four lanes (say 0 and 3); at T1 the special character is removed from lane 0 and 3, while align command is received by lane 1 and 2; at T2 lane 0 and 3 write first word in FIFO, while special character is removed from lane 1 and 2; at T3 lane 0 and 3 write second word, while lane 1 and 2 write first word in FIFO, and concurrently first word is read along all 4 lanes. C. Channel Alignment When two boards are connected to each other, a synchronization mechanism has to take place on both sides. Altera ensures transceiver channel bonding up to 4 lanes, but word alignment procedures need to be performed by custom logic (a schematic view is depicted in Fig. 3). The word aligning block receives parallel data from the deserializer and restores word alignment based on special patterns that must be received during link synchronization; the complete process is depicted in Fig. 5. RESET RESET ALIGN_WORD ALIGNMENT ALIGNMENT ALIGN_OK On each board, after the reset phase, the word aligning block sends the alignment pattern “ALIGN WORD” for a costant number of clock cycles (CHECK ALIGN phase), at the same time looking for the matching synchronization word in the received data stream. After the “CHECK ALIGN” phase and if the requested data is received, this block signals that synchronization is acquired by sending the “ALIGN OK WORD”, otherwise the “CHECK ALIGN” phase restarts after the “RESTART ALIGN” state. An example of lane deskewing is given on Fig. 4. RESTART_ALIGN ALIGN_OK_WORD ALIGNMENT ALIGN_OK_WORD ALIGN_OK SEND_DESKEW DESKEW_WORD WAIT_DESKEW DESKEW_WORD SEND_ACK WAIT_ACK ACK_WORD ALIGN_OK ALIGNMENT ALIGN_OK SEND_DESKEW WAIT_DESKEW SEND_ACK ACK_WORD WAIT_ACK The deskewing circuitry consists of four DESKEW FIFOs, each one directly connected to one channel lane. To enable the deskewing circuitry at the receiver side, the transmitter simultaneously sends a “DESKEW WORD (K28.3)” on all four lanes. At this point, the FSM signals that channel alignment is reached by sending the MAGIC ACK WORD. When both nodes receive the MAGIC ACK WORD, the channel is ready to start transmission. ALIGN_WORD ALIGNMENT When both nodes are aligned and receive the “ALIGN OK WORD”, channel alignment is assured and the deskewing operation starts. DESKEW FIFOs use different clocks (see Fig. 3) on the read and write sides: on the write side the clock used is the specific clock provided by the transceiver, while on the read side the clock used is the same for all FIFOs (clock is recovered by Lane 0). The write enable signal is asserted after recognition of “DESKEW WORD” pattern, while the read enable signal is asserted when all FIFOs are no longer empty. NODE A NODE B CHANNEL_OK CHANNEL_OK Fig. 5. Scheme of hand-shake taking place between Node A and Node B when they are connected together. Reset commands are sent to the hardware by the nodes’ operating systems, whose actions are asynchronous and completely independent of each other. III. T EST S YSTEM AND R ESULTS A. BER In order to produce the clearest signal and thus being able to increase signal clock frequency on cable, a fine tuning 874 TABLE III BER MEASUREMENTS . BER Direction Cable Lenght Data Rate < 7.04 E-11 X+ 5m 8.5 Gbps < 8.12 E-11 Y+ 5m 8.5 Gbps < 7.04 E-10 Z+ 5m 8.5 Gbps < 9.8 E-13 X+ 5m 7 Gbps < 2.46 E-14 X+ 2m 7 Gbps of PMAs analog settings is required. Since the transmitted signal is sensitive to the physical route that it must travel toward the receiver (cable, connectors, PCB lanes), a number of Analog settings must be appropriately tuned to offset for distortions. PCB). Manual tuning of this settings involves trial and error, then locking at compile time the optimal values found in this way. Altera provides the Transceiver Toolkit as a specific tool to choose the appropriate values for these analog controls. The Transceiver Toolkit performs its job by tuning the “Reconfig block”; it can dynamically reconfigure analog settings of the PMAs such as Pre-emphasis and DC Gain on the transmitter side and Equalization and VOD on the receiver side; furthermore, it offers a simple GUI to tweak these settings and to see immediate feedback on receiver side. By means of this tool we found the optimal settings for each of the four lanes in all six channels1 . Once the tuning phase is done, a following step is the testing of these settings on our board with our custom logic. Our testing firmware is made of a random signal generator (PRBS @32 bits) on the trasmitting side and a data checker on the receiving side — started after synchronization performed by a dedicated protocol — connected atop the APEnet+ link block. The numbers of correctly and incorrectly transferred 128-bits words are written onto registers. This let us either measure the bit error rate (=no. of incorrectly transferred words/no. of transferred words) of our channels and verify our custom logic. In the table III we report some of the measures performed, taken with different cable lenghts, channel ports, and data rate. We verified successful data transfer up to the maximum declared rate per single lane (8.5 Gbps) in a defined range of the analog settings. Anyway at this moment, for reliable operations and upper level firmware and software validation, we use a more conservative setting for all transceivers lanes at 7.0 Gbps, which makes the channels raw aggregated bandwidth at 28 Gbps. B. Signal Integrity In order to verify link quality, we prepared a testbed for measuring signal integrity for the received data. Testbed is composed by an APEnet+ board which act as a sender, a QSFP+ cable, and a QSFP+-to-SMA breakout custom card. We prepared a special firmware for APEnet+, in which all the trasmitter part of the transceivers are sending a random 1 For more information about Altera Transceiver Toolkit, see http://www. altera.com/literature/hb/qts/qts qii53029.pdf. Fig. 6. Eye pattern at 5.0 Gbps data rate. Fig. 7. Eye pattern at 6.0 Gbps data rate. data pattern, with no waiting for the receive side to be aligned (as it is a one way measurement). Measurements were performed with a 20 GHz, 40 GS/s high bandwidth sampling scope (a LeCroy WaveMaster 820Zi-A) connected to the SMA test points on the break-out card. In Fig. 6 and 7 we show the Eye Pattern at 5 Gbps and 6 Gbps. A cable of 1 mt. lenght has been used for these diagrams. Signal integrity is good for data rate up to 6 Gbps, after which signal degradation occurs. This is due to the fact that in this testbed no receiver side optimizations can be used (like VOD or equalization), and no 100 Ohm differential termination is inserted (as instead QSFP+ standard requires). We are preparing for more precise measurements in the near future. IV. C ONCLUSIONS AND F UTURE W ORK We described our custom design board which enables interboard communication using QSFP+ modules and Altera’s embedded transceivers. The transmission system is capable of 34 Gbps of raw bandwidth per communication channel, and we demonstrated reliable operations up to 28 Gbps for cable lenghts up to 5 mt on all the 6 links simultaneously. Next generation FPGAs (Stratix V) have been recently introduced in the market with transceivers working at up to 14.1 Gbps, which should permit to build a quad link of 56 Gbps, nominally. As future work we plan to port our communication system on this new device. 875 ACKNOWLEDGMENT This work was partially supported by the EU Framework Programme 7 project EURETILE under grant number 247846. We would like to thank Giacomo Chiodi and Luigi Recchia from the LABE facility at Roma La Sapienza for helpful assistance. We also thank Tommaso Tessitore and Luigi Notaro at Teledyne Lecroy for support in the measurements. R EFERENCES [1] http://www.altera.com/devices/fpga/stratix-fpgas/stratix-iv/ transceivers/stxiv-transceivers.html [2] D. Rossetti et el., APEnet+ project status, Proceedings of the XXIX International Symposium on Lattice Field Theory (Lattice 2011). July 10-16, 2011. [3] R. Ammendola et al., QUonG: A GPU-based HPC System Dedicated to LQCD Computing, Symposium on Application Accelerators in HighPerformance Computing (SAAHPC), 113-122, (2011). [4] R Ammendola et al, High-speed data transfer with FPGAs and QSFP+ modules, 2010 JINST 5 C12019 doi:10.1088/1748-0221/5/12/C12019 [5] http://www.samtec.com/ProductInformation/TechnicalSpecifications/ Overview.aspx?series=SEAF [6] http://www.samtec.com/ProductInformation/TechnicalSpecifications/ Testing Standard.aspx?series=SEAF&stack=23 [7] ftp://ftp.seagate.com/sff/SFF-8436.PDF [8] http://euretile.roma1.infn.it/mediawiki/index.php/Main Page 876