A 34 Gbps Data Transmission System with FPGAs Embedded

advertisement
2012 IEEE Nuclear Science Symposium and Medical Imaging Conference Record (NSS/MIC)
N14-36
A 34 Gbps Data Transmission System with FPGAs
Embedded Transceivers and QSFP+ Modules
Roberto Ammendolaa , Andrea Biagionib , Ottorino Frezzab , Francesca Lo Cicerob , Alessandro Lonardob , Pier
Paoluccib , Davide Rossettib , Andrea Salamona , Francesco Simulab , Laura Tosorattob and Piero Vicinib
Abstract—APEnet+ is our custom developed PCIe gen2 board
based on an Altera Stratix IV FPGA. We demonstrate reliable
usage of Altera’s embedded transceivers coupled with QSFP+
(Quad Small Form Pluggable) technology. QSFP+ standard
defines a hot-pluggable transceiver available in copper or optical
cable assemblies for an aggregated bandwidth of up to 40 Gbps.
We use embedded transceivers in a 4 lane configuration, each
one capable of 8.5 Gbps, for an aggregate bandwidth of 34 Gpbs
per link. On Stratix IV 290 we can place up to 6 bidirectional
links, together with a PCIe gen2 x8 hard IP. We describe design
and implementation of this data transmission system.
•
•
Index Terms—data acquisition, field programmable gate arrays, serial transceivers, high speed interconnect
I. I NTRODUCTION
N the last few years we assisted to the introduction on
the market of programmable components (FPGA) equipped
with multiple high speed transceivers [1]. This allowed to
develop with relative low effort and low cost, high bandwidth
data transmission systems customized to be effective in different application fields.
The APEnet+ project [2] leverages on the huge number of
embedded transceivers available on Altera Stratix IV devices
to implement an ”off-chip logic” free 3D Torus network
controller integrated in a PCIe-based board with an impressive
aggregated bandwidth equal to 400 Gbps per single device.
The APEnet+ interconnect hardware, with appropriate customizations, is under evaluation in several experiments where
the required performances as well as system constraints change
dramatically. As a consequence, a large effort was devoted
to design a robust and performant logic architecture able to
sustain the data traffic between the host and the Torus links.
A preliminary work [4] was done to check the interoperability
of FPGAs embedded transceivers with off-the-shelf, high
bandwidth, 40 Gbps QSFP+ cables.
In this paper we describe the design and the implementation
of the APEnet+ Torus links and report on state-of-the-art test
results for achieved performance, BER and signal integrity.
I
A. Use Cases
APEnet+ is tested and used in several experiments as
interconnect system:
• QUonG: a PC cluster designed for Grand Challenge
applications such as QCD, equipped with NVIDIA GPUs
a INFN
Roma
Tor
Vergata;
{firstname.secondname}@roma2.infn.it
b INFN
Roma
La
Sapienza;
{firstname.secondname}@roma1.infn.it
978-1-4673-2030-6/12/$31.00 ©2012 IEEE
E-mail
contact
E-mail
contact
•
and a 3D toroidal network topology and using APEnet+
and NVIDIA support of Peer-to-Peer protocol to reach
higher performances [3].
Euretile: an EU FP7 funded project aimed to investigate
and design a distribuited brain-inspired massively parallel
computing architecture[8].
NA62, LHCb: real time event selection in HEP experiments using Grahical Processing Units, using APEnet+
P2P feature to inject data arriving from DAQ directly into
GPU memory.
Biocomputing: investigation on a fully connected parallel
computing architecture.
II. APE NET + H ARDWARE
APEnet+ is a custom PCI Express board Gen2 x8 developed
by INFN. It is based on an Altera Stratix IV (EP4S290F45C2
model) FPGA, and has 6 QSFP+ modules directly connected
to Altera’s embedded transceivers. Four modules are placed
on the main board, two more are on a small piggy back card,
thus resulting a 2-slot wide card. PCI Express mechanical
compatibility is preserved with x16 cards. The two boards
are connected through a Samtec SEAM-SEAF 18.5 mm stack
height connector [5], which has a demonstrated speed rating of
10.5 GHz at 23 mm stacking height [6]. Complete assembly
of the APEnet+ card can be seen on Fig. 1.
For the applications it is aimed for, as we need to build
networks with 3D topology, the card hosts 6 channels using the
QSFP+ electrical and mechanical standard [7] which defines
an hot-pluggable transceiver with 4 transmit and 4 receive
lanes. We designed the system in such a way that each lane
is connected to the transmit and receive side of the FPGA
embedded transceivers, for a total amount of 32: 24 of them are
for the QSFP+ channels, 8 for the PCIe side. Each transceiver
is capable of up to 8.5 Gbps data rate in full bidirectional
mode, so that the raw bandwidth is up to 34 Gb/s for any of
the 12 directions.
A. Logic Blocks
APEnet+ is an interconnection card, which enables communication among the channels and the PCIe interface and
among the channels themselves. Three main logic blocks are
implemented in the FPGA:
• a network interface, managing the PCIe connection with
the use of an embedded microcontroller;
• a switch block, responsible for data routing and dispatch;
872
TABLE II
L OGIC U SAGE PER L OGIC B LOCK
External Power
Gbit Ethernet
mini-USB
Network
Interface
Block
Switch
Block
Single
Link
Block
Combinational
ALUTs
32,858
12,879
4,216
Memory
ALUTs
228
0
0
ALMs
25,893
12,810
4,413
Dedicated
logic
registers
29,622
12,321
3,842
Block
memory
bits
4,609,848
2,875,392
8,032
Power
Dissipation
914 mW
3,464 mW
1,134 mW
XX+
Z+
YZ-
Y+
Gbit Ethernet
PCI-e connector
QSFP+ Connectors
Altera Stratix IV
SO-DIMM DDR3
Fig. 1. APEnet+ board. The six QSFP+ communication channels are named
X+, X−, Y +, Y −, Z+, Z− as possible directions from a vertex in a 3D
mesh.
The thermal power dissipation measurements are obtained
from the power analyzer tool by Altera, with a value for the
power input I/O toggle rate set to 25%. The power dissipation
associated to a single channel (4 transceivers and related logic)
amounts to slightly more than 1 Watt.
B. Link description
Fig. 2.
Outline of the main logic blocks representing the architecture
implemented inside the FPGA.
TABLE I
L OGIC U SAGE S UMMARY
Logic utilization
42%
Total Thermal Power Dissipation
13191mW
Combinational ALUTs
70,673 / 232,960 (30%)
Memory ALUTs
228 / 116,480 (< 1%)
Dedicated logic registers
61,712 / 232,960 (26%)
Total pins
242 / 1,112 (22%)
Total block memory bits
7,533,432 / 13,934,592 (54%)
DSP block 18-bit elements
4 / 832 (< 1%)
Total GXB Receiver Channel PCS
32 / 32 (100%)
Total GXB Receiver Channel PMA
32 / 48 (67%)
Total GXB Transmitter Channel PCS
32 / 32 (100%)
Total GXB Transmitter Channel PMA
32 / 48 (67%)
Total PLLs
3 / 12 (25%)
•
the 6 links, composed by the transceivers and the necessary logic needed for data alignment, transmission and
flow control.
A more detailed view of the logic architecture is in Fig. 2.
We show on table I and II an outline of the FPGA logic
usage measured with the synthesis software.
The FPGA employed in APEnet+, the Stratix IV GX,
provides up to 32 full-duplex Clock-Data-Recovery-based
transceivers with Physical Coding Sublayer (PCS) and Physical Medium Attachment (PMA). These transceivers support
proprietary protocols and implement 4 bi-directional lanes
bonded into a single channel; each lane supports data rates up
to 8.5 Gbps, totalling up to 34 Gbps per link, with the deserializer supporting 20-bit deserialization factors (double-width
mode).
The PMA analog settings of the Altera transceiver —
Equalization, Pre-emphasis, DC Gain and Voltage Output
Differential (VOD) — can be dynamically reconfigured by a
Reconfig block. This block reconfigures the physical medium
attachment (PMA) controls, functional blocks, clock multiplier
unit (CMU) phase-locked loops (PLLs), receiver clock data
recovery (CDR), and input reference clocks of a transceiver
channel.
What is sent over the link is 8B10B-encoded to maintain the
DC balance in the serial data transmitted. We used the Altera
8B10B encoder, which generates 10-bit code groups from the
8-bit data and 1-bit control identifier; in double-width mode,
two 8B10B encoders/decoders are cascaded for handling the
20-bit encoded data.
The serializers embedded in the transceivers are fed directly
with an external, programmable low-jitter clock source, for a
maximum frequency of 425 MHz. A half-rate clock is used
for the data path; the Fmax resulting from the synthesis is well
over the requested 212.5 MHz for the logic of each individual
channel.
873
WORD ALIGNING
BLOCK
deser
FIFO write/read enable
DESKEW BLOCK
LANE 0
Lane 0 clock
LANE 1
Lane 1 clock
Lane 0 clock
FPGA LOGIC
TRANSCEIVER
ser
LANE 2
Lane 2 clock
Lane 0 clock
LANE 3
Lane 3 clock
Fig. 3.
Lane 0 clock
Schematic view of architecture for the aligning custom logic block.
Fig. 4. Deskewing procedure example: at T0 the align command is received
by two out of four lanes (say 0 and 3); at T1 the special character is removed
from lane 0 and 3, while align command is received by lane 1 and 2; at T2
lane 0 and 3 write first word in FIFO, while special character is removed
from lane 1 and 2; at T3 lane 0 and 3 write second word, while lane 1 and 2
write first word in FIFO, and concurrently first word is read along all 4 lanes.
C. Channel Alignment
When two boards are connected to each other, a synchronization mechanism has to take place on both sides. Altera
ensures transceiver channel bonding up to 4 lanes, but word
alignment procedures need to be performed by custom logic
(a schematic view is depicted in Fig. 3). The word aligning
block receives parallel data from the deserializer and restores
word alignment based on special patterns that must be received
during link synchronization; the complete process is depicted
in Fig. 5.
RESET
RESET
ALIGN_WORD
ALIGNMENT
ALIGNMENT
ALIGN_OK
On each board, after the reset phase, the word aligning block
sends the alignment pattern “ALIGN WORD” for a costant
number of clock cycles (CHECK ALIGN phase), at the same
time looking for the matching synchronization word in the received data stream. After the “CHECK ALIGN” phase and if
the requested data is received, this block signals that synchronization is acquired by sending the “ALIGN OK WORD”,
otherwise the “CHECK ALIGN” phase restarts after the
“RESTART ALIGN” state.
An example of lane deskewing is given on Fig. 4.
RESTART_ALIGN
ALIGN_OK_WORD
ALIGNMENT
ALIGN_OK_WORD
ALIGN_OK
SEND_DESKEW
DESKEW_WORD
WAIT_DESKEW
DESKEW_WORD
SEND_ACK
WAIT_ACK
ACK_WORD
ALIGN_OK
ALIGNMENT
ALIGN_OK
SEND_DESKEW
WAIT_DESKEW
SEND_ACK
ACK_WORD
WAIT_ACK
The deskewing circuitry consists of four DESKEW FIFOs,
each one directly connected to one channel lane. To enable
the deskewing circuitry at the receiver side, the transmitter
simultaneously sends a “DESKEW WORD (K28.3)” on all
four lanes.
At this point, the FSM signals that channel alignment is
reached by sending the MAGIC ACK WORD. When both
nodes receive the MAGIC ACK WORD, the channel is ready
to start transmission.
ALIGN_WORD
ALIGNMENT
When both nodes are aligned and receive the
“ALIGN OK WORD”, channel alignment is assured
and the deskewing operation starts.
DESKEW FIFOs use different clocks (see Fig. 3) on the
read and write sides: on the write side the clock used is
the specific clock provided by the transceiver, while on the
read side the clock used is the same for all FIFOs (clock
is recovered by Lane 0). The write enable signal is asserted
after recognition of “DESKEW WORD” pattern, while the
read enable signal is asserted when all FIFOs are no longer
empty.
NODE A
NODE B
CHANNEL_OK
CHANNEL_OK
Fig. 5. Scheme of hand-shake taking place between Node A and Node B
when they are connected together. Reset commands are sent to the hardware by
the nodes’ operating systems, whose actions are asynchronous and completely
independent of each other.
III. T EST S YSTEM AND R ESULTS
A. BER
In order to produce the clearest signal and thus being able
to increase signal clock frequency on cable, a fine tuning
874
TABLE III
BER MEASUREMENTS .
BER
Direction
Cable Lenght
Data Rate
< 7.04 E-11
X+
5m
8.5 Gbps
< 8.12 E-11
Y+
5m
8.5 Gbps
< 7.04 E-10
Z+
5m
8.5 Gbps
< 9.8 E-13
X+
5m
7 Gbps
< 2.46 E-14
X+
2m
7 Gbps
of PMAs analog settings is required. Since the transmitted
signal is sensitive to the physical route that it must travel
toward the receiver (cable, connectors, PCB lanes), a number
of Analog settings must be appropriately tuned to offset for
distortions. PCB). Manual tuning of this settings involves trial
and error, then locking at compile time the optimal values
found in this way. Altera provides the Transceiver Toolkit
as a specific tool to choose the appropriate values for these
analog controls. The Transceiver Toolkit performs its job by
tuning the “Reconfig block”; it can dynamically reconfigure
analog settings of the PMAs such as Pre-emphasis and DC
Gain on the transmitter side and Equalization and VOD on
the receiver side; furthermore, it offers a simple GUI to tweak
these settings and to see immediate feedback on receiver side.
By means of this tool we found the optimal settings for each
of the four lanes in all six channels1 .
Once the tuning phase is done, a following step is the testing
of these settings on our board with our custom logic.
Our testing firmware is made of a random signal generator
(PRBS @32 bits) on the trasmitting side and a data checker on
the receiving side — started after synchronization performed
by a dedicated protocol — connected atop the APEnet+ link
block.
The numbers of correctly and incorrectly transferred
128-bits words are written onto registers. This let us either
measure the bit error rate (=no. of incorrectly transferred
words/no. of transferred words) of our channels and verify
our custom logic.
In the table III we report some of the measures performed,
taken with different cable lenghts, channel ports, and data rate.
We verified successful data transfer up to the maximum
declared rate per single lane (8.5 Gbps) in a defined range of
the analog settings. Anyway at this moment, for reliable operations and upper level firmware and software validation, we
use a more conservative setting for all transceivers lanes at 7.0
Gbps, which makes the channels raw aggregated bandwidth at
28 Gbps.
B. Signal Integrity
In order to verify link quality, we prepared a testbed for
measuring signal integrity for the received data. Testbed is
composed by an APEnet+ board which act as a sender, a
QSFP+ cable, and a QSFP+-to-SMA breakout custom card.
We prepared a special firmware for APEnet+, in which all
the trasmitter part of the transceivers are sending a random
1 For more information about Altera Transceiver Toolkit, see http://www.
altera.com/literature/hb/qts/qts qii53029.pdf.
Fig. 6.
Eye pattern at 5.0 Gbps data rate.
Fig. 7.
Eye pattern at 6.0 Gbps data rate.
data pattern, with no waiting for the receive side to be aligned
(as it is a one way measurement).
Measurements were performed with a 20 GHz, 40 GS/s high
bandwidth sampling scope (a LeCroy WaveMaster 820Zi-A)
connected to the SMA test points on the break-out card. In
Fig. 6 and 7 we show the Eye Pattern at 5 Gbps and 6 Gbps.
A cable of 1 mt. lenght has been used for these diagrams.
Signal integrity is good for data rate up to 6 Gbps, after
which signal degradation occurs. This is due to the fact that
in this testbed no receiver side optimizations can be used
(like VOD or equalization), and no 100 Ohm differential
termination is inserted (as instead QSFP+ standard requires).
We are preparing for more precise measurements in the near
future.
IV. C ONCLUSIONS AND F UTURE W ORK
We described our custom design board which enables interboard communication using QSFP+ modules and Altera’s
embedded transceivers. The transmission system is capable of
34 Gbps of raw bandwidth per communication channel, and
we demonstrated reliable operations up to 28 Gbps for cable
lenghts up to 5 mt on all the 6 links simultaneously.
Next generation FPGAs (Stratix V) have been recently
introduced in the market with transceivers working at up
to 14.1 Gbps, which should permit to build a quad link of
56 Gbps, nominally. As future work we plan to port our
communication system on this new device.
875
ACKNOWLEDGMENT
This work was partially supported by the EU Framework
Programme 7 project EURETILE under grant number 247846.
We would like to thank Giacomo Chiodi and Luigi Recchia
from the LABE facility at Roma La Sapienza for helpful
assistance.
We also thank Tommaso Tessitore and Luigi Notaro at
Teledyne Lecroy for support in the measurements.
R EFERENCES
[1] http://www.altera.com/devices/fpga/stratix-fpgas/stratix-iv/
transceivers/stxiv-transceivers.html
[2] D. Rossetti et el., APEnet+ project status, Proceedings of the XXIX
International Symposium on Lattice Field Theory (Lattice 2011). July
10-16, 2011.
[3] R. Ammendola et al., QUonG: A GPU-based HPC System Dedicated
to LQCD Computing, Symposium on Application Accelerators in HighPerformance Computing (SAAHPC), 113-122, (2011).
[4] R Ammendola et al, High-speed data transfer with FPGAs and QSFP+
modules, 2010 JINST 5 C12019 doi:10.1088/1748-0221/5/12/C12019
[5] http://www.samtec.com/ProductInformation/TechnicalSpecifications/
Overview.aspx?series=SEAF
[6] http://www.samtec.com/ProductInformation/TechnicalSpecifications/
Testing Standard.aspx?series=SEAF&stack=23
[7] ftp://ftp.seagate.com/sff/SFF-8436.PDF
[8] http://euretile.roma1.infn.it/mediawiki/index.php/Main Page
876
Download