10 Gigabit technologies for a 40 MHz readout 1st “Faster, Fifi, faster” LHCb Collaboration Upgrade Workshop January 2007, Edinburgh Niko Neufeld, CERN/PH Thanks to Artur Barczyk, Beat Jost, Radu Stoica and Sai Suman for many interesting discussions Faster, Fifi, Faster © Gary Larson Niko Neufeld CERN, PH 2 LHCb Trigger-DAQ system: Today • • • LHC crossing-rate: 40MHz Visible events: 10MHz Two stage trigger system – Level-0: synchronous in hardware; 40 MHz Æ 1 MHz – High Level Trigger (HLT): software on CPU-farm; 1 MHz Æ 2 kHz • • Front-end Electronics (FE): interface to Readout Network Readout network FE FE FE L0 trigger – Gigabit Ethernet LAN – Full readout at 1MHz • Readout network Event filter farm Timing and Fast Control – ~ 1800 to 2200 1 U servers CPU permanent Niko Neufeld CERN,storage PH CPU CPU CPU CPU 3 Terminology • • • channel: elementary sensitive element = 1 ADC = 8 to 10 bits. The entire detector comprises millions of channels event: all data fragments (comprising several channels) created at the same discrete time together form an event. It is an electronic snap-shot of the detector response to the original physics reaction zero-suppression: send only channel-numbers of non-zero value channels (applying a suitable threshold) • Niko Neufeld CERN, PH packing-factor: number of event-fragments (“triggers”) packed into a single packet/message – reduces the message rate – optimises bandwidth usage – is limited by the number of CPU cores in the receiving CPU (to guarantee prompt processing and thus limit latency) 4 LHCb DAQ system: features • • Average event (= trigger) rate: 1 MHz Data from several triggers (L0 yes) are concatenated into 1 IP packet ¾ reduces message / packet-rate • IP packets are pushed over 1000 BaseT links ¾ large buffering throughout the network required ¾ no traffic shaping ¾ extremely simple protocol • • • Destination IP-address is synchronously and centrally assigned via a custom optical network (TTC) to all (TEL|UK)L1s Large Ethernet/IP network (~ 3000 Gigabit ports) connects PC-server farm and (TEL|UK)L1s Load balancing via destination assignment Niko Neufeld CERN, PH 5 LHCb DAQ features 2) • Uses only industry standards: Ethernet, IA32 PCs ¾ there is no scenario for the next 10 years of computing, which does not include these two • Uses commercial components throughout ¾ no in house electronics to design and maintain ¾ can easily take advantage of newer = better = cheaper hardware • • • Very simple interface to (TEL|UK)L1s Scalable “Compact” – cable distances of no more than 36 m allow using cheap UTP cabling How many of these can be promoted to 10 Gigabit? Niko Neufeld CERN, PH 6 10 Gigabit for DAQ: where? 1. At the source: Data-processing boards (TELL10s) send data on one or several 10 Gigabit links 2. On the way: Switches / routers distribute data on 10 Gigabit links 3. At the destination: Servers receive data at up to 10 Gigabit speed A DAQ system in the 500 GB/s range will have 10 Gigabit at least in 1. and 2. Niko Neufeld CERN, PH 7 10 Gigabit Technologies The champion & the contenders • • • Ethernet: – Well established (various optical standards, short range copper (CX4), long range copper over UTP CAT6A standardised), widely used as aggregation technology – begins to conquer MAN and WAN market (succeeding SONET) – Large market share, vendor independent IEEE standard (802.3x) – Very active R&D on 100 Gigabit and 40 Gigabit (will probably die) Myrinet: – Popular cluster-interconnect technology, low latency – 10 Gig standard (optical and copper (CX4) exist) – Single vendor (Myricom) InfiniBand: – Cluster interconnect technology, low latency – 10 Gig and 20 Gig standards (optical and copper) – Open industry standard, several vendors (OEMs) but very few chipmakers (Mellanox) – Powerful protocol/software stack (reliable/unreliable datagrams, QoS, out-of-band messages etc…) Niko Neufeld CERN, PH 8 The champion: Ethernet • • • Ad 1.) the TELL10 card: Ethernet still allows simple FIFO like interface (more details on TELL10 & 10 Gbps Ethernet in Guido’s talk) ¾ however due to the (ridiculously) small frame size use of a higher level protocol is mandatory ¾ Pure Ethernet remains quite primitive and does not provide for any reliable messages. The natural reliable protocol TCP/IP is very (too?) heavy for implementation in FPGAs Ad 2.) prices per router port are dropping quickly, still more expensive than InfiniBand, copper standard exist, but not over existing cabling (Cat 6) and high power consumption Æ optical still *very* expensive (quantity!) Ad 3.) Various NIC cards exist ¾ Emphasis on TCP/IP to offload host Æ our primitive protocol can not profit! ¾ Not yet on the mainboard, but only a question of time – at least for high end servers Niko Neufeld CERN, PH 9 Ad 1.) A contender: InfiniBand on the source (TELL10) • • • • • • Nallatech plug-in card On-board Xilinx Virtex-II Pro FPGA Up to 20k logic cells of programmable logic per module Up to 88 Block RAMs and 88 embedded multipliers per module 2x InfiniBand ™ I/O links 2x RocketIO serial links (like on the UKL1) Niko Neufeld CERN, PH 10 Ad 2.) InfiniBand Switches High-Density Switch (432 ports) • 10 Gbps standard • 20 Gbps with DDR also • available 30 Gbps coming up Edge Switch (24 ports) Optical Transceiver Module Niko Neufeld CERN, PH 11 Ad 3.) InfiniBand for the servers • • Quite a few InfiniBand adapter cards (“HCA”) exist. No mainboards exist yet with onboard InfiniBand adapter – availability of onboard Gigabit Ethernet NICs makes (copper UTP) Gigabit Ethernet essentially zero-cost on the servers – physical signalling is compatible (Myricom makes dualpersonality cards!), there are rumours that Intel will bring out a chipset with both options • • InfiniBand on the server might have performance advantages (later) Dual-personality switches exist: they act as an InfiniBand to Ethernet / IP bridge Niko Neufeld CERN, PH 12 (Potential) advantages of InfiniBand (applies partially also to Myrinet and other cluster interconnects) • Low latency & reliable datagrams • Cost per switch-port much lower than in Ethernet (requires much less buffer per port / very high-speed buffer memory is very expensive) Even- building using remote DMA could result in much lower CPU “wasted” for data movement • ¾ implement pull protocol Æ could result in much more efficient usage of network bandwidth (currently we can use only ~ 20% of the theoretically available bandwidth) ¾ implement load balancing and destination assignment Æ no need for custom (“TTC”-like) network for this purpose ¾ Currently we cannot handle more than ~ 300 MB/s per server (for illustration this means that for a 10 MHz readout at current event size 35 kB we would need 1000 servers just for the data formatting, checking and moving Niko Neufeld CERN, PH 13 Open questions: InfiniBand • Technological: – Can an FPGA drive the InfiniBand adapter or do we need an embedded host-processor with an OS? – Almost the entire traffic is unidirectional (from the TELL10s to the servers). Can we take advantage of this fact? • Market: – Will InfiniBand be ever standard on PC mainboards? Niko Neufeld CERN, PH 14 System design using 10 Gbps • Using 10 Gbps technology several upgrade scenarios can be studied: 1. A full 40 MHz readout (scaling up the current system by a factor 40 requires zero suppression at 40 MHz) 2. A two stage system with part of the detector (VeLo++) read out at 40 MHz and the complete detector at 1 MHz • • A sketch for 2.) is given in the next slide. Going to 1.) means scaling up and simplifying the dataflow (no back-traffic) Both could be built today if not (yet) afforded Niko Neufeld CERN, PH 15 Fabric Fabric High Density Switches 1 2 3 4 400 ports “in” per Switch 4. Receive Trigger Decision. 4x10Gbps Links1 2 3 65Gbps per Rack 2. Send to Farm for Trigger 3. Send trigger decision Decision to TELL10 5. If trigger decision Positive, readout 35KB @ 1MHz … 400 TELL10 Boards 1. Readout 10KB events @ 40MHz, Buffer on TELL10 Rack1 … Rack50 Farm Racks with 1 x 32-port Switch or 2 x 16-port Switch LHCb Detector Niko Neufeld CERN, PH 16 R&D for a future LHCb DAQ • We are constantly playing evaluating with 10 Gbps • technologies in our spare time Our main current interest is how far the current DAQ architecture can be scaled up – fringe benefit: make the current system more efficient • Once the two fundamental parameters for the new LHCb DAQ are known (event size and rate) a vigorous R&D program of ~ 2 years should easily suffice to design and prototype a system Niko Neufeld CERN, PH 17