THE ASSOCIATIVE MEMORY SYSTEM FOR THE FTK PROCESSOR AT ATLAS* D. MAGALOTTI University of Modena and Reggio Emilia, Via Universita’ 4, 41121 Modena, Italy S. CITRARO, S. DONATI, P. LUCIANO, M. PIENDIBENE, University of Pisa, Largo B. Pontecorvo, 3 56127 Pisa, Italy P. GIANNETTI Sezione di Pisa INFN, Largo Bruno Pontecorvo 3, 56127 Pisa, Italy A. LANZA Sezione di Pavia INFN, Via Agostino Bassi, 6 - 27100 Pavia, Italy G. VERZELLESI University of Modena and Reggio Emilia, Via Universita’ 4, 41121 Modena, Italy SAKELLARIOU ANDREAS Prisma Electronics SA, El Venizelou 128, Νea Smyrni, 17123, Athens, Greece W. BILLEREAU, J.M. COMBE Cern, CH-1211 Geneva 23, Switzerland In high energy physics experiments, the most interesting processes are very rare and hidden in an extremely large level of background. As the experiment complexity, accelerator backgrounds, and instantaneous luminosity increase, more effective and accurate data selection techniques are needed. The Fast TracKer processor (FTK) is a real time tracking processor designed for the ATLAS trigger upgrade. The FTK core is the Associative Memory system. It provides massive computing power to minimize the processing time of complex tracking algorithms executed online. This paper reports on the results and performance of a new prototype of Associative Memory system. * The work, AM system project, receives support from Istituto Nazionale di Fisica Nucleare; and the European community FP7 People grant FTK 324318 FP7-PEOPLE-2012-IAPP. 1 2 1. Introduction The trigger system of a detector installed at a hadron collider must have high efficiency for the interesting physics processes and it must suppress the enormous QCD backgrounds. A multilevel trigger [1] is an effective solution for this task. The ATLAS trigger system [2] consists of three levels. The hardware Level-1 Trigger quickly locates the regions of interest in the calorimeter and the muon system, operating with output rates up to 100 kHz. The subsequent trigger levels, Level-2 and the Event Filter (EF), are collectively known as the highlevel trigger (HLT). They consist of software algorithms running on a farm of commercial CPUs. A significant improvement to the system performance would derive from the use of the track parameters. The Fast Tracker processor (FTK) [3] is a hardware based system designed to perform track reconstruction in the silicon detectors with offline quality and in time for Level 2 selections. The Fast TracKer processor is highly parallel, with the detector segmented into η – φ towers, each with its own tracking processor. Each processor covers one sixteenth of the detector in φ, 22.5°, plus 10° overlap to maintain high efficiency. The η range of each region is divided into four overlapping intervals, for a total of 64 η – φ towers. Consequently, a tower receives only a fraction of the clusters produced by particles in the silicon detector (hits), and the Processing Units (PUs) executing track reconstruction have substantially fewer candidates to process. Pattern recognition inside each detector tower is executed by two PUs working in parallel. The time consuming pattern recognition problem is solved by the Associative Memory (AM) technology [5] exploiting parallelism to the maximum level. This approach reduces the typical exponential complexity of the CPU-based algorithms into a linear problem. 3 2. The Associative Memory system tests Figure 1: The Processing Unit of the FTK system. The AM board is visible on the left, the AUX board is visible on the. Figure 1 shows the first prototype of the FTK processing unit, already built and tested, composed by the AM system, that performs the pattern matching operation, and the AUX board (actually a first simplified version), that will perform Track Fitting [3]. The AM system is made of the Associative Memory chip, that is the key of the FTK processor, and two types of boards, a VME board (AMB) on which are mounted local associative memory boards (LAMB), large mezzanines that host the AMchips. The main task of the AMBoard is to manage the hit distribution and the fired patterns (roads) readout to/from the AMchips [4]. 2.1. Serial links tests The AM board uses a large network of high speed serial links for data distribution. We chose Xilinx Spartan 6 devices, which provide Multi-Gigabit Transceivers. The serial link offers significant flexibility over parallel bus in terms of transmission distance, noise immunity, and performance. Moreover, circuit board routing is simpler and differential serial links can transmit data over longer distances than parallel links. Then an 8B/10B data encoding is used for DC balance (clock recovery) and error detection. The tests we performed on serial link were the measurement of the quality of the link, and a measurement of the Bit Error Rate (BER), using Pseudo Random Bit Sequence (PRBS). In the transmitter, a PRBS is selected and the receiver controls a signal that is asserted each time an error is detected in the links. 4 2.2. Pattern matching tests The pattern-matching test validates the entire functionality of the AM system. The AM chips are configured through the JTAG protocol. On each LAMB, the 32 AMchips are connected in 8 pipelines of 4 chips. A dedicated FPGA controls in parallel all the chains performing the conversion between the VME protocol and the JTAG protocol. A simulated pattern bank has been generated and loaded into the array of chips. With a JTAG procedure, we check that each pattern of the bank has been written correctly. Then an input file with events made of random hits is generated and loaded via VME in the AMB input memories, one for each layer. When all the memories are loaded, the hits are transmitted at full speed rate to the AM chips. The output matched roads from the AM chips are stored in output memories read from VME. At the end, we compare the list of matched roads found by the hardware with the output of the AM simulation that produces the list of expected roads from the random input hits. The test was successful running at 100 MHz, as expected. 3. Evolution of the AM board The LAMB and the AMboard shown in figure 1 were designed for the AMchip04 [xx] the first chip designed explicitly for FTK, with all the needed functionality, but a die size much smaller (14 mm2) than the final one (180 mm2). The AMchip04 is characterized by a full parallelized I/O and is packaged in a LQ208 (like its predecessor AMchip03), a very useful package for its pin accessibility, but also very limited in terms of available pads. However, working on TSMC 65 nm technology and calculating the power consumption of the chip we understood that the full size final chip would have required us to provide a large number of VDD and GND pins, not compatible with the availability of the old package LQ208. Increasing the chip area requires a change to a complex BGA package that is almost incompatible with the routing complexity of the LAMB (see figure 2 on the left). 5 Figure 2: the routing of the left half of the LAMB (the board is symmetric with respect the central vertical axis). On the left is reported the LAMB designed for the AMchip04 in LQ208 packages, on the right the new solution for BGAs AMchips provided of serialized I/O. These problems led us to a different strategy for the AMchip, based totally on serial link communications. We bought an IP from Silicon Creations to provide serialized I/O buses inside the AMchip. We investigated sophisticated packaging options and found the best solution to be a flip-chip BGA, with I/O optimized to work at 2 Gbps. The distribution of input and output data on serial links required a full re-design of the LAMB that, as shown in the right of figure 2, is characterized by an extremely simplified routing. As a consequence of this simplification, the new LAMB layout has been squeezed compared to the old one. This was an advantage for the AMBoard since it was possible to leave space for additional DC-DC converters and filters that cannot be placed below the LAMB. The motherboard has been redesigned, without the extension on the board front, made initially to allocate 4 DC-DC converters (see figure 1) the couldn’t fit in the internal space. In addition we substituted the 4 Spartan 6 devices with 2 Artix 7 200T FPGAs (16 serial links each) and the previous connector with the high performance connector ASP134488-01 able to provide not only the high speed signals, but also all the current necessary to power the chips on the LAMB. Figure 3 shows a photograph of the described last motherboard version, with superimposed blue and red arrows to show the flux of input (blue) and output (red) data from/to the P3 connector (violet square on the right bottom) to/from the LAMBs respectively . The serial links on the boards have been tested successfully, but the system misses the new AMchip for a complete characterization, 6 MiniLAMB-SLP MiniLAMB-SLP MiniLAMB-SLP Figure 4: The prototype of the final version of AMBoard 4. Conclusion The design of the first prototype of the Processing Unit of the FTK processor had to face the most challenging aspects of the AMchip evolution. A first prototype compatible with the parallelized I/O and package of AMchip04 has been built and successfully tested. A new development has been necessary to evolve to the final size AMchip, provided of serialized I/O. The network of high speed serial links is an elegant solution that offers large advantages: significant flexibility over parallel buses in terms of transmission distance, noise immunity, and performance. References 1. W. Smith, Triggering at LHC Experiments, 2002 Nucl. Instr. and Meth. A, vol. 478, pp. 62–672.H. Müller and B. D. Serot, Phys. Rev. C52, 2072 (1995). 2. G. Aad et al [ATLAS Collaboration], Expected Performance of the ATLAS Experiment - Detector, Trigger and Physics, pp. 549, 2008. 3. Andreani et al., The FastTracker Real Time Processor and Its Impact on Muon Isolation, Tau and b-Jet Online Selections at ATLAS, 2012 TNS Vol.: 59 , Issue:2, pp, 348 – 357 4. A. Andreani et al., “The AMchip04 and the processing unit prototype for the FastTracker”, IOP J. Instr. 7, C08007 (2012).