The Cray XD1 Computer and its Reconfigurable Architecture Dave Strenski stren@cray.com July 11, 2005 Outline XD1 overview Architecture Interconnect Active Manager XD1 FPGAs Architecture Example execution Core development stragity FORTRAN to VHDL considerations Memory allocation Unrolling One verses many cores XD1 FPGA running examples MTA kernel and Ising Model FFT kernel from DSPlogic Smith-Waterman kernel from Cray LANL Traffic simulation code Other works in progress Slide ‹#› Cray Today Nasdaq: CRAY Formed on April 1, 2000 as Cray Inc. Headquartered in Seattle, WA Roughly 900 employees across 30 countries Four Major Development Sites: Chippewa Falls, WI Mendota Heights, MN Seattle, WA Vancouver, Canada Significant Progress in the market X1 Sales and Sandia National Laboratory Red Storm contract Oak Ridge National Laboratory Leadership Class system DARPA HPCS Phase II funding of $50M through 2006 for Cascade Acquired OctigaBay – 70+ Cray XD1s sold to date Slide ‹#› Cray XD1 Overview Slide ‹#› Cray XD1 System Architecture Compute 12 AMD Opteron 32/64 bit, x86 processors High Performance Linux RapidArray Interconnect 12 communications processors 1 Tb/s switch fabric Active Management Dedicated processor Application Acceleration 6 co-processors Processors directly connected via integrated switch fabric Slide ‹#› Cray XD1 Chassis Six Two-way Opteron Blades Fans Six SATA Hard Drives Six FPGA Modules Chassis Front 0.5 Tb/s Switch 12 x 2 GB/s Ports to Fabric Three I/O Slots (e.g. JTAG) Four 133 MHz PCI-X Slots Connector for 2nd 0.5 Tb/s Switch and 12 More 2 GB/s Ports to Fabric Chassis Rear Slide ‹#› Compute Blade 4 DIMM Sockets for DDR 400 Registered ECC Memory AMD Opteron 2XX Processor RapidArray Communications Processor Connector to Main Board AMD Opteron 2XX Processor 4 DIMM Sockets for DDR 400 Registered ECC Memory Slide ‹#› Cray Innovations Balanced Interconnect Active Management Application Acceleration Cray XD1 Performance and Usability Slide ‹#› Architecture Intel XeonTM Processor Intel XeonTM Processor 6.4 GB/sec DDR Memory Controller AMD Opteron HyperTransport 6.4 GB/sec Northbridge Southbridge or PCI-X Bridge 3.2 GB/sec HT HT Rapid Array Rapid Array I/O SPEED LIMIT PCI-X Slot PCI-X Slot PCI-X Slot 1 GB/sec Slide ‹#› Removing the Bottleneck GigaBytes Memory GFLOPS Processor GigaBytes per Second I/O 1 GB/s PCI-X Xeon Server Interconnect 0.25 GB/s GigE 5.3 GB/s DDR 333 Cray XD1 8 GB/s RA 6.4GB/s DDR 400 Cray XT3 SS 6.4 GB/s DDR 400 31 GB/s 34.1 GB/s 102 GB/s Cray X1 Slide ‹#› Communications Optimizations Cray Communications Libraries RapidArray Communications Processor MPI 1.2 library TCP/IP PVM Shmem Global Arrays System-wide process & time synchronization HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization RapidArray Communications Processor AMD Opteron 2XX Processor 2 GB/s 3.2 GB/s 2 GB/s Direct Connected Processor Architecture Slide ‹#› Synchronized Linux Scheduler Not Synchronized Proc 1 System Overhead Wasted CPU Cycles Proc 2 Wasted CPU Cycles Proc 3 Wasted CPU Cycles Wasted CPU Cycles System Overhead Wasted CPU Cycles System Overhead Wasted CPU Cycles Barrier 1 complete Barrier 2 complete Barrier 3 complete Synchronized Key Proc 1 System Overhead Proc 2 System Overhead Proc 3 System Overhead Compute cycles System cycles Wasted cycles Barrier 1 complete Barrier 2 complete Barrier 3 complete Slide ‹#› Reducing OS Jitter Linux Synchronization Speedup % Speedup 50% 40% 30% 20% 10% 0% 1 2 4 8 16 Processors 32 64 Cray XD1 Linux Synchronization increases application scaling Improves efficiency by 42% Lowers application license fees for equivalent processor count Slide ‹#› Direct Connect Topology 1 Cray XD1 Chassis 12 AMD Opteron Processors 58 GFLOPS 8 GB/s between SMPs 1.8 msec interconnect Integrated switching 3 Cray XD1 Chassis 36 AMD Opteron Processors 173 GFLOPS 8 GB/s between SMPs 2.0 msec interconnect Integrated switching 25 Cray XD1 Chassis, two racks 300 AMD Opteron Processors 1.4 TFLOPS 2 - 8 GB/s between SMPs 2.0 msec interconnect Integrated switching Slide ‹#› Fat Tree Topology Spine switch Spine switch Spine switch 12 Cray XD1 chassis 144 AMD Opteron Processors 691 GFLOPS 4/8 GB/s between SMPs 2.0 msec interconnect Fat tree switching, integrated first & third order 6/12 RapidArray spine switches (24-ports) Slide ‹#› MPI Latency MPI Latency versus Message Size 35.00 Latency (microsec) 30.00 25.00 20.00 15.00 10.00 5.00 0.00 0 4 8 32 64 128 256 512 1024 2048 4096 Message Length (bytes) Cray XD1 (RapidArray) Quadrics (Elan 4) 4x Infiniband Myrinet (D card) RapidArray Short Message Latency is 4 times lower than Infiniband The Cray XD1 has sent 2 KB before others have sent their first byte Slide ‹#› MPI Throughput Bandwidth versus Message Size 1400 1000 800 600 400 200 B M 1 32 76 8 25 60 00 81 92 16 38 4 40 96 10 24 51 2 25 6 12 8 64 8 4 0 1 Bandwidth (MB/s) 1200 Data Length (Bytes) Cray XD1 (1/2 RapidArray Fabric) Quadrics Elan 4 4x Infiniband Myrinet (D card) The Cray XD1 Delivers 2X the Bandwidth of Infiniband (1 KB Message Size) Slide ‹#› Active Manager System Usability Single System Command and Control Resiliency CLI and Web Access Dedicated management processors, real-time OS and communications fabric. Proactive background diagnostics with self-healing. Active Management Software Automated management for exceptional reliability, availability, serviceability Slide ‹#› Active Manager GUI: SysAdmin GUI provides quick access to status info and system functions Slide ‹#› Automated Management Users & Administrators Compute Partition 1 Front End Partition Partition management Linux configuration Hardware monitoring Software upgrades File system management Data backups Compute Partition 2 • • • • • File Services Partition Compute Partition 1 Network configuration Accounting & user management Security Performance analysis Resource & queue management Single System Command and Control Slide ‹#› Self-Monitoring Parity Heartbeat Temperature Fan speed Diagnostics Air Velocity Voltage Current Hard Drive Thermals Processors Memory Fans Power supply Active Manager Interconnect Power Supply Dedicated Management Processor, OS, Fabric Slide ‹#› Thermal Management Slide ‹#› File Systems: Local Disks One S-ATA HD per SMP; Local Linux directory per HD EXT2/3 EXT2/3 EXT2/3 RapidArray EXT2/3 EXT2/3 EXT2/3 Cray XD1 Slide ‹#› File Systems: SAN SMP acting as a File Server for the SAN File Server EXT2/3 FC HBA FC SAN NFS Compute Cray XD1 Slide ‹#› Programming Environment Operating System System Management Cray HPC Enhanced Linux Distribution (derived from SuSe 8.2) Active Manager for system administration & workload management Application Acceleration Kit IP Cores, Reference Designs, Command-line tools, API, JTAG interface card Scientific Libraries Shared Memory Access 3rd Party Tools AMD Core Math Library (ACML) Shmem, Global Arrays, OpenMP Fortran 77/90/95, HPF, C/C++, Java, Etnus TotalView Communications Libraries MPI 1.2 Cray XD1 is standards-based for ease of programming – Linux, x86, MPI Slide ‹#› Cray XD1’s FPGA Architecture Slide ‹#› The Rebirth of Co-processing 1976 8086 Processor 8087 Coprocessor AMD Opteron Xilinx Virtex II Pro FPGA 2004 Slide ‹#› Application Acceleration Application Accelerator Application Acceleration RAP Reconfigurable Computing Tightly coupled to Opteron FPGA acts like a programmable coprocessor Performs vector operations Well-suited for: RAP Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation. SuperLinear speedup for key algorithms Slide ‹#› Two Switches One Switch Two Switches One Switch 4 configurations Slide ‹#› Application Acceleration FPGA ... do for each array element . . . end … DataSet Application Acceleration FPGA Compute Processor … … Fine-grained parallelism applied for 100x potential speedup Slide ‹#› Compute Blade Expansion Module RapidArray Processor DDR 400 DRAM Opteron Processor Application Acceleration FPGA Slide ‹#› Interconnections HT Neighbor Module Expansion Module HT RT HT Neighbor Module RapidArray RapidArray Slide ‹#› Module Detail Neighbor Compute Module HyperTransport 3.2 GB/s RAP 2 GB/s 2 GB/s 3.2 GB/s 2 GB/s RapidArray QDR II SRAM QDR II SRAM QDR II SRAM QDR II SRAM Acceleration FPGA 2 GB/s 3.2 GB/s Neighbor Compute Module Slide ‹#› Virtex II Pro FPGA Multi-Gigabit Transceivers (Rocket I/O) Virtex-II Series Fabric MGT MGT XC2VP30 – XC2VP50 • 422 MHz max. clock rate • 30,000 – 53,000 LEs • 3 – 5 Million ‘system gates’ • 136 – 232 Block RAM • 136 – 232 18x18 Multipliers 300 MHz PowerPC • 8 – 16 Rocket I/O MGT MGT Block RAM Slide ‹#› Virtex II Family Logic Blocks RAM16 Virtex-II Family Logic Blocks SRL16 LUT G LUT F CY CY Register Register 1 LE = LUT + Register 1 Slice = 2 LEs 1 CLB = 4 Slices Slice XC2VP30-6 Examples Size Function f (MHz) LE’s BRAM Mult. Number Possible 64 bit Adder 194 66 0 0 450 64 bit Accumulator 198 64 0 0 450 18 x 18 Multiplier 259 88 0 1 136 SP FP Multiplier 188 252 0 4 34 1024 FFT (16 bit complex) 140 5526 22 12 5 Slide ‹#› Module Variants A variety of Application Acceleration variants can be manufactured by populating different pin compatible FPGAs and QDR II RAMs. Speed Logic Elements PowerPC 18x18 Multipliers XC2VP30 -6 30,816 2 136 XC2VP40 -6 43,632 2 192 XC2VP50 -7 53,136 2 232 FPGA RAMs Speed Dimensions Quantity Module Memory K7R163682 200 MHz 512K x 36 4 8 MByte K7R323682 200 MHz 1M x 36 4 16 MByte K7R643682 (future) 200 MHz 2M x 36 4 32 MByte Slide ‹#› Processor to FPGA FPGA Processor RAP Req Resp HyperTransport Req Resp RapidArray Transport • Since the Acceleration FPGA is connected to the local processing node through its HyperTransport I/O bus, the FPGA can be accessed directly using reads and writes. • Additionally, a node can also transfer large blocks of data to and from the Acceleration FPGA using a simple DMA engine in the FPGA’s RapidArray Transport Core. Slide ‹#› FPGA to Processor FPGA Processor RAP Req Resp Req Resp • The Acceleration FPGA can also directly access the memory of a processor. Read and write requests can be performed in bursts of up to 64 bytes. • The Acceleration FPGA can access processor memory without interrupting the processor. • Memory coherency is maintained by the processor. Slide ‹#› FPGA to Neighbor 2-3 GB/s SMP 4 SMP 2 SMP 1 SMP 3 SMP 6 SMP 5 • Each Acceleration FPGA is connected to its neighbors in a ring using the Virtex II Pro MGT (Rocket I/O) transceivers. • The XC2VP40 FPGAs provide a 2 GB/s link to each neighbor FPGA. • The XC2VP50 FPGAs provide a 3 GB/s link to each neighbor FPGA. Slide ‹#› Cray XD1 FPGA Programming Slide ‹#› Hard, but it could be worse! Slide ‹#› Application Acceleration Interfaces RapidArray Transport Core User Logic QDR RAM Interface Core ADDR(20:0) D(35:0) Q(35:0) TX RAP RX RapidArray Transport • • • • ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) QDR II SRAM ADDR(20:0) D(35:0) Q(35:0) XC2VP30-50 running at up to 200 MHz. 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s.) QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications. Slide ‹#› FPGA Linux API Admininstration Commands fpga_open fpga_close fpga_load – allocate and open fpga – close allocated fpga – load binary into fpga Operation Commands fpga_start fpga_reset – start fpga (release from reset) – soft-reset the FPGA Mapping Commands fpga_set_ftrmem fpga_memmap – map application virtual address to allow access by fpga – map fpga ram into application virtual space Control Commands fpga_wrt_appif_val fpga_rd_appifval – write data into application interface (register space) – read data from application interface (register space) Status Commands fpga_status – get status of fpga DMA Commands fpga_put fpga_get – send data to FPGA – receive data from fpga Interrupt/Blocking Commands fpga_intwait – blocks process waits for fpga interrupt Slide ‹#› Additional High Level Tools Adelante Celoxica Forte Design Systems Mentor Graphics Prosilog Synopsis int mask(a, m) { return(a & m); } SystemC, ANSI C/C++ DSPlogic RCIO Lib The MathWorks High Level Flow MATLAB/ Simulink C Synthesis Xilinx process(a, m)is begin z <= a andm; end process; System Generator for DSP VHDL, Verilog VHDL/Verilog Synthesis Mentor Graphics Synopsis Synplicity Xilinx a m Xilinx z Gate Level EDIF File Standard Flow Place and Route 01001011010101 01010110101001 01000101011010 10100101010101 Binary File for FPGA Slide ‹#› Standard Development Flow Cores Merge 0100010101 1010101011 0100101011 0101011010 Load/Run Binary File RAP I/F, QDR RAM I/F DSPLogic RCIO Core HDL Download to XD1 Synthesize Metadata Acceleration FPGA Implement From Command line or Application ModelSim Xilinx ISE Verify Simulate VHDL, Verilog, C Xilinx ChipScope ModelSim Slide ‹#› On Target Debugging Acceleration FPGA • Integrated Logic Analyzer (ILA) blocks are used to capture and store internal logic events based on user defined triggers. User Function 1 • Trapped events can then be read out and displayed on a PC by the ChipScope Software. User Function 2 ILA ILA JTAG Parallel or USB Xilinx ChipScope Plus Software Xilinx Parallel Cable III/IV or MultiLINX JTAG OctigaBay JTAG I/O Card Slide ‹#› FORTRAN to VHDL ideas program test integer xyz integer a, b, c, n(1000), temp(1000) do i = 1, 1000 n(i) = xyz (a, b, c, temp) end do end The variable temp is allocated once outside the loop calling the function. This is efficient FORTRAN code because you only allocate the space one. With an FPGA design you would want to allocated the temporary space on the FPGA. Slide ‹#› FORTRAN to VHDL ideas program test integer xyz integer a, b, n(1000) real delta delta = 0.01 do i = 1, 1000 n(i) = xyz (a, b, delta) end do end program test integer xyz integer a, b, n(1000) integer delta delta = 100 ! 1/delta do i = 1, 1000 n(i) = xyz (a, b, delta) end do end function xyz (a, b, delta) if (a .gt. b*delta) then xyz = a else xyz = b endif return end function xyz (a, b, delta) if (a*delta .gt. b) then xyz = a else xyz = b endif return end Convert real variables to integers where possible. Slide ‹#› FORTRAN to VHDL ideas function xyz (i,j,mode) integer i,j,mode do i = 1, 1000 do j = 1, 1000 if (mode .eq. 2) then if (a(i,j,k) .gt. b(i,j,k)) then xyz = a else xyz = b end if else Move code that doesn’t change outside xyz = 0 the function. Maybe make multiple cores, end if one for each mode. end do end do return end Slide ‹#› Mixing FPGAs and MPI It gets a bit tricky mixing FPGAs with an MPI code. The XD1 has 2 or 4 Opterons per node and only one FPGA. Only one Opteron is able to grab the FPGA at a time. Job1 CPU Job1 CPU Job1 CPU Job2 CPU Job1 CPU Job1 CPU Job2 CPU Job1 CPU ? Job1 FPGA Job1 CPU Job1 CPU Job2 FPGA Job1 CPU Job1 CPU FPGA Job1 CPU Job2 FPGA FPGA Job1 CPU Job2 CPU Job1 FPGA Job1 CPU Job2 CPU Job2 FPGA Job1 CPU Job2 FPGA Job2 CPU Job1 FPGA Job1 CPU Job1 CPU Job1 FPGA Not Available Job1 CPU Job1 FPGA Job2 CPU Job1 CPU Job1 FPGA Slide ‹#› Cray XD1 FPGA Examples Slide ‹#› Random Number Example Processor RAP Mersenne Twister RNG pseudo-random numbers • FPGA implements “Mersenne Twister” RNG algorithm often used for Monte Carlo analysis. The algorithm generates integers with a uniform distribution and won’t repeat for 219937-1 values. • FPGA automatically transfers generated numbers into two buffers located in the processor’s local memory. • Processor application alternately reads the pseudo-random numbers from two buffers. As processor marks the buffers as ‘empty’, the FPGA refills them with new numbers. Slide ‹#› MTA Example Application Accelerator Load/Start a.out in Opteron’s memory Call FPGA_OPEN Call FPGA_LOAD Buffer B Call FPGA_SET_FTRMEM (allocate memory) Buffer A Call FPGA_START FPGA checks buffer flags RAP FPGA generate random numbers FPGA toggles buffer flag Opteron consumes random numbers RAP Opteron/FPGA run asynchronously Call FPGA_CLOSE Opteron exits Slide ‹#› Random Number Results Source Original C Code VHDL Code Platform 2.2 GHz Opteron FPGA (XC2VP30-6) @ 200 MHz Speed (32 bit integers/second) ~101 Million ~319 Million N/A ~25% of chip (includes RapidArray Core) Size • FPGA provides 3X performance of fastest available Opteron • Algorithm takes up a small portion of the smallest FPGA. • Performance is limited by speed at which numbers can be written into processor memory, not by FPGA logic. The logic could easily produce 1.6 billion integers/second by increasing parallelism. Slide ‹#› Ising Model with Monte Carlo Code was developed by Martin Siegert at Simon Fraser University Uses the MTA random number generation design Runs 2.5 times faster with the FPGA Should run faster when the newest MTA design that returns floating point random number instead of integers. Tar file available for the Cray XD1 Slide ‹#› FFT design from DSPlogic Code was developed by Mike Babst and Rod Swift at DSPlogic Uses 16-bit fixed point data as input and 32-bit fixed point as output, which yields an accuracy similar to single precision results posted at FFTW web site (www.fftw.org) A one dimensional complex FFT of length 65536 on the FPGA is about 5 times faster then on the 2.2 GHz Opteron using FFTW. Packing the data more can double the performance to 10x. Performance depends on the size of the data. Slide ‹#› Smith-Waterman Code was developed internally by Cray CUPS = Cell updates Per Second Rate = FPGA frequency * clocks/cell * num S-M Processing Elements Current: 80 MHz * 1 * 32 = 2.6 Billion CUPS, 60% of the chip Optimization: 100 MHz * 1 * 50 = 5 Billion CUPS Virtex 4 FPGA: 100 MHz * 1 * 150 = 15 Billion CUPS Opteron using SSEARCH34 = 100 Million CUPS Current version running 25 times faster then 2.2 GHz Opteron. Nucleotide (4-bit) version is running in house. Amino acid (8-bit) is just finished, incorporating it into SSEARCH to make it easier to use. Smith-Waterman on the FPGA is about 10 times faster then BLAST on the Opteron. Slide ‹#› Los Alamos Traffic Simulation Code was developed by Justin Tripp, Henning Mortveit, Anders Hansson, and Maya Gokhale at Los Alamos National Labs Uses FPGA for straight road sections and Opteron for everything else. Runs 34.4 times faster with the FPGA relative to a 2.2 GHz Opteron System integration issues must be optimized to exploit this speedup in the overall simulation. Slide ‹#› Other XD1 FPGA Projects Financial company using the random number generation core for a Monte Carlo simulation. Seismic companies using FPGAs for FFT and convolutions. Pharmaceutical companies using FPGAs for searching and sorting. NCSA is working on a civil engineering “dirt” code. University of Illinois is working on porting part of NAMD to an FPGA. Slide ‹#› Other Useful FPGA designs JPEG2000 developed by Barco Silex, currently runs on Virtex FPGAs. Working with them on a real time, high resilution compression project. 64-bit floating point Matrix Multiplication by Ling Zhuo and Viktor Prasanna at the University of Southern California. Gets 8.3 Gflops on a XC2VP125 as compared to 5.5 Gflops 3.2 GHz Xeon. Finite-Difference Time-Domain (FDTD) by Ryan Schneider, Laurence Turner, and Michal Okoniewski at University of Calgary. Slide ‹#› Questions Slide ‹#›