Jeremy Espenshade May 7, 2009 High Performance Computing Performing difficult computations in an acceptable time period Example Areas of Interest: ▪ ▪ ▪ ▪ Cryptanalysis Bioinformatics Molecular Dynamics Image Processing Specialized Hardware Architectural differences can provide orders of magnitude speedup on suitable applications ▪ GPGPUs: Visualization, Linear Algebra, etc through data parallelism ▪ Cell Processor: Video Encoding, Linear Algebra, etc through data parallelism ▪ FPGAs: DSP, Image Processing, Cryptography, etc through bit-level parallelism Background Cluster Computing FPGA Technology Commercial FPGA Supercomputers Proposed Framework Requirements and Motivation Hardware and Software Organization HW/SW Interaction Application Case Studies DES Cryptanalysis ▪ Step-by-Step Design Flow ▪ Demonstration FPGA Cluster ▪ Performance Comparison Matrix Multiplication ▪ Platform Performance Characterization Conclusion and Future Work Historical monolithic supercomputers have given way to networks of smaller computers “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray Middleware technologies have made cluster construction and programming easy and efficient Message Passing – MPI, PVM Shared Memory – OpenMP Remote Procedure Call – Java RMI, CORBA Grid Organization – Condor, Globus De facto standard API for inter-process communication over distributed memory Language-independent library with point-topoint and collective operations MPI_Send/MPI_Recv MPI_Bcast/MPI_Reduce MPI_Scatter/MPI_Gather MPI_Barrier OpenMPI Open source implementation in native C Creates “Virtual Topology” of computation environment with process ranks Process Tree: MPI_Send(data, child) MPI_Recv(data,parent) Process Ring: Master/Slave: MPI_Send(data, (myrank +1) % size) MPI_Bcast(data, to slaves) MPI_Recv(data, (myrank-1)%size) MPI_Reduce(data, to master) R0 R3 R1 R2 R0 R0 R3 R1 R2 R3 R2 R1 Field-Programmable Gate Arrays Devices in which function can be specified through a hardware description language (VHDL/Verilog) Slower than custom ASICs but much more flexible Large Degree of finegrained parallelism Basic Logic block Interconnects Configurable I/0 Block FPGAs realize computation over a network of CLBs Each CLB contains: Eight 6-Input Look-Up- Tables Eight Flip-Flops Control Multiplexors Arithmetic and Carry Logic LUTs preconfigured to implement any 6-input logical function ‘A = B ⊕ C ⊕ D ⊕ E ⊕ F ⊕ G’ can be calculated in a single cycle even over large operand lengths Bit-Level Parallelism • Hybrid FPGA – Hardwired PowerPC Cores – 2-15k CLBs – Arithmetic DSP Slices – Ethernet MAC Units Cray XT5h AMD Opteron and Xilinx Virtex-4 FPGAs on single blade Cray SeaStar2+™ Interconnect Custom API for RPUs http://www.cray.com/Assets/PDF/products/xt/CrayXT5hBrochure.pdf SRC6/7 Altera Stratix II FPGAs http://www.srccomp.com/products/src7_hardware.asp and Intel Xeon or AMD Opteron Processors SRC HI-BAR® Interface Carte® Programming Environment Motivating Concepts FPGAs have great performance potential ▪ Especially applications with high bit, instruction, and data level parallelism Many FPGAs working together would allow even better parallelism exploitation ▪ Increased data parallelism through multi-node partitioning ▪ Task parallelism through independent heterogeneous nodes FPGA cluster frameworks are currently limited to proprietary supercomputers ▪ Commodity clusters will reduce the barrier to entry and promote FPGA integration and use ▪ Occam’s Razor applies to programming environments Simple is good. Easily Programmable Common API for both parallel programming and hardware access Modular hardware supported without modification Hardware/Software Design Independent Interface access independent of application and implementation Minimal Framework Overhead System should exhibit acceptable performance Commodity technologies Ethernet networking, Linux OS, open software Scalable and Flexible Additional FPGAs easily integrated Heterogeneous nodes seamlessly supported Extensible Future improvements possible without harsh restrictions … HW HW … HW 10/100/1000T Ethernet Compact Flash Ethernet Network Hard-wired MAC Ethernet MAC on chip Embedded Linux OS Root File System on Compact Flash (256MB) BusyBox Utilities Minimal Libraries OpenMPI OpenSSH: Certificate-based security for shell access OpenSSL: TCP/IP security Various support libraries (zlib,etc) Special Device Files in /dev Hardware devices mapped to character devices fopen, fwrite, fread, and fclose commands Dynamics major number reported in /proc/devices FILE * AllocateResource(char * base_name) ▪ Lock-based Arbitration MPI Application User Application MPI_Init MPI_Recv fopen Kernel Driver PLB HW FIFO fwrite open write MPI_Send MPI_Finalize fread fclose read release Enable Interrupts Reset FIFOs Interrupt Controller Software Design Constant Interface Write FIFO Read FIFO State Machines HW Unit Application-Specific Hardware Hardware Design Xilinx provides minimal set of drivers SysAce CF, Ethernet MAC (MII/GMII), etc Custom FIFO Driver for HW abstraction Boot time ▪ Registers platform device and character driver ▪ Maps physical memory address and IRQ to virtual space ▪ Constructs address offsets for registers Device Open ▪ Resets hardware accelerator and FIFOs ▪ Enable Interrupts Interrupt Handling ▪ Read waits on data in Read FIFO ▪ Hardware generates interrupt CF MAC P0 P1 P2 CF P3 MAC P4 CF MAC P4 P5 P6 P7 CF P8 MAC P9 Each hardware unit can behave as an independent node Hardware units can perform different functions and similar functions at different speeds Each FPGA can host as many units as fit in the device Configurations with multiple hardware units/process or other exotic setups are also inherently supported The Data Encryption Standard (DES) is a widely used and longstanding block cypher Due to insufficient cryptographic strength resulting from a 56 bit key, DES has been successfully broken by brute force exhaustive key searches in the past decade Past approaches: Distributed computing (a la Folding@Home) Custom DES ASICs in a cluster A hybrid of the above two approaches Custom system of 120 Spartan-IIe FPGAs Independent HW/SW Design Hardware Hardware Design Software Design Algorithm Implementation Search keys as fast as possible Embedded Platform Hardware Accelerator MPI Application Software Partition Key Space Coordinate Results System Integration Deployment Cross Compilation Xilinx Base System Builder Generates PowerPC, DDR, Compact Flash and Ethernet Interfaces Xilinx Peripheral Creation Wizard PLB Slave Interface Read/Write FIFOs Interrupt Controller Software Reset Software Accessible Registers State Machine centric design FIFO Access Interface/Accelerator Interaction Interrupt Generation Key Guesser top level Contains 2X DES encryption engines 18 stage pipeline (16 rounds plus 1 input, 1 output) Initialized with known plaintext and ciphertext 24 high bits of key expected as input Each DES engine checks lowest (32 – X) bits with assigned middle bits based on component number High 24 Mid X Low 32 - X If key is found, return it, otherwise return zero after key space is checked FIFO Interface and State Machine Plain Text Cypher Text High 24 bits DES Encrypt 000 DES Encrypt 001 DES Encrypt 010 DES Encrypt 011 DES Encrypt 100 DES Encrypt 101 DES Encrypt 110 DES Encrypt 111 Correct Guess Search Complete Result Key Xilinx ISE VHDL model of interface and DES guessing unit interaction Simulate & synthesize Timing and Resource Utilization For Each Configuration Parameter Receive Configuration Read Req Read Ack While key not found and still searching Start Searching and Wait for Guessing Results For Second Key Half Return Result or Failure Notification Write Ack Generate Interrupt Write Req XPS Interrupt Controller Connection Processor Local Bus Connection (PLB) Physical Address Assignment PowerPC Arbiter DDR RAM Ethernet MAC Compact Flash Processor Local Bus (PLB) User Logic HAU_0 User Logic HAU_1 … User Logic HAU_N Linux kernel build targets DTS file created as Xilinx library Driver extracts memory addresses, IRQs, and name information Example unit description: plb_des_1: plb-des@c9c20000 { compatible = “xlnx,plb-des-1.00.a”; interrupt-parent = <&xps_intc_0>; interrupts = < 1 2 >; reg = < 0xc9c20000 0x10000 >; xlnx,family = “virtex5”; xlnx,include-dphase-timer = <0x1>; }; Linux Kernel Build targeting specific platform and DTS Include device driver for hardware accelerators ▪ make ARCH=powerpc CONFIG_FIFO_DRIVER=y Generates .ELF programming file Bit-stream Generation Synthesize, Map, Place, and Route Design Generates .BIT configuration file Create a System ACE file Merges .BIT and .ELF into .ACE file Place .ACE file onto compact flash and boot Master-Slave Structure Master coordinates dynamic work queue Slaves wait for work or stop condition Program Flow Master: S1 M S2 … ▪ Send 24-bit key space indicator (KSI) to each slave ▪ Wait for a response: ▪ If key found, break out, report results and distribute stop conditions to all slaves ▪ If key not found, send next KSI to slave Slave ▪ Allocate and initialize hardware unit ▪ Wait for work or stop condition ▪ If new work arrives, send KSI to hardware and send back the result Sn #include <stdio.h> #include “mpi.h” #include “fpga_util.h” int main(int argc, char * argv[]){ FILE * my_dev; MPI_Init(argc, argv); MPI_Rank(MPI_COMM_WORLD, &rank); if(rank ==0){ //Master Process // For each slave MPI_Send(key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD); // While work remains in queue and key not found MPI_Recv(result, 3, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Send(new_key_space_indicator, 1, MPI_INT, slave_rank, 0, MPI_COMM_WORLD); // Once key found printf(“The answer is %d!\n”, data_in); } else if(rank ==1){ my_dev = AllocateResource(“des_unit”); setvbuf(my_dev, null, _IONBF, sizeof(int)); // Until stop condition received MPI_Recv(KSI, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); fwriteKSI, sizeof(int), 1, my_dev); fread(result, sizeof(int), 2, my_dev); MPI_Send(result, 3,MPI_INT, 0, 0, MPI_COMM_WORLD); } MPI_Finalize(); } Four Virtex-5 devices PowerPC 440 Processor @ 400 MHz Two ML507 and two ML510 boards 256/512 MB DDR2 One Virtex-4 device PowerPC 405 Processor @ 300 MHz ML410 board 256 MB DDR2 100 MHz Processor Local Bus (PLB) RIT Ethernet Network DES Resource Usage ML510: 3 Units each, ML507: 2 Units each, ML410: 1 Unit 11 Units total over 5 FPGAs 10000 Total Keys Guessed / Second (Millions) 9000 8000 7000 6000 5000 Ideal Scaling 4000 Actual Scaling 3000 2000 1000 0 0 2 4 6 8 10 12 Hardware Accelerators Hardware DES unit can guess 8 keys/cycle*100MHz = 800 Million keys/sec 11 Distributed Hardware Units => 8800 M keys/sec ideally Actual performance is 8548.55 M keys/sec = 97.14% DES search application developed for cluster of 2.33 GHz Xeon processors with same program structure Single node performance = 0.735 M Keys/sec Scales to 7.722 M Keys/sec across 11 cores @ 95.4% efficiency System Performance Commodity FPGAs Speedup Cost (approx.) Price/ Peformance 8548.552 MK/s 1107x ~$11K 371x Cray XD1 7200 MK/s 930X ~100K 42x SRC-6 4000 MK/s 518X N/A N/A Xeon 2.3 GHz 7.722 MK/s 1x ~$4K 1x A standard test of computational capability is matrix multiplication A*B=C Highly data parallel Each index of the result matrix can be computed independently: C[i][j] = A[i][] * B[][j] Hardware Design Store multiple rows of A and compute a running dot product for each row when receiving a column of B Software Design Statically partition work across available FPGAs and aggregate results 1.2 Single Node Execution Time (s) Speedup Over Xeon 1 Board 0.8 ML410 (32 MACs) ML507 (32 MACs) 0.6 ML510 (32 MACs) 0.4 ML510 (64 MACs) 0.2 2.33 GHz Xeon 0 0 500 1000 1500 Matrix Dimensions 2000 256 512 1024 ML410 0.162 0.957 6.562 47.139 ML507 0.071 0.470 3.427 26.015 ML510 (32) 0.071 0.471 3.410 26.403 ML510 (64) 0.043 0.269 1.875 14.269 Xeon 2.33 0.031 0.258 1.956 13.405 2500 FPGAs fare poorly in comparison to Xeon GPP Largest FPGA with most available hardware comparable to single core, but poor price/performance With greater concurrency, the FPGA should have performed better. Why didn’t it? 2048 256x256 (Mbps) 512x512 (Mbps) 1024x1024 (Mbps) 2048x2048 (Mbps) ML410 16.15 19.73 21.73 23.49 ML507 37.03 40.18 41.61 42.56 ML510 (32) 36.92 40.08 41.82 41.94 ML510 (64) 36.46 38.99 40.27 39.98 Bandwidth is the limiting factor Data is communicated word by word over a 32-bit arbitrated bus (PLB) Processor independent DMA required for improved performance 1.2 30 Scaling Factor 1 ML507 25 Execution Time 20 0.8 256 x 256 0.6 512 x 512 0.4 1024 x 1024 2048 x 2048 0.2 256 x 256 15 0 512 x 512 2xML507 2xML507, 2xML510, ML410 2xML507, ML510 2xML507, 2xML510 10 0 1 2 3 4 5 6 FPGAs 1024 x 1024 2048 x 2048 Problem scales well across Virtex-5 devices 83% across 4 FPGAs 5 0 0 1 2 3 FPGAs 4 5 6 Virtex-4 addition causes worse performance Static Partitioning Scalable framework for clustering of FPGAs developed and demonstrated Flexible application development with decoupled HW/SW Co- design Standard MPI programming interface and simple hardware interaction abstractions/APIs Commodity hardware technologies and open software allows low barrier to entry for further research and development Application Case Studies Well-suited applications like cryptanalysis perform admirably ▪ Performance improvements of > 1100x ▪ Price/Performance improvements of > 370x Current bandwidth limitations holding back other applications Framework Infrastructure Hardware Communication ▪ DMA data transfer from PowerPC to HW ▪ Fast HW->HW Interconnection on single FPGA ▪ Dynamic Reconfiguration of Hardware Accelerators Cluster Management ▪ Job Submission and Resource Matching with Condor or similar ▪ Monitoring with Ganglia or similar Robustness ▪ Fault Tolerance ▪ Correct HW Usage Enforcement New Applications Performance study with more inter-process communication ▪ Image Processing could be a good place to start Neural Simulation Platform (Dmitri Yudanov, CE Dept) Expressed interest from EE and Astrophysics departments Design Flow Improvements Integrated tool chain and improved deployment procedure Thank you for listening Contact: Jeremy Espenshade jke2553@rit.edu RIT Computer Engineering Hardware Design Lab Primary Advisor: Dr. Marcin Lukowiak Espenshade, Jeremy. Scalable Framework for Heterogeneous Clustering of Commodity FPGAs. Master’s thesis, Rochester Institute of Technology, 2009. Cray Inc. Cray XD1 Supercomputer Outscores Competition in HPC Challenge Benchmark Tests. Business Wire. Feb 15 2005. http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=674199&highlight=. Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang, Kris Gaj, Volodymyr Kindratenko, and Duncan Buell. The Promise of High-Performance Reconfigurable Computing. IEEE Computer Magazine,41(2):69–76, 2008. Xilinx Corp. Virtex-5 Multi-Platform FPGAs, 2009. http://www.xilinx.com/products/virtex5/.