HPEC using FPGAs Challenges and Benefits 2 Utah State University Cache Valley 90 miles North of Salt Lake City David. Sant. Engineering Innovation Building 3 Agenda On-board computing for Spacecraft A primer on FPGAs (5 slides) HPEC using FPGAs (26 slides) The Polymorphic Systolic Array Framework Improving productivity Enabling real time and responsive reconfiguration Future technologies for FPGAs Acknowledgements 4 On-board Computing Civilian and Military space missions getting more complex Need to support several types of data from several types of sensors Missions will require spacecraft computer to be more responsive Need for In-situ data processing (signal processing) Not just compression, but data analysis, decision making etc. Power budget, form factors of spacecraft computer extremely tight State of the art RadHard microprocessor from BAE systems or RISC processor? Aging workhorse, time to upgrade big time 5 So, what do we upgrade to? Commodity Microprocessors Cell, GPU, Many/Multi core Very powerful Blows out the power budget RadHard parts need to be custom ordered Commodity DSP chips Good as long as you stick to just one chip Rahhard parts can be custom ordered Commodity Reconfigurable chips FPGAs (field programmable gate arrays) Can perform like a custom silicon chip Best performance/power ratios RadHard parts already available with steady roadmap from Xilinx 6 Programming perspective Microprocessors DSP chips FPGAs Optimistic view point Frozen pizza Take ‘n’ bake Raw ingredients 7 Quick Primer on FPGAs Mixture of blocks on a die Some dedicated DSP (MAC units) PPC (optional) RAM Some programmable Look Up Tables (LUT) Gazillions of network switches Hidden Special circuit ICAP (internal configuration access port) 8 Simple View of Programming an FPGA All computations are assumed to be based on Boolean Logic So, Problem solving concept => algorithmsNMOS transistor Algorithms => Discrete set of simple tasks (add/multiply…) Simple tasks => A set of Boolean functions talking to each other Boolean function=> simple manipulation of 1 and 0 bits Each bit stored in a small memory cell (SRAM) An FPGA is essentially a vast set of SRAM cells waiting to be loaded with 0s and 1s to mimic Boolean logic 9 Programming an FPGA Each Look Up Table (LUT) has a unique mailing address 16 bits go into each Look Up Table (LUT) Each routing switch has a unique mailing address One bit for each switch Executable for an FPGA is sequence of bits that have to be delivered precisely to each LUT and Switch Box This binary/executable is called “Configuration Bitstream” or simply “Bitstream” 10 Programming an FPGA Programming the FPGA is like having a Mailman deliver bits to each address correctly Slow process But a Bitstream is slightly more complex Each FPGA is like a Country (has a unique code) A “Bitstream” before entering the chip has to undergo security clearance (CRC or cyclic redundancy check) Port of Entry = ICAP FPGA addresses are hierarchical (state, county, city, suburb, house address) Term used for encoding all this overhead is “Frame Address” All this address stuff is overhead Actual useful stuff is inside the mail envelope 11 So what does a real configured/programmed FPGA look like? Before Programming Nice clean plate Empty LUTs, Switches…. After Programming Messy plate of spaghetti Configured LUTs, Switches…. All those green things are wires that have been setup to carry data between LUTs, FFs etc… 12 High Performance Embedded Computing (HPEC) using FPGAs Signal processing algorithms Wildly useful and hence widely used Computationally quite parallel/pipeline-amenable Proven to be accelerate-able by Systolic Array designs on FPGAs The Good of FPGAs: FPGAs claim to have orders of magnitude performance advantage over DSP chips (www.xilinx.com www.altera.com) They can be reconfigured partially and dynamically The Bad (no the Ugly): Productivity is the biggest barrier The number of signal processing folks willing to adopt FPGAs is small and stagnant Partial dynamic reconfiguration is very slow compared to processing speeds 13 Elaborating the Good of FPGAs: Extreme DSP computing 14 Elaborating the Good of FPGAs: Partial Dynamic Reconfiguration At some point in time…… FPGA FPGA Abruptly…say we need to quickly increase parallelism support for application α Can we dynamically reconfigure the chip, without disturbing the execution of either ( 5) Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit application?Circuit At the cost α application, ααα of takingαααaway parallelism ααα support αααfor the other And do it fast enough? parallel processing circuits for Application 4454Four parallel parallel processing processing circuits circuits Application for Application ααα αRELATIVE parallel processing circuits for Application Because we did not have enough onfor the chip toprocess: support high levels Remember, programming the FPGA is aspace very very very slow to of parallelism for both applications, or execution speeds of applications Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Circuit Thereββwas a power budget satisfy β βββ βββ ββ ββ we couldn’t ββ parallel processing circuits for application parallel parallel processing processing circuits circuits forapplication application for application 667Seven parallel processing circuits for βββ β 15 Productivity It’s a funny thing in the FPGA world FPGA programmers are essentially VLSI design guys They don’t buy $5K parts to get average performance Every clock cycle is precious Every LUT/FF/MAC/BRAM is precious They don’t adopt new programming languages in a hurry They love to have full control over every operation 16 Productivity, so what does it mean? Wants an entire system on FPGA modeled, performance predicted, designed, implemented, debugged, verified, guaranteed timing closure, low power, high throughput…. Done really really fast, just like software And then wants to make some minor changes and do it quickly all over again, just like software… 17 Why cant new designs be compiled, loaded onto FPGAs and tested super fast? Need to look at traditional design flow 1. 2. 3. 4. 5. Hardware-Software partition (quick) Create macro and micro architectures for hardware portion (a month, two months..) Write bug free VHDL/Verilog code for architectures (a few months) Synthesize, translate, map, place and route (5 to 15 hours) Simulate 6. Load configuration onto chip 7. 8. 9. If there is a functional or timing bug, you pay a penalty of a few days to weeks Test again. If there is a timing bug, you pay a penalty of several weeks If you decide to make a micro architecture change, go back to step 2 Good luck trying to finish your project on time and budget This will still not get you a dynamically reconfigurable design 18 One way to Improve Productivity Stick to the traditional design flow as much as possible FPGA users are once bitten twice shy Very conservative and believe in the existing flow But introduce structure into the flow, i.e. physical structure, macro-architecture structure Make Partial Dynamic Reconfiguration (PDR) almost automatic FPGA designers are not conversant with PDR designs 19 Augmented Design Flow: Exclusively for Signal Processing Algorithms Hardware-Software Partitioning (just a concept and specific to an application) Structured Macro-architecture via Floor Planning Structure Micro-architecture design Generic structure applicable to many algorithms Project, Schedule data flow model of Sig. Proc. Kernel onto things called Sockets of Macro-architecture Well understood process Embed dynamic reconfiguration capability New technology Works in tandem with Macro-architecture Code, Synthesize…. Test on chip Structured Macro-architecture Some important Terms/Elements: Socket: A physical region on the FPGA chip reserved by designer to be loaded with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR) Switch Box: A circuit that makes the array of Sockets re-partition-able PE/Processing Element: A circuit/bitstream to implement a signal processing kernel’s systolic array data-flow functionality. To activate a socket, a PE must be loaded into it 21 Socket/PRR: Under the Hood Yellow box: A socket/PRR It contains BRAMs, MACs and LUTs/FFs (purple and blue/green/black stuff) If you want to dynamically reconfigure the parallelism of Systolic Arrays on an FPGA: All PRRs must be created with identical resources of MACs, BRAMs, LUTs, FFs. Physical fabric of Virtex SX 35 FPGA Switch Box: Stuff that makes the Array of Sockets Re-partition-able Simple circuit Need to set mux sel lines & fifo controls Resides in static region on FPGA Change SB connections to change partitioning of sockets/PRRs between systolic array kernels’ nodes 23 Ok, time to port Macro-architecture Framework onto Chip What really happened when we tried it Virtex 4 SX 35 Static region (luminescent green stuff) •Microprocessor •Switch Boxes •Cache •Controller PRRs/Sockets (white boxes) •To be filled with Systolic Array Processing Elements 25 Now to the Micro-architecture… First, Hardware Software Partitioning Example: Extended Kalman Filter (EKF). A critical navigation algorithm and a nasty signal processing kernel. All stuff with rounded edges are tasks that can change based on physics of the problem. So put it all in software (Microblaze). All else is consistent and so put them in hardware (PolySAF) 26 Designing/Deriving the Processing Element: Example EKF Works on Faddeev Algorithm to compute Schur compliment 27 One of the many possible ways Port 28 Code, Synthesize, …Optimize Port: Code, synthesize, Translate, Map, Place and Route For One Socket/PRR (just a few days worth of work) Move Nets around to meet timing: Manually pick up a wire in this small bowl of spaghetti of wires, and move it around. Nuisance of a task, but necessary But you need to do it only in one PRR (just a few hours worth of work) Copy Locally optimized bitstream/circuit of the one PRR to all PRRs Automatically obtain Global Timing closure for the PolySAF If Microprocessor, Cache are retained for multiple designs, then global timing closure for whole chip is also automatically gifted to you 29 Have we answered the Productivity problem? Time to Grade the Approach Need to look at traditional design flow 1. 2. Hardware-Software partition (quick) Create macro and micro architectures for hardware portion (a month, two months..) 3. Write bug free VHDL/Verilog code for architectures (a few months) 4. If there is a functional or timing bug, you pay a penalty of a few days to weeks Load configuration onto chip 7. 8. Do for only one PRR Simulate 6. Reuse most of the macro structure and code only for one PRR Synthesize, translate, map, place and route (5 to 15 hours) 5. Applicable to a wide range of Sig. Proc. Algorithms Test again. If there is a timing bug, you pay a penalty of several weeks If you decide to make a micro architecture change, go back to step 3 Good luck trying to finish your project on time and budget 30 Want the details, the math, the algorithms etc? Read this paper A. Sudarsanam, R. Barnes, A. Dasu, J. Carver, and R. Kallam, “Dynamically Reconfigurable Systolic Array Accelerators: A case study with EKF and DWT Algorithms,” IET/IEE Computers & Digital Techniques. Vol 4, Issue 1. Jan 2010. Author preprint available on line at Reconfigurable Computing Group www.usu.edu/rcg 31 Now, onto Partial Dynamic Reconfiguration in the PolySAF 3 nodes EKF 2 nodes DWT Detach Socket 2 nodes EKF 2 nodes DWT Reconfigure Reset new PRR Re-attach 2 nodes EKF 3 nodes DWT DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression 32 How to Physically Reconfigure PRR? Known Methods 33 Comparison of all known options Best known technique: from Microsoft Research Labs (2008) eMIPS project Too Slow, Too expensive (hogs up valuable on-chip BRAMs) 34 Embedding Dynamic Reconfiguration into the System Active Bitstream (PRR) to PRR: Hardware Circuit ARC ICAP wrapper FPGA ICAP snoop PRR (source) active bitstream PRR (destination) PRR (destination) 35 Accelerated Relocation Circuit (ARC) Manipulate Frame addresses FAR is Frame address register Lots of unnecessary overhead can be avoided No need for CRC processing 36 Results…reconfiguration times in millisecs Test Circuit PolySAF node Resources Bitstream #.of. Size frames (Bytes) ARC BiRF* IEEE TVLSI 2009 Same Side/ Opp Side Same side BRAM Same Side Opp Side Microsoft* Tech. Report 2008 LUT FF DSP BRAM 486 273 0 0 31159 195 0.48 84.7 14 3.38 8.86 438 273 0 0 30693 195 0.48 83.4 14 3.33 8.73 1234 988 0 0 68469 432 1.07 186.1 30 7.42 19.47 423 216 1 0 32349 195 0.48 87.9 15 3.50 9.20 375 216 1 0 32349 195 0.48 89.8 15 3.58 9.20 502 466 8 0 65261 432 1.07 177.3 29 7.07 18.56 DCT 1419 1636 8 8 44397 540 1.34 120.64 22 4.81 12.62 CSC 318 438 1 12 17313 301 0.74 47.04 9 1.87 4.92 DWT 940 389 0 4 47897 303 0.75 130.2 21 5.19 13.62 FSA_ no_DSP DSA_ no_DSP Matrix_Mult no_DSP FSA_ with_DSP DSA_ with_DSP Matrx_Mult with_DSP RFT cases All systems run @ 100 MHz * Estimated values for state of the art competing technologies Footprint of ARC: 1064 LUTs, 638 FFs and 1 BRAM 37 Next steps… Improve, Formalize and Collaborate Performance prediction Model Predict how big circuit will be, how it will perform using Excel and Matlab Big leap in productivity Arithmetic Precision manipulation is extraordinarily powerful when it comes to FPGAs If the right non-IEEE precision can be chosen for a Sig. Proc. App. Then you can save medium to massive amounts of area, power in the circuit mapped onto the FPGA Great opportunity for Small Satellites Efficient communication between Microprocessor and PolySAF via threads Validate and brutally test this on a large number of algorithms (FFTs, Filters, Hyperspectral processing…..) NASA can help with this Technology is attractive for software defined radios, precision navigation… 38 Kaleidoscope: Future of FPGA Near term Maybe better tools to program and debug FPGAs? Mentor’s Catapult, AutoESL compiler, Synfora compiler…. Maybe some sort of standardization in FPGA programming Hopefully DARPA HPCS program will produce something Longer term (Revolutionary things to come) Vertically Integrated FPGA + DRAM on a single chip 1000x improvement in performance/watt Visit Micron Research Center at USU to learn more www.usu.edu/mrc 39 Acknowledgements Joe Bredekamp and the NASA AISR program Applied Information Systems Research Funding from NASA is valuable Focused research Want my technology to be adopted for real missions Xilinx and Mentor Graphics (donated > $ 100K worth software) My Grad Students