Unapproved draft– please do not release without permission of Lockheed Martin legal FPGA Based Processor for Hubble Space Telescope Autonomous Docking – A Case Study Authors: Jonathan F. Feifarek, Timothy C. Gallagher, Lockheed Martin, Denver, Colorado Abstract Designing electronic hardware for flight projects requires balancing conflicting constraints: a short design cycle, tolerance of radiation induced effects, limited power, and flexibility to accommodate changing requirements and fixes. Our design space, consisting of three autonomous rendezvous and docking algorithms mission critical to the Hubble Space Telescope Robotic Vehicle project, presented our team with these constraints. In addition, analyses showed the processing throughputs of each algorithm exceeded the capacity of existing radiation hardened computers by two orders of magnitude. An FPGA based solution with dual redundancy approach at the FPGA level was selected. This paper discusses the challenges encountered in designing an algorithmic intensive FPGA based processor for flight and presents a case study to show how these difficulties were mitigated for the Hubble Robotic Vehicle. Introduction NASA Goddard’s Hubble Space Telescope Robotic Vehicle (HRV) project necessitated a rapid turnaround customized flight solution to service the Hubble Space Telescope (HST) before aging critical components failed. A key component, autonomous rendezvous and docking, required processing throughputs that greatly exceed the capacity of any existing radiation hardened processor because of a need to process independent algorithms with different camera views in parallel. FPGA’s were determined to be the only viable solution to meet these processing constraints capable of providing the necessary update rate for the guidance, navigation, and control to accomplish low rate direct docking with the delicate HST. Use of a High Order Language (HOL) was considered the best way to realistically meet the difficult schedule of converting existing complex and specialized C/C++ vision code to synthesizable RTL for implementation on FPGAs. Other advantages of using a higher abstraction level of design is the ability to optimize and verify code orders of magnitude faster then RTL simulation, automating complex state machine control, and generating bit-accurate and cycle-accurate RTL output before porting to hardware. To meet the high proton radiation environment within the cost and power constraints of the mission, a dual redundancy approach at the FPGA level was chosen taking advantage of the system’s tolerance for limited data dropouts but not for result uncertainty. The details of the SEU mitigation scheme specific to our design are presented along with merits and limitations. The Vision Processing Card (VPC) design implementation was the result of selecting an FPGA based architecture using a HOL design to build mission specific hardware. The architecture of this board, along with the hardware to implement the more complex algorithm, Natural Features Image Recognition (NFIR) are shown and timing / sizing results are presented. 1 Unapproved draft– please do not release without permission of Lockheed Martin legal Flight Computer vs FPGA Trade During the architectural definition phase of the Hubble proposal effort a design trade study was undertaken to select the best processing system to perform a complex set of NFIR type algorithms. This processing system had to use available products with space effects mitigation (rad-hard components or rad-tolerant with mitigation capability), perform time-dependent imagery meeting 5 frames-per-second processing, provide adequate margin for growth in algorithm functionality, and be fully operational within one year. To achieve this aggressive schedule, the program planned on re-using predecessor floating-point C/C++ code. As with many algorithms using floating point code, only a small subset and very specific parts actually require the large dynamic range. Many applications such as MATLAB will default to floating point even if more efficient integer precision gives the required results accuracy. There are other efficient variants such as 18-bit floating point. At the time of the proposal, parts of the algorithm were running on commercial processors however after scaling for the full application and the necessary repetition rate the calculations were that 40x more performance was necessary then a standard Pentium based machine. For dataflow types of processing which encompasses many types of optical and DSP algorithms the FPGA based reconfigurable computing based obtains performance greater then 100 times that of a microprocessor[1]. Many other researchers have validated these resultant magnitude gains over Pentium based CPUs on other image processing benchmarks such as median filter and edge detection, pattern recognition, and multi-spectral processing [2,3]. The NFIR algorithm ran in real-time when implemented in parallel on a Dual P4 Xeon processors running at 3.2 GHz with hyper-threading enabled, a configuration that could not fly for radiation, reliability, and power reasons. Based on this performance, it was estimated that a single RAD750 running at 40 MHz would only be able to process images at a 1/160 this rate. Assuming linear speed scaling, it would have taken 20 of the fastest (at the time) radiation hardened computers to achieve the same performance as our final dual-FPGA processor running with a 40 MHz clock.A drawback of FPGA based reconfigurable computing is the design, synthesis, and place-and-route cycle which uses a complex flow compared to software development. This requires detailed hardware knowledge and sometimes low-level component manipulation. These problems can be alleviated but not eliminated by using a state-of-the-art High Order Language (HOL) design methodology. High Order Language (HOL) vs Hardware Description Language (HDL) As noted by NASA Goddard Space Flight Center there are three essential elements in reconfigurable computing designs [4]: rad-tolerant flight qualified FPGA’s, ground prototyping for evaluation of applications, and tool development/design methodologies. While the first two elements are taken care of by manufacturing enhancements/SEE testing and COTS vendors, the last item leaves it up to the designer to take advantage of new tools, techniques and design methodologies such as HOL’s. HOL’s constitute a tool to raise the abstraction level of hardware FPGA designs such that optimization, verification, modification (i.e. maintainability), time-to-market, and first-pass success are improved over handcrafted RTL. 2 Unapproved draft– please do not release without permission of Lockheed Martin legal In the past few years there has been an emergence of tools based on high-order languages such as JAVA, C, C++, MATLAB for FPGA development. Many of these tools have been either fringe commercial products or university research systems. While they have shown fast algorithm to gates conversion cycles, the generated performance in terms of FPGA area size and speed compared to hand-coded VHDL have not been very efficient. This may explain the lack of enthusiasm outside of their own circles and the unfair comparisons of actual performance obtained by HOL’s. We have shown results from using the more mature HOL toolsets such as Celoxica’s DK HOL system that validates the very rapid development times along with producing cycle accurate and bit accurate RTL from C [5]. As a byproduct of rapid design and development in a C based system, this allows for much greater optimization of the algorithm since the design, compile and simulation cycles are orders of magnitude faster than in RTL. For many cases with complex RTL code, smaller module level testing is performed before the designer can move onto the next piece of code, resulting in less time to optimize effectively. Expert hand-coded RTL implementation is still better for high performance designs over current generation HOL’s. As result of these trades, and because multiple image processing algorithms existed in C and C++, the HRV program selected Celoxica’s Handel-C HOL along with their DK development environment. Design Flow The design flow used to implement Vision Processing algorithms on the FPGA processor is closely linked to Celoxica’s Handel-C flow shown in Figure 1. Drawing used with permission of Celoxica, Inc. 3 Unapproved draft– please do not release without permission of Lockheed Martin legal Figure 1 – Design Flow The algorithms were initially developed in C++ re-using existing algorithms with calls to public domain, widely validated OpenCV routines. These were then ported to C and partitioned between hardware and software components. Both hardware and software components were initially hosted on a standard PC platform, which allowed incremental debug of the hardware components using the DK tools’ powerful Hardware/Software Co-verification. The result was a rapid iteration of implementation, verification, and performance measurement. For the hardware portion of the design, a Xilinx 2V6000 FPGA hosting a microBlaze™ core was selected. In order to host the algorithms on characteristic hardware while the flight hardware was still under development, a commercial FPGA board was used in a PC platform. Celoxica furnished API’s for this hardware that readily allowed the Handel-C portion of the design to be loaded on an FPGA and co-verified with software. Similarly, the software components were initially run on the PC in a Windows environment, then later executed on the FPGA within the microBlaze™ microprocessor. This microprocessor executes the top level portions of the algorithm (describe in the NFIR section) as well as accessing software libraries, Xilinx and Celoxica hardware floating point libraries, and a custom floating point array processor. 4 Unapproved draft– please do not release without permission of Lockheed Martin legal Reconfigurable Computer Architecture A Reconfigurable Computer platform was co-developed with SEAKR Engineering, Inc. This local company had extensive experience with FPGA development and building space qualified hardware (http://www.seakr.com/data/Unsorted/RCC_Datasheet.pdf ). The resulting Reconfigurable Computer Card architecture is shown in Figure 2. Figure 2. HRV Reconfigurable Computer Card Architecture The VPC board contains four coprocessors each consisting of a Xilinx 2V6000 FPGA and three associated memory blocks. The selection of the Xilinx 2V6000 was driven by the required algorithm size and throughput; this is the largest QPRO part that Xilinx produces. A Virtex 2 Pro part was rejected as a candidate as it does not have an acceptable level of radiation performance data. Data comes onto the board through 4 serial camera data ports, one for each Coprocessor. This data arrives at 10485760 bps, and is converted in parallel words by an LVDS Deserializer, which can be processed by the Coprocessor at 873,813 12-bit samples per second. 5 Unapproved draft– please do not release without permission of Lockheed Martin legal Each of the Coprocessors contain three memory interfaces, all implemented on mezzanine cards. These cards allow for growth in memory size or late changes to memory type, and provide inexpensive design flexibility. There are three memory interfaces connected to the coprocessor: 1) Pyramid data in a ping-pong configuration; 2) Edge and corner data; and 3) Data and program memory for an embedded microprocessor, including data for a 3-D model required by all the algorithms. The Configuration Control FPGA is a radiation tolerant Actel device which performs configuration scrubbing, performs voting, and provides a serial interface to an external system. It loads one of multiple configurations from the Flash EEPROM. Natural Features Image Recognition Architecture The Natural Features Image Recognition (NFIR) algorithm is one of two algorithms developed for the HRV mission, and being the more complex of the two in terms of logic and math requirements, is discussed in this section. A block diagram with logic flow is shown in Figure 3. The NFIR is initialized with a known attitude from an independent algorithm and inputs a digital image frame. NFIR must process this frame in 200 milliseconds and output a pose estimation of the camera relative to the Hubble Space Telescope – this pose contains 3 axes of relative position and a 3 axes of relative attitude. The NFIR algorithm consists of two primary steps 1) Point based pose estimate (on initialization the input pose used is the result of the previous NFIR pose estimate). 2) Edge based pose estimate (input pose used is the result of the point based pose of this cycle). Step 1 is encompassed in the first two blocks within the microprocessor flowchart in Figure 3, as well as the Lukas Kanade Trackers which operate on pyramidal down-sampled images. Step 2 is contained in the bottom blocks of the flowchart, the hardware Edge Finder, and pyramidal downsampled Edge Enhanced images produced by the Front End hardware. A custom Floating Point Unit is available to be explicitly accessed by microprocessor software. 6 Unapproved draft– please do not release without permission of Lockheed Martin legal RAM Serial Camera Pixels Raw images Program, Data Memory t+1 t Edge Enhanced Images Loading Loading t FPGA Memory Manager Xilinx microBlaze™ MicroProcessor Core Image Patches Project Model Points (3D to 2D Images) Pixels Image Points Image Points Project Edges Edge Compute New Pose (Iterative) Output Pose Single FPU Instance With Multiple Software Invocations Edge Finder Best Fit Edge Enhanced Pixels Pyramidal Downsampling / Edge Enhancement Lukas Kanade Trackers Compute New Pose Memory Manager Front End Image Processor Custom Floating Point Unit (FPU) uBlaze In/Out Operation Request Operator Operands In, Results Out Data Data Floating Point Unit Pipeline uBlaze Software Libs uBlaze Hardware FPU Scaler, Matrix */Convert Control Figure 3 - Natural Features Image Recognition (NFIR) Algorithm Block Diagram For Point Based Pose Estimation, initialization consists of using the initial image, the initial pose corresponding to the image, and the model to determine a set of points to track with subsequent images. The model points are projected into the image frame (along the direction of the initial pose). The image points ‘closest’ to the projected points are chosen as input to the tracker. Feature (point) tracking is based on a pyramidal implementation of the Lucas Kanade feature tracker algorithm. The tracker uses the 2D image points each cycle to determine the image points that correspond to the points with which the tracker was initialized. Non-matching points are marked so they won’t be further processed. The Point base pose estimation uses the model points (in model coordinate frame) and corresponding image points (camera image coordinate frame) to calculate the pose (the translation and rotation between the two coordinate systems). For the Edge Based Pose Estimation, initialization data consists of an initial pose, a model and a sequence of images. The current image is processed to detect edges using a Sobel filter. The model edges are projected into the camera image frame (using the input pose). The edge image 7 Unapproved draft– please do not release without permission of Lockheed Martin legal is searched to determine the strongest edge (if any) corresponding to each model edge. The corresponding edge determination accounts for an in-image plane rotation and offset (i.e. optical flow). Each model edge (3D) and detected edge (3D) is processed to yield point data directly related to the inter-frame motion. All the edges are processed to provide a set of this data. This set is then processed using Lie algebra to formulate rigid-body transformation. The resulting inter-frame motion is used to calculate a new pose estimate, which is output as the final product of the NFIR algorithm. SEU Mitigation Reprogrammable SRAM-based FPGAs are susceptible to SEUs in registers, internal memory, and primarily in their configuration latches. Configuration latch upsets however, may or may not cause a functional error. Most of the device configuration memory is used for routing information. However, typical designs only use a small portion of the total routing resources. For example, a specific design that uses 90% of available Configuration Logic Blocks may only actively use about 10% of the total configuration bits [6]. In general, the remaining bits can be considered as don’t cares from an SEU standpoint. Thus, only about 1 in 10 configuration latch upsets cause a functional [6]. Though not every SEU will cause a functional error, upsets in the configuration latches must be detected and corrected to remove the error and prevent multiple simultaneous configuration latch upsets from occurring. The enabling characteristic that makes the Xilinx FPGA attractive for space use is the ability to perform partial reconfiguration, allowing the readback and reprogramming of a device while the device is functioning, without interruption of the device to the user. Single Event Effects (SEEs) also include Single Event Functional Interrupts (SEFIs). These are events that cause an interruption of the normal functionality of the device, and remain present until the device is fully reconfigured. Though the device must be taken “off-line” for a complete reconfiguration cycle, the removal of power is not required to reset the SEFI condition. These events must also be detected and mitigated. One form of SEFI is the Power On Reset (POR) event, which clears the configuration of the entire device. Persistent bits are those bits in the configuration memory which result in errors that will not flush out of the system when upset. One of the simplest ways of mitigating this type of error is a system reset. [7] Our application allows us to make use of an FPGA reset on every cycle since the algorithm operates on one frame of data with a single image memory in DRAM. Even this area of the design is not very sensitive to upsets since the DRAM is protected with Error Detection And Correction (EDAC) logic , its data is self-flushing every 400 milliseconds, and the algorithms are highly tolerant of upsets to individual pixels. The microBlaze™ data and instructions stored in SRAM is more sensitive to upsets, but is also EDAC protected. The predicted upset rates for the Virtex-II 6000 devices are shown in Table 1. These values were generated using CREME96 for a 600 km, 28 degree orbit. The total number of upsets is expected to be .895 per device day. However, 75% of the upsets will be to the configuration memory. Typically, only 1 in 10 of these upsets results in a functional error. This 1/10 factor is 8 Unapproved draft– please do not release without permission of Lockheed Martin legal accounted for in the table, making the effective upsets/device day number drop to 0.217 upsets per device-day. XQVR6000 Function CFG Fcn. CFG BRAM POR SMAP JCFG CLB-FF Upset Rate 1.68E-04 1.68E-05 1.89E-04 2.72E-03 3.66E-03 9.31E-04 2.23E-04 Device Units No. Bits Upset Rate upset/bit/10yr 16395508 2.75E+03 upset/bit/10yr 16395508 2.75E+02 upset/bit/10yr 2654208 5.01E+02 upset/bit/10yr 1 2.72E-03 upset/bit/10yr 1 3.66E-03 upset/bit/10yr 1 9.31E-04 upset/bit/10yr 76032 1.70E+01 TOTAL 7.93E+02 Device Units Upset Rate upset/dev/10yr 7.53E-01 upset/dev/10yr 7.53E-02 upset/dev/10yr 1.37E-01 upset/dev/10yr 7.46E-07 upset/dev/10yr 1.00E-06 upset/dev/10yr 2.55E-07 upset/dev/10yr 4.65E-03 TOTAL 2.17E-01 Units upsets/dev/day upsets/dev/day upsets/dev/day upsets/dev/day upsets/dev/day upsets/dev/day upsets/dev/day upsets/dev/day Table 1. Predicted Upsets/Device Day (with 1/10 Configuration bit factor) At the board level, the risk of SEUs causing upsets to the data is reduced with dual Coprocessor redundancy for each algorithm, by configuration bit checking, and by SEFI protection. The dual redundancy scheme calls for implementing the identical algorithm and input data to two Coprocessors (and comparing the pose data outputs of each pair after each frame calculation. If these values do not match, they are sent with a flag indicating this fact so they can be dealt with there (see the SEU Mitigation at System Level section). To prevent the accumulation of upsets, the Control PLA and Configuration / Interface PLA will work together to perform configuration bit checking. Typically, the Control PLA reads back configuration bits of each FPGA, computes a CRC for that frame and compares it to the CRC value in the FLASH memory. If a discrepancy is detected, a signal notifies the Configuration / Interface PLA to send configuration bits to that FPGA frame alone (a process known a partial reconfiguration [8][9]). The time to perform this partial reconfiguration time is less than 1 millisecond, compare to the 320 ms it takes to perform a full reconfiguration of the entire chip. 9 Unapproved draft– please do not release without permission of Lockheed Martin legal Configuration Memory (FLASH) CFG1a CFG1b CFG1c CFG2a CFG2b CFG2c CFG Verifier Verify Verify CFG bits CFG bits CFG Loader Load Load Configuration Checker = = CFG bits CFG bits Video In Cam 1 Algo 1 Algo 1 A B Cam 2 Radiation Tolerant RAM based FPGAs Algo 2 Algo 2 A B Results Checker Pose 1A = Pose Pose 1B 2A Pose Selector Pose 1, OK To Flight Computer Pose 2, OK Figure 4 – Dual Voting SEU Mitigation Block Diagram 10 = Pose 2B Unapproved draft– please do not release without permission of Lockheed Martin legal The Dual Voting SEU Mitigation Block Diagram in Figure 4 shows the SEU mitigation scheme we designed for detecting and recovering from SEUs. Rather than using a conventional CRC technique, the two bitstreams are compared with each other, which results in a fast detection of upsets. The Dual Redundant scheme provides coverage for all SEUs, so most chip level SEU mitigation schemes are not required. This is important to the HRV program where cost and schedule are critical. Typically, these techniques consist of TMRing critical portions of the design such as state machines, which is very difficult to do with the Handel-C flow, as data path and processing are combined. Full module redundancy was the selected implementation of TMR, because all the logic paths including IO cells, not just flip flops, are susceptible to SEUs. [6] Providing triplicate I/O paths using the Xilinx XTMR tool is expensive in using up pins and may introduce skew on the board which must be controlled. Adding EDAC to DRAM interfaces increases complexity and power, as does internal parity generation and checking. Full TMR of logic within the chip is not likely to fit with any of the algorithms. The Xilinx XTMR tool will be used to remove half-latches and the use of SLR16s as RAM. The degree of fault tolerance implementation is defined by your system level requirements. The following are some questions that must be answered: Does the system only need to detect an error? How quickly must the system respond to an error? Must the system also correct the error? [10] In the case of our Vision Processing algorithms, the system must detect and respond to an error within a second – since five images are received in this period, and the algorithm must process the latest image, and data that is erroneous is simply marked as bad and discarded (using the last known good value). Since there are multiple algorithms and data sets running in parallel, it is also possible to use the previous pose outputs to dictate which of the two disagreeing results is in error. Results The requirements of NFIR were to provide 30% margin of FPGA sizing resources (70% utilization maximum), and to produce pose estimates in real-time at a 5 Hz rate. Both of these requirements were met. The NFIR Sizing Results are summarized in Table 2. These results reflect a highly parallel configuration in which four instances of the Lukas Kanade tracker were used. The figure for BlockRAM utilization shows an intentional allocation of all BlockRAM resources for Data and Instruction caches within the microBlaze™ microprocessor. With no BlockRAM allocated to these caches, this figure dropped to below 50%. 11 Unapproved draft– please do not release without permission of Lockheed Martin legal LUTs 2700 8000 13580 24280 67584 36% MicroBlaze Processor Front End + LK Tracker (4) Total Available Percentage Utilized Flip Flops 2000 4000 10580 16580 67584 25% Multipliers 4 0 70 74 144 51% BlockRAMs 33 42 68 143 144 99% Table 2- NFIR Sizing Results The NFIR Timing results for the various sequential processes are summarized in Table 3. The resulting total is 8.2 million cycles – at a 40 MHz clock, this represents 205 milliseconds, which is just over the required 5 Hz rate. At a 50 MHz FPGA clock, the algorithm runs at 6 Hz, which is the required 5 Hz rate with a 20% margin. Function Timed Cycles/Loop Project Model Points 26000 Lktracker (hardware) 2000000 FindExtrinsic 3078000 Project edges 120000 FindEdges (hardware) 400000 Project ellipses 80000 computeAllFis 180000 computeVsumCsum 280000 computeAlpha 230000 UpdatePose 6000 getAllErrors 240000 Total Loops 1 1 1 3 1 3 2 2 2 3 3 6640000 Total Cycles 26000 2000000 3078000 360000 400000 240000 360000 560000 460000 18000 720000 8222000 Table 3 – NFIR Timing Results Summary In this paper, we reported the design trades and resulting architecture of a multiple FPGA based reconfigurable computing system running complex autonomous docking algorithms. The trades and design were completed, a card was built with parts that have flight equivalents, and the Natural Features Image Recognition (NFIR) algorithm was demonstrated. The target goal of operating at 5 Hz was met with spare capacity in the FPGA lookup table and gates, and these results were obtained prior to the program Critical Design Review. 12 Unapproved draft– please do not release without permission of Lockheed Martin legal References 1. M. Caffrey, “A Space Based Reconfigurable Radio”, International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), 2002 2. B. Draper, R. Beveridge, W. Bohm, C. Ross, M Chawathe, “Implementing Image Applications on FPGAs”, International Conference on Pattern Recognition, Quebec City, August 2002 3. R. Manner, M. Sessler, H. Simmler, “Pattern Recognition and Reconstruction on a FPGA Coprocessor Board”, 2000 IEEE Symposium on Field-Programmable Custom Computing Machines 4. T. Flatley, “Developing Reconfigurable Computing Systems for Space Flight Applications”, Abstract, International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), 2002 5. T. Gallagher, “Joint High & Low Level Language Tools for System on Chip FPGA Devices”, Bishop’s Lodge Workshop in Distributed Embedded Computing, 2005 6. P. Graham, M. Caffrey, J. Zimmerman, D. Johnson “Consequences and Categories of SRAM FPGA Configuration SEUs”, International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), 2003 7. D. Johnson, K. Morgan, M. Wirthlin, M. Caffrey, P. Graham, “Detection of Configuration Memory Upsets Causing Persistent Errors in SRAM-based FPGAs”, International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), 2004 8. XAPP216: Correcting Single-Event Upsets Through Virtex Partial Configuration 9. XAPP197: TMR Design Techniques for Virtex FPGAs 10. M. Berg, “A Simplified Approach to Fault Tolerant State Machine Design for Single Event Upsets”, International Conference on Military and Aerospace Programmable Logic Devices (MAPLD), 2002 13