feifarek_paper - NASA Office of Logic Design

advertisement
Unapproved draft– please do not release without permission of Lockheed Martin legal
FPGA Based Processor for Hubble Space Telescope Autonomous Docking – A Case Study
Authors: Jonathan F. Feifarek, Timothy C. Gallagher, Lockheed Martin, Denver, Colorado
Abstract
Designing electronic hardware for flight projects requires balancing conflicting constraints: a
short design cycle, tolerance of radiation induced effects, limited power, and flexibility to
accommodate changing requirements and fixes. Our design space, consisting of three
autonomous rendezvous and docking algorithms mission critical to the Hubble Space Telescope
Robotic Vehicle project, presented our team with these constraints. In addition, analyses showed
the processing throughputs of each algorithm exceeded the capacity of existing radiation
hardened computers by two orders of magnitude. An FPGA based solution with dual redundancy
approach at the FPGA level was selected. This paper discusses the challenges encountered in
designing an algorithmic intensive FPGA based processor for flight and presents a case study to
show how these difficulties were mitigated for the Hubble Robotic Vehicle.
Introduction
NASA Goddard’s Hubble Space Telescope Robotic Vehicle (HRV) project necessitated a rapid
turnaround customized flight solution to service the Hubble Space Telescope (HST) before aging
critical components failed. A key component, autonomous rendezvous and docking, required
processing throughputs that greatly exceed the capacity of any existing radiation hardened
processor because of a need to process independent algorithms with different camera views in
parallel. FPGA’s were determined to be the only viable solution to meet these processing
constraints capable of providing the necessary update rate for the guidance, navigation, and
control to accomplish low rate direct docking with the delicate HST.
Use of a High Order Language (HOL) was considered the best way to realistically meet the
difficult schedule of converting existing complex and specialized C/C++ vision code to
synthesizable RTL for implementation on FPGAs. Other advantages of using a higher
abstraction level of design is the ability to optimize and verify code orders of magnitude faster
then RTL simulation, automating complex state machine control, and generating bit-accurate and
cycle-accurate RTL output before porting to hardware.
To meet the high proton radiation environment within the cost and power constraints of the
mission, a dual redundancy approach at the FPGA level was chosen taking advantage of the
system’s tolerance for limited data dropouts but not for result uncertainty. The details of the
SEU mitigation scheme specific to our design are presented along with merits and limitations.
The Vision Processing Card (VPC) design implementation was the result of selecting an FPGA
based architecture using a HOL design to build mission specific hardware. The architecture of
this board, along with the hardware to implement the more complex algorithm, Natural Features
Image Recognition (NFIR) are shown and timing / sizing results are presented.
1
Unapproved draft– please do not release without permission of Lockheed Martin legal
Flight Computer vs FPGA Trade
During the architectural definition phase of the Hubble proposal effort a design trade study was
undertaken to select the best processing system to perform a complex set of NFIR type
algorithms. This processing system had to use available products with space effects mitigation
(rad-hard components or rad-tolerant with mitigation capability), perform time-dependent
imagery meeting 5 frames-per-second processing, provide adequate margin for growth in
algorithm functionality, and be fully operational within one year. To achieve this aggressive
schedule, the program planned on re-using predecessor floating-point C/C++ code.
As with many algorithms using floating point code, only a small subset and very specific parts
actually require the large dynamic range. Many applications such as MATLAB will default to
floating point even if more efficient integer precision gives the required results accuracy. There
are other efficient variants such as 18-bit floating point. At the time of the proposal, parts of the
algorithm were running on commercial processors however after scaling for the full application
and the necessary repetition rate the calculations were that 40x more performance was necessary
then a standard Pentium based machine.
For dataflow types of processing which encompasses many types of optical and DSP algorithms
the FPGA based reconfigurable computing based obtains performance greater then 100 times
that of a microprocessor[1]. Many other researchers have validated these resultant magnitude
gains over Pentium based CPUs on other image processing benchmarks such as median filter and
edge detection, pattern recognition, and multi-spectral processing [2,3].
The NFIR algorithm ran in real-time when implemented in parallel on a Dual P4 Xeon
processors running at 3.2 GHz with hyper-threading enabled, a configuration that could not fly
for radiation, reliability, and power reasons. Based on this performance, it was estimated that a
single RAD750 running at 40 MHz would only be able to process images at a 1/160 this rate.
Assuming linear speed scaling, it would have taken 20 of the fastest (at the time) radiation
hardened computers to achieve the same performance as our final dual-FPGA processor running
with a 40 MHz clock.A drawback of FPGA based reconfigurable computing is the design,
synthesis, and place-and-route cycle which uses a complex flow compared to software
development. This requires detailed hardware knowledge and sometimes low-level component
manipulation. These problems can be alleviated but not eliminated by using a state-of-the-art
High Order Language (HOL) design methodology.
High Order Language (HOL) vs Hardware Description Language (HDL)
As noted by NASA Goddard Space Flight Center there are three essential elements in
reconfigurable computing designs [4]: rad-tolerant flight qualified FPGA’s, ground prototyping
for evaluation of applications, and tool development/design methodologies. While the first two
elements are taken care of by manufacturing enhancements/SEE testing and COTS vendors, the
last item leaves it up to the designer to take advantage of new tools, techniques and design
methodologies such as HOL’s.
HOL’s constitute a tool to raise the abstraction level of hardware FPGA designs such that
optimization, verification, modification (i.e. maintainability), time-to-market, and first-pass
success are improved over handcrafted RTL.
2
Unapproved draft– please do not release without permission of Lockheed Martin legal
In the past few years there has been an emergence of tools based on high-order languages such as
JAVA, C, C++, MATLAB for FPGA development. Many of these tools have been either fringe
commercial products or university research systems. While they have shown fast algorithm to
gates conversion cycles, the generated performance in terms of FPGA area size and speed
compared to hand-coded VHDL have not been very efficient. This may explain the lack of
enthusiasm outside of their own circles and the unfair comparisons of actual performance
obtained by HOL’s. We have shown results from using the more mature HOL toolsets such as
Celoxica’s DK HOL system that validates the very rapid development times along with
producing cycle accurate and bit accurate RTL from C [5]. As a byproduct of rapid design and
development in a C based system, this allows for much greater optimization of the algorithm
since the design, compile and simulation cycles are orders of magnitude faster than in RTL. For
many cases with complex RTL code, smaller module level testing is performed before the
designer can move onto the next piece of code, resulting in less time to optimize effectively.
Expert hand-coded RTL implementation is still better for high performance designs over current
generation HOL’s.
As result of these trades, and because multiple image processing algorithms existed in C and
C++, the HRV program selected Celoxica’s Handel-C HOL along with their DK development
environment.
Design Flow
The design flow used to implement Vision Processing algorithms on the FPGA processor is
closely linked to Celoxica’s Handel-C flow shown in Figure 1.
Drawing used with permission of Celoxica, Inc.
3
Unapproved draft– please do not release without permission of Lockheed Martin legal
Figure 1 – Design Flow
The algorithms were initially developed in C++ re-using existing algorithms with calls to public
domain, widely validated OpenCV routines. These were then ported to C and partitioned
between hardware and software components. Both hardware and software components were
initially hosted on a standard PC platform, which allowed incremental debug of the hardware
components using the DK tools’ powerful Hardware/Software Co-verification. The result was a
rapid iteration of implementation, verification, and performance measurement.
For the hardware portion of the design, a Xilinx 2V6000 FPGA hosting a microBlaze™ core
was selected. In order to host the algorithms on characteristic hardware while the flight
hardware was still under development, a commercial FPGA board was used in a PC platform.
Celoxica furnished API’s for this hardware that readily allowed the Handel-C portion of the
design to be loaded on an FPGA and co-verified with software. Similarly, the software
components were initially run on the PC in a Windows environment, then later executed on the
FPGA within the microBlaze™ microprocessor. This microprocessor executes the top level
portions of the algorithm (describe in the NFIR section) as well as accessing software libraries,
Xilinx and Celoxica hardware floating point libraries, and a custom floating point array
processor.
4
Unapproved draft– please do not release without permission of Lockheed Martin legal
Reconfigurable Computer Architecture
A Reconfigurable Computer platform was co-developed with SEAKR Engineering, Inc. This
local company had extensive experience with FPGA development and building space qualified
hardware (http://www.seakr.com/data/Unsorted/RCC_Datasheet.pdf ). The resulting
Reconfigurable Computer Card architecture is shown in Figure 2.
Figure 2. HRV Reconfigurable Computer Card Architecture
The VPC board contains four coprocessors each consisting of a Xilinx 2V6000 FPGA and three
associated memory blocks. The selection of the Xilinx 2V6000 was driven by the required
algorithm size and throughput; this is the largest QPRO part that Xilinx produces. A Virtex 2 Pro
part was rejected as a candidate as it does not have an acceptable level of radiation performance
data.
Data comes onto the board through 4 serial camera data ports, one for each Coprocessor. This
data arrives at 10485760 bps, and is converted in parallel words by an LVDS Deserializer, which
can be processed by the Coprocessor at 873,813 12-bit samples per second.
5
Unapproved draft– please do not release without permission of Lockheed Martin legal
Each of the Coprocessors contain three memory interfaces, all implemented on mezzanine cards.
These cards allow for growth in memory size or late changes to memory type, and provide
inexpensive design flexibility. There are three memory interfaces connected to the coprocessor:
1) Pyramid data in a ping-pong configuration; 2) Edge and corner data; and 3) Data and program
memory for an embedded microprocessor, including data for a 3-D model required by all the
algorithms.
The Configuration Control FPGA is a radiation tolerant Actel device which performs
configuration scrubbing, performs voting, and provides a serial interface to an external system. It
loads one of multiple configurations from the Flash EEPROM.
Natural Features Image Recognition Architecture
The Natural Features Image Recognition (NFIR) algorithm is one of two algorithms developed
for the HRV mission, and being the more complex of the two in terms of logic and math
requirements, is discussed in this section. A block diagram with logic flow is shown in Figure 3.
The NFIR is initialized with a known attitude from an independent algorithm and inputs a digital
image frame. NFIR must process this frame in 200 milliseconds and output a pose estimation of
the camera relative to the Hubble Space Telescope – this pose contains 3 axes of relative position
and a 3 axes of relative attitude.
The NFIR algorithm consists of two primary steps
1) Point based pose estimate (on initialization the input pose used is the result of the
previous NFIR pose estimate).
2) Edge based pose estimate (input pose used is the result of the point based pose of this
cycle).
Step 1 is encompassed in the first two blocks within the microprocessor flowchart in Figure 3, as
well as the Lukas Kanade Trackers which operate on pyramidal down-sampled images. Step 2 is
contained in the bottom blocks of the flowchart, the hardware Edge Finder, and pyramidal downsampled Edge Enhanced images produced by the Front End hardware. A custom Floating Point
Unit is available to be explicitly accessed by microprocessor software.
6
Unapproved draft– please do not release without permission of Lockheed Martin legal
RAM
Serial Camera
Pixels
Raw images
Program,
Data Memory
t+1
t
Edge Enhanced Images
Loading
Loading
t
FPGA
Memory Manager
Xilinx microBlaze™
MicroProcessor Core
Image Patches
Project Model Points
(3D to 2D Images)
Pixels
Image Points
Image Points
Project Edges
Edge
Compute New
Pose (Iterative)
Output Pose
Single FPU Instance With Multiple Software Invocations
Edge
Finder
Best Fit Edge
Enhanced
Pixels
Pyramidal
Downsampling /
Edge
Enhancement
Lukas Kanade
Trackers
Compute New
Pose
Memory Manager
Front End
Image
Processor
Custom Floating Point Unit (FPU)
uBlaze
In/Out
Operation Request
Operator
Operands In,
Results Out
Data
Data
Floating Point Unit
Pipeline
uBlaze Software Libs
uBlaze Hardware FPU
Scaler,
Matrix
*/Convert
Control
Figure 3 - Natural Features Image Recognition (NFIR) Algorithm Block Diagram
For Point Based Pose Estimation, initialization consists of using the initial image, the initial pose
corresponding to the image, and the model to determine a set of points to track with subsequent
images. The model points are projected into the image frame (along the direction of the initial
pose). The image points ‘closest’ to the projected points are chosen as input to the tracker.
Feature (point) tracking is based on a pyramidal implementation of the Lucas Kanade feature
tracker algorithm. The tracker uses the 2D image points each cycle to determine the image points
that correspond to the points with which the tracker was initialized. Non-matching points are
marked so they won’t be further processed.
The Point base pose estimation uses the model points (in model coordinate frame) and
corresponding image points (camera image coordinate frame) to calculate the pose (the
translation and rotation between the two coordinate systems).
For the Edge Based Pose Estimation, initialization data consists of an initial pose, a model and a
sequence of images. The current image is processed to detect edges using a Sobel filter. The
model edges are projected into the camera image frame (using the input pose). The edge image
7
Unapproved draft– please do not release without permission of Lockheed Martin legal
is searched to determine the strongest edge (if any) corresponding to each model edge. The
corresponding edge determination accounts for an in-image plane rotation and offset (i.e. optical
flow).
Each model edge (3D) and detected edge (3D) is processed to yield point data directly related to
the inter-frame motion. All the edges are processed to provide a set of this data. This set is then
processed using Lie algebra to formulate rigid-body transformation. The resulting inter-frame
motion is used to calculate a new pose estimate, which is output as the final product of the NFIR
algorithm.
SEU Mitigation
Reprogrammable SRAM-based FPGAs are susceptible to SEUs in registers, internal memory,
and primarily in their configuration latches. Configuration latch upsets however, may or may not
cause a functional error. Most of the device configuration memory is used for routing
information. However, typical designs only use a small portion of the total routing resources.
For example, a specific design that uses 90% of available Configuration Logic Blocks may only
actively use about 10% of the total configuration bits [6]. In general, the remaining bits can be
considered as don’t cares from an SEU standpoint. Thus, only about 1 in 10 configuration latch
upsets cause a functional [6]. Though not every SEU will cause a functional error, upsets in the
configuration latches must be detected and corrected to remove the error and prevent multiple
simultaneous configuration latch upsets from occurring. The enabling characteristic that makes
the Xilinx FPGA attractive for space use is the ability to perform partial reconfiguration,
allowing the readback and reprogramming of a device while the device is functioning, without
interruption of the device to the user.
Single Event Effects (SEEs) also include Single Event Functional Interrupts (SEFIs). These are
events that cause an interruption of the normal functionality of the device, and remain present
until the device is fully reconfigured. Though the device must be taken “off-line” for a complete
reconfiguration cycle, the removal of power is not required to reset the SEFI condition. These
events must also be detected and mitigated. One form of SEFI is the Power On Reset (POR)
event, which clears the configuration of the entire device.
Persistent bits are those bits in the configuration memory which result in errors that will not
flush out of the system when upset. One of the simplest ways of mitigating this type of error is a
system reset. [7] Our application allows us to make use of an FPGA reset on every cycle since
the algorithm operates on one frame of data with a single image memory in DRAM. Even this
area of the design is not very sensitive to upsets since the DRAM is protected with Error
Detection And Correction (EDAC) logic , its data is self-flushing every 400 milliseconds, and
the algorithms are highly tolerant of upsets to individual pixels. The microBlaze™ data and
instructions stored in SRAM is more sensitive to upsets, but is also EDAC protected.
The predicted upset rates for the Virtex-II 6000 devices are shown in Table 1. These values were
generated using CREME96 for a 600 km, 28 degree orbit. The total number of upsets is
expected to be .895 per device day. However, 75% of the upsets will be to the configuration
memory. Typically, only 1 in 10 of these upsets results in a functional error. This 1/10 factor is
8
Unapproved draft– please do not release without permission of Lockheed Martin legal
accounted for in the table, making the effective upsets/device day number drop to 0.217 upsets
per device-day.
XQVR6000
Function
CFG
Fcn. CFG
BRAM
POR
SMAP
JCFG
CLB-FF
Upset Rate
1.68E-04
1.68E-05
1.89E-04
2.72E-03
3.66E-03
9.31E-04
2.23E-04
Device
Units
No. Bits Upset Rate
upset/bit/10yr 16395508
2.75E+03
upset/bit/10yr 16395508
2.75E+02
upset/bit/10yr
2654208
5.01E+02
upset/bit/10yr
1
2.72E-03
upset/bit/10yr
1
3.66E-03
upset/bit/10yr
1
9.31E-04
upset/bit/10yr
76032
1.70E+01
TOTAL
7.93E+02
Device
Units
Upset Rate
upset/dev/10yr
7.53E-01
upset/dev/10yr
7.53E-02
upset/dev/10yr
1.37E-01
upset/dev/10yr
7.46E-07
upset/dev/10yr
1.00E-06
upset/dev/10yr
2.55E-07
upset/dev/10yr
4.65E-03
TOTAL
2.17E-01
Units
upsets/dev/day
upsets/dev/day
upsets/dev/day
upsets/dev/day
upsets/dev/day
upsets/dev/day
upsets/dev/day
upsets/dev/day
Table 1. Predicted Upsets/Device Day (with 1/10 Configuration bit factor)
At the board level, the risk of SEUs causing upsets to the data is reduced with dual Coprocessor
redundancy for each algorithm, by configuration bit checking, and by SEFI protection.
The dual redundancy scheme calls for implementing the identical algorithm and input data to two
Coprocessors (and comparing the pose data outputs of each pair after each frame calculation. If
these values do not match, they are sent with a flag indicating this fact so they can be dealt with
there (see the SEU Mitigation at System Level section).
To prevent the accumulation of upsets, the Control PLA and Configuration / Interface PLA will
work together to perform configuration bit checking. Typically, the Control PLA reads back
configuration bits of each FPGA, computes a CRC for that frame and compares it to the CRC
value in the FLASH memory. If a discrepancy is detected, a signal notifies the Configuration /
Interface PLA to send configuration bits to that FPGA frame alone (a process known a partial
reconfiguration [8][9]).
The time to perform this partial reconfiguration time is less than 1 millisecond, compare to the
320 ms it takes to perform a full reconfiguration of the entire chip.
9
Unapproved draft– please do not release without permission of Lockheed Martin legal
Configuration Memory (FLASH)
CFG1a
CFG1b
CFG1c
CFG2a
CFG2b
CFG2c
CFG Verifier
Verify
Verify
CFG bits
CFG bits
CFG Loader
Load
Load
Configuration Checker
=
=
CFG bits
CFG bits
Video In
Cam 1
Algo 1
Algo 1
A
B
Cam 2
Radiation Tolerant
RAM based FPGAs
Algo 2
Algo 2
A
B
Results Checker
Pose
1A
=
Pose
Pose
1B
2A
Pose Selector
Pose 1, OK
To Flight Computer
Pose 2, OK
Figure 4 – Dual Voting SEU Mitigation Block Diagram
10
=
Pose
2B
Unapproved draft– please do not release without permission of Lockheed Martin legal
The Dual Voting SEU Mitigation Block Diagram in Figure 4 shows the SEU mitigation scheme
we designed for detecting and recovering from SEUs. Rather than using a conventional CRC
technique, the two bitstreams are compared with each other, which results in a fast detection of
upsets.
The Dual Redundant scheme provides coverage for all SEUs, so most chip level SEU mitigation
schemes are not required. This is important to the HRV program where cost and schedule are
critical. Typically, these techniques consist of TMRing critical portions of the design such as
state machines, which is very difficult to do with the Handel-C flow, as data path and processing
are combined. Full module redundancy was the selected implementation of TMR, because all
the logic paths including IO cells, not just flip flops, are susceptible to SEUs. [6]
Providing triplicate I/O paths using the Xilinx XTMR tool is expensive in using up pins and may
introduce skew on the board which must be controlled. Adding EDAC to DRAM interfaces
increases complexity and power, as does internal parity generation and checking. Full TMR of
logic within the chip is not likely to fit with any of the algorithms.
The Xilinx XTMR tool will be used to remove half-latches and the use of SLR16s as RAM.
The degree of fault tolerance implementation is defined by your system level requirements. The
following are some questions that must be answered: Does the system only need to detect an
error? How quickly must the system respond to an error? Must the system also correct the error?
[10] In the case of our Vision Processing algorithms, the system must detect and respond to an
error within a second – since five images are received in this period, and the algorithm must
process the latest image, and data that is erroneous is simply marked as bad and discarded (using
the last known good value). Since there are multiple algorithms and data sets running in parallel,
it is also possible to use the previous pose outputs to dictate which of the two disagreeing results
is in error.
Results
The requirements of NFIR were to provide 30% margin of FPGA sizing resources (70%
utilization maximum), and to produce pose estimates in real-time at a 5 Hz rate. Both of these
requirements were met.
The NFIR Sizing Results are summarized in Table 2. These results reflect a highly parallel
configuration in which four instances of the Lukas Kanade tracker were used. The figure for
BlockRAM utilization shows an intentional allocation of all BlockRAM resources for Data and
Instruction caches within the microBlaze™ microprocessor. With no BlockRAM allocated to
these caches, this figure dropped to below 50%.
11
Unapproved draft– please do not release without permission of Lockheed Martin legal
LUTs
2700
8000
13580
24280
67584
36%
MicroBlaze Processor
Front End +
LK Tracker (4)
Total
Available
Percentage Utilized
Flip Flops
2000
4000
10580
16580
67584
25%
Multipliers
4
0
70
74
144
51%
BlockRAMs
33
42
68
143
144
99%
Table 2- NFIR Sizing Results
The NFIR Timing results for the various sequential processes are summarized in Table 3. The
resulting total is 8.2 million cycles – at a 40 MHz clock, this represents 205 milliseconds, which
is just over the required 5 Hz rate. At a 50 MHz FPGA clock, the algorithm runs at 6 Hz, which
is the required 5 Hz rate with a 20% margin.
Function Timed
Cycles/Loop
Project Model Points
26000
Lktracker (hardware)
2000000
FindExtrinsic
3078000
Project edges
120000
FindEdges (hardware)
400000
Project ellipses
80000
computeAllFis
180000
computeVsumCsum
280000
computeAlpha
230000
UpdatePose
6000
getAllErrors
240000
Total
Loops
1
1
1
3
1
3
2
2
2
3
3
6640000
Total Cycles
26000
2000000
3078000
360000
400000
240000
360000
560000
460000
18000
720000
8222000
Table 3 – NFIR Timing Results
Summary
In this paper, we reported the design trades and resulting architecture of a multiple FPGA based
reconfigurable computing system running complex autonomous docking algorithms. The trades
and design were completed, a card was built with parts that have flight equivalents, and the
Natural Features Image Recognition (NFIR) algorithm was demonstrated. The target goal of
operating at 5 Hz was met with spare capacity in the FPGA lookup table and gates, and these
results were obtained prior to the program Critical Design Review.
12
Unapproved draft– please do not release without permission of Lockheed Martin legal
References
1. M. Caffrey, “A Space Based Reconfigurable Radio”, International Conference on Military
and Aerospace Programmable Logic Devices (MAPLD), 2002
2. B. Draper, R. Beveridge, W. Bohm, C. Ross, M Chawathe, “Implementing Image
Applications on FPGAs”, International Conference on Pattern Recognition, Quebec City,
August 2002
3. R. Manner, M. Sessler, H. Simmler, “Pattern Recognition and Reconstruction on a FPGA
Coprocessor Board”, 2000 IEEE Symposium on Field-Programmable Custom Computing
Machines
4. T. Flatley, “Developing Reconfigurable Computing Systems for Space Flight Applications”,
Abstract, International Conference on Military and Aerospace Programmable Logic Devices
(MAPLD), 2002
5. T. Gallagher, “Joint High & Low Level Language Tools for System on Chip FPGA
Devices”, Bishop’s Lodge Workshop in Distributed Embedded Computing, 2005
6. P. Graham, M. Caffrey, J. Zimmerman, D. Johnson “Consequences and Categories of SRAM
FPGA Configuration SEUs”, International Conference on Military and Aerospace
Programmable Logic Devices (MAPLD), 2003
7. D. Johnson, K. Morgan, M. Wirthlin, M. Caffrey, P. Graham, “Detection of Configuration
Memory Upsets Causing Persistent Errors in SRAM-based FPGAs”, International
Conference on Military and Aerospace Programmable Logic Devices (MAPLD), 2004
8. XAPP216: Correcting Single-Event Upsets Through Virtex Partial Configuration
9. XAPP197: TMR Design Techniques for Virtex FPGAs
10. M. Berg, “A Simplified Approach to Fault Tolerant State Machine Design for Single Event
Upsets”, International Conference on Military and Aerospace Programmable Logic Devices
(MAPLD), 2002
13
Download