Analysis and Acceleration for Target ... Jairam Ramanathan

advertisement
Analysis and Acceleration for Target Recognition
by
Jairam Ramanathan
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Engineering
at the
BARKER
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
JUL 3 12002
February 2001
LIBRARIES
@ Jairam Ramanathan, MMI. All rights reserved.
The author hereby grants to MIT permission to reproduce and distribute publicly
paper and electronic copies of this thesis document in whole or in part.
A uthor ......................................
Department of Electrical Engineeri ng and Computer Science
February 6, 2001
Certified by.......
Paul D. Fiore
Senior Principal Engineer, BAE SYSTEMS
VI-A Company Thesis Supervisor
Certified by.......
Dan E. Dudgeon
Senior Staff, MIT Lincoln Laboratory
MIT Thesis Supervisor
Accepted by.........
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Analysis and Acceleration for Target Recognition
by
Jairam Ramanathan
Submitted to the Department of Electrical Engineering and Computer Science
on February 6, 2001, in partial fulfillment of the
requirements for the degree of
Master of Engineering
Abstract
This thesis examined the hardware acceleration properties of automatic target recognition algorithms. It specifically focused on an algorithm produced by the SystemOriented High Range Resolution Automatic Recognition Program at the WrightPatterson Air Force Base. Analysis of this algorithm determined which calculations
would be most suitable for and derive the most benefit from hardware acceleration.
The algorithm was appropriately modified and restructured to ease its hardware translation while not significantly affecting the algorithm performance. A portion of the
algorithm was then implemented and executed on a custom hardware board containing multiple field-programmable gate arrays, and the timing and algorithmic performance were compared with the corresponding software execution statistics. The near
order of magnitude speedup showed the viability of custom hardware acceleration for
target recognition algorithms.
VI-A Company Thesis Supervisor: Paul D. Fiore
Title: Senior Principal Engineer, BAE SYSTEMS
MIT Thesis Supervisor: Dan E. Dudgeon
Title: Senior Staff, MIT Lincoln Laboratory
2
Acknowledgments
Generous financial support for my work was provided by BAE SYSTEMS (formerly
Sanders, a Lockheed Martin Company). In particular at BAE SYSTEMS, I would
like to thank Dr. Cory Myers for his advice and guidance over the past three years.
I would also like to thank my thesis advisors, Dr. Paul Fiore of BAE SYSTEMS and
Dr. Dan Dudgeon of MIT Lincoln Laboratory, for their help in bringing my thesis to
completion.
I would like to thank Ken Smith, John Zaino, and Marion Reine of BAE SYSTEMS, and Eric Pauer, formerly of Sanders, for their assistance and advice while I
was undertaking my research.
I would finally like to thank my parents for their constant support and encouragement while I pursued my goals.
3
Contents
1
Introduction
9
1.1
Automatic Target Recognition . . . . . . . . . . . . . . . . . . . . . .
9
1.2
Acceleration Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.3
Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2 Project Background
3
15
2.1
Initial Implementation
. . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2
Wordlength Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.1
Bit Precision Assignment . . . . . . . . . . . . . . . . . . . . .
19
2.2.2
Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3
Final Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4
Synthesis and Generation
. . . . . . . . . . . . . . . . . . . . . . . .
24
2.5
Target D evice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.6
Sum m ary
25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SHARP Algorithm
26
3.1
SAR to HRR Conversion ..
. . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Least Squares Fitting.
. . . . . . . . . . . . . . . . . . . . . . .
29
3.3
Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.1
. . . . . . . . . . . . . . . . . . . . . .
32
. . . . . . . . . . . . . . . . . . . . . . . . .
33
. ..
Power Transformation
3.4
Algorithm Performance.
3.5
Summary .......
... .................
4
..
... .....
..
35
4
5
36
Software Analysis
4.1
Timing Analysis ...........
37
4.2
Least Squares Fit . . . . . . . . . .
38
4.2.1
QR Factorization Approach
39
4.2.2
Matched Filter Approach .
40
4.3
Power Transformation
. . . . . . .
46
4.4
Revised Timing Analysis . . . . . .
46
4.5
Summary
. . . . . . . . . . . . . .
47
49
Hardware-Specific Analysis
5.1
5.2
5.3
Power Transformation
. . . . . . . . . . . . . . . . . . . . . . . . . .
50
. . . . . . . . . . . . . . . . . . . .
50
5.1.1
Vector Transformation
5.1.2
Vector Averaging.....
54
Least Squares Fitting . . . . . . .
5.2.1
Bias Removal and Magnitu< Normalization
5.2.2
Correlation
Summary
55
55
. . . . . . . .
58
. . . . . . . . . . . . .
61
62
6 Performance Comparison
6.1
Bias Removal and Normalization
6.2
Correlation
. . . . . . . . . . . .
68
6.3
Final Comparison Estimates . . .
72
63
6.3.1
Algorithm Performance .
72
6.3.2
Algorithm Timing.....
73
6.4
Proposed Improvements.....
6.5
Summary
74
. . . . . . . . . . . . .
75
7 Summary
76
A Acronyms
78
5
List of Figures
1-1
Overview of forced decision and threshold decision methods. (a) Forced
decision. (b) Threshold decision . . . . . . . . . . . . . . . . . . . . .
10
1-2
SHARP Approach
11
2-1
Ptolemy Screenshots. (a) ACS Domain Palette (b) Example Dataflow
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
D iagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-2
Wordlength Analysis Tool Outputs.
18
(a) All cost-variance pairs (b)
Pareto-optimal cost-variance pairs . . . . . . . . . . . . . . . . . . . .
22
3-1
Segmentation of SAR image to simulate Doppler filtering . . . . . . .
28
3-2
SHARP Algorithm Block Diagram
29
4-1
Range shifting of a bmp263 target by using 70-windows. (a) Original
. . . . . . . . . . . . . . . . . . .
80-vector (b)-(l) 11 70-wide range shifts from rightmost to leftmost. .
4-2
43
Range shifting of a bmp263 target by using 80-windows. (a) Original
90-vector. (b)-(l) 11 80-wide range shifts from rightmost to leftmost. .
45
5-1
Power Transformation Block Diagram . . . . . . . . . . . . . . . . . .
51
5-2
Divider Output - Floating Point and <4.8> Fixed Point
. . . . . . .
52
5-3
LUT Output - Floating Point Input and <4.8> Fixed Point Input . .
53
5-4
LUT Output - Floating Point and <1.8> Fixed Point . . . . . . . . .
53
5-5
LUT Output - Floating Point Input and <1.8> Transformed Divider
Input.........
....................................
54
5-6
Normalization Block Diagram . . . . . . . . . . . . . . . . . . . . . .
56
5-7
Ptolemy Normalization Design . . . . . . . . . . . . . . . . . . . . . .
57
6
5-8
Correlation Block Diagram . . . . . . . . . . . . . . . . . . . . . . . .
58
5-9
Ptolemy Correlation Design
. . . . . . . . . . . . . . . . . . . . . . .
60
6-1
Execution Schedule for the Normalization Routine . . . . . . . . . . .
64
6-2
FPGA Occupancy for the Normalization Routine
. . . . . . . . . . .
65
6-3
Input Signature to Normalization Routine . . . . . . . . . . . . . . .
66
6-4
Normalization Routine - Software and Hardware . . . . . . . . . . . .
66
6-5
Normalization Routine - Floating Point and Modified Fixed Point
.
67
6-6
Execution Schedule for the Correlation Routine
. . . . . . . . . . . .
69
6-7
FPGA Occupancy for the Correlation Routine . . . . . . . . . . . . .
70
6-8
Correlation Target Signature . . . . . . . . . . . . . . . . . . . . . . .
71
6-9
Correlation Template Signature . . . . . . . . . . . . . . . . . . . . .
71
6-10 Correlation by Range Shift . . . . . . . . . . . . . . . . . . . . . . . .
72
7
.
List of Tables
3.1
Confusion Matrix by Vehicle Using Original AFRL Code
3.2
Confusion Matrix by Vehicle Type Using Original AFRL Code
. . .
34
. . .
34
4.1
Timing Analysis of Original Software . . . . . . . . . . . . . . . . . .
38
4.2
Confusion Matrix by Vehicle Using Matched Filter
. . . . . . . . . .
44
4.3
Confusion Matrix by Vehicle Type Using Matched Filter . . . . . . .
44
4.4
Timing Analysis of Optimized Software . . . . . . . . . . . . . . . . .
47
5.1
Elementary Function Blocks . . . . . . . . . . . . . . . . . . . .
50
6.1
Normalization Design Bit Precisions . . . . . . . . . . . . . . . . . . .
63
6.2
Correlation Design Bit Precisions . . . . . . . . . . . . . . . . . . . .
68
6.3
Confusion Matrix by Vehicle Using Fixed-Point Simulation . . .
72
6.4
Confusion Matrix by Vehicle Type Using Fixed-Point Simulation . . .
73
8
. .
Chapter 1
Introduction
Automatic target recognition (ATR) is an important part of many military applications. The ability to discriminate between hostile and friendly targets as well as
the ability to differentiate between various hostile targets are often mission critical
objectives for real-time systems. As such, it is a common priority to produce the best
possible recognition performance in the least possible time.
In terms of execution speed, there are limits to what can be currently accomplished in software. Many high-speed applications are turning to dedicated hardware
to provide an execution speed that software cannot attain.
ATR is a promising
candidate for hardware execution. ATR systems are computationally intensive, but
the computations performed are highly repetitive, enabling even a minor speedup to
significantly improve execution time.
In this chapter, we present a brief introduction to ATR, examine the major issues
to be considered for acceleration and discuss our method of approach, and finally
present an outline for the remainder of the thesis.
1.1
Automatic Target Recognition
Target recognition entails classifying a target given characteristic signatures (templates) of several target classes. It is important to note that the number of different
types of observable targets will generally exceed the number of targets for which tem-
9
plates are available. Consequently, a design choice must be made to deal with the
situation that the target does not belong to any of the template classes. There are
two reasonable options: the identifier can either match the target as well as possible
to one of the template classes (forced-decision) or it can leave the target unclassified
(threshold-decision). These approaches are shown in Figure 1-1.
Observation Profile
rATR
AspectTret
iTarget
ID
DTemnplated
(a)
Observation Profile
A TR
or
tifier,
BelowTagtI
Threshicld
Aspect
T
e
TUnknown
(b)
Figure 1-1: Overview of forced decision and threshold decision methods. (a) Forced
decision. (b) Threshold decision.
There are strengths and shortcomings to both threshold-decision and forceddecision approaches. Forced-decision methods are guaranteed to classify all known
targets. However, they will also (incorrectly) identify all unknown targets. Conversely, threshold-decision methods will leave most unknown targets unclassified.
However, they will generally also fail to identify some known targets.
It is computationally more intensive to implement a threshold-decision approach.
Before classifying any target, the decision must be made whether that target should
be classified at all. In many cases, all the calculations for a forced-decision approach
must be repeated in a threshold-decision approach, but end up unused due to the
threshold comparison. Consequently, the method chosen for a particular application
depends strongly on the performance and computational specifications. The extent to
which the template set spans the possible target set also strongly impacts the choice
of approach.
10
ATR
Adar
Target ID
Figure 1-2: SHARP Approach
The Air Force Research Laboratory (AFRL) at Wright-Patterson Air Force Base
(WPAFB) has developed an algorithm that implements a forced-decision method.
The algorithm attempts to classify targets given their high range resolution (HRR)
radar signatures.
Their approach is called the System-Oriented HRR Automatic
Recognition Program (SHARP) [32]. The basic premise of the SHARP approach is
shown in Figure 1-2.
The SHARP objective is to develop and mature advanced air-to-ground HRR ATR
capabilities for transition into suitable operational Air Force airborne platforms. It is
clear that in an application of this type, it is desirable to minimize the latency between
data collection and classification. Therefore, this thesis examined a framework that
provided the SHARP algorithm with a significant speedup via hardware acceleration.
Field-programmable gate arrays (FPGAs) have received a great deal of attention
in recent years. FPGAs are hardware devices that contain a large array of small
configurable logic blocks (CLBs) surrounded by many routing resources [10]. Consequently, they can be easily reconfigured to perform functions required by the user.
This thesis used custom hardware containing FPGAs to achieve the hardware acceleration of the SHARP algorithm.
1.2
Acceleration Issues
The first stage of acceleration is purely at the software level. In order to properly reap
the benefits of hardware acceleration, an algorithm must be reorganized and modified
in a manner that is conducive towards hardware mapping. In many cases, this reor-
11
ganization actually provides a speed gain for software execution. However, this gain
is even more pronounced when demonstrated in the hardware domain. These types
of improvements are generally achieved by using alternate methods for calculations
that increase the computational efficiency of the algorithm. This research examined
the SHARP algorithm to determine what portions would benefit the most from an
algorithmic speedup, and then attempted to improve their efficiency.
Another class of algorithm modification may result in better performance when
implemented in hardware, but may actually slow the performance in software. Typically, these types of improvements are made by modifying the control flow of the
algorithm. Software implementations of algorithms are able to make full use of the
incoming data and alter their behaviors based on this data. However, it is expensive
and inefficient to change the steps of an algorithm in hardware. Designs in which
the flow of the data is independent of what the data contains are more suitable for
implementation in hardware and result in better performance than algorithms with
variable control flows. Every effort must be made to remove data dependencies from
the control flow in order to make the best use of the available hardware resources.
This research examined all data dependencies in the SHARP algorithm and designed
reasonable alternatives for dealing with them.
An additional factor that must be considered in the mapping of an algorithm to
hardware is the precision of the data. While most software algorithms use floatingpoint calculations, it is currently impractical to implement floating-point operations
in hardware. Even assuming fixed-point arithmetic, decisions must be made regarding
the data precisions to be kept at various points in the algorithm. It is important to
note that whenever floating-point data is truncated to a fixed-width precision, error
is introduced. Therefore, bit precisions were chosen in a manner that minimized the
effect on the output.
While the aforementioned considerations can be viewed as hardware independent,
it is essential that the actual specifics of the hardware be taken into consideration.
It is common practice to make use of libraries of basic hardware elements when
generating a large-scale design. It is possible that these libraries may not contain
12
some elements required by the design. In such cases, a decision must be made to
either create a design for the missing element or to alter the algorithm to make use
of existing blocks. These decisions generally must be made on a case-by-case basis
because the complexities of both approaches depend on the specific situation.
In addition, the physical hardware resources must also be considered. Many hardware boards contain more than one FPGA, so it is necessary to determine an efficient
way to partition the algorithm to make use of the available resources. Ideally, a design
would use up as much of the FPGAs as possible, since speedup is generally proportional to area in FPGA designs. In addition, the transfer of data between FPGAs
must be considered. There are typically a limited number of interconnects between
FPGAs, restricting the amount of data that can be passed from one FPGA to another. This research attempted to partition the SHARP algorithm in a manner that
made use of the available resources while still facilitating analysis and debugging.
A final concern to be addressed is data transfer to and from the hardware. For
most applications, the time needed to write data to the memory of the hardware
board and read the output is not negligible. In fact, for small-scale designs, the data
transfer time may exceed the time actually spent on computation. Consequently, it
is important that the data transfer be managed so that the speed benefit gained by
executing the computation in hardware is not outweighed by the time lost in data
transfer.
By addressing the above issues, this thesis shows that it is possible to produce a
hardware design that provides a significant speedup in execution time for the SHARP
algorithm, while not adversely affecting the algorithm's performance. As a result,
it will be more feasible to incorporate computationally-intensive ATR methods in
practical real-time systems.
1.3
Overview of the Thesis
We begin in Chapter 2 by presenting our approach to generating hardware designs.
The approach was to utilize the tools produced by the Algorithm Analysis and Map13
ping Environment for Adaptive Computing Systems (ACS) [4] project at BAE SYSTEMS. In this chapter, we will present an overview of the methodology of the ACS
project.
In Chapter 3, we present a development of the SHARP algorithm. The necessary
computations are arranged in two separate components that execute sequentially in
order to facilitate analysis. We also present the results of execution of the SHARP
algorithm in order to measure the performance of the algorithm.
Chapter 4 begins the analysis of the SHARP algorithm in a way that prepares the
algorithm for hardware implementation, without making decisions that rely on the
use of any particular hardware device. All modifications to the SHARP algorithm are
performed strictly at the software level. We use a timing analysis in order to better
focus our efforts and present the modified performance and timing results at the end
of the chapter.
In Chapter 5, we discuss modifications and analyses that focus on the hardware
level. This chapter discusses both design decisions that would be necessary for any
hardware implementation as well as decisions that are specifically intended for the
hardware board used on the ACS program.
Chapter 6 presents the hardware implementation of the algorithm and compares
its performance with the performance of its software counterpart. We provide timing
estimates and discuss how they could be improved through further design modification. We also discuss the extent to which the hardware acceleration benefits the
algorithm as a whole.
Finally, in Chapter 7, we summarize our findings and discuss the results of this
research.
14
Chapter 2
Project Background
In order to create the hardware designs used to accelerate the SHARP algorithm, this
research made use of the tools developed by the Algorithm Analysis and Mapping
Environment for Adaptive Computing Systems (ACS) [4] project at BAE SYSTEMS.
The goal of this project is to decrease the amount of time spent in creating hardware
designs by automating much of the design process.
The ACS project has specifically concentrated on implementing digital signal processing (DSP) algorithms in field-programmable gate arrays (FPGAs), because FPGAs offer very high-speed processing at a small hardware cost [11]. To demonstrate
the potential of FPGA computation for DSP algorithms, the ACS tools were used
to implement a Winograd Discrete Fourier Transform (DFT) [24] and a linear FM
detector [25].
It has also been shown [30] that FPGAs are particularly suitable for ATR applications because logic can be configured down to the bit level. While FPGAs do
not offer the computing power of other technologies, such as application-specific integrated circuits (ASICs), their high speed-to-cost ratio makes them attractive to
developers. Additionally, unlike ASICs, which are programmed at the factory, FPGAs can be repeatedly reprogrammed by the user to suit the functionality needed
[10].
FPGAs have been used to accelerate both target detection [17], and target
recognition algorithms, typically using SAR imagery [6], [9], [30].
For these reasons, it was decided to employ the tools provided by the ACS project
15
in order to achieve hardware acceleration of the SHARP algorithm. Because development of the ACS tools was continually progressing, this involved contributing to
the actual development of the software in order to be able to reap the benefits of its
capabilities.
This chapter will present the methodology of the ACS project. Section 2.1 will
describe the method used to lay out an algorithm in software as a starting point
from which hardware design can begin. Section 2.2 will present the tool used to
automatically set bit precisions throughout the algorithm as well as schedule the
execution of the algorithm components. Section 2.3 will describe how this information
was then employed to begin the generation of hardware-level components. Section 2.4
will show how the design process was brought to completion, resulting in full hardware
designs that are ready for implementation. Finally, in Section 2.5, we present specifics
of the target hardware device used on the ACS program.
2.1
Initial Implementation
The first step taken in the ACS program to create a hardware design is to lay out
the algorithm in Ptolemy
[29],
a software program developed at the University of
California at Berkeley. Ptolemy is a simulation tool that enables designers to build
and test large-scale systems using elementary building blocks. Since this is very much
the same approach taken when building large-scale hardware designs, the ACS project
created a new domain in Ptolemy that provided the functionality necessary to begin
hardware design. [19] shows a very similar approach that used Khoros, produced by
Khoral Research [16] for the simulation tool.
From an interface standpoint, the ACS domain of Ptolemy is very similar to its
other domains. Namely, a system is built up by placing elementary computational
blocks (such as adders, multipliers, etc.)
on the workspace and connecting them
together to built more complex functions. Figure 2-1 shows both a typical group
of blocks (called a "palette") and a dataflow, which will be used again later in this
thesis. However, the difference between the ACS domain and the other domains in
16
Ptolemy lies in the interpretation of the individual blocks.
Because Ptolemy is at its core a simulation environment, most domains are used
for simulation. For example, in the synchronous dataflow (SDF) domain, the functionality of an adder is to sum its inputs and produce the result as its output. However,
the goal of the ACS domain is not simulation; instead, the ACS domain tries to encapsulate the information necessary to implement an element in hardware within that
block. Consequently, at the simplest level of abstraction, the adder block contains
the instructions for building an adder in hardware.
Naturally, there are more layers of complexity to be addressed. There is no universal adder design capable of handling all situations. Different applications may require
different precisions on the input or output, or may require different input-to-output
latencies. Consequently, the ACS blocks have two primary functions. The first function is to generate files which can be used in conjunction with the wordlength analysis
tool to be discussed in Section 2.2 to help the designer configure the block to meet
the needs of the larger system. We discuss this function below. The second function
is to actually generate a design that fits the criteria that the designer has specified.
This function will be discussed in Section 2.3.
The intention of the ACS program was to make each element as configurable as
possible. For example, in addition to being able to set bit precisions on the input
and output, designers would be able to choose among different designs for the same
element, depending on whether area or latency was a priority. However, while the
blocks have been implemented in such a way that adding these capabilities is possible,
they are not currently implemented. Consequently, the main use of the first function
of the ACS blocks is to determine the bit precisions for the fixed-precision calculations
that occur in hardware.
The ACS blocks have many variable parameters. Many blocks operate on vectors
rather than scalars, so vector lengths can be specified. Particular blocks, such as a
gain element, may have parameters specific to their function, such as the gain level.
Additionally, all blocks have bit precision parameters.
If the designer knows the
precisions that are required for the system and knows that the underlying design will
17
(a)
(b)
Figure 2-1: Ptolemy Screenshots.
Diagram
(a) ACS Domain Palette (b) Example Dataflow
18
be able to handle those precisions, they can be manually entered and locked.
In many cases, the designer may not know what the precision requirements are for
a specific block. In these cases, the wordlength analysis tool discussed in Section 2.2
can be used to suggest wordlengths. In order to be able to use the wordlength tool,
the ACS blocks must be able to generate files that describe their input/output relationships. These files are produced as MATLAB [21] scripts. They consist of files that
calculate the output of the block from its input, calculate the possible output range
given the ranges of its inputs, characterize the input-to-output latency, describe any
restrictions imposed by the underlying design on the input and output wordlengths,
as well as estimate the space needed for the design based on its parameters. The
generation of these files will be discussed in Section 2.3.
2.2
Wordlength Analysis
The wordlength analysis tool serves two primary purposes. First, it is responsible for
choosing bit precisions for all inputs and outputs that have not already been locked
by the user. There are two components of the bit precision of a fixed-point number:
the major bit, indicating the significance of the highest order bit of the number in
two's complement form, and the wordlength, indicating the number of bits used to
quantize that number.
The second function of the tool is to generate scheduling information for the
design. This information is necessary to compensate for the different latencies that
may exist along different datapaths in the design. The schedule will be used in Section
2.3 to ensure that all inputs to a particular block arrive at the same time, regardless
of when they are first available in hardware. We discuss each of these functions below.
2.2.1
Bit Precision Assignment
The first step in assigning bit precisions is to choose all of the major bits. Typically,
this can be done easily using the information provided by each block concerning the
output range span. The tool imposes the requirement that the input precisions be
19
chosen. Consequently, the input range span is known. From this, the range spans for
every node in the design can be calculated. As a result, the major bits are set to the
minimum value that spans the required range.
Once the major bits have been set, it is necessary to choose wordlengths for each
node. Unlike the major bits, which are essentially prechosen due to range analysis,
there is a large degree of flexibility in the wordlengths.
Any set of wordlengths
that satisfies the constraints specified by the different blocks is considered a feasible
combination. However, most feasible combinations are of little practical value. It is
therefore necessary to establish a way to score each combination and determine which
of these combinations are "optimal."
The method used to optimize the design wordlengths is described in detail in [10]
and [13]. The tool makes use of the Markov chain Monte Carlo (MCMC) method [7] to
choose wordlengths that are jointly optimized with regards to output noise variance
and hardware cost. Specifically, the only wordlength combinations of interest are
those such that no other design has both lower variance and lower hardware cost.
These combinations are known as "Pareto-optimal" points.
The MCMC method is used as follows. The initial starting point is a large number
of randomly chosen wordlength combinations. Note that all wordlengths that have
been explicitly chosen by the designer will not be changed. Combinations that violate
any of the constraints imposed by the individual blocks are rejected. Presumably,
unless the wordlength constraints are highly restrictive, many feasible combinations
remain.
At this point, the combinations are evaluated with regards to cost and variance.
Any of the combinations that are not Pareto-optimal are rejected. The remaining
combinations are randomly perturbed. From a given combination, the subsequent
combination will either be identical to the original, or will differ by 1 in one of the
wordlengths. Assuming there are n wordlengths to be chosen, this results in 2n + 1
possible perturbations. Again, all designs that violate constraints or are not Paretooptimal are rejected.
Through repeated iteration of this process, several steady-state points will be
20
reached. From any steady-state point, altering any of the wordlengths will either
violate a constraint or result in a higher output variance or hardware cost. These
jointly-optimized combinations are then presented to the designer. Typically, the
designer will want to choose the design with the highest cost that will still fit in
the physical hardware, since these designs will usually have the lowest variances.
However, in some circumstances, a lower cost design may be desired, so the user is
free to choose a different design. Figure 2-2 shows typical outputs of the wordlength
analysis. Figure 2-2a shows all of the cost-variance pairs encountered during iteration,
while Figure 2-2b shows only those pairs that are Pareto-optimal.
Once the wordlengths have been set, all that remains to be done is scheduling
after which the design is ready to be created as in Section 2.3.
2.2.2
Scheduling
Recall that the purpose of scheduling is essentially to synchronize the inputs to all
blocks. This is achieved through use of the latency information provided by each
block. Latency generally depends on the wordlength chosen for a block. Therefore,
it is to be expected that different wordlength combinations will result in different
schedules.
Scheduling is achieved in a feed-forward sense, beginning with the data sources and
progressing through to the data sinks. All blocks that are connected to the sources
and have no other inputs can execute as soon as the schedule is launched.
This
rationale is applied to every block. Namely, a block can execute as soon as all of its
inputs are available. Using the latency information, it is reasonably straightforward
to schedule all the blocks.
Note that in many cases, this scheduling implies that while the output of a block
may be available, it may not actually be consumed for many clock cycles. Consequently, some delay must be introduced to synchronize the datapaths in hardware.
The wordlength analysis tool includes these anticipated delays in its cost estimates.
The insertion of these delays will be discussed below.
21
1030
10 9
C
1028
1027
1050
1100
1300
1250
1200
1150
1350
1400
Cost
(a)
10,
28
.Lu 10
0a
7
10271
1050
Number of Designs=9
1100
1150
1200
Cost
1250
1300
1350
(b)
Figure 2-2: Wordlength Analysis Tool Outputs. (a) All cost-variance pairs (b) Paretooptimal cost-variance pairs
22
2.3
Final Implementation
Once the wordlengths and schedule have been generated, this information is once
again used by the ACS domain in Ptolemy to actually begin creating the design. The
bit precisions provided by the wordlength analysis tool are entered and locked into
Ptolemy. Ptolemy then uses the generation capability of each block to actually create
designs that meet the user specifications. This is currently achieved in one of two
ways. The block will either contain instructions for generating a custom-designed
Very High Speed Integrated Circuit (VHSIC) Hardware Design Language (VHDL)
file or initiate a call to the external Core Generator tool, produced by Xilinx [34]. An
invocation of the Core Generator will result in both a VHDL file and an Electronic
Design Interchange Format (EDIF) file.
It is from these underlying designs that the Ptolemy blocks were initially constructed. All of the information that the blocks provide to the wordlength analysis
tools are properties of either the custom-made designs or the generic designs provided
by the Core Generator. Consequently, the characteristics of the designs have been
encapsulated and analyzed before any designs are actually instantiated.
In addition to generating the individual hardware designs, Ptolemy is also responsible for connecting the individual blocks so that they function together as a system.
There are two factors to be considered in making connections between blocks. First,
any differences in bit precision on either end of a connection must be accounted for.
Note that there will not be any difference in major bit, only in wordlength. Additionally, the wordlength can only decrease across a connection, since an increase in
wordlength would only result in artificial bits being appended. Consequently, Ptolemy
ensures that all connections are made properly by trimming off any unneeded low order bits.
Additionally, Ptolemy must implement the delays implied by the schedule. For
example, if a block produces its output at clock cycle 1, but the block that will
consume this output can only execute at clock cycle 3 due to latency on another
input, Ptolemy inserts a delay of 2 clock cycles on this connection to synchronize the
23
inputs. These additions to accommodate varying bit precisions and input latencies
are made through the inclusion of additional VHDL files.
2.4
Synthesis and Generation
There are two remaining tasks to be completed before the design can be used. The
VHDL files that have been generated are read into Synplify, a software tool produced
by Synplicity [28]. This tool is used to simulate the hardware execution of the design
to ensure that it performs as expected. Once the simulation results are approved, the
design is synthesized. Recall that only the VHDL files produced by invoking the Core
Generator have corresponding EDIF files. Synthesis of the design produces EDIF files
for the remaining VHDL files.
The final task is processing all of the EDIF files using the Alliance Series tools
produced by Xilinx [34]. This results in final bitstreams which can be used to program
the hardware and actually execute the design. Execution is accomplished through use
of software libraries provided by the board manufacturer to communicate with the
hardware. The Alliance tools were used to target an Annapolis Micro Systems [2]
Wildforce hardware board, which is described below.
2.5
Target Device
The hardware details of the Wildforce board are presented in [3]. Here we will only
present a brief description of those features of the Wildforce that will play a role in
our design process.
The Wildforce board has 5 Xilinx [34] XC4062 FPGAs labeled CPEO and PElPE4. However, CPEO is used for external interface. All start signals are sent to CPEO
and CPEO returns all the interrupts. Consequently, there are 4 FPGAs available for
use in design. With each FPGA is an associated memory bank, which can contain
262,144 (256K) 36-bit values.
Each FPGA has a 36-bit connection to its associated memory. Additionally, there
24
is a 32-bit connection from each FPGA to a local bus which can be used to transfer
data between FPGAs. Through the use of the Peripheral Component Interconnect
(PCI) bus, the external world also has 36-bit connections to each of the memory
banks.
2.6
Summary
This chapter presented the approach of the ACS project. Ptolemy was used in conjunction with the wordlength analysis tool to fully specify the design and then to
begin hardware generation. Several tools were then used to simulate and complete
generation of the design, resulting in bitstreams that could actually be implemented
in a reconfigurable hardware board.
We also presented some specifications of the Wildforce board, which was used
as the target device by the ACS program. The role of these specifications in our
hardware-specific analysis of the SHARP algorithm will be shown in Chapter 5. The
results of using the ACS tools for accelerating the SHARP algorithm will be presented
in Chapter 6.
25
Chapter 3
SHARP Algorithm
Moving target identification is a long-standing military objective. [26] provides a brief
overview of the recent history of ATR development, culminating in the current focus
on hardware implementation for target identification. The ability to classify an unknown target regardless of velocity, orientation, or heading has obvious applications.
Various approaches have been taken to attempt to handle this problem. Traditional
synthetic aperture radar (SAR) approaches have been tried [5]; however, SAR ATR of
moving targets has been less successful than SAR ATR of stationary targets. Due to
the long processing interval inherent with SAR imagery, target information is prone
to blurring, resulting in degraded ATR performance.
Target identification using moving target indication (MTI) radar has also been attempted [14]. MTI radar has the advantage over SAR imagery of being able to detect
and track moving targets. However, MTI radar lacks the bandwidth necessary for
target discrimination. Consequently, although targets can be detected, they cannot
be identified.
As a result, the Air Force Research Laboratory (AFRL) at Wright-Patterson Air
Force Base (WPAFB) has begun investigating the potential of high range resolution
(HRR) radar for the moving target problem [22]. HRR radar is well-suited for identifying moving targets for two reasons. Unlike SAR, HRR radar signatures can be
formed quickly, allowing the ground clutter to be separated from the target through
the use of Doppler filtering [20]. Additionally, because of its large range resolution,
26
HRR provides the bandwidth lacking in MTI radar, making target discrimination
feasible.
The AFRL initiated the System-oriented HRR Automatic Recognition Program
(SHARP) [32] in order to develop HRR ATR capability. As part of the SHARP program, an algorithm was designed that implemented a forced-decision ATR approach.
The algorithm classifies targets based on their signatures and aspect angles using an
extensive template set.
The SHARP algorithm makes use of the dataset from the Moving and Stationary
Target Acquisition and Recognition (MSTAR) [15] database to measure its performance. The subset of the MSTAR dataset used had seven classes, consisting of three
BMP2 armored personnel carriers (APCs), one BTR70 APC, and three T72 main
battle tanks. There were two sets of data available: the training set and the test set.
The training set was taken at a 170 depression angle and consisted of roughly 230
different aspect angles for each class spread across the total 3600 span. The test set
was taken at a 15* depression angle and consisted of roughly 195 aspect angles for
each class.
The results of using the SHARP algorithm to classify the entire test set are stored
in a confusion matrix. Confusion matrices are described in [8] and provide a way
to measure the performance of a classification algorithm. The rows of the confusion
matrix represent the test set, while the columns represent the template set. The value
in row i and column
j
of the matrix is the percentage of signatures in Class i that
were classified as Class j.
The MSTAR data was available as SAR imagery rather than HRR signatures.
In Section 3.1, we discuss the calculations necessary to obtain simulated HRR data
from the SAR images. These calculations were necessary so that there was data with
which to test the algorithm performance, but they were not considered part of the
algorithm (which is intended to operate on real HRR data). The SHARP algorithm
consists of some preprocessing and a least squares fitting, but we present these out of
order in Sections 3.2 and 3.3 because their development is more intuitive in this order.
Finally, in Section 3.4 we present the results of executing the SHARP algorithm on
27
the MSTAR dataset.
3.1
SAR to HRR Conversion
The approach of the SHARP algorithm is to employ the characteristics of HRR
radar signatures to achieve better results for moving target recognition. However,
the MSTAR dataset used to determine algorithm performance consisted only of SAR
imagery. It was therefore necessary to perform some preprocessing of the data in
order to be able to simulate HRR data and get a better estimate of the algorithm
performance.
The method used to obtain simulated HRR data from the available SAR images is
discussed in [14] and [32]. The dimensions of the original SAR images were 128x 128.
These were subsampled by taking the center 101x70 pixels where the actual target
was contained as shown in Figure 3-1. This was done to simulate the results of the
Doppler filtering which would be performed on actual HRR signatures to separate
the target from the ground clutter. The subimages were then zero padded back up
to 128x 128 so that an inverse Fast Fourier Transform (IFFT) could be taken. The
IFFT was taken on the dimension that was 70 pixels wide before zero padding and
resulted in range signatures collected over the radar aperture (the dimension that was
originally 101 pixels wide).
Figure 3-1: Segmentation of SAR image to simulate Doppler filtering
The resulting signatures were finally deweighted in angle using an inverse Taylor
28
HRR Signature
Least Squares Fit
Preprocessing
arget ID
Templates
Aspect Angle
Figure 3-2: SHARP Algorithm Block Diagram
window [31] to make the energy of the signatures uniform. The result was a 128x 128
matrix, representing 128 signatures that were each 128-wide in range. It is important
to note that only the center 101x70 of this matrix represented real data while the
rest was zero padding. This 128x 128 matrix was used as the input to the algorithm;
it would be compressed to a single signature, which was then used for classification.
The method of performing this compression will be discussed in Section 3.3.
3.2
Least Squares Fitting
The dataflow of the SHARP algorithm is shown in Figure 3-2, a revision of Figure
1-1. We begin by presenting the least squares fit routine.
The SHARP algorithm minimizes an error metric and bases its classification decision on that minimum error. Consequently, it makes sense to begin our development
of the algorithm by defining this metric. Given a target vector d and a template
vector m, the SHARP algorithm tries to fit the vectors to each other by altering the
bias and magnitude of m. Specifically, the algorithm determines an optimal offset k,
and an optimal gain k 2 such that k, + k2 * mi
1
m]
k
di, or equivalently,
d.
(3.1)
Since in most cases it is not possible to find a k such that Equation 3.1 is exactly
satisfied, the least squares method provides a k that minimizes the error between the
29
two sides of Equation 3.1. This solution is developed in [27) and is given by
1T
1T
TMIT
k=
d.
(3.2)
Our primary focus is not to determine the least squares solution to this equation, but
to determine the error associated with that solution because our objective is to find
the minimum error. Naturally, the error is dependent on the choice of k and is given
by
=
1
m
k
-) d
[ 1
m
k
-d) .
(3.3)
Consequently, the SHARP algorithm determines the least squares solution k to Equation 3.1 and then evaluates Equation 3.3 to determine the resultant error.
Now that we have defined a metric by which to compare the various template
signatures, we must define the template set to which this metric will be applied. The
template set spans two dimensions which we must take into account: aspect angle
and range. Clearly, one possible solution is the brute force attempt; we can consider
template vectors spanning all aspect angles and consider all possible range shifts and
minimize the error across all the resulting comparisons. However, this is a highly
inefficient approach that fails to utilize knowledge of the data.
The SHARP algorithm assumes that the signal information is contained in the
center 70 range gates of the entire 128-gate span. The algorithm further assumes that
the templates are already close to being correctly aligned. Consequently, the decision
was made to only consider shifts within five range gates. This results in a dramatic
savings in computation when compared with considering all 139 possible range shifts.
In order to facilitate performing the eleven necessary range shifts, the target vector
is padded on each end with five zeros to produce an 80-wide vector. The eleven shifts
are generated by taking subvectors (1-70, 2-71, ... , 11-80) of the padded vector.
To cut down on the number of aspects we must consider, we make use of the
fact that we know the aspect angle of the target vector. It therefore makes sense to
choose some subset of angles centered around the target aspect. In fact, [1] shows
30
that performance may actually be improved by constraining the aspect angles to avoid
mismatches that may occur at drastically different angles. The SHARP algorithm
only considers templates whose aspects are within 5' of the aspect angle of the target.
This is again a huge savings since a 360-degree span has been reduced to a 10-degree
span.
The result of these decisions is to reduce the number of error calculations that
must be computed to a manageable number. The template data contained signatures
that were multiples of 10 apart in aspect. Consequently, the maximum number of
aspect angles to be considered for any given target aspect angle a is eleven (i.e.
a - 5, a - 4,. . . , a + 5). Therefore, we see that for a given target signature, we must
perform at most 11 (aspects) x 11 (ranges) x 7 (classes) = 847 error calculations.
3.3
Preprocessing
We recall that the input data was available as a 128x 128 matrix. Additionally, we
know that the least squares fitting component of the SHARP algorithm expects the
data as a 70-wide (or 80-wide, depending on zero padding) vector. Consequently, it is
apparent that some preprocessing had to be done on the input data before the least
squares fit could be performed. Since we know that only the center 101x70 of the
input matrix represented the target data, only this submatrix was used for processing.
The first step in the preprocessing was to remove automatic gain control and
range effects from the signatures in order to improve classifier performance. This was
achieved through a normalization. Recall that each input signature r was 70-wide
and contained range gate magnitudes. From this vector, a power vector p was formed
such that pi = r2. The root mean square (RMS) amplitude of p was then calculated
and is given by
1
Prms =\128
7
PI(.4
(3.4)
where the -L factor is due to the mean being taken over all 128 range gates (including
zero padding). The power vector was then normalized by this RMS amplitude to
31
obtain
n=
.
(3.5)
Prms
Combining Equations 3.4 and 3.5, we obtain
n=
2
(3.6)
We finally substitute for p and obtain
n =
(3.7)
where the squaring is done elementwise.
Before the resulting signatures could be averaged to produce a single vector, an
additional transformation had to be performed to satisfy the constraints of the least
squares fit routine. This is discussed below in Section 3.3.1.
3.3.1
Power Transformation
It is important to note that using the method of error calculation above made the
SHARP algorithm a Gaussian classifier. As shown in [33], this means that the incoming data was assumed to have a Gaussian distribution. Equivalently, the incoming
data had to be completely characterized by its first and second moments, i.e. its
mean (y) and variance (o2 ). This assumption was implicitly made by using Equation
3.3 to perform the least squares fit. Determining the optimal bias k, was equivalent
to matching the means of the two signatures because the new mean Pnew was equal
to k1 . Similarly, finding the optimal gain k 2 was equivalent to matching the variances
because the new variance
2
Oj
was equal to k2
*
o 2.
Although the data was assumed to be Gaussian, it has been shown that the
statistics of HRR data are better modeled as having a Rayleigh distribution [22].
Consequently, in order to maximize the performance of the classifier, it was necessary
to transform the data. Empirical studies conducted in [1] determined that a trans-
32
form of the form t = nc would transform the distribution such that it more closely
resembled a Gaussian distribution. Specifically, using c = 0.2 resulted in improved
classifier performance.
Combining the power transformation with Equation 3.7, we see that the calculation performed on each of the 101 signatures is
t=
(3.8)
r
Up to now, the preprocessing was identical for both the template and target sigAt this point, the computation finally differed.
natures.
Because we wanted the
template signatures to be as accurate as possible, all 128 signatures were averaged
together to obtain a single 70-gate vector. However, the target signatures should attempt to simulate the quality of real HRR signatures. Consequently, only the center
eight profiles were averaged. Target vectors were finally padded with five zeros on
either end to accommodate the range shifting of the least squares fit.
3.4
Algorithm Performance
We present the results of execution of the SHARP algorithm on the 1,362 MSTAR test
signatures in two different formats. Table 3.1 shows the confusion matrix arranged
by vehicle. Table 3.2 shows the confusion matrix arranged by vehicle class. Recall
that element (i, j) indicated the percentage of Class i signatures that were assigned to
Class j. Through these tables, we can see quantitatively how the SHARP algorithm
performs.
We see from Table 3.1 that the performance by vehicle was certainly not optimal.
In fact, for the BMP263 and BMP2c21, less than 50% of the signatures were correctly
classified. However, we see from Table 3.2 that the majority of misclassifications were
still the same vehicle type. Consequently, we see that while the SHARP algorithm did
not perform well on a vehicle-by-vehicle basis, it did perform well on vehicle types. In
fact 86.64% of the target signatures were correctly matched with their vehicle class.
33
Table 3.1: Confusion Matrix by Vehicle Using Original AFRL Code
Actual
BMP263
BMP266
BMP2c21
BTR70c21
BMP263
47.69%
7.69%
21.54%
3.59%
BMP266
16.92%
60.00%
12.31%
4.10%
BMP2c21
23.08%
14.87%
49.74%
5.64%
T72812
T72s7
0.00%
1.57%
2.05%
4.19%
1.54%
2.62%
Predicted
BTR70c21
3.59%
4.10%
4.62%
76.92%
1.54%
1.57%
T72132
3.59%
4.10%
4.10%
3.08%
5.64%
8.90%
T72812
-2.56%
3.59%
3.59%
3.08%
-75.38%
10.99%
T72s7
3.59%
5.64%
4.10%
3.59%
13.85%7
70.16%
Table 3.2: Confusion Matrix by Vehicle Type Using Original AFRL Code
Actual
BMP2
BTR70
T72
BMP2
84.62%
13.33%
6.87%
Predicted
BTR70
T72
4.10%
11.28%
76.92% 9.74%
1.20% 91.92%
A detailed timing analysis of the SHARP algorithm will be presented in Chapter
4, but we note here that the time required for classification of the test signatures (excluding the time necessary for template formation) was approximately 21.64 minutes.
While a speedup of a process that only takes 21.64 minutes to execute is not very
significant, it is important to note that in a practical system, this time would be much
larger. It is likely that the template set would be much more densely populated in
aspect and would contain data for many depression angles. Consequently, we could
expect a realistic execution time to be on the order of days. In this case, the speedup
provided by hardware acceleration would clearly be significant.
We present the results of algorithm execution not only to demonstrate what the
capabilities and shortcomings of the SHARP algorithm are, but also to provide a
point of reference for our further analysis. Since we are attempting to provide the
same capability as the SHARP algorithm, but with hardware acceleration which will
significantly decrease the execution time, we need a quantifiable way to compare
the results. We would therefore like to see that executing some or all of the SHARP
algorithm in hardware does not significantly change the confusion matrices from those
in Tables 3.1 and 3.2.
34
3.5
Summary
In this chapter, we presented the methods used in the SHARP algorithm. We demonstrated how the SAR imagery from the MSTAR dataset was processed to produce
simulated HRR signatures for classification. The SHARP algorithm was developed in
two parts: preprocessing and least squares fitting. The preprocessing consisted of a
normalization and a power transformation. This power transformation was necessary
because the least squares fit routine was implemented as a Gaussian classifier.
Since the goal of this research was to accelerate the algorithm without significantly
affecting performance, the results of executing the SHARP algorithm were presented
to serve as a reference point for future comparisons. Chapter 4 will begin our analysis of the SHARP algorithm, which will rely heavily upon an understanding of the
framework presented in this chapter.
35
Chapter 4
Software Analysis
In attempting to accelerate an algorithm, it is important that the original software be
optimized as much as possible. Achieving a 10 x speedup through the use of hardware
acceleration is not very impressive if a 5 x speedup can be gained simply by improving
the software. Thus, it is only possible to get a true measure of the impact of hardware
acceleration by comparison to a software implementation that is fully optimized.
Additionally, this software optimization can typically lead to better hardware
performance. Naturally, functions that require few computations will execute faster
in both hardware and software. However, the lack of computational elements results
in extra FPGA area. Therefore, the existing blocks can be implemented using larger
area designs which can either improve the precision of the calculation or reduce its
latency. Consequently, it is clear that software optimization is an important step
towards hardware acceleration.
In Section 4.1, we present a timing analysis of the SHARP algorithm. Section
4.2 presents an analysis of the least squares fit routine. The section begins with the
original AFRL approach and then presents an alternative that significantly improves
its efficiency. Section 4.3 presents optimizations that target the preprocessing in the
SHARP algorithm. Finally, in Section 4.4, we present a revised timing analysis to
demonstrate the effects of these modifications and refocus further study.
36
4.1
Timing Analysis
It is important to recognize that the SHARP algorithm consisted of two separate
parts. Template signature formation comprised the first part while target signature
formation and classification made up the second part. The important distinction
between these parts is that the former could be computed and stored ahead of time.
The compilation of the template library was not be considered part of the algorithm execution since the template data was static. Once the templates had been
computed, they could simply be stored and read on subsequent execution. Consequently, our analysis of the SHARP algorithm focused only on the formation and
classification of target signatures.
We began our analysis by examining the execution of the SHARP algorithm to
determine the relative amounts of computation involved.
A profiling of this sort
was helpful to determine what calculations dominated the execution time and would
therefore demonstrate the greatest benefit from an acceleration. Conversely, it also
prevented us from undertaking a detailed analysis to gain a significant speedup in a
computation that did not make up a significant portion of the algorithm execution
time. Through analysis of the initial execution of the algorithm, we were able to
better focus our efforts and achieve the greatest speedup.
All timing tests were conducted on a Sun Microsystems Ultra 5 running at 360
MHz with 320 MB of RAM. As mentioned above, we ignored the time required for
template formation. Consequently, we broke our analysis into three categories: target
loading/preprocessing, least squares fitting, and classification. The timing is shown
in Table 4.1 for all seven target classes, which we recall consisted of 1,362 signatures
(128x 128 matrices). All timings are shown in seconds.
As expected, the first priority for acceleration was the least squares fit routine.
Least squares fitting required many error calculations, each of which required several
vector multiplications. Additionally, the least squares fit was performed on many
vectors. Consequently, it was no surprise that it was responsible for most of the
execution time.
37
Table 4.1: Timing Analysis of Original Software
BMP263
BMP266
BMP2c21
BTR70
T72132
T72812
T72s7
Total
Loading/Preprocessing
24.55
24.89
24.56
24.62
24.54
25.05
24.50
172.71
Least Squares Fit
164.16
160.08
162.55
161.00
156.27
161.48
159.52
1125.06
Classification
.11
.11
.11
.11
.11
.11
.11
.77
We also see that the secondary area for acceleration was the preprocessing. While
we could not do much to improve the load time of the data, we could improve the
time required to preprocess the data into power-transformed HRR signatures. Consequently, we also focused on the power transformation and attempted to improve its
execution time.
4.2
Least Squares Fit
We recall from Chapter 3 that the goal of the least squares fitting was to minimize the
error between the target signature and the class to which it was assigned. Specifically,
the SHARP algorithm determined the class that minimized the error between the
target signature and all template signatures that were within 50 in aspect and within
5 range gate shifts. In this context, the error to be minimized was
E2 =IIM][
k I-
d)'(
1 m ][k I-
d),
(4.1)
where m and d are column vectors representing the template and target signatures,
respectively, and k is a 2 x 1 vector consisting of the optimal offset and gain between
the vectors.
38
4.2.1
QR Factorization Approach
The approach taken by the original AFRL code to calculate the fitting error was
relatively straightforward. The code first calculated the least squares solution k and
then plugged it into Equation 4.1 to arrive at the fitting error. However, rather than
simply using Equation 3.2 to calculate k, the AFRL code employed a QR factorization
in order to reduce the amount of calculation necessary.
From [27], we know that the goal of QR factorization is to take a matrix A and
produce two matrices, Q and R, such that A
orthonormal columns (implying that QTQ
=
=
QR. Additionally, Q must have
I) and R must be upper triangular.
The AFRL took advantage of this method by taking the QR factorization of [ 1 m ].
Since the dimensions of
[1
m ] were 70 x 2, the dimensions of Q and R were 70 x 2
and 2 x 2, respectively. This factorization was then substituted into Equation 3.2.
This substitution is shown below:
1T
T
1T
k
k
T
[1 M
=((QR)T QR)-'(QR)T d
d
(4.2)
(4.3)
k =
(RTQTQR)-IRTQTd
(4.4)
k
(RTR)-RTQTd.
(4.5)
It is not clear that using QR factorization provided a benefit over the straightforward
calculation. In fact, from [27) we know the QR factorization of a 70 x 2 matrix requires
280 multiplications. Additionally, we can easily calculate the number of multiplications in Equations 4.2 and 4.5 to be 430 and 161, respectively. It would thus seem
that the QR factorization approach required 441 total multiplications (11 more than
straightforward calculation). However, the benefit of the QR approach was in the
context of its use.
Recall that in order to compare one target signature to one template signature,
there were eleven range shifts to be considered.
For straightforward calculation,
this resulted in 11 x 430 = 4,730 multiplications. However, in the QR approach,
39
factorization only had to be done once per template. Consequently, only 280 + 11
x 161 = 2,051 multiplications were required. While this was clearly an improvement
over the straightforward approach, it could be improved further using a matched
filter.
4.2.2
Matched Filter Approach
The foundation for using a matched filter approach for the SHARP algorithm is
developed in [12]. We will present this development, extend it to implementation,
and demonstrate the resulting benefit.
The development of the matched filter approach began with the original equation
by which the SHARP algorithm performed the least squares fitting,
MI[ k
1
(4.6)
~ d.
From [273, we know that the least squares solution k is found by solving
1T
[
1T
1
k
m
d
=
(4.7)
or equivalently,
N
Nm
-
mlk]=
Nd
(4.8)
where N is the length of m and d (in this case, 70), and fn and d are the means of
m and d, respectively. Solving the first line of this equation gives
Nk1 + Nmfk 2 = Nd,
k1 = d-fk
40
(4.9)
2.
(4.10)
Substituting back into Equation 4.6 gives
[1
m]dk
rn
2
d,
(4.11)
d,
(4.12)
k2
Al - k2 fn1 + k 2 m
~
and finally,
(4.13)
(m - fn1)k2~ (d - dl).
Note that m - iil is simply m with its bias removed. Similarly, d - Al is d with its
bias removed. Defining rhn = m - mn1 and d = d - dl, we obtain
inhk 2 ~d.
(4.14)
From this, we know the least squares solution for k 2 is
k2
(4.15)
Tr7
mm
Using Equations 4.14 and 4.15, we can calculate the fitting error:
E2
=
(k 2 xh
E2
=
(k 2 1n
a)T (k 2 ih - a)
(4.16)
_ d T ) (k 2 in - a)
(4.17)
(4.18)
(k 2~n-rTh - 2k 2 xhTd + ada)
62
62
-
=
dd
-h
T
(4.19)
d.
mm
We see from Equation 4.19 that if f~n and d were normalized so that Iil
=
jdl = 1,
then minimizing the least squares error could be accomplished by maximizing ~nTa,
which is precisely the objective of a matched filter [23]. Therefore, we see that by
removing the bias from the template and target vectors and then normalizing them to
unit magnitude, we could accomplish the necessary error minimization by maximizing
the correlation between the two vectors. However, it was important to realize that
calculating the bias and normalizing were expensive calculations. Consequently, it
41
was desirable that these operations not be performed frequently.
Recall from Section 3.2 that a single target vector was compared against seven
template classes. Within each template class, signatures were only considered if their
aspect angles were within 5' of the target signature. Additionally, the target signature
was shifted up to 5 range gates in either directions to align the signatures. Cleary, the
choice of template vector had no bearing on processing the target vector. However, a
bias removal and normalization was necessary for every range shift using the AFRL
approach for range shifting.
The range shifting done by the algorithm was performed by padding the original
70-vector with 5 zeros on either end and then taking a window of size 70 from the
resulting 80-vector. However, Figure 4-1 shows the problem with such an approach.
Figure 4-la shows the 80-vector obtained by zero-padding one of the bmp263 target
signatures. Figures 4-1b through 4-11 show the 11 range shifts of that signature.
The 70 samples containing target information were elements 6 through 75 of the
80-vector. However, every window besides the center one left out at least one of
these samples. In the extreme case, windows 1-70 and 11-80 left out 5 samples of the
original vector as shown in Figure 4-1b and 4-11. In many cases, this did not matter
because the actual target signatures were generally smaller than 70 range gates, but
in some cases, real data samples were lost. Consequently, the mean of the vector
could potentially change with every range shift, requiring constant bias removal and
normalization.
Instead of accepting this conclusion and taking the resulting performance degradation, we proposed a modification to the algorithm which enabled the bias removal
and normalization to be performed only once for each target vector. This modification
was to ensure that samples were never dropped. This was accomplished by providing
additional zero padding and taking a larger window. Specifically, the original 70vector was padded with 10 zeros on either end resulting in a 90-vector. The 11 range
shifts were then performed by taking a subwindow of size 80. Note that the target
samples were contained in elements 11-80 of the 90-vector. We then see in Figure 4-2
that each range shift included all 70 of the target samples. Consequently, the mean
42
2
2
2
(b)
(a)
(c)
U)
1i[
1
1i[
0
20
40
60
80
0
20
40
60
(e)
(d)
ci
20
40
60
0
20
40
60
0
20
40
60
20
40
60
20
40
60
2
(h)
(g)
60
1i[
2
2
40
(f)
if[
0
20
2
2
2
0
(I)
E)
1i
11
0
20
40
60
0
20
40
60
2
2
2
(k)
(j)
(I)
+-p
CE 1
0
0
1
1
20
40
Range
60
0
20
40
Range
60
0
Range
Figure 4-1: Range shifting of a bmp263 target by using 70-windows. (a) Original
80-vector (b)-(l) 11 70-wide range shifts from rightmost to leftmost.
43
did not change across the range shift, implying that bias removal and normalization
only needed to be performed once for each target vector.
Making this modification did slightly change the performance of the algorithm
since forcing all the target samples to be kept changed some of the error calculations
from the original version. The modified performances are shown in Tables 4.2 and
4.3. As before, element (i, J) indicates the percentage of Class i signatures that were
identified as Class j. Note that the algorithm performance actually improved slightly
because by not dropping samples, we were using a more accurate measurement of the
least squares error.
Table 4.2: Confusion Matrix by Vehicle Using Matched Filter
Actual
BMP263
BMP266
BMP2c21
BTR70c21
T72132
T72812
T72s7
I BMP263
47.69%
7.69%
21.03%
3.59%
2.04%
2.04%
1.05%
BMP266
16.41%
60.51%
12.82%
4.10%
1.02%
1.04%
3.66%
BMP2c21
24.10%
13.33%
50.26%
6.15%
4.08%
4.04%
2.62%
Predicted
BTR70c21
4.62%
4.10%
3.59%
77.44%
0.51%
0.5%
1.05%
T72132
T72812
2.56%
2.05%
4.62%
3.59%
4.10%
4.10%
3.08%
3.08%
76.02% 7
.90%
76.02%
7.90%
8.90%
11.52%
T72s7
2.56%
6.15%
4.10%
2.56%
6.122%
12%
71.20%
Table 4.3: Confusion Matrix by Vehicle Type Using Matched Filter
Actual
BMP2
BTR70
T72
Predicted
BMP2 BTR70
T72
84.62% 4.10%
11.28%
13.85% 77.44% 8.72%
5.84%
1.20% 92.96%
We recall that QR factorization provided an advantage over straightforward calculation because the cost of factorization was amortized across the eleven range shifts.
Similarly, the matched filter approach provided an advantage over QR factorization
because the cost of bias removal and normalization was amortized across the number
of template signatures to be considered. Bias removal and normalization required 80
multiplications as did the calculation of the correlation. Consequently, to compare
against M template signatures, QR factorization required 2,051M multiplications,
44
.
2
2
(a)
.
-
2
(b)
(c)
1[
E
1
0
20
40
60
0
80
2 (d)
a)
60
80
20
40
60
80
20
40
60
80
0
40
60
80
2
20
40
60
80
60
80
0
20
40
60
80
40 60
Range
80
2
2
(k)
U)
40
1
0
(j)
20
(I)
1
20
80
2
(h)
1i
60
1
0
(g)
40
(f)
2
2
20
2
1*
0
0
(e)
1
0
0)
40
2
I.
0)
20
()
1
1
*0
20
40 60
Range
80
0
20
40 60
Range
80
0
20
Figure 4-2: Range shifting of a bmp263 target by using 80-windows. (a) Original
90-vector. (b)-(l) 11 80-wide range shifts from rightmost to leftmost.
45
while the matched filter approach required 80 + M x 11 x 80 = 80 + 880M multiplications. In addition, by comparing Tables 4.1 and 4.4, we see that adopting a
matched filter approach provided a significant advantage over the QR factorization
approach in the algorithm timing.
4.3
Power Transformation
In considering the power transformation, we chose not to conduct an in-depth analysis
to determine if alternate calculations could achieve the same result more efficiently.
Instead, we noted that there was one straightforward but critical change that could
be implemented which resulted in a dramatic speedup.
Recall from Section 3.3.1 that for target formation, the center eight profiles of
the 101 available signatures were averaged together to produce the target signature,
while template formation employed all 101 profiles. However, the original AFRL code
performed the power transformation and range gate normalization on all profiles in
both cases. The only difference between the template and target formation in the
AFRL code was that the target formation code used only eight of these profiles for
averaging, in spite of having preprocessed all of the signatures.
It was apparent that we could gain a substantial savings in computation simply by
eliminating the unnecessary calculations from the original AFRL code. By ensuring
that the preprocessing only occurred on the eight profiles used in the averaging rather
than all 101 profiles, we expected a speedup on the order of 12x. This ensured that the
time spent on computing the power transformation and the range gate normalization
would be dominated by the time required for the least squares fitting. The next section
will present a revised timing analysis that incorporates the above modifications.
4.4
Revised Timing Analysis
Table 4.4 shows the results of the timing analysis on the fully optimized software. All
timings are shown in seconds.
46
Table 4.4: Timing Analysis of Optimized Software
BMP263
BMP266
BMP2c21
BTR70
T72132
T72812
T72s7
Total
Loading/Preprocessing
2.79
2.53
2.58
2.54
2.63
2.69
2.57
18.33
Least Squares Fit
34.22
33.93
34.55
34.14
34.67
34.20
33.64
239.35
Classification
.11
.12
.12
.11
.11
.11
.12
.80
By comparing Tables 4.1 and 4.4, we see that the software optimizations made
a significant impact upon the execution time of the algorithm. The total execution
time was reduced from 1298.54 seconds to 258.48 seconds. Additionally, the ratio
between least squares fitting time and preprocessing time had almost doubled, further supporting the supposition that least squares fitting dominated the execution.
Therefore, accelerating the least squares fitting remained our highest priority.
4.5
Summary
This chapter began the analysis of the SHARP algorithm. The timing analysis presented in Section 4.1 led us to focus our efforts primarily on the least squares fitting
routine, and secondarily on the power transformation. The QR factorization approach
used by the AFRL was presented and analyzed. It was then replaced with a matched
filter approach, which significantly decreased its execution time. The unnecessary
calculations were removed from the power transformation which also resulted in a
speedup.
Finally, we presented a revised timing analysis. This analysis showed that the
algorithm execution time had been greatly reduced due to our optimizations and further supported the claim that accelerating the least squares fitting was the highest
priority. Having concluded our software analysis, we begin a hardware-specific analysis in Chapter 5 that alters the already modified algorithm to further improve its
47
anticipated hardware performance.
48
Chapter 5
Hardware-Specific Analysis
In this chapter, we examine the SHARP algorithm from a hardware standpoint.
Thus far, hardware has not been a factor in our analysis. This chapter considers the
difficulties that may present themselves when mapping the SHARP algorithm into
hardware using the ACS program approach.
Recall from Section 1.2 that our concerns with regards to hardware implementation are control flow modifications, bit precisions, design availability, partitioning,
and data transfer. While Chapter 2 indicated that most of the bit precision decisions
could be left to the wordlength analysis tool, this chapter discusses situations in which
the tool would not suffice, and considers the remaining hardware issues. These issues
are analyzed with regards to our two areas of interest: the power transformation and
the least square fitting.
Our hardware-specific analysis begins with the power transformation in Section
5.1. We ultimately decided not to implement the power transformation in hardware,
but this section explains why this decision was made and provides insight which can
be used in further analysis. We then analyze the least squares fit in Section 5.2,
concluding with the Ptolemy diagrams of our design in Figures 5-7 and 5-9.
49
5.1
Power Transformation
We begin our analysis with a list of the elementary blocks available for use in our
design in Table 5.1. From these blocks, we were able to construct larger systems that
implemented the functionality we desired.
Table 5.1: Elementary Function Blocks
Accumulator
Chop
Const
Divider
Lookup Table (LUT)
Maximum
Multiply
Register
Shift
Sqrt
Square
Subtractor
Produces a running total of its input
Produces a subvector of its input
Produces a constant value
Produces the quotient of its two inputs
Performs a table lookup on its input
Produces the maximum of its two inputs
Produces the product of its two inputs
Repeats its input
Performs a bitshift on its input
Produces the square root of its input
Squares its input
Produces the difference of its two inputs
Recall from Section 3.3.1 that the ultimate purpose of the power transformation
was to average eight power-transformed vectors. Consequently, from an implementation standpoint, there were two components to consider. First, the range gate normalization and power transformation calculation had to be performed on the eight
input vectors. Second, the eight vectors had to be averaged together. We consider
these components in Sections 5.1.1 and 5.1.2.
5.1.1
Vector Transformation
From Chapter 3, we recall that the equation to be implemented on each vector is
(tI
r 0.4
70
)0-1
(5.1)
where r contains the magnitudes of the individual range gates and the exponentiation
is done elementwise. This calculation is represented as a block diagram in Figure 5-1.
50
The vector was squared at the outset. The lower datapath was then used to calculate
the RMS amplitude. The vector was squared again and then accumulated.
The
Chop was used to read the 90 partial sums produced by the Accumulator and only
output the final sum. The Shift was a right shift by 7 bits, which was equivalent to a
division by 128. The square root of the value was then taken, resulting in the RMS
amplitude. The Register repeated the amplitude 90 times so that each element of
squared 90-vector could be normalized. Finally, a LUT was used to implement the
function y
input
=
x 0., completing the calculation.
Divider
Square
Square
Accumulator
Chop
Shift
Sqrt
LUT
-- output
Register
Figure 5-1: Power Transformation Block Diagram
There were two issues that needed to be considered to make this design feasible. First, the output precision of the divider had to be manually chosen. This was
necessary for all dividers and was a consequence of the inability to sufficiently characterize the divider input. Specifically, the denominator input could theoretically be
very close to zero. As a result, the major bit chosen by the wordlength analysis tool
would be extremely conservative and result in a high quantization error. Therefore,
we attempted to choose an appropriate major bit for the divider output based on our
knowledge of the system.
In order to choose the divider precision, we considered the system after the first
Square and before the LUT. Note that if we removed the Shift (which is equivalent
to a gain of
) then the system was a straightforward normalization. That is, the
output vector was simply the input vector divided by its magnitude. As a result,
without the Shift, the output range span could be no larger than [-1,1]. We then had
to trace the effect of the Shift. Passing through the Sqrt resulted in a factor of
.
The Register did not affect this factor, but passing through the Divider inverted it, so
the divider output range span could be no larger than [-8V2-, 8VZ] ~ [-11.3,11.3].
Consequently, the output major bit of the divider could be set to 4, which would
51
-
Floating Point
Fixed Point
7-
6-
5-
'E 4 CU
3-
2-
0
0
10
20
30
40
50
60
70
Range
Figure 5-2: Divider Output - Floating Point and <4.8> Fixed Point
easily span this range.
The second factor to be considered was the LUT. The underlying design provided by Xilinx supported up to 256 256-bit values [34]. This effectively meant the
wordlength of the input to the LUT had to be 8, since there were only 28 = 256 different addresses in the LUT. Combining this with our earlier choice of major bit, the
output precision of the divider had to be set to <4.8>, using the notation <mb.wl>
to indicate a major bit of mb and a wordlength of wl. This meant that the least
significant bit (LSB) only had a significance of 2-,
so all quantizations were only
guaranteed to be within .125 of the corresponding floating point values. Figure 5-2
shows the output of the Divider in floating point and at <4.8> precision for a typical
bmp263 target. Figure 5-3 shows the corresponding outputs of the exponentiation.
Figure 5-4 shows the floating point output and the same output quantized to <1.8>
precision.
It is apparent that from Figures 5-2 and 5-3 that <4.8> precision was not sufficient to characterize the input. However, Figure 5-4 shows that <1.8> precision was
52
1.6
-
Floating Point
Fixed Point
1.4
1.2
C-
ca .8
C
0.6-
0.4-
0.2
-
U -
0
i
10
20
i
30
40
, .
,I .
50
60
70
Range
Figure 5-3: LUT Output - Floating Point Input and <4.8> Fixed Point Input
1.6
-
Floating Point
Fixed Point
1.4
1.2 P
1
-o 0.8
F
0)
0.6-
0.4
-
0.2-
0
10
20
30
40
50
60
70
Range
Figure 5-4: LUT Output - Floating Point and <1.8> Fixed Point
53
1.6
-
Floating Point
Fixed Point
1.4-
1.2-
1
Z 0.80)
0.6-
0.4-
0.2-
0
10
20
30
40
50
60
70
Range
Figure 5-5: LUT Output
Input
-
Floating Point Input and <1.8> Transformed Divider
acceptable for the output. Consequently, if we could make the input more closely
resemble the output, 8-bit precision would be sufficient. We accomplished this by
inserting two Sqrt's between the Divider and LUT. We used the Sqrt's to maintain
16-bit precision until the input of the LUT, which was set to <1.8> precision. However, the input had already been raised to the 0.25 power by the Sqrt's, so the LUT
was used to implement the function y = x 0.8 , resulting in an overall exponent of 0.2.
The output precision was then set to <1.16>. The floating point and <1.16> fixed
point outputs are shown in Figure 5-5.
5.1.2
Vector Averaging
In order to be able to average the eight transformed vectors, it was necessary to
develop a way to parallelize them. The only reasonable way to do this was through
the use of a memory element. The output of the transformation calculation could
be stored in memory. Then, when all eight vectors were in memory, an address
54
generator could output the vectors in parallel. Specifically, the first element of each
vector would be produced, followed by the second element of each vector until all
elements had been produced.
The advantage of rearranging the elements in this fashion was that it was then
possible to produce the average of the 8 vectors. In order to produce the first element
of the average vector, the first elements of each of the 8 original vectors could be
passed into an accumulator in serial order. The output of the accumulator could be
downsampled to produce the final sum, which could be passed through a gain of
}8 to
produce the final average.
Unfortunately, time was not available for this approach to be taken. The complexity involved in including a memory bank into the calculation and developing the
necessary address generator did not permit its implementation. Fortunately, as Section 4.4 indicates, the time saved in performing the power transformation in hardware
was not the dominating factor in the final performance. Consequently, we concentrated the remainder of our analysis on the least squares fit.
5.2
Least Squares Fitting
The least squares fitting consisted of two distinct parts. First, the input vectors
had their biases removed and their magnitudes normalized. Second, the correlation
between the target vectors and the appropriate template vectors was calculated.
5.2.1
Bias Removal and Magnitude Normalization
We begin as before by presenting a block diagram of the desired calculation in Figure
5-6.
In order to calculate the bias, a Chop was used to take an 80-subvector of the
input 90-vector. The vector was accumulated and a Chop was used to extract the
total sum. The bias was finally calculated by multiplying by a constant of
80~.
The bias
was then repeated 80 and 90 times, so that it could subtracted from both the original
90-vector and the 80-subvector. Note that this could also have been achieved through
55
input
Subrato
Chop
Divider
Subtractor
AccumulatorHChop
Multiply
Square
-
Accumulator
Sqr
output
Register
Register
ConstRegister
Figure 5-6: Normalization Block Diagram
the use of one Register and Subtractor operating on the 90-vector. However, because
the magnitude which was calculated next operated on the 80-vector, an additional
Chop would have been necessary, increasing the latency of the design. Consequently,
two Registers and Subtractors were used to operate on both the 90-vector and the
80-subvector.
To calculate the magnitude, the 80-vector was squared and accumulated. Once
again, a Chop was used to get the total sum. The square root of this sum was the
desired magnitude. The magnitude was repeated 90 times and used to normalize the
90-vector by using a Divider.
We note once again that we had to manually set the output range of the divider.
Fortunately, the purpose of this divider was to divide the elements of a vector by its
magnitude, so we fixed its output range to [-1,1]. Aside from this, all precisions were
left for the wordlength analysis tool to assign.
The normalization design was estimated to be large enough to require an entire
FPGA dedicated to it. Consequently, PEI was used for normalization, requiring the
signature data to be written to the memory bank of PEl. Chapter 6 will show this
estimate was accurate, as PEI had a relatively high design occupancy. Figure 5-7
shows the Ptolemy graph for the normalization design. Note that Ptolemy requires
buffers (Shifts of 0) to be placed after the inputs, but these did not affect our analysis,
so we disregard them.
56
Figure 5-7: Ptolemy Normalization Design
57
Multiply
Cop
AM
ulator
Chop
Multiply
Accumulator
Chop
-- pMultiply
Accumulator
Chop
Multiply
Accumulator
Chop
Multiply|-
Accumulator
Chp
Accumulator
Chop
Ac
Chop
data input -
-Multiply
ulator
Chop
- Accumulator
Chop
Accumulator
Chop
Accumulator
Chop
-_N0
Multiply|-hoMultiply
-- 7 f
o
~
:
MultiplyMultiply
ang
l
inp
t--
T
emfay
-
Maximum
Maximum
Maximum
iMultiply
muAccumulator
output
Maximum
Maximum
Maximum
Maximum
Maximum
Maximum
_L-,t __,Ib
Figure 5-8: Correlation Block Diagram
5.2.2
Correlation
The block diagram for the correlation calculation is shown in Figure 5-8.
There
were two inputs to the correlation: a target signature and an angle input which was
used to produce a template signature. Once both signatures had been generated, the
target signature (which was 90-wide) was passed to 11 parallel Chops which took the
11 different range shifts. Each of these 80-vectors was correlated with the template
vector through the use of a Multiply, an Accumulator, and a Chop. Finally, the 11
correlations were passed through a binary tree of Maximums to produce the maximum
correlation.
There were many issues to be resolved with the correlation design. The most
difficult decision was how to handle the template library. In software execution, all
templates within 50 in aspect were considered. Due to nonuniform aspect spacing,
this meant that the number of templates being considered was aspect-dependent.
Consequently, there was a data-dependent control flow.
In order to remove this dependency, the template set was filled out to contain 360
templates. Because templates were spaced in multiples of 10, any angles that were not
represented had templates containing all zeros inserted. Additionally, to facilitate the
58
±50
aspect span, the template sets were given 5' wraparound at each end. Therefore,
for a given angle, there were exactly 11 templates per class to be considered and they
appeared sequentially in the template library. This resulted in a template library size
of 7 (classes) x 370 (vectors) x 80 (elements/vector) = 207,200 elements.
An additional problem was how the template library would be used.
Because
there was no way to hold a target vector in place, the decision was made to only
perform one vector comparison per initiation. Note that this meant that each target
vector had to be repeated 77 times (7 (classes) x 11 (aspect angles/class)) for all the
comparisons to be made. Consequently, a simple address generator was built into the
template library which would output one signature given its starting address. Clearly,
this is not the most efficient approach, but timing constraints prevented an alternate
approach (i.e. a memory element could be used to hold the target vector in place)
from being investigated.
The target set consisted of 1,362 vectors. Since each vector had to be repeated 77
times, the input dataset consisted of 104,874 90-vectors, meaning 9,438,660 elements
had to be written to the hardware memory. Because the memory banks only had
262,144 addresses, this dataset was broken into 37 smaller pieces. To make processing
simpler, vectors were not split across pieces; rather, each download consisted of 37
90-vectors repeated 77 times (256,410 elements).
The only remaining difficulties were related to bit precision. The decision was
made to put all of the correlation calculation (not necessarily the template address
generation) in a single FPGA. If the correlation was not able to fit into a single FPGA,
additional designs would be necessary in order to be able to route the template data
to multiple FPGAs, due to the limited number of interconnects between FPGAs.
The most expensive elements in the correlation design were the Multiplies. It was
not possible to fit 11 16-bit multipliers in a single FPGA. Consequently, the multipliers
were reduced to 12-bit input precision. This still resulted in accuracy down to 2-1
(assuming <1.12> precision), which would be sufficient for determining maximum
correlations. Figure 5-9 shows the Ptolemy graph for the correlation design. Again,
the buffers can be disregarded.
59
Figure 5-9: Ptolemy Correlation Design
60
At this point, since the crucial design elements had been decided, the only remaining issue was to make FPGA assignments.
Since there were only two large
components (the normalization and the correlation), this was relatively easy. The
data input and normalization had already been placed in PEI. The address input
was sent to PE2. The template library and address generator were placed in PE3.
Finally, the correlator and output memory were assigned to PE4.
5.3
Summary
This chapter analyzed the SHARP algorithm to eliminate any difficulties in creating
a hardware design for it. After analyzing the power transformation, it was decided
that the complexity inherent in creating a design for it outweighed the benefit that
would be gained through its hardware acceleration. Consequently, execution of the
power transformation was left in software.
The least squares fit was also analyzed with regards to hardware issues. The
biggest difficulty encountered was handling the template library. The library was
modified to remove data dependencies and a compromise was made to repeat the
input data set in order to reduce the complexity of the design. Although this compromise greatly reduced the efficiency of the design, Chapter 6 will show that the
resulting hardware acceleration still provided a significant speedup. This chapter finally presented the Ptolemy diagrams that were used to produce the hardware designs
that will be presented in Chapter 6.
61
Chapter 6
Performance Comparison
Our initial intent was to provide hardware acceleration for the SHARP algorithm,
such that a standalone application could be developed that implemented a portion of
the algorithm in hardware, and demonstrated an appreciable speedup over its fully
software counterpart. Unfortunately, due to project funding and timing constraints, it
was not possible to fully meet this objective. However, here we present the results we
were able to obtain which show that the ultimate objectives are completely attainable.
Recall from Chapter 5 that our decision was to only implement the least squares
fit routine in hardware, due to the inherent complexity of implementing the power
transformation. Consequently, the least squares fit was split into two components.
Bias removal and normalization comprised the first part while correlation and a maximum tree comprised the second. These designs were created separately in order to
facilitate testing, and also because they are completely modular; there would be no
difficulty in merging the components once each had been fully tested.
In Section 6.1, we present the normalization design. We consider both its timing
and algorithmic performance. We also suggest further modifications that result in
improved performance. Section 6.2 presents the correlation design. Although this
design was not fully functional, its performance and timing were estimated. Section
6.3 combines the two designs and compares the estimated timing and performance
results with the software results presented in Chapter 4. Finally, Section 6.4 indicates
how these designs could be further modified to improve the overall performance.
62
Table 6.1: Normalization Design Bit Precisions
Block
input
Chop
Accumulator
Chop
Const
Multiply
Register
Register
Subtractor
Subtractor
Square
Accumulator
Chop
Sqrt
Register
Divider
output
6.1
Input II
Input 2
<4.16>
<4.16>
<11.16>
<11.15>
<4.15>
<4.16>
<4.16>
<4.16>
<5.15>
<10.16>
<17.17>
<17.16>
<9.15>
<5.14>
<-6.14>
<4.15>
<4.16>
<9.15>
Output
<4.16>
<4.16>
<11.17>
<11.15>
<-6.14>
<4.16>
<4.15>
<4.16>
<5.16>
<5.16>
<10.16>
<17.18>
<17.17>
<9.15>
<9.15>
<6.16>
<6.16>
Bias Removal and Normalization
We begin first with the precisions chosen by the wordlength analysis tool, shown in
Table 6.1. Elements are listed as they appear in Figure 5-6, in column-major order.
There are two factors of interest in this table. First, the input precision was
arbitrarily chosen to be <4.16>. We know from our analysis of the power transformation in Section 5.1.1 that a precision of <1.16> would be more appropriate and
provide better precision on the input. Additionally, we note that the Divider output
was not manually set to <1.16>. The <6.16> output precision was a result of the
conservative precision assignment made by the wordlength analysis tool. Figure 6-1
shows the execution schedule for the design, indicating an execution time of 312 clock
cycles. Figure 6-2 shows a floorplan of PEI for this design that indicates the area of
the design. This floorplan was generated using the graphical floorplanner tool in the
Xilinx Alliance tools.
The floating point software and fixed point hardware results for the input bmp263
63
0-
-
-5
-10 -
-
- -
-15
a)
E
-20
a)
0
-30 .. . . . . . .
-35
-40~
0
50
100
150
Clock cycles
200
250
Figure 6-1: Execution Schedule for the Normalization Routine
64
300
0I"
%ZNII!
%~NIX :i
J
x
K
t.
ZI
4;I
Figure~~~~~~~~~~~ 6-:FG-cupnyfrteNomlztnRuie
65
1.5
ca
0.5-
0
0
10
20
30
40
50
60
70
80
90
Range
Figure 6-3: Input Signature to Normalization Routine
0.5
-
Software
Hardware
0.4-
0.3-
0.2as
0.1
0-
-0.1
-0.21
0
10
20
30
40
50
60
70
80
90
Range
Figure 6-4: Normalization Routine - Software and Hardware
66
0.3
--
Floating Point
Fixed Point
0.25-
0.2-
0.15-
CD
0.1
CD
0O
2
0.05 -
0-
-0.05 -
-0.1
-0.151
0
10
20
30
40
50
60
70
80
90
Range
Figure 6-5: Normalization Routine - Floating Point and Modified Fixed Point
signature shown in Figure 6-3 are shown in Figure 6-4. We can see that the normalization only performed well in hardware at output values close to zero. As the output
signature moved away from 0, the error between the software and hardware versions
increased. By examining the precisions, it was determined that the cause of this poor
performance was a result of the two Accumulators and the Square. The output precisions of these blocks were only accurate within 2--,
20, and 2', respectively. By
modifying these precisions so that precision was accurate up to the LSB of the input,
we would hope to see better results. In fact the floating point output is compared
with a software implementation using these fixed-point precision in Figure 6-5.
Here, we see excellent performance, even using fixed-precision arithmetic. We
would also expect this performance to further improve when the input precision and
divider precision are manually set to <1.16> since these precisions better fit the data.
Additionally, we expect the design cost to decrease slightly because the wordlengths
would grow more slowly given an initial major bit of 1.
67
Table 6.2: Correlation Design Bit Precisions
Input 1 Input 2 1 Output
Block
data input
<1.16>
angle input
<15.16>
Template Library <15.16>
<1.16>
Chop
<1.16>
<1.12>
Multiply
<1.12>
<1.12> <2.16>
Accumulator
<2.16>
<9.16>
Chop
<9.16>
<4.16>
Maximum
<4.16> <4.16>
<4.16>
output
<4.16>
6.2
Correlation
While project funding prevented a working correlation design from being fully tested,
we had all the information necessary to simulate execution and predict what the result
would have been. We begin as before with the precisions provided by the wordlength
analysis tool. The precisions along the 11 parallel paths and the precisions of the 10
Maximums were identical, so we present a condensed version of the precision table in
Table 6.2. Figure 6-6 shows the corresponding execution schedule, taking a total of
100 clock cycles. Figure 6-7 shows the occupancy of PE4, since most of the calculation
was there. PE2 and PE3 only contained hardware for address and template generation
and, as a result, were mostly unoccupied.
We see from Table 6.2 that once again, precision was lost in the accumulator.
Consequently, we again modified the accumulator precision to be accurate to the
LSB of its input (i.e. from <9.16> to <9.23>). Figure 6-10 shows the comparison
between software floating point and fixed point simulation when the target signature
shown in Figure 6-8 (already normalized) was correlated with the template signature
shown in Figure 6-9. Figure 6-10 shows the correlations for each of the 11 range shifts.
We can see from Figure 6-10 that with the modified precisions, the correlation
design worked well with fixed-precision arithmetic. Section 6.3 will present results for
the overall algorithm execution.
68
0
E
z
00
I
-150 L
0
10
20
30
U:
40
60
50
Clock cycles
70
80
90
Figure 6-6: Execution Schedule for the Correlation Routine
69
100
Figure 6-7: FPGA Occupancy for the Correlation Routine
70
0.3
0.3
I
I
I
I
10
20
30
40
I
I
I
I
50
60
70
80
0.25
0.2 F
0.15
(D
0.1
.E
M1
cts
0.05
0
-0.05
-0.1
-0.15
0
90
Range
Figure 6-8: Correlation Target Signature
0.3 G
0.3-
0.2 5 0.20.1 5ca
*0
0.1-
0.0 5
--
0-
-0.
)5-
-0
0
00
10
20
30
40
Range
50
60
Figure 6-9: Correlation Template Signature
71
70
80
1
1
-!
0.9
0.8
Floating Point
Fixed Point
Ix
5
-
0.7
0.6
0
-60.5
0
0
0.4
0.3-
0.2
-
0.1
-
1
2
3
4
5
6
Shift
7
8
9
10
11
Figure 6-10: Correlation by Range Shift
6.3
Final Comparison Estimates
6.3.1
Algorithm Performance
Our first concern should be the fidelity of the algorithm.
Whether the hardware
version of the algorithm executes faster than its software counterpart is irrelevant if
the hardware version performs very poorly. Thus, in Tables 6.3 and 6.4 we present the
confusion matrices obtained through software execution, but with the least squares
fit performed using the precisions we have chosen for the hardware implementation.
Table 6.3: Confusion Matrix by Vehicle Using Fixed-Point Simulation
Actual
BMP263
BMP266
BMP2c21
BTR70c21
T72132
T72812
T72s7
BMP263
46.15%
7.69%
20.51%
4.62%
2.04%
0.00%
1.57%
BMP266
15.90%
58.97%
13.33% 6.15%
1.53%
2.05%
4.19%
BMP2c21
24.10%
14.36%
50.77%
5.13%
4.08%
1.54%
2.09%
72
Predicted
BTR70c21
4.10%
4.10%
4.10%
75.38%
0.00%
1.03%
1.05%
T72132
3.08%
4.10%
4.62%
3.08%
75.00%
6.15%
8.90%
T72812
4.10%
4.62%
2.56%
2.56%
10.20%
74.87%
11.51%
T72s7
2.56%
6.15%
4.10%
3.08%
7.14%
14.36%
70.68%
Table 6.4: Confusion Matrix by Vehicle Type Using Fixed-Point Simulation
Actual
BMP2
BTR70
T72
BMP2
83.93%
15.90%
6.36%
Predicted
BTR70
T72
4.10%
11.97%
75.38% 8.72%
0.69% 92.96%
We can see that through comparison with Tables 4.2 and 4.3, that the performance degraded slightly due to the use of fixed-precision calculation. However, this
degradation did not appear to be very significant since the fixed point results were
within 2% of the floating point results.
6.3.2
Algorithm Timing
We also estimated the timing performance of the hardware version of the algorithm.
Because we know that we will suffer some performance degradation through the use
of fixed-point arithmetic, a hardware implementation is not advantageous when compared with the software version unless it runs significantly faster.
A timing test was performed to measure the time necessary to transfer all of the
input (data input as well as angle input) to the hardware board and read back the
output in the same manner that this would be done in the final working version. The
total time spent in data transfer was 6.99 seconds. We must now estimate the time
spent in calculation. Recall that the normalization and correlation designs took 312
and 100 clock cycles to execute, respectively. However, the last 90 clock cycles of
the normalization design were used for writing the output vector to a memory bank
because the design was standalone. Consequently in a combined design, the total
execution time would be 312 + 100 - 90 = 322 clock cycles to compare one target
vector to one template vector.
The data was processed in 37 pieces, each consisting of 37 vectors, repeated 77
times. This results in a total of 37 x 37 x 77 x 322 ==33,942,986 clock cycles to process
all vectors. Since the board was running at a clock speed of 2.5 MHz, we should
73
expect a total computation time of 13.58 seconds. Consequently, the total time spent
in least squares fitting is 20.57 seconds. Because we have not altered the remainder
of the algorithm, we can use the timing measurements from Table 4.4 to estimate the
total execution time for the accelerated algorithm.
We should still expect 18.33 seconds to preprocess the data. Our timing estimate
for least squares fitting is 20.57 seconds. The time for classification should remain
unchanged at .80 seconds. Consequently, the estimated total time for execution of
the accelerated SHARP algorithm is 39.70 seconds. This is clearly an improvement
over the software execution time of 258.48 seconds and indicates that even with a
suboptimal design such as the one proposed here, a significant benefit can be gained
over software execution.
6.4
Proposed Improvements
Even though we have shown that the above design provides a significant speedup for
the SHARP algorithm, there are several further modifications that can be made that
will further improve the hardware speedup of the SHARP algorithm. Here, we briefly
discuss two improvements.
The first improvement is fairly straightforward. Even though the Wildforce board
was only running at 2.5 MHz, it supports clock speeds up to 50 MHz [3]. Consequently,
if the clock speed could be increased, we would expect a directly proportional decrease
in the computation time of the least squares fit. The board was only clocked at 2.5
MHz due to a difficulty encountered in receiving board interrupts at higher clock
speeds.
Consequently, if this problem were resolved, the execution time could be
decreased by simply increasing the Wildforce clock speed.
The second improvement is to design a memory element to hold target vectors in
place. Recall that the lack of this element required the dataset to be repeated 77 times.
Consequently, we should expect a dramatic savings if we successfully implemented it.
However, in order to make full use of it, the template library address generator would
have to be expanded to produce all 77 comparison template vectors.
74
The first effect of this change would be to reduce the data transfer time. The input
dataset would only consist of 1,362 90-vectors and 1,362 aspect angles. The output
would consist of 77 correlations for each of the 1,362 input vectors. Consequently,
only one download would be necessary. We conservatively estimate the data transfer
time to be 1 second.
Additionally, for each input vector, normalization only occurs once, while correlation happens 77 times. Consequently, even if we assume a one-time cost of 90 clock
cycles to store the target vector and a cost of 50 clock cycles for address generation
(a conservative estimate), then the total number of clock cycles to fully process one
vector would be at most 312 + 90 + 77 x (50 + 100) = 11,952 clock cycles. To process all 1,362 vectors (still at 2.5 MHz) would then take approximately 6.5 seconds.
Including data transfer, we see that even our conservative estimate of 7.5 seconds is
a significant improvement over the estimate of 20.57 seconds in Section 6.3.
6.5
Summary
In this chapter, we have presented timing and performance results, both measured
and estimated, for the hardware acceleration of the SHARP algorithm. We have
demonstrated that although the hardware design we presented was not optimal, its
inefficiency was outweighed by the benefit of simply running in hardware.
Our timing estimate indicated that hardware execution would result in a speedup
on the order of 6.5x when compared with the optimized software implementation
results presented in Chapter 4, while the performance of the algorithm in hardware
remained within 2% of its software counterpart.
We also indicated areas in which the design could be further improved. By increasing the Wildforce clock speed and using a memory element to eliminate the need
to repeat the dataset 77 times, the algorithm speedup could be increased to well over
an order of magnitude, demonstrating the potential of reconfigurable hardware.
75
Chapter 7
Summary
ATR is an evolving military application. As technology advances, the performance
standards for ATR systems are set higher. Until ATR systems achieve perfect recognition, they will continue increasing in computational intensity as technology permits
in order to improve performance.
It is naturally important that the time for classification remain relatively constant as the classification performance increases. The merits of perfect recognition
are questionable if the recognition cannot be included in a real-time system. Consequently, for military applications, it is important to investigate technologies that are
increasing in computational power as well as speed.
FPGAs have a promising future in areas of intensive computing. Reconfigurable
computing has recently received a great deal of attention due to its computational
capabilities coupled with its affordability and ease of design.
Numerous defense-
related projects have begun examining the potential of FPGAs for computationally
intensive applications.
This thesis examined the potential of FPGA acceleration for a particular ATR
system, the SHARP algorithm. While the identification performance of the algorithm
was clearly not precise enough to be including in a practical system, it served as an
instructional case to demonstrate the capabilities of reconfigurable hardware for target
recognition systems.
This thesis also demonstrated the advances being made in FPGA design tools.
76
The amount of time being spent in hardware design is being drastically reduced
due to the existence of tools like those provided by the ACS program. While the
development of these tools is still far from maturation, this thesis demonstrated that
they can already be used to design and implement sophisticated systems while hiding
most of the low-level decisions from the designer. As this field matures, we can expect
even more low-level decisions to be abstracted away, while achieving more efficient
underlying designs.
By examining these technologies through the framework of a comparatively simple
ATR algorithm, this thesis created a foundation that demonstrates what the current
capabilities are and where the needs for further study lie. It is hoped that the methods
and results presented in this thesis will serve as a basis to advance future research.
77
Appendix A
Acronyms
ACS: Adaptive Computing Systems
AFRL: Air Force Research Laboratory
APC: Armored Personnel Carrier
ASIC: Application Specific Integrated Circuit
ATR: Automatic Target Recognition
CLB: Configurable Logic Block
DFT: Discrete Fourier Transform
EDIF: Electronic Design Interchange Format
FPGA: Field-Programmable Gate Array
HRR: High Range Resolution
IFFT: Inverse Fast Fourier Transform
LSB: Least Significant Bit
LUT: Lookup Table
MCMC: Markov Chain Monte Carlo
MSTAR: Moving and Stationary Target Acquisition and Recognition
MTI: Moving Target Indication
RMS: Root Mean Square
SAR: Synthetic Aperture Radar
SDF: Synchronous Dataflow
SHARP: System-Oriented HRR Automatic Recognition Program
78
VHDL: VHSIC Hardware Description Language
VHSIC: Very High Speed Integrated Circuit
WPAFB: Wright-Patterson Air Force Base
79
Bibliography
[1] Air Force Research Laboratory, An Air-to-Ground Classification Analysis. AFRL
Internal Memo, 1994.
[2] Annapolis Micro Systems,
Inc.,
Annapolis Micro Systems Home Page.
http://www.annapmicro.com, 2001.
[3] Annapolis Micro Systems, Inc, Wildforce Reference Manual, 1999.
[4] BAE SYSYEMS, Algorithm Analysis and Mapping Environment for Adaptive
Computing Systems. http://www.sanders.com/adv-tech/aam.htm, 2000.
[5] B. Bhanu, G. Jones III, Object Recognition Results Using MSTAR Synthetic
Aperture Radar Data. IEEE Workshop on Computer Vision Beyond the Visible
Spectrum: Methods and Applications, pages 55-62, 2000.
[6] Y. Cho, Optimized Automatic Target Recognition Algorithm on Scalable
Myrinet/Field Programmable Array Nodes. IEEE Asilomar Conference on Signals, Systems, and Computers, 2000.
[7] P. Djuric, Bayesian Methods for Signal Processing. IEEE Signal ProcessingMagazine, pages 26-28, September 1998.
[8] R. Duda, P. Hart, Pattern Classification and Scene Analysis. John Wiley & Sons,
Inc. 1973.
[9] P. Fiore, P. Topiwala, Bit-ordered Tree Classifiers for SAR Target Classification.
IEEE Asilomar Conference on Signals, Systems, and Computers, 1997.
80
[10] P. Fiore, A Custom Computing Framework for Orientation and Photogrammetry.
Ph.D. Thesis, Massachusetts Institute of Technology, June 2000.
[11] P. Fiore, C. Myers, J. Smith, E. Pauer, Rapid Implementation of Mathematical
and DSP Algorithms in Configurable Computing Devices. SPIE International
Symposium on Voice, Video, and Data Communications, November 1998.
[12] P. Fiore, Suitability of AFRL Forced Decision Algorithm for ACS Demo. Sanders
Internal Memo, August 1999.
[13] P. Fiore, Wordlength Optimization via the MCMC for Custom DSP Computing.
Submitted to IEEE Transactions on Signal Processing,November 2000.
[14] D. Gross et al. High Range Resolution Ground Moving Target ATR Using Advanced Space-Based SAR/MTI Concepts. AIAA Space Technology Conference &
Exposition, September 1999.
[15] R. Hummel, Moving and Stationary Target Acquisition and Recognition
2001.
(MSTAR). http://www.darpa.mil/spa/Programs/mstar.htm,
[16] Khoral Research, Inc. Khoral Home Page. http://www.khoral.com, 2001.
[17] D. Kottke, P. Fiore, Systolic Array for Acceleration of Template Based ATR,
IEEE International Conference on Image Processing, 1997.
[18] T. Lamont-Smith, Translation to the Normal Distribution for Radar Clutter.
IEEE Proceedings - Radar, Sonar, and Navigation, Vol. 147, No. 1, pages 17-22,
February 2000.
[19] B. Levine et al. Mapping of an Automated Target Recognition Application from
a Graphical Software Environment to FPGA-based Reconfigurable Hardware.
IEEE Symposium on Field-ProgrammableCustom Computing Machines, pages
292-293, 1999.
81
[20] T. Marzetta, E Martinsen, C. Plum, Fast Pulse Doppler Radar Processing Accounting for Range Bin Migration. IEEE National Radar Conference, pages 264268, 1993.
[211 The
Mathworks,
MATLAB
Introduction
Home
Page.
http://www.mathworks.com/products/matlab/ 2001.
[22] R. Mitchell, R. DeWall, Overview of High Range Resolution Radar Target Identification. SPIE Automatic Target Recognition Conference, pages 35-47, October
1994.
[23] A. Oppenheim, R. Schafer, Digital Signal Processing. Prentice-Hall, 1975.
[24] E. Pauer, P. Fiore, J. Smith, C. Myers, Algorithm Analysis and Mapping Environment for Adaptive Computing Systems. A CM InternationalSymposium on
Field-ProgrammableGate Arrays, 1999.
[25] E. Pauer, P. Fiore, J. Smith, Algorithm Analysis and Mapping Environment
for Adaptive Computing Systems: Further Results. IEEE Symposium on FieldProgrammable Custom Computing Machines, pages 264-265, April 1999.
[26] J. Ratches, C. Walters, R. Buser, B. Guenther, Aided and Automatic Target
Recognition Based Upon Sensory Inputs from Images Forming Systems. IEEE
Transactionson PatternAnalysis and Machine Intelligence, Vol. 19, No. 9, pages
1004-1019, September 1997.
[27] G. Strang, Introduction to Linear Algebra, Wellesley-Cambridge Press, 1998.
[28] Synplicity, Inc., Synplicity Home Page, http://www.synplicity.com, 2001.
[29] University
of
California
at
Berkeley,
The
Ptolemy
Project.
http://ptolemy.eecs.berkeley.edu, 2001.
[30] J. Villasenor et al. Configurable Computing Solutions for Automatic Target
Recognition. IEEE Symposium on Field-ProgrammableCustom Computing Machines, pages 70-79, 1996.
82
[31] A. Wilkinson, R. Lord, M. Inggs, Stepped-Frequency Processing by Reconstruction of Target Reflectivity Spectrum. IEEE Southern African Symposium on
Communications and Signal Processing, September 1998.
[32] R. Williams et al. Automatic Target Recognition of Time Critical Moving Targets
Using 1D High Range Resolution (HRR) Radar. IEEE Radar Conference, pages
54-59, 1999.
[33] A. Willsky, G. Wornell, J. Shapiro, Stochastic Processes, Detection, and Estimation, 6.432 Course Notes, 1999.
[34] Xilinx Corporation, Xilinx Home Page http://www.xzilinx.com, 2000.
83
Download