Uploaded by qingboxu98

A Highly Efficient Heterogeneous Processor for

advertisement
sensors
Article
A Highly Efficient Heterogeneous Processor for
SAR Imaging
Shiyu Wang * , Shengbing Zhang *, Xiaoping Huang, Jianfeng An and Libo Chang
School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China
* Correspondence: onion0709@mail.nwpu.edu.cn (S.W.); zhangsb@nwpu.edu.cn (S.Z.);
Tel.: +86-1331-927-0830 (S.W.)
Received: 4 June 2019; Accepted: 1 August 2019; Published: 3 August 2019
Abstract: The expansion and improvement of synthetic aperture radar (SAR) technology have greatly
enhanced its practicality. SAR imaging requires real-time processing with limited power consumption
for large input images. Designing a specific heterogeneous array processor is an effective approach to
meet the power consumption constraints and real-time processing requirements of an application
system. In this paper, taking a commonly used algorithm for SAR imaging—the chirp scaling
algorithm (CSA)—as an example, the characteristics of each calculation stage in the SAR imaging
process is analyzed, and the data flow model of SAR imaging is extracted. A heterogeneous array
architecture for SAR imaging that effectively supports Fast Fourier Transformation/Inverse Fast Fourier
Transform (FFT/IFFT) and phase compensation operations is proposed. First, a heterogeneous array
architecture consisting of fixed-point PE units and floating-point FPE units, which are respectively
proposed for the FFT/IFFT and phase compensation operations, increasing energy efficiency by
50% compared with the architecture using floating-point units. Second, data cross-placement and
simultaneous access strategies are proposed to support the intra-block parallel processing of SAR
block imaging, achieving up to 115.2 GOPS throughput. Third, a resource management strategy for
heterogeneous computing arrays is designed, which supports the pipeline processing of FFT/IFFT
and phase compensation operation, improving PE utilization by a factor of 1.82 and increasing energy
efficiency by a factor of 1.5. Implemented in 65-nm technology, the experimental results show that
the processor can achieve energy efficiency of up to 254 GOPS/W. The imaging fidelity and accuracy
of the proposed processor were verified by evaluating the image quality of the actual scene.
Keywords: heterogeneous array; SAR imaging; data cross-placement; computing resource management
1. Introduction
Aerospace synthetic aperture radar (SAR) can be all-time and all-weather to obtain high-precision
microwave images and other value-added products over large areas, and it has an extensive range
of applications in remote sensing, environmental monitoring, geographical mapping, war zone
surveillance, precision guidance, and reconnaissance [1–4].
Extensions and modifications of the SAR technology have significantly increased its practicality
and applications. The demand for high-resolution and wide-swath (HRWS) SAR imaging is growing,
especially in the areas of ocean observation, geological survey, and environmental protection. In 1978,
the United States launched the first spaceborne SAR named Seasat-1. It is a satellite specifically
designed for telemetry of the Earth’s oceans, and is aimed at realizing the possibility of global satellite
monitoring of the oceans and determining the system requirements for marine remote sensing satellites.
RADARSAT-1 was successfully launched in Canada in 1995 [5]. It not only provided Canada with a
large amount of all-weather and all-time SAR data, but also provided useful information for commercial
and scientific users in disaster management, agriculture, mapping, hydrology, forestry, oceanography,
Sensors 2019, 19, 3409; doi:10.3390/s19153409
www.mdpi.com/journal/sensors
Sensors 2019, 19, 3409
2 of 20
ice research, and coastal monitoring. In January 2006, Japan launched the Advanced Land Observing
Satellite (ALOS) [6]. The Phased Array type L-band Synthetic Aperture Radar (PALSAR) that it carried
was an L-band SAR sensor that is not affected by atmospheric conditions, cloud cover, and other
related conditions, so it can be used for ground observations around the clock. In June 2007, the Terra
SAR-X was launched by the German National Space Center. Its X-band SAR radar reliably provided
high-resolution weather conditions and wide-area radar images with superior geometric accuracy over
any other spaceborne SAR sensor [7]. Moreover, for both civil and military applications, it is desired to
monitor moving targets, including ground moving target indication/ground moving target imaging
(GMTI/GMTIm) [8,9].
For SAR processing systems, SAR imaging time accounts for most of the processing time, and
directly brings a significant impact on system throughput and rapid response capability. The imaging
delay of SAR will seriously affect the subsequent image processing, such as content analysis, risk
diagnosis, and feature extraction. SAR imaging efficiency plays a very important role in the SAR system
platform, and it can directly affect the throughput and rapid response capability of the entire platform.
Spaceborne and airborne real-time SAR imaging is the most direct and effective real-time imaging
implementation approach, which can quickly provide SAR image data for SAR applications while
significantly reducing the communication burden of air-to-ground data links [10]. At the same time,
the working environment of spaceborne and airborne imaging systems is harsh, and the power
consumption of the processor is also severely limited. Therefore, real-time and low power consumption
are two essential items that must be met by spaceborne/airborne SAR imaging processors.
Since SAR imaging requires a large amount of two-dimensional parallel computing, it is difficult
for a single multi-core central processing unit (CPU) to meet its real-time requirements. The SAR
imaging scheme with multiple CPU nodes has high power consumption and low processing efficiency,
and cannot be applied in spaceborne/airborne SAR processing. Generally, heterogeneous schemes
such as CPU + GPU, CPU + DSP (s), and CPU + FPGA (one or more) can meet the performance
requirements of real-time processing, but their power consumption is above 10 W, or even more than 100
W. A dedicated chip that fully implements the imaging algorithm can achieve better results in real-time,
and low power consumption and is suitable for applications with strict power constraints, but the
scheme hardens the algorithm, resulting in poor flexibility. ASIP (Application Specific Instruction Set
Processor) is a dedicated processor solution between a general-purpose processor and an application
specific integrated circuit (ASIC). This processor combines the flexibility of a general-purpose processor
and the efficiency of an ASIC. In order to achieve a good trade-off between flexibility and processing
efficiency, the development of a dedicated processor that is capable of fully implementing the SAR
imaging process is an effective solution to meet its power consumption and real-time requirements for
spaceborne/airborne SAR processing.
The chirp scaling algorithm (CSA) is one of the most commonly used algorithms for SAR
imaging [11]. Its calculations mainly include Fast Fourier Transformation/Inverse Fast Fourier
Transform (FFT/IFFT), phase multiplication, interpolation, etc., especially FFT/IFFT operations account
for the highest proportion. The accuracy requirements and computing flow of these operations are
different. Therefore, how to design an array structure and storage structure suitable for such processing
is a key issue to be solved.
With the progress of integrated circuit (IC) technology, more processing units and memory blocks
can be integrated on a single chip. Based on the abundant computational and memory resources on the
chip, to make full use of bandwidth resources, this paper proposes a heterogeneous array structure that
efficiently supports CSA imaging processing by combining block parallelism and pipeline processing
while buffering the intermediate results on-chip.
It can support the parallel and pipeline processing and increases the maximum utilization of
computing units. Moreover, we have designed an on-chip multi-level data buffer structure matching
the heterogeneous array structure to ensure data supply for pipeline processing. This solution can
reduce the complexity of the system while improving real-time performance.
Sensors 2019, 19, 3409
3 of 20
The paper is organized as follows. Section 2 outlines related work and background. Section 3
analyzes the characteristics of CSA and proposes the design of the processor. Section 4 presents the
heterogeneous architecture implementations. We present the evaluation of experimental results in
Section 5, and the conclusions in Section 6.
2. Related Work
Digital signal processors (DSPs), CPUs, and graphics processing units (GPUs) have respective
advantages in real-time SAR processing. As the system adopts CPU, it has good flexibility and
portability [12]. However, their power efficiency for computing is quite low, which is a bottleneck in
real-time SAR applications. Due to GPU’s powerful parallel computation capability and programmability,
the new method makes full use of GPU’s powerful computation ability, which effectively improves the
real-time quality of SAR scene generation [13–16]. At present, the GPU + CPU method can effectively
combine the advantages of the two processors to improve imaging efficiency [17,18]. However, the
average power consumption which is up to 150 W, limits the application of GPU in micro air vehicles.
Nowadays, high capability DSPs easily realize many complex theories and algorithms on hardware,
and promote the development of SAR technology [19–21]. In 2003, Hanover University implemented
a SAR real-time processing system using a multi-DSP architecture. This system uses highly parallel
digital signal processor technology (HiPAR-DSP) for SAR signal processing [22]. The Indian Space
Research Organization (IRSO) developed the SAR Specialized Processor (NRTP) based on Analog
Devices’ DSP multiprocessor, which approximates the real-time imaging of SAR [23]. However, for
some applications with strictly constrained power, DSP has lower energy efficiency, resulting in lower
imaging efficiency.
The rapid development of field-programmable gate array (FPGA) has been one of the most
important technologies of realizing digital signal processing. With its rich on-chip memory and
computational resources, FPGA can be configured as a SAR imaging platform to meet the high
throughput rate SAR signal processing requirements [24–26]. An FPGA based on fault-tolerant
architecture (Xilinx Virtex-II Pro) is applied to SAR processing systems [27,28]. In 2006, the University
of Florida developed a high-performance heterogeneous spatial computing framework based on
hardware/software interfaces. In this architecture, the CPU is responsible for scheduling and task
management, and the FPGA acts as a coprocessor for computational acceleration [29]. With the rapid
development of storage capacity and computing power of commercial FPGAs, SAR real-time imaging
systems can all be built by FPGA (Xilinx Virtex-6) [30]. However, for highly complex algorithms, the
development cycle of FPGA is relatively long.
For the real-time requirements and physical implementation limitations of SAR imaging, ASIC
implementation is generally employed [31,32]. The Massachusetts Institute of Technology (MIT)
Lincoln Laboratory uses bit-level systolic-array technology to design a SAR signal processor with high
throughput and low power consumption [33]. The jet propulsion laboratory has also developed an
airborne SAR processing system using a VLSI+SOC (very large scale integration+system on chip)
hardware solution [10]. The processor’s low power consumption and small size make it suitable for
small SAR imaging systems.
In general, the DSP solution is used to implement SAR imaging through software programming.
Since the DSP is designed for general purposes, this implementation has high flexibility and a short
design cycle. It is more suitable for real-time SAR imaging than a CPU, but for low power applications,
it is still not the most suitable choice. The ASIC solution for SAR imaging has the optimal power and
performance for a single computational process. However, SAR imaging is a combination of multiple
calculations on one device, which causes the design cost and power consumption of SAR imaging to
soar, the design cycle to become longer, and poor flexibility. ASIP makes a good trade-off between the
high flexibility of a general purpose processor and the high processing efficiency of an ASIC, and can
be tailored and optimized for a certain type of algorithm or domain application to meet constraints
such as performance, area, and power consumption. Moreover, it can effectively reduce design cycles
Sensors 2019, 19, x FOR PEER REVIEW
Sensors 2019, 19, 3409
4 of 20
4 of 20
constraints such as performance, area, and power consumption. Moreover, it can effectively reduce
design
and Thus,
the design
Thus, of
many
of important
ASIP make
it a very important
and
the cycles
design risk.
many risk.
advantages
ASIPadvantages
make it a very
implementation
method
implementation
method
in the field of signal processing.
in
the field of signal
processing.
Making trade-offs
trade-offs between
between speed,
speed, cost,
cost, power
power consumption,
consumption, and
ASIP design
design
Making
and flexibility,
flexibility, ASIP
methodology in
real-time
signal
processing
system
can can
not only
satisfy
the realmethodology
in the
thedesign
designofofSAR
SAR
real-time
signal
processing
system
not only
satisfy
the
time and and
performance
requirements
of aerospace
systems,
butbut
also
the
real-time
performance
requirements
of aerospace
systems,
alsoshorten
shortenthe
thelead
leadtime
time of
of the
processors.
ASIP,
when
designed
with
a
specific
architecture
with
higher
parallelism
and
higher
processors. ASIP, when designed with a specific architecture with higher parallelism and higher
complexity,also
alsohas
hasgood
goodscalability.
scalability.
Therefore,
have
designed
a dedicated
processor
thatfully
can
complexity,
Therefore,
wewe
have
designed
a dedicated
processor
that can
fully
implement
the
SAR
imaging
process
to
meet
the
power
consumption
and
real-time
implement the SAR imaging process to meet the power consumption and real-time requirements of
requirements
the application environment.
the
applicationofenvironment.
3. Processor Architecture Design
the most
most commonly
commonly used
used algorithms
algorithms for SAR
SAR imaging
imaging [11].
[11]. Compared with
The CSA is one of the
simple operation
operation process, low computational
computational
other algorithms, the CSA has the advantages of a simple
efficiency. On the other hand, the CSA improves the fidelity of the
complexity, and high imaging efficiency.
image, especially the preservation of the phase information. Moreover,
Moreover, the CSA can adapt to different
different
scanningmodes,
modes,forfor
example,
spotlight,
strip-map,
sliding
spotlight,
Tops,
and
radar scanning
example,
spotlight,
strip-map,
scanscan
SAR,SAR,
sliding
spotlight,
Tops, and
Mosaic
Mosaic
modes
[34,35].
modes [34,35].
3.1.
CSA Flow
Flow Analysis
Analysis
3.1. CSA
The
The CSA
CSA can
can be
be divided
into three
The imaging
imaging principle
principle of
of the
the CSA
CSA is
is shown
shown in
in Figure
Figure 1.
1. The
divided into
three
modules
accordingto
tofunctions,
functions,orordivided
divided
into
seven
steps
according
to operation
the operation
sequence.
modules according
into
seven
steps
according
to the
sequence.
The
The
algorithm
is executed
step
byand
step,
andalgorithm
in the algorithm
we the
perform
the alternating
algorithm
is executed
step by
step,
in the
process, process,
we perform
alternating
operation
operation
of and
FFT/IFFT
phase compensation.
perform
a SARfour
imaging,
four
Fourier and
transform
of FFT/IFFT
phaseand
compensation.
To performTo
a SAR
imaging,
Fourier
transform
threeand
three-phase
multiplication
are needed.
phase
multiplication
are needed.
SAR Echo Data
Step 1
Azimuth FFT
Step 2
Step 3
H1 CS Factor
Range FFT
Module 1
2-D Frequency Domain
H2 Range Compensation Factor
Step 4
Step 5
Range-Doppler Domain
Range IFFT
Module 2
Range-Doppler Domain
H3 Azimuth Compensation Factor
Step 6
Step 7
Azimuth IFFT
Module 3
SAR Image
Figure 1. Chirp scaling algorithm (CSA) flow chart. SAR:
SAR: synthetic
synthetic aperture radar.
The
into
2Q2𝑄
loglog
multiplications
and and
3Q log
The Q-point
Q-point FFT/IFFT
FFT/IFFTcan
canbebedecomposed
decomposed
into
𝑄 real
multiplications
3𝑄2 Q
logreal
𝑄
2 Q real
additions
[36].[36].
TableTable
1 lists
the computation
quantity
of the
operation.
real additions
1 lists
the computation
quantity
ofseven-step
the seven-step
operation.
From Table 1, we can see that the proportion of FFT(IFFT) in all operations is:
W =
(2 + 2 + 3 + 3)NM log2 M + (2 + 2 + 3 + 3)NM log2 N
10NM log2 M + 10NM log2 N + 18NM
(1)
Sensors 2019, 19, x FOR PEER REVIEW
Sensors 2019, 19, 3409
5 of 20
5 of 20
Table 1. Computational statistics of CSA.
For different imaging matrix sizes, the proportion W of FFT(IFFT) is slightly different, as shown in
Calculation
Step
Real-Multi
Real-Add
Total
TableContent
2. It can be shown from Table 2 that the W values are basically above 90% and can reach up to
3𝑁𝑀 log 𝑀
5𝑁𝑀 log 𝑀
Azimuth-FFT 1
1
2𝑁𝑀 log 𝑀 2
95%
as the matrix size
becomes larger.
Therefore, accelerating
the FFT/IFFT operation6𝑁𝑀
will inevitably
4𝑁𝑀
2𝑁𝑀
CS Factor-Multi 3
2
2𝑁𝑀 log 𝑁 the imaging efficiency.
3𝑁𝑀 log 𝑁
5𝑁𝑀 log 𝑁
Range-FFT
reduce
the imaging3 time and optimize
RC Factor 4-Multi
Range-IFFT 5
AC Factor 6-Multi
Azimuth-IFFT
4
5
6
7
Total
-
2𝑁𝑀
4NM
3𝑁𝑀 log 𝑁
2𝑁𝑀 log 𝑁
Table
4𝑁𝑀 1. Computational statistics
2𝑁𝑀of CSA.
3𝑁𝑀 log 𝑀
2𝑁𝑀 log 𝑀
Steplog 𝑀 + 4NMlog
Real-Multi
4𝑁𝑀
𝑁 +
6𝑁𝑀 log 𝑀 + Real-Add
6𝑁𝑀 log 𝑁 +
12NM
6NM
2
1
3NM log2 M
2NM log2 M
2
Calculation Content
6𝑁𝑀
5𝑁𝑀 log 𝑁
6𝑁𝑀
5𝑁𝑀 log 𝑀
10𝑁𝑀 log 𝑀 +Total
10𝑁𝑀 log 𝑁 +
18𝑁𝑀
5NM log2 M
Azimuth-FFT 1
1 FFT: Fast Fourier Transformation; M: Azimuth direction sample numbers; N: Range direction sample
2
4NM
2NM
6NM
CS Factor-Multi 3
3 CS Factor: Chirp Scaling Factor; 4 RC Factor: Range Compensation Factor; 5.IFFT: Inverse
numbers;
Range-FFT
3
2NM log2 N
3NM log2 N
5NM log2 N
4 -Multi
Fast
Fourier
Transform. 46 AC Factor: Azimuth
4NM Compensation Factor.
2NM
6NM
RC
Factor
5
2NM log2 N
3NM log2 N
5NM log2 N
Range-IFFT 5
6 -Multi
6 that the proportion
4NM
6NM
AC
Factor
From
Table
1, we can see
of FFT(IFFT)2NM
in all operations is:
Azimuth-IFFT
7
2NM log2 M
3NM log2 M
5NM log2 M
(2 + 2 + 3 + 3)𝑁𝑀 log 𝑀 + (2 + 2 + 3 + 3)𝑁𝑀 log 𝑁
4NM log M +
6NM log M +
10NM log2 M + (1)
10𝑁𝑀 log 2 𝑀 + 10𝑁𝑀
log 𝑁 + 218𝑁𝑀
4NMlog2 N + 12NM
6NM log2 N + 6NM
10NM log2 N + 18NM
1 FFT:
3
For
different
imaging
matrix2 sizes,
the proportion
W of
FFT(IFFT)
is slightly
different,
as shown
Fast Fourier
Transformation;
M: Azimuth
direction sample
numbers;
N: Range
direction sample
numbers;
4 RC Factor: Range Compensation Factor; 5 IFFT: Inverse Fast Fourier Transform. 6
CS
Factor:
Chirp
Scaling
Factor;
in Table 2. It can be shown from Table 2 that the W values are basically above 90% and can reach up
AC Factor: Azimuth Compensation Factor.
to 95% as the matrix size becomes larger. Therefore, accelerating the FFT/IFFT operation will
inevitably reduce the imaging time
and2. optimize
the imaging
efficiency.
Table
Computational
load statistics.
Total
π‘Š =
256 × Table
256 2.1024
× 1024
4096
4096
16,384 × 16,384
Computational
load×statistics.
Image Size
FFT Computational
Image Size
107
Load
FFT
Computational Load
PhaseCompensation
Compensation
Phase
Computational
106Load
Computational
Load
W-Value
W-Value
89.8%
2562.1
× 256
4096
16,384
9 × 4096 7.6
× 108 1024 × 1024
4.1 × 10
× 10×1016,384
2.1 × 10
4.1 × 10
7.6 × 10
10
4.810
×910
10 × 107 1.8 × 103 × 1083 × 10
1.8
4.8 ×
89.8%
91.7%
93%
94%
91.7%
93%
94%
65,536 × 65,536
65,536
65,536
12
1.2 ×× 10
1.2 × 10
1.2 ×
× 10
1.2
1010
94.7%
94.7%
Space-line (Parallel)
3.2. Computation Flow Strategy
3.2. Computation Flow Strategy
In the imaging process, we take the block imaging method and perform parallel processing
In the
imaging
process,
we take
the block
imaging and
method
perform
parallel
between
blocks.
In the
algorithm
process,
four FFT/IFFT
threeand
phase
operations
areprocessing
pipelined
between blocks.
the algorithm
four FFT/IFFT
and(multi-azimuth)
three phase operations
areand
pipelined
according
to theInalgorithm
flow,process,
while each
multi-range
FFT/IFFT
phase
according to
the
flow, while individually.
each multi-range
(multi-azimuth)
FFT/IFFT
and phase
operation
operation
can
bealgorithm
parallel processing
To organize
the pipeline
processing
of two
types
can
be
parallel
processing
individually.
To
organize
the
pipeline
processing
of
two
types
of
operations
of operations in SAR imaging, we designed a calculation process based on space–time flow (ST-Flow),
in SAR
imaging,
we2.designed
calculation
process FFT/IFFT
based on space–time
flow (ST-Flow),
asand
shown
in
as
shown
in Figure
At a time,a in
space, multi-line
can be performed
in parallel,
phase
Figure
2.
At
a
time,
in
space,
multi-line
FFT/IFFT
can
be
performed
in
parallel,
and
phase
compensation
compensation operation can be calculated simultaneously at multiple points, so no calculation unit
operation
calculated
multiple
points,
so no calculation
unitcalculation
is idle. Onunit
the
is
idle. Oncan
thebe
timeline,
datasimultaneously
is continuouslyatfed
into the
processing
unit, and the
timeline,
fed intofor
thedata.
processing
unit,ST-Flow,
and the calculation
unitcan
does
have
does
not data
haveisa continuously
stall due to waiting
With this
SAR imaging
be not
done
in a
stall due to waiting
continuous
process.for data. With this ST-Flow, SAR imaging can be done in a continuous process.
Time-line (Pipeline)
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
PM:Phase Multiplication
SAR imaging
imaging flow.
flow.
Figure 2. SAR
3.3. Heterogeneous Arrays
FFT/IFFT
Sensors 2019, 19, x FOR PEER REVIEW
6 of 20
CSA includes scalar operation for phase multiplication and vector operation FFT (IFFT). As
Table 2 shown, FFT/IFFT operations account for up to 95% of SAR imaging, so accelerating the
Sensors
2019,operation
19, 3409
6 of 20
FFT/IFFT
efficiently is the most important approach for imaging processors.
The fixed-point FFT/IFFT operation with lower accuracy has a small loss of imaging accuracy,
and
can significantly
improve the processing throughput. In [37], the quantization error power of the
3.3. Heterogeneous
Arrays
fixed-point processing CSA was evaluated in detail. The analysis results showed that as the word
CSA
includes
scalar
phase multiplication
and
vector operation
(IFFT). Asand
Table
length
increases
from
12 operation
to 16, the for
quantization
error power
remains
essentiallyFFT
unchanged,
the2
shown,
FFT/IFFT
operations
account
for
up
to
95%
of
SAR
imaging,
so
accelerating
the
FFT/IFFT
imaging quality with a 15 or 16-bit word length is very close to that of a single precision floatingoperation
efficiently
the most
important
approach
for imaging
processors.
point.
Therefore,
we is
design
PE arrays
to support
12-bit,
14-bit, and
16-bit fixed-point FFT/IFFT. For
The
fixed-point
FFT/IFFT
operation
with
lower
accuracy
has
a
small
loss
imaging accuracy,
applications with lower accuracy requirements, low-bit width operation
can
be of
selected.
and can
significantly
improve
the
processing
throughput.
In
[37],
the
quantization
error
of
However, the phase compensation operation requires high precision and must
use power
floatingthe fixed-point
was
detail. The
analysis
results
showed that as
the
point
arithmeticprocessing
operations.CSA
Based
onevaluated
the earlierin
description
and
discussion,
a heterogeneous
array
word
length
increases
from
12
to
16,
the
quantization
error
power
remains
essentially
unchanged,
is designed, which includes two types of computing units named PE and FPE. PE is used for FFT/IFFT
and the imaging
a 15 or
16-bit word operation.
length is very close to that of a single precision
operation
and FPEquality
is usedwith
for phase
compensation
floating-point.
design
PE arraysagainst
to support
12-bit,
14-bit, and 16-bit
fixed-point FFT/IFFT.
Since the Therefore,
operation we
ratio
of FFT/IFFT
phase
compensation
is approximately
9:1, the
For
applications
with
lower
accuracy
requirements,
low-bit
width
operation
can
be
selected.
configuration of PE and FPE should also follow this proportional relationship. For smaller matrix
phase
compensation
operation
requires
high
must usearray,
floating-point
sizes,However,
the ratio isthe
near
90%;
to meet the different
matrix
sizes,
we precision
design theand
processing
in which
arithmetic
operations.
Based
on
the
earlier
description
and
discussion,
a
heterogeneous
array is
the ratio of PE and FPE is 8:1, as shown in Figure 3.
designed,
which
includes
two
types
of
computing
units
named
PE
and
FPE.
PE
is
used
for
FFT/IFFT
In CSA flow, each range/azimuth FFT/IFFT operation is relatively independent, and there is no
operation
and FPE is
used forrange/azimuth,
phase compensation
operation.
data
dependency
between
so each
range/azimuth FFT/IFFT operation can be
Since
the
operation
ratio
of
FFT/IFFT
against
phase
compensation
is approximately
9:1, the
performed in parallel. Moreover, in the FFT/IFFT operation,
each butterfly
operation is relatively
configuration
of
PE
and
FPE
should
also
follow
this
proportional
relationship.
For
smaller
matrix
sizes,
independent, and multiple butterfly operations can be performed in parallel. The phase
the ratio is nearprocess
90%; to meet
the different
matrix sizes,
we design
processing
in which
the
compensation
performs
independent
operations
at athe
single
point array,
so that
multiple
ratio
of
PE
and
FPE
is
8:1,
as
shown
in
Figure
3.
independent operations can be performed in parallel.
···
FPE
···
···
PE
PE
···
···
···
8 lines
PE
FFT/IFFT
PE
FPE
Phase
compensation
Figure 3.
3. PE
PEarray
arraystructure
structurediagram.
diagram.
PE:
computing
unit
for FFT/IFFT
operation
in a heterogenous
PE:
computing
unit
for FFT/IFFT
operation
in a heterogenous
array.
array.
In CSA flow, each range/azimuth FFT/IFFT operation is relatively independent, and there is no
data dependency
range/azimuth,
sophase
each range/azimuth
operation
canare
be performed
In CSA flow,between
four FFT/IFFT
and three
operations areFFT/IFFT
data dependent;
they
processed
parallel.
Moreover,
in the
FFT/IFFT
butterfly
operation
relatively
independent,
in the
pipeline.
As shown
in Figure
4, tooperation,
establish aeach
pipeline
between
the FFTisand
the phase
operation,
andparallel
multiple
butterfly
operations
can be performed
the
FFT/IFFT
differ
by 1/8 computation
cycles.in parallel. The phase compensation process
performs independent operations at a single point so that multiple independent operations can be
performed in parallel.
In CSA flow, four FFT/IFFT and three phase operations are data dependent; they are processed in
the pipeline. As shown in Figure 4, to establish a pipeline between the FFT and the phase operation,
the parallel FFT/IFFT differ by 1/8 computation cycles.
Sensors 2019, 19, 3409
Sensors 2019, 19, x FOR PEER REVIEW
Sensors 2019, 19, x FOR PEER REVIEW
7 of 20
7 of 20
7 of 20
Space-line
Space-line(Parallel)
(Parallel)
Time-line (Pipeline)
Time-line (Pipeline)
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
FFT/IFFT
FFT/IFFT
...
PM:Phase Multiplication PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM
PM:Phase Multiplication PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM
Figure 4. Pipeline between FFT/IFFT and phase operation.
Figure 4. Pipeline between FFT/IFFT and phase operation.
3.4. Data Placement and Simultaneous Access
3.4. Data Placement and Simultaneous Access
process, the data
data transfer
transfer has
has aa bit-reverse
bit-reverse address
address sequence.
sequence. To support this
In the FFT/IFFT process,
In the FFT/IFFT process, the data transfer has a bit-reverse address sequence. To support this
data access pattern, we use a multi-bank
multi-bank distributed
distributed data
data placement
placement strategy,
strategy, as
as shown
shown in
in Table
Table 3.
3.
data access pattern, we use a multi-bank distributed data placement strategy, as shown in Table 3.
According
to
the
calculation
requirements,
one
row
of
PE
parallel
performs
16
butterfly
operations,
According to the calculation requirements, one row of PE parallel performs 16 butterfly operations, and
According to the calculation requirements, one row of PE parallel performs 16 butterfly operations,
and needs
to provide
32atdata
at thetime.
sameTherefore,
time. Therefore,
dataisaccess
is performed
inAs
parallel.
needs
to provide
32 data
the same
data access
performed
in parallel.
shownAs
in
and needs to provide 32 data at the same time. Therefore, data access is performed in parallel. As
shown 5,
in32
Figure
5, 32
data are simultaneously
accessed
Bank
in the In
first
In
Figure
data are
simultaneously
accessed from
Bank from
0 andBank
Bank01and
in the
first1 cycle.
thecycle.
second
shown in Figure 5, 32 data are simultaneously accessed from Bank 0 and Bank 1 in the first cycle. In
the second
cycle,
data
are read simultaneously
Bank 3.
2 and
3. Bank
and in
the
cycle,
data are
read
simultaneously
from Bank 2from
and Bank
BankBank
selection
andselection
the address
a
the second cycle, data are read simultaneously from Bank 2 and Bank 3. Bank selection and the
address
a bank are
generated
follow
each
step in processing
the FFT/IFFT
processing
each
bank
arein
generated
to follow
eachtostep
in the
FFT/IFFT
flow.
Althoughflow.
eachAlthough
PE performs
a
address in a bank are generated to follow each step in the FFT/IFFT processing flow. Although each
PE
performs
a
different
FFT/IFFT
operation,
they
use
similar
data
placement
and
access
strategies.
different FFT/IFFT operation, they use similar data placement and access strategies.
PE performs a different FFT/IFFT operation, they use similar data placement and access strategies.
Table 3.
3. 4096 points input data storage
storage in
in four
four banks.
banks.
Table
Table 3. 4096 points input data storage in four banks.
Bank_NO.
Bank_NO.
Space-line
Space-line(Parallel)
(Parallel)
Bank_NO.
Bank_0
Bank_0
Bank_0
Bank_1
Bank_1
Bank_1
Bank_2
Bank_2
Bank_2
Bank_3
Bank_3
Bank_3
Input
Data
Storage
Bank
Input
Data
StorageStatus
Status in
in Bank
Input Data Storage Status in Bank
0–15
64–79
0–15
64–79
0–15
64–79
-- 16–31
80–95
-- 16–31
80–95
16–31
80–95
32–47
96–111
32–47
96–111
-- 32–47
96–111
-48–63
112–127
4080–4095
48–63 112–127
- 4080–4095
48–63 112–127 - 4080–4095
16 Operands
16 Operands
(PE0_0~PE0_15)=>
(PE0_0~PE0_15)=>
Bank_0
Bank_0
Bank_1
Bank_1
Bank_2
Bank_2
Bank_3
Bank_3
Data(0)
Data(0)
Data(16)
Data(16)
Data(1)
Data(1)
Data(17)
Data(17)
Data(32)
Data(32)
Data(48)
Data(48)
Data(2)
Data(2)
Data(18)
Data(18)
Data(33)
Data(33)
Data(49)
Data(49)
...
...
...
...
Data(34)
Data(34)
Data(50)
Data(50)
Data(13)
Data(13)
Data(29)
Data(29)
...
...
...
...
Time-line (Pipeline)
Time-line (Pipeline)
Data(14)
Data(14)
Data(30)
Data(30)
Data(45)
Data(45)
Data(61)
Data(61)
Data(15)
Data(15)
Data(31)
Data(31)
Data(46)
Data(46)
Data(62)
Data(62)
...
...
...
...
Data(47)
Data(47)
Data(63)
Data(63)
...
...
...
...
Figure 5. 16 PEs perform base-4 butterfly operation timing diagram for four banks.
Figure 5. 16 PEs perform base-4 butterfly operation timing diagram for four banks.
There
There is
is no
no special
special requirement
requirement for
for the
the sequence
sequence of
of data
data in
in the
the phase
phase compensation
compensation calculation
calculation
There
is
no
special
requirement
for
the
sequence
of
data
in
the
phase
compensation
calculation
process;
therefore,
as
shown
in
Figure
6,
the
calculation
process
only
needs
to
access
the
data
in parallel.
process; therefore, as shown in Figure 6, the calculation process only needs to access the
data in
process; therefore, as shown in Figure 6, the calculation process only needs to access the data in
parallel.
parallel.
Sensors 2019, 19, 3409
Sensors 2019, 19, x FOR PEER REVIEW
Sensors 2019, 19, x FOR PEER REVIEW
8 of 20
8 of 20
8 of 20
Space-line
Space-line(Parallel)
(Parallel)
Time-line (Pipeline)
Time-line (Pipeline)
16 Operands
16 Operands
(FPE0_0~FPE0_15)=> Bank_0
(FPE0_0~FPE0_15)=> Bank_0
(FPE1_0~FPE1_15)=> Bank_1
(FPE1_0~FPE1_15)=> Bank_1
Data(0)
Data(0)
Data(0)
Data(0)
Data(1)
Data(1)
Data(1)
Data(1)
Data(2)
Data(2)
Data(2)
Data(2)
...
...
...
...
Data(13)
Data(13)
Data(14)
Data(14)
Data(15)
Data(15)
Data(13)
Data(13)
Data(14)
Data(14)
Data(15)
Data(15)
Figure 6.
6. 32 FPEs perform phase operation for two banks. FPE: computing unit in a heterogeneous
Figure 32 FPEs perform phase operation for two banks. FPE: computing unit in a heterogeneous
array used for
for phase
phase compensation
compensation operation.
operation.
array used
Architectural Implementations
4. Architectural
4. Architectural Implementations
4.1. Overall
Overall Architecture
Architecture
4.1.
4.1. Overall Architecture
A highly
highly efficient
efficientheterogeneous
heterogeneousprocessor
processorfor
for
SAR
imaging
is designed.
Figure
7 shows
the
A
SAR
imaging
is designed.
Figure
7 shows
the topA highly efficient heterogeneous processor for SAR imaging is designed. Figure 7 shows the toptop-level
architecture
of
the
proposed
SAR
imaging
processor.
This
section
describes
the
overall
level architecture of the proposed SAR imaging processor. This section describes the overall
level architecture of the proposed SAR imaging processor. This section describes the overall
hardware block
and functional
functional modules.
modules. Essentially,
Essentially, the
the architecture
architecture consists
consists of
of three
three major
major
hardware
block diagram
diagram and
hardware block diagram and functional modules. Essentially, the architecture consists of three major
components:
a
hybrid–PE
array,
an
on-chip
buffer
module,
and
a
data
systolic
engine.
components: a hybrid–PE array, an on-chip buffer module, and a data systolic engine.
components: a hybrid–PE array, an on-chip buffer module, and a data systolic engine.
···
···
···
···
···
···
···
···
Switch
Switch
PE
PE
···
···
···
···
PE
PE
PE
PE
PE
PE
FPE
FPE
FPE
FPE
FPE
FPE
FPE
FPE
FPE
FPE
FPE
FPE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
···
···
FPE
FPE
FPE
FPE
FPE
FPE
···
···
FPE
FPE
FPE
FPE
FPE
FPE
PE
PE
PE
PE
PE
PE
Local-PF
Local-PF buffer(16KB)
buffer(16KB)
16×32b
16×32b
16×32b
16×32b
PF-Buffer(32KBοΌ‰
PF-Buffer(32KBοΌ‰
Figure 7. Top-level
Top-level architecture of processor.
Figure
Figure 7. Top-level architecture of processor.
Heterogeneous
HeterogeneousPE
PEarray
array
···
···
······
PE
PE
······
Buffer_A
Buffer_ACtrl
Ctrl
16×16b
16×16b
16×16b
16×16b
PE
PE
······
16×16b
16×16b
PE
PE
······
Ex-memory
Ex-memorysystem
system
On-chip
On-chipBuffer
BufferBB
Bank[0](4KB)
Bank[0](4KB)
Local buffer(16KB)
Local buffer(16KB)
Local-TF
Local-TF buffer(32KB)
buffer(32KB)
PE
PE
PE ··· PE
PE
PE
PE
PE
PE ··· PE
PE
PE
······
On-chip Buffer A
On-chip Buffer A
16×16b
16×16b
16×16b
16×16b
16×16b
16×16b
16×16b
16×16b
Switch
Switch
······
······
Bank[32](4KB)
Bank[32](4KB)
Data Buffer_A2
Data Buffer_A2
······
Bank[0](4KB)
Bank[0](4KB)
······
······
Bank[32](4KB)
Bank[32](4KB)
Bank[32](4KB)
Bank[32](4KB)
Resource
Resource
Controller
Controller
Data Buffer_A1
Data Buffer_A1
Bank[0](4KB)
Bank[0](4KB)
128b
128b
16×16b
16×16b
TF-Buffer
TF-Buffer
(2KBοΌ‰
(2KBοΌ‰
Data
Data
Buffer_B1
Buffer_B1
16×16bit
16×16bit
Ctrl
Ctrl
IO /
IO /
Decoder
Decoder
Bank[0](4KB)
Bank[0](4KB)
Bank[32](4KB)
Bank[32](4KB)
Data
Data
Buffer_B2
Buffer_B2
Buffer_B Ctrl
Buffer_B Ctrl
Ctrl
Ctrl
Sensors 2019, 19, 3409
9 of 20
To meet the throughput requirement of SAR imaging, two identical sets of heterogeneous arrays
are implemented, which can perform different block imaging processing computations in parallel.
Each of the heterogeneous arrays contains 16 × 16 PEs and 2 × 16 FPEs. The number ratio of PE and
FPE satisfies the proportional relationship of 8:1.
To feed the processing array with adequate data supply, three types of buffers are implemented
on chip. In a processing array, all the data banks for 16-line PEs and two-line FPEs are organized as
a 264-KB data buffer with two sub-buffers, each of which contains 32 banks for PEs and one bank
for an FPE. A 32-KB twiddle factor dedicated local buffer (Local-TF buffer) and a 16-KB phase factor
dedicated local buffer (Local-PF buffer) for the phase compensation operation is also implemented
inside a processing array.
To organize the data transfer between off-chip RAM and on-chip buffers, a data systolic engine is
implemented. With this data systolic engine, the input raw image echo can be read and the imaging
output can be written back following the processing flow.
4.2. Heterogeneous PE Arrays
Each PE pipelined performs a four-point butterfly operation in six cycles, and all of the PE in a
row parallel perform butterfly operations in a block. During the FFT/IFFT operation, all 64 input data
are sent to one row of PEs in two cycles from the data buffer, and the 64 output data are written back to
the data buffer in two cycles.
In a heterogeneous array, as shown in Figure 7, PEs are interconnected to pass a twiddle factor, the
Local-TF buffer distributes the twiddle factor to the PE from top to bottom. The twiddle factor passes
two rows down each cycle, and the required twiddle factors are assigned to 16 rows of PEs in eight
cycles. Besides, each PE supports zero-padding to expand the raw data to an integer power of two.
During the phase compensation operation, the two input data banks send 32 input data to two
rows of FPEs (32 FPEs) in parallel. The Local-PF buffer passes and distributes the phase compensation
factor from bottom to top.
4.3. Alternate Systolic-Memory and On-Chip Buffer Organization
Since on-chip memory space is limited, all of the radar echo data is stored in the external memory
first. As shown in Figure 8, the data systolic engine (DSE) fetches the data from dynamic random
access memory (DRAM) and pushes the data into on-chip memory. To hide the communication latency
of data transfer between DSEs and arithmetic components, we employ the alternate systolic technique.
In order to avoid DSE competition in hardware resources, we use two alternate systolic memory
modules for each of the input/output interfaces for the whole system. At the same time, we adopt
two DSE channels for input data and weight at the input end. The proposed memory architecture can
provide 4 GB/s of read/write memory bandwidth at 250-MHz frequency to satisfy the data requirements
of the processor.
As shown in Figure 8, our storage architecture consists of three layers: DRAM, a data transfer
engine system, and an on-chip buffer. Since the on-chip storage resources are limited in size, all the
pending radar echo data is first stored in off-chip memory (DRAM). During data processing, the data
is first cached by the data transfer engine system into the on-chip buffer, and then sent to the PE
array for processing by the on-chip buffer. As shown in Figure 8, in order to hide the communication
latency between the off-chip memory and the on-chip buffer, we use the double-buffered data alternate
transmission method.
Sensors 2019, 19, 3409
Sensors 2019, 19, x FOR PEER REVIEW
10 of 20
10 of 20
Data
Systolic
Engine
(DSE)_0
Data
Systolic
Engine
(DSE)_1
Asynchronous
Buffer_0
Asynchronous
Buffer_1
Data
Systolic
Engine
(DSE)_2
Data
Systolic
Engine
(DSE)_3
Asynchronous
Buffer_2
Asynchronous
Buffer_3
Data Transformation Pipeline
Engine
(DTPE)
Off-chip
DRAM
DRAM-controller
Off-chip
DRAM
Data transfer engine system
on-chip Buffer_1 on-chip Buffer_0
Input port
(a)
Data Systolic
Engine
(DSE) 6
Off-chip
DRAM
Off-chip
DRAM
Data Systolic
Engine
(DSE) 5
DRAM-controller
Data Transformation
Pipeline Engine
(DTPE)
Output image
Onput Port
(b)
Figure 8.
8. Memory hierarchy architecture.
architecture. (a): Input Port; (b): Output
Figure
Output Port.
Port.
4.4. Resource
Controller
As shown
in Figure 8, our storage architecture consists of three layers: DRAM, a data transfer
engine
system,
and
an on-chip
buffer. Since
on-chip the
storage
resources
in size,
the
The resource controller
is responsible
forthe
allocating
execution
unitare
andlimited
arranging
the all
access
pending
radar
echo
data
is
first
stored
in
off-chip
memory
(DRAM).
During
data
processing,
the
data
flow of the on-chip buffer.
is first
cached
by the
data are
transfer
engine system
into
the on-chip
buffer,
andprocessing.
then sent toThe
theFFT/IFFT
PE array
Two
imaging
blocks
respectively
assigned
to two
arrays for
parallel
for
processing
by
the
on-chip
buffer.
As
shown
in
Figure
8,
in
order
to
hide
the
communication
and phase compensation operations are involved in the intra-block processing, so the PE is assigned to
latency
between
thethe
off-chip
memory
the ison-chip
buffer,
usecompensation
the double-buffered
data
the
FFT/IFFT
during
calculation
and and
the FPE
assigned
to thewe
phase
operation.
alternate
transmission
method. data mapping and access strategy, in order to support the parallel
According
to the designed
access of data, the resource controller allocates bank and bank addresses for each range of data.
0,16
Bank_2
0,32
Bank_3
0,48
0,15
0,64
0,31
0,80
0,47
0,96
0,63
0,112
...
...
...
...
0,79
0,95
0,111
0,127
...
...
...
...
...
...
...
0,r-1
Range
r
0,0
0,1
0,2
1,0
1,1
1,2
...
...
a-2,0
a-2,1
a-2,2
a-1,0
a-1,1
a-1,2
0,r-3
0,r-2
0,r-1
1,r-3
1,r-2
1,r-1
...
...
...
...
Azimuth
Range Operation
...
...
………
Bank_1
...
...
...
...
………
0,0
………
Bank_0
………
Bank group
flow of the on-chip buffer.
Two imaging blocks are respectively assigned to two arrays for parallel processing. The
FFT/IFFT and phase compensation operations are involved in the intra-block processing, so the PE is
assigned to the FFT/IFFT during the calculation and the FPE is assigned to the phase compensation
operation.
Sensors
2019, 19, 3409
11 of 20
According to the designed data mapping and access strategy, in order to support the parallel
access of data, the resource controller allocates bank and bank addresses for each range of data. When
When
performing
FFT/IFFT,
of data
is stored
four
banksaccording
accordingto
to aa distributed
distributed
performing
range range
FFT/IFFT,
each each
row row
of data
is stored
in in
four
banks
storage
strategy.
storage strategy.
As shown
When performing
As
shown in
in Figure
Figure 9,
9, we
we take
take aa row
row of
of 1024
1024 points
points as
as an
an example
example (r
(r == 1024).
1024). When
performing
FFT/IFFT,
1024
points
are
segmented
and
stored
in
four
banks
according
to
the
distributed
FFT/IFFT, 1024 points are segmented and stored in four banks according to the distributed storage
storage
strategy. AAtotal
totalofof1616consecutive
consecutivepoints
pointsare
are
used
a segment,
in which
approximately
toare
15
strategy.
used
asas
a segment,
in which
approximately
0 to015
are placed
in Bank_0,
31placed
are placed
in Bank_1,
32 are
to 47
are placed
in Bank_2,
toplaced
63 are
placed
in Bank_0,
16 to16
31to
are
in Bank_1,
32 to 47
placed
in Bank_2,
and 48and
to 6348are
placed
in
Bank_3;
the
above
operation
is
repeated
until
all
data
of
256
segments
are
stored.
A
base-4
in Bank_3; the above operation is repeated until all data of 256 segments are stored. A base-4 FFT/IFFT
FFT/IFFT operation
at 1024
pointsa requires
a total
of of
five
levels of The
operation.
The calculation
operation
at 1024 points
requires
total of five
levels
operation.
calculation
process usesprocess
multiuses
multi-bank
parallel
data
access.
Taking
the
first
stage
as
an
example,
data
0
to
31
is
read
bank parallel data access. Taking the first stage as an example, data 0 to 31 is read from Bank_0from
and
Bank_0 in
and
in the
first
cycle,
datais992
tofrom
1023 Bank_2
is read from
Bank_2inand
the
Bank_1
theBank_1
first cycle,
and
data
992 and
to 1023
read
and Bank_3
theBank_3
second in
cycle.
second
cycle.
latter
fouroperational
levels of the
operational
data access
process
to the first level.
The
latter
fourThe
levels
of the
data
access process
are similar
to are
the similar
first level.
Similarly,
when
performing
azimuth
FFT/IFFT,
each
azimuth
of
data
is
stored
Similarly, when performing azimuth FFT/IFFT, each azimuth of data is stored in
in four
four banks
banks
according
to
the
storage
strategy
(taking
1024
points
as
an
example,
a
=
1024).
The
data
access
process
according to the storage strategy (taking 1024 points as an example, a = 1024). The data access process
is similar
similar to
is
to the
the FFT/IFFT
FFT/IFFT range.
range.
a
a-2,r-3 a-2,r-2 a-2,r-1
a-1,r-3 a-1,r-2 a-1,r-1
32,0
Bank_3
48,0
64,0
31,0
80,0
47,0
96,0
63,0
112,0
...
...
...
...
79,0
95,0
111,0
127,0
...
...
...
...
...
...
...
2,0
a-1,0
a-3,0
...
...
...
...
a-2,0
a-2,1
a-2,2
a-1,0
a-2,1
a-1,2
...
...
0,r-3
0,r-2
0,r-1
1,r-3
1,r-2
1,r-1
………
16,0
Bank_2
15,0
1,2
………
Bank_1
...
...
...
...
0,2
1,1
………
0,0
0,1
1,0
… …
Bank_0
0,0
...
...
a-2,r-3 a-2,r-2 a-2,r-1
a
Azimuth
Bank group
Range
r
a-1,r-3 a-1,r-2 a-1,r-1
Azimuth Operation
Figure 9. Data access pattern.
SAR imaging
imaging is
is a
SAR
a continuous
continuous process
process with
with huge
huge differences
differences in
in operational
operational density
density between
between
FFT/IFFT
and
the
phase
compensation
operation.
For
the
characteristics
of
the
computational
process,
FFT/IFFT and the phase compensation operation. For the characteristics of the computational process,
we
have
designed
a
way
to
organize
the
processing
of
SAR
imaging
in
space
and
time
flow
(ST-Flow),
we have designed a way to organize the processing of SAR imaging in space and time flow (ST-Flow),
as shown
shown in
as
in Figure
Figure 10.
10.
Taking
1024
points
FPE performs
performs aa one-point
one-point phase
phase compensation
compensation
Taking 1024 points FFT/IFFT
FFT/IFFT as
as an
an example,
example, each
each FPE
operation in
in one
one cycle,
cycle, and
and all
all the
the FPE
FPE in
in aa row
phase compensation
compensation operations.
operations.
operation
row parallel
parallel perform
perform phase
During
the
phase
compensation
operation,
all
16-input
data
are
sent
to
one
row
of
FPEs
in
cycle
During the phase compensation operation, all 16-input data are sent to one row of FPEs in one
one cycle
from
the
data
buffer,
and
16
output
data
are
written
back
to
the
data
buffer
in
one
cycle.
It
can
from the data buffer, and 16 output data are written back to the data buffer in one cycle. It can be
be
seen that the 1024-point phase compensation operation requires 64 cycles. In order to satisfy the task
saturation and parallelism of the parallel pipeline between phase compensation and FFT/IFFT, the
resource controller sets the start time for each row of PE to be delayed by 64 cycles from the previous
row. Considering the different matrix sizes, the ratio of PEs to FPEs is configured to be 8:1, so for larger
matrices, the FPE will be idle. During the processing of the FPE, it is necessary to wait for the PE to
complete the FFT operation before starting the processing of the next frame.
seen that the 1024-point phase compensation operation requires 64 cycles. In order to satisfy the task
saturation and parallelism of the parallel pipeline between phase compensation and FFT/IFFT, the
resource controller sets the start time for each row of PE to be delayed by 64 cycles from the previous
row. Considering the different matrix sizes, the ratio of PEs to FPEs is configured to be 8:1, so for
larger2019,
matrices,
Sensors
19, 3409the FPE will be idle. During the processing of the FPE, it is necessary to wait for
12 ofthe
20
PE to complete the FFT operation before starting the processing of the next frame.
Time-line
...
Cycles
…
...
…
…
FFT/IFFT FFT/IFFT FFT/IFFT
FFT/IFFT FFT/IFFT FFT/IFFT
FFT/IFFT FFT/IFFT FFT/IFFT
FFT/IFFT FFT/IFFT FFT/IFFT
...
...
Parallel Pipeline FFT/IFFT
Imaging block_2
FFT/IFFT
FFT/IFFT
………
Imaging block_3
FFT/IFFT
Parallel
processing
Imaging block_n-1
Imaging block_n
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
PM
FFT/IFFT
FFT/IFFT
FFT/IFFT
PM
FFT/IFFT
PM
...
PM
...
PM
...
PM
...
PM
FFT/IFFT
PM
FFT/IFFT
PM
...
PM
...
PM
PM
PM
PM
PM
...
...
PM
PM
PM
PM
FFT/IFFT
PM
FFT/IFFT
PM
...
...
...
...
………
Parallel
processing
Imaging block_1
FFT/IFFT
………
Imaging block_0
………
Parallel
processing
...
...
PM
PM
…
PM
PM
…
Space-line
PM
PM
PM
PM
…
Parallel PM
Cycles
...
...
PM:Phase Multiplication
Figure10.
10. Space–time
Figure
Space–timeflow
flow(ST-Flow)
(ST-Flow)ofofimaging
imagingprocessing.
processing.
5. Processor Performance Evaluation
5. Processor Performance Evaluation
We implemented the SAR imaging processor at 65-nm CMOS (complementary metal oxide
We implemented the SAR imaging processor at 65-nm CMOS (complementary metal oxide
semiconductor) technology with 1.2 V of supply voltage using Synopsys tools. Figure 11 shows the die
semiconductor) technology with 1.2 V of supply voltage using Synopsys tools. Figure 11 shows the
photograph of the chip. In the evaluation, the CS imaging algorithm is selected as the benchmark.
die photograph of the chip. In the evaluation, the CS imaging algorithm is selected as the benchmark.
5.1. Performance Analysis
In this section, we configure the processor with fixed-point PE and single-precision floating-point
FPE. We evaluate the processor performance at 200 MHz with different fixed-point lengths. The test
echo data matrix size is 16,384 × 16,384. We perform two operations in parallel on the heterogeneous
PE, which can take advantage of the computing power and increase the throughput. When the CSA is
processed in heterogeneous PE mode, the throughput is achieved to 115.2 Giga operations per second
(GOPS), with 463 mW of power consumption. As shown in Table 4, when all the imaging processes
use single-precision floating-point units, the power consumption of the processor is up to 713 mW,
and its energy efficiency is only 67% of the fixed/floating point heterogeneous imaging mode. Also,
the processor can reduce a small amount of power consumption when selecting low-bit fixed-point
Sensors 2019, 19, 3409
13 of 20
FFT/IFFT operation. The processor consumes 463 mW for 16-bit fixed-point FFT/IFFT and reduces to
454 mW
for
Sensors
2019,
19,12-bit
x FOR fixed-point
PEER REVIEWFFT/IFFT, as shown in Table 4.
13 of 20
PLL
Controller
PE Array
PE Array
REGs
TF Buffer &PF Buffer
Buffer A&B
Figure
Figure 11. Die
Die photograph
photograph of
of the
the chip.
chip.
1
4. System performance assessment with different fixed-point length FFT.
5.1. PerformanceTable
Analysis
PE Bit-Width
(Bits) the processor
12
14 fixed-point
16
Single-Precision
Floating floatingIn this section,
we configure
with
PE and single-precision
(GOPS)
115.2
115.2at 200
115.2
115.2fixed-point lengths.
point FPE. WeThroughput
evaluate the
processor performance
MHz with different
Power
(mW)
459We perform
463
713 in parallel on the
The test echo data
matrix
size is 16,384454
× 16,384.
two operations
Energy efficiency (GOPS/mW)
0.254
0.250
0.240
0.16
heterogeneous PE, which can take advantage of the computing power and increase the throughput.
1
4 provides
statistics on in
throughput,
power consumption,
and
energy
efficiency for
entire heterogeneous
When Table
the CSA
is processed
heterogeneous
PE mode,
the
throughput
isthe
achieved
to 115.2 Giga
processor.
operations per second (GOPS), with 463 mW of power consumption. As shown in Table 4, when all
the imaging processes use single-precision floating-point units, the power consumption of the
5.2. Array Utilization Analysis
processor is up to 713 mW, and its energy efficiency is only 67% of the fixed/floating point
As shownimaging
in Table mode.
5, we can
seethe
that
in the algorithm
processing,
the ST-flow
two-dimensional
heterogeneous
Also,
processor
can reduce
a small amount
of power
consumption
parallel
pipeline
achieves
better
array
utilization
than
one-dimensional
time-based
computational
flow
when selecting low-bit fixed-point FFT/IFFT operation. The processor consumes 463 mW for 16-bit
(TI-flow).
The
high
utilization
of
the
array
can
increase
the
throughput
of
the
system.
The
time-based
fixed-point FFT/IFFT and reduces to 454 mW for 12-bit fixed-point FFT/IFFT, as shown in Table 4.
computational flow (TI-flow) that is employed in existing processors is inefficient for SAR imaging
1
processing. AsTable
shown
in Table
5, in the ST-Flow
mode,
FFT operation
theFFT.
phase
mean (PM)
4. System
performance
assessment
with the
different
fixed-point and
length
operation are pipelined, the throughput reaches 115.2 GOPS, the resource utilization rate can reach
PE Bit-Width (Bits)
12
14
16
Single-Precision Floating
98.8%, and the energy efficiency is 0.24 GOP/mW. In TI-Flow mode, the FFT operation and the PM
Throughput (GOPS)
115.2 115.2 115.2
115.2
operation are executed sequentially, the throughput is only 62.6 GOPS, the resource utilization rate is
Power (mW)
454
459
463
713
54.3%, and the energy efficiency is only 0.16 GOP/mW. Compared with the TI-Flow mode, the resource
Energy efficiency (GOPS/mW) 0.254 0.250 0.240
0.16
utilization in ST-Flow mode significantly increases, the throughput increases by 84.5%, and the average
1 Table 4 provides statistics on throughput, power consumption, and energy efficiency for the entire
power consumption only increases by 21.2%.
heterogeneous processor.
Table 5. Array utilization with ST-flow and time-based computational flow (TI-flow). GOPS: Giga
5.2. Array
Utilization Analysis
operations per second.
As shown in Table 5, we can see that in the algorithm processing, the ST-flow two-dimensional
TI-Flow
parallel pipeline achieves
better arrayST-Flow
utilization than one-dimensional time-based computational
FFT
Phase Compensation
Overall
flow (TI-flow). The high utilization of the array can increase
the
throughput of the system.
The timeArray
utilizations
98.8%
88.9%
11.1%
54.3%
based computational flow (TI-flow) that is employed in existing processors is inefficient for SAR
ThroughputAs
(GOPS)
102.4 mode, the12.8
62.6the phase
imaging processing.
shown in Table115.2
5, in the ST-Flow
FFT operation and
Power (mW)
463
435
317
382
mean (PM)
operation are pipelined, the throughput
reaches
115.2 GOPS,- the resource utilization
rate
Energy efficiency (GOP/mW)
0.24
0.16
can reach 98.8%, and the energy efficiency is 0.24 GOP/mW. In TI-Flow mode, the FFT operation and
the PM operation are executed sequentially, the throughput is only 62.6 GOPS, the resource
5.3. Analyzes of Array Scalability
utilization rate is 54.3%, and the energy efficiency is only 0.16 GOP/mW. Compared with the TI-Flow
We
analyze
theutilization
performance
of a single
heterogeneous
array,
as shown
in Figures 12increases
and 13. On
mode,
the
resource
in ST-Flow
mode
significantly
increases,
the throughput
by
the horizontal
axis, power
the numbers
5, 9, 18,only
and increases
36 represent
the array scales of 5 × 4, 9 × 8, 18 × 16,
84.5%,
and the (X)
average
consumption
by 21.2%.
and 36 × 32, respectively.
Energy
Energy efficiency
efficiency (GOP/mW)
(GOP/mW)
0.24
0.24
--
--
0.16
0.16
5.3.
5.3. Analyzes
Analyzes of
of Array
Array Scalability
Scalability
We
We analyze
analyze the
the performance
performance of
of aa single
single heterogeneous
heterogeneous array,
array, as
as shown
shown in
in Figures
Figures 12
12 and
and 13.
13. On
On
14 of 20
the horizontal (X) axis,
axis, the
the numbers
numbers 5,
5, 9,
9, 18,
18, and
and 36
36 represent
represent the
the array
array scales
scales of
of 55 ×× 4,
4, 99 ×× 8,
8, 18
18 ×× 16,
16,
and
and 36
36 ×× 32,
32, respectively.
respectively.
As the
size
of
the
array
increases,
the
throughput
and
imaging
efficiency
of
the
system
increase
the size
size of
of the
the array
array increases,
increases, the
the throughput
throughput and
and imaging
imaging efficiency
efficiency of
of the
the system
system increase
increase
As the
significantly,
but
the
power
consumption
of
the
processor
also
rises
sharply.
In
general,
the
significantly,but
butthe
thepower
power
consumption
of the
processor
sharply.
In general,
the powerpowersignificantly,
consumption
of the
processor
alsoalso
risesrises
sharply.
In general,
the power-delay
delay
product
and
energy
efficiency
of
large
PE
arrays
are
better
than
those
of
small
PE
arrays.
On
delay
product
and
energy
efficiency
of
large
PE
arrays
are
better
than
those
of
small
PE
arrays.
On
product and energy efficiency of large PE arrays are better than those of small PE arrays. On the other
the
other
hand,
the
array
size
must
be
closely
matched
to
the
buffer
size;
an
oversized
or
undersized
the other
bematched
closely matched
to thesize;
buffer
an oversized
or undersized
hand,
thehand,
array the
sizearray
mustsize
be must
closely
to the buffer
ansize;
oversized
or undersized
array
array
will
in wasted PE
or
memory
bandwidth
utilization.
array configuration
configuration
willinresult
result
PEorresources
resources
or low
low
memory
bandwidth
utilization.
configuration
will result
wastedinPEwasted
resources
low memory
bandwidth
utilization.
Therefore,
the
Therefore,
the
size
of
aa single
heterogeneous
processing
array
is
designed
to
be
18
×× 16
after
aa tradeTherefore,
the
size
of
single
heterogeneous
processing
array
is
designed
to
be
18
16
after
tradesize of a single heterogeneous processing array is designed to be 18 × 16 after a trade-off between the
off
the
complexity
and
off between
between
the chip
chip implementation
implementation
complexityperformance.
and processing
processing performance.
performance.
chip
implementation
complexity and processing
Sensors
2019, 19, 3409
the horizontal
(X)
Figure
12. Power
array
sizes.
Figure 12.
and imaging
imaging time
time with
with different
different
Power consumption
consumption and
different array
array sizes.
sizes.
Energy efficiency and power-delay product with different
array sizes.
Figure
Figure 13.
13. Energy
Energy efficiency
efficiency and power-delay product with different
different array
array sizes.
sizes.
5.4. Comparison with Other Schemes
Table 6 lists the SAR imaging time for different sizes of input. For the ordinary SAR radar
(for instance, the Chinese Gaofen-3 satellite, pulse repetition frequency: 2000 Hz), the real-time
processing time of 16,384 × 16,384 SAR raw data requires 8 s. The proposed scheme can meet the
real-time requirements.
The power consumption and SAR imaging time for other studies are also listed in Table 6.
As can be seen from Table 6, the power consumption of the proposed scheme is the smallest, because
the proposed scheme can completely realize the entire SAR imaging process without additional
microcontroller unit (MCU) or CPU. Similar to [15], the Mobile-GPU architecture uses a lower power
cost (5 W) to achieve better real-time performance. Compared with [15], the proposed architecture is
better in performance-to-power ratio and improves by a factor of 230.4. From the real-time performance
Sensors 2019, 19, 3409
15 of 20
perspective, the CPU + GPU scheme is the best, but its power consumption exceeds 300 W. The
real-time performance of the proposed scheme is only 8.6% of [17], but the performance-to-power
ratio improves by a factor of 63.4. Table 7 shows the comparison of the proposed scheme and related
research in real-time performance. As can be seen from Table 7, compared with [15], the speedup ratio
reached 21.33.
Table 6. Comparison with previous works.
Architectural
Model
Operating
Frequency
Power
Consumption
SAR Imaging
Algorithms
Frame Size
SAR Signal
Processing Time (s)
0.04
0.15
0.68
8.2
5.54
32.9
Proposed solution
200 MHZ
463 mW
CS
1024 × 1024
2048 × 2048
6472 × 3328
16,384 × 16,384
30,000 × 6000
32,768 × 32,768
GPGPU [14]
-
>500 W
Omega-k
30,000 × 6000
8.5
CPU + GPU [18]
-
345 W
CS
32,768 ×32,768
2.8
Mobile-GPU [16]
2.3 GHZ
5W
CS
2048 × 2048
3.2
Microprocessor +
FPGA [15]
-
68 W
CS
6472 × 3328
8
CPU + ASIC [28]
100 MHZ
10 W
-
1024 × 1024
-
Table 7. Speed-up ratio to previous works.
Architectural
Imaging
Time
Imaging Time in
Proposed Solution
Speed-Up
Ratio
Frame Size
CPU + ASIC [28]
Mobile-GPU [16]
Microprocessor + FPGA [15]
GPGPU [14]
CPU + GPU [18]
3.2 s
8s
8.5 s
2.8 s
0.04 s
0.15 s
0.68 s
5.54 s
32.9 s
21.33
11.76
1.54
0.086
1024 × 1024
2048 × 2048
6472 × 3328
30,000 × 6000
32,768 × 32,768
5.5. SAR Imaging Quality Evaluation
We compared the scene SAR imaging results of different fixed-point length FFT. Radar data were
obtained from RADARSAT-1 of Canada (width: 50 km; resolution: 6 m) [38]. The imaging effect is
shown in Figure 14.
For the actual scenes, the mean square error (MSE), peak signal-to-noise ratio (PSNR) [39],
structural similarity index (SSIM) [40], and radiometric resolution (RL) [41] are commonly adopted to
evaluate SAR imaging quality.
Sufficient imaging accuracy can be achieved with single-precision floating-point imaging.
Fixed-point processing methods will cause a certain loss of precision. We take the single-precision
floating-point imaging as the test reference to evaluate the fixed-point FFT SAR image quality.
The MSE is adopted to calculate the squared intensity difference between the pixels of the
partial fixed-point image and the pixels of the full single-precision floating-point image. The PSNR is
essentially the same as the MSE, but it is associated with the quantized gray level of the SAR image.
The MSE and PSNR are calculated as shown in Formulas (2) and (3):
MSE =
M N
1 XX 0
( f (i, j) − f (i, j))2
M×N
i=1 j=1
(2)
Sensors 2019, 19, 3409
16 of 20
Q2 × M × N
PSNR = 10 log10 P P
2
M
N
0
j=1 ( f (i, j) − f (i, j))
i=1
(3)
where f 0 (i, j) and f (i, j) represent the image pixels to be evaluated and the reference image pixels,
respectively; M, N represent the length and width of the image, respectively. Q represents the gray
level of the image (Q = 255).
Sensors 2019, 19, x FOR PEER REVIEW
16 of 20
(a)
(b)
(c)
(d)
Figure 14. The scene
scene SAR
SAR imaging
imaging results
results for
for different fixed-point length FFT. (a) 12-bit fixed-point
FFT; (b) 14-bit fixed-point FFT;
FFT; (c)
(c) 16-bit
16-bit fixed-point
fixed-point FFT;
FFT;(d)
(d)single-precision
single-precisionfloat-point
float-pointFFT.
FFT.
PSNR
MSEscenes,
are simple
and straightforward
SAR image
quality assessments
the
For
theand
actual
the mean
square error (MSE),
peak signal-to-noise
ratio based
(PSNR)on[39],
visibility
of
errors.
Due
to
the
PSNR
index
not
being
exactly
the
same
as
the
visual
quality
seen
by
structural similarity index (SSIM) [40], and radiometric resolution (RL) [41] are commonly adopted
theevaluate
human SAR
eye, the
evaluation
requirements of the human visual system (HVS) cannot be met [40].
to
imaging
quality.
Therefore,
we also
adoptaccuracy
SSIM (the
Structural
Similarity
Index) to evaluate
the SAR images.
As shown
Sufficient
imaging
can
be achieved
with single-precision
floating-point
imaging.
Fixedin
Formula
(4):
point processing methods will cause a certainloss of precision.
We take
the single-precision floating+ εSAR
y + ε1 2δxy
point imaging as the test reference to evaluate 2Ο•
thex Ο•fixed-point
FFT
image quality.
2
SSIM(x, y) = (4)
The MSE is adopted to calculate the squared
the pixels of the partial
Ο•2x + intensity
Ο•2y + ε1 difference
δ2x + δ2y + εbetween
2
fixed-point image and the pixels of the full single-precision floating-point image. The PSNR is
where
δ2x represents
image
and
δ2y represents
the single-precision
essentially
the samethe
as fixed-point
the MSE, but
it is variance,
associated
with
the quantized
gray level of thefloating-point
SAR image.
image
variance;
Ο•
represents
the
mean
value
of
the
fixed-point
image,
and
Ο•
represents
the mean
The MSE and PSNR
x are calculated as shown in Formulas (2) and (3):
y
value of the single-precision floating-point image. The SSIM value range is [0, 1], and the larger the
1
SSIM value, the smaller image distortion.
𝑀𝑆𝐸 =
𝑓 (𝑖, 𝑗) − 𝑓(𝑖, 𝑗)
(2)
𝑀 ×indicator.
𝑁
RL is also a very important evaluation
RL is adopted to evaluate the minimum variation
of target reflection that radar sensors can distinguish. As shown in Formula (5):
𝑄 ×𝑀×𝑁
𝑃𝑆𝑁𝑅 = 10 log
!
(3)
∑ ∑ α𝑓 (𝑖, 𝑗) − 𝑓(𝑖, 𝑗)
RL = 10 log10 + 1
(5)
β
where 𝑓 (𝑖, 𝑗) and 𝑓(𝑖, 𝑗) represent the image pixels to be evaluated and the reference image pixels,
respectively;
M, N represent
thedeviation
length and
of the
image,
respectively.
Q represents
theimage.
gray
where
α represents
the standard
of width
the image,
and
β represents
the mean
value of the
level Table
of the8image
(Qloss
= 255).
lists the
of precision due to the different data widths. As can be seen from Table 8, the
PSNR
and
MSE
are
simple
and straightforward
SAR image
assessments
based
on the
PSNR value of a partial 16-bit
fixed-point
image can reach
29.1 dB,quality
the results
show that
the partial
visibility
of
errors.
Due
to
the
PSNR
index
not
being
exactly
the
same
as
the
visual
quality
seen
by
16-bit fixed-point image and the single-precision floating-point image differ only by 0.02 and 0.05
the
human
eye,indexes
the evaluation
of the human
visual
system
cannot becompared
met [40].
dB on
the two
of SSIMrequirements
and RL, respectively.
For the
actual
scene(HVS)
SAR imaging,
Therefore,
we
also
adopt
SSIM
(the
Structural
Similarity
Index)
to
evaluate
the
SAR
images.
with a single-precision floating-point image, the accuracy loss of a partial 16-bit fixed-point imageAs
is
shown 2%.
in Formula (4):
within
𝑆𝑆𝐼𝑀(π‘₯, 𝑦) =
(2πœ‘ πœ‘ + πœ€ )(2𝛿 + πœ€ )
(πœ‘ + πœ‘ + πœ€ )(𝛿 + 𝛿 + πœ€ )
(4)
where 𝛿 represents the fixed-point image variance, and 𝛿 represents the single-precision
floating-point image variance; πœ‘ represents the mean value of the fixed-point image, and πœ‘
Table 8 lists the loss of precision due to the different data widths. As can be seen from Table 8,
the PSNR value of a partial 16-bit fixed-point image can reach 29.1 dB, the results show that the partial
16-bit fixed-point image and the single-precision floating-point image differ only by 0.02 and 0.05 dB
on the two indexes of SSIM and RL, respectively. For the actual scene SAR imaging, compared with
a single-precision floating-point image, the accuracy loss of a partial 16-bit fixed-point image is within
Sensors 2019, 19, 3409
17 of 20
2%.
Table 8. Quantitative evaluation of actual scene SAR imaging.
Table 8. Quantitative evaluation of actual scene SAR imaging.
FFT Pro-Acc1
FFT Pro-Acc 1
Single-precision float-point
3 (dB)
5 (dB)
PSNR2 (dB) MSE
MSE
SSIM4 (dB) RL
3 (dB)
SSIM 4 (dB)
RL 5 (dB)
PSNR 2 (dB)
∞
0
1
4.99
Single-precision float-point
∞
0
1
4.99
12-bit fixed-point
13.7
2765.2
0.23
4.11
12-bit fixed-point
13.7
2765.2
0.23
4.11
14-bit
fixed-point
377.4
0.77
4.71
14-bit
fixed-point
22.422.4
377.4
0.77
4.71
16-bit
fixed-point
81.8
0.98
4.94
16-bit
fixed-point
29.129.1
81.8
0.98
4.94
2
3
1 1FFT
2
3
4 SSIM:
FFTpro-acc:
pro-acc:FFT
FFTprocessing
processing
accuracy;PSNR:
PSNR:
peak
signal-to-noise
accuracy;
peak
signal-to-noise
ratio;ratio;
MSE:MSE:
meanmean
squaresquare
error; error;
5 RL: Radiometric
4 SSIM: Structural
5
Structural
Similarity Index;
Resolution.
Similarity Index; RL: Radiometric Resolution.
Phaseisisalso
also
important
information
a image.
SAR image.
The mean
phase(PM)
mean
phase
Phase
important
information
for afor
SAR
The phase
and(PM)
phaseand
deviations
deviations
(PD)
are
estimated
by
the
method
proposed
in
[42].
Table
9
lists
the
phase
precision
with
(PD) are estimated by the method proposed in [42]. Table 9 lists the phase precision with different
different fixed-point
SARAs
imaging.
As can
be Table
seen from
the loss
of phase
precision
with
fixed-point
SAR imaging.
can be seen
from
9, theTable
loss of9,phase
precision
with
partial 16-bit
partial 16-bit
fixed-point
is less than 3%.
fixed-point
imaging
is lessimaging
than 3%.
Table9.9. Phase
Phase information
information evaluation of actual scene
Table
scene SAR
SAR imaging.
imaging.
1
FFT
Pro-Acc
FFT
Pro-Acc
2
PM
PM
1
Single-precision
float-point
Single-precision
float-point
12-bit
fixed-point
12-bit
fixed-point
14-bit
fixed-point
14-bit
fixed-point
16-bit
fixed-point
16-bit fixed-point
1
1
3
PD
PD 3
2
0.00244°
β—¦
0.00244
β—¦
0.00916°
0.00916
β—¦
0.00398
0.00398°
β—¦
0.00252
0.00252°
2
3.3026°
3.3026β—¦
3.3054°
3.3054β—¦
3.3032β—¦
3.3032°
3.3026β—¦
3.3026°
3
3 PD:
pro-acc:
FFT
processingaccuracy;
accuracy; 2 PM:
phase
deviation.
FFTFFT
pro-acc:
FFT
processing
PM:phase
phasemean;
mean;PD:
phase
deviation.
For
quality
evaluation,
wewe
adopted
thethe
point
target
simulation
echoecho
data.
Forthe
thepoint
pointtarget
targetimaging
imaging
quality
evaluation,
adopted
point
target
simulation
We
compared
the
point
target
SAR
imaging
results
for
FFT
with
different
fixed-point
lengths,
as
shown
data. We compared the point target SAR imaging results for FFT with different fixed-point lengths,
inasFigure
For the
target
spatial spatial
resolution
(RES),(RES),
peak peak
side lobe
ratioratio
(PSLR)
and
shown15.
in Figure
15.point
For the
pointimage,
target image,
resolution
side lobe
(PSLR)
integrated
side lobe
ratioratio
(ISLR)
are commonly
adopted
to assess
imaging
quality
[38,43].
Table
10
and integrated
side lobe
(ISLR)
are commonly
adopted
to assess
imaging
quality
[38,43].
Table
shows
the
results
of
the
point
targets
imaging
quality
assessment
and
comparison.
10 shows the results of the point targets imaging quality assessment and comparison.
(a)
(b)
(c)
(d)
Figure15.
15.The
Thepoint
pointtarget
targetSAR
SAR
imaging
results
different
fixed-point
length
(a) 12-bit
fixedFigure
imaging
results
forfor
different
fixed-point
length
FFT.FFT.
(a) 12-bit
fixed-point
point
(b)fixed-point
14-bit fixed-point
(c) fixed-point
16-bit fixed-point
FFT;
(d) single-precision
float-point
FFT;
(b)FFT;
14-bit
FFT; (c)FFT;
16-bit
FFT; (d)
single-precision
float-point
FFT. FFT.
Table 10. Quantitative evaluation of point target SAR imaging.
FFT Pro-Acc 1
Single-precision
float-point
12-bit fixed-point
14-bit fixed-point
16-bit fixed-point
1
Azimuth Direction
2
RES (m)
3
Range Direction
PSLR (dB)
4
ISLR (dB)
RES (m)
PSLR (dB)
ISLR (dB)
4.74
−12.91
−9.64
2.58
−13.31
−9.96
5.43
4.81
4.77
−5.68
−11.85
−12.86
−2.99
−8.22
−9.53
3.71
2.83
2.61(m)
−5.88
−11.55
−13.28
−3.22
−9.09
−9.93
FFT pro-acc: FFT processing accuracy; 2 RES: spatial resolution; 3 PSLR: peak side lobe ratio; 4 ISLR: integrated
side lobe ratio.
Sensors 2019, 19, 3409
18 of 20
For partial 16-bit fixed-point imaging, in the azimuth direction, the PSLR and ISLR precision loss
of the image are 0.3% and 0.8%, respectively; the RES precision loss is 0.2%. In the range direction, the
PSLR and ISLR precision losses of the image are 0.2% and 0.2%, respectively; the RES precision loss
is 0.7%.
According to the actual scene and the point target image quantization analysis, as shown in
Tables 8–10, the partial 16-bit fixed-point imaging accuracy is close to the single-precision floating-point
imaging accuracy, which meets the requirements of on-orbit SAR imaging applications.
6. Conclusions
This paper proposes a heterogeneous imaging processor using fixed-floating point heterogeneous
parallel acceleration technology to perform SAR imaging in the aerospace field. The processor consists
of two 18 × 16 heterogeneous arrays that provide 115.2 GOPS throughput. To improve energy efficiency,
each array supports fixed-floating hybrid calculations to take full advantage of computing resources,
which can increase the throughput of imaging processing by 1.82 times. At the same time, the PE
array can be partitioned by rows through a sensible algorithm-to-hardware architecture mapping,
process the imaging process in parallel, provide high-utilization hardware resources, and improve the
efficiency by a factor of 1.5. A single processor requires 8 s and consumes 463 mW to process SAR raw
data with a granularity of 16,384 × 16,384, which meets the limits real-time and power consumption of
the on-orbit SAR imaging platform. The proposed solution also has good scalability, by extending the
size of the processor array, the real-time requirements of larger-scale SAR imaging can be met.
Author Contributions: Investigation, J.A.; Methodology, S.W.; Project administration, S.W. and S.Z.; Software,
X.H. and J.A.; Writing—original draft, S.W.; Writing—review & editing, L.C.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
Percivall, G.S.; Alameh, N.S.; Caumont, H.; Moe, K.L.; Evans, J.D. Improving Disaster Management Using
Earth Observations—GEOSS and CEOS Activities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6,
1368–1375. [CrossRef]
Joyce, K.E.; Belliss, S.E.; Samsonov, S.V.; McNeill, S.J.; Glassey, P.J. A review of the status of satellite remote
sensing and image processing techniques for mapping natural hazards and disasters. Prog. Phys. Geog. 2009,
33, 183–207. [CrossRef]
Tralli, D.M.; Blom, R.G.; Zlotnicki, V.; Donnellan, A.; Evans, D.L. Satellite remote sensing of earthquake,
volcano, flood, landslide and coastal inundation hazards. ISPRS J. Photogramm. Remote Sens. 2005, 59,
185–198. [CrossRef]
Gierull, C.H.; Vachon, P.W. Foreword to the Special Issue on Multichannel Space-Based SAR. IEEE J. Sel. Top.
Appl. Earth Obs. Remote Sens. 2015, 8, 4995–4997. [CrossRef]
Radarsat Constellation Mission. Available online: http://directory.eoportal.org/web/eoportal/satellitemissions/r (accessed on 23 March 2017).
Advanced Land Observing Satellite-2. Available online: https://directory.eoportal.org/web/eoportal/satellitemissions (accessed on 23 March 2017).
TerraSAR-X Add-on for Digital Elevation Measurement. Available online: https://directory.eoportal.org/
web/eoportal/satellite-missions/t/tandem-x (accessed on 23 march 2017).
Yang, L.; Zhao, L.; Zhou, S.; Guo, B. Sparsity-Driven SAR Imaging for Highly Maneuvering Ground Target
by the Combination of Time-Frequency Analysis and Parametric Bayesian Learning. IEEE J. Sel. Top. Appl.
Earth Obs. Remote Sens. 2017, 4, 1443–1455. [CrossRef]
Zhu, S.; Liao, G.; Tao, H.; Yang, Z. Estimating Ambiguity-Free Motion Parameters of Ground Moving
Targets from Dual-Channel SAR Sensors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 3328–3349.
[CrossRef]
Sensors 2019, 19, 3409
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
19 of 20
Wai-Chi, F.; Jin, M.Y. On board processor development for NASA’s spaceborne imaging radar with VLSI
system-on-chip technology. In Proceedings of the 2004 IEEE International Symposium on Circuits and
Systems (IEEE Cat. No. 04CH37512), Vancouver, BC, Canada, 23–26 May 2004; p; Volume 2, p. II-901-4.
Raney, R.K.; Runge, H.; Bamler, R.; Cumming, I.G.; Wong, F.H. Precision SAR processing using chirp scaling.
IEEE Trans. Geosci. Remote Sens. 1994, 32, 786–799. [CrossRef]
Li, G.; Zhang, F.; Ma, L.; Hu, W.; Li, W. Accelerating SAR imaging using vector extension on multi-core SIMD
CPU. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS),
Milan, Italy, 26–1 July 2015.
Peternier, A.; Boncori, J.P.M.; Pasquali, P. Near-real-time focusing of ENVISAT ASAR Stripmap and Sentinel-1
TOPS imagery exploiting OpenCL GPGPU technology. Remote Sens. Environ. 2017, 202, 45–53. [CrossRef]
Lou, Y.; Clark, D.; Marks, P.; Muellerschoen, R.J.; Wang, C.C. Onboard Radar Processor Development for
Rapid Response to Natural Hazards. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2770–2776.
[CrossRef]
Tang, H.; Li, G.; Zhang, F.; Hu, W.; Li, W. A Spaceborne SAR on-board processing simulator using mobile
GPU. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS),
Beijing, China, 10–15 July 2016.
Ge, B.; Chen, L.; An, D.; Zhou, Z. GPU-based FFBP algorithm for high-resolution spotlight SAR imaging.
In Proceedings of the 2017 IEEE International Conference on Signal Processing, Communications and
Computing (ICSPCC), Xiamen, China, 22–25 October 2017; pp. 1–5.
Zhang, F.; Li, G.; Li, W.; Hu, W.; Hu, Y. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU
Deep Collaborative Computing. Sensors 2016, 16, 494. [CrossRef] [PubMed]
Wu, Z.; Liu, Y.; Zhang, L.; Li, N.; Du, K.; Balz, T. Highly efficient synthetic aperture radar processing system
for airborne sensors using CPU+GPU architecture. J. Appl. Remote Sens. 2015, 9, 097293. [CrossRef]
Ye, J.; Shanqing, H.; Jiayun, Z.; Teng, L. Virtual single-node processing for SAR imaging based on multi-DSP.
In Proceedings of the 2016 IEEE International Conference on Signal Processing, Communications and
Computing (ICSPCC), Hong Kong, China, 5–8 August 2016.
Zhijun, Y.; Xiangfei, N.; Wenyi, X.; Xiaowei, N.; Weiming, T. Real time imaging processing of ground-based
SAR based on multicore DSP. In Proceedings of the 2017 IEEE International Conference on Imaging Systems
and Techniques (IST), Beijing, China; 2017; pp. 1–5.
Xingyu, X.; Abou-Khousa, M.A.; Al-Wahedi, K. Embedded Synthetic Aperture Radar Imaging System on
Compact DSP Platform. In Proceedings of the 2017 International Conference on Electrical and Computing
Technologies and Applications, Ras Al Khaimah, UAE, 21–23 November 2017.
Langemeyer, S.; Kloos, H.; Simon-Klar, C.; Friebe, L.; Hinrichs, W.; Lieske, H.; Pirsch, P. A compact and
flexible multi-DSP system for real-time SAR applications. In Proceedings of the IEEE International Geoscience
& Remote Sensing Symposium, Anchorage, AL, USA, 20–24 September 2004.
Desai, N.M.; Kumar, B.S.; Sharma, R.K.; Kunal, A.; Gameti, R.B.; Gujraty, V.R. Near Real Time SAR Processors
for ISRO’s Multi-Mode RISAT-I and DMSAR. In Proceedings of the 7th European Conference on Synthetic
Aperture Radar, Friedrichshafen, Germany, 2–5 June 2008; pp. 1–4.
Di, W.; Chen, C.; Liu, Y. FPGA-Based Multi-core Reconfigurable System for SAR Imaging. In Proceedings
of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain,
2018, 22–27 July; pp. 8921–8924.
Li, B.; Li, C.; Xie, Y.; Chen, L.; Shi, H.; Deng, Y. A SoPC based Fixed Point System for Spaceborne SAR
Real-Time Imaging Processing. In Proceedings of the 2018 IEEE High Performance Extreme Computing
Conference (HPEC), Waltham, MA, USA; 2018; pp. 1–6.
Bierens, L.; Vollmuller, B.J. On-board Payload Data Processor (OPDP) and its application in advanced
multi-mode, multi-spectral and interferometric satellite SAR instruments. In Proceedings of the EUSAR
2012, 9th European Conference on Synthetic Aperture Radar, Nuremberg, Germany, 23–26 April 2012.
Le, C.; Chan, S.; Cheng, F.; Fang, W.; Fischman, M.; Hensley, S.; Johnson, R.; Jourdan, M.; Marina, M.;
Parham, B.; et al. Onboard FPGA-based SAR processing for future spaceborne systems. In Proceedings of
the IEEE 2004 Radar Conference, Philadelphia, PA, USA, 29–29 April 2004; pp. 15–20.
Fang, W.C.; Le, C.; Taft, S. On-board fault-tolerant SAR processor for spaceborne imaging radar systems.
In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems, Kobe, Japan, 23–26 May
2005; pp. 420–423.
Sensors 2019, 19, 3409
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
20 of 20
Greco, J.; Cieslewski, G.; Jacobs, A.; Troxel, I.A.; George, A.D. Hardware/software Interface for
High-performance Space Computing with FPGA Coprocessors. In Proceedings of the 2006 IEEE Aerospace
Conference, Big Sky, MT, USA, 4–11 March 2006; pp. 1–10.
Pfitzner, M.; Cholewa, F.; Pirsch, P.; Blume, H. FPGA based Architecture for real-time SAR processing with
integrated Motion Compensation. In Proceedings of the 2013 Asia-Pacific Conference on Synthetic Aperture
Radar (Apsar), Tsukuba, Japan, 23–27 September 2013; pp. 521–524.
Bi, D.; Xie, Y.; Li, X.; Zheng, Y.R. Efficient 2-D synthetic aperture radar image reconstruction from compressed
sampling using a parallel operator splitting structure. Digit. Signal Process. 2016, 50, 171–179. [CrossRef]
Long, T.; Yang, Z.; Li, B.; Chen, L.; Ding, Z.; Chen, H.; Xie, Y. A Multi-mode SAR Imaging Chip based
on a Dynamically Reconfigurable SoC Architecture Consisting of Dual-operation Engines and Multilayer
Switching Network. Preprints 2018. [CrossRef]
Song, W.S.; Baranoski, E.J.; Martinez, D.R. One trillion operations per second on-board VLSI signal processor
for Discoverer II space based radar. In Proceedings of the Aerospace Conference Proceedings, Big Sky, MT,
USA, 25–25 March 2000.
Chen, Q.; Yu, A.; Sun, Z.; Huang, H. A multi-mode space-borne SAR simulator based on SBRAS. In Proceedings
of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July
2012; pp. 4567–4570.
Stangl, M.; Werninghaus, R.; Schweizer, B.; Fischer, C.; Brandfass, M.; Mittermayer, J.; Breit, H. TerraSAR-X
technologies and first results. IEE Proc. Sonar Navig. 2006, 153, 86–95. [CrossRef]
Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput.
1965, 19, 297–301. [CrossRef]
Yang, C.; Li, B.; Chen, L.; Wei, C.; Xie, Y.; Chen, H.; Yu, W. A Spaceborne Synthetic Aperture Radar Partial
Fixed-Point Imaging System Using a Field-Programmable Gate Array—Application-Specific Integrated
Circuit Hybrid Heterogeneous Parallel Acceleration Technique. Sensors 2017, 17, 1493. [CrossRef] [PubMed]
Cumming, I.G.; Wong, F.H. Digital Processing of Synthetic Aperture Radar Data: Algorithms and Implementation;
Artech house: Norwood, MA, USA, 2005.
Hu, A.; Zhang, R.; Yin, D.; Chen, Y. Perceptual quality assessment of SAR image compression. Int. J.
Remote Sens. 2013, 34, 8764–8788. [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to
structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [CrossRef] [PubMed]
Márquez-Martínez, J.; Mittermayer, J.; Rodríguez-Cassolà, M. Radiometric resolution optimization for
future SAR systems. In Proceedings of the IEEE International Geoscience & Remote Sensing Symposium,
Anchorage, AL, USA, 20–24 September 2004.
Tell, B.R.; Laur, H. Phase preservation in SAR processing: The interferometric offset test. In Proceedings of
the ‘Remote Sensing for a Sustainable Future’, International Geoscience and Remote Sensing Symposium,
Lincoln, NE, USA, 27–31 May 1996; Volume 1, pp. 477–480.
Zhang, H.; Li, Y.; Su, Y. SAR image quality assessment using coherent correlation function. In Proceedings of
the 2012 5th International Congress on Image and Signal Processing, Chongqin, China, 16–18 October 2012.
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Download