sensors Article A Highly Efficient Heterogeneous Processor for SAR Imaging Shiyu Wang * , Shengbing Zhang *, Xiaoping Huang, Jianfeng An and Libo Chang School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China * Correspondence: onion0709@mail.nwpu.edu.cn (S.W.); zhangsb@nwpu.edu.cn (S.Z.); Tel.: +86-1331-927-0830 (S.W.) Received: 4 June 2019; Accepted: 1 August 2019; Published: 3 August 2019 Abstract: The expansion and improvement of synthetic aperture radar (SAR) technology have greatly enhanced its practicality. SAR imaging requires real-time processing with limited power consumption for large input images. Designing a specific heterogeneous array processor is an effective approach to meet the power consumption constraints and real-time processing requirements of an application system. In this paper, taking a commonly used algorithm for SAR imaging—the chirp scaling algorithm (CSA)—as an example, the characteristics of each calculation stage in the SAR imaging process is analyzed, and the data flow model of SAR imaging is extracted. A heterogeneous array architecture for SAR imaging that effectively supports Fast Fourier Transformation/Inverse Fast Fourier Transform (FFT/IFFT) and phase compensation operations is proposed. First, a heterogeneous array architecture consisting of fixed-point PE units and floating-point FPE units, which are respectively proposed for the FFT/IFFT and phase compensation operations, increasing energy efficiency by 50% compared with the architecture using floating-point units. Second, data cross-placement and simultaneous access strategies are proposed to support the intra-block parallel processing of SAR block imaging, achieving up to 115.2 GOPS throughput. Third, a resource management strategy for heterogeneous computing arrays is designed, which supports the pipeline processing of FFT/IFFT and phase compensation operation, improving PE utilization by a factor of 1.82 and increasing energy efficiency by a factor of 1.5. Implemented in 65-nm technology, the experimental results show that the processor can achieve energy efficiency of up to 254 GOPS/W. The imaging fidelity and accuracy of the proposed processor were verified by evaluating the image quality of the actual scene. Keywords: heterogeneous array; SAR imaging; data cross-placement; computing resource management 1. Introduction Aerospace synthetic aperture radar (SAR) can be all-time and all-weather to obtain high-precision microwave images and other value-added products over large areas, and it has an extensive range of applications in remote sensing, environmental monitoring, geographical mapping, war zone surveillance, precision guidance, and reconnaissance [1–4]. Extensions and modifications of the SAR technology have significantly increased its practicality and applications. The demand for high-resolution and wide-swath (HRWS) SAR imaging is growing, especially in the areas of ocean observation, geological survey, and environmental protection. In 1978, the United States launched the first spaceborne SAR named Seasat-1. It is a satellite specifically designed for telemetry of the Earth’s oceans, and is aimed at realizing the possibility of global satellite monitoring of the oceans and determining the system requirements for marine remote sensing satellites. RADARSAT-1 was successfully launched in Canada in 1995 [5]. It not only provided Canada with a large amount of all-weather and all-time SAR data, but also provided useful information for commercial and scientific users in disaster management, agriculture, mapping, hydrology, forestry, oceanography, Sensors 2019, 19, 3409; doi:10.3390/s19153409 www.mdpi.com/journal/sensors Sensors 2019, 19, 3409 2 of 20 ice research, and coastal monitoring. In January 2006, Japan launched the Advanced Land Observing Satellite (ALOS) [6]. The Phased Array type L-band Synthetic Aperture Radar (PALSAR) that it carried was an L-band SAR sensor that is not affected by atmospheric conditions, cloud cover, and other related conditions, so it can be used for ground observations around the clock. In June 2007, the Terra SAR-X was launched by the German National Space Center. Its X-band SAR radar reliably provided high-resolution weather conditions and wide-area radar images with superior geometric accuracy over any other spaceborne SAR sensor [7]. Moreover, for both civil and military applications, it is desired to monitor moving targets, including ground moving target indication/ground moving target imaging (GMTI/GMTIm) [8,9]. For SAR processing systems, SAR imaging time accounts for most of the processing time, and directly brings a significant impact on system throughput and rapid response capability. The imaging delay of SAR will seriously affect the subsequent image processing, such as content analysis, risk diagnosis, and feature extraction. SAR imaging efficiency plays a very important role in the SAR system platform, and it can directly affect the throughput and rapid response capability of the entire platform. Spaceborne and airborne real-time SAR imaging is the most direct and effective real-time imaging implementation approach, which can quickly provide SAR image data for SAR applications while significantly reducing the communication burden of air-to-ground data links [10]. At the same time, the working environment of spaceborne and airborne imaging systems is harsh, and the power consumption of the processor is also severely limited. Therefore, real-time and low power consumption are two essential items that must be met by spaceborne/airborne SAR imaging processors. Since SAR imaging requires a large amount of two-dimensional parallel computing, it is difficult for a single multi-core central processing unit (CPU) to meet its real-time requirements. The SAR imaging scheme with multiple CPU nodes has high power consumption and low processing efficiency, and cannot be applied in spaceborne/airborne SAR processing. Generally, heterogeneous schemes such as CPU + GPU, CPU + DSP (s), and CPU + FPGA (one or more) can meet the performance requirements of real-time processing, but their power consumption is above 10 W, or even more than 100 W. A dedicated chip that fully implements the imaging algorithm can achieve better results in real-time, and low power consumption and is suitable for applications with strict power constraints, but the scheme hardens the algorithm, resulting in poor flexibility. ASIP (Application Specific Instruction Set Processor) is a dedicated processor solution between a general-purpose processor and an application specific integrated circuit (ASIC). This processor combines the flexibility of a general-purpose processor and the efficiency of an ASIC. In order to achieve a good trade-off between flexibility and processing efficiency, the development of a dedicated processor that is capable of fully implementing the SAR imaging process is an effective solution to meet its power consumption and real-time requirements for spaceborne/airborne SAR processing. The chirp scaling algorithm (CSA) is one of the most commonly used algorithms for SAR imaging [11]. Its calculations mainly include Fast Fourier Transformation/Inverse Fast Fourier Transform (FFT/IFFT), phase multiplication, interpolation, etc., especially FFT/IFFT operations account for the highest proportion. The accuracy requirements and computing flow of these operations are different. Therefore, how to design an array structure and storage structure suitable for such processing is a key issue to be solved. With the progress of integrated circuit (IC) technology, more processing units and memory blocks can be integrated on a single chip. Based on the abundant computational and memory resources on the chip, to make full use of bandwidth resources, this paper proposes a heterogeneous array structure that efficiently supports CSA imaging processing by combining block parallelism and pipeline processing while buffering the intermediate results on-chip. It can support the parallel and pipeline processing and increases the maximum utilization of computing units. Moreover, we have designed an on-chip multi-level data buffer structure matching the heterogeneous array structure to ensure data supply for pipeline processing. This solution can reduce the complexity of the system while improving real-time performance. Sensors 2019, 19, 3409 3 of 20 The paper is organized as follows. Section 2 outlines related work and background. Section 3 analyzes the characteristics of CSA and proposes the design of the processor. Section 4 presents the heterogeneous architecture implementations. We present the evaluation of experimental results in Section 5, and the conclusions in Section 6. 2. Related Work Digital signal processors (DSPs), CPUs, and graphics processing units (GPUs) have respective advantages in real-time SAR processing. As the system adopts CPU, it has good flexibility and portability [12]. However, their power efficiency for computing is quite low, which is a bottleneck in real-time SAR applications. Due to GPU’s powerful parallel computation capability and programmability, the new method makes full use of GPU’s powerful computation ability, which effectively improves the real-time quality of SAR scene generation [13–16]. At present, the GPU + CPU method can effectively combine the advantages of the two processors to improve imaging efficiency [17,18]. However, the average power consumption which is up to 150 W, limits the application of GPU in micro air vehicles. Nowadays, high capability DSPs easily realize many complex theories and algorithms on hardware, and promote the development of SAR technology [19–21]. In 2003, Hanover University implemented a SAR real-time processing system using a multi-DSP architecture. This system uses highly parallel digital signal processor technology (HiPAR-DSP) for SAR signal processing [22]. The Indian Space Research Organization (IRSO) developed the SAR Specialized Processor (NRTP) based on Analog Devices’ DSP multiprocessor, which approximates the real-time imaging of SAR [23]. However, for some applications with strictly constrained power, DSP has lower energy efficiency, resulting in lower imaging efficiency. The rapid development of field-programmable gate array (FPGA) has been one of the most important technologies of realizing digital signal processing. With its rich on-chip memory and computational resources, FPGA can be configured as a SAR imaging platform to meet the high throughput rate SAR signal processing requirements [24–26]. An FPGA based on fault-tolerant architecture (Xilinx Virtex-II Pro) is applied to SAR processing systems [27,28]. In 2006, the University of Florida developed a high-performance heterogeneous spatial computing framework based on hardware/software interfaces. In this architecture, the CPU is responsible for scheduling and task management, and the FPGA acts as a coprocessor for computational acceleration [29]. With the rapid development of storage capacity and computing power of commercial FPGAs, SAR real-time imaging systems can all be built by FPGA (Xilinx Virtex-6) [30]. However, for highly complex algorithms, the development cycle of FPGA is relatively long. For the real-time requirements and physical implementation limitations of SAR imaging, ASIC implementation is generally employed [31,32]. The Massachusetts Institute of Technology (MIT) Lincoln Laboratory uses bit-level systolic-array technology to design a SAR signal processor with high throughput and low power consumption [33]. The jet propulsion laboratory has also developed an airborne SAR processing system using a VLSI+SOC (very large scale integration+system on chip) hardware solution [10]. The processor’s low power consumption and small size make it suitable for small SAR imaging systems. In general, the DSP solution is used to implement SAR imaging through software programming. Since the DSP is designed for general purposes, this implementation has high flexibility and a short design cycle. It is more suitable for real-time SAR imaging than a CPU, but for low power applications, it is still not the most suitable choice. The ASIC solution for SAR imaging has the optimal power and performance for a single computational process. However, SAR imaging is a combination of multiple calculations on one device, which causes the design cost and power consumption of SAR imaging to soar, the design cycle to become longer, and poor flexibility. ASIP makes a good trade-off between the high flexibility of a general purpose processor and the high processing efficiency of an ASIC, and can be tailored and optimized for a certain type of algorithm or domain application to meet constraints such as performance, area, and power consumption. Moreover, it can effectively reduce design cycles Sensors 2019, 19, x FOR PEER REVIEW Sensors 2019, 19, 3409 4 of 20 4 of 20 constraints such as performance, area, and power consumption. Moreover, it can effectively reduce design and Thus, the design Thus, of many of important ASIP make it a very important and the cycles design risk. many risk. advantages ASIPadvantages make it a very implementation method implementation method in the field of signal processing. in the field of signal processing. Making trade-offs trade-offs between between speed, speed, cost, cost, power power consumption, consumption, and ASIP design design Making and flexibility, flexibility, ASIP methodology in real-time signal processing system can can not only satisfy the realmethodology in the thedesign designofofSAR SAR real-time signal processing system not only satisfy the time and and performance requirements of aerospace systems, butbut also the real-time performance requirements of aerospace systems, alsoshorten shortenthe thelead leadtime time of of the processors. ASIP, when designed with a specific architecture with higher parallelism and higher processors. ASIP, when designed with a specific architecture with higher parallelism and higher complexity,also alsohas hasgood goodscalability. scalability. Therefore, have designed a dedicated processor thatfully can complexity, Therefore, wewe have designed a dedicated processor that can fully implement the SAR imaging process to meet the power consumption and real-time implement the SAR imaging process to meet the power consumption and real-time requirements of requirements the application environment. the applicationofenvironment. 3. Processor Architecture Design the most most commonly commonly used used algorithms algorithms for SAR SAR imaging imaging [11]. [11]. Compared with The CSA is one of the simple operation operation process, low computational computational other algorithms, the CSA has the advantages of a simple efficiency. On the other hand, the CSA improves the fidelity of the complexity, and high imaging efficiency. image, especially the preservation of the phase information. Moreover, Moreover, the CSA can adapt to different different scanningmodes, modes,forfor example, spotlight, strip-map, sliding spotlight, Tops, and radar scanning example, spotlight, strip-map, scanscan SAR,SAR, sliding spotlight, Tops, and Mosaic Mosaic modes [34,35]. modes [34,35]. 3.1. CSA Flow Flow Analysis Analysis 3.1. CSA The The CSA CSA can can be be divided into three The imaging imaging principle principle of of the the CSA CSA is is shown shown in in Figure Figure 1. 1. The divided into three modules accordingto tofunctions, functions,orordivided divided into seven steps according to operation the operation sequence. modules according into seven steps according to the sequence. The The algorithm is executed step byand step, andalgorithm in the algorithm we the perform the alternating algorithm is executed step by step, in the process, process, we perform alternating operation operation of and FFT/IFFT phase compensation. perform a SARfour imaging, four Fourier and transform of FFT/IFFT phaseand compensation. To performTo a SAR imaging, Fourier transform threeand three-phase multiplication are needed. phase multiplication are needed. SAR Echo Data Step 1 Azimuth FFT Step 2 Step 3 H1 CS Factor Range FFT Module 1 2-D Frequency Domain H2 Range Compensation Factor Step 4 Step 5 Range-Doppler Domain Range IFFT Module 2 Range-Doppler Domain H3 Azimuth Compensation Factor Step 6 Step 7 Azimuth IFFT Module 3 SAR Image Figure 1. Chirp scaling algorithm (CSA) flow chart. SAR: SAR: synthetic synthetic aperture radar. The into 2Q2π loglog multiplications and and 3Q log The Q-point Q-point FFT/IFFT FFT/IFFTcan canbebedecomposed decomposed into π real multiplications 3π2 Q logreal π 2 Q real additions [36].[36]. TableTable 1 lists the computation quantity of the operation. real additions 1 lists the computation quantity ofseven-step the seven-step operation. From Table 1, we can see that the proportion of FFT(IFFT) in all operations is: W = (2 + 2 + 3 + 3)NM log2 M + (2 + 2 + 3 + 3)NM log2 N 10NM log2 M + 10NM log2 N + 18NM (1) Sensors 2019, 19, x FOR PEER REVIEW Sensors 2019, 19, 3409 5 of 20 5 of 20 Table 1. Computational statistics of CSA. For different imaging matrix sizes, the proportion W of FFT(IFFT) is slightly different, as shown in Calculation Step Real-Multi Real-Add Total TableContent 2. It can be shown from Table 2 that the W values are basically above 90% and can reach up to 3ππ log π 5ππ log π Azimuth-FFT 1 1 2ππ log π 2 95% as the matrix size becomes larger. Therefore, accelerating the FFT/IFFT operation6ππ will inevitably 4ππ 2ππ CS Factor-Multi 3 2 2ππ log π the imaging efficiency. 3ππ log π 5ππ log π Range-FFT reduce the imaging3 time and optimize RC Factor 4-Multi Range-IFFT 5 AC Factor 6-Multi Azimuth-IFFT 4 5 6 7 Total - 2ππ 4NM 3ππ log π 2ππ log π Table 4ππ 1. Computational statistics 2ππof CSA. 3ππ log π 2ππ log π Steplog π + 4NMlog Real-Multi 4ππ π + 6ππ log π + Real-Add 6ππ log π + 12NM 6NM 2 1 3NM log2 M 2NM log2 M 2 Calculation Content 6ππ 5ππ log π 6ππ 5ππ log π 10ππ log π +Total 10ππ log π + 18ππ 5NM log2 M Azimuth-FFT 1 1 FFT: Fast Fourier Transformation; M: Azimuth direction sample numbers; N: Range direction sample 2 4NM 2NM 6NM CS Factor-Multi 3 3 CS Factor: Chirp Scaling Factor; 4 RC Factor: Range Compensation Factor; 5.IFFT: Inverse numbers; Range-FFT 3 2NM log2 N 3NM log2 N 5NM log2 N 4 -Multi Fast Fourier Transform. 46 AC Factor: Azimuth 4NM Compensation Factor. 2NM 6NM RC Factor 5 2NM log2 N 3NM log2 N 5NM log2 N Range-IFFT 5 6 -Multi 6 that the proportion 4NM 6NM AC Factor From Table 1, we can see of FFT(IFFT)2NM in all operations is: Azimuth-IFFT 7 2NM log2 M 3NM log2 M 5NM log2 M (2 + 2 + 3 + 3)ππ log π + (2 + 2 + 3 + 3)ππ log π 4NM log M + 6NM log M + 10NM log2 M + (1) 10ππ log 2 π + 10ππ log π + 218ππ 4NMlog2 N + 12NM 6NM log2 N + 6NM 10NM log2 N + 18NM 1 FFT: 3 For different imaging matrix2 sizes, the proportion W of FFT(IFFT) is slightly different, as shown Fast Fourier Transformation; M: Azimuth direction sample numbers; N: Range direction sample numbers; 4 RC Factor: Range Compensation Factor; 5 IFFT: Inverse Fast Fourier Transform. 6 CS Factor: Chirp Scaling Factor; in Table 2. It can be shown from Table 2 that the W values are basically above 90% and can reach up AC Factor: Azimuth Compensation Factor. to 95% as the matrix size becomes larger. Therefore, accelerating the FFT/IFFT operation will inevitably reduce the imaging time and2. optimize the imaging efficiency. Table Computational load statistics. Total π = 256 × Table 256 2.1024 × 1024 4096 4096 16,384 × 16,384 Computational load×statistics. Image Size FFT Computational Image Size 107 Load FFT Computational Load PhaseCompensation Compensation Phase Computational 106Load Computational Load W-Value W-Value 89.8% 2562.1 × 256 4096 16,384 9 × 4096 7.6 × 108 1024 × 1024 4.1 × 10 × 10×1016,384 2.1 × 10 4.1 × 10 7.6 × 10 10 4.810 ×910 10 × 107 1.8 × 103 × 1083 × 10 1.8 4.8 × 89.8% 91.7% 93% 94% 91.7% 93% 94% 65,536 × 65,536 65,536 65,536 12 1.2 ×× 10 1.2 × 10 1.2 × × 10 1.2 1010 94.7% 94.7% Space-line (Parallel) 3.2. Computation Flow Strategy 3.2. Computation Flow Strategy In the imaging process, we take the block imaging method and perform parallel processing In the imaging process, we take the block imaging and method perform parallel between blocks. In the algorithm process, four FFT/IFFT threeand phase operations areprocessing pipelined between blocks. the algorithm four FFT/IFFT and(multi-azimuth) three phase operations areand pipelined according to theInalgorithm flow,process, while each multi-range FFT/IFFT phase according to the flow, while individually. each multi-range (multi-azimuth) FFT/IFFT and phase operation operation can bealgorithm parallel processing To organize the pipeline processing of two types can be parallel processing individually. To organize the pipeline processing of two types of operations of operations in SAR imaging, we designed a calculation process based on space–time flow (ST-Flow), in SAR imaging, we2.designed calculation process FFT/IFFT based on space–time flow (ST-Flow), asand shown in as shown in Figure At a time,a in space, multi-line can be performed in parallel, phase Figure 2. At a time, in space, multi-line FFT/IFFT can be performed in parallel, and phase compensation compensation operation can be calculated simultaneously at multiple points, so no calculation unit operation calculated multiple points, so no calculation unitcalculation is idle. Onunit the is idle. Oncan thebe timeline, datasimultaneously is continuouslyatfed into the processing unit, and the timeline, fed intofor thedata. processing unit,ST-Flow, and the calculation unitcan does have does not data haveisa continuously stall due to waiting With this SAR imaging be not done in a stall due to waiting continuous process.for data. With this ST-Flow, SAR imaging can be done in a continuous process. Time-line (Pipeline) FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM PM:Phase Multiplication SAR imaging imaging flow. flow. Figure 2. SAR 3.3. Heterogeneous Arrays FFT/IFFT Sensors 2019, 19, x FOR PEER REVIEW 6 of 20 CSA includes scalar operation for phase multiplication and vector operation FFT (IFFT). As Table 2 shown, FFT/IFFT operations account for up to 95% of SAR imaging, so accelerating the Sensors 2019,operation 19, 3409 6 of 20 FFT/IFFT efficiently is the most important approach for imaging processors. The fixed-point FFT/IFFT operation with lower accuracy has a small loss of imaging accuracy, and can significantly improve the processing throughput. In [37], the quantization error power of the 3.3. Heterogeneous Arrays fixed-point processing CSA was evaluated in detail. The analysis results showed that as the word CSA includes scalar phase multiplication and vector operation (IFFT). Asand Table length increases from 12 operation to 16, the for quantization error power remains essentiallyFFT unchanged, the2 shown, FFT/IFFT operations account for up to 95% of SAR imaging, so accelerating the FFT/IFFT imaging quality with a 15 or 16-bit word length is very close to that of a single precision floatingoperation efficiently the most important approach for imaging processors. point. Therefore, we is design PE arrays to support 12-bit, 14-bit, and 16-bit fixed-point FFT/IFFT. For The fixed-point FFT/IFFT operation with lower accuracy has a small loss imaging accuracy, applications with lower accuracy requirements, low-bit width operation can be of selected. and can significantly improve the processing throughput. In [37], the quantization error of However, the phase compensation operation requires high precision and must use power floatingthe fixed-point was detail. The analysis results showed that as the point arithmeticprocessing operations.CSA Based onevaluated the earlierin description and discussion, a heterogeneous array word length increases from 12 to 16, the quantization error power remains essentially unchanged, is designed, which includes two types of computing units named PE and FPE. PE is used for FFT/IFFT and the imaging a 15 or 16-bit word operation. length is very close to that of a single precision operation and FPEquality is usedwith for phase compensation floating-point. design PE arraysagainst to support 12-bit, 14-bit, and 16-bit fixed-point FFT/IFFT. Since the Therefore, operation we ratio of FFT/IFFT phase compensation is approximately 9:1, the For applications with lower accuracy requirements, low-bit width operation can be selected. configuration of PE and FPE should also follow this proportional relationship. For smaller matrix phase compensation operation requires high must usearray, floating-point sizes,However, the ratio isthe near 90%; to meet the different matrix sizes, we precision design theand processing in which arithmetic operations. Based on the earlier description and discussion, a heterogeneous array is the ratio of PE and FPE is 8:1, as shown in Figure 3. designed, which includes two types of computing units named PE and FPE. PE is used for FFT/IFFT In CSA flow, each range/azimuth FFT/IFFT operation is relatively independent, and there is no operation and FPE is used forrange/azimuth, phase compensation operation. data dependency between so each range/azimuth FFT/IFFT operation can be Since the operation ratio of FFT/IFFT against phase compensation is approximately 9:1, the performed in parallel. Moreover, in the FFT/IFFT operation, each butterfly operation is relatively configuration of PE and FPE should also follow this proportional relationship. For smaller matrix sizes, independent, and multiple butterfly operations can be performed in parallel. The phase the ratio is nearprocess 90%; to meet the different matrix sizes, we design processing in which the compensation performs independent operations at athe single point array, so that multiple ratio of PE and FPE is 8:1, as shown in Figure 3. independent operations can be performed in parallel. ··· FPE ··· ··· PE PE ··· ··· ··· 8 lines PE FFT/IFFT PE FPE Phase compensation Figure 3. 3. PE PEarray arraystructure structurediagram. diagram. PE: computing unit for FFT/IFFT operation in a heterogenous PE: computing unit for FFT/IFFT operation in a heterogenous array. array. In CSA flow, each range/azimuth FFT/IFFT operation is relatively independent, and there is no data dependency range/azimuth, sophase each range/azimuth operation canare be performed In CSA flow,between four FFT/IFFT and three operations areFFT/IFFT data dependent; they processed parallel. Moreover, in the FFT/IFFT butterfly operation relatively independent, in the pipeline. As shown in Figure 4, tooperation, establish aeach pipeline between the FFTisand the phase operation, andparallel multiple butterfly operations can be performed the FFT/IFFT differ by 1/8 computation cycles.in parallel. The phase compensation process performs independent operations at a single point so that multiple independent operations can be performed in parallel. In CSA flow, four FFT/IFFT and three phase operations are data dependent; they are processed in the pipeline. As shown in Figure 4, to establish a pipeline between the FFT and the phase operation, the parallel FFT/IFFT differ by 1/8 computation cycles. Sensors 2019, 19, 3409 Sensors 2019, 19, x FOR PEER REVIEW Sensors 2019, 19, x FOR PEER REVIEW 7 of 20 7 of 20 7 of 20 Space-line Space-line(Parallel) (Parallel) Time-line (Pipeline) Time-line (Pipeline) FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... FFT/IFFT FFT/IFFT ... PM:Phase Multiplication PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM:Phase Multiplication PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM PM Figure 4. Pipeline between FFT/IFFT and phase operation. Figure 4. Pipeline between FFT/IFFT and phase operation. 3.4. Data Placement and Simultaneous Access 3.4. Data Placement and Simultaneous Access process, the data data transfer transfer has has aa bit-reverse bit-reverse address address sequence. sequence. To support this In the FFT/IFFT process, In the FFT/IFFT process, the data transfer has a bit-reverse address sequence. To support this data access pattern, we use a multi-bank multi-bank distributed distributed data data placement placement strategy, strategy, as as shown shown in in Table Table 3. 3. data access pattern, we use a multi-bank distributed data placement strategy, as shown in Table 3. According to the calculation requirements, one row of PE parallel performs 16 butterfly operations, According to the calculation requirements, one row of PE parallel performs 16 butterfly operations, and According to the calculation requirements, one row of PE parallel performs 16 butterfly operations, and needs to provide 32atdata at thetime. sameTherefore, time. Therefore, dataisaccess is performed inAs parallel. needs to provide 32 data the same data access performed in parallel. shownAs in and needs to provide 32 data at the same time. Therefore, data access is performed in parallel. As shown 5, in32 Figure 5, 32 data are simultaneously accessed Bank in the In first In Figure data are simultaneously accessed from Bank from 0 andBank Bank01and in the first1 cycle. thecycle. second shown in Figure 5, 32 data are simultaneously accessed from Bank 0 and Bank 1 in the first cycle. In the second cycle, data are read simultaneously Bank 3. 2 and 3. Bank and in the cycle, data are read simultaneously from Bank 2from and Bank BankBank selection andselection the address a the second cycle, data are read simultaneously from Bank 2 and Bank 3. Bank selection and the address a bank are generated follow each step in processing the FFT/IFFT processing each bank arein generated to follow eachtostep in the FFT/IFFT flow. Althoughflow. eachAlthough PE performs a address in a bank are generated to follow each step in the FFT/IFFT processing flow. Although each PE performs a different FFT/IFFT operation, they use similar data placement and access strategies. different FFT/IFFT operation, they use similar data placement and access strategies. PE performs a different FFT/IFFT operation, they use similar data placement and access strategies. Table 3. 3. 4096 points input data storage storage in in four four banks. banks. Table Table 3. 4096 points input data storage in four banks. Bank_NO. Bank_NO. Space-line Space-line(Parallel) (Parallel) Bank_NO. Bank_0 Bank_0 Bank_0 Bank_1 Bank_1 Bank_1 Bank_2 Bank_2 Bank_2 Bank_3 Bank_3 Bank_3 Input Data Storage Bank Input Data StorageStatus Status in in Bank Input Data Storage Status in Bank 0–15 64–79 0–15 64–79 0–15 64–79 -- 16–31 80–95 -- 16–31 80–95 16–31 80–95 32–47 96–111 32–47 96–111 -- 32–47 96–111 -48–63 112–127 4080–4095 48–63 112–127 - 4080–4095 48–63 112–127 - 4080–4095 16 Operands 16 Operands (PE0_0~PE0_15)=> (PE0_0~PE0_15)=> Bank_0 Bank_0 Bank_1 Bank_1 Bank_2 Bank_2 Bank_3 Bank_3 Data(0) Data(0) Data(16) Data(16) Data(1) Data(1) Data(17) Data(17) Data(32) Data(32) Data(48) Data(48) Data(2) Data(2) Data(18) Data(18) Data(33) Data(33) Data(49) Data(49) ... ... ... ... Data(34) Data(34) Data(50) Data(50) Data(13) Data(13) Data(29) Data(29) ... ... ... ... Time-line (Pipeline) Time-line (Pipeline) Data(14) Data(14) Data(30) Data(30) Data(45) Data(45) Data(61) Data(61) Data(15) Data(15) Data(31) Data(31) Data(46) Data(46) Data(62) Data(62) ... ... ... ... Data(47) Data(47) Data(63) Data(63) ... ... ... ... Figure 5. 16 PEs perform base-4 butterfly operation timing diagram for four banks. Figure 5. 16 PEs perform base-4 butterfly operation timing diagram for four banks. There There is is no no special special requirement requirement for for the the sequence sequence of of data data in in the the phase phase compensation compensation calculation calculation There is no special requirement for the sequence of data in the phase compensation calculation process; therefore, as shown in Figure 6, the calculation process only needs to access the data in parallel. process; therefore, as shown in Figure 6, the calculation process only needs to access the data in process; therefore, as shown in Figure 6, the calculation process only needs to access the data in parallel. parallel. Sensors 2019, 19, 3409 Sensors 2019, 19, x FOR PEER REVIEW Sensors 2019, 19, x FOR PEER REVIEW 8 of 20 8 of 20 8 of 20 Space-line Space-line(Parallel) (Parallel) Time-line (Pipeline) Time-line (Pipeline) 16 Operands 16 Operands (FPE0_0~FPE0_15)=> Bank_0 (FPE0_0~FPE0_15)=> Bank_0 (FPE1_0~FPE1_15)=> Bank_1 (FPE1_0~FPE1_15)=> Bank_1 Data(0) Data(0) Data(0) Data(0) Data(1) Data(1) Data(1) Data(1) Data(2) Data(2) Data(2) Data(2) ... ... ... ... Data(13) Data(13) Data(14) Data(14) Data(15) Data(15) Data(13) Data(13) Data(14) Data(14) Data(15) Data(15) Figure 6. 6. 32 FPEs perform phase operation for two banks. FPE: computing unit in a heterogeneous Figure 32 FPEs perform phase operation for two banks. FPE: computing unit in a heterogeneous array used for for phase phase compensation compensation operation. operation. array used Architectural Implementations 4. Architectural 4. Architectural Implementations 4.1. Overall Overall Architecture Architecture 4.1. 4.1. Overall Architecture A highly highly efficient efficientheterogeneous heterogeneousprocessor processorfor for SAR imaging is designed. Figure 7 shows the A SAR imaging is designed. Figure 7 shows the topA highly efficient heterogeneous processor for SAR imaging is designed. Figure 7 shows the toptop-level architecture of the proposed SAR imaging processor. This section describes the overall level architecture of the proposed SAR imaging processor. This section describes the overall level architecture of the proposed SAR imaging processor. This section describes the overall hardware block and functional functional modules. modules. Essentially, Essentially, the the architecture architecture consists consists of of three three major major hardware block diagram diagram and hardware block diagram and functional modules. Essentially, the architecture consists of three major components: a hybrid–PE array, an on-chip buffer module, and a data systolic engine. components: a hybrid–PE array, an on-chip buffer module, and a data systolic engine. components: a hybrid–PE array, an on-chip buffer module, and a data systolic engine. ··· ··· ··· ··· ··· ··· ··· ··· Switch Switch PE PE ··· ··· ··· ··· PE PE PE PE PE PE FPE FPE FPE FPE FPE FPE FPE FPE FPE FPE FPE FPE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE ··· ··· FPE FPE FPE FPE FPE FPE ··· ··· FPE FPE FPE FPE FPE FPE PE PE PE PE PE PE Local-PF Local-PF buffer(16KB) buffer(16KB) 16×32b 16×32b 16×32b 16×32b PF-BufferοΌ32KBοΌ PF-BufferοΌ32KBοΌ Figure 7. Top-level Top-level architecture of processor. Figure Figure 7. Top-level architecture of processor. Heterogeneous HeterogeneousPE PEarray array ··· ··· ······ PE PE ······ Buffer_A Buffer_ACtrl Ctrl 16×16b 16×16b 16×16b 16×16b PE PE ······ 16×16b 16×16b PE PE ······ Ex-memory Ex-memorysystem system On-chip On-chipBuffer BufferBB Bank[0](4KB) Bank[0](4KB) Local buffer(16KB) Local buffer(16KB) Local-TF Local-TF buffer(32KB) buffer(32KB) PE PE PE ··· PE PE PE PE PE PE ··· PE PE PE ······ On-chip Buffer A On-chip Buffer A 16×16b 16×16b 16×16b 16×16b 16×16b 16×16b 16×16b 16×16b Switch Switch ······ ······ Bank[32](4KB) Bank[32](4KB) Data Buffer_A2 Data Buffer_A2 ······ Bank[0](4KB) Bank[0](4KB) ······ ······ Bank[32](4KB) Bank[32](4KB) Bank[32](4KB) Bank[32](4KB) Resource Resource Controller Controller Data Buffer_A1 Data Buffer_A1 Bank[0](4KB) Bank[0](4KB) 128b 128b 16×16b 16×16b TF-Buffer TF-Buffer οΌ2KBοΌ οΌ2KBοΌ Data Data Buffer_B1 Buffer_B1 16×16bit 16×16bit Ctrl Ctrl IO / IO / Decoder Decoder Bank[0](4KB) Bank[0](4KB) Bank[32](4KB) Bank[32](4KB) Data Data Buffer_B2 Buffer_B2 Buffer_B Ctrl Buffer_B Ctrl Ctrl Ctrl Sensors 2019, 19, 3409 9 of 20 To meet the throughput requirement of SAR imaging, two identical sets of heterogeneous arrays are implemented, which can perform different block imaging processing computations in parallel. Each of the heterogeneous arrays contains 16 × 16 PEs and 2 × 16 FPEs. The number ratio of PE and FPE satisfies the proportional relationship of 8:1. To feed the processing array with adequate data supply, three types of buffers are implemented on chip. In a processing array, all the data banks for 16-line PEs and two-line FPEs are organized as a 264-KB data buffer with two sub-buffers, each of which contains 32 banks for PEs and one bank for an FPE. A 32-KB twiddle factor dedicated local buffer (Local-TF buffer) and a 16-KB phase factor dedicated local buffer (Local-PF buffer) for the phase compensation operation is also implemented inside a processing array. To organize the data transfer between off-chip RAM and on-chip buffers, a data systolic engine is implemented. With this data systolic engine, the input raw image echo can be read and the imaging output can be written back following the processing flow. 4.2. Heterogeneous PE Arrays Each PE pipelined performs a four-point butterfly operation in six cycles, and all of the PE in a row parallel perform butterfly operations in a block. During the FFT/IFFT operation, all 64 input data are sent to one row of PEs in two cycles from the data buffer, and the 64 output data are written back to the data buffer in two cycles. In a heterogeneous array, as shown in Figure 7, PEs are interconnected to pass a twiddle factor, the Local-TF buffer distributes the twiddle factor to the PE from top to bottom. The twiddle factor passes two rows down each cycle, and the required twiddle factors are assigned to 16 rows of PEs in eight cycles. Besides, each PE supports zero-padding to expand the raw data to an integer power of two. During the phase compensation operation, the two input data banks send 32 input data to two rows of FPEs (32 FPEs) in parallel. The Local-PF buffer passes and distributes the phase compensation factor from bottom to top. 4.3. Alternate Systolic-Memory and On-Chip Buffer Organization Since on-chip memory space is limited, all of the radar echo data is stored in the external memory first. As shown in Figure 8, the data systolic engine (DSE) fetches the data from dynamic random access memory (DRAM) and pushes the data into on-chip memory. To hide the communication latency of data transfer between DSEs and arithmetic components, we employ the alternate systolic technique. In order to avoid DSE competition in hardware resources, we use two alternate systolic memory modules for each of the input/output interfaces for the whole system. At the same time, we adopt two DSE channels for input data and weight at the input end. The proposed memory architecture can provide 4 GB/s of read/write memory bandwidth at 250-MHz frequency to satisfy the data requirements of the processor. As shown in Figure 8, our storage architecture consists of three layers: DRAM, a data transfer engine system, and an on-chip buffer. Since the on-chip storage resources are limited in size, all the pending radar echo data is first stored in off-chip memory (DRAM). During data processing, the data is first cached by the data transfer engine system into the on-chip buffer, and then sent to the PE array for processing by the on-chip buffer. As shown in Figure 8, in order to hide the communication latency between the off-chip memory and the on-chip buffer, we use the double-buffered data alternate transmission method. Sensors 2019, 19, 3409 Sensors 2019, 19, x FOR PEER REVIEW 10 of 20 10 of 20 Data Systolic Engine (DSE)_0 Data Systolic Engine (DSE)_1 Asynchronous Buffer_0 Asynchronous Buffer_1 Data Systolic Engine (DSE)_2 Data Systolic Engine (DSE)_3 Asynchronous Buffer_2 Asynchronous Buffer_3 Data Transformation Pipeline Engine (DTPE) Off-chip DRAM DRAM-controller Off-chip DRAM Data transfer engine system on-chip Buffer_1 on-chip Buffer_0 Input port (a) Data Systolic Engine (DSE) 6 Off-chip DRAM Off-chip DRAM Data Systolic Engine (DSE) 5 DRAM-controller Data Transformation Pipeline Engine (DTPE) Output image Onput Port (b) Figure 8. 8. Memory hierarchy architecture. architecture. (a): Input Port; (b): Output Figure Output Port. Port. 4.4. Resource Controller As shown in Figure 8, our storage architecture consists of three layers: DRAM, a data transfer engine system, and an on-chip buffer. Since on-chip the storage resources in size, the The resource controller is responsible forthe allocating execution unitare andlimited arranging the all access pending radar echo data is first stored in off-chip memory (DRAM). During data processing, the data flow of the on-chip buffer. is first cached by the data are transfer engine system into the on-chip buffer, andprocessing. then sent toThe theFFT/IFFT PE array Two imaging blocks respectively assigned to two arrays for parallel for processing by the on-chip buffer. As shown in Figure 8, in order to hide the communication and phase compensation operations are involved in the intra-block processing, so the PE is assigned to latency between thethe off-chip memory the ison-chip buffer, usecompensation the double-buffered data the FFT/IFFT during calculation and and the FPE assigned to thewe phase operation. alternate transmission method. data mapping and access strategy, in order to support the parallel According to the designed access of data, the resource controller allocates bank and bank addresses for each range of data. 0,16 Bank_2 0,32 Bank_3 0,48 0,15 0,64 0,31 0,80 0,47 0,96 0,63 0,112 ... ... ... ... 0,79 0,95 0,111 0,127 ... ... ... ... ... ... ... 0,r-1 Range r 0,0 0,1 0,2 1,0 1,1 1,2 ... ... a-2,0 a-2,1 a-2,2 a-1,0 a-1,1 a-1,2 0,r-3 0,r-2 0,r-1 1,r-3 1,r-2 1,r-1 ... ... ... ... Azimuth Range Operation ... ... ……… Bank_1 ... ... ... ... ……… 0,0 ……… Bank_0 ……… Bank group flow of the on-chip buffer. Two imaging blocks are respectively assigned to two arrays for parallel processing. The FFT/IFFT and phase compensation operations are involved in the intra-block processing, so the PE is assigned to the FFT/IFFT during the calculation and the FPE is assigned to the phase compensation operation. Sensors 2019, 19, 3409 11 of 20 According to the designed data mapping and access strategy, in order to support the parallel access of data, the resource controller allocates bank and bank addresses for each range of data. When When performing FFT/IFFT, of data is stored four banksaccording accordingto to aa distributed distributed performing range range FFT/IFFT, each each row row of data is stored in in four banks storage strategy. storage strategy. As shown When performing As shown in in Figure Figure 9, 9, we we take take aa row row of of 1024 1024 points points as as an an example example (r (r == 1024). 1024). When performing FFT/IFFT, 1024 points are segmented and stored in four banks according to the distributed FFT/IFFT, 1024 points are segmented and stored in four banks according to the distributed storage storage strategy. AAtotal totalofof1616consecutive consecutivepoints pointsare are used a segment, in which approximately toare 15 strategy. used asas a segment, in which approximately 0 to015 are placed in Bank_0, 31placed are placed in Bank_1, 32 are to 47 are placed in Bank_2, toplaced 63 are placed in Bank_0, 16 to16 31to are in Bank_1, 32 to 47 placed in Bank_2, and 48and to 6348are placed in Bank_3; the above operation is repeated until all data of 256 segments are stored. A base-4 in Bank_3; the above operation is repeated until all data of 256 segments are stored. A base-4 FFT/IFFT FFT/IFFT operation at 1024 pointsa requires a total of of five levels of The operation. The calculation operation at 1024 points requires total of five levels operation. calculation process usesprocess multiuses multi-bank parallel data access. Taking the first stage as an example, data 0 to 31 is read bank parallel data access. Taking the first stage as an example, data 0 to 31 is read from Bank_0from and Bank_0 in and in the first cycle, datais992 tofrom 1023 Bank_2 is read from Bank_2inand the Bank_1 theBank_1 first cycle, and data 992 and to 1023 read and Bank_3 theBank_3 second in cycle. second cycle. latter fouroperational levels of the operational data access process to the first level. The latter fourThe levels of the data access process are similar to are the similar first level. Similarly, when performing azimuth FFT/IFFT, each azimuth of data is stored Similarly, when performing azimuth FFT/IFFT, each azimuth of data is stored in in four four banks banks according to the storage strategy (taking 1024 points as an example, a = 1024). The data access process according to the storage strategy (taking 1024 points as an example, a = 1024). The data access process is similar similar to is to the the FFT/IFFT FFT/IFFT range. range. a a-2,r-3 a-2,r-2 a-2,r-1 a-1,r-3 a-1,r-2 a-1,r-1 32,0 Bank_3 48,0 64,0 31,0 80,0 47,0 96,0 63,0 112,0 ... ... ... ... 79,0 95,0 111,0 127,0 ... ... ... ... ... ... ... 2,0 a-1,0 a-3,0 ... ... ... ... a-2,0 a-2,1 a-2,2 a-1,0 a-2,1 a-1,2 ... ... 0,r-3 0,r-2 0,r-1 1,r-3 1,r-2 1,r-1 ……… 16,0 Bank_2 15,0 1,2 ……… Bank_1 ... ... ... ... 0,2 1,1 ……… 0,0 0,1 1,0 … … Bank_0 0,0 ... ... a-2,r-3 a-2,r-2 a-2,r-1 a Azimuth Bank group Range r a-1,r-3 a-1,r-2 a-1,r-1 Azimuth Operation Figure 9. Data access pattern. SAR imaging imaging is is a SAR a continuous continuous process process with with huge huge differences differences in in operational operational density density between between FFT/IFFT and the phase compensation operation. For the characteristics of the computational process, FFT/IFFT and the phase compensation operation. For the characteristics of the computational process, we have designed a way to organize the processing of SAR imaging in space and time flow (ST-Flow), we have designed a way to organize the processing of SAR imaging in space and time flow (ST-Flow), as shown shown in as in Figure Figure 10. 10. Taking 1024 points FPE performs performs aa one-point one-point phase phase compensation compensation Taking 1024 points FFT/IFFT FFT/IFFT as as an an example, example, each each FPE operation in in one one cycle, cycle, and and all all the the FPE FPE in in aa row phase compensation compensation operations. operations. operation row parallel parallel perform perform phase During the phase compensation operation, all 16-input data are sent to one row of FPEs in cycle During the phase compensation operation, all 16-input data are sent to one row of FPEs in one one cycle from the data buffer, and 16 output data are written back to the data buffer in one cycle. It can from the data buffer, and 16 output data are written back to the data buffer in one cycle. It can be be seen that the 1024-point phase compensation operation requires 64 cycles. In order to satisfy the task saturation and parallelism of the parallel pipeline between phase compensation and FFT/IFFT, the resource controller sets the start time for each row of PE to be delayed by 64 cycles from the previous row. Considering the different matrix sizes, the ratio of PEs to FPEs is configured to be 8:1, so for larger matrices, the FPE will be idle. During the processing of the FPE, it is necessary to wait for the PE to complete the FFT operation before starting the processing of the next frame. seen that the 1024-point phase compensation operation requires 64 cycles. In order to satisfy the task saturation and parallelism of the parallel pipeline between phase compensation and FFT/IFFT, the resource controller sets the start time for each row of PE to be delayed by 64 cycles from the previous row. Considering the different matrix sizes, the ratio of PEs to FPEs is configured to be 8:1, so for larger2019, matrices, Sensors 19, 3409the FPE will be idle. During the processing of the FPE, it is necessary to wait for 12 ofthe 20 PE to complete the FFT operation before starting the processing of the next frame. Time-line ... Cycles … ... … … FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT FFT/IFFT ... ... Parallel Pipeline FFT/IFFT Imaging block_2 FFT/IFFT FFT/IFFT ……… Imaging block_3 FFT/IFFT Parallel processing Imaging block_n-1 Imaging block_n FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT PM FFT/IFFT FFT/IFFT FFT/IFFT PM FFT/IFFT PM ... PM ... PM ... PM ... PM FFT/IFFT PM FFT/IFFT PM ... PM ... PM PM PM PM PM ... ... PM PM PM PM FFT/IFFT PM FFT/IFFT PM ... ... ... ... ……… Parallel processing Imaging block_1 FFT/IFFT ……… Imaging block_0 ……… Parallel processing ... ... PM PM … PM PM … Space-line PM PM PM PM … Parallel PM Cycles ... ... PM:Phase Multiplication Figure10. 10. Space–time Figure Space–timeflow flow(ST-Flow) (ST-Flow)ofofimaging imagingprocessing. processing. 5. Processor Performance Evaluation 5. Processor Performance Evaluation We implemented the SAR imaging processor at 65-nm CMOS (complementary metal oxide We implemented the SAR imaging processor at 65-nm CMOS (complementary metal oxide semiconductor) technology with 1.2 V of supply voltage using Synopsys tools. Figure 11 shows the die semiconductor) technology with 1.2 V of supply voltage using Synopsys tools. Figure 11 shows the photograph of the chip. In the evaluation, the CS imaging algorithm is selected as the benchmark. die photograph of the chip. In the evaluation, the CS imaging algorithm is selected as the benchmark. 5.1. Performance Analysis In this section, we configure the processor with fixed-point PE and single-precision floating-point FPE. We evaluate the processor performance at 200 MHz with different fixed-point lengths. The test echo data matrix size is 16,384 × 16,384. We perform two operations in parallel on the heterogeneous PE, which can take advantage of the computing power and increase the throughput. When the CSA is processed in heterogeneous PE mode, the throughput is achieved to 115.2 Giga operations per second (GOPS), with 463 mW of power consumption. As shown in Table 4, when all the imaging processes use single-precision floating-point units, the power consumption of the processor is up to 713 mW, and its energy efficiency is only 67% of the fixed/floating point heterogeneous imaging mode. Also, the processor can reduce a small amount of power consumption when selecting low-bit fixed-point Sensors 2019, 19, 3409 13 of 20 FFT/IFFT operation. The processor consumes 463 mW for 16-bit fixed-point FFT/IFFT and reduces to 454 mW for Sensors 2019, 19,12-bit x FOR fixed-point PEER REVIEWFFT/IFFT, as shown in Table 4. 13 of 20 PLL Controller PE Array PE Array REGs TF Buffer &PF Buffer Buffer A&B Figure Figure 11. Die Die photograph photograph of of the the chip. chip. 1 4. System performance assessment with different fixed-point length FFT. 5.1. PerformanceTable Analysis PE Bit-Width (Bits) the processor 12 14 fixed-point 16 Single-Precision Floating floatingIn this section, we configure with PE and single-precision (GOPS) 115.2 115.2at 200 115.2 115.2fixed-point lengths. point FPE. WeThroughput evaluate the processor performance MHz with different Power (mW) 459We perform 463 713 in parallel on the The test echo data matrix size is 16,384454 × 16,384. two operations Energy efficiency (GOPS/mW) 0.254 0.250 0.240 0.16 heterogeneous PE, which can take advantage of the computing power and increase the throughput. 1 4 provides statistics on in throughput, power consumption, and energy efficiency for entire heterogeneous When Table the CSA is processed heterogeneous PE mode, the throughput isthe achieved to 115.2 Giga processor. operations per second (GOPS), with 463 mW of power consumption. As shown in Table 4, when all the imaging processes use single-precision floating-point units, the power consumption of the 5.2. Array Utilization Analysis processor is up to 713 mW, and its energy efficiency is only 67% of the fixed/floating point As shownimaging in Table mode. 5, we can seethe that in the algorithm processing, the ST-flow two-dimensional heterogeneous Also, processor can reduce a small amount of power consumption parallel pipeline achieves better array utilization than one-dimensional time-based computational flow when selecting low-bit fixed-point FFT/IFFT operation. The processor consumes 463 mW for 16-bit (TI-flow). The high utilization of the array can increase the throughput of the system. The time-based fixed-point FFT/IFFT and reduces to 454 mW for 12-bit fixed-point FFT/IFFT, as shown in Table 4. computational flow (TI-flow) that is employed in existing processors is inefficient for SAR imaging 1 processing. AsTable shown in Table 5, in the ST-Flow mode, FFT operation theFFT. phase mean (PM) 4. System performance assessment with the different fixed-point and length operation are pipelined, the throughput reaches 115.2 GOPS, the resource utilization rate can reach PE Bit-Width (Bits) 12 14 16 Single-Precision Floating 98.8%, and the energy efficiency is 0.24 GOP/mW. In TI-Flow mode, the FFT operation and the PM Throughput (GOPS) 115.2 115.2 115.2 115.2 operation are executed sequentially, the throughput is only 62.6 GOPS, the resource utilization rate is Power (mW) 454 459 463 713 54.3%, and the energy efficiency is only 0.16 GOP/mW. Compared with the TI-Flow mode, the resource Energy efficiency (GOPS/mW) 0.254 0.250 0.240 0.16 utilization in ST-Flow mode significantly increases, the throughput increases by 84.5%, and the average 1 Table 4 provides statistics on throughput, power consumption, and energy efficiency for the entire power consumption only increases by 21.2%. heterogeneous processor. Table 5. Array utilization with ST-flow and time-based computational flow (TI-flow). GOPS: Giga 5.2. Array Utilization Analysis operations per second. As shown in Table 5, we can see that in the algorithm processing, the ST-flow two-dimensional TI-Flow parallel pipeline achieves better arrayST-Flow utilization than one-dimensional time-based computational FFT Phase Compensation Overall flow (TI-flow). The high utilization of the array can increase the throughput of the system. The timeArray utilizations 98.8% 88.9% 11.1% 54.3% based computational flow (TI-flow) that is employed in existing processors is inefficient for SAR ThroughputAs (GOPS) 102.4 mode, the12.8 62.6the phase imaging processing. shown in Table115.2 5, in the ST-Flow FFT operation and Power (mW) 463 435 317 382 mean (PM) operation are pipelined, the throughput reaches 115.2 GOPS,- the resource utilization rate Energy efficiency (GOP/mW) 0.24 0.16 can reach 98.8%, and the energy efficiency is 0.24 GOP/mW. In TI-Flow mode, the FFT operation and the PM operation are executed sequentially, the throughput is only 62.6 GOPS, the resource 5.3. Analyzes of Array Scalability utilization rate is 54.3%, and the energy efficiency is only 0.16 GOP/mW. Compared with the TI-Flow We analyze theutilization performance of a single heterogeneous array, as shown in Figures 12increases and 13. On mode, the resource in ST-Flow mode significantly increases, the throughput by the horizontal axis, power the numbers 5, 9, 18,only and increases 36 represent the array scales of 5 × 4, 9 × 8, 18 × 16, 84.5%, and the (X) average consumption by 21.2%. and 36 × 32, respectively. Energy Energy efficiency efficiency (GOP/mW) (GOP/mW) 0.24 0.24 -- -- 0.16 0.16 5.3. 5.3. Analyzes Analyzes of of Array Array Scalability Scalability We We analyze analyze the the performance performance of of aa single single heterogeneous heterogeneous array, array, as as shown shown in in Figures Figures 12 12 and and 13. 13. On On 14 of 20 the horizontal (X) axis, axis, the the numbers numbers 5, 5, 9, 9, 18, 18, and and 36 36 represent represent the the array array scales scales of of 55 ×× 4, 4, 99 ×× 8, 8, 18 18 ×× 16, 16, and and 36 36 ×× 32, 32, respectively. respectively. As the size of the array increases, the throughput and imaging efficiency of the system increase the size size of of the the array array increases, increases, the the throughput throughput and and imaging imaging efficiency efficiency of of the the system system increase increase As the significantly, but the power consumption of the processor also rises sharply. In general, the significantly,but butthe thepower power consumption of the processor sharply. In general, the powerpowersignificantly, consumption of the processor alsoalso risesrises sharply. In general, the power-delay delay product and energy efficiency of large PE arrays are better than those of small PE arrays. On delay product and energy efficiency of large PE arrays are better than those of small PE arrays. On product and energy efficiency of large PE arrays are better than those of small PE arrays. On the other the other hand, the array size must be closely matched to the buffer size; an oversized or undersized the other bematched closely matched to thesize; buffer an oversized or undersized hand, thehand, array the sizearray mustsize be must closely to the buffer ansize; oversized or undersized array array will in wasted PE or memory bandwidth utilization. array configuration configuration willinresult result PEorresources resources or low low memory bandwidth utilization. configuration will result wastedinPEwasted resources low memory bandwidth utilization. Therefore, the Therefore, the size of aa single heterogeneous processing array is designed to be 18 ×× 16 after aa tradeTherefore, the size of single heterogeneous processing array is designed to be 18 16 after tradesize of a single heterogeneous processing array is designed to be 18 × 16 after a trade-off between the off the complexity and off between between the chip chip implementation implementation complexityperformance. and processing processing performance. performance. chip implementation complexity and processing Sensors 2019, 19, 3409 the horizontal (X) Figure 12. Power array sizes. Figure 12. and imaging imaging time time with with different different Power consumption consumption and different array array sizes. sizes. Energy efficiency and power-delay product with different array sizes. Figure Figure 13. 13. Energy Energy efficiency efficiency and power-delay product with different different array array sizes. sizes. 5.4. Comparison with Other Schemes Table 6 lists the SAR imaging time for different sizes of input. For the ordinary SAR radar (for instance, the Chinese Gaofen-3 satellite, pulse repetition frequency: 2000 Hz), the real-time processing time of 16,384 × 16,384 SAR raw data requires 8 s. The proposed scheme can meet the real-time requirements. The power consumption and SAR imaging time for other studies are also listed in Table 6. As can be seen from Table 6, the power consumption of the proposed scheme is the smallest, because the proposed scheme can completely realize the entire SAR imaging process without additional microcontroller unit (MCU) or CPU. Similar to [15], the Mobile-GPU architecture uses a lower power cost (5 W) to achieve better real-time performance. Compared with [15], the proposed architecture is better in performance-to-power ratio and improves by a factor of 230.4. From the real-time performance Sensors 2019, 19, 3409 15 of 20 perspective, the CPU + GPU scheme is the best, but its power consumption exceeds 300 W. The real-time performance of the proposed scheme is only 8.6% of [17], but the performance-to-power ratio improves by a factor of 63.4. Table 7 shows the comparison of the proposed scheme and related research in real-time performance. As can be seen from Table 7, compared with [15], the speedup ratio reached 21.33. Table 6. Comparison with previous works. Architectural Model Operating Frequency Power Consumption SAR Imaging Algorithms Frame Size SAR Signal Processing Time (s) 0.04 0.15 0.68 8.2 5.54 32.9 Proposed solution 200 MHZ 463 mW CS 1024 × 1024 2048 × 2048 6472 × 3328 16,384 × 16,384 30,000 × 6000 32,768 × 32,768 GPGPU [14] - >500 W Omega-k 30,000 × 6000 8.5 CPU + GPU [18] - 345 W CS 32,768 ×32,768 2.8 Mobile-GPU [16] 2.3 GHZ 5W CS 2048 × 2048 3.2 Microprocessor + FPGA [15] - 68 W CS 6472 × 3328 8 CPU + ASIC [28] 100 MHZ 10 W - 1024 × 1024 - Table 7. Speed-up ratio to previous works. Architectural Imaging Time Imaging Time in Proposed Solution Speed-Up Ratio Frame Size CPU + ASIC [28] Mobile-GPU [16] Microprocessor + FPGA [15] GPGPU [14] CPU + GPU [18] 3.2 s 8s 8.5 s 2.8 s 0.04 s 0.15 s 0.68 s 5.54 s 32.9 s 21.33 11.76 1.54 0.086 1024 × 1024 2048 × 2048 6472 × 3328 30,000 × 6000 32,768 × 32,768 5.5. SAR Imaging Quality Evaluation We compared the scene SAR imaging results of different fixed-point length FFT. Radar data were obtained from RADARSAT-1 of Canada (width: 50 km; resolution: 6 m) [38]. The imaging effect is shown in Figure 14. For the actual scenes, the mean square error (MSE), peak signal-to-noise ratio (PSNR) [39], structural similarity index (SSIM) [40], and radiometric resolution (RL) [41] are commonly adopted to evaluate SAR imaging quality. Sufficient imaging accuracy can be achieved with single-precision floating-point imaging. Fixed-point processing methods will cause a certain loss of precision. We take the single-precision floating-point imaging as the test reference to evaluate the fixed-point FFT SAR image quality. The MSE is adopted to calculate the squared intensity difference between the pixels of the partial fixed-point image and the pixels of the full single-precision floating-point image. The PSNR is essentially the same as the MSE, but it is associated with the quantized gray level of the SAR image. The MSE and PSNR are calculated as shown in Formulas (2) and (3): MSE = M N 1 XX 0 ( f (i, j) − f (i, j))2 M×N i=1 j=1 (2) Sensors 2019, 19, 3409 16 of 20 Q2 × M × N PSNR = 10 log10 P P 2 M N 0 j=1 ( f (i, j) − f (i, j)) i=1 (3) where f 0 (i, j) and f (i, j) represent the image pixels to be evaluated and the reference image pixels, respectively; M, N represent the length and width of the image, respectively. Q represents the gray level of the image (Q = 255). Sensors 2019, 19, x FOR PEER REVIEW 16 of 20 (a) (b) (c) (d) Figure 14. The scene scene SAR SAR imaging imaging results results for for different fixed-point length FFT. (a) 12-bit fixed-point FFT; (b) 14-bit fixed-point FFT; FFT; (c) (c) 16-bit 16-bit fixed-point fixed-point FFT; FFT;(d) (d)single-precision single-precisionfloat-point float-pointFFT. FFT. PSNR MSEscenes, are simple and straightforward SAR image quality assessments the For theand actual the mean square error (MSE), peak signal-to-noise ratio based (PSNR)on[39], visibility of errors. Due to the PSNR index not being exactly the same as the visual quality seen by structural similarity index (SSIM) [40], and radiometric resolution (RL) [41] are commonly adopted theevaluate human SAR eye, the evaluation requirements of the human visual system (HVS) cannot be met [40]. to imaging quality. Therefore, we also adoptaccuracy SSIM (the Structural Similarity Index) to evaluate the SAR images. As shown Sufficient imaging can be achieved with single-precision floating-point imaging. Fixedin Formula (4): point processing methods will cause a certainloss of precision. We take the single-precision floating+ εSAR y + ε1 2δxy point imaging as the test reference to evaluate 2Ο thex Οfixed-point FFT image quality. 2 SSIM(x, y) = (4) The MSE is adopted to calculate the squared the pixels of the partial Ο2x + intensity Ο2y + ε1 difference δ2x + δ2y + εbetween 2 fixed-point image and the pixels of the full single-precision floating-point image. The PSNR is where δ2x represents image and δ2y represents the single-precision essentially the samethe as fixed-point the MSE, but it is variance, associated with the quantized gray level of thefloating-point SAR image. image variance; Ο represents the mean value of the fixed-point image, and Ο represents the mean The MSE and PSNR x are calculated as shown in Formulas (2) and (3): y value of the single-precision floating-point image. The SSIM value range is [0, 1], and the larger the 1 SSIM value, the smaller image distortion. πππΈ = π (π, π) − π(π, π) (2) π ×indicator. π RL is also a very important evaluation RL is adopted to evaluate the minimum variation of target reflection that radar sensors can distinguish. As shown in Formula (5): π ×π×π ππππ = 10 log ! (3) ∑ ∑ απ (π, π) − π(π, π) RL = 10 log10 + 1 (5) β where π (π, π) and π(π, π) represent the image pixels to be evaluated and the reference image pixels, respectively; M, N represent thedeviation length and of the image, respectively. Q represents theimage. gray where α represents the standard of width the image, and β represents the mean value of the level Table of the8image (Qloss = 255). lists the of precision due to the different data widths. As can be seen from Table 8, the PSNR and MSE are simple and straightforward SAR image assessments based on the PSNR value of a partial 16-bit fixed-point image can reach 29.1 dB,quality the results show that the partial visibility of errors. Due to the PSNR index not being exactly the same as the visual quality seen by 16-bit fixed-point image and the single-precision floating-point image differ only by 0.02 and 0.05 the human eye,indexes the evaluation of the human visual system cannot becompared met [40]. dB on the two of SSIMrequirements and RL, respectively. For the actual scene(HVS) SAR imaging, Therefore, we also adopt SSIM (the Structural Similarity Index) to evaluate the SAR images. with a single-precision floating-point image, the accuracy loss of a partial 16-bit fixed-point imageAs is shown 2%. in Formula (4): within πππΌπ(π₯, π¦) = (2π π + π )(2πΏ + π ) (π + π + π )(πΏ + πΏ + π ) (4) where πΏ represents the fixed-point image variance, and πΏ represents the single-precision floating-point image variance; π represents the mean value of the fixed-point image, and π Table 8 lists the loss of precision due to the different data widths. As can be seen from Table 8, the PSNR value of a partial 16-bit fixed-point image can reach 29.1 dB, the results show that the partial 16-bit fixed-point image and the single-precision floating-point image differ only by 0.02 and 0.05 dB on the two indexes of SSIM and RL, respectively. For the actual scene SAR imaging, compared with a single-precision floating-point image, the accuracy loss of a partial 16-bit fixed-point image is within Sensors 2019, 19, 3409 17 of 20 2%. Table 8. Quantitative evaluation of actual scene SAR imaging. Table 8. Quantitative evaluation of actual scene SAR imaging. FFT Pro-Acc1 FFT Pro-Acc 1 Single-precision float-point 3 (dB) 5 (dB) PSNR2 (dB) MSE MSE SSIM4 (dB) RL 3 (dB) SSIM 4 (dB) RL 5 (dB) PSNR 2 (dB) ∞ 0 1 4.99 Single-precision float-point ∞ 0 1 4.99 12-bit fixed-point 13.7 2765.2 0.23 4.11 12-bit fixed-point 13.7 2765.2 0.23 4.11 14-bit fixed-point 377.4 0.77 4.71 14-bit fixed-point 22.422.4 377.4 0.77 4.71 16-bit fixed-point 81.8 0.98 4.94 16-bit fixed-point 29.129.1 81.8 0.98 4.94 2 3 1 1FFT 2 3 4 SSIM: FFTpro-acc: pro-acc:FFT FFTprocessing processing accuracy;PSNR: PSNR: peak signal-to-noise accuracy; peak signal-to-noise ratio;ratio; MSE:MSE: meanmean squaresquare error; error; 5 RL: Radiometric 4 SSIM: Structural 5 Structural Similarity Index; Resolution. Similarity Index; RL: Radiometric Resolution. Phaseisisalso also important information a image. SAR image. The mean phase(PM) mean phase Phase important information for afor SAR The phase and(PM) phaseand deviations deviations (PD) are estimated by the method proposed in [42]. Table 9 lists the phase precision with (PD) are estimated by the method proposed in [42]. Table 9 lists the phase precision with different different fixed-point SARAs imaging. As can be Table seen from the loss of phase precision with fixed-point SAR imaging. can be seen from 9, theTable loss of9,phase precision with partial 16-bit partial 16-bit fixed-point is less than 3%. fixed-point imaging is lessimaging than 3%. Table9.9. Phase Phase information information evaluation of actual scene Table scene SAR SAR imaging. imaging. 1 FFT Pro-Acc FFT Pro-Acc 2 PM PM 1 Single-precision float-point Single-precision float-point 12-bit fixed-point 12-bit fixed-point 14-bit fixed-point 14-bit fixed-point 16-bit fixed-point 16-bit fixed-point 1 1 3 PD PD 3 2 0.00244° β¦ 0.00244 β¦ 0.00916° 0.00916 β¦ 0.00398 0.00398° β¦ 0.00252 0.00252° 2 3.3026° 3.3026β¦ 3.3054° 3.3054β¦ 3.3032β¦ 3.3032° 3.3026β¦ 3.3026° 3 3 PD: pro-acc: FFT processingaccuracy; accuracy; 2 PM: phase deviation. FFTFFT pro-acc: FFT processing PM:phase phasemean; mean;PD: phase deviation. For quality evaluation, wewe adopted thethe point target simulation echoecho data. Forthe thepoint pointtarget targetimaging imaging quality evaluation, adopted point target simulation We compared the point target SAR imaging results for FFT with different fixed-point lengths, as shown data. We compared the point target SAR imaging results for FFT with different fixed-point lengths, inasFigure For the target spatial spatial resolution (RES),(RES), peak peak side lobe ratioratio (PSLR) and shown15. in Figure 15.point For the pointimage, target image, resolution side lobe (PSLR) integrated side lobe ratioratio (ISLR) are commonly adopted to assess imaging quality [38,43]. Table 10 and integrated side lobe (ISLR) are commonly adopted to assess imaging quality [38,43]. Table shows the results of the point targets imaging quality assessment and comparison. 10 shows the results of the point targets imaging quality assessment and comparison. (a) (b) (c) (d) Figure15. 15.The Thepoint pointtarget targetSAR SAR imaging results different fixed-point length (a) 12-bit fixedFigure imaging results forfor different fixed-point length FFT.FFT. (a) 12-bit fixed-point point (b)fixed-point 14-bit fixed-point (c) fixed-point 16-bit fixed-point FFT; (d) single-precision float-point FFT; (b)FFT; 14-bit FFT; (c)FFT; 16-bit FFT; (d) single-precision float-point FFT. FFT. Table 10. Quantitative evaluation of point target SAR imaging. FFT Pro-Acc 1 Single-precision float-point 12-bit fixed-point 14-bit fixed-point 16-bit fixed-point 1 Azimuth Direction 2 RES (m) 3 Range Direction PSLR (dB) 4 ISLR (dB) RES (m) PSLR (dB) ISLR (dB) 4.74 −12.91 −9.64 2.58 −13.31 −9.96 5.43 4.81 4.77 −5.68 −11.85 −12.86 −2.99 −8.22 −9.53 3.71 2.83 2.61(m) −5.88 −11.55 −13.28 −3.22 −9.09 −9.93 FFT pro-acc: FFT processing accuracy; 2 RES: spatial resolution; 3 PSLR: peak side lobe ratio; 4 ISLR: integrated side lobe ratio. Sensors 2019, 19, 3409 18 of 20 For partial 16-bit fixed-point imaging, in the azimuth direction, the PSLR and ISLR precision loss of the image are 0.3% and 0.8%, respectively; the RES precision loss is 0.2%. In the range direction, the PSLR and ISLR precision losses of the image are 0.2% and 0.2%, respectively; the RES precision loss is 0.7%. According to the actual scene and the point target image quantization analysis, as shown in Tables 8–10, the partial 16-bit fixed-point imaging accuracy is close to the single-precision floating-point imaging accuracy, which meets the requirements of on-orbit SAR imaging applications. 6. Conclusions This paper proposes a heterogeneous imaging processor using fixed-floating point heterogeneous parallel acceleration technology to perform SAR imaging in the aerospace field. The processor consists of two 18 × 16 heterogeneous arrays that provide 115.2 GOPS throughput. To improve energy efficiency, each array supports fixed-floating hybrid calculations to take full advantage of computing resources, which can increase the throughput of imaging processing by 1.82 times. At the same time, the PE array can be partitioned by rows through a sensible algorithm-to-hardware architecture mapping, process the imaging process in parallel, provide high-utilization hardware resources, and improve the efficiency by a factor of 1.5. A single processor requires 8 s and consumes 463 mW to process SAR raw data with a granularity of 16,384 × 16,384, which meets the limits real-time and power consumption of the on-orbit SAR imaging platform. The proposed solution also has good scalability, by extending the size of the processor array, the real-time requirements of larger-scale SAR imaging can be met. Author Contributions: Investigation, J.A.; Methodology, S.W.; Project administration, S.W. and S.Z.; Software, X.H. and J.A.; Writing—original draft, S.W.; Writing—review & editing, L.C. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflict of interest. References 1. 2. 3. 4. 5. 6. 7. 8. 9. Percivall, G.S.; Alameh, N.S.; Caumont, H.; Moe, K.L.; Evans, J.D. Improving Disaster Management Using Earth Observations—GEOSS and CEOS Activities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 1368–1375. [CrossRef] Joyce, K.E.; Belliss, S.E.; Samsonov, S.V.; McNeill, S.J.; Glassey, P.J. A review of the status of satellite remote sensing and image processing techniques for mapping natural hazards and disasters. Prog. Phys. Geog. 2009, 33, 183–207. [CrossRef] Tralli, D.M.; Blom, R.G.; Zlotnicki, V.; Donnellan, A.; Evans, D.L. Satellite remote sensing of earthquake, volcano, flood, landslide and coastal inundation hazards. ISPRS J. Photogramm. Remote Sens. 2005, 59, 185–198. [CrossRef] Gierull, C.H.; Vachon, P.W. Foreword to the Special Issue on Multichannel Space-Based SAR. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 4995–4997. [CrossRef] Radarsat Constellation Mission. Available online: http://directory.eoportal.org/web/eoportal/satellitemissions/r (accessed on 23 March 2017). Advanced Land Observing Satellite-2. Available online: https://directory.eoportal.org/web/eoportal/satellitemissions (accessed on 23 March 2017). TerraSAR-X Add-on for Digital Elevation Measurement. Available online: https://directory.eoportal.org/ web/eoportal/satellite-missions/t/tandem-x (accessed on 23 march 2017). Yang, L.; Zhao, L.; Zhou, S.; Guo, B. Sparsity-Driven SAR Imaging for Highly Maneuvering Ground Target by the Combination of Time-Frequency Analysis and Parametric Bayesian Learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 4, 1443–1455. [CrossRef] Zhu, S.; Liao, G.; Tao, H.; Yang, Z. Estimating Ambiguity-Free Motion Parameters of Ground Moving Targets from Dual-Channel SAR Sensors. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 3328–3349. [CrossRef] Sensors 2019, 19, 3409 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 19 of 20 Wai-Chi, F.; Jin, M.Y. On board processor development for NASA’s spaceborne imaging radar with VLSI system-on-chip technology. In Proceedings of the 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), Vancouver, BC, Canada, 23–26 May 2004; p; Volume 2, p. II-901-4. Raney, R.K.; Runge, H.; Bamler, R.; Cumming, I.G.; Wong, F.H. Precision SAR processing using chirp scaling. IEEE Trans. Geosci. Remote Sens. 1994, 32, 786–799. [CrossRef] Li, G.; Zhang, F.; Ma, L.; Hu, W.; Li, W. Accelerating SAR imaging using vector extension on multi-core SIMD CPU. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–1 July 2015. Peternier, A.; Boncori, J.P.M.; Pasquali, P. Near-real-time focusing of ENVISAT ASAR Stripmap and Sentinel-1 TOPS imagery exploiting OpenCL GPGPU technology. Remote Sens. Environ. 2017, 202, 45–53. [CrossRef] Lou, Y.; Clark, D.; Marks, P.; Muellerschoen, R.J.; Wang, C.C. Onboard Radar Processor Development for Rapid Response to Natural Hazards. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2770–2776. [CrossRef] Tang, H.; Li, G.; Zhang, F.; Hu, W.; Li, W. A Spaceborne SAR on-board processing simulator using mobile GPU. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016. Ge, B.; Chen, L.; An, D.; Zhou, Z. GPU-based FFBP algorithm for high-resolution spotlight SAR imaging. In Proceedings of the 2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Xiamen, China, 22–25 October 2017; pp. 1–5. Zhang, F.; Li, G.; Li, W.; Hu, W.; Hu, Y. Accelerating Spaceborne SAR Imaging Using Multiple CPU/GPU Deep Collaborative Computing. Sensors 2016, 16, 494. [CrossRef] [PubMed] Wu, Z.; Liu, Y.; Zhang, L.; Li, N.; Du, K.; Balz, T. Highly efficient synthetic aperture radar processing system for airborne sensors using CPU+GPU architecture. J. Appl. Remote Sens. 2015, 9, 097293. [CrossRef] Ye, J.; Shanqing, H.; Jiayun, Z.; Teng, L. Virtual single-node processing for SAR imaging based on multi-DSP. In Proceedings of the 2016 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Hong Kong, China, 5–8 August 2016. Zhijun, Y.; Xiangfei, N.; Wenyi, X.; Xiaowei, N.; Weiming, T. Real time imaging processing of ground-based SAR based on multicore DSP. In Proceedings of the 2017 IEEE International Conference on Imaging Systems and Techniques (IST), Beijing, China; 2017; pp. 1–5. Xingyu, X.; Abou-Khousa, M.A.; Al-Wahedi, K. Embedded Synthetic Aperture Radar Imaging System on Compact DSP Platform. In Proceedings of the 2017 International Conference on Electrical and Computing Technologies and Applications, Ras Al Khaimah, UAE, 21–23 November 2017. Langemeyer, S.; Kloos, H.; Simon-Klar, C.; Friebe, L.; Hinrichs, W.; Lieske, H.; Pirsch, P. A compact and flexible multi-DSP system for real-time SAR applications. In Proceedings of the IEEE International Geoscience & Remote Sensing Symposium, Anchorage, AL, USA, 20–24 September 2004. Desai, N.M.; Kumar, B.S.; Sharma, R.K.; Kunal, A.; Gameti, R.B.; Gujraty, V.R. Near Real Time SAR Processors for ISRO’s Multi-Mode RISAT-I and DMSAR. In Proceedings of the 7th European Conference on Synthetic Aperture Radar, Friedrichshafen, Germany, 2–5 June 2008; pp. 1–4. Di, W.; Chen, C.; Liu, Y. FPGA-Based Multi-core Reconfigurable System for SAR Imaging. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 2018, 22–27 July; pp. 8921–8924. Li, B.; Li, C.; Xie, Y.; Chen, L.; Shi, H.; Deng, Y. A SoPC based Fixed Point System for Spaceborne SAR Real-Time Imaging Processing. In Proceedings of the 2018 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA; 2018; pp. 1–6. Bierens, L.; Vollmuller, B.J. On-board Payload Data Processor (OPDP) and its application in advanced multi-mode, multi-spectral and interferometric satellite SAR instruments. In Proceedings of the EUSAR 2012, 9th European Conference on Synthetic Aperture Radar, Nuremberg, Germany, 23–26 April 2012. Le, C.; Chan, S.; Cheng, F.; Fang, W.; Fischman, M.; Hensley, S.; Johnson, R.; Jourdan, M.; Marina, M.; Parham, B.; et al. Onboard FPGA-based SAR processing for future spaceborne systems. In Proceedings of the IEEE 2004 Radar Conference, Philadelphia, PA, USA, 29–29 April 2004; pp. 15–20. Fang, W.C.; Le, C.; Taft, S. On-board fault-tolerant SAR processor for spaceborne imaging radar systems. In Proceedings of the 2005 IEEE International Symposium on Circuits and Systems, Kobe, Japan, 23–26 May 2005; pp. 420–423. Sensors 2019, 19, 3409 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 20 of 20 Greco, J.; Cieslewski, G.; Jacobs, A.; Troxel, I.A.; George, A.D. Hardware/software Interface for High-performance Space Computing with FPGA Coprocessors. In Proceedings of the 2006 IEEE Aerospace Conference, Big Sky, MT, USA, 4–11 March 2006; pp. 1–10. Pfitzner, M.; Cholewa, F.; Pirsch, P.; Blume, H. FPGA based Architecture for real-time SAR processing with integrated Motion Compensation. In Proceedings of the 2013 Asia-Pacific Conference on Synthetic Aperture Radar (Apsar), Tsukuba, Japan, 23–27 September 2013; pp. 521–524. Bi, D.; Xie, Y.; Li, X.; Zheng, Y.R. Efficient 2-D synthetic aperture radar image reconstruction from compressed sampling using a parallel operator splitting structure. Digit. Signal Process. 2016, 50, 171–179. [CrossRef] Long, T.; Yang, Z.; Li, B.; Chen, L.; Ding, Z.; Chen, H.; Xie, Y. A Multi-mode SAR Imaging Chip based on a Dynamically Reconfigurable SoC Architecture Consisting of Dual-operation Engines and Multilayer Switching Network. Preprints 2018. [CrossRef] Song, W.S.; Baranoski, E.J.; Martinez, D.R. One trillion operations per second on-board VLSI signal processor for Discoverer II space based radar. In Proceedings of the Aerospace Conference Proceedings, Big Sky, MT, USA, 25–25 March 2000. Chen, Q.; Yu, A.; Sun, Z.; Huang, H. A multi-mode space-borne SAR simulator based on SBRAS. In Proceedings of the 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, Germany, 22–27 July 2012; pp. 4567–4570. Stangl, M.; Werninghaus, R.; Schweizer, B.; Fischer, C.; Brandfass, M.; Mittermayer, J.; Breit, H. TerraSAR-X technologies and first results. IEE Proc. Sonar Navig. 2006, 153, 86–95. [CrossRef] Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [CrossRef] Yang, C.; Li, B.; Chen, L.; Wei, C.; Xie, Y.; Chen, H.; Yu, W. A Spaceborne Synthetic Aperture Radar Partial Fixed-Point Imaging System Using a Field-Programmable Gate Array—Application-Specific Integrated Circuit Hybrid Heterogeneous Parallel Acceleration Technique. Sensors 2017, 17, 1493. [CrossRef] [PubMed] Cumming, I.G.; Wong, F.H. Digital Processing of Synthetic Aperture Radar Data: Algorithms and Implementation; Artech house: Norwood, MA, USA, 2005. Hu, A.; Zhang, R.; Yin, D.; Chen, Y. Perceptual quality assessment of SAR image compression. Int. J. Remote Sens. 2013, 34, 8764–8788. [CrossRef] Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [CrossRef] [PubMed] Márquez-Martínez, J.; Mittermayer, J.; Rodríguez-Cassolà, M. Radiometric resolution optimization for future SAR systems. In Proceedings of the IEEE International Geoscience & Remote Sensing Symposium, Anchorage, AL, USA, 20–24 September 2004. Tell, B.R.; Laur, H. Phase preservation in SAR processing: The interferometric offset test. In Proceedings of the ‘Remote Sensing for a Sustainable Future’, International Geoscience and Remote Sensing Symposium, Lincoln, NE, USA, 27–31 May 1996; Volume 1, pp. 477–480. Zhang, H.; Li, Y.; Su, Y. SAR image quality assessment using coherent correlation function. In Proceedings of the 2012 5th International Congress on Image and Signal Processing, Chongqin, China, 16–18 October 2012. © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).