This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2021.DOI An FPGA-based Hardware/Software Design using Binarized Neural Networks for Agricultural Applications: A Case Study CHUN-HSIAN HUANG1 , (Member, IEEE), 1 National Taitung University, Taiwan Corresponding author: Chun-Hsian Huang (e-mail: huangch@nttu.edu.tw). This work was supported in part by the Research Project MOST 107-2221-E-143-002-MY3, Ministry of Science and Technology, Taiwan. ABSTRACT This work presents an FPGA-based hardware/software design to help the agricultural robot intelligently decide if biological agents need to be applied to the target crops. For target crop recognition, in our global positioning, the selective search integrates with a thresholding scheme to reduce the number of region of interest (ROI) in a captured image. In our local recognition, a binarized neural network (BNN) architecture is presented to help recognize the target crop. Furthermore, an estimation method of pest and disease severity is also presented. Experiments show integrating our presented BNN architecture needs a few extra resources (less than 17% of available FPGA resources in terms of Xilinx Zynq UltraScall+T M MPSoC ZU3EG A484), compared to an existing BNN one. However, the top-1 accuracy rate and the top-5 one can be increased by 32.25% to 32.84% and by 14.99% to 15.17%, respectively. Furthermore, when the presented BNN architecture was also implemented on the ARM Cortex-A53 CPU and the NVIDIA GeForce RTX 2080 GPU, our BNN hardware module on the FPGA can accelerate the frames per second (FPS) by a factor of 3,690.18 and a factor of 1.07, respectively. INDEX TERMS FPGA, binarized neural networks, object detection, agriculture I. INTRODUCTION N the early 1980s, the agricultural development gradually integrates with computer science to support automatic management [1]. Currently, emerging techniques such as Internet-of-Things (IoT) further make the agricultural management more efficient. For example, we can apply different sensors to the farms for monitoring the growth environments of crops and their health statuses. By using the collected sensor data, farmers and companies can extract the valuable information to improve the productivity of crops. Furthermore, with the popularity of robotics, the efficiency of agricultural management can be thus enhanced significantly. I For agricultural applications, protecting the target crops from pests and diseases is a crucial task. In this work, dragon fruits are adopted as our target crops, and we try to enable an agricultural robot to intelligently decide if biological agents need to be applied to the target crops. As a result, the main motivation of this work is to implement an accurate and real- time robot vision system that can help the agricultural robot detect the target crop and analyze its health status. The target crop recognition is a typical application of object detection. Currently, the convolutional neural network (CNN) [2] is the most representative model of deep learning. A CNN consists of an input layer, multiple hidden layers and an output layer. The hidden layers include a series of convolutional layers that convolve with a multiplication or dot product. As a result, CNNs are very computing-intensive so that they are usually implemented on powerful platforms such as cloud servers. This also means that the captured images must be transferred to these powerful platforms such as cloud servers through a communication network for object detection. However, this usually leads to high latencies of data transfers in the Internet, and thus the real-time detection cannot be achieved. In recent years, fortunately, a new alternative called edge computing [3] is proposed to bring computation and data 1 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access closer to the source when required. For an intelligent robot, the concept of edge computing enables the real-time requirement to be achieved. However, implementing the computingintensive CNNs always accompanies a large amount of power consumptions [4] and memory access [5]. This is also a big challenge especially for energy-constrained edge devices. To realize the edge artificial intelligence (AI), quantized neural networks (QNNs) [6] are thus proposed to reduce the size of network. Therefore, lower precision parameters reduce the required storage space and the bit-width of processing elements. A binarized neural network (BNN) is a typical design that refines weights and activations to 1-bit value [7]. Such refined computations can increase system throughput and performance, which also enables low-power deep learning applications [8]. However, implementing neural networks on the computing architectures such as CPUs, GPUs and application-specific integrated circuits (ASICs) usually incurs vast amount of power consumptions [4]. Recently, accelerating the implementation of CNNs using FPGAs have become a new alternative, due to its ability to maximize parallelism and the energy efficiency [9]. Furthermore, by taking advantage of reconfiguration, different CNN parameters or topologies can be easily reconfigured in the FPGA for evaluation. Based on the above discussion, to achieve the goal of this work, that is, making the agricultural robot able to intelligently decide if biological agents need to be applied to the target crops, we thus propose an FPGA-based hardware/software design using BNNs. In this proposed design, a BNN architecture is presented to recognize the target crops. By considering the environment of the real scene and the features of target crops, a thresholding scheme is used to reduce the number of region of interest (ROI) in a captured image, while an estimation method of the pest and disease severity is also presented. Therefore, through the assistance of the proposed design, the agricultural robot can detect the target crops and decide if biological agents need to be applied to the target crops to protect them from pests and diseases. The rest of this paper is organized as follows. Section II describes the preliminary work, while Section III introduces our estimation method of pest and disease severity. The FPGA-based hardware/software design is described in Section IV, and system evaluation is given in Section V. Finally, Section VI concludes this work. II. PRELIMINARY WORK To enable the agricultural robot to intelligently decide if biological agents need to be applied to the target crops, the target crops need to be detected accurately and the pest and disease severity can be then estimated. In the following sections, we will introduce the state-of-the-art work about crop detection and estimation of pest and disease severity, and the existing BNN architectures and applications. A. CROP DETECTION AND ESTIMATION OF PEST AND DISEASE SEVERITY To increase agricultural productivity, crop pest and disease recognition is a crucial task. Traditionally, this task mainly relies upon agricultural experts to diagnose the health status of crops. Recently, with the rapid progress of technology, the use of image processing techniques for crop pest and disease recognition has also become a hot research issue. To estimate the pest and disease severity, calculating the size of a deformed or discoloured area relative to the whole crop is a typical method. Zaw et al. [10] adopted k-means clustering to select the defected area for the segmentation phase. Then, support vector machine (SVM) was used to estimate the leaf disease. Dhingraet et al. [11] presented a segmentation technique based on neutrosophic logic. Based on segmented regions, feature subsets were used to detect whether the plant leaf is diseased or not. Bierman et al. [12] adopted both SVM and artificial neural networks (ANN) classifiers training based on 18 colors and texture features, and fused the results of the two classifier. Their experiments showed a recognition accuracy of 100% for both downy and powdery mildew using the ensemble classifier. The above methods [10]–[12] could provide high-accurate crop pest and disease recognition; however, these recognition methods were not applied to a real growth environment. To detect the target object especially for more complex real-time image recognition tasks, CNN-based inference designs recently become the mainstream solutions [13]. Zhang et al. [14] presented a multi-task cascaded convolutional network based on intelligent fruit detection, by using which an automated robot can work in real time with high accuracy. An image fusion procedure was also presented to improve the performance of the detector. Experiments were performed on a personal computer equipped with NVIDIA (R) GeForce GTX 1060 graphics card, and results showed the proposed detector was performed immaculately both in terms of accuracy and time-cost. Yu et al. [15] also presented a fruit pose estimator called rotated YOLO (R-YOLO), which improved the localization precision of the picking points. Their design was implemented on the embedded control platform such as NVIDIA Jetson TX2 for inference. Experiments showed the method provided better performance in terms of real-time detection and localization accuracy of the picking points. Based on the above research work [14], [15], to detect the target crops in a real environment, a powerful embedded computing architecture and a refined CNN model are thus necessary. B. BNN DESIGNS Object detection with a neural network is a computingintensive task. This is also a challenge for the hardware architecture used. The most existing CNNs usually incur a large amount of power consumptions. Thus, QNNs such as BNNs [7] were proposed and applied to embedded and edge computing environments. For a BNN design, its weights and the activations are constrained to either +1 or −1, and the 2 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access activation function as depicted in Equation 1 is applied to all BNN layers. Because of the reduction in the memory and computational demands, algorithm efficiency can be thus improved. +1 , x ≥ 0 Sign(x) = (1) −1 , x < 0 Nurvitadhi et al. [16] compared the applicability of BNNs on different hardware computing architectures, namely CPU, GPU, ASIC, and FPGA. They implemented the BNN accelerator on Aria 10 FPGA as well as 14-nm ASIC, and compared them against optimized software on Xeon server CPU, Nvidia Titan X server GPU, and Nvidia TX1 mobile GPU. Experiments showed that FPGA provided superior efficiency over CPU and GPU. Furthermore, FPGA can provide orders of magnitudes in efficiency improvements over software, compared to a fixed ASIC solution. Zhao et al. [17] presented a BNN accelerator synthesized from C++ to FPGA-targeted design. Experiments demonstrated the energy and resource efficiency of the FPGA-based CNN accelerator. By using FINN [18], Fraser et al. [8] implemented a large BNN on an ADM-PCIE-8K5 FPGA platform. The experiments showed it could classify CIFAR-10 images at 88.7% accuracy. Moss et al. [19] also presented a BNN accelerator implemented on the Intel Xeon+FPGA platform that specialized FPGA architecture for the most computing-intensive parts whilst other parts were handled by the Xeon CPU. Experiments showed the design could provide comparable performance and better energy efficiency than a Titan X GPU card. To further apply BNNs to real applications, Jokic et al. [20] proposed an FPAG-based 20 kfp streaming camera system called BinaryEye. The system could classify regions of interest within a frame in real-time streaming mode. For maritime/sea border security and surveillance applications, Hashimoto et al. [21] also adopted BNNs for ship classification from Synthetic Aperture Radar (SAR) images. The experiments showed the proposed design could classify ship or not with equivalent accuracy as GPU by using FPGA. Based on the above research work [8], [16], [17], [19]–[21], we can observe that integrating resource-efficient BNNs and energy-efficient FPGAs into a robot vision system would be an ideal solution. Therefore, in this work, we will adopt the BNN and implement it in the FPGA device to recognize our target crops. Furthermore, due to binarization, the accuracies of the existing BNNs were also reduced compared to the CNNs with full precision [22]. To enhance the recognition accuracy of target crops using the BNN in a real scene, a thresholding scheme is also presented in this work. Details will be introduced in Section III. III. ESTIMATION FLOW OF PEST AND DISEASE SEVERITY To intelligently decide if biological agents need to be applied to the target crops, the estimation of pest and disease severity is the core part of the robot vision system. The proposed 3;)8!'#)9$A*! !"!#$%&!'(!)*#+ ,-$*)#$%./'.0' 123 #$ 4*!)$!*'$+)/' $+*!(+."56 !" 7)*8!$'#*.9' *!#.8/%$%./ ,-$*)#$%./'.0' #*.9'%;)8! ,($%;)$%./'.0' 9!($')/5'5%(!)(!' (!&!*%$< 4*!)$!*'$+)/' =!&!"'>6 #$ !" ?99"<'@%.".8%#)"' )8!/$( #$ 1!#.8/%:!56 !" Detection Estimation FIGURE 1. Estimation flow of pest and disease severity flow contains two phases, namely target crop detection and estimation of pest and disease severity, as shown in Figure 1. A. TARGET CROP DETECTION To accurately detect our target crop, the overall process is further divided into the global positioning and the local recognition. 1) Global Positioning The size of the captured image is much greater than that of images used for target crop recognition. This means the captured image needs to be first split into many segmenting images before it is transferred to the BNN. Instead of using sliding windows, the presented global positioning consists of the selective search [23] and a thresholding scheme. The selective search is based on a hierarchical grouping algorithm [23], where each region in an image is based on five similarity measures, namely color similarity, texture similarity, size similarity, shape similarity/compatibility, and a final meta-similarity measure. • • Color similarity: All the values of color channel for each region are calculated, and then the histogram of each color channel is represented by 25 bins. Next, 75 bins (25 for each red, green and blue channel) are combined into a vector. Color similarity is to measure the histogram intersection distance between two regions. Texture similarity: Eight Gaussian derivatives of an image are created, which are then used to extract histogram with 10 bins for each color channel. Next, a 10 × 8 × 3 dimensional vector for each region are generated. 3 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access Algorithm 1 Global positioning - thresholding 1: for i ← 0, m do 2: #greeni ← Calc(Ii ) #greeni 3: ratioi ← #totali 4: if ratioi > thres then 5: Add(Ii , ROIlocal ) 6: end if 7: end for To compute texture similarity between two regions, histogram intersection is also used. • Size similarity: Smaller regions would be merged earlier rather than later. This ensures that region proposals at all scales are formed in all parts of the image, while it can prevent a large number of clusters from swallowing up all smaller regions. • Shape similarity/compatibility: Two regions are compatible when they fit well into each other. When two regions do not even touch each other, they would not be merged. • Final meta-similarity: A final meta-similarity between two regions acts as a linear combination of aforementioned four similarities. Through the hierarchical grouping algorithm, the ROIs are thus extracted. When the color features of target crops are taken into consideration, it can be known that not all the ROIs need to be transferred to the BNN for the target crop recognition. A thresholding scheme, as given in Algorithm 1, is presented in this work to reduce the number of ROIs. As depicted in Equation 2, the original ROIs of a captured image (ROIori ) are represented as m + 1 segmenting images. ROIori = {I0 , I1 , ..., Im }; (2) For each segmenting image Ii in ROIori , its number of green pixels (#greeni ) is calculated first. Here, based on the real crops, when the color of a pixel is within a specific color range, this pixel will be defined as a green pixel. When the ratio (ratioi ) of the number of green pixels to that of total pixels (#totali ) in a segmenting image Ii is greater than a predefined threshold (thres), this segmenting image Ii is thus included in ROIlocal for the local recognition phase. is used in the last four ones. Through the convolutional operations and the maximum polling ones, the sizes of the feature maps are individually 32×32, 16×16, 14×14, 7×7, 5×5, 3×3 and 1×1. The depths of all the layers are individually 32, 64, 128, 128, 256, 256, 256, 512, 512 and 10. When the global positioning phase finishes, each segmenting image in ROIlocal is resized as a 64 × 64 8-bit RGB image and then transferred to the BNN for target crop recognition. If the target crop is recognized in a segmenting image, this segmenting image is thus used in the estimation pest and disease severity. B. ESTIMATION OF PEST AND DISEASE SEVERITY This phase is to estimate the pest and disease severity of the target crop, so that the biological agent can be decided whether it should be applied to the target crop. By referring to the experience of agricultural experts, our target crops, that is, dragon fruits, can be divided into five levels of pest and disease severity, as shown in Figure 3. Here, the increase in the value of level means the increase in the pest and disease severity. As described in Section II-A, the estimation of five levels of pest and disease severity is mainly based on the size of deformed or discoloured area relative to the whole crop. The method used in this work is as given in Algorithm 2. Here, Imgextract represents a segmenting image in which the target crop can be recognized. To distinguish the area of pests and diseases from the target crop, based on the characteristics of color, two masks, namely maskcrop and maskseverity , are used to remove the background area (Line 2). By performing the edge detection (Line 3), the target crop and its area of pests and diseases are then contoured. Next, the pixel counts of the target crop (pixelcrop ) and those of area of pests and diseases (pixelseverity ) are individually calculated (Line 4). The ratio of pixelseverity to pixelcrop is then used to estimate which level of pest and disease severity (Level0 ∼ Level4) is. When the level of pest and disease severity is greater than Level2, the biological agents are applied to the target crop. Therefore, through the target crop detection as described in Section III-A and the estimation of pest and disease severity as described in Section III-B, the agricultural robot can intelligently decide if biological agents need to be applied to the target crops. 2) Local Recognition IV. FPGA-BASED HARDWARE/SOFTWARE DESIGN To increase the recognition accuracy of target crops, in the local recognition phase, we also present a BNN architecture, as shown in Figure 2. Its input is a 64 × 64 8-bit RGB image, while it outputs a 16-bit result. Here, ten categories can be recognized. This BNN architecture contains seven convolutional operations, three maximum polling ones, and three fully-connected ones. The kernel size of the convolutional operation and that of the maximum pooling operation are 3×3 and 2×2, respectively. The same padding is used in the first three convolutional operations, while the valid padding The full system architecture design is shown in Figure 4. A Linux operating system (OS) runs on the microprocessor. A USB controller is to connect to a camera for capturing real-time images. Furthermore, a UART controller is used to connect to the motor control board for controlling the movement of the agricultural robot. The design flow for implementing the FPGA-based system architecture is illustrated in Figure 5. By using the deep learning framework, the target crop images (training data) are trained along with the presented BNN model. When the 4 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access Conv. + MaxPool Conv. + MaxPool Conv. + MaxPool Conv. Conv. Conv. Conv. FC 64 3 32 16 3 3 3 3 3 64 32 3 14 64 5 3 3 3 16 32 7 3 14 3 256 FC FC 3 3 1 3 5 128 1 3 3 7 128 3 3 256 256 512 512 10 FIGURE 2. Proposed BNN architecture (a) Level 0 (b) Level 1 (c) Level 2 (d) Level 3 (e) Level 4 Algorithm 2 Estimation of pest and disease severity 1: function P IXEL C OUNT (Img, mask) 2: Imgmasking ← M asking(Img, mask) 3: Imgedge ← EdgeDetect(Imgmasking ) 4: count ← ContourArea(Imgedge ) 5: return count 6: end function 7: pixelcrop ← P ixelCount(Imgextract , maskcrop ) 8: pixelseverity ← P ixelCount(Imgextract , maskseverity ) pixelseverity 9: ratio ← pixelcrop 10: if ratio ≥ 80% then 11: result ← Level4 12: else if ratio ≥ 60% then 13: result ← Level3 14: else if ratio ≥ 40% then 15: result ← Level2 16: else if ratio ≥ 20% then 17: result ← Level1 18: else 19: result ← Level0 20: end if FIGURE 3. Five levels of of pest and disease severity FPGA training phase finishes, the BNN parameters and topology are extracted to generate the corresponding BNN hardware module. Instead of implementing the corresponding (hardware description language) HDL codes directly, the highlevel synthesizer (HLS) is used in this work to translate the high-level languages such as Python and C++ into the HDL codes such as Verilog codes. This makes the BNN hardware module easily replaceable without rewriting HDL codes and can quickly evolve the refinement of the BNN architecture. Finally, by using the FPGA design tool, the BNN hardware module can be generated and then configured in the programmable logic, while the application programs, including selective search, thresholding, detection and estimation, and the OS are executed on the microprocessor. 0%(',11(% 23)45 ;<< 7 ;" '()*%(++,% #$.,%$ !"#$%&'(" !"#$%& '()*%(++,% !"'$%& -,.(%/" '()*%(++,% 789:" '()*%(++,% -(*(%" '()*%(+"6($%& -,.(%/ FIGURE 4. FPGA-based system architecture 5 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access 5-'%*%*.+9'(' Deep Learning !1$1&(%21+!1'-&34+53-103)$6%*.4+ Framework 71&).*%(%)*4+80(%/'(%)* ! "##$%&'(%)*+ ,-).-'/0 !"#$! !"#"$%"&'())' *!+,-%"+%.!" 3*44,# Python & C++ Python Cross Compiler /*-*+&)(!$" High-Level Synthesizer 5&*,*6&)!,(!6#'+1 !!" #$%&'$'()*)+,( Verilog FPGA Design Tool !",(--+.#) /0-1(% 2 !",(--"! !"#!$%%$&'() *"#+, !!"#$%&'$%(" )*&+,( %!&'()*'+$*,(-*. *'+&'0*01(+$!)21 FIGURE 6. Agricultural robot FIGURE 5. Design flow for implementing the FPGA-based system architecture V. SYSTEM EVALUATION To evaluate the the proposed method, the AVNET Ultra96V2 platform containing a Xilinx Zynq UltraScall+T M MPSoC ZU3EG A484 device was adopted to implement the FPGA-based hardware/software design. Figure 6 shows the agricultural robot, and the AVNET Ultra96-V2 platform was equipped in the main control box. The agricultural robot was customized for this work, instead of using an existing product. The robot contained a robotic arm equipped with a camera to capture the images of target crops and a nozzle to apply the biological agents to the crops. Furthermore, a motor control board was also in the main control box and responsible for controlling the movement of the robotic body and the robotic arm. Note that this work focuses on the robot vision system, and Figure 6 mainly shows where the robot vision system would perform on. This work does not address on the mechanical and electrical engineering of the robotic body and arm. A deep learning framework called Theano [24] was used for the training of the presented BNN. It was executed on a server equipped with an Intel Core i7-8700 CPU, 64 GB RAM, and a graphic card that contained a NVIDIA GeForce RTX 2080 GPU. Fig. 7 shows the real scene of our greenhouse, where one hundred target crops, i.e., dragon fruits were planted. As shown in Fig. 8, the images of different levels of pest and disease severity were captured by the cameras for training. Furthermore, the dragon fruits in the platform of data collection were changed every day to ensure the data diversity of captured images. In the proposed system, a thresholding scheme as introduced in Algorithm 1 was used, while it was based on the ratio of the number of green pixels to that of total pixels in a target crop. In the real scene as shown in Fig. 7, besides the target crops, in fact, most of the remaining objects would be filtered out through the thresholding scheme. Therefore, !"#$%&"'()*$%+,%("*#$(%&*+'- FIGURE 7. Greenhouse the accuracy of the target crop recognition in our real scene is close to 100%. To completely and objectively evaluate the presented BNN, the ImageNet dataset [25] was thus used in this experiment. Here, besides the target crops, the remaining images were obtained from the ImageNet dataset [25]. Our training data contained 5,000 images of the target crops and 45,000 images containing nine different categories of objects. Here, per category contained 5,000 images, and the nine categories consisted of acorn, banana, bell pepper, cauliflower, spider, ladybug, lemon, mushroom, and orange. Furthermore, the number of training epochs was set as 500, while the batch normalization [26] was applied to the BNN training for acceleration. We adopted the FINN framework [18] to generate the corresponding BNN hardware module. Here, the generated BNN module was based on a Xilinx Vivado HLS project. We extracted the compiled HDL results and integrated them into the system design. In the BNN hardware module, all the information of parameters and topology obtained through the training phase was stored in a parameter memory bank and then loaded into the corresponding layers to customize the BNN architecture. According to our implementation, this proposed design can operate up to 100.04 MHz. 6 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access TABLE 2. Comparison on power consumptions (W) !"#$!% BNN [8], [20], [21] Ours !"!#$% !"!#$& !"!#$' !"!#$( !"!#$) FIGURE 8. Image capture of target crops TABLE 1. Comparison on BNN configurations Input Image #Conv. #MaxPool #FC Params(Mbits) BNN [8], [20], [21] 32 × 32 6 2 3 1.55 Ours 64 × 64 7 3 3 2.12 A real case of the target crop detection and estimation of pest and disease severity is showed in Figure 9. Through the global positioning, some ROIs were labelled in the image. Next, through the local recognition, the segmenting image containing the target crop was then extracted for the calculation of the ratio of pixelseverity to pixelcrop . Finally, based on Algorithm 2, the estimation result was thus obtained. To further analyze the proposed design, an existing BNN architecture [8] used in Jokic et al. [20] and Hashimoto et al. [21] was also implemented for comparison. In the following sections, we will discuss the BNN configurations, the recognition accuracy and the system performance. A. BNN CONFIGURATIONS The comparison between the configurations of the existing BNN architecture [8], [20], [21] and ours is given in Table 1. Compared to the existing BNN architecture [8], [20], [21], besides the number of fully connected layers (#F C), our presented BNN architecture contains more convolutional layers (#Conv.) and more maximum polling layers (#M axP ool). Furthermore, the size of our input image is a 64 × 64 RGB image. As a result, compared to the existing BNN architecture [8], [20], [21], ours increases 36.77% of parameters (P arams). The comparison on resource usage in terms of the available FPGA resources by integrating the existing BNN architecture [8], [20], [21] and ours into the system is shown in Figure 10. The corresponding power consumptions are given in Table 2. Owing to the deeper and larger architecture, compared to the existing BNN architecture [8], [20], [21], we can expect that the integration of the presented BNN architecture into the system needs more resources usages and power consumptions. According to the experiments, integrating our BNN architecture increases 7.66% of available LUTs, 4.73% of available Flip-Flops, and 16.66% of available BRAMs, Static 0.324 0.326 Dynamic 2.664 2.856 Total 2.989 3.182 while it results in a power increase of 6.46%. Compared to the increases in LUT and Flip-Flop, that in BRAM is more obvious. This is because the presented BNN architecture contains more parameters, as shown in Table 1, so that more BRAMs were required to implement the parameter memory bank of the BNN hardware module. B. RECOGNITION ACCURACY In this experiment, three sets of test images, including 400 images, 600 images and 800 images, were used to test the accuracies of the system designs by integrating the existing BNN architecture [8], [20], [21] and ours. As described in Section V, besides the target crops, the remaining categories of objects were also used for testing. In the above three test sets, every category contained the same number of images. Two evaluation metrics, namely top-1 accuracy and top-5 one, were used in the experiment, and they were defined as given in Definition 1 and Definition 2, respectively. Definition 1: Top-1 accuracy is the conventional accuracy, that is, the classification result having highest probability must be exactly the expected result. Definition 2: Top-5 accuracy indicates the expected result must be exactly one of the classification results having the 5 highest probabilities. For the existing BNN architecture [8], [20], [21] and ours, the experimental results in terms of top-1 and top-5 accuracy rates are shown in Figure 11. According to the experiments, compared to the existing BNN architecture [8], [20], [21], the top-1 accuracy rate using ours can be increased by 32.25% to 32.84%, while the top-5 one can be increased by 14.99% to 15.17%. Compared to the existing BNN architecture [8], [20], [21], as shown in Table 1, besides our presented BNN architecture contains more layers, the size of input image is also larger. As a result, in the training phase, more key features of the target crops could be extracted so as to increase the recognition accuracy. Furthermore, compared to the existing BNN architecture [8], [20], [21], integrating our BNN hardware module into the system requires only a few extra resources (less than 17% of available FPGA resources), as described in Section V-A, but the recognition accuracy can be increased significantly. C. SYSTEM PERFORMANCE Based on the flow of target crop detection and estimation of pest and disease severity as shown in Figure 1, given the time of Tselect microseconds for selective search and thresholding, the time of Trecognize microseconds for target crop recognition using the BNN, and the time of Testimate 7 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access Local Recognition Global Positioning Estimation Calculation of Ratio FIGURE 9. A real case of the target crop detection and the estimation of pest and disease severity ,<=4>?@<A>=BC<A1DAE<?F=A4GAEH<A BIB10BJ0<A/8;-A?<=4>?@<=AKLM 5NNAO(PQAO" PQAO"!P TABLE 3. System performance :>?= ( ' CPU GPU FPGA FPGA & % $ Time (µs) 564,248,000 163,337 88,011 152,670 FPS 1.42 4,897.85 9,089.77 5,240.06 # " ! )*+ )*+,-. /0123/042 5,-. 678 9: 5*/; FIGURE 10. Comparison on resource usage >33:?(@A:?" @A:?"!@ 56678069:8042:;<= Computing Architecture ARM Cortex-A53 NVIDIA GeForce RTX 2080 BNN [8], [20], [21] Ours B78C ) ( ' & % $ # " ! *+,-! *+,-% ./012324$ *+,-! *+,-% *+,-! ./012324& *+,-% ./012324( FIGURE 11. Top-1 and top-5 accuracy rate microseconds for estimation of pest and disease severity, the total processing time Ttotal is as depicted in Equation 3. Ttotal = Tselect + Trecognize + Testimate (3) As shown in Figure 5, the most computing-intensive part, that is, the BNN computation, is implemented as a hardware module for acceleration. The remaining parts, including the selective search and thresholding (Tselect ) and the estimation of pest and disease severity (Testimate ), are executed on the microprocessor. In this experiment, we thus focus on comparing the amounts of recognition time (Trecognize ) on different computing architectures, including the ARM Cortex-A53 CPU, the NVIDIA GeForce RTX 2080 GPU, the existing BNN architecture [8], [20], [21] on FPGA and ours on FPGA. The set of 800 images, as introduced in Section V-B, was used for evaluation. The experimental results, including the required processing time and the frames per second (F P S), are given in Table 3. Here, besides the existing BNN architecture [8], [20], [21], the presented BNN architecture, as shown in Figure 2, was also implemented on the ARM Cortex-A53 CPU and the NVIDIA GeForce RTX 2080 GPU for comparison. Note that NIVIDIA TensorRT was not used in the GPU implementation. According to the experiments, the FPS using the existing BNN architecture [8], [20], [21] is the best among all the four computing architectures. However, as introduced in Table 1, compared to our presented BNN architecture, the existing BNN architecture [8], [20], [21] receives smaller input images and contains fewer layers. As a result, we can expect its processing time must be less than those using the other three computing architectures when the numbers of images are the same. Its FPS, of course, can be also greater than those using the other three computing architectures. However, as shown in Figure 11, the top-1 accuracy rates using the existing BNN architecture [8], [20], [21] are all less than 10%. Such a bad 8 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access accuracy cannot be accepted in this work. When the same BNN architecture with higher accuracy, as shown in Figure 2, were implemented on the ARM Cortex-A53 CPU and the NVIDIA GeForce RTX 2080 GPU, our BNN hardware module on the FPGA can accelerate the FPS by a factor of 3,690.18 and a factor of 1.07, respectively. Note that, in fact, the NVIDIA GeForce RTX 2080 GPU is a very powerful GPU device mainly used in a server and not in an embedded computing system. Therefore, this experiment demonstrates, when the recognition accuracy is taken into consideration, the proposed design can provide higher system performance to satisfy real-time requirements. VI. CONCLUSIONS To enable the agricultural robot to intelligently decide if biological agents need to be applied to the target crops, this work proposes an FPGA-based hardware/software design using BNNs. In the target crop detection, a thresholding scheme is used to reduce the number of ROIs in a captured image, while a BNN architecture is presented to help recognize the target crop. Furthermore, an estimation method of pests and disease severity is also presented. Experiments show, although integrating the presented BNN architecture needs a few extra FPGA resources, not only the recognition accuracy can be increased significantly but the high FPS can be also achieved. However, in the real environment, the target crop detection is easily affected by light, capture-angle and background objects. In the future, besides RGB images, infrared and depth images will be used in our design for the enhancement of recognition accuracy. Furthermore, different QNN architectures will be also discussed and tested. REFERENCES [1] S. C. Borgelt, J. D. Harrison, K. A. Sudduth, and S. J. Birrell, “Evaluation of GPS for applications in precision agriculture,” Applied Engineering in Agriculture, vol. 12, no. 6, pp. 633–638, 1996. [2] Z. Zhao, P. Zheng, S. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019. [3] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016. [4] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “FP-BNN: Binarized neural network on FPGA,” Neurocomputing, vol. 275, pp. 1072 – 1086, 2018. [5] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with embedded fpga platform for convolutional neural network,” in Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016, p. 26–35. [6] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, p. 6869–6898, 2017. [7] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2, Dec. 2015, p. 3123–3131. [8] N. J. Fraser, Y. Umuroglu, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Scaling binarized neural networks on reconfigurable logic,” in Proceedings of the 8th Workshop and the 6th Workshop on Parallel Programming and Run-Time Management Techniques for ManyCore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms, 2017, p. 25–30. [9] A. Shawahna, S. M. Sait, and A. El-Maleh, “FPGA-based accelerators of deep learning networks for learning and classification: A review,” IEEE Access, vol. 7, pp. 7823–7859, 2019. [10] K. K. Zaw, Z. M. M. Myo, and D. T. H. Thoung, “Support vector machine based classification of leaf diseases,” International Journal Science and Engineering Applications, vol. 7, pp. 143–147, 2018. [11] G. Dhingra, V. Kumar, and H. D. Joshi, “A novel computer vision based neutrosophic approach for leaf disease identification and classification,” Measurement, vol. 135, pp. 782–794, 2019. [12] A. Bierman, T. LaPlumm, L. Cadle-Davidson, D. Gadoury, D. Martinez, S. Sapkota, and M. Rea, “A high-throughput phenotyping system using machine vision to quantify severity of grapevine powdery mildew,” Plant Phenomics, vol. 2019, 2019, article ID 9209727. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems Volume 1, 2012, p. 1097–1105. [14] L. Zhang, G. Gui, A. M. Khattak, M. Wang, W. Gao, and J. Jia, “Multitask cascaded convolutional networks based intelligent fruit detection for designing automated robot,” IEEE Access, vol. 7, pp. 56 028–56 038, 2019. [15] Y. Yu, K. Zhang, H. Liu, L. Yang, and D. Zhang, “Real-time visual localization of the picking points for a ridge-planting strawberry harvesting robot,” IEEE Access, vol. 8, pp. 116 556–116 568, 2020. [16] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr, “Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC,” in Proceedings of International Conference on Field-Programmable Technology, 2016, pp. 77–84. [17] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, and Z. Zhang, “Accelerating binarized convolutional neural networks with software-programmable FPGAs,” in Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017, p. 15–24. [18] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “FINN: A framework for fast, scalable binarized neural network inference,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp. 65– 74. [19] D. J. M. Moss, E. Nurvitadhi, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. W. Leong, “High performance binary neural networks on the Xeon+FPGAT M platform,” in Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), 2017, pp. 1–4. [20] P. Jokic, S. Emery, and L. Benini, “BinaryEye: A 20 kfps streaming camera system on FPGA with real-time on-device image recognition using binary neural networks,” in Proceedings of IEEE 13th International Symposium on Industrial Embedded Systems (SIES), 2018, pp. 1–17. [21] S. Hashimoto, Y. Sugimoto, K. Hamamoto, and N. Ishihama, “Ship classification from SAR images based on deep learning,” in Intelligent Systems and Applications. Springer International Publishing, 2019, pp. 18–34. [22] T. Simons and D.-J. Lee, “A review of binarized neural networks,” Electronics, vol. 8, no. 6, 2019. [23] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based image segmentation,” International Journal of Computer Vision, vol. 59, pp. 167– 181, 2004. [24] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-farley, and Y. Bengio, “Theano: A CPU and GPU math compiler in python,” in Proceedings of the 9th Python in Science Conference, 2010, pp. 3–10. [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015. [26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015, http://arxiv.org/abs/1502.03167. 9 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3058110, IEEE Access CHUN-HSIAN HUANG (M’17) received his Ph.D. degree in Computer Science and Information Engineering from National Chung Cheng University, Taiwan, in January 2011. In July 2011, he was a postdoctoral scholar at Intel-NTU Connected Context Computing Center, National Taiwan University. From August 2011 to February 2012, he was an assistant researcher at the Chung-Shan Institute of Science and Technology, Taiwan. Since February 2012, he joined the faculty of Department of Computer Science and Information Engineering, National Taitung University. He is currently an Associate Professor in the Department of Computer Science and Information Engineering of National Taitung University. Dr. Huang’s research interests include embedded systems, reconfigurable computing, cyber physical systems and robotic applications. Details can be found on the website. https://sites.google.com/site/chunhsianhuang/english-version 10 VOLUME x, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/