2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) | 978-1-6654-3574-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00176 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) Software-Hardware Co-Optimization for CNNs Based on Reconfigurable Devices 1st Fang Liu 2nd Zimeng Fan School of Computer Science, Wuhan University Wuhan City College Wuhan, China liufangfang@whu.edu.cn College of Computer Science and Technology Wuhan University of Science and Technology Wuhan, China fanzimeng@wust.edu.cn 3rd Yanxiang He∗ 4th Min Peng∗ School of Computer Science, Wuhan University Wuhan, China yxhe@whu.edu.cn School of Computer Science, Wuhan University Wuhan, China pengm@whu.edu.cn Abstract—Convolutional Neural Network (CNN) is an efficient recognition algorithm used in pattern recognition, image processing, etc. The problem of how to build an efficient computational system for CNNs has become a pressing one, which is traditionally achieved by software optimization or hardware accelerator design. In contrast, this paper argues that software algorithms and hardware architectures affect each other in neural network applications with a high degree of coupling. In this paper, we propose a software and hardware-based cooptimization design to improve CNNs’ processing efficiency by co-optimization of software and hardware. First, a modular analysis and collaborative design are carried out for the ShuffleNetV2 model by implementing quantization and improving the computational unit. Second, the model is optimized based on reconfigurable computing devices for its characteristics. Third, 8 bit quantization is implemented, while the depth-separable convolution operation and channel selection module is redesigned to make the module perform operations in a hardware-friendly form. The experimental work was carried out on the FPGA platform Xilin xzynqxc-7Z045 using High Level Synthesis (HLS). The experimental results show that the optimized model has a significant improvement in terms of resource utilization and latency. Index Terms—Software-Hardware Co-Optimization, Convolutional Neural Networks, ShuffleNetV2, Reconfigurable Computing I. I NTRODUCTION With the development of computing power [1] [2] [3], big data [5] [6] [7], and machine learning technology [12] [13] [14], the research field of convolutional neural networks (CNNs) has been extended from image processing to text, video, and speech due to its high inference accuracy and strong self-adaptability. This greatly impacts on the our society, such as transportation [9] and healthcare [10]. This paper was supported by the National key R&D Program of china under Grant No.2018YFC1604003,General Program of Natural Science Foundation of China (NSFC) under Grant NO.61772382 and Grant NO.62072346 and key R&D Project of Hubei Province under Grant No.2020BAA021,Natural Science Foundation of Hubei Province NO.2020CFB795. CNNs are constantly evolving and optimized, with the emergence of SqueezeNet, MobileNet, ShuffleNet, Xception, and other lightweight models that use special structures or units to reduce the amount of computation. For the convolution kernel and pooling operations in CNNs, the core is matrix operations, which require high arithmetic power, and both convolution and pooling operations are streaming modes. In the development of convolutional neural networks, a large number of convolutional variants emerge, requiring lots of convolutional computations, and the increasing of the depth of the network also brings down the computational efficiency [17] [18] [19], therefore, how to build an efficient computational system oriented to CNNs becomes an urgent problem [4]. Existing approaches to solving this problem mainly rely on network optimization of CNNs or designing hardware accelerators based on the characteristics of CNNs [15]. In the current research, new deep neural network models need to match the hardware architecture, resulting in model features not being fully exploited in generic hardware architectures. and the design of neural network model accelerators does not sufficiently consider the characteristics of the models themselves to achieve more efficient computation [16]. This paper argues that an important guideline for building efficient AI systems is software and hardware co-design, where neural network algorithms and hardware architectures interact with each other in the computation process and have a high degree of coupling. Therefore, software optimization needs to be combined with hardware optimization in the form of hardware-software synergy. While the neural network model architecture and the number of parameters are reasonably designed, the hardware architecture supporting the above model is designed from the perspective of hardware to ensure the accuracy and efficiency of the whole computing system. In the AI system design process, deep learning algorithms are first designed based on the target application scenario, followed by optimization of the neural network model to enable hardware acceleration. This optimization step typically 978-1-6654-3574-1/21/$31.00 ©2021 IEEE 1279 DOI 10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00176 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. includes model compression and fixed-point quantization to reduce the workload and improve the peak performance of the AI accelerator, and then adjusting the hardware architecture design based on the algorithm optimization strategy used. The three steps are iterative to ensure that the metrics of the target application scenario are met. After the hardware architecture is designed, a customized software compiler is used to convert the neural network model into a sequence of instructions for run-time execution. The software compiler automatically performs scheduling and operator optimization to further improve the computational efficiency of the hardware [50]. The hardware-software codesign is based on reconfigurable computing technology, and the performance of the programmable devices is fully utilized to optimize the performance of the neural network model [34]. Based on the idea of collaborative hardware and software optimization, this paper presents a modular analysis and collaborative design of ShuffleNetv2 [30], a lightweight model of CNN. By implementing quantization, improving computational units and optimizing the shuffle module, the network optimization is carried out in collaboration with hardware and software. The experimental data in this paper illustrates that the optimized ShuffleNetV2 network [30] has significantly improved performance indicators in terms of latency and resource utilization. The main points of this article are as follows: • A collaborative design, starting with a variant convolution of the ShuffleNetV2 model, to quantify and improve the convolutional computational unit and optimize the data layout. • The layout of ShuffleNetV2 implementation on FPGA development board with module analysis. • The software optimization idea is designed for hardware unification, and the combination of software and hardware is realized [32] [33]. The method of this paper is verified on FPGA, and the experimental data shows that the method of this paper improves the resource utilization and reduces latency, and for ShuffleNetv2 a certain optimization is realized through the method of software and hardware cooperation. variants have emerged from research based on convolutional operations, such as depthwise convolution, pointwise convolution. The pooling layer is used to dramatically reduce the dimensionality of the data, in which down sampling is used to reduce the dimensionality of the data, maximum pooling or average pooling is used to retain important feature information. The fully-connected layer acts as a ”classifier”, with convolution, pooling, and activation functions iterating over multiple iterations before the results are classified by the fully-connected layer. B. ShuffleNetV2 ShuffleNetV2 is an upgraded version of ShuffleNet. ShuffleNetV2 is optimized for minimizing memory accesses for the same channel size, giving full consideration to the group convolution in terms of grouping method and number, and reducing element-level operations [30]. ShuffleNetV2 introduces a new operation: Channel Split, which divides the input channels of the model into two parts, one part is passed directly downward and the other part is calculated backward; at the end of the model, the number of output channels of the two parts is cascaded, making the information between the channels interoperable. The number of output channels in both parts is cascaded at the end of the model, allowing information to interoperate between the various channels. ShuffleNetv2 retains the modules in ShuffleNet that requires down sampling, removes the random split operation from the modules, and processes the information downwards separately before stitching, doubling the number of output channels. ShuffleNetv2 uses a scaling factor to control the balance between accuracy and efficiency. II. BACKGROUND A. Convolutional Neural Network Convolutional Neural Network (CNN) is a class of feedforward neural networks that incorporates convolutional computation and has a deep structure with representational learning capabilities and is commonly used for visual image processing. CNN can effectively downscale large data volume into small data volume, and effectively preserve the image features [16]. A typical CNN consists of 3 parts including a convolutional layer, a pooling layer, and a fully connected layer. The convolutional layer is responsible for feature extraction and is the core component of a CNN. By using convolutional kernels to perform convolutional operations, features are effectively extracted as key factors for recognition. Many convolutional Fig. 1. ShuffleNetV2 Model C. Reconfigurable Computing The theory of reconfigurable computing was proposed by Gerald Estrin in the 1960s. The core idea is that hardware resources can be configured to build dedicated circuits for computing based on the characteristics of the data flow and the 1280 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. computation [37] [38]. if the algorithm changes, the hardware resources can be reconfigured based on the new data flow and the characteristics of the computation so that the hardware can be adapted to the new computational requirements. The basis of reconfigurable computing is a computing device that can reconfigure hardware resources. Currently, a commonly used hardware platform for reconfigurable computing is the FPGA [32]. The research shows that the combined computing power of reconfigurable computing can exceed that of a CPU by a factor of several and consume far less power, even though the computing unit used in reconfigurable computing, the FPGA [31], is clocked at a much lower frequency than the CPU of the day [15]. The use of parallel operations using convolution in the CNN model is the most suitable multi-core architecture for handling parallel operations [22] [23] [24], and the use of reconfigurable computing allows for effective optimization of the CNN model to improve overall operational performance. collaborative optimization of software and hardware for CNN, and carries out related work based on it. C. EXPLORING DESIGN SPACE In this section, a collaborative hardware and software optimization exploration [41] [42] [43] based on the ShuffleNetV2 model is presented. The idea of the optimization method is to analyze the convolutional module based on the computational characteristics of ShuffleNetV2 after confirming the quantization method, using a chunked module hardware and software model optimization method. D. The problem of quantifying model parameters Deep neural network models can perform high-precision data computation with limited data [40]. In a hardwaresoftware collaborative approach [28] [29], it is necessary to balance classification accuracy and latency when designing hardware-software collaborative design based on FPGAs as a goal to design a reasonable quantization scheme. III. RELATED WORK A. CNN Optimization Model The CNN-based optimization models, SqueezeNet [44],MobileNets [11], ShuffleNet [35], Xception [36], have been well used in mobile applications [8]. SqueezeNet was published in ICLR-2017 by researchers from Berkeley & Stanford. Squeeze represents a Squeeze layer in the model, which uses a 1*1 convolution kernel to convolve the feature map of the previous layer, aiming to reduce the dimensionality of the feature map. ShuffleNet is proposed by the Face++ team in 2017, the shuffle in the model is to perform channel shuffling, which will reduce the number of channels of each part of the Xception is proposed by Google, based on Inception-V3, and combined with Depth-Wise Convolution. Fig. 2. Accuracy loss in Top-1 classification versus bit-width Q of input feature maps Research [35] has shown that weights have a small impact on classification accuracy when they are a reasonable precision. As shown in Fig. 2, the experiments based on the cifar-10 dataset shown that the weights can be quantized to 8bit (or lower) precision with an inference accuracy reduction of less than 1%. It follows that when weights are quantized to 8bit precision, the loss in classification accuracy can be modeled using software by reducing the quantized bit width of the feature images, where the quantized bit width will be expressed as Q. Since the input images are all between -1 and 1 after preprocessing, the default bit width for the integer part of the experiment is 1. If the quantized bit width is Q, the integer part is 1 and the fractional part is Q-1. This experimental data shows that Q=8 is the optimal feature image quantization scheme considering both accuracy and resource utilization for applications where accuracy is not extremely demanding, while any value of Q¿=8 results in a small loss in classification accuracy while improving resource utilization and performance. Therefore, in this paper, an 8-bit quantization scheme is used for the weights and feature images in a hardware-software collaborative optimization scheme. B. Hardware Accelerators for CNNs Current hardware accelerators for CNNs mainly refer to the use of FPGAs as a hardware platform to accelerate a part of the CNN computation in hardware form. FPGA-based model optimization can be designed by compressing the model to greatly reduce the model size with a reasonable loss of accuracy, for example by performing data quantization [48] and making the weights smaller by pruning or sparse matrices [31]. The second approach is to optimize the structure of the model, mainly by optimizing the computational units, e.g. by the fast Fourier transform [20], by optimizing the loop structure [45] [46], e.g. by expanding the input and output channels to increase the parallelism of the convolution operations [44], and by optimizing the access structure, e.g. by optimizing the arrangement of the CNN data [21]. From the existing research, it can be found that the current research mainly focuses on the optimization of CNN models or the design of CNN-oriented hardware accelerators. However, software and hardware should work together for CNN-oriented optimization [49]. Therefore, this paper proposes the idea of 1281 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. E. Analysis of ShuffleNetV2 convolutional modules his section analyzes various parts of ShuffleNetv2 to find the computational bottlenecks and optimization directions of ShuffleNetv2 with the idea of accelerating large probability events. ShuffleNetv2 is a lightweight model based on CNN and mainly performs the convolutional operation of multiplicative accumulation operation MAC (Multiply Accumulate), therefore this paper performs acceleration based on MAC for large probability operations. As shown in Fig. 3, there are four types of convolution in the ShuffleNetV2 model, namely deep convolution, pointwise convolution, normal convolution and fully connected layers, where the width and height of the convolution kernel for both deep and normal convolution are 3. In the ShuffleNetV2 model, pointwise convolution accounts for 70.240%, ordinary convolution for 23.159%, depthwise convolution for 6.442%, and full linkage layer for 0.159%, but the proportion is negligible because the full linkage layer only exists in the output layer to obtain classification results. Based on the idea of accelerating probabilistic events, the hardware and software co-design in this paper focuses on the pointwise convolution and the normal convolution kernel of the ShuffleNetV2 model with depth-separable convolution. convolutional kernel size of 1, depthwise convolution with a convolutional kernel size of 3, and normal convolution with a convolutional kernel size of 3. In this paper, we design a highly parallelized convolutional computation scheme for each convolutional feature, and achieve fine-grained parallelism by copying the data before transferring them to the PE unit, so that the convolutional kernel data stream and the feature image data stream are aligned. In this section we set the feature image to C ×H ×W and the convolution kernel to c × d × d. a) Pointwise Convolution: In contrast to normal convolution, pointwise convolution has a convolution kernel of size d of only 1, which results in no convolution kernel moving over the feature image, but more like the feature map itself traversing each pixel point. This allows each of the convolution kernels to operate in parallel. As shown in the Fig. 4 below, we take the values at the same position in each dimension of the input feature image one at a time in turn, and then copy them c times, completing one round of operations with the next pixel point’s value taken out in each dimension. The convolution kernel part, on the other hand, takes out each convolution kernel data in turn. Fig. 3. Analysis of ShuffleNetV2 convolution module Fig. 4. Pointwise convolution data layout IV. SOFTWARE-HARDWARE CO-DESIGN Based on the analysis in the previous section, this paper designs the ShuffleNetv2 model based on three of the computationally intensive convolution operations, while reducing the impact of the channel split operation and optimizing the structure of the model based on software and hardware collaboration [25] [26]. In this paper, the system is optimized in terms of hardware and software [42] [43] [46]. In the software design part, different computational processes are designed for the three different convolutional features, and the corresponding data flows are designed. in the hardware design part, the overall architecture of the system and the top-level architecture for convolutional operations are highlighted. A. Software Optimization Design In the ShuffleNetv2 model [30], there are three different types of convolutional units: pointwise convolution with a b) Depthwise Convolution: Compared to the other 2 convolution operations, the most important feature of depthwise convolution is that the dimensionality of the convolution kernel is the same as the dimensionality of the feature image i.e. C=c, while there is no data dependency between each dimension. The feature image is read using the pointwise method, but instead of taking out the value of a pixel, a window of data is extracted, as shown in Fig. 5. The convolution kernel then takes out the data sequentially along the dimensional direction. c) Normal convolution: Since pointwise convolution and depthwise convolution can be considered as two operational steps of one common convolution, the operation for common convolution is a fusion of the two previous methods. For a single convolution kernel, the data is read in the same order as for depthwise convolution, while between different convolution kernels, the feature images need to be copied c times, the same as for pointwise. 1282 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. Fig. 5. Depthwise Convolution data layout Fig. 6. Normal Convolution data layout B. Hardware design a) Overall Architecture: The scheme is based on a reconfigurable computing platform for hardware design [37] [38] [39]. The overall structure of this optimized solution is shown in Fig. 7. The overall architecture of the system is achieved by quantizing the data, uploading and storing the WEIGHT on the BRAM, deploying the entire system on the FPGA and controlling the system flow through the ARM core. the previous section to enable it to be stored on the BRAM, thus speeding up the data access time during inference. The reading of the data is carried out by the AXI4 bus and the commands from the ARM core are also transmitted via the AXI4 bus. When the ShuffleNetv2 module has been calculated, an interrupted signal is generated and returned to the ARM core via terminal control, marking the completion of the inference calculation part. Finally, the ARM core completes the inference process by retrieving the inference results in BRAM. We use the ARM core on the FPGA for the overall control of the system, calling the ShuffleNetv2 module designed in this paper (as an IP core) to perform the inference, recording information and parameter configuration. b) Convolution Module Design: The design of the convolution model is shown in Fig. 8. In addition to transferring the feature images to the module, an identifier Flag is set to distinguish which type of convolution operation is being performed. When Flag=PW, this operation is pointwise convolution; when Flag = DW, this operation is depthwise convolution; when Flag = CON, this operation is normal convolution. The PE unit consists of a number of PEs (Processing Units) and an addition tree. The feature images are processed through the previous section and multiplied with the parallel sum weight in the PEs, then the convolution operation is completed by a set of additional trees, and finally each output result is added with the bias. If Flag is PW mode, the output feature images will be divided into 2 groups. With this design, the feature image is divided at the pointwise convolution of the previous convolution unit and the next channel split unit can be skipped. It is worth noting that the feature image is still stored continuously in hardware after segmentation, so that subsequent operations are not affected, while reducing the stress on access caused by channel split operations. V. EXPERIMENT A. Experiment Setup The dataset tested in this paper was the image classification classical dataset cifar-10. We implemented ShuffleNetv2 on different hardware platforms, where the CPU model was an Intel(R) Core(TM) i5-7500 CPU @3.4GHz (single core), which used the C++ version of the ShuffleNetv2 model; the GPU is used on Google’s Colab using the Tesla P100, which uses the Pytorch implementation of ShuffleNetv2 [50]. The optimization was also implemented on the ZYNQ7045 board, which uses the Xilinx FPGA chip xc7z045 and operates at 100 MHz. 5 different configurations of the system were set up to explore the optimal number of PEs, with 4, 8, 16, 32 and 64 PEs in each of the 5 configurations. B. Experiment Results Fig. 7. Overall System Architecture The data is compressed by the quantization operation in In this subsection we will first show the comparisons made in exploring the optimal number of PEs, here we set up 5 sets of comparison experiments. The experiments are tested and 1283 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. Fig. 8. Convolution module top-level design compared by setting different numbers of PEs on the FPGA through the 3 convolution modules that have been implemented. The choice of PE number is based on the alignment of the data stream and the nature of the 3 convolution operations. increases. The maximum value of LUT of the FPGA used in this paper is 210k,. As can be seen from Fig. 7, when the number of PEs is 32, the resources of LUT basically reach the maximum value of 210k, and the resources of LUT will not be enough if the number of PEs is added, so the number of PEs should be less than 32. In the actual situation, the ratio of LUT usage should preferably not be too saturated, and it can be seen that when the number of PEs is greater than When the number of PEs is 16, the latency and the LUTs intersect at exactly the same time, and the efficiency of the LUTs is at its highest. Therefore, according to the experimental results, the optimal setting for the number of PEs in this paper is 16. TABLE I C OMPARISON WITH CPU AND GPU Fig. 9. Comparison of resource usage with different number of PEs Fig. 9 shows the consumption of various hardware resources of DSP, BRAM, FF, and LUT as PE increases. It can be seen that except for BRAM, which does not change significantly, the consumption of DSP, FF and LUT all increase as PE increases, with LUT increasing the most, with DSP consumption increasing from 39k to 434k as PE increases from 4 to 64. Fig. 10. Comparison of latency and LUT usage for different numbers of PEs Fig. 10 shows the transformation relationship between LUT, latency and the maximum number of LUTs as the number of PEs increases. Latency decreases as the number of PEs increases, but at the same time the consumption of LUTs also Device Execution time(ms) TOP-1 acc FPGA XC7Z045 28.78 75.027% GPU 43.4 75.84% CPU 283.8 75.1% In Table I, the data regarding the design of this pape is shown in comparison with ShuffleNetv2 running on CPU, GPU with the same number of parameters. In terms of accuracy, it is only 0.813% lower than the original ShuffleNetV2, which can prove that the design is not overly detrimental to the accuracy of the model. At the same time, the number of parameters remains the same as between, except for the data cache used to unlock the data dependencies. And on top of the above, our work is still 9.8x faster than a single-core CPU and 1.5x faster than a GPU. Table II shows the comparison of the baseline experiments and the accelerated data from other methods. Where the baseline model is the data before optimization and the experimental data from other papers [20] [21] [47]. the DSP, BRAM, FF and LUT are the hardware resource usage. the BRAM for the baseline model is as high as 1276, while the DSP usage is only 95, which shows a low resource utilization and high memory space consumption [48]. The optimized model substantially increases the DSP usage and reduces the BRAM usage. At the same time, the optimized result in terms of latency is 1284 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. TABLE II C OMPARISON WITH OTHER WORK baseline Device DSP FF LUT FPS GOPS LATENCY Clock XC 7z045 95 18k 28k 15.79 109.42ms 100MHZ Partial Reconfiguration on FPGA XC 7z020 220 106k 154k 40-44ms 100MHZ only 26.3% of the original [4]. Using the same dataset as this paper, the Latency of this paper is only 65% to 72% of that of this paper, with approximately the same hardware resource consumption. The model used in is the same as that used in this paper, and it is worth noting that is of dubious reference for the FPS parameters as it was tested on the ImageNet dataset. In addition, the device used in [8] is a ZU3EG with a clock frequency of 250 MHZ, while the device used in this paper is clocked at only 100 MHZ. The final result proves that the GOPS in this paper is improved by 1.59x and the Latency is reduced from 104.3ms to 28.78ms. Compared to where the data is re-arranged by changing the method of Our latency is still only 32% of that of , which is similar to the method proposed in this paper. The comparative results demonstrate the improved resource utilization and reduced latency of this paper’s approach. VI. CONCLUSION Improving the computational efficiency of CNNs and related models has been a hot research topic. The traditional optimization methods are mainly the optimization of models or designing hardware accelerators. In this paper, we propose an optimization scheme for ShuffleNetv2 on a reconfigurable platform based on the idea of hardware-software co-optimization, and validate it on FPGA hardware. The paper starts with data quantization and optimization of each of the three Pointwise convolutions, Depthwise convolutions and normal convolutions of the model. The redesign of the convolution module and the high degree of pipelining result in significant optimization of the latency. Experimental results from the optimization experiments conducted on a Xilinx Zynq 7045 development board show that the optimization solution significantly speeds up the inference of ShuffleNetV2, improves the resource utilization and reduces the latency of the system compared to the existing model. In future work, we will continue to focus on software and hardware cooperative deep learning model optimization, and the next work will concentrate on optimizing the computational processing of the first and last ordinary convolution in ShuffleNetV2. We will further reduce the latency through operations such as loop unfolding, and conduct research on Synetgy Data Optimization CNN Accelerator Our ZU3EG 37 30k 24k 58.7 47.09 104.3ms 250MHZ XC 7z045 340 22k 42k 57.5 100ms 100MHZ XC 7z045 220 87k 115k 291.45 76.27 28.78ms 100MHZ hardware-software collaborative-based model optimization by adding more layers to the data flow architecture and improving the ratio of computation to communication. VII. ACKNOWLEDGMENT This paper was supported by the National key RD Program of china under Grant No.2018YFC1604003,General Program of Natural Science Foundation of China (NSFC) under Grant NO.61772382 and NO.62072346 and key RD Project of Hubei Province under Grant No.2020BAA021,Natural Science Foundation of Hubei Province NO.2020CFB795.We thank MindSpore for the partial support of this work, which is a new deep learning computing framework. REFERENCES [1] M. Qiu and J. Li, “Real-time embedded systems: optimization, synthesis, and networking,” CRC Press [2] J.Wang, M. Qiu, B. Guo, “High reliable real-time bandwidth scheduling for virtual machines with hidden Markov predicting in telehealth platform,” Future Generation Computer Systems, 49, 68-762015 [3] F. Hu, S. Lakdawala, Q. Hao, M. Qiu, “Low-power, intelligent sensor hardware interface for medical data preprocessing,” IEEE Transactions on Information Technology in Biomedicine, 13 (4), 656-6632009 [4] S. H. Shi, X. M. Chu, “Speeding up Convolutional Neural Networks”, arXiv preprint arXiv:1704.07724 (2017). [5] P. Wu, Z. Lu, Q. Zhou, Z. Lei, X. Li, M. Qiu, P.C.K. Hung, “Bigdata logs analysis based on seq2seq networks for cognitive Internet of Things,” Future Generation Computer Systems, 90, 477-488, 2019 [6] R. Lu, X. Jin, S. Zhang, M. Qiu, X. Wu, “A study on big knowledge and its engineering issues,” IEEE Transactions on Knowledge and Data Engineering, 31 (9), 1630-1644, 2018 [7] W. Dai, L. Qiu, A. Wu, M. Qiu“Cloud infrastructure resource allocation for big data applications,” IEEE Transactions on Big Data, 4(3), pp.3133242016 [8] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.C. Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation”. arXiv preprint arXiv:1801.04381 (2018). [9] M. Zhu, X.Y Liu, F. Tang, M. Qiu, R. Shen, W. Shu, M.Y. Wu“Public vehicles for future urban transportation,” IEEE transactions on intelligent transportation systems, 17(12), pp.3344-33532016 [10] Q. Zhang, T. Huang, Y. Zhu, M. Qiu“A case study of sensor data collection and analysis in smart city: provenance in smart food supply chain,” International Journal of Distributed Sensor Networks, 9(11), 3821322013 [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications”. arXiv preprint arXiv:1704.04861 (2017). [12] H. Qiu, M. Qiu, Z. Lu, “Selective encryption on ECG data in body sensor network based on supervised machine learning,” Information Fusion, 55, 59-67,2020 [13] K. Gai and M. Qiu, “Reinforcement learning-based content-centric services in mobile sensing,” IEEE Network, 32(4), pp.34-39, 2018 1285 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply. [14] K. Gai and M. Qiu, “Optimal resource allocation using reinforcement learning for IoT content-centric services,” Applied Soft Computing, Vol.70, pp.12-21, 2018 [15] Z. W. Cai, X. D. He, J. Sun, and V. Nuno, “Deep learning with low precision by half-wave Gaussian quantization”. IEEE Conference on Computer Vision and Pattern Recognition (CVPR2017). 5406–5414. 2017. [16] K. Simonyan, A. Zisserman. ”Very deep convolutional networks for large-scale image recognition”. arXiv preprint arXiv:1409.1556, 2014. [17] J. Niu, Y. Gao, M. Qiu, Z. Ming, “Selecting proper wireless network interfaces for user experience enhancement with guaranteed probability,” Journal of Parallel and Distributed Computing, 72(12), pp. 1565-1575, 2012 [18] M. Qiu, K. Zhang, M. Huang, “Usability in mobile interface browsing,” Web Intelligence and Agent Systems: An International Journal, 4(1), pp. 43-59, 2006 [19] M. Qiu, K. Zhang, M. Huang“An empirical study of web interface design on small display devices,” IEEE/WIC/ACM International Conference on Web Intelligence (WI’04), pp.29-35, 2004 [20] Y. F. Yang, Q. J. Huang, “Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs”. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2019. [21] W. Hu, S. Chen, et al., “Data Optimization CNN Accelerator Design on FPGA”, IEEE ISPA, 294-299, 2019, Xiamen, China, doi: 10.1109/ISPABDCloud-SustainCom-SocialCom48970.2019.00051. [22] M. Qiu, H. Li, E.H.-M. Sha, “Heterogeneous real-time embedded software optimization considering hardware platform,” ACM symposium on Applied Computing, pp. 1637-1641, 2009 [23] M. Qiu, C. Xue, Z. Shao, E.H.-M. Sha, “Energy minimization with soft real-time and DVS for uniprocessor and multiprocessor embedded systems,” ACM/IEEE Design, Automation Test in Europe Conference (DATE), pp. 1-6, 2007 [24] M. Qiu, Z. Chen, M. Liu, “Low-power low-latency data allocation for hybrid scratch-pad memory,” IEEE Embedded Systems Letters, 6(4), 69-72, 2014 [25] Y. Guo, Q. Zhuge, J. Hu, M. Qiu, E.H.-M. Sha, “Optimal data allocation for scratch-pad memory on embedded multi-core systems,” IEEE International Conference on Parallel Processing (ICPP), pp.464-471, 2011 [26] L. Zhang, M. Qiu, W.C. Tseng, E.H.-M. Sha, “Variable partitioning and scheduling for MPSoC with virtually shared scratch pad memory,” Journal of Signal Processing Systems, Vol.58(2), pp. 247-265, 2010 [27] K. Kwon, A. Amid, et al., “Co-design of deep neural nets and neural net accelerators for embedded vision applications”. arXiv preprint arXiv:1804.10642, (2018). [28] M. Liu, S. Zhang, Z. Fan, M. Qiu, “H Infinite State Estimation for Discrete-Time Chaotic Systems Based on a Unified Model,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol.42, 2012 [29] M. Qiu, D. Cao, H. Su, K. Gai, “Data transfer minimization for financial derivative pricing using Monte Carlo simulation with GPU in 5G,” IEEE International Journal of Communication Systems, 29(16), pp. 23642374, 2016 [30] N. N. Ma, X, Y. Zhang, et al., “Shufflenetv2: Practical guidelines for efficient cnn architecture design”. arXiv:1807.11164, 2018. [31] J. S. Wang, Q. W. Lou, et al., “Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA”. arXiv preprint arXiv:1808.04311 (2018). [32] J. T. Qiu, J. Wang, et al., “Going deeper with embedded FPGA platform for convolutional neural network”. In Proceedings of the ACM SIGDA, 2016. [33] R. Z. Zhao, X. Y. Niu, et al., “Optimizing CNN-based object detection algorithms on embedded FPGA platforms”. In Proceedings of the Annual ARC Processor Summit (ARC). Springer, 255–267, 2017 [34] D. Gschwend, “ZynqNet: An FPGA-accelerated embedded convolutional neural network,” ETH Zurich, 2016. [35] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices”. arXiv:1707.01083. (2017). [36] F. Chollet “Xception: Deep Learning with Depthwise Separable Convolutions”. Computer Vision and Pattern Recognition, 1800-1807, 2007. [37] H. Zhao, M. Chen, M. Qiu, K. Gai, M. Liu, “A novel pre-cache schema for high performance Android system,” Future Generation Computer Systems, Vol. 56, pp. 766-772, 2016 [38] Y. Gao, S. Iqbal, P. Zhang, M. Qiu, “Performance and power analysis of high-density multi-GPGPU architectures: A preliminary case study,” IEEE 17th International Conference on High Performance Computing (HPCC), 2015 [39] M. Qiu, L. Chen, Y. Zhu, J. Hu, X. Qin, “Online data allocation for hybrid memories on embedded tele-health systems,” IEEE Intl Conf on High Performance Computing and Communications (HPCC), 2014 [40] P. Gysel, et al., “Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks”. IEEE Trans. Neural Networks and Learning Systems, 29(11),5784–5789, 2018. [41] K. Gai, M. Qiu, M. Liu, Z. Xiong, “In-memory big data analytics under space constraints using dynamic programming,” Future Generation Computer Systems, 83, 219-227,2018 [42] J. Niu, C. Liu, Y. Gao, M. Qiu, “Energy efficient task assignment with guaranteed probability satisfying timing constraints for embedded systems,” IEEE Transactions on Parallel and Distributed Systems, 25(8), 2043-2052, 2013 [43] Y. Guo, Q. Zhuge, J. Hu, J. Yi, M. Qiu, E. H.-M. Sha, “Data placement and duplication for embedded multicore systems with scratch pad memory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits, 2013 [44] F. N. Iandola, H. Song, et al.,“SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and ¡0.5MB model size”. ICLR, 2017. [45] M. Qiu, M. Guo, M. Liu, C.J. Xue, L.T. Yang, E. H.-M. Sha, “Loop scheduling and bank type assignment for heterogeneous multi-bank memory,” Journal of Parallel and Distributed Computing, 69 (6), 546558, 2009 [46] Z. Shao, M. Wang, Y. Chen, C. Xue, M. Qiu, L.T. Yang, E.H.-M. Sha, “Real-time dynamic voltage loop scheduling for multi-core embedded systems,” IEEE Transactions on Circuits and Systems II, 54 (5), 445449, 2007 [47] M. Farhadi, M. Ghasemi, Y. Z. Yang, “A Novel Design of Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA”. arXiv:1909.05653. (2019). [48] J. S. Wang, Q. W. Lou, et al., “Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA”. arXiv preprint arXiv:1808.04311 (2018). [49] J. T. Qiu, J. Wang, et al., “Going deeper with embedded FPGA platform for convolutional neural network”. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 26–35, 2016 [50] Mindspore. https://www.mindspore.cn/, 2020 1286 Authorized licensed use limited to: GOOGLE. Downloaded on May 08,2023 at 01:29:37 UTC from IEEE Xplore. Restrictions apply.