Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime Saad Ashfaq, MohammadHossein AskariHemmat, Sudhakar Sah, Ehsan Saboori, Olivier Mastropietro, Alexander Hoffman Deeplite Inc. Montreal, Canada (saad, hossein, sudhakar, ehsan, olivier, alexander.hoffman) @deeplite.ai Abstract—Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively. Keywords— Low-bit quantization; Bit-serial; DeepliteRT; Deep learning; Inference; Edge computing; Optimization I. INTRODUCTION Modern deep learning models for computer vision are achieving accuracies and detection performance with the potential to drive exciting new product developments across multiple domains, including but not limited to, home surveillance cameras, AI traffic monitoring systems, automotive driver assistance, mobile video applications and intelligent consumer electronics. It is becoming increasingly practical to deploy deep learning models for vision applications on existing server infrastructure, especially with the advent of high-performance data center GPUs and CPUs. However, due to stringent requirements for deterministic latency, reliable connectivity, data privacy, and cost sensitivity of many applications, the use of cloud-centric AI has become a major limiting factor in many vision applications. For example, there are privacy issues related to uploading non-anonymized facial data in face detection for surveillance, real-time person detection in automotive requires on device processing and always-on person ID with smart doorbell cameras needs costeffective inference performance. Therefore, to achieve massmarket deployment, cutting-edge vision applications need to be enabled within edge devices themselves. The lack of sufficient processing power and practical economics to run these models in real-time, all the time, has been a major barrier to adoption. Model quantization is one of the ways to improve the performance of computer vision models on CPUs and overcome barriers for adopting AI at the edge. Current quantization frameworks and inference engines predominantly use 8-bit integer (INT8) quantization for model weights and activations, instead of using the same precision as training, typically 32-bit floating-point (FP32). This reduces memory bandwidth and can potentially provide 2-4x speedup while retaining accuracy. However, we sought to push the limits of existing quantization methods by using less than 8-bits for quantization. In this paper, we present Deeplite Runtime (DeepliteRT), an innovative ultra-low precision inference engine for Arm CPUs, and a quantization framework in Deeplite Neutrino™ [4] that automatically quantizes and optimizes CNN models with less than 4 bits of precision. DeepliteRT empowers developers to unlock advanced AI on billions of low-power Arm-based devices achieving up to 5x faster inference compared to optimized FP32 baselines and up to 2x faster inference for classification models relative to TFLite [10] used with the highly optimized XNNPack [12] backend. The speedups on object detection models are up to 3.2x and 2.2x compared to ONNX Runtime and TFLite with XNNpack, respectively. II. BACKGROUND Deep Learning model performance has progressed considerably in the past few years, especially with state-of-theart classification [3] and object detection models [1], [17]. These models perform well on computer vision problems and improve upon the state-of-the-art year-over-year. However, constantly improving performance of these models still does not make them suitable to be used on edge devices due to computational, power and memory requirements. These models use 32-bit precision which can be reduced to 16 bits or even to 8 bits by quantizing the weights and activations accordingly. Existing methods for neural network quantization can be classified in two categories [24], Post-training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ does not require the original network to be re-trained on the target dataset www.embedded-world.eu and is therefore typically limited to FP16 or INT8 quantization, where sufficient information is preserved in the original training statistics to accurately quantize the network. On the other hand, QAT quantizes the network during training and preserves significantly more accuracy than PTQ. QAT methods have been tested at as low as INT4 precision [23] using 4 bits per neural network weight and 8 bits per activation. However, INT4 quantization in [23] is tested only on classification models. Low bit quantization is much more difficult in object detection problems. Significant work to quantize the model using just 1 bit is also available but such models suffer from accuracy loss beyond 10% [26][27]. On the other end of the research spectrum, low complexity light weight models like MobileNet [2] are available to help enable AI on edge devices. However, these models are handcrafted and lack the performance needed for many computer vision applications. Also, because of their handcrafted and compact architecture, compression techniques such as quantization are generally not effective [18] and custom modifications might be necessary to accommodate quantization [22]. Even when high accuracy quantized models with 1-2 bits of precision are available, they cannot be deployed on commercial off-the-shelf hardware [19] due to lack of support for 1 or 2-bit instructions and operators. Most commodity hardware only supports down to 8-bit operations. Such devices either need a custom inference engine (runtime) or custom hardware support like INT4 instructions provided in advanced NVIDIA processors [25]. Therefore, ultra-low bit quantization (below 4bits) remains a challenging problem to solve both from the QAT as well as the runtime perspective. Currently, standard AI frameworks such as TensorFlow [15] and PyTorch [16] do not support 1 bit or 2 bits of precision. Open-source runtimes and compilers such as TFLite [10] and TVM [28] currently support 8 bits of precision with quantization. This paper makes the following contributions: • Quantization aware training with 1 and 2-bit precision (weights and activations) for object detection and classification models. • Mixed precision approach to minimize the accuracy drop of quantized models. • Custom ultra-low precision convolution operators to accelerate speed and memory throughput of quantized layers. • End-to-end framework to deploy and execute mixed precision ultra-low bit quantized models on Armv7 and Armv8 Cortex-A processors. III. CHALLENGES OF DETECTION MODELS FOR EMBEDDED APPLICATIONS We benchmarked state-of-the-art object detection models on a Raspberry Pi 4B platform. Our initial study found that unless we use a very compact model (YOLOv5n [1]) combined with a very low-resolution input image (less than 300px), we cannot achieve more than 4-5 FPS, even with 8-bit quantization (Figure 1). The speedup from using YOLOv5n, however, has an expensive tradeoff with accuracy as shown in the graph in Figure 2. Fig. 1. YOLOv5 benchmark on Raspberry Pi 4B (Arm Cortex A-72) The high latency and low throughput for current deep neural networks on commodity CPUs like the Cortex-A72 in the Raspberry Pi 4B demonstrates the harsh limitations of AI inference on low power and affordable processors. Despite there being billions of devices powered by ARM Cortex-A CPUs, even the latest quantization techniques did not provide sufficiently low latency numbers for practical applications. Although reducing the model parameters from 32 bits to 8 bits results in respectable speedup without a significant loss in accuracy, it is still not enough to run these models on such small footprint hardware. Furthermore, many of the compact networks [2] designed for these devices were not accurate enough, including the smaller variations of YOLOv5n [1] as shown in Figure 2. Fig. 2. Accuracy drop on YOLO Variants for VOC and COCO datasets [1] IV. ACCURATE ULTRA-LOW BIT QUANTIZATION It has been known for several years now that deep learning models are robust to quantization. However, how many bits is enough for certain AI tasks remains an open-ended question ever since early work on post-training and training-aware quantization, such as in [5] and [6]. Using advances in Deeplite Neutrino™, a CNN based model optimization software, we employed a quantization- aware training method to quantize both model weights and activations below 4 bits. In addition, we provide a way to save and infer such models on low-power CPUs and compare to fullprecision and INT8 baselines using DeepliteRT. The main goal of quantization is to reduce model precision and complexity while preserving accuracy. As mentioned earlier, quantization can be applied to a model after or while training. Quantization after training (post-training quantization) can be done statically or dynamically. In post-training static quantization, weights are quantized ahead of time, using a calibration process on the validation set to compute a scale and bias for the activations. In post-training dynamic quantization, much like post-training static quantization, the weights are quantized ahead of time, but the activations are dynamically quantized at inference. Dynamic quantization is useful for models where model execution time is dominated by the time it takes to load weights for the model e.g., LSTM [30]. Quantization can also be learned by the network. In quantization-aware training [6], the model will learn to represent parameters with a pre-defined precision. Compared to post-training quantization, quantization-aware training yields better accuracy. On the other hand, it takes another training session to learn the newly quantized parameters. Neutrino provides state-of-the-art quantization aware techniques that can quantize models down to 3 bits, 2 bits or 1 bit as well as mixed precision. At training, we quantize input tensor t as below [4]: t t̅ = |clip ( , −Q N , Q P )| s Where, t̅ ∈ ℕ is the quantized tensor, t ∈ ℝ is the input tensor and s∈ ℝ is the scaling factor. To quantize the input tensor t with b bits, Q P = 2b−1 − 1 and Q N = 2b−1 represent the clipping limits. Given the equation above, the quantization error can be computed as below: t̂ = t̅ × s , errorq = t − t̂ At training, our quantization algorithm learns the scaling factor so that the quantization error, errorq , will be minimum. Our quantizer is coupled with DeepliteRT to efficiently run the quantized model on the target platform. V. DEEPLITERT To enable extremely low precision computations on commodity hardware such as Arm CPUs, we require a proprietary inference engine. Public domain AI runtimes, such as TFLite and ONNX runtime [11] currently only support quantization precision down to 8-bit. Furthermore, running operations in less than 8-bit either requires special hardware support or innovative ways to perform low bit computations. Running models with sub-8 bit precision on commodity hardware has been explored before. TernGEMM [32] proposes a GEMM based method that uses logical bitwise operators to perform matrix multiplication for ternary weights, {-1, 0, 1}, and 3 to 6-bit activations. Han et al. [13], developed low precision convolution kernels and optimization passes in the TVM machine learning compiler to compute sub- 8-bit operations on commodity hardware. ULLPACK [33] presents two packing schemes for low precision operations that allow for a trade-off between accuracy and speed. DeepliteRT leverages specific low-level hardware intrinsic to run our ultra-low precision quantized models with high performance on Arm CPUs. DeepliteRT accelerates model inference through ultra-low bit convolution kernels. The convolution is performed using bitserial computation where popcount and bitwise operations are utilized to calculate the dot products of the low bit weight and activation values. Using 1bit weights and activations with unipolar encoding where each bit can take on the values {0, 1}, the bitserial dot product can be computed with the following equation: 𝑊 ⋅ 𝐴 = 𝑃𝑂𝑃𝐶𝑂𝑈𝑁𝑇 (𝑊 & 𝐴) This can be further extended to the multi-bit case (w bits for weights and a bits for activations) by splitting the weights and activations into separate bitplanes and then summing the dot products across all the bitplane combinations: 𝑤−1𝑎−1 𝑊 ⋅ 𝐴 = ∑ ∑ (𝑃𝑂𝑃𝐶𝑂𝑈𝑁𝑇 (𝑊[𝑖] & 𝐴[𝑗]) ≪ (𝑖 + 𝑗)) 𝑖=0 𝑗=0 Our implementation uses intrinsic from the Neon vectorized instruction set for both Armv7 and Armv8 architectures to target 32-bit and 64-bit Arm CPU devices. Efficient tiling and parallelization schemes are also used to improve upon the performance of the vectorized kernels. On the ResNet18 model running on the low-power Arm Cortex-A53 CPU in the Raspberry Pi 3B+, our overall implementation realizes speedups of up to 2.9x on 2-bit and 4.4x on 1-bit over an optimized floating-point baseline improving upon the results published in prior works [7]. Additionally, we also extend our work to object detection models and achieve speedups of up to 2.2x and 3.2x over TFLite with XNNPACK [10][12] and ONNX Runtime [11], respectively, for both YOLOv5s and YOLOv5m [1] on the Arm Cortex-A72 CPU in the Raspberry Pi 4B [19]. VI. END-TO-END PIPELINE FOR COMPUTER VISION TASKS Our end-to-end pipeline combines all the steps required to quantize and run a CNN based model on DeepliteRT. As Figure 3 shows, Deeplite Neutrino™ automatically quantizes a trained full-precision CNN down to 2 bits and passes it to Deeplite Compiler that then compiles the quantized model and generates a dlrt file ready to be deployed and executed with DeepliteRT. VII. EVALUATION We evaluated the accuracy and inference time of ultra-low bit models for classification and object detection tasks. For classification tasks, we used ResNet18 [3] and ResNet50 [3] on ImageNet dataset and ResNet18 on VWW [8] dataset. We also successfully benchmarked our pipeline on VGG16-SSD [17], YOLOv5s [1] and YOLOv5m [1] on Pascal VOC [9] and subset of MS-COCO [20] datasets for object detection. To run these models, we have used TensorFlow [15], PyTorch [16], ONNX Runtime [11], TVM [28], TFLite [10] and DeepliteRT. The target platforms used for these experiments include the www.embedded-world.eu Raspberry Pi 3B+ with 4x Arm Cortex-A53, the Raspberry Pi 4B with 4x Arm Cortex-A72 and the NVIDIA Jetson Nano with 4x Arm Cortex-A57. achieved 3.19x and 2.95x speedup on Raspberry Pi 3B+ and Raspberry Pi 4B, respectively. The most impressive aspect of the quantized model is that the significant speedup and compression come at the cost of less than 0.02 drop in mAP. Fig. 3. End-to-end quantization and runtime pipeline A. ResNet18 on VWW Figure 5 illustrates the accuracy-performance tradeoff for DeepliteRT and ONNX Runtime on the ResNet18 model trained on VWW dataset for 2A/2W (2 bits for activations and weights) and 1A/2W (1 bit for activation and 2 bits for weights) precisions. The quantized ResNet18 on VWW dataset achieved 15.58x reduction in model size with 3.75x and 2.90x speedups on Raspberry Pi 3B+ and Raspberry Pi 4B devices, respectively. Moreover, this was achieved with less than a 2% drop in accuracy for the 1A/2W configuration and less than a 1% drop in accuracy for the 2A/2W configuration. Fig. 5. DeepliteRT performance benchmark for ResNet18 with VWW [8] dataset 224px image Fig. 6. Accuracy/performance benchmark of DeepliteRT on VGG16-SSD300 [17] model on Pascal VOC [9] object detection dataset (2A/2W- weights and activations quantized to 2 bits) C. YOLOv5s and YOLOv5m on VOC Fig. 4. Accuracy/performance benchmark of DeepliteRT on ResNet18 model on VWW dataset (2A/2W- weights and activations quantized to 2 bits, 1A/2W- activations quantized to 1 bit and weights quantized to 2 bits) We also benchmarked DeepliteRT against TFLite [10] with XNNpack [12] for 2A/2W and 1A/2W quantized models. As Figure 6 shows, our 2-bit model has less than 1% accuracy drop while being significantly faster compared to the TFLite INT8 model running on the XNNpack backend. B. VGG16-SSD on VOC We used the Single Shot Detection (SSD) [17] model with VGG16 [31] backbone as a part of our object detection performance analysis. As evident from the graph in figure 7, we Our classification results are state-of-the-art for both accuracy and latency on Arm hardware with limited computational power such as the Cortex A-53 in Raspberry Pi 3B+. VGG16SSD [17] detection results were also promising but even the best performing model on the Raspberry Pi 4B takes more than a second for a single inference, so it might not be practical to utilize it for commercial applications. To address this challenge, we extended our quantization support to YOLOv5 [1], a stateof-the-art object detection architecture. To understand how our ultra-low precision models compare to TFLite with XNNPack [10] [12 (FP16 model from Ultralytics repository [1]) and fullprecision ONNX Runtime results on Arm CPUs, we used YOLOv5s and YOLOv5m from the Ultralytics repository and trained them on the person class in VOC [9] dataset. Fig. 7. Accuracy/performance benchmark of DeepliteRT on ResNet18 and ResNet50 models on ImageNet dataset, Lower bars mean faster acceleration or lower inference time, DLRT is only 50% slower compared to Embedded GPU performance quantized down to 2 bits. Based on the results, mixed precision ultra-low bit quantization is an effective method to increase speedup of compact models with a minimal drop in accuracy. TABLE I. DEEPLITERT BENCHMARK ON YOLOV5N /COCO-8 CLASSES 352px, ARM Cortex A-53, COCO 8 classes Model Fig. 8. DeepliteRT performance benchmark for YoloV5m and Yolo5Vs on 320px image (TFLite performs poorly without the XNNPack delegate and can be even slower compared to the FP32 models) Our performance benchmarks show speedups of up to 2.2x over TFLite with XNNPACK [10][12] and 3.2x over ONNX Runtime [11] for both YOLOv5s and YOLOv5m [1] on Raspberry Pi 4B. As Figure 8 shows, we can even achieve 9 FPS for YOLOv5s and 3 FPS for YOLOv5m on this target device. By combining accurate, SOTA models like YOLOv5 with accessible hardware like the Arm Cortex-A CPU inside the Raspberry Pi [19], we are creating new possibilities for vision applications at the edge for the AI community. D. YOLOv5n on COCO 8 Classes We have performed an extensive evaluation of our quantization framework and runtime on a subset of MS-COCO dataset. COCO dataset consists of 80 classes, but we have evaluated it on a subset including the classes that are relevant for real life use cases. This subset includes the person, dog, cat, car, bus, truck, bicycle, and motorcycle classes. Table 1 shows the latency improvement of our 2-bit model running on Arm Cortex A-53 processor. Inference time for our quantized model is 2.54x faster compared to the baseline with approximately 1% accuracy drop. Since it is extremely challenging to quantize already compact models, we use a mixed precision approach, keeping a few quantization-sensitive layers in FP32 and the rest Quantization approach mAP Latency (ms) YOLOv5n -FP32 No quantization 0.424 250 YOLOv5n Mixed-precision (FP32 and 2-bit) Conservative 0.414 98.371 VIII. CONCLUSION This paper presents a novel ultra-low bit quantization method and corresponding runtime that enables executing ultralow bit quantized models on Arm CPUs. The clear memory benefits (up to 16x compression with 2-bit quantization) are complemented by faster arithmetic enabled on low-cost CPUs. The result is 2-5x speedup over existing FP32 and INT8 runtime frameworks, approaching GPU-level latency on a commodity Arm processor as demonstrated by the image classification benchmarks on the Arm Cortex-A57 CPU in the NVIDIA Jetson Nano. We successfully applied quantization on complex SOTA models including VGG16-SSD, YOLO family and ResNet18/50. DeepliteRT yielded significant inference acceleration over next-best-alternatives, reinforcing the promise of ultra-low bit CNN models on low-cost CPUs. REFERENCES [1] [2] [3] [4] Glenn Jocher, Ultralytics, (2019), GitHub repository, https://github.com/ultralytics/yolo5 A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” CoRR, vol. abs/1704.04861, 2017, [Online]. Available: http://arxiv.org/abs/1704.04861 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385,2015. A. Sankaran et al., “Deeplite Neutrino: An End-to-End Framework for Constrained Deep Learning Model Optimization,” CoRR, vol. www.embedded-world.eu [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] abs/2101.04073, 2021, [Online]. Available: https://arxiv.org/abs/2101.04073 Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, Jeffrey S. Vetter , NVIDIA Tensor Core Programmability, Performance & Precision (https://arxiv.org/abs/1803.04014) B. Jacob et al., “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” CoRR, vol. abs/1712.05877, 2017, [Online]. Available: http://arxiv.org/abs/1712.05877 M. Cowan, T. Moreau, T. Chen, J. Bornholt, L. Ceze, , Automatic generation of high-performance quantized machine learning kernels, CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and OptimizationFebruary 2020 Pages 305–316 https://doi.org/10.1145/3368826.3377912 A. Chowdhery, P. Warden, J. Shlens, A. Howard, and R. Rhodes, “Visual Wake Words Dataset,” CoRR, vol. abs/1906.05721, 2019, [Online]. Available: http://arxiv.org/abs/1906.05721 M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, Jan. 2015. Li Shuangfeng. TensorFlow Lite: On-Device Machine Learning Framework[J]. Journal of Computer Research and Development, 2020, 57(9): 1839-1853. developers, O. R. (2021). ONNX Runtime. https://onnxruntime.ai/. https://onnxruntime.ai/ Google, XNNPACK, (2019), GitHub repository, https://github.com/google/XNNPACK Q. Han et al., “Extremely Low-Bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures,” 2020. doi: 10.1145/3404397.3404407. J. M. Alarcón, A. N. H. Blin, M. J. V. Vacas, and C. Weiss, Exploring hyperon structure with electromagnetic transverse densities. arXiv, 2018. doi: 10.48550/ARXIV.1802.00479. M. Abadi et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. [Online]. Available: https://www.tensorflow.org/ A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015pytorch-an-imperative-style-high-performance-deep-learninglibrary.pdf W. Liu et al., “SSD: Single Shot MultiBox Detector,” CoRR, vol. abs/1512.02325, 2015, [Online]. Available: http://arxiv.org/abs/1512.02325 S. Yun and A. Wong, “Do All MobileNets Quantize Poorly? Gaining Insights into the Effect of Quantization on Depthwise Separable Convolutional Networks Through the Eyes of Multi-scale Distributional Dynamics,” CoRR, vol. abs/2104.11849, 2021, [Online]. Available: https://arxiv.org/abs/2104.11849 W. Gay, Raspberry Pi Hardware Reference, 1st ed. USA: Apress, 2014. T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” CoRR, vol. abs/1405.0312, 2014, [Online]. Available: http://arxiv.org/abs/1405.0312 I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations,” CoRR, vol. abs/1609.07061, 2016, [Online]. Available: http://arxiv.org/abs/1609.07061 Uday Kulkarni, Meena S.M., Sunil V. Gurlahosur, Gopal Bhogar, Quantization Friendly MobileNet (QF-MobileNet) Architecture for Vision Based Applications on Embedded Platforms,Neural Networks, Volume 136, 2021,Pages 28-39, ISSN 0893-6080, https://doi.org/10.1016/j.neunet.2020.12.022. R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “ACIQ: Analytical Clipping for Integer Quantization of neural networks,” CoRR, vol. [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] abs/1810.05723, 2018, [Online]. Available: http://arxiv.org/abs/1810.05723 M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, and T. Blankevoort, “A White Paper on Neural Network Quantization,” CoRR, vol. abs/2106.08295, 2021, [Online]. Available: https://arxiv.org/abs/2106.08295 Dave Salvator, Hao Wu, Milind Kulkarni and Niall Emmart, Int4 Precision for AI Inference, Nvidia Blog, Nov , 2019, https://developer.nvidia.com/blog/int4-for-ai-inference/ M. Courbariaux and Y. Bengio, “BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016, [Online]. Available: http://arxiv.org/abs/1602.02830 M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” CoRR, vol. abs/1603.05279, 2016, [Online]. Available: http://arxiv.org/abs/1603.05279 Chen et al., “TVM: End-to-End Optimization Stack for Deep Learning,” CoRR, vol. abs/1802.04799, 2018, [Online]. Available: http://arxiv.org/abs/1802.04799 Nvidia TensorRT Introduction. Available online: https://developer.nvidia.com/tensorrt (accessed on 21 May 2022). S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. S. Choi, K. Shim, J. Choi, W. Sung, and B. Shim, “TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference,” in SiPS, 2021, pp. 111–116. [Online]. Available: https://doi.org/10.1109/SiPS52927.2021.00028 J. Won, J. Si, S. Son, T. J. Ham, και J. W. Lee, ‘ULPPACK: Fast Sub8-bit Matrix Multiply on Commodity SIMD Hardware’, στο Proceedings of Machine Learning and Systems, 2022, τ. 4, σσ. 52– 63.