Real-time Face Mask Detection System on Edge using Deep Learning and Hardware Accelerators 2021 2nd International Conference on Communication, Computing and Industry 4.0 (C2I4) | 978-1-6654-2013-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/C2I454156.2021.9689421 Stavan Ruparelia Electronics and Communication Engineering Department, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India. stavanrupareliya7878@gmail.com, Monil Jethva Electronics and Communication Engineering Department, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India. jethvamonil99@gmail.com Ruchi Gajjar Electronics and Communication Engineering Department, Institute of Technology, Nirma University, Ahmedabad, Gujarat, India. ruchi.gajjar@nirmauni.ac.in Abstract—Real-time face mask detection with the use of Artificial Intelligence is one of the most advanced ways of detecting face masks and their wearing condition in public or private areas. In this work, a system based on Object Detection models is proposed which can detect and classify the type of mask wearing conditions in real-time. The system is implemented with two latest deep convolutional neural networks; YOLOv5s and YOLOv5l. The proposed system can efficiently detect and classify face masks based on their wearing condition as well as count them and store the count into a CSV file format with a timestamp. To perform real-time inference, the deep learning models were deployed on Nvidia Jetson Nano and Jetson Xavier NX which are embedded solutions inspired by Edge AI. The detection algorithms achieved mAP of 86.43 and 92.49 for YOLOv5s and YOLOv5l respectively. Comparing the mAP of both detection models, YOLOv5l achieved higher mAP than YOLOv5s while comparing fps on both hardware, Nvidia Jetson Xavier NX provides more fps than Nvidia Jestion Nano for realtime inference. Index Terms—COVID-19, Edge AI, Hardware Accelerator, Real-time Detection, YOLOv5 I. INTRODUCTION The world is facing the threat of the global crisis due to the outbreak of the Covid-19 virus. This has caused a global crisis, affecting not only human health but also the economy as a whole. More than 244 million Covid-19 cases, including 4 million deaths because of this infectious virus, have been confirmed by World Health Organization (WHO) as of 27th Oct. 2021 [1]. Even with a high rate of vaccination being administered by different countries, the threat of the virus spreading is still a concern. However, vaccination is not the only solution, but to limit the spread of this virus, it is recommended to take Public Health and Social Measures (PHSM) like wearing a face mask, avoiding or limiting public gatherings, restrictions on domestic and international movements [2]. Covering the face or wearing a face mask is one of the essential steps to restrict the transmission of Coronavirus. Even though it is a mandate to wear a face mask in many countries, many people are showing negligence in following this norm. While some do not wear a mask at all in public places, others wear it incorrectly, where the nose is not covered. This is a threat, not only to the person himself but to all people in the vicinity. One to one surveillance is also not a viable option, given the population and task force in many countries. Hence, if a system can be devised that automatically detects if the person is wearing a mask or not and also, can differentiate the wearing conditions of the mask, it would be effective in raising the alert and alarming the authorities in case the Covid appropriate mask protocol is not followed. In this work, we proposed a neural network based face mask detection system that efficiently detects the face mask, distinguishes between mask-wearing conditions as a proper mask, improper and no mask and also counts the people wearing masks in order to limit the social gathering, as shown in Fig. 1. The trained deep learning is deployed on edge AI devices for real-time inference from the actual video feed. The paper is continued as follows: The literature survey of Face mask detection using AI and Object Detection using AI and Embedded systems is discussed in section 2. The proposed face mask system implementation flow, Dataset details, and Hardware Accelerators are narrated in detail in section 3. Section 4 gives details about the detection model’s training setup as well as real-time inference results details. In section 5, our work is concluded. II. LITERATURE REVIEW A. Face Mask Detection Using AI With the emerging applications of machine learning, object detection has gained a lot of attention. Deep learning architectures have proved efficient in discriminating multiple classes of objects within a frame. YOLO is one such convolutional neural Fig. 1. Basic flow of the face mask detection proposed system architecture that is designed to detect objects. Researchers are using various versions of YOLO for the task of face mask detection. A similar mask detector based on YOLOv2 with ResNet-50 was proposed by [3]. YOLOv3 and faster R-CNN were used by [4] to detect people wearing masks from people without masks. Images with people wearing or not wearing a mask were classified using YOLOv3 by [5] and YOLOv4 by [6]. [7] proposed a dataset for four different mask types with and without masks, incorrect masks and the mask area. The classification was done using YOLO v3 and v4, giving an mAP of 71.69%. YOLOv4 was used to classify people wearing correct masks, improper masks or no masks in [8] with an accuracy of 98.90%. Classification of unmasked faces, masked faces and people using YOLOv4 along with detection of social distancing and notifying its violation was done by [9]. Similar to YOLO, [10] have used SSD and MobileNetV2 to perform real-time face mask detection. It is evident from this survey that the works published have used versions of YOLO for mask detection. The majority of the works aim to classify either the person is wearing a mask or not. Very few works have reported discrimination of improper mask-wearing conditions. Also, there is limited or no work done on another aspect of limiting the spread of Covid-19, i.e. restriction on the number of people in a gathering. In this paper, we have not only classified the above mentioned three conditions of masks but have ensured that they are effectively detected given varying size, illumination, angle or colour along with the count of people for social gathering restrictions. The complete system is implemented on hardware accelerated edge AI devices for real-time inference. B. Object Detection using AI and Embedded System An important aspect of implementing a machine learning model for real-time applications is the proper inference and choice of embedded platform. There are works reported in literature where machine learning models, specifically object detection models are implemented on hardware accelerated edge AI devices like Nvidia’s Jetson modules like Nano, TX1, Jeston Xavier NX and Intel’s NCS and NCS2. Table I represents the analysis of various works done in the field of the deployment of detection models along with achieved fps results. TABLE I S TUDY OF VARIOUS ALGORITHMS & ARCHITECTURES DEPLOYED E DGE Author [11] Detection Model Proposed DCNN [12] Customized YOLOv3-tiny [13] [14] [15] [16] [17] SSD YOLOv3 YOLOv4 Proposed YOLO Nano YOLOv3 Tiny, YOLOv4 Tiny Hardware Jetson Tx1 Jetson Nano Intel NCS 2 Intel NCS Jetson Tx1 Jetson Xavier NX Jetson Tx1 Jetson Xavier NX Jetson Nano ON fps 1.6 8 5 4 8.5 10 14.68 8 24 However, not much work has been reported for deploying a model for face mask detection on embedded platforms for real-time inference. In this work, we have ported our model on Nvidia Jetson Nano and Jetson Xavier NX and presented the comparison of real-time inference for YOLOv5 and YOLOv5s. III. PROPOSED SYSTEM METHODOLOGY A. Proposed System Block Diagram For surveillance purposes of individuals wearing a mask correctly in public or corporate places, the first step is to train the detection models with the required dataset. To train the detection models, the required dataset was collected from different sources. Images collected for the dataset were processed and annotated to create a genuine dataset. The steps of creating a genuine dataset are discussed in below Section. Transfer learning has been applied to pre-trained models using the collected dataset and new weights have been generated. The trained detection algorithm can detect and classify in three categories such as Proper Mask, Improper Mask, and No Mask. The model architectures and new weights were deployed on hardware accelerated devices. The camera has been attached to the devices to capture the real-time video stream. The device will generate the inference results using the trained deep learning models along with the CSV file which stores the timestamp and class-wise count of each class detected in each frame. The diagram in Fig. 2 represents the proposed approach used in this work. B. Dataset Preparation 1) Data Collection: To create a genuine dataset for this work, we needed images that cover all possible classes and different situations such as from individual pictures to crowded pictures. For this purpose, images were collected from different sources such as the internet, news, social media, etc. A huge amount of images were collected out of which 3000 images were selected to create a dataset. In Fig. 3, some of the selected images are shown. 2) Data Pre-processing: After collecting the data, images were separated according to class-wise category. This separation resulted in 1300 images of Proper Mask class, 1100 images of No Mask class, and 600 images of Improper Mask class. As detection models require a specific shape of input images, all images were reshaped to 512 x 512 x 3 using a python script to maintain the uniformity of the dataset. 3) Data Augmentation: As it is necessary to have a good and enough dataset for the training of the detection model, we used the image augmentation technique for generating more images. For image augmentation, different image processing techniques such as flipping, shifting, zooming, and brightness manipulation. The data augmentation resulted in the final dataset having 1350 Proper Mask images, 1150 No Mask images, and 850 Improper Mask images. 4) Data Annotation: To train the detection models, each image should be fed to a neural network with annotations of objects present in the image. Hence each image of the dataset was annotated manually. LabelImg [18] annotation tool was used to annotate the images. The annotations were stored in txt file format. C. Detection Model Architecture Face Mask detection and classification for surveillance purposes is a very challenging problem as it covers different crowd conditions, various angle and illumination conditions, dissimilar mask patterns, and much more. Another challenge is to classify mask wearing conditions in three categories i.e. Proper Mask, Improper Mask, and No Mask as Improper Mask can be easily jumbled up with either Proper Mask or No Mask. Hence to address this challenge, we choose one of the latest and best object detection algorithms - YOLOv5 [19]. YOLOv5 can be partitioned into three parts which are Feature Extraction, Feature Fusion, and Detection layers. Feature extraction is being done with the use of CSPDarknet [20] as a backbone of the CNN. PANet [21] architecture is being used to perform the fusion of the features extracted from the backbone of the CNN. For the final detection and the classification of the existing objects in the input image, Detection Layers are used which will generate bounding boxes, class of object, and score. Two different versions of YOLOv5 have been used in this work which is YOLOv5s and YOLOv5l. The concept of the architecture is almost similar for both algorithms, the difference is in terms of the number of convolution layers and activation functions. D. Description of Hardware Accelerators Edge AI is a concept that involves generating real-time inferences on hardware accelerated embedded devices. As training is very much computational intensive, it was performed on Google Colab’s Tesla K80 GPU. To perform the real-time inference, the architecture with the generated weights is deployed on hardware accelerated devices. For deployment purposes, we choose two powerful hardware from Nvidia which are the Nvidia Jeston Nano [22] and Jetson Xavier NX [23]. The specifications of selected hardware are depicted in Table II. TABLE II C OMPARISON OF H ARDWARE T ECHNICAL S PECIFICATIONS HW Accelerator CPU DL Accelerator Memory Storage Power Price Jetson Nano 128-core NVIDIA Maxwell GPU Quad-core Arm A57 @ 1.43 GHz NA 4GB LPDDR4 microSD Card 5V | 9,299 Jetson Xavier NX 384 NVIDIA CUDA cores and 48 Tensors cores 6-core NVIDIA Carmel ARM 64-bit 2x NVDLA Engines 8GB 128-bit LPDDR4x microSD Card 19V | 36,250 IV. E XPERIMENTAL S ETUP AND R ESULTS A. Training Setup Training the detection models is always one of the most crucial and challenging parts. First, our collected dataset was split into three categories i.e. Training Dataset, Validation Dataset, and Testing Dataset. The proportion of the split was 14:3:3 respectively for Training Dataset, Validation Dataset, and Testing Dataset. As the training of the detection algorithms requires high processing power, we trained our detection models on Google Colab. Google Colab had allocated Nvidia’s Tesla K80 GPU environment for the training session. The deep learning models were trained for 25 epochs on 2345 training images and at the same time, the detection models were validated on 502 validation images. The graphs of different types of loss per epoch are as shown in Fig. 4. As it can be observed, the loss was high in starting phase of training but over multiple epochs, model parameters were updated and provided optimal accuracy performance. TABLE III P ERFORMANCE COMPARISON AMONG VARIOUS D ETECTION Models YOLOv3 YOLOv4 YOLOv5s YOLOv5l mAP 89.89 89.58 86.43 92.49 Precision 0.85 0.86 0.84 0.89 Recall 0.92 0.92 0.79 0.93 MODELS f1 Score 0.89 0.89 0.82 0.91 We also compared the performance of our trained detection models with YOLOv3 and YOLOv4. Performance comparison analysis is as presented in Table III. As it can be seen that YOLOv5s performed a bit less accurately as compared to YOLOv3 and YOLOv4. Whereas YOLOv5l outperformed all other detection models with the highest accuracy of 92.49. Fig. 2. Proposed real-time Face mask detection system implementation flow Fig. 3. Images from the Dataset B. Real-time Inference Results In order to inference detection models in real-time, trained deep learning models were implemented on Nvidia Jetson Xavier NX and Jeston Nano. Real-time inference results are displayed in Fig. 5. As it can be seen that the proposed detection models can accurately detect face masks and classify them in a crowded area. In Fig. 5, result(a) and result(b) are the inference results generated using YOLOv5s whereas result(c) and result(d) are the inference results generated using YOLOv5l. Comparing the inference results of the YOLOv5s and YOLOv5l, both detection models performed very well but at certain points, YOLOv5l generated better results. Looking at the result(c) and result(a), YOLOv5s missed detection of one face mask in the image whereas YOLOv5l was capable of detecting all the face masks present in the image and classifying them in a real-time stream under various conditions. As well as comparing the results in terms of confidence of the class detected YOLOv5l was quite confident about the face mask detected as compared to YOLOv5s. Deployment on the Nvidia Jeston Xavier NX provided fps of 30 and 24 respectively for YOLOv5s and YOLOv5l detection models Fig. 4. Loss vs Epoch curve of the Detection models whereas deployment on Nvidia Jeston Nano provided fps of 12 and 8 respectively for YOLOv5s and YOLOv5l detection models. Hence we can conclude that Nvidia Jeston Xavier NX provides high fps compare to Nvidia Jeston Nano. V. CONCLUSION A proposed system has been deployed on two edge AI devices, Nvidia Jetson Xavier NX, and Nvidia Jetson Nano. In this paper, a system is proposed which can detect and classify human face masks and store the count into a CSV file in realtime. Two modern deep learning algorithms; YOLOv5l and YOLOv5s were trained on the face mask dataset assembled by us. After testing trained detection models on various images, YOLOv5l achieved mAP of 92.49 whereas YOLOv5s Fig. 5. YOLOv5s & YOLOv5l Real-time Inference Results achieved mAP of 86.43. Moreover, experimental evaluations revealed that YOLOv5l is more accurate than YOLOv5s. The proposed system was deployed on edge AI devices to perform real-time inference of face mask detection and classification. Furthermore, the class-wise counting of face masks detected were stored into a CSV file format along with the timestamp in real-time. The system presented in this work can be adopted with real-time camera surveillance systems in crowded areas to detect face masks and perform further analysis based on the counts stored in CSV files. [6] [7] [8] [9] R EFERENCES [1] “Who coronavirus (covid-19) dashboard.” [Online]. Available: https://covid19.who.int/ [2] [Online]. Available: https://covid19.who.int/measures [3] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa, “Fighting against covid-19: A novel deep learning model based on yolov2 with resnet-50 for medical face mask detection,” Sustainable cities and society, vol. 65, p. 102600, 2021. [4] S. Singh, U. Ahuja, M. Kumar, K. Kumar, and M. Sachdeva, “Face mask detection using yolov3 and faster r-cnn models: Covid-19 environment,” Multimedia Tools and Applications, vol. 80, no. 13, pp. 19 753–19 768, 2021. [5] M. R. Bhuiyan, S. A. Khushbu, and M. S. Islam, “A deep learning based assistive system to classify covid-19 face mask for human safety with yolov3,” in 2020 11th International Conference on Computing, [10] [11] [12] [13] Communication and Networking Technologies (ICCCNT). IEEE, 2020, pp. 1–5. S. Abbasi, H. Abdi, and A. Ahmadi, “A face-mask detection approach based on yolo applied for a new collected dataset,” in 2021 26th International Computer Conference, Computer Society of Iran (CSICC). IEEE, 2021, pp. 1–6. A. Kumar, A. Kalia, K. Verma, A. Sharma, and M. Kaushal, “Scaling up face masks detection with yolo on a novel dataset,” Optik, vol. 239, p. 166744, 2021. S. Degadwala, D. Vyas, U. Chakraborty, A. R. Dider, and H. Biswas, “Yolo-v4 deep learning model for medical face mask detection,” in 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS). IEEE, 2021, pp. 209–213. K. Bhambani, T. Jain, and K. A. Sultanpure, “Real-time face mask and social distancing violation detection system using yolo,” in 2020 IEEE Bangalore Humanitarian Technology Conference (B-HTC). IEEE, 2020, pp. 1–6. P. Nagrath, R. Jain, A. Madan, R. Arora, P. Kataria, and J. Hemanth, “Ssdmnv2: A real time dnn-based face mask detection system using single shot multibox detector and mobilenetv2,” Sustainable cities and society, vol. 66, p. 102692, 2021. Y. Han and E. Oruklu, “Traffic sign recognition based on the nvidia jetson tx1 embedded system using convolutional neural networks,” in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 2017, pp. 184–187. V. Mazzia, A. Khaliq, F. Salvetti, and M. Chiaberge, “Real-time apple detection system using embedded systems with hardware accelerators: An edge ai application,” IEEE Access, vol. 8, pp. 9102–9114, 2020. R. Gajjar, N. Gajjar, V. J. Thakor, N. P. Patel, and S. Ruparelia, “Real-time detection and identification of plant leaf diseases using [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] convolutional neural networks on an embedded platform,” The Visual Computer, pp. 1–16, 2021. Y.-C. Chen, H. Fathoni, and C.-T. Yang, “Implementation of fire and smoke detection using deepstream and edge computing approachs,” in 2020 International Conference on Pervasive Artificial Intelligence (ICPAI). IEEE, 2020, pp. 272–275. S. Ruparelia, M. Jethva, and R. Gajjar, “Real-time tomato detection, classification, and counting system using deep learning and embedded systems,” in Proceedings of the International e-Conference on Intelligent Systems and Signal Processing. Springer, 2022, pp. 511–522. L. Wang, X. Ye, H. Xing, Z. Wang, and P. Li, “Yolo nano underwater: A fast and compact object detector for embedded device,” in Global Oceans 2020: Singapore–US Gulf Coast. IEEE, 2020, pp. 1–4. S. R. Monil Jethva and R. Gajjar, “Face mask detection and counting using deep learning and embedded systems,” in forthcoming conference. Tzutalin, “Labelimg,” Free Software: MIT License, 2015. [Online]. Available: https://github.com/tzutalin/labelImg Ultralytics, “ultralytics/yolov5: Yolov5 in pytorch - onnx - coreml tflite.” [Online]. Available: https://github.com/ultralytics/yolov5.git C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “Cspnet: A new backbone that can enhance learning capability of cnn,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 390–391. S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768. N. Developer, “Nvidia jetson nano developer kit,” 2019. [Online]. Available: https://developer.nvidia.com/embedded/jetson-nanodeveloper-kit “Jetson xavier nx developer kit,” May 2020. [Online]. Available: https://developer.nvidia.com/embedded/jetson-xavier-nx-devkit