Uploaded by amargo.queenie

Real-time Face Mask Detection on Edge with Deep Learning

Real-time Face Mask Detection System on Edge
using Deep Learning and Hardware Accelerators
2021 2nd International Conference on Communication, Computing and Industry 4.0 (C2I4) | 978-1-6654-2013-6/21/$31.00 ©2021 IEEE | DOI: 10.1109/C2I454156.2021.9689421
Stavan Ruparelia
Electronics and Communication Engineering Department,
Institute of Technology, Nirma University,
Ahmedabad, Gujarat, India.
Monil Jethva
Electronics and Communication Engineering Department,
Institute of Technology, Nirma University,
Ahmedabad, Gujarat, India.
Ruchi Gajjar
Electronics and Communication Engineering Department,
Institute of Technology, Nirma University,
Ahmedabad, Gujarat, India.
Abstract—Real-time face mask detection with the use of
Artificial Intelligence is one of the most advanced ways of
detecting face masks and their wearing condition in public or
private areas. In this work, a system based on Object Detection
models is proposed which can detect and classify the type of
mask wearing conditions in real-time. The system is implemented
with two latest deep convolutional neural networks; YOLOv5s
and YOLOv5l. The proposed system can efficiently detect and
classify face masks based on their wearing condition as well
as count them and store the count into a CSV file format
with a timestamp. To perform real-time inference, the deep
learning models were deployed on Nvidia Jetson Nano and Jetson
Xavier NX which are embedded solutions inspired by Edge AI.
The detection algorithms achieved mAP of 86.43 and 92.49 for
YOLOv5s and YOLOv5l respectively. Comparing the mAP of
both detection models, YOLOv5l achieved higher mAP than
YOLOv5s while comparing fps on both hardware, Nvidia Jetson
Xavier NX provides more fps than Nvidia Jestion Nano for realtime inference.
Index Terms—COVID-19, Edge AI, Hardware Accelerator,
Real-time Detection, YOLOv5
The world is facing the threat of the global crisis due
to the outbreak of the Covid-19 virus. This has caused a
global crisis, affecting not only human health but also the
economy as a whole. More than 244 million Covid-19 cases,
including 4 million deaths because of this infectious virus,
have been confirmed by World Health Organization (WHO)
as of 27th Oct. 2021 [1]. Even with a high rate of vaccination
being administered by different countries, the threat of the
virus spreading is still a concern. However, vaccination is
not the only solution, but to limit the spread of this virus,
it is recommended to take Public Health and Social Measures
(PHSM) like wearing a face mask, avoiding or limiting public
gatherings, restrictions on domestic and international movements [2].
Covering the face or wearing a face mask is one of the
essential steps to restrict the transmission of Coronavirus. Even
though it is a mandate to wear a face mask in many countries,
many people are showing negligence in following this norm.
While some do not wear a mask at all in public places, others
wear it incorrectly, where the nose is not covered. This is a
threat, not only to the person himself but to all people in the
vicinity. One to one surveillance is also not a viable option,
given the population and task force in many countries.
Hence, if a system can be devised that automatically detects
if the person is wearing a mask or not and also, can differentiate the wearing conditions of the mask, it would be effective in
raising the alert and alarming the authorities in case the Covid
appropriate mask protocol is not followed. In this work, we
proposed a neural network based face mask detection system
that efficiently detects the face mask, distinguishes between
mask-wearing conditions as a proper mask, improper and no
mask and also counts the people wearing masks in order to
limit the social gathering, as shown in Fig. 1. The trained deep
learning is deployed on edge AI devices for real-time inference
from the actual video feed.
The paper is continued as follows: The literature survey of
Face mask detection using AI and Object Detection using AI
and Embedded systems is discussed in section 2. The proposed
face mask system implementation flow, Dataset details, and
Hardware Accelerators are narrated in detail in section 3.
Section 4 gives details about the detection model’s training
setup as well as real-time inference results details. In section
5, our work is concluded.
A. Face Mask Detection Using AI
With the emerging applications of machine learning, object
detection has gained a lot of attention. Deep learning architectures have proved efficient in discriminating multiple classes of
objects within a frame. YOLO is one such convolutional neural
Fig. 1. Basic flow of the face mask detection proposed system
architecture that is designed to detect objects. Researchers are
using various versions of YOLO for the task of face mask
detection. A similar mask detector based on YOLOv2 with
ResNet-50 was proposed by [3]. YOLOv3 and faster R-CNN
were used by [4] to detect people wearing masks from people
without masks. Images with people wearing or not wearing
a mask were classified using YOLOv3 by [5] and YOLOv4
by [6]. [7] proposed a dataset for four different mask types with and without masks, incorrect masks and the mask area.
The classification was done using YOLO v3 and v4, giving
an mAP of 71.69%. YOLOv4 was used to classify people
wearing correct masks, improper masks or no masks in [8]
with an accuracy of 98.90%. Classification of unmasked faces,
masked faces and people using YOLOv4 along with detection
of social distancing and notifying its violation was done by
[9]. Similar to YOLO, [10] have used SSD and MobileNetV2
to perform real-time face mask detection.
It is evident from this survey that the works published have
used versions of YOLO for mask detection. The majority of
the works aim to classify either the person is wearing a mask or
not. Very few works have reported discrimination of improper
mask-wearing conditions. Also, there is limited or no work
done on another aspect of limiting the spread of Covid-19,
i.e. restriction on the number of people in a gathering. In this
paper, we have not only classified the above mentioned three
conditions of masks but have ensured that they are effectively
detected given varying size, illumination, angle or colour along
with the count of people for social gathering restrictions. The
complete system is implemented on hardware accelerated edge
AI devices for real-time inference.
B. Object Detection using AI and Embedded System
An important aspect of implementing a machine learning
model for real-time applications is the proper inference and
choice of embedded platform. There are works reported in
literature where machine learning models, specifically object
detection models are implemented on hardware accelerated
edge AI devices like Nvidia’s Jetson modules like Nano,
TX1, Jeston Xavier NX and Intel’s NCS and NCS2. Table
I represents the analysis of various works done in the field of
the deployment of detection models along with achieved fps
Detection Model
Proposed DCNN
Proposed YOLO Nano
YOLOv3 Tiny, YOLOv4 Tiny
Jetson Tx1
Jetson Nano
Intel NCS 2
Intel NCS
Jetson Tx1
Jetson Xavier NX
Jetson Tx1
Jetson Xavier NX
Jetson Nano
However, not much work has been reported for deploying
a model for face mask detection on embedded platforms for
real-time inference. In this work, we have ported our model on
Nvidia Jetson Nano and Jetson Xavier NX and presented the
comparison of real-time inference for YOLOv5 and YOLOv5s.
A. Proposed System Block Diagram
For surveillance purposes of individuals wearing a mask
correctly in public or corporate places, the first step is to
train the detection models with the required dataset. To train
the detection models, the required dataset was collected from
different sources. Images collected for the dataset were processed and annotated to create a genuine dataset. The steps
of creating a genuine dataset are discussed in below Section.
Transfer learning has been applied to pre-trained models using
the collected dataset and new weights have been generated.
The trained detection algorithm can detect and classify in
three categories such as Proper Mask, Improper Mask, and
No Mask. The model architectures and new weights were
deployed on hardware accelerated devices. The camera has
been attached to the devices to capture the real-time video
stream. The device will generate the inference results using
the trained deep learning models along with the CSV file
which stores the timestamp and class-wise count of each class
detected in each frame. The diagram in Fig. 2 represents the
proposed approach used in this work.
B. Dataset Preparation
1) Data Collection: To create a genuine dataset for this
work, we needed images that cover all possible classes and
different situations such as from individual pictures to crowded
pictures. For this purpose, images were collected from different sources such as the internet, news, social media, etc. A
huge amount of images were collected out of which 3000
images were selected to create a dataset. In Fig. 3, some of
the selected images are shown.
2) Data Pre-processing: After collecting the data, images
were separated according to class-wise category. This separation resulted in 1300 images of Proper Mask class, 1100
images of No Mask class, and 600 images of Improper Mask
class. As detection models require a specific shape of input
images, all images were reshaped to 512 x 512 x 3 using a
python script to maintain the uniformity of the dataset.
3) Data Augmentation: As it is necessary to have a good
and enough dataset for the training of the detection model, we
used the image augmentation technique for generating more
images. For image augmentation, different image processing
techniques such as flipping, shifting, zooming, and brightness
manipulation. The data augmentation resulted in the final
dataset having 1350 Proper Mask images, 1150 No Mask
images, and 850 Improper Mask images.
4) Data Annotation: To train the detection models, each
image should be fed to a neural network with annotations of
objects present in the image. Hence each image of the dataset
was annotated manually. LabelImg [18] annotation tool was
used to annotate the images. The annotations were stored in
txt file format.
C. Detection Model Architecture
Face Mask detection and classification for surveillance
purposes is a very challenging problem as it covers different
crowd conditions, various angle and illumination conditions,
dissimilar mask patterns, and much more. Another challenge
is to classify mask wearing conditions in three categories i.e.
Proper Mask, Improper Mask, and No Mask as Improper Mask
can be easily jumbled up with either Proper Mask or No
Mask. Hence to address this challenge, we choose one of the
latest and best object detection algorithms - YOLOv5 [19].
YOLOv5 can be partitioned into three parts which are Feature
Extraction, Feature Fusion, and Detection layers. Feature
extraction is being done with the use of CSPDarknet [20] as a
backbone of the CNN. PANet [21] architecture is being used to
perform the fusion of the features extracted from the backbone
of the CNN. For the final detection and the classification of
the existing objects in the input image, Detection Layers are
used which will generate bounding boxes, class of object, and
score. Two different versions of YOLOv5 have been used in
this work which is YOLOv5s and YOLOv5l. The concept
of the architecture is almost similar for both algorithms, the
difference is in terms of the number of convolution layers and
activation functions.
D. Description of Hardware Accelerators
Edge AI is a concept that involves generating real-time
inferences on hardware accelerated embedded devices. As
training is very much computational intensive, it was performed on Google Colab’s Tesla K80 GPU. To perform the
real-time inference, the architecture with the generated weights
is deployed on hardware accelerated devices. For deployment
purposes, we choose two powerful hardware from Nvidia
which are the Nvidia Jeston Nano [22] and Jetson Xavier NX
[23]. The specifications of selected hardware are depicted in
Table II.
HW Accelerator
DL Accelerator
Jetson Nano
128-core NVIDIA
Maxwell GPU
Quad-core Arm
A57 @ 1.43 GHz
microSD Card
| 9,299
Jetson Xavier NX
cores and 48 Tensors cores
6-core NVIDIA Carmel
ARM 64-bit
2x NVDLA Engines
8GB 128-bit LPDDR4x
microSD Card
| 36,250
A. Training Setup
Training the detection models is always one of the most
crucial and challenging parts. First, our collected dataset was
split into three categories i.e. Training Dataset, Validation
Dataset, and Testing Dataset. The proportion of the split was
14:3:3 respectively for Training Dataset, Validation Dataset,
and Testing Dataset. As the training of the detection algorithms
requires high processing power, we trained our detection
models on Google Colab. Google Colab had allocated Nvidia’s
Tesla K80 GPU environment for the training session. The deep
learning models were trained for 25 epochs on 2345 training
images and at the same time, the detection models were
validated on 502 validation images. The graphs of different
types of loss per epoch are as shown in Fig. 4. As it can
be observed, the loss was high in starting phase of training
but over multiple epochs, model parameters were updated and
provided optimal accuracy performance.
f1 Score
We also compared the performance of our trained detection
models with YOLOv3 and YOLOv4. Performance comparison
analysis is as presented in Table III. As it can be seen that
YOLOv5s performed a bit less accurately as compared to
YOLOv3 and YOLOv4. Whereas YOLOv5l outperformed all
other detection models with the highest accuracy of 92.49.
Fig. 2. Proposed real-time Face mask detection system implementation flow
Fig. 3. Images from the Dataset
B. Real-time Inference Results
In order to inference detection models in real-time, trained
deep learning models were implemented on Nvidia Jetson
Xavier NX and Jeston Nano. Real-time inference results are
displayed in Fig. 5. As it can be seen that the proposed
detection models can accurately detect face masks and classify
them in a crowded area. In Fig. 5, result(a) and result(b)
are the inference results generated using YOLOv5s whereas
result(c) and result(d) are the inference results generated using
YOLOv5l. Comparing the inference results of the YOLOv5s
and YOLOv5l, both detection models performed very well but
at certain points, YOLOv5l generated better results. Looking
at the result(c) and result(a), YOLOv5s missed detection of
one face mask in the image whereas YOLOv5l was capable
of detecting all the face masks present in the image and classifying them in a real-time stream under various conditions.
As well as comparing the results in terms of confidence of
the class detected YOLOv5l was quite confident about the
face mask detected as compared to YOLOv5s. Deployment
on the Nvidia Jeston Xavier NX provided fps of 30 and 24
respectively for YOLOv5s and YOLOv5l detection models
Fig. 4. Loss vs Epoch curve of the Detection models
whereas deployment on Nvidia Jeston Nano provided fps of
12 and 8 respectively for YOLOv5s and YOLOv5l detection
models. Hence we can conclude that Nvidia Jeston Xavier NX
provides high fps compare to Nvidia Jeston Nano.
A proposed system has been deployed on two edge AI
devices, Nvidia Jetson Xavier NX, and Nvidia Jetson Nano. In
this paper, a system is proposed which can detect and classify
human face masks and store the count into a CSV file in realtime. Two modern deep learning algorithms; YOLOv5l and
YOLOv5s were trained on the face mask dataset assembled
by us. After testing trained detection models on various
images, YOLOv5l achieved mAP of 92.49 whereas YOLOv5s
Fig. 5. YOLOv5s & YOLOv5l Real-time Inference Results
achieved mAP of 86.43. Moreover, experimental evaluations
revealed that YOLOv5l is more accurate than YOLOv5s. The
proposed system was deployed on edge AI devices to perform
real-time inference of face mask detection and classification.
Furthermore, the class-wise counting of face masks detected
were stored into a CSV file format along with the timestamp
in real-time. The system presented in this work can be adopted
with real-time camera surveillance systems in crowded areas
to detect face masks and perform further analysis based on the
counts stored in CSV files.
[1] “Who coronavirus (covid-19) dashboard.” [Online]. Available:
[2] [Online]. Available: https://covid19.who.int/measures
[3] M. Loey, G. Manogaran, M. H. N. Taha, and N. E. M. Khalifa,
“Fighting against covid-19: A novel deep learning model based on yolov2 with resnet-50 for medical face mask detection,” Sustainable cities
and society, vol. 65, p. 102600, 2021.
[4] S. Singh, U. Ahuja, M. Kumar, K. Kumar, and M. Sachdeva, “Face mask
detection using yolov3 and faster r-cnn models: Covid-19 environment,”
Multimedia Tools and Applications, vol. 80, no. 13, pp. 19 753–19 768,
[5] M. R. Bhuiyan, S. A. Khushbu, and M. S. Islam, “A deep learning
based assistive system to classify covid-19 face mask for human safety
with yolov3,” in 2020 11th International Conference on Computing,
Communication and Networking Technologies (ICCCNT). IEEE, 2020,
pp. 1–5.
S. Abbasi, H. Abdi, and A. Ahmadi, “A face-mask detection approach
based on yolo applied for a new collected dataset,” in 2021 26th
International Computer Conference, Computer Society of Iran (CSICC).
IEEE, 2021, pp. 1–6.
A. Kumar, A. Kalia, K. Verma, A. Sharma, and M. Kaushal, “Scaling
up face masks detection with yolo on a novel dataset,” Optik, vol. 239,
p. 166744, 2021.
S. Degadwala, D. Vyas, U. Chakraborty, A. R. Dider, and H. Biswas,
“Yolo-v4 deep learning model for medical face mask detection,” in 2021
International Conference on Artificial Intelligence and Smart Systems
(ICAIS). IEEE, 2021, pp. 209–213.
K. Bhambani, T. Jain, and K. A. Sultanpure, “Real-time face mask and
social distancing violation detection system using yolo,” in 2020 IEEE
Bangalore Humanitarian Technology Conference (B-HTC). IEEE, 2020,
pp. 1–6.
P. Nagrath, R. Jain, A. Madan, R. Arora, P. Kataria, and J. Hemanth,
“Ssdmnv2: A real time dnn-based face mask detection system using
single shot multibox detector and mobilenetv2,” Sustainable cities and
society, vol. 66, p. 102692, 2021.
Y. Han and E. Oruklu, “Traffic sign recognition based on the nvidia
jetson tx1 embedded system using convolutional neural networks,” in
2017 IEEE 60th International Midwest Symposium on Circuits and
Systems (MWSCAS). IEEE, 2017, pp. 184–187.
V. Mazzia, A. Khaliq, F. Salvetti, and M. Chiaberge, “Real-time apple
detection system using embedded systems with hardware accelerators:
An edge ai application,” IEEE Access, vol. 8, pp. 9102–9114, 2020.
R. Gajjar, N. Gajjar, V. J. Thakor, N. P. Patel, and S. Ruparelia,
“Real-time detection and identification of plant leaf diseases using
convolutional neural networks on an embedded platform,” The Visual
Computer, pp. 1–16, 2021.
Y.-C. Chen, H. Fathoni, and C.-T. Yang, “Implementation of fire and
smoke detection using deepstream and edge computing approachs,”
in 2020 International Conference on Pervasive Artificial Intelligence
(ICPAI). IEEE, 2020, pp. 272–275.
S. Ruparelia, M. Jethva, and R. Gajjar, “Real-time tomato detection,
classification, and counting system using deep learning and embedded
systems,” in Proceedings of the International e-Conference on Intelligent
Systems and Signal Processing. Springer, 2022, pp. 511–522.
L. Wang, X. Ye, H. Xing, Z. Wang, and P. Li, “Yolo nano underwater:
A fast and compact object detector for embedded device,” in Global
Oceans 2020: Singapore–US Gulf Coast. IEEE, 2020, pp. 1–4.
S. R. Monil Jethva and R. Gajjar, “Face mask detection and counting
using deep learning and embedded systems,” in forthcoming conference.
Tzutalin, “Labelimg,” Free Software: MIT License, 2015. [Online].
Available: https://github.com/tzutalin/labelImg
Ultralytics, “ultralytics/yolov5: Yolov5 in pytorch - onnx - coreml tflite.” [Online]. Available: https://github.com/ultralytics/yolov5.git
C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H.
Yeh, “Cspnet: A new backbone that can enhance learning capability of
cnn,” in Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition workshops, 2020, pp. 390–391.
S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
for instance segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 8759–8768.
N. Developer, “Nvidia jetson nano developer kit,” 2019.
[Online]. Available: https://developer.nvidia.com/embedded/jetson-nanodeveloper-kit
“Jetson xavier nx developer kit,” May 2020. [Online]. Available: