Uploaded by trashmail489

Video Hand gesture recognition

advertisement
14610
IEEE SENSORS JOURNAL, VOL. 22, NO. 14, 15 JULY 2022
Video Hand Gestures Recognition Using Depth
Camera and Lightweight CNN
David González León , Jade Gröli , Sreenivasa Reddy Yeduri , Member, IEEE, Daniel Rossier ,
Romuald Mosqueron , Om Jee Pandey , Senior Member, IEEE,
and Linga Reddy Cenkeramaddi , Senior Member, IEEE
Abstract —Hand gestures are a well-known and intuitive
method of human-computer interaction. The majority of the
research has concentrated on hand gesture recognition from
the RGB images, however, little work has been done on recognition from videos. In addition, RGB cameras are not robust
in varying lighting conditions. Motivated by this, we present
the video based hand gestures recognition using the depth
camera and a light weight convolutional neural network (CNN) model. We constructed a dataset and then used a light
weight CNN model to detect and classify hand movements efficiently. We also examined the classification accuracy with
a limited number of frames in a video gesture. We compare the depth camera’s video gesture recognition performance
to that of the RGB camera. We evaluate the proposed model’s performance on edge computing devices and compare to
benchmark models in terms of accuracy and inference time. The proposed model results in an accuracy of 99.48% on
the RGB version of test dataset and 99.18% on the depth version of test dataset. Finally, we compare the accuracy of the
proposed light weight CNN model with the state-of-the hand gesture classification models.
Index Terms — Hand-gestures, human-computer interaction, video hand-gestures, hand-gestures recognition, RGB-D
camera, light weight CNN.
I. I NTRODUCTION
UMAN action or gesture recognition has been extensively studied over the last decade [1], [2]. In recent
years, artificial intelligence and sensor technology has gained
more popularity to improve the autonomy of the people.
Here, the main aim is to help them to get access to dayto-day activities more efficiently [3]. In this context, hand
gestures recognition is an efficient solution due to its inherent applications that includes sign language detection [3],
smart home, autonomous vehicles [4], health care, Augmented
H
Manuscript received 27 April 2022; accepted 24 May 2022. Date of
publication 14 June 2022; date of current version 14 July 2022. This
work was supported by the Indo-Norwegian Collaboration in Autonomous
Cyber-Physical Systems (INCAPS) by the International Partnerships
for Excellent Education, Research and Innovation (INTPART) Program
from the Research Council of Norway under Project 287918. The
associate editor coordinating the review of this article and approving it
for publication was Dr. Brajesh Kumar Kaushik. (Corresponding author:
Linga Reddy Cenkeramaddi.)
David González León, Jade Gröli, Daniel Rossier, and Romuald Mosqueron are with the Information and Telecommunication Department,
HEIG-VD Engineering School, REDS Institute, 1401 Yverdon-les-Bains,
Switzerland (e-mail: david.gonzalezleon@heig-vd.ch; jade.groli@
heig-vd.ch;
daniel.rossier@heig-vd.ch;
romuald.mosqueron@
heig-vd.ch).
Sreenivasa Reddy Yeduri and Linga Reddy Cenkeramaddi are with
the ACPS Research Group, Department of Information and Communication Technology, University of Agder, 4879 Grimstad, Norway (e-mail:
sreenivasa.r.yeduri@uia.no; linga.cenkeramaddi@uia.no).
Om Jee Pandey is with the Department of Electronics Engineering,
IIT BHU Varanasi, Varanasi 221005, India (e-mail: omjee.ece@iitbhu.
ac.in).
Digital Object Identifier 10.1109/JSEN.2022.3181518
Virtuality (AR) and Virtual Reality (VR) [5], driver monitoring
in autonomous cars [6], and automatic surgical tasks [7].
Video-based gesture recognition under big data makes the
human-computer interaction more flexible and natural. This
brings the richer interactive experience to teaching, gaming,
and on-board control [8]. Further, video-based gesture recognition has many challenges due to the affect of factors such
as camera movement, target scale transformation, viewing
angle, dynamic background, and illumination. To address
these, in this work, we propose hand gesture recognition from
videos captured using a depth camera. The key contributions
of the proposed work are as follows:
•
•
•
•
•
•
We created a dataset of video-based hand gestures using
RGB-D camera.
We proposed lightweight deep CNN model for the hand
gesture classification from video sequences.
When compared to RGB camera-based hand gestures,
the proposed depth camera-based hand gestures are more
reliable and robust.
Furthermore, when compared to image-based hand gesture recognition, the proposed video-based hand gestures
have a number of practical applications.
It is also simple to incorporate the additional video
gestures without requiring major changes to the proposed
model.
We also included a detailed analysis on the reduced
number of frames in a video gesture, which is extremely
useful in other domains as well.
1558-1748 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
LEÓN et al.: VIDEO HAND GESTURES RECOGNITION USING DEPTH CAMERA AND LIGHTWEIGHT CNN
•
Through extensive experiments, we evaluate the performance of the proposed method in terms of classification
accuracy and inference time.
• Further, we have deployed the model on the edge computing system to show the capability of the model for the
real-time applications.
The rest of the paper is organized as follows: Section II
presents the literature on the video-based gesture recognition
frameworks. Section III describes the details of the system
and the dataset and the proposed light-weight CNN model is
elaborated in Section IV. The numerical results are presented
in Section V and the concluding remarks are presented in
Section VI.
II. R ELATED W ORK
Video-based gesture recognition is an important application
in computer vision. However, the relation between video
frames need to be evaluated which is important for model.
In [9], the authors have proposed a Temporal Pyramid Relation
Network to model the temporal relation between frames effectively and efficiently. Initially, Temporal Pyramid Pooling has
been used to get the temporal feature sequences of multiple
scale features. Then, the temporal relations of these feature
sequences is obtained by stacking a a Temporal Relation
Network on the feature sequence of each scale, respectively.
Finally, representations of all features have been aggregated
to obtain the final prediction.
Spatiotemporal gesture segmentation is the task of identifying the hand in a video sequence. In [10], a unified framework has been proposed to simultaneously perform spatial
segmentation, temporal segmentation, and recognition. This
framework identifies the gesture from a continuous image
streams even if the hand location is ambiguous or the absence
of information about beginning and end of a gesture in a
video [10]. ResC3D network-based multimodal gesture recognition method has been proposed in [11] for gesture recognition from video sequence. Retinex and median filter video
enhancement techniques have been applied to find a effective
and compact representation of the video sequences. Thereafter,
a ResC3D network has been developed for extracting and
blending the features. It has been shown that the proposed
method in [11] achieved an accuracy of 67.71%. A comparison
of data mining methods has been carried out in [12] for
gesture classification from video streaming. The authors have
considered a vector of 20 body-joint parts which are captured
by Kinect camera. The gestures are classified based on stand,
sit down, and lie down. Finally, backpropagation neural network, decision tree, support vector machine, and naive Bayes
methods have been chosen for classification. It has been shown
that backpropagation neural network with 100% accuracy has
outperformed other methods. Further, the average accuracy of
all these methods is 93.72% [12]. A gesture spotting method,
a combination of gesture spotting and gesture recognition
has been proposed in [13] for instructing the video game
(Quake II) with gestures in close to natural way. The proposed
method in [13] recognizes the meaningful movements in
addition to subtracting the unintended movements from a given
image sequence. A video-based full body gesture recognition
14611
system has been proposed in [14] that uses the view-invariant
pose coefficient vectors as feature vectors. The proposed
system is independent of the orientation of the camera and
results in the promising results [14]. An adaptive HMM-based
gesture recognition method with user adaptation using Kinect
camera has been designed in [15] for a natural user interface
of a humanoid robot device. Here, Kinect camera is used to
reduce the large-scale video data where the data includes the
depth measurement information. Thus, only 3-axis coordinate
information of joints are analyzed and categorized. The proposed method has been applied for classifying 14 classes and
shows superior performance [15].
A spatio-temporal feature extraction technique has been
proposed in [16] for the application of Arabic Sign Language
gesture recognition in both offline and online ways. Here,
the temporal features of gestures in a video are obtained
through predictions obtained in forward, backward, and
bi-directions. Thereafter, the prediction errors are accumulated
into one image which represents the motion of the sequence.
Finally, the proposed feature extraction technique is complemented with classification techniques such as K Nearest
Neighbour (KNN) and Bayesian. Through extensive results,
it has been shown that the above classification methods
cascaded with proposed feature extraction technique outperforms the hidden Markov models (HMMs) [16]. A video
semantic feature learning method has been proposed in [17] to
improve the video-based gesture recognition with integration
of image topological sparse coding with dynamic time warping
algorithm. The proposed method in [17] divides the learning
into two phases: semi-supervised learning of video image
features and supervised optimization of video sequence features. Then, gestures are recognized using K-nearest neighbor
algorithm and distance weighting based dynamic time warping
algorithm [17].
In [18], the authors have proposed to recognize the gestures
for controlling unmanned aerial vehicles using support vector
machine (SVM) and RGB-D camera. A thorough analysis
has been carried out under different lightning conditions with
different kernel-SVM [18]. The large-scale gesture recognition
from videos is still has many challenges that includes the
background. In [19], RGB-D data-based method has been
proposed for gesture recognition from large-scale videos. Here,
more details of the gestures are obtained by expanding the
inputs into 32-frame videos. Then, the C3D model has been
applied on RGB and depth videos to extract the spatio and
temporal features, respectively. Finally, these features are fused
to enhance the performance. The proposed model has been
applied on Chalearn LAP IsoGD Database and applied an
accuracy of 49.2% [19].
A new method which is a combination of CNN and artificial
feature method has been proposed in [20] for classification of
human behaviours such as sitting, standing, walking, walking
upstairs, walking downstairs, and lying still. Here, CNN is
used to extract the local features and artificial feature method
is used to extract the artificial statistical features, and these
combined features are used for classification. This method
has also been used to classify other complex behaviours
such as squatting, running, cycling, tying shoelaces, climbing
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
14612
mountains, and even various sports [20]. The use of pre-trained
convolutional neural networks for feature extraction and context mining has been described in [21] for anomaly detection
which is an important aspect in intelligent surveillance system.
The denoising autoencoder with relatively low model complexity has been used for an efficient and accurate anomaly
detection [21]. In [22], Dynamic Graph CNN (DGCNN) model
has been proposed which directly takes 3D points as inputs
whihc are then used for successful recognition of 3D object.
Here, the authors have adopted DGCNN for recognizing 3D
geometry features in spatio-temporal space of the data for
recognition of actions. It has been shown that the proposed
method in [22] has achieved state-of-the-art results with an
accuracy of 98.56% and 95.54% on IBM DVS128 Gesture
and DHP19 datasets. In [23], a rapid deep learning approach
for hand gesture detection for intelligent vehicle transportation
is proposed. The classification accuracy and computational
efficiency of the proposed long-recurrent convolutional neural
network, employed for hand gesture identification, is obtained
by extracting 3 representative frames from the video sequence.
To extract the representative frames, a semantic segmentationbased deconvolutional neural network was used. The proposed
approach in [23] has been tested on the Cambridge public
dataset, and it outperforms benchmark models. In video-based
surgical gesture recognition, extracting visual and temporal
information from a video is challenging. In [7], a 3D CNN
model has been proposed to extact spatio-temporal features
from video sequences. The proposed model has been tested
with JIGSAWS dataset which has achieved an accuracy
of 84% [7].
III. S YSTEM D ESCRIPTION AND DATASET D ETAILS
In this section, we describe the details of the system we
consider for the creation of the dataset. We also details the
different classes of the dataset.
A. System Description
The camera used to build the dataset is a RGB-D camera:
Intel RealSense Depth Camera D435 as shown in Fig. 1. This
camera can support up to a maximum range of 3 meters and
has 2 channels, one for the RGB stream and other for the depth
stream. It comes with a depth field of view (FOV) of 87◦ ×58◦
and RGB sensor FOV of 69◦ × 42◦ . In this application, both
streams have the same dimension of 480×640 [24]. Normally,
the depth stream has a dimension of 480 × 860. However,
we lower it to 480 × 640 to match the dimension of the RGB
stream.
Fig. 2 shows the complete setup for extracting both RGB
and depth version of the images. The camera is settled on
a tripod stand as shown in Fig. 1. The camera is connected
to the computer though a USB-C to USB 3.0 port to allow
the video capture. Finally, we used a custom python script to
record and save the video sequences from both streams.
The sequences are recorded in the same environment
(a room) and with the same framing. The framing of each
sequence goes from under the waist to the top of the head
with 1.5 meter feild of view.
IEEE SENSORS JOURNAL, VOL. 22, NO. 14, 15 JULY 2022
Fig. 1. The camera module.
Fig. 2. The complete setup for creating the dataset.
B. Dataset Details
Each sequence is made up of 40 frames and there are
762 sequences corresponding to a gesture in the dataset. For
each recorded sequence, there is a RGB and a depth version
using the 2 channels of the camera as the camera generates
both sequences. The images are resized to 25 % of the original
size before being saved.
C. Gesture Details
• Scroll_up and scroll_down
The principle is the same except the gesture starts from
under the waist to the top of the head for the scroll_up
and the opposite for scroll_down. Figs. 3a and 3b show
the RGB version of starting and end positions of the hand
in the video for scroll up. Figs. 4a and 4b show the depth
version of starting and end positions of the hand in the
video for scroll up. Figs. 3c and 3d show the RGB version
of starting and end positions of the hand in the video for
scroll down. Figs. 4c and 4d show the depth version of
starting and end positions of the hand in the video for
scroll down.
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
LEÓN et al.: VIDEO HAND GESTURES RECOGNITION USING DEPTH CAMERA AND LIGHTWEIGHT CNN
Fig. 3. A complete set of RGB version of gestures.
•
Scroll_right and scroll_left
The movement starts with the right hand at mid-torso
height on the left or the right depending on the type of
scroll required. Figs. 3e and 3f show the RGB version of
starting and end positions of the hand in the video for
scroll right. Figs. 4e and 4f show the depth version of
starting and end positions of the hand in the video for
scroll right. Figs. 3g and 3h show the RGB version of
14613
Fig. 4. A complete set of depth version of gestures.
•
starting and end positions of the hand in the video for
scroll left. Figs. 4g and 4h show the depth version of
starting and end positions of the hand in the video for
scroll left.
Zoom_in
The movement starts at the top of the chest with the right
hand closed and moves forward straight toward the cam-
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
14614
IEEE SENSORS JOURNAL, VOL. 22, NO. 14, 15 JULY 2022
Fig. 5. Visualisation of the model used for training and testing.
•
era while the hand is released gradually. Figs. 3i and 3j
show the RGB version of starting and end positions of the
hand in the video for zoom in. Figs. 4i and 4j show the
depth version of starting and end positions of the hand
in the video for zoom in.
Zoom_out
The movement starts at the shoulder height with the right
arm stretched toward the camera with the hand opened.
Figs. 3k and 3l show the RGB version of starting
and end positions of the hand in the video for zoom
out. Figs. 4k and 4l show the depth version of starting
and end positions of the hand in the video for zoom
out.
IV. L IGHT W EIGHT CNN FOR V IDEO B ASED
H AND -G ESTURES C LASSIFICATION
In this section, we explain the proposed light-weight CNN
model. It has one input layer, 3 convolution3d layers, 3 max
pooling layers, 3 batch normalization layers, 4 dense layers,
2 dropout layers, and 1 flatten layer as shown in Fig. 5. The
dimentions and output of each layer is tabulated in Table I.
The functionality of each layer is described as follows:
1) Convolution3D layer: The principle of this layer is to
create a convolution kernel that is convolved with the
layer input and produce a tensor of outputs. It applies
cuboidal convolution filters to 3D input. This layer
convolves the 3D input by moving the filter along horizontal, vertical, and along depth. It applies dot product
of the weights and inputs, then adds a bias term.
2) Max Pooling 3D layer: This is used to reduce the
dimentionality of the input which then helps the CNN
to train faster.
3) Batch Normalization layer: It allows each network layer
to perform learning independently. It also helps to
TABLE I
A RCHITECTURE D ETAILS OF THE P ROPOSED CNN M ODEL FOR THE
1 OF 2 D ATASET. L AYERS C ONV 3 D _1 AND C ONV 3 D _2 A RE
C ONVOLUTIONAL L AYERS W ITH F ILTERS 16 AND 32 R ESPECTIVELY.
B OTH L AYERS H AVE A K ERNEL S IZE OF (3,3,4)
normalize the output layer from the corresponding input
layer, which reduces the network initialization sensitivity
and speeds up the training.
4) Flatten layer: Flatten layer takes matrix as input and
creates a vector. Then, it will be connected to the
fully-connected layer for classification.
5) Dropout layer: This layer helps in preventing the
over-fitting by setting randomly selected input units to
0 with a frequency of r at each step during training time.
The remaining units are scaled up by 1/(1 −r ) such that
the sum over all inputs is unchanged. Here, r represents
the rate.
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
LEÓN et al.: VIDEO HAND GESTURES RECOGNITION USING DEPTH CAMERA AND LIGHTWEIGHT CNN
14615
Fig. 6. The variation of (a) accuracy and (b) loss for RGB version of the images, (c) accuracy and (d) loss for depth version of images.
TABLE II
C ONVOLUTION N EURAL N ETWORK PARAMETERS
6) Dense layer: Dense layer is basically used for
changing the dimensions of the vector. It generates an ’m’ dimensional vector with matrix-vector
multiplication.
V. R ESULTS AND D ISCUSSION
In this section, we evaluate the performance of the proposed
model in terms of accuracy, loss, and confusion matrix. For the
training of dataset, we consider three different methods when
loading and creating the training and testing sets. We trained
using 40 frames of each sequence. Then, we trained only 1 out
of 2 frames in a sequence resulting in the usage of 20 frames
of each sequence. Finally, we train using 1 out of 4 frames in a
sequence. The goal was to see if there was a noticeable drop in
the training and testing accuracy using reduced sequences. The
sizes for the full, 1 of 2, and 1 of 4 trained models are 34 MB,
20 MB, and 14 MB, respectively, for both RGB and depth
images.
Table II list the parameters considered for the evaluation of
the proposed model for the all thee cases.
Figs. 6a and 6b show the variation of accuracy and loss
for training and validation of the dataset with the proposed
model for RGB version of the images. It is observed from
Fig. 6a that the accuracy increases exponentially and saturate
to 99.48% at epoch 100. From Fig. 6b, it is observed that
the loss decreases exponentially and saturates at epoch 110.
Further, Figs. 6c and 6d show the variation of accuracy
and loss for training and validation of the dataset with
the proposed model for depth version of the images. It is
observed from Fig. 6c that the accuracy increases exponentially and saturate to 99.18% at epoch 90. From Fig. 6d, it is
observed that the loss decreases exponentially and saturates at
epoch 70.
Figs. 7a and 7b show the confusion matrix when the full
sequence is considered for RGB and depth version of images,
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
14616
IEEE SENSORS JOURNAL, VOL. 22, NO. 14, 15 JULY 2022
Fig. 7. Confusion matrix for (a) RGB full frames of video sequence, (b) depth with full frames of video sequence, (c) RGB with 1 of 2 frames of
video sequence, (d) depth with 1 of 2 frames of video sequence, (e) RGB with 1 of 4 frames of video sequence, and (f) depth with 1 of 4 frames of
video sequence.
respectively. From Fig. 7a, it is observed that the zoom in is
detected as zoom out with a inaccuracy of 1%. From Fig. 7b,
it is observed that the zoom out is detected as zoom in with
a inaccuracy of 1%. This is to be expected, since those two
movements are really similar to each other. Figs. 7c and 7d
show the confusion matrix when 1 of 2 frames in the video
sequence is considered for RGB and depth version of images,
respectively. From Fig. 7c, it is observed that the scroll down
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
LEÓN et al.: VIDEO HAND GESTURES RECOGNITION USING DEPTH CAMERA AND LIGHTWEIGHT CNN
14617
TABLE III
P ERFORMANCE C OMPARISON OF P ROPOSED M ODEL ON D IFFERENT S IZES OF THE D ATASET
TABLE IV
H AND G ESTURE R ECOGNITION ACCURACY OF THE P ROPOSED M ODEL AND O THER D EEP L EARNING M ODELS
is detected as scroll left with an inaccuracy of 1%. The reason
is same as that of the case of full dataset. It is also observed
that zoom out is detected as zoom in with an inaccuracy
of 2%. This could be linked to the fact that sometimes the
movement for a scroll down starts with the hand on the right,
goes up and then down. It went slightly left before starting
the movement down, which could lead to confusion of the
model. From Fig. 7d, it is observed that depth version has
no error in classifying the gestures. Figs. 7e and 7f show the
confusion matrix when 1 of 4 frames in the video sequence is
considered for RGB and depth version of images, respectively.
From Fig. 7e, it is observed that the scroll left is detected
as scroll right with an inaccuracy of 1%. The reason is
that the model confuses scroll right and scroll left rarely,
which could be an error on the test set. From Fig. 7f, it is
observed that depth version has no error in classifying the
gestures.
Figs. 8a and 8b show the variation of validation accuracy
with fold for training and testing, respectively. Table III lists
overall accuracy of the proposed model for all three cases. It is
observed from Table III that when we consider full dataset the
overall accuracy is 99.18%, when we consider 1 of 2 dataset
the accuracy is 99.04%, and the accuracy is 99.16% when the
consider 1 of 4 dataset.
Table IV shows the comparison of the proposed method
with the state-of-the-art methods in terms of classification
accuracy. It is observed from Table IV that no paper in the
literature has considered our dataset. Further, the classification
accuracy of the proposed model is higher than the state-of-theart methods.
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
14618
IEEE SENSORS JOURNAL, VOL. 22, NO. 14, 15 JULY 2022
Fig. 8. 10-fold validation accuracy for (a) training and (b) testing.
VI. C ONCLUSION AND F UTURE W ORK
We have proposed a light weight convolutional neural network for accurately classifying gestures from a video sequence
of RGB and depth version which are captured using RGB-D
camera. A dataset containing two video sequences each with
40 frames has been created for both RGB and depth version.
The proposed model has been compared to the state-of-theart models in terms of classification accuracy. The proposed
model has achieved an accuracy of 99.23% and 99.18% on
RGB and depth version of test datasets, respectively. In future,
we plan to extend the work in complex scenario wherein,
multiple people present in a single frame.
R EFERENCES
[1] R. Xu, S. Zhou, and W. Li, “MEMS accelerometer based nonspecificuser hand gesture recognition,” IEEE Sensors J., vol. 12, no. 5,
pp. 1166–1173, May 2012.
[2] W. Lao, J. Han, and P. H. N. De With, “Automatic video-based
human motion analyzer for consumer surveillance system,” IEEE Trans.
Consum. Electron., vol. 55, no. 2, pp. 591–598, May 2009.
[3] F. Zhan, “Hand gesture recognition with convolution neural networks,”
in Proc. IEEE 20th Int. Conf. Inf. Reuse Integr. Data Sci. (IRI), Jul. 2019,
pp. 295–298.
[4] D.-L. Dinh and T.-S. Kim, “Smart home appliance control via hand
gesture recognition using a depth camera,” in Smart Energy Control Systems for Sustainable Buildings, J. Littlewood, C. Spataru,
R. J. Howlett, and L. C. Jain, Eds. Cham, Switzerland: Springer, 2017,
pp. 159–172.
[5] T.-H. Tran and V.-H. Do, “Improving continuous hand gesture detection
and recognition from depth using convolutional neural networks,” in
Intelligent Systems and Networks, D.-T. Tran, G. Jeon, T. D. L. Nguyen,
J. Lu, and T.-D. Xuan, Eds. Singapore: Springer, 2021, pp. 80–86.
[6] A. Roitberg, T. Pollert, M. Haurilet, M. Martin, and R. Stiefelhagen,
“Analysis of deep fusion strategies for multi-modal gesture recognition,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops
(CVPRW), Jun. 2019, pp. 1–9.
[7] I. Funke, S. Bodenstedt, F. Oehme, F. von Bechtolsheim, J. Weitz,
and S. Speidel, “Using 3D convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video,”
in Medical Image Computing and Computer Assisted Intervention—
MICCAI, D. Shen et al., Eds. Cham, Switzerland: Springer, 2019,
pp. 467–475.
[8] Y. Sun et al., “Gesture recognition algorithm based on multi-scale
feature fusion in RGB-D images,” IET Image Process., vol. 14, no. 15,
pp. 3662–3668, 2020.
[9] K. Yang, R. Li, P. Qiao, Q. Wang, D. Li, and Y. Dou, “Temporal pyramid
relation network for video-based gesture recognition,” in Proc. 25th
IEEE Int. Conf. Image Process. (ICIP), Oct. 2018, pp. 3104–3108.
[10] J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff, “A unified framework
for gesture recognition and spatiotemporal gesture segmentation,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1685–1699,
Sep. 2009.
[11] Q. Miao et al., “Multimodal gesture recognition based on the ResC3D
network,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCVW),
Oct. 2017, pp. 3047–3055.
[12] O. Patsadu, C. Nukoolkit, and B. Watanapa, “Human gesture recognition
using Kinect camera,” in Proc. 9th Int. Conf. Comput. Sci. Softw. Eng.
(JCSSE), May 2012, pp. 28–32.
[13] H. Kang, C. W. Lee, and K. Jung, “Recognition-based gesture spotting in video games,” Pattern Recognit. Lett., vol. 25,
no. 15, pp. 1701–1714, Nov. 2004. [Online]. Available: https://www.
sciencedirect.com/science/article/pii/S0167865504001576
[14] B. Peng, G. Qian, and S. Rajko, “View-invariant full-body gesture
recognition from video,” in Proc. 19th Int. Conf. Pattern Recognit.,
Dec. 2008, pp. 1–5.
[15] I.-J. Ding and C.-W. Chang, “An adaptive hidden Markov model-based
gesture recognition approach using Kinect to simplify large-scale video
data processing for humanoid robot imitation,” Multimedia Tools Appl.,
vol. 75, no. 23, pp. 15537–15551, Dec. 2016.
[16] T. Shanableh, K. Assaleh, and M. Al-Rousan, “Spatio-temporal featureextraction techniques for isolated gesture recognition in Arabic sign
language,” IEEE Trans. Syst. Man, Cybern. B, Cybern., vol. 37, no. 3,
pp. 641–650, Jun. 2007.
[17] S. Xu, L. Liang, and C. Ji, “Gesture recognition for human–machine
interaction in table tennis video based on deep semantic understanding,” Signal Process., Image Commun., vol. 81, Feb. 2020,
Art. no. 115688. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0923596519307076
[18] W. G. Aguilar, B. Cobeña, G. Rodriguez, V. S. Salcedo, and
B. Collaguazo, “SVM and RGB-D sensor based gesture recognition
for UAV control,” in Augmented Reality, Virtual Reality, and Computer
Graphics, L. T. De Paolis and P. Bourdot, Eds. Cham, Switzerland:
Springer, 2018, pp. 713–719.
[19] Y. Li et al., “Large-scale gesture recognition with a fusion of RGB-D
data based on the C3D model,” in Proc. 23rd Int. Conf. Pattern Recognit.
(ICPR), Dec. 2016, pp. 25–30.
[20] X. Bu, “Human motion gesture recognition algorithm in video based on
convolutional neural features of training images,” IEEE Access, vol. 8,
pp. 160025–160039, 2020.
[21] C. Wu, S. Shao, C. Tunc, and S. Hariri, “Video anomaly detection using
pre-trained deep convolutional neural nets and context mining,” in Proc.
IEEE/ACS 17th Int. Conf. Comput. Syst. Appl. (AICCSA), Nov. 2020,
pp. 1–8.
[22] J. Chen, J. Meng, X. Wang, and J. Yuan, “Dynamic graph CNN for
event-camera based gesture recognition,” in Proc. IEEE Int. Symp.
Circuits Syst. (ISCAS), Oct. 2020, pp. 1–5.
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
LEÓN et al.: VIDEO HAND GESTURES RECOGNITION USING DEPTH CAMERA AND LIGHTWEIGHT CNN
[23] V. John, A. Boyali, S. Mita, M. Imanishi, and N. Sanma, “Deep learningbased fast hand gesture recognition using representative frames,” in Proc.
Int. Conf. Digit. Image Comput., Techn. Appl. (DICTA), Nov. 2016,
pp. 1–8.
[24] Intel RealSense. Intel Realsense Depth Camera D435. Accessed:
Apr. 2022. [Online]. Available: https://www.intelrealsense.com/depthcamera-d435/
David González León is pursuing the bachelor’s
degree in embedded system with the HEIG-VD,
Yverdon-les-Bains, Switzerland. During the writing of the article, he was doing an exchange student for one semester at UiA, Grimstad, Norway.
His research interests are embedded systems,
medical device development, reprogrammable
component (fpga), systems on chip, the IoT, and
machine learning.
Jade Gröli is pursuing the bachelor’s degree
in embedded system with HEIG-VD, Yverdonles-Bains, Switzerland. During the writing of the
article, she was doing an exchange student for
one semester at UiA, Grimstad, Norway. Her
research interests are embedded systems, medical device development, reprogrammable component (fpga), systems on chip, the IoT, and
machine learning.
Sreenivasa Reddy Yeduri (Member, IEEE)
received the B.E. degree in electronics and
communication engineering from Andhra University, Visakhapatnam, India, in 2013, the M.Tech.
degree from the Indian Institute of Information
Technology, Gwalior, India, in 2016, and the
Ph.D. degree from the Department of Electronics
and Communication Engineering, National Institute of Technology, Goa, India, in 2021. He is currently working as a Postdoctoral Research Fellow
with the Autonomous and Cyber-Physical Systems (ACPS) Research Group, University of Agder, Grimstad, Norway.
His research interests are machine-type communications, the Internet
of Things, LTE MAC, 5G MAC, optimization in communication, wireless
networks, power line communications, visible light communications,
hybrid communication systems, spectrum cartography, spectrum sensing, V2X communication, V2V communication, wireless sensor networks,
mmWave RADAR, and aerial vehicle traffic management.
Daniel Rossier is a Professor in embedded software and OS management. He is also the coauthor of two patents bearing on the autonomous
and adaptive management of frequencies on the
WiFi network and the author of the (public) patent
on the smart object oriented technology, based
on migrating environments. His main research
areas are operating systems, embedded virtualization, ARM technologies, and embedded software execution environments.
14619
Romuald Mosqueron received the Ph.D.
degree in electronics and informatics of images
from Burgundy University, Dijon, France,
in 2006. He worked as a Project Manager
at EPFL (Switzerland) during ten years in
the field of signal and image processing for
industrial and broadcast projects. He is currently
working as a Professor with the Information
and Telecommunication Department, HEIG-VD
Engineering School, REDS Institute, Yverdonles-Bains, Switzerland. His research interests
are embedded systems in broadcast, where he developed several
solutions to improve live transmission and contribute to find new
contents for the user experience. He is also active in 5G stand-alone
network developments and end user module design. He is a member
of several work groups for 5G video contribution and distribution. He is
also active in robotics and sport developments.
Om Jee Pandey (Senior Member, IEEE)
received the Ph.D. degree from the Department of Electrical Engineering, Indian Institute
of Technology Kanpur, Kanpur, India, in 2019.
He worked as a Postdoctoral Fellow with
the Communications Theories Research Group,
Department of Electrical and Computer Engineering, University of Saskatchewan, Saskatoon, SK, Canada. He is currently working as an
Assistant Professor with the Department of Electronics Engineering, IIT BHU Varanasi, Varanasi,
India. His research interests include wireless sensor networks, low-power
wide-area networks, unmanned aerial vehicle networks, mobile and pervasive computing, cyber-physical systems, the Internet of Things, cloud
and fog computing, UAV-assisted optical communications, and social
networks. He serves as a regular reviewer for various reputed journals of
IEEE, including IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE
TRANSACTIONS ON WIRELESS COMMUNICATIONS, IEEE TRANSACTIONS
ON COMMUNICATIONS, IEEE TRANSACTIONS ON GREEN COMMUNICATIONS
AND NETWORKING, IEEE TRANSACTIONS ON NETWORK AND SERVICE
MANAGEMENT, IEEE INTERNET OF THINGS JOURNAL, IEEE SYSTEMS
JOURNAL, IEEE SENSORS JOURNAL, and IEEE ACCESS.
Linga Reddy Cenkeramaddi (Senior Member,
IEEE) received the master’s degree in electrical
engineering from the Indian Institute of Technology, New Delhi, India, in 2004, and the Ph.D.
degree in electrical engineering from the Norwegian University of Science and Technology,
Trondheim, Norway, in 2011. He worked at Texas
Instruments in mixed signal circuit design before
joining the Ph.D. program at NTNU. After finishing his Ph.D. degree, he worked in radiation
imaging for an atmosphere space interaction
monitor (ASIM mission to International Space Station) with the University
of Bergen, Norway, from 2010 to 2012. At present, he is the Group Leader
of the Autonomous and Cyber-Physical Systems (ACPS) Research
Group and working as a Professor with the University of Agder, Campus
Grimstad, Norway. He has coauthored over 90 research publications that
have been published in prestigious international journals and standard
conferences. His main scientific interests are in cyber-physical systems,
autonomous systems, and wireless embedded systems. He is the Principal and the Co-Principal Investigator of many research grants from
Norwegian Research Council. He is also a member of the editorial boards
of various international journals and the technical program committees
of several IEEE conferences.
Authorized licensed use limited to: UNIVERSITAET BAYREUTH. Downloaded on January 06,2023 at 17:45:48 UTC from IEEE Xplore. Restrictions apply.
Download