Uploaded by tigcock

Robust Abnormal Human-Posture Recognition Using OpenPose and Multiview Cross-Information

Robust Abnormal Human-Posture Recognition
Using OpenPose and Multiview
Mingyang Xu, Limei Guo, and Hsiao-Chun Wu , Fellow, IEEE
Abstract—With the emerging demand for intelligent health
surveillance, video information has been widely explored
to facilitate real-time automatic patient monitoring systems.
Recently, human-posture recognition techniques based on
deep learning or artificial intelligence networks have been
reported in the literature. Nonetheless, during the training, testing, or both stages, it is quite difficult to extract
reliable features for all postures to be recognized. In this
work, we propose a robust multiperspective abnormal
human-posture recognition approach based on the multiview
cross-information and the confidence measurement, which
is adopted to evaluate the importance of the posture-feature
information from different perspectives. In our proposed approach, human skeletal data are first extracted by OpenPose.
Then those data are utilized as the input of the YOLOv5s system to recognize/detect abnormal postures such as falls
and bumps. Based on the NTU-RGB+D public dataset and the Pytorch framework, the simulation results show that our
proposed abnormal human-posture recognition method can lead to high accuracy.
Index Terms— Abnormal human-posture recognition, confidence measure, deep learning, multiview cross-information.
HE pervasive applications of intelligent surveillance technology can be found in people’s daily life. As massive video information has been for event detection, violent
scene analysis, and human behavior recognition. For example,
surveillance cameras could be installed in the houses of elderly
people who live alone for their health monitoring [1], [2].
Nowadays, human behavior/posture recognition can benefit
from the emerging artificial intelligence technology [3], [4],
[5]. Nevertheless, how to effectively detect and recognize
abnormal human postures from surveillance videos is still
quite challenging. Although many researchers have been work-
Manuscript received 17 February 2023; revised 10 April 2023;
accepted 11 April 2023. Date of publication 19 April 2023; date
of current version 31 May 2023. This work was supported by the
Louisiana Board of Regents Research Competitiveness Subprogram
under Grant LEQSF(2021-22)-RD-A-34. The associate editor coordinating the review of this article and approving it for publication was
Dr. Avik Santra. (Corresponding author: Hsiao-Chun Wu.)
Mingyang Xu and Limei Guo are with the School of Computer Science
and Engineering, Central South University, Changsha, Hunan 410075,
China (e-mail: xdjh2007@163.com; xmy12312300@126.com).
Hsiao-Chun Wu is with the School of Electrical Engineering and Computer Science, Louisiana State University, Baton Rouge, LA 70803 USA,
and also with the Innovation Center for AI Applications, Yuan Ze University, Chungli 32003, Taiwan (e-mail: wu@ece.lsu.edu).
Digital Object Identifier 10.1109/JSEN.2023.3267300
ing on this problem for years and a number of techniques
have been proposed, difficulties still exist in real scenarios.
In practice, there are many factors significantly influencing
the recognition accuracy, such as the shooting angle, the
luminance, the scenery background, and the human’s outfit [6].
The conventional human-posture recognition schemes include
the dense trajectory (DT) algorithm in [7] and improved dense
trajectories (iDT) method in [8] and [9], where these two
methods are based on the removal of background trajectories as the camera can. The local space–time feature, which
exploits both temporal states for compensating time-warping
effects and shape contexts for extracting space–time shape
variations, is introduced for human action recognition [10].
Optical flow is adopted to manifest the motion information to
locate the region of interest, the direction gradients in tandem
with the optical flow histogram are carried out to describe the
region, and then the support vector machine (SVM) classifier
is undertaken to recognize various types of human actions [11].
The affine scale-invariant feature transform (SIFT) is proposed
to recognize human actions [12]. Generally speaking, the
aforementioned existing schemes do not require a large number of training data, especially from many different conditions
but they cannot reach a very high accuracy [13]. In comparison
with the above-stated conventional methods, deep-learning
networks have the advantages of fast computation speed, high
1558-1748 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Block diagram of our proposed new robust abnormal human-posture recognition approach.
accuracy, and end-to-end training capacity, and thus they have
been very popular for human-posture recognition in the recent
decade [14], [15], [16]. However, in reality, the accuracy of
human-posture recognition would degrade due to unreliable
feature information [17].
To solve this often-encountered problem, in this work,
we propose a novel multiperspective human-posture recognition scheme, which is based on a confidence measure to
integrate the multiview cross-information. In our approach,
human skeletal data involving the first-order information (joint
point coordinates) and the second-order information (bone
length and orientation), which would lead to reliable features (not easily affected by external environmental factors),
are first acquired for human-posture recognition. The overall framework of our proposed approach is illustrated by
Fig. 1. The open-source libraries “Letterbox,” “OpenPose,”
and “YOLOv5s” are utilized in our approach. Letterbox is utilized to preprocess the video images to make them of the same
size. Then, OpenPose is utilized to extract the human skeletal
data from multiperspective (multiview) images. According to
the number and quality of the skeletal points, the importance
of the extracted feature information from different images will
be evaluated by the proposed confidence measure. Only those
features which meet the prespecified confidence requirements
can be used to classify postures. The confidence measure
is adopted to fuse the cross-information of multiperspective
images and thus enhance the recognition accuracy. YOLOv5s
will further extract the first- and second-order information of
the skeleton points to classify abnormal human postures such
as falls and bumps. The system response time is 0.198 s.
To achieve the (near) real-time practice, we propose to sample
a frame over every two to three consecutive frames during the
The rest of this article is organized as follows. Section II
outlines our proposed new robust abnormal human-posture
recognition approach. Section III presents the analysis and
evaluation of experimental results. The conclusion will be
finally drawn in Section IV.
Our proposed novel robust abnormal human-posture
recognition approach consists of five mechanisms, namely
1) preprocessing, 2) extraction of human skeletal points, 3)
multiview comprehensive analysis, 4) confidence evaluation,
and 5) human-posture recognition. Details will be presented
in Sections II-A–II-E.
A. Preprocessing
The first mechanism, preprocessing, can be further divided
into two subtasks, namely 1) image enhancement and 2) image
resizing. The details can be found as follows.
1) Image Enhancement: During the data acquisition, images
are often corrupted by various noises inevitably. Therefore,
several approaches are utilized here to suppress noise(s). These
approaches include low-pass filtering, discrete Fourier transform (DFT) filtering, and wavelet decomposition et al. In this
work, denoising convolutional neural networks (DnCNNs)
are adopted for image denoising. The image enhancement
subtask here is the process of adjusting images so that the
results are more suitable for display or further image analysis.
For example, sharpening and brightening an image to make
it easier to identify key features. In this work, histogram
equalization is adopted for image enhancement.
2) Image Resizing Using Letterbox: The enhanced images
need to be resized such that all images will have the same
size for the ultimate human-posture recognition. In real-world
scenarios, different images have different sizes. It is usual to
uniformly scale the original image to a standard size using
zooming and filling. However, if too much information needs
to be filled in, the required processing time would be long.
Letterbox can better solve this problem, which can maintain
the length–width ratio of the image and adaptively fill the
original image with the fewest black edges (fillers).
B. Extraction of Human Skeletal Points Using OpenPose
OpenPose is an open-source package to recognize and
classify human joints based on deep learning [18]; it adopts
a technique dependent on the PartAffinity Fields (PAFs) to
associate various parts of a human body and forms a complete
human skeleton on the image. For doing so, OpenPose carries
out two steps. First, it invokes the VGG-19 model to extract
image features and feed these features into a pair of convolutional neural networks (CNNs) running in parallel. As shown
by Fig. 2, the first branch of the OpenPose network calculates
the confidence map to detect body parts. The second branch
of the OpenPose network calculates the PAFs and combines
various body parts to form the skeleton of a human body.
These two parallel branches can be run multiple times to create
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. OpenPose network architecture (duplicated from [18]).
construct and train a behavior-recognition model based on all
skeletal points to detect abnormal human postures such as falls
and bumps.
Fig. 3. Illustration of a human skeletal map (graph).
a robust confidence graph and PAFs. Feature F in Fig. 2 is
produced by the VGG-19 model.
To recognize abnormal human postures, we adopt the OpenPose model to acquire skeletal points in real time, which
include 18 joints such as the nose, shoulders, knees, and so
on. A typical human skeletal map is illustrated by Fig. 3.
The skeletal points are labeled and the coordinates (x p , y p )
of the 18 skeletal points in the image are estimated, where
p specifies the joint index. Table I lists the names of these
joints. Meanwhile, since the positions of the joints change
continuously over time, the vertical movement velocity v y (m)
of the human body’s center point is calculated over every ten
frames as given by
p∈P y p (m) −
p∈P y p (m − 9)
v y (m) =
36 τ
where m indicates the frame index, P = {2, 5, 8, 11}, y p (m)
denotes the vertical coordinate (along the y-axis) of the pth
joint in the mth frame, and τ represents the reciprocal of the
frame rate (the time between two successive frames). Obviously, the abrupt variations v y (m) over time could indicate
abnormal human postures. Note that the temporal relationship
across consecutive frames is not considered in this work. Since
the instantaneous vertical coordinates y p (m)’s depend on the
view angle, how to determine the threshold for judging the
abnormal temporal variations of v y (m) is not trivial [19].
Therefore, we propose to adopt the YOLOv5 package to
C. Multiview Comprehensive Analysis
In reality, images are often acquired by fixed cameras and
other fixed devices [20]. Since the human(s) in the scene can
move around, the acquired image may often not be at the front
view of a person [21]. If only a single camera is available to
acquire a single-view image for human-posture recognition,
the performance may often be unsatisfactory [22]. Henceforth,
multiview recognition methods were proposed in [23], [24],
and [25], where the traffic data were collected using multiple
cameras to obtain a wide range of trajectories around the crossroad to solve the problems of limited visual-angle and vehicle
occlusion in the single-camera scene [23]. The multiview
approach demonstrated great potential in the field of artificial
intelligence face recognition [24]. Three cameras were used in
the unmanned aerial vehicle (UAV) target-tracking algorithm
in [25] to effectively solve the problem that human bodies
may appear in images with different proportions, directions,
and occlusion. In this work, we propose a new multiview
recognition approach to extract the human skeletal data from
multiperspective images to combat the problem of low accuracy based on single-view images [26].
In multiview recognition, the selection of the number of
cameras should be careful such that the balance between the
recognition accuracy and the computational complexity can
be addressed. Generally speaking, the recognition accuracy
would increase with the number of cameras from different
angles. However, when the number of cameras increases, the
required computational complexity will also increase and it
will negatively affect the real-time recognition performance.
To determine how many cameras would be appropriate for
obtaining a satisfactory recognition accuracy and an acceptable real-time performance, we randomly select 500 groups
of images pertinent to normal human postures and another
500 groups of images pertinent to abnormal human postures
from the NTU-RGB + D dataset (see [27], [28]) to form a
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Test recognition accuracy versus the number of cameras where
the training set ratio is 80%.
training dataset. Then, we randomly select a hundred groups
of images pertinent to normal human postures and another
100 groups of images pertinent to abnormal human postures
from the NTU-RGB + D dataset again to form a test dataset.
Then we train the model and test to record the changes in the
recognition accuracy as the number of cameras varies. The test
results are shown in Fig. 4. According to Fig. 4, when the
number of cameras is less than four, the recognition accuracy
increases linearly. When the number of cameras is greater than
or equal to four, the recognition accuracy is not significantly
improved (the recognition accuracy increases sublinearly). The
“training set ratio”1 for Fig. 4 is actually 80%. Therefore,
in this work, we determine to use the image data acquired
by three different cameras to balance the recognition accuracy
and the real-time requirement.
D. Confidence Evaluation
The effectiveness and necessity of confidence mechanisms
have been demonstrated in the existing literature [29], [30],
where various confidence scores were introduced for the
regression problems to determine which data should be actually used to estimate the underlying parameters. In this
work, our objective is quite different as we would like to
2) discard those images containing blocked or blurred views
and 2) evaluate the confidence score for those accepted image
data for the ultimate weighted decision to reach more accurate recognition results. Our proposed new multiview-image
confidence mechanism is performed by the process illustrated by Fig. 5, where both feature-extraction and confidenceevaluation tasks are based on the first-order information (joint
coordinates), the second-order information (bone lengths and
orientations) of the skeletal data, and the motion information
of the global map. Thus, our proposed new abnormal humanposture recognition system will rely on the joints, frames, and
features associated with high confidence scores.
1 The available dataset is divided into two parts, namely the training and
test sets. The proportion of the entire data to be used as the training data is
called the training set ratio.
Fig. 5. Illustration of a multicamera image-data acquisition system.
If the extracted skeletal graph includes at least a joint point in the
lower and upper bodies, we set lowerhalfflag==1 and upperhalfflag==1,
respectively. Coinsidevalue is the percentage of the area of the overlapping parts of OpenPose boxes and YOLOv5 boxes in the total area.
Fig. 6. Illustration of a multicamera image-data acquisition system.
Assume that three cameras take photos (images) at the same
time, as shown by Fig. 6. Thus, we can label these images
from different cameras by i = 1, 2, and 3. Given each image
(photo) acquired by a camera, if the following three conditions
are satisfied, we set the “image-quality flag” V1,i = 1 for
the ith image; otherwise, we set V1,i = 0: 1) the number of
joints of a human is greater than nine (to be determined by the
confidence mechanism), 2) the box of OpenPose and the box
of YOLOv5 intersect with each other by more than 30% (when
OpenPose is used to extract skeletal points and YOLOv5 is
used for human-posture recognition [31], they are utilized to
frame the human body and the skeletal points, respectively),
and 3) the extracted skeletal graph includes at least a joint
point in either of the lower and upper bodies. When V1,i = 0,
the image (photo) will be discarded and not be used for the
next step of recognition.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
If we have a satisfied confidence in the ith image (i.e.,
V1,i = 1), then the “confidence measure” V2,i specifies the
normalized weighted sum of the confidence weights of the
joints shown in the image i. Note that if V1,i = 0, then
V2,i = 0. We set the confidence weights of all joints as
listed in Table I. The confidence weight of a joint depends
on how critical this joint is for us to classify human postures.
For example, the nose is the critical joint for determining the
position of a human head, a hand and an elbow are the critical
joints for determining the position of a human arm, and a foot
and a knee are the critical joints for determining the position
of a human leg. The importance of the aforementioned joints
is relatively high, so the corresponding confidence weights
are large. As the secondary joints for us to classify human
postures, eyes, shoulders, and hips are less important than the
aforementioned critical joints, so the corresponding confidence
weights are smaller than those of the critical joints. After
adopting YOLOv5 in the next recognition step, we can have
three different recognition results for the images i = 1, 2, and
3. Thereafter, the corresponding confidence measures V2,i ’s
will be invoked to fuse such recognition results to increase
the recognition accuracy.
Let Fi denote the recognition results from the ith camera
for i = 1, 2, and 3. If a normal action is detected, we have
Fi = 1; otherwise (an abnormal action is detected instead),
we have Fi = 0. Thus, the overall (fusion) decision rule is
given by
normal action
abnormal action
Fig. 7. Network structure of the YOLOv5s model (duplicated from [39]).
where the “fused decision metric” F is defined by
1 X
V1,i × V2,i × Fi .
E. Human-Posture Recognition
Finally, our proposed human-posture recognition mechanism in this work consists of two steps, namely 1) region
of interest (ROI) localization and 2) posture classification
using YOLOv5s. Details of these two steps can be found as
1) ROI Localization: To expedite the execution time of the
posture classification step using YOLOv5 later on, the kernel
correlation filter (KCF) algorithm in [32] is adopted here to
localize the ROI of a skeletal graph.
2) Posture Classification Using YOLOv5s: The YOLO algorithm was proposed by Redmon et al. [33], which was
designed for target detection (classification) and localization.
It has evolved to different versions (V1–V5) [33], [34], [35],
[36]. The YOLOv5 model mainly consists of four parts,
namely input, backbone, neck, and prediction. The YOLOv5s
model is adopted in this work, whose network structure is
shown by Fig. 7, which includes six modules, namely CBL,
Res-Unit, CSP1X, CSP2X, Focus, and SPP modules. The CBL
module is a basic convolution module, while the Res-Unit
module adopts the residual structure of the Resnet network
Fig. 8.
Four successive detections of (a) normal actions and
(b) abnormal actions using three visible-light cameras that capture
photos simultaneously (using three rows of synchronous photos).
for reference to build a deep network. Both CSP1X and
CSP2X modules are modified from the network structure of
CSPNet. The Focus module was recently established to slice
images while the SPP module adopts the multidimensional
maximum pooling technique to fuse multiscale features of an
image. The YOLOv5 algorithm includes four different network
structure models, namely YOLOv5s, YOLOv5m, YOLOv5l,
and YOLOv5x, according to [37]. The major differences
among these models are the varieties of the network depth
and width. We adopt the YOLOv5s model because it has
the smallest window depth and width among them. The
performance comparison of the aforementioned four network
models is shown by [38]. According to [38], the detection
performance improves as the network depth and width both
increase. However, the required computation time increases
as the network depth and width both increase. In this work,
we focus on real-time implementation and hence we use the
smallest depth and width.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 9. Performance comparison of three schemes, namely our proposed novel robust abnormal human-posture recognition approach, the SSD
algorithm proposed in [40], and the Faster R-CNN algorithm proposed in [41], in terms of (a) accuracy, (b) sensitivity, (c) specificity, and (d) precision
with respect to the training set ratio using visible-light images.
A. Simulation Setup
All simulations are carried out on the Windows 10 operating
system, where the NVIDA RTX 2060 GPU and the CUDA11.7
computing platform are utilized. The learning is based on
the Pytorch 1.10 framework. During the training process,
the stochastic gradient-descent (SGD) method based on the
Nesterov momentum (0.9) is adopted to optimize the learning
model, where the batch size is set to be 16, the weight decay
is set to be 0.0001, and the ℓ2 -loss-function is invoked as the
B. Dataset
The NTU-RGB + D dataset contains 60 categories of
actions (see Table II) with a total of 56 880 video samples.
These 60 categories can be further grouped into three major
categories: 40 categories of actions belong to “daily behaviors,” nine categories of actions belong to “health-related
actions,” and 11 categories of actions belong to “two-person
mutual actions.” These actions were performed by 40 individuals aged from ten to 35. The NTU-RGB + D dataset
was collected by the Microsoft Kinect V2 sensor by three
cameras with different angles; the video data were collected
in the form of depth sequences, 3-D skeletal data, RGB videos,
and infrared frames [27], [28]. The three cameras were located
at the same height but with different orientations. To further
enhance the camera views, the heights and distances of the
cameras can be changed according to Table III. In this work,
we utilize the 18 skeletal points (joints) to identify abnormal
human postures. We further divide the dataset into training and
test sets for cross-validation. The total number of image frames
extracted from RGB videos is 31 357, while the proportions of
these data belonging to the training sets (a.k.a. the training set
ratios) are 20%, 40%, 60%, and 80%, respectively. To tackle
the problem of insufficient training data when the training set
ratio is small, we also apply the “data-augmentation” strategy
to create more training data by flipping the images horizontally and varying the brightness. The memory requirement is
4.2 GB, the training time for 2400 images (collected from three
Microsoft Kinect V2 cameras) is about 2.03 s per epoch, and
it requires no more than 1000 epochs to converge according to
our empirical experience. The system response time is 0.198 s.
The weight decay and the initial learning rate are set to be
0.0001 and 0.01, respectively.
C. Evaluation of Results
Our experiments are focused on four types of abnormal
postures and four types of normal postures as identified
by [42]. The abnormal postures include the postures related
to headache, chest pain, back pain, and neck pain, which
indicate abnormal health conditions, while the normal postures include drinking, nodding/bowing, sitting, and standing.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 8(a) demonstrates the four successive detections of normal postures and Fig. 8(b) demonstrates the four successive
detections of abnormal postures (note that each row of photos
is taken by a camera while three rows of photos are taken
simultaneously by three different cameras).
Our proposed novel robust abnormal human-posture recognition approach is compared with the existing skeleton-based
behavior recognition schemes including the single-shot multibox detector (SSD) algorithm proposed in [40] and the faster
recurrent neural network (Faster R-CNN) algorithm proposed
in [41]. The multiangle image fusion strategy based on confidence is a part of our proposed new scheme. On the contrary,
the other two existing (SSD and Faster R-CNN) schemes in
comparison do not fuse multiview images based on confidence,
while they reach the final decision using a simple majority
vote of the local decisions based on individual images. The
recognition results from the aforementioned three schemes in
comparison are depicted by Fig. 9. To verify the effectiveness
of each scheme, the performance is evaluated using four
metrics, namely accuracy defined in [43], sensitivity defined
in [43], specificity defined in [43], and precision defined in [43]
as given by
accuracy =
TP + FN + TN + FP
sensitivity =
specificity =
precision =
Fig. 10.
Four successive detections of (a) normal actions and
(b) abnormal actions using three thermal-infrared cameras which capture photos simultaneously (using three rows of synchronous photos).
where “TP” (true positive) denotes the number of abnormalaction photos to be recognized as abnormal-action photos,
“TN” (true negative) denotes the number of normal-action
photos to be recognized as normal-action photos, “FP” (false
positive) denotes the number of normal-action photos to
be recognized as abnormal-action photos, and “FN” (false
negative) denotes the number of abnormal-action photos to
be recognized as normal-action photos. According to Fig. 9,
our proposed new robust abnormal-human-posture recognition
approach (denoted by “Our model” in the figures) outperforms
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
Fig. 11. Performance comparison of three schemes, namely our proposed novel robust abnormal human-posture recognition approach, the SSD
algorithm proposed in [40], and the Faster R-CNN algorithm proposed in [41], in terms of (a) accuracy, (b) sensitivity, (c) specificity, and (d) precision
with respect to the training set ratio using thermal infrared images.
the other two existing schemes in terms of the accuracy,
sensitivity, specificity, and precision when visible-light cameras are used and the training set ratio reaches up to 60%.
On the other hand, we also take into account the lack of
lighting at night and thus visible-light cameras cannot capture
quality photos. It is well known that thermal infrared cameras
rely on human bodies’ own thermal radiation independent of
external lighting, making them effective for 24-h surveillance
requirement [44], [45]. Therefore, the applications of thermal
infrared cameras for safety monitoring in [46] and night-vision
assistance in [47] and [48] were developed [49]. The outstanding performance of the thermal infrared camera-dependent
approach for human-activity recognition was reported in [50].
Consistent with the experiments using visible-light images, our
experiments using thermal infrared images are also focused
on the same four abnormal and four normal postures identified in [42]. Fig. 10(a) demonstrates the four successive
detections of normal postures and Fig. 10(b) demonstrates
the four successive detections of abnormal postures (note
that each row of photos is taken by a camera, while three
rows of photos are taken simultaneously by three different
cameras). Furthermore, Fig. 11 depicts the accuracy, sensitivity, specificity, and precision with respect to the training set
ratio resulting from the aforementioned three schemes using
thermal infrared cameras in comparison. According to Fig. 11,
our proposed new robust abnormal-human-posture recognition
approach outperforms the other two existing schemes in terms
of the accuracy, sensitivity, specificity, and precision when the
training set ratio reaches up to 60%. As a result, the advantage of our proposed new robust abnormal-human-posture
recognition approach is more substantial for thermal infrared
In this article, a new multiview cross-information learning
neural network model based on the OpenPose and YOLOv5
frameworks is proposed. The OpenPose network is adopted
to extract the key skeletal points of the human-body image.
Meanwhile, noise is suppressed to enhance the image quality. Then the YOLOv5 recognition system is employed for
training and recognition of normal/abnormal human postures.
To improve recognition accuracy, a new confidence mechanism
is introduced to measure the confidence level of the recognition
result from a single camera. Then a weighted-sum fusion rule
is established to fuse the individual recognition results from
different cameras. Through simulations based on the NTURGB + D dataset, we compare our proposed novel robust
abnormal human-posture recognition approach with the other
two existing schemes. Our proposed new approach leads to the
best performance in terms of accuracy, sensitivity, specificity,
and precision when the training data are sufficient (or the
training set ratio reaches up to 80%).
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
[1] L. Feifei, “Research on detection and recognition of indoor falls based
on video surveillance,” Ph.D. dissertation, School Control Sci. Eng.,
Dept. Biomed. Eng., Shandong Univ., Jinan, China, Apr. 2016.
[2] A. B. Abdusalomov, M. Mukhiddinov, A. Kutlimuratov, and
T. K. Whangbo, “Improved real-time fire warning system based on
advanced technologies for visually impaired people,” Sensors, vol. 22,
no. 19, p. 7305, Sep. 2022.
[3] Y. Cao, R. Xie, K. Yan, S.-H. Fang, and H.-C. Wu, “Novel dynamic
segmentation for human-posture learning system using hidden logistic
regression,” IEEE Signal Process. Lett., vol. 29, pp. 1487–1491, 2022.
[4] C. Yu, Z. Xu, K. Yan, Y.-R. Chien, S.-H. Fang, and H.-C. Wu,
“Noninvasive human activity recognition using millimeter-wave radar,”
IEEE Syst. J., vol. 16, no. 2, pp. 3036–3047, Jun. 2022.
[5] G. Liu et al., “Automatic human posture recognition using Kinect sensors
by advanced graph convolutional network,” in Proc. IEEE Int. Symp.
Broadband Multimedia Syst. Broadcast. (BMSB), Jun. 2022, pp. 1–7.
[6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as
space-time shapes,” in Proc. 10th IEEE Int. Conf. Comput. Vis. (ICCV),
Dec. 2005, pp. 2247–2253.
[7] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation
of local spatio-temporal features for action recognition,” in Proc. Brit.
Mach. Vis. Conf., 2009, pp. 1–11.
[8] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proc. ICCV, Mar. 2013, pp. 3551–3558.
[9] X. Yukai, S. Shengli, L. Linjian, L. Huikai, and Z. Yue, “Overview
of human abnormal behavior recognition based on computer vision,”
Infrared, vol. 39, no. 11, pp. 34–39, Nov. 2018.
[10] P.-C. Hsiao, C.-S. Chen, and L.-W. Chang, “Human action recognition
using temporal-state shape contexts,” in Proc. 19th Int. Conf. Pattern
Recognit., Dec. 2008, pp. 1–4.
[11] H.-B. Zhang, S.-Z. Li, F. Guo, S. Liu, and B.-X. Liu, “Real-time
human action recognition based on shape combined with motion feature,” in Proc. IEEE Int. Conf. Intell. Comput. Intell. Syst., Oct. 2010,
pp. 633–637.
[12] Z. Zhang and J. Liu, “Recognizing human action and identity based
on affine-sift,” in Proc. Int. Conf. Electr. Electron. Eng. (EEESYM),
Jun. 2012, pp. 216–219.
[13] A. S. Alharthi, S. U. Yunas, and K. B. Ozanyan, “Deep learning for
monitoring of human gait: A review,” IEEE Sensors J., vol. 19, no. 21,
pp. 9575–9591, Nov. 2019.
[14] W. Tieyan, “Human fall detection method based on smartphone and
machine learning algorithm,” Sci. Technol. Innov., vol. 105, pp. 85–88,
Jul. 2022.
[15] U. Zia, W. Khalil, S. Khan, I. Ahmad, and M. N. Khan, “Towards human
activity recognition for ubiquitous health care using data from awaistmounted smartphone,” TURKISH J. Electr. Eng. Comput. Sci., vol. 28,
no. 2, pp. 646–663, Mar. 2020.
[16] A. Basavaraju, J. Du, F. Zhou, and J. Ji, “A machine learning approach
to road surface anomaly assessment using smartphone sensors,” IEEE
Sensors J., vol. 20, no. 5, pp. 2635–2647, Mar. 2020.
[17] E. Ramanujam, T. Perumal, and S. Padmavathi, “Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review,” IEEE Sensors J., vol. 21, no. 12, pp. 13029–13040,
Mar. 2021.
[18] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D
pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 172–186.
[19] Z. Han, J. Zhao, H. Leung, K. F. Ma, and W. Wang, “A review of deep
learning models for time series prediction,” IEEE Sensors J., vol. 21,
no. 6, pp. 7833–7848, Mar. 2021.
[20] D. Darsena, G. Gelli, I. Iudice, and F. Verde, “Sensing technologies
for crowd management, adaptation, and information dissemination in
public transportation systems: A review,” IEEE Sensors J., vol. 23, no. 1,
pp. 68–87, Jan. 2023.
[21] S. Cai, M. Shao, M. Du, G. Bao, and B. Fan, “A binocular-cameraassisted sensor-to-segment alignment method for inertial sensor-based
human gait analysis,” IEEE Sensors J., vol. 23, no. 3, pp. 2663–2671,
Feb. 2023.
[22] G. Zhang, J. Yin, P. Deng, Y. Sun, L. Zhou, and K. Zhang, “Achieving
adaptive visual multi-object tracking with unscented Kalman filter,”
Sensors, vol. 22, p. 9106, Nov. 2022.
[23] X. Tang, H. Song, W. Wang, and Y. Yang, “Vehicle spatial distribution
and 3D trajectory extraction algorithm in a cross-camera traffic scene,”
in Proc. Int. Conf. Sensors, Basel, Switzerland, Nov. 2020, p. 6517.
[24] B. M. Nair, J. Foytik, R. Tompkins, Y. Diskin, T. Aspiras, and V. Asari,
“Multi-pose face recognition and tracking system,” Proc. Comput. Sci.,
vol. 6, pp. 381–386, Aug. 2011.
[25] P. Sun and X. Ding, “UAV image detection algorithm based on improved
YOLOv5,” in Proc. IEEE 5th Int. Conf. Inf. Syst. Comput. Aided Educ.
(ICISCAE), Sep. 2022, pp. 757–760.
[26] R. Ravindran, M. J. Santora, and M. M. Jamali, “Multi-object detection
and tracking, based on DNN, for autonomous vehicles: A review,” IEEE
Sensors J., vol. 21, no. 5, pp. 5668–5677, Mar. 2021.
[27] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A large
scale dataset for 3D human activity analysis,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1010–1019.
[28] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot,
“NTU RGB+D 120: A large-scale benchmark for 3D human activity
understanding,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10,
pp. 2684–2701, Oct. 2020.
[29] A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with
supervised learning,” in Proc. 22nd Int. Conf. Mach. Learn. (ICML),
Jan. 2005, pp. 625–632.
[30] B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate
multiclass probability estimates,” in Proc. 8th ACM SIGKDD Int. Conf.
Knowl. Discovery Data Mining, Jul. 2002, pp. 694–699.
[31] X. Cai, F. Shuang, X. Sun, Y. Duan, and G. Cheng, “Towards lightweight
neural networks for garbage object detection,” Sensors, vol. 22, no. 19,
p. 7455, Jul. 2022.
[32] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed
tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
[33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2016, pp. 779–788.
[34] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017,
pp. 6517–6525.
[35] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1–6.
[36] A. Bochkovskiy, C. Wang, and H. Liao, “YOLOv4: Optimal speed and
accuracy of object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Apr. 2020, pp. 1–17.
[37] W. Li, H. Mutian, X. Shuo, Y. Tian, Z. Tianyi, and L. Jianfei, “Waste
classification and detection based on YOLOv5s network,” Packag. Eng.,
vol. 42, pp. 50–56, Aug. 2021.
[38] J. Xue, F. Cheng, Y. Li, Y. Song, and T. Mao, “Detection of farmland
obstacles based on an improved YOLOv5s algorithm by using CIoU and
anchor box scale clustering,” Sensors, vol. 22, no. 5, p. 1790, Feb. 2022.
[39] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved
YOLOv5 based on transformer prediction head for object detection on
drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
Workshops (ICCVW), Oct. 2021, pp. 2778–2788.
[40] W. Liu, D. Anguelov, and D. Erhan, “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Dec. 2016, pp. 21–37.
[41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards realtime object detection with region proposal networks,” in Proc. Int. Conf.
Adv. Neural Inf. Process. Syst., Jun. 2015, pp. 1137–1149.
[42] J. Wang, D. Chen, and J. Yang, “Human behavior classification by
analyzing periodic motions,” Frontiers Comput. Sci. China, vol. 4, no. 4,
pp. 580–588, Mar. 2010.
[43] J. Li, Z. Chi, and Z. Li, “Human fall detection system based on
threshold analysis method,” Transducer Microsyst. Technol., vol. 8,
pp. 1209–1221, Apr. 2019.
[44] A. Akula, A. K. Shah, and R. Ghosh, “Deep learning approach for
human action recognition in infrared images,” Cognit. Syst. Res., vol. 50,
pp. 146–154, Aug. 2018.
[45] B. A. El-Rahiem et al., “An efficient deep learning model for classification of thermal face images,” J. Enterprise Inf. Manag., vol. 11,
pp. 1–12, Jul. 2020.
[46] T. P. Rani, P. Kalaichelvi, S. Sakthy, and S. Padmasri, “Monitoring
and training KIT for autism spectrum disorder patients using artificial
intelligence,” in Proc. 1st Int. Conf. Comput. Sci. Technol. (ICCST),
Nov. 2022, pp. 251–262.
[47] K. Geng and G. Yin, “Using deep learning in infrared images to enable
human gesture recognition for autonomous vehicles,” IEEE Access,
vol. 8, pp. 88227–88240, 2020.
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.
[48] C. Zhang, D. Xiao, Q. Yang, Z. Wen, and L. Lv, “Review: Application of
infrared thermography in livestock monitoring,” Trans. ASABE, vol. 63,
no. 2, pp. 389–399, 2020.
[49] A. N. Wilson, K. Gupta, B. H. Koduru, A. Kumar, A. Jha, and
L. R. Cenkeramaddi, “Recent advances in thermal imaging and its
applications using machine learning: A review,” IEEE Sensors J., vol. 23,
no. 4, pp. 3395–3407, Feb. 2023.
[50] H. Hei, X. Jian, and E. Xiao, “Sample weights determination based on
cosine similarity method as an extension to infrared action recognition,”
J. Intell. Fuzzy Syst., vol. 40, no. 3, pp. 3919–3930, Mar. 2021.
Mingyang Xu was born in 2000, in Jiangsu,
China. He is currently pursuing the B.S. degree
in communication engineering with Central
South University, Changsha, China.
His research interests include the areas
of network communication and signal/image
Limei Guo received the B.S. degree in
electronic engineering from Hunan University,
Changsha, China, in 1995, and the M.S. and
Ph.D. degrees in traffic information and engineering control from Central South University,
Changsha, in 2002 and 2010, respectively.
From September 2013 to September 2014,
she had been a Visiting Scholar at the School
of Electrical Engineering and Computer Science,
Louisiana State University, Baton Rouge, LA,
USA. Since January 2002, she has been with
the Faculty of Central South University, where she is an Associate
Professor now. She has published more than 20 technical journal
and conference papers in communication engineering. Her research
interests include the areas of wireless communications and image
Hsiao-Chun Wu (Fellow, IEEE) received the
B.S.E.E. degree from National Cheng Kung
University, Tainan, Taiwan, in 1990, and the
M.S. and Ph.D. degrees in electrical and
computer engineering from the University of
Florida, Gainesville, FL, USA, in 1993 and 1999,
From March 1999 to January 2001, he worked
for Motorola Personal Communications Sector
Research Labs as a Senior Electrical Engineer.
From July to August 2007, he had been a
Visiting Assistant Professor at Television and Networks Transmission
Group, Communications Research Centre, Ottawa, ON, Canada. From
August to December 2008, he was a Visiting Associate Professor at
the Department of Electrical Engineering, Stanford University, Stanford,
CA, USA. Since January 2001, he has been with the Faculty of the
Department of Electrical and Computer Engineering, Louisiana State
University (LSU), Baton Rouge, LA, USA. He is currently a Distinguished
Professor at LSU. He is also a Visiting Professor of the International
College of Semiconductor Technology, National Chiao Tung University,
Hsinchu, Taiwan. Besides, he is currently with the Innovation Center for
Artificial Intelligence Applications, Yuan Ze University, Chungli, Taiwan.
He has published more than 300 peer-refereed technical journal and
conference papers in electrical and computer engineering. His research
interests include the areas of wireless communications and signal
Dr. Wu is an IEEE Distinguished Lecturer. He currently serves
ON MOBILE COMPUTING. Besides, he is an Academic Editor for Sensors. He used to serve as an Editor and Technical Editor for IEEE
SIGNAL PROCESSING LETTERS, and IEEE Communications Magazine.
He has also served for numerous textbooks, IEEE/ACM conferences
and journals as the technical committee, symposium chair, track
chair, or Reviewer in signal processing, communications, circuits, and
Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.