12370 IEEE SENSORS JOURNAL, VOL. 23, NO. 11, 1 JUNE 2023 Robust Abnormal Human-Posture Recognition Using OpenPose and Multiview Cross-Information Mingyang Xu, Limei Guo, and Hsiao-Chun Wu , Fellow, IEEE Abstract—With the emerging demand for intelligent health surveillance, video information has been widely explored to facilitate real-time automatic patient monitoring systems. Recently, human-posture recognition techniques based on deep learning or artificial intelligence networks have been reported in the literature. Nonetheless, during the training, testing, or both stages, it is quite difficult to extract reliable features for all postures to be recognized. In this work, we propose a robust multiperspective abnormal human-posture recognition approach based on the multiview cross-information and the confidence measurement, which is adopted to evaluate the importance of the posture-feature information from different perspectives. In our proposed approach, human skeletal data are first extracted by OpenPose. Then those data are utilized as the input of the YOLOv5s system to recognize/detect abnormal postures such as falls and bumps. Based on the NTU-RGB+D public dataset and the Pytorch framework, the simulation results show that our proposed abnormal human-posture recognition method can lead to high accuracy. Index Terms— Abnormal human-posture recognition, confidence measure, deep learning, multiview cross-information. I. I NTRODUCTION HE pervasive applications of intelligent surveillance technology can be found in people’s daily life. As massive video information has been for event detection, violent scene analysis, and human behavior recognition. For example, surveillance cameras could be installed in the houses of elderly people who live alone for their health monitoring [1], [2]. Nowadays, human behavior/posture recognition can benefit from the emerging artificial intelligence technology [3], [4], [5]. Nevertheless, how to effectively detect and recognize abnormal human postures from surveillance videos is still quite challenging. Although many researchers have been work- T Manuscript received 17 February 2023; revised 10 April 2023; accepted 11 April 2023. Date of publication 19 April 2023; date of current version 31 May 2023. This work was supported by the Louisiana Board of Regents Research Competitiveness Subprogram under Grant LEQSF(2021-22)-RD-A-34. The associate editor coordinating the review of this article and approving it for publication was Dr. Avik Santra. (Corresponding author: Hsiao-Chun Wu.) Mingyang Xu and Limei Guo are with the School of Computer Science and Engineering, Central South University, Changsha, Hunan 410075, China (e-mail: xdjh2007@163.com; xmy12312300@126.com). Hsiao-Chun Wu is with the School of Electrical Engineering and Computer Science, Louisiana State University, Baton Rouge, LA 70803 USA, and also with the Innovation Center for AI Applications, Yuan Ze University, Chungli 32003, Taiwan (e-mail: wu@ece.lsu.edu). Digital Object Identifier 10.1109/JSEN.2023.3267300 ing on this problem for years and a number of techniques have been proposed, difficulties still exist in real scenarios. In practice, there are many factors significantly influencing the recognition accuracy, such as the shooting angle, the luminance, the scenery background, and the human’s outfit [6]. The conventional human-posture recognition schemes include the dense trajectory (DT) algorithm in [7] and improved dense trajectories (iDT) method in [8] and [9], where these two methods are based on the removal of background trajectories as the camera can. The local space–time feature, which exploits both temporal states for compensating time-warping effects and shape contexts for extracting space–time shape variations, is introduced for human action recognition [10]. Optical flow is adopted to manifest the motion information to locate the region of interest, the direction gradients in tandem with the optical flow histogram are carried out to describe the region, and then the support vector machine (SVM) classifier is undertaken to recognize various types of human actions [11]. The affine scale-invariant feature transform (SIFT) is proposed to recognize human actions [12]. Generally speaking, the aforementioned existing schemes do not require a large number of training data, especially from many different conditions but they cannot reach a very high accuracy [13]. In comparison with the above-stated conventional methods, deep-learning networks have the advantages of fast computation speed, high 1558-1748 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. XU et al.: ROBUST ABNORMAL HUMAN-POSTURE RECOGNITION 12371 Fig. 1. Block diagram of our proposed new robust abnormal human-posture recognition approach. accuracy, and end-to-end training capacity, and thus they have been very popular for human-posture recognition in the recent decade [14], [15], [16]. However, in reality, the accuracy of human-posture recognition would degrade due to unreliable feature information [17]. To solve this often-encountered problem, in this work, we propose a novel multiperspective human-posture recognition scheme, which is based on a confidence measure to integrate the multiview cross-information. In our approach, human skeletal data involving the first-order information (joint point coordinates) and the second-order information (bone length and orientation), which would lead to reliable features (not easily affected by external environmental factors), are first acquired for human-posture recognition. The overall framework of our proposed approach is illustrated by Fig. 1. The open-source libraries “Letterbox,” “OpenPose,” and “YOLOv5s” are utilized in our approach. Letterbox is utilized to preprocess the video images to make them of the same size. Then, OpenPose is utilized to extract the human skeletal data from multiperspective (multiview) images. According to the number and quality of the skeletal points, the importance of the extracted feature information from different images will be evaluated by the proposed confidence measure. Only those features which meet the prespecified confidence requirements can be used to classify postures. The confidence measure is adopted to fuse the cross-information of multiperspective images and thus enhance the recognition accuracy. YOLOv5s will further extract the first- and second-order information of the skeleton points to classify abnormal human postures such as falls and bumps. The system response time is 0.198 s. To achieve the (near) real-time practice, we propose to sample a frame over every two to three consecutive frames during the test. The rest of this article is organized as follows. Section II outlines our proposed new robust abnormal human-posture recognition approach. Section III presents the analysis and evaluation of experimental results. The conclusion will be finally drawn in Section IV. II. N OVEL R OBUST A BNORMAL H UMAN -P OSTURE R ECOGNITION A PPROACH Our proposed novel robust abnormal human-posture recognition approach consists of five mechanisms, namely 1) preprocessing, 2) extraction of human skeletal points, 3) multiview comprehensive analysis, 4) confidence evaluation, and 5) human-posture recognition. Details will be presented in Sections II-A–II-E. A. Preprocessing The first mechanism, preprocessing, can be further divided into two subtasks, namely 1) image enhancement and 2) image resizing. The details can be found as follows. 1) Image Enhancement: During the data acquisition, images are often corrupted by various noises inevitably. Therefore, several approaches are utilized here to suppress noise(s). These approaches include low-pass filtering, discrete Fourier transform (DFT) filtering, and wavelet decomposition et al. In this work, denoising convolutional neural networks (DnCNNs) are adopted for image denoising. The image enhancement subtask here is the process of adjusting images so that the results are more suitable for display or further image analysis. For example, sharpening and brightening an image to make it easier to identify key features. In this work, histogram equalization is adopted for image enhancement. 2) Image Resizing Using Letterbox: The enhanced images need to be resized such that all images will have the same size for the ultimate human-posture recognition. In real-world scenarios, different images have different sizes. It is usual to uniformly scale the original image to a standard size using zooming and filling. However, if too much information needs to be filled in, the required processing time would be long. Letterbox can better solve this problem, which can maintain the length–width ratio of the image and adaptively fill the original image with the fewest black edges (fillers). B. Extraction of Human Skeletal Points Using OpenPose OpenPose is an open-source package to recognize and classify human joints based on deep learning [18]; it adopts a technique dependent on the PartAffinity Fields (PAFs) to associate various parts of a human body and forms a complete human skeleton on the image. For doing so, OpenPose carries out two steps. First, it invokes the VGG-19 model to extract image features and feed these features into a pair of convolutional neural networks (CNNs) running in parallel. As shown by Fig. 2, the first branch of the OpenPose network calculates the confidence map to detect body parts. The second branch of the OpenPose network calculates the PAFs and combines various body parts to form the skeleton of a human body. These two parallel branches can be run multiple times to create Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. 12372 IEEE SENSORS JOURNAL, VOL. 23, NO. 11, 1 JUNE 2023 I NDICES , N AMES , AND TABLE I C ONFIDENCE W EIGHTS OF J OINTS Fig. 2. OpenPose network architecture (duplicated from [18]). construct and train a behavior-recognition model based on all skeletal points to detect abnormal human postures such as falls and bumps. Fig. 3. Illustration of a human skeletal map (graph). a robust confidence graph and PAFs. Feature F in Fig. 2 is produced by the VGG-19 model. To recognize abnormal human postures, we adopt the OpenPose model to acquire skeletal points in real time, which include 18 joints such as the nose, shoulders, knees, and so on. A typical human skeletal map is illustrated by Fig. 3. The skeletal points are labeled and the coordinates (x p , y p ) of the 18 skeletal points in the image are estimated, where p specifies the joint index. Table I lists the names of these joints. Meanwhile, since the positions of the joints change continuously over time, the vertical movement velocity v y (m) of the human body’s center point is calculated over every ten frames as given by P P def p∈P y p (m) − p∈P y p (m − 9) v y (m) = (1) 36 τ def where m indicates the frame index, P = {2, 5, 8, 11}, y p (m) denotes the vertical coordinate (along the y-axis) of the pth joint in the mth frame, and τ represents the reciprocal of the frame rate (the time between two successive frames). Obviously, the abrupt variations v y (m) over time could indicate abnormal human postures. Note that the temporal relationship across consecutive frames is not considered in this work. Since the instantaneous vertical coordinates y p (m)’s depend on the view angle, how to determine the threshold for judging the abnormal temporal variations of v y (m) is not trivial [19]. Therefore, we propose to adopt the YOLOv5 package to C. Multiview Comprehensive Analysis In reality, images are often acquired by fixed cameras and other fixed devices [20]. Since the human(s) in the scene can move around, the acquired image may often not be at the front view of a person [21]. If only a single camera is available to acquire a single-view image for human-posture recognition, the performance may often be unsatisfactory [22]. Henceforth, multiview recognition methods were proposed in [23], [24], and [25], where the traffic data were collected using multiple cameras to obtain a wide range of trajectories around the crossroad to solve the problems of limited visual-angle and vehicle occlusion in the single-camera scene [23]. The multiview approach demonstrated great potential in the field of artificial intelligence face recognition [24]. Three cameras were used in the unmanned aerial vehicle (UAV) target-tracking algorithm in [25] to effectively solve the problem that human bodies may appear in images with different proportions, directions, and occlusion. In this work, we propose a new multiview recognition approach to extract the human skeletal data from multiperspective images to combat the problem of low accuracy based on single-view images [26]. In multiview recognition, the selection of the number of cameras should be careful such that the balance between the recognition accuracy and the computational complexity can be addressed. Generally speaking, the recognition accuracy would increase with the number of cameras from different angles. However, when the number of cameras increases, the required computational complexity will also increase and it will negatively affect the real-time recognition performance. To determine how many cameras would be appropriate for obtaining a satisfactory recognition accuracy and an acceptable real-time performance, we randomly select 500 groups of images pertinent to normal human postures and another 500 groups of images pertinent to abnormal human postures from the NTU-RGB + D dataset (see [27], [28]) to form a Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. XU et al.: ROBUST ABNORMAL HUMAN-POSTURE RECOGNITION 12373 Fig. 4. Test recognition accuracy versus the number of cameras where the training set ratio is 80%. training dataset. Then, we randomly select a hundred groups of images pertinent to normal human postures and another 100 groups of images pertinent to abnormal human postures from the NTU-RGB + D dataset again to form a test dataset. Then we train the model and test to record the changes in the recognition accuracy as the number of cameras varies. The test results are shown in Fig. 4. According to Fig. 4, when the number of cameras is less than four, the recognition accuracy increases linearly. When the number of cameras is greater than or equal to four, the recognition accuracy is not significantly improved (the recognition accuracy increases sublinearly). The “training set ratio”1 for Fig. 4 is actually 80%. Therefore, in this work, we determine to use the image data acquired by three different cameras to balance the recognition accuracy and the real-time requirement. D. Confidence Evaluation The effectiveness and necessity of confidence mechanisms have been demonstrated in the existing literature [29], [30], where various confidence scores were introduced for the regression problems to determine which data should be actually used to estimate the underlying parameters. In this work, our objective is quite different as we would like to 2) discard those images containing blocked or blurred views and 2) evaluate the confidence score for those accepted image data for the ultimate weighted decision to reach more accurate recognition results. Our proposed new multiview-image confidence mechanism is performed by the process illustrated by Fig. 5, where both feature-extraction and confidenceevaluation tasks are based on the first-order information (joint coordinates), the second-order information (bone lengths and orientations) of the skeletal data, and the motion information of the global map. Thus, our proposed new abnormal humanposture recognition system will rely on the joints, frames, and features associated with high confidence scores. 1 The available dataset is divided into two parts, namely the training and test sets. The proportion of the entire data to be used as the training data is called the training set ratio. Fig. 5. Illustration of a multicamera image-data acquisition system. If the extracted skeletal graph includes at least a joint point in the lower and upper bodies, we set lowerhalfflag==1 and upperhalfflag==1, respectively. Coinsidevalue is the percentage of the area of the overlapping parts of OpenPose boxes and YOLOv5 boxes in the total area. Fig. 6. Illustration of a multicamera image-data acquisition system. Assume that three cameras take photos (images) at the same time, as shown by Fig. 6. Thus, we can label these images from different cameras by i = 1, 2, and 3. Given each image (photo) acquired by a camera, if the following three conditions are satisfied, we set the “image-quality flag” V1,i = 1 for the ith image; otherwise, we set V1,i = 0: 1) the number of joints of a human is greater than nine (to be determined by the confidence mechanism), 2) the box of OpenPose and the box of YOLOv5 intersect with each other by more than 30% (when OpenPose is used to extract skeletal points and YOLOv5 is used for human-posture recognition [31], they are utilized to frame the human body and the skeletal points, respectively), and 3) the extracted skeletal graph includes at least a joint point in either of the lower and upper bodies. When V1,i = 0, the image (photo) will be discarded and not be used for the next step of recognition. Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. 12374 IEEE SENSORS JOURNAL, VOL. 23, NO. 11, 1 JUNE 2023 If we have a satisfied confidence in the ith image (i.e., V1,i = 1), then the “confidence measure” V2,i specifies the normalized weighted sum of the confidence weights of the joints shown in the image i. Note that if V1,i = 0, then V2,i = 0. We set the confidence weights of all joints as listed in Table I. The confidence weight of a joint depends on how critical this joint is for us to classify human postures. For example, the nose is the critical joint for determining the position of a human head, a hand and an elbow are the critical joints for determining the position of a human arm, and a foot and a knee are the critical joints for determining the position of a human leg. The importance of the aforementioned joints is relatively high, so the corresponding confidence weights are large. As the secondary joints for us to classify human postures, eyes, shoulders, and hips are less important than the aforementioned critical joints, so the corresponding confidence weights are smaller than those of the critical joints. After adopting YOLOv5 in the next recognition step, we can have three different recognition results for the images i = 1, 2, and 3. Thereafter, the corresponding confidence measures V2,i ’s will be invoked to fuse such recognition results to increase the recognition accuracy. Let Fi denote the recognition results from the ith camera for i = 1, 2, and 3. If a normal action is detected, we have Fi = 1; otherwise (an abnormal action is detected instead), we have Fi = 0. Thus, the overall (fusion) decision rule is given by normal action > < F abnormal action 1 2 Fig. 7. Network structure of the YOLOv5s model (duplicated from [39]). (2) where the “fused decision metric” F is defined by 3 def F= 1 X V1,i × V2,i × Fi . 35 (3) i=1 E. Human-Posture Recognition Finally, our proposed human-posture recognition mechanism in this work consists of two steps, namely 1) region of interest (ROI) localization and 2) posture classification using YOLOv5s. Details of these two steps can be found as follows. 1) ROI Localization: To expedite the execution time of the posture classification step using YOLOv5 later on, the kernel correlation filter (KCF) algorithm in [32] is adopted here to localize the ROI of a skeletal graph. 2) Posture Classification Using YOLOv5s: The YOLO algorithm was proposed by Redmon et al. [33], which was designed for target detection (classification) and localization. It has evolved to different versions (V1–V5) [33], [34], [35], [36]. The YOLOv5 model mainly consists of four parts, namely input, backbone, neck, and prediction. The YOLOv5s model is adopted in this work, whose network structure is shown by Fig. 7, which includes six modules, namely CBL, Res-Unit, CSP1X, CSP2X, Focus, and SPP modules. The CBL module is a basic convolution module, while the Res-Unit module adopts the residual structure of the Resnet network Fig. 8. Four successive detections of (a) normal actions and (b) abnormal actions using three visible-light cameras that capture photos simultaneously (using three rows of synchronous photos). for reference to build a deep network. Both CSP1X and CSP2X modules are modified from the network structure of CSPNet. The Focus module was recently established to slice images while the SPP module adopts the multidimensional maximum pooling technique to fuse multiscale features of an image. The YOLOv5 algorithm includes four different network structure models, namely YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, according to [37]. The major differences among these models are the varieties of the network depth and width. We adopt the YOLOv5s model because it has the smallest window depth and width among them. The performance comparison of the aforementioned four network models is shown by [38]. According to [38], the detection performance improves as the network depth and width both increase. However, the required computation time increases as the network depth and width both increase. In this work, we focus on real-time implementation and hence we use the smallest depth and width. Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. XU et al.: ROBUST ABNORMAL HUMAN-POSTURE RECOGNITION 12375 Fig. 9. Performance comparison of three schemes, namely our proposed novel robust abnormal human-posture recognition approach, the SSD algorithm proposed in [40], and the Faster R-CNN algorithm proposed in [41], in terms of (a) accuracy, (b) sensitivity, (c) specificity, and (d) precision with respect to the training set ratio using visible-light images. III. S IMULATION A. Simulation Setup All simulations are carried out on the Windows 10 operating system, where the NVIDA RTX 2060 GPU and the CUDA11.7 computing platform are utilized. The learning is based on the Pytorch 1.10 framework. During the training process, the stochastic gradient-descent (SGD) method based on the Nesterov momentum (0.9) is adopted to optimize the learning model, where the batch size is set to be 16, the weight decay is set to be 0.0001, and the ℓ2 -loss-function is invoked as the objective. B. Dataset The NTU-RGB + D dataset contains 60 categories of actions (see Table II) with a total of 56 880 video samples. These 60 categories can be further grouped into three major categories: 40 categories of actions belong to “daily behaviors,” nine categories of actions belong to “health-related actions,” and 11 categories of actions belong to “two-person mutual actions.” These actions were performed by 40 individuals aged from ten to 35. The NTU-RGB + D dataset was collected by the Microsoft Kinect V2 sensor by three cameras with different angles; the video data were collected in the form of depth sequences, 3-D skeletal data, RGB videos, and infrared frames [27], [28]. The three cameras were located at the same height but with different orientations. To further enhance the camera views, the heights and distances of the cameras can be changed according to Table III. In this work, we utilize the 18 skeletal points (joints) to identify abnormal human postures. We further divide the dataset into training and test sets for cross-validation. The total number of image frames extracted from RGB videos is 31 357, while the proportions of these data belonging to the training sets (a.k.a. the training set ratios) are 20%, 40%, 60%, and 80%, respectively. To tackle the problem of insufficient training data when the training set ratio is small, we also apply the “data-augmentation” strategy to create more training data by flipping the images horizontally and varying the brightness. The memory requirement is 4.2 GB, the training time for 2400 images (collected from three Microsoft Kinect V2 cameras) is about 2.03 s per epoch, and it requires no more than 1000 epochs to converge according to our empirical experience. The system response time is 0.198 s. The weight decay and the initial learning rate are set to be 0.0001 and 0.01, respectively. C. Evaluation of Results Our experiments are focused on four types of abnormal postures and four types of normal postures as identified by [42]. The abnormal postures include the postures related to headache, chest pain, back pain, and neck pain, which indicate abnormal health conditions, while the normal postures include drinking, nodding/bowing, sitting, and standing. Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. 12376 IEEE SENSORS JOURNAL, VOL. 23, NO. 11, 1 JUNE 2023 F ORTY C ATEGORIES OF TABLE II ACTIONS IN THE NTU-RGB+D DATASET TABLE III D IFFERENT C AMERA S ETUPS Fig. 8(a) demonstrates the four successive detections of normal postures and Fig. 8(b) demonstrates the four successive detections of abnormal postures (note that each row of photos is taken by a camera while three rows of photos are taken simultaneously by three different cameras). Our proposed novel robust abnormal human-posture recognition approach is compared with the existing skeleton-based behavior recognition schemes including the single-shot multibox detector (SSD) algorithm proposed in [40] and the faster recurrent neural network (Faster R-CNN) algorithm proposed in [41]. The multiangle image fusion strategy based on confidence is a part of our proposed new scheme. On the contrary, the other two existing (SSD and Faster R-CNN) schemes in comparison do not fuse multiview images based on confidence, while they reach the final decision using a simple majority vote of the local decisions based on individual images. The recognition results from the aforementioned three schemes in comparison are depicted by Fig. 9. To verify the effectiveness of each scheme, the performance is evaluated using four metrics, namely accuracy defined in [43], sensitivity defined in [43], specificity defined in [43], and precision defined in [43] as given by TP + TN def accuracy = (4) TP + FN + TN + FP TP def sensitivity = (5) TP + FN TN def specificity = (6) TN + FP TP def precision = (7) TP + FP Fig. 10. Four successive detections of (a) normal actions and (b) abnormal actions using three thermal-infrared cameras which capture photos simultaneously (using three rows of synchronous photos). where “TP” (true positive) denotes the number of abnormalaction photos to be recognized as abnormal-action photos, “TN” (true negative) denotes the number of normal-action photos to be recognized as normal-action photos, “FP” (false positive) denotes the number of normal-action photos to be recognized as abnormal-action photos, and “FN” (false negative) denotes the number of abnormal-action photos to be recognized as normal-action photos. According to Fig. 9, our proposed new robust abnormal-human-posture recognition approach (denoted by “Our model” in the figures) outperforms Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. XU et al.: ROBUST ABNORMAL HUMAN-POSTURE RECOGNITION 12377 Fig. 11. Performance comparison of three schemes, namely our proposed novel robust abnormal human-posture recognition approach, the SSD algorithm proposed in [40], and the Faster R-CNN algorithm proposed in [41], in terms of (a) accuracy, (b) sensitivity, (c) specificity, and (d) precision with respect to the training set ratio using thermal infrared images. the other two existing schemes in terms of the accuracy, sensitivity, specificity, and precision when visible-light cameras are used and the training set ratio reaches up to 60%. On the other hand, we also take into account the lack of lighting at night and thus visible-light cameras cannot capture quality photos. It is well known that thermal infrared cameras rely on human bodies’ own thermal radiation independent of external lighting, making them effective for 24-h surveillance requirement [44], [45]. Therefore, the applications of thermal infrared cameras for safety monitoring in [46] and night-vision assistance in [47] and [48] were developed [49]. The outstanding performance of the thermal infrared camera-dependent approach for human-activity recognition was reported in [50]. Consistent with the experiments using visible-light images, our experiments using thermal infrared images are also focused on the same four abnormal and four normal postures identified in [42]. Fig. 10(a) demonstrates the four successive detections of normal postures and Fig. 10(b) demonstrates the four successive detections of abnormal postures (note that each row of photos is taken by a camera, while three rows of photos are taken simultaneously by three different cameras). Furthermore, Fig. 11 depicts the accuracy, sensitivity, specificity, and precision with respect to the training set ratio resulting from the aforementioned three schemes using thermal infrared cameras in comparison. According to Fig. 11, our proposed new robust abnormal-human-posture recognition approach outperforms the other two existing schemes in terms of the accuracy, sensitivity, specificity, and precision when the training set ratio reaches up to 60%. As a result, the advantage of our proposed new robust abnormal-human-posture recognition approach is more substantial for thermal infrared images. IV. C ONCLUSION In this article, a new multiview cross-information learning neural network model based on the OpenPose and YOLOv5 frameworks is proposed. The OpenPose network is adopted to extract the key skeletal points of the human-body image. Meanwhile, noise is suppressed to enhance the image quality. Then the YOLOv5 recognition system is employed for training and recognition of normal/abnormal human postures. To improve recognition accuracy, a new confidence mechanism is introduced to measure the confidence level of the recognition result from a single camera. Then a weighted-sum fusion rule is established to fuse the individual recognition results from different cameras. Through simulations based on the NTURGB + D dataset, we compare our proposed novel robust abnormal human-posture recognition approach with the other two existing schemes. Our proposed new approach leads to the best performance in terms of accuracy, sensitivity, specificity, and precision when the training data are sufficient (or the training set ratio reaches up to 80%). Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. 12378 IEEE SENSORS JOURNAL, VOL. 23, NO. 11, 1 JUNE 2023 R EFERENCES [1] L. Feifei, “Research on detection and recognition of indoor falls based on video surveillance,” Ph.D. dissertation, School Control Sci. Eng., Dept. Biomed. Eng., Shandong Univ., Jinan, China, Apr. 2016. [2] A. B. Abdusalomov, M. Mukhiddinov, A. Kutlimuratov, and T. K. Whangbo, “Improved real-time fire warning system based on advanced technologies for visually impaired people,” Sensors, vol. 22, no. 19, p. 7305, Sep. 2022. [3] Y. Cao, R. Xie, K. Yan, S.-H. Fang, and H.-C. Wu, “Novel dynamic segmentation for human-posture learning system using hidden logistic regression,” IEEE Signal Process. Lett., vol. 29, pp. 1487–1491, 2022. [4] C. Yu, Z. Xu, K. Yan, Y.-R. Chien, S.-H. Fang, and H.-C. Wu, “Noninvasive human activity recognition using millimeter-wave radar,” IEEE Syst. J., vol. 16, no. 2, pp. 3036–3047, Jun. 2022. [5] G. Liu et al., “Automatic human posture recognition using Kinect sensors by advanced graph convolutional network,” in Proc. IEEE Int. Symp. Broadband Multimedia Syst. Broadcast. (BMSB), Jun. 2022, pp. 1–7. [6] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in Proc. 10th IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2005, pp. 2247–2253. [7] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in Proc. Brit. Mach. Vis. Conf., 2009, pp. 1–11. [8] H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proc. ICCV, Mar. 2013, pp. 3551–3558. [9] X. Yukai, S. Shengli, L. Linjian, L. Huikai, and Z. Yue, “Overview of human abnormal behavior recognition based on computer vision,” Infrared, vol. 39, no. 11, pp. 34–39, Nov. 2018. [10] P.-C. Hsiao, C.-S. Chen, and L.-W. Chang, “Human action recognition using temporal-state shape contexts,” in Proc. 19th Int. Conf. Pattern Recognit., Dec. 2008, pp. 1–4. [11] H.-B. Zhang, S.-Z. Li, F. Guo, S. Liu, and B.-X. Liu, “Real-time human action recognition based on shape combined with motion feature,” in Proc. IEEE Int. Conf. Intell. Comput. Intell. Syst., Oct. 2010, pp. 633–637. [12] Z. Zhang and J. Liu, “Recognizing human action and identity based on affine-sift,” in Proc. Int. Conf. Electr. Electron. Eng. (EEESYM), Jun. 2012, pp. 216–219. [13] A. S. Alharthi, S. U. Yunas, and K. B. Ozanyan, “Deep learning for monitoring of human gait: A review,” IEEE Sensors J., vol. 19, no. 21, pp. 9575–9591, Nov. 2019. [14] W. Tieyan, “Human fall detection method based on smartphone and machine learning algorithm,” Sci. Technol. Innov., vol. 105, pp. 85–88, Jul. 2022. [15] U. Zia, W. Khalil, S. Khan, I. Ahmad, and M. N. Khan, “Towards human activity recognition for ubiquitous health care using data from awaistmounted smartphone,” TURKISH J. Electr. Eng. Comput. Sci., vol. 28, no. 2, pp. 646–663, Mar. 2020. [16] A. Basavaraju, J. Du, F. Zhou, and J. Ji, “A machine learning approach to road surface anomaly assessment using smartphone sensors,” IEEE Sensors J., vol. 20, no. 5, pp. 2635–2647, Mar. 2020. [17] E. Ramanujam, T. Perumal, and S. Padmavathi, “Human activity recognition with smartphone and wearable sensors using deep learning techniques: A review,” IEEE Sensors J., vol. 21, no. 12, pp. 13029–13040, Mar. 2021. [18] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 172–186. [19] Z. Han, J. Zhao, H. Leung, K. F. Ma, and W. Wang, “A review of deep learning models for time series prediction,” IEEE Sensors J., vol. 21, no. 6, pp. 7833–7848, Mar. 2021. [20] D. Darsena, G. Gelli, I. Iudice, and F. Verde, “Sensing technologies for crowd management, adaptation, and information dissemination in public transportation systems: A review,” IEEE Sensors J., vol. 23, no. 1, pp. 68–87, Jan. 2023. [21] S. Cai, M. Shao, M. Du, G. Bao, and B. Fan, “A binocular-cameraassisted sensor-to-segment alignment method for inertial sensor-based human gait analysis,” IEEE Sensors J., vol. 23, no. 3, pp. 2663–2671, Feb. 2023. [22] G. Zhang, J. Yin, P. Deng, Y. Sun, L. Zhou, and K. Zhang, “Achieving adaptive visual multi-object tracking with unscented Kalman filter,” Sensors, vol. 22, p. 9106, Nov. 2022. [23] X. Tang, H. Song, W. Wang, and Y. Yang, “Vehicle spatial distribution and 3D trajectory extraction algorithm in a cross-camera traffic scene,” in Proc. Int. Conf. Sensors, Basel, Switzerland, Nov. 2020, p. 6517. [24] B. M. Nair, J. Foytik, R. Tompkins, Y. Diskin, T. Aspiras, and V. Asari, “Multi-pose face recognition and tracking system,” Proc. Comput. Sci., vol. 6, pp. 381–386, Aug. 2011. [25] P. Sun and X. Ding, “UAV image detection algorithm based on improved YOLOv5,” in Proc. IEEE 5th Int. Conf. Inf. Syst. Comput. Aided Educ. (ICISCAE), Sep. 2022, pp. 757–760. [26] R. Ravindran, M. J. Santora, and M. M. Jamali, “Multi-object detection and tracking, based on DNN, for autonomous vehicles: A review,” IEEE Sensors J., vol. 21, no. 5, pp. 5668–5677, Mar. 2021. [27] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A large scale dataset for 3D human activity analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1010–1019. [28] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, pp. 2684–2701, Oct. 2020. [29] A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” in Proc. 22nd Int. Conf. Mach. Learn. (ICML), Jan. 2005, pp. 625–632. [30] B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2002, pp. 694–699. [31] X. Cai, F. Shuang, X. Sun, Y. Duan, and G. Cheng, “Towards lightweight neural networks for garbage object detection,” Sensors, vol. 22, no. 19, p. 7455, Jul. 2022. [32] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015. [33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2016, pp. 779–788. [34] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp. 6517–6525. [35] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 1–6. [36] A. Bochkovskiy, C. Wang, and H. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Apr. 2020, pp. 1–17. [37] W. Li, H. Mutian, X. Shuo, Y. Tian, Z. Tianyi, and L. Jianfei, “Waste classification and detection based on YOLOv5s network,” Packag. Eng., vol. 42, pp. 50–56, Aug. 2021. [38] J. Xue, F. Cheng, Y. Li, Y. Song, and T. Mao, “Detection of farmland obstacles based on an improved YOLOv5s algorithm by using CIoU and anchor box scale clustering,” Sensors, vol. 22, no. 5, p. 1790, Feb. 2022. [39] X. Zhu, S. Lyu, X. Wang, and Q. Zhao, “TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 2778–2788. [40] W. Liu, D. Anguelov, and D. Erhan, “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Dec. 2016, pp. 21–37. [41] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards realtime object detection with region proposal networks,” in Proc. Int. Conf. Adv. Neural Inf. Process. Syst., Jun. 2015, pp. 1137–1149. [42] J. Wang, D. Chen, and J. Yang, “Human behavior classification by analyzing periodic motions,” Frontiers Comput. Sci. China, vol. 4, no. 4, pp. 580–588, Mar. 2010. [43] J. Li, Z. Chi, and Z. Li, “Human fall detection system based on threshold analysis method,” Transducer Microsyst. Technol., vol. 8, pp. 1209–1221, Apr. 2019. [44] A. Akula, A. K. Shah, and R. Ghosh, “Deep learning approach for human action recognition in infrared images,” Cognit. Syst. Res., vol. 50, pp. 146–154, Aug. 2018. [45] B. A. El-Rahiem et al., “An efficient deep learning model for classification of thermal face images,” J. Enterprise Inf. Manag., vol. 11, pp. 1–12, Jul. 2020. [46] T. P. Rani, P. Kalaichelvi, S. Sakthy, and S. Padmasri, “Monitoring and training KIT for autism spectrum disorder patients using artificial intelligence,” in Proc. 1st Int. Conf. Comput. Sci. Technol. (ICCST), Nov. 2022, pp. 251–262. [47] K. Geng and G. Yin, “Using deep learning in infrared images to enable human gesture recognition for autonomous vehicles,” IEEE Access, vol. 8, pp. 88227–88240, 2020. Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply. XU et al.: ROBUST ABNORMAL HUMAN-POSTURE RECOGNITION [48] C. Zhang, D. Xiao, Q. Yang, Z. Wen, and L. Lv, “Review: Application of infrared thermography in livestock monitoring,” Trans. ASABE, vol. 63, no. 2, pp. 389–399, 2020. [49] A. N. Wilson, K. Gupta, B. H. Koduru, A. Kumar, A. Jha, and L. R. Cenkeramaddi, “Recent advances in thermal imaging and its applications using machine learning: A review,” IEEE Sensors J., vol. 23, no. 4, pp. 3395–3407, Feb. 2023. [50] H. Hei, X. Jian, and E. Xiao, “Sample weights determination based on cosine similarity method as an extension to infrared action recognition,” J. Intell. Fuzzy Syst., vol. 40, no. 3, pp. 3919–3930, Mar. 2021. Mingyang Xu was born in 2000, in Jiangsu, China. He is currently pursuing the B.S. degree in communication engineering with Central South University, Changsha, China. His research interests include the areas of network communication and signal/image processing. Limei Guo received the B.S. degree in electronic engineering from Hunan University, Changsha, China, in 1995, and the M.S. and Ph.D. degrees in traffic information and engineering control from Central South University, Changsha, in 2002 and 2010, respectively. From September 2013 to September 2014, she had been a Visiting Scholar at the School of Electrical Engineering and Computer Science, Louisiana State University, Baton Rouge, LA, USA. Since January 2002, she has been with the Faculty of Central South University, where she is an Associate Professor now. She has published more than 20 technical journal and conference papers in communication engineering. Her research interests include the areas of wireless communications and image processing. 12379 Hsiao-Chun Wu (Fellow, IEEE) received the B.S.E.E. degree from National Cheng Kung University, Tainan, Taiwan, in 1990, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Florida, Gainesville, FL, USA, in 1993 and 1999, respectively. From March 1999 to January 2001, he worked for Motorola Personal Communications Sector Research Labs as a Senior Electrical Engineer. From July to August 2007, he had been a Visiting Assistant Professor at Television and Networks Transmission Group, Communications Research Centre, Ottawa, ON, Canada. From August to December 2008, he was a Visiting Associate Professor at the Department of Electrical Engineering, Stanford University, Stanford, CA, USA. Since January 2001, he has been with the Faculty of the Department of Electrical and Computer Engineering, Louisiana State University (LSU), Baton Rouge, LA, USA. He is currently a Distinguished Professor at LSU. He is also a Visiting Professor of the International College of Semiconductor Technology, National Chiao Tung University, Hsinchu, Taiwan. Besides, he is currently with the Innovation Center for Artificial Intelligence Applications, Yuan Ze University, Chungli, Taiwan. He has published more than 300 peer-refereed technical journal and conference papers in electrical and computer engineering. His research interests include the areas of wireless communications and signal processing. Dr. Wu is an IEEE Distinguished Lecturer. He currently serves as an Associate Editor for IEEE TRANSACTIONS ON BROADCASTING and IEEE TRANSACTIONS ON SIGNAL PROCESSING and an Editor for IEEE TRANSACTIONS ON COMMUNICATIONS and IEEE TRANSACTIONS ON MOBILE COMPUTING. Besides, he is an Academic Editor for Sensors. He used to serve as an Editor and Technical Editor for IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS and IEEE Communications Magazine and an Associate Editor for IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, IEEE COMMUNICATIONS LETTERS, IEEE SIGNAL PROCESSING LETTERS, and IEEE Communications Magazine. He has also served for numerous textbooks, IEEE/ACM conferences and journals as the technical committee, symposium chair, track chair, or Reviewer in signal processing, communications, circuits, and computers. Authorized licensed use limited to: Sri Sivasubramanya Nadar College of Engineering. Downloaded on February 27,2024 at 13:09:33 UTC from IEEE Xplore. Restrictions apply.