This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 View-invariant Deep Architecture for Human Action Recognition using Two-stream Motion and Shape Temporal Dynamics Chhavi Dhiman, Member IEEE, Dinesh Kumar Vishwakarma, Senior Member, IEEE Abstract— Human action Recognition for unknown views, is a challenging task. We propose a deep view-invariant human action recognition framework, which is a novel integration of two important action cues: motion and shape temporal dynamics (STD). The motion stream encapsulates the motion content of action as RGB Dynamic Images (RGB-DIs), which are generated by Approximate Rank Pooling (ARP) and processed by using finetuned InceptionV3 model. The STD stream learns long-term viewinvariant shape dynamics of action using a sequence of LSTM and Bi-LSTM learning models. Human Pose Model (HPM) generates view-invariant features of structural similarity index matrix (SSIM) based key depth human pose frames. The final prediction of the action is made on the basis of three types of late fusion techniques i.e. maximum (max), average (avg) and multiply (mul), applied on individual stream scores. To validate the performance of the proposed novel framework, the experiments are performed using both cross-subject and cross-view validation schemes on three publically available benchmarks- NUCLA multi-view dataset, UWA3D-II Activity dataset and NTU RGB-D Activity dataset. Our algorithm outperforms existing state-of-the-arts significantly, which is measured in terms of recognition accuracy, receiver operating characteristic (ROC) curve and area under the curve (AUC). Index Terms— human action recognition, spatio-temporal dynamics, human pose model, late fusion I. INTRODUCTION H UMAN Action Recognition (HAR) in videos has gained a lot of attention in pattern recognition and computer vision community due to its wide range of applications such as intelligent video surveillance, multimedia analysis, humancomputer interaction, and healthcare. Despite the efforts made in this domain, the performance of human action recognition in videos is still a challenging task mainly due to two main reasons: (1) the intra-class inconsistencies of an action due to different motion speeds, intensity illumination, viewpoints, and (2) the inter-class similarity of actions due to similar set of human poses. The performance of the action recognition system greatly depends on how much relevant information is selected and how efficiently it is processed. Due to the emergence of affordable depth maps with the Microsoft Kinect device, 3D C. Dhiman and D.K. Vishwakarma are with Biometric Research Laboratory, Department of Information Technology, Delhi Technological University features [1] [2] [3] based research is proliferating, speedily. Depth maps simplify the process of irrelevant background details removal and illumination variations largely and represent the fine details of the human pose. They have turned out to be the preferred solution [4] [5] to handle intra-class inconsistency due to intensity illumination. Recently, Convolutional Neural Networks (CNN) [6] have gained an exceptional success to classify both images and videos [7] [8] [9] with their outstanding modelling capacity, and the ability to provide discriminative representations to raw images. It is observed from the previous works [10] [11] that appearance, motion, and temporal information, act as important cues to understand human actions in an effective manner [12]. The multi-stream architectures [13]: two streams [14] and three streams [15] have boosted the response of CNN based recognition systems by jointly exploiting RGB and depth based appearance and motion content of actions. Optical flow [16] and dense trajectories [17] are used majorly, to represent the motion of the object in videos. However, these approaches are not tailored to incorporate viewpoint invariance in action recognition. Dense trajectories are sensitive to camera viewpoints and do not include explicit human pose details during the action. Depth human pose can be useful to understand the temporal structure and global motion of human gait for more accurate recognition. Therefore, we propose a view-invariant two-stream deep human action recognition framework, which is a fusion of Shape Temporal Dynamic (STD) stream, and motion stream. STD stream learns the depth sequence of human poses as a view-invariant Human Pose Model [18] (HPM) features over a time period, using Long Short Term Memory (LSTM) model. The motion stream encapsulates motion details of the action as Dynamic Images (DIs). The key contributions of the works are three folds, which are as follows: A two-stream view-invariant deep framework is designed for human action recognition using motion stream and Spatial-Temporal Dynamic (STD) stream that captures motion involved in an action sequence as Dynamic Images (DI) and temporal information of human shape dynamics (as view-invariant (HPM) [18]) using a sequence of Bi-LSTM and LSTM learning model, respectively. (Formerly Delhi College of Engineering), Bawana Road, Delhi-110042, India (e-mail: dvishwakarma@gmail.com). 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < In STD stream, to make identification more effective, discriminant human shape dynamics are learnt for only key frames rather than the entire video sequence using the sequence of Bi-LSTM and LSTM models. It enhances the learning ability of the framework by removing redundant information. The novelty of motion stream lies in highlighting the distinct motion features of human poses for each action using transfer learning. Where the salient motion features of each action is encrypted as RGB-DIs using Approximate Rank Pooling with the accelerated performance. Three late fusion strategies used to integrate the STD and motion streams i.e. maximum (max), average (avg), and multiplication (mul) late fusion. To evaluate the performance of the proposed framework, experiments are conducted using cross-subject and crossview validation schemes, on three challenging public datasets such as - NUCLA multi-view dataset [19], UWA3D-II Activity dataset [20] and NTU RGB-D Activity dataset [21]. The proposed framework is compared with similar state-of-the-arts and observed that it exhibits superior performance. The paper is organized as follows. In section II, the related recent works are discussed, highlighting their contribution in handling view invariance with different types of action representations using RGB, depth and RGB-depth modalities. In section III, the proposed framework is discussed in detail, and experimental results are reported in section IV in terms of recognition accuracy, ROC curves, and AUC. Conclusion of the proposed work is outlined in section V. II. RELATED WORK Human action recognition solutions based on deep learning are proliferating, with an added advantage of one over another. Multi-stream deep architectures [8] [9] have surpassed the performance of single-stream deep state-of-the-arts [1] [16] due to the fact that such architectures are enriched with a fusion of different types of action cues - temporal, motion, and spatial. The motion between frames is majorly defined as optical flow [16]. Extraction of optical flow is, quite slow and often governs the total processing time of video analysis. Recent works [22] avoid optical flow computation by encoding the motion information as motion vectors and dynamic images [23]. Temporal pooling appeared as an efficient solution to preserve the temporal dynamics of the action over a video. Various temporal pooling approaches were presented, in which pooling function is applied on temporal templates [24], time-series representation of spatial or motion action features [25], and ranking functions on video frames [26]. It is observed that all these approaches captured the temporal dynamics of the action over short time intervals. CNNs are the powerful feature descriptors, for the given input. For video analysis, in particular, it is crucial to decide the way video information is presented to CNN. Videos are treated as a stream of still images [27]. To utilize the temporal details of an action sequence, 2D filters are replaced by 3D filters [1] in CNNs, while providing a sub-video of fixed length, or a pack of a short sequence of video frames into an array of images. 2 However, these approaches effectively apprehend the local motion details within a small window but fail to preserve longterm motion patterns associated with certain actions. Motion Binary History (MBH) [28], Motion History Image (MHI), and Motion Energy Image (MEI) [29] based static image representation of RGB-D action sequence attempted to preserve long-term motion patterns but the process of generation of these representations involve loss of information. It is difficult to represent the complex long-term dynamics of action in a compact form. The mainstream literature listed above [5] [8] [9] [17] targeted action recognition from a common viewpoint. Such frameworks, fail to produce a good performance for test samples, from different viewpoints. The viewpoint dependency of the framework, can be handled by incorporating the view wise geometric details of the actions. Geometry-based action representations [17] [30] have improved the performance of view-invariant action recognition. Li et al. [31] proposed hanklets to capture view-invariant dynamic properties of short tracklets. Zhang et al. [32] defined continuous virtual paths by linear transformations of the action descriptors to associate action from dissimilar views. Where each point on the virtual path represented the virtual view. Rahmani and Mian [33] suggested a non-linear knowledge transfer (NKTM) model that mapped dense trajectory action descriptors to canonical views. These approaches [31] [32] [33] lacked spatial information, hence, shape/appearance details, which are an important cue to represent the close characteristics of action. The work [34] aligned view-specific features in the sparse feature spaces and transferred the view-shared features in the sparse feature space to maintain a view specific and a common dictionary set, separately. The work [39] adaptively assigned attention to only salient joints of each skeleton frame and generated temporal attention to salient frames to understand key spatial and temporal action features using LSTM models. However, the performance of this work is sensitive to different values of 𝜃 and not able to match real-time expected performance of a recognition system. The publically available datasets cater action samples for a limited number of views either three views or five views. Liu et al. [35] introduced human pose model (HPM) learned with synthetically generated 3D human poses for 180 different views 𝜃 ∈ (0, 𝜋) that makes HPM [18] model rich with a wide range of views for a pose. Later, Liu et al. [17] fused the non-linear knowledge transfer model-based view independent dense trajectory action descriptor with viewinvariant HPM features. Integration of spatial details with trajectories improved the performance of recognition significantly. III. PROPOSED METHODOLOGY The proposed RGB-D based human action recognition framework is demonstrated in Fig. 1. The architecture is designed by learning both motions, and view-invariant shape dynamics of humans while performing an action. The motion stream encrypts the motion content of the action as dynamic images (DI), and uses the concept of transfer learning to understand the action in RGB videos with the help of pretrained InceptionV3. Geometric details of the shape of the 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3 Depth silhouette segmentation SSIM based 10 Key frame selection ROI Extraction softmax Dense HPM Late Fusion softmax Depth video LSTM Fc7 Bi-LSTM objects RGB video Feature extraction (InceptionV3) Dense Concat Convolution Dropout Maxpool Batch Normalisation Avgpool ropout Fig. 1. Schematic Block Diagram of the proposed approach Dynamic Image during actions are extracted as view-invariant Human Pose Model (HPM) [18] features in STD stream, which are learned in a sequential manner using one Bi-LSTM, and one LSTM layer followed by dense, dropout and softmax layers. Two streams are combined using a late fusion concept to predict the action. A. STD stream: Depth-Human Pose Model (HPM) based action descriptor Depth shape representation of human pose preserves the information about the relative position of body parts. In the proposed work, STD stream encrypts the fine details of depth human poses irrespective of the viewpoints which are represented as view-invariant HPM features. It learns the temporal dynamics of these features using Bi-LSTM and LSTM sequential structures. HPM model [18] has a similar architecture to the AlexNet [37] but it is trained using synthetically generated multi-viewpoint action data from 180 viewpoints by fitting human models to the CMU motion capture data [38]. It makes HPM insensitive to the view variations in the actions. 1) Structural Similarity Index Matrix (SSIM) based Key Frame Extraction Initially, a depth video with 𝑛 no. of depth frames {𝑓1 , 𝑓2 … … . , 𝑓𝑛 }, is pre-processed by morphological operations, to obtain depth human silhouette to reduce the background noise. The redundant information in the video is removed by selecting a minimum number of key pose frames [40] based on the Structural Similarity Index Matrix (SSIM) [41]. It computes the global structural similarity index value and local SSIM map for two consecutive depth frames. If there is a small amount of change in two consecutive human poses, structural similarity index (𝕊𝕊𝕀) value is high. For distinct human poses, the 𝕊𝕊𝕀 value is small. Mathematically, SSIM value is defined as below: Classification layers 𝕊𝕊𝕀(𝑓𝑖 , 𝑓𝑖+1 ) = [ℒ(𝑓𝑖 , 𝑓𝑖+1 )𝛼 ] × [ℂ(𝑓𝑖 , 𝑓𝑖+1 )𝛽 ] × [𝒮(𝑓𝑖 , 𝑓𝑖+1 )𝛾 ] (1) where ℒ(𝑓𝑖 , 𝑓𝑖+1 ) = 𝒮(𝑓𝑖 , 𝑓𝑖+1 ) = (2∗𝜗𝑓 ∗𝜗𝑓 𝑖 𝑖+1 +Κ1 ) 2 +Κ 𝜗𝑥2 +𝜗𝑦 1 , ℂ(𝑓𝑖 , 𝑓𝑖+1 ) = (2∗𝜎𝑥 ∗𝜎𝑦 +Κ2 ) 𝜎𝑥2 +𝜎𝑦2 +Κ2 , (𝜎𝑥𝑦 +Κ3 ) 𝜎𝑥 𝜎𝑦 +Κ3 , where 𝜗𝑥 , 𝜗𝑦 , 𝜎𝑥 , 𝜎𝑦 , 𝜎𝑥𝑦 are the local means, variances and cross-variances for any two consecutive frames 𝑓𝑖 , 𝑓𝑖+1 and ℒ(. ), ℂ(. )and 𝒮(. ) are luminance, contrast and structural components of the pixels. Since depth images are not sensitive to luminance and contrast components the exponents of ℒ(. ) and ℂ(. )𝑖. 𝑒. 𝛼, 𝛽 are set to 0.5 and exponent of structural component𝒮(. ) , 𝛾 is set to 1. 𝕊𝕊𝕀 value is computed for every two consecutive frames in a video and arranged in ascending order with their respective frame numbers, in a vector Λ. First ten 𝕊𝕊𝕀 values and corresponding frames numbers 𝑖, 𝑖 ∈ (1, 𝑛) are selected from the arranged vector Λ as key frames. The salient information of each selected key frames is extracted as the region of interest (ROI) and resized to [227 × 227] images to transform into view-invariant HPM features composed as 𝑓𝑐7 layer [10 × 4096] feature vector using HPM [18] model. 2) Model Architecture and Learning In this section, a deep convolutional neural network (CNN) structure is described which extract view-invariant shape dynamics of the human depth silhouettes. The deep architecture is similar to [18] except that we have connected the last 𝑓𝑐7 layer with a combination of Bidirectional LSTM and LSTM layers, to learn temporal long term dynamics of the action. The architecture of our CNN is as follows: 𝐼𝑛𝑝𝑢𝑡(227,227) → 𝐶𝑜𝑛𝑣(11,96,4) → 𝑅𝑒𝐿𝑈 → 𝑚𝑎𝑥𝑃𝑜𝑜𝑙(3,2) → 𝑁𝑜𝑟𝑚 → 𝐶𝑜𝑛𝑣(5,256,1) → 𝑅𝑒𝐿𝑈 → 𝑚𝑎𝑥𝑃𝑜𝑜𝑙(3,2) → 𝑁𝑜𝑟𝑚 → 𝐶𝑜𝑛𝑣(3,256,1) → 𝑅𝑒𝐿𝑈 → 𝑃(3,2) → 𝐹𝑐6(4096) → 𝑅𝑒𝐿𝑈 → 𝐷𝑟𝑜𝑝𝑜𝑢𝑡(0.5) → 𝐹𝑐7(4096) → 𝐵𝑖𝐿𝑠𝑡𝑚(512) → 𝐿𝑠𝑡𝑚(128) → 𝑅𝑒𝐿𝑈 → 𝐹𝑐(. ) → 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(. ) 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < t=2 t=3 t=10 1 × 4096 1 × 4096 1 × 4096 HPM t=1 1 × 4096 f1 f1 𝐵𝑖𝐿𝑆𝑇𝑀512 𝑙𝑎𝑦𝑒𝑟 𝕊𝕊𝕀(𝒇𝒊 , 𝒇𝒊+𝟏 ) f2 f3 f5 f6 f7 f8 0.623 0.780 0.784 0.635 0.586 [1 × 512] f6 𝐿𝑆𝑇𝑀128 𝑙𝑎𝑦𝑒𝑟 f4 f9 f2 0.697 f2 f6 0.679 4 f9 0.593 f9 [1 × 128] ReLU [1 × 10] Dense 0.586 0.593 0.623 Softmax Ordered sequence of frames according to 𝕊𝕊𝕀 Ordered sequence of key frames in time (Class labels) (b) (a) Fig. 2 (a) Shape Temporal Dynamics (STD) stream design (b) SSIM based key feature extraction procedure is demonstrated for only nine frames as a test case considering 𝛼 = 0.5, 𝛽 = 0.5, 𝛾 = 1 where 𝐶𝑜𝑛𝑣(ℎ, 𝓃, 𝕤) is a convolution layer with ℎ × ℎ kernel size, 𝓃 number of filters, 𝕤 stride, 𝑚𝑎𝑥𝑃𝑜𝑜𝑙(ℎ, 𝕤) is a maxpooling layer of ℎ × ℎ kernel size and stride 𝕤, 𝑁𝑜𝑟𝑚 is a normalization layer, 𝑅𝑒𝐿𝑈 is a rectified linear unit, 𝐷𝑟𝑜𝑝𝑜𝑢𝑡(𝑝) is Dropout layer with (𝑝) dropout ratio, 𝐹𝑐(ℕ) is a fully connected layer with ℕ no. of neurons. 𝐵𝑖𝐿𝑠𝑡𝑚(Ο)and 𝐿𝑠𝑡𝑚(Ο) are Bidirectional Long short term memory(LSTM) layer, and one-directional LSTM, respectively with ‘Ο’ output shape. Bidirectional LSTM layer is trained with weight regularizer 0.001 and recurrent dropout of 0.55 with the true return sequence for Bidirectional LSTM layer. Softmax layer is attached at the end of the network. Last fully connected layer is designed with 10, 30, and 60 neurons as output shape for NUCLA, UWA3D, and NTU RGB-D dataset respectively, according to the number of classes in the datasets. The pre-trained HPM model is learned for view-invariant synthetic action data for 399 types of human poses. Therefore, the proposed deep HPM based shape descriptor model is learned end to end with 80 epochs and ‘Adam’ optimizer. The 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 layer will generate a probability vector [1 × 𝑛], where 𝑛 is no. of classes, that shows the belongingness of the test sample to all the classes of the dataset. B. Motion stream: RGB-Dynamic Image (DI) based action descriptor In this section, appearance and motion dynamics of an action sequence is represented in terms of RGB-Dynamic Images (RGB-DIs), which are used to extract deep motion features using pre-trained inceptionV3 architecture, according to the dynamics of the action sequence. DIs focus mainly on the salient objects, and motion of the salient object, by averaging away the background pixels and their motion patterns and preserving long-term action dynamics. In comparison to other sequences, temporal pooling strategies [23] [24] [26] [36], Approximate Rank Pooling (ARP) method emphasizes the order (𝜏) of the frame occurrence, to extract complex long term dynamics of action. Construction of dynamic image depends on the ranking function that ranks each frame in time axis. According to Fernando et al. [26], a video, i.e. {𝐼1 , 𝐼2 , … . . , 𝐼𝑁 } is represented as a ranking function𝜑(𝐼𝑡 ), 𝑡 ∈ [1, 𝑁] where, 𝜑(. ) function assigns a score 𝓈 to each frame 𝐼𝑡 at instance 𝑡, to reflect the rank of each frame. The time average of 𝜑(𝐼𝑡 ) up to time 𝑡 is 1 computed as 𝑄𝑡 = ∑𝑡𝑖=1 𝜑(𝐼𝑖 ) and𝓈(𝑡|𝒓) =< 𝒓, 𝑄𝑡 >, where 𝑡 𝒓 ∈ 𝑅𝑟 is a vector of parameters. The score for each frame is computed in such a manner that𝓈(𝑡2 |𝒓) > 𝓈(𝑡1 |𝒓), 𝑡2 > 𝑡1 . For which vector 𝒓 is learned as a convex optimization problem using RankSVM [42]. The optimizing equation is given as Eq. (2): 𝑟 ∗ = 𝜕(𝐼1 , 𝐼2 , … . . , 𝐼𝑁 : 𝜑) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑟 𝐸(𝑟), (2) 𝜆 2 2 𝑁(𝑁−1) where 𝐸(𝑟) = ‖𝑟‖2 + × ∑𝑡1>𝑡2 max{0,1 − 𝓈(𝑡2 |𝒓) + 𝓈(𝑡1 |𝒓)} where 𝜕(𝐼1 , 𝐼2 , … . . , 𝐼𝑁 : 𝜑) maps a sequence of 𝑁 number of frames to 𝑟 ∗ , also termed as rank pooling function, that holds the information to rank all the frames in the video. The first term in objective function 𝐸(𝑟) is the quadratic regularized used in support vector machines. The second term is a hingeloss that counts the number of pairs 𝑡2 > 𝑡1 are falsely ranked by the scoring function 𝓈(. )., if scores are not separated by at least unit margin i.e. 𝓈(𝑡2 |𝒓) > 𝓈(𝑡1 |𝒓) + 1. 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 𝑓1 5 𝑓𝑁 𝑓2 𝛾1 𝛾2 𝛾𝑁 𝛾1 𝛾2 𝛾𝑁 𝛾1 𝛾2 𝛾𝑁 Fig. 4: Dynamic image formation using Approximate Rank Pooling (ARP) [48]. First row: R-channel, Second row: G-channel, Third row: B-channel of each RGB video frame In the proposed work, learning of the ranking function for dynamic images construction is accelerated by applying approximate rank pooling (ARP) [43] in comparison to bidirectional rank pooling [36] which perform rank pooling twice in forward and backward direction resulting in high DI generation time. ARP involves simple linear operations at pixel level over the frames to rank them, which is extremely simple and efficient for fast computation. ARP approximates the rank pooling procedure by using gradient-based optimization in Eq. (1) as follow: For 𝒓 = ⃗0, 𝒓∗ = ⃗0 − 𝜂∇𝐸(𝒓)|𝒓=0⃗ for any 𝜂 > 0, (3) Where ∇𝐸(𝑟) ∝ ∑ 𝑡2>𝑡1 ∇ max{0,1 − 𝓈(𝑡2 |𝒓) + 𝓈(𝑡1 |𝒓)} |𝑑=0⃗ ∇𝐸(𝑟) = ∑ 𝑡2>𝑡1 ∇< 𝒓, 𝑄𝑡1 − 𝑄𝑡2 > = ∑ 𝑡2>𝑡1 𝑄𝑡1 − 𝑄𝑡2 (4) 1 1 𝑡2 𝑡1 𝑡2 𝒓∗ ∝ ∑ 𝑡2>𝑡1 [ ∑𝑖=1 𝜑𝑖 − 1 ∑𝑡𝑗=1 𝜑𝑗 ] = ∑𝑇𝑡=1 𝛾𝑡 𝜑𝑡 (5) where 𝛾𝑡 = 2(𝑇 − 𝑡 + 1) − (𝑇 + 1)(ℎ𝑡 − ℎ𝑡−1 ), and ℎ𝑡 = 1 ∑𝑡𝑖=1 is the 𝑡 𝑡ℎ harmonic number, ℎ0 = 0. Hence, the rank𝑡 pooling function is re-written as: (6) 𝜕̂(𝐼1 , 𝐼2 , … . . , 𝐼𝑁 : 𝜑) = ∑𝑇𝑡=1 𝛾𝑡 𝜑(𝐼𝑡 ) 𝑓1 𝑓2 𝑓3 ⋅⋅⋅ 𝑓𝑁−1 𝑓𝑁 2∗1−𝑁−1 2∗2−𝑁−12∗3−𝑁−1 𝛾1 → 1 2 3 ⋅⋅⋅ 2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1 𝑁 (𝑁 − 1) 2∗2−𝑁−1 2∗3−𝑁−1 3 2 ⋅⋅⋅ 2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1 𝑁 (𝑁 − 1) 2∗3−𝑁−1 3 ⋅⋅⋅ 2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1 (𝑁 − 1) 𝑁 𝛾2 → 𝛾3 → 𝛾𝑁 → ⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ 𝛾𝑁−1 → 2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1 𝑁 (𝑁 − 1) 2∗𝑁−𝑁−1 𝑁 Fig. 3: computation of 𝛾𝑡 parameter for fixed video length 𝑁; numbers in red show the dependency of 𝛾𝑡 on consecutive video frames ∈ (𝑖, 𝑁) Therefore, ARP can be defined as a weighted sum of sequential video frames. The weights 𝛾𝑡 , 𝑡𝜖[1, 𝑁] are precomputed for a fixed length video, using Eq. (5) as shown in Fig. 3. While computing 𝛾𝑛 , the order of occurrence of all the frames, for time 𝑡 ≥ 𝑛, is considered by computing a weight for 2∗𝑖−𝑁−1 , where 𝑖 ∈ [𝑛, 𝑁]. The computed weight each frame 𝑖 value for each considered frame is summed to obtain a single value of 𝛾𝑛 . Therefore, rank-pooling function can be directly defined by using individual frame features 𝜑(𝐼𝑡 ) and 𝛾𝑡 = 2(𝑇 − 𝑡 + 1) as a linear function of time 𝑡, instead of computing the intermediate average feature vectors 𝑄𝑡 per frame, to assign the score to rank the frames. The procedure of Approximate Rank Pooling (ARP) is shown in Fig. 4, where each video frame is multiplied with the corresponding computed, 𝛾𝑡 weight i.e. 𝑓1 is multiplied with 𝛾1 for every channel separately. R, G, and B channels of the dynamic image is obtained as weighted sum of R, G, and B - channels of each video frame respectively. The size of the DI, so obtained, is same as original frame. To compute view-invariant motion features of the action, the constructed DIs are passed through the motion stream as shown in Fig. 5, which is a combination of InceptionV3 architecture convolution layers followed by a set of classification layers i.e. Global Average Pooling Layer2D( ), BatchNormalisation( ),dropout(0.3),dense(512, ‘Relu’), dropout(0.5), and Dense(10, ‘softmax’) layers. Convolution features with vector shape of 8 × 8 × 2048 is obtained as a high dimensional representation of input DI using pre-trained InceptionV3 model. Batch normalisation layer is used to maintain the internal covariate shift of hidden units’ values to be minima,l after ‘ReLU’ activations in 𝑑𝑒𝑛𝑠𝑒(512, ′𝑅𝑒𝑙𝑢′) layer. Combination of Batch normalisation and Dropout layer helped to handle the overfitting phenomena together without minimal loss of dropouts rather than only depending on dropout layer resulting in larger loss of weights. The layers of the motion stream are trained end to end for multiview datasets to update the weights of the InceptionV3 convolution layers according to training samples. The besttrained model weights, so obtained, for the highest achieved validation accuracy are used for testing of the sample to achieve the high recognition rate irrespective of view variations. 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Fc [𝟏 × 𝟓𝟏𝟐] Dropout Fc [𝟏 × 𝟏𝟎] BN GAP 6 Output InceptionV3 [𝟐𝟗𝟗 × 𝟐𝟗𝟗 × 𝟑] [𝟖 × 𝟖 × 𝟐𝟎𝟒𝟖] Softmax activation [𝟏 × 𝟏 × 𝟐𝟎𝟒𝟖] Dropout ReLU activation Fig. 5: Layer Structure of the Motion Stream, GAP: Global Average Pooling, BN: Batch normalization IV. EXPERIMENTAL RESULTS To validate the performance of the proposed framework for view-invariant human action recognition, three publically available NUCLA multi-view action 3D dataset, UWA3D depth dataset and NTU RGB-D activity datasets are used. A. NUCLA multi-view action 3D Dataset The Northern-UCLA multi-view RGB-D dataset [19] is captured by Jiang Wang and Xiaohan Nie in UCLA simultaneously from three different viewpoints using Kinect v1. The dataset covers 10 action categories performed by 10 subjects: (1) pick up with one hand, (2) pick up with two hands, (3) drop trash, (4) walk around, (5) sit down, (6) stand up, (7) donning, (8) doffing, (9) throw, (10) carry. This dataset is very challenging because many actions share the same “walking” pattern before and after the actual action is performed. To handle this inter-class similarity, SSIM based ten depth key frames are selected and processed to extract view-invariant 3DHPM shape features. Moreover, some actions such as “pick up with on hand” and “pick up with two hands” are difficult to discriminate from different viewpoints. For cross-view validation, two views are used for training, and the third view is used for testing. For cross-subject validation, test samples are selected irrespective of viewpoint. The sample frames of the datasets are shown in Fig. 6(a). B. UWA3D Multi-view Activity-II Dataset UWA3D multi-view activity-II dataset [20] is a large dataset, which covers 30 human actions performed by ten subjects and recorded from 4 different viewpoints at different times using the Kinect v1 sensor. The 30 actions are:(1) one hand waving, (2) one hand punching, (3) two hands waving, (4) two hands punching, (5) sitting down, (6) standing up, (7) vibrating, (8) falling down, (9) holding chest, (10) holding head, (11) holding back, (12) walking, (13) irregular walking, (14) lying down, (15) turning around, (16) drinking, (17) phone answering, (18) bending, (19) jumping jack, (20) running, (21) picking up, (22) putting down, (23) kicking, (24) jumping, (25) dancing, (26) moping floor, (27) sneezing, (28) sitting down (chair), (29) squatting, and (30) coughing. The four viewpoints are: (a) front, (b) left, (c) right, (d) top. The major challenge of the dataset lies in the fact that a large number of action classes are not recorded simultaneously resulting in intra-action differences besides viewpoint variations. The dataset also contains self-occlusions and human-object interactions in some videos. Sample images of UWA3D dataset from four different viewpoints are shown in Fig. 6(b). C. NTU RGB-D Human Activity Dataset NTU RGB-D action recognition dataset [21] is a large-scale RGB-D Dataset for human activity analysis captured by 3 Microsoft Kinect v.2 cameras placed at three different angles: −450 , 00 , 450 , simultaneously. It consists of 56,880 action samples including RGB videos, depth map sequences, 3D skeletal data, and infrared videos for each sample. The dataset consists of 60 types (50 single person actions and 10 two-person interactions) of actions performed by 40 subjects repeated twice facing the left or right sensor respectively. The height of the sensors and their distances from the subject performing an action were further adjusted to get more viewpoint variations. This makes the NTU RGB-D dataset a largest, and most complex cross-view action dataset of its kind to date. RGB and depth sample frames of NTU RGB-D dataset are shown in Fig. 6(c). The resolution of RGB videos and depth maps is 1920×1080 and 512×424 respectively. We follow the standard cross-subject and cross-view evaluation protocol in the experiments, as specified in [21]. Under cross-subject evaluation protocol, out of 40 subjects, 20 subjects are selected for training and 20 subjects for testing. Under cross-subject evaluation protocol, out of 40, 20 subjects are selected for training and 20 subjects for testing. Under cross-view evaluation protocol, view 2 and view 3 are used as training views and view 1 is used as test view. In the experiments, both motion stream and STD stream of the proposed deep framework, are pre-trained end-to-end independently. For the testing phase, the best-trained model is selected based on the highest validation accuracy achieved. Under cross-view validation scheme, one view is used as test view and rest all views are used for training. In the training phase, the training samples are split in training samples and validation samples using 80-20 splitting strategy and Adaptive Moment Estimation (Adam) optimizer is used with (epochs, batch size, learning rate of the Adam optimizer) as (80, 10, 0.0002). In the testing phase, the scores obtained from each stream for each test sample are fused using three late fusion mechanisms: max, avg and mul. The obtained results under cross-subject validation scheme are provided in Table I, for NUCLA multi-view dataset, UWA3D II activity dataset, and NTU RGB-D dataset. The performance of the proposed framework, under cross-view validation scheme, is shown in Table II, and III for IV, NUCLA multi-view dataset, UWA3D II activity dataset, respectively, highlighting the highest 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < produced remarkable results. The small variation in the accuracy for different views exhibits the view invariance property of the proposed framework. Whereas, in other stateof-the-arts [1] [18] [35], accuracy of the presented frameworks vary from view to view in the range of 10% that shows these state-of-the-arts are sensitive to different views. obtained accuracy of the proposed framework for each dataset. Where performance of the framework is described in four steps i.e. HPM, STD stream (HPM_LSTM), motion stream (DI_InceptionV3), and hybrid framework (HPM_LSTM + DI_InceptionV3), respectively. It is observed that for both cross-view and cross-subject validation, late fusion scheme has V1 V2 V1 V3 (a) 7 (b) V2 V3 V1 V2 V3 (c) Fig. 6. Sample Images of (a) NUCLA multi-view 3D Action Dataset (b) UWA3D II Multi-View Action Dataset and (c) NTU RGB-D Human Activity Dataset. The left group of images in Fig. 6(c) is recorded when the subject face camera sensor C-3 and the right group of images are recorded when the subject face camera sensor C-2 TABLE I action features to make the prediction. CROSS SUBJECT VALIDATION RESULTS FOR NUCLA AND UWA3D II MULTITABLE II VIEW ACTION 3D DATASET AND NTU RGB-D ACTIVITY DATASET CROSS VIEW VALIDATION RESULTS FOR NUCLA MULTI-VIEW ACTION 3D DATASET Dataset NUCLA UWA3D II NTU RGBMethod dataset dataset D dataset Training/ Test [1,2]/3 [1,3] / 2 [2,3]/ 1 Mean Motion stream 93% 82.6% 62% View STD stream 76% 73.5% 68.3 HPM 85.21 78.57 71.96 78.58 Max 83% 81.8% 71.6 86.29 76.42 70.6 77.77 Motion stream Proposed Hybrid Avg 84.5 % 79.6% 75.7 89.96 81.37 75.12 82.15 STD stream Approach Mul. 87.3% 85.2% 79.4% Max 91.73 85.43 79.72 85.62 Avg 90.46 80.65 74.50 81.87 Hybrid The comparison of the recognition accuracy with other Mul. 93.12 89.94 85.36 89.47 state-of-the-arts is shown in Table IV, V, and VI for NUCLA, UWA3DII and NTU RGB-D dataset, respectively. It is observed that the novel integration of motion stream and STD stream of the proposed method has outperformed the recent works HPM_TM [18], HPM_TM+DT [17], NKTM [44]. Interestingly, our method achieves 91.3% and 85.36% average recognition accuracy which is about 9% and 12.3% higher than the similar work, HPM_TM+DT [17] for view 1 as test view for both UWA3D II activity dataset and NUCLA dataset. HPM_TM [18], utilised Fourier Temporal Pyramid (FTP) to extract temporal details of human shapes per frame. HPM_TM+DT [17] further tried to understand the motion patterns of the humans as their dense trajectories (DT). It boosted the performance of [17] by 3% from [18] . However, DT does not provide view invariant motion features of the action. The proposed STD stream in the defined view invariant action recognition architecture, learns the long short term temporal information of human shapes per action using the sequence of Bi-LSTM and LSTM learning models. The LSTMs and Bi-LSTMs are stronger and more generalised learning models to remember complex and long term temporal information about the action, than FTP, the temporal modelling approach. However, the obtained classification accuracy for NTU RGB-D dataset is not as good as obtained from other datasets due to the large variety of the number of samples and their complexity in NTU RGB-D dataset, as shown in Table VI. Viewpoint and large intra-class variations make this dataset very challenging. The performance of other work [45], is comparatively better than the proposed framework for NTU RGB-D Activity dataset. It utilized the skeleton joints based However, the novel integration of motion stream and STD stream using late fusion, has boosted recognition accuracy for all three multi-view datasets. It is verified by ROC curves and AUC curves, in Fig. 7, for an individual test view of each multiview dataset. From where, it can be easily visualized that the hybrid approach based ROC curves are showing superior performance than the individual motion stream and STD stream-based classification results which support the fact that the fusion of the scores of two streams has resulted in increase in correct selection of true samples i.e. improved true positive rate (TPR). At the same time, AUC values of the ROC curves help to compare the ROC curves better at the point of intersections. Computation time: Our technique outperforms the current cross-view action recognition methods on multi-view NUCLA, UWA3D and NTU RGB-D Activity Dataset by fusing motion stream and view-invariant Shape Temporal Dynamics (STD) stream information. Therefore, the proposed two-stream deep architecture, not only perform proficiently but also timeefficient compared to existing cross-view action recognition techniques. The experiments are performed on a single 8GB NVIDIA GeForce GTX 1080 GPU system. It does not demand computationally expensive training and testing phases, as shown in Table VII. The major reason behind lesser computation cost involved in the training and testing phase is the compact and competent representation of action. In motion stream, the entire video sequence is represented by a single DI generated with fast ARP method and STD stream process the key human pose depth frame instead of all the frames in the action sequence. 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8 TABLE III CROSS VIEW VALIDATION RESULTS FOR UWA3D MULTI VIEW ACTIVITY-II DATASET [v1,v2] v3 v4 58.61 69.92 87.4 81.2 62.1 73.5 86.6 85.3 73.2 78.8 84.3 88.2 [v1,v4] [v2,v3] [v2,v4] v2 v3 v1 v4 v1 v3 61.33 69.98 61.02 63.18 61.19 61.47 73.9 79.4 82.6 73.1 81.6 72.4 65.4 75.9 64.3 69.5 66.3 69.8 78.3 82.8 85.1 83.6 85.1 81.2 Proposed Hybrid 79.9 81.4 79.4 77.3 79.4 80.9 Approach 80.5 83.2 88.9 84.6 93.9 85.2 TABLE IV COMPARISON OF ACTION RECOGNITION ACCURACY (%) ON NUCLA MULTI-VIEW ACTION 3D DATASET Train-Test View Data Type [1,2]/ 3 [1,3] /2 [2,3] /1 Methods CVP [32] RGB 60.6 55.8 39.5 nCTE [46] RGB 68.6 68.3 52.1 NKTM [44] RGB 75.8 73.3 59.1 HOPC+STK [20] Depth 80 HPM_TM [18] Depth 92.2 78.5 68.5 HPM_TM+DT [17] RGBD 92.9 82.8 72.5 HPM Depth 85.21 78.57 71.96 RGB 86.29 79.7 70.6 Motion Stream Depth 89.96 81.37 75.12 STD stream RGBD Proposed Hybrid Approach 93.12 89.94 85.36 TABLE V COMPARISON OF ACTION RECOGNITION ACCURACY (%) UWA3D MULTI-VIEW ACTIVITY-II DATASET Data Type RGB RGB RGB RGB RGB RGBD RGBD Depth RGB Depth Proposed Hybrid Approach RGBD [v1,v2] v3 v4 57.1 59.9 59.5 59.6 55.6 60.6 60.1 61.3 64.9 67.7 85.8 89.9 86.9 89.8 58.61 69.92 87.4 81.2 62.1 73.5 86.6 85.3 73.2 78.8 84.3 88.2 1 1 0.8 0.8 (a) STD Stream(AUC=0.90) 1 1 0.8 0.8 Late Fusion(AUC=0.994) 0.2 STD Stream(AUC=0.973) Motion Stream(AUC=0.98) 0.2 0.4 0.6 0.8 1 0 0.2 0.4 FPR 0.6 0.8 0.2 1 1 1 0.8 0.8 0.8 (f) 0.4 0.6 0.8 1 0.8 FPR 0.2 0.4 0.6 0.8 0.8 (g) FPR 0.8 1 (h) 0.6 Late Fusion(AUC=0.99) 0.4 Late Fusion (AUC=0.98) 0.4 Motion Stream(AUC=0.96) 0.2 Motion Stream(AUC=0.93) 1 0.6 FPR STD stream (AUC=0.93) 0 0 0.4 1 STD Stream(AUC=0.89) 0 Motion Stream(AUC=0.89) 0.2 1 Motion Stream(AUC=0.93) Motion Stream(AUC=0.92) 0.2 0.4 FPR 0.6 0.2 0.2 0 0 0.2 STD Stream(AUC=0.87) STD Fusion(AUC=0.89) 0.2 Late Fusion(AUC=0.99) 0.4 Late Stream(AUC=0.96) TPR TPR 0.4 60.4 61 61.2 63.5 67.4 82.8 84.6 65.29 79.98 70.2 83.65 79.6 86.18 STD stream(AUC=0.93) 0 0.6 0.6 Mean (d) 0 0 FPR (e) [v3,v4] v1 v2 69.5 51.5 67.1 50.4 71.7 54.1 72.5 54.5 75.5 61.2 90.3 80.5 89.2 83.8 73.99 64.31 83.5 81.1 78.6 68.8 85.3 82.3 84.1 84.2 83.0 91.2 0.2 0 1 0.6 52 63 69.4 80 79.7 82.7 78.58 77.77 82.15 89.47 Late Fusion(AUC=0.98) Late Fusion(AUC=0.972) Motion Stream(AUC=0.920) STD Stream (AUC=0.931) 0 0 65.29 79.98 70.2 83.65 79.6 86.18 0.4 0.4 Motion Stream(AUC=0.960) mean Mean 0.6 0.6 0.4 0 [v2,v4] v1 v3 68.4 51.1 69.5 53.5 70.3 54.9 73.2 59.3 73.6 60.8 91.1 82.1 89.6 82.1 61.19 61.47 81.6 72.4 66.3 69.8 85.1 81.2 79.4 80.9 93.9 85.2 TPR Late Fusion(AUC=0.998) 0.2 [v2,v3] v1 v4 71 59.5 71.7 60 69.9 56.1 70.6 59.5 73.6 66.5 83.3 73 83.6 79 61.02 63.18 82.6 73.1 64.3 69.5 85.1 83.6 79.4 77.3 88.9 84.6 (c) TPR 0.4 TPR [v1,v4] v2 v3 61.2 60.8 59.5 60.8 61.9 60.4 61.6 66.8 64.9 70.1 74.4 78 76.7 83.6 61.33 69.98 73.9 79.4 65.4 75.9 78.3 82.8 79.9 81.4 80.5 83.2 (b) 0.6 TPR 0.6 [v1, v3] v2 v4 60.6 54.1 56.6 64 56.7 62.5 57.1 65.1 61.2 68.4 79.3 85.4 81.9 89.5 65.42 73.11 78.1 85.5 69.6 79.6 81.8 86.5 75.4 81.3 82.6 88.6 [v3,v4] V1 v2 73.99 64.31 83.5 81.1 78.6 68.8 85.3 82.3 84.1 84.2 83.0 91.2 TPR Train-Test View Methods DT [47] C3D [1] nCTE [46] NKTM [33] R-NKTM [44] HPM(RGB+D)_Traj [35] HPM_TM+DT [17] HPM Motion Stream STD stream [v1, v3] v2 v4 65.42 73.11 78.1 85.5 69.6 79.6 81.8 86.5 75.4 81.3 82.6 88.6 TPR Training View Test View HPM Motion Stream STD stream 0 0.2 0.4 0.6 FPR 0.8 1 0 0 0.2 0.4 FPR 0.6 0.8 1 Fig. 7: Performance evaluation of the proposed framework for NUCLA multi-view dataset (a)-(c), UWA3D dataset (d)-(g) and NTU RGB-D Activity dataset (h) in terms of ROC curve and area under the curve (AUC). V. CONCLUSION In this work, a novel two-stream RGBD deep framework is presented which capitalizes on view-invariant characteristics of both depth and RGB data streams to make action recognition insensitive to view variations. The proposed approach processes the RGB based motion stream and depth based STD stream independently. Motion stream captures the motion 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9 details of the action in the form of RGB-Dynamic images (RGB-DIs) which are processed with fine-tuned InceptionV3 deep network. STD stream captures the view-invariant temporal dynamics of depth frames of key poses using HPM [18] model followed by a sequence of Bi-LSTM and LSTM layers that helped to learn long-term view-invariant shape dynamics of the actions. Structural Similarity Index Matrix (SSIM) based key pose extraction helps to inspect major shape variations during the action reducing the redundant frames having minor shape changes. The late fusion of scores of the motion stream and STD stream is used to predict the label of the test sample. To validate the performance of the proposed framework experiments are conducted on three publically available multi-view datasets-NUCLA multi-view dataset, UWA3D II Activity dataset, and NTU RGB-D Activity dataset using cross-view and cross-subject cross-validation scheme. The ROC representation of the recognition performance of the proposed framework for each test view exhibits the improved AUC for late fusion over motion and STD streams individually. It is also, noticed that the recognition accuracy of the framework is consistent for different views that confirm the view-invariant characteristics of the framework. In the last, comparisons with other state-of-the-arts are outlined for the proposed deep architecture proving the superiority of the framework in terms of time efficiency and accuracy both. In future work, we aim to utilize the skeleton details of actions along with depth details for different viewpoints to make the action recognition more robust to intra-class variations of large set of samples and maintaining time-efficient performance of the system. [2] J. Han, L. Shao, D. Xu and J. Shotton, "Enhanced Computer Vision with Microsoft Kinect Sensor: A Review," IEEE Transactions on Cybernetics, vol. 5, pp. 1318-34, 2013. TABLE VI COMPARISON OF ACTION RECOGNITION ACCURACY (%) NTU RGB-D ACTIVITY DATASET Cross Cross Method Data type subject view Skepxelloc+vel [45] Joints 81.3 89.2 STA-LSTM [48] Joints 73.4 81.2 ST-LSTM [49] Joints 69.2 77.7 HPM(RGB+D)_Traj [35] RGB-D 80.9 86.1 HPM_TM+DT [17] RGB-D 77.5 84.5 Re-TCN [50] Joints 74.3 83.1 dyadic [51] RGB-D 62.1 DeepResnet-56 [52] Joints 78.2 85.6 Depth 65.8 70.9 HPM RGB 62 68.7 Motion Stream Depth Maps 68.3 72.4 STD stream RGB-D (max fusion) 71.6 79.8 Proposed Hybrid RGB-D (late fusion) 75.7 83 Approach RGB-D (product fusion) 79.4 84.1 TABLE VII AVERAGE COMPUTATION SPEED (FRAME PER SEC: FPS) Method Training Testing NKTM [33] 12fps 16fps HOPC [20] 0.04fps 0.5fps HPM+TM [18] 22fps 25fps Ours 26 fps 30 fps [12] D. K. Vishwakarma and K. Singh, "Human Activity Recognition Based on Spatial Distribution of Gradients at Sublevels of Average Energy Silhouette Images," IEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 4, pp. 316-327, 2017. REFERENCES [1] S. Ji , W. Xu , M. Yang and K. Yu, "3D Convolutional Neural Networks for Human Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013. [3] T. Singh and D. K. Vishwakarma, "Video Benchmarks of Human Action Datasets: A Review," Artificial Intelligence Review, vol. 52, no. 2, pp. 1107-1154, 2018. [4] C. Dhiman and D. K. Vishwakarma, "A Robust Framework for Abnormal Human Action Recognition using R-Transform and Zernike Moments in Depth Videos," IEEE Sensors Journal, vol. 19, no. 13, pp. 5195-5203, 2019. [5] E. P. Ijjina and K. M. Chalavadi, "Human action recognition in RGB-D videos using motion sequence information and deep learning," Pattern Recognition, vol. 72, pp. 504-516, 2017. [6] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, 1998. [7] L. Ren , J. Lu , J. Feng and . J. Zhou, "Multi-modal uniform deep learning for RGB-D person re-identification," Pattern Recognition, vol. 72, pp. 446-457, 2017. [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. FeiFei, "Large-scale Video Classification with Convolutional Neural Networks," in IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, 2014. [9] N. Yudistira and T. Kurita, "Gated spatio and temporal convolutional neural network for activity recognition: towards gated multimodal deep learning," EURASIP Journal on Image and Video Processing, vol. 2017, p. 85, 2017. [10] Y. Yang , I. Saleemi and M. Shah , "Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions," IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1635-1648, 2013. [11] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani, R. Yousefzadeh and L. V. Gool, "Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification," vol. arXiv: 1711.08200, 2017. [13] Z. Tu, W. Xie, Q. Qin, R. Poppe, R. C. Veltkamp, B. Li and J. Yuanf, "Multi-stream CNN: Learning representations based on human-related regions for action recognition," Pattern Recognition, vol. 79, pp. 32-43, 2018. [14] H. Wang, Y. Yang, E. Yang and C. Deng, "Exploring hybrid spatiotemporal convolutional networks for human action recognition," Multimedia Tools and Applications, vol. 76, pp. 15065-15081, 2017. [15] Y. Liu, Q. Wu, L. Tan and H. shi, "Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition," IEEE Access, vol. 5, pp. 19432-19441, 2017. [16] W. Lin, Y. Mi, J. Wu, K. Lu and H. Xiong, "Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion," arXiv:1711.07430 , 2018. [17] J. Liu, N. Akhtar and A. M. Saeed , "Viewpoint Invariant RGB-D Human Action Recognition," in International Conference on Digital Image Computing: Techniques and Applications, Sydney, 2017. [18] H. Rahmani and A. Mian, "3D Action Recognition from Novel Viewpoints," in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016. [19] J. Wang, B. X. Nie, Y. Xia, Y. Wu and S. C. Zhu, "Cross-view action modeling, learning and recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, 2014. [20] H. Rahmani, A. Mahmood, D. Huynh and A. Mian, "Histogram of oriented principal components for cross-view action recognition," IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, pp. 2430-2443, 2016. [21] A. Shahroudy, J. Liu, T. T. Ng and G. Wang, "NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis," in Conference on Computer Vision and Pattern Recognition, Nevada, United States, 2016. 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE Transactions on Image Processing > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [22] C. Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Samola and P. Krahenbuhl, "Compressed Video Action Recognition," in IEEE Conference on Computer Vision and Pattern Recognition, US, 2018. [23] X. Yang, "Action recognition for depth video using multi-view dynamic images," Information Sciences, vol. 480, pp. 287-304, 2019. [24] A. F. Bobick and J. W. Davis, "The recognition of human movement using temporal templates," IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, 2011. [25] M. S. Ryoo, B. Rothrock and L. Matthies, "Pooled Motion Features for First-Person Videos," in IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015. [26] B. Fernando, E. Gavves, J. Oramas, . A. Ghodrati and T. Tuytelaars, "Modeling video evolution for action recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015. [27] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga and G. Toderici, "Beyond Short Snippets: Deep Networks for Video Classification," in IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 10 Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 667-681, 2018. [45] J. Liu, N. Akhtar and . A. Mian, "Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition," CoRR, vol. arXiv:1711.05941v4, 2018. [46] A. Gupta, J. Martinez, J. J. Little and R. J. Woodham, "3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding," in IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. [47] H. Wang, A. Kläser, C. Schmid and C. L. Liu, "Action recognition by dense trajectories," in IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2011. [48] S. Song, C. Lan, J. Xing, W. Zeng and J. Liu, "An end-to-end spatiotemporal attention model for human action recognition from skeleton data," in AAAI conference on Artificial Intelligence, Califirnia, USA, 2017. [49] J. Liu, A. Shahroudy, . D. Xu and G. Wang, "Spatio-temporal LSTM with trust gates for 3D human action recognition," in European Conference on Computer Vision (ECCV), Amsterdam, 2016. [28] S. Jetley and F. Cuzzolin, "3D Activity Recognition Using Motion History and Binary Shape Templates," in Proceedings of Asian Conference on Computer Vision(ACCV), Singapore, 2014. [50] T. S. Kim and A. Reiter, "Interpretable 3D Human Action Analysis with Temporal Convolutional," in Conference on Computer Vision and Pattern Recognition workshop, Honolulu, Hawaii, 2017. [29] M. A. R. Ahad, J. K. Tan, H. Kim and S. Ishikawa, "Motion history image: Its variants and applications," Machine Vision and Applications, vol. 23, no. 2, pp. 255-281, 2010. [51] A. S. Keçeli, "Viewpoint projection based deep feature learning for single and dyadic action recognition," Expert Systems With Applications, vol. 104, pp. 235-243, 2018. [30] Y. Kong, Z. Ding, J. Li and Y. Fu, "Deeply Learned View-Invariant Features for Cross-View Action Recognition," IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 3028-3037, 2017. [52] H.-H. Phama, L. Khoudoura, A. Crouzil, P. Zegers and S. A. Velastin, "Exploiting deep residual networks for human action recognition from skeletal data," Computer Vision and Image Understanding, vol. 170, pp. 51-66, 2018. [31] B. Li, O. I. Camps and M. Sznaier, "Cross-view Activity Recognition using Hankelets," in IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012. [32] Z. Zhang, . C. Wang, B. Xiao, . W. Zhou, S. Liu and C. Shi, "Cross-view action recognition via a continuous virtual path," in IEEE Conference on Computer Vision and Pattern Recognition, 2013. [33] H. Rahmani and A. Mian, "Learning a non-linear knowledge transfer model for cross-view action recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015. [34] J. Zheng and Z. Jiang, "Learning view invaraint sparse representations for cross view action recognition," in International Conference on Computer Vision (ICCV), Sydney, 2013. [35] J. Liu and A. S. Mian, "Learning Human Pose Models from Synthesized Data for Robust RGB-D Action," CoRR, vol. arXiv:1707.00823v2, 2017. [36] P. Wang, S. Wang , Z. Gao, Y. Hou and W. Li, "Structured Images for RGB-D Action Recognition," in IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 2017. [37] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Neural Information Processing Systems (NIPS), Nevada, 2012. [38] CMU motion capture database, http://mocap.cs.. [39] S. Song, C. Lan , J. Xing , W. Zeng and J. Liu , "Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection," IEEE Transactions on Image Processing , vol. 27, no. 7, pp. 3459-3471, 2018. [40] K. Schindler and L. v. Gool, "Action snippets: How many frames does human action recognition require?," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 2008. [41] W. Zhou, A. C. Bovik , H. R. Sheikh and E. P. Simoncelli, "Image Qualifty Assessment: From Error Visibility to Structural Similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, 2004. Chhavi Dhiman (M’19) received M.Tech. from Delhi Technological University, New Delhi, India, in 2014. She is currently pursuing Ph.D. degree from the Department of Electronics and Communication Engineering, Delhi Technological University, Delhi, India. Her current research interest includes Machine Learning, Deep Learning, Pattern Recognition, Human Action Identification and Classification. Dinesh Kumar Vishwakarma (M’16, SM’19) received Doctor of Philosophy (Ph.D.) degree in the field of Computer Vision from Delhi Technological University (Formerly Delhi College of Engineering), New Delhi, India, in 2016. He is currently an Associate Professor with the Department of Information Technology, Delhi Technological University, New Delhi. His current research interests include Computer Vision, Deep Learning, Sentiment Analysis, Fake News Detection, Crowd Analysis, Human Action and Activity Recognition. He is a reviewer of various Journals/Transactions of IEEE, Elsevier, and Springer. He has been awarded “Premium Research Award” by Delhi Technological University, Delhi, India in 2018.) [42] A. J. Smola and B. Schölkopf, "A tutorial on support vector regression," Statistics and Computing, vol. 14, no. 3, pp. 199-222, 2004. [43] H. Bilen, . B. Fernando, E. Gavves, A. Vedaldi and S. Gould, "Dynamic Image Networks for Action Recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. [44] H. Rahmani, A. Mian and M. Shah, "Learning a Deep Model for Human Action Recognition from Novel Viewpoints," IEEE Transactions on 1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.