Uploaded by Beulin Ashika

HAR

advertisement
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
1
View-invariant Deep Architecture for Human
Action Recognition using Two-stream Motion
and Shape Temporal Dynamics
Chhavi Dhiman, Member IEEE, Dinesh Kumar Vishwakarma, Senior Member, IEEE

Abstract— Human action Recognition for unknown views, is a
challenging task. We propose a deep view-invariant human action
recognition framework, which is a novel integration of two
important action cues: motion and shape temporal dynamics
(STD). The motion stream encapsulates the motion content of
action as RGB Dynamic Images (RGB-DIs), which are generated
by Approximate Rank Pooling (ARP) and processed by using finetuned InceptionV3 model. The STD stream learns long-term viewinvariant shape dynamics of action using a sequence of LSTM and
Bi-LSTM learning models. Human Pose Model (HPM) generates
view-invariant features of structural similarity index matrix
(SSIM) based key depth human pose frames. The final prediction
of the action is made on the basis of three types of late fusion
techniques i.e. maximum (max), average (avg) and multiply (mul),
applied on individual stream scores. To validate the performance
of the proposed novel framework, the experiments are performed
using both cross-subject and cross-view validation schemes on
three publically available benchmarks- NUCLA multi-view
dataset, UWA3D-II Activity dataset and NTU RGB-D Activity
dataset. Our algorithm outperforms existing state-of-the-arts
significantly, which is measured in terms of recognition accuracy,
receiver operating characteristic (ROC) curve and area under the
curve (AUC).
Index Terms— human action recognition, spatio-temporal
dynamics, human pose model, late fusion
I. INTRODUCTION
H
UMAN Action Recognition (HAR) in videos has gained a
lot of attention in pattern recognition and computer vision
community due to its wide range of applications such as
intelligent video surveillance, multimedia analysis, humancomputer interaction, and healthcare. Despite the efforts made
in this domain, the performance of human action recognition in
videos is still a challenging task mainly due to two main
reasons: (1) the intra-class inconsistencies of an action due to
different motion speeds, intensity illumination, viewpoints, and
(2) the inter-class similarity of actions due to similar set of
human poses. The performance of the action recognition system
greatly depends on how much relevant information is selected
and how efficiently it is processed. Due to the emergence of
affordable depth maps with the Microsoft Kinect device, 3D
C. Dhiman and D.K. Vishwakarma are with Biometric Research Laboratory,
Department of Information Technology, Delhi Technological University
features [1] [2] [3] based research is proliferating, speedily.
Depth maps simplify the process of irrelevant background
details removal and illumination variations largely and
represent the fine details of the human pose. They have turned
out to be the preferred solution [4] [5] to handle intra-class
inconsistency due to intensity illumination.
Recently, Convolutional Neural Networks (CNN) [6] have
gained an exceptional success to classify both images and
videos [7] [8] [9] with their outstanding modelling capacity, and
the ability to provide discriminative representations to raw
images. It is observed from the previous works [10] [11] that
appearance, motion, and temporal information, act as important
cues to understand human actions in an effective manner [12].
The multi-stream architectures [13]: two streams [14] and three
streams [15] have boosted the response of CNN based
recognition systems by jointly exploiting RGB and depth based
appearance and motion content of actions. Optical flow [16]
and dense trajectories [17] are used majorly, to represent the
motion of the object in videos. However, these approaches are
not tailored to incorporate viewpoint invariance in action
recognition. Dense trajectories are sensitive to camera
viewpoints and do not include explicit human pose details
during the action. Depth human pose can be useful to
understand the temporal structure and global motion of human
gait for more accurate recognition. Therefore, we propose a
view-invariant two-stream deep human action recognition
framework, which is a fusion of Shape Temporal Dynamic
(STD) stream, and motion stream. STD stream learns the depth
sequence of human poses as a view-invariant Human Pose
Model [18] (HPM) features over a time period, using Long
Short Term Memory (LSTM) model. The motion stream
encapsulates motion details of the action as Dynamic Images
(DIs). The key contributions of the works are three folds, which
are as follows:
 A two-stream view-invariant deep framework is designed
for human action recognition using motion stream and
Spatial-Temporal Dynamic (STD) stream that captures
motion involved in an action sequence as Dynamic Images
(DI) and temporal information of human shape dynamics (as
view-invariant (HPM) [18]) using a sequence of Bi-LSTM
and LSTM learning model, respectively.
(Formerly Delhi College of Engineering), Bawana Road, Delhi-110042, India
(e-mail: dvishwakarma@gmail.com).
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
 In STD stream, to make identification more effective,
discriminant human shape dynamics are learnt for only key
frames rather than the entire video sequence using the
sequence of Bi-LSTM and LSTM models. It enhances the
learning ability of the framework by removing redundant
information.
 The novelty of motion stream lies in highlighting the
distinct motion features of human poses for each action
using transfer learning. Where the salient motion features of
each action is encrypted as RGB-DIs using Approximate
Rank Pooling with the accelerated performance. Three late
fusion strategies used to integrate the STD and motion
streams i.e. maximum (max), average (avg), and
multiplication (mul) late fusion.
 To evaluate the performance of the proposed framework,
experiments are conducted using cross-subject and crossview validation schemes, on three challenging public
datasets such as - NUCLA multi-view dataset [19],
UWA3D-II Activity dataset [20] and NTU RGB-D Activity
dataset [21]. The proposed framework is compared with
similar state-of-the-arts and observed that it exhibits
superior performance.
The paper is organized as follows. In section II, the related
recent works are discussed, highlighting their contribution in
handling view invariance with different types of action
representations using RGB, depth and RGB-depth modalities.
In section III, the proposed framework is discussed in detail,
and experimental results are reported in section IV in terms of
recognition accuracy, ROC curves, and AUC. Conclusion of the
proposed work is outlined in section V.
II. RELATED WORK
Human action recognition solutions based on deep learning
are proliferating, with an added advantage of one over another.
Multi-stream deep architectures [8] [9] have surpassed the
performance of single-stream deep state-of-the-arts [1] [16] due
to the fact that such architectures are enriched with a fusion of
different types of action cues - temporal, motion, and spatial.
The motion between frames is majorly defined as optical flow
[16]. Extraction of optical flow is, quite slow and often governs
the total processing time of video analysis. Recent works [22]
avoid optical flow computation by encoding the motion
information as motion vectors and dynamic images [23].
Temporal pooling appeared as an efficient solution to preserve
the temporal dynamics of the action over a video. Various
temporal pooling approaches were presented, in which pooling
function is applied on temporal templates [24], time-series
representation of spatial or motion action features [25], and
ranking functions on video frames [26]. It is observed that all
these approaches captured the temporal dynamics of the action
over short time intervals.
CNNs are the powerful feature descriptors, for the given
input. For video analysis, in particular, it is crucial to decide the
way video information is presented to CNN. Videos are treated
as a stream of still images [27]. To utilize the temporal details
of an action sequence, 2D filters are replaced by 3D filters [1]
in CNNs, while providing a sub-video of fixed length, or a pack
of a short sequence of video frames into an array of images.
2
However, these approaches effectively apprehend the local
motion details within a small window but fail to preserve longterm motion patterns associated with certain actions. Motion
Binary History (MBH) [28], Motion History Image (MHI), and
Motion Energy Image (MEI) [29] based static image
representation of RGB-D action sequence attempted to preserve
long-term motion patterns but the process of generation of these
representations involve loss of information. It is difficult to
represent the complex long-term dynamics of action in a
compact form.
The mainstream literature listed above [5] [8] [9] [17]
targeted action recognition from a common viewpoint. Such
frameworks, fail to produce a good performance for test
samples, from different viewpoints. The viewpoint dependency
of the framework, can be handled by incorporating the view
wise geometric details of the actions. Geometry-based action
representations [17] [30] have improved the performance of
view-invariant action recognition. Li et al. [31] proposed
hanklets to capture view-invariant dynamic properties of short
tracklets. Zhang et al. [32] defined continuous virtual paths by
linear transformations of the action descriptors to associate
action from dissimilar views. Where each point on the virtual
path represented the virtual view. Rahmani and Mian [33]
suggested a non-linear knowledge transfer (NKTM) model that
mapped dense trajectory action descriptors to canonical views.
These approaches [31] [32] [33] lacked spatial information,
hence, shape/appearance details, which are an important cue to
represent the close characteristics of action. The work [34]
aligned view-specific features in the sparse feature spaces and
transferred the view-shared features in the sparse feature space
to maintain a view specific and a common dictionary set,
separately. The work [39] adaptively assigned attention to only
salient joints of each skeleton frame and generated temporal
attention to salient frames to understand key spatial and
temporal action features using LSTM models. However, the
performance of this work is sensitive to different values of 𝜃
and not able to match real-time expected performance of a
recognition system. The publically available datasets cater
action samples for a limited number of views either three views
or five views. Liu et al. [35] introduced human pose model
(HPM) learned with synthetically generated 3D human poses
for 180 different views 𝜃 ∈ (0, 𝜋) that makes HPM [18] model
rich with a wide range of views for a pose. Later, Liu et al. [17]
fused the non-linear knowledge transfer model-based view
independent dense trajectory action descriptor with viewinvariant HPM features. Integration of spatial details with
trajectories improved the performance of recognition
significantly.
III. PROPOSED METHODOLOGY
The proposed RGB-D based human action recognition
framework is demonstrated in Fig. 1. The architecture is
designed by learning both motions, and view-invariant shape
dynamics of humans while performing an action. The motion
stream encrypts the motion content of the action as dynamic
images (DI), and uses the concept of transfer learning to
understand the action in RGB videos with the help of pretrained InceptionV3. Geometric details of the shape of the
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
3
Depth
silhouette
segmentation
SSIM based 10 Key
frame selection
ROI
Extraction
softmax
Dense
HPM
Late
Fusion
softmax
Depth video
LSTM
Fc7
Bi-LSTM
objects
RGB video
Feature extraction (InceptionV3)
Dense
Concat
Convolution
Dropout
Maxpool
Batch Normalisation
Avgpool
ropout
Fig. 1. Schematic Block Diagram of the proposed approach
Dynamic Image
during actions are extracted as view-invariant Human Pose
Model (HPM) [18] features in STD stream, which are learned
in a sequential manner using one Bi-LSTM, and one LSTM
layer followed by dense, dropout and softmax layers. Two
streams are combined using a late fusion concept to predict the
action.
A. STD stream: Depth-Human Pose Model (HPM) based
action descriptor
Depth shape representation of human pose preserves the
information about the relative position of body parts. In the
proposed work, STD stream encrypts the fine details of depth
human poses irrespective of the viewpoints which are
represented as view-invariant HPM features. It learns the
temporal dynamics of these features using Bi-LSTM and LSTM
sequential structures. HPM model [18] has a similar
architecture to the AlexNet [37] but it is trained using
synthetically generated multi-viewpoint action data from 180
viewpoints by fitting human models to the CMU motion capture
data [38]. It makes HPM insensitive to the view variations in
the actions.
1) Structural Similarity Index Matrix (SSIM) based Key Frame
Extraction
Initially, a depth video with 𝑛 no. of depth
frames {𝑓1 , 𝑓2 … … . , 𝑓𝑛 }, is pre-processed by morphological
operations, to obtain depth human silhouette to reduce the
background noise. The redundant information in the video is
removed by selecting a minimum number of key pose frames
[40] based on the Structural Similarity Index Matrix (SSIM)
[41]. It computes the global structural similarity index value
and local SSIM map for two consecutive depth frames. If there
is a small amount of change in two consecutive human poses,
structural similarity index (𝕊𝕊𝕀) value is high. For distinct
human poses, the 𝕊𝕊𝕀 value is small. Mathematically, SSIM
value is defined as below:
Classification
layers
𝕊𝕊𝕀(𝑓𝑖 , 𝑓𝑖+1 ) = [ℒ(𝑓𝑖 , 𝑓𝑖+1 )𝛼 ] × [ℂ(𝑓𝑖 , 𝑓𝑖+1 )𝛽 ] × [𝒮(𝑓𝑖 , 𝑓𝑖+1 )𝛾 ]
(1)
where ℒ(𝑓𝑖 , 𝑓𝑖+1 ) =
𝒮(𝑓𝑖 , 𝑓𝑖+1 ) =
(2∗𝜗𝑓 ∗𝜗𝑓
𝑖
𝑖+1
+Κ1 )
2 +Κ
𝜗𝑥2 +𝜗𝑦
1
, ℂ(𝑓𝑖 , 𝑓𝑖+1 ) =
(2∗𝜎𝑥 ∗𝜎𝑦 +Κ2 )
𝜎𝑥2 +𝜎𝑦2 +Κ2
,
(𝜎𝑥𝑦 +Κ3 )
𝜎𝑥 𝜎𝑦 +Κ3
, where 𝜗𝑥 , 𝜗𝑦 , 𝜎𝑥 , 𝜎𝑦 , 𝜎𝑥𝑦 are the local
means, variances and cross-variances for any two consecutive
frames 𝑓𝑖 , 𝑓𝑖+1 and ℒ(. ), ℂ(. )and 𝒮(. ) are luminance, contrast
and structural components of the pixels. Since depth images are
not sensitive to luminance and contrast components the
exponents of ℒ(. ) and ℂ(. )𝑖. 𝑒. 𝛼, 𝛽 are set to 0.5 and exponent
of structural component𝒮(. ) , 𝛾 is set to 1. 𝕊𝕊𝕀 value is
computed for every two consecutive frames in a video and
arranged in ascending order with their respective frame
numbers, in a vector Λ. First ten 𝕊𝕊𝕀 values and corresponding
frames numbers 𝑖, 𝑖 ∈ (1, 𝑛) are selected from the arranged
vector Λ as key frames. The salient information of each selected
key frames is extracted as the region of interest (ROI) and
resized to [227 × 227] images to transform into view-invariant
HPM features composed as 𝑓𝑐7 layer [10 × 4096] feature
vector using HPM [18] model.
2) Model Architecture and Learning
In this section, a deep convolutional neural network (CNN)
structure is described which extract view-invariant shape
dynamics of the human depth silhouettes. The deep architecture
is similar to [18] except that we have connected the last 𝑓𝑐7
layer with a combination of Bidirectional LSTM and LSTM
layers, to learn temporal long term dynamics of the action. The
architecture of our CNN is as follows: 𝐼𝑛𝑝𝑢𝑡(227,227) →
𝐶𝑜𝑛𝑣(11,96,4) → 𝑅𝑒𝐿𝑈 → 𝑚𝑎𝑥𝑃𝑜𝑜𝑙(3,2) → 𝑁𝑜𝑟𝑚 →
𝐶𝑜𝑛𝑣(5,256,1) → 𝑅𝑒𝐿𝑈 → 𝑚𝑎𝑥𝑃𝑜𝑜𝑙(3,2) → 𝑁𝑜𝑟𝑚 →
𝐶𝑜𝑛𝑣(3,256,1) → 𝑅𝑒𝐿𝑈 → 𝑃(3,2) → 𝐹𝑐6(4096) → 𝑅𝑒𝐿𝑈 →
𝐷𝑟𝑜𝑝𝑜𝑢𝑡(0.5) → 𝐹𝑐7(4096) → 𝐵𝑖𝐿𝑠𝑡𝑚(512) →
𝐿𝑠𝑡𝑚(128) → 𝑅𝑒𝐿𝑈 → 𝐹𝑐(. ) → 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(. )
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
t=2
t=3
t=10
1 × 4096
1 × 4096
1 × 4096
HPM
t=1
1 × 4096
f1
f1
𝐵𝑖𝐿𝑆𝑇𝑀512 𝑙𝑎𝑦𝑒𝑟
𝕊𝕊𝕀(𝒇𝒊 , 𝒇𝒊+𝟏 )
f2
f3
f5
f6
f7
f8
0.623
0.780
0.784
0.635 0.586
[1 × 512]
f6
𝐿𝑆𝑇𝑀128 𝑙𝑎𝑦𝑒𝑟
f4
f9
f2
0.697
f2
f6
0.679
4
f9
0.593
f9
[1 × 128]
ReLU
[1 × 10]
Dense
0.586 0.593 0.623
Softmax
Ordered sequence of
frames according to 𝕊𝕊𝕀
Ordered sequence of key
frames in time
(Class labels)
(b)
(a)
Fig. 2 (a) Shape Temporal Dynamics (STD) stream design (b) SSIM based key feature extraction procedure is demonstrated for only nine frames as a test case
considering 𝛼 = 0.5, 𝛽 = 0.5, 𝛾 = 1
where 𝐶𝑜𝑛𝑣(ℎ, 𝓃, 𝕤) is a convolution layer with ℎ × ℎ kernel
size, 𝓃 number of filters, 𝕤 stride, 𝑚𝑎𝑥𝑃𝑜𝑜𝑙(ℎ, 𝕤) is a maxpooling layer of ℎ × ℎ kernel size and stride 𝕤, 𝑁𝑜𝑟𝑚 is a
normalization layer, 𝑅𝑒𝐿𝑈 is a rectified linear unit,
𝐷𝑟𝑜𝑝𝑜𝑢𝑡(𝑝) is Dropout layer with (𝑝) dropout ratio, 𝐹𝑐(ℕ) is
a fully connected layer with ℕ no. of neurons. 𝐵𝑖𝐿𝑠𝑡𝑚(Ο)and
𝐿𝑠𝑡𝑚(Ο) are Bidirectional Long short term memory(LSTM)
layer, and one-directional LSTM, respectively with ‘Ο’ output
shape. Bidirectional LSTM layer is trained with weight
regularizer 0.001 and recurrent dropout of 0.55 with the true
return sequence for Bidirectional LSTM layer. Softmax layer is
attached at the end of the network. Last fully connected layer is
designed with 10, 30, and 60 neurons as output shape for
NUCLA, UWA3D, and NTU RGB-D dataset respectively,
according to the number of classes in the datasets.
The pre-trained HPM model is learned for view-invariant
synthetic action data for 399 types of human poses. Therefore,
the proposed deep HPM based shape descriptor model is
learned end to end with 80 epochs and ‘Adam’ optimizer. The
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 layer will generate a probability vector [1 × 𝑛],
where 𝑛 is no. of classes, that shows the belongingness of the
test sample to all the classes of the dataset.
B. Motion stream: RGB-Dynamic Image (DI) based action
descriptor
In this section, appearance and motion dynamics of an
action sequence is represented in terms of RGB-Dynamic
Images (RGB-DIs), which are used to extract deep motion
features using pre-trained inceptionV3 architecture, according
to the dynamics of the action sequence. DIs focus mainly on the
salient objects, and motion of the salient object, by averaging
away the background pixels and their motion patterns and
preserving long-term action dynamics. In comparison to other
sequences, temporal pooling strategies [23] [24] [26] [36],
Approximate Rank Pooling (ARP) method emphasizes the
order (𝜏) of the frame occurrence, to extract complex long term
dynamics of action.
Construction of dynamic image depends on the ranking
function that ranks each frame in time axis. According to
Fernando et al. [26], a video, i.e. {𝐼1 , 𝐼2 , … . . , 𝐼𝑁 } is represented
as a ranking function𝜑(𝐼𝑡 ), 𝑡 ∈ [1, 𝑁] where, 𝜑(. ) function
assigns a score 𝓈 to each frame 𝐼𝑡 at instance 𝑡, to reflect the rank
of each frame. The time average of 𝜑(𝐼𝑡 ) up to time 𝑡 is
1
computed as 𝑄𝑡 = ∑𝑡𝑖=1 𝜑(𝐼𝑖 ) and𝓈(𝑡|𝒓) =< 𝒓, 𝑄𝑡 >, where
𝑡
𝒓 ∈ 𝑅𝑟 is a vector of parameters. The score for each frame is
computed in such a manner that𝓈(𝑡2 |𝒓) > 𝓈(𝑡1 |𝒓), 𝑡2 > 𝑡1 . For
which vector 𝒓 is learned as a convex optimization problem
using RankSVM [42]. The optimizing equation is given as Eq.
(2):
𝑟 ∗ = 𝜕(𝐼1 , 𝐼2 , … . . , 𝐼𝑁 : 𝜑) = 𝑎𝑟𝑔𝑚𝑖𝑛𝑟 𝐸(𝑟),
(2)
𝜆
2
2
𝑁(𝑁−1)
where 𝐸(𝑟) = ‖𝑟‖2 +
×
∑𝑡1>𝑡2 max{0,1 − 𝓈(𝑡2 |𝒓) + 𝓈(𝑡1 |𝒓)}
where 𝜕(𝐼1 , 𝐼2 , … . . , 𝐼𝑁 : 𝜑) maps a sequence of 𝑁 number of
frames to 𝑟 ∗ , also termed as rank pooling function, that holds
the information to rank all the frames in the video. The first
term in objective function 𝐸(𝑟) is the quadratic regularized
used in support vector machines. The second term is a hingeloss that counts the number of pairs 𝑡2 > 𝑡1 are falsely ranked
by the scoring function 𝓈(. )., if scores are not separated by at
least unit margin i.e. 𝓈(𝑡2 |𝒓) > 𝓈(𝑡1 |𝒓) + 1.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
𝑓1
5
𝑓𝑁
𝑓2
𝛾1
𝛾2
𝛾𝑁
𝛾1
𝛾2
𝛾𝑁
𝛾1
𝛾2
𝛾𝑁
Fig. 4: Dynamic image formation using Approximate Rank Pooling (ARP) [48]. First row: R-channel, Second row: G-channel, Third row: B-channel of each
RGB video frame
In the proposed work, learning of the ranking function for
dynamic images construction is accelerated by applying
approximate rank pooling (ARP) [43] in comparison to bidirectional rank pooling [36] which perform rank pooling twice
in forward and backward direction resulting in high DI
generation time. ARP involves simple linear operations at pixel
level over the frames to rank them, which is extremely simple
and efficient for fast computation. ARP approximates the rank
pooling procedure by using gradient-based optimization in Eq.
(1) as follow:
For 𝒓 = ⃗0, 𝒓∗ = ⃗0 − 𝜂∇𝐸(𝒓)|𝒓=0⃗ for any 𝜂 > 0,
(3)
Where ∇𝐸(𝑟) ∝ ∑ 𝑡2>𝑡1 ∇ max{0,1 − 𝓈(𝑡2 |𝒓) +
𝓈(𝑡1 |𝒓)} |𝑑=0⃗ ∇𝐸(𝑟) = ∑ 𝑡2>𝑡1 ∇< 𝒓, 𝑄𝑡1 − 𝑄𝑡2 > =
∑ 𝑡2>𝑡1 𝑄𝑡1 − 𝑄𝑡2
(4)
1
1
𝑡2
𝑡1
𝑡2
𝒓∗ ∝ ∑ 𝑡2>𝑡1 [ ∑𝑖=1
𝜑𝑖 −
1
∑𝑡𝑗=1
𝜑𝑗 ] = ∑𝑇𝑡=1 𝛾𝑡 𝜑𝑡
(5)
where 𝛾𝑡 = 2(𝑇 − 𝑡 + 1) − (𝑇 + 1)(ℎ𝑡 − ℎ𝑡−1 ), and ℎ𝑡 =
1
∑𝑡𝑖=1 is the 𝑡 𝑡ℎ harmonic number, ℎ0 = 0. Hence, the rank𝑡
pooling function is re-written as:
(6)
𝜕̂(𝐼1 , 𝐼2 , … . . , 𝐼𝑁 : 𝜑) = ∑𝑇𝑡=1 𝛾𝑡 𝜑(𝐼𝑡 )
𝑓1
𝑓2
𝑓3
⋅⋅⋅
𝑓𝑁−1
𝑓𝑁
2∗1−𝑁−1 2∗2−𝑁−12∗3−𝑁−1
𝛾1 →
1
2
3
⋅⋅⋅
2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1
𝑁
(𝑁 − 1)
2∗2−𝑁−1 2∗3−𝑁−1
3
2
⋅⋅⋅
2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1
𝑁
(𝑁 − 1)
2∗3−𝑁−1
3
⋅⋅⋅
2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1
(𝑁 − 1)
𝑁
𝛾2 →
𝛾3 →
𝛾𝑁
→
⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
𝛾𝑁−1 →
2 ∗ (𝑁 − 1) − 𝑁 − 1 2 ∗ 𝑁 − 𝑁 − 1
𝑁
(𝑁 − 1)
2∗𝑁−𝑁−1
𝑁
Fig. 3: computation of 𝛾𝑡 parameter for fixed video length 𝑁; numbers in red
show the dependency of 𝛾𝑡 on consecutive video frames ∈ (𝑖, 𝑁)
Therefore, ARP can be defined as a weighted sum of
sequential video frames. The weights 𝛾𝑡 , 𝑡𝜖[1, 𝑁] are precomputed for a fixed length video, using Eq. (5) as shown in
Fig. 3. While computing 𝛾𝑛 , the order of occurrence of all the
frames, for time 𝑡 ≥ 𝑛, is considered by computing a weight for
2∗𝑖−𝑁−1
, where 𝑖 ∈ [𝑛, 𝑁]. The computed weight
each frame
𝑖
value for each considered frame is summed to obtain a single
value of 𝛾𝑛 . Therefore, rank-pooling function can be directly
defined by using individual frame features 𝜑(𝐼𝑡 ) and 𝛾𝑡 =
2(𝑇 − 𝑡 + 1) as a linear function of time 𝑡, instead of
computing the intermediate average feature vectors 𝑄𝑡 per
frame, to assign the score to rank the frames. The procedure of
Approximate Rank Pooling (ARP) is shown in Fig. 4, where
each video frame is multiplied with the corresponding
computed, 𝛾𝑡 weight i.e. 𝑓1 is multiplied with 𝛾1 for every
channel separately. R, G, and B channels of the dynamic image
is obtained as weighted sum of R, G, and B - channels of each
video frame respectively. The size of the DI, so obtained, is
same as original frame. To compute view-invariant motion
features of the action, the constructed DIs are passed through
the motion stream as shown in Fig. 5, which is a combination
of InceptionV3 architecture convolution layers followed by a
set of classification layers i.e. Global Average Pooling
Layer2D( ), BatchNormalisation( ),dropout(0.3),dense(512,
‘Relu’), dropout(0.5), and Dense(10, ‘softmax’) layers.
Convolution features with vector shape of 8 × 8 × 2048 is
obtained as a high dimensional representation of input DI using
pre-trained InceptionV3 model. Batch normalisation layer is
used to maintain the internal covariate shift of hidden units’
values to be minima,l after ‘ReLU’ activations in
𝑑𝑒𝑛𝑠𝑒(512, ′𝑅𝑒𝑙𝑢′) layer. Combination of Batch normalisation
and Dropout layer helped to handle the overfitting phenomena
together without minimal loss of dropouts rather than only
depending on dropout layer resulting in larger loss of weights.
The layers of the motion stream are trained end to end for
multiview datasets to update the weights of the InceptionV3
convolution layers according to training samples. The besttrained model weights, so obtained, for the highest achieved
validation accuracy are used for testing of the sample to achieve
the high recognition rate irrespective of view variations.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Fc [𝟏 × 𝟓𝟏𝟐]
Dropout
Fc [𝟏 × 𝟏𝟎]
BN
GAP
6
Output
InceptionV3
[𝟐𝟗𝟗 × 𝟐𝟗𝟗 × 𝟑]
[𝟖 × 𝟖 × 𝟐𝟎𝟒𝟖]
Softmax activation
[𝟏 × 𝟏 × 𝟐𝟎𝟒𝟖]
Dropout
ReLU activation
Fig. 5: Layer Structure of the Motion Stream, GAP: Global Average Pooling, BN: Batch normalization
IV. EXPERIMENTAL RESULTS
To validate the performance of the proposed framework for
view-invariant human action recognition, three publically
available NUCLA multi-view action 3D dataset, UWA3D
depth dataset and NTU RGB-D activity datasets are used.
A. NUCLA multi-view action 3D Dataset
The Northern-UCLA multi-view RGB-D dataset [19] is
captured by Jiang Wang and Xiaohan Nie in UCLA
simultaneously from three different viewpoints using Kinect
v1. The dataset covers 10 action categories performed by 10
subjects: (1) pick up with one hand, (2) pick up with two hands,
(3) drop trash, (4) walk around, (5) sit down, (6) stand up, (7)
donning, (8) doffing, (9) throw, (10) carry. This dataset is very
challenging because many actions share the same “walking”
pattern before and after the actual action is performed. To
handle this inter-class similarity, SSIM based ten depth key
frames are selected and processed to extract view-invariant 3DHPM shape features. Moreover, some actions such as “pick up
with on hand” and “pick up with two hands” are difficult to
discriminate from different viewpoints. For cross-view
validation, two views are used for training, and the third view
is used for testing. For cross-subject validation, test samples are
selected irrespective of viewpoint. The sample frames of the
datasets are shown in Fig. 6(a).
B. UWA3D Multi-view Activity-II Dataset
UWA3D multi-view activity-II dataset [20] is a large dataset,
which covers 30 human actions performed by ten subjects and
recorded from 4 different viewpoints at different times using
the Kinect v1 sensor. The 30 actions are:(1) one hand waving,
(2) one hand punching, (3) two hands waving, (4) two hands
punching, (5) sitting down, (6) standing up, (7) vibrating, (8)
falling down, (9) holding chest, (10) holding head, (11) holding
back, (12) walking, (13) irregular walking, (14) lying down,
(15) turning around, (16) drinking, (17) phone answering, (18)
bending, (19) jumping jack, (20) running, (21) picking up, (22)
putting down, (23) kicking, (24) jumping, (25) dancing, (26)
moping floor, (27) sneezing, (28) sitting down (chair), (29)
squatting, and (30) coughing. The four viewpoints are: (a) front,
(b) left, (c) right, (d) top. The major challenge of the dataset lies
in the fact that a large number of action classes are not recorded
simultaneously resulting in intra-action differences besides
viewpoint variations. The dataset also contains self-occlusions
and human-object interactions in some videos. Sample images
of UWA3D dataset from four different viewpoints are shown in
Fig. 6(b).
C. NTU RGB-D Human Activity Dataset
NTU RGB-D action recognition dataset [21] is a large-scale
RGB-D Dataset for human activity analysis captured by 3
Microsoft Kinect v.2 cameras placed at three different angles:
−450 , 00 , 450 , simultaneously. It consists of 56,880 action
samples including RGB videos, depth map sequences, 3D
skeletal data, and infrared videos for each sample. The dataset
consists of 60 types (50 single person actions and 10 two-person
interactions) of actions performed by 40 subjects repeated twice
facing the left or right sensor respectively. The height of the
sensors and their distances from the subject performing an
action were further adjusted to get more viewpoint variations.
This makes the NTU RGB-D dataset a largest, and most
complex cross-view action dataset of its kind to date. RGB and
depth sample frames of NTU RGB-D dataset are shown in Fig.
6(c). The resolution of RGB videos and depth maps is
1920×1080 and 512×424 respectively. We follow the standard
cross-subject and cross-view evaluation protocol in the
experiments, as specified in [21]. Under cross-subject
evaluation protocol, out of 40 subjects, 20 subjects are selected
for training and 20 subjects for testing. Under cross-subject
evaluation protocol, out of 40, 20 subjects are selected for
training and 20 subjects for testing. Under cross-view
evaluation protocol, view 2 and view 3 are used as training
views and view 1 is used as test view.
In the experiments, both motion stream and STD stream of
the proposed deep framework, are pre-trained end-to-end
independently. For the testing phase, the best-trained model is
selected based on the highest validation accuracy achieved.
Under cross-view validation scheme, one view is used as test
view and rest all views are used for training. In the training
phase, the training samples are split in training samples and
validation samples using 80-20 splitting strategy and Adaptive
Moment Estimation (Adam) optimizer is used with (epochs,
batch size, learning rate of the Adam optimizer) as
(80, 10, 0.0002). In the testing phase, the scores obtained from
each stream for each test sample are fused using three late
fusion mechanisms: max, avg and mul. The obtained results
under cross-subject validation scheme are provided in Table I,
for NUCLA multi-view dataset, UWA3D II activity dataset,
and NTU RGB-D dataset. The performance of the proposed
framework, under cross-view validation scheme, is shown in
Table II, and III for IV, NUCLA multi-view dataset, UWA3D
II activity dataset, respectively, highlighting the highest
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
produced remarkable results. The small variation in the
accuracy for different views exhibits the view invariance
property of the proposed framework. Whereas, in other stateof-the-arts [1] [18] [35], accuracy of the presented frameworks
vary from view to view in the range of 10% that shows these
state-of-the-arts are sensitive to different views.
obtained accuracy of the proposed framework for each dataset.
Where performance of the framework is described in four steps
i.e. HPM, STD stream (HPM_LSTM), motion stream
(DI_InceptionV3), and hybrid framework (HPM_LSTM +
DI_InceptionV3), respectively. It is observed that for both
cross-view and cross-subject validation, late fusion scheme has
V1
V2
V1
V3
(a)
7
(b)
V2
V3
V1
V2
V3
(c)
Fig. 6. Sample Images of (a) NUCLA multi-view 3D Action Dataset (b) UWA3D II Multi-View Action Dataset and (c) NTU RGB-D Human Activity
Dataset. The left group of images in Fig. 6(c) is recorded when the subject face camera sensor C-3 and the right group of images are recorded when the subject
face camera sensor C-2
TABLE I
action features to make the prediction.
CROSS SUBJECT VALIDATION RESULTS FOR NUCLA AND UWA3D II MULTITABLE II
VIEW ACTION 3D DATASET AND NTU RGB-D ACTIVITY DATASET
CROSS VIEW VALIDATION RESULTS FOR NUCLA MULTI-VIEW ACTION 3D
DATASET
Dataset
NUCLA
UWA3D II
NTU RGBMethod
dataset
dataset
D dataset
Training/ Test
[1,2]/3
[1,3] / 2
[2,3]/ 1
Mean
Motion stream
93%
82.6%
62%
View
STD stream
76%
73.5%
68.3
HPM
85.21
78.57
71.96
78.58
Max
83%
81.8%
71.6
86.29
76.42
70.6
77.77
Motion stream
Proposed Hybrid
Avg
84.5 %
79.6%
75.7
89.96
81.37
75.12
82.15
STD stream
Approach
Mul.
87.3%
85.2%
79.4%
Max
91.73
85.43
79.72
85.62
Avg
90.46
80.65
74.50
81.87
Hybrid
The comparison of the recognition accuracy with other
Mul.
93.12
89.94
85.36
89.47
state-of-the-arts is shown in Table IV, V, and VI for NUCLA,
UWA3DII and NTU RGB-D dataset, respectively. It is
observed that the novel integration of motion stream and STD
stream of the proposed method has outperformed the recent
works HPM_TM [18], HPM_TM+DT [17], NKTM [44].
Interestingly, our method achieves 91.3% and 85.36% average
recognition accuracy which is about 9% and 12.3% higher than
the similar work, HPM_TM+DT [17] for view 1 as test view
for both UWA3D II activity dataset and NUCLA dataset.
HPM_TM [18], utilised Fourier Temporal Pyramid (FTP) to
extract temporal details of human shapes per frame.
HPM_TM+DT [17] further tried to understand the motion
patterns of the humans as their dense trajectories (DT). It
boosted the performance of [17] by 3% from [18] . However,
DT does not provide view invariant motion features of the
action. The proposed STD stream in the defined view invariant
action recognition architecture, learns the long short term
temporal information of human shapes per action using the
sequence of Bi-LSTM and LSTM learning models. The LSTMs
and Bi-LSTMs are stronger and more generalised learning
models to remember complex and long term temporal
information about the action, than FTP, the temporal modelling
approach. However, the obtained classification accuracy for
NTU RGB-D dataset is not as good as obtained from other
datasets due to the large variety of the number of samples and
their complexity in NTU RGB-D dataset, as shown in Table VI.
Viewpoint and large intra-class variations make this dataset
very challenging. The performance of other work [45], is
comparatively better than the proposed framework for NTU
RGB-D Activity dataset. It utilized the skeleton joints based
However, the novel integration of motion stream and STD
stream using late fusion, has boosted recognition accuracy for
all three multi-view datasets. It is verified by ROC curves and
AUC curves, in Fig. 7, for an individual test view of each multiview dataset. From where, it can be easily visualized that the
hybrid approach based ROC curves are showing superior
performance than the individual motion stream and STD
stream-based classification results which support the fact that
the fusion of the scores of two streams has resulted in increase
in correct selection of true samples i.e. improved true positive
rate (TPR). At the same time, AUC values of the ROC curves
help to compare the ROC curves better at the point of
intersections.
Computation time: Our technique outperforms the current
cross-view action recognition methods on multi-view NUCLA,
UWA3D and NTU RGB-D Activity Dataset by fusing motion
stream and view-invariant Shape Temporal Dynamics (STD)
stream information. Therefore, the proposed two-stream deep
architecture, not only perform proficiently but also timeefficient compared to existing cross-view action recognition
techniques. The experiments are performed on a single 8GB
NVIDIA GeForce GTX 1080 GPU system. It does not demand
computationally expensive training and testing phases, as
shown in Table VII. The major reason behind lesser
computation cost involved in the training and testing phase is
the compact and competent representation of action. In motion
stream, the entire video sequence is represented by a single DI
generated with fast ARP method and STD stream process the
key human pose depth frame instead of all the frames in the
action sequence.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
8
TABLE III
CROSS VIEW VALIDATION RESULTS FOR UWA3D MULTI VIEW ACTIVITY-II DATASET
[v1,v2]
v3
v4
58.61
69.92
87.4
81.2
62.1
73.5
86.6
85.3
73.2
78.8
84.3
88.2
[v1,v4]
[v2,v3]
[v2,v4]
v2
v3
v1
v4
v1
v3
61.33 69.98 61.02 63.18 61.19 61.47
73.9
79.4
82.6
73.1
81.6
72.4
65.4
75.9
64.3
69.5
66.3
69.8
78.3
82.8
85.1
83.6
85.1
81.2
Proposed Hybrid
79.9
81.4
79.4
77.3
79.4
80.9
Approach
80.5
83.2
88.9
84.6
93.9
85.2
TABLE IV
COMPARISON OF ACTION RECOGNITION ACCURACY (%) ON NUCLA MULTI-VIEW ACTION 3D DATASET
Train-Test View
Data Type
[1,2]/ 3
[1,3] /2
[2,3] /1
Methods
CVP [32]
RGB
60.6
55.8
39.5
nCTE [46]
RGB
68.6
68.3
52.1
NKTM [44]
RGB
75.8
73.3
59.1
HOPC+STK [20]
Depth
80
HPM_TM [18]
Depth
92.2
78.5
68.5
HPM_TM+DT [17]
RGBD
92.9
82.8
72.5
HPM
Depth
85.21
78.57
71.96
RGB
86.29
79.7
70.6
Motion Stream
Depth
89.96
81.37
75.12
STD stream
RGBD
Proposed Hybrid Approach
93.12
89.94
85.36
TABLE V
COMPARISON OF ACTION RECOGNITION ACCURACY (%) UWA3D MULTI-VIEW ACTIVITY-II DATASET
Data
Type
RGB
RGB
RGB
RGB
RGB
RGBD
RGBD
Depth
RGB
Depth
Proposed Hybrid
Approach
RGBD
[v1,v2]
v3
v4
57.1
59.9
59.5
59.6
55.6
60.6
60.1
61.3
64.9
67.7
85.8
89.9
86.9
89.8
58.61 69.92
87.4
81.2
62.1
73.5
86.6
85.3
73.2
78.8
84.3
88.2
1
1
0.8
0.8
(a)
STD Stream(AUC=0.90)
1
1
0.8
0.8
Late Fusion(AUC=0.994)
0.2
STD Stream(AUC=0.973)
Motion Stream(AUC=0.98)
0.2
0.4
0.6
0.8
1
0
0.2
0.4
FPR
0.6
0.8
0.2
1
1
1
0.8
0.8
0.8
(f)
0.4
0.6
0.8
1
0.8
FPR
0.2
0.4
0.6
0.8
0.8
(g)
FPR
0.8
1
(h)
0.6
Late Fusion(AUC=0.99)
0.4
Late Fusion (AUC=0.98)
0.4
Motion Stream(AUC=0.96)
0.2
Motion Stream(AUC=0.93)
1
0.6
FPR
STD stream (AUC=0.93)
0
0
0.4
1
STD Stream(AUC=0.89)
0
Motion Stream(AUC=0.89)
0.2
1
Motion Stream(AUC=0.93)
Motion Stream(AUC=0.92)
0.2
0.4 FPR 0.6
0.2
0.2
0
0
0.2
STD Stream(AUC=0.87)
STD Fusion(AUC=0.89)
0.2
Late Fusion(AUC=0.99)
0.4
Late Stream(AUC=0.96)
TPR
TPR
0.4
60.4
61
61.2
63.5
67.4
82.8
84.6
65.29
79.98
70.2
83.65
79.6
86.18
STD stream(AUC=0.93)
0
0.6
0.6
Mean
(d)
0
0
FPR
(e)
[v3,v4]
v1
v2
69.5
51.5
67.1
50.4
71.7
54.1
72.5
54.5
75.5
61.2
90.3
80.5
89.2
83.8
73.99 64.31
83.5
81.1
78.6
68.8
85.3
82.3
84.1
84.2
83.0
91.2
0.2
0
1
0.6
52
63
69.4
80
79.7
82.7
78.58
77.77
82.15
89.47
Late Fusion(AUC=0.98)
Late Fusion(AUC=0.972)
Motion Stream(AUC=0.920)
STD Stream (AUC=0.931)
0
0
65.29
79.98
70.2
83.65
79.6
86.18
0.4
0.4
Motion Stream(AUC=0.960)
mean
Mean
0.6
0.6
0.4
0
[v2,v4]
v1
v3
68.4
51.1
69.5
53.5
70.3
54.9
73.2
59.3
73.6
60.8
91.1
82.1
89.6
82.1
61.19 61.47
81.6
72.4
66.3
69.8
85.1
81.2
79.4
80.9
93.9
85.2
TPR
Late Fusion(AUC=0.998)
0.2
[v2,v3]
v1
v4
71
59.5
71.7
60
69.9
56.1
70.6
59.5
73.6
66.5
83.3
73
83.6
79
61.02 63.18
82.6
73.1
64.3
69.5
85.1
83.6
79.4
77.3
88.9
84.6
(c)
TPR
0.4
TPR
[v1,v4]
v2
v3
61.2
60.8
59.5
60.8
61.9
60.4
61.6
66.8
64.9
70.1
74.4
78
76.7
83.6
61.33 69.98
73.9
79.4
65.4
75.9
78.3
82.8
79.9
81.4
80.5
83.2
(b)
0.6
TPR
0.6
[v1, v3]
v2
v4
60.6
54.1
56.6
64
56.7
62.5
57.1
65.1
61.2
68.4
79.3
85.4
81.9
89.5
65.42 73.11
78.1
85.5
69.6
79.6
81.8
86.5
75.4
81.3
82.6
88.6
[v3,v4]
V1
v2
73.99 64.31
83.5
81.1
78.6
68.8
85.3
82.3
84.1
84.2
83.0
91.2
TPR
Train-Test View
Methods
DT [47]
C3D [1]
nCTE [46]
NKTM [33]
R-NKTM [44]
HPM(RGB+D)_Traj [35]
HPM_TM+DT [17]
HPM
Motion Stream
STD stream
[v1, v3]
v2
v4
65.42
73.11
78.1
85.5
69.6
79.6
81.8
86.5
75.4
81.3
82.6
88.6
TPR
Training View
Test View
HPM
Motion Stream
STD stream
0
0.2
0.4
0.6
FPR
0.8
1
0
0
0.2
0.4
FPR
0.6
0.8
1
Fig. 7: Performance evaluation of the proposed framework for NUCLA multi-view dataset (a)-(c), UWA3D dataset (d)-(g) and NTU RGB-D Activity dataset (h)
in terms of ROC curve and area under the curve (AUC).
V. CONCLUSION
In this work, a novel two-stream RGBD deep framework is
presented which capitalizes on view-invariant characteristics of
both depth and RGB data streams to make action recognition
insensitive to view variations. The proposed approach
processes the RGB based motion stream and depth based STD
stream independently. Motion stream captures the motion
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
9
details of the action in the form of RGB-Dynamic images
(RGB-DIs) which are processed with fine-tuned InceptionV3
deep network. STD stream captures the view-invariant
temporal dynamics of depth frames of key poses using HPM
[18] model followed by a sequence of Bi-LSTM and LSTM
layers that helped to learn long-term view-invariant shape
dynamics of the actions. Structural Similarity Index Matrix
(SSIM) based key pose extraction helps to inspect major shape
variations during the action reducing the redundant frames
having minor shape changes. The late fusion of scores of the
motion stream and STD stream is used to predict the label of
the test sample. To validate the performance of the proposed
framework experiments are conducted on three publically
available multi-view datasets-NUCLA multi-view dataset,
UWA3D II Activity dataset, and NTU RGB-D Activity dataset
using cross-view and cross-subject cross-validation scheme.
The ROC representation of the recognition performance of the
proposed framework for each test view exhibits the improved
AUC for late fusion over motion and STD streams individually.
It is also, noticed that the recognition accuracy of the
framework is consistent for different views that confirm the
view-invariant characteristics of the framework. In the last,
comparisons with other state-of-the-arts are outlined for the
proposed deep architecture proving the superiority of the
framework in terms of time efficiency and accuracy both.
In future work, we aim to utilize the skeleton details of
actions along with depth details for different viewpoints to
make the action recognition more robust to intra-class
variations of large set of samples and maintaining time-efficient
performance of the system.
[2] J. Han, L. Shao, D. Xu and J. Shotton, "Enhanced Computer Vision with
Microsoft Kinect Sensor: A Review," IEEE Transactions on
Cybernetics, vol. 5, pp. 1318-34, 2013.
TABLE VI
COMPARISON OF ACTION RECOGNITION ACCURACY (%) NTU RGB-D
ACTIVITY DATASET
Cross
Cross
Method
Data type
subject
view
Skepxelloc+vel [45]
Joints
81.3
89.2
STA-LSTM [48]
Joints
73.4
81.2
ST-LSTM [49]
Joints
69.2
77.7
HPM(RGB+D)_Traj [35]
RGB-D
80.9
86.1
HPM_TM+DT [17]
RGB-D
77.5
84.5
Re-TCN [50]
Joints
74.3
83.1
dyadic [51]
RGB-D
62.1
DeepResnet-56 [52]
Joints
78.2
85.6
Depth
65.8
70.9
HPM
RGB
62
68.7
Motion Stream
Depth Maps
68.3
72.4
STD stream
RGB-D (max fusion)
71.6
79.8
Proposed Hybrid
RGB-D (late fusion)
75.7
83
Approach
RGB-D (product fusion)
79.4
84.1
TABLE VII
AVERAGE COMPUTATION SPEED (FRAME PER SEC: FPS)
Method
Training
Testing
NKTM [33]
12fps
16fps
HOPC [20]
0.04fps
0.5fps
HPM+TM [18]
22fps
25fps
Ours
26 fps
30 fps
[12] D. K. Vishwakarma and K. Singh, "Human Activity Recognition Based
on Spatial Distribution of Gradients at Sublevels of Average Energy
Silhouette Images," IEEE Transactions on Cognitive and Developmental
Systems, vol. 9, no. 4, pp. 316-327, 2017.
REFERENCES
[1] S. Ji , W. Xu , M. Yang and K. Yu, "3D Convolutional Neural Networks
for Human Action Recognition," IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013.
[3] T. Singh and D. K. Vishwakarma, "Video Benchmarks of Human Action
Datasets: A Review," Artificial Intelligence Review, vol. 52, no. 2, pp.
1107-1154, 2018.
[4] C. Dhiman and D. K. Vishwakarma, "A Robust Framework for
Abnormal Human Action Recognition using R-Transform and Zernike
Moments in Depth Videos," IEEE Sensors Journal, vol. 19, no. 13, pp.
5195-5203, 2019.
[5] E. P. Ijjina and K. M. Chalavadi, "Human action recognition in RGB-D
videos using motion sequence information and deep learning," Pattern
Recognition, vol. 72, pp. 504-516, 2017.
[6] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning
applied to document recognition," in Proceedings of the IEEE, 1998.
[7] L. Ren , J. Lu , J. Feng and . J. Zhou, "Multi-modal uniform deep learning
for RGB-D person re-identification," Pattern Recognition, vol. 72, pp.
446-457, 2017.
[8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. FeiFei, "Large-scale Video Classification with Convolutional Neural
Networks," in IEEE Conference on Computer Vision and Pattern
Recognition, Columbus, OH, 2014.
[9] N. Yudistira and T. Kurita, "Gated spatio and temporal convolutional
neural network for activity recognition: towards gated multimodal deep
learning," EURASIP Journal on Image and Video Processing, vol. 2017,
p. 85, 2017.
[10] Y. Yang , I. Saleemi and M. Shah , "Discovering motion primitives for
unsupervised grouping and one-shot learning of human actions, gestures,
and expressions," IEEE Transaction on Pattern Analysis and Machine
Intelligence, vol. 35, pp. 1635-1648, 2013.
[11] A. Diba, M. Fayyaz, V. Sharma, A. H. Karami, M. M. Arzani, R.
Yousefzadeh and L. V. Gool, "Temporal 3D ConvNets: New
Architecture and Transfer Learning for Video Classification," vol. arXiv:
1711.08200, 2017.
[13] Z. Tu, W. Xie, Q. Qin, R. Poppe, R. C. Veltkamp, B. Li and J. Yuanf,
"Multi-stream CNN: Learning representations based on human-related
regions for action recognition," Pattern Recognition, vol. 79, pp. 32-43,
2018.
[14] H. Wang, Y. Yang, E. Yang and C. Deng, "Exploring hybrid spatiotemporal convolutional networks for human action recognition,"
Multimedia Tools and Applications, vol. 76, pp. 15065-15081, 2017.
[15] Y. Liu, Q. Wu, L. Tan and H. shi, "Gaze-Assisted Multi-Stream Deep
Neural Network for Action Recognition," IEEE Access, vol. 5, pp.
19432-19441, 2017.
[16] W. Lin, Y. Mi, J. Wu, K. Lu and H. Xiong, "Action Recognition with
Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion,"
arXiv:1711.07430 , 2018.
[17] J. Liu, N. Akhtar and A. M. Saeed , "Viewpoint Invariant RGB-D Human
Action Recognition," in International Conference on Digital Image
Computing: Techniques and Applications, Sydney, 2017.
[18] H. Rahmani and A. Mian, "3D Action Recognition from Novel
Viewpoints," in IEEE Conference on Computer Vision and Pattern
Recognition, Las Vegas, NV, USA, 2016.
[19] J. Wang, B. X. Nie, Y. Xia, Y. Wu and S. C. Zhu, "Cross-view action
modeling, learning and recognition," in IEEE Conference on Computer
Vision and Pattern Recognition, Columbus, Ohio, 2014.
[20] H. Rahmani, A. Mahmood, D. Huynh and A. Mian, "Histogram of
oriented principal components for cross-view action recognition," IEEE
Transaction on Pattern Analysis and Machine Intelligence, vol. 38, no.
12, pp. 2430-2443, 2016.
[21] A. Shahroudy, J. Liu, T. T. Ng and G. Wang, "NTU RGB+D: A Large
Scale Dataset for 3D Human Activity Analysis," in Conference on
Computer Vision and Pattern Recognition, Nevada, United States, 2016.
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2020.2965299, IEEE
Transactions on Image Processing
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
[22] C. Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Samola and P.
Krahenbuhl, "Compressed Video Action Recognition," in IEEE
Conference on Computer Vision and Pattern Recognition, US, 2018.
[23] X. Yang, "Action recognition for depth video using multi-view dynamic
images," Information Sciences, vol. 480, pp. 287-304, 2019.
[24] A. F. Bobick and J. W. Davis, "The recognition of human movement
using temporal templates," IEEE Transaction on Pattern Analysis and
Machine Intelligence, vol. 23, no. 3, pp. 257-267, 2011.
[25] M. S. Ryoo, B. Rothrock and L. Matthies, "Pooled Motion Features for
First-Person Videos," in IEEE Conference on Computer Vision and
Pattern Recognition, Boston, Massachusetts, 2015.
[26] B. Fernando, E. Gavves, J. Oramas, . A. Ghodrati and T. Tuytelaars,
"Modeling video evolution for action recognition," in IEEE Conference
on Computer Vision and Pattern Recognition, Boston, Massachusetts,
2015.
[27] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga
and G. Toderici, "Beyond Short Snippets: Deep Networks for Video
Classification," in IEEE Conference on Computer Vision and Pattern
Recognition, Boston, 2015.
10
Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 667-681,
2018.
[45] J. Liu, N. Akhtar and . A. Mian, "Skepxels: Spatio-temporal Image
Representation of Human Skeleton Joints for Action Recognition,"
CoRR, vol. arXiv:1711.05941v4, 2018.
[46] A. Gupta, J. Martinez, J. J. Little and R. J. Woodham, "3D Pose from
Motion for Cross-View Action Recognition via Non-linear Circulant
Temporal Encoding," in IEEE Conference on Computer Vision and
Pattern Recognition, Columbus, 2014.
[47] H. Wang, A. Kläser, C. Schmid and C. L. Liu, "Action recognition by
dense trajectories," in IEEE Conference on Computer Vision and Pattern
Recognition, Providence, RI, 2011.
[48] S. Song, C. Lan, J. Xing, W. Zeng and J. Liu, "An end-to-end spatiotemporal attention model for human action recognition from skeleton
data," in AAAI conference on Artificial Intelligence, Califirnia, USA,
2017.
[49] J. Liu, A. Shahroudy, . D. Xu and G. Wang, "Spatio-temporal LSTM
with trust gates for 3D human action recognition," in European
Conference on Computer Vision (ECCV), Amsterdam, 2016.
[28] S. Jetley and F. Cuzzolin, "3D Activity Recognition Using Motion
History and Binary Shape Templates," in Proceedings of Asian
Conference on Computer Vision(ACCV), Singapore, 2014.
[50] T. S. Kim and A. Reiter, "Interpretable 3D Human Action Analysis with
Temporal Convolutional," in Conference on Computer Vision and
Pattern Recognition workshop, Honolulu, Hawaii, 2017.
[29] M. A. R. Ahad, J. K. Tan, H. Kim and S. Ishikawa, "Motion history
image: Its variants and applications," Machine Vision and Applications,
vol. 23, no. 2, pp. 255-281, 2010.
[51] A. S. Keçeli, "Viewpoint projection based deep feature learning for
single and dyadic action recognition," Expert Systems With Applications,
vol. 104, pp. 235-243, 2018.
[30] Y. Kong, Z. Ding, J. Li and Y. Fu, "Deeply Learned View-Invariant
Features for Cross-View Action Recognition," IEEE Transactions on
Image Processing, vol. 26, no. 6, pp. 3028-3037, 2017.
[52] H.-H. Phama, L. Khoudoura, A. Crouzil, P. Zegers and S. A. Velastin,
"Exploiting deep residual networks for human action recognition from
skeletal data," Computer Vision and Image Understanding, vol. 170, pp.
51-66, 2018.
[31] B. Li, O. I. Camps and M. Sznaier, "Cross-view Activity Recognition
using Hankelets," in IEEE Conference on Computer Vision and Pattern
Recognition, Providence, RI, 2012.
[32] Z. Zhang, . C. Wang, B. Xiao, . W. Zhou, S. Liu and C. Shi, "Cross-view
action recognition via a continuous virtual path," in IEEE Conference on
Computer Vision and Pattern Recognition, 2013.
[33] H. Rahmani and A. Mian, "Learning a non-linear knowledge transfer
model for cross-view action recognition," in IEEE Conference on
Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015.
[34] J. Zheng and Z. Jiang, "Learning view invaraint sparse representations
for cross view action recognition," in International Conference on
Computer Vision (ICCV), Sydney, 2013.
[35] J. Liu and A. S. Mian, "Learning Human Pose Models from Synthesized
Data for Robust RGB-D Action," CoRR, vol. arXiv:1707.00823v2, 2017.
[36] P. Wang, S. Wang , Z. Gao, Y. Hou and W. Li, "Structured Images for
RGB-D Action Recognition," in IEEE International Conference on
Computer Vision Workshops (ICCVW), Venice, Italy, 2017.
[37] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification
with Deep Convolutional Neural Networks," in Neural Information
Processing Systems (NIPS), Nevada, 2012.
[38] CMU motion capture database, http://mocap.cs..
[39] S. Song, C. Lan , J. Xing , W. Zeng and J. Liu , "Spatio-Temporal
Attention-Based LSTM Networks for 3D Action Recognition and
Detection," IEEE Transactions on Image Processing , vol. 27, no. 7, pp.
3459-3471, 2018.
[40] K. Schindler and L. v. Gool, "Action snippets: How many frames does
human action recognition require?," in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 2008.
[41] W. Zhou, A. C. Bovik , H. R. Sheikh and E. P. Simoncelli, "Image
Qualifty Assessment: From Error Visibility to Structural Similarity,"
IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612,
2004.
Chhavi Dhiman (M’19) received M.Tech.
from Delhi Technological University, New
Delhi, India, in 2014. She is currently
pursuing Ph.D. degree from the Department
of Electronics and Communication
Engineering,
Delhi
Technological
University, Delhi, India. Her current
research interest includes Machine
Learning, Deep Learning, Pattern Recognition, Human Action
Identification and Classification.
Dinesh Kumar Vishwakarma (M’16,
SM’19) received Doctor of Philosophy
(Ph.D.) degree in the field of Computer
Vision
from
Delhi
Technological
University (Formerly Delhi College of
Engineering), New Delhi, India, in 2016.
He is currently an Associate Professor with
the
Department
of
Information
Technology, Delhi Technological University, New Delhi. His
current research interests include Computer Vision, Deep
Learning, Sentiment Analysis, Fake News Detection, Crowd
Analysis, Human Action and Activity Recognition. He is a
reviewer of various Journals/Transactions of IEEE, Elsevier,
and Springer. He has been awarded “Premium Research
Award” by Delhi Technological University, Delhi, India in
2018.)
[42] A. J. Smola and B. Schölkopf, "A tutorial on support vector regression,"
Statistics and Computing, vol. 14, no. 3, pp. 199-222, 2004.
[43] H. Bilen, . B. Fernando, E. Gavves, A. Vedaldi and S. Gould, "Dynamic
Image Networks for Action Recognition," in IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, 2016.
[44] H. Rahmani, A. Mian and M. Shah, "Learning a Deep Model for Human
Action Recognition from Novel Viewpoints," IEEE Transactions on
1057-7149 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Download