Uploaded by alexmva03

sensors-20-04758

advertisement
sensors
Letter
Deep Learning-Based Real-Time Multiple-Person
Action Recognition System
Jen-Kai Tsai, Chen-Chien Hsu * , Wei-Yen Wang
and Shao-Kang Huang
Department of Electrical Engineering, National Taiwan Normal University, Taipei 106, Taiwan;
johnny850219@gmail.com (J.-K.T.); wywang@ntnu.edu.tw (W.-Y.W.); grghwng@hotmail.com (S.-K.H.)
* Correspondence: jhsu@ntnu.edu.tw; Tel.: +886-2-7749-3551
Received: 20 July 2020; Accepted: 21 August 2020; Published: 23 August 2020
Abstract: Action recognition has gained great attention in automatic video analysis, greatly reducing
the cost of human resources for smart surveillance. Most methods, however, focus on the detection of
only one action event for a single person in a well-segmented video, rather than the recognition of
multiple actions performed by more than one person at the same time for an untrimmed video. In this
paper, we propose a deep learning-based multiple-person action recognition system for use in various
real-time smart surveillance applications. By capturing a video stream of the scene, the proposed
system can detect and track multiple people appearing in the scene and subsequently recognize their
actions. Thanks to high resolution of the video frames, we establish a zoom-in function to obtain
more satisfactory action recognition results when people in the scene become too far from the camera.
To further improve the accuracy, recognition results from inflated 3D ConvNet (I3D) with multiple
sliding windows are processed by a nonmaximum suppression (NMS) approach to obtain a more
robust decision. Experimental results show that the proposed method can perform multiple-person
action recognition in real time suitable for applications such as long-term care environments.
Keywords: smart surveillance; action recognition; deep learning; human tracking
1. Introduction
Due to the rise of artificial intelligence in the field of video analysis, human action recognition
has gained great popularity in smart surveillance, which focuses on automatic detection of suspicious
behavior and activities. As a result, the system can launch an alert in advance to prevent accidents
in public places [1] such as airports [2], stations [3], etc. To serve the need of the upcoming aging
society, smart surveillance can also provide great advantages to long-term care centers as well,
where action recognition can help center staff notice dangers or accidents when taking care of large
numbers of patients. Therefore, it is important to develop an action recognition system for real-time
smart surveillance.
Recently, data-driven action recognition has become a popular research topic due to the flourish
of deep learning [4–8]. Based on the types of input data, existing literature of action recognition
can be divided into two categories: skeleton-based and image-based methods. The former includes
3D skeleton [9,10] generated by Microsoft Kinect, and 2D skeleton generated by OpenPose [11]
and AlphaPose [12]. The latter includes approaches using single-frame images [13], multiframe
video [14,15], and optical flow [16]. Note that the size of skeleton data is much smaller than that of the
image data. For example, the size of 3D skeleton data and image data in the NTURGB+D dataset is 5.8
and 136 GB, respectively. It is easy to see that the size of skeleton data is 23 times smaller than that of
the image data. Thus, the training process by using skeleton data is faster than that by using image
data. However, the image data might contain vital information [17], including age, gender, wearing,
expression, background, illumination, etc., in comparison to skeleton data. Moreover, in order to
Sensors 2020, 20, 4758; doi:10.3390/s20174758
www.mdpi.com/journal/sensors
Sensors 2020, 20, 4758
2 of 17
identify each individual appearing in the scene, the face image is also needed in the proposed method.
Thus, image-based methods render more information of the scene for wider applications, such as smart
identification and data interpretation, which is of practical value and worthy of further investigation.
Convolution neural networks (CNN) is a powerful tool in the field of image-based action
recognition. According to the dimension of the convolution unit used in the network, previous
approaches can be separated into 2D and 3D convolution approaches. Although 2D convolution
provides appealing performance in body gesture recognition, it cannot effectively handle the recognition
of continuous action streams. This is due to the lack of temporal information, which imposes the
limitation of modeling continuous action with 2D convolution. On the other hand, 3D convolution
contains an additional time dimension that can remedy this deficiency. Thus, 3D convolution has been
widely used in recent data-driven action recognition architectures [18], such as C3D (convolutional
3D) [19], I3D (inflated 3D ConvNet) [20], and 3D-fused two-stream [21]. Widely seen as foundational
research, Ji [14] proposed taking several frames from an input video and simultaneously feeding them
into a neural network. Features extracted from the video contain the information in both time and space
dimensions via 3D convolution that can be utilized to generate results for action recognition. Although
the recognition performance of [14], evaluated based on the TRECVID 2008 [22] and KTH datasets [23],
provided good action recognition, the accuracy of 3D convolution relies on a huge amount of network
parameters, leading to low efficiency in the recognition process. This inevitably causes difficulty in
reaching a real-time action recognition. To solve this problem, a recent approach [21] developed a
two-stream architecture, taking both RGB and optical flow images as input data, that improves I3D
by adopting the concept of Inception v1 [24] in GoogLeNet. Inception v1 contains 2D convolution
using a 1 × 1 filter in the network to reduce the amount of training parameters, and hence improves
the efficiency of recognition. As a result, accuracy of the I3D approaches outperforms traditional
approaches in the well-known action recognition datasets, such as UCF101 [25] and Kinetics [26].
Although the I3D-based action recognition has a desired performance, the original version of I3D
cannot recognize actions of multiple people appearing in a scene at the same time. The I3D-based
recognition system also encounters problems when locating the start and end time for each of the
actions in the input video for subsequent action recognition [27], because there is likely a mismatch
between the start and end time of the actions and the segmented time interval.
In order to solve these problems, we propose a real-time multiple-person action recognition system
with the following major contributions. (1) We extend the use of I3D for real-time multiple-person
action recognition. (2) The system is capable of tracking multiple people in real time to provide action
recognition with improved accuracy. (3) An automatic zoom-in function is established to enhance the
quality of input data for recognizing people located far from the camera. (4) The mismatch problem
mentioned earlier can be addressed by adopting a sliding window method. (5) To further improve the
accuracy, recognition results from I3D with multiple sliding windows are processed by a nonmaximum
suppression (NMS) approach to obtain a more robust decision. Experimental results show that the
proposed method is able to achieve multiple-person action recognition system in real time.
The rest of the paper is organized are follows: Section 2 introduces the proposed real-time
multiple-person action recognition system, Section 3 presents the experimental results, and the
conclusion is given in Section 4.
2. Real-Time Multiple-Person Action Recognition System
Figure 1 shows the complete flow chart of the proposed multiple-person action recognition system.
First of all, we use YOLOv3 [28] to locate multiple people appearing in the scene. The Deep SORT [29]
algorithm is used to track the people and provide each of them with an identity (ID) number. To
identify the person’s name for display, FaceNet [30] is then used for face recognition to check whether
or not the ID exists. For people far from the camera, we also establish a “zoom in” function to improve
the recognition performance. Video frames in sliding windows are preprocessed and resized for being
appearing in the scene. The Deep SORT [Error! Reference source not found.] algorithm is used to
track the people and provide each of them with an identity (ID) number. To identify the person’s
name for display, FaceNet [Error! Reference source not found.] is then used for face recognition to
check whether or not the ID exists. For people far from the camera, we also establish a “zoom in”
Sensors 2020, 20, 4758
3 of 17
function to improve the recognition performance. Video frames in sliding windows are
preprocessed and resized for being fed into I3D for action recognition. Finally, this system utilizes a
fed into I3D forNMS
actionto
recognition.
Finally,
this system
utilizes
NMS and
to postprocess
one-dimensional
postprocess
the outputs
from
I3Datoone-dimensional
improve accuracy
robustness of
the
outputs
from
I3D
to
improve
accuracy
and
robustness
of
action
recognition.
action recognition.
Figure
1. Flow
chart
theproposed
proposed multiple-person
multiple-person action
system
(I3D,(I3D,
inflated
3D 3D
Figure
1. Flow
chart
of ofthe
actionrecognition
recognition
system
inflated
ConvNet;
NMS,
nonmaximum
suppression).
ConvNet; NMS, nonmaximum suppression).
2.1. YOLO (You Only Look Once)
2.1. YOLO (You Only Look Once)
Object detection can be separated into two main categories: two-stage and one-stage methods.
Object
detection
be [31],
separated
into [32],
two faster
mainRCNN
categories:
two-stage
and one-stage
The
former,
such as can
RCNN
fast RCNN
[33], and
mask RCNN
[34], detectsmethods.
the
The locations
former, of
such
as RCNN
Reference
not
found.],the
fast
RCNN
Reference
different
objects[Error!
in the image
at first,source
and then
recognizes
objects.
The[Error!
later, such
as
YOLO
and SSDfaster
[36], combines
the bothReference
tasks through
a neural
[28] RCNN
is a neural
source
not[35]
found.],
RCNN [Error!
source
notnetwork.
found.],YOLOv3
and mask
[Error!
networks
basednot
object
detection
algorithm
implemented
in the Darknet
framework,
which
obtain
Reference
source
found.],
detects
the locations
of different
objects
in the image
at can
first,
and then
the class and a corresponding bounding box for every object in images and videos. Compared with
previous methods, YOLOv3 has the advantages of fast recognition speed, high accuracy, and capability
of detecting multiple objects at the same time [37]. Thus, this paper uses YOLOv3 at the first stage of
action recognition for two reasons. The first one is because the system has to locate each of the people
appearing in the scene in real time. The other one is because the information of the bounding box
corresponding to each person in the scene is critical for preprocessing in the proposed system. In this
step, we convert the input video from camera into frames. The frames are then resized from 1440 ×
1080 to 640 × 480 for use by YOLOv3 to obtain a detection result represented by coordinates of the
Sensors 2020, 20, 4758
4 of 17
bounding boxes. Note that the purpose of the frame-resizing process results from the consideration of
speed and accuracy of object detection.
2.2. Deep SORT
Simple online and real-time tracking (SORT) is a real-time multiple object tracking method
proposed by Bewley [38]. It combines a Kalman filter and Hungarian algorithm, predicting the object
location in next frame according to that of the current frame by measuring the speed of detected objects
through time. It is a tracking-by-detection method that performs tracking based on the result of object
detection per frame. As an updated version of SORT, simple online and real-time tracking with a
deep association metric (Deep SORT) proposed by Wojke [29] contains an additional convolutional
neural network for extracting additional features, which is pretrained by a large-scale video dataset,
the motion analysis and reidentification set (MARS). Hence, Deep SORT is capable of reducing SORT
tracking error by about 45% [29]. In the proposed system, the goal of using Deep SORT is to perform
multiple-person tracking, where an ID number corresponding to each individual is created into a
database based on the detection results of YOLOv3. This means that each person appearing in the
scene will be related by an individual bounding box and a ID number.
2.3. FaceNet
FaceNet is a face recognition approach revealed by Google [30], which has achieved an excellent
recognition performance of 99.4% accuracy in the LFW (labeled faces in the wild) database. It adopts
Inception ResNet network architecture and applies a pretraining model based on MS-Celeb-1M dataset.
Note that this dataset contains 8 million face images of about 1 million identities. During the pretraining
process, it first generates 512 feature vectors via L2 normalization and embedding process. Then, the
feature vectors are mapped into a feature space to calculate the Euclidean distance between features.
Finally, a triple loss algorithm is applied in the training process. In the processing of FaceNet, the
similarity of face-matching depends on calculating the Euclidean distance between the features of the
input face image and the stored face images in the database. As soon as the similarity of each matching
pair is given, SVM classification is applied to make the final matching decision.
The face recognition process of the proposed method is shown in Figure 2. At the beginning, we
prestore matched pairs of individual names and their corresponding features generated by FaceNet in
the database. When the bounding box image and corresponding ID number of each individual are
obtained from Deep SORT, we check if the ID number exists in the database for reducing the redundant
execution of FaceNet. Considering the efficiency for real-time application, we separate the condition
as to which individual does not require a face-matching process via FaceNet from others. If the ID
number does exist in the database, the name related to the ID number will be directly obtained from
the database. Otherwise, we will check whether the face of the individual can be detected by utilizing
the Haar-based cascade classifier in OpenCV, based on the bounding box image of the individual. If
the face of the individual cannot be detected, FaceNet will not be executed, resulting in an “Unknown”
displayed on the top of the bounding box in the video frames. Otherwise, the cropped face image is
fed into FaceNet for face-matching by comparing with the stored features in the database. When the
feature of the cropped face image is similar to the stored feature in the database, we update the table
with the ID number and the corresponding name retrieved from the database for display on the top of
the bounding box in the video frames.
Sensors 2020, 20, x FOR PEER REVIEW
6 of 18
Sensors2020,
2020,20,
20,4758
x FOR PEER REVIEW
Sensors
5 6ofof1718
Figure 2. Flow chart of face recognition via FaceNet.
2.4. Automatic “Zoom-in”
Figure 2. Flow chart of face recognition via FaceNet.
Figure 2. Flow chart of face recognition via FaceNet.
2.4.AtAutomatic
“Zoom-In”
this stage,
every image for further processing has been resized into 640 × 480 pixels, which is
2.4.
Automatic
“Zoom-in”
much smaller
than the
original
of 1440
× 1080 pixels.
However,
if there
individuals
located
At this stage,
every
imageimage
for further
processing
has been
resized into
640 are
× 480
pixels, which
is
farmuch
fromAt
the
camera,
the
size
of
the
bounding
box
in
this
case
will
become
too
small
for
making
this
stage,
every
image
for
further
processing
has
been
resized
into
640
×
480
pixels,
which
smaller than the original image of 1440 × 1080 pixels. However, if there are individuals locatedis
accurate
action
recognition.
Fortunately,
we
able
to will
utilize
the iftoo
high
resolution
(1440
× 1080
much
than the
the size
original
image
of 1440
1080
pixels.
However,
there
arefor
individuals
located
far
fromsmaller
the camera,
of
the
bounding
box×are
in
this
case
become
small
making
accurate
far
from
the
camera,
the
size
of
the
bounding
box
in
this
case
will
become
too
small
for
making
pixels)
of
the
original
video
to
zoom-in
the
bounding
box
for
individuals
located
at
a
longer
distance.
action recognition. Fortunately, we are able to utilize the high resolution (1440 × 1080 pixels) of the
accurate
action
recognition.
Fortunately,
able
to located
utilize
the
(1440
× 1080
Because
the
position
of thethe
camera
is fixed,
weare
can
roughly
estimate
the resolution
depth
of the
individual
original
video
to zoom-in
bounding
boxwe
for
individuals
at a high
longer
distance.
Because
the
pixels)
of
the
original
video
to
zoom-in
the
bounding
box
for
individuals
located
at
a
longer
distance.
according
to
the
height
of
the
bounding
box
in
the
image.
If
the
individual
is
located
far
from
position of the camera is fixed, we can roughly estimate the depth of the individual according to thethe
Because
the
position
of
theincamera
is fixed,
we
roughly
estimate
the
depth
of
individual
camera,
the
height
of the
bounding
box
will
be
a can
much
smaller
value.
Here,
setthe
a threshold
height
of the
bounding
box
the image.
If the
individual
is located
far from
thewe
camera,
the
height to
to theto
height
bounding
box
inzoom-in
theHere,
image.
thea individual
is to
located
far when
from
determine
when
automatically
launch
the
function
accordingto
the height
ofthe
ofaccording
the bounding
box
willof
bethe
a much
smaller
value.
weIfset
threshold
determine
tothe
camera,
the
height
of
the
bounding
box
will
be
a
much
smaller
value.
Here,
we
set
a
threshold
automatically
launch
the zoom-in
according
to the is
height
of the bounding
the image.
bounding
box in
the image.
Oncefunction
the zoom-in
function
activated,
we locatebox
theincenter
of to
the
determine
when
to
automatically
launch
the
zoom-in
function
according
to
the
height
of
theat
Once thebox
zoom-in
activated,
locate
the center
of the
bounding
original
1440
bounding
in thefunction
originalis1440
× 1080we
image
frame
and crop
a new
imagebox
of in
640the
× 480
centered
box
inasthe
image.
Once
the
zoom-in
is activated,
we not
locate
the as
center
of the
1080 image
frame
and
crop
new image
of that
640 function
×the
480
centered
at thewill
bounding
box,
in
the×bounding
bounding
box,
shown
in aFigure
3. Note
cropped
region
exceed
theshown
boundary
bounding
box
in
the
original
1440
×
1080
image
frame
and
crop
a
new
image
of
640
×
480
centered
at
3. Note
that the cropped region will not exceed the boundary of the original image.
of Figure
the original
image.
the bounding box, as shown in Figure 3. Note that the cropped region will not exceed the boundary
of the original image.
Figure
zoom-inmethod.
method.
Figure3.
3. The
The proposed
proposed zoom-in
Figure 3. The proposed zoom-in method.
Blurring
Background
2.5.2.5.
Blurring
Background
2.5. In
Blurring
Background
thisprocess,
process,
weaim
aimat
atutilizing
utilizing image
image processing
of of
action
In this
we
processing for
forenhancing
enhancingthe
theperformance
performance
action
recognition
by
I3D.
For
each
individual,
we
blur
the
entire
image
of
640
×
480
except
the
corresponding
In
this
process,
we
aim
at
utilizing
image
processing
for
enhancing
the
performance
of
action
recognition by I3D. For each individual, we blur the entire image of 640 × 480 except the
bounding
boxbybyI3D.
a Gaussian
kernel. Becausewe
theblur
image
within
the
of each
recognition
theregion
entire
imageregion
of bounding
640within
× 480box
except
the
corresponding
boundingFor
boxeach
by aindividual,
Gaussian kernel.
Because
the image
the
bounding
individual
contains
more
important
information
than
the
other
regions
in
the
image
for
recognition
bounding
box by
a Gaussian
kernel.
Because than
the image
region
withininthe
boxcorresponding
of each individual
contains
more
important
information
the other
regions
thebounding
image for
purpose.
In Figure
4, we contains
can see from
theimportant
top that three
people have
detected.
Takeinthe
box of each
individual
more
information
thanbeen
the other
regions
theindividual
image for
recognition purpose. In Figure 4, we can see from the top that three people have been detected.
Take
marked
as “person-0”
example.
Thecan
far see
left from
imagethe
in top
the second
rowpeople
in Figure
4 shows
that onlyTake
the
recognition
purpose.for
In Figure
4, we
that three
have
been detected.
the individual marked as “person-0” for example. The far left image in the second row in Figure 4
region
within themarked
bounding
box of person-0
remains intact
after
theimage
blurring
process.
The
same
process4
the individual
as “person-0”
for example.
The far
left
in the
second
row
in Figure
shows
that
only
the region
within theasbounding
boxallofofperson-0
remains images
intact after
thetoblurring
applies
to
the
other
two
individuals
well.
Then,
the
preprocessed
related
each
shows that only the region within the bounding box of person-0 remains intact after the blurring
process.
The
same
process
applies
to
the
other
two
individuals
as
well.
Then,
all
of
the
preprocessed
individual
are same
further
reduced
into smaller
images
of individuals
224 × 224 forascollection
intoall
anof
individual
dataset
process. The
process
applies
to the other
two
well. Then,
the preprocessed
images related to each individual are further reduced into smaller images of 224 × 224 for collection
images related to each individual are further reduced into smaller images of 224 × 224 for collection
Sensors 2020, 20, x FOR PEER REVIEW
7 of 18
Sensors 2020, 20, 4758
6 of 17
into an individual dataset used by respective sliding windows. The design of this process is to retain
important information of the images and reduce interference of the background region for
used by respective sliding windows. The design of this process is to retain important information of
improving recognition
accuracy.
the images and reduce interference of the background region for improving recognition accuracy.
Figure 4. The process of blurring background.
Figure 4. The process of blurring background.
2.6. Sliding Windows
2.6. Sliding Windows
As mentioned earlier, most work on human action detection assumed presegmented video clips
which are then solved by the recognition part of the task. However, the information of the start and
As mentioned
earlier,
mostaction
workareon
humaninaction
detection
assumed
presegmented
end time of
an observed
important
providing
satisfactory
recognition
for continuous video clips
actionsolved
streams.by
Here,
apply the sliding
[39] method
to divide
input video into
a start and
which are then
thewe
recognition
partwindows
of the task.
However,
thethe
information
of the
of overlapped
short
video
segments, as
in Figuresatisfactory
5. In fact, we sample
the video with
end time ofsequence
an observed
action
are
important
inshown
providing
recognition
for continuous
blurred background every five frames to construct a sequence of frames for processing by a sliding
action streams.
Here,
apply
theelapses.
slidingThe
windows
Reference
notindicating
found.] method to
window
of 16we
frames
as time
16 frames [Error!
in the sliding
window, source
presumably
divide the input
video
into
a sequence
short video
asrecognize
shown the
in Figure 5. In
the start
and end
time
of the action,of
foroverlapped
each person detected
are thensegments,
fed into I3D to
action
performed
by
the
person.
Specifically,
each
video
segment
F
consisting
of
16
frames
in
a
sliding
fact, we sample the video with blurred background every five frames to construct a sequence of
window can be constructed by:
frames for processing by a sliding windowF =
of 16
frames as time elapses. The 16 frames (1)
in the sliding
fc−5n 0 ≤ n < 16
window, presumably indicating the start and end time of the action, for each person detected are
where fc is the frame captured at current time. Hence, each video segment representing 80 frames
then fed intofrom
I3D
to recognize the action performed by the person. Specifically, each video segment F
the camera will be fed into I3D for action recognition in sequence. In this paper, five consecutive
consisting ofsliding
16 frames
ineach
a sliding
window
canand
be their
constructed
by: recognition class by I3D are
windows
consisting
of 16 frames
corresponding
grouped as an input set for processing by NMS, as shown in Figure 5.
F  { fc-5n |0  n  16}
(1)
where fc is the frame captured at current time. Hence, each video segment representing 80 frames
from the camera will be fed into I3D for action recognition in sequence. In this paper, five
consecutive sliding windows each consisting of 16 frames and their corresponding recognition class
by I3D are grouped as an input set for processing by NMS, as shown in Figure 5.
Sensors 2020, 20,
20, 4758
x FOR PEER REVIEW
Sensors 2020, 20, x FOR PEER REVIEW
78 of 17
18
8 of 18
Figure 5. The proposed sliding windows method for action recognition, where from top to bottom
Figure5.5.The
Theproposed
proposedsliding
sliding windows
windows method
method for action
Figure
action recognition,
recognition, where
wherefrom
fromtop
toptotobottom
bottom
shows the ground truth of actions and overlapped sliding windows each consisting of 16 frames.
showsthe
theground
groundtruth
truthof
ofactions
actionsand
and overlapped
overlapped sliding windows
shows
windows each
eachconsisting
consistingofof16
16frames.
frames.
2.7. Inflated 3D ConvNet (I3D)
2.7.Inflated
Inflated3D
3DConvNet
ConvNet(I3D)
(I3D)
2.7.
I3D, a neural network architecture based on 3D convolution proposed by Carreira [20], as shown
I3D,a aneural
neural network
network architecture
architecture based
based on
on 3D
proposed
by Carreira
[Error!
I3D,
3D convolution
convolution
proposed
Carreira
in Figure
6, is adopted
for action recognition
in the proposed
system, taking
onlyby
RGB
images[Error!
as the
Reference
source
not
found.],
as
shown
in
Figure
6,
is
adopted
for
action
recognition
ininthe
Reference
source
not
found.],
as
shown
in
Figure
6,
is
adopted
for
action
recognition
the
input data. Note that optical flow input in the original approach is discarded in the proposed design
proposedsystem,
system,taking
taking only
only RGB
RGB images
images as
as the
the input
input data.
Note
that
optical
flow
input
ininthe
proposed
data.
Note
that
optical
flow
input
the
considering the recognition speed. Moreover, I3D contains several Inception modules that hold several
originalapproach
approachisisdiscarded
discarded inthe
the proposed
proposed design
design considering
the recognition
speed. Moreover,
original
considering
recognition
Moreover,
convolution
units with a 1 × 1 in
× 1 filter,
as shown in Figure
7. Thisthe
design
allows speed.
the dimension
of
I3D
contains
several
Inception
modules
that
hold
several
convolution
units
with
a
11××11××11filter,
asas
I3D
contains
several
Inception
modules
that
hold
several
convolution
units
with
a
filter,
input data to be adjusted for various sizes by changing the amount of those convolution units.
In the
shownininFigure
Figure7.7.This
Thisdesign
designallows
allows the
the dimension
dimension of
to
for
various
sizes
shown
of input
input data
data
tobe
beadjusted
adjusted
for
various
sizesby
by
proposed
method,
the
input
data
to
I3D
are
the
video
segments
of
a
window
size
of
16
frames
from
the
changing
the
amount
of
those
convolution
units.
In
the
proposed
method,
the
input
data
to
I3D
are
changing stage.
the amount
of those
convolution
units.
In theisproposed
method, athe
input dataclass
to I3D
are
previous
Therefore,
each
of size
the video
segments
used
to produce
the video segments
of a window
of 16 frames
from the
previous
stage. recognition
Therefore, each
ofand
the a
the
video
segments
of
a
window
size
of
16
frames
from
the
previous
stage.
Therefore,
each
of
the
corresponding
confidence
via aI3D
based onclass
the input:
16 frames × 224
image height
video segments
is used toscore
produce
recognition
and a corresponding
confidence
score(pixels)
via I3D×
video
segments
is(pixels)
used to×produce
a recognition class and a corresponding confidence score via I3D
224
image
width
3
channels.
based on the input: 16 frames × 224 image height (pixels) × 224 image width (pixels) × 3 channels.
based on the input: 16 frames × 224 image height (pixels) × 224 image width (pixels) × 3 channels.
Figure 6. Inflated 3D (I3D) network architecture [20] © 2020 IEEE.
Figure 6. Inflated 3D (I3D) network architecture [Error! Reference source not found.] © 2020 IEEE.
Figure 6. Inflated 3D (I3D) network architecture [Error! Reference source not found.] © 2020 IEEE.
Sensors 2020, 20, 4758
Sensors 2020, 20, x FOR PEER REVIEW
8 of 17
9 of 18
Figure 7. Inception module network architecture [20] © 2020 IEEE.
Figure 7. Inception module network architecture. [Error! Reference source not found.] © 2020 IEEE.
2.8. Nonmaximum Suppression (NMS)
2.8. Nonmaximum Suppression (NMS)
Traditional two-dimensional nonmaximum suppression is commonly used for obtaining a final
Traditional
two-dimensional
nonmaximum
suppression
is the
commonly
used forbox
obtaining
decision
of multiple
object detection.
Basically, NMS
generates
best bounding
related atofinal
the
decision
of multiple
object detection.
Basically,
NMS generates
thebounding
best bounding
related
to the
target
object
by iteratively
eliminating
the redundant
candidate
boxes box
in the
image.
To
target
object
by
iteratively
eliminating
the
redundant
candidate
bounding
boxes
in
the
image.
To
accomplish this task, NMS requires the coordinates of the bounding box related to the individual
accomplish
this task, NMS
requires
the coordinates
of the bounding
box related
to the the
individual
and
the corresponding
confidence
score.
During the iteration,
the bounding
box having
highest
and the corresponding
confidence
score.
During
the iteration,
the bounding
having
thethe
highest
confidence
score is selected
to calculate
the
intersection
over union
(IoU) scorebox
with
each of
other
confidence
score
is
selected
to
calculate
the
intersection
over
union
(IoU)
score
with
each
of
the
other
bounding boxes. If the IoU score is greater than a threshold, the corresponding confidence score of
bounding
boxes.
the IoU
scoretoiszero,
greater
thanmeans
a threshold,
the corresponding
confidence
score
of
the
bounding
boxIf will
be reset
which
that bounding
box will be
eliminated.
This
the bounding
will
be reset toboxes
zero,are
which
means
bounding
box boxes
will behaving
eliminated.
This
process
repeatsbox
until
all bounding
handled.
Asthat
a result,
bounding
the highest
process repeats
until allafter
bounding
boxes are handled. As a result, bounding boxes having the highest
confidence
are selected
the iteration.
confidence
are selected
after the
iteration.
As an attempt
to improve
the
robustness of the recognition results, start and end times of the five
As
an
attempt
to
improve
the
robustness
of the recognition
start
andtoend
times
the
consecutive sliding windows and their
corresponding
recognitionresults,
class are
used
derive
theoffinal
five
consecutive
sliding
windows
and
their
corresponding
recognition
class
are
used
to
derive
the
recognition results via NMS. Figure 8 shows how a one-dimensional nonmaximum suppression is
final torecognition
results
via NMS.
Figure
shows
howofatheone-dimensional
used
derive the final
recognition
results,
where8 the
threshold
IoU score is set asnonmaximum
0.4. From top
suppression
is
used
to
derive
the
final
recognition
results,
where
the
threshold
of
the
IoU
score is set
to bottom, we can see that each group of five consecutive sliding windows and their corresponding
as 0.4. From class
top toare
bottom,
weto
can
see that each
groupclass
of five
consecutive
sliding
andNMS
their
recognition
utilized
recognize
the action
and
its best start
and windows
end time via
corresponding
class aresliding
utilized
to recognize
the action
class and confidence
its best start
and The
end
among
a grouprecognition
of five consecutive
windows
and their
corresponding
score.
time
via
NMS
among
a
group
of
five
consecutive
sliding
windows
and
their
corresponding
final recognition result is a winner-take-all decision, resulting in the final estimate of the action class
confidence
The finalwith
recognition
result
is a winner-take-all
decision,
resulting
in the
held
by the score.
video segment
the highest
confidence
score. According
to Figure
8, we can
seefinal
that
estimate
of
the
action
class
held
by
the
video
segment
with
the
highest
confidence
score.
According
not until the last (fifth) sliding window of frames provides an output does the proposed method
to Figure
8, we
can see
that not
until the last
(fifth) sliding
of frames
make
a final
estimate
of action
recognition.
Although
there is window
a delay around
2.5 sprovides
between an
theoutput
actual
does the proposed
method
make
a final
estimate
of action
recognition.
Although
there is a delay
recognition
results and
ground
truth,
the NMS
process
can effectively
suppress
the inconsistency
of
around
2.5
s
between
the
actual
recognition
results
and
ground
truth,
the
NMS
process
can
the action recognition results.
effectively suppress the inconsistency of the action recognition results.
Sensors 2020,
Sensors
2020, 20,
20, 4758
x FOR PEER REVIEW
of 18
17
109 of
Figure 8.
One-dimensional nonmaximum suppression (NMS) is used to derive the final
recognition
results.
Figure
8. One-dimensional
nonmaximum suppression (NMS) is used to derive the final recognition
results.
3. Experimental Results
3. Experimental Results
3.1. Computational Platforms
To evaluate the
proposed action recognition system, we train the I3D model on the Taiwan
3.1. Computational
Platforms
Computing Cloud (TWCC) platform in the National Center for High-performance Computing (NCHC)
To evaluate the proposed action recognition system, we train the I3D model on the Taiwan
and verify the proposed method on an Intel (R) Core(TM) i7-8700 @ 3.2GHz and a NVIDIA GeForce
Computing Cloud (TWCC) platform in the National Center for High-performance Computing
GTX 1080Ti graphic card under Windows 10. The computational time of training and testing our I3D
(NCHC) and verify the proposed method on an Intel (R) Core(TM) i7-8700 @ 3.2GHz and a NVIDIA
model is about 5 h on TWCC. The experiments are conducted under Python 3.5 that utilizes Tensorflow
GeForce GTX 1080Ti graphic card under Windows 10. The computational time of training and
backend with Keras library and NVIDIA CUDA 9.0 library for parallel computation.
testing our I3D model is about 5 h on TWCC. The experiments are conducted under Python 3.5 that
utilizes
Tensorflow backend with Keras library and NVIDIA CUDA 9.0 library for parallel
3.2. Datasets
computation.
In this paper, training is conducted on NTU RGB+D dataset [40], which has 60 action classes. Each
sample
scene is captured by three Microsoft Kinect V2 placed at different positions, and the available
3.2. Datasets
data of each scene includes RGB videos, depth map sequences, 3D skeleton data, and infrared (IR)
In this paper, training is conducted on NTU RGB+D dataset [Error! Reference source not
videos. The resolution of the RGB videos is 1920 × 1080 pixels, whereas the corresponding depth maps
found.], which has 60 action classes. Each sample scene is captured by three Microsoft Kinect V2
and IR videos are all 512 × 424 pixels. The skeleton data contain the 3D coordinates of 25 body joints
placed at different positions, and the available data of each scene includes RGB videos, depth map
for each frame. In order to apply the proposed action recognition system to a practical scenario, a
sequences, 3D skeleton data, and infrared (IR) videos. The resolution of the RGB videos is 1920 ×
long-term care environment is chosen for demonstration purpose in this paper. We manually select 12
1080 pixels, whereas the corresponding depth maps and IR videos are all 512 × 424 pixels. The
action classes that are likely to be adopted in this environment, as shown in Table 1. In the training
skeleton data contain the 3D coordinates of 25 body joints for each frame. In order to apply the
set, most of the classes contain 800 videos, except the class “background” that has 1187 videos. In the
proposed action recognition system to a practical scenario, a long-term care environment is chosen
testing set, there are 1841 videos.
for demonstration purpose in this paper. We manually select 12 action classes that are likely to be
adopted in this environment, as shown
Table of
1. action
In therecognition.
training set, most of the classes contain 800
Table 1.inClasses
videos, except the class “background” that has 1187 videos. In the testing set, there are 1841 videos.
Number
1
2
3
4
5
6
Action
Number
Action
Table
1. Classes of action recognition.
Drink
7
Chest ache
Eat meal
8
Backache
Number
Number9
Action Stomachache
StandAction
up
1 Cough
Drink
7 10 Chest ache Walking
Writing
2 Fall down
Eat meal
8 11 Backache
Headache
12
Background
3
Stand up
9
Stomachache
4
5
6
Cough
Fall down
Headache
10
11
12
Walking
Writing
Background
Sensors 2020, 20, 4758
x FOR PEER REVIEW
10 of 18
17
11
3.3. Experimental Results
3.3. Experimental Results
In order to investigate the performance of the proposed method for practical use in a long-term
In order to investigate
of thefor
proposed
method for practical use in a long-term
care environment,
there arethe
fiveperformance
major objectives
evaluation:
care environment, there are five major objectives for evaluation:
•
Can the system simultaneously recognize the actions performed by multiple people in real
•
Can
time?the system simultaneously recognize the actions performed by multiple people in real time?
•
• Performance
Performance comparison
comparison of
of action
action recognition
recognition with
with and
and without
without “zoom-in”
“zoom-in” function;
function;
•
Differences
of
action
recognition
using
black
background
and
blurred
•
Differences of action recognition using black background and blurredbackground;
background;
•
• Differences
Differences of
of I3D-based
I3D-based action
action recognition
recognition with
withand
andwithout
withoutNMS;
NMS;
• Accuracy of the overall system.
•
the first
firstobjective,
objective,after
aftertraining
training
with
dataset
epochs,
screenshots
selected
In the
with
thethe
dataset
for for
10001000
epochs,
two two
screenshots
selected
from
from
an action
recognition
resultare
video
areinshown
9, where
actions performed
by three
an
action
recognition
result video
shown
Figurein9, Figure
where actions
performed
by three people
in a
people in ascenario
simplified
areatrecognized
at theAs
same
time. Asbehave
individuals
in the
scene
simplified
are scenario
recognized
the same time.
individuals
in the behave
scene over
time,
for
over time,eating
for example,
eating
a meal up
andfor
standing
for IDa 214,
having
headache
forfalling
ID 215,
and
example,
a meal and
standing
ID 214,up
having
headache
fora ID
215, and
down
falling
down
and having for
a ID
stomachache
for ID
the system
can corresponding
recognize the to
actions
and
having
a stomachache
216, the system
can216,
recognize
the actions
what
corresponding
the individuals
are performing.
Insystem
addition,
system
canname
also
the
individuals to
arewhat
performing.
In addition,
the proposed
canthe
alsoproposed
display the
ID and
display the
and SORT
name obtained
by Deep
SORT andThere
FaceNet,
There
is no hard
of
obtained
byID
Deep
and FaceNet,
respectively.
is norespectively.
hard limit of
the number
of limit
people
the numberinofthe
people
the scene forasaction
recognition,
as longcan
as be
thedetected
individuals
can be
appearing
sceneappearing
for actioninrecognition,
long as
the individuals
by YOLO
detected
by YOLO
in theifimage.
However,
if image
the bounding
box imageisof
ansmall,
individual
is too might
small,
in
the image.
However,
the bounding
box
of an individual
too
the system
the systemamight
encounter
recognition
error.
Table 2 shows
speed
of thefor
proposed
encounter
recognition
error.a Table
2 shows
the execution
speedthe
of execution
the proposed
system
various
system forofvarious
people,
includingand
facial
recognition
and action recognition.
numbers
people,numbers
includingoffacial
recognition
action
recognition.
(a)
(b)
classes
areare
(a)(a)
Eat_meal,
Headache,
Fall-down
(b)
Figure 9. Action
Actionrecognition
recognitionresults
resultswhere
whereaction
action
classes
Eat_meal,
Headache,
Fall-down
Stand_up,
Headache,
Stomachache.
(b)
Stand_up,
Headache,
Stomachache.
Table 2. Execution speed of the proposed system for various numbers of people.
Table 2. Execution speed of the proposed system for various numbers of people.
Number of People
1
Number of People
Average speed
(fps) speed12.29
Average
(fps)
3
1 2
2
3
11.7011.70 11.46
12.29
11.46
8
4 4
8
10
11.30 11.02 11.02
11.30
10.93
10
10.93
wedesign
designan
anexperiment
experiment
measure
action
recognition
results
and without
Second, we
to to
measure
thethe
action
recognition
results
with with
and without
using
using
the zoom-in
10 individual
shows thefacing
individual
facing
the camera
slowly moves
the
zoom-in
function.function.
Figure 10Figure
shows the
the camera
slowly
moves backward
while
backward
while
It as
can
be seen
thatthe
as individual
time elapses,
the
individual
doing
actions.
It doing
can beactions.
seen that
time
elapses,
will
move
farther will
and move
fartherfarther
away
and farther
awayresulting
from thein
camera,
resulting in
poor recognition
performance
because
the size
the
from
the camera,
poor recognition
performance
because
the size of the
bounding
box of
is too
bounding
box is
too small
to make
accurateOn
action
recognition.
On the contrary,
the bounding
box
small
to make
accurate
action
recognition.
the contrary,
the bounding
box becomes
sufficiently
becomes
large
with the
room-in
as shown
in theofbottom-right
image 11
of and
Figure
large
withsufficiently
the room-in
function
as shown
in function
the bottom-right
image
Figure 10. Figures
12
10. Figures 11 and 12 show the recognition results with and without using the zoom-in function,
Sensors 2020, 20, 4758
11 of 17
Sensors 2020, 20, x FOR PEER REVIEW
Sensors 2020, 20, x FOR PEER REVIEW
12 of 18
12 of 18
show
with and
using theofzoom-in
function,
where
the horizontal
axis
wherethe
therecognition
horizontal results
axis indicates
thewithout
frame number
the video
clip and
the vertical
axis reveals
where the horizontal axis indicates the frame number of the video clip and the vertical axis reveals
indicates
the
frame
number
of
the
video
clip
and
the
vertical
axis
reveals
the
action
class
recognized
the action class recognized with the zoom-in function (green) and without the zoom-in function
the action class recognized with the zoom-in function (green) and without the zoom-in function
with
zoom-in function
(green) and
the zoom-in
in 3comparison
(red),the
respectively,
in comparison
to without
the manually
labeledfunction
ground(red),
truthrespectively,
(blue). Table
shows the
(red), respectively, in comparison to the manually labeled ground truth (blue). Table 3 shows the
to
the
manually
labeled
ground
truth
(blue).
Table
3
shows
the
recognition
accuracy
for
same
recognition accuracy for the same individual standing at different distances. We can seethe
that
the
recognition accuracy for the same individual standing at different distances. We can see that the
individual
standing
at
different
distances.
We
can
see
that
the
average
accuracy
of
the
recognition
average accuracy of the recognition results with and without the zoom-in method is 90.79% and
average accuracy of the recognition results with and without the zoom-in method is 90.79% and
results
and without
the zoom-in
method
is 90.79%
and 69.74%,
respectively,
as these
shown
in the far
69.74%,with
respectively,
as shown
in the far
left column
in Table
3. As clearly
shown in
figures,
we
69.74%, respectively, as shown in the far left column in Table 3. As clearly shown in these figures, we
left
column
in
Table
3.
As
clearly
shown
in
these
figures,
we
can
see
that
there
is
no
difference
of
the
can see that there is no difference of the recognition results with or without the zoom-in function for
can see that there is no difference of the recognition results with or without the zoom-in function for
recognition
results
withnear
or without
the zoom-in
function
for the
individual
located
near the
the camera,
camera.
the individual
located
the camera.
However,
when the
individual
moves
far from
the individual located near the camera. However, when the individual moves far from the camera,
However,
when
the individual
moves farthe
from
the camera,
the zoom-in
the zoom-in
approach
greatly enhances
recognition
accuracy
from approach
32.14% togreatly
89.29%.enhances the
the zoom-in approach greatly enhances the recognition accuracy from 32.14% to 89.29%.
recognition accuracy from 32.14% to 89.29%.
Figure 10.
10.Images
Imagesofofa video
a video
where
a person
moves
backward
from(left)
nearto(left)
to farto(right)
to
Figure
where
a person
moves
backward
from near
far (right)
evaluate
Figure 10. Images of a video where a person moves backward from near (left) to far (right) to
evaluate
the
recognition
performance
with
and
without
zoom-in
method.
the recognition performance with and without zoom-in method.
evaluate the recognition performance with and without zoom-in method.
Figure 11.
11. Recognitionresult
result
using
the zoom-in
function
(green
line), where
the
blur rectangle
Figure
using
the the
zoom-in
function
(green(green
line), where
blur the
rectangle
indicates
11.Recognition
Recognition result
using
zoom-in
function
line), the
where
blur rectangle
indicates
the
individual
is
located
far
from
the
camera.
the
individual
is
located
far
from
the
camera.
indicates the individual is located far from the camera.
Sensors 2020, 20, x FOR PEER REVIEW
Sensors 2020, 20, 4758
13 of 18
12 of 17
Sensors 2020, 20, x FOR PEER REVIEW
13 of 18
Figure 12. Recognition result without the zoom-in function (red line), where the blur rectangle
indicates the individual is located far from the camera.
Figure 12. Recognition result without the zoom-in function (red line), where the blur rectangle indicates
Figure 12. Recognition result without the zoom-in function (red line), where the blur rectangle
the individual is located far from the camera.
indicates
individualaccuracy
is locatedwith
far from
the camera.
Table 3.the
Recognition
and without
the zoom-in method of the scenario in Figure 10.
Table 3. Recognition accuracy with and without the zoom-in method of the scenario in Figure 10.
Average
Short-Distance
Long-Distance
Accuracy
Table 3. Recognition
accuracyAccuracy
with and without
the zoom-inAccuracy
method of the
scenario in Figure
10.
With zoom-in
90.79%
89.29%Accuracy
Average
Accuracy
Short-Distance Accuracy
Long-Distance
91.49%Accuracy Long-Distance Accuracy
Accuracy Short-Distance
Without
zoom-in Average
69.74%
32.14%
With zoom-in
90.79%
89.29%
91.49%
With
zoom-in
90.79%
89.29%
Without zoom-in
69.74%
32.14%
91.49%
Without
69.74%
32.14% and black
Third, zoom-in
to investigate the
difference for image frames with blurred background
background for action recognition, we use the same video input to analyze the accuracy of action
Third, to investigate the difference for image frames with blurred background and black
Third, with
to investigate
the difference
for image as
frames
with
blurred
and
recognition
different background
modifications,
shown
in Figure
13. background
Figures 14 and
15 black
show
background for action recognition, we use the same video input to analyze the accuracy of action
background
for
action
recognition,
we
use
the
same
video
input
to
analyze
the
accuracy
of
action
the recognition results using blurred and black backgrounds, respectively. It is clear that
the
recognition with different background modifications, as shown in Figure 13. Figures 14 and 15
recognition
modifications,
as shown
in Figure
13. Figures
14 and 15 show
recognition with
resultdifferent
is betterbackground
by using blurred
background
(green
line) than
black background
(red
show the recognition results using blurred and black backgrounds, respectively. It is clear that the
the
results
using
blurred
and black
respectively.
It is clear
that the
line),recognition
in comparison
to the
manually
labeled
groundbackgrounds,
truth (blue), where
the recognition
accuracy
of
recognition result is better by using blurred background (green line) than black background (red line),
recognition
result
is
better
by
using
blurred
background
(green
line)
than
black
background
(red
using images with blurred background is 82.9%, whereas the one with black background is 46.9%.
in comparison to the manually labeled ground truth (blue), where the recognition accuracy of using
line),
in comparison
to the manually
labeled
ground information
truth (blue), contributes
where the recognition
of
We can
see that retaining
appropriate
background
to a better accuracy
recognition
images with blurred background is 82.9%, whereas the one with black background is 46.9%. We can
using
images
with
blurred
background
is
82.9%,
whereas
the
one
with
black
background
is
46.9%.
performance.
see that retaining appropriate background information contributes to a better recognition performance.
We can see that retaining appropriate background information contributes to a better recognition
performance.
Figure 13. Images with blurred background (left) and black background (right).
(right).
Figure 13. Images with blurred background (left) and black background (right).
Sensors
x
Sensors 2020,
2020, 20,
20, 4758
x FOR
FOR PEER
PEER REVIEW
REVIEW
Sensors 2020, 20, x FOR PEER REVIEW
13 of
17
14
14
of 18
18
14 of 18
Figure
Figure 14.
14. Recognition
Recognition
result using
using blurred
blurred background.
background.
Recognition result
Figure 14. Recognition result using blurred background.
Figure 15. Recognition
Recognition result using
using black background.
background.
Figure
Figure 15.
15. Recognition result
result using black
black background.
Figure 15. Recognition result using black background.
Fourth, to verify the improvement of introducing NMS, we analyze the accuracy of the action
Fourth,
Fourth, to
to verify
verify the
the improvement
improvement of
of introducing
introducing NMS,
NMS, we
we analyze
analyze the
the accuracy
accuracy of
of the
the action
action
recognition
system
before
and after using
NMS. In thisNMS,
experiment,
we fed
video of of
1800
Fourth,system
to verify
the improvement
of introducing
we analyze
theaaaccuracy
theframes
action
recognition
before
and
after
using
NMS.
In
this
experiment,
we
fed
video
of
1800
frames
recognition system before and after using NMS. In this experiment, we fed a video of 1800 frames
consisting ofsystem
12 actions
intoand
the after
proposed
action
recognition,
shown
in Figure
16,frames
where
recognition
before
usingsystem
NMS. for
In this
experiment,
weasfed
a video
of 1800
consisting
consisting of
of 12
12 actions
actions into
into the
the proposed
proposed system
system for
for action
action recognition,
recognition, as
as shown
shown in
in Figure
Figure 16,
16,
the
orange
line
and
the
blue
line
represent
the
recognition
results
with
and
without
using
NMS.
is
consisting
of 12 actions
into theblue
proposed
system for recognition
action recognition, with
as shownwithout
in Figure It16,
where
where the
the orange
orange line
line and
and the
the blue line
line represent
represent the
the recognition results
results with and
and without using
using
clear that
of recognition
results
is greatly
byresults
the NMS
method.
Note that
the
where
theinconsistency
orange line
and
the blue line
represent
thesuppressed
recognition
with
and without
using
NMS.
NMS. It
It is
is clear
clear that
that inconsistency
inconsistency of
of recognition
recognition results
results is
is greatly
greatly suppressed
suppressed by
by the
the NMS
NMS method.
method.
average
recognition
accuracy
increases
from
76.3%
to
82.9%
with
the
use
of
NMS.
NMS.
It is clear
that inconsistencyaccuracy
of recognition results
is greatly
suppressed
by the
NMS
method.
Note
Note that
that the
the average
average recognition
recognition accuracy increases
increases from
from 76.3%
76.3% to
to 82.9%
82.9% with
with the
the use
use of
of NMS.
NMS.
Note that the average recognition accuracy increases from 76.3% to 82.9% with the use of NMS.
Figure
without using
using NMS.
NMS.
Comparison of
Figure 16.
16. Comparison
Comparison
of recognition
recognition results
results with
with and
and without
Figure 16. Comparison of recognition results with and without using NMS.
Finally,
performance
Finally, to
to show
show the
the performance
performance of
of the
the proposed
proposed approach,
approach, Figure
Figure 17
17 shows
shows the
the recognition
recognition
Finally,
to overall
show
the
performance
ofthe
theorange
proposed
approach,
Figure
17
shows
the recognition
accuracy
of
the
system,
where
the
orange
and
blue
lines
indicate
the
action
recognition
results
of
the
overall
system,
where
and
blue
lines
indicate
the
action
recognition
accuracy of the overall system, where the orange and blue lines indicate the action recognition
accuracy
of
the
overall
system,
where
the
orange
and
blue
lines
indicate
the
action
recognition
obtained
by the proposed
method and
the manually
labeled ground
truth,
respectively.
We can see that
results obtained
obtained
by the
the proposed
proposed
method
and
labeled
ground
truth,
We
results
by
method
and the
the manually
manually
labeled
ground
truth, respectively.
respectively.
We
results
obtained
by for
thefew
proposed
method
and the manually
labeled
groundwith
truth,
respectively.
We
can
see
that
except
occasions,
the
recognition
results
are
consistent
the
ground
truth
can see that except for few occasions, the recognition results are consistent with the ground truth at
at
can see that except for few occasions, the recognition results are consistent with the ground truth at
Sensors 2020, 20, 4758
14 of 17
Sensors 2020, 20, x FOR PEER REVIEW
15 of 18
except for few occasions, the recognition results are consistent with the ground truth at an accuracy of
an accuracy of approximately 82.9%. Due to the fact that this work specifically chose 12 action
approximately 82.9%. Due to the fact that this work specifically chose 12 action classes to perform
classes to perform multiple-person action recognition for long-term care centers, there is no
multiple-person action recognition for long-term care centers, there is no universal benchmark for
universal benchmark for serving the need of a meaningful comparison.
serving the need of a meaningful comparison.
Figure 17. Performance of action recognition of the overall system.
While conducting the experiments, we found that there are some difficult scenarios for action
recognition by the proposed system. If two different actions involve similar movement, it may easily
incur recognition error.
For example,
example, an
an individual
individual leaning
leaning forward
forward could
could be classified as either
error. For
‘Stomachache’
Also, if
if occlusion
occlusion occurs
occurs when capturing the video, the action
‘Stomachache’ or
or ‘Cough’
‘Cough’ action.
action. Also,
recognition
likely
to make
an incorrect
decision.
Interested
readersreaders
can refercan
to the
following
recognition system
systemis is
likely
to make
an incorrect
decision.
Interested
refer
to the
link:
https://youtu.be/t6HpYCjlTLA
to watch a demo
showing
real-time
multiple-person
action
following
link: https://youtu.be/t6HpYCjlTLA
to video
watch
a demo
video
showing real-time
recognition
usingaction
the proposed
approach
in this
paper. approach in this paper.
multiple-person
recognition
using the
proposed
4.
Conclusions
4. Conclusions
This
This paper
paper presented
presented aa real-time
real-time multiple-person
multiple-person action
action recognition
recognition system
system suitable
suitable for
for smart
smart
surveillance
applications
to
identify
individuals
and
recognize
their
actions
in
the
scene.
For
people
far
surveillance applications to identify individuals and recognize their actions in the scene. For people
from
the camera,
a “zoom
in” function
automatically
activates
to make
use ofuse
theof
high
video
far from
the camera,
a “zoom
in” function
automatically
activates
to make
theresolution
high resolution
frame
for
better
recognition
performance.
In
addition,
we
leverage
the
I3D
architecture
for
action
video frame for better recognition performance. In addition, we leverage the I3D architecture for
recognition
in real time
by using
window
and design
NMS. For
purpose, the
action recognition
in real
time sliding
by using
slidingdesign
window
anddemonstration
NMS. For demonstration
proposed
is applied
to long-term
caretoenvironments
where
the system iswhere
capable
of system
detecting
purpose, approach
the proposed
approach
is applied
long-term care
environments
the
is
(abnormal)
actions
that
might
be
of
concerns
for
accident
prevention.
Note
that
there
is
a
delay
of
capable of detecting (abnormal) actions that might be of concerns for accident prevention. Note that
around
s between
the recognition
results
and the occurrence
of the
observed
action
the proposed
there is a2.5
delay
of around
2.5 s between
the recognition
results and
occurrence
ofby
observed
action
method.
However,method.
this delay
does not make
significant
on significant
the smart surveillance
by the proposed
However,
this delay
does impacts
not make
impacts onapplication
the smart
for
long-termapplication
care centers,for
because
human
afterhuman
the alarm
is generally
longer
the is
2
surveillance
long-term
careresponse
centers, time
because
response
time after
thethan
alarm
sgenerally
delay time.
Ideally,
a smaller
more appealing.
is ourtime
future
decrease the
longer
than
the 2 s delay
delay time
time.isIdeally,
a smallerItdelay
is work
moretoappealing.
It isdelay
our
time
via
optimizing
the
filter
size
of
the
sliding
window
to
achieve
the
objective
of
less
than
1
s
for
the
future work to decrease the delay time via optimizing the filter size of the sliding window to achieve
delay
time.
Thanks
to
the
architecture
of
the
proposed
method,
the
proposed
method
can
be
used
in
the objective of less than 1 s for the delay time. Thanks to the architecture of the proposed method,
the
for various
to provide
smart application
surveillance,
for example,
detection
of
the future
proposed
method application
can be usedscenarios
in the future
for various
scenarios
to provide
smart
unusual
behavior
in
a
factory
environment.
surveillance, for example, detection of unusual behavior in a factory environment.
Author Contributions:
Contributions:Conceptualization,
Conceptualization,
J.-K.T.
and C.-C.H.;
methodology,
J.-K.T.; software,
J.-K.T.;
J.-K.T.
and C.-C.H.;
methodology,
J.-K.T.; software,
J.-K.T.; validation,
J.-K.T.,
writing—original
draft preparation,
J.-K.T.; writing—review
and editing, and
J.-K.T.,
C.-C.H.,
and C.-C.H.,
S.-K.H.;
validation,
J.-K.T., writing—original
draft preparation,
J.-K.T.; writing—review
editing,
J.-K.T.,
visualization,
J.-K.T.; supervision,
C.-C.H. and C.-C.H.
W.-Y.W.;and
project
administration,
C.-C.H.; funding
acquisition,
and S.-K.H.; visualization,
J.-K.T.; supervision,
W.-Y.W.;
project administration,
C.-C.H.;
funding
C.-C.H. and W.-Y.W. All authors have read and agreed to the published version of the manuscript.
acquisition, C.-C.H. and W.-Y.W. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Chinese Language and Technology Center of National Taiwan Normal
Funding: This
research
byAreas
the Chinese
Language
and Technology
of National
University
(NTNU)
fromwas
Thefunded
Featured
Research
Center Program
within theCenter
Framework
of the Taiwan
Higher
Normal University
(NTNU)
from
The Featured
Areas(MOE)
Research
Center and
Program
within
the Framework
of the
Education
Sprout Project
by the
Ministry
of Education
in Taiwan,
Ministry
of Science
and Technology,
Taiwan,
under Grants
no. MOST
MOST (MOE)
109-2634-F-003-007
through
Pervasive
Artificial
Higher Education
Sprout
Project109-2634-F-003-006
by the Ministry of and
Education
in Taiwan, and
Ministry
of Science
and
Intelligence
Technology,Research
Taiwan,(PAIR)
underLabs.
Grants no. MOST 109-2634-F-003-006 and MOST 109-2634-F-003-007 through
Pervasive Artificial Intelligence Research (PAIR) Labs.
Sensors 2020, 20, 4758
15 of 17
Acknowledgments: The authors are grateful to the National Center for High-performance Computing for
computer time and facilities to conduct this research.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Wiliem, A.; Madasu, V.; Boles, W.; Yarlagadda, P. A suspicious behaviour detection using a context space
model for smart surveillance systems. Comput. Vis. Image Underst. 2012, 116, 194–209. [CrossRef]
Feijoo-Fernández, M.C.; Halty, L.; Sotoca-Plaza, A. Like a cat on hot bricks: The detection of anomalous
behavior in airports. J. Police Crim. Psychol. 2020. [CrossRef]
Ozer, B.; Wolf, M. A Train station surveillance system: Challenges and solutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 24–27 June 2014;
pp. 652–657. [CrossRef]
Zhang, H.B.; Zhang, Y.X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.X.; Chen, D.S. A comprehensive survey of
vision-based human action recognition methods. Sensors 2019, 19, 1005. [CrossRef] [PubMed]
Bakalos, N.; Voulodimos, A.; Doulamis, N.; Doulamis, A.; Ostfeld, A.; Salomons, E.; Caubet, J.; Jimenez, V.;
Li, P. Protecting water infrastructure from cyber and physical threats: Using multimodal data fusion and
adaptive deep learning to monitor critical systems. IEEE Signal Process. Mag. 2019, 36, 36–48. [CrossRef]
Kar, A.; Rai, N.; Sikka, K.; Sharma, G. Adascan: Adaptive scan pooling in deep convolutional neural networks
for human action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3376–3385.
Wei, H.; Jafari, R.; Kehtarnavaz, N. Fusion of video and inertial sensing for deep learning–based human
action recognition. Sensors 2019, 19, 3680. [CrossRef] [PubMed]
Ding, R.; Li, X.; Nie, L.; Li, J.; Si, X.; Chu, D.; Liu, G.; Zhan, D. Empirical study and improvement on deep
transfer learning for human activity recognition. Sensors 2018, 19, 57. [CrossRef] [PubMed]
Xia, L.; Chen, C.; Aggarwal, J. View invariant human action recognition using histograms of 3D joints.
In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27.
Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition.
In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October
2016; pp. 816–833.
Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields.
In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu,
HI, USA, 21–26 July 2017; pp. 1302–1310.
Fang, H.-S.; Xie, S.; Tai, Y.-W.; Lu, C. RMPE: Regional multi-person pose estimation. In Proceedings of the
IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2334–2343.
Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T.
Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 2625–2634.
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2012, 35, 221–231. [CrossRef] [PubMed]
Hwang, P.-J.; Wang, W.-Y.; Hsu, C.-C. Development of a mimic robot-learning from demonstration
incorporating object detection and multiaction recognition. IEEE Consum. Electron. Mag. 2020, 9, 79–87.
[CrossRef]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In
Proceedings of the Conference and Workshop on Neural Information Processing Systems, Montreal, QC,
Canada, 8–13 December 2014.
Chen, Z.; Li, A.; Wang, Y. A temporal attentive approach for video-based pedestrian attribute recognition. In
Chinese Conference on Pattern Recognition and Computer Vision (PRCV); Springer: Cham, Switzerland, 2019;
pp. 209–220.
Sensors 2020, 20, 4758
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
16 of 17
Hwang, P.-J.; Hsu, C.-C.; Wang, W.-Y.; Chiang, H.-H. Robot learning from demonstration based on action and
object recognition. In Proceedings of the IEEE International Conference on Consumer Electronics, Las Vegas,
NV, USA, 4–6 January 2020.
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3d
convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
Chile, 7–13 December 2015; pp. 4489–4497.
Carreira, J.; Zisserman, A. Quo vadis, action recognition? new models and the kinetics dataset. In Proceedings
of the IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017.
Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action
recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27–30 June 2016.
Rose, T.; Fiscus, J.; Over, P.; Garofolo, J.; Michel, M. The TRECVid 2008 event detection evaluation.
In Proceedings of the IEEE Workshop on Applications of Computer Vision, Snowbird, UT, USA, 7–9
December 2009.
Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local svm approach. In Proceedings of the
International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 32–36.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE International Conference on Computer Vision
and Pattern Recognition, Boston, MA, USA, 7–12 June 2015.
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild.
arXiv 2012, arXiv:1212.0402.
Carreira, J.; Noland, E.; Hillier, C.; Zisserman, A. A short note on the kinetics-700 human action dataset.
arXiv 2019, arXiv:1907.06987.
Song, Y.; Kim, I. Spatio-temporal action detection in untrimmed videos by using multimodal features and
region proposals. Sensors 2019, 19, 1085. [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In
Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September
2017; pp. 3645–3649.
Schroff, F.; Kalenichenko, D.; Philbinl, J. FaceNet: A unified embedding for face recognition and clustering.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12
June 2015; pp. 815–823.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and
semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, USA, 24–27 June 2014; pp. 580–587.
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083.
Ren, S.; He, K.; Girshick, R.; Sum, J. Faster R-CNN: Towards real-time object detection with region proposal
networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems,
Montréal, QC, Canada, 7–12 December 2015; pp. 91–99.
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask r-cnn. arXiv 2017, arXiv:1703.06870.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
26 June–1 July 2016; pp. 779–788.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-T.; Berg, A.C. Single shot MultiBox detector.
In Proceedings of the 14th European Conference on Compute Vision, Amsterdam, The Netherlands, 8–16
October 2016; pp. 21–37.
Wu, Y.-T.; Chien, Y.-H.; Wang, W.-Y.; Hsu, C.-C. A YOLO-based method on the segmentation and recognition
of Chinese words. In Proceedings of the International Conference on System Science and Engineering,
New Taipei City, Taiwan, 28–30 June 2018.
Bewley, A.; Zongyuan, G.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the
IEEE International Conference on Image Processing, Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468.
Sensors 2020, 20, 4758
39.
40.
17 of 17
Shou, Z.; Wang, D.; Chang, S.-F. Temporal action localization in untrimmed videos via multi-stage CNNs. In
Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas,
NV, USA, 26 June–1 July 2016; pp. 1049–1058.
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A large scale dataset for 3d human activity analysis.
In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas,
NV, USA, 26 June–1 July 2016; pp. 1010–1019.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Download