Uploaded by Wis sa

Article 7

advertisement
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020
1
Soft Spatial Attention-based Multimodal Driver
Action Recognition Using Deep Learning
Imen JEGHAM, Anouar BEN KHALIFA, Ihsen ALOUANI, Mohamed Ali MAHJOUB
Abstract— Driver behaviors and decisions are crucial factors for on-road driving safety. With a precise driver behavInput
Space stream network
Spatial Attention
Time stream network
ior monitoring system, traffic accidents and injuries can
be significantly reduced. However, understanding human
behaviors in real-world driving settings is a challenging
task because of the uncontrolled conditions including ilC
N C
lumination variation, occlusion, and dynamic and cluttered
N N C
N N C
Kinect
background. In this paper, a Kinect sensor, which provides
N N C
N N
N
multimodal signals, is adopted as a driver monitoring sensor to recognize safe driving and common secondary most
distracting in-vehicle actions. We propose a novel soft
spatial attention-based network named the Depth-based
Spatial Attention network (DSA), which adds a cognitive process to deep network by selectively focusing on the driver’s
silhouette and motion in the cluttered driving scene. In fact, at each time t, we introduce a new weighted RGB frame
based on an attention model designed using a depth frame. The final classification accuracy is substantially enhanced
compared to the state-of-the-art results with an achieved improvement of up to 27%.
Classification
n
Index Terms— Driver action recognition, Kinect sensor, Spatial soft attention, Multimodal, Deep learning.
I. I NTRODUCTION
D
RIVER behaviors and decisions are the principal factors
that can affect driving safety. More than 90% of reported
vehicle crashes in the US have been caused by the driver’s
inadvertent errors and misbehavior, which is similar to other
countries worldwide [1]. Traffic accidents can be reduced by
10% to 20% with an efficient driver monitoring system [2],
[3]. Therefore, it is important to have a clear perspective on
the driver behavior. For this reason, recognizing the driver’s
in-vehicle actions represents one of the most important tasks
for Intelligent Transportation Systems (ITS) [4]. In fact, for
highly automated vehicles including the level-3 automation
(according to the automation definition in SAE standard J3016
[5]), the driver is allowed to perform secondary tasks and is
responsible for taking over the vehicle control under emergencies. In China and the US, many accidents were recorded
when a Tesla driver relied uniquely on the autopilot system
while driving.
Preprint submitted to IEEE SENSORS JOURNAL, May xx, 2020.
Imen JEGHAM is with Université de Sousse, Institut Supérieur
d’Informatique et des Techniques de Communication de H. Sousse,
LATIS- Laboratory of Advanced Technology and Intelligent Systems,
4011, Sousse, Tunisie; (e-mail: imen.jegham@isitc.u-sousse.tn).
Anouar BEN KHALIFA is with Université de Sousse, Ecole Nationale d’Ingénieurs de Sousse, LATIS- Laboratory of Advanced
Technology and Intelligent Systems, 4023, Sousse, Tunisie (e-mail:
anouar.benkhalifa@eniso.rnu.tn).
Ihsen ALOUANI is with IEMN-DOAE, Université polytechnique Hautsde-France, Valenciennes, France
Mohamed Ali MAHJOUB is with Université de Sousse, Ecole Nationale d’Ingénieurs de Sousse, LATIS- Laboratory of Advanced Technology and Intelligent Systems, 4023, Sousse, Tunisie .
Different physiological and vision sensors have been widely
used for driver status monitoring. Physiological sensors have
been generally restricted to estimate specific driver behavior.
For example, EOG and EEG are mainly used to estimate
driver fatigue and somnolence. Moreover, these sensors are
expensive and require specific information in advance such as
gaze direction. On the other hand, vision sensors are the most
widely used sensors for Driver Action Recognition (DAR).
Unlike other sensors, they allow a holistic understanding of the
driver’s actions and situation. Various types of vision sensors
can be employed including omnidirectional, thermographic,
and Kinect cameras. Kinect has been successfully used for
low-cost navigation. It consists of an RGB-Depth camera
that captures RGB images along with 16-bit depth images
that indicate the objects’ distance rather than their brightness.
Kinect was initially designed for indoor motion sensing and
was extended later to outdoor environments [6].
Efficient real-world DAR systems should be designed under uncontrolled environmental conditions. Therefore, many
challenges including lighting conditions, viewpoint variation
and cluttered and dynamic background need to be addressed
[4], [7], [8]. In the past few years, deep learning have
shown remarkable efficiency in solving a plethora of complex real-life problems. Deep learning-based Human Action
Recognition (HAR) techniques have surpassed most traditional
hand-crafted feature systems that require considerable preprocessing and domain skill knowledge [9]. However, these
systems still fail to take into account the spatial and temporal context of the scene. Moreover, HAR systems are not
automatically useful under DAR constraints. In fact, they
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
2
IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020
are mostly based on a holistic analysis of the human body
posture and behaviour. The limited in-vehicle space where the
actions are executed and the parallel execution of different
in-vehicle actions with driving tasks challenge drastically the
HAR techniques. Furthermore, other challenges related to
naturalistic driving settings including the dynamic and highly
cluttered background and the high illumination variation are
added. This motivates the development of comprehensive DAR
techniques under real-life conditions.
Inspired by the human vision process, visual attention models extract relevant information by selectively concentrating
on parts of the visual space where and when it is needed.
Attention models can be clustered in two main categories [10]:
hard and soft attention models.
Hard attention models bring hard decisions when picking
parts of the input data. Training these models can be hard
through back-propagation and computationally expensive as
it depends on sampling. Soft attention models, on the other
hand, consider the entire input but dynamically weighting each
part of the scene. This mechanism can be applied on three
different variants: spatial, temporal and spatio-temporal. Such
soft context interpretability is necessary and helpful for driver
monitoring applications. Since accidents can happen in the
blink of an eye, the continuous processing of the temporal
dimension is important. Therefore, in this study, we focus on
a visual soft spatial attention mechanism.
In this paper, we unprecedentedly exploit depth modality
which highlights the driver’s silhouette and ignore the cluttered
background, for soft spatial attention. We use RGB-D data
to focus on the relevant parts of cluttered naturalistic driving
scenes to recognize driver in-vehicle actions. Thus, we propose
a novel Depth-based soft Spatial Attention network (DSA) that
efficiently focuses on the driver’s silhouette to monitor driver
behaviors for the ITS.
The main contributions of this paper can be summarized as
follows:
• We propose DSA, which is a novel spatial attentionbased network that focuses on the driver’s posture. DSA
is designed to be efficient in cluttered and dynamic
scenes to recognize the driver’s in-vehicle actions under
uncontrolled driving settings.
• We unprecedentedly exploit a depth sensor modality for
a visual attention mechanism in DAR.
• We experiment our network on the first public multimodal
driver action dataset MDAD [11] and report promising results for different views and under various environments.
The remainder of the paper is organised as follows: In Section
2, the background on deep-learning techniques for action
recognition as well as the related work are detailed. DSA is
described in Section 3. Section 4 exposes the experimental
results. We finally conclude in Section 5.
the system to efficiently extract pertinent information. In this
section, we categorize deep learning-based action recognition
and we present attention-based related work.
A. Deep learning techniques for action recognition
A lot of papers have surveyed vision-based HAR research
using deep learning [12]–[14]. They all admit that the two major aspects in developing deep networks for action recognition
are the convolution process and temporal modeling. However,
dealing with the temporal dimension is a challenging issue.
Thus, many solutions have been proposed and can be clustered
into three major categories:
3DCNN or spacetime networks: These techniques extend
convolutional operations to the temporal domain. 3DCNNs
[15], [16] extract features from both temporal and spatial
dimensions by capturing spatiotemporal information encoded
in neighboring frames. This kind of networks is more appropriate for measuring spatiotemporal dynamics over a short
period of time. C3D [17] has been put forward to overcome
this shortage. However, this technique results in expensive
computation costs and a large number of parameters that make
the training task very complex.
Multistream CNN: They use different convolutional networks to model both spatial and motion information in action
frames. In fact, their architecture mainly contains two streams:
a spatial CNN that learns actions from static frames and a
temporal CNN that uses the optical flow to recognize actions
[18], [19]. Other streams are also used including the audio
signal stream [20]. However, this may not be appropriate for
collecting information over a long period. Thus, improvements
have been suggested using temporal pooling, which can pool
spatial features computed at every frame across time. However,
this category suffers from a common drawback which is
the absence of interactions between streams, which is very
important for training spatiotemporal features.
Hybrid network: To aggregate temporal information, this
network combines the CNN with temporal sequence modeling
techniques. In fact, recurrent neural network and particularly
Long Short Term Memory (LSTM) [21] has the advantage
of modeling long term sequences. A lot of approaches have
been proposed and have shown good performances [22]–
[24]. This category takes advantage of both CNN and LSTM,
requires lower complexity and achieves considerable time
savings compared to other techniques [14], [25].
For these reasons, we opt for hybrid networks in our
approach. Nevertheless, on their own, hybrid networks are
unable to selectively focus on relevant scene information.
Hence, extending these techniques with visual attention has
a promising chance to enhance the overall system’s performance.
B. Visual attention
II. B ACKGROUND
AND
R ELATED
WORK
Due to their proven performance, deep learning techniques
have been applied to action recognition applications. However,
these systems still fail to exclusively focus on the relevant
information of the scene. A proper attention model can guide
Over the last few years, attention models have been widely
used in various fields including sentence summarization [26],
machine translation [27], speech recognition [28], etc. In the
literature, unimodality and particularly RGB modality have
been widely employed. Karpathy et al. [29] highlighted a
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
JEGHAM et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2020)
Input
Feature extraction
Classification
It1
Jt1
Jt2
Dt1
.
.
.
At1
.
.
.
Jtn
Dtn
CNN
Xt1
Xt2
LSTM
ht1
LSTM
Yt1
Yt2
ht2
Jtn-1
Itn
CNN
.
.
.
CNN
CNN
.
.
.
Xtn-1
Xtn
.
.
.
.
.
.
htn-2
LSTM
htn-1
LSTM
Ytn-1
Ytn
Atn
Fig. 1: Proposed architecture of DSA.
multi-resolution architecture for HAR to deal with the input at
two spatial resolutions: a low-resolution context stream and a
high-resolution fovea stream that focused the attention onto the
center of the frame. Sharma et al. [10] proposed a visual soft
attention-based model for HAR in videos, which integrated
convolutional features extracted from different parts of the
space time volume by selectively focusing on special parts of
video frames. Wang et al. [30] designed an attention module
that consisted of two parallel parts: a channel and space level
attention, to be inserted into a CNN. Girdhar et al. [31] put
forward an attention mechanism using a derivation of topdown and bottom-up attention as low rank approximations
of methods of bilinear pooling. Wang et al. [25] introduced
an unsupervised attention mechanism that focused on specific
visual regions of video frames and improved the recognition
accuracy by integrating the attention weighted module. This
algorithm was limited to recognize specific driver activities
in the short term from a particular view. The work presented
above focused mainly on the spatial locations of each frame
without considering temporal relations between frames. Later
work incorporated attention in motion stream [23], [32].
However, this latter only used the optical flow frames resulted
from consecutive frames, and hardly considered the long term
temporal relations in a video sequence. Thus, Wang et al. [33]
suggested the attention-aware temporal weighted CNN that
embedded visual attention into a temporal multi stream CNN.
Meng et al. [34] introduced two separate temporal and spatial
attention mechanisms. For spatial attention, they identified the
most-relevant spatial location in that frame. For the temporal
attention, they recognized the most relevant frames from an
input video. Such work was insufficient to focus on the most
relevant objects and movements [35].
Using multimodal data helps to orientate the system attention. Therefore, Baradel et al. [36] used multimodal video data
including RGB frames and articulated pose. They proposed a
two-stream approach: the raw RGB stream that was treated
by a spatio-temporal soft-attention mechanism conditioned on
features from the pose network and the pose stream that
was refined with a convolutional model taking as an input
a 3D tensor holding data from a sub-sequence. Zhu et al.
[37] proposed a cuboid model obtained by organizing the
3
pairwise displacements between all body joints for skeletonbased action recognition. This representation allowed deep
models to focus mainly on actions. However, skeleton modality posed strong restrictions on subjects and failed to consider
the positions of ordinal body joints when the actions were
performed on a limited surface. To overcome these shortages,
we introduce a novel attention model based on RGB and depth
modalities that will efficiently identify the driver pose from
the background context to recognize safe driving and many
secondary actions from different views.
III. DSA: D EPTH - BASED
SPATIAL ATTENTION NETWORK
Visual attention is one of the key perception mechanisms
to efficiently pick the data of most potential interest. It is
the cognitive and behavioral process of selectively focusing
on a relevant part of data while neglecting other useless
information. This brings a new interpretability perspective by
weighting regions according to the relevance of information
they hold. Intuitively, including visual soft attention enhances
the overall accuracy since the system focuses exclusively on
important cues of the scene. In this paper, to recognize driver
in-vehicle actions under realistic driving settings, we introduce
DSA, which is a hybrid deep network that exploits the depth
modality to extract informative parts of RGB data to reliably
classify driver actions. In fact, the gray levels of depth images
indicate the distance of objects rather than their brightness.
Therefore, based on the depth modality, we highlight the
driver’s actions and ignore the dynamic background. As shown
in Figure 1, a single human action is described by a sequence
of two modalities: a set of RGB input images I = It and a
set of depth input images D = Dt , where t varies from 1 to n
and n represents the number of frames per sequence. The aim
is to predict the action class Y. Therefore, at each time step
t, we extract a soft spatial attention image At from the depth
image and combine it with the RGB image to obtain Jt : the
hybrid network’s input.
A. Spatial attention module
In this subsection, we develop the mechanism by which we
obtain the soft spatial attention (At ), as well as the restored
attention on the RGB frame (Jt ).
1) Soft spatial attention (At ): To figure out the spatial attention (At ), a framework of three steps is proposed. A detailed
explanation of these processes is given in Figure 2.
Power rate transform: In this work, to highlight driver information, the power rate transform is first employed to stretch
the image grayscale histogram. This process can be modeled
with Equation (1), where c and µ are regulation parameters, D
is the original depth image that presents the initial grayscale
value, and R represents the transferred grayscale image that is
a 16-bit depth image whose values are in the range [0, 216 [.
R = c × Dµ
(1)
According to Equation (1), when µ > 1, image details are
highlighted and image stretching is focused principally on
the important grayscale range; and when µ < 1, the low
grayscale range is highlighted. According to the histogram of
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020
Side view
4
Normalization
Power rate transformation
(c=1,µ=0,2)
Foreground reinforcement
Operation in morphology
Front view
Depth image
Spatial attention frame
Normalization
Depth image
Power rate transformation
(c=1,µ=0,2)
Foreground reinforcement
Operation in morphology
Spatial attention frame
Fig. 2: Layout of proposed soft attention mechanism.
depth frames, all values of important cues are concentrated
in a low range, and grayscale stretching occurred for the
parameters c=1 and µ < 1 [38]. Figure 3 shows an illustration
of some calculated power rate transforms of depth frame.
Empirically, the values of c=1 and µ=0.2 are selected for
the power rate transform of depth frames since they generate
the best grayscale stretching results. As seen in Figure 2,
the driver silhouette is clearer after the power rate transform.
Nevertheless, background noise is still visible, which will be
addressed by foreground reinforcement.
Original depth image
Power rate transformation
(c=1,µ=0,2)
Power rate transformation
(c=1,µ=0,4)
Power rate transformation
(c=1,µ=0,6)
Power rate transformation
(c=1,µ=0,8)
Power rate transformation
(c=1,µ=0,9)
Fig. 3: Comparison of results using power rate transforms in
depth images for multiple values of µ.
Foreground reinforcement: The brightest parts in a depth
frame represent the near objects to the Kinect and vice versa,
the darker indicates the farthest ones. These grayscale values
are more convenient to indicate our region of interest. In
an outdoor environment, everything is bathed in IR light.
Therefore, light reflection on the background makes darker
foreground elements even for the nearest pixels while makes
brighter background regions. For that, an additional process
dedicated to outdoor environments is required. This process
aims principally to decrease the luminosity of the background
parts while highlighting foreground regions. This can be
modeled by Equation (2), where R represents the transferred
grayscale image resulted from the previous process and m
corresponds to the value of the brightest pixel of R. In fact, σ
represents a depth frame where the foreground is reinforced
and the luminosity of the background is reduced. Given that σ
is a 16-bit depth frame, its values vary in the interval [0,216 [.
Black pixels are not changed while other pixels are updated:
the brightness of background regions is considerably minimized in the same time the foreground regions mainly darker
and that corresponds to small values of R are significantly
highlighted. This process is well illustrated in Figure 2.
P
| i,j (Ri ,j −m)| If Ri ,j 6= 0
σ(i, j) =
(2)
0
Else.
Morphological Operations: Two issues can be properly addressed using morphological operations. On the one hand,
since we consider an outdoor environment, some dot patterns
projected on the scene (to be then reflected back to the
camera) are totally affected by the sunlight. Thus, the obtained
images are noisy and present black regions, as depicted in
Figure 2. On the other hand, Kinect gives two different
modalities from various components. Hence, RGB and depth
frames require a calibration when fusing the two modalities
[39]. Morphological operations are then used to remove small
objects and to fill small holes while preserving the shape and
size of the driver in the frame.
2) Attention on RGB frame: Depth images represent a rich
source of information. It considerably removes a high amount
of noisy and cluttered background, which efficiently simplifies
the identification of driver’s silhouettes from the background
context. However, in an outdoor environment, because of
the luminosity challenge, some important cues are deleted.
These cues maybe crucial for the identification of driver
behaviors. Thus, the ability of depth frames to locate and
extract useful information is decreased. Therefore, a weighted
restoration of the RGB frame after applying depth-based
attention becomes important since it supplementary highlights
information mainly present on depth frame and restores missed
data. Thereby, Jt can be expressed by Equation (3), where w
is the weight of the restored RGB frame.
Y
Jt =
It × At + w × It
(3)
Figure 4 illustrates the impact of the soft spatial attention
mechanism in a single video frame. The figure on the upper
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
JEGHAM et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2020)
left is the original RGB frame (It ) and the figure on the lower
left is the visualization map of soft spatial attention (At ).
Combining these two images results in a new frame where
some regions are highlighted and others are removed. Seen that
some key cues for DAR are dropped, scene parts removed by
the attention mechanism are restored with a low weight (Jt ).
5
and the remaining biases with zeros for bias.

it = σ(Wi [ht−1 , xt ] + bi )




ft = σ(Wf [ht−1 , xt ] + bf )



ot = σ(Wo [ht−1 , xt ] + bo )
 ct = ft ∗ ct−1 + it ∗ tanh(Wc [ht−1 , xt ] + bc )



ht = ot ∗ tanh(ct )



yt = sof tmax(Wy ht + by )
(4)
IV. E XPERIMENTS
To highlight the utility of DSA, we perform driver activity recognition experiments on two different views. In this
section, we provide classification results of different kinds of
experiments.
RGB frame (It)
A. Dataset
Attention on RGB frame(Jt)
Spatial attention frame (At)
Fig. 4: Schematic diagram of soft spatial attention.
B. Feature extraction and classification
All the obtained RGB frames after the attention process Jt
fed to a trained VGG16 model [40] that requires as input a
fixed-size 224 × 224 RGB image. Thus, there are resized to
224 × 224 resolution. The Xt human features are extracted
from a space stream network that gives 4,096 features in total
from each frame.
Driver actions can be considered as a sequence of motion
body parts over time. For this reason, a time stream network
based on LSTM is then required. LSTM is a neural network
structure known by the capability of modeling time series of
data. It can receive different inputs and maintain the memory
of things that happen over a long period of time. Each LSTM
cell contains a self-connected memory along with three special
multiplicative gates that regulate the flow of information into
and out of the cell [41]. These latter are the input gate, the
output gate, and the forget gate, which respectively control,
the flow of the input activations into the memory cell, the
output flow of the cell activation, and the information from the
input and previous output and determines which one should be
reminded or forgotten. LSTM cell can be described according
to Equation (4) where yt is the final output, it , ft , ot , ct and
ht represent outputs at time t of respectively the input gate,
the forget gate, the output gate, the memory cell state and the
cell output, W are input weight matrices connecting LSTM
cell to inputs, b are bias vectors, and functions σ and tanh
are respectively the logistic sigmoid and hyperbolic tangent
nonlinearities. The LSTM parameters are initialized with the
Glorot initializer [42] for weights that independently samples
from a uniform distribution, the forget gate bias with ones,
We evaluate DSA on the MDAD dataset [11]. This latter is the only public real-world multimodal and multiview
driver action dataset introduced in the literature up to our
knowledge. MDAD consists of two temporally synchronized
data modalities (RGB and depth) from frontal and side views.
It presents numerous drivers that are asked to execute safe
driving referred to as action A1. Moreover, they execute 15
various common most distracting secondary tasks such as
smoking, GPS setting, reaching behind, etc. These actions are
referred to as A2 to A16 (more details can be found in [11]).
This dataset presents more than 444K frames with a high
amount of challenges related to naturalistic-driving settings
including complex actions, illumination variation and dynamic
and cluttered background.
B. Experimental setup
All the experiments are performed on a PC with an Intel (R)
core (TM) i7-8700 CPU @ 3.20 GHz, 16 GB of RAM, and an
NVIDIA GTX 1660 graphics card with 6 GB of VRAM. For
the MDAD dataset, 55% of the dataset are used for training,
20% for validation and the remaining 25% for testing.
C. Evaluation
1) Choice of weight in the attention model: The input of
the hybrid network (Jt ) is expressed by Equation 3. This
subsection aims at the identification of the optimal weight
w in this equation. Different choices of the weight of the
restored RGB frame are empirically evaluated and the results
are summarized in Figure 5 in terms of classification accuracy.
We notice that, compared to LRCN results (without applying
attention mechanism), the soft attention model (w > 0)
ameliorates the accuracy results and the accuracy peak is
recorded for a weight of 0.5. Based on this observation, the
weight is fixed at 0.5 for our proposed approach.
2) Quantitative results: Table I depicts a comparison of the
DSA accuracy to the state-of-the-art techniques. The obtained
results show that the deep learning process performs better
than the traditional machine learning techniques. Moreover,
DSA achieves a higher performance compared with the aforementioned methods. In fact, the system is mainly focused
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
6
IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020
100
69.79%
70%
61.98%
A1
61.46%
65.63%
60%
52.08%
59.33%
True Class
57.29%
50%
52.08%
49.48%
40%
34.90%
30%
0
0.25
0.5
0.75
Side view
1
0
8.333
0
16.67
0
0
0
0
8.333
0
0
0
A2 16.67 83.33
50
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A3 16.67
0
50
0
8.333
0
0
0
0
0
0
0
16.67
0
8.333
0
A4 8.333
0
0
58.33 8.333
0
0
0
8.333
0
8.333
0
8.333
0
0
0
A5
0
0
0
75
0
0
0
0
0
0
0
16.67
0
8.333
0
A6 16.67 8.333
0
0
0
75
0
0
0
0
0
0
0
0
0
0
A7
0
0
0
0
0
0
91.67
0
0
0
0
0
8.333
0
0
0
A8
0
0
0
0
0
0
0
0
8.333
0
0
0
0
0
A9
0
0
0
0
0
0
0
0
58.33
0
8.333
0
25
8.333
0
0
A10
0
0
0
0
0
8.333
0
0
0
58.33 8.333
0
0
0
16.67 8.333
A11 8.333
0
0
0
0
0
0
0
0
0
58.33
0
16.67
0
16.67
0
A12 8.333
0
0
0
0
8.333
0
0
8.333
0
0
58.33 16.67
0
0
0
A13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A14 16.67
0
0
0
0
8.333
0
0
0
0
16.67
0
8.333 41.67
0
8.333
A15 8.333
0
0
0
0
8.333
0
0
0
0
8.333
0
16.67
0
58.33
0
A16
0
0
0
0
0
0
0
0
0
8.333
0
0
41.67
0
0
50
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13
A14
A15
A16
0
0
83.33 8.333
8.333 8.333
90
80
70
60
50
0
100
40
30
20
10
0
Front view
Predicted Class
(a) Front view
Fig. 5: Classification accuracy for different weights of restored
RGB frame.
25
8.333
A2
75
0
8.333
0
0
0
0
0
16.67
0
25
8.333 8.333
8.333
0
8.333
0
0
0
0
0
8.333
0
0
0
0
0
0
0
0
A3 8.333
0
66.67
0
0
0
0
0
0
0
0
0
16.67
0
8.333
0
A4
90
True Class
80
0
0
58.33
0
0
0
0
8.333
0
0
0
16.67
0
16.67
0
A5 58.33
0
0
16.67
0
0
8.333
0
0
0
0
0
0
8.333
0
0
8.333
A6
0
0
0
0
0
83.33
0
0
0
0
0
8.333 8.333
0
0
0
A7
0
0
0
0
0
0
58.33
0
33.33
0
0
8.333
0
0
0
A8
0
0
0
0
0
0
0
83.33
0
0
0
8.333 8.333
0
0
0
0
A9 8.333
0
0
0
0
0
0
0
50
8.333 8.333 8.333 8.333
0
8.333
0
A10 8.333
0
0
8.333
0
0
0
0
0
58.33
0
0
0
A11
0
0
0
8.333
0
0
0
0
0
8.333 33.33 8.333 41.67
0
0
0
A12
0
0
0
8.333
0
0
0
8.333
0
0
0
0
0
A13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A14
0
0
0
0
0
33.33
0
8.333
0
8.333
0
50
0
0
0
0
A15
0
0
0
0
0
0
0
0
0
0
0
0
16.67
0
83.33
0
A16
0
0
0
0
0
0
0
0
0
25
0
8.333 33.33
0
0
33.33
70
60
50
True Class
A1
100
A1
50
0
0
16.67 8.333 8.333
0
0
8.333
0
0
0
0
0
0
8.333
A2 16.67
50
0
8.333
0
0
0
8.333
0
0
8.333
0
0
0
8.333
0
A3
0
83.33
0
8.333
0
0
0
0
0
0
0
0
0
8.333
0
A4 8.333
0
0
91.67
0
0
0
0
0
0
0
0
0
0
0
0
A5
0
0
0
0
83.33
0
0
0
16.67
0
0
0
0
0
0
0
A6
0
0
0
0
8.333 91.67
0
0
0
0
0
0
0
0
0
0
A7
0
0
0
0
0
0
100
0
0
0
0
0
0
0
0
0
A8
0
0
0
0
8.333
0
0
83.33
0
0
0
0
0
8.333
0
0
A9
0
0
0
0
0
25
8.333 58.33 16.67
8.333 91.67
30
0
8.333
0
0
0
58.33
16.67
0
0
0
0
0
0
0
0
0
16.67 58.33 8.333
0
0
0
0
A11 8.333
0
0
0
0
0
0
0
33.33
0
50
0
0
8.333
0
0
A12
0
0
0
0
0
16.67
25
8.333
0
0
0
41.67
0
0
0
8.333
A13
0
0
0
0
0
0
16.67
0
0
8.333
0
0
58.33
0
0
16.67
A14 8.333
0
0
0
8.333 8.333
0
0
0
0
8.333
0
0
66.67
0
0
A15
0
0
0
0
8.333
0
0
0
0
0
0
0
0
0
91.67
0
A16
0
0
0
0
8.333
0
0
0
0
8.333
0
0
0
58.33
A1
A2
A3
A4
A5
A6
A9
A10
A11
A12
A13
A14
A15
A16
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13
A14
A15
A16
16.67 8.333
A8
Predicted Class
10
(b) Side view
Predicted Class
70
60
8.333 8.333
0
40
30
20
10
0
A7
20
0
80
50
0
A10 8.333 8.333
40
0
90
Fig. 7: Confusion matrices of DSA from front and side views
(a) Front view
A1
25
0
True Class
A2 16.67 83.33
0
8.333
0
0
0
0
0
0
8.333
0
8.333
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A3 8.333
0
A4 16.67
0
0
A5
0
0
0
A6
0
0
0
A7
0
0
0
16.67
A8
0
0
0
8.333 8.333
A9 8.333
41.67 8.333 16.67
8.333 33.33 8.333
0
8.333
0
0
0
0
16.67
0
0
0
0
75
0
0
0
0
0
0
0
0
0
0
0
8.333
0
66.67
0
0
0
8.333 58.33
0
0
0
0
0
0
0
16.67
0
0
8.333
0
0
0
0
0
0
25
0
83.33
0
0
0
8.333 8.333
0
0
0
0
0
0
0
8.333
50
0
0
8.333 8.333
0
0
0
8.333
0
0
0
0
0
0
0
58.33
0
8.333 8.333
0
0
0
16.67
A10
0
0
0
0
0
0
0
0
0
75
8.333 16.67
0
0
0
0
A11
0
0
0
16.67
0
0
0
0
0
8.333
0
0
0
0
A12
0
0
0
8.333
0
0
0
0
0
8.333 16.67 41.67
0
0
0
25
A13
0
0
0
0
0
0
8.333
0
0
0
41.67
0
33.33
0
0
16.67
A14
0
0
0
0
0
0
0
0
0
0
25
0
0
41.67
0
33.33
A15
0
0
0
0
0
0
0
0
0
16.67
0
8.333
0
0
75
0
A16
0
0
0
0
0
0
0
0
0
16.67
25
8.333 8.333
0
0
41.67
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A14
A15
A16
0
75
80
70
60
50
40
30
20
10
0
A12
A13
Predicted Class
(b) Side view
Fig. 6: Confusion matrices of LRCN from front and side views
on driver actions, and noisy elements such as cluttered background are partially dropped. Because of the different sensors’
positioning within the car, the front view is more exposed
to sunlight, as illustrated in Figure 8. For this reason, the
classification accuracy of the side view are always higher than
those of the front view.
3) Qualitative results: Figure 6 and Figure 7 depict the
confusion matrices for the cases of the LRCN and the
DSA. The driver’s different distracting actions are executed
in parallel with driving tasks such as steering wheel turning,
vehicle surrounding surveillance, etc. Moreover, the actions are
executed in a limited in-vehicle space. Therefore, a high in-
terclass similarity is recorded, which creates a high confusion
between actions. Using the LRCN from the front view, action
A2 (”Doing hair and makeup”) and action A14 (”Drinking
using left hand”) are unrecognized for all tested subjects (as
shown in Figure 6a). The first one is misclassified as A1
(”Safe driving”) and the second action is misclassified as A12
(”Fatigue and somnolence”) or A6 (”Writing message using
left hand”). These wrong classifications are principally due
to the similarity of these actions, the background clutter, the
occlusion and the high illumination recorded from the front
viewpoint. From the side view and according to Figure 6b, all
the actions are recognized for at least three subjects.
Comparing the obtained results in Figure 6 and Figure 7, we
notice a considerable amelioration in classification accuracy,
which is explained by the partial removal of the noisy background. In fact, DSA helps to principally focus on the relevant
information of the driving scene while neglecting useless
information. Therefore, some actions are totally recognized
such as action A7 (”Talking phone using right hand”) from
the side view and A13 (”Drinking using right hand”) from the
front view. However, some misclassification is still present due
to challenges related to the realistic recording environment.
D. Discussion
Monitoring driver behaviors in realist driving settings is
crucial for developing safe ITS. Since drivers’ actions are
executed in a limited in-vehicle space in parallel with driving
tasks, a high interclass similarity is created. This aspect makes
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
JEGHAM et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2020)
7
TABLE I: Classification accuracy for different views with various approaches.
STIP
[11]
34.89%
30.2%
Side View
Front View
LRCN
[43]
57.81%
48.96%
Alexnet
[44]
38.25%
37.81%
VGG16
44.25%
43.26%
DAR a challenging task.
In this paper, we put forward DSA, a depth-based soft
spatial attention network to recognize in-vehicle actions under
realistic environment settings. In fact, based on the rich
information of the depth frame, the system selectively focuses
on the driver’s silhouette and motion. DSA acheives higher
accuracy compared with widely used state-of-the-art action
recognition techniques. Promising classification results for two
views (side and front) are recorded. However, some confusion
between similar actions is still present. Our proposed approach
is partially affected by the high illumination variation since
depth data are affected, especially for the front view. It complicates the understanding of the scene even with the naked
eye. Figure 8 depicts the effect of high sunlight on the RGB
and depth frames of two different views in the same instant t.
The most important elements are removed from depth frames,
and even the RGB frame acquired from the frontal view is
affected. Given a multiview data, we perform a view fusion
process. Thus, we employ the basic data fusion technique that
concatenates feature vectors extracted from different views.
We achieve 75% in terms of classification accuracy, which
motivates further investigating the attention-based multi-view
fusion for DAR.
Depth images
Front view
Side view
RGB images
Fig. 8: Effect of illumination variation on depth and RGB
frames from different views in the same instant t
V. C ONCLUSION
In this paper, we propose a novel depth-based soft spatial
attention network for driver action recognition. By combining
depth modality along with RGB images, DSA focuses the
attention on the human silhouette to reliably classify the driver
actions. Soft spatial attention improves the capability of the
CNN by selectively highlighting relevant frame regions. Our
experiments on a multimodal and multiview driver action
dataset have demonstrated that DSA improves the classification accuracy of up to 27% compared to the state-of-the-art
VGG19
[45]
46.92%
43.86%
PHOGMLP [46]
28.41%
23.18%
GMMCNN [2]
40.32%
38.61%
MCNN
[47]
48.75%
45.18%
DSA
69.79%
65.63%
methods and achieves up to 75% in terms of accuracy when
fusing the two views.
R EFERENCES
[1] S. Singh, “Critical reasons for crashes investigated in the national
motor vehicle crash causation survey,” Tech. Rep., 2015. [Online].
Available: http://www-nrd.nhtsa.dot.gov/Pubs/812115.pdf
[2] Y. Xing, C. Lv, H. Wang, D. Cao, E. Velenis, and F. Wang, “Driver
activity recognition for intelligent vehicles: A deep learning approach,”
IEEE Transactions on Vehicular Technology, vol. 68, no. 6, pp. 5379–
5390, 2019.
[3] A. Mimouna, I. Alouani, A. Ben Khalifa, Y. El Hillali, A. Taleb-Ahmed,
A. Menhaj, A. Ouahabi, and N. E. Ben Amara, “Olimp: A heterogeneous
multimodal dataset for advanced environment perception,” Electronics,
vol. 9, no. 4, p. 560, 2020.
[4] I. Jegham, A. B. Khalifa, I. Alouani, and M. A. Mahjoub, “Vision-based
human action recognition: An overview and real world challenges,”
Forensic Science International: Digital Investigation, vol. 32, p. 200901,
2020.
[5] SAE, “levels of driving automation,” 2019, last accessed 24/04/2020.
[Online]. Available: https://www.sae.org/news/2019/01/sae-updatesj3016-automated-driving-graphic
[6] D. Pagliari, L. Pinto, M. Reguzzoni, and L. Rossi, “Integration of
kinect and low-cost gnss for outdoor navigation,” ISPRS - International
Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XLI-B5, pp. 565–572, 2016.
[7] A. B. Khalifa, I. Alouani, M. A. Mahjoub, and N. E. B. Amara,
“Pedestrian detection using a moving camera: A novel framework for
foreground detection,” Cognitive Systems Research, vol. 60, pp. 77 – 96,
2020.
[8] I. JEGHAM, A. BEN KHALIFA, I. ALOUANI, and M. A. MAHJOUB,
“Safe driving : Driver action recognition using surf keypoints,” in 2018
30th International Conference on Microelectronics (ICM), 2018, pp. 60–
63.
[9] P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “Rgb-d-based
human motion recognition with deep learning: A survey,” Computer
Vision and Image Understanding, vol. 171, pp. 118 – 139, 2018.
[10] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using
visual attention,” CoRR, vol. abs/1511.04119, 2015. [Online]. Available:
http://arxiv.org/abs/1511.04119
[11] I. Jegham, A. Ben Khalifa, I. Alouani, and M. A. Mahjoub, “Mdad: A
multimodal and multiview in-vehicle driver action dataset,” in Computer
Analysis of Images and Patterns, M. Vento and G. Percannella, Eds.
Cham: Springer International Publishing, 2019, pp. 518–529.
[12] L. Wang, D. Q. Huynh, and P. Koniusz, “A comparative review of
recent kinect-based action recognition algorithms,” IEEE Transactions
on Image Processing, vol. 29, pp. 15–28, 2020.
[13] M. Cornacchia, K. Ozcan, Y. Zheng, and S. Velipasalar, “A survey
on activity detection and classification using wearable sensors,” IEEE
Sensors Journal, vol. 17, no. 2, pp. 386–403, 2017.
[14] Y. Kong and Y. Fu, “Human action recognition and prediction: A
survey,” arXiv preprint arXiv:1806.11230, 2018.
[15] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 35, no. 1, pp. 221–231, Jan 2013.
[16] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatiotemporal attention networks for action recognition and detection,” IEEE
Transactions on Multimedia, pp. 1–1, 2020.
[17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in The IEEE
International Conference on Computer Vision (ICCV), December 2015.
[18] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” in Advances in Neural Information
Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D.
Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014,
pp. 568–576.
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors
Journal
8
IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020
[19] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
model and the kinetics dataset,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017.
[20] J. Wu, Y. Zhang, and W. Lin, “Towards good practices for action video
encoding,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2014.
[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[22] L. Sun, K. Jia, K. Chen, D.-Y. Yeung, B. E. Shi, and S. Savarese, “Lattice
long short-term memory for human action recognition,” in The IEEE
International Conference on Computer Vision (ICCV), Oct 2017.
[23] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm
convolves, attends and flows for action recognition,” Computer Vision
and Image Understanding, vol. 166, pp. 41 – 50, 2018.
[24] N. Tufek, M. Yalcin, M. Altintas, F. Kalaoglu, Y. Li, and S. K. Bahadir,
“Human action recognition using deep learning methods on limited
sensory data,” IEEE Sensors Journal, vol. 20, no. 6, pp. 3101–3112,
2020.
[25] K. Wang, X. Chen, and R. Gao, “Dangerous driving behavior detection
with attention mechanism,” in Proceedings of the 3rd International
Conference on Video and Image Processing, 2019, pp. 57–62.
[26] P. Ren, Z. Chen, Z. Ren, F. Wei, J. Ma, and M. de Rijke, “Leveraging
contextual sentence relations for extractive summarization using a neural
attention model,” in Proceedings of the 40th International ACM SIGIR
Conference on Research and Development in Information Retrieval,
2017, pp. 95–104.
[27] Y. Cheng, Agreement-Based Joint Training for Bidirectional AttentionBased Neural Machine Translation. Springer Singapore, 2019, pp.
11–23.
[28] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for
connectionist temporal classification in speech recognition,” in ICASSP
2019 - 2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), May 2019, pp. 7115–7119.
[29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2014.
[30] W. Wang, X. Lu, P. Zhang, H. Xie, and W. Zeng, “Driver action
recognition based on attention mechanism,” in 2019 6th International
Conference on Systems and Informatics (ICSAI), Nov 2019, pp. 1255–
1259.
[31] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in Advances in Neural Information Processing Systems 30,
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp.
34–45.
[32] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention
network for action recognition in videos,” IEEE Transactions on Image
Processing, vol. 27, no. 3, pp. 1347–1360, March 2018.
[33] L. Wang, J. Zang, Q. Zhang, Z. Niu, G. Hua, and N. Zheng, “Action
recognition by an attention-aware temporal weighted convolutional neural network,” Sensors, vol. 18, no. 7, p. 1979, Jun 2018.
[34] L. Meng, B. Zhao, B. Chang, G. Huang, W. Sun, F. Tung, and L. Sigal,
“Interpretable spatio-temporal attention for video action recognition,”
in The IEEE International Conference on Computer Vision (ICCV)
Workshops, Oct 2019.
[35] J.-M. Perez-Rua, B. Martinez, X. Zhu, A. Toisoul, V. Escorcia, and
T. Xiang, “Knowing what, where and when to look: Efficient video
action modeling with attention,” arXiv preprint arXiv:2004.01278, 2020.
[36] F. Baradel, C. Wolf, and J. Mille, “Human Activity Recognition with
Pose-driven Attention to RGB,” in BMVC 2018 - 29th British Machine
Vision Conference, Newcastle, United Kingdom, Sep. 2018, pp. 1–14.
[Online]. Available: https://hal.inria.fr/hal-01828083
[37] K. Zhu, R. Wang, Q. Zhao, J. Cheng, and D. Tao, “A cuboid cnn model
with an attention mechanism for skeleton-based action recognition,”
IEEE Transactions on Multimedia, pp. 1–1, 2019.
[38] Q. Xiao, M. Qin, P. Guo, and Y. Zhao, “Multimodal fusion based on
lstm and a couple conditional hidden markov model for chinese sign
language recognition,” IEEE Access, vol. 7, pp. 112 258–112 268, 2019.
[39] B. Karan, “Calibration of kinect-type rgb-d sensors for robotic applications,” Fme Transactions, vol. 43, pp. 47–54, 2015.
[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[41] A. Graves, Supervised Sequence Labelling. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2012, pp. 5–13.
[42] X. Glorot and Y. Bengio, “Understanding the difficulty of training
deep feedforward neural networks,” in Proceedings of the thirteenth
international conference on artificial intelligence and statistics, 2010,
pp. 249–256.
[43] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[44] C. Yan, “Driving posture recognition by convolutional neural networks,”
IET Computer Vision, vol. 10, pp. 103–114, March 2016.
[45] A. Koesdwiady, S. M. Bedawi, C. Ou, and F. Karray, “End-to-end
deep learning for driver distraction recognition,” in Image Analysis and
Recognition, F. Karray, A. Campilho, and F. Cheriet, Eds.
Cham:
Springer International Publishing, 2017, pp. 11–18.
[46] C. H. Zhao, B. L. Zhang, X. Z. Zhang, S. Q. Zhao, and H. X. Li, “Recognition of driving postures by combined features and random subspace
ensemble of multilayer perceptron classifiers,” Neural Computing and
Applications, vol. 22, no. 1, pp. 175–184, 2013.
[47] Y. Hu, M. Lu, and X. Lu, “Driving behaviour recognition from still
images by using multi-stream fusion cnn,” Machine Vision and Applications, vol. 30, no. 5, pp. 851–865, 2019.
Imen JEGHAM is Phd Student in computer science at Higher Institute of Computer Science
and Communication Techniques of Hammam
Sousse (university of Sousse – Tunisia). She
received the engineering degree in computer
science in 2014 and the European master in
highway and traffic engineering in 2017, from the
national school of engineers of Sousse, Tunisia.
Her research interests include computer vision,
pattern recognition, Signal and image processing, traffic engineering.
Anouar BEN KHALIFA received the engineering degree (2005) from the National Engineering
School of Monastir – Tunisia, a Msc degree
(2007) and a Ph.D degree (2014) in Electrical
Engineering, Signal Processing, System Analysis and Pattern Recognition from the National
Engineering School of Tunis – (Tunisia). He is
now Associate Professor in Electrical and Computer Engineering at the National Engineering
School of Sousse– (Tunisia). He is a Founding
member of the LATIS research labs (Laboratory
of Advanced Technology and Intelligent Systems). He is the head of
the Department of Industrial Electronic Engineering at the National
Engineering School of Sousse (From 2016 to 2019). His research interests are Artificial Intelligence, Pattern Recognition, Image Processing,
Machine Learning, Intelligent Transportation Systems and Information
Fusion.
Ihsen ALOUANI is an Associate Professor at
the IEMN-DOAE lab in the Polytechnic University Hauts-De-France, France. He got his
PhD from the Polytechnic University Hauts-DeFrance, aMsc and engineering degree from the
National Engineering School Sousse, Tunisia.
He is the head of ”Cyber-defense and Information Security” Master’s program. His research
focus is on Intelligent Transportation Systems,
Hardware acceleration and security.
Mohamed Ali MAHJOUB is Professor at National Engineering School of Sousse (university
of Sousse – Tunisia) and member of research
LATIS laboratory, team signals, image and document. He received the MSc in computer science in 1990, and Phd and HDR in electrical
engineering, signal processing and system analysis, from the National School of Engineers of
Tunis, Tunisia, in 1999 and 2013 respectively.
His research interests include dynamic bayesian
network, computer vision, pattern recognition,
HMM, and data retrieval. His main papers have been published in
international journals and conferences.
1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.
Related documents
Download