This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020 1 Soft Spatial Attention-based Multimodal Driver Action Recognition Using Deep Learning Imen JEGHAM, Anouar BEN KHALIFA, Ihsen ALOUANI, Mohamed Ali MAHJOUB Abstract— Driver behaviors and decisions are crucial factors for on-road driving safety. With a precise driver behavInput Space stream network Spatial Attention Time stream network ior monitoring system, traffic accidents and injuries can be significantly reduced. However, understanding human behaviors in real-world driving settings is a challenging task because of the uncontrolled conditions including ilC N C lumination variation, occlusion, and dynamic and cluttered N N C N N C Kinect background. In this paper, a Kinect sensor, which provides N N C N N N multimodal signals, is adopted as a driver monitoring sensor to recognize safe driving and common secondary most distracting in-vehicle actions. We propose a novel soft spatial attention-based network named the Depth-based Spatial Attention network (DSA), which adds a cognitive process to deep network by selectively focusing on the driver’s silhouette and motion in the cluttered driving scene. In fact, at each time t, we introduce a new weighted RGB frame based on an attention model designed using a depth frame. The final classification accuracy is substantially enhanced compared to the state-of-the-art results with an achieved improvement of up to 27%. Classification n Index Terms— Driver action recognition, Kinect sensor, Spatial soft attention, Multimodal, Deep learning. I. I NTRODUCTION D RIVER behaviors and decisions are the principal factors that can affect driving safety. More than 90% of reported vehicle crashes in the US have been caused by the driver’s inadvertent errors and misbehavior, which is similar to other countries worldwide [1]. Traffic accidents can be reduced by 10% to 20% with an efficient driver monitoring system [2], [3]. Therefore, it is important to have a clear perspective on the driver behavior. For this reason, recognizing the driver’s in-vehicle actions represents one of the most important tasks for Intelligent Transportation Systems (ITS) [4]. In fact, for highly automated vehicles including the level-3 automation (according to the automation definition in SAE standard J3016 [5]), the driver is allowed to perform secondary tasks and is responsible for taking over the vehicle control under emergencies. In China and the US, many accidents were recorded when a Tesla driver relied uniquely on the autopilot system while driving. Preprint submitted to IEEE SENSORS JOURNAL, May xx, 2020. Imen JEGHAM is with Université de Sousse, Institut Supérieur d’Informatique et des Techniques de Communication de H. Sousse, LATIS- Laboratory of Advanced Technology and Intelligent Systems, 4011, Sousse, Tunisie; (e-mail: imen.jegham@isitc.u-sousse.tn). Anouar BEN KHALIFA is with Université de Sousse, Ecole Nationale d’Ingénieurs de Sousse, LATIS- Laboratory of Advanced Technology and Intelligent Systems, 4023, Sousse, Tunisie (e-mail: anouar.benkhalifa@eniso.rnu.tn). Ihsen ALOUANI is with IEMN-DOAE, Université polytechnique Hautsde-France, Valenciennes, France Mohamed Ali MAHJOUB is with Université de Sousse, Ecole Nationale d’Ingénieurs de Sousse, LATIS- Laboratory of Advanced Technology and Intelligent Systems, 4023, Sousse, Tunisie . Different physiological and vision sensors have been widely used for driver status monitoring. Physiological sensors have been generally restricted to estimate specific driver behavior. For example, EOG and EEG are mainly used to estimate driver fatigue and somnolence. Moreover, these sensors are expensive and require specific information in advance such as gaze direction. On the other hand, vision sensors are the most widely used sensors for Driver Action Recognition (DAR). Unlike other sensors, they allow a holistic understanding of the driver’s actions and situation. Various types of vision sensors can be employed including omnidirectional, thermographic, and Kinect cameras. Kinect has been successfully used for low-cost navigation. It consists of an RGB-Depth camera that captures RGB images along with 16-bit depth images that indicate the objects’ distance rather than their brightness. Kinect was initially designed for indoor motion sensing and was extended later to outdoor environments [6]. Efficient real-world DAR systems should be designed under uncontrolled environmental conditions. Therefore, many challenges including lighting conditions, viewpoint variation and cluttered and dynamic background need to be addressed [4], [7], [8]. In the past few years, deep learning have shown remarkable efficiency in solving a plethora of complex real-life problems. Deep learning-based Human Action Recognition (HAR) techniques have surpassed most traditional hand-crafted feature systems that require considerable preprocessing and domain skill knowledge [9]. However, these systems still fail to take into account the spatial and temporal context of the scene. Moreover, HAR systems are not automatically useful under DAR constraints. In fact, they 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal 2 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020 are mostly based on a holistic analysis of the human body posture and behaviour. The limited in-vehicle space where the actions are executed and the parallel execution of different in-vehicle actions with driving tasks challenge drastically the HAR techniques. Furthermore, other challenges related to naturalistic driving settings including the dynamic and highly cluttered background and the high illumination variation are added. This motivates the development of comprehensive DAR techniques under real-life conditions. Inspired by the human vision process, visual attention models extract relevant information by selectively concentrating on parts of the visual space where and when it is needed. Attention models can be clustered in two main categories [10]: hard and soft attention models. Hard attention models bring hard decisions when picking parts of the input data. Training these models can be hard through back-propagation and computationally expensive as it depends on sampling. Soft attention models, on the other hand, consider the entire input but dynamically weighting each part of the scene. This mechanism can be applied on three different variants: spatial, temporal and spatio-temporal. Such soft context interpretability is necessary and helpful for driver monitoring applications. Since accidents can happen in the blink of an eye, the continuous processing of the temporal dimension is important. Therefore, in this study, we focus on a visual soft spatial attention mechanism. In this paper, we unprecedentedly exploit depth modality which highlights the driver’s silhouette and ignore the cluttered background, for soft spatial attention. We use RGB-D data to focus on the relevant parts of cluttered naturalistic driving scenes to recognize driver in-vehicle actions. Thus, we propose a novel Depth-based soft Spatial Attention network (DSA) that efficiently focuses on the driver’s silhouette to monitor driver behaviors for the ITS. The main contributions of this paper can be summarized as follows: • We propose DSA, which is a novel spatial attentionbased network that focuses on the driver’s posture. DSA is designed to be efficient in cluttered and dynamic scenes to recognize the driver’s in-vehicle actions under uncontrolled driving settings. • We unprecedentedly exploit a depth sensor modality for a visual attention mechanism in DAR. • We experiment our network on the first public multimodal driver action dataset MDAD [11] and report promising results for different views and under various environments. The remainder of the paper is organised as follows: In Section 2, the background on deep-learning techniques for action recognition as well as the related work are detailed. DSA is described in Section 3. Section 4 exposes the experimental results. We finally conclude in Section 5. the system to efficiently extract pertinent information. In this section, we categorize deep learning-based action recognition and we present attention-based related work. A. Deep learning techniques for action recognition A lot of papers have surveyed vision-based HAR research using deep learning [12]–[14]. They all admit that the two major aspects in developing deep networks for action recognition are the convolution process and temporal modeling. However, dealing with the temporal dimension is a challenging issue. Thus, many solutions have been proposed and can be clustered into three major categories: 3DCNN or spacetime networks: These techniques extend convolutional operations to the temporal domain. 3DCNNs [15], [16] extract features from both temporal and spatial dimensions by capturing spatiotemporal information encoded in neighboring frames. This kind of networks is more appropriate for measuring spatiotemporal dynamics over a short period of time. C3D [17] has been put forward to overcome this shortage. However, this technique results in expensive computation costs and a large number of parameters that make the training task very complex. Multistream CNN: They use different convolutional networks to model both spatial and motion information in action frames. In fact, their architecture mainly contains two streams: a spatial CNN that learns actions from static frames and a temporal CNN that uses the optical flow to recognize actions [18], [19]. Other streams are also used including the audio signal stream [20]. However, this may not be appropriate for collecting information over a long period. Thus, improvements have been suggested using temporal pooling, which can pool spatial features computed at every frame across time. However, this category suffers from a common drawback which is the absence of interactions between streams, which is very important for training spatiotemporal features. Hybrid network: To aggregate temporal information, this network combines the CNN with temporal sequence modeling techniques. In fact, recurrent neural network and particularly Long Short Term Memory (LSTM) [21] has the advantage of modeling long term sequences. A lot of approaches have been proposed and have shown good performances [22]– [24]. This category takes advantage of both CNN and LSTM, requires lower complexity and achieves considerable time savings compared to other techniques [14], [25]. For these reasons, we opt for hybrid networks in our approach. Nevertheless, on their own, hybrid networks are unable to selectively focus on relevant scene information. Hence, extending these techniques with visual attention has a promising chance to enhance the overall system’s performance. B. Visual attention II. B ACKGROUND AND R ELATED WORK Due to their proven performance, deep learning techniques have been applied to action recognition applications. However, these systems still fail to exclusively focus on the relevant information of the scene. A proper attention model can guide Over the last few years, attention models have been widely used in various fields including sentence summarization [26], machine translation [27], speech recognition [28], etc. In the literature, unimodality and particularly RGB modality have been widely employed. Karpathy et al. [29] highlighted a 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal JEGHAM et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2020) Input Feature extraction Classification It1 Jt1 Jt2 Dt1 . . . At1 . . . Jtn Dtn CNN Xt1 Xt2 LSTM ht1 LSTM Yt1 Yt2 ht2 Jtn-1 Itn CNN . . . CNN CNN . . . Xtn-1 Xtn . . . . . . htn-2 LSTM htn-1 LSTM Ytn-1 Ytn Atn Fig. 1: Proposed architecture of DSA. multi-resolution architecture for HAR to deal with the input at two spatial resolutions: a low-resolution context stream and a high-resolution fovea stream that focused the attention onto the center of the frame. Sharma et al. [10] proposed a visual soft attention-based model for HAR in videos, which integrated convolutional features extracted from different parts of the space time volume by selectively focusing on special parts of video frames. Wang et al. [30] designed an attention module that consisted of two parallel parts: a channel and space level attention, to be inserted into a CNN. Girdhar et al. [31] put forward an attention mechanism using a derivation of topdown and bottom-up attention as low rank approximations of methods of bilinear pooling. Wang et al. [25] introduced an unsupervised attention mechanism that focused on specific visual regions of video frames and improved the recognition accuracy by integrating the attention weighted module. This algorithm was limited to recognize specific driver activities in the short term from a particular view. The work presented above focused mainly on the spatial locations of each frame without considering temporal relations between frames. Later work incorporated attention in motion stream [23], [32]. However, this latter only used the optical flow frames resulted from consecutive frames, and hardly considered the long term temporal relations in a video sequence. Thus, Wang et al. [33] suggested the attention-aware temporal weighted CNN that embedded visual attention into a temporal multi stream CNN. Meng et al. [34] introduced two separate temporal and spatial attention mechanisms. For spatial attention, they identified the most-relevant spatial location in that frame. For the temporal attention, they recognized the most relevant frames from an input video. Such work was insufficient to focus on the most relevant objects and movements [35]. Using multimodal data helps to orientate the system attention. Therefore, Baradel et al. [36] used multimodal video data including RGB frames and articulated pose. They proposed a two-stream approach: the raw RGB stream that was treated by a spatio-temporal soft-attention mechanism conditioned on features from the pose network and the pose stream that was refined with a convolutional model taking as an input a 3D tensor holding data from a sub-sequence. Zhu et al. [37] proposed a cuboid model obtained by organizing the 3 pairwise displacements between all body joints for skeletonbased action recognition. This representation allowed deep models to focus mainly on actions. However, skeleton modality posed strong restrictions on subjects and failed to consider the positions of ordinal body joints when the actions were performed on a limited surface. To overcome these shortages, we introduce a novel attention model based on RGB and depth modalities that will efficiently identify the driver pose from the background context to recognize safe driving and many secondary actions from different views. III. DSA: D EPTH - BASED SPATIAL ATTENTION NETWORK Visual attention is one of the key perception mechanisms to efficiently pick the data of most potential interest. It is the cognitive and behavioral process of selectively focusing on a relevant part of data while neglecting other useless information. This brings a new interpretability perspective by weighting regions according to the relevance of information they hold. Intuitively, including visual soft attention enhances the overall accuracy since the system focuses exclusively on important cues of the scene. In this paper, to recognize driver in-vehicle actions under realistic driving settings, we introduce DSA, which is a hybrid deep network that exploits the depth modality to extract informative parts of RGB data to reliably classify driver actions. In fact, the gray levels of depth images indicate the distance of objects rather than their brightness. Therefore, based on the depth modality, we highlight the driver’s actions and ignore the dynamic background. As shown in Figure 1, a single human action is described by a sequence of two modalities: a set of RGB input images I = It and a set of depth input images D = Dt , where t varies from 1 to n and n represents the number of frames per sequence. The aim is to predict the action class Y. Therefore, at each time step t, we extract a soft spatial attention image At from the depth image and combine it with the RGB image to obtain Jt : the hybrid network’s input. A. Spatial attention module In this subsection, we develop the mechanism by which we obtain the soft spatial attention (At ), as well as the restored attention on the RGB frame (Jt ). 1) Soft spatial attention (At ): To figure out the spatial attention (At ), a framework of three steps is proposed. A detailed explanation of these processes is given in Figure 2. Power rate transform: In this work, to highlight driver information, the power rate transform is first employed to stretch the image grayscale histogram. This process can be modeled with Equation (1), where c and µ are regulation parameters, D is the original depth image that presents the initial grayscale value, and R represents the transferred grayscale image that is a 16-bit depth image whose values are in the range [0, 216 [. R = c × Dµ (1) According to Equation (1), when µ > 1, image details are highlighted and image stretching is focused principally on the important grayscale range; and when µ < 1, the low grayscale range is highlighted. According to the histogram of 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020 Side view 4 Normalization Power rate transformation (c=1,µ=0,2) Foreground reinforcement Operation in morphology Front view Depth image Spatial attention frame Normalization Depth image Power rate transformation (c=1,µ=0,2) Foreground reinforcement Operation in morphology Spatial attention frame Fig. 2: Layout of proposed soft attention mechanism. depth frames, all values of important cues are concentrated in a low range, and grayscale stretching occurred for the parameters c=1 and µ < 1 [38]. Figure 3 shows an illustration of some calculated power rate transforms of depth frame. Empirically, the values of c=1 and µ=0.2 are selected for the power rate transform of depth frames since they generate the best grayscale stretching results. As seen in Figure 2, the driver silhouette is clearer after the power rate transform. Nevertheless, background noise is still visible, which will be addressed by foreground reinforcement. Original depth image Power rate transformation (c=1,µ=0,2) Power rate transformation (c=1,µ=0,4) Power rate transformation (c=1,µ=0,6) Power rate transformation (c=1,µ=0,8) Power rate transformation (c=1,µ=0,9) Fig. 3: Comparison of results using power rate transforms in depth images for multiple values of µ. Foreground reinforcement: The brightest parts in a depth frame represent the near objects to the Kinect and vice versa, the darker indicates the farthest ones. These grayscale values are more convenient to indicate our region of interest. In an outdoor environment, everything is bathed in IR light. Therefore, light reflection on the background makes darker foreground elements even for the nearest pixels while makes brighter background regions. For that, an additional process dedicated to outdoor environments is required. This process aims principally to decrease the luminosity of the background parts while highlighting foreground regions. This can be modeled by Equation (2), where R represents the transferred grayscale image resulted from the previous process and m corresponds to the value of the brightest pixel of R. In fact, σ represents a depth frame where the foreground is reinforced and the luminosity of the background is reduced. Given that σ is a 16-bit depth frame, its values vary in the interval [0,216 [. Black pixels are not changed while other pixels are updated: the brightness of background regions is considerably minimized in the same time the foreground regions mainly darker and that corresponds to small values of R are significantly highlighted. This process is well illustrated in Figure 2. P | i,j (Ri ,j −m)| If Ri ,j 6= 0 σ(i, j) = (2) 0 Else. Morphological Operations: Two issues can be properly addressed using morphological operations. On the one hand, since we consider an outdoor environment, some dot patterns projected on the scene (to be then reflected back to the camera) are totally affected by the sunlight. Thus, the obtained images are noisy and present black regions, as depicted in Figure 2. On the other hand, Kinect gives two different modalities from various components. Hence, RGB and depth frames require a calibration when fusing the two modalities [39]. Morphological operations are then used to remove small objects and to fill small holes while preserving the shape and size of the driver in the frame. 2) Attention on RGB frame: Depth images represent a rich source of information. It considerably removes a high amount of noisy and cluttered background, which efficiently simplifies the identification of driver’s silhouettes from the background context. However, in an outdoor environment, because of the luminosity challenge, some important cues are deleted. These cues maybe crucial for the identification of driver behaviors. Thus, the ability of depth frames to locate and extract useful information is decreased. Therefore, a weighted restoration of the RGB frame after applying depth-based attention becomes important since it supplementary highlights information mainly present on depth frame and restores missed data. Thereby, Jt can be expressed by Equation (3), where w is the weight of the restored RGB frame. Y Jt = It × At + w × It (3) Figure 4 illustrates the impact of the soft spatial attention mechanism in a single video frame. The figure on the upper 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal JEGHAM et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2020) left is the original RGB frame (It ) and the figure on the lower left is the visualization map of soft spatial attention (At ). Combining these two images results in a new frame where some regions are highlighted and others are removed. Seen that some key cues for DAR are dropped, scene parts removed by the attention mechanism are restored with a low weight (Jt ). 5 and the remaining biases with zeros for bias. it = σ(Wi [ht−1 , xt ] + bi ) ft = σ(Wf [ht−1 , xt ] + bf ) ot = σ(Wo [ht−1 , xt ] + bo ) ct = ft ∗ ct−1 + it ∗ tanh(Wc [ht−1 , xt ] + bc ) ht = ot ∗ tanh(ct ) yt = sof tmax(Wy ht + by ) (4) IV. E XPERIMENTS To highlight the utility of DSA, we perform driver activity recognition experiments on two different views. In this section, we provide classification results of different kinds of experiments. RGB frame (It) A. Dataset Attention on RGB frame(Jt) Spatial attention frame (At) Fig. 4: Schematic diagram of soft spatial attention. B. Feature extraction and classification All the obtained RGB frames after the attention process Jt fed to a trained VGG16 model [40] that requires as input a fixed-size 224 × 224 RGB image. Thus, there are resized to 224 × 224 resolution. The Xt human features are extracted from a space stream network that gives 4,096 features in total from each frame. Driver actions can be considered as a sequence of motion body parts over time. For this reason, a time stream network based on LSTM is then required. LSTM is a neural network structure known by the capability of modeling time series of data. It can receive different inputs and maintain the memory of things that happen over a long period of time. Each LSTM cell contains a self-connected memory along with three special multiplicative gates that regulate the flow of information into and out of the cell [41]. These latter are the input gate, the output gate, and the forget gate, which respectively control, the flow of the input activations into the memory cell, the output flow of the cell activation, and the information from the input and previous output and determines which one should be reminded or forgotten. LSTM cell can be described according to Equation (4) where yt is the final output, it , ft , ot , ct and ht represent outputs at time t of respectively the input gate, the forget gate, the output gate, the memory cell state and the cell output, W are input weight matrices connecting LSTM cell to inputs, b are bias vectors, and functions σ and tanh are respectively the logistic sigmoid and hyperbolic tangent nonlinearities. The LSTM parameters are initialized with the Glorot initializer [42] for weights that independently samples from a uniform distribution, the forget gate bias with ones, We evaluate DSA on the MDAD dataset [11]. This latter is the only public real-world multimodal and multiview driver action dataset introduced in the literature up to our knowledge. MDAD consists of two temporally synchronized data modalities (RGB and depth) from frontal and side views. It presents numerous drivers that are asked to execute safe driving referred to as action A1. Moreover, they execute 15 various common most distracting secondary tasks such as smoking, GPS setting, reaching behind, etc. These actions are referred to as A2 to A16 (more details can be found in [11]). This dataset presents more than 444K frames with a high amount of challenges related to naturalistic-driving settings including complex actions, illumination variation and dynamic and cluttered background. B. Experimental setup All the experiments are performed on a PC with an Intel (R) core (TM) i7-8700 CPU @ 3.20 GHz, 16 GB of RAM, and an NVIDIA GTX 1660 graphics card with 6 GB of VRAM. For the MDAD dataset, 55% of the dataset are used for training, 20% for validation and the remaining 25% for testing. C. Evaluation 1) Choice of weight in the attention model: The input of the hybrid network (Jt ) is expressed by Equation 3. This subsection aims at the identification of the optimal weight w in this equation. Different choices of the weight of the restored RGB frame are empirically evaluated and the results are summarized in Figure 5 in terms of classification accuracy. We notice that, compared to LRCN results (without applying attention mechanism), the soft attention model (w > 0) ameliorates the accuracy results and the accuracy peak is recorded for a weight of 0.5. Based on this observation, the weight is fixed at 0.5 for our proposed approach. 2) Quantitative results: Table I depicts a comparison of the DSA accuracy to the state-of-the-art techniques. The obtained results show that the deep learning process performs better than the traditional machine learning techniques. Moreover, DSA achieves a higher performance compared with the aforementioned methods. In fact, the system is mainly focused 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal 6 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020 100 69.79% 70% 61.98% A1 61.46% 65.63% 60% 52.08% 59.33% True Class 57.29% 50% 52.08% 49.48% 40% 34.90% 30% 0 0.25 0.5 0.75 Side view 1 0 8.333 0 16.67 0 0 0 0 8.333 0 0 0 A2 16.67 83.33 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A3 16.67 0 50 0 8.333 0 0 0 0 0 0 0 16.67 0 8.333 0 A4 8.333 0 0 58.33 8.333 0 0 0 8.333 0 8.333 0 8.333 0 0 0 A5 0 0 0 75 0 0 0 0 0 0 0 16.67 0 8.333 0 A6 16.67 8.333 0 0 0 75 0 0 0 0 0 0 0 0 0 0 A7 0 0 0 0 0 0 91.67 0 0 0 0 0 8.333 0 0 0 A8 0 0 0 0 0 0 0 0 8.333 0 0 0 0 0 A9 0 0 0 0 0 0 0 0 58.33 0 8.333 0 25 8.333 0 0 A10 0 0 0 0 0 8.333 0 0 0 58.33 8.333 0 0 0 16.67 8.333 A11 8.333 0 0 0 0 0 0 0 0 0 58.33 0 16.67 0 16.67 0 A12 8.333 0 0 0 0 8.333 0 0 8.333 0 0 58.33 16.67 0 0 0 A13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A14 16.67 0 0 0 0 8.333 0 0 0 0 16.67 0 8.333 41.67 0 8.333 A15 8.333 0 0 0 0 8.333 0 0 0 0 8.333 0 16.67 0 58.33 0 A16 0 0 0 0 0 0 0 0 0 8.333 0 0 41.67 0 0 50 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 0 0 83.33 8.333 8.333 8.333 90 80 70 60 50 0 100 40 30 20 10 0 Front view Predicted Class (a) Front view Fig. 5: Classification accuracy for different weights of restored RGB frame. 25 8.333 A2 75 0 8.333 0 0 0 0 0 16.67 0 25 8.333 8.333 8.333 0 8.333 0 0 0 0 0 8.333 0 0 0 0 0 0 0 0 A3 8.333 0 66.67 0 0 0 0 0 0 0 0 0 16.67 0 8.333 0 A4 90 True Class 80 0 0 58.33 0 0 0 0 8.333 0 0 0 16.67 0 16.67 0 A5 58.33 0 0 16.67 0 0 8.333 0 0 0 0 0 0 8.333 0 0 8.333 A6 0 0 0 0 0 83.33 0 0 0 0 0 8.333 8.333 0 0 0 A7 0 0 0 0 0 0 58.33 0 33.33 0 0 8.333 0 0 0 A8 0 0 0 0 0 0 0 83.33 0 0 0 8.333 8.333 0 0 0 0 A9 8.333 0 0 0 0 0 0 0 50 8.333 8.333 8.333 8.333 0 8.333 0 A10 8.333 0 0 8.333 0 0 0 0 0 58.33 0 0 0 A11 0 0 0 8.333 0 0 0 0 0 8.333 33.33 8.333 41.67 0 0 0 A12 0 0 0 8.333 0 0 0 8.333 0 0 0 0 0 A13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A14 0 0 0 0 0 33.33 0 8.333 0 8.333 0 50 0 0 0 0 A15 0 0 0 0 0 0 0 0 0 0 0 0 16.67 0 83.33 0 A16 0 0 0 0 0 0 0 0 0 25 0 8.333 33.33 0 0 33.33 70 60 50 True Class A1 100 A1 50 0 0 16.67 8.333 8.333 0 0 8.333 0 0 0 0 0 0 8.333 A2 16.67 50 0 8.333 0 0 0 8.333 0 0 8.333 0 0 0 8.333 0 A3 0 83.33 0 8.333 0 0 0 0 0 0 0 0 0 8.333 0 A4 8.333 0 0 91.67 0 0 0 0 0 0 0 0 0 0 0 0 A5 0 0 0 0 83.33 0 0 0 16.67 0 0 0 0 0 0 0 A6 0 0 0 0 8.333 91.67 0 0 0 0 0 0 0 0 0 0 A7 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 A8 0 0 0 0 8.333 0 0 83.33 0 0 0 0 0 8.333 0 0 A9 0 0 0 0 0 25 8.333 58.33 16.67 8.333 91.67 30 0 8.333 0 0 0 58.33 16.67 0 0 0 0 0 0 0 0 0 16.67 58.33 8.333 0 0 0 0 A11 8.333 0 0 0 0 0 0 0 33.33 0 50 0 0 8.333 0 0 A12 0 0 0 0 0 16.67 25 8.333 0 0 0 41.67 0 0 0 8.333 A13 0 0 0 0 0 0 16.67 0 0 8.333 0 0 58.33 0 0 16.67 A14 8.333 0 0 0 8.333 8.333 0 0 0 0 8.333 0 0 66.67 0 0 A15 0 0 0 0 8.333 0 0 0 0 0 0 0 0 0 91.67 0 A16 0 0 0 0 8.333 0 0 0 0 8.333 0 0 0 58.33 A1 A2 A3 A4 A5 A6 A9 A10 A11 A12 A13 A14 A15 A16 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 16.67 8.333 A8 Predicted Class 10 (b) Side view Predicted Class 70 60 8.333 8.333 0 40 30 20 10 0 A7 20 0 80 50 0 A10 8.333 8.333 40 0 90 Fig. 7: Confusion matrices of DSA from front and side views (a) Front view A1 25 0 True Class A2 16.67 83.33 0 8.333 0 0 0 0 0 0 8.333 0 8.333 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A3 8.333 0 A4 16.67 0 0 A5 0 0 0 A6 0 0 0 A7 0 0 0 16.67 A8 0 0 0 8.333 8.333 A9 8.333 41.67 8.333 16.67 8.333 33.33 8.333 0 8.333 0 0 0 0 16.67 0 0 0 0 75 0 0 0 0 0 0 0 0 0 0 0 8.333 0 66.67 0 0 0 8.333 58.33 0 0 0 0 0 0 0 16.67 0 0 8.333 0 0 0 0 0 0 25 0 83.33 0 0 0 8.333 8.333 0 0 0 0 0 0 0 8.333 50 0 0 8.333 8.333 0 0 0 8.333 0 0 0 0 0 0 0 58.33 0 8.333 8.333 0 0 0 16.67 A10 0 0 0 0 0 0 0 0 0 75 8.333 16.67 0 0 0 0 A11 0 0 0 16.67 0 0 0 0 0 8.333 0 0 0 0 A12 0 0 0 8.333 0 0 0 0 0 8.333 16.67 41.67 0 0 0 25 A13 0 0 0 0 0 0 8.333 0 0 0 41.67 0 33.33 0 0 16.67 A14 0 0 0 0 0 0 0 0 0 0 25 0 0 41.67 0 33.33 A15 0 0 0 0 0 0 0 0 0 16.67 0 8.333 0 0 75 0 A16 0 0 0 0 0 0 0 0 0 16.67 25 8.333 8.333 0 0 41.67 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A14 A15 A16 0 75 80 70 60 50 40 30 20 10 0 A12 A13 Predicted Class (b) Side view Fig. 6: Confusion matrices of LRCN from front and side views on driver actions, and noisy elements such as cluttered background are partially dropped. Because of the different sensors’ positioning within the car, the front view is more exposed to sunlight, as illustrated in Figure 8. For this reason, the classification accuracy of the side view are always higher than those of the front view. 3) Qualitative results: Figure 6 and Figure 7 depict the confusion matrices for the cases of the LRCN and the DSA. The driver’s different distracting actions are executed in parallel with driving tasks such as steering wheel turning, vehicle surrounding surveillance, etc. Moreover, the actions are executed in a limited in-vehicle space. Therefore, a high in- terclass similarity is recorded, which creates a high confusion between actions. Using the LRCN from the front view, action A2 (”Doing hair and makeup”) and action A14 (”Drinking using left hand”) are unrecognized for all tested subjects (as shown in Figure 6a). The first one is misclassified as A1 (”Safe driving”) and the second action is misclassified as A12 (”Fatigue and somnolence”) or A6 (”Writing message using left hand”). These wrong classifications are principally due to the similarity of these actions, the background clutter, the occlusion and the high illumination recorded from the front viewpoint. From the side view and according to Figure 6b, all the actions are recognized for at least three subjects. Comparing the obtained results in Figure 6 and Figure 7, we notice a considerable amelioration in classification accuracy, which is explained by the partial removal of the noisy background. In fact, DSA helps to principally focus on the relevant information of the driving scene while neglecting useless information. Therefore, some actions are totally recognized such as action A7 (”Talking phone using right hand”) from the side view and A13 (”Drinking using right hand”) from the front view. However, some misclassification is still present due to challenges related to the realistic recording environment. D. Discussion Monitoring driver behaviors in realist driving settings is crucial for developing safe ITS. Since drivers’ actions are executed in a limited in-vehicle space in parallel with driving tasks, a high interclass similarity is created. This aspect makes 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal JEGHAM et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2020) 7 TABLE I: Classification accuracy for different views with various approaches. STIP [11] 34.89% 30.2% Side View Front View LRCN [43] 57.81% 48.96% Alexnet [44] 38.25% 37.81% VGG16 44.25% 43.26% DAR a challenging task. In this paper, we put forward DSA, a depth-based soft spatial attention network to recognize in-vehicle actions under realistic environment settings. In fact, based on the rich information of the depth frame, the system selectively focuses on the driver’s silhouette and motion. DSA acheives higher accuracy compared with widely used state-of-the-art action recognition techniques. Promising classification results for two views (side and front) are recorded. However, some confusion between similar actions is still present. Our proposed approach is partially affected by the high illumination variation since depth data are affected, especially for the front view. It complicates the understanding of the scene even with the naked eye. Figure 8 depicts the effect of high sunlight on the RGB and depth frames of two different views in the same instant t. The most important elements are removed from depth frames, and even the RGB frame acquired from the frontal view is affected. Given a multiview data, we perform a view fusion process. Thus, we employ the basic data fusion technique that concatenates feature vectors extracted from different views. We achieve 75% in terms of classification accuracy, which motivates further investigating the attention-based multi-view fusion for DAR. Depth images Front view Side view RGB images Fig. 8: Effect of illumination variation on depth and RGB frames from different views in the same instant t V. C ONCLUSION In this paper, we propose a novel depth-based soft spatial attention network for driver action recognition. By combining depth modality along with RGB images, DSA focuses the attention on the human silhouette to reliably classify the driver actions. Soft spatial attention improves the capability of the CNN by selectively highlighting relevant frame regions. Our experiments on a multimodal and multiview driver action dataset have demonstrated that DSA improves the classification accuracy of up to 27% compared to the state-of-the-art VGG19 [45] 46.92% 43.86% PHOGMLP [46] 28.41% 23.18% GMMCNN [2] 40.32% 38.61% MCNN [47] 48.75% 45.18% DSA 69.79% 65.63% methods and achieves up to 75% in terms of accuracy when fusing the two views. R EFERENCES [1] S. Singh, “Critical reasons for crashes investigated in the national motor vehicle crash causation survey,” Tech. Rep., 2015. [Online]. Available: http://www-nrd.nhtsa.dot.gov/Pubs/812115.pdf [2] Y. Xing, C. Lv, H. Wang, D. Cao, E. Velenis, and F. Wang, “Driver activity recognition for intelligent vehicles: A deep learning approach,” IEEE Transactions on Vehicular Technology, vol. 68, no. 6, pp. 5379– 5390, 2019. [3] A. Mimouna, I. Alouani, A. Ben Khalifa, Y. El Hillali, A. Taleb-Ahmed, A. Menhaj, A. Ouahabi, and N. E. Ben Amara, “Olimp: A heterogeneous multimodal dataset for advanced environment perception,” Electronics, vol. 9, no. 4, p. 560, 2020. [4] I. Jegham, A. B. Khalifa, I. Alouani, and M. A. Mahjoub, “Vision-based human action recognition: An overview and real world challenges,” Forensic Science International: Digital Investigation, vol. 32, p. 200901, 2020. [5] SAE, “levels of driving automation,” 2019, last accessed 24/04/2020. [Online]. Available: https://www.sae.org/news/2019/01/sae-updatesj3016-automated-driving-graphic [6] D. Pagliari, L. Pinto, M. Reguzzoni, and L. Rossi, “Integration of kinect and low-cost gnss for outdoor navigation,” ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XLI-B5, pp. 565–572, 2016. [7] A. B. Khalifa, I. Alouani, M. A. Mahjoub, and N. E. B. Amara, “Pedestrian detection using a moving camera: A novel framework for foreground detection,” Cognitive Systems Research, vol. 60, pp. 77 – 96, 2020. [8] I. JEGHAM, A. BEN KHALIFA, I. ALOUANI, and M. A. MAHJOUB, “Safe driving : Driver action recognition using surf keypoints,” in 2018 30th International Conference on Microelectronics (ICM), 2018, pp. 60– 63. [9] P. Wang, W. Li, P. Ogunbona, J. Wan, and S. Escalera, “Rgb-d-based human motion recognition with deep learning: A survey,” Computer Vision and Image Understanding, vol. 171, pp. 118 – 139, 2018. [10] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” CoRR, vol. abs/1511.04119, 2015. [Online]. Available: http://arxiv.org/abs/1511.04119 [11] I. Jegham, A. Ben Khalifa, I. Alouani, and M. A. Mahjoub, “Mdad: A multimodal and multiview in-vehicle driver action dataset,” in Computer Analysis of Images and Patterns, M. Vento and G. Percannella, Eds. Cham: Springer International Publishing, 2019, pp. 518–529. [12] L. Wang, D. Q. Huynh, and P. Koniusz, “A comparative review of recent kinect-based action recognition algorithms,” IEEE Transactions on Image Processing, vol. 29, pp. 15–28, 2020. [13] M. Cornacchia, K. Ozcan, Y. Zheng, and S. Velipasalar, “A survey on activity detection and classification using wearable sensors,” IEEE Sensors Journal, vol. 17, no. 2, pp. 386–403, 2017. [14] Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” arXiv preprint arXiv:1806.11230, 2018. [15] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, Jan 2013. [16] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatiotemporal attention networks for action recognition and detection,” IEEE Transactions on Multimedia, pp. 1–1, 2020. [17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [18] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 568–576. 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.3019258, IEEE Sensors Journal 8 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020 [19] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. [20] J. Wu, Y. Zhang, and W. Lin, “Towards good practices for action video encoding,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [22] L. Sun, K. Jia, K. Chen, D.-Y. Yeung, B. E. Shi, and S. Savarese, “Lattice long short-term memory for human action recognition,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017. [23] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm convolves, attends and flows for action recognition,” Computer Vision and Image Understanding, vol. 166, pp. 41 – 50, 2018. [24] N. Tufek, M. Yalcin, M. Altintas, F. Kalaoglu, Y. Li, and S. K. Bahadir, “Human action recognition using deep learning methods on limited sensory data,” IEEE Sensors Journal, vol. 20, no. 6, pp. 3101–3112, 2020. [25] K. Wang, X. Chen, and R. Gao, “Dangerous driving behavior detection with attention mechanism,” in Proceedings of the 3rd International Conference on Video and Image Processing, 2019, pp. 57–62. [26] P. Ren, Z. Chen, Z. Ren, F. Wei, J. Ma, and M. de Rijke, “Leveraging contextual sentence relations for extractive summarization using a neural attention model,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 95–104. [27] Y. Cheng, Agreement-Based Joint Training for Bidirectional AttentionBased Neural Machine Translation. Springer Singapore, 2019, pp. 11–23. [28] J. Salazar, K. Kirchhoff, and Z. Huang, “Self-attention networks for connectionist temporal classification in speech recognition,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 7115–7119. [29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. [30] W. Wang, X. Lu, P. Zhang, H. Xie, and W. Zeng, “Driver action recognition based on attention mechanism,” in 2019 6th International Conference on Systems and Informatics (ICSAI), Nov 2019, pp. 1255– 1259. [31] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 34–45. [32] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention network for action recognition in videos,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1347–1360, March 2018. [33] L. Wang, J. Zang, Q. Zhang, Z. Niu, G. Hua, and N. Zheng, “Action recognition by an attention-aware temporal weighted convolutional neural network,” Sensors, vol. 18, no. 7, p. 1979, Jun 2018. [34] L. Meng, B. Zhao, B. Chang, G. Huang, W. Sun, F. Tung, and L. Sigal, “Interpretable spatio-temporal attention for video action recognition,” in The IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2019. [35] J.-M. Perez-Rua, B. Martinez, X. Zhu, A. Toisoul, V. Escorcia, and T. Xiang, “Knowing what, where and when to look: Efficient video action modeling with attention,” arXiv preprint arXiv:2004.01278, 2020. [36] F. Baradel, C. Wolf, and J. Mille, “Human Activity Recognition with Pose-driven Attention to RGB,” in BMVC 2018 - 29th British Machine Vision Conference, Newcastle, United Kingdom, Sep. 2018, pp. 1–14. [Online]. Available: https://hal.inria.fr/hal-01828083 [37] K. Zhu, R. Wang, Q. Zhao, J. Cheng, and D. Tao, “A cuboid cnn model with an attention mechanism for skeleton-based action recognition,” IEEE Transactions on Multimedia, pp. 1–1, 2019. [38] Q. Xiao, M. Qin, P. Guo, and Y. Zhao, “Multimodal fusion based on lstm and a couple conditional hidden markov model for chinese sign language recognition,” IEEE Access, vol. 7, pp. 112 258–112 268, 2019. [39] B. Karan, “Calibration of kinect-type rgb-d sensors for robotic applications,” Fme Transactions, vol. 43, pp. 47–54, 2015. [40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [41] A. Graves, Supervised Sequence Labelling. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 5–13. [42] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256. [43] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. [44] C. Yan, “Driving posture recognition by convolutional neural networks,” IET Computer Vision, vol. 10, pp. 103–114, March 2016. [45] A. Koesdwiady, S. M. Bedawi, C. Ou, and F. Karray, “End-to-end deep learning for driver distraction recognition,” in Image Analysis and Recognition, F. Karray, A. Campilho, and F. Cheriet, Eds. Cham: Springer International Publishing, 2017, pp. 11–18. [46] C. H. Zhao, B. L. Zhang, X. Z. Zhang, S. Q. Zhao, and H. X. Li, “Recognition of driving postures by combined features and random subspace ensemble of multilayer perceptron classifiers,” Neural Computing and Applications, vol. 22, no. 1, pp. 175–184, 2013. [47] Y. Hu, M. Lu, and X. Lu, “Driving behaviour recognition from still images by using multi-stream fusion cnn,” Machine Vision and Applications, vol. 30, no. 5, pp. 851–865, 2019. Imen JEGHAM is Phd Student in computer science at Higher Institute of Computer Science and Communication Techniques of Hammam Sousse (university of Sousse – Tunisia). She received the engineering degree in computer science in 2014 and the European master in highway and traffic engineering in 2017, from the national school of engineers of Sousse, Tunisia. Her research interests include computer vision, pattern recognition, Signal and image processing, traffic engineering. Anouar BEN KHALIFA received the engineering degree (2005) from the National Engineering School of Monastir – Tunisia, a Msc degree (2007) and a Ph.D degree (2014) in Electrical Engineering, Signal Processing, System Analysis and Pattern Recognition from the National Engineering School of Tunis – (Tunisia). He is now Associate Professor in Electrical and Computer Engineering at the National Engineering School of Sousse– (Tunisia). He is a Founding member of the LATIS research labs (Laboratory of Advanced Technology and Intelligent Systems). He is the head of the Department of Industrial Electronic Engineering at the National Engineering School of Sousse (From 2016 to 2019). His research interests are Artificial Intelligence, Pattern Recognition, Image Processing, Machine Learning, Intelligent Transportation Systems and Information Fusion. Ihsen ALOUANI is an Associate Professor at the IEMN-DOAE lab in the Polytechnic University Hauts-De-France, France. He got his PhD from the Polytechnic University Hauts-DeFrance, aMsc and engineering degree from the National Engineering School Sousse, Tunisia. He is the head of ”Cyber-defense and Information Security” Master’s program. His research focus is on Intelligent Transportation Systems, Hardware acceleration and security. Mohamed Ali MAHJOUB is Professor at National Engineering School of Sousse (university of Sousse – Tunisia) and member of research LATIS laboratory, team signals, image and document. He received the MSc in computer science in 1990, and Phd and HDR in electrical engineering, signal processing and system analysis, from the National School of Engineers of Tunis, Tunisia, in 1999 and 2013 respectively. His research interests include dynamic bayesian network, computer vision, pattern recognition, HMM, and data retrieval. His main papers have been published in international journals and conferences. 1530-437X (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Cornell University Library. Downloaded on August 28,2020 at 14:36:55 UTC from IEEE Xplore. Restrictions apply.