2230 IEEE SENSORS JOURNAL, VOL. 21, NO. 2, JANUARY 15, 2021 Vision-Based Autonomous Navigation Approach for a Tracked Robot Using Deep Reinforcement Learning Muhammad Mudassir Ejaz , Tong Boon Tang , Senior Member, IEEE, and Cheng-Kai Lu , Senior Member, IEEE Abstract —Tracked robots need to achieve safe autonomous steering in various changing environments. In this article, a novel end-to-end network architecture is proposed for tracked robots to learn collision-free autonomous navigation through deep reinforcement learning. Specifically, this research improved the learning time and exploratory nature of the robot by normalizing the input data and injecting parametric noise into the network parameters. Features were extracted from the four consecutive depth images by deep convolutional neural networks, which were used to derive the tracked robot. In addition, a comparison was made between three Q-variant models in terms of average reward, variance, and dispersion across episodes. Also, a detailed statistical analysis was performed to measure the reliability of all the models. The proposed model was superior in all the environments. It is worth noting that our proposed model, layer normalisation dueling double deep Q-network (LND3QN), could be directly transferred to a real robot without any fine-tuning after being trained in a simulation environment. The proposed model also demonstrated outstanding performance in several cluttered real-world environments considering both static and dynamic obstacles. Index Terms — Autonomous navigation, deep learning, reinforcement learning, obstacle avoidance. I. I NTRODUCTION R OBOTS are used in various applications, such as data collection, surveillance [1], exploration, rescue services, and inspection [2]. The robot requires navigation to operate in these applications, but achieving collision-free and safe navigation is a challenging task. Autonomous navigation has been studied for a long time, and many well-developed methods have been proposed for safe autonomous navigation [3], [4] for various environments. However, these conventional methods use assumptions to operate, which are not suitable for large environments [5]. For large areas, visual simultaneous localization and mapping (V-SLAM) [6] is used where motions of the robot are estimated using pixel Manuscript received August 5, 2020; accepted August 8, 2020. Date of publication August 13, 2020; date of current version December 16, 2020. This work was supported in part by the YUTP-Fundamental Research Grant (YUTP-FRG) under Grant 015LC0-002. The associate editor coordinating the review of this article and approving it for publication was Dr. Ioannis Raptis. (Corresponding author: Cheng-Kai Lu.) Muhammad Mudassir Ejaz is with the Department of Electrical and Electronics Engineering, Universiti Teknologi PETRONAS (UTP), Seri Iskandar 32610, Malaysia. Tong Boon Tang and Cheng-Kai Lu are with the Institute of Health Analytics (IHA), Universiti Teknologi PETRONAS (UTP), Seri Iskandar 32610, Malaysia (e-mail: chengkai.lu@utp.edu.my). Digital Object Identifier 10.1109/JSEN.2020.3016299 information. However, this method is susceptible to changing light conditions and has poor performance in a low textured environment [7]. Recently, deep learning approaches for autonomous navigation have highlighted the bottleneck of traditional methods and have proposed solutions to address the limitations of conventional methods [8]–[11]. Deep reinforcement learning (DRL)based methods have gained massive popularity in the field of autonomous navigation due to their promising performance where no labelled data is required for training [12]–[15]. Laser range sensors were widely used for autonomous navigation using DRL in the past, but due to less sensing capability to describe the 3D world, the vision sensor has become a good choice since it provides more information, generalizes the environment better, and is cheaper. In particular, depth images are more appropriate than RGB images since they exhibit much better visual fidelity due to their texture-less nature [16]. Also, the depth image from the simulation environment and the real-world environment are quite similar, so model transferability is also easy. In DRLbased methods, the agent learns from its behaviour through hitand-trial, so it is not feasible and impractical to allow robots to train in the real world. Hence, a simulation environment is required for training. 1558-1748 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. EJAZ et al.: VISION-BASED AUTONOMOUS NAVIGATION APPROACH FOR A TRACKED ROBOT USING DRL It takes DRL-based methods longer to learn a problem than supervised learning methods since there is no labelled data and no supervision is provided to the network. One way to reduce the learning time is to improve the exploratory nature of the agent. Exploration is the process where an agent explores new regions by acquiring more information from the environment, irrespective of thinking about the rewards. In contrast, exploitation is the opposite of exploration, where agents take steps to increase the cumulative reward. This trade-off between exploration and exploitation is a standing problem. Also, the computational cost is a significant issue when working with deep learning models. The distribution of data in each layer changes abruptly, making the model slow while training. In this article, we present a novel end-to-end network architecture that relieves the burden of the computation cost by normalising the input data before each convolutional layer using a layer normalisation method. It also improves the exploratory nature of an agent by adding the parametric noise in the fully deep Q-network (DQN) (layer normalisation dueling double deep Q-network [LND3QN]) trained in three virtual environments, and the results were compared with Q-variant models. The results also demonstrate the outstanding performance of a model in a tracked robot in various cluttered real-world environments, considering both static and dynamic obstacles. Nevertheless, to the best of our knowledge, no reports have focused on these issues together. The rest of the manuscript presented as follows. The related works are discussed in Section II, and the architecture and implementation of LND3QN with noise injection are demonstrated in Section III. In Section IV, the experimental results are further discussed. Lastly, Section IV concludes this article. II. R ELATED W ORK Autonomous navigation using DRL-based methods has become a popular choice since labelled data is not required for training. In addition, the transferability of a model is high if depth images are used as input for a network. In 2013, Mnih et al. proposed a DQN algorithm where actions were classified using raw pixels and tested on ATARI games [17]. The same approach was implemented on ViZDoom, a firstperson shooting (FPS) game environment, by Lample and Chaplot [18]. In the FPS game environment, an agent navigates different regions to increase their score. This idea opened the way for using DRL-based methods for autonomous navigation using raw images. Both Tai and Liu [19] and Zhang et al. [20] used depth images as an input for the network, where DQN and successor feature-based DRL was adopted for the obstacle avoidance navigation, respectively. Xie et al. [21] acquired depth-prediction information from the RGB images by converting them through the fully convolutional residual network (FCRN) according to Laina et al. [22], and the duelling nature of Dueling Deep Q Networks (DDQN) was used for action prediction. The same DRL method was used by Wu et al. [23], and branching noise in the fully connected layers was achieved through the noisy nets for better exploration, according to Fortunato et al. [24]. Wu et al. [25] 2231 proposed a novel method where two sets of data streams merged, which were given to a network as an input. The methods, as mentioned earlier, used DQN [26] and its variants as a DRL method for network training. Another main attribute of DRL-based methods is the exploration technique. The purpose of exploration is to ensure that convergence of the agent’s actions must not reach the local optimum prematurely. Several exploration methods have been proposed in the literature to deal with exploration and exploitation trade-off, such as counting tables [27], learned dynamic models [28], [29], self-supervised curiosity [30], and state-space density modeling [31]. To achieve a better exploratory nature of these algorithms, a bootstrap DQN was proposed by Osband et al. [32], where temporally correlated noise has added in the parameters. It has been found that adding noise improves the exploratory behaviour of the agent [33]. Fortunato et al. [24] proposed a NoisyNets for DQN that enhances the agent’s exploration capability by the addition of noise in the network parameters. III. E XPERIMENTS AND D ISCUSSIONS A. Problem Definition The objective of this work was to empower tracked robots to learn autonomous navigation viably using DRL. We formulated this autonomous navigation problem using the Markov decision process (MDP), which consists of a set of tuple M = (S, A, R, P, γ ). S is the state space, A is the action space, R is the immediate reward, P represents a transition probability, and γ [0,1] is the discount factor [34]. Reinforcement learning (RL) is a closed-loop learning phenomenon where an agent performs an action, at A, in a given state, st S, and moves to the next state, st+1 , and receives an immediate reward (rt ) from the environment. A policy, π (a|s), defines the mapping from state to action. The goal of an agent is to maximize the cumulative reward from the environment through a Qvalue. The Q-value is defined as the best action value that increases the reward by following the optimal policy and can be formulated as follows: ∞ t γ R (st , at ) |s0 = s, a0 = a (1) Q (s, a) = E t =0 In the above equation, γ represents a discount factor that controls the distribution of rewards in the future. Bellman’s equation [35] can be used to formulate the RL problem that returns the maximum rewards from the environment, as shown in Equation 2, as follows: Q ∗ (s, a) = R (st , at ) + γ max Q(st +1 , at +1 ) (2) To enable the agent to train itself, the reward is the only learning signal given to the agent by the environment. To consider the learning speed, we designed a dense-reward function rather than a sparse one. In sparse-reward functions, the agent requires more experiences to learn, which slows down the learning process. However, in a dense-reward function, the agent is restricted from taking actions that give a maximum return. Therefore, we designed an information-reward function Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. 2232 IEEE SENSORS JOURNAL, VOL. 21, NO. 2, JANUARY 15, 2021 Fig. 1. Layer normalisation duelling double deep Q-network with noise injection (LND3QN) with the noise injection network architecture. The input state is a series of four depth images that go into three convolutional layers to extract features from the input data. A duelling architecture divides the flattened layer into value and advantage functions. that allows the model to learn fast, safe, and smooth steering with high efficiency, as expressed by Equation 3, as follows: −10 i f colli si on r (st , at ) = (3) c1 ϑ 2 cos (c2 ϑω) − c3 other wi se. wherec1, c2 , and c3 are constants with values of 3, 3, and 0.1, respectively. Here, c1 acts as a scaling factor, c2 acts as a bias, and c3 acts as a regularier. Linear and angular velocity are represented by ϑ and ω, respectively. This reward function helps the robot to move straight as far as possible for maximum reward until angular action is required to avoid a collision. When the robot is close to the obstacle, its linear velocity decreases, and its angular velocity increases, which changes the orientation of the robot. In the case of a collision, a robot receives −10 as a penalty; otherwise, it receives a total reward calculated by Equation 3 after 500 steps. B. Network Architecture To attain the objectives mentioned above, we selected the dueling double deep Q-network (D3QN) as an RL method. D3QN is a model-free, value-based method proposed by Wang et al. [36]. D3QN highlights the overestimation issue faced by DDQN and DQN due to the estimation of each action value in every state. The novelty of D3QN is that it splits the Q-value into two streams: one computes the value of the state, and the other calculates the advantage action that depends on the state. This architecture helps to generalize the actions without imposing any effects on the learning algorithm. After a state value and an advantageous action value is calculated, they are concatenated to form one stream of Q-values. Q-value Q-values are formulated using the following expression: Q π (s, a) = V π (s) + Aπ (s, a), where V represents the value and A is the advantage function. The benefit of the dueling network is that every time the Q-value is updated, it updates the value of that action, while other actions remain unaffected. Thereby, the process of state-value learning becomes more efficient. We modified the D3QN model and proposed a novel method named layer normalisation dueling double deep Q-network with noise injection (LND3QN). This method reduces the computational cost by normalizing the state space through layer normalization and improves the exploratory nature of the agent by injecting the noise into the network parameters. The network architecture of the proposed method is illustrated in Fig. 1. Four consecutive depth images captured from the camera merged together as the first set of inputs of the network. The reason for stacking the images is to preserve the temporal information of an environment. In forward neural networks, a non-linearity is shown between the input, x, to an output vector, y. Let suppose, x m is the vector representation of the summed input of the layer, while m t h is the hidden layer of that neural network. The summed inputs are calculated with a weight matrix, wt h projection, and bottom-up inputs, h m , given as follows: (4) = f xli + bli x im = wimT h m h m+1 i where f (·) is an element-wise non-linear function, wim is the input weights of the i th input layer, T is a transpose of matrix and bli is the bias. These parameters are learned by the optimisation algorithms, with the gradients being computed by back-propagation. In a feed-forward neural network, the output of the first layer becomes the input of the second layer, which greatly changes the summed inputs of that layer, especially when the rectified linear unit (ReLU) activation function is used. To reduce this covariate shift, the layer normalisation technique is applied before each of the convolutional layers. Layer normalization computes the mean and variance of the summed input within a layer, as follows: 1 K 1 K m m m xi σ = (x m − μm )2 (5) μ = i=1 i=1 i K K where H represents the number of hidden layers, x is the summed input vector, and μ and σ are the mean and variance, respectively. Three convolution layers with different filters and Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. EJAZ et al.: VISION-BASED AUTONOMOUS NAVIGATION APPROACH FOR A TRACKED ROBOT USING DRL kernels are used to extract the features from the input data. Layer normalisation with the identity function is applied to the input data, and after that, the processed data is fed into first convolutional layers. A kernel size of 10×14 with a stride size of 8 converts the image into 20 × 16 with 32 features. In the second and third layers of the convolutional layer, the layer normalisation layer uses the ReLU activation function, while the kernel size of the second convolution layer is 4 × 4 and 3 × 3, respectively. The last convolution layer transformed the image to a size of 10 × 8, with 64 features, which were then flattened to become an array of 5120-D. The covariance was measured after each convolutional layer, and we found that by doing layer normalization the covariance after each convolutional layer was reduced to 32%, 55%, and 69%, respectively. This reduction not only boosted the training but also provided relief to the computation power. Layer normalization [37] is better than batch and weight normalization because it does not introduce any dependencies within the network, and it normalizes the input data by calculating the mean and variance. After the extraction of features, the network is divided into two branches that consist of two dense layers. One has been used to measure the state value. In contrast, the other branch is used to estimate the advantage function, which corresponds to the action commands. The extracted features then passed into two fully connected layers with 512 units and 1 unit to determine the state value. Meanwhile, the same features are fed into the second branch with two fully connected layers with 512 units, and N units to calculate the Q-values. N represents the number of actions. After obtaining the state value, V (s), and the advantage action (A) from the fully connected layer, the Q-value was calculated using the following equation: Q i (s, a) = Leaky ReLU (V (s) + Ai (s, a) 1 Ai (s, a )) (6) − a N where N is the number of actions. The use of the leaky ReLU activation function instead of ReLU, was used to fix the issue of dying neurons, which occurred due to layer normalisation. Generally, the epsilon-greedily method is used for exploration in Q-variant methods. The initial value of is usually large and decreases gradually until it reaches the final epsilon. Entropy-based exploration works in the opposite manner since the probability value in the explored region is small and increases as the agent move towards an unexplored region [38]. However, these probability-based exploration methods take more steps to achieve better exploration, which affects training. For better exploration, we introduced noise into the network parameters, which enhance the agent’s exploratory nature. We used NoisyNets, proposed by Fortunato et al. [24], which works by disturbing the weights and biases by adding noise in the linear layer, y = wx + b, where x, w, and b represent the input, weight, and bias, respectively. Gaussian noise is added as uncertainty in the network parameters, as shown in Equation 6, as follows: y = (μw + σw w ) x + μb + σb b (7) 2233 where μw , σw , μb , and σb represent the network parameters, while w and b are the random noises and is the elementwise multiplication. Adding the noise in the network makes the network heavier, which affects the computational power. To overcome this issue, we used factorised Gaussian noise rather than independent noises. The weight and bias matrix can be expressed as the following: (8) ωi, j = μωi, j + σi,ωj f (i ) f j ; b j = μbj + σ jb f j √ where f is defined as f () = sgn() ||. We sampled μm and μb randomly by a uniform distribution on the interval − N1 , 1 N , where N is the input layer size and σ m and σ b were initialised to 0.4 N . The advantage of using noise for exploration compared to epsilon-greedily and entropy-based methods is that it does not need any hyperparameter for tuning. The amount of noise in the network is updated automatically while training. C. Training Framework The training framework of the proposed model consists of two networks, as depicted in Fig. 2. The state information goes to the online network to estimate a Q(st ; θ ) value for the corresponding action. The action value multiplies by Q(st; θ − ) to estimate a Q(st ; at ; θ − ). Meanwhile, the information of the next state goes to both online and target networks simultaneously to compute the target Q-value, Q(st +1 ; θ − ). The target value (yi ) is calculated using Equation 8 in which (γ ) represents the discount factor, set as 0.99, and (rt ) is the instantaneous reward given in each time step. Here, θ and θ − represent the network parameters. ⎧ ⎪ ⎨rt , i f epi sode ends at step t + 1 yi = rt + γ Q st +1 , argmax Q (st +1 , at +1 ; θ ) ; θ − , (9) ⎪ ⎩ other wi se In deep learning, a model is optimised using a loss function. By obtaining the Q-values from the online and target network, the loss is calculated using the following equation: (10) L (θ ) = E (yi − Q(st , at , ; θ )2 The network parameters of an online network are initialized randomly, while the parameters of the target network are duplicated from theta. Afterwards, at every time step, the back-propagation is applied to the loss function using the Adam optimiser with a learning rate (α) of 0.0001 to update the online network’s parameters. For efficient training of the online network, the parameters of the convolutional layers were divided by 2 during the back-propagation. The target network is just a duplication of the online network, and hence their parameters are not trainable. The rate at which the duplication of parameters occurred was 0.001. Algorithm 1 shows the pseudo-code for the proposed method. The advantage of using noise for exploration compared to epsilon-greedily and entropy-based methods is that it does not need any hyperparameter for tuning. The amount of noise in the network is updated automatically while training. Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. 2234 IEEE SENSORS JOURNAL, VOL. 21, NO. 2, JANUARY 15, 2021 Fig. 2. Training framework of the proposed LND3QN model. The input state is fed into the online network to perform an action. It calculates the estimated Q-value after taking an action. The target value (y) is calculated through the immediate reward (rt ). The discount factor (γ) and the Q-values come from both the networks at each iteration. The online network (θ) is responsible for the optimal actions, while the target network (θ− ) estimates the target values. The loss is back-propagated to an online network to update its parameters. The target network is just a duplication of the online network, and it is updated periodically. IV. A RCHITECTURE AND I MPLEMENTATION OF LND3QN W ITH N OISE I NJECTION A. Experiments in a Virtual Environment The Gazebo is an open-source 3D simulator with a graphical interface with high-quality graphics and a physics engine to train and test the algorithm in an elaborate indoor and outdoor environment. It is a well-known simulator that works with the robot operating system (ROS). The ROS allows us to simulate the virtual tracked robot for training and then transfer the trained model in a physical robot. Admitting this benefit, we designed different environments for training in Gazebo, and each environment was different in terms of complexity and number of obstacles, as shown in Fig. 3. The first environment was a small 10 × 10 m world with few obstacles. Willow Garage’s office, which comes with Gazebo, was chosen as a second environment. In the last environment, we introduced more walking persons in the cafe environment, which resembles a real-world scenario. Communication in the ROS is done via topics to different nodes, as shown in Fig. 4. The messages are subscribed and published via topics indicated as rectangular boxes to four main nodes represented by an oval shape. The depth image is published to Gazebo World (GW) by camera_controller node via camera/depth/image_raw topic. GW processes the data and then sends it to the motor_controller node using the cmd_vel topic. The odometry node subscribes a message to GW via odom topic to move the tracked robot. Training was performed on an Intel i7 CPU, 16 GB RAM, and NVIDIA RTX 2060 GPU desktop system with Tensorflow [39], and each iteration took approximately 0.22 seconds as the average training time. Hyper-parameters such as the learning rate (α) and discount factor (γ ) were set to Algorithm 1 LND3QN With Noise Injection Input: INITIALISE Batch size (NB ), Replay memory (M), Size of replay memory (NM ), Gamma (γ ), Exploration frames (E), Observation (NO ), Rate to update target network (τ ), Parameter of online (θ) and target (θ − ) network, Episodes (NE ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 While episode = 1 to NE do Observe the state st Select an action a ∗ = argmaxa Q(st ,a;θ) Receive a reward rt by executing an action and goes to the next state st+1 Store transition (st , a ∗ , r , st+1 ) in M if |M| > NM then Remove the first transition from the Memory M end Apply Layer Normalization (LN) in each convolutional layer Inserting factorized Gaussian noise in the fully connected layers Sample a mini-batch of NB transitions (st , at , rt , st+1 ) from M Sample noise variables i , j , i− , − j ∼ N(0, 1) Evaluate Q1(st+1 , a; θ) and Q2(st+1 , a; θ − ) ifepisode terminates at st+1 then Set y = rt else Set y= rt + γ Q2(st+1 , ar gmaxat+1 Q1 (st+1 , at ; θ) θ − ) end end Compare Loss L(θ) = L (θ) = E (yi − Q(st , at , ; θ))2 Perform AdamOptimizer with respect to parameters of the online network θ Update target network θ − ← θ end 0.0001 and 0.99, respectively. The experience replay buffer size was 50000 for the first environment and reduced to 30000 Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. EJAZ et al.: VISION-BASED AUTONOMOUS NAVIGATION APPROACH FOR A TRACKED ROBOT USING DRL 2235 Fig. 5. Rewards received by all the models in the first environments. (a) represents the reward at each episode, (b) shows the smoothed reward over each episode while (c) and (d) depicts the average reward and variance, respectively. Fig. 3. 3D environments build on the Gazebo simulator. The first environment is a small 10 x 10 m square world, while the second environment is a willow garage with narrow paths and a large area. The third environment is a café world, which resembles a real-world scenario. Fig. 4. ROS communication flowchart. Nodes and topics are represented by ellipse and rectangular boxes, respectively. Arrow shows that message is published to the node. for the second and third environments. In every experiment, seven velocities were considered, including linear and angular velocities, such as 0.4 m/s, 1.2 m/s, π/12 rad/s, π/6 rad/s, 0 rad/s, −π/6 rad/s, and −π/12 rad/s. In the first environment, the total episodes for the training set were 1500. However, in the second and third environments, we reduced the total episodes to 150. In each environment, the proposed model LND3QN was compared to three baseline models, namely DQN, DDQN, and D3QN. In all the baseline models, the greedily method was used for exploration, where the initial and final epsilon values were set to 0.1 and 0.001, respectively. All the models were evaluated by the reward obtained by the robot in all three environments. For better demonstration, we divided the reward graph into four different plots. The first plot shows the reward at each time step. The second plot depicts the smooth value of the reward calculated by the exponential moving average (EMA) method, formulated as follows: X 1, t =1 (11) st = α X 1 + (1 − α) St −1 , t > 1 where the coefficient (α) represents the smoothing factor, which was between 0 and 1, Xis the reward in each episode, and S is the smoothing value at a given episode. For a better smoothing curve, we kept α = 0.99. The third plot illustrates the average reward after taking an average of 10 episodes. Variance is shown in the fourth plot, where a model with low variance is considered well trained. Fig. 5 illustrates the results from the first environment where training was started from scratch. In Fig. 5(a), the suggested model is more prominent, which means it received a maximum reward in most episodes. However, Fig. 5(b) depicts the smoothed reward. It can be seen that the proposed model has better performance among all at the end of the training and received a maximum reward at the 1100th episode. The curve remained stable after the 1100th episode, which indicates that the proposed model was trained faster than other models. From the variance plot, as depicted in Fig. 5(d), the spikes represent the deviation of an episodic reward from the mean. In the beginning, all the models have approximately the same variance, but at the interval from 26th to 30th episodes, a peak has been found that means the training of the proposed model has gone towards over-fitting, but as the training continues the variance of the proposed model tends to decrease and reach a minimum at the end of the training. In the second environment, trained weights from the first environment are used for training. Fig. 6 shows the reward pattern of all the models. It can be clearly seen in Fig. 6(b) that the suggested model received higher rewards from the beginning until the end of the training. However, the reward values DQN and DDQN are similar at the end of the training, but D3QN performed better than DQN and DDQN. The stability of LND3QN is clearly seen from the variance plot, as shown in Fig. 6(d). It can be seen that at the start of the training the D3QN model goes to overfitting as it exhibits high variance, but from intervals 7th to onwards, the model starts learning. Fig. 7 illustrates the reward graph obtained from the third environment by all the models where the second environment’s trained weights have been used. Compared to three baseline models, the LND3QN model outperformed and yielded higher episodic reward due to a better exploration strategy, as shown in Fig. 7(b). From the figure, it can be seen that the D3QN Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. 2236 IEEE SENSORS JOURNAL, VOL. 21, NO. 2, JANUARY 15, 2021 TABLE I C OMPARISON B ETWEEN THE P ROPOSED LND3QN M ODEL AND T HREE B ASELINE M ODELS W ITH R ESPECT TO THE I NTERQUARTILE R ANGE IN VARIOUS V IRTUAL E NVIRONMENTS Fig. 6. Performance curves of all the Q-variant models and the proposed model in the second environment. Reward at the end of each episode and its smoothed values are depicted in (a) and (b), respectively. (c) and (d) shows the average reward and variance, respectively. Fig. 7. Comparison of all the models in terms of reward received in the third environment. (a) portrays the reward at each episode, (b) shows the smoothed reward over each episode. Average reward and variance plotted in (c) and (d), respectively. model did not perform well in this environment and obtained the lowest reward, whereas DQN and DDQN models were the same at the end of the training. Once again, the proposed model resulted in the lowest variance, as depicted in Fig. 7(d), while DQN achieved the highest variance at the start and end of training. DRL-based methods have a high level of variability in performance and are susceptible to different factors, such as hyper-parameters, the environment, and implementation details [40]. These variabilities will cause issues when the algorithm is deployed in a real-world scenario. To consider this, the reliability was measured by all the models. Dispersion across runs (DR) is one way to calculate the variability of a model. It can be measured by either the variance or standard deviation, but we measured it by calculating the interquartile range (IQR) since it is robust statistics. The IQR is the difference between the first quartile value and the third quartile value. We applied the analysis of variance (ANOVA) to the reward values. Table I lists the IQR value of each model. In the first environment, LND3QN had an IQR of 81, while DDQN had the lowest IQR of 41. However, DQN had a slightly better IQR in comparison to D3QN. In environment 2, D3QN performed better with an IQR of 63, while the other models had lower IQR values. In the third environment, the proposed model performed outstandingly, with an IQR of 45, which is 68% more than the baseline models. On the contrary, all the baseline models had IQR values below 20. For better visualisation of reliability and more statistical analysis, boxplots show how the rewards are dispersed across the episodes in all the scenarios, as shown in Fig. 8. The results were evaluated in terms of range, median, and outliers. Fig. 8(a) demonstrates the results of the first environment. The figure shows that the proposed model yielded higher rewards in most of the episodes, with a median value of 120, while DQN obtained the lowest median value. DDQN had a smaller IQR with one outlier; however, the median reward was higher than DQN and D3QN models. In the second environment, the median achieved by the DQN and DDQN models plunged to below 55. However, the suggested LND3QN model attained the highest median value, as shown in Fig. 8(b). In the last environment, the proposed model performed very well from the start of the training compared to the others, as shown in Fig. 8(c). LND3QN achieved the highest median value, while baseline models had median values below 50. If we look over the third quartile of the proposed model in Fig. 7(a–c), we notice that it is higher than all the baselines models, which proves that LND3QN attained the highest rewards, and its success rate is also high. An episode is considered to be successful if it lasts for 500 steps without collision during training. A more statistical comparison is made in terms of average reward and standard deviation (SD) return, as illustrated in Table II. This statistical analysis further proves the stability and reliability of the proposed model. The average reward of all the baseline models in the first environment was below 90. However, the LND3QN average reward was 98.02. Similarly, the SD of LND3QN was higher than the other models. In the second environment, LND3QN obtained a lower SD value than D3QN, but its average reward was higher than that of others. The superiority of the proposed model was shown in the last environment, where the SD achieved by DQN, DDQN, and D3QN models was below 11, and the SD attained by LND3QN was 22. Also, the average reward of the proposed model was 76, while the others were below 55. B. Experiments in Real-World Scenarios Real-world experiments were conducted in an indoor environment with three different scenarios. A tracked robot equipped with an Intel Realsense camera and Lidar was used throughout the experiments, as shown in Fig. 9. The dimensions of the robot were 43 × 37 × 18 cm, with 35 kg weight. All the tracks were connected with 4 servo motors, and each motor communicated with an RS232 cable. In addition, two small motors were attached with the camera to turn the camera left, right, up, and down. All motors were connected with a UDOO x86 computer, while the sensors were powered by an Intel Core i5 mini PC. The ROS was installed on both Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. EJAZ et al.: VISION-BASED AUTONOMOUS NAVIGATION APPROACH FOR A TRACKED ROBOT USING DRL 2237 Fig. 8. Dispersion across Run in all three environments. Better reliability is indicated by more positive values. Y-axis represents the range of reward values. TABLE II R ESULTS AND D ESCRIPTIONS OF R EPORTED VALUES BY DQN VARIANTS AND THE P ROPOSED M ODEL . “L ENGTH (I TERS )” D ENOTES A LGORITHM I TERATIONS AND “L ENGTH (E PISODES )” D ENOTES THE N UMBER OF E PISODES Fig. 9. Tracked robot equipped with an Intel Realsense depth camera and a Lidar sensor. (a) and (b) shows the front and side view of a tracked robot, respectively. systems, where UDOO x86 served as a ROS Master, and a wireless router was used for communication between them. Real-world experiments were conducted in an indoor environment with different scenarios. In the first scenario, two boxes were placed, as shown in Fig. 10(a). The goal of the robot was to avoid both of them and to navigate autonomously. Depth images were captured from the Intel Realsense camera and passed to the proposed model, LND3QN, to determine the actions. In the beginning, the robot moved forward with a linear velocity and then changed its direction to the right by selecting the larger angular velocity to avoid the first obstacles. As it crossed the first obstacle, the second obstacle was placed nearby, so it again changed its direction to the right by increasing the angular velocity and minimising the linear velocity. Once it crossed the second obstacle, it went forward again and approached a wall. Similarly, the robot Fig. 10. Different scenarios in an indoor environment to evaluate the validity and adaptability of our proposed model. turned left by receiving a larger angular velocity command to avoid the collision with the wall. During the experiments, angular velocities were recorded, and then graphs are plotted of the selected actions. Fig. 11 illustrates the six intermediate time step images, with the respective angular velocity at the bottom. All the action commands were provided by the proposed model that was trained in a simulation environment. In addition to the complex scenarios, the suggested model was evaluated in other Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. 2238 IEEE SENSORS JOURNAL, VOL. 21, NO. 2, JANUARY 15, 2021 Fig. 11. Autonomous navigation in the second real-world indoor scenario. Images shows the six intermediate time step and the action selections by a robot illustrate below it. Fig. 12. Actions selection by the robot in the first indoor scenario. The curve below the images shows the steering action selected by the tracked robot at each step. Fig. 13. Action choices by the robot in the third real-world scenario. The graph shows the angular actions taken by the tracked robot at each time step. unseen environments, as shown in Fig. 10(b), where more obstacles are placed. As shown in Fig. 12, the robot turned left at first to avoid the collision and traveled straight until it came closer to the obstacle. Once it reached the obstacle, the robot changed its direction by decreasing the linear velocity and increased the angular velocity to turn left. Furthermore, we evaluated the proposed model in more complex scenarios by placing more obstacles, which resembles a cluttered environment, as shown in Fig. 10(c). In the beginning, the robot was placed near the wall. In order to not bump the wall, it changed its direction to the left and moved forward until it approached the next wall. Similarly, the action commands from the learned policy enabled the robot to change its direction once again to avoid the obstacle, as shown in Fig. 13. Experiments were also conducted in a dynamic environment, where a person was moving in a 4 × 4 m2 area, as presented in Fig. 14. The goal was to avoid the moving obstacle (human) and the static obstacles placed in the testing environment. Fig. 14 (a) shows the layout of Fig. 14. Experiments conducted in a real-world dynamic environment. The path covered by the robot on current scenario is indicated by a red color, whereas light red color shows the path at previous scenario taken by the robot. Obstacles are represented by the green and yellow boxes. The black box depicts the robot’s starting location and blue arrow shows the movements of a person. (a) Layout of the testing environment. (b) Experiment performed in a static environment. (c-f) Experiments performed in a dynamic environment, with relocating of the static obstacles. the testing environment, with two obstacles indicated as a green and yellow box. The black box indicates a robot with an arrow that represents the starting point. In Fig. 14(b), a complete trajectory is shown in red. Fig. 14 (c–f) shows the path covered by a robot in each time step with both static and moving obstacles, while the red color shows the path in the current scenario, and the light red color indicates the path in the previous scenario taken by the robot. The blue arrow indicates the movement of a person in the testing environment. For a better understanding of the person’s movement, an opaque show the starting point, and the final position is depicted as a firm figure. We had also changed the location of the static obstacles, as shown in Fig. 14(d–f). V. C ONCLUSION In this article, we presented a novel method for autonomous navigation for a tracked robot using the DRL-based method. More training is needed for DRL-based methods, which reduces the training of neural networks due to computational cost. Similarly, exploration and exploitation trade-off also affects learning. We focused on these two main parameters and proposed the LND3QN method, which accelerates the training of neural networks by applying layer normalisation before each convolutional layer. The injection of noise into the fully connected layers improved the exploratory nature of a tracked robot. The proposed method derived the action commands from depth images directly through meticulous network architecture. CNN was used to extract out the features from the four consecutive depth images, and Q-values were calculated from the features. The proposed model, LND3QN, was simulated, analyzed, and compared with three baseline models, namely DQN, Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. EJAZ et al.: VISION-BASED AUTONOMOUS NAVIGATION APPROACH FOR A TRACKED ROBOT USING DRL DDQN, and D3QN. It is worth noting that the suggested model had outstanding performance among all the baseline models in terms of average reward and variance. Furthermore, statistical analysis was performed to validate the variability of the models in terms of DR. The results show that LND3QN has better reliability than other Q-variant models. Real-world experiments were conducted using a tracked robot in different scenarios, and the results illustrate that the proposed model has better generalization capability to any unseen environment and is competent enough to determine the steering actions. In the future, we will use prioritise experience replay, which may reduce the learning time more and increase the performance by selecting the actions that have a high Q-value. The model may become efficient if we include more past transitions to understand the surrounding environment better and modify the reward function. R EFERENCES [1] R. Hasan, S. Asif Hussain, S. Azeemuddin Nizamuddin, and S. Mahmood, “An autonomous robot for intelligent security systems,” in Proc. 9th IEEE Control Syst. Graduate Res. Colloq. (ICSGRC), Shah Alam, Malaysia, Aug. 2018, pp. 201–206. [2] F. Ingrand and M. Ghallab, “Deliberation for autonomous robots: A survey,” Artif. Intell., vol. 247, pp. 10–44, Jun. 2017. [3] A. Pandey, “Mobile robot navigation and obstacle avoidance techniques: A review,” Int. Robot. Autom. J., vol. 2, no. 3, May 2017, Art. no. 00022. [4] H. Isakhani, N. Aouf, O. K. Stamatis, and J. F. Whidborne, “A furcated visual collision avoidance system for an autonomous micro robot,” IEEE Trans. Cogn. Develop. Syst., vol. 12, no. 1, pp. 1–11, Mar. 2020. [5] A. Al-Kaff, F. García, D. Martín, A. De La Escalera, and J. Armingol, “Obstacle detection and avoidance system based on monocular camera and size expansion algorithm for UAVs,” Sensors, vol. 17, no. 5, p. 1061, May 2017. [6] C.-H. Chien, C.-C.-J. Hsu, W.-Y. Wang, and H.-H. Chiang, “Indirect visual simultaneous localization and mapping based on linear models,” IEEE Sensors J., vol. 20, no. 5, pp. 2738–2747, Mar. 2020. [7] C. Debeunne and D. Vivet, “A review of visual-LiDAR fusion based simultaneous localization and mapping,” Sensors, vol. 20, no. 7, p. 2068, 2020. [8] J. Wang, V. A. Shim, R. Yan, H. Tang, and F. Sun, “Automatic object searching and behavior learning for mobile robots in unstructured environment by deep belief networks,” IEEE Trans. Cogn. Develop. Syst., vol. 11, no. 3, pp. 395–404, Sep. 2019. [9] S. Stevsic, T. Nageli, J. Alonso-Mora, and O. Hilliges, “Sample efficient learning of path following and obstacle avoidance behavior for quadrotors,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 3852–3859, Oct. 2018. [10] F. Codevilla, M. Muller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-to-end driving via conditional imitation learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), Brisbane, QLD, Australia, May 2018, pp. 1–9. [11] K. Wu, M. Abolfazli Esfahani, S. Yuan, and H. Wang, “TDPP-net: Achieving three-dimensional path planning via a deep neural network architecture,” Neurocomputing, vol. 357, pp. 151–162, Sep. 2019. [12] F. Sadeghi and S. Levine, “CAD2RL: Real single-image flight without a single real image,” presented at the Robot., Sci. Syst. XIII (RSS), Massachusetts Institute of Technology (MIT), Cambridge, MA, USA, Jul. 2017. [13] P. Long, T. Fanl, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2018, pp. 6252–6259. [14] J. Bruce, N. Suenderhauf, P. Mirowski, R. Hadsell, and M. Milford, “One-shot reinforcement learning for robot navigation with interactive replay,” 2017, arXiv:1711.10137. [Online]. Available: https://arxiv.org/abs/1711.10137 [15] M. Pfeiffer et al., “Reinforced imitation: Sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 4423–4430, Oct. 2018. 2239 [16] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), Brisbane, QLD, Australia, May 2018, pp. 1111–1117. [17] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013, arXiv:1312.5602. [Online]. Available: https://arxiv.org/abs/1312.5602 [18] G. Lample and D. S. Chaplot, “Playing FPS games with deep reinforcement learning,” presented at the 31st AAAI Conf. Artif. Intell. (AAAI), California, CA, USA, Feb. 2017. [19] L. Tai and M. Liu, “Towards cognitive exploration through deep reinforcement learning for mobile robots,” 2016, arXiv:1610.01733. [Online]. Available: http://arxiv.org/abs/1610.01733 [20] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Vancouver, BC, Canada, Sep. 2017, pp. 2371–2378. [21] L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular vision based obstacle avoidance through deep reinforcement learning,” 2017, arXiv:1706.09829. [Online]. Available: http://arxiv.org/abs/1706.09829 [22] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Proc. 4th Int. Conf. 3D Vis. (3DV), Stanford, CA, USA, Oct. 2016, pp. 239–248. [23] K. Wu, M. Esfahani, S. Yuan, and H. Wang, “Learn to steer through deep reinforcement learning,” Sensors, vol. 18, no. 11, p. 3650, Oct. 2018. [24] M. Fortunato et al., “Noisy networks for exploration,” in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–18. [25] K. Wu, H. Wang, M. A. Esfahani, and S. Yuan, “BND*-DDQN: Learn to steer autonomously through deep reinforcement learning,” IEEE Trans. Cognit. Develop. Syst., early access, Jul. 16, 2020, doi: 10.1109/TCDS.2019.2928820. [26] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [27] H. Tang et al., “Exploration: A study of count-based exploration for deep reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 2753–2762. [28] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel, “VIME: Variational information maximizing exploration,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1109–1117. [29] J. Achiam and S. Sastry, “Surprise-based intrinsic motivation for deep reinforcement learning,” presented at the Int. Conf. Learn. Represent. (ICLR), Toulon, France, Apr. 2017. [30] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Honolulu, HI, USA, Jul. 2017, pp. 16–17. [31] G. Ostrovski, M. G. Bellemare, A. Oord, and R. Munos, “Count-based exploration with neural density models,” in Proc. 34th Int. Conf. Mach. Learn., vol. 70, 2017, pp. 2721–2730. [32] I. Osband, C. Blundell, A. Pritzel, and B. V. Roy, “Deep exploration via bootstrapped DQN,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 4026–4034. [33] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 901–909. [34] M. Zimmer and S. Doncieux, “Bootstrapping Q-learning for robotics from neuro-evolution results,” IEEE Trans. Cogn. Develop. Syst., vol. 10, no. 1, pp. 102–119, Mar. 2018. [35] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, vol. 135. Cambridge, MA, USA: MIT Press, 1998. [36] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and N. D. Freitas, “Dueling network architectures for deep reinforcement learning,” in Proc. 33rd Int. Conf. Mach. Learn. (PMLR), vol. 48, 2016, pp. 1995–2003. [37] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, arXiv:1607.06450. [Online]. Available: https://arxiv.org/abs/1607.06450 [38] M. Usama and D. Eui Chang, “Learning-driven exploration for reinforcement learning,” 2019, arXiv:1906.06890. [Online]. Available: http://arxiv.org/abs/1906.06890 [39] M. Abadi et al., “TensorFlow: A system for large-scale machine learning,” in Proc. 12th USENIX Symp. Operating Syst. Design Implement. (OSDI), vol. 16, 2016, pp. 265–283. [40] L. J. Lin, “Self-improvement based on reinforcement learning, planning and teaching,” in Proc. 8th Int. Workshop, San Mateo, CA, USA: Morgan Kaufmann, 1991. Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply. 2240 IEEE SENSORS JOURNAL, VOL. 21, NO. 2, JANUARY 15, 2021 Muhammad Mudassir Ejaz received the B.Eng. degree in biomedical engineering from the NED University of Engineering and Technology, Pakistan, in 2015. He is currently pursuing the master’s degree with Universiti Teknologi PETRONAS (UTP), Malaysia. Since 2018, he has been with the Smart Assistive and Rehabilitative Technology (SMART), Department of Electrical and Electronics Engineering, UTP. His research interests include image and video processing, robot vision, deep reinforcement learning, and deep learning. Tong Boon Tang (Senior Member, IEEE) received the B.Eng. (Hons.) and Ph.D. degrees from The University of Edinburgh. He is currently the Director of the Institute of Health and Analytics for Personalized Care, UTP, Malaysia. His research interests include biomedical instrumentation, from device and measurement to data fusion. He serves as the Secretary of the HICoE Council and the Chair of the IEEE Circuits and Systems Society Malaysia Chapter. Cheng-Kai Lu (Senior Member, IEEE) received the B.S. and M.S. degrees in electronics engineering from Fu Jen Catholic University, Taipei, Taiwan, in 2001 and 2003, respectively, and the Ph.D. degree in engineering from The University of Edinburgh, U.K., in 2012. After graduation, he worked as the Director of the Research and Development Division, Chyao Shiunn Electronic Industrial Company Ltd., Shanghai, China, before he joined the National Applied Research Laboratories, Science and Technology Policy Research and Information Centre, Taiwan. He is currently a Faculty Member of the Electrical and Electronic Engineering Department, Universiti Teknologi PETRONAS (UTP), Malaysia. His research interests include medical imaging, embedded systems, and artificial intelligence, and their applications and clinical decision support systems. Apart from academic experience, he has more than eight years of industrial work experience. He has not only published his research works on peer-reviewed articles (book chapters, journal articles, conferences papers, and reports) but also has filed a couple of patents. The most significant technical contributions to date by him are on production line automation to significantly reduce labor costs in manufacturing while he served as the Director of the Research and Development Division, Chyao Shiunn Electronic Industrial Company Ltd., from 2012 to 2014. His patents have also been licensed out successfully, and he receives partial royalties from patent licensing agreements. It is worthwhile to mention that one of his inventions (TW Patent: PS keyboard system, 2006) has been adopted by the Republic of China Air Force and has further been applied to light aircraft and the specific long-haul flight plane. He has served as an Executive Member for the IEEE EMBS Malaysia Chapter and the Penang Chapter from January 2017 to February 2018 and since 2018. Authorized licensed use limited to: NATIONAL TAIWAN NORMAL UNIVERSITY. Downloaded on March 09,2022 at 23:02:49 UTC from IEEE Xplore. Restrictions apply.