applied sciences Article Energy Management Strategy for a Hybrid Electric Vehicle Based on Deep Reinforcement Learning Yue Hu 1,2,3,† , Weimin Li 1,3,4, *, Kun Xu 1,† , Taimoor Zahid 1,2 , Feiyan Qin 1,2 and Chenming Li 5 1 2 3 4 5 * † Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China; yue.hu@siat.ac.cn (Y.H.); kun.xu@siat.ac.cn (K.X.); taimoor.z@siat.ac.cn (T.Z.); fy.qin@siat.ac.cn (F.Q.) Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China Jining Institutes of Advanced Technology, Chinese Academy of Sciences, Jining 272000, China Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Hong Kong 999077, China Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong 999077, China; 1155101099@link.cuhk.edu.hk Correspondence: wm.li@siat.ac.cn; Tel.: +86-159-1537-4149 These authors contributed equally to this work. Received: 28 December 2017; Accepted: 24 January 2018; Published: 26 January 2018 Abstract: An energy management strategy (EMS) is important for hybrid electric vehicles (HEVs) since it plays a decisive role on the performance of the vehicle. However, the variation of future driving conditions deeply influences the effectiveness of the EMS. Most existing EMS methods simply follow predefined rules that are not adaptive to different driving conditions online. Therefore, it is useful that the EMS can learn from the environment or driving cycle. In this paper, a deep reinforcement learning (DRL)-based EMS is designed such that it can learn to select actions directly from the states without any prediction or predefined rules. Furthermore, a DRL-based online learning architecture is presented. It is significant for applying the DRL algorithm in HEV energy management under different driving conditions. Simulation experiments have been conducted using MATLAB and Advanced Vehicle Simulator (ADVISOR) co-simulation. Experimental results validate the effectiveness of the DRL-based EMS compared with the rule-based EMS in terms of fuel economy. The online learning architecture is also proved to be effective. The proposed method ensures the optimality, as well as real-time applicability, in HEVs. Keywords: hybrid electric vehicle; energy management strategy; deep reinforcement learning; online learning 1. Introduction An energy management strategy (EMS) is one of the key technologies for hybrid electric vehicles (HEVs) due to its decisive effect on the performance of the vehicle [1]. The EMS for HEVs has been a very active research field during the past decades. However, how to design a highly-efficient and adaptive EMS is still a challenging task due to the complex structure of HEVs and the uncertain driving cycle. The existing EMS methods can be generally classified into the following three categories: (1) Rule-based EMS, such as the thermostatic strategy, the load following strategy, and electric assist strategy [2,3]. These methods rely heavily on the results of extensive experimental trials and human expertise without the a priori knowledge of the driving conditions [4]. Other related control strategies employ heuristic control techniques, with the resultant strategies formalized as fuzzy rules [5,6]. Though these rule-based strategies are effective and can be easily implemented, their optimality Appl. Sci. 2018, 8, 187; doi:10.3390/app8020187 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 187 2 of 15 and flexibility are critically limited by working conditions and, consequently, are not adaptive to different driving cycles. (2) Optimization-based EMS: some optimization methods employed in control strategy are either based on the known driving cycles or predicted future driving conditions, such as dynamic programming (DP) [7–9], sequential quadratic programming (SQP), genetic algorithms (GA) [10], the Pontryagin minimum principle (PMP) [11], and so on. Usually, these algorithms can manage to determine the optimal power split between the engine and the motor for a particular driving cycle. However, the obtained optimal power-split solutions are only optimal with respect to a specific driving cycle. In general, it is neither optimal nor charge-sustaining for other cycles. Unless future driving conditions can be predicted during real-time operation, there is no way to imply these control laws directly. Moreover, these methods suffer from the “curse of dimensionality” problem, which prevents their wide adoption in real-time applications. Model predictive control (MPC) [12] is another type of optimization-based method. The optimal control problem in the finite domain is solved at each sampling instant and control actions are obtained based on online rolling optimization. This method has the advantages of good control effect and strong robustness. (3) Learning-based EMS: some strategies can learn from the historical data or use the previous driving data for online learning or application [13,14]. Some researchers propose that traffic information and cloud computing in intelligent transportation systems (ITSs) can enhance HEV energy management since vehicles obtain real-time data via intelligent infrastructures or connected vehicles [15,16]. Regardless of the learning from historical data or predicted data, these EMS methods also need complex control models and professional knowledge from experts. Thus, these EMS methods are not end-to-end control methods. Reinforcement learning-based control methods have also been used for HEV energy management [17,18]. However, reinforcement learning must be able to learn from a scalar reward signal that is frequently sparse, noisy, and delayed. Additionally, the sequence of highly-correlated states is also a large problem of reinforcement learning, in addition to the data distribution changes, as the algorithm learns new behaviors in reinforcement learning. The learning-based EMS is an emerging and promising method because of its potential ability of self-adaption according to different driving conditions, even if there are still some problems. In our previous research, online learning control strategies based on neural dynamic programming (NDP) [19], fuzzy Q-learning (FQL) [20], were proposed. These strategies do not rely on prior information related to future driving conditions and can self-tune the parameters of the algorithms. A back propagation (BP) neural network was used to estimate the Q-value which, in turn, tuned the parameter of fuzzy controller [20]. However, it also requires designing the fuzzy controller, as well as professional knowledge. Deep reinforcement learning (DRL) has shown successful performance in playing Atari [21] and Go games [22] in recent years. The DRL method is a powerful algorithm to solve complex control problems and handle large state spaces by establishing a deep neural network to relate the value estimation and associated state-action pairs. As a result, the DRL algorithm has been quickly applied in robotics [23], building HVAC control [24], ramp metering [25], and other fields. In the automotive field, DRL has been used for lane keeping assist [26], autonomous braking system [27], and autonomous vehicles [28]. However, motion control of autonomous vehicles needs very high precision from our perspective. The mechanism of DRL has not been explained very deeply and may not meet this high requirement. Nevertheless, DRL is a powerful technique that can be used in HEV EMS in this research as it concerns the fuel economy compared to the control precision. A DRL-based EMS has been designed for plug-in hybrid electric vehicles (PHEVs) [29]. This is the first time DRL has been applied to a PHEV EMS. However, there are several problems in this study: (1) The learning process is still offline, which means that the trained deep network can only work well in the same driving cycle, but would not be able to obtain good performance in other driving conditions. As a result, this method can be used in buses with fixed route however it is not acceptable for vehicles with route variation; (2) The immediate reward is important as it affects the performance of DRL. The optimization objective is Appl. Sci. 2018, 8, 187 3 of 15 Appl. Sci. 2018, 8, x FOR PEER REVIEW 3 of 15 the vehicle fuel economy, but the reward is a function based on the power supply from the engine. engine. The relationship betweenand fuelengine economy and isengine power complex and the Thethe relationship between fuel economy power complex andisthe paper lacks thepaper ability lacks the ability to justify this phenomena; (3) The structure of deep neural network can be well to justify this phenomena; (3) The structure of deep neural network can be well designed by fixing the designed by fixing the Q targets network, which can make the algorithm more stable. Q targets network, which can make the algorithm more stable. In this research, an energy management strategy based on deep reinforcement learning is In this research, an energy management strategy based on deep reinforcement learning is proposed. Our work achieves good performance and high scalability by (1) building the system proposed. Our work achieves good performance and high scalability by (1) building the system model model of the HEV and formulating the HEV energy management problem; (2) developing a of the HEV and formulating the HEV energy management problem; (2) developing a DRL-based DRL-based control framework and an online learning architecture for a HEV EMS, which is adapted control framework and an online learning architecture for a HEV EMS, which is adapted to different to different driving conditions; and (3) facilitating algorithm training and evaluation in the driving conditions; and (3) facilitating algorithm training and evaluation in the simulation environment. simulation environment. Figure 1 illustrates our DRL-based algorithm for HEV EMS. The Figure 1 illustrates algorithm for HEV EMS. The based DRL-based EMS can autonomously DRL-based EMSour canDRL-based autonomously learn the optimal policy on data inputs, without any learn the optimal policy based on data inputs, without any prediction or predefined rules. Forbuilt training prediction or predefined rules. For training and validation, we use the HEV model in andADVISOR validation, we use the HEV model built in ADVISOR software (National Renewable Energy software (National Renewable Energy Laboratory, Golden, CO, USA). Simulation results Laboratory, Golden, CO, USA). Simulation results that the algorithm is ableother to improve the fuel reveal that the algorithm is able to improve thereveal fuel economy while meeting requirements, economy meeting other requirements, such as dynamic performance and vehicle drivability. such aswhile dynamic performance and vehicle drivability. at st Q Action value error S θ Q ( st , at , θ ) ( st , a t , s t + 1 ) ( st , at , rt , st +1 ) a S' γ max Q( st +1 , at +1 , θ − ) a t +1 r ∂L ∂θ Q̂ r + γ max Q( st +1 , at +1 ,θ − ) at +1 Figure 1. Deep reinforcement learning (DRL)-based framework for HEV EMS. Figure 1. Deep reinforcement learning (DRL)-based framework for HEV EMS. The proposed DRL-based EMS uses a fixed target Q network which can make the algorithm The DRL-based EMS uses aisfixed target Qdirectly networkrelated which can make the algorithmMore more more proposed stable. The immediate reward a function to fuel consumption. stable. The immediate reward is a function directly related to fuel consumption. More importantly, importantly, a DRL-based online learning architecture is presented. It is a critical factor to apply the a DRL-based onlineinlearning architecture is presented. It is a critical factor to apply the DRL algorithm DRL algorithm HEV energy management under different driving conditions. in HEV energy different The restmanagement of this paperunder is organized as driving follows:conditions. Section 2 introduces the system model of HEV Thedescribes rest of the thismathematics paper is organized as of follows: Section 2 3introduces the system model of and formulation HEV EMS. Section explains our deep reinforcement HEV and describes the mathematics formulation of HEV and EMS. Section 3 explains our The deep learning-based control strategy, including offline learning online learning application. experimentallearning-based results are givencontrol in Section 4, followed by the conclusions in Section reinforcement strategy, including offline learning and5. online learning application. The experimental results are given in Section 4, followed by the conclusions in Section 5. 2. Problem Formulation 2. Problem Formulation The prototype vehicle is a single-axis parallel HEV, the drivetrain structure of which is shown inThe Figure 2. The drivetrain an engine, electric motor/generator, Ni-Hiisbatteries, prototype vehicle is integrates a single-axis parallelan HEV, the traction drivetrain structure of which shown in an automatic clutch, and an automatic/manual transmission system. The motor is Ni-Hi directlybatteries, linked Figure 2. The drivetrain integrates an engine, an electric traction motor/generator, between the auto clutch output and the transmission input. This architecture provides the an automatic clutch, and an automatic/manual transmission system. The motor is directly linked regenerative braking during deceleration and allows an efficient motor assist operation. To provide between the auto clutch output and the transmission input. This architecture provides the regenerative pure during electrical propulsion,and theallows engineancan be disconnected the drivetrain by the automatic braking deceleration efficient motor assistfrom operation. To provide pure electrical propulsion, the engine can be disconnected from the drivetrain by the automatic clutch. We have Appl. Sci. 2018, 8, 187 4 of 15 Appl. Sci. 2018, 8, x FOR PEER REVIEW 4 of 15 adopted thehave vehicle modelthe from our previous work for work this research. The key parameters of clutch. We adopted vehicle model from our[19,20] previous [19,20] for this research. The key this vehicle are given in Table 1. parameters of this vehicle are given in Table 1. Figure 2. Drivetrain structure of the HEV. Figure 2. Drivetrain structure of the HEV. Table 1. Summary of the HEV parameters. Table 1. Summary of the HEV parameters. Part or Vehicle Parameters Value Part or Vehicle Parameters Value Displacement: 1.0 L Spark Ignition Maximum power: 50 kW/5700 r/min Displacement: 1.0 L (SI) engine Spark Ignition Maximum power: torque: 50 89.5 Nm/5600 r/min Maximum kW/5700 r/min (SI) engine Maximum power: 10 kW r/min Maximum torque: 89.5 Nm/5600 Permanent magnet motor Maximum Nm Maximum torque: power: 46.5 10 kW Permanent magnet motor Capacity: 6.5 Ah Maximum torque: 46.5 Nm Advanced Ni-Hi battery Nominal cell voltage: 1.2 V Capacity: 6.5 Ah Total cells: 120 Advanced Ni-Hi battery Nominal cell voltage: 1.2 V Total5-speed cells: 120 Automated manual transmission GR: 2.2791/2.7606/3.5310/5.6175/11.1066 5-speed Automated manual transmission Vehicle Curb weight: 1000 kg GR: 2.2791/2.7606/3.5310/5.6175/11.1066 Vehicle Curb weight: 1000 kg In the following, key concepts of the DRL-based EMS are formulated: state: Inkey the concepts DRL algorithm, control action directly determined by the system states. InSystem the following, of the DRL-based EMSisare formulated: In this study, the total required torque ( Tdem ) and the battery state-of–charge (SOC) are selected to System state: In the DRL algorithm, control action is directly determined by the system states. form a two-dimensional state space, i.e., s (t ) = (Tdem (t ), SOC (t )) T , where Tdem (t ) represents the In this study, the total required torque (Tdem ) and the battery state-of–charge (SOC) are selected to form the battery state ofT charge at time t. torque at state time space, t, and i.e., SOCs((tt)) =represents arequired two-dimensional ( Tdem (t), SOC (t))T , where dem ( t ) represents the required The(decision-making on the torque-split ratio torqueControl at time action: t, and SOC t) represents the battery state of charge at between time t. the internal combustion engine (ICE)action: and battery is the core problem of torque-split the HEV energy management strategy. We choose Control The decision-making on the ratio between the internal combustion the output torque from the ICE as the control action in this study, denoted as A ( t ) = T ( t ) , where engine (ICE) and battery is the core problem of the HEV energy management strategy. Wee choose the t output torque from the T ICE asshould the control action ininthis study, denoted A(t) = Talgorithm, t is the time step index. be discretized order to apply the as DRL-based e ( t ), wherei.e., e (t ) 1 2 n isthe theentire time action step index. T ( t ) should be discretized in order to apply the DRL-based algorithm, space ise A = { A , A ,..., A } , where n is the degree of discretization. In this i.e., the entire action space is A = A1 , A2 , ..., An , where n is the degree of discretization. In this research, we consider n as 24. The motor output torque Tm (t ) can be obtained by subtracting research, we consider n as 24. The motor output torque Tm (t) can be obtained by subtracting Te (t) Te (t )T from from (t). Tdem (t ) . dem ImmediateReward: Reward:Immediate Immediatereward rewardisisimportant importantininthe theDRL DRLalgorithm algorithmbecause becauseititdirectly directly Immediate influencesthe theparameters parameterstuning tuningofofthe thedeep deepneural neuralnetwork network(DNN). (DNN).The TheDRL DRLagent agentisisalways always influences trying to maximize the reward which it can obtain by taking the optimal action at each time step. trying to maximize the reward which it can obtain by taking the optimal action at each time step. Therefore, the immediate reward should be defined according to the optimization objective. The control objective of the HEV EMS is to minimize vehicle fuel consumption and emissions along a Appl. Sci. 2018, 8, 187 5 of 15 Therefore, the immediate reward should be defined according to the optimization objective. The control objective of the HEV EMS is to minimize vehicle fuel consumption and emissions along a driving mission. Meanwhile, the vehicle drivability and battery health should be satisfied. In this work, we focus more on fuel economy of the HEV; the emissions are not taken into consideration. Keeping this objective in mind, the reciprocal of the ICE fuel consumption at each time step is defined as the immediate reward. A penalty value is introduced to penalize the situation when the SOC exceeds the threshold. Immediate reward is defined by the following equations: a rss 0 = 1 C ICE 1 C ICE +C Min2 C ICE 1 −C C ICE 6= 0 ∩ 0.4 ≤ SOC ≤ 0.85 C ICE 6= 0 ∩ SOC < 0.4 or SOC > 0.85 C ICE = 0 ∩ 0.4 ≤ SOC (1) C ICE = 0 ∩ SOC < 0.4 a is the immediate reward generated when state changes from s to s0 by taking action a; where rss 0 C ICE is the instantaneous fuel consumption value of the ICE; C is the numerical penalty, as well as the maximum instantaneous power supply from the ICE; MinCICE is the minimum nonzero value of the ICE instantaneous fuel consumption value. The SOC variation range is from 40% to 85% in this study. This definition can guarantee the lower ICE fuel consumption while satisfying the SOC constrains. Formally, the goal of the EMS of the HEV is to find the optimal control strategy, π ∗ , that maps the observed states st to the control action at . Mathematically, the control strategy of the HEV can be formulated as an infinite horizon dynamic optimization problem as follows: ∞ R= ∑ γt r (t) (2) t =0 where r (t) is the immediate reward incurred by at at time t; and γ ∈ (0, 1) is a discount factor that assures the infinite sum of cost function convergence. We use Q∗ (st , at ), i.e., the optimal value, to represent the maximum accumulative reward which we can obtain by taking action at in state st . Q∗ (st , at ) is calculated by the Bellman Equation as follows: Q∗ (st , at ) = E[rt+1 + γmaxQ∗ (st+1 , at+1 )|st , at ] a t +1 (3) The Q-learning method is used to update the value estimation, as shown in Equation (4). Qt+1 (st , at ) = Qt (st , at ) + η (rt+1 + γmaxQt (st+1 , at+1 ) − Qt (st , at )) a t +1 (4) where η ∈ (0, 1] represents the learning rate. Such a value iteration algorithm converges to the optimal action value function, Qt → Q∗ as t → ∞ . 3. Deep Reinforcement Learning-Based EMS Deep reinforcement learning-based EMS is developed which combines a deep neural network and conventional reinforcement learning. The EMS makes decisions only based on the current system state since the proposed EMS is an end-to-end control strategy. This deep reinforcement neural network can also be called a deep Q-network (DQN). In the rest of this section, value function approximation, DRL algorithm design, and the DRL-based algorithm online learning application are presented. 3.1. Value Function Approximation The state-action value is represented by a large, but limited, number of states and actions table, i.e., the Q table, in conventional reinforcement learning. However, a deep neural network is taken in this work to approximate the Q-value by Equation (3). As depicted in Figure 3, the inputs of the Appl. Sci. 2018, 8, x FOR PEER REVIEW Appl. Sci. 2018, 8, 187 6 of 15 6 of 15 this work to approximate the Q-value by Equation (3). As depicted in Figure 3, the inputs of the networkare arethe thesystem systemstates, states,which whichare aredefined definedin inSection Section2.2.The Therectified rectifiedlinear linearunit unit(ReLU) (ReLU)isisused used network asthe theactivation activationfunction functionfor forhidden hiddenlayers, layers,and andthe thelinear linearlayer layerisisused usedfor forobtaining obtainingthe theaction actionvalue value as atthe theoutput outputlayer. layer.InInorder ordertoto balance exploration and exploitation, ε - greedy at balance thethe exploration and exploitation, thethe ε − greedy policypolicy is usedis for action selection, i.e., the policy chooses the maximum Q-value action with probability 1 − ε and used for action selection, i.e., the policy chooses the maximum Q-value action with probability selects a random action with probability ε. 1 − ε and selects a random action with probability ε . Q( st , at1 , θ ) Q( st , at2 , θ ) T S t = dem SOC . . . L Q( st , atn , θ ) w,b w,b r + γ max Q( s t +1 , a t +1 , θ -) at +1 Figure3.3. Structure Structureof ofthe theneural neuralnetwork. network. Figure The Q-value Q-value estimation estimation for for all all control control actions actions can can be be calculated calculated by by performing performing aa forward forward The calculationininthe the neural network. mean squared error between theQ-value target Q-value and the calculation neural network. TheThe mean squared error between the target and the inferred inferred output of neural network is defined as loss function in Equation (5): output of neural network is defined as loss function in Equation (5): L(θ ) = E[(r + γ max Q( st +1 , at +1 ,θ − ) − Q( st , at ,θ )) 2 ]2 a L(θ ) = E[(r + γmaxQ(st+1 , at+1 , θ − ) − Q(st , at , θ )) ] (5) (5) t +1 a t +1 where Q ( s t a t , θ ) is the output of the neural network with the parameters θ . − − where Q(st aQ the neural the parameters (st+1 , aprevious t ,(θs) is, the t +1 , θ ) target network Q-value,with using parametersθ. θr + γmax from Qsome r + γ max a ,output θ − ) isof the at + 1 t +1 t +1 a t +1 is the target Q-value, using parameters θ − from some previous iteration. This fixed target Q network iteration. This fixed target Q network makes the algorithm more stable. Parameters in the neural makes the algorithm more stable. Parameters in the neural network are updated by the gradient network are updated by the gradient descent method. descent method. The inputs of DQN are total required torque Tdem and battery SOC. The variation range of SOC The inputs of DQN are total required torque Tdem and battery SOC. The variation range of SOC from 00 to to 11 and and does does not not need need preprocessing. preprocessing. However, the total required torque TTdem can vary vary isis from dem can significantly. we scale scale the the total total required required torque torque TTdem to the significantly.In Inorder order to to facilitate facilitate the the learning learning process, process, we dem to the range [−1, 1] before feeding to the neural network as shown in Equation (6). The minimum and range [−1, 1] before feeding to the neural network as shown in Equation (6). The minimum and maximum values for Tdem can be obtained from historical observation: maximum values for Tdem can be obtained from historical observation: Tdem − min( Tdem ) 0 Tdem =' (6) Tdem − min(Tdem ) ) − min( Tdem ) Tdemmax = ( Tdem (6) max(Tdem ) − min(Tdem ) 3.2. DRL Algorithm Design 3.2. DRL Algorithm Design Our DRL-based EMS control algorithm is presented in Algorithm 1. The outer loop controls the number of training episodes, while the inner loop performs the EMS control at each time step within Our DRL-based EMS control algorithm is presented in Algorithm 1. The outer loop controls the one training episode. number of training episodes, while the inner loop performs the EMS control at each time step within one training episode. Appl. Sci. 2018, 8, 187 7 of 15 Algorithm 1: Deep Q-Learning with Experience Replay Initialize replay memory D to capacity N Initialize action-value function Q with random weights θ Initialize target action-value function Q̂ with weights θ − = θ 1: For episode = 1, M do 2: Reset environment: s0 = (SOC Initial , T0 ) 3: For t = 1, T, do 4: With probability ε select a random action at otherwise select at = maxQ(st , a; θ ) at 5: 6: 7: 8: 9: Choose action at and observe the reward rt Set st+1 = (SOCt+1 , Tt+1 ) Store (st , at , rt , st+1 ) in memory D Sample random mini-batch of (st , at , rt , st+1 ) from D if terminal s j+1 : Set y j = r j else set y j = r j + γmaxQ̂(s j+1 , a j+1 ; θ _ ) a j +1 10: Perform a gradient descent step on (y j − Q(s j a j ; θ ))2 11: Every C steps reset Q̂ = Q 12: end for 13: end for In order to avoid the strong correlations between the samples in a short time period of conventional RL, experience replay is adopted to store the experience (i.e., a batch of state, action, reward, and next state:(st , at , rt , st+1 )) at each time step in a data experience pool. For each certain time, random samples of experience are drawn from the experiment pool and used to train the Q network. We initialize memory D as an empty set. Then we initialize weights θ in the action-value function estimation Q neural network. In order to break the dependency loop between the target value and weights θ, a separate neural network Q̂ with weights θ − is created for calculating the target Q value. We can set the maximum number of episode as M. During the learning process, in step 4, the algorithm selects the maximum Q value action with probability 1 − ε and selects a random action with probability ε based on the observation of the state. In step 5, action at is executed and reward rt is obtained. In step 6, the system state becomes the next state. In step 7, the state action transition tuple is stored in memory. Then, a mini-batch of transition tuples is drawn randomly from the memory. Step 9 calculates the target Q value. The weights in neural network Q are updated by using the gradient descent method in step 10. The network Q̂ is periodically updated by copying parameters from the network Q in step 11. 3.3. DRL-Based Algorithm Online Learning Application In Section 3.2, the DRL-based algorithm is proposed, however, it is an offline learning algorithm which can only be applied in the simulation environment. More importantly, the training process can only be applied in limited driving cycles, therefore, the trained DQN only performs well under the learned driving conditions, which may not provide satisfactory results under other driving cycles. This is unacceptable in HEV real-time applications. As a result, online learning is necessary for DRL-based algorithms in HEV EMS applications. The DRL-based online learning architecture is presented in Figure 4. Action execution and network training should be separated. There is a controller which contains a Q neural network and selects an action for the HEV while storing the state action transitions. When the HEV needs to learn a new driving cycle, the method of action selection will be the ε − greedy method. Otherwise, the HEV can always select the maximum Q-value action. There is another on-board computer or remote computing center which is responsible for Q neural network training. The on-board computer or remote computing center obtains state action transitions from the action controller and trains the network training should be separated. There is a controller which contains a Q neural network and selects an action for the HEV while storing the state action transitions. When the HEV needs to learn a new driving cycle, the method of action selection will be the ε − greedy method. Otherwise, the HEV can always select the maximum Q-value action. There is another on-board computer or remote 8 of 15 computing center which is responsible for Q neural network training. The on-board computer or Appl. Sci. 2018, 8, 187 remote computing center obtains state action transitions from the action controller and trains the neural onon thethe DRL algorithm. The The Q neural network is periodically updatedupdated by copying neuralnetwork networkbased based DRL algorithm. network is periodically by Q neural parameters from the on-board computer or remote computing center. copying parameters from the on-board computer or remote computing center. Action States r st,a,r,st+1 st,a,r,st+1 DQN Parameters DQN Parameters T-box Action controller Vehicle 1 t ,r,s st,a +1 DQ s ter me a r a NP Computing center Vehicle 2 Figure 4. 4. DRL-based DRL-based online online learning learning architecture. architecture. Figure The communication mode between the action controller and the on-board computer can be via mode neural between the action controller and the computer can be CANThe buscommunication or Ethernet. If the network training is completed byon-board a remote computing center, Q via CAN bus or Ethernet. If the Q neural network training is completed by a remote computing a vehicle terminal named Telematics-box (T-box) should be installed in the HEV in order to center, a vehicle terminal named Telematics-box (T-box) should be installed in the HEV in order to communicate with the remote computing center through the 3G communication network. A remote communicate with the remote computing center through the 3G communication network. A remote computing center can obtain state action transitions from other connected HEVs, as is shown in computing center can obtain state action transitions from other connected HEVs, as is shown in Figure 4. Figure 4. This is useful to train a large Q neural network which can deal with different driving This is useful to train a large Q neural network which can deal with different driving conditions. conditions. The main differences between online learning and offline learning are as follows: (1) online Thecan main differences between learning and offline arefrom as follows: (1)driving online learning adapt to varying drivingonline conditions, while offline canlearning only learn the given learning adapt to varying whilebe offline can only learn learning from thebecause given driving cycles; (2)can action execution and driving networkconditions, training should separated in online of the cycles; (2) action execution and network training should be separated in online learning because of limited on-board controller computing ability; and (3) online training efficiency should be higher than the limited on-board controller computing ability; andEMS (3) online training efficiency should be higher offline training since the vehicle must learn the optimal with the shortest time. Thus, it is necessary than offline training since the vehicle must learn the optimal EMS with the shortest time. Thus, to cluster the representative state action transitions and use the recent data in the experience pool.it is necessary to cluster thelearning representative state actioncantransitions andto use thea recent dataofinEMS. the Interestingly, offline and online learning be combined realize good effect experience For instance,pool. we can train the DQN offline under the Urban Dynamometer Driving Schedule (UDDS), Interestingly, learning under and online learning can be combined realize a good effect of and then apply the offline online learning the New European Driving Cycleto(NEDC). EMS. For instance, we can train the DQN offline under the Urban Dynamometer Driving Schedule (UDDS), and then apply and the online learning under the New European Driving Cycle (NEDC). 4. Experimental Results Discussion 4.1. Offline Application 4. Experimental Results and Discussion 4.1.1. Experiment Setup 4.1. Offline Application In order to evaluate the effectiveness of proposed DRL-based algorithm, simulation experiments 4.1.1. Experiment Setup are done in MATLAB and the ADVISOR co-simulation environment. The offline learning application is evaluated firstly and the UDDS cycle is used in the learning process.algorithm, The simulation model In order to evaluate the driving effectiveness of proposed DRL-based simulation for the HEV mentioned in Section 2 is and builtthe in ADVISOR. theenvironment. hyper parameters of the experiments are done in MATLAB ADVISOR Meanwhile, co-simulation The offline DRL-based algorithm used in the simulations are summarized in Table 2. learning application is evaluated firstly and the UDDS driving cycle is used in the learning process. Appl. Sci. 2018, 8, 187 9 of 15 Appl. Sci. 2018, 8, x FOR PEER REVIEW Appl. Sci. 2018, 8, x FOR PEER REVIEW 9 of 15 9 of 15 The simulation model the HEV of mentioned in Section 2 is built ADVISOR. Meanwhile, the Table for 2. Summary the DRL-based algorithm hyperinparameters. The simulation model the HEValgorithm mentioned in Section 2 is built inare ADVISOR. Meanwhile, hyper parameters of thefor DRL-based used in the simulations summarized in Table 2.the hyper parameters of the DRL-based algorithm used in the simulations are summarized in Table 2. Hyper Parameters Value Table 2. Summary of the DRL-based algorithm hyper parameters. size algorithm32 Table 2. Summarymini-batch of the DRL-based hyper parameters. Hyper Value replayParameters memory size 1000 Hyper Parameters Value mini-batch size γ 32 discount factor 0.99 mini-batch size 32 replay memory size 1000 learning rate 0.00025 replay memory size 1000 discount factor γ 0.99 initial exploration 1 discount factor 0.99 learning rate γ 0.00025 final exploration 0.2 learning rate size 0.00025 initial exploration 1 replay start 200 initial exploration 1 final exploration 0.2 final exploration 0.2 replay start size 200 replay startnetwork size 200neurons, i.e., Tdem application, the input layer of the has two In this and SOC. There are In this application, the input layer of the network has two neurons, i.e., and There three hidden layers having 20, 50, and 100 neurons, respectively. The output has 24 neurons Tdemlayer SOC. In this application, the input layer of the network has two neurons, i.e., and SOC. There T dem representing discrete ICEhaving torque.20,All50,these are fully connected.The Theoutput network is trained are threethe hidden layers and layers 100 neurons, respectively. layer has 24 with are three hidden layers 50, and neurons, respectively. The output has 24 neurons representing the having discrete ICE torque. All layers are fully connected. Thelayer network is 50 episodes and each episode means20, a trip (1369100 s). these neurons representing the discrete ICE torque. All these layers are fully connected. The network trained with 50 episodes and each episode means a trip (1369 s). We evaluate the performance of DRL-based EMS by comparing them with the rule-basedis EMS trained with 50 episodes and each episode meansEMS a tripby (1369 s). evaluate the performance of DRL-based the rule-based EMS known asWe “Parallel Electric Assist Control Strategy” [20]. comparing The initial them SOC with is 0.8. Weasevaluate performance of DRL-based EMS byThe comparing them with the rule-based EMS known “Parallelthe Electric Assist Control Strategy” [20]. initial SOC is 0.8. known as “Parallel Electric Assist Control Strategy” [20]. The initial SOC is 0.8. 4.1.2. Experimental Results 4.1.2. Experimental Results 4.1.2. Experimental Results Firstly, we evaluate the learning performance of DRL-based algorithm. The track of average loss Firstly, we evaluate the learning performance of DRL-based algorithm. The track of average is recorded in Figure 5. It is clear that the average loss decreases quickly along the training process. Firstly, we evaluate learning of DRL-based algorithm. The track of training average loss is recorded in Figurethe 5. It is clearperformance that the average loss decreases quickly along the Figure 6 depicts the6 track ofthe the total reward one episode along the training process. though loss is recorded in Figure 5.track It isofclear thatof the average decreases quickly alongprocess. theEven training process. Figure depicts the total reward of oneloss episode along the training Even the curve is oscillating, the overall trend of the track is rising. There are also some dramatic drops in process. Figure 6 depicts the track of the total reward of one episode along the training process. Even though the curve is oscillating, the overall trend of the track is rising. There are also some dramatic though the curve is oscillating, the overall trend of the track is rising. There are also some dramatic the total reward during the training process. This is because of the adding of a large penalty when the drops in the total reward during the training process. This is because of the adding of a large penalty drops selects in the total reward thethat process. This because of theconstraint. adding of a large penalty when the algorithm selects actions results in the of violation of the SOC algorithm actions thatduring results intraining the violation theisSOC constraint. when the algorithm selects actions that results in the violation of the SOC constraint. Figure 5. Track of loss. Figure 5. Track of loss. Figure 5. Track of loss. Figure 6. Track of the total reward. Figure 6. Track of the total reward. Figure 6. Track of the total reward. Then, the simulation results of the trained DRL-based EMS for the UDDS driving cycle are shown in Figure 7. In order to evaluate the performance and effectiveness of the trained DRL-based EMS, comparison results are listed in Table 3. Power consumption is converted to fuel consumption; equivalent Appl. Sci. 2018, 8, x FOR PEER REVIEW 10 of 15 Appl. Sci. 2018, 8, 187 10 of 15 Then, the simulation results of the trained DRL-based EMS for the UDDS driving cycle are shown in Figure 7. In order to evaluate the performance and effectiveness of the trained DRL-based EMS, comparison results areby listed in Table Power consumption is converted toand fuelfuel consumption; fuel consumption is obtained adding the3.converted power consumption consumption. equivalent fuel consumption is obtained by adding the converted power consumption and fuelto the As shown by the results of Table 3, fuel consumption is improved significantly compared consumption. As shown by the results of Table 3, fuel consumption is improved significantly rule-based control strategy, as fuel consumption is decreased by 10.09%. Meanwhile, the equivalent fuel compared to the rule-based control strategy, as fuel consumption is decreased by 10.09%. consumption is also decreased by 8.05%. The DRL-based EMS achieves good performance. Notably, Meanwhile, the equivalent fuel consumption is also decreased by 8.05%. The DRL-based EMS the rule-base designed by Notably, the experts the DRL-based EMS only learns fromwhile the states achieves EMS goodisperformance. thewhile rule-base EMS is designed by the experts the and historical data. EMS only learns from the states and historical data. DRL-based Figure 7. Simulation results of the trained EMS under UDDS. Figure 7. Simulation results of the trained EMS under UDDS. Table 3. Comparison of the results under UDDS. Table 3. Comparison of the results under UDDS. Control Strategy Fuel Consumption (L/100 km) Rule-Based 3.857 Control Strategy DRL-based Fuel Consumption 3.468(L/100 km) Rule-Based 4.2.DRL-based Online Application 3.857 3.468 Equivalent Fuel Consumption (L/100 km) 3.861 Equivalent Fuel3.550 Consumption (L/100 km) 3.861 3.550 4.2.1. Experiment Setup 4.2. Online Application The DRL-based online learning architecture is presented in Section 3.2. In order to evaluate the 4.2.1.online Experiment learningSetup performance conveniently, we also use ADVISOR software to simulate the online learning working process. In the online learning application, the neural network setting is the same The DRL-based online learning architecture is presented in Section 3.2. In order to evaluate the as the offline application. Two different kinds of simulations are performed. In the first scenario, the online learning performance webeginning also use and, ADVISOR software the online neural network parameters conveniently, are random at the in the second one, to thesimulate neural network learning working process. In the online learning application, the neural network setting is is pre-trained offline under the existing driving cycle before the online learning process. In the the firstsame as thecase, offline application. Two different kinds of simulations are performed. In the first scenario, the online learning simulation without any pre-training under the NEDC driving cycle is done. In the network second case, we pre-train neuralatnetwork offline under UDDS driving firstly, and the neural parameters are the random the beginning and, the in the second one,cycle the neural network then applyoffline the online learning under the NEDC cycle driving cycle. the online learning process. In the first is pre-trained under the existing driving before case, the online learning simulation without any pre-training under the NEDC driving cycle is done. 4.2.2. Experimental Results In the second case, we pre-train the neural network offline under the UDDS driving cycle firstly, and In the the first case, we trained the the neural network 50 times under the NEDC driving cycle with the then apply online learning under NEDC driving cycle. same initial condition. Unlike the offline learning, this process is online and simulates the vehicle under the NEDC driving cycle. 4.2.2.running Experimental Results The track of loss is depicted in Figure 8. The loss also decreases quickly along the training In the first case, weapplication. trained theFigure neural9 depicts network times under the NEDC cycle with the process in the online the50 track of total reward and thedriving fuel consumption sameofinitial condition. Unlike the offline learning, this process is online and simulates vehicle one driving cycle along the training process, and the overall trend of the total reward is thethe same running under the NEDC driving cycle. The track of loss is depicted in Figure 8. The loss also decreases quickly along the training process in the online application. Figure 9 depicts the track of total reward and the fuel consumption of one driving cycle along the training process, and the overall trend of the total reward is the same as the offline application. This reveals the proposed DRL-based online learning architecture is effective. Appl. Sci. 2018, 8, x FOR PEER REVIEW 11 of 15 Appl. Sci. 2018, 8, x FOR PEER REVIEW as the offline application. This reveals Appl. Sci. 2018, 8, 187 11 of 15 the proposed DRL-based online learning architecture is 11 of 15 11 of 15 9, the trend of the total reward and the fuel consumption is as the offline application. This reveals the proposed DRL-based online learning architecture is nearly opposite. This reflects that the definition of the reward is suitable. thesee offline application. This reveals thetotal proposed online learning architecture is effective. As we can see9,from Figure the trend of theDRL-based totalthe reward and the fuel consumption As weas can from Figure the trend of9,the reward and fuel consumption is nearly opposite. effective. As we can see from Figure 9, the trend of the total reward and the fuel consumption is nearly opposite. reflectsof that definition of the reward is suitable. This reflects that the This definition thethe reward is suitable. Appl. Sci. x FORsee PEER REVIEW effective. As2018, we8, can from Figure nearly opposite. This reflects that the definition of the reward is suitable. Figure 8.8.Track theonline onlineapplication. application. Figure Trackof ofloss loss in in the Figure 8. Track of loss in the online application. Figure 8. Track of loss in the online application. Figure 9. Track of the total reward and fuel consumption in the online application. Figure 9. Track of the total reward and fuel consumption in the online application. Figure 9. Track of Figure 9. Track of the the total total reward reward and and fuel fuel consumption consumption in in the the online online application. application. Simulation results of the online trained DRL-based EMS for the NEDC driving cycle are shown Simulation of the online trainedare DRL-based EMS for the NEDC drivingiscycle shown in Figure 10. Theresults comparison of the results listed in Table 4. Fuel consumption alsoare improved Simulation of the online DRL-based EMS4.for the NEDC driving cycle arethe shown in Figure 10. The comparison of thetrained results are in Table Fuel consumption is also improved compared toresults the rule-based control strategy, aslisted fuel consumption is decreased by 10.29%, while Figure 10. The Table 4. 4. Fuel Fuel consumption also improved in Figure 10. comparison results are2.57%. in Table consumption is also improved compared to the rule-based of control strategy, aslisted fuel consumption is decreased by 10.29%, while the equivalent fuel consumption isthe decreased by equivalent fuel consumption is decreased byas 2.57%. strategy, as fuel consumption consumption is is decreased decreased by by 10.29%, 10.29%, while while the the compared to the rule-based control strategy, fuel fuel consumption consumption is is decreased decreased by by 2.57%. 2.57%. equivalent fuel Figure 10. Simulation results of the trained EMS under NEDC. Figure 10. Simulation results of the trained EMS under NEDC. Figure 10. Simulation results of the trained EMS under NEDC. Figure 10. Simulation results of the trained EMS under NEDC. Appl. Sci. 2018, 8, 187 Appl. Sci. 2018, 8, x FOR PEER REVIEW 12 of 15 12 of 15 Table 4. Comparison Comparison of of the the results results under under NEDC. NEDC. Table 4. Control Strategy Control Strategy Rule-Based Rule-Based DRL-based DRL-based Fuel Consumption (L/100 km) Fuel Consumption 3.877 (L/100 km) Equivalent Fuel Consumption (L/100 km) Equivalent Fuel Consumption (L/100 km) 3.892 3.877 3.478 3.478 3.892 3.792 3.792 In the second case, the neural network was pre-trained offline under the UDDS driving cycle, suchIn that DRL-based EMS can adapt to the UDDS drivingoffline cycle but have a priori knowledge thethe second case, the neural network was pre-trained under thenoUDDS driving cycle, about the NEDC driving cycle. The comparison of the results between the offline trained DRL-based such that the DRL-based EMS can adapt to the UDDS driving cycle but have no a priori knowledge EMS under the UDDS cycle for the NEDC cycle and control strategies are about the NEDC drivingdriving cycle. The comparison of thedriving results between theother offline trained DRL-based listedunder in Table It is obvious that the offline-trained EMS under UDDS driving cycle are does not EMS the 5. UDDS driving cycle for the NEDC driving cycle andthe other control strategies listed adapt well NEDCthat driving cycle. in Table 5. Ittoisthe obvious the offline-trained EMS under the UDDS driving cycle does not adapt well to the NEDC driving cycle. Table 5. Comparison of the results under NEDC. Table 5. Comparison of the results under NEDC. Fuel Consumption Equivalent Fuel Consumption (L/100 km) (L/100 km) Control Strategy Fuel Consumption (L/100 km) Equivalent Fuel Consumption (L/100 km) Rule-Based 3.877 3.892 Rule-Based 3.877 3.892 DRL-based EMS trained under NEDC online 3.478 3.792 DRL-based EMS trained under NEDC online 3.478 3.792 DRL-based pre-trained under UDDS offline 3.872 DRL-based EMSEMS only only pre-trained under UDDS offline 3.690 3.690 3.872 Control Strategy Based on on the the offline offline pre-trained pre-trained EMS EMS under under the the UDDS UDDS driving driving cycle, cycle, we we can can apply apply the the online online Based learning process under the NEDC driving cycle. We trained the pre-trained neural network 20 times learning process under the NEDC driving cycle. We trained the pre-trained neural network 20 times under the NEDC driving cycle with the same initial conditions. After the training process, we tested under the NEDC driving cycle with the same initial conditions. After the training process, we tested the trained EMS under the NEDC driving cycle. The simulation results are shown in Figure 11. The the trained EMS under the NEDC driving cycle. The simulation results are shown in Figure 11. comparison of the results are listed in Table 6. The results show that the pre-training process can The comparison of the results are listed in Table 6. The results show that the pre-training process can contribute to effectively decrease the online training time. This is because the DRL-based EMS learns contribute to effectively decrease the online training time. This is because the DRL-based EMS learns some of the same features between different driving conditions. some of the same features between different driving conditions. Figure 11. under NEDC of the trained EMS EMS whichwhich was pre-trained under Figure 11. Simulation Simulationresults results under NEDC of online the online trained was pre-trained UDDS.UDDS. under Table 6. Comparison Comparison of of the the results results under under NEDC. NEDC. Table 6. Control Strategy Control Strategy DRL-based EMS trained under NEDC online DRL-based EMS trained under NEDC online DRL-based EMS only pre-trained under UDDS offline DRL-based EMSwhich onlypre-trained pre-trained under UDDS offline DRL-based EMS offline and trained NEDC online DRL-based EMSunder which pre-trained offline and trained under NEDC online Fuel Consumption Equivalent Fuel Consumption (L/100 km) Equivalent Fuel Consumption (L/100 km)(L/100 km) 3.478 3.792 3.478 3.792 3.690 3.872 3.690 3.872 Fuel Consumption (L/100 km) 3.440 3.795 3.440 3.795 Appl. Sci. 2018, 8, 187 Appl. Sci. 2018, 8, x FOR PEER REVIEW 13 of 15 13 of 15 Figure Figure 12 12 depicts depicts the the track track of of the the total total reward reward and and the the fuel fuel consumption consumption of of one one driving driving cycle cycle along the training process, the curves are smoother than the curves without pre-training. This along the training process, the curves are smoother than the curves without pre-training. This also also reveals reveals that that pre-training pre-training offline offline can can improve improve the theonline onlinelearning learning efficiency efficiency even even though though the thedriving driving condition conditionisisdifferent. different. Figure12. 12. Track Trackof ofthe thetotal totalreward rewardand andfuel fuelconsumption consumptionin inthe theonline onlineapplication applicationin inwhich whichthe theEMS EMS Figure waspre-trained pre-trainedfirstly. firstly. was 5. Conclusions 5. Conclusions This paper presents a deep reinforcement learning-based data-driven approach to obtain an This paper presents a deep reinforcement learning-based data-driven approach to obtain an energy management strategy of a HEV. The proposed method combines Q learning and a deep energy management strategy of a HEV. The proposed method combines Q learning and a deep neural neural network to form a deep Q network which can obtain action directly from the states. Key network to form a deep Q network which can obtain action directly from the states. Key concepts of concepts of the DRL-based EMS have been formulated. Value function approximation and DRL the DRL-based EMS have been formulated. Value function approximation and DRL algorithm design algorithm design have been described in detail in this paper. In order to adapt to varying driving have been described in detail in this paper. In order to adapt to varying driving cycles, a DRL-based cycles, a DRL-based online learning architecture has been presented. Simulation results demonstrate online learning architecture has been presented. Simulation results demonstrate that the DRL-based that the DRL-based EMS can obtain better performance than the rule-based EMS in fuel economy. EMS can obtain better performance than the rule-based EMS in fuel economy. Furthermore, the online Furthermore, the online learning approach can learn from different driving conditions. The future learning approach can learn from different driving conditions. The future work will focus on how to work will focus on how to improve the online learning efficiency and testing on a real vehicle. improve the online learning efficiency and testing on a real vehicle. Another important issue is how to Another important issue is how to output continuous actions. In this paper, the output actions are output continuous actions. In this paper, the output actions are discretized and this may leads to the discretized and this may leads to the violent oscillation of the ICE output torque. A deep violent oscillation of the ICE output torque. A deep deterministic policy gradient (DDPG) algorithm deterministic policy gradient (DDPG) algorithm can output the continuous actions and may solve can output the continuous actions and may solve this problem. This will be a future work. However, this problem. This will be a future work. However, DDPG is also based on DRL. The contribution of DDPG is also based on DRL. The contribution of this paper will speed up the application of deep this paper will speed up the application of deep reinforcement learning methods in energy reinforcement learning methods in energy management of HEVs. management of HEVs. Acknowledgments: This research was supported by National Natural Science Foundation of China (61573337), Acknowledgments: ThisFoundation research was supported by National Natural National Natural Science of China (61603377), National NaturalScience ScienceFoundation Foundationof ofChina China (61573337), (61273139), and the Technical (JSGG20141020103523742). National NaturalResearch Science Foundation FoundationofofShenzhen China (61603377), National Natural Science Foundation of China (61273139), and the Technical of the Shenzhen Author Contributions: Yue HuResearch and KunFoundation Xu proposed method(JSGG20141020103523742). and wrote main part of this paper; Weimin Li supported this article financially and checked the paper; Taimoor Zahid checked the grammar of this paper; Author Contributions: Yue HuLiand Kun Xu the method and wrote main part of this paper; Weimin and Feiyan Qin and Chenming analyzed theproposed data. Li supported this article financially and checked the paper; Taimoor Zahid checked the grammar of this paper; Conflicts of Qin Interest: The authors nothe conflict and Feiyan and Chenming Lideclare analyzed data.of interest. Conflicts of Interest: The authors declare no conflict of interest. References 1. Un-Noor, F.; Padmanaban, S.; Mihet-Popa, L.; Mollah, M.N.; Hossain, E. A Comprehensive Study of Key References Electric Vehicle (EV) Components, Technologies, Challenges, Impacts, and Future Direction of Development. 1. Un-Noor, F.; Padmanaban, S.; Mihet-Popa, L.; Mollah, M.N.; Hossain, E. A Comprehensive Study of Key Energies 2017, 10, 1217. [CrossRef] Electric Vehicle (EV) Components, Technologies, Challenges, Impacts, and Future Direction of Development. Energies 2017, 10, 1217, doi: 10.3390/en10081217. Appl. Sci. 2018, 8, 187 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 14 of 15 Wirasingha, S.G.; Emadi, A. Classification and review of control strategies for plug-in hybrid electric vehicles. IEEE Trans. Veh. Technol. 2011, 60, 111–122. [CrossRef] Gokce, K.; Ozdemir, A. An instantaneous optimization strategy based on efficiency maps for internal combustion engine/battery hybrid vehicles. Energy Convers. Manag. 2014, 81, 255–269. [CrossRef] Odeim, F.; Roes, J.; Wulbeck, L.; Heinzel, A. Power management optimization of fuel cell/battery hybrid vehicles with experimental validation. J. Power Sources 2014, 252, 333–343. [CrossRef] Li, C.; Liu, G. Optimal fuzzy power control and management of fuel cell/battery hybrid vehicles. J. Power Sources 2009, 192, 525–533. [CrossRef] Denis, N.; Dubois, M.R.; Desrochers, A. Fuzzy-based blended control for the energy management of a parallel plug-in hybrid electric vehicle. IET Intell. Transp. Syst. 2015, 9, 30–37. [CrossRef] Ansarey, M.; Panahi, M.S.; Ziarati, H.; Mahjoob, M. Optimal energy management in a dual-storage fuel-cell hybrid vehicle using multi-dimensional dynamic programming. J. Power Sources 2014, 250, 359–371. [CrossRef] Vagg, C.; Akehurst, S.; Brace, C.J.; Ash, L. Stochastic dynamic programming in the real-world control of hybrid electric vehicle. IEEE Trans. Contr. Syst. Techol. 2016, 24, 853–866. [CrossRef] Qin, F.; Xu, G.; Hu, Y.; Xu, K.; Li, W. Stochastic optimal control of parallel hybrid electric vehicles. Energies 2017, 10, 214. [CrossRef] Chen, Z.; Mi, C.C.; Xiong, R.; Xu, J.; You, C. Energy management of a power-split plug-in hybrid electric vehicle based on genetic algorithm and quadratic programming. J. Power Sources 2014, 248, 416–426. [CrossRef] Xie, S.; Li, H.; Xin, Z.; Liu, T.; Wei, L. A pontryagin minimum principle-based adaptive equivalent consumption minimum strategy for a plug-in hybrid electric bus on a fixed route. Energies 2017, 10, 1379. [CrossRef] Guo, L.; Guo, B.; Guo, Y.; Chen, H. Optimal energy management for HEVs in Eco-driving applications using bi-level MPC. IEEE Trans. Intell. Transp. Syst. 2017, 18, 2153–2162. [CrossRef] Tian, H.; Li, S.E.; Wang, X.; Huang, Y.; Tian, G. Data-driven hierarchical control for online energy management of plug-in hybrid electric city bus. Energy 2018, 142, 55–67. [CrossRef] Daming, Z.; AI-Durra, A.; Fei, G.; Simoes, M.G. Online energy management strategy of fuel cell hybrid electric vehicles based on data fusion approach. J. Power Sources 2017, 366, 278–291. Martinez, C.M.; Hu, X.; Cao, D.; Velenis, E.; Gao, B.; Wellers, M. Energy Management in Plug-in Hybrid Electric Vehicles: Recent Progress and a Connected Vehicles Perspective. IEEE Trans. Veh. Technol. 2017, 66, 4534–4549. [CrossRef] Zheng, C.; Xu, G.; Xu, K.; Pan, Z.; Liang, Q. An energy management approach of hybrid vehicles using traffic preview information for energy saving. Energy Convers. Manag. 2015, 105, 462–470. [CrossRef] Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement learning-based energy management strategy for a hybrid electric tracked vehicle. Energies 2015, 8, 7243–7260. [CrossRef] Qi, X.; Wu, G.; Boriboonsomsin, K.; Barth, M.J.; Gonder, J. Data-driven reinforcement learning-based real-time energy management system for plug-in hybrid electric vehicles. Transp. Res. Rec. 2016, 2572, 1–8. [CrossRef] Li, W.; Xu, G.; Xu, Y. Online learning control for hybrid electric vehicle. Chin. J. Mech. Eng. 2012, 25, 98–106. [CrossRef] Hu, Y.; Li, W.; Xu, H.; Xu, G. An online learning control strategy for hybrid electric vehicle based on fuzzy Q-learning. Energies 2015, 8, 11167–11186. [CrossRef] Volodymyr, M.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. Silver, D.; Aja, H.; Maddison, J.C.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, L.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [CrossRef] [PubMed] Springenberg, T.; Boedecker, J.; Riedmiller, M.; Obermayer, K. Autonomous learning of state representations for control. Künstliche Intell. 2015, 29, 1–10. Wei, T.; Wang, Y.; Zhu, Q. Deep reinforcement learning for building HVAC control. In Proceedings of the 54th Annual Design Automation Conference, Austin, TX, USA, 18–22 June 2017. Appl. Sci. 2018, 8, 187 25. 26. 27. 28. 29. 15 of 15 Belletti, F.; Haziza, D.; Gomes, G.; Bayen, A.M. Expert level control of ramp metering based on multi-task deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2017, 99, 1–10. [CrossRef] Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. End-to-End deep reinforcement learning for lane keeping assist. arXiv 2016. Chae, H.; Kang, C.M.; Kim, B.D.; Kim, J.; Chung, C.C.; Choi, J.W. Autonomous Braking System via Deep Reinforcement Learning. arXiv 2017. Tianho, Z.; Kanh, G.; Levine, S.; Abbeel, P. Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 528–535. Qi, X.; Luo, Y.; Wu, G.; Boriboonsomsin, K.; Brath, M.J. Deep reinforcement learning-based vehicle energy efficiency autonomous learning system. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Redondo, Beach, CA, USA, 11–14 June 2017. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).