Uploaded by s.balaji.hotwheels

Energy Management Strategy for a Hybrid Electric Vehicles

advertisement
applied
sciences
Article
Energy Management Strategy for a Hybrid Electric
Vehicle Based on Deep Reinforcement Learning
Yue Hu 1,2,3,† , Weimin Li 1,3,4, *, Kun Xu 1,† , Taimoor Zahid 1,2 , Feiyan Qin 1,2 and Chenming Li 5
1
2
3
4
5
*
†
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
yue.hu@siat.ac.cn (Y.H.); kun.xu@siat.ac.cn (K.X.); taimoor.z@siat.ac.cn (T.Z.); fy.qin@siat.ac.cn (F.Q.)
Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences,
Shenzhen 518055, China
Jining Institutes of Advanced Technology, Chinese Academy of Sciences, Jining 272000, China
Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong,
Hong Kong 999077, China
Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong 999077, China;
1155101099@link.cuhk.edu.hk
Correspondence: wm.li@siat.ac.cn; Tel.: +86-159-1537-4149
These authors contributed equally to this work.
Received: 28 December 2017; Accepted: 24 January 2018; Published: 26 January 2018
Abstract: An energy management strategy (EMS) is important for hybrid electric vehicles (HEVs)
since it plays a decisive role on the performance of the vehicle. However, the variation of future
driving conditions deeply influences the effectiveness of the EMS. Most existing EMS methods
simply follow predefined rules that are not adaptive to different driving conditions online. Therefore,
it is useful that the EMS can learn from the environment or driving cycle. In this paper, a deep
reinforcement learning (DRL)-based EMS is designed such that it can learn to select actions directly
from the states without any prediction or predefined rules. Furthermore, a DRL-based online learning
architecture is presented. It is significant for applying the DRL algorithm in HEV energy management
under different driving conditions. Simulation experiments have been conducted using MATLAB
and Advanced Vehicle Simulator (ADVISOR) co-simulation. Experimental results validate the
effectiveness of the DRL-based EMS compared with the rule-based EMS in terms of fuel economy.
The online learning architecture is also proved to be effective. The proposed method ensures the
optimality, as well as real-time applicability, in HEVs.
Keywords: hybrid electric vehicle; energy management strategy; deep reinforcement learning;
online learning
1. Introduction
An energy management strategy (EMS) is one of the key technologies for hybrid electric vehicles
(HEVs) due to its decisive effect on the performance of the vehicle [1]. The EMS for HEVs has been
a very active research field during the past decades. However, how to design a highly-efficient and
adaptive EMS is still a challenging task due to the complex structure of HEVs and the uncertain
driving cycle.
The existing EMS methods can be generally classified into the following three categories:
(1) Rule-based EMS, such as the thermostatic strategy, the load following strategy, and electric assist
strategy [2,3]. These methods rely heavily on the results of extensive experimental trials and human
expertise without the a priori knowledge of the driving conditions [4]. Other related control strategies
employ heuristic control techniques, with the resultant strategies formalized as fuzzy rules [5,6].
Though these rule-based strategies are effective and can be easily implemented, their optimality
Appl. Sci. 2018, 8, 187; doi:10.3390/app8020187
www.mdpi.com/journal/applsci
Appl. Sci. 2018, 8, 187
2 of 15
and flexibility are critically limited by working conditions and, consequently, are not adaptive to
different driving cycles. (2) Optimization-based EMS: some optimization methods employed in control
strategy are either based on the known driving cycles or predicted future driving conditions, such as
dynamic programming (DP) [7–9], sequential quadratic programming (SQP), genetic algorithms
(GA) [10], the Pontryagin minimum principle (PMP) [11], and so on. Usually, these algorithms can
manage to determine the optimal power split between the engine and the motor for a particular
driving cycle. However, the obtained optimal power-split solutions are only optimal with respect
to a specific driving cycle. In general, it is neither optimal nor charge-sustaining for other cycles.
Unless future driving conditions can be predicted during real-time operation, there is no way to imply
these control laws directly. Moreover, these methods suffer from the “curse of dimensionality” problem,
which prevents their wide adoption in real-time applications. Model predictive control (MPC) [12]
is another type of optimization-based method. The optimal control problem in the finite domain is
solved at each sampling instant and control actions are obtained based on online rolling optimization.
This method has the advantages of good control effect and strong robustness. (3) Learning-based
EMS: some strategies can learn from the historical data or use the previous driving data for online
learning or application [13,14]. Some researchers propose that traffic information and cloud computing
in intelligent transportation systems (ITSs) can enhance HEV energy management since vehicles
obtain real-time data via intelligent infrastructures or connected vehicles [15,16]. Regardless of the
learning from historical data or predicted data, these EMS methods also need complex control
models and professional knowledge from experts. Thus, these EMS methods are not end-to-end
control methods. Reinforcement learning-based control methods have also been used for HEV energy
management [17,18]. However, reinforcement learning must be able to learn from a scalar reward
signal that is frequently sparse, noisy, and delayed. Additionally, the sequence of highly-correlated
states is also a large problem of reinforcement learning, in addition to the data distribution changes,
as the algorithm learns new behaviors in reinforcement learning.
The learning-based EMS is an emerging and promising method because of its potential ability
of self-adaption according to different driving conditions, even if there are still some problems.
In our previous research, online learning control strategies based on neural dynamic programming
(NDP) [19], fuzzy Q-learning (FQL) [20], were proposed. These strategies do not rely on prior
information related to future driving conditions and can self-tune the parameters of the algorithms.
A back propagation (BP) neural network was used to estimate the Q-value which, in turn, tuned the
parameter of fuzzy controller [20]. However, it also requires designing the fuzzy controller, as well as
professional knowledge.
Deep reinforcement learning (DRL) has shown successful performance in playing Atari [21] and
Go games [22] in recent years. The DRL method is a powerful algorithm to solve complex control
problems and handle large state spaces by establishing a deep neural network to relate the value
estimation and associated state-action pairs. As a result, the DRL algorithm has been quickly applied in
robotics [23], building HVAC control [24], ramp metering [25], and other fields. In the automotive field,
DRL has been used for lane keeping assist [26], autonomous braking system [27], and autonomous
vehicles [28]. However, motion control of autonomous vehicles needs very high precision from our
perspective. The mechanism of DRL has not been explained very deeply and may not meet this
high requirement.
Nevertheless, DRL is a powerful technique that can be used in HEV EMS in this research as it
concerns the fuel economy compared to the control precision. A DRL-based EMS has been designed
for plug-in hybrid electric vehicles (PHEVs) [29]. This is the first time DRL has been applied to a
PHEV EMS. However, there are several problems in this study: (1) The learning process is still offline,
which means that the trained deep network can only work well in the same driving cycle, but would
not be able to obtain good performance in other driving conditions. As a result, this method can be
used in buses with fixed route however it is not acceptable for vehicles with route variation; (2) The
immediate reward is important as it affects the performance of DRL. The optimization objective is
Appl. Sci. 2018, 8, 187
3 of 15
Appl. Sci. 2018, 8, x FOR PEER REVIEW
3 of 15
the vehicle fuel economy, but the reward is a function based on the power supply from the engine.
engine. The
relationship
betweenand
fuelengine
economy
and isengine
power
complex
and the
Thethe
relationship
between
fuel economy
power
complex
andisthe
paper lacks
thepaper
ability
lacks
the
ability
to
justify
this
phenomena;
(3)
The
structure
of
deep
neural
network
can
be
well
to justify this phenomena; (3) The structure of deep neural network can be well designed by fixing
the
designed
by
fixing
the
Q
targets
network,
which
can
make
the
algorithm
more
stable.
Q targets network, which can make the algorithm more stable.
In this research, an energy management strategy based on deep reinforcement learning is
In this
research, an energy management strategy based on deep reinforcement learning is
proposed. Our work achieves good performance and high scalability by (1) building the system
proposed. Our work achieves good performance and high scalability by (1) building the system model
model of the HEV and formulating the HEV energy management problem; (2) developing a
of the HEV and formulating the HEV energy management problem; (2) developing a DRL-based
DRL-based control framework and an online learning architecture for a HEV EMS, which is adapted
control framework and an online learning architecture for a HEV EMS, which is adapted to different
to different driving conditions; and (3) facilitating algorithm training and evaluation in the
driving conditions; and (3) facilitating algorithm training and evaluation in the simulation environment.
simulation environment. Figure 1 illustrates our DRL-based algorithm for HEV EMS. The
Figure
1 illustrates
algorithm
for HEV
EMS.
The based
DRL-based
EMS
can autonomously
DRL-based
EMSour
canDRL-based
autonomously
learn the
optimal
policy
on data
inputs,
without any
learn
the
optimal
policy
based
on
data
inputs,
without
any
prediction
or
predefined
rules.
Forbuilt
training
prediction or predefined rules. For training and validation, we use the HEV model
in
andADVISOR
validation,
we
use
the
HEV
model
built
in
ADVISOR
software
(National
Renewable
Energy
software (National Renewable Energy Laboratory, Golden, CO, USA). Simulation results
Laboratory,
Golden,
CO, USA).
Simulation
results
that the algorithm
is ableother
to improve
the fuel
reveal that
the algorithm
is able
to improve
thereveal
fuel economy
while meeting
requirements,
economy
meeting
other requirements,
such as dynamic performance and vehicle drivability.
such aswhile
dynamic
performance
and vehicle drivability.
at
st
Q
Action value error
S
θ
Q ( st , at , θ )
( st , a t , s t + 1 )
( st , at , rt , st +1 )
a
S'
γ max
Q( st +1 , at +1 , θ − )
a
t +1
r
∂L
∂θ
Q̂
r + γ max Q( st +1 , at +1 ,θ − )
at +1
Figure 1. Deep reinforcement learning (DRL)-based framework for HEV EMS.
Figure 1. Deep reinforcement learning (DRL)-based framework for HEV EMS.
The proposed DRL-based EMS uses a fixed target Q network which can make the algorithm
The
DRL-based
EMS
uses aisfixed
target Qdirectly
networkrelated
which can
make
the algorithmMore
more
more proposed
stable. The
immediate
reward
a function
to fuel
consumption.
stable.
The
immediate
reward
is
a
function
directly
related
to
fuel
consumption.
More
importantly,
importantly, a DRL-based online learning architecture is presented. It is a critical factor to apply the
a DRL-based
onlineinlearning
architecture
is presented.
It is a critical
factor
to apply the DRL algorithm
DRL algorithm
HEV energy
management
under different
driving
conditions.
in HEV energy
different
The restmanagement
of this paperunder
is organized
as driving
follows:conditions.
Section 2 introduces the system model of HEV
Thedescribes
rest of the
thismathematics
paper is organized
as of
follows:
Section
2 3introduces
the
system
model of
and
formulation
HEV EMS.
Section
explains our
deep
reinforcement
HEV
and describes
the mathematics
formulation
of HEV and
EMS.
Section
3 explains
our The
deep
learning-based
control
strategy, including
offline learning
online
learning
application.
experimentallearning-based
results are givencontrol
in Section
4, followed
by the conclusions
in Section
reinforcement
strategy,
including
offline learning
and5. online learning
application. The experimental results are given in Section 4, followed by the conclusions in Section 5.
2. Problem Formulation
2. Problem Formulation
The prototype vehicle is a single-axis parallel HEV, the drivetrain structure of which is shown
inThe
Figure
2. The drivetrain
an engine,
electric
motor/generator,
Ni-Hiisbatteries,
prototype
vehicle is integrates
a single-axis
parallelan
HEV,
the traction
drivetrain
structure of which
shown in
an automatic
clutch, and
an automatic/manual
transmission
system.
The motor is Ni-Hi
directlybatteries,
linked
Figure
2. The drivetrain
integrates
an engine, an electric
traction
motor/generator,
between
the
auto
clutch
output
and
the
transmission
input.
This
architecture
provides
the
an automatic clutch, and an automatic/manual transmission system. The motor is directly linked
regenerative
braking
during
deceleration
and
allows
an
efficient
motor
assist
operation.
To
provide
between the auto clutch output and the transmission input. This architecture provides the regenerative
pure during
electrical
propulsion,and
theallows
engineancan
be disconnected
the drivetrain
by the
automatic
braking
deceleration
efficient
motor assistfrom
operation.
To provide
pure
electrical
propulsion, the engine can be disconnected from the drivetrain by the automatic clutch. We have
Appl. Sci. 2018, 8, 187
4 of 15
Appl. Sci. 2018, 8, x FOR PEER REVIEW
4 of 15
adopted
thehave
vehicle
modelthe
from
our previous
work
for work
this research.
The
key
parameters
of
clutch. We
adopted
vehicle
model from
our[19,20]
previous
[19,20] for
this
research.
The key
this
vehicle
are
given
in
Table
1.
parameters of this vehicle are given in Table 1.
Figure 2. Drivetrain structure of the HEV.
Figure 2. Drivetrain structure of the HEV.
Table 1. Summary of the HEV parameters.
Table 1. Summary of the HEV parameters.
Part or Vehicle
Parameters Value
Part or Vehicle
Parameters Value
Displacement:
1.0 L
Spark Ignition
Maximum
power:
50
kW/5700
r/min
Displacement: 1.0 L
(SI) engine
Spark Ignition
Maximum power:
torque: 50
89.5
Nm/5600
r/min
Maximum
kW/5700
r/min
(SI) engine
Maximum
power:
10 kW r/min
Maximum
torque: 89.5
Nm/5600
Permanent magnet motor
Maximum
Nm
Maximum torque:
power: 46.5
10 kW
Permanent magnet motor
Capacity:
6.5
Ah
Maximum torque: 46.5 Nm
Advanced Ni-Hi battery
Nominal cell voltage: 1.2 V
Capacity: 6.5 Ah
Total cells: 120
Advanced Ni-Hi battery
Nominal cell voltage: 1.2 V
Total5-speed
cells: 120
Automated manual transmission
GR: 2.2791/2.7606/3.5310/5.6175/11.1066
5-speed
Automated
manual transmission
Vehicle
Curb weight: 1000 kg
GR: 2.2791/2.7606/3.5310/5.6175/11.1066
Vehicle
Curb weight: 1000 kg
In the following, key concepts of the DRL-based EMS are formulated:
state: Inkey
the concepts
DRL algorithm,
control action
directly
determined by the system states.
InSystem
the following,
of the DRL-based
EMSisare
formulated:
In this study, the total required torque ( Tdem ) and the battery state-of–charge (SOC) are selected to
System state: In the DRL algorithm, control action is directly determined by the system states.
form a two-dimensional state space, i.e., s (t ) = (Tdem (t ), SOC (t )) T , where Tdem (t ) represents the
In this study, the total required torque (Tdem ) and the battery
state-of–charge (SOC) are selected to form
the battery
state ofT charge
at time t.
torque at state
time space,
t, and i.e.,
SOCs((tt)) =represents
arequired
two-dimensional
( Tdem (t), SOC
(t))T , where
dem ( t ) represents the required
The(decision-making
on the torque-split
ratio
torqueControl
at time action:
t, and SOC
t) represents the battery
state of charge
at between
time t. the internal combustion
engine
(ICE)action:
and battery
is the core problem
of torque-split
the HEV energy
management
strategy.
We choose
Control
The decision-making
on the
ratio between
the internal
combustion
the
output
torque
from
the
ICE
as
the
control
action
in
this
study,
denoted
as
A
(
t
)
=
T
(
t ) , where
engine (ICE) and battery is the core problem of the HEV energy management strategy. Wee choose
the t
output
torque
from
the T
ICE
asshould
the control
action ininthis
study,
denoted
A(t) = Talgorithm,
t
is the time
step
index.
be discretized
order
to apply
the as
DRL-based
e ( t ), wherei.e.,
e (t )
1
2
n
isthe
theentire
time action
step index.
T
(
t
)
should
be
discretized
in
order
to
apply
the
DRL-based
algorithm,
space ise A = { A , A ,..., A } , where n is the degree of discretization. In this
i.e., the entire action space is A = A1 , A2 , ..., An , where n is the degree of discretization. In this
research, we consider n as 24. The motor output torque Tm (t ) can be obtained by subtracting
research, we consider n as 24. The motor output torque Tm (t) can
be obtained by subtracting Te (t)
Te (t )T from
from
(t). Tdem (t ) .
dem
ImmediateReward:
Reward:Immediate
Immediatereward
rewardisisimportant
importantininthe
theDRL
DRLalgorithm
algorithmbecause
becauseititdirectly
directly
Immediate
influencesthe
theparameters
parameterstuning
tuningofofthe
thedeep
deepneural
neuralnetwork
network(DNN).
(DNN).The
TheDRL
DRLagent
agentisisalways
always
influences
trying
to
maximize
the
reward
which
it
can
obtain
by
taking
the
optimal
action
at
each
time
step.
trying to maximize the reward which it can obtain by taking the optimal action at each time step.
Therefore, the immediate reward should be defined according to the optimization objective. The
control objective of the HEV EMS is to minimize vehicle fuel consumption and emissions along a
Appl. Sci. 2018, 8, 187
5 of 15
Therefore, the immediate reward should be defined according to the optimization objective. The control
objective of the HEV EMS is to minimize vehicle fuel consumption and emissions along a driving
mission. Meanwhile, the vehicle drivability and battery health should be satisfied. In this work,
we focus more on fuel economy of the HEV; the emissions are not taken into consideration. Keeping
this objective in mind, the reciprocal of the ICE fuel consumption at each time step is defined as the
immediate reward. A penalty value is introduced to penalize the situation when the SOC exceeds the
threshold. Immediate reward is defined by the following equations:
a
rss
0 =

1


C ICE



 1
C ICE +C
 Min2


C ICE


 1
−C
C ICE 6= 0 ∩ 0.4 ≤ SOC ≤ 0.85
C ICE 6= 0 ∩ SOC < 0.4 or SOC > 0.85
C ICE = 0 ∩ 0.4 ≤ SOC
(1)
C ICE = 0 ∩ SOC < 0.4
a is the immediate reward generated when state changes from s to s0 by taking action a;
where rss
0
C ICE is the instantaneous fuel consumption value of the ICE; C is the numerical penalty, as well as the
maximum instantaneous power supply from the ICE; MinCICE is the minimum nonzero value of the
ICE instantaneous fuel consumption value. The SOC variation range is from 40% to 85% in this study.
This definition can guarantee the lower ICE fuel consumption while satisfying the SOC constrains.
Formally, the goal of the EMS of the HEV is to find the optimal control strategy, π ∗ , that maps
the observed states st to the control action at . Mathematically, the control strategy of the HEV can be
formulated as an infinite horizon dynamic optimization problem as follows:
∞
R=
∑ γt r (t)
(2)
t =0
where r (t) is the immediate reward incurred by at at time t; and γ ∈ (0, 1) is a discount factor
that assures the infinite sum of cost function convergence. We use Q∗ (st , at ), i.e., the optimal value,
to represent the maximum accumulative reward which we can obtain by taking action at in state st .
Q∗ (st , at ) is calculated by the Bellman Equation as follows:
Q∗ (st , at ) = E[rt+1 + γmaxQ∗ (st+1 , at+1 )|st , at ]
a t +1
(3)
The Q-learning method is used to update the value estimation, as shown in Equation (4).
Qt+1 (st , at ) = Qt (st , at ) + η (rt+1 + γmaxQt (st+1 , at+1 ) − Qt (st , at ))
a t +1
(4)
where η ∈ (0, 1] represents the learning rate. Such a value iteration algorithm converges to the optimal
action value function, Qt → Q∗ as t → ∞ .
3. Deep Reinforcement Learning-Based EMS
Deep reinforcement learning-based EMS is developed which combines a deep neural network and
conventional reinforcement learning. The EMS makes decisions only based on the current system state
since the proposed EMS is an end-to-end control strategy. This deep reinforcement neural network
can also be called a deep Q-network (DQN). In the rest of this section, value function approximation,
DRL algorithm design, and the DRL-based algorithm online learning application are presented.
3.1. Value Function Approximation
The state-action value is represented by a large, but limited, number of states and actions table,
i.e., the Q table, in conventional reinforcement learning. However, a deep neural network is taken
in this work to approximate the Q-value by Equation (3). As depicted in Figure 3, the inputs of the
Appl. Sci. 2018, 8, x FOR PEER REVIEW
Appl. Sci. 2018, 8, 187
6 of 15
6 of 15
this work to approximate the Q-value by Equation (3). As depicted in Figure 3, the inputs of the
networkare
arethe
thesystem
systemstates,
states,which
whichare
aredefined
definedin
inSection
Section2.2.The
Therectified
rectifiedlinear
linearunit
unit(ReLU)
(ReLU)isisused
used
network
asthe
theactivation
activationfunction
functionfor
forhidden
hiddenlayers,
layers,and
andthe
thelinear
linearlayer
layerisisused
usedfor
forobtaining
obtainingthe
theaction
actionvalue
value
as
atthe
theoutput
outputlayer.
layer.InInorder
ordertoto
balance
exploration
and
exploitation,
ε - greedy
at
balance
thethe
exploration
and
exploitation,
thethe
ε − greedy
policypolicy
is usedis
for
action
selection,
i.e.,
the
policy
chooses
the
maximum
Q-value
action
with
probability
1
−
ε and
used for action selection, i.e., the policy chooses the maximum Q-value action with probability
selects
a
random
action
with
probability
ε.
1 − ε and selects a random action with probability ε .
Q( st , at1 , θ )
Q( st , at2 , θ )
T
S t =  dem
SOC
.
.
.
L
Q( st , atn , θ )
w,b
w,b
r + γ max Q( s t +1 , a t +1 , θ -)
at +1
Figure3.3. Structure
Structureof
ofthe
theneural
neuralnetwork.
network.
Figure
The Q-value
Q-value estimation
estimation for
for all
all control
control actions
actions can
can be
be calculated
calculated by
by performing
performing aa forward
forward
The
calculationininthe
the
neural
network.
mean
squared
error between
theQ-value
target Q-value
and the
calculation
neural
network.
TheThe
mean
squared
error between
the target
and the inferred
inferred
output
of
neural
network
is
defined
as
loss
function
in
Equation
(5):
output of neural network is defined as loss function in Equation (5):
L(θ ) = E[(r + γ max Q( st +1 , at +1 ,θ − ) − Q( st , at ,θ )) 2 ]2
a
L(θ ) = E[(r + γmaxQ(st+1 , at+1 , θ − ) − Q(st , at , θ )) ]
(5)
(5)
t +1
a t +1
where Q ( s t a t , θ ) is the output of the neural network with the parameters θ .
−
−
where
Q(st aQ
the neural
the parameters
(st+1 , aprevious
t ,(θs) is, the
t +1 , θ )
target network
Q-value,with
using
parametersθ. θr + γmax
from Qsome
r + γ max
a ,output
θ − ) isof the
at + 1
t +1
t +1
a t +1
is the target Q-value, using parameters θ − from some previous iteration. This fixed target Q network
iteration. This fixed target Q network makes the algorithm more stable. Parameters in the neural
makes the algorithm more stable. Parameters in the neural network are updated by the gradient
network are updated by the gradient descent method.
descent method.
The inputs of DQN are total required torque Tdem and battery SOC. The variation range of SOC
The inputs of DQN are total required torque Tdem
and battery SOC. The variation range of SOC
from 00 to
to 11 and
and does
does not
not need
need preprocessing.
preprocessing. However, the total required torque TTdem
can vary
vary
isis from
dem can
significantly.
we scale
scale the
the total
total required
required torque
torque TTdem to
the
significantly.In
Inorder
order to
to facilitate
facilitate the
the learning
learning process,
process, we
dem to the
range [−1, 1] before feeding to the neural network as shown in Equation (6). The minimum and
range [−1, 1] before feeding to the neural network as shown in Equation (6). The minimum and
maximum values for Tdem can be obtained from historical observation:
maximum values for Tdem can be obtained from historical observation:
Tdem − min( Tdem )
0
Tdem ='
(6)
Tdem − min(Tdem )
) − min( Tdem )
Tdemmax
= ( Tdem
(6)
max(Tdem ) − min(Tdem )
3.2. DRL Algorithm Design
3.2. DRL
Algorithm Design
Our DRL-based
EMS control algorithm is presented in Algorithm 1. The outer loop controls the
number of training episodes, while the inner loop performs the EMS control at each time step within
Our DRL-based EMS control algorithm is presented in Algorithm 1. The outer loop controls the
one training episode.
number of training episodes, while the inner loop performs the EMS control at each time step within
one training episode.
Appl. Sci. 2018, 8, 187
7 of 15
Algorithm 1: Deep Q-Learning with Experience Replay
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights θ
Initialize target action-value function Q̂ with weights θ − = θ
1: For episode = 1, M do
2: Reset environment: s0 = (SOC Initial , T0 )
3: For t = 1, T, do
4:
With probability ε select a random action at
otherwise select at = maxQ(st , a; θ )
at
5:
6:
7:
8:
9:
Choose action at and observe the reward rt
Set st+1 = (SOCt+1 , Tt+1 )
Store (st , at , rt , st+1 ) in memory D
Sample random mini-batch of (st , at , rt , st+1 ) from D
if terminal s j+1 : Set y j = r j
else set y j = r j + γmaxQ̂(s j+1 , a j+1 ; θ _ )
a j +1
10:
Perform a gradient descent step on (y j − Q(s j a j ; θ ))2
11:
Every C steps reset Q̂ = Q
12: end for
13: end for
In order to avoid the strong correlations between the samples in a short time period of conventional
RL, experience replay is adopted to store the experience (i.e., a batch of state, action, reward, and next
state:(st , at , rt , st+1 )) at each time step in a data experience pool. For each certain time, random samples
of experience are drawn from the experiment pool and used to train the Q network.
We initialize memory D as an empty set. Then we initialize weights θ in the action-value function
estimation Q neural network. In order to break the dependency loop between the target value and
weights θ, a separate neural network Q̂ with weights θ − is created for calculating the target Q value.
We can set the maximum number of episode as M. During the learning process, in step 4,
the algorithm selects the maximum Q value action with probability 1 − ε and selects a random action
with probability ε based on the observation of the state. In step 5, action at is executed and reward rt is
obtained. In step 6, the system state becomes the next state. In step 7, the state action transition tuple is
stored in memory. Then, a mini-batch of transition tuples is drawn randomly from the memory. Step 9
calculates the target Q value. The weights in neural network Q are updated by using the gradient
descent method in step 10. The network Q̂ is periodically updated by copying parameters from the
network Q in step 11.
3.3. DRL-Based Algorithm Online Learning Application
In Section 3.2, the DRL-based algorithm is proposed, however, it is an offline learning algorithm
which can only be applied in the simulation environment. More importantly, the training process can
only be applied in limited driving cycles, therefore, the trained DQN only performs well under the
learned driving conditions, which may not provide satisfactory results under other driving cycles.
This is unacceptable in HEV real-time applications. As a result, online learning is necessary for
DRL-based algorithms in HEV EMS applications.
The DRL-based online learning architecture is presented in Figure 4. Action execution and
network training should be separated. There is a controller which contains a Q neural network and
selects an action for the HEV while storing the state action transitions. When the HEV needs to
learn a new driving cycle, the method of action selection will be the ε − greedy method. Otherwise,
the HEV can always select the maximum Q-value action. There is another on-board computer or
remote computing center which is responsible for Q neural network training. The on-board computer
or remote computing center obtains state action transitions from the action controller and trains the
network training should be separated. There is a controller which contains a Q neural network and
selects an action for the HEV while storing the state action transitions. When the HEV needs to learn
a new driving cycle, the method of action selection will be the ε − greedy method. Otherwise, the
HEV can always select the maximum Q-value action. There is another on-board computer or remote
8 of 15
computing center which is responsible for Q neural network training. The on-board computer or
Appl. Sci. 2018, 8, 187
remote computing center obtains state action transitions from the action controller and trains the
neural
onon
thethe
DRL
algorithm.
The The
Q neural
network
is periodically
updatedupdated
by copying
neuralnetwork
networkbased
based
DRL
algorithm.
network
is periodically
by
Q neural
parameters
from
the
on-board
computer
or
remote
computing
center.
copying parameters from the on-board computer or remote computing center.
Action
States
r
st,a,r,st+1
st,a,r,st+1
DQN Parameters
DQN Parameters
T-box
Action controller
Vehicle 1
t
,r,s
st,a
+1
DQ
s
ter
me
a
r
a
NP
Computing center
Vehicle 2
Figure 4.
4. DRL-based
DRL-based online
online learning
learning architecture.
architecture.
Figure
The communication mode between the action controller and the on-board computer can be via
mode neural
between
the action
controller
and the
computer can
be
CANThe
buscommunication
or Ethernet. If the
network
training
is completed
byon-board
a remote computing
center,
Q
via CAN bus or Ethernet. If the Q neural network training is completed by a remote computing
a vehicle terminal named Telematics-box (T-box) should be installed in the HEV in order to
center, a vehicle terminal named Telematics-box (T-box) should be installed in the HEV in order to
communicate with the remote computing center through the 3G communication network. A remote
communicate with the remote computing center through the 3G communication network. A remote
computing center can obtain state action transitions from other connected HEVs, as is shown in
computing center can obtain state action transitions from other connected HEVs, as is shown in Figure 4.
Figure 4. This is useful to train a large Q neural network which can deal with different driving
This is useful to train a large Q neural network which can deal with different driving conditions.
conditions.
The main differences between online learning and offline learning are as follows: (1) online
Thecan
main
differences
between
learning
and
offline
arefrom
as follows:
(1)driving
online
learning
adapt
to varying
drivingonline
conditions,
while
offline
canlearning
only learn
the given
learning
adapt
to varying
whilebe
offline
can only
learn learning
from thebecause
given driving
cycles;
(2)can
action
execution
and driving
networkconditions,
training should
separated
in online
of the
cycles;
(2)
action
execution
and
network
training
should
be
separated
in
online
learning
because
of
limited on-board controller computing ability; and (3) online training efficiency should be higher
than
the limited
on-board
controller
computing
ability;
andEMS
(3) online
training
efficiency
should
be higher
offline
training
since the
vehicle must
learn the
optimal
with the
shortest
time. Thus,
it is necessary
than
offline
training
since
the
vehicle
must
learn
the
optimal
EMS
with
the
shortest
time.
Thus,
to cluster the representative state action transitions and use the recent data in the experience pool.it is
necessary
to cluster
thelearning
representative
state
actioncantransitions
andto use
thea recent
dataofinEMS.
the
Interestingly,
offline
and online
learning
be combined
realize
good effect
experience
For
instance,pool.
we can train the DQN offline under the Urban Dynamometer Driving Schedule (UDDS),
Interestingly,
learning under
and online
learning
can be
combined
realize a good effect of
and then
apply the offline
online learning
the New
European
Driving
Cycleto(NEDC).
EMS. For instance, we can train the DQN offline under the Urban Dynamometer Driving Schedule
(UDDS),
and then
apply and
the online
learning under the New European Driving Cycle (NEDC).
4.
Experimental
Results
Discussion
4.1.
Offline Application
4. Experimental
Results and Discussion
4.1.1.
Experiment
Setup
4.1. Offline
Application
In order to evaluate the effectiveness of proposed DRL-based algorithm, simulation experiments
4.1.1.
Experiment
Setup
are
done
in MATLAB
and the ADVISOR co-simulation environment. The offline learning application
is evaluated
firstly
and
the UDDS
cycle is used
in the learning
process.algorithm,
The simulation
model
In order to evaluate
the driving
effectiveness
of proposed
DRL-based
simulation
for
the HEV mentioned
in Section
2 is and
builtthe
in ADVISOR.
theenvironment.
hyper parameters
of the
experiments
are done in
MATLAB
ADVISOR Meanwhile,
co-simulation
The offline
DRL-based
algorithm
used
in
the
simulations
are
summarized
in
Table
2.
learning application is evaluated firstly and the UDDS driving cycle is used in the learning process.
Appl. Sci. 2018, 8, 187
9 of 15
Appl. Sci. 2018, 8, x FOR PEER REVIEW
Appl. Sci. 2018, 8, x FOR PEER REVIEW
9 of 15
9 of 15
The simulation model
the HEV of
mentioned
in Section
2 is built
ADVISOR. Meanwhile, the
Table for
2. Summary
the DRL-based
algorithm
hyperinparameters.
The simulation
model
the HEValgorithm
mentioned
in Section
2 is built inare
ADVISOR.
Meanwhile,
hyper
parameters
of thefor
DRL-based
used
in the simulations
summarized
in Table 2.the
hyper parameters of the DRL-based algorithm used in the simulations are summarized in Table 2.
Hyper Parameters
Value
Table 2. Summary of the DRL-based algorithm hyper parameters.
size algorithm32
Table 2. Summarymini-batch
of the DRL-based
hyper parameters.
Hyper
Value
replayParameters
memory size
1000
Hyper
Parameters
Value
mini-batch
size γ
32
discount factor
0.99
mini-batch
size
32
replay
memory
size
1000
learning rate
0.00025
replay
memory
size
1000
discount
factor
γ
0.99
initial exploration
1
discount
factor
0.99
learning
rate γ
0.00025
final
exploration
0.2
learning
rate size
0.00025
initial
exploration
1
replay
start
200
initial
exploration
1
final exploration
0.2
final
exploration
0.2
replay
start size
200
replay
startnetwork
size
200neurons, i.e., Tdem
application, the input layer
of the
has two
In this
and SOC. There are
In
this
application,
the
input
layer
of
the
network
has
two
neurons,
i.e.,
and
There
three hidden layers having 20, 50, and 100 neurons, respectively. The output
has 24
neurons
Tdemlayer SOC.
In
this
application,
the
input
layer
of
the
network
has
two
neurons,
i.e.,
and
SOC.
There
T
dem
representing
discrete
ICEhaving
torque.20,All50,these
are fully
connected.The
Theoutput
network
is trained
are threethe
hidden
layers
and layers
100 neurons,
respectively.
layer
has 24 with
are three
hidden
layers
50,
and
neurons,
respectively.
The output
has 24
neurons
representing
the having
discrete
ICE
torque.
All
layers
are fully connected.
Thelayer
network
is
50 episodes
and
each episode
means20,
a trip
(1369100
s). these
neurons
representing
the
discrete
ICE
torque.
All
these
layers
are
fully
connected.
The
network
trained
with
50
episodes
and
each
episode
means
a
trip
(1369
s).
We evaluate the performance of DRL-based EMS by comparing them with the rule-basedis EMS
trained with
50 episodes
and each episode
meansEMS
a tripby
(1369 s).
evaluate
the performance
of DRL-based
the rule-based EMS
known asWe
“Parallel
Electric
Assist Control
Strategy”
[20]. comparing
The initial them
SOC with
is 0.8.
Weasevaluate
performance
of DRL-based
EMS
byThe
comparing
them
with the rule-based EMS
known
“Parallelthe
Electric
Assist Control
Strategy”
[20].
initial SOC
is 0.8.
known as “Parallel Electric Assist Control Strategy” [20]. The initial SOC is 0.8.
4.1.2. Experimental Results
4.1.2. Experimental Results
4.1.2. Experimental
Results
Firstly,
we evaluate
the learning performance of DRL-based algorithm. The track of average loss
Firstly, we evaluate the learning performance of DRL-based algorithm. The track of average
is recorded
in
Figure
5.
It
is clear
that the
average loss
decreases quickly
along
the
training
process.
Firstly,
we evaluate
learning
of DRL-based
algorithm.
The
track
of training
average
loss is
recorded
in Figurethe
5. It
is clearperformance
that the average
loss decreases
quickly
along
the
Figure
6 depicts
the6 track
ofthe
the
total
reward
one
episode
along
the
training
process.
though
loss
is recorded
in Figure
5.track
It
isofclear
thatof
the
average
decreases
quickly
alongprocess.
theEven
training
process.
Figure
depicts
the total
reward
of oneloss
episode
along
the training
Even
the curve
is
oscillating,
the
overall
trend
of
the
track
is
rising.
There
are
also
some
dramatic
drops
in
process.
Figure
6
depicts
the
track
of
the
total
reward
of
one
episode
along
the
training
process.
Even
though the curve is oscillating, the overall trend of the track is rising. There are also some dramatic
though
the
curve
is
oscillating,
the
overall
trend
of
the
track
is
rising.
There
are
also
some
dramatic
the total
reward
during
the
training
process.
This
is
because
of
the
adding
of
a
large
penalty
when
the
drops in the total reward during the training process. This is because of the adding of a large penalty
drops selects
in the
total
reward
thethat
process.
This
because
of theconstraint.
adding of a large penalty
when
the
algorithm
selects
actions
results
in the of
violation
of the
SOC
algorithm
actions
thatduring
results
intraining
the
violation
theisSOC
constraint.
when the algorithm selects actions that results in the violation of the SOC constraint.
Figure 5. Track of loss.
Figure 5. Track of loss.
Figure 5. Track of loss.
Figure 6. Track of the total reward.
Figure 6. Track of the total reward.
Figure 6. Track of the total reward.
Then, the simulation results of the trained DRL-based EMS for the UDDS driving cycle are shown
in Figure 7. In order to evaluate the performance and effectiveness of the trained DRL-based EMS,
comparison results are listed in Table 3. Power consumption is converted to fuel consumption; equivalent
Appl. Sci. 2018, 8, x FOR PEER REVIEW
10 of 15
Appl. Sci. 2018, 8, 187
10 of 15
Then, the simulation results of the trained DRL-based EMS for the UDDS driving cycle are
shown in Figure 7. In order to evaluate the performance and effectiveness of the trained DRL-based
EMS, comparison
results areby
listed
in Table
Power consumption
is converted toand
fuelfuel
consumption;
fuel consumption
is obtained
adding
the3.converted
power consumption
consumption.
equivalent
fuel
consumption
is
obtained
by
adding
the
converted
power
consumption
and fuelto the
As shown by the results of Table 3, fuel consumption is improved significantly compared
consumption. As shown by the results of Table 3, fuel consumption is improved significantly
rule-based control strategy, as fuel consumption is decreased by 10.09%. Meanwhile, the equivalent fuel
compared to the rule-based control strategy, as fuel consumption is decreased by 10.09%.
consumption is also decreased by 8.05%. The DRL-based EMS achieves good performance. Notably,
Meanwhile, the equivalent fuel consumption is also decreased by 8.05%. The DRL-based EMS
the rule-base
designed by Notably,
the experts
the DRL-based
EMS only
learns
fromwhile
the states
achieves EMS
goodisperformance.
thewhile
rule-base
EMS is designed
by the
experts
the and
historical
data. EMS only learns from the states and historical data.
DRL-based
Figure 7. Simulation results of the trained EMS under UDDS.
Figure 7. Simulation results of the trained EMS under UDDS.
Table 3. Comparison of the results under UDDS.
Table 3. Comparison of the results under UDDS.
Control Strategy
Fuel Consumption (L/100 km)
Rule-Based
3.857
Control Strategy
DRL-based Fuel Consumption
3.468(L/100 km)
Rule-Based
4.2.DRL-based
Online Application
3.857
3.468
Equivalent Fuel Consumption (L/100 km)
3.861
Equivalent Fuel3.550
Consumption (L/100 km)
3.861
3.550
4.2.1. Experiment Setup
4.2. Online Application
The DRL-based online learning architecture is presented in Section 3.2. In order to evaluate the
4.2.1.online
Experiment
learningSetup
performance conveniently, we also use ADVISOR software to simulate the online
learning working process. In the online learning application, the neural network setting is the same
The
DRL-based online learning architecture is presented in Section 3.2. In order to evaluate the
as the offline application. Two different kinds of simulations are performed. In the first scenario, the
online
learning
performance
webeginning
also use and,
ADVISOR
software
the online
neural
network
parameters conveniently,
are random at the
in the second
one, to
thesimulate
neural network
learning
working
process.
In
the
online
learning
application,
the
neural
network
setting
is
is pre-trained offline under the existing driving cycle before the online learning process. In the the
firstsame
as thecase,
offline
application.
Two
different
kinds
of
simulations
are
performed.
In
the
first
scenario,
the online learning simulation without any pre-training under the NEDC driving cycle is done.
In the network
second case,
we pre-train
neuralatnetwork
offline under
UDDS
driving
firstly, and
the neural
parameters
are the
random
the beginning
and, the
in the
second
one,cycle
the neural
network
then applyoffline
the online
learning
under the
NEDC cycle
driving
cycle. the online learning process. In the first
is pre-trained
under
the existing
driving
before
case, the online learning simulation without any pre-training under the NEDC driving cycle is done.
4.2.2. Experimental Results
In the second case, we pre-train the neural network offline under the UDDS driving cycle firstly, and
In the
the first
case,
we trained
the the
neural
network
50 times
under the NEDC driving cycle with the
then apply
online
learning
under
NEDC
driving
cycle.
same initial condition. Unlike the offline learning, this process is online and simulates the vehicle
under the
NEDC driving cycle.
4.2.2.running
Experimental
Results
The track of loss is depicted in Figure 8. The loss also decreases quickly along the training
In
the first
case,
weapplication.
trained theFigure
neural9 depicts
network
times
under
the NEDC
cycle with the
process
in the
online
the50
track
of total
reward
and thedriving
fuel consumption
sameofinitial
condition.
Unlike
the
offline
learning,
this
process
is
online
and
simulates
vehicle
one driving cycle along the training process, and the overall trend of the total reward is thethe
same
running under the NEDC driving cycle.
The track of loss is depicted in Figure 8. The loss also decreases quickly along the training process
in the online application. Figure 9 depicts the track of total reward and the fuel consumption of one
driving cycle along the training process, and the overall trend of the total reward is the same as the
offline application. This reveals the proposed DRL-based online learning architecture is effective.
Appl. Sci. 2018, 8, x FOR PEER REVIEW
11 of 15
Appl.
Sci. 2018,
8, x FOR PEER REVIEW
as the
offline
application.
This reveals
Appl.
Sci.
2018,
8, 187
11 of 15
the proposed DRL-based online learning architecture
is
11 of 15
11
of
15
9, the trend of the total reward and the fuel consumption is
as the offline application. This reveals the proposed DRL-based online learning architecture is
nearly
opposite. This reflects that the definition of the reward is suitable.
thesee
offline
application.
This
reveals
thetotal
proposed
online
learning
architecture
is
effective.
As
we
can see9,from
Figure
the
trend
of theDRL-based
totalthe
reward
and the
fuel consumption
As weas
can
from
Figure
the
trend
of9,the
reward
and
fuel
consumption
is nearly opposite.
effective.
As
we
can
see
from
Figure
9,
the
trend
of
the
total
reward
and
the
fuel
consumption
is
nearly opposite.
reflectsof
that
definition
of the reward is suitable.
This reflects
that the This
definition
thethe
reward
is suitable.
Appl. Sci.
x FORsee
PEER
REVIEW
effective.
As2018,
we8, can
from
Figure
nearly opposite. This reflects that the definition of the reward is suitable.
Figure
8.8.Track
theonline
onlineapplication.
application.
Figure
Trackof
ofloss
loss in
in the
Figure 8. Track of loss in the online application.
Figure 8. Track of loss in the online application.
Figure 9. Track of the total reward and fuel consumption in the online application.
Figure 9. Track of the total reward and fuel consumption in the online application.
Figure
9.
Track of
Figure
9. Track
of the
the total
total reward
reward and
and fuel
fuel consumption
consumption in
in the
the online
online application.
application.
Simulation results of the online trained DRL-based EMS for the NEDC driving cycle are shown
Simulation
of the online
trainedare
DRL-based
EMS for
the NEDC
drivingiscycle
shown
in Figure
10. Theresults
comparison
of the results
listed in Table
4. Fuel
consumption
alsoare
improved
Simulation
of the online
DRL-based
EMS4.for
the
NEDC driving
cycle
arethe
shown
in
Figure 10.
The
comparison
of thetrained
results are
in Table
Fuel
consumption
is also
improved
compared
toresults
the rule-based
control
strategy,
aslisted
fuel consumption
is
decreased
by 10.29%,
while
Figure
10. The
Table 4.
4. Fuel
Fuel
consumption
also
improved
in Figure
10.
comparison
results
are2.57%.
in Table
consumption
is also
improved
compared
to
the
rule-based of
control
strategy,
aslisted
fuel consumption
is decreased
by 10.29%,
while
the
equivalent
fuel
consumption
isthe
decreased
by
equivalent
fuel
consumption
is decreased
byas
2.57%.
strategy,
as
fuel consumption
consumption is
is decreased
decreased by
by 10.29%,
10.29%, while
while the
the
compared
to the
rule-based
control
strategy,
fuel
fuel consumption
consumption is
is decreased
decreased by
by 2.57%.
2.57%.
equivalent fuel
Figure 10. Simulation results of the trained EMS under NEDC.
Figure 10. Simulation results of the trained EMS under NEDC.
Figure 10. Simulation results of the trained EMS under NEDC.
Figure 10. Simulation results of the trained EMS under NEDC.
Appl. Sci. 2018, 8, 187
Appl. Sci. 2018, 8, x FOR PEER REVIEW
12 of 15
12 of 15
Table
4. Comparison
Comparison of
of the
the results
results under
under NEDC.
NEDC.
Table 4.
Control Strategy
Control
Strategy
Rule-Based
Rule-Based
DRL-based
DRL-based
Fuel Consumption (L/100 km)
Fuel Consumption
3.877 (L/100 km)
Equivalent Fuel Consumption (L/100 km)
Equivalent Fuel Consumption
(L/100 km)
3.892
3.877
3.478
3.478
3.892
3.792
3.792
In the second case, the neural network was pre-trained offline under the UDDS driving cycle,
suchIn
that
DRL-based
EMS
can adapt
to the
UDDS
drivingoffline
cycle but
have
a priori
knowledge
thethe
second
case, the
neural
network
was
pre-trained
under
thenoUDDS
driving
cycle,
about
the
NEDC
driving
cycle.
The
comparison
of
the
results
between
the
offline
trained
DRL-based
such that the DRL-based EMS can adapt to the UDDS driving cycle but have no a priori knowledge
EMS under
the UDDS
cycle
for the NEDC
cycle and
control
strategies
are
about
the NEDC
drivingdriving
cycle. The
comparison
of thedriving
results between
theother
offline
trained
DRL-based
listedunder
in Table
It is obvious
that the
offline-trained
EMS
under
UDDS
driving
cycle are
does
not
EMS
the 5.
UDDS
driving cycle
for the
NEDC driving
cycle
andthe
other
control
strategies
listed
adapt
well
NEDCthat
driving
cycle.
in
Table
5. Ittoisthe
obvious
the offline-trained
EMS under the UDDS driving cycle does not adapt well
to the NEDC driving cycle.
Table 5. Comparison of the results under NEDC.
Table 5. Comparison of the results under NEDC.
Fuel Consumption
Equivalent Fuel Consumption
(L/100 km)
(L/100 km)
Control Strategy
Fuel Consumption (L/100 km)
Equivalent Fuel Consumption (L/100 km)
Rule-Based
3.877
3.892
Rule-Based
3.877
3.892
DRL-based EMS
trained under NEDC online
3.478
3.792
DRL-based EMS trained under NEDC online
3.478
3.792
DRL-based
pre-trained
under
UDDS offline
3.872
DRL-based
EMSEMS
only only
pre-trained
under UDDS
offline
3.690 3.690
3.872
Control Strategy
Based on
on the
the offline
offline pre-trained
pre-trained EMS
EMS under
under the
the UDDS
UDDS driving
driving cycle,
cycle, we
we can
can apply
apply the
the online
online
Based
learning
process
under
the
NEDC
driving
cycle.
We
trained
the
pre-trained
neural
network
20
times
learning process under the NEDC driving cycle. We trained the pre-trained neural network 20 times
under
the
NEDC
driving
cycle
with
the
same
initial
conditions.
After
the
training
process,
we
tested
under the NEDC driving cycle with the same initial conditions. After the training process, we tested
the trained EMS under the NEDC driving cycle. The simulation results are shown in Figure 11. The
the trained EMS under the NEDC driving cycle. The simulation results are shown in Figure 11.
comparison of the results are listed in Table 6. The results show that the pre-training process can
The comparison of the results are listed in Table 6. The results show that the pre-training process can
contribute to effectively decrease the online training time. This is because the DRL-based EMS learns
contribute to effectively decrease the online training time. This is because the DRL-based EMS learns
some of the same features between different driving conditions.
some of the same features between different driving conditions.
Figure 11.
under
NEDC
of the
trained
EMS EMS
whichwhich
was pre-trained
under
Figure
11. Simulation
Simulationresults
results
under
NEDC
of online
the online
trained
was pre-trained
UDDS.UDDS.
under
Table
6. Comparison
Comparison of
of the
the results
results under
under NEDC.
NEDC.
Table 6.
Control
Strategy
Control
Strategy
DRL-based EMS trained under NEDC online
DRL-based
EMS trained under NEDC online
DRL-based EMS only pre-trained under UDDS offline
DRL-based
EMSwhich
onlypre-trained
pre-trained
under
UDDS offline
DRL-based EMS
offline
and trained
NEDC
online
DRL-based EMSunder
which
pre-trained
offline and trained
under NEDC online
Fuel Consumption
Equivalent Fuel Consumption
(L/100 km) Equivalent Fuel Consumption
(L/100 km)(L/100 km)
3.478
3.792
3.478
3.792
3.690
3.872
3.690
3.872
Fuel Consumption (L/100 km)
3.440
3.795
3.440
3.795
Appl. Sci. 2018, 8, 187
Appl. Sci. 2018, 8, x FOR PEER REVIEW
13 of 15
13 of 15
Figure
Figure 12
12 depicts
depicts the
the track
track of
of the
the total
total reward
reward and
and the
the fuel
fuel consumption
consumption of
of one
one driving
driving cycle
cycle
along
the
training
process,
the
curves
are
smoother
than
the
curves
without
pre-training.
This
along the training process, the curves are smoother than the curves without pre-training. This also
also
reveals
reveals that
that pre-training
pre-training offline
offline can
can improve
improve the
theonline
onlinelearning
learning efficiency
efficiency even
even though
though the
thedriving
driving
condition
conditionisisdifferent.
different.
Figure12.
12. Track
Trackof
ofthe
thetotal
totalreward
rewardand
andfuel
fuelconsumption
consumptionin
inthe
theonline
onlineapplication
applicationin
inwhich
whichthe
theEMS
EMS
Figure
waspre-trained
pre-trainedfirstly.
firstly.
was
5. Conclusions
5. Conclusions
This paper presents a deep reinforcement learning-based data-driven approach to obtain an
This paper presents a deep reinforcement learning-based data-driven approach to obtain an
energy management strategy of a HEV. The proposed method combines Q learning and a deep
energy management strategy of a HEV. The proposed method combines Q learning and a deep neural
neural network to form a deep Q network which can obtain action directly from the states. Key
network to form a deep Q network which can obtain action directly from the states. Key concepts of
concepts of the DRL-based EMS have been formulated. Value function approximation and DRL
the DRL-based EMS have been formulated. Value function approximation and DRL algorithm design
algorithm design have been described in detail in this paper. In order to adapt to varying driving
have been described in detail in this paper. In order to adapt to varying driving cycles, a DRL-based
cycles, a DRL-based online learning architecture has been presented. Simulation results demonstrate
online learning architecture has been presented. Simulation results demonstrate that the DRL-based
that the DRL-based EMS can obtain better performance than the rule-based EMS in fuel economy.
EMS can obtain better performance than the rule-based EMS in fuel economy. Furthermore, the online
Furthermore, the online learning approach can learn from different driving conditions. The future
learning approach can learn from different driving conditions. The future work will focus on how to
work will focus on how to improve the online learning efficiency and testing on a real vehicle.
improve the online learning efficiency and testing on a real vehicle. Another important issue is how to
Another important issue is how to output continuous actions. In this paper, the output actions are
output continuous actions. In this paper, the output actions are discretized and this may leads to the
discretized and this may leads to the violent oscillation of the ICE output torque. A deep
violent oscillation of the ICE output torque. A deep deterministic policy gradient (DDPG) algorithm
deterministic policy gradient (DDPG) algorithm can output the continuous actions and may solve
can output the continuous actions and may solve this problem. This will be a future work. However,
this problem. This will be a future work. However, DDPG is also based on DRL. The contribution of
DDPG is also based on DRL. The contribution of this paper will speed up the application of deep
this paper will speed up the application of deep reinforcement learning methods in energy
reinforcement learning methods in energy management of HEVs.
management of HEVs.
Acknowledgments: This research was supported by National Natural Science Foundation of China (61573337),
Acknowledgments:
ThisFoundation
research was
supported
by National
Natural
National
Natural Science
of China
(61603377),
National
NaturalScience
ScienceFoundation
Foundationof
ofChina
China (61573337),
(61273139),
and
the Technical
(JSGG20141020103523742).
National
NaturalResearch
Science Foundation
FoundationofofShenzhen
China (61603377),
National Natural Science Foundation of China
(61273139),
and the Technical
of the
Shenzhen
Author
Contributions:
Yue HuResearch
and KunFoundation
Xu proposed
method(JSGG20141020103523742).
and wrote main part of this paper; Weimin Li
supported this article financially and checked the paper; Taimoor Zahid checked the grammar of this paper;
Author
Contributions:
Yue HuLiand
Kun Xu
the method and wrote main part of this paper; Weimin
and
Feiyan
Qin and Chenming
analyzed
theproposed
data.
Li supported this article financially and checked the paper; Taimoor Zahid checked the grammar of this paper;
Conflicts
of Qin
Interest:
The authors
nothe
conflict
and Feiyan
and Chenming
Lideclare
analyzed
data.of interest.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
Un-Noor, F.; Padmanaban, S.; Mihet-Popa, L.; Mollah, M.N.; Hossain, E. A Comprehensive Study of Key
References
Electric Vehicle (EV) Components, Technologies, Challenges, Impacts, and Future Direction of Development.
1.
Un-Noor, F.; Padmanaban, S.; Mihet-Popa, L.; Mollah, M.N.; Hossain, E. A Comprehensive Study of Key
Energies 2017, 10, 1217. [CrossRef]
Electric Vehicle (EV) Components, Technologies, Challenges, Impacts, and Future Direction of
Development. Energies 2017, 10, 1217, doi: 10.3390/en10081217.
Appl. Sci. 2018, 8, 187
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
14 of 15
Wirasingha, S.G.; Emadi, A. Classification and review of control strategies for plug-in hybrid electric vehicles.
IEEE Trans. Veh. Technol. 2011, 60, 111–122. [CrossRef]
Gokce, K.; Ozdemir, A. An instantaneous optimization strategy based on efficiency maps for internal
combustion engine/battery hybrid vehicles. Energy Convers. Manag. 2014, 81, 255–269. [CrossRef]
Odeim, F.; Roes, J.; Wulbeck, L.; Heinzel, A. Power management optimization of fuel cell/battery hybrid
vehicles with experimental validation. J. Power Sources 2014, 252, 333–343. [CrossRef]
Li, C.; Liu, G. Optimal fuzzy power control and management of fuel cell/battery hybrid vehicles.
J. Power Sources 2009, 192, 525–533. [CrossRef]
Denis, N.; Dubois, M.R.; Desrochers, A. Fuzzy-based blended control for the energy management of a
parallel plug-in hybrid electric vehicle. IET Intell. Transp. Syst. 2015, 9, 30–37. [CrossRef]
Ansarey, M.; Panahi, M.S.; Ziarati, H.; Mahjoob, M. Optimal energy management in a dual-storage fuel-cell
hybrid vehicle using multi-dimensional dynamic programming. J. Power Sources 2014, 250, 359–371.
[CrossRef]
Vagg, C.; Akehurst, S.; Brace, C.J.; Ash, L. Stochastic dynamic programming in the real-world control of
hybrid electric vehicle. IEEE Trans. Contr. Syst. Techol. 2016, 24, 853–866. [CrossRef]
Qin, F.; Xu, G.; Hu, Y.; Xu, K.; Li, W. Stochastic optimal control of parallel hybrid electric vehicles. Energies
2017, 10, 214. [CrossRef]
Chen, Z.; Mi, C.C.; Xiong, R.; Xu, J.; You, C. Energy management of a power-split plug-in hybrid electric
vehicle based on genetic algorithm and quadratic programming. J. Power Sources 2014, 248, 416–426.
[CrossRef]
Xie, S.; Li, H.; Xin, Z.; Liu, T.; Wei, L. A pontryagin minimum principle-based adaptive equivalent
consumption minimum strategy for a plug-in hybrid electric bus on a fixed route. Energies 2017, 10, 1379.
[CrossRef]
Guo, L.; Guo, B.; Guo, Y.; Chen, H. Optimal energy management for HEVs in Eco-driving applications using
bi-level MPC. IEEE Trans. Intell. Transp. Syst. 2017, 18, 2153–2162. [CrossRef]
Tian, H.; Li, S.E.; Wang, X.; Huang, Y.; Tian, G. Data-driven hierarchical control for online energy management
of plug-in hybrid electric city bus. Energy 2018, 142, 55–67. [CrossRef]
Daming, Z.; AI-Durra, A.; Fei, G.; Simoes, M.G. Online energy management strategy of fuel cell hybrid
electric vehicles based on data fusion approach. J. Power Sources 2017, 366, 278–291.
Martinez, C.M.; Hu, X.; Cao, D.; Velenis, E.; Gao, B.; Wellers, M. Energy Management in Plug-in Hybrid
Electric Vehicles: Recent Progress and a Connected Vehicles Perspective. IEEE Trans. Veh. Technol. 2017, 66,
4534–4549. [CrossRef]
Zheng, C.; Xu, G.; Xu, K.; Pan, Z.; Liang, Q. An energy management approach of hybrid vehicles using traffic
preview information for energy saving. Energy Convers. Manag. 2015, 105, 462–470. [CrossRef]
Liu, T.; Zou, Y.; Liu, D.; Sun, F. Reinforcement learning-based energy management strategy for a hybrid
electric tracked vehicle. Energies 2015, 8, 7243–7260. [CrossRef]
Qi, X.; Wu, G.; Boriboonsomsin, K.; Barth, M.J.; Gonder, J. Data-driven reinforcement learning-based real-time
energy management system for plug-in hybrid electric vehicles. Transp. Res. Rec. 2016, 2572, 1–8. [CrossRef]
Li, W.; Xu, G.; Xu, Y. Online learning control for hybrid electric vehicle. Chin. J. Mech. Eng. 2012, 25, 98–106.
[CrossRef]
Hu, Y.; Li, W.; Xu, H.; Xu, G. An online learning control strategy for hybrid electric vehicle based on fuzzy
Q-learning. Energies 2015, 8, 11167–11186. [CrossRef]
Volodymyr, M.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.;
Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015,
518, 529–533.
Silver, D.; Aja, H.; Maddison, J.C.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, L.;
Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search.
Nature 2016, 529, 484–489. [CrossRef] [PubMed]
Springenberg, T.; Boedecker, J.; Riedmiller, M.; Obermayer, K. Autonomous learning of state representations
for control. Künstliche Intell. 2015, 29, 1–10.
Wei, T.; Wang, Y.; Zhu, Q. Deep reinforcement learning for building HVAC control. In Proceedings of the
54th Annual Design Automation Conference, Austin, TX, USA, 18–22 June 2017.
Appl. Sci. 2018, 8, 187
25.
26.
27.
28.
29.
15 of 15
Belletti, F.; Haziza, D.; Gomes, G.; Bayen, A.M. Expert level control of ramp metering based on multi-task
deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2017, 99, 1–10. [CrossRef]
Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. End-to-End deep reinforcement learning for lane keeping assist.
arXiv 2016.
Chae, H.; Kang, C.M.; Kim, B.D.; Kim, J.; Chung, C.C.; Choi, J.W. Autonomous Braking System via Deep
Reinforcement Learning. arXiv 2017.
Tianho, Z.; Kanh, G.; Levine, S.; Abbeel, P. Learning deep control policies for autonomous aerial vehicles
with MPC-guided policy search. In Proceedings of the 2016 IEEE International Conference on Robotics and
Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 528–535.
Qi, X.; Luo, Y.; Wu, G.; Boriboonsomsin, K.; Brath, M.J. Deep reinforcement learning-based vehicle energy
efficiency autonomous learning system. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV),
Redondo, Beach, CA, USA, 11–14 June 2017.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Download