L

advertisement
Optimizing a Start-Stop System to Minimize Fuel
Consumption using Machine Learning
,Hur
OF TECHN4OLOGjY
by
L
Noel Hollingsworth
L
S.B., C.S. M.I.T., 2012
15 2014
LIBRARIES
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Author.
Signature redacted
Departhment of Electrical Engineering and Computer Science
December 10, 2013
Certified by.........Signature
redacted..........
Leslie Pack Kaelbling
Panasonic Professor of Computer Science and Engineering
Thesis Supervisor
Signature redacted
Accepted by ........
................
Albert R. Meyer
Chairman, N166sers of Engineering Thesis Committee
Optimizing a Start-Stop System to Minimize Fuel
Consumption using Machine Learning
by
Noel Hollingsworth
Submitted to the Department of Electrical Engineering and Computer Science
on December 10, 2013, in partial fulfillment of the
requirements for the degree of
Master of Engineering in Electrical Engineering and Computer Science
Abstract
Many people are working on improving the efficiency of car's engines. One approach
to maximizing efficiency has been to create start-stop systems. These systems shut
the car's engine off when the car comes to a stop, saving fuel that would be used to
keep the engine running. However, these systems introduce additional energy costs,
which are associated with the engine restarting. These energy costs must be balanced
by the system.
In this thesis I describe my work with Ford to improve the performance of their
start-stop controller. In this thesis I discuss optimizing a controller for both the general population as well as for individual drivers. I use reinforcement-learning techniques in both cases to find the best performing controller. I find a 27% improvement
on Ford's current controller when optimizing for the general population, and then
find an additional 1.6% improvement on the improved controller when optimizing for
an individual.
Thesis Supervisor: Leslie Pack Kaelbling
Title: Panasonic Professor of Computer Science and Engineering
2
Contents
1
2
Introduction
8
1.1
Problem Overview
. . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2
O utline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
. . . . . ..
Background
12
2.1
Reinforcement Learning
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Policy Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.1
Policy Gradient Algorithms
. . . . . . . . . . . . . . . . . . . . . . .
14
2.2.2
Kohl-Stone Policy Gradient
. . . . . . . . . . . . . . . . . . . . . . .
15
2.2.3
Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2.4
Covariance Matrix Adaptation Evolution Strategy . . . . . . . . . . .
18
Model Based Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.3.1
Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.3.2
Nearest Neighbors
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.4.1
22
2.3
2.4
Machine Learning for Adaptive Power Management . . . . . . . . . .
3
2.4.2
2.5
3
Wind Turbine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.4.3
Reinforcement learning for energy conservation in buildings . . . . . .
23
2.4.4
Adaptive Data centers . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Learning a Policy for a Population of Drivers
25
3.1
Problem Overview
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2
Simulator Model
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Simulator Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
Optimizing For a Population . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.1
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3.2
Optimizing For a Subset of a Larger Population . . . . . . . . . . . .
36
3.3.3
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.2.1
3.3
3.4
4
Design, Analysis, and Learning Control of a Fully Actuated Micro
Optimizing a Policy for an Individual Driver
41
4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Kohl-Stone Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2.1
Initial Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2.2
Finding Policies in Parallel . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.3
Improving the Convergence Speed of the Algorithm . . . . . . . . . .
51
4.2.4
Determining the Learning Rate for Kohl-Stone . . . . . . . . . . . . .
53
Problem Overview
4.1.1
4.2
Issues
4
4.2.5
4.3
4.4
4.5
5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
Cross-Entropy Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.3.1
Experiment
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.3.2
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Model Based Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.4.1
Experiment With Normally Distributed Output Variables . . . . . . .
63
4.4.2
Experiment with Output Variable Drawn from Logistic Distribution .
65
4.4.3
Nearest Neighbors Regression
. . . . . . . . . . . . . . . . . . . . . .
66
4.4.4
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
Conclusion
5.1
Future Work. . ..
72
. . . ....
. . . . . . . . . . . . . . . . . . . . . . . . .
5
73
Kohl-Stone Policy Search Pseudo-Code
16
2.2
CEM Pseudo-code
. . . . . . . . . .
18
3.1
Hypothetical rule-based policy.....
26
3.2
Vehicle Data Trace . . . . . . . . . .
28
3.3
Simulator State Machine . . . . . . .
29
3.4
Simulator Speed Comparison . . . . .
30
3.5
Simulator Brake Pressure Comparison
31
3.6
Simulator Stop Length Comparison .
32
3.7
Kohl-Stone Convergence Graph
. . .
35
4.1
Results of Kohl-Stone for Individual Driver's . . . .
. . . . . . . .
46
4.2
Moving Average of Kohl-Stone for Individuals . . .
. . . . . . . .
47
4.3
Kohl-Stone Result Histogram
. . . . . . . . . . . .
. . . . . . . .
49
4.4
Parallel Kohl-Stone Results
. . . . . . . . . . . . .
. . . .
50
4.5
Kohl-Stone Results When Sped Up . . . . . . . . .
. . . . . . . .
53
4.6
Results from Kohl-Stone with a Slower Learning Rate
. . . . . . . .
55
4.7
Results from Kohl-Stone with a Faster Learning Rate
. . . . . . . .
56
.
.
.
.
.
.
.
.
.
.
2.1
.
List of Figures
6
4.8
Cost Improvement from CMAES with an Initial Covariance of 0.001 . . . . .
59
4.9
Average Covariance Values for CMAES . . . . . . . . . . . . . . . . . . . . .
60
4.10 Cost Improvement from CMAES with an Initial Covariance of 0.005 and 0.008 61
4.11 Stop Length Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.12 Error from Gaussian Processes as Function of Training Set Size
64
. . . . . . .
4.13 Comparison of Gaussian Processes Predictions with Actual Results
. . . . .
66
4.14 Comparison of Gaussian Processes predictions with Actual Results with Laplace
Output Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.15 Comparison of nearest neighbors predictions with Actual Results . . . . . . .
68
7
Chapter 1
Introduction
Energy consumption is one of the most important problems facing the world today. Most
current energy sources pollute and are not renewable, which makes minimizing their use
critical. The most notable way that Americans use non-renewable sources of energy is when
they drive their car. Therefore, any improvements to the efficiency of cars being driven could
have a major effect on total energy consumption.
This reduction can be significant even if the gain in efficiency is small. In 2009 Ford sold
over 4.8 million automobiles [1], with an average fuel efficiency of 27.1 miles per gallon [2].
Assuming these vehicles are driven 15,000 miles a year, a .01 miles per gallon improvement
in efficiency would saves more than 980,000 gallons of gasoline per year.
One approach that automobile manufacturers have used to improve fuel efficiency is to create
start-stop systems for use in their cars. These systems turn off the vehicle's engine when
it comes to a halt, saving fuel that would otherwise be expended running the engine while
the car was stationary. Start-stop systems have primarily been deployed in hybrid vehicles,
but are now being deployed in automobiles with pure combustion engines as well [3]. These
systems have resulted in a 4-10% improvement in fuel efficiency in hybrid cars, but even
greater gains should be possible. Although the systems are generally efficient, shutting the
car's engine down for short periods of time results in losing more energy from the engine
8
restarting than is gained from the engine being off. By not turning off the car's engine as
often for shorter stops and turning it off more often for longer stops, it may be possible to
conserve substantially more energy. My thesis will explore several methods of improving a
controller's performance by making more accurate predictions of driver's stopping behavior.
1.1
Problem Overview
In this thesis I describe my work with Ford to improve the performance of their start-stop
controller. I will examine several methods of improving the performance of the controller.
The performance was measured by a cost function defined by Ford that was a combination
of three factors: the energy savings from turning the car's engine off, the energy cost from
restarting the car's engine after turning it off, and the annoyance felt by the driver when
they attempted to accelerate if the car's engine was still shut down.
My goal was to improve the performance of the controller by altering its policy for shutting
the engine down and turning the engine back on. I did not define the controller's entire
policy. Instead, I was given a black-box controller with five parameters for a policy for the
start-stop controller's rule-based system, which determined when to turn the car's engine off
and when to turn it back on. I attempted to change the parameters of the system's policy
in order to maximize the performance of the system.
I was given less than ten hours of driving data to optimize the policy. This is not enough
data to determine the effectiveness of the various approaches I used, so I created a simulator
in order to measure their effectiveness. The simulator was designed to have custom drivers.
These drivers were sampled from a distribution where the mean behavior of the drivers would
match the behavior from Ford's data. This allowed me to measure the performance of the
algorithms without using real data. It is still important to test the algorithms on real data
before implementing them in production vehicles, in order to determine how drivers react
to different policies. I will describe what sort of data will need to be collected next when I
discuss the effectiveness of each algorithm.
9
I took two broad approaches to optimizing the policy.
First, in chapter three, I discuss
optimizing a single policy for all drivers in a population. This approach makes it easy for
Ford to ensure that the policy chosen is sensible, and requires no additional computing
resources in the car other than those already being used. Second, in chapter four, I discuss
optimizing a policy in the car's computer for the driver of the car. This has the potential for
better results but requires additional computing resources online. In addition, Ford would
have no way of knowing what policies would be used by the controller. It is up to Ford to
balance the trade-off between complexity and performance, but in each section I will explain
the advantages and disadvantages of the approach I used.
1.2
Outline
After the introduction, this thesis consists of four additional chapters.
" In chapter two I discuss the background of the problem. First I describe the approaches
I used to optimize the controller's performance. I then discuss several other applications
of machine learning to energy conservation, comparing and contrasting the algorithms
they used with the ones I experimented with.
* In chapter three I discuss the problem setup in more detail and describe optimizing a
single policy for a population of drivers. I first give a formal overview of the problem
before describing the simulator I created to create synthetic data. I then discuss using
the Kohl-Stone policy gradient algorithm [4] to minimize the expected cost of a policy
over a population of drivers.
" In chapter four I discuss determining a policy online for each driver. I describe three
approaches: the Kohl-Stone policy gradient algorithm, a cross-entropy method, and
model based reinforcement learning using Gaussian processes.
For each approach I
describe the performance gain compared to a policy found for the population of drivers
10
the individual was drawn from, and discuss potential difficulties in implementing the
algorithms in a production vehicle.
* In chapter five I conclude the thesis, summarizing my results and discussing potential
future work.
Now I will proceed to the background chapter, where I will introduce the methods used in
the thesis.
11
Chapter 2
Background
This chapter provides an overview of the knowledge needed to understand the methods used
in the thesis. The primary focus is on the reinforcement learning techniques that will be
used to optimize the start-stop controller, though I also give a brief overview of alternative
methods of optimizing the controller. The chapter concludes with a look at related work
that applies machine learning in the service of energy conservation.
2.1
Reinforcement Learning
Reinforcement learning is a subfield of machine learning, concerned with optimizing the
performance of agents that are interacting with their environment and attempting to maximize an accumulated reward or minimizing an accumulated cost. Formally, given a summed
discounted reward:
00
R = ZYtrt,
t=0
(2.1)
where rt denotes the reward gained at time step t, and -y is a discount factor in the range
[0, 1), the goal is to maximize:
E[RI7r].
where 7r corresponds to a policy that determines the agent's actions.
12
(2.2)
In this setting the agent often does not know what reward their actions will generate, so
they must take an action in order to determine the reward associated with the action. The
need to take exploratory actions makes reinforcement learning a very different problem from
supervised learning, where the goal is to predict an output from an input given a training set
of input-output pairs. The consequence of this difference is that one of the most important
aspects of a reinforcement learning problem is balancing exploration and exploitation. If an
agent exploits by always taking the best known action it may never use an unknown action
that could produce higher long-term reward, while an agent that chooses to constantly
explore will take suboptimal actions and obtain much less reward.
Another key difference between supervised learning and reinforcement learning is that in
reinforcement learning the states the agent observes depends on both the states the agent
previously observed, as well the actions executed in those states. This means that certain
actions may generate large short term rewards, but lead to states that perform worse in the
long run.
Reinforcement learning problems are typically formulated as a Markov decision process, or
MDP, defined by a 4-tuple
(S, A, P, R),
(2.3)
where S corresponds to the set of states that the agent may be in; A is the set of all possible
actions the agent may take; P is the probability distribution for each state-action pair, with
p(s'Is, a) specifying the probability of entering state s', having been in state s and executed
action a; and R is the reward function R(s, a) specifying the reward received by the agent
for executing action a in state s. The goal is to find a policy 7r mapping states to actions
that leads to the highest long-term reward. As shown in equation 2.1 the reward is usually
described as a discounted sum of the rewards received at all future steps. A key element of
the MDP is the Markov property. The Markov property is that given an agent's history of
states, {SO) S1, 2 ....
,
sn_ 1 , sn}, the agent's reward and transition probabilities only depend
on the latest state sn, that is that
P(s's1 ,
sn, a,, ... an) = P(s'ls , an).
13
(2.4)
This is a simplifying assumption that makes MDPs much easier to solve. There are a few
algorithms that are guaranteed to converge to a solution for an MDP, but unfortunately
the start-stop problem deals with a state space that is not directly observable, and Ford
required the use of a black box policy parameterized by five predetermined parameters,
so I will instead use a more general framework that instead searches through the space of
parameterized policies.
2.2
Policy Search Methods
Policy search methods are used to solve problems where a predefined policy with open
parameters is used. In policy search, there is a parameterized policy 7(0) and, as in an MDP,
a cost function C(O) that is being minimized, which can also be thought of as maximizing
a reward that is the negative of the cost function. Instead of choosing actions directly, the
goal is to determine a parameter vector for the policy, which then chooses actions based on
the state of the system. This can allow for a much more compact representation, because
the parameter vector 0 may have far fewer parameters than a corresponding value function.
Additionally, background knowledge can be used to pre-configure the policy. This enables
the policy to be restricted to sensible classes. Policy search algorithms refer to an algorithm
which is drawn from this family of algorithms.
2.2.1
Policy Gradient Algorithms
Policy gradient algorithms are a subset of policy search algorithms that work by finding the
derivative of the reward function with respect to the parameter vector, or some approximation of the derivative, and then moving in the direction of lower cost. This is repeated until
the algorithm converges to a local minimum. If the derivative can be computed analytically
and the problem is an MDP then equation 2.5, known as the policy gradient theorem, holds:
R
Zd
S
(s) 1
s Qa)(s, a).
a
14
(2.5)
Here, dr(s) is the summed probability of reaching a state s using the policy 7r, with some
discount factor applied, which can be expressed as the following, where so is the initial state,
and -y is the discount factor:
00
dr(s) = Z
t=o
'Pr{st = sIso, wr}.
(2.6)
Many sophisticated algorithms for policy search rely on the policy gradient theorem. Unfortunately it relies on an the assumption that the policy is differentiable. This is not true
for the start-stop problem, as Ford requires that we use a black-box policy, which allows me
to the results of applying the policy to an agent's actions, but not view the actual policy so
as to obtain a derivative. The policy also uses a rule-based system, so it is unlikely that a
smooth analytic gradient exists. Because of this I used a relatively simple policy gradient
algorithm published by Kohl and Stone [4].
2.2.2
Kohl-Stone Policy Gradient
The primary advantage of the Kohl-Stone algorithm for determining an optimal policy is
that it can be used when an analytical gradient of the policy is not available. In place of an
analytical gradient, it obtains an approximation of the gradient by sampling, an approach
which was first used by Williams [14].
The algorithm starts with an arbitrary initial parameter vector 0. It then loops, repeatedly
creating random perturbations of the parameters and determining the cost of using those
policies. A gradient is then interpolated from the cost of the random perturbations. Next,
the algorithm adds a vector of length r in the direction of the negative gradient to the policy.
The process repeats until the parameter vector converges to a local minima, at which point
the algorithm halts. Pseudo-code for the algorithm is given in Figure 2.1.
15
1 0 +- initialParameters;
2
while not finished do
3
create R1 , ... , Rm where R, = 0 with its values randomly perturbed;
4
determine(C(Ri),C(R 2), ... , C(Rm));
5
for i - 1 to 101 do
// Find the cost function values after perturbing 0
6
AvgPos +- average of C VR where R[i] > 0[i];
7
AvgNone
8
AvgNeg
average of C VR where R[ij == O[ij;
+-
+-
average of C VR where R[iJ < 0[i];
// Determine the value of the differential for the parameter i
if AvgNone > AvgPos and AvgNone > AvgNeg then
9
change.,
10
11
-
0;
else
L
12
change, +- AvgPos - AvgNeg;
// Move the parameters in the direction of the positive differential
ca ;
13
change +-
14
0
+-
*
changel *T
0 + change;
Figure 2.1: Pseudo-code for the Kohl-Stone policy search algorithm
Another advantage of the algorithm, aside from not needing an analytic gradient, is that
there are only three parameters that must be tuned. The first of these is the learning rate.
If the learning rate is too low the algorithm converges very slowly, while if it is too high the
algorithm may diverge, or it may skip over a better local minimum and go to a minimum
that has higher cost. The usual solution to finding the learning rate is to pick an arbitrary
learning rate and decrease it if divergence is observed. If the algorithm is performing very
slowly then the learning rate may be increased. However this approach may not be suitable
for situations where a user cannot tweak the settings of the algorithm online, although there
16
are methods that adapt the learning rate on-line automatically.
The second parameter chosen by the user is the amount by which each parameter is randomly
perturbed. If the value is set too low the difference in costs due to noise may be much greater
than the differences in cost due to the perturbation.
If it is set too high the algorithm's
approximation of a gradient may no longer be accurate. This is sometimes set arbitrarily,
though I have performed experiments for the start-stop problem in order to determine what
values worked best.
The last parameter that must be set is the initial policy vector. The policy gradient algorithm converges to a local minimum, and the starting policy vector determines which local
minimum the algorithm will converge to. If the problem is convex there will only be one
minimum, so the choice of starting policy will not matter, but in most practical cases the
problem will be non-convex. One solution to this problem is to restart the search from many
different locations, and choose the best minimum that is found. When the policy is found
offline this adds computation time, but is otherwise relatively simple to implement. Unfortunately, like the choice of learning rate, it is much more difficult to set when it needs to be
chosen on-line, as the cost of using different policy starting locations manifests itself in the
real world cost being optimized.
2.2.3
Evolutionary Algorithms
Kohl-Stone and related methods were policy gradient approaches, where given a policy the
next policy is chosen by taking the gradient, or some approximation of it, and moving the
policy in the direction of the negative gradient. A different approach from the gradient based
approach is a sampling-based evolutionary approach. This approach works by starting at
a point, sampling around the point, and then moving in the direction of better samples.
Even though this approach and Kohl-Stone both use sampling, they are somewhat different.
Kohl-Stone uses sampling in order to compute a gradient, and then uses the gradient to move
to a better location. This approach uses the sampled results directly to move to a better
17
location. As a result the samples may be taken much further away from the current policy
than the samples used in Kohl-Stone, and the algorithms move to the weighted average of
the good samples, not in the direction of the gradient.
One example of an evolutionary approach is a cross-entropy method, or CEM [11]. CEM has
the same basic framework as Kohl-Stone, starting at an initial policy 0o and then iterating
through updates until converging to a minimum.
In addition to a vector for the initial
location it also needs an initial covariance matrix Eo. In each update it generates k samples
from the multivariate normal distribution with Ot as the mean and with Et as the covariance
matrix.
It then evaluates the cost of using each sampled policy, and then moves in an
average of the best sampled results, where only the Ke best results are used. It also alters
the covariance matrix to correspond to the variance of the best sampled results.
1
0 +- initialPolicyVector;
2
E <- initialCovariance;
3 while not finished do
4
create R1 , ... , Rk where Ri is randomly sampled from NORMAL(0, E)
determine(C(Ri), C(R2 ),
5
6
... ,
C(Rk));
sort R 1 , R 2 , ... , Rk in decreasing order of cost
Ke
1
i=0
Ke
0 =
-R,
K,
7
E =
L
' (0i _ 0)(0i 0)T-
i=O K
Figure 2.2: Pseudo-code for CEM.
2.2.4
Covariance Matrix Adaptation Evolution Strategy
The covariance matrix adaption evolution strategy, or CMAES, is an improvement over the
cross-entropy method [12]. It works similarly except the procedure of the algorithm is more
configurable and an evolutionary path is maintained. The path allows for information about
18
previous updates of the covariance matrix and mean to be stored, causing the adaption of the
covariance matrix to be based on the entire history of the search, producing better results.
It also contains many more parameters that can be tweaked to produce better results for a
specific problem, though all of them may be set to a sensible default setting as well.
CMAES and CEM are very general algorithms for function maximization but are used quite
often in reinforcement learning because they provide a natural way of determining what
information to gather next. One example is Stulp and Siguald's use of the evolutionary path
adaption of the Covariance Matrix from CMAES in the Path Integral Policy Improvement
algorithm, or PI2 [11]. The PI 2 algorithm has similar goals to CEM and CMAES, except that
instead of aiming to maximize a function it intends to maximize the result of a trajectory.
This is a very different goal from function maximization, because in trajectories earlier steps
are more important than later ones. PI2's application to trajectories makes it an important
algorithm for robotics. In my case I used a standard CMAES library, because it is most
natural to approach the start-stop problem as a function to be maximized.
2.3
Model Based Policy Search
Although I have been addressing approaches that use reinforcement learning to optimize
the start-stop controller, an alternative approach is to first learn a model using supervised
learning, and then use policy search on the model to find the optimal policy. In a supervised
learning problem the user is given pairs of input values ((1),
y(l)), (52),
y( 2)),
1
(51n) y(n))
where an entry of x is a vector of real values, and an entry of y is a real value, and the goal
is to predict the value of a new value for y(i) given only x(). The system can then use a
policy search algorithm to determine a policy vector 0 that performs best given the predicted
output y('). Using this model introduces the limitations that the policy chosen will have no
effect on the output, and that the policy chosen will not affect future states of the world.
For the start-stop controller, these limitations are similar to assuming that no matter how
the start-stop controller behaves the driver will drive the same way.
19
When y(') is real valued, as it is when trying to predict the length of a car's stop, the
supervised learning problem is called regression. I make use of two regression techniques in
this thesis, Gaussian processes and nearest neighbors regression.
2.3.1
Gaussian Processes
The first regression method I used were Gaussian processes, which are a framework for
nonlinear regression that gives probabilistic predictions given the inputs [7].
This often
takes the form of a normal distribution .A(y~p(x), E(x)), where the value for y is normally
distributed with a mean and variance determined by x. The probabilistic prediction is helpful
in cases where the uncertainty associated with the answer is useful, which it may be in the
start-stop prediction problem. For example, the controller may choose to behave differently
if it is relatively sure that the driver is coming to a stop 5 seconds long than when it believes
the driver has a relatively equal chance of coming to a stop between 1 and 9 seconds long.
Gaussian processes work by choosing functions from the set of all functions to represent
the input to output mapping. This is more complex than linear regression, a simpler form
of regression that only allows linear combinations of a set of basis functions applied to the
input to be used to predict the output. Over fitting is avoided by using a Bayesian approach,
where smoother functions are given more likely priors. This corresponds with the view that
many functions are somewhat smooth and that the smoother a function is, the more likely
it explains just the underlying process creating the data and not the noise associated with
the data.
Once the Gaussian process is used to predict the value of y(i), the optimal value for the policy
parameter 0 can be found.
This can be done with one of the search methods described
previously, as long as the reward of using a specified policy for any value of y(') can be
calculated. This may present problems if a decision must be made quickly and it is not easy
to search over the policy space, because a new policy must be searched for every time a
decision must be made. If a near-optimal policy can be found quickly this may represent
20
a good approach, because the policy can be dynamically determined for each decision that
must be made.
2.3.2
Nearest Neighbors
The second regression technique I used was nearest neighbors regression. Given some point
x(') this technique gives a prediction for a point y(') by averaging the outputs of the training
examples that are closest to x(). The number of training examples to average is a parameter
of the algorithm and can be decided using cross-validation.
Unlike Gaussian processes,
nearest neighbors regression gives a point estimate instead of a probabilistic prediction. The
advantage of nearest neighbors regression is that it is simple to implement and makes no
distributional assumptions.
That makes it ideal as a simple check to ensure that a more
sophisticated approach such as Gaussian processes is not making errors due to incorrect
distributional assumptions, or a poor choice of hyper-parameters. A policy can be found as
soon as a prediction for y(') is made, using similar search methods as described earlier.
2.4
Related Work
I will now discuss examples of machine learning being used for energy conservation. Machine
learning is useful for energy conservation because instead of creating simple systems that
must be designed for the entire space of users, or even a subset of users, the system can be
adapted to the current user. This can enable higher energy savings than are possible with
systems that are not specific to the user. I will discuss four examples of machine learning
for energy conservation, to illustrate where it can be applied and what methods are used:
The work by Theocharous et. al. on conserving laptop battery using supervised learning
[6]; Kolter, Jackowski and Tedrake's work in optimizing a wind turbine's performance using
reinforcement learning [8]; the work by Dalamagkidis et. al. about minimizing the power
consumption of a building [9], and lastly about the work of Bodik et. al. on creating an
21
adaptive data center to minimize power consumption [10].
2.4.1
Machine Learning for Adaptive Power Management
One area where machine learning is often used for power conservation is minimizing the power
used by a computer. By turning off components of the computer that have not been used
recently, power can be conserved, provided the user does not begin using the components
again shortly thereafter. Theocharous et. al. applied supervised learning to this area, using
a method similar to the model based policy approach described earlier [6]. A key difference
was that their approach was classification based, where the goal was to classify the current
state as a suitable time to turn the components of the laptop off. One notable difference
of their approach from the approach I described is that they created custom contexts and
trained custom classifiers for each context. This is an interesting area to explore, especially
if the contexts can be determined automatically. One method for determining these contexts
is a Hidden Markov Model, which I looked into, but was too complex for Ford to implement
in their vehicles.
2.4.2
Design, Analysis, and Learning Control of a Fully Actuated
Micro Wind Turbine
Another interesting application of machine learning to energy maximization is Kolter and
others work in maximizing the energy production of a wind turbine [8]. This can be treated
similarly to energy conservation, where instead of minimizing the energy spent, the goal is
to maximize the energy gained. The goal is to choose the settings of a wind turbine that
maximize energy production. Their work uses a sampling based approach to approximate
a gradient. This approach is more complex than the Kohl-Stone algorithm, as they form
a second order approximation of the objective function in a small area known as the trust
region around the current policy. They then solve the problem exactly in the trust region and
continue. This approach takes advantage of the second order function information, hopefully
22
allowing for quicker convergence of the policy parameters. The disadvantages of the approach
are that it is more complex, needing increased time for semidefinite programming to solve
for the optimal solution in the trust region, and needing an increased number of samples to
form the second order approximation.
In their case the calculations could be done before
the wind turbine was deployed, which meant the increased time for computation was not an
issue. In our case we would like approach to be feasible online, and it is unclear whether
there will be enough computation power allocated to our system in the car to perform the
additional computation that are needed for their approach.
2.4.3
Reinforcement learning for energy conservation in buildings
Another example of reinforcement being used to minimize energy consumption is in buildings.
Dalamagkidis et. al. sought to minimize the power consumption in a building subject to an
annoyance penalty that was paid if the temperature in the building was too warm or hot, or
if the air quality inside the building was not good enough [9]. This is a very similar problem
to the start-stop problem, where I am seeking a policy that minimizes the fuel cost and the
annoyance the driver feels. Their approach was to use temporal-difference learning, which
is a method of solving reinforcement learning problems. By using this approach they were
able to optimize a custom policy that was near the performance of a hand-tuned policy in 4
years of simulator time.
2.4.4
Adaptive Data centers
The last example of machine learning being used to minimize power is in data centers. Data
centers have machines that can be turned off based on the workload in the near future. The
goal is to meet the demands of the users of the data center, while keeping as many of the
machines turned off as possible. Bodik et. al. wrote about using linear regression to predict
the workload that would be demanded in the next two minutes [10]. Based on this prediction
the parameters of a policy vector were changed. This resulted in 75% power savings over
23
not turning any machines off, while only having service level violations in .005% of cases.
2.5
Conclusion
There are a variety of approaches related to machine learning that can be used to find the
optimal start-stop controller. This chapter gave a summary of the techniques that I applied
to the problem. I also gave a brief summary of other areas where machine learning has been
used to maximize energy, and discussed the advantages and disadvantages of the approaches
used. In the next chapter I will study the results of applying these techniques to the problem
of determining a policy that performs well for a population of drivers, as well as the problem
of finding a policy that adapts to the driver of the car where the policy is used.
24
Chapter 3
Learning a Policy for a Population of
Drivers
In this chapter I will discuss optimizing a single policy for a population of drivers in order
to minimize average cost. This is similar to the approach Ford currently takes, except that
their policy is a hand-tuned policy that they expect to work well on average, rather than one
optimized with a machine learning or statistical method. The benefits of this approach are
that the policy can be examined before it is used in vehicles to check that it is reasonable,
and that there are no additional computations required inside a car's computer beyond
computing the actions to take given the policy. The disadvantage of this approach is the
single policy used on all drivers will probably use more energy than when a unique policy is
specialized to each driver.
3.1
Problem Overview
In this approach the goal is to select a start-stop controller that performs best across all
users. Formally, the goal is to pick a parameter vector 0 for the policy that minimizes
E[C10],
25
(3.1)
over all drivers in the population, where C is composed of three factors,
C
= C1
+
C 2 + C3 .
(3.2)
The first factor, ci, represents the fuel lost from turning the car's engine on when the car is
stopped, and is equal to the time that car has spent stopped minus the time the combustion
engine was turned off during a driving session. The second factor, c 2, represents the fixed
cost in energy to restart the car's engine, and is equal to 2 times the number of times the
car's engine has been restarted during a period of driving. Every restart of the car's engine
costing 2 was a detail specified by Ford. The last factor, c3 , is the driver annoyance penalty,
and is paid based on the length of time between the driver pressing the accelerator pedal
and the car actually accelerating.
Each time the driver accelerates c 3 is incremented by
10 x max(0, restartDelay- 0.25), where restartDelayis equal to the time elapsed between
the driver pressing the accelerator pedal and the engine turning back on.
The goal is to minimize the expected cost E[C10] by choosing an appropriate parameterized
policy wr(9). Ford uses a rule-based policy, an example of which is used in Algorithm 3. Here
the parameter vector corresponds to values in the rules.
1
2
3
4
5
6
7
s
9
if CarStopped then
if EngineOn and BrakePedal > 0o then
L
TURN ENGINE OFF;
if EngineOn and BrakePedal > 01 and AverageSpeed < 02 then
L
TURN ENGINE OFF;
if EngineOff and BrakePedal < 03 then
TURN ENGINE ON;
if EngineOff and BrakePedal < 04 and TimeStopped > 20 then
TURN ENGINE ON;
Figure 3.1: Hypothetical rule-based policy
26
This policy framework is then called 100 times a second during a stop in order to determine
how to proceed. There are two important things to note about this rule-based policy. First,
this is an example of what the policy might look like, not the actual policy. Ford gave us
an outline similar to this, but did not tell us what the actual rules are, so we do not know
what the parameters correspond to. Second, the controller is responsible for both turning
the engine off when the car comes to a stop, as well as turning the engine back on when the
user is about to accelerate. This second part helps to minimize c 3 in the cost function, but
may also lead to more complicated behavior, as the controller may turn the engine off and
on multiple times during a stop if it believes the car is about to accelerate but the car does
not.
Given the black-box nature of the policy we are not allowed to add or remove parameters, and
instead must only change the vector 0 in order to achieve the best results. Our inputs from
any driver are three data traces: the driver's speed, the driver's accelerator pedal pressure,
and the brake pedal pressure, which are each sampled at 100 Hz. Figure 3.2 shows a graph
of an example of input data.
3.2
Simulator Model
In order to help determine the best policy Ford gave us a small amount of data, representing
under 5 hours of driving from a driver in an Ann Arbor highway loop, as well as about
an hour of data from drivers in stop and go traffic. This data was insufficient to run the
experiments we hoped to perform, so to create more data that was similar to the data from
Ford I built a simulator. The simulator was based on of a state machine, which is shown in
Figure 3.3.
The simulator has three states: Stopped, Creeping, and Driving. In the Stopped state the
driver has come to a stop. In the Driving state the driver is driving at a constant speed,
with Gaussian noise added in. The Creeping state represents when the driver is moving at
a low speed, with their foot on neither pedal. The driver stays at their current state for a
27
Ford Data
-
-
45
40.
35230E
25
20
C 15
10
5
0O
5
10
15
2'0
25
3
0
5
10
15
20
25
30
5
10
15
Time (seconds)
20
25
-g 1000
P
800
600
400
0 200
25
20
-
e
"76
0
15
-
a)
10-
0
-
5-
30
Figure 3.2: A slice of the data. This graph shows a driver coming to a stop, and then
accelerating out of it.
predetermined amount of time, and then transitions to another state by either accelerating
or decelerating at a relatively constant rate.
The simulator is parameterized to allow for the creation of many different drivers.
For
example, the average stopping time is a parameter in the simulator, as well as the maximum
speed a driver will accelerate to after being stopped. There are 40 parameters, allowing me
to create a population of drivers with varying behavior.
The main weakness of the simulator is the state machine model. It is unlikely that a driver's
actions are actually only determined by their current state. For example, it would be expected that a driver on the highway would have a pattern of actions different from a driver
in stop and go traffic in an urban environment. To alleviate this issue, and in order to help
28
Accelerating
Creeping
Stopped
Decelerating
Decelerating
Accelerating
Accelerating
Decelerating
Driving
Accelerating
Decelerating
Figure 3.3: The state machine used by the simulator. Circles represent states and arrows
represent possible transitions between states.
me conduct my experiments, I created 2000 driver profiles, 1000 of whose parameters were
centered around parameters from a driver whose behavior matched the Ann Arbor data, and
1000 whose parameters were centered around a simulated driver who matched the stop and
go data. Every driver uses four randomly picked driver profiles, using one 50% of the time,
two 20% of the time, and one 10% of the time. This enables driver's behavior to be based
on their recent actions, as a driver will use a profile for a period of time before switching to
another profile.
3.2.1
Simulator Output
The simulator that I have described outputs the speed trace of the vehicle. After the speed
trace is generated a second program is run that takes the speed values and generates appro-
29
priate values for the brake and accelerator pedal pressure values. Ford created this program
in Matlab and I ported it to Java. The output from the combination of the two programs is
a data trace with a similar format to data from the real car, giving results for all three values
at 100 Hz for a drive cycle. The only difference in output format is that the synthetic data
is never missing any data, while the data in the real car occasionally doesn't have values for
certain time steps. This occurs rarely, and only causes a millisecond of data to be missing
at a time, so it was not modelled into the simulator.
The highway driving data was given to me first, so I created the highway driver profiles first.
Figures 3.4, 3.5, and 3.6 show that I was able to match coarse statistics from the highway
driving data.
0.09
0.09
0.08
0.08
0.07
0.07
0.06
0.06
0.05
0.05
E 0.4
00.04
0.03
0.03
0.02
0.02
-
0.01
0.01
0.000
10
20
30
40
50
Speed in kmnh
60
70
0
80
10
20
30
40
50
Speed in kmnh
60
70
80
Figure 3.4: The percentage of the time the driver spends at different speeds. On the left is
real data from the Ann-Arbor highway loop, and on the right shows simulated data with a
profile designed to match the highway loops.
Figures 3.4, 3.5, and 3.6 represent a summary of traces generated from the real data and the
simulated data. Aside from analyzing larger statistics generated from the traces, I analyzed
small subsections of traces to ensure that individual outputs behaved similarly as well. I did
this by looking at certain events, such as a driver coming to a stop, and then accelerating
out of the stop, and looking at the trace values generated, making sure that the simulator
output traces that behaved similarly to the real data.
About six months later I was given the stop-and-go data. This data was comprised of eight
30
0.0020
0.0014
0.0012
0.0015-
0.0010
0.0008
0.0010
E
0.0006
0.0005.
0.0004
0.0002
0.00000
500
1000
100
2000
Brake Pressure NM
2500
3000
0.
3500
0003
0000
500
1000
2000
1500
Brake Pressure NM
2500
3000
3500
Figure 3.5: The percentage of the time the driver spends with different brake pressures when
at a stop. On the left is real data from the Ann-Arbor highway loop, and on the right shows
simulated data with a profile designed to match the highway loops.
smaller driving episodes from stop-and-go traffic. I created a driving profile different from
the highway profile that had similar summary statistics to the data, as shown in the following
table.
Sim - Highway
Sim - Stop-And-Go
Average Speed (km/h)
23.2
33.8
33.5
Average Stop Time (s)
19.24
10.32
9.72
Average Time Between Stops
102.78
41.48
40.83
% of Segments below 10 km/h
4
76
Ford Stop-And-Go
77
A driving segment is defined as the period of driving in between instances of the vehicle
coming to a halt. The statistics show that the profile has similar behavior to the real stopand-go data, as the vehicle stops for shorter lengths of time, and often does not accelerate
to a high speed between stops. Combined with the highway simulator profile, it enables me
to create data similar to all data that Ford has given me.
The largest concern with the simulator is that aspects of real-world behavior may not be
modelled in the simulator. One example is that the simulated driver does not change their
behavior based on the start-stop policy in use. It is hard to know if driver's will change their
31
25
20
ih0
.10
~15
t
10
20
30
40
50
so
To0
10
20
340060
7
Letguh of top In
Seconds
Length of stop in seconds
Figure 3.6: A histogram showing the length of stops in a trace. On the left is real data from
the Ann-Arbor highway loop, and on the right shows simulated data with a profile designed
to match the highway loops. The right trace has a much larger number of stops, but the
distribution of stopping times is similar.
actions based on different behavior from the controller without implementing a controller in
a vehicle. Because the Ford data was collected from a vehicle using a static controller, I have
no evidence that drivers change their behavior, or of how they would change their behavior
if they did. Therefore, I opted to use simulated drivers that act the same regardless of the
policy in place.
There may be other areas where the simulated behavior may not match
actual behavior, but there seems to not be any reason that the methods described would not
work just as well with real data.
The simulator will be used for all tests described in the following two chapters, as it allowed
me to quickly test how the algorithms performed using hours of synthetic data, and should
help avoid overfitting to the small amount of that was provided by Ford.
3.3
Optimizing For a Population
I will now describe the procedures used to optimize for a population, as well as the results
obtained from the experiments.
In order to optimize for a population I used a modified
version of the Kohl-Stone policy gradient algorithm described in section 2.2.2. I chose to
32
use this algorithm rather than an evolutionary algorithm because of Kohl-Stone's simplicity
and because the runtime of the algorithm was not a factor, as the algorithm did not need to
run on a car's computer.
The modified version of the algorithm was the same as the algorithm described in section
2.2.2, with one modification made to determining C(R1 ),..., C(R,) at each iteration of the
algorithm. These costs are the costs of using a policy that are a perturbation of the current
policy 0, and in the original Kohl-Stone algorithm are evaluated based on the costs of a
completely new trial using Ri as a policy.
This is because the differences in the policy
may affect the actions of the driver. In my modified algorithm I evaluated the cost of the
perturbations based on a single drive trace, chosen from a random driver from the population.
I did this to cut down on the time necessary to run the algorithm, and to cut down on the
amount of noise present in the results. Most of the runtime of the algorithm is spent creating
the drive traces, and I create 50 perturbations of 0 at each drive trace. Therefore, creating
only one drive trace at each step speeds up the algorithm by a factor of almost 50. Using
only one drive trace also cuts down on the amount of noise present in the cost, as if I selected
a new random driver for each drive trace, the differences in cost would mainly arise from
differences in the drive trace and not the difference from the parameters.
For example,
certain drive traces, such as a file where there are very few stops, will always have a low
cost. Conversely, a drive trace where there are many short stops will always have a relatively
high cost no matter what the controller's policy is. Using a new drive trace each time would
necessitate adding in more perturbations to reduce the noise, extending the runtime of the
algorithm even further. Using one random trace to optimize the policy is an example of a
technique known as variance control, which is also employed by the PEGASUS algorithm.
PEGASUS is another policy search method which searches for the policy that performs best
on random trials that are made deterministic by fixing the random number generation [13].
The main negative side effect of using only one drive trace for each iteration of the algorithm
is that the modified algorithm operates with the assumption that small perturbations to the
start-stop controller have no effect on driver behavior. This is true for the simulator because
33
the start-stop controller never has any effect on the driving behavior.
This is probably
true for real life as well, as I would expect drivers to change their behavior with drastically
different controllers, such as a controller that always turned the engine off, versus one that
never turned the engine off, but because the algorithm only needs to approximate a gradient
the policies measured are similar enough that drivers might not even notice their difference.
However, this does mean that the learning rate may need to be relatively small, so that
any changes at any iteration do not cause the driver to drastically change their behavior.
The other negative side effect is that on each iteration the value for 0 changes based on the
driving trace drawn from one driver. This can lead to the value for 9 moving away from the
optimum if the driver is far away from the mean of the population. This should not be an
issue over time, as I expect the value for 9 to approach the population mean.
3.3.1
Experimental Results
For my experiment I used the modified version of the Kohl-Stone algorithm. I normalized
each of the parameters in the controller to the range [0, 1], and set a learning rate of .025.
For each iteration of the algorithm I created 50 perturbations and perturbed each value
randomly by either -0.25, 0, or 0.25. Each driving trace used was 6 hours long.
I performed 35 trials of the Kohl-Stone algorithm each running from a random starting
location for 200 iterations.
After 200 iterations I measured the performance of each trial
on 25 random drivers drawn from the population. Figure 3.7 shows the average cost of all
35 trials over time, demonstrating that the algorithm had stopped improving many trials
before the 200 iteration limit was reached.
I chose the best performing parameters, and
then measured on another sample of 25 random drivers to get the expected cost of a 6 hour
driving segment.
I found the best parameters had an average cost of 1329, while Ford's
original parameters had a cost of 1810, for an improvement of 481 or 27%. This cost saving
is equivalent to the car's engine being off instead of idling for 80 seconds every hour, though
the actual cost savings may arise from any combination of the annoyance factors and energy
34
200019001E1800-
1700160015001400113000
500
100
150
Iterations
20
Figure 3.7: A graph of the average cost of the 35 trials over time.
savings discussed previously.
I also compared the results of using the policy learned by my algorithm to the results of
Ford's policy on the real data from Ford.
There was not much data from Ford, so the
comparison is not as useful as the comparison on the simulated data, but ideally the policy
should perform better than Ford's policy on the real data as well.
35
Ford Policy
Optimized Policy
Percentage Improvement
1-94 Data
56.84
55.61
2%
Ann Arbor Loop Data
199.33
177.75
11%
W 154 Data
67.61
63.92
5%
Urban Profile 1
36.1
34.05
6%
Urban Profile 2
50.35
47.2
6%
Urban Profile 3
698.12
634.95
9%
Urban Profile 4
115.72
111.57
4%
Urban Profile 5
116.02
100.53
13%
Urban Profile 6
64.45
57.32
11%
Urban Profile 7
62.83
53.9
14%
Total
1467.37
1336.8
9%
The optimized policy does not exhibit as much improvement over the Ford policy as it did for
the simulated data but it still outperforms Ford's policy by 9% overall. It also outperforms
Ford's policy across all 9 data sets. Along with the small amount of data present, there are
two caveats to the results. First, there were no brake pedal or accelerator pedal data available
for most of the samples, so those values were generated as functions of the speed data using
the program Ford gave me. It is unknown accurate these predictions are. Second, there is
no way of knowing if using different controllers may have changed the actions of the driver's
in the experiments. This is impossible to test without implementing a modified controller in
real vehicles. These factors may cause the optimized controller to perform relatively worse
or better compared to these results.
3.3.2
Optimizing For a Subset of a Larger Population
In the previous section I demonstrated that policy search can produce a policy that performs
better than the hand tuned policy created by Ford. The rest of my thesis will discuss methods
36
of improving on this policy.
One way to improve on a policy generated for the entire population of drivers is to create
separate policies for subsets of the entire population. This is beneficial in two ways. First,
if Ford can subdivide the population ahead of time, they can install the policies in different
types of vehicles. For instance, drivers in Europe and the United States might behave differently, or drivers of compact cars and pickup trucks might have different stopping behavior.
If the populations can be identified ahead of time and data can be collected from each population different policies can be installed for each set of drivers.
Classifying for a smaller
population might also be useful if the driver can be classified online as driving in a certain
way. If the driver's style can be classified by the car's computer, a policy can be swapped
in that works best for the situation. For instance, drivers on the highway and drivers in
stop-and-go traffic in an urban environment might behave very differently and have different
optimal policies. It should be possible to obtain better results by using a policy optimized
for a specific pattern of driving when the driver's style fits that pattern.
In addition to providing results for optimizing for a smaller population of drivers, this section
acts as a middle ground between the previous section and the next chapter, where I will
discuss optimizing a policy for a specific driver. As the policy is optimized for a more specific
group better results are expected, as the policy only needs to match the behavior of that
group and does not need to account for drivers that are not present who behave differently.
As long as the drivers in the smaller subsets are different from the larger population in a
meaningful way, better results are expected as the group becomes smaller, with the best
results being expected with policies optimized for a single driver. The key is to divide the
population into subsets that are expected to be different in some way, as if the population
is subdivided into random large subsets the expectation is that the policies will be identical.
It is also necessary for there to be enough data for each subset in order to create policies
that perform well,.
It may even be possible that the methods used in this subsection will result in lower costs
than optimizing policies for a single driver. This strategy may lead to lower costs if policies
37
are optimized for modes of driving, which are then swapped in during a drive, based on the
driver's behavior. Lower costs are likely if the individual driver's behavior in a particular
setting, such as highway driving, is more similar to the average driving behavior from all
drivers in that setting than the individual driver's average behavior across all settings of
driving.
In this case the policy found for all drivers in a particular mode will probably
perform better than the policy found for a single driver across all modes of driving.
3.3.3
Experimental Results
The most natural way to divide the driver profiles from the simulator into groups is to
divide them into two sets of drivers, the first being the drivers generated from the Ann
Arbor data, and the second being the drivers generated from the urban data. This doesn't
represent a division of car buyers that Ford can identify ahead of time, but the profiles
do represent modes of driving that would benefit from separate policies.
I will compare
the results of optimizing for the smaller populations compared to optimizing for a larger
population comprised of both sets of drivers.
In order to determine the best policies for each population I repeated the experiments from
Section 3.3, running the experiment 35 times from random starting locations, and testing
on 25 random drivers to determine the best policy, and finally testing on 25 other random
drivers to determine the expected cost of the policy.
For the profiles matching the Ann
Arbor data I found that the best policy had an average cost of 1233, while the for the
profiles matching the urban data the best policy had a cost of 1399. Averaged together the
cost was 1321, representing the average cost of six hours of driving if three hours are spent
in each mode and the best policy is used for each mode. I found that the best single policy
for the combined population had a cost of 1398. Therefore, optimizing for the subsets of the
population resulted in an improvement of 5.5%, or 13 seconds of idling removed per hour
in real world terms. This is less than the improvement from going from Ford's policy to an
optimized policy, but would represent a massive savings of energy if an improvement of this
38
scale could be implemented on all hybrid cars produced by Ford.
3.4
Conclusion
In this chapter I discussed the policy-search methods that were able to find a single policy
that performs better than Ford's hand tuned policy. I discussed the simulator that generated
the data necessary for policy search as well as the Kohl-Stone policy gradient algorithm
for policy search. The optimized policy that policy-search found performs 27% better on
synthetic data and 9% better on the limited real data available.
This policy could probably be immediately implemented in Ford's vehicles with better results.
The main concerns related to these results were that the simulated data does not capture
unknown characteristics of the real-world data and that drivers might modify their behavior
based on the new policy and produce worse results. The optimized policy does not perform
as well on the real data as it does on the synthetic data, but it outperforms Ford's policy
on every data set available, which seems to indicate that the simulator is capturing at least
some of the important features of the real data. It is impossible for me to test how drivers
would behave differently with the optimized policy. The only way to test this is to implement
the policy on test vehicles that are being driven under normal conditions to see how the user
would behave.
In order to use this policy on production vehicles, a logical next step would be to implement
the policy in vehicles where the user's driving behavior can be measured.
This could be
tested against vehicles running the original hand-tuned policy. If the newly optimized policy
performs better than the old policy, it could then be implemented in vehicles sold by Ford.
Along with testing the effects of the policy, another option to improve the performance of the
policy would be to collect more real world data. Though the simulator was constructed to
match the real-world data there was not enough real data to be sure that it matches the larger
population of driving behavior from drivers in Ford vehicles.
39
Collecting much more data
would allow for a policy-search algorithm to be run directly on the data, eliminating the need
for synthetic data. Unfortunately, this would still not address the problem that there would
be no way to test the effect of a policy on a drivers behavior. Each time a policy is optimized
it should be tested in real vehicles to ensure that drivers do not change their behavior with
the optimized policy in a way that negatively affects performance.
Alternatively, a large
group of random policies could be implemented for random drivers in an effort to show that
policies have no effect on driver's actions. If this is true, then there would be no need to
test the effects of individual drivers. However, it seems unlikely that completely different
controllers would have no effects whatsoever on driver's actions.
The main benefits of the approach outlined in this chapter are that it is simple, producing a
single policy that can be examined by Ford, and that it requires no computation inside the
car. However, it is likely that optimizing for a smaller subsection of driver behavior than
the whole population will produce better results than creating a policy that must be used
by each driver. I demonstrated this partially in this chapter, showing that optimizing for
subsets of a larger population produced better results than creating a single policy for the
entire population.
In the next chapter I will discuss optimizing for a single driver, which
should produce even better results.
40
Chapter 4
Optimizing a Policy for an Individual
Driver
In this chapter I will discuss optimizing the start-stop policy for each individual driver. This
should result in a lower cost on average than the previous approach of optimizing a single
policy for a population of drivers.
The primary disadvantage of this approach is that it
requires more computation in the car's internal computer, which may be an issue if the
computational power of the car is limited. It also requires time to find an optimized solution
and is more complex, needing much more debugging before it can be implemented in a real
car.
4.1
Problem Overview
In the previous chapter I sought to optimize
E[C|0],
(4.1)
for the entire population by varying the policy 0, where C was the cost function comprised
of factors representing the fuel cost of the controller's actions and the annoyance felt by the
41
driver of the vehicle due to the controller's actions. In this chapter I will instead seek to
optimize
E[CIO, d],
(4.2)
where d is the individual driver, again by varying 0. By creating a policy for each driver that
need only work well for that driver I should get better results than finding a single policy
that must work well for all drivers. In this chapter I will discuss experiments to find the
best policy using two methods of policy search, the Kohl-Stone policy gradient algorithm,
and CMAES, a variant of the cross-entropy maximization algorithm. In addition to these
two methods of policy search I will also discuss experiments using a model-based approach
where I attempt to minimize
E [C 10, d, -r],
(4.3)
where r is the recent behavior of the driver. I will do this by creating a model of the driver's
behavior using Gaussian processes.
As before I will use the simulator to measure the performance of my algorithms. The amount
of data available was not sufficient to test any of the approaches on real data. The disadvantages of the simulator were discussed in chapter 3, but one of the main disadvantages
of the simulator, that the simulated drivers do not change their behavior based on the different start-stop policies, is not as much of an issue here. This is because for most of the
approaches in this chapter the effectiveness of a policy would be determined by how that
policy, or a policy similar to it performed with the actual user. Therefore, the results incorporate how the user changed their behavior in response to a change in policy. I will examine
this further when discussing the results of each technique and their expected effectiveness
when implemented in a vehicle.
Along with an identical simulator, I used the same cost and rule configuration as before.
The main changes are that I assumed that I am allowed to do computation inside the car's
computer, and that the car's computer can switch policies at will. This allows me to use
data from the driver to find better policies. There are several complications introduced by
this approach, namely the computational expense, the ability for the algorithms to converge
42
to suboptimal solutions, and the additional complexity introduced by running algorithms on
the car's computer.
4.1.1
Issues
The additional computational load is the most notable effect of finding policies customized
for each driver.
In my experiments I did not create an explicit limit on the amount of
computation available inside the car, though I will discuss the computational expense of
each method.
The other main disadvantage of optimizing the policy for a driver is that it will result
in many different policies that cannot be predicted ahead of time, and therefore cannot be
checked manually. The algorithms used may also lead to policies converging to worse policies
than before, for instance if the optimization overfits a policy to a recent driving period. I
will experimentally evaluate how often the tested algorithms converge to poorly performing
policies by comparing their performance with the general policy found in chapter 3.
An even worse situation than the algorithm finding a suboptimal policy would be for there
to be a bug in the code, leading to crashes, or the controller always returning a bad policy. It
is almost certainly possible to separate the algorithm for finding the best policy from other
computation inside the car's computer, so that a failure in the optimization algorithm does
not cause all of the car's computation to fail, but a bug in the code could lead to worse
policies being used. The controller returning a bad policy because of a bug would be worse
than the controller failing completely, because if the controller failed the car could use the
default policy values found for the population. Conversely, if the controller had a bug in it
that resulted in nonsensical values, such as settings that would lead to the car's engine never
turning off, the car would execute the flawed policy.
There is no way for me to guarantee that the final algorithms used in a production vehicle
will not have any bugs, especially because I believe that any algorithms used in consumer
vehicles would need to be recoded to the standards used for code executing in the car. The
43
only way for me to limit the number of bugs is to use relatively simple algorithms that
are much easier to debug than more complicated algorithms. Because this was a goal, the
two direct policy search algorithms chosen for analysis are simple to implement, though the
Kohl-Stone policy gradient algorithm is simpler than CMAES. The model-based approach
using Gaussian processes is more complicated however, and I relied on an external library
for my implementation. I found that Gaussian processes were not effective, so they seem to
be a suboptimal solution before considering complexity, but if a more complicated solution
obtained better results the tradeoff in cost versus complexity would need to be considered,
as it might not be worth it to find a solution with .001% more efficiency at the cost of greatly
increased complexity. This is in direct contrast to finding a policy for the population, where
the complexity of the approach does not need to be considered, because none of the code
runs inside a vehicle.
The final aspect of optimizing for a population that is different than optimizing for an
individual is that the speed to reach an optimal solution must be considered. Previously
this did not matter, because I was optimizing for a static population offline. Here, if it takes
10,000 hours to find the best solution the person might have sold their car before an optimal
solution can be found. Even if it takes 100, or 10 hours this could lead to problems if the
driver's behavior changes. Consider a car driven by one person on the weekdays for ten hours
and by someone else on the weekends for four hours. The optimal policy would probably
be different for each person, and if the policy took ten hours to converge it might never be
ideal for the person driving on the weekends. This shows the importance of finding a good
policy fast, though I will show that it is difficult to choose an optimal policy in a time-frame
of as little as ten hours. In order to speed up the process it would probably be necessary to
use an approach where key aspects of the driver were identified and then a policy optimized
for drivers with similar characteristics was swapped in. This approach would not lead to a
policy directly optimized for the driver, but still might get better results, and was discussed
briefly in the previous chapter.
44
4.2
Kohl-Stone Optimization
In this section I will discuss using the Kohl-Stone policy gradient algorithm in order to find an
optimized policy for an individual. I will discuss experiments using the algorithm, beginning
with the same framework I used in the previous section.
I will then discuss experiments
made with small modifications to the algorithm, in an attempt to make it perform better
when optimizing the start-stop controller. The Kohl-Stone policy search algorithm is simple,
so porting it into a real vehicle should not be an issue. When using this algorithm I can
also usually assume that drivers behave differently based on different policies in use, as the
algorithm chooses a policy based on the performance of the policy with the driver.
4.2.1
Initial Results
I will now discuss the results of using an identical version of the Kohl-Stone algorithm as
used in chapter 3. This is identical to the original Kohl-Stone algorithm, except that at
each iteration of the algorithm I only collect one new drive cycle and find the results of all
perturbations of the current policy using that drive cycle, instead of using a new drive cycle
to measure the performance of each perturbation of the policy. This requires assuming that
drivers will behave identically when presented with a controller that is very similar to another
controller. This assumption seems reasonable, and makes the algorithm much faster, as I
measure the performance of 50 perturbations of the policy at each iteration. Given that each
drive cycle is 6 hours long, not making this assumption would make each iteration require
300 hours of driving data.
As in chapter 3 I used a learning rate of .025 and a perturbation rate of .025, where all the
parameters were scaled to be in the range [0,1]. Each iteration of the algorithm consisted
of 6 hours of driving, and the parameters were updated after each iteration.
Instead of
beginning the policy search from a random location, I began with the best policy found for
the population of drivers the individual drivers were drawn from.
45
To determine the effectiveness of the approach I ran 35 experiments with a randomly selected driver in each experiment. Between each iteration of the algorithm I measured the
performance of the current policy by generating five new random drive cycles from the driver
and finding the cost of the policy for each of them. Figure 4.1 shows the results of running
the algorithm for 200 iterations.
1450
1440
1430
1420
0
1410
1400
1390
1380
13701
0;
50
100
Iterations
150
200
Figure 4.1: A graph of the average costs of the driver's policies as the algorithm progresses.
The algorithm starts with the policy optimized for the population of drivers and attempts
to find the best policy for the individual driver.
The initial policy has an average cost of 1424.0, while the policies after 200 iterations have
an average cost of 1401.8, for a 1.6% improvement. While this improvement is not nearly as
large as the 27% improvement moving from Ford's policy to an optimized policy, it is still
an improvement that would be worthwhile to implement in production vehicles. Figure 4.1
is quite noisy, and it may seem that this improvement is only due to the noise present in
the process.
Figure 4.2 shows a 10 step moving average, showing that there is clearly a
real improvement from the original starting location, though it also shows that the costs are
not monotonically decreasing, possibly due to both the noise present in the process and the
46
policies overfitting to recent drive cycles.
1425
1420
1415-
U0
14101405-
1400-
1395-
139010
50
100
Iterations
150
200
Figure 4.2: A graph of the moving average of the costs of the last ten iterations.
In addition to the cost difference from beginning to end, there are two other traits of the
algorithm that are important to monitor. The first is how quickly the algorithm converges
to a policy that has a locally optimal cost. To measure this I measured the total cost of
using the best policy found so far throughout the algorithm's runtime, and compared it with
the cost of using the original policy for 200 drive cycles instead. I found an average cost
of 280878 for using the best policy, compared with an average cost of 284805 for using the
original policy, giving an improvement of 1.4%. This is fairly close to the improvement of
1.6% found from always using the final policy, and, along with figure 4.2, demonstrates that
most of the improvements come early on in the algorithm's execution. However, it would
still be better to find the solution quicker, especially if the behavior of the driver is changing,
as the moving average of the costs stays near the original costs for the first 5 iterations,
representing 30 hours of driving, and does not approach the final cost until after more than
25 iterations, representing 150 hours of driving.
speeding up the optimization in later sections.
47
I will investigate possible approaches to
The second important aspect of the algorithm is the variance of the cost improvement. It
is especially useful to know whether the algorithm can lead to significantly worse results for
individual users. This may happen if the policy overfits to recent drives, or if it becomes
stuck in a local minimum that is worse than the original policy. If the algorithm can lead
to worse results it may be useful to ensure that the original policy would not perform much
better on the recent driving behavior. If this is the case the original policy can be substituted
back in, which ensures that the custom policy behavior does not perform much worse than
the static policy.
Figure 4.3 shows the improvement from the population's policy for all 35 trials. 22 of the
trials have an improvement, while 13 see a decrease in performance.
The cluster near 0
suggests that for a significant number of the drivers there is no improvement from using a
customized policy. These drivers likely behave similarly to an average driver drawn from the
population, and the differences between them may be due entirely to the noise present in
the process. There are no drivers that have a final policy that performs dramatically worse
than the initial policy, suggesting that the policy search algorithm could be used without any
modification, though using the original policy when the optimized policy performs poorly
may result in a slight performance gain.
In this section I have discussed the results of my application of the Kohl-Stone algorithm.
In the following sections I will vary the algorithm in order to determine if it is possible to
find better results, and if it is possible to find a fully optimized policy in a shorter period of
time.
4.2.2
Finding Policies in Parallel
One approach to finding better policies is to run parallel executions of the Kohl-Stone algorithm, each starting in a random location. The controller can then swap in the policy
that would have performed best during the car's recent activity after each iteration of the
algorithm. This has the potential to get better results than a single run of the algorithm if
48
5
2
Aoo
-50
0
50
Cost Improvement
100
150
200
Figure 4.3: The cost improvement from the first iteration to the last iteration for all 35
trials.
the cost as a function of the policy has many local minima, and if the best policy for the
population is not always similar to the best policy for the driver.
The approach has two disadvantages. The first is that it requires more computation inside
the car's computer. If five policies are searched for in parallel, the approach must compute
the Kohl-Stone algorithm for each, and must evaluate five times as many policies with its
start-stop controller. Kohl-Stone is relatively simple, but one run of the algorithm already
evaluates 50 policies at each iteration, so a complicated controller may make this approach
impractical. The second problem with optimizing policies in parallel is that when evaluating
the current policy in each execution of the algorithm only one policy was in use by the driver.
The other policies will be evaluated with the assumption that the driver would have behaved
similarly if they had been used instead. This could result in a situation where the controller
oscillates between two policies that fit the driver's past behavior, but which perform worse
when actually put in the car, while ignoring another policy that has better results when
actually used.
49
I tested the effectiveness of this approach in the same manner as I tested the Kohl-Stone
algorithm. I conducted 35 trials of 35 new drivers, with the same parameterization of KohlStone as before. Each trial consisted of 5 parallel executions of the algorithm, and to measure
the performance of the algorithm I used the policy which performed best on my previous
test. Figure 4.4 shows a graph of the costs of this approach over time, comparing it to an
approach where the best policy for the population is used.
17501700
16500
U 1600
01
'U
1550
1500-
-
-. 0,
1450[
nuu
_0
50
100
150
200
Iterations
Figure 4.4: The blue line shows a moving average of the last 10 iterations of the parallel
Kohl-Stone approach, averaged across all 35 drivers. The green line shows the average result
that would have been achieved by using the best policy for the population of drivers.
I found that in the final iteration the average cost was 1428.3, compared with an average
cost when using the population's policy of 1423.6. This suggests that starting at a random
location is significantly worse than starting at the best location for the policy. This probably
occurs because the cost as a function of a policy is non-convex, with multiple local minima,
leading the policy search to converge to a result that is worse than the best policy in many
cases. Additionally, it seems that in this population a policy near the population's policy
represents the best policy for the individual driver, which is in line with the results from
50
section 4.2.1.
The approach of finding policies in parallel was relatively unsuccessful. I believe this to be
due to characteristics of the cost as a function of the policy for my simulated drivers.
If
the algorithms could be implemented in test vehicles, it might still be worth testing this
approach to discover if real drivers have similar characteristics as simulated driver's, but
there does not appear to be an obvious benefit from using this approach.
4.2.3
Improving the Convergence Speed of the Algorithm
In addition to improving the final policy found by the algorithm, another objective is to
try to reduce the time the algorithm takes to converge to a final policy. In section 4.2.1
I demonstrated that the baseline Kohl-Stone algorithm takes 30 hours of driving to start
making significant improvements, and that it takes 150 hours to find the best performing
policy. Speeding this up would allow the vehicle to obtain larger fuel savings in a shorter
period of time, and cause the controller to react better when the driver characteristics change.
To try to speed up the algorithm I ran the Kohl-Stone algorithm as before, with multiple
iterations of the algorithm before collecting new data.
I chose to run 4 iterations of the
algorithm between each new collection of driving data, which could cause the algorithm to
be up to four times faster. If this approach had been successful I would have attempted to
increase the number of iterations without collecting data in order to further speed up the
algorithm.
The approach has the the potential to be up to four times as fast as the original Kohl-Stone
algorithm. The disadvantage is that it can lead to overfitting, because it has the potential to
move more based on a single segment of driving. If all segments of driving are very similar
then this will be fine, but if they are different then this can lead to problems. For instance,
if there is an accident on the road, and all traffic is stop-and-go on a particular day, then the
controller may diverge significantly from an optimal controller if it is attempting to adapt too
fast. The approach also forces me to assume that drivers behave similarly when observing
51
a wider range of controller policies. The algorithm may move the controller policy up to
four times as far away before gathering new data than before, which may be enough of a
difference that a driver notices and behaves differently.
I ran another experiment with the same setup as 4.2.1, except with this change to the
algorithm, with the results shown in Figure 4.5. I find that the average policy used improves
by a much smaller amount, most likely due to overfitting, as it oscillates between policies
that perform better and worse than the original policy.
Note that the average of the costs is in all cases lower than average costs shown in Figure 4.2,
which measured the performance of the original Kohl-Stone algorithm. This indicates that
the drivers in this experiment, which were sampled from the same distribution as the drivers
in the previous study, produce a lower controller cost when using policies that were fit for
the whole population. This is shown by the results from the beginning of each graph, when
each driver is using the same policy, but the costs differ by more than 50. Therefore what I
attempted to measure was the percentage improvement. I found that when using a specific
approach the percentage improvement in cost for different drivers was similar, even if the
costs of the driver's actions were different.
Given that I found that optimized policies perform only 1.6% better on average than the
original policy, and that individual driving segments have extremely noisy behavior, it makes
sense that overfitting can be a problem if the algorithm attempts to learn faster.
The
measured differences of the effectiveness of policies is most likely due primarily to this noise,
which makes overfitting such an issue.
This approach was unsuccessful, and demonstrates the risk of overfitting when attempting to
improve the convergence speed of the algorithm. I will now discuss the results of my last set
of experiments with the Kohl-Stone algorithm, which were conducted in order to determine
the best learning rate for the algorithm.
52
1372
137013681366-
u 1364> 1362
1360
1358
1356
1354
0
50
100
150
200
Iterations
Figure 4.5: The blue line shows a moving average of the last 10 iterations of the sped up
Kohl-Stone approach, averaged across all 35 drivers.
4.2.4
Determining the Learning Rate for Kohl-Stone
In the previous two subsections I discussed two variations of the Kohl-Stone policy gradient
algorithm, showing that neither performed better than the original algorithm when optimizing the start-stop controller's policy. Now I return to the original algorithm and describe my
attempts to improve its performance by modifying its learning rate. Although Kohl-Stone
has a small number of parameters, one of them, the learning rate, may have a large effect
on performance.
The learning rate for the Kohl-Stone algorithm determines the magnitude of the policy
change at each iteration of the algorithm. In the Kohl-Stone algorithm the policy's change
at each iteration is equal to a vector with the same direction as the gradient that has a
norm equal to the learning rate. This is slightly different than similar algorithms, such as
gradient descent, where the parameters change by the gradient multiplied by the learning
rate. Still, small and large learning rates have similar effects with Kohl-Stone as they would
53
with gradient descent. A large learning rate may overfit to the most recent driving segment,
moving too quickly towards a policy that performs well for the recent drive, but that is
not ideal for the driver's overall behavior. It may also overshoot the correct policy, moving
from a policy where certain parameters are too low compared to the ideal policy to a policy
where the parameters are too high. A small learning rate will lead to an algorithm that
converges at a slower rate. Furthermore, a small learning rate will explore a smaller region
of the parameter space at each iteration. This makes it more prone to getting stuck on local
minima, as if the parameters are the minima of the space explored at each iteration the
algorithm will stop making progress. I expect there to be many local minima, so this could
be an issue with smaller learning rates.
I chose a learning rate of .025 for all previous experiments based on initial experiments run
with an earlier version of the start-stop controller with a set of drivers that did not include
the urban drivers. I will now demonstrate results showing that .025 is most likely still close
to the optimal learning rate and that significantly smaller or larger learning rates do not
perform better.
First, I ran experiments with a learning rate of .0125 in order to determine if it was possible
to get better results with a smaller learning rate. Figure 4.6 shows this setting, showing that
it performs worse than the original Kohl-Stone algorithm. I believe this to be due to the
fact that it learns at a very slow rate, and that differences between policies that are close
together are very difficult to measure across the 6 hours of driving the algorithm observes
at each iteration.
Therefore with a small learning rate the algorithm improves a little bit
at the beginning, but is not able to move to a better location that is further away, as the
approach with a learning rate of .025 did.
Next, I ran experiments with a learning rate of .1, and then a learning rate of .05, in order to
determine the effectiveness of larger learning rates. The results from those experiments are
shown in Figure 4.7. Like the experiments with a learning rate of .0125 they perform worse
than the experiments with a learning rate of .025. In this case I believe it to be due to the
algorithm overfitting recent drive cycles. This is shown most noticeably for the experiment
54
140n.
1400
1398
13961394
13921390
1388
13861
0
50
100
Iterations
150
200
Figure 4.6: The blue line shows a moving average of the last 10 iterations of the Kohl-Stone
algorithm with a learning rate of .0125, averaged across all 35 drivers.
with a learning rate of .1, where the algorithm's cost swings violently. This is probably due
to the rapid changes in the policy.
I found that both increasing and decreasing the learning rate produced worse results than
the original setup of the Kohl-Stone algorithm. The learning rate of .025 was chosen based
on similar tests with a preliminary version of the controller, so it is reasonable that the same
learning rate would still perform the best. It would be possible to fine-tune the learning rate
of the controller further, but it is likely that the best learning rate for the simulator and real
cars would be slightly different, so I did not make an effort to find an absolute maximum
learning rate. However, given the similarities between the real data and the simulated data
it seems reasonable that a learning rate of .025 would work well for an actual driver as well.
55
I
*
I
1362
476
1360-
24741472
135a
i91470
o1354
S13542
I66
1350
1348
134010
14
50
100
ftaruton
150
146:
_0
200
50
100
kerutWeu
1
20
Figure 4.7: The blue lines shows a moving average of the last 10 iterations of the Kohl-Stone
algorithm, averaged across all 35 drivers. The left graph corresponds to the experiment with
a learning rate of .05, and the right graph corresponds to the experiment with a learning
rate of .1
4.2.5
Conclusion
In this section I examined the performance of the Kohl-Stone algorithm for policy improvement. I found that using the Kohl-Stone algorithm for individual drivers resulted in a policy
that had a cost that was on average 1.6% lower than the cost from using the best policy
found for the population of drivers. This improvement is much smaller than the 27% improvement I found when moving from Ford's policy to an optimized policy, but is still worth
implementing in real vehicles, as the Ford team has declared that even improvements that
resulted in a gain of efficiency of less than 1% are worth it because of the scale of the problem.
After finding this initial improvement I tried two alternative approaches, using variants of
Kohl-Stone, and also tried different values for the learning rate in the algorithm. I found
that the modifications to the algorithm and learning rate produced either equivalent or
worse results than the original algorithm. I do not have enough data from actual drivers to
determine if the alternative approaches would be more successful with real drivers, but if real
drivers behave similarly to simulated drivers using the original Kohl-Stone algorithm with
a learning rate near .025 will work the best. Still, it would be worthwhile to perform the
same experiments with real data if it could be obtained. I will now investigate an alternative
56
method searching for policies using cross-entropy methods.
4.3
Cross-Entropy Methods
Another group of approaches for optimizing a policy are cross-entropy methods, described
in section 2.2.3.
In this section I will describe my results when I optimized the policy
using a specific cross-entropy method, the covariance matrix adaptation evolution strategy,
or CMAES, which I described in section 2.2.4. The main difference between cross-entropy
methods and the Kohl-Stone policy gradient algorithm is that cross-entropy methods do not
attempt to approximate a gradient.
Instead they sample within a region, and then move
in the direction of the better samples. The shape of the sampling region is then updated
based on all of the samples. This has the potential to be faster than Kohl-Stone, as the
algorithm can sample from a wider area, and in some cases can quickly identify the region
where the minimum is located. It also has the potential to perform worse than Kohl-Stone
if the algorithm stops sampling from the area around where the best performing policy is
located. This may be a problem here because of how noisy the problem is. The noise may
cause the algorithm to believe that a region containing the minimum is not worth sampling
from further.
4.3.1
Experiment
To test the effectiveness of CMAES I conducted the same experiments as I used to test the
Kohl-Stone policy gradient algorithm. For each experiment I ran 35 trials, each with a random driver, and ran the algorithm for 200 iterations, which is about 1200 hours of simulated
driving time. Rather than implementing the algorithm myself, as I did for the Kohl-Stone
policy gradient algorithm, I used the CMAES implementation in Python provided by Hansen
[12]. I used all the default parameters for the algorithm, except that instead of taking the
default 13 samples at each iteration I took 50 samples, in order to compensate for the high
57
level of noise.
This also matches the number of samples I used at each iteration for the
Kohl-Stone algorithm.
The two parameters of the CMAES algorithm that must be determined manually are the
initial value of the covariance matrix and the initial parameters of the policy. As with KohlStone I chose to start the search with the best policy found for the population of drivers
in order to have a starting location that has the lowest average cost. The typical method
of choosing the initial covariance matrix is to assume a diagonal. This ensures that initial
samples are independent in each coordinate around the mean.
I ran several experiments
with different initial values for these entries.
First I ran an experiment with an initial value of .001 along the diagonal of the covariance
matrix. I chose 0.001 so that the initial standard deviation along each axis would be similar
to the learning rate of the best parameterization of the Kohl-Stone algorithm. My results
from the experiment are demonstrated in figure 4.8, showing that the algorithm did not find
policies with improved performance.
This may have been due to the initial value of the
covariance matrix being too low or high. Figure 4.9 demonstrates the average value of the
covariance matrix along the diagonal over time, showing that it rose as the algorithm was
executed. This suggests that a larger initial value for the entries of the covariance matrix
might produce better results.
To test if a larger covariance matrix produced better results I ran the same experiment with
values of 0.005 along the diagonal of the initial covariance matrix instead of 0.001. Figure 4.10
shows the results of that experiment, showing that there is a small improvement in policies
used when this initial covariance matrix is used. I found that the average initial cost was
1222 and the average final cost was 1214 for an average improvement of .7%. While this is
an improvement, it is worse than the average improvement with the Kohl-Stone algorithm.
I next ran the experiments again with an initial value of 0.008, in order to test if it was
possible to find better results with a larger initial sampling region. Figure 4.10 shows the
results from the experiment, demonstrating worse results than when a setting of 0.005 was
used. I also ran experiments with initial values of 0.025, 0.05 , and 0.1 in the diagonal of the
58
12101
1208-
1206
V 1204
0
L 1202
1200
1198
1196
0
50
100
150
200
Iterations
Figure 4.8: A 10 iteration moving average of the cost with the policy found from CMAES
with an initial covariance setting of .001 along the diagonals, averaged across all 35 drivers.
covariance matrix, and found similar results. This shows that 0.005 is probably close to the
optimal initial setting for the value of the entries in the covariance matrix.
4.3.2
Summary
At best I found a 0.7% improvement for the cost using CMAES. This is less than KohlStone's best improvement, and as CMAES is also more complicated than Kohl-Stone it does
not seem to be a good alternative in this case. CMAES most likely performs worse because
the sampling region is altered based on recent results, which may be an issue for the startstop problem, because, as I have shown earlier, the cost of a six hour driving segment given
a policy and a driver is extremely noisy. Therefore CMAES may exclude certain regions
where the average cost is lower if it finds a few bad samples from that region. This seems to
make CMAES a poorer fit for the problem than the Kohl-Stone policy gradient algorithm.
59
0.0040
U0 0.0030-
S0.0025 --
0.0020
0.00151
50
100
150
200
Iterations
Figure 4.9:
The average value of the covariance matrix along the diagonals during the
execution of the CMAES algorithm, averaged across all 35 drivers.
4.4
Model Based Policy Search
The previous two sections described two approaches to optimizing the policy using model
free reinforcement learning, where the policy is directly optimized. An alternative, described
in section 2.3, is to learn a model of the driver's behavior, and then either search for a
policy that performs well given the predicted behavior or use a precomputed policy that was
optimized for similar behavior.
The stop lengths of the driver are the most important aspects of their behavior to model.
The expected stop length for a particular stop should correspond with whether the controller
will turn the car's engine off or keep it on. This is the main decision the car's controller must
make, although the controller must also decide when the car's engine should be turned back
on, which would be determined by predicting the remaining time until the driver presses
the accelerator pedal when stopped. I will focus on predicting the stop length, though the
techniques in this section could be used to predict the time until accelerator pedal application.
60
E
14ELA
12f13
1
12
-
1196
12 21
1192-
1216
11988
-
1214-
50
100
MMOOMn
150
200
0
50
fteratlo
150
200
Figure 4.10: A 10 iteration moving average of the cost with the policy found from CMAES
averaged across all 35 drivers. The left shows the results when the initial settings along the
diagonal of the covariance matrix is .005, the right shows when the setting is 0.008.
Once a stop length is predicted a policy must then be used.
It is possible to perform a
policy search given the predicted stop length on-line, but this would probably be too slow
to work in a vehicle. Instead, a reasonable process would be to precompute policies that
work well based for various stop lengths. Then, when the actual stop length is predicted,
the controller would swap in the policy predicted for the stop length that is closest to the
value of the predicted stop. I will not discuss this part of the process further until I show my
results for predicting the stop length, as tuning policies for predicted stop lengths is useless
when there is not an accurate prediction mechanism.
Predicting the stop length is an example of regression, as the goal is to predict a continuously
distributed random variable from a set of features of the driver's behavior. The regression
function will be fit on a per-driver basis, because certain feature values may be correlated
with longer or shorter stops for some drivers, and may not be correlated at all with stop
length for other drivers. For instance, some drivers may stop suddenly at stop lights and
others may not. Therefore a sudden deceleration before a stop may be an indicator that
some drivers are at a stop light and would be expected to have a long stop, but may indicate
the opposite for other drivers, where stopping suddenly might mean they are in stop and go
traffic and would be expected to have a short stop.
61
To predict the stop length on a per driver basis I will first use Gaussian Processes, which I
described in section 2.3. The predictions provided by Gaussian process have both a mean and
a variance, giving the program an idea of how likely the mean prediction is to be accurate.
This is useful because the controller may wish to act differently given its confidence of the
mean prediction. For example, it may be best for the controller to act differently if it has a
prediction that the mean is 11 seconds, and there is a 99% chance the stop will between 10
and 12 seconds, than if it has a prediction that the mean is 11 seconds, and there is a 60%
chance the stop will between 10 and 12 seconds. For my Gaussian process implementation
I used the GPML library [7].
Gaussian Processes have one important hyperparameter that must be chosen by hand, the
estimate of the variance of the output o,. I chose this parameter by choosing the value for
a2 that maximized the likelihood of the data. Maximum likelihood methods are prone to
overfitting when there are many values being fitted, but because there is only a single scalar
value to fit here there is little risk of overfitting, and maximum likelihood is a good choice
of method to use to decide the variance.
A potential issue with Gaussian processes for the stop prediction problem is that they typically express their prediction as a normally distributed random variable. It is also possible
to use a logistic distribution, which is similar to the normal distribution, but has heavier
tails. The stop distribution in both the collected data and the simulator is exponentially
distributed, as Figure 4.11 shows. I could find no examples of research using Gaussian processes for exponentially distributed variables, and there are no libraries available that predict
an exponentially distributed variable with Gaussian processes. This is does not guarantee
that there will be an issue, as it is possible that individual stop lengths are drawn from a
normal distribution given features of the driver's behavior, but that without features they
are exponentially distributed.
It is still something that must be kept in mind, and most
likely explains at least some of the poor performance of the algorithm in the experiments I
ran.
I also experimented with nearest neighbors regression as well, in order to determine if the
62
n-i
0.10
0.08
0.06
0
0.04
0.02
0
10
20
30
40
50
Stop Lengths
60
70
80
Figure 4.11: A histogram showing what percentage of stops in the simulator have a given
length.
problems with Gaussian Processes were fixable with a simple approach that does not rely on
distributional assumptions. This approach is described in section 2.3 and, unlike Gaussian
Processes, provides a point prediction instead of a probabilistic prediction. This is less useful,
but if the predictions are very accurate they could still be used to find a policy. I used the
Scikit-Learn library for my implementation of nearest neighbors regression [15].
4.4.1
Experiment With Normally Distributed Output Variables
The first step of the experiment was deciding which features to use.
I chose 30 features
that I felt represented the drivers' overall pattern of recent driving, as well as any behavior
from the driving segment immediately before the stop. These feature values included the
average vehicle speed from the 30 minutes preceding the stop, as well as from the 10 minutes
preceding the stop, the maximum stop length over the past 30 and 10 minutes, the average
stop length over the past 30 and 10 minutes, the number of seconds since the accelerator pedal
63
was pressed, the speed at which the driver began decelerating, and the rate of deceleration
immediately before stop. For each feature that summarized recent driver behavior I created
a feature representing the driver's behavior from the last 30 minutes, as well as the last 10
minutes in order to emphasize short term behavior more, but to also capture the driver's
medium term behavior as well.
I then measured the effectiveness of Gaussian Processes at predicting the stop length from
a driver given training sets of different sizes. I conducted my experiment using 35 different
drivers, with a 1000-stop test set, and measured the effectiveness of training sets that had
sizes ranging from 12 stops to 1750 stops. Figure 4.12 shows the effectiveness of the Gaussian
processes for each of these training set sizes at minimizing the squared error between the
mean of the predictive distribution for the stop and the actual stop length.
10.09.8
9.6LU
0~9.4-
9.2
9.08.8-
0
200
400
600
800
1000 1200 1400
Number of Training Examples Used
1600
1800
Figure 4.12: Graph of the error rate given the size of the training set used.
As Figure 4.12 shows, the Gaussian processes with a normally distributed output variable
performs poorly, with the best results coming with a training set of size 1000 when the
square root of the average squared error is 8.6 seconds, which is less than 2 seconds better
than predicting the mean every time. With a 1000 stop training set the average predicted
64
probability density of the actual stops is .003, which was fairly constant across all training set
sizes. An example of the predictions from the Gaussian process are given in Figure 4.13 which
shows an example of predicted versus real values for a 30 stop sample. The graph shows that
generally the Gaussian process predicts a stop close to the mean, with the actual distribution
having a much larger range of values than the predicted values. Gaussian processes produce
better results than simply predicting the mean of the distribution though. The stop lengths
have a mean of 10.8 seconds, and the exponential distribution has a standard deviation
equal to a mean, so the average error would be 10.8 seconds if predicting the mean, instead
of the 8.6 seconds found when using a Gaussian process. However the probability predictions
are very inaccurate, as the actual stops had an average probability density of .003, while
outputting predictions with probability densities drawn from an exponential distribution
with a mean of 10.8 would result in an average probability density of .046. This may be due
to exponential distribution having a heavy tail, while the normal distribution has light tails,
which I will attempt to fix by looking at the results when using a logistic variable for the
output variable. It also may be due to the fact that the normal distribution is symmetric,
and thus must allot as much probability density to values less than 0 as values more than
2x, where x is the mean of the distribution. The value predicted at less than 0 will never
occur, while as Figure 4.13 shows, it is somewhat common to have an actual stop that is
more than twice the predicted value of the stop. There does not seem to be a way to fix this
without using an exponentially distributed output variable.
4.4.2
Experiment with Output Variable Drawn from Logistic Distribution
To test the effects of using a different distribution for the output variable, I tested the
Gaussian process again, using an output variable drawn from a logistic distribution instead of
a normal distribution. This performed slightly better than the normally distributed variable
at predicting probability densities, with the Gaussian process predicting a .0119 probability
65
40-
-
30
-C
20
10-
0
5
10
15
Stop Number
20
25
30
Figure 4.13: Graph demonstrating predictions versus actual stop lengths. The green line
represents the predicted stop lengths, while the blue line represents the actual stop lengths.
density on average for each stop length, but had an average error of 9.2 seconds at best.
Figure 4.14 shows a graph comparing predicted stop lengths to actual stop lengths, showing
how it once again performs quite poorly.
4.4.3
Nearest Neighbors Regression
To determine if the distributional assumptions of the Gaussian processes were the only
reason for their poor performance, I next conducted experiments to predict stop lengths
using nearest neighbors regression.
This is a simpler technique than Gaussian Processes,
and does not produce probabilistic estimates, but has the benefit of making no assumptions
about the distributions of outputs given the inputs.
I conducted this experiment similarly to the experiments with Gaussian processes, using a
training set of 1000 stops. I initially averaged the 5 nearest neighbors to find predictions
for a new stop, finding the square root of the mean squared error to be 13.3. This is much
66
40-
4S 30C
20
10
0
5
10
15
Stop Number
20
25
30
Figure 4.14: Graph demonstrating predictions versus actual stop lengths when the prediction is a random variable drawn from a logistic distribution. The green line represents the
predicted stop lengths, while the blue line represents the actual stop lengths.
worse than Gaussian Processes, and is due both to the noise present in the training set, and
nearest neighbors inability to determine which features are meaningful. Figure 4.15 shows
that the nearest neighbors predictions have a much greater variability than the Gaussian
processes' predictions, which leads to worse results because they are still unable to predict
accurately.
Increasing the number of neighbors to average over decreases this variability, but never finds
a solution better than predicting the mean value of the stop times. If the training set size
was increased the predictions would improve, but because the training set is drawn from the
driver's past driving sessions it is unrealistic to keep increasing the training set size.
For
example, in order to collect a training set of 10000 stops a year of driving may be required, by
which time the behavior of the driver may have changed. Therefore it does not appear that
nearest neighbors is a useful regression technique for predicting stop times, at least for the
simulated data. This may be due to the simulator, or to the simplicity of nearest neighbors,
67
Predicted vs Actual Value
45
40,
35-
30-
0
25-
-
a0
15-
O-L
5
10
15
2A
2A
30
Stop Number
Figure 4.15: Graph demonstrating predictions versus actual stop lengths for nearest neighbors regression. The green line represents the predicted stop lengths, while the blue line
represents the actual stop lengths.
which does not determine which features are most important, or make any distributional
assumptions when making predictions.
4.4.4
Summary
Using Gaussian Processes with either a logistic distribution or a normal distribution for
the output variable resulted in poor predictions for the stop length. There are two issues
that seem to be preventing this approach from working well.
First, the stop length is
exponentially distributed, but the predictors are not. This is even more of an issue because
both predictors are symmetric, while the stop length cannot be less than zero. It might be
possible to normalize the distributions and remove the predicted values of less than zero,
but the probability predictions at each point would still be very imprecise. I tried to address
this issue using a nearest neighbors based regression method, but found it to be inaccurate
68
as well. This may be due to the simplicity of the model, or it may indicate that until the
second issue is resolved the choice of method is relatively meaningless.
The second issue is that the the simulator is using a Markov model, where the stops are
drawn from an exponential distribution randomly at each time there is a stop. The mean
of this distribution is set by which of the four states the driver is currently in. Thus the
best that can be done from the summary features is to predict the mean of the exponential
distribution.
The Ford script for producing brake and accelerator pedals looks ahead, so
features from the brake and accelerator pedal pressure immediately preceding the stop may
be slightly more predictive of the stop length, but it is highly unlikely that there is enough
data currently to predict a stop length within less than 3 or 4 seconds based on driver's trace
files from the simulator. Drivers probably do not behave this way in real life, as previous
behavior is most likely predictive for stopping time, but I do not have enough data to measure
how drivers behave. It would be possible to arbitrarily create drivers in the simulator that
stop for certain lengths based on their previous behavior, but there would be no way to
verify that experiments done with that set of drivers would be replicable with real drivers.
Both issues have the potential to be fixed. In order to determine how effective individual
driver behavior is at predicting stop lengths more data must be gathered from drivers.
Currently Ford does not have enough data from individual drivers to be able to construct a
model of driver's stopping behavior. If this data could be gathered it would be possible to
use nearest neighbors again, or instead a generalized linear model to predict the stop length
instead of using Gaussian Processes. If a generalized linear model was used the predicted
value would be an exponentially distributed variable, whose mean, or inverse of its mean, was
a linear combination of the features. This would fix both of the major issues that the model
for predicting the stop length currently has, and still allow for probabilistic predictions.
Given the inaccuracy of the predictions of the stop length I did not proceed to the next step,
which is determining what policy should be used based on the predicted stop lengths. It
would be more appropriate to consider this step after a model that is more accurate can be
created, especially because the form of the model may need to change in order to improve
69
its accuracy.
4.5
Conclusion
In this chapter I examined three approaches to selecting a policy that performs best for an individual driver: the Kohl-Stone policy gradient algorithm, the covariance matrix adaptation
evolution strategy, and an approach that used Gaussian processes to create a model of the
driver's behavior.
Kohl-Stone and CMAES represent model-free approaches that attempt
to learn the best policy for the individual directly. With the best parameters, I found an
average improvement of 1.6% using Kohl-Stone, and an improvement of .7% using CMAES
over the policy optimized for the population that I found in chapter 3. These improvements
are much smaller than I found moving from Ford's hand tuned policy to the optimized policy
for the population, but would still be worth implementing in production vehicles.
My other approach was a model-based reinforcement learning method, where I attempted
to learn a model of the driver's stop lengths given features of the driver's behavior. This
fared poorly, most likely due to how stops are distributed, as well as the configuration of the
simulator as a state machine.
I expect both of the model-free approaches to also produce better results than a static policy
in real vehicles. I also suspect that the model-based approach would perform better in real
vehicles, as it is likely that a driver's stop length is correlated with features from before
the stop, whereas in my model stop lengths are explicitly not correlated with the features.
The next step is then to collect real data and implement the model-free approaches in test
vehicles. Using real data it is possible to determine if the model-free approach would work
well if the driver's do not change their behavior based on the policy in use. If driver's do
change their behavior based on the policy in use it is necessary to implement Kohl-Stone on
a real vehicle to observe how well it works. From actual data it should also be possible to
determine what features predict driver stopping behavior, which would allow for the creation
of a better model of the driver's stop lengths. Policies could then be determined based on
70
predicted stop length. These policies could be tested either with the data or in test vehicles.
Again, if driver's change their behavior based on the policy in use it would be necessary to
measure the effectiveness of the approach in a test vehicle, not on collected data.
71
Chapter 5
Conclusion
In this thesis I aimed to improve the performance of a car's start-stop controller using
machine learning.
I investigated the performance of various algorithms at improving the
performance of a controller, first trying to find the controller that is the best on average for
all drivers, before evaluating techniques that optimize a controller for a single driver.
Because of the lack of available data, I built a simulator in order to evaluate these methods.
This simulator was designed to have its average behavior mimic the limited set of driving
data available, but to also simulate a wide variety of driving behavior. All methods were
evaluated on the simulator, though I also evaluated the methods optimizing a controller for
a population on the data that was available.
I found major gains when optimizing a controller for the whole population; a 27% improvement when comparing the cost of the controller to the pre-existing controller on the simulated
data, and a 9% improvement on the real data. This suggests that the improved controller
could be immediately tested for use in production vehicles. The only remaining questions
are if the data available is representative of the full driving population, and if drivers will
change their behavior because of the changing controller. The first question can be answered
with data drawn from the population rather than a driver testing out the start-stop system,
but the second can only be resolved if the controller is tested in a vehicle.
72
I next evaluated methods for optimizing the controller for an individual. I found a further
1.6% improvement on the average simulated driver using similar techniques to optimizing
for the whole population, which is enough of an improvement to be worth implementing in
a production vehicle. I then evaluated another method for optimizing a controller directly,
as well as a method which built a model for the driver's actions before learning a controller.
Building a model was ineffective, due partially to the lack of sophistication in the simulator.
Unfortunately building more sophisticated behavior into the simulator seems infeasible without more real-world data. Still, the gains from optimizing a controller directly are promising
and merit study in an actual vehicle.
5.1
Future Work
There are two major areas for future study. First, the methods used in this thesis could
be implemented in actual vehicles for testing. This would test to find driver's responses to
new controllers, to ensure that they do not negatively change their behavior. Once the new
methods have been evaluated in test vehicles they can be put into production vehicles.
The second area for future study is more sophisticated methods that require more data. For
instance, it would be more feasible to build a model of individual driver behavior if there were
many hours of real driving data. This would hopefully enable a model based reinforcement
learning method to be more successful. More data could be used on its own to test these
methods, or it could be used as a basis for the creation of a more sophisticated simulator
that modeled real drivers better.
Even given the small amount of data on hand, the conclusions of this thesis are very promising.
It appears that, assuming that drivers do not react negatively to a new controller,
that at least a 9% improvement from the old controller is reachable by simply using a new
controller that is optimized for the population. Reducing the energy usage of all new cars by
9% during stops would greatly increase the aggregate efficiency of vehicles being operated,
and likewise reduce the amount of fossil fuels burned by those vehicles.
73
Bibliography
[1] Ford Motor Company 2010 Annual Report. http: //corporate.f ord. com/doc/ir2010_annualreport.pdf
[2] Ford's
Fuel
Economy
Leadership
http: //media. ford. com/images/10031/Fuel
EconomyNumbers . pdf
[3] Keith Barry, Ford Offers Auto Start-Stop Across Model Range http: //www.wired. com/
autopia/2010/12/ford-offers-auto-start-stop-across-model-range/
[4] N. Kohl and P. Stone, Policy gradient reinforcement learning for fast quadrupedal locomotion. IEEE International Conference on Robotics and Automation, 2004
[5] C. M. Bishop. PatternRecognition and Machine Learning. Springer, 2006.
[6] G. Theocharous, S. Mannor, N. Shah, P. Gandhi, B. Kveton, S. Siddiqi, and C-H. Yu,
Machine Learning for Adaptive Power Management, Intel Technology Journal, Vol. 10,
pp. 299-312, November 2006.
[7] C. E. Rasmusen and C. Williams. Gaussian Processesfor Machine Learning. MIT Press,
2006.
[8] Kolter, J. Zico, Zachary Jackowski, and Russ Tedrake. "Design, Analysis, and Learning
Control of a Fully Actuated Micro Wind Turbine." Under review (2012).
74
[9] Dalamagkidis, K., Kolokotsa, D. (2008). Reinforcement Learning for Building Environmental Control. Reinforcement Learning, Theory and Applications Weber, C., Elshaw,
M., Mayer, N. M., (Eds.), pp. (283-294).
[10] P. Bodik, M. P. Armbrust, K. Canini, A. Fox, M. Jordan, and D. A. Patterson, A
case for adaptive datacenters to conserve energy and improve reliability, University of
California at Berkeley, Tech. Rep. UCB/EECS-2008-127, 2008.
[11] Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th InternationalConference on Machine Learning
(ICML), 2012.
[12] N. Hansen. (2005 Nov.). The CMA Evolution Strategy: A Tutorial [Online]. Available:
http://www.lri.fr/hansen/cmatutorial.pdf
[13] Andrew Y. Ng and Michael Jordan. PEGASUS: A policy search method for large MDPs
and POMDPs. In Proceedings of the Sixteenth Conference on Uncertainty in Articial
Intelligence, pages 406415, Stanford, California, 2000.
[14] Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist
Reinforcement Learning. Machine Learning, 8:229-256, 1992.
[15] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830,
2011
75
Download