Optimizing a Start-Stop System to Minimize Fuel Consumption using Machine Learning ,Hur OF TECHN4OLOGjY by L Noel Hollingsworth L S.B., C.S. M.I.T., 2012 15 2014 LIBRARIES Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Author. Signature redacted Departhment of Electrical Engineering and Computer Science December 10, 2013 Certified by.........Signature redacted.......... Leslie Pack Kaelbling Panasonic Professor of Computer Science and Engineering Thesis Supervisor Signature redacted Accepted by ........ ................ Albert R. Meyer Chairman, N166sers of Engineering Thesis Committee Optimizing a Start-Stop System to Minimize Fuel Consumption using Machine Learning by Noel Hollingsworth Submitted to the Department of Electrical Engineering and Computer Science on December 10, 2013, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Many people are working on improving the efficiency of car's engines. One approach to maximizing efficiency has been to create start-stop systems. These systems shut the car's engine off when the car comes to a stop, saving fuel that would be used to keep the engine running. However, these systems introduce additional energy costs, which are associated with the engine restarting. These energy costs must be balanced by the system. In this thesis I describe my work with Ford to improve the performance of their start-stop controller. In this thesis I discuss optimizing a controller for both the general population as well as for individual drivers. I use reinforcement-learning techniques in both cases to find the best performing controller. I find a 27% improvement on Ford's current controller when optimizing for the general population, and then find an additional 1.6% improvement on the improved controller when optimizing for an individual. Thesis Supervisor: Leslie Pack Kaelbling Title: Panasonic Professor of Computer Science and Engineering 2 Contents 1 2 Introduction 8 1.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 O utline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . . . . . .. Background 12 2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Policy Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Policy Gradient Algorithms . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Kohl-Stone Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Covariance Matrix Adaptation Evolution Strategy . . . . . . . . . . . 18 Model Based Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 22 2.3 2.4 Machine Learning for Adaptive Power Management . . . . . . . . . . 3 2.4.2 2.5 3 Wind Turbine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Reinforcement learning for energy conservation in buildings . . . . . . 23 2.4.4 Adaptive Data centers . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Learning a Policy for a Population of Drivers 25 3.1 Problem Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Simulator Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Simulator Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Optimizing For a Population . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Optimizing For a Subset of a Larger Population . . . . . . . . . . . . 36 3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 3.3 3.4 4 Design, Analysis, and Learning Control of a Fully Actuated Micro Optimizing a Policy for an Individual Driver 41 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Kohl-Stone Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Initial Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Finding Policies in Parallel . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Improving the Convergence Speed of the Algorithm . . . . . . . . . . 51 4.2.4 Determining the Learning Rate for Kohl-Stone . . . . . . . . . . . . . 53 Problem Overview 4.1.1 4.2 Issues 4 4.2.5 4.3 4.4 4.5 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Cross-Entropy Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Model Based Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Experiment With Normally Distributed Output Variables . . . . . . . 63 4.4.2 Experiment with Output Variable Drawn from Logistic Distribution . 65 4.4.3 Nearest Neighbors Regression . . . . . . . . . . . . . . . . . . . . . . 66 4.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Conclusion 5.1 Future Work. . .. 72 . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . 5 73 Kohl-Stone Policy Search Pseudo-Code 16 2.2 CEM Pseudo-code . . . . . . . . . . 18 3.1 Hypothetical rule-based policy..... 26 3.2 Vehicle Data Trace . . . . . . . . . . 28 3.3 Simulator State Machine . . . . . . . 29 3.4 Simulator Speed Comparison . . . . . 30 3.5 Simulator Brake Pressure Comparison 31 3.6 Simulator Stop Length Comparison . 32 3.7 Kohl-Stone Convergence Graph . . . 35 4.1 Results of Kohl-Stone for Individual Driver's . . . . . . . . . . . . 46 4.2 Moving Average of Kohl-Stone for Individuals . . . . . . . . . . . 47 4.3 Kohl-Stone Result Histogram . . . . . . . . . . . . . . . . . . . . 49 4.4 Parallel Kohl-Stone Results . . . . . . . . . . . . . . . . . 50 4.5 Kohl-Stone Results When Sped Up . . . . . . . . . . . . . . . . . 53 4.6 Results from Kohl-Stone with a Slower Learning Rate . . . . . . . . 55 4.7 Results from Kohl-Stone with a Faster Learning Rate . . . . . . . . 56 . . . . . . . . . . 2.1 . List of Figures 6 4.8 Cost Improvement from CMAES with an Initial Covariance of 0.001 . . . . . 59 4.9 Average Covariance Values for CMAES . . . . . . . . . . . . . . . . . . . . . 60 4.10 Cost Improvement from CMAES with an Initial Covariance of 0.005 and 0.008 61 4.11 Stop Length Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.12 Error from Gaussian Processes as Function of Training Set Size 64 . . . . . . . 4.13 Comparison of Gaussian Processes Predictions with Actual Results . . . . . 66 4.14 Comparison of Gaussian Processes predictions with Actual Results with Laplace Output Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.15 Comparison of nearest neighbors predictions with Actual Results . . . . . . . 68 7 Chapter 1 Introduction Energy consumption is one of the most important problems facing the world today. Most current energy sources pollute and are not renewable, which makes minimizing their use critical. The most notable way that Americans use non-renewable sources of energy is when they drive their car. Therefore, any improvements to the efficiency of cars being driven could have a major effect on total energy consumption. This reduction can be significant even if the gain in efficiency is small. In 2009 Ford sold over 4.8 million automobiles [1], with an average fuel efficiency of 27.1 miles per gallon [2]. Assuming these vehicles are driven 15,000 miles a year, a .01 miles per gallon improvement in efficiency would saves more than 980,000 gallons of gasoline per year. One approach that automobile manufacturers have used to improve fuel efficiency is to create start-stop systems for use in their cars. These systems turn off the vehicle's engine when it comes to a halt, saving fuel that would otherwise be expended running the engine while the car was stationary. Start-stop systems have primarily been deployed in hybrid vehicles, but are now being deployed in automobiles with pure combustion engines as well [3]. These systems have resulted in a 4-10% improvement in fuel efficiency in hybrid cars, but even greater gains should be possible. Although the systems are generally efficient, shutting the car's engine down for short periods of time results in losing more energy from the engine 8 restarting than is gained from the engine being off. By not turning off the car's engine as often for shorter stops and turning it off more often for longer stops, it may be possible to conserve substantially more energy. My thesis will explore several methods of improving a controller's performance by making more accurate predictions of driver's stopping behavior. 1.1 Problem Overview In this thesis I describe my work with Ford to improve the performance of their start-stop controller. I will examine several methods of improving the performance of the controller. The performance was measured by a cost function defined by Ford that was a combination of three factors: the energy savings from turning the car's engine off, the energy cost from restarting the car's engine after turning it off, and the annoyance felt by the driver when they attempted to accelerate if the car's engine was still shut down. My goal was to improve the performance of the controller by altering its policy for shutting the engine down and turning the engine back on. I did not define the controller's entire policy. Instead, I was given a black-box controller with five parameters for a policy for the start-stop controller's rule-based system, which determined when to turn the car's engine off and when to turn it back on. I attempted to change the parameters of the system's policy in order to maximize the performance of the system. I was given less than ten hours of driving data to optimize the policy. This is not enough data to determine the effectiveness of the various approaches I used, so I created a simulator in order to measure their effectiveness. The simulator was designed to have custom drivers. These drivers were sampled from a distribution where the mean behavior of the drivers would match the behavior from Ford's data. This allowed me to measure the performance of the algorithms without using real data. It is still important to test the algorithms on real data before implementing them in production vehicles, in order to determine how drivers react to different policies. I will describe what sort of data will need to be collected next when I discuss the effectiveness of each algorithm. 9 I took two broad approaches to optimizing the policy. First, in chapter three, I discuss optimizing a single policy for all drivers in a population. This approach makes it easy for Ford to ensure that the policy chosen is sensible, and requires no additional computing resources in the car other than those already being used. Second, in chapter four, I discuss optimizing a policy in the car's computer for the driver of the car. This has the potential for better results but requires additional computing resources online. In addition, Ford would have no way of knowing what policies would be used by the controller. It is up to Ford to balance the trade-off between complexity and performance, but in each section I will explain the advantages and disadvantages of the approach I used. 1.2 Outline After the introduction, this thesis consists of four additional chapters. " In chapter two I discuss the background of the problem. First I describe the approaches I used to optimize the controller's performance. I then discuss several other applications of machine learning to energy conservation, comparing and contrasting the algorithms they used with the ones I experimented with. * In chapter three I discuss the problem setup in more detail and describe optimizing a single policy for a population of drivers. I first give a formal overview of the problem before describing the simulator I created to create synthetic data. I then discuss using the Kohl-Stone policy gradient algorithm [4] to minimize the expected cost of a policy over a population of drivers. " In chapter four I discuss determining a policy online for each driver. I describe three approaches: the Kohl-Stone policy gradient algorithm, a cross-entropy method, and model based reinforcement learning using Gaussian processes. For each approach I describe the performance gain compared to a policy found for the population of drivers 10 the individual was drawn from, and discuss potential difficulties in implementing the algorithms in a production vehicle. * In chapter five I conclude the thesis, summarizing my results and discussing potential future work. Now I will proceed to the background chapter, where I will introduce the methods used in the thesis. 11 Chapter 2 Background This chapter provides an overview of the knowledge needed to understand the methods used in the thesis. The primary focus is on the reinforcement learning techniques that will be used to optimize the start-stop controller, though I also give a brief overview of alternative methods of optimizing the controller. The chapter concludes with a look at related work that applies machine learning in the service of energy conservation. 2.1 Reinforcement Learning Reinforcement learning is a subfield of machine learning, concerned with optimizing the performance of agents that are interacting with their environment and attempting to maximize an accumulated reward or minimizing an accumulated cost. Formally, given a summed discounted reward: 00 R = ZYtrt, t=0 (2.1) where rt denotes the reward gained at time step t, and -y is a discount factor in the range [0, 1), the goal is to maximize: E[RI7r]. where 7r corresponds to a policy that determines the agent's actions. 12 (2.2) In this setting the agent often does not know what reward their actions will generate, so they must take an action in order to determine the reward associated with the action. The need to take exploratory actions makes reinforcement learning a very different problem from supervised learning, where the goal is to predict an output from an input given a training set of input-output pairs. The consequence of this difference is that one of the most important aspects of a reinforcement learning problem is balancing exploration and exploitation. If an agent exploits by always taking the best known action it may never use an unknown action that could produce higher long-term reward, while an agent that chooses to constantly explore will take suboptimal actions and obtain much less reward. Another key difference between supervised learning and reinforcement learning is that in reinforcement learning the states the agent observes depends on both the states the agent previously observed, as well the actions executed in those states. This means that certain actions may generate large short term rewards, but lead to states that perform worse in the long run. Reinforcement learning problems are typically formulated as a Markov decision process, or MDP, defined by a 4-tuple (S, A, P, R), (2.3) where S corresponds to the set of states that the agent may be in; A is the set of all possible actions the agent may take; P is the probability distribution for each state-action pair, with p(s'Is, a) specifying the probability of entering state s', having been in state s and executed action a; and R is the reward function R(s, a) specifying the reward received by the agent for executing action a in state s. The goal is to find a policy 7r mapping states to actions that leads to the highest long-term reward. As shown in equation 2.1 the reward is usually described as a discounted sum of the rewards received at all future steps. A key element of the MDP is the Markov property. The Markov property is that given an agent's history of states, {SO) S1, 2 .... , sn_ 1 , sn}, the agent's reward and transition probabilities only depend on the latest state sn, that is that P(s's1 , sn, a,, ... an) = P(s'ls , an). 13 (2.4) This is a simplifying assumption that makes MDPs much easier to solve. There are a few algorithms that are guaranteed to converge to a solution for an MDP, but unfortunately the start-stop problem deals with a state space that is not directly observable, and Ford required the use of a black box policy parameterized by five predetermined parameters, so I will instead use a more general framework that instead searches through the space of parameterized policies. 2.2 Policy Search Methods Policy search methods are used to solve problems where a predefined policy with open parameters is used. In policy search, there is a parameterized policy 7(0) and, as in an MDP, a cost function C(O) that is being minimized, which can also be thought of as maximizing a reward that is the negative of the cost function. Instead of choosing actions directly, the goal is to determine a parameter vector for the policy, which then chooses actions based on the state of the system. This can allow for a much more compact representation, because the parameter vector 0 may have far fewer parameters than a corresponding value function. Additionally, background knowledge can be used to pre-configure the policy. This enables the policy to be restricted to sensible classes. Policy search algorithms refer to an algorithm which is drawn from this family of algorithms. 2.2.1 Policy Gradient Algorithms Policy gradient algorithms are a subset of policy search algorithms that work by finding the derivative of the reward function with respect to the parameter vector, or some approximation of the derivative, and then moving in the direction of lower cost. This is repeated until the algorithm converges to a local minimum. If the derivative can be computed analytically and the problem is an MDP then equation 2.5, known as the policy gradient theorem, holds: R Zd S (s) 1 s Qa)(s, a). a 14 (2.5) Here, dr(s) is the summed probability of reaching a state s using the policy 7r, with some discount factor applied, which can be expressed as the following, where so is the initial state, and -y is the discount factor: 00 dr(s) = Z t=o 'Pr{st = sIso, wr}. (2.6) Many sophisticated algorithms for policy search rely on the policy gradient theorem. Unfortunately it relies on an the assumption that the policy is differentiable. This is not true for the start-stop problem, as Ford requires that we use a black-box policy, which allows me to the results of applying the policy to an agent's actions, but not view the actual policy so as to obtain a derivative. The policy also uses a rule-based system, so it is unlikely that a smooth analytic gradient exists. Because of this I used a relatively simple policy gradient algorithm published by Kohl and Stone [4]. 2.2.2 Kohl-Stone Policy Gradient The primary advantage of the Kohl-Stone algorithm for determining an optimal policy is that it can be used when an analytical gradient of the policy is not available. In place of an analytical gradient, it obtains an approximation of the gradient by sampling, an approach which was first used by Williams [14]. The algorithm starts with an arbitrary initial parameter vector 0. It then loops, repeatedly creating random perturbations of the parameters and determining the cost of using those policies. A gradient is then interpolated from the cost of the random perturbations. Next, the algorithm adds a vector of length r in the direction of the negative gradient to the policy. The process repeats until the parameter vector converges to a local minima, at which point the algorithm halts. Pseudo-code for the algorithm is given in Figure 2.1. 15 1 0 +- initialParameters; 2 while not finished do 3 create R1 , ... , Rm where R, = 0 with its values randomly perturbed; 4 determine(C(Ri),C(R 2), ... , C(Rm)); 5 for i - 1 to 101 do // Find the cost function values after perturbing 0 6 AvgPos +- average of C VR where R[i] > 0[i]; 7 AvgNone 8 AvgNeg average of C VR where R[ij == O[ij; +- +- average of C VR where R[iJ < 0[i]; // Determine the value of the differential for the parameter i if AvgNone > AvgPos and AvgNone > AvgNeg then 9 change., 10 11 - 0; else L 12 change, +- AvgPos - AvgNeg; // Move the parameters in the direction of the positive differential ca ; 13 change +- 14 0 +- * changel *T 0 + change; Figure 2.1: Pseudo-code for the Kohl-Stone policy search algorithm Another advantage of the algorithm, aside from not needing an analytic gradient, is that there are only three parameters that must be tuned. The first of these is the learning rate. If the learning rate is too low the algorithm converges very slowly, while if it is too high the algorithm may diverge, or it may skip over a better local minimum and go to a minimum that has higher cost. The usual solution to finding the learning rate is to pick an arbitrary learning rate and decrease it if divergence is observed. If the algorithm is performing very slowly then the learning rate may be increased. However this approach may not be suitable for situations where a user cannot tweak the settings of the algorithm online, although there 16 are methods that adapt the learning rate on-line automatically. The second parameter chosen by the user is the amount by which each parameter is randomly perturbed. If the value is set too low the difference in costs due to noise may be much greater than the differences in cost due to the perturbation. If it is set too high the algorithm's approximation of a gradient may no longer be accurate. This is sometimes set arbitrarily, though I have performed experiments for the start-stop problem in order to determine what values worked best. The last parameter that must be set is the initial policy vector. The policy gradient algorithm converges to a local minimum, and the starting policy vector determines which local minimum the algorithm will converge to. If the problem is convex there will only be one minimum, so the choice of starting policy will not matter, but in most practical cases the problem will be non-convex. One solution to this problem is to restart the search from many different locations, and choose the best minimum that is found. When the policy is found offline this adds computation time, but is otherwise relatively simple to implement. Unfortunately, like the choice of learning rate, it is much more difficult to set when it needs to be chosen on-line, as the cost of using different policy starting locations manifests itself in the real world cost being optimized. 2.2.3 Evolutionary Algorithms Kohl-Stone and related methods were policy gradient approaches, where given a policy the next policy is chosen by taking the gradient, or some approximation of it, and moving the policy in the direction of the negative gradient. A different approach from the gradient based approach is a sampling-based evolutionary approach. This approach works by starting at a point, sampling around the point, and then moving in the direction of better samples. Even though this approach and Kohl-Stone both use sampling, they are somewhat different. Kohl-Stone uses sampling in order to compute a gradient, and then uses the gradient to move to a better location. This approach uses the sampled results directly to move to a better 17 location. As a result the samples may be taken much further away from the current policy than the samples used in Kohl-Stone, and the algorithms move to the weighted average of the good samples, not in the direction of the gradient. One example of an evolutionary approach is a cross-entropy method, or CEM [11]. CEM has the same basic framework as Kohl-Stone, starting at an initial policy 0o and then iterating through updates until converging to a minimum. In addition to a vector for the initial location it also needs an initial covariance matrix Eo. In each update it generates k samples from the multivariate normal distribution with Ot as the mean and with Et as the covariance matrix. It then evaluates the cost of using each sampled policy, and then moves in an average of the best sampled results, where only the Ke best results are used. It also alters the covariance matrix to correspond to the variance of the best sampled results. 1 0 +- initialPolicyVector; 2 E <- initialCovariance; 3 while not finished do 4 create R1 , ... , Rk where Ri is randomly sampled from NORMAL(0, E) determine(C(Ri), C(R2 ), 5 6 ... , C(Rk)); sort R 1 , R 2 , ... , Rk in decreasing order of cost Ke 1 i=0 Ke 0 = -R, K, 7 E = L ' (0i _ 0)(0i 0)T- i=O K Figure 2.2: Pseudo-code for CEM. 2.2.4 Covariance Matrix Adaptation Evolution Strategy The covariance matrix adaption evolution strategy, or CMAES, is an improvement over the cross-entropy method [12]. It works similarly except the procedure of the algorithm is more configurable and an evolutionary path is maintained. The path allows for information about 18 previous updates of the covariance matrix and mean to be stored, causing the adaption of the covariance matrix to be based on the entire history of the search, producing better results. It also contains many more parameters that can be tweaked to produce better results for a specific problem, though all of them may be set to a sensible default setting as well. CMAES and CEM are very general algorithms for function maximization but are used quite often in reinforcement learning because they provide a natural way of determining what information to gather next. One example is Stulp and Siguald's use of the evolutionary path adaption of the Covariance Matrix from CMAES in the Path Integral Policy Improvement algorithm, or PI2 [11]. The PI 2 algorithm has similar goals to CEM and CMAES, except that instead of aiming to maximize a function it intends to maximize the result of a trajectory. This is a very different goal from function maximization, because in trajectories earlier steps are more important than later ones. PI2's application to trajectories makes it an important algorithm for robotics. In my case I used a standard CMAES library, because it is most natural to approach the start-stop problem as a function to be maximized. 2.3 Model Based Policy Search Although I have been addressing approaches that use reinforcement learning to optimize the start-stop controller, an alternative approach is to first learn a model using supervised learning, and then use policy search on the model to find the optimal policy. In a supervised learning problem the user is given pairs of input values ((1), y(l)), (52), y( 2)), 1 (51n) y(n)) where an entry of x is a vector of real values, and an entry of y is a real value, and the goal is to predict the value of a new value for y(i) given only x(). The system can then use a policy search algorithm to determine a policy vector 0 that performs best given the predicted output y('). Using this model introduces the limitations that the policy chosen will have no effect on the output, and that the policy chosen will not affect future states of the world. For the start-stop controller, these limitations are similar to assuming that no matter how the start-stop controller behaves the driver will drive the same way. 19 When y(') is real valued, as it is when trying to predict the length of a car's stop, the supervised learning problem is called regression. I make use of two regression techniques in this thesis, Gaussian processes and nearest neighbors regression. 2.3.1 Gaussian Processes The first regression method I used were Gaussian processes, which are a framework for nonlinear regression that gives probabilistic predictions given the inputs [7]. This often takes the form of a normal distribution .A(y~p(x), E(x)), where the value for y is normally distributed with a mean and variance determined by x. The probabilistic prediction is helpful in cases where the uncertainty associated with the answer is useful, which it may be in the start-stop prediction problem. For example, the controller may choose to behave differently if it is relatively sure that the driver is coming to a stop 5 seconds long than when it believes the driver has a relatively equal chance of coming to a stop between 1 and 9 seconds long. Gaussian processes work by choosing functions from the set of all functions to represent the input to output mapping. This is more complex than linear regression, a simpler form of regression that only allows linear combinations of a set of basis functions applied to the input to be used to predict the output. Over fitting is avoided by using a Bayesian approach, where smoother functions are given more likely priors. This corresponds with the view that many functions are somewhat smooth and that the smoother a function is, the more likely it explains just the underlying process creating the data and not the noise associated with the data. Once the Gaussian process is used to predict the value of y(i), the optimal value for the policy parameter 0 can be found. This can be done with one of the search methods described previously, as long as the reward of using a specified policy for any value of y(') can be calculated. This may present problems if a decision must be made quickly and it is not easy to search over the policy space, because a new policy must be searched for every time a decision must be made. If a near-optimal policy can be found quickly this may represent 20 a good approach, because the policy can be dynamically determined for each decision that must be made. 2.3.2 Nearest Neighbors The second regression technique I used was nearest neighbors regression. Given some point x(') this technique gives a prediction for a point y(') by averaging the outputs of the training examples that are closest to x(). The number of training examples to average is a parameter of the algorithm and can be decided using cross-validation. Unlike Gaussian processes, nearest neighbors regression gives a point estimate instead of a probabilistic prediction. The advantage of nearest neighbors regression is that it is simple to implement and makes no distributional assumptions. That makes it ideal as a simple check to ensure that a more sophisticated approach such as Gaussian processes is not making errors due to incorrect distributional assumptions, or a poor choice of hyper-parameters. A policy can be found as soon as a prediction for y(') is made, using similar search methods as described earlier. 2.4 Related Work I will now discuss examples of machine learning being used for energy conservation. Machine learning is useful for energy conservation because instead of creating simple systems that must be designed for the entire space of users, or even a subset of users, the system can be adapted to the current user. This can enable higher energy savings than are possible with systems that are not specific to the user. I will discuss four examples of machine learning for energy conservation, to illustrate where it can be applied and what methods are used: The work by Theocharous et. al. on conserving laptop battery using supervised learning [6]; Kolter, Jackowski and Tedrake's work in optimizing a wind turbine's performance using reinforcement learning [8]; the work by Dalamagkidis et. al. about minimizing the power consumption of a building [9], and lastly about the work of Bodik et. al. on creating an 21 adaptive data center to minimize power consumption [10]. 2.4.1 Machine Learning for Adaptive Power Management One area where machine learning is often used for power conservation is minimizing the power used by a computer. By turning off components of the computer that have not been used recently, power can be conserved, provided the user does not begin using the components again shortly thereafter. Theocharous et. al. applied supervised learning to this area, using a method similar to the model based policy approach described earlier [6]. A key difference was that their approach was classification based, where the goal was to classify the current state as a suitable time to turn the components of the laptop off. One notable difference of their approach from the approach I described is that they created custom contexts and trained custom classifiers for each context. This is an interesting area to explore, especially if the contexts can be determined automatically. One method for determining these contexts is a Hidden Markov Model, which I looked into, but was too complex for Ford to implement in their vehicles. 2.4.2 Design, Analysis, and Learning Control of a Fully Actuated Micro Wind Turbine Another interesting application of machine learning to energy maximization is Kolter and others work in maximizing the energy production of a wind turbine [8]. This can be treated similarly to energy conservation, where instead of minimizing the energy spent, the goal is to maximize the energy gained. The goal is to choose the settings of a wind turbine that maximize energy production. Their work uses a sampling based approach to approximate a gradient. This approach is more complex than the Kohl-Stone algorithm, as they form a second order approximation of the objective function in a small area known as the trust region around the current policy. They then solve the problem exactly in the trust region and continue. This approach takes advantage of the second order function information, hopefully 22 allowing for quicker convergence of the policy parameters. The disadvantages of the approach are that it is more complex, needing increased time for semidefinite programming to solve for the optimal solution in the trust region, and needing an increased number of samples to form the second order approximation. In their case the calculations could be done before the wind turbine was deployed, which meant the increased time for computation was not an issue. In our case we would like approach to be feasible online, and it is unclear whether there will be enough computation power allocated to our system in the car to perform the additional computation that are needed for their approach. 2.4.3 Reinforcement learning for energy conservation in buildings Another example of reinforcement being used to minimize energy consumption is in buildings. Dalamagkidis et. al. sought to minimize the power consumption in a building subject to an annoyance penalty that was paid if the temperature in the building was too warm or hot, or if the air quality inside the building was not good enough [9]. This is a very similar problem to the start-stop problem, where I am seeking a policy that minimizes the fuel cost and the annoyance the driver feels. Their approach was to use temporal-difference learning, which is a method of solving reinforcement learning problems. By using this approach they were able to optimize a custom policy that was near the performance of a hand-tuned policy in 4 years of simulator time. 2.4.4 Adaptive Data centers The last example of machine learning being used to minimize power is in data centers. Data centers have machines that can be turned off based on the workload in the near future. The goal is to meet the demands of the users of the data center, while keeping as many of the machines turned off as possible. Bodik et. al. wrote about using linear regression to predict the workload that would be demanded in the next two minutes [10]. Based on this prediction the parameters of a policy vector were changed. This resulted in 75% power savings over 23 not turning any machines off, while only having service level violations in .005% of cases. 2.5 Conclusion There are a variety of approaches related to machine learning that can be used to find the optimal start-stop controller. This chapter gave a summary of the techniques that I applied to the problem. I also gave a brief summary of other areas where machine learning has been used to maximize energy, and discussed the advantages and disadvantages of the approaches used. In the next chapter I will study the results of applying these techniques to the problem of determining a policy that performs well for a population of drivers, as well as the problem of finding a policy that adapts to the driver of the car where the policy is used. 24 Chapter 3 Learning a Policy for a Population of Drivers In this chapter I will discuss optimizing a single policy for a population of drivers in order to minimize average cost. This is similar to the approach Ford currently takes, except that their policy is a hand-tuned policy that they expect to work well on average, rather than one optimized with a machine learning or statistical method. The benefits of this approach are that the policy can be examined before it is used in vehicles to check that it is reasonable, and that there are no additional computations required inside a car's computer beyond computing the actions to take given the policy. The disadvantage of this approach is the single policy used on all drivers will probably use more energy than when a unique policy is specialized to each driver. 3.1 Problem Overview In this approach the goal is to select a start-stop controller that performs best across all users. Formally, the goal is to pick a parameter vector 0 for the policy that minimizes E[C10], 25 (3.1) over all drivers in the population, where C is composed of three factors, C = C1 + C 2 + C3 . (3.2) The first factor, ci, represents the fuel lost from turning the car's engine on when the car is stopped, and is equal to the time that car has spent stopped minus the time the combustion engine was turned off during a driving session. The second factor, c 2, represents the fixed cost in energy to restart the car's engine, and is equal to 2 times the number of times the car's engine has been restarted during a period of driving. Every restart of the car's engine costing 2 was a detail specified by Ford. The last factor, c3 , is the driver annoyance penalty, and is paid based on the length of time between the driver pressing the accelerator pedal and the car actually accelerating. Each time the driver accelerates c 3 is incremented by 10 x max(0, restartDelay- 0.25), where restartDelayis equal to the time elapsed between the driver pressing the accelerator pedal and the engine turning back on. The goal is to minimize the expected cost E[C10] by choosing an appropriate parameterized policy wr(9). Ford uses a rule-based policy, an example of which is used in Algorithm 3. Here the parameter vector corresponds to values in the rules. 1 2 3 4 5 6 7 s 9 if CarStopped then if EngineOn and BrakePedal > 0o then L TURN ENGINE OFF; if EngineOn and BrakePedal > 01 and AverageSpeed < 02 then L TURN ENGINE OFF; if EngineOff and BrakePedal < 03 then TURN ENGINE ON; if EngineOff and BrakePedal < 04 and TimeStopped > 20 then TURN ENGINE ON; Figure 3.1: Hypothetical rule-based policy 26 This policy framework is then called 100 times a second during a stop in order to determine how to proceed. There are two important things to note about this rule-based policy. First, this is an example of what the policy might look like, not the actual policy. Ford gave us an outline similar to this, but did not tell us what the actual rules are, so we do not know what the parameters correspond to. Second, the controller is responsible for both turning the engine off when the car comes to a stop, as well as turning the engine back on when the user is about to accelerate. This second part helps to minimize c 3 in the cost function, but may also lead to more complicated behavior, as the controller may turn the engine off and on multiple times during a stop if it believes the car is about to accelerate but the car does not. Given the black-box nature of the policy we are not allowed to add or remove parameters, and instead must only change the vector 0 in order to achieve the best results. Our inputs from any driver are three data traces: the driver's speed, the driver's accelerator pedal pressure, and the brake pedal pressure, which are each sampled at 100 Hz. Figure 3.2 shows a graph of an example of input data. 3.2 Simulator Model In order to help determine the best policy Ford gave us a small amount of data, representing under 5 hours of driving from a driver in an Ann Arbor highway loop, as well as about an hour of data from drivers in stop and go traffic. This data was insufficient to run the experiments we hoped to perform, so to create more data that was similar to the data from Ford I built a simulator. The simulator was based on of a state machine, which is shown in Figure 3.3. The simulator has three states: Stopped, Creeping, and Driving. In the Stopped state the driver has come to a stop. In the Driving state the driver is driving at a constant speed, with Gaussian noise added in. The Creeping state represents when the driver is moving at a low speed, with their foot on neither pedal. The driver stays at their current state for a 27 Ford Data - - 45 40. 35230E 25 20 C 15 10 5 0O 5 10 15 2'0 25 3 0 5 10 15 20 25 30 5 10 15 Time (seconds) 20 25 -g 1000 P 800 600 400 0 200 25 20 - e "76 0 15 - a) 10- 0 - 5- 30 Figure 3.2: A slice of the data. This graph shows a driver coming to a stop, and then accelerating out of it. predetermined amount of time, and then transitions to another state by either accelerating or decelerating at a relatively constant rate. The simulator is parameterized to allow for the creation of many different drivers. For example, the average stopping time is a parameter in the simulator, as well as the maximum speed a driver will accelerate to after being stopped. There are 40 parameters, allowing me to create a population of drivers with varying behavior. The main weakness of the simulator is the state machine model. It is unlikely that a driver's actions are actually only determined by their current state. For example, it would be expected that a driver on the highway would have a pattern of actions different from a driver in stop and go traffic in an urban environment. To alleviate this issue, and in order to help 28 Accelerating Creeping Stopped Decelerating Decelerating Accelerating Accelerating Decelerating Driving Accelerating Decelerating Figure 3.3: The state machine used by the simulator. Circles represent states and arrows represent possible transitions between states. me conduct my experiments, I created 2000 driver profiles, 1000 of whose parameters were centered around parameters from a driver whose behavior matched the Ann Arbor data, and 1000 whose parameters were centered around a simulated driver who matched the stop and go data. Every driver uses four randomly picked driver profiles, using one 50% of the time, two 20% of the time, and one 10% of the time. This enables driver's behavior to be based on their recent actions, as a driver will use a profile for a period of time before switching to another profile. 3.2.1 Simulator Output The simulator that I have described outputs the speed trace of the vehicle. After the speed trace is generated a second program is run that takes the speed values and generates appro- 29 priate values for the brake and accelerator pedal pressure values. Ford created this program in Matlab and I ported it to Java. The output from the combination of the two programs is a data trace with a similar format to data from the real car, giving results for all three values at 100 Hz for a drive cycle. The only difference in output format is that the synthetic data is never missing any data, while the data in the real car occasionally doesn't have values for certain time steps. This occurs rarely, and only causes a millisecond of data to be missing at a time, so it was not modelled into the simulator. The highway driving data was given to me first, so I created the highway driver profiles first. Figures 3.4, 3.5, and 3.6 show that I was able to match coarse statistics from the highway driving data. 0.09 0.09 0.08 0.08 0.07 0.07 0.06 0.06 0.05 0.05 E 0.4 00.04 0.03 0.03 0.02 0.02 - 0.01 0.01 0.000 10 20 30 40 50 Speed in kmnh 60 70 0 80 10 20 30 40 50 Speed in kmnh 60 70 80 Figure 3.4: The percentage of the time the driver spends at different speeds. On the left is real data from the Ann-Arbor highway loop, and on the right shows simulated data with a profile designed to match the highway loops. Figures 3.4, 3.5, and 3.6 represent a summary of traces generated from the real data and the simulated data. Aside from analyzing larger statistics generated from the traces, I analyzed small subsections of traces to ensure that individual outputs behaved similarly as well. I did this by looking at certain events, such as a driver coming to a stop, and then accelerating out of the stop, and looking at the trace values generated, making sure that the simulator output traces that behaved similarly to the real data. About six months later I was given the stop-and-go data. This data was comprised of eight 30 0.0020 0.0014 0.0012 0.0015- 0.0010 0.0008 0.0010 E 0.0006 0.0005. 0.0004 0.0002 0.00000 500 1000 100 2000 Brake Pressure NM 2500 3000 0. 3500 0003 0000 500 1000 2000 1500 Brake Pressure NM 2500 3000 3500 Figure 3.5: The percentage of the time the driver spends with different brake pressures when at a stop. On the left is real data from the Ann-Arbor highway loop, and on the right shows simulated data with a profile designed to match the highway loops. smaller driving episodes from stop-and-go traffic. I created a driving profile different from the highway profile that had similar summary statistics to the data, as shown in the following table. Sim - Highway Sim - Stop-And-Go Average Speed (km/h) 23.2 33.8 33.5 Average Stop Time (s) 19.24 10.32 9.72 Average Time Between Stops 102.78 41.48 40.83 % of Segments below 10 km/h 4 76 Ford Stop-And-Go 77 A driving segment is defined as the period of driving in between instances of the vehicle coming to a halt. The statistics show that the profile has similar behavior to the real stopand-go data, as the vehicle stops for shorter lengths of time, and often does not accelerate to a high speed between stops. Combined with the highway simulator profile, it enables me to create data similar to all data that Ford has given me. The largest concern with the simulator is that aspects of real-world behavior may not be modelled in the simulator. One example is that the simulated driver does not change their behavior based on the start-stop policy in use. It is hard to know if driver's will change their 31 25 20 ih0 .10 ~15 t 10 20 30 40 50 so To0 10 20 340060 7 Letguh of top In Seconds Length of stop in seconds Figure 3.6: A histogram showing the length of stops in a trace. On the left is real data from the Ann-Arbor highway loop, and on the right shows simulated data with a profile designed to match the highway loops. The right trace has a much larger number of stops, but the distribution of stopping times is similar. actions based on different behavior from the controller without implementing a controller in a vehicle. Because the Ford data was collected from a vehicle using a static controller, I have no evidence that drivers change their behavior, or of how they would change their behavior if they did. Therefore, I opted to use simulated drivers that act the same regardless of the policy in place. There may be other areas where the simulated behavior may not match actual behavior, but there seems to not be any reason that the methods described would not work just as well with real data. The simulator will be used for all tests described in the following two chapters, as it allowed me to quickly test how the algorithms performed using hours of synthetic data, and should help avoid overfitting to the small amount of that was provided by Ford. 3.3 Optimizing For a Population I will now describe the procedures used to optimize for a population, as well as the results obtained from the experiments. In order to optimize for a population I used a modified version of the Kohl-Stone policy gradient algorithm described in section 2.2.2. I chose to 32 use this algorithm rather than an evolutionary algorithm because of Kohl-Stone's simplicity and because the runtime of the algorithm was not a factor, as the algorithm did not need to run on a car's computer. The modified version of the algorithm was the same as the algorithm described in section 2.2.2, with one modification made to determining C(R1 ),..., C(R,) at each iteration of the algorithm. These costs are the costs of using a policy that are a perturbation of the current policy 0, and in the original Kohl-Stone algorithm are evaluated based on the costs of a completely new trial using Ri as a policy. This is because the differences in the policy may affect the actions of the driver. In my modified algorithm I evaluated the cost of the perturbations based on a single drive trace, chosen from a random driver from the population. I did this to cut down on the time necessary to run the algorithm, and to cut down on the amount of noise present in the results. Most of the runtime of the algorithm is spent creating the drive traces, and I create 50 perturbations of 0 at each drive trace. Therefore, creating only one drive trace at each step speeds up the algorithm by a factor of almost 50. Using only one drive trace also cuts down on the amount of noise present in the cost, as if I selected a new random driver for each drive trace, the differences in cost would mainly arise from differences in the drive trace and not the difference from the parameters. For example, certain drive traces, such as a file where there are very few stops, will always have a low cost. Conversely, a drive trace where there are many short stops will always have a relatively high cost no matter what the controller's policy is. Using a new drive trace each time would necessitate adding in more perturbations to reduce the noise, extending the runtime of the algorithm even further. Using one random trace to optimize the policy is an example of a technique known as variance control, which is also employed by the PEGASUS algorithm. PEGASUS is another policy search method which searches for the policy that performs best on random trials that are made deterministic by fixing the random number generation [13]. The main negative side effect of using only one drive trace for each iteration of the algorithm is that the modified algorithm operates with the assumption that small perturbations to the start-stop controller have no effect on driver behavior. This is true for the simulator because 33 the start-stop controller never has any effect on the driving behavior. This is probably true for real life as well, as I would expect drivers to change their behavior with drastically different controllers, such as a controller that always turned the engine off, versus one that never turned the engine off, but because the algorithm only needs to approximate a gradient the policies measured are similar enough that drivers might not even notice their difference. However, this does mean that the learning rate may need to be relatively small, so that any changes at any iteration do not cause the driver to drastically change their behavior. The other negative side effect is that on each iteration the value for 0 changes based on the driving trace drawn from one driver. This can lead to the value for 9 moving away from the optimum if the driver is far away from the mean of the population. This should not be an issue over time, as I expect the value for 9 to approach the population mean. 3.3.1 Experimental Results For my experiment I used the modified version of the Kohl-Stone algorithm. I normalized each of the parameters in the controller to the range [0, 1], and set a learning rate of .025. For each iteration of the algorithm I created 50 perturbations and perturbed each value randomly by either -0.25, 0, or 0.25. Each driving trace used was 6 hours long. I performed 35 trials of the Kohl-Stone algorithm each running from a random starting location for 200 iterations. After 200 iterations I measured the performance of each trial on 25 random drivers drawn from the population. Figure 3.7 shows the average cost of all 35 trials over time, demonstrating that the algorithm had stopped improving many trials before the 200 iteration limit was reached. I chose the best performing parameters, and then measured on another sample of 25 random drivers to get the expected cost of a 6 hour driving segment. I found the best parameters had an average cost of 1329, while Ford's original parameters had a cost of 1810, for an improvement of 481 or 27%. This cost saving is equivalent to the car's engine being off instead of idling for 80 seconds every hour, though the actual cost savings may arise from any combination of the annoyance factors and energy 34 200019001E1800- 1700160015001400113000 500 100 150 Iterations 20 Figure 3.7: A graph of the average cost of the 35 trials over time. savings discussed previously. I also compared the results of using the policy learned by my algorithm to the results of Ford's policy on the real data from Ford. There was not much data from Ford, so the comparison is not as useful as the comparison on the simulated data, but ideally the policy should perform better than Ford's policy on the real data as well. 35 Ford Policy Optimized Policy Percentage Improvement 1-94 Data 56.84 55.61 2% Ann Arbor Loop Data 199.33 177.75 11% W 154 Data 67.61 63.92 5% Urban Profile 1 36.1 34.05 6% Urban Profile 2 50.35 47.2 6% Urban Profile 3 698.12 634.95 9% Urban Profile 4 115.72 111.57 4% Urban Profile 5 116.02 100.53 13% Urban Profile 6 64.45 57.32 11% Urban Profile 7 62.83 53.9 14% Total 1467.37 1336.8 9% The optimized policy does not exhibit as much improvement over the Ford policy as it did for the simulated data but it still outperforms Ford's policy by 9% overall. It also outperforms Ford's policy across all 9 data sets. Along with the small amount of data present, there are two caveats to the results. First, there were no brake pedal or accelerator pedal data available for most of the samples, so those values were generated as functions of the speed data using the program Ford gave me. It is unknown accurate these predictions are. Second, there is no way of knowing if using different controllers may have changed the actions of the driver's in the experiments. This is impossible to test without implementing a modified controller in real vehicles. These factors may cause the optimized controller to perform relatively worse or better compared to these results. 3.3.2 Optimizing For a Subset of a Larger Population In the previous section I demonstrated that policy search can produce a policy that performs better than the hand tuned policy created by Ford. The rest of my thesis will discuss methods 36 of improving on this policy. One way to improve on a policy generated for the entire population of drivers is to create separate policies for subsets of the entire population. This is beneficial in two ways. First, if Ford can subdivide the population ahead of time, they can install the policies in different types of vehicles. For instance, drivers in Europe and the United States might behave differently, or drivers of compact cars and pickup trucks might have different stopping behavior. If the populations can be identified ahead of time and data can be collected from each population different policies can be installed for each set of drivers. Classifying for a smaller population might also be useful if the driver can be classified online as driving in a certain way. If the driver's style can be classified by the car's computer, a policy can be swapped in that works best for the situation. For instance, drivers on the highway and drivers in stop-and-go traffic in an urban environment might behave very differently and have different optimal policies. It should be possible to obtain better results by using a policy optimized for a specific pattern of driving when the driver's style fits that pattern. In addition to providing results for optimizing for a smaller population of drivers, this section acts as a middle ground between the previous section and the next chapter, where I will discuss optimizing a policy for a specific driver. As the policy is optimized for a more specific group better results are expected, as the policy only needs to match the behavior of that group and does not need to account for drivers that are not present who behave differently. As long as the drivers in the smaller subsets are different from the larger population in a meaningful way, better results are expected as the group becomes smaller, with the best results being expected with policies optimized for a single driver. The key is to divide the population into subsets that are expected to be different in some way, as if the population is subdivided into random large subsets the expectation is that the policies will be identical. It is also necessary for there to be enough data for each subset in order to create policies that perform well,. It may even be possible that the methods used in this subsection will result in lower costs than optimizing policies for a single driver. This strategy may lead to lower costs if policies 37 are optimized for modes of driving, which are then swapped in during a drive, based on the driver's behavior. Lower costs are likely if the individual driver's behavior in a particular setting, such as highway driving, is more similar to the average driving behavior from all drivers in that setting than the individual driver's average behavior across all settings of driving. In this case the policy found for all drivers in a particular mode will probably perform better than the policy found for a single driver across all modes of driving. 3.3.3 Experimental Results The most natural way to divide the driver profiles from the simulator into groups is to divide them into two sets of drivers, the first being the drivers generated from the Ann Arbor data, and the second being the drivers generated from the urban data. This doesn't represent a division of car buyers that Ford can identify ahead of time, but the profiles do represent modes of driving that would benefit from separate policies. I will compare the results of optimizing for the smaller populations compared to optimizing for a larger population comprised of both sets of drivers. In order to determine the best policies for each population I repeated the experiments from Section 3.3, running the experiment 35 times from random starting locations, and testing on 25 random drivers to determine the best policy, and finally testing on 25 other random drivers to determine the expected cost of the policy. For the profiles matching the Ann Arbor data I found that the best policy had an average cost of 1233, while the for the profiles matching the urban data the best policy had a cost of 1399. Averaged together the cost was 1321, representing the average cost of six hours of driving if three hours are spent in each mode and the best policy is used for each mode. I found that the best single policy for the combined population had a cost of 1398. Therefore, optimizing for the subsets of the population resulted in an improvement of 5.5%, or 13 seconds of idling removed per hour in real world terms. This is less than the improvement from going from Ford's policy to an optimized policy, but would represent a massive savings of energy if an improvement of this 38 scale could be implemented on all hybrid cars produced by Ford. 3.4 Conclusion In this chapter I discussed the policy-search methods that were able to find a single policy that performs better than Ford's hand tuned policy. I discussed the simulator that generated the data necessary for policy search as well as the Kohl-Stone policy gradient algorithm for policy search. The optimized policy that policy-search found performs 27% better on synthetic data and 9% better on the limited real data available. This policy could probably be immediately implemented in Ford's vehicles with better results. The main concerns related to these results were that the simulated data does not capture unknown characteristics of the real-world data and that drivers might modify their behavior based on the new policy and produce worse results. The optimized policy does not perform as well on the real data as it does on the synthetic data, but it outperforms Ford's policy on every data set available, which seems to indicate that the simulator is capturing at least some of the important features of the real data. It is impossible for me to test how drivers would behave differently with the optimized policy. The only way to test this is to implement the policy on test vehicles that are being driven under normal conditions to see how the user would behave. In order to use this policy on production vehicles, a logical next step would be to implement the policy in vehicles where the user's driving behavior can be measured. This could be tested against vehicles running the original hand-tuned policy. If the newly optimized policy performs better than the old policy, it could then be implemented in vehicles sold by Ford. Along with testing the effects of the policy, another option to improve the performance of the policy would be to collect more real world data. Though the simulator was constructed to match the real-world data there was not enough real data to be sure that it matches the larger population of driving behavior from drivers in Ford vehicles. 39 Collecting much more data would allow for a policy-search algorithm to be run directly on the data, eliminating the need for synthetic data. Unfortunately, this would still not address the problem that there would be no way to test the effect of a policy on a drivers behavior. Each time a policy is optimized it should be tested in real vehicles to ensure that drivers do not change their behavior with the optimized policy in a way that negatively affects performance. Alternatively, a large group of random policies could be implemented for random drivers in an effort to show that policies have no effect on driver's actions. If this is true, then there would be no need to test the effects of individual drivers. However, it seems unlikely that completely different controllers would have no effects whatsoever on driver's actions. The main benefits of the approach outlined in this chapter are that it is simple, producing a single policy that can be examined by Ford, and that it requires no computation inside the car. However, it is likely that optimizing for a smaller subsection of driver behavior than the whole population will produce better results than creating a policy that must be used by each driver. I demonstrated this partially in this chapter, showing that optimizing for subsets of a larger population produced better results than creating a single policy for the entire population. In the next chapter I will discuss optimizing for a single driver, which should produce even better results. 40 Chapter 4 Optimizing a Policy for an Individual Driver In this chapter I will discuss optimizing the start-stop policy for each individual driver. This should result in a lower cost on average than the previous approach of optimizing a single policy for a population of drivers. The primary disadvantage of this approach is that it requires more computation in the car's internal computer, which may be an issue if the computational power of the car is limited. It also requires time to find an optimized solution and is more complex, needing much more debugging before it can be implemented in a real car. 4.1 Problem Overview In the previous chapter I sought to optimize E[C|0], (4.1) for the entire population by varying the policy 0, where C was the cost function comprised of factors representing the fuel cost of the controller's actions and the annoyance felt by the 41 driver of the vehicle due to the controller's actions. In this chapter I will instead seek to optimize E[CIO, d], (4.2) where d is the individual driver, again by varying 0. By creating a policy for each driver that need only work well for that driver I should get better results than finding a single policy that must work well for all drivers. In this chapter I will discuss experiments to find the best policy using two methods of policy search, the Kohl-Stone policy gradient algorithm, and CMAES, a variant of the cross-entropy maximization algorithm. In addition to these two methods of policy search I will also discuss experiments using a model-based approach where I attempt to minimize E [C 10, d, -r], (4.3) where r is the recent behavior of the driver. I will do this by creating a model of the driver's behavior using Gaussian processes. As before I will use the simulator to measure the performance of my algorithms. The amount of data available was not sufficient to test any of the approaches on real data. The disadvantages of the simulator were discussed in chapter 3, but one of the main disadvantages of the simulator, that the simulated drivers do not change their behavior based on the different start-stop policies, is not as much of an issue here. This is because for most of the approaches in this chapter the effectiveness of a policy would be determined by how that policy, or a policy similar to it performed with the actual user. Therefore, the results incorporate how the user changed their behavior in response to a change in policy. I will examine this further when discussing the results of each technique and their expected effectiveness when implemented in a vehicle. Along with an identical simulator, I used the same cost and rule configuration as before. The main changes are that I assumed that I am allowed to do computation inside the car's computer, and that the car's computer can switch policies at will. This allows me to use data from the driver to find better policies. There are several complications introduced by this approach, namely the computational expense, the ability for the algorithms to converge 42 to suboptimal solutions, and the additional complexity introduced by running algorithms on the car's computer. 4.1.1 Issues The additional computational load is the most notable effect of finding policies customized for each driver. In my experiments I did not create an explicit limit on the amount of computation available inside the car, though I will discuss the computational expense of each method. The other main disadvantage of optimizing the policy for a driver is that it will result in many different policies that cannot be predicted ahead of time, and therefore cannot be checked manually. The algorithms used may also lead to policies converging to worse policies than before, for instance if the optimization overfits a policy to a recent driving period. I will experimentally evaluate how often the tested algorithms converge to poorly performing policies by comparing their performance with the general policy found in chapter 3. An even worse situation than the algorithm finding a suboptimal policy would be for there to be a bug in the code, leading to crashes, or the controller always returning a bad policy. It is almost certainly possible to separate the algorithm for finding the best policy from other computation inside the car's computer, so that a failure in the optimization algorithm does not cause all of the car's computation to fail, but a bug in the code could lead to worse policies being used. The controller returning a bad policy because of a bug would be worse than the controller failing completely, because if the controller failed the car could use the default policy values found for the population. Conversely, if the controller had a bug in it that resulted in nonsensical values, such as settings that would lead to the car's engine never turning off, the car would execute the flawed policy. There is no way for me to guarantee that the final algorithms used in a production vehicle will not have any bugs, especially because I believe that any algorithms used in consumer vehicles would need to be recoded to the standards used for code executing in the car. The 43 only way for me to limit the number of bugs is to use relatively simple algorithms that are much easier to debug than more complicated algorithms. Because this was a goal, the two direct policy search algorithms chosen for analysis are simple to implement, though the Kohl-Stone policy gradient algorithm is simpler than CMAES. The model-based approach using Gaussian processes is more complicated however, and I relied on an external library for my implementation. I found that Gaussian processes were not effective, so they seem to be a suboptimal solution before considering complexity, but if a more complicated solution obtained better results the tradeoff in cost versus complexity would need to be considered, as it might not be worth it to find a solution with .001% more efficiency at the cost of greatly increased complexity. This is in direct contrast to finding a policy for the population, where the complexity of the approach does not need to be considered, because none of the code runs inside a vehicle. The final aspect of optimizing for a population that is different than optimizing for an individual is that the speed to reach an optimal solution must be considered. Previously this did not matter, because I was optimizing for a static population offline. Here, if it takes 10,000 hours to find the best solution the person might have sold their car before an optimal solution can be found. Even if it takes 100, or 10 hours this could lead to problems if the driver's behavior changes. Consider a car driven by one person on the weekdays for ten hours and by someone else on the weekends for four hours. The optimal policy would probably be different for each person, and if the policy took ten hours to converge it might never be ideal for the person driving on the weekends. This shows the importance of finding a good policy fast, though I will show that it is difficult to choose an optimal policy in a time-frame of as little as ten hours. In order to speed up the process it would probably be necessary to use an approach where key aspects of the driver were identified and then a policy optimized for drivers with similar characteristics was swapped in. This approach would not lead to a policy directly optimized for the driver, but still might get better results, and was discussed briefly in the previous chapter. 44 4.2 Kohl-Stone Optimization In this section I will discuss using the Kohl-Stone policy gradient algorithm in order to find an optimized policy for an individual. I will discuss experiments using the algorithm, beginning with the same framework I used in the previous section. I will then discuss experiments made with small modifications to the algorithm, in an attempt to make it perform better when optimizing the start-stop controller. The Kohl-Stone policy search algorithm is simple, so porting it into a real vehicle should not be an issue. When using this algorithm I can also usually assume that drivers behave differently based on different policies in use, as the algorithm chooses a policy based on the performance of the policy with the driver. 4.2.1 Initial Results I will now discuss the results of using an identical version of the Kohl-Stone algorithm as used in chapter 3. This is identical to the original Kohl-Stone algorithm, except that at each iteration of the algorithm I only collect one new drive cycle and find the results of all perturbations of the current policy using that drive cycle, instead of using a new drive cycle to measure the performance of each perturbation of the policy. This requires assuming that drivers will behave identically when presented with a controller that is very similar to another controller. This assumption seems reasonable, and makes the algorithm much faster, as I measure the performance of 50 perturbations of the policy at each iteration. Given that each drive cycle is 6 hours long, not making this assumption would make each iteration require 300 hours of driving data. As in chapter 3 I used a learning rate of .025 and a perturbation rate of .025, where all the parameters were scaled to be in the range [0,1]. Each iteration of the algorithm consisted of 6 hours of driving, and the parameters were updated after each iteration. Instead of beginning the policy search from a random location, I began with the best policy found for the population of drivers the individual drivers were drawn from. 45 To determine the effectiveness of the approach I ran 35 experiments with a randomly selected driver in each experiment. Between each iteration of the algorithm I measured the performance of the current policy by generating five new random drive cycles from the driver and finding the cost of the policy for each of them. Figure 4.1 shows the results of running the algorithm for 200 iterations. 1450 1440 1430 1420 0 1410 1400 1390 1380 13701 0; 50 100 Iterations 150 200 Figure 4.1: A graph of the average costs of the driver's policies as the algorithm progresses. The algorithm starts with the policy optimized for the population of drivers and attempts to find the best policy for the individual driver. The initial policy has an average cost of 1424.0, while the policies after 200 iterations have an average cost of 1401.8, for a 1.6% improvement. While this improvement is not nearly as large as the 27% improvement moving from Ford's policy to an optimized policy, it is still an improvement that would be worthwhile to implement in production vehicles. Figure 4.1 is quite noisy, and it may seem that this improvement is only due to the noise present in the process. Figure 4.2 shows a 10 step moving average, showing that there is clearly a real improvement from the original starting location, though it also shows that the costs are not monotonically decreasing, possibly due to both the noise present in the process and the 46 policies overfitting to recent drive cycles. 1425 1420 1415- U0 14101405- 1400- 1395- 139010 50 100 Iterations 150 200 Figure 4.2: A graph of the moving average of the costs of the last ten iterations. In addition to the cost difference from beginning to end, there are two other traits of the algorithm that are important to monitor. The first is how quickly the algorithm converges to a policy that has a locally optimal cost. To measure this I measured the total cost of using the best policy found so far throughout the algorithm's runtime, and compared it with the cost of using the original policy for 200 drive cycles instead. I found an average cost of 280878 for using the best policy, compared with an average cost of 284805 for using the original policy, giving an improvement of 1.4%. This is fairly close to the improvement of 1.6% found from always using the final policy, and, along with figure 4.2, demonstrates that most of the improvements come early on in the algorithm's execution. However, it would still be better to find the solution quicker, especially if the behavior of the driver is changing, as the moving average of the costs stays near the original costs for the first 5 iterations, representing 30 hours of driving, and does not approach the final cost until after more than 25 iterations, representing 150 hours of driving. speeding up the optimization in later sections. 47 I will investigate possible approaches to The second important aspect of the algorithm is the variance of the cost improvement. It is especially useful to know whether the algorithm can lead to significantly worse results for individual users. This may happen if the policy overfits to recent drives, or if it becomes stuck in a local minimum that is worse than the original policy. If the algorithm can lead to worse results it may be useful to ensure that the original policy would not perform much better on the recent driving behavior. If this is the case the original policy can be substituted back in, which ensures that the custom policy behavior does not perform much worse than the static policy. Figure 4.3 shows the improvement from the population's policy for all 35 trials. 22 of the trials have an improvement, while 13 see a decrease in performance. The cluster near 0 suggests that for a significant number of the drivers there is no improvement from using a customized policy. These drivers likely behave similarly to an average driver drawn from the population, and the differences between them may be due entirely to the noise present in the process. There are no drivers that have a final policy that performs dramatically worse than the initial policy, suggesting that the policy search algorithm could be used without any modification, though using the original policy when the optimized policy performs poorly may result in a slight performance gain. In this section I have discussed the results of my application of the Kohl-Stone algorithm. In the following sections I will vary the algorithm in order to determine if it is possible to find better results, and if it is possible to find a fully optimized policy in a shorter period of time. 4.2.2 Finding Policies in Parallel One approach to finding better policies is to run parallel executions of the Kohl-Stone algorithm, each starting in a random location. The controller can then swap in the policy that would have performed best during the car's recent activity after each iteration of the algorithm. This has the potential to get better results than a single run of the algorithm if 48 5 2 Aoo -50 0 50 Cost Improvement 100 150 200 Figure 4.3: The cost improvement from the first iteration to the last iteration for all 35 trials. the cost as a function of the policy has many local minima, and if the best policy for the population is not always similar to the best policy for the driver. The approach has two disadvantages. The first is that it requires more computation inside the car's computer. If five policies are searched for in parallel, the approach must compute the Kohl-Stone algorithm for each, and must evaluate five times as many policies with its start-stop controller. Kohl-Stone is relatively simple, but one run of the algorithm already evaluates 50 policies at each iteration, so a complicated controller may make this approach impractical. The second problem with optimizing policies in parallel is that when evaluating the current policy in each execution of the algorithm only one policy was in use by the driver. The other policies will be evaluated with the assumption that the driver would have behaved similarly if they had been used instead. This could result in a situation where the controller oscillates between two policies that fit the driver's past behavior, but which perform worse when actually put in the car, while ignoring another policy that has better results when actually used. 49 I tested the effectiveness of this approach in the same manner as I tested the Kohl-Stone algorithm. I conducted 35 trials of 35 new drivers, with the same parameterization of KohlStone as before. Each trial consisted of 5 parallel executions of the algorithm, and to measure the performance of the algorithm I used the policy which performed best on my previous test. Figure 4.4 shows a graph of the costs of this approach over time, comparing it to an approach where the best policy for the population is used. 17501700 16500 U 1600 01 'U 1550 1500- - -. 0, 1450[ nuu _0 50 100 150 200 Iterations Figure 4.4: The blue line shows a moving average of the last 10 iterations of the parallel Kohl-Stone approach, averaged across all 35 drivers. The green line shows the average result that would have been achieved by using the best policy for the population of drivers. I found that in the final iteration the average cost was 1428.3, compared with an average cost when using the population's policy of 1423.6. This suggests that starting at a random location is significantly worse than starting at the best location for the policy. This probably occurs because the cost as a function of a policy is non-convex, with multiple local minima, leading the policy search to converge to a result that is worse than the best policy in many cases. Additionally, it seems that in this population a policy near the population's policy represents the best policy for the individual driver, which is in line with the results from 50 section 4.2.1. The approach of finding policies in parallel was relatively unsuccessful. I believe this to be due to characteristics of the cost as a function of the policy for my simulated drivers. If the algorithms could be implemented in test vehicles, it might still be worth testing this approach to discover if real drivers have similar characteristics as simulated driver's, but there does not appear to be an obvious benefit from using this approach. 4.2.3 Improving the Convergence Speed of the Algorithm In addition to improving the final policy found by the algorithm, another objective is to try to reduce the time the algorithm takes to converge to a final policy. In section 4.2.1 I demonstrated that the baseline Kohl-Stone algorithm takes 30 hours of driving to start making significant improvements, and that it takes 150 hours to find the best performing policy. Speeding this up would allow the vehicle to obtain larger fuel savings in a shorter period of time, and cause the controller to react better when the driver characteristics change. To try to speed up the algorithm I ran the Kohl-Stone algorithm as before, with multiple iterations of the algorithm before collecting new data. I chose to run 4 iterations of the algorithm between each new collection of driving data, which could cause the algorithm to be up to four times faster. If this approach had been successful I would have attempted to increase the number of iterations without collecting data in order to further speed up the algorithm. The approach has the the potential to be up to four times as fast as the original Kohl-Stone algorithm. The disadvantage is that it can lead to overfitting, because it has the potential to move more based on a single segment of driving. If all segments of driving are very similar then this will be fine, but if they are different then this can lead to problems. For instance, if there is an accident on the road, and all traffic is stop-and-go on a particular day, then the controller may diverge significantly from an optimal controller if it is attempting to adapt too fast. The approach also forces me to assume that drivers behave similarly when observing 51 a wider range of controller policies. The algorithm may move the controller policy up to four times as far away before gathering new data than before, which may be enough of a difference that a driver notices and behaves differently. I ran another experiment with the same setup as 4.2.1, except with this change to the algorithm, with the results shown in Figure 4.5. I find that the average policy used improves by a much smaller amount, most likely due to overfitting, as it oscillates between policies that perform better and worse than the original policy. Note that the average of the costs is in all cases lower than average costs shown in Figure 4.2, which measured the performance of the original Kohl-Stone algorithm. This indicates that the drivers in this experiment, which were sampled from the same distribution as the drivers in the previous study, produce a lower controller cost when using policies that were fit for the whole population. This is shown by the results from the beginning of each graph, when each driver is using the same policy, but the costs differ by more than 50. Therefore what I attempted to measure was the percentage improvement. I found that when using a specific approach the percentage improvement in cost for different drivers was similar, even if the costs of the driver's actions were different. Given that I found that optimized policies perform only 1.6% better on average than the original policy, and that individual driving segments have extremely noisy behavior, it makes sense that overfitting can be a problem if the algorithm attempts to learn faster. The measured differences of the effectiveness of policies is most likely due primarily to this noise, which makes overfitting such an issue. This approach was unsuccessful, and demonstrates the risk of overfitting when attempting to improve the convergence speed of the algorithm. I will now discuss the results of my last set of experiments with the Kohl-Stone algorithm, which were conducted in order to determine the best learning rate for the algorithm. 52 1372 137013681366- u 1364> 1362 1360 1358 1356 1354 0 50 100 150 200 Iterations Figure 4.5: The blue line shows a moving average of the last 10 iterations of the sped up Kohl-Stone approach, averaged across all 35 drivers. 4.2.4 Determining the Learning Rate for Kohl-Stone In the previous two subsections I discussed two variations of the Kohl-Stone policy gradient algorithm, showing that neither performed better than the original algorithm when optimizing the start-stop controller's policy. Now I return to the original algorithm and describe my attempts to improve its performance by modifying its learning rate. Although Kohl-Stone has a small number of parameters, one of them, the learning rate, may have a large effect on performance. The learning rate for the Kohl-Stone algorithm determines the magnitude of the policy change at each iteration of the algorithm. In the Kohl-Stone algorithm the policy's change at each iteration is equal to a vector with the same direction as the gradient that has a norm equal to the learning rate. This is slightly different than similar algorithms, such as gradient descent, where the parameters change by the gradient multiplied by the learning rate. Still, small and large learning rates have similar effects with Kohl-Stone as they would 53 with gradient descent. A large learning rate may overfit to the most recent driving segment, moving too quickly towards a policy that performs well for the recent drive, but that is not ideal for the driver's overall behavior. It may also overshoot the correct policy, moving from a policy where certain parameters are too low compared to the ideal policy to a policy where the parameters are too high. A small learning rate will lead to an algorithm that converges at a slower rate. Furthermore, a small learning rate will explore a smaller region of the parameter space at each iteration. This makes it more prone to getting stuck on local minima, as if the parameters are the minima of the space explored at each iteration the algorithm will stop making progress. I expect there to be many local minima, so this could be an issue with smaller learning rates. I chose a learning rate of .025 for all previous experiments based on initial experiments run with an earlier version of the start-stop controller with a set of drivers that did not include the urban drivers. I will now demonstrate results showing that .025 is most likely still close to the optimal learning rate and that significantly smaller or larger learning rates do not perform better. First, I ran experiments with a learning rate of .0125 in order to determine if it was possible to get better results with a smaller learning rate. Figure 4.6 shows this setting, showing that it performs worse than the original Kohl-Stone algorithm. I believe this to be due to the fact that it learns at a very slow rate, and that differences between policies that are close together are very difficult to measure across the 6 hours of driving the algorithm observes at each iteration. Therefore with a small learning rate the algorithm improves a little bit at the beginning, but is not able to move to a better location that is further away, as the approach with a learning rate of .025 did. Next, I ran experiments with a learning rate of .1, and then a learning rate of .05, in order to determine the effectiveness of larger learning rates. The results from those experiments are shown in Figure 4.7. Like the experiments with a learning rate of .0125 they perform worse than the experiments with a learning rate of .025. In this case I believe it to be due to the algorithm overfitting recent drive cycles. This is shown most noticeably for the experiment 54 140n. 1400 1398 13961394 13921390 1388 13861 0 50 100 Iterations 150 200 Figure 4.6: The blue line shows a moving average of the last 10 iterations of the Kohl-Stone algorithm with a learning rate of .0125, averaged across all 35 drivers. with a learning rate of .1, where the algorithm's cost swings violently. This is probably due to the rapid changes in the policy. I found that both increasing and decreasing the learning rate produced worse results than the original setup of the Kohl-Stone algorithm. The learning rate of .025 was chosen based on similar tests with a preliminary version of the controller, so it is reasonable that the same learning rate would still perform the best. It would be possible to fine-tune the learning rate of the controller further, but it is likely that the best learning rate for the simulator and real cars would be slightly different, so I did not make an effort to find an absolute maximum learning rate. However, given the similarities between the real data and the simulated data it seems reasonable that a learning rate of .025 would work well for an actual driver as well. 55 I * I 1362 476 1360- 24741472 135a i91470 o1354 S13542 I66 1350 1348 134010 14 50 100 ftaruton 150 146: _0 200 50 100 kerutWeu 1 20 Figure 4.7: The blue lines shows a moving average of the last 10 iterations of the Kohl-Stone algorithm, averaged across all 35 drivers. The left graph corresponds to the experiment with a learning rate of .05, and the right graph corresponds to the experiment with a learning rate of .1 4.2.5 Conclusion In this section I examined the performance of the Kohl-Stone algorithm for policy improvement. I found that using the Kohl-Stone algorithm for individual drivers resulted in a policy that had a cost that was on average 1.6% lower than the cost from using the best policy found for the population of drivers. This improvement is much smaller than the 27% improvement I found when moving from Ford's policy to an optimized policy, but is still worth implementing in real vehicles, as the Ford team has declared that even improvements that resulted in a gain of efficiency of less than 1% are worth it because of the scale of the problem. After finding this initial improvement I tried two alternative approaches, using variants of Kohl-Stone, and also tried different values for the learning rate in the algorithm. I found that the modifications to the algorithm and learning rate produced either equivalent or worse results than the original algorithm. I do not have enough data from actual drivers to determine if the alternative approaches would be more successful with real drivers, but if real drivers behave similarly to simulated drivers using the original Kohl-Stone algorithm with a learning rate near .025 will work the best. Still, it would be worthwhile to perform the same experiments with real data if it could be obtained. I will now investigate an alternative 56 method searching for policies using cross-entropy methods. 4.3 Cross-Entropy Methods Another group of approaches for optimizing a policy are cross-entropy methods, described in section 2.2.3. In this section I will describe my results when I optimized the policy using a specific cross-entropy method, the covariance matrix adaptation evolution strategy, or CMAES, which I described in section 2.2.4. The main difference between cross-entropy methods and the Kohl-Stone policy gradient algorithm is that cross-entropy methods do not attempt to approximate a gradient. Instead they sample within a region, and then move in the direction of the better samples. The shape of the sampling region is then updated based on all of the samples. This has the potential to be faster than Kohl-Stone, as the algorithm can sample from a wider area, and in some cases can quickly identify the region where the minimum is located. It also has the potential to perform worse than Kohl-Stone if the algorithm stops sampling from the area around where the best performing policy is located. This may be a problem here because of how noisy the problem is. The noise may cause the algorithm to believe that a region containing the minimum is not worth sampling from further. 4.3.1 Experiment To test the effectiveness of CMAES I conducted the same experiments as I used to test the Kohl-Stone policy gradient algorithm. For each experiment I ran 35 trials, each with a random driver, and ran the algorithm for 200 iterations, which is about 1200 hours of simulated driving time. Rather than implementing the algorithm myself, as I did for the Kohl-Stone policy gradient algorithm, I used the CMAES implementation in Python provided by Hansen [12]. I used all the default parameters for the algorithm, except that instead of taking the default 13 samples at each iteration I took 50 samples, in order to compensate for the high 57 level of noise. This also matches the number of samples I used at each iteration for the Kohl-Stone algorithm. The two parameters of the CMAES algorithm that must be determined manually are the initial value of the covariance matrix and the initial parameters of the policy. As with KohlStone I chose to start the search with the best policy found for the population of drivers in order to have a starting location that has the lowest average cost. The typical method of choosing the initial covariance matrix is to assume a diagonal. This ensures that initial samples are independent in each coordinate around the mean. I ran several experiments with different initial values for these entries. First I ran an experiment with an initial value of .001 along the diagonal of the covariance matrix. I chose 0.001 so that the initial standard deviation along each axis would be similar to the learning rate of the best parameterization of the Kohl-Stone algorithm. My results from the experiment are demonstrated in figure 4.8, showing that the algorithm did not find policies with improved performance. This may have been due to the initial value of the covariance matrix being too low or high. Figure 4.9 demonstrates the average value of the covariance matrix along the diagonal over time, showing that it rose as the algorithm was executed. This suggests that a larger initial value for the entries of the covariance matrix might produce better results. To test if a larger covariance matrix produced better results I ran the same experiment with values of 0.005 along the diagonal of the initial covariance matrix instead of 0.001. Figure 4.10 shows the results of that experiment, showing that there is a small improvement in policies used when this initial covariance matrix is used. I found that the average initial cost was 1222 and the average final cost was 1214 for an average improvement of .7%. While this is an improvement, it is worse than the average improvement with the Kohl-Stone algorithm. I next ran the experiments again with an initial value of 0.008, in order to test if it was possible to find better results with a larger initial sampling region. Figure 4.10 shows the results from the experiment, demonstrating worse results than when a setting of 0.005 was used. I also ran experiments with initial values of 0.025, 0.05 , and 0.1 in the diagonal of the 58 12101 1208- 1206 V 1204 0 L 1202 1200 1198 1196 0 50 100 150 200 Iterations Figure 4.8: A 10 iteration moving average of the cost with the policy found from CMAES with an initial covariance setting of .001 along the diagonals, averaged across all 35 drivers. covariance matrix, and found similar results. This shows that 0.005 is probably close to the optimal initial setting for the value of the entries in the covariance matrix. 4.3.2 Summary At best I found a 0.7% improvement for the cost using CMAES. This is less than KohlStone's best improvement, and as CMAES is also more complicated than Kohl-Stone it does not seem to be a good alternative in this case. CMAES most likely performs worse because the sampling region is altered based on recent results, which may be an issue for the startstop problem, because, as I have shown earlier, the cost of a six hour driving segment given a policy and a driver is extremely noisy. Therefore CMAES may exclude certain regions where the average cost is lower if it finds a few bad samples from that region. This seems to make CMAES a poorer fit for the problem than the Kohl-Stone policy gradient algorithm. 59 0.0040 U0 0.0030- S0.0025 -- 0.0020 0.00151 50 100 150 200 Iterations Figure 4.9: The average value of the covariance matrix along the diagonals during the execution of the CMAES algorithm, averaged across all 35 drivers. 4.4 Model Based Policy Search The previous two sections described two approaches to optimizing the policy using model free reinforcement learning, where the policy is directly optimized. An alternative, described in section 2.3, is to learn a model of the driver's behavior, and then either search for a policy that performs well given the predicted behavior or use a precomputed policy that was optimized for similar behavior. The stop lengths of the driver are the most important aspects of their behavior to model. The expected stop length for a particular stop should correspond with whether the controller will turn the car's engine off or keep it on. This is the main decision the car's controller must make, although the controller must also decide when the car's engine should be turned back on, which would be determined by predicting the remaining time until the driver presses the accelerator pedal when stopped. I will focus on predicting the stop length, though the techniques in this section could be used to predict the time until accelerator pedal application. 60 E 14ELA 12f13 1 12 - 1196 12 21 1192- 1216 11988 - 1214- 50 100 MMOOMn 150 200 0 50 fteratlo 150 200 Figure 4.10: A 10 iteration moving average of the cost with the policy found from CMAES averaged across all 35 drivers. The left shows the results when the initial settings along the diagonal of the covariance matrix is .005, the right shows when the setting is 0.008. Once a stop length is predicted a policy must then be used. It is possible to perform a policy search given the predicted stop length on-line, but this would probably be too slow to work in a vehicle. Instead, a reasonable process would be to precompute policies that work well based for various stop lengths. Then, when the actual stop length is predicted, the controller would swap in the policy predicted for the stop length that is closest to the value of the predicted stop. I will not discuss this part of the process further until I show my results for predicting the stop length, as tuning policies for predicted stop lengths is useless when there is not an accurate prediction mechanism. Predicting the stop length is an example of regression, as the goal is to predict a continuously distributed random variable from a set of features of the driver's behavior. The regression function will be fit on a per-driver basis, because certain feature values may be correlated with longer or shorter stops for some drivers, and may not be correlated at all with stop length for other drivers. For instance, some drivers may stop suddenly at stop lights and others may not. Therefore a sudden deceleration before a stop may be an indicator that some drivers are at a stop light and would be expected to have a long stop, but may indicate the opposite for other drivers, where stopping suddenly might mean they are in stop and go traffic and would be expected to have a short stop. 61 To predict the stop length on a per driver basis I will first use Gaussian Processes, which I described in section 2.3. The predictions provided by Gaussian process have both a mean and a variance, giving the program an idea of how likely the mean prediction is to be accurate. This is useful because the controller may wish to act differently given its confidence of the mean prediction. For example, it may be best for the controller to act differently if it has a prediction that the mean is 11 seconds, and there is a 99% chance the stop will between 10 and 12 seconds, than if it has a prediction that the mean is 11 seconds, and there is a 60% chance the stop will between 10 and 12 seconds. For my Gaussian process implementation I used the GPML library [7]. Gaussian Processes have one important hyperparameter that must be chosen by hand, the estimate of the variance of the output o,. I chose this parameter by choosing the value for a2 that maximized the likelihood of the data. Maximum likelihood methods are prone to overfitting when there are many values being fitted, but because there is only a single scalar value to fit here there is little risk of overfitting, and maximum likelihood is a good choice of method to use to decide the variance. A potential issue with Gaussian processes for the stop prediction problem is that they typically express their prediction as a normally distributed random variable. It is also possible to use a logistic distribution, which is similar to the normal distribution, but has heavier tails. The stop distribution in both the collected data and the simulator is exponentially distributed, as Figure 4.11 shows. I could find no examples of research using Gaussian processes for exponentially distributed variables, and there are no libraries available that predict an exponentially distributed variable with Gaussian processes. This is does not guarantee that there will be an issue, as it is possible that individual stop lengths are drawn from a normal distribution given features of the driver's behavior, but that without features they are exponentially distributed. It is still something that must be kept in mind, and most likely explains at least some of the poor performance of the algorithm in the experiments I ran. I also experimented with nearest neighbors regression as well, in order to determine if the 62 n-i 0.10 0.08 0.06 0 0.04 0.02 0 10 20 30 40 50 Stop Lengths 60 70 80 Figure 4.11: A histogram showing what percentage of stops in the simulator have a given length. problems with Gaussian Processes were fixable with a simple approach that does not rely on distributional assumptions. This approach is described in section 2.3 and, unlike Gaussian Processes, provides a point prediction instead of a probabilistic prediction. This is less useful, but if the predictions are very accurate they could still be used to find a policy. I used the Scikit-Learn library for my implementation of nearest neighbors regression [15]. 4.4.1 Experiment With Normally Distributed Output Variables The first step of the experiment was deciding which features to use. I chose 30 features that I felt represented the drivers' overall pattern of recent driving, as well as any behavior from the driving segment immediately before the stop. These feature values included the average vehicle speed from the 30 minutes preceding the stop, as well as from the 10 minutes preceding the stop, the maximum stop length over the past 30 and 10 minutes, the average stop length over the past 30 and 10 minutes, the number of seconds since the accelerator pedal 63 was pressed, the speed at which the driver began decelerating, and the rate of deceleration immediately before stop. For each feature that summarized recent driver behavior I created a feature representing the driver's behavior from the last 30 minutes, as well as the last 10 minutes in order to emphasize short term behavior more, but to also capture the driver's medium term behavior as well. I then measured the effectiveness of Gaussian Processes at predicting the stop length from a driver given training sets of different sizes. I conducted my experiment using 35 different drivers, with a 1000-stop test set, and measured the effectiveness of training sets that had sizes ranging from 12 stops to 1750 stops. Figure 4.12 shows the effectiveness of the Gaussian processes for each of these training set sizes at minimizing the squared error between the mean of the predictive distribution for the stop and the actual stop length. 10.09.8 9.6LU 0~9.4- 9.2 9.08.8- 0 200 400 600 800 1000 1200 1400 Number of Training Examples Used 1600 1800 Figure 4.12: Graph of the error rate given the size of the training set used. As Figure 4.12 shows, the Gaussian processes with a normally distributed output variable performs poorly, with the best results coming with a training set of size 1000 when the square root of the average squared error is 8.6 seconds, which is less than 2 seconds better than predicting the mean every time. With a 1000 stop training set the average predicted 64 probability density of the actual stops is .003, which was fairly constant across all training set sizes. An example of the predictions from the Gaussian process are given in Figure 4.13 which shows an example of predicted versus real values for a 30 stop sample. The graph shows that generally the Gaussian process predicts a stop close to the mean, with the actual distribution having a much larger range of values than the predicted values. Gaussian processes produce better results than simply predicting the mean of the distribution though. The stop lengths have a mean of 10.8 seconds, and the exponential distribution has a standard deviation equal to a mean, so the average error would be 10.8 seconds if predicting the mean, instead of the 8.6 seconds found when using a Gaussian process. However the probability predictions are very inaccurate, as the actual stops had an average probability density of .003, while outputting predictions with probability densities drawn from an exponential distribution with a mean of 10.8 would result in an average probability density of .046. This may be due to exponential distribution having a heavy tail, while the normal distribution has light tails, which I will attempt to fix by looking at the results when using a logistic variable for the output variable. It also may be due to the fact that the normal distribution is symmetric, and thus must allot as much probability density to values less than 0 as values more than 2x, where x is the mean of the distribution. The value predicted at less than 0 will never occur, while as Figure 4.13 shows, it is somewhat common to have an actual stop that is more than twice the predicted value of the stop. There does not seem to be a way to fix this without using an exponentially distributed output variable. 4.4.2 Experiment with Output Variable Drawn from Logistic Distribution To test the effects of using a different distribution for the output variable, I tested the Gaussian process again, using an output variable drawn from a logistic distribution instead of a normal distribution. This performed slightly better than the normally distributed variable at predicting probability densities, with the Gaussian process predicting a .0119 probability 65 40- - 30 -C 20 10- 0 5 10 15 Stop Number 20 25 30 Figure 4.13: Graph demonstrating predictions versus actual stop lengths. The green line represents the predicted stop lengths, while the blue line represents the actual stop lengths. density on average for each stop length, but had an average error of 9.2 seconds at best. Figure 4.14 shows a graph comparing predicted stop lengths to actual stop lengths, showing how it once again performs quite poorly. 4.4.3 Nearest Neighbors Regression To determine if the distributional assumptions of the Gaussian processes were the only reason for their poor performance, I next conducted experiments to predict stop lengths using nearest neighbors regression. This is a simpler technique than Gaussian Processes, and does not produce probabilistic estimates, but has the benefit of making no assumptions about the distributions of outputs given the inputs. I conducted this experiment similarly to the experiments with Gaussian processes, using a training set of 1000 stops. I initially averaged the 5 nearest neighbors to find predictions for a new stop, finding the square root of the mean squared error to be 13.3. This is much 66 40- 4S 30C 20 10 0 5 10 15 Stop Number 20 25 30 Figure 4.14: Graph demonstrating predictions versus actual stop lengths when the prediction is a random variable drawn from a logistic distribution. The green line represents the predicted stop lengths, while the blue line represents the actual stop lengths. worse than Gaussian Processes, and is due both to the noise present in the training set, and nearest neighbors inability to determine which features are meaningful. Figure 4.15 shows that the nearest neighbors predictions have a much greater variability than the Gaussian processes' predictions, which leads to worse results because they are still unable to predict accurately. Increasing the number of neighbors to average over decreases this variability, but never finds a solution better than predicting the mean value of the stop times. If the training set size was increased the predictions would improve, but because the training set is drawn from the driver's past driving sessions it is unrealistic to keep increasing the training set size. For example, in order to collect a training set of 10000 stops a year of driving may be required, by which time the behavior of the driver may have changed. Therefore it does not appear that nearest neighbors is a useful regression technique for predicting stop times, at least for the simulated data. This may be due to the simulator, or to the simplicity of nearest neighbors, 67 Predicted vs Actual Value 45 40, 35- 30- 0 25- - a0 15- O-L 5 10 15 2A 2A 30 Stop Number Figure 4.15: Graph demonstrating predictions versus actual stop lengths for nearest neighbors regression. The green line represents the predicted stop lengths, while the blue line represents the actual stop lengths. which does not determine which features are most important, or make any distributional assumptions when making predictions. 4.4.4 Summary Using Gaussian Processes with either a logistic distribution or a normal distribution for the output variable resulted in poor predictions for the stop length. There are two issues that seem to be preventing this approach from working well. First, the stop length is exponentially distributed, but the predictors are not. This is even more of an issue because both predictors are symmetric, while the stop length cannot be less than zero. It might be possible to normalize the distributions and remove the predicted values of less than zero, but the probability predictions at each point would still be very imprecise. I tried to address this issue using a nearest neighbors based regression method, but found it to be inaccurate 68 as well. This may be due to the simplicity of the model, or it may indicate that until the second issue is resolved the choice of method is relatively meaningless. The second issue is that the the simulator is using a Markov model, where the stops are drawn from an exponential distribution randomly at each time there is a stop. The mean of this distribution is set by which of the four states the driver is currently in. Thus the best that can be done from the summary features is to predict the mean of the exponential distribution. The Ford script for producing brake and accelerator pedals looks ahead, so features from the brake and accelerator pedal pressure immediately preceding the stop may be slightly more predictive of the stop length, but it is highly unlikely that there is enough data currently to predict a stop length within less than 3 or 4 seconds based on driver's trace files from the simulator. Drivers probably do not behave this way in real life, as previous behavior is most likely predictive for stopping time, but I do not have enough data to measure how drivers behave. It would be possible to arbitrarily create drivers in the simulator that stop for certain lengths based on their previous behavior, but there would be no way to verify that experiments done with that set of drivers would be replicable with real drivers. Both issues have the potential to be fixed. In order to determine how effective individual driver behavior is at predicting stop lengths more data must be gathered from drivers. Currently Ford does not have enough data from individual drivers to be able to construct a model of driver's stopping behavior. If this data could be gathered it would be possible to use nearest neighbors again, or instead a generalized linear model to predict the stop length instead of using Gaussian Processes. If a generalized linear model was used the predicted value would be an exponentially distributed variable, whose mean, or inverse of its mean, was a linear combination of the features. This would fix both of the major issues that the model for predicting the stop length currently has, and still allow for probabilistic predictions. Given the inaccuracy of the predictions of the stop length I did not proceed to the next step, which is determining what policy should be used based on the predicted stop lengths. It would be more appropriate to consider this step after a model that is more accurate can be created, especially because the form of the model may need to change in order to improve 69 its accuracy. 4.5 Conclusion In this chapter I examined three approaches to selecting a policy that performs best for an individual driver: the Kohl-Stone policy gradient algorithm, the covariance matrix adaptation evolution strategy, and an approach that used Gaussian processes to create a model of the driver's behavior. Kohl-Stone and CMAES represent model-free approaches that attempt to learn the best policy for the individual directly. With the best parameters, I found an average improvement of 1.6% using Kohl-Stone, and an improvement of .7% using CMAES over the policy optimized for the population that I found in chapter 3. These improvements are much smaller than I found moving from Ford's hand tuned policy to the optimized policy for the population, but would still be worth implementing in production vehicles. My other approach was a model-based reinforcement learning method, where I attempted to learn a model of the driver's stop lengths given features of the driver's behavior. This fared poorly, most likely due to how stops are distributed, as well as the configuration of the simulator as a state machine. I expect both of the model-free approaches to also produce better results than a static policy in real vehicles. I also suspect that the model-based approach would perform better in real vehicles, as it is likely that a driver's stop length is correlated with features from before the stop, whereas in my model stop lengths are explicitly not correlated with the features. The next step is then to collect real data and implement the model-free approaches in test vehicles. Using real data it is possible to determine if the model-free approach would work well if the driver's do not change their behavior based on the policy in use. If driver's do change their behavior based on the policy in use it is necessary to implement Kohl-Stone on a real vehicle to observe how well it works. From actual data it should also be possible to determine what features predict driver stopping behavior, which would allow for the creation of a better model of the driver's stop lengths. Policies could then be determined based on 70 predicted stop length. These policies could be tested either with the data or in test vehicles. Again, if driver's change their behavior based on the policy in use it would be necessary to measure the effectiveness of the approach in a test vehicle, not on collected data. 71 Chapter 5 Conclusion In this thesis I aimed to improve the performance of a car's start-stop controller using machine learning. I investigated the performance of various algorithms at improving the performance of a controller, first trying to find the controller that is the best on average for all drivers, before evaluating techniques that optimize a controller for a single driver. Because of the lack of available data, I built a simulator in order to evaluate these methods. This simulator was designed to have its average behavior mimic the limited set of driving data available, but to also simulate a wide variety of driving behavior. All methods were evaluated on the simulator, though I also evaluated the methods optimizing a controller for a population on the data that was available. I found major gains when optimizing a controller for the whole population; a 27% improvement when comparing the cost of the controller to the pre-existing controller on the simulated data, and a 9% improvement on the real data. This suggests that the improved controller could be immediately tested for use in production vehicles. The only remaining questions are if the data available is representative of the full driving population, and if drivers will change their behavior because of the changing controller. The first question can be answered with data drawn from the population rather than a driver testing out the start-stop system, but the second can only be resolved if the controller is tested in a vehicle. 72 I next evaluated methods for optimizing the controller for an individual. I found a further 1.6% improvement on the average simulated driver using similar techniques to optimizing for the whole population, which is enough of an improvement to be worth implementing in a production vehicle. I then evaluated another method for optimizing a controller directly, as well as a method which built a model for the driver's actions before learning a controller. Building a model was ineffective, due partially to the lack of sophistication in the simulator. Unfortunately building more sophisticated behavior into the simulator seems infeasible without more real-world data. Still, the gains from optimizing a controller directly are promising and merit study in an actual vehicle. 5.1 Future Work There are two major areas for future study. First, the methods used in this thesis could be implemented in actual vehicles for testing. This would test to find driver's responses to new controllers, to ensure that they do not negatively change their behavior. Once the new methods have been evaluated in test vehicles they can be put into production vehicles. The second area for future study is more sophisticated methods that require more data. For instance, it would be more feasible to build a model of individual driver behavior if there were many hours of real driving data. This would hopefully enable a model based reinforcement learning method to be more successful. More data could be used on its own to test these methods, or it could be used as a basis for the creation of a more sophisticated simulator that modeled real drivers better. Even given the small amount of data on hand, the conclusions of this thesis are very promising. It appears that, assuming that drivers do not react negatively to a new controller, that at least a 9% improvement from the old controller is reachable by simply using a new controller that is optimized for the population. Reducing the energy usage of all new cars by 9% during stops would greatly increase the aggregate efficiency of vehicles being operated, and likewise reduce the amount of fossil fuels burned by those vehicles. 73 Bibliography [1] Ford Motor Company 2010 Annual Report. http: //corporate.f ord. com/doc/ir2010_annualreport.pdf [2] Ford's Fuel Economy Leadership http: //media. ford. com/images/10031/Fuel EconomyNumbers . pdf [3] Keith Barry, Ford Offers Auto Start-Stop Across Model Range http: //www.wired. com/ autopia/2010/12/ford-offers-auto-start-stop-across-model-range/ [4] N. Kohl and P. Stone, Policy gradient reinforcement learning for fast quadrupedal locomotion. IEEE International Conference on Robotics and Automation, 2004 [5] C. M. Bishop. PatternRecognition and Machine Learning. Springer, 2006. [6] G. Theocharous, S. Mannor, N. Shah, P. Gandhi, B. Kveton, S. Siddiqi, and C-H. Yu, Machine Learning for Adaptive Power Management, Intel Technology Journal, Vol. 10, pp. 299-312, November 2006. [7] C. E. Rasmusen and C. Williams. Gaussian Processesfor Machine Learning. MIT Press, 2006. [8] Kolter, J. Zico, Zachary Jackowski, and Russ Tedrake. "Design, Analysis, and Learning Control of a Fully Actuated Micro Wind Turbine." Under review (2012). 74 [9] Dalamagkidis, K., Kolokotsa, D. (2008). Reinforcement Learning for Building Environmental Control. Reinforcement Learning, Theory and Applications Weber, C., Elshaw, M., Mayer, N. M., (Eds.), pp. (283-294). [10] P. Bodik, M. P. Armbrust, K. Canini, A. Fox, M. Jordan, and D. A. Patterson, A case for adaptive datacenters to conserve energy and improve reliability, University of California at Berkeley, Tech. Rep. UCB/EECS-2008-127, 2008. [11] Freek Stulp and Olivier Sigaud. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29th InternationalConference on Machine Learning (ICML), 2012. [12] N. Hansen. (2005 Nov.). The CMA Evolution Strategy: A Tutorial [Online]. Available: http://www.lri.fr/hansen/cmatutorial.pdf [13] Andrew Y. Ng and Michael Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth Conference on Uncertainty in Articial Intelligence, pages 406415, Stanford, California, 2000. [14] Ronald J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229-256, 1992. [15] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011 75