Reinforcement Using Supervised Learning for Policy Generalization Julien Laumonier∗ DAMAS Laboratory Department of Computer Science and Software Engineering Laval University, G1K 7P4, Quebec (Qc), Canada (418) 656-2131 ext. 4505 jlaumoni@damas.ift.ulaval.ca Introduction objective is to learn by a supervised learning algorithm, an approximation of the optimal policy π ∗ . To do so, we define the loss function as follows: 0 if π(s) = a l(π(s), a) = 1 if π(s) = a. Applying reinforcement learning in large Markov Decision Process (MDP) is an important issue for solving very large problems. Since the exact resolution is often intractable, many approaches have been proposed to approximate the value function (for example, TD-Gammon (Tesauro 1995)) or to approximate directly the policy by gradient methods (Russell & Norvig 2002). Such approaches provide a policy on all the state space whereas classical reinforcement learning algorithms do not guarantee in finite time the exploration of all states. However, these approaches often need a manual definition of the parameter for approximation functions. Recently, (Lagoudakis & Parr 2003) introduced the problem of approximating policy by a policy iteration algorithm using a mix between a rollout algorithm and Support Vector Machines (SVM). The work presented in this paper is an extension of Lagoudakis’ idea. We propose a new and more general formalism which combines reinforcement learning and supervised learning formalism. To learn an approximation of an optimal policy, we propose some combinations of various algorithms (reinforcement and supervised learning). Contrary to Lagoudakis’ approach, we are not restricted to an approximated policy iteration but we can use any reinforcement learning algorithms. One of the arguments for this approach is that in reinforcement learning, the most important result to obtain is the optimal policy, i.e a total order of actions for each state. That is why, we do not focus on the value of a state but on direct policy approximation. The learned policy π should minimize the loss function for all states in S. The risk of a policy R(π) is defined as the number of state for which the policy π does not predict the optimal action. n R(π) = E [l(π(S), A)] = S,A 1 l(π(S), A). n i=1 Note that n could be very large or even infinite. The optimal policy π ∗ is the policy with minimized the true risk for all policy in Π: π ∗ = arg minπ∈Π R(π). By definition, the risk of the optimal policy is zero. In practice, we do not have the real optimal policy. Thus, the supervised learning algorithm ∗ which is a subset of π ∗ with size m. uses a learning set πL The empirical risk is defined as the errors m made by a policy 1 π on this learning set: Remp (π) = m i=1 l(π(S), A). Approach To learn a generalized policy, we propose two approaches usually used in supervised learning : an offline and an online algorithm. In both approaches, the input of the supervised learning is defined by a state and the proposed optimal action. Problem Definition Offline Learning As (Lagoudakis & Parr 2003), we define the problem of learning a policy as a generalization of an approximated policy obtained by a reinforcement learning algorithm using supervised learning. The formalism follows the MDP’s: we have a set of states S (|S| = n) where each state is represented with v variable states si = (s1 , . . . , sv )i . We have at the same time p = |A| actions a1 , . . . , ap . A reward function and a transition function can be also be defined. Basically, a policy π in an MDP is a classifier: π : S → A. The global The offline algorithm is divided in two steps. The first step learns a policy π using a classical reinforcement learning algorithm (such as Q-learning, for example). The second step generalizes the tabular policy, obtained previously, using a supervised learning algorithm (here, SVM). Online Learning This approach aims to combine online supervised learning and reinforcement learning interaction with the environment. The online algorithm is a mix between Perceptron and Q-learning. Basically, at each step of the Q-Learning, after an update of the Q-value, the current state s and the current optimal action a (obtained by arg max Q(s, a)) are given to ∗ I would like to thanks François Laviolette, Camille Besse and Brahim Chaib-draa for their comments on this problem. c 2007, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1882 the Perceptron part which approximates the policy. In the QLearning part, action is chosen by the approximated policy (with a classical exploration probability). online perceptron uses also exploration state-action examples to learn although these examples, at the beginning, are not necessarily the optimal actions. Therefore, the impact of the supervised learning algorithm is difficult to ensure. At last, the results presented here are based on a very simple problem where the PerceptronQ outperforms the tabular Qlearning. In the real mountain car problem, some stability problems may occur using approximated policies. Experiments First experiments have been made using Q-Learning as reinforcement learning algorithm for both approaches (offline and online) and SVM (offline) and Perceptron (online) for supervised learning algorithms. Supervised Learning algorithms use a Radial Basis Function (RBF) kernel to allow the approximation of a non linear policy. We test our approach on a very simplified version of mountain car with random initial states. Figure 1 shows the convergence result using Q-Learning and Perceptron Q-Learning (the online algorithm) for 1000 episodes. Perceptron Q-Learning seems to converge faster because it generalizes during learning. Table 1 represents a pure exploitation of the tabular policy (learned by Q-Learning), the SVM policy and the Perceptron Q-Learning on 1000 episodes after 1000 episodes of learning. We can see on this table that the SVM policy is more stable (less standard deviation) and provides better results (better average) than classical tabular policy. This is explained by the states where the tabular policy does not provide any action and only a random action is chosen which lead in average to worse results. On the other side, Perceptron Q-Learning seems to lead to a more specialized policy which is very good in many cases (best average) but can be bad for few initial states (worst standard deviation). 500 PAC-Bounds Our results do not guarantee any theoretical convergence or even any bounds on the value of the generalized policy. Indeed, even if we can ensure that the Q-values will converge to the optimal values, the samples provided to the classification algorithms are non-iid for the online version. Recently, Probably Approximately Correct (PAC) bounds have been defined on MDP based algorithms (Strehl et al. 2006). These bounds are based on the notion of sample complexity (Kakade 2003) and ensure that after t timesteps the policy will be an -optimal policy with a probability 1 − δ. Generally, t is bound by a polynomial function in S, A, , δ. One idea is to mix these bounds and supervised learning bounds to get more restricted bounds on the final generalized policy. One problem is the definition of PAC-MDP, which does not fit well with PAC bounds for supervised learning with our formalism. Indeed, PAC-MDP framework defines an optimal policy as a policy where value has the following property: V π (s) < V ∗ (s) − . This formulation does not give us some bounds on the risk of the policy. Formally, we need to find a bound of P (R(π) < η) to mix with classical PAC bound from supervised learning. QLearning PerceptronQ Nb of steps 400 Conclusion 300 The work presented here is only a preliminary study of the combining of reinforcement learning and supervised learning algorithms. Much work needs to be done to evaluate precisely the impact of generalizing the policy instead of the value function. More specifically, we plan to evaluate various parameters which can influence the results (algorithms, kernels, etc.) and we will study more theoretically PAC bounds for policy generalization. 200 100 0 0 100 200 300 400 500 600 700 800 900 1000 Nb Episodes Figure 1: Learning convergence for 1000 episodes Policy Average Std Dev Tabular 19.145 10.924 Offline SVM 14.405 6.400 References Online Perceptron 7.925 25.833 Kakade, S. M. 2003. On the Sample Complexity of Reinforcement Learning. Ph.D. Dissertation, University College London. Lagoudakis, M. G., and Parr, R. 2003. Reinforcement learning as classification: Leveraging modern classifier. In 20th Inter. Conf. on Machine Learning (ICML-2003). Russell, S. J., and Norvig, P. 2002. Artificial Intelligence: A Modern Approach (2nd Edition). Prentice Hall. Strehl, A. L.; Li, L.; Wiewiora, E.; Langford, J.; and Littman, M. L. 2006. PAC Model-Free Reinforcement Learning. In 23rd Inter. Conf. on Machine Learning (ICML-06). Tesauro, G. 1995. Temporal difference learning and tdgammon. Commun. ACM 38(3):58–68. Table 1: Exploitation of Tabular, SVM and Perceptron Discussion With our approach, the policy generalization seems to give good results for the simple problem of mountain car. As it could be expected, the SVM approach gives a better generalization than the perceptron algorithm as it does in a pure classification problem. However, one of the difference between the offline and the online approach is that the offline approach learns on a nearly converged policy whereas the 1883