Reinforcement Using Supervised Learning for Policy Generalization

Julien Laumonier

Julien Laumonier∗
DAMAS Laboratory
Department of Computer Science and Software Engineering
Laval University, G1K 7P4, Quebec (Qc), Canada
(418) 656-2131 ext. 4505
objective is to learn by a supervised learning algorithm, an
approximation of the optimal policy π ∗ . To do so, we define
the loss function as follows:
0 if π(s) = a
l(π(s), a) =
1 if π(s) = a.
Applying reinforcement learning in large Markov Decision
Process (MDP) is an important issue for solving very large
problems. Since the exact resolution is often intractable,
many approaches have been proposed to approximate the
value function (for example, TD-Gammon (Tesauro 1995))
or to approximate directly the policy by gradient methods
(Russell & Norvig 2002). Such approaches provide a policy
on all the state space whereas classical reinforcement learning algorithms do not guarantee in finite time the exploration
of all states. However, these approaches often need a manual definition of the parameter for approximation functions.
Recently, (Lagoudakis & Parr 2003) introduced the problem
of approximating policy by a policy iteration algorithm using a mix between a rollout algorithm and Support Vector
Machines (SVM). The work presented in this paper is an extension of Lagoudakis’ idea. We propose a new and more
general formalism which combines reinforcement learning
and supervised learning formalism. To learn an approximation of an optimal policy, we propose some combinations of
various algorithms (reinforcement and supervised learning).
Contrary to Lagoudakis’ approach, we are not restricted to
an approximated policy iteration but we can use any reinforcement learning algorithms. One of the arguments for
this approach is that in reinforcement learning, the most important result to obtain is the optimal policy, i.e a total order
of actions for each state. That is why, we do not focus on the
value of a state but on direct policy approximation.
The learned policy π should minimize the loss function
for all states in S. The risk of a policy R(π) is defined as the
number of state for which the policy π does not predict the
optimal action.
R(π) =
E [l(π(S), A)] =
l(π(S), A).
n i=1
Note that n could be very large or even infinite. The optimal
policy π ∗ is the policy with minimized the true risk for all
policy in Π: π ∗ = arg minπ∈Π R(π). By definition, the risk
of the optimal policy is zero. In practice, we do not have the
real optimal policy. Thus, the supervised learning algorithm
which is a subset of π ∗ with size m.
uses a learning set πL
The empirical risk is defined as the errors
m made by a policy
π on this learning set: Remp (π) = m
i=1 l(π(S), A).
To learn a generalized policy, we propose two approaches
usually used in supervised learning : an offline and an online algorithm. In both approaches, the input of the supervised learning is defined by a state and the proposed optimal
Problem Definition
Offline Learning
As (Lagoudakis & Parr 2003), we define the problem of
learning a policy as a generalization of an approximated policy obtained by a reinforcement learning algorithm using supervised learning. The formalism follows the MDP’s: we
have a set of states S (|S| = n) where each state is represented with v variable states si = (s1 , . . . , sv )i . We have at
the same time p = |A| actions a1 , . . . , ap . A reward function
and a transition function can be also be defined. Basically, a
policy π in an MDP is a classifier: π : S → A. The global
The offline algorithm is divided in two steps. The first step
learns a policy π using a classical reinforcement learning algorithm (such as Q-learning, for example). The second step
generalizes the tabular policy, obtained previously, using a
supervised learning algorithm (here, SVM).
Online Learning
This approach aims to combine online supervised learning and reinforcement learning interaction with the environment. The online algorithm is a mix between Perceptron and
Q-learning. Basically, at each step of the Q-Learning, after
an update of the Q-value, the current state s and the current
optimal action a (obtained by arg max Q(s, a)) are given to
I would like to thanks François Laviolette, Camille Besse and
Brahim Chaib-draa for their comments on this problem.
c 2007, Association for the Advancement of Artificial
Copyright Intelligence ( All rights reserved.
the Perceptron part which approximates the policy. In the QLearning part, action is chosen by the approximated policy
(with a classical exploration probability).
online perceptron uses also exploration state-action examples to learn although these examples, at the beginning, are
not necessarily the optimal actions. Therefore, the impact of
the supervised learning algorithm is difficult to ensure. At
last, the results presented here are based on a very simple
problem where the PerceptronQ outperforms the tabular Qlearning. In the real mountain car problem, some stability
problems may occur using approximated policies.
First experiments have been made using Q-Learning as reinforcement learning algorithm for both approaches (offline
and online) and SVM (offline) and Perceptron (online) for
supervised learning algorithms. Supervised Learning algorithms use a Radial Basis Function (RBF) kernel to allow
the approximation of a non linear policy. We test our approach on a very simplified version of mountain car with
random initial states. Figure 1 shows the convergence result using Q-Learning and Perceptron Q-Learning (the online algorithm) for 1000 episodes. Perceptron Q-Learning
seems to converge faster because it generalizes during learning. Table 1 represents a pure exploitation of the tabular policy (learned by Q-Learning), the SVM policy and the Perceptron Q-Learning on 1000 episodes after 1000 episodes
of learning. We can see on this table that the SVM policy
is more stable (less standard deviation) and provides better
results (better average) than classical tabular policy. This
is explained by the states where the tabular policy does not
provide any action and only a random action is chosen which
lead in average to worse results. On the other side, Perceptron Q-Learning seems to lead to a more specialized policy
which is very good in many cases (best average) but can be
bad for few initial states (worst standard deviation).
Our results do not guarantee any theoretical convergence or
even any bounds on the value of the generalized policy. Indeed, even if we can ensure that the Q-values will converge
to the optimal values, the samples provided to the classification algorithms are non-iid for the online version. Recently, Probably Approximately Correct (PAC) bounds have
been defined on MDP based algorithms (Strehl et al. 2006).
These bounds are based on the notion of sample complexity
(Kakade 2003) and ensure that after t timesteps the policy
will be an -optimal policy with a probability 1 − δ. Generally, t is bound by a polynomial function in S, A, , δ. One
idea is to mix these bounds and supervised learning bounds
to get more restricted bounds on the final generalized policy.
One problem is the definition of PAC-MDP, which does not
fit well with PAC bounds for supervised learning with our
formalism. Indeed, PAC-MDP framework defines an optimal policy as a policy where value has the following property: V π (s) < V ∗ (s) − . This formulation does not give us
some bounds on the risk of the policy. Formally, we need
to find a bound of P (R(π) < η) to mix with classical PAC
bound from supervised learning.
Nb of steps
The work presented here is only a preliminary study of the
combining of reinforcement learning and supervised learning algorithms. Much work needs to be done to evaluate
precisely the impact of generalizing the policy instead of
the value function. More specifically, we plan to evaluate
various parameters which can influence the results (algorithms, kernels, etc.) and we will study more theoretically
PAC bounds for policy generalization.
900 1000
Nb Episodes
Figure 1: Learning convergence for 1000 episodes
Std Dev
Offline SVM
Online Perceptron
Table 1: Exploitation of Tabular, SVM and Perceptron
With our approach, the policy generalization seems to give
good results for the simple problem of mountain car. As it
could be expected, the SVM approach gives a better generalization than the perceptron algorithm as it does in a pure
classification problem. However, one of the difference between the offline and the online approach is that the offline
approach learns on a nearly converged policy whereas the