Compress and Control
Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, Guillaume Desjardins
Google DeepMind, Australian National University
{veness,bellemare,alschua,gdesjardins}@google.com
marcus.hutter@anu.edu.au
Abstract
This paper describes a new information-theoretic policy evaluation technique for reinforcement learning. This technique
converts any compression or density model into a corresponding estimate of value. Under appropriate stationarity and ergodicity conditions, we show that the use of a sufficiently
powerful model gives rise to a consistent value function estimator. We also study the behavior of this technique when
applied to various Atari 2600 video games, where the use of
suboptimal modeling techniques is unavoidable. We consider
three fundamentally different models, all too limited to perfectly model the dynamics of the system. Remarkably, we
find that our technique provides sufficiently accurate value
estimates for effective on-policy control. We conclude with a
suggestive study highlighting the potential of our technique
to scale to large problems.
1
Introduction
Within recent years, a number of information-theoretic approaches have emerged as practical alternatives to traditional
machine learning algorithms. Noteworthy examples include
the compression-based approaches of Frank, Chui, and Witten (2000) and Bratko et al. (2006) to classification, and Cilibrasi and Vitányi (2005) to clustering. What differentiates
these techniques from more traditional machine learning approaches is that they rely on the ability to compress the raw
input, rather than combining or learning features relevant to
the task at hand. Thus this family of techniques has proven
most successful in situations where the nature of the data
makes it somewhat unwieldy to specify or learn appropriate features. This class of methods can be formally justified
by appealing to various notions within algorithmic information theory, such as Kolmogorov complexity (Li and Vitányi
2008). In this paper we show how a similarly inspired approach can be applied to reinforcement learning, or more
specifically, to the tasks of policy evaluation and on-policy
control.
Policy evaluation refers to the task of estimating the
value function associated with a given policy, for an arbitrary given environment. The performance of well-known
reinforcement learning techniques such as policy iteration
c 2015, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
(Howard 1960), approximate dynamic programming (Bertsekas and Tsitsiklis 1996; Powell 2011) and actor-critic
methods (Sutton and Barto 1998), for example, all crucially
depend on how well policy evaluation can be performed. In
this paper we introduce a model-based approach to policy
evaluation, which transforms the task of estimating a value
function to that of learning a particular kind of probabilistic
state model.
To better put our work into context, it is worth making
the distinction between two fundamentally different classes
of model based reinforcement learning methods. Simulation based techniques involve learning some kind of forward model of the environment from which future samples
can be generated. Given access to such models, planning
can be performed directly using search. Noteworthy recent
examples include the work of Doshi-Velez (2009), Walsh,
Goschin, and Littman (2010), Veness et al. (2010), Veness
et al. (2011), Asmuth and Littman (2011), Guez, Silver,
and Dayan (2012), Hamilton, Fard, and Pineau (2013) and
Tziortziotis, Dimitrakakis, and Blekas (2014). Although the
aforementioned works demonstrate quite impressive performance on small domains possessing complicated dynamics,
scaling these methods to large state or observation spaces
has proven challenging. The main difficulty that arises when
using learnt forward models is that the modeling errors
tend to compound when reasoning over long time horizons
(Talvitie 2014).
In contrast, another family of techniques, referred to in
the literature as planning as inference, attempt to side-step
the issue of needing to perform accurate simulations by reducing the planning task to one of probabilistic inference
within a generative model of the system. These ideas have
been recently explored in both the neuroscience (Botvinick
and Toussaint 2012; Solway and Botvinick 2012) and machine learning (Attias 2003; Poupart, Lang, and Toussaint
2011) literature. The experimental results to date have been
somewhat inconclusive, making it far from clear whether the
transformed problem is any easier to solve in practice. Our
main contribution in this paper is to show how to set up a particularly tractable form of inference problem by generalizing
compression-based classification to reinforcement learning.
The key novelty is to focus the modeling effort on learning
the stationary distribution of a particular kind of augmented
Markov chain describing the system, from which we can ap-
proximate a type of dual representation (Wang, Bowling,
and Schuurmans 2007; Wang et al. 2008) of the value function. Using this technique, we were able to produce effective controllers on a problem domain orders of magnitude
larger than what has previously been addressed with simulation based methods.
2
Background
We start with a brief overview of the parts of reinforcement learning and information theory needed to describe our
work, before reviewing compression-based classification.
2.1
Markov Decision Processes
A Markov Decision Process (MDP) is a type of probabilistic model widely used within reinforcement learning (Sutton and Barto 1998; Szepesvári 2010) and control (Bertsekas and Tsitsiklis 1996). In this work, we limit our attention to finite horizon, time homogenous MDPs whose
action and state spaces are finite. Formally, an MDP is a
triplet (S, A, µ), where S is a finite, non-empty set of states,
A is a finite, non-empty set of actions and µ is the transition probability kernel that assigns to each state-action pair
(s, a) ∈ S × A a probability measure µ(· | s, a) over S × R.
S and A are known as the state space and action space
respectively. The transition probability kernel gives rise to
the state transition kernel P(s0 |s, a) := µ({s0 } × R | s, a),
which gives the probability of transitioning from state s to
state s0 if action a is taken in s.
An agent’s behavior is determined by a policy, that defines, for each state s ∈ S and time t ∈ N, a probability
measure over A denoted by πt (· | s). A stationary policy is
a policy which is independent of time, which we will denote by π(· | s) where appropriate. At each time t, the agent
communicates an action At ∼ πt (· | St−1 ) to the system in
state St−1 ∈ S. The system then responds with a statereward pair (St , Rt ) ∼ µ(· | St−1 , At ). Here we will assume that each reward is bounded between [rmin , rmax ] ⊂ R
and that the system starts in a state s0 and executes for
an infinite number of steps. Thus the execution of the system can be described by a sequence of random variables
A1 , S1 , R1 , A2 , S2 , R2 , ...
The finite
m-horizon return from time t is defined as
Pt+m−1
Zt :=
Ri . The expected m-horizon return from
i=t
time t, also known as the value function, is denoted by
V π (st ) := E[Zt+1 | St = st ]. The return space Z is the set
of all possible returns. The action-value function is defined
by Qπ (st , at+1 ) := E[Zt+1 | St = st , At+1 = at+1 ]. An
optimal policy, denoted by π ∗ , is a policy that maximizes
the expected return E [Zt+1 | St ] for all t; in our setting, a
state-dependent deterministic optimal policy always exists.
2.2
Compression and Sequential Prediction
We now review sequential probabilistic prediction in the
context of statistical data compression. An alphabet X is a
set of symbols. A string of data x1 x2 . . . xn ∈ X n of length
n is denoted by x1:n . The prefix x1:j of x1:n , j ≤ n, is denoted by x≤j or x<j+1 . The empty string is denoted by .
The concatenation of two strings s and r is denoted by sr.
A coding distribution ρ is a sequence of probability mass
functions ρn : X n → [0, 1], which
P for all n ∈ N satisfy
the constraint that ρn (x1:n ) =
y∈X ρn+1 (x1:n y) for all
x1:n ∈ X n , with the base case ρ0 () := 1. From here onwards, whenever the meaning is clear from the argument to
ρ, the subscript on ρ will be dropped. Under this definition,
the conditional probability of a symbol xn given previous
data x<n is defined as ρ(xn |x<n ) := ρ(x1:n )/ρ(x<n ) provided ρ(x<n ) > 0, with the familiar chain rules ρ(x1:n ) =
Qn
Qk
i=1 ρ(xi |x<i ) and ρ(xj:k | x<j ) =
i=j ρ(xi |x<i ) now
following.
A binary source code c : X ∗ → {0, 1}∗ assigns to
each possible data sequence x1:n a binary codeword c(x1:n )
of length `c (x1:n ). The typical goal when constructing a
source code is to minimize the lengths of each codeword
while ensuring that the original data sequence x1:n is always recoverable from c(x1:n ). A fundamental technique
known as arithmetic encoding (Witten, Neal, and Cleary
1987) makes explicit the connection between coding distributions and source codes. Given a coding distribution ρ and
a data sequence x1:n , arithmetic encoding constructs a code
aρ which produces a binary codeword whose length is essentially − log2 ρ(x1:n ). We refer the reader to the standard
text of Cover and Thomas (1991) for further information.
2.3
Compression-based classification
Compression-based classification was introduced by Frank,
Chui, and Witten (2000). Given a sequence of n labeled i.i.d.
training examples D := (y1 , c1 ), . . . , (yn , cn ), where yi and
ci are the input and class labels respectively, one can apply
Bayes rule to express the probability of a new example Y
being classified as class C ∈ C given the training examples
D by
P [ Y | C, D ] P [ C | D ]
P [ C | Y, D ] = P
.
P [ Y | c, D ] P [ c | D ]
(1)
c∈C
The main idea behind compression-based classification is
to model P [ Y | C, D ] using a coding distribution for the
inputs that is trained on the subset of examples from D
that match class C. Well known non-probabilistic compression methods such as L EMPEL -Z IV (Ziv and Lempel 1977)
can be used by forming their associated coding distribution 2−`z (x1:n ) , where `z (x1:n ) is the length of the compressed data x1:n in bits under compression method z. The
class probability P [ C | D] can be straightforwardly estimated from its empirical frequency or smoothed versions
thereof. Thus the overall accuracy of the classifier essentially depends upon how well the inputs can be modeled by
the class conditional coding distribution.
Compression-based classification has both advantages
and disadvantages. On one hand, it is straightforward to apply generic compression techniques (including those operating at the bit or character level) to complicated input types
such as richly formatted text or DNA strings (Frank, Chui,
and Witten 2000; Bratko et al. 2006). On the other hand,
learning a probabilistic model of the input may be significantly more difficult than directly applying standard dis-
criminative classification techniques. Our approach to policy
evaluation, which we now describe, raises similar questions.
3
Compression and Control
We now introduce Compress and Control (CNC), our new
method for policy evaluation.
3.1
Overview
Policy evaluation is concerned with the estimation of
the state-action value function Qπ (s, a). Here we assume
that the environment is a finite, time homogenous MDP
M := (S, A, µ), and that the policy to be evaluated is a
stationary Markov policy π. To simplify the exposition, we
consider the finite m-horizon case, and assume that all rewards are drawn from a finite set R ⊂ R; later we will discuss how to remove these restrictions.
At a high level, CNC performs policy evaluation by learning a time-independent state-action conditional distribution
P(Z | S, A); the main technical component of our work involves establishing that this time-independent conditional
probability is well defined. Our technique involves constructing a particular kind of augmented Markov chain
whose stationary distribution allows for the recovery of
P(Z | S, A). Given this distribution, we can obtain
X
Qπ (s, a) =
z P(Z = z | S = s, A = a).
z∈Z
In the spirit of compression-based classification, CNC estimates this distribution by using Bayes rule to combine learnt
density models of both P(S | Z, A) and P(Z | A). Although
it might seem initially strange to learn a model that conditions on the future return, the next section shows how this
counterintuitive idea can be made rigorous.
3.2
Transformation
Our goal is to define a transformed chain whose stationary
distribution can be marginalized to obtain a distribution over
states, actions and the m-horizon return. We need two lemmas for this purpose. To make these statements precise, we
will use some standard terminology from the Markov chain
literature; for more detail, we recommend the textbook of
Brémaud (1999).
Definition 1. A Homogenous Markov Chain (HMC) given
by {Xt }t≥0 over state space X is said to be: (AP) aperiodic
iff gcd{n ≥ 1 : P[Xn = x|X0 = x] > 0} = 1, ∀x ∈ X ;
(PR) positive recurrent iff E[min{n ≥ 1 : Xn = x}|X0 =
x] < ∞, ∀x ∈ X ; (IR) irreducible iff ∀x, x0 ∃n ≥ 1 :
P[Xn = x0 |X0 = x] > 0; (EA) essentially aperiodic iff
gcd{n ≥ 1 : P[Xn = x|X0 = x] > 0} ∈ {1, ∞}, ∀x ∈ X .
Note also that EA+IR implies AP.
Although the term ergodic is sometimes used to describe particular combinations of these properties (e.g.
AP+PR+IR), here we avoid it in favor of being more explicit.
Lemma 1. Consider a stochastic process {Xt }t≥1 over
state space X that is independent of a sequence of
U-valued random variables {Ut }t≥1 in the sense that
Yt−1 Yt+1 Yt -Xt−1 - Xt
-Xt+1
@
@
@
@
@
@
@
?
?
?
R
@
R
@
R
@
Rt−1
Rt+1
Rt
Figure 1: Lemma 1 applied to { (At , St ), Rt }t≥1 .
P(xt |x<t , u<t ) = P(xt |x<t ), and with Ut only depending on Xt−1 and Xt in the sense that P(ut |x1:t , u<t ) =
P(ut |xt−1 , xt ) and P(Ut = u|Xt−1 = x, Xt = x0 ) being
independent of t. Then, if {Xt }t≥1 is an (IR/EA/PR) HMC
over X , then {Yt }t≥1 := {(Xt , Ut )}t≥1 is an (IR/EA/PR)
HMC over Y := {yt ∈ X × U : ∃xt−1 ∈ X : P(yt |xt−1 ) >
0}.
Lemma 1 allows HMC {Xt := (At , St )}t≥1 to be augmented to obtain the HMC {Yt := (Xt , Rt )}t≥1 , where
At , St and Rt denote the action, state and reward at time
t respectively; see Figure 1 for a graphical depiction of the
dependence structure.
The second result allows the HMC {Yt }t≥1 to be further
augmented to give the snake HMC {Yt:t+m }t≥1 (Brémaud
1999). This construction ensures that there is sufficient information within each augmented state to be able to condition on the m-horizon return.
Lemma 2. If {Yt }t≥1 is an (IR/EA/PR) HMC over state
space Y, then for any m ∈ N, the stochastic process
{Wt }t≥1 , where Wt := (Yt , ..., Yt+m ), is an (IR/EA/PR)
HMC over W := {y0:m ∈ Y m+1 : P(y1:m |y0 ) > 0}.
Now if we assume that the HMC defined by M and π
is (IR+EA+PR), Lemmas 1 and 2 imply that there exists a
unique stationary distribution ν 0 over the augmented state
space (A × S × R)m+1 .
0
0
0
0
0
0
Furthermore, if we let
0 , R0 , . . . , Am , Sm , Rm ) ∼
Pm(A0 , S
0
0
0
ν and define Z :=
i=1 Ri , it is clear that there exists
a joint distribution ν over Z × (A × S × R)m+1 such
0
0
that (Z 0 , A00 , S00 , R00 , . . . , A0m , Sm
, Rm
) ∼ ν. Hence the νprobability P [Z 0 | S00 , A01 ] is well defined, which allows us
to express the action-value function Qπ as
Qπ (s, a) = Eν [Z 0 | S00 = s, A01 = a] .
(2)
Finally, by expanding the expectation and applying Bayes
rule, Equation 2 can be further re-written as
X
Qπ (s, a) =
z ν(z | s, a)
z∈Z
=
X
z∈Z
ν(s | z, a) ν(z | a)
z P
.
ν(s | z 0 , a) ν(z 0 | a)
(3)
z 0 ∈Z
The CNC approach to policy evaluation involves directly
learning the conditional distributions ν(s | z, a) and ν(z | a)
in Equation 3 from data, and then using these learnt distributions to form a plug-in estimate of Qπ (s, a). Notice that
ν(s|z, a) conditions on the return, similar in spirit to prior
work on planning as inference (Attias 2003; Botvinick and
Toussaint 2012; Solway and Botvinick 2012). The distinguishing property of CNC is that the conditioning is performed with respect to a stationary distribution that has been
explicitly constructed to allow for efficient modeling and inference.
3.3
Online Policy Evaluation
We now provide an online algorithm for compression-based
policy evaluation. This will produce, for all times t ∈
N, an estimate Q̂πt (s, a) of the m-horizon expected return
Qπ (s, a) as a function of the first t − m action-observationreward triples.
Constructing our estimate involves modeling the νprobability terms in Equation 3 using two different coding
distributions, ρS and ρZ respectively; ρS will encode states
conditional on return-action pairs, and ρZ will encode returns
conditional on actions. Sample states, actions and returns
can be generated by directly executing the system (M, π);
Provided the HMC M + π is (IR+EA+PR), Lemmas 1 and
2 ensure that the empirical distributions formed from a sufficiently large sample of action/state/return triples will be
arbitrarily close to the required conditional ν-probabilities.
Next we describe how the coding distributions are trained.
Given a history ht := s0 , a1 , s1 , r1 . . . , an+m , sn+m , rn+m
with t = n + m, we define the m-lagged return at any time
i ≤ n + 1 by zi := ri + · · · + ri+m−1 . The sequence of the
first n states occurring in ht can be mapped to a subsequence
denoted by sz,a
0:n−1 that is defined by keeping only the states
(si : zi+1 = z ∧ ai+1 = a)n−1
i=0 . Similarly, a sequence of
a
m-lagged returns z1:n can be mapped to a subsequence z1:n
n
formed by keeping only the returns (zi : ai = a)i=1 from
z1:n . Our value estimate at time t of taking action a in state
s can now be defined as
X
Q̂πt (s, a) :=
z wtz,a (s),
(4)
z∈Z
where
wtz,a (s) :=
a
ρS ( s | sz,a
0:n−1 ) ρZ (z | z1:n )
P
z 0 ,a
a )
ρS (s | s0:n−1 ) ρZ (z 0 | z1:n
z 0 ∈Z
(5)
approximates the probability of receiving a return of z if action a is selected in state s.
Implementation. The action-value function estimate Q̂πt
can be computed efficiently by maintaining |Z||A| buckets,
each corresponding to a particular return-action pair (z, a).
Each bucket contains an instance of the coding distribution
ρS encoding the state sequence sz,a
0:n−1 . Similarly, |A| buckets containing instances of ρZ are created to encode the various return subsequences. This procedure is summarized in
Algorithm 1.
To obtain a particular state-action value estimate, Equations 4 and 5 can be computed directly by querying the appropriate bucketed coding distributions. Assuming that the
time required to compute each conditional probability using
Algorithm 1 CNC POLICY EVALUATION
Require: Stationary policy π, environment M
Require: Finite planning horizon m ∈ N
Require: Coding distributions ρS and ρZ
1: for i = 1 to t do
2:
Perform ai ∼ π(· | si−1 )
3:
Observe (si , ri ) ∼ µ(· | si−1 , ai )
4:
if i ≥ m then
5:
Update ρS in bucket (zi−m+1 , ai−m+1 ) with si−m
6:
Update ρZ in bucket ai−m+1 with zi−m+1
7:
end if
8: end for
9: return Q̂π
t
ρS and ρZ is constant, the time complexity for computing
Q̂t (s, a) is O(|Z|).
3.4
Analysis
We now show that the state-action estimates defined by
Equation 4 are consistent provided that consistent density
estimators are used for both ρS and ρZ . Also, we will say fn
converges stochastically to 0 with rate n−1/2 if and only if
q
∃c > 0, ∀δ ∈ [0, 1] : P |fn (ω)| ≤ nc ln 2δ ≥ 1 − δ,
and will denote this by writing fn (ω) ∈ OP (n−1/2 ).
Theorem 1. Given an m-horizon, finite state space, finite
action space, time homogenous MDP M := (S, A, µ) and a
stationary policy π that gives rise to an (IR+EA+PR) HMC,
for all > 0, we have that for any state s ∈ S and action
a ∈ A that
h
i
lim P | Q̂πn (s, a) − Qπ (s, a) | ≥ = 0,
n→∞
provided ρS and ρZ are consistent estimators of ν(s|z, a)
and ν(z|a) respectively. Furthermore, if |ρS (s|z, a) −
ν(s|z, a)| ∈ OP (n−1/2 ) and |ρZ (z|a) − ν(z|a)| ∈
OP (n−1/2 ) then |Q̂πn (s, a) − Qπ (s, a)| ∈ OP (n−1/2 ).
Next we state consistency results for two types of estimators often used in model-based reinforcement learning.
Theorem 2. The frequency estimator ρ(xn |x<n ) :=
Pn−1
1
i=1 Jxn = xi K when used as either ρS or ρZ is a conn−1
sistent estimator of ν(s|z, a) or ν(z|a) respectively for any
s ∈ S, z ∈ Z, and a ∈ A; furthermore, the absolute estimation error converges stochastically to 0 with rate n−1/2 .
Note that the above result is essentially tabular, in the
sense that each state is treated atomically. The next result
applies to a factored application of multi-alphabet Context
Tree Weighting (CTW) (Tjalkens, Shtarkov, and Willems
1993; Willems, Shtarkov, and Tjalkens 1995; Veness et al.
2011), which can handle considerably larger state spaces in
practice. In the following, we use the notation sn,i to refer
to the ith factor of state sn .
4
Experimental Results
In this section we describe two sets of experiments. The first
set is an experimental validation of our theoretical results
using a standard policy evaluation benchmark. The second
combines CNC with a variety of density estimators and studies the resulting behavior in a large on-policy control task.
4.1
Policy Evaluation
Our first experiment involves a simplified version of the
game of Blackjack (Sutton and Barto 1998, Section 5.1). In
Blackjack, the agent requests cards from the dealer. A game
is won when the agent’s card total exceeds the dealer’s own
total. We used CNC to estimate the value of the policy that
stays if the player’s sum is 20 or 21, and hits in all other
cases. A state is represented by the single card held by the
dealer, the player’s card total so far, and whether the player
holds a usable ace. In total, there are 200 states, two possible actions (hit or stay), and three possible returns (-1, 0 and
1). A Dirichlet-Multinomial model with hyper-parameters
αi = 12 was used for both ρS and ρZ .
Figure 2 depicts the estimated MSE and average maximum squared error of Q̂π over 100,000 episodes; the mean
and maximum are taken over all possible state-action pairs
and averaged over 10,000 trials. We also compared CNC to
a first-visit Monte Carlo value estimate (Szepesvári 2010).
The CNC estimate closely tracks the Monte Carlo estimate,
even performing slightly better early on due to the smoothing introduced by the use of a Dirichlet prior. As predicted
by the analysis in Section 3.4, the MSE decays toward zero.
4.2
On-policy Control
Our next set of experiments explored the on-policy control
behavior of CNC under an -greedy policy. The purpose of
these experiments is to demonstrate the potential of CNC to
scale to large control tasks when combined with a variety of
different density estimators. Note that Theorem 1 does not
apply here: using CNC in this way violates the assumption
that π is stationary.
Evaluation Platform. We evaluated CNC using ALE, the
Arcade Learning Environment (Bellemare et al. 2013), a
reinforcement learning interface to the Atari 2600 video
game platform. Observations in ALE consist of frames of
160 × 210 7-bit color pixels generated by the Stella Atari
2600 emulator. Although the emulator generates frames at
60Hz, in our experiments we consider time steps that last 4
consecutive frames, following the existing literature (Bellemare, Veness, and Talvitie 2014; Mnih et al. 2013). We first
focused on the game of P ONG, which has an action space
of {U P, D OWN, N OOP} and provides a reward of 1 or -1
whenever a point is scored by either the agent or its computer opponent. Episodes end when either player has scored
Monte Carlo estimate
Dirichlet CNC
Episodes (1000’s)
Max. Squared Error
Mean Squared Error
Theorem 3. Given a state space that is factored in the sense
that S := B1 × · · · × Bk , the estimator ρ(sn | s<n ) :=
Qk
i=1 CTW (sn,i | sn,<i , s<n,1:i ) when used as ρS , is a consistent estimator of ν(s|z, a) for any s ∈ S, z ∈ Z, and
a ∈ A; furthermore, the absolute estimation error converges
stochastically to 0 at a rate of n−1/2 .
Monte Carlo estimate
Dirichlet CNC
Episodes (1000's)
Figure 2: Mean and maximum squared errors of the Monte
Carlo and CNC estimates on the game of Blackjack.
21 points; as a result, possible scores for one episode range
between -21 to 21, with a positive score corresponding to a
win for the agent.
Experimental Setup. We studied four different CNC
agents, with each agent corresponding to a different choice
of model for ρS ; the Sparse Adapative Dirichlet (SAD) estimator (Hutter 2013) was used for ρZ for all agents. Each
agent used an -greedy policy (Sutton and Barto 1998) with
respect to its current value function estimates. The exploration rate was initialized to 1.0, then decayed linearly to
0.02 over the course of 200,000 time steps. The horizon was
set to m = 80 steps, corresponding to roughly 5 seconds of
play. The agents were evaluated over 10 trials, each lasting
2 million steps.
The first model we consider is a factored application of
the SAD estimator, a count based model designed for large,
sparse alphabets. The model divides the screen into 16 × 16
regions. The probability of a particular image patch occurring within each region is modeled using a region-specific
SAD estimator. The probability assigned to a whole screen is
the product of the probabilities assigned to each patch.
The second model is an auto-regressive application of logistic regression (Bishop 2006), that assigns a probability to
each pixel using a shared set of parameters. The product of
these per-pixel probabilities determines the probability of a
screen under this model. The features for each pixel prediction correspond to the pixel’s local context, similar to standard context-based image compression techniques (Witten,
Moffat, and Bell 1999). The model’s parameters were updated online using A DAGRAD (Duchi, Hazan, and Singer
2011). The hyperparameters (including learning rate, choice
of context, etc.) were optimized via the random sampling
technique of Bergstra and Bengio (2012).
The third model uses the L EMPEL -Z IV algorithm (Ziv
and Lempel 1977), a dictionary-based compression technique. It works by adapting its internal data structures over
time to assign shorter code lengths to more frequently seen
substrings of data. For our application, the pixels in each
frame were encoded in row-major order, by first searching
for the longest sequence in the history matching the new data
to be compressed, and then encoding a triple that describes
the temporal location of the longest match, its length, as well
as the next unmatched symbol. This process repeats until no
data is left. Recalling Section 2.3, the (implicit) conditional
probability of a state s under the L EMPEL -Z IV model can
Lempel-Ziv
Logistic regression
DQN
BASS
CNC
SkipCTS
Factored SAD
Steps (1000’s)
Episodes
Figure 3: Left. Average reward over time in P ONG. Right.
Average score across episodes in P ONG. Error bars indicate
one inter-trial standard error.
now be obtained by computing
z,a
z,a
Average Score (Scaled)
Factored SAD
Average Score
Reward Per 100 Steps
Optimal
20.0
3190
16.4
13.0
497.2
-19.0
Freeway
Pong
Q*bert
Figure 4: Average score over the last 500 episodes for three
Atari 2600 games. Error bars indicate one inter-trial one
standard error.
−[`LZ (s0:n−1 s)−`LZ (s0:n−1 )]
ρS (s | sz,a
.
0:n−1 ) := 2
Results. As depicted in Figure 3 (left), all three models
improved their policies over time. By the end of training,
two of these models had learnt control policies achieving
win rates of approximately 50% in P ONG. Over their last
50 episodes of training, the L EMPEL -Z IV agents averaged
-0.09 points per episode (std. error: 1.79) and the factored
SAD agents, 3.29 (std. error: 2.49). While the logistic regression agents were less successful (average -17.87, std. error 0.38) we suspect that further training time would significantly improve their performance. Furthermore, all agents
ran at real-time or better. These results highlight how CNC
can be successfully combined with fundamentally different
approaches to density estimation.
We performed one more experiment to illustrate the effects of combining CNC with a more sophisticated density
model. We used S KIP CTS, a recent Context Tree Weighting derivative, with a context function tailored to the ALE
observation space (Bellemare, Veness, and Talvitie 2014).
As shown in Figure 3 (right), CNC combined with S KIP CTS
learns a near-optimal policy in P ONG. We also compared
our method to existing results from the literature (Bellemare
et al. 2013; Mnih et al. 2013), although note that the DQN
scores, which correspond to a different training regime and
do not include Freeway, are included only for illustrative
purposes. As shown in Figure 4, CNC can also learn competitive control policies on F REEWAY and Q* BERT.
Interestingly, we found S KIP CTS to be insufficiently accurate for effective MCTS planning when used as a forward
model, even with enhancements such as double progressive widening (Couëtoux et al. 2011). In particular, our best
simulation-based agent did not achieve a score above −14
in P ONG, and performed no better than random in Q* BERT
and F REEWAY. In comparison, our CNC variants performed
significantly better using orders of magnitude less computation. While it would be premature to draw any general conclusions, the CNC approach does appear to be more forgiving
of modeling inaccuracies.
5
Discussion and Limitations
The main strength and key limitation of the CNC approach
seems to be its reliance on an appropriate choice of den-
sity estimator. One could only expect the method to perform well if the learnt models can capture the observational
structure specific to high and low return states. Specifying a
model can be thus viewed as committing to a particular kind
of compression-based similarity metric over the state space.
The attractive part of this approach is that density modeling is a well studied area, which opens up the possibility
of bringing in many ideas from machine learning, statistics
and information theory to address fundamental questions in
reinforcement learning. The downside of course is that density modeling is itself a difficult problem. Further investigation is required to better understand the circumstances under
which one would prefer CNC over more traditional modelfree approaches that rely on function approximation to scale
to large and complex problems.
So far we have only applied CNC to undiscounted, finite
horizon problems with finite action spaces, and more importantly, finite (and rather small) return spaces. This setting is favorable for CNC, since the per-step running time
depends on |Z| ≤ m|rmax − rmin |; in other words, the
worst case running time scales no worse than linearly in the
length of the horizon. However, even modest changes to the
above setting can change the situation drastically. For example, using discounted return can introduce an exponential
dependence on the horizon. Thus an important topic for future work is to further develop the CNC approach for large
or continuous return spaces. Since the return space is only
one dimensional, it would be natural to consider various discretizations of the return space. For example, one could consider a tree based discretization that recursively subdivides
the return space into successively smaller halves. A binary
tree of depth d would produce 2d intervals of even size with
an accuracy of = m(rmax − rmin )/2d . This implies that
to achieve an accuracy of at least we would need to set
d ≥ log2 (m(rmax − rmin )/), which should be feasible
for many applications. Furthermore, one could attempt to
adaptively learn the best discretization (Hutter 2005a) or approximate Equation 4 using Monte Carlo sampling. These
enhancements seem necessary before we could consider applying CNC to the complete suite of ALE games.
6
Closing Remarks
This paper has introduced CNC, an information-theoretic
policy evaluation and on-policy control technique for reinforcement learning. The most interesting aspect of this approach is the way in which it uses a learnt probabilistic
model that conditions on the future return; remarkably, this
counterintuitive idea can be justified both in theory and in
practice.
While our initial results show promise, a number of open
questions clearly remain. For example, so far the CNC value
estimates were constructed by using only the Monte Carlo
return as the learning signal. However, one of the central
themes in Reinforcement Learning is bootstrapping, the idea
of constructing value estimates on the basis of other value
estimates (Sutton and Barto 1998). A natural question to explore is whether bootstrapping can be incorporated into the
learning signal used by CNC.
For the case of on-policy control, it would be also interesting to investigate the use of compression techniques
or density estimators that can automatically adapt to nonstationary data. A promising line of investigation might be
to consider the class of meta-algorithms given by György,
Linder, and Lugosi (2012), that can convert any stationary
coding distribution into its piece-wise stationary extension;
efficient algorithms from this class have shown promise for
data compression applications, and come with strong theoretical guarantees (Veness et al. 2013). Furthermore, extending the analysis in Section 3.4 to cover the case of on-policy
control or policy iteration (Howard 1960) would be highly
desirable.
Finally, we remark that information-theoretic perspectives
on reinforcement learning have existed for some time; in
particular, Hutter (2005b) described a unification of algorithmic information theory and reinforcement learning, leading to the AIXI optimality notion for reinforcement learning
agents. Establishing whether any formal connection exists
between this body of work and ours is deferred to the future.
Acknowledgments. We thank Kee Siong Ng, Andras
György, Shane Legg, Laurent Orseau and the anonymous
reviewers for their helpful feedback on earlier revisions.
References
Asmuth, J., and Littman, M. L. 2011. Learning is planning:
near Bayes-optimal reinforcement learning via Monte-Carlo
tree search. In Uncertainty in Artificial Intelligence (UAI),
19–26.
Attias, H. 2003. Planning by Probabilistic Inference. In
Proceedings of the 9th International Workshop on Artificial
Intelligence and Statistics.
Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M.
2013. The Arcade Learning Environment: An Evaluation
Platform for General Agents. Journal of Artificial Intelligence Research (JAIR) 47:253–279.
Bellemare, M. G.; Veness, J.; and Talvitie, E. 2014. Skip
Context Tree Switching. In Proceedings of the Thirty-First
International Conference on Machine Learning (ICML).
Bergstra, J., and Bengio, Y. 2012. Random search for hyperparameter optimization. Journal of Machine Learning Research (JMLR) 13:281–305.
Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic
Programming. Athena Scientific, 1st edition.
Bishop, C. M. 2006. Pattern Recognition and Machine
Learning (Information Science and Statistics). Secaucus,
NJ, USA: Springer-Verlag New York, Inc.
Botvinick, M., and Toussaint, M. 2012. Planning as inference. In Trends in Cognitive Sciences 10, 485–588.
Bratko, A.; Cormack, G. V.; R, D.; Filipi, B.; Chan, P.; Lynam, T. R.; and Lynam, T. R. 2006. Spam filtering using
statistical data compression models. Journal of Machine
Learning Research (JMLR) 7:2673–2698.
Brémaud, P. 1999. Markov chains : Gibbs fields, Monte
Carlo simulation and queues. Texts in applied mathematics.
New York, Berlin, Heidelberg: Springer.
Cilibrasi, R., and Vitányi, P. M. B. 2005. Clustering by
compression. IEEE Transactions on Information Theory
51:1523–1545.
Couëtoux, A.; Hoock, J.-B.; Sokolovska, N.; Teytaud, O.;
and Bonnard, N. 2011. Continuous upper confidence
trees. In Proceedings of the 5th International Conference on
Learning and Intelligent Optimization, LION’05, 433–445.
Springer-Verlag.
Cover, T. M., and Thomas, J. A. 1991. Elements of information theory. New York, NY, USA: Wiley-Interscience.
Doshi-Velez, F. 2009. The Infinite Partially Observable
Markov Decision Process. In Advances in Neural Information Processing Systems (NIPS) 22.
Duchi, J.; Hazan, E.; and Singer, Y. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research (JMLR)
12:2121–2159.
Frank, E.; Chui, C.; and Witten, I. H. 2000. Text categorization using compression models. In Proceedings of Data
Compression Conference (DCC), 200–209. IEEE Computer
Society Press.
Guez, A.; Silver, D.; and Dayan, P. 2012. Efficient
Bayes-Adaptive Reinforcement Learning using Samplebased Search. In Advances in Neural Information Processing Systems (NIPS) 25.
György, A.; Linder, T.; and Lugosi, G. 2012. Efficient
Tracking of Large Classes of Experts. IEEE Transactions
on Information Theory 58(11):6709–6725.
Hamilton, W. L.; Fard, M. M.; and Pineau, J. 2013. Modelling Sparse Dynamical Systems with Compressed Predictive State Representations. In ICML, volume 28 of JMLR
Proceedings, 178–186.
Howard, R. A. 1960. Dynamic Programming and Markov
Processes. MIT Press.
Hutter, M. 2005a. Fast non-parametric Bayesian inference
on infinite trees. In Proceedings of 10th International Conference on Artificial Intelligence and Statistics (AISTATS),
144–151.
Hutter, M. 2005b. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer.
Hutter, M. 2013. Sparse adaptive dirichlet-multinomiallike processes. In Conference on Computational Learning
Theory (COLT), 432–459.
Li, M., and Vitányi, P. 2008. An Introduction to Kolmogorov
Complexity and Its Applications. Springer, third edition.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602.
Poupart, P.; Lang, T.; and Toussaint, M. 2011. Escaping Local Optima in POMDP Planning as Inference. In The 10th
International Conference on Autonomous Agents and Multiagent Systems - Volume 3, AAMAS ’11, 1263–1264.
Powell, W. B. 2011. Approximate Dynamic Programming:
Solving the Curses of Dimensionality. Wiley-Interscience,
2nd edition.
Solway, A., and Botvinick, M. 2012. Goal-directed decision
making as probabilistic inference: A computational framework and potential neural correlates. Psycholological Review 119:120–154.
Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Szepesvári, C. 2010. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.
Talvitie, E. 2014. Model Regularization for Stable Sample
Rollouts. In Uncertainty in Artificial Intelligence (UAI).
Tjalkens, T. J.; Shtarkov, Y. M.; and Willems, F. M. J. 1993.
Context tree weighting: Multi-alphabet sources. In Proceedings of the 14th Symposium on Information Theory Benelux.
Tziortziotis, N.; Dimitrakakis, C.; and Blekas, K. 2014.
Cover Tree Bayesian Reinforcement Learning. Journal of
Machine Learning Research (JMLR) 15:2313–2335.
Veness, J.; Ng, K. S.; Hutter, M.; and Silver, D. 2010. Reinforcement Learning via AIXI Approximation. In Proceedings of the Conference for the Association for the Advancement of Artificial Intelligence (AAAI).
Veness, J.; Ng, K. S.; Hutter, M.; Uther, W.; and Silver, D.
2011. A Monte Carlo AIXI approximation. Journal of Artificial Intelligence Research (JAIR) 40:95–142.
Veness, J.; White, M.; Bowling, M.; and Gyorgy, A. 2013.
Partition Tree Weighting. In Proceedings of Data Compression Conference (DCC), 321–330.
Walsh, T. J.; Goschin, S.; and Littman, M. L. 2010. Integrating Sample-Based Planning and Model-Based Reinforcement Learning. In Proceedings of the Conference for the
Association for the Advancement of Artificial Intelligence
(AAAI).
Wang, T.; Bowling, M.; Schuurmans, D.; and Lizotte, D. J.
2008. Stable dual dynamic programming. In Advances in
Neural Information Processing Systems (NIPS) 20, 1569–
1576.
Wang, T.; Bowling, M.; and Schuurmans, D. 2007. Dual representations for dynamic programming and reinforcement
learning. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning,
44–51.
Willems, F. M.; Shtarkov, Y. M.; and Tjalkens, T. J. 1995.
The Context Tree Weighting Method: Basic Properties.
IEEE Transactions on Information Theory 41:653–664.
Witten, I. H.; Moffat, A.; and Bell, T. C. 1999. Managing gigabytes: compressing and indexing documents and images.
Morgan Kaufmann.
Witten, I. H.; Neal, R. M.; and Cleary, J. G. 1987. Arithmetic
coding for data compression. Communications of the ACM.
30:520–540.
Ziv, J., and Lempel, A. 1977. A universal algorithm for
sequential data compression. Information Theory, IEEE
Transactions on 23(3):337–343.