Supplemental material for AAAI-16 paper: Efficient PAC-optimal Exploration in Concurrent,

advertisement
Supplemental material for AAAI-16 paper:
Efficient PAC-optimal Exploration in Concurrent,
Continuous State MDPs with Delayed Updates
Jason Pazis and Ronald Parr
To prove lemmas 8.1, 8.2, 8.3, and 8.4, and theorem 8.5, we first need to
introduce some supporting theory.
12
Bellman error MDPs
This section introduces the concept of Bellman error MDPs, a new analysis tool
for RL algorithms. Existing work on bounds based on the Bellman error has focused on one-step max-norm errors (Williams and Baird, 1993). Unfortunately,
for the same reason that the maximum expected discounted accumulated reward
max
, bounds based on the max-norm of
of an MDP can be much smaller than R1´γ
the Bellman error can be loose. Bellman error MDPs are much more flexible
and intuitive, allowing us to prove stronger bounds with less effort. While this
work focuses on PAC optimal exploration, Bellman error MDPs can be used to
prove bounds for other online or offline RL algorithms.
Definitions
Let M be an MDP pS, A, P, R, γq with Bellman operator B π for policy π, and Q
an approximate value function for M . Additionally, @s, a, π let 0 ď Qπ ps, aq ď
Qmax . Let the Bellman error MDP Mpπ,Qq be an MDP which differs from M
only in its reward function which is defined as Rpπ,Qq ps, aq “ Qps, aq´B π Qps, aq
(notice that Qps, aq and B π are the approximate value function and Bellman
π
operator of the original MDP). We will use Bpπ,Qq
to denote Mpπ,Qq ’s Bellman
π
operator under policy π, and Qpπ,Qq to denote its value function.
Bellman error MDP theory
Theorem 12.1 is the main theorem of this section. Most results on Bellman
error MDPs follow directly from this theorem and basic properties of MDPs.
1
Theorem 12.1. The return of π over M is equal to Q minus the return of π
over Mpπ,Qq 1 :
Qπ ps, aq “ Qps, aq ´ Qπpπ,Qq ps, aq @ps, aq P pS, Aq.
Proof. We will use induction to prove our claim: If 0 denotes a value function
that is zero everywhere, all we need to prove is that pB π qi Qps, aq “ Qps, aq ´
π
pBpπ,Qq
qi 0 and take the limit as i Ñ 8.
π
The base case B π Qps, aq “ Qps, aq ´ Bpπ,Qq
0 is given by hypothesis. Asπ i
π
suming that pB q Qps, aq “ Qps, aq ´ pBpπ,Qq qi 0 holds for i, we will prove that
it also holds for i ` 1:
pB π qi`1 Qps, aq “B π pB π qi Qps, aq
ż
´
¯
“
P ps1 |s, aq Rps, a, s1 q ` γpB π qi Qps1 , πps1 qq
1
żs
´
´
˘¯
π
“
P ps1 |s, aq Rps, a, s1 q ` γ Qps1 , πps1 qq ´ pBpπ,Qq
qi 0
1
żs
´
¯
“
P ps1 |s, aq Rps, a, s1 q ` γQps1 , πps1 qq ´
s1
ż
´
¯
π
P ps1 |s, aq γpBpπ,Qq
qi 0
s1
ż
´
¯
π
“B π Qps, aq ´
P ps1 |s, aq γpBpπ,Qq
qi 0
s1
ż
´
¯
π
“Qps, aq ´ Rpπq ps, aq ´
P ps1 |s, aq γpBpπ,Qq
qi 0
s1
“Qps, aq ´
π
pBpπq
qi`1 0
If we now take the limit as i Ñ 8 we have the original claim:
π
lim pB π qi Qps, aq “Qps, aq ´ pBpπ,Qq
qi 0 Ñ
iÑ8
Qπ ps, aq “Qps, aq ´ Qπpπ,Qq ps, aq
Corollary 12.2 bounds the range of the Bellman error MDP value function, a
property that will prove very useful in the analysis of our algorithm. It follows
from Theorem 12.1 and the fact that 0 ď Qπ ps, aq ď Qmax .
Corollary 12.2. Let 0 ď Qps, aq ď Qmax @ps, aq. Then:
´Qmax ď Qπpπ,Qq ps, aq ď Qmax @ps, a, πq.
1 Note
that this is true for any policy π not just the greedy policy over Q.
2
Lemma 12.3 below (a consequence of Theorem 12.1), proves that the difference between the optimal and the greedy policy over Q is bounded above by
the inverse difference of the value of those policies in their respective Bellman
error MDPs.
Lemma 12.3. @ps, a, Qq:
Q
Q
˚
V ˚ psq ´ V π psq ď QπpπQ ,Qq ps, π Q psqq ´ Qπpπ˚ ,Qq ps, π ˚ psqq
Proof.
Qps, π ˚ psqq ď Qps, π Q psqq ñ
˚
Q
˚
Q
Qπ ps, π ˚ psqq ` Qπpπ˚ ,Qq ps, π ˚ psqq ď Qπ ps, π Q psqq ` QπpπQ ,Qq ps, π Q psqq ñ
˚
Q
Q
˚
Q
˚
Qπ ps, π ˚ psqq ´ Qπ ps, π Q psqq ď QπpπQ ,Qq ps, π Q psqq ´ Qπpπ˚ ,Qq ps, π ˚ psqq ñ
Q
V ˚ psq ´ V π psq ď QπpπQ ,Qq ps, π Q psqq ´ Qπpπ˚ ,Qq ps, π ˚ psqq
Lemma 12.4 proves an upper bound on the expected value of the greedy
policy in its Bellman error MDP when we do not have a uniform bound on the
Bellman error over all state-actions.
Lemma 12.4. Let X1 , . . . , Xi , . . . , Xn be sets of state-actions where
Q
Q
Qps, aq ´ B π Qps, aq ď i @ps, aq P Xi , Qps, aq ´ B π Qps, aq ď πQ @ps, aq R
Yni“1 Xi , and πQ ď i @i. Let pei ps, a, tq for t P r0, T ´ 1s be Bernoulli random
variables expressing the probability of taking a state-action pst , at q at step t,
for which pst , at q P Xi , when starting from state-action ps, aq and following π Q
thereafter. Then:
π Q
` e ,
1´γ
¯
řn ´řT ´1 t
T
where e “ i“1
t“0 pγ pi ps, a, tqq pi ´ π Q q ` γ Qmax .
Q
QπpπQ ,Qq ps, aq ď
Proof. The expected rewardřin the Bellman error MDP for step t P r0, T ´ 1s
n
is bounded above by πQ ` i“1 ppi ps, a, tqpi ´ πQ qq. From Corollary 12.2 the
value of any state-action that may be encountered at step T in the Bellman
error MDP is bounded above by Qmax . From these two facts we have that:
˜ ˜
¸¸
Tÿ
´1
n
ÿ
πQ
t
e
ppi ps, a, tqpi ´ πQ qq
QpπQ ,Qq ps, aq ď
γ π Q `
` γ T Qmax
t“0
i“1
n
ÿ
Q
ď π `
1 ´ γ i“1
¸
˜
Tÿ
´1
`
˘
γ t pei ps, a, tq
t“0
3
pi ´ πQ q
` γ T Qmax .
Lemmas 12.5 and 12.6 prove bounds on the difference in expected value between the optimal policy and the greedy policy over Q when we do not have
a uniform bound on the Bellman error over all state-actions. They avoid the
square peg in round hole problem encountered when one tries to analyze exploration algorithms using max-norm bounds.
˚
Lemma 12.5. Let Qps, aq ´ B π Qps, aq ě ´˚ @ps, aq, X1 , . . . , Xi , . . . , Xn be
Q
sets of state-actions where Qps, aq ´ B π Qps, aq ď i @ps, aq P Xi , Qps, aq ´
Q
B π Qps, aq ď πQ @ps, aq R Yni“1 Xi , and πQ ď i @i. Let pi ps, a, tq for t P
r0, T ´ 1s be Bernoulli random variables expressing the probability of taking a
state-action pst , at q at step t, for which pst , at q P Xi , when starting from stateaction ps, aq and following π Q thereafter. Then:
˚ ` π Q
` e ,
1´γ
¯
řn ´řT ´1 t
T
where e “ i“1
t“0 pγ pi ps, a, tqq pi ´ π Q q ` γ Qmax .
Q
V ˚ psq ´ V π psq ď
˚
˚
Proof. Since Qps, aq ´ B π Qps, aq ě ´˚ @ps, aq, we have that Qπpπ˚ ,Qq ps, aq ě
˚
´ 1´γ
. Substituting the above along with Lemma 12.4 into Lemma 12.3 concludes our proof.
˚
Lemma 12.6. Let Qps, aq ´ B π Qps, aq ě ´˚ @ps, aq, X1 , . . . , Xi , . . . , Xn be
Q
sets of state-actions where Qps, aq ´ B π Qps, aq ď i @ps, aq P X
Q i , Qps, aq ´U
Q
1
n
π
ln Qmax
B Qps, aq ď πQ @ps, aq R Yi“1 Xi , and πQ ď i @i. Let TH “ 1´γ
s
and define H “ t1, 2, 4, . . . , 2i u where i is the largest integer such that 2i ď TH .
Define ph,i ps, aq for h P r0, TH ´ 1s to be Bernoulli random variables expressing
the probability of encountering exactly h state-actions for which ps, aq P Xi
when starting from state-action ps, aq and following π Q thereafter for a total of
ř2h´1
mintT, TH u steps. Finally let peh,i pst,j q “ m“h pm,i pst,j q. Then:
Q
V ˚ psq ´ V π psq ď
where e “ 2
řn
i“1
p
ř
hPH
˚ ` π Q
` s ` e ,
1´γ
phph,i ps, aqq pi ´ πQ qq ` γ T Qmax .
Proof. Let pi ps, a, tq for t P r0, T ´ 1s be Bernoulli random variables expressing
the probability of taking a state-action pst , at q at step t, for which pst , at q P Xi ,
when starting from state-action ps, aq and following π Q thereafter. We have
that @i:
Tÿ
´1
`
´1
T
ÿ
ÿ `
˘ Tÿ
˘
γ t pi ps, a, tq ď
pi ps, a, tq “
phph,i ps, aqq ď 2
hpeh,i ps, aq .
t“0
t“0
h“1
hPH
We also have that:
γ mintT,TH u Qmax ď γ TH Qmax ` γ T Qmax ď s ` γ T Qmax ,
and the result follows from Lemma 12.5.
4
Using Bellman error MDPs
An important consequence of Lemma 12.3 is that in order to get a bound on
the difference in performance between the optimal policy and the greedy policy
over a value function, we only need a lower bound on the performance of an
optimal policy and an upper bound on the performance of the greedy policy in
their respective Bellman error MDPs.
The algorithm presented in this paper relies on an optimistic approximation
scheme. Because of that, we will be able to get a good bound on the performance of the optimal policy in its Bellman error MDP from the start, while our
bound on the performance of the greedy policy will improve as more samples
are gathered.
13
Useful lemmas
Lemma 13.1. Let ti for i “ 0 Ñ l be the outcomes of independent pbut not
necessarily identically distributedq random variables in t0, 1u, with P pti “ 1q ě
2
ln 1δ ă 1 and:
pi . If m
l
ÿ
1´
i“0
then
m
b
pi ě
,
2
m
ln 1δ
řl
i“0 ti
ě m with probability at least 1 ´ δ.
řl
Proof. Let i“0 pi “ µ. From Chernoff’s bound we have that the probability
řl
that i“0 ti is less than p1 ´ qµ for 0 ă ă 1 is upper bounded by:
˜
¸
l
ÿ
´2 µ
(1)
δ “P
ti ď p1 ´ qµ ď e 2 .
i“0
Setting p1 ´ qµ “ m we have that µ “
and taking logarithms we have:
m
1´ .
Substituting into equation 1 above
m
´2 1´
ñ
2
2
2
1
ď ln ñ
1´ m δ
2
1
2 ă ln ñ
m
c δ
1
2
ă
ln .
m δ
ln δ ď
Substituting into µ “
m
1´
concludes our proof.
5
Lemma 13.1 is an improvement over Li’s lemma 56 (Li, 2009) in two ways:
While Li’s lemma requires that µ ą 2m for all δ ă 1, for realistically large values
of m lemma 13.1 yields values much closer to m. As an example, for δ “ 10´9
and k “ 103 , 106 and 109 our bound yields 1.2556m, 1.0065m and 1.0002m
respectively. More importantly, our bound allows for the success probability of
each trial to be different, which will allow us to prove TCE PAC bounds, rather
than just bounds on the number of significantly suboptimal steps.
Lemma 13.2. B̃ is a γ-contraction in maximum norm.
Proof. Suppose ||Q1 ´ Q2 ||8 “ . For any ps, a, M q we have:
B̃Q1 ps, a, M q “ min Qmax , F pQ1 , N pU, s, a, M qq`
(
dps, a, M, N pU, s, a, M q, dknown q
ď min Qmax , F pQ2 , N pU, s, a, M qq ` γ`
(
dps, a, M, N pU, s, a, M q, dknown q
ď γ ` min Qmax , F pQ2 , N pU, s, a, M qq`
(
dps, a, M, N pU, s, a, M q, dknown q
“ γ ` B̃Q2 ps, a, M q
ñ B̃Q1 ps, a, M q ď γ ` B̃Q2 ps, a, M q.
Similarly we have that B̃Q2 ps, a, M q ď γ ` B̃Q1 ps, a, M q which completes our
proof.
Versions of lemma 13.3 below have appeared in the literature before. It is
included here for completeness:
Lemma 13.3. Let B̂ be a γ-contraction with fixed point Q̂, and Q the output
of
ln pQmax ´Q
min q
ln γ
iterations2 of value iteration using B̂. Then if Qmin ď Q̂ps, a, M q ď Qmax and
Qmin ď Q0 ps, a, M q ď Qmax @ ps, a, M q, where Q0 ps, a, M q is the initial value
for ps, a, M q:
´ ď Qps, a, M q ´ B̂Qps, a, M q ď @ps, a, M q.
Proof. If γ “ 0 B̂Q0 ps, a, M q “ Q̂ps, a, M q @ps, a, M q. Otherwise, we have that:
||Q̂ ´ Q0 ||8 ď Qmax ´ Qmin .
2 Note
that
ln pQ
max ´Qmin q
ln γ
ď
1
1´γ
ln
Qmax ´Qmin
.
6
Let Qn be the value function at the n-th step of value iteration. Since B̂ is a
γ-contraction with fixed point Q̂, for n ě 1:
||Q̂ ´ B̂Qn ||8 “ ||B̂ Q̂ ´ B̂Qn ||8
ď γ||Q̂ ´ Qn´1 ||8
“ γ||Q̂ ´ B̂Qn´2 ||8
ď γ 2 ||Q̂ ´ Qn´2 ||8
...
ď γ n ||Q̂ ´ Q0 ||8 ñ
||Q̂ ´ B̂Qn ||8 ď γ n pQmax ´ Qmin q.
We want
ď γ n pQmax ´ Qmin q,
which after redistributing and taking the γ logarithm becomes:
n ď logγ
“
Qmax ´ Qmin
ln pQmax ´Q
min q
ln γ
.
Bellman error analysis
Lemma 13.4 below bounds the probability that an individual approximation
unit will have Bellman error of unacceptably high magnitude during a nondelay step.
c
ln
Lemma 13.4. If b “ Q̂max
4r1`log2 ksNSAM pdknown q2
δ
2
belonging to a fixed Ũ :
˜
P F
π˚
pQŨ , ups, a, M qq ´ B
¸
π˚
Q̃ps, a, M q ď ´c
and:
˜
P F
π
Q
Ũ
pQŨ , ups, a, M qq ´ B
π
Q
Ũ
ď
then for a fixed ups, a, M q
δ
,
4r1 ` log2 ksNSAM pdknown q2
b
QŨ ps, a, M q ěc ` 2 a
ka ps, a, M q
ď
7
¸
δ
.
4r1 ` log2 ksNSAM pdknown q2
Proof. Let Y be the set of ka ps, a, M q samples used by F π pQ̃, ups, a, M qq at
ps, a, M q and define f px1 , . . . xka q “ F π pQŨ , ups, a, M qq, where x1 , . . . xka are
realizations of independent (from the Markov property) variables whose outcomes are possible next states, one for each state-action-MDP in Y . The outcomes of the variables (which is where the Markov property ensures independence) are the next states the transitions lead to, not the state-action-MDP
triples the samples originate from. The state-action-MDP triples the samples
originate from are fixed with respect to f , and no assumptions are made about
their distribution. Since Ũ is fixed, QŨ is fixed with respect to f (we are examining the effects of a single application of F π pQŨ , ups, a, M qq to the fixed
function QŨ , while varying the next-states that samples in ups, a, M q land on).
If ka ps, a, M q ą 0 then @ i P r1, ka ps, a, M qs:
sup
x1 ,...xk ,x̂i
|f px1 , . . . xka q ´ f px1 , . . . , xi´1 x̂i , xi`1 . . . xka ps,a,M q q| “ ci ď
Q̂max
,
ka ps, a, M q
and
ka ps,a,M
ÿ q
2
pci q ď ka ps, a, M q
i“1
Q̂2max
Q̂2max
“
.
2
ka ps, a, M q
ka ps, a, M q
From McDiarmid’s inequality we have:
˜
π
¸
π
P F pQŨ , ups, a, M qq ´ E rF pQŨ , ups, a, M qqs ď ´0
˜
¸
ď P f px1 , . . . xka q ´ E rf px1 , . . . xka qs ď ´0
22
ďe
ďe
0
´ řka ps,a,M
q
i“1
´
pc2 q
i
22
0 ka ps,a,M q
Q̂2
max
,
and:
˜
¸
π
π
P F pQŨ , ups, a, M qq´E rF pQŨ , ups, a, M qqs ě 0
˜
¸
ď P f px1 , . . . xka q ´ E rf px1 , . . . xka qs ě 0
22
0
´ řka ps,a,M
q
ďe
´
ďe
´
If ka ps, a, M q “ 0 then e
i“1
pc2 q
i
22
0 ka ps,a,M q
Q̂2
max
22
0 ka ps,a,M q
2
Q̂max
.
“ 1 and the above is trivially true.
8
From Definition 5.13 we have that:
” ˚
ı
˚
b
B π QŨ ps, a, M q ´ c ď E F π pQŨ , ups, a, M qq ´ a
ka ps, a, M q
and
” Q
ı
Q
E F π Ũ pQŨ , ups, a, M qq ď B π Ũ QŨ ps, a, M q ` c ` a
b
,
ka ps, a, M q
where the expectations are over the next-states that samples in ups, a, M q used
Q
˚
by F π and F π Ũ respectively land on.
for 0 above we have that:
Substituting ? b
ka ps,a,M q
˜
P F
¸
π˚
pQŨ , ups, a, M qq ´ B
˜
ď P Fπ
˜
´2
ďe
?
˚
π˚
Q̃ps, a, M q ď ´c
” ˚
ı
b
pQŨ , ups, a, M qq ´ E F π pQŨ , ups, a, M qq ď ´ a
ka ps, a, M q
¸
¸2
b
ka ps,a,M q
ka ps,a,M q
Q̂2
max
g
¨
f
4r1`log2 ksNSAM pdknown q2
f
e ln
˚
δ
´2˚
Q̂
˝ max
2ka ps,a,M q
˛2
‹
‹ ka ps,a,M q
‚
Q̂2
max
“e
δ
4r1 ` log2 ksNSAM pdknown q2
“
and:
´ Q
¯
Q
b
P F π Ũ pQŨ , ups, a, M qq ´ B π Ũ QŨ ps, a, M q ě c ` 2 a
ka ps, a, M q
¸
˜
” Q
ı
Q
b
π Ũ
π Ũ
ďP F
pQŨ , ups, a, M qq ´ E F
pQŨ , ups, a, M qq ě a
ka ps, a, M q
˜
´2
ďe
?
¸2
b
ka ps,a,M q
ka ps,a,M q
2
Q̂max
g
¨
f
4r1`log2 ksNSAM pdknown q2
f
e ln
˚
δ
´2˚
Q̂
max
˝
2ka ps,a,M q
“e
“
˛2
‹
‹ ka ps,a,M q
‚
Q̂2
max
δ
.
4r1 ` log2 ksNSAM pdknown q2
Given a bound on the probability that an individual approximation unit
has Bellman error of unacceptably high magnitude, lemma 13.5 uses the union
9
bound to bound the probability that there exists at least one approximation unit
in Ũ for some published Ũ with Bellman error of unacceptably high magnitude
during a non-delay step.
c
Lemma 13.5. If b “ Q̂max
ln
4r1`log2 ksNSAM pdknown q2
δ
2
, the probability that for
any published Ũ there exists at least one ups, a, M q P Ũ such that:
˚
˚
F π pQŨ , ups, a, M qq ´ B π QŨ ps, a, M q ď ´c
(2)
or:
Fπ
Q
Ũ
pQŨ , ups, a, M qq ´ B π
Q
Ũ
b
QŨ ps, a, M q ě c ` 2 a
ka ps, a, M q
(3)
during a non-delay step, is upper bounded by 2δ .
Proof. When Ũ is the empty set no approximation units exist, so the above is
trivially true. Otherwise, at most r1`log2 ksNSAM pdknown q distinct non-empty
Ũ exist during non-delay steps. Thus, there are at most 2r1`log2 ksNSAM pdknown q2
ways for at least one of the at most NSAM pdknown q approximation units to fail
at least once during non-delay steps (r1 ` log2 ksNSAM pdknown q2 ways each for
equation 2 or equation 3 to be true at least once), each with a probability at
δ
most 4r1`log ksNSAM
pdknown q2 . From the union bound, we have that the proba2
bility that for any published Ũ there exists at least one ups, a, M q P Ũ such that
equation 2 or 3 is true during a non-delay step, is upper bounded by 2δ .
Based on lemma 13.5 we can now bound the probability that any ps, a, M q
will have Bellman error of unacceptably high magnitude during a non-delay
step:
c
Lemma 13.6. If b “ Q̂max
ln
4r1`log2 ksNSAM pdknown q2
δ
2
then:
˚
QŨ ps, a, M q ´ B π QŨ ps, a, M q ą ´a ´ 2c
and:
QŨ ps, a, M q ´ B π
Q
Ũ
b
`
QŨ ps, a, M q ă a ` 2c `2 a
ka ps, a, M q
2dps, a, M, N pŨ , s, a, M q, dknown q
@ps, a, M, Ũ q simultaneously with probability 1 ´ 2δ .
˚
Proof. When Ũ is the empty set, QŨ ps, a, M q “ Qmax . Since B π QŨ ps, a, M q ď
˚
Qmax , QŨ ps, a, M q´B π QŨ ps, a, M q ě ´a ´2c . Otherwise, @ps, a, M, Ũ q with
probability 1 ´ 2δ :
(
˚
˚
B π QŨ ps, a, M q “ min Qmax , B π QŨ ps, a, M q
10
˚
ď min Qmax , B π QŨ pN pŨ , s, a, M qq`
dps, a, M, N pŨ , s, a, M q, dknown q ` c
(
˚
ă min Qmax , F π pQŨ , upN pŨ , s, a, M qqq ` c `
dps, a, M, N pŨ , s, a, M q, dknown q ` c
(
˚
ď B̃ π QŨ ps, a, M q ` 2c
ď B̃QŨ ps, a, M q ` 2c ` 0
ď QŨ ps, a, M q ` a ` 2c .
˚
In step 1 we used the fact that B π QŨ ps, a, M q ď Qmax , in step 2 we used
Definition 5.13, in step 3 we used lemma 13.5, in step 4 we used Definition 5.11,
˚
in step 5 we used the fact that B̃QŨ ps, a, M q ě B̃ π QŨ ps, a, M q, and in step 6
we used the fact that ´a ď QŨ ps, a, M q ´ B̃QŨ ps, a, M q.
When no approximation units containing active samples exist,
dps, a, M, N pŨ , s, a, M qq is not defined. Otherwise, @ps, a, M, Ũ q with probability 1 ´ 2δ :
Bπ
Q
Ũ
QŨ ps, a, M q
!
)
Q
“ min Qmax , B π Ũ QŨ ps, a, M q
ě min Qmax , B π
Q
Ũ
QŨ pN pŨ , s, a, M qq ´ dps, a, M, N pŨ , s, a, M q, dknown q ´ c
Q
b
π Ũ
´
ą min Qmax , F
pQŨ , upN pŨ , s, a, M qqq ´ c ´ 2 a
ka ps, a, M q
(
(
c ´ dps, a, M, N pŨ , s, a, M q, dknown q
(
Q
ě min Qmax , F π Ũ pQŨ , upN pŨ , s, a, M qqq ` dps, a, M, N pŨ , s, a, M q, dknown q ´
b
2c ´ 2 a
´ 2dps, a, M, N pŨ , s, a, M q, dknown q
ka ps, a, M q
Q
b
ě B̃ π Ũ QŨ ps, a, M q ´ 2c ´ 2 a
´ 2dps, a, M, N pŨ , s, a, M q, dknown q
ka ps, a, M q
b
“ B̃QŨ ps, a, M q ´ 2c ´ 2 a
´ 2dps, a, M, N pŨ , s, a, M q, dknown q
ka ps, a, M q
b
ě QŨ ps, a, M q ´ a ´ 2c ´ 2 a
´ 2dps, a, M, N pŨ , s, a, M q, dknown q.
ka ps, a, M q
Q
In step 1 we used the fact that B π Ũ QŨ ps, a, M q ď Qmax , in step 2 we used
Definition 5.13, in step 3 we used lemma 13.5, in step 5 we used Definition 5.11,
Q
in step 6 we used the fact that B̃ π Ũ QŨ ps, a, M q “ B̃QŨ ps, a, M q, and in step
7 we used the fact that QŨ ps, a, M q ´ B̃QŨ ps, a, M q ď a .
Note that both the first half of lemma 13.5 (used in the fist half of the proof)
and the second half (used in the second half of the proof) hold simultaneously
with probability 2δ , thus we don’t need to take a union bound over the individual
probabilities.
11
Lemma 13.7 bounds the number of times we can encounter approximation
units with fewer than k active samples.
Lemma 13.7. Let ps1,j , s2,j , s3,j , . . . q for j P t1, . . . , kp u be the random paths
generated in MDPs M1 , . . . , Mkp respectively on some execution of Algorithm 1.
Let Ũ pt, jq be the approximation set used by Algorithm 1 at step t in MDP Mj .
Let τ pt, jq be the number of steps from step t in MDP Mj to the next delay
step, or toQthe first stepU t1 for which Ũ pt, jq ‰ Ũ pt1 , jq, whichever comes first.
Let TH “
1
1´γ
ln Qmax
s
and define H “ t1, 2, 4, . . . , 2i u where i is the largest
integer such that 2i ď TH . Let ka´ be the largest value in Ka that is strictly
smaller than ka , or 0 if such a value does not exist. Let Xka pt, jq be the set of
state-action-MDP triples at step t in MDP Mj such that no approximation unit
with at least ka active samples exists within dknown distance, and at least one
approximation unit with at least ka´ active samples exists within dknown distance
pka´ “ 0 covers both the case where an approximation unit with 0 active samples exists within dknown distance, and the case where no approximation units
exist within dknown distanceq. Define ph,ka pst,j q for ka P Ka to be Bernoulli
random variables that express the following conditional probability: Given the
state of the approximation set and sample candidate queue at step t for MDP
Mj , exactly h state-action-MDP triples in Xka pt, jq are encountered in MDP
ř2h´1
Mj during the next mintTH , τ pt, jqu steps. Let peh,ka pst,j q “ i“h pi,ka pst,j q.
Finally let Tj be the set of non-delay steps psee Definition 7.5q for MDP Mj .
2r 1 ln Qmax
s s
2 ks
If kp N1´γ
ln 2r1`log
ă 1 and NSAM pdknown q ě 2, with probability
δ
SAM pdknown q
1 ´ 2δ :
kp
ÿ
ÿ ÿ
phpeh,ka pst,j qq
j“1 tPTj hPH
´
Q
U¯ Q
U
Qmax
1
1
pka ´ ka´ ` kp q 1 ` log2 1´γ
ln Qmax
NSAM pdknown q
1´γ ln s
s
c
ă
2r1`log2 ks
2r 1 ln Qmax
s s ln
δ
1 ´ pk 1´γ
´
´k `k qN
pd
q
a
a
p
SAM
known
@ ka P Ka and @ h P H simultaneously.
Proof. From the Markov property we have that, peh,ka pst,j q variables at least
TH steps apart (of the same or different j) are independent3 . Define TiH for
i P t0, 1, . . . , TH ´ 1u to be the (infinite) set of timesteps for which t P ti, i `
TH , i ` 2TH , . . . u.
ka ´ ka´ samples will be added to an approximation unit with ka´ active samples before it progresses to ka active samples. Additionally, at most NSAM pdknown q
approximation units can be added to the approximation set, and we are exploring in kp MDPs in parallel. Thus, at most pka ´ ka´ ` kp ´ 1qNSAM pdknown q
3 Careful readers may notice that what happens at step t affects which variables are selected
at future timesteps. This is not a problem. We only care that the outcomes of the variables
are independent given their selection.
12
state-action-MDP triples within dknown distance of approximation units with
ka´ active samples can be encountered on non-delay steps4 .
Let us assume that there exists an i P t0, 1, . . . , TH ´ 1u and h P H such
that:
kp
ÿ
ÿ
peh,ka pst,j q ě
j“1 tPTj XTiH
pk ´ ka´ ` kp qNSAM pdknown q
¯.
´
b a
2r1`log2 ks
h 1 ´ pk ´k´ `k qN2h pd
ln
δ
q
a
a
p
SAM
known
δ
, at least
From lemma 13.1 it follows that with probability at least 1 ´ 2r1`log
2 ks
´
pka ´ ka ` kp qNSAM pdknown q state-action-MDP triples within dknown distance
of approximation units with ka´ active samples will be encountered on non-delay
steps, which is a contradiction. It must therefore be the case that:
kp
ÿ
ÿ
peh,ka pst,j q ă
j“1 tPTj XTiH
pk ´ ka´ ` kp qNSAM pdknown q
¯
b a
2r1`log2 ks
ln
h 1 ´ pk ´k´ `k qN2h pd
δ
q
´
a
a
p
SAM
known
δ
for all i P t0, 1, . . . , TH ´ 1u and h P
with probability at least 1 ´ 2r1`log
2 ks
H ´ tTH u simultaneously, which implies that:
kp
ÿ
ÿ ÿ
phpeh,ka pst,j qq
j“1 tPTj hPH
pka ´ ka´ ` kp q|H|TH NSAM pdknown q
b
2TH
2 ks
ln 2r1`log
1 ´ pk ´k´ `k qN
δ
a
p
SAM pdknown q
a
´
Q
U¯ Q
U
Qmax
1
1
pka ´ ka´ ` kp q 1 ` log2 1´γ
ln Qmax
ln
NSAM pdknown q
1´γ
s
s
c
ď
2r1`log2 ks
2r 1 ln Qmax
s s ln
δ
1 ´ pk 1´γ
´
pd
q
´k `k qN
ă
a
a
p
SAM
known
δ
with probability at least 1 ´ 2r1`log
for all h P H simultaneously.
2 ks
From the union bound we have that since ka can take r1 ` log2 ks values,
with probability 1 ´ 2δ :
kp
ÿ
ÿ ÿ
phpeh,ka pst,j qq
j“1 tPTj hPH
U¯ Q
U
´
Q
Qmax
1
1
pka ´ ka´ ` kp q 1 ` log2 1´γ
ln Qmax
ln
NSAM pdknown q
1´γ
s
s
c
ă
2r1`log2 ks
2r 1 ln Qmax
s s ln
δ
1 ´ pk 1´γ
´
´k `k qN
pd
q
a
a
p
SAM
known
@ ka P Ka and @ h P H simultaneously.
4 This covers the worst case, which could happen if every time a state-action-MDP triple
within dknown distance of an approximation unit with ka ´ 1 samples is encountered by one
of the MDPs, the remaining kp ´ 1 MDPs encounter a state-action-MDP triple within dknown
of the same approximation unit simultaneously.
13
8
Proofs of main lemmas and theorems
Lemma 8.1. The space complexity of algorithm 1 is:
O pNSAM pdknown qq
per concurrent MDP.
Proof. Algorithm 1 only needs access to the value and number of active samples
of each approximation unit, and from the definition of the covering number and
Algorithm 2 we have that at most NSAM pdknown q approximation units can be
added.
Lemma 8.2. The space complexity of algorithm 2 is:
O pk|A|NSAM pdknown qq .
Proof. From the definition of the covering number we have that at most
NSAM pdknown q approximation units can be added. From the definition of an approximation unit and Algorithm 2 we have that at most k samples can be added
per approximation unit, and each sample requires at most Op|A|q space.
Lemma 8.3. The per step computational complexity of algorithm 1 is bounded
above by:
O p|A|NSAM pdknown qq .
Proof. A naive search for the nearest approximation unit of each of the at most
|A| actions needs to perform at most O pNSAM pdknown qq operations.
Since every step of algorithm 1 results in a sample being processed by algorithm 2, we will be bounding the per sample computational complexity of
algorithm 2:
Lemma 8.4. Let c be the maximum number of approximation units to which
a single sample can be added5 . The per sample computational complexity of
algorithm 2 is bounded above by:
ˆ
˙
k|A|NSAM pdknown q
a
O ck|A|NSAM pdknown q `
ln
.
ln γ
Qmax
Proof. Lines 4 through 12 will be performed only once per sample. Line 6 requires at most OpNSAM pdknown qq operations. Line 7 requires at most
OpkNSAM pdknown qq operations. Line 9 can reuse the result of line 6 and only
check if the approximation unit added in line 7 (if any) has k samples. Line 10
requires at most OpNSAM pdknown qq operations.
5 c will depend on the dimensionality of the state-action-MDP space. For domains that are
big enough to be interesting it will be significantly smaller than other quantities of interest.
14
Given a pointer to N pU, s1 , a1 , M q for every next state action MDP triple,
line 14 requires at most O pk|A|NSAM pdknown qq operations(O pk|A|q for each approximation unit), and since B̃ is a γ-contraction in maximum norm
ln
max q
times per sample6 (lemma
(lemma 13.2), it will be performed at most pQ
ln γ
´
¯
pdknown q
a
13.3), for a total cost of O k|A|NSAM
ln Qmax
. This step dominates the
ln γ
computational cost of the learner.
Lines 15 and 16 require at most O pNSAM pdknown qq operations each, and
ln
a
max q
will be performed at most pQ
times per sample.
ln γ
Maintaining a pointer to N pU, s1 , a1 , M q for every next state action MDP
triple requires the following: 1) Computing N pU, s1 , a1 , M q every time a sample
is added to the sample set, for a cost of up to Op|A|NSAM pdknown qq operations
per sample. 2) Updating all N pU, s1 , a1 , M q pointers every time the number
of active samples for an approximation unit changes: There can exist at most
k|A|NSAM pdknown q pointers in the sample set, and a single sample can affect
the number of active samples of at most c approximation units. Each pointer can
be updated against a particular approximation unit in constant time, thus the
cost of maintaining the pointers is at most Opck|A|NSAM pdknown qq operations
per sample.
Theorem 8.5. Let ps1,j , s2,j , s3,j , . . . q for j P t1, . . . , kp u be the random paths
generated in MDPs M1 , . . . , Mkp respectively on some execution of algorithm 1,
and π̃c be the pnon-stationaryq policy followed by algorithm 1. Let b “
´
¯
4r1`log2 ksNSAM pdknown q2
ln
Q̂2max
4r1`log2 ksNSAM pdknown q2
δ
Q̂max
, k ě 22 p1´γq
, c
2 ln
2
δ
s
be defined as in definition 5.13, a be defined as in algorithm 2, kp be defined
as in definition 7.1, and Tj be the set of non-delay steps psee definition 7.5q
2r1`log2 ks
2r 1 ln Qmax
s s ln
δ
ă 1 and NSAM pdknown q ě 2, with
for MDP Mj . If 1´γ
p1`kp qNSAM pdknown q
probability at least 1 ´ δ, for all t, j:
V π̃ pst,j , Mj q ě V ˚ pst,j , Mj q ´
4c ` 2a
´ 3s ´ e pt, jq,
1´γ
where7 :
kp 8
ÿ
ÿ
e pt, jq “
j“1 t“0
kp
ÿ
ÿ
e pt, jq `
j“1 tPTj
kp
ÿ
ÿ
e pt, jq,
j“1 tRTj
with:
kp
ÿ
ÿ
e pt, jq
j“1 tPTj
6 Note that the per sample cost does not increase even if multiple samples enter through
line 4 before publishing Ũ .
7 f pnq “ Õpgpnqq is a shorthand for f pnq “ Opgpnq logc gpnqq for some c.
15
¨´
« Õ ˝
Q̂max
s p1´γq
¯
˛
` kp NSAM pdknown qQ̂max
‚
p1 ´ γq
and:
¨
ÿ
e pt, jq ď ˝Qmax
tRTj
ÿ
NSAMÿ
pdknown q
ka PKa
i“1
˛
Di,j,ka ‚,
where Di,j,ka is the delay of the ka -th sample for ka P Ka in the i-th approximation unit, with respect to MDP Mj . Note that the probability of success 1 ´ δ
řkp ř8
holds for all timesteps in all MDPs simultaneously, and j“1
t“0 e pt, jq is
řkp ř
an undiscounted infinite sum. Unlike j“1 tPTj e pt, jq which is a sum over
ř
all MDPs, tRTj e pt, jq gives a bound on the total cost due to value function
update delays for each MDP separately.
Proof. We have that:
QŨ ps, a, M q ´ B π
Q
Ũ
QŨ ps, a, M q ď Qmax @ ps, a, M, Ũ q.
Since ka´ ě k2a @ka P Ka ´ t1u we have that for any ps, a, M q within dknown
distance of an approximation unit with ka´ ą 0 active samples, with probability
1 ´ 2δ :
d
pdknown q2
Q
2 ln 4r1`log2 ksNSAM
π Ũ
δ
QŨ ps, a, M q ´ B
QŨ ps, a, M q ă a ` 2c ` 2Q̂max
.
ka
´
¯
Q̂2max
4r1`log2 ksNSAM pdknown q2
for ka into lemma 13.6,
Substituting k ě 22 p1´γq
2 ln
δ
s
we have that with probability at least 1 ´ 2δ for any ps, a, M q within dknown of
an approximation unit with k active samples8 :
Q̃ps, a, M q ´ B π
Q
Ũ
Q̃ps, a, M q ă a ` 2c ` 2p1 ´ γqs .
Let TH , H, Ũ pt, jq, τ pt, jq, and peh,ka pst,j q be defined as in lemma 13.7. Even
though π̃ is non-stationary, it is comprised of stationary segments. Starting from
step t in MDP Mj , π̃ is equal to π QŨ pt,jq for at least τ pt, jq steps. Substituting
the above into lemma 12.6 we have that with probability at least 1 ´ 2δ , for all
Mj and t P Tj :
V ˚ pst,j , Mj q ´ V π̃ pst,j , Mj q ď
4c ` 2a
` 3s ` e pt, jq,
1´γ
where:
e pt, jq “ γ τ pt,jq Qmax ` 2
ÿ
phpeh,1 pst,j qqQmax
hPH
8 Notice
that in this case dps, a, M, N ps, a, M q, dknown q “ 0.
16
d
ÿ
2
`
ka PtKa ´1u
ÿ
phpeh,ka pst,j qq2Q̂max
hPH
2
pdknown q
2 ln 4r1`log2 ksNSAM
δ
.
ka
Define:
ˆ
c0 “
R
1 ` log2
Qmax
1
ln
1´γ
s
V˙ R
V
1
Qmax
ln
NSAM pdknown q.
1´γ
s
From the above it follows that:
kp
ÿ
ÿ
e pt, jq
j“1 tPTj
kp
ÿ
ÿ
“
˜
γ τ pt,jq Qmax
j“1 tPTj
ÿ
`2
phpeh,1 pst,j qqQmax
hPH
d
ÿ
ÿ
2
`
ka PtKa ´1u
kp
ÿ
ÿ
“
2 ln
phpeh,ka pst,j qq2Q̂max
4r1`log2 ksNSAM pdknown q2
δ
¸
ka
hPH
γ τ pt,jq Qmax
j“1 tPTj
kp
ÿ
ÿ ÿ
`2
phpeh,1 pst,j qqQmax
j“1 tPTj hPH
ÿ
d
kp
ÿ
ÿ ÿ
phpeh,ka pst,j qq2Q̂max
`2
2 ln
4r1`log2 ksNSAM pdknown q2
δ
ka PtKa ´1u j“1 tPTj hPH
ka
kp r1 ` log2 ksNSAM pdknown qQmax
p1 ´ γq
2p1 ` kp qc0 Qmax
c Q
`
U
ă
2r1`log
1´
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
ks
c
ÿ
pka ´
ka´
`2
ka PtKa ´1u
` kp qc0 2Q̂max
c Q
1´
2r1`log
1´
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
ks
c
2
`
`k
ř
a
ka PtKa ´1u
2
˘
` kp c0 2Q̂max
c Q
1´
4r1`log2 ksNSAM pdknown q2
δ
ka
U
1 ln Qmax ln 2r1`log2 ks
2 1´γ
s
δ
´
pka ´ka
`kp qNSAM pdknown q
p2 ` r3 ` log2 kskp q c0 Qmax
c Q
U
ď
2 ln
2 ln
4r1`log2 ksNSAM pdknown q2
δ
ka
U
1 ln Qmax ln 2r1`log2 ks
2 1´γ
s
δ
p1`kp qNSAM pdknown q
17
p2 ` r3 ` log2 kskp qc0 Qmax
c Q
U
ď
2r1`log
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
1´
2
´?
ř
ka PtKa ´1u
`
ks
b
¯
2k
ka ` ?kp c0 Q̂max 2 ln
a
c Q
U
2r1`log
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
1´
˜
ď
p2 ` r3 ` log2 kskp qc0 Qmax
c Q
U
2r1`log
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
1´
˜
`4kp
ď
4r1`log2 ksNSAM pdknown q2
δ
8
8
ÿ
ÿ
1
1
?
`
i
i 2
2
2
i“1
i“0
¸¸
ks
`
c0 Q̂max
1´
ks
˜
˙¸
8 ˆ
ÿ
?
1
1
?
`
2 k 1`
2i
2i 2
i“0
b
2 ln
c Q
4r1`log2 ksNSAM pdknown q2
δ
U
2r1`log2 ks
Qmax
1
2 1´γ ln ln
δ
s
p1`kp qNSAM pdknown q
p2 ` r3 ` log2 kskp qc0 Qmax
c Q
U
2
1
2r1`log2 ks
ln Qmax ln
1´γ
s
δ
1´
p1`kp qNSAM pdknown q
b
¯
´
¯¯
´ ? ´
2 k 3 ` ?22 ` 4kp 1 ` ?22 c0 Q̂max 2 ln
c Q
`
U
4r1`log2 ksNSAM pdknown q2
δ
2r1`log
1´
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
ks
ˆ
´ ?
¯b
c0 Q̂max p2 ` r3 ` log2 kskp q ` 9 k ` 10kp
2 ln
c Q
ď
U
2r1`log
1´
˜ˆ
ď
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
4r1`log2 ksNSAM pdknown q2
δ
˙
ks
V˙ R
V
Qmax
1
Qmax
1
ln
ln
NSAM pdknown qQ̂max
1´γ
s
1´γ
s
´ ?
¯b
¸
pdknown q2
p2 ` r3 ` log2 kskp q ` 9 k ` 10kp
2 ln 4r1`log2 ksNSAM
δ
c Q
U
R
1 ` log2
2r1`log
1´
1 ln Qmax ln
2
2 1´γ
s
δ
p1`kp qNSAM pdknown q
ks
with probability 1´δ, where in step 3 we used the fact that there can be at most
r1`log2 ksNSAM pdknown
b q policy changes and lemma 13.7, and in step 8 we used
the fact that Q̂max
2
pdknown q
2 ln 4r1`log2 ksNSAM
ě Qmax . Since lemma 13.6
δ
(used to bound the Bellman error of each ps, a, M, Ũ q) and lemma 13.7 (used to
bound how many times each ps, a, M, Ũ q is encountered) hold with probability
at least 1 ´ 2δ each, the bound above holds with probability at least 1 ´ δ.
Since for realistically large NSAM pdknown q:
g Q
U
f
f 2 1 ln Qmax ln 2r1`log2 ks
e 1´γ
s
δ
1´
« 1,
p1 ` kp qNSAM pdknown q
18
we have that:
kp
ÿ
ÿ
e pt, jq
j“1 tPTj
˜ˆ
V˙ R
V
Qmax
1
Qmax
1
ă
1 ` log2
ln
ln
NSAM pdknown qQ̂max
1´γ
s
1´γ
s
´ ?
¯b
¸
pdknown q2
p2 ` r3 ` log2 kskp q ` 9 k ` 10kp
2 ln 4r1`log2 ksNSAM
δ
c
2r1`log2 ks
1
2r 1´γ
ln Qmax
s s ln
δ
1´
p1`kp qNSAM pdknown q
¯
´
˛
¨
Q̂max
s p1´γq ` kp NSAM pdknown qQ̂max
‚
« Õ ˝
p1 ´ γq
R
with probability 1 ´ δ, where we have substituted k with
ˆ
˙
Q̂2
4r1 ` log2 ksNSAM pdknown q2
c 2 max 2 ln
2s p1 ´ γq
δ
for some constant c.
From the definition of a delay step (Definition 7.5) we have that the maxiř
řNSAM pdknown q
mum number of delay steps in MDP Mj is ka PKa i“1
Di,j,ka . Allowing for the algorithm to perform arbitrarily badly (e pt, jq “ Qmax ) on every
delay step we have that @j:
ÿ
tRTj
e pt, jq ď Qmax
ÿ
NSAMÿ
pdknown q
ka PKa
i“1
Di,j,ka .
References
Li, L. 2009. A unifying framework for computational reinforcement learning
theory. Ph.D. Dissertation.
Williams, R., and Baird, L. 1993. Tight performance bounds on greedy policies
based on imperfect value functions. Technical report, Northeastern University,
College of Computer Science.
19
Download