Supplemental material for AAAI-16 paper: Efficient PAC-optimal Exploration in Concurrent, Continuous State MDPs with Delayed Updates Jason Pazis and Ronald Parr To prove lemmas 8.1, 8.2, 8.3, and 8.4, and theorem 8.5, we first need to introduce some supporting theory. 12 Bellman error MDPs This section introduces the concept of Bellman error MDPs, a new analysis tool for RL algorithms. Existing work on bounds based on the Bellman error has focused on one-step max-norm errors (Williams and Baird, 1993). Unfortunately, for the same reason that the maximum expected discounted accumulated reward max , bounds based on the max-norm of of an MDP can be much smaller than R1´γ the Bellman error can be loose. Bellman error MDPs are much more flexible and intuitive, allowing us to prove stronger bounds with less effort. While this work focuses on PAC optimal exploration, Bellman error MDPs can be used to prove bounds for other online or offline RL algorithms. Definitions Let M be an MDP pS, A, P, R, γq with Bellman operator B π for policy π, and Q an approximate value function for M . Additionally, @s, a, π let 0 ď Qπ ps, aq ď Qmax . Let the Bellman error MDP Mpπ,Qq be an MDP which differs from M only in its reward function which is defined as Rpπ,Qq ps, aq “ Qps, aq´B π Qps, aq (notice that Qps, aq and B π are the approximate value function and Bellman π operator of the original MDP). We will use Bpπ,Qq to denote Mpπ,Qq ’s Bellman π operator under policy π, and Qpπ,Qq to denote its value function. Bellman error MDP theory Theorem 12.1 is the main theorem of this section. Most results on Bellman error MDPs follow directly from this theorem and basic properties of MDPs. 1 Theorem 12.1. The return of π over M is equal to Q minus the return of π over Mpπ,Qq 1 : Qπ ps, aq “ Qps, aq ´ Qπpπ,Qq ps, aq @ps, aq P pS, Aq. Proof. We will use induction to prove our claim: If 0 denotes a value function that is zero everywhere, all we need to prove is that pB π qi Qps, aq “ Qps, aq ´ π pBpπ,Qq qi 0 and take the limit as i Ñ 8. π The base case B π Qps, aq “ Qps, aq ´ Bpπ,Qq 0 is given by hypothesis. Asπ i π suming that pB q Qps, aq “ Qps, aq ´ pBpπ,Qq qi 0 holds for i, we will prove that it also holds for i ` 1: pB π qi`1 Qps, aq “B π pB π qi Qps, aq ż ´ ¯ “ P ps1 |s, aq Rps, a, s1 q ` γpB π qi Qps1 , πps1 qq 1 żs ´ ´ ˘¯ π “ P ps1 |s, aq Rps, a, s1 q ` γ Qps1 , πps1 qq ´ pBpπ,Qq qi 0 1 żs ´ ¯ “ P ps1 |s, aq Rps, a, s1 q ` γQps1 , πps1 qq ´ s1 ż ´ ¯ π P ps1 |s, aq γpBpπ,Qq qi 0 s1 ż ´ ¯ π “B π Qps, aq ´ P ps1 |s, aq γpBpπ,Qq qi 0 s1 ż ´ ¯ π “Qps, aq ´ Rpπq ps, aq ´ P ps1 |s, aq γpBpπ,Qq qi 0 s1 “Qps, aq ´ π pBpπq qi`1 0 If we now take the limit as i Ñ 8 we have the original claim: π lim pB π qi Qps, aq “Qps, aq ´ pBpπ,Qq qi 0 Ñ iÑ8 Qπ ps, aq “Qps, aq ´ Qπpπ,Qq ps, aq Corollary 12.2 bounds the range of the Bellman error MDP value function, a property that will prove very useful in the analysis of our algorithm. It follows from Theorem 12.1 and the fact that 0 ď Qπ ps, aq ď Qmax . Corollary 12.2. Let 0 ď Qps, aq ď Qmax @ps, aq. Then: ´Qmax ď Qπpπ,Qq ps, aq ď Qmax @ps, a, πq. 1 Note that this is true for any policy π not just the greedy policy over Q. 2 Lemma 12.3 below (a consequence of Theorem 12.1), proves that the difference between the optimal and the greedy policy over Q is bounded above by the inverse difference of the value of those policies in their respective Bellman error MDPs. Lemma 12.3. @ps, a, Qq: Q Q ˚ V ˚ psq ´ V π psq ď QπpπQ ,Qq ps, π Q psqq ´ Qπpπ˚ ,Qq ps, π ˚ psqq Proof. Qps, π ˚ psqq ď Qps, π Q psqq ñ ˚ Q ˚ Q Qπ ps, π ˚ psqq ` Qπpπ˚ ,Qq ps, π ˚ psqq ď Qπ ps, π Q psqq ` QπpπQ ,Qq ps, π Q psqq ñ ˚ Q Q ˚ Q ˚ Qπ ps, π ˚ psqq ´ Qπ ps, π Q psqq ď QπpπQ ,Qq ps, π Q psqq ´ Qπpπ˚ ,Qq ps, π ˚ psqq ñ Q V ˚ psq ´ V π psq ď QπpπQ ,Qq ps, π Q psqq ´ Qπpπ˚ ,Qq ps, π ˚ psqq Lemma 12.4 proves an upper bound on the expected value of the greedy policy in its Bellman error MDP when we do not have a uniform bound on the Bellman error over all state-actions. Lemma 12.4. Let X1 , . . . , Xi , . . . , Xn be sets of state-actions where Q Q Qps, aq ´ B π Qps, aq ď i @ps, aq P Xi , Qps, aq ´ B π Qps, aq ď πQ @ps, aq R Yni“1 Xi , and πQ ď i @i. Let pei ps, a, tq for t P r0, T ´ 1s be Bernoulli random variables expressing the probability of taking a state-action pst , at q at step t, for which pst , at q P Xi , when starting from state-action ps, aq and following π Q thereafter. Then: π Q ` e , 1´γ ¯ řn ´řT ´1 t T where e “ i“1 t“0 pγ pi ps, a, tqq pi ´ π Q q ` γ Qmax . Q QπpπQ ,Qq ps, aq ď Proof. The expected rewardřin the Bellman error MDP for step t P r0, T ´ 1s n is bounded above by πQ ` i“1 ppi ps, a, tqpi ´ πQ qq. From Corollary 12.2 the value of any state-action that may be encountered at step T in the Bellman error MDP is bounded above by Qmax . From these two facts we have that: ˜ ˜ ¸¸ Tÿ ´1 n ÿ πQ t e ppi ps, a, tqpi ´ πQ qq QpπQ ,Qq ps, aq ď γ π Q ` ` γ T Qmax t“0 i“1 n ÿ Q ď π ` 1 ´ γ i“1 ¸ ˜ Tÿ ´1 ` ˘ γ t pei ps, a, tq t“0 3 pi ´ πQ q ` γ T Qmax . Lemmas 12.5 and 12.6 prove bounds on the difference in expected value between the optimal policy and the greedy policy over Q when we do not have a uniform bound on the Bellman error over all state-actions. They avoid the square peg in round hole problem encountered when one tries to analyze exploration algorithms using max-norm bounds. ˚ Lemma 12.5. Let Qps, aq ´ B π Qps, aq ě ´˚ @ps, aq, X1 , . . . , Xi , . . . , Xn be Q sets of state-actions where Qps, aq ´ B π Qps, aq ď i @ps, aq P Xi , Qps, aq ´ Q B π Qps, aq ď πQ @ps, aq R Yni“1 Xi , and πQ ď i @i. Let pi ps, a, tq for t P r0, T ´ 1s be Bernoulli random variables expressing the probability of taking a state-action pst , at q at step t, for which pst , at q P Xi , when starting from stateaction ps, aq and following π Q thereafter. Then: ˚ ` π Q ` e , 1´γ ¯ řn ´řT ´1 t T where e “ i“1 t“0 pγ pi ps, a, tqq pi ´ π Q q ` γ Qmax . Q V ˚ psq ´ V π psq ď ˚ ˚ Proof. Since Qps, aq ´ B π Qps, aq ě ´˚ @ps, aq, we have that Qπpπ˚ ,Qq ps, aq ě ˚ ´ 1´γ . Substituting the above along with Lemma 12.4 into Lemma 12.3 concludes our proof. ˚ Lemma 12.6. Let Qps, aq ´ B π Qps, aq ě ´˚ @ps, aq, X1 , . . . , Xi , . . . , Xn be Q sets of state-actions where Qps, aq ´ B π Qps, aq ď i @ps, aq P X Q i , Qps, aq ´U Q 1 n π ln Qmax B Qps, aq ď πQ @ps, aq R Yi“1 Xi , and πQ ď i @i. Let TH “ 1´γ s and define H “ t1, 2, 4, . . . , 2i u where i is the largest integer such that 2i ď TH . Define ph,i ps, aq for h P r0, TH ´ 1s to be Bernoulli random variables expressing the probability of encountering exactly h state-actions for which ps, aq P Xi when starting from state-action ps, aq and following π Q thereafter for a total of ř2h´1 mintT, TH u steps. Finally let peh,i pst,j q “ m“h pm,i pst,j q. Then: Q V ˚ psq ´ V π psq ď where e “ 2 řn i“1 p ř hPH ˚ ` π Q ` s ` e , 1´γ phph,i ps, aqq pi ´ πQ qq ` γ T Qmax . Proof. Let pi ps, a, tq for t P r0, T ´ 1s be Bernoulli random variables expressing the probability of taking a state-action pst , at q at step t, for which pst , at q P Xi , when starting from state-action ps, aq and following π Q thereafter. We have that @i: Tÿ ´1 ` ´1 T ÿ ÿ ` ˘ Tÿ ˘ γ t pi ps, a, tq ď pi ps, a, tq “ phph,i ps, aqq ď 2 hpeh,i ps, aq . t“0 t“0 h“1 hPH We also have that: γ mintT,TH u Qmax ď γ TH Qmax ` γ T Qmax ď s ` γ T Qmax , and the result follows from Lemma 12.5. 4 Using Bellman error MDPs An important consequence of Lemma 12.3 is that in order to get a bound on the difference in performance between the optimal policy and the greedy policy over a value function, we only need a lower bound on the performance of an optimal policy and an upper bound on the performance of the greedy policy in their respective Bellman error MDPs. The algorithm presented in this paper relies on an optimistic approximation scheme. Because of that, we will be able to get a good bound on the performance of the optimal policy in its Bellman error MDP from the start, while our bound on the performance of the greedy policy will improve as more samples are gathered. 13 Useful lemmas Lemma 13.1. Let ti for i “ 0 Ñ l be the outcomes of independent pbut not necessarily identically distributedq random variables in t0, 1u, with P pti “ 1q ě 2 ln 1δ ă 1 and: pi . If m l ÿ 1´ i“0 then m b pi ě , 2 m ln 1δ řl i“0 ti ě m with probability at least 1 ´ δ. řl Proof. Let i“0 pi “ µ. From Chernoff’s bound we have that the probability řl that i“0 ti is less than p1 ´ qµ for 0 ă ă 1 is upper bounded by: ˜ ¸ l ÿ ´2 µ (1) δ “P ti ď p1 ´ qµ ď e 2 . i“0 Setting p1 ´ qµ “ m we have that µ “ and taking logarithms we have: m 1´ . Substituting into equation 1 above m ´2 1´ ñ 2 2 2 1 ď ln ñ 1´ m δ 2 1 2 ă ln ñ m c δ 1 2 ă ln . m δ ln δ ď Substituting into µ “ m 1´ concludes our proof. 5 Lemma 13.1 is an improvement over Li’s lemma 56 (Li, 2009) in two ways: While Li’s lemma requires that µ ą 2m for all δ ă 1, for realistically large values of m lemma 13.1 yields values much closer to m. As an example, for δ “ 10´9 and k “ 103 , 106 and 109 our bound yields 1.2556m, 1.0065m and 1.0002m respectively. More importantly, our bound allows for the success probability of each trial to be different, which will allow us to prove TCE PAC bounds, rather than just bounds on the number of significantly suboptimal steps. Lemma 13.2. B̃ is a γ-contraction in maximum norm. Proof. Suppose ||Q1 ´ Q2 ||8 “ . For any ps, a, M q we have: B̃Q1 ps, a, M q “ min Qmax , F pQ1 , N pU, s, a, M qq` ( dps, a, M, N pU, s, a, M q, dknown q ď min Qmax , F pQ2 , N pU, s, a, M qq ` γ` ( dps, a, M, N pU, s, a, M q, dknown q ď γ ` min Qmax , F pQ2 , N pU, s, a, M qq` ( dps, a, M, N pU, s, a, M q, dknown q “ γ ` B̃Q2 ps, a, M q ñ B̃Q1 ps, a, M q ď γ ` B̃Q2 ps, a, M q. Similarly we have that B̃Q2 ps, a, M q ď γ ` B̃Q1 ps, a, M q which completes our proof. Versions of lemma 13.3 below have appeared in the literature before. It is included here for completeness: Lemma 13.3. Let B̂ be a γ-contraction with fixed point Q̂, and Q the output of ln pQmax ´Q min q ln γ iterations2 of value iteration using B̂. Then if Qmin ď Q̂ps, a, M q ď Qmax and Qmin ď Q0 ps, a, M q ď Qmax @ ps, a, M q, where Q0 ps, a, M q is the initial value for ps, a, M q: ´ ď Qps, a, M q ´ B̂Qps, a, M q ď @ps, a, M q. Proof. If γ “ 0 B̂Q0 ps, a, M q “ Q̂ps, a, M q @ps, a, M q. Otherwise, we have that: ||Q̂ ´ Q0 ||8 ď Qmax ´ Qmin . 2 Note that ln pQ max ´Qmin q ln γ ď 1 1´γ ln Qmax ´Qmin . 6 Let Qn be the value function at the n-th step of value iteration. Since B̂ is a γ-contraction with fixed point Q̂, for n ě 1: ||Q̂ ´ B̂Qn ||8 “ ||B̂ Q̂ ´ B̂Qn ||8 ď γ||Q̂ ´ Qn´1 ||8 “ γ||Q̂ ´ B̂Qn´2 ||8 ď γ 2 ||Q̂ ´ Qn´2 ||8 ... ď γ n ||Q̂ ´ Q0 ||8 ñ ||Q̂ ´ B̂Qn ||8 ď γ n pQmax ´ Qmin q. We want ď γ n pQmax ´ Qmin q, which after redistributing and taking the γ logarithm becomes: n ď logγ “ Qmax ´ Qmin ln pQmax ´Q min q ln γ . Bellman error analysis Lemma 13.4 below bounds the probability that an individual approximation unit will have Bellman error of unacceptably high magnitude during a nondelay step. c ln Lemma 13.4. If b “ Q̂max 4r1`log2 ksNSAM pdknown q2 δ 2 belonging to a fixed Ũ : ˜ P F π˚ pQŨ , ups, a, M qq ´ B ¸ π˚ Q̃ps, a, M q ď ´c and: ˜ P F π Q Ũ pQŨ , ups, a, M qq ´ B π Q Ũ ď then for a fixed ups, a, M q δ , 4r1 ` log2 ksNSAM pdknown q2 b QŨ ps, a, M q ěc ` 2 a ka ps, a, M q ď 7 ¸ δ . 4r1 ` log2 ksNSAM pdknown q2 Proof. Let Y be the set of ka ps, a, M q samples used by F π pQ̃, ups, a, M qq at ps, a, M q and define f px1 , . . . xka q “ F π pQŨ , ups, a, M qq, where x1 , . . . xka are realizations of independent (from the Markov property) variables whose outcomes are possible next states, one for each state-action-MDP in Y . The outcomes of the variables (which is where the Markov property ensures independence) are the next states the transitions lead to, not the state-action-MDP triples the samples originate from. The state-action-MDP triples the samples originate from are fixed with respect to f , and no assumptions are made about their distribution. Since Ũ is fixed, QŨ is fixed with respect to f (we are examining the effects of a single application of F π pQŨ , ups, a, M qq to the fixed function QŨ , while varying the next-states that samples in ups, a, M q land on). If ka ps, a, M q ą 0 then @ i P r1, ka ps, a, M qs: sup x1 ,...xk ,x̂i |f px1 , . . . xka q ´ f px1 , . . . , xi´1 x̂i , xi`1 . . . xka ps,a,M q q| “ ci ď Q̂max , ka ps, a, M q and ka ps,a,M ÿ q 2 pci q ď ka ps, a, M q i“1 Q̂2max Q̂2max “ . 2 ka ps, a, M q ka ps, a, M q From McDiarmid’s inequality we have: ˜ π ¸ π P F pQŨ , ups, a, M qq ´ E rF pQŨ , ups, a, M qqs ď ´0 ˜ ¸ ď P f px1 , . . . xka q ´ E rf px1 , . . . xka qs ď ´0 22 ďe ďe 0 ´ řka ps,a,M q i“1 ´ pc2 q i 22 0 ka ps,a,M q Q̂2 max , and: ˜ ¸ π π P F pQŨ , ups, a, M qq´E rF pQŨ , ups, a, M qqs ě 0 ˜ ¸ ď P f px1 , . . . xka q ´ E rf px1 , . . . xka qs ě 0 22 0 ´ řka ps,a,M q ďe ´ ďe ´ If ka ps, a, M q “ 0 then e i“1 pc2 q i 22 0 ka ps,a,M q Q̂2 max 22 0 ka ps,a,M q 2 Q̂max . “ 1 and the above is trivially true. 8 From Definition 5.13 we have that: ” ˚ ı ˚ b B π QŨ ps, a, M q ´ c ď E F π pQŨ , ups, a, M qq ´ a ka ps, a, M q and ” Q ı Q E F π Ũ pQŨ , ups, a, M qq ď B π Ũ QŨ ps, a, M q ` c ` a b , ka ps, a, M q where the expectations are over the next-states that samples in ups, a, M q used Q ˚ by F π and F π Ũ respectively land on. for 0 above we have that: Substituting ? b ka ps,a,M q ˜ P F ¸ π˚ pQŨ , ups, a, M qq ´ B ˜ ď P Fπ ˜ ´2 ďe ? ˚ π˚ Q̃ps, a, M q ď ´c ” ˚ ı b pQŨ , ups, a, M qq ´ E F π pQŨ , ups, a, M qq ď ´ a ka ps, a, M q ¸ ¸2 b ka ps,a,M q ka ps,a,M q Q̂2 max g ¨ f 4r1`log2 ksNSAM pdknown q2 f e ln ˚ δ ´2˚ Q̂ ˝ max 2ka ps,a,M q ˛2 ‹ ‹ ka ps,a,M q ‚ Q̂2 max “e δ 4r1 ` log2 ksNSAM pdknown q2 “ and: ´ Q ¯ Q b P F π Ũ pQŨ , ups, a, M qq ´ B π Ũ QŨ ps, a, M q ě c ` 2 a ka ps, a, M q ¸ ˜ ” Q ı Q b π Ũ π Ũ ďP F pQŨ , ups, a, M qq ´ E F pQŨ , ups, a, M qq ě a ka ps, a, M q ˜ ´2 ďe ? ¸2 b ka ps,a,M q ka ps,a,M q 2 Q̂max g ¨ f 4r1`log2 ksNSAM pdknown q2 f e ln ˚ δ ´2˚ Q̂ max ˝ 2ka ps,a,M q “e “ ˛2 ‹ ‹ ka ps,a,M q ‚ Q̂2 max δ . 4r1 ` log2 ksNSAM pdknown q2 Given a bound on the probability that an individual approximation unit has Bellman error of unacceptably high magnitude, lemma 13.5 uses the union 9 bound to bound the probability that there exists at least one approximation unit in Ũ for some published Ũ with Bellman error of unacceptably high magnitude during a non-delay step. c Lemma 13.5. If b “ Q̂max ln 4r1`log2 ksNSAM pdknown q2 δ 2 , the probability that for any published Ũ there exists at least one ups, a, M q P Ũ such that: ˚ ˚ F π pQŨ , ups, a, M qq ´ B π QŨ ps, a, M q ď ´c (2) or: Fπ Q Ũ pQŨ , ups, a, M qq ´ B π Q Ũ b QŨ ps, a, M q ě c ` 2 a ka ps, a, M q (3) during a non-delay step, is upper bounded by 2δ . Proof. When Ũ is the empty set no approximation units exist, so the above is trivially true. Otherwise, at most r1`log2 ksNSAM pdknown q distinct non-empty Ũ exist during non-delay steps. Thus, there are at most 2r1`log2 ksNSAM pdknown q2 ways for at least one of the at most NSAM pdknown q approximation units to fail at least once during non-delay steps (r1 ` log2 ksNSAM pdknown q2 ways each for equation 2 or equation 3 to be true at least once), each with a probability at δ most 4r1`log ksNSAM pdknown q2 . From the union bound, we have that the proba2 bility that for any published Ũ there exists at least one ups, a, M q P Ũ such that equation 2 or 3 is true during a non-delay step, is upper bounded by 2δ . Based on lemma 13.5 we can now bound the probability that any ps, a, M q will have Bellman error of unacceptably high magnitude during a non-delay step: c Lemma 13.6. If b “ Q̂max ln 4r1`log2 ksNSAM pdknown q2 δ 2 then: ˚ QŨ ps, a, M q ´ B π QŨ ps, a, M q ą ´a ´ 2c and: QŨ ps, a, M q ´ B π Q Ũ b ` QŨ ps, a, M q ă a ` 2c `2 a ka ps, a, M q 2dps, a, M, N pŨ , s, a, M q, dknown q @ps, a, M, Ũ q simultaneously with probability 1 ´ 2δ . ˚ Proof. When Ũ is the empty set, QŨ ps, a, M q “ Qmax . Since B π QŨ ps, a, M q ď ˚ Qmax , QŨ ps, a, M q´B π QŨ ps, a, M q ě ´a ´2c . Otherwise, @ps, a, M, Ũ q with probability 1 ´ 2δ : ( ˚ ˚ B π QŨ ps, a, M q “ min Qmax , B π QŨ ps, a, M q 10 ˚ ď min Qmax , B π QŨ pN pŨ , s, a, M qq` dps, a, M, N pŨ , s, a, M q, dknown q ` c ( ˚ ă min Qmax , F π pQŨ , upN pŨ , s, a, M qqq ` c ` dps, a, M, N pŨ , s, a, M q, dknown q ` c ( ˚ ď B̃ π QŨ ps, a, M q ` 2c ď B̃QŨ ps, a, M q ` 2c ` 0 ď QŨ ps, a, M q ` a ` 2c . ˚ In step 1 we used the fact that B π QŨ ps, a, M q ď Qmax , in step 2 we used Definition 5.13, in step 3 we used lemma 13.5, in step 4 we used Definition 5.11, ˚ in step 5 we used the fact that B̃QŨ ps, a, M q ě B̃ π QŨ ps, a, M q, and in step 6 we used the fact that ´a ď QŨ ps, a, M q ´ B̃QŨ ps, a, M q. When no approximation units containing active samples exist, dps, a, M, N pŨ , s, a, M qq is not defined. Otherwise, @ps, a, M, Ũ q with probability 1 ´ 2δ : Bπ Q Ũ QŨ ps, a, M q ! ) Q “ min Qmax , B π Ũ QŨ ps, a, M q ě min Qmax , B π Q Ũ QŨ pN pŨ , s, a, M qq ´ dps, a, M, N pŨ , s, a, M q, dknown q ´ c Q b π Ũ ´ ą min Qmax , F pQŨ , upN pŨ , s, a, M qqq ´ c ´ 2 a ka ps, a, M q ( ( c ´ dps, a, M, N pŨ , s, a, M q, dknown q ( Q ě min Qmax , F π Ũ pQŨ , upN pŨ , s, a, M qqq ` dps, a, M, N pŨ , s, a, M q, dknown q ´ b 2c ´ 2 a ´ 2dps, a, M, N pŨ , s, a, M q, dknown q ka ps, a, M q Q b ě B̃ π Ũ QŨ ps, a, M q ´ 2c ´ 2 a ´ 2dps, a, M, N pŨ , s, a, M q, dknown q ka ps, a, M q b “ B̃QŨ ps, a, M q ´ 2c ´ 2 a ´ 2dps, a, M, N pŨ , s, a, M q, dknown q ka ps, a, M q b ě QŨ ps, a, M q ´ a ´ 2c ´ 2 a ´ 2dps, a, M, N pŨ , s, a, M q, dknown q. ka ps, a, M q Q In step 1 we used the fact that B π Ũ QŨ ps, a, M q ď Qmax , in step 2 we used Definition 5.13, in step 3 we used lemma 13.5, in step 5 we used Definition 5.11, Q in step 6 we used the fact that B̃ π Ũ QŨ ps, a, M q “ B̃QŨ ps, a, M q, and in step 7 we used the fact that QŨ ps, a, M q ´ B̃QŨ ps, a, M q ď a . Note that both the first half of lemma 13.5 (used in the fist half of the proof) and the second half (used in the second half of the proof) hold simultaneously with probability 2δ , thus we don’t need to take a union bound over the individual probabilities. 11 Lemma 13.7 bounds the number of times we can encounter approximation units with fewer than k active samples. Lemma 13.7. Let ps1,j , s2,j , s3,j , . . . q for j P t1, . . . , kp u be the random paths generated in MDPs M1 , . . . , Mkp respectively on some execution of Algorithm 1. Let Ũ pt, jq be the approximation set used by Algorithm 1 at step t in MDP Mj . Let τ pt, jq be the number of steps from step t in MDP Mj to the next delay step, or toQthe first stepU t1 for which Ũ pt, jq ‰ Ũ pt1 , jq, whichever comes first. Let TH “ 1 1´γ ln Qmax s and define H “ t1, 2, 4, . . . , 2i u where i is the largest integer such that 2i ď TH . Let ka´ be the largest value in Ka that is strictly smaller than ka , or 0 if such a value does not exist. Let Xka pt, jq be the set of state-action-MDP triples at step t in MDP Mj such that no approximation unit with at least ka active samples exists within dknown distance, and at least one approximation unit with at least ka´ active samples exists within dknown distance pka´ “ 0 covers both the case where an approximation unit with 0 active samples exists within dknown distance, and the case where no approximation units exist within dknown distanceq. Define ph,ka pst,j q for ka P Ka to be Bernoulli random variables that express the following conditional probability: Given the state of the approximation set and sample candidate queue at step t for MDP Mj , exactly h state-action-MDP triples in Xka pt, jq are encountered in MDP ř2h´1 Mj during the next mintTH , τ pt, jqu steps. Let peh,ka pst,j q “ i“h pi,ka pst,j q. Finally let Tj be the set of non-delay steps psee Definition 7.5q for MDP Mj . 2r 1 ln Qmax s s 2 ks If kp N1´γ ln 2r1`log ă 1 and NSAM pdknown q ě 2, with probability δ SAM pdknown q 1 ´ 2δ : kp ÿ ÿ ÿ phpeh,ka pst,j qq j“1 tPTj hPH ´ Q U¯ Q U Qmax 1 1 pka ´ ka´ ` kp q 1 ` log2 1´γ ln Qmax NSAM pdknown q 1´γ ln s s c ă 2r1`log2 ks 2r 1 ln Qmax s s ln δ 1 ´ pk 1´γ ´ ´k `k qN pd q a a p SAM known @ ka P Ka and @ h P H simultaneously. Proof. From the Markov property we have that, peh,ka pst,j q variables at least TH steps apart (of the same or different j) are independent3 . Define TiH for i P t0, 1, . . . , TH ´ 1u to be the (infinite) set of timesteps for which t P ti, i ` TH , i ` 2TH , . . . u. ka ´ ka´ samples will be added to an approximation unit with ka´ active samples before it progresses to ka active samples. Additionally, at most NSAM pdknown q approximation units can be added to the approximation set, and we are exploring in kp MDPs in parallel. Thus, at most pka ´ ka´ ` kp ´ 1qNSAM pdknown q 3 Careful readers may notice that what happens at step t affects which variables are selected at future timesteps. This is not a problem. We only care that the outcomes of the variables are independent given their selection. 12 state-action-MDP triples within dknown distance of approximation units with ka´ active samples can be encountered on non-delay steps4 . Let us assume that there exists an i P t0, 1, . . . , TH ´ 1u and h P H such that: kp ÿ ÿ peh,ka pst,j q ě j“1 tPTj XTiH pk ´ ka´ ` kp qNSAM pdknown q ¯. ´ b a 2r1`log2 ks h 1 ´ pk ´k´ `k qN2h pd ln δ q a a p SAM known δ , at least From lemma 13.1 it follows that with probability at least 1 ´ 2r1`log 2 ks ´ pka ´ ka ` kp qNSAM pdknown q state-action-MDP triples within dknown distance of approximation units with ka´ active samples will be encountered on non-delay steps, which is a contradiction. It must therefore be the case that: kp ÿ ÿ peh,ka pst,j q ă j“1 tPTj XTiH pk ´ ka´ ` kp qNSAM pdknown q ¯ b a 2r1`log2 ks ln h 1 ´ pk ´k´ `k qN2h pd δ q ´ a a p SAM known δ for all i P t0, 1, . . . , TH ´ 1u and h P with probability at least 1 ´ 2r1`log 2 ks H ´ tTH u simultaneously, which implies that: kp ÿ ÿ ÿ phpeh,ka pst,j qq j“1 tPTj hPH pka ´ ka´ ` kp q|H|TH NSAM pdknown q b 2TH 2 ks ln 2r1`log 1 ´ pk ´k´ `k qN δ a p SAM pdknown q a ´ Q U¯ Q U Qmax 1 1 pka ´ ka´ ` kp q 1 ` log2 1´γ ln Qmax ln NSAM pdknown q 1´γ s s c ď 2r1`log2 ks 2r 1 ln Qmax s s ln δ 1 ´ pk 1´γ ´ pd q ´k `k qN ă a a p SAM known δ with probability at least 1 ´ 2r1`log for all h P H simultaneously. 2 ks From the union bound we have that since ka can take r1 ` log2 ks values, with probability 1 ´ 2δ : kp ÿ ÿ ÿ phpeh,ka pst,j qq j“1 tPTj hPH U¯ Q U ´ Q Qmax 1 1 pka ´ ka´ ` kp q 1 ` log2 1´γ ln Qmax ln NSAM pdknown q 1´γ s s c ă 2r1`log2 ks 2r 1 ln Qmax s s ln δ 1 ´ pk 1´γ ´ ´k `k qN pd q a a p SAM known @ ka P Ka and @ h P H simultaneously. 4 This covers the worst case, which could happen if every time a state-action-MDP triple within dknown distance of an approximation unit with ka ´ 1 samples is encountered by one of the MDPs, the remaining kp ´ 1 MDPs encounter a state-action-MDP triple within dknown of the same approximation unit simultaneously. 13 8 Proofs of main lemmas and theorems Lemma 8.1. The space complexity of algorithm 1 is: O pNSAM pdknown qq per concurrent MDP. Proof. Algorithm 1 only needs access to the value and number of active samples of each approximation unit, and from the definition of the covering number and Algorithm 2 we have that at most NSAM pdknown q approximation units can be added. Lemma 8.2. The space complexity of algorithm 2 is: O pk|A|NSAM pdknown qq . Proof. From the definition of the covering number we have that at most NSAM pdknown q approximation units can be added. From the definition of an approximation unit and Algorithm 2 we have that at most k samples can be added per approximation unit, and each sample requires at most Op|A|q space. Lemma 8.3. The per step computational complexity of algorithm 1 is bounded above by: O p|A|NSAM pdknown qq . Proof. A naive search for the nearest approximation unit of each of the at most |A| actions needs to perform at most O pNSAM pdknown qq operations. Since every step of algorithm 1 results in a sample being processed by algorithm 2, we will be bounding the per sample computational complexity of algorithm 2: Lemma 8.4. Let c be the maximum number of approximation units to which a single sample can be added5 . The per sample computational complexity of algorithm 2 is bounded above by: ˆ ˙ k|A|NSAM pdknown q a O ck|A|NSAM pdknown q ` ln . ln γ Qmax Proof. Lines 4 through 12 will be performed only once per sample. Line 6 requires at most OpNSAM pdknown qq operations. Line 7 requires at most OpkNSAM pdknown qq operations. Line 9 can reuse the result of line 6 and only check if the approximation unit added in line 7 (if any) has k samples. Line 10 requires at most OpNSAM pdknown qq operations. 5 c will depend on the dimensionality of the state-action-MDP space. For domains that are big enough to be interesting it will be significantly smaller than other quantities of interest. 14 Given a pointer to N pU, s1 , a1 , M q for every next state action MDP triple, line 14 requires at most O pk|A|NSAM pdknown qq operations(O pk|A|q for each approximation unit), and since B̃ is a γ-contraction in maximum norm ln max q times per sample6 (lemma (lemma 13.2), it will be performed at most pQ ln γ ´ ¯ pdknown q a 13.3), for a total cost of O k|A|NSAM ln Qmax . This step dominates the ln γ computational cost of the learner. Lines 15 and 16 require at most O pNSAM pdknown qq operations each, and ln a max q will be performed at most pQ times per sample. ln γ Maintaining a pointer to N pU, s1 , a1 , M q for every next state action MDP triple requires the following: 1) Computing N pU, s1 , a1 , M q every time a sample is added to the sample set, for a cost of up to Op|A|NSAM pdknown qq operations per sample. 2) Updating all N pU, s1 , a1 , M q pointers every time the number of active samples for an approximation unit changes: There can exist at most k|A|NSAM pdknown q pointers in the sample set, and a single sample can affect the number of active samples of at most c approximation units. Each pointer can be updated against a particular approximation unit in constant time, thus the cost of maintaining the pointers is at most Opck|A|NSAM pdknown qq operations per sample. Theorem 8.5. Let ps1,j , s2,j , s3,j , . . . q for j P t1, . . . , kp u be the random paths generated in MDPs M1 , . . . , Mkp respectively on some execution of algorithm 1, and π̃c be the pnon-stationaryq policy followed by algorithm 1. Let b “ ´ ¯ 4r1`log2 ksNSAM pdknown q2 ln Q̂2max 4r1`log2 ksNSAM pdknown q2 δ Q̂max , k ě 22 p1´γq , c 2 ln 2 δ s be defined as in definition 5.13, a be defined as in algorithm 2, kp be defined as in definition 7.1, and Tj be the set of non-delay steps psee definition 7.5q 2r1`log2 ks 2r 1 ln Qmax s s ln δ ă 1 and NSAM pdknown q ě 2, with for MDP Mj . If 1´γ p1`kp qNSAM pdknown q probability at least 1 ´ δ, for all t, j: V π̃ pst,j , Mj q ě V ˚ pst,j , Mj q ´ 4c ` 2a ´ 3s ´ e pt, jq, 1´γ where7 : kp 8 ÿ ÿ e pt, jq “ j“1 t“0 kp ÿ ÿ e pt, jq ` j“1 tPTj kp ÿ ÿ e pt, jq, j“1 tRTj with: kp ÿ ÿ e pt, jq j“1 tPTj 6 Note that the per sample cost does not increase even if multiple samples enter through line 4 before publishing Ũ . 7 f pnq “ Õpgpnqq is a shorthand for f pnq “ Opgpnq logc gpnqq for some c. 15 ¨´ « Õ ˝ Q̂max s p1´γq ¯ ˛ ` kp NSAM pdknown qQ̂max ‚ p1 ´ γq and: ¨ ÿ e pt, jq ď ˝Qmax tRTj ÿ NSAMÿ pdknown q ka PKa i“1 ˛ Di,j,ka ‚, where Di,j,ka is the delay of the ka -th sample for ka P Ka in the i-th approximation unit, with respect to MDP Mj . Note that the probability of success 1 ´ δ řkp ř8 holds for all timesteps in all MDPs simultaneously, and j“1 t“0 e pt, jq is řkp ř an undiscounted infinite sum. Unlike j“1 tPTj e pt, jq which is a sum over ř all MDPs, tRTj e pt, jq gives a bound on the total cost due to value function update delays for each MDP separately. Proof. We have that: QŨ ps, a, M q ´ B π Q Ũ QŨ ps, a, M q ď Qmax @ ps, a, M, Ũ q. Since ka´ ě k2a @ka P Ka ´ t1u we have that for any ps, a, M q within dknown distance of an approximation unit with ka´ ą 0 active samples, with probability 1 ´ 2δ : d pdknown q2 Q 2 ln 4r1`log2 ksNSAM π Ũ δ QŨ ps, a, M q ´ B QŨ ps, a, M q ă a ` 2c ` 2Q̂max . ka ´ ¯ Q̂2max 4r1`log2 ksNSAM pdknown q2 for ka into lemma 13.6, Substituting k ě 22 p1´γq 2 ln δ s we have that with probability at least 1 ´ 2δ for any ps, a, M q within dknown of an approximation unit with k active samples8 : Q̃ps, a, M q ´ B π Q Ũ Q̃ps, a, M q ă a ` 2c ` 2p1 ´ γqs . Let TH , H, Ũ pt, jq, τ pt, jq, and peh,ka pst,j q be defined as in lemma 13.7. Even though π̃ is non-stationary, it is comprised of stationary segments. Starting from step t in MDP Mj , π̃ is equal to π QŨ pt,jq for at least τ pt, jq steps. Substituting the above into lemma 12.6 we have that with probability at least 1 ´ 2δ , for all Mj and t P Tj : V ˚ pst,j , Mj q ´ V π̃ pst,j , Mj q ď 4c ` 2a ` 3s ` e pt, jq, 1´γ where: e pt, jq “ γ τ pt,jq Qmax ` 2 ÿ phpeh,1 pst,j qqQmax hPH 8 Notice that in this case dps, a, M, N ps, a, M q, dknown q “ 0. 16 d ÿ 2 ` ka PtKa ´1u ÿ phpeh,ka pst,j qq2Q̂max hPH 2 pdknown q 2 ln 4r1`log2 ksNSAM δ . ka Define: ˆ c0 “ R 1 ` log2 Qmax 1 ln 1´γ s V˙ R V 1 Qmax ln NSAM pdknown q. 1´γ s From the above it follows that: kp ÿ ÿ e pt, jq j“1 tPTj kp ÿ ÿ “ ˜ γ τ pt,jq Qmax j“1 tPTj ÿ `2 phpeh,1 pst,j qqQmax hPH d ÿ ÿ 2 ` ka PtKa ´1u kp ÿ ÿ “ 2 ln phpeh,ka pst,j qq2Q̂max 4r1`log2 ksNSAM pdknown q2 δ ¸ ka hPH γ τ pt,jq Qmax j“1 tPTj kp ÿ ÿ ÿ `2 phpeh,1 pst,j qqQmax j“1 tPTj hPH ÿ d kp ÿ ÿ ÿ phpeh,ka pst,j qq2Q̂max `2 2 ln 4r1`log2 ksNSAM pdknown q2 δ ka PtKa ´1u j“1 tPTj hPH ka kp r1 ` log2 ksNSAM pdknown qQmax p1 ´ γq 2p1 ` kp qc0 Qmax c Q ` U ă 2r1`log 1´ 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q ks c ÿ pka ´ ka´ `2 ka PtKa ´1u ` kp qc0 2Q̂max c Q 1´ 2r1`log 1´ 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q ks c 2 ` `k ř a ka PtKa ´1u 2 ˘ ` kp c0 2Q̂max c Q 1´ 4r1`log2 ksNSAM pdknown q2 δ ka U 1 ln Qmax ln 2r1`log2 ks 2 1´γ s δ ´ pka ´ka `kp qNSAM pdknown q p2 ` r3 ` log2 kskp q c0 Qmax c Q U ď 2 ln 2 ln 4r1`log2 ksNSAM pdknown q2 δ ka U 1 ln Qmax ln 2r1`log2 ks 2 1´γ s δ p1`kp qNSAM pdknown q 17 p2 ` r3 ` log2 kskp qc0 Qmax c Q U ď 2r1`log 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q 1´ 2 ´? ř ka PtKa ´1u ` ks b ¯ 2k ka ` ?kp c0 Q̂max 2 ln a c Q U 2r1`log 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q 1´ ˜ ď p2 ` r3 ` log2 kskp qc0 Qmax c Q U 2r1`log 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q 1´ ˜ `4kp ď 4r1`log2 ksNSAM pdknown q2 δ 8 8 ÿ ÿ 1 1 ? ` i i 2 2 2 i“1 i“0 ¸¸ ks ` c0 Q̂max 1´ ks ˜ ˙¸ 8 ˆ ÿ ? 1 1 ? ` 2 k 1` 2i 2i 2 i“0 b 2 ln c Q 4r1`log2 ksNSAM pdknown q2 δ U 2r1`log2 ks Qmax 1 2 1´γ ln ln δ s p1`kp qNSAM pdknown q p2 ` r3 ` log2 kskp qc0 Qmax c Q U 2 1 2r1`log2 ks ln Qmax ln 1´γ s δ 1´ p1`kp qNSAM pdknown q b ¯ ´ ¯¯ ´ ? ´ 2 k 3 ` ?22 ` 4kp 1 ` ?22 c0 Q̂max 2 ln c Q ` U 4r1`log2 ksNSAM pdknown q2 δ 2r1`log 1´ 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q ks ˆ ´ ? ¯b c0 Q̂max p2 ` r3 ` log2 kskp q ` 9 k ` 10kp 2 ln c Q ď U 2r1`log 1´ ˜ˆ ď 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q 4r1`log2 ksNSAM pdknown q2 δ ˙ ks V˙ R V Qmax 1 Qmax 1 ln ln NSAM pdknown qQ̂max 1´γ s 1´γ s ´ ? ¯b ¸ pdknown q2 p2 ` r3 ` log2 kskp q ` 9 k ` 10kp 2 ln 4r1`log2 ksNSAM δ c Q U R 1 ` log2 2r1`log 1´ 1 ln Qmax ln 2 2 1´γ s δ p1`kp qNSAM pdknown q ks with probability 1´δ, where in step 3 we used the fact that there can be at most r1`log2 ksNSAM pdknown b q policy changes and lemma 13.7, and in step 8 we used the fact that Q̂max 2 pdknown q 2 ln 4r1`log2 ksNSAM ě Qmax . Since lemma 13.6 δ (used to bound the Bellman error of each ps, a, M, Ũ q) and lemma 13.7 (used to bound how many times each ps, a, M, Ũ q is encountered) hold with probability at least 1 ´ 2δ each, the bound above holds with probability at least 1 ´ δ. Since for realistically large NSAM pdknown q: g Q U f f 2 1 ln Qmax ln 2r1`log2 ks e 1´γ s δ 1´ « 1, p1 ` kp qNSAM pdknown q 18 we have that: kp ÿ ÿ e pt, jq j“1 tPTj ˜ˆ V˙ R V Qmax 1 Qmax 1 ă 1 ` log2 ln ln NSAM pdknown qQ̂max 1´γ s 1´γ s ´ ? ¯b ¸ pdknown q2 p2 ` r3 ` log2 kskp q ` 9 k ` 10kp 2 ln 4r1`log2 ksNSAM δ c 2r1`log2 ks 1 2r 1´γ ln Qmax s s ln δ 1´ p1`kp qNSAM pdknown q ¯ ´ ˛ ¨ Q̂max s p1´γq ` kp NSAM pdknown qQ̂max ‚ « Õ ˝ p1 ´ γq R with probability 1 ´ δ, where we have substituted k with ˆ ˙ Q̂2 4r1 ` log2 ksNSAM pdknown q2 c 2 max 2 ln 2s p1 ´ γq δ for some constant c. From the definition of a delay step (Definition 7.5) we have that the maxiř řNSAM pdknown q mum number of delay steps in MDP Mj is ka PKa i“1 Di,j,ka . Allowing for the algorithm to perform arbitrarily badly (e pt, jq “ Qmax ) on every delay step we have that @j: ÿ tRTj e pt, jq ď Qmax ÿ NSAMÿ pdknown q ka PKa i“1 Di,j,ka . References Li, L. 2009. A unifying framework for computational reinforcement learning theory. Ph.D. Dissertation. Williams, R., and Baird, L. 1993. Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Northeastern University, College of Computer Science. 19