PAC Reinforcement Learning Bounds for RTDP and Rand-RTDP Alexander L. Strehl Lihong Li Michael L. Littman Computer Science Dept. Rutgers University Piscataway, NJ 08854 USA strehl@cs.rutgers.edu Computer Science Dept. Rutgers University Piscataway, NJ 08854 USA lihong@cs.rutgers.edu Computer Science Dept. Rutgers University Piscataway, NJ 08854 USA mlittman@cs.rutgers.edu Abstract Real-time Dynamic Programming (RTDP) is a popular algorithm for planning in a Markov Decision Process (MDP). It can also be viewed as a learning algorithm, where the agent improves the value function and policy while acting in an MDP. It has been empirically observed that an RTDP agent generally performs well when viewed this way, but past theoretical results have been limited to asymptotic convergence proofs. We show that, like true learning algorithms E3 and RMAX, a slightly modified version of RTDP satisfies a Probably Approximately Correct (PAC) condition (with better sample complexity bounds). In other words, we show that the number of timesteps in an infinite-length run in which the RTDP agent acts according to a non-optimal policy from its current state is less than some polynomial (in the size of the MDP), with high probability. We also show that a randomized version of RTDP is PAC with asymptotically equal sample complexity bounds, but has much less per-step computational cost, O(ln(S) ln(SA) + ln(A)), rather than O(S + ln(A)) when we consider only the dependence on S, the number of states and A, the number of actions of the MDP. Introduction The Real-Time Dynamic Programming (RTDP) algorithm was introduced by Barto et al. (1995). In that work, it was shown that RTDP generalizes a popular search algorithm, Learning Real Time A∗ (LRTA∗ ) (Korf, 1990). Thus, RTDP is important to the reinforcement learning (RL) community as well as the heuristic search community. New analyses and variations of RTDP, such as Labeled RTDP and Bounded RTDP, continue to be studied (Bonet & Geffner, 2003; McMahan et al., 2005). With respect to standard RL terminology, RTDP is, by definition, strictly a “planning” algorithm because it receives as input the MDP (rewards and transitions) it will act upon. A “learning” algorithm, as typically defined, doesn’t receive these quantities as input. However, as noted by Barto et al. (1995), RTDP can be thought of as a learning algorithm. The form of RTDP is similar to many learning algorithms in that an agent following RTDP travels through the c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. MDP in one long trajectory and updates its action-values (Qvalue estimates) for those state-actions that it actually visits. It also displays complex behavior that could arguably be considered “learning”1. Traditional planning algorithms, such as value iteration, tend to solve an MDP by iterating through all the state-action pairs many times, rather than by following a valid trajectory through the MDP. Hence, RTDP can be considered something in between a learning algorithm and a planning algorithm, and it represents an important connection between the learning and planning communities in RL. The first result of our paper is that RTDP has very nice theoretical properties when analyzed in a PAC-like framework (as defined in a later section). We prove that a slightly modified version of RTDP, when run forever in any MDP, will make only a polynomial number of mistakes (non-optimal choices) with high probability. We formalize this notion of sample complexity below. This result doesn’t imply convergence of its action-value estimates2 . To the best of our knowledge, the only related theoretical result is that the action-value estimates, Q(s, a), computed by RTDP converge to Q∗ (s, a) in the limit (Barto et al., 1995; Bertsekas & Tsitsiklis, 1996). Instead, we would like a guarantee on RTDP’s performance after only finitely many interactions with the environment. Furthermore, we are not as interested in the convergence of RTDP’s actionvalue estimates as we are in the performance of its behavior (the value of the agent’s policy). For instance, in the work of Bonet and Geffner (2003), it is empirically shown that RTDP’s action-value estimates may take a long time to converge to their optimal values. This finding is due to the fact that some states may be difficult to reach and RTDP only updates the value estimates for states as they are reached by the agent. The authors create a new algorithm, Labeled RTDP, that exhibits a better convergence rate (empirically). However, when the actual performances of the algorithms were compared as measured by average cost to the goal (Figure 3 of Bonet and Geffner (2003)), RTDP’s performance was at least as good as Labeled RTDP’s. This observation suggests 1 See Section 3.4 of Barto et al. (1995) for a related discussion. In fact, the action-value estimates of our modified version of RTDP won’t converge to the optimal action-values Q∗ . We leave it as an important open problem whether the original RTDP algorithm is PAC-MDP or not. 2 that the slow convergence rate of RTDP may not negatively impact its resulting behavior. Therefore, we have chosen to directly analyze the sample complexity of RTDP rather than the convergence rate of its action-value estimates. One of RTDP’s attractive features is that by only updating value estimates for the states reached by an agent in the MDP, it focuses computation on relevant states. In fact, each step (action choice) of the agent requires only a single Bellman backup, a computation that, in the worst-case, scales linearly with the number of states. Although this method is a significant improvement over more expensive operations such as value iteration, it can still be costly in large environments. Our second result is to provide an algorithm, called Rand-RTDP that has similar PAC-like learning guarantees, but whose computation per-step scales at a rate of O(ln2 (S)), when we consider its dependence on S. The key insight is that the single full-update of RTDP can be closely approximated by an average of a few randomly chosen sample updates. This idea has been mentioned in Bertsekas and Tsitsiklis (1996) and also analyzed in Kearns et al. (1999). Definitions and Notation This section introduces the Markov Decision Process notation used throughout the paper; see Sutton and Barto (1998) for an introduction. An MDP M is a five tuple S, A, T, R, γ, where S is the state space, A is the action space, T : S × A × S → [0, 1] is a transition function, R : S × A → [0, 1] is a reward function, and 0 ≤ γ < 1 is a discount factor on the summed sequence of rewards. We also let S and A denote the number of states and the number of actions, respectively. From state s under action a, the agent receives a random reward r, which has expectation R(s, a), and is transported to state s with probability T (s |s, a). A policy is a strategy for choosing actions. We assume (unless otherwise noted) that rewards all lie between 0 and 1. A stationary policy is one that produces an action based on only the current state. For any policy π, let π VM (s) (QπM (s, a)) denote the discounted, infinite-horizon value (action-value) function for π in M (which may be omitted from the notation) from state s. If T is a posiπ tive integer, let VM (s, T ) denote the T -step ∞value function π (s) = E[ j=1 γ j−1 rj ] and of policy π. Specifically, VM T π (s, T ) = E[ j=1 γ j−1 rj ] where [r1 , r2 , . . .] is the reVM ward sequence generated by following policy π from state s. These expectations are taken over all possible paths the agent might follow. The optimal policy is denoted π ∗ and ∗ (s) and Q∗M (s, a). Note that a policy has value functions VM cannot have a value greater than 1/(1 − γ). PAC-MDP In this section, we provide the formal definition we use to characterize a learning algorithm as efficient. Any algorithm A that guides an agent to act based on only its history (trajectory through the MDP), such as Q-learning (Watkins & Dayan, 1992) and R-MAX (Brafman & Tennenholtz, 2002), can be viewed as simply some non-stationary policy. Thus, each action the agent takes is the result of executing some policy. Suppose we stop the algorithm after its (t − 1)st action, and let us denote the agent’s policy by At . This policy has a well-defined value function V At (st ), where st is the current (tth) state reached by the agent. We say it is -optimal, for any , from its current state, if V ∗ (st ) − V At (st ) ≤ . Now, suppose that the algorithm A is run for an infinitelength trajectory (run forever in the MDP, starting from an arbitrary initial state). Kakade (2003) defines the sample complexity of exploration (sample complexity, for short) of A to be the number of timesteps t such that the nonstationary policy at time t, At , is not -optimal from the current state, st , at time t (formally V At (st ) < V ∗ (st ) − ). We believe this definition captures the essence of measuring learning. An algorithm A is then said to be an efficient PAC-MDP (Probably Approximately Correct in Markov Decision Processes) algorithm if, for any and δ, the perstep computational complexity and the sample complexity of A is less than some polynomial in the relevant quantities (S, A, 1/, 1/δ, 1/(1 − γ)), with probability at least 1 − δ. It is simply PAC-MDP if we relax the definition to have no computational complexity requirement. The terminology, PAC, is borrowed from Valiant (1984), a classic paper dealing with classification. Two Algorithms In this section, we present two learning/planning algorithms. Both require as input a model, including the transition and reward functions, of an MDP. In addition, we assume that the learner receives S, A, and γ as input. The following experiment is then conducted. The agent always occupies a single state s of the MDP M . The algorithm is told this state and must compute an action a. The agent receives a reward r (which our algorithms completely ignore) and is then transported to another state s according to the reward and transition functions. This procedure then repeats forever. The first state occupied by the agent may be chosen arbitrarily. We define a timestep to be a single interaction with the environment, as described above. Since we are interested in algorithms that satisfy the PACMDP guarantee, we also allow the learning algorithm to receive two additional inputs, and δ, both real numbers between zero and one, exclusively. Both algorithms maintain action-value estimates, Q(s, a) for each state-action pair (s, a). At time t = 1, 2, . . ., let Qt (s, a) denote the algorithm’s current action-value estimate for (s, a) and let Vt (s) denote maxa∈A Qt (s, a). The learner always acts greedily with respect to its estimates, meaning that if st is the tth state reached, a := argmaxa∈A Qt (st , a) is the next action chosen. Additionally, both algorithms make use of “optimistic initialization”, that is, Q1 (s, a) = 1/(1 − γ) for all (s, a). The main difference between the two algorithms is how the action-values are updated on each timestep. The RTDP Algorithm Real-time dynamic programming (RTDP) is an algorithm for acting in a known MDP that was first studied by Barto et al. (1995). Suppose that at time t ≥ 1, action a is performed from state s. Then, the following update is performed: T (s |s, a)Vt (s ). (1) Qt+1 (s, a) = R(s, a) + γ s ∈S We analyze a modified version of RTDP, which has an additional free parameter, 1 ∈ (0, 1). In the analysis section (Proposition 2), we will provide a precise value for 1 in terms of the other inputs (S, A, , δ, and γ) that guarantees the resulting algorithm is PAC-MDP. Our modification is that we allow the update, Equation 1, to take place only if the new action-value results in a decrease of at least 1 . In other words, the following equation must be satisfied for an update to occur: T (s |s, a)Vt (s ) ≥ 1 . (2) Qt (s, a) − R(s, a) + γ s ∈S Otherwise, no change is made and Qt+1 (s, a) = Qt (s, a). For all other state-action pairs (s , a ) = (s, a), the actionvalue estimate is left unchanged, that is, Qt+1 (s , a ) = Qt (s , a ). The only difference between our modified version of RTDP and the standard version is that we enforce an update condition (Equation 2). The main purpose of this change is to bound the number of updates performed by RTDP, which is required for our analysis that RTDP is PAC-MDP. It also has the interesting side effect of reducing the overall computational cost of running RTDP. Without the update condition, RTDP will continually refine its value-function estimates, which involves infinitely many Bellman backups (computations of Equation 1). With the update condition, RTDP stops updating its estimates once they are “close” to optimal. By “close” we mean sufficient to obtain an -optimal policy. Hence, the modified version will perform only a finite number of Bellman backups. A similar modification is described by Bonet and Geffner (2003). The Rand-RTDP Algorithm We now introduce a randomized version of RTDP, called Rand-RTDP. In addition to action-value estimates, RandRTDP maintains a quantity LAU(s, a), for each (s, a), which stands for “last attempted update”. Let LAUt (s, a) denote the value of LAU(s, a) at time t (meaning immediately before the tth action is chosen). This quantity stores the timestep when (s, a) was last encountered and an “attempted update” occurred (as defined below). The algorithm also has two free parameters, 1 ∈ (0, 1) and a positive integer m. In the analysis section (Proposition 3) we provide precise values for these parameters in terms of the other inputs (S, A, , δ, and γ) that guarantee the resulting algorithm is PAC-MDP. Suppose that at time t ≥ 1, action a is performed from state s, and LAUt (s, a) is less than or equal to the last timestep for which any action-value estimate was updated (meaning some action-value estimate has changed on or since the last attempted update of (s, a)). We call such a timestep an attempted update. Consider the following up- date: m Qt+1 (s, a) = 1 (ri + γVt (si )) + 1 , m i=1 (3) where s1 , . . . , sm are m random draws from T (·|s, a), the transition distribution for (s, a) and r1 , . . . , rm are m random draws from the reward distribution for (s, a) (with mean R(s, a)). We allow the update, Equation 3, to take place only if the new action-value results in a decrease of at least 1 . In other words, the following equation must be satisfied for a successful update to occur: m Qt (s, a) − 1 (ri + γVt (si )) ≥ 21 . m i=1 (4) Otherwise, no change is made, an unsuccessful update occurs, and Qt+1 (s, a) = Qt (s, a). For all other state-action pairs (s , a ) = (s, a), the action-value estimate is left unchanged, that is, Qt+1 (s , a ) = Qt (s , a ). In addition, if the condition on LAUt (s, a) above doesn’t hold, then no update is made. The main difference between Rand-RTDP and RTDP is that instead of a full Bellman backup (Equation 1), RandRTDP performs a multi-sample backup (Equation 3). In addition, the update of Rand-RTDP involves an additive bonus of 1 (a small positive number). The purpose of this bonus is to ensure that, with high probability, the action-value estimates, Q(s, a), are optimistic, which means that they always upper bound the optimal action-values, Q∗ (s, a). The bonus is needed because even if the previous value estimates, Vt (·), are optimistic, the random sampling procedure could result in a choice of next-states and rewards so that the average m 1 (m i=1 (ri + γVt (si ))) of these terms is not optimistic anymore. The purpose of the LAU(s, a) values (and the condition on them required for an attempted update) is to limit the number of attempted updates performed by the algorithm. This limit allows us to prove that, with high probability, every randomized update will result in a new action-value estimate that is not too far from what would have resulted if a full backup was performed. They also limit the total amount of computation. Specifically, they prevent an attempted update for (s, a) if the last attempted update for (s, a) was unsuccessful and no other action-value estimates have changed since the last attempted update of (s, a). Rand-RTDP’s Generative Model Note that Rand-RTDP updates its action-value estimates by obtaining random samples of next states s and immediate rewards r for its current state-action pair. This sampling is the only time Rand-RTDP uses its knowledge of M . In other words, Rand-RTDP requires only a generative model of the underlying MDP3 , rather than an actual representation of the transition and reward functions. Next, we show that the agent can “simulate” 3 A generative model of an MDP is a model that outputs a random next state and immediate reward pair given input of a stateaction pair (s, a). Each call to the model should produce independent draws distributed according to the dynamics of the MDP (Kearns et al., 1999). a generative model when it is given the true dynamics of the underlying MDP. The reverse, obtaining the true dynamics given a generative model, is in general impossible given only finite time. If the algorithm is given an explicit representation of the transition function, T (·|s, a), in standard form (a list, for each (s, a), of possible next-states s and their corresponding probabilities T (s |s, a)), then we can modify this data structure, by adding to each transition probability in the list the sum of the transition probabilities of all next-states appearing earlier in the list, in the same amount of time it takes just to read the original data structure (transition function) as input. Our modified data structure combined with a random number generator clearly allows us to draw random next-states in time logarithmically (by binary search) in the number of next states (at most S) (Knuth, 1969). Thus, we can simulate a generative model in logarithmic time with a constant-time setup cost. Knuth (1969) also describes a more sophisticated method that allows constanttime queries, but requires sorting the input lists in the preprocessing stage. Implementation of Rand-RTDP An implementation of Rand-RTDP is provided in Algorithm 1. Algorithm 1 Rand-RTDP 0: Inputs: γ, S, A, m, 1 1: for all (s, a) do 2: Q(s, a) ← 1/(1 − γ) // action-value estimates 3: LAU(s, a) ← 0 // time of last attempted update 4: end for 5: t∗ ← 0 // time of most recent Q-value change 6: for t = 1, 2, 3, · · · do 7: Let s denote the state at time t. 8: Choose action a := argmaxa ∈A Q(s, a ). 9: if LAU(s, a) ≤ t∗ then 10: Draw m random next states, s1 , . . . , sm and m random immediate rewards r1 , . . . , rm from the generative model m for (s, a). 1 11: q←m i=1 (ri + γ maxa ∈A Q(si , a )) 12: if Q(s, a) − q ≥ 21 then 13: Q(s, a) ← q + 1 14: t∗ ← t 15: end if 16: LAU(s, a) ← t 17: end if 18: end for Analysis and Comparison Computational Complexity In this section, we analyze the per-step computational complexity of the two algorithms. We measure complexity assuming individual numbers require unit storage and can be manipulated arithmetically in unit time4 . The per-step computation cost of RTDP is O(K + ln(A)), (where K is the number of states that can be reached in 4 We have tried not to abuse this assumption. one step with positive probability), which is highly detrimental in domains with a large number of next states (in the worst case, K = S). The per-step computational cost of Rand-RTDP is O(m · G + ln(A)) where G is the cost of drawing a single sample from the generative model. In the analysis section we provide a value for m that is sufficient for Rand-RTDP to be PAC-MDP. Using this value of m (see Proposition 3) and assuming that G = O(ln(K)), the worst-case per-step computational cost of Rand-RTDP is O(m · G + ln(A)) = ln (SA/(δ(1 − γ))) O ln(K) + ln(A) , 2 (1 − γ)4 which is a vast improvement over RTDP when only the dependence on S is considered. Specifically, when , δ, A, and γ are taken as constants, the computational cost is O(ln(S) ln(K)) = O(ln2 (S)). Note that the computational requirements of both algorithms involve an additive factor of O(ln(A)), due to storing and accessing the maximum action-value estimate for the current state after an update occurs via a priority queue. General Framework The algorithms we consider have the following form: Definition 1 Suppose an RL algorithm A maintains a value, denoted Q(s, a), for each state-action pair (s, a) with s ∈ S and a ∈ A. Let Qt (s, a) denote the estimate for (s, a) immediately before the tth action of the agent. We say that A is a greedy algorithm if the tth action of A, at , is at := argmaxa∈A Qt (st , a), where st is the tth state reached by the agent. The following is a definition of a new MDP that will be useful in our analysis. Definition 2 (Known State-Action MDP) For an MDP M = S, A, T, R, γ, a given set of action-values, Q(s, a) for each state-action pair (s, a), and a set K of state-action pairs, we define the known state-action MDP MK = S ∪{s0 }, A, TK , RK , γ as follows. Let s0 be an additional state added to the state space of M . Under all actions from s0 , the agent is returned to s0 with probability 1. The reward for taking any action from s0 is 0. For all (s, a) ∈ K, RK (s, a) = R(s, a) and TK (·|s, a) = T (·|s, a). For all (s, a) ∈ K, RK (s, a) = Q(s, a) and T (s0 |s, a) = 1. The known state-action MDP is a generalization of the standard notions of a “known state MDP” of Kearns and Singh (2002) and Kakade (2003). It is an MDP whose dynamics (reward and transition functions) are equal to the true dynamics of M for a subset of the state-action pairs (specifically those in K). For all other state-action pairs, the value of taking those state-action pairs in MK (and following any policy from that point on) is equal to the current action-value estimates Q(s, a). We intuitively view K as a set of stateaction pairs for which the agent has sufficiently accurate estimates of their dynamics. Definition 3 (Escape Event) Suppose that for some algorithm there is a set of state-action pairs Kt defined during each timestep t. Let AK be defined as the event, called the escape event, that some state-action pair (s, a) is experienced by the agent at time t such that (s, a) ∈ Kt . Note that all learning algorithms we consider take and δ as input. We let A(, δ) denote the version of algorithm A parameterized with and δ. The following proposition is taken from Strehl et al. (2006). Proposition 1 Let A(, δ) be a greedy learning algorithm, such that for every timestep t, there exists a set Kt of stateaction pairs. We assume that Kt = Kt+1 unless during timestep t, an update to some action-value occurs or the event AK happens. Let MKt be the known state-action MDP and πt be the current greedy policy, that is, for all states s, πt (s) = argmaxa Qt (s, a). Suppose that for any inputs and δ, with probability at least 1 − δ, the following conditions hold: (1) Qt (s, a) ≥ Q∗ (s, a) − (optimism), πt (s) ≤ (accuracy), and (3) the total num(2) Vt (s) − VM Kt ber of updates of action-value estimates plus the number of times the escape event from Kt , AK , can occur is bounded by ζ(, δ) (learning complexity). Then, when A(, δ) is executed on any MDP M , it will follow a 4-optimal policy from its current state on all but ζ(, δ) 1 1 O ln ln (1 − γ)2 δ (1 − γ) timesteps, with probability at least 1 − 2δ. Our proofs work by the following scheme (for both RTDP and Rand-RTDP): (1) Define a set of known-state actions for each timestep t. (2) Show that these satisfy the conditions of Proposition 1. Sample Complexity Bound for RTDP In this section, we prove that RTDP is PAC-MDP with a sample complexity bound of SA 1 1 ln ln O 2 . (5) (1 − γ)4 δ (1 − γ) First, we show that RTDP has the property of “optimism”. Lemma 1 During execution of RTDP, Qt (s, a) ≥ Q∗ (s, a) holds for all time steps t and state-action pairs (s, a). Proof: The proof is by induction on the timestep t. For the base case, note that Q1 (s, a) = 1/(1 − γ) ≥ Q∗ (s, a) for all (s, a). Now, suppose the claim holds for all timesteps less than or equal to t. Thus, we have that Qt (s, a) ≥ Q∗ (s, a) for all (s, a). We also have that Vt (s) = maxa∈A Qt (s, a) ≥ Qt (s, argmaxa Q∗ (s, a)) ≥ Q∗ (s, argmaxa Q∗ (s, a)) = V ∗ (s). Suppose s is the tth state reached and a is the action taken at time t. Of the action-values maintained by RTDP, only Q(s, a) can change. Specifically, if it does get updated, then we have that Qt+1 (s, a) = R(s, a) + γ s ∈S T (s, a, s )Vt (s ) ≥ R(s, a) + γ s ∈S T (s, a, s )V ∗ (s ) = Q∗ (s, a). Proposition 2 The RTDP algorithm with 1 = (1 − γ) is PAC-MDP. Proof: We apply Proposition 1. By Lemma 1, Condition (1) is satisfied. During timestep t of the execution of RTDP, we define Kt to be the set of all state-action pairs (s, a) such that: Qt (s, a) − R(s, a) + γ T (s |s, a)Vt (s ) ≤ 1 . (6) s πt (s) ≤ 1 /(1 − γ) always holds. We claim that Vt (s) − VM Kt πt To verify, note that VMK is the solution to the following set t of equations: πt (s) = R(s, πt (s)) + γ VM K t s ∈S πt T (s |s, πt (s))VM (s ), K t if (s, πt (s)) ∈ Kt , πt VM (s) Kt = Qt (s, πt (s)), if (s, πt (s)) ∈ Kt . The vector Vt is the solution to a similar set of equations except with some additional positive reward terms, each bounded by 1 (see Equation 6). It follows that Vt (s) − πt (s) ≤ 1 /(1 − γ). Thus, by letting 1 = (1 − γ), VM Kt πt we satisfy Vt (s) − VM (s) ≤ , as desired (to fulfill ConKt dition (2) of Proposition 1). Finally, note that by definition, the algorithm performs updates if and only if the event AK occurs. Hence, it is sufficient to bound the number of times that AK occurs. Every successful update of Q(s, a) decreases its value by at least 1 . Since its value is initialized to 1/(1 − γ), we have that each state-action pair can be updated at most 1/(1 (1 − γ)) times, for a total of at most SA/(1 (1 − γ)) = SA/((1 − γ)2 ) timesteps t such that AK can occur. Ignoring log factors, this analysis leads to a total sample complexity bound of SA Õ 2 . (7) (1 − γ)4 Sample Complexity Bound for Rand-RTDP In this section, we prove that Rand-RTDP is PAC-MDP also with a sample complexity bound of 1 SA 1 ln ln . (8) O 2 (1 − γ)4 δ (1 − γ) For Rand-RTDP, we define Kt , during timestep t of the execution of Rand-RTDP, to be the set of all state-action pairs (s, a) such that: Qt (s, a)− R(s, a) + γ T (s |s, a)Vt (s ) ≤ 31 . (9) s We will show, in the proof of the following lemma, that Rand-RTDP will perform at most κ attempted updates during any given infinite-length run through any MDP, where SA κ := SA 1 + 1 (1 − γ) Lemma 2 During execution of Rand-RTDP, if m ≥ ln(2κ/δ)/(21 2 (1 − γ)2 ), then on every timestep t such that (escape event) AK occurs, a successful update also occurs, with probability at least 1 − δ/2. Proof: First, we bound the number of updates and attempted updates. Note that each time a successful update occurs for (s, a), this event results in a decrease to the current actionvalue estimate, Q(s, a), of at least 1 . As Q(s, a) is initialized to 1/(1 − γ) and it cannot fall below zero, it can be successfully updated at most 1/(1 (1 − γ)) times. Hence, the total number of updates is bounded by SA/(1 (1 − γ)). Now, note that when a state-action pair, (s, a), is first experienced, it will always perform an attempted update. Next, another attempted update of (s, a) will occur only when at least some action-value has been successfully updated (due to Line 9 in Algorithm 1) since the last attempted update (of (s, a)). Thus, the total number of attempted updates is bounded by SA(1 + SA/(1 (1 − γ))) = κ. Suppose at time t, (s, a) is experienced and results in an attempted update. The Rand-RTDP agent draws m next states s1 , . . . , sm and m immediate rewards r1 , . . . , rm from its generative model. Define the random variables Xi := ri + γVt (si ) for i = 1, . . . , m. Note that the Xi are independently and identically distributed with mean E[Xi ] = R(s, a) + γ s T (s |s, a)Vt (s ). Hence, by the Hoeffding bound we have that m 2 2 1 Pr Xi − E[X1 ] ≥ 1 ≤ e−21 m(1−γ) . m i=1 Thus, by our choice of m we have that m 1 Xi − 1 < E[X1 ] m i=1 (10) will hold with probability at least 1 − δ/(2κ). We claim that if (s, a) ∈ Kt (so that the event AK occurred at time t), then this attempted update will be successful if Equation 10 holds. To see this fact, note that the difference between the old action-value estimate (Qt (s, a)) and the candidate for the new action-value estimate is at least 1 : m 1 Qt (s, a) − (ri + γVt (si )) + 1 m i=1 = Qt (s, a) − (1/m) m (Xi ) − 1 i=1 > Qt (s, a) − E[X1 ] − 21 > 31 − 21 = 1 . The first step follows from the definition of Xi . The second step follows from Equation 10. The third step follows from the fact that (s, a) ∈ Kt (see Equation 9). Since there are at most κ attempted updates, the chance that the update is not successful (which is at most the probability that Condition 10 does not hold) is at most δ/2, for any timestep that satisfies the above restrictions (an attempted update occurs for (s, a) ∈ Kt at time t), by the union bound. What we have shown is that, with high probability, every attempted update of (s, a) at times t such that (s, a) ∈ Kt will be successful. It can be shown that if this statement holds for execution of the algorithm up to some time t, and (s, a) ∈ Kt is the state-action pair at time t, then an attempted update will occur for (s, a). It is not hard to prove but is a bit tedious, so we omit it here. Next, we show that Rand-RTDP has the property of “optimism” with high probability. Lemma 3 During execution of Rand-RTDP, if m ≥ ln(2κ/δ)/(212 (1 − γ)2 ), then Qt (s, a) ≥ Q∗ (s, a) holds for all time steps t and state-action pairs (s, a), with probability at least 1 − δ/2. Proof: The proof is by induction on the timestep t. When t = 1, no actions have been taken by the agent yet, and we have that Q1 (s, a) = 1/(1−γ) ≥ Q∗ (s, a) for all (s, a), due to optimistic initialization. Now, suppose that Qt (s, a) ≥ Q∗ (s, a) holds for all (s, a) at some timestep t. Suppose that a is the tth action, and it was taken from state s (s was the tth state reached by the agent). If this step does not result in an attempted update, then no action-value estimates change, and we are done. Thus, suppose that an attempted update does occur. The Rand-RTDP agent draws m next states s1 , . . . , sm and m immediate rewards r1 , . . . , rm from its generative model (to be used in the attempted update). By Equation 3, if the update is successful, then Qt+1 (s, a) = m (1/m) i=1 (ri + γVt (si )) + 1 (if it’s not successful then the inductive assumption). it still upper bounds Q∗ (s, a), by m However, we have that (1/m) i=1 (ri + γVt (si )) + 1 ≥ m (1/m) i=1 (ri + γV ∗ (si )) + 1 , by the induction hypothesis. Applying the Hoeffding bound (with random variables Yi := ri + γV ∗ (si ) for i = 1, . . . , m) and the union bound as in the proof of Lemma 2, we then have that Qt+1 (s, a) ≥ Q∗ (s, a) holds, with probability at least 1 − δ/2, over all attempted updates of any state-action pair. Proposition 3 The Rand-RTDP algorithm with m = 2κ 1 1 SA and 1 = = O 2 (1−γ)4 ln (1−γ)δ 21 2 (1−γ)2 ln δ (1 − γ)/3 is PAC-MDP. Proof sketch: The proof is very similar to that of Proposition 2, and leads to a total sample complexity bound of SA Õ 2 . (11) (1 − γ)4 Note that the PAC sample complexity bound we’ve proved for RTDP (Equation 7) is asymptotically the same as that for Rand-RTDP (see Equation 11). This result suggests that it may be possible to use Rand-RTDP instead of RTDP on large-scale problems where computation cost is critical, without sacrificing much in terms of convergence performance. Experiments For each experiment, we recorded the cumulative reward obtained (to stand in for sample complexity)5 by the algorithm 5 The true sample complexity of the algorithm is expensive, and often impossible, to compute in practice. for a fixed number of timesteps (50,000). We also measured the number of backups (specifically, the number of times the algorithm computes the value of some reachable nextstate, V (s )) computed by the algorithm (to measure computational complexity). Our experiments were designed to supplement the theoretical analysis and are not meant as a thorough evaluation of the algorithms. The algorithms were tested on random MDPs generated as follows, for parameters S = 500, A = 2, and γ = 0.95. To guarantee that every state was reachable from every other state, random Hamiltonian circuits (to ensure connectivity) connecting every state were constructed under each action, and a probability of 0.1 was assigned to each of the corresponding transitions. Then, for each state-action pair, the transition function was completed by assigning random probabilities (summing up to 0.9) to 99 randomly selected next states. Therefore, for each state-action pair, there were at most 100 next-states. To make the MDP more interesting, we constructed the reward function so that R(s, a) tended to be large if the index of s was large; specifically, the mean reward, R(s, a), was randomly selected with mean proportional to the index of s.6 Each experiment was repeated 100 times and the results were averaged. The results of RTDP are summarized below: Param Reward Backups 1 = 0.1 25, 248 ± 13 4, 476, 325 ± 589 1 = 0.2 25, 235 ± 12 4, 396, 626 ± 889 1 = 0.3 25, 208 ± 15 4, 288, 353 ± 4, 505 1 = 0.4 25, 197 ± 16 4, 005, 369 ± 31, 161 The results of Rand-RTDP are summarized below: Reward Backups Param (1 = 0.1, m = 30) 25, 124 ± 14 1, 469, 122 ± 184 (1 = 0.2, m = 30) 25, 138 ± 12 1, 450, 281 ± 227 (1 = 0.3, m = 30) 25, 126 ± 15 1, 435, 197 ± 269 (1 = 0.4, m = 30) 25, 104 ± 15 1, 420, 445 ± 343 (1 = 0.1, m = 50) 25, 159 ± 13 2, 448, 039 ± 308 (1 = 0.2, m = 50) 25, 150 ± 15 2, 415, 387 ± 411 (1 = 0.3, m = 50) 25, 148 ± 14 2, 388, 299 ± 456 (1 = 0.4, m = 50) 25, 139 ± 13 2, 361, 596 ± 596 As a comparison, we also ran the optimal policy, which achieved 25,873 ± 13 cumulative reward, and the policy that chooses actions uniformly at random, which achieved 24,891 ± 15 cumulative reward. We see from the results that Rand-RTDP did not require as much computation as RTDP, but also didn’t achieve quite as much reward. Conclusion In this paper, we have provided a theoretical analysis of the sample complexity of RTDP and a randomized variant of it, Rand-RTDP. When viewed as learning algorithms, both algorithms are shown to be efficient in the PAC-MDP framework. In particular, their sample complexity bounds have a sub-quadratic dependence on the sizes of the state and action 6 We found that if the mean rewards are chosen uniformly at random, then the optimal policy almost always picks the action that maximizes the immediate reward, which is not interesting in the context of sequential decision making. spaces, which are arguably the most important factors. To the best of our knowledge, this is the first sample complexity result for RTDP, although people have investigated its limit behavior (Barto et al., 1995; Bonet & Geffner, 2003). Furthermore, the Rand-RTDP algorithm improves the perstep computational cost over RTDP without an asymptotic increase in its sample complexity. Acknowledgments Thanks to the National Science Foundation (IIS-0325281). We also thank John Langford and Eric Wiewiora for discussions. References Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72, 81–138. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific. Bonet, B., & Geffner, H. (2003). Labeled RTDP: Improving the convergence of real-time dynamic programming. Proc. 13th International Conf. on Automated Planning and Scheduling (pp. 12–21). Trento, Italy: AAAI Press. Brafman, R. I., & Tennenholtz, M. (2002). R-MAX—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213–231. Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College London. Kearns, M., Mansour, Y., & Ng, A. Y. (1999). A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99) (pp. 1324–1331). Kearns, M. J., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209–232. Knuth, D. (1969). The art of computer programming: Vol 2 / seminumerical algorithms, chapter 3: Random numbers. Addison-Wesley. Korf, R. E. (1990). Real-time heuristic search. Artificial Intelligence, 42, 189–211. McMahan, H. B., Likhachev, M., & Gordon, G. J. (2005). Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. Proceedings of the Twentysecond International Conference on Machine Learning (ICML05) (pp. 569–576). Strehl, A. L., Li, L., & Littman, M. L. (2006). Incremental modelbased learners with formal learning-time guarantees. To appear in Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence (UAI-06). Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. The MIT Press. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.