Introduction to Reinforcement Learning E0397 Lecture Slides by Richard Sutton (With small changes) Reinforcement Learning: Basic Idea Learn to take correct actions over time by experience Similar to how humans learn: “trial and error” Try an action – “see” what happens “remember” what happens Use this to change choice of action next time you are in the same situation “Converge” to learning correct actions Focus on long term return, not immediate benefit Some action may seem not beneficial for short term, but its long term return will be good. RL in Cloud Computing Why are we talking about RL in this class? Which problems due you think can be solved by a learning approach? Is focus on long-term benefit appropriate? RL Agent Object is to affect the environment Environment is stochastic and uncertain Environment action state reward Agent What is Reinforcement Learning? An approach to Artificial Intelligence Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do—how to map situations to actions—so as to maximize a numerical reward signal Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater longterm gains The need to explore and exploit Considers the whole problem of a goaldirected agent interacting with an uncertain environment Examples of Reinforcement Learning Robocup Soccer Teams World’s best player of simulated soccer, 1999; Runner-up 2000 Inventory Management Stone & Veloso, Reidmiller et al. Van Roy, Bertsekas, Lee & Tsitsiklis 10-15% improvement over industry standard methods Dynamic Channel Assignment Elevator Control Singh & Bertsekas, Nie & Haykin World's best assigner of radio channels to mobile telephone calls (Probably) world's best down-peak elevator controller Many Robots Crites & Barto navigation, bi-pedal walking, grasping, switching between skills... TD-Gammon and Jellyfish World's best backgammon player Tesauro, Dahl Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output – actual output) •Best examples: Classification/identification systems. E.g. fault classification, WiFi location determination, workload identification •E.g. Classification systems: give agent lots of data where data has been already classified. Agent can “train” itself using this knowledge. •Training phase is non trivial •(Normally) no learning happens in on-line phase Reinforcement Learning Training Info = evaluations (“rewards” / “penalties”) Inputs RL System Outputs (“actions”) Objective: get as much reward as possible •Absolutely no “already existing training data” •Agent learns ONLY by experience Elements of RL State: state of the system Actions: Suggested by RL agent, taken by actuators in the system Actions change state (deterministically, or stochastically) Reward: Immediate gain when taking action a in state s Value: Long term benefit of taking action a in state s Rewards gained over a long time horizon The n-Armed Bandit Problem Choose repeatedly from one of n actions; each choice is called a play After each play a t , you get a reward rt , where These are unknown action values Distribution of rt depends only on a t Objective is to maximize the reward in the long term, e.g., over 1000 plays To solve the n-armed bandit problem, you must explore a variety of actions and exploit the best of them The Exploration/Exploitation Dilemma Suppose you form estimates * Qt (a) Q (a) action value estimates The greedy action at t is at at* argmax Qt (a) a * at at exploitation at at* exploration You can’t exploit all the time; you can’t explore all the time You can never stop exploring; but you should always reduce exploring. Maybe. Action-Value Methods Methods that adapt action-value estimates and nothing else, e.g.: suppose by the t-th play, action a had been chosen k a times, producing rewards r1 , r2 , , rk , then a “sample average” * lim Qt (a) Q (a) ka e-Greedy Action Selection Greedy action selection: * t at a arg max Qt (a) a e-Greedy: at { at* with probability 1 e random action with probability e . . . the simplest way to balance exploration and exploitation 10-Armed Testbed n = 10 possible actions Q (a) Each is chosen randomly from a rt normal distrib.: each is also normal: 1000 plays repeat the whole thing 2000 times and average the results * e-Greedy Methods on the 10Armed Testbed Incremental Implementation Recall the sample average estimation method: The average of the first k rewards is (dropping the dependence on a ): r1 r2 Qk k rk Can we do this incrementally (without storing all the rewards)? We could keep a running sum and count, or, equivalently: Qk 1 Qk 1 rk 1 Qk k 1 This is a common form for update rules: NewEstimate = OldEstimate + StepSize[Target – OldEstimate] Tracking a Nonstationary ProblemQ k Choosing to be a sample average is appropriate in a stationary problem, i.e., when none of the Q * (a) change over time, But not in a nonstationary problem. Better in the nonstationary case is: Qk 1 Qk rk 1 Qk for constant , 0 1 k (1 ) Q0 (1 ) ri k k i i 1 exponential, recency-weighted average Formal Description of RL The Agent-Environment Interface Agent and environment interact at discrete time steps : t 0,1, 2, st S Agent observes state at step t : produces action at step t : at A(st ) gets resulting reward : rt 1 and resulting next state : st 1 ... st at rt +1 st +1 at +1 rt +2 st +2 at +2 rt +3 s t +3 ... at +3 The Agent Learns a Policy Policy at step t, t : a mapping from states to action probabilities t (s, a) probability that at a when st s Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent’s goal is to get as much reward as it can over the long run. Getting the Degree of Abstraction Right Time steps need not refer to fixed intervals of real time. Actions can be low level (e.g., voltages to motors), or high level (e.g., accept a job offer), “mental” (e.g., shift in focus of attention), etc. States can low-level “sensations”, or they can be abstract, symbolic, based on memory, or subjective (e.g., the state of being “surprised” or “lost”). An RL agent is not like a whole animal or robot. Reward computation is in the agent’s environment because the agent cannot change it arbitrarily. The environment is not necessarily unknown to the agent, only incompletely controllable. Goals and Rewards Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. A goal should specify what we want to achieve, not how we want to achieve it. A goal must be outside the agent’s direct control—thus outside the agent. The agent must be able to measure success: explicitly; frequently during its lifespan. Returns Suppose the sequence of rewards after step t is : rt 1 , rt 2 , rt 3 , What do we want to maximize? In general, we want to maximize the expected return, ERt , for each step t. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. Rt rt 1 rt 2 rT , where T is a final time step at which a terminal state is reached, ending an episode. Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: Rt rt 1 rt 2 2 rt 3 k rt k 1 , k 0 where , 0 1, is the discount rate . shortsighted 0 1 farsighted An Example Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: reward 1 for each step before failure return number of steps before failure As a continuing task with discounted return: reward 1 upon failure; 0 otherwise return k , for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. A Unified Notation In episodic tasks, we number the time steps of each episode starting from zero. We usually do not have to distinguish between episodes, so we write st instead of s t, j for the state at step t of episode j. Think of each episode as ending in an absorbing state that always produces reward of zero: We can cover all cases by writing Rt k rt k 1 , k 0 where can be 1 only if a zero reward absorbing state is always reached. The Markov Property “the state” at step t, means whatever information is available to the agent at step t about its environment. The state can include immediate “sensations,” highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property: Prst 1 s,rt 1 r st ,at ,rt , st 1 ,at 1 , ,r1 ,s0 ,a0 Prst 1 s,rt 1 r st ,at for all s, r, and histories st ,at ,rt , st 1 ,at 1 , ,r1, s0 ,a0 . Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step “dynamics” defined by transition probabilities: Psas Prst 1 s st s,at a for all s, s S, a A(s). Expected values of reward: Rass E rt 1 st s,at a,st 1 s for all s, s S, a A(s). An Example Finite MDP Recycling Robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected Recycling Robot MDP Rs e a r c h expected no. of cans while searching S high, low Rw a i t expected no. of cans while waiting A(high) search, wait A(low) search, wait, recharge Rs e a r c h Rw a i t Value Functions The value of a state is the expected return starting from that state; depends on the agent’s policy: State - value function for policy : k V (s) E Rt st s E rt k 1 st s k 0 The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following : Action - value function for policy : k Q (s, a) E Rt st s, at a E rt k 1 st s,at a k 0 Bellman Equation for a Policy The basic idea: Rt rt 1 rt 2 2 rt 3 3 rt 4 rt 1 rt 2 rt 3 rt 4 2 rt 1 Rt 1 So: V (s) E Rt st s E rt 1 V st 1 st s Or, without the expectation operator: V (s) (s,a)PsasRass V (s) a s More on the Bellman Equation V (s) (s,a)PsasRass V (s) a s This is a set of equations (in fact, linear), one for each state. The value function for is its unique solution. Backup diagrams: for V for Q Optimal Value Functions For finite MDPs, policies can be partially ordered: if and only if V (s) V (s) for all s S There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all *. Optimal policies share the same optimal state-value function: V (s) max V (s) for all s S Optimal policies also share the same optimal action Q (s,a) max Q (s, a) for all s S and a A(s) value function: This is the expected return for taking action a in state s and thereafter following an optimal policy. Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V (s) max Q (s,a) a A (s) max E rt 1 V (st 1 ) st s,at a a A (s) max PsasRass V ( s) a A (s) s The relevant backup diagram: V is the unique solution of this system of nonlinear equations. Bellman Optimality Equation for Q* Q (s,a) E r max Q (s , a) s s,a a t 1 a t 1 t Psas Rass max Q ( s, a) s a t The relevant backup diagram: * Q is the unique solution of this system of nonlinear equations. Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = –1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; = 0.9 Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to V is an optimal policy. Therefore, given V , one-step-ahead search produces the long-term optimal actions. E.g., back to the gridworld: * What About Optimal Action-Value Functions? * Given Q , the agent does not even have to do a one-step-ahead search: (s) arg max Q (s,a) aA (s) Solving the Bellman Optimality Equation Finding an optimal policy by solving the Bellman Optimality Equation requires the following: accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property. How much space and time do we need? polynomial in number of states (via dynamic programming methods; Chapter 4), BUT, number of states is often huge (e.g., backgammon has about 1020 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation. Solving MDPs The goal in MDP is to find the policy that maximizes long term reward (value) If all the information is there, this can be done by explicitly solving, or by iterative methods Policy Iteration 0 V 0 1 1 V policy evaluation V * policy improvement “greedification” * * Policy Iteration From MDPs to reinforcement learning MDPs vs RL MDPs are great if we know transition probabilities Psas Prst 1 s st s,at a for all s, s S, a A(s). And expected immediate rewards Rass E rt 1 st s,at a,st 1 s for all s, s S, a A(s). But in real life tasks – this is precisely what we do NOT know! This is what turns the task into a learning problem Simplest method to learn: Monte Carlo Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model Monte Carlo Policy Evaluation Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s 1 2 3 4 5 Every-Visit MC: average returns for every time s is visited in an episode First-visit MC: average returns only for first time s is visited in an episode Both converge asymptotically First-visit Monte Carlo policy evaluation Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available We want to learn Q* Q(s,a) - average return starting from state s and action a following Also converges asymptotically if every stateaction pair is visited Exploring starts: Every state-action pair has a non-zero probability of being the starting pair Monte Carlo Control MC policy iteration: Policy evaluation using MC methods followed by policy improvement Policy improvement step: greedify with respect to value (or action-value) function Monte Carlo Exploring Starts Fixed point is optimal policy * Now proven (almost) Problem with Monte Carlo – learning happens only after end of entire episode TD Prediction Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V Recall: Simple every - visit Monte Carlo method : V(st ) V(st ) Rt V (st ) target: the actual return after time t The simplest TD method, TD(0) : V(st ) V(st ) rt 1 V (st1 ) V(st ) target: an estimate of the return Simple Monte Carlo V(st ) V(st ) Rt V (st ) where Rt is the actual return following state st T T T TT T TT T TT st . T TT T Simplest TD Method V(st ) V(st ) rt 1 V (st1 ) V(st ) st rt 1 st 1 TT T T T TT T T TT T TT T cf. Dynamic Programming V(st ) E rt 1 V(st ) st rt 1 T TT TT T T T T T T st 1 T TD methods bootstrap and sample Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples Example: Driving Home State leaving o ffice reach car, raining Elapsed Time (minutes) 0 5 (5) Predicted Time to Go 30 Predicted Total Time 30 35 40 exit highway 20 (15) 15 35 behind truck 30 (10) 10 40 home street 40 3 43 arrive ho me 43 0 43 (10) (3) Driving Home Changes recommended by Monte Carlo methods =1) Changes recommended by TD methods (=1) Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome You can learn without the final outcome Less memory Less peak computation From incomplete sequences Both MC and TD converge (under certain assumptions to be detailed later), but which is faster? Random Walk Example Values learned by TD(0) after various numbers of episodes TD and MC on the Random Walk Data averaged over 100 sequences of episodes Optimality of TD(0) Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small . Constant- MC also converges under these conditions, but to a difference answer! Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times. You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? You are the Predictor V(A)? You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets Learning An Action-Value Function Estimate Q for the current behavior policy After every transition from a nonterminal state . st , do this : Qst , at Qst , at rt 1 Qst 1 ,at 1 Qst ,at If st 1 is terminal, then Q(st 1, at 1 ) 0. Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: Windy Gridworld undiscounted, episodic, reward = –1 until goal Results of Sarsa on the Windy Gridworld Q-Learning: Off-Policy TD Control One - step Q - learning : Qst , at Qst , at rt 1 max Qst1 , a Qst , at a Planning and Learning Use of environment models Integration of planning and learning methods The Original Idea Sutton, 1990 The Original Idea Sutton, 1990 Models Model: anything the agent can use to predict how the environment will respond to its actions Distribution model: description of all possibilities and their probabilities a ss a ss P and R for all s, s, and a A(s) Sample model: produces sample experiences e.g., e.g., a simulation model Both types of models can be used to produce simulated experience Often sample models are much easier to come by Planning Planning: any computational process that uses a model to create or improve a policy Planning in AI: state-space planning plan-space planning (e.g., partial-order planner) We take the following (unusual) view: all state-space planning methods involve computing value functions, either explicitly or implicitly they all apply backups to simulated experience Planning Cont. Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning: Random-Sample One-Step Tabular Q-Planning Learning, Planning, and Acting Two uses of real experience: model learning: to improve the model direct RL: to directly improve the value function and policy Improving value function and/or policy via a model is sometimes called indirect RL or modelbased R L. Here, we call it planning. Direct vs. Indirect RL Indirect methods: make fuller use of experience: get better policy with fewer environment interactions Direct methods simpler not affected by bad models But they are very closely related and can be usefully combined: planning, acting, model learning, and direct RL can occur simultaneously and in parallel The Dyna Architecture (Sutton 1990) The Dyna-Q Algorithm direct RL model learning planning Dyna-Q on a Simple Maze rewards = 0 until goal, when =1 Value Prediction with FA As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V In earlier chapters, value functions were stored in lookup tables. Here, the value function estimate at time t, Vt , depends on a parameter vector t , and only the parameter vector is updated. e.g., t could be the vector of connection weights of a neural network. Adapt Supervised Learning Algorithms Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Training example = {input, target output} Error = (target output – actual output)