Chap. 13 Reinforcement Learning (RL) Machine Learning Tom M. Mitchell outline What is Reinforcement Learning? Methods Used in Reinforcement Learning Temporal Difference Methods Applications Introduction What’s reinforcement learning? History What reinforcement learning can do? Reinforcement Learning’s Element What’s reinforcement Learning? Reinforcement learning address the question of how an autonomous agent that sense and acts in its environment can learn to choose optimal actions to achieve its goals. In Reinforcement learning , the computer is simply given a goal to achieve. The computer then learns how to achieve that goal by trial-and-error interactions with its environment. What’s reinforcement Learning? To provide the intuition behind reinforcement learning consider the problem of learning to ride a bicycle. The goal given to the RL system is simply to ride the bicycle without falling over. In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right. At this point their are two actions possible: turn the handle bars left or turn them right. What’s reinforcement Learning? The RL system turns the handle bars to the left and immediately crashes to the ground, thus receiving a negative reinforcement. The RL system has just learned not to turn the handle bars left when tilted 45 degrees to the right. In the next trial The RL system knows not to turn the handle bars to the left, so it performs the only other possible action: turn right when tilted 45 degrees to the right. What’s reinforcement Learning? It immediately crashes to the ground, again receiving a strong negative reinforcement. At this point the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad. Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. …… What’s reinforcement Learning? Agent State Reward environment s0 a0 s1 r0 a1 r1 s2 a2 r2 r0 + r1 + 2 r2 + … , where 0 < 1 The discount factor is used to exponentially decrease the weight of reinforcements received in the future RL systems learn a mapping from situations to actions by trial-and-error interactions with a dynamic environment. The “goal” of the RL system is defined using the concept of a reward function, which is the exact function of future reinforcements(rewards) the agent seeks to maximize. In other words, there exists a mapping from state/action pairs to reward; after performing an action in a given state the RL agent will receive some reward in the form of a scalar value. The RL agent learns to perform actions that will maximize the sum of the rewards received when starting from some initial state and proceeding to a terminal state. RL Vs. other function approximation RL is similar in some respects to the function approximation problems discussed in other chapters. The target function to be learned in this case is a control policy : S -> A. A policy determines which action should be performed in each state; a policy is a mapping from states to actions. The value of a state is defined as the sum of the rewards received when starting in that state and following some fixed policy to a terminal state. The optimal policy would therefore be the mapping from states to actions that maximizes the sum of the rewards when starting in an arbitrary state and performing actions until a terminal state is reached RL Vs. other function approximation This reinforcement learning problem differs from other function approximation tasks in several important respects. Delayed reward: In RL, direct correspondence between the states and the actions is not available. The trainer provides only a sequence of immediate reward values as the agent executes its actions. Face the problem of temporal credit assignment. Exploration: The agents influence the distribution of training examples by the action sequence it chooses. The question is which experimentation strategy produces most effective learning. The Learning Task Ways of formulation Markov Decision Process (MDP) Precise Task Definition An Example Ways of formulation There are many ways to formulate the problem of learning sequential control strategies Agent’s actions: Might be Deterministic or Nondeterministic Agent may have or haven’t the ability of predicting the next state that will result from each action Trainer of the agent: Expert(who shows it examples of optimal action sequences) or agent itself(train itself by performing actions of its own choice.) States & Transitions s0 a1 s2 a2 s1 a4 s5 a3 a5 s6 s3 a6 s7 a7 s8 Markov Decision Process. Finite set of States : S; Set of Actions: A t: discrete time step; st: the state at time t; at: the action at time t; At each discrete time, agent observe states st S, and chooses action at A. Then receive immediate reward: rt , And state change to: st+1 Markov assumption: st+1= (st , at ), rt=r (st , at ) i.e., rt, and st+1 depend only on current state and action Functions and r may be nondeterministic Functions and r not necessarily be known to agent st at rt st+1 at+1 st+2 rt+1 at+2 rt+2 … Learning Task st at rt st+1 at+1 st+2 at+2 rt+1 rt+2 … Execute actions in the environment, observe results and Learn a policy p : S A from states stS to actions atA that maximizes the accumulated reward : Vp(s) = rt+ rt+1+ 2 rt+2+… from any starting state st 0<<1 is the discount factor for future rewards. Target function is p : S A But there are no direct training examples of the form <s,a> Training examples are of the form <<s,a>,r> State Value Function Consider deterministic environments, namely (s,a) and r(s,a) are deterministic functions of s and a. For each policy p: S A the agent might adopt, we define an evaluation function: Vp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+i i (13.1) where rt, rt+1,… are generated by following the policy p from start state s Task: Learn the optimal policy p* that maximizes Vp(s) p* = argmaxp Vp(s) ,s (13.2) Action Value Function State value function denotes the reward for starting in state s and following policy p. Vp(s)= rt+ rt+1+ 2 rt+2+…= Si=0 rt+i i Action value function denotes the reward for starting in state s, taking action a and following policy p afterwards. Qp(s,a)= r(s,a) + rt+1+ 2 rt+2+… = r(s,a) + Vp ((s,a)) Optimal Value Functions Concept of V* : V*(s) = maxp Vp(s) Concept of π*: The policy π which maximizes Vπ(s) for all states s. p*(s) = argmaxp Vp(s) ,s (13.2) π*(s)=argmaxa {r(s,a) + V*((s,a))} (13.3) Example 0 0 100 0 90 0 0 0 G 100 0 0 0 0 0 G 100 0 0 81 r(s, a) (immediate reward) values G all Optimal policies 90 100 V*(s) values G One Optimal policy What to Learn Q-Function Training Rule to Learn APPROXIMATE Q A simple deterministic world Q Learning for Deterministic Worlds example Explore or Exploit? hwk 13.1 13.2 Example 73 R 100 90 100 66 R 66 81 81 aright s1 Q(s1, aright) r + maxa’ Q (s2 , a’ ) s2 0 + 0.9 max{66,81,100} 90 0 0 0 90 G 0 0 0 0 0 G 81 0 0 0 100 72 0 0 0 initial Q(s, a) values 81 81 81 90 90 72 Q(s, a) values 100 81 Thank you