Reinforcement Learning Yishay Mansour Tel-Aviv University Outline • Goal of Reinforcement Learning • Mathematical Model (MDP) • Planning 2 Goal of Reinforcement Learning Goal oriented learning through interaction Control of large scale stochastic environments with partial knowledge. Supervised / Unsupervised Learning Learn from labeled / unlabeled examples 3 Reinforcement Learning - origins Artificial Intelligence Control Theory Operation Research Cognitive Science & Psychology Solid foundations; well established research. 4 Typical Applications • Robotics – Elevator control [CB]. – Robo-soccer [SV]. • Board games – backgammon [T], – checkers [S]. – Chess [B] • Scheduling – Dynamic channel allocation [SB]. – Inventory problems. 5 Contrast with Supervised Learning The system has a “state”. The algorithm influences the state distribution. Inherent Tradeoff: Exploration versus Exploitation. 6 Mathematical Model - Motivation Model of uncertainty: Environment, actions, our knowledge. Focus on decision making. Maximize long term reward. Markov Decision Process (MDP) 7 Mathematical Model - MDP Markov decision processes S- set of states A- set of actions d - Transition probability R - Reward function Similar to DFA! 8 MDP model - states and actions Environment = states 0.7 0.3 action a Actions = transitions d (s, a, s') 9 MDP model - rewards R(s,a) = reward at state s for doing action a (a random variable). Example: R(s,a) = -1 with probability 0.5 +10 with probability 0.35 +20 with probability 0.15 10 MDP model - trajectories trajectory: s0 a0 r0 s1 a1 r1 s2 a2 r2 11 MDP - Return function. Combining all the immediate rewards to a single value. Modeling Issues: Are early rewards more valuable than later rewards? Is the system “terminating” or continuous? Usually the return is linear in the immediate rewards. 12 MDP model - return functions return Finite Horizon - parameter H 1 i H Infinite Horizon i i discounted - parameter g<1. undiscounted R(s , a ) 1 N N 1 i0 return γ i R(s i ,a i ) i 0 N R(s i ,a i ) return Terminating MDP 13 MDP model - action selection AIM: Maximize the expected return. Fully Observable - can “see” the “entire” state. Policy - mapping from states to actions Optimal policy: optimal from any start state. THEOREM: There exists a deterministic optimal policy 14 Contrast with Supervised Learning Supervised Learning: Fixed distribution on examples. Reinforcement Learning: The state distribution is policy dependent!!! A small local change in the policy can make a huge global change in the return. 15 MDP model - summary sS a A - set of states, |S|=n. - set of k actions, |A|=k. d ( s1 , a , s 2 ) - transition function. R(s,a) :S A - immediate reward function. - policy. g i0 i ri - discounted cumulative return. 16 Simple example: N- armed bandit Single state. Goal: Maximize sum of immediate rewards. a1 s a2 a3 Given the model: Greedy action. Difficulty: unknown model. 17 N-Armed Bandit: Highlights • Algorithms (near greedy): – Exponential weights • Gi sum of rewards of action ai • wi = eGi – Follow the leader • Results: – For any sequence of T rewards: – E[online] > maxi {Gi} - sqrt{T log N} 18 Planning - Basic Problems. Given a complete MDP model. Policy evaluation - Given a policy , estimate its return. Optimal control - Find an optimal policy * (maximizes the return from any start state). 19 Planning - Value Functions V(s) The expected return starting at state s following . Q(s,a) The expected return starting at state s with action a and then following . V*(s) and Q*(s,a) are define using an optimal policy *. V*(s) = max V(s) 20 Planning - Policy Evaluation Discounted infinite horizon (Bellman Eq.) V(s) = Es’~ (s) [ R(s, (s)) + g V(s’)] Rewrite the expectation V ( s) E[ R( s, ( s))] g s ' d ( s, ( s), s' )V ( s' ) Linear system of equations. 21 Algorithms - Policy Evaluation Example A={+1,-1} g = 1/2 d(si,a)= si+a random "a: R(si,a) = i s0 s3 s1 0 1 3 2 s2 V(s0) = 0 +g [(s0,+1)V(s1) + (s0,-1) V(s3) ] 22 Algorithms -Policy Evaluation Example A={+1,-1} g = 1/2 d(si,a)= si+a random s0 "a: R(si,a) = i s3 s1 0 1 3 2 V(s0) = 5/3 V(s1) = 7/3 V(s2) = 11/3 V(s3) = 13/3 s2 V(s0) = 0 + (V(s1) + V(s3) )/4 23 Algorithms - optimal control State-Action Value function: Q(s,a) E [ R(s,a)] + gEs’~ (s,a) [ V(s’)] Note V ( s ) Q ( s , ( s )) For a deterministic policy . 24 Algorithms -Optimal control Example A={+1,-1} g = 1/2 d(si,a)= si+a random s0 R(si,a) = i s3 s1 0 1 3 2 Q(s0,+1) = 5/6 Q(s0,-1) = 13/6 s2 Q(s0,+1) = 0 +g V(s1) 25 Algorithms - optimal control CLAIM: A policy is optimal if and only if at each state s: V(s) MAXa {Q(s,a)} (Bellman Eq.) PROOF: Assume there is a state s and action a s.t., V(s) < Q(s,a). Then the strategy of performing a at state s (the first time) is better than . This is true each time we visit s, so the policy that performs action a at state s is better than . p 26 Algorithms -optimal control Example A={+1,-1} g = 1/2 d(si,a)= si+a random s0 R(si,a) = i s3 s1 0 1 3 2 s2 Changing the policy using the state-action value function. 27 Algorithms - optimal control The greedy policy with respect to Q(s,a) is (s) = argmaxa{Q(s,a) } The e-greedy policy with respect to Q(s,a) is (s) = argmaxa{Q(s,a) } with probability 1-e, and (s) = random action with probability e 28 MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. V i 1 (s) max{R(s, a) g s ' d (s, a, s' ) V i (s' )} a 3. Policy Iteration method. i ( s ) arg max {Q i 1 ( s, a)} a 29 Convergence • Value Iteration – Drop in distance from optimal maxs {V*(s) – Vt(s)} • Policy Iteration – Policy can only improve "s Vt+1(s) Vt(s) – Less iterations then Value Iteration, but more expensive iterations. 30 Relations to Board Games • • • • • • state = current board action = what we can play. opponent action = part of the environment value function = probability of winning Q- function = modified policy. Hidden assumption: Game is Markovian 31 Planning versus Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning. 32 Example - Elevator Control Learning (alone): Model the arrival model well. Planning (alone) : Given arrival model build schedule Real objective: Construct a schedule while updating model 33 Partially Observable MDP Rather than observing the state we observe some function of the state. Ob - Observable function. a random variable for each states. Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s. Problem: different states may “look” similar. The optimal strategy is history dependent ! 34 POMDP - Belief State Algorithm Given a history of actions and observable value we compute a posterior distribution for the state we are in (belief state). The belief-state MDP: States: distribution over S (states of the POMDP). actions: as in the POMDP. Transition: the posterior distribution (given the observation) We can perform the planning and learning on the belief-state MDP. 35 POMDP Hard computational problems. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a deterministic POMDP is P-spacehard (NP-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal strategy for a stochastic POMDP is EXPTIMEhard (P-space-complete) [PT,L]. Computing an infinite (polynomial) horizon undiscounted optimal policy for an MDP is P-complete [PT] . 36 Resources • Reinforcement Learning (an introduction) [Sutton & Barto] • Markov Decision Processes [Puterman] • Dynamic Programming and Optimal Control [Bertsekas] • Neuro-Dynamic Programming [Bertsekas & Tsitsiklis] • Ph. D. thesis - Michael Littman 37