COSC 878 Seminar on Large Scale Statistical Machine Learning 1 Today’s Plan • Course Website http://people.cs.georgetown.edu/~huiyang/cosc878/ • Join the Google group: https://groups.google.com/forum/#!forum/cosc8 78 • Students Introduction • Team-up and Presentation Scheduling • First Talk 2 Reinforcement Learning: A Survey Grace 1/13/15 What is Reinforcement Learning • The problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment Solve RL Problems – Two strategies • Search in the behavior space – To find one behavior that perform well in the environment – Genetic algorithms, genetic programming • Statistical methods and dynamic programming – Estimate the utility of taking actions in states of the world – We focus on this strategy Standard RL model What we learn in RL • The agent’s job is to find a policy \pi that maximizes some long-run measure of reinforcement. – A policy \pi maps states to actions – Reinforcement = reward Difference between RL and Supervised Learning • In RL, no presentation of input/output pairs – No training data – We only know the immediate reward – Not know the best actions in long run • In RL, need to evaluate the system online while learning – Online evaluation (know the online performance) is important Difference between RL and AI/Planning • AI algorithms are less general – AI algorithms require a predefined model of state transitions – And assume determinism • RL assumes that the state space can be enumerated and stored in memory Models • The difficult part: – How to model future into the model • Three models – Finite horizon – Infinite horizon – Average-reward Finite Horizon • At a given moment in time, the agent optimizes its expected reward for the next h steps • Ignore what will happen after h steps Infinite Horizon • • • • Maximize the long run reward Does not put limit on the number of future steps Future rewards are discounted geometrically Mathematically more tractable than finite horizon Discount factor (between 0 and 1) Average-reward • Maximize the long run average reward • It is the limiting case of infinite horizon when \gamma approaches 1 • Weakness: – Cannot know when get large rewards – When we prefer large initial reward, we have no way to know it in this model • Cures: – Maximize both the long run average and the initial rewards – The Bias optimal model Compare model optimality • all unlabeled arrows produce a reward of 0 • A single action Compare model optimality Finite horizon h=4 • Upper line: • 0+0+2+2+2=6 • Middle: • 0+0+0+0+0=0 • Lower: • 0+0+0+0+0=0 Compare model optimality Infinite horizon \gamma=0.9 • Upper line: • 0*0.9^0 + 0*0.9^1+2*0.9^2+ 2*0.9^3+2*0.9^4… = 2*0.9^2*(1+0.9+0.9^ 2..)= 1.62*(1)/(10.9)=16.2 • Middle: • … 10*0.9^5+…~=59 • Lower: • … + 11*0.9^6+… = 58.5 Compare model optimality Average reward • Upper line: • ~= 2 • Middle: • ~=10 • Lower: • ~= 11 Parameters • Finite horizon and infinite horizon both have parameters –h – \gamma • These parameters matter to the choice of optimality model – Choose them carefully in your application • Average reward model’s advantage: not influenced by those parameters MARKOV MODELS 19 Markov Process • Markov Property1 (the “memoryless” property) for a system, its next state depends on its current state. Pr(Si+1|Si,…,S0)=Pr(Si+1|Si) s0 s1 …… si+1 …… si e.g. • Markov Process a stochastic process with Markov property. 20 1A. A. Markov, ‘06 Family of Markov Models • • • • • Markov Chain Hidden Markov Model Markov Decision Process Partially Observable Markov Decision Process Multi-armed Bandit 21 Markov Chain (S, M) • Discrete-time Markov process • Example: Google PageRank1 ο State S – web page B Pagerank(B) D Pagerank(D) ο Transition probability M ο PageRank: how likely a random web surfer will land on a page A Pagerank(A) C Pagerank(C) E Pagerank(E) Random jump factor 1−πΌ ππππππππ(π) ππππππππ π = +πΌ π πΏ(π) π∈Π # of pages # of outlinks pages linked to S The stable state distribution of such an MC is PageRank 22 1L. Page et. al., ‘99 Hidden Markov Model (S, M, O, e) • A Markov chain that states are hidden and observable symbols are emitted with some probability according to its states1. s0 π0 o0 p0 s1 π1 o1 p1 s2 p2 …… π2 o2 Si– hidden state pi -- transition probability oi --observation ei --observation probability (emission probability) 23 1Leonard E. Baum et. al., ‘66 Markov Decision Process (S, M, A, R, γ) • MDP extends MC with actions and rewards1 s0 p0 s1 a0 p1 s2 a1 r0 p2 s3 a2 r1 r2 si– state ai – action ri – reward pi – transition probability 24 1R. Bellman, ‘57 …… Definition of MDP • A tuple (S, M, A, R, γ) – S : state space – M: transition matrix Ma(s, s') = P(s'|s, a) – A: action space – R: reward function R(s,a) = immediate reward taking action a at state s – γ: discount factor, 0< γ ≤1 • policy π π(s) = the action taken at state s • Goal is to find an optimal policy π* maximizing the expected total rewards. 25 Policy Policy: ο°(s) = a ο°(s0) =move right and up s0 ο°(s1) =move right and up s1 s2 26 According to which, select an action a at state s. ο°(s2) = move right [Slide altered from Carlos Guestrin’s ML lecture] Value of Policy Expected long-term Value: reward starting from s ) + ο§3 R(s ) Future rewards Vο°(s ) = E[R(s ) + ο§ R(s ) + ο§2 R(s Vο°(s) 0 0 + ο§4 R(s4) + ο] 1 2 3 discounted by ο§ ο [0,1) s0 R(s0) Start from s0 27 ο°(s0) [Slide altered from Carlos Guestrin’s ML lecture] Value of Policy Expected long-term Value: reward starting from s ) + ο§3 R(s ) Future rewards Vο°(s ) = E[R(s ) + ο§ R(s ) + ο§2 R(s Vο°(s) 0 0 1 + ο§4 R(s4) + ο] ο°(s0) discounted by ο§ ο [0,1) R(s1’) s1 s1’’ 28 3 s1’ s0 R(s0) Start from s0 2 R(s1) R(s1’’) [Slide altered from Carlos Guestrin’s ML lecture] Value of Policy Expected long-term Value: reward starting from s ) + ο§3 R(s ) Future rewards Vο°(s ) = E[R(s ) + ο§ R(s ) + ο§2 R(s Vο°(s) 0 0 1 + ο§4 R(s4) + ο] discounted by ο§ ο [0,1) s2’ ο°(s1’) ο°(s0) R(s1’) s1 R(s2’) ο°(s1) s1’’ 29 3 s1’ s0 R(s0) Start from s0 2 s2 R(s1) R(s1’’) s2’’ ο°(s1’’) R(s2) R(s2’’) [Slide altered from Carlos Guestrin’s ML lecture] Computing the value of a policy Value function Vο°(s0) = πΈ π [π π 0 , π + πΎπ π 1 , π + πΎ2π π 2 , π + πΎ3π π 3 , π + β― ] =πΈ π [π π 0 , π + πΎ ∞ π‘−1 π (π , π)] π‘ π‘=1 πΎ =π π 0 , π + πΎπΈ π [ ∞ π‘−1 π (π , π)] π‘ π‘=1 πΎ =π π 0 , π + πΎ π ′ ππ π The current state (π , π ′ ) πο°(π ′ ) A possible next state 30 Optimality — Bellman Equation ο¬ The Bellman equation1 to MDP is a recursive definition of the optimal value function V*(.) state-value function π∗ s = max π π , π + πΎ π ππ (π , π ′)π∗(π ′) π ′ ο¬ Optimal Policy ππ π , π ′ π∗(π ′) π∗ s = arg πππ₯ π π , π + πΎ π 31 π ′ 1R. Bellman, ‘57 Optimality — Bellman Equation ο¬ The Bellman equation can be rewritten as Relationship between V and π∗ π = max π(π , π) a Q action-value function π(π , π) = π π , π + πΎ ππ (π , π ′)π∗(π ′) π ′ ο¬ Optimal Policy π∗ s = arg πππ₯ π π , π π 32 MDP algorithms Solve Bellman equation • • • • • • Optimal value V*(s) Optimal policy ο°*(s) Value Iteration Policy Iteration Model-based Modified Policy Iteration approaches Prioritized Sweeping Temporal Difference (TD) Learning Model free approaches Q-Learning [Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto, ‘98, Richard Sutton, ‘88, Watkins, ‘92] [Slide altered from Carlos Guestrin’s ML 33 lecture]