Slide

advertisement
COSC 878
Seminar on Large Scale Statistical
Machine Learning
1
Today’s Plan
• Course Website
http://people.cs.georgetown.edu/~huiyang/cosc878/
• Join the Google group:
https://groups.google.com/forum/#!forum/cosc8
78
• Students Introduction
• Team-up and Presentation Scheduling
• First Talk
2
Reinforcement Learning:
A Survey
Grace
1/13/15
What is Reinforcement Learning
• The problem faced by an agent that must
learn behavior through trial-and-error
interactions with a dynamic environment
Solve RL Problems – Two strategies
• Search in the behavior space
– To find one behavior that perform well in the
environment
– Genetic algorithms, genetic programming
• Statistical methods and dynamic programming
– Estimate the utility of taking actions in states of
the world
– We focus on this strategy
Standard RL model
What we learn in RL
• The agent’s job is to find a policy \pi that
maximizes some long-run measure of
reinforcement.
– A policy \pi maps states to actions
– Reinforcement = reward
Difference between RL and Supervised
Learning
• In RL, no presentation of input/output pairs
– No training data
– We only know the immediate reward
– Not know the best actions in long run
• In RL, need to evaluate the system online
while learning
– Online evaluation (know the online performance)
is important
Difference between RL and
AI/Planning
• AI algorithms are less general
– AI algorithms require a predefined model of state
transitions
– And assume determinism
• RL assumes that the state space can be
enumerated and stored in memory
Models
• The difficult part:
– How to model future into the model
• Three models
– Finite horizon
– Infinite horizon
– Average-reward
Finite Horizon
• At a given moment in time, the agent
optimizes its expected reward for the next h
steps
• Ignore what will happen after h steps
Infinite Horizon
•
•
•
•
Maximize the long run reward
Does not put limit on the number of future steps
Future rewards are discounted geometrically
Mathematically more tractable than finite horizon
Discount factor
(between 0 and 1)
Average-reward
• Maximize the long run average reward
• It is the limiting case of infinite horizon when \gamma approaches 1
• Weakness:
– Cannot know when get large rewards
– When we prefer large initial reward, we have no way to know it in this
model
• Cures:
– Maximize both the long run average and the initial rewards
– The Bias optimal model
Compare model optimality
• all unlabeled
arrows produce
a reward of 0
• A single action
Compare model optimality
Finite horizon
h=4
• Upper line:
• 0+0+2+2+2=6
• Middle:
• 0+0+0+0+0=0
• Lower:
• 0+0+0+0+0=0
Compare model optimality
Infinite horizon
\gamma=0.9
• Upper line:
• 0*0.9^0 +
0*0.9^1+2*0.9^2+
2*0.9^3+2*0.9^4… =
2*0.9^2*(1+0.9+0.9^
2..)= 1.62*(1)/(10.9)=16.2
• Middle:
• … 10*0.9^5+…~=59
• Lower:
• … + 11*0.9^6+… =
58.5
Compare model optimality
Average reward
• Upper line:
• ~= 2
• Middle:
• ~=10
• Lower:
• ~= 11
Parameters
• Finite horizon and infinite horizon both have
parameters
–h
– \gamma
• These parameters matter to the choice of
optimality model
– Choose them carefully in your application
• Average reward model’s advantage: not
influenced by those parameters
MARKOV MODELS
19
Markov Process
• Markov Property1 (the “memoryless” property)
for a system, its next state depends on its
current state.
Pr(Si+1|Si,…,S0)=Pr(Si+1|Si)
s0
s1 ……
si+1 ……
si
e.g.
• Markov Process
a stochastic process with Markov property.
20
1A.
A. Markov, ‘06
Family of Markov Models
•
•
•
•
•
Markov Chain
Hidden Markov Model
Markov Decision Process
Partially Observable Markov Decision Process
Multi-armed Bandit
21
Markov Chain
(S, M)
• Discrete-time Markov process
• Example: Google PageRank1
ο‚— State S – web page
B
Pagerank(B)
D
Pagerank(D)
ο‚— Transition probability M
ο‚— PageRank: how likely a random
web surfer will land on a page
A
Pagerank(A)
C
Pagerank(C)
E
Pagerank(E)
Random jump
factor
1−𝛼
π‘ƒπ‘Žπ‘”π‘’π‘Ÿπ‘Žπ‘›π‘˜(π‘Œ)
π‘ƒπ‘Žπ‘”π‘’π‘Ÿπ‘Žπ‘›π‘˜ 𝑆 =
+𝛼
𝑁
𝐿(π‘Œ)
π‘Œ∈Π
# of pages
# of outlinks
pages linked to S
The stable state distribution of such an MC is PageRank
22
1L.
Page et. al., ‘99
Hidden Markov Model
(S, M, O, e)
• A Markov chain that states are hidden and
observable symbols are emitted with some
probability according to its states1.
s0
𝑒0
o0
p0
s1
𝑒1
o1
p1
s2
p2
……
𝑒2
o2
Si– hidden state
pi -- transition probability
oi --observation
ei --observation probability (emission probability)
23
1Leonard
E. Baum et. al., ‘66
Markov Decision Process (S, M, A, R, γ)
• MDP extends MC with actions and rewards1
s0
p0
s1
a0
p1
s2
a1
r0
p2
s3
a2
r1
r2
si– state
ai – action
ri – reward
pi – transition probability
24
1R.
Bellman, ‘57
……
Definition of MDP
• A tuple (S, M, A, R, γ)
– S : state space
– M: transition matrix
Ma(s, s') = P(s'|s, a)
– A: action space
– R: reward function
R(s,a) = immediate reward taking action a at state s
– γ: discount factor, 0< γ ≤1
• policy π
π(s) = the action taken at state s
• Goal is to find an optimal policy π* maximizing the
expected total rewards.
25
Policy
Policy: (s) = a
(s0) =move right and up
s0
(s1) =move right and up
s1
s2
26
According to which,
select an action a at
state s.
(s2) = move right
[Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Expected long-term
Value:
reward starting from
s ) + 3 R(s )
Future rewards
V(s ) = E[R(s ) +  R(s ) + 2 R(s
V(s)
0
0
+ 4 R(s4) + ]
1
2
3
discounted by  οƒŽ [0,1)
s0
R(s0)
Start from s0
27
(s0)
[Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Expected long-term
Value:
reward starting from
s ) + 3 R(s )
Future rewards
V(s ) = E[R(s ) +  R(s ) + 2 R(s
V(s)
0
0
1
+ 4 R(s4) + ]
(s0)
discounted by  οƒŽ [0,1)
R(s1’)
s1
s1’’
28
3
s1’
s0
R(s0)
Start from s0
2
R(s1)
R(s1’’)
[Slide altered from Carlos Guestrin’s ML
lecture]
Value of Policy
Expected long-term
Value:
reward starting from
s ) + 3 R(s )
Future rewards
V(s ) = E[R(s ) +  R(s ) + 2 R(s
V(s)
0
0
1
+ 4 R(s4) + ]
discounted by  οƒŽ [0,1)
s2’
(s1’)
(s0)
R(s1’)
s1
R(s2’)
(s1)
s1’’
29
3
s1’
s0
R(s0)
Start from s0
2
s2
R(s1)
R(s1’’)
s2’’
(s1’’)
R(s2)
R(s2’’)
[Slide altered from Carlos Guestrin’s ML
lecture]
Computing the value of a policy
Value function
V(s0) = 𝐸 πœ‹ [𝑅 𝑠0 , π‘Ž + 𝛾𝑅 𝑠1 , π‘Ž + 𝛾2𝑅 𝑠2 , π‘Ž + 𝛾3𝑅 𝑠3 , π‘Ž + β‹― ]
=𝐸 πœ‹ [𝑅 𝑠0 , π‘Ž + 𝛾
∞
𝑑−1 𝑅(𝑠 , π‘Ž)]
𝑑
𝑑=1 𝛾
=𝑅 𝑠0 , π‘Ž + 𝛾𝐸 πœ‹ [
∞
𝑑−1 𝑅(𝑠 , π‘Ž)]
𝑑
𝑑=1 𝛾
=𝑅 𝑠0 , π‘Ž + 𝛾
𝑠′ π‘€πœ‹ 𝑠
The current
state
(𝑠, 𝑠 ′ ) 𝑉(𝑠 ′ )
A possible next state
30
Optimality — Bellman Equation
 The Bellman equation1 to MDP is a recursive
definition of the optimal value function V*(.)
state-value function
𝑉∗ s = max 𝑅 𝑠, π‘Ž + 𝛾
π‘Ž
π‘€π‘Ž (𝑠, 𝑠′)𝑉∗(𝑠′)
𝑠′
 Optimal Policy
π‘€π‘Ž 𝑠, 𝑠 ′ 𝑉∗(𝑠′)
π∗ s = arg π‘šπ‘Žπ‘₯ 𝑅 𝑠, π‘Ž + 𝛾
π‘Ž
31
𝑠′
1R.
Bellman, ‘57
Optimality — Bellman Equation
 The Bellman equation can be rewritten as Relationship
between V and
𝑉∗ 𝑠 = max 𝑄(𝑠, π‘Ž)
a
Q
action-value function
𝑄(𝑠, π‘Ž) = 𝑅 𝑠, π‘Ž + 𝛾
π‘€π‘Ž (𝑠, 𝑠′)𝑉∗(𝑠′)
𝑠′
 Optimal Policy
π∗ s = arg π‘šπ‘Žπ‘₯ 𝑄 𝑠, π‘Ž
π‘Ž
32
MDP algorithms
Solve Bellman
equation
•
•
•
•
•
•
Optimal
value V*(s)
Optimal
policy *(s)
Value Iteration
Policy Iteration
Model-based
Modified Policy Iteration approaches
Prioritized Sweeping
Temporal Difference (TD) Learning Model free
approaches
Q-Learning
[Bellman, ’57, Howard, ‘60, Puterman and Shin, ‘78, Singh & Sutton, ‘96, Sutton & Barto,
‘98, Richard Sutton, ‘88, Watkins, ‘92]
[Slide altered from Carlos Guestrin’s ML
33
lecture]
Download