Week 14 - Reinforcement Learning

CS188 Discussion Week 14: Reinforcement Learning
By Nuttapong Chentanez
Want to compute
Markov Decision Process (MDP)
-Consists of set of states s  S
-Transition model model, T(s, a, s’) = P(s’ | s, a)
- Probability that executing action a in state s leads to s’
-Reward function R(s, a, s’),
- Can also write Renter(s’) for reward for entering into s’
- Or Rin(s) for being in s
-A start state (or a distribution)
-May be terminal states
Bellman’s Equation
Idea: Optimal rewards = maximize over first action and then follow
optimal policy
Simple Monte-Carlo (Sampling)
*(s) =
Value Iteration
Theorem: Will converge to a unique optimal values. Policy may
converge long before values do.
Policy Iteration
Policy Evaluation: Calculate utilities for a fixed policy
Policy Improvement: Update policy based on resulting converged utility
Repeat until policy does not change
In practice, no need to compute “exact” utility of a policy just rough
estimate is enough.
Combine dynamic programming with Monte-Carlo
-One step of sampling and use current state value
-Aka. Temporal Difference Learning (TD)
Reinforcement Learning
Still have MDP
Still have an MDP:
A set of states s  S
A set of actions (per state) A
A model T(s,a,s’)
A reward function R(s,a,s’)
Still looking for a policy (s)
However, the agent don’t know T or R
Must actually try actions at states out to learn
Reinforcement Learning in Animals
Studied experimentally in psychology for > 60 years
Rewards: food, pain, hunger etc.
Example: Bees learn near-optimum foraging plan in artificial flowers
Dolphin training
Model-Based Learning
-Can try to learn T and R first then
Simplest case:
Count outcomes for each s,a
Normalize to give estimate of T(s,a,s’)
Discover R(s,a,s’) the first time we experience (s,a,s’)
Passive Learning
-Given a policy (s), try to learn V(s), don’t know T, R
TD Features
-On-line, Incremental, Bootstrapping(use info that we learn so far)
-Model free, Converge as long as  decreases over time eg (1/k), <1
Problems with TD
-TD is for learning value of state under a given policy
-If we want to learn optimal policy, won’t work
-Idea, learn state-action pair (Q-values) instead
-Utility of starting at state s, taking action a, then follow  thereafter
-Q-value of optimum policy, utility of starting at state s, taking action a,
then follow optimum policy afterward
Another solution: Exploration Function
Idea: instead of exploring a fixed amount, can explore area where the
value is not yet established
Eg. f(u,n) = u + k/n, k is a constant
Q-Learning Algorithm, applying TD idead to learn Q *
Practical Q-Learning
-In realistic situation, too many states to visit, may even be infinite
-Need to learn from small training data
-Be able to generalize to new, similar states
-Fundamental problem in machine learning
Function Approximation
-Inefficient/Infeasible to learn each state-action pair q-value
-Suppose we approximate Q(st,at) with a function f with parameters 
-Can do gradient-descent update
Learn Q*(s,a) values
-Receive a sample (s,a,s’,r)
-Consider your old estimate:
-Consider your new sample estimate:
-Modify the old estimate towards the new sample:
-Equivalently, average samples over time:
Converge as long as  decreases over time eg (1/k), <1
Exploration vs. Exploitation
-“If always take best current action, will never explore other action and
never know there is a better action”
-Explore initially then exploit later
-Many scheme for balancing exploration vs. exploitation
Simplest: random actions (-greedy)
Every time step, flip a coin
With probability , act randomly
With probability 1-, act according to current policy
(best q value for instance)
Will explore space but keep doing random action even after the learning
is “done”
A solution: Lower  over time
Project time
- vt is the reward received
- Idea: Gradient indicate the direction of changes in  that most
increase Q
Want Q to looks more like vt, so modify each parameter
depending on whether
increasing or decreasing a parameter will make Q more like vt
- In this simple form, does not always work, but the idea is on the right
- Feature selection is difficult in general