Document 12388652

advertisement
Outline
• MDP (brief)
– Background
– Learning MDP
• Q learning
• Game theory (brief)
– Background
• Markov games (2-player)
– Background
– Learning Markov games
• Littman’s Minimax Q learning (zero-sum)
• Hu & Wellman’s Nash Q learning (general-sum)
Stochastic games (SG)
Partially observable SG (POSG)
/ SG
/ POSG
Expectation over next states
Immediate reward
Value of next state
•
Model-based reinforcement learning:
1.
2.
•
Learn the reward function and the state transition function
Solve for the optimal policy
Model-free reinforcement learning:
1.
Directly learn the optimal policy without knowing the reward
function or the state transition function
#times action a causes state transition s  s’
#times action a has been executed in state s
Total reward accrued when applying a in s
v(s’)
1.
2.
3.
4.
Start with arbitrary initial values of Q(s,a), for all sS,
aA
At each time t the agent chooses an action and
observes its reward rt
The agent then updates its Q-values based on the Qlearning rule
The learning rate t needs to decay over time in order
for the learning algorithm to converge
Famous game theory example
A co-operative game
Generalization of MDP
Mixed strategy
Stationary: the agent’s policy does not change over time
Deterministic: the same action is always chosen whenever the agent is in state s
Example
0
-1
1
State 1
2
1
1
1
0
-1
1
2
1
-1
1
0
1
1
1
-1
State 2
2
1
1
-1
1
1
1
v(s,*)  v(s,) for all s  S,  
Max V
Such that: rock + paper + scissors = 1
Worst case
Best response
Expectation over all actions
Quality of a state-action pair
Discounted value of all succeeding states weighted by their likelihood
This learning rule converges to the correct values of Q and v
Discounted value of all succeeding states
Expected reward for taking
action a when opponent
chooses o from state s
eplor controls how often the agent will deviate from its current policy
Hu and Wellman general-sum Markov games as a framework for RL
Theorem (Nash, 1951) There exists a mixed strategy
Nash equilibrium for any finite bimatrix game
Download