R. Brafman and M. Tennenholtz
Presented by Daniel Rasmussen
Exploration versus Exploitation
Should we follow our best known path for a guaranteed reward, or take an unknown path for a chance at a greater reward?
Explicit policies do not work well in multiagent settings
R-Max uses an implicit exploration/exploitation policy
How does it do this?
Why should we believe it works?
Background
R-Max algorithm
Proof of near-optimality
Variations on the algorithm
Conclusions
A set of (normal form) games
Each game has a set of possible actions and rewards associated with those actions as usual
We add a probabilistic transfer function that maps the current game and the players’ actions onto the next game
(4,2) (1, 1)
(2, 2) (1, 0)
0.7
(2,2) (3, 1)
(2, 3) (1, 1)
0.3
(0,2) (5, 1)
(2, 4) (3, 1)
0.2
0.4
0.4
(4,3) (3, 3)
(2, 3) (1, 1)
(2,0) (3, 0)
(2, 1) (1, 1)
(5,2) (0, 1)
(4, 3) (5, 1)
History: the set of all past states, the actions taken in those states, and the rewards received (plus the current state)
(S x A 2 x R) t x S
Policy: a mapping from the set of histories to the set of possible actions in the current state
P: (S x A 2 x R) t x S A, for all t >= 0
Value of a policy: expected average reward if we follow the policy for T steps
Denoted U(s, p a
, p o
, T)
U(p a
) = min po min s lim
T..∞
U(s, p a
, p o
, T)
ε-return mixing time: how many steps do we need to take before the expected average reward is within ε of the stable average
Min T such that U(s, p a
, p o
, T) > U(p a
) – ε for all s and p o
Optimal policy: the policy with the highest U(p a
)
We will always parameterize this in ε and T (i.e. the policy with the highest U(p a
) whose ε-return mixing time is T)
The agent always knows what state it is in
After each stage the agent knows what actions were taken and what rewards were received
We know the maximum possible reward, R max
We know the ε-return mixing time of any policy
When faced with the choice between a known and unknown reward, always try the unknown reward
Maintain an internal model of the stochastic game
Calculate an optimal policy according to model and carry it out
Update model based on observations
Calculate a new optimal policy and repeat
Input
N: number of games
k: number of actions in each game
ε: the error bound
δ: the probability of failure
R max
: the maximum reward value
T: the ε-return mixing time of an optimal policy
Initializing the internal model
Create states {G
1 stochastic game
...G
n
} to represent the stages in the
Create a fictitious game G
0
Initialize all rewards to (R max
, 0)
Set all transfer functions to point to G
0
Associate a boolean known/unknown variable with each entry in each game, initialized to unknown
Associate a list of states reached with each entry, which is initially empty
G
2
(4,2) (1, 1)
(2, 2) (1, 0)
0.7
G
1
(2,2) (3, 1)
(2, 3) (1, 1)
0.3
(0,2) (5, 1)
(2, 4) (3, 1)
G
3
0.2
0.4
0.4
(4,3) (3, 3)
(2, 3) (1, 1)
G
4
(2,0) (3, 0)
(2, 1) (1, 1)
G
5
(5,2) (0, 1)
(4, 3) (5, 1)
G
6
unknown
{} unknown
{}
G
1
R max
,0 R max
,0
R max
,0 R max
,0 unknown
{} unknown
{}
G
0
G
2
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
G
3
R max
,0 R max
,0
R max
,0 R max
,0
G
4
R max
,0 R max
,0
R max
,0 R max
,0
G
5
R max
,0 R max
,0
R max
,0 R max
,0
G
6
Repeat
Compute an optimal policy for T steps based on the current internal model
Execute that policy for T steps
After each step:
▪ If an entry was visited for the first time, update the rewards based on observations
▪ Update the list of states reached from that entry
▪ If the list of states reached now contains c+1 elements
▪ mark that entry as known
▪ update the transition function
▪ compute a new policy
unknown
{} unknown
{}
G
1
R max
,0 R max
,0
R max
,0 R max
,0 unknown
{} unknown
{}
G
0
G
2
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
G
3
R max
,0 R max
,0
R max
,0 R max
,0
G
4
R max
,0 R max
,0
R max
,0 R max
,0
G
5
R max
,0 R max
,0
R max
,0 R max
,0
G
6
unknown
{(G
4
,1)} unknown
{}
G
1
2,2 R max
,0
R max
,0 R max
,0 unknown
{} unknown
{}
G
0
G
2
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
G
3
R max
,0 R max
,0
R max
,0 R max
,0
G
4
R max
,0 R max
,0
R max
,0 R max
,0
G
5
R max
,0 R max
,0
R max
,0 R max
,0
G
6
unknown
{(G
4
,1)} unknown
{}
G
1
2,2 R max
,0
R max
,0 R max
,0 unknown
{} unknown
{}
G
0
G
2
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
G
3 unknown
{(G
1
,1)}
R max
,0 3, 3
R max
,0 R max
,0
G
4
R max
,0 R max
,0
R max
,0 R max
,0
G
5
R max
,0 R max
,0
R max
,0 R max
,0
G
6
unknown
{(G
4
,1)(G
2
,1)} unknown
{}
G
1
2,2 R max
,0
R max
,0 R max
,0 unknown
{} unknown
{}
G
0
G
2
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
G
3 unknown
{(G
1
, 1)}
R max
,0 3, 3
R max
,0 R max
,0
G
4
R max
,0 R max
,0
R max
,0 R max
,0
G
5
R max
,0 R max
,0
R max
,0 R max
,0
G
6
0.5
known
{(G
4
,1)(G
2
,1)} unknown
{}
G
1
2,2 R max
,0
R max
,0 R max
,0 unknown
{} unknown
{}
G
0
G
2
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
R max
,0 R max
,0
G
3
0.5
unknown
{(G
1
, 1)}
R max
,0 3, 3
R max
,0 R max
,0
G
4
R max
,0 R max
,0
R max
,0 R max
,0
G
5
R max
,0 R max
,0
R max
,0 R max
,0
G
6
Goal: prove that if the agent follows the R-Max algorithm for T steps the average reward will be within ε of the optimum
Outline of proof:
After T steps the agent will have either obtained near-optimum average reward or it will have learned something new
There are a polynomial number of parameters to learn, therefore the agent can completely learn its model in polynomial time
The adversary can block learning, but if it does so then the agent will obtain near-optimum reward
Either way the agent wins!
Lemma 1: If the agent’s model is a sufficiently close approximation of the true game, then an optimal policy in the model will be nearoptimal in the game
We guarantee that the model is sufficiently close by waiting until we have c + 1 samples before marking an entry known
Lemma 2: the difference between the expected reward based on the model and the actual reward will be less than the exploration probability times R max
|V model large |V
– V actual
| < eR max model probability
– V actual
| large exploration
Combining Lemma 1 and 2
In unknown states the agent has an unrealistically high expectation of reward (R max
), so Lemma 2 tells us that the probability of exploration is high
In known states Lemma 1 tells us that we will obtain near-optimal reward
We will always be in either a known or unknown state, therefore we will always explore with high probability or obtain near-optimal reward
If we explore for long enough then (almost) all states will be known, and we are guaranteed near-optimal reward
Remove assumption that we know the εreturn mixing time of an optimal policy, T
Remove assumption that we know R max
Simplified version for repeated games
R-Max is a model-based reinforcement learning algorithm that is guaranteed to converge on the near-optimal average reward in polynomial time
Works on zero-sum stochastic games, MDPs, and repeated games
The authors provide a formal justification for optimism under uncertainty heuristic
Guarantee that the agent either obtains near-optimal reward or learns efficiently
For complicated games N, k, and T are all likely to be high, so polynomial time will still not be computationally feasible
We do not consider how the agent’s behaviour might impact the adversary’s