Uploaded by James Jeon

CH2. Multi-armed Bandits

advertisement
CH2. Multi-armed Bandits
Table of Contents
2.1 k-armed Bandit Problems
2.2 Action-Value Methods
2.3 The 10-armed Testbed
2.4 Incremental Implementation
2.5 Tracking a Nonstationary Problem
2.6 Optimistic Initial Values
2.7 Upper-Confidence-Bound Action Selection
2.8 Gradient Based Algorithms
2.9 Associative Search (Contextual Bandits)
2.1 k-armed Bandit Problems
In a k-armed bandit problem, you are faced in the following environment:
Faced repeatedly with a choice among k different options(actions)
Receive a numerical reward chosen from a stationary probability distribution that depends
on the action you selected
Value of an Action: Each of the k actions has an expected or mean reward given that
that action is selected
Objective is to maximize the total reward over some time period, or time steps
We could describe the k-armed bandit problem mathematically by the following:
At
: Action selected on time step t
Rt
: Reward corresponding to action A
q ∗ (a)
t
: Value of an arbitrary action a
This is the expected reward given a is selected: q
∗ (a)
= E[R t |A t = a]
If we knew the value of each action, the problem is trivial since we can choose the action with
highest value at each time step. Therefore, we assume that we do not know the action values
with certainty. We denote the estimated value of action a at time step t as Q (a). Our goal is to
then make Q (a) close to q (a).
t
t
∗
If we maintain estimates of the action values, then at any time step there is at least one action
whose estimated value is greatest. We call these actions greedy. We say we are exploiting
when choosing greedy actions. On the other hand, we say we are exploring, when we choose
actions that aren't greedy. Exploitation is the right thing to do to maximize the expected reward
on one step, but exploration may produce the greater total reward in the long run. (Example in
Tic-Tac-Toe)
2.2 Action-Value Methods
Action-value methods are used for estimating the values of actions and for using the estimates
to make action selection decisions.
Recall that the true value of an action is the mean reward when that action is selected.
Mathematically, we can formulate this by:
t−1
sum of rewards when a taken prior to t
Q t (a) =
where 𝟙
predicate
=
number of times a taken prior to t
∑ i=1 R i ⋅
∑
t−1
i=1
𝟙
𝟙
A i =a
A i =a
denotes the random variable is 1 if predicate is true and 0 if it is not. If the
denominator is zero, we set Q
t (a)
as 0, a default value.
As the denominator goes to infinity, by the law of large numbers, Q
t (a)
converges to q
.
∗ (a)
We call this sample-average method for estimating action valeus because each estimate is an
average of the sample of relevant rewards.
The simplest action selection rule is to select one of the actions with the highest estimated
value, one of the greedy actions. We write the greedy action selection method as:
A t = argmax Q t (a)
a
To include exploration in our action selection method, we use what is called a ϵ-greedy method,
where for a probability of ϵ, we randomly select an action among all the actions available,
including the greedy action. An advantage of this method is that, as the number of steps
increases, every action will be sampled an infinite number of times, thus ensuring convergence.
2.3 The 10-armed Testbed
We compare the results of greedy methods and the ϵ-greedy methods with the 10-armed
testbed. The conditions of the 10-armed testbed is as follows:
A set of 2000 randomly generated k-armed bandit problems with k = 10
The action values q
∗
, where a = 1, ⋯ , 10 were selected according to a normal Gaussian
(a)
distribution with mean 0 and variance 1
When a learning method applied to that problem selected action A at time step t, the
t
actual reward R was selected from a normal distribution with mean q
t
∗ (A t )
and variance 1
One run consists of experiencing over 1000 time steps
We test for 2000 independent runs
We see from figure 2.2 that the ϵ-greedy methods fare better than that of the greedy
methods. The greedy method performed worse in the long run because it often got stuck
performing suboptimal actions. On the other hand, the ϵ-greedy method performed beter
because they continued to explore and improve their chances of recognizing the optimal
action.
The advantage of ϵ-greedy method depends on the task.
If the reward variance were large, it takes more exploration to find the optimal action, and
the ϵ-greedy method might do even better.
If the reward variances were zero, than the greedy method would know the true value of
each action after trying it once. In this case, the greedy method might actually perform best
because it would soon find the optimal action adn then never explore.
However, if the bandit task were nonstationary, meaning that the true values of the actions
changed over time, the ϵ-greedy method might do better.
2.4 Incremental Implementation
We now tackle the problem of how to compute the averages of observed rewards in a
computationally efficient manner. The definition of rewards and action values are changed.
Ri
Qn
is the reward received after the ith selection of a single action
is the estimate of its action value after it has been selected n − 1 times
Qn =
R 1 + R 2 + ⋯ + R n−1
n − 1
Then, given Q and the nth reward R , the new average of all n rewards can be computed
n
n
by
1
Q n+1 =
n
n
∑ Ri
i=1
n−1
1
=
n
=
1
n
(R n + ∑ R i ))
i=1
(R n + (n − 1)
1
n − 1
n−1
∑ Ri )
i=1
1
=
n
(R n + (n − 1)Q n )
1
=
n
(R n + nQ n − Q n )
= Qn +
1
n
[R n − Q n ]
This implementation requires memory only for Q and n, and only the small computation
n
for each new reward, which is computationally efficient. The update rule above is a form
that occurs frequently. The general form is
N ewEstimate ← OldEstimate + StepSize[T arget − OldEstimate]
The expression [T arget − OldEstimate] is an error in the estimate. It is reduced by taking a
step toward the T arget
The target is presumed to indicate a desirable direction in which to move. In the update
rule above, the target is the nth reward.
Note that the step-size paramter StepSize chages in the incremental method as n
increases. We denote the step-size paramter by α.
A pseudocode for a complete bandit problem using incrementally computed sample averages
and ϵ-greedy aciton selection is shown below.
2.5 Tracking a Nonstationary Problem
The averaging methods above are appropriate for stationary bandit problems, problems in
which the reward probabilities do not change over time. To encounter reinforcement learning
problems that are nonstationary, we give more weight to recent rewards, using a constant stepsize parameter. Then, the incremental update rule can be updated to
Q n+1 = Q n + α[R n − Q n ]
where the step-size parameter α ∈ (0, 1] is constant. This results in Q
average of past rewards and the initial estimate Q :
n+1
being a weighted
1
Q n+1 = Q n + α[R n − Q n ]
= αR n + (1 − α)Q n
= αR n + (1 − α)[αR n−1 + (1 − α)Q n−1 ]
2
= αR n + (1 − α)αR n−1 + (1 − α) Q n−1
2
= αR n + (1 − α)αR n−1 + (1 − α) αR n−2 +
⋯ + (1 − α)
n−1
αR 1 + (1 − α)
n
Q1
n
= (1 − α)
n
Q 1 + ∑ α(1 − α)
n−i
Ri
i=1
We call this a weighted average because the sum of the weights is
n
(1 − α)
n
+ ∑ α(1 − α)
n−i
, as we can check here. Note that the weight, α(1 − α)
= 1
n−i
, given
i=1
to the reward R depends on how many rewards ago, n − i, it was observed. The quantity 1 − α
is less than 1, and thus the weight given to R decreases as the number of intervening rewards
increases. This means that recent rewards have a higher weight than that of past rewards. This
is sometimes called an exponential recency-weighted average.
i
i
Sometimes it is convenient to vary the step-size paramter from step to step. Let α (a) denote
the step-size parameter used to process the reward received after the nth selection of action a (
α
=
results in the sample-average method). For the choices of the sequence {α (a)} to
n
1
n(a)
n
n
converge the following two conditions are required.
1.
∞
∑ α n(a) = ∞
n=1
2.
∞
2
∑ α n (a) < ∞
n=1
The first condition is required to guarantee that the steps are large enough to eventually
overcome any initial conditions or random fluctuations.
The second condition guarantees that eventually the steps become small enough to assure
convergence.
For α
n (a)
, the second condition is not met, indicating that the estimates never completely
= α
converge but continue to vary in response to the most recently received rewards. This is
actually desirable in a nonstationary environment.
2.6 Optimistic Initial Values
The methods above are dependent to some extent on the initial action-value estiamtes, Q
(Note we have defined Q (a) to be 0 in general). These methods are biased by their initial
1 (a)
1
estimates.
For sample-average methods, the bias disappears once all actions have been selected at
least once
For methods with constant alpha, the bias is permanent, though it decreases over time
Initial action values can also be used as a simple way to encourage exploration. suppose
the initial action-values are set to +5 in a 10-armed testbed. Recall that the q (a) are
∗
selected from a normal distribution with mean 0 and variance 1; hence the initial estimate
of +5 is extremely optimistic. However, this optimism encourages action-value methods to
explore. Whichever actions are initially selected, the reward is less than the starting
estiamtes; the learner switches to other actions, being disappointed with the rewards it is
receiving. The result is that all actions are tried several times before the value estimate
converge.
Initially, the optimistic method performs worse because it explores more, but eventaully it
performs better because its exploration decreases with time. We call this technique for
encouraging exploration optimisitc initial values.
Note that this method is not adequate for non-stationary problems because its drive for
exploration is inherently temporary; it focuses on the time horizon near the initial conditions.
2.7 Upper-Confidence-Bound Action Selection
We would like to have a preference of the actions available when doing exploration. UCB does
this by selecting among the non-greedy actions according to their potential for actually being
optimal, taking into account both how close their estimates are to being maximal and the
uncertainties in those estimates. We define a new A for this.
t
A t = arg max [Q t (a) + c√
a
N t (a)
lnt
]
N t (a)
denotes the number of times action a has been selected prior to time t
Number c > 0 controls the degree of exploration
If N
t(a)
= 0
then a is considered to be a maximizing action
The idea of UCB is that the square-root term is a measure of the uncertainty or variance in the
estimate of a's value. The quantity being max'ed over is thus a upper bound on the possible
true value of action a, with c determining the confidence level.
Each time a is selected the uncertainty is presumably reduced: N
uncertainty term decreases.
Each time an action other than a is selected, t is increased but N
t
(a)
t (a)
increments, and the
is not, leading to the
uncertainty estimate increasing.
2.8 Gradient Based Algorithms
In this section we consider learning a numerical preference for each action a, which we denote
H (a). The larger the preference, the more often that action is taken. Note that only the relative
preference of one action over another is important; if we add 1000 to all the action preferences
t
there is no effect on the action probabilities. The action probabilities are determined according
to a soft-max distribution.
e
Pr{A t = a} =
∑
H t (a)
k
b=1
e
H t (b)
= π t (a)
is the probability of taking action a at time t. Initially, all action preferences are the same
(e.g., H (a) = 0, ∀a) so that all actions have an equal probability of being selected.
π t (a)
1
There is a natural learning algorith for this setting based on the idea of stochastic gradient
ascent. On each step, after selecting action A and receiving the reward R , the action
preferences are updated by:
t
t
H t+1 (A t ) ≐ H t (A t) + α(R t − R̄ t )(1 − π t (A t )), and
H t+1 (a) ≐ H t(a) − α(R t − R̄ t )π t (a), f or all a ≠ A t
where α > 0 is a step-size parameter, and R̄
and including time t.
t
∈ R
is the average of all the rewards up through
The R̄ term serves as a baseline with which the reward is compared.
t
If the reward is higher than the baseline, then the probability of taking A in the future
is increased,
t
If the reward is below the baseline, then the probability is decreased.
2.9 Associative Search (Contextual Bandits)
Up to this point, we have only considered non-associative tasks; tasks in which there is no need
to associate different actions with different situations. However, in a general reinforcement
learning task there is more than one situation, and the goal is to learn a policy: a mapping from
situations to the actions that are best in those situations.
If a reinforcement problem involves both trial-and-error learning to search for the best actions,
and association of these actions with the situations in which they are best, it is an example of a
associative search task.
Download