CH2. Multi-armed Bandits

CH2. Multi-armed Bandits Table of Contents 2.1 k-armed Bandit Problems 2.2 Action-Value Methods 2.3 The 10-armed Testbed 2.4 Incremental Implementation 2.5 Tracking a Nonstationary Problem 2.6 Optimistic Initial Values 2.7 Upper-Confidence-Bound Action Selection 2.8 Gradient Based Algorithms 2.9 Associative Search (Contextual Bandits) 2.1 k-armed Bandit Problems In a k-armed bandit problem, you are faced in the following environment: Faced repeatedly with a choice among k different options(actions) Receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected Value of an Action: Each of the k actions has an expected or mean reward given that that action is selected Objective is to maximize the total reward over some time period, or time steps We could describe the k-armed bandit problem mathematically by the following: At : Action selected on time step t Rt : Reward corresponding to action A q ∗ (a) t : Value of an arbitrary action a This is the expected reward given a is selected: q ∗ (a) = E[R t |A t = a] If we knew the value of each action, the problem is trivial since we can choose the action with highest value at each time step. Therefore, we assume that we do not know the action values with certainty. We denote the estimated value of action a at time step t as Q (a). Our goal is to then make Q (a) close to q (a). t t ∗ If we maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these actions greedy. We say we are exploiting when choosing greedy actions. On the other hand, we say we are exploring, when we choose actions that aren't greedy. Exploitation is the right thing to do to maximize the expected reward on one step, but exploration may produce the greater total reward in the long run. (Example in Tic-Tac-Toe) 2.2 Action-Value Methods Action-value methods are used for estimating the values of actions and for using the estimates to make action selection decisions. Recall that the true value of an action is the mean reward when that action is selected. Mathematically, we can formulate this by: t−1 sum of rewards when a taken prior to t Q t (a) = where 𝟙 predicate = number of times a taken prior to t ∑ i=1 R i ⋅ ∑ t−1 i=1 𝟙 𝟙 A i =a A i =a denotes the random variable is 1 if predicate is true and 0 if it is not. If the denominator is zero, we set Q t (a) as 0, a default value. As the denominator goes to infinity, by the law of large numbers, Q t (a) converges to q . ∗ (a) We call this sample-average method for estimating action valeus because each estimate is an average of the sample of relevant rewards. The simplest action selection rule is to select one of the actions with the highest estimated value, one of the greedy actions. We write the greedy action selection method as: A t = argmax Q t (a) a To include exploration in our action selection method, we use what is called a ϵ-greedy method, where for a probability of ϵ, we randomly select an action among all the actions available, including the greedy action. An advantage of this method is that, as the number of steps increases, every action will be sampled an infinite number of times, thus ensuring convergence. 2.3 The 10-armed Testbed We compare the results of greedy methods and the ϵ-greedy methods with the 10-armed testbed. The conditions of the 10-armed testbed is as follows: A set of 2000 randomly generated k-armed bandit problems with k = 10 The action values q ∗ , where a = 1, ⋯ , 10 were selected according to a normal Gaussian (a) distribution with mean 0 and variance 1 When a learning method applied to that problem selected action A at time step t, the t actual reward R was selected from a normal distribution with mean q t ∗ (A t ) and variance 1 One run consists of experiencing over 1000 time steps We test for 2000 independent runs We see from figure 2.2 that the ϵ-greedy methods fare better than that of the greedy methods. The greedy method performed worse in the long run because it often got stuck performing suboptimal actions. On the other hand, the ϵ-greedy method performed beter because they continued to explore and improve their chances of recognizing the optimal action. The advantage of ϵ-greedy method depends on the task. If the reward variance were large, it takes more exploration to find the optimal action, and the ϵ-greedy method might do even better. If the reward variances were zero, than the greedy method would know the true value of each action after trying it once. In this case, the greedy method might actually perform best because it would soon find the optimal action adn then never explore. However, if the bandit task were nonstationary, meaning that the true values of the actions changed over time, the ϵ-greedy method might do better. 2.4 Incremental Implementation We now tackle the problem of how to compute the averages of observed rewards in a computationally efficient manner. The definition of rewards and action values are changed. Ri Qn is the reward received after the ith selection of a single action is the estimate of its action value after it has been selected n − 1 times Qn = R 1 + R 2 + ⋯ + R n−1 n − 1 Then, given Q and the nth reward R , the new average of all n rewards can be computed n n by 1 Q n+1 = n n ∑ Ri i=1 n−1 1 = n = 1 n (R n + ∑ R i )) i=1 (R n + (n − 1) 1 n − 1 n−1 ∑ Ri ) i=1 1 = n (R n + (n − 1)Q n ) 1 = n (R n + nQ n − Q n ) = Qn + 1 n [R n − Q n ] This implementation requires memory only for Q and n, and only the small computation n for each new reward, which is computationally efficient. The update rule above is a form that occurs frequently. The general form is N ewEstimate ← OldEstimate + StepSize[T arget − OldEstimate] The expression [T arget − OldEstimate] is an error in the estimate. It is reduced by taking a step toward the T arget The target is presumed to indicate a desirable direction in which to move. In the update rule above, the target is the nth reward. Note that the step-size paramter StepSize chages in the incremental method as n increases. We denote the step-size paramter by α. A pseudocode for a complete bandit problem using incrementally computed sample averages and ϵ-greedy aciton selection is shown below. 2.5 Tracking a Nonstationary Problem The averaging methods above are appropriate for stationary bandit problems, problems in which the reward probabilities do not change over time. To encounter reinforcement learning problems that are nonstationary, we give more weight to recent rewards, using a constant stepsize parameter. Then, the incremental update rule can be updated to Q n+1 = Q n + α[R n − Q n ] where the step-size parameter α ∈ (0, 1] is constant. This results in Q average of past rewards and the initial estimate Q : n+1 being a weighted 1 Q n+1 = Q n + α[R n − Q n ] = αR n + (1 − α)Q n = αR n + (1 − α)[αR n−1 + (1 − α)Q n−1 ] 2 = αR n + (1 − α)αR n−1 + (1 − α) Q n−1 2 = αR n + (1 − α)αR n−1 + (1 − α) αR n−2 + ⋯ + (1 − α) n−1 αR 1 + (1 − α) n Q1 n = (1 − α) n Q 1 + ∑ α(1 − α) n−i Ri i=1 We call this a weighted average because the sum of the weights is n (1 − α) n + ∑ α(1 − α) n−i , as we can check here. Note that the weight, α(1 − α) = 1 n−i , given i=1 to the reward R depends on how many rewards ago, n − i, it was observed. The quantity 1 − α is less than 1, and thus the weight given to R decreases as the number of intervening rewards increases. This means that recent rewards have a higher weight than that of past rewards. This is sometimes called an exponential recency-weighted average. i i Sometimes it is convenient to vary the step-size paramter from step to step. Let α (a) denote the step-size parameter used to process the reward received after the nth selection of action a ( α = results in the sample-average method). For the choices of the sequence {α (a)} to n 1 n(a) n n converge the following two conditions are required. 1. ∞ ∑ α n(a) = ∞ n=1 2. ∞ 2 ∑ α n (a) < ∞ n=1 The first condition is required to guarantee that the steps are large enough to eventually overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence. For α n (a) , the second condition is not met, indicating that the estimates never completely = α converge but continue to vary in response to the most recently received rewards. This is actually desirable in a nonstationary environment. 2.6 Optimistic Initial Values The methods above are dependent to some extent on the initial action-value estiamtes, Q (Note we have defined Q (a) to be 0 in general). These methods are biased by their initial 1 (a) 1 estimates. For sample-average methods, the bias disappears once all actions have been selected at least once For methods with constant alpha, the bias is permanent, though it decreases over time Initial action values can also be used as a simple way to encourage exploration. suppose the initial action-values are set to +5 in a 10-armed testbed. Recall that the q (a) are ∗ selected from a normal distribution with mean 0 and variance 1; hence the initial estimate of +5 is extremely optimistic. However, this optimism encourages action-value methods to explore. Whichever actions are initially selected, the reward is less than the starting estiamtes; the learner switches to other actions, being disappointed with the rewards it is receiving. The result is that all actions are tried several times before the value estimate converge. Initially, the optimistic method performs worse because it explores more, but eventaully it performs better because its exploration decreases with time. We call this technique for encouraging exploration optimisitc initial values. Note that this method is not adequate for non-stationary problems because its drive for exploration is inherently temporary; it focuses on the time horizon near the initial conditions. 2.7 Upper-Confidence-Bound Action Selection We would like to have a preference of the actions available when doing exploration. UCB does this by selecting among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates. We define a new A for this. t A t = arg max [Q t (a) + c√ a N t (a) lnt ] N t (a) denotes the number of times action a has been selected prior to time t Number c > 0 controls the degree of exploration If N t(a) = 0 then a is considered to be a maximizing action The idea of UCB is that the square-root term is a measure of the uncertainty or variance in the estimate of a's value. The quantity being max'ed over is thus a upper bound on the possible true value of action a, with c determining the confidence level. Each time a is selected the uncertainty is presumably reduced: N uncertainty term decreases. Each time an action other than a is selected, t is increased but N t (a) t (a) increments, and the is not, leading to the uncertainty estimate increasing. 2.8 Gradient Based Algorithms In this section we consider learning a numerical preference for each action a, which we denote H (a). The larger the preference, the more often that action is taken. Note that only the relative preference of one action over another is important; if we add 1000 to all the action preferences t there is no effect on the action probabilities. The action probabilities are determined according to a soft-max distribution. e Pr{A t = a} = ∑ H t (a) k b=1 e H t (b) = π t (a) is the probability of taking action a at time t. Initially, all action preferences are the same (e.g., H (a) = 0, ∀a) so that all actions have an equal probability of being selected. π t (a) 1 There is a natural learning algorith for this setting based on the idea of stochastic gradient ascent. On each step, after selecting action A and receiving the reward R , the action preferences are updated by: t t H t+1 (A t ) ≐ H t (A t) + α(R t − R̄ t )(1 − π t (A t )), and H t+1 (a) ≐ H t(a) − α(R t − R̄ t )π t (a), f or all a ≠ A t where α > 0 is a step-size parameter, and R̄ and including time t. t ∈ R is the average of all the rewards up through The R̄ term serves as a baseline with which the reward is compared. t If the reward is higher than the baseline, then the probability of taking A in the future is increased, t If the reward is below the baseline, then the probability is decreased. 2.9 Associative Search (Contextual Bandits) Up to this point, we have only considered non-associative tasks; tasks in which there is no need to associate different actions with different situations. However, in a general reinforcement learning task there is more than one situation, and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations. If a reinforcement problem involves both trial-and-error learning to search for the best actions, and association of these actions with the situations in which they are best, it is an example of a associative search task.

CH2. Multi-armed Bandits

Related documents

Products

Support

CH2. Multi-armed Bandits

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib