目录 IDS_notes(1) hao22b(1) Notes on TS and Information-based method 1 IDS_notes(1) 2 hao22b(1) 3 Notes on TS and Information-based method 3.1 Notes on TS and Information-based method 3.1.1 Thompson sampling 3.1.1.1 Algorithm 3.1.1.2 Research on TS 3.1.1.3 Bayesian Regret 3.1.1.4 Contributions of information-based method 3.1.1.5 Why Thompson Sampling Works ? 3.1.1.6 Limitations of Thompson Sampling 3.1.2 Information-based methods and TS 3.1.2.1 Problems, research and improvement 2 11 12 13 22 23 23 23 23 23 24 26 27 27 28 28 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 ( share so Learnng Online full feed back | missing slf -ops feed He : can Example bave a back his observes he back all (a a ) , oe significant have veverus loss value acomt does bandies beyound : @@ : own still is but This . graph feed wirn bomdrt he examples Graph Feedback) , me for always impact each on tne 10 ss Ʃ . ops round on ech loss round he , incurred . tne on regret minimax . Apple pobom testng : send weak obsemed . . , card Example 2 : rerealmg poblem rerealng unie loss . d3 - ε howmmch know strong tase self . acron . Example , 3 : observe police evey thing prerent but his Crime own , . 目录 IDS_notes(1) hao22b(1) Notes on TS and Information-based method 1 Notes on TS and Information-based method 1.1 Thompson sampling 1.1.1 Algorithm 1.1.2 Research on TS 1.1.3 Bayesian Regret 1.1.3.1 Soft knowledge and hard knowledge 1.1.3.2 Why Bayesian regret? 1.1.4 Contributions of information-based method 1.1.5 Why Thompson Sampling Works ? 1.1.6 Limitations of Thompson Sampling 1.2 Information-based methods and TS 1.2.1 Problems, research and improvement 1.2.1.1 1. Large/uncountable action spaces 1.2.1.2 2. Deal with contextual bandit (2022) 1.2.1.3 3. Approximate implementations 1.2.1.4 4. New algorithms based on the analysis technique -----Information-directed sampling ! 1.2.1.5 Other topics: about Frequentist IDS 2 11 12 12 12 12 12 13 13 14 15 16 16 17 17 17 19 20 21 23 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 扫描全能王 创建 ( share so Learnng Online full feed back | missing slf -ops feed He : can Example bave a back his observes he back all (a a ) , oe significant have veverus loss value acomt does bandies beyound : @@ : own still is but This . graph feed wirn bomdrt he examples Graph Feedback) , me for always impact each on tne 10 ss Ʃ . ops round on ech loss round he , incurred . tne on regret minimax . Apple pobom testng : send weak obsemed . . , card Example 2 : rerealmg poblem rerealng unie loss . d3 - ε howmmch know strong tase self . acron . Example , 3 : observe police evey thing prerent but his Crime own , . Notes on TS and Information-based method We can think that information-based method was born to study TS. And in the subsequent analysis, algorithms that can outperform TS in certain situations is generated, which is called informationdirected method. Thompson sampling Algorithm The Thompson sampling algorithm simply samples actions according to the posterior probability they are optimal. In particular, actions are chosen randomly at time according to the sampling distribution By definition, this means for each , This algorithm is sometimes called probability matching because the action selection distribution is matched to the posterior distribution of the optimal action. Practical implementations of Thompson sampling typically use two simple steps at each time to randomly generate an action. First, an index is sampled from the posterior distribution of the true index Then, the algorithm selects the action that would be optimal if the sampled parameter were actually the true parameter. Research on TS Thompson sampling has the honor of being the first bandit algorithm and is named after its inventor [Thompson, 1933], who considered the Bernoulli case with two arms. Thompson provided no theoretical guarantees, but argued intuitively and gave hand-calculated empirical analysis. It would be wrong to say that Thompson sampling was entirely ignored for the next eight decades, but it was definitely not popular until recently, when a large number of authors independently rediscovered the article/algorithm [Graepel et al., 2010, Granmo, 2010, Ortega and Braun, 2010, Chapelle and Li, 2011, May et al., 2012]. The surge in interest was mostly empirical, but theoreticians followed soon with regret guarantees. For the frequentist analysis, we followed the proofs by Agrawal and Goyal [2012, 2013a], but the setting is slightly different. We presented results for the ‘realisable’ case where the pay-off distributions are actually Gaussian, while Agrawal and Goyal use the same algorithm but prove bounds for rewards bounded in [0, 1]. Agrawal and Goyal [2013a] also analyse the Beta/Bernoulli variant of Thompson sampling, which for rewards in [0, 1] is asymptotically . optimal in the same way as KL-UCB (see Chapter 10). This result was simultaneously obtained by Kaufmann et al. [2012b], who later showed that for appropriate priors, asymptotic optimality also holds for single-parameter exponential families [Korda et al., 2013]. For Gaussian bandits with unknown mean and variance, Thompson sampling is asymptotically optimal for some priors, but not others – even quite natural ones [Honda and Takemura, 2014]. The Bayesian analysis of Thompson sampling based on confidence intervals is due to Russo and Van Roy [2014b]. Recently the idea has been applied to a wide range of bandit settings [Kawale et al., 2015, Agrawal et al., 2017] and reinforcement learning [Osband et al., 2013, Gopalan and Mannor, 2015, Leike et al., 2016, Kim, 2017]. The BayesUCB algorithm is due to Kaufmann et al. [2012a], with improved analysis and results by Kaufmann [2018]. The frequentist analysis of Thompson sampling for linear bandits is by Agrawal and Goyal [2013b], with refined analysis by Abeille and Lazaric [2017a] and a spectral version by Koc´ak et al. [2014]. A recent paper analyses the combinatorial semi-bandit setting [Wang and Chen, 2018]. The information-theoretic analysis is by Russo and Van Roy [2014a, 2016], while the generalising beyond the negentropy potential is by Lattimore and Szepesv´ari [2019c]. As we mentioned, these ideas have been applied to convex bandits [Bubeck et al., 2015a, Bubeck and Eldan, 2016] and also to partial monitoring [Lattimore and Szepesv´ari, 2019c]. There is a tutorial on Thompson sampling by Russo et al. [2018] that focuses mostly on applications and computational issues. We mentioned there are other ways to configure Algorithm 24, for example the recent article by Kveton et al. [2019]. Bayesian Regret Soft knowledge and hard knowledge An online optimization algorithm typically starts with two forms of prior knowledge. The first - hard knowledge - posits that the mapping from action to outcome distribution lies within a particular family of mappings. For example, with hard knowledge we can suppose reward obeys normal distribution for arm , in which is known for each . The second - soft knowledge - concerns which of these mappings are more or less likely to match reality. Soft knowledge evolves with observations and is typically represented in terms of a probability distribution or a confidence set. With soft knowledge we can just suppose obeys a prior distribution, or we just have some prior knowledge without knowing some specific distributions. "Distributions are not restricted to Gaussian and more complex information structures are allowed" Why Bayesian regret? In (Russo and Van Roy, 2014/2016), the first paper using information theoretic method, the author said as below. Prior to their study, all frequentist regret bounds were attained for fixed priors. One of the first theoretical guarantees for Thompson sampling was provided by May et al. (2012), but they showed only that the algorithm converges asymptotically to optimality. Agrawal and Goyal (2012); Kauffmann et al. (2012); Agrawal and Goyal (2013a) and Korda et al. (2013) studied on the classical multi-armed bandit problem, where sampling one action provides no information about other actions. They provided frequentist regret bounds for Thompson sampling that are asymptotically optimal in the sense defined by Lai and Robbins (1985). To attain these bounds, the authors fixed a specific uninformative prior distribution, and studied the algorithm’s performance assuming this prior is used. Our interest in Thompson sampling is motivated by its ability to incorporate rich forms of prior knowledge about the actions and the relationship among them. Accordingly, we study the algorithm in a very general framework, allowing for an arbitrary prior distribution over the true outcome distributions . To accommodate this level of generality while still focusing on finite- time performance, we study the algorithm’s expected regret under the prior distribution. This measure is sometimes called Bayes risk or Bayesian regret. They found TS had ability to incorporate rich forms of prior knowledge about the actions and the relationship among them, so they wanted to study the algorithm in a very general framework. To study a general regret bound it may be hard to find a frequentist regret works for all settings -because our knowledge about the environment is limited. For example, we just know the entropy of the unknown parameter, but we don't know it's distribution. This idea can incorporate rich forms of prior knowledge --we don't care whether the rewards are normally distributed or binomial. So far (until that time), only one other article has examined the content of soft knowledge. An important aspect of our regret bound is its dependence on soft knowledge through the entropy of the optimal-action distribution. One of the only other regret bounds that depends on soft- knowledge was provided very recently by Li (2013). Inspired by a connection between Thompson sampling and exponential weighting schemes, that paper introduced a family of Thompson sampling like algorithms and studied their application to contextual bandit problems. While our analysis does not currently treat contextual bandit problems, we improve upon their regret bound in several other respects. First, their bound depends on the entropy of the prior distribution of mean rewards, which is never smaller, and can be much larger, than the entropy of the distribution of the optimal action. In addition, their bound has an order dependence on the problem’s time horizon, and, in order to guarantee each action is explored sufficiently often, requires that actions are frequently selected uniformly at random. In contrast, our focus is on settings where the number of actions is large and the goal is to learning without sampling each one. Contributions of information-based method Provided a new analysis of Thompson sampling based on tools from information theory. Inherits the simplicity and elegance enjoyed by work in that field. Apply to a much broader range of information structures than those studied in prior work on Thompson sampling. Using soft knowledge: our analysis leads to regret bounds that highlight the benefits of soft knowledge, quantified in terms of the entropy of the optimal-action distribution. Such regret bounds yield insight into how future performance depends on past observations. This is key to assessing the benefits of exploration, and as such, to the design of more effective schemes that trade off between exploration and exploitation. Under different problems' information structure, the information ratio bounds can be d/2, 1/2 and d/(2m). This reflect the impact of each problem’s information structure on the regret-per-bit of information acquired by TS about the optimum. Subsequent work has established bounds on the information ratio for problems with convex reward functions (Bubeck and Eldan, 2016) and for problems with graph structured feedback (Liu et al., 2017). In this way, it is easier to understand the influence of different information structures on regret. In forthcoming work, we leverage this insight to produce an algorithm that outperforms Thompson sampling. ----Information-directed sampling ! While our focus has been on providing theoretical guarantees for Thompson sampling, we believe the techniques and quantities used in the analysis may be of broader interest. Our formulation and notation may be complex, but the proofs themselves essentially follow from combining known relations in information theory with the tower property of conditional expectation, Jensen’s inequality, and the Cauchy-Schwartz inequality. In addition, the information theoretic view taken in this paper may provide a fresh perspective on this class of problems. Why Thompson Sampling Works ? Reference: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf To understand whether TS is well suited to a particular application, it is useful to develop a high level understanding of why it works. As information is gathered, beliefs about action rewards are carefully tracked. By sampling actions according to the posterior probability that they are optimal, the algorithm continues to sample all actions that could plausibly be optimal, while shifting sampling away from those that are unlikely to be optimal. Roughly speaking, the algorithm tries all promising actions while gradually discarding those that are believed to underperform. This intuition is formalized in recent theoretical analyses of Thompson sampling, which we now review. Regret Analysis for Classical Bandit Problems Asymptotic Instance Dependent Regret Bounds. Instance-Independent Regret bounds. Regret Analysis for Complex Online Decision Problems This tutorial has covered the use of TS to address an array of complex online decision problems. In each case, we first modeled the problem at hand, carefully encoding prior knowledge. We then applied TS, trusting it could leverage this structure to accelerate learning. The results described in the previous subsection are deep and interesting, but do not justify using TS in this manner. We will now describe alternative theoretical analyses of TS that apply very broadly. These analyses point to TS’ s ability to exploit problem structure and prior knowledge, but also to settings where TS performs poorly. Regret Bounds via UCB Regret Bounds via Information Theory Limitations of Thompson Sampling Problems that do not Require Exploration Problems that do not Require Exploitation Time Sensitivity Problems Requiring Careful Assessment of Information Gain TS is well suited to problems where the best way to learn which action is optimal is to test the most promising actions. However, there are natural problems where such a strategy is far from optimal, and efficient learning requires a more careful assessment of the information actions provide. The following example from (Russo and Van Roy, 2018a) highlights this point. The shortcoming of TS in the above example can be interpreted through the lens of the information ratio. For this problem, the information ratio when actions are sampled by TS is far from the minimum possible, reflecting that it is possible to a acquire information at a much lower cost per bit. There are other examples, also from (Russo and Van Roy, 2018a), illustrate a broader range of problems for which TS suffers in this manner. So we can know what the information can do: In some settings, we can use information to show that TS is suboptimal. And we can use information ratio to develop new algorithms that can outperform TS. Information-based methods and TS Problems, research and improvement Information theoretic method was first used to analyze Bayesian regret of Thompson Sampling, as we talked before. 1. Large/uncountable action spaces A Rate-Distortion Analysis of Thompson Sampling: Following the above line of analysis, we can bound the regret of Thompson sampling by the mutual information between the statistic and . When can be chosen to be far less informative than obtain a significantly tighter bound . Application: Linear Bandits, Generalized Linear Bandits with iid Noise, Logistic Bandits. , we This article deals with logistic bandit, but they do this based on a conjecture which can just be computationally verifiable. They don't complete the proof. 2. Deal with contextual bandit (2022) Lifted- information ratio! Relationship between "decoupling coefficient" and "lifted information ratio": which matches our definition of the lifted information ratio, up to the difference of replacing the mutual information by the root mean-squared error in predicting the true parameter . Notably, this definition essentially coincides with the lifted information ratio for the special case of Gaussian losses. We have also managed to show some new results that advance the state of the art in the wellstudied problem of logistic bandits. We believe that these results are very encouraging and that our newly proposed formalism may find many more applications in the future. 3. Approximate implementations Information-based method helps to analysis the regret of approximate TS algorithm. Ensemble Sampling (2017) This is only an approximate TS algorithm, and the theoretical analysis of this algorithm is not enough. An Analysis of Ensemble Sampling (Nips 2022) In this regret analysis, information-based methods are used frequently. Although they don't use the concept of information ratio. 4. New algorithms based on the analysis technique ------ Informationdirected sampling ! We can demonstrate through simple analytic examples, UCB and TS can perform very poorly when faced with more complex information structures. Shortcomings stem from the fact that they do not adequately account for particular kinds of information. Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient reinforcement learning algorithm (Lu et al., 2021). D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. Oper. Res., 66 (1):230–252, (2014 )2018. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and this measure of information gain. We benchmark the performance of IDS through simulations of the widely studied Bernoulli, Gaussian, and linear bandit problems, for which UCB algorithms and Thompson sampling are known to be very effective. We find that even in these settings, IDS outperforms UCB algorithms and Thompson sampling. This is particularly surprising for Bernoulli bandit problems, where UCB algorithms and Thompson sampling are known to be asymptotically optimal in the sense proposed by Lai and Robbins [49]. Drawback: computationally demanding, developing a computationally efficient version of IDS may require innovation. It is worth noting that the problem formulation we work with, which is presented in Section 3, is very general, encompassing not only problems with bandit feedback, but also a broad array of information structures for which observations can offer information about rewards of arbitrary subsets of actions or factors that influence these rewards. Because IDS and our analysis accommodate this level of generality, they can be specialized to problems that in the past have been studied individually. J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 358–384. PMLR,2018. In this work, we consider bandits with heteroscedastic noise, where we explicitly allow the noise distribution to depend on the evaluation point. We show that this leads to new trade-offs for information and regret, which are not taken into account by existing approaches like upper confidence bound algorithms (UCB) or Thompson Sampling. Frequentist version of Information Directed Sampling (IDS) Minimize the regret-information ratio over all possible action sampling distributions Empirically, we demonstrate in a linear setting with heteroscedastic noise, that some of our methods can outperform UCB and Thompson Sampling, while staying competitive when the noise is homoscedastic. Many "computationally-efficient " algorithms have been proposed for different types of IDS, but there is not enough theory analysis. Ensemble Sampling, as a computationally-efficient method for TS, have been proved to have low regret. But no similar work exists for IDS method. Here we can see some drawbacks of IDS in reinforcement learning problems. An exact method can be analyzed, but not tractable. An approximate method has difficulty to guarantee regret bounds, although they may performs as well as, and often better than the exact method on some simple experiments. Other topics: about Frequentist IDS Bayesian IDS: cover a large information structure Frequentist IDS: worst case regret, limited in settings https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/494381/thesis.pdf?sequence =1&isAllowed=y We close with a list of exciting directions for the future. Naturally, our focus is on open questions within the IDS framework, but more generally, the exploration-exploitation trade-off in models with structured feedback is not yet fully understood. 9.2.1 First-Principles Derivation 9.2.2 Asymptotic and Instance-Dependent Regret 9.2.3 Partial Monitoring 9.2.4 Other Information Trade-Offs 9.2.5 Reinforcement Learning