Uploaded by Xuanfei Ren

Notes on TS and Information-based method

advertisement
目录
IDS_notes(1)
hao22b(1)
Notes on TS and Information-based method
1 IDS_notes(1)
2 hao22b(1)
3 Notes on TS and Information-based method
3.1 Notes on TS and Information-based method
3.1.1 Thompson sampling
3.1.1.1 Algorithm
3.1.1.2 Research on TS
3.1.1.3 Bayesian Regret
3.1.1.4 Contributions of information-based method
3.1.1.5 Why Thompson Sampling Works ?
3.1.1.6 Limitations of Thompson Sampling
3.1.2 Information-based methods and TS
3.1.2.1 Problems, research and improvement
2
11
12
13
22
23
23
23
23
23
24
26
27
27
28
28
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
(
share
so
Learnng
Online
full feed back |
missing slf -ops
feed
He
:
can
Example
bave
a
back
his
observes
he
back
all
(a a )
,
oe
significant
have
veverus
loss value
acomt
does
bandies
beyound
:
@@
:
own
still
is
but
This
.
graph feed
wirn
bomdrt
he
examples Graph Feedback) ,
me
for
always
impact
each
on
tne
10 ss
Ʃ
.
ops
round
on
ech
loss
round
he
,
incurred
.
tne
on
regret
minimax
.
Apple
pobom
testng
:
send
weak obsemed
.
.
,
card
Example
2
:
rerealmg
poblem
rerealng
unie loss
.
d3
-
ε
howmmch
know
strong
tase
self
.
acron
.
Example
,
3
:
observe
police
evey
thing
prerent
but
his
Crime
own
,
.
目录
IDS_notes(1)
hao22b(1)
Notes on TS and Information-based method
1 Notes on TS and Information-based method
1.1 Thompson sampling
1.1.1 Algorithm
1.1.2 Research on TS
1.1.3 Bayesian Regret
1.1.3.1 Soft knowledge and hard knowledge
1.1.3.2 Why Bayesian regret?
1.1.4 Contributions of information-based method
1.1.5 Why Thompson Sampling Works ?
1.1.6 Limitations of Thompson Sampling
1.2 Information-based methods and TS
1.2.1 Problems, research and improvement
1.2.1.1 1. Large/uncountable action spaces
1.2.1.2 2. Deal with contextual bandit (2022)
1.2.1.3 3. Approximate implementations
1.2.1.4 4. New algorithms based on the analysis technique -----Information-directed sampling !
1.2.1.5 Other topics: about Frequentist IDS
2
11
12
12
12
12
12
13
13
14
15
16
16
17
17
17
19
20
21
23
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
扫描全能王 创建
(
share
so
Learnng
Online
full feed back |
missing slf -ops
feed
He
:
can
Example
bave
a
back
his
observes
he
back
all
(a a )
,
oe
significant
have
veverus
loss value
acomt
does
bandies
beyound
:
@@
:
own
still
is
but
This
.
graph feed
wirn
bomdrt
he
examples Graph Feedback) ,
me
for
always
impact
each
on
tne
10 ss
Ʃ
.
ops
round
on
ech
loss
round
he
,
incurred
.
tne
on
regret
minimax
.
Apple
pobom
testng
:
send
weak obsemed
.
.
,
card
Example
2
:
rerealmg
poblem
rerealng
unie loss
.
d3
-
ε
howmmch
know
strong
tase
self
.
acron
.
Example
,
3
:
observe
police
evey
thing
prerent
but
his
Crime
own
,
.
Notes on TS and Information-based
method
We can think that information-based method was born to study TS. And in the subsequent analysis,
algorithms that can outperform TS in certain situations is generated, which is called informationdirected method.
Thompson sampling
Algorithm
The Thompson sampling algorithm simply samples actions according to the posterior probability they
are optimal. In particular, actions are chosen randomly at time according to the sampling
distribution
By definition, this means for each
,
This algorithm is sometimes called probability matching because the action selection distribution is
matched to the posterior distribution of the optimal action.
Practical implementations of Thompson sampling typically use two simple steps at each time to
randomly generate an action.
First, an index
is sampled from the posterior distribution of the true index
Then, the algorithm selects the action
that would be
optimal if the sampled parameter were actually the true parameter.
Research on TS
Thompson sampling has the honor of being the first bandit algorithm and is named after its
inventor [Thompson, 1933], who considered the Bernoulli case with two arms. Thompson
provided no theoretical guarantees, but argued intuitively and gave hand-calculated empirical
analysis.
It would be wrong to say that Thompson sampling was entirely ignored for the next eight
decades, but it was definitely not popular until recently, when a large number of authors
independently rediscovered the article/algorithm [Graepel et al., 2010, Granmo, 2010, Ortega
and Braun, 2010, Chapelle and Li, 2011, May et al., 2012]. The surge in interest was mostly
empirical, but theoreticians followed soon with regret guarantees.
For the frequentist analysis, we followed the proofs by Agrawal and Goyal [2012, 2013a], but the
setting is slightly different. We presented results for the ‘realisable’ case where the pay-off
distributions are actually Gaussian, while Agrawal and Goyal use the same algorithm but prove
bounds for rewards bounded in [0, 1]. Agrawal and Goyal [2013a] also analyse the
Beta/Bernoulli variant of Thompson sampling, which for rewards in [0, 1] is asymptotically
.
optimal in the same way as KL-UCB (see Chapter 10). This result was simultaneously obtained
by Kaufmann et al. [2012b], who later showed that for appropriate priors, asymptotic
optimality also holds for single-parameter exponential families [Korda et al., 2013]. For
Gaussian bandits with unknown mean and variance, Thompson sampling is asymptotically
optimal for some priors, but not others –
even quite natural ones [Honda and Takemura, 2014].
The Bayesian analysis of Thompson sampling based on confidence intervals is due to Russo
and Van Roy [2014b]. Recently the idea has been applied to a wide range of bandit settings
[Kawale et al., 2015, Agrawal et al., 2017] and reinforcement learning [Osband et al., 2013,
Gopalan and Mannor, 2015, Leike et al., 2016, Kim, 2017]. The BayesUCB algorithm is due to
Kaufmann et al. [2012a], with improved analysis and results by Kaufmann [2018]. The
frequentist analysis of Thompson sampling for linear bandits is by Agrawal and Goyal [2013b],
with refined analysis by Abeille and Lazaric [2017a] and a spectral version by Koc´ak et al.
[2014]. A recent paper analyses the combinatorial semi-bandit setting [Wang and Chen, 2018].
The information-theoretic analysis is by Russo and Van Roy [2014a, 2016], while the
generalising beyond the negentropy potential is by Lattimore and Szepesv´ari [2019c]. As we
mentioned, these ideas have been applied to convex bandits [Bubeck et al., 2015a, Bubeck and
Eldan, 2016] and also to partial monitoring [Lattimore and Szepesv´ari, 2019c]. There is a
tutorial on Thompson sampling by Russo et al. [2018] that focuses mostly on applications and
computational issues. We mentioned there are other ways to configure Algorithm 24, for
example the recent article by Kveton et al. [2019].
Bayesian Regret
Soft knowledge and hard knowledge
An online optimization algorithm typically starts with two forms of prior knowledge.
The first - hard knowledge - posits that the mapping from action to outcome distribution lies within a
particular family of mappings. For example, with hard knowledge we can suppose reward obeys
normal distribution
for arm , in which
is known for each .
The second - soft knowledge - concerns which of these mappings are more or less likely to match
reality. Soft knowledge evolves with observations and is typically represented in terms of a probability
distribution or a confidence set.
With soft knowledge we can just suppose
obeys a prior distribution, or we just have some
prior knowledge without knowing some specific distributions. "Distributions are not restricted to
Gaussian and more complex information structures are allowed"
Why Bayesian regret?
In (Russo and Van Roy, 2014/2016), the first paper using information theoretic method, the author
said as below.
Prior to their study, all frequentist regret bounds were attained for fixed priors.
One of the first theoretical guarantees for Thompson sampling was provided by May et al.
(2012), but they showed only that the algorithm converges asymptotically to optimality. Agrawal
and Goyal (2012); Kauffmann et al. (2012); Agrawal and Goyal (2013a) and Korda et al. (2013)
studied on the classical multi-armed bandit problem, where sampling one action provides no
information about other actions. They provided frequentist regret bounds for Thompson
sampling that are asymptotically optimal in the sense defined by Lai and Robbins (1985). To
attain these bounds, the authors fixed a specific uninformative prior distribution, and
studied the algorithm’s performance assuming this prior is used.
Our interest in Thompson sampling is motivated by its ability to incorporate rich forms of
prior knowledge about the actions and the relationship among them. Accordingly, we study the
algorithm in a very general framework, allowing for an arbitrary prior distribution over the
true outcome distributions
. To accommodate this level of generality while still
focusing on finite- time performance, we study the algorithm’s expected regret under the prior
distribution. This measure is sometimes called Bayes risk or Bayesian regret.
They found TS had ability to incorporate rich forms of prior knowledge about the actions and the
relationship among them, so they wanted to study the algorithm in a very general framework.
To study a general regret bound it may be hard to find a frequentist regret works for all settings -because our knowledge about the environment is limited. For example, we just know the entropy of
the unknown parameter, but we don't know it's distribution. This idea can incorporate rich forms of
prior knowledge --we don't care whether the rewards are normally distributed or binomial.
So far (until that time), only one other article has examined the content of soft knowledge.
An important aspect of our regret bound is its dependence on soft knowledge through the
entropy of the optimal-action distribution. One of the only other regret bounds that depends
on soft- knowledge was provided very recently by Li (2013). Inspired by a connection between
Thompson sampling and exponential weighting schemes, that paper introduced a family of
Thompson sampling like algorithms and studied their application to contextual bandit problems.
While our analysis does not currently treat contextual bandit problems, we improve upon their
regret bound in several other respects.
First, their bound depends on the entropy of the prior distribution of mean rewards, which is
never smaller, and can be much larger, than the entropy of the distribution of the optimal
action.
In addition, their bound has an order
dependence on the problem’s time horizon, and, in
order to guarantee each action is explored sufficiently often, requires that actions are frequently
selected uniformly at random. In contrast, our focus is on settings where the number of actions
is large and the goal is to learning without sampling each one.
Contributions of information-based method
Provided a new analysis of Thompson sampling based on tools from information theory.
Inherits the simplicity and elegance enjoyed by work in that field.
Apply to a much broader range of information structures than those studied in prior work on
Thompson sampling.
Using soft knowledge: our analysis leads to regret bounds that highlight the benefits of soft
knowledge, quantified in terms of the entropy of the optimal-action distribution. Such
regret bounds yield insight into how future performance depends on past observations. This is
key to assessing the benefits of exploration, and as such, to the design of more effective
schemes that trade off between exploration and
exploitation.
Under different problems' information structure, the information ratio bounds can be d/2, 1/2
and d/(2m). This reflect the impact of each problem’s information structure on the regret-per-bit
of information acquired by TS about the optimum.
Subsequent work has established bounds on the information ratio for problems with convex
reward functions (Bubeck and Eldan, 2016) and
for problems with graph structured feedback (Liu et al., 2017).
In this way, it is easier to understand the influence of different information structures on
regret.
In forthcoming work, we leverage this insight to produce an algorithm that outperforms
Thompson sampling. ----Information-directed sampling !
While our focus has been on providing theoretical guarantees for Thompson sampling, we
believe the techniques and quantities used in the analysis may be of broader interest. Our
formulation and notation may be complex, but the proofs themselves essentially follow from
combining known relations in information theory with the tower property of conditional
expectation, Jensen’s inequality, and the Cauchy-Schwartz inequality. In addition, the information
theoretic view taken in this paper may provide a fresh perspective on this class of problems.
Why Thompson Sampling Works ?
Reference: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf
To understand whether TS is well suited to a particular application, it is useful to develop a high level
understanding of why it works. As information is gathered, beliefs about action rewards are carefully
tracked. By sampling actions according to the posterior probability that they are optimal, the
algorithm continues to sample all actions that could plausibly be optimal, while shifting sampling
away from those that are unlikely to be optimal. Roughly speaking, the algorithm tries all promising
actions while gradually discarding those that are believed to underperform. This intuition is
formalized in recent theoretical analyses of Thompson sampling, which we now review.
Regret Analysis for Classical Bandit Problems
Asymptotic Instance Dependent Regret Bounds.
Instance-Independent Regret bounds.
Regret Analysis for Complex Online Decision Problems
This tutorial has covered the use of TS to address an array of complex online decision problems. In
each case, we first modeled the problem at hand, carefully encoding prior knowledge. We then
applied TS, trusting it could leverage this structure to accelerate learning. The results described in the
previous subsection are deep and interesting, but do not justify using TS in this manner.
We will now describe alternative theoretical analyses of TS that apply very broadly. These analyses
point to TS’ s ability to exploit problem structure and prior knowledge, but also to settings where TS
performs poorly.
Regret Bounds via UCB
Regret Bounds via Information Theory
Limitations of Thompson Sampling
Problems that do not Require Exploration
Problems that do not Require Exploitation
Time Sensitivity
Problems Requiring Careful Assessment of Information Gain
TS is well suited to problems where the best way to learn which action is optimal is to test the most
promising actions. However, there are natural
problems where such a strategy is far from optimal, and efficient learning requires a more careful
assessment of the information actions provide.
The following example from (Russo and Van Roy, 2018a) highlights this point.
The shortcoming of TS in the above example can be interpreted through the lens of the information
ratio. For this problem, the information ratio when actions are sampled by TS is far from the
minimum possible, reflecting that it is possible to a acquire information at a much lower cost per bit.
There are other examples, also from (Russo and Van Roy, 2018a), illustrate a broader range of
problems for which TS suffers in this manner.
So we can know what the information can do: In some settings, we can use information to show that
TS is suboptimal. And we can use information ratio to develop new algorithms that can outperform
TS.
Information-based methods and TS
Problems, research and improvement
Information theoretic method was first used to analyze Bayesian regret of Thompson Sampling, as
we talked before.
1. Large/uncountable action spaces
A Rate-Distortion Analysis of Thompson Sampling:
Following the above line of analysis, we can bound the regret of Thompson sampling by the mutual
information between the statistic
and
. When can be chosen to be far less informative than
obtain a significantly tighter bound .
Application: Linear Bandits, Generalized Linear Bandits with iid Noise, Logistic Bandits.
, we
This article deals with logistic bandit, but they do this based on a conjecture which can just be
computationally verifiable. They don't complete the proof.
2. Deal with contextual bandit (2022)
Lifted- information ratio!
Relationship between "decoupling coefficient" and "lifted information ratio":
which matches our definition of the lifted information ratio, up to the difference of replacing the
mutual information by the root mean-squared error in predicting the true parameter
. Notably,
this definition essentially coincides with the lifted information ratio for the special case of
Gaussian losses.
We have also managed to show some new results that advance the state of the art in the wellstudied problem of logistic bandits. We believe that these results are very encouraging and
that our newly proposed formalism may find many more applications in the future.
3. Approximate implementations
Information-based method helps to analysis the regret of approximate TS algorithm.
Ensemble Sampling (2017)
This is only an approximate TS algorithm, and the theoretical analysis of this algorithm is not enough.
An Analysis of Ensemble Sampling (Nips 2022)
In this regret analysis, information-based methods are used frequently. Although they don't use the
concept of information ratio.
4. New algorithms based on the analysis technique ------ Informationdirected sampling !
We can demonstrate through simple analytic examples, UCB and TS can perform very poorly when
faced with more complex information structures. Shortcomings stem from the fact that they do not
adequately account for particular kinds of information.
Information-directed sampling (IDS) has recently demonstrated its potential as a data-efficient
reinforcement learning algorithm (Lu et al., 2021).
D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. Oper. Res.,
66
(1):230–252, (2014 )2018.
Each action is sampled in a manner that minimizes the ratio between squared expected single-period
regret and this measure of information gain.
We benchmark the performance of IDS through simulations of the widely studied Bernoulli,
Gaussian, and linear bandit problems, for which UCB algorithms and Thompson sampling are
known to be very effective. We find that even in these settings, IDS outperforms UCB
algorithms and Thompson sampling. This is particularly surprising for Bernoulli bandit
problems, where UCB algorithms and Thompson sampling are known to be asymptotically
optimal in the sense proposed by Lai and Robbins [49].
Drawback: computationally demanding, developing a computationally efficient version of IDS may
require innovation.
It is worth noting that the problem formulation we work with, which is presented in Section 3, is
very general, encompassing not only problems with bandit feedback, but also a broad array of
information structures for which observations can offer information about rewards of
arbitrary subsets of actions or factors that influence these rewards. Because IDS and our
analysis accommodate this level of generality, they can be specialized to problems that in the
past have been studied individually.
J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic
noise. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 358–384. PMLR,2018.
In this work, we consider bandits with heteroscedastic noise, where we explicitly allow the noise
distribution to depend on the evaluation point. We show that this leads to new trade-offs for
information and regret, which are not taken into account by existing approaches like upper
confidence bound algorithms (UCB) or Thompson Sampling.
Frequentist version of Information Directed Sampling (IDS)
Minimize the regret-information ratio over all possible action sampling distributions
Empirically, we demonstrate in a linear setting with heteroscedastic noise, that some of our
methods can outperform UCB and Thompson Sampling, while staying competitive when the
noise is homoscedastic.
Many "computationally-efficient " algorithms have been proposed for different types of IDS, but there
is not enough theory analysis. Ensemble Sampling, as a computationally-efficient method for TS, have
been proved to have low regret. But no similar work exists for IDS method.
Here we can see some drawbacks of IDS in reinforcement learning problems. An exact method can be
analyzed, but not tractable. An approximate method has difficulty to guarantee regret bounds,
although they may performs as well as, and often better than the exact method on some simple
experiments.
Other topics: about Frequentist IDS
Bayesian IDS: cover a large information structure
Frequentist IDS: worst case regret, limited in settings
https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/494381/thesis.pdf?sequence
=1&isAllowed=y
We close with a list of exciting directions for the future. Naturally, our focus is on open questions
within the IDS framework, but more generally, the exploration-exploitation trade-off in models
with structured feedback is not yet fully understood.
9.2.1 First-Principles Derivation
9.2.2 Asymptotic and Instance-Dependent Regret
9.2.3 Partial Monitoring
9.2.4 Other Information Trade-Offs
9.2.5 Reinforcement Learning
Download