Learning to interact

advertisement
Presenting work by various authors, and own
work in collaboration with colleagues at
Microsoft and the University of Amsterdam
Example task:
Example task:
Example task:
Find best news articles
based on user context;
optimize click-through rate
Tune ad display parameters
(e.g., mainline reserve) to
optimize revenue
Improve ranking of
QAC to optimize
suggestion usage
Typical approach: lots of offline tuning + AB testing.
Example: which search interface
results in higher revenue?
[Kohavi et al. ’09, ‘12]
Need to carefully design / tune each treatment
Typically compare 2-5 options
Depending on effect size and variance, thousands to millions of
impressions required to detect statistically significant differences
slow development cycles (e.g., weeks)
Can any of this be automated to speed up
innovation?
Image adapted from: https://www.flickr.com/photos/prayitnophotography/4464000634
Exploration –
exploitation trade-off
Formalized as (
= Reinforcement learning problem where actions do not affect
future states
Address key challenge: how to balance exploration and
exploitation – explore to learn, exploit to benefit from what
has been learned.
Example
100
50
10
Example
100
50
10
Example
100
50
10
both arms are promising,
higher uncertainty for C
Bandit approaches balance exploration and exploitation based
on expected payoff and uncertainty.
Idea 1:
Use simple exploration approach
(here: ε-greedy)
Idea 2:
Explore efficiently in a small action
space, but use machine learning to
optimize over a context space.
[Li et al. ‘12]
Contextual bandits
Example results:
[Li et al. ‘12]
Balancing exploration and
exploitation is crucial for good results.
1)
Balance exploration and exploitation, to ensure
continued learning while applying what has been learned
2)
Explore in a small action space, but learn in a large
contextual space
Illustrated Sutra of Cause and Effect
"E innga kyo" by Unknown - Woodblock reproduction, published in 1941 by Sinbi-Shoin Co., Tokyo.
Licensed under Public domain via Wikimedia Commons http://commons.wikimedia.org/wiki/File:E_innga_kyo.jpg#mediaviewer/File:E_innga_kyo.jpg
Problem: estimate effects of mainline reserve changes.
[Bottou et. al ‘13]
controlled experiment
counterfactual reasoning
[Bottou et. al ‘13]
Key idea: estimate what would have happened if a different system
(distribution over parameter values) had been used, using importance
sampling.
Example distributions:
Step 1: factorize based on known
causal graph
′
𝑃 πœ”
𝑃′ π‘ž π‘₯, π‘Ž
Step 2: compute estimates using
importance sampling
π‘Œ′
1
=
𝑛
𝑛
𝑖=1
𝑃′ π‘ž π‘₯, π‘Ž
𝑦𝑖
𝑃 π‘ž π‘₯, π‘Ž
[Bottou et. al ‘13]
𝑃′(π‘ž)
𝑃(π‘ž)
π‘ž
This works because:
𝐸𝑃 ′ 𝑦 =
=
=
≈
𝑦𝑃′ 𝑦 𝑑𝑦
𝑦
𝑃′ (𝑦)
𝑦 𝑃(𝑦) 𝑃
𝑦
𝑃′ (𝑦)
𝐸𝑃 [𝑦 𝑃(𝑦) ]
1
𝑛
𝑦 𝑑𝑦
𝑃′ 𝑑 π‘₯, π‘Ž
𝑛
𝑖=1 𝑦𝑖 𝑃 𝑑 π‘₯, π‘Ž
[Precup et. al ‘00]
Counterfactual reasoning allows analysis over a
continuous range.
[Bottou et. al ‘13]
1)
Leverage known causal structure and importance
sampling to reason about “alternative realities”
2)
Bound estimator error to distinguish between uncertainty
due to low sample size and exploration coverage
document 1
document 2
document 2
document 3
document 3
document 4
document 4
document 1
document 1
document 2
document 3
document 4
Compare two rankings:
Example:
optimize QAC ranking
1) Generate interleaved (combined) ranking
2) Observe user clicks
3) Credit clicks to original rankers to infer outcome
Learning approach
sample unit
sphere to
generate
candidate
ranker
Dueling bandit gradient descent (DBGD)
optimizes a weight vector for weightedlinear combinations of ranking features.
feature 1
Relative listwise feedback is
obtained using interleaving
current best
weight vector
[Yue & Joachims ‘09]
randomly generated
candidate
feature 2
Approach: candidate pre-selection (CPS)
generate
many
candidates
and select
the most
promising
one
tournament on historical data
feature 1
probabilistic interleave and
importance sampling
feature 2
[Hofmann et al. ’13c]
TD + DBGD
From earlier work: learning
from relative listwise
feedback is robust to noise.
Here: adding structure
further dramatically
improves performance.
CPS
0.8
0.6
0.4
0.2
informational click model
0
0
200
400
600
[Hofmann et al. ’13b, Hofmann et al. ’13c]
800
1000
1)
Avoid combinatorial action space by exploring in
parameter space
2)
Reduce variance using relative feedback
3)
Leverage known structures for sample-efficient learning
Optimizing interactive systems
Contextual bandits
Systematic approach to balancing exploration and exploitation; contextual
bandits explore in small action space but optimize in large context space.
Counterfactual reasoning
Leverages causal structure and importance sampling for “what if”
analyses.
Online learning to rank
Avoids combinatorial explosion by exploring and learning in parameter
space; uses known ranking structure for sample-efficient learning.
Research
Applications
Assess action and solution spaces in a given application, collect and
learn from exploration data, increase experimental agility
Try this (at home)
Try open-source code samples;
Living labs challenge allows
experimentation with online
learning and evaluation methods
Code:
https://bitbucket.org
/ilps/lerot
Challenge:
http://livinglabs.net/challenge/
[Kohavi et al. ‘09] R. Kohavi, R. Longbotham, D. Sommerfield, R. M. Henne: Controlled experiments on the web: survey and practical guide
(Data Mining and Knowledge Discovery 18, 2009).
[Kohavi et al. ‘12] R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, Y. Xu: Trustworthy online controlled experiments: five puzzling
outcomes explained (KDD 2012).
[Li et al. ‘11] L. Li, W. Chu, J. Langford, X. Wang: Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation
Algorithms (WWW, 2014).
[Li et al. ‘12] L. Li, W. Chu, J. Langford, T. Moon, X. Wang: An Unbiased Offline Evaluation of Contextual Bandit Algorithms based on
Generalized Linear Models, ICML-2011 Workshop on Online Trading of Exploration and Exploitation.
[Bottou et. al ‘13] L. Bottou, J. Peters, J. Quiñonero-Candela, D.X. Charles, D.M. Chickering, E. Portugaly, D. Ray, P. Simard, E. Snelson:
Counterfactual reasoning and learning systems: the example of computational advertising (Journal of Machine Learning Research
14 (1), 2013).
[Precup et al. ‘00] D. Precup, R. S. Sutton, S. Singh: Eligibility Traces for Off-Policy Policy Evaluation (ICML 2000).
[Yue & Joachims ‘09] Y. Yue, T. Joachims: Interactively optimizing information retrieval system as a dueling bandits problem (ICML 2009).
[Hofmann et al. ’13b] K. Hofmann, A. Schuth, S. Whiteson, M. de Rijke: Reusing Historical Interaction Data for Faster Online Learning to
Rank for IR (WSDM 2013).
[Hofmann et al. ’13c] K. Hofmann, S. Whiteson, M. de Rijke: Balancing exploration and exploitation in listwise and pairwise online
learning to rank for information retrieval (Information Retrieval 16, 2013).
Download