Presenting work by various authors, and own work in collaboration with colleagues at Microsoft and the University of Amsterdam Example task: Example task: Example task: Find best news articles based on user context; optimize click-through rate Tune ad display parameters (e.g., mainline reserve) to optimize revenue Improve ranking of QAC to optimize suggestion usage Typical approach: lots of offline tuning + AB testing. Example: which search interface results in higher revenue? [Kohavi et al. ’09, ‘12] Need to carefully design / tune each treatment Typically compare 2-5 options Depending on effect size and variance, thousands to millions of impressions required to detect statistically significant differences slow development cycles (e.g., weeks) Can any of this be automated to speed up innovation? Image adapted from: https://www.flickr.com/photos/prayitnophotography/4464000634 Exploration – exploitation trade-off Formalized as ( = Reinforcement learning problem where actions do not affect future states Address key challenge: how to balance exploration and exploitation – explore to learn, exploit to benefit from what has been learned. Example 100 50 10 Example 100 50 10 Example 100 50 10 both arms are promising, higher uncertainty for C Bandit approaches balance exploration and exploitation based on expected payoff and uncertainty. Idea 1: Use simple exploration approach (here: ε-greedy) Idea 2: Explore efficiently in a small action space, but use machine learning to optimize over a context space. [Li et al. ‘12] Contextual bandits Example results: [Li et al. ‘12] Balancing exploration and exploitation is crucial for good results. 1) Balance exploration and exploitation, to ensure continued learning while applying what has been learned 2) Explore in a small action space, but learn in a large contextual space Illustrated Sutra of Cause and Effect "E innga kyo" by Unknown - Woodblock reproduction, published in 1941 by Sinbi-Shoin Co., Tokyo. Licensed under Public domain via Wikimedia Commons http://commons.wikimedia.org/wiki/File:E_innga_kyo.jpg#mediaviewer/File:E_innga_kyo.jpg Problem: estimate effects of mainline reserve changes. [Bottou et. al ‘13] controlled experiment counterfactual reasoning [Bottou et. al ‘13] Key idea: estimate what would have happened if a different system (distribution over parameter values) had been used, using importance sampling. Example distributions: Step 1: factorize based on known causal graph ′ π π π′ π π₯, π Step 2: compute estimates using importance sampling π′ 1 = π π π=1 π′ π π₯, π π¦π π π π₯, π [Bottou et. al ‘13] π′(π) π(π) π This works because: πΈπ ′ π¦ = = = ≈ π¦π′ π¦ ππ¦ π¦ π′ (π¦) π¦ π(π¦) π π¦ π′ (π¦) πΈπ [π¦ π(π¦) ] 1 π π¦ ππ¦ π′ π π₯, π π π=1 π¦π π π π₯, π [Precup et. al ‘00] Counterfactual reasoning allows analysis over a continuous range. [Bottou et. al ‘13] 1) Leverage known causal structure and importance sampling to reason about “alternative realities” 2) Bound estimator error to distinguish between uncertainty due to low sample size and exploration coverage document 1 document 2 document 2 document 3 document 3 document 4 document 4 document 1 document 1 document 2 document 3 document 4 Compare two rankings: Example: optimize QAC ranking 1) Generate interleaved (combined) ranking 2) Observe user clicks 3) Credit clicks to original rankers to infer outcome Learning approach sample unit sphere to generate candidate ranker Dueling bandit gradient descent (DBGD) optimizes a weight vector for weightedlinear combinations of ranking features. feature 1 Relative listwise feedback is obtained using interleaving current best weight vector [Yue & Joachims ‘09] randomly generated candidate feature 2 Approach: candidate pre-selection (CPS) generate many candidates and select the most promising one tournament on historical data feature 1 probabilistic interleave and importance sampling feature 2 [Hofmann et al. ’13c] TD + DBGD From earlier work: learning from relative listwise feedback is robust to noise. Here: adding structure further dramatically improves performance. CPS 0.8 0.6 0.4 0.2 informational click model 0 0 200 400 600 [Hofmann et al. ’13b, Hofmann et al. ’13c] 800 1000 1) Avoid combinatorial action space by exploring in parameter space 2) Reduce variance using relative feedback 3) Leverage known structures for sample-efficient learning Optimizing interactive systems Contextual bandits Systematic approach to balancing exploration and exploitation; contextual bandits explore in small action space but optimize in large context space. Counterfactual reasoning Leverages causal structure and importance sampling for “what if” analyses. Online learning to rank Avoids combinatorial explosion by exploring and learning in parameter space; uses known ranking structure for sample-efficient learning. Research Applications Assess action and solution spaces in a given application, collect and learn from exploration data, increase experimental agility Try this (at home) Try open-source code samples; Living labs challenge allows experimentation with online learning and evaluation methods Code: https://bitbucket.org /ilps/lerot Challenge: http://livinglabs.net/challenge/ [Kohavi et al. ‘09] R. Kohavi, R. Longbotham, D. Sommerfield, R. M. Henne: Controlled experiments on the web: survey and practical guide (Data Mining and Knowledge Discovery 18, 2009). [Kohavi et al. ‘12] R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, Y. Xu: Trustworthy online controlled experiments: five puzzling outcomes explained (KDD 2012). [Li et al. ‘11] L. Li, W. Chu, J. Langford, X. Wang: Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms (WWW, 2014). [Li et al. ‘12] L. Li, W. Chu, J. Langford, T. Moon, X. Wang: An Unbiased Offline Evaluation of Contextual Bandit Algorithms based on Generalized Linear Models, ICML-2011 Workshop on Online Trading of Exploration and Exploitation. [Bottou et. al ‘13] L. Bottou, J. Peters, J. Quiñonero-Candela, D.X. Charles, D.M. Chickering, E. Portugaly, D. Ray, P. Simard, E. Snelson: Counterfactual reasoning and learning systems: the example of computational advertising (Journal of Machine Learning Research 14 (1), 2013). [Precup et al. ‘00] D. Precup, R. S. Sutton, S. Singh: Eligibility Traces for Off-Policy Policy Evaluation (ICML 2000). [Yue & Joachims ‘09] Y. Yue, T. Joachims: Interactively optimizing information retrieval system as a dueling bandits problem (ICML 2009). [Hofmann et al. ’13b] K. Hofmann, A. Schuth, S. Whiteson, M. de Rijke: Reusing Historical Interaction Data for Faster Online Learning to Rank for IR (WSDM 2013). [Hofmann et al. ’13c] K. Hofmann, S. Whiteson, M. de Rijke: Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval (Information Retrieval 16, 2013).