15 millions de prédictions par seconde Les défis de Criteo Nicolas Le Roux Scientific Program Manager - R&D 2014-11-20 Copyright © 2014 Criteo What does Criteo do? • We buy advertising spaces on websites • We display ads for our partners • We get paid if the user clicks on the ad Copyright © 2014 Criteo Retargeting Copyright © 2014 Criteo In practice 1. A user lands on a webpage 2. The website contacts an ad-exchange 3. The ad-exchange contacts Criteo and its competitors 4. It is an auction: each competitor tells how much it bids 5. The highest bidder wins the right to display an ad Copyright © 2014 Criteo Details of the auction • Real-time bidding (RTB) • Second-price auction: the winner pays the second highest price • Optimal strategy: bid the expected gain • Expected gain = price per click (CPC) * probability of click (CTR) Copyright © 2014 Criteo Goal of the arbitrage • Choose the appropriate advertiser • Only win the profitable displays: it is a prediction problem • An overbid means you are losing money • An underbid means you are losing potential revenue Copyright © 2014 Criteo What to do once we win the display? • We are now directly in contact with the website • Choose the best products • Choose the color, the font and the layout • Generate the banner Copyright © 2014 Criteo Goal of the recommendation • Increase the value of a display: it is a ranking problem • The potential gains are huge Copyright © 2014 Criteo Real-time constraints at Criteo • More than 2 billion banners displayed per day Copyright © 2014 Criteo Real-time constraints at Criteo • More than 2 billion banners displayed per day Time to answer an RTB request (in ms) 100 Time to bid before timeout Copyright © 2014 Criteo Real-time constraints at Criteo • More than 2 billion banners displayed per day Time to answer an RTB request (in ms) 50 Network time Copyright © 2014 Criteo 10 Parsing + logging 35 Other computation 5 Available for prediction Fitting within the real-time constraints • To avoid timeouts, only simple models are allowed • Logistic regression • Decision trees • What freedom do we have? Copyright © 2014 Criteo Predicting the CTR • 𝑃 𝑐𝑙𝑖𝑐𝑘 = 1 𝑥 = 𝜎(𝜃 𝑇 𝑥) • 𝑥: features containing historical and contextual information • 𝜃: parameters of the model Copyright © 2014 Criteo Predicting the CTR • 𝑃 𝑐𝑙𝑖𝑐𝑘 = 1 𝑥 = 𝜎(𝜃 𝑇 𝑥) • 𝑥: [TimeSinceLastVisit, CurrentURL] • 𝜃: parameters of the model Copyright © 2014 Criteo A large number of parameters • 𝑥: [TimeSinceLastVisit, CurrentURL] • One parameter per modality Copyright © 2014 Criteo A large number of parameters • 𝑥: [TimeSinceLastVisit, CurrentURL] • One parameter per modality • TimeSinceLastVisit • Less than 10 seconds: [1, 0, 0] • Between 10 seconds and 5 minutes: [0, 1, 0] • More than 5 minutes: [0, 0, 1] Copyright © 2014 Criteo A large number of parameters • 𝑥: [TimeSinceLastVisit, CurrentURL] • One parameter per modality • TimeSinceLastVisit • Less than 10 seconds: [1, 0, 0] • Between 10 seconds and 5 minutes: [0, 1, 0] • More than 5 minutes: [0, 0, 1] • CurrentURL • lemonde.fr : [1, 0, 0, ..., 0] • facebook.com : [0, 1, 0, ..., 0] • maisonetjardin.fr : [0, 0, 1, ..., 0] • 𝑥: [More Than 5 minutes, facebook.com] = [0, 0, 1, 0, 1, 0, ..., 0] Copyright © 2014 Criteo Modeling higher-order information • A linear model cannot represent higher order information Copyright © 2014 Criteo Modeling higher-order information • A linear model cannot represent higher order information • E.g. CurrentUrl = "disney.com" and Advertiser = "Guns4Life" Copyright © 2014 Criteo Modeling higher-order information • A linear model cannot represent higher order information • E.g. CurrentUrl = "disney.com" and Advertiser = "Guns4Life" • We model these by creating "cross-features" Copyright © 2014 Criteo Modeling higher-order information • A linear model cannot represent higher order information • E.g. CurrentUrl = "disney.com" and Advertiser = "Guns4Life" • We model these by creating "cross-features" • CurrentUrl has 𝑝1 modalities, Advertiser has 𝑝2 modalities • The cross-feature has 𝑝1 𝑝2 modalities Copyright © 2014 Criteo Hashing • This model has estimation and computational issues Copyright © 2014 Criteo Hashing • This model has estimation and computational issues • Choose ℎ: ℕ → 1, … , 𝑝 Copyright © 2014 Criteo Hashing • This model has estimation and computational issues • Choose ℎ: ℕ → 1, … , 𝑝 • Replace 𝑥𝑖 = 1 with 𝑥ℎ(𝑖) = 1 Copyright © 2014 Criteo Hashing • This model has estimation and computational issues • Choose ℎ: ℕ → 1, … , 𝑝 • Replace 𝑥𝑖 = 1 with 𝑥ℎ(𝑖) = 1 • The original 𝑥 is projected to ℝ𝑝 Copyright © 2014 Criteo Hashing • This model has estimation and computational issues • Choose ℎ: ℕ → 1, … , 𝑝 • Replace 𝑥𝑖 = 1 with 𝑥ℎ(𝑖) = 1 • The original 𝑥 is projected to ℝ𝑝 • 𝑃 𝑐𝑙𝑖𝑐𝑘 = 1 𝑥 = 𝜎(𝜃 𝑇 𝑥) Copyright © 2014 Criteo Dealing with collisions • There are many pairs 𝑖1 , 𝑖2 such that ℎ(𝑖1 ) = ℎ(𝑖2 ) • These two features will become indistinguishable Copyright © 2014 Criteo Dealing with collisions • There are many pairs 𝑖1 , 𝑖2 such that ℎ(𝑖1 ) = ℎ(𝑖2 ) • These two features will become indistinguishable • First solution: increase 𝑝 Copyright © 2014 Criteo Dealing with collisions • There are many pairs 𝑖1 , 𝑖2 such that ℎ(𝑖1 ) = ℎ(𝑖2 ) • These two features will become indistinguishable • First solution: increase 𝑝 • Second solution: do feature selection. Copyright © 2014 Criteo The cost of adding variables • « Hey, I thought of this great variable: Time since last product view. • Can we add it to the model? » Copyright © 2014 Criteo The cost of adding variables • « Hey, I thought of this great variable: Time since last product view. • Can we add it to the model? » • Storage: #Banners/day x #Days x 4 = 480GB Copyright © 2014 Criteo The cost of adding variables • « Hey, I thought of this great variable: Time since last product view. • Can we add it to the model? » • Storage: #Banners/day x #Days x 4 = 480GB • RAM: #Users x #Campaigns x 4 = 40GB Copyright © 2014 Criteo Feature selection • About 40 original features Copyright © 2014 Criteo Feature selection • About 40 original features • 780 level-2 cross-features • 9880 level-3 cross-features Copyright © 2014 Criteo Feature selection • About 40 original features • 780 level-2 cross-features • 9880 level-3 cross-features • Each feature contains many modalities Copyright © 2014 Criteo Feature selection at scale • Greedy methods score the variables independently • Group sparsity methods score them jointly but are biased (think L1) Copyright © 2014 Criteo Feature selection at scale • Greedy methods score the variables independently • Group sparsity methods score them jointly but are biased (think L1) • Let us use greedy methods to provide a good initial set of features • Refine with group sparsity Copyright © 2014 Criteo Learning the parameters • n = 10^9, p = 10^7 • Theory tells us that stochastic gradient methods should be used Copyright © 2014 Criteo Learning the parameters - The actual situation • Tens of models are trained several times a day • Warm starts favor batch methods • Not all points are equal • Stochastic methods are harder to parallelize. Copyright © 2014 Criteo Criteo's optimizer • Batch optimizer • Distributed computation of the gradients (107 examples/s) • Update computation on a single node Copyright © 2014 Criteo Criteo's optimizer • Batch optimizer • Distributed computation of the gradients (107 examples/s) • Update computation on a single node • Robustness trumps accuracy Copyright © 2014 Criteo Dealing with spurious clicks • Some clicks do not bring sales • Some clicks are pure fake • We can build a fraud detection system Copyright © 2014 Criteo Dealing with spurious clicks • Some clicks do not bring sales • Some clicks are pure fake • We can build a fraud detection system • What is the real issue here? Copyright © 2014 Criteo Dealing with spurious clicks • The real goal is to only buy clicks which bring sales Copyright © 2014 Criteo Dealing with spurious clicks • The real goal is to only buy clicks which bring sales • We have labeled data Copyright © 2014 Criteo Dealing with spurious clicks • The real goal is to only buy clicks which bring sales • We have labeled data • Let's build a sale prediction model. Copyright © 2014 Criteo Are we done yet? • We know how to build a good model • We know how to train it • We are predicting a meaningful target Copyright © 2014 Criteo Simpson’s paradox CTR Overall First position 60/9000 (0.67%) Second position 50/7000 (0.71%) Copyright © 2014 Criteo Simpson’s paradox CTR Overall Low-value users High-value users First position 60/9000 (0.67%) 48/8000 (0.6%) 12/1000 (1.2%) Second position 50/7000 (0.71%) 2/1000 (0.2%) 48/6000 (0.8%) Copyright © 2014 Criteo Simpson’s paradox CTR Overall Low-value users High-value users First position 60/9000 (0.67%) 48/8000 (0.6%) 12/1000 (1.2%) Second position 50/7000 (0.71%) 2/1000 (0.2%) 48/6000 (0.8%) • Because of the underprediction, we might not detect our error. Copyright © 2014 Criteo Offline model evaluation • The data collection process depends on the current algorithm • Changing the predictor changes the input distribution • How can we predict what will happen? Copyright © 2014 Criteo Evaluating the model • A/B test • Slow • Costly Copyright © 2014 Criteo Evaluating the model • A/B test • Slow • Costly • Counterfactual analysis: “What would have happened if…?” • Based around importance sampling Copyright © 2014 Criteo Counterfactual analysis • Requires randomization in real time • Requires the ability to replay what happened • “Would I have taken another decision?” • Extensive logging: ALL campaigns Copyright © 2014 Criteo Using side information • There is never enough data • With more information, we could learn finer behaviours • But there are only so many sales, can we do better? Copyright © 2014 Criteo From unsupervised to weakly supervised learning • Unsupervised learning tries to learn about X • Weakly supervised learning uses related tasks • Long visits on the website • Sales which do not follow a click • Big data: unstructured targets rather than inputs Copyright © 2014 Criteo Summary for the prediction • The technical constraints shape our solution • Accuracy is not the only goal • Scaling is much more than the quantity of data Copyright © 2014 Criteo Summary for the prediction • The technical constraints shape our solution • Accuracy is not the only goal • Scaling is much more than the quantity of data • Which latest developments can we incorporate and benefit from? Copyright © 2014 Criteo Product recommendation • We have the list of products seen • A catalog can contain 105 products • How do we choose the right products to show? Copyright © 2014 Criteo Product recommendation • We have the list of products seen • A catalog can contain 105 products • How do we choose the right products to show? • In less than 20ms Copyright © 2014 Criteo Two-stage approach • Stage 1: Product preselection based on: • Popularity • Browsing history of the user Copyright © 2014 Criteo Two-stage approach • Stage 1: Product preselection based on: • Popularity • Browsing history of the user • Stage 2: Exact scoring using a prediction model Copyright © 2014 Criteo Major hurdles for product recommendation • Products come and go • There might be a sale (Black Friday) • Complementary vs. similar products Copyright © 2014 Criteo Other challenges • Multiple products in a banner • Interaction between products and layout • Different timeframes for different products Copyright © 2014 Criteo A first recipe for success • There are many sources of success/failure • It is often suboptimal to focus on one • The first step to address each source is often manual Copyright © 2014 Criteo Conclusion • Prediction is at the core of our business • Huge engineering constraints • Different bottlenecks than in academia • Build from the ground up, not the other way around Copyright © 2014 Criteo This is the end Thank you! Questions? Copyright © 2014 Criteo