Computational Advertising: The LinkedIn Way
Deepak Agarwal, LinkedIn Corporation
CIKM, San Francisco
Oct 30th, 2013
Computational Advertising
 MatchMaker (Broder, CACM)
– Placing the “best” ads in a given context for every user visit
 Match making at scale requires automation
– serving with low marginal cost increases profit margins
 Automation through Machine Learning/Optimization
 New discipline called Computational Advertising
©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn Advertising: Brand, Self-Serve, Sponsored
updates
SERVING
Ad
request
Filter Campaigns
(Targeting criteria,
Frequency Cap,
Budget Pacing)
Profile:
region = US, age = 20
Context = profile page,
300 x 250 ad slot
Automatic
Format
Selection
Campaigns eligible for
auction
Serving constraint < 100 millisec
Response
Prediction
Engine
Click Cost =
Bid3 x
CTR3/CTR2
Sorted by
Bid * CTR
Response Prediction: Important Input for optimization
 CTR of an ad format on some slot of a LinkedIn page
– E.g. CTR of 160 x 600 ad slot formats
f160x600_exp_3_4 f160x600_exp_3_5
f160x600_exp_3_6
f300x250_exp_2_10
 CTR of an ad on some position for a selected ad format
©2013 LinkedIn Corporation. All Rights Reserved.
Counting clicks and views using moving window
Estimate CTR of formats for
each page type
Page type
formats
CTRpagetype,format = clickspagetypel,format/viewspagetype,format
Onboarding new formats, avoiding starvation
Explore/exploit dilemma
Format
Clicks
Format
Format Clicks
Clicks
Format
Clicks
V1.2
1523
V1.2
1523
V1.2
1523
V1.2
1523
V2.0
34,872
V2.0
34,872
V2.0
34,872
V2.0
34,872
V2.1
37,224
V2.1
37,224
V2.1
37,224
V2.1
37,224
V3.0
0
V3.0
000
V3.0
V3.0
Views
CTR
% Traffic
Views
CTR
Traffic
Views
CTR %%%
Traffic
Views
CTR
Traffic
624,915
0.24%
???
624,915
25%
624,915 2.4%
2.4%
0%
624,915
2.4%
7.8%
11,839,741
0.29%
???
11,839,741
25%
11,839,741 2.9%
2.9%
0%
11,839,741
2.9%
35.3%
12,594,481
0.30%
???
12,594,481
25%
12,594,481 3.0%
3.0%
100%
12,594,481
3.0%
36.3%
0
0.28%
???
000 2.8%
25%
2.8%
2.8%
20.7%
0%
Explore/exploit
Exploit-only
Explore-only
(greedy)
(random)
(softmax)
• Serves
Maximizes
Exploit old
format
&
performance
new,
thatgood
are known
&
given
bad to
current
evenly
be good
knowledge
but explore
• Ignores
Cannot
those that
adapt
ourcould
knowledge
to changing
be potentially
about
environment
performance
good
• Profits from current knowledge while continuing to learn
Evaluating Explore/Exploit schemes
 We evaluated several explore/exploit techniques offline
– Offline replay based on precision@1 on randomized data
 Provides unbiased estimates of online performance (Langford et al, 2009)
Softmax + epsilon greedy on LinkedIn advertising, provided significant
gains
 Thompson sampling, UCB, Softmax, epsilon-greedy with
– moving window with
 Different training update frequencies (few minutes, few hours, daily)
 Epsilon-greedy + softmax, Thompson sampling, UCB among the
promising schemes
– Faster updates help with new formats initially, daily updates are fine if
very few new formats introduced into the system (as in our
application)
– Segmenting by user attributes did not help much
©2013 LinkedIn Corporation. All Rights Reserved.
CTR estimates for ads: Curse of dimensionality
Mitigating the curse
 What segments ? What are good ones ?
– Too few coarse segments: fails to personalize
– Too many: curse of dimensionality, data sparseness
 Most segments have no clicks, 0/5 == 0/50 == 0/5M ?
Pool data to mitigate sparseness
40/1000
Visits on profile page
20/20000
0/5
users from Palo Alto
profile page, ad 77, user from Palo Alto
Pooling with hundreds of millions of segments is challenging
– Different ways to pool, we use logistic regression
10
Taming curse of dimensionality: Logistic Regression
Dim 1
+
+
_
_
+
+
_
+
_
+ _
_
+
_
Dim 2
_
_
_
_
-2.5
_
_
11
CTR Prediction Model for Ads
 Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
 Model:
CTR Prediction Model for Ads
 Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
 Model:
Cold-start component
Warm-start
per-campaign component
CTR Prediction Model for Ads
 Feature vectors
– Member feature vector: xi
– Campaign feature vector: cj
– Context feature vector: zk
Cold-start:
Warm-start:
 Model:
Both can have L2
penalties.
Cold-start component
Warm-start
per-campaign component
Model Fitting
 Single machine (well understood)
–
–
–
–
conjugate gradient
L-BFGS
Trusted region
…
 Model Training with Large scale data
– Cold-start component Θw is more stable
 Weekly/bi-weekly training good enough
 However: difficulty from need for large-scale logistic regression
– Warm-start per-campaign model Θc is more dynamic
 New items can get generated any time
 Big loss if opportunities missed
 Need to update the warm-start component as frequently as possible
Model Fitting
 Single machine (well understood)
–
–
–
–
conjugate gradient
L-BFGS
Trusted region
…
 Model Training with Large scale data
– Cold-start component Θw is more stable
Large Scale Logistic
Regression
 Weekly/bi-weekly training good enough
 However: difficulty from need for large-scale logistic regression
– Warm-start per-campaign model Θc is more dynamic
 New items can get generated any time
 Big loss if opportunities missed
 Need to update the warm-start component as frequently as possible
Per-item logistic regression
given Θc
Large Scale Logistic Regression: Computational
Challenge




Hundreds of millions/billions of observations
Hundreds of thousands/millions of covariates
Fitting a logistic regression model on a single machine not feasible
Model fitting iterative using methods like gradient descent,
Newton’s method etc
– Multiple passes over the data





Problem: Find x to min(F(x))
Iteration n: xn = xn-1 – bn-1 F’(xn-1)
bn-1 is the step size that can change every iteration
Iterate until convergence
Conjugate gradient, LBFGS, Newton trust region, …
Compute using Map-Reduce
Big Data
Partition 1
Partition 2
…
Partition N
Mapper 1
Mapper 2
…
Mapper N
<Key, Value>
<Key, Value>
<Key, Value>
<Key, Value>
Reducer 1
Reducer 2
…
Reducer M
Output 1
Output 1
Output 1
Output 1
Large Scale Logistic Regression
 Naïve:
– Partition the data and run logistic regression for each partition
– Take the mean of the learned coefficients
– Problem: Not guaranteed to converge to global solution
 Alternating Direction Method of Multipliers (ADMM)
–
–
–
–
Boyd et al. 2011
Set up constraints: each partition’s coefficient = global consensus
Solve the optimization problem using Lagrange Multipliers
Advantage: converges to global solution
Large Scale Logistic Regression via ADMM
Iteration 1
BIG DATA
Partition 1
Partition 2
Partition 3
Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Large Scale Logistic Regression via ADMM
Iteration 1
BIG DATA
Partition 1
Partition 2
Partition 3
Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Large Scale Logistic Regression via ADMM
Iteration 2
BIG DATA
Partition 1
Partition 2
Partition 3
Partition K
Logistic
Regression
Logistic
Regression
Logistic
Regression
Logistic
Regression
Consensus
Computation
Large Scale Logistic Regression via ADMM
 Notation
–
–
–
–
(Xi , yi): data in the ith partition
βi: coefficient vector for partition i
β: Consensus coefficient vector
r(β): penalty component such as ||β||22
 Optimization problem
ADMM updates
LOCAL REGRESSIONS
Shrinkage towards current
best global estimate
UPDATED
CONSENSUS
ADMM at LinkedIn
 Lessons and Improvements
– Initialization is important (ADMM-M)
 Use the mean of the partitions’ coefficients
 Reduces number of iterations by 50%
– Adaptive step size (learning rate) (ADMM-MA)
 Exponential decay of learning rate
– Together, these optimizations reduce training time from 10h to 2h
Explore/Exploit with Logistic Regression
E/E: Sample a line from the
posterior
(Thompson Sampling)
_
COLD START
_
COLD + WARM START
for an Ad-id
+
+
+
+
_
+
_
+ _
_
+
_
_
_
_
_
_
_
POSTERIOR of WARM-START
COEFFICIENTS
26
Models Considered
 CONTROL: per-campaign CTR counting model
 COLD-ONLY: only cold-start component
 LASER: our model (cold-start + warm-start)
 LASER-EE: our model with Explore-Exploit using Thompson
sampling
Metrics
 Model metrics
– Test Log-likelihood
– AUC/ROC
– Observed/Expected ratio
 Business metrics (Online A/B Test)
– CTR
– CPM (Revenue per impression)
Observed / Expected Ratio
 Observed: #Clicks in the data
 Expected: Sum of predicted CTR for all impressions
 Not a “standard” classifier metric, but in many ways more useful for
this application
 What we usually see: Observed / Expected < 1
– Quantifies the “winner’s curse” aka selection bias in auctions
 When choosing from among thousands of candidates, an item with
mistakenly over-estimated CTR may end up winning the auction
 Particularly helpful in spotting inefficiencies by segment
– E.g. by bid, number of impressions in training (warmness), geo, etc.
– Allows us to see where the model might be giving too much weight to
the wrong campaigns
 High correlation between O/E ratio and model performance online
Offline: ROC Curves
1.0
●
●
0.8
●
●
●
0.6
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
0.4
●
●
●
0.2
●
1.0
0.4
False Positive Rate
0.8
●
●●
0.2
0.0
CONTROL [ 0.672 ]
COLD−ONLY [ 0.757 ]
LASER [ 0.778 ]
0.6
●
0.0
True Positive Rate
●
●
●
●●
Online A/B Test
 Three models
– CONTROL (10%)
– LASER (85%)
– LASER-EE (5%)
 Segmented Analysis
– 8 segments by campaign warmness
 Degree of warmness: the number of training samples available in the
training data for the campaign
 Segment #1: Campaigns with almost no data in training
 Segment #8: Campaigns that are served most heavily in the previous
batches so that their CTR estimate can be quite accurate
Daily CTR Lift Over Control
+%
LASER
LASER−EE
●
+%
●
+%
●
●
+%
●
Day 7
Day 6
Day 5
Day 4
Day 3
●
Day 2
+%
●
Day 1
Percentage of CTR Lift
●
Daily CPM Lift Over Control
+%
●
LASER
LASER−EE
+%
+%
●
●
+%
●
Day 7
●
Day 6
●
Day 5
Day 3
Day 2
+%
Day 4
+%
Day 1
Percentage of eCPM Lift
●
●
CPM Lift By Campaign
Warmness Segments
Lift Percentage of CPM
+%
+%
0%
−%
−%
LASER
LASER−EE
−%
1
2
3
4
5
6
Campaign Warmness Segment
7
8
O/E Ratio By Campaign
Warmness Segments
Observed Click/Expected Clicks
1
0.9
0.8
0.7
0.6
CONTROL
LASER
LASER−EE
0.5
1
2
3
4
5
6
Campaign Warmness Segment
7
8
Number of Campaigns Served Improvement from E/E
Insights
 Overall performance:
– LASER and LASER-EE are both much better than control
– LASER and LASER-EE performance are very similar
 Segmented analysis by campaign warmness
– Segment #1 (very cold)
 LASER-EE much worse than LASER due to its exploration property
 LASER much better than CONTROL due to cold-start features
– Segments #3 - #5
 LASER-EE significantly better than LASER
 Winner’s curse hit LASER
– Segment #6 - #8 (very warm)
 LASER-EE and LASER are equivalent
 Number of campaigns served
– LASER-EE serves significantly more campaigns than LASER
– Provides healthier market place
Theory vs. Practice
Textbook
Reality
 Data is stationary
 Training data is clean
 Training is hard, testing and
inference are easy
 Models don’t change
 Complex algorithms work best
 Features, items changing
constantly
 Fraud, bugs,tracking delays,
online/offline inconsistencies,
etc.
 All aspects have challenges at
web scale
 Never-ending processes of
improvement
 Simple models with good
features and lots of data win
Solutions to Practical Problems
 Rapid model development cycle
– Quick reaction to changes in data, product
– Write once for training, testing, inference
 Can adapt to changing data
– Integrated Thompson sampling explore/exploit
– Automatic training
– Multiple training frequencies for different parts of model
 Good tools yield good models
– Reusable components for feature extraction and transformation
– Very high-performance inference engine for deployment
– Modelers can concentrate on building models, not re-writing common
functions or worrying about production issues
Summary
 Reducing dimension through logistic regression coupled with
explore/exploit schemes like Thompson sampling effective
mechanism to solve response prediction problems in advertising
 Partitioning model components by cold-start (stable) and warm-start
(non-stationary) with different training frequencies effective
mechanism to scale computations
 ADMM with few modifications effective model training strategy for
large data with high dimensionality
 Methods work well for LinkedIn advertising, significant
improvements
©2013 LinkedIn Corporation. All Rights Reserved.
Collaborators
 I won’t be here without them, extremely lucky to work with such
talented individuals
Liang Zhang
Jonathan Traupman
Romer Rosales
Bo Long
Doris Xin
Current Work
 Investigating Spark and various other fitting algorithms
– Promising results, ADMM still looks good on our datasets
 Stream Ads
– Multi-response prediction (clicks, shares, likes, comments)
– Filtering low quality ads extremely important
 Revenue/Engagement tradeoffs (Pareto optimal solutions)
 Stream Recommendation
– Holistic solution to both content and ads on the stream
 Large scale ML infrastructure at LinkedIn
– Powers several recommendation systems
©2013 LinkedIn Corporation. All Rights Reserved.
We are hiring !
 Interns for summer 2014 (contact me
dagarwal@linkedin.com)
 Full-time
– Graduating PhDs, experienced researchers
©2013 LinkedIn Corporation. All Rights Reserved.
Backup slides
©2013 LinkedIn Corporation. All Rights Reserved.
LASER Configuration
 Feature processing pipeline
– Sources: transform external data into feature vectors
– Transformers: modify/combine feature vectors
– Assembler: Packages features vectors for training/inference
 Configuration language
–
–
–
–
Model structure can be changed extensively
Library of reusable components
Train, test, and deploy models without any code changes
Speeds up model development cycle
LASER Transformer Pipeline
Request
User
profile
Item
Context
Source
User Source
Item Source
Subset
Subset
Interaction
Assembler
Training or
Inference
LASER Performance
 Real time inference
– About 10µs per inference (1500 ads = 15ms)
– Reacts to changing features immediately
 “Better wrong than late”
– If a feature isn’t immediately available, back off to prior value
 Asynchronous computation
– Actions that block or take time run in background threads
 Lazy evaluation
– Sources & transformers do not create feature vectors for all items
– Feature vectors are constructed/transformed only when needed
 Partial results cache
– Logistic regression inference is a series of dot products
– Scalars are small; cache can be huge
– Hardware-like implementation to minimize locking and heap pressure