Statistical Challenges in Online Advertising Deepak Agarwal Deepayan Chakrabarti

advertisement
Statistical Challenges in Online
Advertising
Deepak Agarwal
Deepayan Chakrabarti
(Yahoo! Research)
Research
© 2008 Yahoo!
Online Advertising
• Multi-billion dollar industry, high growth
– $9.7B in 2006 (17% increase), total $150B
• Why this will continue?
– Broadband cheap, ubiquitous
– “Getting things done” easier on the internet
– Advertisers shifting dollars
• Why does it work?
– Massive scale, automated, low marginal cost
– Key: Monetize more and better, “learn from data”
– New discipline “Computational Advertising”
Research
© 2008 Yahoo!
What is “Computational Advertising”?
New scientific sub-discipline, at the intersection of
– Large scale search and text analysis
–
–
–
–
–
Information retrieval
Statistical modeling
Machine learning
Optimization
Microeconomics
Research
© 2008 Yahoo!
Ads
Content
User
Content
Provider
Research
Advertisers
Online advertising: 6000 ft Overview
Pick
ads
Ad
Network
Examples:
Yahoo, Google,
MSN, RightMedia,
…
© 2008 Yahoo!
Outline
• Background on online advertising
– Sponsored Search, Content Match, Display, Unified
marketplace
• The Fundamental Problem
• Statistical sub-problems:
– Description
– Existing methods
– Challenges
Research
© 2008 Yahoo!
Different flavors
Online
Advertising
Revenue
Models
CPM
CPC
CPA
Display
Research
Misc.
Ad exchanges
Advertising
Setting
Content
Match
Sponsored
Search
© 2008 Yahoo!
Revenue Models
CPC
CPA
Cost Per
iMpression
Ads
Ad
Network
Pick
ads
Content
$$
User
Content
Provider
Research
Advertisers
CPM
$
© 2008 Yahoo!
Revenue Models
CPC
CPA
Cost Per
Click
click
Ads
Ad
Network
Pick
ads
Content
$$
User
Content
Provider
Research
Advertisers
CPM
$
© 2008 Yahoo!
Revenue Models
CPC
CPA
Ad
Network
click
Ads
Pick
ads
Content
$$
User
Content
Provider
Research
Advertisers
CPM
Cost Per
Action
$
© 2008 Yahoo!
Revenue Models
CPM
CPC
CPA
• Example: Suppose we show an ad N times on the
same spot
• Under CPM: Revenue = N * CPM
• Under CPC: Revenue = N * CTR * CPC
Depends on
auction
mechanism
Click-through Rate
(probability of a click
given an impression)
Research
© 2008 Yahoo!
Auction Mechanism
• Revenue depends on type of auction
– Generalized First-price:
• CPC = bid on clicked ad
– Generalized Second-price:
• CPC = bid of ad below clicked ad (or the reserve price)
• CPC could be modified by additional factors
• [Optimal Auction Design in a Multi-Unit Environment: The
Case of Sponsored Search Auctions] by Edelman+/2006
• [Internet Advertising and the Generalized Second Price
Auction…] by Edelman+/2006
Research
© 2008 Yahoo!
Revenue Models
CPM
CPC
CPA
• Example: Suppose we show an ad N times on the
same spot
• Under CPM: Revenue = N * CPM
• Under CPC: Revenue = N * CTR * CPC
• Under CPA: Revenue = N * CTR * Conv. Rate * CPA
Conversion Rate
(probability of a user conversion
on the advertiser’s landing page
given a click)
Research
© 2008 Yahoo!
Revenue Models
CPM
CPC
CPA
Revenue website traffic website traffic + website traffic +
ad relevance
ad relevance +
dependence
landing page quality
Relevance to
advertisers
Prices and
Bids
Ease of
picking ads
Research
© 2008 Yahoo!
Background
Online
Advertising
Revenue
Models
CPM
CPC
CPA
Display
Research
Misc.
Ad exchanges
Advertising
Setting
Content
Match
Sponsored
Search
© 2008 Yahoo!
Ads
Content
User
Research
Content
Provider
Pick
ads
Adshow
• What do you
the user?Network
Advertisers
Advertising Setting
• How does the user
interact with the ad
system?
© 2008 Yahoo!
Advertising Setting
Display
Research
Content
Match
Sponsored
Search
© 2008 Yahoo!
Advertising Setting
Display
Content
Match
Sponsored
Search
Pick
ads
Research
© 2008 Yahoo!
Advertising Setting
Display
Content
Match
Sponsored
Search
• Graphical display ads
• Mostly for brand awareness
• Revenue model is typically CPM
Research
© 2008 Yahoo!
Advertising Setting
Display
Content
Match
Sponsored
Search
Content
match ad
Research
© 2008 Yahoo!
Advertising Setting
Display
Text ads
Content
Match
Sponsored
Search
Pick
ads
Match ads to
the content
Research
© 2008 Yahoo!
Advertising Setting
Display
Content
Match
Sponsored
Search
• The user intent is unclear
• Revenue model is typically CPC
• Query (webpage) is long and noisy
Research
© 2008 Yahoo!
Advertising Setting
Display
Search
Query
Research
Content
Match
Sponsored
Search
Sponsored
Search Ads
© 2008 Yahoo!
Advertising Setting
Display
Content
Match
Sponsored
Search
Text ads
Search
Query
Research
Pick
ads
Match ads
to the query
© 2008 Yahoo!
Advertising Setting
Display
Content
Match
Sponsored
Search
• User “declares” his/her intention
• Click rates generally higher than for Content Match
• Revenue model is typically CPC (recently some CPA)
• Query is short and less noisy than Content Match
Research
© 2008 Yahoo!
Summary
• Different revenue models
– Depends on the goal of the advertiser campaign
• Brand awareness
– Display advertising
– Pay per impression (CPM)
• Attracting users to advertised product
– Content Match, Sponsored Search
– Pay per click (CPC), Pay per action (CPA)
Research
© 2008 Yahoo!
Background
Online
Advertising
Revenue
Models
CPM
CPC
CPA
Display
Research
Misc.
Ad exchanges
Advertising
Setting
Content
Match
Sponsored
Search
© 2008 Yahoo!
Unified Marketplace
• Publishers, Ad-networks, advertisers participate
together in a singe exchange
• Publishers put impressions in the exchange;
advertisers/ad-networks bid for it
• CPM, CPC, CPA are all integrated into a single
auction mechanism
Research
© 2008 Yahoo!
Overview: The Open Exchange
Bids $0.75 via Network…
Bids $0.50
Bids $0.60
Ad.com
AdSense
Bids $0.65—WINS!
Has ad
impression
to sell -AUCTIONS
… which becomes
$0.45 bid
Transparency and value
Research
© 2008 Yahoo!
Unified scale: Expected CPM
• Campaigns are CPC, CPA, CPM
• They may all participate in an auction together
• Converting to a common denomination is a challenge
Research
© 2008 Yahoo!
Outline
• Background on online advertising
• The Fundamental Problem
• Statistical sub-problems:
– Description
– Existing methods
– Challenges
Research
© 2008 Yahoo!
Outline
• Background on online advertising
• The Fundamental Problem
– Display advertising
– Sponsored Search and Content Match
• Statistical sub-problems:
– Description
– Existing methods
– Challenges
Research
© 2008 Yahoo!
Display Advertising
Research
© 2008 Yahoo!
Display Advertising
• Main goal of advertisers: Brand Awareness
• Revenue Model: Primarily Cost per impression (CPM)
• Traditional Advertising Model:
1. Ads are targeted at particular demographics (user
characteristics)
1. GM ads on Y! autos shown to “males above 55”
2. Mortgage ad shown to “everybody on Y! Front page”
2. Book a slot well in advance
– “2M impressions in Jan next year”
– These future impressions must be guaranteed by the ad
network
Research
© 2008 Yahoo!
Display Advertising
• Fundamental Problem: Guarantee impressions to
advertisers
Young
US
2
4
1
1. Predict Supply:
• How many impressions
will be available?
• Demographics overlap
3
2
2
2. Predict Demand:
Y! Mail
1
Female
Research
• How much will
advertisers want each
demographic?
© 2008 Yahoo!
Display Advertising
• Fundamental Problem: Guarantee impressions to
advertisers
Young
US
1. Predict Supply
2. Predict Demand
2
4
1
3. Find the optimal allocation
3
2
Y! Mail
2
• subject to supply and
demand constraints
1
Female
Research
© 2008 Yahoo!
Display Advertising
• Fundamental Problem: Guarantee impressions to
advertisers
1. Predict Supply
2. Predict Demand
3. Find the optimal allocation, subject to constraints
• Optimal in terms of what objective function?
Research
© 2008 Yahoo!
Allocation through Optimization
• Optimal in terms of what objective function?
– E.g. Maximize value of remaining inventory
• Cherry-picks valuable inventory, saves it for later
– Fairness
• “Spreads the wealth” subject to constraints
supply
Research
s
i
xij
demand
dj
© 2008 Yahoo!
Example
Supply Pools
Young
US
2
3
4
2
Y!
Mail
1
US, Y, nF
Supply = 2
Price = 1
US & Y
(2)
2
1
Female
Demand
US, Y, F
Supply = 3
Price = 5
Supply Pools
Research
How should we distribute impressions
from the supply pools to satisfy this
demand?
© 2008 Yahoo!
Example (Cherry-picking)
• Cherry-picking:
Fulfill demands at least
cost
Supply Pools
US, Y, nF
Supply = 2
Price = 1
(2)
Demand
US & Y
(2)
US, Y, F
Supply = 3
Price = 5
Research
How should we distribute impressions
from the supply pools to satisfy this
demand?
© 2008 Yahoo!
Example (Fairness)
• Cherry-picking:
Fulfill demands at least
cost
• Fairness:
Equitable distribution of
available supply pools
Supply Pools
US, Y, nF
Supply = 2
Cost = 1
(1)
(1)
Demand
US & Y
(2)
US, Y, F
Supply = 3
Cost = 5
Research
How should we distribute impressions
from the supply pools to satisfy this
demand?
© 2008 Yahoo!
Objective functions
Maximize  V y
j
j
j
y : remaining inventory for pool j
j
V : Value of pool j
j
" Fairness"
~
x  ( x w / X )d ; X   x w
jk
j
j
w
k
w
j:S j  S k
j
j
e.g . w  1  proportional allocation
j
In general, w can be monotonically decreasing function
j
of value V .
j
Objective function : Minimize
x )
   x log( x /~
k
k
j
jk
jk
jk
  0.
k
Research
© 2008 Yahoo!
Display Advertising
• Fundamental Problem: Guarantee impressions to
advertisers
1. Predict Supply
2. Predict Demand
3. Find the optimal allocation, subject to constraints
– Pick the right objective function
• Further issues:
– Risk Management: Supply and demand forecasts
should have both mean and variance
– Forecast aggregation: Forecasts may be needed over
multiple resolutions, in time and in demographics
Research
© 2008 Yahoo!
Display Advertising
• Fundamental Problem: Guarantee impressions to
advertisers
1. Predict Supply
2. Predict Demand
3. Find the optimal allocation, subject to constraints
– Pick the right objective function
• Forecasting accuracy is critical!
– Overshoot  under-delivery of impressions  unhappy
advertisers
– Undershoot  loss in revenue
Research
© 2008 Yahoo!
Outline
• Background on online advertising
• The Fundamental Problem
– Display advertising
– Sponsored Search and Content Match
• Statistical sub-problems:
– Description
– Existing methods
– Challenges
Research
© 2008 Yahoo!
Sponsored Search and Content Match
• Given a query:
– Select the top-k ads to be shown on the k slots to
maximize total expected revenue
• What is total expected revenue?
Research
© 2008 Yahoo!
Example (Content Match)
Ad
Position 1
Ad
Position 2
Ad
Position 3
Research
© 2008 Yahoo!
Example (Content Match)
Research
© 2008 Yahoo!
Reminder: Auction Mechanism
• Revenue depends on type of auction
– Generalized First-price:
• CPC = bid on clicked ad
– Generalized Second-price:
• CPC = bid of ad below clicked ad (or the reserve price)
• CPC could be modified by additional factors
• Total expected revenue = revenue obtained in a given
time window
• [Optimal Auction Design in a Multi-Unit Environment: The
Case of Sponsored Search Auctions] by Edelman+/2006
• [Internet Advertising and the Generalized Second Price
Auction…] by Edelman+/2006
Research
© 2008 Yahoo!
Sponsored Search and Content Match
• Given a query:
– Select the top-k ads to be shown on the k slots to
maximize total expected revenue
• What affects the total revenue?
– Relevance of the ad to the query
– Bids on the ads
– User experience on the ad landing page (ad “quality”)
– Expected total revenue is some function of these.
Research
© 2008 Yahoo!
Sponsored Search and Content Match
• Given a query:
– Select the top-k ads to be shown on the k slots to
maximize total expected revenue
• Fundamental Problem:
– Estimate relevance of the ad to the query
Research
© 2008 Yahoo!
Ad Relevance
Computation
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
– Techniques
– Challenges
• Machine Learning using Click Feedback
• Online Learning
Research
© 2008 Yahoo!
IR-based ad matching
• “Why not use a search engine to match ads to
context?”
– Ads are the “documents”
– Context (user query or webpage content) is the
“query”
• Three broad approaches:
– Vector space models
– Probabilistic models
– Language models
• Open-source software is available:
– Lemur (www.lemurproject.org)
Research
© 2008 Yahoo!
IR-based ad matching
• Vector space models:
– Each word/phrase in the vocabulary is a separate
dimension
– Each ad and query is a point in this vector space
– Example: cosine similarity
• Probabilistic models
• Language models
Research
© 2008 Yahoo!
IR-based ad matching
• Q1: How can we score the goodness of an ad for a
context?
Query
Ad vector
vector
• Cosine similarity:
• Advantages:
– Simple and easy to interpret
– Normalizes for different ad and context lengths
Research
© 2008 Yahoo!
IR-based ad matching
• Vector space models
• Probabilistic models:
– Predict, for every (ad, query) pair, the probability that
the ad is relevant to the query
– Example: Okapi BM25
• Language models
Research
© 2008 Yahoo!
IR-based ad matching
• Q1: How can we score the goodness of an ad for a
context?
• Okapi BM25:
Inverse Document
Frequency
Research
Term
Frequency
in ad
Norm.
document
length
Term
Frequency
in query
Parameters
© 2008 Yahoo!
IR-based ad matching
• Q1: How can we score the goodness of an ad for a
context?
• Okapi BM25:
Term
Frequency
in ad
Norm.
document
length
Term
Frequency
in query
• Advantages:
– Different terms are weighted differently
– Tunable parameters
– Good performance
Research
© 2008 Yahoo!
IR-based ad matching
• Vector space models
• Probabilistic models
• Language models:
– Ads and queries are generated by statistical models of
how words are used in the language
– What statistical models can be used?
– How do we translate query and ad generation
probabilities into relevance?
Research
© 2008 Yahoo!
IR-based ad matching
• What statistical models can be used?
– Bigram model
– Multinomial model
• Given any ad or query, we can compute the parameter
setting most likely to have generated the document
Total length
Term probability
(model parameters)
Term Frequency
Research
© 2008 Yahoo!
IR-based ad matching
How do we translate query and ad generation
probabilities into relevance?
Query
Query
params
Ad
Ad
params
Method 1
• Compute most likely
query and ad params
• Generate ad using
query params
• High probability  high relevance
Research
© 2008 Yahoo!
IR-based ad matching
How do we translate query and ad generation
probabilities into relevance?
Query
Query
params
Ad
Ad
params
Method 2
• Compute most likely
query and ad params
• Generate query using
ad params
• High probability  high relevance
Research
© 2008 Yahoo!
IR-based ad matching
How do we translate query and ad generation
probabilities into relevance?
Query
Query
params
Ad
Ad
params
Method 3
• Compute most likely
query and ad params
• Compute KL-divergence
between params
• Low KL-divergence  high relevance
Research
© 2008 Yahoo!
IR-based ad matching
• New methods to combine syntactic and semantic
information
• For example, “A Semantic Approach to Contextual
Advertising” by Broder+/SIGIR/2007
– Words only provide syntactic clues
– Classify ads and queries into a common taxonomy
– Taxonomy matches provide semantic clues
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
– Techniques
– Challenges
• Machine Learning using Click Feedback
• Online Learning
Research
© 2008 Yahoo!
Challenges of IR-based ad matching
• Word matches might not always work
Research
© 2008 Yahoo!
Woes of word matching
Extract Topical info
Increases coverage,
more relevant match
Research
© 2008 Yahoo!
Challenges of IR-based ad matching
• Word matches might not always work
• Works well for frequent words, what about rare
words? Long tail, big revenue impact.
– Remedy: Add more matching dimensions (phrase,…)
• Static, does not capture effect of external factors
– E.g. high interest in basketball page due to an event;
dies off after the event
– Click feedback a powerful way of capturing such latent
effects; difficult to do it through relevance only
• Relevance scores may not correspond to CTR; does
not provide estimates of expected revenue
Research
© 2008 Yahoo!
Challenges of IR-based ad matching
• Heterogeneous corpus (query, ads). Single tfidf
scores not applicable.
• In content match, queries long and noisy
• Partial feedback does not work
– Not scalable
• Ads are small, relevance of landing page difficult to
determine (video, image, text)
Research
© 2008 Yahoo!
Machine Learning using Click
Feedback
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
• Machine Learning using Click Feedback
– Advantages and Challenges of Click Feedback
– Feature-based models
• Description
• Case Studies
– Hierarchical Models
– Matrix Factorization and Collaborative Filtering
– Challenges and Open Problems
• Online Learning
Research
© 2008 Yahoo!
Learning from Click Feedback
• Learning relevance from partial human-labeled
training data
– Attractive but not scalable
• Users provide us direct feedback through ad clicks
– Low cost and automated learning mechanism
– Large amounts of feedback for big ad-networks
• Estimation problem:
– Estimate CTR = Pr(click| query, ad, user)
Research
© 2008 Yahoo!
Learning from Clicks: Challenges
• Noisy labels
– Clicks (unscrupulous users gaming the system)
– Negatives (not clear; I never click on ads )
• Sparseness
– (query, ad) matrix has billions of cells; long tail
• Too few data points in large number of cells; MLE has high
variance
• Goal is to learn the best cells, not all cells
•
Dynamic and seasonal effects
– CTRs evolve; subject to seasonal effects
• Summer, Halloween,..
• Palin ads popular yesterday, not today
Research
© 2008 Yahoo!
Challenges continued
• Selection bias
– We never showed watch ads on golf pages
• Positional bias, presentation bias
– Same ad performs differently at different positions
• Slate bias
– Performance of ad depends on other ads that were
displayed
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
• Machine Learning using Click Feedback
– Advantages and Challenges of Click Feedback
– Feature-based models
• Description
• Case Studies
– Hierarchical Models
– Matrix Factorization and Collaborative Filtering
– Challenges and Open Problems
• Online Learning
Research
© 2008 Yahoo!
Feature based approach
• Query, Ad characterized by features
– Query: bag-of-words, phrases, topic,…
– Ads: bag-of-words, keywords, size,…
• Query feature vector: q
• Ad feature vector: a
• Pr(Click|Q,A) = f(q,a;θ)
• Example: Logistic regression
– log-odds(Pr(Click|Q,A)) = q’ W a
– W estimated from data
Research
© 2008 Yahoo!
Feature based models: Challenges
• Challenges
– High dimensional, need to regularize (Priors)
– De-bias for positional and slate effects
– Negative events to be weighted appropriately
• Go through case studies reported in literature
Research
© 2008 Yahoo!
Predicting Clicks: Estimating the Click-through rates of new
ads: Richardson et al, WWW 2007
• Estimate CTR of new ads in Sponsored search
• Log-odds(CTR(ad)) = wifi(ad)
• Features used:
– Bid term CTRs of related ads (from other accounts)
• CTRs of all other ads with keyword “camera”
– Appearance, attention, advertiser reputation, landing
page quality, relevance of bid terms to ad, bag-ofwords in ad.
• Does not capture interactions between (query, ad),
main focus is to estimate CTR of new ads only
• Negative events down-weighted based on eyetracking study
Research
© 2008 Yahoo!
Combining relevance with Click Feedback, Chakrabarti et al,
WWW 08
• Content Match application
• CTR estimation for arbitrary (page, ad) pairs
• Features :
– Bag-of-words in query, ads; relevance scores from IR
– Cross-product of words: Occurs in both page and ad
• Learn to predict click data using such features
• Prediction function amenable to WAND algorithm
– Helps with fast retrieval at serve time
Research
© 2008 Yahoo!
Proposed Method
• A logistic regression method model for CTR
Model
parameters
CTR
Research
Main effect Main effect
Interaction
for page
for ad
effect
(how good (how good (words shared
is the page) is the ad) by page and ad)
© 2008 Yahoo!
Proposed Method
• Mp,w = tfp,w
• Ma,w = tfa,w
• Ip,a,w = tfp,w * tfa,w
• So, IR-based term frequency measures are taken
into account
Research
© 2008 Yahoo!
Proposed Method
• Two sources of complexity
– Adding in IR scores
– Word selection for efficient learning
Research
© 2008 Yahoo!
Proposed Method
• How can IR scores fit into the model?
logit(pij)
– What is the relationship between logit(pij) and cosine
score?
– Quadratic relationship
Cosine score
Research
© 2008 Yahoo!
Proposed Method
• How can IR scores fit into the model?
• This quadratic relationship can be used in two ways
– Put in cosine and cosine2 as features
– Use it as a prior
Research
© 2008 Yahoo!
Proposed Method
• Word selection
– Overall, nearly 110k words in corpus
– Learning parameters for each word would be:
• Very expensive
• Require a huge amount of data
• Suffer from diminishing returns
– So we want to select ~1k top words which will have the
most impact
Research
© 2008 Yahoo!
Proposed Method
• Word selection
– Data based:
• Define an interaction measure for each word
• Higher values for words which have higher-than-expected
CTR when they occur on both page and ad
Research
© 2008 Yahoo!
Precision
Experiments
Recall
25% lift in precision at 10% recall
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
• Machine Learning using Click Feedback
– Advantages and Challenges of Click Feedback
– Feature-based models
• Description
• Case Studies
– Hierarchical Models
– Matrix Factorization and Collaborative Filtering
– Challenges and Open Problems
• Online Learning
Research
© 2008 Yahoo!
Regelsen and Fain, 2006
• Estimate CTR of terms by “borrowing strength” at
multiple resolutions
• Hierarchical clustering of related terms
– Clustering advertiser keyword matrix
• Estimating CTR at finer resolutions by using
information at coarser resolutions
– Weighted average, more weight to finer resolutions
– Weights selected heuristically, no principled approach
Research
© 2008 Yahoo!
Estimation in the “tail”
• A more principled approach to “Estimating Rates of
Rare Events at Multiple Resolutions” [KDD/2007]
• Contextual Advertising
– Show an ad on a webpage (“impression”)
– Revenue is generated if a user clicks
– Problem: Estimate the click-through rate (CTR) of an
ad on a page
• Most (ad, page) pairs have very few impressions, if any,
• and even fewer clicks
Severe data sparsity
Research
© 2008 Yahoo!
Estimation in the “tail”
• Use an existing, well-understood hierarchy
– Categorize ads and webpages to leaves of the
hierarchy
– CTR estimates of siblings are correlated
The hierarchy allows us to aggregate data
• Coarser resolutions
– provide reliable estimates for rare events
– which then influences estimation at finer resolutions
Research
© 2008 Yahoo!
System overview
Retrospective data
[URL, ad, isClicked]
Crawl
URLs
a sample
of URLs
Classify pages
and ads
Rare event
estimation using
hierarchy
Research
Impute impressions,
fix sampling bias
© 2008 Yahoo!
Sampling of webpages
• Naïve strategy: sample at random from the set of
URLs
Sampling errors in impression volume AND click
volume
• Instead, we propose:
– Crawling all URLs with at least one click, and
– a sample of the remaining URLs
Variability is only in impression volume
Research
© 2008 Yahoo!
Imputation of impression volume
Page classes
Ad classes
sums to #impressions
on ads of this ad class
[column constraint]
Research
#impressions =
nij + mij + xij
Clicked
pool
Sampled
Excess impressions
Non-clicked
(to be imputed)
pool
sums to
∑nij + K.∑mij
[row constraint]
sums to
Total impressions
(known)
© 2008 Yahoo!
Imputation of impression volume
Level 0
• Region
= (page node, ad node)
• Region Hierarchy
 A cross-product of the page
hierarchy and the ad
hierarchy
Level i
Region
Research
© 2008 Yahoo!
Imputation of impression volume
Level i
Level
i+1
sums to
[block constraint]
Research
© 2008 Yahoo!
Imputing xij
Iterative Proportional Fitting
[Darroch+/1972]
Level i
block
Research
Level
i+1
• Initialize xij = nij + mij
• Iteratively scale xij values to
match row/col/block constraint
• Ordering of constraints: topdown, then bottom-up, and
repeat
© 2008 Yahoo!
Imputation: Summary
• Given
– nij (impressions in clicked pool)
– mij (impressions in sampled non-clicked pool)
– # impressions on ads of each ad class in the ad
hierarchy
• We get
– Estimated impression volume
Ñij = nij + mij + xij
in each region ij of every level
Research
© 2008 Yahoo!
System overview
Retrospective data
[page, ad, isclicked]
Crawl
Pages
a sample
of pages
Classify pages
and ads
Rare event
estimation using
hierarchy
Research
Impute impressions,
fix sampling bias
© 2008 Yahoo!
Rare rate modeling
1.
Freeman-Tukey transform:
–
–
yij = F-T(clicks and impressions at ij)
≈ transformed-CTR
Variance stabilizing transformation: Var(y) is
independent of E[y]  needed in further modeling
Research
© 2008 Yahoo!
Rare rate modeling
2.
Generative Model (Tree-structured Markov Model)
variance Wij
Unobserved
“state”
βparent(ij)
variance Vij
yij
Research
Sparent(ij)
Sij
covariates βij
Wparent(ij)
Vparent(ij)
yparent(ij)
© 2008 Yahoo!
Rare rate modeling
• Model fitting with a 2-pass Kalman
filter:
– Filtering: Leaf to root
– Smoothing: Root to leaf
• Linear in the
number of regions
Research
© 2008 Yahoo!
Tree-structured Markov model
d (r )  S ,V )
yr ~ N (uT

r
r r
 d (r ) : coefficient vector for covariates at level d(r).
S r : random effects, one per region (require smoothing)
Markov Model
Sr  S
 wr
pa(r )
wr ~ N (0,Wr ); wr indep S
pa(r )
Smoothing : Depends on Wr / Vr
;S
Root
W
Root
 0.
Vr  V / N r ; Wr  W d ( r )
l'
d (r )
l
Var ( S r )  W ; Corr(l , l )  W / W i
i
Research
i 1
'
i
i 1
i 1
© 2008 Yahoo!
Scalable Model fitting
Multi-resolution Kalman filter
Posterior of states {Sr } :
- Kalman filter algorithm (Huang and Cressie, 2002)
Algorithm "essentially" linear in the number of regions
Depends on number of parent regions
At each parent region, O(# children region 3 ) computatio n
Two steps : Uptree filtering , downtree smoothing
Variance componets : ECME algorithm (Liu and Rubin, 1994)
Research
© 2008 Yahoo!
Multi-Resolution Kalman filter: Mathematical overview
Filtering (uptree) step :
Update posterior of leaf nodes using standard Bayesian updates
Invert the state equations;
S
pa ( r )
 B r S r   r ; B r  corr ( d ( r ), d ( r )  1);
Collect contributi on of child for parent
Combine informatio n from children
recombine info available for parent
Smoothing ( downtree) step
Update info on children using info from parent
Research
© 2008 Yahoo!
Experiments
• 503M impressions
• 7-level hierarchy of which the top 3 levels were used
• Zero clicks in
– 76% regions in level 2
– 95% regions in level 3
• Full dataset DFULL, and a 2/3 sample DSAMPLE
Research
© 2008 Yahoo!
Experiments
• Estimate CTRs for all regions R in level 3 with zero
clicks in DSAMPLE
• Some of these regions R>0 get clicks in DFULL
• A good model should predict higher CTRs for R>0 as
against the other regions in R
Research
© 2008 Yahoo!
Experiments
• We compared 4 models
– TS: our tree-structured model
– LM (level-mean): each level smoothed independently
– NS (no smoothing): CTR proportional to 1/Ñ
– Random: Assuming |R>0| is given, randomly predict
the membership of R>0 out of R
Research
© 2008 Yahoo!
Experiments
TS
Research
© 2008 Yahoo!
Experiments
Few impressions 
Estimates depend
more on siblings
Enough impressions
 little “borrowing”
from siblings
Research
© 2008 Yahoo!
Related Work
• Multi-resolution modeling
– studied in time series modeling and spatial statistics
[Openshaw+/79, Cressie/90, Chou+/94]
• Imputation
– studied in statistics [Darroch+/1972]
• Application of such models to estimation of such rare
events (rates of ~10-3) is novel
Research
© 2008 Yahoo!
Summary
• A method to estimate
– rates of extremely rare events
– at multiple resolutions
– under severe sparsity constraints
• The method has two parts
– Imputation  incorporates hierarchy, fixes sampling
bias
– Tree-structured generative model  extremely fast
parameter fitting
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
• Machine Learning using Click Feedback
– Advantages and Challenges of Click Feedback
– Feature-based models
• Description
• Case Studies
– Hierarchical Models
– Matrix Factorization and Collaborative Filtering
– Challenges and Open Problems
• Online Learning
Research
© 2008 Yahoo!
Collaborative Filtering
• Collaborative filtering
– Similarity based methods
Ad-ad similarity
matrix
r   sr /  s
ui
jN ( i )
Rating (CTR) for
query u of ad i
Research
ij
uj
jN ( i )
ij
Local neighborhood
of ad i
© 2008 Yahoo!
Collaborative Filtering
• Collaborative filtering
– Similarity based methods
r   sr /  s
ui
ij
jN ( i )
uj
jN ( i )
ij
Featurebased model
– Possible adaptation
log - odds( p )  f (q, a; θ)  z
qa
z   s z /  s
qa
jN ( a )
qj
qj
jN ( a )
qa
Collaborative
filtering model
qj
– Challenges:
• Learning similarity
• Simultaneously incorporating query and ad similarities
Research
© 2008 Yahoo!
Matrix Factorization
• Matrix Factorization
– Each query (ad) is a linear
combination of latent factors
– Solve for factors, under some
regularization and constraints
Factor
coefficients
for ad
log - odds ( p )  f (q, a; θ)   u v
r
qa
k 1
qk
ak
Factor
coefficients
for query
Research
© 2008 Yahoo!
Matrix Factorization
• Matrix Factorization
log - odds ( p )  f (q, a; θ)   u v
r
qa
k 1
qk
ak
• Bi-clustering
log - odds( p )  f (q, a; )  z
qa
 ( q ), ( a )
 (q) : Query cluster; (a) : ad cluster
– Predictive Discrete latent factor models, Agarwal and
Merugu, KDD 07.
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
• Machine Learning using Click Feedback
– Advantages and Challenges of Click Feedback
– Feature-based models
• Description
• Case Studies
– Hierarchical Models
– Matrix Factorization and Collaborative Filtering
– Challenges and Open Problems
• Online Learning
Research
© 2008 Yahoo!
Challenges of Feature-based models
• Learns from clicks but still misses context in many
instances as in relevance based approach
• Introducing features that are too granular makes it
hard to learn CTR reliably
• Does not capture the dynamics of the system
• Training cost is high
• Slow prediction functions inadmissible due to latency
constraints
Research
© 2008 Yahoo!
Challenges of Feature-based models
• Other methods
– Boosting, Neural nets, Decision Trees, Random Forests, ……
• Local models
– Mixture of experts: Fit local, think global
P(click | Q, A)    P (click | Q, A)
L
k 1
k
k
• Hierarchical modeling with multiple trees
– User interest, query, ad,..
– Each tree is different
– How to perform smoothing with multiple disparate trees?
Research
© 2008 Yahoo!
Challenges of Feature-based models
• Combining cold start with warm start together main
challenge in collaborative filtering based methods
• We believe, solving basic issues more challenging
– Positional bias
– Selection bias
– Correlation in ads on a slate
– Dynamic CTR; seasonal variations
Research
© 2008 Yahoo!
Online learning
Research
© 2008 Yahoo!
Overview
• Information Retrieval (IR)
• Machine Learning using Click Feedback
• Online Learning
Research
© 2008 Yahoo!
Online learning for ad matching
• All previous approaches learn from historical data
• This has several drawbacks:
– Slow response to emerging patterns in the data
• due to special events like elections, …
– Initial systemic biases are never corrected
• If the system has never shown “sound system dock” ads
for the “iPod” query, it can never learn if this match is
good
– System needs to be retrained periodically
Research
© 2008 Yahoo!
Online learning for ad matching
• Solution: Combining exploitation with exploration
– Exploitation: Pick ads that are good according to
current model
– Exploration: Pick ads that increase our knowledge
about the entire space of ads
• Multi-armed bandits
– Background
– Applications to online advertising
– Challenges and Open Problems
Research
© 2008 Yahoo!
Background: Bandits
Bandit “arms”
p1
p2
p3
(unknown payoff
probabilities)
• “Pulling” arm i yields a reward:
• reward = 1 with probability pi (success)
• reward = 0 otherwise
Research
(failure)
© 2008 Yahoo!
Background: Bandits
Bandit “arms”
p1
p2
p3
(unknown payoff
probabilities)
• Goal: Pull arms sequentially so as to maximize the
total expected reward
– Estimate payoff probabilities pi
– Bias the estimation process towards better arms
Research
© 2008 Yahoo!
Background: Bandits
• An algorithm to sequentially pick the arms is called a
bandit policy
• Regret of a policy = how much extra payoff could be
gained in expectation if the best arm is always pulled
– Of course, the best arm is not known to the policy
– Hence, the regret is the price of exploration
– Low regret implies that the policy quickly converges to
the best arm
• What is the optimal policy?
Research
© 2008 Yahoo!
Background: Bandits
• Which arm should be pulled next?
– Not necessarily what looks best right now, since it might
have had a few lucky successes
– Seems to depend on some complicated function of the
successes and failures of all arms
Number of
successes
argmax g(s1, f1, s2, f2, …, sk, fk) ?
Number of
failures
Research
© 2008 Yahoo!
Background: Bandits
• What is the optimal policy?
• Consider a bandit which
– has an infinite time horizon, but
– future rewards are geometrically discounted
Rtotal = R(1) + γ.R(2) + γ2.R(3) + …
(0<γ<1)
• Theorem [Gittins/1979]: The optimal policy
decouples and solves a bandit problem for each arm
independently
argmax g(s1, f1, s2, f2, …, sk, fk) ?
argmax {g1(s1, f1), g2(s2, f2), …, gk(sk, fk)}
Research
© 2008 Yahoo!
Background: Bandits
• What is the optimal policy?
• Theorem [Gittins/1979]: The optimal policy
decouples and solves a bandit problem for each arm
independently
– Significantly reduces the dimension of the problem
space
– Gives a minimum regret bound of O(log T)
– But, the optimal functions gi(si, fi) are hard to compute
– Need approximate methods…
Research
© 2008 Yahoo!
Background: Bandits
Priority
1
Research
Priority
2
Priority
3
Bandit Policy
1. Assign priority to
each arm
Allocation
2. “Pull” arm with max
priority, and observe
reward
Estimation
3. Update priorities
© 2008 Yahoo!
Background: Bandits
• One common policy is UCB1 [Auer/2002]
Number of
successes
Number of
failures
Total number of
observations
Number of
observations of
arm i
Observed
Factor
payoff
representing
uncertainty
Research
© 2008 Yahoo!
Background: Bandits
Observed
Factor
payoff
representing
uncertainty
• As total observations T becomes large:
– Observed payoff tends asymptotically towards the true
payoff probability
– The system never completely “converges” to one best
arm; only the rate of exploration tends to zero
Research
© 2008 Yahoo!
Background: Bandits
Observed
Factor
payoff
representing
uncertainty
• Sub-optimal arms are pulled O(log T) times
• Hence, UCB1 has O(log T) regret
• This is the lowest possible regret
Research
© 2008 Yahoo!
Online learning for ad matching
• Solution: Combining exploitation with exploration
– Exploitation: Pick ads that are good according to
current model
– Exploration: Pick ads that increase our knowledge
about the entire space of ads
• Multi-armed bandits
– Background
– Applications to online advertising
– Challenges and Open Problems
Research
© 2008 Yahoo!
Webpage 2 Webpage 1
Background: Bandits
Bandit “arms”
= ads
Webpage 3
~109
pages
Research
~106 ads
© 2008 Yahoo!
Background: Bandits
Ads
Webpages
One bandit
Unknown CTR
Content Match = A matrix
• Each row is a bandit
• Each cell has an unknown CTR
Research
© 2008 Yahoo!
Background: Bandits
Why not simply apply a bandit policy directly to our
problem?
• Convergence is too slow
~109 bandits, with ~106 arms per bandit
• Additional structure is available, that can help
 Taxonomies
Research
© 2008 Yahoo!
Taxonomies for dimensionality
reduction
Root
Apparel
Computers
• Already exist
• Actively maintained
Travel
• Existing classifiers
to map pages and
ads to taxonomy
nodes
Page/Ad
A bandit policy that uses this structure can be faster
Research
© 2008 Yahoo!
Outline
Multi-level Bandit Policy for Content Match
• Experiments
• Summary
Research
© 2008 Yahoo!
Multi-level Policy
Ads
classes
Webpages
classes
……
…
…
……
Research
Consider only two levels
© 2008 Yahoo!
Multi-level Policy
Travel
CompuApparel
ters
CompuApparel
Travel
ters
Research
Ad parent
classes
Ad child
classes
……
…
…
……
Block
One bandit
Consider only two levels
© 2008 Yahoo!
Multi-level Policy
Travel
CompuApparel
ters
CompuApparel
Travel
ters
Ad parent
classes
Ad child
classes
……
…
…
……
Block
One bandit
Key idea: CTRs in a block are homogeneous
Research
© 2008 Yahoo!
Multi-level Policy
• CTRs in a block are homogeneous
– Used in allocation (picking ad for
each new page)
– Used in estimation (updating
priorities after each observation)
Research
© 2008 Yahoo!
Multi-level Policy
• CTRs in a block are homogeneous
Used in allocation (picking ad for
each new page)
– Used in estimation (updating
priorities after each observation)
Research
© 2008 Yahoo!
Multi-level Policy (Allocation)
A
C
T
T
Page
classifier
C
A
?
• Classify webpage  page class, parent page class
• Run bandit on ad parent classes  pick one ad parent class
Research
© 2008 Yahoo!
Multi-level Policy (Allocation)
ad
?
C
T
T
Page
classifier
C
A
A
• Classify webpage  page class, parent page class
• Run bandit on ad parent classes  pick one ad parent class
• Run bandit among cells  pick one ad class
• In general, continue from root to leaf  final ad
Research
© 2008 Yahoo!
Multi-level Policy (Allocation)
ad
C
T
T
Page
classifier
C
A
A
Bandits at higher levels
• use aggregated information
• have fewer bandit arms
 Quickly figure out the best ad parent class
Research
© 2008 Yahoo!
Multi-level Policy
• CTRs in a block are homogeneous
Used in allocation (picking ad for
each new page)
Used in estimation (updating
priorities after each observation)
Research
© 2008 Yahoo!
Multi-level Policy (Estimation)
• CTRs in a block are homogeneous
– Observations from one cell also give
information about others in the block
– How can we model this dependence?
Research
© 2008 Yahoo!
Multi-level Policy (Estimation)
• Shrinkage Model
# clicks in cell
# impressions
in cell
Scell | CTRcell ~ Bin (Ncell, CTRcell)
CTRcell ~ Beta (Paramsblock)
All cells in a block come from the same distribution
Research
© 2008 Yahoo!
Multi-level Policy (Estimation)
• Intuitively, this leads to shrinkage of
cell CTRs towards block CTRs
E[CTR] = α.Priorblock + (1-α).Scell/Ncell
Estimated
CTR
Research
Beta prior
(“block CTR”)
Observed
CTR
© 2008 Yahoo!
Experiments
Depth 0
Depth 1
20 nodes
221 nodes
We use
these 2
levels
…
Depth 2
Root
Depth
7
~7000 leaves
Taxonomy structure
Research
© 2008 Yahoo!
Experiments
• Data collected over a 1 day period
• Collected from only one server, under some other
ad-matching rules (not our bandit)
• ~229M impressions
• CTR values have been linearly transformed for
purposes of confidentiality
Research
© 2008 Yahoo!
Clicks
Experiments (Multi-level Policy)
Number of pulls
Multi-level gives much higher #clicks
Research
© 2008 Yahoo!
Mean-Squared Error
Experiments (Multi-level Policy)
Number of pulls
Multi-level gives much better Mean-Squared Error
 it has learnt more from its explorations
Research
© 2008 Yahoo!
Number of pulls
Mean-Squared Error
Clicks
Experiments (Shrinkage)
without
shrinkage
with
shrinkage
Number of pulls
Shrinkage  improved Mean-Squared Error,
but no gain in #clicks
Research
© 2008 Yahoo!
Summary
• Taxonomies exist for many datasets
• They can be used for
– Dimensionality Reduction
– Multi-level bandit policy  higher #clicks
– Better estimation via shrinkage models  better MSE
Research
© 2008 Yahoo!
Online learning for ad matching
• Solution: Combining exploitation with exploration
– Exploitation: Pick ads that are good according to
current model
– Exploration: Pick ads that increase our knowledge
about the entire space of ads
• Multi-armed bandits
– Background
– Applications to online advertising
– Challenges and Open Problems
Research
© 2008 Yahoo!
Challenges and Open Problems
• Bandit policies typically assume stationarity
• But, sudden changes are the norm in the online
advertising world:
– Ads may be suddenly removed when they run out of
budget
– New ads are constantly added to the system
– The total number of ads is huge, and full exploration
may be too costly
– Mortal multi-armed bandits [NIPS/2008]
Research
© 2008 Yahoo!
Mortal Multi-armed Bandits
• Traditional bandit policies like UCB1 spend a large
fraction of their initial pulls on exploration
– Hard-earned knowledge may be lost due to finite arm
lifetimes
• Method 1 (Sampling):
– Pick a random sample from the set of available arms
– Run UCB1 on sample, until some fraction of arms in the
sample are lost
– Pro: Quicker convergence, more exploitation
– Con: Best arm in the sample may be worse than best
arm overall
– Pick sample size to control this tradeoff
Research
© 2008 Yahoo!
Mortal Multi-armed Bandits
• Traditional bandit policies like UCB1 spend a large
fraction of their initial pulls on exploration
– Hard-earned knowledge may be lost due to finite arm
lifetimes
• Method 2 (Payoff threshold):
– New bandit policy: If the observed payoff of any arm is
higher than a threshold, pull it till it expires
– Pro: Good arms, once found, are exploited quickly
– Con: While exploiting good arms, the best arm may be
starving and may expire without being found
– Pick threshold to control this tradeoff
Research
© 2008 Yahoo!
Mortal Multi-armed Bandits
• Challenges:
– Selecting the critical sample size or threshold correctly,
for arbitrary payoff distributions
– What if even the payoff distribution is unknown?
Research
© 2008 Yahoo!
Challenges and Open Problems
• Mortal multi-armed bandits
• What if the bandit policy has some information about
the budget?
– The bandit policy can control which arms expire, and
when
– “Handling Advertisements of Unknown Quality in
Search Advertising” by Pandey+/NIPS/2006
• Combining budgets with extra knowledge of ad CTRs
– E.g., Using an ad taxonomy
• Using a bandit scheme to infer/correct an ad
taxonomy
Research
© 2008 Yahoo!
Conclusions
Research
© 2008 Yahoo!
Conclusions
• We provided an introduction to Online Advertising
– Discussed the eco-system and various actors involved
– Discussed different flavors of online advertising
• Sponsored Search, Content Match, Display Advertising
Research
© 2008 Yahoo!
Conclusions
Online
Advertising
Revenue
Models
CPM
CPC
CPA
Display
Research
Misc.
Ad exchanges
Advertising
Setting
Content
Match
Sponsored
Search
© 2008 Yahoo!
Conclusions
• Outlined associated statistical challenges
– Sponsored search, Content Match, Display
• We believe the following to be a technical roadmap
Offline Modeling
Regression, collaborative
filtering, mixture of experts
Multi-resolution models
Selection bias
Slate correlation
Noisy labels
Research
Online Models
Time series
Explore/Exploit
Multi-armed bandits
© 2008 Yahoo!
Conclusions
• Offline Modeling
– By far the best studied so far
– Not a careful study of selection bias, slate correlations,
noisy labels. Good opportunity here
– More emphasis on matrix structure, goal is to estimate
interactions
• Explore/Exploit
– Some work using multi-armed bandits; long way to go
• Time series model to capture temporal aspects
– Little work
• Holistic approach that combines all components in a
principled way
Research
© 2008 Yahoo!
Download