Defense - Northwestern Networks Group

advertisement
Contextual Advertising by
Combining Relevance with Click
Feedback
Wenfan Li
Introduction
• In 2007 online advertising revenue was $21.0 billion against $20.0
billon proposed.
• In 2012 online advertising revenue was $36.6 billion with 15% of
substantial growth.
• Textual advertising also contributes major part in overall online
advertising revenue.
• Aim of the web advertising is to improve the user experience and
increasing probability of clicks hence revenue.
Sponsored Search (SS) or Paid
Search advertising
• Ads on the result pages display based on the search query
Contextual advertising or
Context Match (CM)
• Ads are shown within the content of a generic, third-party
web page.
Contribution
•
•
•
Similarity formulation
Scoring formulas(Click Propensity) with correltion parameter
to incorporate the click data.
Inline with current system and can be active with extended
scoring formula with least effort.
Vector Representation
• Vector X is representing the ad in vector space and axis X1, X2,
X3 are representing features.
X1 = unigram
X2 = class
X3 = phrase
Related work
Ribeiro-Neto has done extensive research in similarity search and
suggested the following :
• Vector theory as explained above
• Rank algorithm
1- Rank is represented by Cosine of union of all the ad
section and page vectors.
2- Cosine function limitation of not considering synonyms
is overcome by including the vocabulary of similar
pages.
Related work
• Impact of individual features can be represented by a tree
(Matching function)
• Node 3,6 ->Arithmetic operators , log function
• Node 5,6 ,8 -> Numerical features ,ad term
• Result in improved performance
• tf-idf weighting
Limitation
• All existing approaches was relying on a-priori information
and does not use the feedback in the form of click data.
• Traditional similarity methods do not account for differential
click probabilities on different pages.
Method
• Apply regression model to combine feedback and semantic
infromation from page and ads.
(a) Feature extraction
(b) Feature selection
(c) Coefficient estimation
• Incorporate traditional relevance measures to enhance
matching relevant ads to pages.
Feature Extraction
• Define region
- page title, page metadata, page body, page URL etc.
• Assign Region P(r) specific score tp(r)w to features for word w
and measure main effect
- Main effect for page region
Mp(r)w = 1(w € p(r)) · tp(r)w
-- use for getting better estimates and adjustments.
- Main effect for Ad region
Ma(r)w = 1(w € a(r)) · ta(r)w
-- participate in selection of relevant ads
Feature Extraction
Interaction effects between page and ad regions
Ip(r1)w1,a(r2)w2 = 1(w1 € p(r1),w2 € a(r2))· f(tp(r1)w1 ,
ta(r2)w2 )
-- participate in selection of relevant ads
Feature Selection
• Data based approach
interaction measure
iw = CTRwboth/ CTRwpage · CTRwad
• Relevance-based
• Compute average tf-idf scores for respective region.
• Compute Geometric mean tf-idf scores for respective region.
Logistic Regression
High correlation features
Sparse between features
Logistic Regression
• Feature based models
• qij =Average of prior log-odds
Optimization
•
•
•
•
•
Split the data into regions
Distribute the data among clusters
Apply the regression model on individual
Map reduce platform is used for this purpose
Highly adaptable, flexible and self control process, we need
not to distribute, monitor, schedule the data.
Optimization
• Prior
• Feature occurrence variance is S(1-S)
Challenges and Mitigation
• Increase in variance
-Consider selected set of informative features.
-Down weight the sparce data.
• Overfitting
Divide and conquer strategy using Map Reduce
• Correlative features in regression model
- Avoid the multi collinearity variable.
Normal Distribution Curve
• To avoid correlation and sparse effect and to ensure the
convergence consider the
range which covers 99% of the a priori probability.
Ad Search Prototype
• The prototype ad search engine is based on the WAND
algorithm and inverted indexing of the ads using the Hadoop
distributed computing framework.
• Features used: unigrams, phrases and classes.
• The inverted index is composed of one postings list for each
feature that has one entry (posting) for each ad that contains
this feature.
Architecture of ad search prototype
Inverted Indexing
• Extract features from the ads.
• Use Ma(r) to calculate a static score
for each individual ad.
• Use this score to assign an adID to
the ads in decreasing ad score order.
• Each feature is represented by an
unique numeric featureID. The
resulting data file is sorted by <
adID, featureID >.
• Invert sorting < featureID, adID >
as a key
• Write the delta compressed posting
lists
Difference in ad search engine
and web search engine
• In web search, the queries are short and the documents are long.
• In the ad search case, ad search engine performs search in
the vector space with a long query and relatively short ad
vectors.
WAND Algorithm
• WAND is a document at the time algorithm that finds the next
cursor to be moved based on an upper bound of the score for
the documents at which the cursors are currently positioned
• The algorithm keeps a heap of current candidates. The invariant of
the algorithm is that the heap contains the best matches (highest
scores) among the documents (ads) with IDs less than the document
pointed by the current minimum cursor.
• The algorithm assumes that all cursors that are before the
current will hit this document (i.e. the document contains
all those terms represented by cursors before or at that document).
WAND Algorithm
• Cursors pointing on documents with upper bound smaller than the
minimum score among the candidate docs are candidates for a move.
• WAND can be used with any function that is monotonic with respect
to the number of matching terms in the document.
• some non-monotonic scoring functions can also be used as long as
we can find a mechanism to estimate the score upper bounds.
• a set of functions where a fixed subset of the features (known a priori)
always decrease the score.
• Example: cosine similarity
WAND Framework
• Page is parsed and the features
are extracted along with their
TF-IDF scores
• Features are extracted from the
input query.
• The query is a bag of pairs <
featureID,weight >.
• apply the reweighing based on
the table Iw
• For each query feature,WAND
opens a cursor over the posting
list of this feature
• The cursors are moved forward
examining the documents as
they are encountered
WAND
• Incorporate the logisticregression based model
• modify scoring equation to
exclude the page effect and
used as WAND scoring
Formula
• The Mp(r) table is used to
adjust the final scores to
calculate the probabilities
according to equation 3
Experiments Data
• Conduct on a large-scale real-world dataset
• Conduct our experiments on 30 days of data. The first 15 days of data
were used as our training corpus while the results were evaluated on
the remaining days.
• sampling scheme: we retain all impressions associated with clicked
page views and get a 1/2000 random sample from the pageviews
associated with non-clicked page views.
Word selection schemes
•
•
•
•
Obtain 112K unique words after stop word removal
Truncate word list to consider those have at least 100 impressions
Consider at least 5 clicks shown on pages and 5 clicks on ads
The truncation left us with approximately 3.4K words
Word Selection schemes
• For data based feature selection:
we conducted two experiments, one with top 1k words and other with
all 3.4K words.
• For two relevance –based selection:
the first relevance-based measure assigns the average tf-idf score to a
word, computed from a single corpus (union of page title and ad title
regions).
the second relevance –based measure is the geometric mean of
average tf-idf scores of a word ,computed from page region and
ad region corpora.
Hybrid model
• The relevance of an ad for a page is quantified by a score which is a
function of page and ad content.
• Test two such scores for region combination (pti, oti)
Hybrid model
• (a) use Relevance2 as an additional feature in the logistic regression
but without adding substantial overhead to the indexing algorithm
• (b) use Relevance2 to derive the priors qij in equation 3.
• For (a), we add quadratic terms to the logistic regression
• For (b), we derive qij ’s by using the approximate quadratic
curve shown below:
Hybrid model
• Sort all positive Relevance2
values in the training data and
create 100 bins. For each bin,
we compute the empirical logodds and the average
Relevance2 score.
Evaluation Criteria
• We use the precision recall metric to measure the performance of our
algorithms.
• Recall for a given threshold on the model score is the fraction of all
clicks that were present (recalled) in the test cases above that
threshold.
• Precision for a given recall value is the CTR on the test cases above
the threshold.
Results(precision-recall curve)
Results
Results
Histograms of parameters
Conclusion
• Propose a new class of models to combine relevance with click
feedback for a contextual advertising system.
• Through large scale experiments, we convincingly demonstrate the
advantage of combining relevance with click feedback.
Future Works
• Enrich our models to discover interactions that are caused due to
word synonyms.
• Experiment with a wide range of feature selection and model fitting
scheme for the logistic regression.
• Explore the application of our method to other systems where there is
implicit feedback in the form of clicks, such as web search and search
advertising.
Download