Contextual Advertising by Combining Relevance with Click Feedback Wenfan Li Introduction • In 2007 online advertising revenue was $21.0 billion against $20.0 billon proposed. • In 2012 online advertising revenue was $36.6 billion with 15% of substantial growth. • Textual advertising also contributes major part in overall online advertising revenue. • Aim of the web advertising is to improve the user experience and increasing probability of clicks hence revenue. Sponsored Search (SS) or Paid Search advertising • Ads on the result pages display based on the search query Contextual advertising or Context Match (CM) • Ads are shown within the content of a generic, third-party web page. Contribution • • • Similarity formulation Scoring formulas(Click Propensity) with correltion parameter to incorporate the click data. Inline with current system and can be active with extended scoring formula with least effort. Vector Representation • Vector X is representing the ad in vector space and axis X1, X2, X3 are representing features. X1 = unigram X2 = class X3 = phrase Related work Ribeiro-Neto has done extensive research in similarity search and suggested the following : • Vector theory as explained above • Rank algorithm 1- Rank is represented by Cosine of union of all the ad section and page vectors. 2- Cosine function limitation of not considering synonyms is overcome by including the vocabulary of similar pages. Related work • Impact of individual features can be represented by a tree (Matching function) • Node 3,6 ->Arithmetic operators , log function • Node 5,6 ,8 -> Numerical features ,ad term • Result in improved performance • tf-idf weighting Limitation • All existing approaches was relying on a-priori information and does not use the feedback in the form of click data. • Traditional similarity methods do not account for differential click probabilities on different pages. Method • Apply regression model to combine feedback and semantic infromation from page and ads. (a) Feature extraction (b) Feature selection (c) Coefficient estimation • Incorporate traditional relevance measures to enhance matching relevant ads to pages. Feature Extraction • Define region - page title, page metadata, page body, page URL etc. • Assign Region P(r) specific score tp(r)w to features for word w and measure main effect - Main effect for page region Mp(r)w = 1(w € p(r)) · tp(r)w -- use for getting better estimates and adjustments. - Main effect for Ad region Ma(r)w = 1(w € a(r)) · ta(r)w -- participate in selection of relevant ads Feature Extraction Interaction effects between page and ad regions Ip(r1)w1,a(r2)w2 = 1(w1 € p(r1),w2 € a(r2))· f(tp(r1)w1 , ta(r2)w2 ) -- participate in selection of relevant ads Feature Selection • Data based approach interaction measure iw = CTRwboth/ CTRwpage · CTRwad • Relevance-based • Compute average tf-idf scores for respective region. • Compute Geometric mean tf-idf scores for respective region. Logistic Regression High correlation features Sparse between features Logistic Regression • Feature based models • qij =Average of prior log-odds Optimization • • • • • Split the data into regions Distribute the data among clusters Apply the regression model on individual Map reduce platform is used for this purpose Highly adaptable, flexible and self control process, we need not to distribute, monitor, schedule the data. Optimization • Prior • Feature occurrence variance is S(1-S) Challenges and Mitigation • Increase in variance -Consider selected set of informative features. -Down weight the sparce data. • Overfitting Divide and conquer strategy using Map Reduce • Correlative features in regression model - Avoid the multi collinearity variable. Normal Distribution Curve • To avoid correlation and sparse effect and to ensure the convergence consider the range which covers 99% of the a priori probability. Ad Search Prototype • The prototype ad search engine is based on the WAND algorithm and inverted indexing of the ads using the Hadoop distributed computing framework. • Features used: unigrams, phrases and classes. • The inverted index is composed of one postings list for each feature that has one entry (posting) for each ad that contains this feature. Architecture of ad search prototype Inverted Indexing • Extract features from the ads. • Use Ma(r) to calculate a static score for each individual ad. • Use this score to assign an adID to the ads in decreasing ad score order. • Each feature is represented by an unique numeric featureID. The resulting data file is sorted by < adID, featureID >. • Invert sorting < featureID, adID > as a key • Write the delta compressed posting lists Difference in ad search engine and web search engine • In web search, the queries are short and the documents are long. • In the ad search case, ad search engine performs search in the vector space with a long query and relatively short ad vectors. WAND Algorithm • WAND is a document at the time algorithm that finds the next cursor to be moved based on an upper bound of the score for the documents at which the cursors are currently positioned • The algorithm keeps a heap of current candidates. The invariant of the algorithm is that the heap contains the best matches (highest scores) among the documents (ads) with IDs less than the document pointed by the current minimum cursor. • The algorithm assumes that all cursors that are before the current will hit this document (i.e. the document contains all those terms represented by cursors before or at that document). WAND Algorithm • Cursors pointing on documents with upper bound smaller than the minimum score among the candidate docs are candidates for a move. • WAND can be used with any function that is monotonic with respect to the number of matching terms in the document. • some non-monotonic scoring functions can also be used as long as we can find a mechanism to estimate the score upper bounds. • a set of functions where a fixed subset of the features (known a priori) always decrease the score. • Example: cosine similarity WAND Framework • Page is parsed and the features are extracted along with their TF-IDF scores • Features are extracted from the input query. • The query is a bag of pairs < featureID,weight >. • apply the reweighing based on the table Iw • For each query feature,WAND opens a cursor over the posting list of this feature • The cursors are moved forward examining the documents as they are encountered WAND • Incorporate the logisticregression based model • modify scoring equation to exclude the page effect and used as WAND scoring Formula • The Mp(r) table is used to adjust the final scores to calculate the probabilities according to equation 3 Experiments Data • Conduct on a large-scale real-world dataset • Conduct our experiments on 30 days of data. The first 15 days of data were used as our training corpus while the results were evaluated on the remaining days. • sampling scheme: we retain all impressions associated with clicked page views and get a 1/2000 random sample from the pageviews associated with non-clicked page views. Word selection schemes • • • • Obtain 112K unique words after stop word removal Truncate word list to consider those have at least 100 impressions Consider at least 5 clicks shown on pages and 5 clicks on ads The truncation left us with approximately 3.4K words Word Selection schemes • For data based feature selection: we conducted two experiments, one with top 1k words and other with all 3.4K words. • For two relevance –based selection: the first relevance-based measure assigns the average tf-idf score to a word, computed from a single corpus (union of page title and ad title regions). the second relevance –based measure is the geometric mean of average tf-idf scores of a word ,computed from page region and ad region corpora. Hybrid model • The relevance of an ad for a page is quantified by a score which is a function of page and ad content. • Test two such scores for region combination (pti, oti) Hybrid model • (a) use Relevance2 as an additional feature in the logistic regression but without adding substantial overhead to the indexing algorithm • (b) use Relevance2 to derive the priors qij in equation 3. • For (a), we add quadratic terms to the logistic regression • For (b), we derive qij ’s by using the approximate quadratic curve shown below: Hybrid model • Sort all positive Relevance2 values in the training data and create 100 bins. For each bin, we compute the empirical logodds and the average Relevance2 score. Evaluation Criteria • We use the precision recall metric to measure the performance of our algorithms. • Recall for a given threshold on the model score is the fraction of all clicks that were present (recalled) in the test cases above that threshold. • Precision for a given recall value is the CTR on the test cases above the threshold. Results(precision-recall curve) Results Results Histograms of parameters Conclusion • Propose a new class of models to combine relevance with click feedback for a contextual advertising system. • Through large scale experiments, we convincingly demonstrate the advantage of combining relevance with click feedback. Future Works • Enrich our models to discover interactions that are caused due to word synonyms. • Experiment with a wide range of feature selection and model fitting scheme for the logistic regression. • Explore the application of our method to other systems where there is implicit feedback in the form of clicks, such as web search and search advertising.