Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan rajub@asu.edu (Arizona State University) Agenda Problem 1: Ranking the Deep Web – Need for New Ranking. – SourceRank: Agreement Analysis. – Computing Agreement and Collusion . – Results & System Implementation. Problem 2: Ad-Ranking sensitive to Mutual Influences. – Browsing model & Nature of Influence. – Ranking Function and Generalizations. – Results. 2 Deep Web Integration Scenario Millions of sources containing structured tuples Uncontrolled collection of redundant information Mediator Search engines have nominal access. We don’t Google for a “Honda Civic 2008 Tampa” Web DB Web DB Web DB Web DB Web DB Deep Web 3 Why Another Ranking? Rankings are oblivious to result Importance & Trustworthiness Example Query: “Godfather Trilogy” on Google Base Importance: Searching for titles matching with the query. None of the results are the classic Godfather Trustworthiness (bait and switch) The titles and cover image match exactly. Prices are low. Amazing deal! But when you proceed towards check out you realize that the product is a different one! (or when you open the mail package, if you are really unlucky) 4 Source Selection in the Deep Web Problem: Given a user query, select a subset of sources to provide important and trustworthy answers. Surface web search combines link analysis with Query-Relevance to consider trustworthiness and relevance of the results. Unfortunately, deep web records do not have hyper-links. 5 Source Agreement Observations Many sources return answers to the same query. Comparison of semantics of the answers is facilitated by structure of the tuples. Idea: Compute importance and trustworthiness of sources based on the agreement of answers returned by different sources. 6 Agreement Implies Trust & Importance. Important results are likely to be returned by a large number of sources. e.g. For the query “Godfather” hundreds of sources return the classic “The Godfather” while a few sources return the little known movie “Little Godfather”. Two independent sources are not likely to agree upon corrupt/untrustworthy answers. e.g. The wrong author of the book (e.g. Godfather author as “Nino Rota”) would not be agreed by other sources. As we know, truth is one (or a few), but lies are many. 7 Agreement is not just for the search Which tire? 8 Agreement Implies Trust & Relevance Probability of agreement of two independently selected irrelevant/false tuples is 1 1 1 P (f , f ) |U | 3 100 k a 1 2 Probability of agreement or two independently picked relevant and true tuples is Pa (r1, r 2) 1 | RT | | U || RT | Pa(r1, r 2) Pa( f 1, f 2) 9 Method: Sampling based Agreement W ( S 1 S 2) (1 ) 0.78 S3 0.22 where induces the smoothing links to account for the unseen samples. R1, R2 are the result sets of S1, S2. S1 0.86 0.4 0.14 A( R1, R 2) | R2 | 0.6 S2 Link semantics from Si to Sj with weight w: Si acknowledges w fraction of tuples in Sj. Since weight is the fraction, links are unsymmetrical. Agreement is computed using key word queries. Partial titles of movies/books are used as queries. Mean agreement over all the queries are used as the final agreement. 10 Method: Calculating SourceRank How can I use the agreement graph for improved search? • Source graph is viewed as a markov chain, with edges as the transition probabilities between the sources. • The prestige of sources considering transitive nature of the agreement may be computed based on a markov random walk. SourceRank is equal to this stationary visit probability of the random walk on the database vertex. This static SourceRank may be combined with a queryspecific source-relevance measure for the final ranking. 11 Computing Agreement is Hard Computing semantic agreement between two records is the record linkage problem, and is known to be hard. Godfather, The: The Coppola Restoration Marlon Brando, Al Pacino James Caan / Marlon Brando more 13.99 USD $9.99 The Godfather - The Coppola Restoration Giftset [Blu-ray] Example “Godfather” tuples from two web sources. Note that titles and castings are denoted differently. Semantically same entities may be represented syntactically differently by two databases (non-common domains). 12 Method: Computing Agreement Agreement Computation has Three levels. 1. Comparing Attribute-Value Soft-TFIDF with Jaro-Winkler as the similarity measure is used. 2. Comparing Records. We do not assume predefined schema matching. Instance of a bipartite matching problem. Optimal matching is O(v3 ). O(v ) Greedy matching is used. Values are greedily matched against most similar value in the other record. The attribute importance are weighted by IDF. (e.g. same titles (Godfather) is more important than same format (paperback)) 2 3. Comparing result sets. Using the record similarity computed above, result set similarities are computed using the same greedy approach. 13 Detecting Source Collusion The sources may copy data from each other, or make mirrors, boosting SourceRank of the group. Observation 1: Even non-colluding sources in the same domain may contain same data. e.g. Movie databases may contain all Hollywood movies. Observation 2: Top-k answers of even non-colluding sources may be similar. e.g. Answers to query “Godfather” may contain all the three movies in the Godfather trilogy. 14 Source Collusion--Continued Basic Method: If two sources return same top-k answers to the queries with large number of answers (e.g. queries like “the” or “DVD”) they are likely to be colluding. We compute the degree of collusion of sources as the agreement on large answer queries. Words with highest DF in the crawl is used as the queries. The agreement between two databases are adjusted for collusion by multiplying by (1-collusion). 15 Factal: Search based on SourceRank ”I personally ran a handful of test queries this way and got much better results [than Google Products] results using Factal” --- Anonymous WWW’11 Reviewer. http://factal.eas.asu.edu 16 Evaluation Precision and DCG are compared with the following baseline methods 1) CORI: Adapted from text database selection. Union of sample documents from sources are indexed and sources with highest number term hits are selected [Callan et al. 1995]. 2) Coverage: Adapted from relational databases. Mean relevance of the top-5 results to the sampling queries [Nie et al. 2004]. 3) Google Products: Products Search that is used over Google Base All experiments distinguish the SourceRank from baseline methods with 0.95 confidence levels. 17 Though combinations are not our competitors, note that they are not better: 1.SourceRank implicitly considers query relevance, as selected sources fetch answers Precision by query similarity. Combining again with query DCG similarity may be an “overweighting”. 2. Search is Vertical Online Top-4 Sources-Movies 0.45 0.4 29% 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 Coverage SourceRank CORI SR-Coverage 18 SR-CORI Online Top-4 Sources-Books 0.35 Precision DCG 0.3 48% 0.25 0.2 0.15 0.1 0.05 0 Coverage SourceRank CORI SR-Coverage 19 SR-CORI Google Base Top-5 Precision-Books 0.5 675 Sources 0.4 0.3 0.2 0.1 0 24% 675 Google Base sources responding to a set of book queries are used as the book domain sources. GBase-Domain is the Google Base searching only on these 675 domain sources. Source Selection by SourceRank (coverage) followed by ranking by Google Base. 20 Google Base Top-5 Precision-Movies 0.25 209 Sources 25% 0.2 0.15 0.1 0.05 0 Gbase Gbase-Domain SourceRank Coverage 21 Trustworthiness of Source Selection 1. Corrupted the results in sample crawl by replacing attribute vales not specified in the queries with random strings (since partial titles are the queries, we corrupted attributes except titles). 2.If the source selection is sensitive to corruption, the ranks should decrease with the corruption levels. Google Base Movies 60 SourceRank Coverage CORI Decrease in Rank(%) 50 40 30 20 10 0 -10 0 0.1 0.2 0.3 0.4 0.5 0.6 Corruption Level 0.7 0.8 0.9 Every relevance measure based on query-similarity are oblivious to the corruption of attributes unspecified in queries. 22 Trustworthiness- Google Base Books 45 SourceRank Coverage CORI Decrease in Rank(%) 40 35 30 25 20 15 10 5 0 -5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Corruption Level 23 0.8 0.9 Collusion—Ablation Study 1 0.8 Collusion Agreement Adjusted Agreement 0.6 0.4 0.2 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Rank Correlation Two database with the same one million tuples from IMDB are created. Correlation between the ranking functions reduced increasingly. Observations: 1. At high correlation the adjusted agreement is very low. 2. Adjusted agreement is almost the same as the pure agreement at low correlations. Natural agreement will be preserved while catching near-mirrors. 24 Computation Time Random walk is known to be feasible in large scale. Time to compute the agreements is evaluated against number of sources. Note that the computation is offline. Easy to parallelize. 25 Publications and Recognition SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement. Raju Balakrishnan, Subbarao Kambhampati. WWW 2011 (Full Paper). Factal: Integrating Deep Web Based on Trust and Relevance. Raju Balakrishnan, Subbarao Kabmbhampati. WWW 2011 (Demonstration). SourceRank:Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement (Best Poster Award, WWW 2010). Raju Balakrishnan, Subbarao Kambhampati. WWW 2010 Pages 1055~1056. 26 Contributions 1. Agreement based trust assessment for the deep web 2. Agreement based relevance assessment for the deep web 3. Collusion detection between the web sources 4. Evaluations in Google Base sources and online web databases 27 Agenda Ranking the Deep Web Search engines generate their multi– Need for new ranking. billion dollar revenue byAnalysis. textual ads. – SourceRank: Agreement Related problem of ranking of ads. is – Computing Agreement and Collusion as important the ranking of – Results & Systemas Implementation. results. Proposed Work: Ranking the Deep Web Results. Problem 2: Ad-Ranking sensitive to Mutual Influences. A different – Motivation and Problem Definition. aspect of – Browsing model & Nature of Influence ranking – Ranking Function & Generalization – Results. Proposed Work: Mechanism Design & Evaluations. 28 Ad Ranking: State of the Art Sort by Sort by Bid Amount x Relevance Bid Amount Ads are Considered in Isolation, as both ignore Mutual influences. We consider ads as a set, and ranking is based on user’s browsing model 29 User’s Browsing Model If a2 is similar to a1 residual relevance of a2 •User browses down staring at the first decreases and abandonment probability increases. ad • At every ad he May Click the ad with relevance probability R(a) P(click(a) | view(a)) Abandon browsing with probability Goes down to the next ad with probability Process repeats for the ads below with a reduced probability 30 Mutual Influences Three Manifestations of Mutual Influences on an ad a are: 1. Similar ads placed above a Reduces user’s residual relevance of a a User may click on above ads may not view a 2. Relevance of other ads placed above 3. Abandonment probability of other ads placed above a User may abandon search and may not view a 31 Expected Profit Considering Ad Similarities Considering bids ( $(ai )), residual Relevance ( R ( ai )), abandonment probability ( ( ai ) ), and similarities the expected profit from a set of n results is, i 1 n Expected Profit = $(a )R (a ) 1 R (a ) (a ) i i 1 r i r j j j 1 THEOREM: Ranking maximizing expected profit considering similarities between the results is NP-Hard Even worse, constant ratio approximation algorithms are hard (unless NP = ZPP) for diversity ranking problem Proof is a reduction of independent set problem to choosing top-k ads considering similarities. 32 Expected Profit Considering other two Mutual Influences (2 and 3) Dropping similarity, hence replacing Residual Relevance ( Rr (ai )) by Absolute Relevance ( R ( ai )), i 1 n Expected Profit = $(a )R(a )1 R(a ) (a ) i i 1 i j j j 1 Ranking to maximize this expected utility is a sorting problem 33 Optimal Ranking Rank ads in the descending order of: $(a) R(a) RF (a) R(a) (a) The physical meaning RF is the profit generated for unit consumed view probability of ads Higher ads have more view probability. Placing ads producing more profit for unit consumed view probability higher up is intuitive. 34 Comparison to current Ad Rankings Sort by Bid Amount Bid Amount x Relevance Assume abandonment probability is zero Assume (a) k R(a) (a) 0 where k is a constant for all ads $(a) R(a) RF (a) $(a) R(a) $( a ) R(a ) RF (a ) $(a) R(a) k Assumes that the user has infinite patience to go down the results until he finds the ad he wants. Assumes that abandonment probability is negatively proportional to relevance. Generality of the Proposed Ranking The generalized ranking based on utilities. For documents utility=relevance For ads utility=bid amount Popular relevance ranking 36 Quantifying Expected Profit 40 RF Bid Amount x Relevance Bid Amount 35 Expected Profit 30 Abandonment probability Bid amount only strategy becomes optimal at (a) 0 Uniform Random as 0 (a) 1 Relevance Difference in profit between RF and competing strategy is significant 25 20 Uniform random as 0 R(a) Number of Clicks Zipf random with exponent 1.5 15 35.9% Proposed strategy gives maximum Bid Amounts profit for the entire Uniform random range 45.7% 10 5 0 $(a) 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 37 Publication and Recognition Optimal Ad-Ranking for Profit Maximization. Raju Balakrishnan, Subbarao Kabmbhampati. WebDB 2008 Yahoo! Research Key scientific Challenge award for Computation advertising, 2009-10 38 Overall Contributions 1. SourceRank based source selection sensitive to 1. Trustworthiness 2. Importance of the deep web sources. 2. A method to assess the collusion of the deep web sources. 3. An optimal generalized ranking for ads and search results. 4. A ranking framework optimal with respect to the perceived relevance of search snippets, and abandonment probability. 39