Raju Balakrishnan
(PhD Proposal Defense)
Committee: Subbarao Kambhampati (chair)
Yi Chen
AnHai Doan
Huan Liu.
Problem 1: Ranking the Deep Web
– Need for New Ranking.
– SourceRank: Agreement Analysis.
– Computing Agreement and Collusion .
– Results & System Implementation.
Proposed Work: Ranking the Deep Web Results.
Problem 2: Ad-Ranking sensitive to Mutual Influences.
– Browsing model & Nature of Influence.
– Ranking Function and Generalizations.
– Results.
Proposed Work : Mechanism Design for Ads.
Millions of sources containing structured tuples
Uncontrolled collection of redundant information
Mediator
Search engines have nominal access. We don’t Google for a “Honda Civic
2008 Tempe”
Web DB
Web DB
Web DB
Web DB
Web DB
Rankings are oblivious to result Importance & Trustworthiness
Example Query: “ Godfather Trilogy” on Google Base
Importance : Searching for titles matching with the query. None of the results are the classic Godfather
Trustworthiness (bait and switch)
The titles and cover image match exactly.
Prices are low. Amazing deal!
But when you proceed towards check out you realize that the product is a different one! (or when you open the mail package, if you are really unlucky)
Problem: Given a user query, select a subset of sources to provide important and trustworthy answers.
Surface web search combines link analysis with
Query-Relevance to consider trustworthiness and relevance of the results.
Unfortunately, deep web records do not have the hyper-links.
Observations
Many sources return answers to the same query.
Comparison of semantics of the answers is facilitated by structure of the tuples
Idea: Compute importance and trustworthiness of sources based on the agreement of answers returned by different sources
.
Important results are likely to be returned by a large number of sources.
e.g. For the query “Godfather” hundreds of sources return the classic “ The Godfather ” while a few sources return the little known movie “ Little Godfather ”.
Two independent sources are not likely to agree upon corrupt/untrustworthy answers.
e.g. The wrong author of the book (e.g.
Godfather author as “ Nino Rota” ) would not be agreed by other sources. As we know, truth is one (or a few), but lies are many.
1
3
Probability of agreement of two independently selected
1 irrelevant/false tuples is
100 k
P a
( f
1
, f
2
)
|
1
U |
Probability of agreement or two independently picked relevant and true tuples is
P a
( r
1
, r
2
)
1
| R
T
|
| U |
| R
T
|
P a
( r
1
, r
2
)
P a
( f
1
, f
2
)
W ( S
1
S
2
)
( 1
)
A ( R
1
, R
2
)
| R
2
|
0.78
S
3
S
1
0.86
0.6
0.22
0.14
S
2
0.4
Link semantics from S i weight w : S i to S j with acknowledges w fraction of tuples in S j
.
Since weight is the fraction, links are unsymmetrical .
to account for the unseen samples.
R
1
, R
2 are the result sets of S
1
, S
2
.
Agreement is computed using 200 key word queries.
Partial titles of movies/books are used as queries.
Mean agreement over all the queries are used as the final agreement.
How can I use the agreement graph for improved search?
• Source graph is viewed as a markov chain, with edges as the transition probabilities between the sources.
• The prestige of sources considering transitive nature of the agreement may be computed based on a markov random walk.
SourceRank is equal to this stationary visit probability of the random walk on the database vertex.
This static SourceRank may be combined with a queryspecific source-relevance measure for the final ranking.
Computing semantic agreement between two records is the record linkage problem, and is known to be hard.
Godfather, The: The
Coppola Restoration
The Godfather - The
Coppola Restoration
Giftset [Blu-ray]
James Caan /
Marlon Brando more
Marlon Brando, Al
Pacino
Example
“Godfather” tuples from two web sources.
Note that titles and castings are denoted differently.
Semantically same entities may be represented syntactically differently by two databases (non-common domains).
Agreement Computation has Three levels.
1. Comparing Attribute-Value
Soft-TFIDF with Jaro-Winkler as the similarity measure is used.
2. Comparing Records.
We do not assume predefined schema matching.
Instance of a bipartite matching problem.
O ( v
2
) Greedy matching is used. Values are greedily matched against most similar value in the other record.
The attribute importance are weighted by IDF. (e.g. same titles
(Godfather) is more important than same format (paperback))
3. Comparing result sets.
Using the record similarity computed above, result set similarities are computed using the same greedy approach.
The sources may copy data from each other, or make mirrors, boosting
SourceRank of the group.
Observation 1: Even non-colluding sources in the same domain may contain same data.
e.g. Movie databases may contain all Hollywood movies.
Observation 2: Top-k answers of even non-colluding sources may be similar.
e.g. Answers to query “Godfather” may contain all the three movies in the Godfather trilogy.
Basic Method: If two sources return same topk answers to the queries with large number of answers (e.g. queries like “the” or “DVD”) they are likely to be colluding.
We compute the degree of collusio n of sources as the agreement on large answer queries.
Words with highest DF in the crawl is used as the queries.
The agreement between two databases are adjusted for collusion by multiplying by
(1-collusion).
” I personally ran a handful of test queries this way and got much better [than Google
Products] results using Factal ” --
Anonymous WWW’11 Reviewer.
Precision and DCG are compared with the following baseline methods
1) CORI: Adapted from text database selection. Union of sample documents from sources are indexed and sources with highest number term hits are selected
[Callan et al. 1995].
2) Coverage: Adapted from relational databases. Mean relevance of the top-5 results to the sampling queries
[Nie et al. 2004].
3) Google Products: Products Search that is used over
Google Base
All experiments distinguish the SourceRank from baseline methods with 0.95 confidence levels
.
0,3
0,25
0,2
0,15
0,1
0,05
0
0,45
0,4
Though combinations are not our competitors, note that they are not better:
relevance, as selected sources fetch answers
Precision similarity may be an “overweighting”
0,35
2. Search is Vertical
29%
Coverage SourceRank CORI SR-Coverage SR-CORI
18
0,2
0,15
0,1
0,05
0
0,35
0,3
48%
0,25
Coverage SourceRank
Precision
DCG
CORI SR-Coverage SR-CORI
19
0,5
0,4
0,3
0,2
0,1
0
675 Sources
24%
675 Google Base sources responding to a set of book queries are used as the book domain sources.
GBase-Domain is the Google Base searching only on these 675 domain sources.
Source Selection by SourceRank
(coverage) followed by ranking by Google
Base.
20
0,15
0,1
0,05
0,25
209 Sources
25%
0,2
0
Gbase Gbase-Domain SourceRank
21
Coverage
60
50
40
30
20
10
0
-10
0
Google Base Movies
SourceRank
Coverage
CORI
0.1
0.2
0.3
0.4
0.5
0.6
Corruption Level
0.7
0.8
0.9
1. Corrupted the results in sample crawl by replacing attribute vales not specified in the queries with random strings (since partial titles are the queries, we corrupted attributes except titles).
2.If the source selection is sensitive to corruption, the ranks should decrease with the corruption levels.
Every relevance measure based on query-similarity are oblivious to the corruption of attributes unspecified in queries.
22
45
40
35
SourceRank
Coverage
CORI
30
25
20
15
10
5
0
-5
0 0.1
0.2
0.3
0.4
0.5
0.6
Corruption Level
0.7
23
0.8
0.9
1
0.8
0.6
0.4
0.2
Collusion
Agreement
Adjusted Agreement
Two database with the same one million tuples from IMDB are created.
Correlation between the ranking functions reduced increasingly.
0
1 0.9
0.8
0.7
0.6
0.5
0.4
0.3
Rank Correlation
0.2
0.1
0
Observations:
1. At high correlation the adjusted agreement is very low.
2. Adjusted agreement is almost the same as the pure agreement at low correlations.
Natural agreement will be preserved while catching near-mirrors.
24
Random walk is known to be feasible in large scale.
Time to compute the agreements is evaluated against number of sources.
Note that the computation is offline.
Easy to parallelize.
25
SourceRank:Relevance and Trust Assessment for
Deep Web Sources Based on Inter-Source
Agreement.
Raju Balakrishnan, Subbarao Kambhampati.
Accepted for WWW 2011 (Full Paper).
Factal: Integrating Deep Web Based on Trust and
Relevance.
Raju Balakrishnan, Subbarao Kabmbhampati.
Accepted for WWW 2011 (Demonstration).
SourceRank:Relevance and Trust Assessment for
Deep Web Sources Based on Inter-Source
Agreement ( Best Poster Award, WWW 2010 ).
Raju Balakrishnan, Subbarao Kambhampati.
WWW 2010 Pages 1055~1056.
Results from the selected sources need to be combined and ranked. Ranking is online and hence time critical. We plan to:
Consider H igher order agreements.
e.g. Second order agreement, i.e. Friends of two tuples are agreeing to each other.
Computing and combining query-similarity.
Query similarity considering structure of tuples.
Positional voting, as sources rank tuples.
e.g. Borda count instead of set based agreement used currently.
Learning to combine ranking Parameters.
Multiple Parameters: agreement, lineage, query-similarity higher order agreements etc. need to be combined.
Considering Diversity Ranking.
Incorporating diversity in the result level and in the source level.
Large Scale Evaluation of Result Ranking.
Similar to source selection, the result ranking need to be evaluated on multiple large scale datasets
.
Enhancing prototype.
Factal prototype may be enhanced with the result ranking.
Ranking the Deep Web
– as important as the ranking of
results.
.
Problem 2: Ad-Ranking sensitive to Mutual
Influences.
A different
– Motivation and Problem Definition. aspect of
– Browsing model & Nature of Influence ranking
– Ranking Function & Generalization
– Results.
Proposed Work : Mechanism Design & Evaluations.
Sort by
Bid Amount x Relevance
Sort by
Bid Amount
Ads are Considered in Isolation, as both ignore Mutual influences.
We consider ads as a set, and ranking is based on user’s browsing model
a a
1 residual relevance of 2
•User browses down staring at the first decreases and ad abandonment probability increases.
a
• At every ad he May
Click the ad with relevance probability
R ( a )
P ( click ( a ) | view ( a ))
Abandon browsing with probability
Goes down to the next ad with probability
Process repeats for the ads below with a reduced probability
®
Three Manifestations of Mutual Influences on an ad a are:
1. Similar ads placed above a
Reduces user’s residual relevance of a
2. Relevance of other ads placed above
User may click on above ads may not view a a
3. Abandonment probability of other ads placed above a
User may abandon search and may not view a
$( a i
)
( a i profit from a set of n results is,
R ( a i
Expected Profit = i n
1
$( a i
) R r
( a i i
)
j
1
1
1
R r
( a j
)
( a j
)
THEOREM: Ranking maximizing expected profit considering similarities between the results is NP-Hard
Even worse, constant ratio approximation algorithms are hard
(unless NP = ZPP) for diversity ranking problem
Proof is a reduction of independent set problem to choosing top-k ads considering similarities.
Dropping similarity, hence replacing Residual Relevance
Expected Profit = i n
1
$( a i
) R ( a i i
)
j
1
1
1
R ( a j
)
( a j
)
Ranking to maximize this expected utility is a sorting problem
Rank ads in the descending order of:
RF ( a )
$( a ) R ( a )
R ( a )
( a )
The physical meaning RF is the profit generated for unit consumed view probability of ads
Higher ads have more view probability. Placing ads producing more profit for unit consumed view probability higher up is intuitive.
Sort by Bid Amount
Assume abandonment probability is zero
( a )
0
RF ( a )
$( a ) R ( a )
R ( a )
$( a )
Assumes that the user has infinite patience to go down the results until he finds the ad he wants.
Bid Amount x Relevance
Assume
( a )
k
R ( a )
RF ( a )
$( a ) R ( a )
k
$( a ) R ( a )
Assumes that abandonment probability is negatively proportional to relevance.
The generalized ranking based on utilities.
For documents utility=relevance
For ads utility=bid amount
Popular relevance ranking
40
35
30
25
20
15
10
5
0
0 0.1
RF
Bid Amount x Relevance
Bid Amount
Difference in profit between RF and competing strategy is significant
45.7%
0.2
0.3
35.9%
Bid amount only strategy becomes optimal at
( a )
0
0.4
0.5
0.6
Abandonment probability
Uniform Random as
0
( a )
Relevance
Uniform random as
R ( a )
1
0.7
Number of Clicks
Zipf random with exponent 1.5
Proposed strategy gives maximum profit for the entire range
Bid Amounts
Uniform random
0
$( a )
10
0.8
0.9
1
Optimal Ad-Ranking for
Profit Maximization.
Raju Balakrishnan, Subbarao
Kabmbhampati .
WebDB 2008
Yahoo! Research Key scientific Challenge award for Computation advertising, 2009-10
1.
Ad-Auction based on the proposed ranking
Associate a pricing mechanism (e.g. Generalized Second
Pricing) wit the proposed ranking to formulate a complete auction mechanism.
Formulating an envy free equilibrium
At envy free equilibria, no advertiser will be able to increase the profit by changing bids, and the auction becomes stable.
Analysis of advertiser’s profit and comparison with the existing mechanisms.
The profits at equilibrium is likely to be the sustained profit.
Evaluating the proposed ranking on Click Logs
In addition to currently used CTR and bid amounts, the proposed ranking requires the additional parameter abandonment probability.
Mining/Learning abandonment probability from click logs, and using to improve ranking will ensure the applicability of the proposed method.
Since the access to the ad click logs is restricted within search providers, I am hoping to perform this work as part of a possible summer internship ( this is a risk factor ).
1. SourceRank based source selection sensitive to trustworthiness and importance for the deep web.
2. A wisdom of sources approach to rank deep web results considering trust and relevance.
3. An optimal generalized ranking for ads and search results.
4. A ranking framework optimal with respect to the perceived relevance of search snippets, and abandonment probability.
5. A complete auction mechanism extending the optimal ranking with pricing.