PPT

advertisement
SourceRank: Relevance and Trust
Assessment for Deep Web Sources Based on
Inter-Source Agreement
Raju Balakrishnan, Subbarao Kambhampati
Arizona State University
Funding from
Deep Web Integration Scenario
Millions of
sources
containing
structured
tuples
Uncontrolled
collection of
redundant
information
Mediator
Search engines
have nominal
access. We
don’t Google for
a “Honda Civic
2008 Tampa”
Web DB
Web DB
Web DB
Web DB
Web DB
Deep Web
2
Why Another Ranking?
Rankings are oblivious to result Importance & Trustworthiness
Example Query: “Godfather Trilogy” on Google Base
Importance: Searching for titles
matching with the query. None of
the results are the classic Godfather
Trustworthiness (bait and switch)
The titles and cover image match
exactly.
Prices are low. Amazing deal!
But when you proceed towards
check out you realize that the product
is a different one! (or when you open
the mail package, if you are really
unlucky)
3
Agenda
1. Problem Definition
2. SourceRank: Ranking based on
Agreement
3. Computing Agreement
4. Computing Source Collusion
5. System implementation and
Results
4
Source Selection in the Deep Web
Problem: Given a user query, select a subset of sources
to provide important and trustworthy answers.
Surface web search combines link analysis with
Query-Relevance to consider trustworthiness and
relevance of the results.
Unfortunately, deep web records do not have
hyper-links.
5
Source Agreement
Observations
 Many sources return answers to the same
query.
 Comparison of semantics of the answers is
facilitated by structure of the tuples.
Idea:
Compute
importance
and
trustworthiness of sources based on the
agreement of answers returned by different
sources.
6
Agreement Implies Trust & Importance.
Important results are likely to be returned by a
large number of sources.
e.g. For the query “Godfather” hundreds of
sources return the classic “The Godfather”
while a few sources return the little known
movie “Little Godfather”.
Two independent sources are not likely to
agree upon corrupt/untrustworthy answers.
e.g. The wrong author of the book (e.g.
Godfather author as “Nino Rota”) would not be
agreed by other sources. As we know, truth is
one (or a few), but lies are many.
7
Agreement is not just for the search
Which tire?
8
Agreement Implies Trust & Relevance
Probability of agreement of two
independently selected
irrelevant/false tuples is
1
1
1

P (f , f )
|U |
3
100 k
a
1
2
Probability of agreement or two
independently picked relevant
and true tuples is
Pa (r1, r 2) 
1
| RT |
| U || RT | Pa(r1, r 2)  Pa( f 1, f 2)
9
Method: Sampling based Agreement
W ( S 1  S 2)    (1   ) 
0.78
S3
0.22
where  induces the smoothing links
to account for the unseen samples.
R1, R2 are the result sets of S1, S2.
S1
0.86
0.4
0.14
A( R1, R 2)
| R2 |
0.6
S2
Link semantics from Si to Sj with
weight w: Si acknowledges w
fraction of tuples in Sj. Since weight
is
the
fraction,
links
are
unsymmetrical.
Agreement is computed
using key word queries.
Partial titles of
movies/books are used as
queries.
Mean agreement over all
the queries are used as the
final agreement.
10
Method: Calculating SourceRank
How can I use the
agreement graph for
improved search?
• Source graph is viewed as a
markov chain, with edges as the
transition probabilities between
the sources.
• The
prestige
of
sources
considering transitive nature of
the
agreement
may
be
computed based on a markov
random walk.
SourceRank is equal to this stationary visit probability of
the random walk on the database vertex.
This static SourceRank may be combined with a queryspecific source-relevance measure for the final ranking.
11
Computing Agreement is Hard
Computing semantic agreement between two records is
the record linkage problem, and is known to be hard.
Godfather, The:
The Coppola
Restoration
Marlon Brando,
Al Pacino
James Caan /
Marlon Brando more
13.99
USD
$9.99
The Godfather - The
Coppola Restoration
Giftset [Blu-ray]
Example
“Godfather”
tuples from two
web sources.
Note that titles
and castings are
denoted
differently.
Semantically same entities may be represented syntactically
differently by two databases (non-common domains).
12
Method: Computing Agreement
Agreement Computation has Three levels.
1. Comparing Attribute-Value
 Soft-TFIDF with Jaro-Winkler as the similarity measure is used.
2. Comparing Records.
 We do not assume predefined schema matching.
Instance of a bipartite
matching problem.
Optimal matching is O(v3 ).
 O(v ) Greedy matching is used. Values are greedily matched
against most similar value in the other record.
The attribute importance are weighted by IDF. (e.g. same titles
(Godfather) is more important than same format (paperback))
2
3. Comparing result sets.
 Using the record similarity computed above, result set similarities
are computed using the same greedy approach.
13
Detecting Source Collusion
The sources may copy data
from each other, or make
mirrors, boosting
SourceRank of the group.
Observation 1: Even non-colluding sources in the same domain may
contain same data.
e.g. Movie databases may contain all Hollywood movies.
Observation 2: Top-k answers of even non-colluding sources may be
similar.
e.g. Answers to query “Godfather” may contain all the three movies in
the Godfather trilogy.
14
Source Collusion--Continued
Basic Method: If two sources return same top-k
answers to the queries with large number of
answers (e.g. queries like “the” or “DVD”) they
are likely to be colluding.
We compute the degree of collusion of sources
as the agreement on large answer queries.
Words with highest DF in the crawl is used as
the queries.
The agreement between two databases are
adjusted for collusion by multiplying by
(1-collusion).
15
Factal: Search based on SourceRank
”I personally ran a handful of
test queries this way and got
much better results [than
Google Products] results using
Factal” --- Anonymous WWW’11
Reviewer.
http://factal.eas.asu.edu
16
Evaluation
Precision and DCG are compared with the following baseline
methods
1) CORI: Adapted from text database selection. Union of
sample documents from sources are indexed and
sources with highest number term hits are selected
[Callan et al. 1995].
2) Coverage: Adapted from relational databases. Mean
relevance of the top-5 results to the sampling queries
[Nie et al. 2004].
3) Google Products: Products Search that is used over
Google Base
All experiments distinguish the SourceRank from baseline
methods with 0.95 confidence levels.
17
Though combinations are not our competitors,
note that they are not better:
1.SourceRank implicitly considers query
relevance, as selected sources fetch answers
Precision
by query similarity. Combining again
with query
DCG
similarity may be an “overweighting”.
2. Search is Vertical
Online Top-4 Sources-Movies
0.45
0.4
29%
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Coverage
SourceRank
CORI
SR-Coverage
18
SR-CORI
Online Top-4 Sources-Books
0.35
Precision
DCG
0.3
48%
0.25
0.2
0.15
0.1
0.05
0
Coverage
SourceRank
CORI
SR-Coverage
19
SR-CORI
Google Base Top-5 Precision-Books
0.5
675 Sources
0.4
0.3
0.2
0.1
0
24%
 675 Google Base
sources responding
to a set of book
queries are used as
the book domain
sources.
GBase-Domain is
the Google Base
searching only on
these 675 domain
sources.
Source Selection
by SourceRank
(coverage) followed
by ranking by Google
Base.
20
Google Base Top-5 Precision-Movies
0.25
209 Sources
25%
0.2
0.15
0.1
0.05
0
Gbase
Gbase-Domain
SourceRank
Coverage
21
Trustworthiness of Source Selection
1. Corrupted the results in sample
crawl by replacing attribute vales
not specified in the queries with
random strings (since partial
titles are the queries, we
corrupted attributes except
titles).
2.If the source selection is
sensitive to corruption, the ranks
should decrease with the
corruption levels.
Google Base Movies
60
SourceRank
Coverage
CORI
Decrease in Rank(%) 
50
40
30
20
10
0
-10
0
0.1
0.2
0.3
0.4
0.5
0.6
Corruption Level 
0.7
0.8
0.9
Every relevance measure
based on query-similarity are
oblivious to the corruption of
attributes unspecified in queries.
22
Trustworthiness- Google Base Books
45
SourceRank
Coverage
CORI
Decrease in Rank(%) 
40
35
30
25
20
15
10
5
0
-5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Corruption Level 
23
0.8
0.9
Collusion—Ablation Study
1
0.8
Collusion
Agreement
Adjusted Agreement
0.6
0.4
0.2
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Rank Correlation 
Two database with the
same one million tuples
from IMDB are created.
Correlation between the
ranking functions reduced
increasingly.
Observations:
1. At high correlation the
adjusted agreement is
very low.
2. Adjusted agreement is
almost the same as the
pure agreement at low
correlations.
Natural agreement will be
preserved while catching
near-mirrors.
24
Computation Time
Random walk is
known to be feasible in
large scale.
Time to compute the
agreements is
evaluated against
number of sources.
Note that the
computation is offline.
Easy to parallelize.
25
Contributions
1. Agreement based trust
assessment for the deep web
2. Agreement based relevance
assessment for the deep web
3. Collusion detection between
the web sources
4. Evaluations in Google Base
sources and online web
databases
The search using
SourceRank is
demonstrated on
Friday: 10-15:30
26
Download