slides

advertisement
Characterizing the
Uncertainty of Web Data:
Models and Experiences
Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti
Università degli Studi Roma Tre
Dipartimento di Informatica ed Automazione
{blanco,crescenz,merialdo,papotti}@dia.uniroma3.it
The Web as a
Source of Information

Opportunities
−
−
−
a huge amount of information publicly available
valuable data repository can be built by aggregating
information spread over many sources
abundance of redundancy for data of many domains
The Web as a
Source of Information

Opportunities
−
−
−
a huge amount of information publicly available
valuable data repository can be built by aggregating
information spread over many sources
abundance of redundancy for data of many domains
[Blanco et al. WebDb2010, demo@WWW2011]
Sy
La
Min
Max
Vol
Open
Ibm
8
88
73
99
Cisc
1
33
44
12
2342
Appl
8
88
73
99
1998
Appl
8
88
73
99
1998
Data conflicts
HRBN max price?

20.64
Limitations
−
20.49
−
…
20.88
sources are
inaccurate,
uncertain and
unreliable
some sources
reproduce the
contents
published by
others
Popularity-based rankings


Several ranking methods for web sources

E.g. Google PageRank, Alexa Traffic Rank

Mainly based on the popularity of the sources
Several factors can compromise the quality of
data even when extracted from authoritative
sources



Errors in the editorial process
Errors in the publishing process
Errors in the data extraction process
Problem Definition
w1
w2
w3
errors
in bold

A set of sources (possibly with copiers) provide values
of several attributes for a common set of objects
Problem Definition
w1


w2
w3
w4 (Copier)
A set of sources (possibly with copiers) provide values
of several attributes for a common set of objects
We want to compute automatically
−
A score of accuracy for each web source
−
The probability distribution for each value
score (w1)?
...
score (w4)?
State-of-the-art

Probabilistic models to evaluate the accuracy of web
data sources
(i.e., algorithms to reconcile data from inaccurate sources)




NAIVE (voting)
ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07;
Galland et al, WSDM10]
DEP [Dong et al, PVLDB09]
M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10]
Goals

The goal of our work is twofold:


illustrate the state-of-the-art models
compare the result of these models on the same real
world datasets
NAIVE

Independent sources

Consider a single attribute at a time

Count the votes for each possible value
Sources
Truth
it works
it does not!
381 gets 2 votes
380 gets 1 vote
Limitations of the NAIVE Model

Real sources can exhibit different accuracies


Every source is considered equivalent
independently from its authority and accuracy
More accurate sources should weight more than
inaccurate sources
ACCU: a Model considering the
Accuracy of the Sources

The vote of a source is weighted according to
its accuracy with respect to that attribute
Sources
Accuracy 3/3

542
45
1/3
Result
Truth
1/3
Main intuition: it's difficult that sources agree on errors!
Consensus on (many) true values allows the algorithm
to compute accuracy
Source
Accuracy
Discovery
Truth
Discovery
(consensus)
Limitations of the ACCU model

Misleading majorities might be formed by copiers
Sources: Independents
Copier
Result
Truth
2/3
Accuracy
3/3 2/3
1/3
Both values (380 and 381) get 3/3 as weighted vote

Copiers have to be detected to neutralize the
“copied” portion of their votes
A Generative Model of Copiers
Source 1
Truth
e2
e
independently
produced objects
Copier
e1
Source 2
Source 3
e1
e2
copied
objects
DEP: A Model to Consider Source
Dependencies
Sources:
A source is copying 2/3 of its tuples
Independents
Copier
Result
Truth
2/3
Accuracy 3/3 2/3
3/3 3/3
“Portion” of
independent opinion

1/3
1/3
380 gets 3/3 as independent
weighted vote
381 gets 2/3 x 3/3 + 1/3 x 1/3 =
7/9 as independent weighted vote
Main intuition: copiers can be detected as they propagate
false values (i.e., errors)
Contextual Analysis of Truth,
Accuracies, and Dependencies
Truth
Discovery
Source
Accuracy
Discovery
Dependence
Detection
M-DEP: Improved Evidence from
MULTIATT Analysis
w1
w2
Truth
w3
MULTIATT(3)
w4
Copier
errors
in bold

An analysis based only on the Volume would fail in this
example: it would recognizes w2 as a copier of w1 but
it would not detect w4 as a copier of w3

actually w1 and w2 are independent sources sharing a
common format for volumes
Experiments with Web Data



Soccer players

Truth: hand crafted from official pages

Stats: 976 objects and 510 symbols (on average)
Videogames

Truth: www.esrb.com

Stats: 227 objects and 40 symbols (on average)
NASDAQ Stock Quotes

Truth: www.nasdaq.com

Stats: 819 objects, 2902 symbols (on average)
Sample Accuracies of the Sources
Sampled accuracy: the number of true values correctly
reported over the number of objects.
Pearson correlation
coefficient shows that
quality of data and
popularity do not overlap
Experiments with Models
a

Probability Concentration measures the performance in
computing probability distributions for the observed objects.

Low scores for Soccer: no authority on the Web

Differences in VideoGames: #of distinct symbols (5 vs 75)

High SA scores in Finance for every model: large #of
distinct symbols
Global Execution Times
Lessons Learned

Three dimensions to decide which technique to use:
• Characteristics of the domain
- domains where authoritative sources exist are
much easier to handle
- large number of distinct symbols help a lot too
• Requirements on the results
- on average, more complex models return better
results, especially for Probability Concentration
• Execution times
- depend on the number of objects and number of
distinct symbols. Naïve always scales well.
Thanks!
Download