Truth Finding on the Deep Web: Is the Problem Solved?

advertisement
Truth Finding on the Deep Web: Is the Problem Solved?
Xian Li
Xin Luna Dong
Kenneth B. Lyons
SUNY at Binghamton
AT&T Labs-Research
AT&T Labs-Research
xianli@cs.binghamton.edu
lunadong@research.att.com
kbl@research.att.com
Weiyi Meng
Divesh Srivastava
SUNY at Binghamton
AT&T Labs-Research
meng@cs.binghamton.edu
divesh@research.att.com
ABSTRACT
The amount of useful information available on the Web has been growing at
a dramatic pace in recent years and people rely more and more on the Web to
fulfill their information needs. In this paper, we study truthfulness of Deep
Web data in two domains where we believed data are fairly clean and data
quality is important to people’s lives: Stock and Flight. To our surprise, we
observed a large amount of inconsistency on data from different sources and
also some sources with quite low accuracy. We further applied on these two
data sets state-of-the-art data fusion methods that aim at resolving conflicts
and finding the truth, analyzed their strengths and limitations, and suggested
promising research directions. We wish our study can increase awareness
of the seriousness of conflicting data on the Web and in turn inspire more
research in our community to tackle this problem.
1.
INTRODUCTION
The Web has been changing our lives enormously. The amount
of useful information available on the Web has been growing at a
dramatic pace in recent years. In a variety of domains, such as science, business, technology, arts, entertainment, government, sports,
and tourism, people rely on the Web to fulfill their information
needs. Compared with traditional media, information on the Web
can be published fast, but with fewer guarantees on quality and
credibility. While conflicting information is observed frequently
on the Web, typical users still trust Web data. In this paper we try
to understand the truthfulness of Web data and how well existing
techniques can resolve conflicts from multiple Web sources.
This paper focuses on Deep Web data, where data are stored in
underlying databases and queried using Web forms. We considered
two domains, Stock and Flight, where we believed data are fairly
clean because incorrect values can have a big (unpleasant) effect on
people’s lives. As we shall show soon, data for these two domains
also show many different features.
We first answer the following questions. Are the data consistent?
Are correct data provided by the majority of the sources? Are the
sources highly accurate? Is there an authoritative source that we
can trust and ignore all other sources? Are sources sharing data
with or copying from each other?
Our observations are quite surprising. Even for these domains
that most people consider as highly reliable, we observed a large
amount of inconsistency: for 70% data items more than one value is
provided. Among them, nearly 50% are caused by various kinds of
ambiguity, although we have tried our best to resolve heterogeneity
over attributes and instances; 20% are caused by out-of-date data;
and 30% seem to be caused purely by mistakes. Only 70% correct values are provided by the majority of the sources (over half of
the sources); and over 10% of them are not even provided by more
sources than their alternative values are. Although well-known authoritative sources, such as Google Finance for stock and Orbitz
for flight, often have fairly high accuracy, they are not perfect and
often do not have full coverage, so it is hard to recommend one as
the “only” source that users need to care about. Meanwhile, there
are many sources with low and unstable quality. Finally, we did observe data sharing between sources, and often on low-quality data,
making it even harder to find the truths on the Web.
Recently, many data fusion techniques have been proposed to
resolve conflicts and find the truth [2, 3, 6, 7, 8, 9, 12, 13, 15, 16, 17,
18, 19]. We next investigate how they perform on our data sets and
answer the following questions. Are these techniques effective?
Which technique among the many performs the best? How much
do the best achievable results improve over trusting data from a
single source? Is there a need and is there space for improvement?
Our investigation shows both strengths and limitations of the current state-of-the-art fusion techniques. On one hand, these techniques perform quite well in general, finding correct values for 96%
data items on average. On the other hand, we observed a lot of instability among the methods and we did not find one method that
is consistently better than others. While it appears that considering trustworthiness of sources, copying or data sharing between
sources, similarity and formatting of data are helpful in improving
accuracy, it is essential that accurate information on source trustworthiness and copying between sources is used; otherwise, fusion
accuracy can even be harmed. According to our observations, we
identify the problem areas that need further improvement.
Related work: Dalvi et al. [4] studied redundancy of structured
data on the Web but did not consider the consistency aspect. Existing works on data fusion ([3, 8] as surveys and [9, 12, 13, 16,
18, 19] as recent works) have experimented on data collected from
the Web in domains such as book, restaurant and sports. Our work
is different in three aspects. First, we are the first to quantify and
study consistency of Deep Web data. Second, we are the first to
compare all fusion methods proposed up to date empirically. Finally, we focus on two domains where we believed data should be
quite clean and correct values are more critical. We wish our study
on these two domains can increase awareness of the seriousness of
Table 1: Overview of data collections
Stock
Flight
Srcs
Period
Objects
55
38
July 2011
Dec 2011
1000*20
1200*31
Local
attrs
333
43
Global
attrs
153
15
Table 2: Examined attributes for Stock.
Considered
items
16000*20
7200*31
conflicting data on the Web and inspire more research in our community to tackle this problem.
In the rest of the paper, Section 2 describes the data we considered, Section 3 describes our observations on data quality, Section 4
describes results of various fusion methods, Section 5 discusses future research challenges, and Section 6 concludes.
2.
PROBLEM DEFINITION AND DATA SETS
We start with defining how we model data from the Deep Web
and describing our data collections.
2.1
Data model
We consider Deep Web sources in a particular domain, such as
flights. For each domain, we consider objects of the same type,
each corresponding to a real-world entity. For example, an object
in the flight domain can be a particular flight on a particular day.
Each object can be described by a set of attributes. For example,
a particular flight can be described by scheduled departure time,
actual departure time, etc. We call a particular attribute of a particular object a data item. We assume that each data item is associated
with a single true value that reflects the real world. For example,
the true value for the actual departure time of a flight is the minute
that the airplane leaves the gate on the specific day.
Each data source can provide a subset of objects in a particular
domain and can provide values of a subset of attributes for each
object. Data sources have heterogeneity at three levels. First, at
the schema level, they may structure the data differently and name
an attribute differently. Second, at the instance level, they may
represent an object differently. This is less of a problem for some
domains where each object has a unique ID, such as stock ticker
symbol, but more of a problem for other domains such as business
listings, where a business is identified by its name, address, phone
number, business category, etc. Third, at the value level, some of
the provided values might be exactly the true values, some might
be very close to (or different representations of) the true values, but
some might be very different from the true values. In this paper, we
manually resolve heterogeneity at the schema level and instance
level whenever possible, and focus on heterogeneity at the value
level, such as variety and correctness of provided values.
2.2
Data collections
We consider two data collections from stock and flight domains
where we believed data are fairly clean and we deem data quality
very important. Table 1 shows some statistics of the data.
Stock data: The first data set contains 55 sources in the Stock domain. We chose these sources as follows. We searched “stock price
quotes” and “AAPL quotes” on Google and Yahoo, and collected
the deep-web sources from the top 200 returned results. There were
89 such sources in total. Among them, 76 use the GET method (i.e.,
the form data are encoded in the URL) and 13 use the POST method
(i.e., the form data appear in a message body). We focused on the
former 76 sources, for which data extraction poses fewer problems.
Among them, 17 use Javascript to dynamically generate data and 4
rejected our crawling queries. So we focused on the remaining 55
sources. These sources include some popular financial aggregators
such as Yahoo! Finance, Google Finance, and MSN Money, official stock-market websites such as NASDAQ, and financial-news
websites such as Bloomberg and MarketWatch.
Last price
Market cap
Dividend
EPS
Open price
Volume
Yield
P/E
Today’s change (%)
Today’s high price
52-week high price
Shares outstanding
Today’s change($)
Today’s low price
52-week low price
Previous close
We focused on 1000 stocks, including the 30 symbols from Dow
Jones Index, the 100 symbols from NASDAQ Index (3 symbols appear in both Dow Jones and NASDAQ), and randomly chosen 873
symbols from the other symbols in Russell 3000. Every weekday
in July 2011 we searched each stock symbol on each data source,
downloaded the returned web pages, and parsed the DOM trees to
extract the attribute-value pairs. We collected data one hour after the stock market closes on each day to minimize the difference
caused by different crawling times. Thus, each object is a particular
stock on a particular day.
We observe very different attributes from different sources about
the stocks: the number of attributes provided by a source ranges
from 3 to 71, and there are in total 333 attributes. Some of the
attributes have the same semantics but are named differently. After we matched them manually, there are 153 attributes. We call
attributes before the manual matching local attributes and those after the matching global attributes. Figure 1 shows the number of
providers for each global attribute. The distribution observes Zipf’s
law; that is, only a small portion of attributes have a high coverage
and most of the “tail” attributes have a low coverage. In fact, 21
attributes (13.7%) are provided by at least one third of the sources
and over 86% are provided by less than 25% of the sources. Among
the 21 attributes, the values of 5 attributes can keep changing after
market close due to after-hours trading. In our analysis we focus on
the remaining 16 attributes, listed in Table 2. For each attribute, we
normalized values to the same format (e.g., “6.7M”, “6,700,000”,
and “6700000” are considered as the same value).
For purposes of evaluation we consider three gold standards. The
NASDAQ gold standard contains data provided by Nasdaq.com on
the 100 symbols in the NASDAQ index. The Majority100 gold
standard contains the voting results on the 100 NASDAQ symbols from 5 popular financial websites: NASDAQ, Yahoo! Finance,
Google Finance, MSN Money, and Bloomberg; we voted only on
data items provided by at least three sources. The Majority200
gold standard includes the voting results for another 100 randomly
selected symbols in addition to those in Majority100 to increase variety of data items in the standard. The values in all gold standards
are normalized as well.
Flight data: The second data set contains 38 sources from the
flight domain. We chose the sources in a similar way as in the
stock domain and the keyword query we used is “flight status”.
The sources we selected include 3 airline websites (AA, UA, Continental), 8 airport websites (such as SFO, DEN), and 27 third-party
websites, including Orbitz, Travelocity, etc.
We focused on 1200 flights departing from or arriving at the
hub airports of the three airlines (AA, UA, and Continental). We
grouped the flights into batches according to their scheduled arrival
time, collected data for each batch one hour after the latest scheduled arrival time every day in Dec 2011. Thus, each object is a
particular flight on a particular day. We extracted data and normalized the values in the same way as in the Stock domain.
We observe a total of 43 local attributes and 15 global attributes
in this domain (distribution shown in Figure 1). Each source covers 4 to 15 attributes. The distribution of the attributes also observes Zipf’s law: 6 global attributes (40%) are provided by more
than half of the sources while 53% of the attributes are provided
by less than 25% sources. We focus on the 6 popular attributes in
our analysis, including scheduled departure/arrival time, actual de-
'!"#
&!"#
2.+34#
5678/.#
%!"#
$!"#
!"#
*+,-# *+,-# *+,-# *+,-# *+,-# *+,-#
./01#(# ./01#$!# ./01#%!# ./01#&!# ./01#'!# ./01#(!#
1/2."#)*+)0*/#$"0)
Figure 1: Attribute coverage.
(!!"#
'!"#
&!"#
./012#
34567/#
%!"#
$!"#
!"#
'!"#
&!"#
WEB DATA QUALITY
1. Are there a lot of redundant data on the Web? In other words,
are there many different sources providing data on the same
data item?
2. Are the data consistent? In other words, are the data provided
by different sources on the same data item the same and if
not, are the values provided by the majority of the sources
the true values?
3. Does each source provide data of high quality in terms of
correctness and is the quality consistent over time? In other
words, how consistent are the data of a source compared with
a gold standard? And how does this change over time?
4. Is there any copying? In other words, is there any copying
among the sources and if we remove them, are the majority
values from the remaining sources true?
We report detailed results on a randomly chosen data set for each
domain: the data of 7/7/2011 for Stock and the data of 12/8/2011
for Flight. In addition, we report the trend on all collected data
(collected on different days).
Data redundancy
We first examine redundancy of the data. The object (resp., dataitem) redundancy is defined as the percentage of sources that provide a particular object (resp., data item). Figure 2 and Figure 3
show the redundancy on the objects and attributes that we examined; note that the overall redundancy can be much lower.
For the Stock domain, we observe a very high redundancy at the
object level: about 16% of the sources provide all 1000 stocks and
all sources provide over 90% of the stocks; on the other hand, almost all stocks have a redundancy over 50%, and 83% of the stocks
./012#
34567/#
%!"#
$!"#
!"#
!# !)(#!)$#!)*#!)%#!)+#!)&#!),#!)'#!)-# (#
7",2%,'%$3)
Figure 2: Object redundancy.
We first ask ourselves the following four questions about Deep
Web data and answer them in this section.
3.1
(!!"#
!# !)(#!)$#!)*#!)%#!)+#!)&#!),#!)'#!)-# (#
7"23%2'%$4)
parture/arrival time, and departure/arrival gate. We take the data
provided by the three airline websites as the gold standard.
Summary and comparison: In both data collections objects are
easily distinguishable from each other: a stock object can be identified by date and stock symbol, and a flight object can be identified
by date, flight number, and departure city (different flights departing from different cities may have the same flight number). On
the other hand, we observe a lot of heterogeneity for attributes and
value formatting; we have tried our best to resolve the heterogeneity manually. In both domains we observe that the distributions of
the attributes observe Zipf’s Law and only a small percentage of attributes are popular among all sources. The Stock data set is larger
than the Flight data set with respect to both the number of sources
and the number of data items we consider.
Note that generating gold standards is challenging. We have to
generate gold standards by trusting some particular sources in the
Flight and Stock domains. As we show later, this way of generating
gold standards has inherent limitations.
3.
!"#$"%&'(")*+),'&')-&"./)
0-&1)#",2%,'%$3)'4*5")6)
(!"#
!"#$"%&'(")*+)*,-"$&.)/0&1)
#"23%2'%$4)',*5")6)
!"#$"%&'(")*+)',#-./&"0)
)!"#
Figure 3: Data-item redundancy.
have a full redundancy (i.e., provided by all sources). The redundancy at the data-item level is much lower because different sources
can provide different sets of attributes. We observe that 80% of the
sources cover over half of the data items, while 64% of the data
items have a redundancy of over 50%.
For the Flight domain, we observe a lower redundancy. At the
object level, 36% of the sources cover 90% of the flights and 60%
of the sources cover more than half of the flights; on the other hand,
87% of the flights have a redundancy of over 50%, and each flight
has a redundancy of over 30%. At the data-item level, only 28% of
the sources provide more than half of the data items, and only 29%
of the data items have a redundancy of over 50%. This low redundancy is because an airline or airport web site provides information
only on flights related to the particular airline or airport.
Summary and comparison: Overall we observe a large redundancy over various domains: on average each data item has a redundancy of 66% for Stock and 32% for Flight. The redundancy
neither is uniform across different data items, nor observes Zipf’s
Law: very small portions of data items have very high redundancy,
very small portions have very low redundancy, and most fall in between (for different domains, “high” and “low” can mean slightly
different numbers).
3.2 Data consistency
We next examine consistency of the data. We start with measuring inconsistency of the values provided on each data item and
consider the following three measures. Specifically, we consider
data item d and we denote by V̄ (d) the set of values provided by
various sources on d.
• Number of values: We report the number of different values
provided on d; that is, we report |V̄ (d)|, the size of V̄ (d).
• Entropy: We quantify the distribution of the various values
by entropy [14]; intuitively, the higher the inconsistency, the
higher the entropy. If we denote by S̄(d) the set of sources
that provide item d, and by S̄(d, v) the set of sources that
provide value v on d, we compute the entropy on d as
E(d) = −
X
v∈V̄ (d)
|S̄(d, v)|
|S̄(d, v)|
log
.
|S̄(d)|
|S̄(d)|
(1)
• Deviation: For data items with numerical values we additionally measure the difference of the values by deviation.
Among different values for d, we choose the dominant value
v0 as the one with the largest number of providers; that is,
v0 = arg maxv∈V̄ (d) |S̄(d, v)|. We compute the deviation
for d as the relative deviation w.r.t. v0 :
v
u
u
D(d) = t
1
|V̄ (d)|
X
v∈V̄ (d)
(
v − v0 2
) .
v0
(2)
We measure deviation for time similarly but use absolute difference by minute, since the scale is not a concern there.
Table 3: Value inconsistency on attributes. The numbers in
parentheses are those when we exclude source StockSmart.
Stock
Flight
Stock
Flight
Stock
Flight
Attribute w.
low incons.
Previous close
Today’s high
Today’s low
Last price
Open price
Scheduled depart
Arrival gate
Depart gate
Low-var attr
Previous close
Today’s high
Today’s low
Last price
Open price
Scheduled depart
Depart gate
Arrival gate
Low-var attr
Last price
Yield
Change %
Today’s high
Today’s low
Schedule depart
Schedule arrival
Number
1.14 (1.14)
1.98 (1.18)
1.98 (1.18)
2.21 (1.33)
2.29 (1.29)
1.1
1.18
1.19
Entropy
0.04 (0.04)
0.13 (0.05)
0.13 (0.05)
0.15 (0.07)
0.19 (0.09)
0.05
0.10
0.11
Deviation
0.03 (0.02)
0.18 (0.18)
0.19 (0.19)
0.33 (0.32)
0.35 (0.33)
9.35 min
12.76 min
Attribute w.
high incons.
Volume
P/E
Market cap
EPS
Yield
Actual depart
Scheduled arrival
Actual arrival
High-var attr
P/E
Market cap
EPS
Volume
Yield
Actual depart
Actual arrival
Scheduled arrival
High-var attr
Volume
52wk low price
Dividend
EPS
P/E
Actual depart
Actual arrival
Number
7.42 (6.55)
6.89 (6.89)
6.39 (6.39)
5.43 (5.43)
4.85 (4.12)
1.98
1.65
1.6
Entropy
1.49 (1.49)
1.39 (1.39)
1.17 (1.17)
1.02 (0.94)
0.90 (0.90)
0.60
0.31
0.26
Deviation
2.96(2.96)
1.88 (1.88)
1.22(1.22)
0.81 (0.81)
0.73 (0.73)
15.14 min
14.96 min
We have just defined dominant values, denoted by v0 . Regarding
them, we also consider the following two measures.
• Dominance factor: The percentage of the sources that pro0 )|
vide v0 among all providers of d; that is, F (d) = |S̄(d,v
.
|S̄(d)|
• Precision of dominant values: The percentage of data items
on which the dominant value is true (i.e., the same as the
value in the gold standard).
Before describing our results, we first clarify two issues regarding data processing.
• Tolerance: We wish to be fairly tolerant to slightly different
values. For time we are tolerant to 10-minute difference. For
numerical values, we consider all values that are provided
for each particular attribute A, denoted by V̄ (A), and take
the median; we are tolerant to a difference of
τ (A) = α ∗ Median(V̄ (A)),
(3)
where α is a predefined tolerance factor and set to .01 by
default.
• Bucketing: When we measure value distribution, we group
values whose difference falls in our tolerance. Given numerical data item d of attribute A, we start with the dominant value v0 , and have the following buckets: . . . , (v0 −
3τ (A)
, v0 − τ (A)
], (v0 − τ (A)
, v0 + τ (A)
], (v0 + τ (A)
, v0 +
2
2
2
2
2
3τ (A)
],
.
.
.
.
2
Inconsistency of values: Figure 4 shows the distributions of inconsistency by different measures for different domains and Table 3
lists the attributes with the highest or lowest inconsistency.
Stock: For the Stock domain, even with bucketing, the number of
different values for a data item ranges from 1 to 13, where the average is 3.7. There are only 17% of the data items that have a single
value, the largest percentage of items (30%) have two values, and
39% have more than three values. However, we observe one source
(StockSmart) that stopped refreshing data after June 1st, 2011; if
we exclude its data, 37% data items have a single value, 16% have
two, and 36% have more than three. The entropy shows that even
though there are often multiple values, very often one of them is
dominant among others. In fact, while we observe inconsistency
on 83% items, there are 42% items whose entropy is less than .2
and 76% items whose entropy is less than 1 (recall that the maximum entropy for two values, happening under uniform distribution,
is 1). After we exclude StockSmart, entropy on some attributes is
even lower. Finally, we observe that for 64% of the numerical data
items the deviation is within .1; however, for 14% of the items the
deviation is above .5, indicating a big discrepancy.
The lists of highest- and lowest-inconsistency attributes are consistent w.r.t. number-of-values and entropy, with slight changes on
the ordering. The lists w.r.t. deviation are less consistent with the
other lists. For some attributes such as Dividend and 52-week
low price, although there are not that many different values, the
provided values can differ a lot in the magnitude. Indeed, different sources can apply different semantics for these two attributes:
Dividend can be computed for different periods–year, half-year,
quarter, etc; 52-week low price may or may not include the price
of the current day. For Volume, the high deviation is caused by 10
symbols that have terminated–some sources map these symbols to
other symbols; for example, after termination of “SYBASE”, symbol “SY” is mapped to “SALVEPAR” by a few sources. When we
remove these 10 symbols, the deviation drops to only .28. Interestingly, Yield has high entropy but low deviation, because its values
are typically quite small and the difference is also very small. We
observe that real-time values often have a lower inconsistency than
statistical values, because there is often more semantics ambiguity
for statistical values.
Flight: Value inconsistency is much lower for the Flight domain.
The number of different values ranges from 1 to 5 and the average is 1.45. For 61% of the data items there is a single value after
bucketing and for 93% of the data items there are at most two values. There are 96% of the items whose entropy is less than 1.0.
However, when different times are provided for departure or arrival, they can differ a lot: 46% of the data items have a deviation
above 5 minutes, while 20% have a deviation above 10 minutes.
Among different attributes, the scheduled departure time and
gate information have the lowest inconsistency, and as expected,
the actual departure/arrival time have the highest inconsistency. The
average deviations for actual departure and arrival time are as large
as 15 minutes.
Reasons for inconsistency: To understand inconsistency of values,
for each domain we randomly chose 20 data items and in addition
considered the 5 data items with the largest number-of-values, and
manually checked each of them to find the possible reasons. Figure 6 shows the various reasons for different domains.
For the Stock domain, we observe five reasons. (1) In many cases
(46%) the inconsistency is due to semantics ambiguity. We consider semantics ambiguity as the reason if ambiguity is possible
for the particular attribute and we observe inconsistency between
values provided by the source and the dominant values on a large
fraction of items of that attribute; we have given examples of ambiguity for Dividend and 52-week low price earlier. (2) The reason
can also be instance ambiguity (6%), where a source interprets one
stock symbol differently from the majority of sources; this happens
mainly for stock symbols that terminated at some point. Recall that
instance ambiguity results in the high deviation on Volume. (3)
Another major reason is out-of-date data (34%): at the point when
we collected data, the data were not up-to-date; for two thirds of
the cases the data were updated hours ago, and for one third of the
12.34#
567892#
&!"#
%!"#
$!"#
,
- #
./
0#
*#
+#
(#
)#
&#
'#
!"#
!"
#$
76)86,9:;6$(<$=:9:$+96*>$
-" "%&'
%& $(
#$" )$!
%
-" .' "#$
%. $( &*
#$
)$
-" "%/' !&#$ +,'$
%/ $( .*
#$" )$!
-" %0 .# +,'$
%0 '$( $/*
#$$
)$
-" "%1' !/#$ +,'$
%1 $( 0*
#$
)$
-" "%2' !0#$ +,'$
%2 $( 1*
#$
)$
-" "%3' !1#$ +,'$
%3 $( 2*
#$" )$!
-" %4' 2#$ +,'$
%4 $( 3*
-" #$"%5 )$!3 +,'
#
%5
#$& '$() $4* $
%" $!4 +,
-& '$() #$5* '$
%" $!5
+,
#$'
$( #$&" '$
)$! *
&" +,
#$* '$
()
6'
$
'!"#
'!"#
*!"#
)!"#
(!"#
'!"#
&!"#
%!"#
$!"#
!"#
&!"#
%!"#
$!"#
+,-./#
01234,#
!"
#$
(" "%&'
%& $
#$
(" "%)'
%) $
#$
(" "%*'
%* $
#
(" $"%+
%+ '$
#$$
(" "%,'
%, $
#$
(" "%-'
%- $
#$
(" "%.'
%. $
#$
(" "%/'
%/ $
#$
(" "%0'
%0 $
#$&
%"
(& ' $
%"
#$'
$
(!"#
1234256782$9:$;767$<62=>$
)!"#
$#
%#
!"#$"%&'(")*+),'&')-&"./)
*!"#
!"#
?5639@A$
01.2"#)*+),-3"#"%&)4'51"/)
()*+,#
-./01)#
?6@+:A(,$
Figure 4: Value inconsistency: distribution of number of values, entropy of values, and deviation of numerical values.
FlightView
FlightAware
Orbitz
Figure 5: Screenshots of three flight sources.
cases the data had not been refreshed for days. (4) There is one
error on data unit: the majority reported 76M while one source reported 76B. (5) Finally, there are four cases (11%) where we could
not determine the reason and it seems to be purely erroneous data.
For the Flight domain, we observe only three reasons. (1) Semantics ambiguity causes 33% of inconsistency: some source may
report takeoff time as departure time and landing time as arrival
time, while most sources report the time of leaving the gate or arriving at the gate. (2) Out-of-date data causes 11% of the inconsistency; for example, even when a flight is already canceled, a source
might still report its actual departure and arrival time (the latter is
marked as “estimated”). (3) Pure errors seem to cause most of the
inconsistency (56%). For example, Figure 5 shows three sources
providing different scheduled departure time and arrival time for
Flight AA119 on 12/8/2011; according to the airline website, the
real scheduled time is 6:15pm for departure and 9:40pm for arrival. For scheduled departure time, FlightView and FlightAware
provide the correct time while Orbitz provides a wrong one. For
scheduled arrival time, all three sources provide different times;
FlightView again provides the correct one, while the time provided
by FlightAware is unreasonable (it typically takes around 6 hours
to fly from the east coast to the west coast in the US). Indeed, we
found that FlightAware often gives wrong scheduled arrival time; if
we remove it, the average number of values for Scheduled arrival
drops from 1.65 to 1.31.
Dominant values: We now focus on the dominant values, those
with the largest number of providers for a given data item. Similarly, we can define the second dominant value, etc. Figure 7 plots
the distribution of the dominance factors and the precision of the
dominant values with respect to different dominance factors.
For the Stock domain, we observe that on 42% of the data items
the dominant values are supported by over 90% of the sources, and
on 73% of the data items the dominant values are supported by over
half of the sources. For these 73% data items, 98% of the dominant
values are consistent with the various gold standards. However,
when the dominance factor drops, the precision is also much lower.
For 9% of the data items with dominance factor of .4, the consistency already drops to 84% w.r.t. Majority200 gold standard (lower
for other gold standards). For 7% of the data items where the dominance factor is .1, the precision w.r.t. Majority200 for the dominant
value, the second dominant value, and the third dominant value is
.43, .33, and .12 respectively (meaning that for 12% of the data
items none of the top-3 values is true). In general, the precision
w.r.t. Majority200 is higher than that w.r.t. Majority100, meaning
a higher precision on the 100 symbols outside the NASDAQ index. Also, the precision w.r.t. Majority100 is higher than that w.r.t.
NASDAQ; indeed, we found that NASDAQ contains 174 values that
are not provided by any other source on the same items.
For the Flight domain, more data items have a higher dominance
factor–42% data items have a dominance factor of over .9, and 82%
have a dominance factor of over .5. However, for these 82% items
the dominant values have a lower precision: only 88% are consistent with the gold standard. Actually for the 11% data items whose
dominance factor falls in [.5, .6), the precision is only 50% for the
dominant value. As we show later, this is because some wrong
values are copied between sources and become dominant.
Summary and comparison: Overall we observe a fairly high inconsistency of values on the same data item: for Stock and Flight
the average entropy is .58 and .24, and the average deviation is 13.4
and 13.1 respectively. The inconsistency can vary from attributes
to attributes. There are different reasons for the inconsistency, including ambiguity, out-of-date data, and pure errors. For the Stock
domain, half of the inconsistency is because of ambiguity, one third
is because of out-of-date data, and the rest is because of erroneous
data. For the Flight domain, 56% of the inconsistency is because
of erroneous data.
If we choose dominant values as the true value (this is essentially
the VOTE strategy, as we explain in Section 4), we can obtain a precision of 0.908 for Stock (w.r.t. Majority200) and 0.864 for Flight.
We observe that dominant values with a high dominance factor are
typically correct, but the precision can quickly drop when this factor decreases. Interestingly, the Flight domain has a lower inconsistency but meanwhile a lower precision for dominant values, possibly because of copying on wrong values, as we show later.
Finally, we observe different precisions w.r.t. NASDAQ, Majority100 and Majority200 on Stock domain. This is inevitable because we had to trust certain sources for the gold standard but every
source can make mistakes. In the rest of the paper we use Majority200 as the gold standard for Stock.
3.3 Source accuracy
Next, we examine the accuracy of the sources over time. Given
a source S, we consider the following two measures.
• Source accuracy: We compute accuracy of S as the percentage of its provided true values among all its data items appearing in the gold standard.
$$
"#
814,013.#0/5)*6),7#
A&
"#
!!"#
96,:;<:=0,.#
%&"#
!A
"#
$$"#
>1),#.??;?#
@6?.#.??;?#
&"#
'!"#
&!"#
./012#
34567/#
%!"#
$!"#
!"#
!)$# !)%# !)&# !)'# !)(# !)*# !)+# !),# !)-#
0*.-%'%$")+'$&*#)
Figure 6: Reasons for value inconsistency.
Stock
Flight
Accuracy
.94
.93
.92
.91
.83
.98
.95
.94
$"
!#,"
!#+"
!#*"
!#)"
!#("
!#'"
!#&"
!#%"
!#$"
!"
-./0.1"
23456789"$!!"
23456789"%!!"
:;7<=8"
!#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !#,"
2',%(-($#)*-$.'")
Figure 7: Dominant values.
Table 4: Accuracy and coverage of authoritative sources.
Source
Google Finance
Yahoo! Finance
NASDAQ
MSN Money
Bloomberg
Orbitz
Travelocity
Airport average
!"#$%&%'()'*)+',%(-(.)/-01#&)
!"#
-./01234#
0/5)*6),7#
!"#$"%&'(")*+),'&')-&"./)
'()*+,#
-,;3B#
(!"#
Coverage
.82
.81
.84
.89
.81
.87
.71
.03
• Accuracy deviation: We compute the standard deviation of
the accuracy of S over a period of time. We denote by T̄ the
time points in a period, by A(t) the accuracy of S at time
t ∈ T̄ , and byq the mean accuracy over T̄ . The variety is
P
computed by |T̄1 | t∈T̄ (A(t) − Â)2 .
Source accuracy: Figure 8(a) shows the distribution of source accuracy in different domains. Table 4 lists the accuracy and itemlevel coverage of some authoritative sources.
In the Stock domain, the accuracy varies from .54 to .97 (except
StockSmart, which has accuracy .06), with an average of .86. Only
35% sources have an accuracy above .9, and 3 sources (5%) have
an accuracy below .7, which is quite low. Among the five popular
financial sources, four have an accuracy above .9, but Bloomberg
has an accuracy of only .83 because it may apply different semantics on some statistical attributes such as EPS, P/E and Yield. All
authoritative sources have a coverage between .8 and .9.
In the Flight domain, we consider sources excluding the three
official airline websites (their data are used as gold standard). The
accuracy varies from .43 to .99, with an average of .80. There are
40% of the sources with an accuracy above .9, but 10 sources (29%)
have an accuracy below .7. The average accuracy of airport sources
is .94, but their average coverage is only .03. Authoritative sources
like Orbitz and Travelocity all have quite high accuracy (above .9),
but Travelocity has low coverage (.71).
Accuracy deviation: Figure 8(b) shows the accuracy deviation of
the sources in a one-month period, and Figure 8(c) shows the precision of the dominant values over time.
In the Stock domain, we observe that for 4 sources the accuracy
varies tremendously (standard deviation over .1) and the highest
standard deviation is as high as .33. For 59% of the sources the
accuracy is quite steady (standard deviation below .05). We did
not observe any common peaks or dips on particular days. The
precision of the dominant values ranges from .9 to .97, and the
average is .92. The day-by-day precision is also fairly smooth, with
some exceptions on a few days.
In the Flight domain, we observe that for 1 source the accuracy
varies tremendously (deviation .11), and for 60% sources the accuracy is quite steady (deviation below .05). The precision of the
dominant values ranges from .86 to .89, and the average is .87.
Table 5: Potential copying between sources.
Stock
Flight
Remarks
Size
Depen claimed
Depen claimed
Depen claimed
Query redirection
Depen claimed
Embedded interface
Embedded interface
11
2
5
4
3
2
2
Schema
sim
1
1
0.80
0.83
1
1
1
Object
sim
.99
1
1
1
1
1
1
Value
sim
.99
.99
1
1
1
1
1
Avg
accu
.92
.75
.71
.53
.92
.93
.61
Summary and comparison: We observe that the accuracy of the
sources can vary a lot. On average the accuracy is not too high:
.86 for Stock and .80 for Flight. Even authoritative sources may
not have very high accuracy. We also observe that the accuracy is
fairly steady in general. On average the standard deviation is 0.06
for Stock and 0.05 for Flight, and for about half of the sources the
deviation is below .05 over time.
3.4 Potential copying
Just as copying is common between webpage texts, blogs, etc.,
we also observe copying between deep-web sources; that is, one
source obtains some or all of its data from another source, while
possibly adding some new data independently. We next report the
potential copying we found in our data collections (Table 5) and
study how that would affect precision of the dominant values. For
each group S̄ of sources with copying, we compute the following
measures.
• Schema commonality: We measure the commonality of schema
as the average Jaccard similarity between the sets of provided
attributes on each pair of sources. If we denote by Ā(S) the
set of global attributes that S provides, we compute schema
Ā(S 0 )|
.
commonality of S̄ as AvgS,S 0 ∈S̄,S6=S 0 ||Ā(S)∩
Ā(S)∪Ā(S 0 )|
• Object commonality: Object commonality is also measured
by average Jaccard similarity but between the sets of provided objects.
• Value commonality: The average percentage of common values over all shared data items between each source pair.
• Average accuracy: The average source accuracy.
On the Stock domain, we found two groups of sources with potential copying. The first group contains 11 sources, with exactly
the same webpage layout, schema, and highly similar data. These
sources all derive their data from Financial Content, a market data
service company, and their data are quite accurate (.92 accuracy).
The second group contains 2 sources, also with exactly the same
schema and data; the two websites are indeed claimed to be merged
in 2009. However, their data have an accuracy of only .75. For each
group, we keep only one randomly selected source and remove the
rest of the sources; this would increase the precision of dominant
values from .908 to .923.
(a) Distribution of source accuracy.
<8;231$633;263=$>1?@6A84$
(b) Accuracy deviation over time.
($
))
$
)*
$
)#
$
)&
$
)(
$
+)
$
&$
!)$# !)%# !)&# !)'# !)(# !)*# !)+# !),# !)-#
.*-#$")'$$-#'$/)
12345-$
#$
!"#
,-./0$
*$
$!"#
()*+,#
-./01)#
)$
!"(#$
!"($
!"'#$
!"'$
!"&#$
!"&$
!"%#$
!"%$
!"##$
!"#$
)$
./012#
34567/#
%!"#
'$"#
'!"#
&$"#
&!"#
%$"#
%!"#
$"#
!"#
!"#$%&%'()'*)+',%(-(.)/-01#&)
&!"#
0123145671$89$:8;231:$
'!"#
!"
#$
!%" %"&'
&# $
$
!%" %"('
(# $
$
!%" %")'
)# $
$
!%" %"*'
*# $
$
!%" %"+'
+# $
$
!%" %",'
,# $
$
!%" %"-'
-# $
$
!%" %".'
.# $
$
!%" %"/'
/# $
$%&
"
!%& '$
"#
$'$
!"#$"%&'(")*+),*-#$",)
(!"#
2-3)
(c) Dominant values over time.
Figure 8: Source accuracy and deviation over time.
On the Flight domain, we found five groups of sources with potential copying. Among them, two directly claim partnership by
including the logo of other sources; one re-directs its queries; and
two embed the query interface of other sources. Sources in the
largest two groups provide a little bit different sets of attributes, but
exactly the same flights, and the same data for all overlapping data
items. Sources in other groups provide almost the same schema
and data. Accuracy of sources in these groups vary from .53 to .93.
After we removed the copiers and kept only one randomly selected
source in each group, the precision of dominant values is increased
significantly, from .864 to .927.
Summary and comparison: We do observe copying between
deep-web sources in each domain. In some cases the copying is
claimed explicitly, and in other cases it is detected by observing embedded interface or query redirection. For the copying that we have
observed, while the sources may provide slightly different schemas,
they provide almost the same objects and the same values. The accuracy of the original sources may not be high, ranging from .75 to
.92 for Stock, and from .53 to .93 for Flight. Because the Flight domain contains more low-accuracy sources with copying, removing
the copied sources improves the precision of the dominant values
more significantly than in the Stock domain.
4.
DATA FUSION
As we have shown in Section 3, deep-web data from different
sources can vary significantly and there can be a lot of conflicts.
Data fusion aims at resolving conflicts and finding the true values. A basic fusion strategy that considers the dominant value (i.e.,
the value with the largest number of providers) as the truth works
well when the dominant value is provided by a large percentage of
sources (i.e., a high dominance factor), but fails quite often otherwise. Recall that in the Stock domain, the precision of dominant
values is 90.8%, meaning that on around 1500 data items we would
conclude with wrong values. Recently many advanced fusion techniques have been proposed to improve the precision of truth discovery [2, 3, 6, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19].
In this section we answer the following three questions.
1. Are the advanced fusion techniques effective? In other words,
do they perform (significantly) better than simply taking the
dominant values or taking all data provided by the best source
(assuming we know which source it is).
2. Which fusion method is the best? In other words, is there a
method that works better than others on all or most data sets?
3. Which intuitions for fusion are effective? In other words, is
each intuition that has been adopted for fusion effective?
This section first presents an overview of the proposed fusion
methods (Section 4.1) and then compares their performance on our
data collections (Section 4.2).
4.1 Review of data-fusion methods
In our data collections each source provides at most one value
on a data item and each data item is associated with a single true
value. We next review existing fusion methods suitable for this
context. Before we jump into descriptions of each method, we first
enumerate the many insights that have been considered in fusion.
• Number of providers: A value that is provided by a large
number of sources is considered more likely to be true.
• Trustworthiness of providers: A value that is provided by
trustworthy sources is considered more likely to be true.
• Difficulty of data items: The error rate on each particular data
item is also considered in the decision.
• Similarity of values: The provider of a value v is also considered as a partial provider of values similar to v.
• Formatting of values: The provider of a value v is also considered as a partial provider of a value that subsumes v. For
example, if a source typically rounds to million and provides “8M”, it is also considered as a partial provider of
“7,528,396”.
• Copying relationships: A copied value is ignored in the decision.
All fusion methods more or less take a voting approach; that
is, accumulating votes from providers for each value on the same
data item and choosing the value with the highest vote as the true
one. The vote count of a source is often a function of the trustworthiness of the source. Since source trustworthiness is typically
unknown a-priori, they proceed in an iterative fashion: computing
value vote and source trustworthiness in each round until the results converge. We now briefly describe given a data item d, how
each fusion method computes the vote count of each value v on d
and the trustworthiness of each source s. In [11] we summarized
equations applied in each method.
VOTE: Voting is the simplest strategy that takes the dominant value
as the true value; thus, its performance is the same as the precision
of the dominant values. There is no need for iteration.
H UB [10]: Inspired by measuring web page authority based on
analysis of Web links, in HUB the vote of a value is computed
as the sum of the trustworthiness of its providers, while the trustworthiness of a source is computed as the sum of the votes of its
provided values. Note that in this method the trustworthiness of a
source is also affected by the number of its provided values. Normalization is performed to prevent source trustworthiness and value
vote counts from growing in an unbounded manner.
AVG L OG [12]: This method is similar to H UB but decreases the
effect of the number of provided values by taking average and logarithm. Again, normalization is required.
I NVEST [12]: A source “invests” its trustworthiness uniformly among
its provided values. The vote of a value grows non-linearly with respect to the sum of the invested trustworthiness from its providers.
Table 6: Summary of data-fusion methods. X indicates that the method considers the particular evidence.
Category
Method
#Providers
Baseline
Vote
HUB
AVG L OG
I NVEST
P OOLED I NVEST
2-E STIMATES
3-E STIMATES
C OSINE
T RUTH F INDER
ACCU P R
ACCU S IM
ACCU F ORMAT
ACCU C OPY
X
X
X
X
X
X
X
X
X
X
X
X
X
Web-link
based
IR based
Bayesian based
Copying affected
Source
trustworthiness
The trustworthiness of source s is computed by accumulating the
vote of each provided value v weighted by s’s contribution among
all contributions to v. Again, normalization is required.
P OOLED I NVEST [12]: This method is similar to I NVEST but the
vote count of each value on item d is then linearly scaled such that
the total vote count on d equals the accumulated investment on d.
With this linear scaling, normalization is not required any more.
C OSINE [9]: This method considers the values as a vector: for
value v of data item d, if source s provides a value v, the corresponding position has value 1; if s provides another value on d,
the position has value -1; if s does not provide d, the position has
value 0. Similarly the vectors are defined for selected true values.
C OSINE computes the trustworthiness of a source as the cosine similarity between the vector of its provided values and the vector of
the (probabilistically) selected values. To improve stability, it sets
the new trustworthiness as a linear combination of the old trustworthiness and the newly computed one.
2-E STIMATES [9]: 2-E STIMATES also computes source trustworthiness by aggregating value votes. It differs from HUB in two
ways. First, if source s provides value v on d, it considers that s
votes against other values on d and applies a complement vote on
those values. Second, it averages the vote counts instead of summing them up. This method requires a complex normalization for
the vote counts and trustworthiness to the whole range of [0, 1].
3-E STIMATES [9]: 3-E STIMATES improves over 2-E STIMATES by
considering also trustworthiness on each value, representing the
likelihood that a vote on this value being correct. This measure is
computed iteratively together with source trustworthiness and value
vote count and similar normalization is applied.
T RUTH F INDER [17]: This method applies Bayesian analysis and
computes the probability of a value being true conditioned on the
observed providers. Essentially, instead of accumulating trustworthiness, the vote count takes the product of the trustworthiness. In
addition, T RUTH F INDER considers similarity between values and
enhances the vote count of a value by those from its similar values
weighted by the similarity.
ACCU P R [6]: ACCU P R also applies Bayesian analysis. It differs
from T RUTH F INDER in that it takes into consideration that different values provided on the same data item are disjoint and their
probabilities should sum up to 1; in other words, like 2-E STIMATES ,
3-E STIMATES and C OSINE, if a source s provides v 0 6= v on item
d, s is considered to indeed vote against v. To make the Bayesian
analysis possible, it assumes that there are N false values in the
domain of d and they are uniformly distributed.
X
X
X
X
X
X
X
X
X
X
X
X
Item
trustworthiness
Value
similarity
Value
formatting
Copying
X
X
X
X
X
X
X
X
ACCU S IM [6]: ACCU S IM augments ACCU P R by considering also
value similarity in the same way as T RUTH F INDER does.
ACCU F ORMAT: ACCU F ORMAT augments ACCU S IM by considering also formatting of values as we have described.
ACCU C OPY [6]: ACCU C OPY augments ACCU F ORMAT by considering the copying relationships between the sources and weighting
the vote count from a source s by the probability that s provides the
particular value independently. In our implementation we applied
the copy detection techniques in [6], which treats sharing false values as strong evidence of copying.
Table 6 summarizes the features of different fusion methods. We
can categorize them into five categories.
• Baseline: The basic voting strategy.
• Web-link based: The methods are inspired by measuring webpage authority based on Web links, including HUB, AVG L OG , I NVEST and P OOLED I NVEST .
• IR based: The methods are inspired by similarity measures
in Information Retrieval, including C OSINE, 2-E STIMATES
and 3-E STIMATES.
• Bayesian based: The methods are based on Bayesian analysis, including T RUTH F INDER , ACCU P R , ACCU S IM , and
ACCU F ORMAT.
• Copying affected: The vote count computation discounts votes
from copied values, including ACCU C OPY.
Finally, note that in each method we can distinguish trustworthiness for each attribute. For example, ACCU F ORMATATTR distinguishes the trustworthiness for each attribute whereas ACCU F OR MAT uses an overall trustworthiness for all attributes.
4.2 Fusion performance evaluation
We now evaluate the performance of various fusion methods on
our data sets. We focus on five measures.
• Precision: The precision is computed as the percentage of
the output values that are consistent with a gold standard.
• Recall: The recall is computed as the percentage of the values in the gold standard being output as correct. Note that
when we have fused all sources (so output all data items),
the recall is equivalent to the precision.
• Trustworthiness deviation: Recall that except VOTE, each
method computes some trustworthiness measure of a source.
We sampled the trustworthiness of each source with respect
to a gold standard as it is defined in the method, and compared it with the trustworthiness computed by the method at
convergence. In particular, given a source s ∈ S, we denote
Table 7: Precision of data-fusion methods on one snapshot of data. Highest precisions are in bold font and other top-3 precisions are
in bold italic font.
Category
Method
Baseline
Vote
HUB
AVG L OG
I NVEST
P OOLED I NVEST
2-E STIMATES
3-E STIMATES
C OSINE
T RUTH F INDER
ACCU P R
ACCU S IM
ACCU F ORMAT
ACCU S IM ATTR
ACCU F ORMATATTR
ACCU C OPY
Web-link
based
IR based
Bayesian
based
Copying affected
prec w.
trust
.913
.910
.924
.924
.910
.910
.910
.923
.910
.918
.918
.950
.948
.958
Stock
prec w/o. Trust
trust
dev
.908
.907
.11
.899
.17
.764
.39
.856
1.29
.903
.15
.905
.16
.900
.21
.911
.15
.899
.14
.913
.17
.911
.17
.929
.17
.930
.17
.892
.28
by Tsample (s) its sampled trustworthiness and by Tcompute (s)
its computed trustworthiness, and compute the deviation as
dev(S) =
s
1 X
(Tsample (s) − Tcompute (s))2 .
|S| s∈S
(4)
• Trustworthiness difference: The difference is computed as
the average computed trustworthiness for all sources minus
the average sampled trustworthiness.
• Efficiency: Efficiency is measured by the execution time on
a Windows machine with Intel Core i5 processor (3.2GHz,
4MB cache, 4.8 GT/s QPI).
Precision on one snapshot: We first consider data collected on
a particular day and use the same snapshots as in Section 3. For
each data set, we computed the coverage and accuracy of each
source with respect to the gold standard (as reported in Section 3),
and then ordered the sources by the product of coverage and accuracy (i.e., recall). We started with one source and gradually added
sources according to the ordering, and measured the recall. We report the following results. First, Table 7 shows the final precision
(i.e., recall) with and without giving the sampled source trustworthiness as input, and the trustworthiness deviation and difference
for each method in each domain. Second, Figure 9 shows the recall
as we added sources on each domain; to avoid cluttering, for each
category of fusion methods, we only plotted for the method with
the highest final recall. Third, Table 8 compares pairs of methods
where the second was intended to improve over the first. The table shows for each pair how many errors by the first method are
corrected by the second and how many new errors are introduced.
Fourth, to understand how the advanced fusion methods improve
over the baseline VOTE, Figure 10 compares the precision of VOTE
and the best fusion method in each domain with respect to dominance factor. Fifth, Figure 11 categorizes the reasons of mistakes
for a randomly sampled 20 errors by the best fusion method for
each domain.
Stock data: As shown in Table 7, for the Stock data ACCU F OR MATATTR obtains the best results without input trustworthiness and
it improves over VOTE by 2.4% (corresponding to about 350 data
items). As shown in Figure 10, the main improvement occurs on
the data items with dominance factor lower than .5. Note that on
this data set the highest recall from a single source is .93, exactly
the same as that of the best fusion results. From Figure 9 we observe that as sources are added, for most methods the recall peaks
Trust
diff
.08
-.13
-.31
0.29
-.14
-.15
-.17
.12
-.11
-.16
-.16
-.16
-.16
-.11
prec w.
trust
.939
.919
.945
.945
.87
.87
.87
.957
.91
.903
.903
.952
.952
.960
Flight
prec w/o. Trust
trust
dev
.864
.857
.2
.839
.24
.754
.29
.921
17.26
.754
.46
.708
.95
.791
.48
.793
.25
.868
.16
.844
.2
.844
.2
.833
.19
.833
.19
.943
.16
Trust
diff
.14
.001
-.12
7.45
-.35
-.94
-.41
.16
-.06
-.09
-.09
-.08
-.08
-.14
at the 5th source and then gradually decreases; also, we observe
some big change for 3-E STIMATE at the 11th-16th sources.
We next compare the various fusion methods. For this data set,
only Bayesian based methods can perform better than VOTE; among
other methods, Web-link based methods perform worst, then AC CU C OPY , then IR based methods (Table 7). ACCU C OPY does not
perform well because it considers copying as likely between many
pairs of sources in this data set; the major reason is that the copydetection technique in [5] does not take into account value similarity, so it treats values highly similar to the truth still as wrong and
considers sharing such values as strong evidence for copying. From
Table 8, we observe that considering formatting and distinguishing
trustworthiness for different attributes improve the precision on this
data set, while considering trustworthiness at the data-item level
(3-E STIMATE) does not help much.
We now examine how well we estimate source trustworthiness
and the effect on fusion. If we give the sampled source trustworthiness as input (so no need for iteration) and also ignore copiers
in Table 5 when applying ACCU C OPY (note that there may be
other copying that we do not know), ACCU C OPY performs the
best (Table 7). Note that for all methods, giving the sampled trustworthiness improves the results. However, for most methods except I NVEST, P OOLED I NVEST and ACCU C OPY, the improvement
is very small; indeed, for these three methods we observe a big
trustworthiness deviation. Finally, for most methods except H UB ,
P OOLED I NVEST and T RUTH F INDER, the computed trustworthiness is lower than the sampled one on average. This makes sense
because when we make mistakes, we compute lower trustworthiness for most sources. T RUTH F INDER tends to compute very high
accuracy, on average .97, 14% higher than the sampled ones.
Finally, we randomly selected 20 data items on which ACCU F ORMATATTR makes mistakes for examination (Figure 11). We
found that among them, for 4 items ACCU F ORMATATTR actually
selects a value with finer granularity so the results cannot be considered as wrong. Among the rest, we would be able to fix 7 of
them if we know sampled source trustworthiness, and fix 2 more if
we are given in addition the copying relationships. For the remaining 7 items, for 1 item a lot of similar false values are provided,
for 1 item the selected value is provided by high-accuracy sources,
for 3 items the selected value is provided by more than half of the
sources, and for 2 items there is no value that is dominant while
the ground truth is neither provided by more sources nor by more
accurate sources than any other value.
(%
(%
12%*3'
!#,"
!#+"
!"'%
-./0%
123%
!"&$%
4.5670%
!"'%
-./0%
1..20345607/%
8.7950%
,:;7<=>/0%
?@@A1B%
?@@A8.CD%%
!"&$%
+895:;</0%
!"&%
=>>2?;/=@A%
!"#$%&'()*+$#$%&'
!"'$%
!"#$%&'()*+,,'
!"#$%&'()*+,,'
!"'$%
$"
!,$123'
!"&%
=>>24.BC%
(%
)% ((% ()% *(% *)% +(% +)% ,(% ,)% $(%
-"./)('%0'#%"(*)#'
!#)"
!#("
633789/6:;
12/.345"
!#'"
-./0"18<=>?/5"
!#&"
6337@.AB""
18<=>?/5"
!#%"
!#$"
!"
!"#$%
!"#$%
-./0"12/.345"
!#*"
!#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !#,"
,%-$&.&+*'/.+0%)'
(% )% #% (!% (*% (+% ('% ,,% ,$% ,&% *(% *)% *#%
-"./)('%0'#%"(*)#'
Figure 9: Fusion recall as sources are added.
Figure 10: Fusion precision vs. dominance factor.
Table 8: Comparison of fusion methods.
Basic method
Advanced method
HUB
AVG L OG
P OOLED I NVEST
3-E STIMATES
ACCU S IM
ACCU S IM
ACCU S IM ATTR
ACCU F ORMATATTR
ACCU C OPY
I NVEST
2-E STIMATES
T RUTH F INDER
ACCU P R
ACCU S IM
ACCU S IM ATTR
ACCU F ORMATATTR
#Fixed errs
3
376
6
37
70
47
7
33
Flight data: As shown in Table 7, on the Flight data ACCU C OPY
obtains the best results without input trustworthiness and it improves over VOTE by 9% (corresponding to about 550 data items,
half of the mistakes made by VOTE). ACCU C OPY does not have
that many false positives for copy detection as on Stock data because none of the attributes here is numerical, so similar values
is not a potential problem (recall that [6] reports good results also
on a domain with non-numerical values–Book). As shown in Figure 10, ACCU C OPY significantly improves the precision on data
items with dominance factor in [.4, .7), because it ignores copied
values in fusion. Note that on this data set the highest recall from a
single source is .91, 3.4% lower than the best fusion results. From
Figure 9 we observe that as sources are added, for most methods the
recall peaks at the 9th source and then drops a lot after low-quality
copiers are added, but for ACCU C OPY and P OOLED I NVEST the recall almost flattens out after the 9th source; also, we observe a big
drop for C OSINE at the 14th source.
Among other methods, only P OOLED I NVEST and ACCU P R perform better than VOTE (Table 7). Actually, we observe that all
methods perform better than VOTE if sampled trustworthiness are
given as input, showing that the problem lies in trustworthiness
computation; this is because in this data set some groups of sources
with copying dominate the values and are considered as accurate,
while other sources that provide the true values are then considered
as less accurate. This shows that if we are biased by low-quality
copiers, considering source trustworthiness can bring even worse
results. Interestingly, P OOLED I NVEST obtains the second best results on Flight data (but the second worst results on Stock data).
Also, we observe that considering similarity and formatting, or distinguishing trustworthiness for each attribute does not improve the
results for this data set (Table 8).
If we take input trustworthiness, ACCU C OPY performs the best
(Table 7). All methods perform better with input trustworthiness
and the improvement is big. As we have said, these are mainly
because of bias from copied values. Again, except HUB, P OOLED I NVEST and T RUTH F INDER, all other methods compute much lower
trustworthiness than the sampled ones.
Finally, we randomly selected 20 data items on which ACCU C OPY makes mistakes for examination (Figure 11). We found that
Stock
#New errs
25
121
2
32
31
3
5
136
∆Prec
-.008
+.09
+.002
+.002
+.014
+.016
+.001
-.038
Flight
#New errs
12
10
95
1
14
11
0
10
#Fixed errs
2
101
70
29
1
5
0
70
!"#$%&
'"#$
()*)+,-.$/-)01.02-3*20456$72*3)$
'()*+"&
89:0)+4;)$503;5<=05>4-);;$
!"#$
?=5$+=-;4@)04-.$+=00)+5$+=:64-.$
%&#$
'&#$
&"#$
&#$
&#$
'"#$
∆Prec
-.018
+.167
-.046
+.051
-.024
-.011
0
+.11
%&#$
&#$
'"#$
(494*20$AB2*;)A$72*3);$20)$:0=74@)@$
AC2*;)A$72*3)$:0=74@)@$D6$>4.>1
2++302+6$;=30+);$
AC2*;)A$72*3)$@=94-2-5$
?=$=-)$72*3)$@=94-2-5$
Figure 11: Error analysis of the best fusion method.
we would be able to fix 10 of them if we know precise source trustworthiness, and fix 2 more if we know correct copying relationships. For the remaining 8 items, for 1 item a lot of similar false
values are provided, for 7 items the selected value is provided by
more than half of the sources (the value provided by the airline
website is in a minority and provided by at most 3 other sources).
Precision vs. efficiency: Next, we examined the efficiency of the
fusion methods. Figure 12 plots the efficiency and precision of each
method for each domain.
On the Stock data, VOTE finished in less than 1 second; 7 methods finished in 1-10 seconds; 4 methods, including I NVEST, P OOLED I NVEST, 3-E STIMATE , C OSINE, finished in 10-100 seconds but
did not obtain higher precision; ACCU S IM ATTR and ACCU F OR MATATTR finished in 115 and 235 seconds respectively while obtained the highest precision; finally, ACCU C OPY finished in 855
seconds as it in addition computes copying probability between
each pair of sources in each round, but its precision is low.
On the Flight data, which contains fewer sources and fewer data
items than Stock, 4 methods including VOTE finished in less than
1 second; 9 methods finished in 1-10 seconds; I NVEST and AC CU F ORMATATTR finished in 11.7 and 17.3 seconds respectively
but did not obtain better results; ACCU C OPY finished in 17 seconds and obtained the highest precision. Note that on this data
set ACCU C OPY did not spend much longer time than ACCU F OR MATATTR although it in addition computes copying probabilities,
because (1) there are less sources and (2) it converges in less rounds.
Precision over time: Finally, we ran the different fusion methods
on data sets collected on different days. Table 9 shows for each
($
34%+5'
!"#$%&'()*+$#$%&'
!"#$%&'()*+$#$%&'
($
!"'%$
!"'$
!"&%$
!"&$
!"#%$
!3$456'
!"'%$
!"'$
!"&%$
!"&$
!"#%$
!"#$
!"#$
!"($
)*+,$
7**8,9451,6+$
CDE+FG;59,D$
0HHE7DI;@0JD$
!"($
(!!!$
($
(!$
(!!$
,-*+".%&'./*'0#*+%&1#2'
-./$
0123*2$
451,6+$
:*6;5,$
<=>6?@A+,$
B=>6?@A+,$
0HHE7D$
0HHE7DI;@$
0HHE7DG@+$
0HHE7DG@+0JD$
0HHE:*KL$
)*+,$
7**8,9451,6+$
CDE+FG;59,D$
0HHE7DI;@0JD$
($
(!$
,-*+".%&'./*'0#*+%&1#2'
-./$
0123*2$
:*6;5,$
<=>6?@A+,$
0HHE7D$
0HHE7DI;@$
0HHE7DG@+0JD$
0HHE:*KL$
(!!$
451,6+$
B=>6?@A+,$
0HHE7DG@+$
Figure 12: Fusion precision vs. efficiency.
Table 9: Precision of data-fusion methods on data over one month. Font usage is similar to Table 7.
Category
Method
Baseline
VOTE
HUB
AVG L OG
I NVEST
P OOLED I NVEST
2-E STIMATES
3-E STIMATES
C OSINE
T RUTH F INDER
ACCU P R
ACCU S IM
ACCU F ORMAT
ACCU S IM ATTR
ACCU F ORMATATTR
ACCU C OPY
Web-link
based
IR based
Bayesian
based
Copying affected
Avg
.922
.925
.921
.797
.871
.910
.923
.923
.930
.922
.932
.932
.941
.941
.884
method a summary, including average precision, minimum precision, and standard deviation on fusion precision over time.
Our observation in general is consistent with the results on one
snapshot of the data. ACCU F ORMATATTR is the best for the Stock
domain, whereas ACCU C OPY is the best for the Flight domain.
Indeed, the best fusion method for Stock obtains a precision of as
high as .941 on average, whereas the number is .987 for Flight. The
major difference from observations on the snapshots is that ACCU F ORMATATTR and ACCU S IM ATTR outperform VOTE on average
on FLIGHT domain. Finally, we observe higher deviation for Flight
than for Stock, caused by the variety of quality of copied data; we
also observe a quite high deviation for C OSINE model on Flight
data.
Summary and comparison: We found that in most data sets, the
naive voting results have an even lower recall than the highest recall from a single source, while the best fusion method improves
over the highest source recall on average. We obtain very high precision for Flight (.987) and a reasonable precision for Stock (.941).
Note however that for Stock the improvement of recall over a single source with the highest recall is only marginal. Also, on all
data snapshots we observe that fusing a few high-recall sources (5
for Stock, 9 for Flight) obtains the highest recall, while adding more
sources afterwards can only hurt (reducing by 4% for Stock and by
.4% for Flight on the snapshot). Among the mistakes, we found
that about 50% can be fixed by correct knowledge of source trustworthiness and copying; for 10% the selected values have a higher
granularity than the ground truth (so not erroneous); and for the
remaining 40% we do not observe strong evidence from the data
supporting the ground truth.
The Stock data and the Flight data represent two types of data
sets. The one represented by Flight has copying mainly between
low-accuracy sources. On such data sets, considering source accuracy without copying can obtain results with even lower precision,
Stock
Min Deviation
.898
.014
.895
.015
.895
.015
.764
.027
.831
.015
.811
.026
.897
.014
.894
.015
.909
.013
.893
.015
.913
.012
.911
.012
.921
.011
.924
.010
.801
.036
Avg
.887
.885
.868
.786
.979
.639
.718
.880
.818
.893
.866
.866
.956
.956
.987
Flight
Min Deviation
.861
.028
.850
.027
.838
.029
.748
.032
.921
.013
.588
.052
.638
.034
.786
.086
.777
.031
.861
.030
.833
.032
.833
.032
.833
.050
.833
.050
.943
.010
while incorporating knowledge about copying can significantly improve the results. The data sets represented by Stock have copying
mainly among high-accuracy sources. In this case, ignoring copying does not seem to hurt the fusion results much, whereas considering copying should further improve the results; this is shown
by the fact that VOTE improves from .908 to .923 when excluding
copiers, and that ACCU C OPY obtains the highest precision (.958)
among all methods when we take sampled source accuracy and discovered copying as input. Note however that the low performance
of ACCU C OPY on Stock is because the copy-detection method does
not handle similar values well, so it generates lots of false positives
in copy detection. We note that other differences between the two
domains do not seem to affect the results significantly (e.g., despite
a higher heterogeneity and more numerical values on Stock, most
methods obtain a better results on Stock data), so we expect that our
observations can generalize to other data sets.
Among the different fusion methods, we did not observe that one
definitely dominates others on all data sets. Similarly, for fusionmethod pairs listed in Table 8, it is not clear that the advanced
method would definitely improve over the basic method on all data
sets except for I NVEST vs. P OOLED I NVEST, and T RUTH F INDER
vs. ACCU S IM. For example, distinguishing trustworthiness for different attributes helps on Stock data but not on Flight data. However, ACCU S IM ATTR and ACCU F ORMATATTR in general obtain
higher precision than most other methods in both domains. Typically more complex fusion methods achieve a higher fusion precision at the expense of a (much) longer execution time. This is affordable for off-line fusion. Certainly, longer execution time does
not guarantee better results.
The fusion results without input trustworthiness depend both on
how well the model performs if source trustworthiness is given and
on how well the model can estimate source trustworthiness. In general the lower trustworthiness deviation, the higher fusion precision, but there are also some exceptions.
5.
FUTURE RESEARCH DIRECTIONS
Based on our observations described in Sections 3-4, we next
point out several research directions to improve data fusion and
data integration in general.
Improving fusion: First, considering source trustworthiness appears to be promising and can often improve over naive voting
when there is no bias from copiers. However, we often do not know
source trustworthiness a priori. Currently most proposed methods
start from a default accuracy for each source and then iteratively refine the accuracy. However, trustworthiness computed in this way
may not be precise and it appears that knowing precise trustworthiness can fix nearly half of the mistakes in the best fusion results.
Can we start with some seed trustworthiness better than the currently employed default values to improve fusion results? For example, the seed can come from sampling or based on results on the
data items where data are fairly consistent.
Second, we observed that different fractions of data from the
same source can have different quality. The fusion results have
shown the promise of distinguishing quality of different attributes.
On the other hand, one can imagine that data from one source may
have different quality for data items of different categories; for example, a source may provide precise data for UA flights but lowquality data for AA-flights. Can we automatically detect such differences and distinguish source quality for different categories of
data for improving fusion results?
Third, we neither observed one fusion method that always dominates the others, nor observed between a basic method and a proposed improvement that the latter always beats the former. Can we
combine the results of different fusion models to get better results?
Fourth, for both data sets we assumed that there is a single true
value for each data item, but in the presence of semantics ambiguity, one can argue that for each semantics there is a true value
so there are multiple truths. Current work that considers precision
and recall of sources for fusion [19] does not apply here because
each source typically applies a single semantics for each data item.
Can we effectively find all correct values that fit at least one of the
semantics and distinguish them from false values?
Improving integration: First, source copying not only appears
promising for improving data fusion (ACCU C OPY obtains the highest precision on Flight), but has many other potentials to improve
various aspects of data integration [1]. However, the copy-detection
method proposed in [6] falls short in the presence of numerical values as it ignores value similarity and granularity. Can we develop
more robust copy-detection methods in such context? In addition,
copy detection appears to be quite time-consuming. Can we improve the scalability of copy detection for Web-scale data?
Second, even though we have tried our best to resolve heterogeneity at the schema level and instance level manually, we still
observed that 50% of value conflicts are caused by ambiguity. In
fact, observing a lot of conflicts on an attribute from one source is a
red flag for the correctness of schema mapping, and observing a lot
of conflicts on an object from one source is a red flag for the correctness of instance mapping. Can we combine schema mapping,
record linkage, and data fusion to improve results of all of them?
Third, on both data sets we observed that fusion on a few high
recall sources obtains the highest recall, but on all sources obtains
a lower recall. Such quality deterioration can also happen because
of mistakes in instance de-duplication and schema mapping. This
calls for source selection–can we automatically select a subset of
sources that lead to the best integration results?
Improving evaluation: No matter for data fusion, instance deduplication, or schema mapping, we often need to evaluate the re-
sults of applying particular techniques. One major challenge in
evaluation is to construct the gold standard. In our experiments
our gold standards trust data from certain sources, but as we observed, this sometimes puts wrong values or coarse-grained values
in the gold standard. Can we improve gold standard construction
and can we capture our uncertainty for some data items in the gold
standard? Other questions related to improving evaluation include
automatically finding and explaining reasons for mistakes and reasons for inconsistency of data or schema.
6.
CONCLUSIONS
This paper is the first one that tries to understand correctness of
data in the Deep Web. We collected data in two domains where
we believed that the data should be fairly clean; to our surprise, we
observed data of quite high inconsistency and found a lot of sources
of low quality. We also applied state-of-the-art data fusion methods
to understand whether current techniques can successfully resolve
value conflicts and find the truth. While these methods show good
potential, there is obvious space for improvement and we suggested
several promising directions for future work.
7.
REFERENCES
[1] L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and
D. Srivastava. Sailing the information ocean with awareness of
currents: Discovery and application of source dependence. In CIDR,
2009.
[2] L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic
models to reconcile complex data from inaccurate data sources. In
CAiSE, 2010.
[3] J. Bleiholder and F. Naumann. Data fusion. ACM Computing
Surveys, 41(1):1–41, 2008.
[4] N. Dalvi, A. Machanavajjhala, and B. Pang. An analysis of structured
data on the web. PVLDB, 5:680–691, 2012.
[5] X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global
detection of complex copying relationships between sources.
PVLDB, 2010.
[6] X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating
conflicting data: the role of source dependence. PVLDB, 2(1), 2009.
[7] X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and
copying detection in a dynamic world. PVLDB, 2(1), 2009.
[8] X. L. Dong and F. Naumann. Data fusion–resolving data conflicts for
integration. PVLDB, 2009.
[9] A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating
information from disagreeing views. In WSDM, 2010.
[10] J. M. Kleinberg. Authoritative sources in a hyperlinked environment.
In SODA, 1998.
[11] X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth
finding on the deep web: Is the problem solved?
http://www2.research.att.com/∼lunadong/publication/webfusion report.pdf.
[12] J. Pasternack and D. Roth. Knowing what to believe (when you
already know something). In COLING, pages 877–885, 2010.
[13] J. Pasternack and D. Roth. Making better informed trust decisions
with generalized fact-finding. In IJCAI, pages 2324–2329, 2011.
[14] D. Srivastava and S. Venkatasubramanian. Information theory for
data management. PVLDB, 2(2):1662–1663, 2009.
[15] M. Wu and A. Marian. Corroborating answers from multiple web
sources. In Proc. of the WebDB Workshop, 2007.
[16] M. Wu and A. Marian. A framework for corroborating answers from
multiple web sources. Inf. Syst., 36(2):431–449, 2011.
[17] X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple
conflicting information providers on the web. IEEE Trans. Knowl.
Data Eng., 20:796–808, 2008.
[18] X. Yin and W. Tan. Semi-supervised truth discovery. In WWW, pages
217–226, 2011.
[19] B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian
approach to discovering truth from conflicting sources for data
integration. PVLDB, 5(6):550–561, 2012.
Download