A Confidence-Aware Approach for
Truth Discovery on Long-Tail Data
Qi Li 1 , Yaliang Li 1 , Jing Gao 1 , Lu Su 1 , Bo Zhao 2 ,
Murat Demirbas 1 , Wei Fan 3 , and Jiawei Han 4
1 SUNY Buffalo, Buffalo, NY, USA
2 LinkedIn , San Francisco, CA, USA
3 Baidu Research Big Data Lab, China
4 University of Illinois, Urbana, IL, USA
1
50%
30%
19%
A B C
1%
D
16
36
Which of these square numbers also happens to be the sum of two smaller square numbers?
25
49 https://www.youtube.com/watch?v=BbX44YSsQ2I
2
50%
30%
19%
A B C
1%
D
16
36
Which of these square numbers also happens to be the sum of two smaller square numbers?
25
49 https://www.youtube.com/watch?v=BbX44YSsQ2I
3
Problem Description
• Our task is to aggregate the information from different sources for the same entities by considering source reliability degrees .
4
Truth Discovery
• Principle
– Infer both truth and source reliability from the data
• A source is reliable if it provides many pieces of true information
• A piece of information is likely to be true if it is provided by many reliable sources
5
Long-Tail Phenomenon
6
Existing Work
• Existing methods
– Tackle different challenges in truth discovery
• Source correlations, source costs, streaming data, ……
• Limitation when most sources make a few claims
– Sources weights are proportional to the accuracy of the sources
• When the number of claims from a source is quite small, the estimation of the accuracy is unreliable.
7
Overview of Our Work
• A confidence-aware approach
– not only estimates source reliability
– but also considers the confidence interval of the estimation
8
Aggregation
• Assume that each source has a weight 𝑤 𝑠
• To aggregate the various information, weighted combination is adopted: 𝑥 ∗ 𝑛
= 𝑠∈𝑆 𝑛 𝑤 𝑠 𝑠∈𝑆 𝑛 𝑤
∙ 𝑥 𝑠 𝑠 𝑛
9
Model the Error Distribution
•
• Assume that sources are independent
Error made by source 𝑠: 𝜖 𝑠
∼ 𝑁 0, 𝜎 𝑠
2
• Since 𝜖 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 𝜖 𝑐𝑜𝑚𝑏𝑖𝑛𝑒
= 𝑠∈𝑆 𝑠∈𝑆 𝑤 𝑠 𝜖 𝑠 , we have 𝑤 𝑠
∼ 𝑁 0, 𝑠∈𝑆 𝑠∈𝑆 𝑤 𝑠
2 𝜎 𝑠
2 𝑤 𝑠
2
Without loss of generality, we constrain 𝑠∈𝑆 𝑤 𝑠
= 1
10
Minimize the Variance of Errors
• Goal:
– want the variance of 𝜖 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 possible to be as small as
• Optimization
11
How to Estimate Variance
We can estimate the variance of each source using similar formulation for sample variance: 𝜎 𝑠
2 =
1
𝑁 𝑠 𝑛∈𝑁 𝑠 𝑥 𝑠 𝑛
− 𝑥
∗(0) 2 𝑛 where 𝑥
∗(0) 𝑛 is the initial truth.
12
Estimate CI of Variance
• The estimation is not accurate with small number of samples.
• Find a range of values that can act as good estimates.
• Calculate confidence interval based on
𝑁 𝑠 𝜎
2 𝑠 𝜎 𝑠
2
∼ 𝜒 2 𝑁 𝑠
13
Example
Example on calculating confidence interval
14
Example
Example on calculating confidence interval
15
Example
Example on calculating confidence interval
16
How to estimate variance
• Consider the possibly worst scenario of 𝜎 𝑠
2
• Use the upper bound of the 95% confidence interval of 𝜎 𝑠
2
2 𝑢 2 𝑠
= 𝑛∈𝑁 𝑠 𝜒 𝑥 𝑠 𝑛
− 𝑥
∗ 0 𝑛
2
0.05, 𝑁 𝑠
17
CATD
• Closed-form solution: 𝑤 𝑠
∝
1 𝑢 2 𝑠
= 𝑛∈𝑁 𝑠 𝜒
2
0.05, 𝑁 𝑠 𝑥 𝑠 𝑛
− 𝑥
∗ 0 𝑛
2
18
Example
Example on calculating source weight
19
Example
Example on calculating source weight
20
Example
Example on calculating source weight
21
Performance on Game Data
8
9
10
5
6
7
Question level
1
2
3
4
Majority
Voting
CATD
0.0297
0.0132
0.0305
0.0271
0.0414
0.0276
0.0507
0.0290
0.0672
0.0435
0.1101
0.0596
0.1016
0.0481
0.3043
0.1304
0.3737
0.1414
0.5227
0.2045
22
Performance on Game Data
Comparison on Game dataset
23
Summary
• Truth Discovery on long-tail data
– Most sources only provide very few claims and only a few sources makes plenty of claims.
– By adopting effective estimators based on the confidence interval, CATD appropriately estimates source reliability for sources with different levels of participation.
24
25