VLDB review summary

advertisement

A Confidence-Aware Approach for

Truth Discovery on Long-Tail Data

Qi Li 1 , Yaliang Li 1 , Jing Gao 1 , Lu Su 1 , Bo Zhao 2 ,

Murat Demirbas 1 , Wei Fan 3 , and Jiawei Han 4

1 SUNY Buffalo, Buffalo, NY, USA

2 LinkedIn , San Francisco, CA, USA

3 Baidu Research Big Data Lab, China

4 University of Illinois, Urbana, IL, USA

1

50%

30%

19%

A B C

1%

D

16

36

Which of these square numbers also happens to be the sum of two smaller square numbers?

25

49 https://www.youtube.com/watch?v=BbX44YSsQ2I

2

50%

30%

19%

A B C

1%

D

16

36

Which of these square numbers also happens to be the sum of two smaller square numbers?

25

49 https://www.youtube.com/watch?v=BbX44YSsQ2I

3

Problem Description

• Our task is to aggregate the information from different sources for the same entities by considering source reliability degrees .

Truth Discovery

4

Truth Discovery

• Principle

– Infer both truth and source reliability from the data

• A source is reliable if it provides many pieces of true information

• A piece of information is likely to be true if it is provided by many reliable sources

5

Long-Tail Phenomenon

6

Existing Work

• Existing methods

– Tackle different challenges in truth discovery

• Source correlations, source costs, streaming data, ……

• Limitation when most sources make a few claims

– Sources weights are proportional to the accuracy of the sources

• When the number of claims from a source is quite small, the estimation of the accuracy is unreliable.

7

Overview of Our Work

• A confidence-aware approach

– not only estimates source reliability

– but also considers the confidence interval of the estimation

8

Aggregation

• Assume that each source has a weight 𝑤 𝑠

• To aggregate the various information, weighted combination is adopted: 𝑥 ∗ 𝑛

= 𝑠∈𝑆 𝑛 𝑤 𝑠 𝑠∈𝑆 𝑛 𝑤

∙ 𝑥 𝑠 𝑠 𝑛

9

Model the Error Distribution

• Assume that sources are independent

Error made by source 𝑠: 𝜖 𝑠

∼ 𝑁 0, 𝜎 𝑠

2

• Since 𝜖 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 𝜖 𝑐𝑜𝑚𝑏𝑖𝑛𝑒

= 𝑠∈𝑆 𝑠∈𝑆 𝑤 𝑠 𝜖 𝑠 , we have 𝑤 𝑠

∼ 𝑁 0, 𝑠∈𝑆 𝑠∈𝑆 𝑤 𝑠

2 𝜎 𝑠

2 𝑤 𝑠

2

Without loss of generality, we constrain 𝑠∈𝑆 𝑤 𝑠

= 1

10

Minimize the Variance of Errors

• Goal:

– want the variance of 𝜖 𝑐𝑜𝑚𝑏𝑖𝑛𝑒 possible to be as small as

• Optimization

11

How to Estimate Variance

We can estimate the variance of each source using similar formulation for sample variance: 𝜎 𝑠

2 =

1

𝑁 𝑠 𝑛∈𝑁 𝑠 𝑥 𝑠 𝑛

− 𝑥

∗(0) 2 𝑛 where 𝑥

∗(0) 𝑛 is the initial truth.

12

Estimate CI of Variance

• The estimation is not accurate with small number of samples.

• Find a range of values that can act as good estimates.

• Calculate confidence interval based on

𝑁 𝑠 𝜎

2 𝑠 𝜎 𝑠

2

∼ 𝜒 2 𝑁 𝑠

13

Example

Example on calculating confidence interval

14

Example

Example on calculating confidence interval

15

Example

Example on calculating confidence interval

16

How to estimate variance

• Consider the possibly worst scenario of 𝜎 𝑠

2

• Use the upper bound of the 95% confidence interval of 𝜎 𝑠

2

2 𝑢 2 𝑠

= 𝑛∈𝑁 𝑠 𝜒 𝑥 𝑠 𝑛

− 𝑥

∗ 0 𝑛

2

0.05, 𝑁 𝑠

17

CATD

• Closed-form solution: 𝑤 𝑠

1 𝑢 2 𝑠

= 𝑛∈𝑁 𝑠 𝜒

2

0.05, 𝑁 𝑠 𝑥 𝑠 𝑛

− 𝑥

∗ 0 𝑛

2

18

Example

Example on calculating source weight

19

Example

Example on calculating source weight

20

Example

Example on calculating source weight

21

Performance on Game Data

8

9

10

5

6

7

Question level

1

2

3

4

Majority

Voting

CATD

0.0297

0.0132

0.0305

0.0271

0.0414

0.0276

0.0507

0.0290

0.0672

0.0435

0.1101

0.0596

0.1016

0.0481

0.3043

0.1304

0.3737

0.1414

0.5227

0.2045

22

Performance on Game Data

Comparison on Game dataset

23

Summary

• Truth Discovery on long-tail data

– Most sources only provide very few claims and only a few sources makes plenty of claims.

– By adopting effective estimators based on the confidence interval, CATD appropriately estimates source reliability for sources with different levels of participation.

24

25

Download