Presentation

advertisement
Online Ordering of
Overlapping Data Sources
1
Mariam Salloum (YP.com)
Xin Luna Dong (Google)
Divesh Srivastava (AT&T Research)
Vassilis J. Tsotras (UC Riverside)
VLDB 2014 - Hangzhou, China
Motivation for Source Ordering
2
Would like all
listings in Pasadena,
California?
 For online query answering,
users want results as soon as
possible.
 For some domains, there are
hundreds to thousands of
relevant data sources. *
* Dalvi et. al. VLDB 2012.
 All sources cannot be queried in
parallel due to bandwidth
limitation, etc.
 Hence, we must consider the
order in which sources are
queried.
VLDB 2014 - Hangzhou, China
Source Ordering
3
A
B
 5 out of 120 possible orderings are shown
 Orderings are compared by the area-underthe-curve measure.
2
2
5
2
1
3
1
1
0
1
3
1
1
D
C
3
E
5 Source Venn Diagram
VLDB 2014 - Hangzhou, China
4
Challenges
4
 Source ordering needs to consider three factors:
 Coverage – number of answers provided by a source.
 Overlap – percentage of overlapping answers between sources.
 Cost – monetary or latency cost incurred when connecting or
retrieving answers from a source.
 Challenges
 Gathering all coverage and overlap statistics is infeasible.
 20 sources => 1 million overlap statistics
 30 sources => 1 billion overlap statistics
 Such statistics are typically stale.
VLDB 2014 - Hangzhou, China
OASIS: Online Query Answering System
5
We consider 3 problems:
 Overlap Estimation - Given a partial set of overlap statistics, how
to estimate overlap statistics that are not known.
 Source Ordering – How to order sources to maximize the areaunder-the-curve.
 Statistics Enrichment- How to select additional ‘unknown’
statistics to improve accuracy of Overlap Estimation and in-turn
improve Source Ordering.
VLDB 2014 - Hangzhou, China
Basic Overlap Estimation Solution
 Given coverage and partial set of
overlap statistics, formulate constraints:
Ex. P(A ∩ B ) = 0.30
Ex. P(A ∩ B ∩ C ∩ D) = 0.03
P(A ∩ B) = ABC’D’E’+ABC’D’E+
ABC’DE’+ABC’DE+ABCD’E’+ABCD’E+
P( A ∩ B ∩ C ∩ D) = ABCDE’+ABCDE = 0.03
ABCDE’+ABCDE = 0.30
 Find MaxEnt solution under given
constraints.
VLDB 2014 - Hangzhou, China

Provides highest likelihood under given
constraints with no additional assumptions.

Changes smoothly with addition/change in
statistics.
Overlap Estimation (Cont.)
7
 Challenges
 Formulating the problem exactly requires the definition of 2n
variables, where n is the number of data sources.
 Ex. 30 sources = 1 billion variables.
 Observation
 Number of non-zero variables should not exceed the number of
answers, which is usually much smaller than 2n.
VLDB 2014 - Hangzhou, China
Scalable Overlap Estimation Solution
8
1) Define constraints using a subset of variables with high cardinality.
Given Statistics
P(A)
P(A ∩ B)
P(B)
P(A ∩ D)
P(C)
P(A ∩ B ∩ C ∩ D)
P(D)
P(E)
V = {AB'C'D'E', A’BC’D’E’, A’B’CD’E’,A’B’C’DE’,
A’B’C’D’E, ABC’D’E’, AB’C’DE’, ABCDE’, A’B’C’D’E’}
2)
Solve MaxEnt problem
VLDB 2014 - Hangzhou, China
Scalable Overlap Estimation Solution
9
3) Include additional variable that are expected to have high cardinality,
and remove variables whose value is close to zero.
4) Repeat procedure until no new variables are added.
VLDB 2014 - Hangzhou, China
Source Ordering
10
 An optimal ordering of sources returns answers as fast as
possible, measured by the area-under-the-curve.
 Since an optimal solution is NP-Hard, we propose a greedy
algorithm which orders sources based on highest residual
coverage over cost ratio.
 We propose two source ordering strategies:
 STATIC Ordering
 DYNAMIC Ordering
VLDB 2014 - Hangzhou, China
STATIC Ordering
11
Solve MaxEnt problem
Select next source with highest
residual coverage over cost ratio
Probed selected source.
Iterate until threshold is reached.
VLDB 2014 - Hangzhou, China
DYNAMIC Ordering
12
Solve MaxEnt problem
Select next source to probe
Probed selected source
Compute additional statistics
Iterate until threshold is reached.
VLDB 2014 - Hangzhou, China
Statistics Enrichment
13
 The Statistics Enrichment component chooses additional ‘unknown’
statistics with the goal of improving source ordering.
 Incorporating additional statistics into Static and Dynamic ordering:
 STATIC+ Ordering
 DYNAMIC + Ordering
Requests
Additional
Statistics?
STATIC+
DYNAMIC+
STATIC
DYNAMIC
Adaptable?
VLDB 2014 - Hangzhou, China
Experimental Evaluation
14
 Data Set
 Snapshot of Computer Science book listings from AbeBooks.com
 1,028 bookstores (sources)
 1,256 unique books / 25,347 book records in total
 Cost: fixed 356 ms source-connection cost & 0.3ms per tuple cost (based
on empirical tests)
 Ordering Strategies
 STATIC / STATIC+





DYNAMIC / DYNAMIC+
Random: Randomly choose an order of the sources
Coverage: Order the sources in decreasing order of their coverage
Baseline: Naïve usage of given coverage and overlap statistics
FullKnowledge: Greedy algorithm with accurate and complete set of coverage and
overlap statistics.
VLDB 2014 - Hangzhou, China
Evaluation of Algorithms
15
 DYNAMIC yields a larger area-under-the-curve, and probes fewer sources
to get 90% coverage, than STATIC.
 DYNAMIC+ /STATIC+ perform better than their DYNAMIC/STATIC
counterparts.
VLDB 2014 - Hangzhou, China
Conclusions
16
 Proposed Overlap Estimation method generates good overlap
estimates for the purpose of source ordering.
 An adaptive ordering strategy (DYNAMIC ordering) generates a
better source ordering compared to a static ordering strategy.
 Incorporating new statistics (whether accurate, approximate, or
stale) can improve source ordering (DYNAMIC+)
 As long as the statistic selection procedure is fast, incorporating
new statistics on-the-fly can improve source ordering.
VLDB 2014 - Hangzhou, China
Thank You
Questions?
VLDB 2014 - Hangzhou, China
Download