Presentation

Online Ordering of Overlapping Data Sources 1 Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) VLDB 2014 - Hangzhou, China Motivation for Source Ordering 2 Would like all listings in Pasadena, California?  For online query answering, users want results as soon as possible.  For some domains, there are hundreds to thousands of relevant data sources. * * Dalvi et. al. VLDB 2012.  All sources cannot be queried in parallel due to bandwidth limitation, etc.  Hence, we must consider the order in which sources are queried. VLDB 2014 - Hangzhou, China Source Ordering 3 A B  5 out of 120 possible orderings are shown  Orderings are compared by the area-underthe-curve measure. 2 2 5 2 1 3 1 1 0 1 3 1 1 D C 3 E 5 Source Venn Diagram VLDB 2014 - Hangzhou, China 4 Challenges 4  Source ordering needs to consider three factors:  Coverage – number of answers provided by a source.  Overlap – percentage of overlapping answers between sources.  Cost – monetary or latency cost incurred when connecting or retrieving answers from a source.  Challenges  Gathering all coverage and overlap statistics is infeasible.  20 sources => 1 million overlap statistics  30 sources => 1 billion overlap statistics  Such statistics are typically stale. VLDB 2014 - Hangzhou, China OASIS: Online Query Answering System 5 We consider 3 problems:  Overlap Estimation - Given a partial set of overlap statistics, how to estimate overlap statistics that are not known.  Source Ordering – How to order sources to maximize the areaunder-the-curve.  Statistics Enrichment- How to select additional ‘unknown’ statistics to improve accuracy of Overlap Estimation and in-turn improve Source Ordering. VLDB 2014 - Hangzhou, China Basic Overlap Estimation Solution  Given coverage and partial set of overlap statistics, formulate constraints: Ex. P(A ∩ B ) = 0.30 Ex. P(A ∩ B ∩ C ∩ D) = 0.03 P(A ∩ B) = ABC’D’E’+ABC’D’E+ ABC’DE’+ABC’DE+ABCD’E’+ABCD’E+ P( A ∩ B ∩ C ∩ D) = ABCDE’+ABCDE = 0.03 ABCDE’+ABCDE = 0.30  Find MaxEnt solution under given constraints. VLDB 2014 - Hangzhou, China  Provides highest likelihood under given constraints with no additional assumptions.  Changes smoothly with addition/change in statistics. Overlap Estimation (Cont.) 7  Challenges  Formulating the problem exactly requires the definition of 2n variables, where n is the number of data sources.  Ex. 30 sources = 1 billion variables.  Observation  Number of non-zero variables should not exceed the number of answers, which is usually much smaller than 2n. VLDB 2014 - Hangzhou, China Scalable Overlap Estimation Solution 8 1) Define constraints using a subset of variables with high cardinality. Given Statistics P(A) P(A ∩ B) P(B) P(A ∩ D) P(C) P(A ∩ B ∩ C ∩ D) P(D) P(E) V = {AB'C'D'E', A’BC’D’E’, A’B’CD’E’,A’B’C’DE’, A’B’C’D’E, ABC’D’E’, AB’C’DE’, ABCDE’, A’B’C’D’E’} 2) Solve MaxEnt problem VLDB 2014 - Hangzhou, China Scalable Overlap Estimation Solution 9 3) Include additional variable that are expected to have high cardinality, and remove variables whose value is close to zero. 4) Repeat procedure until no new variables are added. VLDB 2014 - Hangzhou, China Source Ordering 10  An optimal ordering of sources returns answers as fast as possible, measured by the area-under-the-curve.  Since an optimal solution is NP-Hard, we propose a greedy algorithm which orders sources based on highest residual coverage over cost ratio.  We propose two source ordering strategies:  STATIC Ordering  DYNAMIC Ordering VLDB 2014 - Hangzhou, China STATIC Ordering 11 Solve MaxEnt problem Select next source with highest residual coverage over cost ratio Probed selected source. Iterate until threshold is reached. VLDB 2014 - Hangzhou, China DYNAMIC Ordering 12 Solve MaxEnt problem Select next source to probe Probed selected source Compute additional statistics Iterate until threshold is reached. VLDB 2014 - Hangzhou, China Statistics Enrichment 13  The Statistics Enrichment component chooses additional ‘unknown’ statistics with the goal of improving source ordering.  Incorporating additional statistics into Static and Dynamic ordering:  STATIC+ Ordering  DYNAMIC + Ordering Requests Additional Statistics? STATIC+ DYNAMIC+ STATIC DYNAMIC Adaptable? VLDB 2014 - Hangzhou, China Experimental Evaluation 14  Data Set  Snapshot of Computer Science book listings from AbeBooks.com  1,028 bookstores (sources)  1,256 unique books / 25,347 book records in total  Cost: fixed 356 ms source-connection cost & 0.3ms per tuple cost (based on empirical tests)  Ordering Strategies  STATIC / STATIC+      DYNAMIC / DYNAMIC+ Random: Randomly choose an order of the sources Coverage: Order the sources in decreasing order of their coverage Baseline: Naïve usage of given coverage and overlap statistics FullKnowledge: Greedy algorithm with accurate and complete set of coverage and overlap statistics. VLDB 2014 - Hangzhou, China Evaluation of Algorithms 15  DYNAMIC yields a larger area-under-the-curve, and probes fewer sources to get 90% coverage, than STATIC.  DYNAMIC+ /STATIC+ perform better than their DYNAMIC/STATIC counterparts. VLDB 2014 - Hangzhou, China Conclusions 16  Proposed Overlap Estimation method generates good overlap estimates for the purpose of source ordering.  An adaptive ordering strategy (DYNAMIC ordering) generates a better source ordering compared to a static ordering strategy.  Incorporating new statistics (whether accurate, approximate, or stale) can improve source ordering (DYNAMIC+)  As long as the statistic selection procedure is fast, incorporating new statistics on-the-fly can improve source ordering. VLDB 2014 - Hangzhou, China Thank You Questions? VLDB 2014 - Hangzhou, China

Presentation

Related documents

Products

Support

Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib