Online Ordering of Overlapping Data Sources 1 Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) VLDB 2014 - Hangzhou, China Motivation for Source Ordering 2 Would like all listings in Pasadena, California? For online query answering, users want results as soon as possible. For some domains, there are hundreds to thousands of relevant data sources. * * Dalvi et. al. VLDB 2012. All sources cannot be queried in parallel due to bandwidth limitation, etc. Hence, we must consider the order in which sources are queried. VLDB 2014 - Hangzhou, China Source Ordering 3 A B 5 out of 120 possible orderings are shown Orderings are compared by the area-underthe-curve measure. 2 2 5 2 1 3 1 1 0 1 3 1 1 D C 3 E 5 Source Venn Diagram VLDB 2014 - Hangzhou, China 4 Challenges 4 Source ordering needs to consider three factors: Coverage – number of answers provided by a source. Overlap – percentage of overlapping answers between sources. Cost – monetary or latency cost incurred when connecting or retrieving answers from a source. Challenges Gathering all coverage and overlap statistics is infeasible. 20 sources => 1 million overlap statistics 30 sources => 1 billion overlap statistics Such statistics are typically stale. VLDB 2014 - Hangzhou, China OASIS: Online Query Answering System 5 We consider 3 problems: Overlap Estimation - Given a partial set of overlap statistics, how to estimate overlap statistics that are not known. Source Ordering – How to order sources to maximize the areaunder-the-curve. Statistics Enrichment- How to select additional ‘unknown’ statistics to improve accuracy of Overlap Estimation and in-turn improve Source Ordering. VLDB 2014 - Hangzhou, China Basic Overlap Estimation Solution Given coverage and partial set of overlap statistics, formulate constraints: Ex. P(A ∩ B ) = 0.30 Ex. P(A ∩ B ∩ C ∩ D) = 0.03 P(A ∩ B) = ABC’D’E’+ABC’D’E+ ABC’DE’+ABC’DE+ABCD’E’+ABCD’E+ P( A ∩ B ∩ C ∩ D) = ABCDE’+ABCDE = 0.03 ABCDE’+ABCDE = 0.30 Find MaxEnt solution under given constraints. VLDB 2014 - Hangzhou, China Provides highest likelihood under given constraints with no additional assumptions. Changes smoothly with addition/change in statistics. Overlap Estimation (Cont.) 7 Challenges Formulating the problem exactly requires the definition of 2n variables, where n is the number of data sources. Ex. 30 sources = 1 billion variables. Observation Number of non-zero variables should not exceed the number of answers, which is usually much smaller than 2n. VLDB 2014 - Hangzhou, China Scalable Overlap Estimation Solution 8 1) Define constraints using a subset of variables with high cardinality. Given Statistics P(A) P(A ∩ B) P(B) P(A ∩ D) P(C) P(A ∩ B ∩ C ∩ D) P(D) P(E) V = {AB'C'D'E', A’BC’D’E’, A’B’CD’E’,A’B’C’DE’, A’B’C’D’E, ABC’D’E’, AB’C’DE’, ABCDE’, A’B’C’D’E’} 2) Solve MaxEnt problem VLDB 2014 - Hangzhou, China Scalable Overlap Estimation Solution 9 3) Include additional variable that are expected to have high cardinality, and remove variables whose value is close to zero. 4) Repeat procedure until no new variables are added. VLDB 2014 - Hangzhou, China Source Ordering 10 An optimal ordering of sources returns answers as fast as possible, measured by the area-under-the-curve. Since an optimal solution is NP-Hard, we propose a greedy algorithm which orders sources based on highest residual coverage over cost ratio. We propose two source ordering strategies: STATIC Ordering DYNAMIC Ordering VLDB 2014 - Hangzhou, China STATIC Ordering 11 Solve MaxEnt problem Select next source with highest residual coverage over cost ratio Probed selected source. Iterate until threshold is reached. VLDB 2014 - Hangzhou, China DYNAMIC Ordering 12 Solve MaxEnt problem Select next source to probe Probed selected source Compute additional statistics Iterate until threshold is reached. VLDB 2014 - Hangzhou, China Statistics Enrichment 13 The Statistics Enrichment component chooses additional ‘unknown’ statistics with the goal of improving source ordering. Incorporating additional statistics into Static and Dynamic ordering: STATIC+ Ordering DYNAMIC + Ordering Requests Additional Statistics? STATIC+ DYNAMIC+ STATIC DYNAMIC Adaptable? VLDB 2014 - Hangzhou, China Experimental Evaluation 14 Data Set Snapshot of Computer Science book listings from AbeBooks.com 1,028 bookstores (sources) 1,256 unique books / 25,347 book records in total Cost: fixed 356 ms source-connection cost & 0.3ms per tuple cost (based on empirical tests) Ordering Strategies STATIC / STATIC+ DYNAMIC / DYNAMIC+ Random: Randomly choose an order of the sources Coverage: Order the sources in decreasing order of their coverage Baseline: Naïve usage of given coverage and overlap statistics FullKnowledge: Greedy algorithm with accurate and complete set of coverage and overlap statistics. VLDB 2014 - Hangzhou, China Evaluation of Algorithms 15 DYNAMIC yields a larger area-under-the-curve, and probes fewer sources to get 90% coverage, than STATIC. DYNAMIC+ /STATIC+ perform better than their DYNAMIC/STATIC counterparts. VLDB 2014 - Hangzhou, China Conclusions 16 Proposed Overlap Estimation method generates good overlap estimates for the purpose of source ordering. An adaptive ordering strategy (DYNAMIC ordering) generates a better source ordering compared to a static ordering strategy. Incorporating new statistics (whether accurate, approximate, or stale) can improve source ordering (DYNAMIC+) As long as the statistic selection procedure is fast, incorporating new statistics on-the-fly can improve source ordering. VLDB 2014 - Hangzhou, China Thank You Questions? VLDB 2014 - Hangzhou, China