On Synopses for Distinct-Value Estimation Under Multiset Operations Kevin Beyer Peter J. Haas Berthold Reinwald Yannis Sismanis IBM Almaden Research Center Rainer Gemulla Technische Universität Dresden Introduction Estimating # Distinct Values (DV) crucial for: Data integration & cleaning E.g. schema discovery, duplicate detection Query optimization Network monitoring Materialized view selection for datacubes Exact DV computation is impractical Sort/scan/count or hash table Problem: bad scalability Approximate DV “synopses” 25 year old literature Hashing-based techniques IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 2 Motivation: A Synopsis Warehouse Synopsis Full-Scale Warehouse Of Data Partitions S1,2 Sn,m combine S*,* etc S1-2,3-7 Goal: discover partition characteristics & relationships to other partitions Keys, functional dependencies, similarity metrics (Jaccard) Synopsis S1,1 Warehouse of Synopses Synopsis Similar to Bellman [DJMS02] Accuracy challenge: small synopses sizes, many distinct values IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 3 Outline Background on KMV synopsis An unbiased low-variance DV estimator Optimality Asymptotic error analysis for synopsis sizing Compound Partitions Union, intersection, set difference Multiset Difference: AKMV synopses Deletions Empirical Evaluation IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 4 K-Min Value (KMV) Synopsis Partition k-min a b … a a hash 0 X X X X U(1) U(2) ... U(k) X X X X X X 1 1/D e D distinct values Hashing = dropping DVs uniformly on [0,1] KMV synopsis: L {U (1 ) , U ( 2 ) ,..., U ( k ) } Leads naturally to basic estimator [BJK+02] Basic estimator: E [U ( k ) ] k / D Dˆ kBE k / U ( k ) All classic estimators approximate the basic estimator Expected construction cost: O ( N k log log D ) Space: O ( k log D ) IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 5 Contributions: New Synopses & Estimators Better estimators for classic KMV synopses Better accuracy: unbiased, low mean-square error Exact error bounds (in paper) Asymptotic error bounds for sizing the synopses Augmented KMV synopsis (AKMV) Permits DV estimates for compound partitions Can handle deletions and incremental updates A Synopsis Combine A op B Synopsis B IBM Almaden Research Center Technische Universität Dresden SA SIGMOD 07 SA op B SB 6 Unbiased DV Estimator from KMV Synopsis Exact error analysis based on theory of order statistics Asymptotically optimal as k becomes large (MLE theory) Analysis with many DVs Theorem: Proof: UB Dˆ k ( k 1) / U ( k ) Unbiased Estimator [Cohen97]: Dˆ UB D k E D 2 (k 2) Show that U(i)-U(i-1) approx exponential for large D Then use [Cohen97] Use above formula to size synopses a priori IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 7 Outline Background on KMV synopsis An unbiased low-variance DV estimator Optimality Asymptotic error analysis for synopsis sizing Compound Partitions Union, intersection, set difference Multiset Difference: AKMV synopses Deletions Empirical Evaluation IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 8 (Multiset) Union of Partitions LA 0 k-min X X X X L …1 LB 0 k-min X X X X …1 k-min XXX X X X XX 0 U(k) …1 Combine KMV synopses: L=LALB Theorem: L is a KMV synopsis of AB Can use previous unbiased estimator: Dˆ kUB ( k 1) / U ( k ) IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 9 (Multiset) Intersection of Partitions L=LALB as with union (contains k elements) K = # values in L that are also in D(AB) Note: L corresponds to a uniform random sample of DVs in AB Theorem: Can compute from LA and LB alone K/k estimates Jaccard distance: D D D(A B) D(A B) Dˆ ( k 1) / U ( k ) Unbiased estimator of #DVs in the intersection: estimates D D ( A B) K Dˆ k k 1 U (k ) See paper for variance of estimator Can extend to general compound partitions from ordinary set operations IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 10 Multiset Differences: AKMV Synopsis Augment KMV synopsis with multiplicity counters L+=(L,c) Space: O ( k log D k log M ) M=max multiplicity Proceed almost exactly as before i.e. L+(E/F)=(LELF,(cE-cF)+) Unbiased DV estimator: Kg k 1 k U ( k ) Kg is the #positive counters Closure property: G=E op F AKMV Synopsis L+E=(LE,cE) E L+G =(LE LF,hop(cE,cF)) Combine AKMV Synopsis F L+F=(LF,cF) Can also handle deletions IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 11 Accuracy Comparison 0.1 Average ARE 0.08 0.06 0.04 0.02 0 Unbiased-KMV SDLogLog Sample-Counting Baseline IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 12 Compound Partitions 500 Unbiased-KMV/Intersections Unbiased-KMV/Unions Unbiased-KMV/Jaccard SDLogLog/Intersections SDLogLog/Unions SDLogLog/Jaccard Frequency 400 300 200 100 0 0 0.10 0.05 0.15 ARE IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 13 Conclusions DV estimation for scalable, flexible synopsis warehouse Better estimators for classic KMV synopses DV estimation for compound partitions via AKMV synopses Closure property Theoretical contributions Order statistics for exact/asymptotic error analysis Asymptotic efficiency via MLE theory A new spin on an old problem IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 14