On Synopses for Distinct-Value Estimation Under Multiset Operations

On Synopses for Distinct-Value Estimation Under Multiset Operations Kevin Beyer Peter J. Haas Berthold Reinwald Yannis Sismanis IBM Almaden Research Center Rainer Gemulla Technische Universität Dresden Introduction  Estimating # Distinct Values (DV) crucial for:  Data integration & cleaning     E.g. schema discovery, duplicate detection Query optimization Network monitoring Materialized view selection for datacubes  Exact DV computation is impractical  Sort/scan/count or hash table  Problem: bad scalability  Approximate DV “synopses”  25 year old literature  Hashing-based techniques IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 2 Motivation: A Synopsis Warehouse Synopsis Full-Scale Warehouse Of Data Partitions S1,2 Sn,m combine S*,* etc S1-2,3-7 Goal: discover partition characteristics & relationships to other partitions  Keys, functional dependencies, similarity metrics (Jaccard)   Synopsis S1,1 Warehouse of Synopses  Synopsis Similar to Bellman [DJMS02] Accuracy challenge: small synopses sizes, many distinct values IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 3 Outline  Background on KMV synopsis  An unbiased low-variance DV estimator  Optimality  Asymptotic error analysis for synopsis sizing  Compound Partitions  Union, intersection, set difference  Multiset Difference: AKMV synopses  Deletions  Empirical Evaluation IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 4 K-Min Value (KMV) Synopsis Partition k-min a b … a a hash 0 X X X X U(1) U(2) ... U(k) X X X X X X 1 1/D e D distinct values    Hashing = dropping DVs uniformly on [0,1] KMV synopsis: L  {U (1 ) , U ( 2 ) ,..., U ( k ) } Leads naturally to basic estimator [BJK+02] Basic estimator: E [U ( k ) ]  k / D  Dˆ kBE  k / U ( k )  All classic estimators approximate the basic estimator Expected construction cost: O ( N  k log log D ) Space: O ( k log D )    IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 5 Contributions: New Synopses & Estimators  Better estimators for classic KMV synopses     Better accuracy: unbiased, low mean-square error Exact error bounds (in paper) Asymptotic error bounds for sizing the synopses Augmented KMV synopsis (AKMV)   Permits DV estimates for compound partitions Can handle deletions and incremental updates A Synopsis Combine A op B Synopsis B IBM Almaden Research Center Technische Universität Dresden SA SIGMOD 07 SA op B SB 6 Unbiased DV Estimator from KMV Synopsis     Exact error analysis based on theory of order statistics Asymptotically optimal as k becomes large (MLE theory) Analysis with many DVs   Theorem: Proof:    UB Dˆ k  ( k  1) / U ( k ) Unbiased Estimator [Cohen97]:  Dˆ UB  D  k  E  D    2  (k  2) Show that U(i)-U(i-1) approx exponential for large D Then use [Cohen97] Use above formula to size synopses a priori IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 7 Outline  Background on KMV synopsis  An unbiased low-variance DV estimator  Optimality  Asymptotic error analysis for synopsis sizing  Compound Partitions  Union, intersection, set difference  Multiset Difference: AKMV synopses  Deletions  Empirical Evaluation IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 8 (Multiset) Union of Partitions LA 0 k-min X X X X L    …1  LB 0 k-min X X X X …1 k-min XXX X X X XX 0 U(k) …1 Combine KMV synopses: L=LALB Theorem: L is a KMV synopsis of AB Can use previous unbiased estimator: Dˆ kUB  ( k  1) / U ( k ) IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 9 (Multiset) Intersection of Partitions  L=LALB as with union (contains k elements)   K = # values in L that are also in D(AB)   Note: L corresponds to a uniform random sample of DVs in AB Theorem: Can compute from LA and LB alone K/k estimates Jaccard distance: D D  D(A  B) D(A  B)  Dˆ   ( k  1) / U ( k )  Unbiased estimator of #DVs in the intersection:   estimates D  D ( A  B) K Dˆ    k  k 1   U   (k )  See paper for variance of estimator Can extend to general compound partitions from ordinary set operations IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 10 Multiset Differences: AKMV Synopsis  Augment KMV synopsis with multiplicity counters L+=(L,c)    Space: O ( k log D  k log M ) M=max multiplicity Proceed almost exactly as before i.e. L+(E/F)=(LELF,(cE-cF)+) Unbiased DV estimator: Kg  k 1    k  U ( k )  Kg is the #positive counters  Closure property: G=E op F AKMV Synopsis L+E=(LE,cE) E L+G =(LE  LF,hop(cE,cF)) Combine AKMV Synopsis F  L+F=(LF,cF) Can also handle deletions IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 11 Accuracy Comparison 0.1 Average ARE 0.08 0.06 0.04 0.02 0 Unbiased-KMV SDLogLog Sample-Counting Baseline IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 12 Compound Partitions 500 Unbiased-KMV/Intersections Unbiased-KMV/Unions Unbiased-KMV/Jaccard SDLogLog/Intersections SDLogLog/Unions SDLogLog/Jaccard Frequency 400 300 200 100 0 0 0.10 0.05 0.15 ARE IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 13 Conclusions  DV estimation for scalable, flexible synopsis warehouse  Better estimators for classic KMV synopses  DV estimation for compound partitions via AKMV synopses   Closure property Theoretical contributions   Order statistics for exact/asymptotic error analysis Asymptotic efficiency via MLE theory A new spin on an old problem IBM Almaden Research Center Technische Universität Dresden SIGMOD 07 14

On Synopses for Distinct-Value Estimation Under Multiset Operations

Related documents

Products

Support

On Synopses for Distinct-Value Estimation Under Multiset Operations

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib