On Synopses for Distinct-Value Estimation Under Multiset Operations

advertisement
On Synopses for Distinct-Value
Estimation Under Multiset
Operations
Kevin Beyer
Peter J. Haas
Berthold Reinwald
Yannis Sismanis
IBM Almaden Research Center
Rainer Gemulla
Technische Universität Dresden
Introduction

Estimating # Distinct Values (DV) crucial for:
 Data integration & cleaning




E.g. schema discovery, duplicate detection
Query optimization
Network monitoring
Materialized view selection for datacubes

Exact DV computation is impractical
 Sort/scan/count or hash table
 Problem: bad scalability

Approximate DV “synopses”
 25 year old literature
 Hashing-based techniques
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
2
Motivation: A Synopsis Warehouse
Synopsis
Full-Scale
Warehouse Of
Data Partitions
S1,2
Sn,m
combine
S*,*
etc
S1-2,3-7
Goal: discover partition characteristics & relationships to other
partitions
 Keys, functional dependencies, similarity metrics (Jaccard)


Synopsis
S1,1
Warehouse
of Synopses

Synopsis
Similar to Bellman [DJMS02]
Accuracy challenge: small synopses sizes, many distinct values
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
3
Outline

Background on KMV synopsis

An unbiased low-variance DV estimator
 Optimality
 Asymptotic error analysis for synopsis sizing

Compound Partitions
 Union, intersection, set difference
 Multiset Difference: AKMV synopses
 Deletions

Empirical Evaluation
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
4
K-Min Value (KMV) Synopsis
Partition
k-min
a
b
…
a
a
hash
0
X X X X
U(1) U(2) ... U(k)
X
X
X X
X
X
1
1/D
e
D distinct values



Hashing = dropping DVs uniformly on [0,1]
KMV synopsis: L  {U (1 ) , U ( 2 ) ,..., U ( k ) }
Leads naturally to basic estimator [BJK+02]
Basic estimator: E [U ( k ) ]  k / D  Dˆ kBE  k / U ( k )
 All classic estimators approximate the basic estimator
Expected construction cost: O ( N  k log log D )
Space: O ( k log D )



IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
5
Contributions: New Synopses & Estimators

Better estimators for classic KMV synopses




Better accuracy: unbiased, low mean-square error
Exact error bounds (in paper)
Asymptotic error bounds for sizing the synopses
Augmented KMV synopsis (AKMV)


Permits DV estimates for compound partitions
Can handle deletions and incremental updates
A
Synopsis
Combine
A op B
Synopsis
B
IBM Almaden Research Center
Technische Universität Dresden
SA
SIGMOD 07
SA op B
SB
6
Unbiased DV Estimator from KMV Synopsis




Exact error analysis based on theory of order statistics
Asymptotically optimal as k becomes large (MLE theory)
Analysis with many DVs


Theorem:
Proof:



UB
Dˆ k  ( k  1) / U ( k )
Unbiased Estimator [Cohen97]:
 Dˆ UB  D 
k

E

D



2
 (k  2)
Show that U(i)-U(i-1) approx exponential for large D
Then use [Cohen97]
Use above formula to size synopses a priori
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
7
Outline

Background on KMV synopsis

An unbiased low-variance DV estimator
 Optimality
 Asymptotic error analysis for synopsis sizing

Compound Partitions
 Union, intersection, set difference
 Multiset Difference: AKMV synopses
 Deletions

Empirical Evaluation
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
8
(Multiset) Union of Partitions
LA
0
k-min
X X X X
L



…1

LB
0
k-min
X X X X
…1
k-min
XXX X X X XX
0
U(k)
…1
Combine KMV synopses: L=LALB
Theorem: L is a KMV synopsis of AB
Can use previous unbiased estimator: Dˆ kUB  ( k  1) / U ( k )
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
9
(Multiset) Intersection of Partitions

L=LALB as with union (contains k elements)


K = # values in L that are also in D(AB)


Note: L corresponds to a uniform random sample of DVs in AB
Theorem: Can compute from LA and LB alone
K/k estimates Jaccard distance:
D
D

D(A  B)
D(A  B)

Dˆ   ( k  1) / U ( k )

Unbiased estimator of #DVs in the intersection:


estimates
D  D ( A  B)
K
Dˆ   
k
 k 1


U

 (k ) 
See paper for variance of estimator
Can extend to general compound partitions from ordinary set
operations
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
10
Multiset Differences: AKMV Synopsis

Augment KMV synopsis with multiplicity counters L+=(L,c)



Space: O ( k log D  k log M )
M=max multiplicity
Proceed almost exactly as before i.e. L+(E/F)=(LELF,(cE-cF)+)
Unbiased DV estimator:
Kg  k 1



k  U ( k ) 
Kg is the #positive counters

Closure property:
G=E op F
AKMV
Synopsis
L+E=(LE,cE)
E
L+G =(LE  LF,hop(cE,cF))
Combine
AKMV
Synopsis
F

L+F=(LF,cF)
Can also handle deletions
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
11
Accuracy Comparison
0.1
Average ARE
0.08
0.06
0.04
0.02
0
Unbiased-KMV SDLogLog Sample-Counting Baseline
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
12
Compound Partitions
500
Unbiased-KMV/Intersections
Unbiased-KMV/Unions
Unbiased-KMV/Jaccard
SDLogLog/Intersections
SDLogLog/Unions
SDLogLog/Jaccard
Frequency
400
300
200
100
0
0
0.10
0.05
0.15
ARE
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
13
Conclusions

DV estimation for scalable, flexible synopsis warehouse

Better estimators for classic KMV synopses

DV estimation for compound partitions via AKMV synopses


Closure property
Theoretical contributions


Order statistics for exact/asymptotic error analysis
Asymptotic efficiency via MLE theory
A new spin on an old problem
IBM Almaden Research Center
Technische Universität Dresden
SIGMOD 07
14
Download