End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton

advertisement
End-biased Samples for Join
Cardinality Estimation
Cristian Estan, Jeffrey F. Naughton
Computer Sciences Department
University of Wisconsin-Madison
Problem description
 Estimating join size


Not restricted to key-foreign key joins
Based on summaries of the two tables
computed separately
 Two main contributions of this paper


Proposing a new type of summaries based on
a special type of sampling
Extensive experimental comparison of many
types of summaries
We can get more accurate estimates!
 [AGMS99] showed that on certain data sets


All summaries give inaccurate estimates
Estimates based on random sampling are within
constant factor of bound
 We show that


On other data sets, our estimates significantly
more accurate than those with random sampling
No known summaries give estimates more
accurate than all others for every data set
Overview
 End-biased samples

Theoretical comparison against other
sampling-based methods
 Experimental comparison against sketches
and histograms
Building the end-biased samples
 If frequency of every
value known for both
tables → exact join size
 We keep a sample of this
data


Sampling probability
proportional to
frequency [DLT01]
Sampling decisions
correlated by using a
shared hash function
[F90],[DG00],[EKMV04]
Frequency
of values of
join attribute
in table A
Frequency
of values of
join attribute
in table B
(c,10) p=1
(d,1)
p=0.25
(g,1) p=0.25
(g,1)
p=0.25
(m,2) p=0.5
(m,1) p=0.25
(s,5)
p=1
(r,7)
p=1
(t,1)
p=0.25
(z,1)
p=0.25
Sampling threshold T=4
Estimating join size
 Let av be the frequency of value v in table A,
bv in B and pv the probability that v is selected
into both samples
 Sum contribution of values in both samples
(av bv/pv) to estimate join size




If av ≥Ta and bv≥Tb , pv =1
If av ≥Ta and bv<Tb , pv =bv/Tb
If av <Ta and bv≥Tb , pv =av/Ta
If av <Ta and bv<Tb , pv =min(av/Ta,bv/Tb)
Why correlate the samples?
 Example: tables with 1000 values appearing
once, 50 values common to both tables


We sample with probability 1/10
Sample size ~ 100 for each table
 Comparison



Correlated Uncorrel.
pv
0.1
0.01
Common values sampled ~ 4, 5 or 6 ~ 0 or 1
Join size estimate
40,50 or 60 0 or 100
Comparison of sampling methods
Type of
values
dominating
the join
Frequent in
both relations
Frequent in
one relation
Infrequent in
both relations
Accuracy of estimates of join size
Random
sampling
Counting
samples
End-biased
samples
Good
Very good
Perfect
Bad
Bad
Bad
Bad
Bad
Good
Overview
 End-biased samples

Theoretical comparison against other
sampling-based methods
 Experimental comparison against sketches
and histograms
Experimental methodology
 Randomly generated tables with ~ 1,000,000 tuples
 Explored multiple configurations
 Varied the “peakedness” of the distribution
 Varied memory budget from 204 to 659,456 words
 Varied the amount of correlation between tables



Uncorrelated – tables generated independently
Positively correlated – frequent values likely
same in both tables
Negatively correlated – unlikely frequent values
same in the two tables
 1,000 runs for each configuration
Summaries compared
 End-biased samples
 End-biased equi-depth histograms [PC84]
 Sketches [AGMS99],[DGGR02],[GGR04]
 Concise samples [GM98]
 Counting samples [GM98]
Comparison with histograms
Comparison with sketches
Memory comparison
Qualitative comparison
Advantage
Sketches
Streaming
updates
End-biased
samples
X
Simple
configuration
X
Selection on
join attribute
X
Conclusions
 End-biased samples and sketches are the
best summaries for the join size estimation
problem addressed in this paper
 End-biased samples are compelling if



Selections on the join attribute are required
Summaries must be very concise
The frequencies of join attributes in the two
tables are strongly correlated
Questions?
Thank you!
Scripts and results for experiments available at
http://www.cs.wisc.edu/~estan/ebs.tar.gz
Estimating the join size
Related work – sampling methods
 [GM98] concise samples, counting samples
 [DLT01] smart sampling
 [F90],[EKMV04] using a hash function to
select values used as summary of data
Related work – join size estimation
 Histograms
 Multidimensional histograms
 [GG02],[GK04] Wavelets
 [AGMS99],[DGGR02],[GGR04] Sketches
Variance of join size estimate
 No slide, point to the paper.
Download