End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison Problem description Estimating join size Not restricted to key-foreign key joins Based on summaries of the two tables computed separately Two main contributions of this paper Proposing a new type of summaries based on a special type of sampling Extensive experimental comparison of many types of summaries We can get more accurate estimates! [AGMS99] showed that on certain data sets All summaries give inaccurate estimates Estimates based on random sampling are within constant factor of bound We show that On other data sets, our estimates significantly more accurate than those with random sampling No known summaries give estimates more accurate than all others for every data set Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms Building the end-biased samples If frequency of every value known for both tables → exact join size We keep a sample of this data Sampling probability proportional to frequency [DLT01] Sampling decisions correlated by using a shared hash function [F90],[DG00],[EKMV04] Frequency of values of join attribute in table A Frequency of values of join attribute in table B (c,10) p=1 (d,1) p=0.25 (g,1) p=0.25 (g,1) p=0.25 (m,2) p=0.5 (m,1) p=0.25 (s,5) p=1 (r,7) p=1 (t,1) p=0.25 (z,1) p=0.25 Sampling threshold T=4 Estimating join size Let av be the frequency of value v in table A, bv in B and pv the probability that v is selected into both samples Sum contribution of values in both samples (av bv/pv) to estimate join size If av ≥Ta and bv≥Tb , pv =1 If av ≥Ta and bv<Tb , pv =bv/Tb If av <Ta and bv≥Tb , pv =av/Ta If av <Ta and bv<Tb , pv =min(av/Ta,bv/Tb) Why correlate the samples? Example: tables with 1000 values appearing once, 50 values common to both tables We sample with probability 1/10 Sample size ~ 100 for each table Comparison Correlated Uncorrel. pv 0.1 0.01 Common values sampled ~ 4, 5 or 6 ~ 0 or 1 Join size estimate 40,50 or 60 0 or 100 Comparison of sampling methods Type of values dominating the join Frequent in both relations Frequent in one relation Infrequent in both relations Accuracy of estimates of join size Random sampling Counting samples End-biased samples Good Very good Perfect Bad Bad Bad Bad Bad Good Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms Experimental methodology Randomly generated tables with ~ 1,000,000 tuples Explored multiple configurations Varied the “peakedness” of the distribution Varied memory budget from 204 to 659,456 words Varied the amount of correlation between tables Uncorrelated – tables generated independently Positively correlated – frequent values likely same in both tables Negatively correlated – unlikely frequent values same in the two tables 1,000 runs for each configuration Summaries compared End-biased samples End-biased equi-depth histograms [PC84] Sketches [AGMS99],[DGGR02],[GGR04] Concise samples [GM98] Counting samples [GM98] Comparison with histograms Comparison with sketches Memory comparison Qualitative comparison Advantage Sketches Streaming updates End-biased samples X Simple configuration X Selection on join attribute X Conclusions End-biased samples and sketches are the best summaries for the join size estimation problem addressed in this paper End-biased samples are compelling if Selections on the join attribute are required Summaries must be very concise The frequencies of join attributes in the two tables are strongly correlated Questions? Thank you! Scripts and results for experiments available at http://www.cs.wisc.edu/~estan/ebs.tar.gz Estimating the join size Related work – sampling methods [GM98] concise samples, counting samples [DLT01] smart sampling [F90],[EKMV04] using a hash function to select values used as summary of data Related work – join size estimation Histograms Multidimensional histograms [GG02],[GK04] Wavelets [AGMS99],[DGGR02],[GGR04] Sketches Variance of join size estimate No slide, point to the paper.