Distance Exponent: A New Concept for Selectivity Estimation in

advertisement
Distance Exponent: A New Concept for
Selectivity Estimation in Metric Trees
(Extended Abstract)
Caetano Traina Jr.1
Dept. of Computer Science
Sao Paulo University at S. Carlos
caetano@icmc.sc.usp.br
Agma J. M. Traina 1
Dept. of Computer Science
Sao Paulo University at S. Carlos
agma@icmc.sc.usp.br
Selectivity estimation is an important issue in query
optimization in an RDBMS to answer multidimensional
queries and to support data mining and data compression.
This work focuses on the estimation of selectivity of range
queries in metric data sets indexed by a metric structure.
Metric data sets includes vector, or dimensional, data sets
as a special case. We developed a new method to estimate
(a) the number of qualifying objects and (b) the number of
disk accesses required to answer the queries.
Our major contribution is the discovery of an empirical
‘power law’ that the cumulative distribution function of the
distances seems to obey. Although the majority of studies
on the subject uses the uniformity assumption, we found
that the great majority of real data sets have a non-uniform
distribution. From the discovered law we derive an analysis
for the distance distribution of metric data sets. This is the
first analysis of distance distributions for real metric data
sets. Thus, we can state that: given a set of N objects in a
metric data set with distance function d(x,y), the average
number of distances less than a radius r follows a power
law, i.e., the average number of neighbors within a given
distance r is proportional to r raised to , as it is shown
in Figure 1.
Whenever a data set presents a metric way to evaluate
the distance between pairs of its objects, a graph (distance
plot) depicting it can always be drawn, even if the data set
does not have a spatial property.
We called the slope of this distance plot as the “distance exponent” . It is the exponent of our power law,
and we show that it plays a relevant role for the analysis of
'
'
1 - Research supported by FAPESP (Brazil), under Grants Nº 98/05556-5
and 98/0559-7. On leave at Carnegie Mellon University.
2 - This material is based upon work supported by the NSF under Grants Nº
IRI-9625428, DMS-9873442, IIS-9817496, and IIS-9910606, and by the
Defense Advanced Research Projects Agency under Contract Nº
N66001-97-C-8517. Additional funding was provided by donations from
NEC and Intel. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the National Science Foundation, DARPA, or other
funding parties.
Christos Faloutsos2
Dept. of Computer Science
Carnegie Mellon University
christos@cs.cmu.edu
real, metric data
sets. Specifically,
we show (a) how to
exploit the distance
exponent to derive
formulas for selectivity estimation of
range queries and
(b) how to compute
it quickly from a
metric index tree.
For spatial data Figure 1. Distance plot for a face vector
sets, the distance data set.
exponent
approaches the intrinsic dimensionality of the data set, so for
any data set, it can be seen as a measure of how the
information in the data set is correlated.
, and they
The formulas use the distance exponent
hold for metric and vector data sets alike. Our formulas are
useful for query optimization, as well as for identifying suboptimally constructed index trees.
We performed several experiments on many real data
sets (road intersections of U.S. counties, vectors
characteristics extracted from face matching systems, sets
of words, and even distance matrixes provided by a
commercial face recognition package, where the distance
function is undisclosed) and synthetic data sets (Sierpinsky
triangle, 2-dimensional uniform distributions and 2dimensional lines). Excluding sub-optimal trees, our
selectivity estimation formulas are accurate, within relative
error from 4% to 17%, and always within one standard
deviation from the analytical results. Moreover, we
developed also a quick algorithm to estimate the “distance
exponent”, which gives excellent accuracy and saves orders
of magnitude in computation time compared to the
traditional distance calculation. The details are in the
corresponding Technical Report CMU-CS-99-110 available
at Carnegie Mellon University.
'
Download