Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees (Extended Abstract) Caetano Traina Jr.1 Dept. of Computer Science Sao Paulo University at S. Carlos caetano@icmc.sc.usp.br Agma J. M. Traina 1 Dept. of Computer Science Sao Paulo University at S. Carlos agma@icmc.sc.usp.br Selectivity estimation is an important issue in query optimization in an RDBMS to answer multidimensional queries and to support data mining and data compression. This work focuses on the estimation of selectivity of range queries in metric data sets indexed by a metric structure. Metric data sets includes vector, or dimensional, data sets as a special case. We developed a new method to estimate (a) the number of qualifying objects and (b) the number of disk accesses required to answer the queries. Our major contribution is the discovery of an empirical ‘power law’ that the cumulative distribution function of the distances seems to obey. Although the majority of studies on the subject uses the uniformity assumption, we found that the great majority of real data sets have a non-uniform distribution. From the discovered law we derive an analysis for the distance distribution of metric data sets. This is the first analysis of distance distributions for real metric data sets. Thus, we can state that: given a set of N objects in a metric data set with distance function d(x,y), the average number of distances less than a radius r follows a power law, i.e., the average number of neighbors within a given distance r is proportional to r raised to , as it is shown in Figure 1. Whenever a data set presents a metric way to evaluate the distance between pairs of its objects, a graph (distance plot) depicting it can always be drawn, even if the data set does not have a spatial property. We called the slope of this distance plot as the “distance exponent” . It is the exponent of our power law, and we show that it plays a relevant role for the analysis of ' ' 1 - Research supported by FAPESP (Brazil), under Grants Nº 98/05556-5 and 98/0559-7. On leave at Carnegie Mellon University. 2 - This material is based upon work supported by the NSF under Grants Nº IRI-9625428, DMS-9873442, IIS-9817496, and IIS-9910606, and by the Defense Advanced Research Projects Agency under Contract Nº N66001-97-C-8517. Additional funding was provided by donations from NEC and Intel. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, DARPA, or other funding parties. Christos Faloutsos2 Dept. of Computer Science Carnegie Mellon University christos@cs.cmu.edu real, metric data sets. Specifically, we show (a) how to exploit the distance exponent to derive formulas for selectivity estimation of range queries and (b) how to compute it quickly from a metric index tree. For spatial data Figure 1. Distance plot for a face vector sets, the distance data set. exponent approaches the intrinsic dimensionality of the data set, so for any data set, it can be seen as a measure of how the information in the data set is correlated. , and they The formulas use the distance exponent hold for metric and vector data sets alike. Our formulas are useful for query optimization, as well as for identifying suboptimally constructed index trees. We performed several experiments on many real data sets (road intersections of U.S. counties, vectors characteristics extracted from face matching systems, sets of words, and even distance matrixes provided by a commercial face recognition package, where the distance function is undisclosed) and synthetic data sets (Sierpinsky triangle, 2-dimensional uniform distributions and 2dimensional lines). Excluding sub-optimal trees, our selectivity estimation formulas are accurate, within relative error from 4% to 17%, and always within one standard deviation from the analytical results. Moreover, we developed also a quick algorithm to estimate the “distance exponent”, which gives excellent accuracy and saves orders of magnitude in computation time compared to the traditional distance calculation. The details are in the corresponding Technical Report CMU-CS-99-110 available at Carnegie Mellon University. '