IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 1361 Voronoi Networks and Their Probability of Misclassification K. Krishna, M. A. L. Thathachar, Fellow, IEEE, and K. R. Ramakrishnan Abstract—Nearest neighbor classifiers that use all the training samples for classification require large memory and demand large online testing computation. To reduce the memory requirements and the computation cost, many algorithms have been developed that perform nearest neighbor classification using only a small number of representative samples obtained from the training set. We call the classification model underlying all these algorithms as Voronoi networks (Vnets), because these algorithms discretize the feature space into Voronoi regions and assign the samples in each region to a class. In this paper we analyze the generalization capabilities of these networks by bounding the generalization error. The class of problems that can be “efficiently” solved by Vnets is characterized by the extent to which the set of points on the decision boundaries fill the feature space, thus quantifying how efficiently a problem can be solved using Vnets. We show that Vnets asymptotically converge to the Bayes classifier with arbitrarily high probability provided the number of representative samples grow slower than the square root of the number of training samples and also give the optimal growth rate of the number of representative samples. We redo the analysis for decision tree (DT) classifiers and compare them with Vnets. The bias/variance dilemma and the curse of dimensionality with respect to Vnets and DTs are also discussed. Index Terms—Neural networks, pattern recognition, statistical learning theory. I. INTRODUCTION O NE OF THE most popular and simple pattern classifiers is the nearest neighbor classifier (NNC). NNC assigns to an unclassified sample the class label of the nearest sample in the set of training examples. Though there is no training phase, NNC needs sufficient memory to store all the training samples and needs to compute, for every test sample, its distance from each training sample. There is a class of algorithms that overcome this problem. These algorithms find a small number of representative samples using the training set and use these samples to perform nearest neighbor classification. Some of the algorithms in this class find a set of representative samples, which is a proper subset of the training set, such that the classification error over training samples is minimized [1]. The other algorithms obtain the representative samples, which need not be a subset of the training set, by applying an iterative algorithm on the training set, [2]–[7]. The popular learning vector quantization (LVQ) algorithm [3] belongs to this category of algorithms. Manuscript received August 26, 1999; revised July 11, 2000. K. Krishna was with the the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India. He is now with IBM India Research Laboratories, New Delhi, India. M. A. L. Thathachar and K. R. Ramakrishnan are with the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India. Publisher Item Identifier S 1045-9227(00)09847-7. All these algorithms, though need a training phase, are shown to perform better than NNC in many cases, apart from reducing the computational effort and memory requirements. All the above-mentioned algorithms share the same classification model, i.e., perform NN classification using some representative samples instead of all the training samples. This classification model is called the Voronoi network (Vnet) in this paper because it discretizes the feature space into Voronoi regions [8] and assigns samples in each region to a class. Voronoi regions are formed by a set of points in the feature space. Each Voronoi region contains those points of the space that are closest to a point among points in the set. From now on, the representative samples are referred to as the Voronoi centers (Vcenters). In this paper, we analyze the capabilities of Vnets to generalize from a set of training samples to unseen test samples and address some issues in their design pertaining to the solution of pattern classification (PC) problems. A. A Note on the Name “Voronoi Network” The considered classification model is referred to by various names which have mostly originated from different learning algorithms. For example, we have LVQ classifier [3], nearest prototype or multiple prototype classifier [6], prototype-based NNC [4], nearest-neighbor-based multilayer perceptron (NN-MLP) [5], etc. One of the most popular among these names is the LVQ classifier. However, LVQ is also used to refer to the class of clustering algorithms (see, for example, [9]) though it was originally used to refer to a class of supervised learning algorithms [3]. In view of the above facts, a need was felt to denote the classification model underlying the above algorithms by an appropriate name. Since the fundamental building block for the classification model is a Voronoi region, the classification model is referred to as Vnet in this paper. B. Generalization Error of Vnets Since Bayes decision rule (BDR) is the best map that can be learned, it is proper to judge the generalization capability of a rule by the error between the rule and the BDR, which is referred to as the generalization error (GE) in this paper. GE is decided by two factors in case of Vnets, as in any other neural network. One of the factors is the representational capacity of the network. There could be some error between the BDR and which is the best among those generated by Vnets the map with a fixed number of Vcenters. To make this error small, the space of maps generated by Vnets must have sufficient power to represent or closely approximate the class of BDRs. This error is called the approximation error (AE). Second, the Vnet learned from a finite number of training samples might be far from the 1045–9227/00$10.00 © 2000 IEEE 1362 best Vnet . The reason for this error is the limited information available on the underlying distribution of samples because of finite number of training samples. This error is called the estimation error (EE). Previously, these two issues have been addressed in isolation [10]–[13] as well as together [14]–[16] for some class of networks. In this paper, we address both these issues for Vnets under a common framework as in [16], [14]; however, the analysis in [16], [14] deals with the problem of learning real-valued functions. The approximation capabilities and the behavior of approximation error of various function approximation schemes, neural networks in specific, have been analyzed in the past (for example, see [10] and [11]). In all the studies the functions are assumed to be real valued and the problem of PC, where the functions map to a finite set of class labels, is considered as a special case [11]. In classical approximation theory, the rate of convergence of AE with respect to the number of parameters of the approximating function is obtained in terms of the degree of smoothness of the function that is to be approximated. In the case of PC problems any such characterization does not appear to have been done in the past. In this paper, the rate of convergence of AE of the maps, that take values from a finite set of symbols, viz., the maps associated with PC problem, has been characterized by the extent to which decision boundaries fill the domain of the maps. One of the contributions of the present work is that the above-mentioned property is captured by a function called the -coarse index function whose value at infinity is the generalized fractal dimension under a given probability measure . The approximation error made by Vnets is bounded in terms of this function exploiting the nice geometric properties of Voronoi regions. This analysis, first of its kind to the best of our knowledge, gives insights into the sources of errors made by Vnets. Blumer et al. [17] analyzed the estimation error in learning using the results of uniform convergence of emsubsets of pirical process from [18], [19] under the framework of Valiant’s [20] probably approximately correct (PAC) learning. Haussler [13] generalized the analysis to learning real valued functions -valued functions. In and Ben-David et al. [21] to this paper, a bound on the estimation error has been derived by -valued functions that is derived considering a class of from the class of maps generated by Vnets using the results from [13]. Elsewhere [7], we have also derived a bound on estimation error by bounding -dimension of the class of maps generated by Vnets using the results from [21] and found that the former approach gave better results. The main theorem of this paper gives a bound on the generalization error of Vnets obtained by separately bounding the components corresponding to AE and EE. As stated earlier, this result characterizes the approximating capabilities of maps taking values from a finite set of class labels that is used to find the rate of decrease of AE of Vnets. Furthermore, it gives a sufficient condition on the relative growth of the number of Vcenters in Vnet with respect to the number of training samples for the asymptotic convergence of the probability of misclassification to the Bayes error, thus proving the asymptotic convergence of Vnets to a corresponding BDR under fairly large class of probability measures. As a corollary, the optimal rate of growth of IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 the number of Vcenters is derived so as to minimize GE. The theorem also sheds light on the tradeoff between the number of Vcenters and the number of training samples which has been referred to as the bias-variance dilemma in [15]. The phenomenon of “curse of dimensionality” in Vnets is also analyzed. Decision tree (DT) is another popular classification model [22], [23] that shares a structure similar to that of Vnet. DT is a binary tree where each nonleaf node contains a split rule that decides whether a sample belonging to the left or right subtree, and each leaf node is assigned to a class. The bound obtained for the Vnets can be rederived for DTs too. A bound on GE is obtained for DTs and is compared with that obtained for Vnets. This kind of analysis suggests a basis for analytical comparison of any two different classification models. Moreover, it appears that, the observations made with respect to Vnets are applicable to other classification models too. This paper is organized as follows. The following section develops Vnet classification model. The generalization error in Vnets is formulated in Section III. Section IV contains the main result viz., a bound on GE. Proofs of most of the results stated in this section are relegated to Appendix for clarity in presentation. Various implications of the main result are discussed in Section V. The bound on GE for DTs is rederived and is compared with that of Vnets in Section VI. Section VII makes some comments on and discusses some possible extensions of the presented results. Finally, the paper ends with a summary in Section VIII. II. VORONOI NETWORKS The task of pattern recognition is to learn a mapping from the feature space to the set of class labels. The learning takes place with respect to a set of training samples.1 The mapping that is learned should be able to map unknown samples (samples that are not present in the training set) into their correct class labels. The ultimate aim of learning is to find a map that minimizes the average misclassification over the feature space. This is a well studied topic in pattern recognition literature [24]. The map that has the least average misclassification error is the BDR. Vnet approximates Bayes decision regions by Vornoi regions. Vnet discretizes the feature space into disjoint convex sets, viz., Voronoi regions and assigns each region to a class label. Let denote the feature space and the set of class labels. Each Voronoi region is specified by a point be a set viz., the Vcenter, in . Let of Vcenters. Then, Vnet is defined as (1) , is the indicator function of set below. Let where and , are as defined and 1The word “samples” is used to refer to a pair of a feature vector and its class label as well as to a feature vector alone. The particular reference would be clear from the context of usage. KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION where is the Euclidean norm and is the boundary of the set , i.e., if is at an equidistant from two or more then is assigned to the set with the least index. Then define (2) are convex sets. Observe that if convex sets2 i.e., partition , and into disjoint . A. Learning Vnets Learning an input–output map using a Vnet involves finding from a set of training samthe Vcenters , such that the ples, misclassification error of the resulting map over the training set is minimized. It may be noted that NNC is an instance of Vnet, . where The empirical risk or the average misclassification error over of a Vnet with Vcenters specified by and is given by (3) is as defined in (1) and is a discrete metric where on given by if , (4) if that minimizes The learning problem is to find the pair . It turns that for a fixed , the vector , that minimizes , is uniquely determined. So, the problem of finding , in effect, boils down to finding alone. As mentioned in the previous section, there exist many algorithms that try to find . As in case of any classification model, there exists a tradeoff between the average misclassification over training set and the probability of misclassification which reflects the generalization capabilities of Vnets. For example, one can easily show using a simple example that, a has a higher probability of misclassifinetwork with , when is properly cation than an optimal net with in Vnets is related chosen [7]. Thus the problem of finding to well-studied structural risk minimization in neural networks [18]. The results presented in this paper throw some light on this issue. III. PROBLEM FORMULATION Let be the feature space and be the finite set of class labels. Denote as and as . Let be a -field of subsets of and be the power set be the smallest -algebra in of . Let which contains every set of the form , , and . Let be a probability measure on . Let , be the probability measures induced by on , and , respectively. Let be a measurable map from and into . Then the average misclassification error, referred to as expected risk, incurred using is given by 2S has been defined to make the definition precise. If the underlying probability measure is absolutely continuous with respect to Lebesgue measure, then these S ’s can be ignored. 1363 The aim of learning is to find a that minimizes . Let (5) i.e., , where be a BDR that minimizes are appropriately defined. Since we are interested in analyzing how best the learned rule approaches a BDR, the class of BDR that can be “efficiently” approximated by the rules generated by Vnets are characterized, instead of characterizing the space of probability measures. is unknown. Only source of inforSince is unknown, mation available is that of random samples drawn according to . Let be the set of training samples . Using this data set, the randomly drawn according to expected risk can be approximated by the empirical risk (6) A common strategy is to estimate the BDR as the mapping that minimizes the empirical risk. Remark III.1: It is to be noted that, under fairly general assumptions, the expected risk converges in probability to the empirical risk for each given , and not for all simultaneously. Therefore, it is not guaranteed that the minimum of the empirical risk will converge to the minimum of the expected risk as tends to infinity. Therefore, as pointed out and analyzed in the fundamental work of Vapnik and Chervonenkis [18], [19]the notion of uniform convergence in probability has to be introduced and it will be discussed later. The standard practice in pattern recognition using neural networks is that the number of parameters of the network is fixed is minimized over the parameter space, i.e., over the and space of maps generated by networks with a fixed number of parameters. In the case of Vnets, the number of parameters is proportional to the number of Voronoi regions or the number of be the space of maps generated by Vnets hidden units. Let hidden units, i.e., with as in (2), for (7) omitting the variables [From now on, a Vnet is denoted by and .] Then, is minimized over i.e., is approxdefined as imated by the function (8) from , viz., Assuming that the problem of learning over , is solved, we are interested in minimization of , apart from finding out the finding out the “goodness” of class of problems that can be solved “efficiently” using Vnets. is addressed by various learning The problem of finding is algorithms mentioned in Section I. The goodness of , which is equal measured in terms of its expected risk is the best posto the probability of misclassification. Since sible mapping that minimizes the probability of misclassifica. The above issues are analyzed below tion, 1364 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 by bounding the difference , which is referred to as the generalization error (GE). to become a better and Remark III.2: One expects and go to infinity. In fact, when inbetter estimate as and hence the creases, the estimate of the expected risk improves. But, when increases, the number estimator of parameters to be estimated increases with the same amount of data, and hence the estimate of the expected risk deteriorates. In this situation, the estimate can be improved by seeking more and have to grow as a function of each data. Therefore, other for convergence to occur. The relation between and for the convergence is derived as a corollary of the main theorem of this paper. IV. BOUND ON GENERALIZATION ERROR There are two factors involved in the approximation of by , viz., how accurately can be approximated by the maps with Voronoi regions, and given a data of size , how approximates the closest function among accurately to with respect to the metric [defined below in (10)] for be the mapping with a given . To study these factors, let Vcenters having the least expected risk, i.e., Then it follows from the definitions that , and (9) The first term in the right-hand side (RHS) of (9) is referred by the to as the AE since, it is the error in approximating . The second term is referred to as the EE since, functions in uses observations, , to estimate the Voronoi cenby minimizing the empirical risk . ters associated with We bound the generalization error by bounding these two terms separately. A. Bound on Approximation Error Notice that the approximation error does not depend on but, it depends on . Therefore, to bound the approximation error, the space of BDRs for which approximation is characterized. In error asymptotically goes to zero with particular, our interest is in such a for which the following expression holds: where goes to zero faster than , for some , as goes to infinity. This problem is analyzed using a metric defined on the space of maps from into . For a given probability measure on , the discrete metric , defined on (4), can be on as follows: extended to a pseudometric It may be noted that (11) is bounded to bound Therefore, in this section, which depends on especially, the -coarse index function of . Before formally deriving the bound, the intuition behind defining -coarse index function of a mapping is described below. Any two neighboring Vcenters generate a hyper planar decision boundary. Since these hyper planes can efficiently approximate any surface which is smooth enough, it is intuitive that BDR with smooth decision boundary surfaces can be efficiently approximated by Vnets. On the other hand, considering the contribution by a Voronoi region to the average misclassification error, if a Voronoi region is a proper subset of [see (5)] for some , then the error made in this region is zero. Therefore, the number of such Voronoi regions, if each region is equally probable, should reflect how close is the mapping generated by the Vnet to the BDR that is being approximated and hence the approximation error. The relation between this number and the smoothness of the decision boundary surfaces is brought out in Section V. Definition IV.1: Let be a measurable function from . Then, denote the preimages of corresponding to the : . Let be a probability class , i.e., . Then a set is said to be -pure measure on for some . is said with respect to if to be -impure if is not -pure. be a set of hypercubes formed by Let hyperplanes such that hyperplanes are per. If is absopendicular to th axis and lutely continuous with respect to Lebesgue measure (denoted ), then are chosen such that . Othby erwise, are chosen such that probability of each hypercube is except those containing points of positive probaless than . If a hypercube contains a point bility measure greater than then the probability of such a hyperof probability . Then, the minimum number of -imcube is utmost , minimum over all possible , represented by pure sets in , is called the -coarse index function of . Theorem IV.1: Let be a probability measure on . , there exist a such Then for all measurable that Proof: This theorem is proved by constructing a , for a given , that satisfies the above bound. For some and , since (10) on . Let KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION then, . Therefore, if then, minimizes . It is to be noted here that if is a -pure set with respect to , for all except for , then . Therefore 1365 estimate of expectation of a class of random variables to their true values below. Let be the class of -valued functions by the metric , viz., defined on (14) Now, are appropriately chosen to bound the error. For a containing -impure sets given , consider a with respect to . Since these hypercubes can be obtained from Vcenters, choose ; for . then, can be bounded by Now observe that if because, partition and in the worst case, . If is not absolutely continuous with respect to and is a -impure set containing , then a point with probability measure greater than . From the above construction and observations, we get thus proving the theorem. The above theorem, because of (11), implies that for a given and , there exists a in such that (12) Therefore, the rate at which this bound approaches zero as tends to infinity, depends on , which in turn depends to the generalon the underlying and . We relate ized fractal dimension (defined later) of the set of points on the decision boundaries of and characterize the rate of decrease of the bound in terms of this dimension in Section V. B. Bound on Estimation Error is a random variable, Since bounded only in a probabilistic sense, i.e., with probability is (13) Let Then, EE can be bounded if is bounded. We bound and consequently bound using results from statistical learning theory [25]. to converge to as one gets more and One expects converges to uniformly more data. This happens if in probability [18]. The following lemma gives a relationship . between the estimation error and , Proposition IV.2: [25] If then, . Thus EE is bounded if the difference between the empirical . This problem risk and true risk is bounded uniformly over is converted to that of analyzing the convergence of empirical , it may be noted that the empirical expecThen for any , , and the true expectation of based on , tation of under the probability distribution . Bounding the error between empirical and true expectations of a class of random variables is a well-studied topic [19]. Let (15) Then the following proposition directly follows from the defiand and from Proposition IV.2. nitions of then, . Proposition IV.3: If in order to bound Therefore, it is enough to bound . Let (16) to is same as the conThen, the convergence of vergence of empirical probability to true probability of the can corresponding subset of . It may be noted that be bounded using the VC-dimension of [19]. Ben-David et al. [21] gave a relation between the VC-dimension of and can be bounded by the -dimension of . Therefore can also be bounded, bounding -dimension of . as a family of real valued functions and using considering as given in [13]. We have derived the pseudodimension of the bounds using both the techniques [7] and found that the results obtained using the latter approach are better than those obtained using the former. The derivation of the bound using the pseudodimension of is given below which follows along similar line to the one given in [16]. Some definitions and are results that are useful in the derivation of bound on given in Appendix A. Consider the family of functions defined in (14). Then Theorem A.4 implies (17) -restriction of given by : , the average metric, and the -covering number of as defined in Appendix A. The following lemma, whose proof has been relegated to Appendix B for continuity of , presentation, relates the metric capacities of , and the expectation of the covering and . number of Lemma IV.4: . where is the 1366 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 Using the above lemma and taking supremum on both sides defined in (15) can be bounded of (17) over all , the as Theorem IV.7: With probability holds:3 , the following bound (18) , is decomposed as the To find a bound on be the class composition of two classes of functions. Let to , generated by -hidden neurons of of functions from , and is the -dimensional Vnet where, unit vector in the th dimension i.e., th component of is 1 and the rest of the components are zeros. Then (20) Proof: For bound for defined in Theorem IV.6, substituting the in (18) we get (19) are defined as in (2). Let be the class of funcwhere tions, from to , generated by the single output neuron of the . Then, : network. Note that , . The following lemma bounds in . The proof of this lemma is terms of the metric capacity of given in Appendix B. Lemma IV.5: From Proposition IV.3, if (21) : This implies that if, the above inequality (21) is satisfied. Therefore, the following inequality should be satisfied: Since is a vector of index functions of some class of subsets, some results from Appendix A can be used to derive a , thus bounding . bound on Theorem IV.6: where . Proof: Let be the class of subsets of the form defined : , then where, in Section II. Let : , . Then, . from Theorem A.5, we get, Because is a space of the indicator functions, . Since is a class of subsets where each subset is half spaces, Lemmas A.1 and an intersection of at most . Substituting this in (34) for , we A.2 implies . Therefore, because of get . The Theorem A.5, above result follows from Lemma IV.5 on substitution of the . bound for (22) It is later shown that the above inequality is satisfied if (23), shown at the bottom of the page, is true. Let . , the inequality (22) is satisfied for the above value Since of ; thus proving the theorem. C. The Main Theorem Theorem IV.8: Let be a probability measure on be a BDR corresponding to and Let -coarse index function of . Let . be the : 3In general, these bounds are known to give very large numbers if different variables in the bounds are substituted with their values. However, they are qualitatively tight in the sense that there exist some distributions for which the bound is attained. Therefore, in this paper, the bound is found only in the order notation. (23) KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION cording to bility be a set of independent training examples drawn ac. Let be as defined in (8). Then, with proba- 1367 the efficiency of learning by Vnets. This analysis also gives an insight into the approximation capabilities of Vnets. Definition V.1: Let be a probability measure on and be a measurable function. Then define is where -impure is the -ball around in . Let if (24) Proof: The theorem follows from (9), Theorem IV.1 and Theorem IV.7. V. IMPLICATIONS OF THE THEOREM In this section, various implications of the main theorem of this paper are discussed. A. Convergence of Vnet to Bayes Decision Rule Under the assumption that AE asymptotically converges to zero with , which is justified in the next section, the discussion and then, the in the sequel implies that, if bound on GE in (24) asymptotically goes to zero. Substituting in the expression for [refer (13)] in (20) and letting tend to infinity, we get otherwise. , (25) , if the limit exists, is called the Then, with respect to , generalized fractal dimension (GFD) of . denoted by GFD is called so because, it is a generalization of the fractal dimension, which is same as the upper entropy index defined in [26]. The fractal dimension is defined on the compact sets using the Lebesgue measure. Here, GFD is defined with regard to and . However, the above definition can be appropriately modified for defining the dimension on any arbitrary sets with regard to a given . In this modified definition, is equal to the fractal dimension of , when has a compact support and is uniform. , we get Writing the bound on AE in terms of (26) Therefore, Vnet asymptotically converges to a BDR with arbitrary high probability if the number of Voronoi regions grow slower than the square root of the number of training samples. B. How Efficiently can Vnets Solve a Problem? Efficiency of learning a given pattern recognition task, , from a finite set of training samples can be measured by the to . The larger the rate rate of convergence of the more the efficiency. It follows from (24) that, for a fixed number of Vcenters , as the number of training samples increases, the bound on EE monotonically goes to zero where as the bound on AE remains constant since it is independent of . Therefore, the rate of convergence of to also depends on the rate of decrease of the bound on AE, and hence AE, to zero. The only parameter that depends on the problem (refer Definition IV.1), in under consideration viz., the bound on GE is due to the bound on AE and the bound on is analyzed to find EE is independent of . So, always lies in the interval It may be observed that . The above inequality (26) says that the nearer stays to the slower is the rate of decrease of the bound on AE with . To analyze further, assume has a compact support and is more the value uniform. Then higher the filling property of and hence more the difficulty to approximate by of Vnets. For a given , if ’s distribution over the regions neighis less, then it makes small boring the points of making the rate of decrease of bound large. In conclusion, as it should be expected, the approximation of a mapping by Vnets and the probability depends on the space filling property of distribution over the regions around it. C. Optimal Growth Rate of It is shown above that for to converge to the number of Voronoi regions should grow slower than the square root of the number of training samples. Then the question arises; does there exist an optimal growth rate of with respect to ? between By the optimal growth, we mean the relation and such that for a fixed number of training samples , number of Vcenters minimizes the bound the Vnet with is quite intuitive from on GE. The existence of such a 1368 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 Remark III.2. This is reflected in the bound on GE (24). To get , the minimum of the bound an explicit expression for on GE, which is assumed to be close to the minimum of GE, is found by solving the following equation: increase in the optimal rate of cannot compensate AE beterm in the exponent of in the bound AE. cause of the is In fact, for higher dimensions the optimal growth rate of almost independent of the dimension of because, the optimal rate reaches very close to 1/6 even for moderate values of . E. Tradeoff Between where . Applying the derivative and neglecting the less significant terms we get After some algebra we get that the estimate of , is proportional to , for large (27) The above expression (27) suggests one more interesting beis more, havior of the optimal growth rate of . When i.e., the problem under consideration is more difficult (refer the previous sub-sections), the optimal growth rate of is higher. That is, the number of Vcenters in Vnets can grow faster as the number of training samples increase, when solving more difficult problems. This is indeed desirable, because larger s make the AE small. and Theorem IV.8 suggests that there exist various choices of and for a given specifications of confidence and accuracy parameters. This is due to the tradeoff between AE and EE. When a large number of centers and less number of training samples are chosen, the resulting EE could be high, but this can be compensated by low AE and vice versa, in order to keep GE constant. If the availability of the training samples is expensive then, one could use more number of centers with the same bound on GE. Or, if the network size is expensive one could use large number of training samples. Thus the economics of this tradeoff would yield a suitable values for and to solve the pattern recognition problem with the required accuracy and confidence. Geman et al., termed this trade off as bias/variance dilemma [15]. F. Sample Complexities Finally, we consider the bounds on sample complexity of Vnets. As defined in Section IV-B, sample complexity reflects the number of samples sufficient to learn a mapping with a given accuracy and confidence requirements. From the value of defined in Theorem IV.6, the expression in (22) implies that if D. Dependence on the Dimension of It is known from the theory of linear and nonlinear widths [27] that if the target function has variables and a degree of smoothness , one should not expect to find that an approxima, when the number tion error goes to zero faster than of parameters that have to be estimated is proportional to . Here the degree of smoothness is a measure of how constrained the class of functions are; for example, the number of derivatives that are uniformly bounded or the number of derivatives that are integrable or square integrable. Therefore, from the classical approximation theory, one expects that unless certain constraints are imposed on the class of functions to be approximated, the rate of convergence will dramatically slow down as the dimension increases, showing the phenomenon known as the “curse of dimensionality.” . The term reflects the frequency Let in (26), of variation of on its domain and the term in a way, characterizes its smoothness. In case of Vnets, AE . Therefore, the above goes to zero as fast as conclusions from the classical approximation theory are valid as smoothness parameter. in case of Vnets also with It is interesting to note the following observation at this point. , the opThe expression in (27) suggests that when . timal growth rate of increases from 1/8 to 1/6 as can be increased at a faster rate for higher dimenThat is, sions maintaining the optimal performance. Nevertheless, the VI. BOUND ON GE OF DECISION TREES The classification model that shares a similar structure, but not the same, with Vnets is the DT classifier [22]. A DT is a binary tree. Each nonleaf node in the tree is associated with a split rule that decides whether a given pattern will go to the left or right subtree. A class label is assigned to each leaf node implying that all the patterns landing in that leaf node are to be classified into that class. When the feature space is a subset of , then the class of split rules considered could be the class of hyper surfaces that are either parallel or oblique to the axes. The DT’s using the former class of split rules are called Axesparallel DTs and those using the later are oblique DTs. There are many learning algorithms for these DTs (for example, [28]). Better and better learning algorithms are still being discovered. The learnability of this class of classifiers has been analyzed and bounds on sample complexities have been found by bounding the VC-dimension of DTs which are used for two-class pattern classification problems [29], [30]. The analysis presented in Section IV can be easily reworked to the class of maps generated by DTs. A brief sketch of the proof of the theorem for DTs is given below. As expected, the approximating behavior of DTs is quite similar to that of Vnets KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION except that the number of Vcenters is replaced by the number of leaf nodes. To find a bound on EE, DT is considered as a combination of (number of leaf nodes) number of indicator functions where the set associated with each indicator function is an intersection of (the depth of the tree) half spaces. Let be the class of mappings generated by DTs for a fixed and . Then, the and corresponding to DTs are (28) where is as defined in (6), and 1369 the same procedure as in the previous section, we get that the number of leaf nodes should be proportional to where , for the bound on GE to be optimal. This rate is faster than that of in Vnets [refer (27)]. The difference is obvious because each element in is assumed to be half spaces whereas each subset in case an intersection of of DT is assumed to be an intersection of half spaces and the derived bounds are worst case bounds. VII. DISCUSSION where is the average misclassification error. . Theorem VI.1: Let be a probability measure on be a BDR corresponding to and be the Let -coarse index function of . Let : be a set of independent training examples drawn acbe as defined in (28). Then, with probacording to . Let bility (29) A Sketch of Proof: The first part of the above inequality -coarse index function of . follows from the definition of beTo get the second part, we need to bound cause, following similar arguments as in Section IV-B: (30) is a combination indicator functions each is associSince ated with an intersection of half spaces, Theorem IV.6 implies that where . Substituting this bound in (30) and after some algebra we get the second part of the inequality in the statement of the theorem. From the above theorem, one can say that the efficiency of solving a problem, the dependence of GE on the dimension of , and the tradeoff between and in case of DTs are the same as in Vnets. One can analyze the optimal growth rate of the depth of a tree and the number of leaf nodes in a tree. It is observed . This that the optimal rate of growth of is a function of is to be expected because the number of subsets one can get in a tree of depth is . But, the optimal rate of growth of is almost similar to that of ; it differs only in the rate. Following In this section, we make some comments on the nature of results obtained along with some possible extensions of the results. The main result of this paper has been developed on a framework similar to Valiant’s PAC framework for learning and along the lines of those given in [16]. However, the result is not entirely distribution free. The bound on AE is dependent on the underlying probability measure in terms of the properties of -coarse index function of . a corresponding BDR, viz., Moreover, in this paper, we have started from an approximation scheme and found the space of problems that can be efficiently solved by the scheme as opposed to the other way round viz., from a class of problems to an optimal class of approximation schemes. As previously mentioned, the bounds obtained using the PAC framework are known to be very loose. Therefore, no attempt has been made in this paper to find the exact numerical bounds. Rather, these bounds have been used to analyze the behavior of GE with respect to various parameters in terms of bounds on the growth of with respect to , etc. In this paper, we have assumed the existence of an algorithm , which minimizes , for a and a training that finds . We guess that the problem of finding is NP-hard set since its counterpart in unsupervised learning, viz., clustering, is NP-hard and the present problem is very similar to clustering. (A formal verification of this conjecture needs to be done.) Recently, an algorithm based on evolutionary algorithms has been with proposed and shown to asymptotically converge to probability one [7]. This algorithm may converge to a suboptimal if run for a finite time, which is the case in reality. Nevertheless, it is empirically shown to converge to a better optimum than that obtained by LVQ3 algorithm [3]. The behavior of generalization error of the networks with sigmoidal units [11] and of the RBF networks [16] have been analyzed considering them as function approximation schemes. It appears that finding a bound on GE for these networks when used as pattern classifiers is difficult. First, the class of decision regions that can be obtained using the functions generated by these networks as discriminating functions could be very difficult to characterize. However, the conclusions arrived from the analysis presented in Section IV-A should, in general, be applicable to these networks too. It is interesting to investigate whether the techniques similar to the one given in Section IV-B 1370 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 can be used in bounding the EE of these networks. The problem may arise when calculating the metric entropy of the space of mappings in the last stage of the network where the vectors of real numbers are mapped to class labels. One has to address two interrelated issues in comparing two classification models. One issue is the average misclassification error and second, the time taken to arrive at a particular instance of the model. The first issue has been analyzed in case of Vnets and DTs in the present paper. The analysis could be extended to MLPs in which each perceptron uses a hard-limiting function; because the fundamental structure in the associated decision regions is a hyperplane. It is interesting if a similar analysis can be done on other types of MLPs and RBF networks that are popularly used in PC models. However, it appears to be difficult to access the computation needed to arrive at a globally optimal classifier with existing tools in computation complexity. If one succeeds in precisely finding the computation requirements of learning these networks, then these classification models can be compared purely on an analytical basis and the comparison then holds for a wide class of PC problems irrespective of the specific learning algorithm. VIII. SUMMARY We have considered the classification model that discretizes the feature space into Voronoi regions and assigns samples in each region to a class label. We have called this model as “Vnets.” The error between a rule and a BDR corresponding to the problem under consideration has been considered as the generalization error. GE has been split into AE and EE and these factors have been individually bounded, thus bounding GE. We have analyzed the approximating behavior of Vnets, in turn their efficiency in solving a problem from the bound on AE. We have shown that the efficiency of Vnet depends on the extent to which Bayes decision boundaries fill the feature space. The bound on EE has been derived using two techniques and the relative merits of these two techniques have been discussed. Putting together the bounds on AE and EE, we have shown that Vnets converge to the Bayes classifier with arbitrarily high probability provided the number of Vcenters grow slower than the square root of the number of training samples. All these results have been reworked for DTs. The optimal growth rate of the number of Voronoi centers for optimum GE has been calculated for Vnets as well as DTs. The main result of this paper also quantitatively explains the bias/variance dilemma and the curse of dimensionality that is generally observed in classification models. APPENDIX I SOME DEFINITIONS AND RESULTS The analysis of the process of learning a concept can be done by analyzing the uniform convergence of the associated empirical process as shown in Section IV-B for a class of concepts. The classes of concepts that are of interest in this paper are the class of subsets of , and the class of real valued functions. In each of the above classes, the uniform convergence of the empirical process depends on the growth function associated with the concept class. The growth function in turn depends on a number (dimension) that is specific to a concept class. In this section, we define these dimensions for the above mentioned two concept classes and give the relation between this dimension and the property of uniform convergence for the class of real valued functions in a series of theorems and lemmas. Readers are referred to [25] and [31] for more details. A. VC Dimension of Subsets of The following notation is used in this section. Let be be a collection of subsets of a measurable space and let . is said to be shattered by Definition A.1: A subset if for every partition of into disjoint subsets and , such that and . Then there exists , is the Vapnik–Chervonenkis (VC) dimension of , the maximum cardinality of any subset of shattered by , or if arbitrary large subsets can be shattered. The following two lemmas give the VC dimension of the class and that of the class of subsets of which of half spaces in are intersections of elements belong to a concept class of a given VC dimension. These results are useful in deriving the bounds for Vnets. be a concept Lemma A.1: ([17, Lemma 3.2.3]) Let . class of finite VC dimension, viz, , let : For all or : , . Then , . Lemma A.2: ([32]) Let be the class of half spaces in then, . B. Learning the Class Definition A.2: For . For of Real Valued Functions let, if else let, let, is said to be full if there , where , and for . Any set such that exists : . be a family of functions from a set into . Any Let is said to be shattered by , if sequence : , is full. its -restriction, The largest such that there exists a sequence of points which are shattered by is said to be the pseudodimension of de. If arbitrarily long finite sequences are shatnoted by is infinity. tered, then Definition A.3: Let be a subset of a metric space with . Then a set is called an -cover of with metric if for every , there exists respect to the metric satisfying . The size of the smallest some -cover is called covering number of and is denoted by . A set is called an -separated if for all , . The packing number of a set is the size of the largest -separated subset of . Let be a metric on defined as , , ( is used in the subscript metric as in [25]). Let be a to represent average : KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION pseudometric on a space of functions from to by a probability distribution on , defined as , induced Theorem A.3: ([13]) Let be a family of functions from a , where for some . set into Let be a probability distribution on . Then for all (31) Theorem A.4: ([13], [25]) Let be a family of functions into [0, 1]. Let be a sequence of training from a set examples randomly drawn according to a distribution . Then, , for all 1371 where is the Dirac delta function and . Let , then ment of is the th ele- . Therefore, because of the isometry. Hence the first inequality follows from (33) and by taking expectation on both sides with respect to . on . Let be the marginal Fix a distribution and , respectively. Let distribution with respect to be an -cover for with respect to the metric such . Then, it is easy to show that that : is an -cover for with re. Then, spect to the metric . Taking supremum over all we get the second inequality in the lemma. Proof of Lemma IV.5: Fix a distribution on . Let be -cover for with respect to the metric such . Then, : that can be shown to be an -cover for with . Hence respect to (32) is the restriction of to the data set , that is: : . The original version of the above theorem is given in [13]; an improved version, with less conservative bounds, is given depends on the probability in [25]. is a random set that depends on . To distribution , since get a bound that is independent of , a quantity called metric capacity is defined and the error is bounded in terms of this. Definition A.4: The metric capacity of is defined as Where (33) where the supremum is taken over all the probability distributions defined on . Since the right-hand side of (31) is independent of and for all , we have, under the same hypothesis of Theorem A.3 (34) We use the following theorem in Theorem IV.6. Theorem A.5: ([13]) Let , where is a class of functions from . Then : to . APPENDIX II PROOFS OF LEMMA IV.4 AND LEMMA IV.5 Proofs of Lemma IV.4: Given a sequence pled according to the distribution , define ical distribution on , i.e., in , samas the empir, Since this holds for any lemma is proved. , taking supremum over all , the REFERENCES [1] B. V. Dasarathy, Ed., Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Comput. Soc. Press, 1991. [2] C.-L. Chang, “Finding prototypes for nearest neighbor classifiers,” IEEE Trans. Comput., vol. C-23, pp. 1179–1184, Nov. 1974. [3] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, pp. 1464–1480, Sept. 1990. [4] C. Decaestecker, “Finding prototypes for nearest neighbor classification by means of gradient descent and deterministic annealing,” Pattern Recognition, vol. 30, no. 2, pp. 281–288, Feb. 1997. [5] Q. Zhao, “Stable on-line evolutionary learning of NN-MLP,” IEEE Trans. Neural Networks, vol. 8, pp. 1371–1378, Nov 1997. [6] J. C. Bezdek, T. R. Reichherzer, G. S. Lim, and Y. Attikiouzel, “Multiple-prototype classifier design,” IEEE Trans. Syst., Man, Cybern. C, vol. 28, pp. 67–79, Feb. 1998. [7] K. Krishna, “Hybrid evolutionary algorithms for supervised and unsupervised learning,” Ph.D. dissertation, Dept. Elect. Eng., Indian Inst. Sci., Bangalore, India, Aug. 1998. [8] G. Voronoi, “Recherches sur les paralleloedres primitives,” J. Reine Angew. Math., vol. 134, pp. 198–287, 1908. [9] N. R. Pal, J. C. Bezdek, and E. C. K. Tsao, “Generalized clustering networks and Kohonen’s self-organizing scheme,” IEEE Trans. Neural Networks, vol. 4, pp. 549–557, July 1993. [10] G. Cybenko, “Approximation by superposition of sigmoidal functions,” Math. Contr., Syst, Signals, vol. 2, no. 4, pp. 303–314, 1989. [11] A. R. Barron, “Approximation and estimation bounds for artificial neural networks,” Dept. Statist., Univ. Illinois Urbana-Champaign, Champaign, IL, Tech. Rep. 58, Mar. 1991. [12] E. Baum and D. Haussler, “What size net gives valid generalization?,” Neural Comput., vol. 1, no. 1, pp. 151–160, 1989. [13] D. Haussler, “Decision theoretic generalizations of the PAC model for neural net and other learning applications,” Inform. Comput., vol. 100, pp. 78–150, 1992. 1372 [14] H. White, “Connectionist nonparametric regression: Multilayer perceptron can learn arbitrary mappings,” Neural Networks, vol. 3, pp. 535–549, 1990. [15] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/variance dilemma,” Neural Comput., vol. 4, pp. 1–58, 1992. [16] P. Niyogi and F. Girosi, “On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions,” Artificial Intell. Lab., Massachusetts Inst. Technol., Cambridge, MA, A. I. Memo 1467, Feb. 1994. [17] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the Vapnik–Chervonenkis dimension,” J. Assoc. Comput. Machinery, vol. 36, no. 4, pp. 926–965, Oct. 1989. [18] V. N. Vapnik, Estimation of Dependences Based on Empirical Data. Berlin, Germany: Springer-Verlag, 1982. [19] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory Probability and its Applications, vol. 12, no. 2, pp. 264–280, 1971. [20] L. G. Valiant, “A theory of the learnable,” Commun. Assoc. Comput. Machinery, vol. 27, no. 11, pp. 1134–1143, 1984. [21] S. Ben-David, N. Cesa-Bianchi, D. Haussler, and P. M. Long, “Characterizations of learnability for classes of f0; . . . ; ng-valued functions,” J. Comput. Syst. Sci., vol. 50, pp. 74–86, 1995. [22] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth, 1984. [23] J. R. Quinlan, “Decision trees and decision making,” IEEE Trans. Syst., Man, Cybern., vol. 20, pp. 339–346, May 1968. [24] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. [25] M. Vidyasagar, A Theory of Learning and Generalization With Applications to Neural Networks and Control Systems. Berlin, Germany: Springer-Verlag, 1996. [26] G. A. Edgar, Measure, Topology and Fractal Geometry. New York: Springer-Verlag, 1990. [27] G. G. Lorentz, Approximation of Functions. New York: Chelsea, 1986. [28] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” J. Artificial Intell. Res., vol. 2, pp. 1–32, 1994. [29] D. Haussler, “Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework,” Artificial Intell., vol. 36, pp. 177–221, 1988. [30] A. Ehrenfeucht and D. Haussler, “Learning decision trees from random examples,” Inform. Comput., vol. 83, no. 3, pp. 247–251, 1989. [31] B. K. Natarajan, Machine Learning: A Theoretical Approach. San Mateo, CA: Morgan Kaufmann, 1991. [32] R. S. Wenocur and R. M. Dudley, “Some special Vapnik–Chervonenkis classes,” Discrete Math., vol. 33, pp. 313–318, 1981. IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000 K. Krishna received the B.Sc. degree in physics from Nagarjuna University, India, in 1986. He received the M.E. and Ph.D. degrees in electrical engineering from the Indian Institute of Science, Bangalore, India, in 1993 and 1999, respectively. He has been a Research Staff Member in the IBM India Research Laboratory, New Delhi, India, since December 1998. His present research interests include machine learning, statistical learning theory, evolutionary algorithms, and data and web mining. Dr. Krishna received the Alfred Hay Medal for the best graduating student in electrical engineering in 1993 from the Indian Institute of Science. M. A. L. Thathachar (SM’79–F’91) received the B.E. degree in electrical engineering from the University of Mysore, India, in 1959, the M.E. degree in power engineering, and the Ph.D. degree in control systems from the Indian Institute of Science, Bangalore, India, in 1961 and 1968, respectively. He was a Member of the Faculty of Indian Institute of Technology, Madras, from 1961 to 1964. Since 1964, he has been with Indian Institute of Science, Bangalore, where currently he is a Professor in the Department of Electrical Engineering. He has been a Visiting Professor at Yale University, New Haven, CT, Michigan State University, East Lansing, Concordia University, Montreal, PQ, Canada, and the National University of Singapore. His current research interests include learning automata, neural networks, and fuzzy systems. Dr. Thathachar is a Fellow of Indian National Science Academy, the Indian Academy of Sciences, and the Indian National Academy of Engineering. K. R. Ramakrishnan received the B.S., M.S., and Ph.D. degrees from in the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India. He is currently an Associate Professor in the same department. His research interests include signal processing, computer vision, medical imaging, and multimedia.