Voronoi Networks and Their Probability of Misclassification , Fellow, IEEE

advertisement
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
1361
Voronoi Networks and Their Probability of
Misclassification
K. Krishna, M. A. L. Thathachar, Fellow, IEEE, and K. R. Ramakrishnan
Abstract—Nearest neighbor classifiers that use all the training
samples for classification require large memory and demand large
online testing computation. To reduce the memory requirements
and the computation cost, many algorithms have been developed
that perform nearest neighbor classification using only a small
number of representative samples obtained from the training set.
We call the classification model underlying all these algorithms
as Voronoi networks (Vnets), because these algorithms discretize
the feature space into Voronoi regions and assign the samples in
each region to a class. In this paper we analyze the generalization
capabilities of these networks by bounding the generalization
error. The class of problems that can be “efficiently” solved by
Vnets is characterized by the extent to which the set of points on
the decision boundaries fill the feature space, thus quantifying
how efficiently a problem can be solved using Vnets. We show
that Vnets asymptotically converge to the Bayes classifier with
arbitrarily high probability provided the number of representative samples grow slower than the square root of the number
of training samples and also give the optimal growth rate of
the number of representative samples. We redo the analysis for
decision tree (DT) classifiers and compare them with Vnets.
The bias/variance dilemma and the curse of dimensionality with
respect to Vnets and DTs are also discussed.
Index Terms—Neural networks, pattern recognition, statistical
learning theory.
I. INTRODUCTION
O
NE OF THE most popular and simple pattern classifiers
is the nearest neighbor classifier (NNC). NNC assigns to
an unclassified sample the class label of the nearest sample in
the set of training examples. Though there is no training phase,
NNC needs sufficient memory to store all the training samples
and needs to compute, for every test sample, its distance from
each training sample. There is a class of algorithms that overcome this problem. These algorithms find a small number of
representative samples using the training set and use these samples to perform nearest neighbor classification. Some of the algorithms in this class find a set of representative samples, which
is a proper subset of the training set, such that the classification
error over training samples is minimized [1]. The other algorithms obtain the representative samples, which need not be a
subset of the training set, by applying an iterative algorithm on
the training set, [2]–[7]. The popular learning vector quantization (LVQ) algorithm [3] belongs to this category of algorithms.
Manuscript received August 26, 1999; revised July 11, 2000.
K. Krishna was with the the Department of Electrical Engineering, Indian
Institute of Science, Bangalore, India. He is now with IBM India Research Laboratories, New Delhi, India.
M. A. L. Thathachar and K. R. Ramakrishnan are with the Department of
Electrical Engineering, Indian Institute of Science, Bangalore, India.
Publisher Item Identifier S 1045-9227(00)09847-7.
All these algorithms, though need a training phase, are shown
to perform better than NNC in many cases, apart from reducing
the computational effort and memory requirements.
All the above-mentioned algorithms share the same classification model, i.e., perform NN classification using some representative samples instead of all the training samples. This classification model is called the Voronoi network (Vnet) in this paper
because it discretizes the feature space into Voronoi regions [8]
and assigns samples in each region to a class. Voronoi regions
are formed by a set of points in the feature space. Each Voronoi
region contains those points of the space that are closest to a
point among points in the set. From now on, the representative
samples are referred to as the Voronoi centers (Vcenters). In this
paper, we analyze the capabilities of Vnets to generalize from a
set of training samples to unseen test samples and address some
issues in their design pertaining to the solution of pattern classification (PC) problems.
A. A Note on the Name “Voronoi Network”
The considered classification model is referred to by various
names which have mostly originated from different learning
algorithms. For example, we have LVQ classifier [3], nearest
prototype or multiple prototype classifier [6], prototype-based
NNC [4], nearest-neighbor-based multilayer perceptron
(NN-MLP) [5], etc. One of the most popular among these
names is the LVQ classifier. However, LVQ is also used to refer
to the class of clustering algorithms (see, for example, [9])
though it was originally used to refer to a class of supervised
learning algorithms [3]. In view of the above facts, a need was
felt to denote the classification model underlying the above
algorithms by an appropriate name. Since the fundamental
building block for the classification model is a Voronoi region,
the classification model is referred to as Vnet in this paper.
B. Generalization Error of Vnets
Since Bayes decision rule (BDR) is the best map that can be
learned, it is proper to judge the generalization capability of a
rule by the error between the rule and the BDR, which is referred to as the generalization error (GE) in this paper. GE is
decided by two factors in case of Vnets, as in any other neural
network. One of the factors is the representational capacity of
the network. There could be some error between the BDR and
which is the best among those generated by Vnets
the map
with a fixed number of Vcenters. To make this error small, the
space of maps generated by Vnets must have sufficient power to
represent or closely approximate the class of BDRs. This error is
called the approximation error (AE). Second, the Vnet learned
from a finite number of training samples might be far from the
1045–9227/00$10.00 © 2000 IEEE
1362
best Vnet
. The reason for this error is the limited information available on the underlying distribution of samples because
of finite number of training samples. This error is called the estimation error (EE). Previously, these two issues have been addressed in isolation [10]–[13] as well as together [14]–[16] for
some class of networks. In this paper, we address both these
issues for Vnets under a common framework as in [16], [14];
however, the analysis in [16], [14] deals with the problem of
learning real-valued functions.
The approximation capabilities and the behavior of approximation error of various function approximation schemes, neural
networks in specific, have been analyzed in the past (for example, see [10] and [11]). In all the studies the functions are
assumed to be real valued and the problem of PC, where the
functions map to a finite set of class labels, is considered as a
special case [11]. In classical approximation theory, the rate of
convergence of AE with respect to the number of parameters of
the approximating function is obtained in terms of the degree
of smoothness of the function that is to be approximated. In the
case of PC problems any such characterization does not appear
to have been done in the past. In this paper, the rate of convergence of AE of the maps, that take values from a finite set of
symbols, viz., the maps associated with PC problem, has been
characterized by the extent to which decision boundaries fill the
domain of the maps. One of the contributions of the present
work is that the above-mentioned property is captured by a function called the -coarse index function whose value at infinity
is the generalized fractal dimension under a given probability
measure . The approximation error made by Vnets is bounded
in terms of this function exploiting the nice geometric properties of Voronoi regions. This analysis, first of its kind to the best
of our knowledge, gives insights into the sources of errors made
by Vnets.
Blumer et al. [17] analyzed the estimation error in learning
using the results of uniform convergence of emsubsets of
pirical process from [18], [19] under the framework of Valiant’s
[20] probably approximately correct (PAC) learning. Haussler
[13] generalized the analysis to learning real valued functions
-valued functions. In
and Ben-David et al. [21] to
this paper, a bound on the estimation error has been derived by
-valued functions that is derived
considering a class of
from the class of maps generated by Vnets using the results from
[13]. Elsewhere [7], we have also derived a bound on estimation
error by bounding -dimension of the class of maps generated
by Vnets using the results from [21] and found that the former
approach gave better results.
The main theorem of this paper gives a bound on the generalization error of Vnets obtained by separately bounding the components corresponding to AE and EE. As stated earlier, this result characterizes the approximating capabilities of maps taking
values from a finite set of class labels that is used to find the
rate of decrease of AE of Vnets. Furthermore, it gives a sufficient condition on the relative growth of the number of Vcenters
in Vnet with respect to the number of training samples for the
asymptotic convergence of the probability of misclassification
to the Bayes error, thus proving the asymptotic convergence of
Vnets to a corresponding BDR under fairly large class of probability measures. As a corollary, the optimal rate of growth of
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
the number of Vcenters is derived so as to minimize GE. The
theorem also sheds light on the tradeoff between the number of
Vcenters and the number of training samples which has been referred to as the bias-variance dilemma in [15]. The phenomenon
of “curse of dimensionality” in Vnets is also analyzed.
Decision tree (DT) is another popular classification model
[22], [23] that shares a structure similar to that of Vnet. DT is
a binary tree where each nonleaf node contains a split rule that
decides whether a sample belonging to the left or right subtree,
and each leaf node is assigned to a class. The bound obtained
for the Vnets can be rederived for DTs too. A bound on GE is
obtained for DTs and is compared with that obtained for Vnets.
This kind of analysis suggests a basis for analytical comparison
of any two different classification models. Moreover, it appears
that, the observations made with respect to Vnets are applicable
to other classification models too.
This paper is organized as follows. The following section develops Vnet classification model. The generalization error in
Vnets is formulated in Section III. Section IV contains the main
result viz., a bound on GE. Proofs of most of the results stated
in this section are relegated to Appendix for clarity in presentation. Various implications of the main result are discussed in
Section V. The bound on GE for DTs is rederived and is compared with that of Vnets in Section VI. Section VII makes some
comments on and discusses some possible extensions of the presented results. Finally, the paper ends with a summary in Section VIII.
II. VORONOI NETWORKS
The task of pattern recognition is to learn a mapping from the
feature space to the set of class labels. The learning takes place
with respect to a set of training samples.1 The mapping that is
learned should be able to map unknown samples (samples that
are not present in the training set) into their correct class labels.
The ultimate aim of learning is to find a map that minimizes the
average misclassification over the feature space. This is a well
studied topic in pattern recognition literature [24]. The map that
has the least average misclassification error is the BDR. Vnet
approximates Bayes decision regions by Vornoi regions.
Vnet discretizes the feature space into disjoint convex sets,
viz., Voronoi regions and assigns each region to a class label. Let
denote the feature space and
the
set of class labels. Each Voronoi region is specified by a point
be a set
viz., the Vcenter, in . Let
of Vcenters. Then, Vnet is defined as
(1)
,
is the indicator function of set
below. Let
where
and
,
are as defined
and
1The word “samples” is used to refer to a pair of a feature vector and its class
label as well as to a feature vector alone. The particular reference would be clear
from the context of usage.
KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION
where
is the Euclidean norm and
is the boundary of the
set , i.e., if is at an equidistant from two or more then is
assigned to the set with the least index. Then define
(2)
are convex sets. Observe that
if
convex sets2 i.e.,
partition
, and
into
disjoint
.
A. Learning Vnets
Learning an input–output map using a Vnet involves finding
from a set of
training samthe Vcenters
, such that the
ples,
misclassification error of the resulting map over the training set
is minimized. It may be noted that NNC is an instance of Vnet,
.
where
The empirical risk or the average misclassification error over
of a Vnet with Vcenters specified by and is given by
(3)
is as defined in (1) and
is a discrete metric
where
on given by
if
,
(4)
if
that minimizes
The learning problem is to find the pair
. It turns that for a fixed , the vector , that minimizes
, is uniquely determined. So, the problem of finding
, in effect, boils down to finding alone.
As mentioned in the previous section, there exist many algorithms that try to find . As in case of any classification
model, there exists a tradeoff between the average misclassification over training set and the probability of misclassification which reflects the generalization capabilities of Vnets. For
example, one can easily show using a simple example that, a
has a higher probability of misclassifinetwork with
, when
is properly
cation than an optimal net with
in Vnets is related
chosen [7]. Thus the problem of finding
to well-studied structural risk minimization in neural networks
[18]. The results presented in this paper throw some light on this
issue.
III. PROBLEM FORMULATION
Let
be the feature space and
be the finite set of class labels. Denote
as and
as
. Let
be a -field of subsets of and
be the power set
be the smallest -algebra in
of . Let
which contains every set of the form
,
, and
. Let be a probability measure on
. Let
,
be the probability measures induced by on
,
and
, respectively. Let be a
measurable map from
and
into . Then the average misclassification error, referred to
as expected risk, incurred using is given by
2S has been defined to make the definition precise. If the underlying probability measure is absolutely continuous with respect to Lebesgue measure, then
these S ’s can be ignored.
1363
The aim of learning is to find a
that minimizes
. Let
(5)
i.e.,
, where
be a BDR that minimizes
are appropriately defined. Since we are interested in analyzing how best the learned rule approaches a BDR, the class of
BDR that can be “efficiently” approximated by the rules generated by Vnets are characterized, instead of characterizing the
space of probability measures.
is unknown. Only source of inforSince is unknown,
mation available is that of random samples drawn according to
. Let
be the set of
training samples
. Using this data set, the
randomly drawn according to
expected risk can be approximated by the empirical risk
(6)
A common strategy is to estimate the BDR as the mapping that
minimizes the empirical risk.
Remark III.1: It is to be noted that, under fairly general assumptions, the expected risk converges in probability to the empirical risk for each given , and not for all simultaneously.
Therefore, it is not guaranteed that the minimum of the empirical risk will converge to the minimum of the expected risk as
tends to infinity. Therefore, as pointed out and analyzed in the
fundamental work of Vapnik and Chervonenkis [18], [19]the notion of uniform convergence in probability has to be introduced
and it will be discussed later.
The standard practice in pattern recognition using neural networks is that the number of parameters of the network is fixed
is minimized over the parameter space, i.e., over the
and
space of maps generated by networks with a fixed number of
parameters. In the case of Vnets, the number of parameters is
proportional to the number of Voronoi regions or the number of
be the space of maps generated by Vnets
hidden units. Let
hidden units, i.e.,
with
as in (2), for
(7)
omitting the variables
[From now on, a Vnet is denoted by
and .] Then,
is minimized over
i.e., is approxdefined as
imated by the function
(8)
from
, viz.,
Assuming that the problem of learning
over
, is solved, we are interested in
minimization of
, apart from finding out the
finding out the “goodness” of
class of problems that can be solved “efficiently” using Vnets.
is addressed by various learning
The problem of finding
is
algorithms mentioned in Section I. The goodness of
, which is equal
measured in terms of its expected risk
is the best posto the probability of misclassification. Since
sible mapping that minimizes the probability of misclassifica. The above issues are analyzed below
tion,
1364
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
by bounding the difference
, which is referred
to as the generalization error (GE).
to become a better and
Remark III.2: One expects
and go to infinity. In fact, when inbetter estimate as
and hence the
creases, the estimate of the expected risk
improves. But, when increases, the number
estimator
of parameters to be estimated increases with the same amount
of data, and hence the estimate of the expected risk deteriorates.
In this situation, the estimate can be improved by seeking more
and
have to grow as a function of each
data. Therefore,
other for convergence to occur. The relation between and
for the convergence is derived as a corollary of the main theorem of this paper.
IV. BOUND ON GENERALIZATION ERROR
There are two factors involved in the approximation of
by
, viz., how accurately
can be approximated by the
maps with Voronoi regions, and given a data of size , how
approximates the closest function among
accurately
to with respect to the metric
[defined below in (10)] for
be the mapping with
a given . To study these factors, let
Vcenters having the least expected risk, i.e.,
Then it follows from the definitions that
, and
(9)
The first term in the right-hand side (RHS) of (9) is referred
by the
to as the AE since, it is the error in approximating
. The second term is referred to as the EE since,
functions in
uses observations,
, to estimate the Voronoi cenby minimizing the empirical risk
.
ters associated with
We bound the generalization error by bounding these two terms
separately.
A. Bound on Approximation Error
Notice that the approximation error does not depend on
but, it depends on . Therefore, to bound the approximation error, the space of BDRs for which approximation
is characterized. In
error asymptotically goes to zero with
particular, our interest is in such a for which the following
expression holds:
where
goes to zero faster than
, for some
, as
goes to infinity.
This problem is analyzed using a metric defined on the space
of maps from
into . For a given probability measure
on
, the discrete metric
, defined on (4), can be
on as follows:
extended to a pseudometric
It may be noted that
(11)
is bounded to bound
Therefore, in this section,
which depends on especially, the -coarse
index function of . Before formally deriving the bound, the
intuition behind defining -coarse index function of a mapping
is described below.
Any two neighboring Vcenters generate a hyper planar decision boundary. Since these hyper planes can efficiently approximate any surface which is smooth enough, it is intuitive
that BDR with smooth decision boundary surfaces can be efficiently approximated by Vnets. On the other hand, considering
the contribution by a Voronoi region to the average misclassification error, if a Voronoi region is a proper subset of [see (5)]
for some , then the error made in this region is zero. Therefore,
the number of such Voronoi regions, if each region is equally
probable, should reflect how close is the mapping generated by
the Vnet to the BDR that is being approximated and hence the
approximation error. The relation between this number and the
smoothness of the decision boundary surfaces is brought out in
Section V.
Definition IV.1: Let be a measurable function from
. Then,
denote the preimages of corresponding to the
:
. Let be a probability
class , i.e.,
. Then a set
is said to be -pure
measure on
for some . is said
with respect to if
to be -impure if is not -pure.
be a set of hypercubes formed by
Let
hyperplanes such that hyperplanes are per. If is absopendicular to th axis and
lutely continuous with respect to Lebesgue measure (denoted
), then
are chosen such that
. Othby
erwise, are chosen such that probability of each hypercube is
except those containing points of positive probaless than
. If a hypercube contains a point
bility measure greater than
then the probability of such a hyperof probability
. Then, the minimum number of -imcube is utmost
, minimum over all possible
, represented by
pure sets in
, is called the -coarse index function of .
Theorem IV.1: Let be a probability measure on
.
, there exist a
such
Then for all measurable
that
Proof: This theorem is proved by constructing a
, for
a given , that satisfies the above bound. For some and
,
since
(10)
on
. Let
KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION
then,
. Therefore, if
then,
minimizes . It is to be
noted here that if is a -pure set with respect to ,
for all except for
, then
. Therefore
1365
estimate of expectation of a class of random variables to their
true values below. Let be the class of
-valued functions
by the metric , viz.,
defined on
(14)
Now,
are appropriately chosen to bound the error. For a
containing
-impure sets
given , consider a
with respect to . Since these hypercubes can be obtained from
Vcenters, choose
; for
.
then, can be bounded by
Now observe that if
because,
partition and in the worst case,
. If is not absolutely
continuous with respect to and is a -impure set containing
, then
a point with probability measure greater than
. From the above construction and observations, we get
thus proving the theorem.
The above theorem, because of (11), implies that for a given
and , there exists a
in
such that
(12)
Therefore, the rate at which this bound approaches zero as
tends to infinity, depends on
, which in turn depends
to the generalon the underlying and . We relate
ized fractal dimension (defined later) of the set of points on the
decision boundaries of and characterize the rate of decrease
of the bound in terms of this dimension in Section V.
B. Bound on Estimation Error
is a random variable,
Since
bounded only in a probabilistic sense, i.e., with probability
is
(13)
Let
Then, EE can be bounded if
is bounded. We bound
and consequently bound
using results
from statistical learning theory [25].
to converge to
as one gets more and
One expects
converges to
uniformly
more data. This happens if
in probability [18]. The following lemma gives a relationship
.
between the estimation error and
,
Proposition IV.2: [25] If
then,
.
Thus EE is bounded if the difference between the empirical
. This problem
risk and true risk is bounded uniformly over
is converted to that of analyzing the convergence of empirical
, it may be noted that the empirical expecThen for any
,
, and the true expectation of based on
,
tation of under the probability distribution
. Bounding the error between empirical and true expectations of a class of random variables is a well-studied topic [19].
Let
(15)
Then the following proposition directly follows from the defiand
and from Proposition IV.2.
nitions of
then,
.
Proposition IV.3: If
in order to bound
Therefore, it is enough to bound
. Let
(16)
to
is same as the conThen, the convergence of
vergence of empirical probability to true probability of the
can
corresponding subset of . It may be noted that
be bounded using the VC-dimension of [19]. Ben-David et
al. [21] gave a relation between the VC-dimension of and
can be bounded by
the -dimension of . Therefore
can also be bounded,
bounding -dimension of .
as a family of real valued functions and using
considering
as given in [13]. We have derived
the pseudodimension of
the bounds using both the techniques [7] and found that the
results obtained using the latter approach are better than those
obtained using the former. The derivation of the bound using
the pseudodimension of is given below which follows along
similar line to the one given in [16]. Some definitions and
are
results that are useful in the derivation of bound on
given in Appendix A.
Consider the family of functions defined in (14). Then Theorem A.4 implies
(17)
-restriction of
given by
:
,
the average
metric, and
the -covering number of
as defined in Appendix A. The following lemma, whose
proof has been relegated to Appendix B for continuity of
,
presentation, relates the metric capacities of
, and the expectation of the covering
and
.
number of
Lemma IV.4:
.
where
is the
1366
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
Using the above lemma and taking supremum on both sides
defined in (15) can be bounded
of (17) over all , the
as
Theorem IV.7: With probability
holds:3
, the following bound
(18)
,
is decomposed as the
To find a bound on
be the class
composition of two classes of functions. Let
to , generated by -hidden neurons of
of functions from
, and is the -dimensional
Vnet where,
unit vector in the th dimension i.e., th component of is 1
and the rest of the components are zeros. Then
(20)
Proof: For
bound for
defined in Theorem IV.6, substituting the
in (18) we get
(19)
are defined as in (2). Let
be the class of funcwhere
tions, from to , generated by the single output neuron of the
. Then,
:
network. Note that
,
. The following lemma bounds
in
. The proof of this lemma is
terms of the metric capacity of
given in Appendix B.
Lemma IV.5:
From Proposition IV.3,
if
(21)
:
This implies that
if, the above inequality (21) is satisfied. Therefore, the
following inequality should be satisfied:
Since
is a vector of index functions of some class of subsets, some results from Appendix A can be used to derive a
, thus bounding
.
bound on
Theorem IV.6:
where
.
Proof: Let be the class of subsets of the form defined
:
, then
where,
in Section II. Let
:
,
. Then,
.
from Theorem A.5, we get,
Because is a space of the indicator functions,
. Since is a class of subsets where each subset is
half spaces, Lemmas A.1 and
an intersection of at most
. Substituting this in (34) for , we
A.2 implies
. Therefore, because of
get
. The
Theorem A.5,
above result follows from Lemma IV.5 on substitution of the
.
bound for
(22)
It is later shown that the above inequality is satisfied if
(23), shown at the bottom of the page, is true. Let
.
, the inequality (22) is satisfied for the above value
Since
of ; thus proving the theorem.
C. The Main Theorem
Theorem IV.8: Let be a probability measure on
be a BDR corresponding to and
Let
-coarse index function of . Let
.
be the
:
3In general, these bounds are known to give very large numbers if different
variables in the bounds are substituted with their values. However, they are qualitatively tight in the sense that there exist some distributions for which the bound
is attained. Therefore, in this paper, the bound is found only in the order notation.
(23)
KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION
cording to
bility
be a set of independent training examples drawn ac. Let
be as defined in (8). Then, with proba-
1367
the efficiency of learning by Vnets. This analysis also gives an
insight into the approximation capabilities of Vnets.
Definition V.1: Let be a probability measure on
and
be a measurable function. Then define
is
where
-impure
is the -ball around
in
. Let
if
(24)
Proof: The theorem follows from (9), Theorem IV.1 and
Theorem IV.7.
V. IMPLICATIONS OF THE THEOREM
In this section, various implications of the main theorem of
this paper are discussed.
A. Convergence of Vnet to Bayes Decision Rule
Under the assumption that AE asymptotically converges to
zero with , which is justified in the next section, the discussion
and
then, the
in the sequel implies that, if
bound on GE in (24) asymptotically goes to zero. Substituting
in the expression for
[refer (13)] in (20)
and letting tend to infinity, we get
otherwise.
,
(25)
, if the limit exists, is called the
Then,
with respect to ,
generalized fractal dimension (GFD) of
.
denoted by GFD is called so because, it is a generalization of the fractal
dimension, which is same as the upper entropy index defined
in [26]. The fractal dimension is defined on the compact sets
using the Lebesgue measure. Here, GFD is defined with regard
to and . However, the above definition can be appropriately
modified for defining the dimension on any arbitrary sets with
regard to a given . In this modified definition, is equal to the fractal dimension of , when has a compact
support and is uniform.
, we get
Writing the bound on AE in terms of
(26)
Therefore, Vnet asymptotically converges to a BDR with arbitrary high probability if the number of Voronoi regions grow
slower than the square root of the number of training samples.
B. How Efficiently can Vnets Solve a Problem?
Efficiency of learning a given pattern recognition task, ,
from a finite set of training samples can be measured by the
to
. The larger the rate
rate of convergence of
the more the efficiency. It follows from (24) that, for a fixed
number of Vcenters , as the number of training samples
increases, the bound on EE monotonically goes to zero where
as the bound on AE remains constant since it is independent of
. Therefore, the rate of convergence of
to
also
depends on the rate of decrease of the bound on AE, and hence
AE, to zero. The only parameter that depends on the problem
(refer Definition IV.1), in
under consideration viz.,
the bound on GE is due to the bound on AE and the bound on
is analyzed to find
EE is independent of . So,
always lies in the interval
It may be observed that
. The above inequality (26) says that the nearer
stays to the slower is the rate of decrease of the bound on AE
with .
To analyze further, assume has a compact support and is
more the value
uniform. Then higher the filling property of
and hence more the difficulty to approximate by
of
Vnets. For a given , if ’s distribution over the regions neighis less, then it makes
small
boring the points of
making the rate of decrease of bound large. In conclusion, as it
should be expected, the approximation of a mapping by Vnets
and the probability
depends on the space filling property of
distribution over the regions around it.
C. Optimal Growth Rate of
It is shown above that for
to converge to
the
number of Voronoi regions should grow slower than the square
root of the number of training samples. Then the question arises;
does there exist an optimal growth rate of with respect to ?
between
By the optimal growth, we mean the relation
and such that for a fixed number of training samples ,
number of Vcenters minimizes the bound
the Vnet with
is quite intuitive from
on GE. The existence of such a
1368
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
Remark III.2. This is reflected in the bound on GE (24). To get
, the minimum of the bound
an explicit expression for
on GE, which is assumed to be close to the minimum of GE, is
found by solving the following equation:
increase in the optimal rate of
cannot compensate AE beterm in the exponent of
in the bound AE.
cause of the
is
In fact, for higher dimensions the optimal growth rate of
almost independent of the dimension of because, the optimal
rate reaches very close to 1/6 even for moderate values of .
E. Tradeoff Between
where
. Applying the derivative and neglecting the less significant terms we get
After some algebra we get that the estimate of
, is proportional to
, for large
(27)
The above expression (27) suggests one more interesting beis more,
havior of the optimal growth rate of . When
i.e., the problem under consideration is more difficult (refer the
previous sub-sections), the optimal growth rate of is higher.
That is, the number of Vcenters in Vnets can grow faster as the
number of training samples increase, when solving more difficult problems. This is indeed desirable, because larger s make
the AE small.
and
Theorem IV.8 suggests that there exist various choices of
and for a given specifications of confidence and accuracy parameters. This is due to the tradeoff between AE and EE. When a
large number of centers and less number of training samples are
chosen, the resulting EE could be high, but this can be compensated by low AE and vice versa, in order to keep GE constant.
If the availability of the training samples is expensive then, one
could use more number of centers with the same bound on GE.
Or, if the network size is expensive one could use large number
of training samples. Thus the economics of this tradeoff would
yield a suitable values for and to solve the pattern recognition problem with the required accuracy and confidence. Geman
et al., termed this trade off as bias/variance dilemma [15].
F. Sample Complexities
Finally, we consider the bounds on sample complexity of
Vnets. As defined in Section IV-B, sample complexity reflects
the number of samples sufficient to learn a mapping with a
given accuracy and confidence requirements. From the value of
defined in Theorem IV.6, the expression in (22) implies that
if
D. Dependence on the Dimension of
It is known from the theory of linear and nonlinear widths
[27] that if the target function has variables and a degree of
smoothness , one should not expect to find that an approxima, when the number
tion error goes to zero faster than
of parameters that have to be estimated is proportional to . Here
the degree of smoothness is a measure of how constrained the
class of functions are; for example, the number of derivatives
that are uniformly bounded or the number of derivatives that are
integrable or square integrable. Therefore, from the classical approximation theory, one expects that unless certain constraints
are imposed on the class of functions to be approximated, the
rate of convergence will dramatically slow down as the dimension increases, showing the phenomenon known as the “curse
of dimensionality.”
. The term
reflects the frequency
Let
in (26),
of variation of on its domain and the term
in a way, characterizes its smoothness. In case of Vnets, AE
. Therefore, the above
goes to zero as fast as
conclusions from the classical approximation theory are valid
as smoothness parameter.
in case of Vnets also with
It is interesting to note the following observation at this point.
, the opThe expression in (27) suggests that when
.
timal growth rate of increases from 1/8 to 1/6 as
can be increased at a faster rate for higher dimenThat is,
sions maintaining the optimal performance. Nevertheless, the
VI. BOUND ON GE OF DECISION TREES
The classification model that shares a similar structure, but
not the same, with Vnets is the DT classifier [22]. A DT is a
binary tree. Each nonleaf node in the tree is associated with
a split rule that decides whether a given pattern will go to the
left or right subtree. A class label is assigned to each leaf node
implying that all the patterns landing in that leaf node are to be
classified into that class. When the feature space is a subset of
, then the class of split rules considered could be the class
of hyper surfaces that are either parallel or oblique to the axes.
The DT’s using the former class of split rules are called Axesparallel DTs and those using the later are oblique DTs. There
are many learning algorithms for these DTs (for example, [28]).
Better and better learning algorithms are still being discovered.
The learnability of this class of classifiers has been analyzed and bounds on sample complexities have been found
by bounding the VC-dimension of DTs which are used for
two-class pattern classification problems [29], [30]. The
analysis presented in Section IV can be easily reworked to
the class of maps generated by DTs. A brief sketch of the
proof of the theorem for DTs is given below. As expected, the
approximating behavior of DTs is quite similar to that of Vnets
KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION
except that the number of Vcenters is replaced by the number
of leaf nodes. To find a bound on EE, DT is considered as a
combination of (number of leaf nodes) number of indicator
functions where the set associated with each indicator function
is an intersection of (the depth of the tree) half spaces. Let
be the class of mappings generated by DTs for a fixed
and . Then, the
and
corresponding to DTs are
(28)
where
is as defined in (6), and
1369
the same procedure as in the previous section, we get that the
number
of leaf nodes should be proportional to
where
, for the bound on GE to be optimal.
This rate is faster than that of in Vnets [refer (27)]. The difference is obvious because each element in is assumed to be
half spaces whereas each subset in case
an intersection of
of DT is assumed to be an intersection of half spaces and the
derived bounds are worst case bounds.
VII. DISCUSSION
where
is the average misclassification error.
.
Theorem VI.1: Let be a probability measure on
be a BDR corresponding to and
be the
Let
-coarse index function of . Let
:
be a set of independent training examples drawn acbe as defined in (28). Then, with probacording to . Let
bility
(29)
A Sketch of Proof: The first part of the above inequality
-coarse index function of .
follows from the definition of
beTo get the second part, we need to bound
cause, following similar arguments as in Section IV-B:
(30)
is a combination indicator functions each is associSince
ated with an intersection of half spaces, Theorem IV.6 implies
that
where
. Substituting this
bound in (30) and after some algebra we get the second part of
the inequality in the statement of the theorem.
From the above theorem, one can say that the efficiency of
solving a problem, the dependence of GE on the dimension of
, and the tradeoff between and in case of DTs are the same
as in Vnets. One can analyze the optimal growth rate of the depth
of a tree and the number of leaf nodes in a tree. It is observed
. This
that the optimal rate of growth of is a function of
is to be expected because the number of subsets one can get in
a tree of depth is . But, the optimal rate of growth of is
almost similar to that of ; it differs only in the rate. Following
In this section, we make some comments on the nature of
results obtained along with some possible extensions of the results.
The main result of this paper has been developed on a framework similar to Valiant’s PAC framework for learning and along
the lines of those given in [16]. However, the result is not entirely distribution free. The bound on AE is dependent on the
underlying probability measure in terms of the properties of
-coarse index function of .
a corresponding BDR, viz.,
Moreover, in this paper, we have started from an approximation
scheme and found the space of problems that can be efficiently
solved by the scheme as opposed to the other way round viz.,
from a class of problems to an optimal class of approximation
schemes.
As previously mentioned, the bounds obtained using the PAC
framework are known to be very loose. Therefore, no attempt
has been made in this paper to find the exact numerical bounds.
Rather, these bounds have been used to analyze the behavior of
GE with respect to various parameters in terms of bounds on the
growth of with respect to , etc.
In this paper, we have assumed the existence of an algorithm
, which minimizes
, for a and a training
that finds
. We guess that the problem of finding
is NP-hard
set
since its counterpart in unsupervised learning, viz., clustering,
is NP-hard and the present problem is very similar to clustering.
(A formal verification of this conjecture needs to be done.) Recently, an algorithm based on evolutionary algorithms has been
with
proposed and shown to asymptotically converge to
probability one [7]. This algorithm may converge to a suboptimal if run for a finite time, which is the case in reality. Nevertheless, it is empirically shown to converge to a better optimum
than that obtained by LVQ3 algorithm [3].
The behavior of generalization error of the networks with sigmoidal units [11] and of the RBF networks [16] have been analyzed considering them as function approximation schemes. It
appears that finding a bound on GE for these networks when
used as pattern classifiers is difficult. First, the class of decision
regions that can be obtained using the functions generated by
these networks as discriminating functions could be very difficult to characterize. However, the conclusions arrived from the
analysis presented in Section IV-A should, in general, be applicable to these networks too. It is interesting to investigate
whether the techniques similar to the one given in Section IV-B
1370
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
can be used in bounding the EE of these networks. The problem
may arise when calculating the metric entropy of the space of
mappings in the last stage of the network where the vectors of
real numbers are mapped to class labels.
One has to address two interrelated issues in comparing two
classification models. One issue is the average misclassification
error and second, the time taken to arrive at a particular instance
of the model. The first issue has been analyzed in case of Vnets
and DTs in the present paper. The analysis could be extended to
MLPs in which each perceptron uses a hard-limiting function;
because the fundamental structure in the associated decision regions is a hyperplane. It is interesting if a similar analysis can be
done on other types of MLPs and RBF networks that are popularly used in PC models. However, it appears to be difficult to
access the computation needed to arrive at a globally optimal
classifier with existing tools in computation complexity. If one
succeeds in precisely finding the computation requirements of
learning these networks, then these classification models can be
compared purely on an analytical basis and the comparison then
holds for a wide class of PC problems irrespective of the specific learning algorithm.
VIII. SUMMARY
We have considered the classification model that discretizes
the feature space into Voronoi regions and assigns samples
in each region to a class label. We have called this model as
“Vnets.” The error between a rule and a BDR corresponding
to the problem under consideration has been considered as the
generalization error. GE has been split into AE and EE and
these factors have been individually bounded, thus bounding
GE. We have analyzed the approximating behavior of Vnets,
in turn their efficiency in solving a problem from the bound on
AE. We have shown that the efficiency of Vnet depends on the
extent to which Bayes decision boundaries fill the feature space.
The bound on EE has been derived using two techniques and
the relative merits of these two techniques have been discussed.
Putting together the bounds on AE and EE, we have shown
that Vnets converge to the Bayes classifier with arbitrarily high
probability provided the number of Vcenters grow slower than
the square root of the number of training samples. All these
results have been reworked for DTs. The optimal growth rate
of the number of Voronoi centers for optimum GE has been
calculated for Vnets as well as DTs. The main result of this
paper also quantitatively explains the bias/variance dilemma
and the curse of dimensionality that is generally observed in
classification models.
APPENDIX I
SOME DEFINITIONS AND RESULTS
The analysis of the process of learning a concept can be done
by analyzing the uniform convergence of the associated empirical process as shown in Section IV-B for a class of concepts.
The classes of concepts that are of interest in this paper are the
class of subsets of , and the class of real valued functions. In each of the above classes, the uniform convergence of
the empirical process depends on the growth function associated with the concept class. The growth function in turn depends
on a number (dimension) that is specific to a concept class. In
this section, we define these dimensions for the above mentioned two concept classes and give the relation between this dimension and the property of uniform convergence for the class
of real valued functions in a series of theorems and lemmas.
Readers are referred to [25] and [31] for more details.
A. VC Dimension of Subsets of
The following notation is used in this section. Let
be
be a collection of subsets of
a measurable space and let
.
is said to be shattered by
Definition A.1: A subset
if for every partition of
into disjoint subsets
and
,
such that
and
. Then
there exists
, is
the Vapnik–Chervonenkis (VC) dimension of ,
the maximum cardinality of any subset of shattered by , or
if arbitrary large subsets can be shattered.
The following two lemmas give the VC dimension of the class
and that of the class of subsets of which
of half spaces in
are intersections of elements belong to a concept class of a
given VC dimension. These results are useful in deriving the
bounds for Vnets.
be a concept
Lemma A.1: ([17, Lemma 3.2.3]) Let
.
class of finite VC dimension, viz,
, let
:
For all
or
:
,
. Then
,
.
Lemma A.2: ([32]) Let be the class of half spaces in
then,
.
B. Learning the Class
Definition A.2: For
. For
of Real Valued Functions
let,
if
else
let,
let,
is said to be full if there
, where
, and for
. Any set
such that
exists
:
.
be a family of functions from a set
into . Any
Let
is said to be shattered by , if
sequence
:
, is full.
its -restriction,
The largest such that there exists a sequence of points which
are shattered by is said to be the pseudodimension of de. If arbitrarily long finite sequences are shatnoted by
is infinity.
tered, then
Definition A.3: Let be a subset of a metric space with
. Then a set
is called an -cover of with
metric
if for every
, there exists
respect to the metric
satisfying
. The size of the smallest
some
-cover is called covering number of
and is denoted by
. A set
is called an -separated if for all
,
. The packing number
of a set is the size of the largest -separated subset of .
Let
be a metric on
defined as
,
,
( is used in the subscript
metric as in [25]). Let
be a
to represent average
:
KRISHNA et al.: VORONOI NETWORKS AND THEIR PROBABILITY OF MISCLASSIFICATION
pseudometric on a space of functions from to
by a probability distribution on , defined as
, induced
Theorem A.3: ([13]) Let be a family of functions from a
, where
for some
.
set into
Let be a probability distribution on . Then for all
(31)
Theorem A.4: ([13], [25]) Let
be a family of functions
into [0, 1]. Let
be a sequence of training
from a set
examples randomly drawn according to a distribution . Then,
,
for all
1371
where
is the Dirac delta function and
. Let
, then
ment of
is the th ele-
. Therefore,
because of the isometry. Hence the first inequality follows from (33) and by taking
expectation on both sides with respect to .
on . Let
be the marginal
Fix a distribution
and , respectively. Let
distribution with respect to
be an -cover for
with respect to the metric
such
. Then, it is easy to show that
that
:
is an -cover for with re. Then,
spect to the metric
. Taking supremum over
all we get the second inequality in the lemma.
Proof of Lemma IV.5: Fix a distribution on
. Let
be
-cover for
with respect to the metric
such
. Then,
:
that
can be shown to be an -cover for
with
. Hence
respect to
(32)
is the restriction of to the data set
, that is:
:
.
The original version of the above theorem is given in [13];
an improved version, with less conservative bounds, is given
depends on the probability
in [25].
is a random set that depends on . To
distribution , since
get a bound that is independent of , a quantity called metric
capacity is defined and the error is bounded in terms of this.
Definition A.4: The metric capacity of is defined as
Where
(33)
where the supremum is taken over all the probability distributions defined on .
Since the right-hand side of (31) is independent of
and
for all , we have, under
the same hypothesis of Theorem A.3
(34)
We use the following theorem in Theorem IV.6.
Theorem A.5: ([13]) Let
, where is a class of functions from
.
Then
:
to
.
APPENDIX II
PROOFS OF LEMMA IV.4 AND LEMMA IV.5
Proofs of Lemma IV.4: Given a sequence
pled according to the distribution , define
ical distribution on , i.e.,
in , samas the empir,
Since this holds for any
lemma is proved.
, taking supremum over all
, the
REFERENCES
[1] B. V. Dasarathy, Ed., Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. Los Alamitos, CA: IEEE Comput. Soc. Press,
1991.
[2] C.-L. Chang, “Finding prototypes for nearest neighbor classifiers,” IEEE
Trans. Comput., vol. C-23, pp. 1179–1184, Nov. 1974.
[3] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, pp.
1464–1480, Sept. 1990.
[4] C. Decaestecker, “Finding prototypes for nearest neighbor classification by means of gradient descent and deterministic annealing,” Pattern
Recognition, vol. 30, no. 2, pp. 281–288, Feb. 1997.
[5] Q. Zhao, “Stable on-line evolutionary learning of NN-MLP,” IEEE
Trans. Neural Networks, vol. 8, pp. 1371–1378, Nov 1997.
[6] J. C. Bezdek, T. R. Reichherzer, G. S. Lim, and Y. Attikiouzel, “Multiple-prototype classifier design,” IEEE Trans. Syst., Man, Cybern. C,
vol. 28, pp. 67–79, Feb. 1998.
[7] K. Krishna, “Hybrid evolutionary algorithms for supervised and unsupervised learning,” Ph.D. dissertation, Dept. Elect. Eng., Indian Inst.
Sci., Bangalore, India, Aug. 1998.
[8] G. Voronoi, “Recherches sur les paralleloedres primitives,” J. Reine
Angew. Math., vol. 134, pp. 198–287, 1908.
[9] N. R. Pal, J. C. Bezdek, and E. C. K. Tsao, “Generalized clustering networks and Kohonen’s self-organizing scheme,” IEEE Trans. Neural Networks, vol. 4, pp. 549–557, July 1993.
[10] G. Cybenko, “Approximation by superposition of sigmoidal functions,”
Math. Contr., Syst, Signals, vol. 2, no. 4, pp. 303–314, 1989.
[11] A. R. Barron, “Approximation and estimation bounds for artificial
neural networks,” Dept. Statist., Univ. Illinois Urbana-Champaign,
Champaign, IL, Tech. Rep. 58, Mar. 1991.
[12] E. Baum and D. Haussler, “What size net gives valid generalization?,”
Neural Comput., vol. 1, no. 1, pp. 151–160, 1989.
[13] D. Haussler, “Decision theoretic generalizations of the PAC model for
neural net and other learning applications,” Inform. Comput., vol. 100,
pp. 78–150, 1992.
1372
[14] H. White, “Connectionist nonparametric regression: Multilayer perceptron can learn arbitrary mappings,” Neural Networks, vol. 3, pp.
535–549, 1990.
[15] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the
bias/variance dilemma,” Neural Comput., vol. 4, pp. 1–58, 1992.
[16] P. Niyogi and F. Girosi, “On the relationship between generalization
error, hypothesis complexity, and sample complexity for radial basis
functions,” Artificial Intell. Lab., Massachusetts Inst. Technol., Cambridge, MA, A. I. Memo 1467, Feb. 1994.
[17] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the Vapnik–Chervonenkis dimension,” J. Assoc. Comput.
Machinery, vol. 36, no. 4, pp. 926–965, Oct. 1989.
[18] V. N. Vapnik, Estimation of Dependences Based on Empirical
Data. Berlin, Germany: Springer-Verlag, 1982.
[19] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of
relative frequencies of events to their probabilities,” Theory Probability
and its Applications, vol. 12, no. 2, pp. 264–280, 1971.
[20] L. G. Valiant, “A theory of the learnable,” Commun. Assoc. Comput.
Machinery, vol. 27, no. 11, pp. 1134–1143, 1984.
[21] S. Ben-David, N. Cesa-Bianchi, D. Haussler, and P. M. Long, “Characterizations of learnability for classes of f0; . . . ; ng-valued functions,”
J. Comput. Syst. Sci., vol. 50, pp. 74–86, 1995.
[22] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification
and Regression Trees. Belmont, CA: Wadsworth, 1984.
[23] J. R. Quinlan, “Decision trees and decision making,” IEEE Trans. Syst.,
Man, Cybern., vol. 20, pp. 339–346, May 1968.
[24] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973.
[25] M. Vidyasagar, A Theory of Learning and Generalization With Applications to Neural Networks and Control Systems. Berlin, Germany:
Springer-Verlag, 1996.
[26] G. A. Edgar, Measure, Topology and Fractal Geometry. New York:
Springer-Verlag, 1990.
[27] G. G. Lorentz, Approximation of Functions. New York: Chelsea, 1986.
[28] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of
oblique decision trees,” J. Artificial Intell. Res., vol. 2, pp. 1–32, 1994.
[29] D. Haussler, “Quantifying inductive bias: AI learning algorithms and
Valiant’s learning framework,” Artificial Intell., vol. 36, pp. 177–221,
1988.
[30] A. Ehrenfeucht and D. Haussler, “Learning decision trees from random
examples,” Inform. Comput., vol. 83, no. 3, pp. 247–251, 1989.
[31] B. K. Natarajan, Machine Learning: A Theoretical Approach. San
Mateo, CA: Morgan Kaufmann, 1991.
[32] R. S. Wenocur and R. M. Dudley, “Some special Vapnik–Chervonenkis
classes,” Discrete Math., vol. 33, pp. 313–318, 1981.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 6, NOVEMBER 2000
K. Krishna received the B.Sc. degree in physics
from Nagarjuna University, India, in 1986. He
received the M.E. and Ph.D. degrees in electrical
engineering from the Indian Institute of Science,
Bangalore, India, in 1993 and 1999, respectively.
He has been a Research Staff Member in the
IBM India Research Laboratory, New Delhi, India,
since December 1998. His present research interests
include machine learning, statistical learning theory,
evolutionary algorithms, and data and web mining.
Dr. Krishna received the Alfred Hay Medal for the
best graduating student in electrical engineering in 1993 from the Indian Institute of Science.
M. A. L. Thathachar (SM’79–F’91) received
the B.E. degree in electrical engineering from the
University of Mysore, India, in 1959, the M.E.
degree in power engineering, and the Ph.D. degree in
control systems from the Indian Institute of Science,
Bangalore, India, in 1961 and 1968, respectively.
He was a Member of the Faculty of Indian Institute
of Technology, Madras, from 1961 to 1964. Since
1964, he has been with Indian Institute of Science,
Bangalore, where currently he is a Professor in the
Department of Electrical Engineering. He has been a
Visiting Professor at Yale University, New Haven, CT, Michigan State University, East Lansing, Concordia University, Montreal, PQ, Canada, and the National University of Singapore. His current research interests include learning
automata, neural networks, and fuzzy systems.
Dr. Thathachar is a Fellow of Indian National Science Academy, the Indian
Academy of Sciences, and the Indian National Academy of Engineering.
K. R. Ramakrishnan received the B.S., M.S., and
Ph.D. degrees from in the Department of Electrical
Engineering, Indian Institute of Science, Bangalore,
India.
He is currently an Associate Professor in the same
department. His research interests include signal processing, computer vision, medical imaging, and multimedia.
Download