Liu 1

advertisement
Multicategory Ã-Learning¤
Yufeng Liu and Xiaotong Shen
Summary
In binary classi¯cation, margin-based techniques usually deliver high performance. As a
result, a multicategory problem is often treated as a sequence of binary classi¯cations. In
the absence of a dominating class, this treatment may be suboptimal and may yield poor
performance,
such as for Support Vector Machine (SVM). We propose a novel multicategory
generalization of
Ã-learning which treats all classes simultaneously. The new generalization eliminates
this potential problem, and at the same time, retains the desirable properties of its binary
counterpart. We
develop a statistical learning theory for the proposed methodology and obtain fast
convergence
rates for both linear and nonlinear learning examples. The operational characteristics of
this
method are demonstrated via simulation. Our results indicate that the proposed
methodology
can deliver accurate class prediction and is more robust against extreme observations
than its
SVM counterpart.
Key Words and Phrases: Generalization error, nonconvex minimization, supervised
learning,
support vectors.
1 Introduction
Classi¯cation has become increasingly important as a means for facilitating information
extraction. Among binary classi¯cation techniques, signi¯cant developments have been seen
in margin¤Yufeng
Liu is Assistant Professor, Department of Statistics and Operations Research, Carolina Center for
Genome Sciences, University of North Carolina, CB 3260, Chapel Hill, NC 27599 (Email:
y°iu@email.unc.edu).
He would like to thank Professor George Fisherman for his helpful comments. Xiaotong Shen is Professor, School of Statistics, University of Minnesota, 224 Church Street S.E., Minneapolis, MN 55455 (Email:
xshen@stat.umn.edu). His research was supported in part by NSF grants IIS-0328802 and DMS-0072635.
The authors would like to thank the editor, the associate editor, and two anonymous referees for their helpful
comments
and suggestions.
1
based methodologies, including Support Vector Machine (SVM, Boser, Guyon, and
Vapnik, 1992;
Cortes and Vapnik, 1995), Penalized Logistic Regression (PLR, Lin et al., 2000), Import
Vector
Machine (IVM, Zhu and Hastie, 2001), and Distance Weighted Discrimination (DWD,
Marron
and Todd, 2002).
Among many margin-based techniques, the ones that focus on estimating the decision
bound-
ary yield higher performance as opposed to those that focus on conditional probabilities.
This
is because the former is an easier problem than the latter. For instance, binary SVM
directly
estimates the Bayes classi¯er sign(P(Y = +1jx) ¡ 1=2) rather than P(Y = +1jx) itself, with
input vector x and class label Y 2 f§1g, as shown in Lin (2002). However, this aspect of
the
methodology makes its generalization to the multicategory case highly nontrivial. One
popular
approach, known as \one-versus-rest", solves k binary problems via sequential training.
As argued by Lee, Lin, and Wahba (2004), an approach of this sort performs poorly in the
absence
of a dominating class, since the conditional probability of each class is no greater than
1=2.
Shen, Tseng, Zhang, and Wong (2003) proposed another margin-based technique
called Ãlearning, which replaces the convex SVM loss function by a non-convex Ã-loss function.
They
show that more accurate class prediction can be achieved, while the margin
interpretation is
retained. The present article generalizes binary Ã-learning to the multicategory case.
Since
Ã-learning, like SVM, does not directly yield P(Y = +1jx), we need to take a new
approach.
To treat all classes simultaneously, we generalize the concept of margins and support
vectors
via multiple comparisons among di®erent classes. Multicategory Ã-learning has the
advantage
of retaining the desired properties of its binary counterpart, but not su®ering from the
aforementioned di±culty of one-versus-rest SVM with regard to the dominating class.
To provide insight into multicategory Ã-learning, we develop a statistical learning theory.
Speci¯cally, the theory quanti¯es the performance of multicategory Ã-learning with
respect to
the choice of tuning parameters, the size of the training sample, and the number of
classes
involved in classi¯cation. It also indicates that our multicategory Ã-learning directly
estimates
the true decision boundary regardless of the presence or absence of the dominating
class.
Simulation experiments indicate that Ã-learning outperforms its counterpart SVM in generalization, as in the binary case. Moreover, multicategory Ã-learning is more robust
against
extreme instances that are wrongly classi¯ed than its counterpart SVM. Interestingly, in
linear
learning problems, it exhibits some behavior that is similar to nonlinear learning
problems with
respect to the tuning parameter, which is di®erent from that of the binary case.
2
Section 2.1 motivates our approach. Section 2.2 describes our proposal for
multicategory Ãlearning, and Section 2.3 brie°y discusses computational issues. Section 3 studies the
statistical
properties of the proposed methodology and develops its statistical learning theory.
Section 4
presents numerical examples, followed by conclusions and discussions in Section 5. The
Appendix
contains the lemmas and technical proofs.
2 Methodology
The primary goal of classi¯cation is to predict the class label Y for a given input vector x
2S
via a classi¯er, where S is an input space. For k-class classi¯cation, a classi¯er
partitions S into
k disjoint and exhaustive regions S1; : : : ; Sk with Sj corresponding to class j. A good
classi¯er
is one that predicts class index Y for given x accurately, which is measured by its
accuracy of
prediction.
Before proceeding, let x 2 S ½ IRd be an input vector and y be an output (label) variable.
We code y as f1; : : : ; kg, and de¯ne f = (f1; : : : ; fk) as a decision function vector. Here fj
,
mapping from S to IR, represents class j; j = 1; : : : ; k. A classi¯er argmaxj=1;:::;kfj(x),
induced
by f, is employed to assign a label to any input vector x 2 S. In other words, x 2 S is
assigned
to a class with the highest value of fj(x), which indicates the strength of evidence that x
belongs
to class j. A classi¯er is trained via a training sample f(xi; yi); i = 1; : : : ; ng,
independently and
identically distributed according to an unknown probability distribution P(x; y).
Throughout
the paper, we use X and Y to denote random variables and x and y to represent
corresponding
observations.
The generalization error (GE) quanti¯es the accuracy of generalization and is de¯ned as
Err(f) = P[Y 6= argmaxjfj(X)], the probability of misclassifying a new input vector X. To
simplify the expression, we introduce g(f(x); y) = (fy(x) ¡ f1(x); : : : ; fy(x) ¡ fy¡1(x); fy(x) ¡
fy+1(x); : : : ; fy(x) ¡ fk(x)), which performs multiple comparisons of class y versus the rest
of classes. Vector g(f(x); y) describes the unique feature of a multicategory problem,
which
is directly related to the generalized margins to be introduced shortly. Furthermore, for u
=
(u1; : : : ; uk¡1), we de¯ne the multivariate sign function, sign(u) = 1 if umin = min(u1; : : : ;
uk¡1) >
0 and ¡1 if umin · 0. With sign(¢) and g(f(x); y) in place, f indicates correct classi¯cation
for any given instance (x; y) if g(f(x); y) > 0k¡1, where 0k¡1 is a (k ¡ 1)-dimensional vector
of 0. Consequently, the GE reduces to Err(f) = 1
2E[1 ¡ sign(g(X; Y ))], with the empirical
3
generalization error (EGE) (2n)¡1Pn
i=1(1 ¡ sign(g(f(xi); yi))).
For motivation, we ¯rst discuss our setting in the binary case, and then generalize it to
the
multicategory case. In particular, we review binary Ã-learning with the usual coding f¡1;
1g,
and then derive it via coding f1; 2g.
2.1 Motivation
With y 2 f§1g, a margin-based classi¯er estimates a single function f and uses sign(f) as
the
classi¯cation rule. Within the regularization framework, it solves argminfJ(f)+CPn
i=1 l(yif(xi)),
where J(f), a regularization term, controls the complexity of f, a loss function l measures
the
data ¯t, and C > 0 is a tuning parameter balancing the two terms. For example, SVM
uses
the hinge loss with l(u) = [1 ¡ u]+, where [v]+ = v if v ¸ 0, and 0 otherwise; PLR and IVM
adopt the logistic loss l(u) = log(1 + e¡u); and the Ã-loss can be any non-increasing
function
satisfying R ¸ Ã(u) > 0 if u 2 (0; ¿ ) and Ã(u) = 1 ¡ sign(u) otherwise, where ¿ 2 (0; 1], and
R > 0. For simplicity, we discuss the linear case in which f(x) = wTx + b; w 2 IRd and b 2
IR,
represents a d-dimensional hyperplane. In this case, J(f) = 1
2kwk2 is de¯ned by the geometric
margin 2
kwk
, the vertical Euclidean distance between hyperplanes f = §1. Here yif(xi) is the
functional margin of instance (xi; yi).
For linear binary Ã-learning with coding f1; 2g, we now derive a parallel formulation
using
the argmax rule, by noting that x is classi¯ed as class 2 if f2(x) > f1(x) and 1 otherwise,
where
fj(x) = wT
j x + bj ; j = 1; 2. Evidently, this rule of classi¯cation depends only on sign((f2 ¡
f1)(x)). To eliminate redundancy in (f1; f2), we invoke a sum-to-zero constraint f1 + f2 = 0.
This type of constraint was previously used by Guermeur (2002) and Lee et al. (2004) in
two
di®erent SVM formulations. Under this constraint, kw1k = kw2k. Binary Ã-learning then
solves:
min
b1;b2;w1;w2 ³1
2
2 Xj=1
kwjk2 + C
n Xi=1
ág(f(xi); yi)¢´ subject to
2 Xj=1
fj(x) = 0 8x 2 S; (1)
where g(f(xi); yi) = fyi(xi) ¡ f3¡yi(xi).
With coding f1; 2g, instances from classes 1 and 2 that lie respectively in halfspaces fx :
g(f(x); 2) ¸ ¡1g and fx : g(f(x); 2) · 1g are de¯ned as \support vectors". In the separable
case, support vectors are instances on hyperplanes g(f(x); 2) = §1. Furthermore, the
functional
margin of (xi; yi) can be de¯ned as g(f(xi); yi), indicating the correctness and strength of
classi¯cation of xi by f.
4
2.2 Multicategory Ã-Learning
As suggested in Shen et al. (2003), the role of a binary Ã-function is twofold. First, it
eliminates
the scaling problem of the sign function that is scale invariant. Second, with a positive
penalty
de¯ned by the positive value of Ã(u) for u 2 (0; ¿ ), it pushes correctly classi¯ed
instances away
from the boundary. As a remark, we note that 1¡sign as a loss is numerically undesirable
since
the solution f is approximately 0 under regularization.
Using coding f1; ¢ ¢ ¢ ; kg, we de¯ne multivariate Ã-functions on k ¡ 1 arguments as
follows:
R ¸ Ã(u) > 0 if umin 2 (0; ¿ );
Ã(u) = 1 ¡ sign(u) otherwise; (2)
where 0 < ¿ · 1 and 0 < R · 2 are some constants, and Ã(u) is non-increasing in umin. We
note
that this multivariate version preserves the desired properties of its univariate
counterpart. Particularly, the multivariate à assigns a positive penalty to any instance with min(g(f(xi); yi))
2
(0; ¿ ) to eliminate the scaling problem. To utilize our computational strategy based on a
di®erence convex (d.c.) decomposition, we use a speci¯c à in implementation:
Ã(u) = 0 if umin ¸ 1; 2 if umin < 0; 2(1 ¡ umin) if 0 · umin < 1: (3)
A plot of this à function for k = 3 is displayed in Figure 1.
Insert Figure 1 about here
Linear multicategory Ã-learning solves minb;w ³1
2 Pk
j=1 kwjk2+CPn
i=1 Ã(g(xi; yi))´subject
to Pk
j=1 fj(x) = 0 for 8x 2 S, where w = vec(w1; : : : ;wk) is a kd-dimensional vector with its
(d(i2 ¡ 1) + i1)-th element wi2(i1) and b = (b1; : : : ; bk)T 2 IRk. By Theorem 2.1 of Liu et
al. (2005), the minimization with the sum-to-zero constraint for all x 2 S is equivalent to
that with the constraint for n training inputs fxi; i = 1; ¢ ¢ ¢ ; ng only. That is, the in¯nite
constraint Pk
j=1 fj(x) = 0 for 8x 2 S can be reduced to be Pk
j=1 bj1n +Pk
j=1 Xwj = 0, where
X = (x1; : : : ; xn)T is the design matrix and 1n is an n-dimensional vector of 1's. This yields
linear multicategory Ã-learning:
min
b;w ³1
2
k Xj=1
kwjk2 + C
n Xi=1
Ã(g(xi; yi))´ subject to
k Xj=1
bj1n + X
k Xj=1
wj = 0; (4)
where the value of C (C > 0) in (4) re°ects relative importance between the geometric
margin
and the EGE.
5
In the present context, we de¯ne the generalized functional margin of an instance (xi; yi)
as min(g(xi; yi)), and the generalized geometric margin to be ° = min1·j1<j2·k °j1j2 , with
° j 1 j2 = 2
kwj1¡wj2k
the vertical Euclidean distance between hyperplanes fj1 ¡ fj2 = §1. Here
°j1j2 measures separation between classes i and j; see Figure 2 for an illustration of the
role of
°. When k = 2, (4) reduces to the binary case of Shen et al. (2003). As a technical
remark,
we note that that (4) uses Pk
j=1 kwjk2 rather than max1·j1<j2·k kwj1 ¡wj2k2 in minimization.
This is because Pk
j=1 kwjk2 plays a similar role as max1·j1<j2·k kwj1 ¡ wj2k2 and is easier to
implement.
Insert Figure 2 about here
Kernel-based learning can be achieved via a proper kernel K(¢; ¢), mapping from S £ S
to IR. The kernel is required to satisfy Mercer's condition (Mercer, 1909) which ensures
the
kernel matrix K to be positive de¯nite, where K is an n £ n matrix with its i1i2-th element
K(xi1 ; xi2 ). Then each fj can be represented as hj(x) + bj with hj = Pn
i=1 vjiK(xi; x) by the
theory of reproducing kernel Hilbert spaces, c.f., Wahba (1998). The kernel-based
multicategory
Ã-learning then solves:
min
b;v ³1
2
k Xj=1
khjk2
HK + C
n Xi=1
Ã(g(xi; yi))´ subject to
k Xj=1
bj1n +K
k Xj=1
vj = 0; (5)
where vj = (vj1; : : : ; vjn)T , and v = vec(v1; : : : ; vn). Using the reproducing kernel
property,
khjk2
HK
can be written as vT
j Kvj .
The concept of support vectors can be also extended to multicategory problems. In the
separable case, the instances on the boundaries of polyhedrons Dj are the support
vectors,
where polyhedron Dj is a collection of solutions of a ¯nite system of linear inequalities
de¯ned
by minj(g(x; j)) ¸ 1. In the nonseparable case, the instances belonging to class j that do
not
fall the inside of Dj are the support vectors.
2.3 Computational Development of Ã-Learning
To treat nonconvex minimization involved in (4) and (5), we utilize the state-of-art
technology in
global optimization|the di®erence convex algorithm (DCA) of An and Tao (1997). The
details
are referred to to Liu et al. (2005) for an algorithm.
The key to e±cient computation is a d.c. decomposition of à = Ã1 +Ã2, where Ã1(u) = 0 if
umin ¸ 1 and 2(1¡umin) otherwise, and Ã2(u) = 0 if umin ¸ 0 and 2umin otherwise. Here Ã1 can
6
be viewed as a multivariate generalization of the univariate hinge loss. This d.c.
decomposition
connects the Ã-loss to the hinge loss Ã1 of SVM. In fact, the multivariate à mimics the
GE
de¯ned by 1 ¡ sign, while the generalized hinge loss Ã1 is a convex upper envelope of 1 ¡
sign.
With this d.c. decomposition, Ã corrects the bias introduced by the imposed convexity of
Ã1,
and is expected to yield higher generalization accuracy.
3 Statistical Learning Theory
In the literature, there has been considerable interest in generalization accuracy of
margin-based
classi¯ers. In the binary case, Lin (2000) investigated rates of convergence of SVM with
a spline
kernel. Bartlett, Jordon, and MaAuli®e (2003) studied rates of convergence for certain
convex
margin losses. Shen et al. (2003) derived a learning theory for Ã-learning. Zhang (2004)
obtained
consistency for general convex margin-based losses. For the multicategory case, Zhang
(2004b)
has recently studied consistency of several large margin classi¯ers using convex losses.
To our
knowledge, no results are available for rates of convergence in the multicategory case.
In this
section, we quantify the generalization error rates of the proposed multicategory Ãlearning, as
measured by the Bayesian regret, to be introduced.
3.1 Statistical Properties
The generalization performance of a classi¯er de¯ned by f is measured by the Bayesian
regret
e(f; ¹ f) = Err(f) ¡ Err( ¹ f) ¸ 0, which is the di®erence between the actual performance and
the ideal performance. Here ¹ f is the Bayes rule, yielding the ideal performance
assuming that
the true distribution of (X; Y ) would have been known in advance, obtained by
minimizing
Err(f) = E[1¡sign(g(f(X); Y ))] with respect to all f, with g(f(x); j) = ffj(x)¡fl(x); l 6= jg.
Note that the Bayes rule is not unique because any ¹ f, satisfying argmaxj
¹ fj(x) = argmaxjPj(x)
with Pj(x) = P(Y = jjx), yields the minimum. Without loss of generality, we use a speci¯c
¹ f = ( ¹ f1; : : : ; ¹ fk) with ¹ fj(x) = k¡1
k I¡sign(Pj(x)¡Pl(x); l 6= j) = 1¢¡ 1
k I¡sign(Pj(x)¡Pl(x); l 6=
j) 6= 1¢ in what follows, that is, ¹ fl(x) = k¡1
k if l = argmaxPj(x), and ¡1
k otherwise.
Theorem 3.1 below gives expressions of the Bayesian regret, which is critical for
establishing
our learning theory.
7
Theorem 3.1. For any decision function vector f,
e(f; ¹ f) =
1
2E[
k Xj=1
Pj(X)(sign(¹g( ¹ f(X); j)) ¡ sign(g(f(X); j)))] (6)
= E[maxjPj(X) ¡ Pargmaxjfj (X)] ¸ 0
= E[Xj6=l
jPl(X) ¡ Pj(X)jI(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]; (7)
where ¹g( ¹ f(x); j) = f ¹ fj(x) ¡ ¹ fl(x); l 6= jg.
Equation (6) in Theorem 3.1 expresses e(f; ¹ f) in terms of a weighted sum of the
individual
misclassi¯cation error, weighted by the conditional probability Pj(X). Equation (7) gives
an
expression of e(f; ¹ f) in misclassi¯cation resulting from ¡k
2¢ multiple comparisons.
Equation (7) suggests that a multicategory problem dramatically di®ers from its binary
counterpart. For a binary problem, (7) reduces to e(f2; ¹ f2) = EjP2(X) ¡ 1=2jjsign(f2(X)) ¡
sign( ¹ f2(X))j because P2(x) ¡ P1(x) = 2(P2(x) ¡ 1=2). This means a comparison between
P1(x) and P2(x) in the binary case is equivalent to examining if P2(x) exceeds 1=2. For a
multicategory problem, however, this no longer holds since multiple pairwise
comparisons are
necessary in order to determine the argmax. In fact, there may not exist a dominating
class,
that is max Pl(x) < 1=2 for some x 2 S. Therefore, k comparisons of Pj(x) with 1=2 may
not
be su±cient to determine the correct classi¯cation rule. Indeed, the issue of the
existence of a
dominating class is important in the multicategory rather than binary case.
The ultimate goal of classi¯cation is to minimize E[1¡sign(g(f(X); Y ))]. To avoid the scale
invariant problem of the sign function, we apply Ã-loss here as a surrogate loss which
minimizes
E[Ã(g(f(X); Y ))]. The following theorem says that a Ã-loss yields the same Bayes rule as
the
1 ¡ sign loss. Thus, consistency of multicategory Ã-learning can be established.
Theorem 3.2. The Bayes decision vector ¹ f satis¯es ¹gmin( ¹ f(x); argmaxj=1;:::;kPj(x)) = 1,
where
¹gmin is the minimum of the k ¡ 1 elements of vector ¹g. For any à satisfying (2), ¹ f minimizes E[Ã(g(f(X); Y ))] and E[1 ¡ sign(g(f(X); Y ))] in the sense that E[Ã(g(f(X); Y ))] ¸
E[Ã(¹g( ¹ f(X); Y ))] = E[1 ¡ sign(¹g( ¹ f(X); Y ))] · E[1 ¡ sign(g(f(X); Y ))] for any f. Furthermore, the minimizers for E[Ã(g(f(X); Y ))] and E[1 ¡ sign(g(f(X); Y ))] are not unique, e.g.,
c ¹ f is also a minimizer for both quantities for any c ¸ 1.
Theorem 3.2 says that Ã-learning estimates the Bayes classi¯er de¯ned by ¹ f as
opposed to the
conditional probabilities (P1(x); : : : ; Pk(x)), and it plays the same role as 1¡sign.
Furthermore,
8
the optimal performance of ¹ f with ¹gmin( ¹ f(x); argmaxjPj(x)) = 1 is realized via the Ã-loss
function although it di®ers from 1 ¡ sign.
3.2 Statistical Learning Theory
Let F be a function class of candidate function vectors, which is allowed to depend on n.
Note
that the Bayes decision function ¹ f is not required to belong to F. For any function vector
f 2 F, classi¯cation is performed by partitioning S into k disjoint sets (Gf1 ; ¢ ¢ ¢ ;Gfk ) = (fx
:
sign(g(f(x); 1) = 1g; ¢ ¢ ¢ ; fx : sign(g(f(x); k) = 1g).
In this section, we generalize the learning theory of Shen et al. (2003) to the multicategory case. Our learning theory quanti¯es the magnitude of e(f; ¹ f) as a function of n, k,
the tuning parameter C, and the complexity of a class of candidate classi¯cation
partitions
G(F) = f(Gf1 ; ¢ ¢ ¢ ;Gfk ); f 2 Fg induced by F.
Denote by the approximation error eÃ(f; ¹ f) = 1
2 (EÃ(g(f(X); Y ))¡EÃ(¹g( ¹ f(X); Y ))), which
measures the degree of approximation of G(F) to (G¹ f1 ; ¢ ¢ ¢ ;G¹ fk
). Let J0 = max(J(f0); 1). The
following technical assumptions are made.
Assumption A: (Approximation error) For some positive sequence sn ! 0 as n ! 1, there
exists f0 2 F such that eÃ(f0; ¹ f) · sn. Equivalently, infff2FgeÃ(f; ¹ f) · sn. Similar to F, f0
may depend on n.
Assumption B: (Boundary behavior) There exist some constants 0 < ® · +1 and c1 > 0
such that P(X 2 S : (maxPl(X) ¡ Pj6=argmaxPl(X)(X)) < 2±) · c1±® for any small ± ¸ 0.
Assumption B describes behavior of the conditional probabilities Pj 's near the decision
boundary fx 2 S : maxPl(x) = Pj(x); for some l 6= j 2 f1; 2; : : : ; kgg. It is equivalent
to that P(X 2 S : (maxPl(X) ¡ second maxPl(X)) < 2±) · c1±® by the fact that fX :
maxPl(X) ¡ Pj6=argmaxPl(X)(X)) < 2±g ½ fX : maxPl(X) ¡ second maxPl(X) < 2±g.
To specify Assumption C, we de¯ne the metric entropy for partitions. For a class of
partitions B = f(B1; ¢ ¢ ¢ ;Bk);Bj \ Bl = ; 8j 6= l; [1·j·kBj = Sg and any ² > 0, call
f(Gv
j1;Gu
j1); : : : ; (Gv
jm;Gu
j m)g; j = 1; : : : ; k, an ²-bracketing set of B if for any (G1; : : : ;Gk) 2 B
there exists an h such that Gv
jh ½
Gj ½ Gu
j h and
max
1·h·m
max
1·j·k
P(Gu
jh¢Gv
j h) · ²; (8)
where Gu
jh¢Gv
jh is the set di®erence between Gu
jh and Gv
jh. The metric entropy HB(²;B) of B
with bracketing is then de¯ned as logarithm of the cardinality of ²-bracketing set of B of
the
9
smallest size.
Let F(`) = ff 2 F; J(f) · `g ½ F and G(`) = f(Gf1 ; : : : ;Gfk ); f 2 F(`)g ½ G(F). Then
G(`) is the set of classi¯cation partitions under regularization J(f) · `. For instance, J(f) is
1
2 Pj kwjk2 in
(4) or is 1
2 Pj khjk2
HK
in (5). To measure the complexity of G(`) via the metric
entropy, the following assumption is made.
Assumption C: (Metric entropy for partitions) For some positive constants ci, i = 2; 3; 4,
there exists some ²n > 0 such that
sup
`¸2
Á(²n; `) · c2n1=2; (9)
where Á(²n; `) = Rc1=2
3 L®=2(®+1)
c4L H1=2
B (u2=4;
G(`))du=L and L = L(²n;C; `) = min(²2
n+(Cn)¡1(`=2¡
1)J0; 1).
Assumption D: (Ã-function) The Ã-function satis¯es (2).
As a technical remark, we note that to simplify the function entropy calculation in
Assumption C required in Theorem 3.4, an additional condition on the Ã-function may be
imposed. For
instance, we may restrict the Ã-loss functions in (2) to satisfy a multivariate Lipschitz
condition:
jÃ(u¤) ¡ Ã(u¤¤)j · Dju¤min ¡ u¤¤ minj; (10)
where D > 0 is a constant. Condition (10) is satis¯ed by the speci¯c à function in (3),
with
D = 2. This aspect is illustrated in Example 3.3.2. However, (10) is irrelevant to the set
entropy
in Assumption C required in Theorem 3.3; see Example 3.3.1.
Theorem 3.3. Suppose that Assumptions A-D are met. Then, for any classi¯er of Ãlearning
argmax( ^ f), there exists a constant c5 > 0 such that
P(e( ^ f; ¹ f) ¸ ±2
n ) · 3:5 exp(¡c5n(nC)¡®+2
®+1 J
®+2
®+1
0 );
provided that Cn ¸ 2±¡2
n J0, where ±2
n = min(max(²2
n; 2sn); 1).
Corollary 3.1. Under the assumptions of Theorem 3.3,
je( ^ f; ¹ f)j = Op(±2
n );Eje( ^ f; ¹ f)j = O(±2
n );
provided that n¡ 1
®+1 (C¡1J0)®+2
®+1 is bounded away from zero.
To obtain the error rate ±2
n in Theorem 3.3, we need to compute the metric entropy for G(`).
It may not be easy to compute the metric entropy for partitions because G(`) is induced
by the
10
class of functions F(`). Moreover, it is also of interest to establish an upper bound of e( ^
f; ¹ f)
using the corresponding function entropy as opposed to set entropy. In what follows, we
shall
develop such results in Theorem 3.4.
To proceed, we de¯ne the L2-metric entropy with bracketing for F as follows. For any ² >
0,
call f(gv
1 ; gu
1 ); : : : ; (gv
m ; gu
m )g an ²-bracketing function if for any g 2 F there is an h such that
gv
h · g · gu
h and max1·h·m kgu
h ¡gl
h k2 · ², where k ¢ k2 is the usual L2-norm, de¯ned as kgk2
2=
R g2dP. Then the L2-metric entropy of F with bracketing HB(²;F) is de¯ned as logarithm
of
the cardinality of the ²-bracketing of the smallest size. Now de¯ne a new function set
FÃ(`) =
fÃ(g(f(x); y))¡Ã(g0(f0(x); y)) : f 2 F(`)g and Á¤(²¤n; `) = Rc1=2
3 L¤®=2(®+1)
c4L¤ H1=2
B (u;FÃ(`))du=L
with L¤ = min(²¤2
n + (Cn)¡1(`=2 ¡ 1)J0; 1).
Theorem 3.4. Suppose that Assumptions A-D are met with Á¤(²¤n; `) replacing Á(²n; `) in
Assumption C. Then, for any classi¯er of Ã-learning argmax( ^ f), there exists a constant c5
>0
such that
P(e( ^ f; ¹ f) ¸ ±¤2
n ) · P(eÃ( ^ f; ¹ f) ¸ ±¤2
n ) · 3:5 exp(¡c5n(nC)¡®+2
®+1 J
®+2
®+1
0 );
provided that Cn ¸ 2±¤¡2
n J0, where ±¤2
n = min(max(²¤2
n ; 2sn); 1).
Corollary 3.2. Under the assumptions of Theorem 3.4,
jeÃ( ^ f; ¹ f)j = Op(±¤2
n );EjeÃ( ^ f; ¹ f)j = O(±¤2
n );
provided that n¡ 1
®+1 (C¡1J0)®+2
®+1 is bounded away from zero.
Note that eÃ( ^ f; ¹ f) ¸ e( ^ f; ¹ f). The rate ±¤2
n obtained from Theorem 3.4 using the metric
entropy for functions yields an upper bound of e( ^ f; ¹ f), thus e( ^ f; ¹ f) · min(±2
n ; ±¤2
n ) with probability tending to 1 by Theorems 3.3-3.4. In application, one may calculate either ±2
n or ±¤2
n,
depending on which entropy is easier to compute.
Theorems 3.3-3.4 reveal distinct characteristics of multicategory problems, although they
cover the binary case. First, a multicategory problem has a higher level of complexity
generally, and hence that the number of classes k may have an impact on the performance.
In
fact, Theorems 3.3-3.4 permit studying dependency of e( ^ f; ¹ f) on k and n
simultaneously; see
Examples 3.3.1 and 3.3.2. Second, some properties of binary linear learning no longer
hold in
the multicategory case when k > 2. For instance, the decision boundaries generated by
linear
learning with k > 2 can be piecewise linear hyperplanes.
11
3.3 Illustrative Examples
To illustrate our learning theory, we study speci¯c learning examples and apply our
learning
theory to derive error bounds for multicategory Ã-learning.
3.3.1. Linear classi¯cation: Linear classi¯cation involving a class of k hyperplanes F =
ff : fj(x) = wT
j x + bj ;Pk
j=1 fj =
0; x 2 S = [0; 1]dg is considered, where d is a constant. To
generate the training sample, we specify P(Y = j) = 1=k, P(xjY = j) = k ¡ 1 for fx : x1 2
[(j ¡ 1)=k; j=k)g and 1=(k ¡ 1) otherwise, where x1 is the ¯rst coordinate of x. Then the
Bayes
classi¯er yields sets fx : x1 2 [0; 1=k)g, : : :, fx : x1 2 [(k ¡ 1)=k; 1]g for the corresponding k
classes.
We now verify Assumptions A-C. For Assumption A, it is easy to ¯nd ft = (w11x1 +
b1; : : : ;w1kx1+bk) such that w1j 's are increasing, Pk
j=1 w1j = 0, Pk
j=1 bj = 0, and w1jj=k+bj =
w1;j+1j=k + bj+1; j = 1; : : : ; k ¡ 1. Let f0 = nft 2 F, then e(f0; ¹ f) · sn = c1n¡1 for some
constant c1 > 0. This implies Assumption A with sn = c1n¡1. Assumption B is satis¯ed with
® = +1 since P(X 2 S : (maxPl(X) ¡ Pj(X)) < 2±) = P(X1 2 f1=k; : : : ; (k ¡ 1)=kg) = 0 for
any su±ciently small ± > 0. To verify Assumption C, we note that HB(u; G(`)) · O(k2
log(k=u))
for any given ` by Lemma 1. Let Á1(²n; `) = c3(k2 log(k=L1=2))1=2=L1=2, where L = min(²2
n+
(Cn)¡1(`=2 ¡ 1); 1). This in turn yields sup`¸2 Á(²n; `) · Á1(²n; 2) = c(k2 log(k=²n))1=2=²n
for some c > 0 and a rate ²n = ( k2 log n
n )1=2 when C=J0 » ±¡2
n n¡1 » 1
k2 log n, provided that
k2 log n
n!
0.
By Corollary 3.1, we conclude that e( ^ f; ¹ f) · O( k2 log n
n ) except for a set of probability tending
to zero, and Ee( ^ f; ¹ f) · O( k2 log n
n ), when k2 log n
n ! 0 as n ! 1. It is interesting to note that
Ee( ^ f; ¹ f) · O(n¡1 log n) when k is a ¯xed constant. This conclusion holds generally for
any
Ã-function satisfying Assumption D.
3.3.2. Gaussian-kernel classi¯cation: In this example, we consider nonlinear learning with the same P(x; y) as in Example 3.3.1. Let F = ff : fj(x) = Pn
i=1 vjiK(xi; x) +
bj ;Pk
j=1 fj = 0; x 2 S = [0; 1]dg with Gaussian kernel K(s; t) = exp(¡ks ¡ tk2=¾2).
For Assumption A, we note that F is a rich function space with large n. In fact, any
continuous function can be well approximated by Gaussian kernel based representations
under
the sup-norm, c.f., Steinwart (2001). Thus there exists an ft = (f1t; : : : ; fkt) 2 F such that
fjt(x) ¸ 0 for x1 2 [(j¡1)=k; j=k] and < 0 otherwise. With a choice of f0 = ²¡2
n ft, eÃ(f0; ¹ f) ·
sn = c1²2
n, where c1 is a constant and ²n is de¯ned below. Assumption B is satis¯ed with
12
® = +1 as in Example 3.3.1. In this case, the metric entropy of FÃ(`) appears to be easier
to
compute. We then apply Theorem 3.4 to obtain the convergence rate. Consider any Ãfunction
in (2) satis¯es (10). Then by Lemma 2, HB(u;FÃ(`)) · O(k(log(`=u))d+1) for any given `.
Let Á¤1 (²¤n; `) = c3(k(log(`=L1=2))d+1)1=2=L1=2, where L = min(²¤2
n + (Cn)¡1(`=2 ¡ 1); 1). Then
sup`¸2 Á¤(²¤n; `) · Á1(²¤n; 2) = c(k(log(1=²¤n))d+1)1=2=²¤n for some c > 0. Solving (9) yields a
rate ²¤n = ( k(log(nk¡1))d+1
n )1=2 when C=J0 » ±¤¡2
n n¡1 » 1
k(log(nk¡1))d+1 under a condition that
k(log(nk¡1))d+1
n!
0 as n ! 1.
By Theorem 3.4, we conclude that e( ^ f; ¹ f) · eÃ( ^ f; ¹ f) · O( k(log(nk¡1))d+1
n ) except for a set
of probability tending to zero. By Corollary 3.2, Ee( ^ f; ¹ f) · O( k(log(nk¡1))d+1
n ). This resulting
rate re°ects dependence of the rate on the class number k. If k is treated as a ¯xed
constant,
then we have Ee( ^ f; ¹ f) · O(n¡1(log n)d+1). This conclusion holds generally for any Ãfunction
satisfying Assumption D and Condition (10).
In summary, Examples 3.3.1-3.3.2 provide an insight into the generalization error of the
proposed methodology. In view of the lower bound n¡1 result (c.f., Tsybakov, 2004) in the
binary case, we conjecture the rates obtained in Examples 3.3.1 and 3.3.2 are nearly
optimal,
although, to our knowledge, a lower bound result for any general classi¯er has not yet
been
established in the multicategory case. Further investigation is necessary.
4 Numerical Examples
In this section, we examine performance of multicategory Ã-learning in terms of
generalization
and compares it with its counterpart SVM. In the literature, there are a number of
di®erent
multicategory SVM generalizations; for instance, Lee et al. (2004), Crammer and Singer
(2001),
Weston and Watkins (1998), among others. To make a fair comparison, we use a
version of
multicategory SVM that is parallel to our multicategory Ã-learning. Speci¯cally, we
replace the
à function in (4) and (5) by Ã1. Then for the linear case, this version of multicategory
SVM
solves
min
b;w ³1
2
k Xj=1
kwjk2 + C
n Xi=1
Ã1(g(xi; yi))´ subject to
k Xj=1
bj1n + X
k Xj=1
wj = 0: (11)
This version of SVM is closely related to that of Crammer and Singer (2001). In their
formulation, all bj 's are set to be 0 rather than employing the sum-to-zero constraint, which is
in
contrast to (11). As argued by Guermeur (2002), the sum-to-zero constraint is necessary
to en13
sure uniqueness of the solution when a k-dimensional vector of decision functions with
intercepts
bj 's is used for a k-class problem.
4.1 Simulation
Two linear examples are considered. In these examples, the GE is approximated by the
testing
error using a testing sample, independent of training. In what follows, all calculations are
carried
out using the IMSL C routines.
Three-class linear problem. The training data are generated as follows. First, generate
pairs (t1; t2) from a bivariate t-distribution with degrees of freedom º, where º = 1; 3 in
Examples
1 and 2, respectively. Second, randomly assign f1; 2; 3g to its label index for each (t1;
t2). Third,
calculate (x1; x2) as follows: x1 = t1 + a1 and x2 = t2 + a2 with three di®erent values of
(a1; a2) = (p3; 1); (¡p3; 1); (0;¡2) for corresponding classes 1-3. In these examples, the
testing and Bayes errors are computed via independent testing samples of size 106 for
classi¯ers
obtained from training samples of size 150.
To eliminate the dependence on C, we maximize the performances of Ã-learning and
SVM
by optimizing C over a discrete set in [10¡3; 103]. For each method, the testing error for
the
optimal C is averaged over 100 repeated simulations. The simulation results are
summarized in
Table 1.
Insert Table 1 about here
As shown in Table 1, Ã-learning usually has a smaller testing error thus better
generalization as compared to its counterpart SVM. The amount of improvement, however, varies
across
examples. In Example 1, the percent of improvement of multicategory Ã-learning over
SVM
is 43.22% when the corresponding t-distribution has one degree of freedom. In Example
2, it
decreases to 20.41% when the t-distribution with 3 degrees of freedom is employed.
Further,
Ã-learning yields a smaller number of support vectors. This suggests that Ã-learning has
an even
more \sparse" solution than SVM, and hence that it has stronger ability of data reduction.
On a
related matter, SVM fails to give data reduction in Example 1 since almost all the
instances are
support vectors, which is in contrast to much smaller number of support vectors of Ãlearning.
One plausible explanation is that the ¯rst moment of the standard bivariate t-distribution
does
not exist, and thus the corresponding SVM does not work well. In general, any classi¯er
with
14
an unbounded loss such as SVM may su®er di±culty from extreme outliers as in this
example.
This reinforces our view that Ã-learning is more robust against outliers.
4.2 Application
We now examine performance of Ã-learning and its counterpart SVM on a benchmark
example
letter, obtained from Statlog. In this example, each sample contains 16 primitive
numerical
attributes converted from its corresponding letter image with a response variable
representing
26 categories. The main goal here is to identify each letter image as one of the 26
capital letters
in the English alphabet. A detailed description can be found in
www:liacc:up:pt=ML=statlog=
datasets=letter=letter:doc:html.
For illustration, we use the data for letters D, O, Q with 805, 753, 783 cases respectively.
A random sample of n = 200 is selected for training, while leaving the rest for testing. For
each training dataset, we seek the best performance of linear Ã-learning and SVM over
a set of
C-values in [10¡3,103]. The corresponding results with respect to the smallest testing
errors for
each method in ten di®erent cases are reported in Table 2. Since the Bayes error is
unknown,
the improvement of Ã-learning over SVM is computed via (T(SVM) ¡ T(Ã))=T (SVM).
Insert Table 2 about here
Table 2 indicates that multicategory Ã-learning has a smaller testing error than its counterpart SVM, although the amount of improvement varies from sample to sample. In
addition,
on average, multicategory Ã-learning has a smaller number of support vectors than
SVM. In
conclusion, Ã-learning has better generalization and achieves further data reduction than
SVM
in this example.
5 Discussion
In this article, we propose a new methodology that generalizes Ã-learning from the
binary case
to the multicategory case. A statistical learning theory is developed for Ã-learning in
terms
of the Bayesian regret. In simulations, we show that the proposed methodology performs
well
and is more robust against outliers than its counterpart SVM. In addition, we discover
some
interesting phenomena that are not with the binary case.
15
Recently, there is considerable interest in studying the variable selection problem using
the
L1 norm to replace the conventional L2 norm. In the binary case, Zhu et al. (2003) studied
properties of the L1 SVM and showed that the corresponding regularized solution path is
piecewise
linear. It is therefore natural to investigate variable selection of the L1 Ã-learning.
Further developments are necessary in order to make multicategory Ã-learning more
useful in
practice, particularly methodologies for a data-driven choice of C, variable selection,
regularized
solution path, as well as the nonstandard situation including unequal loss assignments.
Appendix
Proof of Theorem 3.1: By the de¯nition of Err(f), it is easy to obtain via conditioning that
e(f; ¹ f) = 1
2E[Pk
l=1 Pl(X)(sign(¹g( ¹ f(X); l)) ¡ sign(g(f(X); l)))]. Then it su±ces to consider
the situation that sign(¹g( ¹ f(X); l)) ¡ sign(g(f(X); l)) is nonzero, that is, when two
classi¯ers
disagree. Equivalently, for any given X = x, we can write e(f; ¹ f) using all possible
di®erent
classi¯cation produced by ¹ f and f jointly, where sign(¹g( ¹ f(x); l)) = 1 and sign(g(f(x); j))
=1
imply that ¹ f classi¯es x into class l while f classi¯es x into class j for 1 · l 6= j · k. Thus,
we
have
e(f; ¹ f) = E[
k Xl=1Xj6=l
(Pl(X) ¡ Pj(X))I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]
= E[
k Xl=1Xj6=l
jPl(X) ¡ Pj(X)jI(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)];
where the second equality follows from the fact that ¹ f is the optimal (Bayes) decision
function
vector such that Pl(X) ¸ Pj(X) when sign(¹g( ¹ f(X); l)) = 1. The desired result then follows.
Proof of Theorem 3.2: Write E[1 ¡ sign(g(f(X); Y ))jX = x] as Pk
j=1(1 ¡ sign(g(f(x); j)))
Pj(x) = 1 ¡ Pk
j=1 sign(g(f(x); j))Pj(x). Note that for any given x, one and only one of
sign(g(f(x); j)) can be 1 and the rest equal to ¡1. Consequently, E[1 ¡ sign(g(f(X); Y ))] is
minimized when sign(g(f(x); argmaxj
¹ fj(x))) = 1, i.e., f = ¹ f. Evidently, the minimizer is not
unique as c ¹ f for c ¸ 1 is also a minimizer. Then the desired result follows from the fact
that
Ã(u) ¸ (1 ¡ sign(u)) and Ã(¹g) = 1 ¡ sign(¹g).
Proof of Theorem 3.3: Before proceeding we introduce some notations to be used
below.
Let ~lÃ(f;Zi) = lÃ(f;Zi) + ¸J(f) be the cost function to be minimized, as in (4) or (5), where
lÃ(f;Zi) = Ã(g(f(Xi); Yi)) and ¸ = 1=(Cn). Let ~l(f;Zi) = l(f;Zi) + ¸J(f), where l(f;Zi) =
16
1 ¡ sign(g(f(Xi); Yi)). De¯ne the scaled empirical process En(~l(f;Z) ¡ ~lÃ(f0;Z)) as
n¡1
n Xi=1
(~l(f;Zi) ¡ ~lÃ(f0;Zi) ¡ E[~l(f;Zi) ¡ ~lÃ(f0;Zi)]) = En[l(f;Z) ¡ lÃ(f0;Z)];
where Z = (X; Y ). Let Ai;j = ff 2 F : 2i¡1±2
n · e(f; ¹ f) < 2i±2
n , 2j¡1J0 · J(f) < 2jJ0g,
Ai;0 = ff 2 F : 2i¡1±2
n · e(f; ¹ f) < 2i±2
n ; J(f) < J0g, for j = 1; 2; ¢ ¢ ¢ , and i = 1; 2; ¢ ¢ ¢ . Without
loss of generality, we assume J(f0) ¸ 1 and max(²2
n; 2sn) < 1 in the sequel.
The proof uses the treatment of Shen et al. (2003) and Shen (1998), together with the
results
in Theorem 3.1 and Assumption B. In what is to follow, we shall omit any detail that can
be
referred to the proof of Theorem 1 of Shen et al. (2003).
Using the connection between e( ^ f; ¹ f) and the cost function as in Shen et al. (2003),
we have
P(e( ^ f; ¹ f) ¸ ±2
n ) · P¤³ sup
ff2F:e(f; ¹ f)¸±2
ng
n¡1
n Xi=1
(~lÃ(f0;Zi) ¡ ~l(f;Zi)) ¸ 0´= I;
where P¤ denotes the outer probability measure.
To bound I, it su±ces to bound P(Aij ), for each i; j = 1; ¢ ¢ ¢ . To this end, we need some
inequalities regarding the ¯rst and second moments of ~l(f;Z) ¡ ~lÃ(f0;Z)) for f 2 Aij .
For the ¯rst moment, note that E[l(f;Z)¡lÃ(f0;Z)] = E[l(f;Z)¡lÃ( ¹ f;Z)]¡E[lÃ(f0;Z)¡
lÃ( ¹ f;Z)], which is equal to 2(e(f; ¹ f) ¡ eÃ(f0; ¹ f)) since ElÃ( ¹ f;Z) = El( ¹ f;Z) by Theorem
3.2.
By Assumption A and the de¯nition of ±2
n , 2eÃ(f0; ¹ f) · 2sn · ±2
n . Then, using the assumption
that J0¸ · ±2
n =2, we have, for any integers i; j ¸ 1,
inf
Ai;j
E(~l(f;Z) ¡ ~lÃ(f0;Z)) ¸ M(i; j) = (2i¡1±2
n ) + ¸(2j¡1 ¡ 1)J(f0); (12)
and
inf
Ai;0
E(~l(f;Z) ¡ ~lÃ(f0;Z)) ¸ (2i¡1 ¡ 1=2)±2
n ¸ M(i; 0) = 2i¡2±2
n ; (13)
where the fact that 2i ¡ 1 ¸ 2i¡1 has been used.
For the second moment, it follows from Theorem 3.1 and Assumption B that, for any f 2
F,
e(f; ¹ f) = E[
k Xl=1Xj6=l
jPl(X) ¡ Pj(X)jI(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]
¸ 2±(E[
k Xl=1Xj6=l
I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)I(jPl(X) ¡ Pj(X)j ¸ 2±))]
¸ ±(E[2
k Xl=1Xj6=l
I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)] ¡ 2c1±®)
=
1
2
(4c1)¡1=®E[2
k Xl=1Xj6=l
I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)](®+1)=® (14)
17
with a choice of ± = ¡E[2Pk
l=1Pj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]=(4c1)¢1=®.
Now we establish a connection between the ¯rst and second moments. By Theorem 3.2,
E[Ã(¹g( ¹ f(X); Y ))¡(1¡sign(¹g( ¹ f(X); Y )))] = 0. Note that Ã(u) ¸ 1¡sign(u) for any u 2 Rk¡1,
EjÃ(g0(f0(X); Y ))¡(1¡sign(g0(f0(X); Y )))j = E[Ã(g0(f0(X); Y ))¡(1¡sign(g0(f0(X); Y )))] ·
2eÃ(f0; ¹ f). By the triangular inequality,
E[l(f;Z) ¡ lÃ(f0;Z)]2 · 2Ej1 ¡ sign(g(f(X); Y )) ¡ Ã(g0(f0(X); Y ))j · 2¡2eÃ(f0; ¹ f)+
Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g(f(X); Y ))j + Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g0(f0(X); Y ))j¢:(15)
Note that for any f 2 F
Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g(f(X); Y ))j
= E[
k Xl=1
I(Y = l)jsign(¹g( ¹ f(X); l)) ¡ sign(g(f(X); l))j]
= E[2
k Xl=1
I(Y = l)Xj6=l
I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]
· E[2
k Xl=1Xj6=l
I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]:
This, together with (14), implies that
Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g(f(X); Y ))j · c¤e(f; ¹ f)®=(®+1); (16)
where c¤ = 2®=(®+1)(4c1)1=(®+1). For any f 2 Ai;j , e(f; ¹ f)®=(®+1) ¸ (2¡1±2
n )®=(®+1) ¸ 2¡1±2
n¸
sn ¸ eÃ(f0; ¹ f), e(f; ¹ f) ¸ e(f0; ¹ f) together with (15) and (16) imply that
E[l(f;Z)¡lÃ(f0;Z)]2 · 2¡2eÃ(f0; ¹ f)+c¤(e(f; ¹ f)®=(®+1)+e(f0; ¹ f)®=(®+1))¢· c0
3 (e(f; ¹ f)=2)®=(®+1);
with c0
3 = 16c1=(®+1)
1 + 8. Consequently, for i = 1; ¢ ¢ ¢ and j = 0; 1; ¢ ¢ ¢ ;
sup
Ai;j
E[lÃ(f0;Z) ¡ l(f;Z)]2 · c0
3 (2i¡1±2
n )®=(®+1) · c3M(i; j)®=(®+1) = v(i; j)2;
where c3 = 2c0
3.
We are now ready to bound I. Using the assumption that J0¸ · ±2
n =2, (12) and (13), we
have I · Pi¸1;j¸0 P¤¡supAi;j En(lÃ(f0;Z) ¡ l(f;Z)) ¸ M(i; j)¢. By de¯nition, lÃ(f0;Z) and
l(f;Z) are 0 and 2. Then E[lÃ(f0;Z) ¡ l(f;Z)]2 · 4 and En(lÃ(f0;Z) ¡ l(f;Z)) · 4. For
18
convenience, we scale the empirical process by a constant t = (4c1=2
3 )¡1 in what follows. Then
I · Xi;j
P¤³sup
Ai;j
En(t[lÃ(f0;Z) ¡ l(f;Z)]) ¸ Mc(i; j)´
+Xi
P¤³sup
Ai;0
En(t[lÃ(f0;Z) ¡ l(f;Z)]) ¸ Mc(i; 0)´= I1 + I2 (17)
and supAi;j E[lÃ(f0;Z) ¡ l(f;Z)]2 · vc(i; j)2, where vc(i; j) = min(t1=2v(i; j); 1), Mc(i; j) =
min(tM(i; j); c¡1=2
3 ). Note that vc(i; j) < 1 implies Mc(i; j) = tM(i; j).
Next we bound Ii separately. For I1, we verify the required conditions (4.5)-(4.7) in
Theorem
3 of Shen and Wong (1994). To compute the metric entropy in (4.7) there, we need to
construct
a bracketing function of lÃ(f0;Z)¡l(f;Z). Denote an ²-bracketing set for f(Gf1 ; : : : ;Gfk ); f 2
Aijg to be f(Gv
p1; ¢ ¢ ¢ ;Gv
pm); (Gu
p1; ¢ ¢ ¢ ;Gu
pm)g; p = 1; : : : ; k. Let sv
ph(x) be ¡1 if x 2 Gu
ph
and 1 otherwise, and su
ph(x) be ¡1 if x 2 Gv
ph and 1 otherwise; p = 1; ¢ ¢ ¢ ; k, h = 1; ¢ ¢ ¢ ;m.
Then f(sv
p1; ¢ ¢ ¢ ; sv
pm); (su
p1; ¢ ¢ ¢ ; su
pm)g forms an ²-bracketing function of ¡sign(g(f(x); p)) for
f 2 Aij and p = 1; ¢ ¢ ¢ ; k. This implies that for any ² ¸ 0 and f 2 Aij , there exists an h
(1 · h · m) such that lv
h (z) · l(f; z) ¡ lÃ(f0; z) · lu
h(z) for any z = (x; y), where lu
h(z) =
1 +Pk
p=1 su
ph(x)I(y = p) ¡ lÃ(f0; z), lv
h (z) = 1 +Pk
p=1 sv
ph(x)I(y = p) ¡ lÃ(f0; z), and (E[lu
h¡
lv
h ]2)1=2 =
(Pk
p=1 E[(su
ph(x)
¡ sv
= p)]2)1=2 · 2(maxp P(Gu
ph(x))I(y
ph¢Gl
ph))1=2 ·
2²1=2. So,
(E[lu
h ¡ lv
h ]2)1=2 · min(2²1=2; 2). Hence, HB(²;F¤(2j)) · H(²2=4; G(2j)) for any ² > 0 and
j = 0; ¢ ¢ ¢ , where F¤(2j) = fl(f; z) ¡ lÃ(f0; z) : f 2 F; J(f) · 2jg. Using the fact that
Rvc(i;j)
aMc(i;j) H1=2
B (u2=4; G(2j))du=Mc(i; j) is non-increasing in i and Mc(i; j); i = 1; ¢ ¢ ¢ ; we have,
Z vc(i;j)
aMc(i;j)
H1=2
B (u2=4; G(2j))du=Mc(i; j)
· Z c1=2
3 Mc(1;j)
®
2(®+1)
aMc(1;j)
H1=2
B (u2=4; G(2j))du=Mc(1; j) · Á(²n; 2j);
where a = "=32 with " de¯ned below. Thus (4.7) of Shen and Wong (1994) holds with M
=
n1=2Mc(i; j) and v = vc(i; j)2, so does (4.5). In addition, with T = 1,
Mc(i; j)=vc(i; j)2 · max(c¡1=2
3 ; c¡(2®+3)=(2®+2)
3 ) = c¡1=2
3 · "=(4T)
implies (4.6) with " = 4c¡1=2
3 < 1.
Note that 0 < ±n · 1 and ¸J0 · ±2
n =2. Using a similar argument as in Shen et al. (2003),
an application of Theorem 3 of Shen and Wong (1994) yields that
I1 · 3 exp(¡c5n(¸J(f0))®+2
®+1 =[1 ¡ exp(¡c5n(¸J(f0))®+2
®+1 )]2:
19
Here and in the sequel c5 is a positive generic constant. Similarly, I2 can be bounded.
Finally, I · 6 exp(¡c5n(¸J(f0))®+2
®+1 =[1¡exp(¡c5n(¸J(f0))®+2
®+1 )]2. This implies that I1=2 ·
(5=2 + I1=2) exp(¡c5n(¸J(f0)). The result then follows from the fact that I · I1=2 · 1.
Proof of Corollary 3.1: The result follows from the assumptions and the exponential
inequality in Theorem 3.3.
Proof of Theorem 3.4: The proof is similar to that of Theorem 3.3. For simplicity, we only
sketch the parts that require modi¯cations. Consider the scaled empirical process
En(~lÃ(f;Z)¡
~l
Ã(f0;Z)) and let Ai;j = ff 2 F : 2i¡1±¤2
n · eÃ(f; ¹ f) < 2i±¤2
n , 2j¡1J0 · J(f) < 2jJ0g,
Ai;0 = ff 2 F : 2i¡1±¤2
n · eÃ(f; ¹ f) < 2i±¤2
n ; J(f) < J0g, for j = 1; 2; ¢ ¢ ¢ , and i = 1; 2; ¢ ¢ ¢ . Using
an analogous argument, we have
P(eÃ( ^ f; ¹ f) ¸ ±¤2
n ) · P¤³ sup
ff2F:eÃ(f; ¹ f)¸±¤2
ng
n¡1
n Xi=1
(~lÃ(f0;Zi) ¡ ~lÃ(f;Zi)) ¸ 0´= I:
To bound I, we consider the ¯rst and second moments of ~lÃ(f;Z)¡~lÃ(f0;Z)) for f 2 Aij . For
the ¯rst moment, it is straightforward to show that, for any integers i; j ¸ 1, infAi;j E(~lÃ(f;Z)¡
~l
Ã(f0;Z)) ¸ M(i; j) = (2i¡1±¤2
n ) + ¸(2j¡1 ¡ 1)J(f0), and infAi;0 E(~lÃ(f;Z) ¡ ~lÃ(f0;Z)) ¸
M(i; 0) = 2i¡2±¤2
n.
For the second moment, eÃ(f; ¹ f) = e(f; ¹ f) + 1
2E[Ã(g(f(X)))I(g(f(X)) 2 (0; ¿ ))] and
eÃ(f; ¹ f) · 1. Thus
1
2E[Ã(g(f(X)))I(g(f(X); Y ) 2 (0; ¿ ))] · eÃ(f; ¹ f) · (eÃ(f; ¹ f)) ®
®+1 : (18)
For any f 2 Ai;j , eÃ(f; ¹ f) ¸ 2¡1±2
n ¸ sn ¸ eÃ(f0; ¹ f) together with (16) and (18) imply that
E[lÃ(f;Z) ¡ lÃ(f0;Z)]2 · 2Ejsign(g(f(X); Y )) ¡ sign(g0(f0(X); Y ))j
+2E[Ã(g0(f0(X)))I(g(f(X); Y ) 2 (0; ¿ ))] + 2E[Ã(g(f(X)))I(g(f(X); Y ) 2 (0; ¿ ))]
· 2¡c¤[eÃ(f; ¹ f)®=(®+1) + eÃ(f0; ¹ f)®=(®+1)]¢+ 4[eÃ(f; ¹ f)®=(®+1) + eÃ(f0; ¹ f)®=(®+1)]
· c0
3 (eÃ(f; ¹ f)=2)®=(®+1);
with c0
3 = 16c1=(®+1)
1 +8. Therefore, supAi;j E(lÃ(f0;Z)¡lÃ(f;Z))2 · c3M(i; j)®=(®+1) = v(i; j)2
for i = 1; ¢ ¢ ¢ and j = 0; 1; ¢ ¢ ¢ , where c3 = 2c0
3.
To bound I, note I · I1+I2, where I1 = Pi;j P¤¡supAi;j En(lÃ(f0;Z)¡lÃ(f;Z)) ¸ M(i; j)¢ and I2 = Pi
P¤¡supAi;0 En(lÃ(f0;Z) ¡ lÃ(f;Z)) ¸ M(i; 0)¢. Thus we can bound Ii separately.
Using the fact that Rv(i;j)
aM(i;j) H1=2
B (u;FÃ(2j))du=M(i; j) is non-increasing in i and M(i; j); i =
20
1; ¢ ¢ ¢ ; we have Rv(i;j)
aM(i;j) H1=2
B (u;FÃ(2j))du=M(i; j) · Á¤(²¤n; 2j ). The result then follows from
the same argument as that in the proof of Theorem 3.3.
Proof of Corollary 3.2: The result follows from the assumptions and the exponential
inequality in Theorem 3.4.
Lemma 1: (Metric entropy in Example 3.3.1) Under the assumptions in the example
3.3.1, we
have
HB(²; G(`)) · O(k2 log(k=²)):
Proof: Let (G1; : : : ;Gk) be a classi¯cation partition induced by f and let Gj1j2 be fx : fj1¡fj2 >
0; x 2 Sg; j1 6= j2 2 f1; ¢ ¢ ¢ ; kg. For discussion, we ¯rst construct a bracket for Gj1j2 .
To this end, we determine d points at which the plane fj1 ¡ fj2 = 0 intersects with d out of
d2d¡1 edges of the cube [0; 1]d. For each of these d points, we use a bracket of length ²¤
to cover,
on the edge where the point belongs to. Given an edge, the covering number for this
point is no
greater than 1=²¤. Hence the covering number for the d points on d of d2d¡1 edges is at
most
¡d2d¡1
d ¢( 1
²¤ )d.
After d intersecting points of fj1 ¡ fj2 = 0 on the edges of S are covered, we then connect
the endpoints of the d brackets to form bracket planes vj1j2 = 0 and uj1j2 = 0 such that
fx : vj1j2 > 0g ½ fx : fj1 ¡ fj2 > 0g ½ fx : uj1j2 > 0g. Since the longest segment in S has
length pd corresponding to the diagonal segment between (0; : : : ; 0) and (1; : : : ; 1), we
have P(x :
vj1j2 < 0 < uj1j2) · (pd)d¡1²¤ since x is uniformly distributed on S. Consequently, Gv
j1 j2 ½
G j1 j 2 ½ G u
j1j2 and P(Gv
j1j2¢Gu
j1j2) · (pd)d¡1²¤, where Gv
j1j2 = fx : vj1j2 > 0g and Gu
j1j2 = fx :
uj1j2 > 0g. Since Gj1 = \j2Gj1j2 , Gv
j1 ½ Gj1 ½ Gu
j1 and P(Gv
j1¢Gu
j1) · P([j2Gv
j1j2¢Gu
j1 j2 ) ·
(k ¡ 1)(pd)d¡1²¤, where Gv
j1 = \j2Gv
j1j2 and Gu
j1 = \j2Gu
j1j2 ; j1 6= j2 2 f1; ¢ ¢ ¢ ; kg.
With ² = (k ¡ 1)(pd)d¡1²¤, f(Gv
1;Gu
1 ); ¢ ¢ ¢ ; (Gv
k;Gu
k )g satis¯es maxj1 P(Gv
j1¢Gu
j1) · ² and
thus it forms an ²-bracketing set for (G1; : : : ;Gk). Therefore the ²-covering number for all
partitions induced by f is at most ¡d2d¡1
d ¢¡(k¡1)(pd)d¡1
² ¢d¢k(k¡1): Since d is a constant, the
bracketing metric entropy HB(²; G(`)) is bounded by O(k2 log(k=²)) for any `, yielding the
desired result.
Lemma 2: (Metric entropy in Example 3.3.2) Under the assumptions in Example 3.3.2,
we
have
HB(²;FÃ(`)) · O(k(log(`=²))d+1):
Proof: In order to obtain an upper bound for HB(²;FÃ(`)), we use the sup-norm entropy
bound
21
for a single function set in Zhou (2002), that is, H1(²;F(`)) · O((log(`=²))d+1) under the
L1 metric: kgk1 = supx2S jg(x)j. Consider an arbitrary function vector f = (f1; : : : ; fk) 2
F(`). The the metric entropy for all k-dimensional function vectors in F(`) is bounded by
O(k(log(`=²))d+1) in order to cover k functions simultaneously. Let [fv
j ; fu
j ] be an ²-bracket for
fj . Then [fv
j ¡fu
l ; fu
j ¡fv
l ] forms a 2²-bracket for fj¡fl. Denote gv
j = minl2f1;¢¢¢ ;kgnj(fv
j ¡fu
l ) and
gu
j = minl2f1;¢¢¢ ;kgnj(fu
j ¡fv
l ). Then [gv
j ; gu
j ] becomes a 2²-bracket for gmin(f; j) = minl6=j(fj¡fl).
Consequently, Ã(gv
j ) ¸ Ã(gmin(f; j)) ¸ Ã(gu
j ) by the non-increasing property of à function. By
(10), we have jÃ(gv
j ) ¡ Ã(gu
j )j · 2D². Since gmin(f; y) = Pk
j=1 I(y = j)gmin(f; j), gmin(f; y) 2
[Pk
j=1 I(y = j)gv
j ;Pk
j=1 I(y = j)gu
j ] and jÃ(Pk
j=1 I(y = j)gv
j ) ¡ Ã(Pk
j=1 I(y = j)gu
j )j · 2D².
Consequently, [Ã(Pk
j=1 I(y = j)gu
j (x))¡Ã(g0(f0(x);
y)); Ã(Pk
= j)gv
j (x))¡Ã(g0(f0(x); y))]
forms a bracket of length 2D² for Ã(g(f(x); y)) ¡ Ã(g0(f0(x); y)). The desired result then follows.
j=1 I(y
References
An, H. L. T., and Tao, P. D. (1997). Solving a class of linearly constrained inde¯nite
quadratic
problems by D.C. algorithms. J. Global Optim., 11, 253-285.
Bartlett, P. L, Jordan, M. I, and McAuli®e, J. D. (2003). Convexity, classi¯cation, and risk
bounds. Technical Report 638, Department of Statistics, U.C. Berkeley.
Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin
classi¯ers. The Fifth Annual Conference on Computational Learning Theory, Pittsburgh
ACM, 142-152.
Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273279.
Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass
kernelbased vector machines. Journal of Machine Learning Research, 2, 265-292.
Guermeur, Y. (2002). Combining discriminant models with new multiclass SVMS.
Pattern
Analysis and Applications (PAA), 5, 168-179.
Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory Support Vector Machines, theory,
and
application to the classi¯cation of microarray data and satellite radiance data. J. Amer.
Statist. Assoc. 99, 465: 67-81.
Lin, X., Wahba, G., Xiang, D., Gao, F. Klein, R., and Klein, B. (2000). Smoothing spline
ANOVA models for large data sets with Bernoulli observations and the randomized
GACV.
22
Annals of Statistics. 28, 1570-1600.
Lin, Y. (2000). Some asymptotic properties of the support vector machine. Technical
report
1029, Department of Statistics, University of Wisconsin-Madison.
Lin, Y. (2002). Support vector machines and the Bayes rule in classi¯cation. Data Mining
and
Knowledge Discovery. 6, 259-275.
Liu, Y., Shen, X., and Doss, H. (2005). Multicategory Ã-learning and support vector
machine:
computational tools. J. Comput. Graph. Statist. 14, 1, 219-236.
Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis. Ann. Statist. 27,
1808-1829.
Marron, J. S., and Todd, M. J. (2002). Distance Weighted Discrimination. Technical
Report
No. 1339, School of Operations Research and Industrial Engineering, Cornell University.
Mercer, J. (1909). Functions of positive and negative type and their connection with the
theory
of integral equations. Philos. Trans. Roy. Soc. London A, 209, 415-446.
Shen, X. (1998). On the method of penalization. Statistica Sinica. 8, 337-357.
Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On Ã-learning. J. Amer.
Statist.
Assoc. 98, 724-734.
Shen, X., and Wong, W. H. (1994). Convergence rate of sieve estimates. Ann. Statist.
22,
580-615.
Steinwart, I. (2001). On the in°uence of the kernel on the consistency of support vector
machines.
Journal of Machine Learning Research, 2, 67-93.
Tsybakov, A. B. (2004). Optimal aggregation of classi¯ers in statistical learning. Annals
of
Statistics, 32, 135-166.
Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces, and
randomized
GACV. In: B. SchÄolkopf, C. J. C. Burges and A. J. Smola (eds), Advances in Kernel
Methods:
Support Vector Learning, MIT Press, 125-143.
Weston, J., andWatkins, C. (1999). Support vector machines for multi-class pattern
recognition.
Proceedings of the Seventh European Symposium On Arti¯cial Neural Networks.
Zhang, T. (2004). Statistical behavior and consistency of classi¯cation methods based
on convex
risk minimization. Ann. Statist., 32, 56-85.
Zhang, T. (2004b). Statistical analysis of some multi-category large margin classi¯cation
methods. Journal of Machine Learning Research, 5, 1225-1251.
Zhou, D. X. (2002). The covering number in learning theory. Journal of Complexity, 18,
73923
767.
Zhu, J., and Hastie, T. (2005). Kernel logistic regression and the import vector machine.
Journal
of Computational and Graphical Statistics. 14, 1, 185-205.
Zhu, J., Hastie, T., Rosset, S., and Tibshirani, R. (2003). 1-norm support vector
machines,
Neural Information Processing Systems, 16.
Table 1: Testing, training errors, and their ^e(¢; ¹ f) of SVM and Ã-learning using the best
C in
Examples 1 and 2 with n = 150, averaged over 100 simulation replications and their
standard
errors in parenthesis. In Example 1, d.f.=1, the Bayes error is 0.2470 with the
improvement
of Ã-learning over SVM 43.22%. In Example 2, d.f.=3, the Bayes error is 0.1456 with the
improvement of Ã-learning over SVM 20.41%. Here, the improvement of Ã-learning over
SVM
is de¯ned by (T(SVM)¡T(Ã))=^e(SVM; ¹ f), where ^e(¢; ¹ f) = T(¢)¡Bayes error, and T(¢)
denotes
the testing error of a given method.
Example Method Training(s.e.) Testing(s.e.) ^e(¢; ¹ f)(s.e.) No. SV(s.e.)
d.f.=1 SVM 0.4002(0.1469) 0.4305(0.1405) 0.1835(0.1405) 141.76(10.97)
Ã-L 0.3199(0.1237) 0.3494(0.1209) 0.1024(0.1209) 64.64(15.43)
d.f.=3 SVM 0.1447(0.0267) 0.1505(0.0045) 0.0049(0.0045) 71.81(11.02)
Ã-L 0.1429(0.0285) 0.1495(0.0033) 0.0039(0.0033) 41.29(13.51)
Table 2: Testing errors for problem letter. Each training dataset is of size 200 and
selected from
a total of 2341 samples.
Case SVM Ã-L Improvement (%)
1 .083 .079 3.39%
2 .073 .063 12.24%
3 .086 .076 11.41%
4 .072 .072 0%
5 .088 .085 3.74%
6 .077 .073 5.45%
7 .075 .072 4.39%
8 .079 .075 5.92%
9 .093 .091 1.51%
10 .090 .086 4.11%
Average #SVs 51.1 40.8
24
−2
−1
0
1
2
u1
−2
−1
0
1
2
u2
0
0.5
1
1.5
2
psi(u1,u2)
Figure 1: Perspective plot of the 3-class à function de¯ned in (3).
25
Ployhedron one
Ployhedron two Ployhedron three
f1−f2=1
f1−f2=0
f2−f1=1
f1−f3=1
f1−f3=0
f3−f1=1
f2−f3=1
f3−f2=1 f2−f3=0
Figure 2: Illustration of the concept of margins and support vectors in a 3-class
separable
example: The instances for classes 1-3 fall respectively into the polyhedrons Dj ; j = 1; 2;
3,
where D1 is fx : f1(x) ¡ f2(x) ¸ 1; f1(x) ¡ f3(x) ¸ 1g, D2 is fx : f2(x) ¡ f1(x) ¸
1; f2(x) ¡ f3(x) ¸ 1g, and D3 is fx : f3(x) ¡ f1(x) ¸ 1; f3(x) ¡ f2(x) ¸ 1g. The generalized
geometric margin ° de¯ned as minf°12; °13; °23g is maximized to obtain the decision
boundary.
There are ¯ve support vectors on the boundaries of the three polyhedrons. Among the
¯ve
support vectors, one is from class 1, one is from class 2, and the other three are from
class 3.
26
Download