Multicategory Ã-Learning¤ Yufeng Liu and Xiaotong Shen Summary In binary classi¯cation, margin-based techniques usually deliver high performance. As a result, a multicategory problem is often treated as a sequence of binary classi¯cations. In the absence of a dominating class, this treatment may be suboptimal and may yield poor performance, such as for Support Vector Machine (SVM). We propose a novel multicategory generalization of Ã-learning which treats all classes simultaneously. The new generalization eliminates this potential problem, and at the same time, retains the desirable properties of its binary counterpart. We develop a statistical learning theory for the proposed methodology and obtain fast convergence rates for both linear and nonlinear learning examples. The operational characteristics of this method are demonstrated via simulation. Our results indicate that the proposed methodology can deliver accurate class prediction and is more robust against extreme observations than its SVM counterpart. Key Words and Phrases: Generalization error, nonconvex minimization, supervised learning, support vectors. 1 Introduction Classi¯cation has become increasingly important as a means for facilitating information extraction. Among binary classi¯cation techniques, signi¯cant developments have been seen in margin¤Yufeng Liu is Assistant Professor, Department of Statistics and Operations Research, Carolina Center for Genome Sciences, University of North Carolina, CB 3260, Chapel Hill, NC 27599 (Email: y°iu@email.unc.edu). He would like to thank Professor George Fisherman for his helpful comments. Xiaotong Shen is Professor, School of Statistics, University of Minnesota, 224 Church Street S.E., Minneapolis, MN 55455 (Email: xshen@stat.umn.edu). His research was supported in part by NSF grants IIS-0328802 and DMS-0072635. The authors would like to thank the editor, the associate editor, and two anonymous referees for their helpful comments and suggestions. 1 based methodologies, including Support Vector Machine (SVM, Boser, Guyon, and Vapnik, 1992; Cortes and Vapnik, 1995), Penalized Logistic Regression (PLR, Lin et al., 2000), Import Vector Machine (IVM, Zhu and Hastie, 2001), and Distance Weighted Discrimination (DWD, Marron and Todd, 2002). Among many margin-based techniques, the ones that focus on estimating the decision bound- ary yield higher performance as opposed to those that focus on conditional probabilities. This is because the former is an easier problem than the latter. For instance, binary SVM directly estimates the Bayes classi¯er sign(P(Y = +1jx) ¡ 1=2) rather than P(Y = +1jx) itself, with input vector x and class label Y 2 f§1g, as shown in Lin (2002). However, this aspect of the methodology makes its generalization to the multicategory case highly nontrivial. One popular approach, known as \one-versus-rest", solves k binary problems via sequential training. As argued by Lee, Lin, and Wahba (2004), an approach of this sort performs poorly in the absence of a dominating class, since the conditional probability of each class is no greater than 1=2. Shen, Tseng, Zhang, and Wong (2003) proposed another margin-based technique called Ãlearning, which replaces the convex SVM loss function by a non-convex Ã-loss function. They show that more accurate class prediction can be achieved, while the margin interpretation is retained. The present article generalizes binary Ã-learning to the multicategory case. Since Ã-learning, like SVM, does not directly yield P(Y = +1jx), we need to take a new approach. To treat all classes simultaneously, we generalize the concept of margins and support vectors via multiple comparisons among di®erent classes. Multicategory Ã-learning has the advantage of retaining the desired properties of its binary counterpart, but not su®ering from the aforementioned di±culty of one-versus-rest SVM with regard to the dominating class. To provide insight into multicategory Ã-learning, we develop a statistical learning theory. Speci¯cally, the theory quanti¯es the performance of multicategory Ã-learning with respect to the choice of tuning parameters, the size of the training sample, and the number of classes involved in classi¯cation. It also indicates that our multicategory Ã-learning directly estimates the true decision boundary regardless of the presence or absence of the dominating class. Simulation experiments indicate that Ã-learning outperforms its counterpart SVM in generalization, as in the binary case. Moreover, multicategory Ã-learning is more robust against extreme instances that are wrongly classi¯ed than its counterpart SVM. Interestingly, in linear learning problems, it exhibits some behavior that is similar to nonlinear learning problems with respect to the tuning parameter, which is di®erent from that of the binary case. 2 Section 2.1 motivates our approach. Section 2.2 describes our proposal for multicategory Ãlearning, and Section 2.3 brie°y discusses computational issues. Section 3 studies the statistical properties of the proposed methodology and develops its statistical learning theory. Section 4 presents numerical examples, followed by conclusions and discussions in Section 5. The Appendix contains the lemmas and technical proofs. 2 Methodology The primary goal of classi¯cation is to predict the class label Y for a given input vector x 2S via a classi¯er, where S is an input space. For k-class classi¯cation, a classi¯er partitions S into k disjoint and exhaustive regions S1; : : : ; Sk with Sj corresponding to class j. A good classi¯er is one that predicts class index Y for given x accurately, which is measured by its accuracy of prediction. Before proceeding, let x 2 S ½ IRd be an input vector and y be an output (label) variable. We code y as f1; : : : ; kg, and de¯ne f = (f1; : : : ; fk) as a decision function vector. Here fj , mapping from S to IR, represents class j; j = 1; : : : ; k. A classi¯er argmaxj=1;:::;kfj(x), induced by f, is employed to assign a label to any input vector x 2 S. In other words, x 2 S is assigned to a class with the highest value of fj(x), which indicates the strength of evidence that x belongs to class j. A classi¯er is trained via a training sample f(xi; yi); i = 1; : : : ; ng, independently and identically distributed according to an unknown probability distribution P(x; y). Throughout the paper, we use X and Y to denote random variables and x and y to represent corresponding observations. The generalization error (GE) quanti¯es the accuracy of generalization and is de¯ned as Err(f) = P[Y 6= argmaxjfj(X)], the probability of misclassifying a new input vector X. To simplify the expression, we introduce g(f(x); y) = (fy(x) ¡ f1(x); : : : ; fy(x) ¡ fy¡1(x); fy(x) ¡ fy+1(x); : : : ; fy(x) ¡ fk(x)), which performs multiple comparisons of class y versus the rest of classes. Vector g(f(x); y) describes the unique feature of a multicategory problem, which is directly related to the generalized margins to be introduced shortly. Furthermore, for u = (u1; : : : ; uk¡1), we de¯ne the multivariate sign function, sign(u) = 1 if umin = min(u1; : : : ; uk¡1) > 0 and ¡1 if umin · 0. With sign(¢) and g(f(x); y) in place, f indicates correct classi¯cation for any given instance (x; y) if g(f(x); y) > 0k¡1, where 0k¡1 is a (k ¡ 1)-dimensional vector of 0. Consequently, the GE reduces to Err(f) = 1 2E[1 ¡ sign(g(X; Y ))], with the empirical 3 generalization error (EGE) (2n)¡1Pn i=1(1 ¡ sign(g(f(xi); yi))). For motivation, we ¯rst discuss our setting in the binary case, and then generalize it to the multicategory case. In particular, we review binary Ã-learning with the usual coding f¡1; 1g, and then derive it via coding f1; 2g. 2.1 Motivation With y 2 f§1g, a margin-based classi¯er estimates a single function f and uses sign(f) as the classi¯cation rule. Within the regularization framework, it solves argminfJ(f)+CPn i=1 l(yif(xi)), where J(f), a regularization term, controls the complexity of f, a loss function l measures the data ¯t, and C > 0 is a tuning parameter balancing the two terms. For example, SVM uses the hinge loss with l(u) = [1 ¡ u]+, where [v]+ = v if v ¸ 0, and 0 otherwise; PLR and IVM adopt the logistic loss l(u) = log(1 + e¡u); and the Ã-loss can be any non-increasing function satisfying R ¸ Ã(u) > 0 if u 2 (0; ¿ ) and Ã(u) = 1 ¡ sign(u) otherwise, where ¿ 2 (0; 1], and R > 0. For simplicity, we discuss the linear case in which f(x) = wTx + b; w 2 IRd and b 2 IR, represents a d-dimensional hyperplane. In this case, J(f) = 1 2kwk2 is de¯ned by the geometric margin 2 kwk , the vertical Euclidean distance between hyperplanes f = §1. Here yif(xi) is the functional margin of instance (xi; yi). For linear binary Ã-learning with coding f1; 2g, we now derive a parallel formulation using the argmax rule, by noting that x is classi¯ed as class 2 if f2(x) > f1(x) and 1 otherwise, where fj(x) = wT j x + bj ; j = 1; 2. Evidently, this rule of classi¯cation depends only on sign((f2 ¡ f1)(x)). To eliminate redundancy in (f1; f2), we invoke a sum-to-zero constraint f1 + f2 = 0. This type of constraint was previously used by Guermeur (2002) and Lee et al. (2004) in two di®erent SVM formulations. Under this constraint, kw1k = kw2k. Binary Ã-learning then solves: min b1;b2;w1;w2 ³1 2 2 Xj=1 kwjk2 + C n Xi=1 ág(f(xi); yi)¢´ subject to 2 Xj=1 fj(x) = 0 8x 2 S; (1) where g(f(xi); yi) = fyi(xi) ¡ f3¡yi(xi). With coding f1; 2g, instances from classes 1 and 2 that lie respectively in halfspaces fx : g(f(x); 2) ¸ ¡1g and fx : g(f(x); 2) · 1g are de¯ned as \support vectors". In the separable case, support vectors are instances on hyperplanes g(f(x); 2) = §1. Furthermore, the functional margin of (xi; yi) can be de¯ned as g(f(xi); yi), indicating the correctness and strength of classi¯cation of xi by f. 4 2.2 Multicategory Ã-Learning As suggested in Shen et al. (2003), the role of a binary Ã-function is twofold. First, it eliminates the scaling problem of the sign function that is scale invariant. Second, with a positive penalty de¯ned by the positive value of Ã(u) for u 2 (0; ¿ ), it pushes correctly classi¯ed instances away from the boundary. As a remark, we note that 1¡sign as a loss is numerically undesirable since the solution f is approximately 0 under regularization. Using coding f1; ¢ ¢ ¢ ; kg, we de¯ne multivariate Ã-functions on k ¡ 1 arguments as follows: R ¸ Ã(u) > 0 if umin 2 (0; ¿ ); Ã(u) = 1 ¡ sign(u) otherwise; (2) where 0 < ¿ · 1 and 0 < R · 2 are some constants, and Ã(u) is non-increasing in umin. We note that this multivariate version preserves the desired properties of its univariate counterpart. Particularly, the multivariate à assigns a positive penalty to any instance with min(g(f(xi); yi)) 2 (0; ¿ ) to eliminate the scaling problem. To utilize our computational strategy based on a di®erence convex (d.c.) decomposition, we use a speci¯c à in implementation: Ã(u) = 0 if umin ¸ 1; 2 if umin < 0; 2(1 ¡ umin) if 0 · umin < 1: (3) A plot of this à function for k = 3 is displayed in Figure 1. Insert Figure 1 about here Linear multicategory Ã-learning solves minb;w ³1 2 Pk j=1 kwjk2+CPn i=1 Ã(g(xi; yi))´subject to Pk j=1 fj(x) = 0 for 8x 2 S, where w = vec(w1; : : : ;wk) is a kd-dimensional vector with its (d(i2 ¡ 1) + i1)-th element wi2(i1) and b = (b1; : : : ; bk)T 2 IRk. By Theorem 2.1 of Liu et al. (2005), the minimization with the sum-to-zero constraint for all x 2 S is equivalent to that with the constraint for n training inputs fxi; i = 1; ¢ ¢ ¢ ; ng only. That is, the in¯nite constraint Pk j=1 fj(x) = 0 for 8x 2 S can be reduced to be Pk j=1 bj1n +Pk j=1 Xwj = 0, where X = (x1; : : : ; xn)T is the design matrix and 1n is an n-dimensional vector of 1's. This yields linear multicategory Ã-learning: min b;w ³1 2 k Xj=1 kwjk2 + C n Xi=1 Ã(g(xi; yi))´ subject to k Xj=1 bj1n + X k Xj=1 wj = 0; (4) where the value of C (C > 0) in (4) re°ects relative importance between the geometric margin and the EGE. 5 In the present context, we de¯ne the generalized functional margin of an instance (xi; yi) as min(g(xi; yi)), and the generalized geometric margin to be ° = min1·j1<j2·k °j1j2 , with ° j 1 j2 = 2 kwj1¡wj2k the vertical Euclidean distance between hyperplanes fj1 ¡ fj2 = §1. Here °j1j2 measures separation between classes i and j; see Figure 2 for an illustration of the role of °. When k = 2, (4) reduces to the binary case of Shen et al. (2003). As a technical remark, we note that that (4) uses Pk j=1 kwjk2 rather than max1·j1<j2·k kwj1 ¡wj2k2 in minimization. This is because Pk j=1 kwjk2 plays a similar role as max1·j1<j2·k kwj1 ¡ wj2k2 and is easier to implement. Insert Figure 2 about here Kernel-based learning can be achieved via a proper kernel K(¢; ¢), mapping from S £ S to IR. The kernel is required to satisfy Mercer's condition (Mercer, 1909) which ensures the kernel matrix K to be positive de¯nite, where K is an n £ n matrix with its i1i2-th element K(xi1 ; xi2 ). Then each fj can be represented as hj(x) + bj with hj = Pn i=1 vjiK(xi; x) by the theory of reproducing kernel Hilbert spaces, c.f., Wahba (1998). The kernel-based multicategory Ã-learning then solves: min b;v ³1 2 k Xj=1 khjk2 HK + C n Xi=1 Ã(g(xi; yi))´ subject to k Xj=1 bj1n +K k Xj=1 vj = 0; (5) where vj = (vj1; : : : ; vjn)T , and v = vec(v1; : : : ; vn). Using the reproducing kernel property, khjk2 HK can be written as vT j Kvj . The concept of support vectors can be also extended to multicategory problems. In the separable case, the instances on the boundaries of polyhedrons Dj are the support vectors, where polyhedron Dj is a collection of solutions of a ¯nite system of linear inequalities de¯ned by minj(g(x; j)) ¸ 1. In the nonseparable case, the instances belonging to class j that do not fall the inside of Dj are the support vectors. 2.3 Computational Development of Ã-Learning To treat nonconvex minimization involved in (4) and (5), we utilize the state-of-art technology in global optimization|the di®erence convex algorithm (DCA) of An and Tao (1997). The details are referred to to Liu et al. (2005) for an algorithm. The key to e±cient computation is a d.c. decomposition of à = Ã1 +Ã2, where Ã1(u) = 0 if umin ¸ 1 and 2(1¡umin) otherwise, and Ã2(u) = 0 if umin ¸ 0 and 2umin otherwise. Here Ã1 can 6 be viewed as a multivariate generalization of the univariate hinge loss. This d.c. decomposition connects the Ã-loss to the hinge loss Ã1 of SVM. In fact, the multivariate à mimics the GE de¯ned by 1 ¡ sign, while the generalized hinge loss Ã1 is a convex upper envelope of 1 ¡ sign. With this d.c. decomposition, à corrects the bias introduced by the imposed convexity of Ã1, and is expected to yield higher generalization accuracy. 3 Statistical Learning Theory In the literature, there has been considerable interest in generalization accuracy of margin-based classi¯ers. In the binary case, Lin (2000) investigated rates of convergence of SVM with a spline kernel. Bartlett, Jordon, and MaAuli®e (2003) studied rates of convergence for certain convex margin losses. Shen et al. (2003) derived a learning theory for Ã-learning. Zhang (2004) obtained consistency for general convex margin-based losses. For the multicategory case, Zhang (2004b) has recently studied consistency of several large margin classi¯ers using convex losses. To our knowledge, no results are available for rates of convergence in the multicategory case. In this section, we quantify the generalization error rates of the proposed multicategory Ãlearning, as measured by the Bayesian regret, to be introduced. 3.1 Statistical Properties The generalization performance of a classi¯er de¯ned by f is measured by the Bayesian regret e(f; ¹ f) = Err(f) ¡ Err( ¹ f) ¸ 0, which is the di®erence between the actual performance and the ideal performance. Here ¹ f is the Bayes rule, yielding the ideal performance assuming that the true distribution of (X; Y ) would have been known in advance, obtained by minimizing Err(f) = E[1¡sign(g(f(X); Y ))] with respect to all f, with g(f(x); j) = ffj(x)¡fl(x); l 6= jg. Note that the Bayes rule is not unique because any ¹ f, satisfying argmaxj ¹ fj(x) = argmaxjPj(x) with Pj(x) = P(Y = jjx), yields the minimum. Without loss of generality, we use a speci¯c ¹ f = ( ¹ f1; : : : ; ¹ fk) with ¹ fj(x) = k¡1 k I¡sign(Pj(x)¡Pl(x); l 6= j) = 1¢¡ 1 k I¡sign(Pj(x)¡Pl(x); l 6= j) 6= 1¢ in what follows, that is, ¹ fl(x) = k¡1 k if l = argmaxPj(x), and ¡1 k otherwise. Theorem 3.1 below gives expressions of the Bayesian regret, which is critical for establishing our learning theory. 7 Theorem 3.1. For any decision function vector f, e(f; ¹ f) = 1 2E[ k Xj=1 Pj(X)(sign(¹g( ¹ f(X); j)) ¡ sign(g(f(X); j)))] (6) = E[maxjPj(X) ¡ Pargmaxjfj (X)] ¸ 0 = E[Xj6=l jPl(X) ¡ Pj(X)jI(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]; (7) where ¹g( ¹ f(x); j) = f ¹ fj(x) ¡ ¹ fl(x); l 6= jg. Equation (6) in Theorem 3.1 expresses e(f; ¹ f) in terms of a weighted sum of the individual misclassi¯cation error, weighted by the conditional probability Pj(X). Equation (7) gives an expression of e(f; ¹ f) in misclassi¯cation resulting from ¡k 2¢ multiple comparisons. Equation (7) suggests that a multicategory problem dramatically di®ers from its binary counterpart. For a binary problem, (7) reduces to e(f2; ¹ f2) = EjP2(X) ¡ 1=2jjsign(f2(X)) ¡ sign( ¹ f2(X))j because P2(x) ¡ P1(x) = 2(P2(x) ¡ 1=2). This means a comparison between P1(x) and P2(x) in the binary case is equivalent to examining if P2(x) exceeds 1=2. For a multicategory problem, however, this no longer holds since multiple pairwise comparisons are necessary in order to determine the argmax. In fact, there may not exist a dominating class, that is max Pl(x) < 1=2 for some x 2 S. Therefore, k comparisons of Pj(x) with 1=2 may not be su±cient to determine the correct classi¯cation rule. Indeed, the issue of the existence of a dominating class is important in the multicategory rather than binary case. The ultimate goal of classi¯cation is to minimize E[1¡sign(g(f(X); Y ))]. To avoid the scale invariant problem of the sign function, we apply Ã-loss here as a surrogate loss which minimizes E[Ã(g(f(X); Y ))]. The following theorem says that a Ã-loss yields the same Bayes rule as the 1 ¡ sign loss. Thus, consistency of multicategory Ã-learning can be established. Theorem 3.2. The Bayes decision vector ¹ f satis¯es ¹gmin( ¹ f(x); argmaxj=1;:::;kPj(x)) = 1, where ¹gmin is the minimum of the k ¡ 1 elements of vector ¹g. For any à satisfying (2), ¹ f minimizes E[Ã(g(f(X); Y ))] and E[1 ¡ sign(g(f(X); Y ))] in the sense that E[Ã(g(f(X); Y ))] ¸ E[Ã(¹g( ¹ f(X); Y ))] = E[1 ¡ sign(¹g( ¹ f(X); Y ))] · E[1 ¡ sign(g(f(X); Y ))] for any f. Furthermore, the minimizers for E[Ã(g(f(X); Y ))] and E[1 ¡ sign(g(f(X); Y ))] are not unique, e.g., c ¹ f is also a minimizer for both quantities for any c ¸ 1. Theorem 3.2 says that Ã-learning estimates the Bayes classi¯er de¯ned by ¹ f as opposed to the conditional probabilities (P1(x); : : : ; Pk(x)), and it plays the same role as 1¡sign. Furthermore, 8 the optimal performance of ¹ f with ¹gmin( ¹ f(x); argmaxjPj(x)) = 1 is realized via the Ã-loss function although it di®ers from 1 ¡ sign. 3.2 Statistical Learning Theory Let F be a function class of candidate function vectors, which is allowed to depend on n. Note that the Bayes decision function ¹ f is not required to belong to F. For any function vector f 2 F, classi¯cation is performed by partitioning S into k disjoint sets (Gf1 ; ¢ ¢ ¢ ;Gfk ) = (fx : sign(g(f(x); 1) = 1g; ¢ ¢ ¢ ; fx : sign(g(f(x); k) = 1g). In this section, we generalize the learning theory of Shen et al. (2003) to the multicategory case. Our learning theory quanti¯es the magnitude of e(f; ¹ f) as a function of n, k, the tuning parameter C, and the complexity of a class of candidate classi¯cation partitions G(F) = f(Gf1 ; ¢ ¢ ¢ ;Gfk ); f 2 Fg induced by F. Denote by the approximation error eÃ(f; ¹ f) = 1 2 (EÃ(g(f(X); Y ))¡EÃ(¹g( ¹ f(X); Y ))), which measures the degree of approximation of G(F) to (G¹ f1 ; ¢ ¢ ¢ ;G¹ fk ). Let J0 = max(J(f0); 1). The following technical assumptions are made. Assumption A: (Approximation error) For some positive sequence sn ! 0 as n ! 1, there exists f0 2 F such that eÃ(f0; ¹ f) · sn. Equivalently, infff2FgeÃ(f; ¹ f) · sn. Similar to F, f0 may depend on n. Assumption B: (Boundary behavior) There exist some constants 0 < ® · +1 and c1 > 0 such that P(X 2 S : (maxPl(X) ¡ Pj6=argmaxPl(X)(X)) < 2±) · c1±® for any small ± ¸ 0. Assumption B describes behavior of the conditional probabilities Pj 's near the decision boundary fx 2 S : maxPl(x) = Pj(x); for some l 6= j 2 f1; 2; : : : ; kgg. It is equivalent to that P(X 2 S : (maxPl(X) ¡ second maxPl(X)) < 2±) · c1±® by the fact that fX : maxPl(X) ¡ Pj6=argmaxPl(X)(X)) < 2±g ½ fX : maxPl(X) ¡ second maxPl(X) < 2±g. To specify Assumption C, we de¯ne the metric entropy for partitions. For a class of partitions B = f(B1; ¢ ¢ ¢ ;Bk);Bj \ Bl = ; 8j 6= l; [1·j·kBj = Sg and any ² > 0, call f(Gv j1;Gu j1); : : : ; (Gv jm;Gu j m)g; j = 1; : : : ; k, an ²-bracketing set of B if for any (G1; : : : ;Gk) 2 B there exists an h such that Gv jh ½ Gj ½ Gu j h and max 1·h·m max 1·j·k P(Gu jh¢Gv j h) · ²; (8) where Gu jh¢Gv jh is the set di®erence between Gu jh and Gv jh. The metric entropy HB(²;B) of B with bracketing is then de¯ned as logarithm of the cardinality of ²-bracketing set of B of the 9 smallest size. Let F(`) = ff 2 F; J(f) · `g ½ F and G(`) = f(Gf1 ; : : : ;Gfk ); f 2 F(`)g ½ G(F). Then G(`) is the set of classi¯cation partitions under regularization J(f) · `. For instance, J(f) is 1 2 Pj kwjk2 in (4) or is 1 2 Pj khjk2 HK in (5). To measure the complexity of G(`) via the metric entropy, the following assumption is made. Assumption C: (Metric entropy for partitions) For some positive constants ci, i = 2; 3; 4, there exists some ²n > 0 such that sup `¸2 Á(²n; `) · c2n1=2; (9) where Á(²n; `) = Rc1=2 3 L®=2(®+1) c4L H1=2 B (u2=4; G(`))du=L and L = L(²n;C; `) = min(²2 n+(Cn)¡1(`=2¡ 1)J0; 1). Assumption D: (Ã-function) The Ã-function satis¯es (2). As a technical remark, we note that to simplify the function entropy calculation in Assumption C required in Theorem 3.4, an additional condition on the Ã-function may be imposed. For instance, we may restrict the Ã-loss functions in (2) to satisfy a multivariate Lipschitz condition: jÃ(u¤) ¡ Ã(u¤¤)j · Dju¤min ¡ u¤¤ minj; (10) where D > 0 is a constant. Condition (10) is satis¯ed by the speci¯c à function in (3), with D = 2. This aspect is illustrated in Example 3.3.2. However, (10) is irrelevant to the set entropy in Assumption C required in Theorem 3.3; see Example 3.3.1. Theorem 3.3. Suppose that Assumptions A-D are met. Then, for any classi¯er of Ãlearning argmax( ^ f), there exists a constant c5 > 0 such that P(e( ^ f; ¹ f) ¸ ±2 n ) · 3:5 exp(¡c5n(nC)¡®+2 ®+1 J ®+2 ®+1 0 ); provided that Cn ¸ 2±¡2 n J0, where ±2 n = min(max(²2 n; 2sn); 1). Corollary 3.1. Under the assumptions of Theorem 3.3, je( ^ f; ¹ f)j = Op(±2 n );Eje( ^ f; ¹ f)j = O(±2 n ); provided that n¡ 1 ®+1 (C¡1J0)®+2 ®+1 is bounded away from zero. To obtain the error rate ±2 n in Theorem 3.3, we need to compute the metric entropy for G(`). It may not be easy to compute the metric entropy for partitions because G(`) is induced by the 10 class of functions F(`). Moreover, it is also of interest to establish an upper bound of e( ^ f; ¹ f) using the corresponding function entropy as opposed to set entropy. In what follows, we shall develop such results in Theorem 3.4. To proceed, we de¯ne the L2-metric entropy with bracketing for F as follows. For any ² > 0, call f(gv 1 ; gu 1 ); : : : ; (gv m ; gu m )g an ²-bracketing function if for any g 2 F there is an h such that gv h · g · gu h and max1·h·m kgu h ¡gl h k2 · ², where k ¢ k2 is the usual L2-norm, de¯ned as kgk2 2= R g2dP. Then the L2-metric entropy of F with bracketing HB(²;F) is de¯ned as logarithm of the cardinality of the ²-bracketing of the smallest size. Now de¯ne a new function set FÃ(`) = fÃ(g(f(x); y))¡Ã(g0(f0(x); y)) : f 2 F(`)g and Á¤(²¤n; `) = Rc1=2 3 L¤®=2(®+1) c4L¤ H1=2 B (u;FÃ(`))du=L with L¤ = min(²¤2 n + (Cn)¡1(`=2 ¡ 1)J0; 1). Theorem 3.4. Suppose that Assumptions A-D are met with Á¤(²¤n; `) replacing Á(²n; `) in Assumption C. Then, for any classi¯er of Ã-learning argmax( ^ f), there exists a constant c5 >0 such that P(e( ^ f; ¹ f) ¸ ±¤2 n ) · P(eÃ( ^ f; ¹ f) ¸ ±¤2 n ) · 3:5 exp(¡c5n(nC)¡®+2 ®+1 J ®+2 ®+1 0 ); provided that Cn ¸ 2±¤¡2 n J0, where ±¤2 n = min(max(²¤2 n ; 2sn); 1). Corollary 3.2. Under the assumptions of Theorem 3.4, jeÃ( ^ f; ¹ f)j = Op(±¤2 n );EjeÃ( ^ f; ¹ f)j = O(±¤2 n ); provided that n¡ 1 ®+1 (C¡1J0)®+2 ®+1 is bounded away from zero. Note that eÃ( ^ f; ¹ f) ¸ e( ^ f; ¹ f). The rate ±¤2 n obtained from Theorem 3.4 using the metric entropy for functions yields an upper bound of e( ^ f; ¹ f), thus e( ^ f; ¹ f) · min(±2 n ; ±¤2 n ) with probability tending to 1 by Theorems 3.3-3.4. In application, one may calculate either ±2 n or ±¤2 n, depending on which entropy is easier to compute. Theorems 3.3-3.4 reveal distinct characteristics of multicategory problems, although they cover the binary case. First, a multicategory problem has a higher level of complexity generally, and hence that the number of classes k may have an impact on the performance. In fact, Theorems 3.3-3.4 permit studying dependency of e( ^ f; ¹ f) on k and n simultaneously; see Examples 3.3.1 and 3.3.2. Second, some properties of binary linear learning no longer hold in the multicategory case when k > 2. For instance, the decision boundaries generated by linear learning with k > 2 can be piecewise linear hyperplanes. 11 3.3 Illustrative Examples To illustrate our learning theory, we study speci¯c learning examples and apply our learning theory to derive error bounds for multicategory Ã-learning. 3.3.1. Linear classi¯cation: Linear classi¯cation involving a class of k hyperplanes F = ff : fj(x) = wT j x + bj ;Pk j=1 fj = 0; x 2 S = [0; 1]dg is considered, where d is a constant. To generate the training sample, we specify P(Y = j) = 1=k, P(xjY = j) = k ¡ 1 for fx : x1 2 [(j ¡ 1)=k; j=k)g and 1=(k ¡ 1) otherwise, where x1 is the ¯rst coordinate of x. Then the Bayes classi¯er yields sets fx : x1 2 [0; 1=k)g, : : :, fx : x1 2 [(k ¡ 1)=k; 1]g for the corresponding k classes. We now verify Assumptions A-C. For Assumption A, it is easy to ¯nd ft = (w11x1 + b1; : : : ;w1kx1+bk) such that w1j 's are increasing, Pk j=1 w1j = 0, Pk j=1 bj = 0, and w1jj=k+bj = w1;j+1j=k + bj+1; j = 1; : : : ; k ¡ 1. Let f0 = nft 2 F, then e(f0; ¹ f) · sn = c1n¡1 for some constant c1 > 0. This implies Assumption A with sn = c1n¡1. Assumption B is satis¯ed with ® = +1 since P(X 2 S : (maxPl(X) ¡ Pj(X)) < 2±) = P(X1 2 f1=k; : : : ; (k ¡ 1)=kg) = 0 for any su±ciently small ± > 0. To verify Assumption C, we note that HB(u; G(`)) · O(k2 log(k=u)) for any given ` by Lemma 1. Let Á1(²n; `) = c3(k2 log(k=L1=2))1=2=L1=2, where L = min(²2 n+ (Cn)¡1(`=2 ¡ 1); 1). This in turn yields sup`¸2 Á(²n; `) · Á1(²n; 2) = c(k2 log(k=²n))1=2=²n for some c > 0 and a rate ²n = ( k2 log n n )1=2 when C=J0 » ±¡2 n n¡1 » 1 k2 log n, provided that k2 log n n! 0. By Corollary 3.1, we conclude that e( ^ f; ¹ f) · O( k2 log n n ) except for a set of probability tending to zero, and Ee( ^ f; ¹ f) · O( k2 log n n ), when k2 log n n ! 0 as n ! 1. It is interesting to note that Ee( ^ f; ¹ f) · O(n¡1 log n) when k is a ¯xed constant. This conclusion holds generally for any Ã-function satisfying Assumption D. 3.3.2. Gaussian-kernel classi¯cation: In this example, we consider nonlinear learning with the same P(x; y) as in Example 3.3.1. Let F = ff : fj(x) = Pn i=1 vjiK(xi; x) + bj ;Pk j=1 fj = 0; x 2 S = [0; 1]dg with Gaussian kernel K(s; t) = exp(¡ks ¡ tk2=¾2). For Assumption A, we note that F is a rich function space with large n. In fact, any continuous function can be well approximated by Gaussian kernel based representations under the sup-norm, c.f., Steinwart (2001). Thus there exists an ft = (f1t; : : : ; fkt) 2 F such that fjt(x) ¸ 0 for x1 2 [(j¡1)=k; j=k] and < 0 otherwise. With a choice of f0 = ²¡2 n ft, eÃ(f0; ¹ f) · sn = c1²2 n, where c1 is a constant and ²n is de¯ned below. Assumption B is satis¯ed with 12 ® = +1 as in Example 3.3.1. In this case, the metric entropy of FÃ(`) appears to be easier to compute. We then apply Theorem 3.4 to obtain the convergence rate. Consider any Ãfunction in (2) satis¯es (10). Then by Lemma 2, HB(u;FÃ(`)) · O(k(log(`=u))d+1) for any given `. Let Á¤1 (²¤n; `) = c3(k(log(`=L1=2))d+1)1=2=L1=2, where L = min(²¤2 n + (Cn)¡1(`=2 ¡ 1); 1). Then sup`¸2 Á¤(²¤n; `) · Á1(²¤n; 2) = c(k(log(1=²¤n))d+1)1=2=²¤n for some c > 0. Solving (9) yields a rate ²¤n = ( k(log(nk¡1))d+1 n )1=2 when C=J0 » ±¤¡2 n n¡1 » 1 k(log(nk¡1))d+1 under a condition that k(log(nk¡1))d+1 n! 0 as n ! 1. By Theorem 3.4, we conclude that e( ^ f; ¹ f) · eÃ( ^ f; ¹ f) · O( k(log(nk¡1))d+1 n ) except for a set of probability tending to zero. By Corollary 3.2, Ee( ^ f; ¹ f) · O( k(log(nk¡1))d+1 n ). This resulting rate re°ects dependence of the rate on the class number k. If k is treated as a ¯xed constant, then we have Ee( ^ f; ¹ f) · O(n¡1(log n)d+1). This conclusion holds generally for any Ãfunction satisfying Assumption D and Condition (10). In summary, Examples 3.3.1-3.3.2 provide an insight into the generalization error of the proposed methodology. In view of the lower bound n¡1 result (c.f., Tsybakov, 2004) in the binary case, we conjecture the rates obtained in Examples 3.3.1 and 3.3.2 are nearly optimal, although, to our knowledge, a lower bound result for any general classi¯er has not yet been established in the multicategory case. Further investigation is necessary. 4 Numerical Examples In this section, we examine performance of multicategory Ã-learning in terms of generalization and compares it with its counterpart SVM. In the literature, there are a number of di®erent multicategory SVM generalizations; for instance, Lee et al. (2004), Crammer and Singer (2001), Weston and Watkins (1998), among others. To make a fair comparison, we use a version of multicategory SVM that is parallel to our multicategory Ã-learning. Speci¯cally, we replace the à function in (4) and (5) by Ã1. Then for the linear case, this version of multicategory SVM solves min b;w ³1 2 k Xj=1 kwjk2 + C n Xi=1 Ã1(g(xi; yi))´ subject to k Xj=1 bj1n + X k Xj=1 wj = 0: (11) This version of SVM is closely related to that of Crammer and Singer (2001). In their formulation, all bj 's are set to be 0 rather than employing the sum-to-zero constraint, which is in contrast to (11). As argued by Guermeur (2002), the sum-to-zero constraint is necessary to en13 sure uniqueness of the solution when a k-dimensional vector of decision functions with intercepts bj 's is used for a k-class problem. 4.1 Simulation Two linear examples are considered. In these examples, the GE is approximated by the testing error using a testing sample, independent of training. In what follows, all calculations are carried out using the IMSL C routines. Three-class linear problem. The training data are generated as follows. First, generate pairs (t1; t2) from a bivariate t-distribution with degrees of freedom º, where º = 1; 3 in Examples 1 and 2, respectively. Second, randomly assign f1; 2; 3g to its label index for each (t1; t2). Third, calculate (x1; x2) as follows: x1 = t1 + a1 and x2 = t2 + a2 with three di®erent values of (a1; a2) = (p3; 1); (¡p3; 1); (0;¡2) for corresponding classes 1-3. In these examples, the testing and Bayes errors are computed via independent testing samples of size 106 for classi¯ers obtained from training samples of size 150. To eliminate the dependence on C, we maximize the performances of Ã-learning and SVM by optimizing C over a discrete set in [10¡3; 103]. For each method, the testing error for the optimal C is averaged over 100 repeated simulations. The simulation results are summarized in Table 1. Insert Table 1 about here As shown in Table 1, Ã-learning usually has a smaller testing error thus better generalization as compared to its counterpart SVM. The amount of improvement, however, varies across examples. In Example 1, the percent of improvement of multicategory Ã-learning over SVM is 43.22% when the corresponding t-distribution has one degree of freedom. In Example 2, it decreases to 20.41% when the t-distribution with 3 degrees of freedom is employed. Further, Ã-learning yields a smaller number of support vectors. This suggests that Ã-learning has an even more \sparse" solution than SVM, and hence that it has stronger ability of data reduction. On a related matter, SVM fails to give data reduction in Example 1 since almost all the instances are support vectors, which is in contrast to much smaller number of support vectors of Ãlearning. One plausible explanation is that the ¯rst moment of the standard bivariate t-distribution does not exist, and thus the corresponding SVM does not work well. In general, any classi¯er with 14 an unbounded loss such as SVM may su®er di±culty from extreme outliers as in this example. This reinforces our view that Ã-learning is more robust against outliers. 4.2 Application We now examine performance of Ã-learning and its counterpart SVM on a benchmark example letter, obtained from Statlog. In this example, each sample contains 16 primitive numerical attributes converted from its corresponding letter image with a response variable representing 26 categories. The main goal here is to identify each letter image as one of the 26 capital letters in the English alphabet. A detailed description can be found in www:liacc:up:pt=ML=statlog= datasets=letter=letter:doc:html. For illustration, we use the data for letters D, O, Q with 805, 753, 783 cases respectively. A random sample of n = 200 is selected for training, while leaving the rest for testing. For each training dataset, we seek the best performance of linear Ã-learning and SVM over a set of C-values in [10¡3,103]. The corresponding results with respect to the smallest testing errors for each method in ten di®erent cases are reported in Table 2. Since the Bayes error is unknown, the improvement of Ã-learning over SVM is computed via (T(SVM) ¡ T(Ã))=T (SVM). Insert Table 2 about here Table 2 indicates that multicategory Ã-learning has a smaller testing error than its counterpart SVM, although the amount of improvement varies from sample to sample. In addition, on average, multicategory Ã-learning has a smaller number of support vectors than SVM. In conclusion, Ã-learning has better generalization and achieves further data reduction than SVM in this example. 5 Discussion In this article, we propose a new methodology that generalizes Ã-learning from the binary case to the multicategory case. A statistical learning theory is developed for Ã-learning in terms of the Bayesian regret. In simulations, we show that the proposed methodology performs well and is more robust against outliers than its counterpart SVM. In addition, we discover some interesting phenomena that are not with the binary case. 15 Recently, there is considerable interest in studying the variable selection problem using the L1 norm to replace the conventional L2 norm. In the binary case, Zhu et al. (2003) studied properties of the L1 SVM and showed that the corresponding regularized solution path is piecewise linear. It is therefore natural to investigate variable selection of the L1 Ã-learning. Further developments are necessary in order to make multicategory Ã-learning more useful in practice, particularly methodologies for a data-driven choice of C, variable selection, regularized solution path, as well as the nonstandard situation including unequal loss assignments. Appendix Proof of Theorem 3.1: By the de¯nition of Err(f), it is easy to obtain via conditioning that e(f; ¹ f) = 1 2E[Pk l=1 Pl(X)(sign(¹g( ¹ f(X); l)) ¡ sign(g(f(X); l)))]. Then it su±ces to consider the situation that sign(¹g( ¹ f(X); l)) ¡ sign(g(f(X); l)) is nonzero, that is, when two classi¯ers disagree. Equivalently, for any given X = x, we can write e(f; ¹ f) using all possible di®erent classi¯cation produced by ¹ f and f jointly, where sign(¹g( ¹ f(x); l)) = 1 and sign(g(f(x); j)) =1 imply that ¹ f classi¯es x into class l while f classi¯es x into class j for 1 · l 6= j · k. Thus, we have e(f; ¹ f) = E[ k Xl=1Xj6=l (Pl(X) ¡ Pj(X))I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)] = E[ k Xl=1Xj6=l jPl(X) ¡ Pj(X)jI(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]; where the second equality follows from the fact that ¹ f is the optimal (Bayes) decision function vector such that Pl(X) ¸ Pj(X) when sign(¹g( ¹ f(X); l)) = 1. The desired result then follows. Proof of Theorem 3.2: Write E[1 ¡ sign(g(f(X); Y ))jX = x] as Pk j=1(1 ¡ sign(g(f(x); j))) Pj(x) = 1 ¡ Pk j=1 sign(g(f(x); j))Pj(x). Note that for any given x, one and only one of sign(g(f(x); j)) can be 1 and the rest equal to ¡1. Consequently, E[1 ¡ sign(g(f(X); Y ))] is minimized when sign(g(f(x); argmaxj ¹ fj(x))) = 1, i.e., f = ¹ f. Evidently, the minimizer is not unique as c ¹ f for c ¸ 1 is also a minimizer. Then the desired result follows from the fact that Ã(u) ¸ (1 ¡ sign(u)) and Ã(¹g) = 1 ¡ sign(¹g). Proof of Theorem 3.3: Before proceeding we introduce some notations to be used below. Let ~lÃ(f;Zi) = lÃ(f;Zi) + ¸J(f) be the cost function to be minimized, as in (4) or (5), where lÃ(f;Zi) = Ã(g(f(Xi); Yi)) and ¸ = 1=(Cn). Let ~l(f;Zi) = l(f;Zi) + ¸J(f), where l(f;Zi) = 16 1 ¡ sign(g(f(Xi); Yi)). De¯ne the scaled empirical process En(~l(f;Z) ¡ ~lÃ(f0;Z)) as n¡1 n Xi=1 (~l(f;Zi) ¡ ~lÃ(f0;Zi) ¡ E[~l(f;Zi) ¡ ~lÃ(f0;Zi)]) = En[l(f;Z) ¡ lÃ(f0;Z)]; where Z = (X; Y ). Let Ai;j = ff 2 F : 2i¡1±2 n · e(f; ¹ f) < 2i±2 n , 2j¡1J0 · J(f) < 2jJ0g, Ai;0 = ff 2 F : 2i¡1±2 n · e(f; ¹ f) < 2i±2 n ; J(f) < J0g, for j = 1; 2; ¢ ¢ ¢ , and i = 1; 2; ¢ ¢ ¢ . Without loss of generality, we assume J(f0) ¸ 1 and max(²2 n; 2sn) < 1 in the sequel. The proof uses the treatment of Shen et al. (2003) and Shen (1998), together with the results in Theorem 3.1 and Assumption B. In what is to follow, we shall omit any detail that can be referred to the proof of Theorem 1 of Shen et al. (2003). Using the connection between e( ^ f; ¹ f) and the cost function as in Shen et al. (2003), we have P(e( ^ f; ¹ f) ¸ ±2 n ) · P¤³ sup ff2F:e(f; ¹ f)¸±2 ng n¡1 n Xi=1 (~lÃ(f0;Zi) ¡ ~l(f;Zi)) ¸ 0´= I; where P¤ denotes the outer probability measure. To bound I, it su±ces to bound P(Aij ), for each i; j = 1; ¢ ¢ ¢ . To this end, we need some inequalities regarding the ¯rst and second moments of ~l(f;Z) ¡ ~lÃ(f0;Z)) for f 2 Aij . For the ¯rst moment, note that E[l(f;Z)¡lÃ(f0;Z)] = E[l(f;Z)¡lÃ( ¹ f;Z)]¡E[lÃ(f0;Z)¡ lÃ( ¹ f;Z)], which is equal to 2(e(f; ¹ f) ¡ eÃ(f0; ¹ f)) since ElÃ( ¹ f;Z) = El( ¹ f;Z) by Theorem 3.2. By Assumption A and the de¯nition of ±2 n , 2eÃ(f0; ¹ f) · 2sn · ±2 n . Then, using the assumption that J0¸ · ±2 n =2, we have, for any integers i; j ¸ 1, inf Ai;j E(~l(f;Z) ¡ ~lÃ(f0;Z)) ¸ M(i; j) = (2i¡1±2 n ) + ¸(2j¡1 ¡ 1)J(f0); (12) and inf Ai;0 E(~l(f;Z) ¡ ~lÃ(f0;Z)) ¸ (2i¡1 ¡ 1=2)±2 n ¸ M(i; 0) = 2i¡2±2 n ; (13) where the fact that 2i ¡ 1 ¸ 2i¡1 has been used. For the second moment, it follows from Theorem 3.1 and Assumption B that, for any f 2 F, e(f; ¹ f) = E[ k Xl=1Xj6=l jPl(X) ¡ Pj(X)jI(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)] ¸ 2±(E[ k Xl=1Xj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)I(jPl(X) ¡ Pj(X)j ¸ 2±))] ¸ ±(E[2 k Xl=1Xj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)] ¡ 2c1±®) = 1 2 (4c1)¡1=®E[2 k Xl=1Xj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)](®+1)=® (14) 17 with a choice of ± = ¡E[2Pk l=1Pj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]=(4c1)¢1=®. Now we establish a connection between the ¯rst and second moments. By Theorem 3.2, E[Ã(¹g( ¹ f(X); Y ))¡(1¡sign(¹g( ¹ f(X); Y )))] = 0. Note that Ã(u) ¸ 1¡sign(u) for any u 2 Rk¡1, EjÃ(g0(f0(X); Y ))¡(1¡sign(g0(f0(X); Y )))j = E[Ã(g0(f0(X); Y ))¡(1¡sign(g0(f0(X); Y )))] · 2eÃ(f0; ¹ f). By the triangular inequality, E[l(f;Z) ¡ lÃ(f0;Z)]2 · 2Ej1 ¡ sign(g(f(X); Y )) ¡ Ã(g0(f0(X); Y ))j · 2¡2eÃ(f0; ¹ f)+ Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g(f(X); Y ))j + Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g0(f0(X); Y ))j¢:(15) Note that for any f 2 F Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g(f(X); Y ))j = E[ k Xl=1 I(Y = l)jsign(¹g( ¹ f(X); l)) ¡ sign(g(f(X); l))j] = E[2 k Xl=1 I(Y = l)Xj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)] · E[2 k Xl=1Xj6=l I(sign(¹g( ¹ f(X); l)) = 1; sign(g(f(X); j)) = 1)]: This, together with (14), implies that Ejsign(¹g( ¹ f(X); Y )) ¡ sign(g(f(X); Y ))j · c¤e(f; ¹ f)®=(®+1); (16) where c¤ = 2®=(®+1)(4c1)1=(®+1). For any f 2 Ai;j , e(f; ¹ f)®=(®+1) ¸ (2¡1±2 n )®=(®+1) ¸ 2¡1±2 n¸ sn ¸ eÃ(f0; ¹ f), e(f; ¹ f) ¸ e(f0; ¹ f) together with (15) and (16) imply that E[l(f;Z)¡lÃ(f0;Z)]2 · 2¡2eÃ(f0; ¹ f)+c¤(e(f; ¹ f)®=(®+1)+e(f0; ¹ f)®=(®+1))¢· c0 3 (e(f; ¹ f)=2)®=(®+1); with c0 3 = 16c1=(®+1) 1 + 8. Consequently, for i = 1; ¢ ¢ ¢ and j = 0; 1; ¢ ¢ ¢ ; sup Ai;j E[lÃ(f0;Z) ¡ l(f;Z)]2 · c0 3 (2i¡1±2 n )®=(®+1) · c3M(i; j)®=(®+1) = v(i; j)2; where c3 = 2c0 3. We are now ready to bound I. Using the assumption that J0¸ · ±2 n =2, (12) and (13), we have I · Pi¸1;j¸0 P¤¡supAi;j En(lÃ(f0;Z) ¡ l(f;Z)) ¸ M(i; j)¢. By de¯nition, lÃ(f0;Z) and l(f;Z) are 0 and 2. Then E[lÃ(f0;Z) ¡ l(f;Z)]2 · 4 and En(lÃ(f0;Z) ¡ l(f;Z)) · 4. For 18 convenience, we scale the empirical process by a constant t = (4c1=2 3 )¡1 in what follows. Then I · Xi;j P¤³sup Ai;j En(t[lÃ(f0;Z) ¡ l(f;Z)]) ¸ Mc(i; j)´ +Xi P¤³sup Ai;0 En(t[lÃ(f0;Z) ¡ l(f;Z)]) ¸ Mc(i; 0)´= I1 + I2 (17) and supAi;j E[lÃ(f0;Z) ¡ l(f;Z)]2 · vc(i; j)2, where vc(i; j) = min(t1=2v(i; j); 1), Mc(i; j) = min(tM(i; j); c¡1=2 3 ). Note that vc(i; j) < 1 implies Mc(i; j) = tM(i; j). Next we bound Ii separately. For I1, we verify the required conditions (4.5)-(4.7) in Theorem 3 of Shen and Wong (1994). To compute the metric entropy in (4.7) there, we need to construct a bracketing function of lÃ(f0;Z)¡l(f;Z). Denote an ²-bracketing set for f(Gf1 ; : : : ;Gfk ); f 2 Aijg to be f(Gv p1; ¢ ¢ ¢ ;Gv pm); (Gu p1; ¢ ¢ ¢ ;Gu pm)g; p = 1; : : : ; k. Let sv ph(x) be ¡1 if x 2 Gu ph and 1 otherwise, and su ph(x) be ¡1 if x 2 Gv ph and 1 otherwise; p = 1; ¢ ¢ ¢ ; k, h = 1; ¢ ¢ ¢ ;m. Then f(sv p1; ¢ ¢ ¢ ; sv pm); (su p1; ¢ ¢ ¢ ; su pm)g forms an ²-bracketing function of ¡sign(g(f(x); p)) for f 2 Aij and p = 1; ¢ ¢ ¢ ; k. This implies that for any ² ¸ 0 and f 2 Aij , there exists an h (1 · h · m) such that lv h (z) · l(f; z) ¡ lÃ(f0; z) · lu h(z) for any z = (x; y), where lu h(z) = 1 +Pk p=1 su ph(x)I(y = p) ¡ lÃ(f0; z), lv h (z) = 1 +Pk p=1 sv ph(x)I(y = p) ¡ lÃ(f0; z), and (E[lu h¡ lv h ]2)1=2 = (Pk p=1 E[(su ph(x) ¡ sv = p)]2)1=2 · 2(maxp P(Gu ph(x))I(y ph¢Gl ph))1=2 · 2²1=2. So, (E[lu h ¡ lv h ]2)1=2 · min(2²1=2; 2). Hence, HB(²;F¤(2j)) · H(²2=4; G(2j)) for any ² > 0 and j = 0; ¢ ¢ ¢ , where F¤(2j) = fl(f; z) ¡ lÃ(f0; z) : f 2 F; J(f) · 2jg. Using the fact that Rvc(i;j) aMc(i;j) H1=2 B (u2=4; G(2j))du=Mc(i; j) is non-increasing in i and Mc(i; j); i = 1; ¢ ¢ ¢ ; we have, Z vc(i;j) aMc(i;j) H1=2 B (u2=4; G(2j))du=Mc(i; j) · Z c1=2 3 Mc(1;j) ® 2(®+1) aMc(1;j) H1=2 B (u2=4; G(2j))du=Mc(1; j) · Á(²n; 2j); where a = "=32 with " de¯ned below. Thus (4.7) of Shen and Wong (1994) holds with M = n1=2Mc(i; j) and v = vc(i; j)2, so does (4.5). In addition, with T = 1, Mc(i; j)=vc(i; j)2 · max(c¡1=2 3 ; c¡(2®+3)=(2®+2) 3 ) = c¡1=2 3 · "=(4T) implies (4.6) with " = 4c¡1=2 3 < 1. Note that 0 < ±n · 1 and ¸J0 · ±2 n =2. Using a similar argument as in Shen et al. (2003), an application of Theorem 3 of Shen and Wong (1994) yields that I1 · 3 exp(¡c5n(¸J(f0))®+2 ®+1 =[1 ¡ exp(¡c5n(¸J(f0))®+2 ®+1 )]2: 19 Here and in the sequel c5 is a positive generic constant. Similarly, I2 can be bounded. Finally, I · 6 exp(¡c5n(¸J(f0))®+2 ®+1 =[1¡exp(¡c5n(¸J(f0))®+2 ®+1 )]2. This implies that I1=2 · (5=2 + I1=2) exp(¡c5n(¸J(f0)). The result then follows from the fact that I · I1=2 · 1. Proof of Corollary 3.1: The result follows from the assumptions and the exponential inequality in Theorem 3.3. Proof of Theorem 3.4: The proof is similar to that of Theorem 3.3. For simplicity, we only sketch the parts that require modi¯cations. Consider the scaled empirical process En(~lÃ(f;Z)¡ ~l Ã(f0;Z)) and let Ai;j = ff 2 F : 2i¡1±¤2 n · eÃ(f; ¹ f) < 2i±¤2 n , 2j¡1J0 · J(f) < 2jJ0g, Ai;0 = ff 2 F : 2i¡1±¤2 n · eÃ(f; ¹ f) < 2i±¤2 n ; J(f) < J0g, for j = 1; 2; ¢ ¢ ¢ , and i = 1; 2; ¢ ¢ ¢ . Using an analogous argument, we have P(eÃ( ^ f; ¹ f) ¸ ±¤2 n ) · P¤³ sup ff2F:eÃ(f; ¹ f)¸±¤2 ng n¡1 n Xi=1 (~lÃ(f0;Zi) ¡ ~lÃ(f;Zi)) ¸ 0´= I: To bound I, we consider the ¯rst and second moments of ~lÃ(f;Z)¡~lÃ(f0;Z)) for f 2 Aij . For the ¯rst moment, it is straightforward to show that, for any integers i; j ¸ 1, infAi;j E(~lÃ(f;Z)¡ ~l Ã(f0;Z)) ¸ M(i; j) = (2i¡1±¤2 n ) + ¸(2j¡1 ¡ 1)J(f0), and infAi;0 E(~lÃ(f;Z) ¡ ~lÃ(f0;Z)) ¸ M(i; 0) = 2i¡2±¤2 n. For the second moment, eÃ(f; ¹ f) = e(f; ¹ f) + 1 2E[Ã(g(f(X)))I(g(f(X)) 2 (0; ¿ ))] and eÃ(f; ¹ f) · 1. Thus 1 2E[Ã(g(f(X)))I(g(f(X); Y ) 2 (0; ¿ ))] · eÃ(f; ¹ f) · (eÃ(f; ¹ f)) ® ®+1 : (18) For any f 2 Ai;j , eÃ(f; ¹ f) ¸ 2¡1±2 n ¸ sn ¸ eÃ(f0; ¹ f) together with (16) and (18) imply that E[lÃ(f;Z) ¡ lÃ(f0;Z)]2 · 2Ejsign(g(f(X); Y )) ¡ sign(g0(f0(X); Y ))j +2E[Ã(g0(f0(X)))I(g(f(X); Y ) 2 (0; ¿ ))] + 2E[Ã(g(f(X)))I(g(f(X); Y ) 2 (0; ¿ ))] · 2¡c¤[eÃ(f; ¹ f)®=(®+1) + eÃ(f0; ¹ f)®=(®+1)]¢+ 4[eÃ(f; ¹ f)®=(®+1) + eÃ(f0; ¹ f)®=(®+1)] · c0 3 (eÃ(f; ¹ f)=2)®=(®+1); with c0 3 = 16c1=(®+1) 1 +8. Therefore, supAi;j E(lÃ(f0;Z)¡lÃ(f;Z))2 · c3M(i; j)®=(®+1) = v(i; j)2 for i = 1; ¢ ¢ ¢ and j = 0; 1; ¢ ¢ ¢ , where c3 = 2c0 3. To bound I, note I · I1+I2, where I1 = Pi;j P¤¡supAi;j En(lÃ(f0;Z)¡lÃ(f;Z)) ¸ M(i; j)¢ and I2 = Pi P¤¡supAi;0 En(lÃ(f0;Z) ¡ lÃ(f;Z)) ¸ M(i; 0)¢. Thus we can bound Ii separately. Using the fact that Rv(i;j) aM(i;j) H1=2 B (u;FÃ(2j))du=M(i; j) is non-increasing in i and M(i; j); i = 20 1; ¢ ¢ ¢ ; we have Rv(i;j) aM(i;j) H1=2 B (u;FÃ(2j))du=M(i; j) · Á¤(²¤n; 2j ). The result then follows from the same argument as that in the proof of Theorem 3.3. Proof of Corollary 3.2: The result follows from the assumptions and the exponential inequality in Theorem 3.4. Lemma 1: (Metric entropy in Example 3.3.1) Under the assumptions in the example 3.3.1, we have HB(²; G(`)) · O(k2 log(k=²)): Proof: Let (G1; : : : ;Gk) be a classi¯cation partition induced by f and let Gj1j2 be fx : fj1¡fj2 > 0; x 2 Sg; j1 6= j2 2 f1; ¢ ¢ ¢ ; kg. For discussion, we ¯rst construct a bracket for Gj1j2 . To this end, we determine d points at which the plane fj1 ¡ fj2 = 0 intersects with d out of d2d¡1 edges of the cube [0; 1]d. For each of these d points, we use a bracket of length ²¤ to cover, on the edge where the point belongs to. Given an edge, the covering number for this point is no greater than 1=²¤. Hence the covering number for the d points on d of d2d¡1 edges is at most ¡d2d¡1 d ¢( 1 ²¤ )d. After d intersecting points of fj1 ¡ fj2 = 0 on the edges of S are covered, we then connect the endpoints of the d brackets to form bracket planes vj1j2 = 0 and uj1j2 = 0 such that fx : vj1j2 > 0g ½ fx : fj1 ¡ fj2 > 0g ½ fx : uj1j2 > 0g. Since the longest segment in S has length pd corresponding to the diagonal segment between (0; : : : ; 0) and (1; : : : ; 1), we have P(x : vj1j2 < 0 < uj1j2) · (pd)d¡1²¤ since x is uniformly distributed on S. Consequently, Gv j1 j2 ½ G j1 j 2 ½ G u j1j2 and P(Gv j1j2¢Gu j1j2) · (pd)d¡1²¤, where Gv j1j2 = fx : vj1j2 > 0g and Gu j1j2 = fx : uj1j2 > 0g. Since Gj1 = \j2Gj1j2 , Gv j1 ½ Gj1 ½ Gu j1 and P(Gv j1¢Gu j1) · P([j2Gv j1j2¢Gu j1 j2 ) · (k ¡ 1)(pd)d¡1²¤, where Gv j1 = \j2Gv j1j2 and Gu j1 = \j2Gu j1j2 ; j1 6= j2 2 f1; ¢ ¢ ¢ ; kg. With ² = (k ¡ 1)(pd)d¡1²¤, f(Gv 1;Gu 1 ); ¢ ¢ ¢ ; (Gv k;Gu k )g satis¯es maxj1 P(Gv j1¢Gu j1) · ² and thus it forms an ²-bracketing set for (G1; : : : ;Gk). Therefore the ²-covering number for all partitions induced by f is at most ¡d2d¡1 d ¢¡(k¡1)(pd)d¡1 ² ¢d¢k(k¡1): Since d is a constant, the bracketing metric entropy HB(²; G(`)) is bounded by O(k2 log(k=²)) for any `, yielding the desired result. Lemma 2: (Metric entropy in Example 3.3.2) Under the assumptions in Example 3.3.2, we have HB(²;FÃ(`)) · O(k(log(`=²))d+1): Proof: In order to obtain an upper bound for HB(²;FÃ(`)), we use the sup-norm entropy bound 21 for a single function set in Zhou (2002), that is, H1(²;F(`)) · O((log(`=²))d+1) under the L1 metric: kgk1 = supx2S jg(x)j. Consider an arbitrary function vector f = (f1; : : : ; fk) 2 F(`). The the metric entropy for all k-dimensional function vectors in F(`) is bounded by O(k(log(`=²))d+1) in order to cover k functions simultaneously. Let [fv j ; fu j ] be an ²-bracket for fj . Then [fv j ¡fu l ; fu j ¡fv l ] forms a 2²-bracket for fj¡fl. Denote gv j = minl2f1;¢¢¢ ;kgnj(fv j ¡fu l ) and gu j = minl2f1;¢¢¢ ;kgnj(fu j ¡fv l ). Then [gv j ; gu j ] becomes a 2²-bracket for gmin(f; j) = minl6=j(fj¡fl). Consequently, Ã(gv j ) ¸ Ã(gmin(f; j)) ¸ Ã(gu j ) by the non-increasing property of à function. By (10), we have jÃ(gv j ) ¡ Ã(gu j )j · 2D². Since gmin(f; y) = Pk j=1 I(y = j)gmin(f; j), gmin(f; y) 2 [Pk j=1 I(y = j)gv j ;Pk j=1 I(y = j)gu j ] and jÃ(Pk j=1 I(y = j)gv j ) ¡ Ã(Pk j=1 I(y = j)gu j )j · 2D². Consequently, [Ã(Pk j=1 I(y = j)gu j (x))¡Ã(g0(f0(x); y)); Ã(Pk = j)gv j (x))¡Ã(g0(f0(x); y))] forms a bracket of length 2D² for Ã(g(f(x); y)) ¡ Ã(g0(f0(x); y)). The desired result then follows. j=1 I(y References An, H. L. T., and Tao, P. D. (1997). Solving a class of linearly constrained inde¯nite quadratic problems by D.C. algorithms. J. Global Optim., 11, 253-285. Bartlett, P. L, Jordan, M. I, and McAuli®e, J. D. (2003). Convexity, classi¯cation, and risk bounds. Technical Report 638, Department of Statistics, U.C. Berkeley. Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin classi¯ers. The Fifth Annual Conference on Computational Learning Theory, Pittsburgh ACM, 142-152. Cortes, C., and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273279. Crammer, K., and Singer, Y. (2001). On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2, 265-292. Guermeur, Y. (2002). Combining discriminant models with new multiclass SVMS. Pattern Analysis and Applications (PAA), 5, 168-179. Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory Support Vector Machines, theory, and application to the classi¯cation of microarray data and satellite radiance data. J. Amer. Statist. Assoc. 99, 465: 67-81. Lin, X., Wahba, G., Xiang, D., Gao, F. Klein, R., and Klein, B. (2000). Smoothing spline ANOVA models for large data sets with Bernoulli observations and the randomized GACV. 22 Annals of Statistics. 28, 1570-1600. Lin, Y. (2000). Some asymptotic properties of the support vector machine. Technical report 1029, Department of Statistics, University of Wisconsin-Madison. Lin, Y. (2002). Support vector machines and the Bayes rule in classi¯cation. Data Mining and Knowledge Discovery. 6, 259-275. Liu, Y., Shen, X., and Doss, H. (2005). Multicategory Ã-learning and support vector machine: computational tools. J. Comput. Graph. Statist. 14, 1, 219-236. Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis. Ann. Statist. 27, 1808-1829. Marron, J. S., and Todd, M. J. (2002). Distance Weighted Discrimination. Technical Report No. 1339, School of Operations Research and Industrial Engineering, Cornell University. Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London A, 209, 415-446. Shen, X. (1998). On the method of penalization. Statistica Sinica. 8, 337-357. Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (2003). On Ã-learning. J. Amer. Statist. Assoc. 98, 724-734. Shen, X., and Wong, W. H. (1994). Convergence rate of sieve estimates. Ann. Statist. 22, 580-615. Steinwart, I. (2001). On the in°uence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2, 67-93. Tsybakov, A. B. (2004). Optimal aggregation of classi¯ers in statistical learning. Annals of Statistics, 32, 135-166. Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In: B. SchÄolkopf, C. J. C. Burges and A. J. Smola (eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, 125-143. Weston, J., andWatkins, C. (1999). Support vector machines for multi-class pattern recognition. Proceedings of the Seventh European Symposium On Arti¯cial Neural Networks. Zhang, T. (2004). Statistical behavior and consistency of classi¯cation methods based on convex risk minimization. Ann. Statist., 32, 56-85. Zhang, T. (2004b). Statistical analysis of some multi-category large margin classi¯cation methods. Journal of Machine Learning Research, 5, 1225-1251. Zhou, D. X. (2002). The covering number in learning theory. Journal of Complexity, 18, 73923 767. Zhu, J., and Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics. 14, 1, 185-205. Zhu, J., Hastie, T., Rosset, S., and Tibshirani, R. (2003). 1-norm support vector machines, Neural Information Processing Systems, 16. Table 1: Testing, training errors, and their ^e(¢; ¹ f) of SVM and Ã-learning using the best C in Examples 1 and 2 with n = 150, averaged over 100 simulation replications and their standard errors in parenthesis. In Example 1, d.f.=1, the Bayes error is 0.2470 with the improvement of Ã-learning over SVM 43.22%. In Example 2, d.f.=3, the Bayes error is 0.1456 with the improvement of Ã-learning over SVM 20.41%. Here, the improvement of Ã-learning over SVM is de¯ned by (T(SVM)¡T(Ã))=^e(SVM; ¹ f), where ^e(¢; ¹ f) = T(¢)¡Bayes error, and T(¢) denotes the testing error of a given method. Example Method Training(s.e.) Testing(s.e.) ^e(¢; ¹ f)(s.e.) No. SV(s.e.) d.f.=1 SVM 0.4002(0.1469) 0.4305(0.1405) 0.1835(0.1405) 141.76(10.97) Ã-L 0.3199(0.1237) 0.3494(0.1209) 0.1024(0.1209) 64.64(15.43) d.f.=3 SVM 0.1447(0.0267) 0.1505(0.0045) 0.0049(0.0045) 71.81(11.02) Ã-L 0.1429(0.0285) 0.1495(0.0033) 0.0039(0.0033) 41.29(13.51) Table 2: Testing errors for problem letter. Each training dataset is of size 200 and selected from a total of 2341 samples. Case SVM Ã-L Improvement (%) 1 .083 .079 3.39% 2 .073 .063 12.24% 3 .086 .076 11.41% 4 .072 .072 0% 5 .088 .085 3.74% 6 .077 .073 5.45% 7 .075 .072 4.39% 8 .079 .075 5.92% 9 .093 .091 1.51% 10 .090 .086 4.11% Average #SVs 51.1 40.8 24 −2 −1 0 1 2 u1 −2 −1 0 1 2 u2 0 0.5 1 1.5 2 psi(u1,u2) Figure 1: Perspective plot of the 3-class à function de¯ned in (3). 25 Ployhedron one Ployhedron two Ployhedron three f1−f2=1 f1−f2=0 f2−f1=1 f1−f3=1 f1−f3=0 f3−f1=1 f2−f3=1 f3−f2=1 f2−f3=0 Figure 2: Illustration of the concept of margins and support vectors in a 3-class separable example: The instances for classes 1-3 fall respectively into the polyhedrons Dj ; j = 1; 2; 3, where D1 is fx : f1(x) ¡ f2(x) ¸ 1; f1(x) ¡ f3(x) ¸ 1g, D2 is fx : f2(x) ¡ f1(x) ¸ 1; f2(x) ¡ f3(x) ¸ 1g, and D3 is fx : f3(x) ¡ f1(x) ¸ 1; f3(x) ¡ f2(x) ¸ 1g. The generalized geometric margin ° de¯ned as minf°12; °13; °23g is maximized to obtain the decision boundary. There are ¯ve support vectors on the boundaries of the three polyhedrons. Among the ¯ve support vectors, one is from class 1, one is from class 2, and the other three are from class 3. 26