Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang1,2 , Qiang Huo1 1Microsoft Research Asia, Beijing, China 2The University of Hong Kong, Hong Kong, China (qianghuo@microsoft.com) ICASSP-2010, Dallas, Texas, U.S.A., March 14-19, 2010 Outline • Background • What’s our new approach • How does it work • Conclusions Background of Minimum Classification Error (MCE) Formulation for Pattern Classification • Pioneered by Amari and Tsypkin in late 1960s – S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. On Electronic Computers, Vol. EC-16, No. 3, pp.299-307, 1967. – Y. Z. Tsypkin, Adaptation and learning in automatic systems, 1971. – Y. Z. Tsypkin, Foundations of the theory of learning systems, 1973. • Proposed originally for supervised online adaptation of a pattern classifier – to minimize the expected risk (cost) – via a sequential probabilistic descent (PD) algorithm • Extended by Juang and Katagiri in early 1990s – B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. on Signal Processing, Vol. 40, No. 12, pp.3043-3054, 1992. MCE Formulation by Juang and Katagiri (1) • Define a proper discriminant function of an observation for each pattern class • To enable a maximum discriminant decision rule for pattern classification • Largely an art and application dependent MCE Formulation by Juang and Katagiri (2) • Define a misclassification measure for each observation – to embed the decision process in the overall MCE formulation – to characterize the degree of confidence (or margin) in making decision for this observation – a differentiable function of the classifier parameters • A popular choice: where • Many possible ways => which one is better? => an open problem! MCE Formulation by Juang and Katagiri (3) • Define a loss (cost) function for each observation – a differentiable and monotonically increasing function of the misclassification measure – many possibilities => sigmoid function most popular for approximating MCE • MCE training via minimizing – empirical average loss (cost) by an appropriate optimization procedure, e.g., gradient descent (GD), Quickprop, Rprop, etc., or – expected loss (cost) by a sequential probabilistic descent (PD) algorithm (a.k.a. GPD) Some Remarks • Combinations of different choices for each of the previous three steps and optimization methods lead to various MCE training algorithms. • The power of MCE training has been demonstrated by many research groups for different pattern classifiers in different applications. • How to improve the generalization capability of an MCE-trained classifier? One Possible Solution: SSM-based MCE Training • Sample Separation Margin (SSM) – Defined as the smallest distance of an observation to the classification boundary formed by the true class and the most competing class – There is a closed-form solution for piecewise linear classifier • Define misclassification measure as negative SSM – Other parts of the formulation is the same as “traditional” MCE • A happy result – Minimized empirical error rate, and – Improved generalization • Correctly recognized training samples have a large margin from the decision boundaries! • For more info: – T. He and Q. Huo, “ A study of a new misclassification measure for minimum classification error training of prototype-based pattern classifiers, ’’ in Proc. ICPR-2008 What’s New in This Study? • Extend SSM-based MCE training to pattern classifier with a quadratic discriminant function (QDF) – No closed-form solution to calculate SSM • Demonstrate its effectiveness on a large-scale Chinese handwriting recognition task – Modified QDF (MQDF) is widely used in state-of-the-art Chinese handwriting recognition systems Two Technical Issues • How to calculate the SSM efficiently? – Formulated as a nonlinear programming problem – Can be solved efficiently because it is a quadratically constrained quadratic programming (QCQP) problem with a very special structure: • A convex objective function with one quadratic equality constraint • How to calculate the derivative of the SSM? – Using a technique known as sensitivity analysis in nonlinear programming – Calculated by using the solution to the problem in Eq. (1) • Please refer to our paper for details Experimental Setup • Vocabulary: – 6763 simplified Chinese characters • Dataset: 30% 70% Regular Cursive – Training: 9,447,328 character samples • # of samples per class: 952 – 5,600 – Testing: 614,369 character samples Distribution of writing styles in testing data • Feature extraction: – 512 “8-directional features” – Use LDA to reduce dimension to 128 • Use MQDF for each character class – # of retained eigenvectors: 5 and 10 • SSM-based MCE Training – Use maximum likelihood (ML) trained model as seed model – Update mean vectors only in MCE training – Optimize MCE objective function by batch-mode Quickprop (20 epochs) Experimental Results (1) • MQDF, K=5 9 8 7 6 ML 5 MCE 4 SSM-MCE 3 2 1 0 Regular Cursive Regular (error in %) Cursive (error in %) ML 1.73 8.34 MCE 1.29 7.34 SSM-MCE 1.19 7.00 Experimental Results (2) • MQDF, K=10 8 7 6 ML 5 MCE 4 SSM-MCE 3 2 1 0 Regular Cursive Regular (error in %) Cursive (error in %) ML 1.39 7.03 MCE 1.30 6.54 SSM-MCE 1.07 6.29 Experimental Results (3) • Histogram of SSMs on training set – SSM-based MCE-trained classifier vs. conventional MCE-trained one – Training samples are pushed away from decision boundaries – Bigger the SSM, better the generalization Conclusion and Discussions • SSM-based MCE training offers an implicit way of minimizing empirical error rate and maximizing sample separation margin simutaneously – Verified for quadratic classifiers in this study – Verified for piecewise linear classifiers previously (He&Huo, ICPR-2008) • Ongoing and future works – SSM-based MCE training for discriminative feature extraction – SSM-based MCE training for more flexible classifiers based on GMM and HMM – Searching for other (hopefully better) methods to combine MCE training and maximum margin training