Sample-Separation-Margin Based Minimum Classification Error

advertisement
Sample-Separation-Margin Based Minimum
Classification Error Training of Pattern Classifiers
with Quadratic Discriminant Functions
Yongqiang Wang1,2 , Qiang Huo1
1Microsoft
Research Asia, Beijing, China
2The University of Hong Kong, Hong Kong, China
(qianghuo@microsoft.com)
ICASSP-2010, Dallas, Texas, U.S.A., March 14-19, 2010
Outline
• Background
• What’s our new approach
• How does it work
• Conclusions
Background of Minimum Classification Error
(MCE) Formulation for Pattern Classification
• Pioneered by Amari and Tsypkin in late 1960s
– S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. On Electronic Computers, Vol. EC-16, No. 3,
pp.299-307, 1967.
– Y. Z. Tsypkin, Adaptation and learning in automatic systems, 1971.
– Y. Z. Tsypkin, Foundations of the theory of learning systems, 1973.
• Proposed originally for supervised online adaptation of a pattern classifier
– to minimize the expected risk (cost)
– via a sequential probabilistic descent (PD) algorithm
• Extended by Juang and Katagiri in early 1990s
– B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. on Signal
Processing, Vol. 40, No. 12, pp.3043-3054, 1992.
MCE Formulation by Juang and Katagiri (1)
• Define a proper discriminant function of an observation
for each pattern class
• To enable a maximum discriminant decision rule for pattern
classification
• Largely an art and application dependent
MCE Formulation by Juang and Katagiri (2)
• Define a misclassification measure for each observation
– to embed the decision process in the overall MCE formulation
– to characterize the degree of confidence (or margin) in making decision for this
observation
– a differentiable function of the classifier parameters
• A popular choice:
where
• Many possible ways => which one is better? => an open problem!
MCE Formulation by Juang and Katagiri (3)
• Define a loss (cost) function for each observation
– a differentiable and monotonically increasing function of the misclassification
measure
– many possibilities => sigmoid function most popular for approximating MCE
• MCE training via minimizing
– empirical average loss (cost)
by an appropriate optimization procedure, e.g., gradient descent (GD), Quickprop,
Rprop, etc., or
– expected loss (cost)
by a sequential probabilistic descent (PD) algorithm (a.k.a. GPD)
Some Remarks
• Combinations of different choices for each of the previous three steps
and optimization methods lead to various MCE training algorithms.
• The power of MCE training has been demonstrated by many research
groups for different pattern classifiers in different applications.
• How to improve the generalization capability of an MCE-trained
classifier?
One Possible Solution: SSM-based MCE Training
• Sample Separation Margin (SSM)
– Defined as the smallest distance of an observation to the classification boundary
formed by the true class and the most competing class
– There is a closed-form solution for piecewise linear classifier
• Define misclassification measure as negative SSM
– Other parts of the formulation is the same as “traditional” MCE
• A happy result 
– Minimized empirical error rate, and
– Improved generalization
• Correctly recognized training samples have a large margin from the decision
boundaries!
• For more info:
– T. He and Q. Huo, “ A study of a new misclassification measure for minimum classification error
training of prototype-based pattern classifiers, ’’ in Proc. ICPR-2008
What’s New in This Study?
• Extend SSM-based MCE training to
pattern classifier with a quadratic
discriminant function (QDF)
– No closed-form solution to calculate SSM
• Demonstrate its effectiveness on a
large-scale Chinese handwriting
recognition task
– Modified QDF (MQDF) is widely used in
state-of-the-art Chinese handwriting
recognition systems
Two Technical Issues
• How to calculate the SSM efficiently?
– Formulated as a nonlinear programming problem
– Can be solved efficiently because it is a quadratically constrained quadratic
programming (QCQP) problem with a very special structure:
• A convex objective function with one quadratic equality constraint
• How to calculate the derivative of the SSM?
– Using a technique known as sensitivity analysis in nonlinear programming
– Calculated by using the solution to the problem in Eq. (1)
• Please refer to our paper for details
Experimental Setup
• Vocabulary:
– 6763 simplified Chinese characters
• Dataset:
30%
70%
Regular
Cursive
– Training: 9,447,328 character samples
• # of samples per class: 952 – 5,600
– Testing: 614,369 character samples
Distribution of writing styles in testing data
• Feature extraction:
– 512 “8-directional features”
– Use LDA to reduce dimension to 128
• Use MQDF for each character class
– # of retained eigenvectors: 5 and 10
• SSM-based MCE Training
– Use maximum likelihood (ML) trained model as seed model
– Update mean vectors only in MCE training
– Optimize MCE objective function by batch-mode Quickprop (20 epochs)
Experimental Results (1)
• MQDF, K=5
9
8
7
6
ML
5
MCE
4
SSM-MCE
3
2
1
0
Regular
Cursive
Regular (error in %) Cursive (error in %)
ML
1.73
8.34
MCE
1.29
7.34
SSM-MCE
1.19
7.00
Experimental Results (2)
• MQDF, K=10
8
7
6
ML
5
MCE
4
SSM-MCE
3
2
1
0
Regular
Cursive
Regular (error in %) Cursive (error in %)
ML
1.39
7.03
MCE
1.30
6.54
SSM-MCE
1.07
6.29
Experimental Results (3)
• Histogram of SSMs on training set
– SSM-based MCE-trained classifier vs. conventional MCE-trained one
– Training samples are pushed away from decision boundaries
– Bigger the SSM, better the generalization
Conclusion and Discussions
• SSM-based MCE training offers an implicit way of minimizing empirical
error rate and maximizing sample separation margin simutaneously
– Verified for quadratic classifiers in this study
– Verified for piecewise linear classifiers previously (He&Huo, ICPR-2008)
• Ongoing and future works
– SSM-based MCE training for discriminative feature extraction
– SSM-based MCE training for more flexible classifiers based on GMM and HMM
– Searching for other (hopefully better) methods to combine MCE training and
maximum margin training
Download