Methods

advertisement
We briefly describe here all the classification methods that we have used in this study. They
include classical linear discriminant analysis in various forms, nearest neighbor classifiers,
logistic regression, classification trees and aggregation classifiers such as CART and random
forests, and also machine learning approaches such as support vector machine. We selected these
particular algorithms for several reasons. First, they are popular with data analysts, machine
learning researchers, and statisticians. Second, they typically are competitive off the shelf (they
usually perform relatively well with no parameter tuning). Third, they all can produce accurate
class-probability estimates. But there are factors other than accuracy which contribute to the
merits of a given classifier. These include simplicity and insight gained into the predictive
structure of the data. Linear discriminant methods are easy to implement and usually have low
error rates, but often ignore interactions between variables. Nearest-neighbor classifiers are
simple, intuitive, and also have low error rates compared to more sophisticated classifiers. While
they are able to incorporate interactions between variables, they do so in a "black-box" way and
give very little insight into the structure of the data. In contrast, logistic regression and
classification trees are capable of exploiting and revealing interactions between variables.
Logistic regression and induction trees are easy to interpret and yield information on the
relationship between predictor variables and responses by performing stepwise variable selection.
However, classification trees tend to be unstable and lacking in accuracy. By "stable"
classification we mean that, on average, the classification accuracy is not substantially affected by
small perturbations to the training data. The accuracy of classification trees can be greatly
improved by aggregation (bagging or boosting). As more data become available, one can expect
to observe an improvement in the performance of aggregated classifiers relative to simpler
classifiers, as trees should be able to correctly identify interactions.
1 Classification methods used in our study
A brief introduction of each method is given as follows:
Fishers linear discriminant analysis (FLDA)
FLDA is a classification method that projects high-dimensional data onto a line and performs
classification in this one-dimensional space. The projection maximizes the distance between the
means of the two classes while minimizing the variance within each class. Maximizing this
criterion yields a closed form solution that involves the inverse of a covariance like matrix. FLDA
assumes (1) a normal (Gaussian) distribution of observations and (2) "equal group covariance".
Diagonal linear and quadratic discriminant analysis (DLDA, DQDA) are particular FLDA
discriminant rules designed for particular forms of class covariance matrices.
Logistic regression (LOGISTIC)
Logistic regression is a supervised method for two- or multi-class classification problems
(Hosmer and Lemeshow, 1989). Though a di_erent model is used, it can be shown that logistic
discrimination and Fisher discrimination are the same when the predictors are sampled from
multivariate distributions with common covariance matrices.
k nearest neighbor (kNN)
For each feature in the input case, kNN is an intuitive method that classifies unlabeled examples
based on their similarity with examples in the training set. It finds the k closest features in the
training set and assigns to the class that appears most frequently within the k-subset.
CART and aggregating classifiers (random forests)
CART analysis is a form tree induction classification obtained through binary recursive
partitioning. In this study we consider CART based classification and Random forests, a kind of
aggregation of CART trees. In Random forests (proposed by Breiman (2001), first successive
trees are independently constructed by CART using a bootstrap sample of the data set. In the end,
a simple majority vote is taken for prediction. Random forests add an additional layer of
randomness. In addition to constructing each tree using a di_erent bootstrap sample of the data,
random forests change how the classification trees are constructed. In standard trees, each node is
split using the best split among all variables. In a random forest, each node is split using the best
among a subset of variables randomly chosen at that node. This somewhat counterintuitive
strategy turns out to perform very well compared to many other classifiers, including discriminant
analysis and support vector machines, and is robust against overfitting.
Support vector machine (SVM-linear)
Among numerous classification methods, the support vector machine (SVM) is a popular choice
and has attracted much attention in recent years. As an important large margin classifier, SVM
was originally proposed by Vapnik et al. (Boser et al., 1992; Vapnik, 1998) using the idea of
searching for the optimal separating hyperplane with maximum separation. For further details,
please refer to the monograph by Hastie et al., 2001.
2 Feature selection and performance evaluation
Feature selection
In general, feature (variable) selection is an extremely important aspect of classification problems,
since the features selected are used to build the classifier. Careful consideration should be given
to the problem of feature subset selection. This of course amounts to reducing the number of
variables used to construct a prediction rule for a given learning algorithm. There are several
reasons for performing feature reduction, the most important one being to identify if the
pharyngeal sensitivity could be considered as a good predictor.
One should also avoid including to many features. Indeed, it is known that as model complexity is
increased with more variables added to a given model, the proportion of training samples
(individuals) misclassified may decrease, but the misclassification rate of new samples
(generalization error) would eventually begin to increase; this latter effect being the product of
overfitting the model with the training data (Hastie et al., 2001; McLachlan, 1992; Theodoridis,
1999; Xing, 2002; Xiong et al., 2001).
Performance evaluation
One approach to estimate the error rate of the prediction rule would be to apply the rule to a
"held-out" test set randomly selected from among the training set samples. As an alternative to
the hold-out approach, cross-validation (CV) is very often used, especially when one does not
have the luxury of withholding part of a dataset as an independent test set and possibly even
another part as a validation set. Further, the repeatability of results on new data can be assessed
with this approach. In general, all CV approaches can fall under the "K-fold CV" heading. Here,
the training set of samples is divided into K non-overlapping subsets of (roughly) the same size.
One of the K subsets is "held-out" for testing, the prediction rule is trained on the remaining K - 1
subsets, and an estimate of the error rate can then be obtained from applying each stages
prediction rule to its corresponding test set. This process repeats K times, such that each subset is
treated once as the test set, and the average of the resulting K error rate estimates forms the Kfold CV error rate. The whole K-fold CV process could be repeated multiple times, using
different partitions of the data each run and averaging the results, to obtain more reliable
estimates. At the expense of increased computation cost, repeated- (10-) run CV has been
recommended as the procedure of choice for assessing predictive accuracy of classification rules
(Braga-Neto and Dougherty, 2004; Kohavi, 1995).
The K-fold CV error is nearly unbiased, but it can be highly variable. This proved to be the case
in the present context. Suitably defined bootstrap procedures can reduce the variability of the Kfold CV error in addition to providing a direct assessment of variability for estimated parameters
in the prediction rule. As discussed by Efron and Tibshirani, a bootstrap smoothing of leave-oneout cross-validation is given by the leaveone-out bootstrap error B1, which predicts the error at a
point xj only from bootstrap samples that do not contain the point xj . Typically, B1 is based on a
proportion 0.632 of the original data points. Thus B1 is upwardly biased and Efron proposed the
0.632 estimator, which is the one we have used for the present study.
Download