Methods

We briefly describe here all the classification methods that we have used in this study. They include classical linear discriminant analysis in various forms, nearest neighbor classifiers, logistic regression, classification trees and aggregation classifiers such as CART and random forests, and also machine learning approaches such as support vector machine. We selected these particular algorithms for several reasons. First, they are popular with data analysts, machine learning researchers, and statisticians. Second, they typically are competitive off the shelf (they usually perform relatively well with no parameter tuning). Third, they all can produce accurate class-probability estimates. But there are factors other than accuracy which contribute to the merits of a given classifier. These include simplicity and insight gained into the predictive structure of the data. Linear discriminant methods are easy to implement and usually have low error rates, but often ignore interactions between variables. Nearest-neighbor classifiers are simple, intuitive, and also have low error rates compared to more sophisticated classifiers. While they are able to incorporate interactions between variables, they do so in a "black-box" way and give very little insight into the structure of the data. In contrast, logistic regression and classification trees are capable of exploiting and revealing interactions between variables. Logistic regression and induction trees are easy to interpret and yield information on the relationship between predictor variables and responses by performing stepwise variable selection. However, classification trees tend to be unstable and lacking in accuracy. By "stable" classification we mean that, on average, the classification accuracy is not substantially affected by small perturbations to the training data. The accuracy of classification trees can be greatly improved by aggregation (bagging or boosting). As more data become available, one can expect to observe an improvement in the performance of aggregated classifiers relative to simpler classifiers, as trees should be able to correctly identify interactions. 1 Classification methods used in our study A brief introduction of each method is given as follows: Fishers linear discriminant analysis (FLDA) FLDA is a classification method that projects high-dimensional data onto a line and performs classification in this one-dimensional space. The projection maximizes the distance between the means of the two classes while minimizing the variance within each class. Maximizing this criterion yields a closed form solution that involves the inverse of a covariance like matrix. FLDA assumes (1) a normal (Gaussian) distribution of observations and (2) "equal group covariance". Diagonal linear and quadratic discriminant analysis (DLDA, DQDA) are particular FLDA discriminant rules designed for particular forms of class covariance matrices. Logistic regression (LOGISTIC) Logistic regression is a supervised method for two- or multi-class classification problems (Hosmer and Lemeshow, 1989). Though a di_erent model is used, it can be shown that logistic discrimination and Fisher discrimination are the same when the predictors are sampled from multivariate distributions with common covariance matrices. k nearest neighbor (kNN) For each feature in the input case, kNN is an intuitive method that classifies unlabeled examples based on their similarity with examples in the training set. It finds the k closest features in the training set and assigns to the class that appears most frequently within the k-subset. CART and aggregating classifiers (random forests) CART analysis is a form tree induction classification obtained through binary recursive partitioning. In this study we consider CART based classification and Random forests, a kind of aggregation of CART trees. In Random forests (proposed by Breiman (2001), first successive trees are independently constructed by CART using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Random forests add an additional layer of randomness. In addition to constructing each tree using a di_erent bootstrap sample of the data, random forests change how the classification trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of variables randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis and support vector machines, and is robust against overfitting. Support vector machine (SVM-linear) Among numerous classification methods, the support vector machine (SVM) is a popular choice and has attracted much attention in recent years. As an important large margin classifier, SVM was originally proposed by Vapnik et al. (Boser et al., 1992; Vapnik, 1998) using the idea of searching for the optimal separating hyperplane with maximum separation. For further details, please refer to the monograph by Hastie et al., 2001. 2 Feature selection and performance evaluation Feature selection In general, feature (variable) selection is an extremely important aspect of classification problems, since the features selected are used to build the classifier. Careful consideration should be given to the problem of feature subset selection. This of course amounts to reducing the number of variables used to construct a prediction rule for a given learning algorithm. There are several reasons for performing feature reduction, the most important one being to identify if the pharyngeal sensitivity could be considered as a good predictor. One should also avoid including to many features. Indeed, it is known that as model complexity is increased with more variables added to a given model, the proportion of training samples (individuals) misclassified may decrease, but the misclassification rate of new samples (generalization error) would eventually begin to increase; this latter effect being the product of overfitting the model with the training data (Hastie et al., 2001; McLachlan, 1992; Theodoridis, 1999; Xing, 2002; Xiong et al., 2001). Performance evaluation One approach to estimate the error rate of the prediction rule would be to apply the rule to a "held-out" test set randomly selected from among the training set samples. As an alternative to the hold-out approach, cross-validation (CV) is very often used, especially when one does not have the luxury of withholding part of a dataset as an independent test set and possibly even another part as a validation set. Further, the repeatability of results on new data can be assessed with this approach. In general, all CV approaches can fall under the "K-fold CV" heading. Here, the training set of samples is divided into K non-overlapping subsets of (roughly) the same size. One of the K subsets is "held-out" for testing, the prediction rule is trained on the remaining K - 1 subsets, and an estimate of the error rate can then be obtained from applying each stages prediction rule to its corresponding test set. This process repeats K times, such that each subset is treated once as the test set, and the average of the resulting K error rate estimates forms the Kfold CV error rate. The whole K-fold CV process could be repeated multiple times, using different partitions of the data each run and averaging the results, to obtain more reliable estimates. At the expense of increased computation cost, repeated- (10-) run CV has been recommended as the procedure of choice for assessing predictive accuracy of classification rules (Braga-Neto and Dougherty, 2004; Kohavi, 1995). The K-fold CV error is nearly unbiased, but it can be highly variable. This proved to be the case in the present context. Suitably defined bootstrap procedures can reduce the variability of the Kfold CV error in addition to providing a direct assessment of variability for estimated parameters in the prediction rule. As discussed by Efron and Tibshirani, a bootstrap smoothing of leave-oneout cross-validation is given by the leaveone-out bootstrap error B1, which predicts the error at a point xj only from bootstrap samples that do not contain the point xj . Typically, B1 is based on a proportion 0.632 of the original data points. Thus B1 is upwardly biased and Efron proposed the 0.632 estimator, which is the one we have used for the present study.

Methods

Related documents

Products

Support

Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib