Presenter: Yanlin Wu Advisor: Professor Geman Date: 10/17/2006 Is cross-validation valid for small-sample classification? Ulisses M. Barga-Neto and Edward R. Dougherty Background What is the classification problem? How to evaluate the accuracy of one classifier? Namely measure the error of classification model? Different error measuring methods Things need to pay attention to… Classification problem In statistical patter recognition, feature vector X R d and a label Y R ,which takes on numerical values representing the different classes; for two-class problem, Y={0,1} d Classifier is a function: g : R {0,1} Error rate of g is: [ g ] P[ g ( X ) Y ] E( Y g ( X ) ) Bayes classifier: g BAY ( x) 1 if P(Y 1 | X x) 1 / 2 and g BAY ( x) 0 otherwise. For any g, [ g BAY ] [ g ] , so that g BAY is the optimal classifier. Training data Feature-label distribution F is unknown – using a training data Sn {( X1 , Y1 ),, ( X n , Yn )} to design a classifier Classification rule is a mapping g : {R d {0,1}}n R d {0,1} A classification rule maps the training data Sn into the designed classifier g (Sn ,) The true error of a designed classifier is its error rate given a fixed training dataset: n [ g (Sn ,)] EF ( Y g (Sn , X ) ) where EF indicates expectation with respect to F Training data (continued) The expected error rate over the data is given by E[ n ] EFn EF ( Y g (Sn , X ) ) where Fn is the joint distribution of the data Sn. It is also called unconditional error of the classification rule. Question: How can we state a measure of the true error of a model, since we don’t have access to the universe of observations to test it on, namely we don’t know F ?? Answer: Error estimation methods have been developed. Error estimation techniques For all methods, the final model M is built base on all n observations, and then we use these n observations again to estimate the error of the model. Types: Re-substitution Holdout Method Cross-validation Bootstrap Re-substitution Re-use the same training sample to measure error 1 n ˆresub yi g ( S n , xi ) n i 1 This error tends to be biased low and can be made to be arbitrarily close to zero by overfitting of the model and reusing same data to measure error. Holdout Method For large samples, randomly choose a subset S n S n for test data, design the classifier on S n \ S n , and estimate its error by applying it to S n . Unbiased estimator of E[ ] , with respect to expectation over S n t t t n nt Holdout Method Comments This error can be slightly biased high due to not using all n observations to build the classifier. This bias will tend to decrease as n increases. The choice of what percentage of the n observation are going to S n is important. Also it is affected by n Holdout method can be run multiple times, with the accuracy estimates from all the runs averaged. Impractical with small samples t Cross-Validation Algorithm: Split data into k mutually exclusive subsets S (i ) , then build the model on k-1 of them and measuring error on the other. Each subset will act as testing set once The error is the average of these k error measures. ˆcvk 1 k n / k (i ) y j g (Sn \ S(i ) , x (ji ) ) n i 1 j 1 When k=n, it is called “leave-one-out crossvalidation” Cross-Validation (continued) Stratified Cross-Validation: the classes are represented in each fold in the same proportion as in the original data – there is evidence that this improves the estimator K-fold cross-validation estimator is unbiased as an estimator of E[ nn / k ] Leave-one-out estimator: nearly unbiased as an estimator of E[ n ] Cross-Validation Comments May be biased high, same reason as Holdout method Often used when n is small which can make the Holdout method may become more biased. Very computationally intensive, especially for large k and n. Bootstrap Method Based on the notion of an ‘empirical distribution’ F*, which puts mass 1/n on each of the n data points A bootstrap sample Sn* from F* consists of n equally-likely draws with replacement from the original data Sn The probabiliity that any given data point will not appear in Sn* is (1 1 / n) n e 1 when n>>1 A bootstrap sample of size n contains on average (1 e 1 )n 0.632n of the original data points. Bootstrap Method (continued) Bootstrap zero estimator: ˆ0 EF * ( Y g ( S n* , X ) : ( X , Y ) S n \ S n* ) In practice, EF* has to be approximated by a sample mean based on independent replicates S n*b , for b=1,…,B, where B is recommended to be between B n 25 and 200: *b y g ( S n , xi ) I P 0 b 1 i 1 i ˆ0 B n b 1 *b i I *b i 1 Pi 0 where Pi*b is the actual proportion of times a data point (xi,yi) appears in Sn*b Bootstrap Method (continued) The bootstrap zero estimator tends to be a high biased estimator of E[ n ] 0.632 bootstrap estimator tries to correct this bias: ˆb632 (1 0.632)ˆresub 0.632ˆ0 Bias-corrected bootstrap estimator: ˆbbc ˆresub 1 B 1 n *b *b ( P ) y g ( S i i n , xi ) b 1 i 1 B n Bootstrap Comments Computation intensive. Choice of B is important. Tends to be slightly more accurate than cross-validation in some situation. But tends to have greater variance. Classification procedure Assess gene expressions with microarrays Determine genes whose expression levels can be used as classifier variables Apply a rule to design the classifier from the sample data Apply an error estimation procedure Error estimation challenges What if the number of training samples is remarkably small? The error estimation will be greatly impacted by small samples. A dilemma: unbiased or small variance? Prefer small variance: an unbiased estimator with large variance is of little use Error estimators under small samples Holdout: impractical with small samples Resubstitution: always low-biased cross-validation: have higher variance than that of resubstitution or bootstrap. The variance problem of cross-validation makes its use questionable for the kinds of very small samples! Variability affecting error estimation Internal variance Var[ˆ | Sn ] and the variability due to the random training sample. The latter is much large than the internal variance. Error-counting estimates, such as resubstitution and cross-validation, can only change by 1/n increments. Variability (continued) In cross-validation, test samples are not independent samples. This adds variance to the estimate. Surrogate problem: original designed classifier is assessed in terms of surrogate classifier, designed by the classification rule applied on reduce data. If these surrogate classifiers are too different from the original classifier too often, the estimate may be far from the true error rate. Experimental Setup Classification rules: linear discriminant analysis (LDA) 3-nearest-neighbor (3NN) decision trees (CART) Error estimators: resubstitution (resub) cross-validation: leave-one-out (loo), 5-fold c-v (cv5), 10-fold c-v (cv10) and repeated 10-fold c-v (cv10r) Bootstrap: 0.632 bootstrap (b632) and the biascorrected bootstrap (bbc) Study terms of error estimators Study the performance of an error estimator ˆ via the distribution of the error: deviation distribution of n ˆ the error estimator Estimator bias E[ n ˆ] Confidence we can have in our estimates from actual samples Var[ n ˆ] 2 ˆ E [ ] The root-mean-square (RMS) error: n Quartiles of the deviation distribution: is less affected by outliers than the mean Linear Discriminant Analysis (LDA) Class posteriors: Pr(G / X ) for optimal classification. Suppose f k (x) is the class-conditional density of X in the class G=k, and let k be the prior probability of the K class k, with k 1 k 1 . A simple application of Bayes theorem gives us Pr(G k / X x) f k ( x) k K l 1 f l ( x) l Suppose we model each class density as multivariate Gaussian 1 1 / 2(x ) ( X ) f k ( x) T (2 ) p/2 k 1/ 2 e k K 1 K LDA (continued) Assume the class have a common covariance matrix k k , we get LDA In comparing two classes k and l, it is sufficient to look at the log-ratio log f ( x) Pr (G k / X x) 1 1 log k log k log k 1 / 2( k l ) T ( k l ) x T ( k l ) Pr (G l / X x) f l ( x) l l Linear Discriminant function: k ( x) x T k 1 / 2 kT k log k 1 1 is an equivalent description of the decision rule With G(x)= arg max k k ( x) LDA (continued) Estimate the parameters of the Gaussian distributions: 1. ˆ k N k / N where N kis the number of class-k observations 2. 3. ̂ k g k xi / N k i k 1 g k ( xi ˆ k )( xi ˆ k ) T ( N K ) K i Figure1: 3 Gaussian distribution with same covariance and different means. Included are the contours of constant density enclosing 95% of the probability of each class. (Bayes decision boundaries.) Figuare2: 30 sample drawn from each Gaussian distribution, and fitted LDA decision boundaries. KNN: Nearest-Neighbor Methods Nearest-neighbor methods use those observations in the training set T closest in input space to X to form Yˆ . Specifically, the k-nearest neighbor fit for YˆIS DEFINED AS FOLLOWS: 1 Yˆ ( x) k y xi N k ( x ) i where N k (x) is the neighborhood of x defined by k closest points xi in the training sample. Closeness implies a metric, which for the moment we assume is Euclidean distance (can define other distance also). So in words, we find the observations with xi closest to x in input space, and average their responses. Figure 1. 15-nearest neighbor classifier Figure 2 1-nearest neighbor classifier Figure 3 7-nearest neighbor classifier Figure 4 Misclassification curves (training size=20, test size=10000) Decision tree (CART) Decide how to split (conditional Gini or conditional Entropy) Decide when to stop splitting Decide how to prune the tree Use Training sample: Pessimistic Pruning/ Minimal Error Pruning/ Error-based Pruning/Cost Complexity Pruning Use Pruning Sample: Error Reduced Pruning Simulation (synthetic data) Six sample sizes: 20 to 120 in increments of 20 Total experimental conditions: 3*6*6=108 For each experimental condition and sample size, compute the empirical deviation distribution using 1000 replications with different sample data drawn from an underlying model. True error Computed exactly for LDA By Monte-Carlo computation for 3NN and CART Simulation (synthetic data) Empirical deviation distribution for selected simulations (synthetic data). beta fits, n = 20. Simulation (synthetic data) Empirical deviation distribution for selected simulations. variance as a function of sample size. Simulation (synthetic data) Cross-validation: slightly high-biased, main drawback is high variability. They also tend to produce large outliers Resubstitution is low-biased, but shows smaller variance than cross-validation 0.632 bootstrap proved to be the best Also need to consider computational cost Simulation (patient data) Microarrays, from breast tumor samples from 295 patients: 115 good-prognosis, 180 poor-prognosis. Use log-ratio gene expression values associated with the top p=2 and top p=5 genes. In each case, 1000 observations of size n=20 and 40 were draw independently from the pool of 295 microarrays Sampling was strafified True error for each observation of size n: holdout estimator, the 295-n samples points not drawn are used as the test set (good approximation given large test smaple) Simulation (patient data) Empirical deviation distribution for selected simulations (patient data). beta fits, n = 20. Simulation (patient data) The observations are not independent, but only weakly dependent The results obtained with the patient data confirm the general conclusions obtained with the synthetic data Conclusion cross-validation error estimation is much less biased than resubstitution, but with excessive variance. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (much less than resubstitution). My own opinion Since the universal distribution of the training sample is unknown, the true error can only be defined on the training sample. So if the number of the training samples is very small, or the sampling method to get this training sample is not carried out correctly, either of which will cause the training samples not be able to represent the universal samples, then the classifiers and the error estimations based on this small number samples can not provide useful information about the universal classification problem. Outlier sums statistic method Robert Tibshirani, Trevor Hastie Background What’s outlier? Common methods to detect outliers Outliers in cancer gene study T-statistic in outlier study COPA (Cancer Outlier Profile Analysis) What’s Outlier? Definition: An outlier is an unusual value in a dataset; one that does not fit the typical pattern of data Sources of outliers Recording of measurement errors Natural variation of the data (valid data) Outlier analysis Issues: If outlier is a true error and not dealt with, results can be severely biased. If outlier is valid data and is removed, valuable information regarding important patterns in the data is lost Objective: Identify outliers, then decide how to deal with them. Outlier detection Visual inspection of data – not applicable for large complex datasets Automated methods Normal distribution-based method Median Absolute Deviation Distance Based Method … Normal distribution-based method Works on one variable at a time: Xk , k=1,…,p Assume normal distribution for each variable Algorithm: The i th observation’s data for variable Xk (i=1,…,n): xik Sample mean for variale xk : Xk Sample standard deviation for xk : Calculation zik for each i=1,…,n: 1 xik n i Sk zik (x ik X k )2 i n 1 xik X k Sk Lable xik an outlier if |xik|>3, about 0.25% will be labeled if normality assumption correct Normal distribution-based method Very dependent on assumption of normality X , S themselves are not robust to outliers: k k zik X positive outliers zik Sk Many outliers zik values are small if there are real outliers in the data, so fewer outliers will be detected Many k Only numeric-valued variables (same for other methods) Robust Normal Method Deals with robustness problem Same as normal distribution method, but Use trimmed mean or median instead of X Use trimmed standard deviation instead of S k R xik X k Calculate zik and still use |x |>3 R ik Sk labeling rule ( R superscript represents robust versions of the mean and standard deviation) k Median Absolute Deviation (MAD) Another method for dealing with robustness problem Use median as robust estimate of mean Use MAD as robust estimate o standard deviation Calculate Dik (i=1,…,n): Dik xik median( X k ) MAD median( D1k ,, Dnk ) x median( X k ) Calculate modified zik value: zik ik 1.4826( MAD) Calculate MAD: Lbel xik as outlier if |xik|>3.5 Note: 1.4826 used because E1.4826(MAD) Distance Based Method Non-parametric (no assumption of normality) Multidimensional – detects outliers across all attributes at once (instead of one attribute at a time) Algorithm: Calculate distance between all pairs of observations: the Euclidean distance from observation i to j d ij 2 ( x x ) k 1 ik jk p label observation i an outlier if less than r% of the total are within d distance of i (r and d are parameters) Distance Based Method Computationally intensive, particularly for larger and large samples Time required for each distance calculation grows with number of attributes Choice of d and r are not obvious, trying different values for a particular dataset further increases computations Outliers in cancer studies In cancer studies, mutations can often amplify or turn off gene expression in only a minority of samples, namely produce outliers t-statistic may yield high false discovery rate (FDR) when trying to detect changes occurred in a small number of samples. COPA & PPST have been developed Is there better method? t-statistic method Two standard normal distributions N(θ1,σ2) & N(θ2,σ2), the expectation of (θ1-θ2) obey t distribution Algorithm: Assume Xij be the expression values for genes i and samples j; 2 group: 1. normal; 2. disease. x x Ti i 2 i1 si Compute a two-sample t-statistic Ti for each gene Here xik is the mean of gene i in group k and si is the pooled within group standard deviation of gene i. Call a gene significant if |Ti| exceeds some threshold c. Using permutations of the sample labels to estimate the false discovery rate (FDR) for different c. t-statistic method t-statistic method is Normal distribution based method – largely affected by outliers t-statistic has no outlier dealing procedure Not applicable for the cancer studies where mutations occur in a minority of samples COPA Cancer Outlier Profile Analysis Algorithm: gene expression values are median centered, setting each gene’s median expression value to zero. the median absolute deviation (MAD) is calculated and scaled to 1 by dividing each gene expression value by its MAD. the 75th, 90th, and 95th percentiles of the transformed expression values are tabulated for each gene and then genes are rank-ordered by their percentile scores, providing a prioritized list of outlier profiles. COPA median and MAD were used for transformation as opposed to mean and standard deviation so that outlier expression values do not unduly influence the distribution estimates, and are thus preserved post-normalization Outlier-sum statistic Idea: improve performance for “abnormal” gene expressions in only a small number of samples. propose another method besides COPA Compare it with COPA Outlier-sum statistic Algorithm Define medi and madi be the median and median absolute deviation o the values for gene i x'ij ( xij medi ) / madi Standardize each gene: Define qi(r) be the rth percentile of the xij’ values for gene i, and the interquartile range IQR(i)=qi(75)qi(25). Values greater than the limit qi(75)+IQR(i) are defined to be outliers. The outlier sum statistic is defined to be sum of the values in the disease group that are beyond this limit: Wi x' ij jC2 I [ x'ij qi (75) IQR (i)] In real applications, one might expect negative as well as positive outliers. Hence define W 'i x' jC2 ij I [ x'ij qi (25) IQR (i)] Set the outlier-sum to the larger of W(i) and W’(i) in absolute value. This is called “two-sided outlier-sum statistic” Simulation study Generate 1000 genes and 30 samples, all values drawn from a standard normal distribution. Add 2 units to gene 1 for k of the samples in the second group. Compute the p-value and compare the median, mean and standard deviation of the p-values between different methods. Simulation result k=15 (all samples in group 2 are differentially expressed), the t-statistic performs the best. This continues until k=8 For smaller values of k, the outlier-sum statistic yields better results than COPA and t-statistic. Application to the skin data 12625 genes and 58 cancer patients: 14 with radiation sensitivity and 44 without. Using the group of 44 as normal class. Apply the outlier-sum statistic within the SAM (Significance analysis of microarrays) approach. Experiment result Outlier-sum statistic has lower false discovery rate (FDR) near the right of plot. but the FDR there may be too high for it to be useful in practice. Top 12 genes called by the outlier-sum statistic Conclusion The outlier-sum statistic exhibits better performance than simple t-statistic thresholding and COPA when some gene expressions are unusually high in some but not all samples. Otherwise, t-statistic performs well. My point of view Need more test examples to test the theory and see how far this method will work. In simulation study, the values are drawn from a standard normal distribution. Will the variance of this distribution affect the simulation result? (since exactly 2 units quantities were added on to simulate the abnormal gene expression) In simulation, only one gene was treated to become outlier. What if more than one gene? In other words, if there is another gene which also exhibits unusually high expression in some samples but is irrelevant with the classification problem, will it affect the outlier-sum statistic? Reference U. M. Braga-Neto, E. R. Dougherty, Bioinformatics 20, 374380 Tomlins, S.A. et al, science 310, 644-648 Robert Tibshirani, Trevor Hastie, Biostatistics Advance Access, May 15, 2006 The elements of statistical learning, by Trevor Hastie, Bobert Tibshirani, Jerome Friedman, Springer Class Notes from the ‘Data Mining’, by Paul Maiste ‘Introduction to data mining’, addison wesley, by Pang-ning Tan... Class notes from ‘machine learning’, by Donald. Geman