Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San Francisco Elsinore, Denmark May 17-21, 2004 1 Microarray Workshop Brief Overview of the Life-Cycle 2 Microarray Workshop Life Cycle Biological question Experimental design Failed Microarray experiment Quality measurement Image analysis Pre-processing Pass Analysis Estimation Testing Clustering Discrimination Biological verification and interpretation 3 Microarray Workshop • The steps outlined in the “Life Cycle” need to be carefully thought through and re-adjusted for each data type/platform combination. Experimental design will impact what questions should be asked and may be answered once the data are collected. • To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of. Sir RA Fisher 4 Microarray Workshop ** *** SAGE Nylon membrane Illumina Bead Array Different Technologies GeneChip Affymetrix cDNA microarray Agilent: Long oligo Ink Jet CGH 5 Microarray Workshop Some statistical issues • • • • • Designing gene expression experiments. Acquiring the raw data: image analysis. Assessing the quality of the data. Summarizing and removing artifacts from the data. Interpretation and analysis of the data: - Discovering which genes are differentially expressed Discovering which genes exhibit interesting expression patterns Detection of gene regulatory mechanisms. and many others.. For a review see Smyth, Yang and Speed, “Statistical issues in microarray data analysis”, In: Functional Genomics: Methods and Protocols, Methods in Molecular Biology, Humana Press, March 2003 Lots of other bioinformatics issues … 6 Microarray Workshop Image analysis Quality assessment Pre-processing CEL, CDF files Short-oligonucleotide chip data: • quality assessment, • background correction, • probe-level normalization, • probe set summary gpr, gal files Two-color spotted array data: • quality assessment; diagnostic plots, • background correction, • array normalization. UCSF spot file Array CGH data: •quality assessment; diagnostic plots, •, background correction • clones summary; • array normalization. Analysis probes by sample matrix of log-ratios or log-intensities Analysis of expression data: • Identify D.E. genes, estimation and testing, • clustering, and • discrimination. 7 Microarray Workshop Linear Models Specific examples T-tests F-tests Empirical bayes SAM Examples • Identify differential expression genes among two or more tumor subtypes or different cell Linear Models treatments. • Look for genes that have different time profiles between different mutants. • Looking for genes associated with survival. 8 Microarray Workshop Clustering Algorithms •Hierarchical clustering •Self-organizing maps •Partition around medoids (pam) Examples • We can cluster cell samples (cols), the identification of new / unknown tumor sub classes or cell sub types using gene expression profiles. • We can cluster genes (rows) , using large numbers of yeast experiments, to identify groups of co-expressed genes. 9 Microarray Workshop Discrimination Learning set B-ALL T-ALL AML ? Questions • Identification of groups of genes that predictive of a particular class of tumors? • Can I use the expression profile of cancer patients to predict survival? Gene 1 Mi1 < -0.67 Classification rules • DLDA or DQDA • k-nearest neighbor (knn) • Support vector machine (svm) • Classification tree yes no Gene 2 Mi2 > 0.18 AML yes no B-ALL T-ALL 10 Microarray Workshop Annotation Riken ID GenBank accession ZX00049O01 AV128498 Nucleotide Sequence TCGTTCCATTTTTCTTTAGGGGGTCTTTC CCCGTCTTGGGGGGGAGGAAAAGTTCTG CTGCCCTGATTATGAACTCTATAATAGAG TATATAGCTTTTGTACCTTTTTTACAGGAA GGTGCTTTCTGTAATCATGTGATGTATAT TAAACTTTTTATAAAAGTTAACATTTTGCA TAAT AAACCATTTTTG Name Locuslink MGD Inhibitor of DNA binding 3 15903 MGI:96398 UniGene Gene Symbol Map Position Chromosome:4 66.0 cM Mm.110 Idb3 Literature Swiss-Prot GO P20109 GO:0000122 GO:0005634 Bay Genomic s ES cells Biochemic al pathways (KEGG) PubMed 12858547 2000388 etc GO:0019904 11 Microarray Workshop What is your questions? • • What are the targets genes for my knock-out gene? Look for genes that have different time profiles between different cell types. Gene discovery, differential expression • Is a specified group of genes all up-regulated in a specified conditions? Gene set, differential expression • • Can I use the expression profile of cancer patients to predict survival? Identification of groups of genes that predictive of a particular class of tumors? Class prediction, classification • • Are there tumor sub-types not previously identified? Are there groups of co-expressed genes? Class discovery, clustering • • Detection of gene regulatory mechanisms. Do my genes group into previously undiscovered pathways? Clustering. Often expression data alone is not enough, need to incorporate sequence and other information 12 Microarray Workshop Classification 13 Microarray Workshop cDNA gene expression data Data on G genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … Genes 1 2 3 4 0.46 -0.10 0.15 -0.45 0.30 0.49 0.74 -1.03 0.80 0.24 0.04 -0.79 1.51 0.06 0.10 -0.56 0.90 0.46 0.20 -0.32 ... ... ... ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity) 14 Microarray Workshop Classification • Task: assign objects to classes (groups) on the basis of measurements made on the objects • Unsupervised: classes unknown, want to discover them from the data (cluster analysis) • Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations 15 Microarray Workshop Example: Tumor Classification • Reliable and precise classification essential for successful cancer treatment • Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables • Uncertainties in diagnosis remain; likely that existing classes are heterogeneous • Characterize molecular variations among tumors by monitoring gene expression (microarray) • Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes) 16 Microarray Workshop Tumor Classification Using Gene Expression Data Three main types of statistical problems associated with tumor classification: • Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) • Classification of malignancies into known classes (supervised learning – discrimination) • Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). 17 Microarray Workshop Clustering 18 Microarray Workshop Generic Clustering Tasks • Estimating number of clusters • Assigning each object to a cluster • Assessing strength/confidence of cluster assignments for individual objects • Assessing cluster homogeneity 19 Microarray Workshop What to cluster • Samples: To discover novel subtypes of the existing groups or entirely new partitions. Their utility needs to be confirmed with other types of data, e.g. clinical information. • Genes: To discover groups of co-regulated genes/ESTs and use these groups to infer function where it is unknown using members of the groups with known function. 20 Microarray Workshop Basic principles of clustering Aim: to group observations or variables that are “similar” based on predefined criteria. Issues: Which genes / arrays to use? Which similarity or dissimilarity measure? Which method to use to join clusters/observations? Which clustering algorithm? How to validate the resulting clusters? It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific and varies depending on what is being clustered, genes or arrays. 21 Microarray Workshop Clustering of genes Array Data For each gene, calculate a summary statistics and/or adjusted p-values Set of candidate DE genes. Biological verification Similarity metrics Clustering Descriptive interpretation Clustering algorithm 22 Microarray Workshop Clustering of samples and genes Array Data Set of samples to cluster Similarity metrics Clustering algorithm Set of genes to use in clustering (DO NOT use class labels in the set determination). Clustering Descriptive Interpretation of genes separating novel subgroups of the samples Validation of clusters with clinical data 23 Microarray Workshop Which similarity or dissimilarity measure? • A metric is a measure of the similarity or dissimilarity between two data objects • Two main classes of metric: - Correlation coefficients (similarity) - Compares shape of expression curves - Types of correlation: - Centered. - Un-centered. - Rank-correlation - Distance metrics (dissimilarity) - City Block (Manhattan) distance - Euclidean distance 24 Microarray Workshop Correlation (a measure between -1 and 1) • Pearson Correlation Coefficient (centered correlation) Sx = Standard deviation of x Sy = Standard deviation of y xi x yi y S S x y i 1 n 1 n 1 • Others include Spearman’s and Kendall’s You can use absolute correlation to capture both positive and negative correlation Positive correlation Negative correlation 25 Microarray Workshop Potential pitfalls Correlation = 1 26 Microarray Workshop Distance metrics • City Block (Manhattan) distance: - Sum of differences across dimensions - Less sensitive to outliers - Diamond shaped clusters d ( X , Y ) xi yi • Euclidean distance: - Most commonly used distance - Sphere shaped cluster - Corresponds to the geometric distance into the multidimensional space d ( X ,Y ) i Condition 2 Condition 2 i 2 ( x y ) i i Y X Condition 1 Y X Condition 1 where gene X = (x1,…,xn) and gene Y=(y1,…,yn) 27 Microarray Workshop Euclidean vs Correlation (I) • Euclidean distance • Correlation 28 Microarray Workshop How to Compute Group Similarity? Four Popular Methods: Given two groups g1 and g2, •Single-link algorithm: s(g1,g2)= similarity of the closest pair •Complete-link algorithm: s(g1,g2)= similarity of the farthest pair •Average-link algorithm: s(g1,g2)= average of similarity of all pairs •Centroid algorithm: s(g1,g2)= distance between centroids of the two clusters 29 Microarray Workshop Distance between clusters Examples of clustering methods Single (nearest neighbor) Leads to the “cluster chains” x Complete (furtherest neighbor): Leads to small compact clusters x Distance between centroids Average (Mean) linkage 30 Microarray Workshop Comparison of the Three Methods • Single-link - “Loose” clusters - Individual decision, sensitive to outliers • Complete-link - “Tight” clusters - Individual decision, sensitive to outliers • Average-link or centroid - “In between” - Group decision, insensitive to outliers • Which one is the best? Depends on what you need! 31 Microarray Workshop Clustering algorithms • Clustering algorithm comes in 2 basic flavors Hierarchical Partitioning 32 Microarray Workshop Partitioning methods • Partition the data into a pre-specified number k of mutually exclusive and exhaustive groups. • Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares. Ideally, dissimilarity between clusters will be maximized while it is minimized within clusters. • Examples: - k-means, self-organizing maps (SOM), PAM, etc.; - Fuzzy (each object is assigned probability of being in a cluster): needs stochastic model, e.g. Gaussian mixtures. 33 Microarray Workshop Partitioning methods K=2 34 Microarray Workshop Partitioning methods K=4 35 Microarray Workshop • • • • • • Example of a partitioning algorithm K-Means or PAM (Partitioning Around Medoids) Given a similarity function Start with k randomly selected data points Assume they are the centroids (medoids) of k clusters Assign every data point to a cluster whose centroid (medoid) is the closest to the data point Recompute the centroid (medoid) for each cluster Repeat this process until the similarity-based objective function converges 36 Microarray Workshop 37 Microarray Workshop Mixture Model for Clustering P(X|Cluster1) P(X|Cluster2) P(X|Cluster3) P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3) X | Clusteri N ( i , i 2 ) Microarray Workshop 38 Mixture Model Estimation • Likelihood function (generally Gaussian) • Parameters: e.g., i, i, I 2 ( x ) i p( x) i 21 exp( ) 2 i 2 i i 1 k • Using EM algorithm - Similar to “soft” K-mean • Number of clusters can be determined using a model-selection criterion, e.g. BIC (Raftery and Fraley, 1998) 39 Microarray Workshop Hierarchical methods • Hierarchical clustering methods produce a tree or dendrogram. • They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level. • The tree can be built in two distinct ways - bottom-up: agglomerative clustering (usually used). - top-down: divisive clustering. 40 Microarray Workshop Agglomerative Methods • Start with n mRNA sample (or G gene) clusters • At each step, merge the two closest clusters using a measure of between-cluster dissimilarity which reflects the shape of the clusters The distance between clusters is defined by the method used (e.g., if complete linkage, the distance is defined as the distance between furtherest pair of points in the two clusters) 41 Microarray Workshop Divisive Methods • Start with only one cluster • At each step, split clusters into two parts • Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram) • Disadvantage: Computational difficulties when considering all possible divisions into two groups Divisive methods are rarely utilized in microarray data analysis. 42 Microarray Workshop Illustration of points In two dimensional space Agglomerative 1 5 2 3 4 1,2,3,4,5 4 3 1,2,5 5 1 3,4 1,5 2 1 5 2 3 4 43 Microarray Workshop Tree re-ordering? Agglomerative 1 5 2 3 4 2 1 53 4 1,2,3,4,5 4 3 1,2,5 5 1 3,4 1,5 2 1 5 2 3 4 44 Microarray Workshop Partitioning vs. hierarchical Partitioning: Advantages • Optimal for certain criteria. • Objects automatically assigned to clusters Disadvantages • Need initial k; • Often require long computation times. • All objects are forced into a cluster. Hierarchical Advantages • Faster computation. • Visual. Disadvantages • Unrelated objects are eventually joined • Rigid, cannot correct later for erroneous decisions made earlier. • Hard to define clusters – still need to know “where to cut”. Note that hierarchical clustering results may be used as the starting points for the partitioning or model-based algorithms 45 Microarray Workshop Clustering microarray data • Clustering leads to readily interpretable figures and can be helpful for identifying patterns in time or space. Examples: • We can cluster cell samples (cols), e.g. the identification of new / unknown tumor classes or cell subtypes using gene expression profiles. • We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes. • We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models. 46 Microarray Workshop Estimating number of clusters using silhouette (see PAM) Define silhouette width of the observation is : S = (b-a)/max(a,b) Where a is the average dissimilarity to all the points in the cluster and b Is the minimum distance to any of the objects in the other clusters. Intuitively, objects with large S are well-clustered while the ones with small S tend to lie between clusters. How many clusters: Perform clustering for a sequence of the number of clusters k and choose the number of components corresponding to the largest average silhouette. Issue of the number of clusters in the data is most relevant for novel class discovery, i.e. for clustering sampes. 47 Microarray Workshop Estimating Number of Clusters with Silhouette (ctd) Compute average silhouette for k=3 And compare it with the results for other k’s. 48 Microarray Workshop Estimating number of clusters using reference distribution Idea: Define a goodness of clustering score to minimize, e,g. pooled Within clusters Sum of Squares (WSS) around the cluster means, reflecting compactness of clusters. where n and D are the number of points in the cluster and sum of k 1 Wk Dr all pairwise distances. r 1 2 nr Then gap statistic for k clusters is defined as: Gapn(k ) En* (log( Wk )) log( Wk ) Where E*n is the expectation under a sample of size from the reference distribution. Reference distribution can be generated either parametrically (eg. from a multivariate) or nonparametrically (e.g. by sampling from marginal distributions of the variables. The first local maximum is chosen to be the number of clusters (slightly more complicated rule) (Tibshirani et al, 2001) 49 Microarray Workshop Estimating number of clusters There are other resampling (e.g. Dudoit and Fridlyand, 2002) and non-resampling based rules for estimating the number of clusters (for review see Milligan and Cooper (1978) and Dudoit and Fridlyand (2002) ). The bottom line is that none work very well in complicated situation and, to a large extent, clustering lies outside a usual statistical framework. It is always reassuring when you are able to characterize a newly discovered clusters using information that was not used for clustering. 50 Microarray Workshop Confidence in of the individual cluster assignments Want to assign confidence to individual observations of being in their assigned clusters. •Model-based clustering: natural probability interpretation •Partitioning methods: silhouette •Dudoit and Fridlyand (2003) have presented a resampling-based approach that assigns confidence by computing how proportion of resampling times that an observation ends up in the assigned cluster. 51 Microarray Workshop Tight clustering (genes) Identifies small stable gene clusters by not attempting to cluster all the genes. Thus, it does not neccesitate estimation of the number of clusters and assignment of all points into the clusters. Aids interpretebility and validity of the results. (Tseng et al, 2003) Algorithm: For sequence of k > k0: 1. Identify the set of genes that are consistently grouped together when Observations (genes) are repeatedly sub-sampled. Order those sets by size. Consider the top largest q sets for each k. 2. Stop when for (k, (k+1)), the two sets are nearly identical. Take the set corresponding to (k+1). Remove that set from the dataset. 3. Set k0 = k0 -1 and repeat the procedure. 52 Microarray Workshop Two-way clustering of genes and samples. Refer to the methods that use samples and genes simulteneously to extract I Information. These methods are not yet well developed. Some examples of the approaches include Block Clustering (Hartigan, 1972) which repeatedly rearranges rows and columns to obtain the largest reduction of total within block variance. Another method is based on Plaid Models (Lazzeroni and Owen, 2002) Friedman and Meulmann (2002) present an algorithm allowing to cluster samples based on the subsets of attributes, i.e. each group of samples could have been characterized by different gene sets. 53 Microarray Workshop Applications of clustering to the microarray data Alizadeh et al (2000) Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,. The authors have demonstrated that 3 subtypes of lymphomas (FL, CLL and DLBCL) can be distinguished based on the genetic signature. They went on to ask whether it is possible to identify novel subtypes of DLBCL. This group is known to be heterogeneous in terms of the survival (40% of the patients recur after treatment). There were 81 cases total and about 40 DLBCLs. 54 Microarray Workshop Clustering both cell samples and genes Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling, 55 Microarray Workshop Clustering cell samples Discovering sub-groups Taken from Alizadeh et al (Nature, 2000) 56 Microarray Workshop Attempt at validation of DLBCL subgroups Taken from Alizadeh et al (Nature, 2000) 57 Microarray Workshop Clustering genes Finding different patterns in the data Yeast Cell Cycle (Cho et al, 1998) 6 × 5 SOM with 828 genes Taken from Tamayo et al, (PNAS, 1999) 58 Microarray Workshop Summary Which clustering method should I use? - What is the biological question? - Do I have a preconceived notion of how many clusters there should be? - Hard or soft boundaries between clusters Keep in mind: - Clustering cannot NOT work. That is, every clustering methods will return clusters. - Clustering helps to group / order information and is a visualization tool for learning about the data. However, clustering results do not provide biological “proof”. - Clustering is generally used as an exploratory and hypotheses generation tool. 59 Microarray Workshop Discrimination Classification procedure Feature selection Performance assessment Comparison study 60 Microarray Workshop Basic principles of discrimination •Each object associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) Aim: predict Y from X. 1 K 2 Predefined Class {1,2,…K} Objects Y = Class Label = 2 Classification rule ? X = {red, square} Y=? X = Feature vector {colour, shape} 61 Microarray Workshop Discrimination and Allocation Learning Set Data with known classes Prediction Classification rule Data with unknown classes Classification Technique Class Assignment Discrimination 62 Microarray Workshop Learning set Predefine classes Clinical outcome Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Good Prognosis ? Matesis > 5 Objects Array Feature vectors Gene expression new array Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan. . Classification rule 63 Microarray Workshop Learning set Predefine classes Tumor type B-ALL T-ALL AML T-ALL ? Objects Array Feature vectors Gene expression new array Reference Golub et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537. Classification Rule 64 Microarray Workshop Classification Rule Performance Assessment e.g. Cross validation -Classification procedure, -Feature selection, -Parameters [pre-determine, estimable], Distance measure, Aggregation methods • One can think of the classification rule as a black box, some methods provides more insight into the box. • Performance assessment needs to be looked at for all classification rule. 65 Microarray Workshop Classifiers • A predictor or classifier partitions the space of gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a sample with expression profile X=(X1, ...,XG) Ak the predicted class is k • Classifiers are built from a learning set (LS) L = (X1, Y1), ..., (Xn,Yn) • Classifier C built from a learning set L: C( . ,L): X {1,2, ... ,K} • Predicted class for observation X: C(X,L) = k if X is in Ak Microarray Workshop 66 Decision Theory (I) • Can view classification as statistical decision theory: must decide which of the classes an object belongs to • Use the observed feature vector X to aid in decision making • Denote population proportion of objects of class k as k = p(Y = k) • Assume objects in class k have feature vectors with density pk(X) = p(X|Y = k) 67 Microarray Workshop Decision Theory (II) • One criterion for assessing classifier quality is the misclassification rate, p(C(X)Y) • A loss function L(i,j) quantifies the loss incurred by erroneously classifying a member of class i as class j • The risk function R(C) for a classifier is the expected (average) loss: R(C) = E[L(Y,C(X))] 68 Microarray Workshop Decision Theory (III) • Typically L(i,i) = 0 • In many cases can assume symmetric loss with L(i,j) = 1 for i j (so that different types of errors are equivalent) • In this case, the risk is simply the misclassification probability • There are some important examples, such as in diagnosis, where the loss function is not symmetric 69 Microarray Workshop Classification rule Maximum likelihood discriminant rule • A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest. • For known class conditional densities pk(X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by C(X) = argmaxk pk(X) 70 Microarray Workshop Gaussian ML discriminant rules • For multivariate Gaussian (normal) class densities X|Y= k ~ N(k, k), the ML classifier is C(X) = argmink {(X - k) k-1 (X - k)’ + log| k |} • In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA) • In practice, population mean vectors k and covariance matrices k are estimated by corresponding sample quantities 71 Microarray Workshop ML discriminant rules - special cases [DLDA] Diagonal linear discriminant analysis class densities have the same diagonal covariance matrix = diag(s12, …, sp2) [DQDA] Diagonal quadratic discriminant analysis) class densities have different diagonal covariance matrix k= diag(s1k2, …, spk2) Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation). 72 Microarray Workshop The Logistic Regression Model 2-class case: log[p/(1-p)] = + t X + e p is the probability that the event Y occurs given the observed gene expression pattern, p(Y=1 | X) p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or "logit" This can easily be generalized to multiclass outcome and to more general dependences than linear. Also, logistic regression makes fewer assumptions on the marginal distribution of the variables. However, the results are generally very similat to LDA. (Hastie et al, 2003) 73 Microarray Workshop Nearest neighbor classification • Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). • k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: - find the k observations in the learning set closest to X - predict the class of X by majority vote, i.e., choose the class that is most common among those k observations. • The number of neighbors k can be chosen by cross-validation (more on this later). 74 Microarray Workshop Nearest neighbor rule 75 Microarray Workshop Classification tree • Partition the feature space into a set of rectangles, then fit a simple model in each one • Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) • Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier 76 Microarray Workshop Classification tree Gene 1 Mi1 < -0.67 yes Gene 2 0 no Gene 2 Mi2 > 0.18 2 2 0.18 Gene 1 yes no 0 1 1 -0.67 77 Microarray Workshop Three aspects of tree construction • Split selection rule: - Example, at each node, choose split maximizing decrease in impurity (e.g. Gini index, entropy, misclassification error). • Split-stopping: - Example, grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate. • Class assignment: - Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node. 78 Microarray Workshop Another component in classification rule: aggregating classifiers Resample 1 Classifier 1 Resample 2 Classifier 2 Training Set X1, X2, … X100 Aggregate classifier Resample 499 Resample 500 Classifier 499 Classifier 500 Examples: Bagging Boosting Random Forest 79 Microarray Workshop Aggregating classifiers: Bagging Test sample Resample 1 X*1, X*2, … X*100 Tree 1 Class 1 Resample 2 X*1, X*2, … X*100 Tree 2 Class 2 Lets the tree vote Training Set (arrays) X1, X2, … X100 90% Class 1 10% Class 2 Resample 499 X*1, X*2, … X*100 Tree 499 Class 1 Resample 500 X*1, X*2, … X*100 Tree 500 Class 1 80 Microarray Workshop Classification with SVMs Generalization of the ideas of separating hyperplanes in the original space. Linear boundaries between classes in higher-dimensional space lead to the non-linear boundaries in the original space. 81 Microarray Workshop Other classifiers include… • Neural networks • Projection pursuit • Bayesian belief networks •… 82 Microarray Workshop Why select features • Lead to better classification performance by removing variables that are noise with respect to the outcome • May provide useful insights into etiology of a disease • Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”) 83 Microarray Workshop Why select features? Top 100 feature selection Selection based on variance No feature selection -1 Correlation plot Data: Leukemia, 3 class +1 84 Microarray Workshop Approaches to feature selection • Methods fall into three basic category - Filter methods - Wrapper methods - Embedded methods • The simplest and most frequently used methods are the filter methods. 85 Microarray Workshop Filter methods R p Feature selection R s Classifier design s << p •Features are scored independently and the top s are used by the classifier •Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc Easy to interpret. Can provide some insight into the disease markers. 86 Microarray Workshop Problems with filter method • Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information • Interactions among features generally can not be explicitely incorporated (some filter methods are smarter than others) • Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others. 87 Microarray Workshop Dimension reduction: a variant on a filter method • Rather than retain a subset of s features, perform dimension reduction by projecting features onto s principal components of variation (e.g. PCA etc) • Problem is that we are no longer dealing with one feature at a time but rather a linear or possibly more complicated combination of all features. It may be good enough for a black box but how does one build a diagnostic chip on a “supergene”? (even though we don’t want to confuse the tasks) • Those methods tend not to work better than simple filter methods. 88 Microarray Workshop Wrapper methods R p Feature selection R s Classifier design s << p •Iterative approach: many feature subsets are scored based on classification performance and best is used. •Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc 89 Microarray Workshop Problems with wrapper methods • Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated • No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only. • Easy to overfit. p 90 Microarray Workshop Embedded methods • Attempt to jointly or simulteneously train both a classifier and a feature subset • Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. • Intuitively appealing Some examples: tree-building algorithms, shrinkage methods (LDA, kNN) 91 Microarray Workshop Performance assessment • Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase. • One needs to estimate future performance based on what is available: often the same set that is used to build the classifier. • Assessing performance of the classifier based on - Cross-validation. - Test set - Independent testing on future dataset 92 Microarray Workshop Diagram of performance assessment Classifier Training Set Resubstitution estimation Performance assessment Training set Classifier Independent test set Test set estimation 93 Microarray Workshop Performance assessment (I) • Resubstitution estimation: error rate on the learning set. - Problem: downward bias • Test set estimation: 1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T. 2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T). - L and T must be independent and identically distributed (i.i.d). - Problem: reduced effective sample size 94 Microarray Workshop Diagram of performance assessment Training Set Classifier Resubstitution estimation (CV) Learning set Training set Classifier Cross Validation Performance assessment (CV) Test set Classifier Independent test set Test set estimation 95 Microarray Workshop Performance assessment (II) • V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. - Bias-variance tradeoff: smaller V can give larger bias but smaller variance - Computationally intensive. • Leave-one-out cross validation (LOOCV). (Special case for V=n). Works well for stable classifiers (kNN, LDA, SVM) 96 Microarray Workshop Performance assessment (III) • Common practice to do feature selection using the learning , then CV only for model building and classification. • However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased. • Features (variables) should be selected only from the learning set used to build the model (and not the entire set) 97 Microarray Workshop p Some performance assessment quantities Assume 2-class problem class 1 = no event ~ null hypothesis class 2 = event ~ alternative hypothesis 0 All quantitites are estimated on the available dataset (test set if available) • Misclassification error rate: proportion of misclassified samples • Lift: proportion of correct class 2 predictions divided by the proportion of class 2 cases Proportion(class 2 is true | class 2 is detected)/Proportion(class is 2) • Odds ratio: measure of association between true and predicted labels. 98 Microarray Workshop p Some performance assessment quantities (ctd) • Sensitivity: proportion of correct class 2 predictions Prob(detect class 2| class 2 is true) ~ power • Specificity: proportion of correct class 1 predictions Prob(declare class 1 | class 1 is true ) = 1 – 0 Prob(detect class 2 | class 1 is true) ~ 1 – type I error • Positive Predictive Value (PPV): proportion of class 2 cases among predicted class 2 cases (should be applicable to the population) Prob(class 2 is true | class 2 is detected) = P(detect class 2 | class 2 is true) x Prob(class 2 is true )/Prob(detect class 2) = sensitivity x Prob(class is 2)/ [sensitivity x Prob(class is 2) + (1-specificity) x (1-Prob(class2))] Note that PPV is the only quantity explicitely incorporating population proportions: i.e., prevalence of class 2 in the population of interest ( Prob(class is 2)) as well as sensitivity and specificity. If the prevalence is low, specificity of the test has to be very high to be clinically useful. 99 Microarray Workshop Comparison study • Leukemia data – Golub et al. (1999) - n = 72 samples, - G = 3,571 genes, - 3 classes (B-cell ALL, T-cell ALL, AML). • Reference: S. Dudoit, J. Fridlyand, and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, Vol. 97, No. 457, p. 77-87 100 Microarray Workshop Leukemia data, 3 classes: Test set error rates;150 LS/TS runs 101 Microarray Workshop Results • In the main comparison, NN and DLDA had the smallest error rates. • Aggregation improved the performance of CART classifiers. • For the leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers. 102 Microarray Workshop Comparison study – discussion (I) • “Diagonal” LDA: ignoring correlation between genes helped here. Unlike classification trees and nearest neighbors, DLDA is unable to take into account gene interactions. • Classification trees are capable of handling and revealing interactions between variables. In addition, they have useful by-product of aggregated classifiers: prediction votes, variable importance statistics. • Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions. 103 Microarray Workshop Summary (I) • Bias-variance trade-off. Simple classifiers do well on small datasets. As the number of samples increases, we expect to see that classifiers capable of considering higher-order interactions (and aggregated classifiers) will have an edge. • Cross-validation . It is of utmost importance to crossvalidate for every parameter that has been chosen based on the data, including meta-parameters - what and how many features how many neighbors pooled or unpooled variance classifier itself. If this is not done, it is possible to wrongly declare having discrimination power when there is none. 104 Microarray Workshop Summary (II) • Generalization error rate estimation. It is necessary to keep sampling scheme in mind. • Thousands and thousands of independent samples from variety of sources are needed to be able to address the true performance of the classifier. • We are not at that point yet with microarrays studies. Van Veer et al (2002) study is probably the only study to date with ~300 test samples. 105 Microarray Workshop Learning set Bad Classification Rule Good Feature selection. Correlation with class labels, very similar to t-test. Using cross validation to select 70 genes 295 samples selected from Netherland Cancer Institute tissue bank (1984 – 1995). Results” Gene expression profile is a more powerful predictor then standard systems based on clinical and histologic criteria Agendia (formed by reseachers from the Netherlands Cancer Institute) Has started in Oct, 2003 1) 5000 subjects [Health Council of the Netherlands] 2) 5000 subjects New York based Avon Foundation. Custom arrays are made by Agilent including 70 genes + 1000 controls Case studies Reference 1 Retrospective study L van’t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan 2002. . Reference 2 Cohort study M Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Jouranl of Medicine, Dec 2002. Reference 3 Prospective trials. Aug 2003 Clinical trials http://www.agendia.com/ 106 Microarray Workshop Some examples of wrong answers and questions in microarray data analysis 107 Microarray Workshop Life Cycle Biological question Experimental design Failed Microarray experiment Quality measurement Image analysis Normalization Pass Analysis Estimation Testing Clustering Discrimination Biological verification and interpretation 108 Microarray Workshop Experimental design Proper randomization is essential in experimental design. Question: Build a predictor to diagnose ovarian cancer Design: Tissue from Normal women and Ovarian cancer patients arrives at different times. Issues: Complete confounding between tissue type and time of processing. This phenomenom is very common in the absence of carefully thought-through design. Post-mortem diagnosis: lack of randomization. 109 Microarray Workshop Clustering I The procedure should not bias results towards desired conclusions. Question: Do expression data cluster according to the survival status. Design: Identify genes with high t-statistic for comparison short and long survivors. Use these genes to cluster samples. Get excited that samples cluster according to survival status. Issues: The genes were already selected based on the survival status. Therefore, it would rather be surprising if samples did *not* cluster according to their survival. Conclusion: None are possible with respect to clustering as variable selection was driven by class distinction. 110 Microarray Workshop Clustering II P-values for differential expression are only valid when the class labels are independent of the current dataset. Question: Identify genes distinguishing among “interesting” subgroups. Design: Cluster samples into K groups. For each gene, compute Fstatistic and its associated p-value to test for differential expression among two subgroups. Issues: Same data was used to create groups as to test for DEs – pvalues are invalid. Conclusion: None with respect to DEs p-values. Nevertheless, it is possible to select genes with high value of the statistic and test hypotheses about functional enrichment with, e.g., Gene Ontology. Also, can cluster these genes and use the results to generate new hypotheses. 111 Microarray Workshop Prediction I: estimating misclassification error Performance of the classifiers on the future samples needs to be assessed while taking population proportions into the account. Question: Build a classifier to predict a rare (1/100) subclass of cancer and estimate its misclassification rate in the population. Design: Retrospectively collect equal numbers of rare and common subtypes and build a classifier. Estimate its future performance using cross-validation on the collected set. Issues: Population proportions of the two types differ from the proportions in the study. For instance, if 0/50 of rare subtype and 10/50 of common subtype were misclassified (10/100), then in population, we expect to observe 1 rare instance and 99 common ones and will misclassify approximately 20/100 samples. Conclusion: If a dataset is not representative of population distributions, one needs to think hard about how to do the “translation”. (e.g., Positive Predictive Value on the future samples vs Specificity and Sensitivity on the current ones). Significant association (odds ratio) between predicted and observed class labels is the strong indication that data contains signal with respect to the outcome of interest. 112 Microarray Workshop Estimating PPV Petricoin et al have published the article in Lancet (2002) titled “Proteomics patterns in serum and identification of ovarian cancer”. There they claimed to develop a predictor based on the proteomics signature of the samples that has 94% Positive Predictive Value in detecting ovarian cancer. Recall that PPV is the proportion of true cancers among the ones found by the classifier. There were many issues with their data collection and analysis but the main one that generated a lot of correspondence was that taking into the account the population prevalence of the ovarian cancer, the resulting PPV was 0.8% rather than 94% as stated by the authors. Note that specificity of the classifier has to be almost perfect to be useful in the clinical setting when disease is rare. 113 Microarray Workshop Adapted from the comment in Lancer by Rockhill Prediction II: Prevalence vs PPV (ctd) Prevalence 50% 43% 10% 1% 0.1% One per 2500 Specificity 90% 91 95% 95 99% 99 99.9% 99.9 88 94* 99 99.9 53 69 92 99 9 17 50 91 1 2 9 50 0.4 0.8** 4 29 Assumes a constant sensitivity of 100%. *PPV reported by Petricoin et al (2002) **Correct PPV assuming prevalence of ovariann cancer in general population is 1/2500. Note that discovering discriminatory power is not the same as demonstrating a clinical utility of the classifier.l 114 Microarray Workshop Acknowledgements SFGH • • • • Agnes Paquet David Erle Andrea Barczac UCSF Sandler Genomics Core Facility. UCSF /CBMB • Ajay Jain • Mark Segal • UCSF Cancer Center Array Core • Jain Lab UCB • Terry Speed • Sandrine Dudoit 115 Microarray Workshop Some references 1. 2. 3. 4. 5. 6. 7. 8. 9. Hastie, Tibshirani, Friedman “The Elements of Statistical Learning”, Springer, 2001 Speed (editor) “Statistical Analysis of Gene Expression Microarray Data”, Chapman & Hall/CRC, 2003 Alizadeh et al, “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 2000 Van ‘t Veer et al, “Gene expression profiling predicts clinical outcome of breast cancer, Nature, 2002 Van de Vijver et al, “A gene-expression signature as a predictor of survival in breast cancer, NEJM, 2002 Petricoin et al, “Use of proteomics patterns in serum to identify ovarian cancer”, Lancet, 2002 (and relevant correspondence) Golub et al, “Molecular Classification of Cancer: Class Discovery and Class prediction by Gene Expression Monitoring “, Science, 1999 Cho et al, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, 1999 Dudoit, et al, :Comparison of discrimination methods for the classification of tumors using gene expression data, JASA, 2002 116 Microarray Workshop Some references 10. Ambroise and McLachlan, “Selection bias in gene extraction on the basis microarray gene expression data”, PNAS, 2002 11. Tibshirani et al, “Estimating the number of clusters in the dataset via the GAP statistic”, Tech Report, Stanford, 2000 12. Tseng et al, “Tight clustering : a resampling-based approach for identifying stable and tight patterns in data”, Tech Report, 2003 13. Dudoit and Fridlyand, “A prediction-based resampling method for estimating the number of clusters in a dataset “, Genome Biology, 2002 14. Dudoit and Fridlyand, “Bagging to improve the accuracy of a clustering procedure”, Bioinformatics, 2003 15. Kaufmann and Rousseeuw, “Clustering by means of medoids.”, Elsevier/North Holland 1987 16. See many article by Leo Breiman on aggregation 117 Microarray Workshop