MIT CBCl/AI Class 19: Bioinformatics S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio What is bioinformatics ? MIT CBCl/AI Pre 1995 Application of computing technology to providing statistical and database solutions to problems in molecular biology. Post 1995 Defining and addressing problems in molecular biology using methodologies from statistics and computer science. The genome project, genome wide analysis/screening of disease, genetic regulatory networks, analysis of expression data. Central dogma CBCl/AI Central dogma of biology DNA RNA pre-mRNA mRNA Protein MIT Basic molecular biology MIT CBCl/AI DNA: CGAACAAACCTCGAACCTGCT Transcription mRNA: GCU UGU UUA CGA Translation Polypeptide: Ala Cys Leu Arg Less basic molecular biology MIT CBCl/AI Transcription End modification Splicing Transport Translation Biological information MIT CBCl/AI Sequence information DNA transcription RNA Quantitative information microarray rt-pcr preRNA mRNA translation, stability protein arrays yeast-2 hybrid Chemical screens splicing Protein transport, localization Sequence problems CBCl/AI •Splice sites and branch points in eukaryotic pre-mRNA •Gene finding in prokaryotes and eukaryotes •Promoter recognition (transcription and termination) •Protein structure prediction •Protein function prediction •Protein family classification MIT Gene expression problems MIT CBCl/AI •Predict tissue morphology •Predict treatment/drug effect •Infer metabolic pathway •Infer disease pathways •Infer developmental pathway Microarray technology MIT CBCl/AI Basic idea: The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA. Measuring amount particular mRNA gives measure of amount of corresponding protein. Copies of mRNA is expression of a gene. Microarray technology allows us to measure the expression of thousands of genes at once. (Northern blot). Measure the expression of thousands of genes under different experimental conditions and ask what is different and why. Microarray technology MIT CBCl/AI Ramaswamy and Golub, JCO RNA Biological Sample Test Sample Test Sample Reference PE Cy3 Cy5 ARRAY ARRAY Oligonucleotide Synthesis cDNA Clone (LIBRARY) PCR Product Microarray technology MIT CBCl/AI Oligonucleotide cDNA Lockhart and Winzler 2000 Microarray experiment MIT CBCl/AI Yeast experiment Analytic challenge CBCl/AI When the science is not well understood, resort to statistics: Infer cancer genetics by analyzing microarray data from tumors Ultimate goal: discover the genetic pathways of cancers Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem Curse of dimensionality: Far too few examples for so many dimensions to predict accurately MIT Cancer Classification MIT CBCl/AI 38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes) 34 examples to test classifier Results: 33/34 correct d perpendicular distance from hyperplane d Test data Coregulation and kernels MIT CBCl/AI Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important Two gene example: two genes measuring Sonic Hedgehog and TrkC x e sh , eTrk 2 2 φ( x ) e sh , eTrk , eTrk e sh , e sh , eTrk ,1 K ( x i , x j ) φ( x i ) φ( x j ) ( x i x j 1) 2 Gene coregulation MIT CBCl/AI Nonlinear SVM helps when the most informative genes are removed, Informative as ranked using Signal to Noise (Golub et al). Genes removed 1st order errors 2nd order 3rd order 0 10 20 30 40 50 100 200 1 2 3 3 3 3 3 3 1 1 2 3 3 2 3 3 1 1 1 2 2 2 2 3 1500 7 7 8 polynomials Rejecting samples MIT CBCl/AI Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes Need to introduce concept of rejects to SVM Cancer g2 Reject Normal g1 Rejecting samples CBCl/AI MIT Estimating a CDF CBCl/AI MIT The regularized solution CBCl/AI MIT Rejections for SVMs MIT CBCl/AI 95% confidence or p = .05 .95 P(c=1 | d) 1/d d = .107 Results with rejections MIT CBCl/AI Results: 31 correct, 3 rejected of which 1 is an error d Test data Gene selection MIT CBCl/AI SVMs as stated use all genes/features Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in discriminating Practical reasons, a clinical device with thousands of genes is not financially practical Possible performance improvement Wrapper method for gene/feature selection Results with gene selection MIT CBCl/AI AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error. d d Test data Test data B vs T cells for AML: 10 genes 33/33 correct, 0 rejects. Two feature selection algorithms MIT CBCl/AI Recursive feature elimination (RFE): based upon perturbation analysis, eliminate genes that perturb the margin the least Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is unbiased Recursive feature elimination MIT CBCl/AI 1. Solve the SVM problem for vector w 2. Rank order elements of vector w by absolute value 3. Discard input features/g enes correspond ing to those vector elements with small absolute magnitude (for smallest 10%) 4. Retrain SVM on reduced gene set and goto step (2) Optimizing the LOO MIT CBCl/AI Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound. When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables. The rescaling can be done in the input space or in a “Principal Components” space. Pictorial demonstration MIT CBCl/AI Rescale features to minimize the LOO bound R2/M2 R2/M2 >1 R R2/M2 =1 M=R M x2 x2 x1 Three LOO bounds MIT CBCl/AI Radius margin bound: simple to compute, continuous very loose but often tracks LOO well Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth, valid only for SVMs with no b term Span bound: tight as a Britney Spears outfit complicated to compute, discontinuous so need to smooth Classification function with scaling MIT CBCl/AI We add a scaling parameter s to the SVM, which scales genes, genes corresponding to small sj are removed. The SVM function has the form: f ( x) yα K iSV i i σ ( x , x i ) b. where K σ ( x , y ) K ( x σ , y σ ), σ n , u v denote element by element multiplica tion. SVM and other functionals MIT CBCl/AI The α' s are computed by maximizing the following quadratic form : α W 2 (α , σ ) i 12 y i y j α i α j ( K σ ( x i , x j ) C1 ij ), i α 0. i, j For computing the radius around the data maximize β K R 2 ( β, σ ) i σ ( x i , x i ) 12 β i β j K σ ( x i , x j ) , i β 0 and β i 1. i, j For variance around the data V 2 (σ ) 1 K i σ ( x i , x i ) 12 K σ ( x i , x j ). i, j Remeber L( D ) T 1 W 2 (α, σ ) R 2 ( β , σ ) 1 W 2 (α, σ )V 2 (σ ). i Algorithm MIT CBCl/AI The following steps are used to compute α and σ 1. Initialize σ 1,...,1. 2.Solve the standard SVM algorithm α (σ ) arg max W (α, σ ). α 3. Minimize the estimate of error T with respect to σ with a gradient step. 4.If local minima of T is not reached goto step 3. 5. Discard dimensions correspond ing to small elements in σ and return to step 2. Computing gradients MIT CBCl/AI The gradient w ith respect to the margin K x i , x j W 2 i j y i y j . s f s f i , j 1 The gradient w ith respect to the radius K x i , x j K x i , x i R 2 i i j s f s i s f i 1 i , j 1 The gradient w ith respect to the span ~ 2 S i 4 ~ -1 K SV ~ -1 S i K SV K SV , s f s f K 1 ~ . where K SV T 0 1 Toy data MIT CBCl/AI Nonlinear problem with 2 relevant dimensions of 52 error rate error rate Linear problem with 6 relevant dimensions of 202 number of samples number of samples Molecular classification of cancer MIT CBCl/AI Dataset Total Sample s 38 Class 0 Class 1 Dataset 27 ALL 11 AML Lymphoma Morphology Leukemia Morpholgy (test) 34 20 ALL 14 AML Leukemia Lineage (ALL) 23 15 B-Cell Lymphoma Outcome (AML) 15 8 Low risk Leukemia Morphology (train) Total Sample s 77 Class 0 Class 1 19 FSC 58 DLCL Lymphoma Outcome 58 22 Low risk 36 High risk 8 T-Cell Brain Morphology 41 14 Glioma 27 MD 7 High risk Brain Outcome 50 38 Low risk 12 High risk Hierarchy of difficulty: 1. Histological differences: normal vs. malignant, skin vs. brain 2. Morphologies: different leukemia types, ALL vs. AML 3. Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma 4. Outcome: treatment outcome, elapse, or drug sensitivity. Morphology classification MIT CBCl/AI Dataset Algorithm Total Samples Total error s Class 1 errors Class 0 errors Number Genes Leukemia Morphology (trest) AML vs ALL SVM 35 0/35 0/21 0/14 40 WV 35 2/35 1/21 1/14 50 k-NN 35 3/35 1/21 2/14 10 SVM 23 0/23 0/15 0/8 10 WV 23 0/23 0/15 0/8 9 k-NN 23 0/23 0/15 0/8 10 SVM 77 4/77 2/32 2/35 200 WV 77 6/77 1/32 5/35 30 k-NN 77 3/77 1/32 2/35 250 SVM 41 1/41 1/27 0/14 100 WV 41 1/41 1/27 0/14 3 k-NN 41 0/41 0/27 0/14 5 Leukemia Lineage (ALL) B vs T Lymphoma FS vs DLCL Brain MD vs Glioma Outcome classification MIT CBCl/AI Dataset Algorithm Total Samples Total errors Class 1 errors Class 0 errors Number Genes Lymphoma SVM 58 13/58 3/32 10/26 100 LBC treatment outcome WV 58 15/58 5/32 10/26 12 k-NN 58 15/58 8/32 7/26 15 Brain SVM 50 7/50 6/12 1/38 50 MD treatment outcome WV 50 13/50 6/12 7/38 6 k-NN 50 10/50 6/12 4/38 5 Outcome classification MIT CBCl/AI 0.6 0.6 0.8 0.8 1.0 1.0 Error rates ignore temporal information such as when a patient dies. Survival analysis takes temporal information into account. The Kaplan-Meier survival plots and statistics for the above predictions show significance. p-val = 0.0015 0.0 0.0 0.2 0.2 0.4 0.4 p-val = 0.00039 0 50 Lymphoma 100 150 0 20 40 60 80 100 Medulloblastoma 120 Multi tumor classification MIT CBCl/AI Breast Prostate Lung Colorec tal Lympho ma Bladd er Meleno ma Uterus Leuke mia Renal Pancr eas Ovary Mesothe lioma Brain Abrev B P L CR Ly Bl M U Le R PA Ov MS C Total 11 10 11 13 22 11 10 10 30 11 11 11 11 20 Train 8 8 8 8 16 8 8 8 24 8 8 8 8 16 Test 3 2 3 5 6 3 2 3 6 3 3 3 3 4 Note that most of these tumors came from secondary sources and were not at the tissue of origin. Clustering is not accurate CBCl/AI CNS, Lymphoma, Leukemia tumors separate Adenocarcinomas do not separate MIT Multi tumor classification MIT CBCl/AI + + + + B+R G+R Class +1 +1 -1 -1 +1 -1 +1 -1 R B G Y Combination approaches: All pairs One versus all (OVA) Supervised methodology MIT CBCl/AI Figure 2 ALL OTHER TUMORS Hyperplane Confidence TEST SAMPLE BREAST TUMORS Breast OVA Classifier CNS OVA Classifier (Highest OVA Prediction Strength) 0 +2 Breast (High Confidence) -2 Final Multiclass Call ... Gene Expression Dataset ... Prostate OVA Classifier Well differentiated tumors MIT CBCl/AI Train/ cross -val. Train/cross Test 1 -val. Accuracy Test 1 Fraction of Calls Train/cross - val. 5 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 1 0.9 0.8 0.7 0.6 0.5 0.4 4 High 2 1 Low Confidence 3 0 0 -1 Correct Errors Correct Errors 0.3 0.2 0.1 0 0 -1 0 1 Low 2 3 4 High Confidence Test 1 -1 0 1 Low 2 3 First 4 Top 2 Top 3 Prediction Calls High Confidence Dataset Sample Type Validation Method Sample Number Total Accuracy Confidence High Low Fraction Accuracy Fraction Accuracy Train Well Differentiated Cross-val. 144 78% 80% 90% 20% 28% Test 1 Well Differentiated Train/Test 54 78% 78% 83% 22% 58% Feature selection hurts performance CBCl/AI MIT Poorly differentiated tumors MIT CBCl/AI Accuracy 5 Fraction of Calls 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 4 High 0.6 2 0.4 1 Low Confidence 0.8 3 0 0.2 -1 0 -1 Correct Errors 0 Low 1 2 3 4 First High Top 2 Top 3 Prediction Calls Confidence Dataset Sample Type Validation Method Sample Number Total Accuracy Confidence High Low Fraction Accuracy Fraction Accuracy Test Poorly Differentiated Train/test 20 30% 50% 50% 50% 10% Predicting sample sizes MIT CBCl/AI For a given number of existing samples, how significant is the performance of a classifier? Does the classifier work at all? Are the results better than what one would expect by chance? Assuming we know the answer to the previous questions for a number of sample sizes. What would be the performance of the classifier When trained with additional samples? Will the accuracy of the classifier improve significantly? Is the effort to collect additional samples worthwhile? Predicting sample sizes MIT CBCl/AI l samples Subsampling procedure n1 n2 Train Input Train …. Test Test Dataset classifier classifier classifier classifier e n1 classifier classifier classifier classifier …. en 2 l Train Train/test realizations Test i =1,2,…T1 classifier classifier classifier classifier el Average error rates Significance test (comparison with random predictors) …. Not significant Significant Learning curve Error rate e ( n) an b Sample size This model can be used to estimate sample size requirements Statistical significance MIT CBCl/AI a) b) The statistical significance in the tumor vs. nontumor classification for a) 15 samples and b) 30 samples. Learning curves MIT CBCl/AI Tumor vs. normal Learning curves MIT CBCl/AI Tumor vs. normal a) b) Learning curves MIT CBCl/AI AML vs. ALL Learning curves MIT CBCl/AI Colon cancer Learning curves MIT CBCl/AI Brain treatment outcome