Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets www.KDnuggets.com/gps.html Connecticut College, October 15, 2003 Overview Data Mining and Knowledge Discovery Genomics and Microarrays Microarray Data Mining © 2003 KDnuggets 2 Trends leading to Data Flood More data is generated: Bank, telecom, other business transactions ... Scientific Data: astronomy, biology, etc Web, text, and e-commerce More data is captured: Storage technology faster and cheaper DBMS capable of handling bigger DB © 2003 KDnuggets 3 Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Knowledge __ __ __ __ __ __ __ __ __ DATA Ware house © 2003 KDnuggets Transformed Data Target Data 4 Patterns and Rules Understanding Raw Dat a Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Estimation: predicting a continuous value Deviation Detection: finding changes Link Analysis: finding relationships © 2003 KDnuggets 5 Major Application Areas for Data Mining Solutions Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web © 2003 KDnuggets 6 Genome, DNA & Gene Expression An organism’s genome is the “program” for making the organism, encoded in DNA Human DNA has about 30-35,000 genes A gene is a segment of DNA that specifies how to make a protein Cells are different because of differential gene expression About 40% of human genes are expressed at one time Microarray devices measure gene expression © 2003 KDnuggets 7 Molecular Biology Overview Nucleus Cell Chromosome Gene expression Protein © 2003 KDnuggets Gene (mRNA), single strand 8 Gene (DNA) Graphics courtesy of the National Human Genome Research Institute Affymetrix Microarrays 1.28cm 50um ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Gene expression computed from PM and MM © 2003 KDnuggets 9 Affymetrix Microarray Raw Image Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at D26600_at D28114_at Scanner enlarged section of raw image © 2003 KDnuggets 10 raw data Value 193 -70 144 33 318 1764 1537 1204 707 Microarray Potential Applications New and better molecular diagnostics New molecular targets for therapy few new drugs, large pipeline, … Outcome depends on genetic signature best treatment? Fundamental Biological Discovery finding and refining biological pathways Personalized medicine ?! © 2003 KDnuggets 11 Microarray Data Mining Challenges Avoiding false positives, due to too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Model needs to be robust in presence of noise For reliability need large gene sets; for diagnostics or drug targets, need small gene sets Estimate class probability Model needs to be explainable to biologists © 2003 KDnuggets 12 False Positives in Astronomy cartoon used with permission © 2003 KDnuggets 13 CATs: Clementine Application Templates CATs - examples of complete data mining processes Microarray CAT Preparation MultiClass 2-Class © 2003 KDnuggets 14 Clustering Key Ideas Capture the complete process X-validation loop w. feature selection inside Randomization to select significant genes Internal iterative feature selection loop For each class, separate selection of optimal gene sets Neural nets – robust in presence of noise Bagging of neural nets © 2003 KDnuggets 15 Microarray Classification © 2003 KDnuggets Train data Feature and Parameter Selection Data Model Building Test data Evaluation 16 Classification: External X-val Gene Data T r a i n Train data Feature and Parameter Selection Data Model Building Test data Evaluation Final Model FinalTest Final Results © 2003 KDnuggets 17 Measuring false positives with randomization Gene Class 178 105 4174 7133 1 1 2 2 © 2003 KDnuggets Rand Class Randomize 500 times 2 1 1 2 Gene Class 178 105 4174 7133 2 1 1 2 18 Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% Gene Reduction improves Classification most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference Heuristic: select equal # genes from each class Then apply a favorite machine learning algorithm © 2003 KDnuggets 19 Iterative Wrapper approach to selecting the best gene set Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation. Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class For randomized algorithms, average 10+ Cross-validation runs! Select gene set with lowest average error © 2003 KDnuggets 20 Clementine stream for subset selection by x-validation © 2003 KDnuggets 21 Microarrays: ALL/AML Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 genes well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different © 2003 KDnuggets 22 Gene subset selection: one Xvalidation Error Avg for 10-fold X-val 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 10 20 Genes per Class Single Cross-Validation run © 2003 KDnuggets 23 30 40 Gene subset selection: multiple cross-validation runs For ALL/AML data, 10 genes per class had the lowest error: (<1%) Point in the center is the average error from 10 crossvalidation runs Bars indicate 1 st. dev above and below © 2003 KDnuggets 24 ALL/AML: Results on the test data Genes selected and model trained on Train set ONLY! Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): 33 correct predictions (97% accuracy), 1 error on sample 66 Actual Class AML, Net prediction: ALL other methods consistently misclassify sample 66 -misclassified by a pathologist? © 2003 KDnuggets 25 Pediatric Brain Tumour Data 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital Outer cross-validation with gene selection inside the loop Ranking by absolute T-test value (selects top positive and negative genes) Select best genes by adjusted error for each class Bagging of 100 neural nets © 2003 KDnuggets 26 Selecting Best Gene Set Minimizing Combined Error for all classes is not optimal Average, high and low error rate for all classes © 2003 KDnuggets 27 Error rates for each class Error rate Genes per Class © 2003 KDnuggets 28 Evaluating One Network Averaged over 100 Networks: Class Error rate MED MGL 2.1% 17% RHB EPD JPA 24% 9% 19% *ALL* 8.3% © 2003 KDnuggets 29 Bagging 100 Networks Class MED MGL Individual Error Rate 2.1% 17% Bag Error rate 2% (0)* 10% Bag Avg Conf 98% 83% RHB EPD JPA *ALL* 24% 9% 19% 8.3% 11% 0 0 3% (2)* 76% 91% 81% 92% Note: suspected error on one sample (labeled as MED but consistently classified as RHB) © 2003 KDnuggets 30 AF1q: New Marker for Medulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein Related to leukemia (3 PUBMED entries) but not to Medulloblastoma © 2003 KDnuggets 31 Future directions for Microarray Analysis Algorithms optimized for small samples Integration with other data biological networks medical text protein data Cost-sensitive classification algorithms error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc. © 2003 KDnuggets 32 Acknowledgements Eric Bremer, Children’s Hospital (Chicago) & Northwestern U. Greg Cooper, U. Pittsburgh Tom Khabaza, SPSS Sridhar Ramaswamy, MIT/Whitehead Institute Pablo Tamayo, MIT/Whitehead Institute © 2003 KDnuggets 33 Thank you Further resources on Data Mining: www.KDnuggets.com Microarrays: www.KDnuggets.com/websites/microarray.html Contact: Gregory Piatetsky-Shapiro: www.kdnuggets.com/gps.html © 2003 KDnuggets 34