Introduction

Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets www.KDnuggets.com/gps.html Connecticut College, October 15, 2003 Overview  Data Mining and Knowledge Discovery  Genomics and Microarrays  Microarray Data Mining © 2003 KDnuggets 2 Trends leading to Data Flood  More data is generated:  Bank, telecom, other business transactions ...  Scientific Data: astronomy, biology, etc  Web, text, and e-commerce  More data is captured:  Storage technology faster and cheaper  DBMS capable of handling bigger DB © 2003 KDnuggets 3 Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Knowledge __ __ __ __ __ __ __ __ __ DATA Ware house © 2003 KDnuggets Transformed Data Target Data 4 Patterns and Rules Understanding Raw Dat a Major Data Mining Tasks  Classification: predicting an item class  Clustering: finding clusters in data  Associations: e.g. A & B & C occur frequently  Visualization: to facilitate human discovery  Summarization: describing a group  Estimation: predicting a continuous value  Deviation Detection: finding changes  Link Analysis: finding relationships © 2003 KDnuggets 5 Major Application Areas for Data Mining Solutions             Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web © 2003 KDnuggets 6 Genome, DNA & Gene Expression  An organism’s genome is the “program” for making the organism, encoded in DNA  Human DNA has about 30-35,000 genes  A gene is a segment of DNA that specifies how to make a protein  Cells are different because of differential gene expression  About 40% of human genes are expressed at one time  Microarray devices measure gene expression © 2003 KDnuggets 7 Molecular Biology Overview Nucleus Cell Chromosome Gene expression Protein © 2003 KDnuggets Gene (mRNA), single strand 8 Gene (DNA) Graphics courtesy of the National Human Genome Research Institute Affymetrix Microarrays 1.28cm 50um ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Gene expression computed from PM and MM © 2003 KDnuggets 9 Affymetrix Microarray Raw Image Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at D26600_at D28114_at Scanner enlarged section of raw image © 2003 KDnuggets 10 raw data Value 193 -70 144 33 318 1764 1537 1204 707 Microarray Potential Applications  New and better molecular diagnostics  New molecular targets for therapy  few new drugs, large pipeline, …  Outcome depends on genetic signature  best treatment?  Fundamental Biological Discovery  finding and refining biological pathways  Personalized medicine ?! © 2003 KDnuggets 11 Microarray Data Mining Challenges  Avoiding false positives, due to  too few records (samples), usually < 100  too many columns (genes), usually > 1,000  Model needs to be robust in presence of noise  For reliability need large gene sets; for diagnostics or drug targets, need small gene sets  Estimate class probability  Model needs to be explainable to biologists © 2003 KDnuggets 12 False Positives in Astronomy cartoon used with permission © 2003 KDnuggets 13 CATs: Clementine Application Templates  CATs - examples of complete data mining processes  Microarray CAT Preparation MultiClass 2-Class © 2003 KDnuggets 14 Clustering Key Ideas  Capture the complete process  X-validation loop w. feature selection inside  Randomization to select significant genes  Internal iterative feature selection loop  For each class, separate selection of optimal gene sets  Neural nets – robust in presence of noise  Bagging of neural nets © 2003 KDnuggets 15 Microarray Classification © 2003 KDnuggets Train data Feature and Parameter Selection Data Model Building Test data Evaluation 16 Classification: External X-val Gene Data T r a i n Train data Feature and Parameter Selection Data Model Building Test data Evaluation Final Model FinalTest Final Results © 2003 KDnuggets 17 Measuring false positives with randomization Gene Class 178 105 4174 7133 1 1 2 2 © 2003 KDnuggets Rand Class Randomize 500 times 2 1 1 2 Gene Class 178 105 4174 7133 2 1 1 2 18 Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% Gene Reduction improves Classification  most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes  Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference  Heuristic: select equal # genes from each class  Then apply a favorite machine learning algorithm © 2003 KDnuggets 19 Iterative Wrapper approach to selecting the best gene set  Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation.  Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class  For randomized algorithms, average 10+ Cross-validation runs!  Select gene set with lowest average error © 2003 KDnuggets 20 Clementine stream for subset selection by x-validation © 2003 KDnuggets 21 Microarrays: ALL/AML Example  Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999  72 examples (38 train, 34 test), about 7,000 genes  well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different © 2003 KDnuggets 22 Gene subset selection: one Xvalidation Error Avg for 10-fold X-val 30% 25% 20% 15% 10% 5% 0% 1 2 3 4 5 10 20 Genes per Class Single Cross-Validation run © 2003 KDnuggets 23 30 40 Gene subset selection: multiple cross-validation runs For ALL/AML data, 10 genes per class had the lowest error: (<1%) Point in the center is the average error from 10 crossvalidation runs Bars indicate 1 st. dev above and below © 2003 KDnuggets 24 ALL/AML: Results on the test data  Genes selected and model trained on Train set ONLY!  Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples):  33 correct predictions (97% accuracy),  1 error on sample 66  Actual Class AML, Net prediction: ALL  other methods consistently misclassify sample 66 -misclassified by a pathologist? © 2003 KDnuggets 25 Pediatric Brain Tumour Data  92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital  Outer cross-validation with gene selection inside the loop  Ranking by absolute T-test value (selects top positive and negative genes)  Select best genes by adjusted error for each class  Bagging of 100 neural nets © 2003 KDnuggets 26 Selecting Best Gene Set  Minimizing Combined Error for all classes is not optimal Average, high and low error rate for all classes © 2003 KDnuggets 27 Error rates for each class Error rate Genes per Class © 2003 KDnuggets 28 Evaluating One Network Averaged over 100 Networks: Class Error rate MED MGL 2.1% 17% RHB EPD JPA 24% 9% 19% *ALL* 8.3% © 2003 KDnuggets 29 Bagging 100 Networks Class MED MGL Individual Error Rate 2.1% 17% Bag Error rate 2% (0)* 10% Bag Avg Conf 98% 83% RHB EPD JPA *ALL* 24% 9% 19% 8.3% 11% 0 0 3% (2)* 76% 91% 81% 92%  Note: suspected error on one sample (labeled as MED but consistently classified as RHB) © 2003 KDnuggets 30 AF1q: New Marker for Medulloblastoma?  AF1Q ALL1-fused gene from chromosome 1q  transmembrane protein  Related to leukemia (3 PUBMED entries) but not to Medulloblastoma © 2003 KDnuggets 31 Future directions for Microarray Analysis  Algorithms optimized for small samples  Integration with other data  biological networks  medical text  protein data  Cost-sensitive classification algorithms  error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc. © 2003 KDnuggets 32 Acknowledgements  Eric Bremer, Children’s Hospital (Chicago) & Northwestern U.  Greg Cooper, U. Pittsburgh  Tom Khabaza, SPSS  Sridhar Ramaswamy, MIT/Whitehead Institute  Pablo Tamayo, MIT/Whitehead Institute © 2003 KDnuggets 33 Thank you Further resources on Data Mining: www.KDnuggets.com Microarrays: www.KDnuggets.com/websites/microarray.html Contact: Gregory Piatetsky-Shapiro: www.kdnuggets.com/gps.html © 2003 KDnuggets 34

Introduction

Related documents

Products

Support

Introduction

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib