Novel Uses for Machine Learning and Other Computational Methods in the Design and Interpretation of Genetic Microarrays Michael Molla University of Wisconsin-Madison Ph.D. Thesis Defense August 23, 2007 1 My Research Topics Text Mining Molla et al., Info. Sci. 2002, UW Tech. report, 2004 Gene Chip Design (Choosing Probes) Tobler, Molla et al., ISMB 2002 SNP Detection Model Simulation for Evolutionary Biology Haag & Molla, Evolution 2005 Genome Copy-Number Segmentation Molla et al., CSB 2004, Albert at al., Nature Methods 2005 Paper in preparation The Pretraining Algorithm Direct Genomic Selection Paper Submitted Topics for This Talk 2 My Main Focus DNA microarrays, also known as “gene chips,” have come into prominence NimbleGen glass slide microarray. Image courtesy of NimbleGen Systems. Novel applications of machine learning can help to solve varied and important problems in this domain Solving these will advance science and change the practice of health care 3 Background Oligonucleotide Microarrays 4 Oligonucleotide Microarrays Specific probes synthesized at known spot on chip’s surface probes Probes complementary to genetic material to be measured (known as the sample) surface If the probes do their job, labeled sample can be detected on the surface of the chip sample 5 An Oligonucleotide Microarray 6 Part 1 DETECTING COPY-NUMBER VARIATION 7 Motivation Gene copy number variation is a common form of genetic variation in individuals Specific variations can be predictive of diseases including cancer 8 Comparative Genomic Hybridization (CGH) To find regions of copy-number variation Region Sequence: Complement: Tile1: probes Probe Probe 2: Two Probe 3: GTAGCTAGCATTAGCATGGCCAGTCATG… CATCGATCGTAATCGTACCGGTCAGTAC… across a genomic region CATCGATCGTAATCGTACCGGTCA ATCGATCGTAATCGTACCGGTCAG chips TCGATCGTAATCGTACCGGTCAGT identical … Expose each to one sample … Record log ratio of signals 9 Comparative Genomic Hybridization (CGH) Sample 1 DNA Sample 2 DNA Two Identical Tiling Chips (Sample 2 Intensity) _______________________ Probe Intensity Ratio = (Sample 1 Intensity) 10 A Segmentation 2.5 Probes 2.0 Segments log Ratio 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000 Probe Number 11 log Ratio A More Typical Segmentation Probes 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 Segments 0 2,000 4,000 6,000 8,000 10,000 Probe Number 12 Our segMNT Method (paper in preparation) Identify Candidate Breakpoints Apply dynamic programming to reduced set Permutation test to choose best number of segments 13 Choose Candidate Breakpoints Use a t-test to identify candidate breakpoints Probe Value 0.50 0.40 1.00 0.30 0.20 0.50 0.10 0.00 0.00 Mean Probe Value 1 2 3 4 5 6 Candidate 7 8 9 10 11 Candidate 12 13 14 Candidate 15 16 17 18 19 Candidate 20 1.00 0.50 Region 1 Region 2 Region 5 0.00 Region 3 Region 4 14 Objective Function for Segmentation from Lipson et al., 2005 x ji k The per-segment i 1 score of a segmentation n j 1 j Where k = number of segments nj xji = intensity value for the ith probe of segment j and nj = number of probes in segment j 15 Apply Dynamic Programming 1.00 0.50 0.00 region 1 to 1 from 1 from 2 from 3 from 4 from 5 0.01 region 2 region 3 Score for 1 segment to 2 to 3 region 4 to 4 region 5 to 5 0.07 0.33 0.63 0.56 0.09 0.42 0.75 0.64 0.39 0.73 0.57 0.34 0.23 0.01 16 Apply Dynamic Programming 1.00 1.00 0.50 0.40 0.50 0.30 0.50 0.20 0.10 0.00 0.00 0.00 1 2 region 1 3 4 to 1 N=1 N=2 N=3 N=4 N=5 0.01 5 region 28 6 7 9 region 3 10 11 12 region 415 13 14 Score for N segments to 2 to 3 to 4 16 region 5 17 18 19 20 to 5 0.07 0.33 0.63 0.56 0.10 0.43 0.76 0.65 0.49 0.83 0.77 0.83 0.84 0.84 17 Permutation Test 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 segMNT PERMUTED PROBE VALS 2,000 4,000 6,000 8,000 0 2,000 10,000 4,000 6,000 8,000 segMNT 10,000 to 5 0.56 0.06 0.62 0.08 0.70 0.07 0.77 0.08 0.85 to 5 1seg path 2seg path 3seg path 4seg path 5seg path 0.56 0.09 0.65 0.12 0.77 0.07 0.84 0.00 0.84 18 How Do We Evaluate? Synthetic Data Readily available We know the answer Real Data Segment a human sample Compare to the DB of Genomic Variants 19 Segmentation Comparison Synthetic Data Average Error 1.E+00 1.E-01 1.E-02 DNACopy stepgram segMNT 1.E-03 1.E-04 1 2 3 4 5 6 7 8 9 10 Noise Level in Data 20 Comparison to Known Variants F-Measure Comparison 0.25 segMNT 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0.00 0.00 0.05 0.10 0.15 0.20 StepGram 0.25 0.00 0.05 0.10 0.15 0.20 0.25 DNAcopy 21 Part 2 MY PRETRAINING ALGORITHM 22 Motivating Project: SNP Finding (Molla et al., CSB 2004) Given the results of a single resequencing chip the intensities of all of the probes reference sequence is known Decide where the SNPs are in the genome each p-group represents a base position separate non-conformers into SNPs and noise 23 A Resequencing Chip: Normalized Intensity Example 9 Category = Non-Conformer 9A 1.0 70,000 60,000 0.8 (complement of) Reference Sequence: …ATCCTAGCGTACGATCT… 50,000 A Sort and 0.6 40,000 9G Normalize 9C 30,000 0.4 20,000 Intensities 0.2 Conformer Non-Conformer 9A 10,000 0.0 9C P-groups 1 2 3 4 1 2 Intensities 3 4 5 6 Ranked A 1A 9T 8 99 10 11 12 13 14 15 16 17 8C 7G 2T 5T 0 … … 11A 3C 4C G T 7 6A C 9T 9G 12C 9G 9G 14C 16C 13G 10T Light Square = High Intensity Probe 15T 17T 24 My Pretraining Feature SpaceGeneralization Conformer Non-Conformer Probably Bad Data A Likely SNP 25 The Analogy Conformer vs Non-Confomer Is a Key feature. Trying to predict a key feature helps the learner What if we did not know which were key features? Try to predict them all 26 Background: Supervised Learning The Pretraining Algorithm Given Examples whose feature values are known and Whose categories are known Augment the feature set Predict the categories of examples whose feature values are known but whose categories are not Do 27 Standard Formulation FEATURES CATEGORY A B C D True True True False True True False True True False True True True False False False False True True True False True False False False False True False False False False True 28 Pretraining Step 1: Predict the First Feature CATEGORY PREDICTION RESULT FEATURES IGNORED A B C preA D True True True Correct False True True False Wrong True True False True Correct True True LEARNER False False Correct False False True True Correct True False True False Correct False False False True Correct False False False False Correct True 29 Pretraining Step 2: Predict the Second Feature CATEGORY PREDICTION RESULT IGNORED IGNORED A B C preA True True True Correct Correct Correct False True True False Wrong True False True Correct Wrong True preB preC D Correct Correct True Corect FEATURES False False Correct Correct Correct LEARNER True False False True True Correct Correct Wrong True False True False Correct Correct Correct False False False True Correct Correct Correct False False False False Correct Wrong Correct True 30 Pretraining Step 3: Predict the Third Feature CATEGORY FEATURES PREDICTION RESULT IGNORED IGNORED A B C True True True True False False False True True False False True True False True Correct Correct False Wrong Correct True Correct Wrong Correct LEARNER False Correct True Correct Correct False Correct Correct True Correct Correct False False False preA preB Correct Wrong preC D Correct Correct False True True False True False False Correct True Correct Corect Correct Wrong Correct 31 Pretraining Formulation FEATURES CATEGORY A B C preA preB preC D True True True Correct Correct Correct False True True False Wrong True False True Correct Wrong True False False Correct Correct Correct False False True True Correct Correct Wrong False True False Correct Correct Correct False False False True Correct Correct Correct False False False False Correct Wrong Correct Correct True Corect True True Correct True 32 Using Feature Predictions as Features 10 Negative Example Feature Prediction Error 9 The prediction error allows us to differentiate these positive examples from the negative examples 8 7 6 Positive Example 5 4 3 2 1 0 0 2 4 6 8 10 Feature Value 33 Error Rates on 3 UCI Datasets 7% baseline Error Rate 6% pretrained 5% non-pretrained 4% 3% 2% 1% 0% Wisconsin Breast Cancer Hypothyroid Splice Junctions UCI Dataset 34 Future Work Develop other methods to identify Key features Develop an SVM kernel to directly make use of the nearby cluster hypothesis. 35 Part 3 PROBES FOR DIRECT GENOMIC SELECTION 36 Probes: Good vs. Bad Red = Probe Green = Sample good probe bad probe 37 Our Work: Tobler and Molla et al., ISMB 2002 Categorize probes as good or bad based on signal intensity Choose features to represent probes All features come directly from probe sequence Apply established machine-learning algorithms and apply cross validation Train on categorized examples Test on examples with category hidden 38 New Application: Genome Enrichment Red = Probe Green = Sample Good Probe Sequenceable Target 39 Microarray Sequence Capture Genomic DNA Array Probes Exons Intergenic/Intron Pool of targeted fragments 40 Directed Sequencing Works like a filter Isolate specific subset of the genome for sequencing Allow higher sequence coverage across subset at a lower cost 41 Exon Capture Proof of Principle Collaboration with Baylor College of Medicine Genomic Position Sequence Coverage Target Region Average 7-Fold Sequencing Coverage 42 Conclusion Developed methods to improve Probe selection for sequence capture Low-cost SNP finding Genome copy-number identification Major increases in efficiency for molecular biology A new machine-learning algorithm Helped to show that machine learning can solve varied problems in molecular biology 43 Acknowledgements Advisor Jude Shavlik Committee Members David Page Fred Blattner Mark Craven Charles Dyer Colleagues Mark Rich, Ameet Soni, Lisa Torrey, Frank DiMaio, Jesse Davis, Sean McIlwain, Adam Smith, Yue Pan, Burr Settles, Mike Waddell, Louis Oliphant, Kieth Noto, Trevor Walker, Irene Ong, Rich Maclin, Soumya Ray, Joseph Bockhorst Nimblegen Systems Roland Green Todd Richmond Nan Jiang Jacob Kitzman Tom Albert Baylor College of Medicine Grants NIH 2 R44 HG02193-02 NLM 1 R01 LM07050-01 NSF IRI-9502990 NIH 2 P30 CA14520-29 NIH 5 T32 GM08349 and many others who helped 44 Last But Not Least… 45