Evaluating Machine Learning Approaches for Aiding Probe Selection for Gene-Expression Arrays J. Tobler, M. Molla, J. Shavlik University of Wisconsin-Madison M. Molla, E. Nuwaysir, R. Green Nimblegen Systems Inc. Oligonucleotide Microarrays Specific probes synthesized at known spot on chip’s surface probes Probes complementary to RNA of genes to be measured Typical gene (1kb+) MUCH longer than typical probe (24 bases) surface Probes: Good vs. Bad Blue = Probe Red = Sample good probe bad probe Probe-Picking Method Needed Hybridization characteristics differ between probes Probe set represents very small subset of gene Accurate measurement of expression requires good probe set Related Work Use known hybridization characteristics Lockhardt et al. 1996 Melting point (Tm) predictions Kurata and Suyama 1999 Li and Stormo 2001 Stable secondary structure Kurata and Suyama 1999 Our Approach Apply established machine-learning algorithms Train on categorized examples Test on examples with category hidden Choose features to represent probes Categorize probes as good or bad The Features Feature Name Description fracA, fracC, fracG, fracT The fraction of A, C, G, or T in the 24-mer fracAA, fracAC, fracAG, fracAT, fracCA, fracCC, fracCG, fracCT, fracGA, fracGC, fracGG, fracGT, fracTA, fracTC, fracTG, fracTT The fraction of each of these dimers in the 24-mer n1, n2, …., n24 The particular nucleotide (A, C, G, or T) at the specified position in the 24mer The particular dimer (AA, AC,…TT) at the specified position in the 24-mer d1, d2, …, d23 The Data Tilings of 8 genes (from E. coli & B. subtilus) Every possible probe (~10,000 probes) Genes known to be expressed in sample Gene Sequence: Complement: GTAGCTAGCATTAGCATGGCCAGTCATG… CATCGATCGTAATCGTACCGGTCAGTAC… Probe 1: Probe 2: Probe 3: CATCGATCGTAATCGTACCGGTCA ATCGATCGTAATCGTACCGGTCAG TCGATCGTAATCGTACCGGTCAGT … … Our Microarray Defining our Categories Frequency Low Intensity = BAD Probes (45%) Mid-Intensity = Not Used in Training Set (23%) 0 .05 0 High Intensity = GOOD Probes (32%) .15 Normalized Probe Intensity 1.0 99 The Machine Learning Techniques Naïve Bayes (Mitchell 1997) Neural Networks (Rumelhart et al. 1995) Decision Trees (Quinlan 1996) Can interpret predictions of each learner probabilistically Naïve Bayes Assumes conditional independence between features Make judgments about test set examples based on conditional probability estimates made on training set Naïve Bayes For each example in the test set, evaluate the following: P( high) P( featurei valuei | high) i P(low) P( featurei valuei | low) i Neural Network (1-of-n encoding with probe length = 3) Example probe sequence: “CAG” A1 Weights C1 Good G1 T1 A2 C2 G2 T2 A3 C3 G3 T3 … … or Bad fracC Decision Tree Low High fracG Automatically builds a tree of rules … Low … fracT Low Low High … fracTC fracG Low High fracAC … High Low High … n14 Good Probe A Bad Probe High C … G … T … Decision Tree The information gain of a feature, F, is: InformationGain( S , F ) | Sv | Entropy( S ) Entropy( S v ) vValues( F ) | S | Normalized Information Gain Information Gain per Feature Probe Composition Features C 1.0 A G AA T CC AG AC CT CA AT CG TC GG GA GC GT TA TG TT 0.0 1.0 Base Position Features Base Position 0.0 Dimer Position 7 12 13 14 15 16 17 18 6 9 10 11 19 20 5 8 7 21 22 23 4 24 1 2 3 1 2 3 4 5 6 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Cross-Validation Leave-one-out testing: For each gene (of the 8) Train on all but this gene Test on this gene Record result Forget what was learned Average results across 8 test genes Normalized Probe Intensity Typical Probe-Intensity Prediction Across Short Region 1 0.9 0.8 0.7 0.6 Actual 0.5 0.4 0.3 0.2 0.1 0 650 655 660 665 670 675 680 685 690 695 Starting Nucleotide Position for 24-mer Probe 700 Normalized Probe Intensity Typical Probe-Intensity Prediction Across Short Region 1 0.9 Neural Network Naïve Bayes 0.8 0.7 Decision Tree 0.6 0.5 Actual 0.4 0.3 0.2 0.1 0 650 655 660 665 670 675 680 685 690 695 Starting Nucleotide Position for 24-mer Probe 700 Number of probes selected with intensity >= 90th percentile Probe-Picking Results 20 18 16 14 12 10 8 6 4 2 0 Perfect Selector 0 2 4 6 8 10 12 14 16 18 20 Number of probes selected Number of probes selected with intensity >= 90th percentile Probe-Picking Results 20 18 16 14 12 10 8 6 4 2 0 Perfect Selector Neural Network Naïve Bayes Decision Tree 0 2 4 6 8 Primer Melting Point 10 12 14 16 18 20 Number of probes selected Current and Future Directions Consider more features Folding patterns Melting point Feature selection Evaluate specificity along with sensitivity Ie, consider false positives Evaluate probe selection + gene calling Try more ML techniques SVMs, ensembles, … Take-Home Message Machine learning does a good job on this part of probe-selection problem Easy to collect large number of training ex’s Easily measured features work well Intelligent probe selection can increase microarray accuracy and efficiency Acknowledgements NimbleGen Systems, Inc. for providing the intensities from the eight tiled genes measured on their maskless array. Darryl Roy for helping in creating the training data. Grants NIH 2 R44 HG02193-02, NLM 1 R01 LM07050-01, NSF IRI-9502990, NIH 2 P30 CA14520-29, and NIH 5 T32 GM08349. Thanks