tobler.ismb02.ppt

advertisement
Evaluating Machine Learning
Approaches for
Aiding Probe Selection for
Gene-Expression Arrays
J. Tobler, M. Molla, J. Shavlik
University of Wisconsin-Madison
M. Molla, E. Nuwaysir, R. Green
Nimblegen Systems Inc.
Oligonucleotide Microarrays
Specific probes synthesized at
known spot on chip’s surface
probes
Probes complementary to RNA
of genes to be measured
Typical gene (1kb+) MUCH longer
than typical probe (24 bases)
surface
Probes: Good vs. Bad
Blue = Probe
Red = Sample
good probe
bad probe
Probe-Picking Method Needed
Hybridization characteristics differ
between probes
Probe set represents very small subset
of gene
Accurate measurement of expression
requires good probe set
Related Work
Use known hybridization characteristics
Lockhardt et al. 1996
Melting point (Tm) predictions
Kurata and Suyama 1999
Li and Stormo 2001
Stable secondary structure
Kurata and Suyama 1999
Our Approach
Apply established machine-learning
algorithms


Train on categorized examples
Test on examples with category hidden
Choose features to represent probes
Categorize probes as good or bad
The Features
Feature Name
Description
fracA, fracC, fracG, fracT
The fraction of A, C, G, or T in the
24-mer
fracAA, fracAC, fracAG, fracAT,
fracCA, fracCC, fracCG, fracCT,
fracGA, fracGC, fracGG, fracGT,
fracTA, fracTC, fracTG, fracTT
The fraction of each of these dimers
in the 24-mer
n1, n2, …., n24
The particular nucleotide (A, C, G, or
T) at the specified position in the 24mer
The particular dimer (AA, AC,…TT)
at the specified position in the 24-mer
d1, d2, …, d23
The Data
Tilings of 8 genes (from E. coli & B. subtilus)


Every possible probe (~10,000 probes)
Genes known to be expressed in sample
Gene Sequence:
Complement:
GTAGCTAGCATTAGCATGGCCAGTCATG…
CATCGATCGTAATCGTACCGGTCAGTAC…
Probe 1:
Probe 2:
Probe 3:
CATCGATCGTAATCGTACCGGTCA
ATCGATCGTAATCGTACCGGTCAG
TCGATCGTAATCGTACCGGTCAGT
…
…
Our Microarray
Defining our Categories
Frequency
Low Intensity =
BAD Probes
(45%)
Mid-Intensity =
Not Used in
Training Set
(23%)
0 .05
0
High Intensity =
GOOD Probes
(32%)
.15
Normalized Probe Intensity
1.0
99
The Machine Learning
Techniques
Naïve Bayes (Mitchell 1997)
Neural Networks (Rumelhart et al. 1995)
Decision Trees (Quinlan 1996)
Can interpret predictions of each learner
probabilistically
Naïve Bayes
Assumes conditional independence
between features
Make judgments about test set examples
based on conditional probability estimates
made on training set
Naïve Bayes
For each example in the test set, evaluate
the following:
P( high) P( featurei  valuei | high)
i
P(low) P( featurei  valuei | low)
i
Neural Network
(1-of-n encoding with probe length = 3)
Example
probe
sequence:
“CAG”
A1
Weights
C1
Good
G1
T1
A2
C2
G2
T2
A3
C3
G3
T3
…
…
or
Bad
fracC
Decision Tree
Low
High
fracG
Automatically
builds a tree
of rules
…
Low
…
fracT
Low
Low
High
…
fracTC
fracG
Low
High
fracAC
…
High
Low
High
…
n14
Good Probe
A
Bad Probe
High
C
…
G
…
T
…
Decision Tree
The information gain of a feature, F, is:
InformationGain( S , F ) 
| Sv |
Entropy( S )  
Entropy( S v )
vValues( F ) | S |
Normalized Information Gain
Information Gain per Feature
Probe Composition Features
C
1.0
A
G
AA
T
CC
AG
AC
CT
CA
AT
CG
TC
GG
GA
GC
GT
TA
TG
TT
0.0
1.0
Base Position Features
Base Position
0.0
Dimer Position
7
12 13 14 15 16 17 18
6
9 10 11
19
20
5
8
7
21 22 23
4
24 1 2 3
1 2 3 4 5 6
8
9 10
11 12 13
14 15
16 17
18
19
20
21 22
23
Cross-Validation
Leave-one-out testing:

For each gene (of the 8)
Train on all but this gene
Test on this gene
Record result
Forget what was learned

Average results across 8 test genes
Normalized Probe Intensity
Typical Probe-Intensity
Prediction Across Short Region
1
0.9
0.8
0.7
0.6
Actual
0.5
0.4
0.3
0.2
0.1
0
650
655
660
665
670
675
680
685
690
695
Starting Nucleotide Position for 24-mer Probe
700
Normalized Probe Intensity
Typical Probe-Intensity
Prediction Across Short Region
1
0.9
Neural Network
Naïve
Bayes
0.8
0.7
Decision
Tree
0.6
0.5
Actual
0.4
0.3
0.2
0.1
0
650
655
660
665
670
675
680
685
690
695
Starting Nucleotide Position for 24-mer Probe
700
Number of probes selected with
intensity >= 90th percentile
Probe-Picking Results
20
18
16
14
12
10
8
6
4
2
0
Perfect
Selector
0
2
4
6
8
10 12 14 16 18 20
Number of probes selected
Number of probes selected with
intensity >= 90th percentile
Probe-Picking Results
20
18
16
14
12
10
8
6
4
2
0
Perfect
Selector
Neural Network
Naïve Bayes
Decision Tree
0
2
4
6
8
Primer
Melting
Point
10 12 14 16 18 20
Number of probes selected
Current and Future Directions
Consider more features


Folding patterns
Melting point
Feature selection
Evaluate specificity along with sensitivity

Ie, consider false positives
Evaluate probe selection + gene calling
Try more ML techniques

SVMs, ensembles,
…
Take-Home Message
Machine learning does a good job on this
part of probe-selection problem


Easy to collect large number of training ex’s
Easily measured features work well
Intelligent probe selection can increase
microarray accuracy and efficiency
Acknowledgements
NimbleGen Systems, Inc. for providing the
intensities from the eight tiled genes
measured on their maskless array.
Darryl Roy for helping in creating the training
data.
Grants NIH 2 R44 HG02193-02, NLM 1 R01
LM07050-01, NSF IRI-9502990, NIH 2 P30
CA14520-29, and NIH 5 T32 GM08349.
Thanks
Download