molla.thesis.ppt

advertisement
Novel Uses for Machine Learning and Other
Computational Methods in the Design and
Interpretation of Genetic Microarrays
Michael Molla
University of Wisconsin-Madison
Ph.D. Thesis Defense
August 23, 2007
1
My Research Topics

Text Mining


Molla et al., Info. Sci. 2002, UW Tech. report, 2004
Gene Chip Design (Choosing Probes)
Tobler, Molla et al., ISMB 2002

SNP Detection


Model Simulation for Evolutionary Biology


Haag & Molla, Evolution 2005
Genome Copy-Number Segmentation



Molla et al., CSB 2004, Albert at al., Nature Methods 2005
Paper in preparation
The Pretraining Algorithm
Direct Genomic Selection

Paper Submitted
Topics
for This
Talk
2
My Main Focus

DNA microarrays, also known as “gene
chips,” have come into prominence
NimbleGen
glass slide
microarray.
Image
courtesy of
NimbleGen
Systems.

Novel applications of machine learning
can help to solve varied and important
problems in this domain

Solving these will advance science and
change the practice of health care
3
Background
Oligonucleotide Microarrays
4
Oligonucleotide Microarrays

Specific probes synthesized at
known spot on chip’s surface
probes

Probes complementary to genetic
material to be measured
(known as the sample)
surface

If the probes do their job, labeled
sample can be detected on the surface
of the chip
sample
5
An Oligonucleotide Microarray
6
Part 1
DETECTING
COPY-NUMBER
VARIATION
7
Motivation

Gene copy number variation is a
common form of genetic variation in
individuals

Specific variations can be predictive of
diseases including cancer
8
Comparative Genomic Hybridization
(CGH)

To find regions of copy-number variation
Region Sequence:
Complement:

Tile1: probes
Probe
Probe 2:
 Two
Probe
3:
GTAGCTAGCATTAGCATGGCCAGTCATG…
CATCGATCGTAATCGTACCGGTCAGTAC…
across
a genomic region
CATCGATCGTAATCGTACCGGTCA
ATCGATCGTAATCGTACCGGTCAG
chips
TCGATCGTAATCGTACCGGTCAGT
identical
… Expose each to one sample
…
 Record log ratio of signals
9
Comparative Genomic Hybridization
(CGH)
Sample 1 DNA
Sample 2 DNA
Two Identical
Tiling Chips
(Sample 2 Intensity)
_______________________
Probe Intensity Ratio =
(Sample 1 Intensity)
10
A Segmentation
2.5
Probes
2.0
Segments
log Ratio
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,000
9,000
10,000
Probe Number
11
log Ratio
A More Typical Segmentation
Probes
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
Segments
0
2,000
4,000
6,000
8,000
10,000
Probe Number
12
Our segMNT Method (paper in preparation)

Identify Candidate Breakpoints

Apply dynamic programming to reduced set

Permutation test to choose best
number of segments
13
Choose Candidate Breakpoints
Use a t-test to identify
candidate breakpoints
Probe Value
0.50
0.40
1.00
0.30
0.20
0.50
0.10
0.00
0.00
Mean Probe Value
1
2
3
4
5
6
Candidate
7
8
9
10
11
Candidate
12
13
14
Candidate
15
16
17
18
19
Candidate
20
1.00
0.50
Region 1
Region 2
Region 5
0.00
Region 3
Region 4
14
Objective Function for Segmentation
from Lipson et al., 2005




x

ji
k
The per-segment
 i 1

score of a
segmentation

 n 
j 1
j


Where k = number of segments


nj
xji = intensity value for the ith probe
of segment j
and nj = number of probes in
segment j
15
Apply Dynamic Programming
1.00
0.50
0.00
region 1
to 1
from 1
from 2
from 3
from 4
from 5
0.01
region 2
region 3
Score for 1 segment
to 2
to 3
region 4
to 4
region 5
to 5
0.07
0.33
0.63
0.56
0.09
0.42
0.75
0.64
0.39
0.73
0.57
0.34
0.23
0.01
16
Apply Dynamic Programming
1.00
1.00 0.50
0.40
0.50
0.30
0.50 0.20
0.10
0.00
0.00
0.00
1
2
region
1
3
4
to 1
N=1
N=2
N=3
N=4
N=5
0.01
5
region
28
6
7
9
region
3
10
11
12
region
415
13
14
Score for N segments
to 2
to 3
to 4
16
region
5
17
18
19
20
to 5
0.07
0.33
0.63
0.56
0.10
0.43
0.76
0.65
0.49
0.83
0.77
0.83
0.84
0.84
17
Permutation Test
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
segMNT
PERMUTED
PROBE
VALS
2,000
4,000
6,000
8,000
0
2,000
10,000
4,000
6,000
8,000
segMNT
10,000
to 5
0.56
0.06
0.62
0.08
0.70
0.07
0.77
0.08
0.85
to 5
1seg path
2seg path
3seg path
4seg path
5seg path
0.56
0.09
0.65
0.12
0.77
0.07
0.84
0.00
0.84
18
How Do We Evaluate?

Synthetic Data
Readily available
 We know the answer


Real Data
Segment a human sample
 Compare to the DB of Genomic Variants

19
Segmentation Comparison
Synthetic Data
Average Error
1.E+00
1.E-01
1.E-02
DNACopy
stepgram
segMNT
1.E-03
1.E-04
1
2
3
4
5
6
7
8
9
10
Noise Level in Data
20
Comparison to Known Variants
F-Measure Comparison
0.25
segMNT
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0.00
0.05
0.10
0.15
0.20
StepGram
0.25
0.00
0.05
0.10
0.15
0.20
0.25
DNAcopy
21
Part 2
MY
PRETRAINING ALGORITHM
22
Motivating Project: SNP Finding
(Molla et al., CSB 2004)

Given the results of a single resequencing chip



the intensities of all of the probes
reference sequence is known
Decide where the SNPs are in the genome


each p-group represents a base position
separate non-conformers into SNPs and noise
23
A Resequencing Chip:
Normalized Intensity
Example 9
Category = Non-Conformer
9A
1.0
70,000
60,000
0.8
(complement
of) Reference Sequence: …ATCCTAGCGTACGATCT…
50,000
A
Sort
and
0.6
40,000
9G
Normalize
9C
30,000
0.4
20,000
Intensities
0.2
Conformer
Non-Conformer
9A
10,000
0.0
9C
P-groups
1
2
3
4
1
2 Intensities
3 4 5 6
Ranked
A
1A
9T
8 99 10 11 12 13 14 15 16 17
8C
7G
2T
5T
0
…
…
11A
3C 4C
G
T
7
6A
C
9T
9G
12C
9G
9G
14C
16C
13G
10T
Light Square = High Intensity Probe
15T
17T
24
My
Pretraining
Feature
SpaceGeneralization
Conformer
Non-Conformer
Probably
Bad Data
A Likely
SNP
25
The Analogy
Conformer vs Non-Confomer
Is a Key feature.
 Trying to predict a key feature
helps the learner
 What if we did not know which
were key features?
 Try to predict them all

26
Background:
Supervised
Learning
The Pretraining
Algorithm
Given

Examples whose feature values are known and
Whose categories are known

Augment the feature set

Predict the categories of examples whose feature
values are known but whose categories are not

Do
27
Standard Formulation
FEATURES
CATEGORY
A
B
C
D
True
True
True
False
True
True
False
True
True
False
True
True
True
False
False
False
False
True
True
True
False
True
False
False
False
False
True
False
False
False
False
True
28
Pretraining Step 1: Predict the First Feature
CATEGORY
PREDICTION RESULT
FEATURES
IGNORED
A
B
C
preA
D
True
True
True
Correct
False
True
True
False
Wrong
True
True
False
True
Correct
True
True
LEARNER
False
False
Correct
False
False
True
True
Correct
True
False
True
False
Correct
False
False
False
True
Correct
False
False
False
False
Correct
True
29
Pretraining Step 2: Predict the Second Feature
CATEGORY
PREDICTION RESULT
IGNORED
IGNORED
A
B
C
preA
True
True
True
Correct Correct Correct False
True
True
False
Wrong
True
False
True
Correct Wrong
True
preB
preC
D
Correct Correct True
Corect
FEATURES
False
False
Correct Correct Correct
LEARNER
True
False
False
True
True
Correct Correct Wrong
True
False
True
False
Correct Correct Correct False
False
False
True
Correct Correct Correct False
False
False
False
Correct Wrong
Correct True
30
Pretraining Step 3: Predict the Third Feature
CATEGORY
FEATURES
PREDICTION RESULT
IGNORED
IGNORED
A
B
C
True
True
True
True
False
False
False
True
True
False
False
True
True
False
True Correct Correct
False Wrong Correct
True Correct Wrong
Correct
LEARNER
False Correct
True Correct Correct
False Correct Correct
True Correct Correct
False False False
preA
preB
Correct Wrong
preC
D
Correct
Correct
False
True
True
False
True
False
False
Correct
True
Correct
Corect
Correct
Wrong
Correct
31
Pretraining Formulation
FEATURES
CATEGORY
A
B
C
preA
preB
preC
D
True
True
True
Correct Correct Correct False
True
True
False
Wrong
True
False
True
Correct Wrong
True
False
False
Correct Correct Correct False
False
True
True
Correct Correct Wrong
False
True
False
Correct Correct Correct False
False
False
True
Correct Correct Correct False
False
False
False
Correct Wrong
Correct Correct True
Corect
True
True
Correct True
32
Using Feature Predictions as Features
10
Negative Example
Feature Prediction Error
9
The prediction error
allows us to differentiate
these positive examples
from the negative
examples
8
7
6
Positive Example
5
4
3
2
1
0
0
2
4
6
8
10
Feature Value
33
Error Rates on 3 UCI Datasets
7%
baseline
Error Rate
6%
pretrained
5%
non-pretrained
4%
3%
2%
1%
0%
Wisconsin Breast Cancer
Hypothyroid
Splice Junctions
UCI Dataset
34
Future Work

Develop other methods to identify Key
features

Develop an SVM kernel to directly make
use of the nearby cluster hypothesis.
35
Part 3
PROBES FOR DIRECT
GENOMIC SELECTION
36
Probes: Good vs. Bad
Red = Probe
Green = Sample
good probe
bad probe
37
Our Work:
Tobler and Molla et al., ISMB 2002

Categorize probes as good or bad based
on signal intensity

Choose features to represent probes


All features come directly from probe sequence
Apply established machine-learning
algorithms and apply cross validation
Train on categorized examples
 Test on examples with category hidden

38
New Application:
Genome Enrichment
Red = Probe
Green = Sample
Good Probe
Sequenceable Target
39
Microarray Sequence Capture
Genomic DNA
Array Probes
Exons
Intergenic/Intron
Pool of
targeted
fragments
40
Directed Sequencing

Works like a filter

Isolate specific subset of the genome
for sequencing

Allow higher sequence coverage across
subset at a lower cost
41
Exon Capture Proof of Principle
Collaboration with Baylor College of Medicine
Genomic Position
Sequence Coverage
Target Region
Average 7-Fold Sequencing Coverage
42
Conclusion

Developed methods to improve



Probe selection for sequence capture
Low-cost SNP finding
Genome copy-number identification

Major increases in efficiency for molecular
biology
 A new machine-learning algorithm
 Helped to show that machine learning can
solve varied problems in molecular biology
43
Acknowledgements



Advisor
 Jude Shavlik
Committee Members
 David Page
 Fred Blattner
 Mark Craven
 Charles Dyer
Colleagues
Mark Rich, Ameet Soni, Lisa Torrey,
Frank DiMaio, Jesse Davis, Sean
McIlwain, Adam Smith, Yue Pan, Burr
Settles, Mike Waddell, Louis Oliphant,
Kieth Noto, Trevor Walker, Irene Ong,
Rich Maclin, Soumya Ray, Joseph
Bockhorst


Nimblegen Systems
 Roland Green
 Todd Richmond
 Nan Jiang
 Jacob Kitzman
 Tom Albert

Baylor College of Medicine

Grants
 NIH 2 R44 HG02193-02
 NLM 1 R01 LM07050-01
 NSF IRI-9502990
 NIH 2 P30 CA14520-29
 NIH 5 T32 GM08349
and many others who helped
44
Last But Not Least…
45
Download