molla.cbgi02.ppt

advertisement
Interpreting Microarray Expression Data
Using Text Annotating the Genes
Michael Molla, Peter Andreae, Jeremy Glasner, Frederick Blattner, Jude Shavlik
University of Wisconsin – Madison
The Basic Task
Given
Microarray Expression Data &
Text Annotations of Genes
Generate
Model of Expression
Motivation
• Lots of Data Available on the Internet
– Microarray Expression Data
– Text Annotations of Genes
• Maybe we can Make the Scientist’s Job
Easier
– Generate a Model of Expression Automatically
– Easier First Step for the Human
Microarray Expression Data
• Each spot represents a gene in E. coli
• Colors Indicate Up- or Down-Regulation
Under Antibiotic Shock
• Four our Purpose 3 Classes
– Up-Regulated
– Down-Regulated
– No-Change
Microarray Expression Data
From “Genome-Wide Expression in Escheria Coli K-12”, Blattner et al., 1999
Our Microarray Experiment
•
•
•
•
•
4290 genes
574 up-regulated
333 down-regulated
2747 un-regulated
636 non enough signal
Text Annotations of Genes
• The text from a sample SwissProt entry
(b1382)
– The “description” field
HYPOTHETICAL 6.8 KDA PROTEIN IN
LDHA-FEAR INTERGENIC REGION
– The “keyword” field
HYPOTHETICAL PROTEIN
Sample Rules From a Model for
Up-Regulation
• IF
– The annotation contains FLAGELLAR AND
does NOT contain HYPOTHETICAL
OR
– The annotation contains BIOSYNTHESIS
• THEN
– The gene is up-regulated
Why use Machine Learning?
• Concerned with machines learning from
available data
• Informed by text data, the leaner can make
first-pass model for the scientist
Desired Properties of a Model
• Accurate
– Measure with cross validation
• Comprehensible
– Measure with model size
• Stable to Small Changes in the Data
– Measure with random subsampling
Approaches
• Naïve Bayes
– Statistical method
– Uses all of the words (present or absent)
• PFOIL
– Covering algorithm
– Chooses words to use one at a time
Naïve Bayes
For each word wi, there are two likelihood ratios (lr):
lr (wi present) = p(wi present | up) / p(wi present | down)
lr (wi absent) = p(wi absent | up) / p(wi absent | down)
For each annotation, the lrs are combined to form a lr for a gene:
where X is either present or absent.
PFOIL
•
•
•
•
Learn rules from data
Produces multiple if-then rules from data
Builds rules by adding one word at a time
Easy to interpret models
Accuracy/Comprehensibility Tradeoff
40%
Baseline
30%
PFOIL
20%
Naive Bayes
10%
0%
100
90
80
70
60
50
40
30
Number of Words in Model
20
10
0
Testset Error Rate
50%
Stabilized PFOIL
• Repeatedly run PFOIL on randomly
sampled subsets
• For each word, count the number of models
it appears in
• Restrict PFOIL to only those words that
appear in a minimum of m models
• Rerun PFOIL with only those words
Stability Measure
After running the algorithm N times to
generate N rule sets:
Where:
U = the set of words appearing in any rule set
count(wi) = number of rule sets containing word wi
50%
1.0
45%
0.9
40%
0.8
35%
0.7
30%
0.6
25%
0.5
20%
Stabilized PFOIL Error Rate
15%
Stabilized PFOIL Stability
0.4
0.3
10%
0.2
Unstabilized PFOIL Stability
5%
0.1
0%
0.0
0
5
10
15
20
25
Value of m
30
35
40
45
50
Stability
Testset Error Rate
Accuracy/Stability Tradeoff
Discussion
• Not very severe tradeoffs in Accuracy
– vs. stability
– vs. comprehensibility
• PFOIL not as good at characterizing data
– suggests not many dependencies
– need for “softer” rules
Future Directions
• M of N rules
• Permutation Test
• More Sources of Text Data
Take-Home Message
• This is just a first step toward an aid for
understanding expression data
• Make expression models based on text in
stead of DNA sequence.
Acknowledgements
• This research was funded by the following
grants:
NLM 1 R01 LM07050-01,
NSF IRI-9502990,
NIH 2 P30 CA14520-29, and
NIH 5 T32 GM08349.
Download