PPT lecture slides

advertisement
BIOI7791 Spring 2005
Projects in bioinformatics:
natural language processing
March 25, 2005
© Kevin Cohen
Information extraction
(versus “natural language understanding”)
• Natural language understanding
– Doing everything that we (humans) do with text
• Information extraction
– Find finite set of kinds of facts
Natural language understanding
How many phosphates
are
Was
this aof
good
DoesWhat
myo-inositol
1,4,5-trisphosphate
have more than
is
the
molecular
weight
FKBP12?
there on myo-inositol
1,4,5experiment?
one isoform? If yes, list them.
trisphosphate?
We investigated the interaction of the 12 kDa FK506-binding
protein (FKBP12) with two ryanodine-receptor isoforms
(RyR1 and RyR3) and with two myo-inositol 1,4,5trisphosphate (IP3) receptor isoforms (IP3R1 and IP3R3).
Using glutathione S-transferase (GST)-FKBP12 affinity
chromatography, we could efficiently extract RyR1 (42+/-7%
of the solubilized RyR1) from terminal cisternae of skeletal
muscle as well as RyR3 (32+/-4% of the solubilized RyR3)
from RyR3-overexpressing HEK-293 cells. These
interactions were completely abolished by FK506 (20
microM) but were largely unaffected by RyR-channel
modulators. In contrast, neither IP3R1 nor IP3R3 from
various sources, including rabbit cerebellum, A7r5 smoothmuscle cells and IP3R-overexpressing Sf9 insect cells from
Spodoptera frugiperda, were retained on the GST-FKBP12
matrix. Moreover, immunoprecipitation experiments
Information extraction
• PROTEIN_BINDING_EVENT _1
– Protein_1 RyR1
– Protein_2 FKBP12
• We obtained a 45 kDa fragment of
RyR1 that bound to the GST-FKBP12
matrix, indicating that it retained all
requirements for FKBP12 binding.
Information extraction in
computational biology
• 1999
– First rule-based system (Blaschke et al., ISMB)
– First learning-based system (Craven and
Kumlein, ISMB)
• Today
– Varying levels of syntactic and semantic
sophistication
– Both rule- and learning-based systems still
being built
That first rule-based system
• Regular patterns, hand-crafted
– …of words
– Gene_1 interacts with Gene_2
• No POS, no shallow parsing, no nothing
• Good performance claimed in an extremely
restricted domain
– Two signalling pathways in Drosophila
• Problems: anaphora, EI, …
Automated extraction of
information on proteinprotein interactions from
the biological literature
Ono, Toshihide; Haretsugu
Hishigaki; Akira Tanigami; and
Toshihisa Takagi (2001)
Bioinformatics 17(2)155-161
Why this paper?
• Obvious approach
Why this paper?
• Used only stuff we’ve already talked
about
– Part-of-speech tagging
– stemming
• Special sauce
Problem
• Populate a database of proteinprotein interactions
The template
• Interaction
– Protein 1
– Protein 2
Null hypothesis
• All that’s necessary for information
extraction is pattern matching.
Pattern matching
• No structure
Pattern matching
• No structure
• No meaning
Pattern matching
• No structure
• No meaning
• No inference
Pattern matching
•
•
•
•
No structure
No meaning
No inference
No learning
This is a little bit more sophisticated
than that, since it makes reference to
non-explicit properties of the text
•Part of speech
•Word stems
Simplest approach
• /$protein interacts with $protein/
Simplest approach
• /($protein) interacts with ($protein)/
• $interaction{$sentenceNumber}{P1} = $1;
• $interaction{$sentenceNumber}{P2} = $2;
What they actually did
Selection of target text
1. Identif ication of protein
names
2. Process compound/
complex sentences
3. Recognize protein-protein
interactions
Selection of target text
• Query PubMed:
– (“yeast” OR “E coli”)
– AND “protein” AND “interaction”
– AND protein binding[MH]
Selection of target text
• Query PubMed:
– (“yeast” OR “E coli”)
– AND “protein” AND “interaction”
– AND protein binding[MH]
• Manual filtering:
– interact/interacts/interacted
etc.
Selection of target text
• Query PubMed:
– (“yeast” OR “E coli”)
– AND “protein” AND “interaction”
– AND protein binding[MH]
• Manual filtering:
–
–
–
–
interact/interacts/interacted etc.
associate
bind
complex
Selection of target text
• Query PubMed:
– (“yeast” OR “E coli”)
– AND “protein” AND “interaction”
– AND protein binding[MH]
• Manual filtering:
–
–
–
–
–
interact/interacts/interacted etc.
associate
bind
complex
At least 2 gene names
Identification of protein
names
Selection of target text
1. Identification of protein
names
2. Process compound/
complex sentences
3. Recognize protein-protein
interactions
Identification of protein
names
• Dictionary look-up
Process compound/complex
sentences
Selection of target text
1. Identification of protein
names
2. Process compound/
complex sentences
3. Recognize protein-protein
interactions
Process compound/complex
sentences
Special sauce
Process compound/complex
sentences
• Preprocessing: POS tagging
Process compound/complex
sentences
• Preprocessing: POS tagging
– In biology, multiply and divide mean the
same thing.
Process compound/complex
sentences
• Preprocessing: POS tagging
– In/IN biology/NOUN ,/, multiply/VERB
and/CC divide/VERB mean/VERB the/DT
same/ADJ thing/NOUN ./.
– Brill tagger
Process compound/complex
sentences
• Rule 1:
P1 [(, CC DT)|(, IN|:|;] P2
• Rule 2:
P3 VB1 P4 VB2 CC P5
Process compound/complex
sentences
• Rule 1:
P1 [(, CC DT)|(, IN|:|;] P2
Output:
• P1
• P2
Process compound/complex
sentences
• The gpa1 mutant blocked stable association
of Ste4p with the plasma membrane, and
the ste18 mutant blocked stable association
of Ste4p with both plasma membranes and
internal membranes.
Process compound/complex
sentences
• The gpa1 mutant blocked stable association
of Ste4p with the plasma membrane, and
the ste18 mutant blocked stable association
of Ste4p with both plasma membranes and
internal membranes.
• The gpa1 mutant blocked stable association
of Ste4p with the plasma membrane
Process compound/complex
sentences
• The gpa1 mutant blocked stable association
of Ste4p with the plasma membrane, and
the ste18 mutant blocked stable association
of Ste4p with both plasma membranes and
internal membranes.
• The gpa1 mutant blocked stable association
of Ste4p with the plasma membrane
Process compound/complex
sentences
• The gpa1 mutant blocked stable association of
Ste4p with the plasma membrane, and the ste18
mutant blocked stable association of Ste4p with
both plasma membranes and internal membranes.
• The gpa1 mutant blocked stable association of
Ste4p with the plasma membrane
• ste18 mutant blocked stable association of Ste4p
with both plasma membranes and internal
membranes.
Process compound/complex
sentences
Why?
Process compound/complex
sentences
• association of A (with|and) B
• The gpa1 mutant blocked stable association
of Ste4p with the plasma membrane, and
the ste18 mutant blocked stable association
of Ste4p with both plasma membranes and
internal membranes.
Process compound/complex
sentences
DISCLAIMER
Process compound/complex
sentences
• Rule 2:
P3 VB1 P4 VB2 CC P5
Process compound/complex
sentences
• Rule 2:
P3 VB1 P4 VB2 CC P5
Output:
• P3 VB1 P4
• P3 VB2 P5
Process compound/complex
sentences
• STD1 interacts directly with the TBP and
modulates transcription of the SUC2 gene
of S. cerevisiae.
Process compound/complex
sentences
• STD1 interacts directly with the TBP and
modulates transcription of the SUC2 gene
of S. cerevisiae.
Process compound/complex
sentences
• STD1 interacts directly with the TBP and
modulates transcription of the SUC2 gene
of S. cerevisiae.
Process compound/complex
sentences
• STD1 interacts directly with the TBP and
modulates transcription of the SUC2 gene
of S. cerevisiae.
Process compound/complex
sentences
• STD1 interacts directly with the TBP and
modulates transcription of the SUC2 gene
of S. cerevisiae.
• STD1 interacts directly with the TBP
Process compound/complex
sentences
• STD1 interacts directly with the TBP and
modulates transcription of the SUC2 gene
of S. cerevisiae.
• STD1 interacts directly with the TBP
• STD1 modulates transcription of the SUC2
gene of S. cerevisiae.
Recognition of interaction
Selection of target text
1. Identification of protein
names
2. Process compound/
complex sentences
3. Recognize protein-protein
interactions
Recognition of interaction
• INTERACT
–
–
–
–
–
A interact with B
interaction of A (with|and) B
interaction (between|among) A and B
A-B interaction
A and B interact
Recognition of interaction
•
•
•
•
INTERACT
ASSOCIATE
BIND
COMPLEX
Recognition of interaction
• Preprocessing: stemming
–
–
–
–
Porter stemmer
interacts -> interact
interacted -> interact
interacting -> interact
Recognition of interaction
DISCLAIMER
Recognition of interaction
Special sauce
Recognition of interaction
Negation
• DMC1 does not interact in the two-hybrid
assay with Rad52p or Rad54p.
Recognition of interaction
Negation
• DMC1 does not interact in the two-hybrid assay
with Rad52p or Rad54p.
• protein1.* not (interact|associate|bind|complex). .*protein2
Recognition of interaction
• protein1.* not (interact|associate|bind|complex). .*protein2
• protein1.* PATTERN.* but not protein2
Results
• Corpus
– 2 sets of sentences
– 834 yeast
– 752 E. coli
Results
• Precision: if you found it, was it
right?
TP
TP + FP
Results
• Recall: If it was there, did you find
it?
TP
TP + FN
Results
Key
word
TP
TP+TN
TP+FP
P
R
Interact
198
222
206
96.1
89.1
Associate
55
68
61
90.2
80.9
Bind
103
119
108
95.3
86.6
Complex
152
176
164
92.7
86.4
TOTAL
508
585
539
94.5
86.8
Results
“…we calculate recall and precision based on the
following formula:
recall = TP/(TP + TN)
…where TP [and] TP+TN indicate as follows:
TP = the number of sentences extracted correctly by
our method;
TP + TN = the total number of sentences containing
information on protein-protein interactions”
(157)
Results
• Can we calculate the truth? No, but:
TN = TP + TN – TP = 585 – 508 = 77
 FP = TP + FP – TP = 539 – 508 = 31
 FN = Total – TP – TN – FP = 834 – 508 –
77 – 31 = 218
Results
Recall = 508/(508 + 218) = .70
Results
Maybe they just labelled it wrong?
Results
What does FN even mean when you
hand-selected your input?
If this is so great, why
aren’t we done yet?
- W. John Wilbur
How could you improve on this?
This week’s assignment
• Read Craven and Kumlein (1999)
• Mail me your paper review form one hour
before class
Paper review form
• problem being addressed
– general:
– specific:
Paper review form
• approach taken
– rule-based
– machine learning
•
•
•
•
Bayesian
HMM
SVM
decision tree
–  data
– pre-processing
•  stemming
•  POS tagging
•  shallow parsing
– post-processing
Paper review form
• results
• speculation about future directions
• what do they say about things that I’m
interested in?
• What were the size and contents of the
evaluation corpus?
• what do they make available?
Paper review form
• Specific questions about molecular biology EI:
What did they exclude?
• multiword
• isomorphic with common English words
• anything but symbols
• What’s the set of entity classes?
• What’s the set of relations?
• Quotable quotes:
Download