BIOI7791 Spring 2005 Projects in bioinformatics: natural language processing March 25, 2005 © Kevin Cohen Information extraction (versus “natural language understanding”) • Natural language understanding – Doing everything that we (humans) do with text • Information extraction – Find finite set of kinds of facts Natural language understanding How many phosphates are Was this aof good DoesWhat myo-inositol 1,4,5-trisphosphate have more than is the molecular weight FKBP12? there on myo-inositol 1,4,5experiment? one isoform? If yes, list them. trisphosphate? We investigated the interaction of the 12 kDa FK506-binding protein (FKBP12) with two ryanodine-receptor isoforms (RyR1 and RyR3) and with two myo-inositol 1,4,5trisphosphate (IP3) receptor isoforms (IP3R1 and IP3R3). Using glutathione S-transferase (GST)-FKBP12 affinity chromatography, we could efficiently extract RyR1 (42+/-7% of the solubilized RyR1) from terminal cisternae of skeletal muscle as well as RyR3 (32+/-4% of the solubilized RyR3) from RyR3-overexpressing HEK-293 cells. These interactions were completely abolished by FK506 (20 microM) but were largely unaffected by RyR-channel modulators. In contrast, neither IP3R1 nor IP3R3 from various sources, including rabbit cerebellum, A7r5 smoothmuscle cells and IP3R-overexpressing Sf9 insect cells from Spodoptera frugiperda, were retained on the GST-FKBP12 matrix. Moreover, immunoprecipitation experiments Information extraction • PROTEIN_BINDING_EVENT _1 – Protein_1 RyR1 – Protein_2 FKBP12 • We obtained a 45 kDa fragment of RyR1 that bound to the GST-FKBP12 matrix, indicating that it retained all requirements for FKBP12 binding. Information extraction in computational biology • 1999 – First rule-based system (Blaschke et al., ISMB) – First learning-based system (Craven and Kumlein, ISMB) • Today – Varying levels of syntactic and semantic sophistication – Both rule- and learning-based systems still being built That first rule-based system • Regular patterns, hand-crafted – …of words – Gene_1 interacts with Gene_2 • No POS, no shallow parsing, no nothing • Good performance claimed in an extremely restricted domain – Two signalling pathways in Drosophila • Problems: anaphora, EI, … Automated extraction of information on proteinprotein interactions from the biological literature Ono, Toshihide; Haretsugu Hishigaki; Akira Tanigami; and Toshihisa Takagi (2001) Bioinformatics 17(2)155-161 Why this paper? • Obvious approach Why this paper? • Used only stuff we’ve already talked about – Part-of-speech tagging – stemming • Special sauce Problem • Populate a database of proteinprotein interactions The template • Interaction – Protein 1 – Protein 2 Null hypothesis • All that’s necessary for information extraction is pattern matching. Pattern matching • No structure Pattern matching • No structure • No meaning Pattern matching • No structure • No meaning • No inference Pattern matching • • • • No structure No meaning No inference No learning This is a little bit more sophisticated than that, since it makes reference to non-explicit properties of the text •Part of speech •Word stems Simplest approach • /$protein interacts with $protein/ Simplest approach • /($protein) interacts with ($protein)/ • $interaction{$sentenceNumber}{P1} = $1; • $interaction{$sentenceNumber}{P2} = $2; What they actually did Selection of target text 1. Identif ication of protein names 2. Process compound/ complex sentences 3. Recognize protein-protein interactions Selection of target text • Query PubMed: – (“yeast” OR “E coli”) – AND “protein” AND “interaction” – AND protein binding[MH] Selection of target text • Query PubMed: – (“yeast” OR “E coli”) – AND “protein” AND “interaction” – AND protein binding[MH] • Manual filtering: – interact/interacts/interacted etc. Selection of target text • Query PubMed: – (“yeast” OR “E coli”) – AND “protein” AND “interaction” – AND protein binding[MH] • Manual filtering: – – – – interact/interacts/interacted etc. associate bind complex Selection of target text • Query PubMed: – (“yeast” OR “E coli”) – AND “protein” AND “interaction” – AND protein binding[MH] • Manual filtering: – – – – – interact/interacts/interacted etc. associate bind complex At least 2 gene names Identification of protein names Selection of target text 1. Identification of protein names 2. Process compound/ complex sentences 3. Recognize protein-protein interactions Identification of protein names • Dictionary look-up Process compound/complex sentences Selection of target text 1. Identification of protein names 2. Process compound/ complex sentences 3. Recognize protein-protein interactions Process compound/complex sentences Special sauce Process compound/complex sentences • Preprocessing: POS tagging Process compound/complex sentences • Preprocessing: POS tagging – In biology, multiply and divide mean the same thing. Process compound/complex sentences • Preprocessing: POS tagging – In/IN biology/NOUN ,/, multiply/VERB and/CC divide/VERB mean/VERB the/DT same/ADJ thing/NOUN ./. – Brill tagger Process compound/complex sentences • Rule 1: P1 [(, CC DT)|(, IN|:|;] P2 • Rule 2: P3 VB1 P4 VB2 CC P5 Process compound/complex sentences • Rule 1: P1 [(, CC DT)|(, IN|:|;] P2 Output: • P1 • P2 Process compound/complex sentences • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane, and the ste18 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes. Process compound/complex sentences • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane, and the ste18 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes. • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane Process compound/complex sentences • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane, and the ste18 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes. • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane Process compound/complex sentences • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane, and the ste18 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes. • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane • ste18 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes. Process compound/complex sentences Why? Process compound/complex sentences • association of A (with|and) B • The gpa1 mutant blocked stable association of Ste4p with the plasma membrane, and the ste18 mutant blocked stable association of Ste4p with both plasma membranes and internal membranes. Process compound/complex sentences DISCLAIMER Process compound/complex sentences • Rule 2: P3 VB1 P4 VB2 CC P5 Process compound/complex sentences • Rule 2: P3 VB1 P4 VB2 CC P5 Output: • P3 VB1 P4 • P3 VB2 P5 Process compound/complex sentences • STD1 interacts directly with the TBP and modulates transcription of the SUC2 gene of S. cerevisiae. Process compound/complex sentences • STD1 interacts directly with the TBP and modulates transcription of the SUC2 gene of S. cerevisiae. Process compound/complex sentences • STD1 interacts directly with the TBP and modulates transcription of the SUC2 gene of S. cerevisiae. Process compound/complex sentences • STD1 interacts directly with the TBP and modulates transcription of the SUC2 gene of S. cerevisiae. Process compound/complex sentences • STD1 interacts directly with the TBP and modulates transcription of the SUC2 gene of S. cerevisiae. • STD1 interacts directly with the TBP Process compound/complex sentences • STD1 interacts directly with the TBP and modulates transcription of the SUC2 gene of S. cerevisiae. • STD1 interacts directly with the TBP • STD1 modulates transcription of the SUC2 gene of S. cerevisiae. Recognition of interaction Selection of target text 1. Identification of protein names 2. Process compound/ complex sentences 3. Recognize protein-protein interactions Recognition of interaction • INTERACT – – – – – A interact with B interaction of A (with|and) B interaction (between|among) A and B A-B interaction A and B interact Recognition of interaction • • • • INTERACT ASSOCIATE BIND COMPLEX Recognition of interaction • Preprocessing: stemming – – – – Porter stemmer interacts -> interact interacted -> interact interacting -> interact Recognition of interaction DISCLAIMER Recognition of interaction Special sauce Recognition of interaction Negation • DMC1 does not interact in the two-hybrid assay with Rad52p or Rad54p. Recognition of interaction Negation • DMC1 does not interact in the two-hybrid assay with Rad52p or Rad54p. • protein1.* not (interact|associate|bind|complex). .*protein2 Recognition of interaction • protein1.* not (interact|associate|bind|complex). .*protein2 • protein1.* PATTERN.* but not protein2 Results • Corpus – 2 sets of sentences – 834 yeast – 752 E. coli Results • Precision: if you found it, was it right? TP TP + FP Results • Recall: If it was there, did you find it? TP TP + FN Results Key word TP TP+TN TP+FP P R Interact 198 222 206 96.1 89.1 Associate 55 68 61 90.2 80.9 Bind 103 119 108 95.3 86.6 Complex 152 176 164 92.7 86.4 TOTAL 508 585 539 94.5 86.8 Results “…we calculate recall and precision based on the following formula: recall = TP/(TP + TN) …where TP [and] TP+TN indicate as follows: TP = the number of sentences extracted correctly by our method; TP + TN = the total number of sentences containing information on protein-protein interactions” (157) Results • Can we calculate the truth? No, but: TN = TP + TN – TP = 585 – 508 = 77 FP = TP + FP – TP = 539 – 508 = 31 FN = Total – TP – TN – FP = 834 – 508 – 77 – 31 = 218 Results Recall = 508/(508 + 218) = .70 Results Maybe they just labelled it wrong? Results What does FN even mean when you hand-selected your input? If this is so great, why aren’t we done yet? - W. John Wilbur How could you improve on this? This week’s assignment • Read Craven and Kumlein (1999) • Mail me your paper review form one hour before class Paper review form • problem being addressed – general: – specific: Paper review form • approach taken – rule-based – machine learning • • • • Bayesian HMM SVM decision tree – data – pre-processing • stemming • POS tagging • shallow parsing – post-processing Paper review form • results • speculation about future directions • what do they say about things that I’m interested in? • What were the size and contents of the evaluation corpus? • what do they make available? Paper review form • Specific questions about molecular biology EI: What did they exclude? • multiword • isomorphic with common English words • anything but symbols • What’s the set of entity classes? • What’s the set of relations? • Quotable quotes: