1 GeneTUC: Natural Language Understanding in Medical Text PhD Defense Rune Sætre June 27th 2006 2 Overview • Motivation • Thesis Work – – – – Overview (Diploma Thesis) Idea (Paper 1 and 2) Bioogle (Paper 3, 4 and 5) GeneTUC (Paper 6) • Results, Related Work and Discussion • Comments and Questions by Jong C. Park and Eivind Hovig 3 Motivation P ubM e d J our na l a bs t r a c t s pe r da y , by publ i c a t i on y e a r 1800 1600 1400 1200 1000 800 600 400 200 0 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 http://www.ncbi.nlm.nih.gov/PubMed/ 80 82 84 86 88 90 92 94 96 98 00 02 04 4 …Motivation • Biomedical Researchers publish almost 2000 abstracts per day in MEDLINE – Computers are needed to automatically find all (recall), and only (precision), the relevant information • Future Solution: GeneTUC – TUC: The Understanding Computer – BusTUC works for Natural Language queries about busses in Trondheim – GeneTUC uses full-parsing to extract knowledge from MEDLINE – After parsing the input, GeneTUC can answer simple questions about protein and gene interactions and other facts from the text 5 Challenge: Medical language • Example Input Sentences: – Subsequently, activated CREB activates transcription of genes essential for proper germ cell differentiation.+ – Indeed, Ca2+/calmodulin binds a complex of RGS4 and a transition state analog of Galpha i1-GDP-AlF4-.* • Medical language is not always natural language: – Complex grammar – Invention of new words/names every day + PMID: 11988318 * BioCreative1 Example, PMID 11988318 6 GeneTUC Research Overview Q A / N L U I E GeneTUC Project NLU review VI 2001 2006 GeneTUC Diploma Thesis V IV I III Time N L P II Legend: n Title Presented in Part II as research paper #n Presented in seperate PhD Research School document I R 7 Thesis Work • GeneTUC Diploma Work – Literature Review: NLU in Medicine – GeneTUC: Full-parsing of MEDLINE Abstracts • PhD Papers: 1 Unitex: Local Grammars 2 ProtChew: Automatic Protein Name Recognition 3 Alchymoogle: Automatic Entity Annotation 4 gProt: Automatic Protein Interaction Annotation 5 WebProt: Online gProt Experiments 6 GeneTUC: GENIA corpus experiments 8 TUC Introduction • • • • • Chat-80, Prat-89, HSQL 1991: The Understanding Computer 1996: BusTUC (atb.no/bussorakelet/) 2000: GeneTUC, diploma project 2001-2006: GeneTUC has been my PhD-Project 9 GeneTUC System Architecture GeneTUC Query MEDLINE HG NC • MEDLINE: Abstracts • GO: GeneOntology • TUC: The GO Understanding Answer Computer • DB: TQL DataBase TUC DB • HGNC: HUGO Gene Nomenclature Committee • WordNet: Ontology WordNet 10 WordNet 2.0 • Online lexical reference system – Nouns, verbs, adjectives and adverbs • Inspired by psycholinguistic theories of human lexical memory – Organized into synonym sets, each representing one underlying lexical concept • Different relations link the synonym sets – E.g. hypernyms, hyponyms, holonyms, synonyms, coordinate terms, domain, 11 Nomenclature, HUGO • HUGO Gene Nomenclature Committee – – – – – – – • Approve a gene name and symbol for each known human gene Stored in the Human Gene Nomenclature Database Approved 13,000 symbols (20-30,000 human genes) Each symbol is unique Each gene is only given one approved gene symbol Similar names used, e.g. in mouse gene research Efforts are made to use a symbol acceptable to workers in the field Facilitates electronic data retrieval from publications 12 Gene Ontology • Heterarchy – Molecular Terms • Controlled Vocabulary • Function, Process and Location GO Molecular Function Biological Process Cellular Component 13 GeneTUC Parser S • Top-Down, left to right • Greedy Heuristics • Semantic Constraints NP VP – Interact(Agent: RGS4) – The rock grows N PP V PP P N Rgs4 interacts with calmodulin 14 Screenshot Example • • • • • • • • • • • • • • • • • E: rgs4 interacts with calmodulin. ........................................................................ % TQL: rgs4 isa protein calmodulin isa protein interact/rgs4/sk(1) srel/with/thing/calmodulin/sk(1) event/real/sk(1) ........................................................................ E: calmodulin interacts with cck. ........................................................................ % TQL: cck isa gene interact/calmodulin/sk(3) srel/with/thing/cck/sk(3) event/real/sk(3) ........................................................................ RGS4 Calmodulin CCK 15 Screenshot Example ctd. • • • • E: does rgs4 interact with cck? ............................................................... % TQL: [test::(rgs4 isa protein, – – – – • • • • cck isa gene, interact/rgs4/A, srel/with/thing/cck/A, event/real/A)] ................................................................ Yes ................................................................ A transitive rule – ProteinA interacts with ProteinB and ProteinB interacts with ProteinC ==> ProteinA interacts with ProteinC Calmodulin RGS4 Calmodulin CCK 16 Dictionary • GeneTUC does not perform very well without a complete dictionary • Current Solution: Bioogle can build a dictionary 17 Bioogle (Paper III) • Current ontology: 275 medical terms • Connect Unknowns to these Concepts • Query syntax – “ Unknown is (an|a) “ • Parse results until a hit is found (or not) – “Pentagastrin is a synthetic peptide containing the five terminal amino acids of gastrin.” • Result: 104 of 200 terms were correctly classified 18 Relations: GeneTUC Ontology AKO Is-A Thing Set Compound Activity Family Has_A Substance Gastrin Hormone Peptide Pentagastrin 19 Google API Search Google search • 1000 queries per user pr day • Free to use for everybody • Can be programmed with SOAP in most languages – Simple Object Access Protocol • Results are handled automatically • Alexa (Amazon) has implemented a similar service* – – – – $1 per processor hour $1 per gigabyte/year of user storage $1 per 50 gigabytes of data processed $1 per gigabyte uploaded/downloaded * http://news.bbc.co.uk/1/hi/technology/4530978.stm 20 Paper IV: gProt • What about protein interactions? • Protein Interaction – Protein Protein – BioCreAtIvE1: Protein Set of GeneOntology Terms • Find publicly known interactions for a given protein, using Google as the main source for new knowledge – Query: “ proteinX VerbY “ – Example: “ Gastrin activates “ 21 Paper IV: gProt 22 Gastrin activates nuclear factor {kappa}B (NF{kappa}B) through a ... Conclusions: Gastrin activates NF {kappa} B via a PKC dependent pathway which involves I {kappa} B kinase, NF {kappa} B inducing kinase, and TRAF6. ... gut.bmjjournals.com/cgi/content/abstract/52/6/813 - Lignende sider Gastrin activates nuclear factor {kappa}B (NF{kappa}B) through a ... gut.bmjjournals.com/cgi/reprint/52/6/813 - Lignende sider Gastrin activates nuclear factor kappaB (NFkappaB) through a ... BACKGROUND: We previously reported that gastrin induces expression of CXC chemokines through activat... www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve& db=PubMed&list_uids=12740336&dopt=Abstract Lignende sider Gastrin activates nuclear factor kappaB (NFkappaB) through a ... CONCLUSIONS: Gastrin activates NFkappaB via a PKC dependent pathway which involves IkappaB kinase, NFkappaB inducing kinase, and TRAF6. MeSH Terms: ... www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve& db=PubMed&list_uids=12740336&dopt=Citation Lignende sider [ Flere resultater fra www.ncbi.nlm.nih.gov ] Gastrin activates nuclear factor kappaB (NFkappaB) through a ... iHOP - Information Hyperlinked over Proteins · Gastrin activates nuclear factor kappaB (NFkappaB) through a protein kinase C dependent pathway involving ... www.pdg.cnb.uam.es/UniPub/iHOP/gp/9705030.html - 7k - I hurtigbuffer - Lignende sider Gast - Gastrin precursor Gastrin activates rat stomach histidine decarboxylase via cholecystokinin-B/gastrin receptors. Abstract-863492. Gastrin activated transcription through a ... www.pdg.cnb.uam.es/UniPub/iHOP/gg/121191.html - 105k - I hurtigbuffer - Lignende sider [ Flere resultater fra www.pdg.cnb.uam.es ] Anatomy & Physiology Lecture Outlines aa. gastrin activates gastric juice secretion & gastric smooth muscle “churning” bb. gastrin activates gastroileal reflex which moves chyme from ileum to ... www.gwc.maricopa.edu/class/bio202/digestlc.htm - 20k - I hurtigbuffer - Lignende sider 23 Paper IV: gProt • Results, 2000 facts Errors 26 % Proteins 17 % GO-terms 57 % 24 Paper V: WebProt • Online Implementation, bigger experiment • Can Annotate Protein Interactions with 70% precision • Tested the effect of source filtering – 90% precision, but recall dropping to 70% 25 Google as a source nih.gov 239 PubMed Central Collection of Journals, Books and MEDLINE jbc.org 196 Biological Chemistry, Journal physiology.org 143 American Physiological Society, Collection of Journals endojournals.org 110 Endocrine Society, Collection of Journals asm.org 83 American Society for Microbiology, Collection of Journals ahajournals.org 71 American Heart Association, Collection of Journals nature.com 69 Nature, same as npgjournals.com, Collection of Journals ingentaconnect.com 55 Ingenta Online Publisher, Collection of Journals aacrjournals.org 55 Cancer Research Journal jimmunol.org 51 Immunology, Journal karger.com 48 Karger Medical and Scientific Publisher, Big Collection of Journals pnas.org 44 National Academy of Sciences USA, Proceedings ac.uk 42 MOLECULAR AND CELLULAR BIOLOGY, Journal bloodjournal.org 40 American Society of Hematology, Blood Journal uam.es 39 Information Hyperlinked over Proteins (iHOP), Network aspetjournals.org 38 Molecular Pharmacology Journal oxfordjournals.org 33 Human Molecular Genetics Journal blackwell-synergy.com 32 Neurochemistry, Journal jcb.org 32 Cell Biology, Journal biochemj.org 30 Biochemical Journal npgjournals.com 30 Collection including European Molecular Biology Organization Journal 1480 4660 facts total from WebProt 26 WebProt Percent F-measure 100,0 % 90,0 % 80,0 % 70,0 % 60,0 % 50,0 % 40,0 % 30,0 % 20,0 % 10,0 % 0,0 % Precision Recall F-measure 1 2 3 4 5 6 7 8 Hit Limit 9 10 11 12 20 30 27 Screenshot WebProt 28 Paper VI: GeneTUC Results • Can parse 60% of test input sentences in the GENIA corpus (500 abstracts), – With 86% accuracy on the POS-tagging – Bracketing Precision and Recall scores of 70,6% and 53,9% • And answer simple questions about the parsed sentences 29 Evalb scores Sent. Matched Paper VI Bracket Cross Correct-Tag Len. Recall Prec. Bracket gold test Bracket Words Tags Accuracy 17 73.33 91.67 11 15 12 0 17 15 88.24 12 60.00 75.00 6 10 8 0 12 12 100.00 15 69.23 90.00 9 13 10 0 15 13 86.67 14 40.00 57.14 4 10 7 1 14 12 85.71 29 40.00 58.82 10 25 17 3 29 25 86.21 12 14.29 16.67 1 7 6 3 12 10 83.33 20 22.22 40.00 4 18 10 0 20 17 85.00 23 18.18 25.00 4 22 16 9 23 20 86.96 32 51.61 69.57 16 31 23 2 32 28 87.50 23 13.33 14.29 2 15 14 7 23 15 65.22 40.36 54.47 67 166 123 4 197 167 84.77 30 Summary • 6 papers describing the steps needed to show that GeneTUC can handle medical text • 60% parsing success-rate may not be enough for a commercial application, – But the fact that it improved from just 10% in 2001 is very promising • Once the parsing success-rate is good enough, GeneTUC can be tested on Question-Answering – There is a need for a good public dataset that allows measuring and comparing between different QA systems (Future Work) 31 Acknowledgements • Biologists: – Astrid Lægreid, Kamilla Stunes, Kristine Misund, Liv Thommesen, Tonje Strømmen Steigedal • Computer Scientists: – Tore Amble, Arne Halaas, Amund Tveit, Martin Ranang, Harald Søvik, Yoshimasa Tsuruoka, Anders Andenæs, Tor-Kristian Jenssen, Franz Günthner, Jun’ichi Tsujii, Jörg Cassens, Waclaw Kusnierczyk, Tore Bruland, Peep Küngas, Magnus Lie Hetland, Morten Hartman, Hallgeir Bergum, Jo Kristian Bergum, Frode Jünge, Heri Ramampiaro, Rolv Inge Seehuus, Per Kristian Lehre, Clemens Marschner, Petra Maier, Holger Bosk, Sebastian Nagel, Mariya Vitusevych, Yoshimasa Tsuruoka, Jin-Dong Kim, Hong-Woo Chun, Takashi Ninomiya, Yusuke Miyao, Frode Høyvik, Henrik Tveit, Jian Su and others 32 Questions and Comments • Associate professor Jong C. Park – Computer Science Division, – Korea Advanced Institute of Science and Technology (KAIST), – Daejeon, South Korea • Professor Eivind Hovig – Department of Tumor Biology, – Institute for Cancer Research, – The Norwegian Radium Hospital 33 Thesis Work • GeneTUC Project – Use TUC in the Medical Text Domain • Use Google (Bioogle) to Recognize Unknown Entities – Galpha(i1)-GDP-AlF(4)(-), Ca2+, Gastrin • Use Google (WebProt) to do Automatic Annotation – Mapping (BioCreative): • From Gene/Protein Set of GeneOntology Terms 34 Motivation • Natural language is natural – Talking computers – Voice as input • Repetitive tasks should be automated! – Information Extraction is trivial, if you know what to look for 35 0: GeneTUC Diploma Work • NLU Review 2002 – GENIA: HPSG – Park et al.: CCG-parsing • Numbers? 36 Paper I: Local Grammars • Maurice Gross: – there is more than 10^50 ways to build a sentence with at most twenty words* * Gross (1997). Construction of Local Grammars 37 Paper II: ProtChew • Protein Names – Galpha(i1)-GDP-AlF(4)(-) – Gastrin – … • Idea: Automatic Extraction – Based on existing dictionaries and machine learning • Results? ÷ Protein-related Tokens Part of Protein Name Tokens + 38 evalb • [4] OUTPUT FORMAT FROM THE SCORER • • The scorer gives individual scores for each sentence, for example: • Sent. Matched Bracket Cross Correct Tag ID Len. Stat. Recal Prec. Bracket gold test Bracket Words Tags Accracy ============================================ ===== 1 8 0 100.00 100.00 5 5 5 0 6 5 83.33 • • • • • • • • • • • At the end of the output the === Summary === section gives statistics for all sentences, and for sentences <=40 words in length. The summary contains the following information: i) Number of sentences -- total number of sentences. ii) Number of Error/Skip sentences -- should both be 0 if there is no problem with the parsed/gold files. iii) Number of valid sentences = Number of sentences Number of Error/Skip sentences • • • iv) Bracketing recall = (number of correct constituents) ---------------------------------------(number of constituents in the goldfile) • • • v) Bracketing precision = (number of correct constituents) ---------------------------------------(number of constituents in the parsed file) • vi) Complete match = percentaage of sentences where recall and precision are both 100%. vii) Average cross=(#const crossing a goldfile constituen ---------------------------------------(number of sentences) viii) No crossing = percentage of sentences which have 0 crossing brackets. ix) 2 or less crossing = percentage of sentences which have <=2 crossing brackets. x) Tagging accuracy = percentage of correct POS tags (but see [5].3 for exact details of what is counted). • • • • • • • • 39 Remember • Present one paper at the time • Summary results and related work also in the end Ta med tabeller for parsing, sammenligning med andre etc. Et eksempel på en kompleks setning med gtb treet. Ref tabell. Sammenlign brackets. Ta med webprot screenshot Related work!! Phd pres. Related work. Lexiquest, 40 verbs, hva er fscore? Fra tore: Hvorfor bare 50%. Er det semantikk eller gramatikk som gjør at 50% feiler 40 Dr. Carl-Fredrik Sørensen (50 min, jeg: tid /2) 5 min intro, state-of-the-art 5 min definitions NLU 10 min thesis/papers overview and Research Questions 15 min three themes and contributions. Evaluation of the work 10 min future work Proof of concept. It can be implemented. Next step? Industry... Results are trusted Academic... Results are validated through understanding the research process. Dennings, proof of concept Research question...soon....Moores law Proof of performance Shift the work to biologists Medline growth graph. Figure... Everything is published. Background: http://www.coli.uni-saarland.de/~hansu/what_is_cl.html Schopenhauer: imagine how clever a vice man would be, if he knew everything in his books. Inter-annotator agreement in gprot, maybe 80 percent precision is enough?!