R T U New York State Center of Excellence in Bioinformatics & Life Sciences VUB Leerstoel 2009-2010 Theme: Ontology for Ontologies, theory and applications Ontologies and Natural Language Understanding May 20, 2010; 17h00-19h00 Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussels Room D2.01 Prof. Werner CEUSTERS, MD Ontology Research Group, Center of Excellence in Bioinformatics and Life Sciences and Department of Psychiatry, University at Buffalo, NY, USA R T U New York State Center of Excellence in Bioinformatics & Life Sciences Context of this lecture series Knowledge Representation Informatics Linguistics Computational Linguistics Medical Natural Language Understanding Electronic Health Records Translational Research Medicine Biology Ontology Philosophy Realism-Based Ontology Referent Tracking Pharmacogenomics Pharmacology Performing Arts Defense & Intelligence R T U New York State Center of Excellence in Bioinformatics & Life Sciences Today’s topic Informatics Linguistics Computational Linguistics Medical Natural Language Understanding Electronic Health Records Medicine • May 20: ontologies and Natural Language Understanding Realism-Based Ontology R T U New York State Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences Amazing technology A human being with function enhancing electronic implants A tiny scanner capable of detecting bodily anomalies A ‘doctor’ who is in fact some sort of computer program capable of making medical diagnoses Flawless communication between a human and a computer R T U New York State Center of Excellence in Bioinformatics & Life Sciences Or not amazing? … towards a bionic eye http://bionicvision.org.au/ R T U New York State Center of Excellence in Bioinformatics & Life Sciences Or not ? … mobile diagnostics SilhouetteMobile™ GlucoPack™ scans and stores information about a wound's width and depth, which helps nurses track healing over time as new tissue fills in the injury reads and transmits glucose readings R T U New York State Center of Excellence in Bioinformatics & Life Sciences Or not ? … Transhumanism Max More • "Philosophies of life that seek the continuation and acceleration of the evolution of intelligent life beyond its currently human form and human limitations by means of science and technology, guided by life-promoting principles and values." R T U New York State Center of Excellence in Bioinformatics & Life Sciences Beyond natural evolution … R T U New York State Center of Excellence in Bioinformatics & Life Sciences to … mind uploading ? Ray Kurzweil receives National Medal of Technology (1999). R T U New York State Center of Excellence in Bioinformatics & Life Sciences But for today: How to communicate with computers naturally ? The supercomputer HAL from 2001: A Space Odyssey. R T U New York State Center of Excellence in Bioinformatics Life Sciences Michael& Scott’s solution http://aboulet.files.wordpress.com/2007/05/traveling-salesmen1.jpg R T U New York State Center of Excellence in Bioinformatics & Life Sciences Better: a combination of various technologies R T U New York State Center of Excellence in Bioinformatics & Life Sciences My interest in NLU: the medical informatics dogma • Fact: computers can only deal with a structured representation of reality: – structured data: • relational databases, spread sheets – structured information: • XML simulates context – structured knowledge: • rule-based knowledge systems • Conclusion: a need for structured data entry R T U New York State Center of Excellence in Bioinformatics & Life Sciences Structured data entry • Current technical solutions: – rigid data entry forms – coding and classification systems • But: – the description of biological variability requires the flexibility of natural language and it is generally desirable not to interfere with the traditional manner of medical recording (Wiederhold, 1980) – Initiatives to facilitate the entry of narrative data have focused on the control rather than the ease of data entry (Tanghe, 1997) R T U New York State Center of Excellence in Bioinformatics & Life Sciences Drawbacks of structured data entry • Loss of information – qualitatively • limited expressiveness of coding and classification systems, controlled vocabularies, and “traditional” medical terminologies • use of purpose oriented systems – don’t use data for another purpose than originally foreseen (J VDL) – quantitatively • to time-consuming to code all information manually • Speech recognition and structured data entry forms are not best friends R T U New York State Center of Excellence in Bioinformatics & Life Sciences The pilars of healthcare informatics • Clinical language – medical narrative • Clinical terminologies – coding and classification systems – nomenclatures – formal ontologies • Electronic Healthcare Record Systems R T U New York State Center of Excellence in The possibilities Bioinformatics & Life Sciences • Text based EHCRS able to generate structured data • An EHCR exclusively build around a collection of coded data generated out of free text • AAmultimedia multimediaEHCRS EHCRSwith withclinical clinicalnarrative narrative registrationand andstructured structureddata datageneration generation registration • A multimedia EHCRS with structured data entry and text generation • An EHCR exclusively build around texts generated out of controled vocabularies • An EHCR exclusively build around a collection of structured data able to generate text R T U New York State Center of Excellence in Bioinformatics & Life Sciences Main issues of MNLU • Medical natural language understanding is: – Making computers understand medical language – Allowing computers to turn unstructured texts in structured information • Medical NLU is NOT: – medical reasoning performed by computers – reducing the richness of clinical language to a closed set of codes R T U New York State Center of Excellence in Bioinformatics & Life Sciences Typical examples of MNLU • contextual spell checking • information retrieval – topic selection – relevance ranking • coding and classification • software agents for clinical studies • unstructured data registration for structured reporting R T U New York State Center of Excellence in Bioinformatics & Life Sciences Areas for application of MNLU • Coding patient data • Structured information extraction from unstructured clinical notes • Clinical protocols and guidelines • Assessing patient eligibility for clinical trial entry • Triggering and alerts • Linking case descriptions to scientific literature • Easy access to content • ... towards a medical semantic web R T U New York State Center of Excellence in Bioinformatics & Life Sciences A wealth of communication related applications (1) • Speech as input: – voice recognition: • who is the sender? – speech recognition: • dictation: what is the corresponding text? – irrespective of meaning • • • • command and control language learning (pronunciation checking) question answering spoken natural language understanding R T U New York State Center of Excellence in Bioinformatics & Life Sciences A wealth of communication related applications (2) • Text as input: – – – – – speech generation (text-to-speech) spell checking grammar checking plagiarism detection indexing – semantic indexing – topic detection • document retrieval – return documents that tell me when Bonaparte was born • information retrieval – find in documents the date Bonaparte was born and return only the date – clinical coding R T U New York State Center Speech of Excellence in generation Bioinformatics & Life Sciences (1) She lives near the highway where three lives were lost. R T U New York State CenterSpeech of Excellence in generation Bioinformatics & Life Sciences (2) Chapter III is about Henry III. R T U New York State Center of Excellence in Bioinformatics & Life Sciences Text-to-speech basics http://upload.wikimedia.org/wikipedia/en/a/af/Festival_TTS_Telugu.jpg R T U New York State Center of Excellence in Bioinformatics & Life Sciences Simple speech recognition algorithm raw speech signal analysis acoustic models sequential constraints train speech frames acoustic analysis frame scores time alignment word sequence segmentation From the INRIA Parole project R T U New York State Center of Excellence in Bioinformatics & Life Sciences Dialogue systems with automatic translation http://www.oxygen.lcs.mit.edu/images/Speech.jpg R T U New York State Center of Excellence in Bioinformatics & Life Sciences The disambiguation problem • Some examples: – ‘lives’: from ‘to live’ or plural of ‘life’ – ‘III’: as ‘three’ or ‘the third’ – ‘bow’: the weapon or from ‘to bow’ • Statistical models (n-grams): – most often sufficient – quite fast analysis • Syntactic analysis • Semantic analysis (deep or shallow) R T U New York State Center of Excellence in Bioinformatics & Life Sciences A toy ontology for communication (1) • Patterned particular (PP): – – – – piece of text: combination of characters sound wave series of signs in sign language, smoke combination and sequence of smells ? • Some sender which generated a PP with the intention to provoke something in some receiver, the PP thus becoming a linguistic patterned particular (LPP) – standard messages, questions, commands • carry meaning directly encoded in the message – poems, lies, deceptions, nonsense: • no or partial directly encoded information • Being a PP is not sufficient to be an LPP. There has to be a sender! – a bird or insect flying in a pattern that looks like an LPP in some language R T U New York State Center of Excellence in Bioinformatics & Life Sciences A toy ontology for communication (2) • Aboutness relation from certain elementary LPPs to real world entities when created under certain circumstances – ‘me’, ‘I’, ‘mine’ – ‘current’, ‘president’, United States’, ‘king, ‘France’ • Pattern types – morphologic, syntactic, semantic and discourse conventions • ‘current President of the United States’ • ‘current king of France’ R T U New York State Center of Excellence in Bioinformatics & Life Sciences A toy ontology for communication (3) • Questionable entities: – ‘propositions’ • sort of factual, linguistically undressed statements about the world – ‘bare meanings’ R T U New York State Center of Excellence in Bioinformatics & Life Sciences Text analysis ‘The doctor checks Seven of Nine’s blood pressure’ R T U New York State Center of Excellence in Bioinformatics & Life Sciences Syntactic analysis sentence verb phrase noun phrase noun phrase noun phrase det The noun verb doctor checks det the prepositional phrase compound noun prep person name blood pressure of Seven of Nine R T U New York State Center of Excellence in Bioinformatics & Life Sciences Semantic analysis checking sentence verb phrase has-object has-agent noun phrase noun phrase noun phrase det noun verb det person The doctor checks prepositional phrase compound noun prep person name clinical sign the blood pressure person of Seven of Nine belongs-to R T U New York State Center of Excellence in Bioinformatics & Life Sciences The doctor uses an instrument sentence verb phrase checking agent instrument object noun phrase det The noun noun phrase verb det doctor examines the noun phrase noun prep det noun patient with a hammer R T U New York State Center of Excellence in Bioinformatics & Life Sciences Here the patient has the hammer ! sentence checking noun phrase verb phrase agent object noun phrase det The noun prepositional phrase noun phrase verb det doctor examines the noun phrase noun prep det noun patient with a hammer R T U New York State Center of Excellence in Bioinformatics & Life Sciences The problem of reference • ‘The surgeon examined Maria. She found a small tumor on the left side of her liver. She had it removed three weeks later.’ • Ambiguities: – – – – who denotes the first ‘she’: the surgeon or Maria ? on whose liver was the tumor found ? who denotes the second ‘she’: the surgeon or Maria ? what was removed: the tumor or the liver ? • Here ontology can come to aid. R T U New York State Center of Excellence in Bioinformatics & Life Sciences Ontologies and NLP • A two-way collaboration: – using NLP techniques to assist the development of ontologies, – using ontologies to make better NLP applications, – bootstrapping: NLP applications that require ontologies in some stage and intend to make these ontologies better. R T U New York State Center of Excellence in Bioinformatics & Life Sciences NLU as assistive technology for ontology development R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex: corpus-based term extraction • Based on Deniz Yuret’s PhD thesis • good news: (a particular) language independent automatic linguistic knowledge extractor – – – – relationships between words grammar generation term extraction synonym / homonym detector (???) • bad news: – large corpora required (occ > 500 * different tokens) – big PC required (3.000.000 words/day, DOS, PII-350) R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex: term extraction • • • • • • • • TERM Occurrences (5679 reps) magnetic resonance 100 san francisco 12 invasive fungal sinusitis 7 rhinosinusitis disability index 3 intensive care unit 178 food allergy 31 th1 and th2 32 positron emission 29 R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex grammar induction • Sentence encountered: • Sentence analyzed: R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex’s linguistic principles • Words in natural language sentences: – tend to collocate with a certain strength, – are not linked in circular ways, – have links that don’t cross. R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s6 s5 s4 s3 s2 s1 I saw a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s6 s5 s4 s3 s2 s1 I saw s7 a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s6 s5 s4 s3 s1 I saw s8 s7 a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s11 s10 s9 s1 I saw s8 s7 a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s11 s10 s9 s8 s1 I saw s12 s7 a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s11 s10 s9 s1 I saw s12 s7 a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences C-Tex processing s11 s10 s9 s1 I saw s12 s7 a man carry a telescope R T U New York State Center of Excellence in Bioinformatics & Life Sciences Advantages • Defining the required coverage for a given domain, by – listing the terms that need to receive a description in the ontology (= inverse annotation) – listing the relationships that need to be named • Catch up mechanism: – things already done, don’t need to be done again – If a C-Tex without prior knowledge works fine, one with ontological knowledge should work even better • Builds a grammar R T U New York State Center of Excellence in Bioinformatics & Life Sciences Drawbacks • very slow • very sensitive to repeatedly seeing the same documents – requires very careful training set development R T U New York State Center of Excellence in Bioinformatics & Life Sciences Gap Finder and Web Agent R T U New York State Center of Excellence in Bioinformatics & Life Sciences “Domain specific” word detection Indiana Irving JAMA Janus Johannes Kanno Kd Kern Knowles L.M. LBF4-bind LBF6-binding LMP-1-express LMP-1-positive LPS LTR-Cat Laurent Lenny Leung Lewis Lim Listeria monocytogenes Indianapolis Ito Jaffe Japan Johannsen Kaplan Keegan Kimble Ko LAV LBF4-binding LD LMP-1-induce LMP-1-transfect LT Laine Lechler Lenoir Levels Ley Lin Liu Inoue Iwanaga Jain Jk-bind Johnson Karin Keller Kirsch Kozma LBF3-bind LBF5-and LFA LMP-1-mediate LN LTR Lane Lee Leonard Levine Li Ling Loisel Irani J Jama Jk-binding K Kaye Kennedy Kishimoto L LBF3-binding LBF6-bind LMP LMP-1-negative LOH LTR-CAT Lanes Left Lett Levy Liebowitz Listeria London R T U New York StateKohonen clustering Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences Kohonen clustering R T U New York State Center of Excellence in Bioinformatics & Life Sciences Statistical relationship discovery context EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive EU 6th VAT Directive term member state condition criterion member state member state member state member state member state member state accommodation committee service service supply of service accompany achieve acquire allow animal apply authorise authorise avoid breeding calculating role ACTOR-OF ACT-UPON ACT-UPON ACT-UPON ACT-UPON ACT-UPON ACT-UPON ACT-UPON ACT-UPON CAUSED-BY CAUSED-BY CAUSED-BY CAUSED-BY CAUSED-BY HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION HAS_ACTION value term necessary measure purpose document amount condition method national currency of ecu period rules similar establishment commission agricultural holdings taxable person aim luggage exemption goods identification animals vat member suspension fraud boars turnover R T U New York State Center of Excellence in Bioinformatics & Life Sciences The ‘clique’ - approach • a clique in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge. • A clique is maximal iff not part of a larger clique. R T U New York State Center of Excellence in Bioinformatics & Life Sciences Building cliques out of n-grams Tony Veale. Categories, Cliques and Analogies in Creative Information/Knowledge Management. ICON 2009, Hyderabad, India. http://ltrc.iiit.ac.in/icon_archives/ICON2009/Presentations/Keynote/Categories%20and%20Cliques.pdf R T U New York State Center of Excellence in Bioinformatics & Life Sciences Sorts of cliques in linguistic corpora Tony Veale. Categories, Cliques and Analogies in Creative Information/Knowledge Management. ICON 2009, Hyderabad, India. http://ltrc.iiit.ac.in/icon_archives/ICON2009/Presentations/Keynote/Categories%20and%20Cliques.pdf R T U New York State Center of Excellence in Bioinformatics & Life Sciences Category and hierarchy generation Tony Veale. Categories, Cliques and Analogies in Creative Information/Knowledge Management. ICON 2009, Hyderabad, India. http://ltrc.iiit.ac.in/icon_archives/ICON2009/Presentations/Keynote/Categories%20and%20Cliques.pdf R T U New York State Center of Excellence in Bioinformatics & Life Sciences Ontology to improve natural language understanding R T U New York State Center of Excellence in Bioinformatics & Life Sciences Understanding content (1) We see: “John Doe has a pyogenic granuloma of the left thumb” The machine sees: John Doe has a pyogenic granuloma of the left thumb R T U New York State Center of Excellence in Bioinformatics & Life Sciences Understanding content (2) We see: The XML misunderstanding <record> <patient>John Doe</patient> <diagnosis>pyogenic granuloma of the left thumb</diagnosis> </record> The machine sees: <record> <subject> John Doe </subject> <diagnosis> pyogenic granuloma of the left thumb </diagnosis> </record> R T U New York State Center of Excellence in Bioinformatics & Life Sciences Requirements for NLU 1. Knowledge about terms and how they are used in valid constructions within natural language; 2. Knowledge about the world, i.e. how the referents denoted by the terms interrelate in reality and in given types of context; 3. An algorithm that : a. is able to calculate a language user’s representation of that part of the world described in the utterances that are the subject of the analysis. b. can track the ways in which people express what does NOT represent anything in reality (eg for medico-legal reasons) Only a realist ontology (and not an ontology that deals with “alternative realities”) permits correct disambiguation between 3a and 3b. R T U New York State Center of Excellence in Bioinformatics & Life Sciences Exploit the relationships along the vertices Halliday’s systemic functional grammar The structures of language are partially determined by our conceptualisation of the world. Halliday No mental representation without language Fodor Aristotelian realism concept Meaning is located in the interaction between living beings and the environment language James J. Gibson, Ecological Realism in Psychology referents Baboons and humans have different cut-off points for discerning "same" objects because our verbal expression for "same" makes the idea of "same" more restrictive.” Fagot and Wasserman (Centre for Research in Cognitive Neuroscience in Marseille) R T U New York State Center of Excellence in Bioinformatics & Life Sciences The content Language A Proprietary Terminologies Language LexiconB Lexicon Others ... Grammar ICPC Grammar SNOMED Formal Domain Ontology ICD Linguistic Ontology MEDRA R T U New York State Center of Excellence in Use of spatial logics Bioinformatics & Life Sciences HASOVERLAPPING -REGION HASPARTIALSPATIALOVERLAP ISSPATIAL -PARTOF ISPROPERSPAT.PART-OF HAS-DISCRETEDREGION HASSPATIAL -PART HASPROPERSPATIAL -PART HAS-SPATIALPOINTREFERENCE HASCONNECTINGREGION HASDISCONNECTEDREGION HASEXTERNALIS-NONCONNECTINGTANG.ISREGION SPAT.TANG.IS- HAS-NON- HASPART-OF SPAT.- SPAT.- TANG.- TANG.PART-OF EQUIV.- SPAT.SPAT.OF PART PART ISIS-PARTLYIN-CONVEX- INSIDECONVEXISHULL-OF HULL-OF OUTSIDECONVEXHULL-OF ISIS-GEOINSIDE- TOPOINSIDEOF OF R T U New York State Center of Excellence in Bioinformatics & Life Sciences Example: (canonical) joint anatomy • joint HAS-HOLE joint space • joint capsule IS-OUTER-LAYER-OF joint • meniscus – IS-INCOMPLETE-FILLER-OF joint space – IS-TOPO-INSIDE joint capsule – IS-NON-TANGENTIAL-MATERIAL-PART-OF joint • joint – IS-CONNECTOR-OF bone X – IS-CONNECTOR-OF bone Y • synovia – IS-INCOMPLETE-FILLER-OF joint space • synovial membrane IS-BONAFIDEBOUNDARY-OF joint space R T U New York State Center of Excellence in Bioinformatics & Life Sciences Linguistic, domain and BFO-based RUs Generalised Possession Healthcare phenomenon Hassubclass-of Haspossessor 1 possessed Human being 1 2 subclass-of 1 Having a healthcare phenomenon 2 Is-possessor-of Patient subclass-of 3 4 Has-Healthcare3 phenomenon Patient at risk subclass-of Patient at risk for osteoporosis Is-RiskFactor-Of subclass-of Has-Healthcarephenomenon 4 Risk Factor subclass-of subclass-of Risk factor for osteoporosis Is-RiskFactor-Of Osteoporosis R T U New York State Center of Excellence in Bioinformatics & Life Sciences Value of the three sorts of RUs • Linguistic: – capture the way language is used • Domain: – capture the way how domain experts conceptualize the domain • is in part reflected by the way they talk about the domain • BFO-based: – capture how matters are believed to be, without referring to linguistic or domain RUs except when they denote the same thing R T U New York State Center of Excellence in Bioinformatics & Life Sciences One should try to maximize the number of BFO-based Representational Units • In this case: base RUs on the Ontology of General Medical Science – healthcare phenomenon bodily feature ? – risk factor disposition ? – osteoporosis disorder, disease, path. process ? R T U New York State Center of Excellence in Bioinformatics & Life Sciences MNLU: the general idea Text Result Keywords ICD-Codes Discharge letter MedLine abstracts English patient record French patient record Surgery report Protocol checking R T U New York State Center of Excellence in Bioinformatics & Life Sciences MNLU: some requirements Processor Domain representation Text Result Goal representation R T U New York State Center of Excellence in Bioinformatics & Life Sciences Linguistic Application Components Processor Domain representation Text Result Linguistic Knowledge Task Knowledge Goal representation R T U New York State Center of Excellence in Bioinformatics & Life Sciences Implements Rector’s ‘Clean separation of knowledge’ • Conceptual knowledge: the knowledge of sensible domain concepts • Knowledge of definitions and criteria: how to determine if a concept applies to a particular instance • Surface linguistic knowledge: how to express the concepts in any given language • Knowledge of classification and coding systems: how an expression has been classified by such a system • Pragmatic knowledge: what users usually say or think, what they consider important, how to integrate in software R T U New York State Center of Excellence in Bioinformatics & Life Sciences What does this mean for applications? Processor Domain representation Text Result Linguistic Knowledge Discourse Linguistic Coding Task Information Knowledge rules Knowledge Goal representation English Keywords Reports P.Rec Completeness French ICD-Codes P.Rec R T U New York State Center of Excellence in Bioinformatics & Life Sciences Halliday’s systemic functional grammar • A “complete” theory for NLU – constructivistic basis: “language construes human experience” – English: It is raining – Chinese: The sky drops water • hence: natural languages are instances of generic schemes – macro-structure of documents • derive a “structural formula” – micro-structure of documents • lexical cohesion • in-conjunction analysis R T U New York State Center of Excellence in Bioinformatics & Life Sciences General Principle of Semantic Mapping 1. Semantic constraints are associated with: a) Lexemes, or, b) Syntactic classes which generalize over lexemes. 2. A word inherits all constraints associated with each of the syntactic classes it instantiates, as well as any associated with the lexeme itself. 3. Where the lexicon provides multiple semantic interpretations of a word, these are tried in order until one applies. (e.g., “with” can be interpreted as HAS_HC_PHENOMENON, HAS_INSTRUMENT, etc.) R T U New York State Center of Excellence in Lexicon-specified Mapping Bioinformatics & Life Sciences • “Lexsem rules” fix the RU that a particular term can map to. lexsem e.g., lexsem <string> “present” <wordclass> <concept> verb CONSULTATION_PROCESS • The <string> element defines the root form of the lexeme, so the above example will also be applicable for “presents” and “presenting”. • The <wordclass> element distinguishes cases of lexical ambiguity, e.g., “present” as a noun. • Where a lexeme is polysemic, multiple lexsem entries are provided. • In some cases, a lexeme provides not only a RU, but some structure as well, e.g., lexsem "since" preposition {} {Head.Sem.HAS-CEN-OCCURENCE-SINCE PPHead.Sem} (meaning: the concept expressed by the syntactic dominator of “since” is linked by a HAS-CENOCCURENCE-SINCE relation to the RU expressed by the NP following “since”) R T U New York State Center of Excellence in Syntax-specified Mapping Bioinformatics & Life Sciences Two reasons for associating mapping information on syntactic features: The syntactic feature represents a generalisation over a set of lexemes e.g., the syntactic feature human-surname contains the mapping information for all surnames). The syntactic feature represents a syntactic configuration which itself implies meaning e.g., passive is not a feature of a word but of a configuration of words Syntactic constraints are of two types: Specify the class a particular role filler must have (whether syntactic element or conceptual): e.g., Sem.Actor: human (Sem.Actor is a role-chain, meaning “the Actor slot of the Sem slot”) Specify that the fillers of two role-chains are the same: e.g., Sem.Actor = Subj.Sem Logical combinations of syntactic constraints are possible: {and {Head.Sem: COMPLAINING_PROCESS} {Head.Sem.HAS_SAYING PPHead.Sem} } (‘or’ and ‘not’ are also possible) R T U New York State of Excellence in RUsCenter involved in analyzing “Mr. Smith” Bioinformatics & Life Sciences Material Entity human Is-assignedname-of Ontology male human name MrSmith Mr Smith Is-assignedname-of “Smith” Instance Text R T U New York State “Mr Center of Excellence in Smith” analysed Bioinformatics & syntactically, Life Sciences and features used to drive mapping. female-titled • • Title: female-title The Orth slot of a word gives its surface string. The append( ) operator joins together its arguments as a singleHUMANstring. NAME-TYPE Sem: female-human titled-human TITLEDHUMAN-TYPE Title: title Title: male-title Title -2 Sem: male-human untitled-human human-name Sem: human HUMANNAME-TYPE4 human-surname male-titled genderless-titled prenamed-provided human-firstname Prename: human-firstname Prename -1 HUMANNAME-TYPE3 Sem.Assigned_Name = append{Prenam.Orth, Orth} prename-not-provided Sem.Assigned_Name = Orth R T U New York State Center of Excellence in analysis of “an 83-year-old man” Bioinformatics & Life Sciences Dom-ent human age state HAS-WE-STATE human age P-TYPE human Ontology male human X1 HAS-WE-STATE X2 P-TYPE X3 Instance Deict Epith An 83-year-old man Syntax R T U New York State Center of Excellence in Bioinformatics & Life Sciences Syntactic-Semantic mapping • Lexicon: lexsem “man” noun MALE_HUMAN lexsem “$int$-year-old” adjective HUMAN_AGE_STATE one of the constraints (shown in red) on the feature ‘pre• Syntax: qualified’ (which introduces the Epith role) fits: R T U New York State Center of Excellence in Bioinformatics & Life Sciences Example of a bootstrapping approach R T U New York State Center of Excellence in Bioinformatics & Life Sciences Syntactic relationship “discovery” process • Text processed subsequently by: – paragrapher – segmenter • • • • sentence detection tokenisation rewriting of abbreviations identification of relevant sentences – parser – reference resolution resolver – relationship discoverer R T U New York State Center of Excellence in Bioinformatics & Life Sciences Text to be processed Sphingosine 1-phosphate induces expression of early growth response-1 and fibroblast growth factor-2 through mechanism involving extracellular signal-regulated kinase in astroglial cells. Sato K, Ishikawa K, Ui M, Okajima F. Laboratory of Signal Transduction, Institute for Molecular and Cellular Regulation, Gunma University, 3-39-15 Showa-machi, Maebashi, Japan. kosato@akagi.sb.gunma-u.ac.jp In rat type I astrocytes and C6 glioma cells, sphingosine 1-phosphate (S1P) clearly induced the expression of fibroblast growth factor-2 (FGF-2) mRNA to an extent comparable to that achieved by platelet-derived growth factor (PDGF) and endothelin. In C6 cells, Western blotting showed that S1P also induced expression of early growth response-1 (Egr-1), one of the immediate early gene products and an essential transcriptional factor for FGF-2 expression. On the other hand, sphingosine, a substrate for sphingosine kinase which forms intracellular S1P, was a very weak activator for the expression of either FGF-2 or Egr-1. The S1P-induced Egr-1 expression was partially inhibited by treatment of the cells with either calphostin C, an inhibitor of protein kinase C (PKC), or pertussis toxin (PTX), and completely inhibited by the combination of these agents. Essentially, the same inhibitory pattern by these agents has been observed for S1P-induced extracellular signal-regulated kinase (ERK) activation. The S1P-induced expression of Egr-1 was also completely inhibited in association with complete inhibition of ERK by PD 98059, an ERK kinase inhibitor. Thus, the S1Pinduced activation of the Egr-1/FGF-2 system may be mediated through ERK activation, which may involve at least two signaling pathways, i.e., a PTX-sensitive G-protein-dependent pathway and a PKC-dependent pathway. PMID: 10640689 [PubMed - indexed for MEDLINE] R T U New York StateParagrapher output Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences Segmenter output R T U New York State Center of Excellence in Re-use of resolved Bioinformatics & Life Sciences abbr. R T U New York State Center of Excellence in Parser Bioinformatics & Life Sciences output R T U New York State Center of Excellence in Bioinformatics & Life Sciences Reference resolution R T U New York State Center of Excellence in Bioinformatics & Life Sciences Domain-specific CUE-words • • if (domain.equals("PROTEINS")) subjObjVerbs_ar = new Object[] – {"abolish", "abolishes", "abolished", "abolishing", – "accompany", "accompanies", "accompanied", "accompanying", – "acetylate", "acetylates","acetylated","acetylating", – "activate", "activates", "activated", "activating", – "affect", "affects", "affected", "affecting", – ....} • if (domain.equals("PROTEINS")) • ofByNouns_ar = new Object[] – {"acetylation", "activation", "affection", "aggregation", "altering", "amelioration", "antagonization", "association", "augmentation", "binding", "blocking", "blockage",.... } R T U New York State Center of Excellence in Bioinformatics & Life Sciences Inter-protein relationship “discovery” • Leptin rapidly inhibits hypothalamic neuropeptide Y secretion and stimulates corticotropin-releasing hormone secretion in adrenalectomized mice . – (leptin)-INHIBITS-(hypothalamic neuropeptide Y secretion) – (leptin)-INHIBITS-(neuropeptide Y) R T U New York State Center of Excellence in Bioinformatics & Life Sciences ... special patterns • These results indicate that oTP-1 may prevent luteolysis by inhibiting development of endometrial responsiveness to oxytocin and , therefore , reduce oxytocin-induced synthesis of IP3 and PGF2 alpha . – (oxytocin)-CAUSES-(synthesis of IP3 and PGF2 alpha) – (oxytocin)-CAUSES-(pgf2 alpha) R T U New York State Center of Excellence in Bioinformatics & Life Sciences From syntactic modification to subsumption • (adj)-(noun) :: Cadj-noun IS_A Cnoun – steroid hormone IS_A hormone – fetal liver IS_A liver • BUT not: – binding factor IS_A factor – total protein IS_A protein – two domain IS_A domain • Usefulness ? – relationship with the Cadj R T U New York State Center of Excellence in Bioinformatics & Life Sciences NLU in the GALEN project R T U New York State Center of Excellence in Bioinformatics & Life Sciences The place of Galen Processor Domain representation Text Result Linguistic Knowledge Task Knowledge Goal representation R T U New York State Center of Excellence in Bioinformatics & Life Sciences The processor at work ... Processor Domain representation Meaning Representation Goal Representation Task Knowledge Goal representation Result Text Linguistic Knowledge R T U New York State Center of Excellence in Bioinformatics & Life Sciences Some claims by Galen (+) • European wide endeavour • Result of work by highly competent researchers and developers • Clean knowledge kernel of pure medical terminology • Totally independent from any source or target system • Openess • Development not affordable by one single entity R T U New York State Center of Excellence in Bioinformatics & Life Sciences NLP applications around Galen C-Tex Linguistic Knowledge Multi Tale Text Linguistic Representation Cassandra Galen terminological Knowledge Meaning Representation R T U New York State Center of Excellence in Bioinformatics & Life Sciences MultiTale: synsem - tagging Dura was incised in linear fashion and the scar around the inlet of the reservoir was dissected out until the ventricular catheter was exposed and withdrawn under direct vision. <Clause type="surg"> <Segment role="do" semtype="anat" syntax="sg" meaning="T-A1120">Dura </Segment> <Segment role="action" semtype="open" syntax="papa" meaning="P101000"> <SegConst.1 syntax="past">was </SegConst.1> <SegConst.1 role="action" semtype="open" syntax="papa" meaning="P101000">incised </SegConst.1></Segment> <Segment syntax="prep">in </Segment> <Segment semtype="manner" syntax="adjnoun" meaning="(G-A148,G-D430)"> <SegConst.1 semtype="mod" syntax="adj" meaning="G-A148">linear </SegConst.1> <SegConst.1 semtype="manner" syntax="sg" meaning="G-D430">fashion </SegConst.1></Segment> </Clause> R T U New York State Center of Excellence in Bioinformatics & Life Sciences MultiTale-II: Galen-ready linguistic representation valgising osteotomy of humerus ({valgising}5(osteotomy)1{[of]3(humerus)2}4)22 Pre- and postmarker … Relationship with the GALEN ontology (exhaustive) link {…} (…) @…# \…/ criterion descriptor / concept co-ordination not represented in GALEN <…> criterion modifier Relationship with natural language phenomena (examples) explicit in prepositions, or implicit in adjectives adjectives, adverbial constructions nouns, idioms “and”, “or” function words such as articles, possessive pronouns, etc. adverbs R T U New York State Center of Excellence in Bioinformatics & Life Sciences Cassandra-II: from LR to CR ({valgising}5(osteotomy)1{[of]3(humerus)2}4)22 ((cutting)21 {[TO_ACHIEVE]6((Deed:valgising)7 {[ACTS_ON]17(Pathology:pathologicalposture)18}19)20}5 {[ACTS_ON]3(Anatomy:humerus)2}4)22 R T U New York State Center of Excellence in Bioinformatics & Life Sciences Linguistic versus Conceptual repr. (1) (excision)35 {[of]111 ((cicatrix)2120 {[of]216 (skin)474}0)0}0 (debridement)82 {[of]142 ({palmar}1785 (skin)474)0}0 RefId 35 82 111 142 216 474 1785 2120 Prototype excision debridement of of of skin palmar cicatrix Conceptual repr. Linguistic repr. excising excising debriding debriding ACTS_ON THEME ACTS_ON SOURCE HAS_LOCATION SOURCE skin skin IS_PART_OF(palm) LOCATIVE(palm) cicatrix cicatrix R T U New York State Center of Excellence in Bioinformatics & Life Sciences Linguistic versus Conceptual repr. (2) The Galen view ResourseManagementProcess InstallingProcess LiquidInstallingProcess Filling Injecting The linguistic semantic view To install <theme> [ in <goal> ] To fill <goal> [with <theme> ] To inject <theme> [ in <goal> ] To inject <goal> R T U New York State Center of Excellence in Bioinformatics & Life Sciences Semantic Indexing with and without using ontology R T U New York State Center of Excellence in Bioinformatics & Life Sciences Goals of Semantic Indexing 1. How to identify in a running text those “components that carry meaning” ? 2. How to assess how relevant these components are in the context of the entire document ? - aboutness or characterizing power (NLM MetaMap) - topic R T U New York State Center of Excellence in Bioinformatics & Life Sciences Statistics-based systems • do not possess explicit domain knowledge, • can only identify words or multi-word units in texts, – Based on individual document statistics – Based on corpus statistics • project these on implicitly constructed concepts that are mathematically justifiable, but that do not necessarily correspond with metaphysical reality, • are capable in finding those components that qualify as topic markers, • are poor in identifying all components. R T U New York State Center of Excellence in Bioinformatics & Life Sciences “Concept”-based systems • use explicitly defined concepts to which words, terms or phrases are attached as known grammaticalizations in a specific language. • “attachment” may be – Lexically realised – Grammatically realised • Using syntactic grammar and/or semantic grammar • tend to identify many components, • are less performant in finding the topics. R T U New York State Center of Excellence in ®: TeSSI Bioinformatics & Life Sciences Terminology Supported Semantic Indexing • Based on LinkBase®: – formal ontologies dealing with time, mereology, partonomy, ... (Smith, Varzi, Cohn, ...) – domain ontology structured according to the way languages are influenced by semantics (Bateman) – linking towards multiple 3rd party terminologies, classification systems, ontologies, ... – multi-lingual • Combines in-document statistics with spreading activation enforcement in LinkBase® • Implemented as a server R T U New York State Center of Excellence in Bioinformatics & Life Sciences Architectural Overview LinkBase Database JD Ja BC va Unix Workstation PC LinkFactory Server Mac RMI Corba Soap LAN Concept tree WAN Internet Server Business Objects Criteria / Full definitions Linktype tree Translate ... TeSSI Server Index LinkFactory Workbench R T U New York State Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences Phrase extraction R T U New York State Center of Excellence in Bioinformatics & Life Sciences Disambiguation R T U New York State Center of Excellence in Coding Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences R T U New York State Center of Excellence in Bioinformatics & Life Sciences Intermediate conclusions • Good results – (showed by means of recall/precision studies based on OHSUMED) • BUT: – important effort in building an appropriate ontology • (we can live with that because we did it already for healthcare) • Is there a risk that ever this effort would lose its value ? R T U New York State Center of Excellence in Bioinformatics & Life Sciences 22 page full paper A “statistics only system” ABSTRACT ONLY R T U New York State Center of Excellence in Bioinformatics & Life Sciences How far can these systems go ? • Some positive characteristics: – Do not require detailed domain knowledge – Are language independent – Are able to find complex multi-word units • Some negative (?) characteristics: – seem to be dependent from document length – unclear how to link to existing terminologies • (find “words” instead of “concepts”) R T U New York State Center of Excellence in Bioinformatics & Life Sciences To find this out • Select from OHSUMED 29 abstracts with stated high relevance for 5 concepts, hence supposed to cover the same topic; • Sort abstracts in ascending order with respect to document length; • Concatenate documents to get even larger documents; • Perform a forecast analysis; • Compare TeSSI with statistics based system. R T U New York State Center of Excellence in Word,Bioinformatics concept and node identification & Life Sciences per document (real) Count of words, concepts or nodes 10000 1000 Words 100 Nodes Concepts 10 1 1 2 3 4 5 6 Document number 7 8 9 R T U New York State Center of Excellence in Bioinformatics & Life Sciences Absolute Concept/Node identification (real) 1800 1600 Nr of nodes or concepts 1400 1200 1000 800 600 400 200 0 0 500 1000 1500 2000 2500 3000 Word Count 3500 4000 4500 5000 R T U New York State Center of Excellence in Bioinformatics & Life Sciences Relative Concept/Node identification (real) 0,4 concepts 0,35 0,3 0,25 0,2 0,15 0,1 nodes 0,05 0 0 500 1000 1500 2000 2500 Nr of words 3000 3500 4000 4500 5000 R T U New York State Center of Excellence in Bioinformatics & Life Sciences Concept/Node identification % (forecast) 0,4 0,35 concepts 0,3 0,25 0,2 0,15 0,1 0,05 nodes 0 0 20.000.000 40.000.000 60.000.000 Nr of words 80.000.000 100.000.000 120.000.000 R T U New York State Center of Excellence in Bioinformatics & Life Sciences Conclusions • The “ontological approach” that accepts language as a medium of communication, provides a very good basis for NLU if associative relationships are prominently present. – Hierarchies are not enough • In-document (and even corpus) statistics provide additional information but have an upper bound if used without domain information. – Detail and explicitness at the level of concept and relationships determine indexing performance R T U New York State Center of Excellence in Bioinformatics & Life Sciences The End