NLP Natural Language Processing Linguistic Structure Ian Lewin Rebholz Group EBI NLP Hard Times “Now, what I want is, Facts” (not documents) Mr. Gradgrind in Hard Times by Charles Dickens NLP Learning Objectives To understand . . . key concepts in linguistic analysis the variety of options in linguistic processing their use in recent IE systems and their potential The PPI Task: Given a full-text article, identify pairs of interacting proteins NLP Contents 1 Introduction 2 Linguistic Structure and Analysis Varieties of syntactic rules Syntactic categories Ambiguity & Parse Selection Dependency analyses 3 Use of syntax in selected IE Systems 4 Summary NLP Introduction Reprise of Part One Documents are “bags of words” and named entities NLP Introduction A document Dickkopf-1 (Dkk-1) is a potent head inducer in Xenopus. This effect can be attributed to its capability to specifically inhibit Wnt/beta-catenin signalling. Recent data point to a crucial role for Dkk-1 in the control of programmed cell death during vertebrate limb development. In this paper, we present a comparative expression analysis of Dkk-1, Bmp-4 and Sox-9 as well as data on the regulation of Dkk-1 by Wnt. Finally, we summarize the current knowledge of its potential function in the developing limb and present a model how the interplay of the Bmp, Fgf and Wnt signalling pathways might differentially regulate programmed cell death versus chondrogenic differentiation in limb mesodermal cells. NLP Introduction A Bag of words dkk-1 wnt limb data present death cell ... 4 3 3 2 2 2 2 NLP Linguistic Structure and Analysis Linguistic structure Documents are not just bags of words. Documents have structure Sentences have structure Words have structure NLP Linguistic Structure and Analysis Traditional Components of Language Phonetics Phonology Morphology Lexicon Syntax Semantics Pragmatics and Discourse NLP Linguistic Structure and Analysis Example: Arithmetic grammar 1 + 2∗3 E E → identifier !aa a ! ! E E → E + E E "b " b E → (E ) E → E ∗ E + id E 1 id id 2 3 * E NLP Linguistic Structure and Analysis Grammars and structure 1 + 2∗3 E E !aa !! e e a !aa a ! ! E * E E "b " E b + E "b " b E id id E id id 3 1 id id 1 2 2 3 + * E NLP Linguistic Structure and Analysis Varieties of syntactic rules Varieties of syntactic rules There are varieties of syntax They have different properties You can have an engineering perspective or a theoretical perspective NLP Linguistic Structure and Analysis Varieties of syntactic rules Regular Expressions Regular expressions are one way to capture patterns in language a matches “a” and ab matches “ab” a(b|c)d matches “abd” and “acd” (ab)+ matches “ab”, “abab”, “ababab”, . . . a(b?)c matches “ac” and “abc” if a and b are expressions, so is ab if a and b are expressions, so is a|b if a is an expression, so is a∗, a?, a+ . . . NLP Linguistic Structure and Analysis Varieties of syntactic rules Tools Lots of tools support regular expressions java.util.regex perl, python, ruby, ldots emacs Microsoft Words, . . . monq NLP Linguistic Structure and Analysis Varieties of syntactic rules The Chomsky family: Symbol rewrite grammars a set of rules: X0 , . . . , Xi → Y0 , . . . , Yj over a set of non-terminal symbols (NT) and a set of terminal symbols (T) a distinguished non-terminal symbol (start symbol) a notion of derivation Different restrictions on rules give different types of language NLP Linguistic Structure and Analysis Varieties of syntactic rules Chomsky’s family Type 3 X → Y0 or X → Y0 Y1 where Y0 ∈ T , Y1 ∈ NT Type 2 X → Y0 , . . . Yn where Yi ∈ (T ∪ NT ) Type 1 X0 , A, X1 → X0 , B0 . . . Bn , X1 where A ∈ NT and Bi ∈ (T ∪ NT ) Type 0 Any symbol rewrite grammar NLP Linguistic Structure and Analysis Varieties of syntactic rules Generative power and Computation Different notations have different properties, just as different programming languages do . . . including Generative Power Recognizability Computational Efficiency; Availability of Tools Descriptive elegance NLP Linguistic Structure and Analysis Varieties of syntactic rules Properties of Type 3 Regular Expressions (or Type 3) are recognizable by Finite state machines Fast! In coverage strings are recognizable in linear time Decidable whether strings are in coverage Well studied algorithms for manipulating them Supported in lots of software Least expressive (no centre embedding) . . . but not always descriptively elegant NLP Linguistic Structure and Analysis Varieties of syntactic rules Type 3 Usage used extensively in phonology and morphology and elsewhere e.g. sentence boundary detection very good at simple pattern matching (e.g. codes of various sorts) NLP Linguistic Structure and Analysis Varieties of syntactic rules Context Free Grammars CFGs are type 2 in the Chomsky hierarchy. Each rule takes the form: X → Y0 , . . . Yn NLP Linguistic Structure and Analysis Varieties of syntactic rules Simple CFG example S → NP VP VP → Vi VP → Vt NP VP → VP PP NP → DT NN NP → NP PP PP → P NP Vi → stopped Vt → inhibited NN → process NN → reaction NN → drug DT → the P → with P → in The process stopped The drug inhibited the reaction The reaction inhibited the process with the drug * The drug inhibited NLP Linguistic Structure and Analysis Varieties of syntactic rules Properties of CFGs CFGs are also well-understood mathematically a string is recognizable in O(n3 ) (CYK algorithm) the union of 2 CFG’s is CF (but not the intersection) given 2 CFGs, A and B, L(A)=L(B) is not decidable used extensively in programming language definitions have lots of supporting tools NLP Linguistic Structure and Analysis Varieties of syntactic rules CFG Descriptive Inadequacies X → Y0 , . . . Yn The X and Yi atomic entities quite unrelated to each other. Lexical Features but singular, plural, 1st person, mass . . . nouns are all nouns Phrasal categories the fact that the heads of NP,VP,PP . . . are N,V,P . . . is no accident Coordination How can a CFG express the fact that like co-ordinates with like? NLP Linguistic Structure and Analysis Varieties of syntactic rules Beyond CFG Replace unstructured names by structured ones np[case = acc, per = 3] → him np[case = nom, per = 3] → he np[per = 3] → John np[per = A] → John S:[] -> NP:[per=P num=N] VP:[per=P num=N] NP:[per=P num=N pronom=yes] -> N:[per=P num=N pronom=yes] VP:[per=P num=N] -> V:[per=P num=N val=intrans] VP:[per=P num=N] -> V:[per=P num=N val=trans] NP:[] NLP Linguistic Structure and Analysis Varieties of syntactic rules Beyond CFG: Computational Implications These grammars often have a context free backbone but add an information merging operation (unification) “Unification” grammars are at least Type 1 and, possibly, in the worst case, recognition is undecidable but, in practice, they seem “ok” but you might not want to parse Medline too often NLP Linguistic Structure and Analysis Syntactic categories Grammatical Categories What are the categories? Linguists explain the distribution of sequences of words by appealing to the notion of category Engineers (e.g. programming language designers) can design their own NLP Linguistic Structure and Analysis Syntactic categories Parts of Speech: Word level categories Major word level categories are: N, V, A and P Traditionally, nouns denote things, verbs actions . . . NLP Linguistic Structure and Analysis Syntactic categories Parts of Speech: Word level categories Linguists prefer other criteria, especially distributional ones Identify a frame such as: “They/it can ” Claim: Only items of category X (verb, in base form) can fill the frame NLP Linguistic Structure and Analysis Syntactic categories Nouns N: “ can be a pain in the neck” Dogs/John/Linguistics can be a pain in the neck *happily/swam/under can be a pain in the neck NLP Linguistic Structure and Analysis Syntactic categories Adjective/Adverb Adj: “He is very ” He is very green/tall/large. *He is very the/dogs/swam Adv: “She treats him ” She treats him badly/politely/happily *She treats him green/swam/dogs NLP Linguistic Structure and Analysis Syntactic categories Phrase level categories Major Phrasal Categories are: NP, VP, PP, AP Dogs/John/Linguistics can be a pain in the neck The mixture/Two alternative splicings/The protein bound at position 5 /... can be a pain in the neck The phrases have a key, obligatory element called the “head”. The head of an NP is an N, VP is a V and so forth.... NLP Linguistic Structure and Analysis Syntactic categories How to Diagnose a Constituent substitution movement insertion omission coordination NLP Linguistic Structure and Analysis Syntactic categories Examples 1 KPT inhibits the nociceptive effect of IL-1 in mice NLP Linguistic Structure and Analysis Syntactic categories Examples 2 Blocking of dickkopf-1 counteracts limb truncations and microphthalmia dickkopf-1 is a negative regulator of normal bone homeostasis Conjunction, noun compounding, and PP-attachment are pervasive introducers of ambiguity NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Ambiguity is pervasive Ambiguity is pervasive and a good source of humour . . . 1 Children make tasty snacks 2 Police Help Dog Bite Victim 3 British Left Waffles on Falklands Islands 4 Prostitutes appeal to the Pope 5 Attenborough writes book on penguins 6 New vaccine may contain rabies 7 Prime Minister Used to Be a Woman NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Example of State of the art Enju is a unification grammar with a context free backbone http://www-tsujii.is.s.u-tokyo.ac.jp/enju/ Intravenous interleukin-8 inhibits granulocyte emigration from rabbit mesenteric venules without altering L-selectin expression or leukocyte rolling. Ley K et al, J Immunol. 1993 Dec 1 NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Choosing the most likely parse Can we add probabilities to our grammars to aid parse selection? S → NP VP VP → Vi 1.0 0.4 VP → Vt NP 0.4 VP → VP PP NP → DT NN 0.2 0.3 NP → NP PP PP → P NP 0.7 1.0 Vi → stopped Vt → inhibited NN → process 1.0 1.0 0.7 NN → reaction 0.2 NN → drug DT → the P → with 0.1 1.0 0.5 P → in 0.5 NLP Linguistic Structure and Analysis Ambiguity & Parse Selection The probability of a sentence The probability of a derivation is the product of the rules used in it Y p(D) = p(X → Y |X ) X →Y ∈D There’s even a reasonably efficient method (Viterbi) for picking the most probable tree without actually enumerating them all NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Parameter Estimation Where do the probabilities in the grammar come from? If we have a tagged corpus then we can count in the corpus NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Example S S H HH NP Det VP @ @ N NP Det Det @ @ N S H H NP Q Q V H VP @ @ N V @ @ ,l , l NP VP PN V NP PN p(NP → Det N |NP) = 0.6 p(NP → PN |NP) = 0.4 NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Bias PCFGs do give a notion of parse probability They enable you to rank parses They do have a built in bias shorter derivations are more probable, ceteris paribus probs are based only on a small amount of structual information probs don’t take account of lexical dependencies NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Lexical influences N N XXX X X N N !aa !! a ! a !aa !! a N lung XX XXX A N N N N inflammatory response lung inflammatory SCFG assigns the same probability for these two trees; but inflammatory response is a likely lexical association response NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Lexical influences N N !aa !! a ! a !aa !! a ! a N N fresh N gas-flow A N "b " " "b b b " " N N rates fresh b b N rates gas-flow SCFG assigns the same probability for these two trees; but fresh gas-flow is a likely lexical association NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Lexical influences HIV gag-1 modulates protease activity by a cleavage site S` ``` NP ``` VP PPP P HIV gag-1 VP PP !aa !! a VP NP modulates p activity by . . . modulate by is a likely lexical association NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Extending the probabilitistic conditioning PCFGs are poor because the only conditioning is p(X → Y | X) So . . . condition on more! e.g. more ancestors than immediate parent e.g. lexical material NLP Linguistic Structure and Analysis Ambiguity & Parse Selection Going beyond PCFGs The “best” form of statistical parser is an active research issue They all need training material, which is expensive to come by The more features one adds, the sparser the data . . . NLP Linguistic Structure and Analysis Dependency analyses Dependency analyses Sentence structure: binary asymmetric relations between words Each relation has a head and a modifier Grammatical relations include subject-of object-of NLP Linguistic Structure and Analysis Dependency analyses Dependency Example NLP Linguistic Structure and Analysis Dependency analyses Example Parse In the absence of stg , these cells are inappropriately recruited. (|dobj| |In:1_II| |absence:3_NN1|) (|det| |absence:3_NN1| |the:2_AT|) (|iobj| |absence:3_NN1| |of:4_IO|) (|dobj| |of:4_IO| |stg:5_NP1|) (|det| |cell+s:8_NN2| |these:7_DD2|) (|ncmod| _ |recruit+ed:11_VVN| |In:1_II|) (|ncsubj| |recruit+ed:11_VVN| |cell+s:8_NN2| _) (|aux| |recruit+ed:11_VVN| |be+:9_VBR|) (|passive| |recruit+ed:11_VVN|) NLP Linguistic Structure and Analysis Dependency analyses Advantages of Dependency representations Dependency links are closer to semantics. subject-of . . . is (nearly) the right vocabulary No need to read them off a tree structure May facilitate comparison of different analysers Later processors only need to search in a flat list of bi-lexical relations NLP Linguistic Structure and Analysis Dependency analyses Dependency Parsing There are specialist dependency parsers Or, there are translators from trees to grammatical relations A search engine that exploits grammatical relations: http://www-tsujii.is.s.u-tokyo.ac.jp/medie/ NLP Use of syntax in selected IE Systems IE in bioinformatics Motto: Get the facts not the documents BioCreative 2 (2007) is the most recent competitive evaluation of work in this field The PPI Task: Given a full-text article, identify pairs of interacting proteins identify strings that refer to proteins identify proteins referred to by those strings identify pairs that are claimed to be interacting NLP Use of syntax in selected IE Systems Some bounds OntoGene in BioCreative II Rinaldi et al., ”Genome Biology, 2008” generate all pairs of proteins in the abstract? recall of 35% but precision < 1% generate all pairs in the same sentence? recall of 21%; precision 8% NLP Use of syntax in selected IE Systems Ontogene approach Hand-code rules over dependency parses X interacts with Y if 1 subj(X , H) & prep − obj(Y , H) & prot(X ) & prot(Y ) 2 subj(X , H) & prep − obj(Z , H) &conj(Y , Z ) & prot(X ) & prot(Y ) recall of 25%; precision 26% NLP Use of syntax in selected IE Systems Protein corral a web application that combines IR and IE over Medline retrieves Medline abstracts similarly to PubMed analyzes each and summarizes protein associations PPI NLP Pattern matching Co3 Tri Co-Occurrence (2 proteins and a verb) Co Bi Co-Occurrence (2 proteins only) http://www.ebi.ac.uk/Rebholz-srv/pcorral/ NLP Use of syntax in selected IE Systems Akane approach AKANE system: protein-protein interaction pairs in the BioCreative2 challenge, PPI-IPS subtask Saetre et al, Proc. BioCreative II 2007 Assume any sentence mentioning two proteins known to interact actually describes that interaction Parse with Enju Extract the smallest tree covering both proteins Use this as a pattern if its overall accuracy on training data is “ok” Remove pairs that don’t share a species recall of 19%; precision 11% NLP Use of syntax in selected IE Systems Colorado system: a specialist “semantic parser” Some example rules NUCLEUS := nucleus, nuclei, nuclear PROTEIN-TRANSPORT := ([TRANSPORTED-ENTITY dep:x] _ [ACTION-TRANSPORT head:x]) @ (from {det}? [TRANSPORT-ORIGIN])] @ (to {det}? [TRANSPORT-DESTINATION]); recall of 31%; precision 39% NLP Use of syntax in selected IE Systems Edinburgh TXM system Adapting a relation extraction pipeline for the BioCreative II Task Grover et al. Proc. BioCreative II 2007 propose ALL pairs of proteins within n sentences of each other Filter using a MaxEnt classifier with features such as. . . the two words surrounding each protein name the distance between the protein names the interaction word the distance from the interaction word to the nearest protein various pattern matches including . . . PROT WORD{0, 2} INTERACTION WORD{0, 2} PROT PROT word{0, 1} PROT WORD{0, 2} “complex00 NLP Use of syntax in selected IE Systems Edinburgh TXM system: Document filter Filter again using document level features such as frequency of interaction mentions in the document whether most interaction mentions are extra-sentential whether the title contains a mention whether the species of both proteins are the same recall of 30%; precision 28% NLP Summary Summary Documents are more than bags of words Syntactic engines vary regular expression engines context free engines “advanced linguistic” engines special purpose (“semantic”) engines IE applications exploit, variously, different syntactic engines patterns over engine outputs statistics machine learning The future?