Issues in Computational Linguistics: Grammar Engineering Dick Crouch and Tracy King Outline What is a deep grammar? How to engineer them: – – – – – robustness integrating shallow resources ambiguity writing efficient grammars real world data What is a shallow grammar often trained automatically from marked up corpora part of speech tagging chunking trees POS tagging and Chunking Part of speech tagging: I/PRP saw/VBD her/PRP duck/VB ./PUNCT I/PRP saw/VBD her/PRP$ duck/NN ./PUNCT Chunking: – general chunking [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]. (Abney) – NP chunking [NP President Clinton] visitited [NP the Hermitage] in [NP Leningrad] Treebank grammars Phrase structure tree (c-structure) Annotations for heads, grammatical functions Collins parser output Deep grammars Provide detailed syntactic/semantic analyses – LFG (ParGram), HPSG (LinGO, Matrix) – Grammatical functions, tense, number, etc. Mary wants to leave. subj(want~1,Mary~3) comp(want~1,leave~2) subj(leave~2,Mary~3) tense(leave~2,present) Usually manually constructed – linguistically motivated rules Why would you want one Meaning sensitive applications – overkill for many NLP applications Applications which use shallow methods for English may not be able to for "free" word order languages – can read many functions off of trees in English SUBJ: NP sister to VP [S [NP Mary] [VP left]] OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]] – need other information in German, Japanese, etc. Deep analysis matters… if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division head V.P. Philips closest noun next closest next delegation furthest away but Subject of flew shallow but wrong deep and right Applications of Language Engineering Post-Search Sifting Broad Autonomous Knowledge Filtering Alta Vista AskJeeves Document Base Management Narrow Domain Coverage Google Restricted Dialogue Manually-tagged Keyword Search Knowledge Fusion Good Translation Useful Summary Natural Dialogue Microsoft Paperclip Low Functionality High Traditional Problems Time consuming and expensive to write Not robust – want output for any input Ambiguous Slow Other gating items for applications that need deep grammars Why deep analysis is difficult Languages are hard to describe – Meaning depends on complex properties of words and sequences – Different languages rely on different properties – Errors and disfluencies Languages are hard to compute – Expensive to recognize complex patterns – Sentences are ambiguous – Ambiguities multiply: explosion in time and space How to overcome this Engineer the deep grammars – theoretical vs. practical – what is good enough Integrate shallow techniques into deep grammars Experience based on broad-coverage LFG grammars (ParGram project) Robustness: Sources of Brittleness missing vocabulary – you can't list all the proper names in the world missing constructions – there are many constructions theoretical linguistics rarely considers (e.g. dates, company names) – easy to miss even core constructions ungrammatical input – real world text is not always perfect – sometimes it is really horrendous Real world Input Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13) ``The croaker's done gone from the hook – (WSJ, section 13) (SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip) Missing vocabulary Build vocabulary based on the input of shallow methods – fast – extensive – accurate Finite-state morphologies Part of Speech Taggers Finite State Morphologies Finite-state morphologies falls -> fall +Noun +Pl fall +Verb +Pres +3sg Mary -> Mary +Prop +Giv +Fem +Sg vienne -> venir +SubjP +SG {+P1|+P3} +Verb Build lexical entry on-the-fly from the morphological information – have canonicalized stem form – have significant grammatical information – do not have subcategorization Building lexical entries Lexical entries -unknown N @(COMMON-NOUN %stem). +Noun N-SFX @(PERS 3). +Pl N-NUM @(NUM pl). Rule NOUN -> N N-SFX N-NUM. Templates – COMMON-NOUN :: (^ PRED)='%stem' (^ NTYPE)=common – PERS(3) :: (^ PERS)=3 – NUM(pl) :: (^ NUM)=pl Building lexical entries F-structure for falls [ PRED 'fall' NTYPE common PERS 3 NUM pl ] C-Structure for falls Noun N fall N-SFX +Noun N-NUM +Pl Guessing words Use FST guesser if the morphology doesn't know the word – Capitalized words can be proper nouns » Saakashvili -> Saakashvili +Noun +Proper +Guessed – ed words can be past tense verbs or adjectives » fumped -> fump +Verb +Past +Guessed fumped +Adj +Deverbal +Guessed Languages with more morphology allow for better guessers Using the lexicons Rank the lexical lookup 1. overt entry in lexicon 2. entry built from information from morphology 3. entry built from information from guesser Use the most reliable information Fall back only as necessary Missing constructions Even large hand-written grammars are not complete – new constructions, especially with new corpora – unusual constructions Generally longer sentences fail – one error can destroy the parse Build up as much as you can; stitch together the pieces Grammar engineering approach First try to get a complete parse If fail, build up chunks that get complete parses Have a fall back for things without even chunk parses Link these chunks and fall backs together in a single structure Fragment Chunks: Sample output the the dog appears. Split into: – "token" the – sentence "the dog appears" – ignore the period C-structure F-structure Ungrammatical input Real world text contains ungrammatical input – typos – run ons – cut and paste errors Deep grammars tend to only cover grammatical output Two strategies – robustness techniques: guesser/fragments – disprefered rules for ungrammatical structures Rules for ungrammatical structures Common errors can be coded in the rules – want to know that error occurred (e.g., feature in f-structure) Disprefer parses of ungrammatical structure – tools for grammar writer to rank rules – two+ pass system 1. standard rules 2. rules for known ungrammatical constructions 3. default fall back rules Sample ungrammatical structures Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr } Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …} Robustness summary Integrate shallow methods – for lexical items – morphologies – guessers Fall back techniques – for missing constructions – fragment grammar – disprefered rules Ambiguity Deep grammars are massively ambiguous Example: 700 from section 23 of WSJ – average # of words: 19.6 – average # of optimal parses: 684 » for 1-10 word sentences: 3.8 » for 11-20 word sentences: 25.2 » for 50-60 word sentences: 12,888 Managing Ambiguity Use packing to parse and manipulate the ambiguities efficiently (more tomorrow) Trim early with shallow markup – fewer parses to choose from – faster parse time Choose most probable parse for applications that need a single input Shallow markup Part of speech marking as filter I saw her duck/VB. – accuracy of tagger (v. good for English) – can use partial tagging (verbs and nouns) Named entities – <company>Goldman, Sachs & Co.</company> bought IBM. – good for proper names and times – hard to parse internal structure Fall back technique if fail – slows parsing – accuracy vs. speed Example shallow markup: Named entities Allow tokenizer to accept marked up input: parse {<person>Mr. Thejskt Thejs</person> arrived.} tokenized string: Mr. Thejskt Thejs TB +NEperson Mr(TB). TB Thejskt TB Thejs Add TB arrived TB . TB lexical entries and rules for NE tags Resulting C-structure Resulting F-structure Results for shallow markup Full/All % Full parses Optimal sol’ns Best F-sc Time % Unmarked 76 482/1753 82/79 65/100 Named ent 78 263/1477 86/84 60/91 POS tag 62 248/1916 76/72 40/48 Lab brk 65 158/ 774 85/79 19/31 Kaplan and King 2003 Chosing the most probable parse Applications may want one input – or at least just a handful Use stochastic methods to choose – efficient (XLE English grammar: 5% of parse time) Need training data – partially labelled data ok [NP-SBJ They] see [NP-OBJ the girl with the telescope] Run-time performance Many deep grammars are slow Techniques depend on the system – LFG: exploit the context-free backbone ambiguity packing techniques Speed vs. accuracy trade off – remove/disprefer peripheral rules – remove fall backs for shallow markup Development expense Grammar porting Starter grammar Induced grammar bootstrapping How cheap are shallow grammars? – training data can be expensive to produce Grammar porting Use an existing grammar as the base for a new language Languages must be typologically similar – Japanese-Korean – Balkan Lexical porting via bi-lingual dictionaries Main work in testing and evaluation Starter grammar Provide basic rules and templates – including for robustness techniques Grammar writer: – chooses among them – refines them Grammar Matrix for HPSG Grammar induction Induce a core grammar from a treebank – compile rule generalizations – threshold rare rules – hand augment with features and fallback techniques Requires – induction program – existing resources (treebank) Conclusions Grammar engineering makes deep grammars feasible – robustness techniques – integration of shallow methods Many current applications can use shallow grammars Fast, accurate, broad-coverage deep grammars enable new applications