winterschool

Issues in Computational Linguistics: Grammar Engineering Dick Crouch and Tracy King Outline   What is a deep grammar? How to engineer them: – – – – – robustness integrating shallow resources ambiguity writing efficient grammars real world data What is a shallow grammar     often trained automatically from marked up corpora part of speech tagging chunking trees POS tagging and Chunking  Part of speech tagging: I/PRP saw/VBD her/PRP duck/VB ./PUNCT I/PRP saw/VBD her/PRP$ duck/NN ./PUNCT  Chunking: – general chunking [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]. (Abney) – NP chunking [NP President Clinton] visitited [NP the Hermitage] in [NP Leningrad] Treebank grammars   Phrase structure tree (c-structure) Annotations for heads, grammatical functions Collins parser output Deep grammars  Provide detailed syntactic/semantic analyses – LFG (ParGram), HPSG (LinGO, Matrix) – Grammatical functions, tense, number, etc. Mary wants to leave. subj(want~1,Mary~3) comp(want~1,leave~2) subj(leave~2,Mary~3) tense(leave~2,present)  Usually manually constructed – linguistically motivated rules Why would you want one  Meaning sensitive applications – overkill for many NLP applications  Applications which use shallow methods for English may not be able to for "free" word order languages – can read many functions off of trees in English SUBJ: NP sister to VP [S [NP Mary] [VP left]] OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]] – need other information in German, Japanese, etc. Deep analysis matters… if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division head V.P. Philips closest noun next closest next delegation furthest away but Subject of flew shallow but wrong deep and right Applications of Language Engineering Post-Search Sifting Broad Autonomous Knowledge Filtering Alta Vista AskJeeves Document Base Management Narrow Domain Coverage Google Restricted Dialogue Manually-tagged Keyword Search Knowledge Fusion Good Translation Useful Summary Natural Dialogue Microsoft Paperclip Low Functionality High Traditional Problems   Time consuming and expensive to write Not robust – want output for any input    Ambiguous Slow Other gating items for applications that need deep grammars Why deep analysis is difficult  Languages are hard to describe – Meaning depends on complex properties of words and sequences – Different languages rely on different properties – Errors and disfluencies  Languages are hard to compute – Expensive to recognize complex patterns – Sentences are ambiguous – Ambiguities multiply: explosion in time and space How to overcome this  Engineer the deep grammars – theoretical vs. practical – what is good enough   Integrate shallow techniques into deep grammars Experience based on broad-coverage LFG grammars (ParGram project) Robustness: Sources of Brittleness  missing vocabulary – you can't list all the proper names in the world  missing constructions – there are many constructions theoretical linguistics rarely considers (e.g. dates, company names) – easy to miss even core constructions  ungrammatical input – real world text is not always perfect – sometimes it is really horrendous Real world Input    Other weak blue-chip issues included Chevron, which went down 2 to 64 7/8 in Big Board composite trading of 1.3 million shares; Goodyear Tire & Rubber, off 1 1/2 to 46 3/4, and American Express, down 3/4 to 37 1/4. (WSJ, section 13) ``The croaker's done gone from the hook – (WSJ, section 13) (SOLUTION 27000 20) Without tag P-248 the W7F3 fuse is located in the rear of the machine by the charge power supply (PL3 C14 item 15. (Eureka copier repair tip) Missing vocabulary  Build vocabulary based on the input of shallow methods – fast – extensive – accurate   Finite-state morphologies Part of Speech Taggers Finite State Morphologies  Finite-state morphologies falls -> fall +Noun +Pl fall +Verb +Pres +3sg Mary -> Mary +Prop +Giv +Fem +Sg vienne -> venir +SubjP +SG {+P1|+P3} +Verb  Build lexical entry on-the-fly from the morphological information – have canonicalized stem form – have significant grammatical information – do not have subcategorization Building lexical entries  Lexical entries -unknown N @(COMMON-NOUN %stem). +Noun N-SFX @(PERS 3). +Pl N-NUM @(NUM pl).  Rule NOUN -> N  N-SFX N-NUM. Templates – COMMON-NOUN :: (^ PRED)='%stem' (^ NTYPE)=common – PERS(3) :: (^ PERS)=3 – NUM(pl) :: (^ NUM)=pl Building lexical entries F-structure for falls [ PRED 'fall' NTYPE common PERS 3 NUM pl ] C-Structure for falls Noun N fall N-SFX +Noun N-NUM +Pl Guessing words  Use FST guesser if the morphology doesn't know the word – Capitalized words can be proper nouns » Saakashvili -> Saakashvili +Noun +Proper +Guessed – ed words can be past tense verbs or adjectives » fumped -> fump +Verb +Past +Guessed fumped +Adj +Deverbal +Guessed  Languages with more morphology allow for better guessers Using the lexicons  Rank the lexical lookup 1. overt entry in lexicon 2. entry built from information from morphology 3. entry built from information from guesser   Use the most reliable information Fall back only as necessary Missing constructions  Even large hand-written grammars are not complete – new constructions, especially with new corpora – unusual constructions  Generally longer sentences fail – one error can destroy the parse  Build up as much as you can; stitch together the pieces Grammar engineering approach     First try to get a complete parse If fail, build up chunks that get complete parses Have a fall back for things without even chunk parses Link these chunks and fall backs together in a single structure Fragment Chunks: Sample output   the the dog appears. Split into: – "token" the – sentence "the dog appears" – ignore the period C-structure F-structure Ungrammatical input  Real world text contains ungrammatical input – typos – run ons – cut and paste errors   Deep grammars tend to only cover grammatical output Two strategies – robustness techniques: guesser/fragments – disprefered rules for ungrammatical structures Rules for ungrammatical structures  Common errors can be coded in the rules – want to know that error occurred (e.g., feature in f-structure)  Disprefer parses of ungrammatical structure – tools for grammar writer to rank rules – two+ pass system 1. standard rules 2. rules for known ungrammatical constructions 3. default fall back rules Sample ungrammatical structures  Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr }  Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …} Robustness summary  Integrate shallow methods – for lexical items – morphologies – guessers  Fall back techniques – for missing constructions – fragment grammar – disprefered rules Ambiguity   Deep grammars are massively ambiguous Example: 700 from section 23 of WSJ – average # of words: 19.6 – average # of optimal parses: 684 » for 1-10 word sentences: 3.8 » for 11-20 word sentences: 25.2 » for 50-60 word sentences: 12,888 Managing Ambiguity  Use packing to parse and manipulate the ambiguities efficiently (more tomorrow)  Trim early with shallow markup – fewer parses to choose from – faster parse time  Choose most probable parse for applications that need a single input Shallow markup  Part of speech marking as filter I saw her duck/VB. – accuracy of tagger (v. good for English) – can use partial tagging (verbs and nouns)  Named entities – <company>Goldman, Sachs & Co.</company> bought IBM. – good for proper names and times – hard to parse internal structure  Fall back technique if fail – slows parsing – accuracy vs. speed Example shallow markup: Named entities  Allow tokenizer to accept marked up input: parse {<person>Mr. Thejskt Thejs</person> arrived.} tokenized string: Mr. Thejskt Thejs TB +NEperson Mr(TB). TB Thejskt TB Thejs  Add TB arrived TB . TB lexical entries and rules for NE tags Resulting C-structure Resulting F-structure Results for shallow markup Full/All % Full parses Optimal sol’ns Best F-sc Time % Unmarked 76 482/1753 82/79 65/100 Named ent 78 263/1477 86/84 60/91 POS tag 62 248/1916 76/72 40/48 Lab brk 65 158/ 774 85/79 19/31 Kaplan and King 2003 Chosing the most probable parse  Applications may want one input – or at least just a handful  Use stochastic methods to choose – efficient (XLE English grammar: 5% of parse time)  Need training data – partially labelled data ok [NP-SBJ They] see [NP-OBJ the girl with the telescope] Run-time performance   Many deep grammars are slow Techniques depend on the system – LFG: exploit the context-free backbone ambiguity packing techniques  Speed vs. accuracy trade off – remove/disprefer peripheral rules – remove fall backs for shallow markup Development expense  Grammar porting Starter grammar Induced grammar bootstrapping  How cheap are shallow grammars?   – training data can be expensive to produce Grammar porting   Use an existing grammar as the base for a new language Languages must be typologically similar – Japanese-Korean – Balkan   Lexical porting via bi-lingual dictionaries Main work in testing and evaluation Starter grammar  Provide basic rules and templates – including for robustness techniques  Grammar writer: – chooses among them – refines them  Grammar Matrix for HPSG Grammar induction  Induce a core grammar from a treebank – compile rule generalizations – threshold rare rules – hand augment with features and fallback techniques  Requires – induction program – existing resources (treebank) Conclusions  Grammar engineering makes deep grammars feasible – robustness techniques – integration of shallow methods   Many current applications can use shallow grammars Fast, accurate, broad-coverage deep grammars enable new applications

winterschool

Related documents

Products

Support

winterschool

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib