Treebanks, parsing, etc. Syntax and computers • Parsing: input is sentence, output is tree (or equivalent representation) • Browsing: – Finding particular syntactic structures within a corpus of sentences – Finding sentences that match a particular syntactic construction • Information retrieval, machine translation, speech recognition, etc. Why parsing is difficult: Newspaper headlines • • • • • • • • • Iraqi Head Seeks Arms Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Stolen Painting Found by Tree Local High School Dropouts Cut in Half Red Tape Holds Up New Bridges Clinton Wins on Budget, but More Lies Ahead Hospitals Are Sued by 7 Foot Doctors Kids Make Nutritious Snacks Ambiguous headlines POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED DRUNK GETS NINE MONTHS IN VIOLIN CASE FARMER BILL DIES IN HOUSE IRAQI HEAD SEEKS ARMS PROSTITUTES APPEAL TO POPE BRITISH LEFT WAFFLES ON FALKLAND ISLANDS LUNG CANCER IN WOMEN MUSHROOMS TEACHER STRIKES IDLE KIDS ENRAGED COW INJURES FARMER WITH AXE JUVENILE COURT TO TRY SHOOTING DEFENDANT TWO SOVIET SHIPS COLLIDE, ONE DIES 4 WordNet subcat frames 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Something ----s Somebody ----s It is ----ing Something is ----ing PP Something ----s something Adjective/Noun Something ----s Adjective/Noun Somebody ----s Adjective Somebody ----s something Somebody ----s somebody Something ----s somebody Something ----s something Something ----s to somebody Somebody ----s on something Somebody ----s somebody something Somebody ----s something to somebody Somebody ----s something from somebody Somebody ----s somebody with something Somebody ----s somebody of something Somebody ----s something on somebody Soar 2003 Tutorial 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Somebody ----s somebody PP Somebody ----s something PP Somebody ----s PP Somebody's (body part) ----s Somebody ----s somebody to INFINITIVE Somebody ----s somebody INFINITIVE Somebody ----s that CLAUSE Somebody ----s to somebody Somebody ----s to INFINITIVE Somebody ----s whether INFINITIVE Somebody ----s somebody into V-ing something Somebody ----s something with something Somebody ----s INFINITIVE Somebody ----s VERB-ing It ----s that CLAUSE Something ----s INFINITIVE 5 English LCS lexicon • Theta-grid information for verbs • Derive ucat features – used to build syntactic structure • Co-referenced with WordNet2.0 – theta-grids are aligned with ucat features and word sense information English LCS lexicon data 10.6.a#1#_ag_th,mod-poss(of)#exonerate#exonerate#exonerate#exonerate+ed# (2.0,00874318_exonerate%2:32:00::) "10.6.a" :NAME "Verbs of Possessional Deprivation: Cheat Verbs / -of“ WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con cull cure defraud denude deplete depopulate deprive despoil disabuse disarm disencumber dispossess divest drain ease exonerate fleece free gull milk mulct pardon plunder purge purify ransack relieve render rid rifle rob sap strip swindle unburden void wean) THETA_ROLES ((1 "_ag_th,mod-poss()") (1 "_ag_th,mod-poss(from)") "_ag_th,mod-poss(of)")) SENTENCES "He !!+ed the people (of their rights); He !!+ed him of his sins" (1 Doing syntax with computers • To do this you need a grammar • So where do grammars come from? – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest to the linguists developing the grammar). – TreeBanks • Semi-automatically generated sets of parse trees for the sentences in some corpus. Typically in a generic lowest common denominator formalism (of no particular interest to any modern linguist). 8 TreeBanks • TreeBanks provide a grammar (of a sort). • Hence they provide the training data for various computer applications that use syntax • But they can also provide useful data for more purely linguistic pursuits. – You might have a theory about whether or not something can happen in particular language. – Or a theory about the contexts in which something can happen. – TreeBanks can give you the means to explore those theories. If you can formulate the questions in the right way and get the data you need. 9 A Penn Treebank sentence ( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (. .))) Equivalent representations • • • • PS tree (phrase-markers) Bracketed labeling Automaton F-structure 11 Bracketed labeling [IP[NP[DetThe] [Ndog]] [VP[vbarked] [PP [Pat] [NP[Detthe] [Nboy]]]]]]. 12 An automaton 13 F-structure Time to be flexible! • We have learned a way to diagram parse trees; it involves certain assumptions • Not everybody agrees with all of these assumptions • In fact, very few people agree on very many specifics at all • Syntax resources reflect this diversity • Hence the need to be flexible flight 16 flight flight 17 flight flight flight 18 19 Classical grammar engineering • Write rules with associated lexicon – S → NP VP NN → interest – NP → (DT) NN NNS → rates – NP → NN NNS NNS → raises – NP → NNP VBP → interest – VP → V NP VBZ → rates – Simple 10 rule grammar: 592 parses for some ambiguous sentences – Real-size broad-coverage grammar: millions of parses for a complicated sentence A simple grammar S VP VP PP P V → NP VP → V NP → VP PP → P NP → with → saw 1.0 0.7 0.3 1.0 1.0 1.0 NP NP NP NP NP NP → → → → → → NP PP astronomers ears saw stars telescope 0.4 0.1 0.18 0.04 0.18 0.1 Ambiguity 23 Ambiguity • Tree for: Fed raises interest rates 0.5% in effort to control inflation (NYT headline 5/17/00) Local V/N ambiguities Ambiguity • Local ambiguity means that we have to deal with multiple plausible choices during the parsing process. • Global ambiguity means that the grammar can’t tell us which of several (many?) possible parses is the correct one. 26 Two possible PP attachments Sample treebank parse 29 Sample treebank sentence 30 Sample NP rules 31 Example 11/2/2011 CSCI 5832 Spring 2006 32 How many rules? A sample parsed sentence Not just newswire… PP attachment ambiguity (German) PP attachment in Chinese Sample trees Searching treebank corpora • Online – The Treebank Tool Suite – The VISL website – The NCLT website • Offline – Treebank corpus – Search utilities: tgrep, tregex, etc. tgrep TiGer Treebank edge labels: crossing branches for discontinuous constituency types syntactic functions node labels: phrase categories S HD SB OC VP MO OA PP AC NK NP NK NK nächsten Jahr will die Im VMFIN ART NN APPRART ADJA Sup.Dat. Dat. 3.Sg. Nom. Dat Sg.Neut Pl.Neut Pres.Ind Sg.Fem nahe wollen Jahr die in HD NP NK NK Regierung ihre NN PPOSAT Nom. Acc. Sg.Fem Pl.Masc Regierung ihr annotation on word level: part-of-speech, morphology, lemmata NK Reformpläne umsetzen NN VVINF Acc. Inf Pl.Masc Plan umsetzen . $. Parallel treebanks • Translation training and studies • Machine translation (MT) research & development Aligning parses