Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19 Admin Stuff • These slides available at o http://www.cs.columbia.edu/~rambow/teaching.html • For Eliza in homework, you can use a tagger or chunker, if you want – details at: o http://www.cs.columbia.edu/~ani/cs4705.html • Special office hours (Ani): today after class, tomorrow at 10am in CEPSR 721 Statistical POS Tagging • Want to choose most likely string of tags (T), given the string of words (W) • W = w1, w2, …, wn • T = t1, t2, …, tn • I.e., want argmaxT p(T | W) • Problem: sparse data Statistical POS Tagging (ctd) • p(T|W) = p(T,W) / p(W) = p(W|T) p (T) / p(W) • argmaxT p(T|W) = argmaxT p(W|T) p (T) / p(W) = argmaxT p(W|T) p (T) Statistical POS Tagging (ctd) p(T) = p(t1, t2, …, tn-1 , tn) = p(tn | t1, …, tn-1 ) p (t1, …, tn-1) = p(tn | t1, …, tn-1 ) p(tn-1 | t1, …, tn-2) p (t1, …, tn-2) = i p(ti | t1, …, ti-1 ) i p(ti | ti-2, ti-1 ) trigram (n-gram) Statistical POS Tagging (ctd) p(W|T) = p(w1, w2, …, wn | t1, t2, …, tn ) = i p(wi | w1, …, wi-1, t1, t2, …, tn) i p(wi | ti ) Statistical POS Tagging (ctd) argmaxT p(T|W) = argmaxT p(W|T) p (T) argmaxT i p(wi | ti ) p(ti | ti-2, ti-1 ) • Relatively easy to get data for parameter estimation (next slide) • But: need smoothing for unseen words • Easy to determine the argmax (Viterbi algorithm in time linear in sentence length) Probability Estimation for trigram POS Tagging Maximum-Likelihood Estimation • p’ ( wi | ti ) = c( wi, ti ) / c( ti ) • p’ ( ti | ti-2, ti-1 ) = c( ti, ti-2, ti-1 ) / c( ti-2, ti-1 ) Statistical POS Tagging • Method common to many tasks in speech & NLP • “Noisy Channel Model”, Hidden Markov Model Back to Syntax • (((the/Det) boy/N) likes/V ((a/Det) girl/N)) nonterminal symbols = constituents S NP DetP the boy likes NP DetP Phrase-structure tree girl a terminal symbols = words Phrase Structure and Dependency Structure S NP DetP the boy likes/V likes NP DetP a girl boy/N the/Det girl/N a/Det Types of Dependency likes/V Adj(unct) sometimes/Adv Subj Fw the/Det boy/N Adj small/Adj Adj very/Adv Obj girl/N Fw a/Det Grammatical Relations • Types of relations between words o Arguments: subject, object, indirect object, prepositional object o Adjuncts: temporal, locative, causal, manner, … o Function Words Subcategorization • List of arguments of a word (typically, a verb), with features about realization (POS, perhaps case, verb form etc) • In canonical order Subject-ObjectIndObj • Example: like: N-N, N-V(to-inf) o see: N, N-N, N-N-V(inf) o • Note: J&M talk about subcategorization only within VP Where is the VP? S S likes NP DetP boy DetP girl NP NP the a DetP the boy VP likes NP DetP a girl Where is the VP? • Existence of VP is a linguistic (empirical) claim, not a methodological claim • Semantic evidence??? • Syntactic evidence VP-fronting (and quickly clean the carpet he did! ) o VP-ellipsis (He cleaned the carpets quickly, and so did she ) o Can have adjuncts before and after VP, but not in VP (He often eats beans, *he eats often beans ) o • Note: in all right-branching structures, issue is different again Penn Treebank, Again • Syntactically annotated corpus (phrase structure) • PTB is not naturally occurring data! • Represents a particular linguistic theory (but a fairly “vanilla” one) • Particularities o o o Very indirect representation of grammatical relations (need for head percolation tables) Completely flat structure in NP (brown bag lunch, pink-and-yellow child seat ) Has flat Ss, flat VPs Context-Free Grammars • Defined in formal language theory (comp sci) • Terminals, nonterminals, start symbol, rules • String-rewriting system • Start with start symbol, rewrite using rules, done when only terminals left CFG: Example • Rules o S NP VP o VP V NP o NP Det N | AdjP NP o AdjP Adj | Adv AdjP o N boy | girl o V sees | likes o Adj big | small o Adv very o Det a | the the very small boy likes a girl Derivations of CFGs • String rewriting system: we derive a string (=derived structure) • But derivation history represented by phrase-structure tree (=derivation structure)! Grammar Equivalence and Normal Form • Can have different grammars that generate same set of strings (weak equivalence) • Can have different grammars that have same set of derivation trees (string equivalence) Nobody Uses CFGs Only (Except Intro NLP Courses) All major syntactic theories (Chomsky, LFG, HPSG, TAG-based theories) represent both phrase structure and dependency, in one way or another o All successful parsers currently use statistics about phrase structure and about dependency o Massive Ambiguity of Syntax • For a standard sentence, and a grammar with wide coverage, there are 1000s of derivations! • Example: o The large head master told the man that he gave money and shares in a letter on Wednesday Some Syntactic Constructions: Wh -Movement Control Raising