What is Syntax?

advertisement
Introduction to Syntax, with
Part-of-Speech Tagging
Owen Rambow
September 17 & 19
Admin Stuff
• These slides available at
o
http://www.cs.columbia.edu/~rambow/teaching.html
• For Eliza in homework, you can use a tagger or
chunker, if you want – details at:
o
http://www.cs.columbia.edu/~ani/cs4705.html
• Special office hours (Ani): today after class,
tomorrow at 10am in CEPSR 721
Statistical POS Tagging
• Want to choose most likely string of
tags (T), given the string of words (W)
• W = w1, w2, …, wn
• T = t1, t2, …, tn
• I.e., want argmaxT p(T | W)
• Problem: sparse data
Statistical POS Tagging (ctd)
• p(T|W) = p(T,W) / p(W)
= p(W|T) p (T) / p(W)
• argmaxT p(T|W)
= argmaxT p(W|T) p (T) / p(W)
= argmaxT p(W|T) p (T)
Statistical POS Tagging (ctd)
p(T) = p(t1, t2, …, tn-1 , tn)
= p(tn | t1, …, tn-1 ) p (t1, …, tn-1)
= p(tn | t1, …, tn-1 )
p(tn-1 | t1, …, tn-2) p (t1, …, tn-2)
= i p(ti | t1, …, ti-1 )
 i p(ti | ti-2, ti-1 )  trigram (n-gram)
Statistical POS Tagging (ctd)
p(W|T) = p(w1, w2, …, wn | t1, t2, …, tn )
= i p(wi | w1, …, wi-1, t1, t2, …,
tn)
 i p(wi | ti )
Statistical POS Tagging (ctd)
argmaxT p(T|W) = argmaxT p(W|T) p (T)
 argmaxT i p(wi | ti ) p(ti | ti-2, ti-1 )
• Relatively easy to get data for parameter
estimation (next slide)
• But: need smoothing for unseen words
• Easy to determine the argmax (Viterbi
algorithm in time linear in sentence length)
Probability Estimation
for trigram POS Tagging
Maximum-Likelihood Estimation
• p’ ( wi | ti ) = c( wi, ti ) / c( ti )
• p’ ( ti | ti-2, ti-1 ) = c( ti, ti-2, ti-1 ) / c( ti-2, ti-1 )
Statistical POS Tagging
• Method common to many tasks in
speech & NLP
• “Noisy Channel Model”, Hidden Markov
Model
Back to Syntax
• (((the/Det) boy/N) likes/V ((a/Det) girl/N))
nonterminal
symbols
= constituents
S
NP
DetP
the
boy
likes
NP
DetP
Phrase-structure
tree
girl
a
terminal symbols = words
Phrase Structure and
Dependency Structure
S
NP
DetP
the
boy
likes/V
likes
NP
DetP
a
girl
boy/N
the/Det
girl/N
a/Det
Types of Dependency
likes/V
Adj(unct)
sometimes/Adv
Subj
Fw
the/Det
boy/N
Adj
small/Adj
Adj
very/Adv
Obj
girl/N
Fw
a/Det
Grammatical Relations
• Types of relations between words
o Arguments: subject, object, indirect object,
prepositional object
o Adjuncts: temporal, locative, causal,
manner, …
o Function Words
Subcategorization
• List of arguments of a word (typically, a
verb), with features about realization
(POS, perhaps case, verb form etc)
• In canonical order Subject-ObjectIndObj
• Example:
like: N-N, N-V(to-inf)
o see: N, N-N, N-N-V(inf)
o
• Note: J&M talk about subcategorization
only within VP
Where is the VP?
S
S
likes NP
DetP boy
DetP girl
NP
NP
the
a
DetP
the
boy
VP
likes
NP
DetP
a
girl
Where is the VP?
• Existence of VP is a linguistic (empirical)
claim, not a methodological claim
• Semantic evidence???
• Syntactic evidence
VP-fronting (and quickly clean the carpet he did! )
o VP-ellipsis (He cleaned the carpets quickly, and so
did she )
o Can have adjuncts before and after VP, but not in
VP (He often eats beans, *he eats often beans )
o
• Note: in all right-branching structures, issue is
different again
Penn Treebank, Again
• Syntactically annotated corpus (phrase
structure)
• PTB is not naturally occurring data!
• Represents a particular linguistic theory (but
a fairly “vanilla” one)
• Particularities
o
o
o
Very indirect representation of grammatical
relations (need for head percolation tables)
Completely flat structure in NP (brown bag lunch,
pink-and-yellow child seat )
Has flat Ss, flat VPs
Context-Free Grammars
• Defined in formal language theory
(comp sci)
• Terminals, nonterminals, start symbol,
rules
• String-rewriting system
• Start with start symbol, rewrite using
rules, done when only terminals left
CFG: Example
• Rules
o S  NP VP
o VP  V NP
o NP  Det N | AdjP NP
o AdjP  Adj | Adv AdjP
o N  boy | girl
o V  sees | likes
o Adj  big | small
o Adv  very
o Det  a | the
the very small boy likes a girl
Derivations of CFGs
• String rewriting system: we derive a
string (=derived structure)
• But derivation history represented by
phrase-structure tree (=derivation
structure)!
Grammar Equivalence and
Normal Form
• Can have different grammars that
generate same set of strings (weak
equivalence)
• Can have different grammars that have
same set of derivation trees (string
equivalence)
Nobody Uses CFGs Only
(Except Intro NLP Courses)
All major syntactic theories (Chomsky, LFG,
HPSG, TAG-based theories) represent both
phrase structure and dependency, in one
way or another
o All successful parsers currently use
statistics about phrase structure and about
dependency
o
Massive Ambiguity of Syntax
• For a standard sentence, and a
grammar with wide coverage, there are
1000s of derivations!
• Example:
o
The large head master told the man that
he gave money and shares in a letter on
Wednesday
Some Syntactic Constructions:
Wh -Movement
Control
Raising
Download