statLecture11a

advertisement
Corpora and Statistical Methods
Lecture 11
Albert Gatt
Part 1
Probabilistic Context-Free Grammars and beyond
Context-free grammars: reminder
 Many NLP parsing applications rely on the CFG formalism
 Definition:
 CFG is a 4-tuple: (N,Σ,P,S):
 N = a set of non-terminal symbols (e.g. NP, VP)
 Σ = a set of terminals (e.g. words)

N and Σ are disjoint
 P = a set of productions of the form Aβ
 AЄN
 β Є (N U Σ)* (any string of terminals and non-terminals)
 S = a designated start symbol (usually, “sentence”)
CFG Example
 S  NP VP
 S  Aux NP VP
 NP  Det Nom
 NP  Proper-Noun
 Det  that | the | a
 …
Probabilistic CFGs
 A CFG where each production has an associated probability
 PCFG is a 5-tuple: (N,Σ,P,S, D):
 D: P -> [0,1] a function assigning each rule in P a probability
 usually, probabilities are obtained from a corpus
 most widely used corpus is the Penn Treebank
The Penn Treebank
 English sentences annotated with syntax trees
 built at the University of Pennsylvania
 40,000 sentences, about a million words
 text from the Wall Street Journal
 Other treebanks exist for other languages (e.g. NEGRA for
German)
Example tree
Building a tree: rules
S
NP
NNP
Mr





VP
NNP
VBZ
Vinken
is
S  NP VP
NP  NNP NNP
NNP  Mr
NNP Vinken
…
NP
NP
PP
NN
IN
NN
chairman
of
NNP
Elsevier
Characteristics of PCFGs
 In a PCFG, the probability P(Aβ) expresses the likelihood that the non-
terminal A will expand as β.
 e.g. the likelihood that S  NP VP
 (as opposed to SVP, or S  NP VP PP, or… )
 can be interpreted as a conditional probability:
 probability of the expansion, given the LHS non-terminal
 P(Aβ) = P(Aβ|A)
 Therefore, for any non-terminal A, probabilities of every rule of the form A 
β must sum to 1
 If this is the case, we say the PCFG is consistent
Uses of probabilities in parsing
 Disambiguation: given n legal parses of a string, which is the most likely?
 e.g. PP-attachment ambiguity can be resolved this way
 Speed: parsing is a search problem
 search through space of possible applicable derivations
 search space can be pruned by focusing on the most likely sub-parses of a
parse
 Parser can be used as a model to determine the probability of a sentence,
given a parse
 typical use in speech recognition, where input utterance can be “heard” as
several possible sentences
Using PCFG probabilities
 PCFG assigns a probability to every parse-tree t of a string W
 e.g. every possible parse (derivation) of a sentence recognised by the
grammar
 Notation:
 G = a PCFG
 s = a sentence
 t = a particular tree under our grammar
 t consists of several nodes n
 each node is generated by applying some rule r
Probability of a tree vs. a sentence
P(t , s)   p(r (n))  P(t )
nt
 simply the multiplication of the probability of every rule (node)
that gives rise to t (i.e. the derivation of t)
 this is both the joint probability of t and s, and the probability of t
alone
 why?
P(t,s) = P(t)
P(t , s)  P(t ) P(s | t )
 But P(s|t) must be 1, since the tree t is a parse of all the
words of s
P(t , s)  P(t ) 1  P(t )
Picking the best parse in a PCFG
 A sentence will usually have several parses
 we usually want them ranked, or only want the n-best parses
 we need to focus on P(t|s,G)
 probability of a parse, given our sentence and our grammar
 definition of the best parse for s:
tˆ  arg max P(t | s, G)
t
Picking the best parse in a PCFG
 Problem: t can have multiple derivations
 e.g. expand left-corner nodes first, expand right-corner nodes first
etc
 so P(t|s,G) should be estimated by summing over all possible
derivations
 Fortunately, derivation order makes no difference to the final
probabilities.
 can assume a “canonical derivation” d of t
 P(t) =def P(d)
Probability of a sentence
 Simply the sum of probabilities of all parses of that sentence
 since s is only a sentence if it’s recognised by G, i.e. if there is some t
for s under G
P( s) 
 P(s, t )   p(t )
t
{t: yield (t )  s}
all those trees which “yield” s
Flaws I: Structural independence
 Probability of a rule r expanding node n depends only on n.
 Independent of other non-terminals
 Example:
 P(NP  Pro) is independent of where the NP is in the sentence
 but we know that NPPro is much more likely in subject position
 Francis et al (1999) using the Switchboard corpus:
 91% of subjects are pronouns;
 only 34% of objects are pronouns
Flaws II: lexical independence
 vanilla PCFGs ignore lexical material
 e.g. P(VP V NP PP) independent of the head of NP or PP or
lexical head V
 Examples:
 prepositional phrase attachment preferences depend on lexical items;
cf:
 dump [sacks into a bin]
 dump [sacks] [into a bin] (preferred parse)
 coordination ambiguity:
 [dogs in houses] and [cats]
 [dogs] [in houses and cats]
Weakening the independence assumptions in PCFGs
Lexicalised PCFGs
 Attempt to weaken the lexical independence assumption.
 Most common technique:
 mark each phrasal head (N,V, etc) with the lexical material
 this is based on the idea that the most crucial lexical
dependencies are between head and dependent
 E.g.: Charniak 1997, Collins 1999
Lexicalised PCFGs: Matt walks
 Makes probabilities partly
S(walks)
dependent on lexical content.
 P(VPVBD|VP) becomes:
NP(Matt)
P(VPVBD|VP, h(VP)=walk)
 NB: normally, we can’t assume that all heads
of a phrase of category C are equally
probable.
NNP(Matt)
Matt
VP(walk)
VBD(walk)
walks
Practical problems for lexicalised PCFGs
 data sparseness: we don’t necessarily see all heads of all
phrasal categories often enough in the training data
 flawed assumptions: lexical dependencies occur elsewhere,
not just between head and complement
 I got the easier problem of the two to solve
 of the two and to solve become more likely because of the prehead modifier
easier
Structural context
 The simple way: calculate p(t|s,G) based on rules in the
canonical derivation d of t
 assumes that p(t) is independent of the derivation
 could condition on more structural context
 but then we could lose the notion of a canonical derivation, i.e.
P(t) could really depend on the derivation!
Structural context: probability of a
derivation history
 How to calculate P(t) based on a derivation d?
 Observation:
r1
r3
r2
rm
P(d )  P(S 1  2 ...  m  s)
 (probability that a sequence of m rewrite rules in a derivation yields s)
 can use the chain rule for multiplication
m
P( d ) 
 P(r | r ,...r
i
i 1
1
i 1 )
Approach 2: parent annotation
 Annotate each node with its
S(walks)
parent in the parse tree.
 E.g. if NP has parent S, then
rename NP to NP^S
 Can partly account for
dependencies such as subject-of
NP^S
VP^S
NNP^NP
VBD^VP
 (NP^S is a subject, NP^VP is an object)
Matt
walks
The main point
 Many different parsing approaches differ on what they
condition their probabilities on
Other grammar formalisms
Phrase structure vs. Dependency grammar
 PCFGs are in the tradition of phrase-structure grammars
 Dependency grammar describes syntax in terms of
dependencies between words
 no non-terminals or phrasal nodes
 only lexical nodes with links between them
 links are labelled, labels from a finite list
Dependency Grammar
<ROOT>
main
GAVE
obj:
subj:
dat:
I
him
address
attr:
MY
Dependency grammar
 Often used now in probabilistic parsing
 Advantages:
 directly encode lexical dependencies
 therefore, disambiguation decisions take lexical material into account
directly
 dependencies are a way of decomposing PSRs and their probability
estimates
 estimating probability of dependencies between 2 words is less likely
to lead to data sparseness problems
Summary
 We’ve taken a tour of PCFGs
 crucial notion: what the probability of a rule is conditioned on
 flaws in PCFGs: independence assumptions
 several proposals to go beyond these flaws
 dependency grammars are an alternative formalism
Download