Corpora and Statistical Methods
Lecture 11
Albert Gatt
Part 1
Probabilistic Context-Free Grammars and beyond
Context-free grammars: reminder
Many NLP parsing applications rely on the CFG formalism
Definition:
CFG is a 4-tuple: (N,Σ,P,S):
N = a set of non-terminal symbols (e.g. NP, VP)
Σ = a set of terminals (e.g. words)
N and Σ are disjoint
P = a set of productions of the form Aβ
AЄN
β Є (N U Σ)* (any string of terminals and non-terminals)
S = a designated start symbol (usually, “sentence”)
CFG Example
S NP VP
S Aux NP VP
NP Det Nom
NP Proper-Noun
Det that | the | a
…
Probabilistic CFGs
A CFG where each production has an associated probability
PCFG is a 5-tuple: (N,Σ,P,S, D):
D: P -> [0,1] a function assigning each rule in P a probability
usually, probabilities are obtained from a corpus
most widely used corpus is the Penn Treebank
The Penn Treebank
English sentences annotated with syntax trees
built at the University of Pennsylvania
40,000 sentences, about a million words
text from the Wall Street Journal
Other treebanks exist for other languages (e.g. NEGRA for
German)
Example tree
Building a tree: rules
S
NP
NNP
Mr
VP
NNP
VBZ
Vinken
is
S NP VP
NP NNP NNP
NNP Mr
NNP Vinken
…
NP
NP
PP
NN
IN
NN
chairman
of
NNP
Elsevier
Characteristics of PCFGs
In a PCFG, the probability P(Aβ) expresses the likelihood that the non-
terminal A will expand as β.
e.g. the likelihood that S NP VP
(as opposed to SVP, or S NP VP PP, or… )
can be interpreted as a conditional probability:
probability of the expansion, given the LHS non-terminal
P(Aβ) = P(Aβ|A)
Therefore, for any non-terminal A, probabilities of every rule of the form A
β must sum to 1
If this is the case, we say the PCFG is consistent
Uses of probabilities in parsing
Disambiguation: given n legal parses of a string, which is the most likely?
e.g. PP-attachment ambiguity can be resolved this way
Speed: parsing is a search problem
search through space of possible applicable derivations
search space can be pruned by focusing on the most likely sub-parses of a
parse
Parser can be used as a model to determine the probability of a sentence,
given a parse
typical use in speech recognition, where input utterance can be “heard” as
several possible sentences
Using PCFG probabilities
PCFG assigns a probability to every parse-tree t of a string W
e.g. every possible parse (derivation) of a sentence recognised by the
grammar
Notation:
G = a PCFG
s = a sentence
t = a particular tree under our grammar
t consists of several nodes n
each node is generated by applying some rule r
Probability of a tree vs. a sentence
P(t , s) p(r (n)) P(t )
nt
simply the multiplication of the probability of every rule (node)
that gives rise to t (i.e. the derivation of t)
this is both the joint probability of t and s, and the probability of t
alone
why?
P(t,s) = P(t)
P(t , s) P(t ) P(s | t )
But P(s|t) must be 1, since the tree t is a parse of all the
words of s
P(t , s) P(t ) 1 P(t )
Picking the best parse in a PCFG
A sentence will usually have several parses
we usually want them ranked, or only want the n-best parses
we need to focus on P(t|s,G)
probability of a parse, given our sentence and our grammar
definition of the best parse for s:
tˆ arg max P(t | s, G)
t
Picking the best parse in a PCFG
Problem: t can have multiple derivations
e.g. expand left-corner nodes first, expand right-corner nodes first
etc
so P(t|s,G) should be estimated by summing over all possible
derivations
Fortunately, derivation order makes no difference to the final
probabilities.
can assume a “canonical derivation” d of t
P(t) =def P(d)
Probability of a sentence
Simply the sum of probabilities of all parses of that sentence
since s is only a sentence if it’s recognised by G, i.e. if there is some t
for s under G
P( s)
P(s, t ) p(t )
t
{t: yield (t ) s}
all those trees which “yield” s
Flaws I: Structural independence
Probability of a rule r expanding node n depends only on n.
Independent of other non-terminals
Example:
P(NP Pro) is independent of where the NP is in the sentence
but we know that NPPro is much more likely in subject position
Francis et al (1999) using the Switchboard corpus:
91% of subjects are pronouns;
only 34% of objects are pronouns
Flaws II: lexical independence
vanilla PCFGs ignore lexical material
e.g. P(VP V NP PP) independent of the head of NP or PP or
lexical head V
Examples:
prepositional phrase attachment preferences depend on lexical items;
cf:
dump [sacks into a bin]
dump [sacks] [into a bin] (preferred parse)
coordination ambiguity:
[dogs in houses] and [cats]
[dogs] [in houses and cats]
Weakening the independence assumptions in PCFGs
Lexicalised PCFGs
Attempt to weaken the lexical independence assumption.
Most common technique:
mark each phrasal head (N,V, etc) with the lexical material
this is based on the idea that the most crucial lexical
dependencies are between head and dependent
E.g.: Charniak 1997, Collins 1999
Lexicalised PCFGs: Matt walks
Makes probabilities partly
S(walks)
dependent on lexical content.
P(VPVBD|VP) becomes:
NP(Matt)
P(VPVBD|VP, h(VP)=walk)
NB: normally, we can’t assume that all heads
of a phrase of category C are equally
probable.
NNP(Matt)
Matt
VP(walk)
VBD(walk)
walks
Practical problems for lexicalised PCFGs
data sparseness: we don’t necessarily see all heads of all
phrasal categories often enough in the training data
flawed assumptions: lexical dependencies occur elsewhere,
not just between head and complement
I got the easier problem of the two to solve
of the two and to solve become more likely because of the prehead modifier
easier
Structural context
The simple way: calculate p(t|s,G) based on rules in the
canonical derivation d of t
assumes that p(t) is independent of the derivation
could condition on more structural context
but then we could lose the notion of a canonical derivation, i.e.
P(t) could really depend on the derivation!
Structural context: probability of a
derivation history
How to calculate P(t) based on a derivation d?
Observation:
r1
r3
r2
rm
P(d ) P(S 1 2 ... m s)
(probability that a sequence of m rewrite rules in a derivation yields s)
can use the chain rule for multiplication
m
P( d )
P(r | r ,...r
i
i 1
1
i 1 )
Approach 2: parent annotation
Annotate each node with its
S(walks)
parent in the parse tree.
E.g. if NP has parent S, then
rename NP to NP^S
Can partly account for
dependencies such as subject-of
NP^S
VP^S
NNP^NP
VBD^VP
(NP^S is a subject, NP^VP is an object)
Matt
walks
The main point
Many different parsing approaches differ on what they
condition their probabilities on
Other grammar formalisms
Phrase structure vs. Dependency grammar
PCFGs are in the tradition of phrase-structure grammars
Dependency grammar describes syntax in terms of
dependencies between words
no non-terminals or phrasal nodes
only lexical nodes with links between them
links are labelled, labels from a finite list
Dependency Grammar
<ROOT>
main
GAVE
obj:
subj:
dat:
I
him
address
attr:
MY
Dependency grammar
Often used now in probabilistic parsing
Advantages:
directly encode lexical dependencies
therefore, disambiguation decisions take lexical material into account
directly
dependencies are a way of decomposing PSRs and their probability
estimates
estimating probability of dependencies between 2 words is less likely
to lead to data sparseness problems
Summary
We’ve taken a tour of PCFGs
crucial notion: what the probability of a rule is conditioned on
flaws in PCFGs: independence assumptions
several proposals to go beyond these flaws
dependency grammars are an alternative formalism