Statistical Parsing Xinyu Dai 2015-11 2016/7/15

advertisement
Statistical Parsing
Xinyu Dai
2015-11
2016/7/15
1
Outline



Review parsing and Context-FreeGrammar(CFG)
Ambiguities in Natural language parsing
Probabilistic CFG
2016/7/15
2
Parsing



Programming Language: Analysis a token
sequence to determine its structure according to
the grammar
Context-Free-Grammar
Why parsing:




Report syntax error, recovering error for further
processing
Constitute a parse tree to the rest of a compiler
……
Methods:


Top-down and bottom-up
Ambiguities: look ahead……
2016/7/15
3
Parsing



Natural Language: Analysis a sentence to
determine its structure according to the grammar
Context-Free-Grammar
Why parsing:

The parsing tree can represent the structure of a
sentence



Phrase
Relation: between phrases and in a phrase
Improving many nlp applications





High precision question answering systems (Pasca and Harabagiu SIGIR 2001)
Improving biological named entity extraction (Finkel et al. JNLPBA 2004):
Syntactically based sentence compression (Lin and Wilbur Inf. Retr. 2007)
Extracting people’s opinions about products (Bloom et al. NAACL 2007)
Methods: top-down or bottom-up
2016/7/15
4
Context-Free-Grammar
(CFG)

G = (T, N ,S ,R)





T is set of terminals
N is set of non-terminals
S is the start symbol (one of the nonterminals)
R is rules/productions of the form X  γ, where X
is a nonterminal and is a sequence of terminals and
nonterminals (possibly an empty sequence)
A grammar G generates a language L.
2016/7/15
5
A phrase structure grammar














S → NP VP
VP → V NP
VP→V NP PP
NP→NP PP
NP→N
NP→e
NP → N N
PP → P NP
N → cats
N → claws
N→people
N→scratch
V→scratch
P→with
2016/7/15
6
Top down parsing




Top-down parsing is goal directed
A top-down parser starts with a list of constituents
to be built. The top-down parser rewrites the goals
in the goal list by matching one against the LHS of
the grammar rules, and expanding it with the RHS,
attempting to match the sentence to be derived.
If a goal can be rewritten in several ways, then
there is a choice of which rule to apply (search
problem)
Can use depth-first or breadth-first search, and
goal ordering.
2016/7/15
7
Top down parsing
2016/7/15
8
Bottom-up parsing






Bottom-up parsing is data directed
The initial goal list of a bottom-up parser is the string to
be parsed. If a sequence in the goal list matches the
RHS of a rule, then this sequence may be replaced by
the LHS of the rule.
Parsing is finished when the goal list contains just the
start category.
If the RHS of several rules match the goal list, then
there is a choice of which rule to apply (search
problem)
Can use depth-first or breadth-first search, and goal
ordering.
The standard presentation is as shift-reduce parsing.
2016/7/15
9
Shift-reduce parsing: one
path

2016/7/15
10
Problems with top-down or
bottom-up





Left-recursive
Many different rules for the same LHS
Locally possible, globally impossible
Repeated work: anywhere there is common
substructure.
……
2016/7/15
11
Repeated work…
2016/7/15
12
Classical NLP Parsing

Wrote symbolic grammar and lexicon

Used top-down or bottom up methods to parse
2016/7/15
13
Classical NLP Parsing

Wrote symbolic grammar and lexicon

Used top-down or bottom up methods to parse
2016/7/15
14
Classical NLP Parsing

Wrote symbolic grammar and lexicon

Used top-down or bottom up methods to parse
2016/7/15
15
Classical NLP Parsing

Wrote symbolic grammar and lexicon

Used top-down or bottom up methods to parse


Minimal grammar on “Fed raises” sentence : 36
parses
Real-size broad-coverage grammar: millions of
parser
2016/7/15
16
Ambiguities: PP Attachment
2016/7/15
17
Classical NLP Parsing:
The problem and its solution

Very constrained grammars attempt to limit
unlikely/weird parses for sentences


A less constrained grammar can parse more
sentences


But the attempt make the grammars not robust: many
sentences have no parse
But simple sentences end up with ever more parses
Solution: We need mechanisms that allow us to
find the most likely parse(s)

Statistical parsing lets us work with very loose grammars
that admit millions of parses for sentences but to still
quickly find the best parse(s)
2016/7/15
18
Probabilistic or stochastic context-free
grammars (PCFG)

G = (T, N ,S ,R,P)





T is set of terminals
N is set of non-terminals
S is the start symbol (one of the nonterminals)
R is rules/productions of the form X  γ, where X is a
nonterminal and is a sequence of terminals and
nonterminals (possibly an empty sequence)
P(R) gives the probability of each rule.
"x Î N, å P(x ® g ) =1
x®g

A grammar G generates a language model L.
å P(g ) = 1
g ÎT*
2016/7/15
19
PCFG - Notation





w1n=w1…wn = the word sequence from 1 to n
(sentence of length n)
wab = the subsequence wa … wb
Njab = the nonterminal Nj dominating wa … wb
We will write P(N i ® z j ) to mean P(N i ® z j | N i )
We will want to calculate
maxt P(t | wab ) = maxt P(t Þ *wab )
2016/7/15
20
The probability of trees and
strings


P(t) -- The probability of tree is the product of
the probabilities of the rules used to generate it.
P(w1n) -- The probability of the string is the sum
of the probabilities of the trees which have that
string as their yield
P(w1n ) = å P(w1n , t j ) where t j is a parse of w1n
j
= å P(t j )
j
2016/7/15
21
A Simple PCFG
2016/7/15
22
Trees
2016/7/15
23
Tree and String Probabilities




w15 = astronomers saw stars with ears
P(t1) =1.0*0.1*0.7*1.0*0.4*0.18 * 1.0 * 1.0 *
0.18 = 0.0009072
P(t2) =1.0*0.1*0.3*0.7*1.0*0.18 * 1.0 * 1.0 *
0.18 = 0.0006804
P(w15) = P(t1) + P(t2) = 0.0009072 +
0.0006804 = 0.0015876
2016/7/15
24
Have a rest




Grammar?
Probability(Learning)?
Decoding?
……
2016/7/15
25
Grammar - Treebank
2016/7/15
26
Treebank


Going into it, building a treebank seems a lot
slower and less useful than building a grammar
But a treebank gives us many things





Reusability of the labor
Broad coverage
Frequencies and distributional information
A way to evaluate systems
Treebank grammars can be enormous


the raw grammar has ~10K states, excluding the lexicon
Better parsers usually make the grammars larger, not
smaller
2016/7/15
27
Treebank Grammars

Can take a grammar right off the trees
2016/7/15
28
Chomsky Normal Form

All rules of the form X → Y Z or X → w

N-ary rules introduce new non-terminals



In practice it’s kind of a pain:




Unaries / empties are “promoted”
Reconstructing n-aries is easy
Reconstructing unaries is trickier
The straightforward transformations don’t preserve tree scores
Makes parsing algorithms simpler!
2016/7/15
29
Have a rest




Grammar?
Probability(Learning)?
Decoding?
……
2016/7/15
30
Parsing

Decoding


argmaxtP(t|W1n,G)
Learning probabilities

argmaxGP(w1n|G) (MLE)
P(w1n | G) = å P(w1n , t j | G) where t j is a parse of w1n
j
= å P(t j | G)
j


In general, we cannot efficiently calculate the probability
of a string by simply summing the probabilities of all
possible parse trees for the sentence, as there will be
exponentially many of them.
Inside algorithm and outside algorithm are efficient ways.
2016/7/15
31
Inside algorithm

Inside Variable: αi,j(A) =P(wij|A,G)

Inside Probability: the probability of a
particular (sub)string, given that the string
is dominated by a particular non-terminal.
S
A→BC
A
αi,j(A)
B
C
W1 …Wi-1 Wi … Wk Wk+1 … Wj Wj+1 …Wn
*We will only consider the case of Chomsky Normal Form Grammars, which only
have unary and binary rules of the form
32
Inside probability
calculation
 i , j ( A)  P( wi ...w j | A) i<j

 P( w ...w , B, w
i
k
k 1
...w j , C | A)
A→BC
B ,C , k

Hypothesis:
Place invariance;
Context-Free
Ancestor-Free
 P( B, C | A) P(w ...w
| A, B, C ) P( wk 1...w j | wi ...wk , A, B, C )
 P( B, C | A) P(w ...w
| B) P( wk 1...w j | C )
i
k
B ,C , k

i
k
B ,C , k

 P( A  BC )
i ,k
( B) k 1, j (C )
B ,C , k
 i , j ( A)  P( A  wi )
i j
33
Find probability of a string
using inside probability



Input:G,string W=w1w2…wn
Output:P(w1w2…wn|G)= α1,n(S)
Algorithm:

Initialization:
 i ,i ( A)  P( A  wi ), A  N , 1  i  n

Induction: calculate
 i , j ( A) 

Termination
  P( A  BC )
B ,CN i  k  j
i ,k
( B) k 1, j (C )
P( S  w1...wn | G )  1,n ( S )

With inside probability, we get the probability of W through the sum of all possible
parse trees generated from S.
34
Outside Probability


Outside Variable: βi,j(A) =P(w1(i-1),A,w(j+1)n|G)
Outside probability: the probability of generating a particular non-terminal and
the sequences of words outside its domination (starting with the start nonterminal).
S
C
B
S
A
W1 …Wh-1 Wh … Wi-1 Wi … Wj Wj+1 …Wn
C
A
B
W1 …Wi-1 Wi … Wj Wj+1 … Wk Wk+1 …Wn
35
Outside probability
calculation
1, A  S
1,n ( A)  
0, A  S
i , j ( A)  P( w1...wi 1 , A, w j 1...wn | G )


B ,C , j  k

P( w1...wi 1 , C , wk 1...wn )P(C  AB) P( B  w j 1...wk )

B ,C , h  i


B ,C , j  k
P( w1...wh 1 , C , w j 1...wn )P(C  BA) P( B  wh ...wi 1 )
i ,k (C )P(C  AB ) j 1,k ( B ) 

B ,C , h i
 h , j (C )P (C  BA) h ,i 1 ( B )
36
Find probability of a string
using with outside probability
P( w1n | G )   P( w1...wk 1 Awk 1...wn | G )
A, k
   k ,k ( A) P( A  wk )
A, k
37
Viterbi algorithm



argmaxtP(t|W1n,G)
Viterbi variable: δi,j(A)
δi,j(A) is the highest inside probability parse of
subtree A.
38
Find the most probable parse
using Viterbi algorithm



argmaxtP(t|W1n,G)
Viterbi variable: δi,j(A)
δi,j(A) is the highest inside probability
parse of subtree A.
39
Find the most probable parse
using Viterbi algorithm
 Input:G,string W=w1w2…wn
 Output:t* (The most probable parse tree of W)
 Algorithm:

Initialization:  ( A)  P( A  w ),
i ,i
i

Induction:1≤j≤n,1≤i≤n-j, calculate
i ,i  j ( A) 
max
B ,CN ;i  k i  j
A N, 1  i  n
P( A  BC )i ,k ( B) k 1,i  j (C )
i ,i  j ( A)  arg max P( A  BC)i ,k ( B) k 1,i  j (C)
B ,CN ;i k i  j

Termination
P(t * )  1,n (S )
 S is the designated start symbol of t*, get the t* through
tracing back Δ1,n(S)
40
Have a rest




Grammar?
Probability(Learning)?
Decoding?
……
2016/7/15
41
Learning Probabilities


Maximum likelihood estimation
Training from Treebank


Relative Frequence
If we have no treebank

Expectation maximum estimation –
Inside&outside probability
2016/7/15
42
Relative Frequence

From a treebank


To get the probability for a particular VP
rule just count all the times the rule is
used and divide by the number of VPs
overall
Estimate rule probability:
Number ( A   )
P( A   ) 
 Number ( A   )

P(VP  Vi ) 
Number (VP  Vi )
Number (VP  Vi )  Number (VP  VT NP)  Number (VP  VP PP)
43
S → NP VP
3
NP → Pro
2
NP → Det N
6
NP → NP PP
1
VP → V NP
3
VP → VP PP
PP → P NP
1
2
S → NP VP
3 / 3 = 1.0
NP → Pro
2 / 9 = 0.22
NP → Det N
6 / 9 = 0.67
NP → NP PP
1 / 9 = 0.11
VP → V NP
3 / 4 = 0.75
VP → VP PP
PP → P NP
1 / 4 = 0.25
2 / 2 = 1.0
S → NP VP
1.0
NP → Pro
0.22
NP → Det N
0.67
NP → NP PP
0.11
VP → V NP
0.75
VP → VP PP
PP → P NP
0.25
1.0
Expectation maximum estimation
– Inside&outside probability


Estimate rule
probability through
EME:
Expected occurrence
of A→BC is the cooccurrence of symbol
A、B and C.
Expected Number ( A   )
P( A   ) 
Expected  Number ( A   )

An efficient method for estimating probabilities
for rules from unparsed data is known the
inside-outside algorithm.
Like the HMM backwards/forwards training
procedure, it is an analogous application of
Welch-Baum algorithm.
47
Rule probability with EME
Number ( A  BC )
P( A  BC ) 
 Number ( A   )



1i  k  j  n
i , j ( A) P( A  BC ) i ,k ( B) k 1, j (C )

1i  j  n
i , j ( A) i , j ( A)
48
Algorithm of expected
maximum estimation

Initialization: put a random value to each
P(A→α), ensure  P( A   )  1 , PCFG Gi,i=0
Repeat following step,until convergence
a



Calculate the expected occurrences of each
rule
Re-estimation each rule probability according
to the expected occurrences. Get the new
PCFG Gi+1
49
Review Decoding



argmaxtP(t|W1n,G)
Viterbi variable: δi,j(A)
δi,j(A) is the highest inside probability
parse of subtree A.
2016/7/15
50
A Recursive Parser



Will this parser work?
Why or why not?
Memory requirements?
2016/7/15
51
A Memorized Parser
Question:if there some
rules like X -> Y, how?
2016/7/15
52
A Bottom-Up Parser (CKY)
2016/7/15
53
Memory

How much memory does this require?






Have to store the score cache
Cache size: |symbols|*n2 doubles
For the plain treebank grammar:
X ~ 20K, n = 40, double ~ 8 bytes, ~256MB
Big, but workable.
Pruning: Beams


score[X][i][j] can get too large
Can keep beams (truncated maps score[i][j]) which
only store the best few scores for the span [i,j]
2016/7/15
54
Beam Search
2016/7/15
55
Time

How much time will it take to parse?

For each diff (<= n)

For each i (<= n)

For each rule X -> Y Z



For each split point k
Total time: |rules|*n3
Something like 5 sec for an unoptimized parse
of a 20-word sentences
2016/7/15
56
Evaluating Parsing Accuracy



Most sentences are not given a completely
correct parse by any currently existing parsers.
Standardly for Penn Treebank parsing,
evaluation is done in terms of the percentage of
correct constituents (labeled spans).
[ label, start, end ]
A constituent is a triple, all of which must be in
the true parse for the constituent to be marked
correct.
2016/7/15
57
Evaluating Constituent Accuracy




Let C be the number of correct
constituents produced by the parser over
the test set, M be the total number of
constituents produced, and N be the
total in the correct version
Precision = C/M
Recall = C/N
Both are important measures.
Parser Evaluation Using the Penn Treebank

Labeled Precision and Recall of constituents
A parser says:
The Treebank says:
Counting Constituents
S
NP
VP
NP
PP
NP
1
1
2
3
5
6
7
1
7
4
7
7
S
NP
VP
NP
NP
PP
NP
1
1
2
3
3
5
6
7
1
7
7
4
7
7
Review the probability of a
parsing tree

The probability of tree is the product of the probabilities
of the rules used to generate it.

Not every NP expansion can fill every NP slot


A grammar with symbols like “NP” won’t be context-free
Statistically, conditional independence too strong
2016/7/15
61
Non-Independence

Independence assumptions are often too strong.

Example: the expansion of an NP is highly dependent
on the parent of the NP (i.e., subjects vs. objects).
Also: the subject and object expansions are correlated!

2016/7/15
62
Grammar Refinement

The Game of Designing a Grammar

Annotation refines base treebank symbols to improve statistical fit
of the grammar
Structural annotation[Johnson ’98, Klein and
Manning 03]


2016/7/15
63
Grammar Refinement

The Game of Designing a Grammar

Annotation refines base treebank symbols to
improve statistical fit of the grammar

Head lexicalization [Collins ’99, Charniak ’00]
2016/7/15
64
Lexicalized PCFG
p(V NP NP|VP)
p(V NP NP|VP(said))
p(V NP NP|VP(gave))
= 0.00151
= 0.00001
= 0.01980
65
Lexicalized parsing


Central to lexicalization model is the idea that the strong lexical
dependencies are between heads and their dependents.
E.g., “hit the girl with long hair with a hammer”

We might expect dependencies between




“girl” and “hair”
“hit” and “hammar”
(And “long” and “hair”, although this is not so useful here.)
Heads in CFGs

Each rule has a head

The head is one of the items in the RHS

What is the head? Hard to define, but roughly



The “most important” piece of the constituent
The sub-constituent that carries the meaning of the constituent
The notion of headedness appears in LFG, X-bar syntax, etc.
66
Head Example
Rules:
S -> NP VP
NP -> Pro
VP -> V NP
NP -> Det N
67
Lexicalization PCFG



When we extract a PCFG from the Penn Treebank, we don’t
know the heads of the rules
We can use a set of heuristics to find the head of each rule
Examples from a Head-Table for Penn Treebank grammars
To find the head of VP:
If the VP contains a V, the head is the leftmost V;
Else, if the VP contains a MD, the head is the leftmost MD;
Else, if the VP contains a VP, the head is the leftmost VP;
Else, the head is the leftmost child
68
Lexicalization PCFG

Once we know the
head of each rule, we
can determine the
headword of each
constituent


Start from the bottom
The headword of a
constituent is the
headword of its head
69
Exploiting Heads


Condition the probability of rule expansion
on the head.
Condition the probability of the head on




the type of phrase
the head of the “mother phrase”
In effect, the probability of a parse is some
version of
Π P(head|phead)P(rule|head)
over all the constituents of a parse,

which we use to chose the most probable parse.
70
Charniak’s Approach of
Lexical PCFG Parsing
T  arg max P(t | s)
t
P(t | s)   P(r (n))
Normal PCFG
nt
P(t | s)   P(r (n) | c(n), h(n))* P(h(n) | c(n), h(m(n)))
nt
71
Reference

Slides from Stanford


Slides from UC Berkeley


http://nlp.stanford.edu/courses/lsa354/
http://www.cs.berkeley.edu/~klein/cs288/sp10/
A useful tools: NLTK

http://nltk.org/api/nltk.parse.html
2016/7/15
72
Download