Grammar and Parsing

advertisement
Trees, Grammars, and Parsing
Most slides are taken or adapted from
slides by
Chris Manning
Dan Klein
Parse Trees
From latent state sequences to latent tree structures (edges and nodes)
Types of Trees
There are several ways to add tree structures to
sentences. We will consider 2:
- Phrase structure (constituency) trees
- Dependency trees
1. Phrase structure
• Phrase structure trees organize
sentences into constituents or
brackets.
• Each constituent gets a label.
• The constituents are nested in
a tree form.
• Linguists can and do argue
about the details.
• Lots of ambiguity …
Constituency Tests
• How do we know what nodes go in the tree?
• Classic constituency tests:
– Substitution by proform
– Question answers
– Semantic grounds
• Coherence
• Reference
• Idioms
– Dislocation
– Conjunction
• Cross-linguistic arguments
Conflicting Tests
Constituency isn’t always clear.
• Phonological Reduction:
– I will go  I’ll go
– I want to go  I wanna go
– a le centre  au centre
• Coordination
– He went to and came from the store.
2. Dependency structure
• Dependency structure shows which words depend on (modify or are
arguments of) which other words.
put
boy
The boy put the tortoise on the rug
The
tortoise
on
rug
the
the
Classical NLP: Parsing
• Write symbolic or logical rules:
• Use deduction systems to prove parses from words
–
–
–
–
Minimal grammar on “Fed” sentence: 36 parses
Simple, 10-rule grammar: 592 parses
Real-size grammar: many millions of parses
With hand-built grammar, ~30% of sentences have no parse
• This scales very badly.
– Hard to produce enough rules for every variation of language (coverage)
– Many, many parses for each valid sentence (disambiguation)
Ambiguity examples
The bad effects of V/N ambiguities
Ambiguities: PP Attachment
Attachments
• I cleaned the dishes from dinner.
• I cleaned the dishes with detergent.
• I cleaned the dishes in my pajamas.
• I cleaned the dishes in the sink.
Syntactic Ambiguities 1
• Prepositional Phrases
They cooked the beans in the pot on the stove with handles.
• Particle vs. Preposition
The puppy tore up the staircase.
• Complement Structure
The tourists objected to the guide that they couldn’t hear.
She knows you like the back of her hand.
• Gerund vs. Participial Adjective
Visiting relatives can be boring.
Changing schedules frequently confused passengers.
Syntactic Ambiguities 2
• Modifier scope within NPs
impractical design requirements
plastic cup holder
• Multiple gap constructions
The chicken is ready to eat.
The contractors are rich enough to sue.
• Coordination scope
Small rats and mice can squeeze into holes or cracks in
the wall.
Classical NLP Parsing:
The problem and its solution
• Very constrained grammars attempt to limit unlikely/weird
parses for sentences
– But the attempt makes the grammars not robust: many
sentences have no parse
• A less constrained grammar can parse more sentences
– But simple sentences end up with ever more parses
• Solution: We need mechanisms that allow us to find the
most likely parse(s)
– Statistical parsing lets us work with very loose grammars that
admit millions of parses for sentences but to still quickly find the
best parse(s)
Polynomial-time Parsing with
Context Free Grammars
Parsing
Computational task:
Given a set of grammar rules and a sentence, find a
valid parse of the sentence (efficiently)
Naively, you could try all possible trees until you get to
a parse tree that conforms to the grammar rules,
that has “S” at the root, and that has the right
words at the leaves.
But that takes exponential time in the number of words.
17
Aspects of parsing
• Running a grammar backwards to find possible structures for a sentence
• Parsing can be viewed as a search problem
• Parsing is a hidden data problem
• For the moment, we want to examine all structures for a string of words
• We can do this bottom-up or top-down
– This distinction is independent of depth-first or breadth-first search –
we can do either both ways
– We search by building a search tree which his distinct from the parse
tree
Human parsing
• Humans often do ambiguity maintenance
– Have the police … eaten their supper?
–
come in and look around.
–
taken out and shot.
• But humans also commit early and are “garden
pathed”:
– The man who hunts ducks out on weekends.
– The cotton shirts are made from grows in Mississippi.
– The horse raced past the barn fell.
A phrase structure grammar
•
•
•
•
•
•
•
•
S  NP VP
VP  V NP
VP  V NP PP
NP  NP PP
NP  N
NP  e
NP  N N
PP  P NP
N  cats
N  claws
N  people
N  scratch
V  scratch
P  with
• By convention, S is the start symbol, but in the PTB, we
have an extra node at the top (ROOT, TOP)
Phrase structure grammars = contextfree grammars
• G = (T, N, S, R)
– T is set of terminals
– N is set of nonterminals
• For NLP, we usually distinguish out a set P  N of
preterminals, which always rewrite as terminals
• S is the start symbol (one of the nonterminals)
• R is rules/productions of the form X  , where X is a
nonterminal and  is a sequence of terminals and
nonterminals (possibly an empty sequence)
• A grammar G generates a language L.
Probabilistic or stochastic context-free
grammars (PCFGs)
• G = (T, N, S, R, P)
– T is set of terminals
– N is set of nonterminals
• For NLP, we usually distinguish out a set P  N of
preterminals, which always rewrite as terminals
• S is the start symbol (one of the nonterminals)
• R is rules/productions of the form X  , where X is a
nonterminal and  is a sequence of terminals and
nonterminals (possibly an empty sequence)
• P(R) gives the probability of each rule.
X  N,
 P(X
X   R
 )  1
• A grammar G generates a language model L.

Soundness and completeness
• A parser is sound if every parse it returns is valid/correct
• A parser terminates if it is guaranteed to not go off into
an infinite loop
• A parser is complete if for any given grammar and
sentence, it is sound, produces every valid parse for that
sentence, and terminates
• (For many purposes, we settle for sound but incomplete
parsers: e.g., probabilistic parsers that return a k-best
list.)
Top-down parsing
• Top-down parsing is goal directed
• A top-down parser starts with a list of constituents to
be built. The top-down parser rewrites the goals in the
goal list by matching one against the LHS of the
grammar rules, and expanding it with the RHS,
attempting to match the sentence to be derived.
• If a goal can be rewritten in several ways, then there is
a choice of which rule to apply (search problem)
• Can use depth-first or breadth-first search, and goal
ordering.
Top-down parsing
Problems with top-down parsing
•
Left recursive rules
•
A top-down parser will do badly if there are many different rules for the same
LHS. Consider if there are 600 rules for S, 599 of which start with NP, but one
of which starts with V, and the sentence starts with V.
•
Useless work: expands things that are possible top-down but not there
•
Top-down parsers do well if there is useful grammar-driven control: search is
directed by the grammar
•
Top-down is hopeless for rewriting parts of speech (preterminals) with words
(terminals). In practice that is always done bottom-up as lexical lookup.
•
Repeated work: anywhere there is common substructure
Repeated work…
Bottom-up parsing
•
Bottom-up parsing is data directed
•
The initial goal list of a bottom-up parser is the string to be parsed. If a
sequence in the goal list matches the RHS of a rule, then this sequence may be
replaced by the LHS of the rule.
•
Parsing is finished when the goal list contains just the start category.
•
If the RHS of several rules match the goal list, then there is a choice of which
rule to apply (search problem)
•
Can use depth-first or breadth-first search, and goal ordering.
•
The standard presentation is as shift-reduce parsing.
Problems with bottom-up parsing
• Unable to deal with empty categories: termination
problem, unless rewriting empties as constituents is
somehow restricted (but then it's generally incomplete)
• Useless work: locally possible, but globally impossible.
• Inefficient when there is great lexical ambiguity (grammardriven control might help here)
• Conversely, it is data-directed: it attempts to parse the
words that are there.
• Repeated work: anywhere there is common substructure
PCFGs – Notation
• w1n = w1 … wn = the word sequence from 1 to n
(sentence of length n)
• wab = the subsequence wa … wb
• Njab = the nonterminal Nj dominating wa … wb
Nj
wa … wb
• We’ll write P(Ni  ζj) to mean P(Ni  ζj | Ni )
• We’ll want to calculate maxt P(t * wab)
The probability of trees and strings
• P(w1n, t) -- The probability of tree is the product
of the probabilities of the rules used to generate
it.
P ( w1 n , t ) 

P(R)
{ R  X  AB } t

P(R)
{ R  X  w i } t
• P(w1n) -- The probability of the string is the sum
of the probabilities of the trees which have that
string as their yield
P(w1n) = Σt P(w1n, t) where t is a parse of w1n
Example: A Simple PCFG
(in Chomsky Normal Form)
S
VP
VP
PP
P
V
 NP VP
 V NP
 VP PP
 P NP
 with
 saw
1.0
0.7
0.3
1.0
1.0
1.0
NP
NP
NP
NP
NP
NP






NP PP
astronomers
ears
saw
stars
telescope
0.4
0.1
0.18
0.04
0.18
0.1
P ( t1 ) 
Tree and String Probabilities
• w15 = astronomers saw stars with ears
• P(t1) = 1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18
* 1.0 * 1.0 * 0.18
= 0.0009072
• P(t2) = 1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 0.18
* 1.0 * 1.0 * 0.18
= 0.0006804
• P(w15) = P(t1) + P(t2)
= 0.0009072 + 0.0006804
= 0.0015876
Chomsky Normal Form
• All rules are of the form X  Y Z or X  w.
• A transformation to this form doesn’t change the
weak generative capacity of CFGs.
– With some extra book-keeping in symbol names, you
can even reconstruct the same trees with a
detransform
– Unaries/empties are removed recursively
– N-ary rules introduce new nonterminals:
• VP  V NP PP becomes VP  V @VP-V and @VP-V  NP PP
• In practice it’s a pain
– Reconstructing n-aries is easy
– Reconstructing unaries can be trickier
• But it makes parsing easier/more efficient
Treebank binarization
N-ary Trees in Treebank
TreeAnnotations.annotateTree
Binary Trees
Lexicon and Grammar
Parsing
TODO:
CKY parsing
An example: before binarization…
ROOT
S
VP
NP
N
V
NP
PP
P
N
cats
scratch
people with
NP
N
claws
ROOT
After binarization..
S
@S->_NP
VP
NP
@VP->_V
@VP->_V_NP
N
V
NP
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
ROOT
S
VP
NP
Binary rule
N
V
NP
PP
P
N
cats
scratch
people with
NP
N
claws
ROOT
S
VP
NP
N
V
Seems redundant? (the rule was already binary)
Reason: easier to see how to make
finite-order horizontal markovizations
– it’s like a finite automaton (explained later)
NP
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
ROOT
S
ternary rule
VP
NP
N
V
NP
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
ROOT
S
VP
NP
@VP->_V
@VP->_V_NP
N
V
NP
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
ROOT
S
VP
NP
@VP->_V
@VP->_V_NP
N
V
NP
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
ROOT
S
@S->_NP
VP
NP
@VP->_V
@VP->_V_NP
N
V
NP
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
ROOT
S
@S->_NP
VP
NP
@VP->_V
@VP->_V_NP
N
V
NP
VPV NP PP
Remembers 2 siblings
PP
P
@PP->_P
N
NP
N
cats
scratch
people with
claws
If there’s a rule
VP  V NP PP PP
,
@VP->_V_NP_PP
will exist.
Treebank: empties and unaries
TOP
TOP
TOP
TOP
S-HLN
S
S
S
NP-SUBJ
VP
NP
VP
VP
-NONE-
VB
-NONE-
VB
VB

Atone

Atone
Atone
PTB Tree
NoFuncTags
NoEmpties
TOP
VB
Atone
High
Atone
Low
NoUnaries
CKY Parsing (aka, CYK)
Cocke–Younger–Kasami (CYK or CKY) parsing is a
dynamic programming solution to identifying a valid
parse for a sentence.
Dynamic programming: simplifying a complicated
problem by breaking it down into simpler
subproblems in a recursive manner
48
CKY – Basic Idea
Let the input be a string S consisting of n characters: a1 ... an.
Let the grammar contain r nonterminal symbols R1 ... Rr. This grammar
contains the subset Rs which is the set of start symbols.
Let P[n,n,r] be an array of booleans. Initialize all elements of P to false.
At each step, the algorithm sets P[i,j,k] to be true if the subsequence of words
(span) starting from i of length j can be generated from Rk
We will start with spans of length 1 (individual words), and then proceed to
increasingly larger spans, and determining which ones are valid given the
smaller spans that have already been processed.
49
CKY Algorithm
For each i = 1 to n
For each unit production Rj -> ai,
set P[i,1,j] = true.
For each i = 2 to n -- Length of span
For each j = 1 to n-i+1 -- Start of span
For each k = 1 to i-1 -- Partition of span
For each production RA -> RB RC
If P[j,k,B] and P[j+k,i-k,C]
then set P[j,i,A] = true
If any of P[1,n,x] is true (x is iterated over the set s, where s are all the indices
for Rs)
Then S is member of language
Else S is not member of language
50
CKY In Action
http://www.diotavelli.net/people/void/demos/c
ky.html
51
Probabilistic CKY
(This version doesn’t handle unaries)
Input: words, grammar.
Output: most likely parse, and its probability.
For each left = 1 to #words
// initialize: all length 1 spans (indiv. words)
For each unit production Rj -> wordsleft,left+1,
set score[left,1,j] = P(Rj -> wordsleft,left+1).
For each span = 2 to #words -- Length of span
// induction: increasing span
For each left = 1 to #words-span+1 -- Start of span
For each mid = 1 to span-1 -- Partition of span
For each production RA -> RB RC
If score[left,mid,B]>0 and score[left+mid,span-mid,C]>0
score = score[left,mid,B] * score[left+mid,span-mid,C] * P(RA -> RB RC)
If score > score[left,span,A]
score[left,span,A] = score
back[left,span,A] = (B, C, mid)
Set parent = argmaxstart symbols RS score[1,#words,RS]
Set score = score[1,#words,parent]
Return [score, buildTree(parent,1,#words, back)]
52
buildTree
Input: root, left, span, backpointers
Output: tree
Set tree.symbol = root
If span = 1
// Base case
Set tree.child = wleft,left+1
Else
// recur
Set (B, C, mid) = backpointers[left, span, root]
Set tree.leftChild = buildTree(B, left, mid, backpointers)
Set tree.rightRight = buildTree(C, left+mid, span-mid, backpointers)
Return tree
53
6
Not shown: back pointer entries.
Possible Solutions:
score[1][5]
5
score[1][4] score[2][4]
Span
4
score[1][3] Score[2][3] Score[3][3]
3
score[1][2] score[2][2] Score[3][2] Score[4][2]
2
score[1][1] score[2][1] score[3][1] score[4][1] score[5][1]
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Initialization
6
5
Span
4
3
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 2
6
5
Span
4
3
Span 2:
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 2
6
Grammar Probabilities
P(S->N V) = .1
P(NP->N N) = .1
P(VP->V N) = .1
P(VP->V V) = .005
P(NP-> N P) = .01
P(VP-> V P) = .02
P(PP-> P N) = .1
5
Span
4
3
Span = 2
Left = 1
Mid = 1
S->N V .002
NP->N N .001
VP->V N.0001
VP->V V
.00001
Span 2:
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 2
6
Grammar Probabilities
P(S->N V) = .1
P(NP->N N) = .1
P(VP->V N) = .1
P(VP->V V) = .005
P(NP-> N P) = .01
P(VP-> V P) = .02
P(PP-> P N) = .1
5
Span
4
3
Span = 2
Left = 2
Mid = 1
S->N V .002
S->N V .0001
NP->N N .001 NP->N N .002
VP->V N.0001 VP->V N.004
VP->V V
.00001
Span 2:
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 2
6
Grammar Probabilities
P(S->N V) = .1
P(NP->N N) = .1
P(VP->V N) = .1
P(VP->V V) = .005
P(NP-> N P) = .01
P(VP-> V P) = .02
P(PP-> P N) = .1
5
Span
4
3
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
Span 2:
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 4
6
5
Grammar Probabilities
P(S->N V) = .1
P(S->N VP) = .2
…
VP-> …
V @VP_V
6e-6
V NP
4e-6
S->N VP 2e-6
VP->V NP
2e-6
S->N VP 1e-6
VP->V NP
2e-7
Span 4:
P(NP->N N) = .1
P(NP->N PP) = .1
P(NP-> N P) = .01
Span = 4
Left = 2
Mid = 1
Span
4
3
P(VP->V N) = .1
P(VP->V NP) = .2
P(VP->V V) = .005
P(VP-> V P) = .02
P(VP->V PP) = .1
P(VP->V @VP_V) = .3
P(VP->VP PP) = .1
NP->N PP
1e-4
VP->V PP
5e-6
@VP_V->
N PP 1e-4
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
P(@VP_V -> N PP) = .1
P(PP-> P N) = .1
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 4
6
5
Grammar Probabilities
P(S->N V) = .1
P(S->N VP) = .2
…
VP-> …
V @VP_V
6e-6
V NP
4e-6
VP PP
5e-8
S->N VP 2e-6
VP->V NP
2e-6
S->N VP 1e-6
VP->V NP
2e-7
Span 4:
Span
4
3
P(NP->N N) = .1
P(NP->N PP) = .1
P(NP-> N P) = .01
Span = 4
Left = 2
Mid = 2
P(VP->V N) = .1
P(VP->V NP) = .2
P(VP->V V) = .005
P(VP-> V P) = .02
P(VP->V PP) = .1
P(VP->V @VP_V) = .3
P(VP->VP PP) = .1
NP->N PP
1e-4
VP->V PP
5e-6
@VP_V->
N PP 1e-4
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
P(@VP_V -> N PP) = .1
P(PP-> P N) = .1
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 4
6
5
Grammar Probabilities
P(S->N V) = .1
P(S->N VP) = .2
…
VP-> …
V @VP_V
6e-6
V NP
4e-6
VP PP
5e-8
S->N VP 2e-6
VP->V NP
2e-6
S->N VP 1e-6
VP->V NP
2e-7
Span 4:
Span
4
3
P(NP->N N) = .1
P(NP->N PP) = .1
P(NP-> N P) = .01
Span = 4
Left = 2
Mid = 3
P(VP->V N) = .1
P(VP->V NP) = .2
P(VP->V V) = .005
P(VP-> V P) = .02
P(VP->V PP) = .1
P(VP->V @VP_V) = .3
P(VP->VP PP) = .1
NP->N PP
1e-4
VP->V PP
5e-6
@VP_V->
N PP 1e-4
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
P(@VP_V -> N PP) = .1
P(PP-> P N) = .1
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 4
6
5
Grammar Probabilities
P(S->N V) = .1
P(S->N VP) = .2
…
VP-> …
V @VP_V
6e-6
V NP
4e-6
VP PP
5e-8
S->N VP 2e-6
VP->V NP
2e-6
S->N VP 1e-6
VP->V NP
2e-7
Span 4:
Span
4
3
P(NP->N N) = .1
P(NP->N PP) = .1
P(NP-> N P) = .01
Span = 4
Left = 2
P(VP->V N) = .1
P(VP->V NP) = .2
P(VP->V V) = .005
P(VP-> V P) = .02
P(VP->V PP) = .1
P(VP->V @VP_V) = .3
P(VP->VP PP) = .1
NP->N PP
1e-4
VP->V PP
5e-6
@VP_V->
N PP 1e-4
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
P(@VP_V -> N PP) = .1
P(PP-> P N) = .1
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Induction: Span 4
6
5
Grammar Probabilities
P(S->N V) = .1
P(S->N VP) = .2
…
VP->
V @VP_V
6e-6
Span 4:
back =
(left = V, right =
@VP_V, mid=1)
4
S->N VP 2e-6
VP->V NP
2e-6
Span
P(NP->N N) = .1
P(NP->N PP) = .1
P(NP-> N P) = .01
Span = 4
Left = 2
S->N VP 1e-6
VP->V NP
2e-7
3
P(VP->V N) = .1
P(VP->V NP) = .2
P(VP->V V) = .005
P(VP-> V P) = .02
P(VP->V PP) = .1
P(VP->V @VP_V) = .3
P(VP->VP PP) = .1
NP->N PP
1e-4
VP->V PP
5e-6
@VP_V->
N PP 1e-4
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
P(@VP_V -> N PP) = .1
P(PP-> P N) = .1
Preterminals:
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Final Chart
6
S->N VP
1.2e-7
Back =
(left=N,
right=VP,
5 mid=1)
…
P(NP->N N) = .1
P(NP->N PP) = .1
P(NP-> N P) = .01
VP->
V @VP_V
6e-6
back =
(left = V, right =
@VP_V, mid=1)
4
S->N VP 2e-6
VP->V NP
2e-6
Span
Grammar Probabilities
P(S->N V) = .1
P(S->N VP) = .2
S->N VP 1e-6
VP->V NP
2e-7
3
P(VP->V N) = .1
P(VP->V NP) = .2
P(VP->V V) = .005
P(VP-> V P) = .02
P(VP->V PP) = .1
P(VP->V @VP_V) = .3
P(VP->VP PP) = .1
NP->N PP
1e-4
VP->V PP
5e-6
@VP_V->
N PP 1e-4
S->N V .002
S->N V .002
NP->N P
NP->N N .001 NP->N N .001 .000001
VP->V N.0001 VP->V N.0001 VP->V P
PP->P N
.005
.00001
2
N->cats 0.1
V->cats 0.01
N->scratch .1 N->walls .2
V->scratch .2 V->walls .01
P->with .5
N->claws .1
V->claws .2
P(@VP_V -> N PP) = .1
P(PP-> P N) = .1
1
1
cats
2
scratch 3
walls
Left
4 with
5
claws
6
Corresponding Tree
S
N
(This is different from tree I
showed before because this one
doesn’t include unaries.)
VP
@VP->_V
V
cats
scratch
N
walls
probability = score = 1.2e-7
PP
P
N
with
claws
Extended CKY parsing
• Unaries can be incorporated into the algorithm
– Messy, but doesn’t increase algorithmic complexity
• Empties can be incorporated
– Use fenceposts
– Doesn’t increase complexity; essentially like unaries
• Binarization is vital
– Without binarization, you don’t get parsing cubic in the
length of the sentence
• Binarization may be an explicit transformation or implicit in how
the parser works (Early-style dotted rules), but it’s always there.
Efficient CKY parsing
• CKY parsing can be made very fast (!), partly due to the
simplicity of the structures used.
– But that means a lot of the speed comes from engineering
details
– And a little from cleverer filtering
– Store chart as (ragged) 3 dimensional array of float (log
probabilities)
• score[start][end][category]
– For treebank grammars the load is high enough that you don’t really gain
from lists of things that were possible
– 50wds: (50x50)/2x(1000 to 20000)x4 bytes = 5–100MB for parse triangle.
Large (can move to beam for span[i][j]).
– Use int to represent categories/words (Index)
Efficient CKY parsing
• Provide efficient grammar/lexicon accessors:
– E.g., return list of rules with this left child category
– Iterate over left child, check for zero (Neg. inf.) prob of
X:[i,j] (abort loop), otherwise get rules with X on left
– Some X:[i,j] can be filtered based on the input string
• Not enough space to complete a long flat rule?
• No word in the string can be a CC?
– Using a lexicon of possible POS for words gives a lot of constraint
rather than allowing all POS for words
• Cf. later discussion of figures-of-merit/A* heuristics
Runtime in practice: super-cubic!
360
Time (sec)
300
240
Best Fit
Exponent:
180
3.47
120
60
0
0
10
20
30
40
Sentence Length
• Super-cubic in practice! Why?
50
How good are PCFGs?
• Robust (usually admit everything, but with low probability)
• Partial solution for grammar ambiguity: a PCFG gives some idea of
the plausibility of a sentence
• But not so good because the independence assumptions are too
strong
• Give a probabilistic language model
– But in a simple case it performs worse than a trigram model
• The problem seems to be it lacks the lexicalization of a trigram
model
Parser Evaluation
Evaluating Parsing Accuracy
• Most sentences are not given a completely
correct parse by any currently existing parsers.
• Standardly for Penn Treebank parsing, evaluation
is done in terms of the percentage of correct
constituents (labeled spans).
•
[ label, start, finish ]
• A constituent is a triple, all of which must be in
the true parse for the constituent to be marked
correct.
Evaluating Constituent Accuracy: LP/LR
measure
• Let C be the number of correct constituents produced by the parser over
the test set, M be the total number of constituents produced, and N be
the total in the correct version [microaveraged]
•
•
Precision = C/M
Recall = C/N
• It is possible to artificially inflate either one.
• Thus people typically give the F-measure (harmonic mean) of the two.
Not a big issue here; like average.
• This isn’t necessarily a great measure … me and many other people think
dependency accuracy would be better.
Extensions to basic PCFG Parsing
Many, many possibilities
• Tree Annotations
– Lexicalization
– Grandparent, sibling, etc. annotations
– Manual label splitting
– Latent label splitting
• Horizontal and Vertical Markovization
• Discriminative Reranking
Putting words into PCFGs
• A PCFG uses the actual words only to determine the probability of
parts-of-speech (the preterminals)
• In many cases we need to know about words to choose a parse
• The head word of a phrase gives a good representation of the
phrase’s structure and meaning
– Attachment ambiguities
The astronomer saw the moon with the telescope
– Coordination
the dogs in the house and the cats
– Subcategorization frames
put versus like
(Head) Lexicalization
• put takes both an NP and a VP
– Sue put [ the book ]NP [ on the table ]PP
– * Sue put [ the book ]NP
– * Sue put [ on the table ]PP
• like usually takes an NP and not a PP
– Sue likes [ the book ]NP
– * Sue likes [ on the table ]PP
• We can’t tell this if we just have a VP with a verb,
but we can if we know which verb it is
(Head) Lexicalization
• Collins 1997, Charniak 1997
• Puts the properties of words into a PCFG
Swalked
NPSue
Sue
VPwalked
Vwalked
walked
PPinto
Pinto
into
NPstore
DTthe
NPstore
the
store
Lexicalized Parsing was seen as the
breakthrough of the late 90s
• Eugene Charniak, 2000 JHU workshop: “To do
better, it is necessary to condition probabilities
on the actual words of the sentence. This makes
the probabilities much tighter:
– p(VP  V NP NP)
– p(VP  V NP NP | said)
– p(VP  V NP NP | gave)
= 0.00151
= 0.00001
= 0.01980
”
• Michael Collins, 2003 COLT tutorial: “Lexicalized
Probabilistic Context-Free Grammars … perform
vastly better than PCFGs (88% vs. 73% accuracy)”
Michael Collins (2003, COLT)
Klein and Manning -Accurate Unlexicalized Parsing:
PCFGs and Independence
• The symbols in a PCFG define independence
assumptions:
S
S  NP VP
NP  DT NN
NP
NP
VP
– At any node, the material inside that node is
independent of the material outside that node,
given the label of that node.
– Any information that statistically connects
behavior inside and outside a node must flow
through that node.
Michael Collins (2003, COLT)
Non-Independence I
• Independence assumptions are often too strong.
All NPs
NPs under S
NPs under VP
21%
11%
9%
9%
23%
9%
7%
6%
NP PP
DT NN
PRP
4%
NP PP
DT NN
PRP
NP PP
DT NN
PRP
• Example: the expansion of an NP is highly dependent
on the parent of the NP (i.e., subjects vs. objects).
Non-Independence II
• Who cares?
– NB, HMMs, all make false assumptions!
– For generation, consequences would be obvious.
– For parsing, does it impact accuracy?
• Symptoms of overly strong assumptions:
– Rewrites get used where they don’t belong.
– Rewrites get used too often or too rarely.
In the PTB, this
construction is for
possesives
Breaking Up the Symbols
• We can relax independence assumptions by
encoding dependencies into the PCFG symbols:
Parent annotation
[Johnson 98]
Marking possesive NPs
• What are the most useful features to encode?
Annotations
• Annotations split the grammar categories into
sub-categories.
• Conditioning on history vs. annotating
– P(NP^S  PRP) is a lot like P(NP  PRP | S)
– P(NP-POS  NNP POS) isn’t history conditioning.
• Feature grammars vs. annotation
– Can think of a symbol like NP^NP-POS as
NP [parent:NP, +POS]
• After parsing with an annotated grammar, the
annotations are then stripped for evaluation.
Lexicalization
• Lexical heads are important for certain classes of
ambiguities (e.g., PP attachment):
• Lexicalizing grammar creates a much larger grammar.
– Sophisticated smoothing needed
– Smarter parsing algorithms needed
– More data needed
• How necessary is lexicalization?
– Bilexical vs. monolexical selection
– Closed vs. open class lexicalization
Experimental Setup
• Corpus: Penn Treebank, WSJ
Training:
Development:
Test:
sections
section
section
02-21
22 (first 20 files)
23
• Accuracy – F1: harmonic mean of per-node
labeled precision and recall.
• Size – number of symbols in grammar.
– Passive / complete symbols: NP, NP^S
– Active / incomplete symbols: NP  NP CC 
Experimental Process
• We’ll take a highly conservative approach:
– Annotate as sparingly as possible
– Highest accuracy with fewest symbols
– Error-driven, manual hill-climb, adding one
annotation type at a time
Unlexicalized PCFGs
• What do we mean by an “unlexicalized” PCFG?
– Grammar rules are not systematically specified down to
the level of lexical items
• NP-stocks is not allowed
• NP^S-CC is fine
– Closed vs. open class words (NP^S-the)
• Long tradition in linguistics of using function words as features
or markers for selection
• Contrary to the bilexical idea of semantic heads
• Open-class selection really a proxy for semantics
• Honesty checks:
– Number of symbols: keep the grammar very small
– No smoothing: over-annotating is a real danger
Horizontal Markovization
• Horizontal Markovization: Merges States
12000
73%
9000
Symbols
74%
72%
71%
70%
6000
3000
0
0
1
2v
2
inf
Horizontal Markov Order
0
1
2v
2
inf
Horizontal Markov Order
Vertical Markovization
Order 2
Order 1
• Vertical Markov
order: rewrites
depend on past k
ancestor nodes.
(cf. parent
annotation)
25000
Symbols
79%
78%
77%
76%
75%
74%
73%
72%
20000
15000
10000
5000
0
1
2v
2
3v
3
Vertical Markov Order
1
2v
2
3v
Vertical Markov Order
3
Vertical and Horizontal
3
0
1 2v 2
Horizontal Order
•
1
2 Vertical
Order
inf
Symbols
25000
80%
78%
76%
74%
72%
70%
68%
66%
20000
15000
3
10000
5000
0
0
1
2v
2 inf
Horizontal Order
1
2 Vertical
Order
Examples:
–
–
–
–
Raw treebank:
v=1, h=
Johnson 98: v=2, h=
Collins 99: v=2, h=2
Best F1:
v=3, h=2v
Model
F1
Size
Base: v=h=2v
77.8
7.5K
Unary Splits
• Problem: unary
rewrites used to
transmute
categories so a
high-probability
rule can be
used.

Solution: Mark
unary rewrite sites
with -U
Annotation
F1
Size
Base
77.8
7.5K
UNARY
78.3
8.0K
Tag Splits
• Problem: Treebank tags
are too coarse.
• Example: Sentential, PP,
and other prepositions
are all marked IN.
• Partial Solution:
– Subdivide the IN tag.
Annotation
F1
Size
Previous
78.3
8.0K
SPLIT-IN
80.3
8.1K
Other Tag Splits
• UNARY-DT: mark demonstratives as DT^U (“the
X” vs. “those”)
F1
Size
80.4
8.1K
80.5
8.1K
• TAG-PA: mark tags with non-canonical parents
(“not” is an RB^VP)
81.2
8.5K
• SPLIT-AUX: mark auxiliary verbs with –AUX [cf.
Charniak 97]
81.6
9.0K
• SPLIT-CC: separate “but” and “&” from other
conjunctions
81.7
9.1K
81.8
9.3K
• UNARY-RB: mark phrasal adverbs as RB^U
(“quickly” vs. “very”)
• SPLIT-%: “%” gets its own tag.
Treebank Splits
• The treebank comes with
annotations (e.g., -LOC, -SUBJ, etc).
– Whole set together hurt
the baseline.
– Some (-SUBJ) were less
effective than our
equivalents.
– One in particular was
very useful (NP-TMP)
when pushed down to
the head tag.
– We marked gapped S
nodes as well.
Annotation
F1
Size
Previous
81.8
9.3K
NP-TMP
82.2
9.6K
GAPPED-S
82.3
9.7K
Yield Splits
• Problem: sometimes the
behavior of a category
depends on something
inside its future yield.
• Examples:
– Possessive NPs
– Finite vs. infinite VPs
– Lexical heads!
• Solution: annotate future
elements into nodes.
Annotation
F1
Size
Previous
82.3
9.7K
POSS-NP
83.1
9.8K
SPLIT-VP
85.7
10.5K
Distance / Recursion Splits
• Problem: vanilla PCFGs
cannot distinguish
attachment heights.
NP -v
VP
NP
• Solution: mark a
property of higher or
lower sites:
– Contains a verb.
– Is (non)-recursive.
• Base NPs [cf. Collins 99]
• Right-recursive NPs
PP
v
Annotation
F1
Size
Previous
85.7
10.5K
BASE-NP
86.0
11.7K
DOMINATES-V
86.9
14.1K
RIGHT-REC-NP
87.0
15.2K
A Fully Annotated Tree
Final Test Set Results
Parser
LP
LR
F1
CB
0 CB
Magerman 95
84.9
84.6
84.7
1.26
56.6
Collins 96
86.3
85.8
86.0
1.14
59.9
Klein & M 03
86.9
85.7
86.3
1.10
60.3
Charniak 97
87.4
87.5
87.4
1.00
62.1
Collins 99
88.7
88.6
88.6
0.90
67.1
• Beats “first generation” lexicalized parsers.
Bilexical statistics are used often
[Bikel 2004]
•
•
•
•
The 1.49% use of bilexical dependencies suggests they don’t play much of a role in
parsing
But the parser pursues many (very) incorrect theories
So, instead of asking how often the decoder can use bigram probability on
average, ask how often while pursuing its top-scoring theory
Answering question by having parser constrain-parse its own output
–
–
–
•
•
•
train as normal on §§02–21
parse §00
feed parse trees as constraints
Percentage of time parser made use of bigram statistics shot up to 28.8%
So, used often, but use barely affect overall parsing accuracy
Exploratory Data Analysis suggests explanation
–
distributions that include head words are usually sufficiently similar to those that do not as to make
almost no difference in terms of accuracy
Charniak (2000) NAACL:
A Maximum-Entropy-Inspired Parser
• There was nothing maximum entropy about it. It was a cleverly
smoothed generative model
• Smoothes estimates by smoothing ratio of conditional terms (which
are a bit like maxent features):
P (t | l , l p , t p , l g )
P (t | l , l p , t p )
• Biggest improvement is actually that generative model predicts head
tag first and then does P(w|t,…)
– Like Collins (1999)
• Markovizes rules similarly to Collins (1999)
• Gets 90.1% LP/LR F score on sentences ≤ 40 wds
Petrov and Klein (2006):
Learning Latent Annotations
Outside
Can you automatically find good symbols?
 Brackets are known
 Base categories are known
 Induce subcategories
 Clever split/merge category refinement
X1
X2
X3
X7
X4
X5
X6
.
EM algorithm, like Forward-Backward for
HMMs, but constrained by tree.
He
was
right
Inside
0
LST
ROOT
X
WHADJP
RRC
SBARQ
INTJ
WHADVP
UCP
NAC
FRAG
CONJP
SQ
WHPP
PRT
SINV
NX
PRN
WHNP
QP
SBAR
ADJP
S
ADVP
PP
VP
NP
Number of phrasal subcategories
40
35
30
25
20
15
10
5
POS tag splits, commonest words:
effectively a class-based model


Proper Nouns (NNP):
NNP-14
Oct.
Nov.
Sept.
NNP-12
John
Robert
James
NNP-2
J.
E.
L.
NNP-1
Bush
Noriega
Peters
NNP-15
New
San
Wall
NNP-3
York
Francisco
Street
Personal pronouns (PRP):
PRP-0
It
He
I
PRP-1
it
he
they
PRP-2
it
them
him
The Latest Parsing Results…
F1
≤ 40 words
F1
all words
Klein & Manning unlexicalized
2003
86.3
85.7
Matsuzaki et al. simple EM
latent states 2005
86.7
86.1
Charniak generative (“maxent
inspired”) 2000
90.1
89.5
Petrov and Klein NAACL 2007
90.6
90.1
Charniak & Johnson
discriminative reranker 2005
92.0
91.4
Parser
Download