statLecture11b

advertisement
Corpora and Statistical Methods
Lecture 11
Albert Gatt
Part 2
Statistical parsing
Preliminary issues
How parsers are evaluated
Evaluation
 The issue:
 what objective criterion are we trying to maximise?
 i.e. under what objective function can I say that my parser does “well”
(and how well?)
 need a gold standard
 Possibilities:
 strict match of candidate parse against gold standard
 match of components of candidate parse against gold standard
components
Evaluation
 A classic evaluation metric is the PARSEVAL one
 initiative to compare parsers on the same data
 not initially concerned with stochastic parsers
 evaluate parser output piece by piece
 Main components:
 compares gold standard tree to parser tree
 typically, gold standard is the tree in a treebank
 computes:
 precision
 recall
 crossing brackets
PARSEVAL: labeled recall
# correct nodes in candidate parse
# nodes in treeban k parse
 Correct node = node in candidate parse which:
 has same node label
 originally omitted from PARSEVAL to avoid theoretical conflict
 spans the same words
PARSEVAL: labeled precision
# correct nodes in candidate parse
# nodes in candidate parse
 The proportion of correctly labelled and correctly spanning
nodes in the candidate.
Combining Precision and Recall
 As usual, Precision and recall can be combined into a
single F-measure:
F
1
1
1
  (1   )
P
R
PARSEVAL: crossed brackets
 number of brackets in the candidate parse which cross
brackets in the treebank parse
 e.g. treebank has ((X Y) Z) and candidate has (X (Y Z))
 Unlike precision/recall, this is an objective function to
minimise
Current performance
 Current parsers achieve:
 ca. 90% precision
 >90% recall
 1% cross-bracketed constituents
Some issues with PARSEVAL
1.
These measures evaluate parses at the level of individual
decisions (nodes).

2.
Success on crossing brackets depends on the kind of parse
trees used

3.
ignore the difficulty of getting a globally correct solution by
carrying out a correct sequence of decisions
Penn Treebank has very flat trees (not much embedding),
therefore likelihood of crossed brackets decreases.
In PARSEVAL, if a constituent is attached lower in a tree
than the gold standard, all its daughters are counted
wrong.
Probabilistic parsing with PCFGs
The basic algorithm
The basic PCFG parsing algorithm
 Many statistical parsers use a version of the CYK algorithm.
 Assumptions:
 CFG productions are in Chomsky Normal Form.
 A  BC
 Aa
 Use indices between words:
 Book the flight through Houston
 (0) Book (1) the (2) flight (3) through (4) Houston (5)
 Procedure (bottom-up):
 Traverse input sentence left-to-right
 Use a chart to store constituents and their span + their probability.
Probabilistic CYK: example PCFG
 S  NP VP [.80]
 NP  Det N [.30]
 VP  V NP [.20]
 V  includes [.05]
 Det  the [.4]
 Det  a [.4]
 N  meal [.01]
 N  flight [.02]
Probabilistic CYK: initialisation
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
//syntactic lookup
for i = j-2 to 0 do:
chartij := {}
for k = i+1 to j-1 do:
for each A -> BC do:
if B in chartik & C in chartkj:
chartij := chartij U {A}
1
0
1
2
3
4
5
2
3
4
5
Probabilistic CYK: lexical step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
1
0
1
2
3
4
5
Det
(.4)
2
3
4
5
Probabilistic CYK: lexical step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
1
0
1
2
3
4
5
2
Det
(.4)
N
.02
3
4
5
Probabilistic CYK: syntactic step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
//syntactic lookup
for i = j-2 to 0 do:
chartij := {}
for k = i+1 to j-1 do:
for each A -> BC do:
if B in chartik & C in chartkj:
chartij := chartij U {A}
0
1
2
3
4
5
1
2
Det
(.4)
NP
.0024
N
.02
3
4
5
Probabilistic CYK: lexical step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
0
1
2
3
4
5
1
2
Det
(.4)
NP
.0024
3
N
.02
V
.05
4
5
Probabilistic CYK: lexical step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
0
1
2
3
4
5
1
2
Det
(.4)
NP
.0024
3
4
N
.02
V
.05
Det
.4
5
Probabilistic CYK: syntactic step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
0
1
2
3
4
1
2
Det
(.4)
NP
.0024
3
4
5
N
.02
V
.05
Det
.4
N
.01
Probabilistic CYK: syntactic step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
//syntactic lookup
for i = j-2 to 0 do:
chartij := {}
for k = i+1 to j-1 do:
for each A -> BC do:
if B in chartik & C in chartkj:
chartij := chartij U {A}
0
1
2
3
4
1
2
Det
(.4)
NP
.0024
3
4
5
Det
.4
NP
.001
N
.02
V
.05
N
.01
Probabilistic CYK: syntactic step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
//syntactic lookup
for i = j-2 to 0 do:
chartij := {}
for k = i+1 to j-1 do:
for each A -> BC do:
if B in chartik & C in chartkj:
chartij := chartij U {A}
0
1
2
3
4
1
2
Det
(.4)
NP
.0024
3
4
5
N
.02
V
.05
VP
.00001
Det
.4
NP
.001
N
.01
Probabilistic CYK: syntactic step
 The flight includes a meal.
//Lexical lookup:
for j = 1 to length(string) do:
chartj-1,j := {X : X->word in G}
//syntactic lookup
for i = j-2 to 0 do:
chartij := {}
for k = i+1 to j-1 do:
for each A -> BC do:
if B in chartik & C in chartkj:
chartij := chartij U {A}
0
1
2
3
4
1
2
Det
(.4)
NP
.0024
3
4
5
S
.00000001
92
N
.02
V
.05
VP
.00001
Det
.4
NP
.001
N
.01
Probabilistic CYK: summary
 Cells in chart hold probabilities
 Bottom-up procedure computes probability of a parse
incrementally.
 To obtain parse trees, cells need to be augmented with
backpointers.
Probabilistic parsing with
lexicalised PCFGs
Main approaches (focus on Collins (1997,1999))
see also: Charniak (1997)
Unlexicalised PCFG Estimation
 Charniak (1996) used Penn Treebank POS and phrasal
categories to induce a maximum likelihood PCFG
 only used relative frequency of local trees as the estimates for rule
probabilities
 did not apply smoothing or any other techniques
 Works surprisingly well:
 80.4% recall; 78.8% precision (crossed brackets not estimated)
 Suggests that most parsing decisions are mundane and can be handled well by
unlexicalized PCFG
Probabilistic lexicalised PCFGs
 Standard format of lexicalised rules:
 associate head word with non-terminal
e.g. dumped sacks into
VP(dumped)  VBD(dumped) NP(sacks) PP(into)
 associate head tag with non-terminal
VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)
 Types of rules:
 lexical rules expand pre-terminals to words:
 e.g. NNS(sacks,NNS)  sacks
 probability is always 1
 internal rules expand non-terminals
 e.g. VP(dumped,VBD)  VBD(dumped,VBD) NP(sacks,NNS)
PP(into,IN)
Estimating probabilities
 Non-generative model:
 take an MLE estimate of the probability of an entire rule
Count(VP(dumped, VBD )  VBD (dumped, VBD ) NP(sacks, NNS ) PP(into, IN))
Count(VP(dumped, VBD) )
 non-generative models suffer from serious data sparseness problems
 Generative model:
 estimate the probability of a rule by breaking it up into sub-rules.
Collins Model 1
 Main idea:
 represent CFG rules as expansions into Head + left modifiers + right
modifiers
LHS  STOP Ln Ln1...L1 H R1...Rn1Rn STOP
 Li/Ri is of the form L/R(word,tag); e.g. NP(sacks,NNS)
 STOP: special symbol indicating left/right boundary.
 Parsing:
 Given the LHS, generate the head of the rule, then the left modifiers (until
STOP) and right modifiers (until STOP) inside-out.
 Each step has a probability.
Collins Model 1: example
VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)
1.
Head H(hw,ht):
PH ( H (hw, ht ) | Parent , hw, ht )
 P(VBD(dumped, VBD) | VP(dumped, VBD))
Collins Model 1: example
VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)
1.
Head H(hw,ht):
PH ( H (hw, ht ) | Parent , hw, ht )
n 1
2.
Left modifiers:
 P ( L (lw , lw ) | Parent , H , hw, ht )
L
i
i
t
i 1
 P(STOP | VP(dumped, VBD) ,VBD(dumped, VBD))
Collins Model 1: example
VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)
1.
Head H(hw,ht):
PH ( H (hw, ht ) | Parent , hw, ht )
n 1
2.
 P ( L (lw , lw ) | Parent , H , hw, ht )
Left modifiers:
L
i
i
t
i 1
n 1
3.
Right modifiers:
 P ( R (rw , rw ) | Parent , H , hw, ht )
R
i
i
t
i 1
 PR ( NP(sacks, NNS ) | VP(dumped, VBD) ,VBD (dumped, VBD))
 PR ( PP(into, IN ) | VP(dumped, VBD) ,VBD (dumped, VBD))
 PR (STOP | VP(dumped, VBD) ,VBD (dumped, VBD))
Collins Model 1: example
VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)
1.
Head H(hw,ht):
PH ( H (hw, ht ) | Parent , hw, ht )
n 1
2.
 P ( L (lw , lw ) | Parent , H , hw, ht )
Left modifiers:
L
i
i
t
i 1
n 1
 P ( R (rw , rw ) | Parent, H , hw, ht )
3.
Right modifiers:
4.
Total probability: multiplication of (1) – (3)
R
i
i
t
i 1
Variations on Model 1: distance
 Collins proposed to extend rules by conditioning on
distance of modifiers from the head:
PL ( Li (lwi , lwt ) | P, H , hw, ht, distanceL (i  1))
PR ( Ri (rwi , rwt ) | P, H , hw, ht, distanceR (i  1))
 a function of the yield of modifiers seen.
Distance for R2
probability =
words under R1
Using a distance function
 Simplest kind of distance function is a tuple of binary features:
 Is the string of length 0?
 Does the string contain a verb?
 …
 Example uses:
 if the string has length 0, PR should be higher:
 English is right-branching & most right modifiers are adjacent to the head verb
 if string contains a verb, PR should be higher:
 accounts for preference to attach dependencies to main verb
Further additions
 Collins Model 2:
 subcategorisation preferences
 distinction between complements and adjuncts.
 Model 3 augmented to deal with long-distance (WH)
dependencies.
Smoothing and backoff



Rules may condition on
words that never occur in
training data.
Collins used 3-level
backoff model.
Combined using linear
interpolation.
1.
use head word
PR ( Ri (rwi , rwt ) | Parent , H , hw, ht )
2.
use head tag
PR ( Ri (rwi , rwt ) | Parent , H , ht )
3.
parent only
PR ( Ri (rwi , rwt ) | Parent )
Other parsing approaches
Data-oriented parsing
 Alternative to “grammar-based” models
 does not attempt to derive a grammar from a treebank
 treebank data is stored as fragments of trees
 parser uses whichever trees seem to be useful
Data-oriented parsing
 Suppose we want to parse Sue heard Jim.
 Corpus contains the following potentially useful fragments:
Parser can combine these to give
a parse
Data-oriented Parsing
 Multiple fundamentally distinct derivations of a single tree.
 Parse using Monte Carlo simulation methods:
 randomly produce a large sample of derivations
 use these to find the most probable parse
 disadvantage: needs very large samples to make parses accurate,
therefore potentially slow
Data-oriented parsing vs. PCFGs
 Possible advantages:
 using partial trees directly accounts for lexical dependencies
 also accounts for multi-word expressions and idioms (e.g. take
advantage of)
 while PCFG rules only represent trees of depth 1, DOP fragments
can represent trees of arbitrary length
 Similarities to PCFG:
 tree fragments could be equivalent to PCFG rules
 probabilities estimated for grammar rules are exactly the same as for
tree fragments
History Based Grammars (HBG)
 General idea: any derivational step can be influenced by any
earlier derivational step
 (Black et al. 1993)
 the probability of expansion of the current node conditioned on
all previous nodes along the path from the root
History Based Grammars (HBG)
 Black et al lexicalise their grammar.
 every phrasal node inherits 2 words:
 its lexical head H1
 a secondary head H2, deemed to be useful
 e.g. the PP in the bank might have H1=in and H2=bank
 Every non-terminal is also assigned:
 a syntactic category (Syn) e.g. PP
 a semantic category (Sem) e.g with-Data
 Use the index I that indicates what number child of the
parent node is being expanded
HBG Example (Black et al 1993)
History Based Grammars (HBG)
 Estimation of the probability of a rule R:
P( Syn, Sem, R, H1 , H 2 | Syn p , Sem p , R p , I , H1p , H 2 p )
 probability of:
 the current rule R to be applied
 its Syn and Sem category
 its heads H1 and H2
 conditioned on:
 Syn and Sem of parent node
 the rule that gave rise to the parent
 the index of this child relative to the parent
 the heads H1 and H2 of the parent
Summary
 This concludes our overview of statistical parsing
 We’ve looked at three important models
 Also considered basic search techniques and algorithms
Download