prob-lex-parsing

advertisement
Probabilistic and Lexicalized
Parsing
Probabilistic CFGs
• Weighted CFGs
– Attach weights to rules of CFG
– Compute weights of derivations
– Use weights to pick, preferred parses
• Utility: Pruning and ordering the search space,
disambiguate, Language Model for ASR.
• Parsing with weighted grammars (like
Weighted FA)
– T* = arg maxT W(T,S)
• Probabilistic CFGs are one form of weighted
CFGs.
Probability Model
• Rule Probability:
– Attach probabilities to grammar rules
– Expansions for a given non-terminal sum to 1
R1: VP  V
.55
R2: VP  V NP
R3: VP  V NP NP
.40
.05
– Estimate the probabilities from annotated corpora
P(R1)=counts(R1)/counts(VP)
• Derivation Probability:
–
–
–
–
Derivation T= {R1…Rn}
Probability of a derivation:
Most likely probable parse:
Probability of a sentence:
n
P (T )   P ( Ri )
i 1
T *  arg max P(T )
T
P( S )   P(T | S )
T
• Sum over all possible derivations for the sentence
• Note the independence assumption: Parse probability does not
change based on where the rule is expanded.
Structural ambiguity
•
•
•
•
•
• NP  John | Mary | Denver
• V -> called
• P -> from
S  NP VP
VP  V NP
NP  NP PP
VP  VP PP
PP  P NP
John called Mary from Denver
S
S
VP
NP
NP
PP
VP
V
John
called
VP
NP
P
NP
NP
Mary from Denver
John
V
NP
called
Mary
PP
P
NP
from Denver
Cocke-Younger-Kasami Parser
•
•
•
•
Bottom-up parser with top-down filtering
Start State(s): (A, i, i+1) for each Awi+1
End State: (S, 0,n) n is the input size
Next State Rules
– (B, i, k) (C, k, j)  (A, i, j) if ABC
Example
John
called
Mary
from
Denver
Base Case: Aw
NP
P
NP
V
NP
John
called
Mary
from
Denver
Recursive Cases: ABC
NP
P
NP
X
V
NP
called
John
Mary
from
Denver
NP
P
VP
NP
X
V
Mary
NP
called
John
from
Denver
NP
X
P
VP
NP
from
X
V
Mary
NP
called
John
Denver
PP
NP
X
P
Denver
VP
NP
from
X
V
Mary
NP
called
John
S
NP
John
PP
NP
X
P
Denver
VP
NP
from
V
Mary
called
PP
NP
Denver
X
X
P
S
VP
NP
from
X
V
Mary
NP
called
John
NP
X
S
VP
NP
X
V
Mary
NP
called
John
PP
NP
P
Denver
from
NP
PP
NP
Denver
X
X
X
P
S
VP
NP
from
X
V
Mary
NP
called
John
VP
NP
PP
NP
X
X
X
P
Denver
S
VP
NP
from
X
V
Mary
NP
called
John
VP
NP
PP
NP
X
X
X
P
Denver
S
VP
NP
from
X
V
Mary
NP
called
John
NP
PP
NP
X
VP1
VP2
X
X
P
Denver
S
VP
NP
from
X
V
Mary
NP
called
John
S
NP
PP
NP
X
VP1
VP2
X
X
P
Denver
S
VP
NP
from
X
V
Mary
NP
called
John
S
VP
NP
PP
NP
X
X
X
P
Denver
S
VP
NP
from
X
V
Mary
NP
called
John
Probabilistic CKY
• Assign probabilities to constituents as they are
completed and placed in the table
• Computing the probability
P( A, i, j ) 
 P( A  BC , i, j)
A   BC
P( A  BC , i, j )  P( B, i, k ) *P(C , k , j )* P( A  BC )
– Since we are interested in the max P(S,0,n)
• Use the max probability for each constituent
• Maintain back-pointers to recover the parse.
Problems with PCFGs
• The probability model we’re using is just based on the rules
in the derivation.
• Lexical insensitivity:
– Doesn’t use the words in any real way
– Structural disambiguation is lexically driven
• PP attachment often depends on the verb, its object, and the
preposition
• I ate pickles with a fork.
• I ate pickles with relish.
• Context insensitivity of the derivation
– Doesn’t take into account where in the derivation a rule is used
• Pronouns more often subjects than objects
• She hates Mary.
• Mary hates her.
• Solution: Lexicalization
– Add lexical information to each rule
An example of lexical information: Heads
• Make use of notion of the head of a phrase
– Head of an NP is a noun
– Head of a VP is the main verb
– Head of a PP is its preposition
• Each LHS of a rule in the PCFG has a lexical
item
• Each RHS non-terminal has a lexical item.
– One of the lexical items is shared with the LHS.
• If R is the number of binary branching rules
in CFG, in lexicalized CFG: O(2*|∑|*|R|)
• Unary rules: O(|∑|*|R|)
Example (correct parse)
Attribute grammar
Example (less preferred)
Computing Lexicalized Rule Probabilities
• We started with rule probabilities
– VP  V NP PP
P(rule|VP)
• E.g., count of this rule divided by the number of VPs in
a treebank
• Now we want lexicalized probabilities
– VP(dumped)  V(dumped) NP(sacks)PP(in)
– P(rule|VP ^ dumped is the verb ^ sacks is the
head of the NP ^ in is the head of the PP)
– Not likely to have significant counts in any
treebank
Another Example
• Consider the VPs
– Ate spaghetti with gusto
– Ate spaghetti with marinara
• Dependency is not between mother-child.
Vp (ate)
Vp(ate)
Pp(with)
np
v
Ate spaghetti with gusto
Vp(ate)
Np(spag)
np Pp(with)
v
Ate spaghetti with marinara
Log-linear models for Parsing
• Why restrict to the conditioning to the elements of a
rule?
– Use even larger context
– Word sequence, word types, sub-tree context etc.
• In general, compute P(y|x); where fi(x,y) test the
properties of the context; li is the weight of that
feature.
 l * f ( x, y )
P( y | x) 
e

i
i
li * f i ( x , y )

e
yY
• Use these as scores in the CKY algorithm to find
the best scoring parse.
Supertagging: Almost parsing
Poachers now control the underground trade
N
S
VP
N
N
Adv
VP
S
NP
NP
V
NP
control e
S
N
Adv
poachers
now
S
V
e
NP
VP
NP
NP VP
poachers
now
Det
NP
:
:
Adj
trade
N
NP VP
NP
control
N
N
trade
S
NP
S
N
underground
S
V
:
:
e
Adj
control
Adv
N
N
N
V
VP
NP
underground
the
NP VP
V
S
S
NP
NP VP
NP VP
now
poachers
S
S
S
NP VP
e V
e
NP
NP VP
V
NP
e
N
Adj
underground
:
trade
Summary
• Parsing context-free grammars
– Top-down and Bottom-up parsers
– Mixed approaches (CKY, Earley parsers)
• Preferences over parses using probabilities
– Parsing with PCFG and PCKY algorithms
• Enriching the probability model
– Lexicalization
– Log-linear models for parsing
Download