Parsing with Compositional Vector Grammars

advertisement
Parsing with Compositional
Vector Grammars
Socher, Bauer, Manning, NG 2013
Problem
• How can we parse a sentence and create a
dense representation of it?
– N-grams have obvious problems, most important
is sparsity
• Can we resolve syntactic ambiguity with
context? “They ate udon with forks” vs “They
ate udon with chicken”
Standard Recursive Neural Net
[ Vector((I-like)green)]
W
Classifier?
Main
Score
[ Vector(I-like)]
W
Main
[ Vector(I)]
[ Vector(like)]
[ Vector(green)]
[ Vector(eggs)]
I
like
green
eggs
Standard Recursive Neural Net
π‘π‘Žπ‘ = 𝑓(π‘Šπ‘šπ‘Žπ‘–π‘›
π‘Ž
+ 𝑏)
𝑏
Where 𝑓 βˆ™ is usually tanh(βˆ™) or logistic()
In other words, stack the two word vectors and
multiply through a matrix W and you get a vector of the
same dimensionality as the children a or b.
Syntactically Untied RNN
Classifier
Score
[ Vector(green-eggs)]
[ Vector(I-like)]
W
W
N,V
adj,N
N
First, parse lower
level with PCFG
N
[ Vector(I)]
V
[ Vector(like)]
Adj
[ Vector(green)]
[ Vector(eggs)]
I
like
green
eggs
Syntactically Untied RNN
π‘π‘Žπ‘ = 𝑓(π‘Šπ‘1,𝑐2
π‘Ž
+ 𝑏)
𝑏
The weight matrix is determined by the PCFG
parse category of a and b. (You have one per
parse combination)
Examples: Composition Matrixes
• Notice that he initializes them
with two identity matrixes (in the
absence of other information we
should average
Learning the Weights
• Errors are backpropagated through structure
(Goller and Kuchler, 1996)
input
𝑓 ′ (π‘₯)
𝑒
𝑑𝑦𝑗
= π‘₯𝑖 (𝑦𝑗 (1−𝑦𝑗 ))(𝑦𝑗 − 𝑦𝑗 )
π‘‘π‘Šπ‘–π‘—
(for logistic)
𝛿
Weight derivatives are additive across branches!
(Not obvious- good proof/explanation in Socher,
2014)
Tricks
• Our good friend, ada-grad (diagonal variant):
πœƒπ‘‘ = πœƒπ‘‘−1 −
𝛼
2+𝑐
𝑔
𝑑−1
𝑔𝑑
(Elementwise)
• Initialize matrixes with identity + small random
noise
• Uses Collobert and Weston (2008) word
embeddings to start
Learning the Tree
• We want the scores of the correct parse trees
to be better than all incorrect trees by a
margin:
𝑠(𝐢𝑉𝐺 πœƒ, π‘₯𝑖 , 𝑦𝑖 ≥ 𝑠 𝐢𝑉𝐺 πœƒ, π‘₯𝑖 , 𝑦
+ βˆ†(𝑦𝑖 , 𝑦)
(Correct Parse Trees are Given in the Training
Set)
Finding the Best Tree (inference)
• Want to find the parse tree with the max
score (which is the sum all the scores of all sub
trees)
• Too expensive to try every combination
• Trick: use non-RNN method to select best 200
trees (CKY algorithm). Then, beam search
these trees with RNN.
Model Comparisons (WSJ Dataset)
(Socher’s Model)
F1 for parse labels
Analysis of Errors
Conclusions:
• Not the best model, but fast
• No hand engineered features
• Huge number of parameters:
𝑑 ∗ π‘£π‘œπ‘π‘Žπ‘ + 2𝑑 ∗ 𝑑 ∗ π‘›π‘π‘œπ‘šπ‘ + 𝑑 ∗ π‘π‘™π‘Žπ‘ π‘  + 𝑑
• Notice that Socher can’t make the standard RNN
perform better than the PCFG: there is a pattern
here. Most of the papers from this group involve
very creative modifications to the standard RNN.
(SU-RNN, RNTN, RNN+Max Pooling)
• The model in this paper has (probably) been
eclipsed by the Recursive Neural Tensor
Network. Subsequent work showed this
model performed better (in different
situations) than the SU-RNN
Download