Parsing with Compositional Vector Grammars Socher, Bauer, Manning, NG 2013 Problem • How can we parse a sentence and create a dense representation of it? – N-grams have obvious problems, most important is sparsity • Can we resolve syntactic ambiguity with context? “They ate udon with forks” vs “They ate udon with chicken” Standard Recursive Neural Net [ Vector((I-like)green)] W Classifier? Main Score [ Vector(I-like)] W Main [ Vector(I)] [ Vector(like)] [ Vector(green)] [ Vector(eggs)] I like green eggs Standard Recursive Neural Net πππ = π(πππππ π + π) π Where π β is usually tanh(β) or logistic() In other words, stack the two word vectors and multiply through a matrix W and you get a vector of the same dimensionality as the children a or b. Syntactically Untied RNN Classifier Score [ Vector(green-eggs)] [ Vector(I-like)] W W N,V adj,N N First, parse lower level with PCFG N [ Vector(I)] V [ Vector(like)] Adj [ Vector(green)] [ Vector(eggs)] I like green eggs Syntactically Untied RNN πππ = π(ππ1,π2 π + π) π The weight matrix is determined by the PCFG parse category of a and b. (You have one per parse combination) Examples: Composition Matrixes • Notice that he initializes them with two identity matrixes (in the absence of other information we should average Learning the Weights • Errors are backpropagated through structure (Goller and Kuchler, 1996) input π ′ (π₯) π ππ¦π = π₯π (π¦π (1−π¦π ))(π¦π − π¦π ) ππππ (for logistic) πΏ Weight derivatives are additive across branches! (Not obvious- good proof/explanation in Socher, 2014) Tricks • Our good friend, ada-grad (diagonal variant): ππ‘ = ππ‘−1 − πΌ 2+π π π‘−1 ππ‘ (Elementwise) • Initialize matrixes with identity + small random noise • Uses Collobert and Weston (2008) word embeddings to start Learning the Tree • We want the scores of the correct parse trees to be better than all incorrect trees by a margin: π (πΆππΊ π, π₯π , π¦π ≥ π πΆππΊ π, π₯π , π¦ + β(π¦π , π¦) (Correct Parse Trees are Given in the Training Set) Finding the Best Tree (inference) • Want to find the parse tree with the max score (which is the sum all the scores of all sub trees) • Too expensive to try every combination • Trick: use non-RNN method to select best 200 trees (CKY algorithm). Then, beam search these trees with RNN. Model Comparisons (WSJ Dataset) (Socher’s Model) F1 for parse labels Analysis of Errors Conclusions: • Not the best model, but fast • No hand engineered features • Huge number of parameters: π ∗ π£ππππ + 2π ∗ π ∗ πππππ + π ∗ ππππ π + π • Notice that Socher can’t make the standard RNN perform better than the PCFG: there is a pattern here. Most of the papers from this group involve very creative modifications to the standard RNN. (SU-RNN, RNTN, RNN+Max Pooling) • The model in this paper has (probably) been eclipsed by the Recursive Neural Tensor Network. Subsequent work showed this model performed better (in different situations) than the SU-RNN