Parsing with Compositional Vector Grammars

Parsing with Compositional Vector Grammars Socher, Bauer, Manning, NG 2013 Problem • How can we parse a sentence and create a dense representation of it? – N-grams have obvious problems, most important is sparsity • Can we resolve syntactic ambiguity with context? “They ate udon with forks” vs “They ate udon with chicken” Standard Recursive Neural Net [ Vector((I-like)green)] W Classifier? Main Score [ Vector(I-like)] W Main [ Vector(I)] [ Vector(like)] [ Vector(green)] [ Vector(eggs)] I like green eggs Standard Recursive Neural Net 𝑝𝑎𝑏 = 𝑓(𝑊𝑚𝑎𝑖𝑛 𝑎 + 𝑏) 𝑏 Where 𝑓 ∙ is usually tanh(∙) or logistic() In other words, stack the two word vectors and multiply through a matrix W and you get a vector of the same dimensionality as the children a or b. Syntactically Untied RNN Classifier Score [ Vector(green-eggs)] [ Vector(I-like)] W W N,V adj,N N First, parse lower level with PCFG N [ Vector(I)] V [ Vector(like)] Adj [ Vector(green)] [ Vector(eggs)] I like green eggs Syntactically Untied RNN 𝑝𝑎𝑏 = 𝑓(𝑊𝑐1,𝑐2 𝑎 + 𝑏) 𝑏 The weight matrix is determined by the PCFG parse category of a and b. (You have one per parse combination) Examples: Composition Matrixes • Notice that he initializes them with two identity matrixes (in the absence of other information we should average Learning the Weights • Errors are backpropagated through structure (Goller and Kuchler, 1996) input 𝑓 ′ (𝑥) 𝑒 𝑑𝑦𝑗 = 𝑥𝑖 (𝑦𝑗 (1−𝑦𝑗 ))(𝑦𝑗 − 𝑦𝑗 ) 𝑑𝑊𝑖𝑗 (for logistic) 𝛿 Weight derivatives are additive across branches! (Not obvious- good proof/explanation in Socher, 2014) Tricks • Our good friend, ada-grad (diagonal variant): 𝜃𝑡 = 𝜃𝑡−1 − 𝛼 2+𝑐 𝑔 𝑡−1 𝑔𝑡 (Elementwise) • Initialize matrixes with identity + small random noise • Uses Collobert and Weston (2008) word embeddings to start Learning the Tree • We want the scores of the correct parse trees to be better than all incorrect trees by a margin: 𝑠(𝐶𝑉𝐺 𝜃, 𝑥𝑖 , 𝑦𝑖 ≥ 𝑠 𝐶𝑉𝐺 𝜃, 𝑥𝑖 , 𝑦 + ∆(𝑦𝑖 , 𝑦) (Correct Parse Trees are Given in the Training Set) Finding the Best Tree (inference) • Want to find the parse tree with the max score (which is the sum all the scores of all sub trees) • Too expensive to try every combination • Trick: use non-RNN method to select best 200 trees (CKY algorithm). Then, beam search these trees with RNN. Model Comparisons (WSJ Dataset) (Socher’s Model) F1 for parse labels Analysis of Errors Conclusions: • Not the best model, but fast • No hand engineered features • Huge number of parameters: 𝑑 ∗ 𝑣𝑜𝑐𝑎𝑏 + 2𝑑 ∗ 𝑑 ∗ 𝑛𝑐𝑜𝑚𝑝 + 𝑑 ∗ 𝑐𝑙𝑎𝑠𝑠 + 𝑑 • Notice that Socher can’t make the standard RNN perform better than the PCFG: there is a pattern here. Most of the papers from this group involve very creative modifications to the standard RNN. (SU-RNN, RNTN, RNN+Max Pooling) • The model in this paper has (probably) been eclipsed by the Recursive Neural Tensor Network. Subsequent work showed this model performed better (in different situations) than the SU-RNN

Parsing with Compositional Vector Grammars

Related documents

Products

Support

Parsing with Compositional Vector Grammars

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib