Linguistic Regularities in Sparse and Explicit Word Representations CONLL 2014 Omer Levy and Yoav Goldberg. Astronomers have been able to capture and record the moment when a massive star begins to blow itself apart. International Politics A distance measure between words? π π€π , π€π • Document clustering • Machine translation • Information retrieval • Question answering • …. Space Science Vector representation? Vector Representation Explicit Continuous ~ Embeddings ~ Neural π π’π π ππ is to πππ πππ€ as π½ππππ is to ππππ¦π Figures from (Mikolov et al.,NAACL, 2013) and (Turian et al., ACL, 2010) Continuous vectors representation • Interesting properties: directional similarity Older continuous vector representations: • Latent Semantic Analysis • Analogy between words: • Latent Dirichlet Allocation (Blei, Ng, Jordan, 2003) • etc. woman − man ≈ queen − king A recent result and Golberg, NIPS, 2014) king −(Levy man + woman ≈ queen mathematically showed that the two representation are “almost” equivalent. Figure from (Mikolov et al.,NAACL, 2013) Goals of the paper • Analogy: π is to π ∗ as π is to π ∗ • With simple vector arithmetic: π − π∗ = π − π∗ • Given 3 words: π is to π ∗ as π is to ? • Can we solve analogy problems with explicit representation? • Compare the performance of the • Continuous (neural) representation • Explicit representation Baroni et al. (ACL, 2014) showed that embeddings outperform explicit Explicit representation • Term-Context vector computer data pinch result sugar apricot 0 0 1 0 1 pineapple 0 0 1 0 1 digital 2 1 0 1 0 information 1 6 0 4 0 • π΄ππππππ‘: πΆππππ’π‘ππ: 0, πππ‘π: 0, ππππβ: 1, … ∈ β πππππ ≈100,000 • Sparse! • Pointwise Mutual Information: P ( word , context ) PMI ( word , context ) ο½ log 2 P ( word ) P ( context ) • Positive PMI: • Replace the negative PMI values with zero. Continuous vectors representation • Example: πΌπ‘πππ¦: (−7.35, 9.42, 0.88, … ) ∈ β100 • Continuous values and dense • How to create? • Example: Skip-gram model (Mikolov et al., arXiv preprint arXiv:1301.3781) 1 πΏ= π π π‘=1 −π≤π≤π log π(π€π+π‘ |π€π‘ ) π≠0 exp (π£π . π£π ) π π€π π€π = π exp (π£π . π£π ) • Problem: The denominator is of size π. • Hard to calculate the gradient Formulating analogy objective • Given 3 words: π is to π∗ as π is to ? • Cosine similarity: • Prediction: π. π cos π, π = π π arg max cos π ∗ , π − π + π∗ ∗ π • If using normalized vectors: ∗ , π − cos π ∗ , π + cos π ∗ , π ∗ arg max cos π ∗ π • Alternative? • Instead of adding similarities, multiply them! cos π ∗ , π cos π ∗ , π∗ arg max π∗ cos π ∗ , π Corpora • MSR: • Google: ~8000 syntactic analogies ~19,000 syntactic and semantic analogies a b a* b* Good Better Rough ? better good rougher ? good best rough ? … … … … • Learn different representations from the same corpus: Very important!! Recent controversies over inaccurate evaluations for “GloVe wordrepresentation” (Pennington et al, EMNLP) Embedding vs Explicit (Round 1) 70% 60% Accuracy 50% 40% 30% Embedding 63% Embedding 54% 20% Explicit 29% 10% Explicit 45% 0% MSR Google Many analogies recovered by explicit, but many more by embedding. Using multiplicative form • Given 3 words: π is to π∗ as π is to ? • Cosine similarity: π. π cos π, π = π π • Additive objective: arg max cos π ∗ , π − cos π ∗ , π + cos π ∗ , π∗ ∗ π • Multiplicative objective: cos π ∗ , π cos π ∗ , π∗ arg max π∗ cos π ∗ , π A relatively weak justification England − London + Baghdad = ? Iraq cos πΌπππ, πΈππππππ − cos πΌπππ, πΏπππππ + cos πΌπππ, π΅ππβπππ 0.15 0.13 0.63 0.13 0.14 0.75 cos πππ π’π, πΈππππππ − cos πππ π’π, πΏπππππ + cos πππ π’π, π΅ππβπππ Embedding vs Explicit (Round 2) 80% 70% Accuracy 60% 50% 40% 30% 20% Add 54% Mul 59% Add 63% Mul 67% Mul 68% Mul 57% Add 45% Add 29% 10% 0% MSR Google Embedding MSR Google Explicit 80% 70% Accuracy 60% 50% 40% 30% Embedding 59% Explicit 57% Embedding 67% 20% 10% 0% MSR Google Explicit 68% Summary • On analogies, continuous (neural) representation is not magical • Analogies are possible in the explicit representation • On analogies explicit representation can be as good as continuous (neural) representation. • Objective (function) matters! Agreement between representations Objective Both Correct Both Wrong Embedding Correct Explicit Correct MSR 43.97% 28.06% 15.12% 12.85% Google 57.12% 22.17% 9.59% 11.12% Comparison of objectives Objective Additive Multiplicative Representation MSR Google Embedding Explicit Embedding Explicit 53.98% 29.04% 59.09% 56.83% 62.70% 45.05% 66.72% 68.24% Recurrent Neural Network • Recurrent Neural Networks (Mikolov et al.,NAACL, 2013) • Input-output relations: π¦ π‘ = π(π½π π‘ ) π π‘ = π(πΎπ π‘ − 1 + πΌπ€(π‘)) • π¦ ∈ β πππππ • π€ π‘ ∈ β πππππ • π π‘ ∈ βπ • πΌ ∈ β|πππππ|×π • πΎ ∈ βπ×π • π½ ∈ βπ×|πππππ| Continuous vectors representation • Example: πΌπ‘πππ¦: (−7.35, 9.42, 0.88, … ) ∈ β100 • Continuous values and dense • How to create? • Example: Skip-gram model (Mikolov et al., arXiv preprint arXiv:1301.3781) 1 πΏ= π π π‘=1 −π≤π≤π log π(π€π+π‘ |π€π‘ ) π≠0 exp (π£π . π£π ) π π€π π€π = π exp (π£π . π£π ) • Problem: The denominator is of size π. • Hard to calculate the gradient Continuous vectors representation 1 πΏ= π π π‘=1 −π≤π≤π log π(π€π+π‘ |π€π‘ ) π≠0 exp (π£π . π£π ) π π€π π€π = π exp (π£π . π£π ) • Problem: The denominator is of size π. • Hard to calculate the gradient • Change the objective: 1 π π€π π€π = 1 + exp (π£π . π£π ) • Has trivial solution! • Introduce artificial negative instances! πΏππ = log π(π£π . π£π ) + ππ=1 πΈπ€π ∼ππ π€ [log π(−π£π . π£π )] References Slides from: http://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-andexplicit-word-representations/ Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in Neural Information Processing Systems. 2013. Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." HLT-NAACL. 2013.