2014_acl_goldberg

advertisement
Linguistic Regularities
in Sparse and Explicit Word Representations
CONLL 2014
Omer Levy and Yoav Goldberg.
Astronomers have been able to capture and record the
moment when a massive star begins to blow itself apart.
International Politics
A distance measure between words?
𝑑 𝑀𝑖 , 𝑀𝑗
• Document clustering
• Machine translation
• Information retrieval
• Question answering
• ….
Space Science
Vector representation?
Vector Representation
Explicit
Continuous
~ Embeddings
~ Neural
π‘…π‘’π‘ π‘ π‘–π‘Ž is to π‘€π‘œπ‘ π‘π‘œπ‘€ as π½π‘Žπ‘π‘Žπ‘› is to π‘‡π‘œπ‘˜π‘¦π‘œ
Figures from (Mikolov et al.,NAACL, 2013) and (Turian et al., ACL, 2010)
Continuous vectors representation
• Interesting properties: directional similarity
Older continuous vector representations:
• Latent Semantic Analysis
• Analogy between
words:
• Latent
Dirichlet Allocation (Blei, Ng, Jordan, 2003)
• etc.
woman − man ≈ queen − king
A recent result
and
Golberg,
NIPS, 2014)
king −(Levy
man +
woman
≈ queen
mathematically showed that the two representation are
“almost” equivalent.
Figure from (Mikolov et al.,NAACL, 2013)
Goals of the paper
• Analogy:
π‘Ž is to π‘Ž ∗ as 𝑏 is to 𝑏 ∗
• With simple vector arithmetic:
π‘Ž − π‘Ž∗ = 𝑏 − 𝑏∗
• Given 3 words:
π‘Ž is to π‘Ž ∗ as 𝑏 is to ?
• Can we solve analogy problems with explicit representation?
• Compare the performance of the
• Continuous (neural) representation
• Explicit representation
Baroni et al. (ACL, 2014) showed that
embeddings outperform explicit
Explicit representation
• Term-Context vector
computer
data
pinch
result
sugar
apricot
0
0
1
0
1
pineapple
0
0
1
0
1
digital
2
1
0
1
0
information
1
6
0
4
0
• π΄π‘π‘Ÿπ‘–π‘π‘œπ‘‘: πΆπ‘œπ‘šπ‘π‘’π‘‘π‘’π‘Ÿ: 0, π‘‘π‘Žπ‘‘π‘Ž: 0, π‘π‘–π‘›π‘β„Ž: 1, … ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘ ≈100,000
• Sparse!
• Pointwise Mutual Information:
P ( word , context )
PMI ( word , context ) ο€½ log 2
P ( word ) P ( context )
• Positive PMI:
• Replace the negative PMI values with zero.
Continuous vectors representation
• Example:
πΌπ‘‘π‘Žπ‘™π‘¦: (−7.35, 9.42, 0.88, … ) ∈ ℝ100
• Continuous values and dense
• How to create?
• Example: Skip-gram model (Mikolov et al., arXiv preprint arXiv:1301.3781)
1
𝐿=
𝑇
𝑇
𝑑=1
−𝑐≤𝑗≤𝑐 log 𝑝(𝑀𝑗+𝑑 |𝑀𝑑 )
𝑗≠0
exp (𝑣𝑖 . 𝑣𝑗 )
𝑝 𝑀𝑖 𝑀𝑗 =
π‘˜ exp (𝑣𝑖 . π‘£π‘˜ )
• Problem: The denominator is of size π‘˜.
• Hard to calculate the gradient
Formulating analogy objective
• Given 3 words:
π‘Ž is to π‘Ž∗ as 𝑏 is to ?
• Cosine similarity:
• Prediction:
π‘Ž. 𝑏
cos π‘Ž, 𝑏 =
π‘Ž 𝑏
arg max
cos 𝑏 ∗ , 𝑏 − π‘Ž + π‘Ž∗
∗
𝑏
• If using normalized vectors:
∗ , 𝑏 − cos 𝑏 ∗ , π‘Ž + cos 𝑏 ∗ , π‘Ž ∗
arg max
cos
𝑏
∗
𝑏
• Alternative?
• Instead of adding similarities, multiply them!
cos 𝑏 ∗ , 𝑏 cos 𝑏 ∗ , π‘Ž∗
arg max
𝑏∗
cos 𝑏 ∗ , π‘Ž
Corpora
• MSR:
• Google:
~8000 syntactic analogies
~19,000 syntactic and semantic analogies
a
b
a*
b*
Good
Better
Rough
?
better
good
rougher
?
good
best
rough
?
…
…
…
…
• Learn different representations from the same corpus:
Very important!!
Recent controversies over inaccurate evaluations for “GloVe wordrepresentation” (Pennington et al, EMNLP)
Embedding vs Explicit (Round 1)
70%
60%
Accuracy
50%
40%
30%
Embedding
63%
Embedding
54%
20%
Explicit
29%
10%
Explicit
45%
0%
MSR
Google
Many analogies recovered by explicit, but many more by embedding.
Using multiplicative form
• Given 3 words:
π‘Ž is to π‘Ž∗ as 𝑏 is to ?
• Cosine similarity:
π‘Ž. 𝑏
cos π‘Ž, 𝑏 =
π‘Ž 𝑏
• Additive objective:
arg max
cos 𝑏 ∗ , 𝑏 − cos 𝑏 ∗ , π‘Ž + cos 𝑏 ∗ , π‘Ž∗
∗
𝑏
• Multiplicative objective:
cos 𝑏 ∗ , 𝑏 cos 𝑏 ∗ , π‘Ž∗
arg max
𝑏∗
cos 𝑏 ∗ , π‘Ž
A relatively weak justification
England − London + Baghdad = ? Iraq
cos πΌπ‘Ÿπ‘Žπ‘ž, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ − cos πΌπ‘Ÿπ‘Žπ‘ž, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos πΌπ‘Ÿπ‘Žπ‘ž, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘
0.15
0.13
0.63
0.13
0.14
0.75
cos π‘€π‘œπ‘ π‘’π‘™, πΈπ‘›π‘”π‘™π‘Žπ‘›π‘‘ − cos π‘€π‘œπ‘ π‘’π‘™, πΏπ‘œπ‘›π‘‘π‘œπ‘› + cos π‘€π‘œπ‘ π‘’π‘™, π΅π‘Žπ‘”β„Žπ‘‘π‘Žπ‘‘
Embedding vs Explicit (Round 2)
80%
70%
Accuracy
60%
50%
40%
30%
20%
Add
54%
Mul
59%
Add
63%
Mul
67%
Mul
68%
Mul
57%
Add
45%
Add
29%
10%
0%
MSR
Google
Embedding
MSR
Google
Explicit
80%
70%
Accuracy
60%
50%
40%
30%
Embedding
59%
Explicit
57%
Embedding
67%
20%
10%
0%
MSR
Google
Explicit
68%
Summary
• On analogies, continuous (neural) representation is not magical
• Analogies are possible in the explicit representation
• On analogies explicit representation can be as good as continuous
(neural) representation.
• Objective (function) matters!
Agreement between representations
Objective
Both
Correct
Both
Wrong
Embedding
Correct
Explicit
Correct
MSR
43.97%
28.06%
15.12%
12.85%
Google
57.12%
22.17%
9.59%
11.12%
Comparison of objectives
Objective
Additive
Multiplicative
Representation
MSR
Google
Embedding
Explicit
Embedding
Explicit
53.98%
29.04%
59.09%
56.83%
62.70%
45.05%
66.72%
68.24%
Recurrent Neural Network
• Recurrent Neural Networks (Mikolov et al.,NAACL, 2013)
• Input-output relations:
𝑦 𝑑 = π’ˆ(𝑽𝑠 𝑑 )
𝑠 𝑑 = 𝑓(𝑾𝑠 𝑑 − 1 + 𝑼𝑀(𝑑))
• 𝑦 ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘
• 𝑀 𝑑 ∈ ℝ π‘‰π‘œπ‘π‘Žπ‘
• 𝑠 𝑑 ∈ ℝ𝑑
• 𝑼 ∈ ℝ|π‘‰π‘œπ‘π‘Žπ‘|×𝑑
• 𝑾 ∈ ℝ𝑑×𝑑
• 𝑽 ∈ ℝ𝑑×|π‘‰π‘œπ‘π‘Žπ‘|
Continuous vectors representation
• Example:
πΌπ‘‘π‘Žπ‘™π‘¦: (−7.35, 9.42, 0.88, … ) ∈ ℝ100
• Continuous values and dense
• How to create?
• Example: Skip-gram model (Mikolov et al., arXiv preprint arXiv:1301.3781)
1
𝐿=
𝑇
𝑇
𝑑=1
−𝑐≤𝑗≤𝑐 log 𝑝(𝑀𝑗+𝑑 |𝑀𝑑 )
𝑗≠0
exp (𝑣𝑖 . 𝑣𝑗 )
𝑝 𝑀𝑖 𝑀𝑗 =
π‘˜ exp (𝑣𝑖 . π‘£π‘˜ )
• Problem: The denominator is of size π‘˜.
• Hard to calculate the gradient
Continuous vectors representation
1
𝐿=
𝑇
𝑇
𝑑=1
−𝑐≤𝑗≤𝑐 log 𝑝(𝑀𝑗+𝑑 |𝑀𝑑 )
𝑗≠0
exp (𝑣𝑖 . 𝑣𝑗 )
𝑝 𝑀𝑖 𝑀𝑗 =
π‘˜ exp (𝑣𝑖 . π‘£π‘˜ )
• Problem: The denominator is of size π‘˜.
• Hard to calculate the gradient
• Change the objective:
1
𝑝 𝑀𝑖 𝑀𝑗 =
1 + exp (𝑣𝑖 . 𝑣𝑗 )
• Has trivial solution!
• Introduce artificial negative instances!
𝐿𝑖𝑗 = log 𝜎(𝑣𝑖 . 𝑣𝑗 ) + π‘˜π‘™=1 𝐸𝑀𝑙 ∼𝑃𝑛
𝑀
[log 𝜎(−𝑣𝑙 . 𝑣𝑗 )]
References
Slides from: http://levyomer.wordpress.com/2014/04/25/linguistic-regularities-in-sparse-andexplicit-word-representations/
Mikolov, Tomas, et al. "Distributed representations of words and phrases and their
compositionality." Advances in Neural Information Processing Systems. 2013.
Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space
Word Representations." HLT-NAACL. 2013.
Download