Explorations in vector space the continuous-bag-of-words model from word2vec Jesper Segeblad January 2016

advertisement
Explorations in vector space
the continuous-bag-of-words model from word2vec
Jesper Segeblad
January 2016
Jesper Segeblad
911031-4151
Contents
1 Introduction
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2 The continuous bag of words model
2.1 Measuring similarity . . . . . .
2.2 Evaluating the models . . . . . .
2.3 Extracting training samples . . .
2.4 Hyperparameters . . . . . . . .
3
4
5
5
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Experiments
6
4 Results
7
5 Conclusions
8
1
Jesper Segeblad
911031-4151
1
Introduction
To represent words as vectors in a relatively low dimensional space has in the last few years become
increasingly popular in the natural language processing community. In this representation, we can think
of words as points in a vector space space where words that are semantically similar lie close to each
other. This approach has several advantages to representing words as atomic symbols. Words do have a
relation to one another, and to actually have a representation that can capture this can be very beneficial to
many natural language applications. Traditionally, these models have used word co-occurrence matrices,
counting how many times words co-occur in some text corpus. Other types of ways to create these vectors
have also been explored, such as trying to predict a word from the words that surround it.
Although all these models differ in many ways, they are all trained on raw, unannotated text, and
rest on the distributional hypothesis: the hypothesis that words which have similar meaning occur in
similar contexts (Sahlgren, 2008). This is the assumption that drives these models, and also make them
successful.
Word2vec is a highly popular software package that provides two types of algorithms for creating
such word vectors, the skip-gram and continuous-bag- of-words models (Mikolov et al., 2013a). The
main difference between these two algorithms is the objective: while the skip-gram model try to predict
the surrounding words given a target word, the continuous-bag-of-words model try to predict the target
word given the surrounding words. In a follow up paper, the authors recommended using the skip-gram
model with negative sampling (Mikolov et al., 2013b) and is the model from word2vec that has gained
the most attention. Less attention has been given to the continuous-bag-of- word model. This does not
make it less interesting however, and is the model that will be studied in greater detail here.
1.1
Purpose
The purpose here is to get a deeper understanding of the continuous-bag-of- words model from word2vec.
This is both of the theory and math behind this model, why it actually works, as well as how this might
be implemented in practice.
2
Jesper Segeblad
911031-4151
2
The continuous bag of words model
The continuous bag of words model (CBOW) is inspired by neural network architecture, with an input
layer, “hidden” layer, and output layer. Similarly, it is trained using gradient descent and backpropagation
which is a common technique for training neural networks (Russell and Norvig, 2009). It does however
lack the non-linear activation function at the hidden layer traditionally used in neural networks, and
instead passes on the linear “activation” of the hidden layer (from now on called projection layer as in
the original article by Mikolov et al. (2013a)) to the output.
First of, we need a vocabulary, which is the words that we want to create word vectors for. The input
layer has the dimension of the size of the vocabulary (denote this size by V). Between the input layer
and the projection layer we have a matrix of dimensions V xN, where N is the desired dimensions of the
word vectors. This matrix will be called Win . Each row in this matrix corresponds to some word in our
vocabulary. Between the projection and output layer we have a matrix of size NxV , which we call Wout .
With these matrices we have two vector representations of each word in the vocabulary, one coming
from the rows of Win and one coming from the columns of Wout . These will be referred to as the input
vectors and output vectors respectively, following Rong (2014). When the model the model is initialized,
all these word vectors have random values, and the objective is to find better values for these vectors so
that words that are semantically similar have similar vectors.
We can think of the input as a number of one-hot encoded column vectors, meaning that one of the
entries in the vector are 1 while the rest are 0. How many of these vectors that are taken as input are
decided by the context size.
The goal of the model is to predict the word at the output layer given the the context words fed into
to the model at the input layer. This is done by modeling the conditional probability of the output word
given the input words, p(wordout |wordsin ).
The model takes two inputs at this stage, the context words and the target word. The context words
are projected onto the projection layer by combining their respective input vectors, taken from the rows
of the input weight matrix, or more formally, by multiplying them with the input weight matrix. But since
they are one-hot encoded this is equivalent to just “copying” them.
They are combined by taking the average vector (denote this transformation by h).
h=
1
(vw1 + vw2 + ... + vwi )
C
(1)
Here, C is the number of words in the context and vwi is the input vector of the i:th word in the context.
We thus end up with an average vector, with the dimensionality N. With this vector we can compute a
“score” for each output word (Rong, 2014) (denote this score u j ).
T
u j = Wout
·h
j
(2)
Wout j is the j:th column of the output matrix, the output vector of the word j. This is transposed from
a column vector to a row vector so that we can multiply it by the averaged vector h. With this score we
can compute the conditional probability of this being the actual output word using the softmax function.
P(wordout |wordsin ) = y j =
exp(u j )
V
Â
k=1
T
exp(Wout
k
(3)
· h)
The score u j is exponentiated and divided by the sum of the exponentiated score of all other words
in the vocabulary, giving us a probability of this being the actual output word. The next step is to update
the weights of the model based on the error of the prediction. Since we expect only one output word, the
objective is to maximize the equation 3 for the actual output word. Given this we have a corresponding
loss function E = logp(wordout |wordsin ) (Rong, 2014). The error at the output layer can be computed
as e j = y j t j , where y j is the prediction of the model and t j is the actual probability of this word being
3
Jesper Segeblad
911031-4151
the output word. t j can be either 0 or 1, depending on if it is the actual output word or not, 1 if it is, 0
otherwise. To update the weights between the projection and output layer, this error is multiplied with h,
the averaged vector of the input words and the learning rate (denoted h). This results in another vector,
which is then subtracted from the output vector of the word.
T
T
Wout
= Wout
j
j
h ·ej ·h
(4)
When the output vector have been updated, the error is propagated backwards to update the input
vectors in the context. In order to do this, the sum of all output vectors, weighted by the prediction error,
in the vocabulary is calculated.
EH =
V
 e j ·Wout
j=1
ij
(5)
This gives an N-dimensional vector EH, which is used to update the input vectors in the context.
1
· EH for i in context
(6)
C
Here, each vector (i.e. row of the input matrix Win ) of the words in the context are updated by subtracting the “averaged error” multiplied by the learning rate (h).
The computations for updating the vectors are rather costly, since for each possible output word we
have to check its probability and compare it to the actual probability (0 or 1). The number of output
probabilities are the same as the number of words in the vocabulary, meaning that the execution time
becomes quite substantial if we have a large vocabulary. With a vocabulary of one million words a
total of one million checks would have to made just for one training example. There are however some
optimization tricks that can be applied, such as hierarchical softmax and negative sampling, described in
(Mikolov et al., 2013b) (though in the context of the skip-gram model). Hierarchical softmax makes use
of a binary tree that represents the output layer. Negative sampling reconstructs the tasks from a multiclass classification problem to a binary one: given a training sample, the task is to predict if the word and
its context comes from the real training data or sampled randomly from a random distribution.
While the math behind this can seem rather complex and involved, the intuitive understanding of
the model is somewhat simpler and builds on the distributional hypothesis previously mentioned. The
objective of the model is to maximize the probability of the output target word given the context words,
and the weights of the model (the word vectors) are moved in order to maximize this probability. If the
probability of a word is overestimated, the input vectors will move further away from the output vector.
And if the probability is underestimated, the input vectors will move closer to the output vectors (Rong,
2014).
Wini = Wini
2.1
h·
Measuring similarity
With the resulting vectors, the distance between the words in the vocabulary can be measured. If we
think of the words as points in a two dimensional space, words that are close to each other should have a
similar meaning (given that our model actually works and our assumptions about word similarity being
reflected by their distributional properties).
Mikolov et al. (2013a) uses the cosine similarity to measure the similarity between them. Cosine
similarity is the angle between two vectors in a geometric perspective, and is defined as the dot product
divided by the product of their respective lengths (Jurafsky and Martin, 2009).
cos
sim(v, u) =
v·u
|v| · |u|
(7)
4
Jesper Segeblad
911031-4151
2.2
Evaluating the models
These types of vector representations are often evaluated on various types of word similarity tasks. In
(Mikolov et al., 2013a) evaluation of the models on two types of similarity tasks, semantic and syntactic,
and on one sentence completion task. The similarity tasks are questions of the type “a is to b as c is to
d”, where the task is fill in d. This is accomplished with vector addition and subtraction: X = vec(a)
vec(b) + vec(c). With this resulting vector X, a search over the all the vectors in the vocabulary can be
made to find the vector that is closest to it. The sentence completion task consists of trying complete a
sentence that has one missing word. A list is given with five words that all can be said to be reasonable
choice (Mikolov et al., 2013a).
2.3
Extracting training samples
Since these are trained on unannotated text, obtaining large amounts of training data is no problem. From
this text, training samples can be extracted. These training samples are words together with the context
they appear in (from here on called word and context pairs). The context is defined as a symmetric
window around the focus word. One can view this as sliding a window over the text collection and
looking at a word and the n words around it. If for example a window size of 2 is chosen, the 2 preceding
and the two following words are used as the context. Exactly how many words that should be used as
context are not well established, and different types of window sizes captures different types of semantic
similarity (Goldberg, 2015). This might be something to choose depending on the task that the resulting
vectors are to be used for.
Goldberg (2015) also describes some types of preprocessing that can be used before extracting the
word-context pairs, which can include removing words that appear to frequently or too infrequently, and
removing sentences that are either to long or too short.
2.4
Hyperparameters
The continuous-bag-of-words model relies on a few hyperparameters. They won’t be covered in much
detail, but the most important ones that are used in the implementation further on are described.
First we have to decide on the dimensionality of the resulting word vectors (N from the previous
description). The results presented in Mikolov et al. (2013a) indicates that a higher dimensionality might
give better word representations, since vectors of a higher dimensionality performed better on the tasks
that they were evaluated on.
Another parameter is how the vectors are initialized. Since they are initialized to random values, one
has to decide how these random values are chosen. The approach used in word2vec is to use uniformly
1
1
sampled values between 2N
and 2N
(Goldberg, 2015).
Yet another of these hyperparameters is the learning rate. Mikolov et al. (2013a) used an initial
learning rate of 0.025, and during the training decreased it so that it approached zero further into the
training.
5
Jesper Segeblad
911031-4151
3
Experiments
To get a clearer understanding of the inner workings of the model, a small implementation was made
in Python together with the Numpy library (Van Der Walt et al., 2011). This implementation is rather
simple and not very efficient, because of the inefficient weight updating, and none of the optimization
tricks were implemented.
The model as implemented is initialized together with a vocabulary. This vocabulary is a Python
dictionary with the words that we want to create word vectors for. Each word also has a corresponding
index number, which is how we know which vector is assigned to which word. The weight matrices
1
1
are initialized randomly with values between 2N
and 2N
drawn from a uniform distribution, as in the
original implementation. The learning rate was set to a default value of 0.1, and this is not adjusted during
training of the model. The default size of the projection layer is set 100.
As the original intention was to train the model on the first 1000000 tokens from Swedish Wikipedia,
some methods to preprocess that corpus was implemented. This preprocessing removes all words appearing less than 10 times in the corpus, and they appear neither as target or context words. This was
done before extracting the word and context pairs used as training samples. In practice this means that
the context window is expanded. The context window size was set to the two previous words and the two
following words. Words that appear in the beginning of a sentence and does not have any word preceding it, only the words following it are used as context words. The same holds for words at the end of a
sentence. Following Levy et al. (2015), sentences shorter than 5 words were also removed. This resulted
in around 9000 unique words and 55000 sentences for training. Even with that relatively small amount
of words, training was to slow to complete over the acquired 1000000 words from Swedish Wikipedia.
These limitations make it hard to evaluate how well this implemented model actually performs the
way it was originally intended. The code necessary for doing this was implemented however.
Instead a small testing corpus was built to see how well this implementation performed and if the
resulting vector similarities make sense. This small testing corpus can be found below. These experiments
were done with a model having the default parameters as described above. The window size was not set
to 2+2 however, instead a window size of 1+1 was used, i.e. one word to the left and one word to the
right.
drink apple juice there.
drink orange juice there.
drink tomato juice there.
drink beer there.
drink milk there.
they eat beans here.
they eat meat here.
they eat pasta here.
they eat pork here.
they eat rice here.
we play football outside.
we play ice hockey outside.
we play soccer outside.
we play golf outside.
As can be seen, this testing corpus is quite small. The small size is made up for by doing multiple
passes over the training data.
6
Jesper Segeblad
911031-4151
4
Results
Some results can be found in the table below. These results were retrieved with using a vector dimensionality of 100 and doing 200 passes over the training data. This high number of passes was necessary
because of the small size of the training data. Although these results might not be the most interesting,
they do reflect some regularities in the training data. If we for example look at the three first entries in
the table, we see that the closest words are words that are separated by another one. In other words, they
are words that appear in the same context.
Word
play
drink
eat
juice
apple
beans
Closest word
outside
juice
here
drink
there
they
Cosine similarity
0.987
0.919
0.999
0.919
0.454
0.412
To see the effect of a smaller vector dimensionality another test was done with this parameter set to
50 instead of the default 100. These results can be seen below. The only thing that has changed is the
cosine similarities (and this is not even by that much), the closest words are still the same.
Word
play
drink
eat
juice
apple
beans
Closest word
outside
juice
here
drink
there
they
Cosine similarity
0.982
0.968
0.998
0.968
0.347
0.531
200 passes over the training data are quite many. To see the effects of a smaller number of passes, this
number was reduced to 50. The results from this are presented below.
Word
play
drink
eat
juice
apple
beans
Closest word
outside
juice
here
drink
pork
they
Cosine similarity
0.353
0.497
0.689
0.497
0.208
0.315
The cosine similarities are significantly lower. The most interesting thing however is that the nearest
word to apple now is pork, and they are not that close. This is quite surprising since they do not share
any context in the training data. It might be attributed to the small number of actual training instances,
and smaller number of passes over the data.
7
Jesper Segeblad
911031-4151
5
Conclusions
Vector representations of words are very popular among natural language processing researchers and
practitioners today. They perform very well on a number of tasks and can capture many types of word
similarities. Here, the continuous-bag-of-words model from the popular word2vec package has been
implemented. Even though the model could not be tested as originally intended, the tests that were
made did show that the resulting word vectors could reflect regularities in the training data. And while
the provided implementation does not scale to large vocabularies, it does provide a starting point for
exploring the model further. One such exploration might be to look at the optimizations regarding weight
updating.
8
Jesper Segeblad
911031-4151
References
Yoav Goldberg. A primer on neural network models for natural language processing. arXiv preprint
arXiv:1510.00726, 2015.
Daniel Jurafsky and James H Martin. Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition, volume 21. Pearson Education, 2nd edition, 2009. ISBN 0130950696. doi: 10.1162/089120100750105975.
Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned
from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225,
2015.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations
in vector space. In Proceedings of Workshop at ICLR, 2013a.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of
words and phrases and their compositionality. In Advances in neural information processing systems,
pages 3111–3119, 2013b.
Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 3rd
edition, 2009. ISBN 9780136042594.
Magnus Sahlgren. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–53, 2008.
Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient
numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011.
9
Download