Index-learning of unsupervised low dimensional embeddings Ben Graham September 15, 2014

advertisement
Index-learning of unsupervised low dimensional embeddings
Ben Graham
September 15, 2014
Dept of Statistics, University of Warwick, CV4 7AL, UK b.graham@warwick.ac.uk
Abstract
We introduce a simple unsupervised learning method for creating low-dimensional embeddings. Autoencoders work by simultaneously learning how to encode the input to a low dimensional representation and decoding the low dimensional representation to reconstruct the
original input—the need to be able to reconstruct the input places a significant limit on the
complexity of what can be learnt. The main benefit of our approach is that it works on datasets
where creating a low dimensional representation requires throwing away so much information
that it is unreasonable to attempt to reconstruct the input from any kind of low dimensional
representation.
1
1.1
Low-dimensional embeddings
Autoencoders
A great deal of research has gone into constructing low dimensional embeddings for datasets, for
example [1, 2]. The intuitive idea is that the data, for example 28 × 28 = 784 dimensional MNIST
images [3], really live on a lower dimensional manifold, so it should be possible to project to a lower
dimensional space without losing any important information. Techniques such as Hessian free
optimization and Nesterov’s Accelerated Gradient, helped by pretrained or carefully chosen initial
weights, allow high fidelity autoencoders to be trained for datasets such as MNIST. Essentially
an autoencoder consists of two parts: in the specific case of MNIST you might have an encoder
784N-1000N-500N-250N-30N, and then a decoder 30N-250N-1000N-784N. Backpropagation is used
to minimize the L2 -distance between the input and the output.
However, MNIST digits are by smooth by construction, they are the product of a Gaussian blur
applied to NIST digits. So it feasible to reconstruct the digits almost exactly. However, if you
add noise to the images, the high dimensional noise cannot be contained within a 30 dimensional
signal so the autoencoders will reconstruct noise free versions of the images. In that scenario,
removing artificially added noise is probably a good thing. However, suppose you have videos that
at any given point may or may not contain a cat in the frame. To create an embedding into one
dimension (with cats on one side and not-cats on the other) is possible, but it requires a very
carefully constructed autoencoder and massive computing power [4]. Essentially, the difficulty is
that cats are complicated creatures and they tend not to look like an average of cat pictures.
1
1.2
Index learning
The aim of this paper is to provide an alternative unsupervised learning algorithm to compete with
autoencoders. The basic idea is so simple it is rather surprising it works at all. We will first give a
very rough sketch of the main idea, and then later on we will explain the two optimizations needed
to make it practical to use.
With MNIST in mind, consider making a small mistake when performing supervised training of an artificial neural network (ANN) to solve the MNIST classification problem. Imagine
that you accidentally deleted the 60,000 training labels, and replaced them with the numbers
0, 1, 2, 3, . . . , 59998, 59999. It now looks like we have 60,000 classes, with exactly one training sample per class. To solve the modified classification problem you might build an ANN with architecture
784N-1000N-500N-250N-30N-60000N
with softmax being used at the top level to create a probability distribution on the space of 60,000
“classes”. The one-hot encoded training labels form the identity matrix with size 60,000×60,000.
The backpropagation algorithm is normally a form of supervised learning, but here we are really
doing unsupervised training, as we are using the wrong labels.
What can we say about the training process? Each iteration will be quite slow, due to the
large output layer size. However, the number of parameters in the top layer is large, 30×60,000.
so the top layer should be able to learn the training data quite easily. We call this network an
index-learner, as it learns the indices of the training data.
What properties can we expect the trained network to have? Given a picture of a ‘1’ as input,
the output will be a probability distribution over the training images. The distribution should give
higher probability to the indices of similar looking digits (mostly other 1s and a few 7s perhaps)
and lower probability to the indices of the other training data. In order to do this, assuming the
top layer hasn’t ’overfit’ the training data, the network needs to compress the input down to a 30
dimensional quantity in the top hidden layer. Thus although we have trained the full network, it
becomes interesting when we chop off the top layer to give an encoder of the form 784N-1000N500N-250N-30N, one without a matching decoder.
1.3
Autoencoders vs. index-learners: a toy example
To give a concrete example of why index-learning might be better than autoencoding, consider a
toy classification problem: it is motivated perhaps by the idea of genetic data on a DNA strand
or by texture in a picture. Consider a collection of binary sequences of length 100. Imagine each
sequence is the output of a symmetric two state Markov chain on the set {0, 1}; each step the
Markov chains jumps with probability p and holds still with probability 1 − p. Suppose further that
half the sequences comes from the class p = 0.4, and the other half comes from class p = 0.6. For
example,
p = 0.4
101001010111011010111011100101011000110110011000001000100101 . . .
p = 0.6
001000111111000111101000100000000011010000010010011110000101 . . .
This is a toy problem as, with high probability, simply counting the number of flips is enough
to determine the class. Thus it is easy to build by hand a feed forward ANN that can solve the
classification problem. However, just because we can hand craft a solution, does not mean that we
can easily learn a solution without supervision.
2
First, consider the supervised version of the toy classification problem; say we have 10,000
training sequences, and we also have 10,000 class labels. A standard fully connected ANN can learn
the problem and generalize to test data. It is possible the ANN will overfit the training data and
give very poor test performance, but that can be fixed by using dropout [5]. Alternatively, if you
know the data has a one dimensional structure then a suitable convolutional neural network with
shared weights will also work well [6].
Now consider the toy problem as an unsupervised learning problem: you just have 10,000 training
samples. An autoencoder cannot be expected to work. The entropy of each training samples is close
to 100 bits. In L2 -distance, an autoencoder with minimum layer size much less than 100 cannot
learn to do much better than making a null prediction, i.e. predicting that each bit is half on
and half off. [Theoretically the autoencoder has infinite information capacity but that is not very
relevant in practice]. Similarly PCA will not work, the principal components will be as random and
uninformative. A restricted Boltzmann machine trained by contrastive divergence will only be able
to learn the training distribution if the hidden layer size is at least as large as the visible layer. One
the other hand, consider using an index-learner. If we can train it successfully, the encoding portion
of the network will separate out the two classes. The softmax layer will then put its probability
onto the p = 0.4 training sequences if the input sequence is of type p = 0.4, and vice versa for
p = 0.6.
2
Index-learning encoders
We will now explain how to train index-learners efficiently. We will look at the specific case of
MNIST, the ideas should apply to many unsupervised learning problem where the corresponding
supervised learning problem is easy for ANNs. We start with an ANN architecture capable of
solving the training-index learning problem, say
784N-1000N-500N-250N-30N-60,000N.
Alternatively you could consider a convolutional architecture such as
(1 × 28 × 28)N-20C5-MP2-50C5-MP2-500N-30N-60000N.
What is important is that
• the final hidden layer is small, say 30N or 100N,
• that the output layer is a softmax classifier, and
• that the number of outputs is the size of the training set.
The softmax function
softmax(x1 , x2 , . . . ) =
ex1
ex2
ex1
,
,...
+ ex2 + . . . ex1 + ex2 + . . .
has two important properties:
• It always produces a probability distribution.
3
• If we apply softmax to a subset of the x1 , x2 , . . . then we get the original probability distribution conditioned on taking a value in the subset.
The index labels we are trying to learn, in one-hot encoded form, are given by the 60, 000 × 60, 000
identity matrix. An important property of the identity matrix is that if you form a submatrix by
selecting
• a subset of the rows, and
• a subset of the columns, and
• the two subsets are the same,
then the submatrix is a smaller identity matrix. We use these properties of the softmax and identity
matrix to do a doubled up form of minibatch gradient descent. To prevent overfitting in the softmax
layer, which would prevent the rest of the network from having to learn a good encoding, we use
an autoregressive version of dropout.
2.1
Doubled-up minibatch gradient descent
Let W1 , . . . Wm denote parameter matrices corresponding to the ANN; for example Wm is a 30 ×
60000 matrix. To produce a minibatch, pick a random collection of indices, say i = (i1 , . . . in ) ∈
{1, . . . , 60000}n . In normal minibatch gradient descent, the different elements of the minibatch only
interact additively to form an average gradient. In doubled-up minibatch gradient descent, they
also interact in a competitive way when the softmax function is applied. There are three parts to
the minibatch:
• The input, with size n × 784 formed by selecting the rows i1 , . . . , in from the training set in
the normal way.
• A minibatch ANN. Let Wm [:, i] denote the 30 × n submatrix of Wm corresponding to the
columns i1 , . . . , in . Using W1 , . . . Wm−1 and Wm [:, i] we can form a sub-ANN of the form:
784N-1000N-500N-250N-30N-nN.
• The one-hot encoded target distributions. This just the n × n identity matrix, which we are
thinking of as a submatrix of the 60,000×60,000 identity matrix.
The minibatch ANN performs a subset of the index-learning problem. It trys to guess how the
n minibatch inputs correspond to the n minibatch indices i. Performing backpropagation on the
minibatch ANN allows us to take a gradient descent step with respect to W1 , . . . , Wm−1 and Wm [:, i].
Notice that by updatingWm [:, i] we are also updating a fraction of the entries of Wm .
If the larger ANN is perfectly trained, then the minibatch ANN will also perform perfectly. It is
not immediately obvious that the reverse holds, that by training on minibatches the larger network
learns anything. What we are doing is analogous to dropout, albeit applied to the final layer of the
ANN, rather than to the input and hidden layers. In practice, this procedure seems to work well. It
is also easy to program, it is a fairly trivial modification to the normal ANN training procedure, and
the minibatch updates take no longer to run than you would expect for a traditional feed-forward
ANN used for supervised training (i.e. 784N-1000N-500N-250N-10N).
4
2.2
Autoregressive dropout
It is useful to perform a type of dropout in the classification layer that allows learning while
preventing overfitting. Say that each entry in Wm is either on or off. Entries are on with probability
p (typically p = 0.5). Every time an entry is used for training, with probability q (typically q = 0.01)
the entry is re-randomized, so once again it is on with probability p, independently of its previous
state. When an entry is switched off, its value is set to zero. When it is switched on it has the
oppotunity to be updated. Like regular dropout, this encourages robustness in the final hidden
layer. The time coherence is needed as each row of Wm is accessed relatively infrequently, only once
per iteration through the whole training set.
Given our use of dropout, we have not tried to keep the dimension of the ‘low dimensional
embedding’ as small as possible. Rather, it seems to be more effective to use a larger encoding size,
and then use PCA to reduce the dimension further.
3
Experiments
Experiments were performed using Python and Theano [7]; the code will be released shortly. We
used tanh for our non-linear function, except for the Twenty newsgroups dataset where we just
used a form of L2 -normalization. Training was carried out using Nesterov’s accelerated form of
gradient descent. We choose the minibatch size to reflect the number of classes; that way the ratio
of positive to negative cases in the softmax layer mirrors supervised training.
Index learners are in a sense a novel form of neural networks. We therefore had no prior
knowledge about what the best values are for meta-parameters like network size. Techniques for
training recurrent neural networks and autencoders have developed over many years. It is possible
there is similar scope for improving the training procedures for index-learners.
3.1
The toy problem
Using a 1D-convolutional index-learner
10C3-MP2-20C2-MP2-30C2-MP2-40C2-MP2-300N-100N-10,000N
and a batch size of 10. The network quickly separates the two classes; 100N refers to the low
dimensional representation size. Given the probabilistic definition of the two classes, they do
actually overlap, so prefect separation cannot be expected. See Figure 1.
3.2
MNIST
We trained a fully connected 784N-500N-500N-500N-100N-60,000N index learner on the MNIST
digits dataset. In Figure 2 we show the four pairs of digits, 0s and 1s, 1s and 7s, 6s and 8; and 4s
and 9s It is not very surprising that 0s and 1s separate out neatly. Most of the 1s and 7s and 6s
and 8s are distinct. However, there is a lot of confusion between the 4s and 9s.
In Figure 3 we show some of the features learnt in the input layer. They look similar to the
ones you get when you train an ANN in a supervised manner. This shows that the index-learner is
learning discriminative features in the absence of labels or decoding/reconstruction.
We also trained a convolutional index learner inspired by LeNet-5,
(1 × 28 × 28N )-20C5-MP2-50C5-MP2-500N-100N-60,000N.
5
Figure 1: The toy problem: The first two PCA components for the test set embeddings trained
using a 1d convolutional index-learner.
The results were pretty similar to the results for the fully connected index learner above.
3.3
20 Newsgroups
We looked at the word frequencies for the most common 2,000 words in the Twenty Newsgroups
dataset [8]. We split the dataset into a training set of size 11300 and a test set of size 2000. We
trained a
2000N-500N-100N-11300N
index-learner on the training data from all twenty newsgroups. We then looked at test embeddings
for three particular newsgroups. See Figure 4. We choose two newsgroups (A and T) that we
thought would be similar, and one newgroup that we thought would be rather different (C). Looking
at the embeddings, we can see a difference between A and C, and T and C, but no real difference
between A and T.
4
Conclusions
Index-learners may be an alternative to autoencoders in certain problem domains. They have the
potential to be robust with respect to high entropy training data. The remaining challenge is to
gain a improved understanding of how best to design and train index-learners for larger datasets.
They may also be useful as a means of pretraining. If you have a lot of unlabelled data and a
small amount of labelled data, you could perform index-learning first, and then replace the indexlearners softmax function with supervised learning softmax function for fine-tuning.
6
Figure 2: For pairs, 0,1; 1,7; 6,8 and 4,9; the union of the MNIST test digits in both classes plotted
using the 100-dimensional embedding followed by PCA.
Figure 3: Some of the input-level filters learnt by a fully connected index-learning encoder trained
on 28 × 28 MNIST digits.
7
Figure 4: A=alt.atheism, C=comp.graphics, T=talk.religion.misc; A and T are related so they are
hard to distinguish, but they can both be separated from C.
References
[1] Hinton and Salakhutdinov. Reducing the dimensionality of data with neural networks. SCIENCE: Science, 313, 2006.
[2] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance
of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceedings,
pages 1139–1147. JMLR.org, 2013.
[3] Yann Lecun and Corinna Cortes.
http://yann.lecun.com/exdb/mnist/.
The MNIST database of handwritten digits.
[4] Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff
Dean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In
International Conference in Machine Learning, 2012.
[5] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR,
abs/1207.0580, 2012.
[6] Y. L. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278–2324, November 1998.
[7] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU
and GPU math expression compiler. In Proceedings of the Python for Scientific Computing
Conference (SciPy), June 2010.
[8] http://qwone.com/ jason/20Newsgroups/.
8
Download