Index-learning of unsupervised low dimensional embeddings Ben Graham September 15, 2014 Dept of Statistics, University of Warwick, CV4 7AL, UK b.graham@warwick.ac.uk Abstract We introduce a simple unsupervised learning method for creating low-dimensional embeddings. Autoencoders work by simultaneously learning how to encode the input to a low dimensional representation and decoding the low dimensional representation to reconstruct the original input—the need to be able to reconstruct the input places a significant limit on the complexity of what can be learnt. The main benefit of our approach is that it works on datasets where creating a low dimensional representation requires throwing away so much information that it is unreasonable to attempt to reconstruct the input from any kind of low dimensional representation. 1 1.1 Low-dimensional embeddings Autoencoders A great deal of research has gone into constructing low dimensional embeddings for datasets, for example [1, 2]. The intuitive idea is that the data, for example 28 × 28 = 784 dimensional MNIST images [3], really live on a lower dimensional manifold, so it should be possible to project to a lower dimensional space without losing any important information. Techniques such as Hessian free optimization and Nesterov’s Accelerated Gradient, helped by pretrained or carefully chosen initial weights, allow high fidelity autoencoders to be trained for datasets such as MNIST. Essentially an autoencoder consists of two parts: in the specific case of MNIST you might have an encoder 784N-1000N-500N-250N-30N, and then a decoder 30N-250N-1000N-784N. Backpropagation is used to minimize the L2 -distance between the input and the output. However, MNIST digits are by smooth by construction, they are the product of a Gaussian blur applied to NIST digits. So it feasible to reconstruct the digits almost exactly. However, if you add noise to the images, the high dimensional noise cannot be contained within a 30 dimensional signal so the autoencoders will reconstruct noise free versions of the images. In that scenario, removing artificially added noise is probably a good thing. However, suppose you have videos that at any given point may or may not contain a cat in the frame. To create an embedding into one dimension (with cats on one side and not-cats on the other) is possible, but it requires a very carefully constructed autoencoder and massive computing power [4]. Essentially, the difficulty is that cats are complicated creatures and they tend not to look like an average of cat pictures. 1 1.2 Index learning The aim of this paper is to provide an alternative unsupervised learning algorithm to compete with autoencoders. The basic idea is so simple it is rather surprising it works at all. We will first give a very rough sketch of the main idea, and then later on we will explain the two optimizations needed to make it practical to use. With MNIST in mind, consider making a small mistake when performing supervised training of an artificial neural network (ANN) to solve the MNIST classification problem. Imagine that you accidentally deleted the 60,000 training labels, and replaced them with the numbers 0, 1, 2, 3, . . . , 59998, 59999. It now looks like we have 60,000 classes, with exactly one training sample per class. To solve the modified classification problem you might build an ANN with architecture 784N-1000N-500N-250N-30N-60000N with softmax being used at the top level to create a probability distribution on the space of 60,000 “classes”. The one-hot encoded training labels form the identity matrix with size 60,000×60,000. The backpropagation algorithm is normally a form of supervised learning, but here we are really doing unsupervised training, as we are using the wrong labels. What can we say about the training process? Each iteration will be quite slow, due to the large output layer size. However, the number of parameters in the top layer is large, 30×60,000. so the top layer should be able to learn the training data quite easily. We call this network an index-learner, as it learns the indices of the training data. What properties can we expect the trained network to have? Given a picture of a ‘1’ as input, the output will be a probability distribution over the training images. The distribution should give higher probability to the indices of similar looking digits (mostly other 1s and a few 7s perhaps) and lower probability to the indices of the other training data. In order to do this, assuming the top layer hasn’t ’overfit’ the training data, the network needs to compress the input down to a 30 dimensional quantity in the top hidden layer. Thus although we have trained the full network, it becomes interesting when we chop off the top layer to give an encoder of the form 784N-1000N500N-250N-30N, one without a matching decoder. 1.3 Autoencoders vs. index-learners: a toy example To give a concrete example of why index-learning might be better than autoencoding, consider a toy classification problem: it is motivated perhaps by the idea of genetic data on a DNA strand or by texture in a picture. Consider a collection of binary sequences of length 100. Imagine each sequence is the output of a symmetric two state Markov chain on the set {0, 1}; each step the Markov chains jumps with probability p and holds still with probability 1 − p. Suppose further that half the sequences comes from the class p = 0.4, and the other half comes from class p = 0.6. For example, p = 0.4 101001010111011010111011100101011000110110011000001000100101 . . . p = 0.6 001000111111000111101000100000000011010000010010011110000101 . . . This is a toy problem as, with high probability, simply counting the number of flips is enough to determine the class. Thus it is easy to build by hand a feed forward ANN that can solve the classification problem. However, just because we can hand craft a solution, does not mean that we can easily learn a solution without supervision. 2 First, consider the supervised version of the toy classification problem; say we have 10,000 training sequences, and we also have 10,000 class labels. A standard fully connected ANN can learn the problem and generalize to test data. It is possible the ANN will overfit the training data and give very poor test performance, but that can be fixed by using dropout [5]. Alternatively, if you know the data has a one dimensional structure then a suitable convolutional neural network with shared weights will also work well [6]. Now consider the toy problem as an unsupervised learning problem: you just have 10,000 training samples. An autoencoder cannot be expected to work. The entropy of each training samples is close to 100 bits. In L2 -distance, an autoencoder with minimum layer size much less than 100 cannot learn to do much better than making a null prediction, i.e. predicting that each bit is half on and half off. [Theoretically the autoencoder has infinite information capacity but that is not very relevant in practice]. Similarly PCA will not work, the principal components will be as random and uninformative. A restricted Boltzmann machine trained by contrastive divergence will only be able to learn the training distribution if the hidden layer size is at least as large as the visible layer. One the other hand, consider using an index-learner. If we can train it successfully, the encoding portion of the network will separate out the two classes. The softmax layer will then put its probability onto the p = 0.4 training sequences if the input sequence is of type p = 0.4, and vice versa for p = 0.6. 2 Index-learning encoders We will now explain how to train index-learners efficiently. We will look at the specific case of MNIST, the ideas should apply to many unsupervised learning problem where the corresponding supervised learning problem is easy for ANNs. We start with an ANN architecture capable of solving the training-index learning problem, say 784N-1000N-500N-250N-30N-60,000N. Alternatively you could consider a convolutional architecture such as (1 × 28 × 28)N-20C5-MP2-50C5-MP2-500N-30N-60000N. What is important is that • the final hidden layer is small, say 30N or 100N, • that the output layer is a softmax classifier, and • that the number of outputs is the size of the training set. The softmax function softmax(x1 , x2 , . . . ) = ex1 ex2 ex1 , ,... + ex2 + . . . ex1 + ex2 + . . . has two important properties: • It always produces a probability distribution. 3 • If we apply softmax to a subset of the x1 , x2 , . . . then we get the original probability distribution conditioned on taking a value in the subset. The index labels we are trying to learn, in one-hot encoded form, are given by the 60, 000 × 60, 000 identity matrix. An important property of the identity matrix is that if you form a submatrix by selecting • a subset of the rows, and • a subset of the columns, and • the two subsets are the same, then the submatrix is a smaller identity matrix. We use these properties of the softmax and identity matrix to do a doubled up form of minibatch gradient descent. To prevent overfitting in the softmax layer, which would prevent the rest of the network from having to learn a good encoding, we use an autoregressive version of dropout. 2.1 Doubled-up minibatch gradient descent Let W1 , . . . Wm denote parameter matrices corresponding to the ANN; for example Wm is a 30 × 60000 matrix. To produce a minibatch, pick a random collection of indices, say i = (i1 , . . . in ) ∈ {1, . . . , 60000}n . In normal minibatch gradient descent, the different elements of the minibatch only interact additively to form an average gradient. In doubled-up minibatch gradient descent, they also interact in a competitive way when the softmax function is applied. There are three parts to the minibatch: • The input, with size n × 784 formed by selecting the rows i1 , . . . , in from the training set in the normal way. • A minibatch ANN. Let Wm [:, i] denote the 30 × n submatrix of Wm corresponding to the columns i1 , . . . , in . Using W1 , . . . Wm−1 and Wm [:, i] we can form a sub-ANN of the form: 784N-1000N-500N-250N-30N-nN. • The one-hot encoded target distributions. This just the n × n identity matrix, which we are thinking of as a submatrix of the 60,000×60,000 identity matrix. The minibatch ANN performs a subset of the index-learning problem. It trys to guess how the n minibatch inputs correspond to the n minibatch indices i. Performing backpropagation on the minibatch ANN allows us to take a gradient descent step with respect to W1 , . . . , Wm−1 and Wm [:, i]. Notice that by updatingWm [:, i] we are also updating a fraction of the entries of Wm . If the larger ANN is perfectly trained, then the minibatch ANN will also perform perfectly. It is not immediately obvious that the reverse holds, that by training on minibatches the larger network learns anything. What we are doing is analogous to dropout, albeit applied to the final layer of the ANN, rather than to the input and hidden layers. In practice, this procedure seems to work well. It is also easy to program, it is a fairly trivial modification to the normal ANN training procedure, and the minibatch updates take no longer to run than you would expect for a traditional feed-forward ANN used for supervised training (i.e. 784N-1000N-500N-250N-10N). 4 2.2 Autoregressive dropout It is useful to perform a type of dropout in the classification layer that allows learning while preventing overfitting. Say that each entry in Wm is either on or off. Entries are on with probability p (typically p = 0.5). Every time an entry is used for training, with probability q (typically q = 0.01) the entry is re-randomized, so once again it is on with probability p, independently of its previous state. When an entry is switched off, its value is set to zero. When it is switched on it has the oppotunity to be updated. Like regular dropout, this encourages robustness in the final hidden layer. The time coherence is needed as each row of Wm is accessed relatively infrequently, only once per iteration through the whole training set. Given our use of dropout, we have not tried to keep the dimension of the ‘low dimensional embedding’ as small as possible. Rather, it seems to be more effective to use a larger encoding size, and then use PCA to reduce the dimension further. 3 Experiments Experiments were performed using Python and Theano [7]; the code will be released shortly. We used tanh for our non-linear function, except for the Twenty newsgroups dataset where we just used a form of L2 -normalization. Training was carried out using Nesterov’s accelerated form of gradient descent. We choose the minibatch size to reflect the number of classes; that way the ratio of positive to negative cases in the softmax layer mirrors supervised training. Index learners are in a sense a novel form of neural networks. We therefore had no prior knowledge about what the best values are for meta-parameters like network size. Techniques for training recurrent neural networks and autencoders have developed over many years. It is possible there is similar scope for improving the training procedures for index-learners. 3.1 The toy problem Using a 1D-convolutional index-learner 10C3-MP2-20C2-MP2-30C2-MP2-40C2-MP2-300N-100N-10,000N and a batch size of 10. The network quickly separates the two classes; 100N refers to the low dimensional representation size. Given the probabilistic definition of the two classes, they do actually overlap, so prefect separation cannot be expected. See Figure 1. 3.2 MNIST We trained a fully connected 784N-500N-500N-500N-100N-60,000N index learner on the MNIST digits dataset. In Figure 2 we show the four pairs of digits, 0s and 1s, 1s and 7s, 6s and 8; and 4s and 9s It is not very surprising that 0s and 1s separate out neatly. Most of the 1s and 7s and 6s and 8s are distinct. However, there is a lot of confusion between the 4s and 9s. In Figure 3 we show some of the features learnt in the input layer. They look similar to the ones you get when you train an ANN in a supervised manner. This shows that the index-learner is learning discriminative features in the absence of labels or decoding/reconstruction. We also trained a convolutional index learner inspired by LeNet-5, (1 × 28 × 28N )-20C5-MP2-50C5-MP2-500N-100N-60,000N. 5 Figure 1: The toy problem: The first two PCA components for the test set embeddings trained using a 1d convolutional index-learner. The results were pretty similar to the results for the fully connected index learner above. 3.3 20 Newsgroups We looked at the word frequencies for the most common 2,000 words in the Twenty Newsgroups dataset [8]. We split the dataset into a training set of size 11300 and a test set of size 2000. We trained a 2000N-500N-100N-11300N index-learner on the training data from all twenty newsgroups. We then looked at test embeddings for three particular newsgroups. See Figure 4. We choose two newsgroups (A and T) that we thought would be similar, and one newgroup that we thought would be rather different (C). Looking at the embeddings, we can see a difference between A and C, and T and C, but no real difference between A and T. 4 Conclusions Index-learners may be an alternative to autoencoders in certain problem domains. They have the potential to be robust with respect to high entropy training data. The remaining challenge is to gain a improved understanding of how best to design and train index-learners for larger datasets. They may also be useful as a means of pretraining. If you have a lot of unlabelled data and a small amount of labelled data, you could perform index-learning first, and then replace the indexlearners softmax function with supervised learning softmax function for fine-tuning. 6 Figure 2: For pairs, 0,1; 1,7; 6,8 and 4,9; the union of the MNIST test digits in both classes plotted using the 100-dimensional embedding followed by PCA. Figure 3: Some of the input-level filters learnt by a fully connected index-learning encoder trained on 28 × 28 MNIST digits. 7 Figure 4: A=alt.atheism, C=comp.graphics, T=talk.religion.misc; A and T are related so they are hard to distinguish, but they can both be separated from C. References [1] Hinton and Salakhutdinov. Reducing the dimensionality of data with neural networks. SCIENCE: Science, 313, 2006. [2] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceedings, pages 1139–1147. JMLR.org, 2013. [3] Yann Lecun and Corinna Cortes. http://yann.lecun.com/exdb/mnist/. The MNIST database of handwritten digits. [4] Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff Dean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In International Conference in Machine Learning, 2012. [5] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. [6] Y. L. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11):2278–2324, November 1998. [7] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. [8] http://qwone.com/ jason/20Newsgroups/. 8