Deep Learning Introduction Eder Santana Neural Networks for Signal Processing EEL 6504 - Spring 2016 Introduction This introduction is divided in two parts 1. On the need of deep architectures 2. Problems training deep architectures 3. Recent advances Thus, we will cover the pros and cons of deep learning. Each topic contains links to reference papers in the speaker notes. On the need of deep architectures The success of machine learning applications (regression, classification, etc) is tied to the quality of the data representation. - Remember the Iris dataset, some of the dimensions were easier to classify then others. For complex problems where no single dimension of the raw data is discriminative enough, domain knowledge for feature representation is fundamental. - For example, frequency domain representations for speech recognition. On the need of deep architecrtues But how much can we do with generic priors? Can we design a single algorithm that can be successfully applied to different data domains with minor modifications? Deep Learning is tightly coupled with the need to learn better representations of data. On the need of deep architectures To illustrate the problem of learning representations. Let us remember what each layer in a MLP does. Each line represents a processing element. The top most layer is a classifier. Figure: MLP with 3 layers. Showing input to first hidden and output layers On the need of deep architectures Each layer learns a feature representation of the layer below. The final goal of those representations is to optimize the criterion being backpropagated. Assume the problem of classification: The practical observation is that the number of neurons (connections, weights) required by a a single hidden layer MLP is larger than the number required by deep MLP. “Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network.” - Cohen et. al. On the need of deep architectures Learning image multi-scale invariances In image classification, we want to recognize objects in different view angles and scales. Images from the Microsoft ASIRRA dataset. Dogs vs Cats dataset on Kaggle: https://www.kaggle.com/c/dogs-vs-cats On the need of deep architectures Learning image multi-scale invariances How does the human brain image processing pipeline looks like? Yamins and DiCarlo, 2016 On the need of deep architectures Learning image multi-scale invariances What do we learn on each layer? Example an unsupervised convolutional autoencoer applied to faces Lee et. al. ICML 2009 On the need of deep architectures Learning image multi-scale invariances What do we learn on each layer of AlexNet (Zeiler variation)? Krizhevsky et. al. 2012 On the need of deep architectures Learning image multi-scale invariances What do we learn on each layer of AlexNet (Zeiler variation)? Showing prefrered stimulus and images that drive highest activation. Zeiler and Fergus, 2013 On the need of deep architectures Learning image multi-scale invariances What do we learn on each layer of AlexNet (Zeiler variation)? Showing prefrered stimulus and images that drive highest activation. Zeiler and Fergus, 2013 On the need of deep architectures Learning image multiscale invariances What do we learn on each layer of AlexNet (Zeiler variation)? Showing prefrered stimulus and images that drive highest activation. Zeiler and Fergus, 2013 On the need of deep architectures Learning image multi-scale invariances Visualizing t-SNE embedding: http://cs.stanford.edu/people/karpathy/cnnembed/ On the need of deep architectures Learning invariances in music and speech data How do you do that using what we already know? On the need of deep architectures Learning invariances in music and speech data “Spectogram is an image and we know convnets”. But what about the time component? On the need of deep architectures Learning invariances in music and speech data Speech recognition (before): Get speech “windows”: phonemes Represent phonemes using a Gaussian Process or Deep Belief Network (Deep Autoencoder) Conect the windows: model phoneme transition using Hidden Markov Model Yan Zhang On the need of deep architectures Learning invariances in music and speech data Speech recognition (after, end-to-end deep learning): No windowing First layers are convolutional Middle layers are bidirectional recurrent neural networks to represent time context (future classes). Cost function is Connectionist Temporal Classification (CTC). It allows sequence inputs Hannun et. al. 2014 On the need of deep architectures Learning invariances in music and speech data Content based music recommendation Sander Dieleman, 2014 http://benanne.github.io/2014/08/05/spotify-cnns.html On the need of deep architectures Learning representations for text How should a computer represent a character or a word? Signal processing is about Algebra, but how can we represent something like “bread” + “butter” or “king” - “crown”? We need to represent “words” as vectors! Vector Algebra is well defined. On the need of deep architectures Learning representations for text Process Represent each word by an unique id (one-hot-encoding, sparse): x Initialize word-to-vector embedding (dense): B Get word representation in the new space: y = Bx Each y is just a row of B Adapt B How? On the need of deep architectures Learning representations for text Mikolov et. al. 2013 On the need of deep architectures Learning representations for text Mikolov et. al. 2013 On the need of deep architectures Learning representations for text It also works in character level Andrej Karpathy, 2015 On the need of deep architectures Learning representations for text It also works in character level: Paul Graham generator (one char at time!) The surprised in investors weren't going to raise money. I'm not the company with the time there are all interesting quickly, don't have to get off the same programmers. There's a super-angel round fundraising, why do you can do. If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you're also the founders will part of users' affords that and an alternation to the idea. [2] Don't work at first member to see the way kids will seem in advance of a bad successful startup. And if you have to act the big company too." Andrej Karpathy, 2015 Problems training deep architectures Is this free lunch? Note: Some of the following problems were addressed to some extent. In red we show possible ways to address the problems. Feature design -> Neural networks architecture design (experience) How many neurons per layer, convolutions?, max-pooling?, recurrent? Computational cost (gpus, ASIC) Several matrix multiplications and convolutions and they are not all embarrassingly parallel. Curse of dimensionality (big data) The number of trainable weights sums up to millions Problems training deep architectures Is this free lunch? Non-convex problem (problem?) Local minima problems seemed to be due to bad initialization and wrong activation function choices. Vanishing gradients in backpropagation (improved SGD, LSTM) Problems training deep architectures Is this free lunch? Need different learning rates for different layers (improved SGD) Output layer needs a stable representation to draw a separation surface (remember our example above) Recent advances (or why now?) Since deep learning is “just” neural networks, why is it happening only now? Community acceptance (neural networks were labeled brute force and unreliable due to non-convex optimization, data descriptors + kernel machines were the prefered approach.) https://plus.google.com/+YannLeCunPhD/posts/gurGyczzsJ7 Faster computers with more memory Larger datasets Also, there were A LOT of incremental contributions To the beginner this may look like black magic, superstition or any other term for “not science”. Recent advances Weight initialization techniques Activation functions (relu, maxout, etc) Regularization (dropout, batch normalization, multi-task learning) Data agumentation Optimization algorithms (momentum, adagrad, rmsprop, adam, etc) Architectures (recurrent, convolutional, inception, residual, etc) Deep learning software (CUDA, Caffe, Theano, Tensorflow, Keras, Torch, etc) Deep learning hardware (CPU clusters, GPUs, ASIC, phones, etc) Recent advances Weight initialization Recent advances Weight initialization ● Too large: neurons get stuck into low gradient range ● To small: neurons are too correlated which slows down learning ● Glorot uniform initialization: Wij ~ U Recent advances Activation functions Softplus ReLU Recent advances Activation functions Softplus Approximate gradient doesn’t vanish for x>0. ReLU Recent advances Regularization (Dropout) “If you’re not overfitting, your network isn’t big enough” - attributed to Geoff Hitton How to design a Neural Network: Before Start with a small network Increase number of neurons (and rarely number of layers) until you start overfitting Now Start with a network big enough to overfit Increase regularization and data agumentation until you stop overfitting Recent advances Regularization (Dropout) Dropout: randomly multiply by zero the outputs of the network during training. During test time rescale the learned weights down. 0.5 0. 0. 0 1. 0. 0 0. 0. = * 0.7 1. 0.7 0.01 0. 0 0.2 1. 0.2 Layer Output = f(wTx) Dropout mask with probability 0.5 New layer after masking Recent advances Regularization (Dropout) Dropout: randomly multiply by zero the outputs of the network during training. During test time rescale the learned weights down. 0.5 0. 0. 0 0.5 0. 0 0. 0. = * 0.7 0.5 0.35 0.01 0. 0 0.2 0.5 0.1 Layer Output = f(wTx) Dropout mask with probability 0.5 and rescale New layer output after masking and rescale Note that there can be different dropout proababilities! We just sample and rescale accordingly. Recent advances Regularization (Dropout) Dropout prevents coadaptation of features. Intuition: during traning we add noise to the gradient so the weights don’t learn rely on each as much. We train them to work alone and bring friends during test time. training test Hinton et. al. 2013 Recent advances Regularization (Batch Normalization) Let us remember the error surface for MSE and how correlated features impair learning. J = E[(d - wTx)2]= E[d2] + wTCw - 2wTP Where C is the autocorrelation matrix and P is the cross correlation matrix. This formula is quadradic with accelation matrix being a function of C. Let us focus on the quadratic term for the 2D case: wTCw = (w1C11 + w2C21)w1 + (w1C12 + w2C22)w2 = w12C11 + w22C22 + w1w2C21C12 Recent advances Regularization (Batch Normalization) wTCw = (w1C11 + w2C21)w1 + (w1C12 + w2C22)w2 = w12C11 + w22C22 + w1w2C21C12 Let us visualize contour level plots. Correlated inputs = “narrow valleys”: in the learning surface! C21C12 = 0 No correlation on the input. We learn as fast in all directions. Same variance for x1 and x2. C21C12 = 0 C21C12 = positive But x2 has higher variance than x1 and But x2 has higher variance than x1 Recent advances Regularization (Batch Normalization) Also “eigenvalue spread” defines maximum learning rate, misadjustment, covergence time, etc. Learning and correlated inputs are not friends! Batch normalization learns to “remove the mean and divide by std after every layer” There is a version for conv layers too! In both cases, we learn a moving average approximation of the mean and variance that we use at test time (It is not fair to calculate test batche statistics in your experiments. Don’t do that in your homeworks or publications) Recent advances Data agumentation (noise) For deep learning the more data the better! We already saw one way to increase the virutal size of our dataset: adding noise! Dropout in the input Adding Gaussian noise (works better for speech data) Recent advances Data agumentation (data distortion) There are also small transformation in the input that help neural networks to generalize better: Vertical axis flip And they all can be combined for even more augmentation. Multipe crops (changes scale and position) Rotation, changes viewing angles.