Deep Learning Why? Speech recognition 1992 100.0% 1997 Year 2002 2007 2012 Word error rate 50.0% 25.0% 12.5% 6.3% Read Conversational Source: Huang et al., Communications ACM 01/2014 Large Scale Visual Recognition Challenge 2012 35% 30% Error rate 25% 20% 15% 10% 5% 0% ISI OXFORD_VGG XRCE/INRIA University of Amsterdam LEAR-XRCE SuperVision the 2013 International Conference on Learning Representations, the 2013 ICASSP’s special session on New Types of Deep Neural Network Learning for Speech Recognition and Related Applications, the 2013 ICML Workshop for Audio, Speech, and Language Processing, the 2012, 2011, and 2010 NIPS Workshops on Deep Learning and Unsupervised Feature Learning, 2013 ICML Workshop on Representation Learning Challenges, 2013 Intern. Conf. on Learning Representations, 2012 ICML Workshop on Representation Learning, 2011 ICML Workshop on Learning Architectures, Representations, and Optimization for Speech and Visual Information Processing, 2009 ICML Workshop on Learning Feature Hierarchies, 2009 NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2012 ICASSP deep learning tutorial, the special section on Deep Learning for Speech and Language Processing in IEEE Trans. Audio, Speech, and Language Processing (January 2012), the special issue on Learning Deep Architectures in ”A fast learning algorithm for deep belief nets” -- Hinton et al., 2006 ”Reducing the dimensionality of data with neural networks” -- Hinton & Salakhutdinov Geoffrey Hinton University of Toronto How? Shallow learning • SVM • Linear & Kernel Regression • Hidden Markov Models (HMM) • Gaussian Mixture Models (GMM) • Single hidden layer MLP • ... Limited modeling capability of concepts Cannot make use of unlabeled data Neuronal Networks • Machine Learning • Knowledge from high dimensional data • Classification • Input: features of data • supervised vs unsupervised • labeled data • Neurons Multi Layer Perceptron [ Y1 , Y2 ] output k wjk • • • • Multiple Layers Feed Forward Connected Weights z xi wij i 1-of-N Output j hidden j 1 vij a 0 0 z input i [ X1 , X2 , X3 ] a 1 1 e z Backpropagation • Minimize error of calculated output k • Adjust weights wjk • Procedure • Forward Phase • Backpropagation of errors j vij i • Gradient Descent • For each sample, multiple epochs Best Practice • Normalization • Prevent very high weights, Oscillation • Overfitting/Generalisation • Validation Set, Early Stopping • Mini-Batch Learning • update weights with multiple input vectors combined Problems with Backpropagation • Multiple hidden Layers • Get stuck in local optima • start weights from random positions • Slow convergence to optimum • large training set needed • Only use labeled data • most data is unlabeled Generative Approach Restricted Boltzmann Machines • Unsupervised hidden j • Find complex regularities in training data • Bipartite Graph • visible, hidden layer wij i visible • Binary stochastic units • On/Off with probability • 1 Iteration p(h j 1 ) 1 1 e ( v w ) i ivis ij • Update Hidden Units • Reconstruct Visible Units • Maximum Likelihood of training data Restricted Boltzmann Machines hidden j wij • find latent factors of data set i p(h j • Training Goal: Best probable reproduction • unsupervised data visible 1 ) 1 1 e ( v w ) i ivis ij • Adjust weights to get maximum probability of input data Training: Contrastive Divergence j <vi h j>0 i t=0 data j <vi h j>1 i t=1 reconstruction Dwij = e ( <vi h j >0 - <vi h j>1 ) • Start with a training vector on the visible units. • Update all the hidden units in parallel. • Update the all the visible units in parallel to get a “reconstruction”. • Update the hidden units again. Example: Handwritten 2s 50 binary neurons that learn features Increment weights between an active pixel and an active feature 16 x 16 pixel image data (reality) 50 binary neurons that learn features Decrement weights between an active pixel and an active feature 16 x 16 pixel image reconstruction The final 50 x 256 weights: Each unit grabs a different feature Example: Reconstruction Data Reconstruction from activated binary features New test image from the digit class that the model was trained on Data Reconstruction from activated binary features Image from an unfamiliar digit class The network tries to see every image as a 2. Deep Architecture • Backpropagation, RBM as building blocks • Multiple hidden layers • Motivation (why go deep?) • Approximate complex decision boundary • Fewer computational units for same functional mapping • Hierarchical Learning • Increasingly complex features • work well in different domains • Vision, Audio, … Hierarchical Learning • Natural progression from low level to high level structure as seen in natural complexity • Easier to monitor what is being learnt and to guide the machine to better subspaces Stacked RBMs Then train this RBM h2 W2 h1 Compose the two RBM models to make a single DBN model out h2 W2 copy binary state for each v h1 Train this RBM first W1 h1 W1 v v • First learn one layer at a time by stacking RBMs. • Treat this as “pre-training” that finds a good initial set of weights which can then be fine-tuned by a local search procedure. • Backpropagation can be used to finetune the model to be better at discrimination. Uses Dimensionality reduction Dimensionality reduction • Use a stacked RBM as deep auto-encoder 1. Train RBM with images as input & output 2. Limit one layer to few dimensions Information has to pass through middle layer Dimensionality reduction Olivetti face data, 25x25 pixel images reconstructed from 30 dimensions (625 30) Original Deep RBN PCA Dimensionality reduction 804’414 Reuters news stories, reduction to 2 dimensions PCA Deep RBN Uses Classification Unlabeled data Unlabeled data is readily available Example: Images from the web 1. Download 10’000’000 images 2. Train a 9-layer DNN 3. Concepts are formed by DNN 70% better than previous state of the art Building High-level Features Using Large Scale Unsupervised Learning Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng Uses AI Artificial intelligence Enduro, Atari 2600 Expert player: 368 points Deep Learning: 661 points Playing Atari with Deep Reinforcement Learning Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller Uses Generative (Demo) How to use it How to use it • Home page of Geoffrey Hinton https://www.cs.toronto.edu/~hinton/ • Portal http://deeplearning.net/ • Accord.NET http://accord-framework.net/