Hands-on course in deep neural networks for vision Instructors Michal Irani, Ronen Basri Teaching Assistants Ita Lifshitz, Ethan Fetaya, Amir Rosenfeld Course website http://www.wisdom.weizmann.ac.il/~vision/courses/2016_1/DNN/index.html Schedule (fall semester) 4/11/2015 18/11/2015 9/12/2015 Introduction Tutorial on Caffe, exercise 1 Q&A on exercise 1 16/12/2015 6/1/2016 27/1/2016 Submission of exercise 1 (no class) Student presentations of recent work Project selection Nerve cells in a microscope 1836 1838 1862 1873 1888 1891 1897 1906 1921 First microscopic image of a nerve cell (Valentin) First visualization of axons (Remak) First description of the neuromuscular junction (Kühne) Introduction of silver-chromate technique as staining procedure (Golgi) Birth of the neuron doctrine: the nervous system is made up of independent cells (Cajal) The term “neuron” is coined (Waldeyer). Concept of synapse (Sherrington) Nobel Prize: Cajal and Golgi. Nobel Prize: Sherrington Nerve cell Synapse Action potential • The human brain contains ~86 billion nerve cells • The DNA cannot encode a different function for each cell • Therefore, • Each cell must perform a simple computation • The overall computations are achieved by ensembles of neurons (“connectionism”) • These computations can be changed dynamically by learning (“plasticity”) A Logical Calculus of the Ideas Immanent in Nervous Activity Warren S. Mcculloch and Walter Pitts BULLETIN OF MATHEMATICAL BIOPHYSICS VOLUME 5, 1943 “We shall make the following physical assumptions for our calculus. 1. The activity of the neuron is an "all-or-none" process. 2. A certain fixed number of synapses must be excited within the period of latent addition in order to excite a neuron at any time, and this number is independent of previous activity and position on the neuron. 3. The only significant delay within the nervous system is synaptic delay. 4. The activity of any inhibitory synapse absolutely prevents excitation of the neuron at that time. 5. The structure of the net does not change with time.” The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain Frank Rosenblatt Psychological Review Vol. 65, No. 6, 1958 A simplified view of neural computation • A nerve cell accumulates electric potential at its dendrites • This accumulation is due to the flux of neurotransmitors from neighboring (input) neurons • The strength of this potential depends on the efficacy of the synapse and can be positive (“excitatory”) or negative (“inhibitory”) • Once sufficient electric potential is accumulated (exceeding a certain threshold) one or more action potentials (spikes) are produced. They then travel through the nerve axon and affect nearby (output) neurons The perceptron • Input 𝑥 = 𝑥1 , … , 𝑥𝑑 • Weights 𝑤 = 𝑤1 , … , 𝑤𝑑 • Output 𝑓 𝑥 = 1 0 𝑤𝑇𝑥 +𝑏 >0 otherwise • This is a linear classifier • Implemented with one layer of weights + threshold (Heaviside step activation) 𝑥1 𝑥2 𝑤1 𝑤2 𝑤𝑛 𝑥𝑑 ? > −𝑏 Multi-layer Perceptron • Linear classifiers are very limited, e.g., XOR configurations cannot be classified Multi-layer Perceptron • Linear classifiers are very limited, e.g., XOR configurations cannot be classified Solution: multilayer, feed-forward perceptron Input Hidden 1 Hidden 2 𝑥1 ℎ1 ℎ′1 𝑥2 ℎ2 ℎ′2 𝑥𝑑 ℎ𝑚 ℎ′𝑚′ Output 𝑦1 𝑦2 𝑦𝑘 ℎ = 𝜎 𝑊𝑥 ℎ′ = 𝜎 𝑊 ′ ℎ ⋮ 𝑦 = 𝜎(… 𝜎 𝑊𝑥 … ) The non-linear 𝜎 is called “activation function” Activation functions • Heaviside (threshold) • Sigmoid (tanh(𝑧) or logistic 1 ) −𝑧 1+𝑒 • Rectified linear unit (Relu: max(x,0)) • Max pooling (max(ℎ1 , ℎ2 )) Multi-layer Perceptron • A perceptron with one hidden layer can approximate any smooth function to an arbitrary accuracy Supervised learning • Objective: given labeled training examples 𝑥𝑖 , 𝑦𝑖 learn a map from input to output, y = 𝑓(𝑥) • Type of problems: 𝑁 1, • Classification – each input is mapped to one of a discrete set, e.g., {person, bike, car} • Regression – each input is mapped to a number, e.g., viewing angle • How? Given a network architecture, find a set of weights that minimize a loss function on the training data, e.g., -log likelihood of class given input • Generalization vs. overfit Generalization vs. overfit • We want to minimize loss on the test data, but the test data is not available in training • Use a validation set to make sure you don’t overfit 𝑦 𝑥 Generalization vs. overfit • We want to minimize loss on the test data, but the test data is not available in training • Use a validation set to make sure you don’t overfit Loss Loss on validation Loss on training Iteration Training DNN: back propagation • A network is a function 𝑦 = 𝑓(𝑥; 𝑤) • Objective: modify 𝑤 to improve the prediction of 𝑓 on the training (and validation) data • Quality of 𝑓 is measured via the loss function ℒ(𝑥, 𝑦, 𝑤) • Back propagation: minimize ℒ by gradient descent 𝑤 ← 𝑤 − 𝜂𝛻𝑤 ℒ • Gradient is computed by a backward pass by applying the chain rule Calculating the gradient • Loss: ℒ 𝑦 = 1 2 𝑦−𝑦 2, where 𝑦 = 𝑓(𝑥; 𝑤) 1 1+𝑒 −𝑧 • Activation: logistic φ 𝑧 = (note: φ′(𝑧) = φ(𝑧)(1 − φ(𝑧))) 𝜕ℒ 𝜕ℒ 𝜕𝑥𝑗′ = ′ = 𝜕𝑤𝑖𝑗 𝜕𝑥𝑗 𝜕𝑤𝑖𝑗 𝜕ℒ ′ = ′ 𝑥𝑗 (1 − 𝑥𝑗′ )𝑥𝑖 𝜕𝑥𝑗 (Recall that 𝑥𝑗′ = φ 𝑤1𝑗 𝑥1 + ⋯ + 𝑤𝑑𝑗 𝑥𝑑 ) 𝑦1 𝑤′11 𝑥′1 𝑤11 𝑥1 𝑦𝑘 𝑤′𝑑′ 𝑘 𝑥′𝑑′ 𝑤𝑑𝑑′ 𝑥𝑑 Calculating the gradient 𝜕ℒ = 𝜕𝑥𝑖 𝑑′ = 𝑘=1 𝑑′ 𝑘=1 𝜕ℒ 𝜕𝑥𝑘′ = 𝜕𝑥𝑘′ 𝜕𝑥𝑖 𝑦1 𝑤′11 𝑦𝑘 𝑤′𝑑′ 𝑘 𝜕ℒ ′ ′ ′ 𝑥𝑘 (1 − 𝑥𝑘 )𝑤𝑖𝑘 𝜕𝑥𝑘 • Computed recursively starting with 𝜕ℒ = 𝑦𝑖 − 𝑦𝑖 𝜕𝑦𝑖 𝑥′1 𝑤11 𝑥1 𝑥′𝑑′ 𝑤𝑑𝑑′ 𝑥𝑑 Training algorithm • Initialize with random weights • Forward pass: given a training input vector apply the network to and store all intermediate results • Backward pass: starting from top, recursively use the chain rule to 𝜕ℒ calculate derivatives for all nodes and use those derivatives to calculate 𝜕ℒ 𝜕𝑤𝑖𝑗 𝜕𝑥𝑖 for all edges • Repeat for all training vectors, the gradient 𝛻𝑤 ℒ is composed of the 𝜕ℒ sum of all over all edges 𝜕𝑤𝑖𝑗 • Weight update: 𝑤 ← 𝑤 − 𝜂𝛻𝑤 ℒ Stochastic gradient descent • Gradient descent requires computing the gradient of the (full) loss over all of the training data at every step With large training this is expensive • Approach: compute the gradient over a sample (“mini-batch”), usually by re-shuffling the training set Going once through the entire training data is called an epoch • If learning rate decreases appropriately and under mild assumptions this converges almost surely to a local minimum • Momentum: pass the gradient from one iteration to the next (with (𝑡−1) 𝑡 decay), i.e., 𝑤 ← 𝑤 − 𝜂𝛻𝑤 ℒ − 𝜂′𝛻𝑤 ℒ Regularization: dropout • At training we randomly eliminate half of the nodes in the network • At test we use the full network, but each weight is halved • This spreads the representation of the data over multiple nodes • Informally, it is equivalent to training with many different networks and using their average response Invariance • Invariance to translation, rotation and scale as well as for some transformations of intensity is often desired • Can be achieved by perturbing the training set (data augmentation) – useful for small transformations (expensive) • Translation invariance is further achieved with max pooling and/or convolution Invariance: convolutional neural nets • Translation invariance can be achieved by conv-nets • A conv-net learns a linear filter at the first level, non linear ones higher up • SIFT is an example for a useful non-linear filter, but a learned filter may be more suitable • Inspired by the local receptive fields in the V1 visual cortex area • Conv-nets were applied successfully to digit recognition in the 90’s (LeCunn et al. 1998), but at the time did not scale well to other kinds of images 𝑥1 𝑥′1 𝑥′2 𝑥′3 𝑥2 𝑥3 𝑥4 𝑥′𝑑′ 𝑥5 Weight-sharing: same color ⇒ same weight 𝑥𝑑 Alex-net Krizhevsky, Sutzkever, Hinton 2012 • Trained and tested on Imagenet: 1.2M training images, 50K validation, 150K test. 1000 categories • Loss: softmax – top layer has 𝐶 nodes, 𝑧1 , … , 𝑧𝐶 (here 𝐶 = 1000 categories). The softmax function renormalizes them by 𝑒 𝑧𝑗 𝜎𝑗 𝒛 = 𝐶 𝑧𝑘 𝑒 𝑘=1 Network maximizes the multinomial logistic regression objective, that is ℒ 𝒙 = − 𝑁 𝑖=1 𝜎𝑦𝑖 (𝒙𝑖 ) over the training images 𝒙𝑖 of class 𝑦𝑖 Alex-net • Activation: Relu • Data augmentation • Translation • Horizontal reflection • gray level shifts by principal components) • Dropout • Momentum tanh Relu Alex-net Alex-net: results • ILSVRC-2010 • ILSVRC-2012 Team name Error (5 guesses) SuperVision 0.15315 Using extra training from ImageNet 2011 SuperVision 0.16422 Using only supplied training data ISI 0.26172 Weighted sum of scores SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV OXFORD_VGG 0.26979 Mixed selection from High-Level SVM and Baseline Scores XRCE/INRIA 0.27058 University of Amsterdam 0.29576 Description Baseline: SVM trained on Fisher Vectors over Dense SIFT and Color Statistics • More recent networks reduced error to ~7% Alex-net: results (Krizhevsky et al. 2012) Applications • • • • • Image classification Face recognition Object detection Object tracking Low-level vision • • • • Optical flow, stereo Super-resolution, de-bluring Edge detection Image segmentation • Attention and saliency • Image and video captioning •… Face recognition • Google’s conv-net is trained on 260M face images • Achieved 99.63% accuracy on LFW (face comparison database), and some of its mistakes turned out to be labeling mistakes) • Available in Google Photos (Schroff et al. 2015) Object detection • Latest methods achieve average precision of about 60% on PASCAL VOC 2007 and 44% on Imagenet ILSVRC 2014 (He et al. 2015) Unsupervised learning • Find structure in data • Type of problems: • Clustering • Density estimation • Dimensionality reduction Auto-encoders • Produce a compact representation of the training • Analogous to PCA • Note that identity transformation may be a valid (but undesired) solution • Initialize by training a Restricted Bolzmann Machine Input 𝑥1 𝑥2 𝑥𝑑 Hidden 1 ℎ1 ℎ2 ℎ𝑚 “Output = Input” 𝑦1 𝑦2 𝑦𝑑 Recurrent networks: Hopfield net John Hopfield, 1982 • Fully connected network • Time dynamic: starting from an input, apply network repeatedly to convergence • Update: 𝑠𝑖 = sign( 𝑗≠𝑖 𝑤𝑖𝑗 𝑠𝑗 + 𝜃𝑗 ), minimizes an Ising model energy • Weights are set to store preferred states • “Associative memory” (content address): denoising and completion 𝑥1 𝑥𝑑 𝑥2 𝑥3 𝑥4 Recurrent networks • Used, e.g., for image annotation, i.e., produce a descriptive sentence of an input image Next word Image Recurrent networks (unrolled) • Used, e.g., for image annotation, i.e., produce a descriptive sentence of an input image First word Second word Image (Kiros et al. 2014) Neural network development packages • Caffe: http://caffe.berkeleyvision.org/ • Matconvnet: http://www.vlfeat.org/matconvnet/ • Torch: https://github.com/torch/torch7/wiki/Cheatsheet