Image Classification with Deep Neural Networks Greg Schoeninger The Problem (General) • Transform raw input as pixels, to higher level representations. • Edges, local shapes and colors, object parts, etc. • How do we as humans recognize objects in scenes? The Problem (simplified) • Train a neural network to recognize 4 classes of images. • STL‐10 Data Set (Stanford) • 100,000 Unlabeled Image Patches • 5,000 Labeled Images Complexity • Natural images have a high dimensionality. • Varying position, orientation, lighting, etc. (Factors of variation) • Many different features could be considered. (Edges, colors, SIFT, Gabor filters..) Deep Architectures • Learn feature hierarchies from lower level features to higher level ones. • Do not rely on hand engineered features. • Inspired by the depth of the brain. • Natural images are “stationary” ‐ features learned in one part can be applied to others. • Invariant to small changes in input (translation, rotation, etc.) Deep Architectures Continued • # of variations in input greater than # of training examples. • We now have sufficient computational power. • Unsupervised learning performed locally at each level. • Minimal supervised learning at the end. • Learn good properties and representations of images, then learn what combinations of these properties are called (labels). Solution (Overview) • Self taught learning with a sparse auto encoder. • Convolution • Mean Pooling of features • Soft max Regression of pooled features for classification. • Unsupervised Feature Learning and Deep Learning ‐ Stanford Simple Neuron Neural Network • Hook up neurons so that output of a neuron goes into input of another. • 3 input units,3 hidden units, 1 output unit. • Notation – (x, a, w, b, l, h(x)) Neural Networks Activations • x – Input • a – Activations • z – Total weighted sum of inputs and bias • W – parameters or weights associated with the connections between unit j in layer l, and unit i in layer l + 1. • h(x) – Hypothesis, real number output. Forward Propagation • a(1) = x Bias Units • Bias unit – enables activation function to be shifted as well as scaled. Gradient Descent and Back Propagation • Batch Gradient Descent. • Try to minimize cost function J(W, b; x, y) Gradient Descent • Initialize weights (W) and bias’s (b) to small random values near zero. • alpha = learning rate. • Back propagation is an efficient way to calculate the partial derivatives of J(W, b) Back Propagation • Given a training example (x,y), run forward propagation to compute all the activations, including final hypothesis. • Then for each neuron (i) in layer (L) compute an error term delta that measures how responsible each node was for any errors in the output. Back Propagation • Perform feed forward pass to calculate activations • For each output unit in the final output layer set the delta term. This is just the error between the output nodes and the true expected values. • Work backwards from the output layer to the first hidden layer and set the delta terms. Weighted average of error terms that use a(L) as an input. • Use these delta terms to calculate the partial derivatives of weights and biases. Gradient Descent and Back Prop • Set initial weights and biases to random values close to 0. • Go through the training examples and use back propagation to compute the error terms (delta) • Set the change in the weights and biases by adding their respective delta terms. • Update the parameters, minimizing the error Auto encoders • Unsupervised training • Set target values equal to the inputs (identity function) Auto encoders with sparsity constraint • Make sure that the average activation over the training set is constrained to p. • Add extra penalty to the overall cost function – based on KL divergence of Bernoulli random variable. Auto encoder continued • Auto encoders learn what input image would most likely cause an activation. • Each hidden unit is now learning to look for certain features. • Example of training auto encoder on 10x10 whitened image patches, with 100 hidden units. Linear Decoders (Sparse Auto Encoder Variation) • Some neurons use a different activation function. • Sigmoid activation function constrains the range of inputs and outputs to [0,1] • Linear activation function: Set a(3) = z(3) instead of a(3) = f(z(3)) for the output layer. (Identity function) • Output is now linear function of hidden unit activations. Simplified Gradients and Back propagation • New activation function, so the gradients change for output units. • y = x is the desired output. • f(z) = z • f’(z) = 1 • Hidden layer still uses the sigmoid activation f’(z(2)) ZCA Whitening • Goal is to make input data less redundant. – Pixels are highly correlated to nearby pixels, and weakly correlated to faraway pixels. Similar model to how we think the biological eye processes images. – Adjacent pixels will be perceived to have similar values, inefficient to transmit every single pixel separately. • Not interested in the overall brightness, subtract mean value for normalization. PCA and ZCA Whitening • Subtract the mean value of all patches from input patch. • Sigma – the covariance matrix since x has a 0 mean now. • Compute the eigenvectors of sigma using: – – – – [U,S,V] = svd(sigma). U = eigenvectors S = eigenvalues V is transpose(U) • You can reduce the dimensionality by only considering the top (k) eigenvalues of the data. Linear Decoder Implementation • Learn color image patch features, flatten intensities from each channel into vector. • 100,000 8x8 random image patches from 13,000 96x96 color images. (Cats, dogs, deer, airplanes, birds, horses, monkeys, ships, trucks). • 192 (8*8*3) input units • 400 hidden units • 192 output units • 0.035 sparsity parameter. Convolution • We have learned features over random 8x8 patches from large images. • Convolve these feature detectors on a new large image. – This gives us different feature activation values at each location of the image. • Run 8x8 window over 64x64 image to get sets of 57x57 convolved features (400 sets in our case). Convolution Implementation • Compute activations for every 8x8 patch in new image. • Loop go through features, and convolve the image with the feature using matlabs conv2 function over “valid” region. • Then run the resulting convolved image plus the bias for this feature through the sigmoid function to get the activations. Pooling • In theory we could run the convolved features right through a classifier – but this is computationally challenging. – 57*57*400 = 1,299,600 features per example. • Aggregate statistics of features over windows. • Mean pooling or Max pooling • PoolDim = 19, so 3x3 pooling. Classification • We can now use the pooled for classification. • Soft max Regression – Supervised – Similar to logistic regression (binary classification) but we can have multiple class labels. – Compute the probability of a label given an input. Soft max Regression • No way to closed‐form way to solve for minimum of J(theta) • Use gradient descent or L‐BFGS to solve for minimum. • Add weight decay parameter to guarantee convergence to unique solution. Demo Architecture Overview • Self taught learning with sparse auto encoder. – Preprocessed with ZCA whitening. • Use learned features for convolution on large image. • Pool convolutions to reduce dimensionality and and over fitting. • Softmax regression for classification. Example of self taught learning. Layers of Depth • Deep networks have multiple hidden layers – Remember our auto encoder had 1 hidden layer. – You can stack auto encoders to achieve greater depth. – Ditch the “decoding” layer and attach to next layer or classifier. Questions? Sources • http://deeplearning.stanford.edu/wiki/index.php/UFLD L_Tutorial • http://deeplearningworkshopnips2010.files.wordpress. com/2010/09/nips10‐workshop‐tutorial‐final.pdf • http://www.cs.toronto.edu/~kriz/learning‐features‐ 2009‐TR.pdf • http://www.iro.umontreal.ca/~bengioy/papers/ftml_b ook.pdf • http://www.cs.toronto.edu/~hinton/absps/ranzato_cv pr2011.pdf • http://www.cs.toronto.edu/~hinton/science.pdf