Image Classification  with Deep Neural Networks Greg Schoeninger

Image Classification with Deep Neural Networks
Greg Schoeninger
The Problem (General)
• Transform raw input as pixels, to higher level representations.
• Edges, local shapes and colors, object parts, etc.
• How do we as humans recognize objects in scenes?
The Problem (simplified)
• Train a neural network to recognize 4 classes of images.
• STL‐10 Data Set (Stanford)
• 100,000 Unlabeled Image Patches
• 5,000 Labeled Images
• Natural images have a high dimensionality.
• Varying position, orientation, lighting, etc. (Factors of variation)
• Many different features could be considered. (Edges, colors, SIFT, Gabor filters..)
Deep Architectures
• Learn feature hierarchies from lower level features to higher level ones.
• Do not rely on hand engineered features.
• Inspired by the depth of the brain.
• Natural images are “stationary” ‐ features learned in one part can be applied to others.
• Invariant to small changes in input (translation, rotation, etc.)
Deep Architectures Continued
• # of variations in input greater than # of training examples.
• We now have sufficient computational power.
• Unsupervised learning performed locally at each level.
• Minimal supervised learning at the end.
• Learn good properties and representations of images, then learn what combinations of these properties are called (labels).
Solution (Overview)
• Self taught learning with a sparse auto encoder.
• Convolution
• Mean Pooling of features
• Soft max Regression of pooled features for classification.
• Unsupervised Feature Learning and Deep Learning ‐ Stanford
Simple Neuron
Neural Network
• Hook up neurons so that output of a neuron goes into input of another.
• 3 input units,3 hidden units, 1 output unit.
• Notation – (x, a, w, b, l, h(x))
Neural Networks Activations
• x – Input
• a – Activations
• z – Total weighted sum of inputs and bias
• W – parameters or weights associated with the connections between unit j in layer l, and unit i in layer l + 1.
• h(x) – Hypothesis, real number output. Forward Propagation
• a(1) = x
Bias Units
• Bias unit – enables activation function to be shifted as well as scaled.
Gradient Descent and Back Propagation
• Batch Gradient Descent.
• Try to minimize cost function J(W, b; x, y)
Gradient Descent
• Initialize weights (W) and bias’s (b) to small random values near zero.
• alpha = learning rate.
• Back propagation is an efficient way to calculate the partial derivatives of J(W, b) Back Propagation
• Given a training example (x,y), run forward propagation to compute all the activations, including final hypothesis.
• Then for each neuron (i) in layer (L) compute an error term delta that measures how responsible each node was for any errors in the output.
Back Propagation
• Perform feed forward pass to calculate activations
• For each output unit in the final output layer set the delta term. This is just the error between the output nodes and the true expected values.
• Work backwards from the output layer to the first hidden layer and set the delta terms. Weighted average of error terms that use a(L) as an input.
• Use these delta terms to calculate the partial derivatives of weights and biases.
Gradient Descent and Back Prop
• Set initial weights and biases to random values close to 0.
• Go through the training examples and use back propagation to compute the error terms (delta)
• Set the change in the weights and biases by adding their respective delta terms.
• Update the parameters, minimizing the error
Auto encoders
• Unsupervised training
• Set target values equal to the inputs (identity function)
Auto encoders with sparsity constraint
• Make sure that the average activation over the training set is constrained to p.
• Add extra penalty to the overall cost function – based on KL divergence of Bernoulli random variable.
Auto encoder continued
• Auto encoders learn what input image would most likely cause an activation.
• Each hidden unit is now learning to look for certain features.
• Example of training auto encoder on 10x10 whitened image patches, with 100 hidden units. Linear Decoders (Sparse Auto Encoder Variation)
• Some neurons use a different activation function.
• Sigmoid activation function constrains the range of inputs and outputs to [0,1]
• Linear activation function: Set a(3) = z(3) instead of a(3) = f(z(3)) for the output layer. (Identity function)
• Output is now linear function of hidden unit activations.
Simplified Gradients and Back propagation
• New activation function, so the gradients change for output units.
• y = x is the desired output.
• f(z) = z
• f’(z) = 1
• Hidden layer still uses the sigmoid activation f’(z(2))
ZCA Whitening
• Goal is to make input data less redundant.
– Pixels are highly correlated to nearby pixels, and weakly correlated to faraway pixels. Similar model to how we think the biological eye processes images.
– Adjacent pixels will be perceived to have similar values, inefficient to transmit every single pixel separately.
• Not interested in the overall brightness, subtract mean value for normalization.
PCA and ZCA Whitening
• Subtract the mean value of all patches from input patch.
• Sigma – the covariance matrix since x has a 0 mean now.
• Compute the eigenvectors of sigma using:
[U,S,V] = svd(sigma).
U = eigenvectors
S = eigenvalues
V is transpose(U)
• You can reduce the dimensionality by only considering the top (k) eigenvalues of the data.
Linear Decoder Implementation
• Learn color image patch features, flatten intensities from each channel into vector.
• 100,000 8x8 random image patches from 13,000 96x96 color images. (Cats, dogs, deer, airplanes, birds, horses, monkeys, ships, trucks).
• 192 (8*8*3) input units
• 400 hidden units
• 192 output units
• 0.035 sparsity parameter.
• We have learned features over random 8x8 patches from large images.
• Convolve these feature detectors on a new large image.
– This gives us different feature activation values at each location of the image.
• Run 8x8 window over 64x64 image to get sets of 57x57 convolved features (400 sets in our case).
Convolution Implementation
• Compute activations for every 8x8 patch in new image.
• Loop go through features, and convolve the image with the feature using matlabs conv2 function over “valid” region.
• Then run the resulting convolved image plus the bias for this feature through the sigmoid function to get the activations.
• In theory we could run the convolved features right through a classifier – but this is computationally challenging.
– 57*57*400 = 1,299,600 features per example.
• Aggregate statistics of features over windows.
• Mean pooling or Max pooling
• PoolDim = 19, so 3x3 pooling.
• We can now use the pooled for classification.
• Soft max Regression
– Supervised
– Similar to logistic regression (binary classification) but we can have multiple class labels.
– Compute the probability of a label given an input.
Soft max Regression
• No way to closed‐form way to solve for minimum of J(theta)
• Use gradient descent or L‐BFGS to solve for minimum.
• Add weight decay parameter to guarantee convergence to unique solution.
Architecture Overview
• Self taught learning with sparse auto encoder.
– Preprocessed with ZCA whitening.
• Use learned features for convolution on large image.
• Pool convolutions to reduce dimensionality and and over fitting.
• Softmax regression for classification.
Example of self taught learning.
Layers of Depth
• Deep networks have multiple hidden layers
– Remember our auto encoder had 1 hidden layer.
– You can stack auto encoders to achieve greater depth.
– Ditch the “decoding” layer and attach to next layer or classifier.
• http://deeplearningworkshopnips2010.files.wordpress.