Overview of Back Propagation
Shuiwang Ji
A Sample Network
Forward Operation
• The general feed-forward operation is:
Back Propagation Algorithm
• The hidden to output weights can be learned by minimizing the error
• The power of back-propagation is that it allows us to calculate an
effective error for each hidden unit, and thus derive a learning rule
for the input-to-hidden weights
• We consider the error function:
• The update rule is:
Hidden-to-output Weights
The chain rule:
The sensitivity of unit k is:
Overall, the derivative is:
Input-to-hidden Weights
The chain rule:
The real back propagation:
Overall the rule is:
Back Propagation of Sensitivity
1. The sensitivity at a hidden unit is proportional to the weighted sum of
the sensitivities at the output units
2. The output unit sensitivities are thus propagated “back” to the hidden
Training Hierarchical Feed-forward
Visual Recognition Models Using
Transfer Learning from Pseudo-Tasks
Kai Yu
Presented by Shuiwang Ji
Transfer Learning
• Transfer learning, also known as multi-task learning, is a mechanism
that improves generalization by leveraging shared domain-specific
information contained in related tasks
• In the setting considered in this paper, all tasks share the same input
General Formulation
The main task to be learnt has index m with training examples
A neural network has a natural architecture to tackle this learning
problem by minimizing:
General Formulation
• The
is learned by additionally introducing pseudo auxiliary
tasks, each represented by learning the input-output pairs:
• Then the regularization term becomes
• A Bayesian perspective (skipped)
CNN for Transfer Learning
Input: 140x140 pixel images, including R/G/B channels and additionally two
channels Dx and Dy, which are the horizontal and vertical gradients of gray
C1 layer: 16 filters of size 16 by 16
P1 layer: max pooling over each 5 by 5 neighborhood
C2 layer: 256 filters of size 6 by 6, connections with sparsity 0.5 between
the 16 dimensions of P1 layer and the 256 dimensions of C2 layer
P2 layer: max pooling over each 5 by 5 neighborhood
Output layer: full connections between (256 by 4 by 4) P2 features and
Generating Pseudo Tasks
1. The pseudo-task is constructed by sampling a random 2D patch and
using it as a template to form a local 2D filter that operates on every
training image. The value assigned to an image under this task is
taken to be the maximum over the result of this 2D convolution
2. brittle to scale, translation, and slight intensity variations
Generating Pseudo Tasks
• Applying Gabor filters of 4 orientations and 16 scales result in 64
feature maps of size 104*104 for each image
• Max-pooling operation is performed first within each nonoverlapping 4*4 neighborhood and then within each band of two
successive scales resulting in 32 feature maps of size 26*26 for
each image
• An set of K RBF filter of size 7*7 with 4 orientations are then
sampled and used as the parameters of the pseudo-tasks, resulting
in 8 feature maps of size 20*20
• Finally, max pooling is performed on the result across all the scales
and within every non-overlapping 10*10 neighborhood, giving a 2*2
feature map which constitutes the value of this image under this
• Obtained 4*K pseudo-tasks (K actual random patches, each
operating at a different quadrant of the image)
Object Class Recognition and
Localization Using Sparse
Features with Limited Receptive
Fields, IJCV, in press
Results on Caltech-101
0.18 second for testing one image (the forward operation)
Gender and Ethnicity Recognition
First-layer Features
Convergence Rate