Overview of Back Propagation Algorithm Shuiwang Ji A Sample Network Forward Operation • The general feed-forward operation is: Back Propagation Algorithm • The hidden to output weights can be learned by minimizing the error • The power of back-propagation is that it allows us to calculate an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights • We consider the error function: • The update rule is: Hidden-to-output Weights The chain rule: The sensitivity of unit k is: and Overall, the derivative is: Input-to-hidden Weights The chain rule: The real back propagation: Overall the rule is: Back Propagation of Sensitivity 1. The sensitivity at a hidden unit is proportional to the weighted sum of the sensitivities at the output units 2. The output unit sensitivities are thus propagated “back” to the hidden units Training Hierarchical Feed-forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks ECCV’08 Kai Yu Presented by Shuiwang Ji Transfer Learning • Transfer learning, also known as multi-task learning, is a mechanism that improves generalization by leveraging shared domain-specific information contained in related tasks • In the setting considered in this paper, all tasks share the same input space General Formulation • The main task to be learnt has index m with training examples • A neural network has a natural architecture to tackle this learning problem by minimizing: General Formulation • The is learned by additionally introducing pseudo auxiliary tasks, each represented by learning the input-output pairs: • Then the regularization term becomes • A Bayesian perspective (skipped) CNN for Transfer Learning • • • • • • Input: 140x140 pixel images, including R/G/B channels and additionally two channels Dx and Dy, which are the horizontal and vertical gradients of gray intensities C1 layer: 16 filters of size 16 by 16 P1 layer: max pooling over each 5 by 5 neighborhood C2 layer: 256 filters of size 6 by 6, connections with sparsity 0.5 between the 16 dimensions of P1 layer and the 256 dimensions of C2 layer P2 layer: max pooling over each 5 by 5 neighborhood Output layer: full connections between (256 by 4 by 4) P2 features and outputs Generating Pseudo Tasks 1. The pseudo-task is constructed by sampling a random 2D patch and using it as a template to form a local 2D filter that operates on every training image. The value assigned to an image under this task is taken to be the maximum over the result of this 2D convolution operation 2. brittle to scale, translation, and slight intensity variations Generating Pseudo Tasks • Applying Gabor filters of 4 orientations and 16 scales result in 64 feature maps of size 104*104 for each image • Max-pooling operation is performed first within each nonoverlapping 4*4 neighborhood and then within each band of two successive scales resulting in 32 feature maps of size 26*26 for each image • An set of K RBF filter of size 7*7 with 4 orientations are then sampled and used as the parameters of the pseudo-tasks, resulting in 8 feature maps of size 20*20 • Finally, max pooling is performed on the result across all the scales and within every non-overlapping 10*10 neighborhood, giving a 2*2 feature map which constitutes the value of this image under this pseudo-task • Obtained 4*K pseudo-tasks (K actual random patches, each operating at a different quadrant of the image) Object Class Recognition and Localization Using Sparse Features with Limited Receptive Fields, IJCV, in press Results on Caltech-101 0.18 second for testing one image (the forward operation) Gender and Ethnicity Recognition First-layer Features Convergence Rate