MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung Backpropagation learning Simple vs. multilayer perceptron Hidden layer problem • Radical change for the supervised learning problem. • No desired values for the hidden layer. • The network must find its own hidden layer activations. Generalized delta rule • Delta rule only works for the output layer. • Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers Outline • The algorithm • Derivation as a gradient algoritihm • Sensitivity lemma Multilayer perceptron • L layers of weights and biases • L+1 layers of neurons ⎯→ x ⎯⎯→L ⎯⎯→ x x ⎯ 0 W 1 ,b1 1 W 2 ,b 2 W L ,b L n l−1 ⎛ ⎞ l l l−1 l xi = f ⎜∑ Wij x j + bi ⎝ j=1 ⎠ L Reward function • Depends on activity of the output layer only. R(x L ) • Maximize reward with respect to weights and biases. Example: squared error • Square of desired minus actual output, with minus sign. nL 1 L 2 R(x ) = − ∑ (di − xi ) 2 i=1 L Forward pass For l = 1 to L, n l−1 ui = ∑ Wij x j l l j=1 xi = f (ui l l ) l−1 + bi l Sensitivity computation • The sensitivity is also called “delta.” ∂R si = f ′(ui ) L ∂xi L L = f ′(ui L )(d − x ) L i i Backward pass for l = L to 2 sj l −1 = f ′(u j l −1 nl )∑ s W l i i =1 ij l Learning update • In any order ∆Wij = ηsi x j l ∆bi = ηsi l l l l−1 Backprop is a gradient update • Consider R as function of weights and biases. ∂R l l−1 l = si x j ∂Wij ∂R l l = si ∂bi ∂R ∆Wij = η l ∂Wij l ∂R ∆bi = η l ∂bi l Sensitivity lemma • Sensitivity matrix = outer product – sensitivity vector – activity vector ∂R ∂R l−1 = l xj l ∂Wij ∂bi • The sensitivity vector is sufficient. • Generalization of “delta.” Coordinate transformation n l−1 ui = ∑ Wij f (u j l l l−1 j=1 ∂ui l l−1 ) l−1 = Wij f ′(u j ∂u j l ∂R ∂R l = l ∂ui ∂bi )+ b i l Output layer xi = f (ui L L ) ui = ∑ Wij x j L L L−1 j ∂R L ∂R L = f ′(ui ) L ∂bi ∂xi + bi L Chain rule • composition of two functions ul−1 → R ∂R ul−1 → ul → R ∂R ∂ui l−1 = ∑ l l−1 ∂u j i ∂ui ∂u j ∂R l ∂R l l−1 ′ ) l−1 = ∑ l Wij f (u j ∂b j i ∂bi Computational complexity • Naïve estimate – network output: order N – each component of the gradient: order N – N components: order N2 • With backprop: order N Biological plausibility • Local: pre- and postsynaptic variables xj l−1 W ij l W ij l ⎯ ⎯→ xi , s j ⎯ ← ⎯ si l l−1 l • Forward and backward passes use same weights • Extra set of variables Backprop for brain modeling • Backprop may not be a plausible account of learning in the brain. • But perhaps the networks created by it are similar to biological neural networks. • Zipser and Andersen: – train network – compare hidden neurons with those found in the brain. LeNet • Weight-sharing • Sigmoidal neurons • Learn binary outputs Machine learning revolution • Gradient following – or other hill-climbing methods • Empirical error minimization