Backpropagation Learning in Neural Networks

MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung Backpropagation learning Simple vs. multilayer perceptron Hidden layer problem • Radical change for the supervised learning problem. • No desired values for the hidden layer. • The network must find its own hidden layer activations. Generalized delta rule • Delta rule only works for the output layer. • Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers Outline • The algorithm • Derivation as a gradient algoritihm • Sensitivity lemma Multilayer perceptron • L layers of weights and biases • L+1 layers of neurons ⎯→ x ⎯⎯→L ⎯⎯→ x x ⎯ 0 W 1 ,b1 1 W 2 ,b 2 W L ,b L n l−1 ⎛ ⎞ l l l−1 l xi = f ⎜∑ Wij x j + bi ⎝ j=1 ⎠ L Reward function • Depends on activity of the output layer only. R(x L ) • Maximize reward with respect to weights and biases. Example: squared error • Square of desired minus actual output, with minus sign. nL 1 L 2 R(x ) = − ∑ (di − xi ) 2 i=1 L Forward pass For l = 1 to L, n l−1 ui = ∑ Wij x j l l j=1 xi = f (ui l l ) l−1 + bi l Sensitivity computation • The sensitivity is also called “delta.” ∂R si = f ′(ui ) L ∂xi L L = f ′(ui L )(d − x ) L i i Backward pass for l = L to 2 sj l −1 = f ′(u j l −1 nl )∑ s W l i i =1 ij l Learning update • In any order ∆Wij = ηsi x j l ∆bi = ηsi l l l l−1 Backprop is a gradient update • Consider R as function of weights and biases. ∂R l l−1 l = si x j ∂Wij ∂R l l = si ∂bi ∂R ∆Wij = η l ∂Wij l ∂R ∆bi = η l ∂bi l Sensitivity lemma • Sensitivity matrix = outer product – sensitivity vector – activity vector ∂R ∂R l−1 = l xj l ∂Wij ∂bi • The sensitivity vector is sufficient. • Generalization of “delta.” Coordinate transformation n l−1 ui = ∑ Wij f (u j l l l−1 j=1 ∂ui l l−1 ) l−1 = Wij f ′(u j ∂u j l ∂R ∂R l = l ∂ui ∂bi )+ b i l Output layer xi = f (ui L L ) ui = ∑ Wij x j L L L−1 j ∂R L ∂R L = f ′(ui ) L ∂bi ∂xi + bi L Chain rule • composition of two functions ul−1 → R ∂R ul−1 → ul → R ∂R ∂ui l−1 = ∑ l l−1 ∂u j i ∂ui ∂u j ∂R l ∂R l l−1 ′ ) l−1 = ∑ l Wij f (u j ∂b j i ∂bi Computational complexity • Naïve estimate – network output: order N – each component of the gradient: order N – N components: order N2 • With backprop: order N Biological plausibility • Local: pre- and postsynaptic variables xj l−1 W ij l W ij l ⎯ ⎯→ xi , s j ⎯ ← ⎯ si l l−1 l • Forward and backward passes use same weights • Extra set of variables Backprop for brain modeling • Backprop may not be a plausible account of learning in the brain. • But perhaps the networks created by it are similar to biological neural networks. • Zipser and Andersen: – train network – compare hidden neurons with those found in the brain. LeNet • Weight-sharing • Sigmoidal neurons • Learn binary outputs Machine learning revolution • Gradient following – or other hill-climbing methods • Empirical error minimization

Backpropagation Learning in Neural Networks

Related documents

Products

Support

Backpropagation Learning in Neural Networks

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib