MIT Department of Brain and Cognitive Sciences Instructor: Professor Sebastian Seung

advertisement
MIT Department of Brain and Cognitive Sciences
9.641J, Spring 2005 - Introduction to Neural Networks
Instructor: Professor Sebastian Seung
Backpropagation learning
Simple vs. multilayer
perceptron
Hidden layer problem
• Radical change for the supervised
learning problem.
• No desired values for the hidden layer.
• The network must find its own hidden
layer activations.
Generalized delta rule
• Delta rule only works for the output
layer.
• Backpropagation, or the generalized
delta rule, is a way of creating desired
values for hidden layers
Outline
• The algorithm
• Derivation as a gradient algoritihm
• Sensitivity lemma
Multilayer perceptron
• L layers of weights and biases
• L+1 layers of neurons
⎯→ x ⎯⎯→L ⎯⎯→ x
x ⎯
0
W 1 ,b1
1
W 2 ,b 2
W L ,b L
n l−1
⎛
⎞
l
l
l−1
l
xi = f ⎜∑ Wij x j + bi
⎝ j=1
⎠
L
Reward function
• Depends on activity of the output layer
only.
R(x
L
)
• Maximize reward with respect to
weights and biases.
Example: squared error
• Square of desired minus actual output,
with minus sign.
nL
1
L 2
R(x ) = − ∑ (di − xi )
2 i=1
L
Forward pass
For l = 1 to L,
n l−1
ui = ∑ Wij x j
l
l
j=1
xi = f (ui
l
l
)
l−1
+ bi
l
Sensitivity computation
• The sensitivity is also called “delta.”
∂R
si = f ′(ui ) L
∂xi
L
L
= f ′(ui
L
)(d − x )
L
i
i
Backward pass
for l = L to 2
sj
l −1
= f ′(u j
l −1
nl
)∑ s W
l
i
i =1
ij
l
Learning update
• In any order
∆Wij = ηsi x j
l
∆bi = ηsi
l
l
l
l−1
Backprop is a gradient update
• Consider R as function of weights and
biases.
∂R
l
l−1
l = si x j
∂Wij
∂R
l
l = si
∂bi
∂R
∆Wij = η
l
∂Wij
l
∂R
∆bi = η l
∂bi
l
Sensitivity lemma
• Sensitivity matrix = outer product
– sensitivity vector
– activity vector
∂R ∂R l−1
= l xj
l
∂Wij ∂bi
• The sensitivity vector is sufficient.
• Generalization of “delta.”
Coordinate transformation
n l−1
ui = ∑ Wij f (u j
l
l
l−1
j=1
∂ui
l
l−1
)
l−1 = Wij f ′(u j
∂u j
l
∂R ∂R
l =
l
∂ui ∂bi
)+ b
i
l
Output layer
xi = f (ui
L
L
)
ui = ∑ Wij x j
L
L
L−1
j
∂R
L ∂R
L = f ′(ui )
L
∂bi
∂xi
+ bi
L
Chain rule
• composition of two functions
ul−1 → R
∂R
ul−1 → ul → R
∂R ∂ui
l−1 = ∑
l
l−1
∂u j
i ∂ui ∂u j
∂R
l
∂R l
l−1
′
)
l−1 = ∑
l Wij f (u j
∂b j
i ∂bi
Computational complexity
• Naïve estimate
– network output: order N
– each component of the gradient: order N
– N components: order N2
• With backprop: order N
Biological plausibility
• Local: pre- and postsynaptic variables
xj
l−1
W ij l
W ij l
⎯
⎯→ xi , s j ⎯
← ⎯ si
l
l−1
l
• Forward and backward passes use
same weights
• Extra set of variables
Backprop for brain modeling
• Backprop may not be a plausible
account of learning in the brain.
• But perhaps the networks created by it
are similar to biological neural networks.
• Zipser and Andersen:
– train network
– compare hidden neurons with those found
in the brain.
LeNet
• Weight-sharing
• Sigmoidal neurons
• Learn binary outputs
Machine learning revolution
• Gradient following
– or other hill-climbing methods
• Empirical error minimization
Download