Connectionist Models: Backprop Jerome Feldman CS182/CogSci110/Ling109

advertisement
Connectionist
Models: Backprop
Jerome Feldman
CS182/CogSci110/Ling109
Spring 2008
Hebb’s rule is not sufficient

What happens if the neural circuit fires perfectly, but the result is
very bad for the animal, like eating something sickening?
A pure invocation of Hebb’s rule would strengthen all participating
connections, which can’t be good.
 On the other hand, it isn’t right to weaken all the active connections
involved; much of the activity was just recognizing the situation – we
would like to change only those connections that led to the wrong
decision.


No one knows how to specify a learning rule that will change
exactly the offending connections when an error occurs.

Computer systems, and presumably nature as well, rely upon statistical
learning rules that tend to make the right changes over time. More in
later lectures.
Hebb’s rule is insufficient
tastebud
tastes rotten
eats food
gets sick
drinks water

should you “punish” all the connections?
Models of Learning
Hebbian – coincidence
 Supervised – correction (backprop)
 Recruitment – one-trial
 Reinforcement Learning- delayed reward
 Unsupervised – similarity

Abbstract Neuron
output y
Threshold Activation Function
y {1 if net > 0
0 otherwise
n
net   wiii
i 0
w0
i0=1
w1
i1
w2
i2
wn
...
input i
in
Boolean XOR
XOR
o 0.5
input
x1
input
x2
output
0
0
1
1
0
1
0
1
0
1
1
0
-1
1
OR
0.5
AND
h2 1.5
h1
1
1
1
1
x1
x2
Supervised Learning - Backprop

How do we train the weights of the
network
 Basic
Concepts
Use a continuous, differentiable activation function
(Sigmoid)
 Use the idea of gradient descent on the error
surface
 Extend to multiple layers

Backprop

To learn on data which is not linearly
separable:
 Build
multiple layer networks (hidden layer)
 Use a sigmoid squashing function instead
of a step function.
Tasks
Unconstrained pattern classification
Credit assessment
Digit Classification
Speech Recognition
Function approximation
Learning control
Stock prediction
Sigmoid Squashing Function
output
y
1
1  e -net
n
net   wiyi
i 0
w0
y0=1
w1
y1
w2
y2
wn
...
input
yn
The Sigmoid Function
y=a
x=net
The Sigmoid Function
Output=1
y=a
Output=0
x=neti
The Sigmoid Function
Output=1
Sensitivity to input
y=a
Output=0
x=net
Gradient Descent
Gradient Descent on an error
Learning Rule – Gradient Descent
on an Root Mean Square (RMS)

Learn wi’s that minimize squared error
1
2
E[ w]   (tk - o k )
2kO
O = output layer
Gradient Descent
E[ w] 
Gradient:
Training rule:
  E E
E 
E[ w]  
,
,...,


w

w

w
1
n
 0


w  - E[w]
E
wi  -
wi
1
2
(
t
o
)
 k k
2kO
Gradient Descent
i2
i1
global mimimum:
this is your goal
it should be 4-D (3 weights) but you get the idea
Backpropagation Algorithm

Generalization to multiple layers and
multiple output units
Backprop Details
Here we go…
The output layer
wjk
k
wij
j
yi
ti: target
i
E = Error = ½ ∑i (ti – yi)2
The derivative of the sigmoid is just
learning rate
E
Wij
E
Wij  - 
Wij
Wij  Wij -  
E E yi xi



 -ti - yi   f ' ( xi )  y j
Wij yi xi Wij
yi 1 - yi 
Wij  -  -ti - yi   yi 1 - yi   y j
Wij  -  - y j   i
 i  ti - yi  yi 1 - yi 
Nice Property of Sigmoids
The hidden layer
wjk
wij
yi
ti: target
W jk  - 
E
W jk
E
E y j x j



W jk y j x j W jk
k
j
i
E = Error = ½ ∑i (ti – yi)2
E
E yi xi



  - (ti - yi )  f ' ( xi ) Wij
y j
i yi xi y j
i
E


  -  (ti - yi )  f ' ( xi ) Wij   f ' ( x j )  yk
W jk  i



W jk  -   -  (ti - yi )  yi 1 - yi  Wij   y j 1 - y j  yk
 i

W jk  -  - yk   j


 j    (ti - yi )  yi 1 - yi Wij   y j 1 - y j 
 i



 j   Wij   i   y j 1 - y j 
 i

Let’s just do an example
0 i
1
0 i
2
b=1
w01
0.8
w02 0.6
w0b
1/(1+e^-0.5)
0.5
x0
f
y0 0.6224
i1
i2
y0
0
0
0
0
1
1
1
0
1
1
1
1
E = Error = ½ ∑i (ti – yi)2
E = ½ (t0 – y0)2
0.5
0.4268
E = ½ (0 – 0.6224)2 = 0.1937
Wij  -  - y j   i
 i  ti - yi  yi 1 - yi 
W01  -  - y1   0  -  -i1 0 0
 0  t0 - y0  y0 1 - y0 
0
W02  -  - y2   0  -  -i2   0
W0b  -  - yb   0  -  -b   0
   -0.1463
learning rate
 0  0 - 0.6224 0.62241 - 0.6224
 0  -0.1463
suppose  = 0.5
W0b  0.5  -0.1463  -0.0731
An informal account of
BackProp
For each pattern in the training set:
Compute the error at the output nodes
Compute w for each wt in 2nd layer
Compute delta (generalized error
expression) for hidden units
Compute w for each wt in 1st layer
After amassing w for all weights and, change each wt a little bit, as
determined by the learning rate
wij   ipo jp
Backpropagation Algorithm


Initialize all weights to small random numbers
For each training example do
 For
 For
each hidden unit h: y   ( w x )
 hi i
h
i
each output unit k: y   ( w x )
 kh h
k
k
 For
each output unit k:   y (1 - y ) (t - y )
k
k
k
k
k
 For
each hidden unit h:   y (1 - y ) w 
h
h
h
hk k
 Update
each network weight wij:
wij  wij  wij
with
k
wij    j xij
Backpropagation Algorithm
“activations”
“errors”
Momentum term

The speed of learning is governed by the learning rate.

If the rate is low, convergence is slow
 If the rate is too high, error oscillates without reaching minimum.
w ( n)  w ( n - 1)  -i ( n)y j ( n)
ij
ij
0  1

Momentum tends to smooth small weight error fluctuations.
the momentum accelerates the descent in steady downhill directions.
the momentum has a stabilizing effect in directions that oscillate in time.
Convergence
May get stuck in local minima
 Weights may diverge
…but often works well in practice


Representation power:
2
layer networks : any continuous function
 3 layer networks : any function
Pattern Separation and NN
architecture
Overfitting and generalization
TOO MANY HIDDEN
NODES TENDS TO
OVERFIT
Stopping criteria

Sensible stopping criteria:
 total
mean squared error change:
Back-prop is considered to have converged when the
absolute rate of change in the average squared error per
epoch is sufficiently small (in the range [0.01, 0.1]).
 generalization
based criterion:
After each epoch the NN is tested for generalization. If the
generalization performance is adequate then stop. If this
stopping criterion is used then the part of the training set
used for testing the network generalization will not be used
for updating the weights.
Overfitting in ANNs
Summary

Multiple layer feed-forward networks
 Replace
Step with Sigmoid (differentiable)
function
 Learn weights by gradient descent on error
function
 Backpropagation algorithm for learning
 Avoid overfitting by early stopping
ALVINN drives 70mph on highways
Use MLP Neural Networks when …
(vectored) Real inputs, (vectored) real
outputs
 You’re not interested in understanding how
it works
 Long training times acceptable
 Short execution (prediction) times required
 Robust to noise in the dataset

Applications of FFNN
Classification, pattern recognition:
 FFNN can be applied to tackle non-linearly
separable learning problems.




Recognizing printed or handwritten characters,
Face recognition
Classification of loan applications into credit-worthy and
non-credit-worthy groups
Analysis of sonar radar to determine the nature of the
source of a signal
Regression and forecasting:
 FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs
is a sequence of measurements over time (time
series).
Extensions of Backprop Nets
Recurrent Architectures
 Backprop through time

Elman Nets & Jordan Nets
Output
1
Output
1
Hidden
Context
Input
α
Hidden
Context
Input
Updating the context as we receive input
 In Jordan nets we model “forgetting” as well
 The recurrent connections have fixed weights
 You can train these networks using good ol’
backprop
Recurrent Backprop
w2
a
w4
b
w1
c
w3
unrolling
3 iterations
a
b
c
a
b
c
a
b
c
w1
a


w2 w3
b
w4
c
we’ll pretend to step through the network one
iteration at a time
backprop as usual, but average equivalent
weights (e.g. all 3 highlighted edges on the right
Download