slides

advertisement
Back-Propagation Algorithm
Perceptron
 Gradient Descent
 Multi-layerd neural network
 Back-Propagation
 More on Back-Propagation
 Examples

Inner-product
net  w, x || w ||  || x || cos( )
n
net   w i  x i

i1


A measure of the projection of one vector
onto another
Activation function
n
o  f (net)  f ( w i  x i )
i1


1 if x  0
f (x) : sgn( x)  
1 if x  0
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
1 if
f (x) :  (x)  
0 if
x 0
x 0
 1 if x  0.5

f (x) :  (x)  x if 0.5  x  0.5
 0 if x  0.5

sigmoid function
1
f (x) :  (x) 
1 e(ax)
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Gradient Descent

To understand, consider simpler linear
unit, where
n
o   wi  xi
i 0


Let's learn wi that minimize the squared
error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}
• (t for target)
Error for different hypothesis,
for w0 and w1 (dim 2)
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

We want to move the weight vector in the
direction that decrease E
wi=wi+wi
w=w+w
Differentiating E
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

Update rule for gradient decent
wi    (t d  od )x id
d D
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.

Stochastic Approximation to
gradient descent
wi  (t  o)xi




The gradient decent training rule updates summing over
all the training examples D
Stochastic gradient approximates gradient decent by
updating weights incrementally
Calculate error for each example
Known as delta-rule or LMS (last mean-square) weight
update

Adaline rule, used for adaptive filters Widroff and Hoff (1960)
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
XOR problem and Perceptron

By Minsky and Papert in mid 1960
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Multi-layer Networks



The limitations of simple perceptron do not
apply to feed-forward networks with
intermediate or „hidden“ nonlinear units
A network with just one hidden unit can
represent any Boolean function
The great power of multi-layer networks was
realized long ago

But it was only in the eighties it was shown how to
make them learn


Multiple layers of cascade linear units still
produce only linear functions
We search for networks capable of representing
nonlinear functions


Units should use nonlinear activation functions
Examples of nonlinear activation functions
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
XOR-example
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.


Back-propagation is a learning algorithm for
multi-layer neural networks
It was invented independently several times




Bryson an Ho [1969]
Werbos [1974]
Parker [1985]
Rumelhart et al. [1986]
Parallel Distributed Processing - Vol. 1
Foundations
David E. Rumelhart, James L. McClelland and the PDP Research
Group
What makes people smarter than computers? These volumes by
a pioneering neurocomputing.....
Zur Anzeige w ird der Quic kTime™
Dekompres sor „TIFF (Unkomprimiert)“
benötigt.
Zur Anzeige w ird der Quic kTime™
Dekompres sor „TIFF (Unkomprimiert)“
benötigt.
Back-propagation

The algorithm gives a prescription for
changing the weights wij in any feedforward network to learn a training set of
input output pairs {xd,td}

We consider a simple two-layer network
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
xk
x1
x2
x3
x4
x5

Given the pattern xd the hidden unit j
receives a net input
5
net   w jk x
d
j
d
k
k1

and produces the output
5
 V  f (net )  f ( w jk x )
d
j
d
j
d
k
k1


Output unit i thus receives
3
3
5
j1
j1
k1
net id  W ijV jd   (W ij  f ( w jk x kd ))

And produce the final output
3
3
5
o  f (net )  f (W ijV )  f ( (W ij  f ( w x )))
d
i
d
i
d
j
j1
d
jk k
j1
k1

Out usual error function

For l outputs and m input output pairs
{xd,td}
m
l
1
d
d 2
E[w]    (t i  oi )
2 d 1 i1


In our example E becomes
m
2
1
d
d 2
E[w]    (t i  oi )
2 d 1 i1
m
2
3
5
1
d
d
2
E[w]    (t i  f (W ij  f ( w jk x k )))
2 d 1 i1
j
k1


E[w] is differentiable given f is differentiable
Gradient descent can be applied

For hidden-to-output connections the
gradient descent rule gives:
m
E
W ij  
  (t id  oid ) f ' (net id )  (V jd )
W ij
d 1
m
W ij   (t id  oid ) f ' (net id )  V jd
d 1

id  f ' (net id )(t id  oid )
m
W ij  id V jd
d 1

For the input-to hidden connection wjk we
must differentiate with respect to the wjk

Using the chain rule we obtain
E
E V
w jk  
  d 
w jk
w jk
d 1 V j
m
d
j
m
2
w jk    (t id  oid ) f ' (net id )W ij f ' (net dj )  x kd
d 1 i1
id  f ' (net id )(t id  oid )
m

2
w jk   id W ij f ' (net dj )  x kd

d 1 i1
2
 dj  f ' (net dj )W ijid
i1

m
w jk   dj  x kd

d 1
m
W ij  id V jd
d 1
m
w jk   dj  x kd
d 1



we have same form with a different
definition of 

In general, with an arbitrary number of layers,
the back-propagation update rule has always
the form
m
w ij  output  Vinput
d 1
Where output and input refers to the connection
concerned
 V stands for the appropriate input (hidden unit or real

input, xd )
  depends on the layer concerned

2

By the equation   f (net )W 
d
j
'
d
j
d
ij i
i1
allows us to determine for a given hidden
unit Vj in terms of the ‘s of the unit oi

 The coefficient are usual forward, but the
errors  are propagated backward


back-propagation



We have to use a nonlinear differentiable
activation function

Examples:
1
f (x)   (x) 
1 e( x)
f ' (x)   ' (x)     (x)  (1  (x))
f (x)  tanh(   x)
f ' (x)    (1 f (x) 2 )
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Consider a network with M layers
m=1,2,..,M
 Vmi from the output of the ith unit of the
mth layer
 V0i is a synonym for xi of the ith input
 Subscript m layers m’s layers, not
patterns
 Wmij mean connection from Vjm-1 to Vim

Stochastic Back-Propagation
Algorithm (mostly used)
1.
2.
3.
Initialize the weights to small random values
Choose a pattern xdk and apply is to the input layer V0k= xdk for
all k
Propagate the signal through the network
Vim  f (net im )  f ( wijmV jm1)
j
4.
5.
6.
7.

Compute the deltas for the output layer
iM  f ' (net iM )(t id  ViM )
Compute the deltas for the preceding layer for m=M,M-1,..2
im1  f ' (net im1) w mji mj
j
Update all connections
wijm  imV jm1
wijnew  wijold  wij
Goto 2 and repeat for the next pattern
More on Back-Propagation
Gradient descent over entire network
weight vector
 Easily generalized to arbitrary directed
graphs
 Will find a local, not necessarily global
error minimum


In practice, often works well (can run multiple
times)


Gradient descent can be very slow if  is to
small, and can oscillate widely if  is to large
Often include weight momentum 
E
w pq (t  1)  
   w pq (t)
w pq


Momentum parameter  is chosen between 0
and 1, 0.9 is a good value

Minimizes error over training examples

Will it generalize well

Training can take thousands of iterations,
it is slow!

Using network after training is very fast
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Convergence of Backpropagation

Gradient descent to some local minimum





Perhaps not global minimum...
Add momentum
Stochastic gradient descent
Train multiple nets with different initial weights
Nature of convergence



Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as training
progresses
Expressive Capabilities of
ANNs

Boolean functions:



Every boolean function can be represented by
network with single hidden layer
but might require exponential (in number of inputs)
hidden units
Continuous functions:


Every bounded continuous function can be
approximated with arbitrarily small
error, by network with one hidden layer [Cybenko
1989; Hornik et al. 1989]
Any function can be approximated to arbitrary
accuracy by a network with two
hidden layers [Cybenko 1988].
NETtalk Sejnowski et al 1987
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Prediction
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Zur Anzeige wird der QuickTime™
Dekompressor „TIFF (LZW)“
benötigt.
Perceptron
 Gradient Descent
 Multi-layerd neural network
 Back-Propagation
 More on Back-Propagation
 Examples


RBF Networks, Support Vector Machines
Download