Supervised Learning Neural Networks and Back-Propagation Learning Part 7: Neural Networks & Learning

advertisement
Part 7: Neural Networks & Learning
11/2/05
Supervised Learning
• Produce desired outputs for training inputs
• Generalize reasonably & appropriately to
other inputs
• Good example: pattern recognition
• Feedforward multilayer networks
Neural Networks
and Back-Propagation Learning
11/2/05
1
11/2/05
2
Credit Assignment Problem
Feedforward Network
input
layer
11/2/05
input
layer
3
Control Parameters
11/2/05
...
...
...
...
output
layer
4
Gradient
Evaluation Function
(Fitness, Figure of Merit)
F
measures how F is altered by variation of Pk
Pk
F
P1 M F = F P k
M F
Pm F
S
P1 … Pk … Pm
hidden
layers
Desired
output
11/2/05
Adaptive System
System
...
...
...
...
...
...
...
output
layer
hidden
layers
...
...
How do we adjust the weights of the hidden layers?
Control
Algorithm
F points in direction of maximum increase in F
C
5
11/2/05
6
1
Part 7: Neural Networks & Learning
11/2/05
Gradient Ascent
on Fitness Surface
Gradient Ascent
by Discrete Steps
F
F
+
gradient ascent
+
–
11/2/05
–
7
11/2/05
Gradient Ascent is Local
But Not Shortest
8
Gradient Ascent Process
P˙ = F (P)
Change in fitness :
m F d P
m
dF
k
F˙ =
=
= (F ) k P˙k
k=1
k=1
dt
Pk d t
˙
˙
F = F P
+
–
2
F˙ = F F = F 0
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
11/2/05
9
11/2/05
10
General Ascent
on Fitness Surface
General Ascent in Fitness
Note that any adaptive process P( t ) will increase
fitness provided :
0 < F˙ = F P˙ = F P˙ cos
+
where is angle between F and P˙
F
–
Hence we need cos > 0
or < 90 o
11/2/05
11
11/2/05
12
2
Part 7: Neural Networks & Learning
11/2/05
Gradient of Fitness
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,K,t Q
Suppose for parameters P the corresponding actual outputs
F = E q = E q
q
q
D(t q ,y q ) y qj
E q
=
D(t q ,y q ) = Pk Pk
y qj
Pk
j
are y1,K,y Q
Suppose D(t,y ) [0,) measures difference between
target & actual outputs
Let E = D(t ,y
q
q
q
d D(t q ,y q ) y q
d yq
Pk
q
q
q
= D t ,y y
=
) be error on qth sample
Q
Q
Let F (P) = E q (P) = D[t q ,y q (P)]
q=1
yq
q=1
11/2/05
13
y1q
y1q
P1 L
Pm O
M Define Jacobian matrix J = M
y q
y nq
n P L
Pm 1
Note J and D(t ,y
q
q
)
14
k
q
q
y q D(t ,y )
E
= j
,
q
Pk
y j
j Pk
=
T
E q = (J q ) D(t q ,y q )
15
q
q
) y
= 2( y q t q ) d( t j y j )
dyj
2
2
= 2( t j y j )
11/2/05
16
Recap
T
P˙ = ( J q ) (t q y q )
q
q
Pk
To know how to decrease the differences between
y q
Pk
actual & desired outputs,
we need to know elements of Jacobian,
y qj
Pk ,
which says how jth output varies with kth parameter
(given the qth input)
y q
= 2 ( y t ) j
j
Pk
q
j
i
d D(t,y )
= 2( y t )
dy
Gradient of Error on qth Input
E q d D(t ,y
=
Pk
d yq
i
D(t y) (t y )
=
(ti y i )2 = iy i
y j i
y j
j
i
n1
11/2/05
2
i
q
Since (E q ) =
Pk
11/2/05
2
q
nm
)
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t y = (t y )
Jacobian Matrix
q
(
q
j
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
11/2/05
17
11/2/05
18
3
Part 7: Neural Networks & Learning
11/2/05
Equations
Net input:
Neuron output:
Multilayer Notation
n
hi = w ij s j j=1
h = Ws xq
W1
s1
si = ( hi )
WL–2 WL–1
W2
s2
sL–1
yq
sL
s = (h)
11/2/05
19
11/2/05
20
Typical Neuron
Notation
•
•
•
•
•
•
•
L layers of neurons labeled 1, …, L
Nl neurons in layer l
sl = vector of outputs from neurons in layer l
input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
11/2/05
21
s1l–1
sjl–1
Wi1l–1
Wijl–1
hil
sil
WiNl–1
sNl–1
11/2/05
22
Error Back-Propagation
Delta Values
Convenient to break derivatives by chain rule :
E q
We will compute
starting with last layer (l = L 1)
W ijl
E q
E q hil
=
W ijl1 hil W ijl1
and working back to earlier layers (l = L 2,K,1)
Let il =
So
11/2/05
23
11/2/05
E q
hil
E q
hil
= il
W ijl1
W ijl1
24
4
Part 7: Neural Networks & Learning
11/2/05
Output-Layer Neuron
Output-Layer Derivatives (1)
s1L–1
sjL–1
Wi1L–1
WijL–1
iL =
hiL
tiq
siL = yiq
=
WiNL–1
Eq
sNL–1
11/2/05
= 2(siL t iq )
d siL
d hiL
11/2/05
26
s1l
s1l–1
sjl–1
q
E
= iL sL1
j
W ijL1
W1il
Wi1l–1
Wijl–1
hil
sil
Wkil
s1l+1
skl+1
Eq
l–1
WiN
11/2/05
WNil
sNl–1
where iL = 2( siL t iq ) ( hiL )
27
Hidden-Layer Derivatives (1)
sNl
sNl+1
11/2/05
28
Hidden-Layer Derivatives (2)
E q
hil
= il
l1
W ij
W ijl1
l1 l1
dW s
hil
=
W ikl1skl1 = dWij l1j = sl1j
W ijl1 W ijl1 k
ij
E q
E q h l +1
h l +1
= l = l +1 k l = kl +1 k l
h i
h
h
h i
k
i
k
k
l
i
l l
d ( hil )
hkl +1 m W km sm W kil sil
=
=
= W kil
= W kil ( hil )
hil
hil
hil
d hil
E q
= il sl1
j
W ijl1
where il = ( hil )kl +1W kil
il = kl +1W kil (hil ) = ( hil )kl +1W kil
11/2/05
L
i
Hidden-Layer Neuron
hiL
=
W ikL1skL1 = sL1
j
W ijL1 W ijL1 k
k
dh
2
= 2( s t iq ) ( hiL )
Output-Layer Derivatives (2)
Recall
d( siL t iq )
L
i
25
2
E q
=
(skL tkq )
hiL hiL k
k
k
29
11/2/05
30
5
Part 7: Neural Networks & Learning
11/2/05
Summary of Back-Propagation
Algorithm
Derivative of Sigmoid
Suppose s = ( h) =
1
(logistic sigmoid)
1+ exp(h )
1
Output layer : iL = 2siL (1 siL )( siL t iq )
2
E q
= iL sL1
j
W ijL1
Dh s = Dh [1+ exp(h )] = [1+ exp(h )] Dh (1+ eh )
= (1+ eh )
2
(e ) = e h
h
(1+ eh )
2
Hidden layers : il = sil (1 sil )kl +1W kil
1+ eh
1 1
e h
=
= s
h
1+ eh 1+ eh 1+ eh
1+ e
k
E q
= il sl1
j
W ijl1
= s(1 s)
11/2/05
31
11/2/05
32
Hidden-Layer Computation
Output-Layer Computation
W
s1L–1
sjL–1
Wi1
WijL–1
L1
ij
L L1
i j
= s
hiL
siL
WiNL–1
–
Wi1l–1
tiq
Wijl–1
sjl–1
Wijl–1
L
i
)( t
q
i
L
i
s
sNl–1
)
Training Procedures
Wkil
WNil
1–
il = sil (1 sil )kl +1W kil
33
sil
il
2
11/2/05
hil
WiNl–1
= 2s (1 s
L
i
= yi
q
1–
iL
L
i
W1il
L–1
WijL–1
sNL–1
W ijl1 = il sl1
j
s1l–1
s1l+1
1l+1
Eq
skl+1
kl+1
sNl+1
Nl+1
k
11/2/05
34
Summation of Error Surfaces
• Batch Learning
–
–
–
–
on each epoch (pass through all the training pairs),
weight changes for all patterns accumulated
weight matrices updated at end of epoch
accurate computation of gradient
E
E1
E2
• Online Learning
– weight are updated after back-prop of each training pair
– usually randomize order for each epoch
– approximation of gradient
• Doesn’t make much difference
11/2/05
35
11/2/05
36
6
Part 7: Neural Networks & Learning
11/2/05
Gradient Computation
in Batch Learning
Gradient Computation
in Online Learning
E
E
E1
E1
E2
11/2/05
E2
37
11/2/05
38
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
11/2/05
39
7
Download