B." Neural Network Learning! 10/22/08! 1!

advertisement
B."
Neural Network Learning!
10/22/08!
1!
Supervised Learning!
•! Produce desired outputs for training inputs!
•! Generalize reasonably & appropriately to
other inputs!
•! Good example: pattern recognition!
•! Feedforward multilayer networks!
10/22/08!
2!
input!
layer!
10/22/08!
hidden!
layers!
. . .!
. . .!
. . .!
. . .!
. . .!
. . .!
Feedforward Network!
output!
layer!
3!
Typical Artificial Neuron!
connection"
weights!
inputs!
output!
threshold!
10/22/08!
4!
Typical Artificial Neuron!
linear"
combination!
activation"
function!
net input"
(local field)!
10/22/08!
5!
Equations!
# n
&
hi = %% " w ij s j (( ) *
$ j=1
'
h = Ws ) *
Net input:!
Neuron output:!
!
10/22/08!
s"i = # ( hi )
s" = # (h)
6!
10/22/08!
. . .!
. . .!
Single-Layer Perceptron!
7!
Variables!
x1!
w1!
xj!
#"
wj!
h!
!"
y!
wn!
xn!
10/22/08!
$"
8!
Single Layer Perceptron
Equations!
Binary threshold activation function :
%1, if h > 0
" ( h ) = #( h ) = &
'0, if h $ 0
!
$1, if # w x > "
j j
j
Hence, y = %
&0, otherwise
$1, if w ' x > "
=%
&0, if w ' x ( "
10/22/08!
!
9!
!
2D Weight Vector!
w2!
w " x = w x cos #
v
cos " =
x
+!
w!
w" x = w v
%"
!
!
!
x!
–!
w" x > #
$ w v >#
$v ># w
10/22/08!
v!
w1!
"
w
10!
N-Dimensional Weight Vector!
+!
w!
normal!
vector!
separating!
hyperplane!
–!
10/22/08!
11!
Goal of Perceptron Learning!
•! Suppose we have training patterns x1, x2,
…, xP with corresponding desired outputs
y1, y2, …, yP !
•! where xp & {0, 1}n, yp & {0, 1}!
•! We want to find w, $ such that"
yp = !(w'xp – $) for p = 1, …, P !
10/22/08!
12!
Treating Threshold as Weight!
#n
&
h = %%" w j x j (( ) *
$ j=1
'
x1!
n
= )* + " w j x j
w1!
xj!
j=1
#"
wj!
wn!
xn!
10/22/08!
h!
!"
y!
!
$"
13!
Treating Threshold as Weight!
#n
&
h = %%" w j x j (( ) *
$ j=1
'
x0 =! –1!
x1!
n
= )* + " w j x j
$" = w0!
w1!
xj!
#"
wj!
wn!
xn!
j=1
h!
!"
y!
!
Let x 0 = "1 and w 0 = #
n
n
˜ # x˜
h = w 0 x 0 + " w j x j =" w j x j = w
j=1
10/22/08!
!
j= 0
14!
Augmented Vectors!
#"&
% (
w1 (
%
˜ =
w
% ! (
% (
$ wn '
# "1&
% p(
x
1 (
p
%
x˜ =
%!(
% p(
$ xn '
˜ # x˜ p ), p = 1,…,P
We want y p = "( w
!
10/22/08!
!
15!
Reformulation as Positive
Examples!
We have positive (y p = 1) and negative (y p = 0) examples
˜ " x˜ p > 0 for positive, w
˜ " x˜ p # 0 for negative
Want w
!
!
!
!
Let z p = x˜ p for positive, z p = "x˜ p for negative
˜ " z p # 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
10/22/08!
16!
Adjustment of Weight Vector!
z5!
z1!
z9!
z10!11
z !
z6!
z2!
z3!
z8!
z4!
z7!
10/22/08!
17!
Outline of"
Perceptron Learning Algorithm!
1.! initialize weight vector randomly!
2.! until all patterns classified correctly, do:!
a)! for p = 1, …, P do:!
1)! if zp classified correctly, do nothing!
2)! else adjust weight vector to be closer to correct
classification!
10/22/08!
18!
Weight Adjustment!
p
"
z
˜ ""
p
w
"z
z
˜"
w
p
˜
w
! !
!
!
!
!
10/22/08!
19!
Improvement in Performance!
p
˜ " z < 0,
If w
˜ " # z = (w
˜ + $z ) # z
w
p
p
p
p
p
p
˜
= w # z + $z # z
!
p
˜ # z +$ z
=w
p 2
p
˜
> w# z
10/22/08!
!
20!
Perceptron Learning Theorem!
•! If there is a set of weights that will solve the
problem,!
•! then the PLA will eventually find it!
•! (for a sufficiently small learning rate)!
•! Note: only applies if positive & negative
examples are linearly separable!
10/22/08!
21!
NetLogo Simulation of
Perceptron Learning!
Run Perceptron.nlogo!
10/22/08!
22!
Classification Power of
Multilayer Perceptrons!
•! Perceptrons can function as logic gates!
•! Therefore MLP can form intersections,
unions, differences of linearly-separable
regions!
•! Classes can be arbitrary hyperpolyhedra!
•! Minsky & Papert criticism of perceptrons!
•! No one succeeded in developing a MLP
learning algorithm!
10/22/08!
23!
Credit Assignment Problem!
input!
layer!
10/22/08!
. . .!
. . .!
. . .!
. . .!
. . .!
hidden!
layers!
. . .!
. . .!
How do we adjust the weights of the hidden layers?!
Desired!
output!
output!
layer!
24!
NetLogo Demonstration of"
Back-Propagation Learning!
Run Artificial Neural Net.nlogo!
10/22/08!
25!
Adaptive System!
System!
Evaluation Function!
(Fitness, Figure of Merit)!
S!
F!
P1! …! Pk! …! Pm!
Control Parameters!
10/22/08!
Control!
Algorithm!
C!
26!
Gradient!
"F
measures how F is altered by variation of Pk
"Pk
$ #F
'
& #P1 )
& ! )
& #F
)
"F = & #P )
k
& ! )
&#F
)
&
)
#
P
%
m(
!
"F points in direction of maximum increase in F
10/22/08!
!
!
27!
Gradient Ascent"
on Fitness Surface!
(F!
+!
gradient ascent!
10/22/08!
–
28!
Gradient Ascent"
by Discrete Steps!
(F!
+!
–
10/22/08!
29!
Gradient Ascent is Local"
But Not Shortest!
+!
–
10/22/08!
30!
Gradient Ascent Process!
P˙ = "#F (P)
Change in fitness :
m "F d P
m
dF
k
˙
F=
=#
= # ($F ) k P˙k
k=1 "P d t
k=1
!d t
k
F˙ = $F % P˙
F˙ = "F # $"F = $ "F
!
!
2
%0
Therefore gradient ascent increases fitness!
(until reaches 0 gradient)!
10/22/08!
31!
General Ascent in Fitness!
Note that any adaptive process P( t ) will increase
fitness provided :
0 < F˙ = "F # P˙ = "F P˙ cos$
where $ is angle between "F and P˙
Hence we need cos " > 0
!
or " < 90 !
10/22/08!
!
32!
General Ascent"
on Fitness Surface!
+!
(F!
–
10/22/08!
33!
Fitness as Minimum Error!
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
!
are y1,…,y Q
Suppose D(t,y ) " [0,#) measures difference between
!
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
!
!
Q
Q
Let F (P) = "# E q (P) = "# D[t q ,y q (P)]
q=1
10/22/08!
q=1
34!
Gradient of Fitness!
%
(
"F = "'' #$ E q ** = #$ "E q
& q
)
q
q
!
"E
"
=
D(t q ,y q ) = #
"Pk "Pk
j
!
!
"D(t q ,y q ) "y qj
!
"Pk
d D(t q ,y q ) #y q
=
"
q
dy
#Pk
q
$
y
q
q
= " D t ,y #
yq
10/22/08!
"y qj
(
)
$Pk
35!
Jacobian Matrix!
q
#"y1q
&
"
y
1
%
"P1 !
"Pm (
Define Jacobian matrix J q = % "
#
" (
q
%"y q
(
"
y
n
% n "P !
(
"
P
1
m'
$
Note J q " # n$m and %D(t q ,y q ) " # n$1
!
Since ("E q ) k
!
"#E = ( J
q
!
!
10/22/08!
q
q
q
#
D
t
,y
#
y
(
)
#E
j
=
=$
,
q
#Pk
#y j
j #Pk
q T
)
q
#D(t q ,y q )
36!
!
!
Derivative of Squared Euclidean
Distance!
Suppose D(t,y ) = t " y = # ( t " y )
2
2
i
i
i
"D(t # y ) "
" (t i # y i )
2
=
(t i # y i ) = $
$
"y j
"y j i
"y j
i
=
d( t j " y j )
dyj
d D(t,y )
"
= 2( y # t )
dy
!
10/22/08!
2
2
= "2( t j " y j )
37!
Gradient of Error on qth Input!
d D(t ,y ) "y
"E
=
#
q
q
"Pk
q
d yq
q
"Pk
q
"y
= 2( y $ t ) #
"Pk
q
q
q
"
y
j
q
q
= 2% ( y j $ t j )
j
"Pk
"E = 2( J
q
q T
)
q
q
y
#
t
(
)
!
10/22/08!
!
38!
Recap!
P˙ = "$ ( J ) (t # y )
q T
q
q
q
To know how to decrease the differences between
actual & desired outputs,
!
we need to know elements of Jacobian,
"y qj
"Pk ,
which says how jth output varies with kth parameter
(given the qth input)
The Jacobian depends on the specific form of the system,!
in this case, a feedforward neural network!
!
10/22/08!
39!
Multilayer Notation!
xq!
W1!
s1!
10/22/08!
W2!
s2!
WL–2! WL–1!
sL–1!
yq!
s L!
40!
Notation!
•!
•!
•!
•!
•!
•!
•!
L layers of neurons labeled 1, …, L !
Nl neurons in layer l !
sl = vector of outputs from neurons in layer l !
input layer s1 = xq (the input pattern)!
output layer sL = yq (the actual output)!
Wl = weights between layers l and l+1!
Problem: find how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)!
10/22/08!
41!
Typical Neuron!
s1l–1!
Wi1l–1!
l–1!
W
ij
sjl–1!
#"
hil!
)"
sil!
WiNl–1!
sNl–1!
10/22/08!
42!
Error Back-Propagation!
"E q
We will compute
starting with last layer (l = L #1)
l
"W ij
and working back to earlier layers (l = L # 2,…,1)
!
10/22/08!
43!
Delta Values!
Convenient to break derivatives by chain rule :
"E q
"E q "hil
= l
l#1
"W ij
"hi "W ijl#1
q
"
E
Let $il = l
"h i
l
"E q
"
h
l
i
So
=
$
i
l#1
"W ij
"W ijl#1
!
10/22/08!
44!
Output-Layer Neuron!
s1L–1!
Wi1L–1!
L–1!
W
ij
sjL–1!
#"
hiL!
)"
tiq!
siL = yiq!
WiNL–1!
sNL–1!
10/22/08!
Eq!
45!
Output-Layer Derivatives (1)!
q
#E
#
L
q 2
" = L = L % ( sk $ t k )
k
#h i #h i
L
i
=
d( s $ t
L
i
q 2
i
d hiL
)
L
d
s
L
q
= 2( si $ t i ) iL
d hi
L
&
= 2( s $ t )' ( hi )
L
i
10/22/08!
!
q
i
46!
Output-Layer Derivatives (2)!
"hiL
"
L#1 L#1
L#1
=
W
s
=
s
ik
k
j
L#1
L#1 $
"W ij
"W ij k
!
#E q
L L$1
"
=
%
i sj
L$1
#W ij
where %iL = 2( siL $ t iq )' &( hiL )
10/22/08!
!
47!
Hidden-Layer Neuron!
s1l!
s1l–1!
W1il!
Wi1l–1!
l–1!
W
ij
sjl–1!
#"
hil!
)"
sil!
Wkil!
s1l+1!
skl+1!
Eq!
WiNl–1!
sNl–1!
10/22/08!
WNil!
sNl!
sNl+1!
48!
Hidden-Layer Derivatives (1)!
l
"E q
"
h
l
i
Recall
=
$
i
l#1
"W ij
"W ijl#1
q
q
l +1
l +1
#
E
#
E
#
h
#
h
"il = l = $ l +1 k l = $"kl +1 k l
#h i
#h k #h i
#h i
k
k
!
!
!
l +1
k
l
i
"h
=
"h
m
"hil
l
d
$
h
(
"W kil sil
i)
l
l
l
%
=
=
W
=
W
$
h
(
ki
ki
i)
l
l
"h i
d hi
" #il = &#kl +1W kil% $( hil ) = % $( hil )&#kl +1W kil
k
10/22/08!
!
" # W kml sml
k
49!
Hidden-Layer Derivatives (2)!
l
i
l#1
ij
"h
"W
!
l#1 l#1
dW
"
ij s j
l#1 l#1
l#1
=
W
s
=
=
s
ik
k
j
l#1 $
"W ij k
dW ijl#1
#E q
l l$1
"
=
%
i sj
l$1
#W ij
where %il = ' &( hil )(%kl +1W kil
k
10/22/08!
!
50!
Derivative of Sigmoid!
1
Suppose s = " ( h ) =
(logistic sigmoid)
1+ exp(#$h )
"1
"2
Dh s = Dh [1+ exp("#h )] = "[1+ exp("#h )] Dh (1+ e"#h )
!
= "(1+ e
"#h "2
) ("#e ) = #
"#h
e"#h
"#h 2
(1+ e )
$ 1+ e"#h
1
e"#h
1 '
=#
= #s&
"
"#h
"#h
"#h
"#h )
1+ e 1+ e
1+ e (
% 1+ e
= #s(1" s)
10/22/08!
!
51!
Summary of Back-Propagation
Algorithm!
Output layer : "iL = 2#siL (1$ siL )( siL $ t iq )
%E q
L L$1
=
"
i sj
L$1
%W ij
Hidden layers : "il = #sil (1$ sil )%"kl +1W kil
k
!
&E q
l l$1
=
"
i sj
l$1
&W ij
10/22/08!
!
52!
Output-Layer Computation!
"W ijL#1 = $%iL sL#1
j
s1L–1!
Wi1L–1!
L–1!
W
ij
sjL–1! !
#"
.WijL–1!
sNL–1!
10/22/08!
WiNL–1!
-!
+!
hiL!
)"
siL = yiq! –! tiq!
1–!
*iL!
+!
"iL = 2#siL (1$ siL )( t iq $ siL )
2,!
53!
Hidden-Layer Computation!
"W ijl#1 = $%il sl#1
j
s1l–1!
W1il!
Wi1l–1!
l–1!
W
ij
sjl–1! !
#"
hil!
.Wijl–1!
sNl–1!
WiNl–1!
+!
-!
+!
WNil!
1–!
*il!
+!
"il = #sil (1$ sil )%"kl +1W kil
10/22/08!
sil!
)"
k
Wkil!
#"
s1l+1!
*1l+1!
skl+1!
Eq!
*kl+1!
sNl+1!
*Nl+1!
,!
54!
Training Procedures!
•! Batch Learning!
–! on each epoch (pass through all the training pairs),!
–! weight changes for all patterns accumulated!
–! weight matrices updated at end of epoch!
–! accurate computation of gradient!
•! Online Learning!
–! weight are updated after back-prop of each training pair!
–! usually randomize order for each epoch!
–! approximation of gradient!
•! Doesn’t make much difference!
10/22/08!
55!
Summation of Error Surfaces!
E!
E1!
E2!
10/22/08!
56!
Gradient Computation"
in Batch Learning!
E!
E1!
E2!
10/22/08!
57!
Gradient Computation"
in Online Learning!
E!
E1!
E2!
10/22/08!
58!
Testing Generalization!
Available!
Data!
Training!
Data!
Domain!
Test!
Data!
10/22/08!
59!
Problem of Rote Learning!
error!
error on!
test data!
error on!
training!
data!
epoch!
stop training here!
10/22/08!
60!
Improving Generalization!
Training!
Data!
Available!
Data!
Domain!
Test Data!
Validation Data!
10/22/08!
61!
A Few Random Tips!
•! Too few neurons and the ANN may not be able to
decrease the error enough!
•! Too many neurons can lead to rote learning!
•! Preprocess data to:!
–! standardize!
–! eliminate irrelevant information!
–! capture invariances!
–! keep relevant information!
•! If stuck in local min., restart with different random
weights!
10/22/08!
62!
Beyond Back-Propagation!
•! Adaptive Learning Rate!
•! Adaptive Architecture!
–! Add/delete hidden neurons!
–! Add/delete hidden layers!
•! Radial Basis Function Networks!
•! Etc., etc., etc.…!
10/22/08!
63!
The Golden Rule of Neural Nets!
Neural Networks are the!
second-best way!
to do everything!!
10/22/08!
V!
64!
Download