Document 11908975

advertisement
Part 4A: Neural Network Learning
10/25/09
A.
Neural Network Learning
IV. Neural Network Learning
1
10/25/09
input
layer
10/25/09
3
. . .
output
layer
hidden
layers
10/25/09
Typical Artificial Neuron
4
Typical Artificial Neuron
connection
weights
inputs
. . .
. . .
•  Produce desired outputs for training inputs
•  Generalize reasonably & appropriately to
other inputs
•  Good example: pattern recognition
•  Feedforward multilayer networks
. . .
Feedforward Network
. . .
Supervised Learning
2
. . .
10/25/09
linear
combination
activation
function
output
net input
(local field)
threshold
10/25/09
5
10/25/09
6
1
Part 4A: Neural Network Learning
10/25/09
Equations
Single-Layer Perceptron
. . .
s′i = σ ( hi )
Neuron output:
. . .
 n

hi =  ∑ w ij s j  − θ
 j=1

h = Ws − θ
Net input:
s′ = σ (h)
€
10/25/09
7
10/25/09
8
€
Single Layer Perceptron
Equations
Variables
Binary threshold activation function :
1, if h > 0
σ ( h ) = Θ( h ) = 
0, if h ≤ 0
x1
w1
xj
Σ
wj
h
y
Θ
wn
€
xn
1, if ∑ w x > θ
j j
j
Hence, y = 
0, otherwise
1, if w ⋅ x > θ
=
0, if w ⋅ x ≤ θ
θ
10/25/09
9
10/25/09
10
€
2D Weight Vector
N-Dimensional Weight Vector
w2
w ⋅ x = w x cos φ
€
x
–
v
cos φ =
x
+
+
w
w
separating
hyperplane
w⋅ x = w v
φ
€
€
v
w⋅ x > θ
⇔ w v >θ
w1
–
θ
w
⇔v >θ w
10/25/09
€
normal
vector
11
10/25/09
12
€
2
Part 4A: Neural Network Learning
10/25/09
Treating Threshold as Weight
Goal of Perceptron Learning
n

h = ∑ w j x j  − θ
 j=1

•  Suppose we have training patterns x1, x2,
…, xP with corresponding desired outputs
y1, y2, …, yP •  where xp ∈ {0, 1}n, yp ∈ {0, 1}
•  We want to find w, θ such that
yp = Θ(w⋅xp – θ) for p = 1, …, P 10/25/09
x1
w1
xj
14
θ
 
w
˜ = 1
w
  
 
 wn 
= −θ + ∑ w j x j
wj
wn
xn
j=1
h
Σ
Θ
y
 −1
 p
x1
x˜ p =  

 p
 xn 
€
Let x 0 = −1 and w 0 = θ
n
j=1
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j= 0
€
10/25/09
€
θ
Augmented Vectors
n
xj
y
Θ
10/25/09
n

h = ∑ w j x j  − θ
 j=1

θ = w0
w1
wj
wn
Treating Threshold as Weight
x1
j=1
h
Σ
xn
13
x0 = –1
n
= −θ + ∑ w j x j
€
15
€
10/25/09
16
€
€
Reformulation as Positive
Examples
Adjustment of Weight Vector
We have positive (y p = 1) and negative (y p = 0) examples
z5
z1
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
€
€
€
€
z9
z1011
z z6
p
p
p
z2
p
Let z = x˜ for positive, z = −x˜ for negative
z3
z8
z4
z7
˜ ⋅ z p ≥ 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
10/25/09
17
10/25/09
18
€
3
Part 4A: Neural Network Learning
10/25/09
Outline of
Perceptron Learning Algorithm
Weight Adjustment
1.  initialize weight vector randomly
zp
2.  until all patterns classified correctly, do:
€ €
€
€
€
a)  for p = 1, …, P do:
1)  if zp classified correctly, do nothing
€
2)  else adjust weight vector to be closer to correct
classification
10/25/09
19
Improvement in Performance
˜ ⋅ z p + ηz p ⋅ z p
=w
2
˜ ⋅ zp
>w
10/25/09
20
•  If there is a set of weights that will solve the
problem,
•  then the PLA will eventually find it
•  (for a sufficiently small learning rate)
•  Note: only applies if positive & negative
examples are linearly separable
˜ ′ ⋅ z p = (w
˜ + ηz p ) ⋅ z p
w
˜ ⋅ zp + η zp
=w
10/25/09
Perceptron Learning Theorem
˜ ⋅ z p < 0,
If w
€
p
˜ ′′ ηz
w
ηz p
˜′
w
˜
w
21
10/25/09
22
€
Classification Power of
Multilayer Perceptrons
•  Perceptrons can function as logic gates
•  Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
•  Classes can be arbitrary hyperpolyhedra
•  Minsky & Papert criticism of perceptrons
•  No one succeeded in developing a MLP
learning algorithm
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
10/25/09
23
10/25/09
24
4
Part 4A: Neural Network Learning
10/25/09
Credit Assignment Problem
Hyperpolyhedral Classes
10/25/09
25
. . .
. . .
. . .
. . .
. . .
input
layer
. . .
. . .
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
hidden
layers
10/25/09
26
Adaptive System
NetLogo Demonstration of
Back-Propagation Learning
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
Run Artificial Neural Net.nlogo
P1 … Pk … Pm
Control Parameters
10/25/09
27
Control
Algorithm
C
10/25/09
28
Gradient Ascent
on Fitness Surface
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
 ∂F

 ∂P1 
  


∇F =  ∂F ∂P 
k
  


∂F

 ∂Pm 
€
∇F
+
gradient ascent
–
∇F points in direction of maximum local increase in F
10/25/09
€
29
10/25/09
30
€
5
Part 4A: Neural Network Learning
10/25/09
Gradient Ascent
by Discrete Steps
Gradient Ascent is Local
But Not Shortest
∇F
+
+
–
10/25/09
–
31
10/25/09
Gradient Ascent Process
General Ascent in Fitness
P˙ = η∇F (P)
Note that any adaptive process P( t ) will increase
Change in fitness :
m ∂F d P
m
dF
k
F˙ =
=∑
= ∑ (∇F ) k P˙k
k=1
k=1
∂Pk d t
€d t
˙
˙
F = ∇F ⋅ P
F˙ = ∇F ⋅ η∇F = η ∇F
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
≥0
€
€
€
2
32
or ϕ < 90 
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
10/25/09
33
10/25/09
34
€
General Ascent
on Fitness Surface
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
+
∇F
€
are y1,…,y Q
Suppose D(t,y ) ∈ [0,∞) measures difference between
–
€
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
€
10/25/09
35
€
Q
Q
Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)]
q=1
10/25/09
q=1
36
€
6
Part 4A: Neural Network Learning
10/25/09
Jacobian Matrix
Gradient of Fitness
∂y1q

∂y1q

∂P1 
∂Pm 

Define Jacobian matrix J =


 
∂y q

∂
y nq
n

∂P1 
∂Pm 



∇F = ∇ −∑ E q  = −∑ ∇E q
 q

q
q
∂D(t q ,y q ) ∂y qj
∂E q
∂
=
D(t q ,y q ) = ∑
∂Pk ∂Pk
∂y qj
∂Pk
j
€
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
d D(t q ,y q ) ∂y q
⋅
d yq
∂Pk
q
q
q
= ∇ D t ,y ⋅ ∂y
€
=
€
€
yq
10/25/09
(
)
€
T
37
€
2
2
i
i
i
∂D(t − y ) ∂
∂ (t − y )
=
∑ (ti − y i )2 = ∑ i∂y i
∂y j
∂y j i
j
i
€
€
10/25/09
38
€
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
=
q
q
∂y q ∂D(t ,y )
∂E q
=∑ j
,
q
∂Pk
∂y j
j ∂Pk
∴∇E q = ( J q ) ∇D(t q ,y q )
∂Pk
€
€
Since (∇E q ) k =
d( t j − y j )
dyj
Gradient of Error on qth Input
d D(t ,y ) ∂y
∂E
=
⋅
q
q
∂Pk
dy
q
q
q
∂Pk
∂y q
= 2( y q − t q ) ⋅
∂Pk
2
= 2∑ ( y qj − t qj )
2
j
= −2( t j − y j )
∂y qj
∂Pk
T
d D(t,y )
= 2( y − t )
dy
€
10/25/09
∇E q = 2( J q ) ( y q − t q )
∴
€
39
10/25/09
40
€
€
Recap
P˙ = η∑ ( J ) (t − y )
q T
q
Multilayer Notation
q
q
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
xq
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
W1
s1
W2
s2
WL–2 WL–1
sL–1
yq
s L
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
€
10/25/09
41
10/25/09
42
7
Part 4A: Neural Network Learning
10/25/09
Typical Neuron
Notation
• 
• 
• 
• 
• 
• 
• 
L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
10/25/09
s1l–1
sjl–1
Wi1l–1
Wijl–1
Σ
hil
sil
σ
WiNl–1
sNl–1
43
10/25/09
44
Error Back-Propagation
Delta Values
Convenient to break derivatives by chain rule :
We will compute
∂E q
starting with last layer (l = L −1)
∂W ijl
∂E q
∂E q ∂hil
=
∂W ijl−1 ∂hil ∂W ijl−1
and working back to earlier layers (l = L − 2,…,1)
Let δil =
€
So
10/25/09
45
€
δiL =
hiL
σ
tiq
siL = yiq
=
WiNL–1
sNL–1
10/25/09
46
Output-Layer Derivatives (1)
s1L–1
Wi1L–1
WijL–1
Σ
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
10/25/09
Output-Layer Neuron
sjL–1
∂E q
∂hil
Eq
2
∂E q
∂
=
∑ (skL − tkq )
∂hiL ∂hiL k
d( siL − t iq )
dh
L
i
2
= 2( siL − t iq )
d siL
d hiL
= 2( siL − t iq )σ ′( hiL )
47
10/25/09
48
€
8
Part 4A: Neural Network Learning
10/25/09
Hidden-Layer Neuron
Output-Layer Derivatives (2)
∂hiL
∂
=
∑W ikL−1skL−1 = sL−1
j
∂W ijL−1 ∂W ijL−1 k
sjl–1
∂E q
∴
= δiL sL−1
j
∂W ijL−1
€
s1l
s1l–1
W1il
Wi1l–1
Wijl–1
Σ
hil
sil
σ
Wkil
s1l+1
skl+1
Eq
WiNl–1
WNil
sNl–1
where δiL = 2( siL − t iq )σ ′( hiL )
10/25/09
49
sNl
sNl+1
10/25/09
50
€
Hidden-Layer Derivatives (1)
Recall
Hidden-Layer Derivatives (2)
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
l−1 l−1
dW s
∂hil
∂
=
∑W ikl−1skl−1 = dWij l−1j = sl−1j
∂W ijl−1 ∂W ijl−1 k
ij
∂E q
∂E q ∂h l +1
∂h l +1
δ = l = ∑ l +1 k l = ∑δkl +1 k l
∂h i
∂
h
∂
h
∂h i
k
i
k
k
l
i
€
€
€
l l
d σ ( hil )
∂hkl +1 ∂ ∑m W km sm ∂W kil sil
=
=
= W kil
= W kilσ ′( hil )
l
l
l
∂h i
∂h i
∂h i
d hil
€
∴
where δil = σ ′( hil )∑δkl +1W kil
∴ δil = ∑δkl +1W kilσ ′( hil ) = σ ′( hil )∑δkl +1W kil
k
∂E q
= δil sl−1
j
∂W ijl−1
k
k
10/25/09
51
10/25/09
52
€
€
Summary of Back-Propagation
Algorithm
Derivative of Sigmoid
Suppose s = σ ( h ) =
1
(logistic sigmoid)
1+ exp(−αh )
−1
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
−2
∂E q
= δiL sL−1
j
∂W ijL−1
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
€
= −(1+ e−αh )
−2
(−αe ) = α
−αh
e
−αh
−αh 2
(1+ e )
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
 1+ e−αh
1
e−αh
1 
=α
= αs
−

−αh
1+ e−αh 1+ e−αh
1+ e−αh 
 1+ e
k
€
∂E q
= δil sl−1
j
∂W ijl−1
= αs(1− s)
10/25/09
€
53
10/25/09
54
€
9
Part 4A: Neural Network Learning
10/25/09
Hidden-Layer Computation
Output-Layer Computation
ΔW
s1L–1
L−1
ij
L L−1
i j
= ηδ s
L–1
sjL–1
Wi1
WijL–1
€
Σ
hiL
sNl–1
L
i
)( t
q
i
L
i
−s
)
hil
Σ
55
€
10/25/09
×
k
Wkil
×
WNil
1–
δil = αsil (1− sil )∑δkl +1W kil
10/25/09
sil
σ
δil
2α
s1l+1
δ1l+1
l–1
WiNl–1
×
η
×
δ = 2αs (1− s
L
i
–
Wi1
W l–1
sjl–1 € ij
l–1
tiq
1–
δiL
L
i
=
yiq
W1il
ΔWij WiNL–1
η
×
sNL–1
siL
σ
ΔWijL–1
ΔW ijl−1 = ηδil sl−1
j
s1l–1
Eq
skl+1
δkl+1
sNl+1
Σ
δNl+1
α
56
€
Training Procedures
Summation of Error Surfaces
•  Batch Learning
–  on each epoch (pass through all the training pairs),
–  weight changes for all patterns accumulated
–  weight matrices updated at end of epoch
–  accurate computation of gradient
E
E1
E2
•  Online Learning
–  weight are updated after back-prop of each training pair
–  usually randomize order for each epoch
–  approximation of gradient
•  Doesn’t make much difference
10/25/09
57
10/25/09
Gradient Computation
in Batch Learning
Gradient Computation
in Online Learning
E
E
E1
E1
E2
10/25/09
58
E2
59
10/25/09
60
10
Part 4A: Neural Network Learning
10/25/09
Testing Generalization
Problem of Rote Learning
error
error on
test data
Available
Data
Training
Data
Domain
error on
training
data
Test
Data
epoch
stop training here
10/25/09
61
10/25/09
A Few Random Tips
Improving Generalization
Training
Data
Available
Data
62
•  Too few neurons and the ANN may not be able to
decrease the error enough
•  Too many neurons can lead to rote learning
•  Preprocess data to:
–  standardize
–  eliminate irrelevant information
–  capture invariances
–  keep relevant information
Domain
Test Data
Validation Data
•  If stuck in local min., restart with different random
weights
10/25/09
63
10/25/09
64
Beyond Back-Propagation
•  Adaptive Learning Rate
•  Adaptive Architecture
Run Example BP Learning
–  Add/delete hidden neurons
–  Add/delete hidden layers
•  Radial Basis Function Networks
•  Recurrent BP
•  Etc., etc., etc.…
10/25/09
65
10/25/09
66
11
Part 4A: Neural Network Learning
10/25/09
Can ANNs Exceed the “Turing Limit”?
What is the Power of
Artificial Neural Networks?
•  There are many results, which depend sensitively on
assumptions; for example:
•  Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
•  Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
•  Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
•  Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log*) (Siegelmann ‘99)
•  But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
•  With respect to Turing machines?
•  As function approximators?
10/25/09
67
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
monotone increasing real function on ℜ.
€
•  Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
m
 n

F ( x1,…, x n ) = ∑ aiσ  ∑W ij x j + b j 
i=1
 j=1

€
1
10/25/09
€
xn
n
(see, e.g., Haykin, N.Nets 2/e, 208–9)
b1
x1
[i.e., F (x) = a ⋅ σ (Wx + b)]
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1]
68
One Hidden Layer is Sufficient
n
For any ε > 0, there is an m such that
€
10/25/09
69
10/25/09
Σσ
Σσ
a1
a2
Σ
am
Wmn
Σσ
70
€
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
10/25/09
IVB
71
12
Download