Document 11911344

advertisement
Part 4A: Artificial Neural Network Learning
2/23/12
IV. Neural Network Learning
2/23/12
1
A.
Artificial Neural Network
Learning
2/23/12
2
1
Part 4A: Artificial Neural Network Learning
2/23/12
Supervised Learning
•  Produce desired outputs for training inputs
•  Generalize reasonably & appropriately to
other inputs
•  Good example: pattern recognition
•  Feedforward multilayer networks
2/23/12
3
input
layer
2/23/12
hidden
layers
. . .
. . .
. . .
. . .
. . .
. . .
Feedforward Network
output
layer
4
2
Part 4A: Artificial Neural Network Learning
2/23/12
Typical Artificial Neuron
connection
weights
inputs
output
threshold
2/23/12
5
Typical Artificial Neuron
linear
combination
activation
function
net input
(local field)
2/23/12
6
3
Part 4A: Artificial Neural Network Learning
2/23/12
Equations
⎛ n
⎞
hi = ⎜⎜ ∑ w ij s j ⎟⎟ − θ
⎝ j=1
⎠
h = Ws − θ
Net input:
Neuron output:
€
sʹ′i = σ ( hi )
sʹ′ = σ (h)
2/23/12
7
€
2/23/12
. . .
. . .
Single-Layer Perceptron
8
4
Part 4A: Artificial Neural Network Learning
2/23/12
Variables
x1
w1
xj
h
Σ
wj
Θ
y
wn
xn
θ
2/23/12
9
Single Layer Perceptron
Equations
Binary threshold activation function :
⎧1, if h > 0
σ ( h ) = Θ( h ) = ⎨
⎩0, if h ≤ 0
⎧1,
Hence, y = ⎨
⎩0,
⎧1,
= ⎨
⎩0,
€
2/23/12
if ∑ w j x j > θ
j
otherwise
if w ⋅ x > θ
if w ⋅ x ≤ θ
10
€
5
Part 4A: Artificial Neural Network Learning
2/23/12
2D Weight Vector
w2
w ⋅ x = w x cos φ
cos φ =
€
x
–
+
v
x
w
w⋅ x = w v
φ
€
€
w1
v
w⋅ x > θ
⇔ w v >θ
θ
w
⇔v >θ w
2/23/12
€
11
€
N-Dimensional Weight Vector
+
w
normal
vector
separating
hyperplane
–
2/23/12
12
6
Part 4A: Artificial Neural Network Learning
2/23/12
Goal of Perceptron Learning
•  Suppose we have training patterns x1, x2,
…, xP with corresponding desired outputs
y1, y2, …, yP •  where xp ∈ {0, 1}n, yp ∈ {0, 1}
•  We want to find w, θ such that
yp = Θ(w⋅xp – θ) for p = 1, …, P 2/23/12
13
Treating Threshold as Weight
⎛ n
⎞
h = ⎜⎜∑ w j x j ⎟⎟ − θ
⎝ j=1
⎠
x1
n
= −θ + ∑ w j x j
w1
xj
j=1
Σ
wj
wn
xn
2/23/12
h
Θ
y
€
θ
14
7
Part 4A: Artificial Neural Network Learning
2/23/12
Treating Threshold as Weight
⎛ n
⎞
h = ⎜⎜∑ w j x j ⎟⎟ − θ
⎝ j=1
⎠
x0 =
–1
x1
n
= −θ + ∑ w j x j
θ
= w0
w1
xj
Σ
wj
wn
xn
j=1
h
Θ
y
€
Let x 0 = −1 and w 0 = θ
n
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j=1
2/23/12
j= 0
€
15
€
Augmented Vectors
⎛ θ ⎞
⎜ ⎟
w
˜ = ⎜ 1 ⎟
w
⎜  ⎟
⎜ ⎟
⎝ w n ⎠
⎛ −1⎞
⎜ p ⎟
x
x˜ p = ⎜ 1 ⎟
⎜  ⎟
⎜ p ⎟
⎝ x n ⎠
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
€
2/23/12
€
16
€
8
Part 4A: Artificial Neural Network Learning
2/23/12
Reformulation as Positive
Examples
We have positive (y p = 1) and negative (y p = 0) examples
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
€
€
€
€
Let z p = x˜ p for positive, z p = −x˜ p for negative
˜ ⋅ z p ≥ 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
2/23/12
17
€
Adjustment of Weight Vector
z5
z1
z9
z10
11
z z6
z2
z3
z8
z4
z7
2/23/12
18
9
Part 4A: Artificial Neural Network Learning
2/23/12
Outline of
Perceptron Learning Algorithm
1.  initialize weight vector randomly
2.  until all patterns classified correctly, do:
a)  for p = 1, …, P do:
1)  if zp classified correctly, do nothing
2)  else adjust weight vector to be closer to correct
classification
2/23/12
19
Weight Adjustment
zp
p
˜ ʹ′ʹ′ ηz
w
ηz p
˜ ʹ′
w
˜
w
€ €
€
€
€
€
2/23/12
20
10
Part 4A: Artificial Neural Network Learning
2/23/12
Improvement in Performance
˜ ⋅ z p < 0,
If w
˜ ʹ′ ⋅ z p = ( w
˜ + ηz p ) ⋅ z p
w
˜ ⋅ z p + ηz p ⋅ z p
=w
€
˜ ⋅ zp + η zp
=w
2
˜ ⋅ zp
>w
2/23/12
21
€
Perceptron Learning Theorem
•  If there is a set of weights that will solve the
problem,
•  then the PLA will eventually find it
•  (for a sufficiently small learning rate)
•  Note: only applies if positive & negative
examples are linearly separable
2/23/12
22
11
Part 4A: Artificial Neural Network Learning
2/23/12
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
2/23/12
23
Classification Power of
Multilayer Perceptrons
•  Perceptrons can function as logic gates
•  Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
•  Classes can be arbitrary hyperpolyhedra
•  Minsky & Papert criticism of perceptrons
•  No one succeeded in developing a MLP
learning algorithm
2/23/12
24
12
Part 4A: Artificial Neural Network Learning
2/23/12
Hyperpolyhedral Classes
2/23/12
25
Credit Assignment Problem
input
layer
2/23/12
. . .
. . .
. . .
. . .
. . .
hidden
layers
. . .
. . .
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
26
13
Part 4A: Artificial Neural Network Learning
2/23/12
NetLogo Demonstration of
Back-Propagation Learning
Run Artificial Neural Net.nlogo
2/23/12
27
Adaptive System
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
P1
…
Pk
…
Pm
Control Parameters
2/23/12
Control
Algorithm
C
28
14
Part 4A: Artificial Neural Network Learning
2/23/12
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
⎛ ∂F
⎞
⎜ ∂P1 ⎟
⎜  ⎟
⎜
⎟
∇F = ⎜ ∂F ∂P ⎟
k
⎜  ⎟
⎜∂F
⎟
⎜
⎟
⎝ ∂Pm ⎠
€
∇F points in direction of maximum local increase in F
2/23/12
€
29
€
Gradient Ascent
on Fitness Surface
∇F
+
gradient ascent
2/23/12
–
30
15
Part 4A: Artificial Neural Network Learning
2/23/12
Gradient Ascent
by Discrete Steps
∇F
+
–
2/23/12
31
Gradient Ascent is Local
But Not Shortest
+
–
2/23/12
32
16
Part 4A: Artificial Neural Network Learning
2/23/12
Gradient Ascent Process
P˙ = η∇F (P)
Change in fitness :
m ∂F d P
m
dF
k
F˙ =
=∑
= ∑ (∇F ) k P˙k
k=1 ∂P d t
k=1
€d t
k
F˙ = ∇F ⋅ P˙
F˙ = ∇F ⋅ η∇F = η ∇F
€
€
2
≥0
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
2/23/12
33
General Ascent in Fitness
Note that any adaptive process P( t ) will increase
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
€
or ϕ < 90 
2/23/12
34
€
17
Part 4A: Artificial Neural Network Learning
2/23/12
General Ascent
on Fitness Surface
∇F
+
–
2/23/12
35
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
€
are y1,…,y Q
Suppose D(t,y ) ∈ [0,∞) measures difference between
€
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
€
Q
Q
Let F (P) = −∑ E (P) = −∑ D[t q ,y q (P)]
q
€
q=1
2/23/12
q=1
36
€
18
Part 4A: Artificial Neural Network Learning
2/23/12
Gradient of Fitness
⎛
⎞
∇F = ∇⎜⎜ −∑ E q ⎟⎟ = −∑ ∇E q
⎝ q
⎠
q
€
∂D(t q ,y q ) ∂y qj
∂E q
∂
q
q
=
D(t ,y ) = ∑
∂Pk ∂Pk
∂y qj
∂Pk
j
€
d D(t q ,y q ) ∂y q
=
⋅
d yq
∂Pk
q
= ∇ D t q ,y q ⋅ ∂y
€
yq
2/23/12
(
)
∂Pk
37
€
€
Jacobian Matrix
⎛∂y1q
⎞
∂y1q

⎜
∂P1
∂Pm ⎟
q
⎜
Define Jacobian matrix J =


 ⎟
q
⎜∂y q
⎟
∂y n
⎜ n ∂P 
∂Pm ⎟⎠
1
⎝
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
€
Since (∇E
€
q
)
k
q
q
∂y qj ∂D(t ,y )
∂E q
=
=∑
,
∂Pk
∂Pk
∂y qj
j
T
∴∇E q = ( J q ) ∇D(t q ,y q )
€
2/23/12
38
€
19
Part 4A: Artificial Neural Network Learning
2/23/12
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
2
2
i
i
€
∂D(t − y ) ∂
∂ (t − y )
2
=
(t i − y i ) = ∑ i i
∑
∂y j
∂y j i
∂y j
i
=
€
i
d( t j − y j )
dyj
2
2
= −2( t j − y j )
d D(t,y )
= 2( y − t )
dy
€
2/23/12
∴
39
€
Gradient of Error on qth Input
d D(t ,y ) ∂y
∂E
=
⋅
q
q
∂Pk
q
d yq
q
∂Pk
∂y q
= 2( y − t ) ⋅
∂Pk
q
q
∂y qj
= 2∑ ( y − t )
j
∂Pk
q
j
q
j
T
∇E q = 2( J q ) ( y q − t q )
€
2/23/12
40
€
20
Part 4A: Artificial Neural Network Learning
2/23/12
Recap
P˙ = η∑ ( J ) (t − y )
q T
q
q
q
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
€
2/23/12
41
Multilayer Notation
xq
W1
s1
2/23/12
W2
s2
WL–2
WL–1
sL–1
yq
sL
42
21
Part 4A: Artificial Neural Network Learning
2/23/12
Notation
• 
• 
• 
• 
• 
• 
• 
L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find out how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
2/23/12
43
Typical Neuron
s1l–1
Wi1l–1
sjl–1
Wijl–1
Σ
hil
σ
sil
WiNl–1
sNl–1
2/23/12
44
22
Part 4A: Artificial Neural Network Learning
2/23/12
Error Back-Propagation
We will compute
∂E q
starting with last layer (l = L −1)
∂W ijl
and working back to earlier layers (l = L − 2,…,1)
€
2/23/12
45
Delta Values
Convenient to break derivatives by chain rule :
∂E q
∂E q ∂hil
=
∂W ijl−1 ∂hil ∂W ijl−1
Let δil =
So
€
2/23/12
∂E q
∂hil
l
∂E q
l ∂h i
=
δ
i
∂W ijl−1
∂W ijl−1
46
23
Part 4A: Artificial Neural Network Learning
2/23/12
Output-Layer Neuron
s1L–1
Wi1L–1
W L–1
sjL–1
ij
Σ
hiL
tiq
siL = yiq
σ
WiNL–1
Eq
sNL–1
2/23/12
47
Output-Layer Derivatives (1)
2
∂E q
∂
δ = L = L ∑ ( skL − t kq )
k
∂h i ∂h i
L
i
=
d( siL − t iq )
d hiL
2
d siL
= 2( s − t ) L
d hi
L
i
q
i
= 2( siL − t iq )σ ʹ′( hiL )
2/23/12
48
€
24
Part 4A: Artificial Neural Network Learning
2/23/12
Output-Layer Derivatives (2)
∂hiL
∂
=
W ikL−1skL−1 = sL−1
j
L−1
L−1 ∑
∂W ij
∂W ij k
∂E q
∴
= δiL sL−1
j
L−1
∂W ij
€
where δiL = 2( siL − t iq )σ ʹ′( hiL )
2/23/12
49
€
Hidden-Layer Neuron
s1l
s1l–1
W1il
Wi1l–1
sjl–1
Wijl–1
Σ
hil
σ
sil
Wkil
s1l+1
skl+1
Eq
WiNl–1
sNl–1
2/23/12
WNil
sNl
sNl+1
50
25
Part 4A: Artificial Neural Network Learning
2/23/12
Hidden-Layer Derivatives (1)
Recall
l
∂E q
l ∂h i
=
δ
i
∂W ijl−1
∂W ijl−1
l +1
∂E q
∂E q ∂hkl +1
l +1 ∂hk
δ = l = ∑ l +1
= ∑δ k
∂h i
∂hil
∂hil
k ∂h k
k
l
i
€
l l
d σ ( hil )
∂hkl +1 ∂ ∑m W km sm ∂W kil sil
l
=
=
=
W
= W kilσ ʹ′( hil )
ki
l
l
l
l
∂h i
∂h i
∂h i
d hi
€
€
∴ δil = ∑δkl +1W kilσ ʹ′( hil ) = σ ʹ′( hil )∑δkl +1W kil
k
k
2/23/12
51
€
Hidden-Layer Derivatives (2)
l−1 l−1
dW ij s j
∂hil
∂
=
W ikl−1skl−1 =
= sl−1
j
l−1
l−1 ∑
l−1
∂W ij
∂W ij k
dW ij
€
∂E q
∴
= δil sl−1
j
l−1
∂W ij
where δil = σ ʹ′( hil )∑δkl +1W kil
k
2/23/12
52
€
26
Part 4A: Artificial Neural Network Learning
2/23/12
Derivative of Sigmoid
Suppose s = σ ( h ) =
1
(logistic sigmoid)
1+ exp(−αh )
−1
−2
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
€
= −(1+ e−αh )
=α
−2
(−αe ) = α
e−αh
−αh
−αh 2
(1+ e )
⎛ 1+ e−αh
1
e−αh
1 ⎞
=
α
s
−
⎜
⎟
−αh
−αh
−αh
1+ e 1+ e
1+ e−αh ⎠
⎝ 1+ e
= αs(1− s)
2/23/12
53
€
Summary of Back-Propagation
Algorithm
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
∂E q
= δiL sL−1
j
L−1
∂W ij
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
k
€
q
∂E
= δil sl−1
j
l−1
∂W ij
2/23/12
54
€
27
Part 4A: Artificial Neural Network Learning
2/23/12
Output-Layer Computation
ΔW ijL−1 = ηδiL sL−1
j
s1L–1
Wi1L–1
WijL–1
sjL–1
€
Σ
hiL
ΔWijL–1
sNL–1
WiNL–1
η
×
siL = yiq
–
tiq
σ
1–
δiL
×
δiL = 2αsiL (1− siL )( t iq − siL )
2α
2/23/12
55
€
Hidden-Layer Computation
ΔW ijl−1 = ηδil sl−1
j
s1l–1
W1il
Wi1l–1
l–1
W sjl–1
€ ij
Σ
hil
ΔWijl–1
sNl–1
WiNl–1
×
η
×
δil = αsil (1− sil )∑δkl +1W kil
k
Wkil
×
WNil
1–
δil
2/23/12
sil
σ
Σ
s1l+1
δ1l+1
skl+1
Eq
δkl+1
sNl+1
δNl+1
α
56
€
28
Part 4A: Artificial Neural Network Learning
2/23/12
Training Procedures
•  Batch Learning
–  on each epoch (pass through all the training pairs),
–  weight changes for all patterns accumulated
–  weight matrices updated at end of epoch
–  accurate computation of gradient
•  Online Learning
–  weight are updated after back-prop of each training pair
–  usually randomize order for each epoch
–  approximation of gradient
•  Doesn’t make much difference
2/23/12
57
Summation of Error Surfaces
E
E1
E2
2/23/12
58
29
Part 4A: Artificial Neural Network Learning
2/23/12
Gradient Computation
in Batch Learning
E
E1
E2
2/23/12
59
Gradient Computation
in Online Learning
E
E1
E2
2/23/12
60
30
Part 4A: Artificial Neural Network Learning
2/23/12
Testing Generalization
Available
Data
Training
Data
Domain
Test
Data
2/23/12
61
Problem of Rote Learning
error
error on
test data
error on
training
data
epoch
stop training here
2/23/12
62
31
Part 4A: Artificial Neural Network Learning
2/23/12
Improving Generalization
Training
Data
Available
Data
Domain
Test Data
Validation Data
2/23/12
63
A Few Random Tips
•  Too few neurons and the ANN may not be able to
decrease the error enough
•  Too many neurons can lead to rote learning
•  Preprocess data to:
–  standardize
–  eliminate irrelevant information
–  capture invariances
–  keep relevant information
•  If stuck in local min., restart with different random
weights
2/23/12
64
32
Part 4A: Artificial Neural Network Learning
2/23/12
Run Example BP Learning
2/23/12
65
Beyond Back-Propagation
•  Adaptive Learning Rate
•  Adaptive Architecture
–  Add/delete hidden neurons
–  Add/delete hidden layers
•  Radial Basis Function Networks
•  Recurrent BP
•  Etc., etc., etc.…
2/23/12
66
33
Part 4A: Artificial Neural Network Learning
2/23/12
What is the Power of
Artificial Neural Networks?
•  With respect to Turing machines?
•  As function approximators?
2/23/12
67
Can ANNs Exceed the “Turing Limit”?
•  There are many results, which depend sensitively on
assumptions; for example:
•  Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
•  Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
•  Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
•  Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log*) (Siegelmann ‘99)
•  But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
2/23/12
68
34
Part 4A: Artificial Neural Network Learning
2/23/12
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
monotone increasing real function on ℜ.
€
n
For any ε > 0, there is an m such that
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
€
⎛ n
⎞
F ( x1,…, x n ) = ∑ aiσ ⎜⎜ ∑W ij x j + b j ⎟⎟
i=1
⎝ j=1
⎠
m
€
[i.e., F (x) = a ⋅ σ (Wx + b)]
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1]
2/23/12
€
n
69
(see, e.g., Haykin, N.Nets 2/e, 208–9)
€
One Hidden Layer is Sufficient
•  Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
1
b1
x1
xn
2/23/12
Σ
σ
Σ
σ
a1
a2
Σ
am
Wmn
Σ
σ
70
35
Part 4A: Artificial Neural Network Learning
2/23/12
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
2/23/12
IVB
71
36
Download