Document 11911343

advertisement
Part 4A: Artificial Neural Network Learning
2/23/12
A.
Artificial Neural Network
Learning
IV. Neural Network Learning
1
2/23/12
input
layer
2/23/12
3
. . .
output
layer
hidden
layers
2/23/12
Typical Artificial Neuron
4
Typical Artificial Neuron
connection
weights
inputs
. . .
. . .
•  Produce desired outputs for training inputs
•  Generalize reasonably & appropriately to
other inputs
•  Good example: pattern recognition
•  Feedforward multilayer networks
. . .
Feedforward Network
. . .
Supervised Learning
2
. . .
2/23/12
linear
combination
activation
function
output
net input
(local field)
threshold
2/23/12
5
2/23/12
6
1
Part 4A: Artificial Neural Network Learning
2/23/12
Equations
Single-Layer Perceptron
. . .
sʹ′i = σ ( hi )
Neuron output:
. . .
⎛ n
⎞
hi = ⎜⎜ ∑ w ij s j ⎟⎟ − θ
⎝ j=1
⎠
h = Ws − θ
Net input:
sʹ′ = σ (h)
€
2/23/12
7
2/23/12
8
€
Single Layer Perceptron
Equations
Variables
Binary threshold activation function :
⎧1, if h > 0
σ ( h ) = Θ( h ) = ⎨
⎩0, if h ≤ 0
x1
w1
xj
Σ
wj
h
y
Θ
wn
⎧1, if ∑ w x > θ
j j
j
Hence, y = ⎨
⎩0, otherwise
€
xn
⎧1, if w ⋅ x > θ
= ⎨
⎩0, if w ⋅ x ≤ θ
θ
2/23/12
9
2/23/12
10
€
2D Weight Vector
N-Dimensional Weight Vector
w2
w ⋅ x = w x cos φ
€
x
–
v
cos φ =
x
+
+
w
w
separating
hyperplane
w⋅ x = w v
φ
€
€
v
w⋅ x > θ
⇔ w v >θ
w1
–
θ
w
⇔v >θ w
2/23/12
€
normal
vector
11
2/23/12
12
€
2
Part 4A: Artificial Neural Network Learning
2/23/12
Treating Threshold as Weight
Goal of Perceptron Learning
⎛ n
⎞
h = ⎜⎜∑ w j x j ⎟⎟ − θ
⎝ j=1
⎠
•  Suppose we have training patterns x1, x2,
…, xP with corresponding desired outputs
y1, y2, …, yP •  where xp ∈ {0, 1}n, yp ∈ {0, 1}
•  We want to find w, θ such that
yp = Θ(w⋅xp – θ) for p = 1, …, P 2/23/12
x1
w1
xj
14
⎛ θ ⎞
⎜ ⎟
w
˜ = ⎜ 1 ⎟
w
⎜  ⎟
⎜ ⎟
⎝ w n ⎠
= −θ + ∑ w j x j
wj
wn
xn
j=1
h
Σ
Θ
y
⎛ −1⎞
⎜ p ⎟
x1
x˜ p = ⎜ ⎟
⎜  ⎟
⎜ p ⎟
⎝ x n ⎠
€
Let x 0 = −1 and w 0 = θ
n
j=1
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j= 0
€
2/23/12
€
θ
Augmented Vectors
n
xj
y
Θ
2/23/12
⎛ n
⎞
h = ⎜⎜∑ w j x j ⎟⎟ − θ
⎝ j=1
⎠
θ
= w0
w1
wj
wn
Treating Threshold as Weight
x1
j=1
h
Σ
xn
13
x0 =
–1
n
= −θ + ∑ w j x j
€
15
€
2/23/12
16
€
€
Reformulation as Positive
Examples
Adjustment of Weight Vector
We have positive (y p = 1) and negative (y p = 0) examples
z5
z1
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
€
€
€
€
z9
z10
11
z z6
p
p
p
z2
p
Let z = x˜ for positive, z = −x˜ for negative
z3
z8
z4
z7
˜ ⋅ z p ≥ 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
2/23/12
17
2/23/12
18
€
3
Part 4A: Artificial Neural Network Learning
2/23/12
Outline of
Perceptron Learning Algorithm
Weight Adjustment
1.  initialize weight vector randomly
zp
2.  until all patterns classified correctly, do:
€ €
€
€
€
a)  for p = 1, …, P do:
1)  if zp classified correctly, do nothing
€
2)  else adjust weight vector to be closer to correct
classification
2/23/12
19
Improvement in Performance
˜ ⋅ z p + ηz p ⋅ z p
=w
2
˜ ⋅ zp
>w
2/23/12
20
•  If there is a set of weights that will solve the
problem,
•  then the PLA will eventually find it
•  (for a sufficiently small learning rate)
•  Note: only applies if positive & negative
examples are linearly separable
˜ ʹ′ ⋅ z p = ( w
˜ + ηz p ) ⋅ z p
w
˜ ⋅ zp + η zp
=w
2/23/12
Perceptron Learning Theorem
˜ ⋅ z p < 0,
If w
€
p
˜ ʹ′ʹ′ ηz
w
ηz p
˜ ʹ′
w
˜
w
21
2/23/12
22
€
Classification Power of
Multilayer Perceptrons
•  Perceptrons can function as logic gates
•  Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
•  Classes can be arbitrary hyperpolyhedra
•  Minsky & Papert criticism of perceptrons
•  No one succeeded in developing a MLP
learning algorithm
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
2/23/12
23
2/23/12
24
4
Part 4A: Artificial Neural Network Learning
2/23/12
Credit Assignment Problem
Hyperpolyhedral Classes
2/23/12
25
. . .
. . .
. . .
. . .
. . .
input
layer
. . .
. . .
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
hidden
layers
2/23/12
26
Adaptive System
NetLogo Demonstration of
Back-Propagation Learning
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
Run Artificial Neural Net.nlogo
P1
…
Pk
…
Pm
Control Parameters
2/23/12
27
Control
Algorithm
C
2/23/12
28
Gradient Ascent
on Fitness Surface
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
⎛ ∂F
⎞
⎜ ∂P1 ⎟
⎜  ⎟
⎜
⎟
∇F = ⎜ ∂F ∂P ⎟
k
⎜  ⎟
⎜
⎟
⎜∂F
⎟
⎝ ∂Pm ⎠
€
∇F
+
gradient ascent
–
∇F points in direction of maximum local increase in F
2/23/12
€
29
2/23/12
30
€
5
Part 4A: Artificial Neural Network Learning
2/23/12
Gradient Ascent
by Discrete Steps
Gradient Ascent is Local
But Not Shortest
∇F
+
+
–
2/23/12
–
31
2/23/12
Gradient Ascent Process
General Ascent in Fitness
P˙ = η∇F (P)
Note that any adaptive process P( t ) will increase
Change in fitness :
m ∂F d P
m
dF
k
F˙ =
=∑
= ∑ (∇F ) k P˙k
k=1
k=1
∂Pk d t
€d t
˙
˙
F = ∇F ⋅ P
F˙ = ∇F ⋅ η∇F = η ∇F
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
≥0
€
€
€
2
32
or ϕ < 90 
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
2/23/12
33
2/23/12
34
€
General Ascent
on Fitness Surface
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
+
∇F
€
are y1,…,y Q
Suppose D(t,y ) ∈ [0,∞) measures difference between
–
€
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
€
2/23/12
35
€
Q
Q
Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)]
q=1
2/23/12
q=1
36
€
6
Part 4A: Artificial Neural Network Learning
2/23/12
Jacobian Matrix
Gradient of Fitness
⎛∂y1q
⎞
∂y1q
⎜
∂P1 
∂Pm ⎟
⎜
Define Jacobian matrix J =


 ⎟
⎜∂y q
⎟
∂
y nq
n
⎜
∂P1 
∂Pm ⎟⎠
⎝
⎛
⎞
∇F = ∇⎜⎜ −∑ E q ⎟⎟ = −∑ ∇E q
⎝ q
⎠
q
q
∂D(t q ,y q ) ∂y qj
∂E q
∂
=
D(t q ,y q ) = ∑
∂Pk ∂Pk
∂y qj
∂Pk
j
€
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
d D(t q ,y q ) ∂y q
⋅
d yq
∂Pk
q
q
q
= ∇ D t ,y ⋅ ∂y
€
=
€
€
yq
2/23/12
(
)
€
T
37
€
2
2
i
i
i
∂D(t − y ) ∂
∂ (t − y )
=
∑ (ti − y i )2 = ∑ i∂y i
∂y j
∂y j i
j
i
€
€
2/23/12
38
€
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
=
q
q
∂y q ∂D(t ,y )
∂E q
=∑ j
,
q
∂Pk
∂y j
j ∂Pk
∴∇E q = ( J q ) ∇D(t q ,y q )
∂Pk
€
€
Since (∇E q ) k =
d( t j − y j )
dyj
Gradient of Error on qth Input
d D(t ,y ) ∂y
∂E
=
⋅
q
q
∂Pk
dy
q
q
q
∂Pk
∂y q
= 2( y q − t q ) ⋅
∂Pk
2
= 2∑ ( y qj − t qj )
2
j
= −2( t j − y j )
∂y qj
∂Pk
T
d D(t,y )
= 2( y − t )
dy
€
2/23/12
∇E q = 2( J q ) ( y q − t q )
∴
€
39
2/23/12
40
€
€
Recap
P˙ = η∑ ( J ) (t − y )
q T
q
Multilayer Notation
q
q
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
xq
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
W1
s1
W2
s2
WL–2
WL–1
sL–1
yq
sL
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
€
2/23/12
41
2/23/12
42
7
Part 4A: Artificial Neural Network Learning
2/23/12
Typical Neuron
Notation
• 
• 
• 
• 
• 
• 
• 
L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find out how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
2/23/12
s1l–1
sjl–1
Wi1l–1
Wijl–1
Σ
WiN
hil
sil
σ
l–1
sNl–1
43
2/23/12
44
Error Back-Propagation
Delta Values
Convenient to break derivatives by chain rule :
We will compute
∂E q
starting with last layer (l = L −1)
∂W ijl
∂E q
∂E q ∂hil
=
∂W ijl−1 ∂hil ∂W ijl−1
and working back to earlier layers (l = L − 2,…,1)
Let δil =
€
So
2/23/12
45
€
δiL =
hiL
σ
tiq
siL = yiq
=
WiNL–1
sNL–1
2/23/12
46
Output-Layer Derivatives (1)
s1L–1
Wi1L–1
WijL–1
Σ
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
2/23/12
Output-Layer Neuron
sjL–1
∂E q
∂hil
Eq
2
∂E q
∂
=
∑ (skL − tkq )
∂hiL ∂hiL k
d( siL − t iq )
dh
L
i
2
= 2( siL − t iq )
d siL
d hiL
= 2( siL − t iq )σ ʹ′( hiL )
47
2/23/12
48
€
8
Part 4A: Artificial Neural Network Learning
2/23/12
Hidden-Layer Neuron
Output-Layer Derivatives (2)
∂hiL
∂
=
∑W ikL−1skL−1 = sL−1
j
∂W ijL−1 ∂W ijL−1 k
sjl–1
∂E q
∴
= δiL sL−1
j
∂W ijL−1
€
s1l
s1l–1
W1il
Wi1l–1
Wijl–1
Σ
WiN
2/23/12
49
sil
σ
Wkil
skl+1
Eq
l–1
WNil
sNl–1
where δiL = 2( siL − t iq )σ ʹ′( hiL )
hil
s1l+1
sNl
sNl+1
2/23/12
50
€
Hidden-Layer Derivatives (1)
Recall
Hidden-Layer Derivatives (2)
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
l−1 l−1
dW s
∂hil
∂
=
∑W ikl−1skl−1 = dWij l−1j = sl−1j
∂W ijl−1 ∂W ijl−1 k
ij
∂E q
∂E q ∂h l +1
∂h l +1
δ = l = ∑ l +1 k l = ∑δkl +1 k l
∂h i
∂
h
∂
h
∂h i
k
i
k
k
l
i
€
l l
d σ ( hil )
∂hkl +1 ∂ ∑m W km sm ∂W kil sil
=
=
= W kil
= W kilσ ʹ′( hil )
l
l
l
∂h i
∂h i
∂h i
d hil
€
€
€
∴
where δil = σ ʹ′( hil )∑δkl +1W kil
∴ δil = ∑δkl +1W kilσ ʹ′( hil ) = σ ʹ′( hil )∑δkl +1W kil
k
∂E q
= δil sl−1
j
∂W ijl−1
k
k
2/23/12
51
2/23/12
52
€
€
Summary of Back-Propagation
Algorithm
Derivative of Sigmoid
Suppose s = σ ( h ) =
1
(logistic sigmoid)
1+ exp(−αh )
−1
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
−2
∂E q
= δiL sL−1
j
∂W ijL−1
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
€
= −(1+ e−αh )
−2
(−αe ) = α
−αh
e
−αh
−αh 2
(1+ e )
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
⎛ 1+ e−αh
1
e−αh
1 ⎞
=α
= αs⎜
−
⎟
−αh
1+ e−αh 1+ e−αh
1+ e−αh ⎠
⎝ 1+ e
k
€
∂E q
= δil sl−1
j
∂W ijl−1
= αs(1− s)
2/23/12
€
53
2/23/12
54
€
9
Part 4A: Artificial Neural Network Learning
2/23/12
Hidden-Layer Computation
Output-Layer Computation
ΔW
s1L–1
L−1
ij
L L−1
i j
= ηδ s
L–1
sjL–1
Wi1
WijL–1
€
Σ
hiL
sNl–1
L
i
)( t
q
i
L
i
−s
)
hil
Σ
55
×
€
k
2/23/12
Wkil
×
WNil
1–
δil = αsil (1− sil )∑δkl +1W kil
2/23/12
sil
σ
δil
2α
s1l+1
δ1l+1
l–1
WiNl–1
×
η
×
δ = 2αs (1− s
L
i
–
Wi1
W l–1
sjl–1
€ ij
l–1
tiq
1–
δiL
L
i
=
yiq
W1il
ΔWij WiNL–1
η
×
sNL–1
siL
σ
ΔWijL–1
ΔW ijl−1 = ηδil sl−1
j
s1l–1
Eq
skl+1
δkl+1
sNl+1
Σ
δNl+1
α
56
€
Training Procedures
Summation of Error Surfaces
•  Batch Learning
–  on each epoch (pass through all the training pairs),
–  weight changes for all patterns accumulated
–  weight matrices updated at end of epoch
–  accurate computation of gradient
E
E1
E2
•  Online Learning
–  weight are updated after back-prop of each training pair
–  usually randomize order for each epoch
–  approximation of gradient
•  Doesn’t make much difference
2/23/12
57
2/23/12
58
Gradient Computation
in Batch Learning
Gradient Computation
in Online Learning
E
E
E1
E1
E2
2/23/12
E2
59
2/23/12
60
10
Part 4A: Artificial Neural Network Learning
2/23/12
Testing Generalization
Problem of Rote Learning
error
error on
test data
Available
Data
Training
Data
Domain
error on
training
data
Test
Data
epoch
stop training here
2/23/12
61
2/23/12
A Few Random Tips
Improving Generalization
Training
Data
Available
Data
62
•  Too few neurons and the ANN may not be able to
decrease the error enough
•  Too many neurons can lead to rote learning
•  Preprocess data to:
–  standardize
–  eliminate irrelevant information
–  capture invariances
–  keep relevant information
Domain
Test Data
Validation Data
•  If stuck in local min., restart with different random
weights
2/23/12
63
2/23/12
64
Beyond Back-Propagation
•  Adaptive Learning Rate
•  Adaptive Architecture
Run Example BP Learning
–  Add/delete hidden neurons
–  Add/delete hidden layers
•  Radial Basis Function Networks
•  Recurrent BP
•  Etc., etc., etc.…
2/23/12
65
2/23/12
66
11
Part 4A: Artificial Neural Network Learning
2/23/12
Can ANNs Exceed the “Turing Limit”?
What is the Power of
Artificial Neural Networks?
•  There are many results, which depend sensitively on
assumptions; for example:
•  Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
•  Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
•  Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
•  Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log*) (Siegelmann ‘99)
•  But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
•  With respect to Turing machines?
•  As function approximators?
2/23/12
67
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
monotone increasing real function on ℜ.
€
•  Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
m
⎛ n
⎞
F ( x1,…, x n ) = ∑ aiσ ⎜⎜ ∑W ij x j + b j ⎟⎟
i=1
⎝ j=1
⎠
€
1
2/23/12
€
xn
n
(see, e.g., Haykin, N.Nets 2/e, 208–9)
b1
x1
[i.e., F (x) = a ⋅ σ (Wx + b)]
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1]
68
One Hidden Layer is Sufficient
n
For any ε > 0, there is an m such that
€
2/23/12
69
2/23/12
Σ
σ
Σ
σ
a1
a2
Σ
am
Wmn
Σ
σ
70
€
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
2/23/12
IVB
71
12
Download