Document 11911404

advertisement
II. Neural Network Learning
2014/1/17
1
A.
Neural Network Learning
2014/1/17
2
Supervised Learning
• Produce desired outputs for training inputs
• Generalize reasonably & appropriately to
other inputs
• Good example: pattern recognition
• Feedforward multilayer networks
2014/1/17
3
input
layer
2014/1/17
hidden
layers
. . .
. . .
. . .
. . .
. . .
. . .
Feedforward Network
output
layer
4
Typical Artificial Neuron
connection
weights
inputs
output
threshold
2014/1/17
5
Typical Artificial Neuron
linear
combination
activation
function
net input
(local field)
2014/1/17
6
Equations
# n
&
hi = %% ∑ w ij s j (( − θ
$ j=1
'
h = Ws − θ
Net input:
Neuron output:
€
2014/1/17
s"i = σ ( hi )
s" = σ (h)
7
2014/1/17
. . .
. . .
Single-Layer Perceptron
8
Variables
x1
w1
xj
Σ
wj
h
Θ
y
wn
xn
2014/1/17
θ
9
Single Layer Perceptron
Equations
Binary threshold activation function :
%1, if h > 0
σ ( h ) = Θ( h ) = &
'0, if h ≤ 0
€
$1, if ∑ w x > θ
j j
j
Hence, y = %
&0, otherwise
$1, if w ⋅ x > θ
=%
&0, if w ⋅ x ≤ θ
2014/1/17
10
2D Weight Vector
w2
w ⋅ x = w x cos φ
x
–
v
cos φ =
x
+
w
w⋅ x = w v
φ
w⋅ x > θ
⇔ w v >θ
⇔v >θ w
2014/1/17
v
w1
θ
w
11
N-Dimensional Weight Vector
+
w
normal
vector
separating
hyperplane
–
2014/1/17
12
Goal of Perceptron Learning
• Suppose we have training patterns x1, x2,
…, xP with corresponding desired outputs
y1, y2, …, yP • where xp ∈ {0, 1}n, yp ∈ {0, 1}
• We want to find w, θ such that
yp = Θ(w⋅xp – θ) for p = 1, …, P 2014/1/17
13
Treating Threshold as Weight
#n
&
h = %%∑ w j x j (( − θ
$ j=1
'
x1
n
= −θ + ∑ w j x j
w1
xj
j=1
Σ
wj
wn
xn
2014/1/17
h
Θ
y
€
θ
14
Treating Threshold as Weight
#n
&
h = %%∑ w j x j (( − θ
$ j=1
'
x0 =
–1
x1
n
= −θ + ∑ w j x j
θ
= w0
w1
xj
Σ
wj
wn
xn
j=1
h
Θ
y
€
Let x 0 = −1 and w 0 = θ
n
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j=1
2014/1/17
€
j= 0
15
Augmented Vectors
#θ&
% (
w1 (
%
˜ =
w
% ! (
% (
$ wn '
# −1&
% p(
x
1 (
p
%
x˜ =
%!(
% p(
$ xn '
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
€
2014/1/17
€
16
Reformulation as Positive
Examples
We have positive (y p = 1) and negative (y p = 0) examples
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
Let z p = x˜ p for positive, z p = −x˜ p for negative
˜ ⋅ z p ≥ 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
2014/1/17
17
Adjustment of Weight Vector
z5
z1
z9
z10
11
z z6
z2
z3
z8
z4
z7
2014/1/17
18
Outline of
Perceptron Learning Algorithm
1. initialize weight vector randomly
2. until all patterns classified correctly, do:
a) for p = 1, …, P do:
1) if zp classified correctly, do nothing
2) else adjust weight vector to be closer to correct
classification
2014/1/17
19
Weight Adjustment
p
η
z
˜ ""
p
w
ηz
z
˜"
w
p
˜
w
€ €
€
€
€
€
2014/1/17
20
Improvement in Performance
! ! ⋅ z p = (w
! + ηz p ) ⋅ z p
w
! ⋅ z p + ηz p ⋅ z p
=w
! ⋅ z +η z
=w
p
p 2
! ⋅ zp
>w
2014/1/17
21
Perceptron Learning Theorem
• If there is a set of weights that will solve the
problem,
• then the PLA will eventually find it
• (for a sufficiently small learning rate)
• Note: only applies if positive & negative
examples are linearly separable
2014/1/17
22
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
2014/1/17
23
Classification Power of
Multilayer Perceptrons
• Perceptrons can function as logic gates
• Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
• Classes can be arbitrary hyperpolyhedra
• Minsky & Papert criticism of perceptrons
• No one succeeded in developing a MLP
learning algorithm
2014/1/17
24
Hyperpolyhedral Classes
2014/1/17
25
Credit Assignment Problem
input
layer
2014/1/17
. . .
. . .
. . .
. . .
. . .
hidden
layers
. . .
. . .
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
26
NetLogo Demonstration of
Back-Propagation Learning
Run Artificial Neural Net.nlogo
2014/1/17
27
Adaptive System
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
P1
…
Pk
…
Pm
Control Parameters
2014/1/17
Control
Algorithm
C
28
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
$ ∂F
'
& ∂P1 )
& ! )
& ∂F
)
∇F = & ∂P )
k
& ! )
&∂F
)
&
)
∂
P
%
m(
€
∇F points in direction of maximum local increase in F
2014/1/17
€
29
Gradient Ascent
on Fitness Surface
∇F
+
gradient ascent
2014/1/17
–
30
Gradient Ascent
by Discrete Steps
∇F
+
–
2014/1/17
31
Gradient Ascent is Local
But Not Shortest
+
–
2014/1/17
32
Gradient Ascent Process
P˙ = η∇F (P)
Change in fitness :
m ∂F d P
m
dF
k
˙
F=
=∑
= ∑ (∇F ) k P˙k
k=1 ∂P d t
k=1
€d t
k
F˙ = ∇F ⋅ P˙
F˙ = ∇F ⋅ η∇F = η ∇F
€
€
2
≥0
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
2014/1/17
33
General Ascent in Fitness
Note that any adaptive process P( t ) will increase
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
or ϕ < 90 !
2014/1/17
34
General Ascent
on Fitness Surface
+
∇F
–
2014/1/17
35
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
are y1,…,y Q
Suppose D(t,y ) ∈ [0,∞) measures difference between
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
Q
Q
Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)]
q=1
2014/1/17
q=1
36
Gradient of Fitness
%
(
∇F = ∇'' −∑ E q ** = −∑ ∇E q
& q
)
q
q
∂E
∂
=
D(t q ,y q ) = ∑
∂Pk ∂Pk
j
€
∂D(t q ,y q ) ∂y qj
€
∂Pk
d D(t q ,y q ) ∂y q
=
⋅
q
dy
∂Pk
q
∂
y
q
q
= ∇ D t ,y ⋅
yq
2014/1/17
∂y qj
(
)
∂Pk
37
Jacobian Matrix
q
#∂y1q
&
∂
y
1
%
∂P1 !
∂Pm (
Define Jacobian matrix J q = % "
#
" (
q
%∂y q
(
∂
y
n
% n ∂P !
(
∂
P
1
m'
$
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
Since (∇E q ) k
∴∇E = ( J
q
2014/1/17
q
q
q
∂
D
t
,y
∂
y
(
)
∂E
j
=
=∑
,
q
∂Pk
∂y j
j ∂Pk
q T
)
q
∇D(t q ,y q )
38
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
2
2
i
i
i
∂D(t − y ) ∂
∂ (t i − y i )
2
=
(t i − y i ) = ∑
∑
∂y j
∂y j i
∂y j
i
=
d( t j − y j )
dyj
d D(t,y )
∴
= 2( y − t )
dy
€
2014/1/17
2
2
= −2( t j − y j )
39
Gradient of Error on qth Input
d D(t ,y ) ∂y
∂E
=
⋅
q
q
∂Pk
q
d yq
q
∂Pk
q
∂y
= 2( y − t ) ⋅
∂Pk
q
q
q
∂
y
j
q
q
= 2∑ ( y j − t j )
j
∂Pk
∇E = 2( J
q
q T
)
q
q
y
−
t
(
)
€
2014/1/17
40
€
Recap
P˙ = η∑ ( J ) (t − y )
q T
q
q
q
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
2014/1/17
41
Multilayer Notation
xq
W1
s1
2014/1/17
W2
s2
WL–2
WL–1
sL–1
yq
sL
42
Notation
•
•
•
•
•
•
•
L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find out how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
2014/1/17
43
Typical Neuron
s1l–1
Wi1l–1
l–1
W
ij
sjl–1
Σ
hil
σ
sil
WiNl–1
sNl–1
2014/1/17
44
Error Back-Propagation
∂E q
We will compute
starting with last layer (l = L −1)
l
∂W ij
and working back to earlier layers (l = L − 2,…,1)
€
2014/1/17
45
Delta Values
Convenient to break derivatives by chain rule :
∂E q
∂E q ∂hil
= l
l−1
∂W ij
∂hi ∂W ijl−1
q
∂
E
Let δil = l
∂h i
l
∂E q
∂
h
l
i
So
=
δ
i
l−1
∂W ij
∂W ijl−1
€
2014/1/17
46
Output-Layer Neuron
s1L–1
Wi1L–1
L–1
W
ij
sjL–1
Σ
hiL
σ
tiq
siL = yiq
WiNL–1
sNL–1
2014/1/17
Eq
47
Output-Layer Derivatives (1)
q
∂E
∂
L
q 2
δ = L = L ∑ ( sk − t k )
k
∂h i ∂h i
L
i
=
d( s − t
L
i
q 2
i
d hiL
)
L
d
s
L
q
= 2( si − t i ) iL
d hi
L
&
= 2( s − t )σ ( hi )
L
i
2014/1/17
q
i
48
Output-Layer Derivatives (2)
∂hiL
∂
L−1 L−1
L−1
=
W
s
=
s
ik
k
j
L−1
L−1 ∑
∂W ij
∂W ij k
€
∂E q
L L−1
∴
=
δ
i sj
L−1
∂W ij
where δiL = 2( siL − t iq )σ &( hiL )
2014/1/17
49
Hidden-Layer Neuron
s1l
s1l–1
W1il
Wi1l–1
l–1
W
ij
sjl–1
Σ
hil
σ
sil
Wkil
s1l+1
skl+1
Eq
WiNl–1
sNl–1
2014/1/17
WNil
sNl
sNl+1
50
Hidden-Layer Derivatives (1)
l
∂E q
∂
h
l
i
Recall
=
δ
i
l−1
∂W ij
∂W ijl−1
q
q
l +1
l +1
∂
E
∂
E
∂
h
∂
h
δil = l = ∑ l +1 k l = ∑δkl +1 k l
∂h i
∂h k ∂h i
∂h i
k
k
€
€
€
l +1
k
l
i
∂h
=
∂h
∂ ∑ W kml sml
m
∂hil
l
d
σ
h
(
∂W kil sil
i)
l
l
l
%
=
=
W
=
W
σ
h
(
ki
ki
i)
l
l
∂h i
d hi
∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil
k
2014/1/17
k
51
Hidden-Layer Derivatives (2)
l
i
l−1
ij
∂h
∂W
€
l−1 l−1
dW
∂
ij s j
l−1 l−1
l−1
=
W
s
=
=
s
ik
k
j
l−1 ∑
∂W ij k
dW ijl−1
∂E q
l l−1
∴
=
δ
i sj
l−1
∂W ij
where δil = σ &( hil )∑δkl +1W kil
k
2014/1/17
52
Derivative of Sigmoid
1
Suppose s = σ ( h ) =
(logistic sigmoid)
1+ exp(−αh )
−1
−2
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
€
= −(1+ e
−αh −2
) (−αe ) = α
−αh
e−αh
−αh 2
(1+ e )
$ 1+ e−αh
1
e−αh
1 '
=α
= αs&
−
−αh
−αh
−αh
−αh )
1+ e 1+ e
1+ e (
% 1+ e
= αs(1− s)
2014/1/17
53
Summary of Back-Propagation
Algorithm
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
∂E q
L L−1
=
δ
i sj
L−1
∂W ij
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
k
€
∂E q
l l−1
=
δ
i sj
l−1
∂W ij
2014/1/17
54
Output-Layer Computation
ΔW ijL−1 = ηδiL sL−1
j
s1L–1
Wi1L–1
L–1
W
ij
sjL–1
€
Σ
ΔWijL–1
sNL–1
2014/1/17
WiNL–1
η
×
hiL
σ
siL = yiq
–
tiq
1–
δiL
×
δiL = 2αsiL (1− siL )( t iq − siL )
2α
55
Hidden-Layer Computation
ΔW ijl−1 = ηδil sl−1
j
s1l–1
W1il
Wi1l–1
l–1
W
ij
sjl–1
€
Σ
hil
ΔWijl–1
sNl–1
WiNl–1
×
η
×
WNil
1–
δil
×
δil = αsil (1− sil )∑δkl +1W kil
2014/1/17
sil
σ
k
Wkil
Σ
s1l+1
δ1l+1
skl+1
Eq
δkl+1
sNl+1
δNl+1
α
56
Training Procedures
• Batch Learning
–
–
–
–
on each epoch (pass through all the training pairs),
weight changes for all patterns accumulated
weight matrices updated at end of epoch
accurate computation of gradient
• Online Learning
– weight are updated after back-prop of each training pair
– usually randomize order for each epoch
– approximation of gradient
• Doesn’t make much difference
2014/1/17
57
Summation of Error Surfaces
E
E1
E2
2014/1/17
58
Gradient Computation
in Batch Learning
E
E1
E2
2014/1/17
59
Gradient Computation
in Online Learning
E
E1
E2
2014/1/17
60
Testing Generalization
Available
Data
Training
Data
Domain
Test
Data
2014/1/17
61
Problem of Rote Learning
error
error on
test data
error on
training
data
epoch
stop training here
2014/1/17
62
Improving Generalization
Training
Data
Available
Data
Domain
Test Data
Validation Data
2014/1/17
63
A Few Random Tips
• Too few neurons and the ANN may not be able to
decrease the error enough
• Too many neurons can lead to rote learning
• Preprocess data to:
–
–
–
–
standardize
eliminate irrelevant information
capture invariances
keep relevant information
• If stuck in local min., restart with different random
weights
2014/1/17
64
Run Example BP Learning
2014/1/17
65
Beyond Back-Propagation
• Adaptive Learning Rate
• Adaptive Architecture
– Add/delete hidden neurons
– Add/delete hidden layers
• Radial Basis Function Networks
• Recurrent BP
• Etc., etc., etc.…
2014/1/17
66
What is the Power of
Artificial Neural Networks?
• With respect to Turing machines?
• As function approximators?
2014/1/17
67
Can ANNs Exceed the “Turing Limit”?
• There are many results, which depend sensitively on
assumptions; for example:
• Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
• Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
• Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
• Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log*) (Siegelmann ‘99)
• But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
2014/1/17
68
€
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
monotone increasing real function on ℜ.
n
For any ε > 0, there is an m such that
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
€
$ n
'
F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j ))
i=1
% j=1
(
m
€
[i.e., F (x) = a ⋅ σ (Wx + b)]
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1]
2014/1/17
€
n
(see, e.g., Haykin, N.Nets 2/e, 208–9)
69
One Hidden Layer is Sufficient
• Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
1
b1
x1
xn
2014/1/17
Σ
σ
Σ
σ
a1
a2
Σ
am
Wmn
Σ
σ
70
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
2014/1/17
IVB
71
Download