IV. Neural Networks and Learning A. Artificial Neural Net Learning Supervised Learning

advertisement
Part 4A: Neural Network Learning
3/23/16
IV. Neural Networks and Learning
3/23/16
1
A.
Artificial Neural Net Learning
3/23/16
2
Supervised Learning
• Produce desired outputs for training inputs
• Generalize reasonably & appropriately to
other inputs
• Good example: pattern recognition
• Feedforward multilayer networks
3/23/16
3
1
Part 4A: Neural Network Learning
3/23/16
input
layer
...
...
...
...
...
...
Feedforward Network
output
layer
hidden
layers
3/23/16
4
Typical Artificial Neuron
connection
weights
inputs
output
threshold
3/23/16
5
Typical Artificial Neuron
linear
combination
activation
function
net input
(local field)
3/23/16
6
2
Part 4A: Neural Network Learning
3/23/16
Equations
# n
&
hi = %% ∑ w ij s j (( − θ
$ j=1
'
h = Ws − θ
Net input:
s"i = σ ( hi )
Neuron output:
s" = σ (h)
€
3/23/16
7
€
...
...
Single-Layer Perceptron
3/23/16
8
Variables
x1
w1
xj
Σ
wj
h
Θ
y
wn
xn
3/23/16
θ
9
3
Part 4A: Neural Network Learning
3/23/16
Single Layer Perceptron
Equations
Binary threshold activation function :
%1, if h > 0
σ ( h ) = Θ( h ) = &
'0, if h ≤ 0
$1, if ∑ w x > θ
j j
j
Hence, y = %
&0, otherwise
€
$1, if w ⋅ x > θ
=%
&0, if w ⋅ x ≤ θ
3/23/16
10
€
2D Weight Vector
w2
w ⋅ x = w x cos φ
€
x
–
+
v
cos φ =
x
w
w⋅ x = w v
φ
€
€
w1
v
w⋅ x > θ
⇔ w v >θ
θ
w
⇔v >θ w
3/23/16
€
11
€
N-Dimensional Weight Vector
+
w
normal
vector
separating
hyperplane
–
3/23/16
12
4
Part 4A: Neural Network Learning
3/23/16
Goal of Perceptron Learning
• Suppose we have training patterns x1 , x2 ,
…, xP with corresponding desired outputs
y1 , y2 , …, yP
• where xp ∈ {0, 1}n , yp ∈ {0, 1}
• We want to find w, θ such that
yp = Θ(w·xp – θ) for p = 1, …, P
3/23/16
13
Treating Threshold as Weight
# n
&
h = %%∑ w j x j (( − θ
$ j=1
'
x1
n
= −θ + ∑ w j x j
w1
xj
j=1
h
Σ
wj
wn
xn
Θ
y
€
θ
3/23/16
14
Treating Threshold as Weight
# n
&
h = %%∑ w j x j (( − θ
$ j=1
'
x0 = 1
x1
w1
xj
wn
xn
= −θ + ∑ w j x j
j=1
Σ
wj
n
–θ = w0
h
Θ
y
€
Let x 0 = 1 and w0 = –𝜃
n
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j=1
3/23/16
j= 0
15
€
5
Part 4A: Neural Network Learning
3/23/16
Augmented Vectors
#θ&
% (
w
˜ =% 1(
w
% ! (
% (
$ wn '
# −1&
% p(
x
x˜ p = % 1 (
%! (
% p(
$ xn '
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
€
€
3/23/16
16
€
Reformulation as Positive
Examples
We have positive (y p = 1) and negative (y p = 0) examples
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
€
€
€
€
Let z p = x˜ p for positive, z p = −x˜ p for negative
˜ ⋅ z p ≥ 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
3/23/16
17
€
Adjustment of Weight Vector
z5
z1
z9
z1 0 1 1
z
z6
z2
z3
z8
z4
z7
3/23/16
18
6
Part 4A: Neural Network Learning
3/23/16
Outline of
Perceptron Learning Algorithm
1. initialize weight vector randomly
2. until all patterns classified correctly, do:
a) for p = 1, …, P do:
1) if z p classified correctly, do nothing
2) else adjust weight vector to be closer to correct
classification
3/23/16
19
Weight Adjustment
˜ ""
w
ηz p
˜"
w
zp
€
ηz p
˜
w
€
€
€
€
€
3/23/16
20
Improvement in Performance
! ! ⋅ z p = (w
! + ηz p ) ⋅ z p
w
! ⋅ z p + ηz p ⋅ z p
=w
! ⋅ zp +η zp
=w
2
! ⋅ zp
>w
3/23/16
21
7
Part 4A: Neural Network Learning
3/23/16
Perceptron Learning Theorem
• If there is a set of weights that will solve the
problem,
• then the PLA will eventually find it
• (for a sufficiently small learning rate)
• Note: only applies if positive & negative
examples are linearly separable
3/23/16
22
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
3/23/16
23
Classification Power of
Multilayer Perceptrons
• Perceptrons can function as logic gates
• Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
• Classes can be arbitrary hyperpolyhedra
• Minsky & Papert criticism of perceptrons
• No one succeeded in developing a MLP
learning algorithm
3/23/16
24
8
Part 4A: Neural Network Learning
3/23/16
Hyperpolyhedral Classes
3/23/16
25
Credit Assignment Problem
input
layer
...
...
...
...
...
hidden
layers
...
...
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
3/23/16
26
NetLogo Demonstration of
Back-Propagation Learning
Run Artificial Neural Net.nlogo
3/23/16
27
9
Part 4A: Neural Network Learning
3/23/16
Adaptive System
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
P1 … Pk … Pm
Control
Algorithm
Control Parameters
C
3/23/16
28
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
$ ∂F
'
& ∂P1 )
& ! )
&
)
∇F = & ∂F ∂P )
k
& ! )
&∂F
)
&
)
% ∂Pm (
€
∇F points in direction of maximum local increase in F
3/23/16
€
29
€
Gradient Ascent
on Fitness Surface
∇F
+
gradient ascent
3/23/16
–
30
10
Part 4A: Neural Network Learning
3/23/16
Gradient Ascent
by Discrete Steps
∇F
+
–
3/23/16
31
Gradient Ascent is Local
But Not Shortest
+
–
3/23/16
32
Gradient Ascent Process
P˙ = η∇F (P)
Change in fitness :
m ∂F d P
m
dF
k
F˙ =
=
=
(∇F ) k P˙k
€d t ∑k=1 ∂Pk d t ∑k=1
F˙ = ∇F ⋅ P˙
F˙ = ∇F ⋅ η∇F = η ∇F
€
€
2
≥0
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
3/23/16
33
11
Part 4A: Neural Network Learning
3/23/16
General Ascent in Fitness
Note that any adaptive process P( t ) will increase
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
€
or ϕ < 90 !
3/23/16
34
€
General Ascent
on Fitness Surface
+
∇F
–
3/23/16
35
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
are y1,…,y Q
€
€
Suppose D(t,y ) ∈ [0,∞) measures difference between
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
€
€
Q
Q
Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)]
q=1
3/23/16
q=1
36
€
12
Part 4A: Neural Network Learning
3/23/16
Gradient of Fitness
%
(
∇F = ∇'' −∑ E q ** = −∑ ∇E q
& q
)
q
€
∂D(t q ,y q ) ∂y qj
∂E q
∂
=
D(t q ,y q ) = ∑
∂Pk ∂Pk
∂y qj
∂Pk
j
d D(t q ,y q ) ∂y q
⋅
d yq
∂Pk
q
q
q
= ∇ D t ,y ⋅ ∂y
=
€
€
yq
3/23/16
(
)
∂Pk
37
€
€
Jacobian Matrix
#∂y1q
&
∂y1q
% ∂P1 !
∂Pm (
q
%
Define Jacobian matrix J =
"
#
" (
q
%∂y q
(
∂
y
n
n
% ∂P !
∂Pm ('
1
$
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
€
Since (∇E q ) =
k
€
q
q
∂y q ∂D(t ,y )
∂E q
=∑ j
,
q
∂Pk
∂y j
j ∂Pk
T
∴∇E q = ( J q ) ∇D(t q ,y q )
€
3/23/16
38
€
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
2
2
i
€
i
i
∂D(t − y ) ∂
∂ (t − y )
=
∑ (t i − y i )2 = ∑ i∂y i
∂y j
∂y j i
j
i
=
€
∴
d( t j − y j )
dyj
2
2
= −2( t j − y j )
d D(t,y )
= 2( y − t )
dy
€
3/23/16
39
€
13
Part 4A: Neural Network Learning
3/23/16
Gradient of Error on qth Input
∂E q d D(t ,y
=
∂Pk
d yq
q
q
) ⋅ ∂y
q
∂Pk
= 2( y q − t q ) ⋅
∂y q
∂Pk
= 2∑ ( y qj − t qj )
j
∂y qj
∂Pk
T
∇E q = 2( J q ) ( y q − t q )
€
3/23/16
40
€
Recap
T
P˙ = η∑ ( J q ) (t q − y q )
q
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
€
3/23/16
41
Multilayer Notation
xq
W1
s1
3/23/16
W2
s2
WL–2 WL–1
sL–1
yq
sL
42
14
Part 4A: Neural Network Learning
3/23/16
Notation
• L layers of neurons labeled 1, …, L
• N l neurons in layer l
• sl = vector of outputs from neurons in layer l
•
•
•
•
input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find out how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
3/23/16
43
Typical Neuron
s1 l–1
sjl–1
W i1 l–1
W ijl–1
Σ
hil
σ
sil
W iNl–1
sNl–1
3/23/16
44
Error Back-Propagation
We will compute
∂E q
starting with last layer (l = L −1)
∂W ijl
and working back to earlier layers (l = L − 2,…,1)
€
3/23/16
45
15
Part 4A: Neural Network Learning
3/23/16
Delta Values
Convenient to break derivatives by chain rule :
∂E q
∂E q ∂hil
= l
l−1
∂W ij
∂hi ∂W ijl−1
Let δil =
So
€
∂E q
∂hil
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
3/23/16
46
Output-Layer Neuron
s1 L–1
sjL–1
W i1 L–1
W ijL–1
Σ
hiL
σ
tiq
siL = y iq
W iNL–1
Eq
sNL–1
3/23/16
47
Output-Layer Derivatives (1)
δiL =
=
2
∂E q
∂
= L ∑ ( skL − t kq )
L
k
∂h i ∂h i
d( siL − t iq )
d hiL
2
= 2( siL − t iq )
d siL
d hiL
= 2( siL − t iq )σ &( hiL )
3/23/16
48
€
16
Part 4A: Neural Network Learning
3/23/16
Output-Layer Derivatives (2)
∂hiL
∂
=
∑W ikL−1skL−1 = sL−1
j
∂W ijL−1 ∂W ijL−1 k
∴
€
∂E q
= δiL sL−1
j
∂W ijL−1
where δiL = 2( siL − t iq )σ &( hiL )
3/23/16
49
€
Hidden-Layer Neuron
s1 l
s1 l–1
sjl–1
W 1il
W i1 l–1
W ijl–1
Σ
hil
σ
W kil
sil
s1 l+1
skl+1
Eq
W iNl–1
W Nil
sNl–1
sNl+1
sNl
3/23/16
50
Hidden-Layer Derivatives (1)
Recall
δil =
€
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
∂E q
∂E q ∂h l +1
∂h l +1
= ∑ l +1 k l = ∑δkl +1 k l
l
∂h i
∂h i
∂h i
k ∂h k
k
l l
d σ ( hil )
∂hkl +1 ∂ ∑m W km sm ∂W kil sil
=
=
= W kil
= W kilσ %( hil )
l
l
l
∂h i
∂h i
∂h i
d hil
€
∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil
€
k
3/23/16
k
51
€
17
Part 4A: Neural Network Learning
3/23/16
Hidden-Layer Derivatives (2)
l−1 l−1
dW s
∂hil
∂
=
∑W ikl−1skl−1 = dWij l−1j = sl−1j
∂W ijl−1 ∂W ijl−1 k
ij
∴
€
∂E q
= δil sl−1
j
∂W ijl−1
where δil = σ &( hil )∑δkl +1W kil
k
3/23/16
52
€
Derivative of Sigmoid
Suppose s = σ ( h ) =
1
(logistic sigmoid)
1+ exp (−α h )
−1
−2
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
= −(1+ e−αh )
=α
−2
(−αe ) = α
e−αh
−αh
−αh 2
(1+ e )
$ 1+ e−αh
1
e−αh
1 '
= αs&
−
)
−αh
1+ e−αh 1+ e−αh
1+ e−αh (
% 1+ e
= αs(1− s)
3/23/16
53
€
Summary of Back-Propagation
Algorithm
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
∂E q
= δiL sL−1
j
∂W ijL−1
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
k
€
∂E q
= δil sl−1
j
∂W ijl−1
3/23/16
54
€
18
Part 4A: Neural Network Learning
3/23/16
Output-Layer Computation
ΔW ijL−1 = ηδiL sL−1
j
s1 L–1
sjL–1
W i1 L–1
W ijL–1
€ L–1 Σ
hiL
σ
∆Wij
siL = y iq – tiq
W iNL–1
η
×
sNL–1
1–
δiL
×
δiL = 2αsiL (1− siL )( t iq − siL )
2α
3/23/16
55
€
Hidden-Layer Computation
ΔW ijl−1 = ηδil sl−1
j
s1 l–1
W 1il
s1 l+1
δ1 l+1
W i1 l–1
W ijl–1
sjl–1 €
∆Wijl–1
W iNl–1
×
sNl–1
hil
Σ
σ
sil
W kil
skl+1
η
W Nil
1–
δil
sNl+1
×
δil = αsil (1− sil )∑δkl +1W kil
Σ
δNl+1
α
k
3/23/16
Eq
δkl+1
×
56
€
Training Procedures
• Batch Learning
–
–
–
–
on each epoch (pass through all the training pairs),
weight changes for all patterns accumulated
weight matrices updated at end of epoch
accurate computation of gradient
• Online Learning
– weight are updated after back-prop of each training pair
– usually randomize order for each epoch
– approximation of gradient
• Doesn’t make much difference
3/23/16
57
19
Part 4A: Neural Network Learning
3/23/16
Summation of Error Surfaces
E
E1
E2
3/23/16
58
Gradient Computation
in Batch Learning
E
E1
E2
3/23/16
59
Gradient Computation
in Online Learning
E
1
E
E2
3/23/16
60
20
Part 4A: Neural Network Learning
3/23/16
Testing Generalization
Available
Data
Training
Data
Domain
Test
Data
3/23/16
61
Problem of Rote Learning
error
error on
test data
error on
training
data
epoch
stop training here
3/23/16
62
Improving Generalization
Training
Data
Available
Data
Domain
Validation Data
Test Data
3/23/16
63
21
Part 4A: Neural Network Learning
3/23/16
A Few Random Tips
• Too few neurons and the ANN may not be able to
decrease the error enough
• Too many neurons can lead to rote learning
• Preprocess data to:
–
–
–
–
standardize
eliminate irrelevant information
capture invariances
keep relevant information
• If stuck in local min., restart with different random
weights
3/23/16
64
Run Example BP Learning
3/23/16
65
Beyond Back-Propagation
• Adaptive Learning Rate
• Adaptive Architecture
– Add/delete hidden neurons
– Add/delete hidden layers
• Radial Basis Function Networks
• Recurrent BP
• Etc., etc., etc.…
3/23/16
66
22
Part 4A: Neural Network Learning
3/23/16
Deep Belief Networks
• Inspired by hierarchical representations in
mammalian sensory systems
• Use “deep” (multilayer) feed-forward nets
• Layers self-organize to represent input at
progressively more abstract, task-relevant levels
• Supervised training (e.g., BP) can be used to tune
network performance.
• Each layer is a Restricted Boltzmann Machine
3/23/16
67
Restricted Boltzmann Machine
• Goal: hidden units
become model of
input domain
• Should capture
statistics of input
• Evaluate by testing its
ability to reproduce
input statistics
• Change weights to
decrease difference
(fig. from wikipedia)
3/23/16
68
Unsupervised RBM Learning
• Stochastic binary units • Set yi / with probability
#
&
• Assume bias units
σ %%∑Wij x!j ((
x0 = y0 = 1
$ j
'
• Set yi with probability
• After several cycles of
sampling, update
weights based on
statistics:
/
• Set xj with probability
"
%
ΔW
yi x j − yi#x#j
ij = η
σ $∑Wij yi '
"
%
σ $$∑Wij x j ''
# j
&
#
3/23/16
i
&
(
)
69
23
Part 4A: Neural Network Learning
3/23/16
Training a DBN Network
• Present inputs and do RBM learning with
first hidden layer to develop model
• When converged, do RBM learning
between first and second hidden layers to
develop higher-level model
• Continue until all weight layers trained
• May further train with BP or other
supervised learning algorithms
3/23/16
70
What is the Power of
Artificial Neural Networks?
• With respect to Turing machines?
• As function approximators?
3/23/16
71
Can ANNs Exceed the “Turing Limit”?
• There are many results, which depend sensitively on
assumptions; for example:
• Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
• Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
• Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
• Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log* ) (Siegelmann ‘99)
• But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
3/23/16
72
24
Part 4A: Neural Network Learning
3/23/16
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
n
monotone increasing real function on ℜ.
€
For any ε > 0, there is an m such that
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
€
m
$ n
'
F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j ))
i=1
% j=1
(
€
[i.e., F (x) = a ⋅ σ (Wx + b)]
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1] n
3/23/16
€
73
(see, e.g., Haykin, N.Nets 2/e, 208–9)
€
One Hidden Layer is Sufficient
• Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
b1
1
x1
Σσ
Σσ
a1
a2
Σ
am
xn
W mn
Σσ
3/23/16
74
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
3/23/16
IVB
75
25
Download