A. IV. Neural Networks and Learning Artificial Neural Net Learning Supervised Learning

advertisement
Part 4A: Neural Network Learning
3/23/16
IV. Neural Networks and Learning
1
3/23/16
...
...
...
• Produce desired outputs for training inputs
• Generalize reasonably & appropriately to
other inputs
• Good example: pattern recognition
...
Feedforward Network
...
Supervised Learning
2
...
3/23/16
A.
Artificial Neural Net Learning
• Feedforward multilayer networks
input
layer
3/23/16
3
3/23/16
Typical Artificial Neuron
4
Typical Artificial Neuron
connection
weights
inputs
output
layer
hidden
layers
linear
combination
activation
function
output
net input
(local field)
threshold
3/23/16
5
3/23/16
6
1
Part 4A: Neural Network Learning
3/23/16
Equations
Single-Layer Perceptron
...
s"i = σ ( hi )
Neuron output:
...
# n
&
hi = %% ∑ w ij s j (( − θ
$ j=1
'
h = Ws − θ
Net input:
s" = σ (h)
€
3/23/16
7
3/23/16
8
€
Single Layer Perceptron
Equations
Variables
Binary threshold activation function :
%1, if h > 0
σ ( h ) = Θ( h ) = &
'0, if h ≤ 0
x1
w1
xj
h
Σ
wj
Θ
y
wn
$1, if ∑ w x > θ
j j
j
Hence, y = %
&0, otherwise
€
xn
θ
$1, if w ⋅ x > θ
=%
&0, if w ⋅ x ≤ θ
3/23/16
9
3/23/16
10
€
2D Weight Vector
N-Dimensional Weight Vector
w2
w ⋅ x = w x cos φ
€
x
–
v
cos φ =
x
+
+
w
w
separating
hyperplane
w⋅ x = w v
φ
€
€
v
w⋅ x > θ
⇔ w v >θ
w1
–
θ
w
⇔v >θ w
3/23/16
€
normal
vector
11
3/23/16
12
€
2
Part 4A: Neural Network Learning
3/23/16
Treating Threshold as Weight
Goal of Perceptron Learning
# n
&
h = %%∑ w j x j (( − θ
$ j=1
'
• Suppose we have training patterns x1 , x2 ,
…, xP with corresponding desired outputs
y1 , y2 , …, yP
x1
w1
• where
∈ {0,
∈ {0, 1}
• We want to find w, θ such that
yp = Θ(w·xp – θ) for p = 1, …, P
xp
1}n , yp
xj
xj
wj
wn
xn
= −θ + ∑ w j x j
h
Θ
y
€
Let x 0 = 1 and w0 = –𝜃
n
j= 0
3/23/16
θ
14
# −1&
% p(
x1
x˜ p = % (
%! (
% p(
$ xn '
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j=1
€
#θ&
% (
w
˜ =% 1(
w
% ! (
% (
$ wn '
n
j=1
Σ
y
Augmented Vectors
# n
&
h = %%∑ w j x j (( − θ
$ j=1
'
w1
Θ
3/23/16
Treating Threshold as Weight
–θ = w0
wj
wn
13
x1
j=1
h
Σ
xn
3/23/16
x0 = 1
n
= −θ + ∑ w j x j
€
15
€
3/23/16
16
€
€
Reformulation as Positive
Examples
Adjustment of Weight Vector
We have positive (y p = 1) and negative (y p = 0) examples
z5
z1
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
€
z9
z1 0 1 1
z
z6
z2
Let z p = x˜ p for positive, z p = −x˜ p for negative
z3
z8
z4
z7
€
p
˜ ⋅ z ≥ 0, for p = 1,…,P
Want w
€
€
Hyperplane through origin with all z p on one side
3/23/16
17
3/23/16
18
€
3
Part 4A: Neural Network Learning
3/23/16
Outline of
Perceptron Learning Algorithm
Weight Adjustment
˜ ""
w
1. initialize weight vector randomly
€
a) for p = 1, …, P do:
1) if
zp
classified correctly, do nothing
19
Improvement in Performance
€
€
€
€
3/23/16
20
• If there is a set of weights that will solve the
problem,
• then the PLA will eventually find it
• (for a sufficiently small learning rate)
! ⋅ z p + ηz p ⋅ z p
=w
2
• Note: only applies if positive & negative
examples are linearly separable
! ⋅ zp
>w
3/23/16
˜
w
Perceptron Learning Theorem
! ! ⋅ z p = (w
! + ηz p ) ⋅ z p
w
! ⋅ zp +η zp
=w
ηz p
€
2) else adjust weight vector to be closer to correct
classification
3/23/16
˜"
w
zp
2. until all patterns classified correctly, do:
ηz p
21
3/23/16
22
Classification Power of
Multilayer Perceptrons
• Perceptrons can function as logic gates
• Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
• Classes can be arbitrary hyperpolyhedra
• Minsky & Papert criticism of perceptrons
• No one succeeded in developing a MLP
learning algorithm
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
3/23/16
23
3/23/16
24
4
Part 4A: Neural Network Learning
3/23/16
Credit Assignment Problem
Hyperpolyhedral Classes
3/23/16
25
...
...
...
...
...
input
layer
...
...
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
hidden
layers
3/23/16
26
Adaptive System
NetLogo Demonstration of
Back-Propagation Learning
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
Run Artificial Neural Net.nlogo
P1 … Pk … Pm
Control
Algorithm
Control Parameters
3/23/16
27
C
3/23/16
28
Gradient Ascent
on Fitness Surface
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
∇F
$ ∂F
'
& ∂P1 )
& ! )
&
)
∇F = & ∂F ∂P )
k
& ! )
&∂F
)
&
)
% ∂Pm (
€
+
gradient ascent
–
∇F points in direction of maximum local increase in F
3/23/16
€
29
3/23/16
30
€
5
Part 4A: Neural Network Learning
3/23/16
Gradient Ascent
by Discrete Steps
Gradient Ascent is Local
But Not Shortest
∇F
+
+
–
3/23/16
–
31
3/23/16
Gradient Ascent Process
General Ascent in Fitness
P˙ = η∇F (P)
Note that any adaptive process P( t ) will increase
Change in fitness :
m ∂F d P
m
dF
k
F˙ =
=
=
(∇F ) k P˙k
€d t ∑k=1 ∂Pk d t ∑k=1
F˙ = ∇F ⋅ P˙
F˙ = ∇F ⋅ η∇F = η ∇F
€
€
2
32
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
≥0
€
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
3/23/16
33
or ϕ < 90 !
3/23/16
34
€
General Ascent
on Fitness Surface
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
+
are y1,…,y Q
∇F
€
–
€
Suppose D(t,y ) ∈ [0,∞) measures difference between
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
€
3/23/16
35
€
Q
Q
Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)]
q=1
3/23/16
q=1
36
€
6
Part 4A: Neural Network Learning
3/23/16
Jacobian Matrix
Gradient of Fitness
#∂y1q
&
∂y1q
% ∂P1 !
∂Pm (
q
Define Jacobian matrix J = % "
#
" (
%∂y q
(
∂y nq
% n ∂P !
(
∂
P
1
m
$
'
%
(
∇F = ∇'' −∑ E q ** = −∑ ∇E q
& q
)
q
∂D(t q ,y q ) ∂y qj
∂E q
∂
=
D(t q ,y q ) = ∑
∂Pk ∂Pk
∂y qj
∂Pk
j
€
d D(t q ,y q ) ∂y q
⋅
d yq
∂Pk
q
q
q
= ∇ D t ,y ⋅ ∂y
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
€
=
€
€
y
3/23/16
q
(
)
Since (∇E q ) =
k
€
T
∴∇E q = ( J q ) ∇D(t q ,y q )
∂Pk
37
€
€
3/23/16
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
2
i
€
∴
dyj
∂E q d D(t ,y
=
∂Pk
d yq
q
i
∂D(t − y ) ∂
∂ (t − y )
=
∑ (t i − y i )2 = ∑ i∂y i
∂y j
∂y j i
j
i
d( t j − y j )
Gradient of Error on qth Input
2
i
=
38
€
€
€
q
q
∂y q ∂D(t ,y )
∂E q
=∑ j
,
q
∂Pk
∂y j
j ∂Pk
q
) ⋅ ∂y
q
∂Pk
∂y q
= 2( y q − t q ) ⋅
∂Pk
2
2
= 2∑ ( y qj − t qj )
= −2( t j − y j )
j
∂y qj
∂Pk
T
d D(t,y )
= 2( y − t )
dy
∇E q = 2( J q ) ( y q − t q )
€
€
3/23/16
39
3/23/16
40
€
€
Recap
Multilayer Notation
T
P˙ = η∑ ( J q ) (t q − y q )
q
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
xq
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
W1
s1
W2
s2
WL–2 WL–1
sL–1
yq
sL
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
€
3/23/16
41
3/23/16
42
7
Part 4A: Neural Network Learning
3/23/16
Typical Neuron
Notation
• L layers of neurons labeled 1, …, L
• N l neurons in layer l
• sl = vector of outputs from neurons in layer l
•
•
•
•
s1 l–1
input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find out how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
3/23/16
sjl–1
W i1 l–1
W ijl–1
Σ
hil
σ
sil
W iNl–1
sNl–1
43
3/23/16
44
Error Back-Propagation
Delta Values
Convenient to break derivatives by chain rule :
∂E q
∂E q ∂hil
=
∂W ijl−1 ∂hil ∂W ijl−1
∂E q
We will compute
starting with last layer (l = L −1)
∂W ijl
and working back to earlier layers (l = L − 2,…,1)
Let δil =
€
So
3/23/16
45
€
δiL =
hiL
σ
tiq
siL = y iq
=
W iNL–1
sNL–1
3/23/16
46
Output-Layer Derivatives (1)
s1 L–1
sjL–1
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
3/23/16
Output-Layer Neuron
W i1 L–1
W ijL–1
Σ
∂E q
∂hil
Eq
2
∂E q
∂
= L ∑ ( skL − t kq )
L
k
∂h i ∂h i
d( siL − t iq )
d hiL
2
= 2( siL − t iq )
d siL
d hiL
= 2( siL − t iq )σ &( hiL )
47
3/23/16
48
€
8
Part 4A: Neural Network Learning
3/23/16
Hidden-Layer Neuron
Output-Layer Derivatives (2)
s1 l
s1 l–1
∂hiL
∂
=
∑W ikL−1skL−1 = sL−1
j
∂W ijL−1 ∂W ijL−1 k
W 1il
W i1 l–1
W ijl–1
Σ
sjl–1
∂E q
∴
= δiL sL−1
j
∂W ijL−1
€
σ
sil
W kil
skl+1
Eq
W iNl–1
where δ = 2( s − t )σ &( h
L
i
hil
s1 l+1
L
i
q
i
L
i
W Nil
sNl–1
)
3/23/16
49
sNl+1
sNl
3/23/16
50
€
Hidden-Layer Derivatives (1)
Hidden-Layer Derivatives (2)
∂E q
∂hil
Recall
= δil
l−1
∂W ij
∂W ijl−1
l−1 l−1
dW ij s j
∂hil
∂
=
W ikl−1skl−1 =
= sl−1
j
l−1
l−1 ∑
∂W ij
∂W ij k
dW ijl−1
∂E q
∂E q ∂h l +1
∂h l +1
δil = l = ∑ l +1 k l = ∑δkl +1 k l
∂h i
∂
h
∂
h
∂h i
k
i
k
k
€
l l
d σ ( hil )
∂hkl +1 ∂ ∑m W km sm ∂W kil sil
=
=
= W kil
= W kilσ %( hil )
∂hil
∂hil
∂hil
d hil
€
∴
€
where δil = σ &( hil )∑δkl +1W kil
∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil
€
k
∂E q
= δil sl−1
j
∂W ijl−1
k
k
3/23/16
51
3/23/16
52
€
€
Summary of Back-Propagation
Algorithm
Derivative of Sigmoid
Suppose s = σ ( h ) =
1
(logistic sigmoid)
1+ exp (−α h )
−1
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
∂E q
= δiL sL−1
j
∂W ijL−1
−2
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
= −(1+ e−αh )
−2
(−αe ) = α
−αh
e−αh
−αh 2
(1+ e )
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
$ 1+ e−αh
1
e−αh
1 '
=α
= αs&
−
)
−αh
−αh
−αh
1+ e 1+ e
1+ e−αh (
% 1+ e
k
€
∂E q
= δil sl−1
j
∂W ijl−1
= αs(1− s)
3/23/16
€
53
3/23/16
54
€
9
Part 4A: Neural Network Learning
3/23/16
Hidden-Layer Computation
Output-Layer Computation
ΔW ijL−1 = ηδiL sL−1
j
s1 L–1
W i1
W ijL–1
€ L–1 Σ
ΔW ijl−1 = ηδil sl−1
j
s1 l–1
W 1il
L–1
sjL–1
hiL
∆Wij
×
σ
siL = y iq – tiq
W ijl–1
sjl–1 €
∆Wijl–1
W iNl–1
η
δ1 l+1
W i1 l–1
W iNL–1
sNL–1
s1 l+1
×
1–
sNl–1
δiL
×
δiL = 2αsiL (1− siL )( t iq − siL )
2α
hil
Σ
η
σ
55
Eq
skl+1
δkl+1
sNl+1
×
δil = αsil (1− sil )∑δkl +1W kil
W kil
×
W Nil
1–
δil
3/23/16
sil
Σ
δNl+1
α
k
3/23/16
56
€
€
Training Procedures
Summation of Error Surfaces
• Batch Learning
–
–
–
–
on each epoch (pass through all the training pairs),
weight changes for all patterns accumulated
weight matrices updated at end of epoch
accurate computation of gradient
E
E1
E2
• Online Learning
– weight are updated after back-prop of each training pair
– usually randomize order for each epoch
– approximation of gradient
• Doesn’t make much difference
3/23/16
57
3/23/16
58
Gradient Computation
in Batch Learning
Gradient Computation
in Online Learning
E
E
1
1
E
E
2
E2
E
3/23/16
59
3/23/16
60
10
Part 4A: Neural Network Learning
3/23/16
Testing Generalization
Problem of Rote Learning
error
error on
test data
Available
Data
Training
Data
Domain
error on
training
data
Test
Data
epoch
stop training here
3/23/16
61
3/23/16
62
A Few Random Tips
Improving Generalization
Training
Data
Available
Data
• Too few neurons and the ANN may not be able to
decrease the error enough
• Too many neurons can lead to rote learning
• Preprocess data to:
–
–
–
–
Domain
Validation Data
Test Data
standardize
eliminate irrelevant information
capture invariances
keep relevant information
• If stuck in local min., restart with different random
weights
3/23/16
63
3/23/16
64
Beyond Back-Propagation
• Adaptive Learning Rate
• Adaptive Architecture
Run Example BP Learning
– Add/delete hidden neurons
– Add/delete hidden layers
• Radial Basis Function Networks
• Recurrent BP
• Etc., etc., etc.…
3/23/16
65
3/23/16
66
11
Part 4A: Neural Network Learning
3/23/16
Deep Belief Networks
Restricted Boltzmann Machine
• Goal: hidden units
become model of
input domain
• Should capture
statistics of input
• Evaluate by testing its
ability to reproduce
input statistics
• Change weights to
decrease difference
• Inspired by hierarchical representations in
mammalian sensory systems
• Use “deep” (multilayer) feed-forward nets
• Layers self-organize to represent input at
progressively more abstract, task-relevant levels
• Supervised training (e.g., BP) can be used to tune
network performance.
• Each layer is a Restricted Boltzmann Machine
3/23/16
67
3/23/16
Unsupervised RBM Learning
• Present inputs and do RBM learning with
first hidden layer to develop model
• When converged, do RBM learning
between first and second hidden layers to
develop higher-level model
• Set yi with probability
• After several cycles of
sampling, update
weights based on
statistics:
/
• Set xj with probability
"
%
ΔW
yi x j − yi#x#j
ij = η
σ $∑Wij yi '
"
%
σ $$∑Wij x j ''
# j
&
i
&
(
3/23/16
69
)
• Continue until all weight layers trained
• May further train with BP or other
supervised learning algorithms
3/23/16
70
Can ANNs Exceed the “Turing Limit”?
What is the Power of
Artificial Neural Networks?
• There are many results, which depend sensitively on
assumptions; for example:
• Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
• Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
• Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
• Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log* ) (Siegelmann ‘99)
• But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
• With respect to Turing machines?
• As function approximators?
3/23/16
68
Training a DBN Network
• Stochastic binary units • Set yi / with probability
#
&
• Assume bias units
σ %%∑Wij x!j ((
x0 = y0 = 1
$ j
'
#
(fig. from wikipedia)
71
3/23/16
72
12
Part 4A: Neural Network Learning
3/23/16
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
One Hidden Layer is Sufficient
n
• Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
monotone increasing real function on ℜ.
€
For any ε > 0, there is an m such that
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
€
b1
1
$ n
'
F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j ))
i=1
% j=1
(
m
€
x1
[i.e., F (x) = a ⋅ σ (Wx + b)]
3/23/16
€
(see, e.g., Haykin, N.Nets 2/e, 208–9)
Σσ
W mn
73
3/23/16
a1
a2
Σ
am
xn
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1] n
Σσ
Σσ
74
€
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
3/23/16
IVB
75
13
Download