II. Neural Network Learning A. Neural Network Learning Supervised Learning

advertisement
Part 4A: Neural Network Learning
3/11/15
II. Neural Network Learning
3/11/15
1
A.
Neural Network Learning
3/11/15
2
Supervised Learning
•  Produce desired outputs for training inputs
•  Generalize reasonably & appropriately to
other inputs
•  Good example: pattern recognition
•  Feedforward multilayer networks
3/11/15
3
1
Part 4A: Neural Network Learning
3/11/15
input
layer
...
...
...
...
...
...
Feedforward Network
output
layer
hidden
layers
3/11/15
4
Typical Artificial Neuron
connection
weights
inputs
output
threshold
3/11/15
5
Typical Artificial Neuron
linear
combination
activation
function
net input
(local field)
3/11/15
6
2
Part 4A: Neural Network Learning
3/11/15
Equations
# n
&
hi = %% ∑ w ij s j (( − θ
$ j=1
'
h = Ws − θ
Net input:
s"i = σ ( hi )
Neuron output:
s" = σ (h)
€
3/11/15
7
€
...
...
Single-Layer Perceptron
3/11/15
8
Variables
x1
w1
xj
Σ
wj
h
Θ
y
wn
xn
3/11/15
θ
9
3
Part 4A: Neural Network Learning
3/11/15
Single Layer Perceptron
Equations
Binary threshold activation function :
%1, if h > 0
σ ( h ) = Θ( h ) = &
'0, if h ≤ 0
$1, if ∑ w x > θ
j j
j
Hence, y = %
&0, otherwise
€
$1, if w ⋅ x > θ
=%
&0, if w ⋅ x ≤ θ
3/11/15
10
€
2D Weight Vector
w2
w ⋅ x = w x cos φ
€
x
–
+
v
cos φ =
x
w
w⋅ x = w v
φ
€
€
w1
v
w⋅ x > θ
⇔ w v >θ
θ
w
⇔v >θ w
3/11/15
€
11
€
N-Dimensional Weight Vector
+
w
normal
vector
separating
hyperplane
–
3/11/15
12
4
Part 4A: Neural Network Learning
3/11/15
Goal of Perceptron Learning
•  Suppose we have training patterns x1, x2,
…, xP with corresponding desired outputs
y1, y2, …, yP
•  where xp ∈ {0, 1}n, yp ∈ {0, 1}
•  We want to find w, θ such that
yp = Θ(w∙xp – θ) for p = 1, …, P
3/11/15
13
Treating Threshold as Weight
#n
&
h = %%∑ w j x j (( − θ
$ j=1
'
x1
n
= −θ + ∑ w j x j
w1
xj
j=1
h
Σ
wj
wn
xn
Θ
y
€
θ
3/11/15
14
Treating Threshold as Weight
#n
&
h = %%∑ w j x j (( − θ
$ j=1
'
x0 = –1
x1
n
w1
xj
wn
xn
j=1
Σ
wj
= −θ + ∑ w j x j
θ = w0
h
Θ
y
€
Let x 0 = −1 and w 0 = θ
n
n
˜ ⋅ x˜
h = w 0 x 0 + ∑ w j x j =∑ w j x j = w
j=1
3/11/15
€
j= 0
15
€
5
Part 4A: Neural Network Learning
3/11/15
Augmented Vectors
#θ&
% (
w
˜ =% 1(
w
% ! (
% (
$ wn '
# −1&
% p(
x
p
x˜ = % 1 (
%!(
% p(
$ xn '
˜ ⋅ x˜ p ), p = 1,…,P
We want y p = Θ( w
€
€
3/11/15
16
€
Reformulation as Positive
Examples
We have positive (y p = 1) and negative (y p = 0) examples
˜ ⋅ x˜ p > 0 for positive, w
˜ ⋅ x˜ p ≤ 0 for negative
Want w
€
€
€
€
Let z p = x˜ p for positive, z p = −x˜ p for negative
˜ ⋅ z p ≥ 0, for p = 1,…,P
Want w
Hyperplane through origin with all z p on one side
3/11/15
17
€
Adjustment of Weight Vector
z5
z1
z9
z10 11
z
z6
z2
z3
z8
z4
z7
3/11/15
18
6
Part 4A: Neural Network Learning
3/11/15
Outline of
Perceptron Learning Algorithm
1.  initialize weight vector randomly
2.  until all patterns classified correctly, do:
a)  for p = 1, …, P do:
1)  if zp classified correctly, do nothing
2)  else adjust weight vector to be closer to correct
classification
3/11/15
19
Weight Adjustment
zp
p
˜ "" ηz
w
ηz p
˜"
w
˜
w
€ €
€
€
€
€
3/11/15
20
Improvement in Performance
! ! ⋅ z p = (w
! + ηz p ) ⋅ z p
w
! ⋅ z p + ηz p ⋅ z p
=w
! ⋅ zp +η zp
=w
2
! ⋅ zp
>w
3/11/15
21
7
Part 4A: Neural Network Learning
3/11/15
Perceptron Learning Theorem
•  If there is a set of weights that will solve the
problem,
•  then the PLA will eventually find it
•  (for a sufficiently small learning rate)
•  Note: only applies if positive & negative
examples are linearly separable
3/11/15
22
NetLogo Simulation of
Perceptron Learning
Run Perceptron-Geometry.nlogo
3/11/15
23
Classification Power of
Multilayer Perceptrons
•  Perceptrons can function as logic gates
•  Therefore MLP can form intersections,
unions, differences of linearly-separable
regions
•  Classes can be arbitrary hyperpolyhedra
•  Minsky & Papert criticism of perceptrons
•  No one succeeded in developing a MLP
learning algorithm
3/11/15
24
8
Part 4A: Neural Network Learning
3/11/15
Hyperpolyhedral Classes
3/11/15
25
Credit Assignment Problem
input
layer
...
...
...
...
...
hidden
layers
...
...
How do we adjust the weights of the hidden layers?
Desired
output
output
layer
3/11/15
26
NetLogo Demonstration of
Back-Propagation Learning
Run Artificial Neural Net.nlogo
3/11/15
27
9
Part 4A: Neural Network Learning
3/11/15
Adaptive System
System
Evaluation Function
(Fitness, Figure of Merit)
S
F
P1 … Pk … Pm
Control
Algorithm
Control Parameters
C
3/11/15
28
Gradient
∂F
measures how F is altered by variation of Pk
∂Pk
$ ∂F
'
& ∂P1 )
& ! )
&
)
∇F = & ∂F ∂P )
k
& ! )
&∂F
)
&
)
% ∂Pm (
€
∇F points in direction of maximum local increase in F
3/11/15
€
29
€
Gradient Ascent
on Fitness Surface
∇F
+
gradient ascent
3/11/15
–
30
10
Part 4A: Neural Network Learning
3/11/15
Gradient Ascent
by Discrete Steps
∇F
+
–
3/11/15
31
Gradient Ascent is Local
But Not Shortest
+
–
3/11/15
32
Gradient Ascent Process
P˙ = η∇F (P)
Change in fitness :
m ∂F d P
m
dF
k
F˙ =
=∑
= ∑ (∇F ) k P˙k
k=1 ∂P d t
k=1
€d t
k
F˙ = ∇F ⋅ P˙
F˙ = ∇F ⋅ η∇F = η ∇F
€
€
2
≥0
Therefore gradient ascent increases fitness
(until reaches 0 gradient)
3/11/15
33
11
Part 4A: Neural Network Learning
3/11/15
General Ascent in Fitness
Note that any adaptive process P( t ) will increase
fitness provided :
0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ
where ϕ is angle between ∇F and P˙
Hence we need cos ϕ > 0
€
or ϕ < 90 !
3/11/15
34
€
General Ascent
on Fitness Surface
+
∇F
–
3/11/15
35
Fitness as Minimum Error
Suppose for Q different inputs we have target outputs t1,…,t Q
Suppose for parameters P the corresponding actual outputs
€
are y1,…,y Q
Suppose D(t,y ) ∈ [0,∞) measures difference between
€
target & actual outputs
Let E q = D(t q ,y q ) be error on qth sample
€
€
Q
Q
Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)]
q=1
3/11/15
q=1
36
€
12
Part 4A: Neural Network Learning
3/11/15
Gradient of Fitness
%
(
∇F = ∇'' −∑ E q ** = −∑ ∇E q
& q
)
q
€
∂D(t q ,y q ) ∂y qj
∂E q
∂
=
D(t q ,y q ) = ∑
∂Pk ∂Pk
∂y qj
∂Pk
j
d D(t q ,y q ) ∂y q
⋅
d yq
∂Pk
q
q
q
= ∇ D t ,y ⋅ ∂y
=
€
€
yq
3/11/15
(
)
∂Pk
37
€
€
Jacobian Matrix
#∂y1q
&
∂y1q
%
∂P1 !
∂Pm (
Define Jacobian matrix J = % "
#
" (
%∂y q
(
∂y nq
% n ∂P !
∂Pm ('
1
$
q
Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1
€
Since (∇E q ) k =
€
q
q
∂y q ∂D(t ,y )
∂E q
=∑ j
,
q
∂Pk
∂y j
j ∂Pk
T
∴∇E q = ( J q ) ∇D(t q ,y q )
€
3/11/15
38
€
Derivative of Squared Euclidean
Distance
Suppose D(t,y ) = t − y = ∑ ( t − y )
2
2
i
€
i
∂D(t − y ) ∂
∂ (t − y )
=
∑ (ti − y i )2 = ∑ i∂y i
∂y j
∂y j i
j
i
=
€
i
d( t j − y j )
dyj
d D(t,y )
= 2( y − t )
dy
€
3/11/15
2
2
= −2( t j − y j )
∴
39
€
13
Part 4A: Neural Network Learning
3/11/15
Gradient of Error on qth Input
d D(t ,y ) ∂y
∂E
=
⋅
q
q
∂Pk
dy
q
q
q
∂Pk
∂y q
= 2( y − t ) ⋅
∂Pk
q
q
= 2∑ ( y qj − t qj )
j
∂y qj
∂Pk
T
∇E q = 2( J q ) ( y q − t q )
€
3/11/15
40
€
Recap
P˙ = η∑ ( J ) (t
q T
q
q
− yq )
To know how to decrease the differences between
actual & desired outputs,
€
we need to know elements of Jacobian,
∂y qj
∂Pk ,
which says how jth output varies with kth parameter
(given the qth input)
The Jacobian depends on the specific form of the system,
in this case, a feedforward neural network
€
3/11/15
41
Multilayer Notation
xq
W1
s1
3/11/15
W2
s2
WL–2 WL–1
sL–1
yq
sL
42
14
Part 4A: Neural Network Learning
3/11/15
Notation
• 
• 
• 
• 
• 
• 
• 
L layers of neurons labeled 1, …, L
Nl neurons in layer l
sl = vector of outputs from neurons in layer l
input layer s1 = xq (the input pattern)
output layer sL = yq (the actual output)
Wl = weights between layers l and l+1
Problem: find out how outputs yiq vary with
weights Wjkl (l = 1, …, L–1)
3/11/15
43
Typical Neuron
s1l–1
sjl–1
Wi1l–1
Wijl–1
WiN
Σ
hil
σ
sil
l–1
sNl–1
3/11/15
44
Error Back-Propagation
We will compute
∂E q
starting with last layer (l = L −1)
∂W ijl
and working back to earlier layers (l = L − 2,…,1)
€
3/11/15
45
15
Part 4A: Neural Network Learning
3/11/15
Delta Values
Convenient to break derivatives by chain rule :
∂E q
∂E q ∂hil
=
∂W ijl−1 ∂hil ∂W ijl−1
Let δil =
So
€
∂E q
∂hil
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
3/11/15
46
Output-Layer Neuron
s1L–1
sjL–1
Wi1L–1
WijL–1
Σ
hiL
σ
tiq
siL = yiq
WiNL–1
Eq
sNL–1
3/11/15
47
Output-Layer Derivatives (1)
δiL =
=
2
∂E q
∂
=
∑ (skL − tkq )
∂hiL ∂hiL k
d( siL − t iq )
dh
L
i
2
= 2( siL − t iq )
d siL
d hiL
= 2( siL − t iq )σ &( hiL )
3/11/15
48
€
16
Part 4A: Neural Network Learning
3/11/15
Output-Layer Derivatives (2)
∂hiL
∂
=
∑W ikL−1skL−1 = sL−1
j
∂W ijL−1 ∂W ijL−1 k
∴
€
∂E q
= δiL sL−1
j
∂W ijL−1
where δiL = 2( siL − t iq )σ &( hiL )
3/11/15
49
€
Hidden-Layer Neuron
s1l
s1l–1
W1il
Wi1l–1
Wijl–1
sjl–1
WiN
Σ
hil
σ
Wkil
sil
skl+1
Eq
l–1
WNil
sNl–1
s1l+1
sNl
sNl+1
3/11/15
50
Hidden-Layer Derivatives (1)
Recall
δil =
€
∂E q
∂hil
= δil
l−1
∂W ij
∂W ijl−1
∂E q
∂E q ∂h l +1
∂h l +1
= ∑ l +1 k l = ∑δkl +1 k l
∂hil
∂
h
∂
h
∂h i
k
i
k
k
l l
d σ ( hil )
∂hkl +1 ∂ ∑m W km sm ∂W kil sil
=
=
= W kil
= W kilσ %( hil )
∂hil
∂hil
∂hil
d hil
€
∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil
€
k
3/11/15
k
51
€
17
Part 4A: Neural Network Learning
3/11/15
Hidden-Layer Derivatives (2)
dW ijl−1sl−1
∂hil
∂
j
l−1 l−1
=
W
s
=
= sl−1
∑
ik
k
j
∂W ijl−1 ∂W ijl−1 k
dW ijl−1
∴
€
∂E q
= δil sl−1
j
∂W ijl−1
where δil = σ &( hil )∑δkl +1W kil
k
3/11/15
52
€
Derivative of Sigmoid
Suppose s = σ ( h ) =
1
(logistic sigmoid)
1+ exp (−α h )
−1
−2
Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh )
= −(1+ e−αh )
=α
−2
(−αe ) = α
e−αh
−αh
−αh 2
(1+ e )
$ 1+ e−αh
1
e−αh
1 '
= αs&
−
)
−αh
1+ e−αh 1+ e−αh
1+ e−αh (
% 1+ e
= αs(1− s)
3/11/15
53
€
Summary of Back-Propagation
Algorithm
Output layer : δiL = 2αsiL (1− siL )( siL − t iq )
∂E q
= δiL sL−1
j
∂W ijL−1
Hidden layers : δil = αsil (1− sil )∑δkl +1W kil
k
€
∂E q
= δil sl−1
j
∂W ijl−1
3/11/15
54
€
18
Part 4A: Neural Network Learning
3/11/15
Output-Layer Computation
ΔW ijL−1 = ηδiL sL−1
j
s1L–1
sjL–1
Wi1L–1
WijL–1
€
Σ
hiL
WiNL–1
η
×
sNL–1
siL = yiq –
σ
∆WijL–1
1–
δiL
×
δ = 2αs (1− s
L
i
tiq
L
i
L
i
)( t
q
i
L
i
−s
)
2α
3/11/15
55
€
Hidden-Layer Computation
ΔW ijl−1 = ηδil sl−1
j
s1l–1
s1l+1
W1il
δ1l+1
l–1
Wi1
W l–1
sjl–1 € ij
∆W l–1
ij
sNl–1
WiN
×
hil
Σ
sil
σ
Wkil
skl+1
η
WNil
1–
δil
×
δil = αsil (1− sil )∑δkl +1W kil
k
3/11/15
Eq
δkl+1
×
l–1
sNl+1
Σ
δNl+1
α
56
€
Training Procedures
•  Batch Learning
–  on each epoch (pass through all the training pairs),
–  weight changes for all patterns accumulated
–  weight matrices updated at end of epoch
–  accurate computation of gradient
•  Online Learning
–  weight are updated after back-prop of each training pair
–  usually randomize order for each epoch
–  approximation of gradient
•  Doesn’t make much difference
3/11/15
57
19
Part 4A: Neural Network Learning
3/11/15
Summation of Error Surfaces
E
E1
E2
3/11/15
58
Gradient Computation
in Batch Learning
E
E1
E2
3/11/15
59
Gradient Computation
in Online Learning
E
E1
E2
3/11/15
60
20
Part 4A: Neural Network Learning
3/11/15
Testing Generalization
Available
Data
Training
Data
Domain
Test
Data
3/11/15
61
Problem of Rote Learning
error
error on
test data
error on
training
data
epoch
stop training here
3/11/15
62
Improving Generalization
Training
Data
Available
Data
Domain
Test Data
Validation Data
3/11/15
63
21
Part 4A: Neural Network Learning
3/11/15
A Few Random Tips
•  Too few neurons and the ANN may not be able to
decrease the error enough
•  Too many neurons can lead to rote learning
•  Preprocess data to:
–  standardize
–  eliminate irrelevant information
–  capture invariances
–  keep relevant information
•  If stuck in local min., restart with different random
weights
3/11/15
64
Run Example BP Learning
3/11/15
65
Beyond Back-Propagation
•  Adaptive Learning Rate
•  Adaptive Architecture
–  Add/delete hidden neurons
–  Add/delete hidden layers
•  Radial Basis Function Networks
•  Recurrent BP
•  Etc., etc., etc.…
3/11/15
66
22
Part 4A: Neural Network Learning
3/11/15
Deep Belief Networks
•  Inspired by hierarchical representations in
mammalian sensory systems
•  Use “deep” (multilayer) feed-forward nets
•  Layers self-organize to represent input at
progressively more abstract, task-relevant levels
•  Supervised training (e.g., BP) can be used to tune
network performance.
•  Each layer is a Restricted Boltzmann Machine
3/11/15
67
Restricted Boltzmann Machine
•  Goal: hidden units
become model of input
domain
•  Should capture
statistics of input
•  Evaluate by testing its
ability to reproduce
input statistics
•  Change weights to
decrease difference
(fig. from wikipedia)
3/11/15
68
Unsupervised RBM Learning
•  Stochastic binary units
•  Assume bias units
x0 = y0 = 1
•  Set yi with probability
"
%
σ $$∑Wij x j ''
#
j
&
xj/
•  Set with probability
"
%
σ $∑Wij yi '
#
3/11/15
i
&
•  Set yi/ with probability
#
&
σ %%∑Wij x!j ((
$
'
j
•  After several cycles of
sampling, update
weights based on
statistics:
(
ΔWij = η yi x j − yi#x#j
)
69
23
Part 4A: Neural Network Learning
3/11/15
Training a DBN Network
•  Present inputs and do RBM learning with
first hidden layer to develop model
•  When converged, do RBM learning
between first and second hidden layers to
develop higher-level model
•  Continue until all weight layers trained
•  May further train with BP or other
supervised learning algorithms
3/11/15
70
What is the Power of
Artificial Neural Networks?
•  With respect to Turing machines?
•  As function approximators?
3/11/15
71
Can ANNs Exceed the “Turing Limit”?
•  There are many results, which depend sensitively on
assumptions; for example:
•  Finite NNs with real-valued weights have super-Turing
power (Siegelmann & Sontag ‘94)
•  Recurrent nets with Gaussian noise have sub-Turing power
(Maass & Sontag ‘99)
•  Finite recurrent nets with real weights can recognize all
languages, and thus are super-Turing (Siegelmann ‘99)
•  Stochastic nets with rational weights have super-Turing
power (but only P/POLY, BPP/log*) (Siegelmann ‘99)
•  But computing classes of functions is not a very relevant
way to evaluate the capabilities of neural computation
3/11/15
72
24
Part 4A: Neural Network Learning
3/11/15
A Universal Approximation Theorem
Suppose f is a continuous function on [0,1]
Suppose σ is a nonconstant, bounded,
monotone increasing real function on ℜ.
€
n
For any ε > 0, there is an m such that
∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if
€
m
$ n
'
F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j ))
i=1
% j=1
(
€
[i.e., F (x) = a ⋅ σ (Wx + b)]
€then F ( x ) − f ( x ) < ε for all x ∈ [0,1]
3/11/15
€
n
73
(see, e.g., Haykin, N.Nets 2/e, 208–9)
€
One Hidden Layer is Sufficient
•  Conclusion: One hidden layer is sufficient
to approximate any continuous function
arbitrarily closely
1
b1
x1
xn
Σσ
Σσ
a1
a2
Σ
am
Wmn
Σσ
3/11/15
74
The Golden Rule of Neural Nets
Neural Networks are the
second-best way
to do everything!
3/11/15
IVB
75
25
Download