Lecture Notes No. 13 8. Neural Networks

advertisement
2.160 System Identification, Estimation, and Learning
Lecture Notes No. 13
March 22, 2006
8. Neural Networks
8.1 Physiological Background
Neuro-physiology
• A Human brain has approximately 14 billion neurons, 50 different kinds of
neurons. … uniform
• Massively-parallel, distributed processing
Very different from a computer (a Turing machine)
Image removed due to copyright reasons.
McCulloch and Pitts, Neuron Model 1943
Donald Hebb, Hebbian Rule, 1949
…Synapse reinforcement learning
Rosenflatt, 1959
…The perceptron convergence theorem
Synapses
x1
The Hebbian Rule
Input xi
out put y
fired
fired
w1
x2
w2
xi
∑
wi
y
f
The i-th synapse wi is
reinforced.
wn
xn
1
The electric conductivity increases at wi
y
ŷ
z
g( z)
g
1
e
z
logistic function, or
sinusoid function
(1)
n
z = ∑ wi x i
Error e = yˆ − y = g ( z ) − y
i =1
Gradient Descent method
(2)
∆wi = − ρ ⋅ grad wi e2 = − ρ 2e
ρ : learning rate
(3) ∴ ∆wi = −2 ρ g ' e xi
Rule
∆wi ∝ (Input xi ).(Error)
Supervised Learning
∂e
∂wi
1
1 + e −z
Unsupervised Learning.
Replacing e by ŷ yields the Hebbian
(4) g ( z ) =
∆wi ∝ (Input xi ).(output ŷ )
8.2 Stochastic Approximation
consider a linear output function for yˆ = g ( z ) :
n
(5)
yˆ = ∑ wi xi
i =1
Given N sample data
{( y , x ,...x )
j
j
1
j = 1,...N
j
n
} Training Data
Find w1 ,..., wn that minimize
(6)
(7)
JN =
1
N
n
∑ ( yˆ
j
− y j)
i =1
2
∆wi = − ρ ⋅ grad wi J N = − ρ
N
∂yˆ j
( yˆ − y )
∑
∂wi
j =1
N
j
j
∂yˆ j
for all the sample date j=1,…N:
This method requires to store the gradient ( yˆ − y )
∂wi
It is a type of batch processing.
j
j
2
A simpler method is to execute updating the weight ∆wi every time the training data is
presented.
(8)
(9)
∆wi [k ] = ρδ [k ]xi [k ]
for the k-th presentation
Where δ (k ) = y[k ] − ∑ wl [k ]xl [k ]
Correct output for the
training data presented
at the k -th time
Predicted output based on the
weights wi [k ] for the training data
presented at the k -th time
Learning procedure
Present all the N training data in any sequence, and repeat the N presentations, called an
“epoch”, many times… Recycling.
N predictions
( x [1] , y [1]) ... ( x [ N ] , y [ N ]) .
epoch 1
epoch 2
epoch 3
epoch 4
epoch p
This procedure is called the Widrow-Hoff algorithm.
Convergence: As the recycling is repeated infinite times, does the weight vector
converge to the optimal one: arg min J N ( w1 ...w n ) ?... Consistency
w
If a constant learning rate ρ > 0 is used, this does not converge, unless min J N = 0
JN
Learning
Rate
ρk
ex. ρ k =
constant
k
w2
w1
W-space
W0
Minimum
point
0
k
3
constant
, convergence can be guaranteed.
k
This is a special case of the Method of Stochastic Approximation.
If the learning rate is varied, eg. ρ k =
Expected Loss Function
E [L( w)] = ∫ L( x, y w) p( x )dx
(10)
(11)
∫ L( x, y w) = 2 ( y − yˆ (x w))
1
2
The stochastic approximation procedure for minimizing this expected loss function with
respect to weight w1 ,..., wn is
∂
(12) wi [k + 1] = wi [k ] − ρ [k ]
L( x[k ], y [k ] w[k ])
∂wi
Where x[k ]is the training data presented at the k -th iteration. This estimate is proved
consistent if the learning rate ρ [k ] satisfies.
1) lim ρ [k ] = 0
k →∞
(13)
2).
k
lim ∑ ρ [i ] = +∞
k →∞
i =1
k
2
lim ∑ ρ [i ] < ∞
k →∞
(14)
i =1
This condition prevents all the
weights from converging so fast
that error will remain forever
uncorrected.
This condition ensures that random
fluctuations are eventually
suppressed
lim E[( wi [k ] − wi 0 ) 2 ] = 0 ;
k →∞
The estimated weights converge to their optimal values with probability of 1.
Robbins and Monroe, 1951
This stochastic Approximation method in general needs more presentation of data, i.e. the
convergence process is slower than the batch processing. But the computation is very
simple.
8.3 Multi-Layer Perceptrons
The Exclusive OR Problem
Input
0
0
1
1
X1
0
1
0
1
X2
Output
0
1
1
0
y
Can a single neural unit (perceptron)
with weights w1 , w2 , w3 , produce the
XOR truth table?
No, it cannot
4
x2
x1
g( z)
w1
x2
w2
1
∑
Class 1
g
z
wi
x1
1
Class 0
(15)
(16)
z = w1 x1 + w2 x 2 + w3
Set z=0, then 0 = w1 x1 + w2 x2 + w3 represents a straight line in the x1 − x2 plane.
⎧1
g( z) = ⎨
⎩0
z>0
z≤0
Class 0 and class 1 cannot be
separated by a straight line. …
Not linearly separable.
Consider a nonlinear function in lieu of (15)
1
(17)
z = f ( x1 , x 2 ) = x1 + x 2 − 2 x1 x 2 −
3
1
f (0,0) = −
3
Class 0
1
f (1,1) = −
3
f (1,0) = f (0,1) =
2
>0
3
Class 1
x2
z>0
z<0
Next, replace x1 x 2 by a new variable x3
1
(18)
z = x1 + x 2 − 2 x3 −
3
x1
This is apparently a linear function: Linearly Separable.
5
Hidden Units
• Augment the original input patterns
• Decode the input and generate tan internal representation
x1
z=0
x3
x2
z<0
+1
∑
-1.5
+1
-2
+1
+1
1
z>0
y
∑
-1/3
x2
x1
Hidden Unit
Not directly visible from output
Extending this argument, we arrive at a multi-layer network having multiple hidden
layers between input and output layers.
Multi-Layer Perception
Layer 2
Layer 1
Layer m
Layer M
Layer 0
Input
Layer
Hidden Layers
Output
Layer
6
Download