[Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11] Threshold units

advertisement
Articial Neural Networks
[Read Ch. 4]
[Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11]
Threshold units
Gradient descent
Multilayer networks
Backpropagation
Hidden layer representations
Example: Face Recognition
Advanced topics
74
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Connectionist Models
Consider humans:
Neuron switching time ~ :001 second
Number of neurons ~ 10
Connections per neuron ~ 10 ?
Scene recognition time ~ :1 second
100 inference steps doesn't seem like enough
! much parallel computation
10
4 5
Properties of articial neural nets (ANN's):
Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process
Emphasis on tuning weights automatically
75
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
When to Consider Neural Networks
Input is high-dimensional discrete or real-valued
(e.g. raw sensor input)
Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Form of target function is unknown
Human readability of result is unimportant
Examples:
Speech phoneme recognition [Waibel]
Image classication [Kanade, Baluja, Rowley]
Financial prediction
76
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
ALVINN drives 70 mph on highways
Sharp
Left
Straight
Ahead
Sharp
Right
30 Output
Units
4 Hidden
Units
30x32 Sensor
Input Retina
77
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Perceptron
x1
w1
x2
w2
.
.
.
x0=1
w0
Σ
n
Σ wi xi
{
i=0
wn
o=
xn
8
>
<
>
:
1 if
n
Σw
AA
x >0
i=0 i i
-1 otherwise
w + w x + + wnxn > 0
o(x ; : : : ; xn) = ?11 ifotherwise.
1
0
1
1
Sometimes we'll use simpler vector notation:
8
>
if w~ ~x > 0
o(~x) = <>: ?11 otherwise.
78
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Decision Surface of a Perceptron
x2
x2
+
+
-
-
+
+
x1
x1
-
-
-
(a)
+
(b)
Represents some useful functions
What weights represent
g(x ; x ) = AND(x ; x )?
1
2
1
2
But some functions not representable
e.g., not linearly separable
Therefore, we'll want networks of these...
79
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Perceptron training rule
where
wi
wi + wi
wi = (t ? o)xi
Where:
t = c(~x) is target value
o is perceptron output
is small constant (e.g., .1) called learning rate
80
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Perceptron training rule
Can prove it will converge
If training data is linearly separable
and suciently small
81
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
To understand, consider simpler linear unit, where
o = w + w x + + w n xn
Let's learn wi's that minimize the squared error
E [w~ ] 21 d2XD(td ? od)
Where D is set of training examples
0
1
1
2
82
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
25
20
E[w]
15
10
5
0
2
1
-2
-1
0
0
1
-1
2
3
w1
w0
Gradient
3
@E
@E
@E
rE [w~ ] @w ; @w ; @w 75
n
Training rule:
w~ = ?rE [w~ ]
i.e.,
@E
wi = ? @w
2
64
0
1
i
83
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
@E = @ 1 X(t ? o )
@wi @wi 2 d d d
1
X @
= 2 d @w (td ? od)
i
@ (t ? o )
= 21 Xd 2(td ? od) @w
d
d
i
@
X
= d (td ? od) @w (td ? w~ x~d)
i
@E = X(t ? o )(?x )
@wi d d d i;d
2
2
84
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Gradient Descent
(training examples; )
Each training example is a pair of the form
h~x; ti, where ~x is the vector of input values,
and t is the target output value. is the
learning rate (e.g., .05).
Initialize each wi to some small random value
Until the termination condition is met, Do
{ Initialize each wi to zero.
{ For each h~x; ti in training examples, Do
Input the instance ~x to the unit and
compute the output o
For each linear unit weight wi, Do
wi wi + (t ? o)xi
{ For each linear unit weight wi, Do
wi wi + wi
Gradient-Descent
85
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Summary
Perceptron training rule guaranteed to succeed if
Training examples are linearly separable
Suciently small learning rate Linear unit training rule uses gradient descent
Guaranteed to converge to hypothesis with
minimum squared error
Given suciently small learning rate Even when training data contains noise
Even when training data not separable by H
86
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Incremental (Stochastic) Gradient Descent
Batch mode Gradient Descent:
Do until satised
1. Compute the gradient rED[w~ ]
2. w~ w~ ? rED[w~ ]
Incremental mode Gradient Descent:
Do until satised
For each training example d in D
1. Compute the gradient rEd[w~ ]
2. w~ w~ ? rEd[w~ ]
1
ED [w~ ] 2 dX2D(td ? od)
1
Ed[w~ ] 2 (td ? od)
Incremental Gradient Descent can approximate
Batch Gradient Descent arbitrarily closely if made small enough
2
2
87
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Multilayer Networks of Sigmoid Units
head hid
...
F1
88
...
who’d hood
F2
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Sigmoid Unit
x1
x2
AA
AA
A
.
.
.
xn
w1
w2
x0 = 1
w0
Σ
n
net = Σ wi xi
i=0
wn
1
o = σ(net) =
-net
1+e
(x) is the sigmoid function
1
1 + e?x
Nice property: ddxx = (x)(1 ? (x))
We can derive gradient decent rules to train
One sigmoid unit
Multilayer networks of sigmoid units !
Backpropagation
( )
89
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Error Gradient for a Sigmoid Unit
@E = @ 1 X (t ? o )
@wi @wi 2 d2D d d
@ (t ? o )
= 12 Xd @w
d
d
i
@
1
X
= 2 d 2(td ? od) @w (td ? od)
1i
0
@od CA
= Xd (td ? od) B@? @w
i
@o
X
d @netd
= ? d (td ? od) @net @w
d
i
But we know:
@od = @(netd) = o (1 ? o )
d
d
@netd @netd
@netd = @ (w~ ~xd) = x
i;d
@wi
@wi
So:
@E = ? X (t ? o )o (1 ? o )x
d
d d
d i;d
@wi
d2D
2
2
90
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Backpropagation Algorithm
Initialize all weights to small random numbers.
Until satised, Do
For each training example, Do
1. Input the training example to the network
and compute the network outputs
2. For each output unit k
k ok(1 ? ok )(tk ? ok )
3. For each hidden unit h
h oh(1 ? oh) X wh;k k
k2outputs
4. Update each network weight wi;j
wi;j wi;j + wi;j
where
wi;j = j xi;j
91
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
More on Backpropagation
Gradient descent over entire network weight
vector
Easily generalized to arbitrary directed graphs
Will nd a local, not necessarily global error
minimum
{ In practice, often works well (can run multiple
times)
Often include weight momentum wi;j(n) = j xi;j + wi;j (n ? 1)
Minimizes error over training examples
{ Will it generalize well to subsequent
examples?
Training can take thousands of iterations !
slow!
Using network after training is very fast
92
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning Hidden Layer Representations
Inputs
A target function:
Input
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
Outputs
!
!
!
!
!
!
!
!
Output
10000000
01000000
00100000
00010000
00001000
00000100
00000010
00000001
Can this be learned??
93
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Learning Hidden Layer Representations
A network:
Inputs
Outputs
Learned hidden layer representation:
Input
Hidden
Output
Values
10000000 ! .89 .04 .08 ! 10000000
01000000 ! .01 .11 .88 ! 01000000
00100000 ! .01 .97 .27 ! 00100000
00010000 ! .99 .97 .71 ! 00010000
00001000 ! .03 .05 .02 ! 00001000
00000100 ! .22 .99 .99 ! 00000100
00000010 ! .80 .01 .98 ! 00000010
00000001 ! .60 .94 .01 ! 00000001
94
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Training
Sum of squared errors for each output unit
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
95
500
1000
lecture slides for textbook
1500
2000
2500
Machine Learning, T. Mitchell, McGraw Hill, 1997
Training
Hidden unit encoding for input 01000000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
96
500
1000
lecture slides for textbook
1500
2000
2500
Machine Learning, T. Mitchell, McGraw Hill, 1997
Training
Weights from inputs to one hidden unit
4
3
2
1
0
-1
-2
-3
-4
-5
0
97
500
1000
lecture slides for textbook
1500
2000
2500
Machine Learning, T. Mitchell, McGraw Hill, 1997
Convergence of Backpropagation
Gradient descent to some local minimum
Perhaps not global minimum...
Add momentum
Stochastic gradient descent
Train multiple nets with dierent inital weights
Nature of convergence
Initialize weights near zero
Therefore, initial networks near-linear
Increasingly non-linear functions possible as
training progresses
98
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Expressive Capabilities of ANNs
Boolean functions:
Every boolean function can be represented by
network with single hidden layer
but might require exponential (in number of
inputs) hidden units
Continuous functions:
Every bounded continuous function can be
approximated with arbitrarily small error, by
network with one hidden layer [Cybenko 1989;
Hornik et al. 1989]
Any function can be approximated to arbitrary
accuracy by a network with two hidden layers
[Cybenko 1988].
99
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Overtting in ANNs
Error versus weight updates (example 1)
0.01
Training set error
Validation set error
0.009
0.008
Error
0.007
0.006
0.005
0.004
0.003
0.002
0
5000
10000
15000
Number of weight updates
20000
Error versus weight updates (example 2)
0.08
Training set error
Validation set error
0.07
0.06
Error
0.05
0.04
0.03
0.02
0.01
0
0
100
1000
2000
3000
4000
Number of weight updates
lecture slides for textbook
5000
6000
Machine Learning, T. Mitchell, McGraw Hill, 1997
Neural Nets for Face Recognition
left strt rght up
...
...
30x32
inputs
Typical input images
90% accurate learning head pose, and recognizing 1-of-20 faces
101
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Learned Hidden Unit Weights
left strt rght up
...
...
Learned Weights
30x32
inputs
Typical input images
http://www.cs.cmu.edu/tom/faces.html
102
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Alternative Error Functions
Penalize large weights:
1
X
(tkd ? okd) + i;jX wji
E (w~ ) 2 d2XD k2outputs
Train on target slopes as well as values:
2
0
1 3
66
77
@o
1
X BB @tkd
X
X
kd C
C
6
E (w~ ) 2 d2D k2outputs 4(tkd ? okd) + j2inputs @ j ? j A 75
@xd @xd
2
2
2
2
Tie together weights:
e.g., in phoneme recognition network
103
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Recurrent Networks
y(t + 1)
y(t + 1)
b
x(t)
x(t)
c(t)
(b) Recurrent network
(a) Feedforward network
y(t + 1)
x(t)
c(t)
y(t)
x(t – 1)
c(t – 1)
y(t – 1)
(c) Recurrent network
x(t – 2)
c(t – 2)
unfolded in time
104
lecture slides for textbook
Machine Learning, T. Mitchell, McGraw Hill, 1997
Download