Regression and Classification with Neural Networks

advertisement
Regression and
Classification with
Neural Networks
Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful in
giving your own lectures. Feel free to use
these slides verbatim, or to modify them
to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
your own lecture, please include this
message, or the following link to the
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Copyright © 2001, 2003, Andrew W. Moore
Sep 25th, 2001
Linear Regression
DATASET
inputs
↑
w
← 1 →↓
outputs
x1 = 1
y1 = 1
x2 = 3
y2 = 2.2
x3 = 2
y3 = 2
x4 = 1.5
y4 = 1.9
x5 = 4
y5 = 3.1
Linear regression assumes that the expected value of
the output given an input, E[y|x], is linear.
Simplest case: Out(x) = wx for some unknown w.
Given the data, we can estimate w.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 2
1
1-parameter linear regression
Assume that the data is formed by
yi = wxi + noisei
where…
• the noise signals are independent
• the noise has a normal distribution with mean 0
and unknown variance σ2
P(y|w,x) has a normal distribution with
• mean wx
• variance σ2
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 3
Bayesian Linear Regression
P(y|w,x) = Normal (mean wx, var σ2)
We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn)
which are EVIDENCE about w.
We want to infer w from the data.
P(w|x1, x2, x3,…xn, y1, y2…yn)
•You can use BAYES rule to work out a posterior
distribution for w given the data.
•Or you could do Maximum Likelihood Estimation
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 4
2
Maximum likelihood estimation of w
Asks the question:
“For which value of w is this data most likely to have
happened?”
<=>
For what w is
P(y1, y2…yn |x1, x2, x3,…xn, w) maximized?
<=>
For what w is
n
∏P( y w, x ) maximized?
i
i
i =1
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 5
For what w is
n
∏ P( y
i
w , x i ) maximized?
i =1
For what w isn
1
∏ exp(− 2 (
i =1
yi − wxi
σ
) 2 ) maximized?
For what w is
n
∑
i =1
For what w is
2
1  y − wx i 
−  i
 maximized?
2
σ

2
n
∑ (y
i =1
i
− wx
Copyright © 2001, 2003, Andrew W. Moore
i
)
minimized?
Neural Networks: Slide 6
3
Linear Regression
The maximum
likelihood w is
the one that
minimizes sumof-squares of
residuals
E(w)
w
Ε = ∑( yi − wxi )
2
i
= ∑ yi − (2∑ xi yi )w +
2
i
(∑ x )w
2
2
i
We want to minimize a quadratic function of w.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 7
Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑x y
∑x
i
i
2
i
The maximum likelihood
model is
Out(x) = wx
We can use it for
prediction
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 8
4
Linear Regression
Easy to show the sum of
squares is minimized
when
w=
∑ xi yi
∑x
p(w)
w
2
Note:
i
The maximum likelihood
model is
Out(x) = wx
In Bayesian stats you’d have
ended up with a prob dist of w
And predictions would have given a prob
dist of expected output
Often useful to know your confidence.
We can use it for
prediction
Max likelihood can give some kinds of
confidence too.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 9
Multivariate Regression
What if the inputs are vectors?
3.
6.
.4
.
x2
2-d input
example
5
.8
. 10
x1
Dataset has form
x1
x2
x3
.:
.
xR
Copyright © 2001, 2003, Andrew W. Moore
y1
y2
y3
:
yR
Neural Networks: Slide 10
5
Multivariate Regression
Write matrix X and Y thus:
 .....x1 .....   x11
.....x .....  x
2
 =  21
x=

 
M

 
.....x R .....  xR1
x12
x22
xR 2
x1m 
 y1 

y 
... x2 m 
y =  2

M
M

 
... xRm 
 yR 
...
(there are R datapoints. Each input has m components)
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 11
Multivariate Regression
Write matrix X and Y thus:
 .....x1 .....   x11
.....x .....  x
2
 =  21
x=

 
M

 
.....x R .....  xR1
x12
x22
xR 2
x1m 
 y1 
y 
... x2 m 
y =  2

M
M

 
... xRm 
 yR 
...
IMPORTANT EXERCISE:
(there are R datapoints. Each input hasPROVE
m components)
IT !!!!!
The linear regression model assumes a vector w such that
Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D]
The max. likelihood w is w = (XTX) -1(XTY)
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 12
6
Multivariate Regression (con’t)
The max. likelihood w is w = (XTX)-1(XTY)
R
XTX is an m x m matrix: i,j’th elt is
∑x x
ki kj
k =1
R
XTY is an m-element vector: i’th elt
∑x
k =1
Copyright © 2001, 2003, Andrew W. Moore
y
ki k
Neural Networks: Slide 13
What about a constant term?
We may expect
linear data that does
not go through the
origin.
Statisticians and
Neural Net Folks all
agree on a simple
obvious hack.
Can you guess??
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 14
7
The constant term
• The trick is to create a fake input “X0” that
always takes the value 1
X1
X2
Y
X0
X1
X2
Y
2
3
5
4
4
5
16
17
20
1
1
1
2
3
5
4
4
5
16
17
20
Before:
After:
Y=w1X1+ w2X2
Y= w0X0+w1X1+ w2X2
= w0+w1X1+ w2X2
…has to be a poor
model
In this example,
You should be able
to see the MLE w0
, w1 and w2 by
inspection
…has a fine constant
term
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 15
Regression with varying noise
• Suppose you know the variance of the noise that
was added to each datapoint.
xi
½
1
2
2
3
yi
½
1
1
3
2
Assume
y=3
σi2
4
1
1/4
4
1/4
σ=2
y=2
σ=1/2
σ=1
y=1
σ=1/2
σ=2
y=0
x=0
x=1
x=2
yi ~ N ( wxi , σ i2 )
Copyright © 2001, 2003, Andrew W. Moore
x=3
LE
eM ?
h
t
fw
at’s
W h at e o
m
esti
Neural Networks: Slide 16
8
MLE estimation with varying noise
argmax log p( y , y ,..., y
1
2
R
| x1 , x2 ,..., xR , σ 12 , σ 22 ,..., σ R2 , w) =
w
R
argmin ∑
i =1
w
( yi − wxi ) 2
σ i2
=
R


x ( y − wx )
 w such that ∑ i i 2 i = 0  =
σi
i =1


 R xi yi 
 ∑ 2 
 i =1 σ i 
 R xi2 
 ∑ 2 
 i =1 σ i 
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero
Trivial algebra
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 17
This is Weighted Regression
• We are asking to minimize the weighted sum of
squares
y=3
R
argmin ∑
w
i =1
( yi − wxi )
σ i2
σ=2
2
y=2
σ=1/2
σ=1
y=1
σ=2
σ=1/2
y=0
x=0
where weight for i’th datapoint is
Copyright © 2001, 2003, Andrew W. Moore
x=1
x=2
x=3
1
σ i2
Neural Networks: Slide 18
9
Weighted Multivariate Regression
The max. likelihood w is w = (WXTWX)-1(WXTWY)
(WXTWX)
R
xki xkj
k =1
σ i2
∑
is an m x m matrix: i,j’th elt is
R
(WXTWY) is an m-element vector: i’th elt
∑
k =1
Copyright © 2001, 2003, Andrew W. Moore
xki yk
σ i2
Neural Networks: Slide 19
Non-linear Regression
• Suppose you know that y is related to a function of x in
such a way that the predicted values have a non-linear
dependence on w, e.g:
xi
½
1
2
3
3
Assume
yi
½
2.5
3
2
3
y=3
y=2
y=1
y=0
x=0
x=1
x=2
yi ~ N ( w + xi , σ )
Copyright © 2001, 2003, Andrew W. Moore
2
x=3
LE
eM ?
h
t
fw
at’s
W h at e o
m
esti
Neural Networks: Slide 20
10
Non-linear MLE estimation
argmax log p( y , y ,..., y
1
w
2
argmin ∑ (y
R
i =1
w
i
R
| x1 , x2 ,..., xR , σ , w) =
− w + xi
)
2
=
R


y − w + xi
 w such that ∑ i
= 0 =


w + xi
i =1


Copyright © 2001, 2003, Andrew W. Moore
Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero
Neural Networks: Slide 21
Non-linear MLE estimation
argmax log p( y , y ,..., y
1
w
2
argmin ∑ (y
R
w
i =1
i
R
| x1 , x2 ,..., xR , σ , w) =
− w + xi
)
2
=
R


y − w + xi
 w such that ∑ i
= 0 =


w + xi
i =1


Assuming i.i.d. and
then plugging in
equation for Gaussian
and simplifying.
Setting dLL/dw
equal to zero
We’re down the
algebraic toilet
t
wha
s
s
e
u
So g e do?
w
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 22
11
Non-linear MLE estimation
argmax log p( y , y ,..., y
1
w
2
R
argmin ∑ (
| x1 , x2 ,..., xR , σ , w) =
Assuming i.i.d. and
)
R
then plugging in
2
Common (but not only) approach:
equation for Gaussian
yi − w + xi =
Numerical Solutions:
and simplifying.
i =1
w
• Line Search
• Simulated Annealing
R
Setting dLL/dw


yi − w + xi
 w such that
equal to zero
• Gradient Descent
= 0 =


+
w
x
i
=
1
i
• Conjugate Gradient


• Levenberg Marquart
We’re down the
• Newton’s Method
∑
algebraic toilet
Also, special purpose statisticaloptimization-specific tricks such as
E.M. (See Gaussian Mixtures lecture
for introduction)
hat
ss w
e
u
So g e do?
w
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 23
GRADIENT DESCENT
Suppose we have a scalar function
f(w): ℜ → ℜ
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
w ← w −η
∂
f (w)
∂w
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 24
12
GRADIENT DESCENT
Suppose we have a scalar function
f(w): ℜ → ℜ
We want to find a local minimum.
Assume our current weight is w
GRADIENT DESCENT RULE:
w ← w −η
∂
f (w)
∂w
Recall Andrew’s favorite
default value for anything
η is called the LEARNING RATE. A small positive
number, e.g. η = 0.05
QUESTION: Justify the Gradient Descent Rule
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 25
Gradient Descent in “m” Dimensions
Given
f(w ) : ℜ m → ℜ
 ∂

f (w ) 

 ∂w1

∇f (w) = 
M
 points in direction of steepest ascent.
 ∂

 ∂w f (w)
 m

∇f (w) is the gradient in that direction
GRADIENT DESCENT RULE:
Equivalently
wj ← wj - η
w ← w -η∇f (w)
∂
f (w ) ….where wj is the jth weight
∂w j
“just like a linear feedback system”
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 26
13
What’s all this got to do with Neural
Nets, then, eh??
For supervised learning, neural nets are also models with
vectors of w parameters in them. They are now called
weights.
As before, we want to compute the weights to minimize sumof-squared residuals.
Which turns out, under “Gaussian i.i.d noise”
assumption to be max. likelihood.
Instead of explicitly solving for max. likelihood weights, we
use GRADIENT DESCENT to SEARCH for them.
s exp
, a querulou
sk
a
u
yo
”
y?
“Wh
e later.”
ply: “We’ll se
“Aha!!” I re
our eyes.
ression in y
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 27
Linear Perceptrons
They are multivariate linear models:
Out(x) = wTx
And “training” consists of minimizing sum-of-squared residuals
by gradient descent.
Ε =
∑ (Out (x ) −
yk
∑ (w
)
k
k
=
Τ
x k − yk
)2
2
k
QUESTION: Derive the perceptron training rule.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 28
14
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1
Gradient descent tells us
we should update w
thusly if we wish to
minimize E:
wj ← wj - η
So what’s
∂E
∂w j
∂E
?
∂w j
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 29
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1
Gradient descent tells us
we should update w
thusly if we wish to
minimize E:
wj ← wj - η
So what’s
∂E
∂w j
∂E
?
∂w j
R
∂E
∂
=∑
( yk − w T x k ) 2
∂w j k =1 ∂w j
R
= ∑ 2( y k − w T x k )
k =1
R
= −2∑ δk
k =1
∂
( yk − w T x k )
∂w j
∂
wT xk
∂w j
…where…
δk = yk − w T x k
R
= −2∑ δk
k =1
∂
∂w j
m
∑w x
i =1
i ki
R
= −2∑ δk xkj
k =1
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 30
15
Linear Perceptron Training Rule
R
E = ∑ ( yk − w T x k ) 2
k =1
Gradient descent tells us
we should update w
thusly if we wish to
minimize E:
wj ← wj - η
∂E
∂w j
…where…
R
∂E
= −2∑ δk xkj
∂w j
k =1
R
w j ← w j + 2η∑ δk xkj
k =1
We frequently neglect the 2 (meaning
we halve the learning rate)
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 31
The “Batch” perceptron algorithm
1) Randomly initialize weights w1 w2 … wm
2) Get your dataset (append 1’s to the inputs if
you don’t want to go through the origin).
3) for i = 1 to R
δ i := yi − wΤxi
4) for j = 1 to m
w j ← w j + η ∑ δ i xij
R
i =1
5) if ∑ δ i 2 stops improving then stop. Else loop
back to 3.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 32
16
δ i ← yi − w Τ x i
w j ← w j + ηδ i xij
Th
ule
SR
M
L
e
A RULE KNOWN BY
MANY NAMES
The
The delta ru
le
Classical
conditioning
Hoff
row
d
i
W
Th
e
rule
ad
alin
er
ule
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 33
If data is voluminous and arrives fast
Input-output pairs (x,y) come streaming in very
quickly. THEN
Don’t bother remembering old ones.
Just keep using new ones.
observe (x,y)
δ ← y − wΤx
∀j w j ← w j + η δ x j
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 34
17
Gradient Descent vs Matrix Inversion
for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 35
Gradient Descent vs Matrix Inversion
for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
• More easily parallelizable (or implementable in wetware)?
GD Disadvantages (MI advantages):
• It’s moronic
• It’s essentially a slow implementation of a way to build the XTX matrix
and then solve a set of linear equations
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 36
18
Gradient Descent vs Matrix Inversion
for Linear Perceptrons
GD Advantages (MI disadvantages):
• Biologically plausible
• With very very many attributes each iteration costs only O(mR). If
fewer than m iterations needed we’ve beaten Matrix Inversion
But we’llin wetware)?
• More easily parallelizable (or implementable
see that
GD Disadvantages (MIsoon
advantages):
GD
• It’s moronic
• It’s essentially a slow implementation
of a way to
build the XTX matrix
has an important
extra
and then solve a set of linear equations
trick up its sleeve
• If m is small it’s especially outageous. If m is large then the direct
matrix inversion method gets fiddly but not impossible if you want to
be efficient.
• Hard to choose a good learning rate
• Matrix inversion takes predictable time. You can’t be sure when
gradient descent will stop.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 37
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Our prediction is 0 if out(x)≤1/2
1 if out(x)>1/2
WHAT’S THE BIG PROBLEM WITH THIS???
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 38
19
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Blue = Out(x)
Our prediction is 0 if out(x)≤½
1 if out(x)>½
WHAT’S THE BIG PROBLEM WITH THIS???
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 39
Perceptrons for Classification
What if all outputs are 0’s or 1’s ?
or
We can do a linear fit.
Blue = Out(x)
Our prediction is 0 if out(x)≤½ Green = Classification
1 if out(x)>½
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 40
20
Classification with Perceptrons I
Don’t minimize
∑ (y
)
2
i
− w Τxi .
Minimize number of misclassifications instead. [Assume outputs are
+1 & -1, not +1 & 0]
∑ (y
where Round(x) =
i
(
− Round w Τ x i
))
-1 if x<0
NOTE: CUTE &
NON OBVIOUS WHY
THIS WORKS!!
1 if x≥0
The gradient descent rule can be changed to:
if (xi,yi) correctly classed, don’t change
if wrongly predicted as 1
w Å w - xi
if wrongly predicted as -1
w Å w + xi
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 41
Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
Copyright © 2001, 2003, Andrew W. Moore
This fit would classify much
better. But not a least
squares fit.
Neural Networks: Slide 42
21
Classification with Perceptrons II:
Sigmoid Functions
Least squares fit useless
This fit would classify much
better. But not a least
squares fit.
SOLUTION:
Instead of
Out(x) = wTx
We’ll use
Out(x) = g(wTx)
where g( x) : ℜ → (0,1) is a
squashing function
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 43
The Sigmoid
g ( h) =
1
1 + exp(− h)
Note that if you rotate this
curve through 180o
centered on (0,1/2) you get
the same curve.
i.e. g(h)=1-g(-h)
Can you prove this?
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 44
22
The Sigmoid
g ( h) =
1
1 + exp(− h)
Now we choose w to minimize
2
2
∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )]
R
R
i =1
i =1
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 45
Linear Perceptron Classification
Regions
0
X2
0
0
1
1
1
X1
We’ll use the model
Out(x) = g(wT(x,1))
= g(w1x1 + w2x2 + w0)
Which region of above diagram classified with +1, and
which with 0 ??
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 46
23
Gradient descent with sigmoid on a perceptron
First, notice g ' ( x ) = g ( x )(1 − g ( x ))
Because : g ( x ) =
=
1
so g ' ( x ) =
1 + e− x
− e− x
1 + e − x 


2
−1 
1 −1 − e− x
1
1
1 
1 −
 = − g ( x )(1 − g ( x ))
=
−
=


−
−
2
2
x
x
1+ e
1+ e
 1 + e− x 
1 + e − x 
1 + e − x 






Out(x) = g  ∑ wk xk 
 k




Ε = ∑  yi − g  ∑ wk xik  
i 
 k

The sigmoid perceptron
update rule:
2
R
w j ← w j + η ∑ δ i gi (1 − gi )xij

∂Ε

  ∂ 

= ∑ 2 yi − g  ∑ wk xik   −
g  ∑ wk xik  
∂w j
i
 
 k
  ∂w j  k



 
 ∂
= ∑ − 2 yi − g  ∑ wk xik  g '  ∑ wk xik 
i
 k
  k
 ∂w j

= ∑ − 2δ i g (net i )(1 − g (net i ))xij
i
where δ i = yi − Out(x i )
net i = ∑ wk xk
i =1
∑w x
k
k
ik
where

 m
gi = g  ∑ w j xij 

 j =1
δ i = yi − gi
k
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 47
Other Things about Perceptrons
• Invented and popularized by Rosenblatt (1962)
• Even with sigmoid nonlinearity, correct
convergence is guaranteed
• Stable behavior for overconstrained and
underconstrained problems
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 48
24
Perceptrons and Boolean Functions
If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s…
• Can learn the function x1 ∧ x2
X2
X1
• Can learn the function x1 ∨ x2 .
X2
X1
• Can learn any conjunction of literals, e.g.
x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5
QUESTION: WHY?
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 49
Perceptrons and Boolean Functions
• Can learn any disjunction of literals
e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5
• Can learn majority function
f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1
0 if less than n/2 xi’s are = 1
• What about the exclusive or function?
f(x1,x2) = x1 ∀ x2 =
(x1 ∧ ~x2) ∨ (~ x1 ∧ x2)
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 50
25
Multilayer Networks
The class of functions representable by perceptrons
is limited


(
)
Out(x) = g w Τ x = g  ∑ w j x j 
 j

Use a wider
representation !



Out(x) = g  ∑W j g  ∑ w jk x jk   This is a nonlinear function
 k

 j
Of a linear combination
Of non linear functions
Of linear combinations of inputs
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 51
A 1-HIDDEN LAYER NET
NINPUTS = 2
NHIDDEN = 3
 NINS

v1 = g ∑ w1k xk 
 k =1

w11
x1
w31
w12
x2
w1
w21
w22
 NINS

v2 = g ∑ w2k xk 
 k =1

w2
 N HID

Out = g  ∑ Wk vk 
 k =1

w3
w32

 NINS
v3 = g ∑ w3k xk 

 k =1
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 52
26
OTHER NEURAL NETS
1
x1
x2
x3
2-Hidden layers + Constant Term
“JUMP” CONNECTIONS
x1
N HID
 N INS

Out = g  ∑ w0 k xk + ∑Wk vk 
k =1
 k =1

x2
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 53
Backpropagation



Out(x) = g  ∑W j g  ∑ w jk xk  

 k
 j
Find a set of weights {W j },{w jk }
to minimize
∑ ( y − Out(x ))
2
i
i
i
by gradient descent.
That’s
That’sit!
it!
That’s
the
That’s thebackpropagation
backpropagation
algorithm.
algorithm.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 54
27
Backpropagation Convergence
Convergence to a global minimum is not
guaranteed.
•In practice, this is not a problem, apparently.
Tweaking to find the right number of hidden
units, or a useful learning rate η, is more
hassle, apparently.
IMPLEMENTING BACKPROP: Differentiate Monster sum-square residual Write down the Gradient Descent Rule It turns out to be easier &
computationally efficient to use lots of local variables with names like hj ok vj neti
etc…
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 55
Choosing the learning rate
• This is a subtle art.
• Too small: can take days instead of minutes
to converge
• Too large: diverges (MSE gets larger and
larger while the weights increase and
usually oscillate)
• Sometimes the “just right” value is hard to
find.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 56
28
Learning-rate problems
From J. Hertz, A. Krogh, and R.
G. Palmer. Introduction to the
Theory of Neural Computation.
Addison-Wesley, 1994.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 57
Improving Simple Gradient Descent
Momentum
Don’t just change weights according to the current datapoint.
Re-use changes from earlier iterations.
Let ∆w(t) = weight changes at time t.
Let
be the change we would make with
∂Ε
−η
∂w
regular gradient descent.
Instead we use
∂Ε
+ α∆w (t )
∂w
w (t + 1) = w (t ) + ∆w (t )
∆w (t +1) = −η
Momentum damps oscillations.
A hack? Well, maybe.
Copyright © 2001, 2003, Andrew W. Moore
momentum parameter
Neural Networks: Slide 58
29
Momentum illustration
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 59
Improving Simple Gradient Descent
Newton’s method
E ( w + h) = E ( w ) + hT
∂E 1 T ∂ 2 E
+ h
h + O(| h |3 )
2
∂w 2 ∂w
If we neglect the O(h3) terms, this is a quadratic form
Quadratic form fun facts:
If y = c + bT x - 1/2 xT A x
And if A is SPD
Then
xopt = A-1b is the value of x that maximizes y
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 60
30
Improving Simple Gradient Descent
Newton’s method
E ( w + h) = E ( w ) + hT
∂E 1 T ∂ 2 E
+ h
h + O(| h |3 )
2
∂w 2 ∂w
If we neglect the O(h3) terms, this is a quadratic form
−1
 ∂ 2 E  ∂E
w ←w− 2
 ∂w  ∂w
This should send us directly to the global minimum if the
function is truly quadratic.
And it might get us close if it’s locally quadraticish
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 61
Improving Simple Gradient Descent
Newton’s method
E ( w + h) = E ( w ) + hT
∂E 1 T ∂ 2 E
+ h
h + O(| h |3 )
2
∂w 2 ∂w
IfBwe
the O(h3) terms, this is a quadratic form
UT neglect
(and
it’s a
big b
That
−1
ut)
secon
 ∂ 2 E  ∂E
d der w ←…
expen
w− 2
i va
s i ve a
nd fid tive matr  ∂w  ∂w
ix
If we
dly to
’re no
comp can be
t
weThis
uto
a
te.the global minimum if the
’ll goshould lrsend
nuts. eadyus
in directly
t
h
e
quad
function is truly quadratic.
ratic
bowl,
And it might get us close if it’s locally quadraticish
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 62
31
Improving Simple Gradient Descent
Conjugate Gradient
Another method which attempts to exploit the “local
quadratic bowl” assumption
But does so while only needing to use
∂E
∂w
2
and not ∂ E
∂w 2
It is also more stable than Newton’s method if the local
quadratic bowl assumption is violated.
It’s complicated, outside our scope, but it often works well.
More details in Numerical Recipes in C.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 63
BEST GENERALIZATION
Intuitively, you want to use the smallest,
simplest net that seems to fit the data.
HOW TO FORMALIZE THIS INTUITION?
1.
2.
3.
4.
Don’t. Just use intuition
Bayesian Methods Get it Right
Statistical Analysis explains what’s going on
Cross-validation
Copyright © 2001, 2003, Andrew W. Moore
Discussed in the next
lecture
Neural Networks: Slide 64
32
What You Should Know
• How to implement multivariate Leastsquares linear regression.
• Derivation of least squares as max.
likelihood estimator of linear coefficients
• The general gradient descent rule
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 65
What You Should Know
• Perceptrons
Æ Linear output, least squares
Æ Sigmoid output, least squares
• Multilayer nets
Æ The idea behind back prop
Æ Awareness of better minimization methods
• Generalization. What it means.
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 66
33
APPLICATIONS
To Discuss:
• What can non-linear regression be useful for?
• What can neural nets (used as non-linear
regressors) be useful for?
• What are the advantages of N. Nets for
nonlinear regression?
• What are the disadvantages?
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 67
Other Uses of Neural Nets…
• Time series with recurrent nets
• Unsupervised learning (clustering principal
components and non-linear versions
thereof)
• Combinatorial optimization with Hopfield
nets, Boltzmann Machines
• Evaluation function learning (in
reinforcement learning)
Copyright © 2001, 2003, Andrew W. Moore
Neural Networks: Slide 68
34
Download