Document

advertisement
WK1 - Introduction
Contents
Optimisation
Perceptron
CS 476: Networks of Neural Computation
WK2 – Perceptron
Convergence
Conclusions
Dr. Stathis Kasderidis
Dept. of Computer Science
University of Crete
Spring Semester, 2009
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Contents
•Elements of Optimisation Theory
Contents
Optimisation
Perceptron
•Definitions
•Properties
Convergence
•Model
Conclusions
•Classical
of Quadratic Functions
Algorithm for Smooth Functions
Optimisation Methods
•1st
Derivative Methods
•2nd
Derivative Methods
•Methods
•Other
for Quadratic functions
Methods
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Contents II
•Perceptron Model
Contents
Optimisation
Perceptron
•Convergence Theorem of Perceptron
•Conclusions
Convergence
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Definitions
•Types of Optimisation Problems:
Contents
Optimisation
Perceptron
Convergence
•Unconstrained
•Linear
Constraints
•Non-linear
Constraints
Conclusions
•General nonlinear constrained optimisation
problem definition:
•NCP:
x  m
minimise F(x)
subject to
ci(x)=0
i=1..m’
ci(x)0
i=m’..m
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Definitions II
Contents
•Strong local minimum: A point x* is a SLM of
NCP if there exists >0 such that:
Optimisation
•A1:
F(x) is defined in N(x*, ); and
Perceptron
•A2:
F(x*)<F(y) for all y  N(x*, ), yx*
Convergence
Conclusions
•Weak local minimum: A point x* is a WLM of
NCP if there exists >0 such that:
•B1:
F(x) is defined in N(x*, );
•B2:
F(x*)  F(y) for all y  N(x*, ); and
•B3:
x* is not a strong local minimum
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Definitions III
•UCP:
minimise F(x)
x  m
Contents
Optimisation
Perceptron
•Necessary conditions for a minimum of UCP:
Convergence
•C1:
||g(x*)|| =0, i.e. x* is a stationary point;
Conclusions
•C2:
G(x*) is positive semi-definite.
•Sufficient conditions for a minimum of UCP:
•D1:
||g(x*)|| =0;
•D2:
G(x*) is positive definite.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Definitions IV
•Assume that we expand F in its Taylor series:
Contents
Optimisation
Perceptron
1 2 T
F ( x * p)  F ( x*)  p g ( x*)   p G ( x *  p) p
2
T
Convergence
Where x,p  m, 0  1,  is a scalar and positive
Conclusions
•Any vector p which satisfies:
pT g ( x*)  0
is called descent direction at x*
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Properties of Quadratic Functions
•Assume that a quadratic function is given by:
Contents
Optimisation
Perceptron
Convergence
Conclusions
 ( x)  cT x 
1 T
x Gx
2
For some constant vector c and a constant
symmetric matrix G (the Hessian matrix of ).
•The definition of  implies the following relation
between (x+p) and (x) for any vector x and
scalar :
1
( x  p)  ( x)  pT (Gx  c)  a 2 pT Gp
2
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Properties of Quadratic Functions I
•The function  has a stationary point when:
Contents
Optimisation
Perceptron
 (x*) = Gx*+c = 0
•Consequently a stationary point must satisfy the
system of the linear equations:
Convergence
Conclusions
Gx* = -c
•The system might have:
•No
solutions (c is not a linear combination of
columns of G)
•Many
•A
solutions (if G is singular)
unique solution (if G is non-singular)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Properties of Quadratic Functions II
•If x* is a stationary point it follows that:
Contents
Optimisation
1 2 T
 ( x *  ap )   ( x*)  a p Gp
2
Perceptron
Convergence
Conclusions
•Hence the behaviour of  in a neighbourhood of x*
is determiend by matrix G. Let j and uj denote the
j-th eigenvalue and eigenvector of G. By definition:
G uj = j uj
•The symmetry of G implies that the set of {uj },
j=1..m are orthonormal. So, when p is equal to uj :
 ( x *  au j )   ( x*) 
1 2
a j
2
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Properties of Quadratic Functions III
Contents
•Thus the change in  when moving away from x*
along the direction of uj depends on the sign of j :
Optimisation
•If
Perceptron
•If
j >0  strictly increases as || increases
j <0  is monotonically decreasing as ||
increases
Convergence
j =0 the value of  remain constant when
moving along any direction parallel to uj
•If
Conclusions
•If
G is positive definite, x* is the global minimum
of 
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Properties of Quadratic Functions IV
•If
Contents
Optimisation
G is positive definite, x* is the global minimum
of 
Perceptron
G is positive semi-definite a stationary point (if
exists) is a weak local minimum.
Convergence
•If
Conclusions
•If
G is indefinite and non-singular, x* is a saddle
point and  is unbounded above and below
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Model Algorithm for Smooth Functions
Contents
Optimisation
•Algorithm U(Model algorithm for m-dimensional
unconstrained minimisation)
•Let
xk be the current estimate of x*
Perceptron
Convergence
Conclusions
•U1: [Test for convergence] If the conditions are
satisfied, the algorithm terminates with xk as the
solution;
•U2: [Compute a search direction] Compute a nonzero m-vector pk the direction of search
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Model Algorithm for Smooth Functions I
Contents
Optimisation
Perceptron
Convergence
Conclusions
•U3: [Compute a step length] Compute a positive
scalar k, the step length, for which it holds that:
F(xk+kpk) < F(xk)
•U4: [Update the estimate of the minimum] Set:
Xk+1  xk + kpk ,
k  k+1
and go back to step U1.
•To satisfy the descent condition (Fk+1<Fk) pk should
be a descent direction:
gkTpk < 0
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods: 1st derivative
•1st Derivative Methods:
Contents
Optimisation
•A linear approximation to F about xk is:
F(xk+p) = F(xk) + gkTp
Perceptron
Convergence
•Steepest
Conclusions
Descent Method: Select pk as:
p = - gk
•It
can be shown that:

max  min 2
F ( xk )  F ( x*)
F ( xk 1 )  F ( x*) 
2
max  min 
2

k  1
F ( xk )  F ( x*)

2
k  1
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods I: 1st derivative
• is the spectral condition number of G
Contents
Optimisation
•The steepest descent method can be very slow if
 is large!
Perceptron
Convergence
Conclusions
•Other 1st derivative methods include:
•Discrete
Newton method: Approximates the
Hessian, G, with finite differences of the gradient
g
•Quasi-Newton
curvature:
methods: Approximate the
skTGsk  (g(xk+sk) – g(xk))T sk
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods II: 2nd derivative
•2nd Derivative Methods:
Contents
Optimisation
Perceptron
Convergence
Conclusions
•A quadratic approximation to F about xk is:
F(xk+p) = F(xk) + gkTp+(1/2) pTGkp
•The function is minimised my finding the minimum
of the quadratic function :
1 T
T
( p)  g k p 
p Gk p
2
•This has a solution that satisfies the linear system:
Gkpk = -gk
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods III: 2nd derivative
Contents
Optimisation
•The vector pk in the previous equation is called the
Newton direction and the method is called the
Newton method
Perceptron
Convergence
Conclusions
•Conjugate Gradient Descent: Another 2nd derivative
method
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods IV: Methods for Quadratics
Contents
Optimisation
Perceptron
Convergence
Conclusions
•In many problems the function F(x) is a sum of
squares:
1 m
1
F ( x)   f i ( x) 2  || f ( x) ||2
2 i 1
2
•The i-component of the m-vector is the function
fi(x), and ||f(x)|| is called the residual at x.
•Problems of this type appear in nonlinear
parameter estimation. Assume that x  n is a
parameter vector and ti is the set of independent
variables. Then the least-squares problem is:
2
  ( x, t )  F (t )
m
minimise
i 1
i
i
, x  n
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods V: Methods for Quadratics
•Where:
Contents
Optimisation
Perceptron
Convergence
fi(x)=(x,ti)-yi
and
Y=F(t) is the “true” function
• yi are the desired responses
Conclusions
•In the Least Squares Problem the gradient, g, and
the Hessian, G, have a special structure.
•Assume that the Jacobian matrix of f(x) is denoted
by J(x) (a mxn matrix) and let the matrix Gi(x)
denote the Hessian of fi(x). Then:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods VI: Methods for Quadratics
g(x)=J(x)Tf(x)
Contents
Optimisation
Perceptron
Convergence
Conclusions
and
G(x)=J(x)TJ(x)+Q(x)
• where Q is:
Q( x) 
m
f
i 1
i
( x)Gi ( x)
•We observe that the Hessian is a special
combination of first and second order information.
•Least-square methods are based on the premise
that eventually the first order term J(x)TJ(x)
dominates the second order one, Q(x)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods VII: Methods for Quadratics
•The Gauss-Newton Method:
Contents
Optimisation
Perceptron
Convergence
•Let xk denote the current estimate of the solution;
a quantity subscripted by k will denote that quantity
evaluated at xk. From the Newton direction we get:
(JkTJk+Q)pk = -JkTfk
Conclusions
•Let the vector pN denote the Newton direction. If
||fk||  0 as xk  x* the matrix Qk  0. Thus the
Newton direction can be approximated by the
solution of the equations:
JkTJkpk = -JkTfk
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods VIII: Methods for Quadratics
Contents
Optimisation
Perceptron
Convergence
Conclusions
•The solution of the above problem is given by the
solution to the linear least square problem:
1
|| J k p  f k ||2
minimise
, p  n
2
•The solution is unique if Jk has full rank. The vector
pGN which solves the linear problem is called the
Gauss-Newton direction. This vector approximates
the Netwon direction pN as ||Qk|| 0
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods VIII: Methods for Quadratics
•The Levenberg-Marquardt Method:
Contents
Optimisation
Perceptron
Convergence
Conclusions
•In this method the search direction is defined as
the solution to the equations:
(JkTJk+kI)pk = -JkTfk
•Where k is a non-negative integer. A unit step is
taken along pk, i.e.
xk+1  xk + pk
•It can be shown that for some scalar , related to
k, the vector pk is the solution to the constrained
subproblem:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Classical Optimisation Methods VIIII: Methods for Quadratics
minimise
Contents
Optimisation
1
|| J k p  f k ||2
2
, p  n
subject to ||p||2  
Perceptron
•If k =0 pk is the Gauss Newton direction;
Convergence
•If k  , || pk||  0 and pk becomes parallel to
the steepest descent direction
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Other Methods
Contents
•Other methods are based on function values only.
This category includes methods such as:
Optimisation
•Genetic
Perceptron
•Simulated
Convergence
Conclusions
•Tabu
Algorithms
Annealing
Search
•Guided
Local Search
•etc
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Additional References
Contents
Optimisation
Perceptron
•Practical Optimisation, P. Gill, W. Murray, M.
Wright, Academic Press, 1981.
•Numerical Recipes in C/C++, Press et al,
Cambridege University Press, 1988
Convergence
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron
Contents
Optimisation
•The model was created by Rosenblatt
•It uses the nonlinear neuron of McCulloch-Pitts
(which uses the Sgn as transfer function)
Perceptron
Convergence
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Output
•The output y is calculated by:
Contents
Optimisation
 
y  Sgn( wi xi  b)  Sgn( w  x  b)
m
i 1
Perceptron
Convergence
Conclusions
•Where sgn() is defined as:
 1 if
Sgn( x)  
 1 if
x0
x0
•The perceptron classifies an input vector x  m to
one of two classes C1 or C2
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Decision Boundary
Contents
Optimisation
Perceptron
Convergence
Conclusions
•The case present above is called linearly separable
classes
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Learning Rule
Contents
•Assume that vectors are drawn from two classes C1
and C2,i.e.
Optimisation
•x1(1),
x1(2), x1(3),…. belong to C1; and
Perceptron
•x2(1),
x2(2), x2(3),…. belong to C2
Convergence
Conclusions
•Assume also that we redefine vectors x(n) and w(n)
such as to include the bias, i.e.
•x(n)=[+1,x1(n),…,xm(n)]T;
and
•w(n)=[b(n),w1(n),…,wm(n)]T
x, w  m+1
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Learning Rule II
•Then there should exist a weight vector w such that:
Contents
Optimisation
Perceptron
Convergence
Conclusions
• wT x
> 0, when x belongs to class C1; and
• wT x
 0, when x belongs to class C2;
•We have selected arbitrary to include the case of
wTx=0 to class C2
•The algorithm for adapting the weights of the
perceptron may be formulated as follows:
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Learning Rule III
Contents
Optimisation
Perceptron
1. If the nth member of the training set, x(n) is
correctly classified by the current weight set, w(n)
no correction is needed, i.e.
•
w(n+1)=w(n) , if wTx(n) > 0 and x(n) belongs
to class C1;
•
w(n+1)=w(n) , if wTx(n)  0 and x(n) belongs
to class C2;
Convergence
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Learning Rule IV
Contents
2. Otherwise, the weight vector is updated according
to the rule:
Optimisation
•
Perceptron
w(n+1)=w(n) - (n)x(n),
if wTx(n) > 0 and x(n) belongs to class C2;
Convergence
Conclusions
•
w(n+1)=w(n)+ (n)x(n),
if wTx(n)  0 and x(n) belongs to class C1;
•
The parameter (n) is called learning rate and
controls the adjustment to the weight vector at
iteration n.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Learning Rule V
Contents
Optimisation
Perceptron
Convergence
Conclusions
•If we assume that the desired response is given by:

  1 if x (n) belongs to class C1
d ( n)  

 1 if x (n) belongs to class C2
•Then we can re-write the adaptation rule in the form
of an error-correction learning rule:
w(n+1)=w(n)+ [d(n)-y(n)]x(n)
•Where the e(n)= [d(n)-y(n)] is the error signal
•The learning rate is a positive constant in the range
0<   1.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Learning Rule VI
Contents
Optimisation
•When we assign a value to  in the range (0,1] we
must keep in mind two conflicting requirements:
•Averaging
Perceptron
of past inputs to provide stable weights
estimates, which requires a small 
Convergence
•Fast
Conclusions
adaptation with respect to real changes in the
underlying distributions of the process responsible
for the generation of the input vector x, which
requires a large 
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Summary of the Perceptron Algorithm
•Variables and Parameters:
Contents
Optimisation
Perceptron
Convergence
Conclusions
•x(n)=(m+1)-by-1
input vector
= [+1, x1(n),…, xm(n)]T
•w(n)=(m+1)-by-1
weight vector
= [b(n), w1(n),…, wm(n)]T
•b(n)=bias
•y(n)=actual
response
•d(n)=desired
•=learning
response
rate in (0,1]
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Summary of the Perceptron Algorithm I
1. Initialisation: Set w(0)=0. Then perform the
Contents
Optimisation
Perceptron
Convergence
Conclusions
following computations for time steps n=1,2,….
2. Activation: At time step n, activate the perceptron
by applying the input vector x(n) and desired
response d(n).
3. Computation of Actual Response: Compute the
actual response of the perceptron by using:
y(n)=Sgn(wT(n)x(n) )
where Sgn(•) is the signum function.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Summary of the Perceptron Algorithm II
4. Adaptation of Weight Vector: Update the weight
Contents
vector of the perceptron by using:
Optimisation
Perceptron
Convergence
Conclusions
w(n+1)=w(n)+ [d(n)-y(n)]x(n)
where:
  1 if
d ( n)  
 1 if

x (n) belongs to class C1

x (n) belongs to class C2
5. Continuation: Increment the time step n by one
and go back to step 2.
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem
•We present a proof of the fact that the perceptron
Contents
needs only a finite number of steps in order to
Optimisation converge (i.e. to find the correct weight vector if this
exists)
Perceptron
•We assume that w(0)=0. If this is not the case the
Convergence
proof still stands but the number of the number of
Conclusions
steps that are needed for convergence is increased or
decreased
•Assume that vectors are which are drawn from two
classes C1 and C2,form two subsets, i.e.
•H1={x1(1),
•H2={
x1(2), x1(3),….} belong to C1; and
x2(1), x2(2), x2(3),….} belong to C2
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem I
Contents
Optimisation
Perceptron
•Suppose that w(n)Tx(n) < 0 for n=1,2,… and the
input vector belongs to H1
•Thus for (n)=1 we can write the (actually incorrect)
weight update equation as:
Convergence
w(n+1)=w(n)+ x(n),
Conclusions
for x(n) belonging to class C1;
•Given the initial condition w(0)=0, we can solve
iteratively the above equation and obtain the result:
w(n+1)=x(1)+x(2)+…+x(n) (E.1)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem II
Contents
Optimisation
Perceptron
Convergence
Conclusions
•Since classes C1 and C2 are linearly separable there
exists w0 such that wTx(n) > 0 for all vectors x(1),
x(2), …, x(n) belonging to H1. For a fixed solution w0
we can define a positive number  as:
 T
a  min w0 x ( n)
x ( n )H1
•Multiplying both of E.1 with w0T we get:
w0T w(n+1)= w0T x(1)+ w0T x(2)+…+ w0T x(n)
•So we have finally:
w0T w(n+1)  n  (E.2)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem III
Contents
•We use the Cauchy-Schwarz inequality for two vectors
which states that:
Optimisation
Perceptron
Convergence
Conclusions
||w0||2|| w(n+1)||2  [w0T w(n+1)]2
•Where ||•|| denotes the Euclidean norm of the vector
and the inner product w0T w(n+1) is a scalar. Then
from E.2 we get:
||w0||2|| w(n+1)||2  [w0T w(n+1)]2  n2 2
or alternatively:

n 2 2
2
|| w(n  1) ||   2
|| w0 ||
( E.3)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem IV
•Now using:
Contents
Optimisation
Perceptron
Convergence
w(k+1)=w(k)+ x(k),
for x(k) belonging to class C1; k=1,..,n
•And taking the Euclidean norm we get:
Conclusions
||w(k+1)||2=||w(k)||2+ ||x(k)||2+2 w(k)Tx(k),
•Which under the assumption of wrong classification
(i.e. wTx(n) < 0 ) leads to:
||w(k+1)||2  ||w(k)||2+ ||x(k)||2
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem V
Contents
Optimisation
Perceptron
Convergence
Conclusions
•Or finally to:
||w(k+1)||2 - ||w(k)||2  ||x(k)||2
•Adding all these inequalities for k=1,…,n and using
the initial condition w(0)=0 we get:
n


2
|| w(n  1) ||   || x (k ) || n
( E.4)
k 1
•Where  is a positive number defined as:

  max || x (k ) ||2
x ( k )H1
( E.5)
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem VI
•E.4 states that the Euclidean norm of vector w(n+1)
Contents
grows at most linearly with the number of iterations n.
Optimisation •But this result is in conflict with E.3 for large enough
n.
Perceptron
•Thus we can state that n cannot be larger than some
Convergence
value nmax for which both E.3 and E.5 are
Conclusions
simultaneously satisfied with the equality sign. That is
nmax is the solution of the equation:
nmax  2
 2  nmax 
|| w0 ||
2
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Perceptron Convergence Theorem VII
Contents
Optimisation
Perceptron
Convergence
Conclusions
•Solving for nmax we get:

nmax
 || w0 || 2

2
•This proves the perceptron algorithm will terminate
after a finite number of steps. However, observe that
there exists no unique solution for nmax due to nonuniqueness of w0
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Conclusions
Contents
•There are many optimistion methods. With decreasing
power we present the methods in the following list:
Optimisation
•2nd
derivative methods: e.g. Netwon
Perceptron
•1st
derivative methods: e.g. Quasi-Newton
Convergence
Conclusions
•Function
value based methods: e.g. Genetic
algorithms
•The perceptron is a model which classifies an input
vector to one of two exclusive classes C1 and C2
•The perceptron uses an error-correction style rule for
weight update
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Conclusions I
Contents
•The perceptron learning rule converges in finite
number of steps
Optimisation
Perceptron
Convergence
Conclusions
CS 476: Networks of Neural Computation, CSD, UOC, 2009
Download