Feedforward neural networks - PPKE-ITK

advertisement
Feedforward neural networks
 l N 1  N   l N  2  N 1
 n
1   
Net  x , w     wi   wij
    wlm   
 i 1
 m 1
  
 j 1

1
x
 x   2
1
1  e x
x
The free parameters of mapping (1) are often referred to as weights. They can be changed in the
course of adaptation (or learning) process in order to “tune” the network for performing a special
task.
As was mentioned before, when solving engineering tasks by neural networks, we are faced with
two fundamental questions:
Representational (or approximation) capability, i.e. how many tasks can be solved by the
net
Learning capability, i.e. how to adopt the weights in order to solve a specific task.
1., Representation capabilities of feedforward networks
As was detailed before Netx, w  expands a function space denoted by NN . Namely, every
particular choice of the weight vector w results in a concrete function Netx, w  for which
Netx, w   NN .
The tasks, which are to be represented by a neural network, are selected from a function space
F the functions of which are defined over an input space X.
The fundamental question of representational capability can be posed as follows:
In what function space F is the space NN is uniformly dense? This relation is denoted by
NN c D F and fully spelt out as follows:
for each F x   F and    , there exists a w  for which


Net x , w *  F  x   
where the notation
f 

X
f
2
x dx1 ...dx N
defines a norm used in F . For example if F  L2 then
or if F  G then f  max f x   etc.
xX
If the function class F turns out to be a large one, then neural networks can solve a large number
of problems. On the other hand if F is small, then there is no use to seek neural representation of
engineering tasks as their usage is rather limited.
First let us focus our attention to one layer neural networks given by the following mapping

2  
Net x , w    wi    wij1 x j 
i
 j

Theorem 1. (Harnik, Stinchambe, White ’89)
The classes of one layer neural networks are uniformly dense in L p , namely NN c D LP .
In other words every function in LP can be represented an arbitrarily closely approximated by a
neural net. More precisely, for each F x   Lp
P

F P x dx   and   0 w
X
P
b

2  
1  

a a  F x   i wi  j wij xj   dx,, d  


P
Since L is a rather large space, the theorem implies that almost any engineering task can be
solved by a one-layer neural network.
The proof of theorem 1 heavily draws from functional analysis and is based on the Hahn-Banach
theorem. Since it is out of the focus of the course the interested reader is referred to xxx.
Let us now define a two-layer neural network given as follows:
b
2 


3 
2 
1 
Net x , w    wi sgn  wij sgn  w jl xl 
i
 l

 j
Theorem 2. (Blum & Li):
Net 2  x , w  is uniformly dense in L2
F x   L2 then
NN 2   D L2 . Or in other words for each
 F x dx   . In this case for any arbitrary
2
 
X
2



3 
2 
1  
w :     F  x    wi sgn  wij sgn  w jl xl   dx1  dx n  

i
 l
 
X
 j

Proof:
Here we just give the outline of the proof following the reasoning of Blum & Li.
First let us introduce the class of step functions denoted by S , i.e. all f x    i I x 
i
S
From elementary integral theory it is clear that S is uniformly dense in L1 , namely every
function in L1 can be approximated by an appropriate step function. In 1D case it is described by
the following figure:
In general the approximation is done in two steps.
1.,
Given F x   L1 we define a partition on the set  over which F x  is given, with such
a property that
  i    i   i   and    F x    i  i d x  
b
a
i
i
The partition  i can be represented by a neural net as



sgn  ai sgn  bij x j 
(3)
 i
 j

E.g. if  is two dimensional then  i is separated by n linear hyperplanes, which then should be
AND-ed
xi
xi
xi
Therefore the inner sgn functions in (3) represent the linear hyperplanes needed for the
separation, whereas the outer sgn function implements the required AND relationship.
Remark:
A neuron with sgn threshold function performs an AND function if
x1
1
1
x2

xn


sgn
y
1
n-0.5
n

y  sgn  x j  n  0.5
 j 1

A neuron with sgn threshold function performs an OR function if
x1
1
1
x2

xn


sgn
y
1
-n-0.5
n

y  sgn  x j  n  0.5
 j 1




Now since every  i can be represented by a corresponding sgn  a j sgn  bij x j  , the
 i
 j

remaining step in the approximation is to set the weights  i  s in (xxx), which can take their
values as F xi  being a representational value on  i .
Since L  D L2 , therefore this construction can be extended into L2 .
Constructive approximation
The results listed before only claim that one or two layer networks can represent almost any task.
When implementing a network for solving a concrete task, however, one has to know how big
network is needed. Thus, the question of representation, perceived from engineering point of
view, boils down to the following question:
Given a function F x  to be represented and an    error, what is number of the neurons
which suffices in representing F x  ?
The objective, which we set here fully, coincides with the underlying problem of constructive
analysis when not only the existence of a certain decomposition must be proven but its minimum
complexity should also be pointed out.
Complexity theory of one layer neural networks
In this section we try to access the number of neurons in a one-layer network needed to
implement the mapping F x  , with an error  . Our basic method to ascertain this will be the
Fourier analysis. Therefore the results given here one only valid L2 but not generally in LP .
First we introduce the notion of multivariable truncated Fourier series as follows:
Mn
M1
  n 
 n 

S M1 ,, M N x      fˆ k1  k n  cos  k j x j  i sin   k j x j   
 j 1 T

  j 1 T j
k1   M1
kn  M n
j


 
Here n denotes dimension of vector x , while fˆ k1 ,, k n  are the corresponding Fourier
coefficients of function F x  , given as
fˆ k1 ,, k n      F x1 ,, xn e

i 



n

 T j k j x j 
j 1

dx1  dxn

As well known from the theory of orthogonal series in L2 , if M 1 , M n   then


  S M1 ,M n x1 ,, xn   F x1 ,, xn  dx1 dxn  

2
Based on the multivariable Fourier series we can state the following theorem.
Theorem:
If F x   L2   is of bounded variation, with total variation V f , then it can be represented by an
on-layer network as
L
h
2  
1 
F x    wk   wkj x j 
k 1
 j 1


2  1 V
f
8n F V f n  
8 

where L 
n2
16
n



2
n
n


 1

 ,  is the error of approximation.
Download