Dopredné neurónové siete

advertisement
3. prednáška
Dopredné neurónové
siete
(multilayered perceptron)
Priesvitka 1
Vtip, ktorý mi minulý týždeň poslala doc. Beňušková z
Nového Zealandu, ktorá založila spolu s Petrom Tiňom tradíciu
prednášok a výskumu neurónových sietí na tejto fakulte
Priesvitka 2
Dopredné neurónové siete
(multilayer perceptron)
logický neuron (MacCulloch a Pitts, 1943)
perceptrón (Rosenblatt, 1967)
dopredné neurónové siete (Rumelhard, 1986)
prudký rozvoj neurónových sietí (subsymbolická UI, konekcionizmus)
Priesvitka 3
David Rumelhart
 D. E. Rumelhart, G. E. Hinton, and R. J. Williams: Learning representations by backpropagating errors. Nature, 323(1986), 533-536.
 D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, editors: Parallel Distributed
Processing, volumes I and II. MIT Press, Cambridge, MA, 1986.
Priesvitka 4
Formal definition of a multilayer perceptron
Let us consider an oriented graph G=(V,E) determined as an ordered couple
of a vertex set V and an edge set E.
v6
v6
e7
v4
e6
e2
v3
e1
v1
e7
e8
A
e5
v5
v4
e4
e2
e8
e6
v3
e3
e1
v2
acyclic oriented graph
v1
B
e5
v5
e4
e3
v2
cyclic oriented graph
Priesvitka 5
Theorem. An oriented graph G is acyclic iff it may be indexed so that
e  i, j  E : i  j
af
Vertices of an oriented graph can be classified as follows:
(1) Input vertices that are incident only with outgoing edges.
(2) Hidden vertices that are simultaneously incident at least with one incoming edge and at
least with one outgoing edge.
(3) Output vertices, that are incident only with outgoing edges.
v6
v4
output neuron
v5
hidden neurons
v3
v1
v2
input neurons
Priesvitka 6
The vertex set V is unambiguously divided into three disjoint subsets
V  VI  VH  VO
that are composed of input, hidden, and output vertices, respectively.
Assuming that the graph G is canonically indexed and that the graph is composed of n input
vertices, h hidden vertices, and m output vertices, then the above three vertex subsets can be simply
determined by
VI  1,2,..., n, | VI |  n
VH  n  1, n  2,..., n  h, | VH |  h
VO  n  h  1, n  h  2,..., n  h  m, | VO |  m
v6
v4
output neuron
v5
hidden neurons
v3
v1
v2
input neurons
Priesvitka 7
Vertices and edges of the oriented graph G are evaluated by real numbers. Each oriented edge
e=(vi,vj)E is evaluated by a real number wji called the weight. Each vertex viV is evaluated by a
real number xi called the activity and each hidden or output vertex viV is evaluated by a real
number i called the threshold.
The activities of hidden and output vertices are determined as follows
af
xi  t  i
i 
 wij x j   i
j  1  i 
where t() is a transfer function. In our forthcoming considerations we will assume that its
analytical form is specified by as the sigmoid function.
Priesvitka 8
Definition. A feed-forward neural network is determined as an ordered triple
=(G,w,)
G is an acyclic, oriented, and connected graph and w and  are weights and thresholds assigned to
edges (connections) and vertices (neurons) of the graph. We say that the graph G specifies a
topology (or architecture) of the neural network , and that weights w and thresholds  specify
parameters of the neural network.
Activities of neurons form a vector x=(x1,x2,...,xp). This vector can be divided formally onto
three subvectors that are composed of input, hidden, and output activities
x  xI  xH  xO
Priesvitka 9
A neural network =(G,w,) with fixed weights and thresholds can be considered as a parametric
function
Gw,   : Rn  0,1m
w,
xI
G
xO
xH
This function assigns to an input activity vector xI a vector of output activities xO
a
xO  G xI ; w,
f
Hidden activities in this expression are not explicitly presented, they play only a role of byproducts.
Priesvitka 10
How to calculate activities of a neural network =(G,w,)?
(1) We shall postulate that activities of input neurons are kept fixed, in other words, we say that the
vector of input activities xI is an inputs to the neural network. Usually its entries correspond to the
so-called descriptors that specify a classified pattern.
(2) Activities of hidden and output neurons are calculated by simple recurrent procedure based on
the fact that the topology of neural network is determined by an acyclic oriented graph G .
xi  fi x1 , x2 ,..., xi 1
for i  n  1, n  2,..., p
a
v6
L1
v4
fa
L3
v5
L2
v3
v1
v2
L0
f
Layer L0:
x1=input constant
x2=input constant
Layer L1:
x3=t(w31x1+w32x2+3)
x4=t(w41x1+4)
Layer L2:
x5=t(w53x3+w52x2+5)
Layer L3:
x6=t(w63x3+w64x4+w65x5+5)
Priesvitka 11
Unfortunately, if the graph G is not acyclic (it contains oriented cycles), then the above simple
recurrent calculation of activities is inapplicable, in some stage of algorithm we have to know an
activity which was not yet calculated, activities are determined by
xi  fi x1 , x2 ,..., xn , xn 1 ,..., x p
for i  n  1, n  2,..., p
These equations are coupled, their solution (if any) can be constructed by making use of an iterative
procedure, which is started from initial activities, and calculated activities are used as an input for
calculation of new activities, etc.
xi t 1  fi x1 , x2 ,..., xn , xn t1 ,..., x pt 
for i  n  1, n  2,..., p
This iterative procedure is repeated until a difference between new and old activities is smaller than
a prescribed precision.
c
c
ha
ha
f
f
Priesvitka 12
Learning (adaptation process) of feed-forward neural networks
A learning of a feed-forward neural network consists in a looking for such weights and
thresholds that produced output activities (as a response on input activities) are closely related
to the required ones.
First of all what we have to define is the training set composed of examples of input-activity
vector and required output-activity vector
A  A k  xIk , xˆ Ok ;k  1,2,..., p




The error objective function is determined by
b
1 k
xO  x Ok
2
1 p 2
  gi
2 i 1
Ek w,   
gi 
g 21 cGbx ; w,  g x h
2
k 2
O
k
I
x  x b
for i V g
R
S
T0 otherwise
i
i
O
Priesvitka 13
P
Ew,     Ek w,  
k 1
The learning process of the given feed-forward neural network N=(G,,) specified by the
graph G and with unknown weights and thresholds is realized by the following minimization
process
wopt ,  opt  arg min Ew,  
c
h
w,  
If we know the optimal weights and thresholds (that minimize the error objective function), then an
active process is a calculation of output activities as a response on input activities for parameters
determined by the learning - adaptation process
c
xO  G xI ; wopt , opt
h
Priesvitka 14
The active process of neural networks is usually used for a classification or prediction of
unknown patterns that are described only by input activities (descriptors that specify a structure of
patterns). In order to quantify this process we have to introduce the so-called test set composed of
examples of patterns specified by an input-activity vector and a required output-activity vector




A test  A k  yI , ˆyO ;k  1,2,...,q
k
k
An analogue of the error objective function is
 test 
Ek

2
1 k
1
k
ˆ
w
,


y

y

 opt opt  2  O O  2 G  yIk ; wopt ,opt   ˆyOk
E
 test 
w
opt
q
,opt    Ek
k 1
test 
w
opt

2
,opt 
Priesvitka 15
We say that an adapted neural network correctly interprets patterns from the test set if this
objective function is sufficiently small. If this requirement is not fulfilled, then there exists an
example from the test set which incorrectly interpreted.
Gradient of error objective function
Partial derivatives of the error objective function (defined over a training set) with respect to
weights and thresholds are determined by
Ek Ek xi Ek


t   i  x j
wij xi wij xi
Ek Ek xi Ek


t   i 
i
xi i xi
where the partial derivatives xi wij is simply calculated by making use the relation xi=t(i), then
xi wij  t  i  i wij  t  i x j . Similar considerations are applicable also for the partial
af
af
Priesvitka 16
derivative xi  i . If we compare both these equations we get simple relationship between partial
derivatives Ek wij and Ek  i
Ek Ek

xj
wij i
Let us study the partial derivative Ek xi , its calculation depends on whether the index i
corresponds to an output neuron or hidden neuron
Ek
 gi
xi
Ek
E x
 k l
xi
l xl xi
bfor i V g
afor i V f
O
H
where summation runs over all neurons that are successors of the i-the neuron. The second above
formula results as an application of the well-known theorem called the chain rule for calculation of
partial derivatives of the composite function. Last expressions can be unified at one expression
Priesvitka 17
Ek
E x
 gi   k l
xi
l xl xi
where it is necessary to remember that the summation on r.h.s. is vanishing is the i-th neuron has not
successors, that is output neurons.
A final expression for the partial derivatives of error objective function with respect to thresholds


Ek
E
 t   i   gi   k wli 
i
l l


 for i  VH  VO 
(1) In general, the above formula for calculation of partial derivative of the objective function can be
characterized as a system of linear equations the solution of which determines partial derivatives
Ek  i .
(2) For feed-forward neural networks the above formula can be solved recurrently. Starting from top
layer Lt we calculate all partial derivatives assigned to output neurons, Ek i  t  i gi . In the
next step we calculate partial derivatives from the layer Lt-1, for their calculation we need to know
partial derivatives from the layer Lt , which were calculated in the previous step, etc. Therefore this
recurrent approach of calculation of partial derivatives based is called the back propagation.
(3) Knowing all partial derivatives Ek  i , we may calculate simply partial derivatives Ek wij .
af
Priesvitka 18
(4) An analogue of the above formula was initially derived by Rumelhart et al. in 1986, this work is
considered in literature as one of milestones of the development of theory of neural network.
They demonstrated that multilayer perceptrons together with the back propagation method for
calculation of gradients of error objective functions are able to overcome boundaries of simple
perceptrons stated by Minsky and Papert, that is to classify correctly all patterns and not only those
ones that are linearly separable.
v6
v5
hidden neurons
v3
v1
v2
back propagation
v4
output neuron
input neurons
Theorem. For feed-forward neural networks the first partial derivatives of the error
objective function Ek are determined recurrently by
Priesvitka 19
afF
G
H
IJ
K
Ek
E
 t   i gi   k wli
 i
l  l
Ek Ek

xj
wij  i
where in order to calculate Ek  i we have to know either gi (for iVO) or partial
derivatives Ek  l (for li) .
We have to emphasize that in the course of derivation of the above formula we have never
used an assumption that the considered neural network corresponds to an acyclic graph. This
means that this formula is correct also for neural networks assigned to graphs that are either cyclic
or acyclic.
For neural networks assigned to cyclic graphs the above discussed recurrent approach of
calculation of partial derivatives is inapplicable. For this type of neural networks the partial
derivatives are not determined recurrently, but only as a solution of linear coupled equations
Ek
Ek
 
wli xi  xigi
 i l   i   l
Its matrix form is
1  diag x  w T ek  a
c
h
b
af g
T
where ek  Ek 1 ,..., Ek  p is a column vector composed of first derivatives of the error
objective function Ek with respect to thresholds, diag x    ij xi is a diagonal matrix composed of
af c h
Priesvitka 20
c
h
af g
e  b
1  diagaf
x  w ga
T
first derivatives of activities, and a  g1 x1,..., gp x p is a column vector composed of products
Assuming that the matrix 1  diag x  w T is nonsingular, then the solution is determined by
b
gi xi .
T 1
This means, for neural networks represented either by cyclic or acyclic oriented graph, first partial
derivatives of Ek are determined as a solution of the above matrix equation.
The present approach is simply generalized also for calculation of partial derivatives of the
total error objective function determined over all training patterns
P
P
E
Ek
E
E

and
 k
wij k 1 wij
 i k 1  i
If we know the partial derivatives, then the batch adaptation process is simply realized by a
steepest-descent optimization accelerated by a momentum term
E
wij t 1  wij t   
 wij t 
wij
E
  i t 
 i
where the learning parameter >0 should be sufficiently small (usually =0.01-0.1) to ensure a
monotonous convergence of the optimization method. Initial values of weights wij 0  and thresholds
 i t 1   i t   
Priesvitka 21
i 0  are randomly generated . The last terms in the above formulae correspond to momentum terms
determined as a difference of terms from the last two iterations, wij t   wij t   wij t 1 and
i t   i t   i t 1. The momentum terms may be important at the initial stage of optimization as a
simple tool how to escape local minima, the value of the momentum parameter is usually 0.50.7.
Hessian of error objective function
One of the most efficient optimization techniques is the Newton optimization method based on the
following recurrent updating formula
af
af
xk 1  xk  H 1 xk grad f xk
where H(xk) is a Hessian matrix composed of partial derivatives of second order and f(x) is an
objective function to be minimized. This recurrent scheme is stopped if a norm of gradient is smaller
than a prescribed precision, the obtained solution x* is a minimum for a positive definite Hessian
H(x*).
Three types of partial derivatives of second order will be calculated
Priesvitka 22
2 Ek
2 Ek
2 Ek
,
,
 a  i  i wab wij wab
The partial derivatives  2 Ek  i  a (where i,aVHVO) can be calculated as follows
 2 Ek
  Ek


a i a  i


Ek  
 

wli  
t   i   g i  



l
a 
l






Ek 
 2 Ek
 xi ia  gi  
wli   xi  ia i  O  xi  
wli 
l l
l a l




where
 ia
1 a
if i  af
R
S
T0 aif i  af
and
1 b
if i V g
R
|
 O  S
|T0 bif i V g
O
i
O
Priesvitka 23
Symbols xi and xi are first and second derivative of activities assigned to hidden or output neurons.
This formula allows a recurrent "back propagation" calculation of second partial derivatives
 2 Ek  a  i .
The partial derivatives  2 Ek wab  i are calculated immediately from the above two formulae, we
get
 2 Ek
  Ek 
  Ek 

xb 



wab i i  wab  i  a 
 2 Ek
E

xb  ib xb k
i a
a
In a similar way we may calculate partial derivatives 2 Ek wab wij , we get
Priesvitka 24
 2 Ek
  Ek 
  Ek 


xj 



wab wij wab  wij  wab  i 
  Ek


wab  i

Ek x j
Ek
  Ek 
x


x

t    j   ja xb
 j

 j
i wab i  wab 
i

Ek
 2 Ek
Ek xb
Ek
  Ek 


x
x

t


x

x
x

x

t    j   ja xb


b j
j  ja b
b j
j
i  a 
i
i a
a i
i
 2 Ek
E
E

xb x j  ib xb k x j   ja xj k xb
i a
a
i
A calculation of partial derivatives  2 Ek wab  i and 2 Ek wab wij requires only first partial
derivatives Ek  i and second partial derivatives  2 Ek  i  a .
Theorem. For feed-forward neural networks the second partial derivatives of the error objective
function Ek are determined by
Priesvitka 25
F
I
G
J
H
K
F  Ox     E w IJx 
G
 
H
K
 2 Ek
E
  ia gi   k wli xi
 a  i
l  l
2
k
ia
i
i
li
l
a
i
l
2 Ek
2 Ek
E

xb   ib k xb
wab  i  i  a
 a
2 Ek
2 Ek
E
E

xb x j   ib k xb x j   ja k xb x j
wab wij  i  a
 a
 i
where partial derivatives  2 Ek i a may be calculated recurrently in a back propagation manner
whereas other two partial derivatives  2 Ek wab  i and 2 Ek wab wij are calculated directly.
Partial derivatives of second order of the total objective function determined over all patterns
from the training set are determined as summations of partial derivatives of objective functions Ek
Priesvitka 26
p
 2 Ek
2E

a i k 1 a i
p
 2 Ek
2E

i wab k 1 i wab
p
 2 Ek
2E

wij wab k 1 wij wab
An adaptation process of neural networks realized in the framework of Newton optimization
method is based on the following recurrent updating formulae
 t 1
wij
 t 1
i
 wij   H
t 
t 
 ab 
 i   H
t 
t 
 ab 
1
 ij  ab 
1
 i  ab 
E  
E  
 t  1
  H  ij  a 
wab  a 
a
t
t

E  
 t  1 E
  H  i  a 
wab  a 
a
t
t
Priesvitka 27
In a similar way as in the previous part of this lecture, where we have studied first partial
derivatives of the error objective function Ek, the second partial derivatives were derived so that we
did not use any assumption that neural networks correspond to acyclic graphs. Of course, an
assumption that neural networks are acyclic considerably simplifies the calculation of second partial
derivatives  2 Ek  i  a by a recurrent method. For cyclic neural networks this recurrent approach
is inapplicable, the partial derivatives are determined by a system of linear equations


 2 Ek
 2 Ek
E

wli xi  ia xi  gi   k wli   ia i  O  xi2
a i
l a l
l l


This relation may be rewritten in a matrix form as follows
E 1  w diag x   A
where E" is a symmetric matrix composed of partial derivatives  2 Ek  i  a and A=(Aai) is a
diagonal matrix. Assuming that the matrix (1-wdiag(x')) is a nonsingular, then the matrix E"
composed of second partial derivatives  2 Ek  i  a is determined explicitly by
1
E  A 1  w diag x 
afg
b
b
afg
Priesvitka 28
Illustrative example
Boolean function XOR
An effectiveness of neural networks with hidden neurons is illustrated by Boolean function XOR,
which is not correctly classified by simple perceptron without hidden neurons.
The used neural networks will contain three layers, the first layer is composed of input neuron, the
second one of hidden neurons, and the third (last) one of output neurons.
Hidden neurons create the so-called inner representation which is already linearly separable, this is
the main reason why neural networks with hidden neurons are most frequently used in the whole
theory of neural networks.
Priesvitka 29
Parameters of adaptation process are: =0,1, =0,5; after 400 iterations the objective function
achieved value E=0.031.
5
3
4
1
2
Priesvitka 30
Activities of neurons from the feed-forward
network composed of five neurons and two
hidden neurons.
x 5
No.
x1
x2
x3
x5
x4
1
0,00 0,00 0,96 0,08 0,06 0,00
2
0,00 1,00 1,00 0,89 0,95 1,00
3
1,00 0,00 0,06 0,00 0,94 1,00
4
1,00 1,00 0,96 0,07 0,05 0,00
Hidden activities of XOR problem are linearly separable.
x4
x2
A
x3
x1
B
Newtonova metóda
Priesvitka 31
Budeme približne riešiť rovnicu
grad f  x   0
Nech x=x0+, kde =(1,2,...,n), potom
grad f  xo  d   0
ľavá strana bude upravená pomocou Taylorovho rozvoja
grad f  xo   H  xo  d  ...  0
Ak budeme uvažovať len prvé dva členy rozvoja a budeme predpokladať, že Hessián H(xo) je
nesingulárna matica, potom
   H 1  xo  grad f  xo 
Poznámka: Výchylka  je formálne určená dvoma ekvivalentnými spôsobmi: buď ako riešenie
systému lineárnych rovníc
H  xo    grad f  xo 
alebo pomocou inverznej matice
   H 1  xo  grad f  xo 
Priesvitka 32
Riešenie rovnice grad f  x   0 má potom tento približný tvar
x  xo  H 1  xo  grad f  xo 
Toto riešenie slúži ako podklad pre konštrukciu rekurentnej formule, ktorá je základom Newtonovej
optimalizačnej metódy
xk 1  xk  H 1  xk  grad f  xk 
pre k=0,1,2,... a xo je zadaná počiatočné riešenie. Rekurentná aplikácia tejto formule je ukončená
keď začne platiť xk 1  xk   , kde >0 je zadaná presnosť požadovaného riešenia.
Priesvitka 33
Algoritmus Newtonovej metódy
read(xo,kmax,);
k:=0; norm:=; x:=xo;
while (k<kmax) and (norm>) do
begin k:=k+1;
solve SLE H(x)=-grad f(x);
x':=x+;
norm:=|x-x'|;
x:=x';
end;
write(x,f(x));
Priesvitka 34
Poznámky
(1) Newtonova metóda nájde riešenie rovnice grad f  x   0 , t.j. nájde stacionárne stavy funkcie
f(x). Tieto stacionárne stavy sú klasifikované pomocou Hessiánu H(x):
Ak x je stacionárny bod a Hessián H(x) je pozitívne negatívny, potom v bode x
má funkcia f(x) minimum.
(2) Funkcia f(x) musí byť dvakrát diferencovateľná na celom Rn, počítame gradient a Hessián
funkcie.
(3) Výpočet Hessiánu môže byť časovo a pamäťovo veľmi náročný, menovite pre funkcie mnohých
(niekoľko sto) premenných.
Priesvitka 35
Linearizovaná Newtonova metóda
Pre určitý typ minimalizovanej funkcie výpočet Hessiánu môže byť podstatne zjednodušený.
p
f  x    g k2  x 
k 1
Pre túto funkciu platí vlastnosť
f  x    0  k  1, p : g k  x   0
Nutná a postačujúca podmienka pre existenciu takého x* , f(x*)=0, aby pre každé k[1,p] platilo
gk(x*)=0 .
Priesvitka 36
p
Výpočet Hessiánu pre funkciu f  x    g k2  x 
k 1
f  x 
xi
2 f  x 
xi x j
p
 2
p
g k  x 
k 1
xi
 2 g k  x 
g k  x  g k  x 
k 1
xi
x j
p
 2 gk  x 
k 1
xi x j
 2 g k  x 
Budeme predpokladať, že vektor x je blízko x*, potom gk(x)0, pre k=1,2,..., potom približný výraz
pre elementy Hessiánu má tento tvar
H ij  x  
2 f  x 
xi x j
p
 2
g k  x  g k  x 
xi
k 1
x j
V maticovom formalizme tento výraz má tvar
p
H  x   2  grad g k  x  
k 1
T
 grad g  x 
k
Priesvitka 37
Download