3. prednáška Dopredné neurónové siete (multilayered perceptron) Priesvitka 1 Vtip, ktorý mi minulý týždeň poslala doc. Beňušková z Nového Zealandu, ktorá založila spolu s Petrom Tiňom tradíciu prednášok a výskumu neurónových sietí na tejto fakulte Priesvitka 2 Dopredné neurónové siete (multilayer perceptron) logický neuron (MacCulloch a Pitts, 1943) perceptrón (Rosenblatt, 1967) dopredné neurónové siete (Rumelhard, 1986) prudký rozvoj neurónových sietí (subsymbolická UI, konekcionizmus) Priesvitka 3 David Rumelhart D. E. Rumelhart, G. E. Hinton, and R. J. Williams: Learning representations by backpropagating errors. Nature, 323(1986), 533-536. D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, editors: Parallel Distributed Processing, volumes I and II. MIT Press, Cambridge, MA, 1986. Priesvitka 4 Formal definition of a multilayer perceptron Let us consider an oriented graph G=(V,E) determined as an ordered couple of a vertex set V and an edge set E. v6 v6 e7 v4 e6 e2 v3 e1 v1 e7 e8 A e5 v5 v4 e4 e2 e8 e6 v3 e3 e1 v2 acyclic oriented graph v1 B e5 v5 e4 e3 v2 cyclic oriented graph Priesvitka 5 Theorem. An oriented graph G is acyclic iff it may be indexed so that e i, j E : i j af Vertices of an oriented graph can be classified as follows: (1) Input vertices that are incident only with outgoing edges. (2) Hidden vertices that are simultaneously incident at least with one incoming edge and at least with one outgoing edge. (3) Output vertices, that are incident only with outgoing edges. v6 v4 output neuron v5 hidden neurons v3 v1 v2 input neurons Priesvitka 6 The vertex set V is unambiguously divided into three disjoint subsets V VI VH VO that are composed of input, hidden, and output vertices, respectively. Assuming that the graph G is canonically indexed and that the graph is composed of n input vertices, h hidden vertices, and m output vertices, then the above three vertex subsets can be simply determined by VI 1,2,..., n, | VI | n VH n 1, n 2,..., n h, | VH | h VO n h 1, n h 2,..., n h m, | VO | m v6 v4 output neuron v5 hidden neurons v3 v1 v2 input neurons Priesvitka 7 Vertices and edges of the oriented graph G are evaluated by real numbers. Each oriented edge e=(vi,vj)E is evaluated by a real number wji called the weight. Each vertex viV is evaluated by a real number xi called the activity and each hidden or output vertex viV is evaluated by a real number i called the threshold. The activities of hidden and output vertices are determined as follows af xi t i i wij x j i j 1 i where t() is a transfer function. In our forthcoming considerations we will assume that its analytical form is specified by as the sigmoid function. Priesvitka 8 Definition. A feed-forward neural network is determined as an ordered triple =(G,w,) G is an acyclic, oriented, and connected graph and w and are weights and thresholds assigned to edges (connections) and vertices (neurons) of the graph. We say that the graph G specifies a topology (or architecture) of the neural network , and that weights w and thresholds specify parameters of the neural network. Activities of neurons form a vector x=(x1,x2,...,xp). This vector can be divided formally onto three subvectors that are composed of input, hidden, and output activities x xI xH xO Priesvitka 9 A neural network =(G,w,) with fixed weights and thresholds can be considered as a parametric function Gw, : Rn 0,1m w, xI G xO xH This function assigns to an input activity vector xI a vector of output activities xO a xO G xI ; w, f Hidden activities in this expression are not explicitly presented, they play only a role of byproducts. Priesvitka 10 How to calculate activities of a neural network =(G,w,)? (1) We shall postulate that activities of input neurons are kept fixed, in other words, we say that the vector of input activities xI is an inputs to the neural network. Usually its entries correspond to the so-called descriptors that specify a classified pattern. (2) Activities of hidden and output neurons are calculated by simple recurrent procedure based on the fact that the topology of neural network is determined by an acyclic oriented graph G . xi fi x1 , x2 ,..., xi 1 for i n 1, n 2,..., p a v6 L1 v4 fa L3 v5 L2 v3 v1 v2 L0 f Layer L0: x1=input constant x2=input constant Layer L1: x3=t(w31x1+w32x2+3) x4=t(w41x1+4) Layer L2: x5=t(w53x3+w52x2+5) Layer L3: x6=t(w63x3+w64x4+w65x5+5) Priesvitka 11 Unfortunately, if the graph G is not acyclic (it contains oriented cycles), then the above simple recurrent calculation of activities is inapplicable, in some stage of algorithm we have to know an activity which was not yet calculated, activities are determined by xi fi x1 , x2 ,..., xn , xn 1 ,..., x p for i n 1, n 2,..., p These equations are coupled, their solution (if any) can be constructed by making use of an iterative procedure, which is started from initial activities, and calculated activities are used as an input for calculation of new activities, etc. xi t 1 fi x1 , x2 ,..., xn , xn t1 ,..., x pt for i n 1, n 2,..., p This iterative procedure is repeated until a difference between new and old activities is smaller than a prescribed precision. c c ha ha f f Priesvitka 12 Learning (adaptation process) of feed-forward neural networks A learning of a feed-forward neural network consists in a looking for such weights and thresholds that produced output activities (as a response on input activities) are closely related to the required ones. First of all what we have to define is the training set composed of examples of input-activity vector and required output-activity vector A A k xIk , xˆ Ok ;k 1,2,..., p The error objective function is determined by b 1 k xO x Ok 2 1 p 2 gi 2 i 1 Ek w, gi g 21 cGbx ; w, g x h 2 k 2 O k I x x b for i V g R S T0 otherwise i i O Priesvitka 13 P Ew, Ek w, k 1 The learning process of the given feed-forward neural network N=(G,,) specified by the graph G and with unknown weights and thresholds is realized by the following minimization process wopt , opt arg min Ew, c h w, If we know the optimal weights and thresholds (that minimize the error objective function), then an active process is a calculation of output activities as a response on input activities for parameters determined by the learning - adaptation process c xO G xI ; wopt , opt h Priesvitka 14 The active process of neural networks is usually used for a classification or prediction of unknown patterns that are described only by input activities (descriptors that specify a structure of patterns). In order to quantify this process we have to introduce the so-called test set composed of examples of patterns specified by an input-activity vector and a required output-activity vector A test A k yI , ˆyO ;k 1,2,...,q k k An analogue of the error objective function is test Ek 2 1 k 1 k ˆ w , y y opt opt 2 O O 2 G yIk ; wopt ,opt ˆyOk E test w opt q ,opt Ek k 1 test w opt 2 ,opt Priesvitka 15 We say that an adapted neural network correctly interprets patterns from the test set if this objective function is sufficiently small. If this requirement is not fulfilled, then there exists an example from the test set which incorrectly interpreted. Gradient of error objective function Partial derivatives of the error objective function (defined over a training set) with respect to weights and thresholds are determined by Ek Ek xi Ek t i x j wij xi wij xi Ek Ek xi Ek t i i xi i xi where the partial derivatives xi wij is simply calculated by making use the relation xi=t(i), then xi wij t i i wij t i x j . Similar considerations are applicable also for the partial af af Priesvitka 16 derivative xi i . If we compare both these equations we get simple relationship between partial derivatives Ek wij and Ek i Ek Ek xj wij i Let us study the partial derivative Ek xi , its calculation depends on whether the index i corresponds to an output neuron or hidden neuron Ek gi xi Ek E x k l xi l xl xi bfor i V g afor i V f O H where summation runs over all neurons that are successors of the i-the neuron. The second above formula results as an application of the well-known theorem called the chain rule for calculation of partial derivatives of the composite function. Last expressions can be unified at one expression Priesvitka 17 Ek E x gi k l xi l xl xi where it is necessary to remember that the summation on r.h.s. is vanishing is the i-th neuron has not successors, that is output neurons. A final expression for the partial derivatives of error objective function with respect to thresholds Ek E t i gi k wli i l l for i VH VO (1) In general, the above formula for calculation of partial derivative of the objective function can be characterized as a system of linear equations the solution of which determines partial derivatives Ek i . (2) For feed-forward neural networks the above formula can be solved recurrently. Starting from top layer Lt we calculate all partial derivatives assigned to output neurons, Ek i t i gi . In the next step we calculate partial derivatives from the layer Lt-1, for their calculation we need to know partial derivatives from the layer Lt , which were calculated in the previous step, etc. Therefore this recurrent approach of calculation of partial derivatives based is called the back propagation. (3) Knowing all partial derivatives Ek i , we may calculate simply partial derivatives Ek wij . af Priesvitka 18 (4) An analogue of the above formula was initially derived by Rumelhart et al. in 1986, this work is considered in literature as one of milestones of the development of theory of neural network. They demonstrated that multilayer perceptrons together with the back propagation method for calculation of gradients of error objective functions are able to overcome boundaries of simple perceptrons stated by Minsky and Papert, that is to classify correctly all patterns and not only those ones that are linearly separable. v6 v5 hidden neurons v3 v1 v2 back propagation v4 output neuron input neurons Theorem. For feed-forward neural networks the first partial derivatives of the error objective function Ek are determined recurrently by Priesvitka 19 afF G H IJ K Ek E t i gi k wli i l l Ek Ek xj wij i where in order to calculate Ek i we have to know either gi (for iVO) or partial derivatives Ek l (for li) . We have to emphasize that in the course of derivation of the above formula we have never used an assumption that the considered neural network corresponds to an acyclic graph. This means that this formula is correct also for neural networks assigned to graphs that are either cyclic or acyclic. For neural networks assigned to cyclic graphs the above discussed recurrent approach of calculation of partial derivatives is inapplicable. For this type of neural networks the partial derivatives are not determined recurrently, but only as a solution of linear coupled equations Ek Ek wli xi xigi i l i l Its matrix form is 1 diag x w T ek a c h b af g T where ek Ek 1 ,..., Ek p is a column vector composed of first derivatives of the error objective function Ek with respect to thresholds, diag x ij xi is a diagonal matrix composed of af c h Priesvitka 20 c h af g e b 1 diagaf x w ga T first derivatives of activities, and a g1 x1,..., gp x p is a column vector composed of products Assuming that the matrix 1 diag x w T is nonsingular, then the solution is determined by b gi xi . T 1 This means, for neural networks represented either by cyclic or acyclic oriented graph, first partial derivatives of Ek are determined as a solution of the above matrix equation. The present approach is simply generalized also for calculation of partial derivatives of the total error objective function determined over all training patterns P P E Ek E E and k wij k 1 wij i k 1 i If we know the partial derivatives, then the batch adaptation process is simply realized by a steepest-descent optimization accelerated by a momentum term E wij t 1 wij t wij t wij E i t i where the learning parameter >0 should be sufficiently small (usually =0.01-0.1) to ensure a monotonous convergence of the optimization method. Initial values of weights wij 0 and thresholds i t 1 i t Priesvitka 21 i 0 are randomly generated . The last terms in the above formulae correspond to momentum terms determined as a difference of terms from the last two iterations, wij t wij t wij t 1 and i t i t i t 1. The momentum terms may be important at the initial stage of optimization as a simple tool how to escape local minima, the value of the momentum parameter is usually 0.50.7. Hessian of error objective function One of the most efficient optimization techniques is the Newton optimization method based on the following recurrent updating formula af af xk 1 xk H 1 xk grad f xk where H(xk) is a Hessian matrix composed of partial derivatives of second order and f(x) is an objective function to be minimized. This recurrent scheme is stopped if a norm of gradient is smaller than a prescribed precision, the obtained solution x* is a minimum for a positive definite Hessian H(x*). Three types of partial derivatives of second order will be calculated Priesvitka 22 2 Ek 2 Ek 2 Ek , , a i i wab wij wab The partial derivatives 2 Ek i a (where i,aVHVO) can be calculated as follows 2 Ek Ek a i a i Ek wli t i g i l a l Ek 2 Ek xi ia gi wli xi ia i O xi wli l l l a l where ia 1 a if i af R S T0 aif i af and 1 b if i V g R | O S |T0 bif i V g O i O Priesvitka 23 Symbols xi and xi are first and second derivative of activities assigned to hidden or output neurons. This formula allows a recurrent "back propagation" calculation of second partial derivatives 2 Ek a i . The partial derivatives 2 Ek wab i are calculated immediately from the above two formulae, we get 2 Ek Ek Ek xb wab i i wab i a 2 Ek E xb ib xb k i a a In a similar way we may calculate partial derivatives 2 Ek wab wij , we get Priesvitka 24 2 Ek Ek Ek xj wab wij wab wij wab i Ek wab i Ek x j Ek Ek x x t j ja xb j j i wab i wab i Ek 2 Ek Ek xb Ek Ek x x t x x x x t j ja xb b j j ja b b j j i a i i a a i i 2 Ek E E xb x j ib xb k x j ja xj k xb i a a i A calculation of partial derivatives 2 Ek wab i and 2 Ek wab wij requires only first partial derivatives Ek i and second partial derivatives 2 Ek i a . Theorem. For feed-forward neural networks the second partial derivatives of the error objective function Ek are determined by Priesvitka 25 F I G J H K F Ox E w IJx G H K 2 Ek E ia gi k wli xi a i l l 2 k ia i i li l a i l 2 Ek 2 Ek E xb ib k xb wab i i a a 2 Ek 2 Ek E E xb x j ib k xb x j ja k xb x j wab wij i a a i where partial derivatives 2 Ek i a may be calculated recurrently in a back propagation manner whereas other two partial derivatives 2 Ek wab i and 2 Ek wab wij are calculated directly. Partial derivatives of second order of the total objective function determined over all patterns from the training set are determined as summations of partial derivatives of objective functions Ek Priesvitka 26 p 2 Ek 2E a i k 1 a i p 2 Ek 2E i wab k 1 i wab p 2 Ek 2E wij wab k 1 wij wab An adaptation process of neural networks realized in the framework of Newton optimization method is based on the following recurrent updating formulae t 1 wij t 1 i wij H t t ab i H t t ab 1 ij ab 1 i ab E E t 1 H ij a wab a a t t E t 1 E H i a wab a a t t Priesvitka 27 In a similar way as in the previous part of this lecture, where we have studied first partial derivatives of the error objective function Ek, the second partial derivatives were derived so that we did not use any assumption that neural networks correspond to acyclic graphs. Of course, an assumption that neural networks are acyclic considerably simplifies the calculation of second partial derivatives 2 Ek i a by a recurrent method. For cyclic neural networks this recurrent approach is inapplicable, the partial derivatives are determined by a system of linear equations 2 Ek 2 Ek E wli xi ia xi gi k wli ia i O xi2 a i l a l l l This relation may be rewritten in a matrix form as follows E 1 w diag x A where E" is a symmetric matrix composed of partial derivatives 2 Ek i a and A=(Aai) is a diagonal matrix. Assuming that the matrix (1-wdiag(x')) is a nonsingular, then the matrix E" composed of second partial derivatives 2 Ek i a is determined explicitly by 1 E A 1 w diag x afg b b afg Priesvitka 28 Illustrative example Boolean function XOR An effectiveness of neural networks with hidden neurons is illustrated by Boolean function XOR, which is not correctly classified by simple perceptron without hidden neurons. The used neural networks will contain three layers, the first layer is composed of input neuron, the second one of hidden neurons, and the third (last) one of output neurons. Hidden neurons create the so-called inner representation which is already linearly separable, this is the main reason why neural networks with hidden neurons are most frequently used in the whole theory of neural networks. Priesvitka 29 Parameters of adaptation process are: =0,1, =0,5; after 400 iterations the objective function achieved value E=0.031. 5 3 4 1 2 Priesvitka 30 Activities of neurons from the feed-forward network composed of five neurons and two hidden neurons. x 5 No. x1 x2 x3 x5 x4 1 0,00 0,00 0,96 0,08 0,06 0,00 2 0,00 1,00 1,00 0,89 0,95 1,00 3 1,00 0,00 0,06 0,00 0,94 1,00 4 1,00 1,00 0,96 0,07 0,05 0,00 Hidden activities of XOR problem are linearly separable. x4 x2 A x3 x1 B Newtonova metóda Priesvitka 31 Budeme približne riešiť rovnicu grad f x 0 Nech x=x0+, kde =(1,2,...,n), potom grad f xo d 0 ľavá strana bude upravená pomocou Taylorovho rozvoja grad f xo H xo d ... 0 Ak budeme uvažovať len prvé dva členy rozvoja a budeme predpokladať, že Hessián H(xo) je nesingulárna matica, potom H 1 xo grad f xo Poznámka: Výchylka je formálne určená dvoma ekvivalentnými spôsobmi: buď ako riešenie systému lineárnych rovníc H xo grad f xo alebo pomocou inverznej matice H 1 xo grad f xo Priesvitka 32 Riešenie rovnice grad f x 0 má potom tento približný tvar x xo H 1 xo grad f xo Toto riešenie slúži ako podklad pre konštrukciu rekurentnej formule, ktorá je základom Newtonovej optimalizačnej metódy xk 1 xk H 1 xk grad f xk pre k=0,1,2,... a xo je zadaná počiatočné riešenie. Rekurentná aplikácia tejto formule je ukončená keď začne platiť xk 1 xk , kde >0 je zadaná presnosť požadovaného riešenia. Priesvitka 33 Algoritmus Newtonovej metódy read(xo,kmax,); k:=0; norm:=; x:=xo; while (k<kmax) and (norm>) do begin k:=k+1; solve SLE H(x)=-grad f(x); x':=x+; norm:=|x-x'|; x:=x'; end; write(x,f(x)); Priesvitka 34 Poznámky (1) Newtonova metóda nájde riešenie rovnice grad f x 0 , t.j. nájde stacionárne stavy funkcie f(x). Tieto stacionárne stavy sú klasifikované pomocou Hessiánu H(x): Ak x je stacionárny bod a Hessián H(x) je pozitívne negatívny, potom v bode x má funkcia f(x) minimum. (2) Funkcia f(x) musí byť dvakrát diferencovateľná na celom Rn, počítame gradient a Hessián funkcie. (3) Výpočet Hessiánu môže byť časovo a pamäťovo veľmi náročný, menovite pre funkcie mnohých (niekoľko sto) premenných. Priesvitka 35 Linearizovaná Newtonova metóda Pre určitý typ minimalizovanej funkcie výpočet Hessiánu môže byť podstatne zjednodušený. p f x g k2 x k 1 Pre túto funkciu platí vlastnosť f x 0 k 1, p : g k x 0 Nutná a postačujúca podmienka pre existenciu takého x* , f(x*)=0, aby pre každé k[1,p] platilo gk(x*)=0 . Priesvitka 36 p Výpočet Hessiánu pre funkciu f x g k2 x k 1 f x xi 2 f x xi x j p 2 p g k x k 1 xi 2 g k x g k x g k x k 1 xi x j p 2 gk x k 1 xi x j 2 g k x Budeme predpokladať, že vektor x je blízko x*, potom gk(x)0, pre k=1,2,..., potom približný výraz pre elementy Hessiánu má tento tvar H ij x 2 f x xi x j p 2 g k x g k x xi k 1 x j V maticovom formalizme tento výraz má tvar p H x 2 grad g k x k 1 T grad g x k Priesvitka 37