3.4. How to organize learning the networks? (translation by Agata Krawcewicz, hogcia@gmail.com) Necessary changes of values of weight coefficients in each neuron are counted basing on special rules (sometimes called the paradigms of networks), where the number of different rules that are used today and their varieties is truthfully extreme, because almost every explorer tried to carry his own contribution in to the field of neural networks in the form of a new rule of learning. One opinion about this problem gives a collection of algorithms of learning described in my book "Neural networks" to which I constantly send the more inquiring Readers for more detailed information (http://winntbg.bg.agh.edu.pl/skrypty/0001/). Here I will talk over shortly (without the use of mathematics, because such a deal we contracted at the very beginning of this book!) two basic rules of learning: the rule of the quickest fall, lying at grassroots of the most algorithms of learning with a teacher and the Hebb rule, defining the most simple example of learning without a teacher. The rule of the quickest fall relies on the fact that every neuron having received definite signals on its entries (from the networks entries or from other neurons, which are earlier levels of processing the information) makes its own output signal using possessed knowledge in the form of earlier settled values of all amplification factors (weights) of all entries and (possibly) the threshold. Manners of marking the value of output signals by neurons, on the ground of input signals were talked over more precisely in the previous chapter. The value of the output signal, appointed by the neuron in the given step of the process of learning is compared with the standard-answer given by the teacher within the learning set. If there is a divergence (and at the beginning of the process of learning such divergence will almost always appear, because where on earth does this neuron have to know, what do we want from it?) - the neuron gets the difference between its own output signal and the value of the signal which would be - according to the teacher - correct, and also fixes (by means of the method of the quickest fall which I will soon explain), how to change the values of the weights, so that this error will most quickly grow smaller. In further considerations it will be useful to know the notion of the area of the error which we will now introduce and talk over precisely. And so you already know the fact that the activity of the network relies on the values of weight coefficients of neurons being its elements. If you know the set of all weight coefficients, appearing in all neurons of the whole network, then you know how such network will act. Particularly you can show to such network (in turns) all examples of assignments accessible to you, together with solutions, which are a part of learning set. Every time the network will make its own answer to the asked question - you can compare it with the pattern of the correct answer which is found in the learning set, marking the error which the network committed. A measure of this error is usually the difference between the value of the result delivered by the network and with the value of the result read from the learning set. To overall rate the activity of networks with the defined set of weightcoefficients in its neurons - we usually use the pattern of the sum of squares of errors committed by the network for each case from learning set. Before summing up the errors are squared, to avoid the effect mutual compensating of positive and negative errors, and what is more the squaring causes that the especially „heavy penalty” meets the network for large errors (twice greater error is a quadruple component in the created sum). Fig. 3.4. Method of error surface forming. Discussion In the text. Please observe the yellow rectangle and yellow cloud fist and navy blue rectangle and cloud next. Look at figure 3.4. A situation was shown on it, which could take place in a network so small that it would have only two coefficients of the weights. Such small neural networks do not exist, but imagine that you have such a network, just because only for such very small network will we succeed to draw its behavior without going into difficult, multidimensional spaces. Every state of better or worse learning of this network will be joined with some point on the horizontal (light-blue) visible surface shown in the figure, together with its coordinates, that is with both considered weights- coefficients. Imagine now that you placed such values of weights in the network, which comply with the location of the red point on the surface. Examining such network by means of all elements of the learning set, you will find the total value of the error of this network - and in the place of the red point you will put a (red!) arrow, pointing up, with the height being the calculated value of the error (according to description the vertical axis in fig.). Next, choose other values of weights, marking other position of the point on the surface (navy blue) - and perform the same acts, receiving the navy blue pointer. And now imagine that these acts you perform for all combinations of weight- coefficients, that is to say for all points of the light-blue surface. In one places errors will be greater, in other smaller, what you would be able to see (if you had a patience to examine your network so many times) in the form of error surface, spreading over the surface of the changed weights. An example of such surface I have shown you it in fig. 3.5. Fig. 3.5. Example of error surface formed during the neural network learning As you can see there are many „knolls” on this surface - those are the places in which the network commits especially many errors, so such places should be avoided, and we have many deep valleys which we find very interesting, because there, on the bottom of such valleys, the neural network commits little errors, that is to say solves its assignments especially well. How should we find such valley? And so you should consider learning the neural network as the multistage-process. During this process you will try step by step to improve values of weights in the network, changing old (worse) sets of the weights, causing that the network to commit a large error, for new, regarding which you will be hoping (but you will not be certain) that they are better. Look at the figure 3.6, whereon I have tried to illustrate this. Fig. 3.6. Illustration of neural network learning process as “sliding down” on error surface You begin from the situation shown in the left bottom corner of the figure, which means you have some old set (a vector) of weights, noted on the surface of parameters of the network with the yellow circle. For this vector of weights you find the error, which the network makes, and you „land” on the surface of the error in the place which is marked with a yellow arrow in the left upper corner of the figure. Nothing good can be said about this situation: the error is very high, so the network has temporarily very bad parameters. It is necessary to improve this. How? Well exactly. Methods of learning the neural networks can find which way it is necessary to change the weight coefficients, to obtain the effect of the diminution of the error. Such direction of the quickest fall of the error is noted in fig. 3.6 as a large black pointer. Unfortunately, the details of how the methods of learning do this, cannot be explained without using complicated mathematics and such notions as the gradient or the partial derivative, however conclusions from these quite complicated mathematical considerations are simple enough. And so every neuron in the network makes the modification of its own weight- (and possibly the threshold) coefficients, using the following two simple rules: weights are changed these more strongly, when a greater error was detected; weights connected with these entries, on which large values of input signals appeared, are changed more, than weights of these entries on which the input signal was small. Previously described basic rules in practice still need several additional corrections (in a moment I will say more about them), however the described outline of the method of learning is surely clear. Knowing the error committed by the neuron and knowing its input signals you can easily foresee, how its weights will change. Also notice, how very logical and sensible are these mathematically led out rules: For example the faultless reaction of the neuron on given input signal should of course cause leaving its weights without a change - because they led us to a success. And this is just what is happening! Notice that the network using described methods in practice, breaks the process of learning itself, when it is already well trained, because small errors cause only minimum, "cosmetic" corrections of the weights. It is logical, similarly as the rule of the subordination of the size of the correction from the size of the input-signal delivered by considered weight because these entries on which greater signals appeared had a greater influence on the result of the activity of the neuron which proved to be incorrect, so it is necessary to "sharpen" them more strongly. Particularly the described algorithm causes that for the entries on which in this very moment the signals were not given (during calculations they had zero-values), suitable weights are not changed, because not we do not know whether they are good or not, because it did not participate in the creation of the current (surely incorrect, if it is necessary improve it) output signal. Returning to the presentation of one step of the process of learning, shown in fig. 3.6, notice, what goes on further: Having found the direction of the quickest fall of the error the algorithm of learning the network makes the migration in the space of weights, consisting of changing the old (worse) vector of weights to a new one (better). This migration causes that on the surface of the error we „will slide down” to a new point - most often situated lower, that is to say bringing near the network to the longed-for valley in which errors are least, and solution of put assignments - most perfect. Such optimistic scenario of gradual and efficient moving toward the place, where errors are least, is shown in figure 3.7. Fig. 3.7. Searching (and fining!) network parameters (weight coefficient for all neurons) guarantying minimal value of the error – during supervised learning.