2.13. How many neurons do you need to get a well-working network? (Translated by Anastasiya Zharkova, nastia_zhar@wp.pl) It follows from the above remarks that the broadest possibilities for future use are offered by networks which have at least three-layer structure, with an input layer that receives signals, a hidden layer that elicits those characteristics of input signals that are needed, and an output layer which makes final decisions and provides a solution. Within this structure some elements are determined: the amount of input and output elements, as well as the principle of connecting successive layers. However, there are certain variable elements which you have to determine yourself. These are: the number of hidden layers (one or more) and the number of elements in the hidden layer (or layers) (fig. 2.40). Fig. 2.40. Most important problem during neural network design is connected with number of hidden neurons Despite many years of development of this technology, no precise theory of neural networks has yet been formulated, therefore these elements are usually chosen arbitrarily or by the process of trial and error. It may happen that the idea of a network’s author about how many hidden neurons should be used and how they should be organized (e.g. as a one hidden layer or as several such layers) will not be quite correct. Nevertheless, it shouldn’t have a crucial impact on the network’s operation, because during the learning process network always has a chance to correct possible errors of the structure by choosing appropriate connection parameters. Still, we must here specifically warn our readers about two types of mistakes which trap many researchers of neural networks (especially beginners). The first mistake consists in designing a network with too few elements – when there is no hidden layer or there are too few neurons, the learning process may fail miserably, since the network will not have any chances to imitate in its (too scanty) structure all the details and nuances of the problem that is being solved. Later, I will give you specific examples illustrating the fact that a network which is too small and too primitive cannot deal with certain tasks even if one teaches it very thoroughly and for a very long time. Simply, neural networks sometimes are like people: not all of them are talented enough to solve a particular problem. Luckily, there is always an easy way to check how intelligent a network is, because you see its structure and can count neurons, since the measure of a network’s capability is merely the number of its hidden neurons. With humans it is more difficult! Unfortunately, despite the freedom in building bigger or smaller networks, it sometimes happens that the intelligence of one’s network is too low. This always results in a failure to use such a network for a specific purpose, because such ‘neural dummy’ with not enough hidden neurons will never succeed in the tasks it has been given, no matter how much you will toil at trying to teach it something. However, network’s intelligence shouldn’t be also ‘overdone’. The effect of network’s excessive intelligence is not a greater capacity for dealing with problems that it has been set, but an astonishing phenomenon: instead of diligently acquiring useful knowledge the network begins to fool its teacher and, consequently, it doesn’t learn at all! It may sound incredible at first, but it is true. A network with too many hidden layers or too many elements in its hidden layers tends to simplify the task and, as a result, it ‘cuts corners’ whenever possible. In order for you to understand it, I have to explain briefly how network’s learning process takes place (fig. 2.41). Fig. 2.41. Very simplified schema of neural network learning process You will learn about the details of this process in one of the further chapters, but right now I must say that one teaches a network by providing it with input signals for which correct solutions are known because they are included in the learning data. For each given set of input data the network tries to offer its own output solution. Generally, the network’s suggestions differ from correct solutions provided in the teaching data, so after comparing the solution worked out by the network with the correct exemplar solution in the teaching data, it becomes clear how big was the mistake made by the network. On the basis of mistake evaluation, the network’s teaching algorithm changes the weights of all its neurons so that in the future the same mistakes will not be repeated. The above model of learning process indicates that network aims at committing no mistakes when it is presented with teaching data. Therefore a network which learns well seeks such a rule for processing input signals that would allow it to arrive at correct solutions. When a network discovers this rule, it can perform tasks from the teaching data it has been provided with, as well as other similar tasks which it will be given during ordinary exploitation. We say then that a network demonstrates an ability to learn and to generalize learning results, and we treat it as a success. Unfortunately, a network which is too intelligent, that is one which has a considerable memory in the form of a large number of hidden neurons (together with their adjustable weight sets), can easily avoid mistakes during learning process by memorizing the whole collection of teaching data. It then achieves great success in learning within an astonishingly short time, because it knows and gives correct answer for every question. However, in this method of ‘photographing’ the teaching data, a network which is learning from provided examples of correct solutions, makes no attempt at generalizing acquired information, and tries instead to achieve success by meticulously memorizing rules like “this input implies this output”. Such incorrect operation of a network manifests itself in the fact that it quickly and thoroughly learns the whole of the so-called teaching sequence (i.e. the set of examples used to show a network how it should perform tasks that it is given), but it fails embarrassingly in the first test, that is in a task from similar class but slightly different from the tasks presented in the process of learning. For instance, teaching a network to recognize letters brings immediate success (the network recognizes all letters it is shown), but the network fails completely to recognize a letter in a different handwriting or font (all outputs are zero), or recognizes them incorrectly. In such cases a closer analysis of network’s knowledge reveals that the network has memorized many simple rules like “if here two pixels are lit, and there are five zeros, letter A should be recognized”. Of course, such crude rules do not stand the test of a new task, and the networks falls short of our expectations. The described symptom of ‘learning by heart’ is not displayed by networks with smaller hidden layer, because limited memory forces the network to do its best, and, using the few elements of its hidden layer, to work out such rules of processing input signal that would enable its correct use in more than one instance of required answer. In such cases the learning process is usually considerably slower and more tedious (examples needed by the network in order to learn have to be presented more times – often a few hundred or a few thousand times), though the final effect is usually much better. After a correctly conducted learning process has been finished, and a network works well with base learning examples, one has the right to suppose that it will also cope with similar (but not identical) tasks presented during a test. It is not always the case, but it is often true, and it must be the base for our expectations regarding the use of the network. To sum up, it is important to memorize the following rule: do not expect miracles, so a presumption that an uncomplicated network, with few hidden neurons, will succeed in a complicated task, is rather unrealistic. However, too many hidden layers or too many neurons also lead to a significant decline in learning process. The optimal size of hidden layer lies somewhere between these two extremes. Fig. 2.42. Exemplary relation between hidden neurons and errors made by network Figure 2.42 shows (on the basis of particular computer simulations) the relation between errors made by networks and the number of their hidden neurons. It proves that there are many networks which operate with almost the same efficiency, despite different number of hidden neurons – therefore it is not so difficult to hit such a broad target. Nevertheless, one must avoid the extremes (i.e. networks that are too big or too small). Especially harmful are additional (excessive) hidden layers, and it comes as no surprise that a network with fewer hidden layers often produces better results (because one can teach it more thoroughly) than a theoretically better network with more hidden layers, where the teaching process ‘gets stuck’ in the excess of details. Therefore one should use networks with one or (but only as an exception!) two hidden layers, and fight down the temptation to use networks with more hidden layers by fasting and cold baths.