2.13. How many neurons do you need to get a well

advertisement
2.13. How many neurons do you need to get a well-working network?
(Translated by Anastasiya Zharkova, nastia_zhar@wp.pl)
It follows from the above remarks that the broadest possibilities for future use are offered by
networks which have at least three-layer structure, with an input layer that receives signals, a
hidden layer that elicits those characteristics of input signals that are needed, and an output
layer which makes final decisions and provides a solution. Within this structure some
elements are determined: the amount of input and output elements, as well as the principle of
connecting successive layers. However, there are certain variable elements which you have to
determine yourself. These are: the number of hidden layers (one or more) and the number
of elements in the hidden layer (or layers) (fig. 2.40).
Fig. 2.40. Most important problem during neural network design is connected with number of
hidden neurons
Despite many years of development of this technology, no precise theory of neural networks
has yet been formulated, therefore these elements are usually chosen arbitrarily or by the
process of trial and error. It may happen that the idea of a network’s author about how many
hidden neurons should be used and how they should be organized (e.g. as a one hidden layer
or as several such layers) will not be quite correct. Nevertheless, it shouldn’t have a crucial
impact on the network’s operation, because during the learning process network always has a
chance to correct possible errors of the structure by choosing appropriate connection
parameters. Still, we must here specifically warn our readers about two types of mistakes
which trap many researchers of neural networks (especially beginners).
The first mistake consists in designing a network with too few elements – when there is no
hidden layer or there are too few neurons, the learning process may fail miserably, since the
network will not have any chances to imitate in its (too scanty) structure all the details and
nuances of the problem that is being solved. Later, I will give you specific examples
illustrating the fact that a network which is too small and too primitive cannot deal with
certain tasks even if one teaches it very thoroughly and for a very long time. Simply, neural
networks sometimes are like people: not all of them are talented enough to solve a particular
problem. Luckily, there is always an easy way to check how intelligent a network is, because
you see its structure and can count neurons, since the measure of a network’s capability is
merely the number of its hidden neurons. With humans it is more difficult!
Unfortunately, despite the freedom in building bigger or smaller networks, it sometimes
happens that the intelligence of one’s network is too low. This always results in a failure to
use such a network for a specific purpose, because such ‘neural dummy’ with not enough
hidden neurons will never succeed in the tasks it has been given, no matter how much you
will toil at trying to teach it something.
However, network’s intelligence shouldn’t be also ‘overdone’. The effect of network’s
excessive intelligence is not a greater capacity for dealing with problems that it has been set,
but an astonishing phenomenon: instead of diligently acquiring useful knowledge the network
begins to fool its teacher and, consequently, it doesn’t learn at all! It may sound incredible at
first, but it is true. A network with too many hidden layers or too many elements in its hidden
layers tends to simplify the task and, as a result, it ‘cuts corners’ whenever possible. In order
for you to understand it, I have to explain briefly how network’s learning process takes place
(fig. 2.41).
Fig. 2.41. Very simplified schema of neural network learning process
You will learn about the details of this process in one of the further chapters, but right now I
must say that one teaches a network by providing it with input signals for which correct
solutions are known because they are included in the learning data. For each given set of
input data the network tries to offer its own output solution. Generally, the network’s
suggestions differ from correct solutions provided in the teaching data, so after comparing the
solution worked out by the network with the correct exemplar solution in the teaching data,
it becomes clear how big was the mistake made by the network. On the basis of mistake
evaluation, the network’s teaching algorithm changes the weights of all its neurons so that in
the future the same mistakes will not be repeated.
The above model of learning process indicates that network aims at committing no mistakes
when it is presented with teaching data. Therefore a network which learns well seeks such a
rule for processing input signals that would allow it to arrive at correct solutions. When a
network discovers this rule, it can perform tasks from the teaching data it has been provided
with, as well as other similar tasks which it will be given during ordinary exploitation. We say
then that a network demonstrates an ability to learn and to generalize learning results, and we
treat it as a success.
Unfortunately, a network which is too intelligent, that is one which has a considerable
memory in the form of a large number of hidden neurons (together with their adjustable
weight sets), can easily avoid mistakes during learning process by memorizing the whole
collection of teaching data. It then achieves great success in learning within an astonishingly
short time, because it knows and gives correct answer for every question. However, in this
method of ‘photographing’ the teaching data, a network which is learning from provided
examples of correct solutions, makes no attempt at generalizing acquired information, and
tries instead to achieve success by meticulously memorizing rules like “this input implies this
output”.
Such incorrect operation of a network manifests itself in the fact that it quickly and
thoroughly learns the whole of the so-called teaching sequence (i.e. the set of examples used
to show a network how it should perform tasks that it is given), but it fails embarrassingly in
the first test, that is in a task from similar class but slightly different from the tasks presented
in the process of learning. For instance, teaching a network to recognize letters brings
immediate success (the network recognizes all letters it is shown), but the network fails
completely to recognize a letter in a different handwriting or font (all outputs are zero), or
recognizes them incorrectly. In such cases a closer analysis of network’s knowledge reveals
that the network has memorized many simple rules like “if here two pixels are lit, and there
are five zeros, letter A should be recognized”. Of course, such crude rules do not stand the test
of a new task, and the networks falls short of our expectations.
The described symptom of ‘learning by heart’ is not displayed by networks with smaller
hidden layer, because limited memory forces the network to do its best, and, using the few
elements of its hidden layer, to work out such rules of processing input signal that would
enable its correct use in more than one instance of required answer. In such cases the learning
process is usually considerably slower and more tedious (examples needed by the network in
order to learn have to be presented more times – often a few hundred or a few thousand
times), though the final effect is usually much better. After a correctly conducted learning
process has been finished, and a network works well with base learning examples, one has the
right to suppose that it will also cope with similar (but not identical) tasks presented during a
test. It is not always the case, but it is often true, and it must be the base for our expectations
regarding the use of the network.
To sum up, it is important to memorize the following rule: do not expect miracles, so a
presumption that an uncomplicated network, with few hidden neurons, will succeed in a
complicated task, is rather unrealistic. However, too many hidden layers or too many neurons
also lead to a significant decline in learning process. The optimal size of hidden layer lies
somewhere between these two extremes.
Fig. 2.42. Exemplary relation between hidden neurons and errors made by network
Figure 2.42 shows (on the basis of particular computer simulations) the relation between
errors made by networks and the number of their hidden neurons. It proves that there are
many networks which operate with almost the same efficiency, despite different number of
hidden neurons – therefore it is not so difficult to hit such a broad target. Nevertheless, one
must avoid the extremes (i.e. networks that are too big or too small). Especially harmful are
additional (excessive) hidden layers, and it comes as no surprise that a network with fewer
hidden layers often produces better results (because one can teach it more thoroughly) than a
theoretically better network with more hidden layers, where the teaching process ‘gets stuck’
in the excess of details. Therefore one should use networks with one or (but only as an
exception!) two hidden layers, and fight down the temptation to use networks with more
hidden layers by fasting and cold baths.
Download