Multi-layer Neural Networks

advertisement
Multi-layer Neural Networks
That is, multiple layers of links. Must involve what are called "hidden" nodes - nothing to do
with security, this just means they are not "visible" from the Input or Output sides - rather they
are inside the network somewhere.
These allow more complex classifications. Consider the network:
The 3 hidden nodes each draw a line and fire if the input point is on one side of the line.
The output node could be a 3-dimensional AND gate - fire if all 3 hidden nodes fire.
3-dimensional AND gate perceptron
Consider the 3-d cube defined by the points: (0,0,0), (0,1,0), (0,0,1), (0,1,1), (1,0,0), (1,1,0),
(1,0,1), (1,1,1)
A 3-dimensional AND gate perceptron needs to separate with a 2-d plane the corner point (1,1,1)
from the other points in the cube - this is possible.
Exercise
Construct a triangular area from 3 intersecting lines in the 2-dimensional plane.
1. Define the lines exactly (i.e. express them in terms of y = ax + b).
2. Define the weights and thresholds of a network that will fire only for points within the
triangular area:
1. Define the weights and thresholds for the 3 hidden nodes.
2. Define the weights and threshold for the output node.
Disjoint areas
To only fire when the point is in one of the 2 disjoint areas:
we use the following net (Just Hidden and Output layers shown. Weights shown on connections.
Thresholds circled on nodes.):
Q. Do an alternative version using 4 perceptrons, 2 AND gates and a final OR gate.
3-layer network can classify any arbitrary
shape in n dimensions
2-layer network can classify points inside any n arbitrary lines (n hidden units plus an AND
function).
i.e. Can classify:
1. any regular polygon
2. any convex polygon
3. any convex set to any level of granularity required (just add more lines)
To classify a concave polygon (e.g. a concave star-shaped polygon), compose it out of adjacent
disjoint convex shapes and an OR function. 3-layer network can do this.
3-layer network can classify any number of disjoint convex or concave shapes. Use 2-layer
networks to classify each convex region to any level of granularity required (just add more lines,
and more disjoint areas), and an OR gate.
Then, like the bed/table/chair network above, we can have a net that fires one output for one
complex shape, another output for another arbitrary complex shape.
And we can do this with shapes in n dimensions, not just 2 or 3.
Multi-layer network for XOR
Recall XOR.
2 connections in first layer not shown (weight = 0).
Exercise - Show that this implements XOR
We have multiple divisions. Basically, we use the 1.5 node to divide (1,1) from the others. We
use the 0.5 nodes to split off (0,0) from the others. And then we combine the outputs to split off
(1,1) from (1,0) and (0,1).
Question - How did we design this?
Answer - We don't want to. Nets wouldn't be popular if you had to.
We want to learn these weights.
Also network is to represent unknown f, not known f.
Q. Do an alternative XOR using 2 perceptrons and an AND gate.
Timeline
1950s-60s
Learning rules for single-layer nets.
Also interesting to note that HLLs primitive or non-existent. Computer models often
focused on raw hardware/brain/network models.
1969
Perceptrons, Minsky and Papert. Limitations of single-layer nets apparent.
People (in fact, Minsky and Papert themselves!) were aware that multi-layer nets could
perform more complex classifications. That was not the problem. The problem was the
absence of a learning rule for multi-layer nets.
Meanwhile, HLLs had been invented, and people were excited about them, seeing them
as possibly the language of the mind. Connectionism went into decline. Explicit,
symbolic approaches dominant in AI. Logic. HLLs.
1974
Back-propagation (learning rule for multi-layer) invented by Werbos, but not noticed.
1986
Back-prop popularised. Connectionism takes off again.
Also HLLs no longer so exciting. Seen as part of Computer Engineering, little to do with
brain. Increased interest in numeric and statistical models.
Computer Science may still turn out to be the language for describing what we are (*),
just not necessarily HLL-like Computer Science.
(*) See this gung-ho talk by Minsky (and local copy) at ALife V, 1996 - "Computer Science is
not about computers. It's the first time in 5000 years that we've begun to have ways to describe
the kinds of machinery that we are."
sigmoid function
Multi-layer Networks - Notation we will use
Note: Not a great drawing. Can actually have multiple output nodes.
where
is the sigmoid function.
Input can be a vector.
There may be any number of hidden nodes.
Output can be a vector too.
Typically fully-connected. But remember that if a weight becomes zero, then that connection
may as well not exist. Learning algorithm may learn to set one of the connection weights to zero.
i.e. We start fully-connected, and learning algorithm learns to drop some connections.
To be precise, by making some of its input weights wij zero or near-zero, the hidden node decides
to specialise only on certain inputs. The hidden node is then said to "represent" these set of
inputs.
Supervised Learning - How it works
1. Send in an input x.
2. Run it through the network to generate an output y.
3. Tell the machine what the "right" answer for x actually is (we have a large number of
these known Input-Output exemplars).
4. Comparing the right answer with y gives an error quantity.
5. Use this error to modify the weights and thresholds so that next time x is sent in it will
produce an answer nearer to the correct one.
6. The trick is that at the same time as adjusting the network to make it give a better answer
for x, we are adjusting the weights and thresholds to make it give better answers for other
inputs. These adjustments may interfere with each other!
But of course, interference is what we want! If different inputs didn't interfere with each other, it
wouldn't be a generalisation (able to make predictions for inputs never seen before). It would be
a lookup table (unable to give an answer for inputs never seen before).
Memory, Interference and Forgetting
We need interference so we can generate a "good guess" for unseen data.
But it does seem strange that, having been told the correct answer for x, we do not simply return
this answer exactly anytime we see x in the future. Why "forget" anything that we once knew?
Surely forgetting things is simply a disadvantage.
We could have an extra lookup table on the side, to store the results of every input about
which we knew the exact output, and only consult the network for new, unseen input.
However, this may be of limited use since inputs may never actually be seen twice. e.g. Input of
continuous real numbers in robot senses. Angle = 250.432 degrees, Angle = 250.441 degrees,
etc. Consider when n dimensions. Need every dimension to be the same, to 3 decimal places.
If exact same input never seen twice, our lookup-table grows forever (not finite-size data
structure) and is never used. Even if it is (rarely) used, consider computational cost of searching
it.

We could ensure that inputs are seen multiple times by making the input space more
coarse-grained, e.g. Angle is one of N, S, E or W.
But this of course pre-judges how the input space should be broken up, which is exactly
the job the neural net is trying to solve!


In fact, even the decision to cut to 3 decimal places (a decision probably made by the
robot sensor manufacturer) is already an a priori classification.
We can't actually have real numbers in real-world engineering. Even in software only,
floating point numbers have a finite no. of decimal places.
Another problem with lookup tables for inputs seen before - exemplars may contradict each
other, especially if they come from the real world. See over time two different Input-Output
exemplar pairs (x,y) and (x,z). Presented with x. Do you return y or z?
Learning to divide up the input space
In the above, if the feedback the neural net gets is the same in the area 240 - 260 degrees, then it
will develop weights and thresholds so that any continuous value in this zone generates roughly
the same output.
On the other hand, if it receives different feedback in the zone around 245 - 255 degrees than
outside that zone, then it will develop weights that lead to a (perhaps steep) threshold being
crossed at 245, and one type of output generated, and another threshold being crossed at 255, and
another type of output generated.
The network can learn to classify any area of the multi-dimensional input space in this way. This
is especially useful for:


(a) Where we do not know how to sub-divide the input space in advance.
(b) Especially where the input space is multi-dimensional. Humans are good at dividing
up 1-dimensional space, but terrible at visualising divisions in n-dimensional space.
Each zone can generate completely different outputs.
Learning the design
We asked How did we design the XOR network? In fact, we don't have to design it. We can
repeatedly present the network with exemplars:
Input
Input
Input
Input
0
1
0
1
0
0
1
1
Output
Output
Output
Output
0
1
1
0
and it will learn those weights and thresholds! (or at least, some set of weights and thresholds
that implement XOR)
Download