Multi-layer Neural Networks That is, multiple layers of links. Must involve what are called "hidden" nodes - nothing to do with security, this just means they are not "visible" from the Input or Output sides - rather they are inside the network somewhere. These allow more complex classifications. Consider the network: The 3 hidden nodes each draw a line and fire if the input point is on one side of the line. The output node could be a 3-dimensional AND gate - fire if all 3 hidden nodes fire. 3-dimensional AND gate perceptron Consider the 3-d cube defined by the points: (0,0,0), (0,1,0), (0,0,1), (0,1,1), (1,0,0), (1,1,0), (1,0,1), (1,1,1) A 3-dimensional AND gate perceptron needs to separate with a 2-d plane the corner point (1,1,1) from the other points in the cube - this is possible. Exercise Construct a triangular area from 3 intersecting lines in the 2-dimensional plane. 1. Define the lines exactly (i.e. express them in terms of y = ax + b). 2. Define the weights and thresholds of a network that will fire only for points within the triangular area: 1. Define the weights and thresholds for the 3 hidden nodes. 2. Define the weights and threshold for the output node. Disjoint areas To only fire when the point is in one of the 2 disjoint areas: we use the following net (Just Hidden and Output layers shown. Weights shown on connections. Thresholds circled on nodes.): Q. Do an alternative version using 4 perceptrons, 2 AND gates and a final OR gate. 3-layer network can classify any arbitrary shape in n dimensions 2-layer network can classify points inside any n arbitrary lines (n hidden units plus an AND function). i.e. Can classify: 1. any regular polygon 2. any convex polygon 3. any convex set to any level of granularity required (just add more lines) To classify a concave polygon (e.g. a concave star-shaped polygon), compose it out of adjacent disjoint convex shapes and an OR function. 3-layer network can do this. 3-layer network can classify any number of disjoint convex or concave shapes. Use 2-layer networks to classify each convex region to any level of granularity required (just add more lines, and more disjoint areas), and an OR gate. Then, like the bed/table/chair network above, we can have a net that fires one output for one complex shape, another output for another arbitrary complex shape. And we can do this with shapes in n dimensions, not just 2 or 3. Multi-layer network for XOR Recall XOR. 2 connections in first layer not shown (weight = 0). Exercise - Show that this implements XOR We have multiple divisions. Basically, we use the 1.5 node to divide (1,1) from the others. We use the 0.5 nodes to split off (0,0) from the others. And then we combine the outputs to split off (1,1) from (1,0) and (0,1). Question - How did we design this? Answer - We don't want to. Nets wouldn't be popular if you had to. We want to learn these weights. Also network is to represent unknown f, not known f. Q. Do an alternative XOR using 2 perceptrons and an AND gate. Timeline 1950s-60s Learning rules for single-layer nets. Also interesting to note that HLLs primitive or non-existent. Computer models often focused on raw hardware/brain/network models. 1969 Perceptrons, Minsky and Papert. Limitations of single-layer nets apparent. People (in fact, Minsky and Papert themselves!) were aware that multi-layer nets could perform more complex classifications. That was not the problem. The problem was the absence of a learning rule for multi-layer nets. Meanwhile, HLLs had been invented, and people were excited about them, seeing them as possibly the language of the mind. Connectionism went into decline. Explicit, symbolic approaches dominant in AI. Logic. HLLs. 1974 Back-propagation (learning rule for multi-layer) invented by Werbos, but not noticed. 1986 Back-prop popularised. Connectionism takes off again. Also HLLs no longer so exciting. Seen as part of Computer Engineering, little to do with brain. Increased interest in numeric and statistical models. Computer Science may still turn out to be the language for describing what we are (*), just not necessarily HLL-like Computer Science. (*) See this gung-ho talk by Minsky (and local copy) at ALife V, 1996 - "Computer Science is not about computers. It's the first time in 5000 years that we've begun to have ways to describe the kinds of machinery that we are." sigmoid function Multi-layer Networks - Notation we will use Note: Not a great drawing. Can actually have multiple output nodes. where is the sigmoid function. Input can be a vector. There may be any number of hidden nodes. Output can be a vector too. Typically fully-connected. But remember that if a weight becomes zero, then that connection may as well not exist. Learning algorithm may learn to set one of the connection weights to zero. i.e. We start fully-connected, and learning algorithm learns to drop some connections. To be precise, by making some of its input weights wij zero or near-zero, the hidden node decides to specialise only on certain inputs. The hidden node is then said to "represent" these set of inputs. Supervised Learning - How it works 1. Send in an input x. 2. Run it through the network to generate an output y. 3. Tell the machine what the "right" answer for x actually is (we have a large number of these known Input-Output exemplars). 4. Comparing the right answer with y gives an error quantity. 5. Use this error to modify the weights and thresholds so that next time x is sent in it will produce an answer nearer to the correct one. 6. The trick is that at the same time as adjusting the network to make it give a better answer for x, we are adjusting the weights and thresholds to make it give better answers for other inputs. These adjustments may interfere with each other! But of course, interference is what we want! If different inputs didn't interfere with each other, it wouldn't be a generalisation (able to make predictions for inputs never seen before). It would be a lookup table (unable to give an answer for inputs never seen before). Memory, Interference and Forgetting We need interference so we can generate a "good guess" for unseen data. But it does seem strange that, having been told the correct answer for x, we do not simply return this answer exactly anytime we see x in the future. Why "forget" anything that we once knew? Surely forgetting things is simply a disadvantage. We could have an extra lookup table on the side, to store the results of every input about which we knew the exact output, and only consult the network for new, unseen input. However, this may be of limited use since inputs may never actually be seen twice. e.g. Input of continuous real numbers in robot senses. Angle = 250.432 degrees, Angle = 250.441 degrees, etc. Consider when n dimensions. Need every dimension to be the same, to 3 decimal places. If exact same input never seen twice, our lookup-table grows forever (not finite-size data structure) and is never used. Even if it is (rarely) used, consider computational cost of searching it. We could ensure that inputs are seen multiple times by making the input space more coarse-grained, e.g. Angle is one of N, S, E or W. But this of course pre-judges how the input space should be broken up, which is exactly the job the neural net is trying to solve! In fact, even the decision to cut to 3 decimal places (a decision probably made by the robot sensor manufacturer) is already an a priori classification. We can't actually have real numbers in real-world engineering. Even in software only, floating point numbers have a finite no. of decimal places. Another problem with lookup tables for inputs seen before - exemplars may contradict each other, especially if they come from the real world. See over time two different Input-Output exemplar pairs (x,y) and (x,z). Presented with x. Do you return y or z? Learning to divide up the input space In the above, if the feedback the neural net gets is the same in the area 240 - 260 degrees, then it will develop weights and thresholds so that any continuous value in this zone generates roughly the same output. On the other hand, if it receives different feedback in the zone around 245 - 255 degrees than outside that zone, then it will develop weights that lead to a (perhaps steep) threshold being crossed at 245, and one type of output generated, and another threshold being crossed at 255, and another type of output generated. The network can learn to classify any area of the multi-dimensional input space in this way. This is especially useful for: (a) Where we do not know how to sub-divide the input space in advance. (b) Especially where the input space is multi-dimensional. Humans are good at dividing up 1-dimensional space, but terrible at visualising divisions in n-dimensional space. Each zone can generate completely different outputs. Learning the design We asked How did we design the XOR network? In fact, we don't have to design it. We can repeatedly present the network with exemplars: Input Input Input Input 0 1 0 1 0 0 1 1 Output Output Output Output 0 1 1 0 and it will learn those weights and thresholds! (or at least, some set of weights and thresholds that implement XOR)