http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/pca.html CS-449: Neural Networks Fall 99 Instructor: Genevieve Orr Willamette University Lecture Notes prepared by Genevieve Orr, Nici Schraudolph, and Fred Cummins [Content][Links] Course content Summary Our goal is to introduce students to a powerful class of model, the Neural Network. In fact, this is a broad term which includes many diverse models and approaches. We will first motivate networks by analogy to the brain. The analogy is loose, but serves to introduce the idea of parallel and distributed computation. We then introduce one kind of network in detail: the feedforward network trained by backpropagation of error. We discuss model architectures, training methods and data representation issues. We hope to cover everything you need to know to get backpropagation working for you. A range of applications and extensions to the basic model will be presented in the final section of the module. Lecture 1: Introduction Questions Motivation and Applications Computation in the brain Artificial neuron models Linear regression Linear neural networks Multi-layer networks Error Backpropagation Lecture 2: Classification Introduction Perceptron Learning Delta Learning Doing it Right Lecture 3: Optimizing Linear Networks Weights and Learning Rates Summary Lecture 4: The Backprop Toolbox 2-Layer Networks and Backprop Noise and Overtraining Momentum Delta-Bar-Delta Many layer Networks and Backprop Backprop: an example Overfitting and regularization Growing and pruning networks Preconditioning the network Momentum Delta-Bar-Delta Lecture 5: Unsupervised Learning Introduction Linear Compression (PCA) NonLinear Compression Competitive Learning Kohonon Self-Organizing Nets Lecture 6: Reinforcement Learning Introduction Components of RL Terminology and Bellman's Equation Lecture 7: Advanced Topics Learning rate adaptation Classification Non-supervised learning Time-Delay Neural Networks Recurrent neural networks Real-Time Recurrent Learning Dynamics of RNNs Long Short-Term Memory [Top] Review for Midterm: Linear Nets Non-linear Nets Links Tutorials: The Nervous System - a very nice introduction, many pictures Neural Java - a neural network tutorial with Java applets Web Sim - A Java neural network simulator. a book chapter describing the Backpropagation Algorithm (Postscript) A short set of pages showing how a simple backprop net learns to recognize the digits 0-9, with C code Reinforcement Learning - A Tutorial Simulators and code: Web Sim: Java neural network simulator. Brainwave: a Java based simulator tlearn: Windows, Macintosh and Unix implentation of backprop and variants. Written in C. PDP++: C++ software with every conceivable bell and whistle. Unix only. The manual also makes a good tutorial. Data Sources: UCI machine learning database ai-faq/neural-nets data source list Handwritten Digits Related stuff of interest: A page of neural network links Tesauro's backgammon network Lego Lab at University of Aarhus [Top] Questions 1. 2. 3. 4. 5. 6. What tasks are machines good at doing that humans are not? What tasks are humans good at doing that machines are not? What tasks are both good at? What does it mean to learn? How is learning related to intelligence? What does it mean to be intelligent? Do you believe a machine will ever be built that exhibits intelligence? 7. Have the above definitions changed over time? 8. If a computer were intelligent, how would you know? 9. What does it mean to be conscious? 10. Can one be intelligent and not conscious or vice versa? [Top] [Next: Motivation] [Back to the first page] Neural networks were started about 50 years ago. Their early abilities were exaggerated, casting doubts on the field as a whole There is a recent renewed interest in the field, however, because of new techniques and a better theoretical understanding of their capabilities. . Motivation for neural networks: Scientists are challenged to use machines more effectively for tasks currently solved by humans. Symbolic Rules don't reflect processes actually used by humans Traditional computing excels in many areas, but not in others. Types of Applications Machine learning: Having a computer program itself from a set of examples so you don't have to program it yourself. This will be a strong focus of this course: neural networks that learn from a set of examples. Optimization: given a set of constraints and a cost function, how do you find an optimal solution? E.g. traveling salesman problem. Classification: grouping patterns into classes: i.e. handwritten characters into letters. Associative memory: recalling a memory based on a partial match. Regression: function mapping Cognitive science: Modelling higher level reasoning: o language o problem solving Modelling lower level reasoning: o vision o audition speech recognition o speech generation Neurobiology: Modelling models of how the brain works. neuron-level higher levels: vision, hearing, etc. Overlaps with cognitive folks. Mathematics: Nonparametric statistical analysis and regression. Philosophy: Can human souls/behavior be explained in terms of symbols, or does it require something lower level, like a neurally based model? Where are neural networks being used? Signal processing: suppress line noise, with adaptive echo canceling, blind source separation Control: e.g. backing up a truck: cab position, rear position, and match with the dock get converted to steering instructions. Manufacturing plants for controlling automated machines. Siemens successfully uses neural networks for process automation in basic industries, e.g., in rolling mill control more than 100 neural networks do their job, 24 hours a day Robotics - navigation, vision recognition Pattern recognition, i.e. recognizing handwritten characters, e.g. the current version of Apple's Newton uses a neural net Medicine, i.e. storing medical records based on case information Speech production: reading text aloud (NETtalk) Speech recognition Vision: face recognition , edge detection, visual search engines Business,e.g.. rules for mortgage decisions are extracted from past decisions made by experienced evaluators, resulting in a network that has a high level of agreement with human experts. Financial Applications: time series analysis, stock market prediction Data Compression: speech signal, image, e.g. faces Game Playing: backgammon, chess, go, ... [Top] [Next: Computation in the brain] Computation in the brain [Back to the first page] The brain - that's my second most favourite organ! - Woody Allen The Brain as an Information Processing System The human brain contains about 10 billion nerve cells, or neurons. On average, each neuron is connected to other neurons through about 10 000 synapses. (The actual figures vary greatly, depending on the local neuroanatomy.) The brain's network of neurons forms a massively parallel information processing system. This contrasts with conventional computers, in which a single processor executes a single series of instructions. Against this, consider the time taken for each elementary operation: neurons typically operate at a maximum rate of about 100 Hz, while a conventional CPU carries out several hundred million machine level operations per second. Despite of being built with very slow hardware, the brain has quite remarkable capabilities: its performance tends to degrade gracefully under partial damage. In contrast, most programs and engineered systems are brittle: if you remove some arbitrary parts, very likely the whole will cease to function. it can learn (reorganize itself) from experience. this means that partial recovery from damage is possible if healthy units can learn to take over the functions previously carried out by the damaged areas. it performs massively parallel computations extremely efficiently. For example, complex visual perception occurs within less than 100 ms, that is, 10 processing steps! it supports our intelligence and self-awareness. (Nobody knows yet how this occurs.) processing element energy processing style elements size use speed comput 1014 synapses 10-6 m 30 W 100 Hz parallel, distribut 108 transistors 10-6 m 30 W (CPU) 109 Hz serial, centraliz As a discipline of Artificial Intelligence, Neural Networks attempt to bring computers a little closer to the brain's capabilities by imitating certain aspects of information processing in the brain, in a highly simplified way. Neural Networks in the Brain The brain is not homogeneous. At the largest anatomical scale, we distinguish cortex, midbrain, brainstem, and cerebellum. Each of these can be hierarchically subdivided into many regions, and areas within each region, either according to the anatomical structure of the neural networks within it, or according to the function performed by them. The overall pattern of projections (bundles of neural connections) between areas is extremely complex, and only partially known. The best mapped (and largest) system in the human brain is the visual system, where the first 10 or 11 processing stages have been identified. We distinguish feedforward projections that go from earlier processing stages (near the sensory input) to later ones (near the motor output), from feedback connections that go in the opposite direction. In addition to these long-range connections, neurons also link up with many thousands of their neighbours. In this way they form very dense, complex local networks: Neurons and Synapses The basic computational unit in the nervous system is the nerve cell, or neuron. A neuron has: Dendrites (inputs) Cell body Axon (output) A neuron receives input from other neurons (typically many thousands). Inputs sum (approximately). Once input exceeds a critical level, the neuron discharges a spike - an electrical pulse that travels from the body, down the axon, to the next neuron(s) (or other receptors). This spiking event is also called depolarization, and is followed by a refractory period, during which the neuron is unable to fire. The axon endings (Output Zone) almost touch the dendrites or cell body of the next neuron. Transmission of an electrical signal from one neuron to the next is effected by neurotransmittors, chemicals which are released from the first neuron and which bind to receptors in the second. This link is called a synapse. The extent to which the signal from one neuron is passed on to the next depends on many factors, e.g. the amount of neurotransmittor available, the number and arrangement of receptors, amount of neurotransmittor reabsorbed, etc. Synaptic Learning Brains learn. Of course. From what we know of neuronal structures, one way brains learn is by altering the strengths of connections between neurons, and by adding or deleting connections between neurons. Furthermore, they learn "on-line", based on experience, and typically without the benefit of a benevolent teacher. The efficacy of a synapse can change as a result of experience, providing both memory and learning through long-term potentiation. One way this happens is through release of more neurotransmitter. Many other changes may also be involved. Long-term Potentiation: An enduring (>1 hour) increase in synaptic efficacy that results from highfrequency stimulation of an afferent (input) pathway Hebbs Postulate: "When an axon of cell A... excites[s] cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells so that A's efficiency as one of the cells firing B is increased." Bliss and Lomo discovered LTP in the hippocampus in 1973 Points to note about LTP: Synapses become more or less important over time (plasticity) LTP is based on experience LTP is based only on local information (Hebb's postulate) Summary The following properties of nervous systems will be of particular interest in our neurally-inspired models: parallel, distributed information processing high degree of connectivity among basic units connections are modifiable based on experience learning is a constant process, and usually unsupervised learning is based only on local information performance degrades gracefully if some units are removed etc.......... Further surfing: The Nervous System - a great introduction, many pictures [Top] [Next: Artificial neuron models] [Back to the first page] Artificial Neuron Models Computational neurobiologists have constructed very elaborate computer models of neurons in order to run detailed simulations of particular circuits in the brain. As Computer Scientists, we are more interested in the general properties of neural networks, independent of how they are actually "implemented" in the brain. This means that we can use much simpler, abstract "neurons", which (hopefully) capture the essence of neural computation even if they leave out much of the details of how biological neurons work. People have implemented model neurons in hardware as electronic circuits, often integrated on VLSI chips. Remember though that computers run much faster than brains - we can therefore run fairly large networks of simple model neurons as software simulations in reasonable time. This has obvious advantages over having to use special "neural" computer hardware. A Simple Artificial Neuron Our basic computational element (model neuron) is often called a node or unit. It receives input from some other units, or perhaps from an external source. Each input has an associated weight w, which can be modified so as to model synaptic learning. The unit computes some function f of the weighted sum of its inputs: Its output, in turn, can serve as input to other units. The weighted sum is called the net input to unit i, often written neti. Note that wij refers to the weight from unit j to unit i (not the other way around). The function f is the unit's activation function. In the simplest case, f is the identity function, and the unit's output is just its net input. This is called a linear unit. Maple examples of activation functions. goto top of page [Next: Linear regression] [Back to the first page] Linear Regression Fitting a Model to Data Consider the data below (for more complete auto data, see data description, raw data, and maple plots): (Fig. 1) Each dot in the figure provides information about the weight (x-axis, units: U.S. pounds) and fuel consumption (y-axis, units: miles per gallon) for one of 74 cars (data from 1979). Clearly weight and fuel consumption are linked, so that, in general, heavier cars use more fuel. Now suppose we are given the weight of a 75th car, and asked to predict how much fuel it will use, based on the above data. Such questions can be answered by using a model - a short mathematical description - of the data (see also optical illusions). The simplest useful model here is of the form y = w1 x + w 0 (1) This is a linear model: in an xy-plot, equation 1 describes a straight line with slope w1 and intercept w0 with the y-axis, as shown in Fig. 2. (Note that we have rescaled the coordinate axes this does not change the problem in any fundamental way.) How do we choose the two parameters w0 and w1 of our model? Clearly, any straight line drawn somehow through the data could be used as a predictor, but some lines will do a better job than others. The line in Fig. 2 is certainly not a good model: for most cars, it will predict too much fuel consumption for a given weight. (Fig. 2) The Loss Function In order to make precise what we mean by being a "good predictor", we define a loss (also called objective or error) function E over the model parameters. A popular choice for E is the sumsquared error: (2) In words, it is the sum over all points i in our data set of the squared difference between the target value ti (here: actual fuel consumption) and the model's prediction yi, calculated from the input value xi (here: weight of the car) by equation 1. For a linear model, the sum-sqaured error is a quadratic function of the model parameters. Figure 3 shows E for a range of values of w0 and w1. Figure 4 shows the same functions as a contour plot. (Fig. 3) (Fig. 4) Minimizing the Loss The loss function E provides us with an objective measure of predictive error for a specific choice of model parameters. We can thus restate our goal of finding the best (linear) model as finding the values for the model parameters that minimize E. For linear models, linear regression provides a direct way to compute these optimal model parameters. (See any statistics textbook for details.) However, this analytical approach does not generalize to nonlinear models (which we will get to by the end of this lecture). Even though the solution cannot be calculated explicitly in that case, the problem can still be solved by an iterative numerical technique called gradient descent. It works as follows: 1. Choose some (random) initial values for the model parameters. 2. Calculate the gradient G of the error function with respect to each model parameter. 3. Change the model parameters so that we move a short distance in the direction of the greatest rate of decrease of the error, i.e., in the direction of -G. 4. Repeat steps 2 and 3 until G gets close to zero. How does this work? The gradient of E gives us the direction in which the loss function at the current settting of the w has the steepest slope. In ordder to decrease E, we take a small step in the opposite direction, -G (Fig. 5). (Fig. 5) By repeating this over and over, we move "downhill" in E until we reach a minimum, where G = 0, so that no further progress is possible (Fig. 6). (Fig. 6) Fig. 7 shows the best linear model for our car data, found by this procedure. (Fig. 7) It's a neural network! Our linear model of equation 1 can in fact be implemented by the simple neural network shown in Fig. 8. It consists of a bias unit, an input unit, and a linear output unit. The input unit makes external input x (here: the weight of a car) available to the network, while the bias unit always has a constant output of 1. The output unit computes the sum: y2 = y1 w21 + 1.0 w20 (3) It is easy to see that this is equivalent to equation 1, with w21 implementing the slope of the straight line, and w20 its intercept with the y-axis. (Fig. 8) [Goto top of page] [Next: Linear neural networks] [Back to the first page] Linear Neural Networks Multiple regression Our car example showed how we could discover an optimal linear function for predicting one variable (fuel consumption) from one other (weight). Suppose now that we are also given one or more additional variables which could be useful as predictors. Our simple neural network model can easily be extended to this case by adding more input units (Fig. 1). Similarly, we may want to predict more than one variable from the data that we're given. This can easily be accommodated by adding more output units (Fig. 2). The loss function for a network with multiple outputs is obtained simply by adding the loss for each output unit together. The network now has a typical layered structure: a layer of input units (and the bias), connected by a layer of weights to a layer of output units. (Fig. 1) (Fig. 2) Computing the gradient In order to train neural networks such as the ones shown above by gradient descent, we need to be able to compute the gradient G of the loss function with respect to each weight wij of the network. It tells us how a small change in that weight will affect the overall error E. We begin by splitting the loss function into separate terms for each point p in the training data: (1) where o ranges over the output units of the network. (Note that we use the superscript p to denote the training point - this is not an exponentiation!) Since differentiation and summation are interchangeable, we can likewise split the gradient into separate components for each training point: (2) In what follows, we describe the computation of the gradient for a single data point, omitting the superscript p in order to make the notation easier to follow. First use the chain rule to decompose the gradient into two factors: (3) The first factor can be obtained by differentiating Eqn. 1 above: (4) Using , the second factor becomes (5) Putting the pieces (equations 3-5) back together, we obtain (6) To find the gradient G for the entire data set, we sum at each weight the contribution given by equation 6 over all the data points. We can then subtract a small proportion µ (called the learning rate) of G from the weights to perform gradient descent. The Gradient Descent Algorithm 1. Initialize all weights to small random values. 2. REPEAT until done 1. For each weight wij set 2. For each data point (x, t)p 1. set input units to x 2. compute value of output units 3. For each weight wij set 3. For each weight wij set The algorithm terminates once we are at, or sufficiently near to, the minimum of the error function, where G = 0. We say then that the algorithm has converged. In summary: general case linear network Training data (x,t) (x,t) Model parameters w w Model y = g(w,x) Error function Gradient with respect to wij E(y,t) - (ti - yi) yj Weight update rule The Learning Rate An important consideration is the learning rate µ, which determines by how much we change the weights w at each step. If µ is too small, the algorithm will take a long time to converge (Fig. 3). (Fig. 3) Conversely, if µ is too large, we may end up bouncing around the error surface out of control - the algorithm diverges (Fig. 4). This usually ends with an overflow error in the computer's floatingpoint arithmetic. (Fig. 4) Batch vs. Online Learning Above we have accumulated the gradient contributions for all data points in the training set before updating the weights. This method is often referred to as batch learning. An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent. (Fig. 5) Online learning has a number of advantages: it is often much faster, especially when the training set is redundant (contains many similar data points), it can be used when there is no fixed training set (new data keeps coming in), it is better at tracking nonstationary environments (where the best model gradually changes over time), the noise in the gradient can help to escape from local minima (which are a problem for gradient descent in nonlinear models). These advantages are, however, bought at a price: many powerful optimization techniques (such as: conjugate and second-order gradient methods, support vector machines, Bayesian methods, etc.) which we will not talk about in this course! - are batch methods that cannot be used online. (Of course this also means that in order to implement batch learning really well, one has to learn an awful lot about these rather complicated methods!) A compromise between batch and online learning is the use of "mini-batches": the weights are updated after every n data points, where n is greater than 1 but smaller than the training set size. In order to keep things simple, we will focus very much on online learning, where plain gradient descent is among the best available techniques. Online learning is also highly suitable for implementing things such as reactive control strategies in adapative agents, and should thus fit in well with the rest of your course. goto top of page Multi-layer networks [Next: Multi-layer networks] page] [Back to the first Multi-layer networks A nonlinear problem Consider again the best linear fit we found for the car data. Notice that the data points are not evenly distributed around the line: for low weights, we see more miles per gallon than our model predicts. In fact, it looks as if a simple curve might fit these data better than the straight line. We can enable our neural network to do such curve fitting by giving it an additional node which has a suitably curved (nonlinear) activation function. A useful function for this purpose is the S-shaped hyperbolic tangent (tanh) function (Fig. 1). (Fig. 1) (Fig. 2) FIg. 2 shows our new network: an extra node (unit 2) with tanh activation function has been inserted between input and output. Since such a node is "hidden" inside the network, it is commonly called a hidden unit. Note that the hidden unit also has a weight from the bias unit. In general, all non-input neural network units have such a bias weight. For simplicity, the bias unit and weights are usually omitted from neural network diagrams - unless it's explicitly stated otherwise, you should always assume that they are there. (Fig. 3) When this network is trained by gradient descent on the car data, it learns to fit the tanh function to the data (Fig. 3). Each of the four weights in the network plays a particular role in this process: the two bias weights shift the tanh function in the x- and y-direction, respectively, while the other two weights scale it along those two directions. Fig. 2 gives the weight values that produced the solution shown in Fig. 3. Hidden Layers One can argue that in the example above we have cheated by picking a hidden unit activation function that could fit the data well. What would we do if the data looks like this (Fig. 4)? (Fig. 4) (Relative concentration of NO and NO2 in exhaust fumes as a function of the richness of the ethanol/air mixture burned in a car engine.) Obviously the tanh function can't fit this data at all. We could cook up a special activation function for each data set we encounter, but that would defeat our purpose of learning to model the data. We would like to have a general, non-linear function approximation method which would allow us to fit any given data set, no matter how it looks like. (Fig. 5) Fortunately there is a very simple solution: add more hidden units! In fact, a network with just two hidden units using the tanh function (Fig. 5) can fit the dat in Fig. 4 quite well - can you see how? The fit can be further improved by adding yet more units to the hidden layer. Note, however, that having too large a hidden layer - or too many hidden layers - can degrade the network's performance (more on this later). In general, one shouldn't use more hidden units than necessary to solve a given problem. (One way to ensure this is to start training with a very small network. If gradient descent fails to find a satisfactory solution, grow the network by adding a hidden unit, and repeat.) Theoretical results indicate that given enough hidden units, a network like the one in Fig. 5 can approximate any reasonable function to any required degree of accuracy. In other words, any function can be expressed as a linear combination of tanh functions: tanh is a universal basis function. Many functions form a universal basis; the two classes of activation functions commonly used in neural networks are the sigmoidal (S-shaped) basis functions (to which tanh belongs), and the radial basis functions. [Top] [Next: Backpropagation] [Back to the first page] A nonlinear problem Consider again the best linear fit we found for the car data. Notice that the data points are not evenly distributed around the line: for low weights, we see more miles per gallon than our model predicts. In fact, it looks as if a simple curve might fit these data better than the straight line. We can enable our neural network to do such curve fitting by giving it an additional node which has a suitably curved (nonlinear) activation function. A useful function for this purpose is the S-shaped hyperbolic tangent (tanh) function (Fig. 1). (Fig. 1) (Fig. 2) FIg. 2 shows our new network: an extra node (unit 2) with tanh activation function has been inserted between input and output. Since such a node is "hidden" inside the network, it is commonly called a hidden unit. Note that the hidden unit also has a weight from the bias unit. In general, all non-input neural network units have such a bias weight. For simplicity, the bias unit and weights are usually omitted from neural network diagrams - unless it's explicitly stated otherwise, you should always assume that they are there. (Fig. 3) When this network is trained by gradient descent on the car data, it learns to fit the tanh function to the data (Fig. 3). Each of the four weights in the network plays a particular role in this process: the two bias weights shift the tanh function in the x- and y-direction, respectively, while the other two weights scale it along those two directions. Fig. 2 gives the weight values that produced the solution shown in Fig. 3. Hidden Layers One can argue that in the example above we have cheated by picking a hidden unit activation function that could fit the data well. What would we do if the data looks like this (Fig. 4)? (Fig. 4) (Relative concentration of NO and NO2 in exhaust fumes as a function of the richness of the ethanol/air mixture burned in a car engine.) Obviously the tanh function can't fit this data at all. We could cook up a special activation function for each data set we encounter, but that would defeat our purpose of learning to model the data. We would like to have a general, non-linear function approximation method which would allow us to fit any given data set, no matter how it looks like. (Fig. 5) Error Backpropagation We have already seen how to train linear networks by gradient descent. In trying to do the same for multi-layer networks we encounter a difficulty: we don't have any target values for the hidden units. This seems to be an insurmountable problem - how could we tell the hidden units just what to do? This unsolved question was in fact the reason why neural networks fell out of favor after an initial period of high popularity in the 1950s. It took 30 years before the error backpropagation (or in short: backprop) algorithm popularized a way to train hidden units, leading to a new wave of neural network research and applications. (Fig. 1) In principle, backprop provides a way to train networks with any number of hidden units arranged in any number of layers. (There are clear practical limits, which we will discuss later.) In fact, the network does not have to be organized in layers - any pattern of connectivity that permits a partial ordering of the nodes from input to output is allowed. In other words, there must be a way to order the units such that all connections go from "earlier" (closer to the input) to "later" ones (closer to the output). This is equivalent to stating that their connection pattern must not contain any cycles. Networks that respect this constraint are called feedforward networks; their connection pattern forms a directed acyclic graph or dag. The Algorithm We want to train a multi-layer feedforward network by gradient descent to approximate an unknown function, based on some training data consisting of pairs (x,t). The vector x represents a pattern of input to the network, and the vector t the corresponding target (desired output). As we have seen before, the overall gradient with respect to the entire training set is just the sum of the gradients for each pattern; in what follows we will therefore describe how to compute the gradient for just a single training pattern. As before, we will number the units, and denote the weight from unit j to unit i by wij. 1. Definitions: o the error signal for unit j: o the (negative) gradient for weight wij: o the set of nodes anterior to unit i: o the set of nodes posterior to unit j: 2. The gradient. As we did for linear networks before, we expand the gradient into two factors by use of the chain rule: The first factor is the error of unit i. The second is Putting the two together, we get . To compute this gradient, we thus need to know the activity and the error for all relevant nodes in the network. 3. Forward activaction. The activity of the input units is determined by the network's external input x. For all other units, the activity is propagated forward: Note that before the activity of unit i can be calculated, the activity of all its anterior nodes (forming the set Ai) must be known. Since feedforward networks do not contain cycles, there is an ordering of nodes from input to output that respects this condition. 4. Calculating output error. Assuming that we are using the sum-squared loss the error for output unit o is simply 5. Error backpropagation. For hidden units, we must propagate the error back from the output nodes (hence the name of the algorithm). Again using the chain rule, we can expand the error of a hidden unit in terms of its posterior nodes: Of the three factors inside the sum, the first is just the error of node i. The second is while the third is the derivative of node j's activation function: For hidden units h that use the tanh activation function, we can make use of the special identity tanh(u)' = 1 - tanh(u)2, giving us Putting all the pieces together we get Note that in order to calculate the error for unit j, we must first know the error of all its posterior nodes (forming the set Pj). Again, as long as there are no cycles in the network, there is an ordering of nodes from the output back to the input that respects this condition. For example, we can simply use the reverse of the order in which activity was propagated forward. Matrix Form For layered feedforward networks that are fully connected - that is, each node in a given layer connects to every node in the next layer - it is often more convenient to write the backprop algorithm in matrix notation rather than using more general graph form given above. In this notation, the biases weights, net inputs, activations, and error signals for all units in a layer are combined into vectors, while all the non-bias weights from one layer to the next form a matrix W. Layers are numbered from 0 (the input layer) to L (the output layer). The backprop algorithm then looks as follows: 1. Initialize the input layer: 2. Propagate activity forward: for l = 1, 2, ..., L, where bl is the vector of bias weights. 3. Calculate the error in the output layer: 4. Backpropagate the error: for l = L-1, L-2, ..., 1, where T is the matrix transposition operator. 5. Update the weights and biases: You can see that this notation is significantly more compact than the graph form, even though it describes exactly the same sequence of operations. [Top] [Next: A first example] [Back to the first page] Backpropagation of error: an example We will now show an example of a backprop network as it learns to model the highly nonlinear data we encountered before. The left hand panel shows the data to be modeled. The right hand panel shows a network with two hidden units, each with a tanh nonlinear activation function. The output unit computes a linear combination of the two functions (1) Where (2) and (3) To begin with, we set the weights, a..g, to random initial values in the range [-1,1]. Each hidden unit is thus computing a random tanh function. The next figure shows the initial two activation functions and the output of the network, which is their sum plus a negative constant. (If you have difficulty making out the line types, the top two curves are the tanh functions, the one at the bottom is the network output). We now train the network (learning rate 0.3), updating the weights after each pattern (online learning). After we have been through the entire dataset 10 times (10 training epochs), the functions computed look like this (the output is the middle curve): After 20 epochs, we have (output is the humpbacked curve): and after 27 epochs we have a pretty good fit to the data: As the activation functions are stretched, scaled and shifted by the changing weights, we hope that the error of the model is dropping. In the next figure we plot the total sum squared error over all 88 patterns of the data as a function of training epoch. Four training runs are shown, with different weight initialization each time: You can see that the path to the solution differs each time, both because we start from a different point in weight space, and because the order in which patterns are presented is random. Nonetheless, all training curves go down monotonically, and all reach about the same level of overall error. [Top] [Next: Overfitting] [Back to the first page] Overfitting In the previous example we used a network with two hidden units. Just by looking at the data, it was possible to guess that two tanh functions would do a pretty good job of fitting the data. In general, however, we may not know how many hidden units, or equivalently, how many weights, we will need to produce a reasonable approximation to the data. Furthermore, we usually seek a model of the data which will give us, on average, the best possible predictions for novel data. This goal can conflict with the simpler task of modelling a specific training set well. In this section we will look at some techniques for preventing our model becoming too powerful (overfitting). In the next, we address the related question of selecting an appropriate architecture with just the right amount of trainable parameters. Bias-Variance trade-off Consider the two fitted functions below. The data points (circles) have all been generated from a smooth function, h(x), with some added noise. Obviously, we want to end up with a model which approximates h(x), given a specific set of data y(x) generated as: (1) In the left hand panel we try to fit the points using a function g(x) which has too few parameters: a straight line. The model has the virtue of being simple; there are only two free parameters. However, it does not do a good job of fitting the data, and would not do well in predicting new data points. We say that the simpler model has a high bias. The right hand panel shows a model which has been fitted using too many free parameters. It does an excellent job of fitting the data points, as the error at the data points is close to zero. However it would not do a good job of predicting h(x) for new values of x. We say that the model has a high variance. The model does not reflect the structure which we expect to be present in any data set generated by equation (1) above. Clearly what we want is something in between: a model which is powerful enough to represent the underlying structure of the data (h(x)), but not so powerful that it faithfully models the noise associated with this particular data sample. The bias-variance trade-off is most likely to become a problem if we have relatively few data points. In the opposite case, where we have essentially an infinite number of data points (as in continuous online learning), we are not usually in danger of overfitting the data, as the noise associated with any single data point plays a vanishingly small role in our overall fit. The following techniques therefore apply to situations in which we have a finite data set, and, typically, where we wish to train in batch mode. Preventing overfitting Early stopping One of the simplest and most widely used means of avoiding overfitting is to divide the data into two sets: a training set and a validation set. We train using only the training data. Every now and then, however, we stop training, and test network performance on the independent validation set. No weight updates are made during this test! As the validation data is independent of the training data, network performance is a good measure of generalization, and as long as the network is learning the underlying structure of the data (h(x) above), performance on the validation set will improve with training. Once the network stops learning things which are expected to be true of any data sample and learns things which are true only of this sample (epsilon in Eqn 1 above), performance on the validation set will stop improving, and will typically get worse. Schematic learning curves showing error on the training and validation sets are shown below. To avoid overfitting, we simply stop training at time t, where performance on the validation set is optimal. One detail of note when using early stopping: if we wish to test the trained network on a set of independent data to measure its ability to generalize, we need a third, independent, test set. This is because we used the validation set to decide when to stop training, and thus our trained network is no longer entirely independent of the validation set. The requirements of independent training, validation and test sets means that early stopping can only be used in a data-rich situation. Weight decay The over-fitted function above shows a high degree of curvature, while the linear function is maximally smooth. Regularization refers to a set of techniques which help to ensure that the function computed by the network is no more curved than necessary. This is achieved by adding a penalty to the error function, giving: (2) One possible form of the regularizer comes from the informal observation that an over-fitted mapping with regions of large curvature requires large weights. We thus penalize large weights by choosing (3) Using this modified error function, the weights are now updated as (4) where the right hand term causes the weight to decrease as a function of its own size. In the absence of any input, all weights will tend to decrease exponentially, hence the term "weight decay". Training with noise A final method which can often help to reduce the importance of the specific noise characteristics associated with a particular data sample is to add an extra small amount of noise (a small random value with mean value of zero) to each input. Each time a specific input pattern x is presented, we add a different random number, and use instead. At first, this may seem a rather odd thing to do: to deliberately corrupt ones own data. However, perhaps you can see that it will now be difficult for the network to approximate any specific data point too closely. In practice, training with added noise has indeed been shown to reduce overfitting and thus improve generalization in some situations. If we have a finite training set, another way of introducing noise into the training process is to use online training, that is, updating weights after every pattern presentation, and to randomly reorder the patterns at the end of each training epoch. In this manner, each weight update is based on a noisy estimate of the true gradient. [Top] [Next: Growing and pruning networks] [Back to the first page] Growing and Pruning Networks The neural network modeler is faced with a huge array of models and training regimes from which to select. This course can only serve to introduce you to the most common and general models. However, even after deciding, for example, to train a simple feed forward network, using some specific form of gradient descent, with tanh nodes in a single hidden layer, an important question to be addressed is remains: how big a network should we choose? How many hidden units, or, relatedly, how many weights? By way of an example, the nonlinear data which formed our first example can be fitted very well using 40 tanh functions. Learning with 40 hidden units is considerably harder than learning with 2, and takes significantly longer. The resulting fit is no better (as measured by the sum squared error) than the 2-unit model. The most usual answer is not necessarily the best: we guess an appropriate number (as we did above). Another common solution is to try out several network sizes, and select the most promising. Neither of these methods is very principled. Two more rigorous classes of methods are available, however. We can either start with a network which we know to be too small, and iteratively add units and weights, or we can train an oversized network and remove units/weights from the final network. We will look briefly at each of these approaches. Growing networks The simplest form of network growing algorithm starts with a small network, say one with only a single hidden unit. The network is trained until the improvement in the error over one epoch falls below some threshold. We then add an additional hidden unit, with weights from inputs and to outputs. We initialize the new weights randomly and resume training. The process continues until no significant gain is achieved by adding an extra unit. The process is illustrated below. Cascade correlation Beyond simply having too many parameters (danger of overfitting), there is a problem with large networks which has been called the herd effect. Imagine we have a task which is essentially decomposable into two sub-tasks A and B. We have a number of hidden units and randomly weighted connections. If task A is responsible for most of the error signal arriving at the hidden units, there will be a tendency for all units to simultaneously try to solve A. Once the error attributable to A has been reduced, error from subtask B will predominate, and all units will now try to solve that, leading to an increase again in the error from A. Eventually, due mainly to the randomness in the weight initialization, the herd will split and different units will address different sub-problems, but this may take considerable time. To get around this problem, Fahlman (1991) proposed an algorithm called cascade correlation which begins with a minimal network having just input and output units. Training a single layer requires no back-propagation of error and can be done very efficiently. At some point further training will not produce much improvement. If network performance is satisfactory, training can be stopped. If not, there must be some remaining error which we wish to reduce some more. This is done by adding a new hidden unit to the network, as described in the next paragraph. The new unit is added, its input weights are frozen (i.e. they will no longer be changed) and all output weights are once again trained. This is repeated until the error is small enough (or until we give up). To add a hidden unit,we begin with a candidate unit and provide it with incoming connections from the input units and from all existing hidden units. We do not yet give it any outgoing connections. The new unit's input weights are trained by a process similar to gradient descent. Specifically, we seek to maximize the covariance between v, the new unit's value, and Eo, the output error at output unit o. We define S as: (1) where o ranges over the output units and p ranges over the input patterns. The terms are the mean values of v and Eo over all patterns. Performing gradient ascent on the partial derivative (we will skip the explicit formula here) ensures that we end up with a unit whose activation is maximally correlated (positively or negatively) with the remaining error. Once we have maximized S, we freeze the input weights, and install the unit in the network as described above. The whole process is illustrated below. In (1) we train the weights from input to output. In (2), we add a candidate unit and train its weights to maximize the correlation with the error. In (3) we retrain the output layer, (4) we train the input weights for another hidden unit, (5) retrain the output layer, etc. Because we train only one layer at a time, training is very quick. What is more, because the weights feeding into each hidden unit do not change once the unit has been added, it is possible to record and store the activations of the hidden units for each pattern, and reuse these values without recomputation in later epochs. Pruning networks An alternative approach to growing networks is to start with a relatively large network and then remove weights so as to arrive at an optimal network architecture. The usual procedure is as follows: 1. 2. 3. 4. 5. Train a large, densely connected, network with a standard training algorithm Examine the trained network to assess the relative importance of the weights Remove the least important weight(s) retrain the pruned network Repeat steps 2-4 until satisfied Deciding which are the least important weights is a difficult issue for which several heuristic approaches are possible. We can estimate the amount by which the error function E changes for a small change in each weight. The computational form for this estimate would take us a little too far here. Various forms of this technique have been called optimal brain damage, and optimal brain surgeon. [Top] [Next: Preconditioning the network] [Back to the first page] Preconditioning the Network Ill-Conditioning In the preceding section on overfitting, we have seen what can happen when the network learns a given set of data too well. Unfortunately a far more frequent problem encountered by backpropagation users is just the opposite: that the network does not learn well at all! This is usually due to ill-conditioning of the network. (Fig. 1a) Recall that gradient descent requires a reasonable learning rate to work well: if it is too low (Fig. 1a), convergence will be very slow; set it too high, and the network will diverge (Fig. 1b). (Fig. 1b) Unfortunately the best learning rate is typically different for each weight in the network! Sometimes these differences are small enough for a single, global compromise learning rate to work well other times not. We call a network ill-conditioned if it requires learning rates for its weights that differ by so much that there is no global rate at which the network learns reasonably well. The error function for such a network is characterized by long, narrow valleys: (Fig. 2) (Mathematically, ill-conditioning is characterized by a high condition number. The condition number is the ratio between the largest and the smallest eigenvalue of the network's Hessian. The Hessian is the matrix of second derivatives of the loss function with respect to the weights. Although it is possible to calculate the Hessian for a multi-layer network and determine its condition number explicitly, it is a rather complicated procedure, and rarely done.) Ill-conditioning in neural networks can be caused by the training data, the network's architecture, and/or its initial weights. Typical problems are: having large inputs or target valuess, having both large and small layers in the network, having more than one hidden layer, and having initial weights that are too large or too small. This should make it clear that ill-conditioning is a very common problem indeed! In what follows, we look at each possible source of ill-conditioning, and describe a simple method to remove the problem. Since these methods are all used before training of the network begins, we refer to them as preconditioning techniques. Normalizing Inputs and Targets (Fig. 3) Recall the simple linear network (Fig. 3) we first used to learn the car data set. When we presented the best linear fit, we had rescaled both the x (input) and y (target) axes. Why did we do this? Consider what would happen if we used the original data directly instead: the input (weight of the car) would be quite large - over 3000 (pounds) on average. To map such large inputs onto the far smaller targets, the weight from input to output must become quite small - about -0.01. Now assume that we are 10% (0.001) away from the optimal value. This would cause an error of (typically) 3000*0.001 = 3 at the output. At learning rate µ, the weight change resulting from this error would be µ*3*3000 = 9000 µ. For stable convergence, this should be smaller than the distances to the weight's optimal value: 9000 µ < 0.001, giving us µ < 10-7, a very small learning rate. (And this is for online learning - for batch learning, where the weight changes for several patterns are added up, the learning rate would have to be even smaller!) Why should such a small learning rate be a problem? Consider that the bias unit has a constant output of 1. A bias weight that is, say, 0.1 away from its optimal value would therefore have a gradient of 0.1. At a learning rate of 10-7, however, it would take 10 million steps to move the bias weight by this distance! This is a clear case of ill-conditioning caused by the vastly different scale of input and bias values. The solution is simple: normalize the input, so that it has an average of zero and a standard deviation of one. Normalization is a two-step process: To normalize a variable, first 1. (centering) subtract its average, then 2. (scaling) divide by its standard deviation. Note that for our purposes it is not really necessary to calculate the mean and standard deviation of each input exactly - approximate values are perfectly sufficient. (In the case of the car data, the "mean" of 3000 and "standard deviation" of 1000 were simply guessed after looking at the data plot.) This means that in situations where the training data is not known in advance, estimates based on either prior knowledge or a small sample of the data are usually good enough. If the data is a time series x(t), you may also want to consider using the first differences x(t) - x(t-1) as network inputs instead; they have zero mean as long as x(t) is stationary. Whichever way you do it, remember that you should always normalize the inputs, and normalize the targets. To see why the target values should also be normalized, consider the network we've used to fit a sigmoid to the car data (Fig. 4). If the target values were those found in the original data, the weight from hidden to output unit would have to be 10 times larger. The error signal propagated back to the hidden unit would thus be multiplied by 17 along the way. In order to compensate for this, the global learning rate would have to be lowered correspondingly, slowing down the weights that go directly to the output unit. Thus while large inputs cause ill-conditioning by leading to very small weights, large targets do so by leading to very large weights. (Fig. 4) Finally, notice that the argument for normalizing the inputs can also be applied to the hidden units (which after all look like inputs to their posterior nodes). Ideally, we would like hidden unit activations as well to have a mean of zero and a standard deviation of one. Since the weights into hidden units keep changing during training, however, it would be rather hard to predict their mean and standard deviation accurately! Fortunately we can rely on our tanh activation function to keep things reasonably well-conditioned: its range from -1 to +1 implies that the standard deviation cannot exceed 1, while its symmetry about zero means that the mean will typically be relatively small. Furthermore, its maximum derivative is also 1, so that backpropagated errors will be neither magnified nor attenuated more than necessary. Note: For historic reasons, many people use the logistic sigmoid f(u) = 1/(1 + e-u) as activation function for hidden units. This function is closely related to tanh (in fact, f(u) = tanh(u/2)/2 + 0.5) but has a smaller, asymmetric range (from 0 to 1), and a maximum derivative of 0.25. We will later encounter a legitimate use for this function, but as activation function for hidden units it tends to orsen the network's conditioning. Thus do not use the logistic sigmoid f(u) = 1/(1 + e-u) as activation function for hidden units. Use tanh instead: your network will be better conditioned. Initializing the Weights Before training, the network weights are initialized to small random values. The random values are usually drawn from a uniform distribution over the range [-r,r]. What should r be? If the initial weights are too small, both activation and error signals will die out along their way through the network. Conversely, if they are too large, the tanh function of the hidden units will saturate - be very close to its asymptotic value of +/-1. This means that its derivative will be close to zero, blocking any backpropagated error signals from passing through the node; this is sometimes called paralysis of the node. To avoid either extreme, we would initally like the hidden units' net input to be approximately normalized. We do not know the inputs to the node, but we do know that they're approximately normalized - that's what we ensured in the previous section. It seems reasonable then to model the expected inputs as independent, normalized random variables. This means that their variances add, so we can write since the initial weights are in the range [-r,r]. To ensure that Var(neti) is at most 1, we can thus set r to the inverse of the square root of the fan-in |Ai| of the node - the number of weights coming into it: initialize weight wij to a uniformly random value in the range [-ri, ri], where Setting Local Learning Rates Above we have seen that the architecture of the network - specifically: the fan-in of its nodes determines the range within which its weights should be initialized. The architecture also affects how the error signal scales up or down as it is backpropagated through the network. Modelling the error signals as independent random variables, we have Let us define a new variable v for each hidden or output node, proportional to the (estimated) variance of its error signal divided by its fan-in. We can calculate all the v by a backpropagation procedure: for all output nodes o, set backpropagate: for all hidden nodes j, calculate Since the activations in the network are already normalized, we can expect the gradient for weight wij to scale with the square root of the corresponding error signal's variance, vi|Ai|. The resulting weight change, however, should be commensurate with the characteristic size of the weight, which is given by ri. To achieve this, set the learning rate µi (used for all weights wij into node i) to If you follow all the points we have made in this section before the start of training, you should have a reasonably well-conditioned network that can be trained effectively. It remains to determine a good global learning rate µ. This must be done by trial and error; a good first guess (on the high size) would be the inverse of the square root of the batch size (by a similar argument as we have made above), or 1 for online learning. If this leads to divergence, reduce µ and try again. [Top] [Next: Momentum and learning rate adaptation] first page] [Back to the Momentum and Learning Rate Adaptation Local Minima In gradient descent we start at some point on the error function defined over the weights, and attempt to move to the global minimum of the function. In the simplified function of Fig 1a the situation is simple. Any step in a downward direction will take us closer to the global minimum. For real problems, however, error surfaces are typically complex, and may more resemble the situation shown in Fig 1b. Here there are numerous local minima, and the ball is shown trapped in one such minimum. Progress here is only possible by climbing higher before descending to the global minimum. (Fig. 1a) (Fig. 1b) We have already mentioned one way to escape a local minimum: use online learning. The noise in the stochastic error surface is likely to bounce the network out of local minima as long as they are not too severe. Momentum Another technique that can help the network out of local minima is the use of a momentum term. This is probably the most popular extension of the backprop algorithm; it is hard to find cases where this is not used. With momentum m, the weight update at a given time t becomes (1) where 0 < m < 1 is a new global parameter which must be determined by trial and error. Momentum simply adds a fraction m of the previous weight update to the current one. When the gradient keeps pointing in the same direction, this will increase the size of the steps taken towards the minimum. It is otherefore often necessary to reduce the global learning rate µ when using a lot of momentum (m close to 1). If you combine a high learning rate with a lot of momentum, you will rush past the minimum with huge steps! When the gradient keeps changing direction, momentum will smooth out the variations. This is particularly useful when the network is not well-conditioned. In such cases the error surface has substantially different curvature along different directions, leading to the formation of long narrow valleys. For most points on the surface, the gradient does not point towards the minimum, and successive steps of gradient descent can oscillate from one side to the other, progressing only very slowly to the minimum (Fig. 2a). Fig. 2b shows how the addition of momentum helps to speed up convergence to the minimum by damping these oscillations. (Fig. 2a) (Fig. 2b) To illustrate this effect in practice, we trained 20 networks on a simple problem (4-2-4 encoding), both with and without momentum. The mean training times (in epochs) were momentum Training time 0 217 0.9 95 Learning Rate Adaptation In the section on preconditioning, we have employed simple heuristics to arrive at reasonable guesses for the global and local learning rates. It is possible to refine these values significantly once training has commenced, and the network's response to the data can be observed. We will now introduce a few methods that can do so automatically by adapting the learning rates during training. Bold Driver A useful batch method for adapting the global learning rate µ is the bold driver algorithm. Its operation is simple: after each epoch, compare the network's loss E(t) to its previous value, E(t-1). If the error has decreased, increase µ by a small proportion (typically 1%-5%). If the error has increased by more than a tiny proportion (say, 10-10), however, undo the last weight change, and decrease µ sharply - typically by 50%. Thus bold driver will keep growing µ slowly until it finds itself taking a step that has clearly gone too far up onto the opposite slope of the error function. Since this means that the network has arrived in a tricky area of the error surface, it makes sense to reduce the step size quite drastically at this point. Annealing Unfortunately bold driver cannot be used in this form for online learning: the stochastic fluctuations in E(t) would hopelessly confuse the algorithm. If we keep µ fixed, however, these same fluctuations prevent the network from ever properly converging to the minimum - instead we end up randomly dancing around it. In order to actually reach the minimum, and stay there, we must anneal (gradually lower) the global learning rate. A simple, non-adaptive annealing schedule for this purpose is the search-then-converge schedule µ(t) = µ(0)/(1 + t/T) (2) Its name derives from the fact that it keeps µ nearly constant for the first T training patterns, allowing the network to find the general location of the minimum, before annealing it at a (very slow) pace that is known from theory to guarantee convergence to the minimum. The characteristic time T of this schedule is a new free parameter that must be determined by trial and error. Local Rate Adaptation If we are willing to be a little more sophisticated, we go a lot further than the above global methods. First let us define an online weight update that uses a local, time-varying learning rate for each weight: (3) The idea is to adapt these local learning rates by gradient descent, while simultaneously adapting the weights. At time t, we would like to change the learning rate (before changing the weight) such that the loss E(t+1) at the next time step is reduced. The gradient we need is (4) Ordinary gradient descent in µij, using the meta-learning rate q (a new global parameter), would give (5) We can already see that this would work in a similar fashion to momentum: increase the learning rate as long as the gradient keeps pointing in the same direction, but decrease it when you land on the opposite slope of the loss function. Problem: µij might become negative! Also, the step size should be proportional to µij so that it can be adapted over several orders of magnitude. This can be achieved by performing the gradient descent on log(µij) instead: (6) Exponentiating this gives (7) where the approximation serves to avoid an expensive exp function call. The multiplier is limited below by 0.5 to guard against very small (or even negative) factors. Problem: the gradient is noisy; the product of two of them will be even noisier - the learning rate will bounce around a lot. A popular way to reduce the stochasticity is to replace the gradient at the previous time step (t-1) by an exponential average of past gradients. The exponential average of a time series u(t) is defined as (8) where 0 < m < 1 is a new global parameter. Problem: if the gradient is ill-conditioned, the product of two gradients will be even worse - the condition number is squared. We will need to normalize the step sizes in some way. A radical solution is to throw away the magnitude of the step, and just keep the sign, giving (9) where r = eq. This works fine for batch learning, but... (Fig. 3) Problem: Nonlinear normalizers such as the sign function lead to systematic errors in stochastic gradient descent (Fig. 3): a skewed but zero-mean gradient distribution (typical for stochastic equilibrium) is mapped to a normalized distribution with non-zero mean. To avoid the problems this is casuing, we need a linear normalizer for online learning. A good method is to divide the step by , an exponential average of the squared gradient. This gives (10) Problem: successive training patterns may be correlated, causing the product of stochastic gradients to behave strangely. The exponential averaging does help to get rid of short-term correlations, but it cannot deal with input that exhibits correlations across long periods of time. If you are iterating over a fixed training set, make sure you permute (shuffle) it before each iteration to destroy any correlations. This may not be possible in a true online learning situation, where training data is received one pattern at a time. To show that all these equations actually do something useful, here is a typical set of online learning curves (in postscript) for a difficult benchmark problem, given either uncorrelated training patterns, or patterns with strong short-term or long-term correlations. In these figures "momentum" corresponds to using equation (1) above, and "s-ALAP" to equation (10). "ALAP" is like "s-ALAP" but without the exponential averaging of past gradients, while "ELK1" and "SMD" are more advanced methods (developed by one of us). [Top] [Next: Classification] [Back to the first page] Classification Discriminants Neural networks can also be used to classify data. Unlike regression problems, where the goal is to produce a particular output value for a given input, classification problems require us to label each data point as belonging to one of n classes. Neural networks can do this by learning a discriminant function which separates the classes. For example, a network with a single linear output can solve a two-class problem by learning a discriminant function which is greater than zero for one class, and less than zero for the other. Fig. 6 shows two such two-class problems, with filled dots belonging to one class, and unfilled dots to the other. In each case, a line is drawn where a discriminant function that separates the two classes is zero. (Fig. 6) On the left side, a straight line can serve as a discriminant: we can place the line such that all filled dots lie on one side, and all unfilled ones lie on the other. The classes are said to be linearly separable. Such problems can be learned by neural networks without any hidden units. On the right side, a highly non-linear function is required to ensure class separation. This problem can be solved only by a neural network with hidden units. Binomial To use a neural network for classification, we need to construct an equivalent function approximation problem by assigning a target value for each class. For a binomial (two-class) problem we can use a network with a single output y, and binary target values: 1 for one class, and 0 for the other. We can thus interpret the network's output as an estimate of the probability that a given pattern belongs to the '1' class. To classify a new pattern after training, we then employ the maximum likelihood discriminant, y > 0.5. A network with linear output used in this fashion, however, will expend a lot of its effort on getting the target values exactly right for its training points - when all we actually care about is the correct positioning of the discriminant. The solution is to use an activation function at the output that saturates at the two target values: such a function will be close to the target value for any net input that is sufficiently large and has the correct sign. Specifically, we use the logistic sigmoid function Given the probabilistic interpretation, a network output of, say, 0.01 for a pattern that is actually in the '1' class is a much more serious error than, say, 0.1. Unfortunately the sum-squared loss function makes almost no distinction between these two cases. A loss function that is appropriate for dealing with probabilities is the cross-entropy error. For the two-class case, it is given by When logistic output units and cross-entropy error are used together in backpropagation learning, the error signal for the output unit becomes just the difference between target and output: In other words, implementing cross-entropy error for this case amounts to nothing more than omitting the f'(net) factor that the error signal would otherwise get multiplied by. This is not an accident, but indicative of a deeper mathematical connection: cross-entropy error and logistic outputs are the "correct" combination to use for binomial probabilities, just like linear outputs and sum-squared error are for scalar values. Multinomial If we have multiple independent binary attributes by which to classify the data, we can use a network with multiple logistic outputs and cross-entropy error. For multinomial classification problems (1-of-n, where n > 2) we use a network with n outputs, one corresponding to each class, and target values of 1 for the correct class, and 0 otherwise. Since these targets are not independent of each other, however, it is no longer appropriate to use logistic output units. The corect generalization of the logistic sigmoid to the multinomial case is the softmax activation function: where o ranges over the n output units. The cross-entropy error for such an output layer is given by Since all the nodes in a softmax output layer interact (the value of each node depends on the values of all the others), the derivative of the cross-entropy error is difficult to calculate. Fortunately, it again simplifies to so we don't have to worry about it. [Top] [Next: Non-Supervised Learning] [Back to the first page] Non-Supervised Learning It is possible to use neural networks to learn about data that contains neither target outputs nor class labels. There are many tricks for getting error signals in such non-supervised settings; here we'll briefly discuss a few of the most common approaches: autoassociation, time series prediction, and reinforcement learning. Autoassociation Autoassociation is based on a simple idea: if you have inputs but no targets, just use the inputs as targets. An autoassociator network thus tries to learn the identity function. This is only non-trivial if the hidden layer forms an information bottleneck - contains less units than the input (output) layer, so that the network must perform dimensionality reduction (a form of data compression). A linear autoassociator trained with sum-squared error in effect performs principal component analysis (PCA), a well-known statistical technique. PCA extracts the subspace (directions) of highest variance from the data. As was the case with regression, the linear neural network offers no direct advantage over known statistical methods, but it does suggest an interesting nonlinear generalization: This nonlinear autoassociator includes a hidden layer in both the encoder and the decoder part of the network. Together with the linear bottleneck layer, this gives a network with at least 3 hidden layers. Such a deep network should be preconditioned if it is to learn successfully. Time Series Prediction When the input data x forms a temporal series, an important task is to predict the next point: the weather tomorrow, the stock market 5 minutes from now, and so on. We can (attempt to) do this with a feedforward network by using time-delay embedding: at time t, we give the network x(t), x(t-1), ... x(t-d) as input, and try to predict x(t+1) at the output. After propagating activity forward to make the prediction, we wait for the actual value of x(t+1) to come in before calculating and backpropagating the error. Like all neural network architecture parameters, the dimension d of the embedding is an important but difficult choice. A more powerful (but also more complicated) way to model a time series is to use recurrent neural networks. Reinforcement Learning Sometimes we are faced with the problem of delayed reward: rather than being told the correct answer for each input pattern immediately, we may only occasionally get a positive or negative reinforcement signal to tell us whether the entire sequence of actions leading up to this was good or bad. Reinforcement learning provides ways to get a continuous error signal in such situations. Q-learning associates an expected utility (the Q-value) with each action possible in a particular state. If at time t we are in state s(t) and decide to perform action a(t), the corresponding Q-value is updated as follows: where r(t) is the instantaneous reward resulting from our action, s(t+1) is the state that it led to, a are all possible actions in that state, and gamma <= 1 is a discount factor that leads us to prefer instantaneous over delayed rewards. A common way to implement Q-learning for small problems is to maintain a table of Q-values for all possible state/action pairs. For large problems, however, it is often impossible to keep such a large table in memory, let alone learn its entries in reasonable time. In such cases a neural network can provide a compact approximation of the Q-value function. Such a network takes the state s(t) as its input, and has an output ya for each possible action. To learn the Q-value Q(s(t), a(t)), it uses the right-hand side of the above Q-iteration as a target: Note that since we require the network's outputs at time t+1 in order to calculate its error signal at time t, we must keep a one-step memory of all input and hidden node activity, as well as the most recent action. The error signal is applied only to the output corresponding to that action; all other output nodes receive no error (they are "don't cares"). TD-learning is a variation that assigns utility values to states alone rather than state/action pairs. This means that search must be used to determine the value of the best successor state. TD( ) replaces the one-step memory with an exponential average of the network's gradient; this is similar to momentum, and can help speed the transport of delayed reward signals across large temporal distances. One of the most successful applications of neural networks is TD-Gammon, a network that used TD( ) to learn the game of backgammon from scratch, by playing only against itself. TD-Gammon is now the world's strongest backgammon program, and plays at the level of human grandmasters. [Top] [Next: Recurrent neural networks] [Back to the first page] Learning Time Sequences There are many tasks that require learning a temporal sequence of events. These problems can be broken into 3 distinct types of tasks: Sequence Recognition: Produce a particular output pattern when a specific input sequence is seen. Applications: speech recognition Sequence Reproduction: Generate the rest of a sequence when the network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc) Temporal Association: Produce a particular output sequence in response to a specific input sequence. Applications: speech generation Some of the methods that are used include Tapped Delay Lines (time delay networks) Context Units (e.g. Elman Nets, Jordan Nets) Back propagation through time (BPTT) Real Time Recurrent Learning (RTRL) Tapped Delay Lines / Time Delay Neural Networks One of the simplest ways of performing sequence recognition because conventional backpropagation algorithms can be used. Downsides: Memory is limited by length of tapped delay line. If a large number of input units are needed then computation can be slow and many examples are needed. A simple extension to this is to allow non-uniform sampling: where i is the integer delay assoicated with component i. Thus if there are n input units, the memory is not limited simply the previous n timesteps. Another extension that deals is for each "input" to really be a convolution of the original input sequence. In the case of the delay line memories: Other variations for c are shown graphically below: This figure is taken from "Neural Net Architectures for Temporal Sequence Processing", by Mike Moser. [Top] [Next: Recurrent Networks I] [Back to the first page] Recurrent Networks I Consider the following two networks: (Fig. 1) The network on the left is a simple feed forward network of the kind we have already met. The right hand network has an additional connection from the hidden unit to itself. What difference could this seemingly small change to the network make? Each time a pattern is presented, the unit computes its activation just as in a feed forward network. However its net input now contains a term which reflects the state of the network (the hidden unit activation) before the pattern was seen. When we present subsequent patterns, the hidden and output units' states will be a function of everything the network has seen so far. The network behavior is based on its history, and so we must think of pattern presentation as it happens in time. Network topology Once we allow feedback connections, our network topology becomes very free: we can connect any unit to any other, even to itself. Two of our basic requirements for computing activations and errors in the network are now violated. When computing activations, we required that before computing yi, we had to know the activations of all units in the posterior set of nodes, Pi. For computing errors, we required that before computing nodes, Ai. , we had to know the errors of all units in its anterior set of For an arbitrary unit in a recurrent network, we now define its activation at time t as: yi(t) = fi(neti(t-1)) At each time step, therefore, activation propagates forward through one layer of connections only. Once some level of activation is present in the network, it will continue to flow around the units, even in the absence of any new input whatsoever. We can now present the network with a time series of inputs, and require that it produce an output based on this series. These networks can be used to model many new kinds of problems, however, these nets also present us with many new difficult issues in training. Before we address the new issues in training and operation of recurrent neural networks, let us first look at some sample tasks which have been attempted (or solved) by such networks. Learning formal grammars Given a set of strings S, each composed of a series of symbols, identify the strings which belong to a language L. A simple example: L = {an,bn} is the language composed of strings of any number of a's, followed by the same number of b's. Strings belonging to the language include aaabbb, ab, aaaaaabbbbbb. Strings not belonging to the language include aabbb, abb, etc. A common benchmark is the language defined by the reber grammar. Strings which belong to a language L are said to be grammatical and are ungrammatical otherwise. Speech recognition In some of the best speech recognition systems built so far, speech is first presented as a series of spectral slices to a recurrent network. Each output of the network represents the probability of a specific phone (speech sound, e.g. /i/, /p/, etc), given both present and recent input. The probabilities are then interpreted by a Hidden Markov Model which tries to recognize the whole utterance. Details are provided here. Music composition A recurrent network can be trained by presenting it with the notes of a musical score. It's task is to predict the next note. Obviously this is impossible to do perfectly, but the network learns that some notes are more likely to occur in one context than another. Training, for example, on a lot of music by J. S. Bach, we can then seed the network with a musical phrase, let it predict the next note, feed this back in as input, and repeat, generating new music. Music generated in this fashion typically sounds fairly convincing at a very local scale, i.e. within a short phrase. At a larger scale, however, the compositions wander randomly from key to key, and no global coherence arises. This is an interesting area for further work.... The original work is described here. The Simple Recurrent Network One way to meet these requirements is illustrated below in a network known variously as an Elman network (after Jeff Elman, the originator), or as a Simple Recurrent Network. At each time step, a copy of the hidden layer units is made to a copy layer. Processing is done as follows: 1. Copy inputs for time t to the input units 2. Compute hidden unit activations using net input from input units and from copy layer 3. Compute output unit activations as usual 4. Copy new hidden unit activations to copy layer In computing the activation, we have eliminated cycles, and so our requirement that the activations of all posterior nodes be known is met. Likewise, in computing errors, all trainable weights are feed forward only, so we can apply the standard backpropagation algorithm as before. The weights from the copy layer to the hidden layer play a special role in error computation. The error signal they receive comes from the hidden units, and so depends on the error at the hidden units at time t. The activations in the hidden units, however, are just the activation of the hidden units at time t-1. Thus, in training, we are considering a gradient of an error function which is determined by the activations at the present and the previous time steps. A generalization of this approach is to copy the input and hidden unit activations for a number of previous timesteps. The more context (copy layers) we maintain, the more history we are explicitly including in our gradient computation. This approach has become known as Back Propagation Through Time. It can be seen as an approximation to the ideal of computing a gradient which takes into consideration not just the most recent inputs, but all inputs seen so far by the network. The figure below illustrates one version of the process: The inputs and hidden unit activations at the last three time steps are stored. The solid arrows show how each set of activations is determined from the input and hidden unit activations on the previous time step. A backward pass, illustrated by the dashed arrows, is performed to determine separate values of delta (the error of a unit with respect to its net input) for each unit and each time step separately. Because each earlier layer is a copy of the layer one level up, we introduce the new constraint that the weights at each level be identical. Then the partial derivative of the negative error with respect to wi,j is simply the sum of the partials calculated for the copy of wi,j between each two layers. Elman networks and their generalization, Back Propagation Through Time, both seek to approximate the computation of a gradient based on all past inputs, while retaining the standard back prop algorithm. BPTT has been used in a number of applications (e.g. ecg modeling). The main task is to to produce a particular output sequences in response to specific input sequences. The downside of BPTT is that it requires a large amount of storage, computation, and training examples in order to work well. In the next section we will see how we can compute the true temporal gradient using a method known as Real Time Recurrent Learning. [Top] [Next: Real Time Recurrent Learning] [Back to the first page] Real Time Recurrent Learning In deriving a gradient-based update rule for recurrent networks, we now make network connectivity very very unconstrained. We simply suppose that we have a set of input units, I = {xk(t), 0<k<m}, and a set of other units, U = {yk(t), 0<k<n}, which can be hidden or output units. To index an arbitrary unit in the network we can use (1) Let W be the weight matrix with n rows and n+m columns, where wi,j is the weight to unit i (which is in U) from unit j (which is in I or U). Units compute their activations in the now familiar way, by first computing the weighted sum of their inputs: (2) where the only new element in the formula is the introduction of the temporal index t. Units then compute some non-linear function of their net input yk(t+1) = fk(netk(t)) (3) Usually, both hidden and output units will have non-linear activation functions. Note that external input at time t does not influence the output of any unit until time t+1. The network is thus a discrete dynamical system. Some of the units in U are output units, for which a target is defined. A target may not be defined for every single input however. For example, if we are presenting a string to the network to be classified as either grammatical or ungrammatical, we may provide a target only for the last symbol in the string. In defining an error over the outputs, therefore, we need to make the error time dependent too, so that it can be undefined (or 0) for an output unit for which no target exists at present. Let T(t) be the set of indices k in U for which there exists a target value dk(t) at time t. We are forced to use the notation dk instead of t here, as t now refers to time. Let the error at the output units be (4) and define our error function for a single time step as (5) The error function we wish to minimize is the sum of this error over all past steps of the network (6) Now, because the total error is the sum of all previous errors and the error at this time step, so also, the gradient of the total error is the sum of the gradient for this time step and the gradient for previous steps (7) As a time series is presented to the network, we can accumulate the values of the gradient, or equivalently, of the weight changes. We thus keep track of the value (8) After the network has been presented with the whole series, we alter each weight wij by (9) We therefore need an algorithm that computes (10) at each time step t. Since we know ek(t) at all times (the difference between our targets and outputs), we only need to find a way to compute the second factor . IMPORTANT The key to understanding RTRL is to appreciate what this factor expresses. It is essentially a measure of the sensitivity of the value of the output of unit k at time t to a small change in the value of wij, taking into account the effect of such a change in the weight over the entire network trajectory from t0 to t. Note that wij does not have to be connected to unit k. Thus this algorithm is non-local, in that we need to consider the effect of a change at one place in the network on the values computed at an entirely different place. Make sure you understand this before you dive into the derivation given next Derivation of This is given here for completeness, for those who wish perhaps to implement RTRL. Make sure you at least know what role the factor plays in computing the gradient. From Equations 2 and 3, we get (11) where is the Kronecker delta (12) [Exercise: Derive Equation 11 from Equations 2 and 3] Because input signals do not depend on the weights in the network, (13) Equation 11 becomes: (14) This is a recursive equation. That is, if we know the value of the left hand side for time 0, we can compute the value for time 1, and use that value to compute the value at time 2, etc. Because we assume that our starting state (t = 0) is independent of the weights, we have (15) These equations hold for all . We therefore need to define the values (16) for every time step t and all appropriate i, j and k. We start with the initial condition pijk(t0) = 0 (17) and compute at each time step (18) The algorithm then consists of computing, at each time step t, the quantities pijk(t) using equations 16 and 17, and then using the differences between targets and actual outputs to compute weight changes (19) and the overall correction to be applied to wij is given by (20) [Top] [Next: The dynamics of recurrent neural networks] the first page] [Back to Dynamics and RNNs Consider the recurrent network illustrated below. A single input unit is connected to each of the three "hidden" units. Each hidden unit in turn is connected to itself and the other hidden units. As in the RTRL derivation, we do not distinguish now between hidden and output units. Any activation which enters the network through the input node can flow around from one unit to the other, potentially forever. Weights less than 1.0 will exponentially reduce the activation, weights larger than 1.0 will cause it to increase. The non-linear activation functions of the hidden units will hopefully prevent it from growing without bound. As we have three hidden units, their activation at any given time t describes a point in a 3dimensional state space. We can visualize the temporal evolution of the network state by watching the state evolve over time. In the absence of input, or in the presence of a steady-state input, a network will usually approach a fixed point attractor. Other behaviors are possible, however. Networks can be trained to oscillate in regular fashion, and chaotic behavior has also been observed. The development of architectures and algorithms to generate specific forms of dynamic behavior is still an active research area. Some limitations of gradient methods and RNNs The simple recurrent network computed a gradient based on the present state of the network and its state one time step ago. Using Back Prop Through Time, we could compute a gradient based on some finite n time steps of network operation. RTRL provided a way of computing the true gradient based on the complete network history from time 0 to the present. Is this perfection? Unfortunately not. With feedforward networks which have a large number of layers, the weights which are closest to the output are the easiest to train. This is no surprise, as their contribution to the network error is direct and easily measurable. Every time we back propagate an error one layer further back, however, our estimate of the contribution of a particular weight to the observed error becomes more indirect. You can think of error flowing in the top of the network in distinct streams. Each pack propagation dilutes the error, mixing up error from distinct sources, until, far back in the network, it becomes virtually impossible to tell who is responsible for what. The error signal has become completely diluted. With RTRL and BPTT we face a similar problem. Error is now propagated back in time, but each time step is exactly equivalent to propagating through an additional layer of a feed forward network. The result, of course, is that it becomes very difficult to assess the importance of the network state at times which lie far back in the past. Typically, gradient based networks cannot reliably use information which lies more than about 10 time steps in the past. If you now imagine an attempt to use a recurrent neural network in a real life situation, e.g. monitoring an industrial process, where data are presented as a time series at some realistic sampling rate (say 100 Hz), it becomes clear that these networks are of limited use. The next section shows a recent model which tries to address this problem. [Top] [Next: Long Short-Term Memory] [Back to the first page] Long Short-Term Memory In a recurrent network, information is stored in two distinct ways. The activations of the units are a function of the recent history of the model, and so form a short-term memory. The weights too form a memory, as they are modified based on experience, but the timescale of the weight change is much slower than that of the activations. We call those a long-term memory. The Long Short-Term Memory model [1] is an attempt to allow the unit activations to retain important information over a much longer period of time than the 10 to 12 time steps which is the limit of RTRL or BPTT models. The figure below shows a maximally simple LSTM network, with a single input, a single output, and a single memory block in place of the familiar hidden unit. This figure below shows a maximally simple LSTM network, with a single input, a single output, and a single memory block in place of the familiar hidden unit. Each block has two associated gate units (details below). Each layer may, of course, have multiple units or blocks. In a typical configuration, the first layer of weights is provided from input to the blocks and gates. There are then recurrent connections from one block to other blocks and gates. Finally there are weights from the blocks to the outputs. The next figure shows the details of the memory block in more detail. The hidden units of a conventional recurrent neural network have now been replaced by memory blocks, each of which contains one or more memory cells. At the heart of the cell is a simple linear unit with a single self-recurrent connection with weight set to 1.0. In the absence of any other input, this connection serves to preserve the cell's current state from one moment to the next. In addition to the selfrecurrent connection, cells receive input from input units and other cell and gates. While the cells are responsible for maintaining information over long periods of time, the responsibility for deciding what information to store, and when to apply that information lies with an input and output gating unit, respectively. The input to the cell is passed through a non-linear squashing function (g(x), typically the logistic function, scaled to lie within [-2,2]), and the result is then multiplied by the output of the input gating unit. The activation of the gate ranges over [0,1], so if its activation is near zero, nothing can enter the cell. Only if the input gate is sufficiently active is the signal allowed in. Similarly, nothing emerges from the cell unless the output gate is active. As the internal cell state is maintained in a linear unit, its activation range is unbounded, and so the cell output is again squashed when it is released (h(x), typical range [-1,1]). The gates themselves are nothing more than conventional units with sigmoidal activation functions ranging over [0,1], and they each receive input from the network input units and from other cells. Thus we have: Cell output: ycj(t) is ycj(t) = youtj(t) h(scj(t)) where youtj(t) is the activation of the output gate, and the state, scj(t) is given by scj(0) = 0, and scj(t) = scj(t-1) + yinj(t) g(netcj(t)) for t > 0. This division of responsibility---the input gates decide what to store, the cell stores information, and the output gate decides when that information is to be applied---has the effect that salient events can be remembered over arbitrarily long periods of time. Equipped with several such memory blocks, the network can effectively attend to events at multiple time scales. Network training uses a combination of RTRL and BPTT, and we won't go into the details here. However, consider an error signal being passed back from the output unit. If it is allowed into the cell (as determined by the activation of the output gate), it is now trapped, and it gets passed back through the self-recurrent connection indefinitely. It can only affect the incoming weights, however, if it is allowed to pass by the input gate. On selected problems, an LSTM network can retain information over arbitrarily long periods of time; over 1000 time steps in some cases. This gives it a significant advantage over RTRL and BPTT networks on many problems. For example, a Simple Recurrent Network can learn the Reber Grammar, but not the Embedded Reber Grammar. An RTRL network can sometimes, but not always, learn the Embedded Reber Grammar after about 100 000 training sequences. LSTM always solves the Embedded problem, usually after about 10 000 sequence presentations. One of us is currently training LSTM networks to distinguish between different spoken languages based on speech prosody (roughly: the melody and rhythm of speech). References Hochreiter, Sepp and Schmidhuber, Juergen, (1997) "Long Short-Term Memory", Neural Computation, Vol 9 (8), pp: 1735-1780 [Top] [Next: Projects] [Back to the first page] Some possible project topics The following are a few suggestions for project topics. They are all presented in a fiarly open-ended fashion here. Potential projects will need to be decided in detail depending on your interests and the size of the planned project. In each case, the final goals and requirements will have to be decided upon together with Professor Colombetti and us. You are also free to suggest topics of your own. Please bear in mind that Nic and Fred are both normally in Lugano. They are, of course, contactable by email. Implement Real Time Recurrent Learning (Neural Computation, 1, 270-280, 1989). Code up your own implementation of RTRL. Reproduce the results on temporal XOR and sine wave oscillation. Examine the effects of continuous and a discrete periodic inputs on a network trained to oscillate. In what way does the network entrain to an external signal? 8-3-8 Encoder Implement a feedforward network with one hidden layer and batch backpropagation, and either momentum or the bold driver method. Set it up as an autoencoder with 8 inputs/outputs and 3 hidden units, and train it on the 8 binary patterns that consist of a single '1' and 7 zeroes. Find values for the free parameters that give you fast, reliable convergence, then compare the speed of learning and ultimate performance for the following cases: o linear outputs, sum-squared error o logistic outputs, sum-squared error o logistic outputs, cross-entropy error o softmax outputs, cross-entropy error Online Learning Write a program that generates training data for a neural network, such that the function the network must learn to approximate changes periodically. Then implement a neural network that obtains its training patterns from this generator, and performs online learning with local learning rate adaptation on it. Compare the network's performance for various values of the meta-learning rate. Tic-Tac-Toe Implement a network that uses Q-learning to learn the game of tic-tac-toe (see figure) from self-play. Tic-tac-toe Applications You may want to consider applying a neural network as part of a project that relates to other part of your course - for example, in building or simulating a reactive agent. Your suggestion here. Have you a favourite dataset you wish to model? Time series data to predict? Pattern recognition problem? Let us know. [Top] [Back to the first page] Pattern Classification And Single Layer Networks: Chapter 2 Intro We have just seen how a network can be trained to perform linear regression. That is, given a set of inputs (x) and output/target values (y), the network finds the best linear mapping from x to y. Given an x value that we have not seen, our trained network can predict what the most likely y value will be. The ability to (correctly) predict the output for an input the network has not seen is called generalization. This style of learning is referred to as supervised learning (or learning with a teacher) because we are given the target values. Later we will see examples of unsupervised learning which is used for finding patterns in the data rather than modeling input/output mappings. We now step away from linear regression for a moment and look at another type of supervised learning problem called pattern classification. We start by considering only single layer networks. Pattern classification A classic example of pattern classifiction is letter recognition. We are given, for example, a set of pixel values associated with an image of a letter. We want the computer to determine what letter it is. The pixel values are refered to as the inputs or the decision variables, and the letter categories are referred to as classes. Now, a given letter such as "A" can look quite different depending on the type of font that is used or, in the case of handwritten letters, different people's handwriting. Thus, there will be a range of values for the decision variables that map to the same class. That is, if we plot the values of the decision variables, different regions will correspond to different classes. Example 1: Two Classes (class 0 and class 1), Two Inputs (x1 and x2). See also: Neural Java 2 Class Problem Example 2: Another example (see data description, data, Maple plots): class = types of iris decision variables = sepal and petal sizes Example 3: example of zipcode digits in Maple Single layer Networks for Pattern Classification We can apply a similar approach as in linear regression where the targets are now the classes. Note that the outputs are no longer continuous but rather take on discrete values. Two Classes: What does the network look like? If there are just 2 classes we only need 1 output node. The target is 1 if the example is in, say, class 1, and the target is 0 (or -1) if the target is in class 0. It seems reasonable that we use a binary step function to guarantee an appropriate output value. Training Methods: We will discuss two kinds of methods for training single-layer networks that do pattern classification: Perceptron - guaranteed to find the right weights if they exist The Adaline (uses Delta Rule) - can easily be generalized to multi-layer nets (nonlinear problems) But how do we know if the right weights exist at all???? Let's look to see what a single layer architecture can do .... Single Layer with a Binary Step Function Consider a network with 2 inputs and 1 output node (2 classes). The net output of the network is a linear function of the weights and the inputs net = W X = x1 w1 + x2 w2 y = f(net) x1 w1 + x2 w2 = 0 defines a straight line through the input space. x2 = - w1/w2 x1 <- this is line through the origin with slope -w1/w2 Bias What if the line dividing the 2 classes does not go through the origin? Other interesting geometric points to note: The weight vector (w1, w2) is normal to the decision boundary. Proof: Suppose z1 and z2 are points on the decision boundary. Linear Separability Classification problems for which there is a line that exactly separates the classes are called linearly separable. Single layer networks are only able to solve linearly separable problems. Most real world are not linearly separable. [Goto top of page] [Next: Perceptron] [Back to the first page] The Perceptron The perceptron learning rule is a method for finding the weights in a network. We consider the problem of supervised learning for classification although other types of problems can also be solved. A nice feature of the perceptron learning rule is that if there exist a set of weights that solve the problem, then the perceptron will find these weights. This is true for either binary or bipolar representations. Assumptions: We have single layer network whose output is, as before, output = f(net) = f(W X) where f is a binary step function f whose values are (+-1). We assume that the bias treated as just an extra input whose value is 1 p = number of training examples (x,t) where t = +1 or -1 Geometric Interpretation: With this binary function f, the problem reduces to finding weights such that sign( W X) = t That is, the weight must be chosen so that the projection of pattern X onto W has the same sign as the target t. But the boundary between positive and negative projections is just the plane W X = 0 , i.e. the same decision boundary we saw before. The Perceptron Algorithm 1. initialize the weights (either to zero or to a small random value) 2. pick a learning rate ( this is a number between 0 and 1) 3. Until stopping condition is satisfied (e.g. weights don't change): For each training pattern (x, t): compute output activation y = f(w x) If y = t, don't change weights If y != t, update the weights: w(new) = w(old) + 2 t x or w(new) = w(old) + (t - y ) x, for all t Consider wht happens below when the training pattern p1 or p2 is chosen. Before updating the weight W, we note that both p1 and p2 are incorrectly classified (red dashed line is decision boundary). Suppose we choose p1 to update the weights as in picture below on the left. P1 has target value t=1, so that the weight is moved a small amount in the direction of p1. Suppose we choose p2 to update the weights. P2 has target value t=-1 so the weight is moved a small amount in the direction of -p2. In either case, the new boundary (blue dashed line) is better than before. Comments on Perceptron The choice of learning rate does not matter because it just changes the scaling of w. The decision surface (for 2 inputs and one bias) has equation: x2 = - (w1/w2) x1 - w3 / w2 where we have defined w3 to be the bias: W = (w1,w2,b) = (w1,w2,w3) From this we see that the equation remains the same if W is scaled by a constant. The perceptron is guaranteed to converge in a finite number of steps if the problem is separable. May be unstable if the problem is not separable. Come to class for proof!! Outline: Find a lower bound L(k) for |w|2 as a function of iteration k. Then find an upper bound U(k) for |w|2. Then show that the lower bound grows at a faster rate than the upper bound. Since the lower bound can't be larger than the upper bound, there must be a finite k such that the weight is no longer updated. However, this can only happen if all patterns are correctly classified. Perceptron Decision Boundaries Two Layer Net: The above is not the most general region. Here, we have assumed the top layer is an AND function. Problem: In the general for the 2- and 3- layers cases, there is no simple way to determine the weights. [Top] [Next: Delta] [Back to the first page] Delta Rule Also known by the names: Adaline Rule Widrow-Hoff Rule Least Mean Squares (LMS) Rule Change from Perceptron: Replace the step function in the with a continuous (differentiable) activation function, e.g linear For classification problems, use the step function only to determine the class and not to update the weights. Note: this is the same algorithm we saw for regression. All that really differs is how the classes are determined. Delta Rule: Training by Gradient Descent Revisited Construct a cost function E that measures how well the network has learned. For example (one output node) where n = number of examples ti = desired target value associated with the i-th example yi = output of network when the i-th input pattern is presented to network To train the network, we adjust the weights in the network so as to decrease the cost (this is where we require differentiability). This is called gradient descent. Algorithm Initialize the weights with some small random value Until E is within desired tolerance, update the weights according to where E is evaluated at W(old), is the learning rate.: and the gradient is More than Two Classes. If there are mor ethan 2 classes we could still use the same network but instead of having a binary target, we can let the target take on discrete values. For example of there ar 5 classes, we could have t=1,2,3,4,5 or t= -2,-1,0,1,2. It turns out, however, that the network has a much easier time if we have one output for class. We can think of each output node as trying to solve a binary problem (it is either in the given class or it isn't). [Top] [Next: Correct] [Back to the first page] Doing Classification Correctly The Old Way When there are more than 2 classes, we so far have suggested doing the following: Assign one output node to each class. Set the target value of each node to be 1 if it is the correct class and 0 otherwise. Use a linear network with a mean squared error function. Determine the network class prediction by picking the output node with the largest value. There are problems with this method. First, there is a disconnect between the definition of the error function and the determination of the class. A minimum error does not necessary produce the network with the largest number of correct prediction. By varying the above method a little bit we can remove this inconsistency. Let us start by changing the interpretation of the output: The New Way New Interpretation: The output of yi is interpreted as the probability that i is the correct class. This means that: The output of each node must be between 0 and 1 The sum of the outputs over all nodes must be equal to 1. How do we achieve this? There are several things to vary. We can vary the activation function, for example, by using a sigmoid. Sigmoids range continuously between 0 and 1. Is a sigmoid a good choice? We can vary the cost function. We need not use mean squared error (MSE). What are our other options? To decide, let's start by thinking about what makes sense intuitively. With a linear network using gradient descent on a MSE function, we found that the weight updates were proportional to the error (t-y). This seems to make sense. If we use a sigmoid activation function, we obtain a more complicated formula: See derivatives of activation functions to see where this comes from. This is not quite what we want. It turns out that there is a better error function/activation function combination that gives us what we want. Error Function: Cross Entropy is defined as where c is the number of classes (i.e. the number of output nodes). This equation comes from information theory and is often applied when the outputs (y) are interpreted as probabilities. We won't worry about where it comes from but let's see if it makes sense for certain special cases. Suppose the network is trained perfectly so that the targets exactly match the network output. Suppose class 3 is chosen. This means that output of node 3 is 1 (i.e. the probability is 1 that 3 is correct) and the outputs of the other nodes are 0 (i.e. the probability is 0 that class != 3 is correct). In this case do you see that the above equation is 0, as desired. Suppose the network gives an output of y=.5 for all of the output i.e. that there is complete uncertainty about which is the correct class. It turns out that E has a maximum value in this case. Thus, the more uncertain the network is, the larger the error E. This is as it should be. Activation function: Softmax is defined as where fi is the activation function of the ith output node and c is the number of classes. Note that this has the following good properties: it is always a number between 0 and 1 when combined with the error function gives a weight update proportional to (t-y). where ij = 0 if i=j and zero otherwise. Note that if r is the correct class then tr = 1 and RHS of the above equation reduces to (tr-yr)xs. If q!=r is the correct class then tr = 0 the above also reduces to (tr-yr)xs. Thus we have Look familiar? [Top] [Next: Optimizing] [Back to the first page] Optimal Weight and Learning Rates for Linear Networks Regression Revisited Suppose we are given a set of data (x(1),y(1)),(x(2),y(2))...(x(p),y(p)): If we assume that g is linear, then finding the best line that fits the data (linear regression) can be done algebraically: The solution is based on minimizing the squared error (Cost) between the network output and the data: where y = w x. Finding the best set of weights 1-input, 1 output, 1 weight But the derivative of E is zero at the minimum so we can solve for wopt. n-inputs, m outputs: nm weights The same analysis can be done in the multi-dimensional case except that now everything becomes matrices: where wopt is an mxn matrix, H is an nxn matrix and Á is an mxn matrix. Matrix inversion is an expensive operation. Also, if the input dimension, n, is very large then H is huge and may not even b possible to compute. If we are not able to compute the inverse Hessian or if we don't want to spend the time, then we can use gradient descent. Gradient Descent: Picking the Best Learning Rate For linear networks, E is quadratic then we can write so that we have But this is just a Taylor series expansion of E(w) about w0. Now, suppose we want to determine the optimal weight, wopt. We can differentiate E(w) and evaluate the result at wopt, noting that E`(wopt) is zero: Solving for wopt we obtain: comparing this to the update equation, we find that the learning "rate" that takes us directly to the minimum is equal to the inverse Hessian, which is a matrix and not a scalar. Why do we need a matrix? 2-D Example Curvature axes aligned with the coordinate axes: or in matrix form: 1 and 2 are inversely related to the size of the curvature along each axis. Using the above learning rate matrix has the effect of scaling the gradient differently to make the surface "look" spherical. If the axes are not aligned with coordinate axes, the we need a full matrix of learning rates. This matrix is just the inverse Hessian. In general, H-1 is not diagonal. We can obtain the curvature along each axis, however, by computing the eigenvalues of H. Anyone remember what eigenvalues are?? Taking a Step Back We have been spending a lot of time on some pretty tough math. Why? Because training a network can take a long time if you just blindly apply the basic algorithms. There are techniques that can improve the rate of convergence by orders of magnitude. However, understanding these techniques requires a deep understanding of the underlying characteristics of the problem (i.e. the mathematics). Knowing what speed-up techniques to apply, can make a difference between having a net that takes 100 iterations to train vs. 10000 iterations to train (assuming it trains at all). The previous slides are trying to make the following point for linear networks (i.e. those networks whose cost function is a quadratic function of the weights): 1. The shape of the cost surface has a significant effect on how fast a net can learn. Ideally, we want a spherically symmetric surface. 2. The correlation matrix is defined as the average over all inputs of xxT 3. The Hessian is the second derivative of E with respect to w. For linear nets, the Hessian is the same as the correlation matrix. 4. The Hessian, tells you about the shape of the cost surface: 5. The eigenvalues of H are a measure of the steepness of the surface along the curvature directions. 6. a large eigenvalue => steep curvature => need small learning rate 7. the learning rate should be proportional to 1/eigenvalue 8. if we are forced to use a single learning rate for all weights, then we must use a learning rate that will not cause divergence along the steep directions (large eigenvalue directions). Thus, we must choose a learning rate that is on the order of 1/»max where »max is the largest eigenvalue. 9. If we can use a matrix of learning rates, this matrix is proportional to H-1. 10. For real problems (i.e. nonlinear), you don't know the eigenvalues so you just have to guess. Of course, there are algorithms that will estimate »max ....We won't be considering these here. 11. An alternative solution to speeding up learning is to transform the inputs (that is, x -> Px, for some transformation matrix P) so that the resulting correlation matrix, (Px)(Px)T, is equal to the identity. 12. The above suggestions are only really true for linear networks. However, the cost surface of nonlinear networks can be modeled as a quadratic in the vicinity of the current weight. We can then apply the similar techniques as above, however, they will only be approximations. [Top] [Next: Summary] [Back to the first page] Summary of Linear Nets Characteristics of Networks number of layers number of nodes per layer activation function (linear, binary, softwmax) error function (mean squared error (MSE), cross entropy) type of learning algorithms (gradient descent, perceptron, delta rule) Types of Applications and Associated Nets Regression: o o o uses a one-layer linear network (activation function is identity) uses MSE cost function uses gradient decent learning Classification - Perceptron Learning o uses a one-layer network with a binary step activation function o uses MSE cost function o uses the perceptron learning algorithm (identical with gradient descent when targets are +1 and -1) Classification - Delta Rule o uses a one-layer network with a linear activation function o uses MSE cost function o uses gradient descent o the network chooses the class by picking the output node with the largest output Classification - Gradient Descent (the right way) o uses a one-layer network with a softmax activation function o uses the cross entropy error function o outputs are interpreted as probabilities o the network chooses the class with the highest probability Modes of Learning for Gradient Descent Batch o At each iteration, the gradient is computed by averaging over all inputs Online (stochastic) o At each iteration, the gradient is estimated by picking one (or a small number) of inputs. o Because the gradient is only being esitimated, there is a lot of noise in the weight updates. The error comes down quicly but then tends to jiggle around. To remove this noise one can switch to batch at the point where the error levels out and or to continue to use online but to decrease the learning rate (called annealing the learning rate). One way annealing is to use = 0/t where 0 us the originial learning rate and t is the number of timesteps after annealing is turned on. Picking Learning Rates Learning rates that are too big cause the algorithm to diverge Learning rates that are too small cause the algorithm to converge very slowly. The optimal learning rate for linear networks is /(H-1) where H is the Hessian and is defined as the second derivative of the cost function with respect to the weights. Unfortunately, this is a matrix whose inverse can be costly to compute. The best learning rate for batch is the inverse Hessian. More details if you are interested: o The next best thing is to use a separate learning rate for each weight. If the Hessian is diagonal these learning rates are just one over the eigenvalues of the Hessian. Fat chance that the hessian is diagonal though! o If using a single scalar learning then the best one to use is 1 over the largest eigenvalue of the Hessian. There are fairly inexpensive algorithms for estimating this. However, many people just use the ol' brute force method of picking the learning rate - trial and error. o For linear networks the Hessian is < x xT> and is independent of the weights. For nonlinear networks (i.e. any network that has an activation function that isn't the identity), the Hessian depends on the value of the weights and so changes everytime the weights are updated - arrgh! That is why people love the trial and error approach. Limitations of Linear Networks For regression, we can only fit a straight line through the data points. Many problems are not linear. For classification, we can only lay down linear boundaries between classes. This is often inadequate for most real world problems. Where do we go next - Multilayer Nonlinear Networks!!! [Top] [Next: Backprop] [Back to the first page] Multilayer Networks and Backpropagation Introduction Much early research in networks was abandoned because of the severe limitations of single layer linear networks. Multilayer networks were not "discovered" until much later but even then there were no good training algorithms. It was not until the `80s that backpropagation became widely known. People in the field joke about this because backprop is really just applying the chain rule to compute the gradient of the cost function. How many years should it take to rediscover the chain rule?? Of course, it isn't really this simple. Backprop also refers to the very efficient method that was discovered for computing the gradient. Note: Multilayer nets are much harder to train than single layer networks. That is, convergence is much slower and speed-up techniques are more complicated. Method of Training: Backpropagation Define a cost function (e.g. mean square error) where the activation y at the output layer is given by and where z is the activation at the hidden nodes f2 is the activation function at the output nodes f1 is the activation function at the hidden nodes. Written out more explicitly, the cost function is or all at once: Computing the gradient: for the hidden-to-output weights: the gradient: for the input-to-hidden weights: Summary of Gradients hidden-to-output weights: where input-to-hidden: where Implementing Backprob Create variables for : the weights W and w, the net input to each hidden and output node, neti the activation of each hidden and output node, yi = f(neti) the "error" at each node, ´i For each input pattern k: Step 1: Foward Propagation Compute neti and yi for each hidden node, i=1,..., h: Compute netj and yj for each output node, j=1,...,m: Step 2: Backward Propagation Compute ´2's for each output node, j=1,...,m: Compute ´1's for each hidden node, i=1,...,h Step 3: Accumulate gradients over the input patterns (batch) Step 4: After doing steps 1 to 3 for all patterns, we can now update the weights: Networks with more than 2 layers The above learning procedure (backpropagation) can easily be extended to networks with any number of layers. [Top] [Next: Noise] [Back to the first page] Online vs Batch for Non-Linear Networks Making a Lot of Noise Disadvantage of Noise in Online Updates We have seen that online can often be much faster than batch early in the training process. However, the noise in the updates causes the network to bounce around near the minimum and never converge to the very bottom. Solution: The Advantage of Noise In linear networks the cost function is in the nice shape of a bowl. There is a single minimum. In nonlinear networks, however, the cost surface can be very complex. There can be many minima, valleys, plateau's which make training very difficult. Batch gradient descent will simply move the bottom of the local minimum it randomly starts in. If it is on a plateau, the gradient may be very small and so learning takes a very long time. Valleys are common when using sigmoids. Consider what happens when sigmoids are added. Below, the green sigmoid is added to the blue to obtain the red. Now, look what can happen in 2 dimensions. We obtain a valley that can be difficult to escape from: The noise in online makes it possible to escape from local minima and plateaus. It can help somewhat with valleys as well. Too Much of a Good Thing: OverTraining The good news is that multilayer networks can approximate any smooth function as long as you have enough hidden nodes. The bad news is that this added flexibility can cause the network to learn the noise in the data. Consider regression and classification problems where you have a collection of noisy data. The solid line is the "true" function or class boundary and the +'s and o's is the data: If you have lots of hidden nodes you may find that the network "discovers" the function (dotted lines) given below: In the above example, the network has not only learned the function but it has also learned the noise present in the data. When the net has learned the noise, we say it has overtrained. The reason for this name is that as a net trains it first learns the rough structure of the data. As it continues to learn, it will pick up the details (i.e. the noise). Generalization Why is overtraining a problem? The whole purpose of training these nets is to be able to predict the function output (regression) or class (classification) for inputs that the net has never seen before (i.e. was not trained on). A network is said to generalize well if it can accurately predict the correct output on data it has never seen. Preventing Overtraining There are several ways to prevent overtraining: training for less time. The method for doing this is called early stopping Reducing the number of hidden nodes reduces the number or parameters (weights) so that the net is not able to learn as much detail. Problems are * what is the right number of nodes? * there is reason to believe that better solutions can be found by having too many hidden nodes than too few. Often, better to start with a big net, train, and then carefully prune the net so that it is smaller (one version of pruning is called optimal brain damage) Instead of reducing the number of weights, people instead put constraints on the weights so that there are effectively fewer parameters. One example of this is weight decay. Weight decay pushes the weights toward zero. Note that this corresponds to the linear region of the sigmoid [Top] [Next: Momentum] [Back to the first page] Momentum We saw that if the cost surface is not spherical, learning can be quite slow because the learning rate must be kept small to prevent divergence along the steep curvature directions One way to solve this is to use the inverse Hession (= correlation matrix for linear nets) as the learning rate matrix. This can be problematic because the Hessian can be a large matrix that is difficult to invert. Also, for multilayer networks, the Hessian is not constant (i.e. it changes as the weights change). Recomputing the inverse Hessian at each iteration would be prohibitively expensive and not worth the extra computation. However a much simpler approach is to use the addition of a momentum term. where w(t) is the weight at the tth iteration. Written another way where w(t) = w(t)-w(t-1). Thus, the amount you change the weight is proportional to the negative gradient plus the previous weight change. is called the momentum parameter. and must satisfy 0 <= < 1. Momentum Example Consider the oscillatory behavior shown above. The gradient changes sign at each step. By adding in a small amount of the previous weight change, we can lessen the oscillations. Suppose = .8, w(0)=10 E = w2 => wmin= 0 and dE/dx = 2w No Momentum = 0: t = 0: w(1) = -.8 = -.8 (20) = -16, w(1) = 10-16 = -6 t = 2: w(1) = -.8 = -.8 (-12) = 9.6, w(2) = -6+9.6 = 3.6 t = 3: w(1) = -.8 = - .8(7.2) = -5.76, w(2) = 3.6 - 5.76 = -2.16 With Momentum = .1: t = 0: w(1) = -.8 + w(0) = -.8 (20) + .1*0 = -16, w(1) = 10-16 = -6 t = 2: w(1) = -.8 + w(1) = -.8 (-12) + .1*(-16) = 8, w(2) = -6+8 = 2 t = 3: w(1) = -.8 + w(2) = - .8(4) + .1*(8) = -2.4, w(2) = 2-2.4 = -.4 [Top] [Next: DeltaBarDelta] [Back to the first page] Delta-Bar-Delta (Jacobs) Since the cost surface for multi-layer networks can be complex, choosing a learning rate can be difficult. What works in one location of the cost surface may not work well in another location. Delta-Bar-Delta is a heuristic algorithm for modifying the learning rate as training progresses: Each weight has its own learning rate. For each weight: the gradient at the current timestep is compared with the gradient at the previous step (actually, previous gradients are averaged) If the gradient is in the same direction the learning rate is increased If the gradient is in the opposite direction the learning rate is decreased Should be used with batch only. Let gij(t) = gradient of E wrt wij at time t then define Then the learning rate ij for weight wij at time t+1 is given by where , , and are chosen by the hand. Downsides: Knowing how to choose the parameters , , and is not easy. Doesn't work for online. [Top] [Next: Unsupervised] [Back to the first page] Unsupervised Learning Up until now we have discussed how to train nets given a training set of input and target values. The target value is often called the teacher signal because it represents the "right answer". i.e. what the output of the net should be. Training with a teacher signal is called supervised learning. We can also train nets on inputs where there is no teacher signal. The purpose might be to discover underlying structure of the data encode the data compress the data transform the data This kind of learning is called unsupervised learning because there is no explicit teacher signal. Examples of unsupervised learning hebbian learning w(t+1) = w(t) + y(t) x(t) This moves w toward inifinity in the direction of the eigevector with largest eigenvalue of the correlation matrix A more stable version is Oja's rule w(t+1) = w(t) + (x(t) - y(t) w(t) ) y(t) principal component analysis competitive learning vector quantization [Top] [Next: PCA] [Back to the first page] Linear Data Compression Goal: To find a low dimensional representation of the data Example: Saving Space In general, the data does not lie perfectly on a linear subspace. In this case, some information is lost when the data is compressed. The problem here is to find the compression direction that results in the least amount of information that is lost. Principal Component Analysis (PCA) The 1 direction corresponds to the direction of largest variance of the data. the eigenvector associated with the largest eigenvalue of the correlation matrix ( <x xT> ). If we have n dimensional data, we can compress it down to m dimensions by projecting it down to the space spanned by eigen vectors of the m largest eigenvalues. The methods that can be used for finding these directions is called Principal Component Analysis (PCA). Finding the Principal Components using an Autoassociative Network An autoassociative network is a network whose inputs and targets are the same. That is, the net must find a mapping from an input to itself. Why do this? Well, when the number of hidden nodes is smaller than the number of input node, the network is forced to learn an efficient low dimensional representation of the data. See Maple example of the above network. Example: Image Compression (Cottrell et al, 87) 64 inputs: 8x8 pixel regions of an image specified to 8 bit precision 16 hidden units 64 outputs: targets = inputs Trained on randomly selected patches of an image (150,000 training steps). It was then tested on the entire image patch by patch using the entire set of non overlapping patches See "Fundamentals of Artificial Neural Networks", Hassoun, pp247-253. They found that nonlinearity in the hidden units gave no advantage (this was later confirmed theoretically). [Top] [Next: NonlinearPCA] [Back to the first page] Nonlinear Compression Techniques Two layer networks perform a projection of the data onto a linear subspace. In this case, the encoding and decoding portions of the network are really single layer linear networks. This works well in some cases. However, many datasets lie on lower dimensional subspaces that are not linear. Example: A helix is 1-D, however, it does not line on a 1-D linear subspace. To solve this problem we can let the encoding and decoding portions each be multilayer networks. In this way we obtain nonlinear projections of the data. 5-Layer Networks: Example: Hemisphere (from Fast Nonlinear Dimension Reduction, Nanda Kambhatla,NIPS93) Compressing a hemisphere onto 2 dimensions Example: Faces (from Fast Nonlinear Dimension Reduction, Nanda Kambhatla,NIPS93) In the examples below, the original images consisted of 64x64 8-bit/pixel grayscale images. The first 50 principal components were extracted to from the image you see on the left. This was reduced to 5 dimensions using linear PCA to obtain the image in the center. The same imageon the left was also reduced to 5 dimensions using a 5-layer (50-40-5-40-50) network to produce the image on the right. Face 1: 50 principal components 5 principal components 5 nonlinear components Face 2: 50 principal components 5 principal components 5 nonlinear components [Top] [Next: Competitive] [Back to the first page] Simple Competitive Learning In competitive networks, output units compete for the right to respond. Goal: method of clustering - divide the data into a number of clusters such that the inputs in the same cluster are in some sense similar. A basic competitive learning network has one layer of input nodes and one layer of output nodes. Binary valued outputs are often (but not always) used. There are as many output nodes as there are classes. Often (but not always) there are lateral inhibitory connections between the output nodes.(in simulations, the function of the lateral connections can be replaced with a different algorithm) The output units are also often called grandmother cells. The term grandmother cell comes from discussions as to whether your brain might contain cells that fire only when you encounter your maternal grandmother, or whether such higher level concepts are more distributed. Vector Quantization (VQ) Vector quantization is one example of competitive learning. The goal here is to have the network "discover" structure in the data by finding how the data is clustered. The results can be used for data encoding and compression. One such method for doing this is called vector quantization. In vector quantization, we assume there is a codebook which is defined by a set of M prototype vectors. (M is chosen by the user and the initial prototype vectors are chosen arbitrarily). An input belongs to cluster i if i is the index of the closest prototype (closest in the sense of the normal euclidean distance). This has the effect of dividing up the input space into a Voronoi tesselation . Implementing Vector Quantization with a Network Algorithm: Choose the number of clusters M Initialize the prototypes w*1,... w*m (one simple method for doing this is to randomly choose M vectors from the input data) Repeat until stopping criterion is satisfied: o Randomly pick an input x o Determine the "winning" node k by finding the prototype vector that satisfies | w*k - x | <= | w*i - x | ( for all i ) o note: if the prototypes are normalized, this is equivalent to maximizing w*i x Update only the winning prototype weights according to w*k(new) = w*k(old) + ( x - w*k(old) ) This is called the standard competitive learning rule See Maple Example. VQ and Data Compression Vector quantization can be used for (lossy) data compression. If we are sending information over a phone line, we initially send the codebook vectors for each input, we send the index of the class that the input belongs For a large amount of data, this can be a significant reduction. If M=64, then it takes only 6 bits to encode the index. If the data itself consists of floating point numbers (4 bytes) there is an 80% reduction ( 100*(1 - 6/32) ). Learning Vector Quantization (LVQ) This is a supervised version of vector quantization. Classes are predefined and we have a set of labelled data. The goal is to determine a set of prototypes the best represent each class. [Top] [Next: Kohonon] [Back to the first page] Kohonen's Self-Organizing Map (SOM) Kohonon's SOMs are a type of unsupervised learning. The goal is to discover some underlying structure of the data. However, the kind of structure we are looking for is very different than, say, PCA or vector quantization. Kohonen's SOM is called a topology-preserving map because there is a topological structure imposed on the nodes in the network. A topological map is simply a mapping that preserves neighborhood relations. In the nets we have studied so far, we have ignored the geometrical arrangements of output nodes. Each node in a given layer has been identical in that each is connected with all of the nodes in the upper and/or lower layer. We are now going to take into consideration that physical arrangement of these nodes. Nodes that are "close" together are going to interact differently than nodes that are "far" apart. What do we mean by "close" and "far"? We can think of organizing the output nodes in a line or in a planar configuration. The goal is to train the net so that nearby outputs correspond to nearby inputs. E.g. if x1 and x2 are two input vectors and t1 and t2 are the locations of the corresponding winning output nodes, then t1 and t2 should be close if x1 and x2 are similar. A network that performs this kind of mapping is called a feature map. In the brain, neurons tend to cluster in groups. The connections within the group are much greater than the connections with the neurons outside of the group. Kohonen's network tries to mimick this in a simple way. Algorithm for Kohonon's Self Organizing Map Assume output nodes are connected in an array (usually 1 or 2 dimensional) Assume that the network is fully connected - all nodes in input layer are connected to all nodes in output layer. Use the competitive learning algorithm as follows: Randomly choose an input vector x Determine the "winning" output node i, where wi is the weight vector connecting the inputs to output node i. Note: the above equation is equivalent to wi x >= wk x only if the weights are normalized. Given the winning node i, the weight update is where is called the neighborhood function that has value 1 when i=k and falls off with the distance |rk - ri | between units i and k in the output array. Thus, units close to the winner as well as the winner itself, have their weights updated appreciably. Weights associated with far away output nodes do not change significantly. It is here that the toplogical information is supplied. Nearby units receive similar updates and thus end up responding to nearby input patterns. The above rule drags the weight vector wi and the weights of nearby units towards the input x. Example of the neighborhood function is: where 2 is the width parameter that can gradually be decreased over time. [Top] [Next: RL] [Back to the first page] Reinforcement Learning Learning with a Critic In supervised learning we have assumed that there is a target output value for each input value. However, in many situations, there is less detailed information available. In extreme situations, there is only a single bit of information after a long sequence of inputs telling whether the output is right or wrong. Reinforcement learning is one method developed to deal with such situations. Reinforcement learning (RL) is a kind of supervised learning in that some feedback from the environment is given. However the feedback signal is only evaluative, not instructive. Reinforcement learning is often called learning with a critic as opposed to learning with a teacher. Learning from Interaction Humans learn by interacting with the environment. When a baby plays, it waves its arms around, touches things, tastes things, etc. There is no explicit teacher but there is a sensori-motor connection to its environment. Such a connection provides information about cause and effect, the consequence of actions, and what to do to achieve goals. Learning from interaction with our environment is a fundamental idea underlying most theories of learning. RL has rich roots in the psychology of animal learning, from where it gets its name. The growing interest in RL comes in part from the desire to build intelligent systems that must operate in dynamically changing real- world environments. Robotics is the common example. Environment In RL, it is common to think explicitly of a network functioning in an environment. The environment supplies inputs to the network, receives output, and then provides a reinforcement signal. In the most general case, the environment may itself be governed by a complicated dynamical process. Both reinforcement signals and input patterns may depend arbitrarily on the past history of the networks's output. The classic problem is in game theory, where the "environment" is actually another player or players. Temporal Credit Assignment Problem A network designed to play chess would receive a reinforcement signal (win or lose) after a long sequence of moves. The question that arises is: How do we assign credit or blame individually to each move in a sequence that leads to an eventual victory or loss? This is called the temporal credit assignment problem in contrast with the structural credit problem where we must attribute network error to different weights. Learning and Planning So far in this course we have not discussed the issue of planning. The networks we have seen are simply learning a direct relationship between an input and an output. RL is our first look at networks that in some sense decide a course of action by considering possible future actions before they are actually experienced. Related Work RL is closely related to dynamic programming methods state-space planning methods used in AI Exploration vs Exploitation RL is learning what to do - how to map situations to actions - so as to maximize a scalar reward signal. There are two important features: trial-and-error search: the learner is not told what actions to take delayed reward: actions can affect not only the immediate reward but also all subsequent rewards There is always a trade-off in exploration: discovery new actions, and exploitation: using what it currently knows to obtain the a reward [Top] [Next: Components] [Back to the first page] Components of Reinforcement Learning Reinforcement learning has 3 basic components: agent: the learner or the decision maker environment: everything the agent interacts with, i.e. everything outside the agent actions: what the agent can do. Each action is associated with a reward. The objective is for the agent to choose actions so as to maximize the expected reward over some period of time. Example: The n-Armed Bandit Java Simulation There are n levers that can be pulled. The action at each step is to choose a lever to pull. The rewards are the payoffs for hitting the jackpot. Each arm has some average reward, called it's value. If you know the value then the solution is trivial: always pick the lever with the largest value. What if you don't know the values of any of the arms? What is the best approach for estimating the value while at the same time maximizing your reward? Greedy Approach: Policy: Always pick the arm with the largest estimated value. This is called exploiting your current knowledge. Non-Greedy Approach: If you select a nongreedy approach then you are said to be exploring. Balanced Approach: Choose a balance between exploration and exploitation. The balance partly depends on how many plays you get. If you have 1 play then the best approach is exploitation. However, there are many plays you will need some combination. The reward will be lower in the short term but higher in the long run. Let: Q*(a) = true actual value of taking an action a Qt(a) = estimated value of taking an action a = (sum of rewards)/(number of steps) As t->infinity, Qt(a) -> Q*(a) Example: A simple policy would be to take the greedy choice most of the time but every now and then (with probability e), randomly select an action. How do we choose e? select Components of the Agent A reinforcement learning agent generally has 4 basic components: a policy, a reward function, a value function, and a model of the environment Policy The policy is the decision making function of the agent. It specifies what action the agent should take in any of the situations it might encounter. This is the core of the agent. The other components serve only to change and improve the policy. Reward Function The reward function defines the goal of the RL agent. It maps the state of the environment to a single number, a reward, indicating the intrinsic desirability of the state. The agent's objective is to maximize the total reward it receives in the long run. Value function The value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward the agent can expect to accumulate over the future when starting from the current state. Rewards determine immediate desirability while value indicates the long term desirability. In analogy to humans, rewards are immediate pleasure (if high reward) or pain (if low) whereas values correspond to more refined far-sighted judgement of how pleased or displeased we are that our environment is in a particular state. Most of the methods we will discuss are centered around forming and improving approximate value functions. Model The model of the environment or external world should mimic the behavior of the environment. For example, given a situation and action, the model might predict the resultant next state and next reward. The model often takes up the largest storage space. If there are S states and A actions then a complete model will take up a space proportional to S x S x A because it maps state-action pairs to probability distributions over states. By contrast, the reward and value functions might just map states to real numbers and thus be of size S. [Top] [Next: Terminology] [Back to the first page] Terminology Reinforcement Learning is about learning a mapping from states to a probability distribution over actions. This is called the policy. Policy = (s,a) = probability of taking action a when in state s S = set of all states (assume finite) st = state at time t A(st) = set of all possible actions given agent is in state st S at = action at time t rt R (reals) = reward at time t At each timestep t=1,2,3,... the agent finds itself in a state st S and on that basis chooses an action at A(st). One timestep later, the agent receives a reward rt+1 and finds itself in a new state st+1. The return, rett, is the total reward received starting at time t+1: rett= rt+1 + rt+2 + rt+3 .... + rf where rf is the reward at the final time step (can be infinite) and the discounted return is rett= rt+1 + rt+2 + 2 rt+3 .... where 0 <= <= 1 is called the discount factor. We assume that the number of states and actions is finite. We then define the state transition probabilities to be: This is just the probability of transitioning from state s to state s' when action a has been taken. Expected Rewards The value function for policy is The action-value function for policy is Bellman's Equation for V(s) (Recursion on V(s)) is Bellman Optimality Equations Goal: Find the policy that gives the greatest return over the long run. We say a policy is better than or equal to policy ' if V(s) >= V'(s) for all s. There is always at least one such policy. Such a policy it is called an optimal policy and is denoted by *. Its corresponding value function is called V*: V*(s) = V*(s) = max_ V(s) , for all s and the optimal action-value function Q*(s,a) = Q*(s,a) = max_ Q(s,a) , for all s, a The Bellman optimality equation is then This equation has a unique solution. It is a system of equations with |S| equations and |S| unknowns. If P and R were known then, in principle, it can be solved using some method for solving systems of nonlinear equations. Once V* is known, the optimal policy is determined by always choosing the action that produces the largest V*. [Top] [Next: ] [Back to the first page] Summary of Nonlinear Networks and Applications Backpropagation Implementing backprop characteristics of cost surfaces Activation Functions linear threshold: binary, bipolar sigmoid: bipolar (symmetric), sigmoid softmax Cost Functions Mean Squared Error (MSE) Cross Entropy Improving Generalization using noise to improve learning, annealing what does it mean to overtrain? early stopping weight decay pruning (e.g. optimal brain damage) Speed-up Techniques momentum delta-bar-delta Unsupervised Learning Dimension Reduction for Compression using Autoassociative Networks o Principal Component Analysis (PCA) using 3 layer nets o Nonlinear PCA using 5-layer nets Clustering for Compression Kohonen's Self-Organizing Maps (SOMs) Misc Terminology correlation matrix vs Hessian linear separability bias decision boundary clustering dimension reduction overtraining Experimental Design What techniques would you use to understand the data? (graphing data, examining correlation matrix, dimension reduction,...) What type of architecture would you use? (number of layers, number of nodes, activation functions) Why? What learning algorithm would you use (speed-up technique)? Why? What do you do to insure the net is trained adequately? (but not overtrained) http://diwww.epfl.ch/w3mantra/tutorial/english/index.html Neural Java Neural Networks Tutorial with Java Applets Introduction Neural Java is a series of exercises and demos. Each exercise consists of a short introduction, a small demonstration program written in Java (Java Applet), and a series of questions which are intended as an invitation to play with the programs and explore the possibilities of different algorithms. The aim of the applets is to illustrate the dynamics of different artificial neural networks. Emphasis has been put on visualization and interactive interfaces. The Java Applets are not intended for and not useful for large-scale applications! Users interested in application programs should use other simulators. The list below covers standard neural network algorithms like BackProp, Kohonen, and the Hopfield model. It also includes some models that are more biological, and features visualizations of the Hodgkin-Huxley and the integrate-and-fire models. Additional material The following are available for download: Spiking Neuron Models (W. Gerstner and W. Kistler, Cambridge University Press 2002) Supervised Learning for Neural Networks: a tutorial with JAVA exercises (W. Gerstner). See also Some Competitive Learning Methods (Bernd Fritzke) Exercises If there is this image on the right of the link, then you can download the applet in order to execute it at your place. And if there is this image on the right of the link, then you can download the source code of the applet. But you must agree before with the GNU General Public Licence. If so follow the instructions here to download and install the applets. Single Neurons 1. Artificial Neuron. 2. McCulloch-Pitts Neuron. 3. Spiking Neuron. (Requires Swing). 4. Hodgkin-Huxley Model. 5. Axons and Action Potential Propagation. Supervised Learning Single-layer networks (simple perceptrons) 1. Perceptron Learning Rule. 2. Adaline, Perceptron and Backpropagation. Multi-layer networks 1. Multi-layer Perceptron (with neuron outputs in {0;1}). 2. Multi-layer Perceptron (with neuron outputs in {-1;1}). 3. Multi-layer Perceptron and C language. 4. Generalization in Multi-layer Perceptrons (with neuron outputs in {0;1}). 5. Generalization in Multi-layer Perceptrons (with neuron outputs in {-1;1}). 6. Optical Character Recognition with Multi-layer Perceptron. 7. Prediction with Multi-layer Perceptron. Density Estimation and Interpolation 1. Radial Basis Function Network. 2. Gaussian Mixture Model / EM. 3. Mixture model, using unlabeled data Unsupervised Learning 1. Principal Component Analysis. 2. PCA for Character Recognition. 3. Competitive Learning Methods. Reinforcement Learning 1. Blackjack and Reinforcement Learning. Network Dynamics Hopfield Network. 1. Pseudoinverse Network. 2. Network of spiking neurons. (Requires Swing). 3. Retina Simulation. (Runs very slow with some netscape versions). Miniproject Miniproject for Postgraduate Training Useful links URL: http://diwww.epfl.ch/mantra/tutorial/english/ Last updated: 06-October-2000 by Sébastien Baehni http://www.leemon.com/websim/index.html WebSim 1.30 Home | Machine Learning | Crypto | Graphics Publications | Concise Pubs | BibTeX Pubs | Other Pubs | WebSim Overview Demos More Complex Demos Speed Improvements Download Where to get more information Overview 4 July 1998: WebSim is a general simulator for neural networks, reinforcement learning, fractals, etc. WebSim has been designed for extendability, so it is easy to more more functions as needed. WebSim modules now exist so that a simulation can be performed by using any combination of the following modules: Learning Algorithms: supervised learning, TD(Lambda), Q-learning, value iteration, advantage learning Function approximators: lookup table, linear function approximator, multilayer perceptron (with a wide variety of squashing functions), radial basis functions (also with a wide variety of squashing functions), These can also be combined in series, and the they all know how to calculate both gradients and Hessians for those learning algorithms that use first or second derivatives. Gradient-descent methods: backprop (with momentum), conjugate gradient. These can run in either incremental mode (changing weights after each training example), epoch-wise (changing after one pass through all examples), or in batches (changing after each N examples). These can either use the true gradient (residual gradient algorithms), a false gradient ignoring generalization in the function approximator (direct algorithms) or a linear combination in between (residual algorithms). Graphics: 2D plots, 3D plots, contour plots. These can show the function learned, the value function learned (optimal Q-value/Advantage in each state), or the policy. WebSim is best viewed under a JDK 1.1 browser, rather than JDK 1.0. This means that Netscape 4.05 will not work, but Netscape 4.06 and higher will work. Recent versions of MSIE and HotJava also work fine. WebSim Demos backprop learning of a nonlinear function by a sigmoidal multilayer perceptron gradient descent on the sum of two mean-squared-error functions to satisfy both simultaneously. conjugate gradient learning of an ill-conditioned linear function by a linear function approximator Value Iteration demo. Gridworld experiment with a lookup table. show the names of all threads that are currently running. Grids that can be used as backgrounds for 2D plots in WebSim. 3D VRML scene controlled by a WebSim experiment. This works with Netscape using the Cosmo Player beta 3a plugin. Both beta 5 and beta 3a are available for download, but only beta 3a has supports Java. This beta Cosmo is slow and crashes frequently, but it does give a hint of what the final realease will look like. More Complex Demos The following examples require security privileges that may not be available on all Java systems. For example, one that writes to a file on the hard drive will work under the JDK or Cafe or HotJava, but not under the current version of Netscape or Internet Explorer. Create a short summary BNF description of the language WebSim parses, and send it to standard out (requires access to the hard drive). Create a long, well-commented BNF description of the language WebSim parses, and send it to standard out (requires access to the hard drive). Create a sequence of GIFs showing the 3D graph change as backprop learns a nonlinear function (requires accesss to the hard drive). Load a data set from the Internet and do supervised learning on it. (requires security settings that allow ftp connections to the University of California at Irvine). Speed Improvements WebSim is fast under the WinNT Symantec jit version 2.0 (beta 33). The slowest part of the code will be something like a large matrix multiply. For some matrix operations, Java is as fast as C, and for others it is slower. On tasks such as graphics, Internet and file access, and most math operations, Java will be as fast as optimized C, since those operations are done in either the operating system or the Java runtime libraries which are written in C. The Symantec jit has been liscensed by Netscape for their browser and Sun for its JDK, so this type of speed should soon be widely available, at least on PCs. Since a machine-learning problem with a large neural net spends most of its time in the matrix library, it can be useful to have WebSim use compiled C code for the matrix operations. WebSim has been written to automatically check to see if it is in a runtime environment that allows native code (precompiled C code) to be linked in, and if so it automatically checks the hard drive to see if there is any matrix code there. If there is, it automatically uses that for all the matrix operations, increasing their speed for some operations. To use this feature, you don't have to modify WebSim; simply place the compiled code at the appropriate place on the hard drive. On a PC, download Matrix.DLL into the directory c:\windows (for Win3.1 or Win95) or c:\WinNT.0 (for WinNT). That's it. The code will automatically be used by WebSim. WebSim will not have to be changed to use native code on other platforms. To go back to pure Java, just rename the DLL file or move it to another directory. The source is in this directory and is plain ANSI C, so it can be compiled for any platform using the .h files that come with the JDK or Cafe. Download The following information about WebSim is currently available for download or reading online: The full source code The automatically-generated documentation for the source code The short and long BNF description of the WebSim language. An intro to writing WebSim classes The code revision history zip file containing all of the above. WebSim is best viewed under a JDK 1.1 browser, rather than JDK 1.0. Most browsers (including MSIE and Hot Java) are 1.1. The standard Netscape 4.05 browser is still 1.0, but a 1.1 version of Netscape 4.05 is available for free download. Where to Get More Information This version of WebSim is equivalent to a late beta; it doesn't crash, but it is still undergoing major changes. This code is (c) 1996-1998 by the respective authors, is freeware, and may be freely distributed. If modifications are made, please say so in the comments. WebSim was designed and the core code was written by Leemon Baird. Other major portions of WebSim were written by Mance Harmon, Scott Weaver, and Ansgar Laubsch. WebSim also uses freeware utility code downloaded from the Web, including Nicholas Paldino's ftp code, Ernest Friedman-Hill's Postscript saving code, and Jef Poskanzer's GIF saving code. See their Web sites for restrictions on the use of their code. If you find WebSim useful, please send e-mail to Mance Harmon so he can keep everyone informed as changes are made. If you write new WebSim modules and are willing to make them freeware too, let him know and we'll add them to this archive or add a link to your archive.