Artificial Neuron Models Computational neurobiologists have constructed very elaborate computer models of neurons in order to run detailed simulations of particular circuits in the brain. As Computer Scientists, we are more interested in the general properties of neural networks, independent of how they are actually "implemented" in the brain. This means that we can use much simpler, abstract "neurons", which (hopefully) capture the essence of neural computation even if they leave out much of the details of how biological neurons work. People have implemented model neurons in hardware as electronic circuits, often integrated on VLSI chips. Remember though that computers run much faster than brains we can therefore run fairly large networks of simple model neurons as software simulations in reasonable time. This has obvious advantages over having to use special "neural" computer hardware. A Simple Artificial Neuron Our basic computational element (model neuron) is often called a node or unit. It receives input from some other units, or perhaps from an external source. Each input has an associated weight w, which can be modified so as to model synaptic learning. The unit computes some function f of the weighted sum of its inputs: y i f ( wij y j ) j Its output, in turn, can serve as input to other units. The weighted sum w ij y j is called the net input to unit i, often written neti. j Note that wij refers to the weight from unit j to unit i (not the other way around). The function f is the unit's activation function. In the simplest case, f is the identity function, and the unit's output is just its net input. This is called a linear unit. -1- Linear Regression Fitting a Model to Data Consider the data below (for more complete auto data, see data description, raw data, and maple plots): (Fig. 1) Each dot in the figure provides information about the weight (x-axis, units: U.S. pounds) and fuel consumption (y-axis, units: miles per gallon) for one of 74 cars (data from 1979). Clearly weight and fuel consumption are linked, so that, in general, heavier cars use more fuel. Now suppose we are given the weight of a 75th car, and asked to predict how much fuel it will use, based on the above data. Such questions can be answered by using a model - a short mathematical description - of the data (see also optical illusions). The simplest useful model here is of the form y = w1 x + w0 (1) This is a linear model: in an xy-plot, equation 1 describes a straight line with slope w1 and intercept w0 with the y-axis, as shown in Fig. 2. (Note that we have rescaled the coordinate axes - this does not change the problem in any fundamental way.) How do we choose the two parameters w0 and w1 of our model? Clearly, any straight line drawn somehow through the data could be used as a predictor, but some lines will do a better job than others. The line in Fig. 2 is certainly not a good model: for most cars, it will predict too much fuel consumption for a given weight. -2- (Fig. 2) The Loss Function In order to make precise what we mean by being a "good predictor", we define a loss (also called objective or error) function E over the model parameters. A popular choice for E is the sum-squared error: (2) In words, it is the sum over all points i in our data set of the squared difference between the target value ti (here: actual fuel consumption) and the model's prediction yi, calculated from the input value xi (here: weight of the car) by equation 1. For a linear model, the sum-sqaured error is a quadratic function of the model parameters. Figure 3 shows E for a range of values of w0 and w1. Figure 4 shows the same functions as a contour plot. (Fig. 3) (Fig. 4) -3- Minimizing the Loss The loss function E provides us with an objective measure of predictive error for a specific choice of model parameters. We can thus restate our goal of finding the best (linear) model as finding the values for the model parameters that minimize E. For linear models, linear regression provides a direct way to compute these optimal model parameters. (See any statistics textbook for details.) However, this analytical approach does not generalize to nonlinear models (which we will get to by the end of this lecture). Even though the solution cannot be calculated explicitly in that case, the problem can still be solved by an iterative numerical technique called gradient descent. It works as follows: 1. Choose some (random) initial values for the model parameters. 2. Calculate the gradient G of the error function with respect to each model parameter. 3. Change the model parameters so that we move a short distance in the direction of the greatest rate of decrease of the error, i.e., in the direction of -G. 4. Repeat steps 2 and 3 until G gets close to zero. How does this work? The gradient of E gives us the direction in which the loss function at the current settting of the w has the steepest slope. In order to decrease E, we take a small step in the opposite direction, -G (Fig. 5). (Fig. 5) By repeating this over and over, we move "downhill" in E until we reach a minimum, where G = 0, so that no further progress is possible (Fig. 6). (Fig. 6) Fig. 7 shows the best linear model for our car data, found by this procedure. -4- (Fig. 7) It's a neural network! Our linear model of equation 1 can in fact be implemented by the simple neural network shown in Fig. 8. It consists of a bias unit, an input unit, and a linear output unit. The input unit makes external input x (here: the weight of a car) available to the network, while the bias unit always has a constant output of 1. The output unit computes the sum: y2 = y1 w21 + 1.0 w20 (3) It is easy to see that this is equivalent to equation 1, with w21 implementing the slope of the straight line, and w20 its intercept with the y-axis. (Fig. 8) Linear Neural Networks Multiple regression Our car example showed how we could discover an optimal linear function for predicting one variable (fuel consumption) from one other (weight). Suppose now that we are also given one or more additional variables which could be useful as predictors. Our simple -5- neural network model can easily be extended to this case by adding more input units (Fig. 1). Similarly, we may want to predict more than one variable from the data that we're given. This can easily be accommodated by adding more output units (Fig. 2). The loss function for a network with multiple outputs is obtained simply by adding the loss for each output unit together. The network now has a typical layered structure: a layer of input units (and the bias), connected by a layer of weights to a layer of output units. (Fig. 1) (Fig. 2) Computing the gradient In order to train neural networks such as the ones shown above by gradient descent, we need to be able to compute the gradient G of the loss function with respect to each weight wij of the network. It tells us how a small change in that weight will affect the overall error E. We begin by splitting the loss function into separate terms for each point p in the training data: (1) where o ranges over the output units of the network. (Note that we use the superscript p to denote the training point - this is not an exponentiation!) Since differentiation and summation are interchangeable, we can likewise split the gradient into separate components for each training point: (2) In what follows, we describe the computation of the gradient for a single data point, omitting the superscript p in order to make the notation easier to follow. First use the chain rule to decompose the gradient into two factors: -6- (3) The first factor can be obtained by differentiating Eqn. 1 above: (4) Using y o woj y j , the second factor becomes j (5) Putting the pieces (equations 3-5) back together, we obtain (6) To find the gradient G for the entire data set, we sum at each weight the contribution given by equation 6 over all the data points. We can then subtract a small proportion µ (called the learning rate) of G from the weights to perform gradient descent. The Gradient Descent Algorithm 1. Initialize all weights to small random values. 2. REPEAT until done 1. For each weight wij set wij : 0 2. For each data point (x, t)p 1. set input units to x 2. compute value of output units 3. For each weight wij set wij : wij (t i y i ) y j 3. For each weight wij set wij : wij wij The algorithm terminates once we are at, or sufficiently near to, the minimum of the error function, where G = 0. We say then that the algorithm has converged. The Learning Rate -7- An important consideration is the learning rate µ, which determines by how much we change the weights w at each step. If µ is too small, the algorithm will take a long time to converge (Fig. 3). (Fig. 3) (Fig. 4) Conversely, if µ is too large, we may end up bouncing around the error surface out of control - the algorithm diverges (Fig. 4). This usually ends with an overflow error in the computer's floating-point arithmetic. Multi-layer networks A nonlinear problem Consider again the best linear fit we found for the car data. Notice that the data points are not evenly distributed around the line: for low weights, we see more miles per gallon than our model predicts. In fact, it looks as if a simple curve might fit these data better than the straight line. We can enable our neural network to do such curve fitting by giving it an additional node which has a suitably curved (nonlinear) activation function. A useful function for this purpose is the S-shaped hyperbolic tangent (tanh) function (Fig. 1). (Fig. 1) (Fig. 2) Fig. 2 shows our new network: an extra node (unit 2) with tanh activation function has been inserted between input and output. Since such a node is "hidden" inside the network, -8- it is commonly called a hidden unit. Note that the hidden unit also has a weight from the bias unit. In general, all non-input neural network units have such a bias weight. For simplicity, the bias unit and weights are usually omitted from neural network diagrams unless it's explicitly stated otherwise, you should always assume that they are there. (Fig. 3) When this network is trained by gradient descent on the car data, it learns to fit the tanh function to the data (Fig. 3). Each of the four weights in the network plays a particular role in this process: the two bias weights shift the tanh function in the x- and y-direction, respectively, while the other two weights scale it along those two directions. Fig. 2 gives the weight values that produced the solution shown in Fig. 3. Hidden Layers One can argue that in the example above we have cheated by picking a hidden unit activation function that could fit the data well. What would we do if the data looks like this (Fig. 4)? (Fig. 4) (Relative concentration of NO and NO2 in exhaust fumes as a function of the richness of the ethanol/air mixture burned in a car engine.) Obviously the tanh function can't fit this data at all. We could cook up a special activation function for each data set we encounter, but that would defeat our purpose of learning to model the data. We would like to have a general, non-linear function approximation method which would allow us to fit any given data set, no matter how it looks like. -9- (Fig. 5) Fortunately there is a very simple solution: add more hidden units! In fact, a network with just two hidden units using the tanh function (Fig. 5) can fit the data in Fig. 4 quite well can you see how? The fit can be further improved by adding yet more units to the hidden layer. Note, however, that having too large a hidden layer - or too many hidden layers can degrade the network's performance (more on this later). In general, one shouldn't use more hidden units than necessary to solve a given problem. (One way to ensure this is to start training with a very small network. If gradient descent fails to find a satisfactory solution, grow the network by adding a hidden unit, and repeat.) Theoretical results indicate that given enough hidden units, a network like the one in Fig. 5 can approximate any reasonable function to any required degree of accuracy. In other words, any function can be expressed as a linear combination of tanh functions: tanh is a universal basis function. Many functions form a universal basis; the two classes of activation functions commonly used in neural networks are the sigmoidal (S-shaped) basis functions (to which tanh belongs), and the radial basis functions. Error Backpropagation We have already seen how to train linear networks by gradient descent. In trying to do the same for multi-layer networks we encounter a difficulty: we don't have any target values for the hidden units. This seems to be an insurmountable problem - how could we tell the hidden units just what to do? This unsolved question was in fact the reason why neural networks fell out of favor after an initial period of high popularity in the 1950s. It took 30 years before the error backpropagation (or in short: backprop) algorithm popularized a way to train hidden units, leading to a new wave of neural network research and applications. (Fig. 1) In principle, backprop provides a way to train networks with any number of hidden units arranged in any number of layers. (There are clear practical limits, which we will discuss - 10 - later.) In fact, the network does not have to be organized in layers - any pattern of connectivity that permits a partial ordering of the nodes from input to output is allowed. In other words, there must be a way to order the units such that all connections go from "earlier" (closer to the input) to "later" ones (closer to the output). This is equivalent to stating that their connection pattern must not contain any cycles. Networks that respect this constraint are called feedforward networks; their connection pattern forms a directed acyclic graph or dag. The Algorithm We want to train a multi-layer feedforward network by gradient descent to approximate an unknown function, based on some training data consisting of pairs (x,t). The vector x represents a pattern of input to the network, and the vector t the corresponding target (desired output). As we have seen before, the overall gradient with respect to the entire training set is just the sum of the gradients for each pattern; in what follows we will therefore describe how to compute the gradient for just a single training pattern. As before, we will number the units, and denote the weight from unit j to unit i by wij. 1. Definitions: o the error signal for unit j: o the (negative) gradient for weight wij: o the set of nodes anterior to unit i: o the set of nodes posterior to unit j: 2. The gradient. As we did for linear networks before, we expand the gradient into two factors by use of the chain rule: The first factor is the error of unit i. The second is Putting the two together, we get . To compute this gradient, we thus need to know the activity and the error for all relevant nodes in the network. 3. Forward activation. The activity of the input units is determined by the network's external input x. For all other units, the activity is propagated forward: - 11 - Note that before the activity of unit i can be calculated, the activity of all its anterior nodes (forming the set Ai) must be known. Since feedforward networks do not contain cycles, there is an ordering of nodes from input to output that respects this condition. 4. Calculating output error. Assuming that we are using the sum-squared loss the error for output unit o is simply 5. Error backpropagation. For hidden units, we must propagate the error back from the output nodes (hence the name of the algorithm). Again using the chain rule, we can expand the error of a hidden unit in terms of its posterior nodes: Of the three factors inside the sum, the first is just the error of node i. The second is while the third is the derivative of node j's activation function: For hidden units h that use the tanh activation function, we can make use of the special identity tanh(u)' = 1 - tanh(u)2, giving us Putting all the pieces together we get Note that in order to calculate the error for unit j, we must first know the error of all its posterior nodes (forming the set Pj). Again, as long as there are no cycles in the network, there is an ordering of nodes from the output back to the input that respects this condition. For example, we can simply use the reverse of the order in which activity was propagated forward. Matrix Form For layered feedforward networks that are fully connected - that is, each node in a given layer connects to every node in the next layer - it is often more convenient to write the backprop algorithm in matrix notation rather than using more general graph form given above. In this notation, the biases weights, net inputs, activations, and error signals for all - 12 - units in a layer are combined into vectors, while all the non-bias weights from one layer to the next form a matrix W. Layers are numbered from 0 (the input layer) to L (the output layer). The backprop algorithm then looks as follows: 1. Initialize the input layer: 2. Propagate activity forward: for l = 1, 2, ..., L, where bl is the vector of bias weights. 3. Calculate the error in the output layer: 4. Backpropagate the error: for l = L-1, L-2, ..., 1, where T is the matrix transposition operator. 5. Update the weights and biases: You can see that this notation is significantly more compact than the graph form, even though it describes exactly the same sequence of operations. Backpropagation of error: an example We will now show an example of a backprop network as it learns to model the highly nonlinear data we encountered before. - 13 - The left hand panel shows the data to be modeled. The right hand panel shows a network with two hidden units, each with a tanh nonlinear activation function. The output unit computes a linear combination of the two functions yo f tanh 1( x) g tanh 2( x) a (1) Where tanh 1( x) tanh( dx b) (2) and tanh 2( x) tanh( ex c) (3) To begin with, we set the weights, a..g, to random initial values in the range [-1,1]. Each hidden unit is thus computing a random tanh function. The next figure shows the initial two activation functions and the output of the network, which is their sum plus a negative constant. (If you have difficulty making out the line types, the top two curves are the tanh functions, the one at the bottom is the network output). As the activation functions are stretched, scaled and shifted by the changing weights, we hope that the error of the model is dropping. In the next figure we plot the total sum squared error over all 88 patterns of the data as a function of training epoch. Four training runs are shown, with different weight initialization each time: By Genevieve Orr, of Willamette University, http://www.willamette.edu/~gorr/classes/cs449/intro.html - 14 - found at