2. Multi Layered Perceptrons 2 Output layer Input layer x1 x2 x3 x4 x5 Introduction to Machine Learning CSE474/574: Neural Networks Varun Chandola <chandola@buffalo.edu> Outline Input layer Contents x1 x2 x3 x4 x5 1 Extending Perceptrons 1 2 Multi Layered Perceptrons 2.1 Generalizing to Multiple Labels . . . . . . . 2.2 An Example of a Multilayer Neural Network 2.3 Properties of Sigmoid Function . . . . . . . 2.4 Motivation for Using Non-linear Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 3 3 3 Feed Forward Neural Networks 3 4 Backpropagation 4.1 Derivation of the Backpropagation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 5 Final Algorithm 8 6 Wrapping up Neural Networks 8 7 Bias Variance Tradeoff 8 1 Extending Perceptrons • Questions? – Why not work with thresholded perceptron? ∗ Not differentiable – How to learn non-linear surfaces? – How to generalize to multiple outputs, numeric output? The reason we do not use the thresholded perceptron is because the error function is not differentiable. To understand this, recall that to compute the gradient for perceptron learning we compute the partial derivative of the error with respect to every component of the weight vector. ∂E ∂ 1X = (yj − w> xj )2 ∂wi ∂wi 2 j Now if we use the thresholded perceptron, we need to replace w> xj with o in the above equation, where o is −1 if w> xj < 0 and 1, otherwise. Obviously, given that o is not smooth, the error function is not differentiable. Hence we work with the unthresholded perceptron unit. 2 Output Hidden layer Output layer o1 o2 o3 o4 Multi Layered Perceptrons 2.1 Generalizing to Multiple Labels • Distinguishing between multiple categories • Solution: Add another layer - Multi Layer Neural Networks Multi-class classification is more applicable than binary classification. Applications include, handwritten digit recognition, robotics, etc. 2.2 An Example of a Multilayer Neural Network One significant application of neural networks in the area of autonomous driving is ALVINN or Autonomous Land Vehicle In a Neural Network developed at Carnegie Mellon University in the early 1990s. More information available here: http://www.ri.cmu.edu/research_project_detail.html?project_ id=160&menu_id=261. ALVINN is a perception system which learns to control the NAVLAB vehicles by watching a person drive. ALVINN’s architecture consists of a single hidden layer back-propagation network. The input layer of the network is a 30x32 unit two dimensional “retina” which receives input from the vehicles video camera. Each input unit is fully connected to a layer of five hidden units which are in turn fully connected to a layer of 30 output units. The output layer is a linear representation of the direction the vehicle should travel in order to keep the vehicle on the road. • Linear Unit • Perceptron Unit • Sigmoid Unit – Smooth, differentiable threshold function σ(net) = 1 1 + e−net 2.3 Properties of Sigmoid Function 3 3. Feed Forward Neural Networks – Non-linear output x0 x1 x2 x3 x4 Input layer o = σ(net) As mentioned earlier, the perceptron unit cannot be used as it is not differentiable. The linear unit is differentiable but only learns linear discriminating surfaces. So to learn non-linear surfaces, we need to use a non-linear unit such as the sigmoid. 2.3 Properties of Sigmoid Function 1 Output layer Hidden layer x1 x2 x3 x4 x5 Output net = w> x 4 3 o1 o2 o3 o4 Feed Forward Neural Networks 1 1 0 −1 fs (x) ft (x) fh (x) • d + 1 input nodes (including bias) 0 −1 x • m + 1 hidden nodes • k output nodes 0 x The threshold output in the case of the sigmoid unit is continuous and smooth, as opposed to a perceptron unit or a linear unit. A useful property of sigmoid is that its derivative can be easily expressed as: dσ(y) = σ(y)(1 − σ(y)) dy One can also use e−ky instead of e−y , where k controls the “steepness” of the threshold curve. 2.4 • At hidden nodes: wj , 1 ≤ j ≤ m, wj ∈ <d+1 x Motivation for Using Non-linear Surfaces • At output nodes: wl , 1 ≤ l ≤ k, wl ∈ <m+1 The multi-layer neural network shown above is used in a feed forward mode, i.e., information only flows in one direction (forward). Each hidden node “collects” the inputs from all input nodes and computes a weighted sum of the inputs and then applies the sigmoid function to the weighted sum. The output of each hidden node is forwarded to every output node. The output node “collects” the inputs (from hidden layer nodes) and computes a weighted sum of its inputs and then applies the sigmoid function to obtain the final output. The class corresponding to the output node with the largest output value is assigned as the predicted class for the input. For implementation, one can even represent the weights as two matrices, W (1) (m × d + 1) and W (2) (k × m + 1). 4 Backpropagation • Assume that the network structure is predetermined (number of hidden nodes and interconnections) • Error function for a given input example: k J= 1X (yl − ol )2 2 l=1 • yl - Target value associated with lth class for input (x) • yl = 1 when k is true class for x, and 0 (or −1) otherwise The learning problem is to recognize 10 different vowel sounds from the audio input. The raw sound signal is compressed into two features using spectral analysis. • ol - Predicted output value at lth output node for x What are we learning? Weight vectors for all output and hidden nodes that minimize J The first question that comes to mind is, why not use a standard gradient descent based minimization as the one that we saw in single perceptron unit learning. The reason is that the output at every output 4. Backpropagation 5 node (ol ) is directly dependent on the weights associated with the output nodes but not with weights at hidden nodes. But the input values are “used” by the hidden nodes and are not “visible” to the output nodes. To learn all the weights simultaneously, direct minimization is not possible. Advanced methods such as Backpropagation need to be employed. 1. Initialize all weights to small values 2. For each training example, hx, yi: (a) Propagate input forward through the network (b) Propagate errors backward through the network Gradient Descent • Move in the opposite direction of the gradient of the error • −η∇J k J= 1X (yl − ol )2 2 l=1 • What is the gradient computed with respect to? – Weights - m at hidden nodes and k at output nodes – wj (j = 1 . . . m) – wl (l = 1 . . . k) 4.1 Derivation of the Backpropagation Rules Note that the above algorithm is similar in spirit to the Stochastic Gradient Descent algorithm, since it examines one input example at a time and updates the weight. For a given example (x, let the error between the target and output be denoted by J. Clearly, the weights at each node (hidden and output) need to be updated (using the Gradient descent rule) in the negative direction of the gradient of J. Thus, each weight is updated as follows: ∂J wpi ← wpi − η ∂wpi where wpi denotes the ith component of the pth weight vector at either the hidden layer or output layer. Or it can be interpreted as the weight on the connection to pth node from its ith input. The error J is computed as: k 1X J= (yl − ol )2 2 l=1 4.1 Derivation of the Backpropagation Rules • Consider any weight wpq • Let upq be the q th element of the input vector coming in to the pth unit. Observation 1 P Weight wpq is connected to J through netp = i wpq upq . ∂J ∂netp ∂J ∂J = = upq ∂wpq ∂netp ∂wpq ∂netp Observation 2 netl for an output node is connected to J only through the output value of the node (or ol ) ∂J • wj ← wj − η ∂w j ∂J • wl ← wl − η ∂w l ∇J = ∂J ∂w1 ∂J ∂w2 .. . ∂J ∂wm+k ∂J = ∂wp • Need to compute • Update rule: 6 ∂J ∂wp1 ∂J ∂wp2 .. . ∂J ∂ol ∂J = ∂netl ∂ol ∂netl The first term above can be computed as: k ∂J ∂ 1X = (yl − ol )2 ∂ol ∂ol 2 l=1 The entries in the summation in the right hand side will be non zero only for l. This results in: ∂J ∂ 1 = (yl − ol )2 ∂ol ∂ol 2 = −(yl − ol ) Moreover, the second term in the chain rule above can be computed as: ∂J ∂wpi ∂J wpi ← wpi − η ∂wpi ∂ol σ(netl ) = ∂netl netl = ol (1 − ol ) The last result arises from the fact ol is a sigmoid function. Using the above results, one can compute the following. ∂J = −(yl − ol )ol (1 − ol ) = −δl ∂netl 4.1 Derivation of the Backpropagation Rules 7 5. Final Algorithm 8 Finally we can compute the partial derivative of the error with respect to the weight wlj as: δj = oj (1 − oj ) ∂J = −δl ulj ∂wlj k X δl wlj l=1 δl = (yl − ol )ol (1 − ol ) Update Rule for Output Units • Question: What is uji for the j th hidden node? wlj ← wlj + ηδl ulj • uji is the ith input to j th hidden node, which will be ith attribute value for the input, i.e., xi . where δl = (yl − ol )ol (1 − ol ). • Question: What is ulj for the lth output node? • ulj is the j th input to lth output node, which will be the output coming from the j th hidden node. 5 • While not converged: Observation 3 netj for a hidden node is connected to J through all output nodes ∂J = ∂netj k X l=1 – Move forward to compute outputs at hidden and output nodes – Move backward to propagate errors back ∗ Compute δ errors at output nodes ∗ Compute δ errors at hidden nodes ∂J ∂netl ∂netl ∂netj – Update all weights according to weight update equations Remember that we have already computed the first term on the right hand side for output nodes: ∂J = −δl ∂netl 6 = = = = k X l=1 k X l=1 k X l=1 k X l=1 Wrapping up Neural Networks • Error function contains many local minima where δl = (yl − ol )ol (1 − ol ). This result gives us: ∂J ∂netj Final Algorithm • No guarantee of convergence ∂netl −δl ∂netj −δl – Not a “big” issue in practical deployments • Improving backpropagation ∂netl ∂oj ∂oj ∂netj −δl wlj – Adding momentum – Using stochastic gradient descent ∂oj ∂netj −δl wlj oj (1 − oj ) = −oj (1 − oj ) k X – Train multiple times using different initializations Adding momentum to the learning process refers to adding an “inertia” term which tries to keep the current value of a weight value similar to the one taken in the previous round. ∆wji (n) = ηδj xji + α∆wji (n − 1) δl wlj l=1 Thus, the gradient becomes: 7 Bias Variance Tradeoff • Neural networks are universal function approximators ∂J ∂J = uji ∂wji ∂netj k X = −oj (1 − oj )( δl wlj )uji l=1 = −δj uji – By making the model more complex (increasing number of hidden layers or m) one can lower the error • Is the model with least training error the best model? – The simple answer is no! – Risk of overfitting (chasing the data) Update Rule for Hidden Units wji ← wji + ηδj uji – Overfitting ⇐ High generalization error 7. Bias Variance Tradeoff 9 High Variance - Low Bias • “Chases the data” • Model parameters change significantly when the training data is changed, hence the term high variance. • Very low training error • Poor performance on unseen data Low Variance - High Bias • Less sensitive to training data • Higher training error • Better performance on unseen data • General rule of thumb – If two models are giving similar training error, choose the simpler model • What is simple for a neural network? • Low weights in the weight matrices? – Why? – The simple answer to this is that if the weights in the weight vectors at each node are high, the resulting discriminating surface learnt by the neural network will be highly non-linear. If the weights are smaller, the surface will be smoother (and hence simpler). • Penalize solutions in which the weights are high • Can be done by introducing a penalty term in the objective function – Regularization Regularization for Backpropagation λ Je = J + 2n References m X d+1 X j=1 i=1 (1) (wji )2 k m+1 X X (2) + (wlj )2 l=1 j=1 !