Artificial Neural Networks

CSCI3230 Introduction to Neural Network II Liu Pengfei Week 13, Winter 2015 Outline Backward Propagation Algorithm 1. 1. 2. Practical Issues 2. 1. 2. 2 Optimization Model & Gradient Descent Multilayer Networks of Sigmoid Units & Backward Propagation Over-fitting Local Minima Backward Propagation Algorithm How to update weights to minimize the error between the target and output? 3 Optimization Model (Error minimization)  Given observed data, find a network which gives the same output as the observed target variable(s).  Given the topology of the network (number of layers, number of neurons, their connections), find a set of weights to minimize the error function. The set of training examples 4 Target Output Gradient Descent  Activation function is smooth E(W) is a smooth function of the weights W  But difficult to analytically solve  Can use calculus!  Use iterative approach  5 Gradient Descent Gradient Descent To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.  For a smooth function f (x ), f  is the direction that f increases most rapidly. x f    So xt 1  xt    until x converges x x  xt 6 Gradient interpretation 7 Gradient Descent Gradient Training Rule i.e. Update Partial derivatives are the key 8 Gradient Descent-Example 9 Gradient Descent Derivation (How to compute We know: 10 ) Sigmoid Units 11 Chain Rule Input x Net input 𝑤𝑖 𝑥𝑖 1 Sigmoid output 1+𝑒 −𝑛𝑒𝑡 𝜕𝐸𝑟𝑟 𝜕𝐸𝑟𝑟 𝜕𝑠𝑖𝑔 𝜕𝑛𝑒𝑡 = ∗ ∗ 𝜕𝑤𝑖 𝜕𝑠𝑖𝑔 𝜕 𝑛𝑒𝑡 𝜕𝑤𝑖 12 1 Error 𝐸 = 2 (𝑡 − 𝑠𝑖𝑔 𝑜𝑢𝑡𝑝𝑢𝑡) 2 𝜕𝐸𝑟𝑟 𝜕𝑠𝑖𝑔 1 𝐸𝑟𝑟 = (𝑡 − 𝑠𝑖𝑔 𝑜𝑢𝑡𝑝𝑢𝑡) 2 2 𝜕𝐸𝑟𝑟 𝜕𝑠𝑖𝑔 13 = (𝑡 − 𝑠𝑖𝑔) 𝜕𝑠𝑖𝑔 𝜕 𝑛𝑒𝑡 1 𝑠𝑖𝑔 = 1 + 𝑒 −𝑛𝑒𝑡 𝜕𝑠𝑖𝑔 𝜕 𝑛𝑒𝑡 14 = 𝑒 −𝑛𝑒𝑡 (1+𝑒 −𝑛𝑒𝑡 )2 = 𝑠𝑖𝑔(1 − 𝑠𝑖𝑔) 𝜕𝑛𝑒𝑡 𝜕𝑤𝑖 𝑛𝑒𝑡 = 𝜕𝑛𝑒𝑡 𝜕𝑤𝑖 15 = 𝑥𝑖 𝑤𝑖 𝑥𝑖 Sigmoid Units After knowing how to update 16 in sigmoid unit, Multilayer Networks of Sigmoid Units  For output neuron -same as single sigmoid neuron For hidden neuron - Error is the weight sum error of the connected output neurons  1 𝐸𝑟𝑟 = (𝑡 − 𝑠𝑖𝑔 𝑜𝑢𝑡𝑝𝑢𝑡) 2 2 17 𝐸𝑟𝑟𝑗 = 𝑤𝑖𝑗 𝐸𝑟𝑟𝑖 𝑖 𝑡ℎ𝑎𝑡 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑗 Hidden neuron Output of hidden 18 Net of output Error of output Multilayer Networks of Sigmoid Units Back-propagation Algorithm  BP learns the weights for a multilayer network, given a network with a fixed set of units and interconnections.  BP employs gradient descent in attempt to minimize the squared error between the network output values and the target values for these outputs.  Two stage learning  Forward stage: calculate outputs given input pattern x.  Backward stage: update weights by calculating delta. 19 Multilayer Networks of Sigmoid Units Back-propagation Algorithm  Error function for BP  E defined as a sum of the squared errors overall the output units k for all the training examples d.  Error surface can have multiple local minima  Guarantee toward some local minimum  No guarantee to the global minimum 20 Multilayer Networks of Sigmoid Units Back-propagation Algorithm Process: 21 Multilayer Networks of Sigmoid Units Derivation of the BP Rule xi … j … 22 Multilayer Networks of Sigmoid Units Derivation of the BP Rule xi … j … 23 Multilayer Networks of Sigmoid Units Derivation of the BP Rule Case 1: Rule for Output Unit Weights xi … 24 j Multilayer Networks of Sigmoid Units Derivation of the BP Rule Case 2: Rule for Hidden Unit Weights Multivariate chain rule xi … j … 25 Multilayer Networks of Sigmoid Units Back-propagation Algorithm Process:(revisited) 26 Practical Issues Over-fitting and Local Minima 27 Outline Over-fitting 1. 1. 2. 3. Local Minima 2. 1. 2. 28 Splitting Early stopping Cross-validation Randomize initial weights & Train multiple times Tune the learning rate appropriately Over-fitting Why not choose a method with the best fit to the data? 29 Preventing Over-fitting  Goal: To train a neural network having the BEST generalization power 1. 2. 3. 30 When to stop the algorithm? How to evaluate the neural network fairly? How to select a neural network with the best generalization power? Over-fitting  Motivation of learning     Build a model to learn the underlying knowledge Apply the model to predict the unseen data “Generalization” What is over-fitting?    31 Perform well on the training data but perform poorly on the unseen data Memorize the training data “Specialization” Normal Over-fitted Splitting Method (Cross validation)  Splitting method Random shuffle Divide data into two sets 1. 2.   3. 4. 5. 32 Training set: 70% of data Validation set: 30% of data Train (update weights) the network using the training set Evaluate (do not update weights) the network using validation set If stopping criteria is encountered, quit; else repeat step 3-4 Splitting Method( n-fold validation)  However, there is a problem: As only part of the data is used as testing data, there may be errors on the training methodology, which coincidentally cannot be reflected by the validation data. Analogy: A child knows only addition but not subtraction. However, as in the maths quiz, no question related to subtraction was asked, he still got 100% marks. 33 Early Stopping    Any large enough neural networks may lead to over-fitting if over-trained. If we can know the time to stop training, we can prevent over-fitting Early stopping Early stopping 34 Criteria for Early Stopping 1. 2. 3. Validation error increases Weights do not update much Training epoch (maximum number of iterations) is larger than a constant defined by you One epoch means all the data (including both training and validation) has been used once. Early stopping 35 Cross-validation Data set Randomly Division 1st learning mixed into k folds data set 2nd learning Test set Test set … Learning set Error 1 36 Learning set Error 2 Mean error = Low biased error estimator Cross-validation   • Cross-validation: 10-fold Divide the input data into 10 parts: p1, p2, …, p10 for i = 1 to 10 do • • • • Use parts other than pi for training Use pi for validation and get performancei End for Take the average of from performance1 to performance10 37 Fold 1 Fold 2 Accuracy 0.99 Precision … Fold 10 Mean 0.97 0.87 0.90 0.90 0.92 0.95 0.92 Recall 0.85 0.84 0.90 0.91 Fmeasue 0.86 0.84 0.86 0.87 How to choose the best model ?  Keep a part of the training data as testing data and evaluate your model using the secretly kept dataset. F-measure on the testing data Fold 1’s Model 0.90 Fold 2’s Model 0.92 … 0.91 Fold 10’s Model 0.99 Best Model: Fold 10’s model 38 Training Validation Testing Cross Validation Local Minima 𝜕𝐸 −𝜇 𝜕𝑤𝑖,𝑗  𝑤𝑖,𝑗 = 𝑤𝑖,𝑗 +  Gradient descent will not always guarantee a global minimum We may get stuck in a local minimum! Can you think of some ways to solve the problem?   Global minimum Local minimum 39 Local Minima 𝑤𝑖,𝑗 = 𝑤𝑖,𝑗 +  1) 2) 3) 𝜕𝐸 −𝜇 𝜕𝑤𝑖,𝑗 Randomize initial weights & Train multiple times Tune the learning rate appropriately Anymore? Global minimum Local minimum 40 References Backward Propagation Tutorial http://bi.snu.ac.kr/Courses/g-ai01/Chapter5-NN.pdf  CS-449: Neural Networks http://www.willamette.edu/~gorr/classes/cs449/intro.html  Cross-validation for detecting and preventing over-fitting http://www.autonlab.org/tutorials/overfit10.pdf  41

Artificial Neural Networks

Related documents

Products

Support

Artificial Neural Networks

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib