Multi-Layer Feedforward Neural Networks CAP5615 Intro. to Neural Networks Xingquan (Hill) Zhu Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition Multi-layer NN • Between the input and output layers there are hidden layers, as illustrated below. – Hidden nodes do not directly send outputs to the external environment. • Multi-layer NN overcome the limitation of a single-layer NN – they can handle non-linearly separable learning tasks. Input layer Output layer Hidden Layer XOR problem Two classes, green and red, cannot be separated using one line, but two lines. The NN below with two hidden nodes realizes this non-linear separation, where each hidden node represents one of the two blue lines. x2 1 -1 1 x1 -1 -1 x1 +1 w0 y1 w1 -1 x2 z -1 w3 y2 +1 -1 Types of decision regions 1 w0 w1 x1 w2 x2 0 x1 w1 w0 w1 x1 w2 x2 0 x2 L1 L2 Network with a single node w0 w2 1 Convex region x1 1 L4 L3 One-hidden layer network that realizes the convex region: each hidden node realizes one of the lines bounding the convex region 1 1 -3.5 1 x2 1 P1 P2 two-hidden layer network that realizes the union of three convex regions: each box represents a one hidden layer network realizing one convex region 1 1 x1 P3 x2 1 1 1.5 1 Different Non-Linearly Separable Problems Structure Single-Layer Two-Layer Three-Layer Types of Decision Regions Exclusive-OR Problem Half Plane Bounded By Hyperplane A B A Convex Open Or Closed Regions A B Arbitrary (Complexity Limited by No. of Nodes) Class Separation B B B B A A B B B A A A A Most General Region Shapes Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition FFNN NEURON MODEL • The classical learning algorithm of FFNN is based on the gradient descent method. • The activation function used in FFNN are continuous functions of the weights, differentiable everywhere. – A typical activation function is the Sigmoid Function FFNN NEURON MODEL • A typical activation function is the Sigmoid Function: (v j ) 1 1 av j e with a 0 (v j ) 1 Increasing a where v j wji yi vj i with wji weight of link from node i -10 -8 -6 -4 -2 2 4 6 8 10 to node j and yi output of node i • When a approaches to 0, tends to a linear function • when a tends to infinity then tends to the step function FFNN MODEL • xij : The input from node i to node j • wij : The weight from node i to node j – wij : The weight updating amount from node i to node j • ok : The output from node k The objective of multi-layer NN • The error of output neuron j after the activation of the network on the n-th training example ( x ( n ), d ( n )) is: e j (n) d j (n) - o j (n) • The network error is the sum of the squared errors of the output neurons: 1 2 E(n) 2 e (n) j j output node • The total mean squared error is the average of the network errors over the training examples. E (W ) 1 N N E (n) n 1 1 E (W ) 2N 2 ( d ( n ) o ( n )) n j j j Feed forward NN Idea: Credit assignment problem • Problem of assigning ‘credit’ or ‘blame’ to individual elements involving in forming overall response of a learning system (hidden units) • In neural networks, problem relates to distributing the network error to the weights. Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition Training: Backprop algorithm • Searches for weight values that minimize the total error of the network over the set of training examples. • Repeated procedures of the following two passes: – Forward pass: Compute the outputs of all units in the network, and the error of the output layers. – Backward pass: The network error is used for updating the weights (credit assignment problem). • Starting at the output layer, the error is propagated backwards through the network, layer by layer. This is done by recursively computing the local gradient of each neuron. Backprop • Back-propagation training algorithm illustrated: Network activation Error computation Forward Step Error propagation Backward Step • Backprop adjusts the weights of the NN in order to minimize the network total mean squared error. BP BP Example 1 w0c X0 w0a a wac w0b c w1a X1 w1b w2a wbc b X2 w2b • XOR – X0 1 1 1 1 X1 0 0 1 1 X2 0 1 0 1 Y 0 1 1 0 =0.5; ox 1 (1 e v x ) ; Neuro a woa =0.34 w1a =0.13 va=0.34 oa=0.58 For instance {(1, 0, 0), 0} Neuro b w0b =-0.12 w1b =0.57 vb= -0.12 ob=0.47 Neuro C w0c =-0.99 vc=-0.54 oc=0.37 wac =0.16 w2a =-0.92 w2b =-0.33 wbc =0.75 a=oa(1-oa)kwakk =0.58*(1-0.58)*0.16*(-0.085) =-0.003 b=ob(1-ob)kwbkk =0.47*(1-0.47)*0.75*(-0.085) =-0.016 c=oc(1-oc)(tc-oc) =0.37*(1-0.37)*(0-0.37) =-0.085 woa =awoa=0.5*(-0.003)*1 =-0.015 wob =bwob=0.5*(-0.016)*1 =-0.008 woc =cwoc=0.5*(-0.085)*1 =-0.043 w1a =aw1a=0.5*(-0.003)*0=0 w1b =bw1b=0.5*(-0.01)*0=0 wac =cwac=0.5*(-0.085)*0.58 =-0.025 w2a =aw2a=0.5*(-0.003)*1=0 w2b =bw2b=0.5*(-0.01)*1=0 wbc =cwbc=0.5*(-0.085)*0.47 =-0.020 • Weight updating Neuro a Neuro b Neuro C woa = woa+ woa=0.340.015=0.325 w0b = w0b + wob =-0.12-0.008 w0c = w0c + w0c =-0.99-0.043 w1a = w1a + w1a=0.13+0 w1b = w1b + w1b =0.57+0 wac = wac + wac =0.16-0.025 w2a = w2a+ w2a =-0.92+0 w2b = w2b + w2b =-0.33+0 wbc = wbc + wbc =0.75-0.02 woa =awoa=0.5*(-0.003)*1 =-0.015 wob =bwob=0.5*(-0.016)*1 =-0.008 woc =cwoc=0.5*(-0.085)*1 =-0.043 w1a =aw1a=0.5*(-0.003)*0=0 w1b =bw1b=0.5*(-0.01)*0=0 wac =cwac=0.5*(-0.085)*0.58 =-0.025 w2a =aw2a=0.5*(-0.003)*1=0 w2b =bw2b=0.5*(-0.01)*1=0 wbc =cwbc=0.5*(-0.085)*0.47 =-0.020 Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition Weight Update Rule The Backprop weight update rule is based on the gradient descent method: take a step in the direction yielding the maximum decrease of the network error E. This direction is the opposite of the gradient of E. wij wij wij E (W ) wij - wij Weight Update Rule The input of a neuron j is v j w x i 0 ,..., m oj ij i Using the chain rule E (W ) E (W ) v j w ij v j w ij we can write: (vj) vj j Moreover if we define the E (W ) local gradient of neuron j j v j as follows: Then from v j w ij iwij xi w ij xi we get wij 1 … i … m l w ij j xi Weight update So we have to compute the local gradient j E (W ) v j of neurons. Two different rules for the cases • j output neuron (green ones) • j hidden neuron (the brown ones) Input layer Output layer Hidden Layer Weight update of output neuron If j is an output neuron then using the chain rule we obtain: j E (W ) E (W ) e j o j e j ( 1) ' ( v j ) v j e j o j v j because ej dj - oj and For j output neuron Substituting j in o j (v j) j e j ' ( v j ) w ij j xi we get wij (d j - o j ) o j (1 o j ) Weight update of hidden neuron C set of neurons of output layer E (W ) E (W ) v k j - v j v k v j k C E ( w) k v k v k v k o j , Using chain rule v j o j v j Then vk w jk o j Moreover For j is a hidden node Substituting j in Because o j v j vk w i 0 ,..., m ik xi ' (v j ) E (W ) v k ' (v j ) k w jk o j (1 o j ) k w jk v j k C v k kC kC j - w ij j xi we get wij xi o j (1 o j ) w jk k k in next layer Error backpropagation The flow-graph below illustrates how errors are backpropagated to the hidden neuron j wj1 j ’(vj) 1 ’(v1) e1 k ’(vk) ek ’(vm) em wjk wjm m Summary: Delta Rule • Delta rule wij = j xi j ( v j )(d j o j ) (v j) where w k k of next layer IF j output node jk IF j hidden node ' ( v j ) a o j (1 o j ) for sigmoid activation functions Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition Network training: Two types of network training: Incremental mode (on-line, stochastic, or perobservation) Weights updated after each instance is presented Batch mode (off-line or per -epoch) Weights updated after all the patterns are presented Backprop algorithm incremental-mode n=1; initialize w(n) randomly; while (stopping criterion not satisfied and n<max_iterations) for each example (x,d) - run the network with input x and compute the output y - update the weights in backward order starting from those of the output layer: wij wij wij with wij computed using the (generalized) Delta rule end-for n = n+1; end-while; choose a randomized ordering for selecting the examples in the training set in order to avoid poor performance. Backprop algorithm batch mode • In the batch-mode the weights are updated only after all examples have been processed, using the formula wij wij x w ij x training example • The learning process continues on an epoch-by-epoch basis until the stopping condition is satisfied. Stopping criterions • Sensible stopping criterions: – total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]). – generalization based criterion: After each epoch the NN is tested for generalization using a different set of examples (validation set). If the generalization performance is adequate then stop. Use of Available Data Set for Training The available data set is normally split into three sets as follows: • Training set – use to update the weights. Patterns in this set are repeatedly in random order. The weight update equation are applied after a certain number of patterns. • Validation set – use to decide when to stop training only by monitoring the error. • Test set – Use to test the performance of the neural network. It should not be used as part of the neural network development cycle. Earlier Stopping - Good Generalization • Running too many epochs may overtrain the network and result in overfitting and perform poorly in generalization. Keep a hold-out validation set and test accuracy after every epoch. Maintain weights for best performing network on the validation set and stop training when error increases increases beyond this. Validation set error Training set No. of epochs Model Selection by Cross-validation • Too few hidden units prevent the network from learning adequately fitting the data and learning the concept. • Too many hidden units leads to overfitting. Similar cross-validation methods can be used to determine an appropriate number of hidden units by using the optimal test error to select the model with optimal number of hidden layers and nodes. Validation set error Training set No. of epochs NN DESIGN • • • • Data representation Network Topology Network Parameters Training Data Representation • Data representation depends on the problem. In general NNs work on continuous (real valued) attributes. Therefore symbolic attributes are encoded into continuous ones. • Attributes of different types may have different ranges of values which affect the training process. Normalization may be used, like the following one which scales each attribute to assume values between 0 and 1. xi min i xi max i min i for each value xi of attribute i , where min i and max i are the minimum and maximum value of that attribute over the training set. Network Topology • The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error. • Two types of adaptive algorithms can be used: – start from a large network and successively remove some neurons and links until network performance degrades. – begin with a small network and introduce new neurons until performance is satisfactory. Network parameters • How are the weights initialized? • How is the learning rate chosen? • How many hidden layers and how many neurons? • How many examples in the training set? Initialization of weights • In general, initial weights are randomly chosen, with typical values between -1.0 and 1.0 or -0.5 and 0.5. • If some inputs are much larger than others, random initialization may bias the network to give much more importance to larger inputs. In such a case, weights can be initialized as follows: w ij 21m w jk 1 2n For weights from the input to the first layer 1 |x i | i 1,..., m j 1,..., n ( 1 i w ijx ) i For weights from the first to the second layer Choice of learning rate • The right value of depends on the application. Values between 0.1 and 0.9 have been used in many applications. Size of Training set • Rule of thumb: – the number of training examples should be at least five to ten times the number of weights of the network. • Other rule: |W| N (1 - a) |W|= number of weights a=expected accuracy Applications of FFNN Classification, pattern recognition: • FFNN can be applied to tackle non-linearly separable learning tasks. – Recognizing printed or handwritten characters – Face recognition – Classification of loan applications into credit-worthy and non-credit-worthy groups – Analysis of sonar radar to determine the nature of the source of a signal Regression and forecasting: • FFNN can be applied to learn non-linear functions (regression) and in particular functions whose inputs is a sequence of measurements over time (time series). Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition Categorical attributes and multiclasses • A categorical attribute is usually decomposed into a series of (0, 1) continuous attributes – Whether an attribute value exists or now. • Each class corresponds to one output node, the desired output of the node is “1” for any instance belonging to this class (otherwise, “0”) – For each test instance, the final class label is determined by the output node with the maximum output value. A generalized delta rule • If is small then the algorithm learns the weights very slowly, while if is large then the large changes of the weights may cause an unstable behavior with oscillations of the weight values. • A technique for tackling this problem is the introduction of a momentum term in the delta rule which takes into account previous updates. We obtain the following generalized Delta rule: w ij ( n) w ij ( n 1) j ( n)x i ( n) momentum constant 0 1 momentum term accelerates the descent in steady downhill directions Neural Net for object recognition from images • Objective – Identify interesting objects from input images • Face recognition – Locate faces, happy/sad faces, gender, face pose, orientation – Recognize specific faces: authorization • Vehicle recognition (traffic control or safe driving assistant) – Passenger car, van, pick up, bus, truck • Traffic sign detection • Challenges – Image size (100x100, 10240x10240) – Object size, pose and object orientation – Illuminations Example Example: Face Detection Challenges pose variation lighting condition variation facial expression variation Normal procedures • Training (identify your problem and build specific model) – Build training dataset • Isolate sample images – Images containing faces • Extract regions containing the objects – region containing faces • Normalization (size and illumination) – 200x200 etc. • Select counter-class examples – Non-face regions – Determine Neural Net • Input layers are determined by the input images – E.g., a 200x200 image requires 40,000 input dimensions, each containing a value between 0-255 • Neural net architectures – A three layer FF NN (two hidden layers) is a common practice • Output layers are determined by the learning problem – Bi-class classification or multi-class classification – Train Neural Net Normal procedures • Test – Given a test image • Select a small region (considering all possibilities of the object location and size) – Scanning from the top left to the bottom right – Sampling at different scale levels • Feed the region into the network, determine whether this region contains the object or not • Repeat the above process – Which is a time consuming process CMU Neural Nets for Face Pose Recognition Head pose (1-of-4): 90% accuracy Face recognition (1-of-20): 90% accuracy Neural Net Based Face Detection •Large training set of faces and small set of non-faces •Training set of non-faces automatically built up: •Set of images with no faces •Every ‘face’ detected is added to the non-face training set. Traffic sign detection • Demo – http://www.mathworks.com/products/demos/videoimage /traffic_sign/vipwarningsigns.html • Intelligent traffic light control system – Instead of using loop detectors (like metal detectors) • Using surveillance video: Detecting vehicle and bicycles Vehicle Detection • Intelligent vehicles aim at improving the driving safety by machine vision techniques http://www.mobileye.com/visionRange.shtml Term Project (1) • Modifying CMU face recognition source code to train a classifier for one type of image classification problem – You identify your own objective (make your objective unique) • Gender, kid/adult recognition etc. – Available source code (c, Unix) – Team • Maximum team members: 3 – Due date (April 30) – A written report (3 page minimum) • Your objective • System architecture • Experimental results Alternative choice (2) • Alternatively, you can propose your own term project as well. • Requirement – Must relate to the neural network and classification – Must have a clear objective – Must involve programming work – Must have experimental assessment results – Must have a written report (3 page minimum). – Send me your proposal by April 4. CMU NN face recognition source code • Dr. Tom Mitchell (Machine Learning) – http://www.cs.cmu.edu/~tom/faces.html • What available? – Image dataset • Different class of images: pose, expression, glasses, etc. in pgm format – Complete C source codes • • • • Pgm image read/write 3 layer feed-forward neural network architecture Backpropogation learning algorithms Weight visualization – Document • A 13 page document, list the details of the datasets and the source code Outline • Multi-layer Neural Networks • Feedforward Neural Networks – FF NN model – Backpropogation (BP) Algorithm – BP rules derivation – Practical Issues of FFNN • FFNN for Face Recognition