Multi layer feed-forward NN ARCHITECTURE Output layer Input layer Hidden Layer Neural Networks NN 3 1 A solution for the XOR problem x1 x1 -1 -1 1 1 x2 -1 1 -1 1 1 x1 xor x2 -1 1 1 -1 -1 1 x2 -1 -1 0.1 +1 x1 +1 -1 (v) = if v > 0 -1 if v 0 is the sign function. -1 x2 1 +1 +1 -1 Neural Networks NN 3 2 NEURON MODEL • Sigmoid Function (v j ) 1 Increasing a (v j ) vj vj -10 -8 -6 -4 -2 • • • • 2 4 6 8 10 1 e w i 0,...,m 1 av j y ji i v j induced field of neuron j Most common form of activation function a threshold function Differentiable (important in theory) Neural Networks NN 3 3 LEARNING ALGORITHM • Back-propagation algorithm Function signals Forward Step Error signals Backward Step • It adjusts the weights of the NN in order to minimize the average squared error. Neural Networks NN 3 4 Average Squared Error • Error signal of output neuron j at presentation of n-th training example: e (n) d (n) - y (n) j • Total energy at time n: layer j E(n) 1 2 e jC • Average squared error Measure of learning performance j EAV 2 j C: Set of neurons (n) in output N: size of training set N 1 N E (n) n 1 • Goal: Adjust weights of NN Neural Networks NN 3to minimize EAV 5 Notation e j Error at output of neuron j yj Output of neuron j vj w i 0,...,m Neural Networks ji yi induced local field of neuron j NN 3 6 Weight Update Update rule is based on the gradient descent method take a step in the direction yielding the maximum decrease of E E w ji - w ji Step in direction opposite to the gradient With w ji weight associated to the link from neuron i to neuron j Neural Networks NN 3 7 Define the Local Gradient of neuronejj (n) d j (n) - y j (n) E E(n) e (n) Local Gradient j jC v j 1 2 We obtain 2 j j e j ( v j ) because E E e j y j e j ( 1) ' ( v j ) v j e j y j v j Neural Networks NN 3 8 Update Rule • We obtain w ji j yi because E E v j w ji v j w ji E j v j Neural Networks NN 3 v j w ji E w ji - w ji yi 9 Compute local gradient of neuron j • The key factor is the calculation of ej • There are two cases: – Case 1): j is a output neuron – Case 2): j is a hidden neuron Neural Networks NN 3 10 Error ej of output neuron • Case 1: j output neuron ej dj - yj Then j (d j - y j ) ' (v j ) Neural Networks NN 3 11 Local gradient of hidden neuron • Case 2: j hidden neuron • the local gradient for neuron j is recursively determined in terms of the local gradients of all neurons to which neuron j is directly connected Neural Networks NN 3 12 Use the Chain Rule e j (n) d j (n) - y j (n) E y j j y j v j y j v j E(n) ' (v j ) E e k ek y j y j kC from We obtain Neural Networks e (n) jC 2 j e k v k ek y v kC k j e k ' (vk ) v k E y j 1 2 kC NN 3 k v k w kj y j w kj 13 Local Gradient of hidden neuron j Hence j ( v j ) k w kj kC w1j j ’(vj) 1 ’(v1) e1 k ’(vk) ek ’(vm) em wkj wm j m Neural Networks NN 3 Signal-flow graph of backpropagation error signals to neuron j 14 Delta Rule • Delta rule wji = j yi j (v j )(dj y j ) ( v j ) k wkj IF j output node IF j hidden node kC C: Set of neurons in the layer following the one containing j Neural Networks NN 3 15 Local Gradient of neurons ' ( v j ) ay j[1 y j ] a>0 if j hidden node ay [1 y ] w j j k kj j k ay j[1 y j ][d j y j ] If j output node Neural Networks NN 3 16 Backpropagation algorithm • Two phases of computation: – Forward pass: run the NN and compute the error for each neuron of the output layer. – Backward pass: start at the output layer, and pass the errors backwards through the network, layer by layer, by recursively computing the local gradient of each neuron. Neural Networks NN 3 17 TRAINING • Sequential mode (on-line, pattern or stochastic mode): – (x(1), d(1)) is presented, a sequence of forward and backward computations is performed, and the weights are updated using the delta rule. – Same for (x(2), d(2)), … , (x(N), d(N)). Neural Networks NN 3 18 Training • The learning process continues on an epochby-epoch basis until the stopping condition is satisfied. • From one epoch to the next choose a randomized ordering for selecting examples in the training set. Neural Networks NN 3 19 Stopping criterions • Sensible stopping criterions: – Average squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.1, 0.01]). – Generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. Neural Networks NN 3 20 Generalization • Generalization: NN generalizes well if the I/O mapping computed by the network is nearly correct for new data (test set). • Factors that influence generalization: – the size of the training set. – the architecture of the NN. – the complexity of the problem at hand. • Overfitting (overtraining): when the NN learns too many I/O examples it may end up memorizing the training data. Neural Networks NN 3 21 Expressive capabilities of NN Boolean functions: • Every boolean function can be represented by network with single hidden layer • but might require exponential (in number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer • Any function can be approximated to arbitrary accuracy by a network with two hidden layers Neural Networks NN 3 22 Generalized delta rule • If small Slow rate of learning If large Large changes of weights NN can become unstable (oscillatory) • Method to overcome above drawback: include a momentum term in the delta rule Generalized w ji (n) w ji (n 1) j (n)yi (n) delta function momentum constant Neural Networks NN 3 23 Generalized delta rule • the momentum accelerates the descent in steady downhill directions. • the momentum has a stabilizing effect in directions that oscillate in time. Neural Networks NN 3 24 adaptation Heuristics for accelerating the convergence of the back-prop algorithm through adaptation: • Heuristic 1: Every weight should have its own . • Heuristic 2: Every should be allowed to vary from one iteration to the next. Neural Networks NN 3 25 NN DESIGN • • • • • Data representation Network Topology Network Parameters Training Validation Neural Networks NN 3 26 Setting the parameters • How are the weights initialized? • How is the learning rate chosen? • How many hidden layers and how many neurons? • How many examples in the training set? Neural Networks NN 3 27 Initialization of weights • In general, initial weights are randomly chosen, with typical values between -1.0 and 1.0 or -0.5 and 0.5. • If some inputs are much larger than others, random initialization may bias the network to give much more importance to larger inputs. In such a case, weights can be initialized as follows: w ji 21N wkj 21N Neural Networks For weights from the input to the first layer 1 |x i | i 1,...,N i 1,...,N ( 1 w ji x ) i For weights from the first to the second layer NN 3 28 Choice of learning rate • The right value of depends on the application. Values between 0.1 and 0.9 have been used in many applications. • Other heuristics adapt during the training as described in previous slides. Neural Networks NN 3 29 How many layers and neurons • The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error. • Two types of adaptive algorithms can be used: – start from a large network and successively remove some neurons and links until network performance degrades. – begin with a small network and introduce new neurons until performance is satisfactory. Neural Networks NN 3 30 How many examples • Rule of thumb: – the number of training examples should be at least five to ten times the number of weights of the network. • Other rule: |W| N (1 - a) Neural Networks |W|= number of weights a=expected accuracy on test set NN 3 31