ppt

Machine Learning Lecture 3 ADLINE and Delta Rule G53MLE | Machine Learning | Dr Guoping Qiu 1 The ADLINE and Delta Rule • Adaptive Linear Element (ADLINE) VS Perceptron x1 w0 x2 xn x1 w1 wn o  w0 w1 x2 xn wn  o n R  w0   wi xi i 1 n  1 ; if R  0  o  sig nR      1 , o t h erwise  o  w0   wi xi G53MLE | Machine Learning | Dr Guoping Qiu i 1 2 The ADLINE and Delta Rule • Adaptive Linear Element (ADLINE) VS Perceptron – When the problem is not linearly separable, perceptron will fail to converge – ADLINE can overcome this difficulty by finding a best fit approximation to the target. G53MLE | Machine Learning | Dr Guoping Qiu 3 The ADLINE Error Function • We have training pairs (X(k), d(k), k =1, 2, …, K), where K is the number of training samples, the training error specifies the difference between the output of the ALDLINE and the desired target w 0 • x1 The error is defined as w1 x2 1 K 2 E W    d k   ok  2 k 1 o(k )  W T X k  xn wn  o n o  w0   wi xi i 1 is the output of presenting the training input X(k) G53MLE | Machine Learning | Dr Guoping Qiu 4 The ADLINE Error Function • The error is defined as 1 K 2 E W    d k   ok  2 k 1 • The smaller E(W) is, the closer is the approximation • We need to find W, based on the given training set, that minimizes the error E(W) G53MLE | Machine Learning | Dr Guoping Qiu 5 The ADLINE Error Function Error Surface G53MLE | Machine Learning | Dr Guoping Qiu 6 The Gradient Descent Rule • An intuition – Before we formally derive the gradient decent rule, here is an intuition of what we should be doing E(w) We want to move w(0) to a new value, such that E(W(new))<E(w(0)) E(w(0)) Error surface w W(0) randomly chosen initial weight G53MLE | Machine Learning | Dr Guoping Qiu 7 The Gradient Descent Rule • An intuition – Before we formally derive the gradient decent rule, here is an intuition of what we should be doing E(w) We want to move w(0) to a new value, Such that E(W(new))<E(w(0)) E(w(0)) Which direction should we move w(0) Error surface w W(0) randomly chosen initial weight G53MLE | Machine Learning | Dr Guoping Qiu 8 The Gradient Descent Rule • An intuition – Before we formally derive the gradient decent rule, here is an intuition of what we should be doing E(w) We want to move w(0) to a new value, Such that E(W(new))<E(w(0)) E(w(0)) Which direction should we move w(0) Error surface w(new) w W(0) randomly chosen initial weight G53MLE | Machine Learning | Dr Guoping Qiu 9 The Gradient Descent Rule • An intuition – Before we formally derive the gradient decent rule, here is an intuition of what we should be doing E(w) We want to move w(0) to a new value, Such that E(W(new))<E(w(0)) E(w(0)) How do we know which direction to move? Error surface w(new) w W(0) randomly chosen initial weight G53MLE | Machine Learning | Dr Guoping Qiu 10 The Gradient Descent Rule • An intuition – Before we formally derive the gradient decent rule, here is an intuition of what we should be doing E(w) We want to move w(0) to a new value, Such that E(W(new))<E(w(0)) The sign of the gradient at w(0) Error surface E w w w ( 0 ) w(new) w W(0) randomly chosen initial weight G53MLE | Machine Learning | Dr Guoping Qiu 11 The Gradient Descent Rule • An intuition – Before we formally derive the gradient decent rule, here is an intuition of what we should be doing E(w) We want to move w(0) to a new value, Such that E(W(new))<E(w(0)) E w The sign of the gradient at w(0) Error surface w w ( 0 ) w(new) w W(0) randomly chosen initial weight G53MLE | Machine Learning | Dr Guoping Qiu 12 The Gradient Descent Rule • An intuition – The intuition leads to  E wnew  wold   sign  w  E(w)    w  w (old )  E(w) w(new) w W(0) G53MLE | Machine Learning | Dr Guoping Qiu w(new) w W(0) 13 The Gradient Descent Rule • Formal Derivation of Gradient Descent  E E E  E W    , ,   w  w  wn 1 2 n   – The gradient of E is a vector, whose components are the partial derivatives of E with respect to each of the wi – The gradient specifies the direction that produces the speepest increase in E. – Negative of the vector gives the direction of steepest decrease. G53MLE | Machine Learning | Dr Guoping Qiu 14 The Gradient Descent Rule • The gradient training rule is W  W  E W  wi  wi   E wi  is the training rate G53MLE | Machine Learning | Dr Guoping Qiu 15 The Gradient Descent Rule • Gradient of ADLINE Error Functions 1 K 2 E W    d k   ok  2 k 1 E wi E  1 K 2    d k   ok   wi  2 k 1  1 K  E 2           d k  o k  2 k 1   w i     1 K  E                2 d k  o k d k  o k  2 k 1   w i   n  E              d k  o k d k  w  w x k    0 i i    w k 1  i 1  i  K  K  d k   ok  x k  k 1 i K    d k   ok xi k  k 1 G53MLE | Machine Learning | Dr Guoping Qiu 16 The Gradient Descent Rule • ADLINE weight updating using gradient descent rule K wi  wi    d (k )  o(k ) xi (k ) k 1 G53MLE | Machine Learning | Dr Guoping Qiu 17 The Gradient Descent Rule • Gradient descent training procedure – Initialise wi to small vales, e.g., in the range of (-1, 1), choose a learning rate, e.g.,  = 0.2 – Until the termination condition is met, Do • For all training sample pair (X(k), d(k)), input the instance X(k) and compute K  i   d (k )  o(k ) xi (k ) k 1 • For each weight wi, Do wi  wi i Batch Mode: gradients accumulated over ALL samples first Then update the weights G53MLE | Machine Learning | Dr Guoping Qiu 18 Stochastic (Incremental) Gradient Descent • Also called online mode, Least Mean Square (LMS), Widrow-Hoff, and Delta Rule – Initialise wi to small vales, e.g., in the range of (-1, 1), choose a learning rate, e.g.,  = 0.01 (should be smaller than batch mode) – Until the termination condition is met, Do • For EACH training sample pair (X(k), d(k)), compute  i  d (k )  o(k )xi (k ) • For each weight wi, Do wi  wi i Online Mode: Calculate gradient for EACH samples Then update the weights G53MLE | Machine Learning | Dr Guoping Qiu 19 Training Iterations, Epochs • Training is an iterative process; training samples will have to be used repeatedly for training • Assuming we have K training samples [(X(k), d(k)), k=1, 2, …, K]; then an epoch is the presentation of all K sample for training once • – – First epoch: Present training samples: (X(1), d(1)), (X(2), d(2)), … (X(K), d(K)) Second epoch: Present training samples: (X(K), d(K)), (X(K-1), d(K-1)), … (X(1), d(1)) – Note the order of the training sample presentation between epochs can (and should normally) be different. Normally, training will take many epochs to complete G53MLE | Machine Learning | Dr Guoping Qiu 20 Termination of Training • To terminate training, there are normally two ways – When a pre-set number of training epochs is reached – When the error is smaller than a pre-set value 1 K 2 E W    d k   ok  2 k 1 G53MLE | Machine Learning | Dr Guoping Qiu 21 Gradient Descent Training • A worked Example x1 x2 D -1 -1 -1 -1 +1 +1 +1 -1 +1 +1 +1 +1 x1 W0 W1  x2 W2 Initialization W0(0)=0.1; W1(0)=0.2; W2(0)=0.3; =0.5 G53MLE | Machine Learning | Dr Guoping Qiu 22 Further Readings • T. M. Mitchell, Machine Learning, McGraw-Hill International Edition, 1997 Chapter 4 • Any other relevant books/papers G53MLE | Machine Learning | Dr Guoping Qiu 23 Tutorial/Exercise Questions 1. Derive a gradient descent training rule for a single unit with output y y  w0  w1 x1  w1 x12  w2 x2  w2 x22   wn xn  wn xn2 2. A network consists of two ADLINE units N1 and N2 is shown as follows. Derive a delta training rule for all the weights +1 x1 w1 w0 N1 x2 +1 w3 w4 N2 y w2 G53MLE | Machine Learning | Dr Guoping Qiu 24 Tutorial/Exercise Questions 3. The connection weights of a two-input ADLINE at time n have following values: w0 (n) = -0.5 w1 (n) = 0.1 w2 (n) = -0.3. The training sample at time n is: x1 (n) = 0.6 x2 (n) = 0.8 The corresponding desired output is d(n) = 1 a) Base on the Least-Mean-Square (LMS) algorithm, derive the learning equations for each weight at time n b) Assume a learning rate of 0.1, compute the weights at time (n+1): w0 (n+1), w1 (n+1), and w2 (n+1). G53MLE | Machine Learning | Dr Guoping Qiu 25

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib