8 March 2010 Boris Mirkin Computational Intelligence and Data Visualization http://www.dcs.bbk.ac.uk/~mirkin/advanced Topics for this and next lecture: Correlation: Supervised learning o Statement of the problem . . . . . o Linear regression . . . . . . o Linear discrimination . . . . . o Gradient optimisation for learning . . o Linear regression with the steepest descent . o Perceptron . . . . . . . . o Artificial neuron . . . . . . o Neural Network with one hidden layer . o Error back propagation for learning weights o Data standardization in NN learning . . Prediction of Iris sepal sizes . . . . o Decision trees . . . . . . . o Bayes approach and Naïve Bayes classifier o SVM and kernel . . . . . . . . . . . 1 5 7 8 9 10 11 12 15 20 21 23 1 Correlation : Supervised Learning Problem: Given N pairs (xi, ui) (observed at entities i =1,…, N) in which xi are predictor/input vectors xi=(xi1,…,xip) (dimension p) and ui = (ui1,…,uiq) are target/output vectors (dimension q), build a decision rule û = F(x) such that the difference between computed û and observed target vector u, given x, is minimal over the class of admissible rules F. Specifically, let us take a look at the iris.dat data set: Sepal Petal Sepal and petal in an Iris flower. This popular data set describes 150 Iris specimens, representing three taxa of Iris flowers, I Iris setosa (diploid), II Iris versicolor (tetraploid) and III Iris virginica (hexaploid), 50 specimens from each. Each specimen is measured on four morphological variables: sepal length (w1), sepal width (w2), petal length (w3), and petal width (w4) (see Figure 0.1). Table 0.3. Iris data: Iris specimens measured over four features each (three from each taxon shown). # 1 2 3 I Iris setosa w1 5.1 4.4 4.4 w2 3.5 3.2 3.0 w3 1.4 1.3 1.3 w4 0.3 0.2 0.2 II Iris versicolor III Iris virginica w1 6.4 5.5 5.7 w1 6.3 6.7 7.2 w2 3.2 2.4 2.9 w3 4.5 3.8 4.2 w4 1.5 1.1 1.3 w2 3.3 3.3 3.6 w3 6.0 5.7 6.1 w4 2.5 2.1 2.5 Assume, for illustrative purposes, sepal is easy to measure, petal not. This mimics a real-world data collection. 2 I would like to have a rule for predicting petal measures from those of sepal. Why (and how) should one restrict the class of rules F admissible? A big question, a shaky answer. Take a look at the 2D regression problem: pairs (x,u) are observed at N entities: u x We have N=7 points on the Figure, which thus can be exactly fitted by a polynomial of 6th order u=p(x)=a0+a1x+a2x2+ a3x3 +a4x4+a5x5+a6x6. Indeed, the 7 points give 7 equations ui=p(xi) (i=1,…,7) to exactly determine the coefficients of p(x). Polynomial p(x), on which graph all observations lie, has no predictive power: beyond the range, the curve may go either course (like those shown). The blue straight line fits none of the points but expresses the tendency and should be preferred. If no theoretical motivation, it is hard to tell, what class of Fs to use. Occam’s razor: William Ockham (c. 1285–1349): “Entities should not be multiplied unnecessarily.” (“All things being equal, the simplest explanation tends to be the best one.”) Interpretation: “Principle of maximum parsimony (i.e., economy)” My motto “The simpler a theory, the more cases it covers” 3 Decision rule is an algorithm not necessarily expressed as an analytic function. Different approaches depending on assumptions of: Data Flow, Type of Target, Type of Rule, Criterion. Data flow: Entities i = 1, …, N come one-by-one (Incremental (on-line) learning) or all known at once (Batch mode) Type of target (Quantitative/Categorical) and rule: Regression: u quantitative, q=1 or more - Linear regression (F – linear) - Decision tree (F – tree-like) - Neural Nets (F – a net structure, general) - Evolutionary algorithms (F – general) Pattern recognition (classification): u binary - Discrimination (F – linear) - Support vector machine (SVM) (F –linear) - Logistic regression (F – probability, exponential-linear) - Neural Nets (F – general) - Evolutionary algorithms (F – general) Criterion: least-squares, maximum likelihood, error count Frequently, the difference between u and û is measured with the squared error, E=<u- û, u- û>=<u-F(x),u-F(x)> (1) then it is E that must be minimised over all admissible Fs. In the non-linear case, it is not easy to do. First consider the linear case. 4 Linear (multiple) regression (q=1, least squares) Rule assumption: u = w1*x1+w2*x2+…+wp*xp+w0 where w0, w1,…, wp are unknown constants. For any entity i =1,…, N rule-computed value ûi = w1*xi1+w2*xi2+…+wp*xip+w0 differs from the observed one by di = |ûi – ui|. To find w1, w2, …, wp, w0, minimise D2 = idi2 = i (ui -w1*xi1-w2*xi2-…-wp*xip-w0)2 (1) over all possible vectors w = (w0,w1,…,wp). To make the problem uniform, a fictitious feature x0 is introduced such that all its values are 1: xi0 =1 for all i = 1,…,N. Then the criterion D2 involves no intercept, just the inner products <w,xi> where w=(w0,w1,…,wp) and xi=(xi0, xi1, xi2 , …, xip) are (p+1)-dimensional vectors, a unknown, xi known. From now on, the intercept in (1) is abolished because of the convention. 5 Solution: D2 is but the Euclidean distance squared between Ndimensional target feature column u=(ui) and vector û=Xw whose components are ûi= <w,xi>. Here X is N x (p+1) matrix whose rows are xi (augmented with the component xi0=1, thus being (p+1)-dimensional) so that Xw is the matrix algebra product of X and w. Vectors defined as Xw for all possible w form (p+1)dimensional vector space, X-span. Problem reformulated: given u, find its projection û in the Xspan space. Global solution to the problem is linear, defined by X: û = PXu (2) where PX is the so-called orthogonal projection matrix, of size N x N: PX = X (XTX)-1XT so that û = X (XTX)-1XTu. Matrix PX projects every N-dimensional vector u to its nearest match in the (p+1)-dimensional X-span space. Equation (2) the optimal w = (XTX)-1XTu. The inverse (XTX)-1 does not necessarily exist. 6 Linear discrimination Linear discrimination problem differs in only that aspect that values ui are binary, either “yes” or “no” - classification, not regression, problem. To make it quantitative, define ui=1 if i belongs to the “yes” class and ui= -1 if i belongs to the “no” class. The intercept is referred to, in the context of the discrimination/classification problem, as bias. On Figure below entities (x1,x2, u) are presented by stars * at u=1 and circles at u= -1. Vector w represents solution described above; dashed line represents the set of all x that are orthogonal to w, <w,x> = 0 – the separating hyperplane. Figure shows a relatively rare situation at which the two patterns can be separated by a hyperplane – the linear separability. Linear discriminant decision rule: if ûi = <w,xi> > 0, predict ůi=1; if ûi = <w,xi> < 0, predict ůi= -1; that is, ůi = sign(<w,xi>) (sign(a)=1 when a > 0, = - 1 when a < 0, and =0 when a = 0.) 7 A general approach to optimization of any function: Gradient optimisation (the steepest ascent/descent, or hillclimbing) of any function f(z) of a multidimensional variable z: given an initial state z=z0, do a sequence of iterations to move to a better z location. Each iteration updates z-value: z(new) =z(old) ± *grad(f(z(old)) (2) where + applies if f is maximised, and –, if minimised. Here · grad(f(z)) stands for the vector of partial derivatives of f with respect to the components of z. It is known from calculus, that the vector grad(f(z)) shows the direction of the steepest rise of function f at point z. It is assumed, that – grad(f(z)) shows the steepest descent direction. We do not discuss here how the grad vector can be estimated in any point. · value controls the length of the change of z in (2) and should be small (to guarantee not over jumping) , but not too small (to guarantee changes when grad(f(z(old)) becomes too small; indeed grad(f(z(old)) = 0 if old is optimum). Q: For those who knows the partial derivatives, thus, knows that the gradient is vector of the partial derivatives of the function. What is the gradient of function f(x1,x2)=x12+x22? Function f(x1,x2)=(x1-1)2+3*(x2-4)2? Function f(z1,z2) = 3*z12 + (1-z2)4? A: (2x1, 2x2), 2*(x1-1),3*(x2-4)), (6*z1, -4*(1-z2)3). What is good about the gradient method, that it can be interpreted within the machine learning paradigm in which entities come one by one, and go (like “cats”). This is why it is often used even when a global optimisation algorithm is known. 8 Square error minimisation using steepest descent (for linear regression) Problem: build a linear rule for predicting scalar u using p input variables x, u= w1*x1+w2*x2-…-wp*xp-w0* x0 where x0=1 for any entity. 0. Initialise weights w randomly. 1. For each training instance (xi, ui) a. Compute grad(Ei(w)) where Ei(w) is part of criterion E in (1) related to the instance: Ei= (ui -w1*xi1-w2*xi2-…-wp*xip-w0* xi0)2 From calculus’ derivatives, t-th component of the gradient is Ei/wt= –2(ui- ûi) xit (t=0, 1, …, p) b. Update weights w according to equation w(new) = w - grad(Ei(w)) so that wt(new) = wt + (ui- ûi)xit here is put rather than 2 because it is arbitrary 2. If w(new) w(old), stop, otherwise go to 1. with w=w(new). 9 Perceptron algorithm (for linear discrimination) Problem: build a linear rule for predicting scalar 1/-1 u using p input variables x, u= w1*x1+w2*x2-…-wp*xp-w0* x0 where x0=1 for any entity (pattern recognition: u=1 for the pattern, u=-1 for not) 0. Initialise weights w randomly or to zero. 1. For each training instance (xi, ui) a. compute ůi = sign(<w,xi>) b. if ůi ui, update weights w according to equation w(new) = w(old) + (ui- ůi)xi where , 0<<1, is the so-called learning rate 2. Stopping rule: w(new) w(old). Perceptron is proven to converge to the optimal w when the patterns are linearly separable. Perceptron is a slightly modified form of the conventional anti-gradient minimisation algorithm: the partial derivative of Ei with respect to wt is equal to – 2(ui- ûi) xit, which is similar to that used the perceptron learning rule, - 2(ui- ůi)xi. Thus, the innovation: change the continuous ûi for the discrete ůi =sign(ûi) in the antigradient process. 10 Neuron and artificial neuron A neuron cell fires an output when its summary input becomes higher than a threshold. Dendrite brings signal in, axon passes it out, and the firing occurs via synapse, a gap between neurons, that makes the threshold. The decision rule ůi =sign(ûi) can be interpreted in terms of an artificial neuron: features xi are input signals (from other neurons), weights wt are the wiring (axon) features, the bias w0 – the firing threshold, and sign() – the neuron activation function. This way, the perceptron is an example of natureinspired computation. 11 An artificial neuron is a model: A set of inputs (corresponding to x-features), wiring weights, and activation function involving a firing threshold. Two popular activation functions, besides the sign function ůi =sign(ûi), are the linear activation function, ůi = ûi (we considered it when discussed the steepest descent) and sigmoid activation function ůi =s(ûi) where s(x) = 1/ (1 + exp(-x)), which is a smooth analogue to the sign function. This function’s output is always between 0 and 1. To imitate the perceptron with its sign(x) output, between -1 and 1: th(x) =2s(x)-1= 2(1+ e-x)-1 - 1 (2) This function is usually referred to as the hyperbolic tangent. In contrast to sigmoid s(x), hyperbolic th(x) is symmetric: th(-x) = th(x), like sign(x), which can be useful in some contexts. An artificial neuron is a linear function supplied with a non-linear threshold rule. 12 Multi-layer neural nets Neuron nets with one hidden layer (proven to suffice) 0.Problem: Iris features are in pairs: the size (length and width) of petals (features 1, 2) and that of sepals (features 3, 4). It is likely that the sepal sizes and petal sizes are related. 150 x 4 /advanced/ml/Data/iris.dat Sepal Petal Consider at any Iris specimen xi=(xi1,xi2,xi3,xi4), i=1,…,150, x = (xi3, xi4) (sepal) input and u = (xi1,xi2) (petal) output. Find F such that u F(x). 1.Model: One-hidden-layer NN Build F as a neural network of three layers: (a) input layer that accepts x = ( xi3, xi4) and bias x0=1 (see the previous lecture), (b) output layer producing estimate û for output u = (xi1,xi2), and (c) intermediate - hidden - layer to allow more flexibility in the space of feasible functions F (hidden - because not seen from the outside 13 This structure (Figure 1) is generic in NN theory; it has been proven, for instance, that such a structure can exactly learn any subset of the set of entities. Moreover, any pre-specified u = F(x) can be approximated with such a one-hidden-layer network, if the number of hidden neurons is large enough (Tsybenko 1989). û1 û2 III1 k III2 v11 v12 v21 v22 v31 j I1 x1 v32 II2 II1 w21 w22 w11 w12 w13 w31 i Output (linear) I2 x2 Hidden (sigmoid) II3 w23 w32 w33 I3 x0 = 1 Input (linear) Figure 1. A feed-forward network with 2 input and 2 output features (no feedback loops). Layers: input (I, indexed by i), output (III, indexed by k) and Hidden (II, indexed by j). Weights I to II form 3x3 matrix W=(wij), i= I1, I2, I3, j= II1, II2, II3, Weights II to III form 3x2 matrix V=(vjk), j= II1, II2, II3, k= III1, III2 Layers I and III are assumed to give identical transformation (linear); hidden layer (II) – sigmoid s(x) or th(x) 14 2. Formula for NN transformation F: Node j of hidden layer II: Input: zj=w1j*x1 + w2j*x2+w3j*x3 which is j-th component of vector z = i xi*wij = x*W where x is1x3 input vector, W=(wij) is 3x3 weight matrix. (x3 represents x0 on the network scheme.) Output: φ(zj), j=1,2,3, φ is function (1) or (2). Node k of output layer III: Output = Input, j vjk*φ(zj), which is k-th component of the matrix product û = φ(z)*V. Thus, NN on Figure 1 transforms input x into output û as: û = φ(x*W)*V (3) If matrices W, V are known, (3) expresses – and computes - the unknown function u=F(x) in terms of φ, W, and V. 3. Learning problem Find weight matrices W and V minimising the squared difference between observed u and û found with (3), E=d(u,û) = <u - φ(x*W)*V, u - φ(x*W)*V >, (4) over the training entity set. (A very much non-linear problem!) 15 4. Learning weights with error back propagation 4.1. Updating formula. In NN applications, learning weights W and V minimising E is done with back-propagation that imitates the gradient descent. It runs iterations of updating V and W, each based on the data of an entity (in our case, one of 150 Iris specimens), with the input values in x=(xi) and output values in u=(uk). An update moves V and W into the anti-gradient direction: V(new)=V(old)-gV, W(new)=W(old)- gW (5) where is the learning rate (step size) and gV, gW are parts of the gradient of the error function E in (4) related to matrices V and W. Specifically, the error function is E = [(u1 – û1)2 + (u2 – û2)2 ]/2 (6) where e1 = u1 – û1 and e2 = u2 – û2 are differences between the actual and predicted outputs. The division by 2 is made to avoid factor 2 in the derivatives of E. Also, this E resembles the so-called potential energy function in physics. Equation (5) can be rewritten component-wise vjk(new)=vjk(old) - E/vjk, wij(new)=wij(old) - E/wij (iI, jII, kIII) (5’) 16 To make this computable, one must express the derivatives explicitly. This can be done with the so-called chain rule in calculus: the derivative of a superposition of functions is equal to the product of derivatives of these functions. For example, if f(x)=g(p(x)), then f(x)=g(p(x)p(x) where denotes the derivative. In particular, E/vjk = - (uk – ûk) ûk /vjk. Since ûk = j φ(zj) vjk, the derivative ûk /vjk=φ(zj); thus, E/vjk = - (uk – ûk) φ(zj). (7) The derivative E/wij refers to the layer of W, the previous one, which requires more chain derivations. Specifically, E/wij = k[-(uk – ûk) ûk /wij]. Since ûk = j φ(i xiwij) vjk, the derivative ûk /wij = vjk φ(i xiwij) xi. Now we need the derivative of the activation functions, sigmoid or hyperbolic tangent, which are simple polynomials of themselves: s(x)= [(1+ e-x)-1] =(-1) (1+ e-x)-2 (e-x )= (-1)(1+ e-x)-2(e-x )(-1)= (1+ e-x)-2 e-x =s(x)(1-s(x)) (8) 17 th(x)= [2s(x)-1] = 2s(x)= 2s(x)(1-s(x)) = (1+th(x))(1-th(x))/2 The final expression for the derivatives: E/wij = -k[(uk – ûk) vjk] φ(zj) xi (9) φ(zj) is in (8) depending on what activation function is used. Equations (5), (7) and (9) lead to the following rule for the processing of an instance in the back-propagation algorithm. 4.2. Instance Processing: 1. Forward computation (of the output û and error). Given matrices V and W, upon receiving an instance (x,u), the estimate û of vector u is computed according to the neural network as formalised in equation (3), and the error e = u – û is calculated. 2. Error back-propagation (for estimation of the gradient). Each neuron receives the relevant error estimate, which is -ek = -(uk – ûk) from (7) for output neurons k (k=III1, III2) or -k[(uk – ûk) vjk] from (9), for hidden neurons j (j=II1, II2, II3) [this can be seen as the sum of errors arriving from the output neurons according to the corresponding synapse weights], and adjusts that to the derivative (7) or (9) by multiplying it over its local data depending on the source signal, which is φ(zj), for neuron k’s source j in (7), and φ(zj) xi for neuron j’s source i in (9). The results constitute matrices gV and gW, respectively. 18 3. Weights update. Matrices V and W are updated according to formula (5) or, equivalently, (5’). What is nice in this procedure is that the computation is done locally, so that every neuron processes only the data that are available to this neuron, first from the previous layers, then backward, from the output layer. In particular, the algorithm does not change if the number of hidden neurons is increased from h=3, in Figure 1, to any other integer h. The procedure 4.2 can be easily extended to any feed-forward network, with many hidden layers as well. Procedure 1-3 is performed for all available entities in a random order, which constitutes an epoch. Typically, one epoch is not enough for matrices V and W to get stabilised. Thus, a number of epochs is executed, until the matrices are stabilised. Since, in practical calculations, this may take ages to achieve, other stopping criteria can be utilised. One of such criteria is when the difference between the average values (over iterations within an epoch) of the error function (5) becomes smaller than a pre-specified threshold, such as 0.0001. Yet another criterion is halting the process when the number of epochs applied to the data reaches a pre-specified threshold such as 10,000. An explicit formulation of the back-propagation algorithm is as follows. 19 5. Back propagation algorithm for NN on Fig. 1 (for Iris data set which is available as a whole). A. Initialise weight matrices W=(wij) and V=(vjk) by using random normal distribution N(0,1) with the mean at 0 and the variance 1. B. Choose the data standardisation option amounting to selection of the shift and scale coefficients, av and bv for each feature v, so that every data entry, xiv, is transformed to yiv=(xiv-av)/bv. C. Formulate Halt criterion as explained above and run a loop over epochs. D. Randomise the order of entities [in MatLab, with the command randperm(N)] and run a loop of the 4.2. Instance Processing in that order (an epoch). E. If Halt-criterion is met, end the computation and output results: W, V, û, e, and E. Otherwise, execute D again. Back propagation should be executed with a re-sampling scheme, such as the k-fold cross-validation, to provide the estimates of variation of the results regards the data change. 6. Data standardisation for NN learning Due to specifics of the binary target variables and activation functions, such as th(x) and sign(x), that have -1 and 1 as the boundaries, the data are frequently pre-processed, in the NN context to make every feature’s range to be between -1 and 1: take bv equal to the half-range bv=(Mv-mv)/2, and shift coefficient av to the mid-range av=(Mv+mv)/2. Here Mv denotes the maximum and mv the minimum of feature v. 20 7. Performing computations for Iris data The MatLab code below implements the back propagation algorithm for Iris data set stored in \ml\Data as file iris.dat. % nnbpm.m for learning petal from sepal in iris data % modified with the hyperbolic tangent in the hidden layer % data normalisation to [-1,1] interval and %--------------preparing input and output data --------------------da=load('Data\iris.dat'); [n,m]=size(da); %----------------normalise to [-1,1] scale---------------------mr=max(da); ml=min(da); ra=mr-ml; ba=mr+ml; tda=2*da-ones(n,1)*ba; dan=tda./(ones(n,1)*ra); dan=10*dan; %make the scale 10-fold to see things clearly output=dan(:,[3 4]); % petal sizes input=dan(:,[1 2]); % sepal sizes input(:,3)=10; % bias component, 10-fold %-----------------initialise the network -----------------------h=3; %the number of hidden neurons W=randn(3,h) %initialising wij weights V=randn(h,2) %initialising vjk weights W0=W; % to store if this is good V0=V; % to store if this is good count=0; %counter of epochs stopp=0; %halt-condition negative %------------- looping epochs---------------------------while(stopp==0) mede=zeros(1,2); % mean error to be stored after an epoch ror=randperm(n); % randomly ordering %----------------looping entities in the random order ror for ii=1:n x=input(ror(ii),:); %current instance's input u=output(ror(ii),:);% current instance's output %--------------forward pass (to calculate response ru and error)-----ow=x*W;% summary action of inputs in the hidden layer o1=1+exp(-ow); oow=ones(1,h)./o1; %sigmoid transformation oow=2*oow-1;% symmetric sigmoid output of the hidden layer ov=oow*V; %output of the output layer err=u-ov; %the error mede=mede+abs(err)/n; % the average absolute error 21 %------------error back-propagating-------------------------gV=-oow'*err; % gradients of matrix V t1=V*err'; % error propagated to the hidden layer t2=(1-oow).*(1+oow)/2; %the symmetric sigmoid’s derivative t3=t2.*t1'; % error gW=-x'*t3; % gradients of matrix W %-----------------change of the weights----------------------mu=0.00001;%learn. rate; greater values hinder convergence V=V-mu*gV; W=W-mu*gW; end; %of the entity loop %------------------halt: stop-condition -------------------------count=count+1; ss=mean(mede); if ss<0.01|count>=30000 % small error or the number of epoch stopp=1; end; if rem(count,200)==0 count mede end % these are to watch results every 200-th epoch end; V0 W0 %these are to copy initial weights if results are good This program leads to average errors presented below at different numbers of hidden neurons h (note the feature ranges are equal to 20 here): h |e1| |e2| One can see an improvement – 3 1.11 1.76 but not so great! 6 1.00 1.69 10 0.97 1.63 Home-work: 1. Find values of E for the errors reported in Table above. 2. Take a look at what happens if the data are not normalised. 3. What happens if the learning rate is increased, or decreased, ten times. 4. Extend the table above for different numbers of hidden neurons. 5. Try petal sizes as input with sepal sizes as output. 6. Adapt this code to other data such as studn.dat. 7. Modify this code to involve the sigmoid activation function. 8. Find a way to improve the convergence of the process, for instance, with adaptive changes in the step size values. 22 Decision Trees: a structure used for prediction of quantitative features (regression tree) or nominal features (classification tree). Each node corresponds to a subset of entities (the root to the set of all entities I), and its children are the subset’s parts defined by a single predictor feature x. Each terminal node individual target feature value u. Example: Author-defined clusters of eight Companies (u – product) Sector: Util./Ind. Ecom: No A Retail C Yes B Figure 1. Decision tree for three product-defined classes of Companies defined by categorical features. NSup: <4 4 or more C ShareP: > 30 < 30 A B Figure 2. Decision tree for three product-defined classes of Companies defined by quantitative features. 23 Decision trees: Advantages Interpretability Computation efficiency Drawbacks Simplistic Imprecise Algorithm: Take a node and a feature value(s) and split the corresponding subset accordingly Issues (classification tree): Stop: Whether any node should be split at all Select: Which node of the tree and by which feature to split Score: Chi-squared (CHAID in SPSS package), Entropy (C4.5 package), Change of Gini coefficient (CART package) Assign: What target class k to assign to a terminal node x: Conventionally, k* at which p(k/x) is maximised over k. I suggest: This is ok when p(k) is about 10%-30%. Otherwise, use comparison between p(k/x) and p(k). Specifically, (i) If p(k) is of the order of 50%, then the absolute Quetelet index a(k/x)= p(k/x)- p(k) should be used; (ii) If p(k) is of the order of 1% or less, the relative Quetelet index q(k/x)= [p(k/x)- p(k)]/p(k) should be employed. 24