unit #4 Giansalvo EXIN Cirrincione Single-layer networks They directly compute linear discriminant functions using the TS without need of determining probability densities. Linear discriminant functions Two classes C1 x is assigned to class C2 if yx 0 if yx 0 T y x w x w0 (d-1)-dimensional hyperplane Linear discriminant functions Several classes x is assigned toclass Ck if yk x y j x j k yk x wTk x wk 0 d yk x wki xi wk 0 w i 1 w j x wk 0 w j 0 0 T k Linear discriminant functions Several classes The decision regions are always simply connected and convex. xˆ x A 1 x B where 0 1 yk x A y j x A and yk x B y j x B jk yk xˆ yk x A 1 yk x B and hence yk xˆ y j xˆ jk Logistic discrimination monotonic activation function • two classes • Gaussians with S1 S2 S The decision boundary is still linear Logistic discrimination logistic sigmoid Logistic discrimination logistic sigmoid Logistic discrimination The use of the logistic sigmoid activation function allows the outputs of the discriminant to be interpreted as posterior probabilities. logistic sigmoid binary input vectors Let Pki denote the probability that the input xi takes the value +1 when the input vector is drawn from the class Ck. The corresponding probability that xi = 0 is then given by 1- Pki . pxi Ck P 1 Pki xi ki 1 xi (Bernoullidistribution) Assuming the input variables are statistically independent, the probability for the complete input vector is given by: px Ck Pkixi 1 Pki d i 1 1 xi binary input vectors yk x ln px Ck ln PCk d yk x wki xi wk 0 i 1 where w ki ln Pki ln1 Pki d wk 0 ln1 Pki ln PCk i 1 Linear discriminant functions arise when we consider input patterns in which the variables are binary. binary input vectors Consider a set of independent binary variables having Bernoulli class-conditional densities. For the two-class problem: P C1 x g w x w0 T where g a is thelogistic sigmoid and 1 P1i PC1 ln w 0 ln PC2 1 P2i i 1 P1i P1i ln w i ln 1 P2i P2i Both for normally distributed and Bernoulli distributed class-conditional densities, the posterior probabilities are obtained by a logistic single-layer network. homework Generalized discriminant functions fixed non-linear basis functions Extra basis function equal to one It can approximate any CONTINUOUS functional transformation to arbitrary accuracy. Training least-squares techniques perceptron learning Fisher discriminant Sum-of-squares error function target quadratic in the weights Geometrical interpretation of least squares column space Pseudo-inverse solution normal equations Nxc N x (M+1) c x (M+1) Pseudo-inverse solution singular bias The role of the biases is to compensate for the difference between the averages (over the data set) of the target values and the averages of the output vectors gradient descent Group all of the parameters (weights and biases) together to form a single weight vector w. batch If is chosen correctly, the gradient descent becomes the Robbins-Monro procedure for finding the root of the regression function sequential gradient descent Differentiable non-linear activation functions batch logistic sigmoid gradient descent homework Generate and plot a set of data points in two dimensions, drawn from two classes each of which is described by a Gaussian classconditional density function. Implement the gradient descent algorithm for training a logistic discriminant, and plot the decision boundary at regular intervals during the training procedure on the same graph as the data. Explore the effect of choosing different values for the learning rate. Compare the behaviour of the sequential and batch weight update procedures. homework The perceptron Applied to classification problems in which the inputs are usually binary images of characters or simple shapes fixed weights connected to a random subset of the input pixels The perceptron Define the error function in terms of the total number of misclassifications over the TS. However, an error function based on a loss matrix is piecewise constant w.r.t. the weights and gradient descent cannot be applied. n 1 if x C1 n t n 1 if x C2 proportional to the absolute distances of the T n n misclassified wanted w t 0 input patterns to Minimize the perceptron criterion : the decision boundary misclassified The criterion is continuous and piecewise linear The perceptron Apply the sequential gradient descent rule to the perceptron criterion misclassified Cycle through all of the patterns in the TS and test each pattern in turn using the current set of weight values. If the pattern is correctly classified do nothing, otherwise add the pattern vector to the weight vector if the pattern is labelled class C1 or subtract the pattern vector from the weight vector if the pattern is labelled class C2. The value of is unimportant since its change is equivalent to a re-scaling of the weights and biases. The perceptron The perceptron convergence theorem For any data set which is linearly separable, the perceptron learning rule is guaranteed to find a solution in a finite number of steps. proof solution null initial conditions The perceptron convergence theorem For any data set which is linearly separable, the perceptron learning rule is guaranteed to find a solution in a finite number of steps. proof end proof The perceptron convergence theorem Prove that, for arbitrary vectors ^ the following equality w and w, is satisfied: homework Hence, show that an upper limit on the number of weight updates needed for convergence of the perceptron algorithm is given by: homework If the data set happens not to be linearly separable, then the learning algorithm will never terminate. If we arbitrarily stop the learning process, there is no guarantee that the weight vector found will generalize well for new data. decrease during the training process; the pocket algorithm. It involves retaining a copy (in one’s pocket) of the set of weights which has so far survived unchanged for the longest number of pattern presentations. Limitations of the perceptron Even though the data set of input patterns may not be linearly separable in the input space, it can become linearly separable in the -space. However, it implies the number and complexity of the j’s to grow very rapidly (typically exponential). receptive field Limiting the complexity: diameter-limited perceptron Fisher’s linear discriminant optimal linear dimensionality reduction yw x T no bias select a projection which maximizes the class separation •N1 points of class C1 •N2 points of class C2 Fisher’s linear discriminant Maximize: unit length arbitrarily large by increasing the magnitude of w class mean of the projected data from class Ck Constrained optimization: w (m2 - m1) Maximize a function which represents the difference between the projected class means, normalized by a measure of the within-class scatter along the direction of w. Fisher’s linear discriminant The within-class scatter of the transformed data from class Ck is described by the within-class covariance given by: Fisher criterion between-class covariance matrix within-class covariance matrix Fisher’s linear discriminant w PC of SW1S B w SW1 m2 m1 Generalized eigenvector problem S B w m2 m1 Fisher’s linear discriminant w PC of SW1S B w SW1 m2 m1 EXAMPLE Fisher’s linear discriminant w PC of SW1S B w SW1 m2 m1 The projected data can subsequently be used to construct a discriminant, by choosing a threshold y0 so that we classify a new point as belonging to C1 if y(x) y0 and classify it as belonging to C2 otherwise. Note that y = wTx is the sum of a set of random variables and so we may invoke the central limit theorem and model the classconditional density functions p(y| Ck) using normal distributions. Once we have obtained a suitable weight vector and a threshold, the procedure for deciding the class of a new vector is identical to that of the perceptron network. So, the Fisher criterion can be viewed as a learning law for the single-layer network. Fisher’s linear discriminant relation to the least-squares approach N n if x C1 N 1 tn N if x n C2 N 2 Fisher’s linear discriminant relation to the least-squares approach Bias threshold Fisher’s linear discriminant relation to the least-squares approach Fisher’s linear discriminant d’ linear features Several classes y Wx d x d (W within-class covariance ) Fisher’s linear discriminant Several classes Fisher’s linear discriminant Several classes In the projected d’-dimensional y-space Fisher’s linear discriminant Several classes One possible criterion ... 1 weights: d PC's of SW S B This criterion is unable to find more than (c - 1) linear features