Introduction to Pattern Recognition for Human ICT Linear discriminant functions 2014. 11. 7 Hyunki Hong Contents • Perceptron learning • Minimum squared error (MSE) solution • Least-mean squares (LMS) rule • Ho-Kashyap procedure Linear discriminant functions • The objective of this lecture is to present methods for learning linear discriminant functions of the form. where π€ is the weight vector and π€0 is the threshold weight or bias. - The discriminant functions will be derived in a nonparametric fashion, that is, no assumptions will be made about the underlying densities. – For convenience, we will focus on the binary classification problem. – Extension to the multi-category case can be easily achieved by 1. Using ππ/βππ dichotomies 2. Using ππ/ππ dichotomies Dichotomy: splitting a whole into exactly two nonoverlapping parts. a procedure in which a whole is divided into two parts. Gradient descent • general method for function minimization – Recall that the minimum of a function π½(π₯) is defined by the zeros of the gradient. 1. Only in special cases this minimization function has a closed form solution. 2. In some other cases, a closed form solution may exist, but is numerically ill-posed or impractical (e.g., memory requirements). – Gradient descent finds the minimum in an iterative fashion by moving in the direction of steepest descent. where π is a learning rate. Perceptron learning • Let’s now consider the problem of learning a binary classification problem with a linear discriminant function. – As usual, assume we have a dataset π₯ = {π₯(1, π₯(2, …, π₯(π } containing examples from the two classes. – For convenience, we will absorb the intercept π€0 by augmenting the feature vector π₯ with an additional constant dimension. - Keep in mind that our objective is to find a vector a such that – To simplify the derivation, we will “normalize” the training set by replacing all examples from class π2 by their negative - This allows us to ignore class labels and look for vector π such that • To find a solution we must first define an objective function π½(π) – A good choice is what is known as the Perceptron criterion function , where ππ is the set of examples misclassified by π. Note that π½π(π) is non-negative since πππ¦ < 0 for all misclassified samples. • To find the minimum of this function, we use gradient descent – The gradient is defined by - And the gradient descent update rule becomes - This is known as the perceptron batch update rule. 1. The weight vector may also be updated in an “on-line” fashion, this is, after the presentation of each individual example π(π+1) = π(π) + ππ¦ (π where π¦(π is an example that has been misclassified by π(π). Perceptron learning • If classes are linearly separable, the perceptron rule is guaranteed to converge to a valid solution. – Some version of the perceptron rule use a variable learning rate π(π). – In this case, convergence is guaranteed only under certain conditions. • However, if the two classes are not linearly separable, the perceptron rule will not converge – Since no weight vector a can correctly classify every sample in a non-separable dataset, the corrections in the perceptron rule will never cease. – One ad-hoc solution to this problem is to enforce convergence by using variable learning rates π(π) that approach zero as π →∞. Minimum Squared Error (MSE) solution • The classical MSE criterion provides an alternative to the perceptron rule. – The perceptron rule seeks a weight vector ππ such that πππ¦(π > 0 The perceptron rule only considers misclassified samples, since these are the only ones that violate the above inequality. – Instead, the MSE criterion looks for a solution to the equality πππ¦ (π = π(π , where π(π are some pre-specified target values (e.g., class labels). As a result, the MSE solution uses ALL of the samples in the training set. • The system of equations solved by MSE is – where π is the weight vector, each row in π is a training example, and each row in b is the corresponding class label. 1. For consistency, we will continue assuming that examples from class π2 have been replaced by their negative vector, although this is not a requirement for the MSE solution. • An exact solution to ππ = π can sometimes be found. – If the number of (independent) equations (π) is equal to the number of unknowns (π·+1), the exact solution is defined by π = π−1π – In practice, however, π will be singular so its inverse does not exist. π will commonly have more rows (examples) than columns (unknowns), which yields an over-determined system, for which an exact solution cannot be found. • The solution in this case is to minimizes some function of the error between the model (ππ) and the desired output (π). – In particular, MSE seeks to Minimize the sum Squared Error which, as usual, can be found by setting its gradient to zero. The pseudo-inverse solution • The gradient of the objective function is – with zeros defined by ππππ = πππ – Notice that πππ is now a square matrix!. • If πππ is nonsingular, the MSE solution becomes – where π† = (πππ)−1ππ is known as the pseudo-inverse of π since π†π = πΌ. Note that, in general, ππ† ≠ πΌ. Least-mean-squares solution • The objective function π½πππΈ(π) can also be minimize using a gradient descent procedure. – This avoids the problems that arise when πππ is singular. – In addition, it also avoids the need for working with large matrices. • Looking at the expression of the gradient, the obvious update rule is π(π+1) = π(π) + π(π)ππ(π− ππ(π)) – It can be shown that if π(π) = π(1)/π, where π(1) is any positive constant, this rule generates a sequence of vectors that converge to a solution to ππ(ππ−π) = 0. – The storage requirements of this algorithm can be reduced by considering each sample sequentially. π(π+1) = π(π) + π(π)(π(π − π¦(π π(π))π¦(π – This is known as the Widrow-Hoff, least-mean-squares (LMS) or delta rule. Numerical example Summary: perceptron vs. MSE • Perceptron rule – The perceptron rule always finds a solution if the classes are linearly separable, but does not converge if the classes are non-separable • MSE criterion – The MSE solution has guaranteed convergence, but it may not find a separating hyperplane if classes are linearly separable. Notice that MSE tries to minimize the sum of the squares of the distances of the training data to the separating hyperplane, as opposed to finding this hyperplane. νμ΅ λ΄μ© μ ν νλ³ν¨μ νλ³ν¨μμ ν¨ν΄μΈμ νλ³ν¨μμ νν νλ³ν¨μμ κ²°μ μμ νλ³ν¨μμ νμ΅ μ΅μμ κ³±λ² νΌμ νΈλ‘ νμ΅ λ§€νΈλ©μ μ΄μ©ν μ ν νλ³ν¨μ μ€ν 1) νλ³ν¨μ οΌ μ΄(binary) λΆλ₯ λ¬Έμ νλμ νλ³ν¨μ g(x) οΌ λ€μ€ ν΄λμ€ λΆλ₯ λ¬Έμ ν΄λμ€λ³ νλ³ν¨μ gi(x) 2) νλ³ν¨μμ νν 1. μ§μ ννμ μ ν νλ³ν¨μ (κ³ μ°¨μ 곡κ°μ κ²½μ° μ΄νλ©΄) nμ°¨μ μ λ ₯ νΉμ§λ²‘ν° x = [x1, x2, …, xn]T ν΄λμ€ Ciμ μ ν νλ³ν¨μ wi ο μ΄νλ©΄μ κΈ°μΈκΈ°λ₯Ό κ²°μ νλ λ²‘ν° (μ΄νλ©΄μ λ²μ 벑ν°) wi0 ο x = 0μΈ μμ μμ μ§μ κΉμ§μ 거리λ₯Ό κ²°μ νλ λ°μ΄μ΄μ€ wi , wi0 : κ²°μ κ²½κ³μ μμΉλ₯Ό κ²°μ νλ νλΌλ―Έν° (2μ°¨μ: μ§μ λ°©μ μ) 3) νλ³ν¨μμ νν 2. λ€νμ ννμ νλ³ν¨μ (λ³΄λ€ μΌλ°μ μΈ νν) ν΄λμ€ Ciμ λν μ΄μ°¨ λ€νμ ννμ νλ³ν¨μ Wi : λ²‘ν° xμ λ μμλ€λ‘ ꡬμ±λ 2μ°¨ν λ€μ κ³μμ νλ ¬. nμ°¨μμ n(n+1)/2κ° μμ nμ°¨μ μ λ ₯ λ°μ΄ν°μ κ²½μ° μΆμ ν΄μΌ ν νλΌλ―Έν°μ κ°μ : n(n+1)/2+n+1κ° pμ°¨ λ€νμ ο¨ O(np) κ³μ°λμ μ¦κ°, μΆμ μ μ νλ μ ν 4) νλ³ν¨μμ νν → μ ν λΆλ₯κΈ° μ μ½μ +λ€νμ λΆλ₯κΈ°μ νλΌλ―Έν° μμ λ¬Έμ ν΄κ²° 3. λΉμ ν κΈ°μ (basis) ν¨μμ μ νν©μ μ΄μ©ν νλ³ν¨μ νλ³ν¨μμ νν κ²°μ : μΌκ°ν¨μ κ°μ°μμ ν¨μ νμ΄νΌνμ νΈ ν¨μ μκ·Έλͺ¨μ΄λ ν¨μ … 5) νλ³ν¨μμ νν οΌ νλ³ν¨μμ κ·Έμ λ°λ₯Έ κ²°μ κ²½κ³ (b) (a ) μ νν¨μ (c ) 2μ°¨ λ€νμ μΌκ°ν¨μ 6) νλ³ν¨μμ κ²°μ μμ κ²°μ μμ (a) (b) ο2 ο1 C1 C2 ο1 ο2 C2 C1 7) νλ³ν¨μμ κ²°μ μμ μ λ ₯νΉμ§μ΄ κ° ν΄λμ€μ μνλ μ¬λΆλ₯Ό κ²°μ : μΌλλ€ λ°©μ (one-to-all method) g1 ( x) ο½ 0 g1 οΌ 0 ο3 g1 οΎ 0 ? g1 or g3 ? ? ο1 ? g2 οΌ 0 g οΎ0 g 2 ( x) ο½ 0 2 ? ο2 g3 οΎ 0 g3 ( x) ο½ 0 g3 οΌ 0 λ―Έκ²°μ μμ: λͺ¨λ νλ³ν¨μμ μμ λͺ¨νΈν μμ: λ νλ³ν¨μμ μμ g1 or g2 ? 8) νλ³ν¨μμ κ²°μ μμ μμμ (κ°λ₯ν) λ ν΄λμ€ μμ λν νλ³ν¨μ ν΄λμ€ μκ° mκ° : νλ³μ m(m -1)/2κ° νμ g12 ( x) ο½ 0 C1 C2 μΌλμΌ λ°©μ (one-to-one method) ο1 C1 ? C3 ο2 C2 g 23 ( x) ο½ 0 C3 ο3 g13 ( x) ο½ 0 λ―Έκ²°μ μμ: λͺ¨λ νλ³ν¨μμ μμ 9) νλ³ν¨μμ κ²°μ μμ ν΄λμ€λ³ νλ³ν¨μ κ²°μ κ·μΉ νλ³ν¨μμ λμ νμ΅μ΄ νμ g 2 ( x) ο½ 0 g 2 ο g3 ο½ 0 ο3 g1 ο g3 ο½ 0 g3 ( x) ο½ 0 ο1 ο2 g1 ( x) ο½ 0 g1 ο g2 ο½ 0 νλ³ν¨μμ νμ΅ 1) μ΅μμ κ³±λ² νλ³ν¨μ νμ΅ λ°μ΄ν° μ§ν© νλΌλ―Έν° λ²‘ν° νλ³ν¨μ giμ λͺ©νμΆλ ₯κ° ξjκ° ν΄λμ€ Ciμ μνλ©΄, λͺ©νμΆλ ₯κ° tji = 1 νλ³ν¨μ giμ λͺ©μ ν¨μ(“μ€μ°¨ν¨μ”) νλ³ν¨μ μΆλ ₯κ°κ³Ό λͺ©νμΆλ ₯κ° μ°¨μ΄μ μ κ³± “μ΅μμ κ³±λ²” 2) μ΅μμ κ³±λ² ν° μ λ ₯ μ°¨μ, λ§μ λ°μ΄ν° ο¨ νλ ¬ κ³μ°μ λ§μ μκ° μμ “μμ°¨μ λ°λ³΅ μκ³ λ¦¬μ¦” ν λ²μ νλμ λ°μ΄ν° μ λν΄ μ κ³±μ€μ°¨λ₯Ό κ°μμν€λ λ°©ν₯μΌλ‘ νλΌλ―Έν°λ₯Ό λ°λ³΅μ μΌλ‘ μμ → μ€μ°¨ν¨μμ κΈ°μΈκΈ°λ₯Ό λ°λΌμ μ€μ°¨κ°μ΄ μ€μ΄λλ λ°©ν₯μΌλ‘ 3) μ΅μμ κ³±λ² νλΌλ―Έν°μ λ³νλ νλΌλ―Έν° μμ μ (τ+1λ²μ§Έ νλΌλ―Έν°) νμ΅λ₯ learning rate “μ΅μνκ· μ κ³±(Least Mean Square) μκ³ λ¦¬μ¦” : μ λ ₯κ³Ό μνλ μΆλ ₯κ³Όμ μ μΆλ ₯ 맀νμ κ°μ₯ μ κ·Όμ¬νλ μ νν¨μ μ°ΎκΈ° κΈ°μ ν¨μμ μ νν©μ μν νλ³ν¨μμ κ²½μ° : μ λ ₯ λ²‘ν° x λμ , κΈ°μ ν¨μλ₯Ό μ΄μ©ν 맀νκ°μΈ Ρ(x) μ¬μ© 4) νΌμ νΈλ‘ (perceptron) νμ΅ οΌ Rosenblatt οΌ μΈκ° λμ μ 보μ²λ¦¬ λ°©μμ λͺ¨λ°©νμ¬ λ§λ μΈκ³΅μ κ²½νλ‘λ§μ κΈ°μ΄ μ μΈ λͺ¨λΈ μ λ ₯ xμ λν iλ²μ§Έ μΆλ ₯ yi κ²°μ λμΌ μ ν νλ³ν¨μλ₯Ό μ΄μ©ν κ²°μ κ·μΉ 5) νΌμ νΈλ‘ νμ΅ οΌ Mκ°μ ν¨ν΄ λΆλ₯ λ¬Έμ • κ° ν΄λμ€λ³λ‘ νλμ μΆλ ₯μ μ μνμ¬ λͺ¨λ Mκ°μ νΌμ νΈλ‘ νλ³ν¨μλ₯Ό μ μ • νμ΅ λ°μ΄ν°λ₯Ό μ΄μ©νμ¬ μνλ μΆλ ₯μ λ΄λλ‘ νλΌλ―Έν° μ‘°μ : i λ²μ§Έ ν΄λμ€μ μνλ μ λ ₯ λ€μ΄μ€λ©΄, μ΄μ ν΄λΉνλ μΆλ ₯ yiλ§ 1, λλ¨Έμ§λ λͺ¨λ 0λλλ‘ λͺ©νμΆλ ₯κ° ti μ μ νλΌλ―Έν° μμ μ 6) νΌμ νΈλ‘ νμ΅ νΌμ νΈλ‘ νμ΅μ μν κ²°μ κ²½κ³μ λ³ν μλ “·”λ ν΄λμ€ C1μ μνκ³ λͺ©νκ°μ 1, “o”λ ν΄λμ€ C2 μνκ³ λͺ©νκ°μ 0 (a) (b) x (1) (c) w (1) ο¨ x (1) w w (1) ( 0) w ( 2) ο¨x ( 2) w ( 2) x ( 2) ο 1 ο2 ο1 ο2 (a) x(1)μ λͺ©νμΆλ ₯κ° tκ° 0μ΄μ§λ§, νμ¬ μΆλ ₯κ°μ 1. → w(1) = w(0) + η(ti - yi) x(1) = - η x(1) (b) μ΄ κ²½μ°λ λ°λ. → w(2) = w(1) + η(ti - yi) x(2) = η x(2) ο1 ο2 7) νΌμ νΈλ‘ νμ΅ λ§€νΈλ©μ μ΄μ©ν νΌμ νΈλ‘ λΆλ₯κΈ° μ€ν 1) λ°μ΄ν° μμ± λ° νμ΅ κ²°κ³Ό NumD=10; νμ΅ λ°μ΄ν°μ μμ± train_c1 = randn(NumD,2); train_c2 = randn(NumD,2)+repmat([2.5,2.5],NumD,1); train=[train_c1; train_c2]; train_out(1:NumD,1)=zeros(NumD,1); train_out(1+NumD:2*NumD,1)=zeros(NumD,1)+1; plot(train(1:NumD,1), train(1:NumD,2),'r*'); hold on plot(train(1+NumD:2*NumD,1), train(1+NumD:NumD*2,2),'o'); νμ΅ λ°μ΄ν° 2) νΌμ νΈλ‘ νμ΅ νλ‘κ·Έλ¨ νΌμ νΈλ‘ νμ΅ Mstep=5; % μ΅λ λ°λ³΅νμ μ€μ INP=2; OUT=1; % μ μΆλ ₯ μ°¨μ μ€μ w=rand(INP,1)*0.4-0.2; wo=rand(1)*0.4-0.2; % νλΌλ―Έν° μ΄κΈ°ν a=[-3:0.1:6]; % κ²°μ κ²½κ³μ μΆλ ₯ plot(a, (-w(1,1)*a-wo)/w(2,1)); eta=0.5; % νμ΅λ₯ μ€μ for j = 2:Mstep % λͺ¨λ νμ΅ λ°μ΄ν°μ λν΄ λ°λ³΅νμλ§νΌ νμ΅ λ°λ³΅ for i=1:NumD*2 x=train(i,:); ry=train_out(i,1); νμ΅μ λ°λ₯Έ μΆλ ₯ κ²°μ κ²½κ³μ y=stepf(x*w+wo); % νΌμ νΈλ‘ κ³μ° λ³ν e=ry-y; % μ€μ°¨ κ³μ° E(i,1)=e; dw= eta*e*x'; % νλΌλ―Έν° μμ λ κ³μ° dwo= eta*e*1; w=w+dw; % νλΌλ―Έν° μμ wo=wo+dwo; end plot(a, (-w(1,1)*a-wo)/w(2,1),'r'); % λ³νλ κ²°μ κ²½κ³ μΆλ ₯ end