Introduction to Pattern Recognition for Human ICT Support Vector Machines 2014. 11. 28 Hyunki Hong Contents • Introduction • Optimal separating hyperplanes • Kuhn-Tucker Theorem • Support Vectors • Non-linear SVMs and kernel methods Introduction • Consider the familiar problem of learning a binary classification problem from data. – Assume a given a dataset (π,π) = {(π₯1, π¦1),(π₯2, π¦2), …, (π₯π, π¦π)}, where the goal is to learn a function π¦ = π(π₯) that will correctly classify unseen examples. • How do we find such function? – By optimizing some measure of performance of the learned model • What is a good measure of performance? – A good measure is the expected risk where πΆ(π, π¦) is a suitable cost function, e.g., πΆ(π, π¦) = (π(π₯)−π¦)2 - Unfortunately, the risk cannot be measured directly since the underlying pdf is unknown. Instead, we typically use the risk over the training set, also known as the empirical risk. • Empirical Risk Minimization – A formal term for a simple concept: find the function π(π₯) that minimizes the average risk on the training set. – Minimizing the empirical risk is not a bad thing to do, provided that sufficient training data is available, since the law of large numbers ensures that the empirical risk will asymptotically converge to the expected risk for π→∞. • How do we avoid overfitting? μ κ·Όμ : In applied mathematics, asymptotic analysis is used to build numerical methods to approximate equation solutions – By controlling model complexity. - Intuitively, we should prefer the simplest model that explains the data. Optimal separating hyperplanes • Problem statement – Consider the problem of finding a separating hyperplane for a linearly separable dataset {(π₯1, π¦1), (π₯2, π¦2), …, (π₯π, π¦ π)}, π₯ ∈ π π·, π¦ ∈ {−1,+1} – Which of the infinite hyperplanes should we choose? 1. Intuitively, a hyperplane that passes too close to the training examples will be sensitive to noise and, therefore, less likely to generalize well for data outside the training set. 2. Instead, it seems reasonable to expect that a hyperplane that is farthest from all training examples will have better generalization capabilities – Therefore, the optimal separating hyperplane will be the one with the largest margin, which is defined as the minimum distance of an example to the decision surface. • Solution - Since we want to maximize the margin, let’s express it as a function of the weight vector and bias of the separating hyperplane. – From trigonometry, the distance between a point π₯ and a plane (π€, π) is – Since the optimal hyperplane has infinite solutions (by scaling the weight vector and bias), we choose the solution for which the discriminant function becomes one for the training examples closest to the boundary. This is known as the canonical hyperplane. – Therefore, the distance from the closest example to the boundary is – And the margin becomes • Therefore, the problem of maximizing the margin is equivalent to – Notice that π½(π€) is a quadratic function, which means that there exists a single global minimum and no local minima. • To solve this problem, we will use classical Lagrangian optimization techniques. – We first present the Kuhn-Tucker Theorem, which provides an essential result for the interpretation of Support Vector Machines. Kuhn-Tucker Theorem • Given an optimization problem with convex domain π⊆π π. – In with π∈πΆ1 convex and ππ, βπ affine, necessary & sufficient conditions for a normal point π§∗ to be an optimum are the existence of πΌ∗, π½∗ such that – πΏ(π§, πΌ, π½) is known as a generalized Lagrangian function. – The third condition is known as the Karush-Kuhn-Tucker (KKT) complementary condition. 1. It implies that for active constraints πΌπ ≥ 0; and for inactive constraints πΌπ = 0 2. As we will see in a minute, the KKT condition allows us to identify the training examples that define the largest margin hyperplane. These examples will be known as Support Vectors • Solution - To simplify the primal problem, we eliminate the primal variables (π€, π) using the first Kuhn-Tucker condition ππ½/ππ§ = 0. – Differentiating πΏπ(π€, π, πΌ) with respect to π€ and π, and setting to zero yields – Expansion of πΏπ yields – Using the optimality condition ππ½/ππ€ = 0, the first term in πΏπ can be expressed as – The second term in πΏπ can be expressed in the same way. – The third term in πΏπ is zero by virtue of the optimality condition ππ½/ππ =0 – Merging these expressions together we obtain – Subject to the (simpler) constraints πΌπ ≥ 0 and – This is known as the Lagrangian dual problem. • Comments – We have transformed the problem of finding a saddle point for πΏπ(π€, π) into the easier one of maximizing πΏπ·(πΌ). Notice that πΏπ·(πΌ) depends on the Lagrange multipliers πΌ, not on (π€, π). – The primal problem scales with dimensionality (π€ has one coefficient for each dimension), whereas the dual problem scales with the amount of training data (there is one Lagrange multiplier per example). – Moreover, in πΏπ·(πΌ) training data appears only as dot products π₯πππ₯π. Support Vectors • The KKT complementary condition states that, for every point in the training set, the following equality must hold – Therefore, ∀π₯, either πΌπ = 0 or π¦π(π€ππ₯π + π − 1) = 0 must hold. 1. Those points for which πΌπ > 0 must then lie on one of the two hyperplanes that define the largest margin (the term π¦π(π€ππ₯π + π − 1) becomes zero only at these hyperplanes). 2. These points are known as the Support Vectors. 3. All the other points must have πΌπ = 0. – Note that only the SVs contribute to defining the optimal hyperplane. 1. NOTE: the bias term π is found from the KKT complementary condition on the support vectors. 2. Therefore, the complete dataset could be replaced by only the support vectors, and the separating hyperplane would be the same. κ°μ λ΄μ© μ ν μ΄νλ©΄μ μν λΆλ₯ SVM, Support Vector Machine μ¬λλ³μλ₯Ό κ°μ§ SVM 컀λλ² λ§€νΈλ©μ μ΄μ©ν μ€ν κ³Όλ€μ ν©κ³Ό μΌλ°ν μ€μ°¨ νμ΅ μ€μ°¨ ο νμ΅ λ°μ΄ν°μ λν μΆλ ₯κ³Ό μνλ μΆλ ₯κ³Όμ μ°¨μ΄ μΌλ°ν μ€μ°¨ ο νμ΅μ μ¬μ©λμ§ μμ λ°μ΄ν°μ λν μ€μ°¨ νμ΅ μμ€ν μ 볡μ‘λμ μΌλ°ν μ€μ°¨μ κ΄κ³ (a) (b) νμ΅ λ°μ΄ν°μ κ²½μ° νμ΅ μ€μ°¨ μ ν λΆλ₯κΈ° < λΉμ ν λΆλ₯κΈ° (c) νμ΅μ μ¬μ©λμ§ μμ λ°μ΄ν°μ κ²½μ° μΌλ°ν μ€μ°¨ μ ν λΆλ₯κΈ° < λΉμ ν λΆλ₯κΈ° μ ν λΆλ₯κΈ° > λΉμ ν λΆλ₯κΈ° νμ΅ λ°μ΄ν°κ° μ 체 λ°μ΄ν° λΆν¬λ₯Ό μ λλ‘ νννμ§ λͺ»νλ κ²½μ° λ± κ³Όλ€νμ΅μ νΌνκ³ μΌλ°ν μ€μ°¨λ₯Ό μ€μ΄κΈ° μν΄μλ νμ΅ μμ€ν μ 볡μ‘λλ₯Ό μ μ ν μ‘°μ νλ κ²μ΄ μ€μ κ³Όλ€μ ν© μ ν μ΄νλ©΄ λΆλ₯κΈ° μ λ ₯ xμ λν μ ν μ΄νλ©΄ κ²°μ κ²½κ³ ν¨μ νλ³ν¨μ f(x) = 1 ο x ∈ C1 f(x) = -1 ο x ∈ C2 μ΅μ νμ΅ μ€μ°¨λ₯Ό λ§μ‘±νλ μ¬λ¬ κ°μ§ μ ν κ²°μ κ²½κ³ Which one? SVMμμ μ¬λ¬ μ ν κ²°μ κ²½μ μ€ μΌλ°ν μ€μ°¨λ₯Ό μ΅μλ‘ νλ μ΅μ μ κ²½κ³λ₯Ό μ°ΎκΈ° μν΄ λ§μ§(margin)μ κ°λ μ λμ νμ¬ νμ΅μ λͺ©μ ν¨μλ₯Ό μ μ μ΅λ λ§μ§ λΆλ₯κΈ° λ§μ§(margin) νμ΅ λ°μ΄ν°λ€ μ€μμ λΆλ₯ κ²½κ³μ κ°μ₯ κ°κΉμ΄ λ°μ΄ν°λ‘λΆν° λΆλ₯κ²½κ³κΉμ§μ 거리 μν¬νΈλ²‘ν°(support vector) νμ΅ λ°μ΄ν°λ€ μ€μμ λΆλ₯κ²½κ³μ κ°μ₯ κ°κΉμ΄ κ³³μ μμΉν λ°μ΄ν° (a) (b) margin margin support vector λΆλ₯ κ²½κ³μ λ°λ₯Έ λ§μ§μ μ°¨μ΄ μΌλ°ν μ€μ°¨λ₯Ό μκ² ο ν΄λμ€κ°μ κ°κ²©μ ν¬κ² ο λ§μ§μ μ΅λ !! ο¨ “μ΅λ λ§μ§ λΆλ₯κΈ°”, “SVM” μ΅λ λ§μ§ λΆλ₯κΈ° μ ν λΆλ₯κ²½κ³ (μ΄νλ©΄) w x ο« w0 ο½ 0 T d x w ν μ xμμ λΆλ₯ κ²½κ³κΉμ§μ 거리 w x ο« w0 T d ο½ ο w w0 w ο C1 ο C2 μν¬νΈλ²‘ν° χ • μ΅μ λΆλ₯ μ΄νλ©΄ (OSH) νμ κ³Ό μ΄μ°¨μ νλ©΄κ³Ό μ΅μ거리 μ΅λ λ§μ§μ μμν • SVMμ μ λ ₯μ΄ mμ°¨μμΌ κ²½μ°λ₯Ό ν¬ν¨νμ¬ μ΅μ λΆλ₯μ΄νλ©΄ μΈ κ²°μ κ²½κ³μ λ§μ§μ μ΅λννλ μ΅μ νλΌλ―Έν°(w, b)λ₯Ό μ°Ύμλ κ°μ€μΉ 벑ν°, λ°μ΄μ΄μ€ μ΅λ λ§μ§μ μμν : λ§μ§ μ΅λ λ§μ§μ μμν μ΅λ λ§μ§ λΆλ₯κΈ° plus plane C1 1 Margin w C2 1 w λ§μ§ M minus plane λ§μ§μ μ΅λννλ €λ©΄, support vectors μ μ΅μν νμ΅ λ°μ΄ν° μ΄μ©ν΄ μ ν λΆλ₯κΈ° νλΌλ―Έν° μ΅μ ν SVM νμ΅ νμ΅ λ°μ΄ν° μ§ν© μΆμ ν΄μΌ ν νλΌλ―Έν° w, woκ° λ§μ‘±ν΄μΌ ν 쑰건 μ΅μνν λͺ©μ ν¨μ + “λΌκ·Έλμ§μ ν¨μ” νλΌλ―Έν° w, woμ λν΄ κ·Ήμννκ³ , αiμ λν΄ κ·Ήλν J ( w , w0 , ο‘ ) ο½ SVM νμ΅ J(w, w0, α)μ dual problem : J(w, w0, α) μ΅μ ννλ λμ , Q(α)μ μ΅μ ν 1 2 N T T i ο½1 N w w ο½ T ο₯ο‘ i ο½1 N yi w xi ο½ T i N w w ο ο₯ ο‘ i yi w xi ο y0 ο₯ ο‘ i yi ο« i ο½1 N ο₯ ο₯ο‘ ο‘ i i ο½1 T j yi y j x i x j j ο½1 μλ μμ λ€μ νν Q(α)λ αiμ λν μ΄μ°¨ν¨μ: quadratic νλ‘κ·Έλλ° μ΄μ©νμ¬ κ°λ¨ν ν΄(α^i) ꡬν μ μμ. ο w, w0 μΆμ N ο₯ο‘ i ο½1 i , SVM νμ΅ λλΆλΆ λ°μ΄ν°λ κ²°μ κ²½κ³μ λ¨μ΄μ Έ μμΌλ―λ‘, 쑰건μ → J(w, w0, α)μ μ΅λννλ μμ΄ μλ α^i λ 0λΏμ. μ λ§μ‘± λΆλ₯λ₯Ό μν΄ μ μ₯ν λ°μ΄ν°μ κ°μμ κ³μ°λμ ν격ν κ°μ SVMμ μν λΆλ₯ κ²°μ ν¨μ λμ νλ©΄, f(x) = 1 ο C1 f(x) = -1 ο C2 SVMμ μν νμ΅κ³Ό λΆλ₯ SVMμ μν νμ΅κ³Ό λΆλ₯ μ¬λλ³μλ₯Ό κ°μ§ SVM μ ν λΆλ¦¬ λΆκ°λ₯ν λ°μ΄ν°μ μ²λ¦¬λ₯Ό μν΄ μ¬λλ³μ ξ λμ : μλͺ» λΆλ¦¬λ λ°μ΄ν°λ‘λΆν° ν΄λΉ ν΄λμ€μ κ²°μ κ²½κ³κΉμ§ 거리 οΈi 1 οΈk οΈ j 1 1 λΆλ₯ 쑰건 μ¬λλ³μμ κ°μ΄ ν΄μλ‘ λ μ¬ν μ€λΆλ₯λ₯Ό νμ©! μ¬λλ³μλ₯Ό κ°μ§ SVM : μ€λΆλ₯μ νμ©λ κ²°μ . c 컀μ§λ©΄, ξκ° μ»€μ§λ κ²μ λ§μΌλ―λ‘ μ€λΆλ₯ μ€μ°¨ μ μ΄μ§. λͺ©μ ν¨μ μ΅μν λΌκ·Έλμ§μ ν¨μ w, w0μ λν΄ λ―ΈλΆ μ¬λλ³μλ₯Ό μμΌλ‘ μ μ§νκΈ° μν μλ‘μ΄ λΌκ·Έλμ μΉμ βi μΆκ° αiκ° μ ν΄μ§λ©΄ w, w0μ κ°μ μ¬λλ³μκ° μλ κ²½μ°μ μμ ν λμΌ! μ ν λΆλ¦¬ κ°λ₯ν κ²½μ° μ ν λΆλ¦¬ λΆκ°λ₯ν κ²½μ° μ»€λλ² 2D λ°μ΄ν°λ‘ μ¬μ λΉμ ν λΆλ₯ λ¬Έμ ο μ μ°¨μμ μ λ ₯ xμ κ³ μ°¨μμ 곡κ°μ κ° Φ(x)λ‘ λ³ν (a) (b) ο¦ κ³ μ°¨μμΌλ‘ μ νννλ©΄ κ°λ¨ν μ ν λΆλ₯κΈ°λ₯Ό μ¬μ©ν λΆλ₯ κ°λ₯ κ·Έλ¬λ κ³μ°λ μ¦κ°μ κ°μ λΆμμ© λ°μ → “컀λλ²”μΌλ‘ ν΄κ²° 컀λλ²κ³Ό SVM nμ°¨μ μ λ ₯ λ°μ΄ν° xλ₯Ό mμ°¨μ (m>>1)μ νΉμ§ λ°μ΄ν° Φ(x)λ‘ λ³νν ν, SVMμ μ΄μ©ν΄μ λΆλ₯νλ€κ³ κ°μ SVMμμ μ°μ°μ κ°κ°μ Φ(x)κ° μλλΌ λ 벑ν°μ λ΄μ Φ(x) β Φ(y)λ₯Ό μ¬μ© κ³ μ°¨μ 맀ν Φ(x) λμ μ Φ(x) β Φ(y)λ₯Ό νλμ ν¨μ k(x,y)λ‘ μ μνμ¬ μ¬μ© “컀λ ν¨μ” νλΌλ―Έν° μΆμ μν λΌκ·Έλμ§μ ν¨μ μ λ ₯ x λμ Φ(x)λ₯Ό μ λ ₯μΌλ‘ μ¬μ© μ΄μμ λ¬Έμ μ ν¨μλ‘ νννλ©΄ 컀λλ²κ³Ό SVM μΆμ λ νλΌλ―Έν° μ΄μ©ν΄ λΆλ₯ν¨μ νν κ°λ₯ κ³ μ°¨μ 맀ν ν΅ν΄ λΉμ ν λ¬Έμ λ₯Ό μ νν : κ³μ°λ κ°μ λνμ μΈ μ»€λ ν¨μ μ¬λλ³μμ 컀λ κ°μ§ SVM μ¬λλ³μμ 컀λ κ°μ§ SVM 맀νΈλ© μ΄μ©ν μ€ν • SVM μ΄μ©ν νμ΅κ³Ό ν¨ν΄ λΆλ₯ μ€ν λ°μ΄ν° κ°μ = 30 C1 C2 맀νΈλ© μ΄μ©ν μ€ν 맀νΈλ© ν¨μ quadprog() λͺ©μ ν¨μ μ΄μ°¨κ³νλ²μ μν΄ λͺ©μ ν¨μλ₯Ό μ΅μννλ©΄μ νΉμ 쑰건μ λ§μ‘±νλ νλΌλ―Έν° αλ₯Ό μ°Ύλ ν¨μ λ§μ‘±ν 쑰건 μ λ ₯μΌλ‘ μ¬μ©λ ꡬ체μ μΈ νλ ¬κ³Ό 벑ν°λ₯Ό μ°ΎμΌλ©΄ μ΄μ© μ΄ κ°λ€μ ν¨μμ μ λ ₯μΌλ‘ μ 곡νμ¬ αλ₯Ό μ°Ύκ³ , μ΄ κ°μ΄ 0λ³΄λ€ ν° κ²μ μ νν¨μΌλ‘μ¨ λμλλ μν¬νΈλ²‘ν°λ₯Ό μ°Ύλλ€. 맀νΈλ© μ΄μ©ν ꡬν 컀λ ν¨μ SVM_kernel function [out] = SVM_kernel(X1,X2,fmode,hyp) if (fmode==1) % Linear Kernel out=X1*X2'; end if (fmode==2) % Polynomial Kernel out = (X1*X2'+hyp(1)).^hyp(2); end if (fmode==3) % Gaussian Kernel for i=1:size(X1,1) for j=1:size(X2,1); x=X1(i,:); y=X2( j,:); out(i,j) = exp(-(x-y)*(x-y)'/(2*hyp(1)*hyp(1))); end end end 맀νΈλ©μ μ΄μ©ν ꡬν SVM λΆλ₯κΈ° % 2D λ°μ΄ν° λΆλ₯λ₯Ό μν΄ μ»€λμ μ¬μ©ν SVM λΆλ₯κΈ° μμ± λ° λΆλ₯ load data12_8 % λ°μ΄ν° λΆλ¬μ΄ N=size(Y,1); % N: λ°μ΄ν° ν¬κΈ° kmode=2; hyp(1)=1.0; hyp(2)=2.0; % 컀λ ν¨μμ νλΌλ―Έν° μ€μ %%%%% μ΄μ°¨κ³νλ² λͺ©μ ν¨μ μ μ %% μ΅μν ν¨μ: 0.5*a'*H*a - f'*a %% λ§μ‘± 쑰건 1: Aeq'*a = beq %% λ§μ‘± λ²μ: lb <= a <= ub KXX=SVM_kernel(X,X,kmode,hyp); % 컀λ νλ ¬ κ³μ° H=diag(Y)*KXX*diag(Y)+1e-10*eye(N); % μ΅μν ν¨μμ νλ ¬ f=(-1)*ones(N,1); % μ΅μν ν¨μμ λ²‘ν° Aeq=Y'; beq=0; % λ§μ‘± 쑰건 1 lb=zeros(N,1); ub=zeros(N,1)+10^3; % λ§μ‘± λ²μ 맀νΈλ©μ μ΄μ©ν ꡬν SVM λΆλ₯κΈ°(κ³μ) %%%%% μ΄μ°¨κ³νλ²μ μν μ΅μ ν ν¨μ νΈμΆ --> μνκ°, w0 μΆμ [alpha, fval, exitflag, output]=quadprog(H,f,[],[],Aeq,beq,lb,ub) svi=find(abs(alpha)>10^(-3)); % alpha>0 μΈ μν¬νΈλ²‘ν° μ°ΎκΈ° svx = X(svi,:); % μν¬νΈλ²‘ν° μ μ₯ ksvx = SVM_kernel(svx,svx,kmode,hyp); % μν¬νΈλ²‘ν° μ»€λ κ³μ° for i=1:size(svi,1) % λ°μ΄μ΄μ€(w0) κ³μ° svw(i)=Y(svi(i))-sum(alpha(svi).*Y(svi).*ksvx(:,i)); end w0=mean(svw); 맀νΈλ©μ μ΄μ©ν ꡬν SVM λΆλ₯κΈ°(κ³μ) %%%%% λΆλ₯ κ²°κ³Ό κ³μ° for i=1:N xt=X(i,:); ksvxt = SVM_kernel(svx,xt,kmode,hyp); fx(i,1)=sign(sum(alpha(svi).*Y(svi).*ksvxt)+w0); end Cerr= sum(abs(Y-fx))/(2*N); %%%%% κ²°κ³Ό μΆλ ₯ λ° μ μ₯ fprintf(1,'Number of Support Vector: %d\n',size(svi,1)); fprintf(1,'Classificiation error rate: %.3f\n',Cerr); save SVMres12_8 X Y posId negId svi svx alpha w0 kmode 맀νΈλ©μ μ΄μ©ν ꡬν κ²°μ κ²½κ³λ₯Ό 그리λ νλ‘κ·Έλ¨ load SVMres12_8 xM = max(X); xm= min(X); S1=[floor(xm(1)):0.1:ceil(xM(1))]; S2=[floor(xm(2)):0.1:ceil(xM(2))]; for i=1:size(S1,2) for j=1:size(S2,2) xt=[S1(i),S2( j)]; ksvxt = SVM_kernel(svx,xt,kmode,hyp); G(i,j)=(sum(alpha(svi).*Y(svi).*ksvxt)+w0); F(i,j)=sign(G(i,j)); end end [X1, X2]=meshgrid(S1, S2); figure(kmode); contour(X1',X2',F); hold oncontour(X1',X2',G); posId=find(Y>0); negId=find(Y<0); plot(X(posId,1), X(posId,2),'d'); hold on plot(X(negId,1), X(negId,2),'o'); plot(X(svi,1), X(svi,2),'g+'); % λ°μ΄ν°μ νλΌλ―Έν° λΆλ¬μ€κΈ° % κ²°μ κ²½κ³ κ·Έλ¦΄ μμ μ€μ % μμλ΄μ κ° μ λ ₯μ λν μΆλ ₯ κ³μ° % κ²°μ κ²½κ³ κ·Έλ¦¬κΈ° % νλ³ν¨μμ λ±κ³ μ 그리기 % λ°μ΄ν° 그리기 % μν¬νΈλ²‘ν° νμ 맀νΈλ©μ μ΄μ©ν ꡬν μ ν 컀λ ν¨μλ₯Ό μ¬μ©ν κ²½μ° κ²°μ κ²½κ³μ μν¬νΈλ²‘ν° (7κ°) 맀νΈλ©μ μ΄μ©ν ꡬν (a) Polynomial kernel b=1,d=2, 5κ° (b) Gaussian kernel σ=1, 9κ° λ€νμ 컀λκ³Ό κ°μ°μμ 컀λμ μ¬μ©ν κ²½μ° κ²°μ κ²½κ³μ μν¬νΈλ²‘ν° λ§€νΈλ©μ μ΄μ©ν ꡬν (a) σ=0.2 (b) σ=0.5 (c) σ=2.0 (d) σ=5.0 κ°μ°μμ 컀λμ σμ λ°λ₯Έ κ²°μ κ²½κ³μ λ³ν Karush-Kuhn-Tucker (KKT) 쑰건 • In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions are first order necessary conditions for a solution in nonlinear programming to be optimal, provided that some regularity conditions are satisfied. • Allowing inequality constraints, the KKT approach to nonlinear programming generalizes the method of Lagrange multipliers, which allows only equality constraints. • The system of equations corresponding to the KKT conditions is usually not solved directly, except in the few special cases where a closed-form solution can be derived analytically. In general, many optimization algorithms can be interpreted as methods for numerically solving the KKT system of equations λ§μ§ μ΅λν μ‘°κ±΄μ • KKT 쑰건μ SVM μ΅μ ν λ¬Έμ μ μ μ©