Support Vector Machines(SVM)

Introduction to Pattern Recognition for Human ICT Support Vector Machines 2014. 11. 28 Hyunki Hong Contents • Introduction • Optimal separating hyperplanes • Kuhn-Tucker Theorem • Support Vectors • Non-linear SVMs and kernel methods Introduction • Consider the familiar problem of learning a binary classification problem from data. – Assume a given a dataset (𝑋,𝑌) = {(𝑥1, 𝑦1),(𝑥2, 𝑦2), …, (𝑥𝑁, 𝑦𝑁)}, where the goal is to learn a function 𝑦 = 𝑓(𝑥) that will correctly classify unseen examples. • How do we find such function? – By optimizing some measure of performance of the learned model • What is a good measure of performance? – A good measure is the expected risk where 𝐶(𝑓, 𝑦) is a suitable cost function, e.g., 𝐶(𝑓, 𝑦) = (𝑓(𝑥)−𝑦)2 - Unfortunately, the risk cannot be measured directly since the underlying pdf is unknown. Instead, we typically use the risk over the training set, also known as the empirical risk. • Empirical Risk Minimization – A formal term for a simple concept: find the function 𝑓(𝑥) that minimizes the average risk on the training set. – Minimizing the empirical risk is not a bad thing to do, provided that sufficient training data is available, since the law of large numbers ensures that the empirical risk will asymptotically converge to the expected risk for 𝑛→∞. • How do we avoid overfitting? 점근적: In applied mathematics, asymptotic analysis is used to build numerical methods to approximate equation solutions – By controlling model complexity. - Intuitively, we should prefer the simplest model that explains the data. Optimal separating hyperplanes • Problem statement – Consider the problem of finding a separating hyperplane for a linearly separable dataset {(𝑥1, 𝑦1), (𝑥2, 𝑦2), …, (𝑥𝑁, 𝑦 𝑁)}, 𝑥 ∈ 𝑅𝐷, 𝑦 ∈ {−1,+1} – Which of the infinite hyperplanes should we choose? 1. Intuitively, a hyperplane that passes too close to the training examples will be sensitive to noise and, therefore, less likely to generalize well for data outside the training set. 2. Instead, it seems reasonable to expect that a hyperplane that is farthest from all training examples will have better generalization capabilities – Therefore, the optimal separating hyperplane will be the one with the largest margin, which is defined as the minimum distance of an example to the decision surface. • Solution - Since we want to maximize the margin, let’s express it as a function of the weight vector and bias of the separating hyperplane. – From trigonometry, the distance between a point 𝑥 and a plane (𝑤, 𝑏) is – Since the optimal hyperplane has infinite solutions (by scaling the weight vector and bias), we choose the solution for which the discriminant function becomes one for the training examples closest to the boundary. This is known as the canonical hyperplane. – Therefore, the distance from the closest example to the boundary is – And the margin becomes • Therefore, the problem of maximizing the margin is equivalent to – Notice that 𝐽(𝑤) is a quadratic function, which means that there exists a single global minimum and no local minima. • To solve this problem, we will use classical Lagrangian optimization techniques. – We first present the Kuhn-Tucker Theorem, which provides an essential result for the interpretation of Support Vector Machines. Kuhn-Tucker Theorem • Given an optimization problem with convex domain 𝛀⊆𝑅𝑁. – In with 𝑓∈𝐶1 convex and 𝑔𝑖, ℎ𝑖 affine, necessary & sufficient conditions for a normal point 𝑧∗ to be an optimum are the existence of 𝛼∗, 𝛽∗ such that – 𝐿(𝑧, 𝛼, 𝛽) is known as a generalized Lagrangian function. – The third condition is known as the Karush-Kuhn-Tucker (KKT) complementary condition. 1. It implies that for active constraints 𝛼𝑖 ≥ 0; and for inactive constraints 𝛼𝑖 = 0 2. As we will see in a minute, the KKT condition allows us to identify the training examples that define the largest margin hyperplane. These examples will be known as Support Vectors • Solution - To simplify the primal problem, we eliminate the primal variables (𝑤, 𝑏) using the first Kuhn-Tucker condition 𝜕𝐽/𝜕𝑧 = 0. – Differentiating 𝐿𝑃(𝑤, 𝑏, 𝛼) with respect to 𝑤 and 𝑏, and setting to zero yields – Expansion of 𝐿𝑃 yields – Using the optimality condition 𝜕𝐽/𝜕𝑤 = 0, the first term in 𝐿𝑃 can be expressed as – The second term in 𝐿𝑃 can be expressed in the same way. – The third term in 𝐿𝑃 is zero by virtue of the optimality condition 𝜕𝐽/𝜕𝑏 =0 – Merging these expressions together we obtain – Subject to the (simpler) constraints 𝛼𝑖 ≥ 0 and – This is known as the Lagrangian dual problem. • Comments – We have transformed the problem of finding a saddle point for 𝐿𝑃(𝑤, 𝑏) into the easier one of maximizing 𝐿𝐷(𝛼). Notice that 𝐿𝐷(𝛼) depends on the Lagrange multipliers 𝛼, not on (𝑤, 𝑏). – The primal problem scales with dimensionality (𝑤 has one coefficient for each dimension), whereas the dual problem scales with the amount of training data (there is one Lagrange multiplier per example). – Moreover, in 𝐿𝐷(𝛼) training data appears only as dot products 𝑥𝑖𝑇𝑥𝑗. Support Vectors • The KKT complementary condition states that, for every point in the training set, the following equality must hold – Therefore, ∀𝑥, either 𝛼𝑖 = 0 or 𝑦𝑖(𝑤𝑇𝑥𝑖 + 𝑏 − 1) = 0 must hold. 1. Those points for which 𝛼𝑖 > 0 must then lie on one of the two hyperplanes that define the largest margin (the term 𝑦𝑖(𝑤𝑇𝑥𝑖 + 𝑏 − 1) becomes zero only at these hyperplanes). 2. These points are known as the Support Vectors. 3. All the other points must have 𝛼𝑖 = 0. – Note that only the SVs contribute to defining the optimal hyperplane. 1. NOTE: the bias term 𝑏 is found from the KKT complementary condition on the support vectors. 2. Therefore, the complete dataset could be replaced by only the support vectors, and the separating hyperplane would be the same. 강의 내용 선형 초평면에 의한 분류 SVM, Support Vector Machine 슬랙변수를 가진 SVM 커널법 매트랩을 이용한 실험 과다적합과 일반화 오차 학습 오차  학습 데이터에 대한 출력과 원하는 출력과의 차이 일반화 오차  학습에 사용되지 않은 데이터에 대한 오차 학습 시스템의 복잡도와 일반화 오차의 관계 (a) (b) 학습 데이터의 경우 학습 오차 선형 분류기 < 비선형 분류기 (c) 학습에 사용되지 않은 데이터의 경우 일반화 오차 선형 분류기 < 비선형 분류기 선형 분류기 > 비선형 분류기 학습 데이터가 전체 데이터 분포를 제대로 표현하지 못하는 경우 등 과다학습을 피하고 일반화 오차를 줄이기 위해서는 학습 시스템의 복잡도를 적절히 조정하는 것이 중요 과다적합 선형 초평면 분류기 입력 x에 대한 선형 초평면 결정경계 함수 판별함수 f(x) = 1  x ∈ C1 f(x) = -1  x ∈ C2 최소 학습 오차를 만족하는 여러 가지 선형 결정경계 Which one? SVM에서 여러 선형 결정경제 중 일반화 오차를 최소로 하는 최적의 경계를 찾기 위해 마진(margin)의 개념을 도입하여 학습의 목적함수를 정의 최대 마진 분류기 마진(margin) 학습 데이터들 중에서 분류 경계에 가장 가까운 데이터로부터 분류경계까지의 거리 서포트벡터(support vector) 학습 데이터들 중에서 분류경계에 가장 가까운 곳에 위치한 데이터 (a) (b) margin margin support vector 분류 경계에 따른 마진의 차이 일반화 오차를 작게  클래스간의 간격을 크게  마진을 최대 !!  “최대 마진 분류기”, “SVM” 최대 마진 분류기 선형 분류경계 (초평면) w x  w0  0 T d x w 한 점 x에서 분류 경계까지의 거리 w x  w0 T d   w w0 w  C1  C2 서포트벡터 χ • 최적 분류 초평면 (OSH) 한점과 이차원 평면과 최소거리 최대 마진의 수식화 • SVM은 입력이 m차원일 경우를 포함하여 최적 분류초평면 인 결정경계와 마진을 최대화하는 최적 파라미터(w, b)를 찾아냄 가중치 벡터, 바이어스 최대 마진의 수식화 : 마진 최대 마진의 수식화 최대 마진 분류기 plus plane C1 1 Margin w C2 1 w 마진 M minus plane 마진의 최대화하려면, support vectors 의 최소화 학습 데이터 이용해 선형 분류기 파라미터 최적화 SVM 학습 학습 데이터 집합 추정해야 할 파라미터 w, wo가 만족해야 할 조건 최소화할 목적함수 + “라그랑지안 함수” 파라미터 w, wo에 대해 극소화하고, αi에 대해 극대화 J ( w , w0 ,  )  SVM 학습 J(w, w0, α)의 dual problem : J(w, w0, α) 최적화하는 대신, Q(α)의 최적화 1 2 N T T i 1 N w w  T  i 1 N yi w xi  T i N w w    i yi w xi  y0   i yi  i 1 N    i i 1 T j yi y j x i x j j 1 원래 식을 다시 표현 Q(α)는 αi에 대한 이차함수: quadratic 프로그래밍 이용하여 간단히 해(α^i) 구할 수 있음.  w, w0 추정 N  i 1 i , SVM 학습 대부분 데이터는 결정경계와 떨어져 있으므로, 조건식 → J(w, w0, α)을 최대화하는 음이 아닌 α^i 는 0뿐임. 을 만족 분류를 위해 저장할 데이터의 개수와 계산량의 현격한 감소 SVM에 의한 분류 결정함수 대입하면, f(x) = 1  C1 f(x) = -1  C2 SVM에 의한 학습과 분류 SVM에 의한 학습과 분류 슬랙변수를 가진 SVM 선형 분리 불가능한 데이터의 처리를 위해 슬랙변수 ξ 도입 : 잘못 분리된 데이터로부터 해당 클래스의 결정경계까지 거리 i 1 k  j 1 1 분류 조건 슬랙변수의 값이 클수록 더 심한 오분류를 허용! 슬랙변수를 가진 SVM : 오분류의 허용도 결정. c 커지면, ξ값 커지는 것을 막으므로 오분류 오차 적어짐. 목적함수 최소화 라그랑지안 함수 w, w0에 대해 미분 슬랙변수를 양으로 유지하기 위한 새로운 라그랑제 승수 βi 추가 αi가 정해지면 w, w0의 값은 슬랙변수가 없는 경우와 완전히 동일! 선형 분리 가능한 경우 선형 분리 불가능한 경우 커널법 2D 데이터로 사상 비선형 분류 문제  저차원의 입력 x을 고차원의 공간의 값 Φ(x)로 변환 (a) (b)  고차원으로 선형화하면 간단한 선형 분류기를 사용한 분류 가능 그러나 계산량 증가와 같은 부작용 발생 → “커널법”으로 해결 커널법과 SVM n차원 입력 데이터 x를 m차원 (m>>1)의 특징 데이터 Φ(x)로 변환한 후, SVM을 이용해서 분류한다고 가정 SVM에서 연산은 개개의 Φ(x)가 아니라 두 벡터의 내적 Φ(x) ∙ Φ(y)를 사용 고차원 매핑 Φ(x) 대신에 Φ(x) ∙ Φ(y)를 하나의 함수 k(x,y)로 정의하여 사용 “커널 함수” 파라미터 추정 위한 라그랑지안 함수 입력 x 대신 Φ(x)를 입력으로 사용 이원적 문제의 함수로 표현하면 커널법과 SVM 추정된 파라미터 이용해 분류함수 표현 가능 고차원 매핑 통해 비선형 문제를 선형화 : 계산량 감소 대표적인 커널 함수 슬랙변수와 커널 가진 SVM 슬랙변수와 커널 가진 SVM 매트랩 이용한 실험 • SVM 이용한 학습과 패턴 분류 실험 데이터 개수 = 30 C1 C2 매트랩 이용한 실험 매트랩 함수 quadprog() 목적함수 이차계획법에 의해 목적함수를 최소화하면서 특정 조건을 만족하는 파라미터 α를 찾는 함수 만족할 조건 입력으로 사용될 구체적인 행렬과 벡터를 찾으면 이용 이 값들을 함수의 입력으로 제공하여 α를 찾고, 이 값이 0보다 큰 것을 선택함으로써 대응되는 서포트벡터를 찾는다. 매트랩 이용한 구현 커널 함수 SVM_kernel function [out] = SVM_kernel(X1,X2,fmode,hyp) if (fmode==1) % Linear Kernel out=X1*X2'; end if (fmode==2) % Polynomial Kernel out = (X1*X2'+hyp(1)).^hyp(2); end if (fmode==3) % Gaussian Kernel for i=1:size(X1,1) for j=1:size(X2,1); x=X1(i,:); y=X2( j,:); out(i,j) = exp(-(x-y)*(x-y)'/(2*hyp(1)*hyp(1))); end end end 매트랩을 이용한 구현 SVM 분류기 % 2D 데이터 분류를 위해 커널을 사용한 SVM 분류기 생성 및 분류 load data12_8 % 데이터 불러옴 N=size(Y,1); % N: 데이터 크기 kmode=2; hyp(1)=1.0; hyp(2)=2.0; % 커널 함수와 파라미터 설정 %%%%% 이차계획법 목적함수 정의 %% 최소화 함수: 0.5*a'*H*a - f'*a %% 만족 조건 1: Aeq'*a = beq %% 만족 범위: lb <= a <= ub KXX=SVM_kernel(X,X,kmode,hyp); % 커널 행렬 계산 H=diag(Y)*KXX*diag(Y)+1e-10*eye(N); % 최소화 함수의 행렬 f=(-1)*ones(N,1); % 최소화 함수의 벡터 Aeq=Y'; beq=0; % 만족 조건 1 lb=zeros(N,1); ub=zeros(N,1)+10^3; % 만족 범위 매트랩을 이용한 구현 SVM 분류기(계속) %%%%% 이차계획법에 의한 최적화 함수 호출 --> 알파값, w0 추정 [alpha, fval, exitflag, output]=quadprog(H,f,[],[],Aeq,beq,lb,ub) svi=find(abs(alpha)>10^(-3)); % alpha>0 인 서포트벡터 찾기 svx = X(svi,:); % 서포트벡터 저장 ksvx = SVM_kernel(svx,svx,kmode,hyp); % 서포트벡터 커널 계산 for i=1:size(svi,1) % 바이어스(w0) 계산 svw(i)=Y(svi(i))-sum(alpha(svi).*Y(svi).*ksvx(:,i)); end w0=mean(svw); 매트랩을 이용한 구현 SVM 분류기(계속) %%%%% 분류 결과 계산 for i=1:N xt=X(i,:); ksvxt = SVM_kernel(svx,xt,kmode,hyp); fx(i,1)=sign(sum(alpha(svi).*Y(svi).*ksvxt)+w0); end Cerr= sum(abs(Y-fx))/(2*N); %%%%% 결과 출력 및 저장 fprintf(1,'Number of Support Vector: %d\n',size(svi,1)); fprintf(1,'Classificiation error rate: %.3f\n',Cerr); save SVMres12_8 X Y posId negId svi svx alpha w0 kmode 매트랩을 이용한 구현 결정경계를 그리는 프로그램 load SVMres12_8 xM = max(X); xm= min(X); S1=[floor(xm(1)):0.1:ceil(xM(1))]; S2=[floor(xm(2)):0.1:ceil(xM(2))]; for i=1:size(S1,2) for j=1:size(S2,2) xt=[S1(i),S2( j)]; ksvxt = SVM_kernel(svx,xt,kmode,hyp); G(i,j)=(sum(alpha(svi).*Y(svi).*ksvxt)+w0); F(i,j)=sign(G(i,j)); end end [X1, X2]=meshgrid(S1, S2); figure(kmode); contour(X1',X2',F); hold oncontour(X1',X2',G); posId=find(Y>0); negId=find(Y<0); plot(X(posId,1), X(posId,2),'d'); hold on plot(X(negId,1), X(negId,2),'o'); plot(X(svi,1), X(svi,2),'g+'); % 데이터와 파라미터 불러오기 % 결정경계 그릴 영역 설정 % 영역내의 각 입력에 대한 출력 계산 % 결정 경계 그리기 % 판별함수의 등고선 그리기 % 데이터 그리기 % 서포트벡터 표시 매트랩을 이용한 구현 선형 커널 함수를 사용한 경우 결정경계와 서포트벡터 (7개) 매트랩을 이용한 구현 (a) Polynomial kernel b=1,d=2, 5개 (b) Gaussian kernel σ=1, 9개 다항식 커널과 가우시안 커널을 사용한 경우 결정경계와 서포트벡터 매트랩을 이용한 구현 (a) σ=0.2 (b) σ=0.5 (c) σ=2.0 (d) σ=5.0 가우시안 커널의 σ에 따른 결정경계의 변화 Karush-Kuhn-Tucker (KKT) 조건 • In mathematical optimization, the Karush–Kuhn–Tucker (KKT) conditions are first order necessary conditions for a solution in nonlinear programming to be optimal, provided that some regularity conditions are satisfied. • Allowing inequality constraints, the KKT approach to nonlinear programming generalizes the method of Lagrange multipliers, which allows only equality constraints. • The system of equations corresponding to the KKT conditions is usually not solved directly, except in the few special cases where a closed-form solution can be derived analytically. In general, many optimization algorithms can be interpreted as methods for numerically solving the KKT system of equations 마진 최대화 조건식 • KKT 조건을 SVM 최적화 문제에 적용

Support Vector Machines(SVM)

Related documents

Products

Support

Support Vector Machines(SVM)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib