Chapter 6 Support Vector Machines 授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.) E-mail: chuanyu@yuntech.edu.tw Tel: (05)5342601 ext. 4337 Office: ES709 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering Introduction SVM可被用來解pattern classification和非線性迴歸的問題。 基本上,SVM 是一個具有許多良好特性的linear machine The main idea of a support vector machine is to construct a hyperplane as the decision surface in such a way that the margin of separation between positive and negative examples is maximized. The support vector machine is an approximate implementation of the method of structural risk minimization. SVM在可分割的pattern中,Eq(2.101)的第一項為零,並最小化第 二項。 SVM 具有獨特的特性,SVM未結合problem-domain knowledge, 但能提供良好的 generalization performance 。 建構SVM演算法的核心是support vector xi和某輸入向量x間的 inner-product kernel。 根據inner-product kernel的產生方式,可建構不同的非線性決策 表面的學習機器特性。 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 2 Background x2 w p x1 0 x2 p x1 q x2 p q x1 p x1 q x2 0 x1 x 2 x1 q x 2 p x1 p x 2 q pq 0 x1 p q pq 0 x2 w xb0 T 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 3 Background (cont.) x x2 w x xp r w w x p r w w xp x1 q 令discriminant function w b g x w x b w x p r w T T g x p 0 g x r w or r g x w 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 4 Optimal Hyperplane for Linearly Separable Patterns 考慮training sample {(xi, di)}Ni=1,其desire response為可被線性分離的+1或-1。 則此可分離的hyperplane定義成 w xb0 T (6.1) 其中,x為input vector,w為可調整的weight vector,b為bias,因此可將此分離的問題描述成 w x i b 0 for d i 1 T (6.2) w x i b 0 for d i 1 T 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 5 Optimal Hyperplane for Linearly Separable Patterns 若給予權重向量w和偏差值b,則介於Eq(6.1)所定義的 hyperplane和最接近的資料點間的距離稱為margin of separation,以符號r表示。 SVM的目標是找一特定的hyperplane,使margin of separation r最大化。 The decision surface is referred to as the optimal hyperplane. 令w0和b0分別表示最佳的權重向量和bias。 則用來表示輸入空間的多維度線性決策表面的optimal hyperplane可表示成 w x b0 0 T 0 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6.3) 6 Optimal Hyperplane for Linearly Separable Patterns x2 w xb 1 T Class 2 Class 1 x1 Many linear classifiers (hyperplanes) separate the data. However, only one achieves maximum separation. Which one should we choose? w xb0 T w x b 1 T 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 7 Optimal Hyperplane for Linearly Separable Patterns The algebraic measure of the distance from x to the optimal hyperplane is defined as a discriminant function g ( x ) w 0 x b0 T 從向量x到optimal hyperplane的距離 (6.4) 因為Eq(6.3) ,所以令g(xp)=0 g( x ) = w 0 x b 0 r w 0 T or r g (x) w0 Desired algebraic distance b0 w0 w 0 x b0 1 for d i 1 w 0 x b0 1 for d i 1 T T 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 8 Optimal Hyperplane for Linearly Separable Patterns 我們的目的是給予訓練集T ={(xi,di)}找出optimal hyperplane的參數wo和bo 根據圖6.2可知(wo,bo)需滿足下列條件 w 0 x b0 1 for d i 1 w 0 x b0 1 for d i 1 T T (6.6) The particular data points (xi,di) for which the first or second line of Eq. (6.6) is satisfied with the quality sign are called support vectors. The support vectors are those data points that lie closest to the decision surface and are therefore the most difficult to classify. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 9 Optimal Hyperplane for Linearly Separable Patterns 考慮一support vector x(s) for which d(s)=+1,根據定 義可得 s T s s (6.7) g ( x ) w 0 x b 0 1 for d 1 根據Eq(6.5) ,The algebraic distance from support vector to optimal hyper plane is r g x s w0 1 if d s 1 w0 (6.8) -1 if d s 1 w0 Margin of separation between two classes 最大化r,相當於最小化 權重向量w的Euclidean norm r 2r 2 w0 因為有正負兩個方向, 所以為兩倍r (6.9) Maximum r implies minimizes ||w0|| 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 10 Optimal Hyperplane for Linearly Separable Patterns Quadratic Optimization for Finding the Optimal Hyperplane 我們的目的在於使用一組訓練樣本T={(xi, di)Ni=1},來 找出一滿足下式的optimal hyperplane 將Eq(6.6)的條件結 合起來 d i ( w x i b ) 1 for i 1 ,2 ,...,N T (6.10) 因此,此constrained optimization problem可描述成 如果資料點違反左列條件, 稱此margin of separation 為soft 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 11 Optimal Hyperplane for Linearly Separable Patterns 可使用lagrange multipliers法來解(6.10)條件最佳化的問題 建構Lagrangian function如下: J (w , b, ) 1 2 i d i ( w x i b ) 1 N w w T T (6.11) i 1 i are called Lagrange multipliers. 此問題的解發生在 J(w,b,) 的鞍點(saddle point) ,因此 可分別將J(w,b,)對w和b做偏微分,並將結果設為0。 Minimized with respect to w and b; it also has to be maximized with respect to . 分別將Eq(6.11)對w及 b偏微分後,整理 可得 N w d x i i i (6.12) i 1 N i di 0 (6.13) i 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 12 Optimal Hyperplane for Linearly Separable Patterns 在鞍點上(saddle point) 的每個Lagrange multiplier i,這些 乘數的乘積的對應條件會消失。 i d i ( w x i b ) 1 0 for i 1 ,2 ,...,N T (6.14) 只有恰好滿足Eq(6.14)的乘數可假設為非零值。這個特性是遵 循最佳化理論的Kuhn-Tucker condition。 Dual problem和primal problem會有相同的最佳解。 因為在鞍點上,意謂著偏微分等於零。(恰好在decision surface上,可 屬於任何一類) Duality theorem If the primal problem has an optimal solution, the dual problem also has an optimal solution, and the corresponding optimal values are equal. In order for wo to be an optimal primal solution and o to be an optimal dual solution, it is necessary and sufficient that wo is feasible for the primal problem, and w J w , b , min J w , b , o o o o w o o o 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 13 Optimal Hyperplane for Linearly Separable Patterns 為了將primal problem假定成dual problem,首先將 ∵(6.13) Eq(6.11)逐項展開 ∴=0N 1 T T J (w , b, ) w w id iw x i b id i i 2 i 1 i 1 i 1 N N (6.15) 由Eq(6.12) ,將w以Eq(6.12)代入 將上式代入Eq(6.15) ,整理後並假設J(w, b, )=Q() N Q ( ) i i 1 1 N 2 i 1 N i j d i d j x i x j T (6.16) j 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 14 Optimal Hyperplane for Linearly Separable Patterns The Dual Problem 只含i Dual problem是由整體訓 練資料(N)計算得到。與 primal problem相比,只 需解i。 由dual problem決定optimal i (叫做o,i ) N 代入 (6.12)得到optimal weight w0 w0 0 ,i dixi i 1 (6.17) 由Eq(6.7)式,代入 (6.18)得到optimal weight b0 b0 1 w 0 x T (s) for d (s) 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6.18) 15 Optimal hyperplane for non-separable patterns 之前討論for linear separable,現在為non-separable pattern (允許部分pattern落入partition margin內): d i w x i b 1 i , i 1, 2 ,...., N T The i are called slack variables 資料點落在分 割區域中,但 在decision surface的正確 邊時,可正確 分類 0i 1 正確分類 在hyperplane的定義中新增 N 一組非負的scale變數 i i 1 當 0 i 1 ,表示資料點落 在region of separation,且在 decision surface的正確邊。 i >1 不正確分類 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 16 Optimal hyperplane for non-separable patterns 我們的目標在找出一separating hyperplane,使整體訓練集 合的平均錯誤分類率最小化。 Correct but maybe inside the margin Incorrect 為了計算方便,將上 式更改為近似式 Then (w , ) 1 2 N w w C i T i 1 The first term in Eq. (6.23) is related to minimizing the VC dimension of the support vector machine. The second term is an upper bound on the number of the test errors. The parameter C is user determined (1)experimentally (2)analytically. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6.23) 17 Optimal hyperplane for non-separable patterns Then for the soft classification: (對nonseparable的情況) The primal problem for the nonseparable case is stated as 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 18 Optimal hyperplane for non-separable patterns 應用Lagrange multiplier和6.2節類似的步驟,可描述此 nonseparable pattern的dual problem為: Slack variables i在dual problem已經不見了。 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 19 Optimal hyperplane for non-separable patterns 權重向量w的最佳解為 NS w0 d ixi 0 ,i (6.24) i 1 其中Ns為support vectors的數量。 決定bias的最佳值,利用Kuhn-Tucker條件定義成 i d i w x i b 1 i 0 , i 1, 2 ,...., N T i i 0 , i 1, 2 ,...., N (6.25) (6.26) 在saddle point,primal problem的Lagrangian function對i的微分 等於零,因此 雖然我們可以從訓練樣本中取出任何 i i C 的資料點滿足0<o,i<C,使i為零。再 (6.27) 以Eq(6.25)決定最佳的bias bo。但從 (6.28) 0 if C 數值的觀點,最好是計算從訓練樣本 中取出所有符合條件的資料點的平均 資訊工程所 醫學影像處理實驗室( Medical Image Processing 值來作為最佳的bias bo。 Lab. ) i i Graduate School of Computer Science & Information Engineering 20 How to Build a Support Vector Machine for Pattern Recognition SVM的概念決定於下面兩個數學運作: 將一輸入向量非線性的對應到一高維的特徵空間。 根據Cover’s theorem on the separability of patterns. 一多維空間可被轉換到一新的特徵空間,pattern有很高的機率 可被線性分割。 建構一最佳的hyperplane可將步驟一發現的特徵分割。 此分割hyperplane適用來分割特徵向量空間的線性函式。 j(.) j(xi) xi Feature space Input (data) space 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 21 How to Build a Support Vector Machine for Pattern Recognition (cont.) Inner-Product Kernel 令x表示m0維輸入空間中的一個輸入向量,{jj(x)m1j=1}表 示一組從輸入空間到m1維的特徵空間的非線性轉換。 假設jj(x) is defined a priori for all j. 則可定義一hyperplane m1 w jj j ( x ) b 0 (6.29) j 1 其中, w j j 1 表示一組連接特徵空間到輸出空間的線 性權重,b為bias,可進一步表示成 m1 m1 wj j j (x) 0 (6.30) j0 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 22 How to Build a Support Vector Machine for Pattern Recognition (cont.) jj(x)表示從輸入向量x推演而來的特徵空間影像 j ( x ) j 0 ( x ), j ( x ),..., j m ( x )] 1 1 T 根據定義可得 j 0 ( x ) 1 for all x 因此,decision surface為 w j (x) 0 T Hyperplane (6.31) (6.32) (6.33) 根據Eq(6.12) ,將Eq(6.12)輸入向量xi以其特徵向量 j(xi)代入 N w d j (x i i i ) (6.34) i 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 23 How to Build a Support Vector Machine for Pattern Recognition (cont.) 將Eq(6.34)代入Eq(6.33) ,可定義特徵空間的decision surface N i d ij ( x i )j ( x ) 0 (6.35) T i 1 The term jT(xi)j(x) represents the inner product of two vectors induced in the feature space by the input vector x and the input pattern xi K ( x , x i ) j ( x )j ( x i ) T m1 j j ( x )j j ( x i ) for i 1, 2 ,... N (6.36) j0 The inner-product kernel is a symmetric function of its arguments K ( x , x i ) k ( x i , x ) for all i K is a symmetric function N d K (x, x i i 1 i i )0 將Eq(6.36)代入Eq(6.35)可得 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6.38) 24 How to Build a Support Vector Machine for Pattern Recognition (cont.) Optimal Design of a Support Vector Machine Eq(6.36)的內積核心允許我們建構一在輸入空間為非線性,但特 徵空間為線性的決策表面。 The dual form for the constrained optimization of a support vector machine as Given the training sample {(xi, di)}Ni=1, find the Langrange multiplier {i}Ni=1 that maximize the object function N Q ( ) i 1 i 1 N N 2 i i 1 j d id j K (x i , x j ) (6.40) j 1 subject to the constraints: N (1) i di 0 i 1 (2) 0 i C for i 1,2,..., N 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 25 How to Build a Support Vector Machine for Pattern Recognition (cont.) 我們可將K(xi, xj)視為NxN對稱矩陣K的第ij個元素。 K K ( x i , x j ) N ( i , j ) 1 (6.41) 一旦Lagrange multiplier的最佳值o,i獲得後,我們可決 定對應的最佳線性權重值wo。(根據Eq(6.17)) N w0 0 ,i d ij ( x i ) i 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6.42) 26 How to Build a Support Vector Machine for Pattern Recognition (cont.) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 27 How to Build a Support Vector Machine for Pattern Recognition (cont.) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 28 Example: XOR Problem Input vector x: (-1, -1), (-1, +1), (+1, -1), (+1, +1) Desired response, d: -1, +1, +1, -1 令 K ( x , x i ) (1 x T x i ) 2 x x1 , x 2 , x i x i 1 , x i 2 T T 將上式代入Eq(6.43)展開可得 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 29 ∈-insensitive Loss Function 在第四、五章的MLP及RBF網路使用quadratic loss function作最佳化。 然而,最小平方估計子對於outlier非常敏感,因此需要 一強健的估計子對模型的微小變化較不敏感。 當加入的雜訊有對稱原點的機率密度函數,則可最小 化絕對誤差法來解此非線性迴歸的minimax程序。 L (d , y ) d y desired L ( d , y ) (6.44) estimator output d y , for d y 0 , otherwise 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6.45) 30 Support Vector Machines for Nonlinear Regression 考慮一非線性迴歸模型 d f (x ) v (6.46) 其中,d為scalar,x為 vector,scalar-valued nonlinear function f(x)和雜訊的統計值v是未知的。我們只有一組訓練 資料集{(xi, di)}Ni=1 ,因此我們需要找一個d的估計值。 假設d的估計值為y,它是由一組非線性的基底函數展開而來 {j(x)}m1j=0 : m 1 y wj j j (x) (6.47) j0 w j (x) T 其中 j ( x ) j 0 ( x ), j 1 ( x ),.., j m ( x ) T 1 w w 0 , w1 ,..., w m 1 T 假設j0(x)=1,則權重值w0可表示bias b 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 31 Support Vector Machines for Nonlinear Regression 因此,需使empirical risk最小化: R emp 1 N N L (d i , yi ) i 1 (6.48) 須滿足下列條件 w 2 c0 (6.49) 其中c0是一常數,e-insensitive loss function Le(di,yi)如 N ' N , Eq(6.45)所定義的。 引入兩組非負的slack variable i i 1 i i 1 定義如下: d i w j ( x i ) i , i 1, 2 ,..., N T w j ( x i ) d i i ' , i 1, 2 ,..., N T i 0 , i 1, 2 ,..., N i ' 0 , , i 1, 2 ,..., N 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (6. 50) (6. 51) (6. 52) (6. 53) 32 Support Vector Machines for Nonlinear Regression 因此,條件最佳化問題可等效為最小化成本函數(cost functional) ,但須滿足Eq(6.50)和Eq(6.53)的條件。 N 1 T ( w , , ' ) C ( i i ' ) w w i 1 2 (6. 54) 藉由wTw/2的引入,可免除需滿足Eq(6.49)條 件, Eq(6.54)中的常數C是一個使用者定義的參數,因此, 可定義Lagrangian 函數如下: N J ( w , , ' , , ' , , ' ) C ( i i ' ) i 1 i ' d i w N T 1 2 N w w i w j (w i ) d i i T T i 1 j (w i ) i ' (6. 55) i 1 N ( i i i ' i ' ) i 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 33 Support Vector Machines for Nonlinear Regression 分別將Eq(6.55)對w, ,’偏微分,並將結果設為零,整 理移項後可得 (6. 56) N w ( i i ' )j ( w i ) i 1 i C i (6. 57) i' C i' (6. 58) 將Eq(6.56)~Eq(6.58)代入Eq(6.55) ,整理化簡後可得 Q ( i , i ' ) N N d i ( i i ' ) ( i i ' ) i 1 1 2 i 1 N N ( i i ' )( j i 1 K ( x i , x j ) j ( x i )j ( x j ) j ') K ( x i , x j ) T (6. 59) j 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 34 Support Vector Machines for Nonlinear Regression Dual problem for nonlinear regression Given the training sample {(xi, di)}Ni=1, find the Langrange multiplier {i}Ni=1 and {’i}Ni=1 that maximize the object function Q ( i , i ' ) N N i 1 i 1 d i ( i i ' ) ( i i ' ) 1 2 N N i 1 j 1 ( i i ' )( j j ') K ( x i , x j ) subject to the constraints: i i 0 N (1) ' i 1 (2) 0 i C for i 1,2,..., N 0 i C for i 1,2,..., N ' where C is a user-specified constant. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 35 F (x, w ) w x T N ( i i ') K (x, x i ) i 1 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 36