資訊工程所醫學影像處理實驗室

advertisement
Chapter 6
Support Vector Machines
授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.)
E-mail: chuanyu@yuntech.edu.tw
Tel: (05)5342601 ext. 4337
Office: ES709
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
Introduction
SVM可被用來解pattern classification和非線性迴歸的問題。
基本上,SVM 是一個具有許多良好特性的linear machine
The main idea of a support vector machine is to
construct a hyperplane as the decision surface in such a
way that the margin of separation between positive and
negative examples is maximized.
The support vector machine is an approximate
implementation of the method of structural risk
minimization.
 SVM在可分割的pattern中,Eq(2.101)的第一項為零,並最小化第
二項。
 SVM 具有獨特的特性,SVM未結合problem-domain knowledge,
但能提供良好的 generalization performance 。
 建構SVM演算法的核心是support vector xi和某輸入向量x間的
inner-product kernel。
 根據inner-product kernel的產生方式,可建構不同的非線性決策
表面的學習機器特性。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
2
Background
x2
w
p
x1  0
x2  p
x1
q
x2  
p
q
x1  p

x1  q
x2  0
x1 x 2   x1  q  x 2  p 
x1 p  x 2 q  pq  0
 x1 
  p q    pq  0
 x2 
 w xb0
T
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
3
Background (cont.)
x
x2
w
x  xp  r
w
w
x
p
r
w
w
xp
x1
q
令discriminant function

w 
b
g x   w x  b  w  x p  r


w


T
T
 g x p   0
 g x   r w
or
r 
g x 
w
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
4
Optimal Hyperplane for Linearly Separable Patterns
考慮training sample {(xi, di)}Ni=1,其desire
response為可被線性分離的+1或-1。
則此可分離的hyperplane定義成
w xb0
T
(6.1)
其中,x為input vector,w為可調整的weight
vector,b為bias,因此可將此分離的問題描述成
w x i  b  0 for d i   1
T
(6.2)
w x i  b  0 for d i   1
T
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
5
Optimal Hyperplane for Linearly Separable Patterns
若給予權重向量w和偏差值b,則介於Eq(6.1)所定義的
hyperplane和最接近的資料點間的距離稱為margin of
separation,以符號r表示。
SVM的目標是找一特定的hyperplane,使margin of
separation r最大化。
The decision surface is referred to as the optimal
hyperplane.
令w0和b0分別表示最佳的權重向量和bias。
則用來表示輸入空間的多維度線性決策表面的optimal
hyperplane可表示成
w x  b0  0
T
0
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6.3)
6
Optimal Hyperplane for Linearly Separable Patterns
x2
w xb 1
T
Class 2
Class 1
x1
Many linear classifiers (hyperplanes) separate the data.
However, only one achieves maximum separation.
Which one should we choose?
w xb0
T
w x  b  1
T
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
7
Optimal Hyperplane for Linearly Separable Patterns
The algebraic measure of the distance from x to the
optimal hyperplane is defined as a discriminant function
g ( x )  w 0 x  b0
T
從向量x到optimal
hyperplane的距離
(6.4)
因為Eq(6.3) ,所以令g(xp)=0
g( x ) = w 0 x  b 0  r w 0
T
or
r 
g (x)
w0
Desired algebraic distance

b0
w0
w 0 x  b0  1
for d i   1
w 0 x  b0   1
for d i   1
T
T
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
8
Optimal Hyperplane for Linearly Separable Patterns
 我們的目的是給予訓練集T ={(xi,di)}找出optimal
hyperplane的參數wo和bo
 根據圖6.2可知(wo,bo)需滿足下列條件
w 0 x  b0  1
for d i   1
w 0 x  b0   1
for d i   1
T
T
(6.6)
 The particular data points (xi,di) for which the first or
second line of Eq. (6.6) is satisfied with the quality sign
are called support vectors.
 The support vectors are those data points that lie closest to the
decision surface and are therefore the most difficult to classify.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
9
Optimal Hyperplane for Linearly Separable Patterns
考慮一support vector x(s) for which d(s)=+1,根據定
義可得
s 
T s 
s 
(6.7)
g ( x )  w 0 x  b 0   1 for d   1
根據Eq(6.5) ,The algebraic distance from support
vector to optimal hyper plane is
r 

g x
s 
w0








1
if d
s 
 1
w0
(6.8)
-1
if d
s 
 1
w0
Margin of separation between two classes
最大化r,相當於最小化
權重向量w的Euclidean
norm
r  2r 
2
w0
因為有正負兩個方向,
所以為兩倍r
(6.9)
Maximum r implies minimizes ||w0||
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
10
Optimal Hyperplane for Linearly Separable Patterns
 Quadratic Optimization for Finding the Optimal
Hyperplane
我們的目的在於使用一組訓練樣本T={(xi, di)Ni=1},來
找出一滿足下式的optimal hyperplane 將Eq(6.6)的條件結
合起來
d i ( w x i  b )  1 for i  1 ,2 ,...,N
T
(6.10)
因此,此constrained optimization problem可描述成
如果資料點違反左列條件,
稱此margin of separation
為soft
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
11
Optimal Hyperplane for Linearly Separable Patterns
 可使用lagrange multipliers法來解(6.10)條件最佳化的問題
 建構Lagrangian function如下:
J (w , b, ) 
1
2
  i d i ( w x i  b )  1
N
w w 
T
T
(6.11)
i 1
i are called Lagrange multipliers.
此問題的解發生在 J(w,b,) 的鞍點(saddle point) ,因此
可分別將J(w,b,)對w和b做偏微分,並將結果設為0。
Minimized with respect to w and b; it also has to be
maximized with respect to . 分別將Eq(6.11)對w及 b偏微分後,整理
可得
N
w 
 d x
i
i
i
(6.12)
i 1
N

i
di  0
(6.13)
i 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
12
Optimal Hyperplane for Linearly Separable Patterns
 在鞍點上(saddle point) 的每個Lagrange multiplier i,這些
乘數的乘積的對應條件會消失。
 i d i ( w x i  b )  1  0
for i  1 ,2 ,...,N
T
(6.14)
 只有恰好滿足Eq(6.14)的乘數可假設為非零值。這個特性是遵
循最佳化理論的Kuhn-Tucker condition。
 Dual problem和primal problem會有相同的最佳解。
因為在鞍點上,意謂著偏微分等於零。(恰好在decision surface上,可
屬於任何一類)
 Duality theorem
 If the primal problem has an optimal solution, the dual problem also has an
optimal solution, and the corresponding optimal values are equal.
 In order for wo to be an optimal primal solution and o to be an optimal dual
solution, it is necessary and sufficient that wo is feasible for the primal problem,
and   w   J  w , b ,    min J  w , b ,  
o
o
o
o
w
o
o
o
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
13
Optimal Hyperplane for Linearly Separable Patterns
 為了將primal problem假定成dual problem,首先將
∵(6.13)
Eq(6.11)逐項展開
∴=0N
1 T
T
J (w , b, )  w w    id iw x i  b  id i    i
2
i 1
i 1
i 1
N
N
(6.15)
 由Eq(6.12) ,將w以Eq(6.12)代入
 將上式代入Eq(6.15) ,整理後並假設J(w, b, )=Q()
N
Q ( ) 
 i 
i 1
1
N

2
i 1
N
  i j d i d j x i x j
T
(6.16)
j 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
14
Optimal Hyperplane for Linearly Separable Patterns
 The Dual Problem
只含i
Dual problem是由整體訓
練資料(N)計算得到。與
primal problem相比,只
需解i。
由dual problem決定optimal i (叫做o,i )
N
代入 (6.12)得到optimal weight w0
w0 

0 ,i
dixi
i 1
(6.17)
由Eq(6.7)式,代入 (6.18)得到optimal weight b0
b0  1  w 0 x
T
(s)
for d
(s)
1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6.18)
15
Optimal hyperplane for non-separable patterns
之前討論for linear separable,現在為non-separable pattern
(允許部分pattern落入partition margin內):
d i w x i  b   1   i , i  1, 2 ,...., N
T
The i are called slack variables
資料點落在分
割區域中,但
在decision
surface的正確
邊時,可正確
分類
0i 1 正確分類
在hyperplane的定義中新增
N
一組非負的scale變數  i i 1
當 0   i  1 ,表示資料點落
在region of separation,且在
decision surface的正確邊。
i >1 不正確分類
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
16
Optimal hyperplane for non-separable patterns
我們的目標在找出一separating hyperplane,使整體訓練集
合的平均錯誤分類率最小化。
Correct but maybe inside the margin
Incorrect
為了計算方便,將上
式更改為近似式
Then
 (w , ) 
1
2
N
w w  C  i
T
i 1
The first term in Eq. (6.23) is related to minimizing the VC dimension of the
support vector machine.
The second term is an upper bound on the number of the test errors.
The parameter C is user determined (1)experimentally (2)analytically.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6.23)
17
Optimal hyperplane for non-separable patterns
Then for the soft classification: (對nonseparable的情況)
The primal problem for the nonseparable case is stated as
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
18
Optimal hyperplane for non-separable patterns
 應用Lagrange multiplier和6.2節類似的步驟,可描述此
nonseparable pattern的dual problem為:
Slack variables i在dual
problem已經不見了。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
19
Optimal hyperplane for non-separable patterns
權重向量w的最佳解為
NS
w0 

d ixi
0 ,i
(6.24)
i 1
其中Ns為support vectors的數量。
決定bias的最佳值,利用Kuhn-Tucker條件定義成
 i d i w x i  b   1   i   0 , i  1, 2 ,...., N
T
 i  i  0 , i  1, 2 ,...., N
(6.25)
(6.26)
在saddle point,primal problem的Lagrangian function對i的微分
等於零,因此
雖然我們可以從訓練樣本中取出任何
 i  i  C
的資料點滿足0<o,i<C,使i為零。再 (6.27)
以Eq(6.25)決定最佳的bias bo。但從
(6.28)
  0 if   C
數值的觀點,最好是計算從訓練樣本
中取出所有符合條件的資料點的平均
資訊工程所 醫學影像處理實驗室(
Medical Image Processing
值來作為最佳的bias
bo。 Lab. )
i
i
Graduate School of Computer Science & Information Engineering
20
How to Build a Support Vector Machine for Pattern Recognition
 SVM的概念決定於下面兩個數學運作:
將一輸入向量非線性的對應到一高維的特徵空間。
 根據Cover’s theorem on the separability of patterns.
 一多維空間可被轉換到一新的特徵空間,pattern有很高的機率
可被線性分割。
建構一最佳的hyperplane可將步驟一發現的特徵分割。
 此分割hyperplane適用來分割特徵向量空間的線性函式。
j(.)
j(xi)
xi
Feature space
Input (data) space
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
21
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
 Inner-Product Kernel
令x表示m0維輸入空間中的一個輸入向量,{jj(x)m1j=1}表
示一組從輸入空間到m1維的特徵空間的非線性轉換。
假設jj(x) is defined a priori for all j.
則可定義一hyperplane
m1

w jj j ( x )  b  0
(6.29)
j 1
其中, w j  j 1 表示一組連接特徵空間到輸出空間的線
性權重,b為bias,可進一步表示成
m1
m1
wj
j
j
(x)  0
(6.30)
j0
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
22
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
 jj(x)表示從輸入向量x推演而來的特徵空間影像
j ( x )  j 0 ( x ), j ( x ),..., j m ( x )]
1
1
T

根據定義可得
j 0 ( x )  1 for all x
因此,decision surface為
w j (x)  0
T
Hyperplane
(6.31)
(6.32)
(6.33)
根據Eq(6.12) ,將Eq(6.12)輸入向量xi以其特徵向量
j(xi)代入 N
w 
  d j (x
i
i
i
)
(6.34)
i 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
23
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
 將Eq(6.34)代入Eq(6.33) ,可定義特徵空間的decision surface
N
  i d ij ( x i )j ( x )  0
(6.35)
T
i 1
 The term jT(xi)j(x) represents the inner product of two
vectors induced in the feature space by the input vector x and
the input pattern xi
K ( x , x i )  j ( x )j ( x i ) T
m1

j
j
( x )j j ( x i ) for i  1, 2 ,... N
(6.36)
j0
 The inner-product kernel is a symmetric function of its
arguments
K ( x , x i )  k ( x i , x ) for all i K is a symmetric function
N
  d K (x, x
i
i 1
i
i
)0
將Eq(6.36)代入Eq(6.35)可得
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6.38)
24
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
 Optimal Design of a Support Vector Machine
 Eq(6.36)的內積核心允許我們建構一在輸入空間為非線性,但特
徵空間為線性的決策表面。
 The dual form for the constrained optimization of a support
vector machine as
 Given the training sample {(xi, di)}Ni=1, find the Langrange
multiplier {i}Ni=1 that maximize the object function
N
Q ( ) 

i 1
i

1
N
N



2
i
i 1
j
d id j K (x i , x j )
(6.40)
j 1
subject to the constraints:
N
(1)

i
di  0
i 1
(2) 0   i  C for i  1,2,..., N
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
25
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
我們可將K(xi, xj)視為NxN對稱矩陣K的第ij個元素。
K  K ( x i , x j ) 
N
( i , j ) 1
(6.41)
一旦Lagrange multiplier的最佳值o,i獲得後,我們可決
定對應的最佳線性權重值wo。(根據Eq(6.17))
N
w0 

0 ,i
d ij ( x i )
i 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6.42)
26
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
27
How to Build a Support Vector Machine for Pattern
Recognition (cont.)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
28
Example: XOR Problem
 Input vector x: (-1, -1), (-1, +1), (+1, -1), (+1, +1)
 Desired response, d: -1, +1, +1, -1
 令 K ( x , x i )  (1  x T x i ) 2
x   x1 , x 2  , x i   x i 1 , x i 2 
T
T
 將上式代入Eq(6.43)展開可得
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
29
∈-insensitive Loss Function
在第四、五章的MLP及RBF網路使用quadratic loss
function作最佳化。
然而,最小平方估計子對於outlier非常敏感,因此需要
一強健的估計子對模型的微小變化較不敏感。
當加入的雜訊有對稱原點的機率密度函數,則可最小
化絕對誤差法來解此非線性迴歸的minimax程序。
L (d , y )  d  y
desired
L ( d , y ) 
(6.44)
estimator output

d  y , for d  y 
0 , otherwise
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6.45)
30
Support Vector Machines for Nonlinear Regression
考慮一非線性迴歸模型
d  f (x )  v
(6.46)
其中,d為scalar,x為 vector,scalar-valued nonlinear
function f(x)和雜訊的統計值v是未知的。我們只有一組訓練
資料集{(xi, di)}Ni=1 ,因此我們需要找一個d的估計值。
假設d的估計值為y,它是由一組非線性的基底函數展開而來
{j(x)}m1j=0 : m
1
y 
wj
j
j
(x)
(6.47)
j0
 w j (x)
T
其中
j ( x )  j 0 ( x ), j 1 ( x ),.., j m ( x ) 
T
1

w  w 0 , w1 ,..., w m 1

T
假設j0(x)=1,則權重值w0可表示bias b
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
31
Support Vector Machines for Nonlinear Regression
因此,需使empirical risk最小化:
R emp 
1
N
N
L

(d i , yi )
i 1
(6.48)
須滿足下列條件
w
2
 c0
(6.49)
其中c0是一常數,e-insensitive loss function Le(di,yi)如
N
' N




,

Eq(6.45)所定義的。 引入兩組非負的slack variable i i 1 i i 1
定義如下:
d i  w j ( x i )    i , i  1, 2 ,..., N
T
w j ( x i )  d i    i ' , i  1, 2 ,..., N
T
 i  0 , i  1, 2 ,..., N
 i '  0 , , i  1, 2 ,..., N
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(6. 50)
(6. 51)
(6. 52)
(6. 53)
32
Support Vector Machines for Nonlinear Regression
因此,條件最佳化問題可等效為最小化成本函數(cost
functional) ,但須滿足Eq(6.50)和Eq(6.53)的條件。
 N
 1 T
 ( w ,  ,  ' )  C   ( i   i ' )   w w
 i 1
 2
(6. 54)
藉由wTw/2的引入,可免除需滿足Eq(6.49)條 件,
Eq(6.54)中的常數C是一個使用者定義的參數,因此,
可定義Lagrangian 函數如下:
N
J ( w ,  ,  ' ,  ,  ' ,  ,  ' )  C  ( i   i ' ) 
i 1
  i ' d i  w
N

T
1
2

N
w w    i w j (w i )  d i    i
T
T

i 1
j (w i )   i '

(6. 55)
i 1
N

 (  i  i  i '  i ' )
i 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
33
Support Vector Machines for Nonlinear Regression
分別將Eq(6.55)對w, ,’偏微分,並將結果設為零,整
理移項後可得
(6. 56)
N
w 
 (
i
  i ' )j ( w i )
i 1
i  C  i
(6. 57)
 i' C  i'
(6. 58)
將Eq(6.56)~Eq(6.58)代入Eq(6.55) ,整理化簡後可得
Q ( i ,  i ' ) 
N
N
 d i ( i   i ' )    ( i   i ' )
i 1

1
2
i 1
N
N
  ( i   i ' )(  j
i 1
K ( x i , x j )  j ( x i )j ( x j )   j ') K ( x i , x j )
T
(6. 59)
j 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
34
Support Vector Machines for Nonlinear Regression
 Dual problem for nonlinear regression
 Given the training sample {(xi, di)}Ni=1, find the Langrange
multiplier {i}Ni=1 and {’i}Ni=1 that maximize the object function
Q ( i ,  i ' ) 

N
N
i 1
i 1
 d i ( i   i ' )    ( i   i ' )
1
2
N
N
i 1
j 1
  ( i   i ' )(  j
  j ') K ( x i , x j )
subject to the constraints:
  i   i   0
N
(1)
'
i 1
(2) 0   i  C for i  1,2,..., N
0   i  C for i  1,2,..., N
'
where C is a user-specified constant.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
35
F (x, w )  w x
T
N

 (
i
  i ') K (x, x i )
i 1
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
36
Download