資訊工程所醫學影像處理實驗室

advertisement
Chapter 3
Single-Layer Perceptrons
授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.)
E-mail: chuanyu@yuntech.edu.tw
Tel: (05)5342601 ext. 4337
Office: ES709
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
Adaptive Filter Problem
在動態系統(dynamic system)中,其數學特徵是未知的,
在系統中我們所知道的只有一組由系統產生的labeled
input-output data.
 也就是說,當一個m-dimension的輸入x(i)輸入到系統中,系
統會產生對應的輸出d(i) 。
 因此系統的外部行為可表示成
 : xi , d i ; i  1,2,..., n,...
where
(3.1)
xi   x1 i , x2 i ,..., xm i T
x1(i)
x2(i)
x3(i)
Unknown
Dynamic
system
Output
d(i)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
2
Adaptive Filter Problem (cont.)
問題在於如何設計一多輸入單一輸出的模型?
The neural model operates under the influence of
an algorithm that controls necessary adjustments to
the synaptic weights of the neuron.
• The algorithm starts from an arbitrary setting of the neuron’s
synaptic weights.
• Adjustments to the synaptic weights, in response to statistical
variations in the system’s behavior, are made on a
continuous basis.
• Computations of adjustments to the synaptic weights are
completed inside a time interval that is one sampling period
long.
Adaptive model consists of two continuous processes
• Filtering process, which involves the computation of two
signals.
 An output, and an error signal
• Adaptive process
 Automatic adjustment of the synaptic weights of the
neuron in accordance with the error signal e(i).
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
3
Adaptive Filter Problem (cont.)
The output y(i) is the same as the induced local field
v(i)
m
yi   vi    wk i xk i 
(3.2)
k 1
Eq(3.2)可表示成向量的內積形式
yi   x i wi 
T
where
wi   w1 i , w2 i ,...,wm i 
x1(i)
x2(i)
w1(i) v(i)
w2(i)
x3(i)
w3(i)
y(i)
-1
e(i)
T
The neuron’s output y(i) is compared to the
corresponding output d(i)
ei   d i   yi 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
d(i)
(3.4)
4
Unconstrained Optimization Techniques
 若一成本函數(cost function)E(w)對權重向量w是連續可微,則
adaptive filtering algorithm的目的在於選擇一權重向量w,具有最
小的成本。
 若最佳的權重向量為w*,則須滿足
 
E w* E w 
(3.5)
 Minimize the cost function E (w) with respect to the weight vector
w.
 The necessary condition for optimality is
 
E w *  0
(3.7)
where gradient operator is
 

 

,
,...,

wm 
 w1 w2
T
(3.8)
the gradient vector of the cost function is
 E E
E 
E w   
,
,...,

wm 
 w1 w2
T
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.9)
5
Unconstrained Optimization Techniques
Local iterative descent
 Starting with an initial guess denoted by w(0), generate
a sequence of weight vectors w(1), w(2),…,such that the
cost function E(w) is reduced at each iteration of the
algorithm
E wn 1 E wn
(3.10)
where w(n) is the old value of the weight vector and
w(n+1) is its updated value.
 We hope that the algorithm will eventually converge onto
the optimal solution w*.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
6
Method of steepest Descent
The successive adjustments applied to the weight vector
w are in the direction of steepest descent, that is in a
direction opposite to the gradient vector
(3.11)
g  E (w )
The steepest descent algorithm is formally described by
w(n  1)  w(n) g(n)
The correction of the algorithm is
w(n)  w(n  1)  w(n)
 g(n)
(3.12)
Stepsize/
learning rate
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.13)
7
Method of steepest Descent (cont.)
為了證明steepest descent algorithm滿足Eq(3.10)的條件,
使用一個一階Taylor序列展開w(n)來近似E(w(n+1))
E wn  1 E wn  gT nwn
將Eq(3.13)代入上式,可得
E wn  1 E wn gT ngn
2






E w n  g n
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
8
Method of steepest Descent (cont.)
The method of steepest descent converges to
the optimal solution w* slowly.
The learning-rate parameter  has a serious
influence on its convergence behavior.
When  is small, the transient response of the
algorithm is overdamped, the trajectory traced by
w(n) follows a smooth path in the W-plane.
When  is large, the transient response of the
algorithm is underdamped, the trajectory of w(n)
follows a zigzagging (oscillatory) path.
When  exceeds a certain critical value, the
algorithm becomes unstable.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
9
Method of steepest Descent (cont.)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
10
 Here F is assumed to be defined on the plane, and that its graph
has a bowl shape.
 The blue curves are the contour lines, that is, the regions on which
the value of F is constant.
 A red arrow originating at a point shows the direction of the negative
gradient at that point.
 Note that the (negative) gradient at a point is perpendicular to the
contour line going through that point.
 We see that gradient descent leads us to the bottom of the bowl,
that is, to the point where the value of the function F is minimal.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
11
Method of steepest Descent (cont.)
 Newton’s method
 To minimize the quadratic approximation of the cost functionE
(w) around the current point w(n).
 This minimization is performed at each iteration of the algorithm.
 Using a second-order Taylor series expansion of the cost
function around the point w(n).
E w n  E w n  1 E w n 
 g T n w n  
1
w T n Hn w n 
2
(3.14)
g(n) is the m-by-1 gradient vector of the cost function E (w)
evaluated at the point w(n).
 The matrix H(n) is the m-by-m Hessian matrix of E (w) .
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
12
Method of steepest Descent (cont.)
 The Hessian of E (w) is defined by
  2E

2
  w1
  2E

H   2E w    
w2  w1



2
 E

 wm  w1
 2E
 w1  w2
 2E
 w2

 2E
2
 wm  w2


 w1  wm 

 2E


 w2  wm 




 2E


2
 wm 

 2E
(3.15)
 從Eq(3.15)可知,cost function E (w) 必須可對w進行兩次微分
 將Eq(3.14)對w進行微分,當下式滿足時, E (w)改變量將會最小
gn   Hn wn   0
 上式可解得
 也就是
wn  H 1 ngn
The Hessian H(n) has to be a positive
definite matrix for all n. There is no
guarantee that H(n) is positive definite at
every iteration of the algorithm.
wn  1  wn  wn  wn  H 1 ngn
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.16)
13
Method of steepest Descent (cont.)
 Gauss-Newton Method
Let the sum of error square
1
E w  
2
n

e 2 i 
(3.17)
i 1
error signal e(i)是可調的權重向量w的函數。給定一工作點
w(n) ,我們可將e(i)在w的相依性表示成
 ei 
w  wn, i  1,2,...,n
e' i, w   ei   

 w  w w n 
T
(3.18)
其矩陣表示法為
e' n, w   en  Jnw  wn
(3.19)
其中錯誤向量(error vector)表示成
en  e1, e2,..., enT
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
14
Method of steepest Descent (cont.)
 J(n) is the n-by-m Jacobian matrix of e(n):
 e1

  w1
 e2
J n    
 w1
 
 en 

  w1
e1
 w2
e2
 w2

en 
 w2
e1 

 wm 
e2 


 wm 
  
en  


 wm 
w  w n 

(3.20)
 The Jacobian J(n) is the transpose of the m-by-n gradient
matrix ∇e(n)
en  e1, e2,..., en
 The updated weight vector w(n+1) is then defined by
2
1
w n  1  arg min  e' n, w  
w 2

資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.21)
15
Method of steepest Descent (cont.)
 Using Eq(3.19) to evaluate the squared Euclidean norm
of e’(n,w), we get
1
1
2
2
e' n, w   en   eT n J n w  wn 
2
2
1
 w  wn T J T n J n w  wn 
2
 上式對w微分,並令結果為零,可得
JT nen  J T nJnw  wn  0
 可解得


1 T
wn  1  wn  J nJn J nen
T
(3.22)
 Gauss-Newton法只需要error vector e(n)的Jacobian
matrix。但須確保JT(n)J(n)是非奇異矩陣(nonsingular)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
16
Method of steepest Descent (cont.)
 There is no guarantee that this condition will always hold.
• Add the diagonal matrix dI to the matrix JT(n)J(n).
• The parameter d is a small positive constant.
 On this basis, the Gauss-Newton method is implemented in
the slightly modified form

wn  1  wn  JT nJn  dI

1 T
J nen
(3.23)
 The effect of this modification is progressively reduced as
the number of iterations , n, is increased.
 Eq(3.23)為底下modified cost function的解
1
2

E w   d w  w0 
2



e i 

i 1

n

2
(3.24)
w(0)為權重向量w(i)的初始值。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
17
Linear Least-Squares Filter
 Linear Least-Squares Filter has two distinctive
characteristics:
 The single neuron is built in linear
 The cost function E (w) used to design the filter consists of the
sum of error squares.
 因此使用Eq(3.3)和Eq(3.4) ,error vector可表示成
en   dn   x1, x2,..., xn T w n 
 dn   Xn w n 
where d(n) is a n-by-1 desired response vector:
dn  d 1, d 2,..., d nT
and X(n) is the n-by-m data matrix:
(3.25)
x1(i)
x2(i)
w1(i) v(i)
w2(i)
x3(i)
w3(i)
y(i)
-1
e(i)
Xn  x1, x2,..., xnT
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
d(i)
18
Linear Least-Squares Filter (cont.)
將Eq(3.25)對w(n)微分可得梯度矩陣(gradient matrix)
en  XT n
e(n)的Jacobian為
J n    Xn 
(3.26)
將Eq(3.25)和 (3.26)代入(3.22)可得


w n  1  w n   XT n Xn 


1
XT n dn   Xn w n 
1
 XT n Xn  XT n dn 
(3.27)
The pseudoinverse of the data matrix X(n)


1
X n  X nXn XT n

T
因此,Eq(3.27)可改寫成
wn 1  X  ndn
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.28)
(3.29)
19
Linear Least-Squares Filter (cont.)
 Wiener Filter:
 The input vector x(i) and desired response d(i) are draw from
an ergodic environment.
 We may then substitute long-term sample for expectations
(ensemble averages)
 Ergodic environment可使用二階統計來表示
 Correlation matrix of the input vector x(i), Rx.
 Cross-correlation vector between the input vector x(i) and desired
response d(i) , rx,d.


n
1
R x  E xi x i   lim  xi xT i   lim XT n Xn 
n 
n  n
i 1
rxd
T
1 n
1
 Exi d i   lim  xi d i   lim XT n d n 
n  n
n  n
i 1
(3.30)
(3.31)
where E denotes the statistical expectation operator.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
20
Linear Least-Squares Filter (cont.)
 Accordingly, we may reformulate the linear least-squares
solution of Eq(3.27) as


1
w 0  lim wn  1  lim XT n Xn  XT n dn 
n 

n 

1
1
1
 lim XT n Xn  lim XT n dn 
n  n
n  n
(3.32)
 R x1rxd
The weight vector w0 is called the Wiener solution to the linear
optimum filtering problem.
 For an ergodic process, the linear least-square filter
asymptotically approaches the Wiener filter as the number of
observations approaches infinity.
 However, the second-order statistics is not available in many
important situations encountered in practice.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
21
Least-Mean-Square Algorithm
 The LMS algorithm is based on the use of
instantaneous values for the cost function
E
w 
1 2
e n 
2
(3.33)
其中,e(n)為時間n時的錯誤訊號
 將E(n)對權重向量w微分可得
en 
 E w 

  en 
w
 w 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.34)
22
Least-Mean-Square Algorithm (cont.)
 因為
 因此
en  d n  xT nwn
(3.35)
e(n)
  x( n)
w (n)
 所以Eq(3.34)可改寫成
E ( w )
  x ( n )e( n )
w ( n)
 上式稱為梯度向量的估計(Estimated Gradient vector)可得
gˆ n   xn en 
(3.36)
 套入Eq(3.12)的最陡坡降法,LMS可寫成。
ˆ n  1  w
ˆ n   xn en 
w
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.37)
23
Least-Mean-Square Algorithm (cont.)
 Summary of LMS Algorithm
Training Sample:
 Input signal vector: x(n)
 Desired response: d(n)
User-selected parameter: 
Initialization:
ˆ (0)  0
 Set w
Computation
 For n=1, 2,…, compute
ˆ T nxn
en  d n  w
ˆ n  1  w
ˆ n  xnen
w
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
24
Least-Mean-Square Algorithm (cont.)
 Signal-flow graph representation of LMS
algorithm
將Eq(3.35)和Eq(3.37)結合起來,可將LMS演算法的權重
向量演化的過程表示成


ˆ n  1  w
ˆ n   xn  d n   xT n w
ˆ n 
w


(3.38)
ˆ n   xn d n 
 I  xn xT n  w
其中I為identity matrix
因此,我們將
ˆ n  z 1 w
ˆ n  1
w
x(n)d(n) +
+
S
ˆ (n  1)
w
z-1I
ˆ ( n)
w
-
x(n)xT(n)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
25
Convergence Considerations of the LMS Algorithm
從控制理論我們知道一個回饋系統(feedback system)的穩
定性(stability)是由回饋迴路的參數來決定。
 從圖3.3,LMS演算法的回饋迴路中有兩個參數:learning
rate ,input vector x(n)
LMS演算法的收斂準則
 Convergence in the mean square


E e n  constant as n  
2
(3.41)
 假設
• 連續的輸入向量x(1), x(2),…在統計上是彼此獨立的
• 在時間n,輸入向量x(n)對於前面所有樣本的disired response d(1),
d(2),…d(n-1)是統計上獨立的
• 在時間n,desired response d(n)相依於x(n)
• x(n)和d(n)是從Gaussian-distributed中取出
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
26
Convergence Considerations of the LMS Algorithm
By invoking the elements of independence
theory and assuming that the learning rate
parameter  is sufficiently small
The LMS is convergent in the mean square provided
that  satisfies the condition
0  
2
lmax
(3.42)
where lmax is the largest eigenvalue of the correlation
matrix Rx.
然而,在實際的 LMS的應用中,缺乏關於的lmax知識。
為了解決此難題,可使用trace of Rx作為lmax的保守估
計,則Eq(3.42)可改寫成
0  
2
trR x 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.43)
27
Convergence Considerations of the LMS Algorithm
By definition, the trace of a square matrix is
equal to the sum of its diagonal elements.
Each diagonal element of the correlation matrix Rx
equals the mean-square value of the corresponding
sensor input
因此,Eq(3.43)可再改寫成
0  
2
sum of mean  square valuesof the sensorinputs
(3.44)
提供一個滿足上式的學習速率,LMS演算法可保證收
斂到mean-square,(implies convergence of the
mean)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
28
Virtues and Limitations of the LMS algorithm
 Virtues of the LMS algorithm
 Simplicity
 Robust
 Small model uncertainty and small disturbances can only result in
small estimation errors.
 Limitations of the LMS algorithm
 Slow rate of convergence
 Typically requires a number of iterations equal to about 10 times
the dimensionality of the input space for it to reach a steady-state
condition.
 Sensitivity to variations in the eigenstructure of the input
 The LMS algorithm is sensitive to variations in the condition
number or eigenvalue defined by
 R x  
l max
lmin
(3.45)
 When the condition number X(Rx) is high, the sensitivity of the
LMS algorithm becomes acute.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
29
Learning Curves
 Learning curve
 Is a plot of the mean-square
value of the estimation error,
Eav(n), versus the number of
iterations, n.
 Rate of convergence
 Define as the number of
iterations, n, required for Eav(n)
to decrease to some arbitrarily
chosen value.
 Such as 10 percent of the initial
value Eav(0).
 Misadjustment
 How close the adaptive filter is
to optimality in the meansquare error sense.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
30
Learning Curves (cont.)
 Misadjustment is defined as
M 
E   Emin E  

1
Emin
Emin
(3.46)
where Emin denote the minimum mean-square error produced by
the Wiener filter, designed on the basis of known values of the
correlation matrix Rx and cross-correlation vector rxd.
 The misadjustment M of the LMS algorithm is directly
proportional to the learning-rate parameter .
 The average time constant tav is inversely proportional to the
learning rate parameter .
 If the learning rate parameter is reduced so as to reduce the
misadjustment, then the settling time of the LMS algorithm is
increased.
 Careful attention must be given to the choice of the learning
parameter  in the design of the LMS algorithm in order to
produce a satisfactory overall performance.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
31
Learning-rate Annealing Schedules
LMS演算法在計算過程可以將learning –rate設定成幾種
方式:
 Constant
 n    0 for all n
 Learning-rate is time-varying (by Robbins, 1951)
 n 
c
n
where c is a constant. When c is large, there is a danger
of parameter blowup for small n.
 Search-then-converge schedule (by Darken and Moody,
1992)
0
 n 
1  n / t 
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
32
Learning-rate Annealing Schedules
Learning-rate annealing schedules
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
33
Perceptron
McCulloch-Pitts model
The perceptron consists of a linear combiner
followed by a hard limiter (signum function).
The summing node of the neuronal model
computes a linear combination of the inputs
applied to its synapses, and also incorporates
an externally applied bias.
The resulting sum is applied to a hard limiter.
The neuron produces an output equal to +1 if
the hard limiter input is positive, and -1 if it is
negative.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
34
Perceptron (cont.)
The synaptic weights of the perceptron are denoted by w1,
w2,…,wm.
The inputs applied to the perceptron are denoted by x1,
x2,…,xm.
The bias is denoted by b.
The induced local field of the neuron is
m
v   wi xi  b
x1
x2
i 1
Bias, b
w1
(3.50)
v
w2
j(v)
y
wm
xm
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
35
Perceptron (cont.)
Perceptron的目的在於將外界輸入(x1, x2,…,xm)的刺激正確
的分類為class C1或class C2。
 The decision rule for the classification is to assign the point
represented by the inputs (x1, x2,…,xm) to class C1 if the
perceptron output y is +1 and to class C2 if it is -1。
 The simplest form of the perceptron is two decision regions
separated by a hyperplane defined by
m
w x b  0
i 1
x2
i i
 The synaptic weights (w1, w2,
…,wm) of the perceptron can
be adapted on an iteration-by
-iteration basis.
(3.51)
Class C1
Class C2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
x1
36
Perceptron Convergence Theorem
 根據圖3.8(將圖3.6的bias納入固定輸入),則(m+1)-by-1的input vector
和weight vector可表示成
X(n)   1, x1 n, x2 n ,..., xm n
W(n)  b(n), w1 n , w2 n,...,wm n 
 因此,the induced local field of the neuron is defined as
x0=+1
x1
x2
w0=b
w1
w2

xm

m
vn    wi nxi n   wT nxn
(3.52)
i 0
v
j(v)
y
其中w0(n)為bias b(n)
wm
wTx=0時,座標點(x1, x2,…,xm)會描繪一hyperplane,可將input分
成兩類
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
37
Perceptron Convergence Theorem (cont.)
欲被分類的pattern必須有
足夠的分離,以確保存在
一hyperplane
Decision
Boundary
Class
C1
Class
C1
Class
C2
Class
C2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
38
Perceptron Convergence Theorem (cont.)
 假設perceptron的輸入變數,是由兩個可線性分離的class所
組成,其中子集合X1={x1(1), x1(2),…} ,子集合X2={x2(1),
x2(2),…} ,X1和 X2的聯集構成完整的training set X。
 拿X1和 X2來訓練分類器,將會調整權重向量 w,使兩個類
別C1和 C2可線性分離。也就是說存在一個權重向量 w,使
wT x  0 for everyinput vector x belongingto classC1
(3.53)
wT x  0 for everyinput vector x belongingto classC2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
39
Perceptron Convergence Theorem (cont.)
The algorithm for adapting the weight vector of the
elementary perceptron is formulated as follows:
 If the nth member of the training set, x(n), is correctly
classified by the weight vector w(n), no correction is made to
the weight vector of the perceptron.
w(n  1)  w(n) if w T x  0 and x(n) belongsto classC1
w(n  1)  w(n) if w T x  0 and x(n) belongsto classC2
(3.54)
 Otherwise, the weight vector of the perceptron is updated in
accordance with the rule
w(n  1)  w(n)  nxn  if wT x  0 and x(n) belongsto classC2
w(n  1)  w(n)   nxn  if wT x  0 and x(n) belongsto classC1
X(n)應被分成C1但被錯分為C2;
X(n)應被分成C
資訊工程所
醫學影像處理實驗室(
Medical Image
2但被錯分為C
1;
Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.55)
40
Perceptron Convergence Theorem (cont.)
 證明=1時,fixed increment adaptation rule的
收斂性。
假設initial condition w(0)=0,wT(n)x(n)<0 for n=1,2,…,
and the input vector x(n) belongs to the subset X1。(也
就是說,percetron錯將x(1), x(2)…分成第二類) n)=1,
根據Eq(3.55)的第二式,可得
w(n  1)  w(n)  x(n) for x(n) belongingto classC1
(3.56)
給訂初始條件w(0)=0,則w(n+1)可由逐次累加x(n)獲得
w(n  1)  x(1)  x(2)  ... x(n)
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.57)
41
Perceptron Convergence Theorem (cont.)
由於class C1和C2是假設可線性分離,因此存在一個解w0,
對屬於X1子集合的所有輸入向量x(1),…,x(n) ,使wTx(n)>0。
因此,可定義一正值
  min w T0 x(n)
(3.58)
x ( n )C1
因此對Eq(3.57)兩側同時乘以wT0
wT0 w(n  1)  wT0 x(1)  wT0 x(2)  ...  wT0 x(n)
因此,根據Eq(3.58)的定義我們可得
wT0 w(n  1)  n
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
共有n項
(3.59)
42
Perceptron Convergence Theorem (cont.)
根據Cauchy-Schwarz inequality,可知
w0
2


wn  1  w wn  1
2
T
0
2
(3.60)
將Eq(3.59)代入Eq(3.60)可得
w0
或
2
w n  1 2  n 2 2
w n  1 
2
n 2 2
w0
2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.61)
43
Perceptron Convergence Theorem (cont.)
實際上,Eq(3.56)可改寫成 (以k取代n)
wk  1  w(k )  x(k ) for k  1,..., n and x(k )  X 1
(3.62)
對Eq(3.62)兩邊同時取Euclidean平方,並展開可得
wk 1 2  wk  2  xk  2 2wT k xk 
(3.63)
因為一開始假設perceptron錯將屬於 C1的向量x(k)分成C2,
因此wT(n)x(n)<0 ,所以可從Eq(3.63)推得
wk  1 2  wk  2  xk  2
將上式移項,可得
wk  1 2  wk  2  xk  2
k  1,...,n
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.64)
44
Perceptron Convergence Theorem (cont.)
代入初始條件w(0)=0,並將所有的不等式k=1,…,n加總,可
n
得
2
2
wk  1   xk 
(3.65)
k 1
其中
 n
  max xk  2
(3.66)
x ( k )X 1
在Eq(3.65)和Eq(3.61)中,n的值不能超過某個值nmax ,此
nmax 必須同時滿足Eq(3.65)和Eq(3.61),因此
2
nmax
2
 nmax 
Perceptron必須在最多經
2
w0
過nmax次疊代後,停止調
整synaptic weight
將上式移項整理可得
nmax 
 w0
2
2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.67)
45
Perceptron Convergence Theorem (cont.)
因此,當(n)=1 for all n, and w(0)=0, perceptron調整神經
鍵的權重值,最多只需nmax次的迭代。
Fixed-increment convergence theorem of the perceptron
 Let the subsets of training vectors X1 and X2 be linearly
separable.
 Let the inputs presented to the perceptron originate from
these two subsets.
 The perceptron converges after some n0 iterations, in the
sense that
wn0   wn0  1  wn0  2  ...
is a solution vector for n0<=nmax
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
46
Perceptron Convergence Theorem (cont.)
Absolute error-correction procedure for adaptation of a
single-layer perceptron
 n x n xn   w n xn 
T
T
Each pattern is presented repeatedly to the perceptron
until that pattern is classified correctly.
The use of an initial value w(0) merely results in a
decrease or increase in the number of iterations required
to converge, depending on how w(0) relates to the
solution w0.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
47
Perceptron Convergence Theorem (cont.)
 Summary of the Perceptron Convergence Theorem
 Initialization. Set w(0)=0. Then perform the following
computations for time step n=1,2,…
 Activation. Activate perceptron by applying continuous-valued
input vector x(n) and desired response d(n)
 Computation of Actual Response. Compute the actual response
of the perceptron


yn  sgn wT nxn
 Adaptation of Weight Vector. Update the weight vector of the
perceptron
wn  1  wn d n  ynxn
 1 if x(n) belongsto classC1
d n   
 1 if x(n) belongsto classC2
where
 Continuation: Increment time step n by one and go back to step
2.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
48
Relation between the perceptron and Bayes Classifier for a
Gaussian Environment
 Bayes Classifier
 To minimize the average risk R.
 For a two-class problem, the average risk is defined as
  c11 p1

f x (x |C1)dx  c22 p 2
X1
+ c21 p1
 f x (x |C2 )dx
X2
 f x (x |C1)dx  c12 p2  f x (x |C2 )dx
X2
Correct decision
X1
(3.72)
Incorrect decision
pi: a priori probability that the observation vector x is drawn from
subspace Xi
cij: cost of deciding in favor of class Ci represented by subspace
Xi when class Cj is true.
 fx(x|Ci): conditional probability density function of the random
vector X
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
49
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
由於每個observation vector x需被分成C1或C2中的
一類,因此
X X X
1
2
(3.73)
因此,Eq(3.72)可改寫成
  c11 p1

f x (x |C1)dx  c22 p2
(3.74)
X -X 1
X1
+ c21 p1
 f x (x |C2 )dx
 f x (x |C1)dx  c12 p2  f x (x |C2 )dx
X -X 1
X1
where c11<c21 and c22<c12, we observe that fact
that
 f x (x |C1)dx   f x (x |C2 )dx  1
X
(3.75)
X
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
50
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
因此,將Eq(3.74)展開後,可簡化成
  c21 p1  c22 p 2    p 2 (c12  c22 ) f x (x |C2 )  p1 (c21  c11 ) f x (x |C1)dx
X1
(3.76)
其中,Eq(3.76)的前兩項為固定成本。
由於我們需要使average risk R最小化, 因此可根據
Eq(3.76)推導出下列策略:
 若observation vector x的積分值為負值,則x應被指定為X1
(class 1, C1)。
 若observation vector x的積分值為正值,則x應被指定為X2
(class 2 , C2)。
 若observation vector x的積分值為零值,表示其對average
risk R沒有影響可指定成任何一類,這裡則將x應指定為X2
(class 2)。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
51
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
 根據前面的說明, Bayes classifier可定義成
If the condition
p1 c21  c11  f x x |C1   p2 c12  c22  f x x |C2 
holds, assign the observation vector x to subspace X1
(Class C1). Otherwise assign x to X2 (class C2)
為方便說明,將上式移項整理後,定義
Likelihood ratio
Threshold
x  
f x (x |C1)
f x (x |C2 )
p2 (c12  c22 )

p1 (c21  c11 )
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.77)
(3.78)
52
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
將Bayes classifier重新敘述成
For an observation vector x, the likelihood ratio (x)
is greater than the threshold , assign x to class C1.
Otherwise, assign it to class C2.
x
Likelihood
Ratio
computer
(x)
Comparator
Assign x to class C1
If (x)>
Otherwise,
assign it to class C2

x
Likelihood
Ratio
computer
log(x)
Comparator
Assign x to class C1
If log(x)>
Otherwise,
assign it to class C2
log
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
53
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
 Bayes Classifier for Gaussian Distribution
假設
ClassC1 : E X  μ1




E X  μ1 X  μ1 T  C
ClassC2 : E X  μ 2
E X  μ 2 X  μ 2 T  C
因為C1和C2有相關,所
以共變異矩陣C不為對角
矩陣,假設C為非奇異矩
陣,因此存在C-1。
X的條件機率密度函數可表示為
f x (x |Ci ) 
1
(2 )
m
1
2 (det(C)) 2
 1

exp  (x  μ i )T C1 (x  μ i )  , i  1, 2
2


 假設兩個類別的機率相等
p1  p 2 
1
2
(3.79)
(3.80)
 假設分類錯誤的成本相等,正確分類成本為零
c21  c12 and c11  c22  0
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.81)
54
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
將Eq(3.79)代入Eq(3.77) ,再取log可得
1
1
log  (x)   (x  μ 1 ) T C 1 (x  μ 1 )  (x  μ 2 ) T C 1 (x  μ 2 )
2
2
1
T
T
 (μ 1  μ 2 ) T C 1 x  (μ 2 C 1μ 2  μ 1 C 1μ 1 )
2
(3.82)
將Eq(3.80)和Eq(3.81)代入Eq(3.78) ,再取log可得
log  0
Threshold  =1
(3.83)
Eq(3.82)和Eq(3.83)表示的Bayes classifier可描述成底
下的linear classifier
從Eq(3.51)和Eq(3.84)
y  w xb
T
其中
y  log (x)
w  C1 (μ1  μ 2 )
b
可知,Bayes classifier
類似於perceptron的
linear classifier
1
T
T
(μ 2 C 1μ 2  μ1 C 1μ1 )
2
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
(3.84)
(3.85)
(3.86)
(3.87)
55
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
The classifier consists of a linear combiner with
weight vector w and bias b
根據Eq(3.84) ,log-likelihood test for two-class
problem可描述成
If the output y of the linear combiner is positive,
assign the observation vector x to class C1.
otherwise, assign it to class C2.
x1
x2
Bias, b
w1
w2
y
wm
xm
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
56
Relation between the perceptron and Bayes
Classifier for a Gaussian Environment (cont.)
 Perceptron vs. Bayes classifier for Gaussian
 The perceptron operates on the premise that the patterns to
be classified are linearly separable. The Gaussian distribution
of the two patterns assumed in the derivation of the Bayes
classifier certainly do overlap each other and are therefore no
separable.
 The Bayes classifier minimizes the probability of classification
error. The Bayes classifier always positions the decision
boundary at the point where the Gaussian distributions for the
two classes C1 and C2 cross each other.
 Nonparametric vs. parametric
 The perceptron convergence algorithm is both adaptive and
simple to implement,但Bayes classifier的計算較複雜且較浪費
記憶體空間。
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
57
Two overlapping, one-dimensional
Gaussian distributions
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
58
Ergodic process (From Wikipedia)
 In probability theory, stationary ergodic process is a stochastic
process which exhibits both stationarity and ergodicity. In essence
this implies that the random process will not change its statistical
properties with time.
 Stationarity is the property of a random process which guarantees
that its statistical properties, such as the mean value, its moments
and variance, will not change over time. A stationary process is one
whose probability distribution is the same at all times.
 Several sub-types of stationarity are defined: first-order, secondorder, nth-order, wide-sense and strict-sense.
 An ergodic process is one which conforms to the ergodic theorem.
The theorem allows the time average of a conforming process to
equal the ensemble average. In practice this means that statistical
sampling can be performed at one instant across a group of identical
processes or sampled over time on a single process with no change
in the measured result.
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
59
Taylor series, (From Wikipedia)
 Taylor series
 Taylor series is a representation of a function as an infinite
sum of terms calculated from the values of its derivatives
at a single point. It may be regarded as the limit of the
Taylor polynomials.
 The Taylor series of a real or complex function f that is
infinitely differentiable in a neighborhoods of a real or
complex number a, is the power series
which in a more compact form can be written as
資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. )
Graduate School of Computer Science & Information Engineering
60
Download