Chapter 3 Single-Layer Perceptrons 授課教師: 張傳育 博士 (Chuan-Yu Chang Ph.D.) E-mail: chuanyu@yuntech.edu.tw Tel: (05)5342601 ext. 4337 Office: ES709 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering Adaptive Filter Problem 在動態系統(dynamic system)中,其數學特徵是未知的, 在系統中我們所知道的只有一組由系統產生的labeled input-output data. 也就是說,當一個m-dimension的輸入x(i)輸入到系統中,系 統會產生對應的輸出d(i) 。 因此系統的外部行為可表示成 : xi , d i ; i 1,2,..., n,... where (3.1) xi x1 i , x2 i ,..., xm i T x1(i) x2(i) x3(i) Unknown Dynamic system Output d(i) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 2 Adaptive Filter Problem (cont.) 問題在於如何設計一多輸入單一輸出的模型? The neural model operates under the influence of an algorithm that controls necessary adjustments to the synaptic weights of the neuron. • The algorithm starts from an arbitrary setting of the neuron’s synaptic weights. • Adjustments to the synaptic weights, in response to statistical variations in the system’s behavior, are made on a continuous basis. • Computations of adjustments to the synaptic weights are completed inside a time interval that is one sampling period long. Adaptive model consists of two continuous processes • Filtering process, which involves the computation of two signals. An output, and an error signal • Adaptive process Automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i). 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 3 Adaptive Filter Problem (cont.) The output y(i) is the same as the induced local field v(i) m yi vi wk i xk i (3.2) k 1 Eq(3.2)可表示成向量的內積形式 yi x i wi T where wi w1 i , w2 i ,...,wm i x1(i) x2(i) w1(i) v(i) w2(i) x3(i) w3(i) y(i) -1 e(i) T The neuron’s output y(i) is compared to the corresponding output d(i) ei d i yi 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering d(i) (3.4) 4 Unconstrained Optimization Techniques 若一成本函數(cost function)E(w)對權重向量w是連續可微,則 adaptive filtering algorithm的目的在於選擇一權重向量w,具有最 小的成本。 若最佳的權重向量為w*,則須滿足 E w* E w (3.5) Minimize the cost function E (w) with respect to the weight vector w. The necessary condition for optimality is E w * 0 (3.7) where gradient operator is , ,..., wm w1 w2 T (3.8) the gradient vector of the cost function is E E E E w , ,..., wm w1 w2 T 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.9) 5 Unconstrained Optimization Techniques Local iterative descent Starting with an initial guess denoted by w(0), generate a sequence of weight vectors w(1), w(2),…,such that the cost function E(w) is reduced at each iteration of the algorithm E wn 1 E wn (3.10) where w(n) is the old value of the weight vector and w(n+1) is its updated value. We hope that the algorithm will eventually converge onto the optimal solution w*. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 6 Method of steepest Descent The successive adjustments applied to the weight vector w are in the direction of steepest descent, that is in a direction opposite to the gradient vector (3.11) g E (w ) The steepest descent algorithm is formally described by w(n 1) w(n) g(n) The correction of the algorithm is w(n) w(n 1) w(n) g(n) (3.12) Stepsize/ learning rate 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.13) 7 Method of steepest Descent (cont.) 為了證明steepest descent algorithm滿足Eq(3.10)的條件, 使用一個一階Taylor序列展開w(n)來近似E(w(n+1)) E wn 1 E wn gT nwn 將Eq(3.13)代入上式,可得 E wn 1 E wn gT ngn 2 E w n g n 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 8 Method of steepest Descent (cont.) The method of steepest descent converges to the optimal solution w* slowly. The learning-rate parameter has a serious influence on its convergence behavior. When is small, the transient response of the algorithm is overdamped, the trajectory traced by w(n) follows a smooth path in the W-plane. When is large, the transient response of the algorithm is underdamped, the trajectory of w(n) follows a zigzagging (oscillatory) path. When exceeds a certain critical value, the algorithm becomes unstable. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 9 Method of steepest Descent (cont.) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 10 Here F is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of F is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is perpendicular to the contour line going through that point. We see that gradient descent leads us to the bottom of the bowl, that is, to the point where the value of the function F is minimal. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 11 Method of steepest Descent (cont.) Newton’s method To minimize the quadratic approximation of the cost functionE (w) around the current point w(n). This minimization is performed at each iteration of the algorithm. Using a second-order Taylor series expansion of the cost function around the point w(n). E w n E w n 1 E w n g T n w n 1 w T n Hn w n 2 (3.14) g(n) is the m-by-1 gradient vector of the cost function E (w) evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian matrix of E (w) . 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 12 Method of steepest Descent (cont.) The Hessian of E (w) is defined by 2E 2 w1 2E H 2E w w2 w1 2 E wm w1 2E w1 w2 2E w2 2E 2 wm w2 w1 wm 2E w2 wm 2E 2 wm 2E (3.15) 從Eq(3.15)可知,cost function E (w) 必須可對w進行兩次微分 將Eq(3.14)對w進行微分,當下式滿足時, E (w)改變量將會最小 gn Hn wn 0 上式可解得 也就是 wn H 1 ngn The Hessian H(n) has to be a positive definite matrix for all n. There is no guarantee that H(n) is positive definite at every iteration of the algorithm. wn 1 wn wn wn H 1 ngn 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.16) 13 Method of steepest Descent (cont.) Gauss-Newton Method Let the sum of error square 1 E w 2 n e 2 i (3.17) i 1 error signal e(i)是可調的權重向量w的函數。給定一工作點 w(n) ,我們可將e(i)在w的相依性表示成 ei w wn, i 1,2,...,n e' i, w ei w w w n T (3.18) 其矩陣表示法為 e' n, w en Jnw wn (3.19) 其中錯誤向量(error vector)表示成 en e1, e2,..., enT 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 14 Method of steepest Descent (cont.) J(n) is the n-by-m Jacobian matrix of e(n): e1 w1 e2 J n w1 en w1 e1 w2 e2 w2 en w2 e1 wm e2 wm en wm w w n (3.20) The Jacobian J(n) is the transpose of the m-by-n gradient matrix ∇e(n) en e1, e2,..., en The updated weight vector w(n+1) is then defined by 2 1 w n 1 arg min e' n, w w 2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.21) 15 Method of steepest Descent (cont.) Using Eq(3.19) to evaluate the squared Euclidean norm of e’(n,w), we get 1 1 2 2 e' n, w en eT n J n w wn 2 2 1 w wn T J T n J n w wn 2 上式對w微分,並令結果為零,可得 JT nen J T nJnw wn 0 可解得 1 T wn 1 wn J nJn J nen T (3.22) Gauss-Newton法只需要error vector e(n)的Jacobian matrix。但須確保JT(n)J(n)是非奇異矩陣(nonsingular) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 16 Method of steepest Descent (cont.) There is no guarantee that this condition will always hold. • Add the diagonal matrix dI to the matrix JT(n)J(n). • The parameter d is a small positive constant. On this basis, the Gauss-Newton method is implemented in the slightly modified form wn 1 wn JT nJn dI 1 T J nen (3.23) The effect of this modification is progressively reduced as the number of iterations , n, is increased. Eq(3.23)為底下modified cost function的解 1 2 E w d w w0 2 e i i 1 n 2 (3.24) w(0)為權重向量w(i)的初始值。 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 17 Linear Least-Squares Filter Linear Least-Squares Filter has two distinctive characteristics: The single neuron is built in linear The cost function E (w) used to design the filter consists of the sum of error squares. 因此使用Eq(3.3)和Eq(3.4) ,error vector可表示成 en dn x1, x2,..., xn T w n dn Xn w n where d(n) is a n-by-1 desired response vector: dn d 1, d 2,..., d nT and X(n) is the n-by-m data matrix: (3.25) x1(i) x2(i) w1(i) v(i) w2(i) x3(i) w3(i) y(i) -1 e(i) Xn x1, x2,..., xnT 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering d(i) 18 Linear Least-Squares Filter (cont.) 將Eq(3.25)對w(n)微分可得梯度矩陣(gradient matrix) en XT n e(n)的Jacobian為 J n Xn (3.26) 將Eq(3.25)和 (3.26)代入(3.22)可得 w n 1 w n XT n Xn 1 XT n dn Xn w n 1 XT n Xn XT n dn (3.27) The pseudoinverse of the data matrix X(n) 1 X n X nXn XT n T 因此,Eq(3.27)可改寫成 wn 1 X ndn 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.28) (3.29) 19 Linear Least-Squares Filter (cont.) Wiener Filter: The input vector x(i) and desired response d(i) are draw from an ergodic environment. We may then substitute long-term sample for expectations (ensemble averages) Ergodic environment可使用二階統計來表示 Correlation matrix of the input vector x(i), Rx. Cross-correlation vector between the input vector x(i) and desired response d(i) , rx,d. n 1 R x E xi x i lim xi xT i lim XT n Xn n n n i 1 rxd T 1 n 1 Exi d i lim xi d i lim XT n d n n n n n i 1 (3.30) (3.31) where E denotes the statistical expectation operator. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 20 Linear Least-Squares Filter (cont.) Accordingly, we may reformulate the linear least-squares solution of Eq(3.27) as 1 w 0 lim wn 1 lim XT n Xn XT n dn n n 1 1 1 lim XT n Xn lim XT n dn n n n n (3.32) R x1rxd The weight vector w0 is called the Wiener solution to the linear optimum filtering problem. For an ergodic process, the linear least-square filter asymptotically approaches the Wiener filter as the number of observations approaches infinity. However, the second-order statistics is not available in many important situations encountered in practice. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 21 Least-Mean-Square Algorithm The LMS algorithm is based on the use of instantaneous values for the cost function E w 1 2 e n 2 (3.33) 其中,e(n)為時間n時的錯誤訊號 將E(n)對權重向量w微分可得 en E w en w w 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.34) 22 Least-Mean-Square Algorithm (cont.) 因為 因此 en d n xT nwn (3.35) e(n) x( n) w (n) 所以Eq(3.34)可改寫成 E ( w ) x ( n )e( n ) w ( n) 上式稱為梯度向量的估計(Estimated Gradient vector)可得 gˆ n xn en (3.36) 套入Eq(3.12)的最陡坡降法,LMS可寫成。 ˆ n 1 w ˆ n xn en w 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.37) 23 Least-Mean-Square Algorithm (cont.) Summary of LMS Algorithm Training Sample: Input signal vector: x(n) Desired response: d(n) User-selected parameter: Initialization: ˆ (0) 0 Set w Computation For n=1, 2,…, compute ˆ T nxn en d n w ˆ n 1 w ˆ n xnen w 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 24 Least-Mean-Square Algorithm (cont.) Signal-flow graph representation of LMS algorithm 將Eq(3.35)和Eq(3.37)結合起來,可將LMS演算法的權重 向量演化的過程表示成 ˆ n 1 w ˆ n xn d n xT n w ˆ n w (3.38) ˆ n xn d n I xn xT n w 其中I為identity matrix 因此,我們將 ˆ n z 1 w ˆ n 1 w x(n)d(n) + + S ˆ (n 1) w z-1I ˆ ( n) w - x(n)xT(n) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 25 Convergence Considerations of the LMS Algorithm 從控制理論我們知道一個回饋系統(feedback system)的穩 定性(stability)是由回饋迴路的參數來決定。 從圖3.3,LMS演算法的回饋迴路中有兩個參數:learning rate ,input vector x(n) LMS演算法的收斂準則 Convergence in the mean square E e n constant as n 2 (3.41) 假設 • 連續的輸入向量x(1), x(2),…在統計上是彼此獨立的 • 在時間n,輸入向量x(n)對於前面所有樣本的disired response d(1), d(2),…d(n-1)是統計上獨立的 • 在時間n,desired response d(n)相依於x(n) • x(n)和d(n)是從Gaussian-distributed中取出 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 26 Convergence Considerations of the LMS Algorithm By invoking the elements of independence theory and assuming that the learning rate parameter is sufficiently small The LMS is convergent in the mean square provided that satisfies the condition 0 2 lmax (3.42) where lmax is the largest eigenvalue of the correlation matrix Rx. 然而,在實際的 LMS的應用中,缺乏關於的lmax知識。 為了解決此難題,可使用trace of Rx作為lmax的保守估 計,則Eq(3.42)可改寫成 0 2 trR x 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.43) 27 Convergence Considerations of the LMS Algorithm By definition, the trace of a square matrix is equal to the sum of its diagonal elements. Each diagonal element of the correlation matrix Rx equals the mean-square value of the corresponding sensor input 因此,Eq(3.43)可再改寫成 0 2 sum of mean square valuesof the sensorinputs (3.44) 提供一個滿足上式的學習速率,LMS演算法可保證收 斂到mean-square,(implies convergence of the mean) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 28 Virtues and Limitations of the LMS algorithm Virtues of the LMS algorithm Simplicity Robust Small model uncertainty and small disturbances can only result in small estimation errors. Limitations of the LMS algorithm Slow rate of convergence Typically requires a number of iterations equal to about 10 times the dimensionality of the input space for it to reach a steady-state condition. Sensitivity to variations in the eigenstructure of the input The LMS algorithm is sensitive to variations in the condition number or eigenvalue defined by R x l max lmin (3.45) When the condition number X(Rx) is high, the sensitivity of the LMS algorithm becomes acute. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 29 Learning Curves Learning curve Is a plot of the mean-square value of the estimation error, Eav(n), versus the number of iterations, n. Rate of convergence Define as the number of iterations, n, required for Eav(n) to decrease to some arbitrarily chosen value. Such as 10 percent of the initial value Eav(0). Misadjustment How close the adaptive filter is to optimality in the meansquare error sense. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 30 Learning Curves (cont.) Misadjustment is defined as M E Emin E 1 Emin Emin (3.46) where Emin denote the minimum mean-square error produced by the Wiener filter, designed on the basis of known values of the correlation matrix Rx and cross-correlation vector rxd. The misadjustment M of the LMS algorithm is directly proportional to the learning-rate parameter . The average time constant tav is inversely proportional to the learning rate parameter . If the learning rate parameter is reduced so as to reduce the misadjustment, then the settling time of the LMS algorithm is increased. Careful attention must be given to the choice of the learning parameter in the design of the LMS algorithm in order to produce a satisfactory overall performance. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 31 Learning-rate Annealing Schedules LMS演算法在計算過程可以將learning –rate設定成幾種 方式: Constant n 0 for all n Learning-rate is time-varying (by Robbins, 1951) n c n where c is a constant. When c is large, there is a danger of parameter blowup for small n. Search-then-converge schedule (by Darken and Moody, 1992) 0 n 1 n / t 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 32 Learning-rate Annealing Schedules Learning-rate annealing schedules 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 33 Perceptron McCulloch-Pitts model The perceptron consists of a linear combiner followed by a hard limiter (signum function). The summing node of the neuronal model computes a linear combination of the inputs applied to its synapses, and also incorporates an externally applied bias. The resulting sum is applied to a hard limiter. The neuron produces an output equal to +1 if the hard limiter input is positive, and -1 if it is negative. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 34 Perceptron (cont.) The synaptic weights of the perceptron are denoted by w1, w2,…,wm. The inputs applied to the perceptron are denoted by x1, x2,…,xm. The bias is denoted by b. The induced local field of the neuron is m v wi xi b x1 x2 i 1 Bias, b w1 (3.50) v w2 j(v) y wm xm 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 35 Perceptron (cont.) Perceptron的目的在於將外界輸入(x1, x2,…,xm)的刺激正確 的分類為class C1或class C2。 The decision rule for the classification is to assign the point represented by the inputs (x1, x2,…,xm) to class C1 if the perceptron output y is +1 and to class C2 if it is -1。 The simplest form of the perceptron is two decision regions separated by a hyperplane defined by m w x b 0 i 1 x2 i i The synaptic weights (w1, w2, …,wm) of the perceptron can be adapted on an iteration-by -iteration basis. (3.51) Class C1 Class C2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering x1 36 Perceptron Convergence Theorem 根據圖3.8(將圖3.6的bias納入固定輸入),則(m+1)-by-1的input vector 和weight vector可表示成 X(n) 1, x1 n, x2 n ,..., xm n W(n) b(n), w1 n , w2 n,...,wm n 因此,the induced local field of the neuron is defined as x0=+1 x1 x2 w0=b w1 w2 xm m vn wi nxi n wT nxn (3.52) i 0 v j(v) y 其中w0(n)為bias b(n) wm wTx=0時,座標點(x1, x2,…,xm)會描繪一hyperplane,可將input分 成兩類 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 37 Perceptron Convergence Theorem (cont.) 欲被分類的pattern必須有 足夠的分離,以確保存在 一hyperplane Decision Boundary Class C1 Class C1 Class C2 Class C2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 38 Perceptron Convergence Theorem (cont.) 假設perceptron的輸入變數,是由兩個可線性分離的class所 組成,其中子集合X1={x1(1), x1(2),…} ,子集合X2={x2(1), x2(2),…} ,X1和 X2的聯集構成完整的training set X。 拿X1和 X2來訓練分類器,將會調整權重向量 w,使兩個類 別C1和 C2可線性分離。也就是說存在一個權重向量 w,使 wT x 0 for everyinput vector x belongingto classC1 (3.53) wT x 0 for everyinput vector x belongingto classC2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 39 Perceptron Convergence Theorem (cont.) The algorithm for adapting the weight vector of the elementary perceptron is formulated as follows: If the nth member of the training set, x(n), is correctly classified by the weight vector w(n), no correction is made to the weight vector of the perceptron. w(n 1) w(n) if w T x 0 and x(n) belongsto classC1 w(n 1) w(n) if w T x 0 and x(n) belongsto classC2 (3.54) Otherwise, the weight vector of the perceptron is updated in accordance with the rule w(n 1) w(n) nxn if wT x 0 and x(n) belongsto classC2 w(n 1) w(n) nxn if wT x 0 and x(n) belongsto classC1 X(n)應被分成C1但被錯分為C2; X(n)應被分成C 資訊工程所 醫學影像處理實驗室( Medical Image 2但被錯分為C 1; Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.55) 40 Perceptron Convergence Theorem (cont.) 證明=1時,fixed increment adaptation rule的 收斂性。 假設initial condition w(0)=0,wT(n)x(n)<0 for n=1,2,…, and the input vector x(n) belongs to the subset X1。(也 就是說,percetron錯將x(1), x(2)…分成第二類) n)=1, 根據Eq(3.55)的第二式,可得 w(n 1) w(n) x(n) for x(n) belongingto classC1 (3.56) 給訂初始條件w(0)=0,則w(n+1)可由逐次累加x(n)獲得 w(n 1) x(1) x(2) ... x(n) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.57) 41 Perceptron Convergence Theorem (cont.) 由於class C1和C2是假設可線性分離,因此存在一個解w0, 對屬於X1子集合的所有輸入向量x(1),…,x(n) ,使wTx(n)>0。 因此,可定義一正值 min w T0 x(n) (3.58) x ( n )C1 因此對Eq(3.57)兩側同時乘以wT0 wT0 w(n 1) wT0 x(1) wT0 x(2) ... wT0 x(n) 因此,根據Eq(3.58)的定義我們可得 wT0 w(n 1) n 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 共有n項 (3.59) 42 Perceptron Convergence Theorem (cont.) 根據Cauchy-Schwarz inequality,可知 w0 2 wn 1 w wn 1 2 T 0 2 (3.60) 將Eq(3.59)代入Eq(3.60)可得 w0 或 2 w n 1 2 n 2 2 w n 1 2 n 2 2 w0 2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.61) 43 Perceptron Convergence Theorem (cont.) 實際上,Eq(3.56)可改寫成 (以k取代n) wk 1 w(k ) x(k ) for k 1,..., n and x(k ) X 1 (3.62) 對Eq(3.62)兩邊同時取Euclidean平方,並展開可得 wk 1 2 wk 2 xk 2 2wT k xk (3.63) 因為一開始假設perceptron錯將屬於 C1的向量x(k)分成C2, 因此wT(n)x(n)<0 ,所以可從Eq(3.63)推得 wk 1 2 wk 2 xk 2 將上式移項,可得 wk 1 2 wk 2 xk 2 k 1,...,n 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.64) 44 Perceptron Convergence Theorem (cont.) 代入初始條件w(0)=0,並將所有的不等式k=1,…,n加總,可 n 得 2 2 wk 1 xk (3.65) k 1 其中 n max xk 2 (3.66) x ( k )X 1 在Eq(3.65)和Eq(3.61)中,n的值不能超過某個值nmax ,此 nmax 必須同時滿足Eq(3.65)和Eq(3.61),因此 2 nmax 2 nmax Perceptron必須在最多經 2 w0 過nmax次疊代後,停止調 整synaptic weight 將上式移項整理可得 nmax w0 2 2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.67) 45 Perceptron Convergence Theorem (cont.) 因此,當(n)=1 for all n, and w(0)=0, perceptron調整神經 鍵的權重值,最多只需nmax次的迭代。 Fixed-increment convergence theorem of the perceptron Let the subsets of training vectors X1 and X2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that wn0 wn0 1 wn0 2 ... is a solution vector for n0<=nmax 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 46 Perceptron Convergence Theorem (cont.) Absolute error-correction procedure for adaptation of a single-layer perceptron n x n xn w n xn T T Each pattern is presented repeatedly to the perceptron until that pattern is classified correctly. The use of an initial value w(0) merely results in a decrease or increase in the number of iterations required to converge, depending on how w(0) relates to the solution w0. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 47 Perceptron Convergence Theorem (cont.) Summary of the Perceptron Convergence Theorem Initialization. Set w(0)=0. Then perform the following computations for time step n=1,2,… Activation. Activate perceptron by applying continuous-valued input vector x(n) and desired response d(n) Computation of Actual Response. Compute the actual response of the perceptron yn sgn wT nxn Adaptation of Weight Vector. Update the weight vector of the perceptron wn 1 wn d n ynxn 1 if x(n) belongsto classC1 d n 1 if x(n) belongsto classC2 where Continuation: Increment time step n by one and go back to step 2. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 48 Relation between the perceptron and Bayes Classifier for a Gaussian Environment Bayes Classifier To minimize the average risk R. For a two-class problem, the average risk is defined as c11 p1 f x (x |C1)dx c22 p 2 X1 + c21 p1 f x (x |C2 )dx X2 f x (x |C1)dx c12 p2 f x (x |C2 )dx X2 Correct decision X1 (3.72) Incorrect decision pi: a priori probability that the observation vector x is drawn from subspace Xi cij: cost of deciding in favor of class Ci represented by subspace Xi when class Cj is true. fx(x|Ci): conditional probability density function of the random vector X 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 49 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) 由於每個observation vector x需被分成C1或C2中的 一類,因此 X X X 1 2 (3.73) 因此,Eq(3.72)可改寫成 c11 p1 f x (x |C1)dx c22 p2 (3.74) X -X 1 X1 + c21 p1 f x (x |C2 )dx f x (x |C1)dx c12 p2 f x (x |C2 )dx X -X 1 X1 where c11<c21 and c22<c12, we observe that fact that f x (x |C1)dx f x (x |C2 )dx 1 X (3.75) X 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 50 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) 因此,將Eq(3.74)展開後,可簡化成 c21 p1 c22 p 2 p 2 (c12 c22 ) f x (x |C2 ) p1 (c21 c11 ) f x (x |C1)dx X1 (3.76) 其中,Eq(3.76)的前兩項為固定成本。 由於我們需要使average risk R最小化, 因此可根據 Eq(3.76)推導出下列策略: 若observation vector x的積分值為負值,則x應被指定為X1 (class 1, C1)。 若observation vector x的積分值為正值,則x應被指定為X2 (class 2 , C2)。 若observation vector x的積分值為零值,表示其對average risk R沒有影響可指定成任何一類,這裡則將x應指定為X2 (class 2)。 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 51 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) 根據前面的說明, Bayes classifier可定義成 If the condition p1 c21 c11 f x x |C1 p2 c12 c22 f x x |C2 holds, assign the observation vector x to subspace X1 (Class C1). Otherwise assign x to X2 (class C2) 為方便說明,將上式移項整理後,定義 Likelihood ratio Threshold x f x (x |C1) f x (x |C2 ) p2 (c12 c22 ) p1 (c21 c11 ) 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.77) (3.78) 52 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) 將Bayes classifier重新敘述成 For an observation vector x, the likelihood ratio (x) is greater than the threshold , assign x to class C1. Otherwise, assign it to class C2. x Likelihood Ratio computer (x) Comparator Assign x to class C1 If (x)> Otherwise, assign it to class C2 x Likelihood Ratio computer log(x) Comparator Assign x to class C1 If log(x)> Otherwise, assign it to class C2 log 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 53 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) Bayes Classifier for Gaussian Distribution 假設 ClassC1 : E X μ1 E X μ1 X μ1 T C ClassC2 : E X μ 2 E X μ 2 X μ 2 T C 因為C1和C2有相關,所 以共變異矩陣C不為對角 矩陣,假設C為非奇異矩 陣,因此存在C-1。 X的條件機率密度函數可表示為 f x (x |Ci ) 1 (2 ) m 1 2 (det(C)) 2 1 exp (x μ i )T C1 (x μ i ) , i 1, 2 2 假設兩個類別的機率相等 p1 p 2 1 2 (3.79) (3.80) 假設分類錯誤的成本相等,正確分類成本為零 c21 c12 and c11 c22 0 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.81) 54 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) 將Eq(3.79)代入Eq(3.77) ,再取log可得 1 1 log (x) (x μ 1 ) T C 1 (x μ 1 ) (x μ 2 ) T C 1 (x μ 2 ) 2 2 1 T T (μ 1 μ 2 ) T C 1 x (μ 2 C 1μ 2 μ 1 C 1μ 1 ) 2 (3.82) 將Eq(3.80)和Eq(3.81)代入Eq(3.78) ,再取log可得 log 0 Threshold =1 (3.83) Eq(3.82)和Eq(3.83)表示的Bayes classifier可描述成底 下的linear classifier 從Eq(3.51)和Eq(3.84) y w xb T 其中 y log (x) w C1 (μ1 μ 2 ) b 可知,Bayes classifier 類似於perceptron的 linear classifier 1 T T (μ 2 C 1μ 2 μ1 C 1μ1 ) 2 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering (3.84) (3.85) (3.86) (3.87) 55 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) The classifier consists of a linear combiner with weight vector w and bias b 根據Eq(3.84) ,log-likelihood test for two-class problem可描述成 If the output y of the linear combiner is positive, assign the observation vector x to class C1. otherwise, assign it to class C2. x1 x2 Bias, b w1 w2 y wm xm 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 56 Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.) Perceptron vs. Bayes classifier for Gaussian The perceptron operates on the premise that the patterns to be classified are linearly separable. The Gaussian distribution of the two patterns assumed in the derivation of the Bayes classifier certainly do overlap each other and are therefore no separable. The Bayes classifier minimizes the probability of classification error. The Bayes classifier always positions the decision boundary at the point where the Gaussian distributions for the two classes C1 and C2 cross each other. Nonparametric vs. parametric The perceptron convergence algorithm is both adaptive and simple to implement,但Bayes classifier的計算較複雜且較浪費 記憶體空間。 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 57 Two overlapping, one-dimensional Gaussian distributions 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 58 Ergodic process (From Wikipedia) In probability theory, stationary ergodic process is a stochastic process which exhibits both stationarity and ergodicity. In essence this implies that the random process will not change its statistical properties with time. Stationarity is the property of a random process which guarantees that its statistical properties, such as the mean value, its moments and variance, will not change over time. A stationary process is one whose probability distribution is the same at all times. Several sub-types of stationarity are defined: first-order, secondorder, nth-order, wide-sense and strict-sense. An ergodic process is one which conforms to the ergodic theorem. The theorem allows the time average of a conforming process to equal the ensemble average. In practice this means that statistical sampling can be performed at one instant across a group of identical processes or sampled over time on a single process with no change in the measured result. 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 59 Taylor series, (From Wikipedia) Taylor series Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point. It may be regarded as the limit of the Taylor polynomials. The Taylor series of a real or complex function f that is infinitely differentiable in a neighborhoods of a real or complex number a, is the power series which in a more compact form can be written as 資訊工程所 醫學影像處理實驗室(Medical Image Processing Lab. ) Graduate School of Computer Science & Information Engineering 60