Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison between RBF and BP 4/15/2020 RBF Networks M.W. Mak 1 1. Introduction MLPs are highly non-linear in the parameter space gradient descent local minima RBF networks solve this problem by dividing the learning into two independent processes. w 4/15/2020 RBF Networks M.W. Mak 2 RBF networks implement the function M s( x ) w 0 w i i ( x c i ) i1 wi i and ci can be determined separately Fast learning algorithm Basis function types ( r ) r 2 log( r ) 2 r ( r ) exp( 2 ) (r ) r 2 2 (r ) 1 4/15/2020 r2 2 RBF Networks M.W. Mak 3 For Gaussian basis functions M s( x p ) w 0 w i i x p c i i 1 n ( x pj c ij ) 2 w 0 w i exp 2 2 ij i 1 j 1 M Assume the variance across each dimension are equal M 1 s( x p ) w 0 w i exp 2 i 1 2 i 4/15/2020 RBF Networks ( x pj c ij ) j 1 n 2 M.W. Mak 4 To write in matrix form, let a pi i x p c i s( x p ) M w a i pi where a p 0 1 i0 s( x1 ) 1 a11 s( x ) 1 a 2 21 s( x ) 1 a N N1 a`12 a22 aN 2 a1 M w 0 a2M w1 a NM w M s Aw w A 1 s 4/15/2020 RBF Networks M.W. Mak 5 2. Finding the RBF Parameters Use the K-mean algorithm to find ci 2 2 1 1 1 4/15/2020 RBF Networks 2 M.W. Mak 6 K-mean Algorithm step1: K initial clusters are chosen randomly from the samples to form K groups. step2: Each new sample is added to the group whose mean is the closest to this sample. step3: Adjust the mean of the group to take account of the new points. step4: Repeat step2 until the distance between the old means and the new means of all clusters is smaller than a predefined tolerance. 4/15/2020 RBF Networks M.W. Mak 7 Outcome: There are K clusters with means representing the centroid of each clusters. Advantages: (1) A fast and simple algorithm. (2) Reduce the effects of noisy samples. 4/15/2020 RBF Networks M.W. Mak 8 Use K nearest neighbor rule to find the function width i 1 K K k 1 ck ci 2 k-th nearest neighbor of ci The objective is to cover the training points so that a smooth fit of the training samples can be achieved 4/15/2020 RBF Networks M.W. Mak 9 Centers and widths found by K-means and K-NN 4/15/2020 RBF Networks M.W. Mak 10 Determining weights w using the least square method 2 N M E d p w j j x p c j p1 j0 where dp is the desired output for pattern p E ( d Aw )T ( d Aw ) Set E 0 w ( A T A ) 1 A T d w 4/15/2020 RBF Networks M.W. Mak 11 Let E be the total-squared error between the actual output and the target output d d1 d2 d N T T E d Aw d Aw d T wT AT d Aw d T d wT AT d d T Aw wT AT Aw E T 0 wT AT d d Aw wT AT Aw w w w w T T T AT d w A d AT Aw AT A w w 2 AT d 2 AT Aw AT Aw AT d w AT A AT d 1 4/15/2020 RBF Networks M.W. Mak 12 Note that T x y y x T x Ay Ay x T T x Ax Ax A x x Problems (1) Susceptible to round-off error. (2) No solution if AT A is singular. (3) If AT A is close to singular, we get very large component in w. 4/15/2020 RBF Networks M.W. Mak 13 Reasons (1) Inaccuracy in forming AT A (2) If A is ill-conditioned, small change in A introduces 1 large change in AT A (3) If ATA is close to singular, dependent columns in ATA exist e.g. two parallel straight lines. y x 4/15/2020 RBF Networks M.W. Mak 14 singular matrix : 1 2 x 0 2 4 y 1 If the lines are nearly parallel, they intersect each other at , i.e. x0 y 0 or x0 y 0 So, the magnitude of the solution becomes very large; hence overflow will occur. The effect of the large components can be cancelled out if the machine precision is infinite. 4/15/2020 RBF Networks M.W. Mak 15 If the machine precision is finite, we get large error. For example, 2 4 1038 0 1 1 2 38 2 10 0 Finite machine precision => 2 4.00001 1038 1 1033 1 1 2 38 33 2 10 1 10 Solution: Singular Value Decomposition 4/15/2020 RBF Networks M.W. Mak 16 RBF learning process K-means ci A Basis Functions xp ci K-Nearest Neighbor 4/15/2020 Linear Regression w i RBF Networks M.W. Mak 17 RBF learning by gradient descent 2 n x pj c ij 1 and e ( x ) d ( x ) s( x ) Let i ( x p ) exp p p p 2 2 j 1 ij we have 2 1 E e ( x p ) . 2 p1 N Apply E E E , , and w i ij c ij 4/15/2020 RBF Networks M.W. Mak 18 we have the following update equations N w i ( t 1) w i ( t ) w e ( x p ) i ( x p ) when i 1,2, , M p1 N w i ( t 1) w i ( t ) w e ( x p ) when i 0 p1 2 ij ( t 1) ij ( t ) e ( x p )w i i ( x p ) x pj c ij ij ( t ) N p1 c ij ( t 1) c ij ( t ) c e ( x p )w i i ( x p ) x pj c ij ij ( t ) N p1 4/15/2020 RBF Networks M.W. Mak 3 2 19 Elliptical Basis Function networks yK ( x ) y1 ( x ) J yk ( x p ) wkj j ( x p ) j 0 1 2 x1 M x2 y W W = +D xn T 1 1 j ( x p ) exp{ ( x p j ) j ( x p j )} 2 j : function centers j : covariance matrix 4/15/2020 RBF Networks M.W. Mak 20 K-means and Sample covariance K-means : j j x j if x j Sample covariance : 1 j Nj 1 x N j x j x k j k T ( x j )( x j ) x j The EM algorithm 4/15/2020 RBF Networks M.W. Mak 21 EBF Vs. RBF networks 3 3 Class 1 Class 2 Class 1 Class 2 2 2 1 1 0 0 -1 -1 -2 -2 -3 -3 -2 -1 0 1 2 3 -3 -3 RBFN with 4 centers 4/15/2020 RBF Networks -2 -1 0 1 2 3 EBFN with 4 centers M.W. Mak 22 Elliptical Basis Function Networks Out put 1 of an EBF net work (bias, no rescale, gamma=1) 'nxor.ebf 4.Y.N.1.dat ' 1.43 0.948 0.463 -0.0209 -0.505 EBF Network’s output 2 1.5 1 0.5 0 -0.5 -1 3 2 1 -3 -2 0 -1 4/15/2020 0 -1 1 -2 2 RBF Networks 3 -3 M.W. Mak 23 RBFN for Pattern Classification Hyperplane MLP Kernel function RBF The probability density function (also called conditional density function or likelihood) of the k-th class is defined as px | Ck 4/15/2020 RBF Networks M.W. Mak 24 •According to Bays’ theorem, the posterior prob. is p x | Ck PCk PCk | x px where P(Ck) is the prior prob. and px p( x | Cr )P(Cr ) r • It is possible to use a common pool of M basis functions, labeled by an index j, to represent all of the class-conditional densities, i.e. M px | Ck p( x | j )P( j | Ck ) j 1 4/15/2020 RBF Networks M.W. Mak 25 p( x | Ck ) P( M | Ck ) p(x | 1) p(x | 2) p( x | M ) M px | Ck px | j P j | Ck j 1 4/15/2020 RBF Networks M.W. Mak 26 M px px | j P j | Ck PCk j 1 k p x | j P j | Ck P Ck M j 1 k p x | j P j M j 1 p x | j P j | Ck PCk M P Ck | x j 1 ' p x | j P j ' M P j P j j 1 4/15/2020 RBF Networks M.W. Mak 27 P j | Ck P Ck P j j 1 M p x | j P j M ' p x | j P j ' j 1 P Ck | j P j | x M j 1 wkj j x M j 1 No bias term Hidden node’s output j ( x ) P( j | x ) : posterior prob. of the j-th set of features in the input . weight wkj P(Ck | j ) : posterior prob. of class membership, given the presence of the j- th set of features . 4/15/2020 RBF Networks M.W. Mak 28 Comparison of RBF and MLP RBF networks MLP Learning speed Very Fast Very Slow Convergence Almost guarantee Not guarantee Response time Slow Fast Memory requirement Very large Small Hardware implementation IBM ZISC036 Nestor Ni1000 Voice Direct 364 www.sensoryinc.com www-5.ibm.com/fr/cdlab/zisc.html Generalization Usually better Usually poorer To learn more about NN hardware, see http://www.particle.kth.se/~lindsey/HardwareNNWCourse/home.html 4/15/2020 RBF Networks M.W. Mak 29