Information Theoretic Learning Jose C. Principe Yiwen Wang Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu Acknowledgments Dr. Deniz Erdogmus My students: Puskal Pokharel Weifeng Liu Jianwu Xu Kyu-Hwa Jeong Sudhir Rao Seungju Han NSF ECS – 0300340 and 0601271 (Neuroengineering program) Resources CNEL Website www.cnel.ufl.edu Front page, go to ITL resources (tutorial, examples, MATLAB code) Publications Information Filtering Deniz Erdogmus and Jose Principe From Linear Adaptive Filtering to Nonlinear Information Processing IEEE Signal Processing MAGAZINE November 2006 Outline • Motivation • Renyi’s entropy definition • A sample by sample estimator for entropy • Projections based on mutual information • Applications • Optimal Filtering • Classification • Clustering •Conclusions Data is everywhere! Remote Sensing Biomedical Applications Wireless Communications Speech Processing Information Sensor Arrays From Data to Models Optimal Adaptive Models: Data d y=f(x,w) Data x Adaptive System Output - + Error e Cost function Learning Algorithm From Linear to Nonlinear Mappings • Wiener showed us how to compute optimal linear projections. The LMS/RLS algorithms showed us how to find the Wiener solution sample by sample. • Neural networks brought us the ability to work non-parametrically with nonlinear function approximators. • Linear regression nonlinear regression • Optimum linear filtering TLFNs • Linear Projections (PCA) Princ. Curves • Linear Discriminant Analysis MLPs Adapting Linear and NonLinear Models The goal of learning is to optimize the performance of the parametric mapper according to some cost function. • In classification, minimize the probability of error. • In regression the goal is to minimize the error in the fit. The cost function most widely used has been the mean square error (MSE). It provides the Maximum Likelihood solution when the error is Gaussian distributed. In NONLINEAR systems this is hardly ever the case. Beyond Second Order Statistics • We submit that the goal of learning should be to transfer as much information as possible from the inputs to the weights of the system (no matter if unsupervised or supervised). • As such the learning criterion should be based on entropy (single data source) or divergence (multiple data sources). • Hence the challenge is to find nonparametric, sampleby-sample estimators for these quantities. ITL: Unifying Learning Scheme Normally supervised and unsupervised learning are treated differently, but there is no need to do so. One can come up with a general class of cost functions based on Information Theory that apply to both learning schemes. Cost function (Minimize, Maximize, Nullify) 1. Entropy • Single group of RV’s 2. Divergence • Two or more groups of RV’s ITL: Unifying Learning Scheme Function Approximation Minimize Error Entropy Classification Minimize Error Entropy Maximize Mutual Information between class labels and outputs Jaynes’ MaxEnt Maximize output entropy Linsker’s Maximum Information Transfer Maximize MI between input and output Optimal Feature Extraction Maximize MI between desired and output Independent Component Analysis Maximize output entropy Minimize Mutual Information among outputs July 11, 2016 12 ITL: Unifying Learning Scheme Desired Signal D Learning System Input Signal X Y = q X W Output Signal Y Optimization Information Measure I Y D Information Theory Is a probabilistic description of random variables that quantifies the very essence of the communication process. It has been instrumental in the design and quantification of communication systems. Information theory provides a quantitative and consistent framework to describe processes with partial knowledge (uncertainty). July 11, 2016 14 Information Theory Not all the random events are equally random! p(x) p(x) x CASE 1 x CASE 2 How to quantify this fact? Shannon proposed the concept of ENTROPY July 11, 2016 15 Formulation of Shannon’s Entropy Hartley Information (1928) Large probability small information p X ( x) 1 S H 0 Small probability large information p X ( x) 0 S H Two identical channels should have twice the capacity as one Log2 is a natural measure for additivity g ( p X ( x) 2 ) 2 g ( p X ( x)) S H log 2 p X ( x) Formulation of Shannon’s Entropy Expected value of Hartley Information H S ( X ) E[ S H ] p X ( x) log p X ( x) Communications – ultimate data compression (H channel capacity for asymptotically error-free communication) Measure of (relative) uncertainty Shannon used a principled approach to define entropy Review of Information Theory Shannon Entropy: S ( pk ) log pk H ( P) pk S ( pk ) k Mutual Information: I ( x, y ) H ( x ) H ( y ) H ( x , y ) = H ( x) H ( x | y ) H ( y ) H ( y | x) Kullback-Leibler Divergence: K ( f , g ) f ( x ) log f ( x ) g ( x )dx I ( x , y ) f xy ( x , y ) log( f xy ( x , y ) f x ( x ) f y ( y ))dxdy July 11, 2016 18 Properties of Shannon’s Entropy Discrete RV’s • • • • H(X) > 0 H(X) < log N equality iff X is uniform H(Y|X) < H(Y) equality iff X, Y indep. H(X,Y) = H(X) + H(Y|X) Continuous RV’s • • • • Replace summation with integral Differential entropy Minimum entropy is sum of delta functions Maximum entropy • Fixed variance Gaussian • Fixed upper/lower limits uniform Properties of Mutual Information IS(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) IS(X;Y) = IS(Y;X) IS(X;X) = HS(X) HS(X,Y) HS(X|Y) HS(Y) IS(X;Y) HS(Y|X) HS(X) A Different View of Entropy • Shannon’s Entropy H S ( X ) p X ( x) log p X ( x) H S ( X ) p X ( x) log p X ( x)dx • Renyi’s Entropy 1 H ( X ) log pX ( x) 1 1 H ( X ) log pX ( x)dx 1 Renyi’s entropy becomes Shannon’s as 1 • Fisher’s Entropy (local) 2 p X ( x) H f (X ) dx p X ( x) July 11, 2016 21 Renyi’s Entropy Norm of the pdf: – norm = V V = f y 1 dy Entropies in terms of V 1 HR y = ------------ log V 1– 1 1 H s y = lim ------------ logV = lim ------------ V – 1 11 – 11 – July 11, 2016 22 Geometrical Illustration of Entropy p2 n k=1 1 pk = p p2 (entropy -norm) ( –norm of p raised power to ) p = p1 p2 p3 1 p = p 1 p2 p3 0 p1 0 p1 1 1 1 July 11, 2016 23 Properties of Renyi’s Entropy (a) Continuous function of all probability (b) Permutationally symmetric (c) H(1/n, …1/n) is an increasing function of n (d) Recursivity H p 1 p n = H p 1 + p 2 p n + p 1 + p 2 H p1 p 1 + p 2 p2 p1 + p 2 (e) Additivity H p q = H p + H q If p and q are independent properties (a) (b) (c) (d) (e) Shannon YES YES YES YES YES Renyi YES YES YES NO YES Properties of Renyi’s entropy Renyi’s entropy provides an upper and lower bound for the probability of the error in classification H (W | M ) H (e) log( N c 1) H (W | M ) H (e) 1 P ( e) , min H (W | e, mk ) 1 k unlike Shannon, which provides only a lower bound (Fano’s inequality, which is the tightest bound) P(e) H S (W | M ) H S (e) log( N c 1) Nonparametric Entropy Estimators (Only continuous variables are interesting…) • Plug in estimates Integral estimates Resubstitution estimates Splitting data estimates Cross validation estimates • Sample spacing estimates • Nearest Neighbor distances July 11, 2016 26 Parzen Window Method Put a kernel over the samples, normalize and add. Entropy becomes a function of continuous RV. 1 f X ( x) N N G( x a(i)) i 1 {a(i ) | i 1, 2, ..., N } 1 2 xT x /( 2 2 ) G ( x, ) e d /2 d (2 ) 2I (covariance matrix ) A kernel is a positive function that adds to 1 and peaks at the sample location (i.e. the Gaussian) Parzen Windows 1 1 N=10 f x(x) f x(x) Laplacian 0.5 0 -5 0.5 0 x 0 x 5 N = 1000 f x(x) f x(x) 0.4 0.2 0 -5 0 -5 5 N=10 0.4 Uniform N = 1000 0.2 0 x 5 0 -5 0 x 5 Parzen Windows Smooth estimator Arbitrarily close fit as N infinity, 0 Curse of Dimensionality Previous pictures for d = 1 dimension For a linear increase in d, an exponential increase in N is required for an equally “good” approximation In ITL we use Parzen windows not to estimate the PDF but to estimate the first moment of the PDF. Renyi’s Quadratic Entropy Estimation Quadratic Entropy (=2) Information Potential H 2 ( X ) log V2 ( X ) V2 ( X ) p 2 X ( x)dx Use Parzen window pdf estimation with a (symmetric) Gaussian kernel Information potential: think of the samples as particles (gravity or electrostatic field) that interact with others with a law given by the kernel shape. July 11, 2016 30 IP as an Estimator of Quadratic Entropy Information Potential (IP) + V (X) V = 2 fX x 2dx – + N 1 1 = ---- G x – a i 2 ---- G x – a j 2 dx N N – i = 1 j = 1 N 1 N N + = ------ G x – a i 2 G x – a j 2 dx N2 i = 1 j = 1 – N N 1 = ------2 G a i – a j 2 2 N i = 1j =1 July 11, 2016 31 IP as an Estimator of Quadratic Entropy There is NO approximation in computing the Information Potential for = 2 besides the choice of the kernel. This result is the kernel trick used in Support Vector Machines. It means that we never explicitly estimate the PDF, which improves greatly the applicability of the method. July 11, 2016 32 Information Force (IF) Between two Information Particles (IPTs) G a i – a j 2 2 a i 1 = --------- G a i – a j 2 2 a j – a i 2 2 Overall N V(X) –1 2 a i – a j = -----------G a i – a j 2 a i 2 2 N j= 1 July 11, 2016 33 Calculation of IP & IF d ij = a i – a j 2 v ij = G d ij 2 D = d ij v = v ij a 1 a 2 a j a N a 1 a 2 a i – a j a i a N 1 N N V = ------2 v ij N i = 1j = 1 N – 1 f i = -----------v i j d ij 2 2 N j=1 July 11, 2016 1 N v i = ---- v ij N j= 1 i = 1 N N 1 V = ---- v i N i =1 34 Central “Moments” Mean Variance Entropy E[ X ] x p X ( x)dx E[( X E[ X ]) ] ( x E[ x]) p X ( x)dx 2 2 E[log p X ( X )] p X ( x) log p X ( x)dx log E[ p X ( X )] log p X ( x) 2 dx Moment Estimation Mean Variance Entropy 1 E[ X ] N N a(i) i 1 1 2 E[( X E[ X ]) ] N N 2 ( a ( i ) E [ X ]) i 1 1 N E[log p X ( X )] log f X (a(i )) N i 1 1 N 1 log N i 1 N 1 log E[ p X ( X )] log 2 N N N 2 G ( a ( i ) a ( j ), ) j 1 N 2 G ( a ( i ) a ( j ), ) i 1 j 1 Which of the two Extremes? Estimation of pdf must be accurate for practical ITL? ITL (Minimization/maximization) doesn’t require an accurate pdf estimate? None of the above, but still not fully characterized Extension to any kernel • We do not need to use Gaussian kernels in the Parzen estimator. • Can use any kernel that is symmetric and differentiable (k(0) > 0, k’(0) = 0 and k”(0) < 0) . • We normally work with kernels scaled from an unit size kernel 1 ( x) (x / ) Extension to any • Redefine the Information Potential as 1 1 V ( X ) p X ( x)dx E pX ( X ) pX 1 (a(i )) N i • Using the Parzen estimator we obtain 1 1 1 V ( X ) (a(i) a( j )) N j N i This estimator corresponds exactly to the quadratic estimator ( = 2) with the proper kernel width, . Extension to any , kernel • The - information potential 1 V ( X ) N 1 j i (a( j) a(i)) • The - information force 2 ˆ F ( X j ) ( 1) f X ( X j ) F2 ( X j ) where F2(X) is the quadratic IF. Hence we see that the “fundamental” definition is the quadratic IP and IF, and the “natural” kernel is the Gaussian. How to select the kernel size • Different values of produce different entropy estimates. We suggest to use 3 ~ 0.1 dynamic range (interaction among 10 samples). • Or use Silverman’s rule 0.9 AN 1 / 5 A stands for the minimum of the empirical data standard deviation and the data interquartile range scaled by 1.34 Kernel size is just a scale parameter Kullback Leibler Divergence KL Divergence measures the “distance” between pdfs (Csiszar and Amari) Relative entropy Cross entropy Information for discrimination f x D k f g = f x log ----------- dx g x notice similarity to H S ( X ) f X ( x) log f X ( x)dx Mutual Information & KL Divergence Shannon’s Mutual Information . I X X = f s 1 2 f X 1X 2 x 1 x 2 x x log ------------------------------------ dx d x X 1X 2 1 2 fX 1 x1 f X2 x2 1 2 Kullback Leibler Divergence . D k f g = f x f x log ---------- dx g x Is X 1 X2 = Dk f X1X2 x1 x2 f X1 x 1 fX2 x 2 Statistical Independence f X1X2 x1 x2 = f X1 x 1 fX2 x2 July 11, 2016 43 KL Divergence is NOT a distance Ideally for a distance, Non-negative Null only if pdf’s are equal Symmetric Triangular inequality f3 In reality, D( f , g ) 0 D( f , g ) 0 iff f g f1 D( f , g ) D( f , g ) D( f1 , f 2 ) D( f 2 , f 3 ) NOT D( f1 , f 3 ) f2 New Divergences and Quadratic Mutual Infomation Euclidean Distance between pdfs (Quadratic Mutual Information ED-QMI) 2 DED f g = f x – g x dx I ED X1 X2 = D ED fX 1X 2 x1 x 2 fX 1 x1 f X 2 x 2 Cauchy Schwarz divergence and CS-QMI f x 2 dx g x 2 dx DCS f g = log-----------------------------------------------------2 f x g x dx I X X = D f x x f X 1 x 1 fX 2 x2 CS 1 2 CS X1 X2 1 2 July 11, 2016 45 Geometrical Explanation of MI fX 1 X 2 x1 x2 IED Euclidean Distance I s K-L Divergence VJ = fX 1X 2 x1 x2 2 dx 1 dx2 2 VM = f X 1 x 1 fX 2 x 2 dx1 dx 2 Vc = fX 1 X 2 x 1 x 2 fX 1 x 1 fX 2 x2 dx 1 dx2 VJ VM fX 1 x 1 fX 2 x 2 IED = VJ – 2Vc + VM ICS = log VJ – 2 log V c + logVM 0 ICS = – log cos 2 July 11, 2016 Vc = cos V J VM 46 One Example X2 P21 X PX 2 2 2 PX 12 P22 X P 1X2 1 P11 X PX 0.4 21 11 X1 1 2 P1X 1 P2X 1 P X1 = 0.6 0.4 July 11, 2016 PX 0 I s 0.6 I ED ICS 47 One Example Is Is P21 X July 11, 2016 P 11 X 48 One Example IED IED P21 X July 11, 2016 P 11 X 49 One Example IC S IC S P21 X July 11, 2016 P 11 X 50 QMI and Cross Information Potential Estimators Parzen Window pdf estimation a i = a1 i a2 i T i = 1 N N 1 2 2 fx 1 x 2 x 1 x 2 = ---- G x1 – a 1 i G x2 – a 2 j N i=1 N 1 - G x1 – a1 i 2 fx 1 x1 = --N i=1 N 1 2 fx 2 x2 = ---- G x2 – a2 i Ni = 1 July 11, 2016 51 QMI and Cross Information Potential Estiamtors a 1 1a 1 2 a 1 j a1 N a 1 1 a 1 2 a 2 1a 2 2 a 2 j a2 N a 2 1 a 2 2 a 1 i a 2 i – a 2 j a 1 i – a 1 j a 2 i a 1 N a 2 N d1 ij = a1 i – a1 j v1 ij = G d 1 ij 2 12 1 N 1 V = ---v i v1 i = ---- v 1 1ij N 1 i= 1 Nj = 1 N July 11, 2016 d2 ij = a2 i – a2 j v2 ij = G d 2 ij 2 22 N 1 v2 i = ---- v 2 ij Nj = 1 1 N V2 = ---- v2 i Ni = 1 52 QMI and Cross Information Potential IED = VJ – 2Vc + VM ICS = log VJ – 2 log V c + IED x 1 x 2 = VED N 1 N N 2 = ------ v 1 ij v2 ij – ---- v1 i v2 i + V 1 V2 N N2 i = 1 j = 1 i=1 I C S x1 x2 = VCS July 11, 2016 VJ = fX 1X 2 x1 x2 2 dx 1 dx2 2 VM = f X 1 x 1 fX 2 x 2 dx1 dx 2 logVM Vc = fX 1 X 2 x 1 x 2 fX 1 x 1 fX 2 x2 dx 1 dx2 N N 1 ----- V V v ij v ij 1 2 2 1 2 N i = 1 j = 1 = log ------------------------------------------------------------------------------ N 2 1 --v i v i 1 2 N i=1 53 ED-QMI and Cross Information Potential Ck = c k ij , c k ij = vk ij – vk i – vk j + V k , k = 1 2 VED N 1 = ------2 c l ij vk ij N j= 1 VED –1 N f k i = = ------------ c l ij vk ij d k ij 2 2 a k i N j= 1 i = 1 N , k = 1 2 l k 1 N N V = ------2 v ij N i = 1j = 1 N – 1 f i = -----------v i j d ij 2 2 N j=1 July 11, 2016 i = 1 N 54 Renyi’s Divergence Renyi’s MI 1 I R y = ------------ log –1 I – fY y ------------------------------------ dy Ns – 1 f Y y i i i =1 It does not obey the well known relation for Shannon mutual information I S ( x, y) H S ( x) H S ( y) H S ( x, y) i Renyi’s Divergence Approximation Approximation to Renyi’s MI Ns f Y y d y 1 H y – H y = -------- log ----------–-- ----------------------------------R R Ns –1 i= 1 f y Yi dy i i=1 – Although different, they have the same minima, so the sum of the marginals can be used to minimize mutual information. From Data to Models Optimal Adaptive Models and least squares Data z Data x Adaptive System Output - + J w (e) E[( z f ( x, w)) 2 ] Cost function Learning Algorithm J (e) J (e) e 0 w e w J (e) E[e2 ] e 2 E[ex] 0 w e w Model Based Inference Alternatively, the problem of finding optimal parameters can be framed as model based inference. The desired response z can be thought as being created by an unknown transformation of the input vector x, and the problem is characterized by the joint pdf p(x,z). The role of optimization is therefore to minimize the Kulback Liebler divergence between the estimated joint and the real one p(x, z ) min J (w) p(x, z ) log ~ dxdz p w (x, z ) w If we write z=f(x)+e with the error independent of x, then this is equivalent to min H S (e) p w (e) log p w (e)de w July 11, 2016 58 Error Entropy Criterion Information Theoretic Learning is exactly a set of tools to solve this minimization problem. Note that this is different from the use of information theory in communications. Here We are interested in continuous random variables. We can not assume Gaussianity, so need to use nonparametric estimators. We are interested in using gradients, so estimators must be smooth. We will use Parzen estimators. Since for optimization a monotonic function does not affect the result, we will be using the Information Potential instead of Renyi’s entropy most of the time. H R2 ( E ) log V ( E ) V ( E ) E[ p(e)] July 11, 2016 59 Properties of Entropy Learning with Information Potential The IP with Gaussian kernels preserves the global minimum/maximum of Renyi’s entropy. The global minimum/maximum of Renyi’s entropy coincides with Shannon’s entropy (super-Gaussian). Around the global minimum, Renyi’s entropy cost (of any order) has the same eigenvalues as Shannon’s. The global minimum degenerates to a line (because entropy is insensitive to the mean). Error Entropy Criterion We will use iterative algorithms for optimization of the steepest descent type w(n 1) w(n) V2 (n) Given a batch of N samples the IP is For an FIR the gradient becomes kV2 (n) 2 N N i 1 V (e(n)) wk 1 ( 2 N N N G i 1 j 1 2 1 N N V2 ( E ) 2 G N i 1 j 1 (ei e j )) (e(n i ) e(n j )) 2 (ei e j ) (e(n i ) e(n j )) wk y (n j ) y (n i ) ] G ( e ( n i ) e ( n j ))( e ( n i ) e ( n j )) wk j 1 wk N 2 N kV2 (n) 2 N i 1 July 11, 2016 N G (e(n i) e(n j))(e(n i) e(n j))( x (n j) x (n i)) j 1 k k 61 Error Entropy Criterion This can be easily extended to any alpha and any kernel using the expressions of IP. For the FIR filter we get 2 1 N kV2 (n) K (e(n i) e(n j )) N i1 j 1 N July 11, 2016 N K (e(n i) e(n j))( x (n j) x (n i)) j 1 k k 62 Comparing Quadratic Entropy and MSE • IP does not yield a convex performance surface even for the FIR. For adaptation problems that yield zero or small errors, there is a parabolic approximation around the minimum. The largest eigenvalue of the second order approximation of the performance surface is smaller than MSE (approaches zero with large kernel sizes). So stepsizes can be larger for convergence. Weight Tracks on Contours of Information Potential 6 4 w2 2 0 -2 -4 -6 -6 -4 -2 0 w1 2 4 6 Comparing Quadratic Entropy and MSE • Consider for simplicity the 1-D case, and approximate the Gaussian by its second order Taylor series x 2 / 2 2 G ( x) ce c(1 x 2 / 2 2 ) •We can show 1 max Vˆ2, (e) 2 N c(1 (ei e j ) 2 / 2 2 ) c i j c 2 2 N 2 (e i i e j )2 j 2 min 2 2 2 2 ( e e ) 2 N e 2 e i j i i 2 N MSE (e) 2 N e i j i i When the error is small w.r.t. the kernel size, quadratic entropy training is equivalent to a biased MSE. Comparing Quadratic Entropy and MSE The kernel size produces a dilation is weight space, i.e. it controls the region in weight space where the second order approximation of the entropy cost function is valid. Implications of Entropy Learning We have theoretically shown that: Regardless of the entropy order, increasing the kernel size results in a wider valley around the optimal solution by decreasing the absolute values of the (negative) eigenvalues of the Hessian matrix of the information potential criterion. The effect of entropy order on the eigenvalues of the Hessian depends on the value of the kernel size. Implications of Entropy Learning The batch algorithm just presented is called the Minimum Error Entropy algorithm (MEE). The cost function is totally independent of the mapper, so it can be applied generically. The algorithm is batch and has a complexity of O(N2). To estimate entropy one needs pairwise interactions. Entropy Learning Algorithms MSE Criterion Error Entropy Criterion Steepest Descent Minimum Error Entropy (MEE) MEE- RIG Recursive information gradient LMS MEE- SIG Stochastic Information gradient LMF MEE-SAS self adapting step size NLMS NMEE RLS MEE-Fixed point MEE- Stochastic Information Gradient (SIG) • Dropping E[.] and substituting the required pdf by its Parzen estimate over the most recent M samples, at time k our information potential estimate becomes 1 V (e(n)) M 1 ( e e ) n i i nM n 1 If we substitute in the gradient equation 2 n 1 Vˆ (e(n)) ( 1) n 1 ( e e ) ( e e )( x ( i ) x ( n )) n i n i k k wk M i nM i n M For =2, Vˆ (e(n)) 1 n 1 2 G (en ei )(en ei )( xk (i) xk (n)) wk M i n M We have shown that for the linear case, SIG converges in the mean to the true value of the gradient. So it has the same properties as LMS. MEE- Stochastic Information Gradient (SIG) The SIG has another very interesting property. It can be easily applied to Renyi’s entropy and yields an estimator that is on average the gradient of Shannon’s entropy. 1 ( 1 ) Hˆ ,n 1 M wk 1 2 n1 1 n1 ( e e ) ( e e ) x ( i ) x ( n (en ei )xk (n) xk (i) n i M i n i k k i n M n M i n M 1 n1 1 n1 (en ei ) (en ei ) i n M M i n M n1 In fact, 1 ˆ H ( e ) E log ( e e ) H S (e) E[log f e (e)] S ,n n i L i n L n 1 n1 ( e e ) x ( n ) x ( i ) n i k k Hˆ s ,n E inL n1 wk (en ei ) i n L SIG for Supervised Linear Filters • We can derive a “LMS like” algorithm using Renyi’s entropy. H ( X ) wk 1 wk w k • For = 2, Gaussian kernels, G ( x) xG ( x) / 2 and M=1 we get, 1 H 2 ( X ) (ek ek 1 )( xk xk 1 ) 2 w k Relation between SIG and Hebbian learning • For Gaussian kernels, M=1, the expression to maximize output Entropy on a linear combiner becomes very simple also H ( y k ) ( y k y k 1 ) ( y k y k 1 ) ( x k x k 1 ) w 2 ( y k y k 1 ) 1 2 ( y k y k 1 ) ( x k x k 1 ) We see that SIG gives rise to a sample by sample adaptation rule that is like Hebbian between consecutive samples! Does SIG work? Consider: Random variable, d = 2 x-axis is uniform y-axis is Gaussian sample covariance is the identity matrix 4 3 2 1 0 -1 -2 -3 -4 -4 -3 -2 -1 0 1 2 3 4 Does SIG work? • Generated 50 samples of the 2D distribution from (w12 + w22 = 1) the previous slide, y = w1x1 + w2x2 2 Direc tion (G aus s ian) Gaussian 1.74 Uniform 1.72 1.5 1 0.5 0 -0.5 -1 0 1.7 0 0.5 1 1.5 2 Direc tion of W eight V ec tor (rad) 2.5 3 50 100 150 200 3.5 250 E poc hs 300 350 400 450 500 450 500 2 2.05 2.04 2.03 2.02 2.01 Direc tion (Cauc hy ) E ntropy (Gaus s ian) 1.76 1.5 • PCA would converge to any direction but SIG consistently found the 90 degree direction! 1 0.5 0 -0.5 2 -1 0 1.99 0 0.5 1 1.5 2 2.5 3 3.5 50 100 150 200 250 E poc hs 300 350 400 Adaptation of Linear Systems with Divergence Exemplify for QMI-ED VED 1 N N 1 N VJ 2VC V1V2 2 v1 (i, j )v2 (i, j ) 2 v1 (i)v2 (i) V1V2 N i 1 j 1 N i 1 Taking the sensitivity with respect to the weights VED VED y j (n) VJ VC V1V2 y j (n) ( 2 ) wkj y j (n) wkj y j (n) y j (n) y j (n) wkj This is a straight extension because the potential fields and their gradients add up. July 11, 2016 75 MEE for Nonlinear Systems Consider the error signal. Think of the IPTs as errors of a nonlinear mapper (such as the MLP). How can we train the MLP ? Use the IF as the Injected error. Then apply the Backpropagation algorithm. k N J J e p (n) wij p 1 n1 e p (n) wij This methodology extends naturally to IP. July 11, 2016 76 Global Optimization by Annealing Kernel Size We have a way to avoid local minima in non-convex performance surfaces: 1. Start with a large kernel size, and adapt to reach minimum. 2. Decrease the kernel size to decrease the bias and adapt again. 3. Repeat until the kernel size is compatible with the data. Kernel size annealing is equivalent to the method of convolution smoothing in global optimization. Hence, as long as the annealing rate is right (slow enough), the information potential provides a way to avoid local minima and reach the global minimum. Advanced Search Methods If advanced search methods are needed, care must be taken when extending them to error entropy criterion. Basically the problems are related with the definition of trust regions and the adaptation of the kernel size. We have studied the scaled conjugate method and the Levenberg Marquadt algorithm and have implemented modifications that lead to consistent results. Fast Gauss Transform The Fast Gauss Transform (FGT) is an example of a fast algorithm to approximately compute matrix (A) vector (d) products where the matrix elements are aij xi x j with phi a Gaussian function. The basic idea is to cluster the data and to expand the Gaussian function in Hermite polynomials to divide and conquer the complexity (multi pole method). ( y j yi ) 2 p 1 1 yi yC n y j yC ( p) exp h 2 4 n 0 n! 2 2 dn hn ( y ) 1 exp( x 2 ) n dx n Fast Gauss Transform A greedy clustering algorithm called the farthest point is normally used (because it can be computed in O(kN) time, k # of clusters). The information potential for 1 D data becomes 1 V ( y) 2N 2 Where 1 y j yC B h 2 j 1 B n 0 n! N p 1 y yC B Cn ( B) j 2 y i B Cn ( B) n And this can be computed in O(kpN) time (p degree of approximation). Where does ITL present Advantages? ITL presents advantages when the signals one is dealing with are non-Gaussian. This occurs in basically two major areas in signal processing: Non-Gaussian Noise (outliers) Non linear Filtering Error Entropy Criterion and M estimation Let us review the mean square error. We are interested in quantifying how two different r.v. are. So what we normally do is E[( x y ) 2 ] 2 ( x y ) p( x, y )dxdy We hope that the pdf exponentially decreases away from the x=y line! Error Entropy Criterion and M estimation Let us define a new criterion called the correntropy criterion as 1 1 Vˆ ( X , Y ) k ( x y ) k (e ) N N When we maximize correntropy we are maximizing the probability density at the origin of the space, since using Parzen windows we obtain N i 1 N i i i 1 1 N pˆ E (e) k (e ei ) N i 1 And evaluating it at the origin e=0 Vˆ ( X , Y ) pE (0) i Error Entropy Criterion and M estimation The error correntropy criterion is very much related to the error entropy criterion since if we take the first order differencedxijof xthe i x j samples and construct the vectors DX (dx11 , dx12 ,...dx21 , dx22 ,..., dxNN ) Then correntropy is And since N 1 V ( DX , DY ) 2 N N k (dx ij j 1 i 1 dyij ) dxij dyij xi x j ( yi y j ) ( xi yi ) ( x j y j ) ei e j We obtain 1 V ( DX , DY ) 2 N N N k (e e ) IP( E) j 1 i 1 i j Error Entropy Criterion and M estimation The interesting thing is that correntropy is a metric while error entropy is not. A metric has the properties d ( X ,Y ) 0 1) Non-negativity 2) Identity. d(x,y)=0 if and only if x=y. 3) Symmetry d(x,y)=d(y,x) 4) Triangle inequality d ( X , Z ) d ( X , Y ) d (Y , Z ) We define the correntropy induced metric as CIM ( X , Y ) (V (0, 0) V ( X , Y ))1/ 2 Error Entropy Criterion and M estimation In two dimensions the contours of the CIM for distances to the origin are the following. Error Entropy Criterion and M estimation Let us put the error correntropy criterion in the framework of M estimation (Huber). Define (e) (1 exp(e2 / 2 2 )) / 2 Then 1) (e) 0 2) (0) 0 3) (e) (e) 4) (ei ) (e j ) for | ei || e j | . And therefore it becomes equivalent to following M estimation problem min (e ) which is a weighted least squares min w(e )e with N i 1 N i 2 i 1 i i w(e) exp(e2 / 2 2 ) / 2 3 Error Entropy Criterion and M estimation Alternatively if we define IPM ( X , Y ) (V (0,0) IP(e))1/ 2 we obtain a pseudo metric. MEE is equivalent to the M estimation N N min (deij ) j 1 i 1 Case Study: Regression with outliers Assume that data is generated by a linear model corrupted by a noise pZ ( z) 0.9 N(0,0.1) 0.1 N(4,0.1) created by a mixture of Gaussians 50 MonteCarlo were run with a linear model (known order and best kernel size). EXAMPLE: Revisiting Adaptive Filtering x(n) e(n) Unknown System TDNN z-1 Criterion 6 Delays in input layer 3 Hidden PEs in hidden layer 1 Linear output in output layer y ( n) x ( n) Mackey-Glass time series Prediction MacKey-Glass time series (t = 30): July 11, 2016 dx(t ) 0.2 x(t t) 0.1x(t ) dt 1 x10 (t t) 91 Nonlinear prediction training Two methods 1. Minimization of MSE 2. Minimization of the quadratic Renyi’s error entropy (QREE) It has been shown analytically that, Minimization of QREE is equivalent to minimizing the adivergence between the desired signal and output of the mapper The Parzen estimator preserves extrema of the cost function Training/Testing details Hidden layer PEs where chosen for best performance. Training with 200 samples of MK30 Conjugate gradient Stopping based on cross validation 1,000 initial conditions, and pick best error. Kernel size was set at s = 0.01 Testing is done on 10,000 new samples. Amplitude Histograms for Original and Predicted Signals 4.5 Data Ent MSE 4 Probability Density 3.5 3 2.5 2 1.5 1 0.5 0 -0.3 -0.2 -0.1 0 0.1 Signal Value 0.2 0.3 0.4 Error Distributions for MSE and Entropy 100 Ent MSE 90 80 Probability Density 70 60 50 40 30 20 10 0 -0.025 -0.02 -0.015 -0.01 -0.005 0 0.005 Error Value 0.01 0.015 0.02 0.025 Effect of Kernel Annealing 10 1 10 2 10 1 10 3 10 1 10 3 10 1 10 3 10 1 10 3 Example II: Optimal Feature Extraction Question: How do we project data to a subspace preserving discriminability? Answer: By maximizing the mutual information between desired responses and the output of the (nonlinear) mapper. Example II: Optimal Feature Extraction The feature extractor and the classifier are trained independently or not. Which is better? Another goal is to find out which method of feature extraction produces better classification Example II: Optimal Feature Extraction There are two possible ITL methods that can be used for feature extraction: 1. Maximizing the mutual information between the feature extractor output and the class labels and using QMI-ED or QMI-CS 2. Approximating mutual information by a difference of entropy terms The advantage of the latter is that we do not need to estimate the joint distribution. Computation is less intensive and perhaps more accurate. Example II: Optimal Feature Extraction We consider here classifiers that are invariant under invertible linear transformations to reduce the number of free parameters. Using the IP together with the approximation of MI yields the MRMI-SIG algorithm Where and the angles are updated by Example II: Optimal Feature Extraction Two different classifiers are used: Bayes G (Gaussian) and Bayes NP (nonparametric using Parzen). Two methods of training are used: 1. Training the feature extractor first using PCA, MRMI and QMI-ED. 2. Training both together using the Minimum classifier error (MCE), MSE and feature ranking on a validation set (Fr-V) Example II: Optimal Feature Extraction Data are several sets on the UCI repository Example II: Optimal Feature Extraction Example II: Optimal Feature Extraction Classification with QMI (2-D feature space) Class identity wopt arg max I ( y; d ) w d Images y x Information Potential Field Back-Propagation July 11, 2016 Forces 105 SAR/Automatic Target Recognition MSTAR Public Release Data Base Three class problem: BMP2, BTR70, T72. Input Images are 64x64. Output space is 2 D. A likelihood ratio classifier is computed in the output space. SAR/Automatic Target Recognition Confusion Matrix BMP2 Comparisons (Pcc) Pcc BTR70 T72 BMP2 289 2 19 ITL 94.89% BTR70 3 104 0 SVM 94.60% T72 8 5 294 Templates 90.40% (counts) July 11, 2016 107 Clustering evaluation function Perhaps the best area to apply ITL concepts is unsupervised learning where the structure of the data is paramount for the analysis goal. Indeed most of the clustering, vector quantization and even projection algorithms use some form of similarity metric. ITL can provide similarity beyond second order statistics. Clustering evaluation function As we mentioned in the second lecture, p( x)q( x)dx was called the cross information potential and it measures a form of “distance” between p(x) and q(x). Using the Parzen estimator this yields 1 CIP( p, q) N1 N 2 N1 N 2 G( x x ,2 i 1 j 1 i j 2 ) i p( x), j q( x) This can be written in more condensed way with a membership function and it was called the Clustering Evaluation Function 1 C EF p q = ---------------2N 1 N 2 N N 2 M xi xj G xi – xj 2 i = 1j = 1 M x i x j = M 1 x i M 2 xj M1 xi = 1 xi p x Clustering evaluation function Remember that CEF is the numerator of the argument of the log in DCS p x q x dx CE F p q D CEFno rm p q = – ln ------------------------------------------------- = – ln ------------------------------------------------- 2 2 2 2 p x dx q x dx p x dx q x dx If we use just the numerator it is NOT a distance, but for evaluation purposes between clusters it can be utilized for simplicity. cef renyi J_div cefnorm chernof bhat Clustering evaluation function Example of clustering assignment with CEF on synthetic data. Kernel variance selected for best results. Results on Iris data Class1 Class2 Class3 Class1 50 0 0 Class2 0 42 8 Class3 0 6 44 Clustering evaluation function The information potential can be used as a preprocessing to help image segmentation in brain MRI (low contrast imagery). Resulting image is then clustered in three clusters. white (%) gray(%) CSF(%) 5.5 years 31.45 57.97 10.56 7.5 years 33.37 55.73 10.89 8 years 36.39 54.21 9.39 10.2 years 37.60 50.87 11.51 Clustering based on D CS We also developed a sample by sample algorithm for clustering based on Cauchy-Schwartz distance based on a Lagrange multiplier formulation that ends up being a variable stepsize algorithm for each direction in the space. Clustering based on D CS The optimization is then subject to In order to use Lagrange multipliers we have to construct a smooth partnership function which is obtained by mi(k)=vi(k)2 The two optimization problem are equivalent To solve it we use the Lagrangian Clustering based on D CS There is a fixed point algorithm that solves for the li, and the solution is independent of the order of presentation. To avoid local minima, kernel annealing is required.