Correntropy as a Generalized Similarity Measure Jose C. Principe Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu Acknowledgments Dr. John Fisher Dr. Dongxin Xu Dr. Ken Hild Dr. Deniz Erdogmus My students: Puskal Pokharel Weifeng Liu Jianwu Xu Kyu-Hwa Jeong Sudhir Rao Seungju Han NSF ECS – 0300340 and 0601271 (Neuroengineering program) and DARPA Outline • • • • Information Theoretic Learning A RKHS for ITL Correntropy as Generalized Correlation Applications Matched filtering Wiener filtering and Regression Nonlinear PCA Information Filtering: From Data to Information Information Filters: Given data pairs {xi,di} Optimal Adaptive systems Information measures f(x,w) Data x Adaptive Output System Data d Error e Information Information Measure Embed information in the weights of the adaptive system More formally, use optimization to perform Bayesian estimation Information Filtering Deniz Erdogmus and Jose Principe IEEE SP MAGAZINE From Linear Adaptive Filtering to Nonlinear Information Processing November 2006 CNEL Website ITL resource www.cnel.ufl.edu What is Information Theoretic Learning? ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence. Center piece is a non-parametric estimator for entropy that: Does not require an explicit estimation of pdf Uses the Parzen window method which is known to be consistent and efficient Estimator is smooth Readily integrated in conventional gradient descent learning Provides a link between information theory and Kernel learning. ITL is a different way of thinking about data quantification Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents. ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces. Variance by Entropy Correlation by Correntopy Mean square error (MSE) by Minimum error entropy (MEE) Distances in data space by distances in probability spaces Fully exploits the strucutre of RKHS. Information Theoretic Learning Entropy Not all random variables (r.v.) are equally random! 1 fx(x) fx (x) 0.4 0.5 0.2 0 -5 0 5 0 -5 0 x 5 Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as H S ( X ) p X ( x) log p X ( x) July 11, 2016 H S ( X ) f X ( x) log( f X ( x)) dx 8 Information Theoretic Learning Renyi’s Entropy 1 H ( X ) log pX ( x) 1 1 H ( X ) log f X ( x)dx 1 Renyi’s entropy equals Shannon’s as 1 Norm of the pdf: p2 1 n k=1 pk = p p2 (entropy -norm) ( –norm of p raised power to ) p = p1 p2 p3 1 norm V p = p 1 p2 p3 0 p1 0 V f ( x) dx p1 1 1 V=-Information Potential 1 July 11, 2016 9 Information Theoretic Learning Norm of the PDF (Information Potential) V2(x), 2- norm of the pdf (Information Potential) is one of the central concept in ITL. E[p(x)] and estimators Information Theory (Renyi’s entropy) July 11, 2016 Cost functions for adaptive systems Reproducing Kernel Hilbert spaces 10 Information Theoretic Learning Parzen windowing Given only samples drawn from a distribution: 1 G ( x xi ) fx (x) N N = 1000 0.5 0.5 i 1 0 -5 Kernel function lim pˆ ( x) p( x) * G ( N ) ( x) N provided that N ( N ) N=10 0.4 0.4 0.2 0 -5 0 -5 5 fx(x) Convergence: 0 x 0 x 5 N = 1000 fx(x) 1 pˆ ( x) N 1 N=10 fx (x) {x1 ,..., x N } ~ p( x) 0.2 0 x 5 0 -5 0 x 5 Information Theoretic Learning Information Potential Order-2 entropy & Gaussian kernels: 2 N 1 H 2 ( X ) log p 2 ( x)dx log G ( x xi ) dx N i 1 1 log 2 N G ( x x j )G ( x xi )dx j i 1 log 2 N G 2 ( x j xi ) j i Pairwise interactions between samples O(N2) Information potential estimator, Vˆ2 ( X ) pˆ ( x ) provides a potential field over the space of the samples parameterized by the kernel size Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. Information Theoretic Learning Central Moments Estimators Mean: 1 E[ X ] N N xi i 1 Variance 1 E[( X E[ X ]) ] N 2 N 1 N 2 ( x x ) i N i i 1 i 1 2-norm of PDF (Information Potential) 1 N N E[ f ( x)] 2 G 2 ( xi x j ) N i 1 j 1 scale Renyi’s Quadratic Entropy 1 N N log E[ f ( x)] log( 2 G N i 1 j 1 2 ( xi x j )) Information Theoretic Learning Information Force In adaptation, samples become information particles that interact through information forces. Information potential: 1 ˆ V2 ( x j ) N G 2 ( x j xi ) xj xi i Information force: Vˆ2 1 G 2 ( x j xi ) x j N i Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. Erdogmus, Principe, Hild, Natural Computing, 2002. Information Theoretic Learning Error Entropy Criterion (EEC) We will use iterative algorithms for optimization of a linear system with steepest descent w(n 1) w(n) V2 (n) Given a batch of N samples the IP is For an FIR the gradient becomes Vˆ (e(n)) ˆ kV2 (n) wk 2 N N i 1 1 ( 2 N N N G i 1 j 1 2 N N 1 Vˆ2 ( E ) 2 G N i 1 j 1 (ei e j )) (e(n i ) e(n j )) 2 (ei e j ) (e(n i ) e(n j )) wk y (n j ) y (n i ) ] G ( e ( n i ) e ( n j ))( e ( n i ) e ( n j )) wk j 1 wk N N 2 kVˆ2 (n) 2 N i 1 July 11, 2016 N G (e(n i) e(n j))(e(n i) e(n j))( xk (n j) xk (n i)) j 1 15 Information Theoretic Learning Backpropagation of Information Forces Desired Input Adaptive System Weight Updates Output Adjoint Network IT Criterion Information Forces k N J J e p (n) wij p 1 n1 e p (n) wij Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation. Information Theoretic Learning Any any kernel Parzen windowing: pˆ ( x) 1 N 1 log p ( x)dx 1 1 log E X [ p 1 ( X )] 1 N Recall Renyi’s entropy: H ( X ) K ( x xi ) i 1 Approximate EX[.] by empirical mean Nonparametric -entropy estimator: Pairwise interactions between samples 1 N N 1 1 ˆ H ( X ) log K ( x j xi ) 1 N j 1 i 1 Erdogmus, Principe, IEEE Transactions on Neural Networks, 2002. Information Theoretic Learning Quadratic divergence measures Kulback-Liebler Divergence: Renyi’s Divergence: Euclidean Distance: Cauchy- Schwartz Divergence : DKL ( X ; Y ) p( x) log p ( x) dx q ( x) 1 p( x) 1 D ( X ; Y ) log p( x) 1 q( x) DE ( X ; Y ) dx 2 p ( x ) q ( x ) dx DC ( X ; Y ) log( p( x)q( x)dx ) p ( x)dx q ( x)dx 2 2 Mutual Information is a special case (distance between the joint and the product of marginals) Information Theoretic Learning How to estimate Euclidean Distance 2 Euclidean Distance: DE ( p; q) p( x) q( x) dx DE ( p; q) p 2 ( x)dx 2 p( x)q( x)dx q 2 ( x)dx 1 NpNp Np Np 2 G ( x x ) 2 i j N N p q i 1 j 1 N p Nq 1 G ( x x ) 2 i a N N q q i 1 a 1 Nq Nq G a 1 b 1 2 ( xa xb ) p( x)q( x)dx is called the cross information potential (CIP) So DED can be readily computed with the information potential . Likewise for the Cauchy Schwartz divergence, and also the quadratic mutual information Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. Information Theoretic Learning Unifies supervised and unsupervised learning Desired Signal D 1 2 3 Learning System Input Signal Y = q X W X Switch 1 Filtering/classification Feature extraction (also with entropy) Output Signal Y Optimization Information Measure I Y D /H(Y) Switch 2 Switch 3 InfoMax ICA Clustering NLPCA ITL - Applications Feature extraction System identification ITL Blind source separation Clustering www.cnel.ufl.edu ITL has examples and Matlab code Reproducing Kernel Hilbert Spaces as a Tool for Nonlinear System Analysis Fundamentals of Kernel Methods Kernel methods are a very important class of algorithms for nonlinear optimal signal processing and machine learning. Effectively they are shallow (one layer) neural networks (RBFs) for the Gaussian kernel. They exploit the linear structure of Reproducing Kernel Hilbert Spaces (RKHS) with very efficient computation. ANY (!) SP algorithm expressed in terms of inner products has in principle an equivalent representation in a RKHS, and may correspond to a nonlinear operation in the input space. Solutions may be analytic instead of adaptive, when the linear structure is used. Fundamentals of Kernel Methods Definition An Hilbert space is a space of functions f(.) Given a continuous, symmetric, positive-definite kernel :U U R , a mapping Φ, and an inner product , H A RKHS H is the closure of the span of all Φ(u). Reproducing f , (u) H f (u) Kernel trick The induced norm (u1 ), (u2 ) H (u1 , u2 ) || f ||2H f , f H Fundamentals of Kernel Methods RKHS proposed by Parzen For a stochastic process {Xt, tєT} with T being an index set, the auto-correlation function RX : T×T→R defined as RX(t,s) = E[XtXs] is symmetric and positive-definite, thus induces a RKHS by the Moore-Aronszajn theorem, denoted as RKHSP. Further there is a kernel mapping Θ such that Rx(t,s)=< Θ(t), Θ(s)>P This definition provided a functional analysis perspective for Gaussian processes. RKHSP brought understanding but no new results. Fundamentals of Kernel Methods RKHS induced by the Gaussian kernel The Gaussian kernel is symmetric and positive definite ( x x ') 2 k ( x, x ') exp( ). 2 2 2 1 thus induces a RKHS on a sample set {x1, …xN} of reals, denoted as RKHSK. Further, by Mercer’s theorem, a kernel mapping Φ can be constructed which transforms data from the input space to RKHSK where: k ( x xi ) ( x), ( xi ) K where <,> denotes inner product in RKHSK. A RKHS for ITL RKHS induced by cross information potential Let E be the set of all square integrable one dimensional probability density functions, i.e., fi ( x) E, i I where f i 2 ( x)dx and I is an index set. Then form a linear manifold (similar to the simplex) i f i ( x) i Close the set and define a proper inner product f i ( x), f j ( x) L2 f i ( x) f j ( x)dx L2(E) is an Hilbert space but it is not reproducing. However, let us define the bivariate function on L2(E) ( cross information potential (CIP)) V ( f i , f j ) f i ( x) f j ( x)dx One can show that the CIP is a positive definite function and so it defines a RKHSV . Moreover there is a congruence between L2(E) and HV. A RKHS for ITL ITL cost functions in RKHSV Cross Information Potential (is the natural distance in HV) f ( x) g ( x)dx V ( f ,.),V ( g ,.) HV Information Potential (is the norm (mean) square in HV) f ( x) f ( x)dx V ( f ,.), V ( f ,.) HV V ( f ,.) 2 Second order statistics in HV become higher order statistics of data (e.g. MSE in HV includes HOS of data). Members of HV are deterministic quantities even when x is r.v. Euclidean distance and QMI DED ( f , g ) V ( f ,.) V ( g ,.) 2 QMI ED ( f1, 2 , f1 f 2 ) V ( f1, 2 ,.) V ( f1 f 2 ,.) 2 Cauchy-Schwarz divergence and QMI D( f , g ) log V ( f ,.), V ( g ,.) HV V ( f ,.) . V ( g ,.) log(cos ) QMI CS ( f1, 2 , f1 f 2 ) log V ( f1, 2 ,.), V ( f1 f 2 ,.) HV V ( f1, 2 ,.) . V ( f1 f 2 ,.) A RKHS for ITL Relation between ITL and Kernel Methods thru HV There is a very tight relationship between HV and HK: By ensemble averaging of HK we get estimators for the HV statistical quantities. Therefore statistics in kernel space can be computed by ITL operators. A RKHS for ITL Statistics in RKHSV Gretton defined the empirical mean operator as ( X ) 1 N N 1 ( X i ) N i 1 N k (., X i ) i 1 (1) n ( X ) (u ) E X [k ( X , u )] n 2 n E X u n 0 2 n! So 2n contains the information on all the even moments (Gaussian kernel) and can be used as a STATISTIC representing probability. He also defined the Maximum Mean Discrepancy as MMD ( X ) (Y ) 2 2 H 1 2 NX NX NX 2 k ( X i , X j ) N N X Y i 1 j 1 N X NY 1 k ( X i , Ya ) N 2 i 1 a 1 Y NY NY k (Ya , Yb ) a 1 b 1 They are respectively V(f,.) and DED(f,g) of ITL. The ITL estimators yield the same expressions as above. Correntropy: A new generalized similarity measure Correlation is one of the most widely used functions in signal processing. But, correlation only quantifies similarity fully if the random variables are Gaussian distributed. Can we define a new function that measures similarity but it is not restricted to second order statistics? Use the kernel framework Correntropy: A new generalized similarity measure Define correntropy of a stationary random process {xt} as Vx (t , s ) E ( ( xt xs )) ( xt xs ) p( x)dx The name correntropy comes from the fact that the average over the lags (or the dimensions) is the information potential (the argument of Renyi’s entropy) For strictly stationary and ergodic r. p. N 1 Vˆm ( xn xnm ) N n1 Correntropy can also be defined for pairs of random variables Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006. Correntropy: A new generalized similarity measure How does it look like? The sinewave Correntropy: A new generalized similarity measure Properties of Correntropy: It has a maximum at the origin ( 1 / 2 ) It is a symmetric positive function Its mean value is the information potential Correntropy includes higher order moments of data (1) n V ( s, t ) n 2 n E x s x t n! n 0 2 2n The matrix whose elements are the correntopy at different lags is Toeplitz Correntropy: A new generalized similarity measure Correntropy as a cost function versus MSE. MSE ( X , Y ) E[( X Y ) 2 ] V ( X , Y ) E[k ( X Y )] ( x y ) 2 f XY ( x, y )dxdy k ( x y ) f XY ( x, y )dxdy x, y e 2 f E (e)de e x, y k (e) f E (e)de e Correntropy: A new generalized similarity measure Correntropy induces a metric (CIM) in the sample space defined by CIM ( X , Y ) (V (0, 0) V ( X , Y ))1/ 2 Therefore correntropy can be used as an alternative similarity criterion in the space of samples. Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, IEEE Trans. Sig. Proc, vol 55; # 11, pages 5286-5298 Correntropy: A new generalized similarity measure Correntropy criterion implements M estimation of robust statistics. M estimation is a generalized maximum likelihood method. N arg min N ( x, ) ( x ,ˆ i 1 i 1 i M )0 In adaptation the weighted square problem is defined as N lim w(ei )ei 2 When w(e) '(e) / e i 1 (e) (1 exp(e2 / 2 2 )) / 2 this leads to maximizing the correntropy of the error at the origin. N N min (ei ) min (1 exp(ei 2 / 2 2 )) / 2 i 1 i 1 N N max exp(ei / 2 ) / 2 max (ei ) 2 i 1 2 i 1 Correntropy: A new generalized similarity measure Nonlinear regression with outliers (Middleton model) Polynomial approximator Correntropy: A new generalized similarity measure Define centered correntropy U ( X , Y ) E X ,Y [ ( X Y )] E X EY [ ( X Y )] ( x y ){dFX ,Y ( x, y ) dFX ( x)dFY ( y )} 1 N 1 N N ˆ U ( x, y) ( xi yi ) 2 ( xi y j ) N i 1 N i 1 j 1 Define correntropy coefficient U ( X ,Y ) Uˆ ( x, y) ( X ,Y ) ˆ ( x, y) U ( X , X )U (Y , Y ) Uˆ ( x, x)Uˆ ( y, y) Define parametric correntropy with a, b R a0 Va ,b ( X , Y ) E X ,Y [ (aX b Y )] (ax b y )dFX ,Y ( x, y ) Define parametric centered correntropy U a ,b ( X , Y ) E X ,Y [ (aX b Y )] E X EY [ (aX b Y )] Define Parametric Correntropy Coefficient a ,b ( X , Y ) (aX b, Y ) Rao M., Xu J., Seth S., Chen Y., Tagare M., Principe J., “Correntropy Dependence Measure”, submitted to Metrika . Correntropy: Correntropy Dependence Measure Theorem: Given two random variables X and Y the parametric centered correntropy Ua,b(X, Y ) = 0 for all a, b in R if and only if X and Y are independent. Theorem: Given two random variables X and Y the parametric correntropy coefficient a,b(X, Y ) = 1 for certain a = a0 and b = b0 if and only if Y = a0X + b0. Definition: Given two r.v. X and Y Correntropy Dependence Measure is defined as ( X , Y ) sup a ,b ( X , Y ) a, b R a0 Statistics in RKHSK Bach and Jordan defined the generalized variance as KGV det [ XY ][ XY ] det XX YY Where XY is the cross covariance operator XY E[Y (Y ) X ( X )] E[Y (Y ) E[ X ( X )] E[kY (., Y ) k X (., X ),. ] E[kY (., Y )] E[k X (., X )],. Contains the information about all the higher correlations. Independence by Constrained covariance (COCO) COCO s u p Cov[ f ( X ), g (Y ) f g f ,g 0 Gretton proposed to restrict the functions to a RKHS and solve the problem as the largest singular value, or the trace of the Frobenius norm HSIC (K are the Gram matrices for f,g). HSIC max 1 , 2 1T K1K 2 2 T 2 1 K1 1 T 2 2 K2 2 RKHS induced by Correntropy Definition For a stochastic process {Xt, tєT} with T being an index set, correntropy defined as VX(t,s) = E[K(XtXs)] is symmetric and positive definite. Thus it induces a new RKHS denoted as RKHSC . There is a kernel mapping Ψ such that Vx(t,s)=< Ψ(t),Ψ(s) >C Any symmetric non-negative kernel is the covariance kernel of a random function and vice versa (Parzen). Therefore, given a random function {Xt, t є T} there exists another random function {ft, t є T} (gaussian process) such that E[ft fs]=VX(t,s) RKHS induced by Correntropy Relation with RKHSP Correntropy estimates similarity in feature space. It transforms the component wise data to feature space by f(.) and then takes the correlation V ( x, y ) E[k ( x, y )] E[ f ( x). f ( y )] f(.) is a function also implicitly defined (as f was) RKHSC is nonlinearly related to the input space (unlike RKHSP). RKHSC seems very appropriate to EXTEND traditional linear signal processing methods to nonlinear system identification. RKHS induced by Correntropy Relation with RKHSK. Gram matrix estimates pairwise evaluations in feature space. It is a sample by sample evaluation of functions Gram matrix is NxN (data size) Correntropy matrix is LxL (space size) This means huge savings: If we have a million samples in a 5x5 input space, Gram matrix is 106x106 The correntropy matrix is 5x5, but each element is computed from a million pairs. (very much like a MA model of order 5 estimated with one million samples) Applications of Correntropy Matched Filtering Matched filter computes the inner product between the received signal r(n) and the template s(n) (Rsr(0)). The Correntropy MF computes Hypothesis received signal H0 rk nk H1 rk sk nk N 1 Vrs (0) N k (r s ) i 1 i i Similarity value V0 1 N ( si ni ) / 2 e 2 2 N 2 2 i 1 N 1 ni 2 / 2 2 V1 e 2 N 2 i 1 (Patent pending) Applications of Correntropy Matched Filtering Linear Channels White Gaussian noise Impulsive noise Template: binary sequence of length 20. kernel size using Silverman’s rule. Applications of Correntropy Matched Filtering Alpha stable noise (=1.1), and the effect of kernel size Template: binary sequence of length 20. kernel size using Silverman’s rule. Applications of Correntropy Minimum Average Correlation Energy filters We consider a 2-dimensional image data as a d x 1 column vector. The conventional MACE filter is formulated in the freq. domain. Training Image matrix : Correlation energy of the i-th image is a diagonal matrix whose elements are the magnitude squared of the associated element of, that is, the power spectrum of Correlation value at the origin Average energy over all training images MACE filter minimizes the average energy while satisfying the constraint Application of Correntropy The Correntropy MACE filter Exploiting the linear space properties of the RKHS, the optimization problem in the feature space becomes Solution Do not know F explicitly! But can compute Test output , Applications of Correntropy VMACE Simulation Results Face Recognition (MACE vs. Correntropy MACE) True class images • We used the facial expression database from the Advanced Multimedia Processing Lab in CMU. • The database consists of 13 subjects, whose facial images were captured with 75 varying expressions. • The size of each image is 64x 64. False class images with additive Gaussian noise (SNR=10db). Report the results of the two most difficult cases who produced the worst performance with the MACE method. The simulation results have been obtained by averaging (Monte-Carlo approach) over 100 different training sets (each training set consists of randomly chosen 5 images) The kernel size is ~ 40% of the standard deviation of the input data. Applications of Correntropy VMACE ROC curves with different SNRs Applications of Correntropy CMACE on SAR/ATR – aspect angle mismatch Applications of Correntropy Wiener Filtering We can also formulate the Wiener filter as a Correntropy filter instead of kernel regression by directly exploiting the linear structure of the RKHS. However, there is still an approximation in the computation. y ( n) F T ( n) F (n)V T 1 N 1 1 N N d (k ) F (k ) k 1 N L 1 L 1 f (n i)a ij k 1 j 0 i 0 N L 1 L 1 k 1 j 0 i 0 f (k j )d (k ) 1 N d (k ) a { f (n i) f (k j )} 1 N L 1 L 1 d (k ) aij K ( x(n i ), x(k j )) k 1 j 0 i 0 Patent Pending N ij Applications of Correntropy Wiener Filtering Prediction of the Mackey Glass equation ( a mild chaotic system) Wiener versus correntropy RBF (50) versus correntropy Applications of Correntropy Nonlinear temporal PCA The Karhunen Loeve transform performs Principal Component Analysis (PCA) of the autocorrelation of the r. p. x( N ) x(1) ... X ... ... ... x( L) ... x( N L 1) LxN R XX T K XT X r (1) r ( L 1) r (0) r (1) N r (0) r (1) r (1) r (0) L L r ( L 1) r (1) r ( N 1) r (0) r (1) L r (0) r (1) r (1) r (0) N N r ( N 1) R XX T UDDTU T K X T X VDT DV T D is LxN diagonal { 1 , 2 ,... L } U iT X i ViT , i 1, 2,..., L KL can be also done by decomposing the Gram matrix K directly. Applications of Correntropy Nonlinear KL transform Since the autocorrelation function of the projected data in RKHS is given by correntropy, we can directly construct K with correntropy. Example: x(m) A sin(2 fm) z(m) where pN (n) 0.8 N (0,0.1) 0.1 N (4,0.1) 0.1 N (4,0.1) VPCA A (2nd PC) PCA by N-by-N (N=256) PCA by L-by-L PCA by L-by-L (L=100) (L=4) 0.2 100% 15% 3% 8% 0.25 100% 27% 6% 17% 0.5 100% 99% 47% 90% 1,000 Monte Carlo runs. =1 Applications of Correntropy Correntropy Spectral Density CSD is a function of the kernel size, and shows the difference between PSD ( large) and the new spectral measure Average normalized amplitude Time Bandwidth product Kernel size frequency Kernel size Applications of Correntropy Correntropy based correlograms Correntropy can be used in computational auditory scene analysis (CASA), providing much better frequency resolution. Figures show the correlogram from a 30 channel cochlea model for one (pitch=100Hz)). Auto-correlation Function Auto-correntropy Function lags Applications of Correntropy Correntropy based correlograms Correntropy can be used in computational auditory scene analysis (CASA), providing much better frequency resolution. Figures show the correlogram from a 30 channel cochlea model for two superimposed vowels from two speakers (f0=100, 126 Hz)). Auto-correlation Function Auto-correntropy Function lags Applications of Correntropy Correntropy based correlograms ROC for noiseless (L) and noisy (R) double vowel discrimiation Applications of RKHS Tests for nonlinearity When is the data generated by a nonlinear system? Use Fourier surrogates (take FFT, randomize the phase, and invert) Test if the spectrum of the original and surrogate are statistically different (with some confidence). If they are then, declare the model that generated the data nonlinear. Conclusions Information Theoretic Learning took us beyond Gaussian statistics and MSE as cost functions. ITL generalizes many of the statistical concepts we take for granted. Kernel methods implement shallow neural networks (RBFs) and extend easily the linear algorithms we all know. KLMS is a simple algorithm for on-line learning of nonlinear systems Correntropy defines a new RKHS that seems to be very appropriate for nonlinear system identification and robust control Correntropy may take us out of the local minimum of the (adaptive) design of optimum linear systems For more information go to the website www.cnel.ufl.edu ITL resource for tutorial, demos and downloadable MATLAB code ITL – Applications Nonlinear system identification Minimize information content of the residual error Desired, d Input, x Adaptive System Output, y Minimum Error, e Entropy Equivalently provides the best density matching between the output and the desired signals. min w 1 log pe ( ; w )d min 1 w 1 p xy ( ,; w ) p xy ( ,; w ) p xd ( , ) dd Erdogmus, Principe, IEEE Trans. Signal Processing, 2002. (IEEE SP 2003 Young Author Award) ITL – Applications Time-series prediction Chaotic Mackey-Glass (MG-30) series Compare 2 criteria: Minimum squared-error Minimum error entropy 4.5 MEE MSE 80 60 40 20 0 -0.025 -0.015 -0.005 0 0.005 Error Value 0.015 System: TDNN (6+4+1) 0.025 Probability Density Probability Density 100 Data MEE MSE 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.3 -0.2 -0.1 0 0.1 0.2 Signal Value 0.3 0.4 ITL - Applications System identification Feature extraction ITL Blind source separation Clustering ITL – Applications Optimal feature extraction Data processing inequality: Mutual information is monotonically nonincreasing. Classification error inequality: Error probability is bounded from below and above by the mutual information. Class labels (d) x z Maximize Mutual Information d z Compare Classifier y Backpropagate Information Forces PhD on feature extraction for sonar target recognition (2002) Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. ITL – Applications Extract 2 nonlinear features 64x64 SAR images of 3 vehicles: BMP2, BTR70, T72 Information forces in training Classification results P(Correct) MI+LR 94.89% SVM 94.60% Templates 90.40% Zhao, Xu and Principe, SPIE Automatic Target Recognition, 1999. Hild, Erdogmus, Principe, IJCNN Tutorial on ITL, 2003. ITL - Applications Feature extraction System identification ITL Blind source separation Clustering ITL – Applications Independent component analysis Observations are generated by an unknown mixture of statistically independent unknown sources. n x k Hs k I (z ) H ( z ) H (z ) c c 1 Independent signals Sphering Independent sources s x H Mixed signals z y W R Uncorrelated signals Ken Hild II, PhD on blind source separation (2003) Minimize Mutual Information ITL – Applications On-line separation of mixed sounds Off-line separation, 10x10 mixture On-line separation, 3x3 mixture 30 25 25 MeRMaId MeRMaId-SIG Infomax 20 SIR (dB) SIR (dB) 20 FastICA 15 10 Comon 10 5 5 0 0 -5 2 10 3 10 4 10 Data Length (samples) FastICA 15 -5 0 Infomax Comon 100 200 300 400 500 600 700 Time (thousands of data samples) Observed mixtures and separated outputs X1: X2: X3: Z1: Z2: Z3: Hild, Erdogmus, Principe, IEEE Signal Processing Letters, 2001. ITL - Applications Feature extraction System identification ITL Blind source separation Clustering ITL – Applications Information theoretic clustering Select clusters based on entropy and divergence: Minimize within cluster entropy Maximize between cluster divergence Between cluster divergence p( x)q( x)dx min p ( x)dx q ( x)dx m 2 2 1/ 2 Membership vector Within cluster entropy Robert Jenssen PhD on information theoretic clustering Jenssen, Erdogmus, Hild, Principe, Eltoft, IEEE Trans. Pattern Analysis and Machine Intelligence, 2005. (submitted)