Correntropy as a Generalized Similarity Measure Jose C. Principe

advertisement
Correntropy as a Generalized Similarity
Measure
Jose C. Principe
Computational NeuroEngineering Laboratory
Electrical and Computer Engineering Department
University of Florida
www.cnel.ufl.edu
principe@cnel.ufl.edu
Acknowledgments
Dr. John Fisher
Dr. Dongxin Xu
Dr. Ken Hild
Dr. Deniz Erdogmus
My students: Puskal Pokharel
Weifeng Liu
Jianwu Xu
Kyu-Hwa Jeong
Sudhir Rao
Seungju Han
NSF ECS – 0300340 and 0601271 (Neuroengineering
program) and DARPA
Outline
•
•
•
•
Information Theoretic Learning
A RKHS for ITL
Correntropy as Generalized Correlation
Applications
 Matched filtering
 Wiener filtering and Regression
 Nonlinear PCA
Information Filtering: From Data to
Information
Information Filters: Given data pairs {xi,di}
Optimal Adaptive systems
Information measures
f(x,w)
Data x
Adaptive Output
System
Data d
Error e
Information
Information
Measure
Embed information in the weights of the adaptive system
More formally, use optimization to perform Bayesian estimation
Information Filtering
Deniz Erdogmus and Jose Principe
IEEE
SP MAGAZINE
From Linear
Adaptive Filtering
to Nonlinear
Information Processing
November 2006
CNEL Website
ITL resource
www.cnel.ufl.edu
What is Information Theoretic Learning?
ITL is a methodology to adapt linear or nonlinear systems
using criteria based on the information descriptors of
entropy and divergence.
Center piece is a non-parametric estimator for entropy that:
Does not require an explicit estimation of pdf
Uses the Parzen window method which is known to be consistent
and efficient
Estimator is smooth
Readily integrated in conventional gradient descent learning
Provides a link between information theory and Kernel learning.
ITL is a different way of thinking about
data quantification
Moment expansions, in particular Second Order moments are
still today the workhorse of statistics. We automatically
translate deep concepts (e.g. similarity, Hebb’s postulate of
learning ) in 2nd order statistical equivalents.
ITL replaces 2nd order moments with a geometric statistical
interpretation of data in probability spaces.
Variance by Entropy
Correlation by Correntopy
Mean square error (MSE) by Minimum error entropy (MEE)
Distances in data space by distances in probability spaces
Fully exploits the strucutre of RKHS.
Information Theoretic Learning
Entropy
Not all random variables (r.v.) are equally random!
1
fx(x)
fx (x)
0.4
0.5
0.2
0
-5
0
5
0
-5
0
x
5
Entropy quantifies the degree of uncertainty in a r.v.
Claude Shannon defined entropy as
H S ( X )   p X ( x) log p X ( x)
July 11, 2016
H S ( X )    f X ( x) log( f X ( x)) dx
8
Information Theoretic Learning
Renyi’s Entropy
1
H ( X ) 
log  pX ( x)
1
1
H ( X ) 
log  f X ( x)dx
1
Renyi’s entropy equals Shannon’s as
 1
Norm of the pdf:
p2
1
n

k=1

pk
= p


p2
(entropy  -norm)
(  –norm of p raised power to  )
p =  p1  p2  p3 
1
  norm   V
p = p 1  p2 
p3
0
p1
0
V   f  ( x) dx
p1
1
1
V=-Information Potential
1
July 11, 2016
9
Information Theoretic Learning
Norm of the PDF (Information Potential)
V2(x), 2- norm of the pdf (Information Potential) is
one of the central concept in ITL.
E[p(x)]
and
estimators
Information
Theory
(Renyi’s entropy)
July 11, 2016
Cost functions for
adaptive systems
Reproducing
Kernel Hilbert
spaces
10
Information Theoretic Learning
Parzen windowing
Given only samples drawn from a distribution:
1

G ( x  xi )
fx (x)
N
N = 1000
0.5
0.5
i 1
0
-5
Kernel function
lim pˆ ( x)  p( x) * G ( N ) ( x)
N 
provided that N ( N )  
N=10
0.4
0.4
0.2
0
-5
0
-5
5
fx(x)
Convergence:
0
x
0
x
5
N = 1000
fx(x)
1
pˆ ( x) 
N
1
N=10
fx (x)
{x1 ,..., x N } ~ p( x)
0.2
0
x
5
0
-5
0
x
5
Information Theoretic Learning
Information Potential
Order-2 entropy & Gaussian kernels:
2
N
1

H 2 ( X )   log  p 2 ( x)dx   log    G ( x  xi )  dx
 N i 1

 1
  log  2
N


  G ( x  x j )G ( x  xi )dx 
j i

 1
  log  2
N


 G 2 ( x j  xi ) 
j i

Pairwise interactions
between samples
O(N2)
Information potential estimator, Vˆ2 ( X )
pˆ ( x ) provides a potential field over the space of the samples
parameterized by the kernel size 
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Information Theoretic Learning
Central Moments Estimators
Mean:
1
E[ X ] 
N
N
 xi
i 1
Variance
1
E[( X  E[ X ]) ] 
N
2
N
1 N
2
(
x

x
)
 i N i
i 1
i 1
2-norm of PDF (Information Potential)
1 N N
E[ f ( x)]  2  G 2 ( xi  x j )
N i 1 j 1
  scale
Renyi’s Quadratic Entropy
1 N N
 log E[ f ( x)]   log( 2  G
N i 1 j 1
2
( xi  x j ))
Information Theoretic Learning
Information Force
In adaptation, samples become information particles
that interact through information forces.
Information potential:
1
ˆ
V2 ( x j ) 
N
 G
2
( x j  xi )
xj
xi
i
Information force:
Vˆ2 1
  G 2 ( x j  xi )
x j N i
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Erdogmus, Principe, Hild, Natural Computing, 2002.
Information Theoretic Learning
Error Entropy Criterion (EEC)
We will use iterative algorithms for optimization of a linear system with
steepest descent
w(n  1)  w(n) V2 (n)
Given a batch of N samples the IP is
For an FIR the gradient becomes
Vˆ (e(n))
ˆ
 kV2 (n) 

wk
2
N
N

i 1
1
( 2
N
N
N
 G
i 1 j 1
2
N N
1
Vˆ2 ( E )  2  G
N i 1 j 1
(ei  e j ))
 (e(n  i )  e(n  j ))
2
(ei  e j )
 (e(n  i )  e(n  j ))

wk
 y (n  j ) y (n  i ) 

]
G
(
e
(
n

i
)

e
(
n

j
))(
e
(
n

i
)

e
(
n

j
))

 
wk 
j 1
 wk
N
N
2
 kVˆ2 (n)  2 
N i 1
July 11, 2016
N
 G (e(n  i)  e(n  j))(e(n  i)  e(n  j))( xk (n  j)  xk (n  i))
j 1
15
Information Theoretic Learning
Backpropagation of Information Forces
Desired
Input
Adaptive
System
Weight
Updates
Output
Adjoint
Network
IT
Criterion
Information
Forces
k N
J
J e p (n)
 
wij p 1 n1 e p (n) wij
Information forces become the
injected error to the dual or
adjoint network that
determines the weight
updates for adaptation.
Information Theoretic Learning
Any  any kernel
Parzen windowing:
pˆ ( x) 
1
N

1
log p ( x)dx
1
1

log E X [ p 1 ( X )]
1
N
Recall Renyi’s entropy: H ( X ) 
 K ( x  xi )
i 1
Approximate EX[.] by empirical mean
Nonparametric -entropy estimator:
Pairwise interactions
between samples
 1 

N  N

 1

1


ˆ
H ( X ) 
log
K ( x j  xi )




1
 N j 1  i 1
 


Erdogmus, Principe, IEEE Transactions on Neural Networks, 2002.
Information Theoretic Learning
Quadratic divergence measures
Kulback-Liebler
Divergence:
Renyi’s Divergence:
Euclidean Distance:
Cauchy- Schwartz
Divergence :
DKL ( X ; Y ) 

p( x) log
p ( x)
dx
q ( x)
 1
 p( x) 
1

D ( X ; Y ) 
log p( x)
 1
 q( x) 

DE ( X ; Y ) 
dx
2


p
(
x
)

q
(
x
)
dx

DC ( X ; Y )   log(
 p( x)q( x)dx )
 p ( x)dx  q ( x)dx
2
2
Mutual Information is a special case (distance between
the joint and the product of marginals)
Information Theoretic Learning
How to estimate Euclidean Distance
2
Euclidean Distance: DE ( p; q)    p( x)  q( x) dx
DE ( p; q)   p 2 ( x)dx  2 p( x)q( x)dx   q 2 ( x)dx 
1

NpNp
Np Np
2
G
(
x

x
)

  2 i j N N
p q
i 1 j 1
N p Nq
1
G
(
x

x
)

  2 i a N N
q q
i 1 a 1
Nq Nq
 G
a 1 b 1
2
( xa  xb )
 p( x)q( x)dx is called the cross information potential (CIP)
So DED can be readily computed with the information
potential . Likewise for the Cauchy Schwartz
divergence, and also the quadratic mutual information
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
Information Theoretic Learning
Unifies supervised and unsupervised learning
Desired Signal D
1
2
3
Learning System
Input Signal
Y = q  X W 
X
Switch 1
Filtering/classification
Feature extraction
(also with entropy)
Output Signal
Y
Optimization
Information Measure
I Y D /H(Y)
Switch 2
Switch 3
InfoMax
ICA
Clustering
NLPCA
ITL - Applications
Feature extraction
System identification
ITL
Blind source separation
Clustering
www.cnel.ufl.edu  ITL has examples and Matlab code
Reproducing Kernel Hilbert Spaces as
a Tool for Nonlinear System Analysis
Fundamentals of Kernel Methods
Kernel methods are a very important class of algorithms
for nonlinear optimal signal processing and machine
learning. Effectively they are shallow (one layer)
neural networks (RBFs) for the Gaussian kernel.
They exploit the linear structure of Reproducing Kernel Hilbert
Spaces (RKHS) with very efficient computation.
ANY (!) SP algorithm expressed in terms of inner products has
in principle an equivalent representation in a RKHS, and may
correspond to a nonlinear operation in the input space.
Solutions may be analytic instead of adaptive, when the linear
structure is used.
Fundamentals of Kernel Methods
Definition
An Hilbert space is a space of functions f(.)
Given a continuous, symmetric, positive-definite kernel
 :U U  R , a mapping Φ, and an inner product ,  H
A RKHS H is the closure of the span of all Φ(u).
Reproducing
 f , (u) H  f (u)
Kernel trick
The induced norm
(u1 ), (u2 ) H   (u1 , u2 )
|| f ||2H   f , f  H
Fundamentals of Kernel Methods
RKHS proposed by Parzen
For a stochastic process {Xt, tєT} with T being an index set,
the auto-correlation function RX : T×T→R defined as
RX(t,s) = E[XtXs]
is symmetric and positive-definite, thus induces a RKHS by
the Moore-Aronszajn theorem, denoted as RKHSP.
Further there is a kernel mapping Θ such that
Rx(t,s)=< Θ(t), Θ(s)>P
This definition provided a functional analysis perspective for
Gaussian processes.
RKHSP brought understanding but no new results.
Fundamentals of Kernel Methods
RKHS induced by the Gaussian kernel
The Gaussian kernel is symmetric and positive definite
( x  x ') 2
k ( x, x ') 
exp(
).
2
2
2
1
thus induces a RKHS on a sample set {x1, …xN} of reals,
denoted as RKHSK.
Further, by Mercer’s theorem, a kernel mapping Φ can be
constructed which transforms data from the input space
to RKHSK where:
k ( x  xi )  ( x), ( xi )
K
where <,> denotes inner product in RKHSK.
A RKHS for ITL
RKHS induced by cross information potential
Let E be the set of all square integrable one dimensional probability
density functions, i.e., fi ( x)  E, i  I where  f i 2 ( x)dx   and I is an
index set. Then form a linear manifold (similar to the simplex)


  i f i ( x)
 i

Close the set and define a proper inner product
f i ( x), f j ( x)
L2
  f i ( x) f j ( x)dx
L2(E) is an Hilbert space but it is not reproducing. However, let us define
the bivariate function on L2(E) ( cross information potential (CIP))
V ( f i , f j )   f i ( x) f j ( x)dx
One can show that the CIP is a positive definite function and so it defines
a RKHSV . Moreover there is a congruence between L2(E) and HV.
A RKHS for ITL
ITL cost functions in RKHSV
Cross Information Potential (is the natural distance in HV)
 f ( x) g ( x)dx  V ( f ,.),V ( g ,.) HV
Information Potential (is the norm (mean) square in HV)

f ( x) f ( x)dx  V ( f ,.), V ( f ,.)
HV
 V ( f ,.)
2
Second order statistics in HV become higher order statistics of data
(e.g. MSE in HV includes HOS of data).
Members of HV are deterministic quantities even when x is r.v.
Euclidean distance and QMI
DED ( f , g )  V ( f ,.)  V ( g ,.)
2
QMI ED ( f1, 2 , f1 f 2 )  V ( f1, 2 ,.)  V ( f1 f 2 ,.)
2
Cauchy-Schwarz divergence and QMI
D( f , g )   log
V ( f ,.), V ( g ,.)
HV
V ( f ,.) . V ( g ,.)
  log(cos  )
QMI CS ( f1, 2 , f1 f 2 )   log
V ( f1, 2 ,.), V ( f1 f 2 ,.)
HV
V ( f1, 2 ,.) . V ( f1 f 2 ,.)
A RKHS for ITL
Relation between ITL and Kernel Methods thru HV
There is a very tight relationship between HV and HK: By ensemble
averaging of HK we get estimators for the HV statistical quantities.
Therefore statistics in kernel space can be computed by ITL
operators.
A RKHS for ITL
Statistics in RKHSV
Gretton defined the empirical mean operator as
 ( X )
1

N
N
1
 ( X i )  N
i 1
N
 k (., X i )
i 1

(1) n
 ( X ) (u )  E X [k ( X , u )]   n 2 n E X  u
n 0 2  n!
So
2n
contains the information on all the even moments (Gaussian kernel)
and can be used as a STATISTIC representing probability.
He also defined the Maximum Mean Discrepancy as
MMD   ( X )   (Y )
2
2
H
1
 2
NX
NX NX
2
 k ( X i , X j )  N N
X Y
i 1 j 1
N X NY
1
 k ( X i , Ya )  N 2
i 1 a 1
Y
NY NY
 k (Ya , Yb )
a 1 b 1
They are respectively V(f,.) and DED(f,g) of ITL. The ITL estimators
yield the same expressions as above.
Correntropy:
A new generalized similarity measure
Correlation is one of the most widely used functions in
signal processing.
But, correlation only quantifies similarity fully if the
random variables are Gaussian distributed.
Can we define a new function that measures similarity
but it is not restricted to second order statistics?
Use the kernel framework
Correntropy:
A new generalized similarity measure
Define correntropy of a stationary random process {xt} as
Vx (t , s )  E ( ( xt  xs ))    ( xt  xs ) p( x)dx
The name correntropy comes from the fact that the
average over the lags (or the dimensions) is the
information potential (the argument of Renyi’s entropy)
For strictly stationary and ergodic r. p.
N
1
Vˆm    ( xn  xnm )
N n1
Correntropy can also be defined for pairs of random
variables
Santamaria I., Pokharel P., Principe J., “Generalized Correlation Function: Definition, Properties and
Application to Blind Equalization”, IEEE Trans. Signal Proc. vol 54, no 6, pp 2187- 2186, 2006.
Correntropy:
A new generalized similarity measure
How does it look like? The sinewave
Correntropy:
A new generalized similarity measure
Properties of Correntropy:
It has a maximum at the origin ( 1 / 2 )
It is a symmetric positive function
Its mean value is the information potential
Correntropy includes higher order moments of data
(1) n
V ( s, t )   n 2 n E x s  x t
n!
n 0 2 

2n
The matrix whose elements are the correntopy at
different lags is Toeplitz
Correntropy:
A new generalized similarity measure
Correntropy as a cost function versus MSE.
MSE ( X , Y )  E[( X  Y ) 2 ]
V ( X , Y )  E[k ( X  Y )]
  ( x  y ) 2 f XY ( x, y )dxdy
  k ( x  y ) f XY ( x, y )dxdy
x, y
  e 2 f E (e)de
e
x, y
  k (e) f E (e)de
e
Correntropy:
A new generalized similarity measure
Correntropy induces a metric (CIM) in the sample
space defined by
CIM ( X , Y )  (V (0, 0)  V ( X , Y ))1/ 2
Therefore correntropy can
be used as an alternative
similarity criterion in the
space of samples.
Liu W., Pokharel P., Principe J., “Correntropy: Properties and Applications in Non Gaussian Signal Processing”, IEEE Trans. Sig. Proc, vol 55; # 11, pages 5286-5298
Correntropy:
A new generalized similarity measure
Correntropy criterion implements M estimation of robust statistics.
M estimation is a generalized maximum likelihood method.
N
arg min
N
  ( x,  )
 ( x ,ˆ
i 1
i 1
i
M
)0
  
In adaptation the weighted square problem is defined as
N
lim  w(ei )ei 2

When
w(e)   '(e) / e
i 1
 (e)  (1  exp(e2 / 2 2 )) / 2
this leads to maximizing the correntropy of the error at the origin.
N
N
min   (ei )  min  (1  exp(ei 2 / 2 2 )) / 2


i 1
i 1
N
N
 max  exp(ei / 2 ) / 2  max   (ei )
2

i 1
2

i 1
Correntropy:
A new generalized similarity measure
Nonlinear regression with outliers (Middleton model)
Polynomial approximator
Correntropy:
A new generalized similarity measure
Define centered correntropy
U ( X , Y )  E X ,Y [ ( X  Y )]  E X EY [ ( X  Y )]     ( x  y ){dFX ,Y ( x, y )  dFX ( x)dFY ( y )}
1 N
1 N N
ˆ
U ( x, y)    ( xi  yi )  2   ( xi  y j )
N i 1
N i 1 j 1
Define correntropy coefficient
U ( X ,Y )
Uˆ ( x, y)
( X ,Y ) 
ˆ ( x, y) 
U ( X , X )U (Y , Y )
Uˆ ( x, x)Uˆ ( y, y)
Define parametric correntropy with a, b  R
a0
Va ,b ( X , Y )  E X ,Y [ (aX  b  Y )]     (ax  b  y )dFX ,Y ( x, y )
Define parametric centered correntropy
U a ,b ( X , Y )  E X ,Y [ (aX  b  Y )]  E X EY [ (aX  b  Y )]
Define Parametric Correntropy Coefficient
 a ,b ( X , Y )   (aX  b, Y )
Rao M., Xu J., Seth S., Chen Y., Tagare M., Principe J., “Correntropy Dependence Measure”, submitted to Metrika .
Correntropy:
Correntropy Dependence Measure
Theorem: Given two random variables X and Y the parametric
centered correntropy Ua,b(X, Y ) = 0 for all a, b in R if and only if X
and Y are independent.
Theorem: Given two random variables X and Y the parametric
correntropy coefficient a,b(X, Y ) = 1 for certain a = a0 and
b = b0 if and only if Y = a0X + b0.
Definition: Given two r.v. X and Y
Correntropy Dependence Measure
is defined as
( X , Y )  sup  a ,b ( X , Y )
a, b  R
a0
Statistics in RKHSK
Bach and Jordan defined the generalized variance as
KGV 
det [ XY ][ XY ]
det  XX YY
Where  XY is the cross covariance operator
 XY
 E[Y (Y ) X ( X )]  E[Y (Y ) E[ X ( X )] 
 E[kY (., Y )  k X (., X ),. ]  E[kY (., Y )]  E[k X (., X )],. 
Contains the information about all the higher correlations.
Independence by Constrained covariance (COCO)
COCO  s u p
Cov[ f ( X ), g (Y )
f g
f ,g
0
Gretton proposed to restrict the functions to a RKHS and solve the
problem as the largest singular value, or the trace of the
Frobenius norm HSIC (K are the Gram matrices for f,g).
HSIC  max
1 , 2

1T K1K 2 2
T 2
1 K1 1

T 2
2 K2 2

RKHS induced by Correntropy
Definition
For a stochastic process {Xt, tєT} with T being an index set,
correntropy defined as
VX(t,s) = E[K(XtXs)]
is symmetric and positive definite.
Thus it induces a new RKHS denoted as RKHSC . There is a kernel
mapping Ψ such that
Vx(t,s)=< Ψ(t),Ψ(s) >C
Any symmetric non-negative kernel is the covariance kernel of a
random function and vice versa (Parzen).
Therefore, given a random function {Xt, t є T} there exists another
random function {ft, t є T} (gaussian process) such that
E[ft fs]=VX(t,s)
RKHS induced by Correntropy
Relation with RKHSP
Correntropy estimates similarity in feature space. It
transforms the component wise data to feature space
by f(.) and then takes the correlation
V ( x, y )  E[k ( x, y )]  E[ f ( x). f ( y )]
f(.) is a function also implicitly defined (as f was)
RKHSC is nonlinearly related to the input space (unlike
RKHSP).
RKHSC seems very appropriate to EXTEND traditional
linear signal processing methods to nonlinear system
identification.
RKHS induced by Correntropy
Relation with RKHSK.
Gram matrix estimates pairwise evaluations in feature
space. It is a sample by sample evaluation of functions
Gram matrix is NxN (data size)
Correntropy matrix is LxL (space size)
This means huge savings:
If we have a million samples in a 5x5 input space, Gram
matrix is 106x106
The correntropy matrix is 5x5, but each element is
computed from a million pairs.
(very much like a MA model of order 5 estimated with
one million samples)
Applications of Correntropy
Matched Filtering
Matched filter computes the inner product between the received
signal r(n) and the template s(n) (Rsr(0)).
The Correntropy MF computes
Hypothesis
received signal
H0
rk  nk
H1
rk  sk  nk
N
1
Vrs (0) 
N
 k (r  s )
i 1
i
i
Similarity value
V0 
1
N
 ( si  ni ) / 2
e

2
2
N 2 2 i 1
N
1
 ni 2 / 2 2
V1 
e

2
N 2 i 1
(Patent pending)
Applications of Correntropy
Matched Filtering
Linear Channels
White Gaussian noise
Impulsive noise
Template: binary sequence of length 20. kernel size using
Silverman’s rule.
Applications of Correntropy
Matched Filtering
Alpha stable noise (=1.1), and the effect of kernel size
Template: binary sequence of length 20. kernel size using
Silverman’s rule.
Applications of Correntropy
Minimum Average Correlation Energy filters
We consider a 2-dimensional image data as a d x 1 column vector.
The conventional MACE filter is formulated in the freq. domain.

Training Image matrix :
 Correlation energy of the i-th image
is a diagonal matrix whose elements are the magnitude squared of
the associated element of,
that is, the power spectrum of
 Correlation value at the origin
 Average energy over all training images
 MACE filter minimizes the average energy while satisfying
the constraint
Application of Correntropy
The Correntropy MACE filter
 Exploiting the linear space properties of the RKHS, the optimization
problem in the feature space becomes
 Solution
Do not know F explicitly! But can compute
 Test output
,
Applications of Correntropy
VMACE Simulation Results
 Face Recognition (MACE vs. Correntropy MACE)
True class images
• We used the facial expression
database from the Advanced
Multimedia Processing Lab in CMU.
• The database consists of 13 subjects,
whose facial images were captured
with 75 varying expressions.
• The size of each image is 64x 64.
False class images
with additive Gaussian noise (SNR=10db).
 Report the results of the two most
difficult cases who produced the worst performance with the MACE
method.
 The simulation results have been obtained by averaging (Monte-Carlo
approach) over 100 different training sets (each training set consists of
randomly chosen 5 images)
 The kernel size is ~ 40% of the standard deviation of the input data.
Applications of Correntropy
VMACE ROC curves with different SNRs
Applications of Correntropy
CMACE on SAR/ATR – aspect angle mismatch
Applications of Correntropy
Wiener Filtering
We can also formulate the Wiener filter as a Correntropy filter
instead of kernel regression by directly exploiting the linear
structure of the RKHS. However, there is still an approximation
in the computation.
y ( n)  F T ( n)
 F (n)V
T
1

N
1
1
N
N
 d (k ) F (k )
k 1
N L 1 L 1
 f (n  i)a
ij
k 1 j  0 i  0
N
L 1 L 1
k 1
j 0 i 0
f (k  j )d (k )
1

N
 d (k ) a { f (n  i) f (k  j )}
1

N
L 1 L 1


d (k ) aij K ( x(n  i ), x(k  j )) 

k 1 
j 0 i 0
 Patent Pending
N
ij
Applications of Correntropy
Wiener Filtering
Prediction of the Mackey Glass equation ( a mild chaotic system)
Wiener versus correntropy
RBF (50) versus correntropy
Applications of Correntropy
Nonlinear temporal PCA
The Karhunen Loeve transform performs Principal Component
Analysis (PCA) of the autocorrelation of the r. p.
x( N ) 
 x(1) ...

X   ... ...
...

 x( L) ... x( N  L  1) LxN
R  XX T
K  XT X
r (1)
r ( L  1) 
 r (0)
 r (1)



 N

r (0)
r (1) 


r (1)
r (0)  L L
 r ( L  1)
r (1)
r ( N  1) 
 r (0)
 r (1)



 L

r (0)
r (1) 


r (1)
r (0)  N  N
 r ( N  1)
R  XX T  UDDTU T
K  X T X  VDT DV T
D is LxN diagonal
{ 1 , 2 ,... L }
U iT X  i ViT , i  1, 2,..., L
KL can be also done by decomposing the Gram matrix K directly.
Applications of Correntropy
Nonlinear KL transform
Since the autocorrelation function of the projected data in RKHS is
given by correntropy, we can directly construct K with correntropy.
Example:
x(m)  A sin(2 fm)  z(m)
where
pN (n)  0.8  N (0,0.1)  0.1 N (4,0.1)  0.1 N (4,0.1)
VPCA
A
(2nd PC)
PCA by
N-by-N
(N=256)
PCA by
L-by-L
PCA by
L-by-L
(L=100)
(L=4)
0.2
100%
15%
3%
8%
0.25
100%
27%
6%
17%
0.5
100%
99%
47%
90%
1,000 Monte Carlo runs. =1
Applications of Correntropy
Correntropy Spectral Density
CSD is a function of the kernel size, and shows the difference
between PSD ( large) and the new spectral measure
Average normalized amplitude
Time Bandwidth product
Kernel
size
frequency
Kernel size
Applications of Correntropy
Correntropy based correlograms
Correntropy can be used in computational auditory scene analysis
(CASA), providing much better frequency resolution.
Figures show the correlogram from a 30 channel cochlea model for one
(pitch=100Hz)).
Auto-correlation Function
Auto-correntropy Function
lags
Applications of Correntropy
Correntropy based correlograms
Correntropy can be used in computational auditory scene analysis
(CASA), providing much better frequency resolution.
Figures show the correlogram from a 30 channel cochlea model for two
superimposed vowels from two speakers (f0=100, 126 Hz)).
Auto-correlation Function
Auto-correntropy Function
lags
Applications of Correntropy
Correntropy based correlograms
ROC for noiseless (L) and noisy (R) double vowel discrimiation
Applications of RKHS
Tests for nonlinearity
When is the data generated by a nonlinear system?
Use Fourier surrogates (take FFT, randomize the phase, and invert)
Test if the spectrum of the original and surrogate are statistically
different (with some confidence). If they are then, declare the model
that generated the data nonlinear.
Conclusions
Information Theoretic Learning took us beyond Gaussian
statistics and MSE as cost functions.
ITL generalizes many of the statistical concepts we take for granted.
Kernel methods implement shallow neural networks (RBFs) and
extend easily the linear algorithms we all know.
KLMS is a simple algorithm for on-line learning of nonlinear systems
Correntropy defines a new RKHS that seems to be very
appropriate for nonlinear system identification and robust control
Correntropy may take us out of the local minimum of the (adaptive)
design of optimum linear systems
For more information go to the website www.cnel.ufl.edu  ITL
resource for tutorial, demos and downloadable MATLAB code
ITL – Applications
Nonlinear system identification
Minimize information content of the residual error
Desired, d
Input, x
Adaptive
System Output, y
Minimum
Error, e Entropy
Equivalently provides the best density matching
between the output and the desired signals.
min
w

1
log pe ( ; w )d  min
1
w

 1
 p xy ( ,; w ) 

p xy ( ,; w )

 p xd ( , ) 
dd
Erdogmus, Principe, IEEE Trans. Signal Processing, 2002. (IEEE SP 2003 Young Author Award)
ITL – Applications
Time-series prediction
Chaotic Mackey-Glass (MG-30) series
Compare 2 criteria:
Minimum squared-error
Minimum error entropy
4.5
MEE
MSE
80
60
40
20
0
-0.025 -0.015 -0.005 0 0.005
Error Value
0.015
System: TDNN (6+4+1)
0.025
Probability Density
Probability Density
100
Data
MEE
MSE
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.3 -0.2 -0.1
0
0.1
0.2
Signal Value
0.3
0.4
ITL - Applications
System identification
Feature extraction
ITL
Blind source separation
Clustering
ITL – Applications
Optimal feature extraction
Data processing inequality: Mutual information is monotonically nonincreasing.
Classification error inequality: Error probability is bounded from below
and above by the mutual information.
Class labels (d)
x
z
Maximize
Mutual
Information
d
z
Compare
Classifier
y
Backpropagate Information Forces
PhD on feature extraction for sonar target recognition (2002)
Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.
ITL – Applications
Extract 2 nonlinear features
64x64 SAR images of 3 vehicles: BMP2, BTR70, T72
Information forces in training
Classification results
P(Correct)
MI+LR
94.89%
SVM
94.60%
Templates
90.40%
Zhao, Xu and Principe, SPIE Automatic Target Recognition, 1999.
Hild, Erdogmus, Principe, IJCNN Tutorial on ITL, 2003.
ITL - Applications
Feature extraction
System identification
ITL
Blind source separation
Clustering
ITL – Applications
Independent component analysis
Observations are generated by an unknown mixture
of statistically independent unknown sources.
n
x k  Hs k
I (z ) 
 H ( z )  H (z )
c
c 1
Independent
signals
Sphering
Independent
sources
s
x
H
Mixed
signals
z
y
W
R
Uncorrelated
signals
Ken Hild II, PhD on blind source separation (2003)
Minimize
Mutual Information
ITL – Applications
On-line separation of mixed sounds
Off-line separation, 10x10 mixture On-line separation, 3x3 mixture
30
25
25
MeRMaId
MeRMaId-SIG
Infomax
20
SIR (dB)
SIR (dB)
20
FastICA
15
10
Comon
10
5
5
0
0
-5
2
10
3
10
4
10
Data Length (samples)
FastICA
15
-5
0
Infomax
Comon
100
200
300
400
500
600
700
Time (thousands of data samples)
Observed mixtures and separated outputs
X1:
X2:
X3:
Z1:
Z2:
Z3:
Hild, Erdogmus, Principe, IEEE Signal Processing Letters, 2001.
ITL - Applications
Feature extraction
System identification
ITL
Blind source separation
Clustering
ITL – Applications
Information theoretic clustering
Select clusters based on entropy and divergence:
Minimize within cluster entropy
Maximize between cluster divergence
Between cluster divergence
p( x)q( x)dx

min


 p ( x)dx q ( x)dx



m
2
2
1/ 2
Membership vector
Within cluster entropy
Robert Jenssen PhD on information theoretic clustering
Jenssen, Erdogmus, Hild, Principe, Eltoft, IEEE Trans. Pattern Analysis and Machine Intelligence, 2005. (submitted)
Download