Information Theoretic Learning Jose C. Principe Yiwen Wang

advertisement
Information Theoretic Learning
Jose C. Principe
Yiwen Wang
Computational NeuroEngineering Laboratory
Electrical and Computer Engineering Department
University of Florida
www.cnel.ufl.edu
principe@cnel.ufl.edu
Acknowledgments
Dr. Deniz Erdogmus
My students: Puskal
Pokharel
Weifeng Liu
Jianwu Xu
Kyu-Hwa Jeong
Sudhir Rao
Seungju Han
NSF ECS – 0300340 and 0601271
(Neuroengineering program)
Resources
CNEL Website www.cnel.ufl.edu
Front page, go to ITL resources
(tutorial, examples, MATLAB code)
Publications
Information Filtering
Deniz Erdogmus and Jose Principe
From Linear
Adaptive Filtering
to Nonlinear
Information Processing
IEEE
Signal Processing
MAGAZINE
November 2006
Outline
• Motivation
• Renyi’s entropy definition
• A sample by sample estimator for entropy
• Projections based on mutual information
• Applications
• Optimal Filtering
• Classification
• Clustering
•Conclusions
Data is everywhere!
Remote
Sensing
Biomedical
Applications
Wireless
Communications
Speech
Processing
Information
Sensor
Arrays
From Data to Models
Optimal Adaptive Models:
Data d
y=f(x,w)
Data x
Adaptive
System
Output
-
+
Error e
Cost function
Learning
Algorithm
From Linear to Nonlinear Mappings
• Wiener showed us how to compute optimal linear
projections. The LMS/RLS algorithms showed us how
to find the Wiener solution sample by sample.
• Neural networks brought us the ability to work
non-parametrically with nonlinear function approximators.
• Linear regression
nonlinear regression
• Optimum linear filtering
TLFNs
• Linear Projections (PCA)
Princ. Curves
• Linear Discriminant Analysis
MLPs
Adapting Linear and NonLinear Models
The goal of learning is to optimize the performance of
the parametric mapper according to some cost function.
• In classification, minimize the probability of error.
• In regression the goal is to minimize the error in the fit.
The cost function most widely used has been the mean
square error (MSE). It provides the Maximum Likelihood
solution when the error is Gaussian distributed.
In NONLINEAR systems this is hardly ever the case.
Beyond Second Order Statistics
• We submit that the goal of learning should be to transfer
as much information as possible from the inputs to the
weights of the system (no matter if unsupervised or
supervised).
• As such the learning criterion should be based on
entropy (single data source) or divergence (multiple data
sources).
• Hence the challenge is to find nonparametric, sampleby-sample estimators for these quantities.
ITL: Unifying Learning Scheme
Normally supervised and unsupervised learning are
treated differently, but there is no need to do so. One
can come up with a general class of cost functions
based on Information Theory that apply to both
learning schemes.
Cost function
(Minimize, Maximize, Nullify)
1. Entropy
• Single group of RV’s
2. Divergence
• Two or more groups of RV’s
ITL: Unifying Learning Scheme
Function Approximation
Minimize Error Entropy
Classification
Minimize Error Entropy
Maximize Mutual Information between class labels and outputs
Jaynes’ MaxEnt
Maximize output entropy
Linsker’s Maximum Information Transfer
Maximize MI between input and output
Optimal Feature Extraction
Maximize MI between desired and output
Independent Component Analysis
Maximize output entropy
Minimize Mutual Information among outputs
July 11, 2016
12
ITL: Unifying Learning Scheme
Desired Signal D
Learning System
Input Signal
X
Y = q  X W 
Output Signal
Y
Optimization
Information Measure
I Y  D 
Information Theory
Is a probabilistic description of random variables that
quantifies the very essence of the communication process.
It has been instrumental in the design and quantification of
communication systems.
Information theory provides a quantitative and consistent
framework to describe processes with partial knowledge
(uncertainty).
July 11, 2016
14
Information Theory
Not all the random events are equally random!
p(x)
p(x)
x
CASE 1
x
CASE 2
How to quantify this fact? Shannon proposed the
concept of ENTROPY
July 11, 2016
15
Formulation of Shannon’s Entropy
Hartley Information (1928)
Large probability  small information
p X ( x)  1  S H  0
Small probability  large information
p X ( x)  0  S H  
Two identical channels should have twice the capacity as one
Log2 is a natural measure for additivity
g ( p X ( x) 2 )  2 g ( p X ( x))
S H   log 2 p X ( x)
Formulation of Shannon’s Entropy
Expected value of Hartley Information
H S ( X )  E[ S H ]
  p X ( x) log p X ( x)
Communications – ultimate data compression (H channel capacity for asymptotically error-free
communication)
Measure of (relative) uncertainty
Shannon used a principled approach to define
entropy
Review of Information Theory
Shannon Entropy:
 S ( pk )   log pk

 H ( P) 
pk S ( pk )


k


Mutual Information:
I ( x, y )  H ( x )  H ( y )  H ( x , y )
= H ( x)  H ( x | y )  H ( y )  H ( y | x)

Kullback-Leibler Divergence:
K ( f , g )   f ( x ) log f ( x ) g ( x )dx
I ( x , y )    f xy ( x , y ) log( f xy ( x , y ) f x ( x ) f y ( y ))dxdy
July 11, 2016
18
Properties of Shannon’s Entropy
Discrete RV’s
•
•
•
•
H(X) > 0
H(X) < log N
equality iff X is uniform
H(Y|X) < H(Y)
equality iff X, Y indep.
H(X,Y) = H(X) + H(Y|X)
Continuous RV’s
•
•
•
•
Replace summation with integral
Differential entropy
Minimum entropy is sum of delta functions
Maximum entropy
• Fixed variance  Gaussian
• Fixed upper/lower limits  uniform
Properties of Mutual Information
IS(X;Y) = H(X) + H(Y) – H(X,Y)
= H(X) – H(X|Y)
= H(Y) – H(Y|X)
IS(X;Y) = IS(Y;X)
IS(X;X) = HS(X)
HS(X,Y)
HS(X|Y)
HS(Y)
IS(X;Y)
HS(Y|X)
HS(X)
A Different View of Entropy
• Shannon’s Entropy
H S ( X )   p X ( x) log p X ( x)
H S ( X )    p X ( x) log p X ( x)dx
• Renyi’s Entropy
1
H ( X ) 
log  pX ( x)
1
1
H ( X ) 
log  pX ( x)dx
1
Renyi’s entropy becomes Shannon’s as   1
• Fisher’s Entropy (local)
2


p X ( x)
H f (X )  
dx
p X ( x)
July 11, 2016
21
Renyi’s Entropy
Norm of the pdf:
 – norm =  V
V =

 f y

1 
dy
Entropies in terms of V
1


HR  y = ------------ log V
1–
1
1
H s y = lim ------------ logV = lim ------------  V – 1
  11 – 
  11 – 
July 11, 2016
22
Geometrical Illustration of  Entropy
p2
n

k=1
1

pk = p


p2
(entropy  -norm)
(  –norm of p raised power to  )
p =  p1  p2  p3 
1
p = p 1  p2 
p3
0
p1
0
p1
1
1
1
July 11, 2016
23
Properties of Renyi’s Entropy
(a) Continuous function of all probability
(b) Permutationally symmetric
(c) H(1/n, …1/n) is an increasing function of n
(d) Recursivity
H  p 1  p n  = H  p 1 + p 2  p n  + p 1 + p 2 H  p1  p 1 + p 2  p2   p1 + p 2  
(e) Additivity
H p  q  = H p + H q If p and q are independent
properties
(a)
(b)
(c)
(d)
(e)
Shannon
YES
YES
YES
YES
YES
Renyi
YES
YES
YES
NO
YES
Properties of Renyi’s entropy
Renyi’s entropy provides an upper and lower bound for the
probability of the error in classification
H  (W | M )  H  (e)
log( N c  1)
H  (W | M )  H  (e)   1
 P ( e) 
,
min H  (W | e, mk )   1
k
unlike Shannon, which provides only a lower bound (Fano’s inequality,
which is the tightest bound)
P(e) 
H S (W | M )  H S (e)
log( N c  1)
Nonparametric Entropy Estimators
(Only continuous variables are interesting…)
• Plug in estimates




Integral estimates
Resubstitution estimates
Splitting data estimates
Cross validation estimates
• Sample spacing estimates
• Nearest Neighbor distances
July 11, 2016
26
Parzen Window Method
Put a kernel over the samples, normalize and add.
Entropy becomes a function of continuous RV.
1
f X ( x) 
N
N
 G( x  a(i))
i 1
{a(i ) | i  1, 2, ..., N }
1
2
 xT x /( 2 2 )
G ( x,  ) 
e
d /2 d
(2 ) 
   2I
(covariance matrix )
A kernel is a positive function that adds to 1 and
peaks at the sample location (i.e. the Gaussian)
Parzen Windows
1
1
N=10
f x(x)
f x(x)
Laplacian
0.5
0
-5
0.5
0
x
0
x
5
N = 1000
f x(x)
f x(x)
0.4
0.2
0
-5
0
-5
5
N=10
0.4
Uniform
N = 1000
0.2
0
x
5
0
-5
0
x
5
Parzen Windows
Smooth estimator
Arbitrarily close fit as N  infinity,   0
Curse of Dimensionality
Previous pictures for d = 1 dimension
For a linear increase in d,
an exponential increase in N is required
for an equally “good” approximation
In ITL we use Parzen windows not to estimate the PDF
but to estimate the first moment of the PDF.
Renyi’s Quadratic Entropy Estimation
Quadratic Entropy (=2)
Information Potential
H 2 ( X )   log V2 ( X )
V2 ( X )   p 2 X ( x)dx
Use Parzen window pdf estimation with a
(symmetric) Gaussian kernel
Information potential: think of the samples as
particles (gravity or electrostatic field) that
interact with others with a law given by the
kernel shape.
July 11, 2016
30
IP as an Estimator of Quadratic Entropy
Information Potential (IP)
+
V (X)
V =
2
 fX x 2dx
–
+ 
 N

1
 1

=   ----  G x – a  i   2   ----  G x – a j   2  dx
N
N
–  i = 1
 j = 1

N

1 N N +
= ------    G x – a  i   2  G x – a  j   2  dx
N2 i = 1 j = 1 – 
N N
1
= ------2   G a i  – a  j  2 2 
N i = 1j =1
July 11, 2016
31
IP as an Estimator of Quadratic Entropy
There is NO approximation in computing the
Information Potential for  = 2 besides the choice of
the kernel.
This result is the kernel trick used in Support Vector
Machines.
It means that we never explicitly estimate the PDF,
which improves greatly the applicability of the method.
July 11, 2016
32
Information Force (IF)
Between two Information
Particles (IPTs)

G a  i  – a  j  2  2 
a i 
1
= --------- G a  i  – a  j  2  2  a  j  – a i 
2
2
Overall
N
V(X)
–1
2







  a i – a j 
=
-----------G
a
i
–
a
j
2
a  i 
2 2
N  j= 1
July 11, 2016
33
Calculation of IP & IF
d ij = a  i  – a  j 
2
v ij = G d  ij  2  

 D =  d ij 

 v = v  ij  
a  1  a  2   a  j   a  N
a 1 
a 2 
a i  – a j 
a i

a N 
1 N N  
V = ------2   v ij
N i = 1j = 1
N
–
1
f i  = -----------v i j  d ij
2 2 
N  j=1
July 11, 2016
1
N


v i = ---- 
v ij 
N
j= 1
i = 1  N
N
1
V = ---- 
v  i
N
i =1
34
Central “Moments”
Mean
Variance
Entropy
E[ X ]   x p X ( x)dx
E[( X  E[ X ]) ]   ( x  E[ x]) p X ( x)dx
2
2
 E[log p X ( X )]    p X ( x) log p X ( x)dx
 log E[ p X ( X )]   log  p X ( x) 2 dx
Moment Estimation
Mean
Variance
Entropy
1
E[ X ] 
N
N
 a(i)
i 1
1
2
E[( X  E[ X ]) ] 
N
N
2
(
a
(
i
)

E
[
X
])

i 1
1 N
 E[log p X ( X )] 
log f X (a(i ))

N i 1
1 N
1

log

N i 1
N
1
 log E[ p X ( X )]   log 2
N
N
N
2
G
(
a
(
i
)

a
(
j
),

)

j 1
N
2
G
(
a
(
i
)

a
(
j
),

)

i 1 j 1
Which of the two Extremes?
Estimation of pdf must be accurate for practical ITL?
ITL (Minimization/maximization) doesn’t require an
accurate pdf estimate?
None of the above, but still not fully characterized
Extension to any kernel
• We do not need to use Gaussian kernels in
the Parzen estimator.
• Can use any kernel that is symmetric and
differentiable
(k(0) > 0, k’(0) = 0 and k”(0) < 0) .
• We normally work with kernels scaled from an
unit size kernel
1
  ( x) 

 (x /  )
Extension to any 
• Redefine the Information Potential as
1

 1
V ( X )   p X ( x)dx
 E pX ( X )
  pX 1 (a(i ))
N i


• Using the Parzen estimator we obtain
 1
1
1

V ( X )       (a(i)  a( j )) 
N j N i

This estimator corresponds exactly to the
quadratic estimator ( = 2) with the proper
kernel width, .
Extension to any , kernel
• The - information potential
1
V ( X )  
N
 1


j  i   (a( j)  a(i)) 
• The - information force
 2
ˆ
F ( X j )  (  1) f X ( X j ) F2 ( X j )
where F2(X) is the quadratic IF. Hence we see
that the “fundamental” definition is the quadratic
IP and IF, and the “natural” kernel is the
Gaussian.
How to select the kernel size
• Different values of  produce different entropy
estimates. We suggest to use 3 ~ 0.1 dynamic
range (interaction among 10 samples).
• Or use Silverman’s rule
  0.9 AN
1 / 5
A stands for the minimum of the empirical data standard
deviation and the data interquartile range scaled by 1.34
Kernel size is just a scale parameter
Kullback Leibler Divergence
KL Divergence measures the “distance” between pdfs
(Csiszar and Amari)
Relative entropy
Cross entropy
Information for discrimination
f  x
D k f g =  f  x log ----------- dx
g  x
notice similarity to H S ( X )    f X ( x) log f X ( x)dx
Mutual Information & KL Divergence
Shannon’s Mutual Information
.
I X  X  = f
s
1
2
f X 1X 2 x 1 x 2 
 x  x log ------------------------------------ dx d x
X 1X 2 1 2
fX 1  x1 f X2  x2  1 2
Kullback Leibler Divergence
.
D k f g =
f  x


f
x
log
---------- dx

g  x
Is X 1 X2  = Dk f X1X2  x1 x2  f X1 x 1 fX2 x 2  
Statistical Independence
f X1X2  x1 x2 = f X1 x 1 fX2  x2 
July 11, 2016
43
KL Divergence is NOT a distance
Ideally for a distance,
Non-negative
Null only if pdf’s are equal
Symmetric
Triangular inequality
f3
In reality,
D( f , g )  0
D( f , g )  0
iff f  g
f1
D( f , g )  D( f , g )
D( f1 , f 2 )  D( f 2 , f 3 ) NOT  D( f1 , f 3 )
f2
New Divergences and Quadratic Mutual
Infomation
Euclidean Distance between pdfs (Quadratic
Mutual Information ED-QMI)

2
 DED  f g  =   f x  – g  x  dx

 I ED  X1 X2  = D ED  fX 1X 2  x1 x 2  fX 1  x1  f X 2 x 2  

Cauchy Schwarz divergence and CS-QMI

  f  x 2 dx    g x 2 dx 

 DCS  f g  = log-----------------------------------------------------2

  f x g x  dx 

 I X  X  = D  f
 x  x  f X 1 x 1 fX 2  x2  
 CS 1 2
CS
X1 X2 1 2
July 11, 2016
45
Geometrical Explanation of MI
fX 1 X 2  x1  x2 
IED  Euclidean Distance 
I s  K-L Divergence 
 VJ =   fX 1X 2  x1  x2  2 dx 1 dx2


2
 VM =    f X 1 x 1  fX 2 x 2  dx1 dx 2


 Vc =   fX 1 X 2 x 1 x 2  fX 1 x 1 fX 2  x2  dx 1 dx2
VJ

VM
fX 1 x 1  fX 2  x 2 
 IED = VJ – 2Vc + VM

 ICS = log VJ – 2 log V c + logVM
0
ICS = – log   cos   2 
July 11, 2016
Vc = cos  V J VM
46
One Example
X2
P21
X
PX 2
2
2
PX
12
P22
X
P 1X2
1
P11
X
PX
0.4
21
11
X1
1
2
P1X 1
P2X 1
P X1 =  0.6 0.4 
July 11, 2016
PX
0
I s
0.6
I ED
ICS
47
One Example
Is
Is
P21
X
July 11, 2016
P 11
X
48
One Example
IED
IED
P21
X
July 11, 2016
P 11
X
49
One Example
IC S
IC S
P21
X
July 11, 2016
P 11
X
50
QMI and Cross Information Potential Estimators
Parzen Window pdf estimation
a  i  =  a1 i  a2 i  T i = 1  N 

N

1
2
2
 fx 1 x 2 x 1 x 2  = ----  G  x1 – a 1  i    G x2 – a 2  j   
N

i=1


N

1
-  G x1 – a1  i  2 
 fx 1  x1  = --N

i=1


N

1
2
 fx 2  x2  = ----  G x2 – a2  i   
Ni = 1


July 11, 2016
51
QMI and Cross Information Potential Estiamtors
a 1  1a 1  2  a 1  j  a1 N 
a 1 1 
a 1 2 
a 2  1a 2  2  a 2  j  a2 N 
a 2 1 
a 2 2 
a 1 i 
a 2 i  – a 2 j 


a 1 i  – a 1 j 


a 2 i 
a 1  N
a 2  N
d1 ij  = a1 i  – a1 j 
v1 ij  = G d 1 ij  2  12
1 N
1
V = ---v i 
v1 i  = ----  v 1  1ij  N  1
i= 1
Nj = 1
N
July 11, 2016
d2 ij  = a2 i  – a2 j 
v2 ij  = G d 2 ij  2  22
N
1
v2 i  = ----  v 2  ij 
Nj = 1
1 N
V2 = ----  v2 i 
Ni = 1
52
QMI and Cross Information Potential
 IED = VJ – 2Vc + VM

 ICS = log VJ – 2 log V c +
IED  x 1 x 2  = VED
N
1 N N
2
= ------   v 1  ij  v2  ij  – ----  v1  i  v2 i  + V 1 V2
N
N2 i = 1 j = 1
i=1
I C S x1  x2  = VCS
July 11, 2016
 VJ =   fX 1X 2  x1  x2  2 dx 1 dx2


2
 VM =    f X 1 x 1  fX 2 x 2  dx1 dx 2
logVM 
 Vc =   fX 1 X 2 x 1 x 2  fX 1 x 1 fX 2  x2  dx 1 dx2


N
N
1
 ----- V V 




v
ij
v
ij


1
2
 2
 1 2
N
i
=
1
j
=
1


= log ------------------------------------------------------------------------------ N
2
1





--v
i
v
i

1
2 
N
 i=1

53
ED-QMI and Cross Information Potential
Ck =  c k  ij   , c k  ij  = vk  ij  – vk  i  – vk  j  + V k , k = 1 2
VED
N
1
= ------2  c l  ij  vk  ij 
N j= 1
VED
–1 N
f k i =
= ------------  c l  ij  vk  ij  d k  ij 
2
2
a k  i 
N  j= 1
i = 1  N ,
k = 1 2
l k
1 N N  
V = ------2   v ij
N i = 1j = 1
N
–
1
f i  = -----------v i j  d ij
2 2 
N  j=1
July 11, 2016
i = 1  N
54
Renyi’s Divergence
Renyi’s MI
1
I R  y  = ------------ log
–1
I 


–

fY y 
------------------------------------ dy
Ns
– 1
 f Y y i 
i
i =1
It does not obey the well known relation for Shannon
mutual information
I S ( x, y)  H S ( x)  H S ( y)  H S ( x, y)
i
Renyi’s Divergence Approximation
Approximation to Renyi’s MI


Ns

f Y y  d y
1
H

y

–
H

y

=
-------- log ----------–--
----------------------------------R
 R
Ns
–1


i= 1

f

y

   Yi  dy i
i=1
–
Although different, they have the same minima, so
the sum of the marginals can be used to minimize
mutual information.
From Data to Models
Optimal Adaptive Models and least squares
Data z
Data x
Adaptive
System
Output
-
+
J w (e)  E[( z  f ( x, w)) 2 ]
Cost function
Learning
Algorithm
J (e) J (e) e

0
w
e w
J (e) E[e2 ] e

 2 E[ex]  0
w
e w
Model Based Inference
Alternatively, the problem of finding optimal parameters can be framed
as model based inference.
The desired response z can be thought as being created by an
unknown transformation of the input vector x, and the problem is
characterized by the joint pdf p(x,z).
The role of optimization is therefore to minimize the Kulback Liebler
divergence between the estimated joint and the real one
p(x, z )
min J (w)   p(x, z ) log ~
dxdz
p w (x, z )
w
If we write z=f(x)+e with the error independent of x, then this is
equivalent to
min H S (e)   p w (e) log p w (e)de
w
July 11, 2016
58
Error Entropy Criterion
Information Theoretic Learning is exactly a set of tools to solve this
minimization problem.
Note that this is different from the use of information theory in
communications. Here
We are interested in continuous random variables.
We can not assume Gaussianity, so need to use nonparametric
estimators.
We are interested in using gradients, so estimators must be smooth.
We will use Parzen estimators. Since for optimization a monotonic
function does not affect the result, we will be using the Information
Potential instead of Renyi’s entropy most of the time.
 H R2 ( E )   log V ( E )

 V ( E )  E[ p(e)]
July 11, 2016
59
Properties of Entropy Learning with Information
Potential
The IP with Gaussian kernels preserves the global
minimum/maximum of Renyi’s entropy.
The global minimum/maximum of Renyi’s entropy coincides with
Shannon’s entropy (super-Gaussian).
Around the global minimum, Renyi’s entropy cost (of any order) has
the same eigenvalues as Shannon’s.
The global minimum degenerates to a line (because entropy is
insensitive to the mean).
Error Entropy Criterion
We will use iterative algorithms for optimization of the steepest descent
type
w(n  1)  w(n) V2 (n)
Given a batch of N samples the IP is
For an FIR the gradient becomes
 kV2 (n) 
2
N
N

i 1
V (e(n))

wk
1
( 2
N
N
N
 G
i 1 j 1
2
1 N N
V2 ( E )  2  G
N i 1 j 1
(ei  e j ))
 (e(n  i )  e(n  j ))
2
(ei  e j )
 (e(n  i )  e(n  j ))

wk
 y (n  j ) y (n  i ) 

]
G
(
e
(
n

i
)

e
(
n

j
))(
e
(
n

i
)

e
(
n

j
))

 
wk 
j 1
 wk
N
2 N
kV2 (n)  2 
N i 1
July 11, 2016
N
 G (e(n  i)  e(n  j))(e(n  i)  e(n  j))( x (n  j)  x (n  i))
j 1
k
k
61
Error Entropy Criterion
This can be easily extended to any alpha and any kernel using the
expressions of IP.
For the FIR filter we get
 2

  1  N
 kV2 (n)      K (e(n  i)  e(n  j )) 
N i1  j 1

N
July 11, 2016
N
 K (e(n  i)  e(n  j))( x (n  j)  x (n  i))
j 1
k
k
62
Comparing Quadratic Entropy and MSE
• IP does not yield a convex performance surface even for the FIR.
For adaptation problems that yield zero or small errors, there is a
parabolic approximation around the minimum.
The largest eigenvalue of the
second order approximation of the
performance surface is smaller
than MSE (approaches zero with
large kernel sizes). So stepsizes
can be larger for convergence.
Weight Tracks on Contours of Information Potential
6
4
w2
2
0
-2
-4
-6
-6
-4
-2
0
w1
2
4
6
Comparing Quadratic Entropy and MSE
• Consider for simplicity the 1-D case, and approximate
the Gaussian by its second order Taylor series
 x 2 / 2 2
G ( x)  ce
 c(1  x 2 / 2 2 )
•We can show
1
max Vˆ2, (e)  2
N
 c(1  (ei  e j ) 2 / 2 2 )  c 
i
j
c
2 2 N 2
 (e
i
i
 e j )2
j
2
 min


2
2
2 2
(
e

e
)

2
N
e

2
e




i
j
i
i   2 N  MSE (e)  2 N  e
i
j
i
 i 
When the error is small w.r.t. the kernel size, quadratic
entropy training is equivalent to a biased MSE.
Comparing Quadratic Entropy and MSE
The kernel size produces a dilation is weight space, i.e. it controls the region in
weight space where the second order approximation of the entropy cost function
is valid.
Implications of Entropy Learning
We have theoretically shown that:
Regardless of the entropy order, increasing the kernel size results in a
wider valley around the optimal solution by decreasing the absolute
values of the (negative) eigenvalues of the Hessian matrix of the
information potential criterion.
The effect of entropy order on the eigenvalues of the Hessian depends on
the value of the kernel size.
Implications of Entropy Learning
The batch algorithm just presented is called the Minimum Error
Entropy algorithm (MEE).
The cost function is totally independent of the mapper, so it can be
applied generically.
The algorithm is batch and has a complexity of O(N2).
To estimate entropy one needs pairwise interactions.
Entropy Learning Algorithms
MSE Criterion
Error Entropy Criterion
Steepest Descent
Minimum Error Entropy (MEE)
MEE- RIG Recursive information
gradient
LMS
MEE- SIG Stochastic Information
gradient
LMF
MEE-SAS self adapting step size
NLMS
NMEE
RLS
MEE-Fixed point
MEE- Stochastic Information Gradient (SIG)
• Dropping E[.] and substituting the required pdf by its
Parzen estimate over the most recent M samples, at
time k our information potential estimate becomes
 1
V (e(n))  
M
 1


(
e

e
)


n
i 
i nM

n 1
If we substitute in the gradient equation
 2
  n 1

Vˆ (e(n))
(  1)  n 1



(
e

e
)

(
e

e
)(
x
(
i
)

x
(
n
))





n
i

n
i
k
k


wk
M   i nM
 i  n  M

For =2,

Vˆ (e(n))
1  n 1
  2   G (en  ei )(en  ei )( xk (i)  xk (n)) 
wk
M i  n  M

We have shown that for the linear case, SIG converges in the
mean to the true value of the gradient. So it has the same
properties as LMS.
MEE- Stochastic Information Gradient (SIG)
The SIG has another very interesting property. It can be
easily applied to Renyi’s entropy and yields an estimator
that is on average the gradient of Shannon’s entropy.
1
(


1
)

Hˆ  ,n
1
M

wk 1  
 2
n1
  1 n1





(
e

e
)

(
e

e
)
x
(
i
)

x
(
n
   (en  ei )xk (n)  xk (i) 
  n i   M i
 n
i
k
k
i n M
n M
  i n M


1
n1
 1 n1

 (en  ei )

 (en  ei ) 


i n M
 M i n  M

n1
In fact,
 1

ˆ
H
(
e
)


E
log

(
e

e
)


H S (e)   E[log f e (e)]
S ,n

n
i 

L

  i n L
n 1
 n1





(
e

e
)
x
(
n
)

x
(
i
)

 n
i
k
k


Hˆ s ,n

 E  inL n1
wk


 (en  ei )



i n L
SIG for Supervised Linear Filters
• We can derive a “LMS like” algorithm using
Renyi’s entropy.
 H ( X ) 
wk 1  wk   

 w  k
• For  = 2, Gaussian kernels, G ( x)  xG ( x) /  2 and
M=1 we get,
1
 H 2 ( X ) 


(ek  ek 1 )( xk  xk 1 )


2

 w  k
Relation between SIG and Hebbian
learning
• For Gaussian kernels, M=1, the expression to maximize
output Entropy on a linear combiner becomes very simple
also
H  ( y k ) ( y k  y k 1 )    ( y k  y k 1 )  ( x k  x k 1 )

w
 2  ( y k  y k 1 )

1

2
( y k  y k 1 )  ( x k  x k 1 )
We see that SIG gives rise to a sample by sample
adaptation rule that is like Hebbian between consecutive
samples!
Does SIG work?
Consider:
Random variable, d = 2
x-axis is uniform
y-axis is Gaussian
sample covariance is
the identity matrix
4
3
2
1
0
-1
-2
-3
-4
-4
-3
-2
-1
0
1
2
3
4
Does SIG work?
• Generated 50 samples of the 2D distribution from
(w12 + w22 = 1)
the previous slide, y = w1x1 + w2x2
2
Direc tion (G aus s ian)
Gaussian
1.74
Uniform
1.72
1.5
1
0.5
0
-0.5
-1
0
1.7
0
0.5
1
1.5
2
Direc tion of W eight V ec tor (rad)
2.5
3
50
100
150
200
3.5
250
E poc hs
300
350
400
450
500
450
500
2
2.05
2.04
2.03
2.02
2.01
Direc tion (Cauc hy )
E ntropy (Gaus s ian)
1.76
1.5
• PCA would converge to any direction but SIG
consistently found the 90 degree direction!
1
0.5
0
-0.5
2
-1
0
1.99
0
0.5
1
1.5
2
2.5
3
3.5
50
100
150
200
250
E poc hs
300
350
400
Adaptation of Linear Systems with Divergence
Exemplify for QMI-ED
VED
1 N N
1 N
 VJ  2VC  V1V2  2  v1 (i, j )v2 (i, j )  2  v1 (i)v2 (i)  V1V2
N i 1 j 1
N i 1
Taking the sensitivity with respect to the weights
VED
VED y j (n)
VJ
VC
V1V2 y j (n)

(
2

)
wkj y j (n) wkj
y j (n)
y j (n) y j (n) wkj
This is a straight extension because the potential fields and their
gradients add up.
July 11, 2016
75
MEE for Nonlinear Systems
Consider the error signal. Think of the IPTs as
errors of a nonlinear mapper (such as the MLP).
How can we train the MLP ?
Use the IF as the Injected error.
Then apply the Backpropagation algorithm.
k N
J
J e p (n)
 
wij p 1 n1 e p (n) wij
This methodology extends naturally to IP.
July 11, 2016
76
Global Optimization by Annealing Kernel Size
We have a way to avoid local minima in non-convex
performance surfaces:
1. Start with a large kernel size, and adapt to reach
minimum.
2. Decrease the kernel size to decrease the bias and
adapt again.
3. Repeat until the kernel size is compatible with the
data.
Kernel size annealing is equivalent to the method of convolution
smoothing in global optimization. Hence, as long as the annealing rate
is right (slow enough), the information potential provides a way to avoid
local minima and reach the global minimum.
Advanced Search Methods
If advanced search methods are needed, care must be
taken when extending them to error entropy criterion.
Basically the problems are related with the definition of
trust regions and the adaptation of the kernel size.
We have studied the scaled conjugate method and the
Levenberg Marquadt algorithm and have implemented
modifications that lead to consistent results.
Fast Gauss Transform
The Fast Gauss Transform (FGT) is an example of a fast
algorithm to approximately compute matrix (A) vector (d)
products where the matrix elements are aij    xi  x j  with
phi a Gaussian function.
The basic idea is to cluster the data and to expand the
Gaussian function in Hermite polynomials to divide and
conquer the complexity (multi pole method).
  ( y j  yi ) 2  p 1 1  yi  yC  n  y j  yC 
 
   ( p)
exp 
 h
2

4

 n  0 n!  2   2 

dn
hn ( y )   1
exp(  x 2 )
n
dx
n

Fast Gauss Transform
A greedy clustering algorithm called the farthest point is
normally used (because it can be computed in O(kN)
time, k # of clusters).
The information potential for 1 D data becomes
1
V ( y) 
2N 2 
Where
1  y j  yC B
h

2
j 1 B n  0 n! 
N
p 1
 y  yC B
Cn ( B)    j
2
y i B




Cn ( B)

n
And this can be computed in O(kpN) time (p degree of
approximation).
Where does ITL present Advantages?
ITL presents advantages when the signals one is
dealing with are non-Gaussian.
This occurs in basically two major areas in signal
processing:
Non-Gaussian Noise (outliers)
Non linear Filtering
Error Entropy Criterion and M estimation
Let us review the mean square error.
We are interested in quantifying how two different
r.v. are. So what we normally do is
E[( x  y ) 2 ]  
2
(
x

y
)
p( x, y )dxdy

We hope that the
pdf exponentially
decreases away
from the x=y line!
Error Entropy Criterion and M estimation
Let us define a new criterion called the correntropy
criterion as
1
1
Vˆ ( X , Y )   k ( x  y )   k (e )
N
N
When we maximize correntropy we are maximizing
the probability density at the origin of the space,
since using Parzen windows we obtain
N
i 1
N
i
i
i 1
1 N
pˆ E (e)   k (e  ei )
N i 1
And evaluating it at the origin e=0
Vˆ ( X , Y )  pE (0)
i
Error Entropy Criterion and M estimation
The error correntropy criterion is very much
related to the error entropy criterion since if we
take the first order differencedxijof xthe
i  x j samples
and construct the vectors
DX  (dx11 , dx12 ,...dx21 , dx22 ,..., dxNN )
Then correntropy is
And since
N
1
V ( DX , DY )  2
N
N
 k (dx
ij
j 1 i 1
 dyij )
dxij  dyij  xi  x j  ( yi  y j )
 ( xi  yi )  ( x j  y j )
 ei  e j
We obtain
1
V ( DX , DY )  2
N
N
N
 k (e  e )  IP( E)
j 1 i 1
i
j
Error Entropy Criterion and M estimation
The interesting thing is that correntropy is a
metric while error entropy is not. A metric has the
properties
d ( X ,Y )  0
1) Non-negativity
2) Identity. d(x,y)=0 if and only if x=y.
3) Symmetry d(x,y)=d(y,x)
4) Triangle inequality d ( X , Z )  d ( X , Y )  d (Y , Z )
We define the correntropy induced metric as
CIM ( X , Y )  (V (0, 0)  V ( X , Y ))1/ 2
Error Entropy Criterion and M estimation
In two dimensions the contours of the CIM for
distances to the origin are the following.
Error Entropy Criterion and M estimation
Let us put the error correntropy criterion in the
framework of M estimation (Huber). Define
 (e)  (1  exp(e2 / 2 2 )) / 2
Then
1)  (e)  0
2)  (0)  0
3)  (e)   (e)
4)  (ei )   (e j ) for | ei || e j | .
And therefore it becomes equivalent to following
M estimation problem min   (e ) which is a
weighted least squares min  w(e )e with
N
i 1 N
i
2
i 1
i
i
w(e)  exp(e2 / 2 2 ) / 2 3
Error Entropy Criterion and M estimation
Alternatively if we define IPM ( X , Y )  (V (0,0)  IP(e))1/ 2
we obtain a pseudo metric.
MEE is equivalent to the M estimation
N
N
min   (deij )
j 1 i 1
Case Study: Regression with outliers
Assume that data is generated by a linear model corrupted by a noise
pZ ( z)  0.9  N(0,0.1)  0.1 N(4,0.1)
created by a mixture of Gaussians
50 MonteCarlo were run with a linear model (known order and best
kernel size).
EXAMPLE: Revisiting Adaptive Filtering
x(n)
e(n)
Unknown
System
TDNN
z-1
Criterion
6 Delays in input layer
3 Hidden PEs in hidden layer
1 Linear output in output layer
y ( n)  x ( n)
Mackey-Glass time series Prediction
MacKey-Glass time series (t = 30):
July 11, 2016
dx(t )
0.2 x(t  t)
 0.1x(t ) 
dt
1  x10 (t  t)
91
Nonlinear prediction training
Two methods
1. Minimization of MSE
2. Minimization of the quadratic Renyi’s error
entropy (QREE)
It has been shown analytically that,
Minimization of QREE is equivalent to minimizing the adivergence between the desired signal and output of the mapper
The Parzen estimator preserves extrema of the cost function
Training/Testing details
Hidden layer PEs where chosen for best performance.
Training with 200 samples of MK30
Conjugate gradient
Stopping based on cross validation
1,000 initial conditions, and pick best error.
Kernel size was set at s = 0.01
Testing is done on 10,000 new samples.
Amplitude Histograms for Original and
Predicted Signals
4.5
Data
Ent
MSE
4
Probability Density
3.5
3
2.5
2
1.5
1
0.5
0
-0.3
-0.2
-0.1
0
0.1
Signal Value
0.2
0.3
0.4
Error Distributions for MSE and Entropy
100
Ent
MSE
90
80
Probability Density
70
60
50
40
30
20
10
0
-0.025 -0.02 -0.015 -0.01 -0.005
0
0.005
Error Value
0.01
0.015
0.02
0.025
Effect of Kernel Annealing
  10 1  10 2
  10 1  10 3
  10 1  10 3
  10 1  10 3
  10 1  10 3
Example II: Optimal Feature Extraction
Question:
How do we project data to a subspace preserving
discriminability?
Answer:
By maximizing the mutual information between desired
responses and the output of the (nonlinear) mapper.
Example II: Optimal Feature Extraction
The feature extractor and the classifier are trained
independently or not. Which is better?
Another goal is to find out which method of feature
extraction produces better classification
Example II: Optimal Feature Extraction
There are two possible ITL methods that can be used for
feature extraction:
1. Maximizing the mutual information between the
feature extractor output and the class labels and
using QMI-ED or QMI-CS
2. Approximating mutual information by a difference of
entropy terms
The advantage of the latter is that we do not need to
estimate the joint distribution. Computation is less
intensive and perhaps more accurate.
Example II: Optimal Feature Extraction
We consider here classifiers that are invariant under
invertible linear transformations to reduce the number
of free parameters.
Using the IP together with the approximation of MI yields
the MRMI-SIG algorithm
Where
and the angles are updated by
Example II: Optimal Feature Extraction
Two different classifiers are used: Bayes G (Gaussian)
and Bayes NP (nonparametric using Parzen).
Two methods of training are used:
1. Training the feature extractor first using PCA, MRMI
and QMI-ED.
2. Training both together using the Minimum classifier
error (MCE), MSE and feature ranking on a validation
set (Fr-V)
Example II: Optimal Feature Extraction
Data are several sets on the UCI repository
Example II: Optimal Feature Extraction
Example II: Optimal Feature Extraction
Classification with QMI (2-D feature space)
Class identity
wopt  arg max I ( y; d )
w
d
Images
y
x
Information
Potential
Field
Back-Propagation
July 11, 2016
Forces
105
SAR/Automatic Target Recognition
MSTAR Public Release Data Base
Three class problem:
BMP2, BTR70, T72.
Input Images are 64x64.
Output space is 2 D.
A likelihood ratio classifier is computed in the output
space.
SAR/Automatic Target Recognition
Confusion Matrix
BMP2
Comparisons (Pcc)
Pcc
BTR70 T72
BMP2
289
2
19
ITL
94.89%
BTR70
3
104
0
SVM
94.60%
T72
8
5
294
Templates
90.40%
(counts)
July 11, 2016
107
Clustering evaluation function
Perhaps the best area to apply ITL concepts is unsupervised
learning where the structure of the data is paramount for
the analysis goal.
Indeed most of the clustering, vector quantization and even
projection algorithms use some form of similarity metric.
ITL can provide similarity beyond second order statistics.
Clustering evaluation function

As we mentioned in the second lecture, p( x)q( x)dx was called the
cross information potential and it measures a form of “distance”
between p(x) and q(x).
Using the Parzen estimator this yields
1
CIP( p, q) 
N1 N 2
N1 N 2
 G( x  x ,2
i 1 j 1
i
j
2
)
i  p( x), j  q( x)
This can be written in more condensed way with a membership
function and it was called the Clustering Evaluation Function
1
C EF  p q  = ---------------2N 1 N 2
N
N
2
  M  xi  xj G  xi – xj  2 
i = 1j = 1
M  x i x j = M 1  x i   M 2  xj 
M1  xi  = 1
xi  p x
Clustering evaluation function
Remember that CEF is the numerator of the argument of the log in DCS



 p  x   q  x  dx 
CE F  p q 
D CEFno rm  p q  = – ln  -------------------------------------------------  = – ln  -------------------------------------------------




2
2
2
2
  p  x  dx  q  x  dx
  p  x  dx  q  x  dx
If we use just the numerator it is NOT a distance, but for evaluation purposes
between clusters it can be utilized for simplicity.
cef
renyi
J_div
cefnorm
chernof
bhat
Clustering evaluation function
Example of clustering assignment with CEF on synthetic data. Kernel
variance selected for best results.
Results on Iris data
Class1
Class2
Class3
Class1
50
0
0
Class2
0
42
8
Class3
0
6
44
Clustering evaluation function
The information potential can be used as a preprocessing to help image
segmentation in brain MRI (low contrast imagery). Resulting image is then
clustered in three clusters.
white (%)
gray(%)
CSF(%)
5.5 years
31.45
57.97
10.56
7.5 years
33.37
55.73
10.89
8 years
36.39
54.21
9.39
10.2 years
37.60
50.87
11.51
Clustering based on D CS
We also developed a sample by sample algorithm for clustering
based on Cauchy-Schwartz distance based on a Lagrange
multiplier formulation that ends up being a variable stepsize
algorithm for each direction in the space.
Clustering based on D CS
The optimization is then
subject to
In order to use Lagrange multipliers we have to construct a smooth
partnership function which is obtained by mi(k)=vi(k)2
The two optimization problem are equivalent
To solve it we use the Lagrangian
Clustering based on D CS
There is a fixed point algorithm that solves for the li, and the solution
is independent of
the order of presentation.
To avoid local minima,
kernel annealing is required.
Download