tutorial - Computational NeuroEngineering Laboratory

advertisement
On-Line Kernel Learning
Jose C. Principe and Weifeng Liu
Computational NeuroEngineering Laboratory
Electrical and Computer Engineering Department
University of Florida
www.cnel.ufl.edu
principe@cnel.ufl.edu
Acknowledgments
NSF ECS – 0300340 and 0601271
(Neuroengineering program)
Outline
Setting the On-line learning problem
 Optimal Signal Processing Fundamentals
On-Line learning in Kernel Space
 KLMS algorithm
 Well posedness analysis of KLMS
 Kernel Affine Projection Algorithms (KAPA)
Extended Recursive Least Squares
Active Learning
Problem Setting
Optimal Signal Processing seeks to find optimal models for time
series.
The linear model is well understood and widely applied. Optimal
linear filtering is regression in functional spaces, where the user
controls the size of the space by choosing the model order.
Problems are fourfold:
In many important applications data arrives in real time, one sample
at a time, so on-line learning methods are necessary.
Optimal algorithms must obey physical constrains, FLOPS, memory,
response time, battery power.
Application conditions may be non stationary, i.e. the model must be
continuously adapting to track changes.
Unclear how to go beyond the linear model.
Although the optimal problem is the same as in machine learning,
constraints make the computational problem different.
Machine Learning
Assumption: Examples are drawn independently from an
unknown probability distribution P(x, y) that represents the
rules of Nature.
Expected Risk: R( f )   L( f ( x), y )dP( x, y )
Rˆ N ( f )  1 / N  L( f ( xi ), yi )
Empirical Risk:
i
We would like f∗ that minimizes R(f) among all functions.
In general f *  F
The best we can have is f F*  F that minimizes R(f) inside F.
But P(x, y) is unknown by definition.
Instead we compute f N  F that minimizes Rn(f).
Vapnik-Chervonenkis theory tells us when this can work, but
the optimization is computationally costly.
Exact estimation of fN is done thru optimization.
Machine Learning Strategy
The errors in this process are
R( f N )  R( f * )  R( f F* )  R( f * ) 
 R( f N )  R ( f F* )
Approximation Error
Estimation Error
But the exact fN is hard to obtain normally, and since we have
already two error, why not approximate (the optimization error)
~
R( f N )  R( f N )  r
Provided it is computationally simpler to find? (Leon Bottou)
So the problem is to find F, N and r for each problem
Machine Learning Strategy
The optimality conditions in learning and optimization theories
are mathematically driven:
Learning theory favors cost functions that ensure a fast estimation
rate when the number of examples increases (small estimation error
bound).
Optimization theory favors superlinear algorithms (small
approximation error bound)
What about the computational cost of these optimal solutions, in
particular when the data sets are huge? Algorithmic complexity
should be as close as possible to O(N).
Change the design strategy: Since these solutions are never
optimal (non-reachable set of functions, empirical risk), goal
should be to get quickly to the neighborhood of the optimal
solution to save computation.
Learning Strategy in Biology
In Biology optimality is stated in relative terms: the best possible
response within a fixed time and with the available (finite)
resources.
Biological learning shares both constraints of small and large
learning theory problems, because it is limited by the number of
samples and also by the computation time.
Design strategies for optimal signal processing are similar to the
biological framework than to the machine learning framework.
What matters is “how much the error decreases per sample for a
fixed memory/ flop cost”
It is therefore no surprise that the most successful algorithm in
adaptive signal processing is the least mean square algorithm
(LMS) which never reaches the optimal solution, but is O(L) and
tracks continuously the optimal solution!
Extensions to Nonlinear Systems
Many algorithms exist to solve the on-line linear regression
problem:
LMS stochastic gradient descent
LMS-Newton handles eigenvalue spread, but is expensive
Recursive Least Squares (RLS) tracks the optimal solution with the
available data.
Nonlinear solutions either append nonlinearities to linear filters
(not optimal) or require the availability of all data (Volterra, neural
networks) and are not practical.
Kernel based methods offers a very interesting alternative to
neural networks.
Provided that the adaptation algorithm is written as an inner product,
one can take advantage of the “kernel trick”.
Nonlinear filters in the input space are obtained.
The primary advantage of doing gradient descent learning in RKHS
is that the performance surface is still quadratic, so there are no
local minima, while the filter now is nonlinear in the input space.
Adaptive Filtering Fundamentals
Adaptive Filter Framework
Filtering is regression in functional spaces (time series) m wi nJ (e(n), w)
Filter order defines the size of the subspace
Data d(n)
f (u , w)   wi u (n  i )
i
Data u(n)
Adaptive
System
Output
Error e(n)
Cost
J=E[e2(n)]
Optimal solution is least squares w*=R-1p , but now R is the
autocorrelation of the data input (over the lags), and p is the
crosscorrelation vector.
...
E[u (n)u (n  L)] 
 E[u (n)u (n)]


R
...
...
...

 E[u (n  L)u (n)] ... E[u (n  L)u (n  L)]
pT  [ E[u (n)d (n)] .... E[u (n  L)d (n)]]
On-Line Learning for Linear Filters
Notation:
ui
y(i)
Transversal filter w i
wi weight estimate at time i
(vector) (dim = l)
ui input at time i (vector)
Adaptive weightcontrol mechanism
e(i)
e(i) estimation error at time i
(scalar)
d(i) desired response at time i
(scalar)

+
d (i)
The current estimate wi is computed in
terms of the previous estimate, wi 1 , as:
wi  wi 1  Gi ei
ei estimation error at iteration i
(vector)
di desired response at iteration
i (vector)
Gi capital letter matrix
ei is the model prediction error arising from the use of wi-1 and Gi is a
Gain term
On-Line Learning for Linear Filters
Easiest technique is to search the performance
surface J using
4
gradient descent learning (batch).
3.5
i
2.5
2
1.5
J  E[e 2 (i )]
h
wi  wi 1 hH J i 1
2
l i mE[ wi ]  w *
3
1
W
wi  wi 1  hJ i
Contour
MEE
FP-MEE
step size
1
0.5
0
-1
-0.5
0
0.5
1
W1
1.5
2
2.5
Gradient descent learning has well known compromises:
Stepsize h must be smaller than 1/lmax (of R) for convergence
Speed of adaptation is controlled by lmin
So eigenvalue spread of signal autocorrelation matrix controls speed of
adaptation
The misadjustment (penalty w.r.t. optimum error) is proportional to
stepsize, so fundamental compromise between adapting fast, and small
misadjustment.
3
On-Line Learning for Linear Filters
Gradient descent learning for linear mappers has also great
properties
It accepts an unbiased sample by sample estimator that is easy to
compute (O(L)), leading to the famous LMS algorithm.
wi  wi 1  hui e(i)
The LMS is a robust estimator ( H  ) algorithm.
For small stepsizes, the visited points during adaptation always
belong to the input data manifold (dimension L), since algorithm
always move in the opposite direction of the gradient.
On-Line Learning for Non-Linear Filters?
Can we generalize wi  wi 1  Gi ei to nonlinear models?
y  wT u
y  f (u )
and create incrementally the nonlinear mapping?
f i  f i 1  Gi ei
ui
y(i)
Universal function
approximator f i
Adaptive weightcontrol mechanism
e(i)

+
d (i)
Non-Linear Methods - Traditional
(Fixed topologies)
Hammerstein and Wiener models
An explicit nonlinearity followed (preceded) by a linear filter
Nonlinearity is problem dependent
Do not possess universal approximation property
Multi-layer perceptrons (MLPs) with back-propagation
Non-convex optimization
Local minima
Least-mean-square for radial basis function (RBF) networks
Non-convex optimization for adjustment of centers
Local minima
Volterra models, Recurrent Networks, etc
Non-linear Methods with kernels
Universal approximation property (kernel dependent)
Convex optimization (no local minima)
Still easy to compute (kernel trick)
But require regularization
Sequential (On-line) Learning with Kernels
(Platt 1991) Resource-allocating networks
Heuristic
No convergence and well-posedness analysis
(Frieb 1999) Kernel adaline
Formulated in a batch mode
well-posedness not guaranteed
(Kevinen 2004) Regularized kernel LMS
with explicit regularization
Solution is usually biased
(Engel 2004) Kernel Recursive Least-Squares
(Vaerenbergh 2006) Sliding-window kernel recursive least-squares
Kernel methods
Moore-Aronszajn theorem
Every symmetric positive definite function of two real variables has
a unique Reproducing Kernel Hilbert Space (RKHS).
2
k ( x, y)  exp( h x  y )
Mercer’s theorem
Let K(x,y) symmetric positive definite. The kernel can be expanded
m
in the series
 ( x, y )   lii ( x)i ( y )
Construct the transform as
Inner product
i 1
 ( x)  [ l11 ( x), l2 2 ( x),..., lm  m ( x)]T
 ( x) ( y )   ( x y )
Basic idea of on-line kernel filtering
Transform data into a high dimensional feature space
Construct a linear model in the feature space F
i :  (ui )
y  ,  (u) F
Adapt iteratively parameters with gradient information
Compute the output
i  i 1  hJ i
mi
fi (u )  i ,  (u ) F   a j  (u, c j )
Universal approximation theorem
j 1
For the Gaussian kernel and a sufficient large mi, fi(u) can
approximate any continuous input-output mapping arbitrarily close in
the Lp norm.
Network structure: growing RBF
φ(u)
Ω
u
+
y
c1
i  i 1  h e(i) (ui )
c2
a1
a2
+
u
fi  fi 1  h e(i) (ui , )
cmi-1
a mi-1
i
am
cmi
y
Kernel Least-Mean-Square (KLMS)
Least-mean-square
wi  wi 1 hui e(i)
e(i)  d (i)  wiT1ui
w0
Transform data into a high dimensional feature space F
0  0
e(i )  d (i )  i 1 ,  (ui ) F
i  i 1  h (ui )e(i )
0  0
e(1)  d (1)  0 ,  (u1 ) F  d (1)
1   0  h (u1 )e(1)  a1 (u1 )
e(2)  d (2)  1 ,  (u2 ) F
i
i  h e( j ) (u j )
j 1
i :  (ui )
 d (2)   a1 (u1 ),  (u2 ) F
i
fi (u )  i ,  (u ) F  h e( j ) (u, u j )
j 1
 d (2)  a1 (u1 , u2 )
 2  1  h (u2 )e(2)
 a1 (u1 )  a2 (u2 )
RBF Centers are the samples, and Weights...are the errors!
KLMS- Nonlinear channel equalization
i
fi (u )  i ,  (u ) F  h e( j ) (u, u j )
j 1
st
c1
c2
a1
a2
u
+
cmi-1
y
a mi-1
i
am
cmi
cmi  ui
ami  h e(i )
zt  st  0.5st 1
rt  zt  0.9zt 2  n
rt
Nonlinear channel equalization
Algorithms
Linear LMS (η=0.005)
KLMS (η=0.1)
RN
(NO REGULARIZATION)
(REGULARIZED λ=1)
BER (σ = .1)
0.162±0.014
0.020±0.012
0.008±0.001
BER (σ = .4)
0.177±0.012
0.058±0.008
0.046±0.003
BER (σ = .8)
0.218±0.012
0.130±0.010
0.118±0.004
 (ui , u j )  exp(0.1|| ui  u j ||2 )
Algorithms
Linear LMS
KLMS
RN
Computation (training)
O(l)
O(i)
O(i3)
Memory (training)
O(l)
O(i)
O(i2)
Computation (test)
O(l)
O(i)
O(i)
Memory (test)
O(l)
O(i)
O(i)
Why don’t we need to explicitly regularize the KLMS?
Regularization Techniques
Learning from finite data is ill-posed and a priori
information—to enforce Smoothness is needed.
The key is to constrain the solution norm
In Bayesian model, the norm is the prior. (Gaussian process)
In statistical learning theory, the norm is associated with the
model capacity and hence the confidence of uniform
convergence! (VC dimension and structural risk minimization)
In numerical analysis, it relates to the algorithm stability
(condition number, singular values).
Regularized networks (least squares)
1 N
J ()   (d (i)  T i )2 , subject to ||  ||2  C
N i 1
1 N
J ()   (d (i)  T i )2  l ||  ||2
N i 1
Norm
constraint!
Gaussian
distributed prior
N
   aii , G (i, j )  iT  j , (G  l)a  d
i 1
s1
sr
  Pdiag ( 2
,..., 2
, 0,..., 0)QT d
s1  l
sr  l
 S 0 T
H  [1 ,...,  N ]  Q 
P

 0 0
T
Condition number
on matrix Gφ!
Singular value
S  diag{s1 , s2 ,..., sr }
Notice that if λ = 0, when sr is very small, sr/(sr2+ λ) = 1/sr → ∞.
However if λ > 0, when sr is very small, sr/(sr2+ λ) = sr/ λ → 0.
KLMS and the Data Space
For finite data and using small stepsize theory:
1 N
Denote i   (ui )  R m
Rx  PPT
R  iiT
N i 1
Assume the correlation matrix is singular, and
1  ...   k   k 1  ...   m  0
The weight vector is insensitive to the 0-eigenvalue directions
Denote i     n 1  i (n) Pn so
m
o
E[|  i (n) |2 ] 
So if  n  0
E[ i (n)]  (1  h n )i  0 (n)
h J min
h J min
 (1  h n ) 2i (|  0 (n) |2 
)
2  h n
2  h n
E[ i (n)]   0 (n)
and
E[|  i (n) |2 ] |  0 (n) |2
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
KLMS and the Data Space
Remember that
J (i )  E[| d  iT  |2 ]
The 0-eigenvalue directions does not affect the MSE
J (i)  J min 
h J min
2
 n1 n   n1 n (|  n (0) | 
m
m
2
h J min
2
)(1  h n ) 2i
KLMS only finds solutions on the data subspace! It does
not care about the null space!
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
The minimum norm initialization for KLMS
The initialization 0  0 gives the minimum possible
norm solution.
i   n 1 cn Pn
5
m
 1  ...   k  0
 k 1  ...   m  0
|| i ||   n 1|| cn ||   n k 1|| cn ||2
2
k
2
m
4
3
2
1
0
-1
0
2
4
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Solution norm upper bound for KLMS
Assume the data model d (i )  o (i )  v(i)
following inequalities hold
||  N ||2   1h (|| o ||2 2h || v ||2 )
The
σ1 is the largest
eigenvalue of Gφ
H∞ robustness
2
|
e
(
j
)

v
(
j
)
|
 j 1
i
h ||  ||  j 1| v( j ) |
1
o
2
i 1
2
 1,
for all i  1, 2,..., N
Triangular inequality
|| e ||2  h 1 || o ||2 2 || v ||2
The solution norm of KLMS is always upper bounded i.e.
the algorithm is well posed in the sense of Hadamard.
Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.
Comparison of three techniques
No regularization
0.8
Tikhonov
[ sn /( sn  l )]  sn
2
2
1
PCA
 sn 1 if sn  th

 0 if sn  th
reg-function
sn
1
1
0.6
0.4
0.2
0
0
KLMS
Tikhonov
Truncated SVD
0.5
1
1.5
2
singular value
KLMS
[1  (1  h sn 2 / N ) N ]  sn 1
The stepsize controls the reg-function in
KLMS. This is a mathematical explanation for
the early stopping in neural networks.
Liu W., Principe J. The Well-posedness Analysis of the Kernel Adaline, Proc WCCI, Hong-Kong, 2008
The big picture for gradient based learning
Leaky
Kivinen
LMS
LMS
K=1
APA
Newton
APA
We have kernelized
versions of all
The EXT RLS is a
model with states
K=i
K=1
Leaky
APA
K=i
K=1
2004
Normalize
d LMS
Frieb
, 1999
Adaline
Engel,
2004
RLS
Extended
RLS
weighted
RLS
Liu W., Principe J., “Kernel Affine Projection Algorithms”, European J. of Signal Processing, ID 784292, 2008.
Affine projection algorithms
Least-mean-square (LMS)
wi  wi 1 hui e(i)
e(i)  d (i)  wiT1ui
w0
Affine projection algorithms (APA)
APA uses the previous K samples to estimate the gradient and
Hessian whereas LMS uses just one, the present data
(instantaneous gradient).
U i  [ui , ui 1 ,..., ui  K 1 ]M  K , di  [d (i ), d (i  1),..., d (i  K  1)]T
APA
i  K 1
w0 , ei  di  U iT wi 1 , wi  wi 1  hU i ei  wi 1  h
APA Newton
 e (k )u
k i
i
k
w0 , ei  di  U iT wi 1 , wi  wi 1  h ( I  U iU i T ) 1U i ei
Therefore APA is a family of online gradient based algorithms of
intermediate complexity between the LMS and RLS.
Kernel Affine Projection Algorithms
Q(i)
w
KAPA 1,2 use the least squares cost, while KAPA 3,4 are regularized
KAPA 1,3 use gradient descent and KAPA 2,4 use Newton update
Note that KAPA 4 does not require the calculation of the error by
rewriting 2 with the matrix invers. lemma and using the kernel trick
Note that one does not have access to the weights, so need recursion
as in KLMS.
Care must be taken to minimize computations.
KAPA-1
cmi  ui
ami  h ei (i )
c1
c2
ami 1  ami 1  h ei (i  1)
a1
ami  K 1  ami  K 1  h ei (i  K  1)
a2
u
+
cmi-1
y
a mi-1
i
am
cmi
i
fi (u )  i ,  (u ) F   a j  (u, u j )
j 1
Error reusing to save computation
To calculate K errors is expensive (kernel evaluations)
ei (k )  d (k )  k T i 1 , (i  K  1  k  i )
K times computations? No, save previous errors and use them
ei 1 (k )  d (k )   k T i  d (k )   k T (i 1  h i ei )
 (d (k )   k T i 1 )  h k T  i ei
 ei (k )  h k T  i ei
 ei (k ) 
i

Still needs ei (i  1)
h ei ( j )k  j .
j  i  K 1
T
But saves K error computation
KAPA-2
KAPA-2: Newton’s method
 i  [i , i 1 ,..., i  K 1 ]
di  [d (i ), d (i  1),..., d (i  K  1)]T
0  0
ei  ( I  iT i ) 1 (di  iT i 1 )
i  i 1  hi ei
How to invert the K-by-K matrix ( I
  iT  i )
and avoid O(K3)?
Sliding window Gram matrix inversion
i  [i , i 1 ,..., i  K 1 ]
 a bT 
Gri  l I  

b D

Assume
known
e
1
(Gri  l I )  
f
2
3
Gri   iT  i
D
Gri 1  l I   T
h
Sliding window
fT

H
s  ( g  hT D 1h) 1
D
1
 H  ff / e
T
h
g 
1
Schur complement of D
1
1
1 T
1

D

(
D
h
)(
D
h
)
s

(
D
h) s 
1
(Gri 1  l I )  

1 T
( D h ) s
s


Complexity is K2
Relations to other methods
Computation complexity
Prediction of Mackey-Glass
L=10
K=10
K=50 SW KRLS
Simulation 1: noise cancellation
n(i) ~ uniform [-0.5, 05]
Primary signal
s(i)
n(i)
noise source
+
n(i)
Interference distortion
function: H(▪)
d(i)
u(i)
Adaptive
filter
y(i)
_
+
u(i)  n(i)  0.2u(i  1)  u(i  1)n(i  1)  0.1n(i  1)  0.4u(i  2)
 H (n(i), n(i  1), u(i  1), u(i  2))
Simulation 1: noise cancellation
 (u (i), u ( j ))  exp( || u (i)  u ( j ) ||2 )
Noise Cancellation
2520
2500
2520
Amplitute
0.5
0
-0.5
-1
2500
0.5
0
-0.5
0.5
0
-0.5
2500
0.5
0
-0.5
2500
Noisy Observation
2540
2560
2580
2600
NLMS
2540
2560
2580
2600
KLMS-1
2520
2540
2560
2580
2600
KAPA-2
2520
2540
2560
i
2580
2600
Simulation-2: nonlinear channel equalization
st
zt  st  0.5st 1
rt  zt  0.9zt 2  n
rt
Simulation-2: nonlinear channel equalization
Regularization
The well-posedness discussion for the KLMS hold for
any other gradient decent methods like KAPA-1 and
Kernel Adaline
If Newton method is used, additional regularization is
needed to invert the Hessian matrix like in KAPA-2
and normalized KLMS
Recursive least squares embed the regularization in
the initialization
Extended Recursive Least-Squares
STATE model
xi 1  Fi xi  ni
di  U iT xi  vi
Start with
w0|1 , P0|1  1
Special cases
• Tracking model (F is a time varying scalar)
xi 1   xi  ni , d (i )  uiT xi  v(i )
• Exponentially weighted RLS
xi 1   xi , d (i )  uiT xi  v(i )
• standard RLS
xi 1  xi , d (i )  uiT xi  v(i )
Notations:
xi state vector at
time i
wi|i-1 state estimate
at time i using
data up to i-1
Recursive equations
The recursive update equations
w0|1  0, P0|1  l 1 1 
re (i )  l i  uiT Pi|i 1ui
Conversion factor
k p ,i   Pi|i 1ui / re (i )
e(i )  d (i )  u wi|i 1
T
i
wi 1|i   wi|i 1  k p ,i e(i )
gain factor
error
weight update
Pi 1|i |  |2 [ Pi|i 1  Pi|i 1ui uiT Pi|i 1 / re (i )]  l i q
Notice that
uT wˆ i 1|i   uT wˆ i|i 1   uT Pi|i 1ui e(i) / re (i)
If we have transformed data, how to calculate  (uk )T Pi|i 1 (u j ) or any k, i, j?
New Extended Recursive Least-squares
Pj| j 1  r j 1  H Tj 1Qj 1 H j 1 , j
Theorem 1:
T
where r j 1 is a scalar, H j 1  [u0 ,..., u j 1 ] and Q j 1is a jxj matrix, for all j.
Proof:
P0|1  l 1 1 , r1  l 1 1 , Q1  0
Pi 1|i |  | [ Pi|i 1 
2
Pi|i 1ui uiT Pi|i 1
re (i )
]  l q
i
By mathematical
induction!
|  |2 [ ri 1  H iT1Qi 1 H i 1 
( ri 1  H iT1Qi 1 H i 1 )ui uiT ( ri 1  H iT1Qi 1 H i 1 )
]  l i q
re (i )
T 1
1

Q

f
f
r
(
i
)

r
f
r
(i ) 
i

1
i

1,
i
i

1,
i
e
i

1
i

1,
i
e
2
i
2
T
 (|  | ri 1  l q)  |  | H i 
 H i
T 1
2
1
r i 1re (i ) 
  ri 1 f i 1,i re (i )
Liu W., Principe J., “Extended Recursive Least Squares in RKHS”, in Proc. 1st Workshop on Cognitive Signal Processing, Santorini, Greece, 2008.
New Extended Recursive Least-squares
wˆ j| j 1  H Tj 1a j| j 1 , j
Theorem 2:
where H j 1  [u0 ,..., u j 1 ]T and a j| j 1 is a
Proof:
j 1
vector, for all j.
wˆ 0|1  0, a0|1  0
wˆ i 1|i   wˆ i|i 1  k p ,i e(i )
By mathematical
induction again!
  H iT1ai|i 1   Pi|i 1ui e(i ) / re (i )
  H iT1ai|i 1   ( ri 1  H iT1Qi 1 H i 1 )ui e(i ) / re (i )
  H iT1ai|i 1  ri 1ui e(i ) / re (i )   H iT1 f i 1,i e(i ) / re (i )
  ai|i 1   f i 1,i e(i )re 1 (i ) 
H 

1
r
e
(
i
)
r
(
i
)
i 1
e


T
i
Extended RLS
w0|1  0, P0|1  l  
1
1
New Equations
a0|1  0, r1  l 1 1 , Q1  0
ki 1,i  uiT H i 1T
fi 1,i  Qi 1ki 1,i
re (i )  l i  uiT Pi|i 1ui
re (i )  l i  ri 1uiT ui  kiT1,i fi 1,i
k p ,i   Pi|i 1ui / re (i )
e(i )  d (i )  kiT1,i ai|i 1
e(i )  d (i )  uiT wi|i 1
 ai|i 1  f i 1,i re 1 (i )e(i ) 
ai 1|i   

1
ri 1re (i )e(i ) 

ri |  |2 ri 1  l i q
wi 1|i   wi|i 1  k p ,i e(i )
Pi 1|i |  |2 [ Pi|i 1 
P u u Pi|i 1 / re (i )]  l q
T
i |i 1 i i
i
T 1
1

Q

f
f
r
(
i
)

r
f
r
(i ) 
i 1
i 1, i i 1, i e
i 1 i 1, i e
2
Qi |  | 

T 1
2
1

r
f
r
(
i
)
r
r
(
i
)
i 1 i 1, i e
i 1 e


Kernel Extended Recursive Least-squares
a0|1  0, r1  l 1 1 , Q1  0
Initialize
ki 1,i  [ (u0 , ui ),...,  (ui 1 , ui )]T
f i 1,i  Qi 1ki 1,i
re (i )  l i  ri 1 (ui , ui )  kiT1,i f i 1,i
e(i )  d (i )  kiT1,i ai|i 1
Update on weights
 ai|i 1  f i 1,i re 1 (i )e(i ) 
ai 1|i   

1
r
r
(
i
)
e
(
i
)
i 1 e


ri |  |2 ri 1  l i q
T 1
1


Q

f
f
r
(
i
)

r
f
r
i 1
i 1, i i 1, i e
i 1 i 1, i e (i )
2
Qi |  | 

T 1
2
1
r i 1re (i ) 
  ri 1 f i 1,i re (i )
Update on P matrix
Kernel Ex-RLS
cmi  ui
ami  ri 1re 1 (i )e(i )
ami 1   ami 1   f i 1,i (i )re 1 (i )e(i )
c1
c2
a1
a1   a1   f i 1,i (1)re 1 (i )e(i )
a2
u
+
cmi-1
y
a mi-1
i
am
cmi
i
fi (u )  i ,  (u ) F   a j  (u, u j )
j 1
Simulation-3: Lorenz time series
prediction
Simulation-3: Lorenz time series
prediction
Simulation 4: Rayleigh channel tracking
Noise
st
5 tap Rayleigh multipath fading channel
+
tanh
rt
Rayleigh channel tracking
MSE (dB) (noise variance
0.001 and fD = 50 Hz )
MSE (dB) (noise
variance 0.01 and fD =
200 Hz )
ε-NLMS
-13.51
-9.39
RLS
-14.25
-9.55
Extended RLS
-14.26
-10.01
Kernel RLS
-20.36
-12.74
Kernel extended RLS
-20.69
-13.85
Algorithms
 (ui , u j )  exp(0.1|| ui  u j ||2 )
Computation complexity
Linear
LMS
KLMS
KAPA
ex-KRLS
Computation (training)
O(l)
O(i)
O(i+K2)
O(i2)
Memory (training)
O(l)
O(i)
O(i+K)
O(i2)
Computation (test)
O(l)
O(i)
O(i)
O(i)
Memory (test)
O(l)
O(i)
O(i)
O(i)
Algorithms
At time or iteration i
Active data selection
Why?
Kernel trick may seem a “free lunch”!
The price we pay is memory and pointwise evaluations of
the function.
Generalization (Occam’s razor)
But remember we are working on an on-line scenario,
so most of the methods out there need to be modified.
Problem statement
The learning system
y (u; T (i ))
Already processed (your dictionary)
D(i)  {u ( j ), d ( j )}ij 1
A new data pair
{u (i  1), d (i  1)}
How much new information it contains?
Is this the right question or
How much information it contains with respect to the
learning system y (u; T (i )) ?
Previous Approaches
Novelty condition (Platt, 1991)
• Compute the distance to the current dictionary
dis  min u (i  1)  c j
c j D ( i )
• If it is less than a threshold d1 discard
• If the prediction error
e(i  1)  d (i  1)   (i  1)T (i )
• Is larger than another threshold d2 include new center.
Approximate linear dependency (Engel, 2004)
• If the new input is a linear combination of the previous
centers discard
dis 2  min  (u (i  1)  c D (i ) b j (c j )
j
which is the Schur Complement of Gram matrix and fits KAPA 2 and
4 very well. Problem is computational complexity
Sparsification
A simpler procedure is the following. Define a cost to quantify
linear dependence
 
J     x ( n)  
length ( rn 1 )

k 1
 k   x(rn 1k ) 
2
   x(n) x(n)   2k n 1    G n 1 
T
k n 1
G n 1

Gn   T

k

x
(
n
)

x
(
n
)


n

1


The solution
is


T G
   1  n1   0;

  kT 
 n 1 
J opt    x(n) x(n)   k Tn1  opt
so can use a threshold on J to decide if the
New sample should be kept.
Information measure
Hartley and Shannon’s definition of information
How much information it contains?
I (i  1)   ln p(u (i  1), d (i  1))
Learning is unlike digital communications:
The machine never knows the joint distribution!
When the same message is presented to a learning
system information (the degree of uncertainty)
changes because the system learned with the first
presentation!
Need to bring back MEANING into information theory!
Surprise as an Information measure
Learning is very much an experiment that we do in
the laboratory.
Fedorov (1972) proposed to measure the importance
of an experiment as the Kulback Leibler distance
between the prior (the hypothesis we have) and th
eposterior (the results after measurement).
Mackay (1992) formulated this concept under a
Bayesian approach and it has become one of the key
concepts in active learning.
Surprise as an Information measure
Pfaffelhuber in 1972 formulated the concept of
subjective or redundant information for learning
systems as
I S ( x)   log( q( x))
the PDF of the data is p(x) and q(x) is the learner’s
subjective estimation of it.
Palm in 1981 defined surprise for a learning system
y (u; T (i )) as
ST (i ) (u (i  1))  CI (i  1)   ln p(u (i  1) | T (i ))
Shannon versus Surprise
Shannon
Surprise
Objective
Subjective
Receptor
independent
Receptor
dependent (on time
and agent)
Message is
meaningless
Message has
meaning for the
agent
Evaluation of conditional information
Gaussian process theory
where
Interpretation of conditional information
Prediction error
Large error  large conditional information
Prediction variance
Small error, large variance  large CI
Large error, small variance  large CI (abnormal)
Input distribution
Rare occurrence  large CI
Input distribution
Memoryless assumption
Memoryless uniform assumption
Unknown desired signal
Average CI over the posterior distribution of the
output
Memoryless uniform assumption
This is equivalent to approximate linear dependency!
Redundant, abnormal and learnable
Still need to find a systematic way to select these
thresholds which are hyperparameters.
Active online GP regression (AOGR)
Compute conditional information
If redundant
Throw away
If abnormal
Throw away (outlier examples)
Controlled gradient descent (non-stationary)
If learnable
Kernel recursive least squares (stationary)
Extended KRLS (non-stationary)
Simulation-5: nonlinear regression—learning
curve
Simulation-5: nonlinear regression—
redundancy removal
Simulation-5: nonlinear regression—
abnormality detection
Simulation-6: Mackey-Glass time series
prediction
Simulation-7: CO2 concentration forecasting
Redefinition of On-line Kernel Learning
Notice how problem constraints affected the form of the
learning algorithms.
On-line Learning: A process by which the free
parameters and the topology of a ‘learning system’ are
adapted through a process of stimulation by the
environment in which the system is embedded.
Error-correction learning + memory-based learning
What an interesting (biological plausible?) combination.
Impacts on Machine Learning
KAPA algorithms can be very useful in large scale
learning problems.
Just sample randomly the data from the data base and
apply on-line learning algorithms
There is an extra optimization error associated with
these methods, but they can be easily fit to the machine
contraints (memory, FLOPS) or the processing time
constraints (best solution in x seconds).
Wiley Book (2009)
Weifeng Liu
Jose Principe
Simon Haykin
On-Line Kernel Learning:
An adaptive filtering
perspective
Papers are available at
www.cnel.ufl.edu
Information Theoretic Learning (ITL)
Deniz Erdogmus and Jose Principe
From Linear
Adaptive Filtering
to Nonlinear
Information Processing
This class of algorithms can
be extended to ITL cost
functions and also beyond
Regression (classification,
Clustering, ICA, etc). See
IEEE
SP MAGAZINE, Nov 2006
Or ITL resource
www.cnel.ufl.edu
Download