On-Line Kernel Learning Jose C. Principe and Weifeng Liu Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu Acknowledgments NSF ECS – 0300340 and 0601271 (Neuroengineering program) Outline Setting the On-line learning problem Optimal Signal Processing Fundamentals On-Line learning in Kernel Space KLMS algorithm Well posedness analysis of KLMS Kernel Affine Projection Algorithms (KAPA) Extended Recursive Least Squares Active Learning Problem Setting Optimal Signal Processing seeks to find optimal models for time series. The linear model is well understood and widely applied. Optimal linear filtering is regression in functional spaces, where the user controls the size of the space by choosing the model order. Problems are fourfold: In many important applications data arrives in real time, one sample at a time, so on-line learning methods are necessary. Optimal algorithms must obey physical constrains, FLOPS, memory, response time, battery power. Application conditions may be non stationary, i.e. the model must be continuously adapting to track changes. Unclear how to go beyond the linear model. Although the optimal problem is the same as in machine learning, constraints make the computational problem different. Machine Learning Assumption: Examples are drawn independently from an unknown probability distribution P(x, y) that represents the rules of Nature. Expected Risk: R( f ) L( f ( x), y )dP( x, y ) Rˆ N ( f ) 1 / N L( f ( xi ), yi ) Empirical Risk: i We would like f∗ that minimizes R(f) among all functions. In general f * F The best we can have is f F* F that minimizes R(f) inside F. But P(x, y) is unknown by definition. Instead we compute f N F that minimizes Rn(f). Vapnik-Chervonenkis theory tells us when this can work, but the optimization is computationally costly. Exact estimation of fN is done thru optimization. Machine Learning Strategy The errors in this process are R( f N ) R( f * ) R( f F* ) R( f * ) R( f N ) R ( f F* ) Approximation Error Estimation Error But the exact fN is hard to obtain normally, and since we have already two error, why not approximate (the optimization error) ~ R( f N ) R( f N ) r Provided it is computationally simpler to find? (Leon Bottou) So the problem is to find F, N and r for each problem Machine Learning Strategy The optimality conditions in learning and optimization theories are mathematically driven: Learning theory favors cost functions that ensure a fast estimation rate when the number of examples increases (small estimation error bound). Optimization theory favors superlinear algorithms (small approximation error bound) What about the computational cost of these optimal solutions, in particular when the data sets are huge? Algorithmic complexity should be as close as possible to O(N). Change the design strategy: Since these solutions are never optimal (non-reachable set of functions, empirical risk), goal should be to get quickly to the neighborhood of the optimal solution to save computation. Learning Strategy in Biology In Biology optimality is stated in relative terms: the best possible response within a fixed time and with the available (finite) resources. Biological learning shares both constraints of small and large learning theory problems, because it is limited by the number of samples and also by the computation time. Design strategies for optimal signal processing are similar to the biological framework than to the machine learning framework. What matters is “how much the error decreases per sample for a fixed memory/ flop cost” It is therefore no surprise that the most successful algorithm in adaptive signal processing is the least mean square algorithm (LMS) which never reaches the optimal solution, but is O(L) and tracks continuously the optimal solution! Extensions to Nonlinear Systems Many algorithms exist to solve the on-line linear regression problem: LMS stochastic gradient descent LMS-Newton handles eigenvalue spread, but is expensive Recursive Least Squares (RLS) tracks the optimal solution with the available data. Nonlinear solutions either append nonlinearities to linear filters (not optimal) or require the availability of all data (Volterra, neural networks) and are not practical. Kernel based methods offers a very interesting alternative to neural networks. Provided that the adaptation algorithm is written as an inner product, one can take advantage of the “kernel trick”. Nonlinear filters in the input space are obtained. The primary advantage of doing gradient descent learning in RKHS is that the performance surface is still quadratic, so there are no local minima, while the filter now is nonlinear in the input space. Adaptive Filtering Fundamentals Adaptive Filter Framework Filtering is regression in functional spaces (time series) m wi nJ (e(n), w) Filter order defines the size of the subspace Data d(n) f (u , w) wi u (n i ) i Data u(n) Adaptive System Output Error e(n) Cost J=E[e2(n)] Optimal solution is least squares w*=R-1p , but now R is the autocorrelation of the data input (over the lags), and p is the crosscorrelation vector. ... E[u (n)u (n L)] E[u (n)u (n)] R ... ... ... E[u (n L)u (n)] ... E[u (n L)u (n L)] pT [ E[u (n)d (n)] .... E[u (n L)d (n)]] On-Line Learning for Linear Filters Notation: ui y(i) Transversal filter w i wi weight estimate at time i (vector) (dim = l) ui input at time i (vector) Adaptive weightcontrol mechanism e(i) e(i) estimation error at time i (scalar) d(i) desired response at time i (scalar) + d (i) The current estimate wi is computed in terms of the previous estimate, wi 1 , as: wi wi 1 Gi ei ei estimation error at iteration i (vector) di desired response at iteration i (vector) Gi capital letter matrix ei is the model prediction error arising from the use of wi-1 and Gi is a Gain term On-Line Learning for Linear Filters Easiest technique is to search the performance surface J using 4 gradient descent learning (batch). 3.5 i 2.5 2 1.5 J E[e 2 (i )] h wi wi 1 hH J i 1 2 l i mE[ wi ] w * 3 1 W wi wi 1 hJ i Contour MEE FP-MEE step size 1 0.5 0 -1 -0.5 0 0.5 1 W1 1.5 2 2.5 Gradient descent learning has well known compromises: Stepsize h must be smaller than 1/lmax (of R) for convergence Speed of adaptation is controlled by lmin So eigenvalue spread of signal autocorrelation matrix controls speed of adaptation The misadjustment (penalty w.r.t. optimum error) is proportional to stepsize, so fundamental compromise between adapting fast, and small misadjustment. 3 On-Line Learning for Linear Filters Gradient descent learning for linear mappers has also great properties It accepts an unbiased sample by sample estimator that is easy to compute (O(L)), leading to the famous LMS algorithm. wi wi 1 hui e(i) The LMS is a robust estimator ( H ) algorithm. For small stepsizes, the visited points during adaptation always belong to the input data manifold (dimension L), since algorithm always move in the opposite direction of the gradient. On-Line Learning for Non-Linear Filters? Can we generalize wi wi 1 Gi ei to nonlinear models? y wT u y f (u ) and create incrementally the nonlinear mapping? f i f i 1 Gi ei ui y(i) Universal function approximator f i Adaptive weightcontrol mechanism e(i) + d (i) Non-Linear Methods - Traditional (Fixed topologies) Hammerstein and Wiener models An explicit nonlinearity followed (preceded) by a linear filter Nonlinearity is problem dependent Do not possess universal approximation property Multi-layer perceptrons (MLPs) with back-propagation Non-convex optimization Local minima Least-mean-square for radial basis function (RBF) networks Non-convex optimization for adjustment of centers Local minima Volterra models, Recurrent Networks, etc Non-linear Methods with kernels Universal approximation property (kernel dependent) Convex optimization (no local minima) Still easy to compute (kernel trick) But require regularization Sequential (On-line) Learning with Kernels (Platt 1991) Resource-allocating networks Heuristic No convergence and well-posedness analysis (Frieb 1999) Kernel adaline Formulated in a batch mode well-posedness not guaranteed (Kevinen 2004) Regularized kernel LMS with explicit regularization Solution is usually biased (Engel 2004) Kernel Recursive Least-Squares (Vaerenbergh 2006) Sliding-window kernel recursive least-squares Kernel methods Moore-Aronszajn theorem Every symmetric positive definite function of two real variables has a unique Reproducing Kernel Hilbert Space (RKHS). 2 k ( x, y) exp( h x y ) Mercer’s theorem Let K(x,y) symmetric positive definite. The kernel can be expanded m in the series ( x, y ) lii ( x)i ( y ) Construct the transform as Inner product i 1 ( x) [ l11 ( x), l2 2 ( x),..., lm m ( x)]T ( x) ( y ) ( x y ) Basic idea of on-line kernel filtering Transform data into a high dimensional feature space Construct a linear model in the feature space F i : (ui ) y , (u) F Adapt iteratively parameters with gradient information Compute the output i i 1 hJ i mi fi (u ) i , (u ) F a j (u, c j ) Universal approximation theorem j 1 For the Gaussian kernel and a sufficient large mi, fi(u) can approximate any continuous input-output mapping arbitrarily close in the Lp norm. Network structure: growing RBF φ(u) Ω u + y c1 i i 1 h e(i) (ui ) c2 a1 a2 + u fi fi 1 h e(i) (ui , ) cmi-1 a mi-1 i am cmi y Kernel Least-Mean-Square (KLMS) Least-mean-square wi wi 1 hui e(i) e(i) d (i) wiT1ui w0 Transform data into a high dimensional feature space F 0 0 e(i ) d (i ) i 1 , (ui ) F i i 1 h (ui )e(i ) 0 0 e(1) d (1) 0 , (u1 ) F d (1) 1 0 h (u1 )e(1) a1 (u1 ) e(2) d (2) 1 , (u2 ) F i i h e( j ) (u j ) j 1 i : (ui ) d (2) a1 (u1 ), (u2 ) F i fi (u ) i , (u ) F h e( j ) (u, u j ) j 1 d (2) a1 (u1 , u2 ) 2 1 h (u2 )e(2) a1 (u1 ) a2 (u2 ) RBF Centers are the samples, and Weights...are the errors! KLMS- Nonlinear channel equalization i fi (u ) i , (u ) F h e( j ) (u, u j ) j 1 st c1 c2 a1 a2 u + cmi-1 y a mi-1 i am cmi cmi ui ami h e(i ) zt st 0.5st 1 rt zt 0.9zt 2 n rt Nonlinear channel equalization Algorithms Linear LMS (η=0.005) KLMS (η=0.1) RN (NO REGULARIZATION) (REGULARIZED λ=1) BER (σ = .1) 0.162±0.014 0.020±0.012 0.008±0.001 BER (σ = .4) 0.177±0.012 0.058±0.008 0.046±0.003 BER (σ = .8) 0.218±0.012 0.130±0.010 0.118±0.004 (ui , u j ) exp(0.1|| ui u j ||2 ) Algorithms Linear LMS KLMS RN Computation (training) O(l) O(i) O(i3) Memory (training) O(l) O(i) O(i2) Computation (test) O(l) O(i) O(i) Memory (test) O(l) O(i) O(i) Why don’t we need to explicitly regularize the KLMS? Regularization Techniques Learning from finite data is ill-posed and a priori information—to enforce Smoothness is needed. The key is to constrain the solution norm In Bayesian model, the norm is the prior. (Gaussian process) In statistical learning theory, the norm is associated with the model capacity and hence the confidence of uniform convergence! (VC dimension and structural risk minimization) In numerical analysis, it relates to the algorithm stability (condition number, singular values). Regularized networks (least squares) 1 N J () (d (i) T i )2 , subject to || ||2 C N i 1 1 N J () (d (i) T i )2 l || ||2 N i 1 Norm constraint! Gaussian distributed prior N aii , G (i, j ) iT j , (G l)a d i 1 s1 sr Pdiag ( 2 ,..., 2 , 0,..., 0)QT d s1 l sr l S 0 T H [1 ,..., N ] Q P 0 0 T Condition number on matrix Gφ! Singular value S diag{s1 , s2 ,..., sr } Notice that if λ = 0, when sr is very small, sr/(sr2+ λ) = 1/sr → ∞. However if λ > 0, when sr is very small, sr/(sr2+ λ) = sr/ λ → 0. KLMS and the Data Space For finite data and using small stepsize theory: 1 N Denote i (ui ) R m Rx PPT R iiT N i 1 Assume the correlation matrix is singular, and 1 ... k k 1 ... m 0 The weight vector is insensitive to the 0-eigenvalue directions Denote i n 1 i (n) Pn so m o E[| i (n) |2 ] So if n 0 E[ i (n)] (1 h n )i 0 (n) h J min h J min (1 h n ) 2i (| 0 (n) |2 ) 2 h n 2 h n E[ i (n)] 0 (n) and E[| i (n) |2 ] | 0 (n) |2 Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. KLMS and the Data Space Remember that J (i ) E[| d iT |2 ] The 0-eigenvalue directions does not affect the MSE J (i) J min h J min 2 n1 n n1 n (| n (0) | m m 2 h J min 2 )(1 h n ) 2i KLMS only finds solutions on the data subspace! It does not care about the null space! Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. The minimum norm initialization for KLMS The initialization 0 0 gives the minimum possible norm solution. i n 1 cn Pn 5 m 1 ... k 0 k 1 ... m 0 || i || n 1|| cn || n k 1|| cn ||2 2 k 2 m 4 3 2 1 0 -1 0 2 4 Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. Solution norm upper bound for KLMS Assume the data model d (i ) o (i ) v(i) following inequalities hold || N ||2 1h (|| o ||2 2h || v ||2 ) The σ1 is the largest eigenvalue of Gφ H∞ robustness 2 | e ( j ) v ( j ) | j 1 i h || || j 1| v( j ) | 1 o 2 i 1 2 1, for all i 1, 2,..., N Triangular inequality || e ||2 h 1 || o ||2 2 || v ||2 The solution norm of KLMS is always upper bounded i.e. the algorithm is well posed in the sense of Hadamard. Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008. Comparison of three techniques No regularization 0.8 Tikhonov [ sn /( sn l )] sn 2 2 1 PCA sn 1 if sn th 0 if sn th reg-function sn 1 1 0.6 0.4 0.2 0 0 KLMS Tikhonov Truncated SVD 0.5 1 1.5 2 singular value KLMS [1 (1 h sn 2 / N ) N ] sn 1 The stepsize controls the reg-function in KLMS. This is a mathematical explanation for the early stopping in neural networks. Liu W., Principe J. The Well-posedness Analysis of the Kernel Adaline, Proc WCCI, Hong-Kong, 2008 The big picture for gradient based learning Leaky Kivinen LMS LMS K=1 APA Newton APA We have kernelized versions of all The EXT RLS is a model with states K=i K=1 Leaky APA K=i K=1 2004 Normalize d LMS Frieb , 1999 Adaline Engel, 2004 RLS Extended RLS weighted RLS Liu W., Principe J., “Kernel Affine Projection Algorithms”, European J. of Signal Processing, ID 784292, 2008. Affine projection algorithms Least-mean-square (LMS) wi wi 1 hui e(i) e(i) d (i) wiT1ui w0 Affine projection algorithms (APA) APA uses the previous K samples to estimate the gradient and Hessian whereas LMS uses just one, the present data (instantaneous gradient). U i [ui , ui 1 ,..., ui K 1 ]M K , di [d (i ), d (i 1),..., d (i K 1)]T APA i K 1 w0 , ei di U iT wi 1 , wi wi 1 hU i ei wi 1 h APA Newton e (k )u k i i k w0 , ei di U iT wi 1 , wi wi 1 h ( I U iU i T ) 1U i ei Therefore APA is a family of online gradient based algorithms of intermediate complexity between the LMS and RLS. Kernel Affine Projection Algorithms Q(i) w KAPA 1,2 use the least squares cost, while KAPA 3,4 are regularized KAPA 1,3 use gradient descent and KAPA 2,4 use Newton update Note that KAPA 4 does not require the calculation of the error by rewriting 2 with the matrix invers. lemma and using the kernel trick Note that one does not have access to the weights, so need recursion as in KLMS. Care must be taken to minimize computations. KAPA-1 cmi ui ami h ei (i ) c1 c2 ami 1 ami 1 h ei (i 1) a1 ami K 1 ami K 1 h ei (i K 1) a2 u + cmi-1 y a mi-1 i am cmi i fi (u ) i , (u ) F a j (u, u j ) j 1 Error reusing to save computation To calculate K errors is expensive (kernel evaluations) ei (k ) d (k ) k T i 1 , (i K 1 k i ) K times computations? No, save previous errors and use them ei 1 (k ) d (k ) k T i d (k ) k T (i 1 h i ei ) (d (k ) k T i 1 ) h k T i ei ei (k ) h k T i ei ei (k ) i Still needs ei (i 1) h ei ( j )k j . j i K 1 T But saves K error computation KAPA-2 KAPA-2: Newton’s method i [i , i 1 ,..., i K 1 ] di [d (i ), d (i 1),..., d (i K 1)]T 0 0 ei ( I iT i ) 1 (di iT i 1 ) i i 1 hi ei How to invert the K-by-K matrix ( I iT i ) and avoid O(K3)? Sliding window Gram matrix inversion i [i , i 1 ,..., i K 1 ] a bT Gri l I b D Assume known e 1 (Gri l I ) f 2 3 Gri iT i D Gri 1 l I T h Sliding window fT H s ( g hT D 1h) 1 D 1 H ff / e T h g 1 Schur complement of D 1 1 1 T 1 D ( D h )( D h ) s ( D h) s 1 (Gri 1 l I ) 1 T ( D h ) s s Complexity is K2 Relations to other methods Computation complexity Prediction of Mackey-Glass L=10 K=10 K=50 SW KRLS Simulation 1: noise cancellation n(i) ~ uniform [-0.5, 05] Primary signal s(i) n(i) noise source + n(i) Interference distortion function: H(▪) d(i) u(i) Adaptive filter y(i) _ + u(i) n(i) 0.2u(i 1) u(i 1)n(i 1) 0.1n(i 1) 0.4u(i 2) H (n(i), n(i 1), u(i 1), u(i 2)) Simulation 1: noise cancellation (u (i), u ( j )) exp( || u (i) u ( j ) ||2 ) Noise Cancellation 2520 2500 2520 Amplitute 0.5 0 -0.5 -1 2500 0.5 0 -0.5 0.5 0 -0.5 2500 0.5 0 -0.5 2500 Noisy Observation 2540 2560 2580 2600 NLMS 2540 2560 2580 2600 KLMS-1 2520 2540 2560 2580 2600 KAPA-2 2520 2540 2560 i 2580 2600 Simulation-2: nonlinear channel equalization st zt st 0.5st 1 rt zt 0.9zt 2 n rt Simulation-2: nonlinear channel equalization Regularization The well-posedness discussion for the KLMS hold for any other gradient decent methods like KAPA-1 and Kernel Adaline If Newton method is used, additional regularization is needed to invert the Hessian matrix like in KAPA-2 and normalized KLMS Recursive least squares embed the regularization in the initialization Extended Recursive Least-Squares STATE model xi 1 Fi xi ni di U iT xi vi Start with w0|1 , P0|1 1 Special cases • Tracking model (F is a time varying scalar) xi 1 xi ni , d (i ) uiT xi v(i ) • Exponentially weighted RLS xi 1 xi , d (i ) uiT xi v(i ) • standard RLS xi 1 xi , d (i ) uiT xi v(i ) Notations: xi state vector at time i wi|i-1 state estimate at time i using data up to i-1 Recursive equations The recursive update equations w0|1 0, P0|1 l 1 1 re (i ) l i uiT Pi|i 1ui Conversion factor k p ,i Pi|i 1ui / re (i ) e(i ) d (i ) u wi|i 1 T i wi 1|i wi|i 1 k p ,i e(i ) gain factor error weight update Pi 1|i | |2 [ Pi|i 1 Pi|i 1ui uiT Pi|i 1 / re (i )] l i q Notice that uT wˆ i 1|i uT wˆ i|i 1 uT Pi|i 1ui e(i) / re (i) If we have transformed data, how to calculate (uk )T Pi|i 1 (u j ) or any k, i, j? New Extended Recursive Least-squares Pj| j 1 r j 1 H Tj 1Qj 1 H j 1 , j Theorem 1: T where r j 1 is a scalar, H j 1 [u0 ,..., u j 1 ] and Q j 1is a jxj matrix, for all j. Proof: P0|1 l 1 1 , r1 l 1 1 , Q1 0 Pi 1|i | | [ Pi|i 1 2 Pi|i 1ui uiT Pi|i 1 re (i ) ] l q i By mathematical induction! | |2 [ ri 1 H iT1Qi 1 H i 1 ( ri 1 H iT1Qi 1 H i 1 )ui uiT ( ri 1 H iT1Qi 1 H i 1 ) ] l i q re (i ) T 1 1 Q f f r ( i ) r f r (i ) i 1 i 1, i i 1, i e i 1 i 1, i e 2 i 2 T (| | ri 1 l q) | | H i H i T 1 2 1 r i 1re (i ) ri 1 f i 1,i re (i ) Liu W., Principe J., “Extended Recursive Least Squares in RKHS”, in Proc. 1st Workshop on Cognitive Signal Processing, Santorini, Greece, 2008. New Extended Recursive Least-squares wˆ j| j 1 H Tj 1a j| j 1 , j Theorem 2: where H j 1 [u0 ,..., u j 1 ]T and a j| j 1 is a Proof: j 1 vector, for all j. wˆ 0|1 0, a0|1 0 wˆ i 1|i wˆ i|i 1 k p ,i e(i ) By mathematical induction again! H iT1ai|i 1 Pi|i 1ui e(i ) / re (i ) H iT1ai|i 1 ( ri 1 H iT1Qi 1 H i 1 )ui e(i ) / re (i ) H iT1ai|i 1 ri 1ui e(i ) / re (i ) H iT1 f i 1,i e(i ) / re (i ) ai|i 1 f i 1,i e(i )re 1 (i ) H 1 r e ( i ) r ( i ) i 1 e T i Extended RLS w0|1 0, P0|1 l 1 1 New Equations a0|1 0, r1 l 1 1 , Q1 0 ki 1,i uiT H i 1T fi 1,i Qi 1ki 1,i re (i ) l i uiT Pi|i 1ui re (i ) l i ri 1uiT ui kiT1,i fi 1,i k p ,i Pi|i 1ui / re (i ) e(i ) d (i ) kiT1,i ai|i 1 e(i ) d (i ) uiT wi|i 1 ai|i 1 f i 1,i re 1 (i )e(i ) ai 1|i 1 ri 1re (i )e(i ) ri | |2 ri 1 l i q wi 1|i wi|i 1 k p ,i e(i ) Pi 1|i | |2 [ Pi|i 1 P u u Pi|i 1 / re (i )] l q T i |i 1 i i i T 1 1 Q f f r ( i ) r f r (i ) i 1 i 1, i i 1, i e i 1 i 1, i e 2 Qi | | T 1 2 1 r f r ( i ) r r ( i ) i 1 i 1, i e i 1 e Kernel Extended Recursive Least-squares a0|1 0, r1 l 1 1 , Q1 0 Initialize ki 1,i [ (u0 , ui ),..., (ui 1 , ui )]T f i 1,i Qi 1ki 1,i re (i ) l i ri 1 (ui , ui ) kiT1,i f i 1,i e(i ) d (i ) kiT1,i ai|i 1 Update on weights ai|i 1 f i 1,i re 1 (i )e(i ) ai 1|i 1 r r ( i ) e ( i ) i 1 e ri | |2 ri 1 l i q T 1 1 Q f f r ( i ) r f r i 1 i 1, i i 1, i e i 1 i 1, i e (i ) 2 Qi | | T 1 2 1 r i 1re (i ) ri 1 f i 1,i re (i ) Update on P matrix Kernel Ex-RLS cmi ui ami ri 1re 1 (i )e(i ) ami 1 ami 1 f i 1,i (i )re 1 (i )e(i ) c1 c2 a1 a1 a1 f i 1,i (1)re 1 (i )e(i ) a2 u + cmi-1 y a mi-1 i am cmi i fi (u ) i , (u ) F a j (u, u j ) j 1 Simulation-3: Lorenz time series prediction Simulation-3: Lorenz time series prediction Simulation 4: Rayleigh channel tracking Noise st 5 tap Rayleigh multipath fading channel + tanh rt Rayleigh channel tracking MSE (dB) (noise variance 0.001 and fD = 50 Hz ) MSE (dB) (noise variance 0.01 and fD = 200 Hz ) ε-NLMS -13.51 -9.39 RLS -14.25 -9.55 Extended RLS -14.26 -10.01 Kernel RLS -20.36 -12.74 Kernel extended RLS -20.69 -13.85 Algorithms (ui , u j ) exp(0.1|| ui u j ||2 ) Computation complexity Linear LMS KLMS KAPA ex-KRLS Computation (training) O(l) O(i) O(i+K2) O(i2) Memory (training) O(l) O(i) O(i+K) O(i2) Computation (test) O(l) O(i) O(i) O(i) Memory (test) O(l) O(i) O(i) O(i) Algorithms At time or iteration i Active data selection Why? Kernel trick may seem a “free lunch”! The price we pay is memory and pointwise evaluations of the function. Generalization (Occam’s razor) But remember we are working on an on-line scenario, so most of the methods out there need to be modified. Problem statement The learning system y (u; T (i )) Already processed (your dictionary) D(i) {u ( j ), d ( j )}ij 1 A new data pair {u (i 1), d (i 1)} How much new information it contains? Is this the right question or How much information it contains with respect to the learning system y (u; T (i )) ? Previous Approaches Novelty condition (Platt, 1991) • Compute the distance to the current dictionary dis min u (i 1) c j c j D ( i ) • If it is less than a threshold d1 discard • If the prediction error e(i 1) d (i 1) (i 1)T (i ) • Is larger than another threshold d2 include new center. Approximate linear dependency (Engel, 2004) • If the new input is a linear combination of the previous centers discard dis 2 min (u (i 1) c D (i ) b j (c j ) j which is the Schur Complement of Gram matrix and fits KAPA 2 and 4 very well. Problem is computational complexity Sparsification A simpler procedure is the following. Define a cost to quantify linear dependence J x ( n) length ( rn 1 ) k 1 k x(rn 1k ) 2 x(n) x(n) 2k n 1 G n 1 T k n 1 G n 1 Gn T k x ( n ) x ( n ) n 1 The solution is T G 1 n1 0; kT n 1 J opt x(n) x(n) k Tn1 opt so can use a threshold on J to decide if the New sample should be kept. Information measure Hartley and Shannon’s definition of information How much information it contains? I (i 1) ln p(u (i 1), d (i 1)) Learning is unlike digital communications: The machine never knows the joint distribution! When the same message is presented to a learning system information (the degree of uncertainty) changes because the system learned with the first presentation! Need to bring back MEANING into information theory! Surprise as an Information measure Learning is very much an experiment that we do in the laboratory. Fedorov (1972) proposed to measure the importance of an experiment as the Kulback Leibler distance between the prior (the hypothesis we have) and th eposterior (the results after measurement). Mackay (1992) formulated this concept under a Bayesian approach and it has become one of the key concepts in active learning. Surprise as an Information measure Pfaffelhuber in 1972 formulated the concept of subjective or redundant information for learning systems as I S ( x) log( q( x)) the PDF of the data is p(x) and q(x) is the learner’s subjective estimation of it. Palm in 1981 defined surprise for a learning system y (u; T (i )) as ST (i ) (u (i 1)) CI (i 1) ln p(u (i 1) | T (i )) Shannon versus Surprise Shannon Surprise Objective Subjective Receptor independent Receptor dependent (on time and agent) Message is meaningless Message has meaning for the agent Evaluation of conditional information Gaussian process theory where Interpretation of conditional information Prediction error Large error large conditional information Prediction variance Small error, large variance large CI Large error, small variance large CI (abnormal) Input distribution Rare occurrence large CI Input distribution Memoryless assumption Memoryless uniform assumption Unknown desired signal Average CI over the posterior distribution of the output Memoryless uniform assumption This is equivalent to approximate linear dependency! Redundant, abnormal and learnable Still need to find a systematic way to select these thresholds which are hyperparameters. Active online GP regression (AOGR) Compute conditional information If redundant Throw away If abnormal Throw away (outlier examples) Controlled gradient descent (non-stationary) If learnable Kernel recursive least squares (stationary) Extended KRLS (non-stationary) Simulation-5: nonlinear regression—learning curve Simulation-5: nonlinear regression— redundancy removal Simulation-5: nonlinear regression— abnormality detection Simulation-6: Mackey-Glass time series prediction Simulation-7: CO2 concentration forecasting Redefinition of On-line Kernel Learning Notice how problem constraints affected the form of the learning algorithms. On-line Learning: A process by which the free parameters and the topology of a ‘learning system’ are adapted through a process of stimulation by the environment in which the system is embedded. Error-correction learning + memory-based learning What an interesting (biological plausible?) combination. Impacts on Machine Learning KAPA algorithms can be very useful in large scale learning problems. Just sample randomly the data from the data base and apply on-line learning algorithms There is an extra optimization error associated with these methods, but they can be easily fit to the machine contraints (memory, FLOPS) or the processing time constraints (best solution in x seconds). Wiley Book (2009) Weifeng Liu Jose Principe Simon Haykin On-Line Kernel Learning: An adaptive filtering perspective Papers are available at www.cnel.ufl.edu Information Theoretic Learning (ITL) Deniz Erdogmus and Jose Principe From Linear Adaptive Filtering to Nonlinear Information Processing This class of algorithms can be extended to ITL cost functions and also beyond Regression (classification, Clustering, ICA, etc). See IEEE SP MAGAZINE, Nov 2006 Or ITL resource www.cnel.ufl.edu