Basics of Kernel Methods in Statistical Learning Theory Mohammed Nasser Professor Department of Statistics Rajshahi University E-mail: mnasser.ru@gmail.com Contents Glimpses of Historical Development Definition and Examples of Kernel Some Mathematical Properties of Kernels Construction of Kernels Heuristic Presentation of Kernel Methods Meaning of Kernels Mercer Theorem and Its Latest Development Direction of Future Development Conclusion 2 Computer Scientists’ Contribution to Statistics: Kernel Methods Vladimir Vapnik Jerome H. Friedman 3 Early History In 1900 Karl Pearson published his famous article on goodness of fit, judged as one of first best twelve scientific articles in twentieth century. In 1902 Jacques Hadamard pointed that mathematical models of physical phenomena should have the properties that A solution exists The solution is unique The solution depends continuously on the data, in some reasonable topology ( Well-Posed Problem) 4 Early History In 1940 Fréchet, PhD student of Hadamard highly criticized mean and standard deviation as measures of location and scale respectively. But he did express his belief in development of statistics without proposing any alternative. During sixties and seventies Tukey, Huber and Hampel tried to develop Robust Statistics in order to remove ill-posedness of classical statistics. Robustness means insensitivity to minor change in both model and sample, high tolerance to major changes and good performance at model. Data Mining onslaught and the problem of non-linearity and nonvectorial data have made robust statistics somewhat nonattractive. Let Us See What KM present……………. Recent History Support Vector Machines (SVM) introduced in COLT-92 (conference on learning theory) greatly developed since then. Result: a class of algorithms for Pattern Recognition (Kernel Machines) Now: a large and diverse community, from machine learning, optimization, statistics, neural networks, functional analysis, etc Centralized website: www.kernel-machines.org First Text book (2000): see www.support-vector.net Now ( 2012): At least twenty books of different taste are avialable in international market The book, “ The Elements of Statistical Learning”(2001) by Hastie,Tibshirani and Friedman went into second edition within seven years. 6 History More David Hilbert used the German word ‘kern’ in his first paper on integral equations(Hilbert 1904). The mathematical result underlying the kernel trick, Mercer's theorem, is almost a century old (Mercer 1909). It tells us that any `reasonable' kernel function corresponds to some feature space. which kernels can be used to compute distances in feature spaces was developed by Schoenberg (1938). The methods for representing kernels in linear spaces were first studied by Kolmogorov (1941) for a countable input domain. The method for representing kernels in linear spaces for the general case was developed by Aronszajn (1950). Dunford and Schwartz (1963) showed that Mercer's theorem also holds true for general compact spaces.T 7 History More The use of Mercer's theorem for interpreting kernels as inner products in a feature space was introduced into machine learning by Aizerman, Braverman and Rozonoer (1964) Berg, Christensen and Ressel (1984) published a good monograph on the theory of kernels. Saitoh (1988) showed the connection between positivity (a `positive matrix‘ defined in Aronszajn (1950)) and the positive semi-definiteness of all finite set kernel matrices. Reproducing kernels were extensively used in machine learning and neural networks by Poggio and Girosi, see for example Poggio and Girosi (1990), a paper on radial basis function networks. The theory of kernels was used in approximation and regularization theory, and the first chapter of Spline Models for Observational Data (Wahba 1990) gave a number of theoretical results on kernel functions. 8 Kernel methods: Heuristic View The common characteristic (structure) among the following statistical methods? 1. Principal Components Analysis 2. (Ridge ) regression 3. Fisher discriminant analysis 4. Canonical correlation analysis 5.Singular value decomposition 6. Independent component analysis KPCA SVR KFDA KCCA KICA We consider linear combinations of input vector: f ( x) wT x We make use concepts of length and dot product available in Euclidean space. 9 10 Kernel methods: Heuristic View • Linear learning typically has nice properties – Unique optimal solutions, Fast learning algorithms – Better statistical analysis • But one big problem – Insufficient capacity That means, in many data sets it fails to detect nonlinearship among the variables. • The other demerits - Cann’t handle non-vectorial data 10 Data Vectors Collections of features e.g. height, weight, blood pressure, age, . . . Can map categorical variables into vectors Matrices Images, Movies Remote sensing and satellite data (multispectral) Strings Documents Gene sequences Structured Objects XML documents Graphs 11 Kernel methods: Heuristic View Genome-wide data mRNA expression data hydrophobicity data protein-protein interaction data sequence data (gene, protein) 12 13 Kernel methods: Heuristic View f f f Original Space Feature Space 14 Definition of Kernels Definition: A finitely positive semi-definite function k : x y R is a symmetric function of its arguments for which matrices formed T K 0 by restriction on any finite subset of points is positive semi-definite. It is a generalized dot product It is not generally bilinear But it obeys C-S inequality 15 Kernel Methods: Basic Ideas Proper Kernel k(x, y ) f(x ), f(y ) Is always a kernel. When is the converse true? Theorem(Aronszajn,1950): A function k : x y R can be written as k ( x, y) ( x), ( y) where ( x) is a feature map x ( x) F iff k(x,y) satisfies the semi-definiteness property. We can now check if k(x,y) is a proper kernel using only properties of k(x,y) itself, i.e. without the need to know the feature map ! If the map is needed we may take help of MERCER THEOREM 16 Kernel methods consist of two modules : 1) The choice of kernel (this is non-trivial) 2) The algorithm which takes kernels as input Modularity: Any kernel can be used with any kernel-algorithm. some kernels: k ( x, y) e( || x y|| 2 / c) k ( x, y) ( x, y )d k ( x, y) tanh( x, y ) 1 k ( x, y) || x y ||2 c 2 some kernel algorithms: - support vector machine - Fisher discriminant analysis - kernel regression - kernel PCA - kernel CCA 17 Kernel Construction The set of kernels forms a closed convex cone 18 19 Reproducing Kernel Hilbert Space • Reproducing kernel Hilbert space (RKHS) X: set. A Hilbert space H consisting of functions on X is called a reproducing kernel Hilbert space (RKHS) if the evaluation functional ex : H R , f f ( x ) is continuous for each x X – A Hilbert space H consisting of functions on X is a RKHS if and only if there exists k ( , x) H (reproducing kernel) such that f H , x X . k ( , x), f H f ( x) 20 (by Riesz’s lemma) Reproducing Kernel Hilbert Space II Theorem (construction of RKHS) If k: X x X R is positive definite, there uniquely exists a RKHS Hk on X such that (1) k ( , x) H for all x X , (2) the linear hull of {k ( , x ) | x X } is dense in Hk , (3) k ( , x) is a reproducing kernel of Hk, i.e., k ( , x), f H f ( x) f H k , x X . k At this moment we put no structure on X. To have bettter properties of members of g in H we have to put extra structure on X and assume additional properties of K/ 21 Classification Y=g(X) X!Y Anything: • continuous (, d, …) • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) •… • discrete: – {0,1} binary – {1,…k} multi- class – tree, etc. structured 22 Classification X Perceptron Logistic Regression Support Vector Machine Anything: • continuous (, d, …) • discrete ({0,1}, {1,…k}, …) Decision Tree Random Forest • structured (tree, string, …) •… Kernel trick 23 Regression Y=g(X) X!Y Anything: • continuous (, d, …) • continuous: – , d • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) Not Always •… 24 Regression X Perceptron Normal Regression Support Vector regression Anything: • continuous (, d, …) GLM • discrete ({0,1}, {1,…k}, …) • structured (tree, string, …) •… Kernel trick 25 Kernel Methods: Heuristic View Traditional or non traditional Steps for Kernel Methods [k(xi ,xj)] A positive semi definite matrix f(x)=∑αik(xi, x) K= Pattern function DATA MATRIX Kernel Matrix, Why p.s.d?? what K???? 26 Kernel methods: Heuristic View f f f Original Space Feature Space 27 Kernel Methods: Basic Ideas The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: The expectation is that the feature space has a much higher dimension than the input space. Feature space has a innerproduct like k (x i , x j ) f( x i ), f(x i ) 28 Kernel methods: Heuristic View Form of functions • So kernel methods use linear functions in a feature space: • For regression this could be the function • For classification require thresholding 29 Kernel methods: Heuristic View Feature spaces : x ( x), R F d non-linear mapping to F 1. high-D space L2 2. infinite-D countable space : 3. function space (Hilbert space) example: ( x, y) ( x , y , 2 xy) 2 2 30 Kernel methods: Heuristic View Example • Consider the mapping • Let us consider a linear equation in this feature space: • We actually have an ellipse – i.e. a non-linear shape in the input space. ax 1 0.x 1x 2 0.x 2x 1 bx 2 c 2 2 31 Kernel methods: Heuristic View Ridge Regression (duality) problem: min w ( yi wT xi )2 || w ||2 i 1 target T 1 T solution: w ( X X I d ) X y X T ( XX T I )1 y f(x)=wTx X T (G I ) 1 y xi i = ∑αi(xi,x) i 1 linear comb. data regularization input dxd inverse inverse Gij xi , x j Inner product of obs. Dual Representation 32 Kernel methods” Heuristic View Kernel trick Note: In the dual representation we used the Gram matrix to express the solution. Kernel Trick: Replace : x ( x), kernel Gij xi , x j Gij ( xi ), ( x j ) K ( xi , x j ) If we use algorithms that only depend on the Gram- matrix, G, then we never have to know (compute) the actual features Φ(x) 33 Gist of Kernel methods Choice of a Kernel Function. Through choice of a kernel function we choose a Hilbert space. We then apply the linear method in this new space without increasing computational complexity using mathematical niceties of this space. 34 Kernels to Similarity • Intuition of kernels as similarity measures: || f ( x) ||2 || f ( x) ||2 d (f ( x),f ( x))2 k ( x, x) 2 • When the diagonal entries of the Kernel Gram Matrix are constant, kernels are directly related to similarities. – For example Gaussian Kernel || x x ||2 K G ( x, x) exp( ) 2 2 – In general, it is useful to think of a kernel as a similarity measure. 35 Kernels to Distance • Distance between two points x1 and x2 in feature space: d(x 1, x 2 ) f(x 1) f(x 2 ) k(x 1, x 1) k(x 2, x 2 ) 2k(x 1, x 2 ) • Distance between two points x1 and S in feature space: 2 n 1 n n d(x 1, S ) k (x 1, x 1) k (x 1, x i ) 2 k (x i , x j ) n i 1 n j 1 i 1 36 Kernel methods: Heuristic View Genome-wide data mRNA expression data hydrophobicity data protein-protein interaction data sequence data (gene, protein) 37 Similarity to Kernels How can we make it positive semidefinite if it is not semidefinite? 1 0.5 0.3 k 33 0.5 1 0.6 0.3 0.6 1 38 From Similarity Scores to Kernels Removal of negative eigenvalues Form the similarity matrix S, where the (i,j)-th entry of S denotes the similarity between the i-th and j-th data points. S is symmetric, but is in general not positive semi-definite, i.e., S has negative eigenvalues. S UU , where Σ diag(1 , 2 ,, n ), and T 1 2 r 0 r 1 n . T K UU , where Σ diag(1 , 2 ,, r ,0,,0). 39 From Similarity Scores to Kernels t1 x1 x2 xn t2 t1 - - - tn --s2m x1 x2 xn t2 - - tn -- s2m 40 Kernels as Measures of Function Regularity Empirical Risk functional, Problems of empirical risk minimization R L ,P ( g ) n = L( x, y, g ( x))dPn ( x, y ) X Y L( x, y, g ( x))dPn ( y / x) dPn X XY 1 n L( xi , yi , g ( xi )) n i 1 41 What Can We Do? We can restrict the set of functions over which we minimize empirical risk functionals modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two. Regularization Structural risk Minimization 42 43 Best Approximation f H M fˆ fˆ 44 Best approximation f • Assume M is finite dimensional with basis {k1,......,km} i.e., fˆ=a1k1k+…….+a k1,1k, k2 2, , , k, kmmkn fˆ M gives m conditions (i=1,…,m) ˆ H f f <ki , f- (a1k1+…….+a km)>=0 i.e. <ki,f>- ai<ki,k1>-…….. an <ki,km>=0 45 RKHS approximation m conditions become: Yi - ai<ki,k1>-…….. an <ki,km>=0 We can then estimate the parameters using: a=K-1y In practice it can be ill-conditioned, so we minimise: m m fˆ ( xi ) zi 2 f 2 , 0 min[ ( f ( xi ) yi ) f i 1 f H k 2 i 1 a=(K+λI)-1y Hk , 0 46 Approximation vs estimation Hypothesis space Best possible estimate Estimate Target space True function 47 How to choose kernels? • There is no absolute rule for choosing the right kernel, adapted to a particular problem. • Kernel should capture the desired similarity. – Kernels for vectors: Polynomial and Gaussian kernel – String kernel (text documents) – Diffusion kernel (graphs) – Sequence kernel (protein, DNA, RNA) 48 Kernel Selection • Ideally select the optimal kernel based on our prior knowledge of the problem domain. • Actually, consider a family of kernels defined in a way that again reflects our prior expectations. • Simple way: require only limited amount of additional information from the training data. • Elaborate way: Combine label information 49 50 Future Development Mathematics: Generalization of Mercer Theorem for pseudo metric spaces Development of mathematical tools for multivariate regression Statistics: Application of kernels in multivariate data depth Application of ideas of robust statistics Application of these methods in circular data They can be used to study nonlinear time series 51 Acknowledgement Jieping Ye Department of Computer Science and Engineering Arizona State University http://www.public.asu.edu/~jye02 • http://www.kernel-machines.org/ – Papers, software, workshops, conferences, etc. 52 Thank You 53