CIS526: Machine Learning Lecture 8 (Oct 21, 2003) Preparation help: Bo Han Support Vector Machines Last week, we discussed the construction of support vector machines on linearly separable training data. This is a very strong assumption that is unrealistic in most real life applications. Question: What should we do if the training data set is not linearly separable? + + + + + + + Solution: Introducing the slack variables i, i=1, 2, …, N, to relax the constraint yi (w T x i b) 1 to yi (w T x i b) 1 i , i 0 . Ideally, one would prefer all slack variables to be zero and this would correspond to the linearly separable case. Therefore, the optimization problem for construction of SVM on linearly nonseparable data is defined as: find w and b that minimize: subject to: 1 w 2 C 2 i 2 i T yi (w x i b) 1 i , i 0, i where C > 0 is an appropriately selected parameter. The additional term C i 2 enforces all i slack variables to be as close to zero as possible. Dual problem: As in the linearly separable problem, this optimization problem can be converted to its dual problem: find that maximizes i 12 i j yi y j xiT x j i i j i 1i yi 0, N subject to 0 i C , i NOTE: The consequence of introducing parameter C is in constraining the range of acceptable values of Lagrange multipliers i. The most appropriate choice for C will depend on the specific data set available. SVM: Generalized solution Problem: Support vector machines represented with a linear function f(x) (i.e. a separating hyperplane) have very limited representational power. As such, they could not be very useful in practical classification problems. X2 Good News: With a slight modification, SVM could solve highly nonlinear classification problems!! - - - + + + - - + + + - + + - - - - - - - X1 Justification: Cover’s Theorem Suppose that data set D is nonlinearly separable in the original attribute space. The attribute space can be transformed into a new attribute space where D is linearly separable! Caveat: Cover’s Theorem only proves the existence of the transformed attribute space that could solve the nonlinear problem. It does not provide the guideline for the construction of the attribute transformation! Example 1. XOR problem X2 - - - + + + + + + + + + + - - - - X1 By constructing a new attribute: X1’ = X1 X2 the XOR problem becomes linearly separable by the new attribute X1’. Example 2. Taylor expansion. Value of a multidimensional function F(x) at point x can be approximated as F ( x ) F ( x0 ) F ( x0 )( x x0 ) ( x x0 )T 2 F ( x0 )( x x0 ) O(|| x x0 ||3 ) Therefore, F(x) can be considered as a linear combination of complex attributes derived from the original ones, m m F ( x ) F ( x0 ) a i xi aij xi x j i 1 i , j 1 m aijk xi x j xk O(|| x x0 ||3 ) i , j , k 1 new attr new attr Example 3. Second order monomials derived from the original two-dimensional attribute space (x1 , x 2 ) (z1 , z 2 , z 3 ) (x12 , 2x1 x 2 , x 22 ) Example 4. Fifth order monomials derived from the original 256-dimensional attribute space 5 256 1 1010 of such monomials, which is an extremely high-dimensional 5 There are attribute space!! SVM and curse-of-dimensionality: If the original attribute space is transformed into a very high dimensional space, the likelihood of being able to solve the nonlinear classification increases. However, one is likely to quickly encounter the curse-of-dimensionality problem. The strength of SVM lies in the theoretical justification that margin maximization is an effective mechanism for alleviating the curse-of-dimensionality problem (i.e. SVM is the simplest classifier that solves the given classification problem). Therefore, SVM are able to successfully solve classification problems with extremely high attribute dimensionality!! SVM solution of classification: Denote : M F as a mapping from the original M-dimensional attribute space to the highly dimensional attribute space F. By solving the following dual problem i 12 i j yi y j (xi )T (x j ) find that maximizes i i N y i 1 i i subject to j 0, 0 i C , i the resulting SVM is of the form T f ( x) w ( x i ) b N i y i ( x i ) T ( x) b i 1 Practical Problem: Although SVM are successful in dealing with highly dimensional attribute spaces, the fact that the SVM training scales linearly with the number of attributes, and considering limited memory space could largely limit the choice of mapping . Solution: Kernel Trick It allows computing scalar products (e.g. (x i )T (x) ) in the original attribute space. It follows from Mercer’s Theorem: There is a class of mappings that has the following property: (x)T (y ) K (x, y ) , where K is a corresponding kernel function. Examples of kernel function: xy • Gaussian Kernel: K (x, y ) e 2 A , A is a constant T • Polynomial Kernel: K (x, y ) (x y 1) B , B is a constant By introducing the kernel trick: The dual problem: i 12 i j yi y j K (xi , x j ) find that maximizes i i N y i 1 i i subject to j 0, 0 i C , i The resulting SVM is: f ( x) w T ( x i ) b N i y i K ( x i , x) b i 1 Some practical issues with SVM Modeling choices: When using some of the available SVM software packages or toolboxes a user should choose (1) kernel function (e.g. Gaussian kernel) and its parameter(s) (e.g. constant A), (2) constant C related to the slack variables. Several choices should be examined using validation set in order to find the best SVM. SVM training does not scale well with the size of the training data (i.e. scaling as O(N3)). There are several solutions that offer speed-up of the original SVM algorithm: o chunking; start with a subset of D, build SVM, apply it on all data, add “problematic” data points into the training data, remove “nice” points, repeat). o decomposition; similar to chunking, the size of the subset is kept constant o sequential minimal optimization; extreme version of chunking, only 2 data points are used in each iteration. SVM-Based solutions exist for problems outside binary classification multi-class classification problems SVM for regression Kernel PCA Kernel Fischer discriminant Clustering PCA Kernel PCA Clustering One of the most popular unsupervised learning problems. Broad Definition: Grouping a collection of objects into subsets (or clusters) such that the objects assigned to the same cluster are more closely related than objects assigned to different clusters. Examples of clustering applications Marketing: discover distinctive customer groups and develop targeted marketing program Image segmentation: identification of areas with similar land use from satellite data Portfolio optimization: discover groups of stock time series with different behaviors Gene expression data analysis: discover genes with similar expression profiles Dimensionality reduction: represent each object based on its cluster membership (instead based on the original attributes) 4 2 Cluster2 0 -2 Cluster1 -4 -2 0 2 4 6 Types of data suitable for clustering Standard data matrix representation D = {xi, i = 1, …, N}, where xi is an M-dimensional attribute vector, and N is the number of data points; Proximity (or similarity) matrix D = {dij, i,j = 1, …, N}, dij is the “similarity” between data points i and j. Most often, proximity matrix is symmetric, dij=dji, and triangle inequality is satisfied, dij ≤ dik + dkj. Sometimes, the triangle inequality is not satisfied (e.g. human judgment is used to compare the similarity between objects) Types of clustering algorithms Partitioning algorithms (K-means clustering) Hierarchical algorithms (create hierarchical decomposition of objects) Probabilistic algorithms (assume that clusters satisfy some probability distribution) Similarity measures Choice of the similarity measure is critical for the success of the clustering. It should be selected to confirm to the user’s perception of the similarity between the objects. Let us assume the data is available in the standard data matrix format. The most popular choices for the similarity dij between objects xi and xj are: Euclidean distance: Weighted Euclidean distance: d ( xi , x j ) d ( xi , x j ) M ( xik x jk ) 2 k 1 M wk ( xik x jk ) 2 k 1 1/ R Minkowski distance: M d ( xi , x j ) ( xik x jk ) R k 1 x x i x jk x j M ik M 1 Correlation coefficient: d ( xi , x j ) k 1 M xik xi 2 k 1 x jk x j M k 1 M , where x i xik k 1