CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 SUPPORT VECTOR MACHINE INTRODUCTION :: The number of documents on World Wide Web (Internet) is ever increasing and its growth is doubling every day. To classify each of documents by humans is not possible and also not feasible. Managing structure of such huge documents is not possible so we shall discuss few methods of organizing the data into proper structure. As well as, we shall look into the details of classifying new data into the already present category. Support vector machines (SVM) is a set of related supervised learning method that can be used for Text Classification. Analyzing Data. Recognize Patterns. Regression Analysis. Bio-informatics. Signature/hand writing recognition. E-mail Spam Categorization. Supervised learning is the machine learning task of deducing a category from supervised training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the training data and then predicts the correct output categorization for given data-set input. For e.g. Teacher teaches student to identify apple and oranges by giving some features of that. Next time when student sees apple or orange he can easily classify the object based on his learning from his teacher, this is called supervised learning. He can identify the object only if it is apple or orange, but if the given object was grapes the student cannot identify it. Sparse Matrix is the matrix containing many values that are 0. Computing many 0 in the matrix is time consuming and utilizing lots of resources without giving optimized output. So this matrix is compressed into Sparse Data which contains non-zero values of the Sparse Matrix. It is usually 2- dimensional array which contains the non-zero value and the position in the original matrix. By this Sparse data, data is easily compressed, and this compression almost always results in significantly less computer data storage usage. In my project I have utilized the Support Vector Machine (SVM) for text classification. In this the new set of input data set is classified into the given category. SVM is not used for clustering the data into new category, but it classifies data into already present categories. 1|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 UNDERSTANDING SUPPORT VECTOR MACHINE (SVM) FOR LINEARLY SEPARABLE DATA Consider each document to be a single dot in the figure. And dot of different color specifies different category. Here we have documents of two category and we have to find the boundary separating two documents. The Margin of a linear classifier is the width by which the length of the boundary can be increased before hitting the data points of different category. The line is safe to pick having the highest margin between the two datasets. The data points which lie on the margin are known as Support Vectors. The next step is to find the hyper plane which best separates the two categories. SVM does this by taking a set of points and separating those points using mathematical formulas. From that we can find the positive and negative hyper plane. The mathematical formula for finding hyper plane is : (w · x) + b = +1 (positive labels) (w · x) + b = -1 (negative labels) (w · x) + b = 0 (hyperplane) From the equation above and using linear algebra we can find the values of w and b.Thus, we have the model that contains the solution for w and b and with margin 2/√(w. w) . The margin is calculated as follow. Margin= 2/√(w. w) In SVM, this model is used to classify new data. With the above solutions and calculated margin value, new data can be classified into category. The following figure demonstrates the margin and support vectors for linearly separable data. 2|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 Maximum margin and support vectors for the given data sets are shown in figure. UNDERSTANDING SUPPORT VECTOR MACHINE (SVM) FOR NON-LINEARLY SEPARABLE DATA In the non-linearly separable plane, data are input in an input space that cannot be separated with a linear hyperplane. To separate the data linearly, we have to map the points to a feature space using a kernel method. After the data are separated in the feature space we can map the points back to the input space with a curvy hyper plane. The following figure demonstrates the data flow of SVM. 3|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 In reality, you will find that most of the data sets are not as simple and well behaved. There will be some points that are on the wrong side of the class, points that are far off from the classes, or points that are mixed together in a spiral or checkered pattern. Researchers have looked into those problems and tackled the problem to solve the few points that are in the wrong class, SVM minimized the following equation to create what is called a soft-margin hyper plane. The higher value of the C maximizes the margin value whereas the lower value of C lowers the margin value. TYPES OF KERNEL :: Computation of various points in the feature space can be very costly because feature space can be typically said to be infinite-dimensional. The kernel function is used for to reduce these cost. The reason is that the data points appear in dot product and the kernel function are able to compute the inner products of these points. So there is no need of mapping the points explicitly in the feature space. By using the kernel function we can directly compute the data points through inner product and find equivalent points on the hyper plane. The kernel functions which are being developed for SVM are still a research topic. No appropriate kernel has been found out which is universal for all kind of data. Anybody can develop their own kernel depending upon requirements. The following are some basic types of kernel : 1.) Polynomial kernel with degree d. 2.) Radial basis function kernel with width s Closely related to radial basis function of neural networks. 4|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 3.) Sigmoid with parameter k and q 4.) Linear Kernel K(x,y)= x' * y STRENGTH OF KERNELS :: Kernels are the most tricky and important part of using SVM because it creates the kernel matrix, which summarize all the data. In practice, a low degree polynomial kernel or Radial Basis Function kernel with a reasonable width is a good initial try for most applications. Linear kernel is considered to be the most important choice for text classification because of the already-high-enough feature dimension. There are many ongoing research to estimate the kernel matrix. SPARSE MATRIX AND SPARSE DATA :: Sparse Matrix is the matrix containing many values that are 0. Computing many 0 in the matrix is time consuming and utilizing lots of resources without giving optimized output. So this matrix is compressed into Sparse Data which contains non-zero values of the Sparse Matrix. It is usually 2- dimensional array which contains the non-zero value and the position in the original matrix. By this Sparse data, data is easily compressed, and this compression almost always results in significantly less computer data storage usage. In SVM the speed of computation decreases as it contains use of the linear regression and it contains many values in the training set whose term frequency value is zero. So lots of time is wasted by computing through these values. 5|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 SVM algorithms speed up tremendously if the data is sparse i.e. it contains many values that are 0. The reason for that is the Sparse Data compute lots of dot product and they iterate only over non-zero values. So SVM can use only the sparse data during its computation so that the memory and data storage are less utilized and so the cost is also reduced. Storing a sparse matrix The simple data structure used for a matrix is a two-dimensional array. Each entry in the array represents an element ai,j of the matrix and can be accessed by the two indices i and j. For an m×n matrix, enough memory to store at least (m×n) entries to represent the matrix is needed. Substantial memory requirement reductions can be realized by storing only the non-zero entries. This can yield huge savings in memory when compared to a simple approach. Different data structure can be utilized depending on the number and distribution of the non-zero entries. Formats can be divided into two groups: Those supporting efficient modification. Those supporting efficient matrix operations. The efficient modification group includes DOK, LIL, and COO and is typically used to construct the matrix. After the matrix is constructed, it is typically converted to a format, such as CSR or CSC, which is more efficient for matrix operations. Dictionary of keys (DOK) DOK represents non-zero values as a dictionary mapping (row, column) tuples to values. Good method for contructing sparse array, but poor for iterating over non-zero values in sorted order. List of lists (LIL) LIL stores one list per row, where each entry stores a column index and value. Typically, these entries are kept sorted by column index for faster lookup. Coordinate list (COO) COO stores a list of (row, column, value) tuples. In this the entries are sorted (row index then column index value) to improve random access times. Yale format The Yale Sparse Matrix Format stores an initial sparse m×n matrix, Where M = row in three one-dimensional arrays. NNZ = number of nonzero entries of M. 6|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 Array A = length= NNZ, and holds all nonzero entries. Order-top bottom right left. Array IA= length is m + 1. IA(i) contains the index in A of the first nonzero element of row i. Row i of the original matrix extends from A(IA(i)) to A(IA(i+1)-1), i.e. from the start of one row to the last index before the start of the next. Array JA= column index of each element of A, length= NNZ. Taking the example of the following and then computing various value in matrix to appropriate values. [1200] [0390] [0140] So computing it we get values as, A = [ 1 2 3 9 1 4 ] , IA = [ 0 2 4 6 ] and JA = [ 0 1 1 2 1 2 ]. ADVANTAGES AND DISADVANTAGES OF SUPPORT VECTOR MACHINE (SVM) ADVANTAGES: In high dimensional spaces Support Vector Machines are very effective. When number of dimensions is greater than the number of samples in such cases also it is found to be very effective. Memory Efficient because it uses subset of training points(support vectors) as decisive factors for classification. Versatile: For different decision function we can define different kernel as long as they provide correct result. Depending upon our requirement we can define our own kernel. DISADVANTAGES: If the number of features is much greater than the number of samples, the method is likely to give poor performances. It is useful for small training samples. SVMs do not directly provide probability estimates, so these must be calculated using indirect techniques. We can have Non-traditional data like strings and trees as input to SVM instead of featured vectors. Should select appropriate kernel for their project according to requirement 7|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 DESCRIPTION OF THE EXAMPLE :: As shown in figure, we can see that these points lay on a 1-dimensional plane and cannot be separated by a linear hyper plane. Following steps are followed 1.) Map into feature space. 2.) Use Polynomial kernel Φ(X1) = (X1, X1^2) to map points on the two dimensional plane. 3.) Compute the positive , negative and zero hyperplane. 4.) We get the support vectors and the margin value from it. From these value of the margin we can classify the new input data set into different class depending upon their values. 8|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 SOURCE CODE FOR 1-DIMENSIONAL LINEAR CLASSIFIER OF DATA IN SVM USING POLYNOMIAL KERNEL #include<stdio.h> #include<conio.h> #include<math.h> #include<iostream.h> void main() { int data_set[4][2]={{1,0},{-1,1},{-1,2},{1,3}}; int data_set_after_kernel[4][3]; int i,j,k,l; float d1,d2,D,w11,w12,w1,w21,w22,w2,b1,b2,b; //to calculate the dataset with the polynomial kernel so getting the new //data set as class value(x) value(pow(x,2)) for(i=0 ; i<4; i++) { for(j=0;j<3;j++) { if(j==2) data_set_after_kernel[i][j]=data_set[i][j-1]*data_set[i][j-1]; else data_set_after_kernel[i][j]=data_set[i][j]; } } clrscr(); printf("\n"); for(k=0;k<4;k++) { for(l=0;l<3;l++) { 9|Page CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 printf("%d \t",data_set_after_kernel[k][l]); } printf("\n"); } //plot this points on the feature space and now finding the //hyperplane we will use the equation as (w.x)+b=labels //here we have labels +1,0,-1. //w1x1+w2x2+b=+1 //w1x1+w2x2+b=+1 //w1x1+w2x2+b=-1 //compute the value of D to cuompute the value of 3 variable w1,w1 and b. d1=((data_set_after_kernel[0][1]*data_set_after_kernel[1][2]*1)+(data_set_after_kernel[0][2]*data_s et_after_kernel[3][1]*1)+(1*data_set_after_kernel[1][1]*data_set_after_kernel[3][2])); d2=((data_set_after_kernel[0][1]*1*data_set_after_kernel[3][2])+(data_set_after_kernel[0][2]*data_s et_after_kernel[1][1]*1)+(1*data_set_after_kernel[1][2]*data_set_after_kernel[3][1])); D=d1-d2; //calculate the value of variable w1 w11=((data_set_after_kernel[0][2]*1*(1*data_set_after_kernel[1][0]))+(1*data_set_after_kernel[1][2]*(-1*data_set_after_kernel[3][0]))+((1*data_set_after_kernel[0][0])*1*data_set_after_kernel[3][2])); w12=((data_set_after_kernel[0][2]*1*(1*data_set_after_kernel[3][0]))+(1*data_set_after_kernel[3][2]*(-1*data_set_after_kernel[1][0]))+((1*data_set_after_kernel[0][0])*1*data_set_after_kernel[1][2])); w1=(w11-w12)/D; //calculate the value of variable w2 w21=((data_set_after_kernel[0][1]*1*(1*data_set_after_kernel[3][0]))+(1*data_set_after_kernel[3][1]*(-1*data_set_after_kernel[1][0]))+((1*data_set_after_kernel[0][0])*1*data_set_after_kernel[1][1])); w22=((data_set_after_kernel[0][1]*1*(1*data_set_after_kernel[1][0]))+(1*data_set_after_kernel[1][1]*(-1*data_set_after_kernel[3][0]))+((1*data_set_after_kernel[0][0])*1*data_set_after_kernel[3][1])); w2=(w21-w22)/D; //calculate the variable b in the following steps b1=(data_set_after_kernel[0][1]*data_set_after_kernel[3][2]*(1*data_set_after_kernel[1][0]))+(data_set_after_kernel[0][2]*data_set_after_kernel[1][1]*(1*data_set_after_kernel[3][0]))+(data_set_after_kernel[1][2]*data_set_after_kernel[3][1]*(1*data_set_after_kernel[0][0])); 10 | P a g e CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 b2=(data_set_after_kernel[0][1]*data_set_after_kernel[1][2]*(1*data_set_after_kernel[3][0]))+(data_set_after_kernel[0][2]*data_set_after_kernel[3][1]*(1*data_set_after_kernel[1][0]))+(data_set_after_kernel[1][1]*data_set_after_kernel[3][2]*(1*data_set_after_kernel[0][0])); b=(b1-b2)/D; printf("The value of w1 is: %f \n",w1); printf("The value of w2 is: %f \n",w2); printf("The value of b is: %f \n",b); //Points of the positive y==0plane can be calculated as follows:: //w1x1+w2x2+b=+1 float data_set_positive[4][2]; for(int x=0;x<4;x++) { for(int y=0;y<2;y++) { if(y==0) data_set_positive[x][y]=data_set_after_kernel[x][1]; else data_set_positive[x][y]=(1-b-(w1*data_set_after_kernel[x][1]))/w2; } } //Points of the negative plane can be calculated as follows:: //w1x1+w2x2+b=-1 float data_set_negative[4][2]; for(int r=0;r<4;r++) { for(int t=0;t<2;t++) { if(t==0) data_set_negative[r][t]=data_set_after_kernel[r][1]; else data_set_negative[r][t]=(-1-b-(w1*data_set_after_kernel[r][1]))/w2; } } //Points of the zero plane can be calculated as follows:: //w1x1+w2x2+b=0 float data_set_zero[4][2]; 11 | P a g e CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 for(int e=0;e<4;e++) { for(int f=0;f<2;f++) { if(f==0) data_set_zero[e][f]=data_set_after_kernel[e][1]; else data_set_zero[e][f]=(-b-(w1*data_set_after_kernel[e][1]))/w2; } } //printing the hyperplane points as follows:: printf("\n"); for(k=0;k<4;k++) { for(l=0;l<2;l++) { printf("%f \t",data_set_positive[k][l]); } printf("\n"); } printf("\n"); for(k=0;k<4;k++) { for(l=0;l<2;l++) { printf("%f \t",data_set_negative[k][l]); } printf("\n"); } printf("\n"); for(k=0;k<4;k++) { for(l=0;l<2;l++) { printf("%f \t",data_set_zero[k][l]); } printf("\n"); } //calculating the margin for these dataset we get the following. //we will use the following formula for calculating the margin. 12 | P a g e CS267 TOPICS IN DATABASE SYSTEMS PROF. T.Y. LIN SUPPORT VECTOR MACHINE (SVM) PARIN SHAH 007332832 //2/SQRT(w.w) float margin; margin=2/sqrt((pow(w1,2)+pow(w2,2))); printf("\n The margin for the given dataset is : %f", margin); } REFERENCES :: 1.) http://xanadu.cs.sjsu.edu/~drtylin/classes/cs267/project/tam_ngo/. 2.) http://www.wikipedia.com/. 3.) http://www.support-vector.net/icml-tutorial.pdf/ 4.) http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf/ 5.) http://en.wikipedia.org/wiki/Support_vector_machine/. 6.) http://en.wikipedia.org/wiki/Sparse_data/. 13 | P a g e