Assignment 2 Introduction to Machine Learning ELL784 Anshul Thakur 2015EEZ8076 March 2016 1 Part 1 We are provided a personalised input file that contains 3000 labeled data points, with 25 features each. This file contains 3000 rows, with each row corresponding to a data point. Each row has 26 comma-separated values; the first 25 are the values of the features, and the last is the class label for that data point (there are 10 classes, denoted by the labels 0 to 9). Each data point is actually a low-dimensional representation of an image. • Learn an SVM classifier for these images, using just the given features, and thereby assess the usefulness of the different features. First, we try to visualize the data in 2 dimensions using MATLAB utilities by computing various similarity measures. Scatter plots for various similarity measures are shown below: whole data set 15 10 5 0 -5 -10 -15 -15 -10 -5 0 5 10 Figure 1: City Block Distance Metric 1 15 20 whole data set 4 3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 Figure 2: Standardized Euclidean Distance Metric 2 4 whole data set 4 3 2 1 0 -1 -2 -3 -4 -4 -3 -2 -1 0 1 2 Figure 3: Mahalanobis Distance Metric. 3 3 4 whole data set 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Figure 4: Cosine Distance Metric. For the remaining discussion, Standardized Euclidean metric is used, unless explicitly noted. 1.1 Binary Classification This problem was approached from 2 angles. • One vs All: Randomly choose any one class as the target class T , and the rest as notT . • One vs One: Randomly choose any two classes and filter the data-set for those two classes only. Train an SVM for them. For all considerations, C-SVM was used. Linear, Polynomial and RBF kernels were used and compared. 1.1.1 One vs All Here, one class was randomly chosen from [0, 9] as the target class T and the data-set was filtered as a sequence of indicator variables. C was set to 1 initially. One scatter plot for the case when T was chosen as 1 is shown below: 4 whole data set 4 3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 4 Figure 5: Full Data Set: Label 1 as target. For a Linear SVM, the following statistics were obtained for this run: Accuracy = 99.1111% (892/900) (classification) Model Parameters: [0 0 3.0000 0.0400 0] nr_class: 2 totalSV: 96 rho: -5.3395 Label: [0;1] sv_indices: [96x1 double] ProbA: -1.8190 ProbB: 0.0315 nSV: [51;45] sv_coef: [96x1 double] SVs: [96x25 double] 96 Support Vectors were found for the training data of size 2100 data points, out of which 45 Support Vectors were for classifying as T , and 51 as N otT . A scatter plot as shown below was obtained. The points in blue are label T originally and remaining are in red. The results of applying the model are shown as circles, where black circle implies T and green circle implies N otT . 5 classification results 4 3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 4 Figure 6: Test Set: Label 1 as target. For a Polynomial Kernel of degree 2, gamma 0.008, c 1, the following statistics were obtained in a sample run: Accuracy = 99% (891/900) (classification) Model Parameters: [0 1.0000 2.0000 0.0080 0] nr_class: 2 totalSV: 535 rho: -0.9110 Label: [0;1] sv_indices: [535x1 double] ProbA: -17.3957 ProbB: 12.3468 nSV: [287;248] sv_coef: [535x1 double] SVs: [535x25 double] For a Polynomial Kernel of degree 3, gamma 0.008, c 1, the following statistics were obtained in a sample run: Accuracy = 99.3333% (894/900) (classification) Model Parameters: [0 1.0000 3.0000 0.0080 0] nr_class: 2 6 totalSV: rho: Label: sv_indices: ProbA: ProbB: nSV: sv_coef: SVs: 700 -0.9978 [0;1] [700x1 double] -220.4069 215.0788 [473;227] [700x1 double] [700x25 double] For a Radial Basis Function Kernel of gamma 0.008, c 1, the following statistics were obtained in a sample run: Accuracy = 99.2222% (893/900) (classification) Model Parameters: [0 2.0000 3.0000 0.0080 0] nr_class: 2 totalSV: 232 rho: -2.8164 Label: [0;1] sv_indices: [232x1 double] ProbA: -3.7329 ProbB: 0.1077 nSV: [122;110] sv_coef: [232x1 double] SVs: [232x25 double] Most of the Kernels seem to be giving similar performance over the data set for a random but agreeable choice of parameters. However, the linear model needs the least amount of support vectors for this computation. Next is the RBF kernel. Note that the accuracy seemed to vary over all runs. Hence, a crossvalidation approach was chosen. Since libsvm does not return a model on running cross validation, k-fold cross-validation for parameter tuning was implemented on our own. For linear Kernel, the value of C was varied from 0.001 to 2 in multiples of 3. Initially, cross-validation was used just to report the accuracy of the above discussed model parameters over the entire range. The following values were obtained: Kernel Type Linear Kernel Polynomial Kernel (d=2) Polynomial Kernel (d=3) RBF CV Accuracy 98.761905% 98.952381% 99.047619% 98.857143% Test Accuracy 99.3333% 99.5556% 99.1111% 99.2222% Next, only first 10 features were used for this classification problem, for all the kernel functions discussed above. A comparative table showing the accuracy estimates on test data for both cases is shown below. 7 Kernel Type Linear Kernel Polynomial Kernel (d=2) Polynomial Kernel (d=3) RBF Accuracy (Features=25) 99.3333% 99.5556% 99.1111% 99.2222% Accuracy (Features=10) 97.3333% 95.7778% 92% 98.4444% Thus, it is seen that the accuracy of prediction drops significantly for most Kernel Functions. The least affected is the Radial Basis Kernel. 1.1.2 One vs One In this interpretation of the problem statement, the data was filtered to contain only two labels and the remaining data was removed from further evaluation. Other than that, a similar procedure was adopted for analysis. 3 pairs of classes were chosen (4, 6), (9, 7), (5, 0). The values for both 10-Feature and 25-Feature computations is tabulated below: Kernel Type Linear Kernel Polynomial Kernel (d=2) Polynomial Kernel (d=3) RBF Pair (4, 6) (9, 7) (5, 0) (4, 6) (9, 7) (5, 0) (4, 6) (9, 7) (5, 0) (4, 6) (9, 7) (5, 0) Accuracy (Features=25) 97.2222% 94.8889% 95.1111% 97.3333% 97.1111% 96.6667% 96.7778% 96.3333% 97% 97.8889% 95.88893% 97.8889% Accuracy (Features=10) 95.2222% 91.4444% 91.2222% 93.7778% 92.7778% 95.3333% 94.7778% 92.5556% 95.2222% 95.3333% 94% 95.6667% It is observed that the polynomial kernel of degree 3 maintains more or less the same kind of accuracy for the used label pairs. This would be due to the increased complexity in the model and also the higher number of support vectors needed. Further, a marked difference in the accuracy of predictions is seen between 10 and full feature trained models. The utility to auto-tune parameters as provided with libsvm was used to see the variation of misclassification scores as gamma and c were varied. The following curves were obtained, which easy highlight over and underfitting cases, and also give a coarse sense of where the optiaml hyperparameters might lie. 8 Figure 7: CV Accuracy Curves for RBF Kernel 1.2 Multiclass Classification In this part, multi-class label classification is done. libsvm supports multi-class classification, and the same is used here following the procedure above. For the same values as above, the following accuracy statistics were obtained: Kernel Type Linear Kernel Polynomial Kernel (d=2) Polynomial Kernel (d=3) RBF Accuracy (Features=25) 89.4444% 86.6667% 51.4444% 90.5556% Accuracy (Features=10) 84.4444% 60.6667% 58.1111% 82.5556% For all further analysis, Radial Basis Function will be used, unless otherwise stated. The accuracy of these parameters is pretty bad, and hence, crossvalidation approach was used for parameter tuning. 3-Fold cross validation was used. The range of parameters over which parameter search was done is as follows: C = [0.1,5] gamma = [0.001, 0.9] Using the best cross-validation score, the best parameters found were: 9 C=2.82222 gamma=0.01 Classification Accuracy (Cross Validation)=91.2857 Test Set Accuracy = 93.2222 Similar score was obtained for a Linear Kernel. Another version of the same one vs rest approach was implemented which seemed to give worse results for the same metric. C=0.1 gamma=0.1 Classification Accuracy (Cross Validation)=68.9048 Test Set Accuracy = 59 As a result, a one-vs-one multi-classification approach was also taken. In this approach, a total of 25 , i.e. 45, models were trained and the cumulative score 2 for each classification stored. At the end of the run, the class getting the highest vote was chosen as the real class of the data point. Another approach of keeping cumulative probability scores was also taken, but later discarded because of the following two issues: • Many classes tended to give the same cumulative score. In that case, we needed to consult the maximal vote score. • For the test set when the actual labels were not available (as would be the case in Part B), probability values cannot be assigned to each class since filtering of one vs one in that case is not possible. Consequently, on a blind data set, the prediction of values becomes more of a one-vs-all classification problem. This approach gave better results for a wide range of values for C and gamma. Interpreting C as the inverse of regularization parameter, a large value of C implied a model with less penalty on coefficient values. Further, gamma being taken as the inverse of variance in the RBF kernel, a larger value of gamma implied smaller value of the spread of the RBF Kernels, implying a more strict model. For a coarse search on parameters, with the range quoted below, the accuracy scores are: c=1.68431 g=0.0422 Cross Validation Score =99.9048 Self Test: 100 Test Set: 100 The model seems to give excellent scores for the data sets. It is noted, however, that the final collective models set which contains 10 models in the one-vs-rest method and 45 in one-vs-one multi-class methods have rather poor 10 individual scores on each individual set as compared to the case when one-vs-all was trained on filtered data sets in part 2 of this problem. Thus, the overall parameter tuning makes the individual worse, but strengthens the accuracy of the collective model. 2 Part B In this section, the data set was of a larger size while the problem to be attempted was exactly as in the previous section. Consequently, the same two approaches of One vs All and One vs One Multi-classification were used. For a One-vs-All classification method, the following scores were obtained: c=1.68431 g=0.0422 Cross Validation Score =99.9048 Self Test: 100 Test Set: 96.7999999996 While for the One-vs-One Multi-classification method, the following scores were obtained: c=1 g=0.0225 Cross Validation Score =100 Self Test: 100 Test Set: 96.7999999996 To improve results, 10-Fold cross validation was empolyed, but the results did not vary. As is also cited in the libsvm documentation, the claim seems to be valid that there isn’t much performance gain in using a one-vs-one multiclassification method over one-vs-rest. 11