Uploaded by Anshul Thakur

ML Assignment 1

advertisement
Assignment 2
Introduction to Machine Learning
ELL784
Anshul Thakur
2015EEZ8076
March 2016
1
Part 1
We are provided a personalised input file that contains 3000 labeled data points,
with 25 features each. This file contains 3000 rows, with each row corresponding
to a data point. Each row has 26 comma-separated values; the first 25 are the
values of the features, and the last is the class label for that data point (there
are 10 classes, denoted by the labels 0 to 9). Each data point is actually a
low-dimensional representation of an image.
• Learn an SVM classifier for these images, using just the given features,
and thereby assess the usefulness of the different features.
First, we try to visualize the data in 2 dimensions using MATLAB utilities
by computing various similarity measures. Scatter plots for various similarity
measures are shown below:
whole data set
15
10
5
0
-5
-10
-15
-15
-10
-5
0
5
10
Figure 1: City Block Distance Metric
1
15
20
whole data set
4
3
2
1
0
-1
-2
-3
-4
-3
-2
-1
0
1
2
3
Figure 2: Standardized Euclidean Distance Metric
2
4
whole data set
4
3
2
1
0
-1
-2
-3
-4
-4
-3
-2
-1
0
1
2
Figure 3: Mahalanobis Distance Metric.
3
3
4
whole data set
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 4: Cosine Distance Metric.
For the remaining discussion, Standardized Euclidean metric is used, unless
explicitly noted.
1.1
Binary Classification
This problem was approached from 2 angles.
• One vs All: Randomly choose any one class as the target class T , and the
rest as notT .
• One vs One: Randomly choose any two classes and filter the data-set for
those two classes only. Train an SVM for them.
For all considerations, C-SVM was used. Linear, Polynomial and RBF kernels were used and compared.
1.1.1
One vs All
Here, one class was randomly chosen from [0, 9] as the target class T and the
data-set was filtered as a sequence of indicator variables. C was set to 1 initially.
One scatter plot for the case when T was chosen as 1 is shown below:
4
whole data set
4
3
2
1
0
-1
-2
-3
-4
-3
-2
-1
0
1
2
3
4
Figure 5: Full Data Set: Label 1 as target.
For a Linear SVM, the following statistics were obtained for this run:
Accuracy = 99.1111% (892/900) (classification)
Model Parameters: [0 0 3.0000 0.0400 0]
nr_class: 2
totalSV: 96
rho: -5.3395
Label: [0;1]
sv_indices: [96x1 double]
ProbA: -1.8190
ProbB: 0.0315
nSV: [51;45]
sv_coef: [96x1 double]
SVs: [96x25 double]
96 Support Vectors were found for the training data of size 2100 data points,
out of which 45 Support Vectors were for classifying as T , and 51 as N otT . A
scatter plot as shown below was obtained. The points in blue are label T
originally and remaining are in red. The results of applying the model are
shown as circles, where black circle implies T and green circle implies N otT .
5
classification results
4
3
2
1
0
-1
-2
-3
-4
-3
-2
-1
0
1
2
3
4
Figure 6: Test Set: Label 1 as target.
For a Polynomial Kernel of degree 2, gamma 0.008, c 1, the following
statistics were obtained in a sample run:
Accuracy = 99% (891/900) (classification)
Model Parameters: [0 1.0000 2.0000 0.0080 0]
nr_class: 2
totalSV: 535
rho: -0.9110
Label: [0;1]
sv_indices: [535x1 double]
ProbA: -17.3957
ProbB: 12.3468
nSV: [287;248]
sv_coef: [535x1 double]
SVs: [535x25 double]
For a Polynomial Kernel of degree 3, gamma 0.008, c 1, the following
statistics were obtained in a sample run:
Accuracy = 99.3333% (894/900) (classification)
Model Parameters: [0 1.0000 3.0000 0.0080 0]
nr_class: 2
6
totalSV:
rho:
Label:
sv_indices:
ProbA:
ProbB:
nSV:
sv_coef:
SVs:
700
-0.9978
[0;1]
[700x1 double]
-220.4069
215.0788
[473;227]
[700x1 double]
[700x25 double]
For a Radial Basis Function Kernel of gamma 0.008, c 1, the following
statistics were obtained in a sample run:
Accuracy = 99.2222% (893/900) (classification)
Model Parameters: [0 2.0000 3.0000 0.0080 0]
nr_class: 2
totalSV: 232
rho: -2.8164
Label: [0;1]
sv_indices: [232x1 double]
ProbA: -3.7329
ProbB: 0.1077
nSV: [122;110]
sv_coef: [232x1 double]
SVs: [232x25 double]
Most of the Kernels seem to be giving similar performance over the data set
for a random but agreeable choice of parameters. However, the linear model
needs the least amount of support vectors for this computation. Next is the
RBF kernel.
Note that the accuracy seemed to vary over all runs. Hence, a crossvalidation approach was chosen. Since libsvm does not return a model on
running cross validation, k-fold cross-validation for parameter tuning was implemented on our own. For linear Kernel, the value of C was varied from 0.001
to 2 in multiples of 3. Initially, cross-validation was used just to report the
accuracy of the above discussed model parameters over the entire range. The
following values were obtained:
Kernel Type
Linear Kernel
Polynomial Kernel (d=2)
Polynomial Kernel (d=3)
RBF
CV Accuracy
98.761905%
98.952381%
99.047619%
98.857143%
Test Accuracy
99.3333%
99.5556%
99.1111%
99.2222%
Next, only first 10 features were used for this classification problem, for all
the kernel functions discussed above. A comparative table showing the accuracy
estimates on test data for both cases is shown below.
7
Kernel Type
Linear Kernel
Polynomial Kernel (d=2)
Polynomial Kernel (d=3)
RBF
Accuracy (Features=25)
99.3333%
99.5556%
99.1111%
99.2222%
Accuracy (Features=10)
97.3333%
95.7778%
92%
98.4444%
Thus, it is seen that the accuracy of prediction drops significantly for most
Kernel Functions. The least affected is the Radial Basis Kernel.
1.1.2
One vs One
In this interpretation of the problem statement, the data was filtered to contain
only two labels and the remaining data was removed from further evaluation.
Other than that, a similar procedure was adopted for analysis. 3 pairs of classes
were chosen (4, 6), (9, 7), (5, 0). The values for both 10-Feature and 25-Feature
computations is tabulated below:
Kernel Type
Linear Kernel
Polynomial Kernel (d=2)
Polynomial Kernel (d=3)
RBF
Pair
(4, 6)
(9, 7)
(5, 0)
(4, 6)
(9, 7)
(5, 0)
(4, 6)
(9, 7)
(5, 0)
(4, 6)
(9, 7)
(5, 0)
Accuracy (Features=25)
97.2222%
94.8889%
95.1111%
97.3333%
97.1111%
96.6667%
96.7778%
96.3333%
97%
97.8889%
95.88893%
97.8889%
Accuracy (Features=10)
95.2222%
91.4444%
91.2222%
93.7778%
92.7778%
95.3333%
94.7778%
92.5556%
95.2222%
95.3333%
94%
95.6667%
It is observed that the polynomial kernel of degree 3 maintains more or
less the same kind of accuracy for the used label pairs. This would be due to
the increased complexity in the model and also the higher number of support
vectors needed. Further, a marked difference in the accuracy of predictions is
seen between 10 and full feature trained models.
The utility to auto-tune parameters as provided with libsvm was used to
see the variation of misclassification scores as gamma and c were varied. The
following curves were obtained, which easy highlight over and underfitting cases,
and also give a coarse sense of where the optiaml hyperparameters might lie.
8
Figure 7: CV Accuracy Curves for RBF Kernel
1.2
Multiclass Classification
In this part, multi-class label classification is done. libsvm supports multi-class
classification, and the same is used here following the procedure above.
For the same values as above, the following accuracy statistics were obtained:
Kernel Type
Linear Kernel
Polynomial Kernel (d=2)
Polynomial Kernel (d=3)
RBF
Accuracy (Features=25)
89.4444%
86.6667%
51.4444%
90.5556%
Accuracy (Features=10)
84.4444%
60.6667%
58.1111%
82.5556%
For all further analysis, Radial Basis Function will be used, unless otherwise stated. The accuracy of these parameters is pretty bad, and hence, crossvalidation approach was used for parameter tuning. 3-Fold cross validation was
used. The range of parameters over which parameter search was done is as
follows:
C = [0.1,5]
gamma = [0.001, 0.9]
Using the best cross-validation score, the best parameters found were:
9
C=2.82222
gamma=0.01
Classification Accuracy (Cross Validation)=91.2857
Test Set Accuracy = 93.2222
Similar score was obtained for a Linear Kernel.
Another version of the same one vs rest approach was implemented which
seemed to give worse results for the same metric.
C=0.1
gamma=0.1
Classification Accuracy (Cross Validation)=68.9048
Test Set Accuracy = 59
As a result, a one-vs-one
multi-classification approach was also taken. In this
approach, a total of 25
,
i.e.
45, models were trained and the cumulative score
2
for each classification stored. At the end of the run, the class getting the highest
vote was chosen as the real class of the data point. Another approach of keeping
cumulative probability scores was also taken, but later discarded because of the
following two issues:
• Many classes tended to give the same cumulative score. In that case, we
needed to consult the maximal vote score.
• For the test set when the actual labels were not available (as would be
the case in Part B), probability values cannot be assigned to each class
since filtering of one vs one in that case is not possible. Consequently,
on a blind data set, the prediction of values becomes more of a one-vs-all
classification problem.
This approach gave better results for a wide range of values for C and gamma.
Interpreting C as the inverse of regularization parameter, a large value of C
implied a model with less penalty on coefficient values. Further, gamma being
taken as the inverse of variance in the RBF kernel, a larger value of gamma
implied smaller value of the spread of the RBF Kernels, implying a more strict
model.
For a coarse search on parameters, with the range quoted below, the accuracy
scores are:
c=1.68431
g=0.0422
Cross Validation Score =99.9048
Self Test: 100
Test Set: 100
The model seems to give excellent scores for the data sets. It is noted,
however, that the final collective models set which contains 10 models in the
one-vs-rest method and 45 in one-vs-one multi-class methods have rather poor
10
individual scores on each individual set as compared to the case when one-vs-all
was trained on filtered data sets in part 2 of this problem. Thus, the overall
parameter tuning makes the individual worse, but strengthens the accuracy of
the collective model.
2
Part B
In this section, the data set was of a larger size while the problem to be attempted was exactly as in the previous section. Consequently, the same two
approaches of One vs All and One vs One Multi-classification were used. For a
One-vs-All classification method, the following scores were obtained:
c=1.68431
g=0.0422
Cross Validation Score =99.9048
Self Test: 100
Test Set: 96.7999999996
While for the One-vs-One Multi-classification method, the following scores
were obtained:
c=1
g=0.0225
Cross Validation Score =100
Self Test: 100
Test Set: 96.7999999996
To improve results, 10-Fold cross validation was empolyed, but the results
did not vary.
As is also cited in the libsvm documentation, the claim seems to be valid
that there isn’t much performance gain in using a one-vs-one multiclassification
method over one-vs-rest.
11
Download