ClassificationHW

advertisement
Michael Bode
Classification Homework
Overview and Experimental Setup
The three classification algorithms that I implemented for this assignment were SVM
style NORMA, Bayesian Linear Regression, Exponentiated Gradient Descent. All
algorithms were implemented as online learners. To quantify the results of the
algorithms, I collected the data from all of the provided logs and randomized the order in
which the data was presented to each of the algorithms. First I ran all the algorithms on
this large data set. Then I used half of the data for training the classifiers and then used
the functions the classifiers approximated after the training to classify the remaining half
of the data. Finally I added one through twenty random features to the data and ran each
of the classifiers on the data with random features added.
The approach I took for implementing each of the classifiers and experimental results are
summarized in the following sections.
NORMA-SVM
I implemented the SVM version of the normal algorithm to use the three class margin
loss function. That is whenever the algorithm correctly classifies a sample by a margin
(over the other two classes) the loss is zero and the weights for the new kernel are all
zero. If the algorithm does not make the correct classify by a margin, the incorrect class
violates the margin by the most is penalized. A kernel is placed with a negative weight
(η) in the incorrect class function and a kernel with a positive weight (η) is placed in the
correct class function. At each new timestep the weights of existing kernels are
decreased by a nominal amount
For my implementation, I used the standard radial basis type of kernel: exp(-|x-xi|2/γ2)
where γ controls the size of the kernel.
Parameter Tuning
Since the dimensionality of the feature space was quite high for this problem I though
that the size of the history (τ) should be fairly large so that the feature space could be
fairly well populated. I started with a τ of the most recent 100 samples and ran the
algorithm on the data. Following the error gradient I settled on τ =1000 because the
gains beyond that were marginal and there is a risk of over fitting, not to mention, the
speed of the algorithm was starting to suffer.
Through a bit of binary searching, I found good parameters for kernel size and learning
rage of γ = 0.7 and η = 0.6.
I allowed the “strength of prior” parameter to gradually decrease the weights of the
kernels in such a way so that by the time a kernel is about to be purged from the history,
it will have a weight of about half of the value that it started at. This led to a “strength of
prior” value (λ) of 0.001.
Experimental Results
The results from the NORMA classifier are summarized in the table below. The table
shows the frequency of correct classifications as well as the frequencies of false positives
and false negatives for each of the classes. The upper number shows the frequencies for
when the entire data set was used. The lower number (in parenthesis) is the results
recorded from running on the test set only (after being trained on the training set).
Classification on all data
(Classification on test
set only)
NORMA-SVM
classification of Road
NORMA-SVM
classification of
Vegetation
NORMA-SVM
classification of
Obstacle
Table 1: NORMA Classification Results
Hand labeled
Hand labeled
classification of Road
classification of
Vegetation
Hand labeled
classification of
Obstacle
23419
(11977)
435
(591)
695
(499)
1573
(490)
63159
(31137)
5512
(2245)
24
(12)
246
(241)
4937
(2808)
Classification Rate 91.52% (91.84%)
Bayesian Linear Regression
For the Bayesian Linear Regression classifier, I used a one-hot encoding to make the
traditionally two class classifier perform as a three class classifier. I implemented the
BLR as an online classifier that had as input both a prior distribution on the weights and a
variance on the noise in the feature data.
Parameter Tuning
The two main parameters to tune for the BLR classifier were the prior distribution on the
weights and the variance of the noise on the feature data (σ). I tuned the prior
distribution so be a zero mean distribution with a spherical variance (the variance of the
distribution is s*I where s is a scalar and I is an N by N identity matrix).
After a binary search I found an optimal value for σ of 0.5.
Experimental Results
The results of the BLR classification are shown in the table below. The format of the
table is the same as that for the NORMA table above.
Table 2: Bayesian Linear Regression Classification Results
Classification on all data
Hand labeled
Hand labeled
Hand labeled
(Classification on test
classification of Road
classification of
classification of
set only)
Vegetation
Obstacle
BLR classification of
23554
300
805
Road
(11760)
(143)
(389)
BLR classification of
1423
63328
5475
Vegetation
BLR classification of
Obstacle
(702)
(31729)
39
212
(17)
(97)
Classification Rate 91.75% (91.84%)
(2733)
4864
(2430)
Exponentiated Gradient Descent
I implemented gradient classifier similar to the one asked about on the midterm exam.
This algorithm is similar to the winnow algorithm but it allows for continuous features
and has a renormalization of the weights after each step. For the three class case, three
sets of weights are kept (one for each class). The class with the largest inner product of
the weights and the feature vector is chosen. For incorrect classifications, the weights are
penalized by exp(-ηf) where f is the feature vector and η is the learning rate. After each
step the weights are then normalized to sum to one.
Parameter Tuning
The parameters that must be selected for this classifier are the learning rate and the initial
weight distribution. For my implementation I used a uniform distribution for the weights
(wi = 1/N where N is the number of features). Through tuning I found an optimal
learning rate of η = 0.1.
Experimental Results
Table 9: Exponentiated Gradient Classification Results
Hand labeled
Hand labeled
classification of Road
classification of
Vegetation
Hand labeled
classification of
Obstacle
Exponentiated Gradient
classification of Road
23410
(11589)
1152
(114)
485
(66)
Exponentiated Gradient
classification of
Vegetation
Exponentiated Gradient
classification of
Obstacle
1010
(541)
58194
(31234)
4656
(2650)
596
(349)
4494
(621)
6003
(2836)
Classification Rate 87.61% (91.32%)
Conclusions
Effect of Data Hold Out
Since I implemented all of the classifiers as online classifiers, the intuition would be that
the algorithms would not suffer when data was withheld and the trained classifiers were
run on the held out data. Empirical evidence supports this intuition as the classification
rate was more or less unchanged when the classifier was trained “offline” compared to
running on all the data.
Effect of Random Features
Random features in the feature set had different effects on the performance of the
classification algorithms. The following graph shows the effect of the correct
classification rate when a varying number of random features were added to the feature
set.
As expected the Bayesian Linear Regression and the Exponentiated Gradient classifiers
were relatively immune the addition of random features in the feature set. Even when the
number of random features outnumbered the number of relevant features the correct
classification rate of both algorithms was essentially unchanged.
The NORMA algorithm, however, suffers moderately with the addition of random
features. This intuitively makes sense because as the dimensionality of the feature space
increases the amount of history (tau) required to accurately populate this higher
dimensional space would need to increase as well for performance to keep up. When tau
is held constant the correct classification rate decreases.
Quality of Classification
Based on correct classification rate and robustness to random features, the Bayesian
Linear Regression is the superior of the three classifiers. However since the number of
samples labeled as obstacles was small relative to the number of samples labeled road
and vegetation, the Bayesian Linear Regression classifier sacrifices accuracy in
distinguishing between obstacles and vegetation in order to optimize the classification
between road and vegetation. In fact the BLR classifier misclassifies more obstacle
samples than it correctly classifies. Clearly this is an undesirable effect for an outdoor
robotic system where the penalty for a false negative obstacle classification is much
higher than misclassification regarding road and vegetation. Upon looking at the
classifications on individual data logs, the performance of the Exponentiated Gradient
classifier produced a better classification between the vegetation and obstacle classes as
illustrated in the images below.
Hand Labeled Classification Of Pole Data Log
Bayesian Linear Regression Classification Of Pole Data Log
Exponentiated Gradient Classification Of Pole Data Log
Download