Michael Bode Classification Homework Overview and Experimental Setup The three classification algorithms that I implemented for this assignment were SVM style NORMA, Bayesian Linear Regression, Exponentiated Gradient Descent. All algorithms were implemented as online learners. To quantify the results of the algorithms, I collected the data from all of the provided logs and randomized the order in which the data was presented to each of the algorithms. First I ran all the algorithms on this large data set. Then I used half of the data for training the classifiers and then used the functions the classifiers approximated after the training to classify the remaining half of the data. Finally I added one through twenty random features to the data and ran each of the classifiers on the data with random features added. The approach I took for implementing each of the classifiers and experimental results are summarized in the following sections. NORMA-SVM I implemented the SVM version of the normal algorithm to use the three class margin loss function. That is whenever the algorithm correctly classifies a sample by a margin (over the other two classes) the loss is zero and the weights for the new kernel are all zero. If the algorithm does not make the correct classify by a margin, the incorrect class violates the margin by the most is penalized. A kernel is placed with a negative weight (η) in the incorrect class function and a kernel with a positive weight (η) is placed in the correct class function. At each new timestep the weights of existing kernels are decreased by a nominal amount For my implementation, I used the standard radial basis type of kernel: exp(-|x-xi|2/γ2) where γ controls the size of the kernel. Parameter Tuning Since the dimensionality of the feature space was quite high for this problem I though that the size of the history (τ) should be fairly large so that the feature space could be fairly well populated. I started with a τ of the most recent 100 samples and ran the algorithm on the data. Following the error gradient I settled on τ =1000 because the gains beyond that were marginal and there is a risk of over fitting, not to mention, the speed of the algorithm was starting to suffer. Through a bit of binary searching, I found good parameters for kernel size and learning rage of γ = 0.7 and η = 0.6. I allowed the “strength of prior” parameter to gradually decrease the weights of the kernels in such a way so that by the time a kernel is about to be purged from the history, it will have a weight of about half of the value that it started at. This led to a “strength of prior” value (λ) of 0.001. Experimental Results The results from the NORMA classifier are summarized in the table below. The table shows the frequency of correct classifications as well as the frequencies of false positives and false negatives for each of the classes. The upper number shows the frequencies for when the entire data set was used. The lower number (in parenthesis) is the results recorded from running on the test set only (after being trained on the training set). Classification on all data (Classification on test set only) NORMA-SVM classification of Road NORMA-SVM classification of Vegetation NORMA-SVM classification of Obstacle Table 1: NORMA Classification Results Hand labeled Hand labeled classification of Road classification of Vegetation Hand labeled classification of Obstacle 23419 (11977) 435 (591) 695 (499) 1573 (490) 63159 (31137) 5512 (2245) 24 (12) 246 (241) 4937 (2808) Classification Rate 91.52% (91.84%) Bayesian Linear Regression For the Bayesian Linear Regression classifier, I used a one-hot encoding to make the traditionally two class classifier perform as a three class classifier. I implemented the BLR as an online classifier that had as input both a prior distribution on the weights and a variance on the noise in the feature data. Parameter Tuning The two main parameters to tune for the BLR classifier were the prior distribution on the weights and the variance of the noise on the feature data (σ). I tuned the prior distribution so be a zero mean distribution with a spherical variance (the variance of the distribution is s*I where s is a scalar and I is an N by N identity matrix). After a binary search I found an optimal value for σ of 0.5. Experimental Results The results of the BLR classification are shown in the table below. The format of the table is the same as that for the NORMA table above. Table 2: Bayesian Linear Regression Classification Results Classification on all data Hand labeled Hand labeled Hand labeled (Classification on test classification of Road classification of classification of set only) Vegetation Obstacle BLR classification of 23554 300 805 Road (11760) (143) (389) BLR classification of 1423 63328 5475 Vegetation BLR classification of Obstacle (702) (31729) 39 212 (17) (97) Classification Rate 91.75% (91.84%) (2733) 4864 (2430) Exponentiated Gradient Descent I implemented gradient classifier similar to the one asked about on the midterm exam. This algorithm is similar to the winnow algorithm but it allows for continuous features and has a renormalization of the weights after each step. For the three class case, three sets of weights are kept (one for each class). The class with the largest inner product of the weights and the feature vector is chosen. For incorrect classifications, the weights are penalized by exp(-ηf) where f is the feature vector and η is the learning rate. After each step the weights are then normalized to sum to one. Parameter Tuning The parameters that must be selected for this classifier are the learning rate and the initial weight distribution. For my implementation I used a uniform distribution for the weights (wi = 1/N where N is the number of features). Through tuning I found an optimal learning rate of η = 0.1. Experimental Results Table 9: Exponentiated Gradient Classification Results Hand labeled Hand labeled classification of Road classification of Vegetation Hand labeled classification of Obstacle Exponentiated Gradient classification of Road 23410 (11589) 1152 (114) 485 (66) Exponentiated Gradient classification of Vegetation Exponentiated Gradient classification of Obstacle 1010 (541) 58194 (31234) 4656 (2650) 596 (349) 4494 (621) 6003 (2836) Classification Rate 87.61% (91.32%) Conclusions Effect of Data Hold Out Since I implemented all of the classifiers as online classifiers, the intuition would be that the algorithms would not suffer when data was withheld and the trained classifiers were run on the held out data. Empirical evidence supports this intuition as the classification rate was more or less unchanged when the classifier was trained “offline” compared to running on all the data. Effect of Random Features Random features in the feature set had different effects on the performance of the classification algorithms. The following graph shows the effect of the correct classification rate when a varying number of random features were added to the feature set. As expected the Bayesian Linear Regression and the Exponentiated Gradient classifiers were relatively immune the addition of random features in the feature set. Even when the number of random features outnumbered the number of relevant features the correct classification rate of both algorithms was essentially unchanged. The NORMA algorithm, however, suffers moderately with the addition of random features. This intuitively makes sense because as the dimensionality of the feature space increases the amount of history (tau) required to accurately populate this higher dimensional space would need to increase as well for performance to keep up. When tau is held constant the correct classification rate decreases. Quality of Classification Based on correct classification rate and robustness to random features, the Bayesian Linear Regression is the superior of the three classifiers. However since the number of samples labeled as obstacles was small relative to the number of samples labeled road and vegetation, the Bayesian Linear Regression classifier sacrifices accuracy in distinguishing between obstacles and vegetation in order to optimize the classification between road and vegetation. In fact the BLR classifier misclassifies more obstacle samples than it correctly classifies. Clearly this is an undesirable effect for an outdoor robotic system where the penalty for a false negative obstacle classification is much higher than misclassification regarding road and vegetation. Upon looking at the classifications on individual data logs, the performance of the Exponentiated Gradient classifier produced a better classification between the vegetation and obstacle classes as illustrated in the images below. Hand Labeled Classification Of Pole Data Log Bayesian Linear Regression Classification Of Pole Data Log Exponentiated Gradient Classification Of Pole Data Log