Artificial Neural Networks [ANN] Self-Supervised Scene Classification Pablo de Vicente Abad s20233489 June 2024 1 Contents 1 Introduction 3 2 Dataset 3 3 Classification schemes 3 3.1 Supervised learning Scheme . . . . . . . . . . . . . . . . . . . . . 3 3.2 Self-supervised learning Scheme . . . . . . . . . . . . . . . . . . . 4 3.2.1 Gaussian Blurring . . . . . . . . . . . . . . . . . . . . . . 4 3.2.2 Black and White Perturbation . . . . . . . . . . . . . . . 4 Scene Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 4 Evaluation 5 4.1 Training Configuration . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3 Training Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . 8 4.3.2 Self-Supervised Learning - Gaussian Blurring . . . . . . . 9 4.3.3 Self Supervised Learning - Black and White Perturbation 10 4.4 Scene classification self supervised vs supervised . . . . . . . . . 11 5 Code 12 6 ANEX I 13 2 1 Introduction This document explores the performance of fully-supervised and self-supervised learning techniques for scene classification. We compare models trained on labeled data (fully-supervised) with those trained using a pretext task on unlabeled data (self-supervised). The document outlines the datasets used, details the model architectures and transformations and describes the evaluation protocol. 2 Dataset We will be using the 15-Scene Dataset, which contains 15 categories of different scenes. More information on the dataset can be found in ProjectDescription.pdf. 3 Classification schemes This section details the fully-supervised and self-supervised learning methods to be implemented in this assignment. 3.1 Supervised learning Scheme This section focuses on fine-tuning an EfficientNet-B0 model for scene classification. We will leverage a pre-trained EfficientNet-B0 model and apply transformations to the input data. While the project description recommends training all layers, we will experiment by freezing a subset of the initial layers. This approach allows us to observe the impact of transfer learning and evaluate how freezing specific layers affects the model’s performance. 3 3.2 Self-supervised learning Scheme For the self-supervised classifiers, we will start by loading in a pre-trained Efficient Net-B0 model. As self-supervised classification is inherently more complex than supervised, we will train the feature extraction layers on some pretext tasks. The aim of these tasks is to force the models to learn more fine-tuned features instead of relying on more simple features. This will make our model more generalizable and accurate. 3.2.1 Gaussian Blurring This pretext task consists of applying a variable kernel size blur to each of the images. These kernel sizes range from {5, 7, 9, 11, 15}.We will also replace the classifier with one that categorizes which kernel was applied. To prevent overfitting, we will significantly augment the input data. Each image will be presented with all five kernel sizes during training. This approach, while computationally expensive, helps ensure the model’s ability to generalize to unseen data. 3.2.2 Black and White Perturbation Similar to Gaussian blurring, this pretext task introduces a black or white square randomly placed within the image. The new classifier, replacing the original one, will now be trained to categorize whether the square is black or white. 3.3 Scene Classification In the final stage, we evaluate the performance of models trained on pretext tasks for the original scene classification problem. We replace the modified classifier head with the original one, designed for classifying the 15 scene categories. To leverage the pre-trained features but adapt to the specific task, we freeze the weights of the feature extraction layers and only fine-tune the classifier head. This approach allows the model to learn task-specific decision boundaries while retaining the valuable feature representations learned from the pretext tasks. For comparison, we load both our best model obtained through pretext task training and our best model from its third iteration. This comparison aims to quantify the improvement in performance achieved through fine-tuning with the pre-trained features 4 4 Evaluation This section details the evaluation process for the trained models. We will assess the performance of the following: - Scene classification model trained with fully-supervised learning (a). - Models trained on auxiliary tasks and subsequently fine-tuned for scene classification The evaluation will cover: 1. Training settings and justification for hyper parameter choices (learning rate, optimizer, etc.) 2. Classification accuracy reported in tables, along with justification for the chosen epoch for final evaluation. 3. Impact of pre-training on scene classification: Performance for both perturbation tasks 4. Comparison of fully-supervised vs. self-supervised approaches in terms of classification accuracy and the effectiveness of self-supervised learning. 5. Strategies employed to address underfitting and overfitting, with supporting evidence. Due to limitations in GPU computational power, training iterations for this project were time-consuming. Even leveraging Google Colab’s resources proved insufficient to run models continuously on a personal computer. These factors limited the comprehensiveness of the testing phase. However, I believe the conducted tests are sufficient to draw meaningful conclusions. I will justify each decision made throughout the project to support this claim. 5 4.1 Training Configuration To ensure consistent evaluation across different models, all training runs were conducted with the following configuration: - Maximum Epochs: 10. This limit was chosen to establish common ground for comparison while considering computational constraints. Training on a single GPU GTX1050Ti took approximately 2.5 hours. - Loss Function: Cross-Entropy Loss. This function has been rigorously used and tested by literature and was used as the loss function for all experiments - Validation set: All tests employed a validation set consisting of 10% of the dataset. - Number of workers: To improve data loading speed, we used two parallel workers. Hardware limitations restricted further scaling. - Number of FC layers: Initially, I planned to change the number of fully connected layers and retrain the best model to facilitate a direct comparison. Unfortunately, due to time constraints, this was not possible. Consequently, all tests were conducted with a single fully connected layer. For testing purposes, we used four combinations of different parameters, as detailed in the following table. In this section, we will discuss the rationale behind each option. Supervised Learning Optimizer Batch Size Dropout Probability Learning Rate Weight Decay Train Ratio Momentum (SGD only) x_ty_v1 Adam 128 0.5 0.0001 1.0E-04 0.7 - x_ty_v2 Adam 64 0.3 0.000001 1.0E-05 0.7 - x_ty_v3 SGD 64 0.3 0.000001 1.0E-05 0.7 0.9 x_ty_v4 SGD 128 0.3 0.001 1.0E-06 0.5 0.95 The tests are named using the format x_ty_vn, where: - x can be either s or sl, representing supervised or self-supervised learning, respectively. - ty denotes the task the model is named after (f.e t1/t2 on the self supervised models). 6 - vn indicates the training configuration, where n ranges from 1 to 4. Optimizer: The first configuration we tested used the Adam optimizer. Adam combines the advantages of AdaGrad and RMSProp by adaptively adjusting the learning rate for each parameter. This makes it robust to different problems and requires less tuning. However, it can sometimes lead to poorer generalization compared to simpler optimizers like SGD, as this paper mentions, so, we also included SGD in our tests. Batch Size: Ideally, our batch size would be as large as possible while maintaining sufficient variance. We opted for two values: 128 and 64. The performance loss from lowering the batch size is uncertain, but we compensate for this by decreasing the learning rate. According to the article it is advised not to lower the learning rate but to increase the batch size as much as possible for better performance. However, the contrary was also advised in this article. Stating that reducing the batch size can actually improve generalization performance. Therefore, we decided to experiment with both batch sizes to observe their effects on our model. Learning rate: As previously stated learning rate worked in tandem with Batch Size. Variable learning rate was used in both Adam and SGD (with momentum being tested for [0.9 and 0.95] which made a significant difference for the results obtained at the 10th epoch 4.2 Overfitting We did encounter overfitting problems at the beginning. Dropout was introduced after the first batch of preliminary testing, as a way to alleviate said problems. On this same note we also introduced some data augmentation techniques and applied a validation test, in which we base our models performance during training. Weight decay was also experimented with in a wide range of different values (1e-4 to 1e-6). This values where inspired by the PhD. Jason Brownlee blog post. L2 regularization helps to prevent overfitting by penalizing large weights, encouraging the model to find a simpler function that fits the data. After all this methods where introduced we did not encounter any more overfitting through the project. Further fine tuning could be explored in order to find alternative values. 7 4.3 Training Accuracy 4.3.1 Supervised Learning During our Supervised Learning experiments, we conducted an initial round of testing with various configurations, but these yielded lower results compared to our official testing phase. Details of this preliminary testing can be found in the annex, though its outcomes are less significant. It’s important to note that all models were stopped at the 10th epoch. This decision favored models that showed faster early-stage training, but it’s noteworthy that all models demonstrated improvement by the final epoch. It is regrettable that we couldn’t run the models for a longer duration. Notably, model s_t1_v4 exhibited a relatively high loss despite achieving good accuracy, suggesting potential for further optimization. Additionally, model s_t1_v3 stands out as it achieved the lowest score out of all models trained, with a test accuracy of 12%. This was due to initializing SGD with a learning rate that was too low, resulting in slow convergence. The significant impact of small parameter adjustments can be observed, particularly in comparison with model s_t1_v4. For future comparisons, we have selected model s_t1_v1. Comprehensive results are provided in "Model Testing.xlsx", and additional information can be found in the corresponding .ipynb file for all tasks. Figure 1: Validation results Supervised Learning 8 4.3.2 Self-Supervised Learning - Gaussian Blurring In the pretext task using Gaussian blurring, the v1 configuration yielded the best results. It achieved 95% accuracy in the final epoch and 88% in its third iteration, surpassing all other methods. This significant difference is unexpected, and it raises questions about whether it relates to the nature of the data itself. Figure 2: Validation results Gaussian Blurring 9 4.3.3 Self Supervised Learning - Black and White Perturbation In this task, we observed that the models performing the best were not necessarily at their peak accuracy in the final epoch, although, differences where very close for most of the models. On ss_t1_v2 we observe an accuracy value of 88.05% on the forth epoch vs 87.39% on tenth epoch, which makes it a marginal difference. We observe that the v4 model is the most accurate by a small marging of 1% to both v1 and v2 models in Figure 3 Figure 3: Validation results Black and White Perturbation 10 4.4 Scene classification self supervised vs supervised In this section we trained the Scene classification classificator on top of the weights given back by our training of models. We also compare the results obtained by a the same model at both the third and best epoch. Overall, scene classification benefited more from training on the Black and White perturbation model. This results can be seen in both Figures 4 and 5. I was disappointed to find that the "hybrid" models did not outperform the supervised model. My assumption is that due to the pretext task’s prematurely truncated training, we didn’t fully leverage the potential of transfer learning. However, I was pleasantly surprised by the models trained up to the third epoch, which consistently achieved notable results across different metrics. These models not only surpassed those trained until the final epoch but also outperformed some of the supervised models. My hypothesis is that these models exhibit better generalization capabilities. In contrast, models trained until the final epoch might struggle to generalize effectively and yet fail to fine-tune critical aspects, potentially leaving them in a transitional state. Figure 4: Scene classification training results with Gaussian Blurring under best model and third epoch model 11 Figure 5: Scene classification training results with Black and White Perturbation model under best model and third epoch model 5 Code With the submission, there is a list of files which I wanted to mention briefly: 1. 00-docu/: This folder contains all relevant documentation for the project, including ".xlsx" files with pre-testing and testing results. 2. 02-Code/self_supervised/: This folder contains all the code necessary to run the self-supervised model tasks 1, 2, and 3. Each task has both a .ipynb and a .py file associated with it. The .py files contain all the required functions for training, saving, and loading models. Arguments are called from the .ipynb files, providing a comprehensive list of parameters used. 3. 02-Code/supervised/: Similarly, this folder contains the code for the supervised task, following the same structure as the self-supervised folder. 12 6 ANEX I Auxiliary table containing some of the initial testing for parameters, Scores where lower across the board. Supervised Learning Optimizer Batch Size Dropout Probability Learning Rate Weight Decay Train Ratio Momentum (SGD only) Number of Best Epoch Accuracy at Best Epoch (%) Loss at Best Epoch Test Accuracy (WA) (%) model 1 model 2 model 3 model 4 SGD 128 0 0.1 0 0.8 0.9 8 65% SGD 128 0.5 0.001 1.0E-04 0.9 0.10 10 82% SGD 128 0.5 0.001 1.0E-04 0.10 0.11 10 75% SGD 128 0.5 0.001 1.0E-04 0.11 0.10 10 70% Table 1: Supervised Learning Parameters and Results for Models pre-testing 13