Uploaded by luisgarrido

Self-Supervised Scene Classification with ANNs

advertisement
Artificial Neural Networks [ANN]
Self-Supervised Scene Classification
Pablo de Vicente Abad
s20233489
June 2024
1
Contents
1 Introduction
3
2 Dataset
3
3 Classification schemes
3
3.1
Supervised learning Scheme . . . . . . . . . . . . . . . . . . . . .
3
3.2
Self-supervised learning Scheme . . . . . . . . . . . . . . . . . . .
4
3.2.1
Gaussian Blurring . . . . . . . . . . . . . . . . . . . . . .
4
3.2.2
Black and White Perturbation . . . . . . . . . . . . . . .
4
Scene Classification . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.3
4 Evaluation
5
4.1
Training Configuration . . . . . . . . . . . . . . . . . . . . . . . .
6
4.2
Overfitting
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
4.3
Training Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4.3.1
Supervised Learning . . . . . . . . . . . . . . . . . . . . .
8
4.3.2
Self-Supervised Learning - Gaussian Blurring . . . . . . .
9
4.3.3
Self Supervised Learning - Black and White Perturbation
10
4.4
Scene classification self supervised vs supervised
. . . . . . . . .
11
5 Code
12
6 ANEX I
13
2
1
Introduction
This document explores the performance of fully-supervised and self-supervised
learning techniques for scene classification. We compare models trained on
labeled data (fully-supervised) with those trained using a pretext task on unlabeled data (self-supervised). The document outlines the datasets used, details
the model architectures and transformations and describes the evaluation protocol.
2
Dataset
We will be using the 15-Scene Dataset, which contains 15 categories of different scenes. More information on the dataset can be found in ProjectDescription.pdf.
3
Classification schemes
This section details the fully-supervised and self-supervised learning methods
to be implemented in this assignment.
3.1
Supervised learning Scheme
This section focuses on fine-tuning an EfficientNet-B0 model for scene classification. We will leverage a pre-trained EfficientNet-B0 model and apply transformations to the input data.
While the project description recommends training all layers, we will experiment by freezing a subset of the initial layers. This approach allows us to
observe the impact of transfer learning and evaluate how freezing specific layers
affects the model’s performance.
3
3.2
Self-supervised learning Scheme
For the self-supervised classifiers, we will start by loading in a pre-trained Efficient Net-B0 model. As self-supervised classification is inherently more complex than supervised, we will train the feature extraction layers on some pretext
tasks. The aim of these tasks is to force the models to learn more fine-tuned
features instead of relying on more simple features. This will make our model
more generalizable and accurate.
3.2.1
Gaussian Blurring
This pretext task consists of applying a variable kernel size blur to each of the
images. These kernel sizes range from {5, 7, 9, 11, 15}.We will also replace the
classifier with one that categorizes which kernel was applied.
To prevent overfitting, we will significantly augment the input data. Each
image will be presented with all five kernel sizes during training. This approach,
while computationally expensive, helps ensure the model’s ability to generalize
to unseen data.
3.2.2
Black and White Perturbation
Similar to Gaussian blurring, this pretext task introduces a black or white square
randomly placed within the image. The new classifier, replacing the original one,
will now be trained to categorize whether the square is black or white.
3.3
Scene Classification
In the final stage, we evaluate the performance of models trained on pretext tasks
for the original scene classification problem. We replace the modified classifier
head with the original one, designed for classifying the 15 scene categories.
To leverage the pre-trained features but adapt to the specific task, we freeze
the weights of the feature extraction layers and only fine-tune the classifier head.
This approach allows the model to learn task-specific decision boundaries while
retaining the valuable feature representations learned from the pretext tasks.
For comparison, we load both our best model obtained through pretext task
training and our best model from its third iteration. This comparison aims to
quantify the improvement in performance achieved through fine-tuning with the
pre-trained features
4
4
Evaluation
This section details the evaluation process for the trained models. We will assess
the performance of the following:
- Scene classification model trained with fully-supervised learning (a).
- Models trained on auxiliary tasks and subsequently fine-tuned for scene
classification
The evaluation will cover:
1. Training settings and justification for hyper parameter choices (learning
rate, optimizer, etc.)
2. Classification accuracy reported in tables, along with justification for the
chosen epoch for final evaluation.
3. Impact of pre-training on scene classification: Performance for both perturbation tasks
4. Comparison of fully-supervised vs. self-supervised approaches in terms of
classification accuracy and the effectiveness of self-supervised learning.
5. Strategies employed to address underfitting and overfitting, with supporting evidence.
Due to limitations in GPU computational power, training iterations for this
project were time-consuming. Even leveraging Google Colab’s resources proved
insufficient to run models continuously on a personal computer. These factors
limited the comprehensiveness of the testing phase. However, I believe the
conducted tests are sufficient to draw meaningful conclusions. I will justify each
decision made throughout the project to support this claim.
5
4.1
Training Configuration
To ensure consistent evaluation across different models, all training runs were
conducted with the following configuration:
- Maximum Epochs: 10. This limit was chosen to establish common ground
for comparison while considering computational constraints. Training on
a single GPU GTX1050Ti took approximately 2.5 hours.
- Loss Function: Cross-Entropy Loss. This function has been rigorously
used and tested by literature and was used as the loss function for all
experiments
- Validation set: All tests employed a validation set consisting of 10% of the
dataset.
- Number of workers: To improve data loading speed, we used two parallel
workers. Hardware limitations restricted further scaling.
- Number of FC layers: Initially, I planned to change the number of fully
connected layers and retrain the best model to facilitate a direct comparison. Unfortunately, due to time constraints, this was not possible.
Consequently, all tests were conducted with a single fully connected layer.
For testing purposes, we used four combinations of different parameters, as
detailed in the following table. In this section, we will discuss the rationale
behind each option.
Supervised Learning
Optimizer
Batch Size
Dropout Probability
Learning Rate
Weight Decay
Train Ratio
Momentum (SGD only)
x_ty_v1
Adam
128
0.5
0.0001
1.0E-04
0.7
-
x_ty_v2
Adam
64
0.3
0.000001
1.0E-05
0.7
-
x_ty_v3
SGD
64
0.3
0.000001
1.0E-05
0.7
0.9
x_ty_v4
SGD
128
0.3
0.001
1.0E-06
0.5
0.95
The tests are named using the format x_ty_vn, where:
- x can be either s or sl, representing supervised or self-supervised learning,
respectively.
- ty denotes the task the model is named after (f.e t1/t2 on the self supervised models).
6
- vn indicates the training configuration, where n ranges from 1 to 4.
Optimizer: The first configuration we tested used the Adam optimizer.
Adam combines the advantages of AdaGrad and RMSProp by adaptively adjusting the learning rate for each parameter. This makes it robust to different
problems and requires less tuning. However, it can sometimes lead to poorer
generalization compared to simpler optimizers like SGD, as this paper mentions,
so, we also included SGD in our tests.
Batch Size: Ideally, our batch size would be as large as possible while
maintaining sufficient variance. We opted for two values: 128 and 64. The
performance loss from lowering the batch size is uncertain, but we compensate
for this by decreasing the learning rate. According to the article it is advised
not to lower the learning rate but to increase the batch size as much as possible
for better performance. However, the contrary was also advised in this article.
Stating that reducing the batch size can actually improve generalization performance. Therefore, we decided to experiment with both batch sizes to observe
their effects on our model.
Learning rate: As previously stated learning rate worked in tandem with
Batch Size. Variable learning rate was used in both Adam and SGD (with
momentum being tested for [0.9 and 0.95] which made a significant difference
for the results obtained at the 10th epoch
4.2
Overfitting
We did encounter overfitting problems at the beginning. Dropout was introduced after the first batch of preliminary testing, as a way to alleviate said
problems. On this same note we also introduced some data augmentation techniques and applied a validation test, in which we base our models performance
during training. Weight decay was also experimented with in a wide range of
different values (1e-4 to 1e-6). This values where inspired by the PhD. Jason
Brownlee blog post. L2 regularization helps to prevent overfitting by penalizing
large weights, encouraging the model to find a simpler function that fits the
data.
After all this methods where introduced we did not encounter any more
overfitting through the project. Further fine tuning could be explored in order
to find alternative values.
7
4.3
Training Accuracy
4.3.1
Supervised Learning
During our Supervised Learning experiments, we conducted an initial round of
testing with various configurations, but these yielded lower results compared to
our official testing phase. Details of this preliminary testing can be found in the
annex, though its outcomes are less significant.
It’s important to note that all models were stopped at the 10th epoch. This
decision favored models that showed faster early-stage training, but it’s noteworthy that all models demonstrated improvement by the final epoch. It is
regrettable that we couldn’t run the models for a longer duration. Notably,
model s_t1_v4 exhibited a relatively high loss despite achieving good accuracy, suggesting potential for further optimization.
Additionally, model s_t1_v3 stands out as it achieved the lowest score out
of all models trained, with a test accuracy of 12%. This was due to initializing
SGD with a learning rate that was too low, resulting in slow convergence. The
significant impact of small parameter adjustments can be observed, particularly
in comparison with model s_t1_v4.
For future comparisons, we have selected model s_t1_v1. Comprehensive
results are provided in "Model Testing.xlsx", and additional information can be
found in the corresponding .ipynb file for all tasks.
Figure 1: Validation results Supervised Learning
8
4.3.2
Self-Supervised Learning - Gaussian Blurring
In the pretext task using Gaussian blurring, the v1 configuration yielded the
best results. It achieved 95% accuracy in the final epoch and 88% in its third
iteration, surpassing all other methods. This significant difference is unexpected,
and it raises questions about whether it relates to the nature of the data itself.
Figure 2: Validation results Gaussian Blurring
9
4.3.3
Self Supervised Learning - Black and White Perturbation
In this task, we observed that the models performing the best were not necessarily at their peak accuracy in the final epoch, although, differences where
very close for most of the models. On ss_t1_v2 we observe an accuracy value
of 88.05% on the forth epoch vs 87.39% on tenth epoch, which makes it a
marginal difference. We observe that the v4 model is the most accurate by a
small marging of 1% to both v1 and v2 models in Figure 3
Figure 3: Validation results Black and White Perturbation
10
4.4
Scene classification self supervised vs supervised
In this section we trained the Scene classification classificator on top of the
weights given back by our training of models. We also compare the results
obtained by a the same model at both the third and best epoch.
Overall, scene classification benefited more from training on the Black and
White perturbation model. This results can be seen in both Figures 4 and 5.
I was disappointed to find that the "hybrid" models did not outperform the
supervised model. My assumption is that due to the pretext task’s prematurely
truncated training, we didn’t fully leverage the potential of transfer learning.
However, I was pleasantly surprised by the models trained up to the third
epoch, which consistently achieved notable results across different metrics. These
models not only surpassed those trained until the final epoch but also outperformed some of the supervised models.
My hypothesis is that these models exhibit better generalization capabilities.
In contrast, models trained until the final epoch might struggle to generalize
effectively and yet fail to fine-tune critical aspects, potentially leaving them in
a transitional state.
Figure 4: Scene classification training results with Gaussian Blurring under best
model and third epoch model
11
Figure 5: Scene classification training results with Black and White Perturbation model under best model and third epoch model
5
Code
With the submission, there is a list of files which I wanted to mention briefly:
1. 00-docu/: This folder contains all relevant documentation for the project,
including ".xlsx" files with pre-testing and testing results.
2. 02-Code/self_supervised/: This folder contains all the code necessary
to run the self-supervised model tasks 1, 2, and 3. Each task has both a
.ipynb and a .py file associated with it. The .py files contain all the required
functions for training, saving, and loading models. Arguments are called
from the .ipynb files, providing a comprehensive list of parameters used.
3. 02-Code/supervised/: Similarly, this folder contains the code for the
supervised task, following the same structure as the self-supervised folder.
12
6
ANEX I
Auxiliary table containing some of the initial testing for parameters, Scores
where lower across the board.
Supervised Learning
Optimizer
Batch Size
Dropout Probability
Learning Rate
Weight Decay
Train Ratio
Momentum (SGD only)
Number of Best Epoch
Accuracy at Best Epoch (%)
Loss at Best Epoch
Test Accuracy (WA) (%)
model 1
model 2
model 3
model 4
SGD
128
0
0.1
0
0.8
0.9
8
65%
SGD
128
0.5
0.001
1.0E-04
0.9
0.10
10
82%
SGD
128
0.5
0.001
1.0E-04
0.10
0.11
10
75%
SGD
128
0.5
0.001
1.0E-04
0.11
0.10
10
70%
Table 1: Supervised Learning Parameters and Results for Models pre-testing
13
Download