Uploaded by sourav rout

machine learning report

advertisement
FML - HW 2
Machine Learning Report on Classification
Using Feed Forward Neural Network
Srihari K B(23M0754), Sourav Rout(23M2152)
September 3, 2023
Abstract
This report presents the analysis of different design choices of hyperparameters and
their effects on classification tasks for both a simple (4-class) dataset and a digits
dataset. It also includes the impact of learning rate and the number of training epochs
on the loss and accuracy as well as results of using some regularization techniques to
avoid overfitting.
1
Introduction
Feed Forward Neural Network stands as a fundamental architecture for various classification
tasks. Our primary objectives was to construct an accurate model using FFNN for both
datasets given, optimize hyperparameters such as learning rates and epochs, and analyze
their effects on accuracy and loss. We have also explored various techniques to prevent overfitting like L2 regularization and dropout. We experimented with various dropout fraction
values to investigate their effect on accuracy and loss at different epochs and finally chose
the optimum rates for each layer.
2
Methodology
For each dataset, we follow these general steps:
1. Data Description:
Based on the given datasets simple (4-class) dataset was given to us with two features
in the input data which we classified into 4 different classes. The digits dataset was a
relatively complex data with 64 features of different digits from [0-9] which are classified
into respective 10 classes.
2. Model Architecture:
The model architecture is designed for classification tasks using a Feed Forward Neural
Network (FFNN) implemented in PyTorch Lightning. The code includes a base class
1
Machine Learning Report - Classification Using FFNN
2
LitGenericClassifier, which is inherited by a specific classifier subclass for both the
classification problems.
2.1
Base Class (LitGenericClassifier)
Initialization: The base class is initialized with a learning rate (lr) parameter and sets
up the loss function as cross entropy loss.
Model Definition: The base class defines a placeholder for the model architecture using
nn.Sequential(). The actual architecture is implemented in the derived classes.
Training Step: The training step method computes the loss and accuracy for a training
batch. It applies a forward pass through the model, calculates the loss using crossentropy, and computes accuracy by comparing predicted labels with ground truth. Loss
and accuracy values are logged for monitoring during training.
Validation Step: The validation step method is similar to the training step but is used
for validation. It computes loss and accuracy for a validation batch and logs these
values.
Test Step: The test step method computes loss and accuracy during testing. It is
similar to the validation step but is used for evaluating the model’s performance. Prediction: The predict method predicts labels based on input data.
2.1.1
Derived Class 1 (LitSimpleClassifier)
Model Architecture: This class customizes the model for a simple classification task.
The architecture consists of multiple linear layers with ReLU activations:
Input: 2 features
Hidden Layers: 3 linear layers with ReLU activations
Output: 4 classes
Transform Input: The transform input method is a placeholder for any data preprocessing, although it’s not implemented in this example. Optimizer Configuration: It
sets up the optimizer for this specific model using the defined learning rate.
2.1.2
Derived Class 2 (LitDigitsClassifier)
Model Architecture: This class customizes the model for a digits classification task.
The architecture includes linear layers, ReLU activations, and dropout layers to prevent
overfitting:
Input: 64 features
Hidden Layers: 3 linear layers with ReLU activations and dropout layers
Output: 10 classes (digits 0-9)
Transform Input: Similar to the previous class, the transform input method can be
used for data preprocessing if needed.
Machine Learning Report - Classification Using FFNN
3
Optimizer Configuration: It sets up the optimizer for this specific model using the
defined learning rate.
3. Training Procedure: Outline the training process, including batch size, optimizer,
and any other relevant details.
3
Hyperparameter Analysis
We have investigated the impact of various hyperparameters on the model’s performance,
focusing on aspects such as the number of hidden layers, activation functions, optimizer
choice, loss function, evaluation metrics such as training and validation accuracy, and loss,
batch size and the number of training epochs
3.1
Classification on Simple (4-class) Dataset
For the simple dataset, we performed the following analyses:
1. Learning Rate: We conducted experiments with different learning rates, ranging
from to 0.1 to 0.005. Each experiment was trained for a fixed number of epochs to
optimize the best choice of learning rate
2. Number of Epochs: We varied the number of training epochs while keeping the
learning rate constant. The epochs ranged from 50 to 250.
3. Number of Hidden Layers:
Hidden Layers: 3
Layer Sizes: 128, 256, 128
To analyze the effect of the number of hidden layers, we have experimented with
different layers and Adjusting the architecture in this way allowed us to assess the
model’s capacity to learn from different depths and to find how many layers we needed
for a good enough accuracy.
4. Batch Size: We tested various batch sizes like 16, 32, 64 and 128. Higher the batch
size lowers the stochastic gradient nature of training.
5. Loss function: We currently use the Cross-Entropy Loss, suitable for classification
tasks. However, there are alternative loss functions such as Mean Squared Error (MSE)
or Hinge Loss. Modifying the loss function hyperparameter allows us to observe how
different loss functions affect training and convergence.
3.2
Classification on Digits Dataset
Similar to the simple dataset, we performed learning rate and number of epochs analyses for
the digits dataset.
Machine Learning Report - Classification Using FFNN
4
1. Learning Rate: We conducted experiments with different learning rates, ranging
from to 0.1 to 0.005. Each experiment was trained for a fixed number of epochs to
optimize the best choice of learning rate
2. Number of Epochs: We varied the number of training epochs while keeping the
learning rate constant. The epochs ranged from 50 to 250.
3. Number of Hidden Layers:
Hidden Layers: 3
Layer Sizes: 128, 128, 64
To analyze the effect of the number of hidden layers, we have experimented with
different layers and Adjusting the architecture in this way allowed us to assess the
model’s capacity to learn from different depths and to find how many layers we needed
for a good enough accuracy.
4. Batch Size: We tested various batch sizes like 16, 32, 64 and 128. Higher the batch
size lowers the stochastic gradient nature of training.
5. Loss function: We currently use the Cross-Entropy Loss, suitable for classification
tasks. However, there are alternative loss functions such as Mean Squared Error (MSE)
or Hinge Loss. Modifying the loss function hyperparameter allows us to observe how
different loss functions affect training and convergence.
4
4.1
Results
Classification on Simple (4-class) Dataset
The following figures illustrate the results of our hyperparameter analyses for the simple
dataset:
(a) Steps vs. Training Loss
(b) Steps vs. Training Accuracy
Figure 1: Analysis for Simple (4-class) Dataset - Part 1
Machine Learning Report - Classification Using FFNN
(a) Steps vs. Validation Loss
(b) Steps vs. Validation Accuracy
Figure 2: Analysis for Simple (4-class) Dataset - Part 2
4.2
Classification on Digits Dataset
Similarly, we present the results of our experiments on the digits dataset:
(a) Steps vs. Training Loss
(b) Steps vs. Training Accuracy
Figure 3: Analysis for Digits Dataset - Part 1
5
Machine Learning Report - Classification Using FFNN
(a) Steps vs. Validation Loss
6
(b) Steps vs. Validation Accuracy
Figure 4: Analysis for Digits Dataset - Part 2
5
Discussion
In this section, we discuss the findings from our hyperparameter analyses. We highlight the
following points:
5.1
Classification on Simple (4-class) Dataset
1. Best Epoch: After checking several epoch values for the model. Considering the
stochastic nature of the model based on the necessary choice of hyperparameters. We
had to choose the best epoch value such that the accuracy settles to an optimum value.
Choice of Epoch:50
2. Choice of Optimizer:
In the context of our PyTorch Lightning-based classification model, the choice of optimizer is a critical hyperparameter that significantly influences the training process and
model convergence. After experimenting with some of the relevant choices of optimizer
we decided to go with ADAM optimizer because of the following advantages:
Adaptive Learning Rate: Adam adapts the learning rate during training for each parameter. It computes an individual learning rate for each parameter, allowing for larger
updates for infrequently updated weights and smaller updates for frequently updated
weights.
Momentum: Adam incorporates the concept of momentum, which helps accelerate
convergence in the presence of gradients with varying magnitudes. The momentum
term keeps track of the exponentially weighted average of past gradients.
Bias Correction: Adam applies bias correction to the estimates of the first and second
moments of the gradients. This correction helps mitigate the initialization bias towards
Machine Learning Report - Classification Using FFNN
7
zero, especially in the early stages of training.
3. Steps to prevent Overfitting :
We did not use any particular optimizer for the simple dataset. However, our choice
of Optimizer as ADAM helped to prevent overfitting up to some degree as The combination of adaptive learning rates and momentum provides a degree of regularization.
potentially reducing the need for explicit dropout or weight decay
5.2
Classification on Digits Dataset
1. Best Epoch: Compared to simple dataset Digits was fairly complex to deal with
since its a 10 class classification problem. After checking several epoch values for the
model. Considering the stochastic nature of the model based on the necessary choice
of hyperparameters. We had to choose the best epoch value such that the accuracy
settles to an optimum value .
Choice of Epoch:250
2. Choice of Optimizer:
In the context of our PyTorch Lightning-based classification model, the choice of optimizer is a critical hyperparameter that significantly influences the training process and
model convergence. After experimenting with some of the relevant choices of optimizer
we decided to go with ADAM optimizer because of the following advantages:
Adaptive Learning Rate: Adam adapts the learning rate during training for each parameter. It computes an individual learning rate for each parameter, allowing for larger
updates for infrequently updated weights and smaller updates for frequently updated
weights.
Momentum: Adam incorporates the concept of momentum, which helps accelerate
convergence in the presence of gradients with varying magnitudes. The momentum
term keeps track of the exponentially weighted average of past gradients.
Bias Correction: Adam applies bias correction to the estimates of the first and second
moments of the gradients. This correction helps mitigate the initialization bias towards
zero, especially in the early stages of training.
Regularization Effect: The combination of adaptive learning rates and momentum
provides a degree of regularization, potentially reducing the need for explicit dropout
or weight decay
3. Steps to prevent Overfitting :
1. In order to prevent overfitting we used some of the common Regularization techniques to see what choice works optimally for our classification algorithm. our first
choice was L2 regularization as it Encourages small weights by adding a penalty term
to the loss function based on the L2 norm of the weight vectors.
Regularized Loss = Loss + λ · ||w||2
however, it reduced the training as well as the validation so we went for other choices
Machine Learning Report - Classification Using FFNN
8
such as batch regularization which had similar problems.
2. Our final choice for regularization was Dropout as it increased the accuracy and
since it prevented overfitting by introducing randomness during training. By randomly
deactivating a fraction of neurons in each training batch, dropout encourages the network to rely on different sets of features, making the model more robust.
3. Also our choice of Optimizer as ADAM helped to prevent overfitting up to some
degree as The combination of adaptive learning rates and momentum provides a degree
of regularization.
4. Analysis with different Hyperparameters: :
Epochs
20
20
20
20
50
50
50
50
50
50
100
100
100
100
100
100
100
250
LR
Batch Size
Seed
Valid Loss
Valid Acc
0.001
0.01
0.02
0.1
0.001
0.005
0.01
0.02
0.05
0.1
0.001
0.002
0.005
0.01
0.02
0.05
0.1
0.005
16
64
128
16
128
16
128
32
16
128
128
128
16
16
64
16
128
128
6327983
6327983
2354078
6327983
2354078
6327983
6327983
2354078
2354078
6327983
2354078
2354078
6327983
2354078
6327983
6327983
6327983
812947
0.1039
0.1427
0.1755
2.3330
0.0971
0.0825
0.1098
0.2342
2.3217
2.3165
0.1024
0.0925
0.0825
0.1992
0.2019
1.7500
2.3165
0.1241
0.9750
0.9650
0.9650
0.1100
0.9725
0.9775
0.9800
0.9425
0.1100
0.1050
0.9800
0.9775
0.9775
0.9525
0.9575
0.2825
0.1050
0.9900
Table 1: Analysis on different hyperparameters of digits
6
Conclusion
In conclusion, our analysis of hyperparameters on classification tasks for both the simple
(4-class) dataset and the digits dataset has provided valuable insights. We changed all the
hyperparameters to get the best accuracy from the respective data set as well as experimented
with different choices for optimizer as well as regularization techniques. We found that the
choice of learning rate and the number of training epochs significantly impact the model’s
performance.
Machine Learning Report - Classification Using FFNN
6.1
9
Classification on Simple (4-class) Dataset
For the simple dataset, we got the following outputs and choice of hyperparameters :
Validation Accuracy: 100%
Validation Loss: 0.022
No. of epochs: 50
Choice of optimizer: ADAM
6.2
Classification on Digits Dataset
For the digits dataset, we got the following outputs and choice of hyperparameters :
Validation Accuracy: 99%
Validation Loss: 0.124
No. of epochs: 250
Choice of optimizer: ADAM
Choice of regularization: Dropout
This analysis provides a foundation for making informed hyperparameter choices when
training neural networks for classification tasks.
Download