FML - HW 2 Machine Learning Report on Classification Using Feed Forward Neural Network Srihari K B(23M0754), Sourav Rout(23M2152) September 3, 2023 Abstract This report presents the analysis of different design choices of hyperparameters and their effects on classification tasks for both a simple (4-class) dataset and a digits dataset. It also includes the impact of learning rate and the number of training epochs on the loss and accuracy as well as results of using some regularization techniques to avoid overfitting. 1 Introduction Feed Forward Neural Network stands as a fundamental architecture for various classification tasks. Our primary objectives was to construct an accurate model using FFNN for both datasets given, optimize hyperparameters such as learning rates and epochs, and analyze their effects on accuracy and loss. We have also explored various techniques to prevent overfitting like L2 regularization and dropout. We experimented with various dropout fraction values to investigate their effect on accuracy and loss at different epochs and finally chose the optimum rates for each layer. 2 Methodology For each dataset, we follow these general steps: 1. Data Description: Based on the given datasets simple (4-class) dataset was given to us with two features in the input data which we classified into 4 different classes. The digits dataset was a relatively complex data with 64 features of different digits from [0-9] which are classified into respective 10 classes. 2. Model Architecture: The model architecture is designed for classification tasks using a Feed Forward Neural Network (FFNN) implemented in PyTorch Lightning. The code includes a base class 1 Machine Learning Report - Classification Using FFNN 2 LitGenericClassifier, which is inherited by a specific classifier subclass for both the classification problems. 2.1 Base Class (LitGenericClassifier) Initialization: The base class is initialized with a learning rate (lr) parameter and sets up the loss function as cross entropy loss. Model Definition: The base class defines a placeholder for the model architecture using nn.Sequential(). The actual architecture is implemented in the derived classes. Training Step: The training step method computes the loss and accuracy for a training batch. It applies a forward pass through the model, calculates the loss using crossentropy, and computes accuracy by comparing predicted labels with ground truth. Loss and accuracy values are logged for monitoring during training. Validation Step: The validation step method is similar to the training step but is used for validation. It computes loss and accuracy for a validation batch and logs these values. Test Step: The test step method computes loss and accuracy during testing. It is similar to the validation step but is used for evaluating the model’s performance. Prediction: The predict method predicts labels based on input data. 2.1.1 Derived Class 1 (LitSimpleClassifier) Model Architecture: This class customizes the model for a simple classification task. The architecture consists of multiple linear layers with ReLU activations: Input: 2 features Hidden Layers: 3 linear layers with ReLU activations Output: 4 classes Transform Input: The transform input method is a placeholder for any data preprocessing, although it’s not implemented in this example. Optimizer Configuration: It sets up the optimizer for this specific model using the defined learning rate. 2.1.2 Derived Class 2 (LitDigitsClassifier) Model Architecture: This class customizes the model for a digits classification task. The architecture includes linear layers, ReLU activations, and dropout layers to prevent overfitting: Input: 64 features Hidden Layers: 3 linear layers with ReLU activations and dropout layers Output: 10 classes (digits 0-9) Transform Input: Similar to the previous class, the transform input method can be used for data preprocessing if needed. Machine Learning Report - Classification Using FFNN 3 Optimizer Configuration: It sets up the optimizer for this specific model using the defined learning rate. 3. Training Procedure: Outline the training process, including batch size, optimizer, and any other relevant details. 3 Hyperparameter Analysis We have investigated the impact of various hyperparameters on the model’s performance, focusing on aspects such as the number of hidden layers, activation functions, optimizer choice, loss function, evaluation metrics such as training and validation accuracy, and loss, batch size and the number of training epochs 3.1 Classification on Simple (4-class) Dataset For the simple dataset, we performed the following analyses: 1. Learning Rate: We conducted experiments with different learning rates, ranging from to 0.1 to 0.005. Each experiment was trained for a fixed number of epochs to optimize the best choice of learning rate 2. Number of Epochs: We varied the number of training epochs while keeping the learning rate constant. The epochs ranged from 50 to 250. 3. Number of Hidden Layers: Hidden Layers: 3 Layer Sizes: 128, 256, 128 To analyze the effect of the number of hidden layers, we have experimented with different layers and Adjusting the architecture in this way allowed us to assess the model’s capacity to learn from different depths and to find how many layers we needed for a good enough accuracy. 4. Batch Size: We tested various batch sizes like 16, 32, 64 and 128. Higher the batch size lowers the stochastic gradient nature of training. 5. Loss function: We currently use the Cross-Entropy Loss, suitable for classification tasks. However, there are alternative loss functions such as Mean Squared Error (MSE) or Hinge Loss. Modifying the loss function hyperparameter allows us to observe how different loss functions affect training and convergence. 3.2 Classification on Digits Dataset Similar to the simple dataset, we performed learning rate and number of epochs analyses for the digits dataset. Machine Learning Report - Classification Using FFNN 4 1. Learning Rate: We conducted experiments with different learning rates, ranging from to 0.1 to 0.005. Each experiment was trained for a fixed number of epochs to optimize the best choice of learning rate 2. Number of Epochs: We varied the number of training epochs while keeping the learning rate constant. The epochs ranged from 50 to 250. 3. Number of Hidden Layers: Hidden Layers: 3 Layer Sizes: 128, 128, 64 To analyze the effect of the number of hidden layers, we have experimented with different layers and Adjusting the architecture in this way allowed us to assess the model’s capacity to learn from different depths and to find how many layers we needed for a good enough accuracy. 4. Batch Size: We tested various batch sizes like 16, 32, 64 and 128. Higher the batch size lowers the stochastic gradient nature of training. 5. Loss function: We currently use the Cross-Entropy Loss, suitable for classification tasks. However, there are alternative loss functions such as Mean Squared Error (MSE) or Hinge Loss. Modifying the loss function hyperparameter allows us to observe how different loss functions affect training and convergence. 4 4.1 Results Classification on Simple (4-class) Dataset The following figures illustrate the results of our hyperparameter analyses for the simple dataset: (a) Steps vs. Training Loss (b) Steps vs. Training Accuracy Figure 1: Analysis for Simple (4-class) Dataset - Part 1 Machine Learning Report - Classification Using FFNN (a) Steps vs. Validation Loss (b) Steps vs. Validation Accuracy Figure 2: Analysis for Simple (4-class) Dataset - Part 2 4.2 Classification on Digits Dataset Similarly, we present the results of our experiments on the digits dataset: (a) Steps vs. Training Loss (b) Steps vs. Training Accuracy Figure 3: Analysis for Digits Dataset - Part 1 5 Machine Learning Report - Classification Using FFNN (a) Steps vs. Validation Loss 6 (b) Steps vs. Validation Accuracy Figure 4: Analysis for Digits Dataset - Part 2 5 Discussion In this section, we discuss the findings from our hyperparameter analyses. We highlight the following points: 5.1 Classification on Simple (4-class) Dataset 1. Best Epoch: After checking several epoch values for the model. Considering the stochastic nature of the model based on the necessary choice of hyperparameters. We had to choose the best epoch value such that the accuracy settles to an optimum value. Choice of Epoch:50 2. Choice of Optimizer: In the context of our PyTorch Lightning-based classification model, the choice of optimizer is a critical hyperparameter that significantly influences the training process and model convergence. After experimenting with some of the relevant choices of optimizer we decided to go with ADAM optimizer because of the following advantages: Adaptive Learning Rate: Adam adapts the learning rate during training for each parameter. It computes an individual learning rate for each parameter, allowing for larger updates for infrequently updated weights and smaller updates for frequently updated weights. Momentum: Adam incorporates the concept of momentum, which helps accelerate convergence in the presence of gradients with varying magnitudes. The momentum term keeps track of the exponentially weighted average of past gradients. Bias Correction: Adam applies bias correction to the estimates of the first and second moments of the gradients. This correction helps mitigate the initialization bias towards Machine Learning Report - Classification Using FFNN 7 zero, especially in the early stages of training. 3. Steps to prevent Overfitting : We did not use any particular optimizer for the simple dataset. However, our choice of Optimizer as ADAM helped to prevent overfitting up to some degree as The combination of adaptive learning rates and momentum provides a degree of regularization. potentially reducing the need for explicit dropout or weight decay 5.2 Classification on Digits Dataset 1. Best Epoch: Compared to simple dataset Digits was fairly complex to deal with since its a 10 class classification problem. After checking several epoch values for the model. Considering the stochastic nature of the model based on the necessary choice of hyperparameters. We had to choose the best epoch value such that the accuracy settles to an optimum value . Choice of Epoch:250 2. Choice of Optimizer: In the context of our PyTorch Lightning-based classification model, the choice of optimizer is a critical hyperparameter that significantly influences the training process and model convergence. After experimenting with some of the relevant choices of optimizer we decided to go with ADAM optimizer because of the following advantages: Adaptive Learning Rate: Adam adapts the learning rate during training for each parameter. It computes an individual learning rate for each parameter, allowing for larger updates for infrequently updated weights and smaller updates for frequently updated weights. Momentum: Adam incorporates the concept of momentum, which helps accelerate convergence in the presence of gradients with varying magnitudes. The momentum term keeps track of the exponentially weighted average of past gradients. Bias Correction: Adam applies bias correction to the estimates of the first and second moments of the gradients. This correction helps mitigate the initialization bias towards zero, especially in the early stages of training. Regularization Effect: The combination of adaptive learning rates and momentum provides a degree of regularization, potentially reducing the need for explicit dropout or weight decay 3. Steps to prevent Overfitting : 1. In order to prevent overfitting we used some of the common Regularization techniques to see what choice works optimally for our classification algorithm. our first choice was L2 regularization as it Encourages small weights by adding a penalty term to the loss function based on the L2 norm of the weight vectors. Regularized Loss = Loss + λ · ||w||2 however, it reduced the training as well as the validation so we went for other choices Machine Learning Report - Classification Using FFNN 8 such as batch regularization which had similar problems. 2. Our final choice for regularization was Dropout as it increased the accuracy and since it prevented overfitting by introducing randomness during training. By randomly deactivating a fraction of neurons in each training batch, dropout encourages the network to rely on different sets of features, making the model more robust. 3. Also our choice of Optimizer as ADAM helped to prevent overfitting up to some degree as The combination of adaptive learning rates and momentum provides a degree of regularization. 4. Analysis with different Hyperparameters: : Epochs 20 20 20 20 50 50 50 50 50 50 100 100 100 100 100 100 100 250 LR Batch Size Seed Valid Loss Valid Acc 0.001 0.01 0.02 0.1 0.001 0.005 0.01 0.02 0.05 0.1 0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.005 16 64 128 16 128 16 128 32 16 128 128 128 16 16 64 16 128 128 6327983 6327983 2354078 6327983 2354078 6327983 6327983 2354078 2354078 6327983 2354078 2354078 6327983 2354078 6327983 6327983 6327983 812947 0.1039 0.1427 0.1755 2.3330 0.0971 0.0825 0.1098 0.2342 2.3217 2.3165 0.1024 0.0925 0.0825 0.1992 0.2019 1.7500 2.3165 0.1241 0.9750 0.9650 0.9650 0.1100 0.9725 0.9775 0.9800 0.9425 0.1100 0.1050 0.9800 0.9775 0.9775 0.9525 0.9575 0.2825 0.1050 0.9900 Table 1: Analysis on different hyperparameters of digits 6 Conclusion In conclusion, our analysis of hyperparameters on classification tasks for both the simple (4-class) dataset and the digits dataset has provided valuable insights. We changed all the hyperparameters to get the best accuracy from the respective data set as well as experimented with different choices for optimizer as well as regularization techniques. We found that the choice of learning rate and the number of training epochs significantly impact the model’s performance. Machine Learning Report - Classification Using FFNN 6.1 9 Classification on Simple (4-class) Dataset For the simple dataset, we got the following outputs and choice of hyperparameters : Validation Accuracy: 100% Validation Loss: 0.022 No. of epochs: 50 Choice of optimizer: ADAM 6.2 Classification on Digits Dataset For the digits dataset, we got the following outputs and choice of hyperparameters : Validation Accuracy: 99% Validation Loss: 0.124 No. of epochs: 250 Choice of optimizer: ADAM Choice of regularization: Dropout This analysis provides a foundation for making informed hyperparameter choices when training neural networks for classification tasks.