Semester Project Report Machine Learning Classifiers Performance Evaluation 1/16/2022 [Type the company name] [Type the author name] 1. INTRODUCTION This project comprises of multiple tasks including implementation of classifiers, feature engineering, performance evaluation, dimension reduction, cross-validation and timing analysis. Due to the Boolean nature of the data attributes (features) only classification algorithms will be implemented. Before moving forward, some context will be provided about the dataset. 1.1. Monk’s Problem Dataset The dataset is chosen according to the following calculations πΆππ πΌπ· = 327014 πππ π‘ 2 πππππ‘π = 14 πππ 21 = 14 This dataset has following characteristics Number of Attributes = 6 ππ’ππππ ππ ππ₯ππππππ = 432 This dataset is about the artificial robots and their attributes are identified by the given attributes. π₯1 : βπππ_π βπππ ∈ πππ’ππ, π ππ’πππ, πππ‘ππππ π₯2 : ππππ¦_π βπππ ∈ πππ’ππ, π ππ’πππ, πππ‘ππππ π₯3 : ππ _π ππππππ ∈ π¦ππ , ππ π₯4 : βππππππ ∈ π π€πππ, πππππππ, ππππ π₯5 : ππππππ‘_πππππ ∈ πππ, π¦πππππ€, πππππ, πππ’π π₯6 : βππ _π‘ππ ∈ π¦ππ , ππ Each sub-attribute of these attributes accounts for one column in the dataset and has a value [1,-1]. This expands these attributes over 17 columns. Label column is for the target and can take either 0 or 1 value. The dataset has already been arranged numerically so initial step of the numerical assignments has been skipped. However, further feature engineering is required for the tasks. The details of the tasks to be performed for this project are provided in the next section. 1|Page 1.2. List of Tasks You will use any four of the machine learning algorithms such as, Decision trees, Naïve Bayes, KNNs, Random Forest, SVMs, and Neural Networks etc. on UCI dataset. You will also use feature engineering and observe how different machine learning algorithms are affected by: Set-1: Random features: generate 5 random features for each input; you can use different techniques like, random number generator, binning etc. Add random features of different ranges. Set-2: Derived Features: generate 5 derived features from the original input features like, sum, difference, square, etc. Set-3: Apply PCA to reduce the dimensionality of dataset (Keep 95% variance) and then run the classification algorithm. Train 4 classifiers and report accuracy for: • • • • • Original Features Original features + Set-1 Original features + Set-2 Set-3 Original features + (Set-1, Set-2, Set-3) Prepare each data for k-Fold Cross-Validation, chose a suitable value of K (typically between 5-10) 1.3. Feature Engineering To create random features and derived (or interaction) features we will require some manipulation of the features. We can use randint() routine in python to generate the random features while for derived features we will use mathematical operations like multiplication, etc. Feature engineering renders following benefits a. b. c. Flexibility If features are selected carefully, they can generate good results even when the model is not so good. Better results For better results it is necessary to carefully engineer features. Simpler models Feature engineering directly affects the model complexity. Good feature selection is part of the problem solving. 1.4. Classifiers For this project, following classifiers will be implemented considering the Boolean nature of the dataset. a. b. c. d. e. Decision Trees Naïve Bayes k-NNs Logistic Regression SVC 2|Page 2. TASKS The platform used for the task is Google’s Colab platform. The first thing after adding the necessary libraries, is to open the *.csv file for running different classifiers on it using the following code. import import import import import pandas as pd numpy as np matplotlib matplotlib.pyplot as plt random UCIMonk = pd.read_csv('14_monks-problems1.csv', sep= ',', header=None, skiprows=1) 2.1. Original Dataset First of all the classifiers will be run on the original data. It will be classified using Decision Tree first and then other classifiers. X = UCIMonk.iloc[:,:-1].values y = UCIMonk.iloc[:,UCIMonk.shape[1]-1].values We can use train_test_split routine to split the data into two sets. We train our model on the training set and then evaluate it on the test set. However, this may result in misleading result especially when the data is sparse. Instead of using this approach, k-fold cross validation will be used to split the data into training and validation sets. This split will be dynamic and sets will change with each iteration. Generally k is between 5 to 10. As the split is totally random, each time the code is run, different results are generated. In our case, score is higher for value of k between 7 to 9. Average training time is tabulated below for all the classifiers Classifier Decision Tree k-NNs Logistic Regression Naïve Bayes Average Training Time 1.72E-02 4.13E-02 5.83E-02 1.89E-02 Average Inference Time 9.10E-03 1.87E-02 6.87E-02 1.08E-02 3|Page As far as the training time is concerned, Decision tree and Naïve Bayes have done better than the other classifiers. It is because of their structure. For decision tree, a full-length tree is constructed; however this resulted in overfitting and poor generalization performance on the test set. For Logistic Regression, both training and inference times are quite high. k-NNs despite being poor during the training phase, has outperformed both Decision Tree and Logistic Regression Algorithms. The value of k in this case is chosen to be 5 which resulted in superior performance without overfitting. The errors are plotted in the figure below. It is clear that the performance of all classifiers is not so good on the original data set. Naïve Bayes, Logistic Regression and SVC (in the python file) have very poor performance in particular). 2.2. Random Dataset Set-1: Random features: generate 5 random features for each input; you can use different techniques like, random number generator, binning etc. Add random features of different ranges. To generate random features, the random number ( randint()) routine is used to randomly select features from already provided features. This will add redundant features to the dataset. The addition of redundant features to the dataset results in reduction of generalization capabilities of the classifier and overall accuracy. Since there are 6 attributes in total, first five of them have been used to create the random features. From each of the 5 attributes, one attribute is chosen randomly to generate the dataset. Overall timing performance of all the algorithms have degraded owing to the fact that redundant features have been added which increased the overhead without helping in classification. 4|Page Classifier Average Training Time Average Inference Time Decision Tree 1.92E-02 4.88E-02 6.79E-02 1.94E-02 9.47E-03 2.39E-02 4.89E-02 1.18E-02 k-NNs Logistic Regression Naïve Bayes The training error performance of the Decision Tree and k-NNs algorithms have improved a little bit, however overall performance remains below par. The error performance of k-NNs is better than other models and improved after the addition of the random features as more neighbors have been added to redundant places. 2.3. Derived Features Set-2: Derived Features: generate 5 derived features from the original input features like, sum, difference, square, etc. For derived features, sub-attributes of the first five attributes have been multiplied element-wise. This significantly improved the timing performance of the decision tree especially the inference time. This is due to the fact that the derived features generally combine the effect of multiple features into one and convey more information than the individual features. Classifier Average Training Time Average Inference Time Decision Tree 1.98E-02 4.56E-02 6.49E-02 1.74E-02 1.02E-02 2.05E-02 5.08E-02 1.12E-02 k-NNs Logistic Regression Naïve Bayes 5|Page The error performance of the k-NNs algorithm has improved slightly further after the addition of the derived features. For decision trees it is still high and for other algorithms it is very high. 2.4. PCA Set-3: Apply PCA to reduce the dimensionality of dataset (Keep 95% variance) and then run the classification algorithm. PCA is used for the dimensionality reduction and resulted in the following scatter plot. 6|Page a. Accuracy of each algorithm on the original features: rank the algorithms accuracy and add a discussion why certain algorithms performed better than the others. w.r.t. Both k-NNs and decision tree algorithms have outperformed others. Full-length Decision tree is constructed and therefore its performance is better than the other algorithms. For k-NNs k = 3, which is fairly balanced choice of neighbors. b. Training time of each algorithm on the original features: rank the algorithms w.r.t. training time (smallest is the best) and add a discussion why certain algorithms take less training time than the others. Training time of decision tree algorithm is better that the others because of its exhaustive nature. Naïve Baye’s has also done fairly. c. Inference time of each algorithm on the original features: rank the algorithms w.r.t. inference time (smallest is the best) and add a discussion why certain algorithms take less inference time than the others. Inference time of Naïve Baye’s and k-NNs algorithms is better than the others. For value of k = 3, the performance of the kNN algorithm is optimal and classification becomes easier compared to k=1, while it is computationally faster as compared to higher values of k. d. Impact of irrelevant features on algorithm accuracy: which algorithms got affected by adding irrelevant features and which algorithms got least affected? Why? most After the addition of the redundant features, all algorithms have been affected except k-NNs. This is due to the fact that redundant features only added overhead except for k-NNs where redundant data might help to filter out noise. e. Impact of redundant (derived) features on algorithm accuracy: which algorithms got most affected by adding redundant features and which algorithms got least affected? Why? Derived features have helped the algorithms because each derived feature possessive combined information of multiple features and therefore conveys more information. f. Robustness of different algorithms against noise: which algorithms got most affected by adding noise in the data and which algorithms got least affected? Why? Decision tree is most affected by the addition of the noise because it was already overfitting. 7|Page