Uploaded by imagenationpk

Machine Learning Classifiers Performance Evaluation Report

Semester Project Report
Machine Learning Classifiers Performance Evaluation
[Type the company name]
[Type the author name]
This project comprises of multiple tasks including implementation of classifiers, feature engineering, performance evaluation,
dimension reduction, cross-validation and timing analysis. Due to the Boolean nature of the data attributes (features) only
classification algorithms will be implemented. Before moving forward, some context will be provided about the dataset.
1.1. Monk’s Problem Dataset
The dataset is chosen according to the following calculations
𝐢𝑀𝑆 𝐼𝐷 = 327014
π‘™π‘Žπ‘ π‘‘ 2 𝑑𝑖𝑔𝑖𝑑𝑠 = 14 π‘šπ‘œπ‘‘ 21 = 14
This dataset has following characteristics
Number of Attributes = 6
π‘π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ 𝑒π‘₯π‘Žπ‘šπ‘π‘™π‘’π‘  = 432
This dataset is about the artificial robots and their attributes are identified by the given attributes.
π‘₯1 : β„Žπ‘’π‘Žπ‘‘_π‘ β„Žπ‘Žπ‘π‘’ ∈ π‘Ÿπ‘œπ‘’π‘›π‘‘, π‘ π‘žπ‘’π‘Žπ‘Ÿπ‘’, π‘œπ‘π‘‘π‘Žπ‘”π‘œπ‘›
π‘₯2 : π‘π‘œπ‘‘π‘¦_π‘ β„Žπ‘Žπ‘π‘’ ∈ π‘Ÿπ‘œπ‘’π‘›π‘‘, π‘ π‘žπ‘’π‘Žπ‘Ÿπ‘’, π‘œπ‘π‘‘π‘Žπ‘”π‘œπ‘›
π‘₯3 : 𝑖𝑠_π‘ π‘šπ‘–π‘™π‘–π‘›π‘” ∈ 𝑦𝑒𝑠, π‘›π‘œ
π‘₯4 : β„Žπ‘œπ‘™π‘‘π‘–π‘›π‘” ∈ π‘ π‘€π‘œπ‘Ÿπ‘‘, π‘π‘Žπ‘™π‘™π‘œπ‘œπ‘›, π‘“π‘™π‘Žπ‘”
π‘₯5 : π‘—π‘Žπ‘π‘˜π‘’π‘‘_π‘π‘œπ‘™π‘œπ‘Ÿ ∈ π‘Ÿπ‘’π‘‘, π‘¦π‘’π‘™π‘™π‘œπ‘€, π‘”π‘Ÿπ‘’π‘’π‘›, 𝑏𝑙𝑒𝑒
π‘₯6 : β„Žπ‘Žπ‘ _𝑑𝑖𝑒 ∈ 𝑦𝑒𝑠, π‘›π‘œ
Each sub-attribute of these attributes accounts for one column in the dataset and has a value [1,-1]. This expands these
attributes over 17 columns. Label column is for the target and can take either 0 or 1 value.
The dataset has already been arranged numerically so initial step of the numerical assignments has been skipped. However,
further feature engineering is required for the tasks. The details of the tasks to be performed for this project are provided in
the next section.
1.2. List of Tasks
You will use any four of the machine learning algorithms such as, Decision trees, Naïve Bayes, KNNs, Random Forest, SVMs,
and Neural Networks etc. on UCI dataset. You will also use feature engineering and observe how different machine learning
algorithms are affected by:
Set-1: Random features: generate 5 random features for each input; you can use different techniques like, random number
generator, binning etc. Add random features of different ranges.
Set-2: Derived Features: generate 5 derived features from the original input features like, sum, difference, square, etc.
Set-3: Apply PCA to reduce the dimensionality of dataset (Keep 95% variance) and then run the classification algorithm.
Train 4 classifiers and report accuracy for:
Original Features
Original features + Set-1
Original features + Set-2
Original features + (Set-1, Set-2, Set-3)
Prepare each data for k-Fold Cross-Validation, chose a suitable value of K (typically between 5-10)
1.3. Feature Engineering
To create random features and derived (or interaction) features we will require some manipulation of the features. We can
use randint() routine in python to generate the random features while for derived features we will use mathematical
operations like multiplication, etc. Feature engineering renders following benefits
If features are selected carefully, they can generate good results even when the model is not so good.
Better results
For better results it is necessary to carefully engineer features.
Simpler models
Feature engineering directly affects the model complexity. Good feature selection is part of the problem solving.
1.4. Classifiers
For this project, following classifiers will be implemented considering the Boolean nature of the dataset.
Decision Trees
Naïve Bayes
Logistic Regression
The platform used for the task is Google’s Colab platform. The first thing after adding the necessary libraries, is to open the
*.csv file for running different classifiers on it using the following code.
pandas as pd
numpy as np
matplotlib.pyplot as plt
UCIMonk = pd.read_csv('14_monks-problems1.csv', sep= ',', header=None, skiprows=1)
2.1. Original Dataset
First of all the classifiers will be run on the original data. It will be classified using Decision Tree first and then other classifiers.
X = UCIMonk.iloc[:,:-1].values
y = UCIMonk.iloc[:,UCIMonk.shape[1]-1].values
We can use train_test_split routine to split the data into two sets. We train our model on the training set and then evaluate it
on the test set. However, this may result in misleading result especially when the data is sparse. Instead of using this
approach, k-fold cross validation will be used to split the data into training and validation sets. This split will be dynamic and
sets will change with each iteration. Generally k is between 5 to 10. As the split is totally random, each time the code is run,
different results are generated. In our case, score is higher for value of k between 7 to 9.
Average training time is tabulated below for all the classifiers
Decision Tree
Logistic Regression
Naïve Bayes
Average Training Time
Average Inference Time
As far as the training time is concerned, Decision tree and Naïve Bayes have done better than the other classifiers. It is
because of their structure. For decision tree, a full-length tree is constructed; however this resulted in overfitting and poor
generalization performance on the test set. For Logistic Regression, both training and inference times are quite high. k-NNs
despite being poor during the training phase, has outperformed both Decision Tree and Logistic Regression Algorithms. The
value of k in this case is chosen to be 5 which resulted in superior performance without overfitting.
The errors are plotted in the figure below. It is clear that the performance of all classifiers is not so good on the original data
set. Naïve Bayes, Logistic Regression and SVC (in the python file) have very poor performance in particular).
2.2. Random Dataset
Set-1: Random features: generate 5 random features for each input; you can use different techniques like, random number
generator, binning etc. Add random features of different ranges.
To generate random features, the random number ( randint()) routine is used to randomly select features from already
provided features. This will add redundant features to the dataset. The addition of redundant features to the dataset results
in reduction of generalization capabilities of the classifier and overall accuracy.
Since there are 6 attributes in total, first five of them have been used to create the random features. From each of the 5
attributes, one attribute is chosen randomly to generate the dataset. Overall timing performance of all the algorithms have
degraded owing to the fact that redundant features have been added which increased the overhead without helping in
Average Training Time
Average Inference Time
Decision Tree
Logistic Regression
Naïve Bayes
The training error performance of the Decision Tree and k-NNs algorithms have improved a little bit, however overall
performance remains below par. The error performance of k-NNs is better than other models and improved after the
addition of the random features as more neighbors have been added to redundant places.
2.3. Derived Features
Set-2: Derived Features: generate 5 derived features from the original input features like, sum, difference, square, etc.
For derived features, sub-attributes of the first five attributes have been multiplied element-wise. This significantly improved
the timing performance of the decision tree especially the inference time. This is due to the fact that the derived features
generally combine the effect of multiple features into one and convey more information than the individual features.
Average Training Time
Average Inference Time
Decision Tree
Logistic Regression
Naïve Bayes
The error performance of the k-NNs algorithm has improved slightly further after the addition of the derived features. For
decision trees it is still high and for other algorithms it is very high.
2.4. PCA
Set-3: Apply PCA to reduce the dimensionality of dataset (Keep 95% variance) and then run the classification algorithm.
PCA is used for the dimensionality reduction and resulted in the following scatter plot.
a. Accuracy of each algorithm on the original features: rank the algorithms
accuracy and add a discussion why certain algorithms performed better than the others.
Both k-NNs and decision tree algorithms have outperformed others. Full-length Decision tree is constructed and therefore its
performance is better than the other algorithms. For k-NNs k = 3, which is fairly balanced choice of neighbors.
b. Training time of each algorithm on the original features: rank the algorithms w.r.t.
training time (smallest is the best) and add a discussion why certain algorithms take
less training time than the others.
Training time of decision tree algorithm is better that the others because of its exhaustive nature. Naïve Baye’s has also done
c. Inference time of each algorithm on the original features: rank the algorithms w.r.t.
inference time (smallest is the best) and add a discussion why certain algorithms take
less inference time than the others.
Inference time of Naïve Baye’s and k-NNs algorithms is better than the others. For value of k = 3, the performance of the kNN algorithm is optimal and classification becomes easier compared to k=1, while it is computationally faster as compared to
higher values of k.
d. Impact of irrelevant features on algorithm accuracy: which algorithms got
affected by adding irrelevant features and which algorithms got least affected? Why?
After the addition of the redundant features, all algorithms have been affected except k-NNs. This is due to the fact that
redundant features only added overhead except for k-NNs where redundant data might help to filter out noise.
Impact of redundant (derived) features on algorithm accuracy: which algorithms got
most affected by adding redundant features and which algorithms got least affected?
Derived features have helped the algorithms because each derived feature possessive combined information of multiple
features and therefore conveys more information.
Robustness of different algorithms against noise: which algorithms got most affected
by adding noise in the data and which algorithms got least affected? Why?
Decision tree is most affected by the addition of the noise because it was already overfitting.