International Journal of Mechanical Engineering and Technology (IJMET)
Volume 10, Issue 01, January 2019, pp. 1392-1398, Article ID: IJMET_10_01_141
Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=01
ISSN Print: 0976-6340 and ISSN Online: 0976-6359
© IAEME Publication
Scopus Indexed
Krishna Sriharsha Gundu
Student in VIT Vellore
Sundar S
Professor in VIT Vellore
For years, the Machine Learning community has focused on developing efficient
algorithms that can produce very accurate classifiers. However, it is often much easier
to find several good classifiers based on dataset combination, instead of single classifier
applied on deferent datasets. The advantages of using classifier dataset combinations
instead of a single one are twofold: it helps lowering the computational complexity by
using simpler models, and it can improve the classification accuracy and performance.
Most Data mining applications are based on pattern matching algorithms, thus improving
the performance of the classification has a positive impact on the quality of the overall
data mining task. Since combination strategies proved very useful in improving the
performance, these techniques have become very important in applications such as
Cancer detection, Speech Technology and Natural Language Processing .The aim of this
paper is basically to propose proprietary metric, Normalized Geometric Index (NGI)
based on the latent properties of datasets for improving the accuracy of data mining tasks.
Key words: Machine Learning, Classification, Classifier Selection, Data Mining, Non
Linear Regression, Normalized Geometric Index (NGI)
Cite this Article: Krishna Sriharsha Gundu and Sundar S, Normalized Geometric Index: a
Scale for Classifier Selection, International Journal of Mechanical Engineering and
Technology, 10(01), 2019, pp.1392–1398
Classification is an important data mining task, to classify the given features and to learn the
hidden knowledge in the dataset .Input to a typical classification system is a set of features from
dataset with an associated class .A feature is represented by a set of measurements that contain
relevant information about the structure of the object we wish to classify .Hence in the context of
classification the combination of classifier and dataset is important measure to understand the
performance of classifier system .This method of understanding the performance is termed as
"Overproduce and choose"[1]. In this method, a large number of datasets of different geometries
are given as inputs to different classifiers. Flash.P [2] has discussed more details about generic
[email protected]
Normalized Geometric Index: a Scale for Classifier Selection
approaches to assess the data set parameter influence on the accuracy but did not elaborate any
specific direction to solve the same. In this paper, a numeric index is developed (in Section 3)
and used to predict which classifier gives the best accuracy to the geometry of the dataset which
is to be classified.
1.1. Dataset and Classifier
Classification consists of predicting a certain outcome based on the input historic data. The
prediction is carried out by processing the data with the help of an algorithm applied on training
dataset. The algorithm tries to discover the relationship between the features that will aid in
predicting the output according to the perceived pattern. The classification algorithm analyzes
the input and predict the output. The prediction accuracy is the figure of merit of the classification
algorithm. For example, in a software defect dataset,as shown below table 1 ,the training set
would have relevant information on the "bug" ,collected historically. The prediction table data is
used by the algorithm to as shown below table 2, to predict the bugs in the module.
1.2. Influence of Dataset-Classifier combination
In general classification is about taking a decision on given data. Michael's et al define
classification as "construction of a procedure that is applied to a series of objects, where is each
object is assigned to a label" [3,4,5] .In this paper classification refers to supervisory learning
based classification where classifier is trained on the historic data with associated classes.
In today's machine learning world, the challenge is to improve the performance of the learning
system and to apply the classification algorithm for a particular dataset [6,7]. Since the data set
volume and features increases over span of time selecting a suitable classifier is a challenge. Poor
selection of classifier results poor accuracy. Several studies carried out on the same sparsely but
the problem is still a challenge [8,9] .Due to explosive growth of volumes of data and availability
of several machine learning algorithms there is no study on the selection of classifiers or
guidelines for selecting the classifiers for a given type of data set [10] .Datasets themselves offers
a little clue in selecting a relevant classifier algorithm.
Each classifier interprets and processes the data separately. For example, k - Nearest
Neighbour (kNN) classifier computes the distance from the test point to all the train points and
classifies into a class which is the nearest to it. Although, Support Vector Machine (SVM)
Classifier, draws hyper-planes such that all the points that satisfy a hyper plane belong to a class.
This difference in algorithm used causes a difference of accuracy to a classification accuracy. In
other words, not all the classifiers are suitable to all the geometries of datasets, i.e, the importance
[email protected]
J Krishna Sriharsha Gundu and Sundar S
of features, observations and classes are different to different algorithms. Thereby we have to
choose the classifier based on the dataset at hand.
T. van Gemert [3] conducted study on the relationship between classification algorithm and
dataset with prime consideration on execution time .The author has discussed the influence of
dataset characteristics on classification algorithm performance, however he has not mentioned
about the real issue of dataset parameter influence on the classification accuracy. The author did
not mentioned about the dataset characteristics (eg - Number of features/classes, etc) on the
performance of the classifier.
In an attempt to produce optimal accuracy for all the datasets, a classifier ensemble is
generated from other classifiers / classifier ensembles and combination functions [11]. Although
this approach may yield optimal accuracy for all kinds of datasets, there are many combinations
of classifiers and combination functions. The time required to build a custom classifier is very
high. A final selection is to be made from the list of classifiers built for the dataset [12]. The
proposed method is worth implementing only if the there is a drastic improvement of accuracy.
The time taken to understand and build the optimum classifier renders the classification task
useless in time sensitive classification or accuracy insensitive scenarios (for minor accuracy
A true Classifier ensemble can be built provided classifiers with diverse strong and weak
points are combined. This research on measuring diversity is not concluded [13]. This implies
that a solution of classifier ensemble may not be found or if found it could be a local minimum
error point. This requires a restart on the search process and start with a different classifier and
its combinations. The time taken to get enough data to make a decision is considered critical in a
classification task [13].
Since truly complimentary and diverse classifiers do not exist, the fusion of many decisions
into a single output label is challenging. Although some frame-works such as weighted voting
are developed [14], they are not fool proof. Thereby requiring another classifier just to classify
the pool of outputs into a single out-put based on the feature relevance to the classifier and the
confidence interval. The computational complexity for such a system is exponentially high as the
selection of classifier ensemble uses another classifier ensemble.
The diversity of the classifier pool is ensured by manipulating the classifier inputs and outputs
[15]. This manipulation of dataset could loose some crucial information about the dataset.
The main drawback of the single classifier system is the requirement of prior knowledge to
choose the best classifier [16]. This paper proposes a solution to standardize the knowledge
through a parameter.
In this paper we propose a novel metric, Normalized Geometric Index (NGI) for selecting the
classifier based on the dataset parameters for optimal classification accuracy and execution time.
Since the metric is a numeric value, it can directly be used for deciding a single classifier for
optimal performance. This eliminates the need for one or more learning algorithms to fuse
classifiers and choose the right fusion function thereby saving time and computation complexity.
This parameter is developed keeping four kinds of datasets in mind. (Refer Section 4.2)
The rules behind developing the metric –
1. The accuracy of classification of a dataset improves with the increase in the number
of observations, as long as enough care is taken to avoid overfitting.
[email protected]
Normalized Geometric Index: a Scale for Classifier Selection
2. The accuracy of classification decreases if there are more number of classes for the
same number of observations.
3. The accuracy of classification decreases if there are more number of features for the
same number of observations
Every observation is considered as new information provided to the classifier detailing the
behaviour of the dataset. Thereby more observations (for the same number of classes and
features) would imply better classification.
Every new feature is considered as new dimension to visualize the classes .If there are more
features (for the same number of observations and classes), then the information provided by the
observations will not be sufficient for the classifier to perform the classification.
More the number of output classes, more will be the information required to classify the test
data into different classes. By combining the above points, we the following metric.
The classifiers and data sets consider for the experiments are :
1. Gaussian Naive Bayes (GNB)
2. Support Vector Machine (SVM)
3. Random Forest (RF)
4. k Nearest Neighbor (KNN)
5. Multi-Layer Perceptron (MLP)
6. Multinomial Naive Bayes (MNB)
7. Quadratic Discriminant Analysis (QDA)
4.1. Technical Details in experiment
The parameters of classifiers are set as follows• kNN Classifier has k value set to 3
• MLP Classifier has activation function set to tanh()
• MLP Classifier has been set to adam solver in python
• MLP Classifier’s tolerance has been set to 10-5
• Random Forest takes the decision from an ensemble of 100 trees.
• SVM uses sigmoid kernel for classification.
4.2. Experiment
The experiment is setup such that datasets of all types are executed on all the classifiers. The
resulting accuracies are tabulated. A non-linear line of regression is drawn for all the accuracies
as a function of NGI metric for each classifier.
An average of all the accuracies are taken for a single dataset. Another line of non-linear
regression is drawn for the accuracies as a function of NGI metric. This line acts as a threshold
to select a classifier.
When two lines are plotted, the region where the individual classifier out-performs the
average classifier performance, is the region of strength for the classifier. It must be noted that
the R-squared value (describing the best fitness of the line) will be low, since the accuracy is not
[email protected]
J Krishna Sriharsha Gundu and Sundar S
fully explained for just these parameters. Hence, a line which describes the accuracy best, in terms
of NGI will be selected for determining the region of performance.
Figure. 1. Experimental Setup
After conducting the experiments as mentioned in the previous section the results are obtained as
mentioned below.
A table of Classifier Accuracy for the NGI values.
The results of the accuracies are tabulated. As seen in the NGI column, all the datasets
correspond to a different geometry of datasets. From the table, it can be inferred that NGI values
around the value of 0.041 have overall higher classification accuracy. This NGI value
corresponds to the datasets having less number of classes, more observations and less features.
Least overall classification accuracy is for high number classes, high number of features and low
number of observations. During classification, it is also observed that Multi Nomial Naive Bayes
Classifier will not work with all the raw datasets. All the inputs to it should be non-negative and
thereby needing pre-processing. The following setup is used for studying the relationship between
the classifier performance and NGI metric. For each of the classifier the experiments are
conducted using different datasets as mentioned in the previous section. The response function
of NGI is calculated as
The above accuracy corresponds to the average classification accuracy. It serves as the
baseline for determining the classifier to be better or worse at the particular NGI value. After
calculating the NGI value from the above equation (2), the accuracy values of the individual
classifiers are approximated based on the following equations. These equations are non-linear
regression models of accuracy in terms of NGI metric. In the case of accuracy of a specific
classifier, the corresponding column from the table is chosen as y and the related NGI values are
chosen as x. Using non linear regression, a function is created. These functions are as follows.
[email protected]
Normalized Geometric Index: a Scale for Classifier Selection
These equations approximately describe the behaviour of classifiers for different NGI values.
On graphical examination of the same will show the performance of optimal classifier for the
NGI value. All the approximations are having an R-squared value greater than 0.85.
From the above results it is clear that the Normalized Geometric Index (NGI) is very much helpful
in determining the classifier dataset relation for improved accuracy. The QDA performance is
inferior when compared to the other classifier performance. It is not suggested as prime choice
for the given data set properties. KNN classifier has performed consistently well when compared
to the remaining classifiers while compromising on accuracy front. However the accuracy of
MLP classifier is very high provided the NGI greater than 0.787.The Random Forest (RF)
classifier performs uniformly well across all NGI values with no threshold values. The
performance is mostly consistent.
We have conducted experiments on few data sets, however the experiments can be repeated
using sparse and scientific data sets to study the impact of the NGI metric on a different variety
of datasets.
Amanda JC Sharkey, Noel E Sharkey, Uwe Gerecke, and Gopinath Odayammadath
Chandroth. The “test and select” approach to ensemble combination. In International
Workshop on Multiple Classifier Systems, pages 30–44. Springer, 2000.
Peter Flach. The art and science of algorithms that make sense of data,2012.
T van Gemert. On the influence of dataset characteristics on classifier performance. B.S.
thesis, 2017.
DJSD Michie. Dj spiegelhalter, and cc taylor. Machine learning, neural and statistical
classification. Ellis Horwood, 1994.
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and
Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural
Information Processing Systems, pages 2962–2970, 2015.
[email protected]
J Krishna Sriharsha Gundu and Sundar S
Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and
Tim Kraska. Automating model search for large scale machine learning. In Proceedings of
the Sixth ACM Symposium on Cloud Computing, pages 368–380. ACM, 2015.
Gang Luo. A review of automatic selection methods for machine learning algorithms and
hyper-parameter values. Network Modeling Analysis in Health Informatics and
Bioinformatics, 5(1):18, 2016.
Alexandros Kalousis and Theoharis Theoharis. Noemon: Design, implementation and
performance results of an intelligent assistant for classifier selection. Intelligent Data
Analysis, 3(5):319–337, 1999.
Joao Gama and Pavel Brazdil. Characterization of classification algorithms. In Portuguese
Conference on Artificial Intelligence, pages 189–200. Springer, 1995
Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Meta learning:
Applications to data mining. Springer Science & Business Media, 2008
Josef Kittler. Multiple Classifier Systems: First International Workshop , MCS 2000 Cagliari,
Italy, June 21-23, 2000 Proceedings, volume 1857. Springer Science & Business Media, 2000.
Fabio Roli, Giorgio Giacinto, and Gianni Vernazza. Methods for designing multiple classifier
systems. In International Workshop on Multiple ClassifierSystems, pages 78–87. Springer,
MichaƂWozniak, Manuel Graña, and Emilio Corchado. A survey of multiple classifier
systems as hybrid systems. Information Fusion, 16:3–17, 2014.
Šar¯ unas Raudys. Trainable fusion rules. ii. small sample-size effects. Neural Networks,
19(10):1517–1527, 2006.
Ludmila I Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley &
Sons, 2004.
Josef Kittler. A framework for classifier fusion: Is it still needed? In Joint IAPR International
Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and
Syntactic Pattern Recognition (SSPR), pages 45–56. Springer, 2000.
[email protected]