International Journal of Mechanical Engineering and Technology (IJMET) Volume 10, Issue 01, January 2019, pp. 1392-1398, Article ID: IJMET_10_01_141 Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=01 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 © IAEME Publication Scopus Indexed NORMALIZED GEOMETRIC INDEX: A SCALE FOR CLASSIFIER SELECTION Krishna Sriharsha Gundu Student in VIT Vellore Sundar S Professor in VIT Vellore ABSTRACT For years, the Machine Learning community has focused on developing efficient algorithms that can produce very accurate classifiers. However, it is often much easier to find several good classifiers based on dataset combination, instead of single classifier applied on deferent datasets. The advantages of using classifier dataset combinations instead of a single one are twofold: it helps lowering the computational complexity by using simpler models, and it can improve the classification accuracy and performance. Most Data mining applications are based on pattern matching algorithms, thus improving the performance of the classification has a positive impact on the quality of the overall data mining task. Since combination strategies proved very useful in improving the performance, these techniques have become very important in applications such as Cancer detection, Speech Technology and Natural Language Processing .The aim of this paper is basically to propose proprietary metric, Normalized Geometric Index (NGI) based on the latent properties of datasets for improving the accuracy of data mining tasks. Key words: Machine Learning, Classification, Classifier Selection, Data Mining, Non Linear Regression, Normalized Geometric Index (NGI) Cite this Article: Krishna Sriharsha Gundu and Sundar S, Normalized Geometric Index: a Scale for Classifier Selection, International Journal of Mechanical Engineering and Technology, 10(01), 2019, pp.1392–1398 http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=10&Type=01 1. INTRODUCTION Classification is an important data mining task, to classify the given features and to learn the hidden knowledge in the dataset .Input to a typical classification system is a set of features from dataset with an associated class .A feature is represented by a set of measurements that contain relevant information about the structure of the object we wish to classify .Hence in the context of classification the combination of classifier and dataset is important measure to understand the performance of classifier system .This method of understanding the performance is termed as "Overproduce and choose"[1]. In this method, a large number of datasets of different geometries are given as inputs to different classifiers. Flash.P [2] has discussed more details about generic http://www.iaeme.com/IJMET/index.asp 1392 editor@iaeme.com Normalized Geometric Index: a Scale for Classifier Selection approaches to assess the data set parameter influence on the accuracy but did not elaborate any specific direction to solve the same. In this paper, a numeric index is developed (in Section 3) and used to predict which classifier gives the best accuracy to the geometry of the dataset which is to be classified. 1.1. Dataset and Classifier Classification consists of predicting a certain outcome based on the input historic data. The prediction is carried out by processing the data with the help of an algorithm applied on training dataset. The algorithm tries to discover the relationship between the features that will aid in predicting the output according to the perceived pattern. The classification algorithm analyzes the input and predict the output. The prediction accuracy is the figure of merit of the classification algorithm. For example, in a software defect dataset,as shown below table 1 ,the training set would have relevant information on the "bug" ,collected historically. The prediction table data is used by the algorithm to as shown below table 2, to predict the bugs in the module. 1.2. Influence of Dataset-Classifier combination In general classification is about taking a decision on given data. Michael's et al define classification as "construction of a procedure that is applied to a series of objects, where is each object is assigned to a label" [3,4,5] .In this paper classification refers to supervisory learning based classification where classifier is trained on the historic data with associated classes. In today's machine learning world, the challenge is to improve the performance of the learning system and to apply the classification algorithm for a particular dataset [6,7]. Since the data set volume and features increases over span of time selecting a suitable classifier is a challenge. Poor selection of classifier results poor accuracy. Several studies carried out on the same sparsely but the problem is still a challenge [8,9] .Due to explosive growth of volumes of data and availability of several machine learning algorithms there is no study on the selection of classifiers or guidelines for selecting the classifiers for a given type of data set [10] .Datasets themselves offers a little clue in selecting a relevant classifier algorithm. Each classifier interprets and processes the data separately. For example, k - Nearest Neighbour (kNN) classifier computes the distance from the test point to all the train points and classifies into a class which is the nearest to it. Although, Support Vector Machine (SVM) Classifier, draws hyper-planes such that all the points that satisfy a hyper plane belong to a class. This difference in algorithm used causes a difference of accuracy to a classification accuracy. In other words, not all the classifiers are suitable to all the geometries of datasets, i.e, the importance http://www.iaeme.com/IJMET/index.asp 1393 editor@iaeme.com J Krishna Sriharsha Gundu and Sundar S of features, observations and classes are different to different algorithms. Thereby we have to choose the classifier based on the dataset at hand. 2. EXISTING WORK T. van Gemert [3] conducted study on the relationship between classification algorithm and dataset with prime consideration on execution time .The author has discussed the influence of dataset characteristics on classification algorithm performance, however he has not mentioned about the real issue of dataset parameter influence on the classification accuracy. The author did not mentioned about the dataset characteristics (eg - Number of features/classes, etc) on the performance of the classifier. In an attempt to produce optimal accuracy for all the datasets, a classifier ensemble is generated from other classifiers / classifier ensembles and combination functions [11]. Although this approach may yield optimal accuracy for all kinds of datasets, there are many combinations of classifiers and combination functions. The time required to build a custom classifier is very high. A final selection is to be made from the list of classifiers built for the dataset [12]. The proposed method is worth implementing only if the there is a drastic improvement of accuracy. The time taken to understand and build the optimum classifier renders the classification task useless in time sensitive classification or accuracy insensitive scenarios (for minor accuracy improvements). A true Classifier ensemble can be built provided classifiers with diverse strong and weak points are combined. This research on measuring diversity is not concluded [13]. This implies that a solution of classifier ensemble may not be found or if found it could be a local minimum error point. This requires a restart on the search process and start with a different classifier and its combinations. The time taken to get enough data to make a decision is considered critical in a classification task [13]. Since truly complimentary and diverse classifiers do not exist, the fusion of many decisions into a single output label is challenging. Although some frame-works such as weighted voting are developed [14], they are not fool proof. Thereby requiring another classifier just to classify the pool of outputs into a single out-put based on the feature relevance to the classifier and the confidence interval. The computational complexity for such a system is exponentially high as the selection of classifier ensemble uses another classifier ensemble. The diversity of the classifier pool is ensured by manipulating the classifier inputs and outputs [15]. This manipulation of dataset could loose some crucial information about the dataset. The main drawback of the single classifier system is the requirement of prior knowledge to choose the best classifier [16]. This paper proposes a solution to standardize the knowledge through a parameter. 3. PROPOSED METRIC In this paper we propose a novel metric, Normalized Geometric Index (NGI) for selecting the classifier based on the dataset parameters for optimal classification accuracy and execution time. Since the metric is a numeric value, it can directly be used for deciding a single classifier for optimal performance. This eliminates the need for one or more learning algorithms to fuse classifiers and choose the right fusion function thereby saving time and computation complexity. This parameter is developed keeping four kinds of datasets in mind. (Refer Section 4.2) The rules behind developing the metric – 1. The accuracy of classification of a dataset improves with the increase in the number of observations, as long as enough care is taken to avoid overfitting. http://www.iaeme.com/IJMET/index.asp 1394 editor@iaeme.com Normalized Geometric Index: a Scale for Classifier Selection 2. The accuracy of classification decreases if there are more number of classes for the same number of observations. 3. The accuracy of classification decreases if there are more number of features for the same number of observations Every observation is considered as new information provided to the classifier detailing the behaviour of the dataset. Thereby more observations (for the same number of classes and features) would imply better classification. Every new feature is considered as new dimension to visualize the classes .If there are more features (for the same number of observations and classes), then the information provided by the observations will not be sufficient for the classifier to perform the classification. More the number of output classes, more will be the information required to classify the test data into different classes. By combining the above points, we the following metric. The classifiers and data sets consider for the experiments are : 1. Gaussian Naive Bayes (GNB) 2. Support Vector Machine (SVM) 3. Random Forest (RF) 4. k Nearest Neighbor (KNN) 5. Multi-Layer Perceptron (MLP) 6. Multinomial Naive Bayes (MNB) 7. Quadratic Discriminant Analysis (QDA) 4. EXPERIMENTAL SETUP 4.1. Technical Details in experiment The parameters of classifiers are set as follows• kNN Classifier has k value set to 3 • MLP Classifier has activation function set to tanh() • MLP Classifier has been set to adam solver in python • MLP Classifier’s tolerance has been set to 10-5 • Random Forest takes the decision from an ensemble of 100 trees. • SVM uses sigmoid kernel for classification. 4.2. Experiment The experiment is setup such that datasets of all types are executed on all the classifiers. The resulting accuracies are tabulated. A non-linear line of regression is drawn for all the accuracies as a function of NGI metric for each classifier. An average of all the accuracies are taken for a single dataset. Another line of non-linear regression is drawn for the accuracies as a function of NGI metric. This line acts as a threshold to select a classifier. When two lines are plotted, the region where the individual classifier out-performs the average classifier performance, is the region of strength for the classifier. It must be noted that the R-squared value (describing the best fitness of the line) will be low, since the accuracy is not http://www.iaeme.com/IJMET/index.asp 1395 editor@iaeme.com J Krishna Sriharsha Gundu and Sundar S fully explained for just these parameters. Hence, a line which describes the accuracy best, in terms of NGI will be selected for determining the region of performance. Figure. 1. Experimental Setup 5. RESULTS After conducting the experiments as mentioned in the previous section the results are obtained as mentioned below. A table of Classifier Accuracy for the NGI values. The results of the accuracies are tabulated. As seen in the NGI column, all the datasets correspond to a different geometry of datasets. From the table, it can be inferred that NGI values around the value of 0.041 have overall higher classification accuracy. This NGI value corresponds to the datasets having less number of classes, more observations and less features. Least overall classification accuracy is for high number classes, high number of features and low number of observations. During classification, it is also observed that Multi Nomial Naive Bayes Classifier will not work with all the raw datasets. All the inputs to it should be non-negative and thereby needing pre-processing. The following setup is used for studying the relationship between the classifier performance and NGI metric. For each of the classifier the experiments are conducted using different datasets as mentioned in the previous section. The response function of NGI is calculated as The above accuracy corresponds to the average classification accuracy. It serves as the baseline for determining the classifier to be better or worse at the particular NGI value. After calculating the NGI value from the above equation (2), the accuracy values of the individual classifiers are approximated based on the following equations. These equations are non-linear regression models of accuracy in terms of NGI metric. In the case of accuracy of a specific classifier, the corresponding column from the table is chosen as y and the related NGI values are chosen as x. Using non linear regression, a function is created. These functions are as follows. http://www.iaeme.com/IJMET/index.asp 1396 editor@iaeme.com Normalized Geometric Index: a Scale for Classifier Selection These equations approximately describe the behaviour of classifiers for different NGI values. On graphical examination of the same will show the performance of optimal classifier for the NGI value. All the approximations are having an R-squared value greater than 0.85. 6. CONCLUSION AND FUTURE WORK From the above results it is clear that the Normalized Geometric Index (NGI) is very much helpful in determining the classifier dataset relation for improved accuracy. The QDA performance is inferior when compared to the other classifier performance. It is not suggested as prime choice for the given data set properties. KNN classifier has performed consistently well when compared to the remaining classifiers while compromising on accuracy front. However the accuracy of MLP classifier is very high provided the NGI greater than 0.787.The Random Forest (RF) classifier performs uniformly well across all NGI values with no threshold values. The performance is mostly consistent. We have conducted experiments on few data sets, however the experiments can be repeated using sparse and scientific data sets to study the impact of the NGI metric on a different variety of datasets. REFERENCES [1] [2] [3] [4] [5] Amanda JC Sharkey, Noel E Sharkey, Uwe Gerecke, and Gopinath Odayammadath Chandroth. The “test and select” approach to ensemble combination. In International Workshop on Multiple Classifier Systems, pages 30–44. Springer, 2000. Peter Flach. The art and science of algorithms that make sense of data,2012. T van Gemert. On the influence of dataset characteristics on classifier performance. B.S. thesis, 2017. DJSD Michie. Dj spiegelhalter, and cc taylor. Machine learning, neural and statistical classification. Ellis Horwood, 1994. Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015. http://www.iaeme.com/IJMET/index.asp 1397 editor@iaeme.com J Krishna Sriharsha Gundu and Sundar S [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Evan R Sparks, Ameet Talwalkar, Daniel Haas, Michael J Franklin, Michael I Jordan, and Tim Kraska. Automating model search for large scale machine learning. In Proceedings of the Sixth ACM Symposium on Cloud Computing, pages 368–380. ACM, 2015. Gang Luo. A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Network Modeling Analysis in Health Informatics and Bioinformatics, 5(1):18, 2016. Alexandros Kalousis and Theoharis Theoharis. Noemon: Design, implementation and performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis, 3(5):319–337, 1999. Joao Gama and Pavel Brazdil. Characterization of classification algorithms. In Portuguese Conference on Artificial Intelligence, pages 189–200. Springer, 1995 Pavel Brazdil, Christophe Giraud Carrier, Carlos Soares, and Ricardo Vilalta. Meta learning: Applications to data mining. Springer Science & Business Media, 2008 Josef Kittler. Multiple Classifier Systems: First International Workshop , MCS 2000 Cagliari, Italy, June 21-23, 2000 Proceedings, volume 1857. Springer Science & Business Media, 2000. Fabio Roli, Giorgio Giacinto, and Gianni Vernazza. Methods for designing multiple classifier systems. In International Workshop on Multiple ClassifierSystems, pages 78–87. Springer, 2001. MichaĆWozniak, Manuel Graña, and Emilio Corchado. A survey of multiple classifier systems as hybrid systems. Information Fusion, 16:3–17, 2014. Šar¯ unas Raudys. Trainable fusion rules. ii. small sample-size effects. Neural Networks, 19(10):1517–1527, 2006. Ludmila I Kuncheva. Combining pattern classifiers: methods and algorithms. John Wiley & Sons, 2004. Josef Kittler. A framework for classifier fusion: Is it still needed? In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 45–56. Springer, 2000. http://www.iaeme.com/IJMET/index.asp 1398 editor@iaeme.com