Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria, Canada rexlei86@uvic.ca I. Introduction Classification, also called supervised learning in machine learning, is to learn a classification model from the history data that can be used to predict the classes of the new cases in future. The process of the classification is more like human to learn the experience from the past experience. The data used to implement the classification is grouped into two parts, namely, training dataset and testing data set. The training data set is the data that is fed to the classification algorithms to train the rules which will be used to classify the future data items to the proper classes. The training data set cannot be used as the test data. The test data is used to validate the learnt rules. Similarly, the test data is not applied to train the rules as training data, either. Some of the evaluation methods use the training data to optimize the classification models. The evaluation will be discussed in the second part. The same as the steps of general research, to do the classification analysis consists of several general steps. To begin the research, the purpose of the research must be declared clearly, which defines what kind of the data will be used in the research. After defining the research data, the second step is to collect such data using available ways. The gathered data has to be determined the input feature representation of the learnt function before applying it to the classification algorithms. When the data preparation is ready, the next step is to select the learning algorithm that is used to construct the classification model. In this step, multiple learning algorithms can be applied to the same training dataset so that multiple classification models are produced. Compared the accuracy or based on other evaluation measures, the optimum one is select to get the best predicting results. Then, the following step is to run the selected algorithm on the training dataset to build the classification model. This step produces the result report. Then, the final step is to evaluate the accuracy. The process of constructing the classification model is depicted as follows: Figure 1. Steps of Constructing Classification Model The graph describes the steps in general. The target of classification is to find the model in the middle of the five parts in the Figure 1. II. Evaluation After running the algorithms on the training data, the classification models are formed. In order to find the best one, the evaluation methods have to be introduced to address this problem. Evidently, the predictive accuracy is the straightforward one, which indicates how many new items are correctly classified by the model learnt from the historical data. Another measure is the time including the time to build the model and the time to use the model to predict new cases. Robustness is also considered as a measure, which demonstrates the ability of handling noise data and the ability of handling missing values in the data set. Scalability describes the efficiency of applying the model in disk-resident databases to predict the new dataset. Interpretability depicts if the rules are understandable and if they gives the insight provided by the model. The number of rules is another criterion to judge if the model is good or not, which indicates the compactness of the model. If the number of rules is too large, it means it is difficult to apply these rules to predict the new items, which implicitly shows the model is not good enough. Holdout set is another way to evaluate the model. The holdout set set is defined that the available data at hand is divided into two parts, one as training data set and the other as test data. The part of test data is called holdout set. N-fold cross-validation method is defined as the data is partitioned into N subsets with equal size. Then, select one of the N subset as testing data and the rest as training data to run the learning algorithm. Repeat the whole process N times and select the best results produced by the learning algorithm run by N times. Based on this theory, there is a special N-fold cross validation, that is, Leave-one-out cross validation. The leave-one-out cross validation is defined that extracting one data record from the available data set and using it as testing data while the rest of the data set is used as training data. Then, simple repeat this whole process till every data record has been applied as the test data to validate the model. Besides, the precision and recall are introduced to measure the model. Before introducing the definition of precision and recall, the concept of confusion matrix is defined in Table 1: Table 1. Definition of Confusion Matrix Actual Positive Actual Negative Classified Positive TP Classified Negative FN FP TN TP: True Positive, represents the number of correctly classification of the positive instances. FN: False Negative, the number of incorrectly classification of positive instances. FP: False Positive, the number of incorrectly classification of negative instances. TN: True Negative, the number of correctly classification of negative instances. The precision is the ratio of number of correctly classification of positive instances over the sum of instances classified as positive while recall is the ratio of number of correctly classification of positive instances over total number of positive instances. The mathematical expressions of the two measures are in Equation 1: 𝒑= 𝑻𝑷 𝑻𝑷 ,𝒓 = 𝑻𝑷 + 𝑭𝑷 𝑻𝑷 + 𝑭𝑵 Equation 1. Definition of Precision and Recall However, the two measures are difficult to be applied together to compare two different classifiers (learning method). Therefore, another measure is created, by combining the two measures together, that is, Fmeasure. The F-measure is to use the harmonic mean of precision and recall. The harmonic mean is close to the smaller one. So if F value is large, both precision and recall must be large. The mathematical expression of F-measure is Equation 2. 𝑭= 𝟐 𝟏 𝟏 𝒑+𝒓 = 𝟐𝒑𝒓 𝒑+𝒓 Equation 2. Definition of F-measure These evaluation methods above are applied in different cases based on the specific condition. There are many other evaluation ways available and will be discussed in future. Meanwhile, the classification is still an active field, in which there will be more ways explored in future.