word - Electrical and Computer Engineering

advertisement
Classification
Cheng Lei
Department of Electrical and Computer Engineering
University of Victoria, Canada
rexlei86@uvic.ca
I.
Introduction
Classification,
also
called
supervised learning in machine
learning, is to learn a classification
model from the history data that can
be used to predict the classes of the
new cases in future. The process of the
classification is more like human to
learn the experience from the past
experience.
The data used to implement the
classification is grouped into two
parts, namely, training dataset and
testing data set. The training data set
is the data that is fed to the
classification algorithms to train the
rules which will be used to classify the
future data items to the proper
classes. The training data set cannot
be used as the test data. The test data
is used to validate the learnt rules.
Similarly, the test data is not applied
to train the rules as training data,
either. Some of the evaluation
methods use the training data to
optimize the classification models. The
evaluation will be discussed in the
second part.
The same as the steps of general
research, to do the classification
analysis consists of several general
steps. To begin the research, the
purpose of the research must be
declared clearly, which defines what
kind of the data will be used in the
research. After defining the research
data, the second step is to collect such
data using available ways. The
gathered data has to be determined
the input feature representation of the
learnt function before applying it to
the classification algorithms. When
the data preparation is ready, the next
step is to select the learning algorithm
that is used to construct the
classification model. In this step,
multiple learning algorithms can be
applied to the same training dataset so
that multiple classification models are
produced. Compared the accuracy or
based on other evaluation measures,
the optimum one is select to get the
best predicting results. Then, the
following step is to run the selected
algorithm on the training dataset to
build the classification model. This
step produces the result report. Then,
the final step is to evaluate the
accuracy. The process of constructing
the classification model is depicted as
follows:
Figure 1. Steps of Constructing Classification
Model
The graph describes the steps in
general. The target of classification is
to find the model in the middle of the
five parts in the Figure 1.
II.
Evaluation
After running the algorithms on
the training data, the classification
models are formed. In order to find
the best one, the evaluation methods
have to be introduced to address this
problem. Evidently, the predictive
accuracy is the straightforward one,
which indicates how many new items
are correctly classified by the model
learnt from the historical data.
Another measure is the time including
the time to build the model and the
time to use the model to predict new
cases. Robustness is also considered
as a measure, which demonstrates the
ability of handling noise data and the
ability of handling missing values in
the data set. Scalability describes the
efficiency of applying the model in
disk-resident databases to predict the
new dataset. Interpretability depicts if
the rules are understandable and if
they gives the insight provided by the
model. The number of rules is another
criterion to judge if the model is good
or
not,
which
indicates
the
compactness of the model. If the
number of rules is too large, it means
it is difficult to apply these rules to
predict the new items, which
implicitly shows the model is not good
enough.
Holdout set is another way to
evaluate the model. The holdout set
set is defined that the available data at
hand is divided into two parts, one as
training data set and the other as test
data. The part of test data is called
holdout set.
N-fold cross-validation method is
defined as the data is partitioned into
N subsets with equal size. Then, select
one of the N subset as testing data and
the rest as training data to run the
learning algorithm. Repeat the whole
process N times and select the best
results produced by the learning
algorithm run by N times. Based on
this theory, there is a special N-fold
cross validation, that is, Leave-one-out
cross validation. The leave-one-out
cross validation is defined that
extracting one data record from the
available data set and using it as
testing data while the rest of the data
set is used as training data. Then,
simple repeat this whole process till
every data record has been applied as
the test data to validate the model.
Besides, the precision and recall
are introduced to measure the model.
Before introducing the definition of
precision and recall, the concept of
confusion matrix is defined in Table 1:
Table 1. Definition of Confusion Matrix
Actual
Positive
Actual
Negative
Classified
Positive
TP
Classified
Negative
FN
FP
TN
TP: True Positive, represents the
number of correctly classification of
the positive instances.
FN: False Negative, the number of
incorrectly classification of positive
instances.
FP: False Positive, the number of
incorrectly classification of negative
instances.
TN: True Negative, the number of
correctly classification of negative
instances.
The precision is the ratio of number of
correctly classification of positive
instances over the sum of instances
classified as positive while recall is the
ratio of number of correctly
classification of positive instances
over total number of positive
instances.
The
mathematical
expressions of the two measures are
in Equation 1:
𝒑=
𝑻𝑷
𝑻𝑷
,𝒓 =
𝑻𝑷 + 𝑭𝑷
𝑻𝑷 + 𝑭𝑵
Equation 1. Definition of Precision and Recall
However, the two measures are
difficult to be applied together to
compare two different classifiers
(learning method). Therefore, another
measure is created, by combining the
two measures together, that is, Fmeasure. The F-measure is to use the
harmonic mean of precision and
recall. The harmonic mean is close to
the smaller one. So if F value is large,
both precision and recall must be
large. The mathematical expression of
F-measure is Equation 2.
𝑭=
𝟐
𝟏 𝟏
𝒑+𝒓
=
𝟐𝒑𝒓
𝒑+𝒓
Equation 2. Definition of F-measure
These evaluation methods above
are applied in different cases based on
the specific condition. There are many
other evaluation ways available and
will be discussed in future. Meanwhile,
the classification is still an active field,
in which there will be more ways
explored in future.
Download