Project Report #2 - Pedro C. Exposito

advertisement
CAP 4770
Introduction to Data Mining
Project Report #2:
Classification
Pedro Exposito
ID: 1826385
1. Introduction and Objectives
The main goal of this project is to choose and evaluate classification mechanisms. For this
purpose, some classifiers already available in WEKA will be used and tested on three different
datasets. The Iris, Zoo, and Adult datasets from the UCI Machine Learning Repository website (
http://archive.ics.uci.edu/ml/datasets.html ) will be used in this project. The classification
algorithms Decision Tree, Naïve Bayes, KNN, and SVM will be applied to each dataset through
WEKA’s Explorer, after the appropriate preprocessing steps are taken for each dataset. Then, the
four classifiers will be compared based on performance and a brief discussion about when to use
each classifier, based on the general results, will be given in the conclusions section.
It is not assumed that the reader has previous knowledge of WEKA, thus, this project will also
demonstrate some of WEKA’s classification capabilities for any novice user, as well as the steps
needed to obtain the results shown.
2. Performance Comparison
Before comparing the performances of each algorithm, the datasets will be described and the
preprocessing steps taken to do the experiment with each one will be covered. Different datasets
might require different preprocessing methods even if the same experiment is intended for all.
Therefore, it is important to understand how to set up each dataset before we compare their
classification results.
2.1 Description of the data sets
2.1.1 Iris
The Iris dataset contains three classes of 50 instances each, which leads to a set with 150
instances total. Each class refers to a type of Iris plant—Setosa, Versicolor, or Virginica. Each
instance in the set has four numeric attributes, which represent sepal length, sepal width, petal
length, and petal width, in that order. The fifth and last attribute for all instances is the nominal
attribute class, which is the type of Iris plant.
2.1.2 Zoo
The Zoo dataset contains 101 instances that describe the characteristics of each animal at a
particular zoo. All instances refer to a different animal. All tuples contain a boolean value (0 or
1) for characteristics such as feathers, eggs, milk, and airbone, which tells us whether that animal
has that characteristic—given by a 1— or not—given by 0. There are 15 such boolean values in
every instance. The number of legs is the 14th of the comma separated attributes in this dataset
and it can take a value from the set {0,2,4,6,8}. The animal type, which refers to the animal
group each instance belongs to, is the last attribute. There are seven types in total. The following
are all the attributes in the Zoo dataset: animal_name, hair, feathers, eggs, milk, airbone, aquatic,
predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize, and type.
2.1.3 Adult
The Adult dataset is by far the largest of the three. It contains more than 30,000 instances. It is
also the only one out of the three that contains instances that have unknown values for some
attributes. Each instance refers to an adult individual and contains attributes that describe their
job, education, and some personal information. The 15 attributes for each instance in this dataset
are the following: age, workclass, fnlwgt (weight), education, education-num, marital-status,
occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country,
annual_pay. The Adult dataset is meant to be used as a way to predict whether the previously
mentioned attributes affect if an adult’s annual pay is above or below $50,000 per year.
2.2 Data Preparation Methods
2.2.1 File Format Change
All datasets were originally downloaded from http://archive.ics.uci.edu/ml/datasets.html. The
original data files were iris.data, zoo.data, and adult.data; however, these were not in compatible
format for WEKA, so they had to be modified slightly. WEKA’s standard file extension type is
.arff, so the .data files were converted to .arff. Files with .csv extension would have worked as
well, but to make the experiment’s output more clear the .arff format is preferred because for
.csv WEKA renames each attribute to the value seen in the first instance, thus, attribute names
would be incorrect. In order to transform each dataset to .arff, each one was copied into a
Notepad document and saved with .arff extension (i.e. iris.arff) and the appropriate header
information for .arff files was added in the lines on top of the actual data. For example, the
following was added on top of the data from iris.data:
@RELATION iris
@ATTRIBUTE sepal_length NUMERIC
@ATTRIBUTE sepal_width NUMERIC
@ATTRIBUTE petal_length NUMERIC
@ATTRIBUTE petal_width NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}
@DATA
The top line specifies the relation name. Then each attribute name and its type, or domain, is
specified. Next, comes the actual data for each instance separated by comas. @DATA specifies
that the data instances start below. The downloaded .data files already had attributes separated by
commas and each instance in a row, therefore, just making these modifications was enough to
convert them to the correct file format.
After these changes were done, the file was saved as iris.arff and ready to be opened using
WEKA. This procedure was repeated to obtain zoo.arff and adult.arff as well. In zoo.arff the
boolean attributes were given a domain of {0,1} and the type attribute was given the domain
{1,2,3,4,5,6,7} because there are seven animal groups in the dataset. The attribute animal_name
was specified with type STRING. The same procedure was done in adult.arff specifying the
possible set of values for each nominal attribute in its declaration at the top. The annual_pay
attribute was represented by “>50K” or “<=50K”, which divide all instances into two categories
based on annual pay of less or more than $50,000 per year.
2.2.1 Iris Data Preprocessing
Open WEKA’s Explorer, then choose “Open file…” from the Preprocess tab and open the
iris.arff file. WEKA will display the attributes of the dataset and various statistics in the
Preprocess window. Before we move on to the classifiers in the Classify tab, the appropriate data
preprocessing methods should be executed to prepare the data. WEKA provides various filters to
do data preprocessing. For the Iris dataset, it is convenient to convert the numerical attributes to
nominal attributes. To do so, click Choose, then open weak->filters->unsupervised->attribute
->Discretize. After opening the Discretize filter, double-click on its text field and a window with
the filter’s options and chosen parameters is shown. Change the bins parameter to 5 and click
OK, then click Apply to run the Discretize filter on the Iris data. This converts the four numerical
attributes of Iris (sepal and petal length/width) to nominal attributes and divides their numerical
range into five partitions. Five was chosen as the number of partitions because all the decimal
values for these four attributes are between 1 and 9; thus, five is a number of partitions that
makes sense to distribute this data.
2.2.2 Zoo Data Preprocessing
The Zoo dataset is the one that requires the least amount of preprocessing because during the file
format change procedure, covered in section 2.2.1, all the boolean attributes in zoo.arff were
created as nominal attributes. Attributes type and legs were also created as nominal using their
few possible values as their domain set. We deal here with the only non-nominal attribute,
animal_name, which was created as a STRING. This attribute varies too much because it has a
different animal name for every instance, and it does not provide valuable information like the
other attributes. Therefore, we chose to remove it. The type attribute, which tells us what animal
group each instance belongs to, is more meaningful for classification (based on animal types)
and is enough to identify animal instances by group, even if they don’t have the animal’s name.
To exclude animal_name from the dataset, mark the checkbox to the left of attribute
animal_name, then click Remove. The Zoo dataset is ready for classification after the removal of
this attribute.
2.2.3 Adult Data Preprocessing
The Adult dataset was the only one out of the three datasets that required special preprocessing
methods in order to apply one of the classifiers to it successfully.
The Discretize filter was used on it, just as in the Iris dataset, to convert its numerical attributes
to nominal attributes. However, before applying Discretize, the filter ReplaceMissingValues
(weka->filters->unsupervised->attribute->ReplaceMissingValues) was applied because the Adult
data contained instances with unknown values. This filter replaces the unknown values with the
modes and means from the dataset’s training data. After replacing the missing values, the
Discretize filter was applied with a bins parameter of 10, which was more suitable for the Adult
dataset than the 5 used for the Iris dataset. Numerical attributes, such as age and education-num,
were converted to nominal attributes of 10 partitions each. This procedure was enough to prepare
the Adult dataset for classification using Decision Tree (J48), Naïve Bayes, and KNN (IBk), but
the SVM (SMO in WEKA) classifier required further data preprocessing.
Applying SMO to the entire Adult dataset left the program working nonstop for hours without
signs of progress being made. This could be due to the dataset’s large size (+30,000 instances).
Therefore, a different approach was taken to apply this classifier. The Adult data was reloaded
and filters ReplaceMissingInstances and Discretize were applied again. However, the filter
Resample (weak->filters->unsupervised->instance->Resample) was applied afterwards with a
sampleSizePercent parameter of 40.0. This took a 40% sample of the Adult dataset and made
that the new dataset used. This reduced the number of instances by 60% and still kept 40% of the
same data. However, this was not enough. After several hours, WEKA’s SMO result after
applying it to the Adult dataset resampled to 40% was the following error:
Due to the previous error, the process was repeated again for the Adult dataset resampling it to
25% of the original, instead of 40%. This time the classifier worked correctly and gave the
output results shown in 5.1.4, after approximately six hours.
2.3 Parameters for the Algorithms
The classification algorithms to apply to the datasets are chosen from the Classify tab, just like
filters are chosen from the Preprocess tab. The Decision Tree algorithm is selected by opening
weka->classifiers->trees->J48. The J48 classifier performs the decision tree algorithm in WEKA.
The Naïve Bayes algorithm is selected with weka->classifiers->bayes->NaiveBayes. The knearest neighbors algorithm (KNN) is done with weka->classifiers->lazy->IBk. Finally, SVM is
done by the SMO classifier from weka->classifiers->functions->SMO.
Some parameters can be changed for WEKA’s classifiers, in the same way as parameters can be
changed in the filters used during data preprocessing. In order to compare the results later on,
some classification options were kept the same for all datasets and all classifiers. For example, in
the Classify tab the Cross-validation folds were kept at 10 for all tests, and only some parameters
of the classifiers themselves were changed. If a change is not mentioned, assume that all other
parameter values stayed with their standard values.
Next, the parameter values used to obtain the outputs of section 5 are mentioned, and how
different values affected the results is discussed as well.
2.3.1 Parameters for Iris
The first classifier applied to the Iris dataset is J48, which uses a decision tree for classification.
By double-clicking on the J48 textbox you get the window with the parameters for the classifier.
The output of section 5.2.1 used a confidenceFactor parameter (used for pruning the tree) of 0.20
and the standard minNumObj (minimum number of instances per leaf) for J48, which is 2. Other
tests with different values for confidenceFactor and minNumObj showed that, in general, raising
these attributes did not produce significant changes in the percentage of misclassified instances.
On the other hand, lowering the confidenceFactor to values below 0.01 usually gave four or five
more misclassified instances than the other tests.
The NaiveBayes classifier does not provide any parameters to modify, thus, it runs with the same
preset functionality for all datasets. The NaiveBayes output for the Iris dataset is shown in 5.2.2.
The KNN algorithm is performed by the classifier IBk in WEKA. The IBk output shown in 5.2.3
used KNN = 1, windowSize = 0, and all the other standard parameter values. In further testing,
the values for KNN (k-nearest neighbors used) and windowSize were modified. The results
showed that if windowSize does not equal 0—if it equals 0, unlimited instances are allowed in
the training pool— then the misclassification percentage raised significantly, so it was left at
zero for the printed output. The KNN value did not seem to affect results because using 5 and 10
for KNN gave very similar classification error rates.
The SVM algorithm is done with the SMO classifier in WEKA. The SMO output for the Iris data
is in section 5.2.4 and used the standard parameter values for the algorithm. Clicking on More in
the parameters window for SMO leads us to information on the parameters, which specified that
parameters c and toleranceParameter shouldn’t be changed, so those were kept the same
throughout all the tests. The other modifiable parameters contained appropriate standard values
as well. The only modification done, for further testing, was to change the filter type to
Standardize Training Data, but the results obtained were very similar as the ones with the
standard filter Normalize Training Data.
2.3.2 Parameters for Zoo
The standard J48 parameters were used for the output shown in 5.3.1. A good observation is that
a higher number for minNumObj gave smaller trees in the output, but it also raised the
classification error percentage seen in 5.3.1. However, raising or lowering the value for
confidenceFactor, did not affect the error percentage by a significant amount.
NaiveBayes has no modifiable parameters. Its standard output for the Zoo dataset is shown in
5.3.2.
IBk was run with KNN=1 to obtain the results shown in 5.3.3. Retrying the algorithm with
higher values for KNN did not give a drastic change, but it did seem to show that a higher KNN
produced a slightly higher error rate. For example, KNN = 6 gave 11 misclassified instances and
KNN = 20 gave 16 misclassified instances.
The standard SMO parameter values were used for the output in 5.3.4.
2.3.3 Parameters for Adult
The output in 5.1.1 was obtained after running J48 with minNumObj equal to 30,
confidenceFactor equal to 0.01, and the standard values for other parameters. This selection of
parameter values gave a slightly better performance than the others (less misclassification rate).
Raising the value for minNumObj and confidenceFactor gave similar output results, but raising
minNumObj decreased the overall run-time.
NaiveBayes’ standard output for the Adult dataset is shown in 5.1.2. This classifier has no
modifiable parameters.
The KNN, or IBk, output for the Adult dataset is shown in 5.1.3. The standard values, including
KNN=1 and windowSize=0, were used for this output. However, it took a long time to produce
the output (25 minutes), so for further testing IBk was applied to a 25% sample of the Adult
dataset (Resample filter was applied to it beforehand) with the same parameter values. The error
rate was very close to the initial one with the full dataset. Also, the 25% sample was run with
KNN=40 and KNN=200, and the results were very similar, which means that in this case raising
the number of k-nearest neighbors did not affect the obtained results.
Finally, SMO’s output for the Adult dataset is in section 5.1.4. The standard parameter values
were used for this output; however, this is the output produced from the 25% sample of the Adult
dataset. Section 2.2.3 described the problems found while attempting to run this algorithm on the
full dataset and on a 40% sample. This output should still be a fairly good representation of the
general one since it is still one quarter of the original data.
2.4 Platform Information
The results obtained in this project might be slightly different if the same experiment is
replicated in another machine with more processing power. In particular, the running speed of
WEKA’s classifiers might be faster in a better computer. The same tests could be performed
using different machines and operating systems to compare their speed. The following
specifications are from the machine used to obtain the results shown in section 5 of the report:
OS Name: Microsoft Windows Vista Home Premium
OS Version: 6.0.6002 Service Pack 2 Build 6002
System Model: Dell DM061
System Type: X86-based PC
Processor: Intel(R) Core(TM)2 CPU 6300@ 1.86GHz, 1862 Mhz, 2 Core(s), 2 Logical Processors
Installed Physical Memory (RAM): 3.00 GB
2.5 Classifier Performance Comparisons
Let’s take a look at the performance of all four classifiers with respect to running times and
classification error rates.
As expected, all classifiers took more time to process the Adult dataset than the Iris or Zoo
datasets. This is no surprise because the Adult dataset is the largest of all three by far. In
addition, it is interesting to see that the J48 and SMO classifiers had slightly shorter run-times for
Iris than for Zoo, despite the fact that Iris has 150 instances and Zoo has 101. This is probably
due to the fact that the Zoo dataset has three times as many attributes and also some nominal
attributes with more possible values than those in the Iris dataset. Therefore, instances in Zoo
took a little more time to process than instances in Iris, which have just five attributes.
The algorithm that took the longest to produce its output for all datasets was SVM (SMO
classifier). In fact, it took so long to produce the output for the Adult dataset, that it had to be
applied to a 25% sample of the dataset to obtain the results and it still took hours. SMO only took
five seconds to classify the Iris data and 25 seconds to classify the Zoo data, but those times were
still more than five times larger than the times other classifiers took to produce the results.
Overall, the fastest classifier was NaïveBayes with two seconds as the longest it took to get the
classification output, when it was applied to the Adult dataset. The time it took to get the results
for Iris and Zoo was less than one second.
In terms of accuracy, the most accurate classifier with the lowest misclassification percentages
was SMO. This is possibly why it was a slower algorithm than the others. If we add the
misclassification errors that J48, NaïveBayes, and IBk gave for the three dataset, we that their
total is close to 30 for each algorithm, which leads us to believe that in terms of general accuracy
these three are similar. However, if we compare individual results, some are better than others.
For example, J48 had better accuracy and lower misclassification rate for the Iris set than the
other classifiers, but it also did worse than all others for the Zoo dataset. This shows that some
classifiers are better than others for specific datasets, which is the main topic covered in the
conclusions section.
The dataset with the worst accuracy results was the Adult dataset, possibly because it was the
most complex one as well. The best accuracy results (lowest misclassification percentages) were
obtained for the Zoo dataset using IBk and NaïveBayes. These two got a 4% error rate for the
Zoo data. Overall, the best classifier in terms of speed was NaïveBayes and the slowest, but most
accurate one, was SMO.
The next section includes tables that show the results discussed here.
3. Experimental Results Summary
WEKA uses several methods, such as TP Rate, FP Rate, Precision, Recall and F-Measure, to
evaluate accuracy. Their results for each classifier are shown in the Reference section’s outputs.
They appear near the end of each output, before the confusion matrix. I, however, decided to use
a more simple and widely-used metric to evaluate accuracy, which is the standard way to
compute accuracy—that is, dividing the number of instances that were correctly classified by the
total number of instances in the dataset. Alternatively, one could also subtract the percentage of
misclassified instances from 100 to obtain the same fraction of accuracy. The accuracy results,
obtained using this method, are shown in tables of this section.
The following accuracy comparison tables show each algorithm’s results with respect to the
misclassification error rate and accuracy for each dataset. These results were briefly discussed in
section 2.5:
ACCURACY COMPARISON TABLE 1
Results
Data
sets
Iris
Zoo
Adult
Total
Instances
150
101
32561
J48
Misclassified
Instances
8
8
5519
Error
Rate
5.3%
7.9%
16.9%
Acc.
94.7%
92.1%
83.1%
Total
Instances
150
101
32561
Naïve Bayes
Misclassified
Instances
10
7
5885
Error
Rate
6.7%
6.9%
18.1%
SMO
Misclassified
Instances
9
4
1081*
Error
Rate
6.0%
4.0%
13.3%*
Acc.
94.3%
93.1%
81.9%
Acc. = Accuracy
ACCURACY COMPARISON TABLE 2
Results
Data
sets
Iris
Zoo
Adult
Total
Instances
150
101
32561
IBk
Misclassified
Instances
12
4
6078
Error
Rate
8.0%
4.0%
18.7%
Acc.
92.0%
96.0%
81.3%
Total
Instances
150
101
8140*
Acc. = Accuracy
* = Results based on the 25% sample from Adult dataset
In addition, the running times recorded for the algorithms are shown next. The training time
refers to the time it took WEKA to build the initial training model for the dataset. The testing
time refers to the time each algorithm took to produce the output results, after building the
training model. The results of the following table were already compared in section 2.5.
Acc.
94.0%
96.0%
86.7%*
RUNNING TIMES TABLE
Datasets
J48
Iris
Zoo
Adult
0.01
0.14
0.45
Training Time (seconds)
Naïve
IBk
SMO
Bayes
0
0
0.30
0
0
1.71
0.06
0.02
574.64*
Testing Time (seconds)
J48
Naïve
IBk
SMO
Bayes
~0.3
~0.3
0.5
5
0.4
~0.3
~0.3
28
15
2
1260
~21600*
(21min) (6hrs)
~ = Approximate time in seconds
* = Results based on the 25% sample from Adult dataset
4. Conclusions
After running several tests and obtaining the results for this project, it becomes more clear that
some of the four classifiers used are more suitable for certain datasets and tasks than others.
The decision tree classifier, J48, appears to be more suitable for datasets with few attributes and
few values for each attribute. Such sets will produce a small and clear J48 pruned tree, as well as
an easy-to-understand tree view with the Classifier Tree Visualizer. The fact that J48 is very
good to classify datasets composed of mostly true/false or yes/no attributes (just two possible
values for each attribute) becomes clear if the outputs in 5.1.1, 5.2.1, and 5.3.1 are compared.
The J48 pruned tree and the visualized tree for the Zoo dataset are very clear because most of the
Zoo attributes are boolean nominal attributes with just 0 and 1 as possible values. On the other
hand, the visualized trees for Adult and Iris are a mess because they have either too many
attributes or too many values per attribute. Their J48 pruned trees are not as clear as the one for
the Zoo dataset either. The conclusion for J48 is that, despite the fact it is not the most accurate
classifier out of the four, it is the best one to visualize classification of data for datasets
composed of attributes with two possible values.
The NaïveBayes classifier is very well suited to get detailed, well-organized, and easy-tounderstand statistics for large datasets, such as Adult, with attributes that may have many values.
It does all the classification with respect to the “class” attribute, which is the last attribute for
each instance. It gives general statistics (fraction of instances for each value of the class attribute)
and specific statistics for each possible value, or partition, of every non-class attribute. This
classifier is ideal for datasets were you want to see the number of instances that belong to each
value of a class attribute based on the values for many other categories. This was the fastest
classifier so it is also very good for large datasets. Its results and performance make it the best
classifier out of the four for the Adult dataset, although it also gave very good results for the
other two datasets as well.
The KNN algorithm, performed by the IBk classifier, is good for datasets where the data
instances could be classified into well-defined “clusters” of similar instances. In such cases,
finding the best KNN (# of k-nearest neighbors to use) would greatly improve accuracy with this
classifier. However, this is the least descriptive out of the four classifiers (no tables, no
visualizers, no specific statistics on the results as in NaïveBayes) so it is not a great choice unless
the user is not interested in detailed results and the dataset is expected to work well for IBk.
Overall, IBk was the classifier with the least amount of useful information about the data.
From the results of the SMO outputs and the general performance of all algorithms, it seems that
SMO’s main advantage is that it had the best accuracy percentages; so, if classifying a dataset as
accurately as possible is the main concern, SMO is the best choice out of the four classifiers used
here. However, if descriptive and easy-to-understand statistics or a fast running time are
important, then SMO should be avoided. It took longer than the other classifiers and its
classification output is not as useful as that of NaïveBayes or J48.
Overall, the best classifier for the Adult dataset was NaïveBayes. The best one for the Zoo
dataset was either J48 or NaïveBayes. J48 provided helpful visual results and NaïveBayes gave
useful statistics for Zoo data. The best one for the Iris dataset was NaïveBayes, the other three
did not give as much useful output. In general, NaïveBayes was the most useful classifier in
terms of output statistics and it was also the fastest. Each of the four classifiers is more suitable
for different datasets though, so they are all useful.
5. Reference
The following are the complete outputs obtained from WEKA’s sample runs for each classifier
applied to each of the three datasets:
5.1 ADULT DATASET OUTPUTS
5.1.1 Decision Tree (J48)
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.01 -M 50
Relation: adult_data-weka.filters.unsupervised.attribute.ReplaceMissingValuesweka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
annual_pay
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
-----------------marital-status = Married-civ-spouse
| education-num = '(-inf-2.5]': <=50K (101.0/6.0)
| education-num = '(2.5-4]': <=50K (531.0/50.0)
| education-num = '(4-5.5]': <=50K (230.0/20.0)
| education-num = '(5.5-7]': <=50K (703.0/99.0)
| education-num = '(7-8.5]': <=50K (130.0/29.0)
| education-num = '(8.5-10]'
| | capital-gain = '(-inf-9999.9]'
| | | capital-loss = '(-inf-435.6]': <=50K (7139.0/2383.0)
| | | capital-loss = '(435.6-871.2]': <=50K (0.0)
| | | capital-loss = '(871.2-1306.8]': <=50K (1.0)
| | | capital-loss = '(1306.8-1742.4]': <=50K (120.0/22.0)
| | | capital-loss = '(1742.4-2178]': >50K (244.0/47.0)
| | | capital-loss = '(2178-2613.6]': <=50K (37.0/13.0)
| | | capital-loss = '(2613.6-3049.2]': <=50K (0.0)
| | | capital-loss = '(3049.2-3484.8]': <=50K (0.0)
| | | capital-loss = '(3484.8-3920.4]': <=50K (0.0)
| | | capital-loss = '(3920.4-inf)': <=50K (0.0)
| | capital-gain = '(9999.9-19999.8]': >50K (82.0/4.0)
| | capital-gain = '(19999.8-29999.7]': >50K (12.0/1.0)
| | capital-gain = '(29999.7-39999.6]': <=50K (0.0)
| | capital-gain = '(39999.6-49999.5]': <=50K (0.0)
| | capital-gain = '(49999.5-59999.4]': <=50K (0.0)
| | capital-gain = '(59999.4-69999.3]': <=50K (0.0)
| | capital-gain = '(69999.3-79999.2]': <=50K (0.0)
| | capital-gain = '(79999.2-89999.1]': <=50K (0.0)
| | capital-gain = '(89999.1-inf)': >50K (28.0)
| education-num = '(10-11.5]': <=50K (689.0/316.0)
| education-num = '(11.5-13]': >50K (3228.0/1147.0)
| education-num = '(13-14.5]': >50K (1003.0/229.0)
| education-num = '(14.5-inf)': >50K (698.0/113.0)
marital-status = Divorced
| capital-gain = '(-inf-9999.9]': <=50K (4359.0/380.0)
| capital-gain = '(9999.9-19999.8]': >50K (51.0)
| capital-gain = '(19999.8-29999.7]': >50K (21.0)
| capital-gain = '(29999.7-39999.6]': <=50K (1.0)
| capital-gain = '(39999.6-49999.5]': <=50K (0.0)
| capital-gain = '(49999.5-59999.4]': <=50K (0.0)
| capital-gain = '(59999.4-69999.3]': <=50K (0.0)
| capital-gain = '(69999.3-79999.2]': <=50K (0.0)
| capital-gain = '(79999.2-89999.1]': <=50K (0.0)
| capital-gain = '(89999.1-inf)': >50K (11.0)
marital-status = Never-married
| capital-gain = '(-inf-9999.9]': <=50K (10570.0/382.0)
| capital-gain = '(9999.9-19999.8]': >50K (81.0)
| capital-gain = '(19999.8-29999.7]': >50K (16.0)
| capital-gain = '(29999.7-39999.6]': <=50K (4.0)
| capital-gain = '(39999.6-49999.5]': <=50K (0.0)
| capital-gain = '(49999.5-59999.4]': <=50K (0.0)
| capital-gain = '(59999.4-69999.3]': <=50K (0.0)
| capital-gain = '(69999.3-79999.2]': <=50K (0.0)
| capital-gain = '(79999.2-89999.1]': <=50K (0.0)
| capital-gain = '(89999.1-inf)': >50K (12.0)
marital-status = Separated: <=50K (1025.0/66.0)
marital-status = Widowed: <=50K (993.0/85.0)
marital-status = Married-spouse-absent: <=50K (418.0/34.0)
marital-status = Married-AF-spouse: <=50K (23.0/10.0)
Number of Leaves : 52
Size of the tree :
58
Time taken to build model: 0.45 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
27042
Incorrectly Classified Instances
5519
Kappa statistic
0.4814
Mean absolute error
0.2443
Root mean squared error
0.3496
Relative absolute error
66.8124 %
Root relative squared error
81.7692 %
Total Number of Instances
32561
83.0503 %
16.9497 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.493 0.062
0.715
0.493
0.583
0.844
>50K
0.938 0.507
0.854
Weighted 0.831 0.4
0.82
Avg.
=== Confusion Matrix ===
0.938
0.831
0.894
0.819
0.844
0.844
<=50K
a b <-- classified as
3863 3978 | a = >50K
1541 23179 | b = <=50K
WEKA’s Classifier Tree Visualizer
5.1.2 Naïve Bayes
=== Run information ===
Scheme:
weka.classifiers.bayes.NaiveBayes
Relation: adult_data-weka.filters.unsupervised.attribute.ReplaceMissingValuesweka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
annual_pay
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute
>50K <=50K
(0.24) (0.76)
===============================================
age
'(-inf-24.3]'
62.0 5510.0
'(24.3-31.6]'
805.0 5087.0
'(31.6-38.9]'
1678.0 4372.0
'(38.9-46.2]'
2230.0 3935.0
'(46.2-53.5]'
1596.0 2373.0
'(53.5-60.8]'
925.0 1668.0
'(60.8-68.1]'
418.0 1179.0
'(68.1-75.4]'
95.0 403.0
'(75.4-82.7]'
30.0 146.0
'(82.7-inf)'
12.0 57.0
[total]
7851.0 24730.0
workclass
Private
Self-emp-not-inc
Self-emp-inc
Federal-gov
Local-gov
State-gov
Without-pay
Never-worked
[total]
5155.0 19379.0
725.0 1818.0
623.0 495.0
372.0 590.0
618.0 1477.0
354.0 946.0
1.0 15.0
1.0 8.0
7849.0 24728.0
fnlwgt
'(-inf-159527]'
3231.0 9888.0
'(159527-306769]'
3621.0 11708.0
'(306769-454011]'
864.0 2636.0
'(454011-601253]'
100.0 379.0
'(601253-748495]'
22.0 82.0
'(748495-895737]'
5.0 17.0
'(895737-1042979]'
3.0 9.0
'(1042979-1190221]'
2.0 5.0
'(1190221-1337463]'
2.0 2.0
'(1337463-inf)'
1.0 4.0
[total]
7851.0 24730.0
education
Bachelors
Some-college
11th
HS-grad
Prof-school
Assoc-acdm
Assoc-voc
9th
7th-8th
12th
Masters
1st-4th
10th
Doctorate
5th-6th
Preschool
[total]
2222.0 3135.0
1388.0 5905.0
61.0 1116.0
1676.0 8827.0
424.0 154.0
266.0 803.0
362.0 1022.0
28.0 488.0
41.0 607.0
34.0 401.0
960.0 765.0
7.0 163.0
63.0 872.0
307.0 108.0
17.0 318.0
1.0 52.0
7857.0 24736.0
education-num
'(-inf-2.5]'
'(2.5-4]'
'(4-5.5]'
'(5.5-7]'
'(7-8.5]'
'(8.5-10]'
'(10-11.5]'
'(11.5-13]'
'(13-14.5]'
'(14.5-inf)'
[total]
7.0 214.0
57.0 924.0
28.0 488.0
123.0 1987.0
34.0 401.0
3063.0 14731.0
362.0 1022.0
2487.0 3937.0
960.0 765.0
730.0 261.0
7851.0 24730.0
marital-status
Married-civ-spouse
6693.0 8285.0
Divorced
464.0 3981.0
Never-married
492.0 10193.0
Separated
67.0 960.0
Widowed
86.0 909.0
Married-spouse-absent
35.0 385.0
Married-AF-spouse
11.0 14.0
[total]
7848.0 24727.0
occupation
Tech-support
284.0 646.0
Craft-repair
930.0 3171.0
Other-service
138.0 3159.0
Sales
984.0 2668.0
Exec-managerial
1969.0 2099.0
Prof-specialty
2051.0 3934.0
Handlers-cleaners
87.0 1285.0
Machine-op-inspct
251.0 1753.0
Adm-clerical
508.0 3264.0
Farming-fishing
116.0 880.0
Transport-moving
321.0 1278.0
Priv-house-serv
2.0 149.0
Protective-serv
212.0 439.0
Armed-Forces
2.0 9.0
[total]
7855.0 24734.0
relationship
Wife
Own-child
Husband
Not-in-family
Other-relative
Unmarried
[total]
746.0 824.0
68.0 5002.0
5919.0 7276.0
857.0 7450.0
38.0 945.0
219.0 3229.0
7847.0 24726.0
race
White
7118.0 20700.0
Asian-Pac-Islander
277.0 764.0
Amer-Indian-Eskimo
37.0 276.0
Other
26.0 247.0
Black
388.0 2738.0
[total]
7846.0 24725.0
sex
Female
Male
1180.0 9593.0
6663.0 15129.0
[total]
7843.0 24722.0
capital-gain
'(-inf-9999.9]'
7086.0 24707.0
'(9999.9-19999.8]'
512.0 7.0
'(19999.8-29999.7]'
87.0 2.0
'(29999.7-39999.6]'
1.0 6.0
'(39999.6-49999.5]'
1.0 3.0
'(49999.5-59999.4]'
1.0 1.0
'(59999.4-69999.3]'
1.0 1.0
'(69999.3-79999.2]'
1.0 1.0
'(79999.2-89999.1]'
1.0 1.0
'(89999.1-inf)'
160.0 1.0
[total]
7851.0 24730.0
capital-loss
'(-inf-435.6]'
'(435.6-871.2]'
'(871.2-1306.8]'
'(1306.8-1742.4]'
'(1742.4-2178]'
'(2178-2613.6]'
'(2613.6-3049.2]'
'(3049.2-3484.8]'
'(3484.8-3920.4]'
'(3920.4-inf)'
[total]
7069.0 23986.0
3.0 16.0
1.0 22.0
57.0 406.0
581.0 200.0
123.0 86.0
13.0 3.0
1.0 1.0
2.0 6.0
1.0 4.0
7851.0 24730.0
hours-per-week
'(-inf-10.8]'
'(10.8-20.6]'
'(20.6-30.4]'
'(30.4-40.2]'
'(40.2-50]'
'(50-59.8]'
'(59.8-69.6]'
'(69.6-79.4]'
'(79.4-89.2]'
'(89.2-inf)'
[total]
66.0 672.0
131.0 2063.0
159.0 2160.0
3633.0 14104.0
2353.0 3587.0
453.0 607.0
777.0 1021.0
157.0 293.0
80.0 124.0
42.0 99.0
7851.0 24730.0
native-country
United-States
Cambodia
England
Puerto-Rico
7318.0 22437.0
8.0 13.0
31.0 61.0
13.0 103.0
Canada
40.0 83.0
Germany
45.0 94.0
Outlying-US(Guam-USVI-etc)
1.0 15.0
India
41.0 61.0
Japan
25.0 39.0
Greece
9.0 22.0
South
17.0 65.0
China
21.0 56.0
Cuba
26.0 71.0
Iran
19.0 26.0
Honduras
2.0 13.0
Philippines
62.0 138.0
Italy
26.0 49.0
Poland
13.0 49.0
Jamaica
11.0 72.0
Vietnam
6.0 63.0
Mexico
34.0 611.0
Portugal
5.0 34.0
Ireland
6.0 20.0
France
13.0 18.0
Dominican-Republic
3.0 69.0
Laos
3.0 17.0
Ecuador
5.0 25.0
Taiwan
21.0 32.0
Haiti
5.0 41.0
Columbia
3.0 58.0
Hungary
4.0 11.0
Guatemala
4.0 62.0
Nicaragua
3.0 33.0
Scotland
4.0 10.0
Thailand
4.0 16.0
Yugoslavia
7.0 11.0
El-Salvador
10.0 98.0
Trinadad&Tobago
3.0 18.0
Peru
3.0 30.0
Hong
7.0 15.0
Holand-Netherlands
1.0 2.0
[total]
7882.0 24761.0
Time taken to build model: 0.06 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
26676
Incorrectly Classified Instances
5885
Kappa statistic
0.5507
Mean absolute error
0.1971
Root mean squared error
0.3631
Relative absolute error
53.909 %
Root relative squared error
84.9288 %
Total Number of Instances
32561
81.9262 %
18.0738 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.772 0.166
0.596
0.772 0.673
0.896
>50K
0.834 0.228
0.92
0.834 0.875
0.896
<=50K
Weighted 0.819 0.213
0.842
0.819 0.826
0.896
Avg.
=== Confusion Matrix ===
a b <-- classified as
6050 1791 | a = >50K
4094 20626 | b = <=50K
5.1.3 KNN (IBk)
=== Run information ===
Scheme:
weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
Relation: adult_data-weka.filters.unsupervised.attribute.ReplaceMissingValuesweka.filters.unsupervised.attribute.Discretize-B10-M-1.0-Rfirst-last
Instances: 32561
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
annual_pay
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0.02 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
26483
Incorrectly Classified Instances
6078
Kappa statistic
0.4908
Mean absolute error
0.2106
Root mean squared error
0.3818
Relative absolute error
57.5896 %
Root relative squared error
89.2849 %
Total Number of Instances
32561
81.3335 %
18.6665 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.616 0.124
0.612
0.616 0.614
0.832
>50K
0.876 0.384
0.878
0.876 0.877
0.832
<=50K
Weighted 0.813 0.321
0.814
0.813 0.814
0.832
Avg.
=== Confusion Matrix ===
a b <-- classified as
4832 3009 | a = >50K
3069 21651 | b = <=50K
5.1.4 SVM (SMO with 25% sample of Adult dataset)
=== Run information ===
Scheme:
weka.classifiers.functions.SMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K
"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Relation: adult_data-weka.filters.supervised.attribute.Discretize-Rfirst-lastweka.filters.unsupervised.instance.Resample-S1-Z25.0weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 8140
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
annual_pay
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
SMO
Kernel used:
Linear Kernel: K(x,y) = <x,y>
Classifier for classes: >50K, <=50K
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
0.1958 * (normalized) age='(-inf-21.5]'
0.6555 * (normalized) age='(21.5-23.5]'
0.4589 * (normalized) age='(23.5-27.5]'
-0.0416 * (normalized) age='(27.5-29.5]'
-0.1178 * (normalized) age='(29.5-35.5]'
-0.3482 * (normalized) age='(35.5-43.5]'
-0.4389 * (normalized) age='(43.5-54.5]'
-0.4273 * (normalized) age='(54.5-61.5]'
0.0635 * (normalized) age='(61.5-inf)'
-0.0508 * (normalized) workclass=Private
0.2203 * (normalized) workclass=Self-emp-not-inc
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.4643 * (normalized) workclass=Self-emp-inc
-0.1311 * (normalized) workclass=Federal-gov
-0.0108 * (normalized) workclass=Local-gov
0.1789 * (normalized) workclass=State-gov
0.2579 * (normalized) workclass=Never-worked
-0.2787 * (normalized) education=Bachelors
0.181 * (normalized) education=Some-college
-0.0973 * (normalized) education=11th
0.2712 * (normalized) education=HS-grad
-0.411 * (normalized) education=Prof-school
0.199 * (normalized) education=Assoc-acdm
-0.0063 * (normalized) education=Assoc-voc
-0.1871 * (normalized) education=9th
0.1337 * (normalized) education=7th-8th
0
* (normalized) education=12th
-0.3161 * (normalized) education=Masters
0.132 * (normalized) education=1st-4th
-0.187 * (normalized) education=10th
-0.3691 * (normalized) education=Doctorate
0.4092 * (normalized) education=5th-6th
0.5265 * (normalized) education=Preschool
0.7301 * (normalized) education-num='(-inf-8.5]'
0.2712 * (normalized) education-num='(8.5-9.5]'
0.181 * (normalized) education-num='(9.5-10.5]'
0.1926 * (normalized) education-num='(10.5-12.5]'
-0.2787 * (normalized) education-num='(12.5-13.5]'
-0.3161 * (normalized) education-num='(13.5-14.5]'
-0.7801 * (normalized) education-num='(14.5-inf)'
-0.4299 * (normalized) marital-status=Married-civ-spouse
0.2246 * (normalized) marital-status=Divorced
0.3066 * (normalized) marital-status=Never-married
-0.0862 * (normalized) marital-status=Separated
-0.1672 * (normalized) marital-status=Widowed
0.547 * (normalized) marital-status=Married-spouse-absent
-0.3948 * (normalized) marital-status=Married-AF-spouse
-0.4997 * (normalized) occupation=Tech-support
0.0927 * (normalized) occupation=Craft-repair
0.4868 * (normalized) occupation=Other-service
-0.0874 * (normalized) occupation=Sales
-0.501 * (normalized) occupation=Exec-managerial
-0.1382 * (normalized) occupation=Prof-specialty
0.1826 * (normalized) occupation=Handlers-cleaners
0.1712 * (normalized) occupation=Machine-op-inspct
-0.1376 * (normalized) occupation=Adm-clerical
0.3981 * (normalized) occupation=Farming-fishing
0.1706 * (normalized) occupation=Transport-moving
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
0.6423 * (normalized) occupation=Priv-house-serv
-0.7803 * (normalized) occupation=Protective-serv
0
* (normalized) occupation=Armed-Forces
-0.9709 * (normalized) relationship=Wife
0.4985 * (normalized) relationship=Own-child
-0.3523 * (normalized) relationship=Husband
0.1661 * (normalized) relationship=Not-in-family
0.2024 * (normalized) relationship=Other-relative
0.4562 * (normalized) relationship=Unmarried
-0.17 * (normalized) race=White
0.0623 * (normalized) race=Asian-Pac-Islander
0.5142 * (normalized) race=Amer-Indian-Eskimo
-0.2474 * (normalized) race=Other
-0.1591 * (normalized) race=Black
-0.2996 * (normalized) sex
0.8433 * (normalized) capital-gain='(-inf-57]'
1.7478 * (normalized) capital-gain='(57-3048]'
-1.2452 * (normalized) capital-gain='(3048-3120]'
2.5481 * (normalized) capital-gain='(3120-4243.5]'
-1
* (normalized) capital-gain='(4243.5-4401]'
1.4045 * (normalized) capital-gain='(4401-4668.5]'
-2
* (normalized) capital-gain='(4668.5-4826]'
1
* (normalized) capital-gain='(4826-4932.5]'
-1.7386 * (normalized) capital-gain='(4932.5-4973.5]'
2.2733 * (normalized) capital-gain='(4973.5-5119]'
-1.8125 * (normalized) capital-gain='(5119-5316.5]'
0.7985 * (normalized) capital-gain='(5316.5-5505.5]'
-0.3352 * (normalized) capital-gain='(5505.5-6618.5]'
0.4184 * (normalized) capital-gain='(6618.5-7073.5]'
-2.9024 * (normalized) capital-gain='(7073.5-inf)'
0.9743 * (normalized) capital-loss='(-inf-1551.5]'
-1.1147 * (normalized) capital-loss='(1551.5-1568.5]'
1.8271 * (normalized) capital-loss='(1568.5-1820.5]'
-1
* (normalized) capital-loss='(1820.5-1862]'
1.4065 * (normalized) capital-loss='(1862-1881.5]'
-1.1672 * (normalized) capital-loss='(1881.5-1923]'
0.1524 * (normalized) capital-loss='(1923-1975.5]'
-1
* (normalized) capital-loss='(1975.5-1978.5]'
1.4759 * (normalized) capital-loss='(1978.5-2168.5]'
1
* (normalized) capital-loss='(2176.5-2218.5]'
0
* (normalized) capital-loss='(2218.5-2384.5]'
-1.6022 * (normalized) capital-loss='(2384.5-2450.5]'
-1.0032 * (normalized) capital-loss='(2450.5-3726.5]'
0.0511 * (normalized) capital-loss='(3726.5-inf)'
0.5095 * (normalized) hours-per-week='(-inf-34.5]'
0.0488 * (normalized) hours-per-week='(34.5-39.5]'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
0.0491 * (normalized) hours-per-week='(39.5-41.5]'
-0.1823 * (normalized) hours-per-week='(41.5-49.5]'
-0.2586 * (normalized) hours-per-week='(49.5-65.5]'
-0.1666 * (normalized) hours-per-week='(65.5-inf)'
-0.0123 * (normalized) native-country=United-States
0
* (normalized) native-country=Cambodia
-0.8794 * (normalized) native-country=England
0.0783 * (normalized) native-country=Puerto-Rico
-0.2733 * (normalized) native-country=Canada
-0.4264 * (normalized) native-country=Germany
0
* (normalized) native-country=Outlying-US(Guam-USVI-etc)
-0.1786 * (normalized) native-country=India
-0.244 * (normalized) native-country=Japan
0.2454 * (normalized) native-country=Greece
1.2475 * (normalized) native-country=South
0
* (normalized) native-country=China
-0.0123 * (normalized) native-country=Cuba
0
* (normalized) native-country=Iran
0
* (normalized) native-country=Honduras
-0.6894 * (normalized) native-country=Philippines
-0.4222 * (normalized) native-country=Italy
0.2954 * (normalized) native-country=Poland
1
* (normalized) native-country=Jamaica
1
* (normalized) native-country=Vietnam
-0.086 * (normalized) native-country=Mexico
-0.2418 * (normalized) native-country=Portugal
-0.0921 * (normalized) native-country=Ireland
0.7286 * (normalized) native-country=France
0.7175 * (normalized) native-country=Dominican-Republic
0
* (normalized) native-country=Laos
0.0002 * (normalized) native-country=Ecuador
0
* (normalized) native-country=Taiwan
0.0673 * (normalized) native-country=Haiti
0.2742 * (normalized) native-country=Columbia
-1
* (normalized) native-country=Hungary
0.2587 * (normalized) native-country=Guatemala
0.3218 * (normalized) native-country=Nicaragua
0
* (normalized) native-country=Scotland
-0.2457 * (normalized) native-country=Thailand
0
* (normalized) native-country=El-Salvador
-1.6907 * (normalized) native-country=Trinadad&Tobago
0.5544 * (normalized) native-country=Peru
-0.295 * (normalized) native-country=Hong
0.1615
Number of kernel evaluations: 79605087 (49.256% cached)
Time taken to build model: 574.64 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
7059
Incorrectly Classified Instances
1081
Kappa statistic
0.5975
Mean absolute error
0.1328
Root mean squared error
0.3644
Relative absolute error
36.7839 %
Root relative squared error
85.7762 %
Total Number of Instances
8140
86.7199 %
13.2801 %
=== Detailed Accuracy By Class ===
TP Rate
0.595
0.952
Weighted 0.867
Avg.
FP Rate
0.048
0.405
0.321
Precision
0.792
0.883
0.862
Recall
0.595
0.952
0.867
F-Measure ROC Area Class
0.679
0.773
>50K
0.916
0.773
<=50K
0.86
0.773
=== Confusion Matrix ===
a b <-- classified as
1144 780 | a = >50K
301 5915 | b = <=50K
5.2 IRIS DATASET OUTPUTS
5.2.1 Decision Tree (J48)
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.1 -M 2
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B5-M-1.0-Rfirst-last
Instances: 150
Attributes: 5
sepal_length
sepal_width
petal_length
petal_width
class
Test mode:
10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
-----------------petal_length = '(-inf-2.18]': Iris-setosa (50.0)
petal_length = '(2.18-3.36]': Iris-versicolor (3.0)
petal_length = '(3.36-4.54]': Iris-versicolor (34.0/1.0)
petal_length = '(4.54-5.72]'
| petal_width = '(-inf-0.58]': Iris-virginica (0.0)
| petal_width = '(0.58-1.06]': Iris-virginica (0.0)
| petal_width = '(1.06-1.54]': Iris-versicolor (13.0/3.0)
| petal_width = '(1.54-2.02]': Iris-virginica (20.0/4.0)
| petal_width = '(2.02-inf)': Iris-virginica (14.0)
petal_length = '(5.72-inf)': Iris-virginica (16.0)
Number of Leaves : 9
Size of the tree :
11
Time taken to build model: 0.01 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
142
Incorrectly Classified Instances
8
Kappa statistic
0.92
Mean absolute error
0.0607
Root mean squared error
0.1788
Relative absolute error
13.6478 %
Root relative squared error
37.9371 %
Total Number of Instances
150
94.6667 %
5.3333 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
Iris-setosa
0.92
0.04
0.92
0.92
0.92
0.971
Iris-versicolor
0.92
0.04
0.92
0.92
0.92
0.966
Iris-virginica
Weighted 0.947 0.027
0.947
0.947
0.947
0.979
Avg.
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 46 4 | b = Iris-versicolor
0 4 46 | c = Iris-virginica
WEKA’s Classifier Tree Visualizer
5.2.2 Naïve Bayes
=== Run information ===
Scheme:
weka.classifiers.bayes.NaiveBayes
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B5-M-1.0-Rfirst-last
Instances: 150
Attributes: 5
sepal_length
sepal_width
petal_length
petal_width
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute
Iris-setosa Iris-versicolor Iris-virginica
(0.33)
(0.33)
(0.33)
==================================================================
sepal_length
'(-inf-5.02]'
29.0
4.0
2.0
'(5.02-5.74]'
22.0
19.0
3.0
'(5.74-6.46]'
2.0
21.0
22.0
'(6.46-7.18]'
1.0
10.0
16.0
'(7.18-inf)'
1.0
1.0
12.0
[total]
55.0
55.0
55.0
sepal_width
'(-inf-2.48]'
'(2.48-2.96]'
'(2.96-3.44]'
'(3.44-3.92]'
'(3.92-inf)'
[total]
2.0
2.0
28.0
18.0
5.0
55.0
10.0
26.0
17.0
1.0
1.0
55.0
2.0
21.0
27.0
4.0
1.0
55.0
petal_length
'(-inf-2.18]'
'(2.18-3.36]'
'(3.36-4.54]'
'(4.54-5.72]'
'(5.72-inf)'
[total]
51.0
1.0
1.0
1.0
1.0
55.0
1.0
4.0
34.0
15.0
1.0
55.0
1.0
1.0
2.0
34.0
17.0
55.0
petal_width
'(-inf-0.58]'
'(0.58-1.06]'
'(1.06-1.54]'
'(1.54-2.02]'
'(2.02-inf)'
[total]
50.0
2.0
1.0
1.0
1.0
55.0
1.0
8.0
39.0
6.0
1.0
55.0
1.0
1.0
4.0
25.0
24.0
55.0
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
140
Incorrectly Classified Instances
10
Kappa statistic
0.9
Mean absolute error
0.0629
Root mean squared error
0.2036
Relative absolute error
14.1535 %
Root relative squared error
43.1989 %
Total Number of Instances
150
93.3333 %
6.6667 %
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
Iris-setosa
0.92
0.06
0.885
0.92
0.902
0.974
Iris-versicolor
0.88
0.04
0.917
0.88
0.898
0.975
Iris-virginica
Weighted 0.933
0.033
0.934 0.933
0.933
0.983
Avg.
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 46 4 | b = Iris-versicolor
0 6 44 | c = Iris-virginica
5.2.3 KNN (IBk)
=== Run information ===
Scheme:
weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B5-M-1.0-Rfirst-last
Instances: 150
Attributes: 5
sepal_length
sepal_width
petal_length
petal_width
class
Test mode:
10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
138
Incorrectly Classified Instances
12
Kappa statistic
0.88
Mean absolute error
0.0596
Root mean squared error
0.1851
Relative absolute error
13.4013 %
Root relative squared error
39.2653 %
Total Number of Instances
150
92
8
%
%
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
Iris-setosa
0.88
0.06
0.88
0.88
0.88
0.971
Iris-versicolor
0.88
0.06
0.88
0.88
0.88
0.975
Iris-virginica
Weighted 0.92
0.04
0.92
0.92
0.92
0.982
Avg.
=== Confusion Matrix ===
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 44 6 | b = Iris-versicolor
0 6 44 | c = Iris-virginica
5.2.4 SVM (SMO)
=== Run information ===
Scheme:
weka.classifiers.functions.SMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K
"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Relation: iris-weka.filters.unsupervised.attribute.Discretize-B5-M-1.0-Rfirst-last
Instances: 150
Attributes: 5
sepal_length
sepal_width
petal_length
petal_width
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
SMO
Kernel used:
Linear Kernel: K(x,y) = <x,y>
Classifier for classes: Iris-setosa, Iris-versicolor
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.1785 * (normalized) sepal_length='(-inf-5.02]'
-0.0258 * (normalized) sepal_length='(5.02-5.74]'
0.1021 * (normalized) sepal_length='(5.74-6.46]'
0.1022 * (normalized) sepal_length='(6.46-7.18]'
0.2549 * (normalized) sepal_width='(-inf-2.48]'
0.1033 * (normalized) sepal_width='(2.48-2.96]'
0.1011 * (normalized) sepal_width='(2.96-3.44]'
-0.4334 * (normalized) sepal_width='(3.44-3.92]'
-0.026 * (normalized) sepal_width='(3.92-inf)'
-0.9508 * (normalized) petal_length='(-inf-2.18]'
0.3609 * (normalized) petal_length='(2.18-3.36]'
0.3596 * (normalized) petal_length='(3.36-4.54]'
0.2302 * (normalized) petal_length='(4.54-5.72]'
-0.5173 * (normalized) petal_width='(-inf-0.58]'
0.1705 * (normalized) petal_width='(0.58-1.06]'
0.1731 * (normalized) petal_width='(1.06-1.54]'
0.1737 * (normalized) petal_width='(1.54-2.02]'
0.3926
Number of kernel evaluations: 1314 (85.461% cached)
Classifier for classes: Iris-setosa, Iris-virginica
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.2116 * (normalized) sepal_length='(-inf-5.02]'
-0.0979 * (normalized) sepal_length='(5.02-5.74]'
0.0647 * (normalized) sepal_length='(5.74-6.46]'
0.0641 * (normalized) sepal_length='(6.46-7.18]'
0.1806 * (normalized) sepal_length='(7.18-inf)'
0.1241 * (normalized) sepal_width='(-inf-2.48]'
0.1247 * (normalized) sepal_width='(2.48-2.96]'
0.0103 * (normalized) sepal_width='(2.96-3.44]'
-0.106 * (normalized) sepal_width='(3.44-3.92]'
-0.153 * (normalized) sepal_width='(3.92-inf)'
-0.8027 * (normalized) petal_length='(-inf-2.18]'
0.3403 * (normalized) petal_length='(3.36-4.54]'
0.2313 * (normalized) petal_length='(4.54-5.72]'
0.2311 * (normalized) petal_length='(5.72-inf)'
-0.5161 * (normalized) petal_width='(-inf-0.58]'
-0.2866 * (normalized) petal_width='(0.58-1.06]'
0.1737 * (normalized) petal_width='(1.06-1.54]'
0.3403 * (normalized) petal_width='(1.54-2.02]'
0.2887 * (normalized) petal_width='(2.02-inf)'
0.4062
Number of kernel evaluations: 1677 (82.133% cached)
Classifier for classes: Iris-versicolor, Iris-virginica
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
0.1874 * (normalized) sepal_length='(-inf-5.02]'
0
* (normalized) sepal_length='(5.02-5.74]'
-0.0937 * (normalized) sepal_length='(5.74-6.46]'
-0.0937 * (normalized) sepal_length='(6.46-7.18]'
0
* (normalized) sepal_length='(7.18-inf)'
0.1874 * (normalized) sepal_width='(-inf-2.48]'
-0.0937 * (normalized) sepal_width='(2.48-2.96]'
-0.0938 * (normalized) sepal_width='(2.96-3.44]'
0
* (normalized) sepal_width='(3.44-3.92]'
-0.4055 * (normalized) petal_length='(2.18-3.36]'
-0.407 * (normalized) petal_length='(3.36-4.54]'
0.4065 * (normalized) petal_length='(4.54-5.72]'
0.406 * (normalized) petal_length='(5.72-inf)'
-0.8126 * (normalized) petal_width='(0.58-1.06]'
+
+
+
-
-1.0627 * (normalized) petal_width='(1.06-1.54]'
0.9378 * (normalized) petal_width='(1.54-2.02]'
0.9375 * (normalized) petal_width='(2.02-inf)'
0.1561
Number of kernel evaluations: 2285 (82.537% cached)
Time taken to build model: 0.3 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
141
Incorrectly Classified Instances
9
Kappa statistic
0.91
Mean absolute error
0.2356
Root mean squared error
0.2956
Relative absolute error
53 %
Root relative squared error
62.7163 %
Total Number of Instances
150
94
%
6
%
=== Detailed Accuracy By Class ===
TP Rate FP Rate
1
0
0.9
0.04
0.92
0.05
Weighted 0.94
0.03
Avg.
=== Confusion Matrix ===
Precision Recall F-Measure ROC Area Class
1
1
1
1
Iris-setosa
0.918
0.9
0.909
0.934
Iris-versicolor
0.902
0.92
0.911
0.953
Iris-virginica
0.94
0.94
0.94
0.962
a b c <-- classified as
50 0 0 | a = Iris-setosa
0 45 5 | b = Iris-versicolor
0 4 46 | c = Iris-virginica
5.3 ZOO DATASET OUTPUTS
5.3.1 Decision Tree (J48)
=== Run information ===
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: zoo-weka.filters.unsupervised.attribute.Remove-R1
Instances: 101
Attributes: 17
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins
legs
tail
domestic
catsize
type
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
J48 pruned tree
-----------------feathers = 0
| milk = 0
| | backbone = 0
| | | airborne = 0
| | | | predator = 0
| | | | | legs = 0: 7 (2.0)
| | | | | legs = 2: 6 (0.0)
| | | | | legs = 4: 6 (0.0)
| | | | | legs = 5: 6 (0.0)
| | | | | legs = 6: 6 (2.0)
| | | | | legs = 8: 6 (0.0)
| | | | predator = 1: 7 (8.0)
| | | airborne = 1: 6 (6.0)
| | backbone = 1
| | | fins = 0
| | | | tail = 0: 5 (3.0)
| | | | tail = 1: 3 (6.0/1.0)
| | | fins = 1: 4 (13.0)
| milk = 1: 1 (41.0)
feathers = 1: 2 (20.0)
Number of Leaves : 13
Size of the tree :
21
Time taken to build model: 0.14 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
93
92.0792 %
Incorrectly Classified Instances
8
7.9208 %
Kappa statistic
0.8955
Mean absolute error
0.0225
Root mean squared error
0.1375
Relative absolute error
10.2478 %
Root relative squared error
41.6673 %
Total Number of Instances
101
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
1
1
0
1
1
1
1
2
0.6
0.01
0.75
0.6
0.667
0.793
3
1
0.011
0.929
1
0.963
0.994
4
0.75
0
1
0.75
0.857
0.872
5
0.625 0.032
0.625
0.625 0.625
0.923
6
0.8
0.033
0.727
0.8
0.762
0.984
7
Weighted 0.921 0.008
0.922
0.921
0.92
0.976
Avg.
=== Confusion Matrix ===
a b c d e f g <-- classified as
41 0 0 0 0 0 0 | a = 1
0 20 0 0 0 0 0 | b = 2
0 0 3 1 0 1 0| c=3
0 0 0 13 0 0 0 | d = 4
0 0 1 0 3 0 0| e=5
0 0 0 0 0 5 3| f=6
0 0 0 0 0 2 8| g=7
WEKA’S Classifier Tree Visualizer
5.3.2 Naïve Bayes
=== Run information ===
Scheme:
weka.classifiers.bayes.NaiveBayes
Relation: zoo-weka.filters.unsupervised.attribute.Remove-R1
Instances: 101
Attributes: 17
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins
legs
tail
domestic
catsize
type
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute
1
2
3
4
5
6
7
(0.39) (0.19) (0.06) (0.13) (0.05) (0.08) (0.1)
================================================================
hair
0
3.0 21.0 6.0 14.0 5.0 5.0 11.0
1
40.0 1.0 1.0 1.0 1.0 5.0 1.0
[total]
43.0 22.0 7.0 15.0 6.0 10.0 12.0
feathers
0
1
[total]
42.0 1.0 6.0 14.0 5.0 9.0 11.0
1.0 21.0 1.0 1.0 1.0 1.0 1.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
eggs
0
1
[total]
41.0 1.0 2.0 1.0 1.0 1.0 2.0
2.0 21.0 5.0 14.0 5.0 9.0 10.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
milk
0
1
[total]
1.0 21.0 6.0 14.0 5.0 9.0 11.0
42.0 1.0 1.0 1.0 1.0 1.0 1.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
airborne
0
1
[total]
40.0 5.0 6.0 14.0 5.0 3.0 11.0
3.0 17.0 1.0 1.0 1.0 7.0 1.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
aquatic
0
1
[total]
36.0 15.0 5.0 1.0 1.0 9.0 5.0
7.0 7.0 2.0 14.0 5.0 1.0 7.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
predator
0
20.0 12.0
2.0
5.0
2.0
8.0
3.0
1
[total]
23.0 10.0 5.0 10.0 4.0 2.0 9.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
toothed
0
1
[total]
2.0 21.0 2.0 1.0 1.0 9.0 11.0
41.0 1.0 5.0 14.0 5.0 1.0 1.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
backbone
0
1.0 1.0 1.0 1.0 1.0 9.0 11.0
1
42.0 21.0 6.0 14.0 5.0 1.0 1.0
[total]
43.0 22.0 7.0 15.0 6.0 10.0 12.0
breathes
0
1
[total]
1.0 1.0 2.0 14.0 1.0 1.0 8.0
42.0 21.0 5.0 1.0 5.0 9.0 4.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
venomous
0
42.0 21.0 4.0 13.0 4.0 7.0 9.0
1
1.0 1.0 3.0 2.0 2.0 3.0 3.0
[total]
43.0 22.0 7.0 15.0 6.0 10.0 12.0
fins
0
1
[total]
38.0 21.0 6.0 1.0 5.0 9.0 11.0
5.0 1.0 1.0 14.0 1.0 1.0 1.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
legs
0
2
4
5
6
8
[total]
4.0 1.0 4.0 14.0 1.0 1.0 5.0
8.0 21.0 1.0 1.0 1.0 1.0 1.0
32.0 1.0 3.0 1.0 5.0 1.0 2.0
1.0 1.0 1.0 1.0 1.0 1.0 2.0
1.0 1.0 1.0 1.0 1.0 9.0 3.0
1.0 1.0 1.0 1.0 1.0 1.0 3.0
47.0 26.0 11.0 19.0 10.0 14.0 16.0
tail
0
1
[total]
7.0 1.0 1.0 1.0 4.0 9.0 10.0
36.0 21.0 6.0 14.0 2.0 1.0 2.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
domestic
0
1
[total]
34.0 18.0 6.0 13.0 5.0 8.0 11.0
9.0 4.0 1.0 2.0 1.0 2.0 1.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
catsize
0
1
[total]
10.0 15.0 5.0 10.0 5.0 9.0 10.0
33.0 7.0 2.0 5.0 1.0 1.0 2.0
43.0 22.0 7.0 15.0 6.0 10.0 12.0
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
94
93.0693 %
Incorrectly Classified Instances
7
6.9307 %
Kappa statistic
0.9089
Mean absolute error
0.0203
Root mean squared error
0.1025
Relative absolute error
9.2616 %
Root relative squared error
31.0791 %
Total Number of Instances
101
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.976
0
1
0.976
0.988
1
1
1
0.012
0.952
1
0.976
1
2
0.6
0.021
0.6
0.6
0.6
0.983
3
1
0.023
0.867
1
0.929
1
4
0.75
0
1
0.75
0.857
1
5
1
0.022
0.8
1
0.889
1
6
0.7
0
1
0.7
0.824
0.998
7
Weighted 0.931
0.008
0.938 0.931
0.929
0.999
Avg.
=== Confusion Matrix ===
a b c d e f g <-- classified as
40 0 0 1 0 0 0 | a = 1
0 20 0 0 0 0 0 | b = 2
0 1 3 1 0 0 0| c=3
0 0 0 13 0 0 0 | d = 4
0 0 1 0 3 0 0| e=5
0 0 0 0 0 8 0| f=6
0 0 1 0 0 2 7| g=7
5.3.3 KNN (IBk)
=== Run information ===
Scheme:
weka.classifiers.lazy.IBk -K 1 -W 0 -A
"weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
Relation: zoo-weka.filters.unsupervised.attribute.Remove-R1
Instances: 101
Attributes: 17
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins
legs
tail
domestic
catsize
type
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
97
Incorrectly Classified Instances
4
Kappa statistic
0.9477
Mean absolute error
0.0195
Root mean squared error
0.0941
96.0396 %
3.9604 %
Relative absolute error
Root relative squared error
Total Number of Instances
8.894 %
28.5252 %
101
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
1
1
0.012
0.952
1
0.976
1
2
0.6
0.021
0.6
0.6
0.6
0.985
3
1
0.011
0.929
1
0.963
1
4
0.75
0
1
0.75
0.857
0.997
5
1
0
1
1
1
1
6
0.9
0
1
0.9
0.947
0.984
7
Weighted 0.96
0.005
0.962
0.96
0.96
0.998
Avg.
=== Confusion Matrix ===
a b c d e f g <-- classified as
41 0 0 0 0 0 0 | a = 1
0 20 0 0 0 0 0 | b = 2
0 1 3 1 0 0 0| c=3
0 0 0 13 0 0 0 | d = 4
0 0 1 0 3 0 0| e=5
0 0 0 0 0 8 0| f=6
0 0 1 0 0 0 9| g=7
5.3.4 SVM (SMO)
=== Run information ===
Scheme:
weka.classifiers.functions.SMO -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K
"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Relation: zoo-weka.filters.unsupervised.attribute.Remove-R1
Instances: 101
Attributes: 17
hair
feathers
eggs
milk
airborne
aquatic
predator
toothed
backbone
breathes
venomous
fins
legs
tail
domestic
catsize
type
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
SMO
Kernel used:
Linear Kernel: K(x,y) = <x,y>
Classifier for classes: 1, 2
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-0.4407 * (normalized) hair
0.5084 * (normalized) feathers
0.2712 * (normalized) eggs
-0.5084 * (normalized) milk
0.0011 * (normalized) airborne
-0.0008 * (normalized) aquatic
0.0014 * (normalized) predator
-0.2712 * (normalized) toothed
0
* (normalized) backbone
0
* (normalized) breathes
-0.0678 * (normalized) fins
-0.0678 * (normalized) legs=0
0.305 * (normalized) legs=2
-0.2372 * (normalized) legs=4
0.0012 * (normalized) tail
0
* (normalized) domestic
0.0011 * (normalized) catsize
0.0872
Number of kernel evaluations: 807 (78.808% cached)
Classifier for classes: 1, 3
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.7034 * (normalized) hair
0.3609 * (normalized) eggs
-0.9743 * (normalized) milk
-0.2835 * (normalized) aquatic
-0.0387 * (normalized) predator
-0.1161 * (normalized) toothed
0
* (normalized) backbone
-0.3005 * (normalized) breathes
0.3005 * (normalized) venomous
-0.271 * (normalized) fins
0.0295 * (normalized) legs=0
-0.0145 * (normalized) legs=2
-0.015 * (normalized) legs=4
0
* (normalized) tail
-0.1549 * (normalized) catsize
1.1089
Number of kernel evaluations: 329 (80.839% cached)
Classifier for classes: 1, 4
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.1175 * (normalized) hair
0.5885 * (normalized) eggs
-0.706 * (normalized) milk
0
* (normalized) airborne
0
* (normalized) aquatic
0
* (normalized) predator
0.1175 * (normalized) toothed
0
* (normalized) backbone
-0.706 * (normalized) breathes
0.1175 * (normalized) fins
0.1175 * (normalized) legs=0
0
* (normalized) legs=2
-0.1175 * (normalized) legs=4
0
* (normalized) tail
-0.0003 * (normalized) catsize
+
0.0594
Number of kernel evaluations: 269 (70.729% cached)
Classifier for classes: 1, 5
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-0.5713 * (normalized) hair
0.4283 * (normalized) eggs
-0.7142 * (normalized) milk
0.2855 * (normalized) aquatic
0.0009 * (normalized) predator
0.2858 * (normalized) toothed
0
* (normalized) backbone
0
* (normalized) breathes
-0.1428 * (normalized) fins
-0.1428 * (normalized) legs=0
0.1428 * (normalized) legs=4
-0.0014 * (normalized) tail
-0.0009 * (normalized) domestic
-0.4287 * (normalized) catsize
0.1427
Number of kernel evaluations: 145 (83.726% cached)
Classifier for classes: 1, 6
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
-0.1153 * (normalized) hair
0.2588 * (normalized) eggs
-0.3648 * (normalized) milk
0.1167 * (normalized) airborne
-0.106 * (normalized) aquatic
-0.106 * (normalized) predator
-0.2588 * (normalized) toothed
-0.3648 * (normalized) backbone
0
* (normalized) breathes
0.0548 * (normalized) venomous
-0.149 * (normalized) legs=2
-0.2159 * (normalized) legs=4
+
+
+
+
+
0.3648 * (normalized) legs=6
-0.2388 * (normalized) tail
-0.0551 * (normalized) domestic
-0.1222 * (normalized) catsize
0.3755
Number of kernel evaluations: 245 (81.523% cached)
Classifier for classes: 1, 7
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.3679 * (normalized) hair
0.1327 * (normalized) eggs
-0.4727 * (normalized) milk
-0.0004 * (normalized) airborne
-0.0982 * (normalized) aquatic
0.0492 * (normalized) predator
-0.2801 * (normalized) toothed
-0.4727 * (normalized) backbone
-0.1993 * (normalized) breathes
0.1474 * (normalized) venomous
-0.1048 * (normalized) fins
0.0213 * (normalized) legs=0
-0.0511 * (normalized) legs=2
-0.1287 * (normalized) legs=4
0.1585 * (normalized) legs=8
-0.2012 * (normalized) tail
-0.1242 * (normalized) domestic
-0.2863 * (normalized) catsize
1.0448
Number of kernel evaluations: 179 (82.046% cached)
Classifier for classes: 2, 3
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
-0.7497 * (normalized) feathers
-0.0002 * (normalized) eggs
-0.0019 * (normalized) airborne
0.0002 * (normalized) aquatic
+
+
+
+
+
+
+
+
+
-0.0019 * (normalized) predator
0.2501 * (normalized) toothed
-0.0002 * (normalized) breathes
0.0002 * (normalized) venomous
0.2501 * (normalized) legs=0
-0.7497 * (normalized) legs=2
0.4997 * (normalized) legs=4
-0.0011 * (normalized) catsize
0.502
Number of kernel evaluations: 126 (89.367% cached)
Classifier for classes: 2, 4
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
-
-0.333 * (normalized) feathers
0
* (normalized) eggs
-0.0027 * (normalized) airborne
0.0018 * (normalized) aquatic
0.0004 * (normalized) predator
0.333 * (normalized) toothed
0
* (normalized) backbone
-0.333 * (normalized) breathes
0.333 * (normalized) fins
0.333 * (normalized) legs=0
-0.333 * (normalized) legs=2
0
* (normalized) tail
-0.0011 * (normalized) catsize
0.0012
Number of kernel evaluations: 200 (91.431% cached)
Classifier for classes: 2, 5
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
-0.4616 * (normalized) feathers
0
* (normalized) eggs
-0.1546 * (normalized) airborne
0.1535 * (normalized) aquatic
0
* (normalized) predator
+
+
+
+
+
+
+
-
0.4616 * (normalized) toothed
0
* (normalized) backbone
0
* (normalized) breathes
-0.4616 * (normalized) legs=2
0.4616 * (normalized) legs=4
0
* (normalized) tail
-0.1535 * (normalized) catsize
0.0764
Number of kernel evaluations: 53 (72.959% cached)
Classifier for classes: 2, 6
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
-0.4
0
0
0
0
-0.4
0
-0.4
0.4
-0.4
0.6
* (normalized) feathers
* (normalized) eggs
* (normalized) airborne
* (normalized) aquatic
* (normalized) predator
* (normalized) backbone
* (normalized) breathes
* (normalized) legs=2
* (normalized) legs=6
* (normalized) tail
Number of kernel evaluations: 104 (62.044% cached)
Classifier for classes: 2, 7
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
-0.4886 * (normalized) feathers
-0.1778 * (normalized) eggs
-0.0891 * (normalized) airborne
0.0435 * (normalized) aquatic
-0.0445 * (normalized) predator
-0.4886 * (normalized) backbone
-0.1338 * (normalized) breathes
0.1778 * (normalized) venomous
0.1771 * (normalized) legs=0
+
+
+
+
+
+
+
+
-0.4886 * (normalized) legs=2
0.0446 * (normalized) legs=4
0.0442 * (normalized) legs=5
0.045 * (normalized) legs=6
0.1778 * (normalized) legs=8
-0.3108 * (normalized) tail
-0.0447 * (normalized) catsize
1.1337
Number of kernel evaluations: 231 (82.162% cached)
Classifier for classes: 3, 4
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
-
0.6359 * (normalized) eggs
0.4552 * (normalized) aquatic
0.0019 * (normalized) predator
0.09 * (normalized) toothed
0
* (normalized) backbone
-0.4552 * (normalized) breathes
-0.273 * (normalized) venomous
1.0911 * (normalized) fins
0.09 * (normalized) legs=0
-0.09 * (normalized) legs=4
0
* (normalized) tail
0.273 * (normalized) catsize
1.3631
Number of kernel evaluations: 66 (84.793% cached)
Classifier for classes: 3, 5
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
0.3685 * (normalized) eggs
1.1579 * (normalized) aquatic
0.001 * (normalized) predator
0.1579 * (normalized) toothed
0 * (normalized) backbone
0.3685 * (normalized) breathes
0
* (normalized) venomous
+
+
+
+
-
-0.3685 * (normalized) legs=0
0.3685 * (normalized) legs=4
-0.5264 * (normalized) tail
-0.1579 * (normalized) catsize
1.4219
Number of kernel evaluations: 44 (88.832% cached)
Classifier for classes: 3, 6
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
0
* (normalized) eggs
0.1075 * (normalized) airborne
-0.1086 * (normalized) predator
-0.2161 * (normalized) toothed
-0.4866 * (normalized) backbone
0
* (normalized) breathes
-0.0011 * (normalized) venomous
-0.2161 * (normalized) legs=0
-0.2706 * (normalized) legs=4
0.4866 * (normalized) legs=6
-0.4866 * (normalized) tail
-0.2706 * (normalized) catsize
0.5142
Number of kernel evaluations: 63 (85.246% cached)
Classifier for classes: 3, 7
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
-0.0531 * (normalized) eggs
-0.026 * (normalized) aquatic
0.0694 * (normalized) predator
-0.5156 * (normalized) toothed
-0.9434 * (normalized) backbone
0.026 * (normalized) breathes
-0.0006 * (normalized) venomous
-0.1571 * (normalized) legs=0
-0.1749 * (normalized) legs=4
0.332 * (normalized) legs=8
+
+
+
-0.6114 * (normalized) tail
-0.4278 * (normalized) catsize
1.185
Number of kernel evaluations: 110 (91.941% cached)
Classifier for classes: 4, 5
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
0
0
0
0
0
0.5
-0.5
-0.5
0.5
0
0
* (normalized) eggs
* (normalized) aquatic
* (normalized) predator
* (normalized) toothed
* (normalized) backbone
* (normalized) breathes
* (normalized) fins
* (normalized) legs=0
* (normalized) legs=4
* (normalized) tail
Number of kernel evaluations: 9 (30.769% cached)
Classifier for classes: 4, 6
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
0
* (normalized) eggs
0.0018 * (normalized) airborne
-0.2498 * (normalized) aquatic
-0.0018 * (normalized) predator
-0.2498 * (normalized) toothed
-0.2498 * (normalized) backbone
0.2498 * (normalized) breathes
-0.2498 * (normalized) fins
-0.2498 * (normalized) legs=0
0.2498 * (normalized) legs=6
-0.2498 * (normalized) tail
-0.0005 * (normalized) domestic
0.4997
Number of kernel evaluations: 66 (87.687% cached)
Classifier for classes: 4, 7
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
+
+
-0.0401 * (normalized) eggs
-0.1207 * (normalized) aquatic
0
* (normalized) predator
-0.48 * (normalized) toothed
-0.48 * (normalized) backbone
0.0401 * (normalized) breathes
0.1205 * (normalized) venomous
-0.48 * (normalized) fins
-0.12 * (normalized) legs=0
0.0006 * (normalized) legs=6
0.1194 * (normalized) legs=8
-0.44 * (normalized) tail
-0.1196 * (normalized) catsize
1.1605
Number of kernel evaluations: 226 (89.35% cached)
Classifier for classes: 5, 6
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
0.0017 * (normalized) airborne
-0.4 * (normalized) aquatic
-0.0017 * (normalized) predator
-0.4 * (normalized) toothed
-0.4 * (normalized) backbone
-0.4 * (normalized) legs=4
0.4 * (normalized) legs=6
0.6008
Number of kernel evaluations: 22 (92.81% cached)
Classifier for classes: 5, 7
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
-0.0659 * (normalized) eggs
-0.1995 * (normalized) aquatic
-0.0005 * (normalized) predator
-0.7335 * (normalized) toothed
-0.7335 * (normalized) backbone
-0.534 * (normalized) breathes
0.0008 * (normalized) venomous
0.1336 * (normalized) legs=0
-0.1995 * (normalized) legs=4
0.0659 * (normalized) legs=8
0.0015 * (normalized) tail
1.4656
Number of kernel evaluations: 89 (91.127% cached)
Classifier for classes: 6, 7
BinarySMO
Machine linear: showing attribute weights, not support vectors.
+
+
+
+
+
+
+
+
+
+
+
0
* (normalized) hair
-0.1228 * (normalized) eggs
-0.4496 * (normalized) airborne
0.7752 * (normalized) aquatic
0.4484 * (normalized) predator
-0.7752 * (normalized) breathes
0.1228 * (normalized) venomous
0.9378 * (normalized) legs=0
-1.0606 * (normalized) legs=6
0.1228 * (normalized) legs=8
0.1228 * (normalized) tail
0.9593
Number of kernel evaluations: 52 (84.478% cached)
Time taken to build model: 1.71 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
97
4
96.0396 %
3.9604 %
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Total Number of Instances
0.9478
0.2048
0.3018
93.3993 %
91.4742 %
101
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
1
1
0
1
1
1
1
2
0.6
0.021
0.6
0.6
0.6
0.752
3
1
0.011
0.929
1
0.963 0.994
4
0.75
0.01
0.75
0.75
0.75
0.983
5
1
0
1
1
1
1
6
0.9
0
1
0.9
0.947 0.996
7
Weighted 0.96
0.003
0.961
0.96
0.96
0.986
Avg.
=== Confusion Matrix ===
a b c d e f g <-- classified as
41 0 0 0 0 0 0 | a = 1
0 20 0 0 0 0 0 | b = 2
0 0 3 1 1 0 0| c=3
0 0 0 13 0 0 0 | d = 4
0 0 1 0 3 0 0| e=5
0 0 0 0 0 8 0| f=6
0 0 1 0 0 0 9| g=7
Download