Uploaded by Hayleyklobuchar

C5.0 vs RPart: Decision Tree Analysis with Iris, Wine, Titanic Datasets

advertisement
Hayley Klobuchar
Research Report 2
(MATH 3220 Report 2)
Problem Description
This report will discuss the comparisons between two classification algorithms, each implementing its
own methodology, with three different data sets, to produce a decision tree. The two classification
algorithms being examined are C5.0 and RPart (CART) with the datasets being used are as follows: iris,
wine, and titanic dataset. This report compares the decision tree methods used by each algorithm and to
take note of any variances and discrepancies between the above-mentioned datasets and algorithms.
Background
This report will take a look at two machine learning methodologies for data classification, that uses a
provided training dataset and a classification algorithm that examines attributes of the data to assign to a
pure classification of the data. Using the information from the training dataset, run the data to train the
algorithm to predict the classification of unknown data, referred to as the testing dataset. (Aleshunas)
C5.0
C5.0 is based on an algorithm, C4.5, which is built upon the initial algorithm, ID3, all developed by J. Ross.
Quinlan. This algorithm uses information entropy and information gain.
Information Entropy is the amount of variance that a dataset contains. To imagine this an example, think
of bad full of marbles as a dataset. If this bag contains only contains three colored marbles, one could say
the bag has a relatively low entropy compared to a bag that has 10 colors, one could say the bag’s entropy
is relatively high (Zhou, 2019). Information gained is found after finding the information entropy of the
attributes within the dataset. The higher the entropy for an attribute, means there is higher variance in
the attribute providing less of an information gain from the attribute.
RPART/CART
The R Implementation of CART provides the structure for a regression or decision tree to be produced
containing three main components; the root node, terminal leaves, and the branches of the tree (Pandey,
2019)
RPART is the R Implementation of the CART algorithm that uses the method of assigning the attributes of
the dataset by importance and produces a regression/decision tree of the data’s classification as output
from the implementation.
Methodology
The Iris dataset is a numerical data set containing 150 samples with 4 attributes provided, sepal length,
sepal width, petal length, petal width to predict the classification of each sample. There are three possible
classification outcomes, Setosa, Versicolor, and Virginica.
The Wine dataset Numerical dataset containing 153 samples with 14 total attributes provided. These
attributes will be used to assign the appropriate classification of each sample. The possible classifications
are class_1, class_2, and class_3. This dataset has a large proportion of attributes that the other two
datasets. This set this dataset up for possible variations between both methods being compared and
contrasted.
The Titanic dataset is a categorial dataset containing 2201 entries with 3 attributes provided to determine
if the entry should be classified as survived, or not survived. The attributes provide are class, age, sex.
Hayley Klobuchar
C5.0
Using the R implementation of the classification algorithm, C5.0 creates a classification model that can be
used to create a visual representation of the data classification, called a decision tree. The decision tree is
constructed by the classification model which is created by splitting the test data by the attribute that
provides the most information gain then repeats this process until the test data cannot be split any
further and the data has fallen into its appropriate (pure) classification. During this process if there are
any attributes among the data set that do not provide useful data, this algorithm will eliminate these
attributes from the splitting criteria. (Patil et al, 2012)
C5.0 calculates the information gain of each attribute and automatically splits first on the attribute with
the highest information gain. This process is repeated until the lowest information gain attributes have
been split. The last split should result in the final classification of the data within a margin of error. (Patil
et al, 2012)
RPART (CART)
This method determines the importance of an attribute by the resulting GINI Index value of the attributes
being examined for the algorithms splitting criteria. The GINI Index is an economic based model. A
training set’s GINI Index is to be found by the formula,
m
GINI(T) = 1 − ∑ p2i
i=1
Where m is the value the target attribute contained within the n number of samples from the training set,
t. The summation of the ith values in the training set with the probability of containing the target
attribute is evaluated. (Patil et al, 2012)
The root node is determined by the attribute whose GINI Index is the lowest. After the attribute the
lowest GINI Index initially splits it will split on the next attribute with the minimized GINI Index value,
creating a branch and repeat until the data is split into concentrated or pure classes.
The tree’s terminal nodes are representative of a concentrated or pure class. The terminal nodes are the
resulting classifications of the dataset, determined by the tree’s splitting rules that relates to each internal
node. By providing a trained data set the information can used to forecast a sample that will be placed in
a leaf, or classification based on training data set. (Aleshuna)
Assumptions
Describe any assumptions you’ve made that simplify the problem, help the methodology fit the problem,
or compensate for other issues encountered in this research. C5.0 assumes no missing data is contained
within the datasets. RPART will test more attributes that C5.0
Experimental Design
This experimental design involves three datasets in the R Implementation of the classification algorithm
C5.O and RPART/CART. Provided in this section is a guided example with example code. The following
experimental design section will use the Iris dataset as a guided example. Descriptions of each of the data
sets will be found one paragraph below the current.
The Iris dataset, guided example dataset, is a numerical data set containing 150 samples with 4 attributes
provided, sepal length, sepal width, petal length, petal width to predict the classification of each sample.
There are three possible classification outcomes, Setosa, Versicolor, and Virginica. The Wine dataset
Numerical dataset containing 153 samples with 14 total attributes provided. These attributes will be used
to assign the appropriate classification of each sample. The possible classifications are class_1, class_2, and
class_3. This dataset has a large proportion of attributes that the other two datasets. This set this dataset
up for possible variations between both methods being compared and contrasted.
Hayley Klobuchar
The Titanic dataset is a categorial dataset containing 2201 entries with 3 attributes provided to determine
if the entry should be classified as survived, or not survived. The attributes provide are class, age, sex.
By using each of the three datasets in the implementations of the two classifications methods this
experiment will look at both methods outputs.
This report will use all three datasets in the R Implementation of the classification algorithm C5.O and
RPART/CART to view the results, and the resulting visual representations of decision trees produced with
C5.0 and the resulting visual representations of regression trees produced with RPART/CART for each data
set. The following experimental design sections for each method will use the Iris dataset as a guided
example for implementing the two algorithms/methods.
C.50
1.Install and load the package C5.0
2.Open and read datasets into an object
3. create and train a decision tree, using the minimum range allowed for the data, for each one of the
datasets and to view their decision tree summaries
Hayley Klobuchar
4. Graphically depict the decision tree, refer to figure below: Example output is given using the Iris dataset
.
5. Develop and view the rule sets summary for each dataset. This can be done by using the commands
6. Analyze results
Hayley Klobuchar
RPART/CART
To use the RPART algorithm with these datasets first a formula must be created. The formula will consist of
the variable being used for classification being implied by the attributes of the data.
1.Intstall and load the packages, RPART AND RPLOT
2. Create a label for the formula this method will use, the following is an example format using the Iris
dataset
3. Train the tree using, refer to figure below.
4. View the summary, for command refer to figure above
.
Hayley Klobuchar
5. View a text version of the regression tree
6. Plot the regression tree
Results and Discussion
Results from the descriptive statistics for datasets: Iris, Wine, and Titanic Data
Hayley Klobuchar
For the wine dataset examine the summary of the dataset for a preview of information contained.
Hayley Klobuchar
Hayley Klobuchar
C5.0 results for datasets
IRIS DATA: Summary of decision tree
The image above and to the left shows the commands used to train the decision tree. The command to
view the summary of decision tree. The image above to the right, is the output displayed from calling the
summary function of the decision tree. It provides information on the splits of the tree. It shows that the
attribute with the highest information gain from the dataset found petal length to have the highest
information gain. This is why the first test of the data instance is to test if the data instance has a petal
length equal to or less than 1.9. This is because the algorithm found it to contain the highest information
gain and develop an initial test for each data entry. If it cannot be placed into the terminal node from the
initial test, it will then test that data entry against the attribute with the second highest information gain.
The second splitting attribute for the Iris dataset will be petal width. The test will test for a petal width
greater than 1.7. If the data entry passes the test then it falls into the Virginica classification and produces
3 misclassification and 47 correct classifications within the terminal node. If it fails, then it is tested
against petal length less than or equal to 4.9 then the data entry is classified as a Versicolor. This test had
49 correct data classification with 1 misclassification. Overall the text decision tree uses a confusion
matrix that reformats the above information. This text decision tree also displays that two attributes were
used by the algorithm, petal length and petal width.
Hayley Klobuchar
WINE DATA: Summary of decision tree
The figure to the left depicts the command to train the decision tree with the Wine data set. The figure to
the right depicts the text version of the decision tree. This is called by the summary function of the wine
decision tree. The output provides the information that there is a total of 153 data entries with 14
different attributes. This tree will be testing the data entry for flavonoids less than or equal to 1.57 to
determine the first split. Then test entries for the attribute color that are less than or equal to 3.8, if true
the entry is classified as class_2. If greater than 3.8 the entry is classed as class_3. If when tested for
flavonoid to be greater than 3.8 it will then test the entry for proline less than or equal to 720 then the
entry is class_2, if it is greater than 720 it test the color_intensity for less than or equal to 3.4, if so the
data entry is class_2, if greater than 3.4 the entry is classed as class_1. The confusion matrix provides a
visual summary of information stated above along with the misclassification of each attribute. The last
chunk of information shows what attributes had the highest information gain and in order or importance.
Hayley Klobuchar
TITANIC DATA: Summary of decision tree
The figure to the left depicts the training of the titanic dataset’s decision tree and the command to view
the summary of the decision tree. The summary provides a text decision tree for the data set. It stats that
the algorithm used 2201 entries with four attributes. The attribute with the highest information gain was
sex= Male. If this was true, the entry was classified as not survived with 367 misclassifications for the first
split. The second split from the second most important attribute is sex= Female, if this is true then the
entry is test for the class the entry was on the ship is it is any class other than third then the entry
survived. This produce 20 errors. With class= Third, the algorithm produced 90 misclassifications.
Hayley Klobuchar
DECISION TREE: Iris, Wine, and Titanic
The visual output of the decision trees matches up to the above stated summaries for each dataset.
Hayley Klobuchar
RULE SET: Iris, Wine, and Titanic Data
The following is the development and summary of each dataset’s rulesets.
Rule Set Summary: Iris
The above rule set is developed from the results of the training set of the iris data. It created rule 1 to be
tested as Petal Length (PL) <= 1.9, if true entry is setosa. If false go to rule 2, if PL <= 4.9 and Petal Width
(PW) <=1.7 then the entry is classed as a versicolor. Rule 3 is if PW> 1.7 then the entry is classed as a
virginica. Rule 4, if PL >4.9 then the entry is classed as a virginica.
Rule Set Summary: Wine
Hayley Klobuchar
Figure to the left depicts the summary of the rule set for the wine data. It is read in the same manner as
the iris data section above. To look at an organized matrix of the rule set we can conclude that flavonoids
had the largest information gain with, Color_intensity being second highest and Proline to be the third
highest of the attribute importance. Out of the 153 entries the confusion matrix also displays that there
were 1 misclassification making this an appropriate training set.
Rule Set Summary: Titanic
Figure to the left depicts creating the rule set and providing the command to view the summary of the
rule set for the titanic dataset. Figure to the right is the confusion matrix of the rule set. The information
provided by this figure says that out of 2201 there was 477 misclassifications and that the attributes used
for the rule set was sex and class. It seems that this dataset has the highest error rate. It is also the only
dataset that is categorical data. This leads to the conclusion that C5.0 could be more user-friendly for
numerical data.
END C5.0
Hayley Klobuchar
RPART/CART results for datasets
Iris Summary Results
The figures above are of the Iris dataset in RPART being used to assign a fuction to predict the
classification of the data. The algorthm then trains the tree and summary function is called with
the outputs in the figures below the formula figure. This algorithm uses all four attributes to train
the tree. Node 1 is represtentive of the intial split the nodes represent 2 test for the class of
setosa, two test for the class of Versicolor and one test for virginica. This modewl produced a
lower error rate than C5.0.
Hayley Klobuchar
Wine summary results
Hayley Klobuchar
The wine text decision tree is read similarly to the iris description about this section
Titanic Summary Results
Titanic summary also follows format as noted above.
Hayley Klobuchar
Text Regression Tree for Datasets
Iris
This provides good description of Iris Regression Tree. It displays the root node was the classification
Setosa splitting if criteria not met to test PL and the PW if PL criteria was not met by the data entry. And
notes the misclass rate for each terminal node.
Wine
This provides good description of Wine Regression Tree. It displays the root node was the classification
Class_2. The splitting if criteria not met by root node is to test Color_Intensity and determine entry to be
either class_2 or class 3. If criteria not met still by the entry, then it tests the flavonoids attributes with the
proline attributes to determine entry to either Class_1 or Class_2. If need to still be test the flavonoids
attribute is test for a final time to determine entry to be Class_3. And notes the misclass rate for each
terminal node.
Hayley Klobuchar
Titanic
Titanic root node is testing for not survived in males for classification, It uses the 2 test to test for gender
being male then entry classed as not survived. If survived, then male is tested for age and then tested for
class. If entry still needs further classification, it will test for female and if survived then test for the class of
the entry to determine survival
Plots of Regression Tree
Iris
Wine
Hayley Klobuchar
Titanic
Visual depictions of text tree results stated above.
Issues
Program ran well. Issues arose from lit review comprehension. Most issues were minute and easy to work
through. There were no major issues in this report to note. There was difficulty in formatting this quantity
of data but that is an issue that is not related to the methods or algorithms but rather an experience issue
that will only improve with more experiences available.
Conclusions and Future Work
This report led to the conclusion that both methods are valid methods for classifying data. While the slight
variations in each program lead to similar but not identical results it would be important when deciding
with method to use that a baseline understanding of the data being worked with is required to make the
most efficient decision to with method is to be used. Future work that would be beneficial would to be
examine more categorical data with C5.0, perhaps a smaller data to have a better attribute distribution
behind the data. RPART had a more user-friendly appeal to it. The node summary was overwhelming, but
the confusion matrix nicely depicts the text tree.
Hayley Klobuchar
Appendices
N/A RSCRIPT INCLUDE IN REPORT
References
Pandey, P. (2019, February 13). A Guide to Machine Learning in R for Beginners: Decision Trees. Retrieved
September 27, 2019, from https://medium.com/analytics-vidhya/a-guide-to-machine-learningin-r-for-beginners-decision-trees-c24dfd490abb
Patil, N., Lathi, R., & Chitre, V. (1970, January 01). [PDF] Comparison of C5.0 & CART Classification
Algorithms using Pruning Technique - Semantic Scholar. Retrieved September 26, 2019, from
https://www.semanticscholar.org/paper/Comparison-of-C5.0-%26-CART-ClassificationAlgorithms-Patil-Lathi/ac7b7c0aa3cef86ace51ad070b9b6c543bad13e0
Zhou, V. (2019, June 07). A Simple Explanation of Information Gain and Entropy. Retrieved September 26,
2019, from https://victorzhou.com/blog/information-gain/
Aleshuna, J. In class materials and online webpage of C5.0 Comparison with RPART/CART. Retrieved
September 2019, from
http://mercury.webster.edu/aleshunas/R_learning_infrastructure/Classification%20of%20data%20using%
20decision%20tree%20and%20regression%20tree%20methods.html
Download