Case Study Handout - Predictive Analytics Center of Excellence

advertisement
Data Mining in Weka
“Bringing It All together”
Predictive Analytics Center of Excellence (PACE)
San Diego Super Computer Center, UCSD
Data Mining Boot Camp 1
Introduction
The project assignment demonstrates an example of an end-to-end dat a mi ni ng pr ocess supported
by the Weka software to build supervised and unsupervised models for analysis.
A sick.arff data set will be used to illustrate a set of steps and actions though the process. There are
several versions of this file available in the .arff format. More details and the data set description
can be found in the Appendix A.
We will be using the Explorer component of Weka for this project.
Part 1 Data Exploration
1. Data and descriptors
The dataset for this project contains 30 attributes of patient data describing patient information
regarding the thyroid diagnoses obtained from the Garvan Institute, consisting of 9172 records from 1984
to early 1987.
2. Files
The following file is supplied for the project:
• sick.arff – descriptor and activity values
This file can be found at: http://repository.seasr.org/Datasets/UCI/arff/
3. Exercise 1: Preprocess the Data
It is important to preprocess understand and properly preprocess the data. Some of the key factors that
need to be considered are total number of instances, number of attributes, number of continuous and/or
discrete attributes, number of missing values, etc.
Step by step instructions
In the starting interface of Weka, click on the button Explorer.
In the Preprocess tab, click on the button Open File. In the file selection interface, select the file
sick.arff.
1
The dataset is characterized in the Current relation frame: the name, the number of instances, the
number of attributes (descriptors + class). We see in this frame that the number of instances is 3772,
whereas the number of descriptors is 30. The Attributes frame allows user to modify the set of attributes
using select and remove options. Information about the selected attribute is given in the Selected
attribute frame in which a histogram depicts the attribute distribution. One can see that the value of
the currently selected descriptor T4U shows the distribution of the attribute values in the dataset. Take
a note of the number of missing, unique and distinct values.
Select the last attribute “class” in the Attributes frame.
2
One can read from the Selected attribute frame that there are 3541 negative and 321 sick class examples
in the dataset. Negative compounds are depicted by the blue color whereas “sick” compounds are
depicted by the red color in the histogram. Note the ration of the number of represented class values
for each class. Does it seem balanced?
Visualization
Click on Visualize all button on the lower right – to look at class distribution across the entire set of
attributes.
3
Examine each one of the variables. Did you notice anything? Are there any variables you think should
be removed, discretized, and manipulated in any way? Are there any duplicates? Strongly correlated?
Attribute Removal
Note that the attribute number 28 named “TBG” has 100% missing values.
This attribute together with the related attribute 27 should be removed by adding the checkmark in
from of the attribute name and clicking on the Remove button below. Consider attribute 29 - the
referral source which seems irrelevant, but that may depend on nature of the disease.
4
Discretization
Apply the Discretize filter to the ‘Sick’ dataset
Discretize task 1 (DT1):
Browse the attribute information details on the sick.arff file. How many of the attributes are numeric?
Write down their attribute numbers.
Discretize task 2 (DT2):
In the Preprocess panel. Choose the supervised Discretization filter
(filters.supervised.attribute.Discretize) and apply (using default settings). Browse the details of the
attributes you wrote down in DT1. How are they different? How many distinct ranges have been created
for each attribute?
5
Discretize task 3 (DT3):
Undo the Discretize filter. Change the filter to the unsupervised Discretization filter
(filters.unsupervised.attribute.Discretize) and set the ‘bins’ setting to 5 (filter settings are found by rightclicking on the box to the right of the ‘Choose’ button-show properties option). Leave the other settings
as default and click ‘Apply’. Have a look at the attributes that you wrote down in DT1. Undo the filter
and redo it with the bins set to 10. What do you think the ‘bins’ setting affects?
6
Undo the Discretize filter and go to the Classify panel. We will start with model building in the next
section but will come back to Discritization to check how different discretization filters might influence
the produced models later.
Building the Clustering (Simple k-means) Model
In this exercise, we will create the simple k-means models for predicting the thyroid disease outcome.
In the Clustering frame, click Chose, than select the Simple K-Means method.
7
Click on the Start button to build the simple k-means model.
Notice that there are not “sick” clusters.
Click with the right mouse button on the word SimpleKMeasn in the Clustering frame. The window for
setting options for the k-means method pops up.
Change the option numClusters to a larger number in order to create at least one cluster with the majority
of the “sick” class.
8
How many clusters does it take? You can change other parameters as well – distance metric, seed, etc.
How does that influence your produced clusters?
Exercise 2: Model Building
Building the ZeroR model
In this exercise, we w i l l build the trivial model ZeroR, in which all compounds are classified as
“nonactive”. The goal is to demonstrate that the accuracy is not a correct choice to measure the
performance of classification for unbalanced datasets, in which the number of “negative”
diagnoses is much larger than the number of “sick” ones.
Click on the tab Classify.
The ZeroR method is already selected by default. For assessing the predictive performance of all
models to be built, the 10-fold cross-validation method has also be specified by default.
Click on the Start button to build a model.
9
The predictive performance of the model is characterized in the right-hand Classifier output frame.
The Confusion Matrix for the model is presented at the bottom part of the Classifier output window. It
can be seen from it that all compounds have been classified as “negative”. It is clear that such
trivial model is unusable and it cannot be used for discovering “sick” patients. However, it is worth
noticing that the accuracy of the model (Correctly Classifieds Instances) of this trivial model is very
high: 97.3456 %. This fact clearly indicates that the accuracy cannot be used for assessing the
usefulness of classification models built using unbalanced datasets. For this purpose a good choice is to
use the “Kappa statistic”, which is zero for this case. “Kappa statistic” is an analog of correlation
coefficient. Its value is zero for the lack of any relation and approaches to one for very strong
statistical relation between the class label and attributes of instances, i.e. between the classes of healthy
or sick and the values of their descriptors. Another useful statistical characteristic is “ROC Area”, for
which the value near 0.5 means the lack of any statistical dependence.
Building the Naïve Bayesian Model
In this exercise, we build a Naïve Bayesian model for predicting the thyroid disease outcome. The goal
is to demonstrate the ability of Weka to build statistically significant classification models for predicting
the class outcome, as well as to show different ways of assessing the statistical significance and
usefulness of classification models.
In the classifier frame, clicks Chose, and then select the NaiveBayes method from the Bayes submenu.
Click on the Start button to build a model.
1
0
Not only did the accuracy of the model increase (93.8759 % to 94.3001 %), its real statistical
significance became much stronger. This follows from the value of the ‘’Kappa statistic” of 0.58,
which indicates the existence of moderate statistical dependence. It can be analyzed using the
1
1
“Confusion Matrix” at the bottom of the Classifier output window. So, there are 3384 true positive,
173 true negative, 157 false positive, and 58 false negative examples. The model exhibits an excellent
value of “ROC Area” for “negative” compounds 0.96 and has significantly improved the ROC are for
the “sick” as well. This indicates that this Naïve Bayesian model could very advantageously be used
for discovering thyroid patients’ outcome. This can clearly be shown by analyzing ROC and
Cost/Benefit plots.
The Naïve Bayes method provides probabilistic outputs. This means that Naïve Bayes models can assess
the value of the probability (varying from 0 to 1) that a given patient with particular characteristic can be
predicted as “negative” or “sick”. By moving the threshold from 0 to 1 and imposing that an outcome
can be predicted as “sick” if the corresponding probability exceeds the current threshold, one can build
the ROC (Receiver Operating Characteristic) curve.
Extra exercise for additional practice with the ROC Curve:
Visualize the ROC curve by clicking the right mouse button on the
bayes.NaiveBayes in the Result list frame and selecting the menu item Visualize
threshold curve / active.
model
type
The ROC curve is shown in the Plot frame of the window. The axis X in it corresponds to the false
positive rate, whereas its axis Y corresponds to the true positive rate. The color depicts the value of the
threshold. The “colder” (closer to the blue) color corresponds to the lower threshold value. All outcomes
with probability of being “sick” exceeding the current threshold are predicted as “sick”. If such
prediction made for a current outcome is correct, then the corresponding outcome is true positive,
otherwise it is false positive. If for some values of the threshold the true positive rate greatly exceeds the
false positive rate (which is indicated by the angle A close to 90 degrees), then the classification model
with such threshold can be used to extract selectively “sick outcomes from its mixture with the big
number of “negative” ones .
1
2
In order to find the optimal value of the threshold (or the optimal part of patients’ to be predicted and
diagnosed with the thyroid disease), one can perform the cost/benefit analysis.
Close the window with the ROC curve
Open the window for the cost/benefit analysis by clicking the right mouse button on the model type
bayes.NaiveBayes in the Result list frame and selecting the menu item Cost/Benefit analysis / active.
Click on the Minimize Cost/Benefit button at the right bottom corner of the window.
1
3
Consider attentively the window for the Cost/Benefit Analysis. It consists of several panels. The left
part of the window contains the Plot: ThresholdCurve frame with the Threshold Curve (called also
the Lift curve). The Threshold curve looks very similar to the ROC curve. In both of them the axis Y
corresponds to the true positive rate. However, in contrast to the ROC curve, the axis X in the
Threshold curve corresponds to the part of selected instances (the “Sample Size”). In other words,
the Threshold curve depicts the dependence of the part of “diseased” patients retrieved in the
course of pr edict i ng selected from the whole dataset (ie only those selected for which the estimated
probability of having thyroid disease exceeds the chosen threshold). The value of the threshold can be
modified interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis
window. The confusion matrix for the current value of the threshold is shown in the Confusion
Matrix frame at the left bottom corner of the window.
Pay attention that the confusion matrix for the current value of the threshold will sharply differ from the
previously obtained one. Why is this happening?
In order to give an answer to this question and explain the corresponding phenomenon, let us take a
look at the right side of the window. Its right bottom corner contains the Cost Matrix frame.
The left part of the frame contains the Cost matrix itself. Its four entries indicate the cost one should
pay for decisions taken on the base of the classification model. The cost values are expressed in the
table in abstract units, however in the case studies they can be considered in money scale, for example,
in US Dollars. The left bottom cell of the Cost matrix defines the cost of false positives. Its default
value is 1 unit. In the case Thyroid disease this corresponds to the mean price one should pay in order
to treat a patient wrongly predicted by the model as “sick”. The right top cell of the Cost matrix
defines the cost of false negatives. Its default value is 1 unit. In the case of thyroid disease this
corresponds to the mean price one should pay for “throwing away” a sick patient and losing a
successful treatment because of the wrong prediction taken by the classification model. It is also taken
by default that one should not pay price for correct decision taken using the classification model. It
is clear that all these settings can be changed in order to match the real situation taking place in the
process of problem at hand. In order to find the threshold corresponding to the minimum cost, it is
sufficient to press the button Minimize Cost/Benefit. This explains the afore-mentioned difference
in confusion matrices. The initial confusion matrix corresponds to the threshold 0.5, whereas the
second confusion matrix results from the value of the threshold found by minimizing the cost function.
The current value of the cost is compared by the program with the cost of selecting the same number of
14
instances at random. The difference between the values of the cost function between the random
selection and the current value of the cost is called Gain, indicated at the right side of the frame. In
the context of thyroid disease, the Gain can be interpreted as the benefit obtained by using the
classification model instead of random selection of the same number of patients. Unfortunately, the
current version of the Weka software does not provide the means of automatic maximization of the
Gain function. However, this can easily be done interactively by moving the slider in the Threshold
frame of the Cost/Benefit Analysis window.
Close the window with the Cost/Benefit Analysis.
Extra exercise for additional practice with the Discretize filter:
Set the classifier to Naive Bayes. Select Cross-Validation from the test options, run the test and record
the accuracy.
Discretize task 5 (DT5):
Run the same test as in the (DT2) task, but using the unsupervised Discretize filter. Run this test three
times, changing the bins setting from 5 to 10 to 20. Record the accuracy of each test result.
Discretize task 6 (DT6):
Compare all 5 accuracy readings. Which test had the highest accuracy rate? What can you say about
discretization in general?
Supervised discretization versus unsupervised discretization? What difference does the size of the bins
make using unsupervised discretization?
Building a Classification Tree Model
In this exercise, we build a classification tree model (using the Decision Tree method named in Weka
as J48) for predicting the thyroid disease diagnosis of the patients. The goal is to learn the possibilities
offered by the Weka software to build and visualize classification trees.
In the classifier frame, click Choose, then select the J48 method from the trees submenu.
Click on the Start button.
15
The statistical parameters of the J48 model appears quite high in this case, while added bonus is
the strength of the individual classification tree stems is their interpretation ability. In order to
visualize the classification tree in the text mode, scroll the text field in the Classifier output frame up.
In order to obtain more usual representation of the same tree, do the following.
Click the right mouse button on the model type trees.J48 in the Result list frame and select the menu
item Visualize tree.
Resize a new window with graphical representation of the tree
16
Clock with the right mouse button to the space in this screen, and in the popup menu select the item Fit
to screen.
The Tree View graphical diagram can be used to visualize decision trees. It contains two types of
nodes, ovals and rectangles. Each oval contains a query of the sort: does chemical structure contains a
feature depicted by the specified fingerprint bit number. If the answer is “yes”, then the node connected
with the previous one with the “= on” branch is queried next. Otherwise, the “= off” branch is
activated. The tree top node is queried the first. The “leaves” of the tree, depicted by rectangular, contain
final decisions, whether the current compound is active or not.
Extra Exercise: Build the ROC curve and perform the Cost/Benefit analysis of the J48 model.
17
Final Exercise:
Compare the evaluations of all of the different Methods you ran in this exercise. Which one would you
use for this particular problem and why?
18
APPENDIX
UCI Machine Learning Repository
Thyroid Disease Data Set
Download: Data Folder, Data
Set Description
Abstract: 10 separate databases from Garavan Institute
Data Set
Characteristics:
Multivariate, DomainTheory
Number of
Instances:
7200
Area:
Life
Attribute
Characteristics:
Categorical, Real
Number of
Attributes:
21
Date Donated
1987-0101
Associated Tasks:
Classification
Missing Values?
N/A
Number of Web
Hits:
47023
Source:
Thyroid disease records supplied by the Garavan Institute and J. RossQuinlan, New South Wales Institute, Syndney,
Australia; 1987.
Data Set Information:
# From Garavan Institute
# Documentation: as given by Ross Quinlan
# 6 databases from the Garavan Institute in Sydney, Australia
# Approximately the following for each database:
** 2800 training (data) instances and 972 test instances
** Plenty of missing data
** 29 or so attributes, either Boolean or continuously-valued
# 2 additional databases, also from Ross Quinlan, are also here
** Hypothyroid.data and sick-euthyroid.data
** Quinlan believes that these databases have been corrupted
** Their format is highly similar to the other databases
# 1 more database of 9172 instances that cover 20 classes, and a related domain theory
# Another thyroid database from Stefan Aeberhard
** 3 classes, 215 instances, 5 attributes
** No missing values
# A Thyroid database suited for training ANNs
** 3 classes
** 3772 training instances, 3428 testing instances
** Includes cost data (donated by Peter Turney)
19
Attribute Information:
sick, negative.
age:
sex:
on thyroxine:
query on thyroxine:
on antithyroid medication:
sick:
pregnant:
thyroid surgery:
I131 treatment:
query hypothyroid:
query hyperthyroid:
lithium:
goitre:
tumor:
hypopituitary:
psych:
TSH measured:
TSH:
T3 measured:
T3:
TT4 measured:
TT4:
T4U measured:
T4U:
FTI measured:
FTI:
TBG measured:
TBG:
referral source:
| classes
continuous.
M, F.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
f, t.
continuous.
f, t.
continuous.
f, t.
continuous.
f, t.
continuous.
f, t.
continuous.
f, t.
continuous.
WEST, STMW, SVHC, SVI, SVHD, other.
Num Instances: 3772
Num Attributes: 30
Num Continuous: 7 (Int 1 / Real 6)
Num Discrete:
23
Missing values: 6064 /5.4%
Relevant Papers:
Quinlan,J.R., Compton,P.J., Horn,K.A., & Lazurus,L. (1986). Inductive knowledge acquisition: A case study. In
Proceedings of the Second Australian Conference on Applications of Expert Systems. Sydney, Australia.
[Web Link]
Quinlan,J.R. (1986). Induction of decision trees. Machine Learning, 1, 81--106.
[Web Link]
20
Download