Data Mining in Weka “Bringing It All together” Predictive Analytics Center of Excellence (PACE) San Diego Super Computer Center, UCSD Data Mining Boot Camp 1 Introduction The project assignment demonstrates an example of an end-to-end dat a mi ni ng pr ocess supported by the Weka software to build supervised and unsupervised models for analysis. A sick.arff data set will be used to illustrate a set of steps and actions though the process. There are several versions of this file available in the .arff format. More details and the data set description can be found in the Appendix A. We will be using the Explorer component of Weka for this project. Part 1 Data Exploration 1. Data and descriptors The dataset for this project contains 30 attributes of patient data describing patient information regarding the thyroid diagnoses obtained from the Garvan Institute, consisting of 9172 records from 1984 to early 1987. 2. Files The following file is supplied for the project: • sick.arff – descriptor and activity values This file can be found at: http://repository.seasr.org/Datasets/UCI/arff/ 3. Exercise 1: Preprocess the Data It is important to preprocess understand and properly preprocess the data. Some of the key factors that need to be considered are total number of instances, number of attributes, number of continuous and/or discrete attributes, number of missing values, etc. Step by step instructions In the starting interface of Weka, click on the button Explorer. In the Preprocess tab, click on the button Open File. In the file selection interface, select the file sick.arff. 1 The dataset is characterized in the Current relation frame: the name, the number of instances, the number of attributes (descriptors + class). We see in this frame that the number of instances is 3772, whereas the number of descriptors is 30. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor T4U shows the distribution of the attribute values in the dataset. Take a note of the number of missing, unique and distinct values. Select the last attribute “class” in the Attributes frame. 2 One can read from the Selected attribute frame that there are 3541 negative and 321 sick class examples in the dataset. Negative compounds are depicted by the blue color whereas “sick” compounds are depicted by the red color in the histogram. Note the ration of the number of represented class values for each class. Does it seem balanced? Visualization Click on Visualize all button on the lower right – to look at class distribution across the entire set of attributes. 3 Examine each one of the variables. Did you notice anything? Are there any variables you think should be removed, discretized, and manipulated in any way? Are there any duplicates? Strongly correlated? Attribute Removal Note that the attribute number 28 named “TBG” has 100% missing values. This attribute together with the related attribute 27 should be removed by adding the checkmark in from of the attribute name and clicking on the Remove button below. Consider attribute 29 - the referral source which seems irrelevant, but that may depend on nature of the disease. 4 Discretization Apply the Discretize filter to the ‘Sick’ dataset Discretize task 1 (DT1): Browse the attribute information details on the sick.arff file. How many of the attributes are numeric? Write down their attribute numbers. Discretize task 2 (DT2): In the Preprocess panel. Choose the supervised Discretization filter (filters.supervised.attribute.Discretize) and apply (using default settings). Browse the details of the attributes you wrote down in DT1. How are they different? How many distinct ranges have been created for each attribute? 5 Discretize task 3 (DT3): Undo the Discretize filter. Change the filter to the unsupervised Discretization filter (filters.unsupervised.attribute.Discretize) and set the ‘bins’ setting to 5 (filter settings are found by rightclicking on the box to the right of the ‘Choose’ button-show properties option). Leave the other settings as default and click ‘Apply’. Have a look at the attributes that you wrote down in DT1. Undo the filter and redo it with the bins set to 10. What do you think the ‘bins’ setting affects? 6 Undo the Discretize filter and go to the Classify panel. We will start with model building in the next section but will come back to Discritization to check how different discretization filters might influence the produced models later. Building the Clustering (Simple k-means) Model In this exercise, we will create the simple k-means models for predicting the thyroid disease outcome. In the Clustering frame, click Chose, than select the Simple K-Means method. 7 Click on the Start button to build the simple k-means model. Notice that there are not “sick” clusters. Click with the right mouse button on the word SimpleKMeasn in the Clustering frame. The window for setting options for the k-means method pops up. Change the option numClusters to a larger number in order to create at least one cluster with the majority of the “sick” class. 8 How many clusters does it take? You can change other parameters as well – distance metric, seed, etc. How does that influence your produced clusters? Exercise 2: Model Building Building the ZeroR model In this exercise, we w i l l build the trivial model ZeroR, in which all compounds are classified as “nonactive”. The goal is to demonstrate that the accuracy is not a correct choice to measure the performance of classification for unbalanced datasets, in which the number of “negative” diagnoses is much larger than the number of “sick” ones. Click on the tab Classify. The ZeroR method is already selected by default. For assessing the predictive performance of all models to be built, the 10-fold cross-validation method has also be specified by default. Click on the Start button to build a model. 9 The predictive performance of the model is characterized in the right-hand Classifier output frame. The Confusion Matrix for the model is presented at the bottom part of the Classifier output window. It can be seen from it that all compounds have been classified as “negative”. It is clear that such trivial model is unusable and it cannot be used for discovering “sick” patients. However, it is worth noticing that the accuracy of the model (Correctly Classifieds Instances) of this trivial model is very high: 97.3456 %. This fact clearly indicates that the accuracy cannot be used for assessing the usefulness of classification models built using unbalanced datasets. For this purpose a good choice is to use the “Kappa statistic”, which is zero for this case. “Kappa statistic” is an analog of correlation coefficient. Its value is zero for the lack of any relation and approaches to one for very strong statistical relation between the class label and attributes of instances, i.e. between the classes of healthy or sick and the values of their descriptors. Another useful statistical characteristic is “ROC Area”, for which the value near 0.5 means the lack of any statistical dependence. Building the Naïve Bayesian Model In this exercise, we build a Naïve Bayesian model for predicting the thyroid disease outcome. The goal is to demonstrate the ability of Weka to build statistically significant classification models for predicting the class outcome, as well as to show different ways of assessing the statistical significance and usefulness of classification models. In the classifier frame, clicks Chose, and then select the NaiveBayes method from the Bayes submenu. Click on the Start button to build a model. 1 0 Not only did the accuracy of the model increase (93.8759 % to 94.3001 %), its real statistical significance became much stronger. This follows from the value of the ‘’Kappa statistic” of 0.58, which indicates the existence of moderate statistical dependence. It can be analyzed using the 1 1 “Confusion Matrix” at the bottom of the Classifier output window. So, there are 3384 true positive, 173 true negative, 157 false positive, and 58 false negative examples. The model exhibits an excellent value of “ROC Area” for “negative” compounds 0.96 and has significantly improved the ROC are for the “sick” as well. This indicates that this Naïve Bayesian model could very advantageously be used for discovering thyroid patients’ outcome. This can clearly be shown by analyzing ROC and Cost/Benefit plots. The Naïve Bayes method provides probabilistic outputs. This means that Naïve Bayes models can assess the value of the probability (varying from 0 to 1) that a given patient with particular characteristic can be predicted as “negative” or “sick”. By moving the threshold from 0 to 1 and imposing that an outcome can be predicted as “sick” if the corresponding probability exceeds the current threshold, one can build the ROC (Receiver Operating Characteristic) curve. Extra exercise for additional practice with the ROC Curve: Visualize the ROC curve by clicking the right mouse button on the bayes.NaiveBayes in the Result list frame and selecting the menu item Visualize threshold curve / active. model type The ROC curve is shown in the Plot frame of the window. The axis X in it corresponds to the false positive rate, whereas its axis Y corresponds to the true positive rate. The color depicts the value of the threshold. The “colder” (closer to the blue) color corresponds to the lower threshold value. All outcomes with probability of being “sick” exceeding the current threshold are predicted as “sick”. If such prediction made for a current outcome is correct, then the corresponding outcome is true positive, otherwise it is false positive. If for some values of the threshold the true positive rate greatly exceeds the false positive rate (which is indicated by the angle A close to 90 degrees), then the classification model with such threshold can be used to extract selectively “sick outcomes from its mixture with the big number of “negative” ones . 1 2 In order to find the optimal value of the threshold (or the optimal part of patients’ to be predicted and diagnosed with the thyroid disease), one can perform the cost/benefit analysis. Close the window with the ROC curve Open the window for the cost/benefit analysis by clicking the right mouse button on the model type bayes.NaiveBayes in the Result list frame and selecting the menu item Cost/Benefit analysis / active. Click on the Minimize Cost/Benefit button at the right bottom corner of the window. 1 3 Consider attentively the window for the Cost/Benefit Analysis. It consists of several panels. The left part of the window contains the Plot: ThresholdCurve frame with the Threshold Curve (called also the Lift curve). The Threshold curve looks very similar to the ROC curve. In both of them the axis Y corresponds to the true positive rate. However, in contrast to the ROC curve, the axis X in the Threshold curve corresponds to the part of selected instances (the “Sample Size”). In other words, the Threshold curve depicts the dependence of the part of “diseased” patients retrieved in the course of pr edict i ng selected from the whole dataset (ie only those selected for which the estimated probability of having thyroid disease exceeds the chosen threshold). The value of the threshold can be modified interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The confusion matrix for the current value of the threshold is shown in the Confusion Matrix frame at the left bottom corner of the window. Pay attention that the confusion matrix for the current value of the threshold will sharply differ from the previously obtained one. Why is this happening? In order to give an answer to this question and explain the corresponding phenomenon, let us take a look at the right side of the window. Its right bottom corner contains the Cost Matrix frame. The left part of the frame contains the Cost matrix itself. Its four entries indicate the cost one should pay for decisions taken on the base of the classification model. The cost values are expressed in the table in abstract units, however in the case studies they can be considered in money scale, for example, in US Dollars. The left bottom cell of the Cost matrix defines the cost of false positives. Its default value is 1 unit. In the case Thyroid disease this corresponds to the mean price one should pay in order to treat a patient wrongly predicted by the model as “sick”. The right top cell of the Cost matrix defines the cost of false negatives. Its default value is 1 unit. In the case of thyroid disease this corresponds to the mean price one should pay for “throwing away” a sick patient and losing a successful treatment because of the wrong prediction taken by the classification model. It is also taken by default that one should not pay price for correct decision taken using the classification model. It is clear that all these settings can be changed in order to match the real situation taking place in the process of problem at hand. In order to find the threshold corresponding to the minimum cost, it is sufficient to press the button Minimize Cost/Benefit. This explains the afore-mentioned difference in confusion matrices. The initial confusion matrix corresponds to the threshold 0.5, whereas the second confusion matrix results from the value of the threshold found by minimizing the cost function. The current value of the cost is compared by the program with the cost of selecting the same number of 14 instances at random. The difference between the values of the cost function between the random selection and the current value of the cost is called Gain, indicated at the right side of the frame. In the context of thyroid disease, the Gain can be interpreted as the benefit obtained by using the classification model instead of random selection of the same number of patients. Unfortunately, the current version of the Weka software does not provide the means of automatic maximization of the Gain function. However, this can easily be done interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. Close the window with the Cost/Benefit Analysis. Extra exercise for additional practice with the Discretize filter: Set the classifier to Naive Bayes. Select Cross-Validation from the test options, run the test and record the accuracy. Discretize task 5 (DT5): Run the same test as in the (DT2) task, but using the unsupervised Discretize filter. Run this test three times, changing the bins setting from 5 to 10 to 20. Record the accuracy of each test result. Discretize task 6 (DT6): Compare all 5 accuracy readings. Which test had the highest accuracy rate? What can you say about discretization in general? Supervised discretization versus unsupervised discretization? What difference does the size of the bins make using unsupervised discretization? Building a Classification Tree Model In this exercise, we build a classification tree model (using the Decision Tree method named in Weka as J48) for predicting the thyroid disease diagnosis of the patients. The goal is to learn the possibilities offered by the Weka software to build and visualize classification trees. In the classifier frame, click Choose, then select the J48 method from the trees submenu. Click on the Start button. 15 The statistical parameters of the J48 model appears quite high in this case, while added bonus is the strength of the individual classification tree stems is their interpretation ability. In order to visualize the classification tree in the text mode, scroll the text field in the Classifier output frame up. In order to obtain more usual representation of the same tree, do the following. Click the right mouse button on the model type trees.J48 in the Result list frame and select the menu item Visualize tree. Resize a new window with graphical representation of the tree 16 Clock with the right mouse button to the space in this screen, and in the popup menu select the item Fit to screen. The Tree View graphical diagram can be used to visualize decision trees. It contains two types of nodes, ovals and rectangles. Each oval contains a query of the sort: does chemical structure contains a feature depicted by the specified fingerprint bit number. If the answer is “yes”, then the node connected with the previous one with the “= on” branch is queried next. Otherwise, the “= off” branch is activated. The tree top node is queried the first. The “leaves” of the tree, depicted by rectangular, contain final decisions, whether the current compound is active or not. Extra Exercise: Build the ROC curve and perform the Cost/Benefit analysis of the J48 model. 17 Final Exercise: Compare the evaluations of all of the different Methods you ran in this exercise. Which one would you use for this particular problem and why? 18 APPENDIX UCI Machine Learning Repository Thyroid Disease Data Set Download: Data Folder, Data Set Description Abstract: 10 separate databases from Garavan Institute Data Set Characteristics: Multivariate, DomainTheory Number of Instances: 7200 Area: Life Attribute Characteristics: Categorical, Real Number of Attributes: 21 Date Donated 1987-0101 Associated Tasks: Classification Missing Values? N/A Number of Web Hits: 47023 Source: Thyroid disease records supplied by the Garavan Institute and J. RossQuinlan, New South Wales Institute, Syndney, Australia; 1987. Data Set Information: # From Garavan Institute # Documentation: as given by Ross Quinlan # 6 databases from the Garavan Institute in Sydney, Australia # Approximately the following for each database: ** 2800 training (data) instances and 972 test instances ** Plenty of missing data ** 29 or so attributes, either Boolean or continuously-valued # 2 additional databases, also from Ross Quinlan, are also here ** Hypothyroid.data and sick-euthyroid.data ** Quinlan believes that these databases have been corrupted ** Their format is highly similar to the other databases # 1 more database of 9172 instances that cover 20 classes, and a related domain theory # Another thyroid database from Stefan Aeberhard ** 3 classes, 215 instances, 5 attributes ** No missing values # A Thyroid database suited for training ANNs ** 3 classes ** 3772 training instances, 3428 testing instances ** Includes cost data (donated by Peter Turney) 19 Attribute Information: sick, negative. age: sex: on thyroxine: query on thyroxine: on antithyroid medication: sick: pregnant: thyroid surgery: I131 treatment: query hypothyroid: query hyperthyroid: lithium: goitre: tumor: hypopituitary: psych: TSH measured: TSH: T3 measured: T3: TT4 measured: TT4: T4U measured: T4U: FTI measured: FTI: TBG measured: TBG: referral source: | classes continuous. M, F. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. f, t. continuous. f, t. continuous. f, t. continuous. f, t. continuous. f, t. continuous. f, t. continuous. WEST, STMW, SVHC, SVI, SVHD, other. Num Instances: 3772 Num Attributes: 30 Num Continuous: 7 (Int 1 / Real 6) Num Discrete: 23 Missing values: 6064 /5.4% Relevant Papers: Quinlan,J.R., Compton,P.J., Horn,K.A., & Lazurus,L. (1986). Inductive knowledge acquisition: A case study. In Proceedings of the Second Australian Conference on Applications of Expert Systems. Sydney, Australia. [Web Link] Quinlan,J.R. (1986). Induction of decision trees. Machine Learning, 1, 81--106. [Web Link] 20