Case Study: Predicting Patient Outcomes Computer Science 105 Boston University David G. Sullivan, Ph.D. Dataset Description • The "spine clinic dataset" from Roiger & Geatz. • Data consists of records for 171 patients who had back surgery at a spine clinic. • 31 attributes per record describing: • the patient's condition before and during surgery • the patient's condition three months after surgery • including whether he/she has been able to return to work • Includes missing and erroneous information Overview of the Data-Mining Task • Goal: to develop insights into factors that influence patient outcomes – in particular, whether the patient can return to work. • What type of data mining should we perform? • What will the data mining produce? Overview of the Data-Mining Task • Goal: to develop insights into factors that influence patient outcomes – in particular, whether the patient can return to work. • What type of data mining should we perform? • What will the data mining produce? input attributes model return to work? Review: Preparing the Data • Possible steps include: • denormalization several records for a given entity single training example • discretization numeric nominal • nominal numeric • force Weka to realize that a seemingly numeric attribute is really nominal • remove ID attributes and other problematic attributes Preparing the Data (cont.) • We begin by loading the dataset (a CSV file) into Weka Explorer. • It's helpful to examine each attribute by highlighting its name in the Attribute portion of the Preprocess tab. • helps us to identify missing/anomalous values • can also help us to discover formatting issues that should be addressed Preparing the Data (cont.) • Things worth noting about the attributes in this dataset: • Steps we may want to take: Review: Dividing Up the Data • To allow us to validate the model(s) we learn, we'll divide the examples into two files: • n% for training • 100 – n% for testing • don't touch these until you've finalized your model(s) • You can use Weka to split the dataset: 1) filters/unsupervised/instance/Randomize 2) save the shuffled examples in Arff format 3) filters/unsupervised/instance/RemovePercentage • specify the percentage parameter to remove n% 4) save the remaining examples as your test set 5) load the full file of shuffled examples back into Weka 6) use RemovePercentage with invertSelection set to True to remove the other 100 – n% 7) save the remaining examples as your training set Experimenting with Different Techniques • Use Weka to try different techniques on the training data. • For each technique, examine: • the resulting model • the validation results • for classification models: overall accuracy, confusion matrix • for numeric estimation models: correlation coefficient, errors • for association-rule models: support, confidence • If the model is something you can interpret, make sure it seems reasonable. • Try to improve the validation results by: • changing the algorithm used • changing the algorithm's parameters Remember to Start with a Baseline • For classification learning: • 0R • 1R • For numeric estimation: • simple linear regression • Include the results of these baselines to put your other results in context. • example: 80% accuracy isn't that impressive if 0R has 78% accuracy • being honest about your results is better than making exaggerated claims! Cross Validation • When validating classification/estimation models, Weka performs 10-fold cross validation by default: 1) divides the training data into 10 subsets 2) repeatedly does the following: a) holds out one of the 10 subsets b) builds a model using the other 9 subsets c) tests the model using the held-out subset 3) reports results that average the 10 models together • Note: the model reported in the output window is learned from all of the training examples. • the cross-validation results do not actually evaluate it Reporting the Results • Once you have settled on the algorithm(s) with the best cross-validation results, you should evaluate the resulting model(s) on both the training and test data. • To see how well the reported model does on the training data, select Using training set in the Test box of the Classify tab and rerun the algorithm. • To see how well the reported model does on the training data, select Supplied test set in the Test box of the Classify tab. • click the Set button to specify the file • rerun the algorithm • Include appropriate metrics for each portion of your data: • classification learning: accuracy, confusion matrix • numeric estimation: correlation coefficient Discussing the Results • Your report should include more than just the numeric results. • You should include an intelligent discussion of the results. • compare training vs. test results • how well do the models appear to generalize? • which attributes are included in the models? • for classification learning, what do the confusion matrices tell you about the types of examples that the models get right or get wrong? • for numeric estimation, which attributes have positive coefficients and which have negative? • note: the magnitude of the coefficients may not be significant • are the models intuitive? why or why not? • Don't make overly confident claims! Summary of Experiments • Summary of experiments: Summary of Experiments (cont.)