Case Study: Predicting Patient Outcomes Computer Science 105 Boston University

advertisement
Case Study:
Predicting Patient Outcomes
Computer Science 105
Boston University
David G. Sullivan, Ph.D.
Dataset Description
• The "spine clinic dataset" from Roiger & Geatz.
• Data consists of records for 171 patients who had back surgery
at a spine clinic.
• 31 attributes per record describing:
• the patient's condition before and during surgery
• the patient's condition three months after surgery
• including whether he/she has been able to return to work
• Includes missing and erroneous information
Overview of the Data-Mining Task
• Goal: to develop insights into factors that influence patient
outcomes – in particular, whether the patient can return
to work.
• What type of data mining should we perform?
• What will the data mining produce?
Overview of the Data-Mining Task
• Goal: to develop insights into factors that influence patient
outcomes – in particular, whether the patient can return
to work.
• What type of data mining should we perform?
• What will the data mining produce?
input attributes
model
return to work?
Review: Preparing the Data
• Possible steps include:
• denormalization
several records for a given entity  single training example
• discretization
numeric  nominal
• nominal  numeric
• force Weka to realize that a seemingly numeric attribute
is really nominal
• remove ID attributes and other problematic attributes
Preparing the Data (cont.)
• We begin by loading the dataset (a CSV file) into Weka Explorer.
• It's helpful to examine each attribute by highlighting
its name in the Attribute portion of the Preprocess tab.
• helps us to identify
missing/anomalous
values
• can also help us to
discover formatting issues
that should be addressed
Preparing the Data (cont.)
• Things worth noting about the attributes in this dataset:
• Steps we may want to take:
Review: Dividing Up the Data
• To allow us to validate the model(s) we learn,
we'll divide the examples into two files:
• n% for training
• 100 – n% for testing
• don't touch these until you've finalized your model(s)
• You can use Weka to split the dataset:
1) filters/unsupervised/instance/Randomize
2) save the shuffled examples in Arff format
3) filters/unsupervised/instance/RemovePercentage
• specify the percentage parameter to remove n%
4) save the remaining examples as your test set
5) load the full file of shuffled examples back into Weka
6) use RemovePercentage with invertSelection set to True
to remove the other 100 – n%
7) save the remaining examples as your training set
Experimenting with Different Techniques
• Use Weka to try different techniques on the training data.
• For each technique, examine:
• the resulting model
• the validation results
• for classification models: overall accuracy, confusion matrix
• for numeric estimation models: correlation coefficient, errors
• for association-rule models: support, confidence
• If the model is something you can interpret,
make sure it seems reasonable.
• Try to improve the validation results by:
• changing the algorithm used
• changing the algorithm's parameters
Remember to Start with a Baseline
• For classification learning:
• 0R
• 1R
• For numeric estimation:
• simple linear regression
• Include the results of these baselines to put your other results
in context.
• example: 80% accuracy isn't that impressive
if 0R has 78% accuracy
• being honest about your results is better than making
exaggerated claims!
Cross Validation
• When validating classification/estimation models,
Weka performs 10-fold cross validation by default:
1) divides the training data into 10 subsets
2) repeatedly does the following:
a) holds out one of the 10 subsets
b) builds a model using the other 9 subsets
c) tests the model using the held-out subset
3) reports results that average the 10 models together
• Note: the model reported in the output window is learned from
all of the training examples.
• the cross-validation results do not actually evaluate it
Reporting the Results
• Once you have settled on the algorithm(s) with the best
cross-validation results, you should evaluate the resulting
model(s) on both the training and test data.
• To see how well the reported model does on the training data,
select Using training set in the Test box of the Classify tab
and rerun the algorithm.
• To see how well the reported model does on the training data,
select Supplied test set in the Test box of the Classify tab.
• click the Set button to specify the file
• rerun the algorithm
• Include appropriate metrics for each portion of your data:
• classification learning: accuracy, confusion matrix
• numeric estimation: correlation coefficient
Discussing the Results
• Your report should include more than just the numeric results.
• You should include an intelligent discussion of the results.
• compare training vs. test results
• how well do the models appear to generalize?
• which attributes are included in the models?
• for classification learning, what do the confusion matrices
tell you about the types of examples that the models
get right or get wrong?
• for numeric estimation, which attributes have positive
coefficients and which have negative?
• note: the magnitude of the coefficients may not be significant
• are the models intuitive? why or why not?
• Don't make overly confident claims!
Summary of Experiments
• Summary of experiments:
Summary of Experiments (cont.)
Download