Homework – Decision trees.
(1) Run the program Breast_Cancer.sas from our demos, being sure to insert your library name in the place indicated so that Enterprise Miner will find the data. Put it in the same library you use for demos,
AAEM for example. Alternatively, you can copy the data from our class web page DATA link if you prefer. This is real data from a study in Wisconsin attempting to distinguish benign from malignant cancer using biopsy data. Note that we have not at this point set up a training and validation data set.
Please answer these short answer questions with well thought out complete sentences. You could even combine a, b, and c into a paragraph then d and e into another.
(a) If we use the final model here to sort the data from most to least likely to be malignant, what statistic in the logistic output would address the success of this ranking operation? How did we do based on this statistic, in your opinion?
(b) Our logistic model gives us probabilities of malignancy. Making decisions just on which
(benign or malignant) the model says is more likely, how did we do on proportion misclassified?
You could use the CTABLE option here of course. Explain the role of the prior probability (pprob option) in this calculation.
(c) How many variables were retained in the final model?
(d) Suppose, in some logistic regression, you have an odds ratio of 2 associated with variable a X and when X=10 your probability of an event is 0.7. What is the probability of an event at X=11?
Think (nothing written) about whether the odds ratio is easy for a nonmathematically oriented person to understand
(e) Suppose the logit I get for the probability of heart disease is L= -8.0 + 0.01(cholesterol) +
0.02(systolic blood pressure) + 0.03(age). Imagine you have to explain in words to a client (who is not a mathematician or analyst) how to compute their probability of developing heart disease.
Finish this expression “First you measure your cholesterol then …….. which is your probability p.
Think (nothing written) about whether this is easy for a nonmathematically oriented person to understand.
(2) Perhaps a decision tree will be easier to understand while still working well. Run a decision tree on the data created by the program Breast_Cancer.sas. In the data, Target is a binary variable with 0 for benign and 1 for malignant samples. Be sure to declare Target to be binary – otherwise EM will consider it to be an interval variable even though it has just 2 values (for example, if you ask people how often they have been ticketed for parking at a fire hydrant, you might well get only 0 and 1 responses even though it could be any nonnegative integer). Use 75% of the data for training and the other 25% for pruning. Other than this, do whatever you think is right but tell us what you did in your report in part
(3). Mention how many observations are in each of these data sets and how many of these the tree correctly classified for each. Mention which variables are involved in the tree. Hint: If you view the
Classification Chart then right click and select data options, you can add count as a tip.
(3) The most important part – Report what you did (data split etc.) and what you found in a readable report. Include (of course) whatever tables & graphs you think are helpful but don’t overdo it. Most readers will not want to see every piece of output EM produces. Think about an executive level report to the National Cancer Institute for example. Be sure to include a summary with your evaluation of whether this decision tree model is a good predictor or not and a comparison to the logistic approach.
Three short answer questions to also address in your report:
(a) What changes, if any, did you make in the properties panel? Why?
(b) What would you do differently if you were computing estimates of Pr{malignant} rather than making a decision, benign versus malignant?
(c) Show the variable importance table (view->model->Variable Importance) and comment on why someone would be interested in it.