Digits_bagging

advertisement
Digits data analysis
First run the SAS program draw_digits.sas to view the data. What are the roles of the 16 X variables X1
through X16?
(1) Import the Pensall data from our data sets. A preliminary analysis shows that X11 and X16 are the
most important variables so set all others to rejected (this is just so we can make some illustrative
graphs). Set DIGIT to be the target and make sure it is a level=nominal variable. Create a new diagram
and drag and drop the data into it.
(2) To limit the comparison to digits 1, 7, 9 (and to practice with the SAS code node) go to the utilities
subtab and select the SAS code node. Connect it to the data node.
(A) Open the code editor from the properties panel.
(B) Go to the “macro variables” subtab of the code editor window and find the macro variable
names for the data being imported and the TRAIN data being exported.
(C) Type in this code:
Data &EM_EXPORT_TRAIN;
SET &EM_IMPORT_DATA;
IF DIGIT in ("1", "7", "9");
proc print; run;
(D) Click the running man (“run node”) within the code editor, run, and view the results.
(3) Use a data partition node to split the data into 50% training and 50% validation.
(4) Import a decision tree node and connect it to the data partition node. Run the node with no
changes and look at the leaf by leaf decisions and their error rates.
(5) Similarly, drag in a neural network node, attach it to the data partition node and change the model
selection method to “misclassification.” Under “optimization” in the properties panel enter 100 and set
“Preliminary Training” (near the bottom) to “No.” Run the node. Did the estimation converge? How
many iterations did it take?
(6) In nearest neighbor analysis, the training data cases are saved and are organized in an easily
searchable way called an RD tree. For validation or score data, you find for each data point its nearest
16 (k in general) neighbors. Assuming a binary response (0 or 1) you then assign a probability j/16 (or
j/k) as the probability of getting a 1 where j is the number of nearest neighbors that have the 1
response. In SAS, the model subtab has an icon for “memory based reasoning” which is SAS’s version of
nearest neighbor analysis. Attach that to the data partition node and run it.
(7) Go to the utilities subtab and drag in a start group and an end group node. Go to the model subtab
and drag in another decision tree node. Insert the decision tree node between the start and end nodes.
Connect the start node to the data partition node.
(A) Inspect the start node properties panel. Under general, change the mode from stratify to
bagging. It will create 10 new data sets each of size 10% of the original data size by sampling
from the original data.
(B) The voting or averaging over the 10 trees created from the 10 samples helps the tree to
generalize to future data just as pruning did. Because of this, Brieman suggested that
pruning is not so helpful in this case so in the tree node’s properties panel, change the
subtree method to “largest.”
(C) Run the diagram from the “end groups” node. Notice how it cycles 10 times through this
segment of your diagram. View the results and compare to the single tree method. This
will take a few minutes.
(8) Go to the “assess” subtab and drag in a model comparison node. Connect the tree, memory based
reasoning, neural net, and end groups nodes to it then run. Check the results.
(9) Because we have 2 X variables only, we’ll make a grid of X11 and X16 values then plot the predicted
digits on that grid using colors. Bring in another SAS code node and find (macro variables subtab) the
name of the exported SCORE data set. Type in the following code and run the node:
Data &EM_EXPORT_SCORE;
Do X16 = 0 to 100 by 5;
DO X11 = 0 to 100 by 5;
output; end; end;
run;
Now you have a data set with no target but with all inputs and you are ready to score it.
(8) From the assess subtab, drag in a score node, connect it to the model comparison and (most recent)
SAS code nodes then run. Now explore the exported data (from the score node’s properties panel) .
Make a graph of X11 by X16, choosing DIGIT as your group variable. Interpret the graph.
Download