Boston Housing Data.docx

advertisement
Boston Housing Data




Regression Tree
CART: Classification and Regression Trees
Target is CONTINUOUS
Split based on F statistic P-value
NOx
NOx
Low
High
House Value
N = n1 + n2 obs.
Y1, Y2, Y3, …. Yn1
SSE(1)
Y n1+1, Y n1+2, Y n1+3, …. YN SSE(2)
[SS(total) –SSE(1)-SSE(2) ] / 1df = F numerator
MSE = [SSE(1) + SSE(2)] / (N-2)df = F denominator
p-value = Pr>F.
(# possible splits)(p-value) = Kass adjusted p-value
-Log10 [(# possible splits)(p-value) ] = logworth of split
Keep on splitting as usual.
(1) Add the BOSTON data source from our AAEM library.
(2) Use median house value as the target, NOx (environment) and RM (avg. # rooms in
houses) as inputs. Reject everything else. Explore (at least) the variables RM and NOx
to get their range. What happens if you click on a histogram bar?
(3) (optionally split into training and validation) Create a new diagram.
(4) Drag in a tree node, connect, run, and view results.
(A) Click “Exported data” in the properties panel.
(B) Click “train” to view training data results
(C) Actions->plot->3 D plot (color=NODE, X=RM, Y=NOX, Z=MEDV)
(5) (optional) Make a grid – put this in a code node and run it (where did that funky
name &em_export_score come from?).
( Code Editor -> Macro Variables (subtab at top)->Exports->EM_EXPORT_SCORE)
data &em_export_score;
do nox=0.35 to 0.9 by 0.025;
do rm = 3.5 to 9 by 0.25;
output; end; end;
proc print; run;
(6) From the ASSESS subtab, drag in a score node and connect the tree and code nodes to
it. Update and run. From the properties menu, select Exported data… then select the
SCORE data set and click on Explore at the bottom. Use the graphing icon to make a 3-
D plot of P_MEDV (Y) versus RM and NOx. Use _LEAF_ as a color variable. What
kind of predictions do you see?
Boston Housing II
Herein I describe how you can export the scoring code and use it within SAS (not EM) to
score another dataset that has the inputs and most likely does not have the target variable.
This means that anyone with SAS can score a data set with your code. Notice that the
code is created within EM so a person without EM cannot create a tree, they can just
score a data set using your tree.
(1) Click on the tree node. Select results.
(2) From the top menu bar select view-> scoring -> SAS code. The created code opens
in a window.
(3) Activate (click on top banner) the window containing the code. From the menu select
Edit->select all then Edit->copy.
(4) Get into SAS. You could be in VCL or you could launch SAS from your desktop.
Go to the program editor and paste the copied code into it.
(5) Before the included code, type this:
Data score;
Do rm = 3.5 to 9 by 0.25;
Do NOx = 0.35 to 0.90 by 0.025;
(6) After the included code, type this:
output; end; end;
proc print data=score; run;
proc sort data=score; by _NODE_;
proc means data=score;
var P_MEDV RM NOx;
by _NODE_;
run;
(7) Make a 3D plot using this code in SAS:
PROC G3D;
PLOT RM*NOX=P_MEDV;
RUN;
You can try plot options rotate=15 and title=30 for different views.
Download