Homework 5: Neural net for loan defaults (due

advertisement
Some short answer questions – treat questions 1 and 2 like a “regular” homework. You can just give the
answers. For maximum learning I suggest each of you try this on your own then get together and
discuss what you did and turn in your group’s responses.
(1) Suppose I fit a neural network with 2 hidden units and three inputs: X1=debt to income ratio,
X2=age, and X3 = years of job training or experience. Y is a binary variable, 1 for default on a car loan
and 0 otherwise. My neural net used the standard (hyperbolic tangent) functions in going from the
inputs to the hidden units. The hyperbolic tangent of an argument L, maybe L is a linear combination of
some inputs, is (exp(2L)-1) / (exp(2L)+1) which is the same as (exp(L)-exp(-L)) / (exp(L)+exp(-L)) (though
I’m not asking you, you should be able to see why). Here are the bias and weights (for X1, X2, X3 in
order) for linking the inputs to the two hidden units:
Unit 1: 50 1.2 -0.5 -0.8
Unit 2: 12 -0.2 2.0 -0.5
Here are the bias and weights for the two hidden units (Units 1 and 2 in order) that are used to link the
hidden units to the logit of the response variable (just a linear combination of the hidden units):
-1 -0.6 1.1
What are the logit and probability (of default) for someone with a debt to income ratio X1=2, age X2=30
and having X3=5 years of job training? Assume a logistic function linking the hidden units to the
response. Logit = ______ Probability of default = ________
(2) Suppose X=ln(2). What is the hyperbolic tangent of X? You should be able to do this easily by hand.
What range of values ___ to ____ can the hyperbolic function take on?
(3) This part involves a nice report. I have generated some data on features, age, debt to income ratio
(DI_ratio) and amount of training or job experience (training) that might predict the probability of
people going into arrears by a certain critical amount on their loan. Because the data are just
generated, it is possible to see the true probability p of going into arrears in addition to the data which
we will use for modelling. If you run the program for this homework you will see how the data were
generated. My goal is to see if I can predict the probability of going into arrears and thus avoid loaning
to people who will go into arrears in the future by looking at their features.
I. First, I’ll step you through a neural network model in Enterprise Miner so we’ll all be working from the
same information.
(a) First run this program and view the 3-D graphs. Adjust the libname statement to put the
data into your library:
LIBNAME aaem "c:\workshop\winsas\aaem"; ** change to yours **;
Data arrears;
do subject = 1 to 5000;
DI = round(5*ranuni(123),.01);
training = round(8*ranuni(123),.25);
age = round( 3*training + 30 + 5*normal(123));
p = 0.18;
radius= DI**2 + ((age-40)/10)**2; if radius < 1 then p=0.03;
if DI > 5.5-((age-38)/10)**2 then p=.95;
if radius>1 and DI>4 then
p = p - .18*( (training/4-1)**2 )*(training<4);
if age < 20 and DI_ratio > 4 then p = .8*p + .2;
p=sqrt(p);
target = ranuni(123)<p; DI_ratio=DI;
keep subject DI_ratio training age target p;
output; end;
proc g3d; scatter DI_ratio*age=p/noneedle;
proc g3d; scatter training*DI_ratio=p/noneedle;
proc g3d; scatter DI_ratio*age=target/noneedle;
proc g3d; scatter training*DI_ratio=target/noneedle;
data aaem.arrears; set arrears;
proc means;
ods listing gpath = "%sysfunc(pathname(work))";
proc sgplot;
scatter Y=DI_ratio X=age/group=p;
run;
(b) In Enterprise Miner pull in the raw data after defining the target variable as a binary target
and subject as an ID. Divide the data into half training and half validation.
(c) Connect the data partition node to a neural network node. Change the Model Selection
Criterion to Average Error as suggested in our Veterans demo. Run that node (check to see if it
converged).
(d) For comparison purposes, connect a regression node and a decision tree node to the data
partition. In the tree node, make it a class probability tree by making the subtree assessment measure
Average Square Error. We have only 3 features (predictors) so we’ll not bother with model selection in
the regression. Run these as well.
(e) Connect your three models (neural net, regression, tree) to a Model Comparison node.
(f) For now, we’ll leave the decision matrix as it stands though you may want to play around
with different profits after you finish up the homework.
(g) Run the diagram from the Model Comparison node.
II. Here are the points expected to be addressed in your report:
(a) Compare the average age, debt to income ratio (DI_ratio) and amount of training (training)
for those in arrears (target=1) to those not in arrears in the full data. Also mention the overall rate of
going into arrears in the data. This can be done outside of Enterprise Miner.
(b) What is the most important feature (predictor variable) based on the tree result? Is the
answer the same for training and validation data? In each part of the data partition, what are the
relative importances of the other two features?
(c) By what number is my odds of being in arrears multiplied if my debt to income ratio
increases by 1, according to the regression model? Explain why it would be hard (impossible) to get an
odds ratio with a neural network or a tree model.
(d) Find the lift for each model that is listed in the Model Comparison results Fit Statistics table
and list the three for the validation data. The lift is a function with “depth” as its horizontal axis, it’s not
just a number. By looking at the Model Comparison node’s properties panel or otherwise, explain at
what depth these lift numbers were computed. Also explain to the reader how lift is computed.
(e) (related to (d)) My boss says we’re going to cut down on our rate of making loans by 5% next
year. I could just refuse 5% of the applicants at random or I could look at their features (age, DI_ratio,
training) and use my neural net model to select the 5% to refuse. In terms of the probability of an
applicant going into arrears, how much better off will I be, if at all, using the model? Please state this
carefully.
(f) Which model is chosen as the winner and what default criterion was used to select it? What
is the area under the ROC curve for the validation data for that model? The book suggests a strong
model has area exceeding 0.7. How did we do? What Gini coefficients would indicate a strong model?
(g) Do you think we were in pretty good shape without oversampling for rare events here?
Why?
(h) Using the output window in the Model Comparison results, which model minimizes the
number of false positives in the validation data? If we make a false positive decision, are we deciding
that a person is not in arrears when actually they are, or are we deciding a person is in arrears when
they actually are not?
(i) Based on the tree result, would you suggest a possible simplification in your neural net?
Be sure to write up a nice report incorporating the points above.
Download