Homework 2 Datamining III. The file AAEM/PENS47 contains data

Homework 2
Datamining III.
The file AAEM/PENS47 contains data from a handwriting recognition study. I want to find a logistic
regression model to predict whether the number being written is a 4 or 7 (in this data set that is all there
is). I will predict based on various X values which are coordinates of the pen at various moments in the
process of writing the digit.
The participants in this study were asked to write a given digit (we use the 4 and 7 subset here) on a
pressure sensitive pad and as they did, the pen coordinates X were recorded. This is real data and can be
found online at the UCI (U. Calif. Irvine) data repository.
1. How many 7s and 4s are contained in the data? What are the mean, max, and min of each X in the
2. What is the correlation between X2 and X5 overall, and within each response category (4 and 7)
3. Run a logistic regression to predict the probability that the digit being written is 4. Use X1 through X6
as potential predictors and don’t bother with a holdout sample or model building. Summarize the results
briefly including what is significant and what is not. Use Enterprise Miner to do this in order to practice.
I am not asking you to output EM results nor do I care if you also run PROC LOGISTIC. I am simply
trusting you to (I hope individually) have this experience with EM. Be sure to show the model and what is
significant but be concise here.
4. Turning now to PROC LOGISTIC, run a stepwise logistic regression to pare down on the number of
inputs and thus getting a final model. Write up a report as you would if this were an interview task given
to you by, say, a credit card company interested in handwritten digits at a checkout counter. It is OK to
try this in EM as well, but I’ll check your numbers against PROC LOGISTIC so be sure to run that for
your writeup.
5. Suppose I have 46 100 20 73 0 for X1 through X5. According to your final model, what
value of X6, along with these, would then make digits 4 and 7 equally likely?
6. To illustrate a point made earlier in our module, use the relationship between p and the logit L where
let’s say L = -7+1.6(X1)+1.2(X2) is the logit relating to X1 and X2. Notice that for any probability p we
have a specific L = ln(p/(1-p)). Write the formula for solving for X2 as a function of L and X1 (you’ll see
why below).
7. Run this program:
Data picture;
do p=.01 to .99 by .02; L=log(p/(1-p));
Do X1 = -10 to 5 by .1;
X2 = (L +7-1.6*X1)/1.2; if (0<X1<5) and (0<X2<5) then
output; end; end;
proc gplot; plot X1*X2=p;
symbol1 v=none i=join r=10;
proc g3d; scatter X1*x2=p/shape="balloon" noneedle;
Are the contours of equal probabilities p given by straight lines or curves? If one of the pictures shows
this, include it. If I plot X1*X2=L instead of p, are the contours of equal L equally spaced? How about
when I use contours of p? (this was the question asked in class – note that the p’s I used are equally
spaced). I hope you see why I asked question 6 now.