advertisement

Homework 2 Datamining III. The file AAEM/PENS47 contains data from a handwriting recognition study. I want to find a logistic regression model to predict whether the number being written is a 4 or 7 (in this data set that is all there is). I will predict based on various X values which are coordinates of the pen at various moments in the process of writing the digit. The participants in this study were asked to write a given digit (we use the 4 and 7 subset here) on a pressure sensitive pad and as they did, the pen coordinates X were recorded. This is real data and can be found online at the UCI (U. Calif. Irvine) data repository. 1. How many 7s and 4s are contained in the data? What are the mean, max, and min of each X in the data? 2. What is the correlation between X2 and X5 overall, and within each response category (4 and 7) 3. Run a logistic regression to predict the probability that the digit being written is 4. Use X1 through X6 as potential predictors and don’t bother with a holdout sample or model building. Summarize the results briefly including what is significant and what is not. Use Enterprise Miner to do this in order to practice. I am not asking you to output EM results nor do I care if you also run PROC LOGISTIC. I am simply trusting you to (I hope individually) have this experience with EM. Be sure to show the model and what is significant but be concise here. 4. Turning now to PROC LOGISTIC, run a stepwise logistic regression to pare down on the number of inputs and thus getting a final model. Write up a report as you would if this were an interview task given to you by, say, a credit card company interested in handwritten digits at a checkout counter. It is OK to try this in EM as well, but I’ll check your numbers against PROC LOGISTIC so be sure to run that for your writeup. 5. Suppose I have 46 100 20 73 0 for X1 through X5. According to your final model, what value of X6, along with these, would then make digits 4 and 7 equally likely? 6. To illustrate a point made earlier in our module, use the relationship between p and the logit L where let’s say L = -7+1.6(X1)+1.2(X2) is the logit relating to X1 and X2. Notice that for any probability p we have a specific L = ln(p/(1-p)). Write the formula for solving for X2 as a function of L and X1 (you’ll see why below). 7. Run this program: Data picture; do p=.01 to .99 by .02; L=log(p/(1-p)); Do X1 = -10 to 5 by .1; X2 = (L +7-1.6*X1)/1.2; if (0<X1<5) and (0<X2<5) then output; end; end; proc gplot; plot X1*X2=p; symbol1 v=none i=join r=10; proc g3d; scatter X1*x2=p/shape="balloon" noneedle; run; Are the contours of equal probabilities p given by straight lines or curves? If one of the pictures shows this, include it. If I plot X1*X2=L instead of p, are the contours of equal L equally spaced? How about when I use contours of p? (this was the question asked in class – note that the p’s I used are equally spaced). I hope you see why I asked question 6 now.