St 590, Dickey, Spring 2015 Name ____________________ 1. (18 pts.) An ROC curve from a 3 leaf tree model consists of 3 line segments. The target (response) is binary. The line segments go from point (X,Y)= (0,0) to (0.2,0.4), from there to (0.5,0.8) then to (1,1). From this, compute if possible A. The sensitivity ______ and specificity _______ for the decision rule that gave the (X,Y) point (0.2,0.4). B. The proportion of concordant pairs _________, discordant pairs _______ and ties _______ for the model that gave this ROC curve. C. The area ________ under the ROC curve. 2. (24 pts.) I want to use k means clustering to divide these 5 points {8, 12, 16, 24, 28} on a line into k=2 clusters. The seeds I use to start are 0 and 22. (a) List the clusters produced at the first cluster step Cluster 1 { } Cluster 2 { } (b) List the new cluster centroids: ______, _________ (c) List the two clusters at the second cluster step: Cluster 1 { } Cluster 2 { } (d) List the two new cluster centroids resulting from step 2 ______, _________ 3.(24 pts.) A study looks at several predictors of the disease emphysema (Y=1 for diseased, 0 otherwise) among which is smoking versus non-smoking. For a tree model, here is the Chi-square table used to decide whether to split on smoking. Smoker Non-smoker Emphysema Yes No *-----------*-------------* | 350 | | 500 |------------+------------| | | | 1500 *------------*-----------* 600 (_______) (_______) A. Fill in the missing numbers inside the table, the second column total, and the grand total. B. What is the contribution _______ to the Chi-square calculated statistic coming from the upper left cell of the table? C. By what number _____ do I multiply the p-value from this table if I want to do a Bonferroni correction? D. If the only other variables in my model are gender and whether or not the person has a college degree, what are the advantages of doing the Bonferroni correction versus using the unadjusted pvalues to decide which variable should be used for splitting? 4. (10 pts.) The probability of recovering at dose 25 of a drug is 0.50. The odds ratio for dose is shown as 2 on my computer output when I do a logistic regression predicting the probability of recovering using a logit that is a linear function of dose, L=a+b(dose). According to my model, if I increase the dose from 25 to 27, what will be my estimated probability ____of recovering? Is the coefficient b on dose positive or negative? How do you know? 5. (6 pts.) I have a Fisher linear discriminant function -1 + .6*X1 -.2*X2 for comparing a point (X1,X2) to bivariate normal population 1 and another, -2 + .7*X1 + b*X2 for comparing to bivariate normal population 2. (a) For what range of parameter values b ________ would the point (3,2) be classified as coming from population 2 if equal priors had been assumed in computing the discriminants? (b) How, if at all, would your answer change if you knew that these same discriminants had been computed using priors 0.8 and 0.2 for populations 1 and 2? 6. (18 pts.) Suppose 20% of students major in a STEM (Science, Technology, Engineering, Math) discipline, and that 80% of these STEM students have jobs when they graduate. (a) For the rule STEM=>JOB, find if possible from the information above, the support ____ lift ______ and confidence ______ (b) If I add the information that 60% of all students who graduate do not have jobs upon graduation, then for the rule STEM=>JOB, find if possible from the total information so far, the support ____ lift ______ and confidence ______ 1. Specificity 1-.2=.8 , sensitivity = .4 (.2,.4) Draw the picture: 2 rectangles, 3 triangles. Lower rectangle: height = .4, distance to right = 1-.2=.8, rectangle area = (.4)(.8) Upper rectangle height .4, distance to right=.5, rectangle area = (.4)(.5) = .20 Proportion of concordant pairs is 0.32 + 0.20 = 0.52 = .32 Leftmost triangle has area ½(.2)(.4) = ½(0.08). Middle triangle has area ½(.3)(.4) = ½(0.12) Top triangle has area ½(.5)(.2) Sum of triangle areas = ½(0.08+0.12+0.10)=1/2(.3)=1/2(proportion of ties) Proportion of ties is 0.30 Proportion discordant = 1-.52-.30 = .0.18. Area Under the Curve = .52 + ½(.30)=0.67 2. Seeds 0 and 22, 11 in the middle. 8<11 in cluster 1, rest are >11 so cluster 2 New centroids 8 and (12+16+24+28)/4=20 so (20+8)/2 = 14 is separator and {8, 12} {16,24,28} are clusters with seeds 10 and 68/3=22.67 (Note that next step uses 32.67/2 = 16.33 so we get {8,12,16} and {24,28} with seeds 12 and 26 which leaves the clusters the same and provides the final solution) 3. 350 150 500 250 1250 1500 600 1400 2000 Expect 500(600)/2000 = 150, contribution is (350-150)(350-150)/150 = 40000/150 = 266.7. Bonferroni multiplier is 1 (only 1 way to divide smoking status). Since gender and college degree also get an adjustment of 1 there is no advantage to doing Bonferroni and in fact there is no effect at all of the Bonferroni adjustment when every predictor is at just 2 levels. 4. p=.5 => odds=1 so dose=26 => odds=2 and dose = 27 => odds=4 Odds=4=p/(1-p) => 5p=4 => p=0.8. 5. -1+.6(3)-.2(2) = -1+1.8 -.4 = .4 -2 +.7(3) + b(2) = 0.1+2b 0.1+2b > .4 if b>.15. There would be no change. The effect of priors would have been captured in the intercept terms so it would have already been accounted for. 6. The confidence is Pr{JOB | STEM} which is given as 0.80. Since 20% of these people are STEM students and 80% of these STEM students have jobs, that means that 16% of all students are STEM students with jobs (0.8*0.2=0.16). By definition, the support is 0.16. To get the lift we need to take confidence divided by expected confidence and this expected confidence is the proportion of all students who have jobs so without this, we cannot compute lift. If, however, we are told that only 40% of students overall have jobs upon graduation (60% don’t, so 40% do) then we CAN compute confidence divided by expected confidence = 0.80/0.40 = 2.0. 5. A neural network using hyperbolic tangent functions has inputs X1, X2, X3, and X4. It has 2 hidden units H1 and H2 in one hidden layer. When I put in X1=5, X2=8, X3=20, and X4=15, the values of H1 and H2 become 0.8 and -0.2 respectively. The response variable is binary and its logit is 0.6 + 0.4(Htan(H1) ) + 0.8(Htan(H2) where Htan( )) is the hyperbolic tangent function. (A) compute the logit associated with the 4 X inputs described here (B) Is it more likely that Y will be 1 than it is that Y will be 0? How do you know? 6. Suppose we know that 15% of kids have peanut butter for lunch and that 50% of those having peanut butter have jelly as well. Overall we know that 10% of kids have jelly with their lunch. For the association rule Peanut butter => jelly what are the support _____ confidence _____ and lift _____ ? What is the lift _____ for the rule jelly => peanut butter ? 7. Suppose, for some x, that ex=0.5. How , if at all, is that x related to the natural logarithm of 0.5? What is the hyperbolic tangent function of that x? Htan(x) = _____. 8. I have sales of ice cream monthly over several years and my regression model for sales is Predicted Sales = 100 + 5X1 +2X2 +1X3 + 6X4 + 10X5 + 13X6 + 12X7 + 9X8 +3X9 + 2X10 + 3X11 where X1 is a dummy (indicator) variable for January, X2 an indicator variable for February, etc . through X11 which is a dummy variable for November. What is my prediction for (a) December sales ___ (b) August sales ______ (c) The difference August sales minus December sales ____ (d) The difference August sales minus January sales ____