Z_Final_2015.docx

advertisement
St 590, Dickey, Spring 2015
Name ____________________
1. (18 pts.) An ROC curve from a 3 leaf tree model consists of 3 line segments. The target (response) is
binary. The line segments go from point (X,Y)= (0,0) to (0.2,0.4), from there to (0.5,0.8) then to (1,1).
From this, compute if possible
A. The sensitivity ______ and specificity _______ for the decision rule that gave the (X,Y) point (0.2,0.4).
B. The proportion of concordant pairs _________, discordant pairs _______ and ties _______ for the
model that gave this ROC curve.
C. The area ________ under the ROC curve.
2. (24 pts.) I want to use k means clustering to divide these 5 points {8, 12, 16, 24, 28} on a line
into k=2 clusters. The seeds I use to start are 0 and 22.
(a) List the clusters produced at the first cluster step
Cluster 1 {
} Cluster 2 {
}
(b) List the new cluster centroids: ______, _________
(c) List the two clusters at the second cluster step:
Cluster 1 {
} Cluster 2 {
}
(d) List the two new cluster centroids resulting from step 2 ______, _________
3.(24 pts.) A study looks at several predictors of the disease emphysema (Y=1 for diseased, 0 otherwise)
among which is smoking versus non-smoking. For a tree model, here is the Chi-square table used to
decide whether to split on smoking.
Smoker
Non-smoker
Emphysema
Yes
No
*-----------*-------------*
| 350
|
| 500
|------------+------------|
|
|
| 1500
*------------*-----------*
600
(_______) (_______)
A. Fill in the missing numbers inside the table, the second column total, and the grand total.
B. What is the contribution _______ to the Chi-square calculated statistic coming from the upper left
cell of the table?
C. By what number _____ do I multiply the p-value from this table if I want to do a Bonferroni
correction?
D. If the only other variables in my model are gender and whether or not the person has a college
degree, what are the advantages of doing the Bonferroni correction versus using the unadjusted pvalues to decide which variable should be used for splitting?
4. (10 pts.) The probability of recovering at dose 25 of a drug is 0.50. The odds ratio for dose is shown
as 2 on my computer output when I do a logistic regression predicting the probability of recovering
using a logit that is a linear function of dose, L=a+b(dose). According to my model, if I increase the dose
from 25 to 27, what will be my estimated probability ____of recovering? Is the coefficient b on dose
positive or negative? How do you know?
5. (6 pts.) I have a Fisher linear discriminant function -1 + .6*X1 -.2*X2 for comparing a point (X1,X2) to
bivariate normal population 1 and another, -2 + .7*X1 + b*X2 for comparing to bivariate normal
population 2.
(a) For what range of parameter values b ________ would the point (3,2) be classified as coming from
population 2 if equal priors had been assumed in computing the discriminants?
(b) How, if at all, would your answer change if you knew that these same discriminants had been
computed using priors 0.8 and 0.2 for populations 1 and 2?
6. (18 pts.) Suppose 20% of students major in a STEM (Science, Technology, Engineering, Math)
discipline, and that 80% of these STEM students have jobs when they graduate.
(a) For the rule STEM=>JOB, find if possible from the information above,
the support ____ lift ______ and confidence ______
(b) If I add the information that 60% of all students who graduate do not have jobs upon
graduation, then for the rule STEM=>JOB, find if possible from the total information so far,
the support ____ lift ______ and confidence ______
1. Specificity 1-.2=.8 , sensitivity = .4 (.2,.4)
Draw the picture: 2 rectangles, 3 triangles.
Lower rectangle: height = .4, distance to right = 1-.2=.8, rectangle area = (.4)(.8)
Upper rectangle height .4, distance to right=.5, rectangle area = (.4)(.5) = .20
Proportion of concordant pairs is 0.32 + 0.20 = 0.52
= .32
Leftmost triangle has area ½(.2)(.4) = ½(0.08).
Middle triangle has area ½(.3)(.4) = ½(0.12)
Top triangle has area ½(.5)(.2)
Sum of triangle areas = ½(0.08+0.12+0.10)=1/2(.3)=1/2(proportion of ties)
Proportion of ties is 0.30
Proportion discordant = 1-.52-.30 = .0.18.
Area Under the Curve = .52 + ½(.30)=0.67
2. Seeds 0 and 22, 11 in the middle. 8<11 in cluster 1, rest are >11 so cluster 2
New centroids 8 and (12+16+24+28)/4=20 so (20+8)/2 = 14 is separator and
{8, 12} {16,24,28} are clusters with seeds 10 and 68/3=22.67
(Note that next step uses 32.67/2 = 16.33 so we get {8,12,16} and {24,28} with seeds
12 and 26 which leaves the clusters the same and provides the final solution)
3.
350 150 500
250 1250 1500
600 1400 2000
Expect 500(600)/2000 = 150, contribution is (350-150)(350-150)/150 = 40000/150 = 266.7. Bonferroni
multiplier is 1 (only 1 way to divide smoking status).
Since gender and college degree also get an adjustment of 1 there is no advantage to doing Bonferroni
and in fact there is no effect at all of the Bonferroni adjustment when every predictor is at just 2 levels.
4. p=.5 => odds=1 so dose=26 => odds=2 and dose = 27 => odds=4
Odds=4=p/(1-p) => 5p=4 => p=0.8.
5. -1+.6(3)-.2(2) = -1+1.8 -.4 = .4
-2 +.7(3) + b(2) = 0.1+2b
0.1+2b > .4 if b>.15.
There would be no change. The effect of priors would have been captured in the intercept terms so it
would have already been accounted for.
6. The confidence is Pr{JOB | STEM} which is given as 0.80. Since 20% of these people are STEM
students and 80% of these STEM students have jobs, that means that 16% of all students are STEM
students with jobs (0.8*0.2=0.16). By definition, the support is 0.16. To get the lift we need to take
confidence divided by expected confidence and this expected confidence is the proportion of all
students who have jobs so without this, we cannot compute lift.
If, however, we are told that only 40% of students overall have jobs upon graduation (60% don’t, so
40% do) then we CAN compute confidence divided by expected confidence = 0.80/0.40 = 2.0.
5. A neural network using hyperbolic tangent functions has inputs X1, X2, X3, and X4. It has 2 hidden
units H1 and H2 in one hidden layer. When I put in X1=5, X2=8, X3=20, and X4=15, the values of H1 and
H2 become 0.8 and -0.2 respectively. The response variable is binary and its logit is 0.6 + 0.4(Htan(H1) )
+ 0.8(Htan(H2) where Htan( )) is the hyperbolic tangent function.
(A) compute the logit associated with the 4 X inputs described here
(B) Is it more likely that Y will be 1 than it is that Y will be 0? How do you know?
6. Suppose we know that 15% of kids have peanut butter for lunch and that 50% of those having peanut
butter have jelly as well. Overall we know that 10% of kids have jelly with their lunch. For the
association rule
Peanut butter => jelly
what are the
support _____ confidence _____ and lift _____ ?
What is the lift _____ for the rule jelly => peanut butter ?
7. Suppose, for some x, that ex=0.5. How , if at all, is that x related to the natural logarithm of 0.5?
What is the hyperbolic tangent function of that x? Htan(x) = _____.
8. I have sales of ice cream monthly over several years and my regression model for sales is
Predicted Sales = 100 + 5X1 +2X2 +1X3 + 6X4 + 10X5 + 13X6 + 12X7 + 9X8 +3X9 + 2X10 + 3X11
where X1 is a dummy (indicator) variable for January, X2 an indicator variable for February, etc . through
X11 which is a dummy variable for November. What is my prediction for
(a) December sales ___
(b) August sales ______
(c) The difference August sales minus December sales ____
(d) The difference August sales minus January sales ____
Download