Data Mining Assessment #1 Name _________________________________
1.
(10 pts.) Circle all the methods in this list, if any, that assume multivariate normality of the features (the X’s):
Decision Trees, Discriminant Analysis, Logistic Regression.
2.
I have 2 predictor variables. One is gender with 2 possible values (M, F) and one is age in years taking on 36 different values in my data. My target is binary and I am building a decision tree using the Chi-square criterion. When I split on gender, my Chi-square p-value is 0.0100 and when
I try splits on age, the Chi-square p-value for the best age split is smaller, namely 0.0020.
(7 pts.) Which variable (age or gender) would I choose to split on if I just used these p-values?
(7 pts.) Which (age or gender) would I use if I do a Bonferroni correction?
(16 pts.) Justify your Bonferroni answer.
3. (20 pts.) A logistic regression of a binary response on a single predictor variable X gives a
probability p=0.2 at X=5.
The logit at X=5 is thus L=ln(________)
Fill in the blank.
When I increase my predictor variable X to 6, I get odds 0.8. Compute the odds ratio for variable
X.
4. I expose 50 male and 50 female volunteers to a mild irritant, and record counts of those who do
and do not develop a rash, getting this table of counts.
(20 pts.) Compute the Chi-square statistic for this table.
Females
Males
No rash
21
39
Rash
29
11
(10 pts.) If your Chi-Square statistic exceeds 3.84 it is significant at the usual 0.05 level. Do you conclude that there is a difference in response between men and women? ( yes, no )
5. (10 pts.) What condition would cause me to switch from Fisher’s linear discriminant analysis to
quadratic discriminant analysis?
=======================answers==============================
1. Only discriminant analysis makes distributional assumptions on the distribution of the X’s.
2.
p-value adjusted p-value
Age 0.0020 35(0.0020) = 0.0700
Gender 0.0100 1(0.0100)=0.0100
If no adjustment is made split on age (smaller p-value) with only 2 levels (1 split point) for gender and 35 for age the p-values and our choice reverse (choose gender if Bonferroni applied, 0.01<0.07). No need to compute logworth, of course. Also note that without the Bonferroni corrected number , justification has not been provided.
3. p/(1-p) =0.2/0.8= ¼ = 0.25. When X goes up by 1 this gets multiplied by 3.2 which then is the odds ratio (basic definition) . No need to compute logit slope and exponentiate.
4. Include expected numbers and totals( )
Females (50 total)
Males (50 total)
No rash
21 (30)
39 (30)
Rash
29 (20)
11 (20)
(60) (40)
(observed-expected) 2 /expected = 81/30+81/30+81/20+81/20=162/10(1/3+1/2)=16.2(5/6)=81/6=13.5
13.5 > 3.84 so reject the no difference hypothesis i.e. they react differently.
5. Use quadratic (i.e. quadratic term in X enters the determinant) when the variance matrices differ among the populations.