2012 test 1 - NCSU Statistics

advertisement

Data Mining Assessment #1 Name _________________________________

1.

(10 pts.) Circle all the methods in this list, if any, that assume multivariate normality of the features (the X’s):

Decision Trees, Discriminant Analysis, Logistic Regression.

2.

I have 2 predictor variables. One is gender with 2 possible values (M, F) and one is age in years taking on 36 different values in my data. My target is binary and I am building a decision tree using the Chi-square criterion. When I split on gender, my Chi-square p-value is 0.0100 and when

I try splits on age, the Chi-square p-value for the best age split is smaller, namely 0.0020.

(7 pts.) Which variable (age or gender) would I choose to split on if I just used these p-values?

(7 pts.) Which (age or gender) would I use if I do a Bonferroni correction?

(16 pts.) Justify your Bonferroni answer.

3. (20 pts.) A logistic regression of a binary response on a single predictor variable X gives a

probability p=0.2 at X=5.

The logit at X=5 is thus L=ln(________)

Fill in the blank.

When I increase my predictor variable X to 6, I get odds 0.8. Compute the odds ratio for variable

X.

4. I expose 50 male and 50 female volunteers to a mild irritant, and record counts of those who do

and do not develop a rash, getting this table of counts.

(20 pts.) Compute the Chi-square statistic for this table.

Females

Males

No rash

21

39

Rash

29

11

(10 pts.) If your Chi-Square statistic exceeds 3.84 it is significant at the usual 0.05 level. Do you conclude that there is a difference in response between men and women? ( yes, no )

5. (10 pts.) What condition would cause me to switch from Fisher’s linear discriminant analysis to

quadratic discriminant analysis?

=======================answers==============================

1. Only discriminant analysis makes distributional assumptions on the distribution of the X’s.

2.

p-value adjusted p-value

Age 0.0020 35(0.0020) = 0.0700

Gender 0.0100 1(0.0100)=0.0100

If no adjustment is made split on age (smaller p-value) with only 2 levels (1 split point) for gender and 35 for age the p-values and our choice reverse (choose gender if Bonferroni applied, 0.01<0.07). No need to compute logworth, of course. Also note that without the Bonferroni corrected number , justification has not been provided.

3. p/(1-p) =0.2/0.8= ¼ = 0.25. When X goes up by 1 this gets multiplied by 3.2 which then is the odds ratio (basic definition) . No need to compute logit slope and exponentiate.

4. Include expected numbers and totals( )

Females (50 total)

Males (50 total)

No rash

21 (30)

39 (30)

Rash

29 (20)

11 (20)

(60) (40)

(observed-expected) 2 /expected = 81/30+81/30+81/20+81/20=162/10(1/3+1/2)=16.2(5/6)=81/6=13.5

13.5 > 3.84 so reject the no difference hypothesis i.e. they react differently.

5. Use quadratic (i.e. quadratic term in X enters the determinant) when the variance matrices differ among the populations.

Download