Test1_2015.docx

advertisement
Test 1, St 590, Dickey
Solutions on last page
/
1. (10 pts.) A leaf in a tree computed on training data had 200 1’s and 130 0’s. What is the contribution
of that leaf ______to the misclassification count?
2. (10 pts.) If a contingency table has 3 rows and its Chi-square test statistic has 12 degrees of freedom,
how many columns _____ must the table have?
3. (10 pts.) Average squared error is a way of assessing how well we did with which of the three kinds of
predictions (estimates , decisions, rankings) mentioned in the book and in class?
4. (24 pts.) I ran a regression with monthly data. I fit a linear trend plus 11 monthly dummy variables
using the reference cell approach with December as the reference cell. The linear part of the model is
100 + 3t where t is the observation number. The January dummy variable coefficient is 10.
(A) Observation 10 occurs in December for these data. Compute its predicted value______
(B) Find, if possible from the given information, the predicted values for observation 11 _______ and
observation 12 _______ . (put “NP” if not possible)
(C) In words that might interest a non-statistician, give a brief interpretation of that January coefficient
10.
5. (10 pts.) The following contingency table has counts of undergraduate students in 4 majors classified
by whether they finished in 4 years or not, this being my target.
major
A
B
C D
Totals
Finished? Yes
12
22 9 7
50
No
13
4 15 18
50
(A) Compute the contribution _____________to the Chi-square statistic coming from the 9 people in
major C who finished in 4 years.
(B) Suppose the p-value for the Chi-square statistic for this table is exactly p= 0.00010. Compute the
associated logworth ______ (no Bonferroni adjustment)
6. (36 pts.) The target in the decision tree below is whether a person purchases (Target=1) or does not
purchase (Target=0) a product. The first split is on gender (M or F) and the second on age in years as
shown.
1:
_____%
0:
_____%
Count: _______
M
1:
90.0%
0:
10.0%
Count: 2000
F
1:
____%
0:
35.0%
Count: _____
Age < 40.5
1:
40.0%
0:
60.0%
Count: 3000
Age > 40.5
1:
80.0%
0:
20.0%
Count: 5000
(A) Estimate the probability of purchasing for a female age 35 _________ and a male age 35 ______.
(B) How many females ______were in the data set and how many of them purchased _____ ?
(C) How many people _________ were in the root node ?
(D) What is the percentage of 1’s _______ and 0’s _________ in the root node?
(E) Based on this tree, estimate the probability of purchasing for the 5% of people most likely to
purchase ________ and from that, the lift ___________ at the fifth percentile.
/
SOLUTIONS
(1) Since it’s the training data the decision is 1,the more likely class, leading to 200 correct and 130
incorrect decisions thus 130 is the misclassification count coming from this leaf.
(2) 12 = (3-1)(c-1) so c=7.
(3) Estimates
(4) December: 100 + 3(10) + 0 = 130 (reference month)
January: 100 + 3(11) + 10 = 143 (January effect 10)
February: 100 + 3(12) + ? NP
The 10 is the January effect minus the December effect. It is the difference between those means
after adjusting for the trend.
(5) (9+15)(50)/100 = 12 so (9-12)(9-12)/12 = 9/12 = 0.75 Logworth = 4 (p is 10 raised to the -4 power)
(6) F .4 M .9 (regardless of age) The point here is to remember that a leaf cannot be split any further .
Its elements cannot be distinguished from each other. All males, regardless of age, have the same
predicted probability based on the tree model.
8000 Females, (.65)(8000) = 5200 purchased. 35% did not.
10000 people in root node, 5200+(.9)(2000) = 7000 purchased so 70% and 30%
Most likely (males) .9 probability, overall .7 probability so lift = .9/.7 = 1.2857
Download