Test1_2016.docx

advertisement
Test 1: Data Mining, 2016, D.A.D.
1. (12 pts.) Here are the residuals from each of three different lines fit to a set of 4
(X,Y) points. If I use the same criterion used in regression and homework 1 to pick
the best line, which would I pick?
Line 1: 2, 3, -1, 4
Line 2: 4, 3, -3, 0
Line 3: 3, 3, -2, 2
Why would I pick that one?
2. (35 pts.) I modeled the quarterly retail sales for my company by a least squares
regression giving this predicting equation:
Sales = 10 + 2Q1 -3Q2 -2Q3 + 2t
Here Q1, Q2, and Q3 are dummy variables for quarters 1 to 3, and t=1,2,3,4,5,6 etc.
is the observation number. Observations t=1, 5, 9, etc. are quarter 1 values.
Compute if possible the predicted values for times t=8 ____ and t=9 ______.
Should I have added a quarter 4 dummy variable Q4 to the model? (yes, no)
Write down the first 8 values of variable Q3 ___, ___, ___, ___, ___, ___,___,___
My current quarter’s actual sales figure is 50 but its prediction was 57. What
quarterly sales does the model say I should expect ____ 4 quarters from now?
3. (10 pts.) I fit a decision tree for a binary target (purchased or did not) in a
dataset of 1000 customers of whom 300 purchased. The best leaf has 80
customers of whom 75 purchased. Compute, if possible from this information, the
lifts at depth 5 (top 5%) ____ and depth 10 ______. If either is not possible
explain why.
4. Here is a partial Chi-square table of counts for type of car purchase by gender
along with a few border totals, and expected counts in parentheses ( ).
M
F
Totals:
Sedan
6
(12)
14
(8)
20
Sports car SUV
12
27
(9)
( )
___
___
(___)
(___)
___
40
Coupe
15
(15)
10
(___)
25
Totals:
___
___
___
(8 pt.) What is the contribution ______ to the table Chi-square statistic of the (M,
Sedan) cell of this table?
(20 pts.) Fill in as many of the 10 missing table entries as possible. It may help to
think about how the expected numbers are computed.
5. (15 pts.) One leaf of a tree for a binary target (Y) has 240 1s and 160 0s.
400
Compute the sum of squares
 (Y  Y )
i 1
i
2
=_____ for these 400 Y values.
What is the misclassification rate _____ for this leaf (assuming no prior
probabilities)
Which criterion, average squared error or misclassification, is more important
to minimize for estimate predictions?
***** Solutions with explanations ********
1. The sum of squared residuals are 30, 34 and 26 so pick line 3 to minimize the
residual sum of squares.
2. 10 + 2(0)-3(0)-2(0) + 2(8)= 18 (t=8 implies quarter 4)
10 + 2(1)-3(0)-2(0) + 2(9) = 30
No Q4 is needed. 10+2t gives quarter 4 results. By omitting Q4, the fourth
automatically becomes with baseline. Notice that the question asks for the
values of variable Q3, the column of numbers by which -2 is multiplied in the
model. We want to subtract 2 every third quarter so Q3 = 0 0 1 0 0 0 1 0 etc.
In 4 quarters I have the same quarterly effect so have ascended 2(4)=8 from the
current prediction 57. Thus 57+8=65 is what the model says I should expect.
3. 5% of 1000 is 50. Leaf 1 has over 50 observations but 10% of 1000 is 100 and
would require additional information from the second best leaf. Depth 5 lift is
(75/80)/(300/1000)=75000/24000=25/8=3.125
4. Follow steps A-J in order. Use row 1 total for A & D, columns for the rest.
Sedan
M
6
(12)
Sports
car
12
(9)
F
14
(8)
___
(___)
Totals:
20
___
SUV
Coupe
Totals:
27
(D:60-36 =24)
15
(15)
A:6+…+15=60
C:40-27=13
(E:40-24=16)
10
(B:25-15=10)
___
40
25
_X?_
Now, using upper left expected number, (12)=60*20/X so X=grand total = 100.
… or use (M,Coupe) (15)=60*25/X or (24)=40*60/X
Sedan
Sports car
SUV
Coupe
Totals:
M
6
(12)
12
(9)
27
(24)
15
(15)
60
F
14
(8)
I: 15-12= 3
(J: 15-9=6)
13
(16)
10
(10)
G: 40
Totals:
20
H: 100-85=15
40
25
F:100
Contribution upper left is (6-12)2/12=3.
5. Average of 240 1s and 160 0s is 240/400=.6 so we have 240(1-.6)2+160(0-.6)2
2
 400 
=96. You might also recall that  (Yi  Y ) Yi   Yi  / 400 =240i 1
i 1
 i 1 
400
400
2
2
2402/400=240(1-.6)=96. Classifying everything in this leaf as 1, the
misclassification is .4 or 40% and for estimates, we prune to minimize average
squared error.
You might also recall that for 3 categories we took (1-p)2 for the observed
category and p2 for the non-observed categories (different ps each time) and
summed them for each observation, but then for average squared error we
divided by 3n, not n where n was the number of observations implying that each
row contributes 3 times the error sum of squares being averaged. We can apply
this to the 2 categories as well as long as we divide by 2 so we would have 240
cases of (1-.6)2+.42 =2(.42) where we observed 1 and 160 cases of .62+(1-.4)2 =
2(.62) for the 160 times we observed 0. Dividing by 2 we once again have
240(.16)+160(.36)=96. Clearly this is not the fastest way to do this one.
Related documents
Download