Test 1: Data Mining, 2016, D.A.D. 1. (12 pts.) Here are the residuals from each of three different lines fit to a set of 4 (X,Y) points. If I use the same criterion used in regression and homework 1 to pick the best line, which would I pick? Line 1: 2, 3, -1, 4 Line 2: 4, 3, -3, 0 Line 3: 3, 3, -2, 2 Why would I pick that one? 2. (35 pts.) I modeled the quarterly retail sales for my company by a least squares regression giving this predicting equation: Sales = 10 + 2Q1 -3Q2 -2Q3 + 2t Here Q1, Q2, and Q3 are dummy variables for quarters 1 to 3, and t=1,2,3,4,5,6 etc. is the observation number. Observations t=1, 5, 9, etc. are quarter 1 values. Compute if possible the predicted values for times t=8 ____ and t=9 ______. Should I have added a quarter 4 dummy variable Q4 to the model? (yes, no) Write down the first 8 values of variable Q3 ___, ___, ___, ___, ___, ___,___,___ My current quarter’s actual sales figure is 50 but its prediction was 57. What quarterly sales does the model say I should expect ____ 4 quarters from now? 3. (10 pts.) I fit a decision tree for a binary target (purchased or did not) in a dataset of 1000 customers of whom 300 purchased. The best leaf has 80 customers of whom 75 purchased. Compute, if possible from this information, the lifts at depth 5 (top 5%) ____ and depth 10 ______. If either is not possible explain why. 4. Here is a partial Chi-square table of counts for type of car purchase by gender along with a few border totals, and expected counts in parentheses ( ). M F Totals: Sedan 6 (12) 14 (8) 20 Sports car SUV 12 27 (9) ( ) ___ ___ (___) (___) ___ 40 Coupe 15 (15) 10 (___) 25 Totals: ___ ___ ___ (8 pt.) What is the contribution ______ to the table Chi-square statistic of the (M, Sedan) cell of this table? (20 pts.) Fill in as many of the 10 missing table entries as possible. It may help to think about how the expected numbers are computed. 5. (15 pts.) One leaf of a tree for a binary target (Y) has 240 1s and 160 0s. 400 Compute the sum of squares (Y Y ) i 1 i 2 =_____ for these 400 Y values. What is the misclassification rate _____ for this leaf (assuming no prior probabilities) Which criterion, average squared error or misclassification, is more important to minimize for estimate predictions? ***** Solutions with explanations ******** 1. The sum of squared residuals are 30, 34 and 26 so pick line 3 to minimize the residual sum of squares. 2. 10 + 2(0)-3(0)-2(0) + 2(8)= 18 (t=8 implies quarter 4) 10 + 2(1)-3(0)-2(0) + 2(9) = 30 No Q4 is needed. 10+2t gives quarter 4 results. By omitting Q4, the fourth automatically becomes with baseline. Notice that the question asks for the values of variable Q3, the column of numbers by which -2 is multiplied in the model. We want to subtract 2 every third quarter so Q3 = 0 0 1 0 0 0 1 0 etc. In 4 quarters I have the same quarterly effect so have ascended 2(4)=8 from the current prediction 57. Thus 57+8=65 is what the model says I should expect. 3. 5% of 1000 is 50. Leaf 1 has over 50 observations but 10% of 1000 is 100 and would require additional information from the second best leaf. Depth 5 lift is (75/80)/(300/1000)=75000/24000=25/8=3.125 4. Follow steps A-J in order. Use row 1 total for A & D, columns for the rest. Sedan M 6 (12) Sports car 12 (9) F 14 (8) ___ (___) Totals: 20 ___ SUV Coupe Totals: 27 (D:60-36 =24) 15 (15) A:6+…+15=60 C:40-27=13 (E:40-24=16) 10 (B:25-15=10) ___ 40 25 _X?_ Now, using upper left expected number, (12)=60*20/X so X=grand total = 100. … or use (M,Coupe) (15)=60*25/X or (24)=40*60/X Sedan Sports car SUV Coupe Totals: M 6 (12) 12 (9) 27 (24) 15 (15) 60 F 14 (8) I: 15-12= 3 (J: 15-9=6) 13 (16) 10 (10) G: 40 Totals: 20 H: 100-85=15 40 25 F:100 Contribution upper left is (6-12)2/12=3. 5. Average of 240 1s and 160 0s is 240/400=.6 so we have 240(1-.6)2+160(0-.6)2 2 400 =96. You might also recall that (Yi Y ) Yi Yi / 400 =240i 1 i 1 i 1 400 400 2 2 2402/400=240(1-.6)=96. Classifying everything in this leaf as 1, the misclassification is .4 or 40% and for estimates, we prune to minimize average squared error. You might also recall that for 3 categories we took (1-p)2 for the observed category and p2 for the non-observed categories (different ps each time) and summed them for each observation, but then for average squared error we divided by 3n, not n where n was the number of observations implying that each row contributes 3 times the error sum of squares being averaged. We can apply this to the 2 categories as well as long as we divide by 2 so we would have 240 cases of (1-.6)2+.42 =2(.42) where we observed 1 and 160 cases of .62+(1-.4)2 = 2(.62) for the 160 times we observed 0. Dividing by 2 we once again have 240(.16)+160(.36)=96. Clearly this is not the fastest way to do this one.