ASSIGNMENT 26 Regression 1. For assignment 14, we used summary statistics for the length of MLB games in 2002 and 2003 to test the one-tailed alternative that mean length was smaller in 2003 (due to rule changes). 2002 game length 2003 game length sample mean 172.1 165.9 s 12.2 13.7 n 61 51 pooled variance Numerator of t Denominator of t t-stat one-tailed pvalue 166.4991 6.2 2.4483 2.5324 0.0064 =(60*12.2^2+50*13.7^2)/(60+50) =172.1 - 165.9 =166.5^.5*(1/61+1/51)^.5 =t.dist.rt(2.53,110) the difference in average minutes per game is statistically significant. We reject H0. As usual, Bo marches to the beat of a different drummer. Bo found the original data for the minutes of each of the 112 games. Bo created two columns of data: minutes and D2003, where D2003 was 1 for a game in 2003 and 0 for a game in 2002. Bo regressed minutes on D2003. Fill in each blank. (If it is not possible to know the exact answer, simply write “NP” for not possible without the data.) [25 points] Bo’s estimated intercept will be __172.1, the sample mean for 2002___ Bo’s estimated coefficient of D2003 will be __-6.2, the difference in sample means_ The standard error of Bo’s coefficient of D2003 will be __2.45, the denominator of the t-stat The t-stat associated with Bo’s coefficient of D2003 will be __-2.5324___ The p-value associated with Bo’s coefficient of D2003 will be __0.0064 for Bo’s 1-tialed test, but 0.0128 on the regression output 2. (EMBS Problem 42, page 448. Assigned as one of Class 19 practice problems). A study claimed that self-employed individuals do not experience greater job satisfaction than individuals who are not self-employed. Job satisfaction was measured using 18 questions with answers ranging from 1 to 5. The total score was the measure of job satisfaction. Scores for individuals in four separate professions are given below and in the spreadsheet “Class 26 Assignment data”. Lawyer 44 42 74 42 53 50 45 48 64 38 Physical Therapist 55 78 80 86 60 59 62 52 55 50 Cabinetmaker 54 65 79 69 79 64 59 78 84 60 Systems Analyst 44 73 71 60 64 66 41 55 76 62 Are the differences in sample mean job satisfaction scores across the four professions statistically significant? PLEASE USE REGRESSION TO ANWER THIS QUESTION. Show the relevant regression output to demonstrate that you used regression to do the required calculations. [25 points] Stacking the data, creating 3 columns of dummy variables, and regressing the satisfaction scores on the three dummies produced the following output: SUMMARY OUTPUT Regression Statistics Multiple R 0.537136224 R Square 0.288515323 Adjusted R Square 0.229224933 Standard Error 11.52605744 Observations 40 ANOVA df 3 36 39 SS 1939.4 4782.6 6722 Intercept Coefficients 61.2 Standard Error 3.644859394 Dlawyer DPT Dcabinet -11.2 2.5 7.9 5.154609588 5.154609588 5.154609588 Regression Residual Total MS F 646.4666667 4.866139757 132.85 Significance F 0.006080513 t Stat P-value 16.79077116 1.30888E-18 2.172812472 0.036455868 0.485002784 0.63061271 1.532608797 0.134114132 The null hypothesis is that mean satisfaction is equal for the four groups. This is the same as saying the satisfaction and profession are independent. This is the same as all three b’s in the above model are equal to zero. The alternative is that the four means are not equal, the variables are not independent, the three b’s are not all equal to zero. The p-value is 0.006. We reject H0 in favor of Ha. 3. (EMBS cape problem 1, page 606) Consumer Research, Inc. investigated consumer characteristics that predict the amount charged by credit card users. Data were collected on annual income, household size, and annual credit card charges for a random sample of 50 consumers. The complete data are available in a spreadsheet “Class 26 Assignment data”. Income Household ($1000s) Size 54 3 30 2 32 4 . . . . 22 4 46 5 66 4 Amount Charged ($) 4,016 3,159 5,100 . . 3,074 4,820 5,149 3a. Which is the better predictor of Amount Charged: Income or Household Size? Why? (Assume you can use one or the other but not both.) [10 points] The regression of Amount on Income has a standard error of 731.7. The regression of Amount on Household size has a standard error of 620.8. Thus, household size is the better predictor. (Note, household size will also have the higher adjusted r-square and the lower p-value. When comparing simple models fit to the same data, these three statistics tell an identical story.) We can also answer this question using the multiple regression results from 3c. Household size has the lower p-value (higher magnitude t-value), and thus is the better predictor variable. 3b. Regress Amount Charged on Household Size. Report and interpret BRIEFLY the coefficient of household size. (Tell us what the coefficient means and measures. Tell us why it turned out positive/negative.) [10 points] Intercept Household Size Standard Coefficients Error t Stat 2581.941 195.26258 13.2229177 404.1284 50.99787 7.92441641 P-value 1.28E-17 2.86E-10 The coefficient of 404.13 is the average rate at which amount charged changes with size of household. In the data, larger households charged more, on average, at a rate of $404 for each extra person. 3c. Run a multiple regression of Amount Charged on both Household Size and Income. Report and interpret BRIEFLY the coefficient of household size. (Tell us what the coefficient means and measures. Tell us why it turned out higher/lower than the coefficient from 3b. ) [15 points] Intercept Income ($1000s) Household Size Standard Coefficients Error t Stat 1304.905 197.65484 6.60193679 P-value 3.29E-08 33.13301 3.9679058 8.35025085 7.68E-11 356.2959 3.12E-14 33.20089 10.7315164 The coefficient of 356.3 is the average rate at which amount charged changes with size of household for households with comparable incomes. In order to use this multiple regression equation, one must know both income and household size. Once one “plugs in” income, the forecast goes up by $356.3 for each extra person. The fact that this rate is lower than that in 3b indicates that household size and income are positively correlated in the data. Amount charged goes up by $356.3 per person for any given income, and income also goes up with household size. Thus the $404 from 3b includes both the $356 from 3c AND the fact that larger households indicate larger incomes which, in turn, led to even higher charge amounts. 3d. Will a household of 2 with annual income of $60,000 charge more or less than $4,500 on their credit card? (Assume the household in question will be a random sample of the population of consumers used in the study.) [15 points] Since both income and size are significant predictors (they both have p-values much less than 0.05), I will use the 3c model to answer this question. The point forecast is Intercept Income ($1000s) Coefficients 1304.905 33.13301 Household Size 356.2959 intercept Income Size Point forecast 1 60 2 4005.477 The uncertainty in this new charge amount is measured by 398.1, the reported “standard error” from the multiple regression. Using the “better” method, the probability charge amount < $4,500 is calculated using the t-distribution with 47 degrees of freedom (n-2 for a simple regression, np for a multiple regression where p is the number of terms in the model…including the intercept.) The t is (4,500 – 4,005.5)/398.1 = 1.24. The Pr(charge amount < $4,500) = 1 – t.dist.rt(1.24,47) = 0.89.