(The next 4 questions are based on the following information.) In this problem we consider an analysis of the number of active physicians in a city as a function of the city's population and the region of the United States that the city is in. The sample consists of data on 141 cities in the United States, and the variables are defined as follows pop = city population (in thousands) doctors = number of professionally active physicians in the city region dummy variables are defined for 4 regions: East, Central, South, and West. east = 1 if city in the East region, 0 otherwise central = 1 if city in the Central region, 0 otherwise south = 1 if city in the South region, 0 otherwise R2 = .9551 Intercept pop east central south R2(Adjusted) = .9538 Coefficients Standard Error -255 130.2 2.3 0.043 -36 174.7 -327 164.1 -83 152.8 SSE=56,950,000 t Stat -1.96 53.5 -0.21 -1.99 -0.54 Residual SD = 647.1 P-Value 0.052 0.000 0.836 0.048 0.590 1. What is the predicted number of doctors in a city with a population of 500,000 in the West region? (a) 568 (b) 823 (c) 895 (d) 1150 Answer: (c) predicted doctors = -255 + 2.3 * 500 – 36*0 – 327*0 –83*0 = 895 2. What is the equation for the regression line predicting number of doctors from population for the South region? (a) Doctors = -338 + 2.3 pop (b) Doctors = -255 + 83 pop (c) Doctors = -172 + 2.3 pop (d) Doctors = -255 + 2.3 pop Answer: (a) Predicted Doctors = -255 + 2.3 pop - 36*0 - 327*0 - 83*1 = -338 + 2.3 pop 3. Based on this model, in which region is the slope of the regression line relating doctors to population the steepest? (a) The West region has the steepest regression line. (b) The Central region has the steepest regression line. (c) The East region has the steepest regression line. (d) The regression line has the same slope in all four regions. Answer: (d) The slope for the variable “pop” is the same for all four regions – this is a central assumption of a model with a continuous predictor and a set of dummy variables. Only if we add interaction terms can the slope be different in each region. 4. To test whether the whole model is at all useful, we perform a hypothesis test of whether the population coefficients for the four independent variables (pop, east, central, and south) are all equal to 0. What is the test statistic, approximate critical value, and conclusion for this hypothesis test? (Use alpha=.05) (a) test statistic: F = 723 approximate critical value: F* = 2.45 Conclusion: Reject H0 (b) test statistic: t = 53.5 approximate critical value: t* = 1.98 Conclusion: Reject H0 (c) test statistic: F = 2862 approximate critical value: F* = 3.92 Conclusion: Don't Reject H0 (d) test statistic: t = -0.54 approximate critical value: t* = 1.98 Conclusion: Don't Reject H0 Answer: (a) Test statistic F = (R2 / k) / [(1-R2) / (n-k-1)] = (.9551 / 4) / (.0449 / (141- 4 –1)) = 723 Want critical F from F-table , using 4 and 136 df: closest is F(4,60120) = 2.447 Because the test statistic is larger than the critical value we reject the null hypothesis. (btw, you won’t need to use the F table on the exams) (The next 4 questions deal with the following information.) Below. we predict hourly wages (Wage, measured in dollars per hour) based on type of job (Professional, Clerical, or Service) and amount of education (Educ, measured in years) for n=378 workers interviewed in a 1985 Population Survey. The type of job is coded with two dummy variables, Pro and Clerk: Pro =1 if type of job is Professional, and Pro=0 otherwise Clerk =1 if type of job is Clerical, and Clerk=0 otherwise. Model A: predictors are Pro and Clerk Predicted Wage = 6.54 + 4.78 Pro + 0.89 Clerk R2 = .159 Adjusted R2 = .155 Residual SD = 5.0 Model B: predictors are Pro, Clerk and Educ Predicted Wage = -1.06 + 2.64 Pro + 0.01 Clerk + .66 Educ R2 = .225 Adjusted R2 = .219 Residual SD = 4.8 5. Determine the overall average Wage for each of the three types of jobs in the sample. (a) (b) (c) (d) Professionals:$4.78 Professionals:$11.32 Professionals:$9.18 Professionals:$11.32 Clerical workers: $0.89 Clerical workers: $6.54 Clerical workers: $6.55 Clerical workers: $7.43 Service workers: $6.54 Service workers: $7.43 Service workers: $6.54 Service workers: $6.54 Answer: (d) Use Model A and plug in the dummy codes for the three groups: Professionals: Pro=1, Clerk=0 Predicted Wage = 6.54 + 4.78 * 1 + 0.89 * 0 = 11.32 Clerical workers: Pro=0, Clerk=1 Predicted Wage = 6.54 + 4.78 * 0 + 0.89 * 1 = 7.43 Service workers: Pro=0, Clerk=0 Predicted Wage = 6.54 + 4.78 * 0 + 0.89 * 0 = 6.54 6. Which of the following is/are true about the predictions of Model B? (a) (b) (c) (d) For every level of education, Professionals make $2.64 more than Service workers. For every level of education, Clerical workers make virtually the same amount as Service workers. With each additional year of education, Wage increases by $0.66 (for all three types of workers). All of the above are true. Answer: (d) Model B describes 3 parallel lines relating Wage to Educ: The slope for each line is .66 (the Educ coef) [so (c) is true] The baseline (Service workers) group has a y-intercept of –1.06. The line for the Professionals is $2.64 higher than for the Service workers [so (a) is true] The Clerical workers’ line is $0.01 higher than the Service workers’ line (i.e., they’re virtually the same, only different by a penny, so (b) is true) 7. How much of the variability in Wage is not explained by type of job and amount of education? (a) 84.5% (b) 77.5% (c) 34.0% (d) 22.5% Answer: (b) Unexplained variability for Model B: 1- .225 = .775 8. Consider Professional workers who have 16 years of education. Estimate the average value of Wage for these workers, and (using the normality and equal variances assumptions) estimate the percentage of these workers who make less than $10 per hour. (a) (b) (c) (d) average Wage = $11.32; average Wage = $11.32; average Wage = $12.14; average Wage = $12.14; about 40% make less than $10 per hour about 10% make less than $10 per hour about 33% make less than $10 per hour about 49% make less than $10 per hour Answer: (c) Predicted Wage = -1.06 + 2.64 *1 + 0.01 * 0 + .66 * 16 = 12.14 (that’s the estimated average Wage for those workers) The scores on Wage for those workers are assumed to be normally distributed, with mean of 12.14, and with a SD given by the residual SD, which is 4.8 To determine the percentage who make less than $10/hour, we convert 10 to a z-score: z = 10 – 12.14 / 4.8 = -.45 The area below -.45 is .3264 or about 33%. (The next 3 questions are based on the following information.) A study of several hundred professors' salaries in a large American university in 1969 resulted in the following regression equation: Y-hat = 6300 + 230 B + 120 A + 490 D + 190 E - 2400 X where the variables are defined as follows: Y is annual salary B is number of books published A is number of articles published D is number of doctoral students supervised E is years of experience X is gender (X=1 if female, X=0 if male) 9. What is the predicted salary for a female professor with 5 published articles, no books, 10 doctoral students supervised, and 5 years of experience? (Remember, this was 1969, so the numbers might seem rather low.) (a) (b) (c) (d) $6,450 $9,050 $10,350 $12,750 Answer: (c) From the description: A=5, B=0, D=10, E=5, and X=1 (because she’s female) predicted salary = 6300 + 230*0 + 120*5 + 490 * 10 + 190 * 5 – 2400 * 1 = $10,350 10. What is the expected increase in salary if a male professor writes 1 book and 2 articles in 1 additional year of experience? (a) (b) (c) (d) $190 $420 $470 $660 Answer: (d) B increases by 1, A increases by 2, and E increases by 1. Everything else stays the same. Expected change in salary is: 230*1 + 120*2 + 190*1 = 660 11. In this sample, the overall mean salary for male professors was $16,100 and the overall mean salary for female professors was $11,200. What is the regression equation for predicting salary from the single predictor X? (a) (b) (c) (d) Yˆ = 16100 - 4900 X Yˆ = 11200 + 4900 X Yˆ = 6300 - 2400 X Yˆ = 16100 - 2400 X Answer: (a) Regression with dummy variable only will give the mean of Y for each of the two groups (when plugging in X =0 and X = 1) When X=0, predicted salary needs to be 16,100 (male professor avg); So that’s the y-intercept When X=1, predicted salary needs to be 11,200 (female professor avg); The slope is the difference between the 2 group averages: 11,200 – 16,100 = -4900. (The next 3 questions deal with the following information.) Suppose a realtor runs a regression predicting the Time required to sell a house (measured in weeks), using the following variables as predictors: the Price of the house (in dollars), whether the house is inside the CityLimits or not, and the Age of the house (in years). 12. If the F-test for the overall regression is statistically significant (p<.05), then which of the following is the best conclusion? (a) All three of the predictors (Price, CityLimits, and Age) are individually useful in predicting Time. (b) Exactly one of the predictor variables is useful for predicting Time, but we don’t know which one it is. (c) The three predictors as a team are somewhat useful for predicting Time. (d) The whole regression equation explains none of the variability in Time. Answer: (c) The overall F test tests the null hypothesis that the regression equation is useless (the null hypothesis that all the slopes in the regression equation are 0). If we can reject that null hypothesis, it means that the team of all predictors is somewhat useful – the slopes aren’t all equal to 0. But we can’t conclude that any individual variable is a useful predictor – only that the set of all of them is jointly (a little bit) useful. 13. Which variable would best be represented by a dummy variable? (a) (b) (c) (d) Price CityLimits Age Time Answer: (b) Whether or not a house is within the CityLimits is a categorical variable that can be represented by a 0/1 dummy variable. 14. What are the measurement units for the residuals in this regression? (a) (b) (c) (d) weeks dollars years houses Answer: (a) The residuals have the same units as the outcome variable (Time) which is measured in weeks. (The next 4 questions are based on the following information.) In the analysis below, we consider differences in salaries for public school teachers across different parts of the United States. The dataset contains 51 observations from 1986, one for each state and for the District of Columbia. These 51 observations are classified into one of 4 geographic regions: West, Midwest, Northeast, and Southeast. The outcome variable is PAY (public schoolteacher annual pay, measured in dollars). The predictors are 3 dummy variables defined as follows: NORTHEAST = 1 if state is in the Northeast, 0 otherwise SOUTHEAST = 1 if state is in the Southeast, 0 otherwise MIDWEST = 1 if state is in the Midwest, 0 otherwise ANOVA df Regression Residual Total Intercept NORTHEAST SOUTHEAST MIDWEST 3 47 50 SS MS 180541498 60180499.48 692838766 14741250.34 873380265 Coefficients Standard Error 26159 1065 -113 1537 -4487 1479 -2312 1537 t Stat 24.57 -0.07 -3.03 -1.50 F Significance F 4.082 0.0117 P-value 1.8E-28 0.942 0.004 0.139 15. Which region in the sample has the highest paid teachers, on average? (a) Midwest (b) Northeast (c) Southeast (d) West Answer: (d) All of the coefficients for the dummy variables are negative, meaning that the Northeast, the Southeast, and the Midwest are all below the baseline group (West). 16. Determine the average schoolteacher annual pay in the Midwest region. (a) $26,159 (b) $26,046 (c) $23,847 (d) $21,672 Answer: (c) 26159 –113*0 –4487*0 – 2312*1 = 23,847 17. Construct a 95% confidence interval for the difference between the average PAY in the West region and the average PAY in the Southeast region. (a) (1588, 7386) (b) (-2897, 3126) (c) (2400, 6574) (d) (-701, 5325) Answer: (a) The coefficient for SOUTHEAST represents the difference between average PAY in the Southeast and average PAY in the West. Use a tabled t value with df=47. 47 is large enough to use t=1.96 95% CI: 4487 +/- 1.96 * 1479 (1588, 7386) 18. What does the p-value (“Significance F”) of .0117 in the ANOVA table tell us? (a) We cannot reject the null hypothesis that all four regions have the same average PAY. (b) The Northeast and the Midwest regions have the same average PAY. (c) The three predictors explain about 1% of the variability in PAY. (d) There are some differences in average PAY across the four regions, although it doesn’t tell us the specific pattern of those differences Answer: (d) The F-test tests whether all the coefficients in the regression equation are 0 (in the population). In this case, that would mean that the population average PAY is exactly the same in all 4 regions. A low p-value means that we reject that null hypothesis, and conclude that there are some differences in average PAY among the 4 regions. (The next 8 questions are based on the following information.) Here are several variables measured for n=74 different models of cars sold in the U.S. in 1991. MPG: Average miles per gallon in highway driving WEIGHT: Vehicle weight (in pounds) FOREIGN: Dummy variable indicating whether the car was made by an American or foreign company (0=American, 1=Foreign) We will examine what variables predict the fuel efficiency (MPG) for these cars. Two regressions involving MPG as the outcome variable are below: Model 1: MPG regressed on FOREIGN R2 = .120 Intercept FOREIGN R2(Adjusted) = .108 Residual Standard Deviation = 7.50 Coefficients Standard Error t Stat P-value 30.88 1.31 23.64 1.14E-35 5.50 1.75 3.13 0.00251 Lower 95% 28.28 2.00 Upper 95% 33.49 8.99 Model 2: MPG regressed on WEIGHT and FOREIGN R2 = .807 Intercept WEIGHT FOREIGN R2(Adjusted) = .801 Residual Standard Deviation = 3.54 Coefficients Standard Error 64.21 2.19 -0.01009 0.00063 0.53 0.885 t Stat 29.38 -15.89 0.600 P-value 1.84E-41 4.38E-25 0.5505 Lower 95% 59.86 -0.0114 -1.233 Upper 95% 68.57 -0.0088 2.295 19. Determine the overall average MPG for all the American cars in the sample, and also determine the overall average MPG for all the foreign cars in the sample. (a) (b) (c) (d) average MPG of American cars = average MPG of American cars = average MPG of American cars = average MPG of American cars = 30.88; average MPG of foreign cars = 36.38; average MPG of foreign cars = 64.21; average MPG of foreign cars = 64.74; average MPG of foreign cars = 36.38 30.88 64.74 64.21 Answer: (a) plug in FOREIGN=0 and FOREIGN=1 into Model 1: average MPG of American cars = 30.88 + 5.50*0 = 30.88 average MPG of foreign cars = 30.88 + 5.50 * 1 = 36.38 20. Test the null hypothesis that the overall average MPG for all American cars is the same as the overall average MPG for all foreign cars. What is the appropriate conclusion and p-value for this test? (a) We conclude that the average MPG is different for foreign cars than for American cars, based on the low p-value of .00251. (b) We cannot reject the null hypothesis that the average MPG is the same for American cars and for foreign cars, based on the low p-value of .00251. (c) We conclude that the average MPG is different for foreign cars than for American cars, based on the high p-value of .5505. (d) We cannot reject the null hypothesis that the average MPG is the same for American cars and for foreign cars, based on the high p-value of .5505. Answer: (a) Here, we use the t-test for the FOREIGN coefficient in Model 1. The p-value is very low (.00251), so we conclude that the two average MPGs are different. 21. Which statement is most accurate about Model 2? (a) (b) (c) (d) Holding WEIGHT constant, foreign cars get significantly worse gas mileage than American cars. Holding WEIGHT constant, foreign cars get about the same gas mileage as American cars. Holding FOREIGN constant, foreign cars get about the same gas mileage as American cars. Holding FOREIGN constant, foreign cars get significantly worse gas mileage than American cars. Answer: (b) The coefficient for FOREIGN in Model 2 (which also includes WEIGHT) is close to 0 (only .53, p=.5505): so, controlling for WEIGHT, there’s little difference between American and foreign cars in terms of gas mileage. 22. The Ford Mustang, an American car, weighs 3500 pounds and gets 28.0 miles per gallon. Using Model 2, determine the predicted MPG and the residual for the Mustang. (a) (b) (c) (d) Predicted MPG for Mustang = 28.9; Predicted MPG for Mustang = 30.9; Predicted MPG for Mustang = 28.9; Predicted MPG for Mustang = 30.9; Residual for Mustang = Residual for Mustang = Residual for Mustang = Residual for Mustang = 3.9 2.6 -0.9 -2.9 Answer: (c) predicted MPG = 64.21 -.01009*3500 + 0.53*0 = 28.9 residual = 28.0 – 28.9 = -0.9 23. On average, how much is MPG for a foreign car affected by an additional 100 pounds of car weight? (a) (b) (c) (d) MPG decreases by about 1.0 for an additional 100 pounds of weight MPG increases by about 1.0 for an additional 100 pounds of weight MPG decreases by about 5.3 for an additional 100 pounds of weight MPG increases by about 5.5 for an additional 100 pounds of weight Answer: (a) Here we use the WEIGHT coefficient from Model 2. Holding constant FOREIGN, adding 100 pounds leads to a -.01009*100 = -1.009 change in predicted MPG. We could also say that Model 2 implies that MPG for an American car also changes the same amount given an additional 100 pounds of car weight. 24. How is WEIGHT related to MPG, and how do we know? Pick the best statement(s) below. (a) WEIGHT is negatively related to MPG (holding FOREIGN constant), because the coefficient for WEIGHT in Model 2 is negative, and is significantly different from 0. (b) Knowing the WEIGHT of a car (in addition to FOREIGN) helps to predict MPG, because Model 2 has a higher Adjusted R2 than Model 1. (c) Both (a) and (b) are correct. (d) Neither (a) nor (b) is correct. Answer: (c) The WEIGHT variable is significant, and negatively related to MPG, so (a) is true. The model with WEIGHT added fits better than the model without WEIGHT (adjusted R2 increases), so (b) is also true. 25. What is the typical size of prediction errors if we use both WEIGHT and FOREIGN to predict MPG? (a) (b) (c) (d) 0.53 3.54 5.50 7.50 Answer: (b) The residual SD for Model 2 (3.54) gives the size of the prediction errors when using WEIGHT and FOREIGN as predictors. 26. What are the measurement units for the residuals for these regressions? (a) (b) (c) (d) miles per gallon pounds American or Foreign speed Residuals are in the same units as the DV, which in this case is miles per gallon. (The next 7 questions deal with the following information.) Below are several regression models predicting the number of touchdown passes (TD) for n=35 quarterbacks (QBs) in the current NFL season. The predictor variables are: COMPLETE: total number of completed passes for each quarterback CONF: dummy variable for the conference (1=NFC, 0=AFC) that the QB plays in. Predictors Regression Equation R2 CONF only COMPLETE only Both Predicted TD = 11.1 – 3.1 CONF Predicted TD = -0.37 + .072 COMPLETE Predicted TD = 0.84 + .068 COMPLETE –1.3 CONF .122 .590 .612 Adj. R2 .095 .578 .588 Residual SD 4.22 2.88 2.85 27. Overall, in this sample, did QBs in the NFC or the AFC throw more touchdown passes? (a) QBs in the NFC threw 3.1 more touchdowns, on average, than QBs in the AFC. (b) QBs in the AFC threw 3.1 more touchdowns, on average, than QBs in the NFC. (c) QBs in the NFC threw 1.3 more touchdown, on average, than QBs in the AFC. (d) QBs in the AFC threw 1.3 more touchdown, on average, than QBs in the NFC. Answer: (b) To get the overall average number of TD passes, we use the model with CONF as the only predictor: AFC: CONF=0 Predicted TD = 11.1 NFC: CONF=1 Predicted TD = 11.1 – 3.1 = 8.0 So the AFC avg is greater than NFC avg by 3.1 TDs. 28. Estimate the number of touchdown passes for a quarterback in the NFC who has thrown 200 completed passes. (a) 8.0 (b) 13.1 (c) 14.0 (d) 14.4 Answer: (b) Here we know that CONF=1 (because it’s an NFC QB) and COMPLETE = 200. We use the model that has both CONF and COMPLETE as predictors: Predicted TD = 0.84 + .068 * 200 – 1.3 *1 = 13.14 29. Estimate the average number of touchdown passes by all quarterbacks (in either conference) who have thrown 100 completed passes. (a) 4.1 (b) 6.3 (c) 6.8 (d) 7.6 Answer: (c) Here we only know that COMPLETE = 100, and we don’t know CONF. So we use the model that only has COMPLETE as a predictor: Predicted TD = -0.37 + .072 * 100 = 6.83 30. Determine the regression and residual degrees of freedom for the model that estimates two parallel lines, one for each conference. (a) Regression df = 2; residual df = 32 (b) Regression df = 1; residual df = 33 (c) Regression df = 2; residual df = 33 (d) Regression df = 1; residual df = 34 Answer: (a) The model with 2 parallel lines is the 3rd model, with k=2 predictors. Regression df = k = 2, and residual df = n-k-1 = 35-2-1 = 32 31. In order to predict how many touchdown passes a quarterback has thrown, would you rather know the conference he’s in, or the number of completed passes he’s thrown? (a) The number of completed passes he’s thrown. (b) The conference he’s in (c) It doesn’t matter; both pieces of information are equally useful. Answer: (a) Which predictor is better on its own: CONF or COMPLETE? The model with just COMPLETE has a much better fit than the one with just CONF. 32. On average, an additional touchdown pass occurs for every ____ additional completed passes. (a) 3 (b) 7 (c) 14 (d) 42 Answer: (c) How much does COMPLETE need to increase for TD to go up by 1? Call this unknown increase c. Using the model with COMPLETE as the only predictor (because we’re ignoring the conference for now): we want to solve the following: c*.072 = 1 Or: c = 1/.072 = 13.9, or about 14. Note that if you mistakenly used the 3rd model, you’d get 1/.068 = 14.7, which is luckily still very close to the right answer 33. Suppose we wanted to estimate two non-parallel lines predicting TD from COMPLETE for each conference. What are the best independent variables and dependent variable to use for this regression analysis? (a) (b) (c) (d) IVs: TD, CONF, and TD*CONF IVs: COMPLETE and CONF IVs: COMPLETE*CONF IVs: COMPLETE, CONF, and COMPLETE*CONF DV: DV: DV: DV: COMPLETE TD TD TD Answer: (d) Want the IVs to include COMPLETE, CONF and the interaction term COMPLETE*CONF (The next 7 questions are based on the following information.) Below are several regression models predicting the salaries of n=263 Major League baseball players (from the 1987 season). The dependent variable is SAL, the player’s salary, measured in thousands of dollars. The predictor variables are: HR: number of home runs the player hit in the previous season OUTF: dummy variable indicating whether the player is an outfielder (OUTF=1) or not OUTF=0). HR*OUTF: interaction term created by multiplying HR and OUTF together Model 1 2 3 4 Regression Equation Predicted SAL = 508 + 115 OUTF Predicted SAL = 331 + 17.7 HR Predicted SAL = 329 + 17.4 HR + 15 OUTF Predicted SAL = 347 + 15.7 HR - 58 OUTF + 5.2 HR*OUTF Quality of Fit R2 = .012, Adjusted R2 = .008 R2 = .1177, Adjusted R2 = .114 R2 = .1178, Adjusted R2 = .111 R2 = .120, Adjusted R2 = .110 Detailed Excel output for Model 4: Intercept HR OUTF HR*OUTF Coefficients 347 15.7 -58 5.2 Standard Error 50 3.8 112 6.6 t Stat 7.0 4.1 -0.5 0.8 P-value Lower 95% 2.1E-11 249.7 0.00006 8.1 0.605 -278.9 0.431 -7.8 Upper 95% 445.0 23.3 162.7 18.2 34. Determine the average salary for all of the outfielders in the sample. (a) $508,000 (b) $623,000 (c) $289,000 (d) $344,000 Answer: (b) Here we use Model 1 – it will give us the average salaries for the two groups when we plug in the appropriate dummy codes. The outfielders are given by OUTF=1: Predicted SAL = 508 + 115 * 1 = 623 or $623,000 35. Jim R. is a player who hit 20 home runs. We don’t know what position he played. Based on this information, what would we predict his salary to be? (a) (b) (c) (d) $623,000 $661,000 $677,000 $685,000 Answer: (d) Here we know only HR, so we use Model 2: predicted SAL = 331 + 17.7*20 = 685 36. A hypothesis test of the slope coefficient in Model 1 would be identical to what kind of analysis we encountered in the first half of the course? (a) (b) (c) (d) Comparing two population means, using independent samples Comparing two population means, using paired samples Using the normal distribution to approximate the binomial distribution Calculating the variance of a Bernoulli random variable Answer: (a) The slope coefficient in Model 1 represents the difference in average salary between outfielders and non-outfielders. These are two separate, independent groups of players. So when testing the slope, we’re comparing two population means, using independent samples. 37. Using the model that estimates non-parallel regression lines relating SAL to HR for outfielders and non-outfielders – on average, how much more would outfielders who hit 25 home-runs make, compared to outfielders who hit 15 home-runs? (a) $52,000 (b) $157,000 (c) $174,000 (d) $209,000 Answer: (d) Here we use Model 4 – that’s the one that estimates non-parallel regression lines… The slope of the line relating SAL to HR for outfielders (OUTF=1) is the coefficient for HR plus the coef from HR*OUTF: 15.7 + 5.2 = 20.9 So 10 home-run change from HR=15 to HR=25 means a change of 20.9*10 = 209 or $209,000 38. How is the relationship between salary and home-runs different for outfielders and non-outfielders? Choose the best explanation below. (a) Outfielders make slightly more for each home-run (compared to non-outfielders), and the difference is statistically significant. (b) Outfielders make slightly less for each home-run (compared to non-outfielders), and the difference is statistically significant. (c) Outfielders make slightly more for each home-run (compared to non-outfielders), but the difference is not statistically significant. (d) Outfielders make slightly less for each home-run (compared to non-outfielders), but the difference is not statistically significant. Answer: (c) The coef for HR*OUTF tells about the difference in slopes for the outfielders and the nonoutfielders. It’s positive (5.2) which means that the outfielders make $5200 more per home run compared to the non-outfielders. But the coefficient is not statistically significant (p-value=.431) 39. If you drew a picture of the predictions of best model (based on Adjusted R2), what would it look like? (a) One line relating salary to position for any number of home-runs. (b) One line relating salary to home-runs for both outfielders and non-outfielders. (c) Two parallel lines relating salary to home-runs, one for outfielders and one for non-outfielders. (d) Two non-parallel lines relating salary to home-runs, one for outfielders and one for non-outfielders. Answer: (b) Best model based on Adjusted R2 is Model 2, which describes just one regression line relating salary to home-runs for both groups. 40. Determine the regression and residual degrees of freedom for Model 4. (a) (b) (c) (d) Regression df = 1; Regression df = 2; Regression df = 3; Regression df = 4; Residual df = 261. Residual df = 261. Residual df = 259. Residual df = 263. Answer: (c) regression df = # of predictors = 3 residual df = n - # of predictors –1 = 263 – 3 – 1 = 259 (The next 8 questions deal with the following information.) Let’s consider some data (from the U.S. Bureau of Labor Statistics) on wages for different groups of workers, measured over a ten-year period. We’ll consider workers from 3 industries: mining, construction, and manufacturing. These three types of employees are coded by two dummy variables: Construc (1 for construction workers, 0 otherwise) and Mining (1 for mining workers, 0 otherwise). There are 120 observations for each type of worker, one observation for each month starting in January 1992 and ending in December 2001. The variable Month (ranging from 1 to 120) tells us the time period for each measurement. (Two interaction terms are also computed: Construc*Month and Mining*Month.) The outcome variable is the average Wage (dollars/hour) for each type of employee measured for each month. Below are the first few rows of the data (showing the first 3 months for the construction workers); there are n=360 observations total. Wage 14.07 14.02 14.12 … Month 1 2 3 … Construc 1 1 1 … Mining 0 0 0 … Construc*Month 1 2 3 … Mining*Month 0 0 0 … Four regression models are described below, using Construc, Mining, Month, and the interaction terms as predictors. Model Regression Equation R2 1 2 3 4 Predicted Wage = 13.02 + 2.96 Construc + 2.97 Mining Predicted Wage = 12.92 + .034 Month Predicted Wage = 10.95 + .034 Month + 2.96 Construc + 2.97 Mining Predicted Wage = 11.13 + .031 Month + 2.41 Construc + 2.97 Mining + .009 Construc*Month - .0001 Mining*Month .5733 .4117 .9850 .9913 Adj. R2 .5709 .4101 .9849 .9912 Residual SD 1.21 1.42 0.23 0.17 All regression coefficients in these models are highly statistically significant (p-values < .01), except for the -.0001 for the Mining*Month term in Model 4 (p-value=.88). 41. Determine the overall average value of Wage for each type of worker, over the entire 10-year period. (a) overall average Wage = $10.95 for manufacturing, $13.91 for construction, and $13.92 for mining. (b) overall average Wage = $11.13 for manufacturing, $13.54 for construction, and $14.10 for mining. (c) overall average Wage = $13.02 for manufacturing, $15.98 for construction, and $15.99 for mining. (d) overall average Wage = $12.92 for all three groups. Answer: (c) Use Model 1 and plug in the dummy codes: baseline group, manufacturing: average Wage = 13.02 + 2.96*0 + 2.97*0 = 13.02 construction: average Wage = 13.02 + 2.96*1 + 2.97*0 = 15.98 mining: average Wage = 13.02 + 2.96*0 + 2.97*1 = 15.99 42. Use Model 3 to determine the predicted value of Wage for the first observation in the data set (that is, for construction workers in the first month). Is this prediction too high or too low compared to the actual value of Wage? (a) predicted Wage =$13.94; prediction is too low. (b) predicted Wage =$15.98; prediction is too high. (c) predicted Wage =$10.95; prediction is too low. (d) predicted Wage =$13.94; prediction is too high. Answer: (a) First observation: Construc=1, Mining = 0, Month=1 Predicted Wage = 10.95 + .034 *1 + 2.96 *1 + 2.97 *0 = $13.94 Actual Wage = 14.07 (from first row of data table, above) prediction is too low, it’s less than the actual value (Questions about regressions predicting Wage, continued.) 43. Which of the following best describes the predictions of Model 3? (a) Three parallel lines relating Wage to Month. Controlling for Month, mining and construction workers tend to make about $3 more per hour than manufacturing workers. (b) Three parallel lines relating Wage to Month. Controlling for Month, manufacturing workers make about $3 more per hour than mining and construction workers. (c) Three lines relating Wage to Month, with different slopes for the three groups. The line is steepest for mining workers. (d) Three lines relating Wage to Month, with different slopes for the three groups. The line is steepest for construction workers. Answer: (a) A model with k dummy predictors and one continuous predictor will estimate k+1 parallel lines. In this case, the coefficients of 2.96 and 2.97 for Construc and Mining tell us that the lines for those two groups are above the baseline group, manufacturing 44. Use the best-fitting model to estimate how much the hourly wages for construction workers tended to increase each month. (a) Each month, average hourly wages for construction workers tended by increase by about $2.41. (b) Each month, average hourly wages for construction workers tended by increase by about 4 cents. (c) Each month, average hourly wages for construction workers tended by increase by about 3 cents. (d) Each month, average hourly wages for construction workers tended by increase by about 1 cent. Answer: (b) Best-fitting model is Model 4. Plug in Construc =1 and Mining=0 to get the slope of the line relating Wage to Month: Predicted Wage = 11.13 + .031 Month + 2.41 *1 + 2.97 *0 + .009 * 1 *Month - .0001 * 0 *Month Predicted Wage = 11.13 + 2.41 + (.031 + .009) Month = 13.54 + .04 Month Slope is .04, or an increase of about 4 cents per Month. 45. Based on the best-fitting model, how well can we predict Wage based on the Month and the type of worker? (a) Our typical prediction error will be about 1 cent. (b) Our typical prediction error will be about 3 cents. (c) Our typical prediction error will be about 17 cents. (d) Our typical prediction error will be about 23 cents. Answer: (c) Residual SD for Model 4 will give the typical prediction error: 0.17 or about 17 cents. 46. (a) (b) (c) (d) Determine the correlation coefficient between the variables Month and Wage. r=0 r = .64 r = .76 r = .99 Answer: (b) The correlation r is the square root of R2 for Model 2 = sqrt(. 4117) = +/- .64 Because the slope of the line for Model 2 is positive, the correlation is positive too, so r = .64 47. Notice that the R2 for Model 3 is equal to the R2 for Model 1 plus the R2 for Model 2. Also, the regression slopes in Model 3 are the same as in Model 1 and Model 2. Why is this? (a) Because all the coefficients are statistically significant. (b) Because Month isn’t correlated with either Construc or Mining, so the predictors in Model 1 and Model 2 aren’t redundant. (c) Because wider is better. (d) I’m not sure, but I’m sure magic statistical fairy dust is involved. Answer: (b) Recall that when predictors are uncorrelated (as in the hayfever example in class), the R2 values will add up nicely, and the coefficients for the predictors will stay the same, because the predictors aren’t redundant (they don’t overlap). 48. Which of the following could be the overall standard deviation of the 360 values of Wage in the data set? (Hint: No formal calculations are needed here.) (a) $0.17 (b) $1.21 (c) $1.42 (d) $1.85 Answer: (d) Overall SD of Y must be greater than the residual SD of any model predicting Y. So the overall SD of Y needs to be greater than 1.42, which is the largest residual SD given for the 4 models. Optional: In fact, with some algebra we can directly calculate the overall SD of INCOME Recall that the variance of income = SSTotal/(n-1). So what we want to calculate is sqrt(SSTotal / (n-1)) Because the unexplained proportion of variance is (1-R2)=SSResidual / SSTotal, we can solve for SSTotal: SSTotal = SSResidual / (1-R2) And SSResidual is related to the Residual SD in the ANOVA table: SSResidual = (Residual SD)^2 * (Residual df) Putting it all together we get the ugly formula: SD of Y = sqrt[(Residual SD)^2 * (Residual df) / ((1-R2) * (n-1)) ] We can plug in the Residual SD,R2, and Residual df (which is n-k-1) for any of the models, and we’ll get the same answer (within some rounding error). Using Model 1: SD of Y = sqrt[1.21^2 * (360-3) / ((1-.5733) * (360-1)) ] = 1.847 Using Model 2: SD of Y = sqrt[1.42^2 * (360-2) / ((1-.4117) * (360-1)) ] = 1.849 etc.