dummy variable

advertisement
(The next 4 questions are based on the following information.)
In this problem we consider an analysis of the number of active physicians in a city as a function of the
city's population and the region of the United States that the city is in. The sample consists of data on 141
cities in the United States, and the variables are defined as follows



pop = city population (in thousands)
doctors = number of professionally active physicians in the city
region dummy variables are defined for 4 regions: East, Central, South, and West.

east = 1 if city in the East region, 0 otherwise

central = 1 if city in the Central region, 0 otherwise

south = 1 if city in the South region, 0 otherwise
R2 = .9551
Intercept
pop
east
central
south
R2(Adjusted) = .9538
Coefficients Standard
Error
-255
130.2
2.3
0.043
-36
174.7
-327
164.1
-83
152.8
SSE=56,950,000
t Stat
-1.96
53.5
-0.21
-1.99
-0.54
Residual SD = 647.1
P-Value
0.052
0.000
0.836
0.048
0.590
1. What is the predicted number of doctors in a city with a population of 500,000 in the West region?
(a) 568
(b) 823
(c) 895
(d) 1150
Answer: (c)
predicted doctors = -255 + 2.3 * 500 – 36*0 – 327*0 –83*0 = 895
2. What is the equation for the regression line predicting number of doctors from population for the
South region?
(a) Doctors = -338 + 2.3 pop
(b) Doctors = -255 + 83 pop
(c) Doctors = -172 + 2.3 pop
(d) Doctors = -255 + 2.3 pop
Answer: (a)
Predicted Doctors = -255 + 2.3 pop - 36*0 - 327*0 - 83*1 = -338 + 2.3 pop
3. Based on this model, in which region is the slope of the regression line relating doctors to population
the steepest?
(a) The West region has the steepest regression line.
(b) The Central region has the steepest regression line.
(c) The East region has the steepest regression line.
(d) The regression line has the same slope in all four regions.
Answer: (d)
The slope for the variable “pop” is the same for all four regions – this is a central assumption of a
model with a continuous predictor and a set of dummy variables. Only if we add interaction terms
can the slope be different in each region.
4. To test whether the whole model is at all useful, we perform a hypothesis test of whether the population
coefficients for the four independent variables (pop, east, central, and south) are all equal to 0. What is
the test statistic, approximate critical value, and conclusion for this hypothesis test? (Use alpha=.05)
(a) test statistic: F = 723
approximate critical value: F* = 2.45
Conclusion: Reject H0
(b) test statistic: t = 53.5
approximate critical value: t* = 1.98
Conclusion: Reject H0
(c) test statistic: F = 2862
approximate critical value: F* = 3.92
Conclusion: Don't Reject H0
(d) test statistic: t = -0.54
approximate critical value: t* = 1.98
Conclusion: Don't Reject H0
Answer: (a)
Test statistic F = (R2 / k) / [(1-R2) / (n-k-1)] = (.9551 / 4) / (.0449 / (141- 4 –1)) = 723
Want critical F from F-table , using 4 and 136 df: closest is F(4,60120) = 2.447
Because the test statistic is larger than the critical value we reject the null hypothesis.
(btw, you won’t need to use the F table on the exams)
(The next 4 questions deal with the following information.)
Below. we predict hourly wages (Wage, measured in dollars per hour) based on type of job (Professional,
Clerical, or Service) and amount of education (Educ, measured in years) for n=378 workers interviewed
in a 1985 Population Survey.
The type of job is coded with two dummy variables, Pro and Clerk:
 Pro =1 if type of job is Professional, and Pro=0 otherwise
 Clerk =1 if type of job is Clerical, and Clerk=0 otherwise.
Model A: predictors are Pro and Clerk
Predicted Wage = 6.54 + 4.78 Pro + 0.89 Clerk
R2 = .159
Adjusted R2 = .155
Residual SD = 5.0
Model B: predictors are Pro, Clerk and Educ
Predicted Wage = -1.06 + 2.64 Pro + 0.01 Clerk + .66 Educ
R2 = .225
Adjusted R2 = .219
Residual SD = 4.8
5. Determine the overall average Wage for each of the three types of jobs in the sample.
(a)
(b)
(c)
(d)
Professionals:$4.78
Professionals:$11.32
Professionals:$9.18
Professionals:$11.32
Clerical workers: $0.89
Clerical workers: $6.54
Clerical workers: $6.55
Clerical workers: $7.43
Service workers: $6.54
Service workers: $7.43
Service workers: $6.54
Service workers: $6.54
Answer: (d)
Use Model A and plug in the dummy codes for the three groups:
Professionals: Pro=1, Clerk=0 
Predicted Wage = 6.54 + 4.78 * 1 + 0.89 * 0 = 11.32
Clerical workers: Pro=0, Clerk=1  Predicted Wage = 6.54 + 4.78 * 0 + 0.89 * 1 = 7.43
Service workers: Pro=0, Clerk=0  Predicted Wage = 6.54 + 4.78 * 0 + 0.89 * 0 = 6.54
6. Which of the following is/are true about the predictions of Model B?
(a)
(b)
(c)
(d)
For every level of education, Professionals make $2.64 more than Service workers.
For every level of education, Clerical workers make virtually the same amount as Service workers.
With each additional year of education, Wage increases by $0.66 (for all three types of workers).
All of the above are true.
Answer: (d)
Model B describes 3 parallel lines relating Wage to Educ:
The slope for each line is .66 (the Educ coef) [so (c) is true]
The baseline (Service workers) group has a y-intercept of –1.06.
The line for the Professionals is $2.64 higher than for the Service workers [so (a) is true]
The Clerical workers’ line is $0.01 higher than the Service workers’ line (i.e., they’re virtually the
same, only different by a penny, so (b) is true)
7. How much of the variability in Wage is not explained by type of job and amount of education?
(a) 84.5%
(b) 77.5%
(c) 34.0%
(d) 22.5%
Answer: (b)
Unexplained variability for Model B: 1- .225 = .775
8. Consider Professional workers who have 16 years of education. Estimate the average value of Wage
for these workers, and (using the normality and equal variances assumptions) estimate the percentage of
these workers who make less than $10 per hour.
(a)
(b)
(c)
(d)
average Wage = $11.32;
average Wage = $11.32;
average Wage = $12.14;
average Wage = $12.14;
about 40% make less than $10 per hour
about 10% make less than $10 per hour
about 33% make less than $10 per hour
about 49% make less than $10 per hour
Answer: (c)
Predicted Wage = -1.06 + 2.64 *1 + 0.01 * 0 + .66 * 16 = 12.14 (that’s the estimated average Wage
for those workers)
The scores on Wage for those workers are assumed to be normally distributed, with mean of 12.14,
and with a SD given by the residual SD, which is 4.8
To determine the percentage who make less than $10/hour, we convert 10 to a z-score:
z = 10 – 12.14 / 4.8 = -.45
The area below -.45 is .3264 or about 33%.
(The next 3 questions are based on the following information.)
A study of several hundred professors' salaries in a large American university in 1969 resulted in the
following regression equation:
Y-hat = 6300 + 230 B + 120 A + 490 D + 190 E - 2400 X
where the variables are defined as follows:






Y is annual salary
B is number of books published
A is number of articles published
D is number of doctoral students supervised
E is years of experience
X is gender (X=1 if female, X=0 if male)
9. What is the predicted salary for a female professor with 5 published articles, no books, 10 doctoral
students supervised, and 5 years of experience? (Remember, this was 1969, so the numbers might seem
rather low.)
(a)
(b)
(c)
(d)
$6,450
$9,050
$10,350
$12,750
Answer: (c)
From the description:
A=5, B=0, D=10, E=5, and X=1 (because she’s female)
predicted salary = 6300 + 230*0 + 120*5 + 490 * 10 + 190 * 5 – 2400 * 1 = $10,350
10. What is the expected increase in salary if a male professor writes 1 book and 2 articles in 1 additional
year of experience?
(a)
(b)
(c)
(d)
$190
$420
$470
$660
Answer: (d)
B increases by 1, A increases by 2, and E increases by 1. Everything else stays the same.
Expected change in salary is: 230*1 + 120*2 + 190*1 = 660
11. In this sample, the overall mean salary for male professors was $16,100 and the overall mean salary
for female professors was $11,200. What is the regression equation for predicting salary from the single
predictor X?
(a)
(b)
(c)
(d)
Yˆ = 16100 - 4900 X
Yˆ = 11200 + 4900 X
Yˆ = 6300 - 2400 X
Yˆ = 16100 - 2400 X
Answer: (a)
Regression with dummy variable only will give the mean of Y for each of the two groups (when
plugging in X =0 and X = 1)
When X=0, predicted salary needs to be 16,100 (male professor avg); So that’s the y-intercept
When X=1, predicted salary needs to be 11,200 (female professor avg); The slope is the difference
between the 2 group averages: 11,200 – 16,100 = -4900.
(The next 3 questions deal with the following information.)
Suppose a realtor runs a regression predicting the Time required to sell a house (measured in weeks),
using the following variables as predictors: the Price of the house (in dollars), whether the house is inside
the CityLimits or not, and the Age of the house (in years).
12. If the F-test for the overall regression is statistically significant (p<.05), then which of the following is
the best conclusion?
(a) All three of the predictors (Price, CityLimits, and Age) are individually useful in predicting Time.
(b) Exactly one of the predictor variables is useful for predicting Time, but we don’t know which one it
is.
(c) The three predictors as a team are somewhat useful for predicting Time.
(d) The whole regression equation explains none of the variability in Time.
Answer: (c)
The overall F test tests the null hypothesis that the regression equation is useless (the null
hypothesis that all the slopes in the regression equation are 0). If we can reject that null hypothesis,
it means that the team of all predictors is somewhat useful – the slopes aren’t all equal to 0. But we
can’t conclude that any individual variable is a useful predictor – only that the set of all of them is
jointly (a little bit) useful.
13. Which variable would best be represented by a dummy variable?
(a)
(b)
(c)
(d)
Price
CityLimits
Age
Time
Answer: (b)
Whether or not a house is within the CityLimits is a categorical variable that can be represented by
a 0/1 dummy variable.
14. What are the measurement units for the residuals in this regression?
(a)
(b)
(c)
(d)
weeks
dollars
years
houses
Answer: (a)
The residuals have the same units as the outcome variable (Time) which is measured in weeks.
(The next 4 questions are based on the following information.)
In the analysis below, we consider differences in salaries for public school teachers across different parts
of the United States. The dataset contains 51 observations from 1986, one for each state and for the
District of Columbia. These 51 observations are classified into one of 4 geographic regions: West,
Midwest, Northeast, and Southeast. The outcome variable is PAY (public schoolteacher annual pay,
measured in dollars). The predictors are 3 dummy variables defined as follows:
NORTHEAST = 1 if state is in the Northeast, 0 otherwise
SOUTHEAST = 1 if state is in the Southeast, 0 otherwise
MIDWEST = 1 if state is in the Midwest, 0 otherwise
ANOVA
df
Regression
Residual
Total
Intercept
NORTHEAST
SOUTHEAST
MIDWEST
3
47
50
SS
MS
180541498 60180499.48
692838766 14741250.34
873380265
Coefficients Standard Error
26159
1065
-113
1537
-4487
1479
-2312
1537
t Stat
24.57
-0.07
-3.03
-1.50
F
Significance F
4.082
0.0117
P-value
1.8E-28
0.942
0.004
0.139
15. Which region in the sample has the highest paid teachers, on average?
(a) Midwest
(b) Northeast
(c) Southeast
(d) West
Answer: (d)
All of the coefficients for the dummy variables are negative, meaning that the Northeast, the
Southeast, and the Midwest are all below the baseline group (West).
16. Determine the average schoolteacher annual pay in the Midwest region.
(a) $26,159
(b) $26,046
(c) $23,847
(d) $21,672
Answer: (c)
26159 –113*0 –4487*0 – 2312*1 = 23,847
17. Construct a 95% confidence interval for the difference between the average PAY in the West region
and the average PAY in the Southeast region.
(a) (1588, 7386)
(b) (-2897, 3126)
(c) (2400, 6574)
(d) (-701, 5325)
Answer: (a)
The coefficient for SOUTHEAST represents the difference between average PAY in the Southeast
and average PAY in the West.
Use a tabled t value with df=47. 47 is large enough to use t=1.96
95% CI: 4487 +/- 1.96 * 1479  (1588, 7386)
18. What does the p-value (“Significance F”) of .0117 in the ANOVA table tell us?
(a) We cannot reject the null hypothesis that all four regions have the same average PAY.
(b) The Northeast and the Midwest regions have the same average PAY.
(c) The three predictors explain about 1% of the variability in PAY.
(d) There are some differences in average PAY across the four regions, although it doesn’t tell us the
specific pattern of those differences
Answer: (d)
The F-test tests whether all the coefficients in the regression equation are 0 (in the population). In
this case, that would mean that the population average PAY is exactly the same in all 4 regions. A
low p-value means that we reject that null hypothesis, and conclude that there are some differences
in average PAY among the 4 regions.
(The next 8 questions are based on the following information.)
Here are several variables measured for n=74 different models of cars sold in the U.S. in 1991.



MPG: Average miles per gallon in highway driving
WEIGHT: Vehicle weight (in pounds)
FOREIGN: Dummy variable indicating whether the car was made by an American or foreign
company (0=American, 1=Foreign)
We will examine what variables predict the fuel efficiency (MPG) for these cars. Two regressions
involving MPG as the outcome variable are below:
Model 1: MPG regressed on FOREIGN
R2 = .120
Intercept
FOREIGN
R2(Adjusted) = .108
Residual Standard Deviation = 7.50
Coefficients Standard Error t Stat P-value
30.88
1.31
23.64 1.14E-35
5.50
1.75
3.13 0.00251
Lower 95%
28.28
2.00
Upper 95%
33.49
8.99
Model 2: MPG regressed on WEIGHT and FOREIGN
R2 = .807
Intercept
WEIGHT
FOREIGN
R2(Adjusted) = .801
Residual Standard Deviation = 3.54
Coefficients Standard Error
64.21
2.19
-0.01009
0.00063
0.53
0.885
t Stat
29.38
-15.89
0.600
P-value
1.84E-41
4.38E-25
0.5505
Lower 95%
59.86
-0.0114
-1.233
Upper 95%
68.57
-0.0088
2.295
19. Determine the overall average MPG for all the American cars in the sample, and also determine the
overall average MPG for all the foreign cars in the sample.
(a)
(b)
(c)
(d)
average MPG of American cars =
average MPG of American cars =
average MPG of American cars =
average MPG of American cars =
30.88; average MPG of foreign cars =
36.38; average MPG of foreign cars =
64.21; average MPG of foreign cars =
64.74; average MPG of foreign cars =
36.38
30.88
64.74
64.21
Answer: (a)
plug in FOREIGN=0 and FOREIGN=1 into Model 1:
average MPG of American cars = 30.88 + 5.50*0 = 30.88
average MPG of foreign cars = 30.88 + 5.50 * 1 = 36.38
20. Test the null hypothesis that the overall average MPG for all American cars is the same as the overall
average MPG for all foreign cars. What is the appropriate conclusion and p-value for this test?
(a) We conclude that the average MPG is different for foreign cars than for American cars, based on the
low p-value of .00251.
(b) We cannot reject the null hypothesis that the average MPG is the same for American cars and for
foreign cars, based on the low p-value of .00251.
(c) We conclude that the average MPG is different for foreign cars than for American cars, based on the
high p-value of .5505.
(d) We cannot reject the null hypothesis that the average MPG is the same for American cars and for
foreign cars, based on the high p-value of .5505.
Answer: (a)
Here, we use the t-test for the FOREIGN coefficient in Model 1.
The p-value is very low (.00251), so we conclude that the two average MPGs are different.
21. Which statement is most accurate about Model 2?
(a)
(b)
(c)
(d)
Holding WEIGHT constant, foreign cars get significantly worse gas mileage than American cars.
Holding WEIGHT constant, foreign cars get about the same gas mileage as American cars.
Holding FOREIGN constant, foreign cars get about the same gas mileage as American cars.
Holding FOREIGN constant, foreign cars get significantly worse gas mileage than American cars.
Answer: (b)
The coefficient for FOREIGN in Model 2 (which also includes WEIGHT) is close to 0 (only .53,
p=.5505): so, controlling for WEIGHT, there’s little difference between American and foreign cars
in terms of gas mileage.
22. The Ford Mustang, an American car, weighs 3500 pounds and gets 28.0 miles per gallon. Using
Model 2, determine the predicted MPG and the residual for the Mustang.
(a)
(b)
(c)
(d)
Predicted MPG for Mustang = 28.9;
Predicted MPG for Mustang = 30.9;
Predicted MPG for Mustang = 28.9;
Predicted MPG for Mustang = 30.9;
Residual for Mustang =
Residual for Mustang =
Residual for Mustang =
Residual for Mustang =
3.9
2.6
-0.9
-2.9
Answer: (c)
predicted MPG = 64.21 -.01009*3500 + 0.53*0 = 28.9
residual = 28.0 – 28.9 = -0.9
23. On average, how much is MPG for a foreign car affected by an additional 100 pounds of car weight?
(a)
(b)
(c)
(d)
MPG decreases by about 1.0 for an additional 100 pounds of weight
MPG increases by about 1.0 for an additional 100 pounds of weight
MPG decreases by about 5.3 for an additional 100 pounds of weight
MPG increases by about 5.5 for an additional 100 pounds of weight
Answer: (a)
Here we use the WEIGHT coefficient from Model 2. Holding constant FOREIGN, adding 100
pounds leads to a -.01009*100 = -1.009 change in predicted MPG.
We could also say that Model 2 implies that MPG for an American car also changes the same
amount given an additional 100 pounds of car weight.
24. How is WEIGHT related to MPG, and how do we know? Pick the best statement(s) below.
(a) WEIGHT is negatively related to MPG (holding FOREIGN constant), because the coefficient for
WEIGHT in Model 2 is negative, and is significantly different from 0.
(b) Knowing the WEIGHT of a car (in addition to FOREIGN) helps to predict MPG, because Model 2
has a higher Adjusted R2 than Model 1.
(c) Both (a) and (b) are correct.
(d) Neither (a) nor (b) is correct.
Answer: (c)
The WEIGHT variable is significant, and negatively related to MPG, so (a) is true.
The model with WEIGHT added fits better than the model without WEIGHT (adjusted R2
increases), so (b) is also true.
25. What is the typical size of prediction errors if we use both WEIGHT and FOREIGN to predict MPG?
(a)
(b)
(c)
(d)
0.53
3.54
5.50
7.50
Answer: (b)
The residual SD for Model 2 (3.54) gives the size of the prediction errors when using WEIGHT and
FOREIGN as predictors.
26. What are the measurement units for the residuals for these regressions?
(a)
(b)
(c)
(d)
miles per gallon
pounds
American or Foreign
speed
Residuals are in the same units as the DV, which in this case is miles per gallon.
(The next 7 questions deal with the following information.)
Below are several regression models predicting the number of touchdown passes (TD) for n=35
quarterbacks (QBs) in the current NFL season. The predictor variables are:
 COMPLETE: total number of completed passes for each quarterback
 CONF: dummy variable for the conference (1=NFC, 0=AFC) that the QB plays in.
Predictors
Regression Equation
R2
CONF only
COMPLETE only
Both
Predicted TD = 11.1 – 3.1 CONF
Predicted TD = -0.37 + .072 COMPLETE
Predicted TD = 0.84 + .068 COMPLETE –1.3 CONF
.122
.590
.612
Adj.
R2
.095
.578
.588
Residual
SD
4.22
2.88
2.85
27. Overall, in this sample, did QBs in the NFC or the AFC throw more touchdown passes?
(a) QBs in the NFC threw 3.1 more touchdowns, on average, than QBs in the AFC.
(b) QBs in the AFC threw 3.1 more touchdowns, on average, than QBs in the NFC.
(c) QBs in the NFC threw 1.3 more touchdown, on average, than QBs in the AFC.
(d) QBs in the AFC threw 1.3 more touchdown, on average, than QBs in the NFC.
Answer: (b)
To get the overall average number of TD passes, we use the model with CONF as the only
predictor:
AFC: CONF=0  Predicted TD = 11.1
NFC: CONF=1  Predicted TD = 11.1 – 3.1 = 8.0
So the AFC avg is greater than NFC avg by 3.1 TDs.
28. Estimate the number of touchdown passes for a quarterback in the NFC who has thrown 200
completed passes.
(a) 8.0
(b) 13.1
(c) 14.0
(d) 14.4
Answer: (b)
Here we know that CONF=1 (because it’s an NFC QB) and COMPLETE = 200. We use the model
that has both CONF and COMPLETE as predictors:
Predicted TD = 0.84 + .068 * 200 – 1.3 *1 = 13.14
29. Estimate the average number of touchdown passes by all quarterbacks (in either conference) who have
thrown 100 completed passes.
(a) 4.1
(b) 6.3
(c) 6.8
(d) 7.6
Answer: (c)
Here we only know that COMPLETE = 100, and we don’t know CONF. So we use the model that
only has COMPLETE as a predictor:
Predicted TD = -0.37 + .072 * 100 = 6.83
30. Determine the regression and residual degrees of freedom for the model that estimates two parallel
lines, one for each conference.
(a) Regression df = 2; residual df = 32
(b) Regression df = 1; residual df = 33
(c) Regression df = 2; residual df = 33
(d) Regression df = 1; residual df = 34
Answer: (a)
The model with 2 parallel lines is the 3rd model, with k=2 predictors.
Regression df = k = 2, and residual df = n-k-1 = 35-2-1 = 32
31. In order to predict how many touchdown passes a quarterback has thrown, would you rather know the
conference he’s in, or the number of completed passes he’s thrown?
(a) The number of completed passes he’s thrown.
(b) The conference he’s in
(c) It doesn’t matter; both pieces of information are equally useful.
Answer: (a)
Which predictor is better on its own: CONF or COMPLETE? The model with just COMPLETE
has a much better fit than the one with just CONF.
32. On average, an additional touchdown pass occurs for every ____ additional completed passes.
(a) 3
(b) 7
(c) 14
(d) 42
Answer: (c)
How much does COMPLETE need to increase for TD to go up by 1? Call this unknown increase c.
Using the model with COMPLETE as the only predictor (because we’re ignoring the conference for
now): we want to solve the following: c*.072 = 1
Or: c = 1/.072 = 13.9, or about 14.
Note that if you mistakenly used the 3rd model, you’d get 1/.068 = 14.7, which is luckily still very
close to the right answer
33. Suppose we wanted to estimate two non-parallel lines predicting TD from COMPLETE for each
conference. What are the best independent variables and dependent variable to use for this regression
analysis?
(a)
(b)
(c)
(d)
IVs: TD, CONF, and TD*CONF
IVs: COMPLETE and CONF
IVs: COMPLETE*CONF
IVs: COMPLETE, CONF, and COMPLETE*CONF
DV:
DV:
DV:
DV:
COMPLETE
TD
TD
TD
Answer: (d)
Want the IVs to include COMPLETE, CONF and the interaction term COMPLETE*CONF
(The next 7 questions are based on the following information.)
Below are several regression models predicting the salaries of n=263 Major League baseball players
(from the 1987 season). The dependent variable is SAL, the player’s salary, measured in thousands of
dollars.
The predictor variables are:
HR:
number of home runs the player hit in the previous season
OUTF:
dummy variable indicating whether the player is an outfielder (OUTF=1) or not
OUTF=0).
HR*OUTF:
interaction term created by multiplying HR and OUTF together
Model
1
2
3
4
Regression Equation
Predicted SAL = 508 + 115 OUTF
Predicted SAL = 331 + 17.7 HR
Predicted SAL = 329 + 17.4 HR + 15 OUTF
Predicted SAL = 347 + 15.7 HR - 58 OUTF + 5.2 HR*OUTF
Quality of Fit
R2 = .012, Adjusted R2 = .008
R2 = .1177, Adjusted R2 = .114
R2 = .1178, Adjusted R2 = .111
R2 = .120, Adjusted R2 = .110
Detailed Excel output for Model 4:
Intercept
HR
OUTF
HR*OUTF
Coefficients
347
15.7
-58
5.2
Standard
Error
50
3.8
112
6.6
t Stat
7.0
4.1
-0.5
0.8
P-value Lower 95%
2.1E-11
249.7
0.00006
8.1
0.605
-278.9
0.431
-7.8
Upper
95%
445.0
23.3
162.7
18.2
34. Determine the average salary for all of the outfielders in the sample.
(a) $508,000
(b) $623,000
(c) $289,000
(d) $344,000
Answer: (b)
Here we use Model 1 – it will give us the average salaries for the two groups when we plug in the
appropriate dummy codes. The outfielders are given by OUTF=1:
Predicted SAL = 508 + 115 * 1 = 623 or $623,000
35. Jim R. is a player who hit 20 home runs. We don’t know what position he played. Based on this
information, what would we predict his salary to be?
(a)
(b)
(c)
(d)
$623,000
$661,000
$677,000
$685,000
Answer: (d)
Here we know only HR, so we use Model 2:
predicted SAL = 331 + 17.7*20 = 685
36. A hypothesis test of the slope coefficient in Model 1 would be identical to what kind of analysis we
encountered in the first half of the course?
(a)
(b)
(c)
(d)
Comparing two population means, using independent samples
Comparing two population means, using paired samples
Using the normal distribution to approximate the binomial distribution
Calculating the variance of a Bernoulli random variable
Answer: (a)
The slope coefficient in Model 1 represents the difference in average salary between outfielders and
non-outfielders. These are two separate, independent groups of players. So when testing the slope,
we’re comparing two population means, using independent samples.
37. Using the model that estimates non-parallel regression lines relating SAL to HR for outfielders and
non-outfielders – on average, how much more would outfielders who hit 25 home-runs make, compared
to outfielders who hit 15 home-runs?
(a) $52,000
(b) $157,000
(c) $174,000
(d) $209,000
Answer: (d)
Here we use Model 4 – that’s the one that estimates non-parallel regression lines…
The slope of the line relating SAL to HR for outfielders (OUTF=1) is the coefficient for HR plus the
coef from HR*OUTF: 15.7 + 5.2 = 20.9
So 10 home-run change from HR=15 to HR=25 means a change of 20.9*10 = 209 or $209,000
38. How is the relationship between salary and home-runs different for outfielders and non-outfielders?
Choose the best explanation below.
(a) Outfielders make slightly more for each home-run (compared to non-outfielders), and the difference
is statistically significant.
(b) Outfielders make slightly less for each home-run (compared to non-outfielders), and the difference is
statistically significant.
(c) Outfielders make slightly more for each home-run (compared to non-outfielders), but the difference
is not statistically significant.
(d) Outfielders make slightly less for each home-run (compared to non-outfielders), but the difference is
not statistically significant.
Answer: (c)
The coef for HR*OUTF tells about the difference in slopes for the outfielders and the nonoutfielders. It’s positive (5.2) which means that the outfielders make $5200 more per home run
compared to the non-outfielders. But the coefficient is not statistically significant (p-value=.431)
39. If you drew a picture of the predictions of best model (based on Adjusted R2), what would it look
like?
(a) One line relating salary to position for any number of home-runs.
(b) One line relating salary to home-runs for both outfielders and non-outfielders.
(c) Two parallel lines relating salary to home-runs, one for outfielders and one for non-outfielders.
(d) Two non-parallel lines relating salary to home-runs, one for outfielders and one for non-outfielders.
Answer: (b)
Best model based on Adjusted R2 is Model 2, which describes just one regression line relating salary
to home-runs for both groups.
40. Determine the regression and residual degrees of freedom for Model 4.
(a)
(b)
(c)
(d)
Regression df = 1;
Regression df = 2;
Regression df = 3;
Regression df = 4;
Residual df = 261.
Residual df = 261.
Residual df = 259.
Residual df = 263.
Answer: (c)
regression df = # of predictors = 3
residual df = n - # of predictors –1 = 263 – 3 – 1 = 259
(The next 8 questions deal with the following information.)
Let’s consider some data (from the U.S. Bureau of Labor Statistics) on wages for different groups of
workers, measured over a ten-year period. We’ll consider workers from 3 industries: mining,
construction, and manufacturing. These three types of employees are coded by two dummy variables:
Construc (1 for construction workers, 0 otherwise) and Mining (1 for mining workers, 0 otherwise).
There are 120 observations for each type of worker, one observation for each month starting in January
1992 and ending in December 2001. The variable Month (ranging from 1 to 120) tells us the time period
for each measurement. (Two interaction terms are also computed: Construc*Month and
Mining*Month.)
The outcome variable is the average Wage (dollars/hour) for each type of employee measured for each
month. Below are the first few rows of the data (showing the first 3 months for the construction
workers); there are n=360 observations total.
Wage
14.07
14.02
14.12
…
Month
1
2
3
…
Construc
1
1
1
…
Mining
0
0
0
…
Construc*Month
1
2
3
…
Mining*Month
0
0
0
…
Four regression models are described below, using Construc, Mining, Month, and the interaction terms as
predictors.
Model
Regression Equation
R2
1
2
3
4
Predicted Wage = 13.02 + 2.96 Construc + 2.97 Mining
Predicted Wage = 12.92 + .034 Month
Predicted Wage = 10.95 + .034 Month + 2.96 Construc + 2.97 Mining
Predicted Wage = 11.13 + .031 Month + 2.41 Construc + 2.97 Mining
+ .009 Construc*Month - .0001 Mining*Month
.5733
.4117
.9850
.9913
Adj.
R2
.5709
.4101
.9849
.9912
Residual
SD
1.21
1.42
0.23
0.17
All regression coefficients in these models are highly statistically significant (p-values < .01), except for
the -.0001 for the Mining*Month term in Model 4 (p-value=.88).
41. Determine the overall average value of Wage for each type of worker, over the entire 10-year period.
(a) overall average Wage = $10.95 for manufacturing, $13.91 for construction, and $13.92 for mining.
(b) overall average Wage = $11.13 for manufacturing, $13.54 for construction, and $14.10 for mining.
(c) overall average Wage = $13.02 for manufacturing, $15.98 for construction, and $15.99 for mining.
(d) overall average Wage = $12.92 for all three groups.
Answer: (c)
Use Model 1 and plug in the dummy codes:
baseline group, manufacturing: average Wage = 13.02 + 2.96*0 + 2.97*0 = 13.02
construction:
average Wage = 13.02 + 2.96*1 + 2.97*0 = 15.98
mining:
average Wage = 13.02 + 2.96*0 + 2.97*1 = 15.99
42. Use Model 3 to determine the predicted value of Wage for the first observation in the data set (that is,
for construction workers in the first month). Is this prediction too high or too low compared to the actual
value of Wage?
(a) predicted Wage =$13.94; prediction is too low.
(b) predicted Wage =$15.98; prediction is too high.
(c) predicted Wage =$10.95; prediction is too low.
(d) predicted Wage =$13.94; prediction is too high.
Answer: (a)
First observation: Construc=1, Mining = 0, Month=1
Predicted Wage = 10.95 + .034 *1 + 2.96 *1 + 2.97 *0 = $13.94
Actual Wage = 14.07 (from first row of data table, above)
prediction is too low, it’s less than the actual value
(Questions about regressions predicting Wage, continued.)
43. Which of the following best describes the predictions of Model 3?
(a) Three parallel lines relating Wage to Month. Controlling for Month, mining and construction
workers tend to make about $3 more per hour than manufacturing workers.
(b) Three parallel lines relating Wage to Month. Controlling for Month, manufacturing workers make
about $3 more per hour than mining and construction workers.
(c) Three lines relating Wage to Month, with different slopes for the three groups. The line is steepest
for mining workers.
(d) Three lines relating Wage to Month, with different slopes for the three groups. The line is steepest
for construction workers.
Answer: (a)
A model with k dummy predictors and one continuous predictor will estimate k+1 parallel lines. In
this case, the coefficients of 2.96 and 2.97 for Construc and Mining tell us that the lines for those
two groups are above the baseline group, manufacturing
44. Use the best-fitting model to estimate how much the hourly wages for construction workers tended to
increase each month.
(a) Each month, average hourly wages for construction workers tended by increase by about $2.41.
(b) Each month, average hourly wages for construction workers tended by increase by about 4 cents.
(c) Each month, average hourly wages for construction workers tended by increase by about 3 cents.
(d) Each month, average hourly wages for construction workers tended by increase by about 1 cent.
Answer: (b)
Best-fitting model is Model 4.
Plug in Construc =1 and Mining=0 to get the slope of the line relating Wage to Month:
Predicted Wage = 11.13 + .031 Month + 2.41 *1 + 2.97 *0 + .009 * 1 *Month - .0001 * 0 *Month
Predicted Wage = 11.13 + 2.41 + (.031 + .009) Month = 13.54 + .04 Month
Slope is .04, or an increase of about 4 cents per Month.
45. Based on the best-fitting model, how well can we predict Wage based on the Month and the type of
worker?
(a) Our typical prediction error will be about 1 cent.
(b) Our typical prediction error will be about 3 cents.
(c) Our typical prediction error will be about 17 cents.
(d) Our typical prediction error will be about 23 cents.
Answer: (c)
Residual SD for Model 4 will give the typical prediction error: 0.17 or about 17 cents.
46.
(a)
(b)
(c)
(d)
Determine the correlation coefficient between the variables Month and Wage.
r=0
r = .64
r = .76
r = .99
Answer: (b)
The correlation r is the square root of R2 for Model 2 = sqrt(. 4117) = +/- .64
Because the slope of the line for Model 2 is positive, the correlation is positive too, so r = .64
47. Notice that the R2 for Model 3 is equal to the R2 for Model 1 plus the R2 for Model 2. Also, the
regression slopes in Model 3 are the same as in Model 1 and Model 2. Why is this?
(a) Because all the coefficients are statistically significant.
(b) Because Month isn’t correlated with either Construc or Mining, so the predictors in Model 1 and
Model 2 aren’t redundant.
(c) Because wider is better.
(d) I’m not sure, but I’m sure magic statistical fairy dust is involved.
Answer: (b)
Recall that when predictors are uncorrelated (as in the hayfever example in class), the R2 values
will add up nicely, and the coefficients for the predictors will stay the same, because the predictors
aren’t redundant (they don’t overlap).
48. Which of the following could be the overall standard deviation of the 360 values of Wage in the data
set? (Hint: No formal calculations are needed here.)
(a) $0.17
(b) $1.21
(c) $1.42
(d) $1.85
Answer: (d)
Overall SD of Y must be greater than the residual SD of any model predicting Y. So the overall SD
of Y needs to be greater than 1.42, which is the largest residual SD given for the 4 models.
Optional: In fact, with some algebra we can directly calculate the overall SD of INCOME Recall
that the variance of income = SSTotal/(n-1).
So what we want to calculate is sqrt(SSTotal / (n-1))
Because the unexplained proportion of variance is (1-R2)=SSResidual / SSTotal, we can solve for
SSTotal:
SSTotal = SSResidual / (1-R2)
And SSResidual is related to the Residual SD in the ANOVA table:
SSResidual = (Residual SD)^2 * (Residual df)
Putting it all together we get the ugly formula:
SD of Y = sqrt[(Residual SD)^2 * (Residual df) / ((1-R2) * (n-1)) ]
We can plug in the Residual SD,R2, and Residual df (which is n-k-1) for any of the models, and
we’ll get the same answer (within some rounding error).
Using Model 1: SD of Y = sqrt[1.21^2 * (360-3) / ((1-.5733) * (360-1)) ] = 1.847
Using Model 2: SD of Y = sqrt[1.42^2 * (360-2) / ((1-.4117) * (360-1)) ] = 1.849
etc.
Download