Midterm 1 Practice Short Answer

advertisement
Practice Questions for Exam 1
1.
The height of male students at your college/university is normally distributed with a mean of 70
inches and a standard deviation of 3.5 inches. If you had a list of telephone numbers for male
students for the purpose of conducting a survey, what would be the probability of randomly calling
one of these students whose height is:
(a) taller than 6'0"?
(b) between 5'3" and 6'5"?
(c) shorter than 5'7", the mean height of female students?
(d) shorter than 5'0"?
(e) taller than Shaquille O'Neal, the center of the Boston Celtics, who is 7'1" tall?
Compare this to the probability of a woman being pregnant for 10 months (300 days), where days of
pregnancy is normally distributed with a mean of 266 days and a standard deviation of 16 days.
Answer:
(a) Pr(Z > 0.5714) = 0.2839;
(b) Pr( –2 < Z < 2) = 0.9545 or approximately 0.95;
(c) Pr(Z < -0.8571) = 0.1957;
(d) Pr(Z < -2.8571) = 0.0021;
(e) Pr(Z > 4.2857) = 0.000009 (the text does not show values above 2.99 standard deviations, Pr(Z>2.99 =
0.0014) and Pr(Z > 2.1250) = 0.0168.
2) Adult males are taller, on average, than adult females. Visiting two recent American Youth Soccer
Organization (AYSO) under 12 year old (U12) soccer matches on a Saturday, you do not observe an
obvious difference in the height of boys and girls of that age. You suggest to your little sister that she
collect data on height and gender of children in 4 th to 6th grade as part of her science project. The
accompanying table shows her findings.
Height of Young Boys and Girls, Grades 4-6, in inches
Boys
57.8
3.9
Girls
55
58.4
4.2
57
(a) Let your null hypothesis be that there is no difference in the height of females and males at this age
level. Specify the alternative hypothesis.
(b) Find the difference in height and the standard error of the difference.
(c) Generate a 95% confidence interval for the difference in height.
(d) Calculate the t-statistic for comparing the two means. Is the difference statistically significant at the
1% level? Which critical value did you use? Why would this number be smaller if you had assumed a
one-sided alternative hypothesis? What is the intuition behind this?
Answer:
(a) H0 :
= 0 vs. H1 :
≠0
(b)
-
= -0.6, SE(
(c) -0.6 ± 1.96 × 0.77 = (-2.11, 0.91).
-
)=
3.92 4.2 2

= 0.77.
55
57
(d) t = -0.78, so t < 2.58, which is the critical value at the 1% level. Hence you cannot reject the null
hypothesis. The critical value for the one-sided hypothesis would have been 2.33. Assuming a one-sided
hypothesis implies that you have some information about the problem at hand, and, as a result, can be
more easily convinced than if you had no prior expectation.
3) Assume that two presidential candidates, call them Bush and Gore, receive 50% of the votes in the
population. You can model this situation as a Bernoulli trial, where Y is a random variable with success
probability Pr(Y = 1) = p, and where Y = 1 if a person votes for Bush and Y = 0 otherwise. Furthermore, let
p 1  p 
) in reasonably large
p̂ be the fraction of successes (1s) in a sample, which is distributed N(p,
n
samples, say for n ≥ 40.
(a) Given your knowledge about the population, find the probability that in a random sample of 40, Bush
would receive a share of 40% or less.
(b) How would this situation change with a random sample of 100?
(c) Given your answers in (a) and (b), would you be comfortable to predict what the voting intentions for
the entire population are if you did not know p but had polled 10,000 individuals at random and
calculated p̂ ? Explain.
(d) This result seems to hold whether you poll 10,000 people at random in the Netherlands or the United
States, where the former has a population of less than 20 million people, while the United States is 15
times as populous. Why does the population size not come into play?
Answer:
0.40  0.50
(a) Pr( p̂ < 0.40) = Pr(Z <
) = Pr(Z < -1.26) ≈ 0.104. In roughly every 10th sample of this size,
0.25
40
Bush would receive a vote of less than 40%, although in truth, his share is 50%.
0.40  0.50
(b) Pr( p̂ < 0.40) = Pr(Z <
) = Pr(Z < -2.00) ≈ 0.023. With this sample size, you would expect this
0.25
100
to happen only every 50th sample.
(c) The answers in (a) and (b) suggest that for even moderate increases in the sample size, the estimator
does not vary too much from the population mean. Polling 10,000 individuals, the probability of finding a
p̂ of 0.48, for example, would be 0.00003. Unless the election was extremely close, which the 2000 election
was, polls are quite accurate even for sample sizes of 2,500.
(d) The distribution of sample means shrinks very quickly depending on the sample size, not the
population size. Although at first this does not seem intuitive, the standard error of an estimator is a
value which indicates by how much the estimator varies around the population value. For large sample
sizes, the sample mean typically is very close to the population mean.
4) You have collected weekly earnings and age data from a sub-sample of 1,744 individuals using the
Current Population Survey in a given year.
(a) Given the overall mean of $434.49 and a standard deviation of $294.67, construct a 99% confidence
interval for average earnings in the entire population. State the meaning of this interval in words, rather
than just in numbers. If you constructed a 90% confidence interval instead, would it be smaller or larger?
What is the intuition?
(b) When dividing your sample into people 45 years and older, and younger than 45, the information
shown in the table is found.
Age Category
Age ≥ 45
Age < 45
Average Earnings Y Standard Deviation SY
$488.87
$328.64
$412.20
$276.63
N
507
1237
Test whether or not the difference in average earnings is statistically significant. Given your knowledge
of age-earning profiles, does this result make sense?
Answer:
(a) The confidence interval for mean weekly earnings is 434.49 ± 2.58 ×
294.67
= 434.49 ± 18.20
1744
= (416.29, 452.69). Based on the sample at hand, the best guess for the population mean is $434.49.
However, because of random sampling error, this guess is likely to be wrong. Instead, the interval
estimate for the average earnings lies between $416.29 and $452.69. Committing to such an interval
repeatedly implies that the resulting statement is incorrect 1 out of 100 times. For a 90% confidence
interval, the only change in the calculation of the confidence interval is to replace 2.58 by 1.64. Hence the
confidence interval is smaller. A smaller interval implies, given the same average earnings and the
standard deviation, that the statement will be false more often. The larger the confidence interval, the
more likely it is to contain the population value.
 488.87  412.20 
(b) Assuming unequal population variances, t =
= 4.62, which is statistically
328.642 276.632

507
12.7
significant at conventional levels whether you use a two-sided or one-sided alternative. Hence the null
hypothesis of equal average earnings in the two groups is rejected. Age-earning profiles typically take on
an inverted U-shape. Maximum earnings occur in the 40s, depending on some other factors such as years
of education, which are not considered here. Hence it is not clear if the alternative hypothesis should be
one-sided or two-sided. In such a situation, it is best to assume a two-sided alternative hypothesis.
5) Sir Francis Galton, a cousin of James Darwin, examined the relationship between the height of children
and their parents towards the end of the 19th century. It is from this study that the name "regression"
originated. You decide to update his findings by collecting data from 110 college students, and estimate
the following relationship:
= 19.6 + 0.73 × Midparh, R2 = 0.45, SER = 2.0
where Studenth is the height of students in inches, and Midparh is the average of the parental heights.
(Following Galton's methodology, both variables were adjusted so that the average female height was
equal to the average male height.)
(a) Interpret the estimated coefficients.
(b) What is the meaning of the regression R2?
(c) What is the prediction for the height of a child whose parents have an average height of 70.06 inches?
(d) What is the interpretation of the SER here?
(e) Given the positive intercept and the fact that the slope lies between zero and one, what can you say
about the height of students who have quite tall parents? Those who have quite short parents?
Answer:
(a) For every one inch increase in the average height of their parents, the student's height increases by
0.73 of an inch. There is no reasonable interpretation for the intercept.
(b) The model explains 45 percent of the variation in the height of students.
(c) 19.6 + 0.73 × 70.06 = 70.74.
(d) The SER is a measure of the spread of the observations around the regression line. The magnitude of
the typical deviation from the regression line or the typical regression error here is two inches.
(e) Tall parents will have, on average, tall students, but they will not be as tall as their parents. Short
parents will have short students, although on average, they will be somewhat taller than their parents.
6) The baseball team nearest to your home town is, once again, not doing well. Given that your
knowledge of what it takes to win in baseball is vastly superior to that of management, you want to find
out what it takes to win in Major League Baseball (MLB). You therefore collect the winning percentage of
all 30 baseball teams in MLB for 1999 and regress the winning percentage on what you consider the
primary determinant for wins, which is quality pitching (team earned run average). You find the
following information on team performance:
Summary of the Distribution of Winning Percentage and
Team Earned Run Average for MLB in 1999
Average
Standard
Percentile
deviation
10% 25% 40% 50%
60% 75%
(median)
4.71
0.53
3.84 4.35 4.72 4.78
4.91 5.06
Team
ERA
Winning
0.50
Percentage
0.08
0.40
0.43
0.46
0.48
0.49
0.59
90%
5.25
0.60
(a) What is your expected sign for the regression slope? Will it make sense to interpret the intercept? If
not, should you omit it from your regression and force the regression line through the origin?
(b) OLS estimation of the relationship between the winning percentage and the team ERA yield the
following:
= 0.9 – 0.10 × teamera , R2=0.49, SER = 0.06,
where winpct is measured as wins divided by games played, so for example a team that won half of its
games would have Winpct = 0.50. Interpret your regression results.
(c) It is typically sufficient to win 90 games to be in the playoffs and/or to win a division. Winning over
100 games a season is exceptional: the Atlanta Braves had the most wins in 1999 with 103. Teams play a
total of 162 games a year. Given this information, do you consider the slope coefficient to be large or
small?
(d) What would be the effect on the slope, the intercept, and the regression R2 if you measured Winpct in
percentage points, i.e., as (Wins/Games) × 100?
(e) Are you impressed with the size of the regression R2? Given that there is 51% of unexplained variation
in the winning percentage, what might some of these factors be?
Answer:
(a) You expect a negative relationship, since a higher team ERA implies a lower quality of the input. No
team comes close to a zero team ERA, and therefore it does not make sense to interpret the intercept.
Forcing the regression through the origin is a false implication from this insight. Instead the intercept
fixes the level of the regression.
(b) For every one point increase in Team ERA, the winning percentage decreases by 10 percentage points,
or 0.10. Roughly half of the variation in winning percentage is explained by the quality of team pitching.
(c) The coefficient is large, since increasing the winning percentage by 0.10 is the equivalent of winning 16
more games per year. Since it is typically sufficient to win 56 percent of the games to qualify for the
playoffs, this difference of 0.10 in winning percentage turns can easily turn a loosing team into a winning
team.
(d) Clearly the regression R2 will not be affected by a change in scale, since a descriptive measure of the
quality of the regression would depend on whim otherwise. The slope of the regression will compensate
in such a way that the interpretation of the result is unaffected, i.e., it will become 10 in the above
example. The intercept will also change to reflect the fact that if X were 0, then the dependent variable
would now be measured in percentage, i.e., it will become 94.0 in the above example.
(e) It is impressive that a single variable can explain roughly half of the variation in winning percentage.
Answers to the second question will vary by student, but will typically include the quality of hitting,
fielding, and management. Salaries could be included, but should be reflected in the inputs.
7) In 2001, the Arizona Diamondbacks defeated the New York Yankees in the Baseball World Series in 7
games. Some players, such as Bautista and Finley for the Diamondbacks, had a substantially higher
batting average during the World Series than during the regular season. Others, such as Brosius and Jeter
for the Yankees, did substantially poorer. You set out to investigate whether or not the regular season
batting average is a good indicator for the World Series batting average. The results for 11 players who
had the most at bats for the two teams are:
= –0.347 + 2.290 AZSeasavg , R2=0.11, SER = 0.145,
= 0.134 + 0.136 NYSeasavg , R2=0.001, SER = 0.092,
where Wsavg and Seasavg indicate the batting average during the World Series and the regular season
respectively.
(a) Focusing on the coefficients first, what is your interpretation?
(b) What can you say about the explanatory power of your equation? What do you conclude from this?
Answer:
(a) The two regressions are quite different. For the Diamondbacks, players who had a 10 point higher
batting average during the regular season had roughly a 23 point higher batting average during the
World Series. Hence top performers did relatively better. The opposite holds for the Yankees.
(b) Both regressions have little explanatory power as seen from the regression R2. Hence performance
during the season is a poor forecast of World Series performance.
8) You have obtained a sample of 14,925 individuals from the Current Population Survey (CPS) and are
interested in the relationship between average hourly earnings and years of education. The regression
yields the following result:
ˆ = -4.58 + 1.71×educ , R2 = 0.182, SER = 9.30
ahe
where ahe and educ are measured in dollars and years respectively.
a.
Interpret the coefficients and the regression R2.
b.
Is the effect of education on earnings large?
c. Why should education matter in the determination of earnings? Do the results suggest that there is a
guarantee for average hourly earnings to rise for everyone as they receive an additional year of
education? Do you think that the relationship between education and average hourly earnings is linear?
d. The average years of education in this sample is 13.5 years. What is mean of average hourly earnings
in the sample?
e. Interpret the measure SER. What is its unit of measurement.
Answer:
a. A person with one more year of education increases her earnings by $1.71. There is no meaning
attached to the intercept, it just determines the height of the regression. The model explains 5 percent of
the variation in average hourly earnings.
b. The difference between a high school graduate and a college graduate is four years of education. Hence
a college graduate will earn almost $7 more per hour, on average ($6.84 to be precise). If you assume that
there are 2,000 working hours per year, then the average salary difference would be close to $14,000
(actually $13,680). Depending on how much you have spent for an additional year of education and how
much income you have forgone, this does not seem particularly large.
c. In general, you would expect to find a positive relationship between years of education and average
hourly earnings. Education is considered investment in human capital. If this were not the case, then it
would be a puzzle as to why there are students in the econometrics course — surely they are not there to
just "find themselves" (which would be quite expensive in most cases). However, if you consider
education as an investment and you wanted to see a return on it, then the relationship will most likely
not be linear. For example, a constant percent return would imply an exponential relationship whereby
the additional year of education would bring a larger increase in average hourly earnings at higher levels
of education. The results do not suggest that there is a guarantee for earnings to rise for everyone as they
become more educated since the regression R2 does not equal 1. Instead the result holds "on average."
d. Since 0 = Y - 1 X ⇒ Y = 0 + 1 X . Substituting the estimates for the slope and the intercept then
results in a mean of average hourly earnings of roughly $18.50.
e. The typical prediction error is $9.30. Since the measure is related to the deviation of the actual and
fitted values, the unit of measurement must be the same as that of the dependent variable, which is in
dollars here.
9) You have obtained measurements of height in inches of 29 female and 81 male students (Studenth) at
your university. A regression of the height on a constant and a binary variable (BFemme), which takes a
value of one for females and is zero otherwise, yields the following result:
= 71.0 – 4.84×BFemme , R2 = 0.40, SER = 2.0
(0.3) (0.57)
(a) What is the interpretation of the intercept? What is the interpretation of the slope? How tall are
females, on average?
(b) Test the hypothesis that females, on average, are shorter than males, at the 1% level.
(c) Is it likely that the error term is homoskedastic here?
Answer:
(a) The intercept gives you the average height of males, which is 71 inches in this sample. The slope tells
you by how much shorter females are, on average (almost 5 inches). The average height of females is
therefore approximately 66 inches.
(b) The t-statistic for the difference in means is -8.49. For a one-sided test, the critical value is –2.33. Hence
the difference is statistically significant.
(c) It is safer to assume that the variances for males and females are different. In the underlying sample the
standard deviation for females was smaller.
10) You have collected 14,925 observations from the Current Population Survey. There are 6,285 females
in the sample, and 8,640 males. The females report a mean of average hourly earnings of $16.50 with a
standard deviation of $9.06. The males have an average of $20.09 and a standard deviation of $10.85. The
overall mean average hourly earnings is $18.58.
a. Using the t-statistic for testing differences between two means (section 3.4 of your textbook), decide
whether or not there is sufficient evidence to reject the null hypothesis that females and males have
identical average hourly earnings.
b. You decide to run two regressions: first, you simply regress average hourly earnings on an intercept
only. Next, you repeat this regression, but only for the 6,285 females in the sample. What will the
regression coefficients be in each of the two regressions?
c. Finally you run a regression over the entire sample of average hourly earnings on an intercept and a
binary variable DFemme, where this variable takes on a value of 1 if the individual is a female, and is 0
otherwise. What will be the value of the intercept? What will be the value of the coefficient of the binary
variable?
d. What is the standard error on the slope coefficient? What is the t-statistic?
Answer: a. H0: μF = μM; H1: μF ≠ μM
t
20.09  16.05
10.852 9.062

8640 6285
. As a result, you can comfortably reject the null hypothesis at any reasonable
confidence level.
b.
= 0 = 18.58;
= 0 = 16.50
Hence for each of the regressions, the intercept takes on the value of the overall mean for average hourly
earnings, and the mean average hourly earnings for females.
c.
= 0 + 1× DFemme = 20.09 - 3.59× DFemme
The intercept is the mean of average hourly earnings for males, and the slope is the difference between
the mean of average hourly earnings of females and males.
d. The standard error on the slope coefficient is 0.16, which is identical to the standard error of the tstatistic in (a) above. Hence the t-statistic is (-21.98).
Download