Practice Questions for Exam 1 1. The height of male students at your college/university is normally distributed with a mean of 70 inches and a standard deviation of 3.5 inches. If you had a list of telephone numbers for male students for the purpose of conducting a survey, what would be the probability of randomly calling one of these students whose height is: (a) taller than 6'0"? (b) between 5'3" and 6'5"? (c) shorter than 5'7", the mean height of female students? (d) shorter than 5'0"? (e) taller than Shaquille O'Neal, the center of the Boston Celtics, who is 7'1" tall? Compare this to the probability of a woman being pregnant for 10 months (300 days), where days of pregnancy is normally distributed with a mean of 266 days and a standard deviation of 16 days. Answer: (a) Pr(Z > 0.5714) = 0.2839; (b) Pr( –2 < Z < 2) = 0.9545 or approximately 0.95; (c) Pr(Z < -0.8571) = 0.1957; (d) Pr(Z < -2.8571) = 0.0021; (e) Pr(Z > 4.2857) = 0.000009 (the text does not show values above 2.99 standard deviations, Pr(Z>2.99 = 0.0014) and Pr(Z > 2.1250) = 0.0168. 2) Adult males are taller, on average, than adult females. Visiting two recent American Youth Soccer Organization (AYSO) under 12 year old (U12) soccer matches on a Saturday, you do not observe an obvious difference in the height of boys and girls of that age. You suggest to your little sister that she collect data on height and gender of children in 4 th to 6th grade as part of her science project. The accompanying table shows her findings. Height of Young Boys and Girls, Grades 4-6, in inches Boys 57.8 3.9 Girls 55 58.4 4.2 57 (a) Let your null hypothesis be that there is no difference in the height of females and males at this age level. Specify the alternative hypothesis. (b) Find the difference in height and the standard error of the difference. (c) Generate a 95% confidence interval for the difference in height. (d) Calculate the t-statistic for comparing the two means. Is the difference statistically significant at the 1% level? Which critical value did you use? Why would this number be smaller if you had assumed a one-sided alternative hypothesis? What is the intuition behind this? Answer: (a) H0 : = 0 vs. H1 : ≠0 (b) - = -0.6, SE( (c) -0.6 ± 1.96 × 0.77 = (-2.11, 0.91). - )= 3.92 4.2 2 = 0.77. 55 57 (d) t = -0.78, so t < 2.58, which is the critical value at the 1% level. Hence you cannot reject the null hypothesis. The critical value for the one-sided hypothesis would have been 2.33. Assuming a one-sided hypothesis implies that you have some information about the problem at hand, and, as a result, can be more easily convinced than if you had no prior expectation. 3) Assume that two presidential candidates, call them Bush and Gore, receive 50% of the votes in the population. You can model this situation as a Bernoulli trial, where Y is a random variable with success probability Pr(Y = 1) = p, and where Y = 1 if a person votes for Bush and Y = 0 otherwise. Furthermore, let p 1 p ) in reasonably large p̂ be the fraction of successes (1s) in a sample, which is distributed N(p, n samples, say for n ≥ 40. (a) Given your knowledge about the population, find the probability that in a random sample of 40, Bush would receive a share of 40% or less. (b) How would this situation change with a random sample of 100? (c) Given your answers in (a) and (b), would you be comfortable to predict what the voting intentions for the entire population are if you did not know p but had polled 10,000 individuals at random and calculated p̂ ? Explain. (d) This result seems to hold whether you poll 10,000 people at random in the Netherlands or the United States, where the former has a population of less than 20 million people, while the United States is 15 times as populous. Why does the population size not come into play? Answer: 0.40 0.50 (a) Pr( p̂ < 0.40) = Pr(Z < ) = Pr(Z < -1.26) ≈ 0.104. In roughly every 10th sample of this size, 0.25 40 Bush would receive a vote of less than 40%, although in truth, his share is 50%. 0.40 0.50 (b) Pr( p̂ < 0.40) = Pr(Z < ) = Pr(Z < -2.00) ≈ 0.023. With this sample size, you would expect this 0.25 100 to happen only every 50th sample. (c) The answers in (a) and (b) suggest that for even moderate increases in the sample size, the estimator does not vary too much from the population mean. Polling 10,000 individuals, the probability of finding a p̂ of 0.48, for example, would be 0.00003. Unless the election was extremely close, which the 2000 election was, polls are quite accurate even for sample sizes of 2,500. (d) The distribution of sample means shrinks very quickly depending on the sample size, not the population size. Although at first this does not seem intuitive, the standard error of an estimator is a value which indicates by how much the estimator varies around the population value. For large sample sizes, the sample mean typically is very close to the population mean. 4) You have collected weekly earnings and age data from a sub-sample of 1,744 individuals using the Current Population Survey in a given year. (a) Given the overall mean of $434.49 and a standard deviation of $294.67, construct a 99% confidence interval for average earnings in the entire population. State the meaning of this interval in words, rather than just in numbers. If you constructed a 90% confidence interval instead, would it be smaller or larger? What is the intuition? (b) When dividing your sample into people 45 years and older, and younger than 45, the information shown in the table is found. Age Category Age ≥ 45 Age < 45 Average Earnings Y Standard Deviation SY $488.87 $328.64 $412.20 $276.63 N 507 1237 Test whether or not the difference in average earnings is statistically significant. Given your knowledge of age-earning profiles, does this result make sense? Answer: (a) The confidence interval for mean weekly earnings is 434.49 ± 2.58 × 294.67 = 434.49 ± 18.20 1744 = (416.29, 452.69). Based on the sample at hand, the best guess for the population mean is $434.49. However, because of random sampling error, this guess is likely to be wrong. Instead, the interval estimate for the average earnings lies between $416.29 and $452.69. Committing to such an interval repeatedly implies that the resulting statement is incorrect 1 out of 100 times. For a 90% confidence interval, the only change in the calculation of the confidence interval is to replace 2.58 by 1.64. Hence the confidence interval is smaller. A smaller interval implies, given the same average earnings and the standard deviation, that the statement will be false more often. The larger the confidence interval, the more likely it is to contain the population value. 488.87 412.20 (b) Assuming unequal population variances, t = = 4.62, which is statistically 328.642 276.632 507 12.7 significant at conventional levels whether you use a two-sided or one-sided alternative. Hence the null hypothesis of equal average earnings in the two groups is rejected. Age-earning profiles typically take on an inverted U-shape. Maximum earnings occur in the 40s, depending on some other factors such as years of education, which are not considered here. Hence it is not clear if the alternative hypothesis should be one-sided or two-sided. In such a situation, it is best to assume a two-sided alternative hypothesis. 5) Sir Francis Galton, a cousin of James Darwin, examined the relationship between the height of children and their parents towards the end of the 19th century. It is from this study that the name "regression" originated. You decide to update his findings by collecting data from 110 college students, and estimate the following relationship: = 19.6 + 0.73 × Midparh, R2 = 0.45, SER = 2.0 where Studenth is the height of students in inches, and Midparh is the average of the parental heights. (Following Galton's methodology, both variables were adjusted so that the average female height was equal to the average male height.) (a) Interpret the estimated coefficients. (b) What is the meaning of the regression R2? (c) What is the prediction for the height of a child whose parents have an average height of 70.06 inches? (d) What is the interpretation of the SER here? (e) Given the positive intercept and the fact that the slope lies between zero and one, what can you say about the height of students who have quite tall parents? Those who have quite short parents? Answer: (a) For every one inch increase in the average height of their parents, the student's height increases by 0.73 of an inch. There is no reasonable interpretation for the intercept. (b) The model explains 45 percent of the variation in the height of students. (c) 19.6 + 0.73 × 70.06 = 70.74. (d) The SER is a measure of the spread of the observations around the regression line. The magnitude of the typical deviation from the regression line or the typical regression error here is two inches. (e) Tall parents will have, on average, tall students, but they will not be as tall as their parents. Short parents will have short students, although on average, they will be somewhat taller than their parents. 6) The baseball team nearest to your home town is, once again, not doing well. Given that your knowledge of what it takes to win in baseball is vastly superior to that of management, you want to find out what it takes to win in Major League Baseball (MLB). You therefore collect the winning percentage of all 30 baseball teams in MLB for 1999 and regress the winning percentage on what you consider the primary determinant for wins, which is quality pitching (team earned run average). You find the following information on team performance: Summary of the Distribution of Winning Percentage and Team Earned Run Average for MLB in 1999 Average Standard Percentile deviation 10% 25% 40% 50% 60% 75% (median) 4.71 0.53 3.84 4.35 4.72 4.78 4.91 5.06 Team ERA Winning 0.50 Percentage 0.08 0.40 0.43 0.46 0.48 0.49 0.59 90% 5.25 0.60 (a) What is your expected sign for the regression slope? Will it make sense to interpret the intercept? If not, should you omit it from your regression and force the regression line through the origin? (b) OLS estimation of the relationship between the winning percentage and the team ERA yield the following: = 0.9 – 0.10 × teamera , R2=0.49, SER = 0.06, where winpct is measured as wins divided by games played, so for example a team that won half of its games would have Winpct = 0.50. Interpret your regression results. (c) It is typically sufficient to win 90 games to be in the playoffs and/or to win a division. Winning over 100 games a season is exceptional: the Atlanta Braves had the most wins in 1999 with 103. Teams play a total of 162 games a year. Given this information, do you consider the slope coefficient to be large or small? (d) What would be the effect on the slope, the intercept, and the regression R2 if you measured Winpct in percentage points, i.e., as (Wins/Games) × 100? (e) Are you impressed with the size of the regression R2? Given that there is 51% of unexplained variation in the winning percentage, what might some of these factors be? Answer: (a) You expect a negative relationship, since a higher team ERA implies a lower quality of the input. No team comes close to a zero team ERA, and therefore it does not make sense to interpret the intercept. Forcing the regression through the origin is a false implication from this insight. Instead the intercept fixes the level of the regression. (b) For every one point increase in Team ERA, the winning percentage decreases by 10 percentage points, or 0.10. Roughly half of the variation in winning percentage is explained by the quality of team pitching. (c) The coefficient is large, since increasing the winning percentage by 0.10 is the equivalent of winning 16 more games per year. Since it is typically sufficient to win 56 percent of the games to qualify for the playoffs, this difference of 0.10 in winning percentage turns can easily turn a loosing team into a winning team. (d) Clearly the regression R2 will not be affected by a change in scale, since a descriptive measure of the quality of the regression would depend on whim otherwise. The slope of the regression will compensate in such a way that the interpretation of the result is unaffected, i.e., it will become 10 in the above example. The intercept will also change to reflect the fact that if X were 0, then the dependent variable would now be measured in percentage, i.e., it will become 94.0 in the above example. (e) It is impressive that a single variable can explain roughly half of the variation in winning percentage. Answers to the second question will vary by student, but will typically include the quality of hitting, fielding, and management. Salaries could be included, but should be reflected in the inputs. 7) In 2001, the Arizona Diamondbacks defeated the New York Yankees in the Baseball World Series in 7 games. Some players, such as Bautista and Finley for the Diamondbacks, had a substantially higher batting average during the World Series than during the regular season. Others, such as Brosius and Jeter for the Yankees, did substantially poorer. You set out to investigate whether or not the regular season batting average is a good indicator for the World Series batting average. The results for 11 players who had the most at bats for the two teams are: = –0.347 + 2.290 AZSeasavg , R2=0.11, SER = 0.145, = 0.134 + 0.136 NYSeasavg , R2=0.001, SER = 0.092, where Wsavg and Seasavg indicate the batting average during the World Series and the regular season respectively. (a) Focusing on the coefficients first, what is your interpretation? (b) What can you say about the explanatory power of your equation? What do you conclude from this? Answer: (a) The two regressions are quite different. For the Diamondbacks, players who had a 10 point higher batting average during the regular season had roughly a 23 point higher batting average during the World Series. Hence top performers did relatively better. The opposite holds for the Yankees. (b) Both regressions have little explanatory power as seen from the regression R2. Hence performance during the season is a poor forecast of World Series performance. 8) You have obtained a sample of 14,925 individuals from the Current Population Survey (CPS) and are interested in the relationship between average hourly earnings and years of education. The regression yields the following result: ˆ = -4.58 + 1.71×educ , R2 = 0.182, SER = 9.30 ahe where ahe and educ are measured in dollars and years respectively. a. Interpret the coefficients and the regression R2. b. Is the effect of education on earnings large? c. Why should education matter in the determination of earnings? Do the results suggest that there is a guarantee for average hourly earnings to rise for everyone as they receive an additional year of education? Do you think that the relationship between education and average hourly earnings is linear? d. The average years of education in this sample is 13.5 years. What is mean of average hourly earnings in the sample? e. Interpret the measure SER. What is its unit of measurement. Answer: a. A person with one more year of education increases her earnings by $1.71. There is no meaning attached to the intercept, it just determines the height of the regression. The model explains 5 percent of the variation in average hourly earnings. b. The difference between a high school graduate and a college graduate is four years of education. Hence a college graduate will earn almost $7 more per hour, on average ($6.84 to be precise). If you assume that there are 2,000 working hours per year, then the average salary difference would be close to $14,000 (actually $13,680). Depending on how much you have spent for an additional year of education and how much income you have forgone, this does not seem particularly large. c. In general, you would expect to find a positive relationship between years of education and average hourly earnings. Education is considered investment in human capital. If this were not the case, then it would be a puzzle as to why there are students in the econometrics course — surely they are not there to just "find themselves" (which would be quite expensive in most cases). However, if you consider education as an investment and you wanted to see a return on it, then the relationship will most likely not be linear. For example, a constant percent return would imply an exponential relationship whereby the additional year of education would bring a larger increase in average hourly earnings at higher levels of education. The results do not suggest that there is a guarantee for earnings to rise for everyone as they become more educated since the regression R2 does not equal 1. Instead the result holds "on average." d. Since 0 = Y - 1 X ⇒ Y = 0 + 1 X . Substituting the estimates for the slope and the intercept then results in a mean of average hourly earnings of roughly $18.50. e. The typical prediction error is $9.30. Since the measure is related to the deviation of the actual and fitted values, the unit of measurement must be the same as that of the dependent variable, which is in dollars here. 9) You have obtained measurements of height in inches of 29 female and 81 male students (Studenth) at your university. A regression of the height on a constant and a binary variable (BFemme), which takes a value of one for females and is zero otherwise, yields the following result: = 71.0 – 4.84×BFemme , R2 = 0.40, SER = 2.0 (0.3) (0.57) (a) What is the interpretation of the intercept? What is the interpretation of the slope? How tall are females, on average? (b) Test the hypothesis that females, on average, are shorter than males, at the 1% level. (c) Is it likely that the error term is homoskedastic here? Answer: (a) The intercept gives you the average height of males, which is 71 inches in this sample. The slope tells you by how much shorter females are, on average (almost 5 inches). The average height of females is therefore approximately 66 inches. (b) The t-statistic for the difference in means is -8.49. For a one-sided test, the critical value is –2.33. Hence the difference is statistically significant. (c) It is safer to assume that the variances for males and females are different. In the underlying sample the standard deviation for females was smaller. 10) You have collected 14,925 observations from the Current Population Survey. There are 6,285 females in the sample, and 8,640 males. The females report a mean of average hourly earnings of $16.50 with a standard deviation of $9.06. The males have an average of $20.09 and a standard deviation of $10.85. The overall mean average hourly earnings is $18.58. a. Using the t-statistic for testing differences between two means (section 3.4 of your textbook), decide whether or not there is sufficient evidence to reject the null hypothesis that females and males have identical average hourly earnings. b. You decide to run two regressions: first, you simply regress average hourly earnings on an intercept only. Next, you repeat this regression, but only for the 6,285 females in the sample. What will the regression coefficients be in each of the two regressions? c. Finally you run a regression over the entire sample of average hourly earnings on an intercept and a binary variable DFemme, where this variable takes on a value of 1 if the individual is a female, and is 0 otherwise. What will be the value of the intercept? What will be the value of the coefficient of the binary variable? d. What is the standard error on the slope coefficient? What is the t-statistic? Answer: a. H0: μF = μM; H1: μF ≠ μM t 20.09 16.05 10.852 9.062 8640 6285 . As a result, you can comfortably reject the null hypothesis at any reasonable confidence level. b. = 0 = 18.58; = 0 = 16.50 Hence for each of the regressions, the intercept takes on the value of the overall mean for average hourly earnings, and the mean average hourly earnings for females. c. = 0 + 1× DFemme = 20.09 - 3.59× DFemme The intercept is the mean of average hourly earnings for males, and the slope is the difference between the mean of average hourly earnings of females and males. d. The standard error on the slope coefficient is 0.16, which is identical to the standard error of the tstatistic in (a) above. Hence the t-statistic is (-21.98).