Statistics 103: Practice Problems for Final Exam This sheet contains practice problems for the final exam. It is longer than the actual final so that you have extra problems. Other material in the text and from lectures may appear on the final exam. Questions 1 – 18 refer to a random sample of 500 heads of households. The data are sampled from the households collected in the March 2000 Current Population Survey. For this problem, we’ll assume that the data are a simple random sample of 500 households from the entire U.S. population. People who own their homes have to pay taxes on their property. Below is a histogram that shows property taxes for all 500 households in the sample. Property tax for all households 0 1000 3000 1. True or False: than the mean. 5000 7000 9000 More than 45% of these households have property taxes greater 2. Choose the value that you think is closest to the standard deviation of these property taxes: 10, 100, 1000, 10000. 3. True or False: If we remove all the houses that have property taxes equal to zero, the average of the remaining taxes would be larger than $1,000. 4. True or False: A normal curve can be used to determine the percentage of houses paying above $500 with very good accuracy. 5. Estimate the percentage of households that pay more than $2000 in property taxes. _____ Below is a box plot of property tax by marital status. The marital status 1 is for a married householder; 5 is for a divorced householder; and, 7 is for a single householder. There are 219 married people, 84 divorced people, and 108 single people. The other people in the sample who have other marital statuses are not displayed in this graph. Oneway Analysis of property tax By marital status: 1 = married, 5 = divorced, 7 = single 9000 8000 property tax 7000 6000 5000 4000 3000 2000 1000 0 1 5 7 marital status 6. Order the three groups by median property tax, going from largest to smallest. _________ 7. True or false: The standard deviation of property taxes for these married people is closer in value to the standard deviation for these single people than it is to the standard deviation for these divorced people. 8. True or false: A larger percentage of these married people have property taxes below 500 than do these divorced people. 9. Which of these three groups has the largest percentage of people not paying any property tax? 10. True or False: The standard error of the sample average property tax for divorced people is larger than the standard error for the sample average property tax for married people. Below is a box plot of property taxes for male and female household heads. Men are coded with a 1, and women are coded with a 2. Oneway Analysis of property tax By sex: men = 1 and women = 2 9000 8000 property tax 7000 6000 5000 4000 3000 2000 1000 0 1 2 sex Means and Std Deviations Level 1 2 Number 257 242 Mean 996.086 754.942 Std Dev 1507.10 1162.05 11. Are the assumptions for using confidence intervals or hypothesis tests involving sample average of property taxes likely to hold? Explain your reasoning. 12. Give a 95% confidence interval for the population difference in average property taxes for male and female household heads. Assume the degrees of freedom is 400. 13. Based on the interval, do you think there is overwhelming evidence that the population average property tax amounts differ? 14. Check all of the following that are true: ___ There is a 95% chance that the population difference in average property taxes is between the two values you determined in 12. ___ If we pick two households at random so that one is headed by a male and the other by a female, the difference in their property taxes will fall within the upper and lower limits 95% of the time. ___ If we took another random sample of 500, then another, then another, and so on, we’d expect 95% of the formed confidence intervals to contain the population difference in average property taxes. 15. Test the null hypothesis that there is no difference in population average property taxes between male and female household heads. State your null and alternative hypotheses, the test statistic, the p-value, and your conclusions. Consider a p-value near 0.05 to be small. 16. Check all of the following that are true. ____ The probability that the null hypothesis is true equals the p-value from the previous part. ____ It may be the case that the results are due to chance, and our conclusion from the hypothesis test is wrong. _____ The chance of getting a value of the test statistic as or more extreme than what was observed, assuming the null hypothesis is true, equals the pvalue. Below are scatter plots of property tax, household income, number people in the household, and age of the household head. Scatterplot Matrix 8000 6000 4000 2000 property tax 0 300000 250000 200000 150000 100000 50000 0 household income 7 5 3 number people in house 1 80 60 age of hh 40 20 0 2000 5000 8000 0 100000 250000 1 2 3 4 5 6 7 8 2030 50 70 90 Questions 17 – 18 refer to the plot above. 17. The value of the correlation between household income and age of household head is closest to which of the following values: -.8, -.4, 0, .4, .8. 18. If you fit a regression between property tax (outcome) and income (predictor), which of the following statements would be true. You can choose more than one. ___ ___ ___ The slope of the line would be positive. The intercept of the line would be greater than 1000. The root MSE of the regression would be larger than the SD of property tax. 19. Does taking additional vitamin C help prevent the common cold? Nobel Laureate Linus Pauling (1901 - 1994) performed a randomized experiment to address this question and reported his results in the Proceedings of the National Academy of Sciences. Pauling randomly assigned 279 French skiers to be in one of two groups: a group that took vitamin C supplements or a group that took a placebo (a sugar pill). The numbers of people for each category are summarized below: Vitamin C Placebo Got a cold 17 31 Did not get a cold 122 109 Pretend that you are the consulting statistician for Linus Pauling (a lofty honor indeed!) i) Pauling seeks to know if there is evidence that the population incidence rate of colds for people who take Vitamin C is less than the population incidence rate of colds for people who take the placebo. What do you tell him? State clearly and justify your null and alternative hypotheses, the test statistic, the p-value, and conclusions. ii) Asking people to take the sugar pills is expensive because you have to buy sugar pills and distribute them to the skiers. Pauling requests that the next experiment--Nobel Laureates always try to replicate results--avoid the sugar pills to save resources. Instead, he suggests the control group be randomly assigned to take nothing. Are you willing to comply with Pauling's request? Explain why or why not. iii) Suppose you could replicate this experiment with 1000 skiers—500 in each treatment group. Approximate the standard error that you’d use in a 99% confidence interval. iv) Discuss the types of conclusions that you can draw from these data. That is, what do the results suggest about the ability of Vitamin C to prevent colds? 20. True or False For each statement, if you think the statement is always true, just say it is true. If you think the statement is always false or sometimes false, say it is false and explain why or when it is false in two or less sentences. i) Your research colleague calculates by hand a value of the correlation of -0.95. He says this shows that there is a strong, negative linear association between the two variables. ii) When data are randomly sampled from the same population, a 95% confidence interval constructed from a sample of 100 units should be narrower than a 95% confidence interval constructed from a sample of 200 units. iii) In a regression analysis, a non-random pattern in the plot of residuals versus fitted values is consistent with the assumptions of the regression model. iv) You perform a hypothesis test with a sample size of four units, and you do not reject the null hypothesis. This statistical test provides conclusive evidence against the alternative hypothesis. v) A group of teachers attend a summer program designed to improve their foreign language skills. The teachers take a foreign language test at the start of the summer before the program begins. After the program ends, the teachers take another language test of similar difficulty. Based on a matched pairs hypothesis test, the average increase in scores is significantly greater than zero (p-value = .002). These data demonstrate that the summer training program improved the foreign language skills of the teachers. vi) A professor is considering which of two exams to give. Scores on the first exam follow a normal distribution with mean of 75 and standard deviation of 5. Scores on the second exam follow a normal distribution with mean of 75 and standard deviation of 10. She wants to pick the exam likely to result in a relatively small number of people scoring below 60. She should pick the second exam. vii) A certain company employs 10,000 people. Fifty percent of these employees are women. One hundred of these employees are in management positions. Of these 100 managers, 35 are women. In a court case against the company, a defense attorney argues that his client does not discriminate on the basis of sex when hiring managers. As evidence, he says, ``The chance that a randomly selected employee is in management and a woman equals Pr(is a woman) * Pr(is in management) = .50 * .01 = .005. The chance that a randomly selected employee is in management and a man equals Pr(is a man) * Pr(is in management) = .50 * .01 = .005. These are equal probabilities, so that there is no evidence of discrimination.'' The prosecutor's calculations are valid. 21. A random sample of 100 is taken from a population with 10% minority and 90% non-minority members. i) True or False: The number of minorities in the sample will be around 10 with an SE of around 3. ii) True or False: There is about a 68% chance that the number of minorities in the sample will be between 9% and 11%. iii) True or false: The population has about 10% minority members with an SE of around .3%. iv) True or false: In a particular sample of 100, it would be nearly impossible to see more than 16 minority members. 22. is: In economics, a standard national accounting identity for total production Y = C + I + G + X, where Y = the total production, C = consumption, I = investment, G = government expenditures, and X = net exports. Consider each of these random variables. i) A colleague suggests that to find the expected total production, you should add the expected values of consumption, investment, government expenditures, and next exports. Is your colleague correct? ii) The same colleague suggests that to find the variance of total production, you should add the variances of consumption, investment, government expenditures, and next exports. Is your colleague correct? 23. You have to decide whether or not to study hard for your Stats final. The professor tells you that, in the past, 75% of the people who got As studied hard, whereas 20% of the people who did not get As studied hard. Furthermore, experience shows that about 40% of people get As on the final. i) What is the probability of getting an A, given that you study hard? ii) What is the probability of getting an A, given that you do not study hard? 24. The time (in hours) it takes to complete a three hour final exam follows the following continuous probability density function: f ( y) 1 1 ln( 3) (4 y ) for 1 y 3 , where ln(3) is the natural log of 3. i) Find the probability that a person takes less than two hours to complete the exam. ii) Find the probability that a person takes between 2 and 2.5 hours. iii) Given that a person takes more than two hours, what is the chance that he or she will take less than 2.5 hours? iv) Write the expression for the expected time it takes to complete the exam. You don’t have to evaluate the expression. 25. Suppose that the joint distribution of the number of servings of vegetables (Y) and the number of servings of fruit (X) that the typical Duke student gets per day is described by the following joint distribution. X x=0 x=1 x=2 y=0 0.1 0.15 0.05 Y y=1 .3 .2 0 y=2 .2 0 0 i) What is the probability that a randomly selected Duke student eats at least one serving of fruit per day? ii) What is the expected value of the number of servings of vegetables per day for Duke students? iii) Among Duke students who eat one serving of fruit per day, how many servings of vegetables are they expected to get? iv) v) vi) What is the standard deviation of the number of vegetables? What is the covariance between X and Y? Just for practice, suppose you make the function: T = 1.5Y - .2X. What are the expected value and variance of T? 26. You are offered the following game. You roll a fair, six-sided dice. For each roll less than three (i.e. one or two), you get $10. For each roll more than two (i.e. three through six), you have to pay $2 times the number of rolls (e.g., if you get a 3-6 on the first roll, you pay $2. If you get a 3-6 on the second roll, you pay $4. If you get a 3-6 on the third roll, you pay $6.) i) Suppose you roll the dice three times. of your net earnings? What is the probability distribution ii) What is the expected value and variance of your earnings? 27. Circle true or false. You don’t need to explain your choice. (i) You take a random sample of 100 people and find out the average gratuity they give when dining at restaurants (i.e. the tip) is 16% of their bill. True False: The standard error for making a confidence interval for the average tipping percentage in the population is .16(.84) / 100 . (ii) You pick a random sample of 5 M&Ms from an enormous jar that has 30% blue M&Ms. z .4 .3 .3(.7) / 5 True False: Since =.488, the chance of getting less than 40% blue M&Ms is about 68.7% (this is the area under the normal curve to the left of .488). 28. Invest your money with me. Yeah, I’ll take care of it. The probability distribution for the monthly return rate (expressed as a decimal, so that a -10% return equals -0.10) for a certain stock (call it “JR stock”) can be well described by the following probability density function: f ( x) 9x 2 6 for -1 < x < 1. a) What is the probability that JR stock returns more than 20%? b) c) What is the expected rate of return of JR stock? What is the variance of the rate of return for JR stock? d) The correlation between JR stock and the IBM stock is 0.45. IBM stock has an expected rate of return of 7% and a standard deviation of 10%. What is the standard deviation of a portfolio that has 75% IBM stock and 25% JR stock? e) If you assume the return rates for all months are independent, what is the chance that JR stock will have a positive monthly return in at least one of the next twelve months? f) If you assume the return rates for all months are independent, what is the chance that JR stock will have a positive monthly return in at least 70 of the next 120 months? 29. Weather predictions In preparation for tenting next year, you buy a weather predictor that has the following properties: On days when it is sunny, there is an 80% chance that the predictor says sunny. On days when it is sunny, there is a 15% chance that the predictor says cloudy. On days when it is sunny, there is a 5% chance that the predictor says rain. On days when it rains, there is a 10% chance that the predictor says sunny. On days when it rains, there is a 40% chance that the predictor says cloudy. On days when it rains, there is a 50% chance that the predictor says rain. On days when it is cloudy, there is a 33% chance that the predictor says sunny. On days when it is cloudy, there is a 34% chance the predictor says cloudy. On days when it is cloudy, there is a 33% chance the predictor says rain. In January in Durham, 40% of days are sunny, 30% are cloudy, and 30% have rain. a) The predictor says sunny. sunny? b) The predictor says rain. or sunny? 30. 31. What is the probability that the day will be What is the probability that the day will be cloudy Show that Cov( X , Y ) E ( XY ) E ( X ) E (Y ). Suppose that the variance of GRE scores for individuals who graduate from private schools equals 12 , and that the variance of GRE scores for individuals who graduate from public schools equals 22 . You are going to take a random sample of n1 individuals from private schools and n 2 individuals from public schools. Show that Var ( X 1 X 2 ) 1 / n1 2 / n 2 . 2 2 32. In the problem above, suppose that the population means in the two groups both equal . You want to estimate . Two estimators are proposed: (i) a) W X1 and (ii) V ( X1 X 2 ) / 2 . Show that both W and V are unbiased estimators of . b) Suppose that 1 1000 and 2 10000 , and that the sample size in the first group is 25. What is the smallest sample size needed in the second group so that Var(V)<Var(W)? 2 2 33. Drinking and Driving In many states a motorist is legally drunk or driving under the influence (DUI) if his or her blood alcohol concentration is .10% or higher. When a suspected DUI offender is pulled over, police often request a sobriety test called a breathalyzer, in which the suspected offender breathes into a machine that reports a blood alcohol level. Although the breathalyzers are remarkably precise, they do exhibit some measurement error. Because of that variability, the possibility exists that a driver's true blood alcohol concententation may be under .10% even though the breathalyzer gives a reading over .10%. Experience has shown that repeated breathalyzer measurements taken on the same person produce a distribution of responses that can be described by a normal distribution with mean equal to the person's true blood alcohol concentration and standard deviation equal to .004%. i) Suppose that a driver is stopped on his way home from a party. He has a true blood alcohol concentration of .095%, barely below the legal limit. If he takes the breathalyzer test, what are the chances that he will be incorrectly booked on a DUI charge (i.e., his result will be above .10%)? ii) Suppose a different driver is stopped on her way home and is wicked drunk with a blood alcohol level of .15%. What are the chances that the breathalyzer will indicate that she is not guilty of DUI? iii) In one bad night, 9 people from the same party are pulled over by the police and given breathalyzer tests. Let's assume that all of them have a blood alcohol content of .10%. What is the probability that at least one of the people will be booked on a DUI charge? 34. Sex of children Do certain families have a tendency to have babies of the same sex? Let's assume that the sexes of babies are independent (e.g., the sex of the first-born baby does not affect the probability the second-born baby is female) , and that the probability that a baby is born female is 0.5014, which is approximately the current percentage. Assume that this probability is the same regardless of family size (which is an assumption that is not necessarily true). i) What is the chance that a family with 6 children has them born in alternating sex order, with the oldest being a male? ii) What is the chance that a family has 6 girls in a row? iii) Given that the first three children are boys, what is the chance that the next child will be a girl? An article that analyzes the question of sex tendencies is in the winter 2001 issue of the magazine Chance. 35. You propose to use a Poission distribution for the number of siblings that a randomly selected person has. The probability distribution for the Poisson is: Pr(Y y) y e y! y 0, 1, 2, 3, .... You sample four people and obtain sibling counts of 0, 2, 1, and 4. maximum likelihood estimate of . Find the 36. Lotteries The Pennsylvania Daily Number is a lottery game in which the state constructs a three digit number by drawing a digit from 0 to 9 from each of three different containers. For example, if the digits were drawn in the order 3, 6, 3, then the winning number would be 363. In this problem, we focus only on the first container. If the numbers are truly randomly selected, each value between 0 and 9 should be equally likely to occur. To test if the digits are randomly selected, the frequencies of each digit in the first container were collected for the 500 days between July 19, 1999, and November 29,2000. The frequencies are given below: Digit Frequency 0 47 1 50 2 55 3 46 4 53 5 39 6 55 7 55 8 44 9 56 To see if the digits are randomly selected, two analyses are proposed: (i) Form the null hypothesis that the population average of the frequencies for these draws equals 50, and the alternative hypothesis that the sample average does not equal 50. To do this, a t-test is performed using the test statistic: t = (sample average frequency – 50)/SE, where SE is the standard error of the 10 frequencies above. (ii) For the null hypothesis that the population percentages of each digit equal 0.1, and use a chi-squared goodness of fit test with the test statistic: x2 = sum[(observed – expected)2 / expected] a) Which of these two analyses would you use? Explain why chose that analysis, and what, if anything, is wrong with the other analysis. b) For the t-test, the p-value equals 0.999. For the chi-squared test, the pvalue equals 0.733. Using the method you selected in part a, what does the pvalue tell you about whether the numbers are randomly selected? c) I want you to change the frequencies for 8 and 9 above so that the p-value for your analysis is much smaller than the one in part b. Write the frequencies for 8 and 9 that you’d use to make this happen. 37) More chi-squared problems a) Here’s a conceptual question on chi-squared tests. Fill in the following table so that you get a p-value from the chi-squared test of independence near 1. Your table must have 40 men and 80 women. You must have at least one person in each of the six cells. Favorite color Red Blue Green Total Male Female 40 80 . b) Following up on part a, fill in the following table so that you get a pvalue from the chi-squared test of independence that is near zero. Your table must have 40 men and 80 women. You must have at least one person in each of the six cells. Favorite color Red Blue Green Total Male Female 40 80 .