Topic 4 Random Sampling In-Class Activities Activity 4-1: Sampling Words 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 a. Answers will vary. The answers given here are one example. b. Word score Number 5 of Letters Word did Number 3 of Letters forth whether have might 5 7 4 5 here full resolve perish 4 4 7 6 c. [letters.pdf] observational units = words variable = number of letters per word; type = quantitative d. average = 5 letters per word; statistic e. An example set of responses from one class: Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 1 [averagelengths.pdf] f. observational units = samples of 10 words variable = average number of letters per word; type = quantitative g. In this example 8/10 = .8 of the students produced a sample average greater than 4.29 letters per word. h. Yes – this sampling method appears to be biased. It appears to tend to overestimate the population mean. This is evident from the dotplot because it is centered at about 5.7 (rather than 4.29), and indicates that a large proportion of the class selected samples that had means greater than 4.29. i. Our eyes are most likely drawn to the longer words – we tend to overlook the short, common words like “a”, “and” ‘is” and “or.” Thus when we try to choose representative samples, we do not select enough short words in our sample. j. If we use this method we would also be likely to select too many long words in our sample because the long words take up more space on the page and therefore have a greater chance of being selected when we blindly point to a location. k. No – increasing the sample size will not make up for the biased sampling method. We would still tend to overrepresent the long words. l. We need to employ a truly randomly method to select the words –we could write each word on the same size slip of paper, put each slip in a hat, mix them thoroughly, then draw ten slips from the hat. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 2 Activity 4-2: Sampling Words 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 a. Many answers are possible. The following was obtained using the beginning of line 60: Random Digits Word Word Length 1 031 2 025 3 052 4 076 5 059 now 3 that 4 can 3 a 1 a 1 b. average word length = 2.4 letters per word c. Answers will vary, below is an example from one class. [samplemeans.pdf] d. This distribution is much closer to being centered at 4.29 and has a smaller horizontal spread than the previous one did (though the latter is not always the case). e. The sample averages are roughly split evenly on both sides of 4.29. f. Yes – random sampling appears to have produced unbiased estimates of the average word length in the population. Activity 4-3: Sampling Words 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 Answers will vary. The following are from one particular running of the applet. a. Word Number of Letters 1 The 3 2 these 5 3 here 4 4 for 3 5 should 6 Average number of letters = 4.2 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 3 b. You will probably not obtain the same sample of words or the same average length the second time. c. Average of the 500 sample averages = 4.31 letters per word d. Yes – this appears to be ‘around’ 4.29. e. Answers will vary according to student expectation. f. The center of this distribution should also be near 4.29 but the horizontal spread much less. g. The distribution of the samples of size 20 has less variability (more consistency) in the values of the sample average word length. h. The result of a single sample is more likely to be close to 4.29 with a sample of size 20 than with a sample of size 5. i. No – increasing the sample size when using a biased sampling method will not reduce the bias. The results from different samples will tend to be closer together but will still be centered in the wrong location (not around the parameter value of interest). If you want to reduce the bias you must change the sampling method. Activity 4-4: Sampling Words [insert computer screen icon] 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 a. Below is one example set of results Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 4 [wordsapplet.pdf] b. Both distributions are roughly bell-shaped, centered at about 4.29 with a horizontal spread from about 2 to 7. c. Yes – these distributions seem to have similar variability. d. No – not much changed when we sampled from the larger population. Activity 4-5: Back to Sleep [insert checkmark icon] 4-5, 6-5, 21-2 a. The population of interest is all infants younger than eight months in the United States in those years. The sampling frame is the list of households with such infants, generated from birth records, infant photography companies, and infant formula companies. The sample consists of the infants in the 1002 households whose mother (or other caregiver) participated in the interview. b. The sample size is 1002. (Actually, a total of 1015 infants were in the sample because some households had twins.) c. The researchers did not technically obtain a simple random sample of infants. One reason is that the sampling frame did not include the entire population. Another reason is that more than half of the numbers called did not lead to an interview. Infants who were not included in the sampling frame or whose mother declined to participate might differ systematically in some ways from those who were included. Nevertheless, the researchers did use randomness to select their sample, and they probably obtained as representative a sample as reasonably possible. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 5 d. Perhaps mothers in those groups were in a lower economic class and, therefore, less likely to have phones in the first place, or perhaps they had to work so their children were in daycare. e. These comparisons address the issue of bias, not precision. The sampling method was slightly biased with regard to the mother’s race and age and the infant’s birth weight. f. These percentages are statistics because they are based on the sample. g. The large sample size produces high precision. This means that the sample statistics are likely to be close to their population counterparts. For example, the population proportion of infants who sleep on their back should be close to the sample proportion who sleep on their backs. h. The sample size for subgroups is smaller than for the whole group, so the sample results would be less precise. Homework Activities Activity 4-6: Rating Chain Restaurants a. It seems unlikely that this sample was randomly chosen as it would be extremely difficult to give each Consumer Reports reader an equally likely chance of being selected for the sample and to ensure that everyone selected responded. It is much more likely that the responders selfselected by returning a survey. b. The authors probably make the disclaimer because the sample was not randomly selected from the entire population but only of their readers who may have different habits and attitudes from non-readers and therefore cannot reasonably be extended to the general population. c. Answers will vary, but we probably should generalize these results only to Consumer Reports readers who tend to visit full-service restaurant chains and like to complete surveys. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 6 Activity 4-7: Sampling Words 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 a. categorical (binary) b. 99/268 = .369 c. parameter. .369 is the proportion of all 268 words (the population) in the Gettysburg address that is over 5 letters long. d. No – because of sampling variability we would not expect the sample proportion to equal .369, but we would expect it to be reasonably close most of the time. (In fact, with a sample of size 5, the sample proportion could not equal .369, it could only be 0, .2, .4, .6, .8, or 1.) Activity 4-8: Sampling Words 4-1, 4-2, 4-3, 4-4, 4-7, 4-8, 8-9, 9-15, 14-6 Answers will vary. These are based on one particular running of the applet. [include screen capture?] a. Yes – this distribution should be centered at about .369 (it is .38 in this case). b. This distribution should still be centered at .369 (the mean is .37), but with much less variability. c. Since we are taking random samples, we expect our sample proportions to center around the parameter (.369), regardless of the sample size. However, as we increase the sample size, we expect our samples to become more precise, that is, we expect the variability between samples to decrease. Activity 4-9: Sampling Senators 4-9, 4-18 a. observational units = U.S. senators Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 7 variable = years of service in the senate population = current 100 U.S. Senators sample = 5 selected current U.S. senators parameter = average years of service of all 100 U.S. senators statistic = average years of service of the 5 selected senators b. This sampling method would most likely overestimate the average years of service since your classmates would most likely select names of well-known senators who have been serving in the senate for a long time. (You also need to worry about a tendency for students to mention the senators from their own state more than other states.) c. No – increasing the sample size will not correct for a biased sampling method. Students would still tend to overrepresent the senators who have served longer. d. Obtain a list of the current senators. Number each senator in the list from 00-99. Select any row of the table of random digits and read the row as a sequence of 2-digit numbers. These 2-digit numbers tell you which senators from your list will make up your sample. Continue selecting senators until you have five senators in your sample. Skip any repeated 2-digit numbers. e. Obtain a list of the current representatives. Number each representative in the list from 000-434. Select any row of the table of random digits and read the row as a sequence of 3-digit numbers. These 3-digit numbers tell you which senators from your list will make up your sample. Skip any repeated 3-digit numbers, or numbers greater than 434. Continue selecting representatives until you have five representatives in your sample.. If necessary, continue to another row of the table of random digits. Activity 4-10: Responding to Katrina Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 8 2-12, 4-10, 16-13 Based on the sample sizes, the non-Hispanic white adults’ responses probably come closer to reflecting the group’s population value than the black adults’ responses do because there were so many more white adults sampled. If both samples were selected randomly, the larger sample is more likely to produce a sample result similar to the population parameter. Activity 4-11: Rose-y Opinions a. observational units = 1000 individuals variable = did they have a favorable or unfavorable opinion of Pete Rose? (categorical) b. population = American sports fans sample = first 1000 people leaving a LA Lakers’ basketball game c. This was not a randomly selected sample. People attending this basketball game are not necessarily sports fans in general, or may be extreme LA Laker fans (fanatics), or just basketball fans. This is an example of convenience sampling and is unlikely to result in a representative sample. d. No, the individuals in the sample may still be only interested in basketball and not sports in general. e. If you have a list of subscribers to Sports Illustrated you could number the list and use a table of random digits or computer to select a random sample of subscribers. The population that would be represented by this sample would be all readers of Sports Illustrated which would certainly be more representative of the general sports fan than the previous methods. f. The parameter is the percentage of American sports fans that have an unfavorable opinion of Pete Rose. Its value is unknown. The statistic is the 49% of the 1000 people interviewed by the Gallup pollsters who said they had an unfavorable opinion of Pete Rose. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 9 g. The value of the statistic would most likely change if Gallup had selected another random sample of 1000 people to interview. But the value of the parameter would remain the same. Activity 4-12: Sampling on Campus a. observational units = college freshmen; variable = weight gained during the first term at college; population = all U.S. college freshmen; sample = random sample of college freshmen; parameter = average weight gained by all college freshmen during their first term. Since it would be impossible to obtain a random sample of all U.S. college freshmen, work with freshmen at a particular college. Obtain a list of all freshmen from the registrar. Number the list and use a table of random digits to obtain a random sample of freshmen. b. observational units = college students; variable = price paid for textbooks; population = all U.S. college students; sample = random sample of college students; parameter = average price paid for textbooks by all college students. Since it would be impossible to obtain a random sample of all U.S. college students, work with students at a particular college. Obtain a list of all students from the registrar. Number the list and use a table of random digits to obtain a random sample of students. c. observational units = pages of your history book; variable = number of words on each page; population = all pages in your history book; sample = random sample of pages from your history book; parameter = average number of words per page in your history book. Number all the pages in your history book consecutively. Use a table of random digits to select a sample of pages from your book and count all the words on these pages. d. observational units = college faculty; variable = political party registration; population = all U.S. college faculty; sample = random sample of U.S. college faculty; parameters = percentages of U.S. college faculty that are registered in each political party. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 10 Since it would be impossible to obtain a random sample of all U.S. college faculty, work with faculty at a particular college. Obtain a list of all faculty, and number the list. Then use a table of random digits to obtain a random sample of faculty. Activity 4-13: Sport Utility Vehicles a. observational units = vehicles variable = whether or not the vehicle is an SUV population = all vehicles on the road in your hometown sample = the vehicles that pass by the intersection between 7 and 8 AM that morning parameter = the proportion of all vehicles on the road in your hometown that are SUVs statistic = the proportion of all vehicles that pass by that morning that are SUVs b. The vehicles that you observed between 7 and 8 AM may not be representative of all vehicles on the road. For example, the vehicles may be used to carpool children to school and therefore overrepresent larger families with children and larger cars or they may be predominantly commuter vehicles more than weekend recreational vehicles and underrepresent the proportion of SUVs. c. The sampling frame is the list of cars sold by that dealer. d. The recently purchased vehicles will probably not represent the vehicles on the road in your town. For example, there may have been a backlash against SUVs recently because of high gas prices so that fewer SUVs were purchased in the last year, yet many people would still own them from purchases several years ago. Activity 4-14: Generation M 3-8, 4-14, 13-6, 16-1, 16-3, 16-7, 18-1, 21-11, 21-12 a. Your classmates form a sample as they are only a subset of all students at your school. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 11 b. Answers will vary. This number is a statistic since it is collected from your class (a sample). c. Answers will vary from class to class, but the numbers calculated will all be statistics. d. No – you and your classmates do not constitute a random sample of the students at your school because every student did not have an equal chance of being selected for the sample. e. Answers will vary by school and class. f. Answers will vary by school and class. Activity 4-15: Emotional Support 4-15, 18-19 a. Hite’s sampling method is likely to be biased in the direction of women who think they give more support than they receive. She sampled women in women’s groups who usually join because they aren’t getting the kind of companionship they want from their husbands or boyfriends. b. Hite’s poll surveyed the larger number of women. c. The ABC News/Washington Post poll was probably more representative of the truth about the population of all American women since they used random sampling that was presumably unbiased. Activity 4-16: College Football Players a. position = categorical; weight = quantitative; class = categorical b. Example answer – using line 13 of the table: First delete the 17 red-shirted freshmen from the list. Then renumber the remaining list from 01 to 82. Then, use line 13 to select players 54 Brock Daniels (275 lbs), 40 Aris Borjas (200 lbs), 02 Courtney Brown (205 lbs), 21 Anthony Randolph (220 lbs), 50 Jason Relyea (220 Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 12 lbs), 56 Perris Kelly (285 lbs), 55 Kenny Calderone (285 lbs), 87, 52 Bobby Best (245 lbs), 86, 07 Pat Johnston(195 lbs), 30 Drew Robinson (195 lbs), 34 Martin Mates (185 lbs), 05 Mike Anderson (180 lbs), 60 Lucas Trily (235 lbs), 57 Patrick Koligian (250 lbs), 62 Julai Tuua (275 lbs). The average weight in this sample is 230 lbs. This weight should be fairly close to the average weight of all 82 players since we took a random sample, but we don’t expect it to match exactly. In particular, while this value will vary from sample to sample, we don’t expect a tendency to consistently overestimate or underestimate the population mean weight.. Activity 4-17: Phone Book Gender 4-17, 16-16, 18-11 a. parameter = the proportion of women living in San Luis Obispo County statistic = the proportion of women listed on the randomly selected phone book page b. This sampling technique will give a biased estimate for the proportion of women living in San Luis Obispo County because the phone listings of many married women are often only under their husbands’ names. In addition, many single women choose not to list their phone numbers to avoid harassing phone calls. Therefore, we expect the statistic will be an underestimate of the population parameter. Activity 4-18: Sampling Senators 4-9, 4-18 a. This would produce the most variability because it has the smallest sample size. b. This would produce the least variability because it has the largest sample size. c. This would have less variability than (a) but more than (d). d. This would have more variability than (b) but less than (c). Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 13 From most variability to least variability: a, c, d, b. As the sample size increases, regardless of the size of the population, the variability in the sample values decreases. Activity 4-19: Voter Turnout 4-19, 18-10 a. 1783/2613 = .682 b. This is a statistic because it is a number calculated from a sample (of 2613 adults). c. Did you vote in the 1996 election? 1 Proportion 0.8 0.6 0.4 0.2 0 yes no Response d. This number (49%) is a parameter because the Federal Election Commission has the records of all registered voters. Everyone who was eligible to vote was included in this number. e. No, the sample grossly overestimated the proportion of eligible voters who actually voted. f. While the sample result is unlikely to exactly match the population value, this difference is probably too large to be attributed to sampling variability. g. People may be reluctant to tell the truth (and seem unpatriotic) and so may overstate whether or not they voted. They might not remember that they didn’t vote in this particular Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 14 election. Even with random samples, we have to worry about the honesty of the respondents in surveys. Activity 4-20: Nonsampling Sources of Bias a. The proportions of “yes” responses would most likely differ between these two groups. The question that includes the words “horrific murder” is obviously putting a negative idea into the minds of those surveyed, while the other question seems neutral. b. The proportions declaring agreement with the policy might differ between these two groups. Those interviewed by the smoker might feel pressured into disagreement. c. The proportion of “yes” responses would probably be lower than the actual proportion of married people in the community who have engaged in extramarital sex. This manner of survey is not very confidential, and the surveyor would be hard-pressed to get honest answers to such a personal and potential harmful question. d. We should not be surprised that the proportions would differ between these two groups. The President’s views on foreign policy would be fresh in the minds of one group, while the other group would have to recall past speeches or actions of the President in order to form an opinion. Approval ratings tend to rise shortly after rousing speeches but then come back down again over time. e. How the question is worded, appearance of the interviewer, lack of confidentiality, knowledge of the topic and timing of the question. Activity 4-21: Prison Terms and Car Trips a. Prisoners with longer terms have a higher probability of ending up in the sample (similar to how longer words are more likely to be selected when you point your finger at one spot on the page). Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 15 b. Cars engaged in longer trips have a higher chance of being observed at a particular time point than cars on short trips. c. Many answers are possible, but one example is estimating the average length of time that people have been employed by a particular company. If we take a random sample of employees, employees that have been around longer have a better chance of ending up in the sample. Rossman/Chance, Workshop Statistics, 3/e Solutions, Unit 1, Topic 4 16