4.1 Sampling and Surveys

The Federalist Papers The Authors Word Bank • Anarchy • Difficulty • Oblige • Cede • Mankind • Instituted • Perfection • Natural • Governed Alexander Hamilton (Federalist No. 65) • [On Government ] • “If mankind were to resolve to agree in no institution of government, until every part of it had been adjusted to the most exact standards of perfection, society would soon become a general scheme of anarchy, and the world a desert.” John Jay (Federalist No. 2) • [On Government] “Whenever and however it is instituted, the people must cede to it, some of their natural rights in order to vest it with requisite powers.” James Madison (Federalist No. 51) • [On Government and Human Nature] • “The great difficulty lies in this: you must first enable the government to control the governed; and in the next place, oblige it to control itself.” Activity: Sampling from The Federalist Papers The Federalist Papers are a series of 85 essays supporting the ratification of the U.S. Constitution. At the time they were published, the identity of the authors was a secret known to just a few people. Over time, however, the authors were identified as Alexander Hamilton, James Madison, and John Jay. The authorship of 73 of the essays is fairly certain, leaving 12 in dispute. However, thanks in some part to statistical analysis, most scholars now believe that the 12 disputed essays were written by Madison alone or in collaboration with Hamilton. There are several ways to use statistics to help determine the authorship of a disputed text. One example is to estimate the average word length in a disputed text and compare it to the average word lengths of works where the authorship is not in dispute. • Directions: The following passage is the opening paragraph of Federalist Paper #51, one of the disputed essays. The theme of this essay is the separation of powers between the three branches of government. Choose 5 words from this passage, count the number of letters in each of the words you selected and find the average word length. Share your estimate with the class and create a class dotplot. • To what expedient, then, shall we finally resort, for maintaining in practice the necessary partition of power among the several departments, as laid down in the Constitution? The only answer that can be given is, that as all these exterior provisions are found to be inadequate, the defect must be supplied, by so contriving the interior structure of the government as that its several constituent parts may, by their mutual relations, be the means of keeping each other in their proper places. Without presuming to undertake a full development of this important idea, I will hazard a few general observations, which may perhaps place it in a clearer light, and enable us to form a more correct judgment of the principles and structure of the government planned by the convention. Directions: Use a table of random digits or a random number generator to select a simple random sample (SRS) of 5 words from the opening passage to the Federalist Paper #51. Once you have chosen the words, count the number of letters in each of the words you selected and find the average word length. Share your estimate with the class and create a class dotplot. Trig has a few questions about the activity. • How does the first dotplot (individual choice) compare to the second one (random choice)? Can you think of any reasons why they might be different? • Do we see any problems with the first sampling technique? • How about this scenario. We want to run a study to determine the average salary of HS students, so we go outside into the sophomore hallway and sample the first 20 students we see. Can you see any problems with that? What might be a fix to correct those problems? • Which sampling technique do you think is more reliable? Why? • Think about what would happen if we kept running simulations. Which technique would center around the true mean? J.K. Rowling • A similar statistical technique was used to determine that J.K. Rowling was actually the author of “The Cuckoo’s Calling,” which is a crime fiction novel written in 2013 under the pseudonym name Robert Galbraith. • J.K. Rowling "The Cuckoo's Calling“ 4.1 Sampling and Surveys Objectives SWBAT: 1. IDENTIFY the population and sample in a statistical study. 2. IDENTIFY voluntary response samples and convenience samples. EXPLAIN how these sampling methods can lead to bias. 3. DESCRIBE how to obtain a random sample using slips of paper, technology, or a table of random digits. 4. DISTINGUISH a simple random sample from a stratified random sample or cluster sample. Give the advantages and disadvantages of each sampling method. 5. EXPLAIN how undercoverage, nonresponse, question wording, and other aspects of a sample survey can lead to bias. What’s the difference between a population and a sample? What is a census? • A population is the entire group of individuals we want information about. • A census collects data from every individual in the population. • Often times, it is too time consuming or too costly to reach every member of the population, so a sample is chosen to represent the population. • A sample is a subset of individuals in the population from which we actually collect data. What is the icon in the top-right corner of the example on page 210? • That’s the video icon!!! It mean’s there is a video for this example on textbook website. See the link on my teacher page!!! When sampling, we want to choose a sample that we can trust to represent the population. There are a number of different methods to select samples. What’s the problem with convenience samples? • A convenience sample is choosing individuals from the population who are easy to reach. • For example, if you want to know how long students in a high school spend doing homework, you might go to the library and ask the first 30 students you see. • The problem with this is that it is unlikely to represent that study habits of all high school students. Students sampled in a library are much more likely to spend more time on homework. As a result, we will overestimate the average homework time of our population. • Convenience samples are almost guaranteed to show bias. • Think about this convenience sample. You take a survey at a football game: “Do you support or oppose cuts to the athletic budget?” How might this sample be biased? What is bias? • The design of a statistical study shows bias if it would consistently underestimate or consistently overestimate the value you want to know. • Think of our Federalist Papers activity! Was there any bias? • On the AP test, if asked to describe how the design of a study leads to bias, you need to do two things: • 1) identify a problem with the design • 2) explain how this problem would lead to an underestimate or overestimate • Bias also exists in voluntary response samples. What’s a voluntary response sample? Is this a good method for obtaining a sample? • A voluntary response sample consists of people who choose themselves by responding to a general invitation. • Think about call-in, text-in, write-in, and Internet polls. Usually the people in such surveys are not representative of some larger population of interest, because generally these people feel strongly about an issue, and often share the same opinion. This leads to bias. • Think about American Idol. A lot of people watch, but does everyone vote? Usually your average fan will watch the show but not vote, whereas your super fans will vote several times. Do the numbers really represent how the population feels about certain contestants? Probably not. The Check Your Understand feature is to give you practice to ensure you understand the concepts covered! Example: To estimate the proportion of families that oppose budget cuts to the athletic department, PC principal surveys families as they enter the football stadium on Friday night. Explain how this plan will result in bias and how the bias will affect the estimated proportion. Families attending the football game are likely pro-athletics. Therefore, they would be more likely to oppose budget cuts, and thus would result in an overestimate for the proportion of families that oppose budget cuts. Note: This is a convenience sample. • A sample chosen by chance rules out both favoritism by the sampler and selfselection by respondents. • Random sampling involves using a chance process to determine which members of a population are included in the sample. What’s a simple random sample (SRS)? How can you choose a SRS? A simple random sample (SRS) of size n is chosen in such a way that every group of n individuals in the population has an equal chance to be selected as the sample. • There are a few ways to choose a SRS. • One way is with a hat. • For example, let’s say this class was our population, and we wanted to draw a sample. We can write everyone’s name on a slip of paper, place the slips into a hat, shake the hat to shuffle the names, and then randomly start selecting names from the hat. • Other ways to choose a SRS involve using random numbers generated by a computer or calculator to choose samples. If you don’t have technology handy, you can use a table of random digits. How to Choose an SRS Choosing an SRS With Technology Step 1: Label. Give each individual in the population a distinct numerical label from 1 to N. Step 2: Randomize. Use a random number generator to obtain n different integers from 1 to N. How to Choose an SRS Using Table D Step 1: Label. Give each member of the population a numerical label with the same number of digits. Use as few digits as possible. Step 2: Randomize. Read consecutive groups of digits of the appropriate length from left to right across a line in Table D. Ignore any group of digits that wasn’t used as a label or that duplicates a label already in the sample. Stop when you have chosen n different labels. Your sample contains the individuals whose labels you find. What’s the difference between sampling with replacement and sampling without replacement? How should you account for this difference when using a table of random digits or other random number generator? • Sampling with replacement means the numbers can repeat, or be selected again. • Sampling without replacement means the numbers must remain unique and cannot be selected again. • Think selecting cards from a standard deck of 52 cards. Let’s say you want to draw two consecutive cards. With replacement means you will draw the first card and then put the card back into the deck before you draw the second card. Without replacement means you will draw the first card and then discard it before you draw the second. • To account for without replacement on the table of random digits, if you encounter a number that has already been selected, discard the number and keep selecting. • To account for without replacement on the calculator, use option 8:randIntNoRep( Example: Mall Hours The management company of a local mall plans to survey a random sample of 3 stores to determine the hours they would like to stay open during the holiday season. Use Table D at line 101 to select an SRS of size 3 stores. First give each member of the population a numerical label with the same number of digits. Now read the digits from left to right, ignoring digits that weren’t used as labels and ignoring repeat numbers. We would survey Star Jewelers, Carter’s for Kids, and Mrs. Fields. Suppose we want to estimate the yield of our corn field. The field is square and divided into 16 equally sized plots (4 rows X 4 columns). A river runs along the eastern edge of the field. We want to take a sample of 4 plots. Use a random number generator and pick a SRS of 4 plots. Mark them with an X. Now randomly choose one plot from each horizontal row. This is called a stratified random sample. Which method do you think will work best? Explain. Finally choose one plot from each vertical column. This is also a stratified random sample. • Now it’s time for the harvest! The numbers to the right are the yield for each of the 16 plots. For each of your three sample, calculate the average yield. What is a stratified random sample? How is it different than a simple random sample? • The basic idea of sampling is straightforward: take a SRS from the population and use your sample results to gain information about the population. • Sometimes there are statistical advantages to using more complex sampling methods. To get a stratified random sample, start by classifying the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRS’s to form the sample. When is it beneficial to use a stratified random sample? What is the benefit? How do you choose a variable to stratify by? • It is beneficial to use a stratified random sample when subpopulations within an overall population vary. It is advantageous to sample each subpopulation independently. • The benefit is that stratified random samples are more precise estimates (the estimates will be less variable). • See the example on the bottom of page 220. • When choosing which variable to stratify by, pick the variable that is the best predictor of what you are trying to measure (the variable that has the strongest association). • For example, in our harvest example, it was a good idea to stratify by column because one of the best predictors of corn yield is irrigation. Stratifying by column guaranteed us that we had plants of different distances to the water (hence the closer ones had more exposure to water than the further ones). What is a cluster sample? Why do we use a cluster sample? How is it different than a stratified sample? • Although a stratified random sample can sometimes give more precise information about a population than an SRS, both sampling methods are hard to use when populations are large and spread out over a wide area. • In that situation, we’d prefer a method that selects groups of individuals that are “near” one another. • To get a cluster sample, start by classifying the population into groups of individuals that are located near each other, called clusters. • Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the sample. • Cluster samples are often used for practical reasons, like saving time and money. • Example: The school administration is considering a new schedule and would like student input. Administration decides to survey 200 randomly selected students. Think about how difficult it would be to locate 200 students. Administration opts to use a cluster sample. The clusters are homerooms. They take a SRS of 8 homerooms and give the survey to all 25 students in each homeroom. • Question: Why would it be better to cluster by homeroom than to cluster by math classes? • Math classes are usually formed by grouping students of the same ability. These clusters will be more homogenous than the general population. Certain types of classes (i.e. honors or remedial) may be over- or underrepresented. However homeroom classes are much more mixed. • Clusters are different than strata. We want each strata to contain similar individuals and for there to be large differences between strata. For a cluster, we’d like each cluster to look just like the population, but on a smaller scale. Example: A Good Read A school librarian wants to know the average number of pages in all the books in the library. The library has 20,000 books, arranged by type (fiction, biography, history, etc…) in shelves that hold about 50 books each. a) Explain how to select a simple random sample of 500 books. Number the books 1 to 20000. Use the command randIntNoRep with a lower bound of 1, upper bound of 2000, and a sample size of 500 to select 500 unique numbers between 1 and 20000. Use the books that correspond to those numbers. b) Explain how to select a stratified random sample of 500 books. Explain your choice of strata and one reason why this method might be chosen. Stratify by type because different types of books might be longer or shorter than other types. This will provide a more precise estimate of the average page length than a simple random sample. To select the sample, take an appropriately sized SRS of each type of book and combine the books selected from each type to form the sample. For example, if there are 1000 biographies (5% of 20000), select a SRS of 25 biographies (5% of 500) using the method described in part (a). A school librarian wants to know the average number of pages in all the books in the library. The library has 20,000 books, arranged by type (fiction, biography, history, etc…) in shelves that hold about 50 books each. c) Explain how to select a cluster sample of 500 books. Explain your choice of cluster and one reason why this method might be chosen. Clusters are formed by grouping books that are located near each other, making it easier for the librarian. We can use each shelf of 50 books as a cluster and randomly select 10 shelves to obtain the 500 books for our sample. Number the shelves from 1 to 400 (20000/50) and choose a SRS of 10 shelves using the method described in part (a). Then use all the books on the 10 selected shelves as the sample. d) Discuss a potential drawback with each of the methods above. SRS: Time consuming and difficult to locate the 500 books. You also might miss certain book types in the sample (for example if only 1% of books are cat biographies, there’s a chance a cat biography might not be in the sample). Stratified random sample: Time consuming (need to find books from all over the library). Cluster: Because only 10 shelves are being chosen, it is possible that certain book types will be left out. What is inference? • The purpose of a sample is to give us information about a larger population. • The process of drawing conclusions about a population on the basis of sample data is called inference. • We infer information about the population from what we know about the sample. Why should we rely on random sampling? 1. To avoid bias in selecting samples from the list of available individuals. 2. The laws of probability allow trustworthy inference about the population • Results from random samples come with a margin of error that sets bounds on the size of the likely error. • Larger random samples give better information about the population than smaller samples. What is a margin of error? • It is unlikely that results from a random sample are exactly the same as for the entire population. Sample results are only estimates of the truth about the population. • If we selected two samples at random from the same population, we will almost certainly choose different individuals, sot he sample results will differ somewhat, by chance. • A margin of error sets bounds on how far we expect the sample proportion to be from the true proportion, at most. • The word “error” does not mean that a mistake has been made!!! The margin of error simply compensates for the variability that results from taking a random sample from a population. It does not account for a mistake (bias) made during the data collection process. What is the benefit of increasing the sample size? • Increasing the sample size increases our precision (not accuracy). Larger samples give better information about a population than smaller samples. As sample size increases, random chance decreases. • Think if we were going to perform an experiment flipping a coin, in order to determine how often heads will appear. Let’s say you flip a coin 10 times. If heads lands 3 times, does that mean heads will only appear 30% of the time? • How can you get a more precise result? Increase your sample size. If we increased our sample size to 1000, what would you expect would happen to the success rate for flipping heads? It should move much closer to 50%. There is too much chance involved with small samples, but as the sample size increases, the chance factor decreases. What is a sampling frame? • The list of individuals from which a sample will be drawn is called the sampling frame. What is undercoverage and what problems might undercoverage cause? • Undercoverage occurs when some members of the population cannot be chosen in a sample. • Examples: A sample survey of households will miss homeless people, prison inmates, and students in dormitories. The Federalist Paper example only looked at the opening paragraph. It would have missed any words that appeared in subsequent paragraphs that weren’t in the first. • Undercoverage may lead to some bias. What is nonresponse and what problems might nonresponse cause? How is it different than voluntary response? • Nonresponse occurs when an individual chosen for the sample can’t be contacted or refuses to participate. • Bias results if people who respond differ from those who don’t, in a way that is related to the response. • Nonresponse is different from voluntary response. Nonresponse can occur only after a sample is selected. In a voluntary response sample, every individual has opted to take part, so there won’t be any nonresponse. What is response bias and what problems might response bias cause? • A systematic pattern of incorrect responses in a sample survey leads to response bias. • Response bias occurs because people may lie about their age, income, lifestyle habits, they may misremember or make up answers. • The wording of questions, gender, race, age, ethnicity, human nature, the order of questions, or even the interviewer can affect people’s responses. • Obviously this can lead to bias if people are not being truthful.

4.1 Sampling and Surveys

Related documents

Products

Support

4.1 Sampling and Surveys

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib