Chapter 8 Sampling Page 1 of 14 Chapter 4 & 8 Sampling Sample bias Margin of error Spring 2016 12/29/15 Learning Objectives: samples, surveys, polls population N sample n sample proportion population proportion p one sample survey two sample survey sample bias margin of error E 95% confidence range Chapter 8 Sampling Page 2 of 14 Chapter 8.1 Sampling, polling, and surveying, are terms used to describe estimating what a large group of objects (usually people) will do based on what a small group of similar objects do. Sample = n = a small group of objects intended to represent a much larger group. Population = N = large group of objects. ONE SAMPLE Survey How many SUNY-Oswego students have a personal computer? There are 10,000 people at SUNY- Oswego. N = 10,000 There are 1,000 students in the Math Dept. n = 1,000 Select the 1,000 Math students as a sample and ask if they have a personal computer. 216 said yes 784 said no 1,000 sample The proportion of people who answered “YES” = 216/1000 = .216 = 21.6% 21.6% is called the sample proportion, written as (“p-hat”) Based on the 1,000 student sample, we will temporarily “assume” that 21.6% of all 10,000 people at SUNY- Oswego have a computer. Later, we will change that assumption with some statistics, but let it go for now. Assume 21.6% (10,000) = .216 (10,000) = 2,160 of all SUNY- Oswego people have a computer. 21.6% is now also called the population proportion, called p = p (for now) sample proportion = population proportion for now ONE SAMPLE Survey How many of the 100,000 light bulbs made in your department each month are defective? N = population = 100,000 each month Randomly select 800 bulbs as a sample and test them. 17 are defective. n = sample = 800 randomly selected 17 are defective 783 are OK 800 sample = sample proportion = 17/800 = .02125 = 2.1% are defective assume = p for now Estimate the number of bulbs that are defective in a month defective = (N) = p(N) = 2.1% of (100,000) = .021 (100,000) = 210 Chapter 8 Sampling Page 3 of 14 ONE SAMPLE Survey You were hired to determine how many voters in USA will vote Republican and how many will vote Democrat. You selected 485 voters in the City of Oswego to sample. 268 said they will vote Democrat the sample n has changed to 100 said they will vote Republican n = 268 + 100 = 368 117 did not reply or said it was none of your business. 485 old n The sample number has changed from n = 485 to n = 368 because 117 sample voters provided no useful information and are disqualified. Democrat = Democrat sample proportion = 268/368 = .728260869 = 72.8% Republican = Republican sample proportion = 100/368 = .27173913 = 27.2% There are 122,500,000 voters in the USA. Assume for now, =p For now, Democrat sample proportion = p Democrat population proportion = 72.8% 72.8% (122,500,000) = 89,180,000 will vote Democrat For now, Republican sample proportion = p Republican population proportion = 27.2% 27.2% (122,500,000) = 33,320,000 will vote Republican Chapter 8 Sampling Page 4 of 14 Two sample survey - how many N pike fish are in the lake? Procedure: - using a net, capture a sample of all fish you can in a period of time. - count only the pike. There are 200 pike in the net. n1 = 200 tag all n1 = 200 pike and return all to the lake …sometime later Recapture another sample all the fish you can and count only the pike n2 = 150 tagged and untagged. Count tagged pike = 21 Chapter 8 Sampling Page 5 of 14 Summary: n1 = 200 pike (1st capture) and were all tagged n2 = 150 pike (2nd capture), of which 21 were tagged in 1st capture N = all pike in the lake Assume the percentage of pike in lake 200 N captured pike all pike in lake is the same as the percentage of tagged pike in the 2nd sample 21 150 200 = 21 captured tagged pike N 150 captured pike Number of pike in the lake = N = (n1)(n2) = 200(150) = 1429 pike in lake 21 21 We’ve temporarily assumed (for now) in all above examples that the sample proportion is = population proportion p. They are nearly, but two things prevent them from being equal: sample bias and statistical error (covered in next 2 lectures). Chapter 8 Sampling Page 6 of 14 Practice problems from page 132 1.) Large jar contains N = 200 gumballs of two different colors: red and green. A sample of n = 25 gumballs is randomly drawn. 8 are red 17 are green 25 sample Estimate the number of red gumballs in the jar. One sample survey - calculate assume = 8/25 = 0.32 = p = .32 # red = p(200) = 64 3.) Madison County population is 34,522. Estimate the number of blood-type A– based on a random sample of 253 patients, of which 17 were A–. One sample survey - calculate assume = 17/253 = .067193675 (leave this in your calculator) = p = .067193676 # A- = p(34,522) = 2320 8.) A rookery has an unknown quantity (N) of fur seal pups. 4965 were captured and tagged. Later, 900 were captured. Of these, 218 were found to be tagged previously. Estimate the total fur seal pup population. Two sample survey - how many (N) fur seal pups are in rookery? - capture 1st sample 4965 (n1) fur seal pups and tag each - capture 2nd sample 900 (n2) fur seal pups and count 218 tagged previously N = (n1)(n2 ) = 4965(900) = 20,498 pups 218 218 10.) Maui has an unknown quantity (N) of dolphins. 26 were captured and tagged. Later, 27 were captured and 12 were found to be tagged previously. Estimate the dolphin population. Two sample survey - how many (N) dolphins in Maui? - capture 1st sample 26 (n1) dolphins and tag each - capture 2nd sample 27 (n2) dolphins and count 12 tagged previously N = (n1)(n2 ) = 26(27) = 58.5 = 59 dolphins 12 12 Chapter 8 Sampling Page 7 of 14 Practice problems from page 233 13a.) 6,523 of 12,345 people moved within last 5 years. N = 12,345 people 13b.) 6,523 moved population proportion p = 6,523/12,345 = .528 = 52.8% 245 of sample 500 moved within last 5 years. n = 500 people sample proportion 245 moved = 245/500 = .49 = 49% Note: p ≠ 14a.) 269 of 2,444 people are left handed p = 269/2,444 = .110 = 11% 14b.) 8 of sample 50 are left handed. = 8/50 = .16 = 16% Note: p ≠ Chapter 8 Sampling Page 8 of 14 We assumed = p, the sample proportion = population proportion It’s often not exactly true. Samples are often biased and moves p away from . p≠ …here’s one reason why: Sample Bias - a built-in tendency (whether intentional or not) which excludes a particular group or characteristic within the population, or includes those that shouldn’t be included. Common type of sample bias: Convenience sampling bias - the selection of individuals dictated by what is easiest or cheapest to sample. - How many of 10,000 SUNY-Oswego people have a personal computer? Use Math students as sample because they are convenient for us to contact? It’s well known that Math students buy computers much more often than all other students. By selecting only Math students for the sample, we will bias for more computers. - Selecting only Math students is biased another way. Many of the 10,000 population don’t need a personal computer. Infants in the day-care center, landscapers, plumbers, carpenters, senior citizens, visitors, etc. are included in the 10,000 and must be included in the sample. Another example (convenience bias) You were hired to determine how many USA people will vote Republican and how many Democrat. You selected 485 voters in City of Oswego because you live here and Oswego voters are convenient to contact. Oswego is, however, a dominantly Democrat city. By selecting only Oswego voters, the sample proportion for Democrat voters will bias the result toward Democrats. Non-response bias - many individuals do not respond to a survey. - Of 10 million people selected in our text page 119 to survey voter preference, only 2.4 million responded, resulting in a low 24% response rate. A low response rate often means that people aren’t interested now, but usually interested later (on election day). How prevent sample bias: Random Sampling - the best alternative is to let the laws of chance, randomness, determine the selection of a sample. This means that any group of members of the population should have the same chance of being in the sample as any other group of the same size. The personal computer sample should have included students and non students, Math students and other students. The voter sample should have included voters from other counties, other states. Quota sampling- is a systematic effort to force the sample to be representative of a given population through the use of quotas. The sample should have so many women, so many men, so many blacks, so many whites, so many people living in urban areas, so many people living in rural areas. Chapter 8 Sampling Page 9 of 14 Stratified sampling - an alternative to simple random sampling. Divide the sampling frame into categories, called strata, and then (unlike quota sampling) randomly choose a sample from these strata. The chosen strata are then further divided into categories, called substrata, and a random sample is taken from these substrata. The selected substrata are further subdivided, a random sample is taken from them, etc. The process goes on for a predetermined number of steps (usually four or five). Stratiļ¬ed sampling has generally proved to be a reliable way to collect national data. Cause and effect in medical community Sampling sometimes suggest a cause and effect relationship which can’t be proven. If formal proof is needed, use a clinical study or clinical trial. Examples of faulty studies/trials: Hormone replacement therapy studies (text page 125) yielded conflicting results. Coffee drinking extends life (text page 126) yielded confounded results. Alar apple study (text page 127) yielded scary misleading results. Salk Polio vaccine study (text page 128) yielded confused results. Methods to prevent faulty studies/trials: Controlled study - the subjects are divided into (2) groups - (1) that gets treatment (treatment group) and another that doesn’t (control group). The control group is there for comparison purposes only- they give the experimenters a baseline to see if the treatment group does better or not. Placebo effect states that just the idea that one is getting a treatment can produce positive results in suggestive people. Blind study - a study in which neither the members of the treatment group nor the members of the control group know to which of the two groups they belong. Double-blind study - a controlled placebo study in which neither the subjects nor the scientists conducting the experiment know which subjects are in the treatment group and which are in the control group. Chapter 8 Sampling Page 10 of 14 Practice problems page 133 15.a) Convenience - selecting sample close by, not random. George peeks only at scores of nearby classmates. 15.b) Stratified - this is good random sampling. Population divided into (4) strata, then 5% sampled randomly from each strata. 15.c) Simple - all players in sample; selecting random sample from entire population. 15.d) Quota - forced sample to have a specific trait (seniors only)…not random. 17a.) The sample population is registered Cleansburg voters only. 17b.) The sample is 680 registered Cleansburg voters surveyed by phone. 17c.) The sampling method is simple random selection. 18a.) The sampling proportion is 680 randomly chosen out of 8325 registered voters sampling proportion = n/N = 680/8325 = .08168 = 8.2% 18b.) 306 out of 680 sampled stated they would vote for Smith. The sample statistic estimating the percentage of the vote going to Smith = 306/680 = .45 = 45% 19.) Smith actually received 42% compared to 45% estimated sample error = 45% - 42% = 3% Jones estimated percentage from the survey was = 272/680 = .4 = 40% Jones actually received 43% sample error = 43% actual - 40% estimated = 3% Brown’s estimated percentage from the survey was 102/680 = .15 = 15% Brown actually received 15% sample error = 15% - 15% = 0% 20.) The error appears to be chance because the sample was selected randomly. Also, since there was a 100% response rate, no-response bias can be disregarded. 37a.) The target population are those experiencing a cold and likely to buy medication to help. 37b.) The sample frame is college students in San Diego area. 37c.) The sample is 500 students from a warm weather climate. Also, they are young, likely to overlook a cold, and likely to spend their limited money on other things. Chapter 8 Sampling Page 11 of 14 38a.) The study was not a controlled study. There was no control group. 38b.) List four possible causes other than the effectiveness of vitamin X itself that could have confounded the results of the study. 1.) San Diego students are young, healthy compared to the target population which includes older adults, young children, people that live in cold wintry weather. 2.) Students were paid to participate. 3.) 4.) There was no control group. 39. List four different problems with the study that indicate poor design. 1.) College students don’t represent the population in age, health, finances. 2.) They were paid to participate. 3.) San Diego isn’t “cold” country compared to Northeast. 4.) The medical response to the vitamin was self-reported, not medically determined. 40.) List 4 recommendations to improve this study: 1.) Select sample participants randomly from all over the country 2.) Set up a control group who are given identical looking placebos 3.) Have medical staff determine actual improvement like temperature, congestion. 4.) Make it a double-blind study where no one knows who is getting vitamin and who is getting placebo. Chapter 8 Sampling Page 12 of 14 ….another reason why ≠p Chapter 8.3 – Sample error We assumed = p and if the sample wasn’t biased, it will be close, but not exactly same…..there will be a margin of error even if there were no bias. The closeness that represents p (sample proportion vs. population proportion) is expressed by a margin of error E: E =2 ( 1n ) and a 95% confidence interval given by the range: ( -E) to ( +E) The 95% confidence interval tells us that we can be 95% sure that p lies within that range. Example: Nielsen Rating Service is paid to estimate how many of the world’s 2,000,000,000 sports fans are watching World Cup Soccer. They randomly sampled 5,000 households and found 3,615 were watching. = sample proportion = 3,615/5,000 = .723. In words, 72.3% of the sample were watching. But, p ≠ The margin of error E is: E =2 =2 ( 1n ) =2 .723(1-.723) = 5,000 2 .200271 5,000 .0000400542 = 2(.006328838756) = .012657677 = .013 = 1.3% With 95% confidence, p (population proportion) will be somewhere between ( -E) to ( +E) (72.3% - 1.3%) to = 71% to (72.3% + 1.3%) 73.6% In words, the population proportion p will be in the range 71% to 73.6% of sports fans. Let’s assume p is at the low end of the range, 71%. With 95% confidence, Nielsen Rating Service can say 71% of 2,000,000,000 sports fans are watching World Cup Soccer. 71% (2,000,000,000) = .71(2,000,000,000) = 1,420,000,000 fans Chapter 8 Sampling Page 13 of 14 Example: Yankee Stadium Rock Concert Theater Concert Promotions Inc., was asked about the possibility of producing a rock concert at Yankee Stadium. They needed to estimate how many from the likely 4,000,000 person market (18-35 year olds) would purchase $175 tickets for this proposed event. After randomly selecting 600 people in this age range, they acquired the following data: Summary: 11 yes 575 564 no sample 25 no response What was the population N? N = 4,000,000 person market What was the sample size n? n = 575 people (the no responses are excluded) What will be p with a 95% confidence level? = 11/575 = .019 = 1.9% Calculate the range in which p will occur with a 95% confidence level E = 2 ( 1n )= 2 .019(1 - .019) = 2 575 95% confidence range is: 1.9% - 1.1% = .8% .000032 = 2 (.0057) = .011 = 1.1% to 1.9% + 1.1% = 3% Assume worst case that p will be the low end (.8%) of the range. How many will attend the concert? .8% of 4,000,000 = .008(4,000,000) = 32,000 If Yankee Stadium requires $10,000,000 guaranteed revenue to book the concert, will Theater Concert Promotions Inc., be able to guarantee it? (32,000 attendance)($175ticket) = $5,600,000…no guarantee possible if $10,000,000 needed. Chapter 8 Sampling Page 14 of 14 Practice problems page 237: 68.) Estimate margin of error E and a 95% confidence interval for n = 1,000 and = 0.4 = 40% E =2 ( 1- ) = 2 n .4(1-.4) = 2 1,000 .00024 E = .031 = 3.1% So, with 95% confidence we can estimate the actual p ranges between: -E to +E (40% - 3.1%) to (40% + 3.1%) 36.9% and 43.1% With 95% confidence p will be in the range 36.9% to 43.1% someplace. 82.) Based on sample n = top 400 movies, = 98% of them involve drugs, etc. a.) Estimate the population proportion p of all movies that contain drugs, etc. For all movies, we can say with 95% confidence, p will be somewhere inside the range: -E to +E 98% - E to 98% + E b.) What is E: E =2 ( 1n ) = 2 .98(1-.98) = 2 400 .000049 E = .014 = 1.4% So, with 95% confidence we can estimate p ranges between: (98% - 1.4%) and (98% + 1.4%) 96.6% and 99.4% With 95% confidence, we can say p ranges between 96.6% and 99.4% of all movies c.) Is the top 400 movies a random sample? No, they’re the top 400. Middle and bottom are not included in the sample. The sample is not random, it’s biased toward the top.