Inference with Proportions I Sampling Distributions and Confidence Intervals Parameter • A number that describes the population • Symbols we will use for parameters include m - mean s – standard deviation p – proportion Statistic • A number that that can be computed from sample data • Some statistics we will use include x – sample mean s – standard deviation p – sample proportion This variability is called sampling variability • The observed value of the statistic depends on the particular sample selected from the population and it will vary from sample to sample. Let’s explore what happens with in distributions of sample proportions (p). Have students perform the following experiment. This is a statistic! •Toss a penny 20 times and record the number of heads. In this case, we will use •Calculate the proportion of heads and mark it on The dotplot is a partial graph of the What would happen to the dotplot if we number ofboard. successes in the sample the dot plot on the sampling distribution of all sample ˆ p flipped the penny 50 times and proportions sample size 20. n recorded the of proportion of heads? What shape do you think the dot plot will have? Sampling Distribution of p The distribution that would be formed by considering the value of a sample statistic for every possible different sample of a given size from a population. Suppose we have a population of six students: Alice, Ben, Charles, Denise, Edward, & Frank will keep We are We interested in the the population proportionsmall of females. so that can find ALL the This is called thewe parameter of interest possible samples of a given size. What is the proportion of females? p = 1/3 Let’s select samples of two from this population. How many different samples are possible? 6C2 =15 Find the 15 different samples that are possible and find the sample proportion of the number of females in each sample. Ben & Frank Alice & Ben .5 Charles & Denise Alice & Charles .5 Charles & Edward Alice & Denise 1 Charles & Frank Alice & Edward .5 Denise of & Edward the mean the Alice & Frank How does .5 Denise & Frank Ben & Charles 0 sampling distribution & Frank Ben & Denise compare .5 to theEdward population Ben & Edward 0 parameter (p)? 0 .5 0 0 .5 .5 0 Find the mean and standard deviation of these sample proportions. 1 m pˆ and s pˆ 0.29814 3 Six Students Continued . . . Let’s select samples of three from this population. How many different samples are possible? 6C3 = 20 Find the mean and standard deviation of these sample proportions. 1 m pˆ 3 and s pˆ 0.2108 General Properties for Sampling Distributions of p Rule 1: m pˆ p Rule 2: s pˆ p (1 p ) n Let’s verify Rule 2. Does the formula equal the standard deviation for samples of size 2 (s = .29814)? NO - σ pˆ 1 2 1 to use this formula to So in order 3 –3 the standard 0.29814 calculate deviation of 2 3 the sampling distribution, WHY? p 1 p s pˆ Correction factor – multiply by n We are sampling more than 10% of ourN population! n we MUST be sure that sample If we use the correction factor, we will that N see 1 our size is less than 10% of the we are correct. population! σ pˆ 1 2 3 3 6 2 0.29814 2 6 1 General Properties for Sampling Distributions of p Rule 1: m pˆ p Rule 2: s pˆ p (1 p ) n This rule is exact if the population is infinite, and is approximately correct if the population is finite and no more than 10% of the population is included in the sample Chip Activity: • Select three samples of size 5, 10, and 20 and record the number of blue chips. • Place your proportions on the appropriate dotplots. What do you notice about these distributions? In the fall of 2008, there were 18,516 students enrolled at California Polytechnic State University, San Luis Obispo. Of these students, 8091 (43.7%) were female. We will use a statistical software package to simulate sampling from this Cal Poly population. We will generate 500 samples of each of the following sample sizes: n = 10, n = 25, n = 50, n = 100 and compute the proportion of females for each sample. The following histograms display the distributions of the sample proportions for the 500 samples of each sample size. What do notice What do you youhistograms notice about about Are these thethe standard deviation of shape of the these centered around true these distributions? distributions? proportion p = .437? The development of viral hepatitis after a blood transfusion can cause serious complications for a patient. The article “Lack of Awareness Results in Poor Autologous Blood Transfusions” (Health Care Management, May 15, 2003) reported that hepatitis occurs in 7% of patients who receive blood transfusions during heart surgery. We will simulate sampling from a population of blood recipients. We will generate 500 samples of each of the following sample sizes: n = 10, n = 25, n = 50, n = 100 and compute the proportion of people who contract hepatitis for each sample. The following histograms display the distributions of the sample proportions for the 500 samples of each sample size. Are these histogram s centered around the true proportion p = .07? What happens to the shape of these histograms as the sample size increases? General Properties Continued . . . Rule 3: When n is large and p is not too near 0 or 1, the sampling distribution of p is approximately normal. The farther the value of p is from 0.5, the larger n must be for the sampling distribution of p to be approximately normal. A conservative rule of thumb: If np > 10 and n (1 – p) > 10, then a normal distribution provides a reasonable approximation to the sampling distribution of p. Why does np > 10 ensure an approximate normal distribution? In a binomial distribution, we will investigate what happens to the probability histogram as the sample size increases. Suppose n = 100 10 and 20 30 40 50 60 80 90 andp p==0.1 0.1 70 Let’s normal Whatdraw doesanp equal? curve over the histogram. Why do we need to also check n(1 – p)? Consider what the histogram looks like when n = 10 and p = .9. We must also check that the upper tail will spread out into an approximate normal curve. Here’s an algebraic proof . . . If a binomial distribution can be approximated by a normal curve, then the minimum and maximum values of 0 and n MUST lie within 3 standard deviations of the mean. Simplifying this inequality s np (1 p ) mLet’s npsimplify Recall: gives us the following: this inequality. Therefore:Since 0 < p < 1, we can substitute the The conservative approach uses values 0 and 1 into these inequalities 0 np 3 np (1 p ) Square ANDbothnp 3 np (1 p ) n 10 instead of 9. needed to be to find the largest value sides. 3 np (1 p ) n (1 p ) 3 np (1 p ) np within 3 standard deviations of the Divide both sides by np. mean. 9np (1 p ) n 2 p 2 9np (1 p ) n 2 (1 p )2 9(1 p ) np 9p n (1 p ) 9 np 9 n (1 p ) Blood Transfusions Revisited . . . Let p = proportion of patients who contract hepatitis after a blood transfusion p = .07 To answer this question, we must consider the sampling Suppose a newdistribution bloodp screening is = 6/200 .03 of p.=procedure believed to reduce the incident rate of hepatitis. Blood screened using this procedure is given to n = 200 blood recipients. Only 6 of the 200 patients contract hepatitis. Does this result indicate that the true proportion of patients who contract hepatitis when the new screening is used is less than 7%? Blood Transfusions Revisited . . . Let p = .07 p = 6/200 = .03 Is the sampling distribution approximately normal? Yes, we can np = 200(.07) = 14 > 10 use a normal n(1-p) = 200(.93) = 186 > 10 approximation. What is the mean and standard deviation of the sampling distribution? m pˆ .07 .07(.93) s pˆ .018 200 Blood Transfusions Revisited . . . m pˆ .07 Let p = .07 p = 6/200 = .03 .07(.93) s pˆ .018 200probability This small tells us that it is unlikely thatscreening a sample This new Does this result indicate that the true proportion ofappears .03 or procedure proportion of patients who contract hepatitis smaller would be to yield a smaller when the new screening is used is less than 7%? observed. P(p < .03) = incidence rate for hepatitis. Normalcdf(-1099,.03,.07,.018) = .0132 Confidence Intervals Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. How might we go about estimating this proportion? We Wecould wouldtake haveaa sample candies and sampleof proportion or a compute statistic – athe single proportion ofthe blue value for candiesestimate. in our sample. Point Estimate • A single number (a statistic) based on sample data that is used to estimate a population characteristic • But not always close to the Different samples may “point” refers to the population characteristic due to produce different on a number statistics. sampling variationsingle value line. Population characteristic Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. We could take a sample of candies and compute the proportion of blue candies in our sample. How much confidence Would you have more do you have in confidence if the your point estimate? answer were an interval? Confidence intervals A confidence interval (CI) for a population characteristic is an interval of plausible values for the characteristic. primary goalsoofthat, a confidence interval ItThe is constructed with a chosen degree is to estimate unknown of confidence, the an actual valuepopulation of the characteristic. characteristic will be between the lower and upper endpoints of the interval. Rate your confidence 0 – 100% does it(%) mean toyou be within 10 years? HowWhat confident are that you can ... Guess my age within 10 years? . . . within 5 years? . . . within 1 year? What happened to your level of confidence as the interval became smaller? Perform CI Activity . . . Question for after the activity: • What proportion of all possible CI’s contain the true proportion p? • This is called the confidence level. Let’s develop the equation for the We canconfidence generalize thisinterval. tothe normal For large random samples, large-sample distributions other sampling distribution of than p is the To begin,approximately westandard will use anormal 95% confidence Use distribution – normal. So aboutlevel. of 95% the possible pcurve will are fall the table95% of standard areas to About ofnormal the values within 95%value of these values are of within determine the of z*deviations such that a central area 1.96 standard the 1.96 the mean. p (of 1mean and p ) z*. of .95 falls within between –z* 1.96 within p n Central Area = .95 Lower tail area = .025 Upper tail area = .025 -1.96 0 1.96 Developing a Confidence Interval Continued . . . Approximate sampling Suppose weSuppose get this we p get this pdistribution of p and create an interval Create an interval Suppose we get this p around p and create an interval Using this method of calculation, p the confidence p (1 p ) p (1 p ) 1.96 1.96 interval will n n not capture p p 5% of the p time. This line represents 1.96 This line represents 1.96 When n is large, a 95% p standard deviations below Here is the mean of the Notice thatdeviations the lengthabove of standard confidence interval for p is the mean. sampling distribution This p doesn’t fall within 1.96 each half of the interval the mean. This p fell within 1.96 standard This p fell within 1.96 standard p ( 1 the p ) mean standard deviations of equals deviations of the mean AND its pˆits confidence 1of.96 deviations the mean AND its p ( 1 p ) AND interval does confidence interval “captures” p. 1.96 confidence interval “captures” n p. NOT “capture” p. n Developing a Confidence Interval Continued . . . p (1 p ) If p is within 1.96 n of p, this means the interval p (1 p ) p (1 p ) pˆ 1.96 to pˆ 1.96 n n will capture p. And this will happen for 95% of all possible samples! Confidence level The confidence level associated with a confidence interval estimate is the success rate of the method used to construct the interval. If this method was used to generate an interval estimate over and over again from Oursamples, confidence is in therun method – different in the long 95% (or NOTconfidence in any ONE particular whatever level we use)interval! of the resulting intervals would include the actual value of theThe characteristic being estimated. most common confidence levels are 90%, 95%, and 99% confidence. The diagram to the right is 100 confidence intervals for p computed from 100 different random samples. Note that the ones with asterisks do not capture p. If we were to compute 100 more confidence intervals for p from 100 different random samples, would we get the same results? Recall the General Properties for Sampling Distributions of p 1. 2. These are the conditions that must be true in order to m pˆ p calculate a large-sample confidence interval for p p (1 p ) As long as the sample size is s pˆ less than 10% of the population n 3. As long as n is large (np > 10 and n (1-p) > 10) the sampling distribution of p is approximately normal. The Large-Sample Confidence Interval for p is an estimate of the The general formula for This a confidence interval standard deviation of p or the for a population proportion p . . . iserror standard statistic critical value (standard deviation of the statistic) pˆ(1 pˆ) pˆ (z critical value) n In real life, we often do not know The standard error of a statistic is the population proportion? What the estimated deviation point estimate value can we standard use to estimate it? of the statistic. The 95% confidence interval is based on the The Large-Sample Confidence fact that, for approximately 95% of all random Interval p the margin of error samples,for p is within estimation of p. The general formula for a confidence interval for a population proportion p . . . is pˆ(1 pˆ) pˆ (z critical value) n This is called the margin of error. Critical value (z*) • Found from the confidence level • The upper z-score with probability p lying to its right under the standard normal curve Confidence level 90% 95% 99% z*=1.645 z*=1.96 z*=2.576 tail area z* .05 1.645 .05 .025 .005 .025 1.96 .005 2.576 The Large-Sample Confidence Interval for p The general formula for a confidence interval for a population proportion p when (assumptions) (STEP 1) • p is the sample proportion from a random sample • the sample size n is large (np > 10 and n(1-p) > 10), and • if the sample is selected without replacement, the sample size is small relative to the population size (at most 10% of the population) What are the steps for performing a confidence interval? 1. Assumptions • • • Data from a random sample Sample size is large enough Sample size is small relative to population size 2. Calculations 3. Conclusion Conclusion: (memorize!!) We are ________% confident that the true proportion context is between ______ and ______. The article “How Well Are U.S. Colleges Run?” (USA Today, February 17, 2010) describes a survey of 1031 adult The point estimate is Americans. The survey was carried out by the 567 the Before computing National Center for Public Policy pˆ and the .55sample 1031 we confidence interval, was selected in a way that makes it reasonable to to verify the regard the sample asneed representative of adult conditions. Americans. Of those surveyed, 567 indicated that they believe a college education is essential for success. What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? College Education Continued . . . What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? Conditions: 1) np = 1031(.55) = 567 andconditions n(1-p) = 1031(.45) = 364, All our are verified since both of these so areitgreater 10, the sample is safethan to proceed with size is large enough to proceed. the calculation of the interval. 2) The sample size of n =confidence 1031 is much smaller than 10% of the population size (adult Americans). 3) The sample was selected in a way designed to produce a representative sample. So we can regard the sample as a random sample from the population. College Education Continued . . . What is a 95% confidence interval for the population proportion of adult Americans who believe that a college education is essential for success? Calculation: pˆ(1 pˆ) pˆ (z critical value) n .55(.45) .55 1.96 (.52,.58) 1031 What does this interval mean in the We are 95% confident that the population context proportion of this of adult Americans who believe that aproblem? college education Conclusion: is essential for success is between 52% and 58%. College Education Revisited . . . Recall the “Rate A 95% confidence interval for theConfidence” population your proportion of adult Americans who believe that a Activity college education is essential for success is: .55(.45) .55 1.96 (.52,.58) 1031 What do you notice about the Compute a 90% confidence interval for this proportion. relationship .55(.45) between the .55 1.645 (.524,.575) confidence level 1031 ofproportion. an interval Compute a 99% confidence interval for this and the width of the interval? .55(.45) .55 2.58 1031 (.510,.590) A May 2000 Gallup Poll found that 38% of a random sample of 1012 adults said that they believe in ghosts. Find a 95% confidence interval for the true proportion of adults who believe in ghost. Assumptions: Step 1: check assumptions! • Have an SRS of adults • np =1012(.38) = 384.56 & n(1-p) = 1012(.62) = 627.44 Since both are greater than 10, the distribution can be approximated by a normal curve 2: make • Population of adultsStep is at least 10,120.calculations .38(.62) p 1 p .38 1.96 Pˆ z * n 1012 .35,.41 Step 3: conclusion in context We are 95% confident that the true proportion of adults who believe in ghosts is between 35% and 41%. Choosing a Sample Size The margin of error estimation for a confidence interval is p (1 p ) m z * Before collecting any n data, an investigator may wish to determine a sample size needed to achieve a If there is no prior knowledge and a What In other value cases, should be used for the may certain margin of error estimation. Sometimes, it is feasible to perform a preliminary study is notestimate feasible,for then suggestunknown a reasonable p? p. the preliminary study value to estimate the value conservative estimate for p is 0.5. for p. Why is the conservative estimate for p = 0.5? .1(.9) = .09 .2(.8) = .16 .3(.7) = .21 .4(.6) = .24 .5(.5) = .25 By using .5 for p, we are using the largest value for p(1 – p) in our calculations. Recall the activity where we graphed the histograms for binomials with different probabilities of success – which had the largest standard deviation? In spite of the potential safety hazards, some people would like to have an internet connection in their car. Determine the sample size required to estimate the proportion of adult Americans who would like an internet connection in their car to within 0.03 with 95% confidence. What value should be p 1 p m z * n .5(.5) .03 1.96 n n 1067.111 n 1068 people used for p? This is the value for the margin of error estimate m. Always round the sample size up to the next whole number.