Chapter 4 Simple Random Sampling Definition of Simple Random Sample (SRS) and how to select a SRS Estimation of population mean and total; sample size for estimating population mean and total Estimation of population proportion; sample size for estimating population proportion Comparing estimates Simple Random Samples Desire the sample to be representative of the population from which the sample is selected Each individual in the population should have an equal chance to be selected Is this good enough? Example 1. 2. 3. Select a sample of high school students as follows: Flip a fair coin If heads, select all female students in the school as the sample If tails, select all male students in the school as the sample Each student has an equal chance to be in the sample Every sample a single gender, not representative Each individual in the population has an equal chance to be selected. Is this good enough? NO!! Simple Random Sample A simple random sample (SRS) of size n consists of n units from the population chosen in such a way that every set of n units has an equal chance to be the sample actually selected. Simple Random Samples (cont.) • • • • • Suppose a large History class of 500 students has 250 male and 250 female students. To select a random sample of 250 students from the class, I flip a fair coin one time. If the coin shows heads, I select the 250 males as my sample; if the coin shows tails I select the 250 females as my sample. What is the chance any individual student from the class is included in the sample? 1/2 This is a random sample. Is it a simple random sample? NO! Not every possible group of 250 students has an equal chance to be selected. Every sample consists of only 1 gender – hardly representative. Simple Random Samples (cont.) The easiest way to choose an SRS is with random numbers. Statistical software can generate random digits (e.g., Excel “=random()”, ran# button on calculator). Example: simple random sample Academic dept wishes to randomly choose a 3-member committee from the 28 members of the dept 00 Abbott 01 Cicirelli 02 Crane 03 Dunsmore 04 Engle 05 Fitzpat’k 06 Garcia 07 Goodwin 14 Pillotte 08 Haglund 15 Raman 09 Johnson 16 Reimann 10 Keegan 17 Rodriguez 11 Lechtenb’g 18 Rowe 12 Martinez 19 Sommers 13 Nguyen 20 Stone 21 Theobald 22 Vader 23 Wang 24 Wieczoreck 25 Williams 26 Wilson 27 Zink Solution • Use a random number table; read 2-digit pairs until you have chosen 3 committee members For example, start in row 121: • 71487 09984 29077 14863 61683 47052 62224 51025 • Garcia (07) Theobald (22) Johnson (10) Your calculator generates random numbers; you can also generate random numbers using Excel Sampling Variability • Suppose we had started in line 145? • 19687 12633 57857 95806 09931 02150 43163 58636 Our sample would have been 19 Rowe, 26 Williams, 06 Fitzpatrick • Sampling Variability • • • • • Samples drawn at random generally differ from one another. Each draw of random numbers selects different people for our sample. These differences lead to different values for the variables we measure. We call these sample-to-sample differences sampling variability. Variability is OK; bias is bad!! Example: simple random sample Using Excel tools Using statcrunch (NFL) 4.3 Estimation of population mean Usual estimator n 1 y yi n i 1 Recall that E ( y ) What about the variability of y from sample to sample? 4.3 Estimation of population mean For a simple random sample of size n chosen without replacement from a population of size N V ( y) N n 2 n N 1 The correction factor takes into account that an estimate based on a sample of n=10 from a population of N=20 items contains more information than a sample of n=10 from a population of N=20,000 4.3 Estimating the variance of the sample mean Recall the sample variance n 1 2 2 s ( yi y ) n 1 i 1 It can be shown (Appendix A) that N 2 2 E (s ) N 1 4.3 Estimating the variance of the sample mean So V ( y ) can be unbiasedly estimated by N n s n s Vˆ ( y ) 1 N n N n n 1 is called the N finite population correction (fpc) 2 2 4.3 Estimating the variance of the sample mean n 1 If , N 20 the fpc is usually ignored, so 2 s ˆ V ( y) n 4.3 Example Population {1, 2, 3, 4}; n = 2, equal weights Sample y Pr. of sample s2 Vˆ ( y ) {1, 2} 1/6 1.5 0.5 0.125 {1, 3} 1/6 2.0 2.0 0.500 {1, 4} 1/6 2.5 4.5 1.125 {2, 3} 1/6 2.5 0.5 0.125 {2, 4} 1/6 3.0 2.0 0.500 {3, 4} 1/6 3.5 0.5 0.125 For example, {1, 2}: 2 0.5 0.5 ˆ V ( y ) 1 0.125 4 4 2 4.3 Example Population {1, 2, 3, 4}; Sample =2.5, 2 = 5/4; n = 2, equal weights y Pr. of sample Vˆ ( y ) s2 {1, 2} 1/6 1.5 0.5 0.125 {1, 3} 1/6 2.0 2.0 0.500 {1, 4} 1/6 2.5 4.5 1.125 {2, 3} 1/6 2.5 0.5 0.125 {2, 4} 1/6 3.0 2.0 0.500 {3, 4} 1/6 3.5 0.5 0.125 E ( y ) 1.5 16 2.0 16 2.5 16 2.5 16 3.0 16 3.5 16 2.5 V ( y ) (1.5 2.5) 2 1 6 (3.5 2.5) 2 N n 12 N 1 n 1 6 5 2 4.3 Example Population {1, 2, 3, 4}; Sample =2.5, 2 = 5/4; n = 2, equal weights y Pr. of sample s2 Vˆ ( y ) {1, 2} 1/6 1.5 0.5 0.125 {1, 3} 1/6 2.0 2.0 0.500 {1, 4} 1/6 2.5 4.5 1.125 {2, 3} 1/6 2.5 0.5 0.125 {2, 4} 1/6 3.0 2.0 0.500 {3, 4} 1/6 3.5 0.5 0.125 0.5 2.0 4.5 0.5 2.0 0.5 5 N 2 E (s ) 6 3 N 1 0.125 0.5 1.125 0.125 0.5 0.125 5 ˆ E (V ( y )) V ( y) 6 12 2 4.3 Example Summary Population {1, 2, 3, 4}; Sample =2.5, 2 = 5/4; n = 2, equal weights y Pr. of sample s2 Vˆ ( y ) {1, 2} 1/6 1.5 0.5 0.125 {1, 3} 1/6 2.0 2.0 0.500 {1, 4} 1/6 2.5 4.5 1.125 {2, 3} 1/6 2.5 0.5 0.125 {2, 4} 1/6 3.0 2.0 0.500 {3, 4} 1/6 3.5 0.5 0.125 2 N n 5 E ( y ) 2.5 , V ( y ) n N 1 12 2 n s 5 ˆ E (V ( y )) E 1 V ( y) N n 12 4.3 Margin of error when estimating the population mean Margin of error (MOE), also called "bound on the error of estimation" n s ˆ t.025, n 1 V ( y ) t.025, n 1 1 ; N n often the value of z from N(0,1) is used: 2 n 1.96 Vˆ ( y ) 1.96 1 N 2 s n t distributions Very similar to z~N(0, 1) Sometimes called Student’s t distribution; Gossett, brewery employee Properties: i) symmetric around 0 (like z) ii)degrees of freedom if > 1, E(t ) = 0 if > 2, = - 2, which is always bigger than 1. Student’s t Distribution P(t > 2.2281) = .025 P(t < -2.2281) = .025 .95 .025 -2.2281 0 .025 2.2281 t10 Standard normal P(z > 1.96) = .025 P(z < -1.96) = .025 .95 .025 -1.96 .025 0 1.96 z Student’s t Distribution z y t s n y n Z t -3 -3 -2 -2 -1 -1 00 11 22 33 Figure 11.3, Page 372 Student’s t Distribution y t s n Degrees of Freedom s = s2 n s2 = 2 (X X) i i=1 Z n -1 t1 -3 -3 -2 -2 -1 -1 00 11 22 33 Figure 11.3, Page 372 Student’s t Distribution y t s n Degrees of Freedom s = s2 n s2 = 2 (X X) i i=1 Z n -1 t1 t7 -3 -3 -2 -2 -1 -1 00 11 22 33 Figure 11.3, Page 372 4.3 Margin of error when estimating the population mean Adding MOE to y and subtracting MOE from y gives 95% confidence interval: y t.025, n 1 n s 1 N n 2 or n y 1.96 1 N 2 s n 4.3 Margin of error when estimating the population mean Understanding confidence intervals; behavior of confidence intervals. 4.3 Margin of error when estimating the population mean More generally, (1 )% confidence interval: y t n s 1 N n or 2 2 , n 1 y z n s 1 N n 2 2 Comparing t and z Critical Values z = 1.645 z = 1.96 z = 2.33 z = 2.58 Conf. level 90% 95% 98% 99% n = 30 t = 1.6991 t = 2.0452 t = 2.4620 t = 2.7564 4.4 Determining Sample Size to Estimate Required Sample Size To Estimate a Population Mean If you desire a C% confidence interval for a population mean with an accuracy specified by you, how large does the sample size need to be? We will denote the accuracy by MOE, which stands for Margin of Error. Example: Sample Size to Estimate a Population Mean Suppose we want to estimate the unknown mean height of male students at NC State with a confidence interval. We want to be 95% confident that our estimate is within .5 inch of How large does our sample size need to be? Confidence Interval for In terms of the margin of error MOE, the CI for can be expressed as y MOE The confidence interval for is s y t n * s so MOE tn 1 n * n 1 So we can find the sample size by solving this equation for n: MOE t * n 1 s n t s which gives n MOE * n 1 2 Good news: we have an equation Bad news: 1. Need to know s 2. We don’t know n so we don’t know the degrees of freedom to find t*n-1 A Way Around this Problem: Use the Standard Normal Use the corresponding z* from the standard normal to form the equation s MOE z n Solve for n: * zs n MOE * 2 Sampling distribution of y Confidence level .95 1.96 n 1.96 n MOE MOE set MOE 1.96 1.96 n MOE 2 n and solve for n Estimating s Previously collected data or prior knowledge of the population If the population is normal or nearnormal, then s can be conservatively estimated by s range 6 99.7% of obs. Within 3 of the mean Example: sample size to 2 * z s estimate mean height µ of n MOE NCSU undergrad. male students We want to be 95% confident that we are within .5 inch of , so MOE = .5; z*=1.96 Suppose previous data indicates that s is about 2 inches. n= [(1.96)(2)/(.5)]2 = 61.47 We should sample 62 male students Example: Sample Size to Estimate a Population Mean Textbooks Suppose the financial aid office wants to estimate the mean NCSU semester textbook cost within MOE=$25 with 98% confidence. How many students should be sampled? Previous data shows is about $85. 2 z *σ (2.33)(85) n 62.76 25 MOE round up to n = 63 2 Example: Sample Size to Estimate a Population Mean -NFL footballs • The manufacturer of NFL footballs uses a machine to inflate new footballs • The mean inflation pressure is 13.0 psi, but random factors cause the final inflation pressure of individual footballs to vary from 12.8 psi to 13.2 psi • After throwing several interceptions in a game, Tom Brady complains that the balls are not properly inflated. The manufacturer wishes to estimate the mean inflation pressure to within .025 psi with a 99% confidence interval. How many footballs should be sampled? Example: Sample Size to Estimate a n z * Population Mean ME • The manufacturer wishes to estimate the mean inflation pressure to within .025 pound with a 99% confidence interval. How may footballs should be sampled? • 99% confidence z* = 2.58; ME = .025 • = ? Inflation pressures range from 12.8 to 13.2 psi • So range =13.2 – 12.8 = .4; range/6 = .4/6 = .067 2.58 .067 n 47.8 48 .025 2 . . . 1 2 3 48 2 Required Sample Size To Estimate a Population Mean The formula 2 zs n MOE assumes an infinite population or sampling with replacement (so no fpc). * It is frequently the case that we are sampling without replacement. Required Sample Size To Estimate a Population Mean When Sampling Without Replacement. (1 )% confidence interval: y t , n 1 2 n s2 1 N n Can't use since don't know n; use y z 2 y z 2 n s2 1 N n n s2 1 N n MOE Required Sample Size To Estimate a Population Mean When Sampling Without Replacement. y z 2 MOE z n s2 N n MOE 1 2 n s2 1 N n n z 1 s N n MOE z2 s2 n n 1 2 N ( MOE ) Required Sample Size To Estimate a Population Mean When Sampling Without Replacement. z2 s2 n n 1 2 N ( MOE ) z2 s2 n MOE 1 1 N n0 n n0 1 N 2 z2 s2 MOE 2 z2 s2 where n0 ( MOE ) 2 4.3 Estimation of population total Since N we know that the estimator of is N times the estimator y of and MOE for estimating the total is N times the MOE for estimating the mean 4.3 Estimation of population total n ˆ Ny N yi i 1 n N n V (ˆ) V ( Ny ) N V ( y ) N N 1 n 2 n s 2 ˆ V (ˆ) N 1 N n Margin of error (MOE) 2 2 t.025, n 1 n 2 N 1 N 2 2 2 s n s 2 or 1.96 N 1 N n n Required Sample Size To Estimate a Population Total 2 n s MOE z N 2 1 N n z s n 2 n N 1 2 N ( MOE ) 2 2 2 N2 n 2 2 z2 s2 MOE 2 z2 s2 1 N ( MOE ) 2 N n0 z s so n where n0 1 Nn0 ( MOE ) 2 4.3 Estimation of population total Estimate number of lakes in Minnesota, the “Land of 10,000 Lakes”. 4.5 Estimation of population proportion p Interested in the proportion p of a population that has a characteristic of interest. Estimate p with a sample proportion. http://packpoll.com/ 4.5 Estimation of population proportion p The data: y i pˆ 1 n 1 if item i has the characteristic of interest 0 if item i does not have the characteristic n y i y i 1 n Underlying model: each y n E y i 1 i B (1, p ); i y i 1 n np ; V y i 1 i np (1 p ) i B ( n, p ) 4.5 Estimation of population proportion p pˆ 1 n n n y; E y i i 1 np ; V i y i 1 1 So: E ( pˆ ) E 1 i 1 i 1 n E y n i 1 y n i n i np (1 p ) i 1 n y n V ( pˆ ) V n 1 n 2 i i 1 n V y i 1 i np p n np (1 p ) n 2 p (1 p ) n 4.5 Estimation of population proportion p ˆ ˆ n p (1 p ) Vˆ ( pˆ ) 1 N n 1 n pˆ (1 pˆ ) MOE z 1 N n 1 Required Sample Size To Estimate a Population Proportion p When Sampling Without Replacement. 1 n ˆ yi y , we can use sample size formula for : Since p n i 1 n0 n n 1 0 N z2 s2 where n0 . 2 ( MOE ) Since V ( yi ) 2 p (1 p ), for s 2 we can use p (1 p ) (use prior information about p or p 1 ) 2 4.6 Comparing Estimates We often like to compare the means 1 and 2 in different populations or the proportions p1 and p2 . To compare 1 and 2 we estimate the difference 1 2 . To compare p1 and p2 we estimate the difference p1 p2 . 4.6 Comparing Estimates: Comparing Means Background: for random variables X and Y : E ( X Y ) E ( X ) E (Y ) X Y V ( X Y ) V ( X ) V (Y ) 2C ov( X , Y ) If X and Y are independent, then Cov( X , Y ) 0; (we will focus on the independent case) 4.6 Comparing Estimates: Comparing Means x1 , x2 , , xn1 random sample from pop. 1 (1 , 1 unknown) y1 , y2 , , yn2 random sample from pop. 2 ( 2 , 2 unknown) independent random samples from the 2 populations E ( x y ) E ( x ) E ( y ) 1 2 V ( x y ) V ( x ) V ( y ) 2 2 n s n s Vˆ ( x ) 1 1 1 ;Vˆ ( y ) 1 2 2 N1 n1 N 2 n2 Population 1 Parameters: µ1 and 12 (values are unknown) Sample size: n1 Statistics: x1 and s12 Population 2 Parameters: µ2 and 22 (values are unknown) Sample size: n2 Statistics: x2 and s22 Estimate µ1 µ2 with x1 x2 60 Sampling distribution model for x1 x2 ? E ( x1 x2 ) 1 2 ; SD( x1 x2 ) ( x1 x2 ) ( 1 2 ) SE ( x1 x2 ) 2 1 n1 2 2 2 s1 SE ( x1 x2 ) n2 n1 n s 1 N n 1 1 n2 n s 1 n N 2 fpc: SE ( x1 x2 ) 2 s2 1 2 1 2 2 2 2 Shape? 2 s s n1 n2 df 2 2 2 2 1 s1 1 s2 n1 1 n1 n2 1 n2 2 1 2 2 An estimate of the degrees of freedom is min(n1 − 1, n2 − 1). df s12 s22 n1 n 2 0 ( x1 x2 ) ( 1 2 ) SE ( x1 x2 ) 4.6 Comparing Estimates: Comparing Means Bound on the error of estimation: t 2 , df n1 s12 n2 s22 1 1 N n N 1 1 2 n2 or n1 s12 n2 s22 z 1 1 N1 n1 N 2 n2 4.6 Comparing Estimates: Comparing Means (Special Case, Seldom Used) Assume 12 22 2 . Pooled estimate of common variance: 2 2 ( n 1) s ( n 1) s 1 2 2 s 2p 1 ; n1 n2 2 V ( x y ) V ( x ) V ( y ) t ( x1 x2 ) ( 1 2 ) s 2 p n1 s 2 p 2 n1 2 n2 BOE: , df n1 n2 2 t 2 , n1 n2 2 2 n1 s p n2 1 1 N n 1 1 N2 or n2 n1 s ˆ n2 s ˆ V ( x ) 1 ;V ( y ) 1 N n N 1 1 2 n2 2 p 2 p 2 n1 s p n2 z 1 1 N n 1 1 N2 sp n2 2 sp n2 2 4.6 Comparing Estimates: Comparing Proportions, Two Cases Difference between two polls Difference of proportions between 2 independent polls Differences within a single poll question Comparing proportions for a single poll question, horse-race polls (dependent proportions) 4.6 Comparing Estimates: Comparing Proportions in Two Independent Polls pˆ1 estimates pop. proportion p1 in poll #1 pˆ 2 estimates pop. proportion p2 in poll #2 Polls are independent V ( pˆ1 pˆ 2 ) V ( pˆ1 ) V ( pˆ 2 ) p1 (1 p1 ) p2 (1 p2 ) n1 n2 4.6 Comparing Estimates: Comparing Proportions in Two Independent Polls Vˆ ( pˆ1 pˆ 2 ) Vˆ ( pˆ1 ) Vˆ ( pˆ 2 ) n1 pˆ1 (1 pˆ1 ) n2 pˆ 2 (1 pˆ 2 ) 1 1 N1 n1 1 N 2 n2 1 n1 pˆ1 (1 pˆ1 ) n2 pˆ 2 (1 pˆ 2 ) BOE : z 1 1 N1 n1 1 N 2 n2 1 4.6 Comparing Estimates: Comparing Dependent Proportions in a Single Poll Multinomial Sampling Situation – Typically 3 or more choices in a poll ( p1 p2 ) ( p1 p2 ) 2 V ( pˆ1 pˆ 2 ) n p1 p2 V ( pˆ1 ) V ( pˆ 2 ) 2 n pˆ1 pˆ 2 ˆ ˆ ˆ V ( pˆ1 pˆ 2 ) V ( pˆ1 ) V ( pˆ 2 ) 2 n 1 n pˆ1 (1 pˆ1 ) n pˆ 2 (1 pˆ 2 ) pˆ1 pˆ 2 1 1 2 n 1 N n 1 N n 1 Worksheet http://packpoll.com/ End of Chapter 4