CHAPTER 5 SAMPLING AND SAMPLING DISTRIBUTIONS 5-1. Parameters are numerical measures of populations. Sample statistics are numerical measures of samples. An estimator is a sample statistic used for estimating a population parameter. 5-2. x = 97.9225 (estimate of ) s = 51.8303 s2 = 2,686.38 (estimate of ) (estimate of 2—the population variance) 5-3. p̂ = x/n = 5/12 = 0.41667 (5 out of 12 accounts are over $100.) 5-4. 5-5. x = 15.333 s = 2.5546 a) average price: 1.690385 Basic Statistics from Raw Data Gas Prices Measures of Central tendency Mean 1.6903846 Median 1.69 Mode 1.69 Range IQR 0.23 0.115 Measures of Dispersion If the data is of a Sample Population Variance 0.00545185 0.00524216 St. Dev. 0.07383662 0.07240276 b) Assuming a normal distribution N(1.64, 0.04), P(X> 1.69039) = 0.3373 Mean 1.64 x 1.69039 Stdev 0.12 P(X>x) 0.3373 104 5-6. p̂ = x/n = 11/18 = 0.6111, where x = the number of users of the product. 5-7. We need 25 elements from a population of 950 elements. Use the rows of Table 5-1, the rightmost 3 digits of each group starting in row 1 (left to right). So we skip any such 3-digit number that is either > 950 or that has been generated earlier in this list, giving us a list of 25 different numbers in the desired range. The chosen numbers are: 480, 11, 536, 647, 646, 179, 194, 368, 573, 595, 393, 198, 402, 130, 360, 527, 265, 809, 830, 167, 93, 243, 680, 856, 376. 5-8. We will use again Table 5-1, using columns this time. We will use right-hand columns, first 4 digits from the right (going down the column): 4,194 3,402 4,830 3,537 1,305. 5-9 We will use Table 5-1, sets of 2 columns using all 5 digits from column 1 and the first 3 digits from column 2, continuing by reading down in these columns. Then we will continue to the set: column 3 and first 3 digits column 4. We skip any numbers that are > 40,000,000. The resulting voter numbers are: 10,480,150 22,368,465 24,130,483 37,570,399 1,536,020. 5-10. There are 7 x 24 x 60 minutes in one week: (7)(24)(60) = 10,080 minutes. We will use Table 5-1 Start in the first row and go across the row, then to the next row (left to right using all 5 digits in each set), discarding any of the resulting 5-digit numbers that are > 10,080. The resulting minute numbers are: 1,536 2,011 6,243 7,856 6,121 6,907 5-11. A sampling distribution is the probability distribution of a sample statistic. The sampling distribution is useful in determining the accuracy of estimation results. 5-12. Only if the population is itself normal. 5-13. E X = 125 SE X / n 20/ 5 = 8.944 5-14. The fact that, in the limit, the population distribution does not matter. Thus the theorem is very general. 5-15. When the population distribution is unknown. 5-16. The Central Limit Theorem does not apply. 5-17. P̂ is binomial. Since np = 1.2, the Central Limit Theorem does not apply and we cannot use the normal distribution. 105 5-18. 2 = 10,000 = 1,247 P( X < 1,230) = P Z n = 100 1,230 1,247 100 / 10 = P(Z < –1.7) = .5 – .4554 = 0.0446 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 1247 100 Sample Size n 100 P(X<x) 0.0446 5-19. Sampling Distribution of X-bar Mean 1247 Stdev 10 x 1230 P X 8 = 1 – P X 8 = 1 – P(–8 < X < 8) 8 8 Z = 1 – P(–1.78 < Z < 1.78) 55 / 150 55 / 150 = 1 – P = 1 – 2(.4625) = 0.075 5-20. P(X > 3.6) = P Z 3.6 3.4 = P(Z > 1.333) = 0.0912 1.5 / 100 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 3.4 1.5 Sample Size n 100 x 3.6 Sampling Distribution of X-bar Mean 3.4 Stdev 0.15 P(X>x) 0.0912 106 5-21. 3.7 3.8 P(3.7 < X < 3.9) = P 1.2 / 36 Z 3.9 3.8 1.2 / 36 = P(–0.5 < Z < 0.5) = 2 (.1915) = .3830 (approximately) (Use template: Sampling Distribution.xls, sheet: x-bar) Sampling Distribution of Sample Mean Population Distribution Mean Stdev 3.8 1.2 Sample Size n 36 x1 3.7 5-22. s = 4,500 Sampling Distribution of X-bar Mean Stdev 3.8 0.2 P(x1<X<x2) 0.3829 x2 3.9 n = 225 P X 800 = P Z 800 800 = P Z 4,500 / 15 4,500 / 225 4,500 / 15 800 = P(–2.667 < Z < 2.667) = 2(.4961) = 0.9923 5-23. p = 0.18 n = 200 P( Pˆ .20 ) = P Z .02 = P Z = P(Z .736) .02717 (.18)(.82 ) / 200 .20 .18 = .5 – .2692 = 0.2308 5-24. The claim is that p = 0.58. We have n = 250 and x / n = 123/250 = 0.492. P( P̂ .492) = P Z = P(Z < -2.819) = 0.0024 (.58)(.42 ) / 250 .492 .58 107 5-25. a) P(X > 125000) = 0.0907 Mean 119600 Stdev 35000 Sampling Distribution of X-bar Mean Stdev 119600 4041.45 75 P(X<x) x x 125000 P(X>x) 0.0907 b) 5-26. stdev 30000 32000 P(X>125000) 0.0595 0.0720 34000 36000 38000 40000 0.0845 0.0970 0.1092 0.1212 n = 16 = 1.5 =2 0 1.5 = P(Z > -3) = .5 + .4987 = 0.9987 P( X > 0) = P Z 2 / 16 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 1.5 2 Sample Size n 16 x 0 Sampling Distribution of X-bar Mean 1.5 Stdev 0.5 P(X>x) 0.9987 108 5-27. p = 1/7 .10 .143 = P(Z < 1.648) = 0.5 0.4503 = P( P̂ < .10) = P Z ( 1 / 7 )( 6 / 7 ) / 180 0.0497, a low probability. The sample size, along with np and n(1 – p), are large enough here that the sample distribution (over all the different samples of 180 people in the population) of the proportion of people who get hospitalized during the year is going to be pretty close to normal. Therefore, any one such sample proportion will be close to the predicted mean 1/7 with reasonable probability, and 1/10 is far enough away from that mean given our estimated sample standard deviation that the probability of falling even farther away than that from the mean is small. 5-28. = 700 = 100 n = 60 680 700 720 700 P(680 X 720) = P Z 100 / 60 100 / 60 = 2TA(1.549) = 0.8786 5-29. p = = 0.35 = (0.35)(0.65) / 500 = 0.0213 P Pˆ p 0.05 = P( P̂ < 0.30) + P( P̂ > 0.40) 0.30 0.35 0.40 0.35 = PZ + PZ 0 . 0213 0.0213 = 1 – 2TA(2.344) = 0.0190 5-30. Estimator B is better. It has a small bias, but its variance is small. This estimator is more likely to produce an estimate that is close to the parameter of interest. 5-31. I would use this estimator because consistency means as n the probability of getting close to the parameter increases. With a generous budget I can get a large sample size, which will make this probability high. 5-32. ŝ 2 = 1,287 5-33. Advantage: uses all information in the data. Disadvantage: may be too sensitive to the influence of outliers. n 2 100 s2 = ŝ = 1,287 = 1,300 n 1 99 109 5-34. Depends also on efficiency and other factors. With respect to the bias: A has bias = 1/n B has bias = 0.01 A is better than B when 1/n < 0.01, that is, when n > 1/0.01 = 100 5-35. Consistency is important because it means that as you get more data, your probability of getting closer to your “target” increases. 5-36. n1 = 30, n 2 = 48, n3 = 32. The three sample means are known. The df for deviations from the three sample means are: df = n1 + n 2 + n3 – 3 = 30 + 48 + 32 – 3 = 107 5-37. a) the mean is the best number to use. mean = Sample 34 51 40 38 47 50 52 44 37 43.667 Deviation Deviation from mean squared -9.667 93.45089 7.333 53.77289 -3.667 13.44689 -5.667 32.11489 3.333 11.10889 6.333 40.10689 8.333 69.43889 0.333 0.110889 -6.667 44.44889 SSD = 358 degrees of freedom = 8 MSD = SSD / df = 358 / 8 = 44.75 110 b) choose the means of the respective block of numbers: 40.75, 49.667, 40.5 minimized SSD = 195.917, df = 6, MSD = 32.65283 mean = Sample 34 51 40 38 47 50 52 44 37 40.75 49.667 Deviation Deviation from mean squared -6.75 45.5625 10.25 105.0625 -0.75 0.5625 -2.75 7.5625 -2.667 7.112889 0.333 0.110889 2.333 5.442889 3.5 12.25 -3.5 12.25 SSD = 40.5 195.9167 c) Each of the numbers themselves. SSD = 0. MSD indicates that the variance is zero, which is true since we are using each of the individual numbers to reduce SSD to zero. d) SSD = 719, df = 9, MSD = 79.889 mean = Sample 34 51 40 38 47 50 52 44 37 50 Deviation Deviation from mean squared -16 256 1 1 -10 100 -12 144 -3 9 0 0 2 4 -6 36 -13 169 SSD = 5-38. 5-39. 719 No, because there are n – 1 = 19 – 1 = 18 degrees of freedom for these checks once you know their mean. Since 17 is on less, there is a remaining degree of freedom and you cannot solve for the missing checks. Yes. ( x1 + + x18 + x19 )/19 = x . Since 18 of the x i are known and so is x , we can solve the equation for the unknown x19 . 5-40. df = n-k as k increases, df decreases, SSD decreases, MSD decreases 5-41. E( X ) = = 1,065 V( X ) = 2 /n = 5002/100 = 2,500 111 5-42. 2 = 1,000,000 Want SD( X ) 25 SD( X ) = / n = 1,000 / n 1,000 / n 25 n 1,000/25 = 40 n 1,600. The sample size must be at least 1,600. 5-43. = 53 = 10 E( X ) = = 53 n = 400 SE( X ) = / n = 10 / 400 = 0.5 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 53 10 Sample Size n 5-44. 5-45. p = 0.5 n = 120 SE( P̂ ) = p(1 p) = n Stdev 0.5 (.5)(.5) = 0.0456 120 E( P̂ ) = p = 0.2 SE( P̂ ) = 5-46. Sampling Distribution of X-bar Mean 53 400 p(1 p) = n (.2)(.8) = 0.04216 90 P = 0.5 maximizes the variance of P̂ . V( P̂ ) = Proof: p(1 p) n dV ( Pˆ ) 1 d 1 = (pp 2) = (1 – 2p) dp n dp n Set the derivative to zero: 1 (1 – 2p) = 0 1 = 2p p = 1/2 n The assertion may also be demonstrated by trying different values of p. 112 5-47. 500 600 700 600 Z P(500 < X < 700) = P 600 / 30 600 / 30 = P(–.913 < Z < .913) = 2(.3194) = .6388 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 600 600 Sample Size n 30 x1 500 5-48. Sampling Distribution of X-bar Mean 600 P(x1<X<x2) 0.6387 Stdev 109.545 x2 700 1,000 1,065 650 P( X 1.000) = P Z = PZ 500 / 10 500 = P(Z 1.3) = .5 + .4032 = 0.9032 We need to use the Central Limit Theorem for a normal distribution. 5-49. = 53 = 10 n = 400 54 53 52 53 P(52 < X < 54) = P Z = P(2 < Z < 2) = 0.9544 10 / 20 10 / 20 5-50. p = 0.5 n = 120 .45 .5 = P(Z 1.095) = 0.8632 P( P̂ .45) = P Z (. 5 )(. 5 ) / 120 5-51. a. $8,128.08 found by $3.3M/406 = 8,128.08 7000 8128 .08 = P(Z < 2.256) b. P( X < 7000) = P Z 2000 / 16 = .5000 .4880 = 0.012 P(X<x) 0.0120 x 7000 113 5-52. 0.06 p 0.10 SE( P̂ ) = p(1 p) / n 0.03 Assume p = 0.06: SE( P̂ ) = (.06)(.94) / n .03 (.06)(.94)/n .032 62.66 n Now assume the other extreme, p = 0.10: SE( P̂ ) = (.1)(.9) / n .03 (.1)(.9)/n .032 100 n Now, we also know that the function SE( P̂ ) does not have a maximum point between p = 0.06 and p = 0.10 because the only maximum point of the function occurs at p = 0.5 (as we know from Problem 5-46). Hence SE( P̂ ) is monotonic between p = 0.06 and 0.10, and thus n = 100 is the minimum required sample size. 5-53. Random samples from the entire population of interest reduce the chance of a bias and increase chance of being representative of the entire population. Also, we have a known probability of being within certain distances of the parameter of interest. We use a frame and a random number generator or a table of random numbers. A simple random sample is such that every possible set of n elements has an equal chance of being selected. 5-54. A bias is a systematic deviation away from the target of estimation. A bias takes us away from the target parameter in repeated sampling. If the bias is small and variance of the estimator is also small, the bias may be tolerated, especially if the bias decreases as n increases. 5-55. The sample median is unbiased. The sample mean is more efficient; it is also sufficient. This is why we prefer the sample mean. We must assume normality for using the sample median to estimate . The median is more resistant to outliers. 5-56. S 2 has n – 1 in the denominator because there are n – 1 degrees of freedom for deviations from the sample mean. Using n – 1 instead of n makes S 2 an unbiased estimator of 2 . 114 5-57. = 19.5 = 5.3 n = 100 20 19.5 P( X > 20) = P Z = P(Z > .9434) = .5 .3273 = 0.1727 5.3 / 10 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 19.5 5.3 Sample Size n 100 x 20 5-58. Sampling Distribution of X-bar Mean 19.5 Stdev 0.53 P(X>x) 0.1727 95% bounds on X : 1.96 / n = 19.5 1.96(5.3/10) = [18.4612, 20.5388] 90% bounds on X : 19.5 1.645(5.3/10) = [18.62815, 20.37185] Symmetric Intervals P(x1<X<x2) x1 x2 18.46122 0.95 20.538779 18.62823 0.9 20.371772 5-59. = 2.9 = 0.5 P( X > 3.0) = P Z n = 25 3.0 2.9 = 0.5 – 0.3413 = 0.1587 0.5 / 25 (Use template: Sampling Distribution.xls, sheet: x-bar) Sampling Distribution of Sample Mean Population Distribution Mean Stdev 2.9 0.5 Sample Size n 25 x 3 Sampling Distribution of X-bar Mean Stdev 2.9 0.1 P(>x) 0.1587 115 5-60. df = (rows-1)(columns-1) = (5-1)(3-1) = 8 5-61. p = 0.38 n = 100 .30 .38 P( P̂ > 0.30) = P Z (.38)(.62) / 100 = P(Z > 1.648) = .5 + .4503 = 0.9503 where stdev = SQRT(.38*.62) Sampling Distribution of Sample Mean Population Distribution Mean Stdev 0.38 0.48539 Sample Size n 100 x 0.3 5-62. Sampling Distribution of X-bar Mean 0.38 Stdev 0.04854 P(X>x) 0.9503 X is normal. But since is unknown and we use S, the quantity ( X )/(S/ n ) has the t ( n 1) distribution rather than the standard normal distribution Z. 5-63. No minimum (n = 1 is enough for normality). 5-64. X , P̂ , S 2 are unbiased. S is the square root of an unbiased estimator of 2 , thus it is not unbiased. Proof: Assume E(S) = then: (E(S))2 = 2 and: E(S 2) – (E(S))2 = 2 2 = 0 (since E(S 2) = 2 ). But E(S 2) – (E(S))2 = V(S) V(S) = 0 means that S is not a statistical estimator. The contradiction establishes the proposition that S is biased. 5-65. This estimator is also consistent. It is more efficient than X , because 2 /n 2 < 2 /n. 5-66. df = 124 –3 = 121 116 5-67. a. Normal population requires the smallest minimum n. b. Mound-shaped population requires the next higher minimum n. c. Discrete population needs the highest minimum n. d. Slightly skewed population: n more than for (b), less than for (c). e. Highly skewed population: n less than for (c), but more than for (d). The relative minimum required sample sizes are as follows: n a < nb < n d < n e < n c 5-68. Yes. SE( X ) decreases as n increases: SE( X ) = / n , which goes to 0 as n goes to . Statistically, it is always good to have as large a sample as possible. 5-69. Draw repeated samples, preferably by simulation on a computer, and determine the empirical distribution of the statistic: the relative frequency distribution of its values. 5-70. .15 .20 P( P̂ < .15) = P Z = P(Z < 1.976) = .5 .4759 = 0.0241 (. 2 )(. 8 ) / 250 5-71. = 25 =2 n = 100 24 25 P( X < 24) = P Z = P(Z < 5) = 0.0000003 2 / 10 Not probable at all. 5-72. P Pˆ 0.80 0.07 = P(0.73 P̂ 0.87) .73 .80 .87 .80 = P(2.475 Z 2.475) = P Z (.80)(.20) / 200 (. 80 )(. 20 ) / 200 = 2TA(2.475) = 0.9866 where stdev = SQRT(.80*.20) Sampling Distribution of Sample Mean Population Distribution Mean Stdev 0.8 0.4 Sample Size n 200 x1 0.73 P(x1<X<x2) 0.9867 Sampling Distribution of X-bar Mean 0.8 Stdev 0.02828 x2 0.87 117 5-73. 1.52 1.57 1.62 1.57 = 2TA(1.768) = 0.923 P(1.52 < X < 1.62) = P Z 0.4 / 200 0.4 / 200 5-74. a) point estimate for the sample mean is 52 Population Distribution Mean Stdev 52 2.4 Sample Size n 40 P(X<x) Is the population normal? Sampling Distribution of X-bar Mean Stdev 52 0.37947 x P(X>x) x x1 52 b) P( 52 < X < 53) = 0.4958 5-75 (Use template: Sampling Distribution.xls, sheet: p-hat) n = 400 p = 0.06 Sampling Distribution of Sample Proportion Population Proportion p 0.06 Sample Size n 400 Sampling Distribution of P-hat Mean Stdev 0.06 0.01187 P(P-hat < 0.05) = 0.1999 P(<x) 0.1999 x 0.05 118 P(x1<X<x2) 0.4958 x2 53 5-76 (Use template: Sampling Distribution.xls, sheet: x-bar) μ = 15830 σ = 458 n = 10 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 15830 458 Sample Size n 10 Sampling Distribution of X-bar Mean Stdev 15830 144.832 P( X 16000) 0.1202 x 16000 5-77 P(>x) 0.1202 (Use template: Sampling Distribution.xls, sheet: x-bar) μ = 750.4 σ = 1.2 n = 50 Sampling Distribution of Sample Mean Population Distribution Mean Stdev 750.4 1.2 Sample Size n 50 Sampling Distribution of X-bar Mean Stdev 750.4 0.16971 P(749.5 X 750.5) 0.7222 x1 749.5 P(x1<X<x2) 0.7222 x2 750.5 119