The Sample Mean is a Random Variable The sample mean ðĨĖ is a random variable. Its value is determined through a random sampling process. The probability distribution of xĖ is called the sampling distribution. How is the Sampling Distribution of ð of Generated? Consider the population of five elements, A, B, C, D, and E. The numeric value of each element is listed below: Elements A B C D E All possible samples A B C A B D A B E A C D A C E A D E B C D B C E B D E C D E ðĨĖ 6 9 12 15 21 The population mean and variance are computed as follows: ðĨĖ ðĨĖ − µ -6.6 -3.6 -0.6 2.4 8.4 0.0 6 9 12 15 21 63 µ= ïĨðĨĖ σ2 = ð = (ðĨĖ − µ)2 43.56 12.96 0.36 5.76 70.56 133.20 63 = 12.6 5 ïĨ(ðĨĖ − µ)2 ð = 133.2 = 26.64 5 We select a random sample of size ð = 3 from this population, without replacement. The following is the list all the possible samples. Each possible sample has its own mean ðĨĖ . Since there are 10 possible samples, then there are 10 possible sample means. Thus, the values of ðĨĖ are assigned through a random sampling process. This makes ðĨĖ a random variable Sample values 6 9 12 6 9 15 6 9 21 6 12 15 6 12 21 6 15 21 9 12 15 9 12 21 9 15 21 12 15 21 ðĨĖ 9 10 12 11 13 14 12 14 15 16 The table below lists all the possible values of ðĨĖ in ascending order along with the probability (relative frequency) of each value. This table (and the chart) represent the probability distribution of ðĨĖ . The probability distribution of xĖ is called the sampling distribution of ðĨĖ . Sampling Distribution of ð ðĨĖ 9 10 11 12 13 14 15 16 ð(ðĨĖ ) 0.1 0.1 0.1 0.2 0.1 0.2 0.1 0.1 1.0 Page 1 of 12 IMPORTANT PROPERTIES OF THE SAMPLING DISTRIBUTION OF ð The Mean of the Means Equals the Mean! In the example above there are ten ðĨĖ values. Find the mean of these ten sample means, µðĨ . This calculation is shown in two ways. Find µðĨ directly from the ten ðĨĖ values: ðĨĖ 9 10 12 11 13 14 12 14 15 16 126 ïĨðĨĖ = 126 µðĨ = 126 = 12.6 10 Find the expected value of ðĨĖ , E(ðĨĖ ), from the sampling distribution: ðĨĖ 9 10 11 12 13 14 15 16 ð(ðĨĖ ) 0.1 0.1 0.1 0.2 0.1 0.2 0.1 0.1 ðĨĖ ð(ðĨĖ ) 0.9 1.0 1.1 2.4 1.3 2.8 1.5 1.6 12.6 µðĨ ≡ E(ðĨĖ ) = ïĨðĨĖ ð(ðĨĖ ) = 126 Note the important conclusion from these calculations: Ė ) = µ µð ≡ ð(ð The mean of the (sample) means equals the (population) mean. Page 2 of 12 The Variance and the Standard Error of ð The variance of ðĨĖ , denoted by ðŊððŦ(ð), is the measure of dispersion of the xĖ values around their center of gravity ðĨĖ , that is, around µðĨ . To find var(ðĨĖ ) use the sampling distribution of ðĨĖ . The var(ðĨĖ ) is calculated as the weighted mean of the squared deviations of ðĨĖ . ðĨĖ 9 10 11 12 13 14 15 16 ð(ðĨĖ ) 0.1 0.1 0.1 0.2 0.1 0.2 0.1 0.1 2 (ðĨĖ − µðĨ ) 12.96 6.76 2.56 0.36 0.16 1.96 5.76 11.56 var(ðĨĖ ) = 26.64 5 − 3 ( ) = 8.88(0.5) = 4.440 3 5−1 The square root of var(ðĨĖ ) is called the standard error of ð: )2 (ðĨĖ − µðĨ ð(ðĨĖ ) 1.296 0.676 0.256 0.072 0.016 0.392 0.576 1.156 4.440 se(ðĨĖ ) = √var(ðĨĖ ) se(ðĨĖ ) = √4.440 = 2.107 ðŊððŦ(ð) for Non-Finite Populations For large populations the FPCF approaches 1. Therefore, var(ðĨĖ ) becomes: var(ðĨĖ ) = var(ðĨĖ ) = ïĨ(ðĨĖ − µðĨ )2 ð(ðĨĖ ) = 4.440 The Relationship Between ðŊððŦ(ð) and the Parent Population Variance, σ2 In this example, the parent population, the population from which the samples are obtained, is a finite population. The criterion for a finite population is when ð⁄ð ≥ 0.05. Here, ð = 3 and ð = 5, which gives us ð⁄ð = 0.60. se(ðĨĖ ) = σ2 ð σ √ð These are formulas that we will be using from this point forward. When the parent population is finite, var(ðĨĖ ) is related to σ2 according to the following formula. var(ðĨĖ ) = σ2 ð − ð ( ) ð ð−1 The term ( ð−ð ) is ð−1 called the finite population correction factor (FPCF). Page 3 of 12 The Number of Possible Samples and Sample Means The number of samples of size ð quickly jumps into the astronomical figures. For example, the number of possible samples of size ð = 50 selected from a parent population with ð = 1,000 elements is: C(1000, 50) = 9.5E+84. This is 9.5 with 84 zeros after it. Thus, since there are astronomical (infinite) number of samples, and each sample provides its own sample mean, then the number of ðĨĖ values is also infinite. From here we can conclude that for all practical purposes ðĨĖ is a continuous random variable. The Sampling Distribution of ð Must be Normal The sole application of the sampling distribution of ðĨĖ is in statistical inference (in inferential statistics). For this purpose the sampling distribution must be normally distributed. Examples Using the Normal Sampling Distribution of ð Suppose the average per semester textbook expense by all IUPUI students is $550 (the population mean µ = ððð) with a standard deviation of $260 (population standard deviation ð = ððð). Example 1 The textbook expenses of a random sample of size ð = 100 students is obtained. The probability that the sample mean is less than $500 is _____. Or, The proportion or percentage of xĖ values obtained from samples of size n = 100 that are less than 500 is _____. P(ðĨĖ < 500) = _____. In the following normal sampling distribution, the mean or expected value of distribution of ðĨĖ is the mean of the parent population. The standard deviation (standard error) is se(ðĨĖ ) = σ⁄√ð. 500 550 xĖ You must first transform ðĨĖ into the standard normal variable. ð§= E(xĖ ) = µ xĖ The sampling distribution of xĖ will be normal if the samples are selected from a normal parent population, or, if the parent population is not normal, the sample size is at least 30 (ð ≥ 30). If the parent population is not normal, samples smaller than 30 will not generate a normal sampling distribution of xĖ . ðĨĖ − µ se(ðĨĖ ) se(ðĨĖ ) = ð§= σ √ð = 260 √100 = 26 500 − 550 = −1.92 26 P(ð§ < −1.92) = 0.0274 Page 4 of 12 The parent population parameters: μ = $550 σ = $260 Example 2 A sample of size ð = ðð is selected. The probability that the sample mean is greater than $600 is _____. Or, the proportion or percentage of ðĨĖ values obtained from samples of size ð = 64 that are greater than $600 is _____. Example 3 A sample of size ð = ðð is selected. The probability that the sample mean is between $506 and $594 is _____. Or, the proportion or percentage of ðĨĖ values obtained from samples of size n = 80 that are between $506 and $594 is _____. ð(ð > ððð) = _____ ð(ððð < ð < ððð) = _____ 550 ð§= ðĨĖ − µ se(ðĨĖ ) se(ðĨĖ ) = ð§= σ √ð 600 xĖ 506 se(ðĨĖ ) = = 260 √64 = 32.5 600 − 550 = 1.54 32.5 P(ð§ > 1.54) = 0.0618 σ √ð = 260 √80 550 = 29.069 594 Example 4 A sample of size ð = ðð is selected. The probability that the sample mean is within ±$40 from the population mean is _____. Or, The proportion or percentage of ðĨĖ values obtained from samples of size n=80 that are within ±$40 from the population mean is _____. ð(ð − ðð < Ė ð < ð + ðð) = _____ xĖ 510 se(ðĨĖ ) = σ √ð = 260 √80 550 xĖ = 29.069 ð§= 506 − 550 = −1.51 29.069 ð§= 510 − 550 = −1.38 29.069 ð§= 594 − 550 = 1.51 29.069 ð§= 590 − 550 = 1.38 29.069 P(−1.51 < ð§ < 1.51) = 0.8690 590 P(−1.38 < ð§ < 1.38) = 0.8324 Page 5 of 12 The parent population parameters: μ = $550 σ = $260 Example 5 When ð = ðð, the proportion of ðĨĖ values that fall within ±1.25 standard errors from the population mean is _____. Example 6 For samples of size ð = ððð, the interval which contains the middle 90% (0.90) of sample means is: Ė ) < ð Ė < ð + ð. ðððŽð(ð Ė )) = _____ ð(ð − ð. ðððŽð(ð Ė ðģ < ð Ė <ð Ė ðž ) = ð. ðððð ð(ð Ė ðģ = ______ Ė ðž = ______ ð ð Example 7 For samples of size ð = ððð, The interval symmetric about the mean which contains 95% (0.95) of sample means is: Ė ðģ < ð Ė <ð Ė ðž ) = ð. ðððð ð(ð Ė ðģ = ______ Ė ðž = ______ ð ð se(ðĨĖ ) = 29.069 Ė ) = ð. ðð(ðð. ðð) = ðð. ð ð. ðððŽð(ð Ė < ð + ðð. ð) ð(ð − ðð. ð < ð Ė < ððð. ð) ð(ððð. ð < ð 550 513.7 550 586.3 xĖ ð§ = (513.7 − 550)⁄29.07 = −1.25 xĖ Find the quantity to be subtracted from and added to µ to get the lower and upper end of the desired interval. ðĨĖ − µ ð§= se(ðĨĖ ) ðĨĖ = µ + ð§ â se(ðĨĖ ) Since each tail area under the normal curve is 0.05, then ð§ = (586.3 − 550)⁄29.07 = 1.25 Note: the z-calculation here are redundant. The statement “within ±1.25 standard errors from the population mean” implies ð§ = ±1.25. P(−1.25 < ð§ < 1.25) = 0.7888 ð§0.05 = 1.64 se(ðĨĖ ) = 260⁄√100 = 26 ðĨĖ ðŋ = 550 + (−1.64)(26) = 550 − 42.64 = 507.36 ðĨĖ ð = 550 + (1.64)(26) = 550 + 42.64 = 592.64 550 xĖ α = 1 − 0.95 = 0.05 α⁄2 = 0.025 ð§α⁄2 = ð§0.025 = 1.96 ðððļ = ð§α⁄2 se(ðĨĖ ) ðððļ = (1.96)(26) = 50.96 ≈ $51 ðĨĖ ðŋ = 550 − 51 = $499 ðĨĖ ð = 550 + 51 = $601 The term ð§ â se(ðĨĖ ) is called the “margin of sampling error”, or simply, MARGIN OF ERROR (ðððļ). The combined two tail areas is called the error probability and is denoted by α. Here, α = 0.10. Page 6 of 12 Determining the Sample Size for a Given Margin of Error in Example 7, where ð = 100, we determined the interval that contains 95% of ðĨĖ values is within ±$51 from µ. That is, the 95% margin of error (ðððļ) turned out to be $51, when ð = 100. Generally, P(µ − 50.96 ≤ ðĨĖ ≤ µ + 50.96) = 0.95 P[µ − ð§α⁄2 se(ðĨĖ ) ≤ ðĨĖ ≤ µ + ð§α⁄2 se(ðĨĖ )] = 1 − α In this interval, 95% of ðĨĖ values deviate from µ by no more than $50.96 in either direction. Now suppose we are interested in an interval in which the middle 95% of ðĨĖ values deviate from µ by no more than $20. That is, we want the 95% margin of error to be $20. This expression means that (1 − α)% of sample means fall within ðððļ = ±ð§α⁄2 se(ðĨĖ ) from the population mean. Since se(ðĨĖ ) = σ⁄√ð, then P (µ − ð§α⁄2 σ √ð ≤ ðĨĖ ≤ µ + ð§α⁄2 σ √ð This smaller ðððļ, a narrower interval, requires a different sample size. How do we determine the proper n to obtain this ðððļ? Consider the ðððļ formula: )=1−α This shows that the margin of error is inversely related to the sample size n. The larger the sample size, the smaller the MOE. In example 7, since we have chosen the error probability be α = 0.05, then, P (µ − ð§0.025 σ √ð ≤ ðĨĖ ≤ µ + ð§0.025 P(µ − 20 ≤ ðĨĖ ≤ µ + 20) = 0.95 σ √ð ) = 0.95 ðððļ = ð§α⁄2 σ √ð Solving for n, we have, ð§α⁄2 σ 2 ) ðððļ ð=( Thus, to obtain a 95% margin of error of $20, the minimum sample size should be: 1.96 × 260 2 ð=( ) = 649.2 20 xĖ P (µ − 1.96 260 √100 ≤ ðĨĖ ≤ µ + 1.96 260 √100 Since we are interested in the minimum sample size, then round up n to the nearest integer: ð = 650 ) = 0.95 Page 7 of 12 Sampling Distribution of the Sample Proportion ð Everything you learned about the sampling distribution of ðĨĖ applies equally to the sampling distribution of ð. The only difference between the two is that in the sampling distribution of ð all data are binary. Therefore, the symbols change accordingly. A Comparison Sampling Distribution of ð Parent population mean: Parent population variance: µ = ïĨðĨĖ ⁄ð σ = ïĨ(ðĨĖ − µ)2 ⁄ð Sampling Distribution of ð Parent population mean (proportion): Parent population variance: π = ïĨðĨĖ ⁄ð σ2 = π(1 − π) There are infinite number of possible samples of size ð obtainable from the parent population. Therefore, there are infinite number of sample mean ðĨĖ values. There are infinite number of possible samples of size n obtainable from the parent population. Therefore, there are infinite number of sample proportion pĖ values. The center of gravity, the expected value, or the mean of the sample means equals the population mean. (The mean of the means equals the mean). Ė ) = µð = µ ð(ð The center of gravity, the expected value, or the mean of the sample proportions equals the population proportion. (The mean of the Ė ) = µð = ð proportions equals the proportion). ð(ð The variance of the sample means is: (ð is the sample size) var(ðĨĖ ) = σ2 ⁄ð The variance of the sample proportions is: var(ðĨĖ ) = σ2 ⁄ð = π(1 − π)⁄ð (ð is the sample size) The standard error of the means is: se(ðĨĖ ) = σ⁄√ð The standard error of the proportions is: se(ð) = √π(1 − π)⁄ð To be applicable in statistical inference, the sampling distribution of ðĨĖ must be normal. To be applicable in statistical inference, the sampling distribution of pĖ must be normal. Page 8 of 12 Examples Use the following information for the examples below. The proportion of households in a state who are home-owners is 0.67 (67%): ð = ð. ðð. Example 1 A sample of size ð = 500 households is selected. The probability that the sample proportion ð is less than 0.65 is _____. Or, The proportion or percentage of ð values obtained from samples of size ð = 500 that are less than 0.65 is _____. Example 2 A sample of size ð = 500 is selected. The probability that the sample proportion ð is greater than 0.71 is _____. Or, The proportion or percentage of ð values obtained from samples of size ð = 500 that are greater than 0.71 is _____. P(ð < 0.65) = ______ P(ð > 0.71) = ______ Example 3 A sample of size ð = 600 is selected. The probability that the sample proportion ð is between 0.63 and 0.71 is _____. Or, The proportion or percentage of ð values obtained from samples of size ð = 600 that are between 0.63 and 0.71 is _____. P(0.63 < ð < 0.71) = _____ ð§= ð−π se(ð) se(ð) = 0.0210 ð§= π(1 − π) 0.67(1 − 0.67) se(ð) = √ =√ = 0.0210 ð 500 ð§= 0.65 − 0.67 = −0.95 0.0210 P(ð§ < −0.95) = 0.1711 0.71 − 0.67 = 1.90 0.0210 P(ð§ > 1.90) = 0.0287 0.67(1 − 0.67) se(ð) = √ = 0.0192 600 ð§= 0.63 − 0.67 = −2.08 0.0192 ð§= 0.71 − 0.67 = 2.08 0.0192 P(−2.08 < ð§ < 2.08) = 0.9624 Page 9 of 12 The parent population parameter: ð = ð. ðð Example 4 A sample of size ð = 800 is selected. The probability that the sample proportion is within ±0.03 (3 percentage points) from the population proportion is _____. Or, The proportion or percentage of pĖ values obtained from samples of size ð = 800 that are within ±0.03 from the population proportion is _____. P(π − 0.03 < ð < π − 0.03) = _____ Example 5 A sample of size ð = 800 is selected. The probability that the sample proportion is within ±1.5 standard errors from the population proportion is _____. Or, The proportion or percentage of pĖ values obtained from samples of size ð = 600 that are within ±1.5 standard errors from the population proportion is _____. P[π − 1.50se(ð) < ð < π − 1.50ð ð(ð)] = _____ 0.645 π(1 − π) 0.67(1 − 0.67) se(ð) = √ =√ = 0.0166 ð 800 ð§= ð − π 0.64 − 0.67 = = −1.81 se(ð) 0.0166 0.67 0.695 pĖ se(ð) = 0.0166 π ± 1.50se(ð) = 0.67 ± 0.025 P(0.645 < ð < 0.695) = _____ P(−1.50 < ð§ < 1.50) = 0.8664 ð§= 0.70 − 0.67 = 1.81 0.0166 Again, you can see that the calculations were redundant. “1.50 standard errors from the population proportion” implies “ð§ = ±1.50”. P(−1.81 < ð§ < 1.81) = 0.9298 Nearly 93% of sample proportions (for samples of ð = 800) fall within ±0.03 (3 percentage points) from the population proportion. Alternatively, nearly 93% of sample proportions deviate from the population proportion by no more than ±0.03. Page 10 of 12 The parent population parameter: ð = ð. ðð Sample Size: ð = ððð Example 6 The interval which contains 80% of ð values is P(____< ð < ____) = 0.80 Example 7 The interval which contains 90% of ðĖ values is P(____< ð < ____) = 0.90 Find the quantity to be subtracted from and added to π to get the lower and upper end of the desired interval. α = 1 − 0.90 = 0.10 ð§α⁄2 = ð§0.05 = 1.64 ð−π ð§= se(ð) ð = µ + ð§0.10 se(ð) Since each tail area under the normal curve is 0.10, then ðð.ðð = ð. ðð. α⁄2 = 0.05 Example 8 The interval which contains 95% of pĖ values is P(____< ð < ____) = 0.95 α = 1 − 0.95 = 0.05 ð§α⁄2 = ð§0.025 = 1.96 α⁄2 = 0.025 ðððļ = ð§α⁄2 se(ð) ðððļ = ð§α⁄2 se(ð) ðððļ = (1.64)(0.0157) = 0.026 ðððļ = (1.96)(0.0157) = 0.031 ðĨĖ ðŋ = 0.67 − 0.026 = 0.644 ðĨĖ ðŋ = 0.67 − 0.031 = 0.639 ðĨĖ ð = 0.67 + 0.026 = 0.696 ðĨĖ ð = 0.67 + 0.031 = 0.701 0.67(1 − 0.67) se(ð) = √ = 0.0157 900 ððŋ = 0.67 + (−1.28)(0.0157) = 0.67 − 0.02 = 0.65 ðð = 0.67 + (1.28)(0.0157) = 0.67 + 0.02 = 0.69 The term ð â ðŽð(ð) is called the “margin of sampling error”, or, more simply MARGIN OF ERROR (ðððļ) The combined two tail areas is called the error probability and is denoted by α. Here, α = 0.20. Page 11 of 12 Determining the Sample Size for a Given Margin of Error in Example 8, where ð = 900, we determined the interval that contains 95% of ð values is within ±0.031 (3.1 percentage points) from µ. That is, the 95% margin of error (ðððļ) turned out to be 0.031, when ð = 900. Generally, 0.67(1 − 0.67) 0.67(1 − 0.67) P (π − 1.96√ ≤ ð ≤ π + 1.96√ ) = 0.95 900 900 P(π − 0.031 ≤ ð ≤ π + 0.031) = 0.95 P[π − ð§α⁄2 se(ð) ≤ ð ≤ π + ð§α⁄2 se(ð)] = 1 − α This expression means that (1 − α)% of sample proportions fall within ðððļ = ±ð§α⁄2 se(ð) from the population proportion. Since π(1 − π) Since se(ð) = √ , then ð In this interval, 95% of ð values deviate from π by no more than 0.031 (3.1 percentage points) in either direction. Now suppose we are interested in an interval in which the middle 95% of ð values deviate from π by no more than 0.02 (2 percentage points). That is, we want the 95% margin of error to be 0.02. P(π − 0.02 ≤ ðĨĖ ≤ π + 0.02) = 0.95 π(1 − π) π(1 − π) P (π − ð§α⁄2 √ ≤ ð ≤ π + ð§α⁄2 √ )=1−α ð ð This smaller ðððļ, a narrower interval, requires a different sample size. How do we determine the proper n to obtain this ðððļ? Consider the ðððļ formula: This shows that the margin of error is inversely related to the sample size n. The larger the sample size, the smaller the MOE. In example 8, since we have chosen the error probability be α = 0.05, then, π(1 − π) ðððļ = ð§α⁄2 √ ð π(1 − π) π(1 − π) P (π − ð§0.025 √ ≤ ð ≤ π + ð§0.025 √ ) = 0.95 ð ð Solving for n, we have, ð§α⁄2 2 ) π(1 − π) ðððļ ð=( Thus, to obtain a 95% margin of error of 0.02, the minimum sample size should be: 1.96 2 ð=( ) 0.67(1 − 067) = 2123.44 0.02 pĖ Since we are interested in the minimum sample size, then round up n to the nearest integer: ð = 2124 Page 12 of 12