1 Chapter 6 – Sampling Distributions In most experiments, we have more than one measurement for any given variable, each measurement being associated with one randomly selected a member of a population. Hence we need to examine probabilities associated with events that specify conditions on two or more random variables. Defn: A set of n random variables ๐1 , ๐2 , … , ๐๐ constitutes a random sample of size n from a finite population of size N if each member of the sample, ๐๐ , is chosen in such a way that every sample of size n has the same probability of being chosen. Defn: A set of (continuous or discrete) random variables X1, X2, ..., Xn is called a random sample of size n if the r.v.’s have the same distribution and are independent. We say that X1, X2, ..., Xn are independent and identically distributed (i.i.d.). Note: We will also use the term random sample to the set of observed values ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ of the random variables. Prior to selecting the sample and making the measurements, we have ๐1 , ๐2 , … , ๐๐ , with each ๐๐ being an (unknown) random quantity having associated probability distribution f(x). After selecting the sample and making the measurements, we have ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ . Note: In practice, it is often difficult to do random sampling. However, random sampling is basic to the use of the statistical inferential procedures that we will discuss later. These procedures are used for analyzing experimental data, for testing hypotheses, for estimating parameters (numerical characteristics of populations), and for performing quality control in manufacturing. In each 2 situation, we must somehow obtain convincing evidence that the data collected do approximate the conditions of randomness. Example : In a manufacturing situation, we have manufactured items coming off an assembly line. Assume that the population of items that have been completed is relatively large. We want to check the quality of these items by selecting a random sample of them and making measurements on each item in the sample. If the sample is random, then it has a good chance of being representative of the population, and we can obtain useful information about the quality of the entire population. For example, we are interested in knowing whether the average value of a certain measurement is “close” to the specified target value. It is very unlikely that the sample average will be exactly equal to the population average, but it is likely to be close. The Sampling Distribution of the Sample Mean Defn: A statistic is a random variable which is a function of a random sample. The probability distribution associated with a statistic is called its sampling distribution. Example: Let X1, X2, ..., Xn be a random sample from a population 1 n (probability distribution). The statistic X ๏ฝ n ๏ฅ X i is called the i ๏ฝ1 sample mean. Since The Xi’s are random variables, then X is also a random variable, with a sampling distribution. Some other examples of statistics are: 1 n 2 ๏จ ๏ฉ S ๏ฝ X ๏ญ X ๏ฅ i 1) The sample variance, , n ๏ญ 1 i ๏ฝ1 2 3 ~ X 2) The sample median, , Theorem 6.1: Let X1, X2, ..., Xn be a random sample from a distribution having mean ๏ญ and standard deviation ๏ณ. Then the mean of the sampling distribution of ๐ฬ is: ๏ฉn 1 ๏น n 1 ๏ญ X ๏ฝ E ๏X ๏ ๏ฝ E ๏ช๏ฅ X i ๏บ ๏ฝ ๏ฅ ๏ญ ๏ฝ ๏ญ ๏ซ i ๏ฝ1 n ๏ป i ๏ฝ1 n The variance of the sampling distribution depends on the size of the population from which the sample is drawn. If the population is of infinite size, then ๏ณ X2 1 2 ๏ณ2 ๏ฝ ๏ฅ 2๏ณ ๏ฝ n . i ๏ฝ1 n n Note: The quantity ๐๐ฬ (standard deviation of the sampling distribution of the sample mean) is also called the standard error of the mean. It provides us with a measure of reliability of the sample mean as an estimate of the population mean. This term will be important when we discuss statistical inference. Note: If the random sample was selected from a normal distribution (we write X1, X2, ..., Xn ~ Normal(๏ญ, ๏ณ) ), then it can be shown that ๏ฆ ๏ณ ๏ถ X ~ Normal๏ง๏ง ๏ญ , ๏ท๏ท . n๏ธ ๏จ Example: On page 134, Exercise 5.27. If I randomly select a single assembled piece of machinery from the population of assembled pieces, the time for assembly will be a random variable X having a Normal(µ = 12.9 min., σ = 2.0 min.) On the other hand, if I select a random sample of size 64 from the population, the distribution of ๐ฬ , 4 the average assembly time for the sample of pieces, will have a distribution that is ๐๐๐๐๐๐ (๐๐ฬ = 12.9 ๐๐๐. , ๐๐ฬ = 0.25 ๐๐๐. ). Note that the variability in the distribution of ๐ฬ is only one-eighth the variability in the distribution of X. This is an important concept. The following theorem is EXTREMELY important (as well as astonishing). This theorem provides the basis for our procedures for doing statistical inference. Theorem 6.3: (Central Limit Theorem) If X1, X2, ..., Xn are a random sample from any distribution with mean ๏ญ and standard deviation X ๏ญ๏ญ ๏ณ < +๏ฅ, then the limiting distribution of ๏ฆ ๏ณ ๏ถ as n ๏ฎ +๏ฅ is ๏ง๏ง ๏ท๏ท ๏จ n๏ธ standard normal. Note: Nothing was said about the distribution from which the sample was selected except that it has finite standard deviation. The sample could be selected from a normal distribution, or from an exponential distribution, or from a Weibull distribution, or from a Bernoulli distribution, or from a Poisson distribution, or from any other distribution with finite standard deviation. See, e.g., the example on pages 179-180. See also the illustration on page 184. Note: For what n will the normal approximation be good? For most purposes, if n ๏ณ 30 , we will say that the approximation given by the Central Limit Theorem (CLT) works well. 5 Example: p. 187, Exercise 6.15. Example: The fracture strength of tempered glass averages 14 (measured in thousands of p.s.i.) and has a standard deviation of 2. What is the probability that the average fracture strength of 100 randomly selected pieces of tempered glass will exceed 14,500 p.s.i.? Example: Shear strength measurements for spot welds have been found to have a standard deviation of 10 p.s.i. If 100 test welds are to be measured, what is the approximate probability that the sample mean will be within 1 p.s.i. of the true population mean? The T Distribution Use of the above discussion (Central Limit Theorem, etc.) to draw conclusions about the value of the population mean, µ, from a measured value of the sample mean, ๐ฅฬ , has a flaw. If we have to depend on sample data for information about the population mean, then we would tend not to know the value of the population standard deviation, either. We would also have to estimate σ. We need to modify our theory somewhat to take this complication into account. We introduce another probability distribution that allows us to use sample data alone to make inferences about the population mean. Theorem 6.4: If ๐ฬ is the mean of a random sample of size n taken from a normal distribution having mean µ and standard deviation σ, and if ๐ 1 ๐2 = ∑(๐๐ − ๐ฬ )2 ๐−1 ๐=1 is the sample variance, then the random variable 6 ๐ฬ − ๐ ๐ก= ๐ ( ) √๐ has a t-distribution with degrees of freedom ν = n – 1. The t-distribution (which is actually a family of distributions, characterized by the degrees of freedom) has characteristics similar to those of the standard normal distribution, as we can see from the figure on page 187. Note that for large d.f., the t(n-1) distribution is very close to the standard normal distribution. In fact, the standard normal distribution provides a good approximation to the t(n-1) distribution for n of size 30 or more. Note: Cut-off values and various tail probabilities for the tdistribution, with various values for ν, may be found in Table 4 on page 516. Note that in order to use this table, we must know the degrees of freedom in the particular exercise. However, we will find these values using Excel. The Excel functions to be used would be ๐(๐ ≤ ๐ก) = ๐. ๐ท๐ผ๐๐(๐ฅ, ๐. ๐. , ๐๐ ๐๐ธ ), and ๐๐กโ ๐๐ข๐๐๐ก๐๐๐ = ๐. ๐ผ๐๐(๐, ๐. ๐. ). Example: page 188. The Sampling Distribution of the Variance The above discussion provides us with the tools to do inference about the value of a population mean. If we want to do inference about the value of a population variance, ๐ 2 , then we need to discuss the sampling distribution for the sample statistic, ๐ 2 , that we use to estimate the population variance. For this, we need to introduce another family of probability distributions, the chi-square family. 7 Theorem 6.5: If ๐ 2 is the variance of a random sample of size n taken from a normal distribution with variance ๐ 2 , then the random variable (๐ − 1)๐ 2 ∑๐๐=1(๐๐ − ๐ฬ )2 2 ๐ = = ๐2 ๐2 has a chi-square distribution with degrees of freedom ν = n – 1. Note: Cut-off values and various tail-probabilities for the chi square distribution, with various values for ν, may be found in Table 5 on page 517. Note that in order to use this table, we must know the degrees of freedom in the particular exercise. However, we will find these values using Excel. The Excel functions to be used are ๐(๐ 2 ≤ ๐ฅ ) = ๐ถ๐ป๐ผ๐๐. ๐ท๐ผ๐๐(๐ฅ, ๐. ๐. , ๐๐ ๐๐ธ ), and ๐๐กโ ๐๐ข๐๐๐ก๐๐๐ = ๐ถ๐ป๐ผ๐๐. ๐ผ๐๐(๐, ๐. ๐. ). Example: p. 190. The F-Distribution When we do analysis of experimental data, our conclusions about whether the experimental treatments had an effect will be based on a statistic which may be imagined as a “signal-to-noise” ratio, with the “signal” being the treatment effect (differences among the treatment groups) and the “noise” being the variability of the data within treatment groups. The sampling distribution of this statistic is given in the following theorem. This statistic may also be used to do inference about the differences between two population variances. 8 Theorem 6.6: If ๐12 and ๐22 are the variances of independent random samples of size ๐1 and ๐2 , respectively, taken from two normal distributions having the same variance, then the random variable ๐12 ๐น= 2 ๐2 has an F distribution with parameters ๐1 = ๐1 − 1 (the numerator degrees of freedom) and ๐2 = ๐2 − 1 (the denominator degrees of freedom). Note: Cut-off values and various tail-probabilities for the F distribution, with various values for ๐1 and ๐2 , may be found in Table 6 on pages 518-519 (note that this table is an abbreviated version of an F-table that would be used in practical situations). Note that in order to use this table, we must know the values of the two degrees-of-freedom parameters in the particular exercise. We may also find probabilities and quantiles using Excel. We will come back to the F distribution later in the course.