Chapter 8 Sampling Variability and Sampling Distributions In inferential statistics, we want to use information contained in a sample to reach conclusions about one or more characteristics of the population from which the sample was selected. In the subsequent chapters, we will introduce some inferential methods. 8.1 Statistics and Sampling Variability A statistic any quantity computed from values in a sample (for example, x , s, the sample median, the sample interquartile range and so on). A population parameter any quantity computed from values in a population (for example, , , the population median, the population interquartile range and so on). The difference between a statistic and a population parameter. (1) A statistic is a sample characteristic, whereas a population parameter is a population characteristic. (2) The observed value of a statistic depends on the particular sample selected from the population; typically, it varies from sample to sample. This variability is called sampling variability. However, a population parameter is a fixed number, which is generally unknown. A population is an entire collection of individuals or objects about which information is desired. Generally we are interested in a variable of a population. A variable associates a value with each individual or object in a population. For example, Population = {all students at NAU} = {student 1, student 2, , student k} The variable of interest for this population may be x = height of a student. Now we consider a hypothetical “population”: Population of samples = {all possible samples of a given size n} = {sample 1, sample 2,, sample k} For a population of samples, we may be interested in a statistic (for example, x ). Just as a variable associates a value with every individual or object in a population and can be described by its distribution, a statistic associates a value with each individual sample in the population of samples, and can also be described by a distribution. Definition 8.1: The distribution of a statistic is called its sampling distribution. Example 8.1 Population = {6 students}. The variable of interest is x = the amount of money (dollars) each student spent on textbooks for the current semester, which is given in the following table. Student Student 1 Student 2 Student 3 Student 4 Student 5 Student 6 x 267 258 261 275 252 288 The distribution of x can be summarized by a density histogram. Class Interval Relative Frequency Density 250 to < 260 0.333 0.0333 260 to < 270 0.333 0.0333 270 to < 280 0.167 0.0167 280 to < 290 0.167 0.0167 Density Table 8.1 Frequency distribution for x 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 255 265 275 285 x Figure 8.1 Density histogram for x Now we consider all possible samples of size 2 and calculate their means. Sample {student1, student 2} {student1, student 3} {student1, student 4} {student1, student 5} {student1, student 6} {student2, student 3} {student2, student 4} {student2, student 5} x 262.5 264 271 259.5 277.5 259.5 266.5 255 Sample {student2, student 6} {student3, student 4} {student3, student 5} {student3, student 6} {student4, student 5} {student4, student 6} {student5, student 6} x 273 268 256.5 274.5 263.5 281.5 270 Table 8.2 All possible samples of size 2 and their means The sampling distribution of x can also be summarized by a density histogram. Class Interval Relative Frequency Density 250 to < 260 0.267 0.0267 260 to < 270 0.333 0.0333 270 to < 280 0.333 0.0333 280 to < 290 0.067 0.0067 Table 8.3 Frequency distribution for x 0.035 0.03 Density 0.025 0.02 0.015 0.01 0.005 0 255 265 275 285 Figure 8.2 Density histogram for x 8.2 The sampling Distribution of a Sample Mean When we want to make an inference about the population mean, , it is natural to consider to use the sample mean, x . The behavior of x is described by its sampling distribution. The sample size n and characteristics of the population (for example, mean value , and standard deviation , and its shape) are important in determining the sampling distribution of x . Generally, it is hard to obtain the true sampling distribution of x . We approximate the sampling distribution by some sampling experiments. Let us recall the experiment of tossing a coin. Population = {head, tail}. We can obtain an approximate distribution of the population as follows: toss the coin 500 times; construct a relative frequency histogram for the outcomes of the 500 tosses. Since 500 is reasonably large, the histogram should rather closely resemble the population distribution. Relative Frequency Head Tail Figure 8.3 An approximate distribution of the population = {head, tail} based on 500 tosses An approximate sampling distribution of x can be obtained in the same way: (1) select 500 different random samples of a given size n; (2) compute x for each sample; (3) construct a sample histogram of these 500 x values. The sample histogram should rather closely resemble the true sampling distribution of x . General Properties of the Sampling Distribution of x Let x denote the mean of the observations in a random sample of size n from a population having mean and standard deviation . Denote the mean value of the x distribution by x and the standard deviation of x distribution by x . Then the following rules hold. Rule 1: x = Rule 2: x = n . Rule 3: When the population distribution is normal, the sampling distribution of x is also normal for any sample size n. Thus, the standardized variable z x x x x / n has the standard normal (z) distribution. Rule 4: (Central Limit Theorem) When n is sufficiently large, the sampling distribution of x is well approximated by a normal curve, even when the population distribution is not itself normal. So, the standardized variable z x x x x / n has approximately the standard normal (z) distribution. Note: The Central Limit Theorem can safely be applied if n exceeds 30. Example 8.2 A patient visits her doctor with concerns about her blood pressure. If the systolic blood pressure exceeds 150, the patient is considered to have high blood pressure and medication may be prescribed. The problem is that there is a considerable variation in a patient’s systolic blood pressure readings during a given day. Suppose that the patient’s systolic readings during a given day have a normal distribution with a mean of = 160 mm mercury and a standard deviation of = 20 mm. If five measurements are taken at various times during the day, what is the probability that the average blood pressure reading will be less than 150 and hence fail to indicate that the patient has a high blood pressure problem? Let x denote the average blood pressure reading. Because the patient’s systolic readings during a given day have a normal distribution, by rule 3, the sampling distribution of x is also normal, and by rules 1 and 2, the sampling distribution of x has mean value x = = 160 and standard deviation x = n 20 5 = 8.9443. P( x < 150) = P( z < (? - ?) / ?) = P(z < ?) = ?. Example 8.3 An anthropologist wishes to estimate the average height of men for a certain race of people. Suppose that the population standard deviation is 2.5 inches and that she randomly samples 100 men. (1) Find the probability that the difference between the sample mean x and the true population mean will not exceed .5 inch. (2) If the true population mean = 68 inches, find the probability that x will exceed 68.4 inches. (1) By rule 2, x = n 2.5 100 = .25. Since n = 100, we can invoke the Central Limit Theorem and regard the x distribution as approximately normal. Then, P( x 0.5 ) = P ( x0.25 0.5 0.25 ) P( z 2 ) = P( 2 z 2 ) = P( z 2 ) - P(z < -2) = 0.9772 -0.0228 = 0.9544. (2) By the Central Limit Theorem, P( x > 68.4) ≈ P( z > (? - ?) / ?) = P(z > ?) = 1 - P(z ?) = ?. 8.3 The Sampling Distribution of a Sample Proportion When we investigate the proportion of individuals or objects in a population that possess a specified property, traditionally, we label the individual or object that possesses the property of interest S (for success), and the one that does not possess the property F (for failure). Let denote the proportion of S’s in the population. For example, population = {Students at NAU}. = the proportion of female students. The value of is usually unknown. When a random sample of size n is selected from this type of population, some of the individuals in the sample are S’s, and the remaining individuals in the sample are F’s. The statistic that will provide a basis for making inferences about is p, the sample proportion of S’s: p = (number of S’s in the sample) / n. Just as making inference about requires knowing something about the sampling distribution of x , making inferences about requires knowing properties of the sampling distribution of the statistic p. Let 1 if it is a success (S), 0 if it is a failure (F). x= Then, x has the Bernoulli( ) distribution and the sample becomes Sample F S S F S x 0 1 1 0 1 x1 x2 x3 xn-1 xn x Now x n = (number of S’s in the sample) / n = p. Thus we have following rules: Note: For the Bernoulli( ) distribution, mean = and standard deviation = (1 ) . General Properties of the Sampling Distribution of p Let p be the proportion of S’s in a random sample of size n from a population whose proportion of S’s is . Denote the mean value of p by p and the standard deviation of p by p. Then the following rules hold. Rule 1: p = Rule 2: p (1 ) n Rule 3: (Central Limit Theorem) When n is large and is not too near 0 or 1, the sampling distribution of p is approximately normal. Thus, the standardized variable z p p p p (1 ) / n has approximately the standard normal (z) distribution. Note: If both n 10 and n(1- ) 10 , it is safe to use a normal approximation. Example 8.4 Population = {All blood recipients}. = The proportion of all blood recipients stricken with viral hepatitis = .07. A new treatment is given to n = 200 blood recipients. Only 6 of the 200 patients contract hepatitis. The question of interest to medical researchers is: Is the new treatment effective? If the new treatment is ineffective, then p = = .07, p (1 ) n (.07 )(1 .07 ) 200 = .018. Here, p = 6 / 200 = .03. Since n = 200(.07) = 14 > 10 and n(1-) = 200(1-.07) = 186 > 10, the sampling distribution of p is approximately normal. Then P( p .03 ) P( z (.03 - .07) / .018 )= P( z -2.22 ) = .0132 Since the probability is very small, as means that it is unlikely that a sample proportion .03 or smaller would be observed if the new treatment really were ineffective, the new treatment is effective. Exercise in class: Suppose that the sample size n = 100 and the population proportion = 0.95. a) Does the sample proportion p have approximately a normal distribution? Explain. b) What is the smallest value of n for which the sampling distribution of p is approximately normal?