MATH 2441 Probability and Statistics for Biological Sciences Sampling and Sampling Distributions The Story So Far … Remind yourself of several major ideas raised so far in the course. First, the goal of statistics is to develop methods for describing populations. Almost always we are prevented from studying the entire population of interest in detail, and so must rely on information available in a random sample drawn from that population. statistical inference Population Random Sample probability theory For finite size populations, a random sample is defined to be one in which every element of the population has equal chance to be included. For infinitely large populations, samples are random if the likelihood of selecting one item of the population is unaffected by whichever items have already been selected. Populations are described by parameters a few numbers that summarize some general characteristics of the population. These are fixed numbers (that is, not variables) for that population. However, because we can't study the entire population exhaustively, we have no way of measuring the value of a parameter directly. So far in the course, we've looked at a number of population parameters: 2 0, 1 the arithmetic mean of the population the standard deviation of the population the variance of the population a population proportion a population correlation coefficient the intercept and slope of the population regression line Recall that we always use Greek symbols for population parameters. Analogous descriptive numbers for samples are called statistics. Sample statistics are random variables, since their values depend on precisely which elements of the population become one of the randomly selected elements of the sample. Of course, when we select a random sample, the value of each of these sample statistics can be determined as precisely as we wish, limited only by the precision of our observational equipment. Among the sample statistics we've looked at are x, ~ x s s2 p r b 0, b 1 the arithmetic mean and the median of the sample the standard deviation of the sample the variance of the sample a sample proportion a sample correlation coefficient the intercept and slope of the sample regression line By convention, sample statistics are always denoted by letters of our alphabet. With this information, we can state the goal of the discipline of statistics in a bit more detail. It is our intention to develop methods which will enable us to: David W. Sabo (1999) Sampling and Sampling Distributions Page 1 of 8 (i.) compute estimates of values of population parameters using the observed values of appropriate statistics for random samples of that population (statistical estimation), (ii.) to evaluate the degree to which observed values of appropriate statistics support statements made about values of population parameters (hypothesis testing) Achieving these goals is not a straightforward task because we need to work with limited information. A random sample which is just a small part of a population cannot represent all of the characteristics of that population precisely. Since sample statistics are random variables, they have a probability distribution. Probability distributions for sample statistics are called sampling distributions. Sampling distributions would reflect the possible values of the sample statistics and how they are clustered or spread out, and so will depend on the value of population parameters. (For example, we would expect the distribution of arithmetic means of random samples of a population to be influenced by the arithmetic mean of that population.) As a result, sampling distributions provide the link between values of sample statistics and values of population parameters that we need to achieve the two goals described in more detail just above. In the remainder of this document, we will just summarize the basic facts about several important types of sampling distributions, and look at a few immediate applications in some instances. The Sampling Distribution of the Mean It is necessary to distinguish between two populations here. First, there is the original target population which we are studying. This might be the population of all residents of Canada, it might be the population of all apple trees in Canada, or it might be the population of all cans to tomato soup produced by a company. This target population has a mean, , and a standard deviation, . Now, think of the set of all of the samples of a specific size n that can be selected from that target population. Each of these samples has a mean value, x (and a sample standard deviation, s but we'll deal with s a little later). The set of these x values for all the possible samples forms a population, which has a mean value, x , and a standard deviation, x . Notice that we have used the symbols and here because these really are population parameters. The subscript, x , is used to indicate that this x and x are parameters of the population of x values of all possible samples of size n that can be selected from the original target population. Then, a very fundamental mathematical result from probability theory (the so-called Central Limit Theorem) says the following: (i.) x = , the population mean. That is, the mean of the target population, and the mean of the population of possible x values is the same. (ii.) x , expressing the relationship between the standard deviation of the sampling n distribution, and the standard deviation of the target population. and most importantly (iii.) if the target population is normally distributed, then these x values will also be normally distributed. If the target population is not normally distributed, then these x values will be approximately normally distributed if the sample size n is large enough. As a rule of thumb, you can assume "large enough" here means 30 or more. When point (iii) applies, points (I) - (iii) give us the shape, position, and information about the width of the distribution of possible x values. This embodies as much as it is possible for us to know about the relationship between an x value we may observe, and the value of for the target population. You may recall that part of this result was suggested by some rough experimental results reported in the document "The Arithmetic Mean: Sampling Issues", which was presented just after our discussion of Page 2 of 8 Sampling and Sampling Distributions David W. Sabo (1999) measures of central tendency in the descriptive statistics section of this course. There, we found that a sampling of x values for increasing (but small) values of n seemed to give more bell-like relative frequency histograms, centered approximately at the position of the population mean (which we knew in that case because we had constructed the population ourselves). The histograms obtained there also seemed to be narrowing as n got larger, reflecting point (ii) above. The importance of this result is in providing ways to estimate from values of x (and s, as it turns out). However, there are a few useful applications in situations in which it is plausible to talk about knowing the value of and and using that information to solve a problem involving x . We'll illustrate with several short examples. Example 1: Suppose a company produces tablets containing an active medicinal ingredient. Because of unavoidable variations in various aspects of the production process, not every tablet contains exactly the same amount of this compound. In fact, extensive studies show that the amount of ingredient in each tablet is an approximately normally distributed random variable with a mean of 150 mg and a standard deviation of 8.3 mg. If an experimental subject is given five such tablets, what is the probability that they will have received between 740 and 760 mg of the active ingredient? Solution This problem speaks of the population of all such tablets, indicating that it is approximately normally distributed, and that for this population, = 150 mg and = 8.3 mg. This information will allow us to calculate the probability of that the amount of active ingredient in an individual tablet falls within a certain range of values. Thus, for example, if we define x = mg of active ingredient in a randomly selected tablet of this type then, we can compute 152 150 148 150 Pr(148 x 152 ) Pr z Pr( 0.24 z 0.24 ) 8.3 8 .3 = 2 Pr(0 z 0.24) = 2 x 0.0948 = 0.1896. Thus, just under 20% of the tablets will contain between 148 mg and 152 mg of the active ingredient. The only thing wrong with this calculation is that it doesn't answer the question that was asked! The question was not about the amount of active ingredient in one tablet, but about the amount of active ingredient in 5 tablets. However, if you think about it a minute, you'll realize that asking for the probability that 5 tablets contain between 740 mg and 760 mg of active ingredient is the same thing as asking for the probability that the mean amount of active ingredient in a sample of 5 tablets will be between 740/5 = 148 mg and 760/5 = 152 mg. The central limit theorem tells us that the distribution of such an x : should be approximately normal, because the target population is approximately normal should have a mean value of 150 mg (because the target population has = 150 mg) should have a standard deviation of 8.3 / 5 3.712 mg (that is / n , where = 8.3 mg is the standard deviation of the target population Thus, 152 150 148 150 Pr(148 x 152 ) Pr z Pr( 0.54 z 0.54 ) 0.4108 3.712 3.712 Thus, the probability that the five tablets will have totalled between 740 mg and 760 mg of active ingredient is 0.4108. David W. Sabo (1999) Sampling and Sampling Distributions Page 3 of 8 (Notice the difference between the probabilities for x and for x obtained here. There is a considerably higher probability that the value of x for a set of 5 tablets will fall within the range 148 mg to 152 mg about the target population mean of 150 mg, than there is that any one of the five tablets themselves will fall within that range. This is an example of a very useful feature of the random sampling process: means of samples cluster much closer to the population mean than do the values of the elements of the target population themselves. This shows up in the formulas in x , the standard deviation of the sampling distribution n is small by a factor of n than the standard deviation of the target population. It is in this property of the sampling distribution that value of taking larger rather than smaller samples shows up. In case we forget to mention this again (and again, and again… ) it is very important to notice that the value of x depends on the sample size and not at all on the population size (but, see the note below on sampling from finite populations!). Example 2: A food processing company claims that their canned clams contain less than 2.50 ppb (parts per billion) of a certain pollutant. A food technology student who is rather concerned about this takes tissue specimens from 40 such canned clams and analyzes them for the presence of the pollutant. For that sample of 40 clams, she obtains a mean concentration of 2.63 ppb and a standard deviation of 0.86 ppb. Is this proof that the company's claim is false? Solution: First, we need to sort out the numbers here a bit. The company has claimed that the average concentration of this pollutant in all of their canned clams is less than 2.5 ppb. In symbols, their claim is that < 2.50 ppb, where stands for the mean pollutant concentration of all of the canned clams they produce. The technologist took a sample of this population: the sample had size n = 40, a mean value x = 2.63 ppb, and a standard deviation s = 0.86 ppb. Now, the technologist did observe x = 2.63 which is a number bigger than 2.50, the maximum value of the mean claimed by the company. However, because this value of x resulted from a random sample of the population, we can't immediately interpret its value as invalidating the company's claim. The central limit theorem indicates that if was exactly equal to 2.50 ppm, the values of x for random samples drawn from that population would vary to some extent some being smaller than 2.50 ppb, while others would be larger than 2.50 ppb, and this would still be so even if was somewhat less than 2.50 ppb. However, there is one result that would tend to make us suspect the company's claim. If it turned out to be very improbable to observe a sample of size 40 with a mean x = 2.63 or bigger when it was really true that = 2.50 ppb, then we would be justified in doubting their claim. The technologist's clear observation would then be surprising if the company's claim were true, but since the technologist's observation was sound, the apparent inconsistency can only be reconciled by casting suspicion on the company's claim. In this case, 2.63 2.50 Pr( x 2.63 ) Pr z 40 To compute this exactly, we need a value of , which we don't have. The best we can do is assume that the value s = 0.86 ppb is a reasonable estimate of . Doing this, we get 2.63 2.50 2.63 2.50 Pr( x 2.63 ) Pr z Pr z Pr( z 0.96 ) 0.1685 0.86 40 40 Page 4 of 8 Sampling and Sampling Distributions David W. Sabo (1999) Thus, assuming the company's claim is correct, there is a probability of nearly 17% of observing a result as contradictory as the one the technologist actually did observe. This probability is too high to conclude that this value of x is inconsistent with the company's claim. These experimental results do not invalidate the company's claim. (What sort of answer would we have needed to be able to state with confidence that the company's claim appears doubtful? It depends. For most routine work, a probability above of 0.05 or less would be considered by most practitioners to be adequate grounds for concluding against the company. What we have calculated here is the probability that we are wrong if we use the technologist's results to declare the company's claim to be unfounded. The more serious the consequence of such a mistake on our part, the smaller we'd want that number to be before sticking our necks out! Much more will be said about this when we deal with the topic of hypothesis testing shortly.) Sampling Distributions of Sums or Totals You can see from Example 1 above that questions about sums or totals are very closely related to questions about averages. We state here a somewhat restricted result that will form the basis for some later work. Consider a set of random variables: X1, X2, …, Xn. These might be the values of n randomly selected items from a population which are forming a random sample, for example. Then, we can form a new random variable Y in the following way: Y = a1X1 + a2X2 + … + anXn where a1, a2,… an are just constants chosen for some particular reason. (For instance, if X1, X2, …, Xn are the n observations comprising a random sample of size n, then choosing each of the ak's as 1/n would make Y the sample mean, x .) Then, the central limit theorem says the following: (i.) If the individual Xk's are normally distributed or approximately normally distributed, the random variable Y will be normally distributed or approximately normally distributed. Y will still be approximately normally distributed as long as those Xk's which are not normally distributed do not dominate. (ii.) Y a1 X1 a 2 X2 an Xn (iii.) If the Xk's are independent (which is true if they are elements of random samples), then Y 2 a12 X1 2 a 2 2 X2 2 an 2 Xn 2 This looks pretty gruesome, but, it covers an enormous amount of territory. For example, by writing Y = X1 - X2 Y X1 X2 and Y 2 X1 2 X2 2 (choose a1 = 1 and a2 = -1 in a sum involving just X1 and X2), we get an approach for comparing two populations. Selecting an item from population #1 gives a value of X1. Selecting an item from population #2 (independently) gives a value of X2. Then, Y would be the difference of the means of the two populations a quantity that is of great interest in many comparative studies. We will expand on this application in detail shortly in the course. Example 3: Answer the question in Example 1 again, but this time using the principles described in this section, rather than using an approach based on the sample mean. Solution We just let X1 stand for the mg of active ingredient in the first tablet selected, X2 for the mg of active ingredient in the second tablet selected, and so on for the remaining three tablets. Then the question asks for David W. Sabo (1999) Sampling and Sampling Distributions Page 5 of 8 Pr(740 Y 760) where Y = X1 + X2 + X3 + X4 + X5 . From the principles stated in this section, we know that Y must be approximately normally distributed because each of the Xk's are approximately normally distributed. Further, since the formula for Y here fits the pattern above with a1 = a2 = a3 = a4 = a5 = 1, we get that Y X1 X2 X3 X4 X5 150 150 150 150 150 750 mg and Y 2 X1 2 X1 2 X1 2 X1 2 X1 2 8.3 2 8.3 2 8.3 2 8.3 2 8.3 2 (5)(8.3 2 ) so that Y (5)(8.3 2 ) 5 (8.3) 18.559 mg Thus, 760 750 740 750 Pr( 740 Y 760 ) Pr z Pr( 0.54 z 0.54 ) 0.4108 18 . 559 18 .559 as we obtained in Example 1. Sampling Distribution of the Proportion Let p be the proportion of elements in a random sample which share some characteristic of interest, and let be the corresponding proportion of the target population. Different random samples will of course result in a distribution of values of p. That distribution has the following characteristics: (i.) p (that is, the mean of the sampling distribution is equal to the population proportion.) 1 (ii.) p (iii.) when the sample size, n, is sufficiently large, the distribution of p is approximately normal. There are several conventions for deciding when n is "sufficiently large" for this to be true. In this course we prefer the conditions n n 5 and n(1 - ) 5 Another (but somewhat more awkward condition) is the condition that the interval p 2 p to p 2 p be entirely within the interval 0 to 1 (see Mendenhall). If these conditions are not met, then it is more appropriate to work with the binomial distribution. We will not go into the details of that case in this course. As in the case of means, these properties allow us to establish a connection between observed values of p and the value of the corresponding population proportion, . We will take that issue up shortly. However, there are some more direct applications of these results, which we will illustrate by example. Page 6 of 8 Sampling and Sampling Distributions David W. Sabo (1999) Example 4: Fruit is received at a food processing plant in large truckloads. Each truckload is transferred into a specific holding bin, and during the transfer 250 individual fruits are selected at random for detailed grading. If at least 92% of these selected fruit satisfy specifications, the entire truckload is used in the preparation of whole fruit products. However, if fewer than 92% of these selected fruit satisfy specifications, the entire truckload is used for juice production and the supplier receives a considerably smaller price for the truckload. What is the probability that a truckload which is really 95% in conformance with the specifications will end up being used for juice? Solution First, sort out the numbers in the problem. We are to assume that the 95% is the true proportion of the truckload (which is the population in this problem), and so we identify this as = 0.95. The truckload will be used for juice if the proportion, p, of the sample of 250 fruit which are non-conforming is less than 0.92. Thus, Pr(truckload is used for juice) = Pr( p < 0.92 when = 0.95 and n = 250) We can use the z-table to compute this probability since n = (250)(0.95) = 237.5 > 5 and n(1 - ) = (250)(0.05) = 12.5 > 5. So, p is approximately normally distributed with a mean of = 0.95 and a standard deviation of p (1 ) n (0.95 )( 0.05 ) 0.013784 250 Thus, Pr(p 0.92 ) Pr( z 0.92 0.95 ) Pr( z 2.18 ) 0.0146 0.013784 Thus, the probability that such a good truckload will be mistakenly used for juice as a result of sampling error is quite small 1.46%. Example 5: A biotechnologist wishes to determine the proportion of a population of some organism which has a certain genetic mutation. She has some preliminary information to the effect that the mutation is present in about one out of every 50 members of that population. How large a sample would she have to examine in order to be justified in assuming that the sample proportion is approximately normally distributed? Answer The preliminary information indicates that 1/50 or 0.02. The conditions for approximate normality are that n > 5 and n(1 - ) > 5. Since is very nearly zero, 1 - is very nearly 1, so the second condition will be met as long as n is a bit bigger than 5 (which is a fairly small sample size for this kind of work). The first condition will be harder to satisfy. In fact, it appears that we need something like n n x 0.02 > 5 n > 5/0.02 = 2500. Thus, it would appear that she should plan to collect a random sample of at least approximately 2500 organisms if she wants to be justified in assuming the sample proportion is approximately normally distributed. The Sampling Distribution of the Variance There are situations in which the value of the population variance is of interest as a way of assessing the degree of uniformity of the population. This prompts interest in the distribution of sample variances. It turns out that if the population is approximately normally distributed, then the following quantity has the socalled chi-squared (or 2 ) distribution: David W. Sabo (1999) Sampling and Sampling Distributions Page 7 of 8 2 (n 1)s 2 2 The 2 - distribution is one of those standard distributions that arises in a number of areas of application. Unlike the normal distribution, its probability density function is not symmetric and is not defined for negative values of 2. In fact, there is a family of such density functions, distinguished by a whole number, = n - 1, called the number of degrees of freedom. One typical example is shown in the figure to the right. Given the value of and specially prepared tables of 2 probabilities we can calculate probabilities for s2, and ultimately make statements about the value of 2 . We will return to this topic later in the course. The Finite Population Correction The formulas given so far have all been based on the assumption that the sample of size n is being selected from what is essentially an infinite-sized population. When sampling is done without replacement, and the sample amounts to 5% or more of the population, then it is necessary to employ the so-called finite population correction in computing the variance of the sampling distribution of the mean: x n Nn N1 where n is the sample size, and N is the population size. (When sampling is done with replacement, the population size is effectively infinite.) When n is a very small fraction of N, the ratio (N-n)/(N-1) is essentially 1, and so this correction factor has little effect. Thus, when n = 0.05N, we have Nn N 0.05N N 1 N1 0.95N N N 0.95 0.975 N1 N1 N1 For relatively large values of N, the square root in the final form is essentially 1, meaning that the overall factor is approximately 0.975 not much different than 1. It is not too common to encounter truly finite size populations in technical applications of probability and statistics. Page 8 of 8 Sampling and Sampling Distributions David W. Sabo (1999)