here - BCIT Commons

advertisement
MATH 2441
Probability and Statistics for Biological Sciences
Sampling and Sampling Distributions
The Story So Far …
Remind yourself of several major ideas raised so far in the course.
First, the goal of statistics is to develop methods for describing populations. Almost always we are
prevented from studying the entire population of interest in detail, and so must rely on information available
in a random sample drawn from that population.
statistical inference
Population
Random
Sample
probability theory
For finite size populations, a random sample is defined to be one in which every element of the population
has equal chance to be included. For infinitely large populations, samples are random if the likelihood of
selecting one item of the population is unaffected by whichever items have already been selected.
Populations are described by parameters  a few numbers that summarize some general characteristics of
the population. These are fixed numbers (that is, not variables) for that population. However, because we
can't study the entire population exhaustively, we have no way of measuring the value of a parameter
directly. So far in the course, we've looked at a number of population parameters:


2


 0,  1
the arithmetic mean of the population
the standard deviation of the population
the variance of the population
a population proportion
a population correlation coefficient
the intercept and slope of the population regression line
Recall that we always use Greek symbols for population parameters.
Analogous descriptive numbers for samples are called statistics. Sample statistics are random variables,
since their values depend on precisely which elements of the population become one of the randomly
selected elements of the sample. Of course, when we select a random sample, the value of each of these
sample statistics can be determined as precisely as we wish, limited only by the precision of our
observational equipment. Among the sample statistics we've looked at are
x, ~
x
s
s2
p
r
b 0, b 1
the arithmetic mean and the median of the sample
the standard deviation of the sample
the variance of the sample
a sample proportion
a sample correlation coefficient
the intercept and slope of the sample regression line
By convention, sample statistics are always denoted by letters of our alphabet.
With this information, we can state the goal of the discipline of statistics in a bit more detail. It is our
intention to develop methods which will enable us to:
David W. Sabo (1999)
Sampling and Sampling Distributions
Page 1 of 8
(i.)
compute estimates of values of population parameters using the observed values of
appropriate statistics for random samples of that population (statistical estimation),
(ii.)
to evaluate the degree to which observed values of appropriate statistics support
statements made about values of population parameters (hypothesis testing)
Achieving these goals is not a straightforward task because we need to work with limited information. A
random sample which is just a small part of a population cannot represent all of the characteristics of that
population precisely.
Since sample statistics are random variables, they have a probability distribution. Probability distributions
for sample statistics are called sampling distributions. Sampling distributions would reflect the possible
values of the sample statistics and how they are clustered or spread out, and so will depend on the value of
population parameters. (For example, we would expect the distribution of arithmetic means of random
samples of a population to be influenced by the arithmetic mean of that population.) As a result, sampling
distributions provide the link between values of sample statistics and values of population parameters that
we need to achieve the two goals described in more detail just above.
In the remainder of this document, we will just summarize the basic facts about several important types of
sampling distributions, and look at a few immediate applications in some instances.
The Sampling Distribution of the Mean
It is necessary to distinguish between two populations here. First, there is the original target population
which we are studying. This might be the population of all residents of Canada, it might be the population of
all apple trees in Canada, or it might be the population of all cans to tomato soup produced by a company.
This target population has a mean, , and a standard deviation, .
Now, think of the set of all of the samples of a specific size n that can be selected from that target
population. Each of these samples has a mean value, x (and a sample standard deviation, s  but we'll
deal with s a little later). The set of these x values for all the possible samples forms a population, which
has a mean value,  x , and a standard deviation,  x . Notice that we have used the symbols  and  here
because these really are population parameters. The subscript, x , is used to indicate that this  x and  x
are parameters of the population of x values of all possible samples of size n that can be selected from the
original target population.
Then, a very fundamental mathematical result from probability theory (the so-called Central Limit Theorem)
says the following:
(i.)
 x = , the population mean. That is, the mean of the target population, and the mean of
the population of possible x values is the same.
(ii.)
x 

, expressing the relationship between the standard deviation of the sampling
n
distribution, and the standard deviation of the target population.
and most importantly
(iii.)
if the target population is normally distributed, then these x values will also be normally
distributed. If the target population is not normally distributed, then these x values will be
approximately normally distributed if the sample size n is large enough. As a rule of
thumb, you can assume "large enough" here means 30 or more.
When point (iii) applies, points (I) - (iii) give us the shape, position, and information about the width of the
distribution of possible x values. This embodies as much as it is possible for us to know about the
relationship between an x value we may observe, and the value of  for the target population.
You may recall that part of this result was suggested by some rough experimental results reported in the
document "The Arithmetic Mean: Sampling Issues", which was presented just after our discussion of
Page 2 of 8
Sampling and Sampling Distributions
David W. Sabo (1999)
measures of central tendency in the descriptive statistics section of this course. There, we found that a
sampling of x values for increasing (but small) values of n seemed to give more bell-like relative frequency
histograms, centered approximately at the position of the population mean (which we knew in that case
because we had constructed the population ourselves). The histograms obtained there also seemed to be
narrowing as n got larger, reflecting point (ii) above.
The importance of this result is in providing ways to estimate  from values of x (and s, as it turns out).
However, there are a few useful applications in situations in which it is plausible to talk about knowing the
value of  and  and using that information to solve a problem involving x . We'll illustrate with several short
examples.
Example 1: Suppose a company produces tablets containing an active medicinal ingredient. Because of
unavoidable variations in various aspects of the production process, not every tablet contains exactly the
same amount of this compound. In fact, extensive studies show that the amount of ingredient in each tablet
is an approximately normally distributed random variable with a mean of 150 mg and a standard deviation of
8.3 mg. If an experimental subject is given five such tablets, what is the probability that they will have
received between 740 and 760 mg of the active ingredient?
Solution
This problem speaks of the population of all such tablets, indicating that it is approximately normally
distributed, and that for this population,  = 150 mg and  = 8.3 mg. This information will allow us to
calculate the probability of that the amount of active ingredient in an individual tablet falls within a certain
range of values. Thus, for example, if we define
x = mg of active ingredient in a randomly selected tablet of this type
then, we can compute
152  150 
 148  150
Pr(148  x  152 )  Pr 
z
  Pr(  0.24  z  0.24 )
8.3
8 .3


= 2 Pr(0  z  0.24) = 2 x 0.0948 = 0.1896.
Thus, just under 20% of the tablets will contain between 148 mg and 152 mg of the active ingredient.
The only thing wrong with this calculation is that it doesn't answer the question that was asked! The
question was not about the amount of active ingredient in one tablet, but about the amount of active
ingredient in 5 tablets. However, if you think about it a minute, you'll realize that asking for the probability
that 5 tablets contain between 740 mg and 760 mg of active ingredient is the same thing as asking for the
probability that the mean amount of active ingredient in a sample of 5 tablets will be between
740/5 = 148 mg and 760/5 = 152 mg. The central limit theorem tells us that the distribution of such an x :


should be approximately normal, because the target population is approximately normal
should have a mean value of 150 mg (because the target population has  = 150 mg)

should have a standard deviation of 8.3 / 5  3.712 mg (that is  / n , where  = 8.3 mg
is the standard deviation of the target population
Thus,
152  150 
 148  150
Pr(148  x  152 )  Pr 
z
  Pr( 0.54  z  0.54 )  0.4108
3.712 
 3.712
Thus, the probability that the five tablets will have totalled between 740 mg and 760 mg of active ingredient
is 0.4108.
David W. Sabo (1999)
Sampling and Sampling Distributions
Page 3 of 8
(Notice the difference between the probabilities for x and for x obtained here. There is a considerably
higher probability that the value of x for a set of 5 tablets will fall within the range 148 mg to 152 mg about
the target population mean of 150 mg, than there is that any one of the five tablets themselves will fall within
that range. This is an example of a very useful feature of the random sampling process: means of samples
cluster much closer to the population mean than do the values of the elements of the target population
themselves. This shows up in the formulas in  x 

, the standard deviation of the sampling distribution
n
is small by a factor of n than the standard deviation of the target population. It is in this property of the
sampling distribution that value of taking larger rather than smaller samples shows up. In case we forget to
mention this again (and again, and again… ) it is very important to notice that the value of  x depends on
the sample size and not at all on the population size (but, see the note below on sampling from finite
populations!).

Example 2: A food processing company claims that their canned clams contain less than 2.50 ppb (parts
per billion) of a certain pollutant. A food technology student who is rather concerned about this takes tissue
specimens from 40 such canned clams and analyzes them for the presence of the pollutant. For that
sample of 40 clams, she obtains a mean concentration of 2.63 ppb and a standard deviation of 0.86 ppb. Is
this proof that the company's claim is false?
Solution:
First, we need to sort out the numbers here a bit. The company has claimed that the average concentration
of this pollutant in all of their canned clams is less than 2.5 ppb. In symbols, their claim is that  < 2.50 ppb,
where  stands for the mean pollutant concentration of all of the canned clams they produce. The
technologist took a sample of this population: the sample had size n = 40, a mean value x = 2.63 ppb, and
a standard deviation s = 0.86 ppb.
Now, the technologist did observe x = 2.63 which is a number bigger than 2.50, the maximum value of the
mean claimed by the company. However, because this value of x resulted from a random sample of the
population, we can't immediately interpret its value as invalidating the company's claim. The central limit
theorem indicates that if  was exactly equal to 2.50 ppm, the values of x for random samples drawn from
that population would vary to some extent  some being smaller than 2.50 ppb, while others would be
larger than 2.50 ppb, and this would still be so even if  was somewhat less than 2.50 ppb.
However, there is one result that would tend to make us suspect the company's claim. If it turned out to be
very improbable to observe a sample of size 40 with a mean x = 2.63 or bigger when it was really true that
 = 2.50 ppb, then we would be justified in doubting their claim. The technologist's clear observation would
then be surprising if the company's claim were true, but since the technologist's observation was sound, the
apparent inconsistency can only be reconciled by casting suspicion on the company's claim.
In this case,



2.63  2.50 
Pr( x  2.63 )  Pr  z 




40 

To compute this exactly, we need a value of , which we don't have. The best we can do is assume that the
value s = 0.86 ppb is a reasonable estimate of . Doing this, we get






2.63  2.50 
2.63  2.50 
Pr( x  2.63 )  Pr  z 
  Pr  z 
  Pr( z  0.96 )  0.1685

0.86




40 
40 


Page 4 of 8
Sampling and Sampling Distributions
David W. Sabo (1999)
Thus, assuming the company's claim is correct, there is a probability of nearly 17% of observing a result as
contradictory as the one the technologist actually did observe. This probability is too high to conclude that
this value of x is inconsistent with the company's claim. These experimental results do not invalidate the
company's claim.
(What sort of answer would we have needed to be able to state with confidence that the company's claim
appears doubtful? It depends. For most routine work, a probability above of 0.05 or less would be
considered by most practitioners to be adequate grounds for concluding against the company. What we
have calculated here is the probability that we are wrong if we use the technologist's results to declare the
company's claim to be unfounded. The more serious the consequence of such a mistake on our part, the
smaller we'd want that number to be before sticking our necks out! Much more will be said about this when
we deal with the topic of hypothesis testing shortly.)

Sampling Distributions of Sums or Totals
You can see from Example 1 above that questions about sums or totals are very closely related to questions
about averages. We state here a somewhat restricted result that will form the basis for some later work.
Consider a set of random variables: X1, X2, …, Xn. These might be the values of n randomly selected items
from a population which are forming a random sample, for example. Then, we can form a new random
variable Y in the following way:
Y = a1X1 + a2X2 + … + anXn
where a1, a2,… an are just constants chosen for some particular reason. (For instance, if X1, X2, …, Xn are
the n observations comprising a random sample of size n, then choosing each of the ak's as 1/n would make
Y the sample mean, x .) Then, the central limit theorem says the following:
(i.)
If the individual Xk's are normally distributed or approximately normally distributed, the
random variable Y will be normally distributed or approximately normally distributed. Y will
still be approximately normally distributed as long as those Xk's which are not normally
distributed do not dominate.
(ii.)
 Y  a1 X1  a 2  X2    an  Xn
(iii.)
If the Xk's are independent (which is true if they are elements of random samples), then
 Y 2  a12 X1 2  a 2 2 X2 2    an 2 Xn 2
This looks pretty gruesome, but, it covers an enormous amount of territory. For example, by writing
Y = X1 - X2

 Y   X1   X2
and
 Y 2   X1 2   X2 2
(choose a1 = 1 and a2 = -1 in a sum involving just X1 and X2), we get an approach for comparing two
populations. Selecting an item from population #1 gives a value of X1. Selecting an item from population #2
(independently) gives a value of X2. Then, Y would be the difference of the means of the two populations
 a quantity that is of great interest in many comparative studies. We will expand on this application in
detail shortly in the course.
Example 3: Answer the question in Example 1 again, but this time using the principles described in this
section, rather than using an approach based on the sample mean.
Solution
We just let X1 stand for the mg of active ingredient in the first tablet selected, X2 for the mg of active
ingredient in the second tablet selected, and so on for the remaining three tablets. Then the question asks
for
David W. Sabo (1999)
Sampling and Sampling Distributions
Page 5 of 8
Pr(740  Y  760)
where
Y = X1 + X2 + X3 + X4 + X5 .
From the principles stated in this section, we know that Y must be approximately normally distributed
because each of the Xk's are approximately normally distributed. Further, since the formula for Y here fits
the pattern above with a1 = a2 = a3 = a4 = a5 = 1, we get that
 Y   X1   X2   X3   X4   X5  150  150  150  150  150  750 mg
and
 Y 2   X1 2   X1 2   X1 2   X1 2   X1 2  8.3 2  8.3 2  8.3 2  8.3 2  8.3 2  (5)(8.3 2 )
so that
 Y  (5)(8.3 2 )  5  (8.3)  18.559 mg
Thus,
760  750 
 740  750
Pr( 740  Y  760 )  Pr 
 z
  Pr( 0.54  z  0.54 )  0.4108
18
.
559
18 .559 

as we obtained in Example 1.

Sampling Distribution of the Proportion
Let p be the proportion of elements in a random sample which share some characteristic of interest, and let
 be the corresponding proportion of the target population. Different random samples will of course result in
a distribution of values of p. That distribution has the following characteristics:
(i.)
 p   (that is, the mean of the sampling distribution is equal to the population
proportion.)
 1   
(ii.)
p 
(iii.)
when the sample size, n, is sufficiently large, the distribution of p is approximately normal.
There are several conventions for deciding when n is "sufficiently large" for this to be true.
In this course we prefer the conditions
n
n  5
and
n(1 - )  5
Another (but somewhat more awkward condition) is the condition that the interval
 p  2 p to  p  2 p be entirely within the interval 0 to 1 (see Mendenhall). If these
conditions are not met, then it is more appropriate to work with the binomial distribution.
We will not go into the details of that case in this course.
As in the case of means, these properties allow us to establish a connection between observed values of p
and the value of the corresponding population proportion, . We will take that issue up shortly. However,
there are some more direct applications of these results, which we will illustrate by example.
Page 6 of 8
Sampling and Sampling Distributions
David W. Sabo (1999)
Example 4: Fruit is received at a food processing plant in large truckloads. Each truckload is transferred
into a specific holding bin, and during the transfer 250 individual fruits are selected at random for detailed
grading. If at least 92% of these selected fruit satisfy specifications, the entire truckload is used in the
preparation of whole fruit products. However, if fewer than 92% of these selected fruit satisfy specifications,
the entire truckload is used for juice production and the supplier receives a considerably smaller price for the
truckload. What is the probability that a truckload which is really 95% in conformance with the specifications
will end up being used for juice?
Solution
First, sort out the numbers in the problem. We are to assume that the 95% is the true proportion of the
truckload (which is the population in this problem), and so we identify this as  = 0.95. The truckload will be
used for juice if the proportion, p, of the sample of 250 fruit which are non-conforming is less than 0.92.
Thus,
Pr(truckload is used for juice)
= Pr( p < 0.92 when  = 0.95 and n = 250)
We can use the z-table to compute this probability since n = (250)(0.95) = 237.5 > 5 and n(1 - ) =
(250)(0.05) = 12.5 > 5. So, p is approximately normally distributed with a mean of  = 0.95 and a standard
deviation of
p 
 (1   )
n

(0.95 )( 0.05 )
 0.013784
250
Thus,
Pr(p  0.92 )  Pr( z 
0.92  0.95
)  Pr( z   2.18 )  0.0146
0.013784
Thus, the probability that such a good truckload will be mistakenly used for juice as a result of sampling error
is quite small  1.46%.

Example 5: A biotechnologist wishes to determine the proportion of a population of some organism which
has a certain genetic mutation. She has some preliminary information to the effect that the mutation is
present in about one out of every 50 members of that population. How large a sample would she have to
examine in order to be justified in assuming that the sample proportion is approximately normally
distributed?
Answer
The preliminary information indicates that   1/50 or 0.02. The conditions for approximate normality are
that n > 5 and n(1 - ) > 5. Since  is very nearly zero, 1 -  is very nearly 1, so the second condition will
be met as long as n is a bit bigger than 5 (which is a fairly small sample size for this kind of work). The first
condition will be harder to satisfy. In fact, it appears that we need something like
n  n x 0.02 > 5

n > 5/0.02 = 2500.
Thus, it would appear that she should plan to collect a random sample of at least approximately 2500
organisms if she wants to be justified in assuming the sample proportion is approximately normally
distributed.

The Sampling Distribution of the Variance
There are situations in which the value of the population variance is of interest as a way of assessing the
degree of uniformity of the population. This prompts interest in the distribution of sample variances.
It turns out that if the population is approximately normally distributed, then the following quantity has the socalled chi-squared (or 2 ) distribution:
David W. Sabo (1999)
Sampling and Sampling Distributions
Page 7 of 8
2 
(n  1)s 2
2
The 2 - distribution is one of those standard
distributions that arises in a number of areas of
application. Unlike the normal distribution, its
probability density function is not symmetric and is not
defined for negative values of 2. In fact, there is a
family of such density functions, distinguished by a
whole number,  = n - 1, called the number of
degrees of freedom. One typical example is shown in
the figure to the right. Given the value of  and
specially prepared tables of 2 probabilities we can
calculate probabilities for s2, and ultimately make
statements about the value of 2 .
We will return to this topic later in the course.
The Finite Population Correction
The formulas given so far have all been based on the assumption that the sample of size n is being selected
from what is essentially an infinite-sized population. When sampling is done without replacement, and the
sample amounts to 5% or more of the population, then it is necessary to employ the so-called finite
population correction in computing the variance of the sampling distribution of the mean:
x 

n

Nn
N1
where n is the sample size, and N is the population size. (When sampling is done with replacement, the
population size is effectively infinite.)
When n is a very small fraction of N, the ratio (N-n)/(N-1) is essentially 1, and so this correction factor has
little effect. Thus, when n = 0.05N, we have
Nn
N  0.05N


N 1
N1
0.95N
N
N
 0.95
 0.975 
N1
N1
N1
For relatively large values of N, the square root in the final form is essentially 1, meaning that the overall
factor is approximately 0.975  not much different than 1.
It is not too common to encounter truly finite size populations in technical applications of probability and
statistics.
Page 8 of 8
Sampling and Sampling Distributions
David W. Sabo (1999)
Download