U.B.C. BIOLOGY 300

advertisement
Biology 300
23
Lab Exercise # 3
5. DISCRETE PROBABILITY DISTRIBUTIONS
Probability distributions may be discrete or continuous. This week we will examine two
common discrete distributions: the binomial and Poisson. We will use JMPin to generate
random samples from these distributions and explore their characteristics.
Binomial
The binomial distribution is one of the most commonly encountered discrete probability
distributions in biology. It is based on nominal scale data that come from a population with
only two categories. One of the two categories is arbitrarily referred to as a "success" and the
other a "failure" (based on which you are looking for) and these categories are mutually
exclusive (i.e. male and female, black and white, left and right). The process of selecting an
individual at random from the population is called a trial. The probability of a success
remains constant from trial to trial. Furthermore, the outcome of any particular trial is not
affected by the outcome of any other trial (i.e. trials are independent). The terms of the
equation for the binomial distribution are calculated automatically by the computer using the
equation:
In this equation P(X) is the probability of seeing X successes, n is the number of trials, p is
the proportion of total occurrences that are successes, and q is the proportion that are failures
(since p and q are mutually exclusive, q = 1 - p). Another term commonly encountered in
binomial calculations is N, the number of observations.
The binomial distribution has several convenient mathematical features. First, the variance
(2) = n x p x q (the standard deviation is the square root of this). Second, the mean () = n x
p. Third, for large numbers of trials (perhaps n > 25), the shape of the binomial distribution is
very similar to a normal distribution, allowing us to estimate probabilities without repeatedly
using the cumbersome formula given above.
Poisson
Another discrete probability distribution commonly encountered in biology is the Poisson
distribution. This distribution is important in describing random occurrences of objects in
space or events in time. The occurrence of an object or event is assumed to have no effect on
the probability of a second occurrence of the same object or event (i.e. objects or events are
independent). The formula for calculating the probability of an occurrence is:
Page 24
Biology 300
In this equation, P(X) is the probability of seeing X successes,  is the mean number of
occurrences, and e is a constant (2.718..., the base for natural logarithms). An interesting
property of the Poisson distribution is that the mean and variance of the number of events per
interval are equal. Thus for a Poisson distribution, the variance to mean ratio, often called the
coefficient of dispersion (s2/ ) = 1, which is an indication that events are randomly spaced.
If this is not the case, the distribution is not a Poisson. A uniform distribution, for instance,
has a variance/mean ratio less than 1, while a clumped distribution produces a ratio greater
than 1.
When there are a large number of trials, and p is very small, the Poisson distribution is very
similar in shape to the binomial distribution. When the mean is large the normal distribution
approximates the Poisson distribution.
Using the Program
To experiment with these distributions you will need to use the calculator functions of
JMPin. Start by opening a new file (double click on the blank page icon), then setup four
columns. To format a column to illustrate a probability distribution, select the column, then
choose column info from the pull-down menu for columns. Go to data source and choose
formula. When you exit from this information window the program will open up a new
window, the calculator. This platform allows you to create complex formulas to produce the
data for a variable. For today’s exercise we will concentrate on the random number
generating functions of this calculator.
In the calculator window, click on random from the set of choices to the right of the keypad.
This will produce a set of terms in the far right window that will allow us to generate samples
from a wide variety of probability distributions.
Problems
1. Format column 1 as a binomial distribution by choosing binomial after you have selected
the random number generator (it’s probably worth changing the column name to binomial
just for clarity’s sake). As described above, two parameters, the number of trials and the
probability of success characterize the binomial. The calculator requires these values to be
selected before it can generate a binomial distribution. Let’s assume that we wish to examine
the distribution of male and female offspring in rats with litter sizes of 10. Click to highlight
the first empty square in the formula window. This is where you select the number of trials.
For this example choose 10 (the number of offspring in each litter). Select the second empty
box, the one for probability. Again, for this example, where males and females are equally
likely, set the probability of success to 0.5, then close the calculator window. If you now add
10 rows to your column you should obtain a random sample of 10 values from this particular
binomial distribution.
Biology 300
25
a)
Plot a histogram and boxplots of the data in your binomial column and describe
the distribution using the terminology from last week (skewed, bimodal, uniform, normal
etc.). Does this sample appear to be symmetric? Does it have any outliers? Note the mean
and standard deviation of the distribution.
b)
Delete the 10 rows you have in this data set and immediately re-add them. Are the
numbers the same as before? Why or why not? How much have the mean and standard
deviation changed?
c)
Re-open the column info, access the calculator window and alter the probability
of success. For now let’s assume that males are more common in this strain of rats and
change the probability of success to 0.75 from its current value. Plot new histograms and
boxplots and describe the changes to the distribution of the sample from this new
binomial population. How are the mean and variance of the distribution changed? Write
down these values.
d)
Add 9990 rows to the column. What effect does this have on the distribution
shape? On the sample mean and variance? Use the formulas from the introduction to this
lab exercise to calculate the true mean and standard deviation for a population with these
characteristics (10 trials or offspring = n, p: the probability of success = 0.75, and q: the
probability of failure = 0.25). How much do your calculated parameters differ from the
sample statistics for this set of 10000 values?
Page 26
Biology 300
e)
Compare the differences in parameter (true value) and statistics (sample estimate
of a parameter) for your sample of 10000 values to those from your previous sample with
only 10 values. Which sample size provides a more reliable estimate of population
parameters?
f)
Under column info, edit the modeling type to convert the column into nominal
data (we’ve actually been cheating by leaving the data listed as continuous - this allowed
us to see values for mean and standard deviation). Change the probability of success back
to 0.5 to simplify hand calculation. The histogram window should now provide a table of
probabilities of obtaining specific outcomes for any number of successes from 1 to 10.
Compare the probability of obtaining 10 males out of a litter of 10 with the value you
hand calculate yourself using the binomial formula from the introduction to this lab.
2. Set up a new column and label it Poisson. As you did in the previous question, call up the
calculator window and format this column as a Poisson distribution by choosing Poisson
after you have selected the random function. In this case, we could be examining the number
of Asian Gypsy moths found in insect traps throughout the lower mainland. For this
distribution you must set one characteristic, the mean (equivalent to lambda in this instance).
Set this value to 0.5 for now, meaning that on average we found one insect in every second
trap.
a) Produce a histogram and boxplots for the distribution and describe their shape. Why
are there nothing but integer values for this distribution? Note the mean and standard
deviation for this set of values.
Biology 300
27
b) Temporarily reduce the number of rows to 10. A quick way to do this is to select row
11, scroll to the bottom of the column using the arrows on the side of the window, then
use shift - click to select all the values from the 11th row to the end of the file. What effect
(if any) does this have on the shape of the distribution? Again, note the mean and
standard deviation for this distribution (do this for parts c and d as well).
c) Edit the formula for your Poisson distribution under column info, and change the mean
to 0.01 (and increase the number of rows back to 10000), then produce a new histogram
and boxplots and describe how the shape of the distribution has changed.
d) Increase the mean number of insects found per trap to 1.5 and describe the shape of the
curve. Increase it to 5 then 20 then 50 and describe what happens to the shape of the
distribution. What distribution is this starting to resemble?
Page 28
Biology 300
e) Refer back to the means and standard deviations from parts a through d. What is the
approximate relationship between these values (remember that we are just sampling from
a Poisson, so these are only estimates of the true mean and standard deviation and will
vary randomly)? What is the relationship between mean and variance for these
distributions?
f) As with the binomial, we can’t directly calculate the true probabilities for a specific
number of successes. With 10,000 values in our sample, however, our estimates should
be quite accurate (if you haven’t increased the number of rows back to 10,000 do so
now). Convert the modeling type to nominal so that we will get a list of probabilities for
each number of successes. Set the mean to 1 for your Poisson distribution and use the
equation from the introduction to this exercise to calculate the probability of finding 1
gypsy moth in a trap. Compare this to the estimated probability from your sample. Repeat
this for the probability of finding no gypsy moths in a trap.
Download