Chapter 3 - San Jose State University

advertisement
CHAPTER 3 : EXAMINING SPATIAL PATTERNS IN DISCRETE DATA
WITH THE BINOMIAL AND POISSON DISTRIBUTIONS.
PURPOSE: In this lab, you will learn how to generate Expected Probabilities and Expected
Frequencies using the Binomial and Poisson Distributions. These theoretical probability
distributions are used with discrete data to predict occurrences by chance alone when you are
interested in determining if there is a spatial pattern (Clumped or Uniform) or if the distribution is
stochastic.
Background
If the observed frequencies fit the expected frequencies (chance alone), you would conclude that the
observed distribution is stochastic (i.e. no pattern – see Figure 3-1). If the observed frequencies do
not fit the expected, then you would conclude that the observed distribution is either clumped or
uniform.
Patterns of distributions
Patterns in distributions occur when the observed frequencies do not fit the expected frequencies (see
Figure 3-1).
Clumped: A distribution is clumped when low values and high values of Y occur MORE frequently
than the expected while mid values of Y occur LESS frequently than expected. In these patterns,
individuals tend to form in groups. If animals needed a particular resource that had a patchy
distribution, their distribution would be clumped.
Spatial
Pattern
Frequency
Distribution
Clumped
Frequency
20
15
10
5
0
0
1
2
3
4
5
# per quadrat
No Pattern
(Stochastic)
Frequency
25
20
15
10
5
0
0
1
2
3
4
5
4
5
# per quadrat
25
Uniform
Frequency
Uniform: A distribution is
uniform when the reverse
occurs. This type of distribution
often occurs when organisms
are competing with each other
and set up territories.
20
15
10
5
0
0
1
2
3
# per quadrat
Figure 3- 1
Theoretical Probability Distributions (Binomial and Poisson)
Previously, we had to generate our own theoretical probability distribution (expected probabilities
and frequencies) by using probability rules to compute joint probabilities. There are several
situations in which someone has already worked out a method for generating a theoretical probability
distribution for certain circumstances. Both the Binomial and the Poisson distributions are theoretical
probability distributions that predict what would occur by chance alone (used to generate a stochastic
distribution) with discrete data but do so under different conditions:
The Binomial Distribution is used with a discrete variable for which there are two possible values
(levels or states) and when you always examine the same sample size (number of individuals or
sites). An example would be blue-eye color or not-blue-eye color measured in cases where we would
sample three people each time.
The Poisson Distribution is usually used with abundance data. You use this distribution with
discrete data when p is rare and q is not measurable. Abundance fits this category because you only
know when an individual is present, not when they are absent.
Binomial Distribution
Computing Expected Probabilities
The probability formulae for each possible outcome can be generated from the expansion of the
binomial model, (p+q)k, where “p” is the probability for one state, “q” is the probability for the other
state and “k” is sample size.
There are 6 steps to computing expected probabilities for a binomial distribution:
1) Determine if the single probabilities, “p” and “q” are INTRINSIC or EXTRINSIC to the data.
2) Estimate probabilities (p and q) for single events for INTRINSIC probabilities.
3) Determine the exponents for “p” and “q” for the binomial model.
4) Determine the coefficients for the terms.
5) Compute the combined probabilities.
6) Compute expected frequencies and compare to observed frequencies
These steps are actually things you have already done with computing probabilities and expected
frequencies from the coin flips. However, for large problems, the binomial computations are easier.
For each step, we will first work through Steps 1-5 with an example of for which you already know
the answers: the 2 coin flip from the first lab (Table 3-1).
Table 3- 1: Probabilities for a 2 coin flip
Number of
Heads
0
1
2
3-2
Number of
Tails
2
1
0
Number of ways
this can happen
1
2
1
Probability
formula
q2
2pq
p2
Probability can
also be written
as
1*p0*q2
2*p1*q1
1*p2*q0
Probability
0.25
0.50
0.25
Example 1: 2-Coin Flip - Binomial
1) Determine if the single probabilities, “p” and “q” are INTRINSIC or
EXTRINSIC to the data.
If “p”is already known, then the “p” parameter is EXTRINSIC to the data. For example, if the
measured variable was a coin flip, we know that the probability of getting a heads is 0.5 so the
probability is extrinsic. If this is the case, skip to step 3.
If “p” must be computed from the data, then the “p” parameter is INTRINSIC to the data
2) Estimate probabilities (p and q) for single events for INTRINSIC probabilities.
Single event probabilities are extrinsic for the coin toss so we can skip this step
3) Determine the exponents for “p” and “q” for the binomial model.
This is just an expansion of what we learned about combined probabilities. Let’s look at this first
by using our 2 coin flip example. Let p = the probability of getting a heads and q= the probability
of getting a tails. Y would be the number of heads. There are three possible outcomes. The
exponents follow the values for the number of heads and tails for any one outcome.
Table 3- 2: Determining the binomial exponents for the 2 coin example
Outcome Number of Heads
with 2 coins (Y)
Number of Tails with 2
coins
Exponents
1
0
2
p0q2
2
1
1
p1 q1
3
2
0
p2q0
4) Determine the coefficients for the terms.
The coefficients are just the number of ways that an event can happen (combinations).
k!
The formula for the coefficients (number of combinations) is: 
where x is the power of
x! k - x !
p for the term you want and k is the sample size. Remember this is just the number of possible
combinations. Remember 0! = 1 and 1! = 1
Table 3- 3: Determining the binomial coefficients for the 2 coin example
Outcome Number of
Heads with 2
coins (Y)
Number of
Tails with 2
coins
Coefficients (number of
possible combinations)
Exponents
1
0
2
2!
2 *1

1
0!2 - 0! 1 * (2 *1)
p0q2
2
1
1
2!
2 *1

2
1!2 - 1! 1 * (1)
p1 q1
3
2
0
2!
2 *1

1
2!2 - 2! 2 *1 * (1)
p2q0
5) Compute the combined probabilities.
The combined probabilities are computed by multiplying the coefficient times by the p and q with
the proper exponents (Table 3-4)
Table 3- 4: Computing the combined probabilities for the 2 coin example
Outcome Number
of Heads
with 2
coins (Y)
Number
of Tails
with 2
coins
Coeff.
Exponents Probability
formula
Probability
1
0
2
1
p0q2
1*p0*q2=1*0.50*0.52 0.25
2
1
1
2
p1 q1
2*p1*q1=1*0.51*0.51 0.50
3
2
0
1
p2q0
1*p2*q0=1*0.52*0.50 0.25
Example 2: Contagious Disease - Binomial
Problem: Let’s assume that we are trying to determine if a disease is transmitted by contact. If that
were true, the distribution would have a clumped pattern. Let’s assume that we randomly select 400
groups of four buildings that are close to each other and then we determine if anybody in the building
is sick (Table 3-1). The two possible single events would be that a building does or does not contain
a sick person. The discrete variable would then be “Sick” with two possible levels: Yes or No. “p”
would equal the probability of encountering a building with a sick person (Sick=Yes) and “q” would
equal the probability of encountering a building without a sick person (Sick=No). In this case, we
always examined groups of four buildings so the sample size (k) is fixed.
Why is a Binomial probability distribution appropriate for this problem?
The measured variable, “Sick,” is discrete with only two states and you can identify both
states. Also, there is a fixed sample size because we are always examining four buildings
at a time.
Data:
Table 3- 5: Frequency table for number of buildings with a sick person inside out
of groups of four buildings.
Number of buildings out of four
Observed Frequency (Sets of four)
with a sick person) (Y)
(f)
0
145
1
30
2
25
3
65
4
135
TOTAL (∑f)=
400
3-4
1) Determine if the single probabilities, “p” and “q” are INTRINSIC or EXTRINSIC to
the data.
In this case, we do not know the probability of encountering a building containing a sick person or
the probability of encountering a building without a sick person. Therefore we must compute
estimates of “p” and “q” from our data.
2) Estimate probabilities (p and q) for single events for INTRINSIC probabilities.
The present case is not one in which we would already know “p” (“p” is INTRINSIC) so we will
have to determine “p” from our data. The probability of a building having a sick person, “p” would
be equal to the number of buildings with sick people divided by the total number of buildings.
Number of bldgs with sick people
p=
Total number of bldgs
First we need to compute the number of buildings with sick people (sum of Column 3 in Table 3-2)
using the information from Table 3-1. For any given result (e.g 0 out 4, 1 out of 4, 2 out of 4 etc.), we
can determine the number of buildings with sick people by multiplying the observed frequency (f)
times the result (Y). For example, there were 145 cases in which none of the four buildings had sick
people and f*Y= 0*145=0. So the number of buildings with sick people for the first row would be 0.
In the second row, there were 30 cases in which 1 out of the 4 buildings contained a sick person and
f*Y=30*1=30 buildings with sick people.


The number of buildings with sick people would be Σ(f*Y).
The total number of buildings sampled is 4 buildings sampled at a time (k) * the number of
samples (∑f ) or k*(∑f) =1600
Table 3- 6: Computing Σ(f*Y). The number of buildings with sick people
Number of buildings out of
four with a sick person (Y)
Observed frequency (f)
Number of buildings with a sick
person (f*Y)
0
145
145*0= 0
1
30
30*1=30
2
25
25*2=50
3
65
65*3=195
4
135
135*4=540
Totals
Number of samples Σf = 400
Number of buildings with a sick
person Σ(f*Y) =815
The probability of encountering a building with a sick person (p), sampling one at a time = Total
number of buildings with a sick person over the total number of buildings sampled or Σ(f*Y)/(k*∑f).
p = 815/1600=0.509
The probability of encountering a building without a sick person (q), sampling one at a time is equal
to 1- p.
q = 1-0.509=0.491
3) Determine the terms for the binomial model.
There are five possible outcomes if we always sample 4 buildings at a time.
Table 3- 7: Determining the binomial exponents for the disease example
Outcome Number of bldgs with
a sick person (Y)
Number of bldgs without Exponents
a sick person
1
0
4
p0q4
2
1
3
p1q 3
3
2
2
p2q2
4
3
1
p 3 q1
5
4
0
p4q0
4) Determine the coefficients for the terms
k!
where x is the power of p
x! k - x !
for the term you want and k is the sample size. Remember 0! = 1 and 1! = 1
The formula for the coefficients (number of combinations) is: 
Table 3- 8: Determining the binomial coefficients for the disease example
Outcome Number of
bldgs with a
sick person (Y)
3-6
Number of
bldgs without
a sick person
Coefficients (number of
possible combinations)
Exponents
1
0
4
4!
4 * 3 * 2 *1

1
0!4 - 0! 1 * (4 * 3 * 2 *1)
p0q4
2
1
3
4!
4 * 3 * 2 *1

4
1!4 - 1! 1 * (3 * 2 *1)
p1q 3
3
2
2
4!
4 * 3 * 2 *1

6
2!4 - 2! 2 *1 * (2 *1)
p2q2
4
3
1
4!
4 * 3 * 2 *1

4
3!4 - 3! 3 * 2 *1 * (1)
p 3 q1
5
4
0
4!
4 * 3 * 2 *1

1
4!4 - 4! 4 * 3 * 2 *1 * (1)
p4q0
5) Compute the combined probabilities.
Table 3- 9: Probabilities for the disease example
Outcome Number
of bldgs
with a
sick
person
(Y)
Number
of bldgs
without
a sick
person
Coeff.
Exponents Probability formula
Probability
1
0
4
1
p0q4
1*p0*q4=1*0.5090*0.4914
0.058
2
1
3
4
p1q 3
4*p1*q3=1*0. 5091*0.4913
0.241
3
2
2
6
p2q2
6*p2*q2=1*0. 5092*0.4912
0.375
4
3
1
4
p 3 q1
4*p3*q1=1*0. 5093*0.4911
0.259
5
4
0
1
p4q0
1*p4*q0=1*0. 5094*0.4910
0.067
6) Compute the Expected Frequencies and compare to observed frequencies
The Expected Frequency ( f̂ ) can be computed from the Probabilities: Expected Frequency =
Probability*Total frequency (∑f) as we did before.
Table 3- 10: Expected Frequencies and Observed frequencies for the disease example
Number of bldgs
with a sick person
(Y)
Observed
Frequency
Probability
Expected Frequency
0
145
0.058
400 * 0.058 = 23.2
1
30
0.241
400 * 0.241 = 96.4
2
25
0.375
400 * 0.375 = 150.0
3
65
0.259
400 * 0.259 =103.6
4
135
0.067
400 * 0.067 = 26.8
TOTAL
400
400
Use Excel™ or Systat™ to create a bar chart comparing Observed and Expected
Frequencies.
Paste the chart on the top of the next page:
Is the distribution clumped, uniform or stochastic?
What is your conclusion about the disease?
Poisson Distribution
The Poisson distribution is a discrete distribution that is used to describe the expected frequency of
random events in time and/or space, when the event of interest is RARE. Use the Poisson if you
can’t measure q OR if you are measuring abundance.
There are 4 steps to computing expected probabilities for a normal distribution:
1) Determine if the mean is INTRINSIC or EXTRINSIC to the data.
2) Estimate the mean from the data.
3) Compute the combined probabilities.
4) Compute expected frequencies and compare to observed frequencies.
Example 3: Road kill - Poission
Problem: Suppose you have been assigned to investigate road kills on Highway 26 south of
Hollister, about 100 miles from San Jose. The route passes through both urban and rural areas. Are
the road kills stochastic or do they seem to be concentrated or do they seem to be evenly spaced? If
the distribution is clumped, it means that there are “hot spots” and there might be a possibility for
reducing the number of deaths with proper control measures. You divided the 100 mile distance into
1-mile segments (units of space) and counted the number of road kills on the route in each segment
(you are measuring the abundance of road kills). The appropriate distribution is a Poisson because the
data are discrete and you are measuring abundance. Data are presented in Table 3-6. The mean
number of road kills per mile is known for the state (the mean = 2.09 road kills per mile)
3-8
Why is a Poisson probability distribution appropriate for this problem?
The measured variable, “Road Kills,” represents counts of the number of road kills per
segment. We do not know how many animals were not killed.
Data:
Table 3- 11: Number of road kills on Hwy 26 on one hundred 1 mile long segments
# of road kills per segment
(Y)
0
1
2
3
4
5
≥6
TOTAL
Observed Frequency (f)
13
26
26
18
10
5
2
∑f=100
1) Is the mean INTRINSIC or EXTRINSIC to the data?
If the mean (average number of road kills) is already known, then the mean is extrinsic. If this is the
case (as in this example), skip to step 3.
If the mean must be computed from the data, then the mean is INTRINSIC to the data.
2) Compute the mean from the data when the mean is intrinsic
# of road kills per segment
Observed Frequency (f)
(Y)
f*Y
0
0
13
26
1
26
52
2
26
54
3
18
40
4
10
25
5
5
12
≥6*
2
∑f=100
TOTAL
∑(f *Y)= 209
* If the class includes values greater than or equal to some value, use the lowest value (e.g. 6 in
this case) to compute the mean.

Mean Y  2.09
3) Compute the probabilities for the Poisson model.
The highest number of kills in any segment was 6, so we will need terms for 0 through 6 kills. The
probability for any class P(Y) in a Poisson distribution follows a series. The terms are in a series
where each term depends upon the last. The series goes like:
P(0) the probability of getting no road kills in a 1 mile segment = e  
P(1) the probability of getting no road kills in a 1 mile segment = P (0) *
etc. P(∞)= P(  1) *


1
P(2)= P(1) *

2
……

To calculate the last probability (the probability for all higher classes combined) you need to add up
all of the preceding probabilities and subtract the total from 1. For this example, the probability for
the last class would be P(≥6).
Notice that the only parameter you need to know to compute the series is µ, which is the mean of the
data. Since we had to estimate µ, we will substitute Y  2.09

Compute the expected Poisson probabilities in Table 3-7.
Table 3- 12: Poisson probabilities. BE CAREFUL BECAUSE EACH TERM DEPENDS ON
THE PREVIOUS TERM.
Number of road kills per
segment(Y)
0
1
2
3
4
5
≥6
TOTAL
Formula
e Y = 2.7183-2.09
Y
2.09
 0.124 *
=
1
1
Y
2.09
P(1) *  0.259 *
=
2
2
Y
2.09
P(2) *  0.270 *
=
3
3
Y
2.09
P(3) *  0.188 *

4
4
Y
2.09
P(4) *  0.098 *

5
5
1-Sum of P(0) to P(5) = 1 – (0.124 +
0.259 + 0.270 + 0.188 + 0.098 + 0.041)=
P(0) *
Probability
0.124
0.259
0.271
0.189
0.099
0.041
0.017
1.000
4) Compute the Expected Frequencies and compare to the Observed Frequencies.
The Expected Frequency ( f̂ ) can be computed from the Probabilities: Expected Frequency =
Probability*Total frequency (∑f). For this experiment, total frequency (of samples) was 100.
3-10
Table 3- 13: Observed and Expected frequencies for the road kill example
Number of road kills
in a segment (Y)
Observed
Frequency (f)
Probability
Expected
Frequency (f)
0
13
0.124
12.4
1
26
0.259
25.9
2
26
0.271
27.1
3
18
0.189
18.9
4
10
0.099
9.9
5
5
0.041
4.1
≥6
2
0.017
1.7
TOTAL
100
1.000
100.0
Use Excel™ or Systat™ to create a bar chart comparing Observed and
Expected Frequencies.
Paste the chart here:
Is the distribution clumped, uniform or stochastic?
What is your conclusion about the road kills?
Computing Probabilities for Groups of Classes.
The key to most of these types of problems is in recognizing that all of the probabilities
have to add to 1.0.
For example, assume that you want to know the probability of 1 or more buildings
containing sick people. The probability for no buildings having sick people was 0.058,
The probability of 1 or more buildings containing sick people = 1- The probability for no
buildings having sick people. So P(>0)=1-P(0) or P(>0)=1-0.058=0.942.
The Coefficient of Dispersion – An Indicator of Whether a
Distribution is Clumped, Stochastic or Uniform
The Coefficient of Dispersion (CD) can be used as an indicator of whether or not a
distribution is clumped, stochastic or uniform. There are different expectations of the
value of CD for uniform, stochastic or clumped depending on the type of theoretical
probability distribution that is appropriate for the data (e.g. Binomial, Poisson, or
Normal) (Table 3-1).
Table 3- 14: Values for CD to indicate uniform, stochastic or clumped distributions
depending upon the appropriate theoretical probability distribution.
3-12
Distribution
Uniform
Stochastic
Clumped
Binomial
CD < q
CD ≈ q
CD > q
Poisson
CD < 1
CD ≈ 1
CD > 1
Normal
CD <<1
CD ≈ 1
CD >> 1
Coefficient of Dispersion – indicator of whether a distribution is clumped, stochastic, or uniform
Problem: You are trying to find out if a species of marine algae, Postelsia sp., has a uniform distribution. You randomly selected 44
sites. At each site, you placed a quadrat which had been split into four equal parts
. You inspected each of the four squares to
determine if a Postelisa sp. was present or absent. For each site you then recorded the number of squares for which Postelsia was
present. You should be able to tell that the Binomial is the appropriate distribution here.
2
Formula for sample CD = s
Y
2
Note: For parametric CD = 

Example Data for a sample:
Example Computations:
Number of squares
per quadrat in which
Postelsia was found
(Y)
Frequency
(f)
0
4
1
33
2
5
3
1
4
1
TOTAL
f
44
f*Y2
f*Y
0
50
 0.284 q  1  0.284  0.716
44 * 4
2) Y 
50
 1.136
44
0
33
33
10
20
3
9
4
  f * Y  78
  f * Y  50
1) p 
16
2
3) s 2 
50 2
44  0.493
43
78 
4) CD 
0.493
 0.433
1.136
5) Because CD < q (0.433 < 0.716), the distribution is
probably uniform.
Frequency Data with range of value classes – The computation is virtually the same as for single value classes but you use the class
marks as Y
3-13
On your own
1) You suspect that a nematomorph parasite may be influencing its primary host’s
(grasshopper) movements. In particular, you think the parasite will attempt to direct
the host to move to a habitat where the parasite will have the best chance of finding
the secondary host in its cycle. You randomly select several sites where at least four
potential primary hosts are located and then, at each site, assess four potential hosts
for the presence of the parasite. Your results are in the following table.
What is the measured variable? _____________________
What is the appropriate probability distribution? __________________
Are the parameters intrinsic or extrinsic? _________________
Compute the probabilities and expected frequencies.
Number of
grasshoppers
per site with
the parasite
(Y)
0
1
2
3
4
Frequency (f)
f*Y
f*Y2
Probability
Expected
4
13
20
11
3
Compute the mean
Compute the variance
Compute the CD
Is the distribution clumped, uniform or stochastic? _____________ Make sure
you know why.
Use Excel to plot Observed versus Expected Frequencies
2) You are taking a multiple choice exam with 10 questions. Each question has five
possible answers.
What is the measured variable? _____________________
What is the appropriate probability distribution? __________________
Are the parameters intrinsic or extrinsic? _________________
3-14
What is the probability of getting a score of 7 on the exam if you guess every
question?
What is the probability of getting a score less than 8 on the exam if you guess every
question?
3) You are trying to determine if wood ticks prefer some deer over others (e.g. they have
a clumped distribution). You randomly select deer and check them for ticks. The
mean for these data is 2.00 ticks per deer. Your data are in the following table:
What is the measured variable? _____________________
What is the appropriate probability distribution? __________________
Are the parameters intrinsic or extrinsic? _________________
Compute the probabilities and expected frequencies.
# of ticks
per deer
(Y)
0
1
2
3
4
5
6
7
≥8
Observed
Frequency
(f)
f*Y
f*Y2
Expected
Probabilities
(PY)
Expected
Frequency
( fˆ )
90
62
30
9
2
6
5
7
15
Compute the mean
3-15
Compute the variance
Compute the CD
Is the distribution clumped, uniform or stochastic? _____________ Make sure
you know why.
Use Excel to plot Observed versus Expected Frequencies
3-16
3-17
Download