CHAPTER 3 : EXAMINING SPATIAL PATTERNS IN DISCRETE DATA WITH THE BINOMIAL AND POISSON DISTRIBUTIONS. PURPOSE: In this lab, you will learn how to generate Expected Probabilities and Expected Frequencies using the Binomial and Poisson Distributions. These theoretical probability distributions are used with discrete data to predict occurrences by chance alone when you are interested in determining if there is a spatial pattern (Clumped or Uniform) or if the distribution is stochastic. Background If the observed frequencies fit the expected frequencies (chance alone), you would conclude that the observed distribution is stochastic (i.e. no pattern – see Figure 3-1). If the observed frequencies do not fit the expected, then you would conclude that the observed distribution is either clumped or uniform. Patterns of distributions Patterns in distributions occur when the observed frequencies do not fit the expected frequencies (see Figure 3-1). Clumped: A distribution is clumped when low values and high values of Y occur MORE frequently than the expected while mid values of Y occur LESS frequently than expected. In these patterns, individuals tend to form in groups. If animals needed a particular resource that had a patchy distribution, their distribution would be clumped. Spatial Pattern Frequency Distribution Clumped Frequency 20 15 10 5 0 0 1 2 3 4 5 # per quadrat No Pattern (Stochastic) Frequency 25 20 15 10 5 0 0 1 2 3 4 5 4 5 # per quadrat 25 Uniform Frequency Uniform: A distribution is uniform when the reverse occurs. This type of distribution often occurs when organisms are competing with each other and set up territories. 20 15 10 5 0 0 1 2 3 # per quadrat Figure 3- 1 Theoretical Probability Distributions (Binomial and Poisson) Previously, we had to generate our own theoretical probability distribution (expected probabilities and frequencies) by using probability rules to compute joint probabilities. There are several situations in which someone has already worked out a method for generating a theoretical probability distribution for certain circumstances. Both the Binomial and the Poisson distributions are theoretical probability distributions that predict what would occur by chance alone (used to generate a stochastic distribution) with discrete data but do so under different conditions: The Binomial Distribution is used with a discrete variable for which there are two possible values (levels or states) and when you always examine the same sample size (number of individuals or sites). An example would be blue-eye color or not-blue-eye color measured in cases where we would sample three people each time. The Poisson Distribution is usually used with abundance data. You use this distribution with discrete data when p is rare and q is not measurable. Abundance fits this category because you only know when an individual is present, not when they are absent. Binomial Distribution Computing Expected Probabilities The probability formulae for each possible outcome can be generated from the expansion of the binomial model, (p+q)k, where “p” is the probability for one state, “q” is the probability for the other state and “k” is sample size. There are 6 steps to computing expected probabilities for a binomial distribution: 1) Determine if the single probabilities, “p” and “q” are INTRINSIC or EXTRINSIC to the data. 2) Estimate probabilities (p and q) for single events for INTRINSIC probabilities. 3) Determine the exponents for “p” and “q” for the binomial model. 4) Determine the coefficients for the terms. 5) Compute the combined probabilities. 6) Compute expected frequencies and compare to observed frequencies These steps are actually things you have already done with computing probabilities and expected frequencies from the coin flips. However, for large problems, the binomial computations are easier. For each step, we will first work through Steps 1-5 with an example of for which you already know the answers: the 2 coin flip from the first lab (Table 3-1). Table 3- 1: Probabilities for a 2 coin flip Number of Heads 0 1 2 3-2 Number of Tails 2 1 0 Number of ways this can happen 1 2 1 Probability formula q2 2pq p2 Probability can also be written as 1*p0*q2 2*p1*q1 1*p2*q0 Probability 0.25 0.50 0.25 Example 1: 2-Coin Flip - Binomial 1) Determine if the single probabilities, “p” and “q” are INTRINSIC or EXTRINSIC to the data. If “p”is already known, then the “p” parameter is EXTRINSIC to the data. For example, if the measured variable was a coin flip, we know that the probability of getting a heads is 0.5 so the probability is extrinsic. If this is the case, skip to step 3. If “p” must be computed from the data, then the “p” parameter is INTRINSIC to the data 2) Estimate probabilities (p and q) for single events for INTRINSIC probabilities. Single event probabilities are extrinsic for the coin toss so we can skip this step 3) Determine the exponents for “p” and “q” for the binomial model. This is just an expansion of what we learned about combined probabilities. Let’s look at this first by using our 2 coin flip example. Let p = the probability of getting a heads and q= the probability of getting a tails. Y would be the number of heads. There are three possible outcomes. The exponents follow the values for the number of heads and tails for any one outcome. Table 3- 2: Determining the binomial exponents for the 2 coin example Outcome Number of Heads with 2 coins (Y) Number of Tails with 2 coins Exponents 1 0 2 p0q2 2 1 1 p1 q1 3 2 0 p2q0 4) Determine the coefficients for the terms. The coefficients are just the number of ways that an event can happen (combinations). k! The formula for the coefficients (number of combinations) is: where x is the power of x! k - x ! p for the term you want and k is the sample size. Remember this is just the number of possible combinations. Remember 0! = 1 and 1! = 1 Table 3- 3: Determining the binomial coefficients for the 2 coin example Outcome Number of Heads with 2 coins (Y) Number of Tails with 2 coins Coefficients (number of possible combinations) Exponents 1 0 2 2! 2 *1 1 0!2 - 0! 1 * (2 *1) p0q2 2 1 1 2! 2 *1 2 1!2 - 1! 1 * (1) p1 q1 3 2 0 2! 2 *1 1 2!2 - 2! 2 *1 * (1) p2q0 5) Compute the combined probabilities. The combined probabilities are computed by multiplying the coefficient times by the p and q with the proper exponents (Table 3-4) Table 3- 4: Computing the combined probabilities for the 2 coin example Outcome Number of Heads with 2 coins (Y) Number of Tails with 2 coins Coeff. Exponents Probability formula Probability 1 0 2 1 p0q2 1*p0*q2=1*0.50*0.52 0.25 2 1 1 2 p1 q1 2*p1*q1=1*0.51*0.51 0.50 3 2 0 1 p2q0 1*p2*q0=1*0.52*0.50 0.25 Example 2: Contagious Disease - Binomial Problem: Let’s assume that we are trying to determine if a disease is transmitted by contact. If that were true, the distribution would have a clumped pattern. Let’s assume that we randomly select 400 groups of four buildings that are close to each other and then we determine if anybody in the building is sick (Table 3-1). The two possible single events would be that a building does or does not contain a sick person. The discrete variable would then be “Sick” with two possible levels: Yes or No. “p” would equal the probability of encountering a building with a sick person (Sick=Yes) and “q” would equal the probability of encountering a building without a sick person (Sick=No). In this case, we always examined groups of four buildings so the sample size (k) is fixed. Why is a Binomial probability distribution appropriate for this problem? The measured variable, “Sick,” is discrete with only two states and you can identify both states. Also, there is a fixed sample size because we are always examining four buildings at a time. Data: Table 3- 5: Frequency table for number of buildings with a sick person inside out of groups of four buildings. Number of buildings out of four Observed Frequency (Sets of four) with a sick person) (Y) (f) 0 145 1 30 2 25 3 65 4 135 TOTAL (∑f)= 400 3-4 1) Determine if the single probabilities, “p” and “q” are INTRINSIC or EXTRINSIC to the data. In this case, we do not know the probability of encountering a building containing a sick person or the probability of encountering a building without a sick person. Therefore we must compute estimates of “p” and “q” from our data. 2) Estimate probabilities (p and q) for single events for INTRINSIC probabilities. The present case is not one in which we would already know “p” (“p” is INTRINSIC) so we will have to determine “p” from our data. The probability of a building having a sick person, “p” would be equal to the number of buildings with sick people divided by the total number of buildings. Number of bldgs with sick people p= Total number of bldgs First we need to compute the number of buildings with sick people (sum of Column 3 in Table 3-2) using the information from Table 3-1. For any given result (e.g 0 out 4, 1 out of 4, 2 out of 4 etc.), we can determine the number of buildings with sick people by multiplying the observed frequency (f) times the result (Y). For example, there were 145 cases in which none of the four buildings had sick people and f*Y= 0*145=0. So the number of buildings with sick people for the first row would be 0. In the second row, there were 30 cases in which 1 out of the 4 buildings contained a sick person and f*Y=30*1=30 buildings with sick people. The number of buildings with sick people would be Σ(f*Y). The total number of buildings sampled is 4 buildings sampled at a time (k) * the number of samples (∑f ) or k*(∑f) =1600 Table 3- 6: Computing Σ(f*Y). The number of buildings with sick people Number of buildings out of four with a sick person (Y) Observed frequency (f) Number of buildings with a sick person (f*Y) 0 145 145*0= 0 1 30 30*1=30 2 25 25*2=50 3 65 65*3=195 4 135 135*4=540 Totals Number of samples Σf = 400 Number of buildings with a sick person Σ(f*Y) =815 The probability of encountering a building with a sick person (p), sampling one at a time = Total number of buildings with a sick person over the total number of buildings sampled or Σ(f*Y)/(k*∑f). p = 815/1600=0.509 The probability of encountering a building without a sick person (q), sampling one at a time is equal to 1- p. q = 1-0.509=0.491 3) Determine the terms for the binomial model. There are five possible outcomes if we always sample 4 buildings at a time. Table 3- 7: Determining the binomial exponents for the disease example Outcome Number of bldgs with a sick person (Y) Number of bldgs without Exponents a sick person 1 0 4 p0q4 2 1 3 p1q 3 3 2 2 p2q2 4 3 1 p 3 q1 5 4 0 p4q0 4) Determine the coefficients for the terms k! where x is the power of p x! k - x ! for the term you want and k is the sample size. Remember 0! = 1 and 1! = 1 The formula for the coefficients (number of combinations) is: Table 3- 8: Determining the binomial coefficients for the disease example Outcome Number of bldgs with a sick person (Y) 3-6 Number of bldgs without a sick person Coefficients (number of possible combinations) Exponents 1 0 4 4! 4 * 3 * 2 *1 1 0!4 - 0! 1 * (4 * 3 * 2 *1) p0q4 2 1 3 4! 4 * 3 * 2 *1 4 1!4 - 1! 1 * (3 * 2 *1) p1q 3 3 2 2 4! 4 * 3 * 2 *1 6 2!4 - 2! 2 *1 * (2 *1) p2q2 4 3 1 4! 4 * 3 * 2 *1 4 3!4 - 3! 3 * 2 *1 * (1) p 3 q1 5 4 0 4! 4 * 3 * 2 *1 1 4!4 - 4! 4 * 3 * 2 *1 * (1) p4q0 5) Compute the combined probabilities. Table 3- 9: Probabilities for the disease example Outcome Number of bldgs with a sick person (Y) Number of bldgs without a sick person Coeff. Exponents Probability formula Probability 1 0 4 1 p0q4 1*p0*q4=1*0.5090*0.4914 0.058 2 1 3 4 p1q 3 4*p1*q3=1*0. 5091*0.4913 0.241 3 2 2 6 p2q2 6*p2*q2=1*0. 5092*0.4912 0.375 4 3 1 4 p 3 q1 4*p3*q1=1*0. 5093*0.4911 0.259 5 4 0 1 p4q0 1*p4*q0=1*0. 5094*0.4910 0.067 6) Compute the Expected Frequencies and compare to observed frequencies The Expected Frequency ( f̂ ) can be computed from the Probabilities: Expected Frequency = Probability*Total frequency (∑f) as we did before. Table 3- 10: Expected Frequencies and Observed frequencies for the disease example Number of bldgs with a sick person (Y) Observed Frequency Probability Expected Frequency 0 145 0.058 400 * 0.058 = 23.2 1 30 0.241 400 * 0.241 = 96.4 2 25 0.375 400 * 0.375 = 150.0 3 65 0.259 400 * 0.259 =103.6 4 135 0.067 400 * 0.067 = 26.8 TOTAL 400 400 Use Excel™ or Systat™ to create a bar chart comparing Observed and Expected Frequencies. Paste the chart on the top of the next page: Is the distribution clumped, uniform or stochastic? What is your conclusion about the disease? Poisson Distribution The Poisson distribution is a discrete distribution that is used to describe the expected frequency of random events in time and/or space, when the event of interest is RARE. Use the Poisson if you can’t measure q OR if you are measuring abundance. There are 4 steps to computing expected probabilities for a normal distribution: 1) Determine if the mean is INTRINSIC or EXTRINSIC to the data. 2) Estimate the mean from the data. 3) Compute the combined probabilities. 4) Compute expected frequencies and compare to observed frequencies. Example 3: Road kill - Poission Problem: Suppose you have been assigned to investigate road kills on Highway 26 south of Hollister, about 100 miles from San Jose. The route passes through both urban and rural areas. Are the road kills stochastic or do they seem to be concentrated or do they seem to be evenly spaced? If the distribution is clumped, it means that there are “hot spots” and there might be a possibility for reducing the number of deaths with proper control measures. You divided the 100 mile distance into 1-mile segments (units of space) and counted the number of road kills on the route in each segment (you are measuring the abundance of road kills). The appropriate distribution is a Poisson because the data are discrete and you are measuring abundance. Data are presented in Table 3-6. The mean number of road kills per mile is known for the state (the mean = 2.09 road kills per mile) 3-8 Why is a Poisson probability distribution appropriate for this problem? The measured variable, “Road Kills,” represents counts of the number of road kills per segment. We do not know how many animals were not killed. Data: Table 3- 11: Number of road kills on Hwy 26 on one hundred 1 mile long segments # of road kills per segment (Y) 0 1 2 3 4 5 ≥6 TOTAL Observed Frequency (f) 13 26 26 18 10 5 2 ∑f=100 1) Is the mean INTRINSIC or EXTRINSIC to the data? If the mean (average number of road kills) is already known, then the mean is extrinsic. If this is the case (as in this example), skip to step 3. If the mean must be computed from the data, then the mean is INTRINSIC to the data. 2) Compute the mean from the data when the mean is intrinsic # of road kills per segment Observed Frequency (f) (Y) f*Y 0 0 13 26 1 26 52 2 26 54 3 18 40 4 10 25 5 5 12 ≥6* 2 ∑f=100 TOTAL ∑(f *Y)= 209 * If the class includes values greater than or equal to some value, use the lowest value (e.g. 6 in this case) to compute the mean. Mean Y 2.09 3) Compute the probabilities for the Poisson model. The highest number of kills in any segment was 6, so we will need terms for 0 through 6 kills. The probability for any class P(Y) in a Poisson distribution follows a series. The terms are in a series where each term depends upon the last. The series goes like: P(0) the probability of getting no road kills in a 1 mile segment = e P(1) the probability of getting no road kills in a 1 mile segment = P (0) * etc. P(∞)= P( 1) * 1 P(2)= P(1) * 2 …… To calculate the last probability (the probability for all higher classes combined) you need to add up all of the preceding probabilities and subtract the total from 1. For this example, the probability for the last class would be P(≥6). Notice that the only parameter you need to know to compute the series is µ, which is the mean of the data. Since we had to estimate µ, we will substitute Y 2.09 Compute the expected Poisson probabilities in Table 3-7. Table 3- 12: Poisson probabilities. BE CAREFUL BECAUSE EACH TERM DEPENDS ON THE PREVIOUS TERM. Number of road kills per segment(Y) 0 1 2 3 4 5 ≥6 TOTAL Formula e Y = 2.7183-2.09 Y 2.09 0.124 * = 1 1 Y 2.09 P(1) * 0.259 * = 2 2 Y 2.09 P(2) * 0.270 * = 3 3 Y 2.09 P(3) * 0.188 * 4 4 Y 2.09 P(4) * 0.098 * 5 5 1-Sum of P(0) to P(5) = 1 – (0.124 + 0.259 + 0.270 + 0.188 + 0.098 + 0.041)= P(0) * Probability 0.124 0.259 0.271 0.189 0.099 0.041 0.017 1.000 4) Compute the Expected Frequencies and compare to the Observed Frequencies. The Expected Frequency ( f̂ ) can be computed from the Probabilities: Expected Frequency = Probability*Total frequency (∑f). For this experiment, total frequency (of samples) was 100. 3-10 Table 3- 13: Observed and Expected frequencies for the road kill example Number of road kills in a segment (Y) Observed Frequency (f) Probability Expected Frequency (f) 0 13 0.124 12.4 1 26 0.259 25.9 2 26 0.271 27.1 3 18 0.189 18.9 4 10 0.099 9.9 5 5 0.041 4.1 ≥6 2 0.017 1.7 TOTAL 100 1.000 100.0 Use Excel™ or Systat™ to create a bar chart comparing Observed and Expected Frequencies. Paste the chart here: Is the distribution clumped, uniform or stochastic? What is your conclusion about the road kills? Computing Probabilities for Groups of Classes. The key to most of these types of problems is in recognizing that all of the probabilities have to add to 1.0. For example, assume that you want to know the probability of 1 or more buildings containing sick people. The probability for no buildings having sick people was 0.058, The probability of 1 or more buildings containing sick people = 1- The probability for no buildings having sick people. So P(>0)=1-P(0) or P(>0)=1-0.058=0.942. The Coefficient of Dispersion – An Indicator of Whether a Distribution is Clumped, Stochastic or Uniform The Coefficient of Dispersion (CD) can be used as an indicator of whether or not a distribution is clumped, stochastic or uniform. There are different expectations of the value of CD for uniform, stochastic or clumped depending on the type of theoretical probability distribution that is appropriate for the data (e.g. Binomial, Poisson, or Normal) (Table 3-1). Table 3- 14: Values for CD to indicate uniform, stochastic or clumped distributions depending upon the appropriate theoretical probability distribution. 3-12 Distribution Uniform Stochastic Clumped Binomial CD < q CD ≈ q CD > q Poisson CD < 1 CD ≈ 1 CD > 1 Normal CD <<1 CD ≈ 1 CD >> 1 Coefficient of Dispersion – indicator of whether a distribution is clumped, stochastic, or uniform Problem: You are trying to find out if a species of marine algae, Postelsia sp., has a uniform distribution. You randomly selected 44 sites. At each site, you placed a quadrat which had been split into four equal parts . You inspected each of the four squares to determine if a Postelisa sp. was present or absent. For each site you then recorded the number of squares for which Postelsia was present. You should be able to tell that the Binomial is the appropriate distribution here. 2 Formula for sample CD = s Y 2 Note: For parametric CD = Example Data for a sample: Example Computations: Number of squares per quadrat in which Postelsia was found (Y) Frequency (f) 0 4 1 33 2 5 3 1 4 1 TOTAL f 44 f*Y2 f*Y 0 50 0.284 q 1 0.284 0.716 44 * 4 2) Y 50 1.136 44 0 33 33 10 20 3 9 4 f * Y 78 f * Y 50 1) p 16 2 3) s 2 50 2 44 0.493 43 78 4) CD 0.493 0.433 1.136 5) Because CD < q (0.433 < 0.716), the distribution is probably uniform. Frequency Data with range of value classes – The computation is virtually the same as for single value classes but you use the class marks as Y 3-13 On your own 1) You suspect that a nematomorph parasite may be influencing its primary host’s (grasshopper) movements. In particular, you think the parasite will attempt to direct the host to move to a habitat where the parasite will have the best chance of finding the secondary host in its cycle. You randomly select several sites where at least four potential primary hosts are located and then, at each site, assess four potential hosts for the presence of the parasite. Your results are in the following table. What is the measured variable? _____________________ What is the appropriate probability distribution? __________________ Are the parameters intrinsic or extrinsic? _________________ Compute the probabilities and expected frequencies. Number of grasshoppers per site with the parasite (Y) 0 1 2 3 4 Frequency (f) f*Y f*Y2 Probability Expected 4 13 20 11 3 Compute the mean Compute the variance Compute the CD Is the distribution clumped, uniform or stochastic? _____________ Make sure you know why. Use Excel to plot Observed versus Expected Frequencies 2) You are taking a multiple choice exam with 10 questions. Each question has five possible answers. What is the measured variable? _____________________ What is the appropriate probability distribution? __________________ Are the parameters intrinsic or extrinsic? _________________ 3-14 What is the probability of getting a score of 7 on the exam if you guess every question? What is the probability of getting a score less than 8 on the exam if you guess every question? 3) You are trying to determine if wood ticks prefer some deer over others (e.g. they have a clumped distribution). You randomly select deer and check them for ticks. The mean for these data is 2.00 ticks per deer. Your data are in the following table: What is the measured variable? _____________________ What is the appropriate probability distribution? __________________ Are the parameters intrinsic or extrinsic? _________________ Compute the probabilities and expected frequencies. # of ticks per deer (Y) 0 1 2 3 4 5 6 7 ≥8 Observed Frequency (f) f*Y f*Y2 Expected Probabilities (PY) Expected Frequency ( fˆ ) 90 62 30 9 2 6 5 7 15 Compute the mean 3-15 Compute the variance Compute the CD Is the distribution clumped, uniform or stochastic? _____________ Make sure you know why. Use Excel to plot Observed versus Expected Frequencies 3-16 3-17