Probability Models & Frequency Data

advertisement
Probability Models & Frequency Data
Goodness of Fit
Proportional Model
Chi-square Statistic
Example
R
Distribution
Assumptions
Poisson Distribution
Example
R
1
Goodness of Fit
Goodness of fit tests are used to compare any
observed frequency distribution against an expected
frequency distribution.
We previously did specialized examples of this for a
probability distribution (the 50:50 expected right- vs. lefthand toad example) and binomial distribution (sperm
genes on X chromosome of mice).
The binomial test we did is a specialized form for
categorical variables with only two outcomes. Here we
will introduce a more generalized form.
2
Proportional Model
The proportional model is one of the simplest
probability model. The frequency of occurrence of
events is proportional to the number of
opportunities (e.g., X chromosome example).
What would we do, however, if we had multiple
proportions? A more generalized form of this test
is the chi-square (χ2) goodness-of-fit-test.
3
Example 8.1: Under the proportional model, one would expect babies
born in the U.S. to be born in equal proportions across the days of
the week (i.e., 14.28% per day). Is this true? Shown are a random
sample of 350 births from across the U.S. During the year 1999.
63
33
4
Goodness-of-Fit Test
The χ2 goodness-of-fit test use the chi-square statistic
(based upon the chis-square distribution) to compare
frequency data to a model stated by the null
hypothesis.
Continuing with our example:
H0: The probability of birth is the same every day of the week.
HA: The probability of birth is not the same every day of the week.
Again, H0 and HA are statements about the population
from which the sample is obtained.
5
In order to proceed, we need to determine the expected
frequencies under the null model. In examining the calender for
1999, we see that there are not an even number of each day (52)
in the year (there was an additional Friday), so we need to adjust
for this.
6
Goodness-of-Fit Test
The calculation of the expected frequencies is
straight forward.
Expected = 350 ⋅ (52/365) = 49.863
NB: the sum of the expected frequencies must
sum to the total observed (350).
Once you have a full set of observed and expected
frequencies, one can then determine a chi-square
statistic and associated probability.
7
Chi-square Statistic
The chi-square statistic measures the discrepancy
between between observed and expected
frequncies (make sure to always use the absolute
frquencies [counts] not relative frequencies
[proportions]).
Chi-square for each element can be calculated as:
2
=
∑ Observed −Expected 2 = 33−49.863 =5.70
Expected
49.863
8
Chi-square Statistic
Τηε χ2 statistic is additive across all levels, so:
χ2 = 5.70 + 1.58 + 3.46 + 3.46 + 0.16 + 0.53 + 0.16 = 15.05
We now have a calculated test statistic and as usual need to
compare it to a table value at a particular degree of freedom to
make our decision. In other words, is 15.05 large enough to be
significantly different?
df = (number of categories) -1 = 7-1 = 6
From Statistical Table A in your text, we see that at df = 6, the
critical value for χ2 is 12.59. Therefore, we reject the null
hypothesis and conclude that there are unequal proportions of
births among days.
9
Chi-square Statistic
This type of problem can most easily be solved using a table format:
10
Chi-square Statistic
Assuming equal probabilities this can be very easily
done in R using chisq.test:
> births<-c(33,41,63,63,47,56,47)
> chisq.test(births)
Chi-squared test for given probabilities
data: births
X-squared = 15.24, df = 6, p-value = 0.01847
How can we do this with the unequal probabilities
that we have? This is a bit more complicated, but
still straightforward:
11
Chi-square Statistic
> obsbirths<-births
> days<-c(52,52,52,52,52,53,52)
> expbirths<-350*(days/365)
> expbirths
[1] 49.86301 49.86301 49.86301 49.86301
49.86301 50.82192
[7] 49.86301
> chi<-sum((obsbirths-expbirths)^2/expbirths)
> chi
[1] 15.05676
> ?pchisq
> pchisq(chi,df=6)
What's going on
[1] 0.9801802
here?
> pchisq(chi,df=6,lower.tail=FALSE)
[1] 0.01981982
12
Chi-square Distribution
The chi-square distribution is a
theoretical probability
distribution (analogous to
normal, binomial, poisson,
etc.).
Note that the distribution is not
symmetrical and is highly
skewed.
When df = 1 then asymptotic
to both axes!
13
Chi-square Distribution
If χ2 is a random variable with a chi-square
distribution:
χ2 is a positive real number
The density function depends only on n (df)
The expected value of χ2 = n
The variance of χ2 = 2 n
The graph of f (χ2) is not symmetrical
The graph of f (χ2) approaches symmetry as ν=∞
14
Chi-square Distribution
15
We can explore the properties of the chi-square distribution
through the use of R functions and graphics:
>
>
>
>
>
>
par(mfrow=c(2,2),mar=c(3,4,3,3))
layout.show(4)
plot(dchisq(1,df=1:30))
plot(dchisq(5,df=1:30))
plot(dchisq(10,df=1:30))
plot(dchisq(15,df=1:30))
16
17
Chi-square Assumptions
The sampling distribution of the chi-square
statistic only approximately follows the chi-square
distribution (but pretty closely).
Two assumptions apply:
1) None of the categories should have an
expected frequency less than one.
2) No more than 25% of the categories should
have expected frequencies less than five.
18
Goodness-of-Fit Test
- Two Proportions -
The chi-square goodness of fit test is a very
general one and can be used in a variety of
situations.
It can also be used when there are only two
proportions, a replacement for the binomial test,
but at a cost...it is much less powerful in this
situation. So, use the binomial test whenever
appropriate.
19
Poisson Distribution
The poisson distribution describes the number of
successes in blocks of time or space, when successes
happen independently of each other and occur with equal
probability at every point in time or space.
The poisson is often useful in biological studies because
it is a starting place for evaluating whether or not an
observed pattern is random or not.
If the null model is rejected, the distribution may be either
clumped or dispersed.
20
Poisson Distribution
A clumped distribution arises when the presence
of one success is increases the probability of
success for adjacent observations (e.g.,
occurrences of a contagious disease).
A dispersed distribution is the opposite: the
presence of one success decreases the probability
of success for adjacent observations (e.g., animals
with well defended territories).
21
Poisson Distribution
22
Poisson Distribution
The poisson distribution is constructed using the
probability of X successes occurring in any given
block of time or space:
Pr [ X successes]=
e− x
X!
Where mu is the mean number of independent
successes in time or space (expressed as a unit
count) and e is the base of the natural log.
23
Poisson Distribution
- Example -
Example 8.6 provides the example of an assessment
of the fossil record. They ask, do extinctions occur
randomly through the fossil record or are their periods
where extinction rates are unusually high (mass
extinctions) compared to background rates?
Fossil marine invertebrates are an ideal taxa to test
this question as they
preserve well. The data
are the number of recorded
extinctions in 76 contiguous
blocks of time.
24
25
Poisson Distribution
- Example -
The hypotheses are:
H0: The number of extinctions per time interval has a Poisson distribution.
HA: The number of extinctions per time interval does not have a P distr.
We need to begin by estimating μ, the mean number of
extinctions per time interval. As usual, μ, can be estimated
by x-bar (= 4.21, n = 76).
We need to use the same protocol and generate expected
values to compare to our observed values, so return to the
formula for calculation of the poisson distribution.
26
Poisson Distribution
- Example -
For example, for 3 extinctions:
Pr [3 extinctions]=
−4.21
e
3
4.21
3!
Expected[3 extinctions] = 76 x 0.1846 = 14.03
No, expand for all categories...
27
28
Poisson Distribution
- Example -
We now have a chi-square test statistic calculated.
We need to determine the degrees of freedom. In
the broadest sense, df normally is n – 1. However,
in a variety of circumstances, we need to also
subtract the number of parameters being estimated
from the data. So, df = 8-1-1=6.
The critical value for χ2 of 12.59 at P = 0.05 and df
= 6 is 12.59. Thus, we reject the null hypothesis
and conclude extinctions are non-random.
29
> extinctions<-c(0,13,15,16,7,10,4,2,1,2,6)
> ?dpois
> dpois(extinctions, 4.21)
[1] 1.484637e-02 3.111768e-04 2.626347e-05 6.910575e-06
[5] 6.905011e-02 7.156129e-03 1.943289e-01 1.315693e-01
[9] 6.250321e-02 1.315693e-01 1.148102e-01
> hist(dpois(extinctions, 4.21))
30
Poisson Distribution
- Example -
> extinctions2<-c(13,15,16,7,10,4,2,9)
> chisq.test(extinctions2)
Chi-squared test for given probabilities
data: extinctions2
X-squared = 18.7368, df = 7, p-value = 0.009053
31
Poisson Distribution
We can explore the properties of the chi-square distribution
through the use of R functions and graphics:
>
>
>
>
>
>
par(mfrow=c(2,2),mar=c(3,4,3,3))
layout.show(4)
plot(dpois(1:25,1))
plot(dpois(1:25,2))
Our example
plot(dpois(1:25,4.21))
plot(dpois(1:25,10))
32
33
Download