Three Frameworks for Statistical Analysis

advertisement
Three Frameworks for
Statistical Analysis
Forest, N=6
Count ant
nests per
quadrat
Sample
Design
Field, N=4
Data
Id #
1
2
Habitat
Forest
Forest
Number of ant nest per quadrat
9
6
3
4
5
Forest
Forest
Forest
4
6
7
6
7
8
9
Forest
Field
Field
Field
10
12
9
12
10
Field
10
Three Frameworks for Statistical
Analysis
• Monte Carlo Analysis
• Parametric Analysis
• Bayesian Analysis
The model
yi     * xi   i
 ~ Normal 0,  
For the Parametric and Bayesian
2
i
• yi= is a measurement on a “continuous” scale, which
belongs to an individual type of habitat “i”
• xi= is an indicator or dummy variable for groups (0,1)
• The model includes three parameters:
• α: the mean for groups
• β: the mean difference between groups, and
• The variance (σ2) of the normal distribution from which
the residuals εi are assumed to have come from.
Monte Carlo Analysis
Involves a number
of methods in which
data are randomized or
reshuffled so that observations
are randomly reassigned to different
treatment groups. This randomization
specifies the null hypothesis under
consideration
Monte Carlo Analysis
1. Specify a test statistic or index to
describe the pattern in the data
2. Create a distribution of the test
statistic that would be expected
under the null hypothesis
3. Decide on one- or two-tailed test
4. Compare the observed test statistic
to a distribution of simulated values
and estimate the appropriate P value
as a tail probability
1. Specifying the Test Statistic
Dif obs  10.75  7  3.75
2. Creating the Null Distribution
3. Deciding on a One or
Two tailed Test
Abs (difference) =
3.750
P=
0.036
Threshold
4. Calculating the Tail Probability
Inequality
N
DIFsim> DIFobs
7
DIFsim= DIFobs
29
DIFsim< DIFobs
964
36/1000=0.036
Differences between means
Difference =
3.7500
P1 =
0.0228
differenceobs  10.75  7  3.75
Assumptions
• The data collected represent random,
independent samples
• The test statistic describes the pattern of
interest
• The randomization creates an appropriate
null distribution for the question
Advantages
• It makes clear and explicit the underlying
assumptions and the structure of the null
hypothesis
• It does not require the assumption that the data
are sampled from a specified probability
distribution, such as the normal
Disadvantages
• It is computer intensive and is not included
in most traditional statistical packages
• Different analyses of the same data set
can yield slightly different answers
• The domain of inference for a Monte Carlo
analysis is subtly more restrictive than that
for a parametric analysis
Parametric analysis
• Refers to statistical tests
built on the assumption that
the data being analyzed were
sampled from a specified
distribution
• Most statistical tests specify
the normal distribution
Parametric analysis
1. Specify the test statistic
2. Specify the null distribution
3. Calculate the tail probability
1. Specify the test statistic
t test
X1  X 2
t
sX 1 X 2
sX 1 X 2 
2
df1 * s1
2
 df 2 * s2
dfT
 1
1 



 N1 N 2 
Specify the test statistic
Null hypothesis
Forest
Field
2. Specify the null distribution
Critical value
3. Calculate the tail
probability: Student’s t table
df\p
0.4
0.25
0.1
0.05
0.025
0.01
0.005
0.0005
1
0.32492
1
3.077684
6.313752
12.7062
31.82052
63.65674
636.6192
2
0.288675
0.816497
1.885618
2.919986
4.30265
6.96456
9.92484
31.5991
3
0.276671
0.764892
1.637744
2.353363
3.18245
4.5407
5.84091
12.924
4
0.270722
0.740697
1.533206
2.131847
2.77645
3.74695
4.60409
8.6103
5
0.267181
0.726687
1.475884
2.015048
2.57058
3.36493
4.03214
6.8688
6
0.264835
0.717558
1.439756
1.94318
2.44691
3.14267
3.70743
5.9588
7
0.263167
0.711142
1.414924
1.894579
2.36462
2.99795
3.49948
5.4079
8
0.261921
0.706387
1.396815
1.859548
2.306
2.89646
3.35539
5.0413
http://www.statsoft.com/textbook/sttable.html#t
Results of t-test
Levene's Test for
Equality of
Variances
F
Equal variances
assumed
Sig.
0.4255
0.5324
Equal variances not assumed
t-test for Equality of Means
t
Sig. (2tailed)
df
-2.96319
Mean
Difference
8
0.018
-3.75
-3.21265 7.95
0.012
-3.75
Std. Error
Std. Deviation
Mean
Habitat
N
Mean
Forest
6
7
2.19
0.89
Field
4
10.75
1.5
0.75
Assumptions
• The data collected represent random,
independent samples
• The data were sampled from a specified
distribution
Advantages
• It uses a powerful framework based on known
probability distributions
Disadvantages
• It may not be as powerful as sophisticated
Monte Carlo models that are tailored to
particular questions or data
• It rarely incorporates a priori information
or results from other experiments
What About Non-Parametric
Analyses?
• Essentially, these analyses give the P-values
that would be obtained by ranking the
observations and then performing randomization
tests on the ranked data
• Like other resampling methods, non-parametric
analyses do not require distributional
assumptions.
• However, they have less power than the
equivalent parametric tests and can only be
used with simple experimental designs.
Bayesian analysis
• It includes prior
information and then
uses current data to
build on earlier results
• It also allows us to
quantify the probability
of the observed
difference [i.e.,
P(Ha|data)]
Bayesian analysis
1.
2.
3.
4.
5.
Specify the hypothesis
Specify parameters as random variables
Specify the prior probability distribution
Calculate the likelihood
Calculate the posterior probability
distribution
6. Interpret the results
1. Specify the hypothesis
• The primary goal of a Bayesian analysis is to
determine the probability of the hypothesis
given the data P(H | data)
• The hypothesis needs to be quite specific,
and need to be quantitative:
P(diff>2 | diffobs =3.75)
P(hypothesis | data)
P(hypothesis ) P(data | hypothesis )
P(hypothesis | data) 
P(data)
The left hand side of the equation is
called the posterior probability
distribution, and is the quantity of
interest
P(hypothesis | data)
P(hypothesis ) P(data | hypothesis )
P(hypothesis | data) 
P(data)
The right hand side of the equation
consists of a fraction. In the numerator,
the term P(hypothesis) is the prior
probability distribution, and is the
probability of the hypothesis of interest
before you conducted the experiment
P(hypothesis | data)
P(hypothesis ) P(data | hypothesis )
P(hypothesis | data) 
P(data)
The next term in the numerator is referred
as the likelihood of the data; it reflects
the probability of observing the data given
the hypothesis
P(hypothesis | data)
P(hypothesis ) P(data | hypothesis )
P(hypothesis | data) 
P(data)
The denominator is a normalizing
constant that reflects the probability of the
data given all possible hypotheses. It
scales the posterior probability
distribution to the range [0,1].
P(hypothesis | data)
P(hypothesis | data)  P(hypothesis ) P(data | hypothesis )
We can focus our attention on the
numerator
2. Specify the parameters as
random variables
 forest ~ N ( forest ,  )
2
 field ~ N ( field ,  )
2
The type of random variable used for each
population parameter should reflect biological reality
or mathematical convenience
3. Specify the prior probability
distribution
• We can combine and re-analyze data from
the literature, talk to experts, etc. to come
up with reasonable estimates for the
density of ant nests in fields and forests
• OR, we can use an “uninformative prior”,
for which we initially estimate the density
of ant nests to be equal to zero and the
variances to be very large
 forest ~ N (0,10000)
sigma ~ dunif(0,10)
 ~ 1/ 
2
OpenBUGS code
#Define BUGS model
ttestmodel<- function(){
#Priors
mu1 ~ dnorm(0,0.001)
delta ~ dnorm(0,0.001)
tau <- 1/(sigma*sigma)
sigma ~ dunif(0,10)
#Likelihood
for (i in 1:n)
{
y[i]~ dnorm(mu[i],tau)
mu[i] <- mu1 + delta*x[i]
residual[i] <- y[i]-mu[i]
}
# Derived quantities
mu2 <- mu1 + delta
}
write.model(ttestmodel,"ttestmodel.txt")
Comparison between approaches
• Parametric
•
•
•
•
• Bayesian
•
Null hypothesis:
•
P(data | H0)
P(tobs= 2.96 |t>F theoretical=1.86) •
•
Parameters are fixed
Hypothesis:
P(H | data)
P(diff> 2 | diffobs =3.75)
Parameters are random
variables
4. Calculate the likelihood
Field
Field
mean
Maximum
likelihood
Forest
Field
variance
The likelihood is a distribution that is proportional to the
probability of the observed data given the hypothesis
5. Calculate the posterior
probability distribution
• We multiply the prior by the likelihood, and
divide by the normalizing constant
• In contrast to the results of the parametric
or Monte Carlo analysis, the result of a
Bayesian analysis is a probability
distribution, not a single P-value
Bayesian output
box plot: a
a[2] sample: 650000
14.0
0.4
0.3
0.2
0.1
0.0
[2]
12.0
10.0
[1]
10.0
0.0
-10.0
8.0
Field
6.0
4.0
a[1] sample: 650000
0.6
0.4
delta chains 1:3 sample: 2997
0.4
0.3
0.2
0.1
0.0
0.2
0.0
-5.0
-5.0
0.0
5.0
10.0
Delta (difference)
0.0
5.0
10.0
Forest
15.0
Estimates
Estimator
σForest σField
Analysis
λForest
λField
Parametric
7.00
10.75
2.19
1.50
Bayesian uniformed
prior
7.05
10.74
1.01
1.25
6. Interpreting the Results
• Given the Bayesian estimate of
mean diff= 3.698; [P(diff>2 | 3.75)=0.87
(2607/2997),
In other words, the analysis indicates that
there is a P=0.87 that ant nest densities
between the two habitats are different by >
2 nests.
Assumptions
• The data collected represent random,
independent samples
• The parameters to be estimated are
random variables with known distributions
Advantages
• It allows for the explicit incorporation of prior
information, and the results from one
experiment can be used to inform subsequent
experiments
• The results are interpreted in an intuitively
straightforward way, and the inferences are
conditional on both the observed data and the
prior information
Disadvantages
• It has computational challenges and the
requirement to condition the hypothesis
on the data
• Potential lack of objectivity, because
different results will be obtained using
different priors
Download