Bayesian Analysis of Informative Hypotheses

advertisement
Topic Discussion:
Bayesian Analysis of
Informative Hypotheses
Quantitative Methods Forum
20 January 2014
Outline
Articles:
1. A gentle introduction to Bayesian analysis:
Applications to developmental research
2. Moving beyond traditional null hypothesis
testing: Evaluating expectations directly
3. A prior predictive loss function for the
evaluation of inequality constrained
hypotheses
Note: The author, Rens van de Schoot,was awarded the APA Division 5
dissertation award in 2013.
2
A gentle introduction to Bayesian
analysis: Applications to
developmental research.
3
Probability
• Frequentist Paradigm
– R. A. Fisher, Jerzy Neyman, Egon Pearson
– Long-run frequency
• Subjective Probability Paradigm
– Bayes’ theorem
– Probability as the subjective experience of
uncertainty
4
5
Ingredients of Bayesian Statistics
• Prior distribution
– Encompasses background knowledge on parameters
tested
– Parameters of prior distribution called
hyperparameters
• Likelihood function
– Information in the data
• Posterior Inference
– Combination of prior and likelihood via Bayes’
theorem
– Reflects updated knowledge, balancing prior and
observed information
6
Defining Prior Knowledge
• Lack of information
– Still important to quantify ignorance
– Noninformative: Uniform Distribution
• Akin to a Frequentist analysis
• Considerable information
– Meta-analyses, previous studies
• Sensitivity analyses may be conducted to
quantify effect of different prior specifications
• Priors reflect knowledge about model
parameters before observing data.
7
Effect of Priors
Horse and donkey analogy
Benefits of priors:
– Incorporate findings from previous studies
– Smaller Bayesian credible intervals (cf. confidence
intervals)
• Credible intervals are also known as posterior probability
intervals (PPI)
• PPI gives the probability that a certain parameter lies within
the interval.
The more prior information available, the smaller
the credible intervals.
When priors are misspecified, posterior results are
affected.
8
9
Empirical Example
Theory of dynamic
interactionism:
• Individuals believed to
develop through a
dynamic and reciprocal
transaction between
personality and
environment
3 studies:
• Neyer & Asendorph (2001)
• Sturaro et al. (2010)
• Asendorpf & van Aken
(2003)
Note: N&A involved young adults,
S and A&vA involved
adolescents.
10
Analytic Strategy
• Prior Specification
– Used frequentist estimates from one study to another
• Assessment of convergence
– Gelman-Rubin criterion and other ‘tuning’ variables
•
•
•
•
Cutoff value
Minimum number of iterations
Start values
Examination of trace plots
• Model fit assessed with posterior predictive
checking
11
Results 1
12
Results 2
13
Observations
• Point estimates do not differ between
Frequentist and Bayesian approaches.
• Credible intervals are smaller than
confidence intervals.
• Using prior knowledge in the analyses led
to more certainty about outcomes of the
analyses; i.e., more confidence (precision)
in conclusions.
14
Theoretical Advantages of
Bayesian Approach
• Interpretation
– More intuitive because focus on predictive accuracy
– Bayesian framework eliminates contradictions in
traditional NHST
• Offers more direct expression of uncertainty
• Updating knowledge
– Incorporate prior information into estimates instead of
conducting NHST repeatedly.
NHST = Null Hypothesis Significance Testing
15
Practical Advantages of
Bayesian Approach
• Smaller sample sizes required for Bayesian
estimation compared to Frequentist approaches.
• In context of small sample size, Bayesian
methods would produce a slowly increasing
confidence regarding coefficients compared to
Frequentist approaches.
• Bayesian methods can handle non-normal
parameters better than Frequentist approaches.
• Protection against overinterpreting unlikely
results.
• Elimination of inadmissible parameters.
16
Limitations
• Influence of prior specification.
• Prior distribution specification.
– Assumption that every parameter has a
distribution.
• Computational time
DIAGNOSTICS?
17
Comments?
18
Moving beyond traditional null
hypothesis testing: Evaluating
expectations directly.
19
What is “wrong” with the
traditional H0?
Example: Shape of the earth
H0: The shape of the earth is a flat disk
H1: The shape of the earth is not a flat disk
Evidence gathered against H0 .
Conclusion: The earth is not a sphere
 modification of testable hypotheses.
20
HA: The shape of the earth is a flat disk
HB: The shape of the earth is a sphere
HA is no longer the complement of HB.
Instead, HA and HB are competing models
regarding the shape of the earth.
Testing of such competing hypotheses will
result in a more informative conclusion.
21
What does this example teach us?
The evaluation of informative hypotheses
presupposes that prior information is available.
Prior knowledge is available in the form of specific
expectations of the ordering of statistical
parameters.
Example: Mean comparisons
HI1: μ3 < μ1 < μ5 < μ2 < μ4
HI2: μ3 < {μ1, μ5 , μ2} < μ4
where “,” denotes no
ordering
versus traditional setup
H0: μ1 = μ2 = μ3 = μ4 = μ5
Hu: μ1 , μ2 , μ3 , μ4 , μ5
22
Evaluating Informative Hypotheses
• Hypothesis Testing Approaches
– F-bar test for ANOVA (Silvapulle, et al., 2002; Silvapulle & Sen,
2004)
– Constraints on variance terms in SEM (Stoel, et al., 2006; Gonzalez
& Griffin, 2001)
• Model Selection Approaches
– Evaluate competing models for model fit and model complexity.
• Akaike Information Criterion (AIC; Akaike, 1973)
• Bayes Information Criterion (BIC; Schwarz, 1978)
• Deviance Information Criterion (DIC; Spiegelhalter, et al., 2002)
– These cannot deal with inequality constraints
• Paired Comparison Information Criterion (PCIC; Dayton, 1998 & 2003)
• Order restricted Information Criterion (ORIC; Anraku, 1999; Kuiper, et
al., in press)
• Bayes Factor
23
Comments?
24
A prior predictive loss function for
the evaluation of inequality
constrained hypotheses.
25
Inequality Constrained Hypotheses
Example 1
General linear model with two group means:
yi = μ1di1 + μ2di2 + εi
εi ~ N(0,σ2)
dig takes on 0 or 1 to indicate group
H0: μ1 , μ2 (unconstrained hypothesis)
H1: μ1 < μ2 (inequality constraint imposed)
26
Deviance Information Criteria
D(θ)  2 log p(y | θ)  C
y: data
θ: unknown parameter
p(.) likelihood
C: constant
D  E[ D(θ)]
Taking the expectation:
A measure of how well model fits data.
Effective
number of parameters:
θ
pD  D  D(θ)
: expectation of θ
DIC  D(θ )  2 pD
27
More on DIC
DIC  D(θ )  2 pD
model fit + penalty for model complexity
• Smaller is better.
• Only valid when posterior distribution approximates
multivariate normal.
• Assumes that specified parametric family of pdfs that
generate future observations encompasses true model.
(Can be violated.)
• Data y used to construct posterior distribution AND evaluate
estimated model  DIC selects overfitting models.
• Solution: Bayesian predictive information criterion.
28
Bayesian Predictive
Information Criterion
• Developed by Ando (2007) to avoid
overfitting problems associated with DIC.
BPIC  2Eθ [log p(y | θ)]  2 pD
… looks like the posterior DIC presented in
van de Schoot et al. (2012).
29
Posterior (Predictive?) DIC
BPIC  2Eθ [log p(y | θ)]  2 pD


postDIC  Eg (θ|y ) E f (θ|y )  2 log f (x | θy ) 
 2 log f (y | θy ) 
2[2 log f (y | θy )  2log f (y | θ)]
Are these the same?
Note: pD  D  D(θ)
30
Performance of postDIC
H0: μ1 , μ2
H1: μ1 < μ2
H2: μ1 = μ2
Data generated to be
consistent (cases 1 to 4) or
reversed in direction (cases
5 to 7) with H2.
postDIC does not distinguish
H0 from H1.
Recall: Smaller is better.
31
Prior DIC


prDIC  Eh(θ|y ) E f ( x|θ)  2 log f (x | θy ) 
 C  2 log f (y | θ) 
Eh (θ )  2 log f (y | θ)
• Specification of the prior distribution has more
importance for prDIC than postDIC.
What’s the intuitive difference between a posterior
predictive vs. prior predictive approach?
32
Performance of prDIC
H0: μ1 , μ2
H1: μ1 < μ2
H2: μ1 = μ2
Data generated to be
consistent (cases 1 to 4) or
reversed in direction (cases
5 to 7) with H2.
prDIC distinguishes H0 from
H1when data are in
agreement with H1. But
chooses a bad fitting model
for cases 5 to 7.
Recall: Smaller is better.
33
Prior (Predictive?)
Information Criterion
Eh (θ )  2 log f (y | θ)
Omits C  2 log f (y | θ) from prDIC.
New loss function accounts for agreement
between θy and y.
It quantifies how well replicated data x fits a
certain hypothesis, and how well the
hypothesis fits the data y.
34
Performance of PIC
H0: μ1 , μ2
H1: μ1 < μ2
H2: μ1 = μ2
Data generated to be
consistent (cases 1 to 4) or
reversed in direction (cases
5 to 7) with H2.
PIC selects the hypotheses
that is most consistent with
the data – outperforming
postDIC and prDIC. (?)
Recall: Smaller is better.
35
Paper Conclusions
• The (posterior?) DIC performs poorly
when evaluating inequality constrained
hypotheses.
• The prior DIC can be useful for model
selection when the population from which
the data are generated is in agreement
with the constrained hypotheses.
• The PIC, which is related to the marginal
likelihood, is better to select the best set of
inequality constrained hypotheses.
36
Comments?
37
Download