Week 5 Statistical approaches

advertisement
Why do we do statistics?
Test hypotheses
Estimate some number / make predictions
Help sort out what’s going on in complex data sets (data mining)
Convince others that your interpretation of the data is right
How do we define probability?
Long-run frequency of event
Fraction of sample-space occupied by event
Degree of ‘belief’ associated with outcome
Statistical schools of thought: Frequentists
Parameters fixed
Derive probability statements from hypothetical repetitions of same experiment
Neyman-Pearson (hypothesis tests, confidence intervals)
Popperian tests of hypotheses
Standard confidence intervals
Very limited in what you can analyze
Lots of results and approximations derived for special cases.
Maximum likelihood (Fisher)
Handle models of arbitrary complexity
Integrate information across multiple data sets
Sometimes hard to find maximum,
Estimates biased, confidence statements asymptotic
2ln[L1/L2]~c2(p1-p2)
AIC
Discard sampling ideas for ‘information theoretic reasoning’
Discount log likelihood for model complexity.
Seems to perform well in simulations
But, be careful of implicit priors!
Resampling
Bootstrap- resample with replacement
Jacknife – leave one out
Permutation – evaluate all permutations of data
No assumptions about underlying distribution
Computer intensive
But, need plenty of data for the answer to be reasonable
Sometimes hard to see what ‘null model’ really is.
Sort of a bridge between Frequentist and Bayes, since all
inference is drawn from single data set which is used to construct
distributions for parameters, but ideology is more ‘frequentist’
Statistical schools of thought: Bayesian
Parameters are random (in the subjectivist sense)
Draw inference about parameters / hypotheses from single data set using Bayes
rule.
P(q|data)=P(data|q) P(q) / P(data)
Posterior = likelihood * prior / normalizing constant
Use likelihood to describe information in data
Parameters are random
Need to define prior
Probability statements regarding parameters are ‘subjective’
Priors are not invariant. If we changed the scale of measurement, we’d would
get a different answer
People with different priors will arrive at different conclusions
The importance of prior information: Medical screening
You are being screened at random for rare (1 in 10,000) but awful disease
Screening test for disease X
P(pos result | sick) = 0.95
P(pos result | not sick) = 0.05
You get back a pos result – how likely is it that
you are sick?
Maximum Likelihood:
The likelihood ratio is 0.95/0.05 = 19:1 so you are probably sick
Bayes
P(sick | pos result) = P(pos|sick)*P(sick) /P(pos) = 0.95*0.0001/0.05 = 0.002
{ Since
P(pos result) = [P(pos|sick)*P(sick)+P(pos|not sick)*P(not sick)]
= 0.95*0.0001+0.05*0.9999
= 0.05
}
Posterior odds of being sick are 998 to 1 against, so you are probably fine
An ecological example –Egg production v. temperature
Limited prior data
But counts of eggs are
probably not normal
OLS regression
24
22
20
18
# eggs
16
14
12
10
8
6
24
22
y = 0.4*x + 4.8
20
18
16
14
12
10
8
6
4
104
10
12
12
14
14
16
16
18
18
20
20
22
22
Temperature
24
24
26
26
28
28
30
30
To compare with OLS:
A Generalized linear model with Poisson likelihood and ‘identity’ link
A Bayesian version of same with ‘flat’ priors
ci ~ Poisson (li )
li =a+bTi
1. Obtain maximum likelihood estimates and 95% confidence intervals
2. Assume flat prior and obtain Bayesian posterior mode and ‘credible region’
3. Compare with point/intervals from OLS
A brief digression on independence, and likelihood
Joint log-likelihood for a and b
12
10
8
a
6
4
2
0
0
0.1
0.2
0.3
0.4
b
0.5
0.6
0.7
Maximum likelihood
Profiles obtained by maximizing likelihood over other parameter
Profile likelihood
12
10
1
0.8
8
0.6
6
0.4
a
0.2
4
0
-5
0
5
10
15
20
2
a
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Profile likelihood
b
1
0.8
0.6
0.4
0.2
0
-1
-0.5
0
b
0.5
1
Since 2lnL~c2
And P(c2[1]<3.84)=0.95
We look for parameters such that
L /max(L) = exp(-3.84/2) = 0.146
0.7
Bayesian estimates
Marginal posterior
Marginals obtained by integrating over other parameter
12
0.2
0.15
10
0.1
8
0.05
6
0
-5
0
5
10
15
20
a
4
a
2
Marginal posterior
0
0
0.2
0.3
0.4
0.5
b
4
3
Find tail probabilities, such
that 95% of distribution is in
interval.
2
1
0
-1
0.1
-0.8
-0.6
-0.4
-0.2
0
b
0.2
0.4
0.6
0.8
1
0.6
0.7
Comparison of intervals
2.4
Likelihood
Bayes
2.2
2
OLS
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0
a
MLE
Bayes
OLS
Bootstrap
OLS
2
4
lower
upper
5.2013 0.3805 10.814
5.3423 0.4451 10.9432
4.8115 -1.4217 11.0446
4.6375
-1.399
8.7688
6
8
10
b
MLE
Bayes
OLS
Bootstrap
OLS
12
lower
upper
0.3769 0.1381 0.6038
0.3769 0.1361 0.6002
0.3957
0.133 0.6584
0.4
0.2013
0.6447
When will Likelihood and Bayes be different ?
1. Strongly informative priors (point and interval estimates will differ)
2. Likelihood has strong asymmetry or has odd shape – intervals different
3. Likelihood has fat tails (Bayesian intervals wider)
When will Likelihood and Bayes be same?
1. Weakly informative priors and lots of data
2. Likelihood is fairly symmetric
Likelihood v. Bayes
So, what are the practical differences between likelihood and Bayes?
Likelihood
Don’t need to specify a prior
Estimates biased
Confidence statements approximate
Unclear how to propagate uncertainty
Bayes
Need a prior
No asymptotic theory required
Propagation of uncertainty automatic
‘Subjectivity’ of Bayes is a red herring: Everyone who uses the same assumptions
on the same data will get the same answer.
With flat priors, posterior mode is same as MLE. Confidence intervals can still
differ, but you will generally get very similar results for Bayes and likelihood. So
for testing hypotheses, the extra effort to do Bayes is probably not worth it.
But, if you 1) have usable prior information or 2) want to use your estimates in a
model or make predictions with the answers you get, Bayes offers an internally
consistent way of doing so.
Hierarchical models
Share information (‘borrow strength’) across related data sets
Account for heterogeneity among groups
Estimating distribution of parameters across groups / partition variance
Applications:
‘Random-effects’ and ‘mixed-effects’ models
‘Nested ‘ models
‘Repeated measures’ models
Finite mixture models
‘Random regression’ models
Any model that has a stochastic model for how parameters vary
across observations can be construed as a hierarchical model
E.g. Hidden Markov models, Spatial models, etc
Combining information from samples from three ‘populations’
N=20
Modeled
identically
44
22
00
00
0.5
0.5
11
1.5
1.5
22
2.5
2.5
33
Modeled
hierarchically
44
N=10
22
00
00
0.5
0.5
11
1.5
1.5
22
2.5
2.5
33
0.5
0.5
11
1.5
1.5
22
2.5
2.5
33
44
N=5
22
00
00
Modeled
independently
Combining information from samples from three ‘populations’ –
Lots of data.
8
N=60
Modeled
identically
6
4
2
0
0
0.5
1
1.5
2
2.5
3
Modeled
independently
8
Modeled
hierarchically
6
N=60
4
2
0
0
0.5
1
1.5
2
2.5
3
0
0.5
1
1.5
2
2.5
3
8
6
N=60
4
2
0
Download