Why do we do statistics? Test hypotheses Estimate some number / make predictions Help sort out what’s going on in complex data sets (data mining) Convince others that your interpretation of the data is right How do we define probability? Long-run frequency of event Fraction of sample-space occupied by event Degree of ‘belief’ associated with outcome Statistical schools of thought: Frequentists Parameters fixed Derive probability statements from hypothetical repetitions of same experiment Neyman-Pearson (hypothesis tests, confidence intervals) Popperian tests of hypotheses Standard confidence intervals Very limited in what you can analyze Lots of results and approximations derived for special cases. Maximum likelihood (Fisher) Handle models of arbitrary complexity Integrate information across multiple data sets Sometimes hard to find maximum, Estimates biased, confidence statements asymptotic 2ln[L1/L2]~c2(p1-p2) AIC Discard sampling ideas for ‘information theoretic reasoning’ Discount log likelihood for model complexity. Seems to perform well in simulations But, be careful of implicit priors! Resampling Bootstrap- resample with replacement Jacknife – leave one out Permutation – evaluate all permutations of data No assumptions about underlying distribution Computer intensive But, need plenty of data for the answer to be reasonable Sometimes hard to see what ‘null model’ really is. Sort of a bridge between Frequentist and Bayes, since all inference is drawn from single data set which is used to construct distributions for parameters, but ideology is more ‘frequentist’ Statistical schools of thought: Bayesian Parameters are random (in the subjectivist sense) Draw inference about parameters / hypotheses from single data set using Bayes rule. P(q|data)=P(data|q) P(q) / P(data) Posterior = likelihood * prior / normalizing constant Use likelihood to describe information in data Parameters are random Need to define prior Probability statements regarding parameters are ‘subjective’ Priors are not invariant. If we changed the scale of measurement, we’d would get a different answer People with different priors will arrive at different conclusions The importance of prior information: Medical screening You are being screened at random for rare (1 in 10,000) but awful disease Screening test for disease X P(pos result | sick) = 0.95 P(pos result | not sick) = 0.05 You get back a pos result – how likely is it that you are sick? Maximum Likelihood: The likelihood ratio is 0.95/0.05 = 19:1 so you are probably sick Bayes P(sick | pos result) = P(pos|sick)*P(sick) /P(pos) = 0.95*0.0001/0.05 = 0.002 { Since P(pos result) = [P(pos|sick)*P(sick)+P(pos|not sick)*P(not sick)] = 0.95*0.0001+0.05*0.9999 = 0.05 } Posterior odds of being sick are 998 to 1 against, so you are probably fine An ecological example –Egg production v. temperature Limited prior data But counts of eggs are probably not normal OLS regression 24 22 20 18 # eggs 16 14 12 10 8 6 24 22 y = 0.4*x + 4.8 20 18 16 14 12 10 8 6 4 104 10 12 12 14 14 16 16 18 18 20 20 22 22 Temperature 24 24 26 26 28 28 30 30 To compare with OLS: A Generalized linear model with Poisson likelihood and ‘identity’ link A Bayesian version of same with ‘flat’ priors ci ~ Poisson (li ) li =a+bTi 1. Obtain maximum likelihood estimates and 95% confidence intervals 2. Assume flat prior and obtain Bayesian posterior mode and ‘credible region’ 3. Compare with point/intervals from OLS A brief digression on independence, and likelihood Joint log-likelihood for a and b 12 10 8 a 6 4 2 0 0 0.1 0.2 0.3 0.4 b 0.5 0.6 0.7 Maximum likelihood Profiles obtained by maximizing likelihood over other parameter Profile likelihood 12 10 1 0.8 8 0.6 6 0.4 a 0.2 4 0 -5 0 5 10 15 20 2 a 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Profile likelihood b 1 0.8 0.6 0.4 0.2 0 -1 -0.5 0 b 0.5 1 Since 2lnL~c2 And P(c2[1]<3.84)=0.95 We look for parameters such that L /max(L) = exp(-3.84/2) = 0.146 0.7 Bayesian estimates Marginal posterior Marginals obtained by integrating over other parameter 12 0.2 0.15 10 0.1 8 0.05 6 0 -5 0 5 10 15 20 a 4 a 2 Marginal posterior 0 0 0.2 0.3 0.4 0.5 b 4 3 Find tail probabilities, such that 95% of distribution is in interval. 2 1 0 -1 0.1 -0.8 -0.6 -0.4 -0.2 0 b 0.2 0.4 0.6 0.8 1 0.6 0.7 Comparison of intervals 2.4 Likelihood Bayes 2.2 2 OLS 1.8 1.6 1.4 1.2 0.8 0.6 0.4 0 a MLE Bayes OLS Bootstrap OLS 2 4 lower upper 5.2013 0.3805 10.814 5.3423 0.4451 10.9432 4.8115 -1.4217 11.0446 4.6375 -1.399 8.7688 6 8 10 b MLE Bayes OLS Bootstrap OLS 12 lower upper 0.3769 0.1381 0.6038 0.3769 0.1361 0.6002 0.3957 0.133 0.6584 0.4 0.2013 0.6447 When will Likelihood and Bayes be different ? 1. Strongly informative priors (point and interval estimates will differ) 2. Likelihood has strong asymmetry or has odd shape – intervals different 3. Likelihood has fat tails (Bayesian intervals wider) When will Likelihood and Bayes be same? 1. Weakly informative priors and lots of data 2. Likelihood is fairly symmetric Likelihood v. Bayes So, what are the practical differences between likelihood and Bayes? Likelihood Don’t need to specify a prior Estimates biased Confidence statements approximate Unclear how to propagate uncertainty Bayes Need a prior No asymptotic theory required Propagation of uncertainty automatic ‘Subjectivity’ of Bayes is a red herring: Everyone who uses the same assumptions on the same data will get the same answer. With flat priors, posterior mode is same as MLE. Confidence intervals can still differ, but you will generally get very similar results for Bayes and likelihood. So for testing hypotheses, the extra effort to do Bayes is probably not worth it. But, if you 1) have usable prior information or 2) want to use your estimates in a model or make predictions with the answers you get, Bayes offers an internally consistent way of doing so. Hierarchical models Share information (‘borrow strength’) across related data sets Account for heterogeneity among groups Estimating distribution of parameters across groups / partition variance Applications: ‘Random-effects’ and ‘mixed-effects’ models ‘Nested ‘ models ‘Repeated measures’ models Finite mixture models ‘Random regression’ models Any model that has a stochastic model for how parameters vary across observations can be construed as a hierarchical model E.g. Hidden Markov models, Spatial models, etc Combining information from samples from three ‘populations’ N=20 Modeled identically 44 22 00 00 0.5 0.5 11 1.5 1.5 22 2.5 2.5 33 Modeled hierarchically 44 N=10 22 00 00 0.5 0.5 11 1.5 1.5 22 2.5 2.5 33 0.5 0.5 11 1.5 1.5 22 2.5 2.5 33 44 N=5 22 00 00 Modeled independently Combining information from samples from three ‘populations’ – Lots of data. 8 N=60 Modeled identically 6 4 2 0 0 0.5 1 1.5 2 2.5 3 Modeled independently 8 Modeled hierarchically 6 N=60 4 2 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 8 6 N=60 4 2 0