Uncertainty and confidence intervals Statistical estimation methods, Finse Friday 10.9.2010, 12.45–14.05 Andreas Lindén Outline • Point estimates and uncertainty • Sampling distribution – Standard error – Covariation between parameters • Finding the VC-matrix for the parameter estimates – Analytical formulas – From the Hessian matrix – Bootstrapping • The idea behind confidence intervals • General methods for constructing confidence intervals of parameters – CI based on the central limit theorem – Profile likelihood CI – CI by bootstrapping Point estimates and uncertainty • The main output in any statistical model fitting are the parameter estimates – Point estimates — one value for each parameter – The effect sizes – Answers the question “how much” • Point estimates are of little use without any assessment of uncertainty – – – – – – Standard error Confidence intervals p-values Estimated sampling distribution Bayesian credible intervals Plotting Bayesian posterior distribution 3 Sampling distribution • The probability distribution of a parameter estimate – Calculated from a sample – Variability due to sampling effects • Typically depends on sample size or the number of degrees of freedom (df) • Examples of common sampling distributions – Student’s t-distribution – F-distribution – χ²-distribution 4 Degrees of freedom Y In a linear regression df = n – 2 X 5 Properties of the sampling distribution • The standard error (SE) of a parameter, is the estimated standard deviation of the sampling distribution – Square root of parameter variance • Parameters are not necessarily unrelated – The sampling distribution of several parameters is multivariate – Example: regression slope and intercept 6 Linear regression – simulated data Param. True value a 4.00 b 1.00 σ² 0.80 Estim. 1 Estim. 2 Estim. 3 Estim. 4 Estim. 5 Estim. 6 Estim. 7 Estim. 8 Estim. 9 Estim. 10 … Estim 100 4.29 4.13 3.86 3.77 3.63 4.39 3.80 3.78 3.74 4.62 … 3.54 0.96 0.97 0.98 1.04 1.06 0.93 0.98 1.06 1.07 0.84 … 1.06 0.70 0.36 0.83 0.75 0.63 0.72 0.91 0.92 0.69 0.50 … 0.71 7 Properties of the sampling distribution • The standard error (SE) of a parameter, is the estimated standard deviation of the sampling distribution – Square root of parameter variance • Parameters are not necessarily unrelated – The sampling distribution of several parameters is multivariate – Example: regression slope and intercept COV = 0.1531 -0.0273 0.0031 -0.0273 0.0059 0.0002 0.0031 0.0002 0.0335 CORR = 1.0000 -0.9085 0.0432 -0.9085 1.0000 0.0159 0.0432 0.0159 1.0000 8 Properties of the sampling distribution • The standard error (SE) of a parameter, is the estimated standard deviation of the sampling distribution – Square root of parameter variance • Parameters are not necessarily unrelated – The sampling distribution of several parameters is multivariate – Example: regression slope and intercept • Methods to obtain the VC-matrix (or standard errors) for a set of parameters – Analytical formulas – Bootstrap – The inverse of the Hessian matrix 9 Parameter variances analytically • For many common situations the SE and VC-matrix of a set of parameters can be calculated with analytical formulas • Standard error of the sample mean • Standard error of the estimated binomial probability 10 Bootstrap • The bootstrap is a general and common resampling method • Used to simulate the sampling distribution • Information in the sample itself is used to mimic the original sampling procedure – Non-parametric bootstrap — sampling with replacement – Parametric bootstrap — simulation based on parameter estimates • The procedure is repeated B times (e.g. B = 1000) • To make inference from the bootstrapped estimates – Sample standard deviation = bootstrap estimate of SE – Sample VC-matrix = bootstrap estimate of VC-matrix – Mean = difference between bootstrap mean and original estimate is an estimate of bias 11 VC-matrix from the Hessian • The Hessian matrix (H) – 2nd derivative of the (multivariate) negative log-likelihood at the ML-estimate – Typically given as an output by software for numerical optimization • The inverse of the Hessian is an estimate of the parameters’ variance-covariance matrix 12 Confidence interval (CI) • An frequentistic interval estimate of one or several parameters • A fraction α of all correctly produced CI:s will fail to include the true parameter value – Trust your 95% CI and take the risk α = 0.05 • NB! Should not be confused with Bayesian credible intervals – CI:s should not be thought to contain the parameter with 95% probability – The CI is based on the sampling distribution, not on an estimated probability distribution for the parameter of interest 13 100 90 80 70 60 50 40 30 20 10 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 14 CI based on central limit theorem • The sum/mean of many random values are approximately normally distributed – Actually t-distributed with df depending on sample size and model complexity – Might matter with small sample size • As a rule of thumb, an arbitrary parameter estimate ± 2*SE produce an approximate 95% confidence interval – With infinitely many observations ± 1.96*SE 15 CI from profile likelihood • The profile deviance – The change in −2*log-likelihood, in comparison to the MLestimate – Asymptotically χ²-distributed (assuming infinite sample size) • Confidence intervals can be obtained as the range around the ML-estimate, for which the profile deviance is under a critical level – The 1 – α quantile from χ²-distribution – One-parameter -> df = 1 (e.g. 3.841 for α = 0.05) – k-dimensional profile deviance -> df = k 16 95% CI from profile deviance –2*LL Fmin + 3.841 Fmin Parameter value 17 2-D confidence regions 99% confidence region, deviance χ²df2 = 9.201 95% confidence region, deviance χ²df2 = 5.992 Parameter b Parameter a 18 CI by bootstrapping • A 100*(1 – α)% CI for a parameter can be calculated from the sampling distribution – The α / 2 and 1 – α /2 quantiles (e.g. 0.025 and 0.975 with α = 0.05) • In bootstrapping, simply use the sample quantiles of simulated values 19 Exercises • Data: The prevalence of an infectious disease in a human population is investigated. The infection is recorded with 100% detection efficiency. In a sample of N = 80 humans X = 18 infections were found. • Model: Assume that infection (x = 0 or 1) of a host individual is an independent Bernoulli trial with probability pi, such that the probability of infection is constant over all hosts. • (This equals a logistic regression with an intercept only. Host specific explanatory variables, such as age, condition, etc. could be used to improve the model of pi closer.) Do the following in R: a) Calculate and plot the profile (log) likelihood of infection probability p b) What is the maximum likelihood estimate of p (called p̂ )? c) Construct 95% and 99% confidence intervals for p̂ based on the profile likelihood d) Calculate the analytic SE for p̂ e) Construct symmetric 95% confidence interval for p̂ based on the central limit theorem and the SE obtained in previous exercise f) Simulate and plot the sampling distribution of p̂ by parametric bootstrapping (B = 10000) g) Calculate the bootstrap SE of p̂ h) Construct 95% confidence interval for p̂ based on the bootstrap