Data analysis Ben Graham MA930, University of Warwick October 19, 2015 Intro I I I I I MLE Condence intervals Bayesian credible intervals p-values Hypothesis testing 6.2 Sucient statistics I I I I I I I I Def 6.2.1 A statistic T (X ) is a sucient statistic for θ if the conditional distribution of the sample X given the value of T (X ) does not depend on θ. Thm 6.2.2 If f (x | θ)/f (T (x ) | θ) is constant, then T (X ) is sucient. Thm 6.2.6 T (X ) is sucient i f (x | θ) = g (T (x ) | θ)h(x ) for some g , h Example: Independent Xi ∼Bernoulli(θ), θ ∈ (0, 1). Example: Independent X , . . . , XN ∼Uniform(0, θ), θ > 0. Example: Independent X , . . . , Xn ∼ N (θ, σ ), θ ∈ R Example: Independent X , . . . , Xn ∼ N (θ , θ ), θ ∈ R, θ > 0 Minimal sucient statistics 1 2 1 1 1 2 2 1 2 6.3 Likelihood principle I I I I Random sample X = (X , . . . , Xn ) Xi ∼ f (xi | θ) pmf or pdf 1 X ∼ Q i f (xi | θ) = f (x | θ) Likelihood function L(θ | x ) = f (x | θ) I Likelihood principle: if L(θ | x )/L(θ | y ) is independent of θ, then the conclusions drawn from x and y should be identical. Chapter 7 Point estimation 7.2.2 Maximum Likelihood Estimator Q k I L(θ | x ) = i f (xi | θ), θ ∈ R I MLE: Statistic θ̂(x ) = arg maxθ L(θ | x ) I Dierentiable? Solve ∂ L(θ | x ) = 0, i = 1, . . . , k ∂θ I log-likelihood `(θ) = log L(θ) I Ex: N (µ, σ ) I Ex: Uniform(0, θ), θ > 0. I Theorem 7.2.10: Invariance: The MLE of τ (θ) is τ (θ̂). i 2 Lagrange multipliers I I For maximizing/minimizing f : Rn → R subject to g : Rn → R, g (x ) = 0. Example: multinomial distribution Newton Raphson method I For nding roots of an equation xn+ 1 I I = xn − f (xn ) f 0 (xn ) √ 2 is a root of the equation x − 2 = 0 f (x ) = x − 2 Ex: 2 2 I xn+ 1 = xn − x 2 −2 2x 7.2.3 Bayes Estimators I I I I Parameter θ is random with prior distribution π(θ) Joint distribution π(θ)f (x | θ) Posterior distribution θ | X condition joint distribution on observed data X . Example: θ ∼Beta(α, β), π(θ) ∝ θα− (1 − θ)β− for θ ∈ (0, 1) X , . . . , Xn ∼Bernoulli(p). Conjugate family of priors/posteriors. Example 2: Normal prior, normal data. 1 1 I I 1 Bayes Risk I I Loss function L(θ, δ) Choose δ = δ(X ) to minimize the posterior expected loss Eθ|X [L(θ, δ)] = I ˆ θ L(θ, δ)f (θ | x )d θ This will minimize the Bayes risk Eθ,X [L(θ, δ)]. I I Quadratic loss L(θ, δ) = (θ − δ) →Posterior mean Absolute value loss L(θ, δ) = |θ − δ|→Posterior median 2 7.2.4. EM algorithm Missing data problem: x = (xo , xm ) . I xo observed I xm missing I Joint distribution f (xo , xm | θ) I Want arg maxθ log L(θ | xo ). I EM algorithm: start at some initial guess θ ( ) , then 0 h θ(r +1) = arg max E[x θ m |θ(r ) , i [ log L (θ | x , x )] o m x] o = arg max E[x θ m |θ(r ) ,xo ] [log L(θ | xo )] + E[x | m |θ(r ) ,xo ] [log f (xm | θ, xo )] {z maximised by θ=θ(r ) } The hard EM algorithm The Hard EM algorithm is as its name suggest, much easier than the general EM-algorithm. I Split the data x = (xo , xm ). You have only observed xo . Start at some θ( ) . I Iterate θ (t ) → θ (t + ) by: 0 1 I I sampling xm = xm (t ) conditional on θ(t ) , and then setting θ(t +1) to be the MLE for x = (x0 , xm ). 7.2.17 Multiple Poisson Rates I I I I Parameters β; τ , . . . , τn Observe Xi ∼Poisson(τi ) and Yi ∼Poisson(βτi ) i .e . τi =population density at place i , β =disease eect size Missing data: suppose X is missing. 1 1 7.3.1 Mean squared error I How to measure the quality of estimator W of θ? MSE (W ) = Eθ [(W − θ) Biasθ W I I I I I 2 ] = Varθ [W ] + (Biasθ W )2 = Eθ W − θ An estimator is unbiased if Biasθ W = 0. MLE for σ for N (µ, σ ) is biased. Small MSE more important than unbiasedness P A sequence of estimator is consistant if Wn → µ as n → ∞ P Wn → µ if MSE(Wn ) → 0 2 I I 2 Markov's inequality: X ≥ 0 →P(|X | ≥ a) ≤ E [X ]/a Chebyshev's inequality. r.v. X with mean µ variance σ 2 , P (|X − µ| ≥ k σ) ≤ k −2 Fisher's information I I Sample distribution f (x | θ) Fisher's information: I (θ) = Eθ I " 2 2 # ∂ ∂ regularity = −E log f (X | θ) log f (X | θ) ∂θ ∂θ2 Theorem 7.3.9 Cramer-Rao Inequality Sample X , . . . , Xn with pdf f (X | θ). Estimator W (X ) with 1 I nite variance I Then I ˆ ∂ d E W (X ) = [W (x )f (x | θ)] dx dθ θ ∂θ X d E W (X ) d Var (W (X )) ≥ θ θ Special case: X , . . . , Xn iidrv 1 I (θ) 2 Proof I Cauchy-Schwarz inequality: Cov(X , Y ) ≤Var(X )Var(Y ). 2 I I I Assume wlog E[X ] = E[Y ] = 0 For all t ∈ R,E[(tX + Y )2 ] = E[t 2 X + 2tXY + Y 2 ] ≥ 0 Cramer-Rao Cov ∂ W , f (X | θ) ∂θ 2 ≤ Var (W )Var ∂ f (X | θ) ∂θ Ch 8 Hypothesis testing I I I I I I I A hypothesis is a statement about a population parameter Null hypothesis H Alternative hypothesis H Form a statistical test Can you reject H ? Rejecting H does not mean accepting H . You do not accept H . 0 1 0 0 1 0