Math 362: Mathematical Statistics II Le Chen le.chen@emory.edu Emory University Atlanta, GA Last updated on April 13, 2021 2021 Spring 0 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 3 Motivating example: Given an unfair coin, or p-coin, such that ( 1 head with probability p, X = 0 tail with probability 1 − p, how would you determine the value p? Solutions: 1. You need to try the coin several times, say, three times. What you obtain is “HHT”. 2. Draw a conclusion from the experiment you just made. Motivating example: Given an unfair coin, or p-coin, such that ( 1 head with probability p, X = 0 tail with probability 1 − p, how would you determine the value p? Solutions: 1. You need to try the coin several times, say, three times. What you obtain is “HHT”. 2. Draw a conclusion from the experiment you just made. Rationale: The choice of the parameter p should be the value that maximizes the probability of the sample. P(X1 = 1, X2 = 1, X3 = 0) =P(X1 = 1)P(X2 = 1)P(X3 = 0) =p2 (1 − p). 1 2 3 4 5 6 7 8 9 10 # Hello, R. p <− seq(0,1,0.01) plot(p,p^2∗(1−p), type=”l”, col=”red”) title (”Likelihood”) # add a vertical dotted (4) blue line abline(v=0.67, col=”blue”, lty=4) # add some text text(0.67,0.01, ”2/3”) Maximize f (p) = p2 (1 − p) .... 5 Rationale: The choice of the parameter p should be the value that maximizes the probability of the sample. P(X1 = 1, X2 = 1, X3 = 0) =P(X1 = 1)P(X2 = 1)P(X3 = 0) =p2 (1 − p). 1 2 3 4 5 6 7 8 9 10 # Hello, R. p <− seq(0,1,0.01) plot(p,p^2∗(1−p), type=”l”, col=”red”) title (”Likelihood”) # add a vertical dotted (4) blue line abline(v=0.67, col=”blue”, lty=4) # add some text text(0.67,0.01, ”2/3”) Maximize f (p) = p2 (1 − p) .... 5 A random sample of size n from the population – Bernoulli(p): I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p). I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn . I What is your choice of p based on the above random sample? p= n 1X ki =: k̄ . n i=1 1 independent and identically distributed 6 A random sample of size n from the population – Bernoulli(p): I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p). I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn . I What is your choice of p based on the above random sample? p= n 1X ki =: k̄ . n i=1 1 independent and identically distributed 6 A random sample of size n from the population – Bernoulli(p): I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p). I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn . I What is your choice of p based on the above random sample? p= n 1X ki =: k̄ . n i=1 1 independent and identically distributed 6 A random sample of size n from the population – Bernoulli(p): I X1 , · · · , Xn are i.i.d.1 random variables, each following Bernoulli(p). I Suppose the outcomes of the random sample are: X1 = k1 , · · · , Xn = kn . I What is your choice of p based on the above random sample? p= n 1X ki =: k̄ . n i=1 1 independent and identically distributed 6 A random sample of size n from the population with given pdf: I X1 , · · · , Xn are i.i.d. random variables, each following the same given pdf. I a statistic or an estimator is a function of the random sample. Statistic/Estimator is a random variable! e.g., b= p n 1X Xi . n i=1 I The outcome of a statistic/estimator is called an estimate. e.g., pe = n 1X ki . n i=1 7 A random sample of size n from the population with given pdf: I X1 , · · · , Xn are i.i.d. random variables, each following the same given pdf. I a statistic or an estimator is a function of the random sample. Statistic/Estimator is a random variable! e.g., b= p n 1X Xi . n i=1 I The outcome of a statistic/estimator is called an estimate. e.g., pe = n 1X ki . n i=1 7 A random sample of size n from the population with given pdf: I X1 , · · · , Xn are i.i.d. random variables, each following the same given pdf. I a statistic or an estimator is a function of the random sample. Statistic/Estimator is a random variable! e.g., b= p n 1X Xi . n i=1 I The outcome of a statistic/estimator is called an estimate. e.g., pe = n 1X ki . n i=1 7 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 8 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 9 Two methods for estimating parameters Corresponding estimator 1. Method of maximum likelihood. MLE 2. Method of moments. MME 10 Two methods for estimating parameters Corresponding estimator 1. Method of maximum likelihood. MLE 2. Method of moments. MME 10 Maximum Likelihood Estimation Definition 5.2.1. For a random sample of size n from the discrete (resp. continuous) population/pdf pX (k ; θ) (resp. fY (y ; θ)), the likelihood function, L(θ), is the product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e., L(θ) = n Y i=1 pX (ki ; θ) resp. L(θ) = n Y fY (yi ; θ) . i=1 Definition 5.2.2. Let L(θ) be as defined in Definition 5.2.1. If θe is a value of the parameter such that L(θe ) ≥ L(θ) for all possible values of θ, then we call θe the maximum likelihood estimate for θ. 11 Maximum Likelihood Estimation Definition 5.2.1. For a random sample of size n from the discrete (resp. continuous) population/pdf pX (k ; θ) (resp. fY (y ; θ)), the likelihood function, L(θ), is the product of the pdf evaluated at Xi = ki (resp. Yi = yi ), i.e., L(θ) = n Y i=1 pX (ki ; θ) resp. L(θ) = n Y fY (yi ; θ) . i=1 Definition 5.2.2. Let L(θ) be as defined in Definition 5.2.1. If θe is a value of the parameter such that L(θe ) ≥ L(θ) for all possible values of θ, then we call θe the maximum likelihood estimate for θ. 11 Examples for MLE Often but not always MLE can be obtained by setting the first derivative equal to zero: E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · . k L(λ) = n Y i=1 e −λ λ ki ki ! ln L(λ) = −nλ + =e −nλ λ n Y Pk i=1 ki !−1 ki ! . i=1 n X ! ki ln λ − ln i=1 n Y ! ki ! . i=1 n d 1X ln L(λ) = −n + ki . dλ λ i=1 d ln L(λ) = 0 dλ =⇒ λe = n 1X ki =: k̄ . n i=1 Comment: The critical point is indeed global maximum because n d2 1 X ln L(λ) = − ki < 0. dλ2 λ2 i=1 Examples for MLE Often but not always MLE can be obtained by setting the first derivative equal to zero: E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · . k L(λ) = n Y i=1 e −λ λ ki ki ! ln L(λ) = −nλ + =e −nλ λ n Y Pk i=1 ki !−1 ki ! . i=1 n X ! ki ln λ − ln i=1 n Y ! ki ! . i=1 n d 1X ln L(λ) = −n + ki . dλ λ i=1 d ln L(λ) = 0 dλ =⇒ λe = n 1X ki =: k̄ . n i=1 Comment: The critical point is indeed global maximum because n d2 1 X ln L(λ) = − ki < 0. dλ2 λ2 i=1 Examples for MLE Often but not always MLE can be obtained by setting the first derivative equal to zero: E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · . k L(λ) = n Y i=1 e −λ λ ki ki ! ln L(λ) = −nλ + =e −nλ λ n Y Pk i=1 ki !−1 ki ! . i=1 n X ! ki ln λ − ln i=1 n Y ! ki ! . i=1 n d 1X ln L(λ) = −n + ki . dλ λ i=1 d ln L(λ) = 0 dλ =⇒ λe = n 1X ki =: k̄ . n i=1 Comment: The critical point is indeed global maximum because n d2 1 X ln L(λ) = − ki < 0. dλ2 λ2 i=1 12 Examples for MLE Often but not always MLE can be obtained by setting the first derivative equal to zero: E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · . k L(λ) = n Y i=1 e −λ λ ki ki ! ln L(λ) = −nλ + =e −nλ λ n Y Pk i=1 ki !−1 ki ! . i=1 n X ! ki ln λ − ln i=1 n Y ! ki ! . i=1 n d 1X ln L(λ) = −n + ki . dλ λ i=1 d ln L(λ) = 0 dλ =⇒ λe = n 1X ki =: k̄ . n i=1 Comment: The critical point is indeed global maximum because n d2 1 X ln L(λ) = − ki < 0. dλ2 λ2 i=1 Examples for MLE Often but not always MLE can be obtained by setting the first derivative equal to zero: E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · . k L(λ) = n Y i=1 e −λ λ ki ki ! ln L(λ) = −nλ + =e −nλ λ n Y Pk i=1 ki !−1 ki ! . i=1 n X ! ki ln λ − ln i=1 n Y ! ki ! . i=1 n d 1X ln L(λ) = −n + ki . dλ λ i=1 d ln L(λ) = 0 dλ =⇒ λe = n 1X ki =: k̄ . n i=1 Comment: The critical point is indeed global maximum because n d2 1 X ln L(λ) = − ki < 0. dλ2 λ2 i=1 Examples for MLE Often but not always MLE can be obtained by setting the first derivative equal to zero: E.g. 1. Poisson distribution: pX (k ) = e−λ λk ! , k = 0, 1, · · · . k L(λ) = n Y i=1 e −λ λ ki ki ! ln L(λ) = −nλ + =e −nλ λ n Y Pk i=1 ki !−1 ki ! . i=1 n X ! ki ln λ − ln i=1 n Y ! ki ! . i=1 n d 1X ln L(λ) = −n + ki . dλ λ i=1 d ln L(λ) = 0 dλ =⇒ λe = n 1X ki =: k̄ . n i=1 Comment: The critical point is indeed global maximum because n d2 1 X ln L(λ) = − ki < 0. dλ2 λ2 i=1 The following two cases are related to waiting time: E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0. L(λ) = n Y λe−λyi = λn exp −λ n X i=1 ! yi i=1 ln L(λ) = n ln λ − λ n X yi . i=1 d n X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ n λe = Pn i=1 yi =: 1 . ȳ 13 The following two cases are related to waiting time: E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0. L(λ) = n Y λe−λyi = λn exp −λ n X i=1 ! yi i=1 ln L(λ) = n ln λ − λ n X yi . i=1 d n X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ n λe = Pn i=1 yi =: 1 . ȳ 13 The following two cases are related to waiting time: E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0. L(λ) = n Y λe−λyi = λn exp −λ n X i=1 ! yi i=1 ln L(λ) = n ln λ − λ n X yi . i=1 d n X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ n λe = Pn i=1 yi =: 1 . ȳ 13 The following two cases are related to waiting time: E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0. L(λ) = n Y λe−λyi = λn exp −λ n X i=1 ! yi i=1 ln L(λ) = n ln λ − λ n X yi . i=1 d n X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ n λe = Pn i=1 yi =: 1 . ȳ 13 The following two cases are related to waiting time: E.g. 2. Exponential distribution: fY (y ) = λe−λy for y ≥ 0. L(λ) = n Y λe−λyi = λn exp −λ n X i=1 ! yi i=1 ln L(λ) = n ln λ − λ n X yi . i=1 d n X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ n λe = Pn i=1 yi =: 1 . ȳ 13 A random sample of size n from the following population: E.g. 3. Gamma distribution: fY (y ; λ) = known. L(λ) = λr y r −1 e−λy Γ(r ) n Y λr r −1 −λyi y e = λr n Γ(r )−n Γ(r ) i i=1 ln L(λ) = r n ln λ − n ln Γ(r ) + ln n Y for y ≥ 0 with r > 1 ! yir −1 exp −λ i=1 n Y n X ! yi i=1 ! yir −1 −λ i=1 n X yi . i=1 d rn X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ rn λe = Pn i=1 yi = r . ȳ Comment: – When r = 1, this reduces to the exponential distribution case. – If r is also unknown, it will be much more complicated. No closed-form solution. One needs numerical solver2 . Try MME instead. 2 [DW, Example 7.2.25] 14 A random sample of size n from the following population: E.g. 3. Gamma distribution: fY (y ; λ) = known. L(λ) = λr y r −1 e−λy Γ(r ) n Y λr r −1 −λyi y e = λr n Γ(r )−n Γ(r ) i i=1 ln L(λ) = r n ln λ − n ln Γ(r ) + ln n Y for y ≥ 0 with r > 1 ! yir −1 exp −λ i=1 n Y n X ! yi i=1 ! yir −1 −λ i=1 n X yi . i=1 d rn X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ rn λe = Pn i=1 yi = r . ȳ Comment: – When r = 1, this reduces to the exponential distribution case. – If r is also unknown, it will be much more complicated. No closed-form solution. One needs numerical solver2 . Try MME instead. 2 [DW, Example 7.2.25] 14 A random sample of size n from the following population: E.g. 3. Gamma distribution: fY (y ; λ) = known. L(λ) = λr y r −1 e−λy Γ(r ) n Y λr r −1 −λyi y e = λr n Γ(r )−n Γ(r ) i i=1 ln L(λ) = r n ln λ − n ln Γ(r ) + ln n Y for y ≥ 0 with r > 1 ! yir −1 exp −λ i=1 n Y n X ! yi i=1 ! yir −1 −λ i=1 n X yi . i=1 d rn X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ rn λe = Pn i=1 yi = r . ȳ Comment: – When r = 1, this reduces to the exponential distribution case. – If r is also unknown, it will be much more complicated. No closed-form solution. One needs numerical solver2 . Try MME instead. 2 [DW, Example 7.2.25] 14 A random sample of size n from the following population: E.g. 3. Gamma distribution: fY (y ; λ) = known. L(λ) = λr y r −1 e−λy Γ(r ) n Y λr r −1 −λyi y e = λr n Γ(r )−n Γ(r ) i i=1 ln L(λ) = r n ln λ − n ln Γ(r ) + ln n Y for y ≥ 0 with r > 1 ! yir −1 exp −λ i=1 n Y n X ! yi i=1 ! yir −1 −λ i=1 n X yi . i=1 d rn X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ rn λe = Pn i=1 yi = r . ȳ Comment: – When r = 1, this reduces to the exponential distribution case. – If r is also unknown, it will be much more complicated. No closed-form solution. One needs numerical solver2 . Try MME instead. 2 [DW, Example 7.2.25] 14 A random sample of size n from the following population: E.g. 3. Gamma distribution: fY (y ; λ) = known. L(λ) = λr y r −1 e−λy Γ(r ) n Y λr r −1 −λyi y e = λr n Γ(r )−n Γ(r ) i i=1 ln L(λ) = r n ln λ − n ln Γ(r ) + ln n Y for y ≥ 0 with r > 1 ! yir −1 exp −λ i=1 n Y n X ! yi i=1 ! yir −1 −λ i=1 n X yi . i=1 d rn X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ rn λe = Pn i=1 yi = r . ȳ Comment: – When r = 1, this reduces to the exponential distribution case. – If r is also unknown, it will be much more complicated. No closed-form solution. One needs numerical solver2 . Try MME instead. 2 [DW, Example 7.2.25] 14 A random sample of size n from the following population: E.g. 3. Gamma distribution: fY (y ; λ) = known. L(λ) = λr y r −1 e−λy Γ(r ) n Y λr r −1 −λyi y e = λr n Γ(r )−n Γ(r ) i i=1 ln L(λ) = r n ln λ − n ln Γ(r ) + ln n Y for y ≥ 0 with r > 1 ! yir −1 exp −λ i=1 n Y n X ! yi i=1 ! yir −1 −λ i=1 n X yi . i=1 d rn X ln L(λ) = − yi . dλ λ n i=1 d ln L(λ) = 0 dλ =⇒ rn λe = Pn i=1 yi = r . ȳ Comment: – When r = 1, this reduces to the exponential distribution case. – If r is also unknown, it will be much more complicated. No closed-form solution. One needs numerical solver2 . Try MME instead. 2 [DW, Example 7.2.25] 14 A detailed study with data: E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · . L(p) = n Pk Y (1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn . i=1 ln L(p) = −n + n X ! ki ln(1 − p) + n ln p. i=1 P −n + ni=1 ki d n ln L(p) = − + . dp 1−p p d ln L(p) = 0 dp =⇒ n p e = Pn i=1 ki = 1 . k̄ Comment: Its cousin distribution, the negative binomial distribution can be worked out similarly (See Ex 5.2.14). 15 A detailed study with data: E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · . L(p) = n Pk Y (1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn . i=1 ln L(p) = −n + n X ! ki ln(1 − p) + n ln p. i=1 P −n + ni=1 ki d n ln L(p) = − + . dp 1−p p d ln L(p) = 0 dp =⇒ n p e = Pn i=1 ki = 1 . k̄ Comment: Its cousin distribution, the negative binomial distribution can be worked out similarly (See Ex 5.2.14). 15 A detailed study with data: E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · . L(p) = n Pk Y (1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn . i=1 ln L(p) = −n + n X ! ki ln(1 − p) + n ln p. i=1 P −n + ni=1 ki d n ln L(p) = − + . dp 1−p p d ln L(p) = 0 dp =⇒ n p e = Pn i=1 ki = 1 . k̄ Comment: Its cousin distribution, the negative binomial distribution can be worked out similarly (See Ex 5.2.14). 15 A detailed study with data: E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · . L(p) = n Pk Y (1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn . i=1 ln L(p) = −n + n X ! ki ln(1 − p) + n ln p. i=1 P −n + ni=1 ki d n ln L(p) = − + . dp 1−p p d ln L(p) = 0 dp =⇒ n p e = Pn i=1 ki = 1 . k̄ Comment: Its cousin distribution, the negative binomial distribution can be worked out similarly (See Ex 5.2.14). 15 A detailed study with data: E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · . L(p) = n Pk Y (1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn . i=1 ln L(p) = −n + n X ! ki ln(1 − p) + n ln p. i=1 P −n + ni=1 ki d n ln L(p) = − + . dp 1−p p d ln L(p) = 0 dp =⇒ n p e = Pn i=1 ki = 1 . k̄ Comment: Its cousin distribution, the negative binomial distribution can be worked out similarly (See Ex 5.2.14). 15 A detailed study with data: E.g. 4. Geometric distribution: pX (k ; p) = (1 − p)k −1 p, k = 1, 2, · · · . L(p) = n Pk Y (1 − p)ki −1 p = (1 − p)−n+ i=1 ki pn . i=1 ln L(p) = −n + n X ! ki ln(1 − p) + n ln p. i=1 P −n + ni=1 ki d n ln L(p) = − + . dp 1−p p d ln L(p) = 0 dp =⇒ n p e = Pn i=1 ki = 1 . k̄ Comment: Its cousin distribution, the negative binomial distribution can be worked out similarly (See Ex 5.2.14). 15 MLE for Geom. Distr. of sample size n = 128 k Observed frequency Predicted frequency 1 72 74.14 2 35 31.2 3 11 13.13 4 6 5.52 5 2 2.32 6 2 0.98 Real p = unknown and MLE for p = 0.5792 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # The example from the book. library(pracma) # Load the library ”Practical Numerical Math Functions” k<−c(72, 35, 11, 6, 2, 2) # observed freq. a=1:6 pe=sum(k)/dot(k,a) # MLE for p. f=a for (i in 1:6) { f [ i ] = round((1−pe)^(i−1) ∗ pe ∗ sum(k),2) } # Initialize the table d <−matrix(1:18, nrow = 6, ncol = 3) # Now adding the column names colnames(d) <− c(”k”, ”Observed freq.”, ”Predicted freq.”) d[1:6,1]<−a d[1:6,2]<−k d[1:6,3]<−f grid.table(d) # Show the table PlotResults(”unknown”, pe, d, ”Geometric.pdf”) # Output the results using a user defined function 17 k MLE for Geom. Distr. of sample size n = 128 Observed frequency Predicted frequency 1 42 40.96 2 31 27.85 3 15 18.94 4 11 12.88 5 9 8.76 6 5 5.96 7 7 4.05 8 2 2.75 9 1 1.87 10 2 1.27 11 1 0.87 13 1 0.59 14 1 0.4 Real p = 0.3333 and MLE for p = 0.32 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Now let’s generate random samples from a Geometric distribution with p=1/3 with the same size of the sample. p = 1/3 n = 128 gdata<−rgeom(n, p)+1 # Generate random samples g<− table(gdata) # Count frequency of your data. g<− t(rbind(as.numeric(rownames(g)), g)) # Transpose and combine two columns. pe=n/dot(g[,1],g[,2]) # MLE for p. f <− g[,1] # Initialize f for (i in 1:nrow(g)) { f [ i ] = round((1−pe)^(i−1) ∗ pe ∗ n,2) } # Compute the expected frequency g<−cbind(g,f) # Add one columns to your matrix. colnames(g) <− c(”k”, ”Observed freq.”, ”Predicted freq.”) # Specify the column names. d_df <− as.data.frame(d) # One can use data frame to store data d_df # Show data on your terminal PlotResults(p, pe, g, ”Geometric2.pdf”) # Output the results using a user defined function 19 k MLE for Geom. Distr. of sample size n = 300 Observed frequency Predicted frequency 1 99 105.88 2 69 68.51 3 47 44.33 4 28 28.69 5 27 18.56 6 9 12.01 7 8 7.77 8 5 5.03 9 5 3.25 10 3 2.11 Real p = 0.3333 and MLE for p = 0.3529 20 In case we have several parameters: E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) = 2 L(µ, σ ) = n Y i=1 √1 e 2πσ − (y −µ)2 2σ 2 , y ∈ R. n (y −µ)2 1 1 X − i √ e 2σ2 = (2πσ 2 )−n/2 exp − 2 (yi − µ)2 2σ 2πσ i=1 ln L(µ, σ 2 ) = − ! n n 1 X ln(2πσ 2 ) − 2 (yi − µ)2 . 2 2σ i=1 n ∂ 1 X 2 ln L(µ, σ ) = (yi − µ) ∂µ σ2 i=1 n ∂ n 1 X 2 (yi − µ)2 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4 i=1 ∂ ln L(µ, σ 2 ) = 0 ∂µ ∂ ln L(µ, σ 2 ) = 0 ∂σ 2 =⇒ µe = ȳ n 1X 2 σ = (yi − ȳ )2 e n i=1 21 In case we have several parameters: E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) = 2 L(µ, σ ) = n Y i=1 √1 e 2πσ − (y −µ)2 2σ 2 , y ∈ R. n (y −µ)2 1 1 X − i √ e 2σ2 = (2πσ 2 )−n/2 exp − 2 (yi − µ)2 2σ 2πσ i=1 ln L(µ, σ 2 ) = − ! n n 1 X ln(2πσ 2 ) − 2 (yi − µ)2 . 2 2σ i=1 n ∂ 1 X 2 ln L(µ, σ ) = (yi − µ) ∂µ σ2 i=1 n ∂ n 1 X 2 (yi − µ)2 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4 i=1 ∂ ln L(µ, σ 2 ) = 0 ∂µ ∂ ln L(µ, σ 2 ) = 0 ∂σ 2 =⇒ µe = ȳ n 1X 2 σ = (yi − ȳ )2 e n i=1 21 In case we have several parameters: E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) = 2 L(µ, σ ) = n Y i=1 √1 e 2πσ − (y −µ)2 2σ 2 , y ∈ R. n (y −µ)2 1 1 X − i √ e 2σ2 = (2πσ 2 )−n/2 exp − 2 (yi − µ)2 2σ 2πσ i=1 ln L(µ, σ 2 ) = − ! n n 1 X ln(2πσ 2 ) − 2 (yi − µ)2 . 2 2σ i=1 n ∂ 1 X 2 ln L(µ, σ ) = (yi − µ) ∂µ σ2 i=1 n ∂ n 1 X 2 (yi − µ)2 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4 i=1 ∂ ln L(µ, σ 2 ) = 0 ∂µ ∂ ln L(µ, σ 2 ) = 0 ∂σ 2 =⇒ µe = ȳ n 1X 2 σ = (yi − ȳ )2 e n i=1 21 In case we have several parameters: E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) = 2 L(µ, σ ) = n Y i=1 √1 e 2πσ − (y −µ)2 2σ 2 , y ∈ R. n (y −µ)2 1 1 X − i √ e 2σ2 = (2πσ 2 )−n/2 exp − 2 (yi − µ)2 2σ 2πσ i=1 ln L(µ, σ 2 ) = − ! n n 1 X ln(2πσ 2 ) − 2 (yi − µ)2 . 2 2σ i=1 n ∂ 1 X 2 ln L(µ, σ ) = (yi − µ) ∂µ σ2 i=1 n ∂ n 1 X 2 (yi − µ)2 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4 i=1 ∂ ln L(µ, σ 2 ) = 0 ∂µ ∂ ln L(µ, σ 2 ) = 0 ∂σ 2 =⇒ µe = ȳ n 1X 2 σ = (yi − ȳ )2 e n i=1 21 In case we have several parameters: E.g. 5. Normal distribution: fY (y ; µ, σ 2 ) = 2 L(µ, σ ) = n Y i=1 √1 e 2πσ − (y −µ)2 2σ 2 , y ∈ R. n (y −µ)2 1 1 X − i √ e 2σ2 = (2πσ 2 )−n/2 exp − 2 (yi − µ)2 2σ 2πσ i=1 ln L(µ, σ 2 ) = − ! n n 1 X ln(2πσ 2 ) − 2 (yi − µ)2 . 2 2σ i=1 n ∂ 1 X 2 ln L(µ, σ ) = (yi − µ) ∂µ σ2 i=1 n ∂ n 1 X 2 (yi − µ)2 ∂σ 2 ln L(µ, σ ) = − 2σ 2 + 2σ 4 i=1 ∂ ln L(µ, σ 2 ) = 0 ∂µ ∂ ln L(µ, σ 2 ) = 0 ∂σ 2 =⇒ µe = ȳ n 1X 2 σ = (yi − ȳ )2 e n i=1 21 In case when the parameters determine the support of the density: (Non regular case) 1 E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a if y ∈ [a, b]. (Q n 1 1 if a ≤ y1 , · · · , yn ≤ b, i=1 b−a = (b−a)n L(a, b) = 0 otherwise. L(a, b) is monotone increasing in a and decreasing in b. Hence, in order to maximize L(a, b), one needs to choose ae = ymin and be = ymax . for y ∈ [0, θ]. (Q 2yi n n −2n Qn i=1 θ 2 = 2 θ i=1 yi L(θ) = 0 E.g. 7. fY (y ; θ) = 2y θ2 if 0 ≤ y1 , · · · , yn ≤ θ, otherwise. ⇓ θe = ymax . 22 In case when the parameters determine the support of the density: (Non regular case) 1 E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a if y ∈ [a, b]. (Q n 1 1 if a ≤ y1 , · · · , yn ≤ b, i=1 b−a = (b−a)n L(a, b) = 0 otherwise. L(a, b) is monotone increasing in a and decreasing in b. Hence, in order to maximize L(a, b), one needs to choose ae = ymin and be = ymax . for y ∈ [0, θ]. (Q 2yi n n −2n Qn i=1 θ 2 = 2 θ i=1 yi L(θ) = 0 E.g. 7. fY (y ; θ) = 2y θ2 if 0 ≤ y1 , · · · , yn ≤ θ, otherwise. ⇓ θe = ymax . 22 In case when the parameters determine the support of the density: (Non regular case) 1 E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a if y ∈ [a, b]. (Q n 1 1 if a ≤ y1 , · · · , yn ≤ b, i=1 b−a = (b−a)n L(a, b) = 0 otherwise. L(a, b) is monotone increasing in a and decreasing in b. Hence, in order to maximize L(a, b), one needs to choose ae = ymin and be = ymax . for y ∈ [0, θ]. (Q 2yi n n −2n Qn i=1 θ 2 = 2 θ i=1 yi L(θ) = 0 E.g. 7. fY (y ; θ) = 2y θ2 ⇓ θe = ymax . if 0 ≤ y1 , · · · , yn ≤ θ, otherwise. In case when the parameters determine the support of the density: (Non regular case) 1 E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a if y ∈ [a, b]. (Q n 1 1 if a ≤ y1 , · · · , yn ≤ b, i=1 b−a = (b−a)n L(a, b) = 0 otherwise. L(a, b) is monotone increasing in a and decreasing in b. Hence, in order to maximize L(a, b), one needs to choose ae = ymin and be = ymax . for y ∈ [0, θ]. (Q 2yi n n −2n Qn i=1 θ 2 = 2 θ i=1 yi L(θ) = 0 E.g. 7. fY (y ; θ) = 2y θ2 ⇓ θe = ymax . if 0 ≤ y1 , · · · , yn ≤ θ, otherwise. In case when the parameters determine the support of the density: (Non regular case) 1 E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a if y ∈ [a, b]. (Q n 1 1 if a ≤ y1 , · · · , yn ≤ b, i=1 b−a = (b−a)n L(a, b) = 0 otherwise. L(a, b) is monotone increasing in a and decreasing in b. Hence, in order to maximize L(a, b), one needs to choose ae = ymin and be = ymax . for y ∈ [0, θ]. (Q 2yi n n −2n Qn i=1 θ 2 = 2 θ i=1 yi L(θ) = 0 E.g. 7. fY (y ; θ) = 2y θ2 ⇓ θe = ymax . if 0 ≤ y1 , · · · , yn ≤ θ, otherwise. In case when the parameters determine the support of the density: (Non regular case) 1 E.g. 6. Uniform distribution on [a, b] with a < b: fY (y ; a, b) = b−a if y ∈ [a, b]. (Q n 1 1 if a ≤ y1 , · · · , yn ≤ b, i=1 b−a = (b−a)n L(a, b) = 0 otherwise. L(a, b) is monotone increasing in a and decreasing in b. Hence, in order to maximize L(a, b), one needs to choose ae = ymin and be = ymax . for y ∈ [0, θ]. (Q 2yi n n −2n Qn i=1 θ 2 = 2 θ i=1 yi L(θ) = 0 E.g. 7. fY (y ; θ) = 2y θ2 ⇓ θe = ymax . if 0 ≤ y1 , · · · , yn ≤ θ, otherwise. In case of discrete parameter: E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have been put. In order to estimate the population size N, one randomly captures n animals, and there are k tagged. Find the MLE for N. Sol. The population follows hypergeometric distr.: pX (k ; N) = a k N−a n−k N n . L(N) = a k N−a n−k N n How to maximize L(N)? 1 2 3 4 5 6 7 8 > > > > > a=10 k=5 n=20 N=seq(a,a+100) p=choose(a,k)∗choose(N−a,n−k )/choose(N,n) > plot(N,p,type = ”p”) > print(paste(”The MLE is”, n∗a/ k)) [1] ”The MLE is 40” 23 In case of discrete parameter: E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have been put. In order to estimate the population size N, one randomly captures n animals, and there are k tagged. Find the MLE for N. Sol. The population follows hypergeometric distr.: pX (k ; N) = a k N−a n−k N n . L(N) = a k N−a n−k N n How to maximize L(N)? 1 2 3 4 5 6 7 8 > > > > > a=10 k=5 n=20 N=seq(a,a+100) p=choose(a,k)∗choose(N−a,n−k )/choose(N,n) > plot(N,p,type = ”p”) > print(paste(”The MLE is”, n∗a/ k)) [1] ”The MLE is 40” 23 In case of discrete parameter: E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have been put. In order to estimate the population size N, one randomly captures n animals, and there are k tagged. Find the MLE for N. Sol. The population follows hypergeometric distr.: pX (k ; N) = a k N−a n−k N n . L(N) = a k N−a n−k N n How to maximize L(N)? 1 2 3 4 5 6 7 8 > > > > > a=10 k=5 n=20 N=seq(a,a+100) p=choose(a,k)∗choose(N−a,n−k )/choose(N,n) > plot(N,p,type = ”p”) > print(paste(”The MLE is”, n∗a/ k)) [1] ”The MLE is 40” 23 In case of discrete parameter: E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have been put. In order to estimate the population size N, one randomly captures n animals, and there are k tagged. Find the MLE for N. Sol. The population follows hypergeometric distr.: pX (k ; N) = a k N−a n−k N n . L(N) = a k N−a n−k N n How to maximize L(N)? 1 2 3 4 5 6 7 8 > > > > > a=10 k=5 n=20 N=seq(a,a+100) p=choose(a,k)∗choose(N−a,n−k )/choose(N,n) > plot(N,p,type = ”p”) > print(paste(”The MLE is”, n∗a/ k)) [1] ”The MLE is 40” 23 In case of discrete parameter: E.g. 8. Wildlife sampling. Capture-tag-recapture.... In the history, a tags have been put. In order to estimate the population size N, one randomly captures n animals, and there are k tagged. Find the MLE for N. Sol. The population follows hypergeometric distr.: pX (k ; N) = a k N−a n−k N n . L(N) = a k N−a n−k N n How to maximize L(N)? 1 2 3 4 5 6 7 8 > > > > > a=10 k=5 n=20 N=seq(a,a+100) p=choose(a,k)∗choose(N−a,n−k )/choose(N,n) > plot(N,p,type = ”p”) > print(paste(”The MLE is”, n∗a/ k)) [1] ”The MLE is 40” 23 The graph suggests to sudty the following quantity: r (N) := L(N) N −n N −a = × L(N − 1) N N −a−n+k r (N) < 1 ⇐⇒ na < Nk i.e., N > na k n j na k l na mo Ne = arg max L(N) : N = , . k k 24 The graph suggests to sudty the following quantity: r (N) := L(N) N −n N −a = × L(N − 1) N N −a−n+k r (N) < 1 ⇐⇒ na < Nk i.e., N > na k n j na k l na mo Ne = arg max L(N) : N = , . k k 24 The graph suggests to sudty the following quantity: r (N) := L(N) N −n N −a = × L(N − 1) N N −a−n+k r (N) < 1 ⇐⇒ na < Nk i.e., N > na k n j na k l na mo Ne = arg max L(N) : N = , . k k 24 The graph suggests to sudty the following quantity: r (N) := L(N) N −n N −a = × L(N − 1) N N −a−n+k r (N) < 1 ⇐⇒ na < Nk i.e., N > na k n j na k l na mo Ne = arg max L(N) : N = , . k k 24 Method of Moments Estimation Rationale: The population moments should be close to the sample moments, i.e., n 1X k E(Y k ) ≈ yi , k = 1, 2, 3, · · · . n i=1 Definition 5.2.3. For a random sample of size n from the discrete (resp. continuous) population/pdf pX (k ; θ1 , · · · , θs ) (resp. fY (y ; θ1 , · · · , θs )), solutions to P E(Y ) = 1n ni=1 yi .. . P s E(Y ) = 1n ni=1 yis which are denoted by θ1e , · · · , θse , are called the method of moments estimates of θ1 , · · · , θs . 25 Examples for MME MME is often the same as MLE: E.g. 1. Normal distribution: fY (y ; µ, σ 2 ) = n 1X yi = ȳ µ = E(Y ) = n i=1 n 1X 2 2 2 2 σ + µ = E(Y ) = yi n i=1 √1 e 2πσ ⇒ − (y −µ)2 2σ 2 , y ∈ R. µe = ȳ n 1X 2 σe2 = yi − µ2e n i=1 P = 1n ni=1 (yi − ȳ )2 More examples when MLE coincides with MME: Poisson, Exponential, Geometric. 26 MME is often much more tractable than MLE: E.g. 2. Gamma distribution3 : fY (y ; r , λ) = λr y r −1 e−λy Γ(r ) n r = E(Y ) = 1 X y = ȳ i λ n i=1 n r r2 1X 2 2 yi λ2 + λ2 = E(Y ) = n i=1 ⇒ for y ≥ 0. ȳ 2 re = 2 σ̂ λe = ȳ = re σ̂ 2 ȳ where ȳ P is the sample mean and σ̂ 2 is the sample variance: σ̂ 2 := 1n ni=1 (yi − ȳ )2 . Comments: MME for λ is consistent with MLE when r is known. 3 Check Theorem 4.6.3 on p. 269 for mean and variance Another tractable example for MME, while less tractable for MLE: E.g. 3. Neg. binomial distribution: pX (k ; p, r ) = k = 0, 1, · · · . r (1 − p) = E(X ) = k̄ p r (1 − p) = Var(X ) = σ̂ 2 p2 k +r −1 k ⇒ (1 − p)k pr , k̄ pe = σ̂ 2 2 re = k̄ σ̂ 2 − k̄ 28 re = 12.74 and pe = 0.391. 29 E.g. 4. fY (y ; θ) = 2y θ2 for y ∈ [0, θ]. Z y = E[Y ] = 0 θ 2y 2 2 y3 dy = 2 θ 3 θ2 y =θ = y =0 2 θ. 3 ⇓ θe = 3 y. 2 30 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 31 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 32 § 5.3 Interval Estimation Rationale. Point estimate doesn’t provide precision information. By using the variance of the estimator, one can construct an interval such that with a high probability that interval will contain the unknown parameter. I The interval is called confidence interval. I The high probability is confidence level. 33 § 5.3 Interval Estimation Rationale. Point estimate doesn’t provide precision information. By using the variance of the estimator, one can construct an interval such that with a high probability that interval will contain the unknown parameter. I The interval is called confidence interval. I The high probability is confidence level. 33 E.g. 1. A random sample of size 4, (Y1 = 6.5, Y2 = 9.2, Y3 = 9.9, Y4 = 12.4), from a normal population: fY (y ; µ) = √ 1 −1 e 2 2π 0.8 y −µ 0.8 2 . Both MLE and MME give µe = ȳ = 14 (6.5 + 9.2 + 9.9 + 12.4) = 9.5. The estimator µ b = Y follows normal distribution. Construct 95%-confidence interval for µ ... 34 “The parameter is an unknown constant and no probability statement concerning its value may be made.” –Jerzy Neyman, original developer of confidence intervals. 35 “The parameter is an unknown constant and no probability statement concerning its value may be made.” –Jerzy Neyman, original developer of confidence intervals. 35 In general, for a normal population with σ known, the 100(1 − α)% confidence interval for µ is σ σ ȳ − zα/2 √ , ȳ + zα/2 √ n n Comment: There are many variations 1. One-sided interval such as σ ȳ − zα √ , ȳ n or σ ȳ , ȳ + zα √ n 2. σ is unknown and sample size is small: z-score → t-score 3. σ is unknown and sample size is large: z-score by CLT 4. Non-Gaussian population but sample size is large: z-score by CLT 36 In general, for a normal population with σ known, the 100(1 − α)% confidence interval for µ is σ σ ȳ − zα/2 √ , ȳ + zα/2 √ n n Comment: There are many variations 1. One-sided interval such as σ ȳ − zα √ , ȳ n or σ ȳ , ȳ + zα √ n 2. σ is unknown and sample size is small: z-score → t-score 3. σ is unknown and sample size is large: z-score by CLT 4. Non-Gaussian population but sample size is large: z-score by CLT 36 In general, for a normal population with σ known, the 100(1 − α)% confidence interval for µ is σ σ ȳ − zα/2 √ , ȳ + zα/2 √ n n Comment: There are many variations 1. One-sided interval such as σ ȳ − zα √ , ȳ n or σ ȳ , ȳ + zα √ n 2. σ is unknown and sample size is small: z-score → t-score 3. σ is unknown and sample size is large: z-score by CLT 4. Non-Gaussian population but sample size is large: z-score by CLT 36 In general, for a normal population with σ known, the 100(1 − α)% confidence interval for µ is σ σ ȳ − zα/2 √ , ȳ + zα/2 √ n n Comment: There are many variations 1. One-sided interval such as σ ȳ − zα √ , ȳ n or σ ȳ , ȳ + zα √ n 2. σ is unknown and sample size is small: z-score → t-score 3. σ is unknown and sample size is large: z-score by CLT 4. Non-Gaussian population but sample size is large: z-score by CLT 36 In general, for a normal population with σ known, the 100(1 − α)% confidence interval for µ is σ σ ȳ − zα/2 √ , ȳ + zα/2 √ n n Comment: There are many variations 1. One-sided interval such as σ ȳ − zα √ , ȳ n or σ ȳ , ȳ + zα √ n 2. σ is unknown and sample size is small: z-score → t-score 3. σ is unknown and sample size is large: z-score by CLT 4. Non-Gaussian population but sample size is large: z-score by CLT 36 Theorem. Let k be the number of successes in n independent trials, where n is large and p = P(success) is unknown. An approximate 100(1 − α)% confidence interval for p is the set of numbers ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n Proof: It follows the following facts: I X ∼binomial(n, p) iff X = Y1 + · · · + Yn , while Yi are i.i.d. Bernoulli(p): E[Yi ] = p and Var(Yi ) = p(1 − p). I Central Limit Theorem: Let W1 , W2 , · · · , Wn be an sequence of i.i.d. random variables, whose distribution has mean µ and variance σ 2 , then Pn i=1 Wi − nµ √ approximately follows N(0, 1), when n is large. nσ 2 37 Theorem. Let k be the number of successes in n independent trials, where n is large and p = P(success) is unknown. An approximate 100(1 − α)% confidence interval for p is the set of numbers ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n Proof: It follows the following facts: I X ∼binomial(n, p) iff X = Y1 + · · · + Yn , while Yi are i.i.d. Bernoulli(p): E[Yi ] = p and Var(Yi ) = p(1 − p). I Central Limit Theorem: Let W1 , W2 , · · · , Wn be an sequence of i.i.d. random variables, whose distribution has mean µ and variance σ 2 , then Pn i=1 Wi − nµ √ approximately follows N(0, 1), when n is large. nσ 2 37 Theorem. Let k be the number of successes in n independent trials, where n is large and p = P(success) is unknown. An approximate 100(1 − α)% confidence interval for p is the set of numbers ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n Proof: It follows the following facts: I X ∼binomial(n, p) iff X = Y1 + · · · + Yn , while Yi are i.i.d. Bernoulli(p): E[Yi ] = p and Var(Yi ) = p(1 − p). I Central Limit Theorem: Let W1 , W2 , · · · , Wn be an sequence of i.i.d. random variables, whose distribution has mean µ and variance σ 2 , then Pn i=1 Wi − nµ √ approximately follows N(0, 1), when n is large. nσ 2 37 I When the sample size n is large, by the central limit theorem, Pn Yi − np ap. pi=1 ∼ N(0, 1) np(1 − p) || X X −p −p X − np p = qn ≈ qn p(1−p) pe (1−pe ) np(1 − p) n I Since pe = kn , we see that n X n −p P −zα/2 ≤ r k n 1− kn ≤ zα/2 ≈1−α n i.e., the 100(1 − α)% confidence interval for p is ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n 38 I When the sample size n is large, by the central limit theorem, Pn Yi − np ap. pi=1 ∼ N(0, 1) np(1 − p) || X X −p −p X − np p = qn ≈ qn p(1−p) pe (1−pe ) np(1 − p) n I Since pe = kn , we see that n X n −p P −zα/2 ≤ r k n 1− kn ≤ zα/2 ≈1−α n i.e., the 100(1 − α)% confidence interval for p is ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n 38 I When the sample size n is large, by the central limit theorem, Pn Yi − np ap. pi=1 ∼ N(0, 1) np(1 − p) || X X −p −p X − np p = qn ≈ qn p(1−p) pe (1−pe ) np(1 − p) n I Since pe = kn , we see that n X n −p P −zα/2 ≤ r k n 1− kn ≤ zα/2 ≈1−α n i.e., the 100(1 − α)% confidence interval for p is ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n 38 I When the sample size n is large, by the central limit theorem, Pn Yi − np ap. pi=1 ∼ N(0, 1) np(1 − p) || X X −p −p X − np p = qn ≈ qn p(1−p) pe (1−pe ) np(1 − p) n I Since pe = kn , we see that n X n −p P −zα/2 ≤ r k n 1− kn ≤ zα/2 ≈1−α n i.e., the 100(1 − α)% confidence interval for p is ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n 38 I When the sample size n is large, by the central limit theorem, Pn Yi − np ap. pi=1 ∼ N(0, 1) np(1 − p) || X X −p −p X − np p = qn ≈ qn p(1−p) pe (1−pe ) np(1 − p) n I Since pe = kn , we see that n X n −p P −zα/2 ≤ r k n 1− kn ≤ zα/2 ≈1−α n i.e., the 100(1 − α)% confidence interval for p is ! r r (k /n)(1 − k /n) k (k /n)(1 − k /n) k − zα/2 , + zα/2 . n n n n 38 E.g. 1. Use median test to check the randomness of a random generator. Suppose y1 , · · · , yn denote measurements presumed to have come from a continuous pdf fY (y ). Let k denote the number of yi ’s that are less than the median of fY (y ). If the sample is random, we would expect the difference between kn and 12 to be small. More specifically, a 95% confidence interval based on k should contain the value 0.5. Let fY (y ) = e−y . The median is m = 0.69315. 39 E.g. 1. Use median test to check the randomness of a random generator. Suppose y1 , · · · , yn denote measurements presumed to have come from a continuous pdf fY (y ). Let k denote the number of yi ’s that are less than the median of fY (y ). If the sample is random, we would expect the difference between kn and 12 to be small. More specifically, a 95% confidence interval based on k should contain the value 0.5. Let fY (y ) = e−y . The median is m = 0.69315. 39 E.g. 1. Use median test to check the randomness of a random generator. Suppose y1 , · · · , yn denote measurements presumed to have come from a continuous pdf fY (y ). Let k denote the number of yi ’s that are less than the median of fY (y ). If the sample is random, we would expect the difference between kn and 12 to be small. More specifically, a 95% confidence interval based on k should contain the value 0.5. Let fY (y ) = e−y . The median is m = 0.69315. 39 E.g. 1. Use median test to check the randomness of a random generator. Suppose y1 , · · · , yn denote measurements presumed to have come from a continuous pdf fY (y ). Let k denote the number of yi ’s that are less than the median of fY (y ). If the sample is random, we would expect the difference between kn and 12 to be small. More specifically, a 95% confidence interval based on k should contain the value 0.5. Let fY (y ) = e−y . The median is m = 0.69315. 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #! /usr/bin/Rscript main <− function() { args <− commandArgs(trailingOnly = TRUE) n <− 100 # Number of random samples. r <− as.numeric(args[1]) # Rate of the exponential # Check if the rate argument is given. if (is .na(r)) return(”Please provide the rate and try again.”) # Now start computing ... f <− function (y) pexp(y, rate = r)−0.5 m <− uniroot(f, lower = 0, upper = 100, tol = 1e−9)$root print(paste(”For rate ”, r, ”exponential distribution,”, ”the median is equal to ”, round(m,3))) data <− rexp(n,r) # Generate n random samples data <− round(data,3) # Round to 3 digits after decimal data <− matrix(data, nrow = 10,ncol = 10) # Turn the data to a matrix prmatrix(data) # Show data on terminal k <− sum(data > m) # Count how many entries is bigger than m lowerbd = k/n − 1.96 ∗ sqrt((k/n)∗(1−k/n)/n); upperbd = k/n + 1.96 ∗sqrt((k/n)∗(1−k/n)/n); print(paste(”The 95% confidence interval is (”, round(lowerbd,3), ”,”, round(upperbd,3), ”)”)) } main() Try commandline ... 40 41 Instead of the C.I. k n − zα/2 q (k /n)(1−k /n) k , n n One can simply specify the mean the margin of error: q (k /n)(1−k /n) n . k and n r d := zα/2 max p(1 − p) = p(1 − p) p∈(0,1) + zα/2 = 1/4 p=1/2 (k /n)(1 − k /n) . n =⇒ zα/2 d ≤ √ =: dm . 2 n 42 Comment: 1. When p is close to 1/2, d ≈ zα/2 √ , 2 n which is equivalent to σp ≈ 1 √ . 2 n E.g., n = 1000, k /n = 0.48, and α = 5%, then r 0.48 × 0.52 1.96 d = 1.96 = 0.03097 and dm = √ = 0.03099 1000 2 1000 r 0.48 × 0.52 1 σp = = 0.01579873 and σp ≈ √ = 0.01581139. 1000 2 1000 2. When p is away from 1/2, the discrepancy between d and dm becomes big.... 43 Comment: 1. When p is close to 1/2, d ≈ zα/2 √ , 2 n which is equivalent to σp ≈ 1 √ . 2 n E.g., n = 1000, k /n = 0.48, and α = 5%, then r 0.48 × 0.52 1.96 d = 1.96 = 0.03097 and dm = √ = 0.03099 1000 2 1000 r 0.48 × 0.52 1 σp = = 0.01579873 and σp ≈ √ = 0.01581139. 1000 2 1000 2. When p is away from 1/2, the discrepancy between d and dm becomes big.... 43 E.g. Running for presidency. Max and Sirius obtained 480 and 520 votes, respectively. What is probability that Max will win? What if the sample size is n = 5000, and Max obtained 2400 votes. 44 Choosing sample sizes d ≤ zα/2 p p(1 − p)/n ⇐⇒ n≥ 2 zα/2 p(1 − p) d2 zα/2 d≤ √ 2 n ⇐⇒ n≥ 2 zα/2 4d 2 (When p is known) (When p is unknown) E.g. Anti-smoking campaign. Need to find an 95% C.I. with a margin of error equal to 1%. Determine the sample size? Answer: n ≥ 1.962 4×0.012 = 9640. E.g.’ In order to reduce the sample size, a small sample is used to determine p. One finds that p ≈ 0.22. Determine the sample size again. Answer: n ≥ 1.962 ×0.22×0.78 ×0.012 = 6592.2. 45 Choosing sample sizes d ≤ zα/2 p p(1 − p)/n ⇐⇒ n≥ 2 zα/2 p(1 − p) d2 zα/2 d≤ √ 2 n ⇐⇒ n≥ 2 zα/2 4d 2 (When p is known) (When p is unknown) E.g. Anti-smoking campaign. Need to find an 95% C.I. with a margin of error equal to 1%. Determine the sample size? Answer: n ≥ 1.962 4×0.012 = 9640. E.g.’ In order to reduce the sample size, a small sample is used to determine p. One finds that p ≈ 0.22. Determine the sample size again. Answer: n ≥ 1.962 ×0.22×0.78 ×0.012 = 6592.2. 45 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 46 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation § 5.4 Properties of Estimators Question: Estimators are not in general unique (MLE or MME ...). How to select one estimator? Recall: For a random sample of size n from the population with given pdf, we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of Xi0 s: θ̂ = θ̂(X1 , · · · , Xn ). Criterions: 1. Unbiased. 2. Efficiency, the minimum-variance estimator. (Mean) (Variance) 3. Sufficency. 4. Consistency. (Asymptotic behavior) 48 § 5.4 Properties of Estimators Question: Estimators are not in general unique (MLE or MME ...). How to select one estimator? Recall: For a random sample of size n from the population with given pdf, we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of Xi0 s: θ̂ = θ̂(X1 , · · · , Xn ). Criterions: 1. Unbiased. 2. Efficiency, the minimum-variance estimator. (Mean) (Variance) 3. Sufficency. 4. Consistency. (Asymptotic behavior) 48 § 5.4 Properties of Estimators Question: Estimators are not in general unique (MLE or MME ...). How to select one estimator? Recall: For a random sample of size n from the population with given pdf, we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of Xi0 s: θ̂ = θ̂(X1 , · · · , Xn ). Criterions: 1. Unbiased. 2. Efficiency, the minimum-variance estimator. (Mean) (Variance) 3. Sufficency. 4. Consistency. (Asymptotic behavior) 48 § 5.4 Properties of Estimators Question: Estimators are not in general unique (MLE or MME ...). How to select one estimator? Recall: For a random sample of size n from the population with given pdf, we have X1 , · · · , Xn , which are i.i.d. r.v.’s. The estimator θ̂ is a function of Xi0 s: θ̂ = θ̂(X1 , · · · , Xn ). Criterions: 1. Unbiased. 2. Efficiency, the minimum-variance estimator. (Mean) (Variance) 3. Sufficency. 4. Consistency. (Asymptotic behavior) 48 Unbiasedness Definition 5.4.1. Given a random sample of size n whose population distribution dependes on an unknown parameter θ, let θ̂ be an estimator of θ. Then θ̂ is called unbiased if E(θ̂) = θ; and θ̂ is called asymptotically unbiased if limn→∞ E(θ̂) = θ. 49 E.g. 1. fY (y ; θ) = 2y θ2 if y ∈ [0, θ]. 3 – θ̂1 = Y 2 – θ̂2 = Ymax . 2n + 1 – θ̂3 = Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ). Show that for any constants ai ’s, θ̂ = n X i=1 ai Xi is unbiased ⇐⇒ n X ai = 1. i=1 50 E.g. 1. fY (y ; θ) = 2y θ2 if y ∈ [0, θ]. 3 – θ̂1 = Y 2 – θ̂2 = Ymax . 2n + 1 – θ̂3 = Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ). Show that for any constants ai ’s, θ̂ = n X i=1 ai Xi is unbiased ⇐⇒ n X ai = 1. i=1 50 E.g. 1. fY (y ; θ) = 2y θ2 if y ∈ [0, θ]. 3 – θ̂1 = Y 2 – θ̂2 = Ymax . 2n + 1 – θ̂3 = Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ). Show that for any constants ai ’s, θ̂ = n X i=1 ai Xi is unbiased ⇐⇒ n X ai = 1. i=1 50 E.g. 1. fY (y ; θ) = 2y θ2 if y ∈ [0, θ]. 3 – θ̂1 = Y 2 – θ̂2 = Ymax . 2n + 1 – θ̂3 = Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ). Show that for any constants ai ’s, θ̂ = n X i=1 ai Xi is unbiased ⇐⇒ n X ai = 1. i=1 50 E.g. 1. fY (y ; θ) = 2y θ2 if y ∈ [0, θ]. 3 – θ̂1 = Y 2 – θ̂2 = Ymax . 2n + 1 – θ̂3 = Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ). Show that for any constants ai ’s, θ̂ = n X i=1 ai Xi is unbiased ⇐⇒ n X ai = 1. i=1 50 E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter σ 2 = Var(X ). 2 – σ b = n 2 1 X Xi − X n i=1 – S = Sample Variance = 2 n 2 1 X Xi − X n − 1 i=1 v u u – S = Sample Standard Deviation = t n 2 1 X Xi − X . n − 1 i=1 (Biased for σ!) 51 E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter σ 2 = Var(X ). 2 – σ b = n 2 1 X Xi − X n i=1 – S = Sample Variance = 2 n 2 1 X Xi − X n − 1 i=1 v u u – S = Sample Standard Deviation = t n 2 1 X Xi − X . n − 1 i=1 (Biased for σ!) 51 E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter σ 2 = Var(X ). 2 – σ b = n 2 1 X Xi − X n i=1 – S = Sample Variance = 2 n 2 1 X Xi − X n − 1 i=1 v u u – S = Sample Standard Deviation = t n 2 1 X Xi − X . n − 1 i=1 (Biased for σ!) 51 E.g. 3. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter σ 2 = Var(X ). 2 – σ b = n 2 1 X Xi − X n i=1 – S = Sample Variance = 2 n 2 1 X Xi − X n − 1 i=1 v u u – S = Sample Standard Deviation = t n 2 1 X Xi − X . n − 1 i=1 (Biased for σ!) 51 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 b = 1/Y is biased. E.g. 4. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. λ P nY = ni=1 Yi ∼ Gamma distribution(n, λ). Hence, Z ∞ 1 λn n−1 −λy b = E 1/Y = n E λ y e dy y Γ(n) 0 Z ∞ nλ λn−1 = y (n−1)−1 e−λy dy n−1 0 Γ(n − 1) | {z } pdf for Gamma distr. (n − 1, λ) n = λ. n−1 b = Biased! But E(λ) Note: b∗ = λ n−1 nY n λ n−1 → λ as n → ∞. (Asymptotically unbiased.) is unbiased. E.g. 4’. Exponential distr.: fY (y ; θ) = θ1 e−y /θ for y ≥ 0. θb = Y is unbiased. n n 1X 1X E θb = E(Yi ) = θ = θ. n n i=1 i=1 52 Efficiency Definition 5.4.2. Let θb1 and θb2 be two unbiased estimators for a parameter θ. If Var(θb1 ) < Var(θb2 ), then we say that θb1 is more efficient than θb2 . The relative efficiency of θb1 w.r.t. θb2 is the ratio Var(θb1 )/Var(θb2 ). 53 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 E.g. 1. fY (y ; θ) = θ2y2 if y ∈ [0, θ]. Which is more efficient? Find the relative efficiency of θb1 w.r.t. θb3 . – θ̂1 = 3 Y 2 – θ̂3 = 2n + 1 Ymax . 2n E.g. 2. Let X1 , · · · , Xn be a random sample of size n with the unknown parameter θ = E(X ) (suppose σ 2 = Var(X ) < ∞). P Among all possible unbiased estimators θ̂ = ni=1 ai Xi with Pn i=1 ai = 1. Find the most efficient one. Sol: b = Var(θ) n X i=1 ai2 Var(X ) =σ 2 n X i=1 ai2 1 ≥σ n 2 n X !2 ai i=1 = 1 2 σ , n with equality iff a1 = · · · = an = 1/n. Hence, the most efficient one is the sample mean θb = X . 54 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 55 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 56 § 5.5 MVE: The Cramér-Rao Lower Bound Question: Can one identify the unbiased estimator having the smallest variance? Short answer: In many cases, yes! We are going to develop the theory to answer this question in details! 57 Regular Estimation/Condition: The set of y (resp. k ) values, where fY (y ; θ) 6= 0 (resp. pX (k ; θ) 6= 0), does not depend on θ. i.e., the domain of the pdf does not depend on the parameter (so that one can differentiate under integration). Definition. The Fisher’s Information of a continous (resp. discrete) random variable Y (resp. X ) with pdf fY (y ; θ) (resp. pX (k ; θ)) is defined as " " 2 # 2 #! ∂ ln fY (Y ; θ) ∂ ln pX (X ; θ) I(θ) = E resp. E . ∂θ ∂θ 58 Regular Estimation/Condition: The set of y (resp. k ) values, where fY (y ; θ) 6= 0 (resp. pX (k ; θ) 6= 0), does not depend on θ. i.e., the domain of the pdf does not depend on the parameter (so that one can differentiate under integration). Definition. The Fisher’s Information of a continous (resp. discrete) random variable Y (resp. X ) with pdf fY (y ; θ) (resp. pX (k ; θ)) is defined as " " 2 # 2 #! ∂ ln fY (Y ; θ) ∂ ln pX (X ; θ) I(θ) = E resp. E . ∂θ ∂θ 58 Lemma. Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the continuous population pdf fY (y ; θ). Then the Fisher Information in the random sample Y1 , · · · , Yn equals n times the Fisher information in X : " " 2 # 2 # ∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ) ∂ ln fY (Y ; θ) E =nE = n I(θ). (1) ∂θ ∂θ (A similar statement holds for the discrete case pX (k ; θ)). Proof. Based on two observations: " n !2 # X ∂ LHS = E ln fYi (Yi ; θ) ∂θ i=1 E ∂ ln fYi (Yi ; θ) ∂θ Z ∂ f (y ; θ) ∂θ Y Z ∂ fY (y ; θ)dy f (y ; θ) ∂θ Y R R Z ∂ R.C. ∂ = fY (y ; θ)dy = 1 = 0. ∂θ R ∂θ = fY (y ; θ)dy = 59 Lemma. Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the continuous population pdf fY (y ; θ). Then the Fisher Information in the random sample Y1 , · · · , Yn equals n times the Fisher information in X : " " 2 # 2 # ∂ ln fY1 ,··· ,Yn (Y1 , · · · , Yn ; θ) ∂ ln fY (Y ; θ) E =nE = n I(θ). (1) ∂θ ∂θ (A similar statement holds for the discrete case pX (k ; θ)). Proof. Based on two observations: " n !2 # X ∂ LHS = E ln fYi (Yi ; θ) ∂θ i=1 E ∂ ln fYi (Yi ; θ) ∂θ Z ∂ f (y ; θ) ∂θ Y Z ∂ fY (y ; θ)dy f (y ; θ) ∂θ Y R R Z ∂ R.C. ∂ = fY (y ; θ)dy = 1 = 0. ∂θ R ∂θ = fY (y ; θ)dy = 59 Lemma. Under regular condition, if ln fY (y ; θ) is twice differentiable in θ, then 2 ∂ I(θ) = −E ln f (Y ; θ) . (2) Y ∂θ2 (A similar statement holds for the discrete case pX (k ; θ)). Proof. This is due to the two facts: ∂2 ln fY (Y ; θ) = ∂θ2 ∂2 f (Y ; θ) ∂θ 2 Y fY (Y ; θ) ∂ f (Y ; θ) ∂θ Y − | = E ∂2 f (Y ; θ) ∂θ 2 Y fY (Y ; θ) ! ∂2 f (y ; θ) ∂θ 2 Y Z = !2 fY (Y ; θ) {z } 2 ∂ ln fY (Y ; θ) ∂θ Z fY (y ; θ)dy = ∂2 fY (y ; θ)dy . ∂θ2 fY (y ; θ) R Z 2 ∂ R.C. ∂ = fY (y ; θ)dy = 1 = 0. ∂θ2 R ∂θ2 R 2 60 Lemma. Under regular condition, if ln fY (y ; θ) is twice differentiable in θ, then 2 ∂ I(θ) = −E ln f (Y ; θ) . (2) Y ∂θ2 (A similar statement holds for the discrete case pX (k ; θ)). Proof. This is due to the two facts: ∂2 ln fY (Y ; θ) = ∂θ2 ∂2 f (Y ; θ) ∂θ 2 Y fY (Y ; θ) ∂ f (Y ; θ) ∂θ Y − | = E ∂2 f (Y ; θ) ∂θ 2 Y fY (Y ; θ) ! ∂2 f (y ; θ) ∂θ 2 Y Z = !2 fY (Y ; θ) {z } 2 ∂ ln fY (Y ; θ) ∂θ Z fY (y ; θ)dy = ∂2 fY (y ; θ)dy . ∂θ2 fY (y ; θ) R Z 2 ∂ R.C. ∂ = fY (y ; θ)dy = 1 = 0. ∂θ2 R ∂θ2 R 2 60 Theorem (Cramér-Rao Inequality) Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the continuous population pdf fY (y ; θ). b 1 , · · · , Yn ) be any unbiased estimator for θ. Then Let θb = θ(Y b ≥ Var(θ) 1 . n I(θ) (A similar statement holds for the discrete case pX (k ; θ)). Proof. If n = 1, then by Cauchy-Schwartz inequality, q ∂ b × I(θ) E (θb − θ) ln fY (Y ; θ) ≤ Var(θ) ∂θ On the other hand, Z ∂ fY (y ; θ) ∂ E (θb − θ) ln fY (Y ; θ) = (θb − θ) ∂θ fY (y ; θ)dy ∂θ fY (y ; θ) R Z ∂ = (θb − θ) fY (y ; θ)dy ∂θ R Z ∂ = (θb − θ)fY (y ; θ)dy +1 = 1. ∂θ R | {z } = E(θb − θ) = 0 For general n, apply for (1). . 61 Theorem (Cramér-Rao Inequality) Under regular condition, let Y1 , · · · , Yn be a random sample of size n from the continuous population pdf fY (y ; θ). b 1 , · · · , Yn ) be any unbiased estimator for θ. Then Let θb = θ(Y b ≥ Var(θ) 1 . n I(θ) (A similar statement holds for the discrete case pX (k ; θ)). Proof. If n = 1, then by Cauchy-Schwartz inequality, q ∂ b × I(θ) E (θb − θ) ln fY (Y ; θ) ≤ Var(θ) ∂θ On the other hand, Z ∂ fY (y ; θ) ∂ E (θb − θ) ln fY (Y ; θ) = (θb − θ) ∂θ fY (y ; θ)dy ∂θ fY (y ; θ) R Z ∂ = (θb − θ) fY (y ; θ)dy ∂θ R Z ∂ = (θb − θ)fY (y ; θ)dy +1 = 1. ∂θ R | {z } = E(θb − θ) = 0 For general n, apply for (1). . 61 Definition. Let Θ be the set of all estimators θb that are unbiased for the parameter θ. We say that θb∗ is a best or minimum-variance esimator (MVE) if θb∗ ∈ Θ and b Var(θb∗ ) ≤ Var(θ) for all θb ∈ Θ. b is equal to the Definition. An unbiased estimator θb is efficient if Var(θ) Cramér-Rao lower bound, i.e., Varθb = (n I(θ))−1 . −1 b The efficiency of an unbiased estimator θb is defined to be nI(θ)Var(θ) . Unbiased estimators Θ Efficient est. MVE 62 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = X is efficient? E.g. 1. X ∼Bernoulli(p). Check whether p Step 1. Compute Fisher’s Information: k 1−k pX (k ; p) = p (1 − p) . ln pX (k ; p) = k ln p + (1 − k ) ln(1 − p) ∂ k 1−k ln pX (k ; p) = − ∂p p 1−p ∂2 k 1−k ln pX (k ; p) = 2 + ∂2p p (1 − p)2 " # ∂2 X 1−X 1 1 1 −E ln p (X ; p) =E + = + = . X ∂2p p2 (1 − p)2 p 1−p pq − I(p) = 1 , pq q = 1 − p. Step 2. Compute Var(b p). Var(b p) = 1 Var n2 n X i=1 ! Xi = 1 pq npq = n2 n Conclusion Because b p is unbiased and Var(b p) = (nI(p))−1 , b p is efficient. 63 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 b = 1/Y efficent? E.g. 2. Exponential distr.: fY (y ; λ) = λe−λy for y ≥ 0. Is λ b is biased. Nevertheless, we can still compute Fisher’s Answer No, because λ Information as follows Fisher’s Inf. ln fY (y ; λ) = ln λ − λy ∂ 1 ln fY (y ; λ) = −y ∂λ λ ∂2 1 ln fY (y ; λ) = 2 ∂2λ λ " # ∂2 1 1 −E ln f (Y ; λ) = E = 2. Y ∂2λ λ2 λ − −2 I(λ) = λ b∗ := Try: λ n−1 1 . n Y It is unbiased. Is it efficient? 64 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 2’. Exponential distr.: fY (y ; θ) = θ−1 e−y /θ for y ≥ 0. θb = Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = − ln θ − y θ ∂ 1 y ln fY (y ; θ) = − + 2 ∂θ θ θ ∂2 1 2y ln fY (y ; θ) = − 2 + 3 ∂2θ θ θ " # ∂2 1 2Y 1 2θ −2 −E ln fY (Y ; θ) = E − 2 + 3 = − 2 + 3 = θ . ∂2θ θ θ θ θ − I(θ) = θ −2 b Step 2. Compute Var(θ): Var(Y ) = n 1 X 1 θ2 2 Var(Yi ) = 2 nθ = . 2 n i=1 n n Conclusion. Because θb is unbiased and Var(b p) = (nI(p))−1 , θb is efficient. 65 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 E.g. 3. fY (y ; θ) = 2y /θ2 for y ∈ [0, θ]. θb = 32 Y efficent? Step. 1. Compute Fisher’s Information: ln fY (y ; θ) = ln(2y ) − 2 ln θ ∂ 2 ln fY (y ; θ) = − ∂θ θ By the definition of Fisher’s information, 2 ∂ 2 2 4 I(θ) = E ln fY (y ; θ) =E − = 2. ∂θ θ θ However, if we compute − " −E ∂2 2 ln fY (y ; θ) = − 2 ∂2θ θ # ∂2 2 2 4 ln f (Y ; θ) = E − 2 = − 2 6= 2 = I(θ). Y 2 ∂ θ θ θ θ (†) b Step 2. Compute Var(θ): 9 9 θ2 θ2 Var(Y ) = = . 4n 4n 18 8n Discussion. Even though θb is unbiased, we have two discripencies: (†) and b = Var(θ) b = Var(θ) θ2 θ2 1 ≤ = 8n 4n nI(θ) This is because this is not a regular estimation! 66 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 67 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 68 § 5.6 Sufficient Estimators Rationale: Let θb be an estimator to the unknown parameter θ. Whether does θb contain all information about θ? Equivalently, how can one reduce the random sample of size n, denoted by (X1 , · · · , Xn ), to a function without losing any information about θ? P E.g., let’s choose the function h(X1 , · · · , Xn ) := 1n ni=1 Xi , the sample mean. In many cases, h(X1 , · · · , Xn ) contains all relevant information about the true mean E(X ). In that case, h(X1 , · · · , Xn ), as an estimator, is sufficient. 69 § 5.6 Sufficient Estimators Rationale: Let θb be an estimator to the unknown parameter θ. Whether does θb contain all information about θ? Equivalently, how can one reduce the random sample of size n, denoted by (X1 , · · · , Xn ), to a function without losing any information about θ? P E.g., let’s choose the function h(X1 , · · · , Xn ) := 1n ni=1 Xi , the sample mean. In many cases, h(X1 , · · · , Xn ) contains all relevant information about the true mean E(X ). In that case, h(X1 , · · · , Xn ), as an estimator, is sufficient. 69 § 5.6 Sufficient Estimators Rationale: Let θb be an estimator to the unknown parameter θ. Whether does θb contain all information about θ? Equivalently, how can one reduce the random sample of size n, denoted by (X1 , · · · , Xn ), to a function without losing any information about θ? P E.g., let’s choose the function h(X1 , · · · , Xn ) := 1n ni=1 Xi , the sample mean. In many cases, h(X1 , · · · , Xn ) contains all relevant information about the true mean E(X ). In that case, h(X1 , · · · , Xn ), as an estimator, is sufficient. 69 Definition. Let (X1 , · · · , Xn ) be a random sample of size n from a discrete population with a unknown parameter θ, of which θb (resp. θe ) be an estimator (resp. estimate). We call θb and θe sufficient if P X1 = k1 , · · · , Xn = kn θb = θe = b(k1 , · · · , kn ) (Sufficency-1) is a function that does not depend on θ. In case for random sample (Y1 , · · · , Yn ) from the continuous population, (Sufficency-1) should be fY1 ,··· ,Yn |θ=θ y1 , · · · , yn θb = θe = b(y1 , · · · , yn ) b e Note: θb = h(X1 , · · · , Xn ) and θe = h(k1 , · · · , kn ). or θb = h(Y1 , · · · , Yn ) and θe = h(y1 , · · · , yn ). 70 Equivalently, Definition. ... θb (or θe ) is sufficient if the likelihood function can be factorized as: n Y pX (ki ; θ) = g(θe , θ) b(k1 , · · · , kn ) Discrete L(θ) = i=1 (Sufficency-2) n Y fY (yi ; θ) = g(θe , θ) b(y1 , · · · , yn ) Continous i=1 where g is a function of two arguments only and b is a function that does not depend on θ. 71 b= E.g. 1. A random sample of size n from Bernoulli(P). p b for p by (Sufficency-1): sufficiency of p Case I: If k1 , · · · , kn ∈ {0, 1} such that Pn i=1 Xi . Check ki 6= c, then b = c = 0. p i=1 P X1 = k1 , · · · , Xn = kn Case II: If k1 , · · · , kn ∈ {0, 1} such that Pn Pn i=1 ki = c, then 72 b= E.g. 1. A random sample of size n from Bernoulli(P). p b for p by (Sufficency-1): sufficiency of p Case I: If k1 , · · · , kn ∈ {0, 1} such that Pn i=1 Xi . Check ki 6= c, then b = c = 0. p i=1 P X1 = k1 , · · · , Xn = kn Case II: If k1 , · · · , kn ∈ {0, 1} such that Pn Pn i=1 ki = c, then 72 b= E.g. 1. A random sample of size n from Bernoulli(P). p b for p by (Sufficency-1): sufficiency of p Case I: If k1 , · · · , kn ∈ {0, 1} such that Pn i=1 Xi . Check ki 6= c, then b = c = 0. p i=1 P X1 = k1 , · · · , Xn = kn Case II: If k1 , · · · , kn ∈ {0, 1} such that Pn Pn i=1 ki = c, then 72 b= E.g. 1. A random sample of size n from Bernoulli(P). p b for p by (Sufficency-1): sufficiency of p Case I: If k1 , · · · , kn ∈ {0, 1} such that Pn i=1 Xi . Check ki 6= c, then b = c = 0. p i=1 P X1 = k1 , · · · , Xn = kn Case II: If k1 , · · · , kn ∈ {0, 1} such that Pn Pn i=1 ki = c, then 72 b=c P X1 = k1 , · · · , Xn = kn p b=c P X1 = k1 , · · · , Xn = kn , p = b = c) P(p P P X1 = k1 , · · · , Xn = kn , Xn + n−1 i=1 Xi = c Pn = P i=1 Xi = c P P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1 i=1 ki Pn = P i=1 Xi = c Pn−1 Pn−1 Qn−1 ki 1−ki × pc− i=1 ki (1 − p)1−c+ i=1 ki i=1 p (1 − p) = n pc (1 − p)n−c c 1 = n . c 73 b=c P X1 = k1 , · · · , Xn = kn p b=c P X1 = k1 , · · · , Xn = kn , p = b = c) P(p P P X1 = k1 , · · · , Xn = kn , Xn + n−1 i=1 Xi = c Pn = P i=1 Xi = c P P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1 i=1 ki Pn = P i=1 Xi = c Pn−1 Pn−1 Qn−1 ki 1−ki × pc− i=1 ki (1 − p)1−c+ i=1 ki i=1 p (1 − p) = n pc (1 − p)n−c c 1 = n . c 73 b=c P X1 = k1 , · · · , Xn = kn p b=c P X1 = k1 , · · · , Xn = kn , p = b = c) P(p P P X1 = k1 , · · · , Xn = kn , Xn + n−1 i=1 Xi = c Pn = P i=1 Xi = c P P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1 i=1 ki Pn = P i=1 Xi = c Pn−1 Pn−1 Qn−1 ki 1−ki × pc− i=1 ki (1 − p)1−c+ i=1 ki i=1 p (1 − p) = n pc (1 − p)n−c c 1 = n . c 73 b=c P X1 = k1 , · · · , Xn = kn p b=c P X1 = k1 , · · · , Xn = kn , p = b = c) P(p P P X1 = k1 , · · · , Xn = kn , Xn + n−1 i=1 Xi = c Pn = P i=1 Xi = c P P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1 i=1 ki Pn = P i=1 Xi = c Pn−1 Pn−1 Qn−1 ki 1−ki × pc− i=1 ki (1 − p)1−c+ i=1 ki i=1 p (1 − p) = n pc (1 − p)n−c c 1 = n . c 73 b=c P X1 = k1 , · · · , Xn = kn p b=c P X1 = k1 , · · · , Xn = kn , p = b = c) P(p P P X1 = k1 , · · · , Xn = kn , Xn + n−1 i=1 Xi = c Pn = P i=1 Xi = c P P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1 i=1 ki Pn = P i=1 Xi = c Pn−1 Pn−1 Qn−1 ki 1−ki × pc− i=1 ki (1 − p)1−c+ i=1 ki i=1 p (1 − p) = n pc (1 − p)n−c c 1 = n . c 73 b=c P X1 = k1 , · · · , Xn = kn p b=c P X1 = k1 , · · · , Xn = kn , p = b = c) P(p P P X1 = k1 , · · · , Xn = kn , Xn + n−1 i=1 Xi = c Pn = P i=1 Xi = c P P X1 = k1 , · · · , Xn−1 = kn−1 , Xn = c − n−1 i=1 ki Pn = P i=1 Xi = c Pn−1 Pn−1 Qn−1 ki 1−ki × pc− i=1 ki (1 − p)1−c+ i=1 ki i=1 p (1 − p) = n pc (1 − p)n−c c 1 = n . c 73 In summary, b=c = P X1 = k1 , · · · , Xn = kn p b= Hence, by (Sufficency-1), p Pn i=1 1n if ki ∈ {0, 1} s.t. Pn i=1 ki = c, c 0 otherwise. Xi is a sufficient estimator for p. 74 In summary, b=c = P X1 = k1 , · · · , Xn = kn p b= Hence, by (Sufficency-1), p Pn i=1 1n if ki ∈ {0, 1} s.t. Pn i=1 ki = c, c 0 otherwise. Xi is a sufficient estimator for p. 74 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 b for p by (Sufficency-2): E.g. 1’. As in E.g. 1, check sufficiency of p P Notice that pe = ni=1 ki . Then L(p) = n Y pX (ki ; p) = i=1 n Y pki (1 − p)1−ki i=1 Pn =p i=1 ki Pn (1 − p)n− i=1 ki = ppe (1 − p)n−pe b) is sufficient since (Sufficency-2) is satisfied with Therefore, pe (or p g(pe , p) = ppe (1 − p)n−pe and b(k1 , · · · , kn ) = 1. b is sufficient but not unbiased since E(p b) = np 6= p. Comment 1. The estimator p 2. Any one-to-one function of a sufficient estimator is again a sufficient b2 := 1n p b, which is a unbiased, sufficient, and MVE. estimator. E.g., p b3 := X1 is not sufficient! 3. p 75 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 2. Poisson(λ), pX (k ; λ) = e−λ λk /k !, k = 0, 1, · · · . Show that b = (Pn Xi )2 is sufficient for λ for a sample of size n. λ i=1 Sol: The Corresponding estimate is λe = ( L(λ) = Pn i=1 k i )2 . n Y e−λ λki ki ! i=1 =e −nλ λ n Y Pn i=1 ki !−1 ki ! i=1 =e −nλ λ √ λe n Y × !−1 ki ! . i=1 | {z g(λe ,λ) } b is sufficient estimator for λ. Hence, λ | {z b(k1 ,··· ,kn ) } 76 E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. Note: MME θb = 32 Y is NOT sufficient for θ! } | {z } =b(y1 ,··· ,yk ) E.g. 3. Let Y1 , · · · , Yn be a random sample from fY (y ; θ) = Whether is the MLE θb = Ymax sufficient for θ? 2y θ2 for y ∈ [0, θ]. Sol: The corresponding estimate is θe = ymax . n Y 2y L(θ) = I[0,θ] (yi ) = 2n θ−2n θ2 i=1 n −2n =2 θ n Y ! × yi i=1 n Y n Y I[0,θ] (yi ) i=1 ! × I[0,θ] (ymax ) yi i=1 = 2n θ−2n I[0,θ] (θe ) × n Y yi . i=1 | {z =g(θe ,θ) Hence, θb is a sufficient estimator for θ. } | {z } =b(y1 ,··· ,yk ) Note: MME θb = 32 Y is NOT sufficient for θ! 77 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 78 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 79 Definition. An estimator θbn = h(W1 , · · · , Wn ) is said to be consistent if it converges to θ in probability, i.e., for any > 0, lim P |θbn − θ| < = 1. n→∞ Comment: In the -δ language, the above convergence in probability says ∀ > 0, ∀δ > 0, ∃n(, δ) > 0, s.t. ∀n ≥ n(, δ), P |θbn − θ| < > 1 − δ. 80 Definition. An estimator θbn = h(W1 , · · · , Wn ) is said to be consistent if it converges to θ in probability, i.e., for any > 0, lim P |θbn − θ| < = 1. n→∞ Comment: In the -δ language, the above convergence in probability says ∀ > 0, ∀δ > 0, ∃n(, δ) > 0, s.t. ∀n ≥ n(, δ), P |θbn − θ| < > 1 − δ. 80 A useful tool to check convergence in probability is Theorem. (Chebyshev’s inequality) Let W be any r.v. with finite mean µ and variance σ 2 . Then for any > 0 P (|W − µ| < ) ≥ 1 − or, equivalently, P (|W − µ| ≥ ) ≤ Proof. ... σ2 , 2 σ2 . 2 81 As a consequence of Chebyshev’s inequality, we have P Proposition. The sample mean µ bn = 1n ni=1 Wi is consistent for E(W ) = µ, provided that the population W has finite mean µ and variance σ 2 . Proof. E(b µn ) = µ ∀ > 0, and Var(b µn ) = P (|b µn − µ| ≤ ) ≥ 1 − σ2 . n σ2 → 1. n2 82 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 1. Let Y1 , · · · , Yn be a random sample of size n from the uniform pdf fY (y ; θ) = 1/θ, y ∈ [0, θ]. Let θbn = Ymax . We know that Ymax is biased. Is it consistent? Sol. The c.d.f. of Y is equal to FY (y ) = y /θ for y ∈ [0, θ]. Hence, fYmax (y ) = nFY (y )n−1 fY (y ) = ny n−1 , θn y ∈ [0, θ]. Therefore, P(|θbn − θ| < ) = P(θ − < θbn < θ + ) Z θ Z θ+ ny n−1 = dy + 0dy θn θ− θ n θ− =1− θ → 0 as n → ∞. 83 E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf, bn = Y1 is not consistent for λ. fY (y ; λ) = λe−λy , y > 0. Show that λ bn is not consistent for λ, we need only to find out one Sol. To prove λ > 0 such that the following limit does not hold: bn − λ| < = 1. lim P |λ (3) n→∞ We can choose = λ/m for any m ≥ 1. Then 1 bn − λ| ≤ λ bn ≤ 1 + 1 λ |λ ⇐⇒ 1− λ≤λ m m m bn ≥ 1 − 1 λ. =⇒ λ m 84 E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf, bn = Y1 is not consistent for λ. fY (y ; λ) = λe−λy , y > 0. Show that λ bn is not consistent for λ, we need only to find out one Sol. To prove λ > 0 such that the following limit does not hold: bn − λ| < = 1. lim P |λ (3) n→∞ We can choose = λ/m for any m ≥ 1. Then 1 bn − λ| ≤ λ bn ≤ 1 + 1 λ |λ ⇐⇒ 1− λ≤λ m m m bn ≥ 1 − 1 λ. =⇒ λ m 84 E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf, bn = Y1 is not consistent for λ. fY (y ; λ) = λe−λy , y > 0. Show that λ bn is not consistent for λ, we need only to find out one Sol. To prove λ > 0 such that the following limit does not hold: bn − λ| < = 1. lim P |λ (3) n→∞ We can choose = λ/m for any m ≥ 1. Then 1 bn − λ| ≤ λ bn ≤ 1 + 1 λ |λ ⇐⇒ 1− λ≤λ m m m bn ≥ 1 − 1 λ. =⇒ λ m 84 E.g. 2. Suppose Y1 , Y2 , · · · , Yn is a random sample from the exponential pdf, bn = Y1 is not consistent for λ. fY (y ; λ) = λe−λy , y > 0. Show that λ bn is not consistent for λ, we need only to find out one Sol. To prove λ > 0 such that the following limit does not hold: bn − λ| < = 1. lim P |λ (3) n→∞ We can choose = λ/m for any m ≥ 1. Then 1 bn − λ| ≤ λ bn ≤ 1 + 1 λ |λ ⇐⇒ 1− λ≤λ m m m bn ≥ 1 − 1 λ. =⇒ λ m 84 Hence, bn − λ| < λ ≤ P λ bn ≥ 1 − 1 λ P |λ m m 1 = P Y1 ≥ 1 − λ m Z ∞ −λy = dy λe =e 1 λ 1− m 1 λ2 − 1− m Therefore, the limit in (3) cannot hold. < 1. 85 Hence, bn − λ| < λ ≤ P λ bn ≥ 1 − 1 λ P |λ m m 1 = P Y1 ≥ 1 − λ m Z ∞ −λy = dy λe =e 1 λ 1− m 1 λ2 − 1− m Therefore, the limit in (3) cannot hold. < 1. 85 Hence, bn − λ| < λ ≤ P λ bn ≥ 1 − 1 λ P |λ m m 1 = P Y1 ≥ 1 − λ m Z ∞ −λy = dy λe =e 1 λ 1− m 1 λ2 − 1− m Therefore, the limit in (3) cannot hold. < 1. 85 Hence, bn − λ| < λ ≤ P λ bn ≥ 1 − 1 λ P |λ m m 1 = P Y1 ≥ 1 − λ m Z ∞ −λy = dy λe =e 1 λ 1− m 1 λ2 − 1− m Therefore, the limit in (3) cannot hold. < 1. 85 Hence, bn − λ| < λ ≤ P λ bn ≥ 1 − 1 λ P |λ m m 1 = P Y1 ≥ 1 − λ m Z ∞ −λy = dy λe =e 1 λ 1− m 1 λ2 − 1− m Therefore, the limit in (3) cannot hold. < 1. 85 Plan § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 86 Chapter 5. Estimation § 5.1 Introduction § 5.2 Estimating parameters: MLE and MME § 5.3 Interval Estimation § 5.4 Properties of Estimators § 5.5 Minimum-Variance Estimators: The Cramér-Rao Lower Bound § 5.6 Sufficient Estimators § 5.7 Consistency § 5.8 Bayesian Estimation 87 Rationale: Let W be an estimator dependent on a parameter θ. 1. Frequentists view θ as a parameter whose exact value is to be estimated. 2. Bayesians view θ is the value of a random variable Θ. One can incorporate our knowledge on Θ — the prior distribution pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’ formula to update our knowledge on Θ upon new observation W = w: pW (w|Θ = θ)pΘ (θ) if W is discrete P(W = w) gΘ (θ|W = w) = fW (w|Θ = θ)fΘ (θ) if W is continuous fW (w) where gΘ (θ|W = w) is called posterior distribution of Θ. 88 Rationale: Let W be an estimator dependent on a parameter θ. 1. Frequentists view θ as a parameter whose exact value is to be estimated. 2. Bayesians view θ is the value of a random variable Θ. One can incorporate our knowledge on Θ — the prior distribution pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’ formula to update our knowledge on Θ upon new observation W = w: pW (w|Θ = θ)pΘ (θ) if W is discrete P(W = w) gΘ (θ|W = w) = fW (w|Θ = θ)fΘ (θ) if W is continuous fW (w) where gΘ (θ|W = w) is called posterior distribution of Θ. 88 Rationale: Let W be an estimator dependent on a parameter θ. 1. Frequentists view θ as a parameter whose exact value is to be estimated. 2. Bayesians view θ is the value of a random variable Θ. One can incorporate our knowledge on Θ — the prior distribution pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’ formula to update our knowledge on Θ upon new observation W = w: pW (w|Θ = θ)pΘ (θ) if W is discrete P(W = w) gΘ (θ|W = w) = fW (w|Θ = θ)fΘ (θ) if W is continuous fW (w) where gΘ (θ|W = w) is called posterior distribution of Θ. 88 Rationale: Let W be an estimator dependent on a parameter θ. 1. Frequentists view θ as a parameter whose exact value is to be estimated. 2. Bayesians view θ is the value of a random variable Θ. One can incorporate our knowledge on Θ — the prior distribution pΘ (θ) if Θ is discrete and fΘ (θ) if Θ is continuous — and use Bayes’ formula to update our knowledge on Θ upon new observation W = w: pW (w|Θ = θ)pΘ (θ) if W is discrete P(W = w) gΘ (θ|W = w) = fW (w|Θ = θ)fΘ (θ) if W is continuous fW (w) where gΘ (θ|W = w) is called posterior distribution of Θ. 88 Likelihood of sample W Prior distribution of Θ P(Θ|W ) = Posterior of Θ P(W | Θ)P(Θ) P(W ) Total Probability of sample W 89 Four cases for computing posterior distribution W discrete W continuous p (w|Θ = θ)pΘ (θ) P W i pW (w|Θ = θi )pΘ (θi ) f (w|Θ = θ)pΘ (θ) PW i fW (w|Θ = θi )pΘ (θi ) gΘ (θ|W = w) Θ discrete Θ continuous pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W R R fW (w|Θ = θ)fΘ (θ) f (w|Θ = θ0 )fΘ (θ0 )dθ0 W R 90 Gamma distributions ∞ Z Γ(r ) := y r −1 e−y dy , r > 0. 0 Two parametrizations for Gamma distributions: 1. With a shape parameter r and a scale parameter θ: fY (y ; r , θ) = y r −1 e−y /θ , θr Γ(r ) y > 0, r , θ > 0. 2. With a shape parameter r and a rate parameter λ = 1/θ, fY (y ; r , λ) = E[Y ] = λr y r −1 e−λy , Γ(r ) r = rθ λ and y > 0, r , λ > 0. Var(Y ) = r = r θ2 λ2 91 Gamma distributions ∞ Z Γ(r ) := y r −1 e−y dy , r > 0. 0 Two parametrizations for Gamma distributions: 1. With a shape parameter r and a scale parameter θ: fY (y ; r , θ) = y r −1 e−y /θ , θr Γ(r ) y > 0, r , θ > 0. 2. With a shape parameter r and a rate parameter λ = 1/θ, fY (y ; r , λ) = E[Y ] = λr y r −1 e−λy , Γ(r ) r = rθ λ and y > 0, r , λ > 0. Var(Y ) = r = r θ2 λ2 91 Gamma distributions ∞ Z Γ(r ) := y r −1 e−y dy , r > 0. 0 Two parametrizations for Gamma distributions: 1. With a shape parameter r and a scale parameter θ: fY (y ; r , θ) = y r −1 e−y /θ , θr Γ(r ) y > 0, r , θ > 0. 2. With a shape parameter r and a rate parameter λ = 1/θ, fY (y ; r , λ) = E[Y ] = λr y r −1 e−λy , Γ(r ) r = rθ λ and y > 0, r , λ > 0. Var(Y ) = r = r θ2 λ2 91 Gamma distributions ∞ Z Γ(r ) := y r −1 e−y dy , r > 0. 0 Two parametrizations for Gamma distributions: 1. With a shape parameter r and a scale parameter θ: fY (y ; r , θ) = y r −1 e−y /θ , θr Γ(r ) y > 0, r , θ > 0. 2. With a shape parameter r and a rate parameter λ = 1/θ, fY (y ; r , λ) = E[Y ] = λr y r −1 e−λy , Γ(r ) r = rθ λ and y > 0, r , λ > 0. Var(Y ) = r = r θ2 λ2 91 1 2 3 4 5 6 7 # Plot gamma distributions x = seq(0,20,0.01) k= 3 # Shape parameter theta = 0.5 # Scale parameter plot(x,dgamma(x, k, scale = theta ), type=”l”, col=”red”) 92 Beta distributions 1 Z B(α, β) := .. . 0 y α−1 (1 − y )β−1 dy , α, β > 0. .. . Γ(α)Γ(β) . Γ(α + β) (see Appendix) y α−1 (1 − y )β−1 , B(α, β) y ∈ [0, 1], α, β > 0. = Beta distribution fY (y ; α, β) = E[Y ] = α α+β and Var(Y ) = αβ (α + β)2 (α + β + 1) 93 Beta distributions 1 Z B(α, β) := .. . 0 y α−1 (1 − y )β−1 dy , α, β > 0. .. . Γ(α)Γ(β) . Γ(α + β) (see Appendix) y α−1 (1 − y )β−1 , B(α, β) y ∈ [0, 1], α, β > 0. = Beta distribution fY (y ; α, β) = E[Y ] = α α+β and Var(Y ) = αβ (α + β)2 (α + β + 1) 93 Beta distributions 1 Z B(α, β) := .. . 0 y α−1 (1 − y )β−1 dy , α, β > 0. .. . Γ(α)Γ(β) . Γ(α + β) (see Appendix) y α−1 (1 − y )β−1 , B(α, β) y ∈ [0, 1], α, β > 0. = Beta distribution fY (y ; α, β) = E[Y ] = α α+β and Var(Y ) = αβ (α + β)2 (α + β + 1) 93 Beta distributions 1 Z B(α, β) := .. . 0 y α−1 (1 − y )β−1 dy , α, β > 0. .. . Γ(α)Γ(β) . Γ(α + β) (see Appendix) y α−1 (1 − y )β−1 , B(α, β) y ∈ [0, 1], α, β > 0. = Beta distribution fY (y ; α, β) = E[Y ] = α α+β and Var(Y ) = αβ (α + β)2 (α + β + 1) 93 Beta distributions 1 Z B(α, β) := .. . 0 y α−1 (1 − y )β−1 dy , α, β > 0. .. . Γ(α)Γ(β) . Γ(α + β) (see Appendix) y α−1 (1 − y )β−1 , B(α, β) y ∈ [0, 1], α, β > 0. = Beta distribution fY (y ; α, β) = E[Y ] = α α+β and Var(Y ) = αβ (α + β)2 (α + β + 1) 93 1 2 3 4 5 6 7 # Plot Beta distributions x = seq(0,1,0.01) a = 13 b=2 plot(x,dbeta(x,a,b), type=”l”, col=”red”) 94 E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ): pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1. P Let X = ni=1 Xi . Then X follows binomial(n, θ). Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) = θ ∈ [0, 1]. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) r & s are known X = n X Γ(r +s) r −1 θ (1 Γ(r )Γ(s) − θ)s−1 for Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known 95 E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ): pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1. P Let X = ni=1 Xi . Then X follows binomial(n, θ). Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) = θ ∈ [0, 1]. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) r & s are known X = n X Γ(r +s) r −1 θ (1 Γ(r )Γ(s) − θ)s−1 for Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known 95 E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ): pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1. P Let X = ni=1 Xi . Then X follows binomial(n, θ). Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) = θ ∈ [0, 1]. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) r & s are known X = n X Γ(r +s) r −1 θ (1 Γ(r )Γ(s) − θ)s−1 for Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known 95 E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ): pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1. P Let X = ni=1 Xi . Then X follows binomial(n, θ). Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) = θ ∈ [0, 1]. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) r & s are known X = n X Γ(r +s) r −1 θ (1 Γ(r )Γ(s) − θ)s−1 for Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known 95 E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ): pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1. P Let X = ni=1 Xi . Then X follows binomial(n, θ). Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) = θ ∈ [0, 1]. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) r & s are known X = n X Γ(r +s) r −1 θ (1 Γ(r )Γ(s) − θ)s−1 for Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known 95 E.g. 1. Let X1 , · · · , Xn be a random sample from Bernoulli(θ): pXi (k ; θ) = θk (1 − θ)1−k for k = 0, 1. P Let X = ni=1 Xi . Then X follows binomial(n, θ). Prior distribution: Θ ∼beta(r , s), i.e., fΘ (θ) = θ ∈ [0, 1]. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) r & s are known X = n X Γ(r +s) r −1 θ (1 Γ(r )Γ(s) − θ)s−1 for Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known 95 96 97 X is discrete and Θ is continuous. gΘ (θ|X = k ) = R pX (k |Θ = θ)fΘ (θ) = = pX (k |Θ = θ)fΘ (θ) p (k |Θ = θ0 )fΘ (θ0 )dθ0 R X ! Γ(r + s) r −1 n k θ (1 − θ)n−k × θ (1 − θ)s−1 k Γ(r )Γ(s) ! n Γ(r + s) k +r −1 θ (1 − θ)n−k +s−1 , θ ∈ [0, 1]. k Γ(r )Γ(s) Z pX (k |Θ = θ0 )fΘ (θ0 )dθ0 ! Z n Γ(r + s) 1 0k +r −1 = θ (1 − θ0 )n−k +s−1 dθ0 k Γ(r )Γ(s) 0 ! Γ(k + r )Γ(n − k + s) n Γ(r + s) = × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) pX (k ) = R 98 X is discrete and Θ is continuous. gΘ (θ|X = k ) = R pX (k |Θ = θ)fΘ (θ) = = pX (k |Θ = θ)fΘ (θ) p (k |Θ = θ0 )fΘ (θ0 )dθ0 R X ! Γ(r + s) r −1 n k θ (1 − θ)n−k × θ (1 − θ)s−1 k Γ(r )Γ(s) ! n Γ(r + s) k +r −1 θ (1 − θ)n−k +s−1 , θ ∈ [0, 1]. k Γ(r )Γ(s) Z pX (k |Θ = θ0 )fΘ (θ0 )dθ0 ! Z n Γ(r + s) 1 0k +r −1 = θ (1 − θ0 )n−k +s−1 dθ0 k Γ(r )Γ(s) 0 ! Γ(k + r )Γ(n − k + s) n Γ(r + s) = × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) pX (k ) = R 98 X is discrete and Θ is continuous. gΘ (θ|X = k ) = R pX (k |Θ = θ)fΘ (θ) = = pX (k |Θ = θ)fΘ (θ) p (k |Θ = θ0 )fΘ (θ0 )dθ0 R X ! Γ(r + s) r −1 n k θ (1 − θ)n−k × θ (1 − θ)s−1 k Γ(r )Γ(s) ! n Γ(r + s) k +r −1 θ (1 − θ)n−k +s−1 , θ ∈ [0, 1]. k Γ(r )Γ(s) Z pX (k |Θ = θ0 )fΘ (θ0 )dθ0 ! Z n Γ(r + s) 1 0k +r −1 = θ (1 − θ0 )n−k +s−1 dθ0 k Γ(r )Γ(s) 0 ! Γ(k + r )Γ(n − k + s) n Γ(r + s) = × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) pX (k ) = R 98 X is discrete and Θ is continuous. gΘ (θ|X = k ) = R pX (k |Θ = θ)fΘ (θ) = = pX (k |Θ = θ)fΘ (θ) p (k |Θ = θ0 )fΘ (θ0 )dθ0 R X ! Γ(r + s) r −1 n k θ (1 − θ)n−k × θ (1 − θ)s−1 k Γ(r )Γ(s) ! n Γ(r + s) k +r −1 θ (1 − θ)n−k +s−1 , θ ∈ [0, 1]. k Γ(r )Γ(s) Z pX (k |Θ = θ0 )fΘ (θ0 )dθ0 ! Z n Γ(r + s) 1 0k +r −1 = θ (1 − θ0 )n−k +s−1 dθ0 k Γ(r )Γ(s) 0 ! Γ(k + r )Γ(n − k + s) n Γ(r + s) = × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) pX (k ) = R 98 X is discrete and Θ is continuous. gΘ (θ|X = k ) = R pX (k |Θ = θ)fΘ (θ) = = pX (k |Θ = θ)fΘ (θ) p (k |Θ = θ0 )fΘ (θ0 )dθ0 R X ! Γ(r + s) r −1 n k θ (1 − θ)n−k × θ (1 − θ)s−1 k Γ(r )Γ(s) ! n Γ(r + s) k +r −1 θ (1 − θ)n−k +s−1 , θ ∈ [0, 1]. k Γ(r )Γ(s) Z pX (k |Θ = θ0 )fΘ (θ0 )dθ0 ! Z n Γ(r + s) 1 0k +r −1 = θ (1 − θ0 )n−k +s−1 dθ0 k Γ(r )Γ(s) 0 ! Γ(k + r )Γ(n − k + s) n Γ(r + s) = × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) pX (k ) = R 98 X is discrete and Θ is continuous. gΘ (θ|X = k ) = R pX (k |Θ = θ)fΘ (θ) = = pX (k |Θ = θ)fΘ (θ) p (k |Θ = θ0 )fΘ (θ0 )dθ0 R X ! Γ(r + s) r −1 n k θ (1 − θ)n−k × θ (1 − θ)s−1 k Γ(r )Γ(s) ! n Γ(r + s) k +r −1 θ (1 − θ)n−k +s−1 , θ ∈ [0, 1]. k Γ(r )Γ(s) Z pX (k |Θ = θ0 )fΘ (θ0 )dθ0 ! Z n Γ(r + s) 1 0k +r −1 = θ (1 − θ0 )n−k +s−1 dθ0 k Γ(r )Γ(s) 0 ! Γ(k + r )Γ(n − k + s) n Γ(r + s) = × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) pX (k ) = R 98 ! n Γ(r + s) × θk +r −1 (1 − θ)n−k +s−1 k Γ(r )Γ(s) ! gΘ (θ|X = k ) = Γ(k + r )Γ(n − k + s) n Γ(r + s) × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) = Γ(n + r + s) θk +r −1 (1 − θ)n−k +s−1 , Γ(k + r )Γ(n − k + s) θ ∈ [0, 1] Conclusion: the posterior ∼ beta distribution(k + r , n − k + s). Recall that the prior ∼ beta distribution(r , s). 99 ! n Γ(r + s) × θk +r −1 (1 − θ)n−k +s−1 k Γ(r )Γ(s) ! gΘ (θ|X = k ) = Γ(k + r )Γ(n − k + s) n Γ(r + s) × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) = Γ(n + r + s) θk +r −1 (1 − θ)n−k +s−1 , Γ(k + r )Γ(n − k + s) θ ∈ [0, 1] Conclusion: the posterior ∼ beta distribution(k + r , n − k + s). Recall that the prior ∼ beta distribution(r , s). 99 ! n Γ(r + s) × θk +r −1 (1 − θ)n−k +s−1 k Γ(r )Γ(s) ! gΘ (θ|X = k ) = Γ(k + r )Γ(n − k + s) n Γ(r + s) × k Γ(r )Γ(s) Γ((k + r ) + (n − k + s)) = Γ(n + r + s) θk +r −1 (1 − θ)n−k +s−1 , Γ(k + r )Γ(n − k + s) θ ∈ [0, 1] Conclusion: the posterior ∼ beta distribution(k + r , n − k + s). Recall that the prior ∼ beta distribution(r , s). 99 It remains to determine the values of r and s to incorporate the prior knowledge: PK 1. Mean is about 0.035. E(Θ) = 0.035 =⇒ r = 0.035 r +s ⇐⇒ r 7 = s 193 PK 2. The pdf mostly concentrated over [0, 0.07]. ... trial ... 100 It remains to determine the values of r and s to incorporate the prior knowledge: PK 1. Mean is about 0.035. E(Θ) = 0.035 =⇒ r = 0.035 r +s ⇐⇒ r 7 = s 193 PK 2. The pdf mostly concentrated over [0, 0.07]. ... trial ... 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 x <− seq(0, 1, length = 1025) plot(x,dbeta(x,4,102), type=”l”) plot(x,dbeta(x,7,193), type=”l”) dev.off() pdf=cbind(dbeta(x,4,102),dbeta(x ,7,193)) matplot(x,pdf, type=”l”, lty = 1:2, xlab = ”theta”, ylab = ”PDF”, lwd = 2 # Line width ) legend(0.2, 25, # Position of legend c(”Beta(4,102)”, ”Beta(7,193)”), col = 1:2, lty = 1:2, ncol = 1, # Number of columns cex = 1.5, # Fontsize lwd=2 # Line width ) abline(v=0.07, col=”blue”, lty=1,lwd =1.5) text(0.07, −0.5, ”0.07”) abline(v=0.035, col=”gray60”, lty=3, lwd=2) text(0.035, 1, ”0.035”) 101 If we choose r = 7 and s = 193: gΘ (θ|X = k ) = Γ(n + 200) θk +6 (1 − θ)n−k +192 , Γ(k + 7)Γ(n − k + 193) θ ∈ [0, 1] Moreover, if n = 10 and k = 2, gΘ (θ|X = k ) = Γ(210) θ8 (1 − θ)200 , Γ(9)Γ(201) θ ∈ [0, 1] 102 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 x <− seq(0, 0.1, length = 1025) pdf=cbind(dbeta(x,7,193),dbeta(x ,9,201)) matplot(x,pdf, type=”l”, lty = 1:2, xlab = ”theta”, ylab = ”PDF”, lwd = 2 # Line width ) legend(0.05, 25, # Position of legend c(”Posterior Beta(9,201)”, ”Prior Beta(7,193)”), col = 1:2, lty = 1:2, ncol = 1, # Number of columns cex = 1.5, # Fontsize lwd=2 # Line width ) abline(v=0.07,col=”blue”, lty=1,lwd =1.5) text(0.07, −0.5, ”0.07”) abline(v=0.035,col=”black”, lty=3,lwd =2) text(0.035, 1, ”0.035”) 103 Definition. If the posterior distributions p(Θ|X ) are in the same probability distribution family as the prior probability distribution p(Θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. 1. Beta distributions are conjugate priors for Bernoulli, binomial, nega. binomial, geometric likelihood. 2. Gamma distributions are conjugate priors for Poisson and exponential likelihood. 104 Definition. If the posterior distributions p(Θ|X ) are in the same probability distribution family as the prior probability distribution p(Θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. 1. Beta distributions are conjugate priors for Bernoulli, binomial, nega. binomial, geometric likelihood. 2. Gamma distributions are conjugate priors for Poisson and exponential likelihood. 104 E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) = for k = 0, 1, · · · . P Let W = ni=1 Xi . Then W follows Poisson(nθ). Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) = θ > 0. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X µs θs−1 e−µθ Γ(s) e−θ θ k k! for Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known 105 E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) = for k = 0, 1, · · · . P Let W = ni=1 Xi . Then W follows Poisson(nθ). Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) = θ > 0. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X µs θs−1 e−µθ Γ(s) e−θ θ k k! for Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known 105 E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) = for k = 0, 1, · · · . P Let W = ni=1 Xi . Then W follows Poisson(nθ). Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) = θ > 0. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X µs θs−1 e−µθ Γ(s) e−θ θ k k! for Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known 105 E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) = for k = 0, 1, · · · . P Let W = ni=1 Xi . Then W follows Poisson(nθ). Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) = θ > 0. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X µs θs−1 e−µθ Γ(s) e−θ θ k k! for Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known 105 E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) = for k = 0, 1, · · · . P Let W = ni=1 Xi . Then W follows Poisson(nθ). Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) = θ > 0. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X µs θs−1 e−µθ Γ(s) e−θ θ k k! for Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known 105 E.g. 2. Let X1 , · · · , Xn be a random sample from Poisson(θ): pX (k ; θ) = for k = 0, 1, · · · . P Let W = ni=1 Xi . Then W follows Poisson(nθ). Prior distribution: Θ ∼ Gamma(s, µ), i.e., fΘ (θ) = θ > 0. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X µs θs−1 e−µθ Γ(s) e−θ θ k k! for Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known 105 gΘ (θ|W = w) = R pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W e−nθ (nθ)w µs s−1 −µθ × θ e w! Γ(s) n w µs = × θw+s−1 e−(µ+n)θ , θ > 0. w! Γ(s) pW (w|Θ = θ)fΘ (θ) = Z pW (w|Θ = θ0 )fΘ (θ0 )dθ0 Z ∞ 0 n µs = θ0w+s−1 e−(µ+n)θ dθ0 w! Γ(s) 0 Γ(w + s) n w µs = × w! Γ(s) (µ + n)w+s pW (w) = R w 106 gΘ (θ|W = w) = R pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W e−nθ (nθ)w µs s−1 −µθ × θ e w! Γ(s) n w µs = × θw+s−1 e−(µ+n)θ , θ > 0. w! Γ(s) pW (w|Θ = θ)fΘ (θ) = Z pW (w|Θ = θ0 )fΘ (θ0 )dθ0 Z ∞ 0 n µs = θ0w+s−1 e−(µ+n)θ dθ0 w! Γ(s) 0 Γ(w + s) n w µs = × w! Γ(s) (µ + n)w+s pW (w) = R w 106 gΘ (θ|W = w) = R pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W e−nθ (nθ)w µs s−1 −µθ × θ e w! Γ(s) n w µs = × θw+s−1 e−(µ+n)θ , θ > 0. w! Γ(s) pW (w|Θ = θ)fΘ (θ) = Z pW (w|Θ = θ0 )fΘ (θ0 )dθ0 Z ∞ 0 n µs = θ0w+s−1 e−(µ+n)θ dθ0 w! Γ(s) 0 Γ(w + s) n w µs = × w! Γ(s) (µ + n)w+s pW (w) = R w 106 gΘ (θ|W = w) = R pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W e−nθ (nθ)w µs s−1 −µθ × θ e w! Γ(s) n w µs = × θw+s−1 e−(µ+n)θ , θ > 0. w! Γ(s) pW (w|Θ = θ)fΘ (θ) = Z pW (w|Θ = θ0 )fΘ (θ0 )dθ0 Z ∞ 0 n µs = θ0w+s−1 e−(µ+n)θ dθ0 w! Γ(s) 0 Γ(w + s) n w µs = × w! Γ(s) (µ + n)w+s pW (w) = R w 106 gΘ (θ|W = w) = R pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W e−nθ (nθ)w µs s−1 −µθ × θ e w! Γ(s) n w µs = × θw+s−1 e−(µ+n)θ , θ > 0. w! Γ(s) pW (w|Θ = θ)fΘ (θ) = Z pW (w|Θ = θ0 )fΘ (θ0 )dθ0 Z ∞ 0 n µs = θ0w+s−1 e−(µ+n)θ dθ0 w! Γ(s) 0 Γ(w + s) n w µs = × w! Γ(s) (µ + n)w+s pW (w) = R w 106 gΘ (θ|W = w) = R pW (w|Θ = θ)fΘ (θ) p (w|Θ = θ0 )fΘ (θ0 )dθ0 R W e−nθ (nθ)w µs s−1 −µθ × θ e w! Γ(s) n w µs = × θw+s−1 e−(µ+n)θ , θ > 0. w! Γ(s) pW (w|Θ = θ)fΘ (θ) = Z pW (w|Θ = θ0 )fΘ (θ0 )dθ0 Z ∞ 0 n µs = θ0w+s−1 e−(µ+n)θ dθ0 w! Γ(s) 0 Γ(w + s) n w µs = × w! Γ(s) (µ + n)w+s pW (w) = R w 106 n w µs × θw+s−1 e−(µ+n)θ w! Γ(s) gΘ (θ|X = k ) = Γ(w + s) n w µs × w! Γ(s) (µ + n)w+s = (µ + n)w+s w+s−1 −(µ+n)θ θ e , Γ(w + s) θ > 0. Conclusion: the posterior of Θ ∼ gamma distribution(w + s, n + µ). Recall that the prior of Θ ∼ gamma distribution(s, µ). 107 Case Study 5.8.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 x <− seq(0, 4, length = 1025) pdf=cbind(dgamma(x, shape=88, rate =50), dgamma(x, shape=88+92, 100), dgamma(x, 88+92+72, 150)) matplot(x,pdf, type=”l”, lty = 1:3, xlab = ”theta”, ylab = ”PDF”, lwd = 2 # Line width ) legend(2, 3.5, # Position of legend c(”Prior Gamma(88,50)”, ”Posterior1 Beta(180,100)”, ”Posterior2 Beta(252,150)”), col = 1:3, lty = 1:3, ncol = 1, # Number of columns cex = 1.5, # Fontsize lwd=2 # Line width ) 108 Bayesian Point Estimation Question. Can one calculate an appropriate point estimate θe given the posterior gΘ (θ|W = w)? Definitions. Let θe be an estimate for θ based on a statistic W . The loss function associated with θe is denoted L(θe , θ), where L(θe , θ) ≥ 0 and L(θ, θ) = 0. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. Then the risk associated with θb is the expected value of the loss function with respect to the posterior distribution of Θ: Z b L(θ, θ)gΘ (θ|W = w)dθ if Θ is continuous R risk = X b θi )gΘ (θi |W = w) if Θ is discrete L(θ, i 109 Bayesian Point Estimation Question. Can one calculate an appropriate point estimate θe given the posterior gΘ (θ|W = w)? Definitions. Let θe be an estimate for θ based on a statistic W . The loss function associated with θe is denoted L(θe , θ), where L(θe , θ) ≥ 0 and L(θ, θ) = 0. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. Then the risk associated with θb is the expected value of the loss function with respect to the posterior distribution of Θ: Z b L(θ, θ)gΘ (θ|W = w)dθ if Θ is continuous R risk = X b θi )gΘ (θi |W = w) if Θ is discrete L(θ, i 109 Bayesian Point Estimation Question. Can one calculate an appropriate point estimate θe given the posterior gΘ (θ|W = w)? Definitions. Let θe be an estimate for θ based on a statistic W . The loss function associated with θe is denoted L(θe , θ), where L(θe , θ) ≥ 0 and L(θ, θ) = 0. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. Then the risk associated with θb is the expected value of the loss function with respect to the posterior distribution of Θ: Z b L(θ, θ)gΘ (θ|W = w)dθ if Θ is continuous R risk = X b θi )gΘ (θi |W = w) if Θ is discrete L(θ, i 109 Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. 1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median of gΘ (θ|W = w). 2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean of gΘ (θ|W = w). Remarks 1. Median usually does not have a closed form formula. 2. Mean usually has a closed formula. 110 Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. 1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median of gΘ (θ|W = w). 2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean of gΘ (θ|W = w). Remarks 1. Median usually does not have a closed form formula. 2. Mean usually has a closed formula. 110 Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. 1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median of gΘ (θ|W = w). 2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean of gΘ (θ|W = w). Remarks 1. Median usually does not have a closed form formula. 2. Mean usually has a closed formula. 110 Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. 1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median of gΘ (θ|W = w). 2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean of gΘ (θ|W = w). Remarks 1. Median usually does not have a closed form formula. 2. Mean usually has a closed formula. 110 Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. 1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median of gΘ (θ|W = w). 2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean of gΘ (θ|W = w). Remarks 1. Median usually does not have a closed form formula. 2. Mean usually has a closed formula. 110 Theorem. Let gΘ (θ|W = w) be the posterior distribution of the random variable Θ. 1. If L(θe , θ) = |θe − θ|, then the Bayes point estimate for θ is the median of gΘ (θ|W = w). 2. If L(θe , θ) = (θe − θ)2 , then the Bayes point estimate for θ is the mean of gΘ (θ|W = w). Remarks 1. Median usually does not have a closed form formula. 2. Mean usually has a closed formula. 110 https://en.wikipedia.org 111 Proof. ( of Part 1. ) Let m be the median of the random variable W . We first claim that E(|W − m|) ≤ E(|W |). (?) For any constant b ∈ R, because 1 = P(W ≤ m) = P(W − b ≤ m − b) 2 we see that m − b is the median of W − b. Hence, by (?), E (|W − m|) = E (|(W − b) − (m − b)|) ≤ E (|W − b|) , for all b ∈ R, which proves the statement. 112 Proof. ( of Part 1. continued ) It remains to prove (?). Without loss of generality, we may assume m > 0. Then Z E(|W − m|) = |w − m|fW (w)dw R Z m Z ∞ = (m − w)fW (w)dw + (w − m)fW (w)dw −∞ Z m =− m ∞ Z wfW (w)dw + −∞ Z 0 =− w fW (w)dw + m Z m wfW (w)dw − −∞ Z 0 ≤− wfW (w)dw + −∞ Z = |w|fW (w)dw |0 Z wfW (w)dw + {z } 1 (m − m) 2 ∞ w fW (w)dw m ≥0 ∞ Z w fW (w)dw 0 R = E(|W |). 113 Proof. ( of Part 2. ) Let µ be the mean of W . Then for any b ∈ R, we see that E (W − b)2 = E ([W − µ] + [µ − b])2 = E (W − µ)2 + 2(µ − b) E(W − µ) +[µ − b]2 | {z } =0 = E (W − µ)2 + [µ − b]2 ≥ E (W − µ)2 , that is, E (W − µ)2 ≤ E (W − b)2 , for all b ∈ R. E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 E.g. 1’. X1 , · · · , Xn θ Θ ∼ Bernoulli(θ) ∼ Beta(r , s) X = n X Xi θ ∼ Binomial(n, θ) Θ ∼ Beta(r , s) i=1 r & s are known r & s are known Prior Beta(r , s) → posterior Beta(k + r , n − k + s) upon observing X = k for a random sample of size n. Consider the L2 loss function. θe = mean of Beta(k + r , n − k + s) k +r n+r +s n k r +s r = × + × n+r +s n n+r +s r +s | {z } | {z } = MLE Mean of Prior 115 MLE vs. Prior θe || n × n+r +s k r +s r + × n n+r +s r +s | {z } | {z } MLE Mean of Prior 116 E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior 117 E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior 117 E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior 117 E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior 117 E.g. 2’. X1 , · · · , Xn θ Θ ∼ Poisson(θ) ∼ Gamma(s, µ) s & µ are known W = n X Xi θ ∼ Poisson(nθ) Θ ∼ Gamma(s, µ) i=1 s & µ are known Prior Gamma(s, µ) → Posterior Gamma(w + s, µ + n) upon observing W = w for a random sample of size n. Consider the L2 loss function. θe = mean of Gamma(w + s, µ + n) w +s = µ+n w n µ s = × + × µ+n n µ+n µ | {z } | {z } MLE Mean of Prior 117 MLE vs. Prior θe || w n µ × + × µ+n n µ+n | {z } MLE s µ | {z } Mean of Prior 118 Appendix: Beta integral 1 Z Lemma. B(α, β) := x α−1 (1 − x)β−1 dx = 0 Proof. Notice that Z Γ(α) = ∞ and x α−1 e−x dx 0 Hence, Γ(α)Γ(β) Γ(α + β) ∞ Z Γ(β) = y β−1 e−y dy . 0 ∞ Z Z Γ(α)Γ(β) = 0 ∞ x α−1 y β−1 e−(x+y ) dxdy . 0 119 The key in the proof is the following change of variables: ( =⇒ ∂(x, y ) = ∂(r , θ) =⇒ x = r 2 cos2 (θ) y = r 2 sin2 (θ) 2r cos2 (θ) 2 −2r cos(θ) sin(θ) det ∂(x, y ) ∂(r , θ) 2r sin2 (θ) 2 2r cos(θ) sin(θ) = 4r 3 sin(θ) cos(θ). 120 Therefore, π 2 Z Γ(α)Γ(β) = ∞ Z dθ 0 0 Z =4 π 2 2 dr r 2(α+β)−4 e−r cos2α−2 (θ) sin2β−2 (θ) × 4r 3 sin(θ) cos(θ) | {z } Jacobian ! Z ∞ cos2α−1 (θ) sin2β−1 (θ)dθ 0 2 r 2(α+β)−1 e−r dr . 0 Now let us compute the following two integrals separately: Z I1 := π 2 cos2α−1 (θ) sin2β−1 (θ)dθ 0 Z I2 := ∞ 2 r 2(α+β)−1 e−r dr 0 121 For I2 , by change of variable r 2 = u (so that 2rdr = du), ∞ Z I2 = 2 r 2(α+β)−1 e−r dr 0 1 = 2 ∞ Z 0 2 r 2(α+β)−2 e−r 2rdr |{z} =du 1 ∞ α+β−1 −u = u e du 2 0 1 = Γ(α + β). 2 Z 122 For I1 , by the change of variables 1 − sin(θ)dθ = 2√ dx), x Z I1 = π 2 √ x = cos(θ) (so that cos2α−1 (θ) sin2β−1 (θ)dθ 0 Z = π 2 0 cos2α−1 (θ) sin2β−2 (θ) × sin(θ)dθ | {z } 1 dx =− √ 2 Z 0 x = 1 α− 1 2 x −1 (1 − x)β−1 √ dx 2 x Z 1 1 α−1 = x (1 − x)β−1 dx 2 0 1 = B(α, β) 2 123 Therefore, Γ(α)Γ(β) = 4I1 × I2 1 1 = 4 × Γ(α + β) × B(α, β) 2 2 i.e., B(α, β) = Γ(α)Γ(β) . Γ(α + β) 124