18.05 Problem Set 6, Spring 2014 Solutions Problem 1. (10 pts.) (a) Throughout this problem we will let x be the data of 140 heads out of 250 tosses. We have 140/250 = .56. Computing the likelihoods: � � � � 250 250 250 (.5) p(x|H0 ) = p(x|H1 ) = (.56)140 (.44)110 140 140 which yields Bayes factor p(x|H0 ) (.5)250 = = 0.16458, (.56)140 (.44)110 p(x|H1 ) (Actually, we computed the log Bayes factor since it is numerically more stable. Then we exponentiated to get the Bayes factor.) Since we chose the probability 140/250 of H1 to exactly match the data it is not surprising that the probability of the data given H1 is much greater than the proba­ bility given H0 . Said differently, the data will pull our prior towards one centered at 140/250. 25 (b) Here are the plots of the five priors. The vertical dashed red line is at θ = 0.5. The R code is posted alongside these solutions. 0 5 10 15 20 Beta(1,1) Beta(10,10) Beta(50,50 Beta(500,500) Beta(30,70) 0.0 0.2 0.4 0.6 0.8 1.0 θ A priori I would want my prior centered at 0.5. This rules out Beta(30,70). Beta(500,500) seems too narrow. Beta(1,1) doesn’t really match my experience with coins, but I might go with it and just let the data speak for itself. Both Beta(10,10) and Beta(50,50) seem plausible. Even if they’re wrong they aren’t so strong that they would cause us to ignore the evidence in the data. (c) The prior probability of a bias in favor of heads is P (θ > 1/2). Looking at the plots of the prior pdf’s in part (b) we see that (i)-(iv) are symmetric about .5, therefore they predict the probability of heads is 1/2. That is they are all unbiased. (v) has most of it’s probability below .5. So it is strongly biased against heads. Thus, 1 18.05 Problem Set 6, Spring 2014 Solutions 2 the ranking in order of bias from least to greatest is (v) followed by a four-way tie between (i)-(iv). (d) All of the prior pdf’s are beta distributions, so they have the form f (θ) = c1 θa−1 (1 − θ)b−1 . For a fixed hypothesis θ the likelihood function (given the data x) is � � 250 140 θ (1 − θ)110 . p(x|θ) = 140 Thus the posterior pdf is f (θ|x) = c2 θ140+a−1 (1 − θ)110+b−1 ∼ beta(140 + a, 110 + b). So the five posterior distributions (i)-(v) are beta(141, 111), beta(150, 120), beta(190, 160), beta(640, 610), and beta(170, 180). 25 Here are the plots of the five posteriors. 0 5 10 15 20 Beta(1,1) Beta(10,10) Beta(50,50 Beta(500,500) Beta(30,70) 0.0 0.2 0.4 0.6 0.8 1.0 θ Each prior is centered on a value of θ. The sharpness of the peak is a measure of the prior ‘commitment’ to this value. So prior (iv) is strongly committed to θ = .5, but prior (ii) is only weakly committed and (i) is essentially uncommitted. The effect of the data is to pull the center of the prior towards the data mean of .56. That is, it averages the center of the prior and the data mean. The stronger the prior belief the less the data pulls the center towards .56. So prior (iv) is only moved a little and prior (i) is moved almost all the way to .56. Priors (ii) and (iii) are intermediate. Prior (v) is centered at θ = .3. The data moves the center a long way towards .56. But, since it starts so much farther from .56 than the other priors, the posterior is still centered the farthest from .56. 18.05 Problem Set 6, Spring 2014 Solutions 3 (e) For each of the five posterior distributions, we compute p(θ ≥ 0.5|x) : P(i) (θ P(ii) (θ P(iii) (θ P(iv) (θ P(v) (θ > 0.5|x) = 1> 0.5|x) = 1> 0.5|x) = 1> 0.5|x) = 1> 0.5|x) = 1- pbeta(.5, pbeta(.5, pbeta(.5, pbeta(.5, pbeta(.5, 141,111) = 0.9710 150,120) = 0.9664 190,160) = 0.9459 640,610) = 0.8020 170,180) = 0.2963 This is consistent with the plot in d), as the posterior computed from the uniform prior has the most density past 0.5 while the posterior computed from prior (v) has the least. (f ) Step 1. Since the intervals are small we can use the relation probability ≈ density · Δθ. So P(i) (H0 |x) = P(i) (0.49 ≤ θ ≤ 0.51|x) ≈ ·f(i) (0.5|x) · .2 and P(i) (H1 |x)Pi (0.55 ≤ θ ≤ 0.57) ≈ ·f(i) (0.56) · .2. So the the posterior odds (using prior (i)) of H1 versus H0 are approximately f(i) (0.56) P(i) (H1 |x) c(0.56)140 (0.44)110 dbeta(.56,141,111) ≈ = = ≈ 6.07599 P(i) (H0 |x) f(i) (0.5) dbeta(.5,141,111) c(0.5)140 (0.5)110 By similar reasoning, the posterior odds (using prior (iv)) of H1 versus H0 is approx­ imately 0.00437. Problem 2. (10 pts.) Let A be the event that Alice is collecting tickets and B the event that Bob is collecting tickets. Denoting our data as D, we have the likelihoods 1012+10+11+4+11 e−50 12!10!11!4!11! 1512+10+11+4+11 e−75 P (D|B) = . 12!10!11!4!11! P (D|A) = P (A) P (B) 1 = 10 . Thus, our posterior odds are � 48 � P (D|A) 10 1 e25 · O(A|D) = O(A) = ≈ 25.408 15 10 P (D|B) Moreover, we are given prior odds, O(A) = Note that the Bayes factor is about 250. Problem 3. (10 pts.) (a) We have a flat prior pdf f (θ) = 1. For a single data value x, our likelihood function is: f (x|θ) = 0 1 θ if θ < x if x ≤ θ ≤ 1 18.05 Problem Set 6, Spring 2014 Solutions 4 Thus our table is hyp. θ<x x≤θ≤1 Tot. prior likelihood f (θ) f (x|θ) dθ 0 1 dθ θ 1 unnormalized posterior posterior f (θ|x) 0 0 dθ c dθ θ θ T 1 The normalizing constant c must make the total posterior probability 1, so 1 dθ 1 . =1 ⇒ c=− θ ln(x) c x Note that since x ≤ 1, we have c = −1/ ln(x) > 0. Here are plots for x = .2 (c = 0.621) and x = .5 (c = 1.443). 3.5 f(θ|x = .2) f(θ|x = .5) 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 θ (b) Notice that θ cannot be less than any one of x1 , . . . , xn . So the likelihood function is given by ( 0 if θ < max {x1 , . . . , xn } f (x1 , . . . , xn |θ) = 1 if max {x1 , . . . , xn } ≤ θ ≤ 1. θn Let xM = max {x1 , . . . , xn }. So our table is hyp. θ < xM xM ≤ θ ≤ 1 Tot. prior likelihood unnormalized posterior f (θ) f (data|θ) posterior f (θ|data) dθ 0 0 0 1 dθ c dθ dθ θn θn θn 1 T 1 The normalizing constant c must make the total posterior probability 1, so 1 c xM dθ n−1 . = 1 ⇒ c = 1−n n θ xM − 1 The posterior pdf depends only on n and xM , therefore the data (.1,.5) and (.5,.5) have the same posteriors. Here are the plots of the posteriors for the given data. 18.05 Problem Set 6, Spring 2014 Solutions 5 10 data = (.1,.5) or (.5,.5) data = (.1,...,.5) 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 θ (c) We now have xM = 0.5 so from part (b) the posterior density is ( 0 if θ < .5 f (θ|x1 , . . . , x5 ) = c if .5 ≤ θ ≤ 1, θ5 where c = 4 .5−4 −1 = 4 . 15 Now, let x be amount by which Jane is late for the sixth class. The likelihood is ( 0 if θ < x f (x|θ) = 1 if θ ≥ x θ We have the posterior predictive probability Z f (x|x1 , . . . , x5 ) = f (x|θ)f (θ|x1 , . . . , x5 ) dθ ⎧ 1 4 −5 1 4 ⎨ .5 1θ · 15θ = 124 5 dθ = − 75 θ 75 .5 = 1 ⎩ 1 1 4 · 5 dθ = − 4 θ−5 = 4 (x −5 − 1) x θ 15θ 75 x 75 if 0 ≤ x < .5 if .5 ≤ x ≤ 1 Thus the posterior predictive probability that x ≤ 0.5 is Z 0.5 Z .5 124 62 dx = = 0.82667 . P (x ≤ .5|x1 , . . . , x5 ) = f (x|x1 , . . . , xn ) dx = 75 75 0 0 (d) The graphs or the formula in part (a) show that f (θ|x) is decreasing for θ ≥ x, so the MAP is when θ = x. (e) Extra credit: 5 points (i) From part (a) we have the posterior pdf f (θ|x). The conditional expectation is Z Z 1 c 1−x E(θ|x) = θf (θ|x) dθ = θ · = c(1 − x) = − . θ ln(x) x 18.05 Problem Set 6, Spring 2014 Solutions 6 (ii) After observing x, we know that θ ≥ x, and as a result the conditional expectation E[θ|x] ≥ x. So the conditional expectation estimator is always at least as big as the MAP estimator. However, the MAP estimator is precisely x, the amount by which Jane is late on the first class. In this context, the MAP is not reasonable as it suggests that on the first class, Jane arrived as late as possible and that in the future, she will arrive less than x hours late. 1 0.8 0.6 0.4 0.2 MAP Cond. expect. 0 0 0.2 0.4 0.6 0.8 1 x Problem 4. (10 pts.) (a) Leaving the scale factors as letters our table is hyp. θ Tot. prior f (θ) ∼ N(5, 9) −(θ−5)2 /18 c1 e likelihood f (x|θ) ∼ N(θ, 4) dθ c2 e (6−θ)2 /8 posterior f (θ|x) (θ − 5)2 (6 − θ)2 c exp − − 18 8 � 1 � 1 All we need is some algebraic manipulations of the exponent in the posterior: � � (θ − 5)2 (6 − θ)2 1 θ2 − 12θ + 36 θ2 − 10θ + 25 − − =− + 18 8 2 4 9 � � 1 13θ2 − 148θ + 424 =− 2 36 ! 1 (θ − 74/13)2 +k =− 2 36/13 where k is a constant. Thus the posterior (θ − 74/13)2 f (θ|x) ∝ exp − 2 · 36/13 This has the form of a pdf for N 74 36 , 13 13 . QED ! 18.05 Problem Set 6, Spring 2014 Solutions 7 2 (b) We have µprior = 5, σprior = 9, x = 6, σ 2 = 4, n = 4 So we have 1 a= , 9 b = 1, a+b= 10 5/9 + 6 = 5.9, ⇒ µpost = 9 10/9 2 σpost = 1 = .9. 10/9 0.5 Prior N(5,9) Posterior N(5.9,.9). 0.4 0.3 0.2 0.1 0 -5 0 5 10 15 θ After observing x1 , . . . , x4 , we see that the posterior mean is close to x and the posterior variance is much smaller than the prior variance. The data has made us more certain about the location of θ. (c) As more data is received n increases, so b increases, so the mean of the posterior is closer to the data mean and the variance of the posterior decreases. Since the variance goes down, we gain more certainty about the true value of θ. (d) With no new data we are given the prior f (θ) ∼ N(100, 152 ). For data x = score on the IQ test we have the likelihood f (x|θ) ∼ N(θ, 102 ). Using the update formulas 2 we have µprior = 100, σprior = 152 , σ 2 = 102 , n = 1. So a = 1/225, b = 1/100 and a · 100 + b · 80 = 86.15 a+b a · 100 + b · 150 = 134.62 (ii) Mary, x = 150: µpost = a+b Regression towards the mean! (i) Randall, x = 80: µpost = (e) Extra credit: 5 points. This is essentially the same manipulation as in part (a). First suppose we have one data value x1 then hyp. θ Tot. prior 2 f (θ) ∼ N(µprior , σprior ) 2 −(θ−µprior )2 /2σprior c1 e 1 dθ likelihood f (x1 |θ) ∼ N(θ, σ 2 ) c2 e (x1 −θ)2 /2σ 2 posterior f (θ|x1 ) (θ − µprior )2 (x1 − θ)2 c exp − − 2 2σprior 2σ 2 � 1 All we need is some algebraic manipulations of the exponent in the posterior. When­ � 18.05 Problem Set 6, Spring 2014 Solutions 8 ever we get a term not involving θ we just absorb it into the constant k1 , k2 etc. � � (θ − µprior )2 (x1 − θ)2 1 θ2 − 2µprior θ + µ2prior θ2 − 2x1 θ + x21 − =− + − 2 2 σ2 2σ 2 2 σprior 2σprior 1 = − (a + b)θ2 − 2(aµprior + bx1 )θ + k1 2 1 θ− =− 2 aµprior +bx1 a+b 1 a+b 2 + k2 Thus the posterior ⎛ ⎜ 1 θ− f (θ|x1 ) ∝ exp ⎝− 2 aµ aµprior +bx1 a+b 1 a+b 2 ⎞ ⎟ ⎠ +bx 1 prior 1 . This proves the formulas (1) when This has the form of a pdf for N , a+b a+b n = 1. The formulas when n > 1 are a simple consequence of updating one data point at a time using the formulas when n = 1. Problem 5. (10 pts.) Censored data. We note that we assume that, given a particular dice, the rolls are independent. Let x be the censored value on one roll. The Bayes factor for x is ( 3/4 = 9/10 if x = 0 p(x|4-sided) /6 Bayes factor = = 51/4 p(x|6-sided) = 3/2 if x = 1 1/6 Starting from the prior odds of 1, we multiply by the appropriate Bayes factor and get the posterior odds after rolls 1-5 are 3 = 1.5, 2 27 = 1.35, 20 81 = 2.025, 40 243 = 3.0375, 80 729 = 4.5562 160 (b) In part (a) we saw the Bayes factor when x = 1 is 3/2. Since this is more than 1 it is evidence in favor of the 4-sided die. When x = 0 the Bayes factor is 9/10, which is evidence in favor of the 6-sided die. We saw this in part (a) because after every value of 1 the odds for the 4-sided die went up and after the value of 0 the odds went down. MIT OpenCourseWare http://ocw.mit.edu 18.05 Introduction to Probability and Statistics Spring 2014 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.