Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 BTRY 4090 / STSCI 4090, Spring 2010 Theory of Statistics Instructor: Ping Li Department of Statistical Science Cornell University Instructor: Ping Li 1 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li General Information • Lectures: Tue, Thu 10:10-11:25 am, Stimson Hall G01 • Section: Mon 2:55 - 4:10 pm, Warren Hall 131 • Instructor: Ping Li, pingli@cornell.edu, Office Hours: Tue, Thu 11:25 am -12 pm, 1192, Comstock Hall • TA: Xiao Luo, lx42@cornell.edu. Office hours: TBD (1) Mon, 4:10 - 5:10pm Warren Hall 131; (2) Wed, 2:30 - 3:30pm, Comstock Hall 1181. • Prerequisites: BTRY 4080 or equivalent • Textbook: Rice, Mathematical Statistics and Data Analysis, 3rd edition 2 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li • Exams: – Prelim 1: In Class, Feb. 25, 2010 – Prelim 2: In Class, April 8, 2010 – Final Exam: Warren Hall 145, 2pm - 4:30pm, May 13, 2010 – Policy: Close book, close notes • Programming: Some programming assignments. You can either use Matlab or R. For practice, please download the Matlab examples in 4080 lecture notes. 3 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li • Homework: Weekly – Please turn in your homework either in class or to BSCB front desk (Comstock Hall, 1198). – No late homework will be accepted. – Before computing your overall homework grade, the assignment with the lowest grade (if ≥ grade (if ≥ 25%) will be dropped, the one with the second lowest 50%) will also be dropped. – It is the students’ responsibility to keep copies of the submitted homework. 4 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 • Grading: Two formulas Instructor: Ping Li 1. Homework: 30% + Two Prelims: 35% + Final: 35% 2. Homework: 30% + Two Prelims: 25% + Final: 45% Your grade is whichever higher. • Course Letter Grade Assignment A ≈ 90% (in the absolute scale) C ≈ 60% (in the absolute scale) In borderline cases, participation in section and class interactions will be used as a determining factor. 5 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Syllabus Topic Textbook Random number generation Probability, Random Variables, Joint Distributions, Expected Values Chapters 1-4 Limit Theorems Chapter 5 Distributions Derived from the Normal Distribution Chapter 6 Estimation of Parameters and Fitting of Probability Distributions Chapter 8 Testing Hypothesis and Assessing Goodness of Fit Chapter 9 Comparing Two Samples Chapter 11 The Analysis of Categorical Data Chapter 13 Linear Least Squares Chapter 14 6 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chapters 1 to 4: Mostly Reviews • Random number generation • The method of random projections: A real example of using probability to solve computationally intensive (or infeasible) problems. • Capture/Recpature method: An example of discrete probability and the introduction to parameter estimation using maximum likelihood. • Conditional expectations, bivariate normal, and random projections • Moment generating function and random projections 7 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Nonuniform Sampling by Inversion The goal: Sample X from a distribution F (x). The inversion transform sampling: • Sample U ∼ Uniform(0, 1). • Output X = F −1 (U ) Proof: Pr (X ≤ x) = Pr F −1 (U ) ≤ x = Pr (U ≤ F (x)) = F (x) Limitation: Need a closed-form F −1 , but many common distributions (eg, normal) do not have closed-form F −1 . 8 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Examples of Inversion Transform Sampling • X ∼ Exponential(λ), i.e., F (x) = 1 − e−λx , x ≥ 0. log(1−U ) Let U ∼ Uniform(0, 1), then ∼ Exponential(λ) −λ • X ∼ Pareto(α), i.e., F (x) = 1 − x1α , x ≥ 1. Let U ∼ Uniform(0, 1), then (1−U1 )1/α ∼ Pareto(α). A small trick: If U ∼ Uniform(0, 1), then 1 − U ∼ Uniform(0, 1). Thus, we can replace 1 − U by U . Instructor: Ping Li 9 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 The Box-Muller Transform U1 and U2 are i.i.d. samples from Uniform(0, 1). Then p N1 = −2 log U1 cos(2πU2 ) p N2 = −2 log U1 sin(2πU2 ) are two i.i.d samples from the standard normal N (0, 1). Q: How to generate non-standard normals? Instructor: Ping Li 10 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li An Introduction to Random Projections Many applications require a data matrix: A ∈ Rn×D For example, the term-by-document matrix may contain n (web pages) and D model), or D = 1010 documents = 106 single words, or D = 1012 double words (bi-gram = 1018 triple words (tri-gram model). Many matrix operations boil down to computing how close (how far) two rows (columns) of the matrix are. For example, linear least square (AT A) Challenges: The matrix may be too large to store, or computing AT A is too expensive. −1 AT y . 11 Cornell University, BTRY 4090 / STSCI 4090 Random Projections: Spring 2010 Replace A by B A Instructor: Ping Li =A×R R = B R ∈ RD×k : a random matrix, with i.i.d. entries sampled from N (0, 1). B ∈ Rn×k : projected matrix, also random. k is very small (eg k = 50 ∼ 100), but n and D are very large. B approximately preserves the Euclidean distance and dot products between any two rows of A. In particular, E (BBT ) = AAT . 12 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Consider first two rows in A: Instructor: Ping Li u1 , u2 ∈ RD . u1 = {u1,1 , u1,2 , u1,3 , ..., u1,i , ..., u1,D } u2 = {u2,1 , u2,2 , u2,3 , ..., u2,i , ..., u2,D } and first two rows in B: v1 , v2 ∈ Rk . v1 = {v1,1 , v1,2 , v1,3 , ..., v1,j , ..., v1,k } v2 = {v2,1 , v2,2 , v2,3 , ..., v2,j , ..., v2,k } v1 = RT u1 , v2 = RT u2 . R = {rij }, i = 1 to D and j = 1 to k . rij ∼ N (0, 1). 13 Cornell University, BTRY 4090 / STSCI 4090 v1 = RT u1 , Spring 2010 v2 = RT u2 . v1,j = D X Instructor: Ping Li R = {rij }, i = 1 to D and j = 1 to k . rij u1,i , v2,j = i=1 v1,j − v2,j = The Euclidean norm of u1 : The Euclidean norm of v1 : D X rij u2,i , i=1 D X i=1 rij [u1,i − u2,i ] PD i=1 Pk j=1 |u1,i |2 . |v1,j |2 . The Euclidean distance between u1 and u2 : The Euclidean distance between v1 and v2 : PD i=1 Pk |u1,i − u2,i |2 . 2 |v − v | . 1,j 2,j j=1 14 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li What are we hoping for? • Pk • Pk 2 j=1 |v1,j | ≈ j=1 PD 2 |u | , as close as possible. 1,i i=1 2 |v1,j − v2,j | ≈ PD i=1 |u1,i − u2,i |2 , as close as possible. • k should be as small as possible, for a specified level of accuracy. 15 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Unbiased Estimator of d and m1 , m2 We need a good estimator, unbiased and has small variance. Note that the estimation problem is essentially the same for d and for m1 (m2 ). Thus, we can focus on estimating m1 . By random projections, we have k i.i..d. samples (why?) vj = D X rij u1,i , j = 1, 2, ...k i=1 Because rij ∼ N (0, 1), we can develop estimators and analyze the properties using normal and χ2 distributions. But we can also solve the problem without using normals. 16 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Unbiased Estimator of m1 v1,j = D X rij u1,i , j = 1, 2, ...k, (rij ∼ N (0, 1)) i=1 To get started, let’s first look the moments E(v1,j ) = E D X i=1 rij u1,i ! = D X i=1 E(rij )u1,i = 0 17 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 " 2 E(v1,j ) =E D X Instructor: Ping Li rij u1,i i=1 #2 D X X 2 2 =E rij u1,i + rij u1,i ri′ j u1,i′ i=1 i6=i′ D X X 2 2 = E(rij )u1,i + E(rij ri′ j )u1,i u1,i′ i=1 = D X i=1 Great! i6=i′ ! u21,i + 0 = m1 m1 is exactly what we are after. Since we have k , i.i.d. samples vj , we can simply average them to estimate m1 . 18 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 An unbiased estimator of the Euclidean norm Instructor: Ping Li m1 = k 1X |v1,j |2 , m̂1 = k j=1 PD 2 |u | 1,i i=1 k k X 1X 1 E (m̂1 ) = E |v1,j |2 = m1 = m1 k j=1 k j=1 We need to analyze its variance to assess its accuracy. Recall, our goal is to use k (number of projections) as small as possible. 19 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li k 1 1 X 2 2 V ar |v1,j | = V ar |v1,j | V ar (m̂1 ) = 2 k j=1 k 1 4 2 2 = E |v1,j | − E (|v1,j | ) k ! 4 D X 1 = E rij u1,i − m21 k i=1 We can compute E P D i=1 rij u1,i 2 4 take advantage of the χ distribution. directly, but it would be much easier if we 20 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li χ2 Distribution ∼ N (0, 1), then Y = X 2 is a Chi-Square distribution with one degree of freedom, denoted by χ21 . If X Pk = 1 to k are i.i.d. normal Xi ∼ N (0, 1). Then Y = j=1 Xj2 follows a Chi-square distribution with k degrees of freedom, denoted by χ2k . If Xj , j If Y ∼ χ2k , then E(Y ) = k, V ar(Y ) = 2k 21 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Recall, after random projections, v1,j = D X rij u1,i , j = 1, 2, ...k, rij ∼ N (0, 1) i=1 Therefore, vj also has a normal distribution: v1,j ∼ N v1,j Equivalently √m 1 0, D X i=1 |ui,i |2 ! = N (0, m1 ) ∼ N (0, 1). Therefore, v1,j √ m1 2 = 2 v1,j m1 ∼ χ21 , V ar 2 v1,j m1 ! = 2, V ar 2 v1,j Now we can figure out the variance formula for random projections. = 2m21 22 Cornell University, BTRY 4090 / STSCI 4090 Implication Spring 2010 2 1 2m 1 V ar (m̂1 ) = V ar |v1,j |2 = k k V ar(m̂1 ) 2 = , 2 m1 k q Instructor: Ping Li independent of m1 V ar(m̂1 ) is known as the coefficient of variation. m21 ——————- We have solved the variance using χ21 . We can actually figure out the distribution of m̂1 using χ2k . 23 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 k 1X |v1,j |2 , m̂1 = k j=1 Instructor: Ping Li v1,j ∼ N (0, m1 ) Because v1,j ’s are i.i.d, we know 2 k X k m̂1 v1,j = ∼ χ2k √ m1 m1 j=1 (why?) This will be useful for analyzing the error bound using probability inequalities. We can also write down the moments of m̂1 directly using χ2k 24 Cornell University, BTRY 4090 / STSCI 4090 Recall, if Y Spring 2010 Instructor: Ping Li ∼ χ2k , then E(Y ) = k , and V ar(Y ) = 2k =⇒ E k m̂1 m1 = k, V ar k m̂1 m1 =⇒ m21 2m21 V ar(m̂1 ) = 2k 2 = k k = 2k, 25 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li An unbiased estimator of the Euclidean distance k X 1 dˆ = |v1,j − v2,j |2 , k j=1 k dˆ ∼ χ2k , d d= PD i=1 |u1,i − u2,i |2 2 2d ˆ = . V ar(d) k They can be derived exactly the way as we analyze the estimator of m1 . Note that the coefficient of variation for dˆ ˆ 2 V ar(d) = , d2 k independent of d meaning that the errors are pre-determined by k , a huge advantage. 26 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li More probability problems • What is the error probability P |dˆ − d| ≥ ǫd ? • How large k should be? • What about the inner (dot) product a = PD i=1 u1,i u2,i ? 27 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li An unbiased estimator of the inner product a= k 1X â = v1,j v2,j , k j=1 PD i=1 u1,i u2,i E(â) = a m1 m2 + a2 V ar(â) = k Proof: v1,j v2,j = " D X i=1 u1,i rij #" D X i=1 u2,i rij # 28 Cornell University, BTRY 4090 / STSCI 4090 v1,j v2,j = Spring 2010 "D X u1,i rij i=1 = D X Instructor: Ping Li #" 2 u1,i u2,i rij D X i=1 + i=1 =⇒ E(v1,j v2,j ) = D X u1,i u2,i E i=1 = D X = D X 2 rij X i6=i′ u1,i u2,i = a i=1 This proves the unbiasedness. X u1,i u2,i′ rij ri′ j i6=i′ u1,i u2,i 1 + i=1 u2,i rij # + X u1,i u2,i′ E [rij ri′ j ] i6=i′ u1,i u2,i′ 0 29 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li We first derive the variance of â using a complicated brute force method, then we show a much simpler method using conditional expectation. 2 [v1,j v2,j ] = = = " D X 2 u1,i u2,i rij + i=1 D X i6=i′ 2 u1,i u2,i rij i=1 D X #2 4 [u1,i u2,i ]2 rij i=1 + X X + +2 2 u1,i u2,i′ rij ri′ j X i6=i′ X 2 u1,i u2,i′ rij ri′ j + ... u1,i u2,i u1,i′ u2,i′ [rij ri′ j ]2 i6=i′ [u1,i u2,i′ ]2 [rij ri′ j ]2 + ... i6=i′ Why can we ignore the rest of the terms (after taking expectations)? 30 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Why can we ignore the rest of the terms (after taking expectations)? Recall rij ∼ N (0, 1) i.i.d. E(rij ) = 0, 2 E(rij ) = 1, E(rij ri′ j ) = E(rij )E(ri′ j ) = 0 3 E(rij ) = 0, 4 E(rij ) = 3, 2 2 E(rij ri′ j ) = E(rij )E(ri′ j ) = 0 31 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Therefore, 2 E [v1,j v2,j ] = D X 2 3 [u1,i u2,i ] + 2 i=1 X u1,i u2,i u1,i′ u2,i′ + i6=i′ X [u1,i u2,i′ ]2 i6=i′ But 2 a = "D X u1,i u2,i i=1 m1 m2 = " D X i=1 #2 |u1,i | 2 = D X 2 [u1,i u2,i ] + i=1 #" D X i=1 X u1,i u2,i u1,i′ u2,i′ i6=i′ |u2,i | 2 # = D X i=1 2 [u1,i u2,i ] + X [u1,i u2,i′ ]2 i6=i′ Therefore, E [v1,j v2,j ]2 = m1 m2 + 2a2 , V ar [v1,j v2,j ] = m1 m2 + a2 32 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li An unbiased estimator of the inner product k 1X v1,j v2,j , â = k j=1 a= PD i=1 u1,i u2,i E(â) = a m1 m2 + a2 V ar(â) = k The coefficient of variation q V ar(â) a2 = q m1 m2 +a2 1 a2 k, When two vectors u1 and u2 are almost orthogonal, a =⇒ coefficient of variation ≈ ∞. not independent of a. ≈ 0, =⇒ random projections may not be good for estimating inner products. 33 Cornell University, BTRY 4090 / STSCI 4090 The joint distribution of E(v1,j ) = 0, Spring 2010 v1,j = Instructor: Ping Li PD i=1 u1,i rij and v2,j = V ar(v1,j ) = D X i=1 E(v2,j ) = 0, V ar(v2,j ) = D X i=1 PD i=1 u2,i rij . |u1,i |2 = m1 |u2,i |2 = m2 Cov(v1,i , v2,j ) = E(v1,j v2,j ) − E(v1,j )E(v2,j ) = a v1,j and v2,j are jointly normal (bivariate normal). v 0 m 1,j ∼ N µ = , Σ = 1 v2,j 0 a a m2 (What if we know m1 and m2 exactly? For example, by one scan of data matrix.) 34 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Summary of Random Projections Random Projections: Replace A by B A =A×R R = B • An elegant method, interesting probability exercise. • Suitable for approximating Euclidean distances in massive, dense, and heavy-tailed (some entries are excessively large) data matrices. • It does not take advantage of data sparsity. • We will come back to study its error probability bounds (and other things). 35 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Capture/Recapture Method: Section 1.4, Example I The method may be used to estimate the size of a wildlife population. Suppose that t animals are captured, tagged, and released. On a later occasion, m animals are captured, and it is found that r of them are tagged. Assume the total population is N . Q 1: What is the probability mass function P {N = n}? Q 2: How large is the population N , estimated from m, r , and t ? 36 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Solution: P {N = n} = To estimate N , we can choose the N maximized. Ln = ∝ t r n−t m−r n m = n such that Ln = P {N = n} is (n−t)! t! r!(t−r)! (m−r)!(n−t−m+r)! n! m!(n−m)! (n−t)! (n−t−m+r)! n! (n−m)! (n − t)!(n − m)! = (n − t − m + r)!n! 37 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 The method of maximum likelihood Instructor: Ping Li To find the n such that Ln is maximized (n − t)!(n − m)! Ln = (n − t − m + r)!n! If Ln has a global maximum, then it is equivalent to finding the n such that Ln (n − t)(n − m) =1= Ln−1 n(n − t − m + r) mt =⇒n = r gn = Indeed, if n < mt r , then (n − t)(n − m) − n(n − t − m + r) = mt − nr < 0 Thus, if n < mt r , then gn is increasing; if n> mt r , then gn is decreasing. 38 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li How to plot Ln ? Ln = (n − t)!(n − m)! (n − m)(n − m − 1)...(n − m − t + r + 1) = (n − t − m + r)!n! n(n − 1)...(n − t + 1) log Ln = t−r X j=1 log (n − m − j + 1) − t X i=1 log(n − i + 1) 39 Cornell University, BTRY 4090 / STSCI 4090 x 10 Likelihood ratio (gn): t = 10 m = 20 r = 4 1.5 1 1.4 0.8 1.3 Likelihood Ratio Likelihood Instructor: Ping Li Likelihood (Ln): t = 10 m = 20 r = 4 −8 1.2 Spring 2010 0.6 1.2 0.4 1.1 0.2 1 0 20 40 60 80 100 n 120 140 160 0.9 20 40 60 80 100 n 120 140 160 40 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab code function cap_recap(t, m, r); n0 = max(t+m-r, m)+5; j=1:(t-r); i = 1:t; for n = n0:5*n0 L(n-n0+1) = exp( sum(log(n-m+1-j)) - sum(log(n+1-i))); g(n-n0+1)= (n-t)*(n-m)./n./(n-t-m+r); end figure; plot(n0:5*n0,L,’r’,’linewidth’,2);grid on; xlabel(’n’); ylabel(’Likelihood’); title([’Likelihood (L_n): t = ’ num2str(t) ’ m = ’ num2str(m) ’ r = ’ num2str(r)]); figure; plot(n0:5*n0,g, ’r’,’linewidth’,2);grid on; xlabel(’n’); ylabel(’Likelihood Ratio’); title([’Likelihood ratio (g_n): t = ’ num2str(t) ’ m = ’ num2str(m) ’ r = ’ num2str(r)]); 41 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Bivariate Normal Distribution The random variables X and Y have a bivariate normal distribution if, for constants, ux , uy , σx > 0, σy > 0, −1 < ρ < 1, their joint density function is given, for all −∞ < x, y < ∞, by 1 − 2(1−ρ 1 2) p f (x, y) = e 2πσx σy 1 − ρ2 If X and Y are independent, then ρ (x−µx )2 2 σx (y−µy )2 + 2 σy (x−µx )(y−µy ) −2ρ σx σy = 0, and 1 f (x, y) = e 2πσx σy − 12 (x−µx )2 2 σx (y−µy )2 + 2 σy 42 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Denote that X and Y are jointly normal: X Y ∼ N µ = µx µy , Σ = σx2 ρσx σy ρσx σy σy2 X and Y are marginally normal: X ∼ N (µx , σx2 ), Y ∼ N (µy , σy2 ) X and Y are also conditionally normal: σx X|Y ∼ N µx + ρ(y − µy ) , (1 − ρ2 )σx2 σy Y |X ∼ N σy µy + ρ(x − µx ) , (1 − ρ2 )σy2 σx 43 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Bivariate Normal and Random Projections A R = B v1 and v2 , the first two rows in B, have k entries: PD PD v1,j = i=1 u1,i rij and v2,j = i=1 u2,i rij . v1,j and v2,j are bivariate normal: v 0 m 1,j ∼ N µ = , Σ = 1 v2,j 0 a m1 = PD i=1 2 |u1,i | , m2 = PD i=1 2 |u2,i | , a = PD i=1 a m2 u1,i u2,i 44 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Simplify calculations using conditional normality v1,j |v2,j ∼ N E (v1,j v2,j ) 2 a m1 m2 − a v2,j , m2 m2 2 2 =E E v1,j v2,j |v2,j =E 2 v2,j 2 = 2 E v2,j E m1 m2 − a2 + m2 2 v1,j |v2,j a v2,j m2 2 !! 2 m1 m2 − a2 2 a 2 + 3m2 2 = m1 m2 + 2a . =m2 m2 m2 The unbiased estimator â = 1 k PD i=1 v1,j v2,j has variance 1 2 Var (â) = m1 m2 + a k 45 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Moment Generating Function (MGF) Definition: For a random variable X , its moment generating function (MGF), is defined as tX MX (t) = E e = P tx p(x)e x R∞ −∞ etx f (x)dx if X is discrete if X is continuous MGF MX (t) uniquely determines the distribution of X . 46 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li MGF of Normal 2 Suppose X ∼ N (0, 1), i.e., fX (x) = ∞ −x √1 e− 2 . 2π 1 − x2 MX (t) = e √ e 2 dx 2π −∞ Z ∞ 1 − x2 +tx √ e 2 = dx 2π −∞ Z ∞ 1 − x2 −2tx+t2 −t2 2 √ e dx = 2π −∞ Z ∞ t2 1 − (x−t)2 2 √ e =e 2 dx 2π −∞ Z =e t2 2 tx 47 Cornell University, BTRY 4090 / STSCI 4090 Suppose Y Write Y Spring 2010 Instructor: Ping Li ∼ N (µ, σ 2 ). = σX + µ, where X ∼ N (0, 1). tY MY (t) = E e We can view σt as another t′ . µt µt+σtX σtX µt =E e =e E e µt MY (t) = e MX (σt) = e ×e σ 2 t2 2 2 µt+ σ2 t2 =e 48 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li MGF of Chi-Square If Xj , j = 1 to k , are i.i.d. N (0, 1), then Pk Y = j=1 Xj2 ∼ χ2k , a Chi-squared distribution with k degrees of freedom. What is the density function? Well, since the MGF uniquely determines the distribution, we can analyze MGF first. By the independence of Xj , k h Pk i Y h 2 i h 2 ik Y t 2 MY (t) = E e = E et j=1 Xj = E etXj = E etXj j=1 49 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li h 2i Z ∞ 2 tXj tx2 1 − x2 E e dx = e √ e 2π −∞ Z ∞ 1 − x2 +tx2 √ e 2 = dx 2π −∞ Z ∞ 1 − x2 (1−2t) 2 √ e dx = 2π −∞ Z ∞ 2 x 1 1 √ e− 2σ2 dx, = σ2 = 1 − 2t 2π −∞ Z ∞ x2 1 − 2σ 2 √ e dx = σ =σ 2πσ −∞ 1 = (1 − 2t)1/2 h 2 ik MY (t) = E etXj = 1 , k/2 (1 − 2t) (t < 1/2) 50 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li MGF for Random Projections In random projections, the unbiased estimator dˆ = k 1 k Pk 2 |v − v | 1,j 2,j j=1 k dˆ X |v1,j − v2,j |2 = ∼ χ2k d d j=1 Q: What is the MGF of dˆ. Solution: ˆ h Mdˆ(t) = E edt = E e where 2dt/k < 1, i.e., t < k/(2d). kd̂ d i [ ] dt k = 1− 2dt k −k/2 51 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Moments and MGF tX MX (t) = E e tX ′ =⇒MX (t) = E Xe n tX (n) =⇒MX (t) = E X e Setting t = 0, (n) E [X n ] = MX (0) 52 Cornell University, BTRY 4090 / STSCI 4090 Example: Spring 2010 X ∼ χ2k . MX (t) = Instructor: Ping Li 1 . (1−2t)k/2 −k −k/2 −k/2−1 (1 − 2t) (−2) = k (1 − 2t) 2 −k ′′ − 1 (1 − 2t)−k/2−2 (−2) M (t) =k 2 M ′ (t) = =k(k + 2) (1 − 2t)−k/2−2 Therefore, E(X) = M ′ (0) = k, E(X 2 ) = M ′′ (0) = k 2 + 2k V ar(X) = (k 2 + 2k) − k 2 = 2k. 53 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Example: MGF and Moments of â in Random Projections The unbiased estimator of inner product: â = Using conditional expectation: Pk i=1 v1,j v2,j . 2 a m1 m2 − a v2,j , m2 m2 ∼ N (0, m2 ) v1,j |v2,j ∼ N v2,j 1 k For simplicity, let x = v1,j , y = v2,j , µ = m1 m2 − a2 σ = m2 2 a a v2,j = y, m2 m2 54 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 E (exp(v1,j v2,j t)) Using the MGF of x|y Instructor: Ping Li = E (exp(xyt)) = E (E (exp(xyt)) |y) ∼ N (µ, σ 2 ) 2 E (exp(xyt)|y) µyt+ σ2 (yt)2 =e 2 µyt+ σ2 (yt)2 E (E (exp(xyt)|y)) = E e 2 µyt + Since y ∼ σ (yt)2 = y 2 2 y2 N (0, m2 ), we known m2 ∼ χ21 . 2 a σ 2 t+ t m2 2 55 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Using MGF of χ21 , we obtain 2 µyt+ σ2 E e (yt) 2 =E e y2 m2 m2 a m2 2 t+ σ2 t2 −1/2 a σ2 2 = 1 − 2m2 t+ t m2 2 1 2 2 −2 = 1 − 2at − m1 m2 − a t . By independence, 2 t 2 t Mâ (t) = 1 − 2a − m1 m2 − a k k2 Now, we can use this MGF to calculate moments of â. − k2 . 56 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li 2 t 2 t Mâ (t) = 1 − 2a − m1 m2 − a k k2 − k2 , # k − −1 2 2 t 2 t (1) Mâ (t) =(−k/2) 1 − 2a − m1 m2 − a k k2 2 2t × −2a/k − m1 m2 − a k2 " The term in [...] will not matter after letting t = 0. Therefore, E(â) = M GFâ (1) (0) = (−k/2)(−2a/k) = a 57 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Following a similar procedure, we can obtain m1 m2 + a2 Var (â) = k 2a 2 E (â − a) = 2 3m1 m2 + a k 3 The centered third moment measures the skewness of the distribution and can be quite useful, for example, testing the normality. 58 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Tail Probabilities The tail probability P (X > t) is extremely important. For example, in random projections, P |dˆ − d| ≥ ǫd tells what is the probability that the difference (error) between the estimated Euclidian distance dˆ and the true distance d exceeds an ǫ fraction of the true distance d. Q: Is it just the cumulative probability function (CDF)? 59 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Tail Probability Inequalities (Bounds) P (X > t) ≤ ??? Reasons to study tail probability bounds: • Even if the distribution of X is known, evaluating P (X > t) often requires numerical methods. • Often the exact distribution of X is unknown. Instead, we may know the moments (mean, variance, MGF, etc). • Theoretical reasons. For example, studying how fast the error decreases. 60 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Several Tail Probability Inequalities (Bounds) • Markov’s Inequality. Only use the first moment. Most basic. • Chebyshev’s Inequality. Only use the second moment. • Chernoff’s Inequality. Use the MGF. Most accurate and popular among theorists. 61 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Markov’s Inequality: Theorem A in Section 4.1 If X is a random variable with P (X ≥ 0) = 1, and for which E(X) exists, then E(X) P (X ≥ t) ≤ t Proof: E(X) = Assume X is continuous with probability density f (x). Z ∞ 0 xf (x)dx ≥ Z ∞ t xf (x)dx ≥ Z ∞ t tf (x)dx = tP (X ≥ t) See the textbook for the proof by assuming X is discrete. Many extremely useful bounds can be obtained from Markov’s inequality. 62 Cornell University, BTRY 4090 / STSCI 4090 Markov’s inequality Spring 2010 P (X ≥ t) ≤ Instructor: Ping Li E(X) t . If t = kE(X), then P (X ≥ t) = P (X ≥ kE(X)) ≤ 1 k The error decreases at the rate of k1 , which is too slow. The original Markov’s inequality only utilizes the first moment (hence its inaccuracy). 63 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chebyshev’s Inequality: Theorem C in Section 4.1 Let X be a random variable with mean µ and variance σ 2 . Then for any t σ2 P (|X − µ| ≥ t) ≤ 2 t Proof: Let Y = (X − µ)2 = |X − µ|2 , w = t2 , then 2 E(Y ) E (X − µ) σ2 P (Y ≥ w) ≤ = = w w w Note that |X − µ|2 ≥ t2 ⇐⇒ |x − µ| ≥ t. Therefore, 2 2 P (|X − µ| ≥ t) = P |X − µ| ≥ t σ2 ≤ 2 t > 0, 64 Cornell University, BTRY 4090 / STSCI 4090 Chebyshev’s inequality Spring 2010 P (|X − µ| ≥ t) ≤ P (|X − µ| ≥ kσ) ≤ Instructor: Ping Li σ2 t2 . 1 k2 The error decreases at the rate of k12 , which is faster than k1 . If t = kσ , then 65 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chernoff Inequality Ross, Proposition 8.5.2: then for any ǫ Application: If X is a random variable with finite MGF MX (t), >0 P {X ≥ ǫ} ≤ e−tǫ MX (t), for all t >0 P {X ≤ ǫ} ≤ e−tǫ MX (t), for all t <0 One can choose the t to minimize the upper bounds. This usually leads to accurate probability bounds, which decrease exponentially fast. 66 Cornell University, BTRY 4090 / STSCI 4090 Proof: For t Spring 2010 Instructor: Ping Li Use Markov’s Inequality. > 0, because X > ǫ ⇐⇒ etX > etǫ (monotone transformation) tX tǫ P (X > ǫ) =P e ≥ e tX E e ≤ etǫ =e−tǫ MX (t) 67 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Tail Bounds of Normal Random Variables X ∼ N (µ, σ 2 ). Assume µ > 0. Need to know P (|X − µ| ≥ ǫµ) ≤ ?? Chebyshev’s inequality: 2 P (|X − µ| ≥ ǫµ) ≤ 2 σ 1 σ = ǫ2 µ 2 ǫ2 µ 2 The bound is not good enough, only decreasing at the rate of ǫ12 . 68 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Tail Bounds of Normal Using Chernoff’s Inequality Right tail bound For any t P (X − µ ≥ ǫµ) > 0, P (X − µ ≥ ǫµ) =P (X ≥ (1 + ǫ)µ) ≤e−t(1+ǫ)µ MX (t) −t(1+ǫ)µ µt+σ 2 t2 /2 =e e −t(1+ǫ)µ+µt+σ 2 t2 /2 =e =e−tǫµ+σ 2 2 t /2 What’s next? Since the inequality holds for any t minimize the upper bound. > 0, we can choose the t to 69 Cornell University, BTRY 4090 / STSCI 4090 Right tail bound Choose the t Spring 2010 Instructor: Ping Li P (X − µ ≥ ǫµ) = t∗ to minimize g(t) = −tǫµ + σ 2 t2 /2. 2 2 µǫ µ ǫ ′ 2 ∗ ∗ g (t) = −ǫµ + σ t = 0 =⇒ t = 2 =⇒ g(t ) = − σ 2 σ2 Therefore, 2 µ2 σ2 − ǫ2 P (X − µ ≥ ǫµ) ≤ e −ǫ2 decreasing at the rate of e . 70 Cornell University, BTRY 4090 / STSCI 4090 Left tail bound For any t Spring 2010 Instructor: Ping Li P (X − µ ≤ −ǫµ) < 0, P (X − µ ≤ −ǫµ) =P (X ≤ (1 − ǫ)µ) ≤e−t(1−ǫ)µ MX (t) =e−t(1−ǫ)µ eµt+σ 2 2 t /2 tǫµ+σ 2 t2 /2 =e Choose the t = t∗ = − σµǫ2 to minimize tǫµ + σ 2 t2 /2. Therefore, 2 µ2 σ2 − ǫ2 P (X − µ ≤ −ǫµ) ≤ e 71 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Combine left and right tail bounds Instructor: Ping Li P (|X − µ| ≥ ǫµ) P (|X − µ| ≥ ǫµ) =P (X − µ ≥ ǫµ) + P (X − µ ≤ −ǫµ) 2 µ2 σ2 − ǫ2 ≤2e 72 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Sample Size Selection Using Tail Bounds 2 Xi ∼ N µ, σ , i.i.d. i = 1 to k . An unbiased estimator of µ is µ̂ k 1X Xi , µ̂ = k i=1 µ̂ ∼ N Choose k such that P (|µ̂ − µ| ≥ ǫµ) ≤ δ ———– 2 We already know P (|µ̂ − µ| − ǫ2 ≥ ǫµ) ≤ 2e µ2 σ 2 /k . σ2 µ, k 73 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li It suffices to select the k such that 2 kµ2 σ2 − ǫ2 2e 2 kµ2 σ2 − ǫ2 =⇒e 2 ≤δ ≤ 2 δ 2 ǫ kµ δ =⇒ − ≤ log 2 σ2 2 2 2 ǫ kµ δ =⇒ ≥ − log 2 σ2 2 2 σ2 δ =⇒k ≥ − log 2 ǫ2 µ 2 74 Cornell University, BTRY 4090 / STSCI 4090 Suppose Xi Spring 2010 2 ∼ N (µ, σ ), i = 1 to k , i.i.d. Then µ̂ = unbiased estimator of µ. If the sample size k satisfies Instructor: Ping Li 1 n Pn i=1 Xi is an 2 2 σ2 , k ≥ log δ ǫ2 µ 2 then with probability at least 1 − δ , the estimated µ is within a 1 ± ǫ factor of the true µ, i.e., |µ̂ − µ| ≤ ǫµ. 75 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li What affects sample size k ? 2 2 σ2 k ≥ log δ ǫ2 µ 2 • δ : level of significance. Lower δ → more significant → larger k . • σ2 µ2 : noise/signal ratio. σ2 Higher µ2 → larger k . • ǫ: accuracy. Lower ǫ → more accurate → larger k . • The evaluation criterion. For example, |µ̂ − µ| ≤ ǫµ, or |µ̂ − µ| ≤ ǫ? 76 Cornell University, BTRY 4090 / STSCI 4090 Exercise: Spring 2010 Instructor: Ping Li In random projections, dˆ is the unbiased estimator of the Euclidian distance d. • Prove the exponential tail bound: P |dˆ − d| ≥ ǫd ≤ e??? • Determine the sample size such that P |dˆ − d| ≥ ǫd ≤ δ 77 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Section 4.6: Approximate Methods Suppose we know What about 2 E(X) = µX , V ar(X) = σX . Suppose Y = g(X). E(Y ), V ar(Y ) ? In many cases, analytical solutions are not available (or too complicated). How about Y = aX ? Easy! 2 We know E(Y ) = aE(X) = aµX , V ar(Y ) = a2 σX . 78 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Delta Method General idea: Linear expansion of Y = g(X) about X = µX . 1 Y = g(X) = g(µX ) + (X − µX )g ′ (µX ) + (X − µX )2 g ′′ (µX ) + ... 2 Taking expectations on both sides: 1 E(Y ) = g(µX ) + E(X − µX )g (µX ) + E(X − µX )2 g ′′ (µX ) + ... 2 2 σX =⇒E(Y ) ≈ g(µX ) + g ′′ (µX ) 2 ′ What about the variance? 79 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Use the linear approximation only: Y = g(X) = g(µX ) + (X − µX )g ′ (µX ) + ... 2 2 V ar(Y ) ≈ [g ′ (µX )] σX How good are these approximates? about µX . Depends on the nonlinearity of g(X) 80 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Example B in Section 4.6 X ∼ U (0, 1), Y = √ X . Compute E(Y ) and V ar(Y ). Exact Method E(Y ) = Z 0 E(Y 2 ) = Z 1 xdx = 0 1 , 2 1 √ xdx = 1 1 2 1/2+1 x = . 1/2 + 1 3 0 2 1 2 1 V ar(Y ) = − = = 0.0556 2 3 18 81 Cornell University, BTRY 4090 / STSCI 4090 Delta Method: Spring 2010 Instructor: Ping Li X ∼ U (0, 1), E(X) = 21 , V ar(X) = √ Y = g(X) = X . g ′ (X) = 21 X −1/2 , g ′′ (X) = − 12 21 X −1/2−1 = − 41 X −3/2 . 1 12 . p V ar(X) 1 −3/2 E(Y ) ≈ E(X) + − E (X) 2 4 p 1/12 1 = 1/2 + − (1/2)−3/2 = 0.6776 2 4 2 1 −1/2 (X) V ar(Y ) ≈V ar(X) E 2 2 1 1 = (1/2)−1/2 = 0.0417 12 2 82 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Delta Method for Sign Random Projections The projected data v1,j and v2,j are bivariate normal. v1,j v2,j ∼ N µ = One can use â = 1 k Pk 0 0 , Σ = m1 a a m2 j=1 v1,j v2,j to estimate first estimate the angle θ = cos−1 √ a m1 m2 , j = 1, 2, ..., k a without bias. One can also using Pr (sign(v1,j ) = sign(v2,j ) = 1 − θ π √ then estimate a using cos θ̂ m1 m2 . Delta method can help the analysis. (Why sign random projections?) 83 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Delta Method for Two Variables 2 Z = g(X, Y ). E(X) = µX , E(Y ) = µY , V ar(X) = σX , V ar(Y ) = σY2 , Cov(X, Y ) = σXY . Taylor expansion of Z = g(X, Y ), about (X = µX , Y = µY ): 2 ∂g(µX , µY ) 1 ∂g (µX , µY ) 2 Z =(X − µX ) + (X − µX ) ∂X 2 ∂X 2 2 ∂g(µX , µY ) 1 2 ∂g (µY , µY ) +(Y − µY ) + (Y − µY ) ∂Y 2 ∂Y 2 ∂g 2 (µY , µY ) +g(µX , µY ) + (X − µX )(Y − µY ) + ... ∂X∂Y 84 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Taking expectations of both sides of the expansion: 2 σX ∂g 2 (µX , µY ) E(Z) ≈g(µX , µY ) + 2 ∂X 2 σY2 ∂g 2 (µX , µY ) ∂g 2 (µY , µY ) + + σXY 2 ∂Y 2 ∂X∂Y Only using linear expansion yields 2 2 ∂g(µ , µ ) ∂g(µ , µ ) X Y X Y 2 V ar(Z) ≈σX + σY2 ∂X ∂Y ∂g(µX , µY ) ∂g(µX , µY ) + 2σXY ∂X ∂Y 85 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Chapter 5: Limit Theorems X1 , X2 , ..., Xn are i.i.d. samples. What Happen if n → ∞? • The Law of Large Numbers • The Central Limit Theorem • The Normal Approximation Instructor: Ping Li 86 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Law of Large Numbers Theorem 5.2.A: Let X1 , X2 , ..., be a sequence of independent random variables with E(Xi ) = µ and V ar(Xi ) = σ 2 . Then, for any ǫ > 0, as n → ∞, P n ! 1 X X i − µ > ǫ → 0 n i=1 The sequence {Xn } is said to Converge in probability to µ. 87 Cornell University, BTRY 4090 / STSCI 4090 Proof: Spring 2010 Using Chebyshev’s Inequality. Because Xi ’s are i.i.d., n 1X X̄ = Xi . n i=1 n 1X 1 E(Xi ) = nµ = µ E(X̄) = n i=1 n n 1 X 1 σ2 2 V ar(X̄) = 2 V ar(Xi ) = 2 nσ = n i=1 n n Thus, by Chebyshev’s Inequality, V ar(X̄) σ2 P (|X̄ − µ| ≥ ǫ) ≤ = 2 →0 2 ǫ nǫ Instructor: Ping Li 88 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Normal Distribution 10.8 10.6 Sample Mean 10.4 10.2 10 9.8 9.6 9.4 9.2 0 10 1 10 2 10 3 10 n 4 10 5 10 6 10 89 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma Distribution 11 10.8 Sample Mean 10.6 10.4 10.2 10 9.8 0 10 1 10 2 10 3 10 n 4 10 5 10 6 10 90 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Uniform Distribution 16 14 Sample Mean 12 10 8 6 4 0 10 1 10 2 10 3 10 n 4 10 5 10 6 10 91 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab Code function TestLawLargeNumbers(MEAN) N = 10ˆ6; figure; c = [’r’,’k’,’b’]; for repeat = 1:3 X = normrnd(MEAN, 1, 1, N); % var = 1 semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2); grid on; hold on; end; xlabel(’n’); ylabel(’Sample Mean’); title(’Normal Distribution’); figure; for repeat = 1:3 X = gamrnd(MEAN.ˆ2, 1./MEAN, 1, N); % var = 1 semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2); 92 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li grid on; hold on; end; xlabel(’n’); ylabel(’Sample Mean’); title(’Gamma Distribution’); figure; for repeat = 1:3 X = rand(1, N)*MEAN*2; semilogx(cumsum(X)./(1:N),c(repeat),’linewidth’,2); grid on; hold on; end; xlabel(’n’); ylabel(’Sample Mean’); title(’Uniform Distribution’); 93 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Monte Carlo Integration To calculate I(f ) = Z 1 f (x)dx, for example −x2 /2 f (x) = e 0 Numerical integration can be difficult, especially in high-dimensions. 94 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Monte Carlo integration: Generate n i.i.d. samples Xi as n → ∞. ∼ U (0, 1). Then by LLN 1X f (Xi ) → E(f (Xi )) = n Z 1 f (x)1dx 0 Advantages • Very flexible. The interval does not have to be [0,1]. The function f (x) can be complicated. The function can be decomposed in various ways, e.g., f (x) = g(x) ∗ h(x), and one can sample from other distributions. • Straightforward in high-dimensions, double integrals, triple integrals, etc. 95 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Major disadvantage of Monte Carlo integration LLN converges at the rate of √1n , from the Central Limit theorem. 1 Numerical integrations converges at the rate of n . However, in high-dimensions, the difference becomes smaller. Also, there are more advanced Monte Carlo techniques to achieve better rates. 96 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Examples for Monte Carlo Numerical Integration R1 cos xdx as an expectation: Z 1 Z 1 cos xdx = 1 × cos xdx = E(cos(x)), Treat 0 0 0 x ∼ Uniform U (0, 1) Monte Carlo integration procedure: • Generate N i.i.d. samples xi ∼ Uniform U (0, 1), i = 1 to N . PN 1 • Use empirical expectation N i=1 cos(xi ) to approximate E(cos(x)). 97 Cornell University, BTRY 4090 / STSCI 4090 R1 True value: 0 Spring 2010 Instructor: Ping Li cos xdx = sin(1) = 0.8415 0.88 0.86 0.84 0.82 0.8 1 10 2 10 3 10 4 10 N 5 10 6 10 7 10 98 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Z 0 1 Instructor: Ping Li log2 (x + 0.1) −x0.15 p e dx sin(x + 0.1) 2 1.5 1 0.5 1 10 2 10 3 10 4 10 N 5 10 6 10 7 10 99 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Section 5.3: Central Limit Theorem and Normal Approximation Central Limit Theorem Let X1 , X2 , ..., be a sequence of independent and identically distributed random variables, each having finite mean E(Xi ) variance σ 2 . Then as n P = µ and →∞ X1 + X2 + ... + Xn − nµ √ ≤y nσ → Z y −∞ 1 − t2 √ e 2 dt 2π 100 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Normal Approximation X̄ − µ X1 + X2 + ... + Xn − nµ √ p = nσ σ 2 /n Non-rigorously, we may say X̄ is approximately N But we know E(X̄) = µ, V ar(X̄) = σ2 n . is approximately 2 µ, σn . N (0, 1) 101 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Normal Distribution Approximates Binomial Suppose X ∼ Binomial(n, p). For fixed p, as n → ∞ Binomial(n, p) ≈ N (µ, σ 2 ), µ = np, σ 2 = np(1 − p). Instructor: Ping Li 102 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li n = 10 p = 0.2 0.35 Density (mass) function 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 x 5 6 7 8 9 10 103 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li n = 20 p = 0.2 0.25 Density (mass) function 0.2 0.15 0.1 0.05 0 −10 −5 0 5 10 x 15 20 25 104 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li n = 50 p = 0.2 0.16 0.14 Density (mass) function 0.12 0.1 0.08 0.06 0.04 0.02 0 −10 −5 0 5 10 x 15 20 25 30 105 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li n = 100 p = 0.2 0.1 0.09 Density (mass) function 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 −10 0 10 20 x 30 40 50 106 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li n = 1000 p = 0.2 0.035 Density (mass) function 0.03 0.025 0.02 0.015 0.01 0.005 0 100 150 200 x 250 300 107 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab code function NormalApprBinomial(n,p); mu = n*p; sigma2 = n*p*(1-p); figure; bar((0:n),binopdf(0:n,n,p),’g’);hold on; grid on; x = mu - 3*sigma2:0.001:mu+3*sigma2; plot(x,normpdf(x,mu,sqrt(sigma2)),’r-’,’linewidth’,2); xlabel(’x’);ylabel(’Density (mass) function’); title([’n = ’ num2str(n) ’ p = ’ num2str(p)]); 108 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Convergence in Distribution Definition: Let X1 , X2 , ..., be a sequence of random variables with cumulative distributions F1 , F2 , ..., and let X be a random variable with distribution function F . We say that Xn converges in distribution to X if lim Fn (x) = F (x) n→∞ at every point x at which F is continuous. 109 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Theorem 5.3A: Continuity Theorem Let Fn be a sequence of cumulative distribution functions with the corresponding MGF Mn . Let F be a cumulative distribution function with MGF M . If Mn (t) → M (t) for all t in an open interval containing zero, then Fn (x) → F (x) at all continuity points of F . 110 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Approximate Poisson by Normal If X ∼ P oi(λ) is approximately normal when λ is large. Recall P oi(λ) approximates Bin(n, p) with λ ≈ np, and large n. ——————————- Let Xn Let ∼ P oi(λn ). Let λ1 , λ2 , ... be an increasing sequence with λn → ∞. Zn = Let Z X√ n −λn , with CDF λn ∼ N (0, 1), with CDF F . To show Fn . 2 Fn → F , suffices to show MZn (t) → MZ (t) = et /2 . 111 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Proof: If Y t ∼ P oi(λ), then MY (t) = eλ(e −1) . Then, for Zn √ − √λn t λn et/ λn −1 λ MZn (t) =e = X√ n −λn , λn e √ h p i = exp −t λn + λn et/ λn −1 n = exp[g(t, n)] 112 Cornell University, BTRY 4090 / STSCI 4090 t Recall e =1+t+ Spring 2010 t2 2 t2 6 Instructor: Ping Li + + ... √ p g(t, n) = − t λn + λn et/ λn − 1 2 3 p t 1 t 1 t √ + = − t λn + λn + + ... 2 λn 6 λ3/2 λn n 1 t3 t2 t2 + ...→ = + 1/2 2 6 λn 2 Therefore, MZn (t) 2 → et /2 = MZ (t) 113 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Proof of Central Limit Theorem Theorem 5.3.B: Let X1 , X2 , ..., be a sequence of independent random variables having mean µ and variance σ 2 and the common probability distribution function F and MGF M defined in the neighborhood of zero. Then lim P n→∞ Proof: Pn Z x 1 −z 2 /2 i=1 Xi − nµ √ √ e ≤x = dz, σ n 2π −∞ Let Sn = Pn i=1 Xi and Zn = t2 /2 MZn (t) → e , −∞ < x < ∞ Sn √ −nµ . It suffices to show σ n as n → ∞. 114 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li = M n (t). Hence √ nµ nµ t t − √ t √ √ MZn (t) =e σ n MSn = e− σ t M n σ n σ n Note that MSn (t) Taylor expand M (t) about zero t2 ′′ M (t) =1 + tM (0) + M (0) + ... 2 t2 2 2 =1 + tµ + σ + µ + ... 2 ′ 115 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Therefore, MZn (t) =e− √ nµ σ t Mn t √ σ n 2 nµ µt t =e− σ t 1 + √ + 2 σ n 2σ n √ nµ = exp − t + n log 1 + σ √ 2 σ +µ 2 + ... 2 n µt t 2 2 √ + σ + µ + ... σ n 2σ 2 n 116 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 By Taylor expansion, log(1 + x) =x− Instructor: Ping Li x2 2 + .... Therefore, σ 2 + µ2 µt t2 n log 1 + √ + 2 σ n 2σ n " # 2 2 1 µt µt t 2 2 √ √ =n + 2 σ +µ − + ... 2 σ n σ n 2σ n 2 µt t + ... =n √ + σ n 2n Hence √ 2 nµ µt t t + n log 1 + √ + 2 σ 2 + µ2 + ... MZn (t) = exp − σ σ n 2σ n t2 /2 →e The textbook assumed µ = 0 to start with, which simplified the algebra. 117 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chapter 6: Distributions Derived From Normal • χ2 distribution If X1 , X2 , ..., Xn are i.i.d. N (0, 1). Then Pn 2 2 2 i=1 Xi ∼ χn , the χ distribution with n degrees of freedom. • t distribution independent, then If U √Z U/n ∼ χ2n , Z ∼ N (0, 1), and Z and U are ∼ tn , the t distribution with n degrees of freedom. • F distribution If U ∼ χ2m , V ∼ χ2n , and U and V are independent, U/m then V /n ∼ Fm,n , the F distribution with m and n degrees of freedom. 118 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li χ2 Distribution If X1 , X2 , ..., Xn are i.i.d. N (0, 1). Then with n degrees of freedom. Pn 2 2 2 X ∼ χ , the χ distribution n i=1 i • Z ∼ χ2n , then MGF MZ (t) = (1 − 2t)−n/2 . • Z ∼ χ2n , then E(Z) = n, V ar(Z) = 2n. • Z1 ∼ χ2n , Z2 ∼ χ2m , Z1 and Z2 are independent. Then Z = Z1 + Z2 ∼ χ2n+m . • χ2n = Gamma α = n 2, λ= 1 2 . 119 Cornell University, BTRY 4090 / STSCI 4090 If X If Z Spring 2010 Instructor: Ping Li ∼ Gamma(α, λ), then MX (t) = ∼ χ2n , then MZ (t) = (1 − 2t) Therefore, Z ∼ χ2n = Gamma Therefore, the density function of Z −n/2 n 1 2, 2 λ λ−t = α = 1 1−2t 1 1−t/λ α n/2 ∼ χ2n 1 z n/2−1 e−z/2 , fZ (z) = n/2 2 Γ(n/2) z≥0 . 120 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li t Distribution If U ∼ χ2n , Z ∼ N (0, 1), and Z and U are independent, then √Z U/n the t distribution with n degrees of freedom. Theorem 6.2.A: The density function of the Z Γ[(n + 1)/2] fZ (z) = √ nπΓ(n/2) ∼ tn is z2 1+ n −(n+1)/2 ∼ tn , 121 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li 0.4 1 degree 10 degrees normal 0.35 0.3 density 0.25 0.2 0.15 0.1 0.05 0 −5 0 x 5 122 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab Code function plot_tdensity figure; x=-5:0.01:5; plot(x,tpdf(x,1),’g-’,’linewidth’,2);hold on; grid on; plot(x,tpdf(x,10),’k-’,’linewidth’,2);hold on; grid on; plot(x,normpdf(x),’r’,’linewidth’,2); for n = 2:9; plot(x,tpdf(x,n));hold on; grid on; end; xlabel(’x’); ylabel(’density’); legend(’1 degree’,’10 degrees’,’normal’); 123 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Things to know about tn distributions: • It is widely used in statistical testing, the t-test. • It is practically indistinguishable from normal, when n ≥ 45. • It is a heavy-tailed distribution, only has < nth moments. • It is the Cauchy distribution when n = 1. Instructor: Ping Li 124 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The F Distribution ∼ χ2m , V ∼ χ2n , and U and V are independent, then Z = U/m V /n ∼ Fm,n , the F distribution with m and n degrees of freedom. If U Proposition 6.2.B: If Z ∼ Fm,n , then the density Γ[(m + n)/2] m m/2 m/2−1 m −(m+n)/2 fZ (z) = z 1+ z Γ(m/2)Γ(n/2) n n The F distribution is also widely used in statistical testing, the F -test. 125 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Cauchy Distribution If X Z= ∼ N (0, 1) and Y ∼ N (0, 1), and X and Y are independent. Then X Y has the standard Cauchy distribution, with density 1 fZ (z) = , π(z 2 + 1) −∞ < z < ∞ Cauchy distribution does not have a finite mean, E(Z) It is also the t-distribution with 1 degree of freedom. = ∞. 126 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Proof: X ≤z FZ (z) = P (Z ≤ z) = P Y =P (X ≤ Y z, Y > 0) + P (X ≥ Y z, Y < 0) =2P (x ≤ yz, Y > 0) Z ∞ Z yz =2 fX,Y (x, y)dxdy 0 0 Z ∞ Z yz 1 − x2 1 − y2 √ e 2 √ e 2 dxdy =2 2π 2π 0 0 Z ∞ Z yz 2 y2 1 − 2 − x2 e e dxdy = π 0 0 Now what? It actually appears easier to work the PDF fZ (z). 127 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Use the fact ∂ R g(x) c h(y)dy = h(g(x))g ′ (x), ∂x ∞ for any constant 2 2 1 − 2 − y 2z fZ (z) = e ye dy π 0 2 Z ∞ 2 2 1 y − y (z2 +1) e = d π 0 2 1 1 = 2 . πz +1 Z y2 What’s the problem when working directly with the CDF? c. 128 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li yz 2 1 − x2 − 2 FZ (z) = e dxdy e π 0 Z ∞ Z yz 0 2 2 1 − x +y 2 = e dxdy π 0 0 Z Z 2 1 ∞ π/2 − r2 = e rdθdr π 0 tan−1 (1/z) Z ∞ hπ i r2 1 = e− 2 r − tan−1 (1/z) dr π 0 2 2 Z ∞ −1 2 r π/2 − tan (1/z) − r2 = e d π 2 0 Z ∞ y2 Z π/2 − tan−1 (1/z) = π Therefore, 1 1 fZ (z) = . π z2 + 1 129 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Section 6.3: Sample Mean and Sample Variance Let X1 , X2 , ..., Xn be independent samples from N (µ, σ 2 ). n The sample mean 1X Xi X̄ = n i=1 n The sample variance 2 1 X 2 S = Xi − X̄ n − 1 i=1 130 Cornell University, BTRY 4090 / STSCI 4090 Theorem 6.3.A: Spring 2010 Instructor: Ping Li The random variable X̄ and the vector (X1 − X̄, X2 − X̄, ..., Xn − X̄) are independent. Proof: Read the book for a more rigorous proof. Let’s only prove that X̄ and Xi − X̄ are uncorrelated (homework problem). 131 Cornell University, BTRY 4090 / STSCI 4090 Corollary 6.3.A: Proof: Spring 2010 X̄ and S 2 are independently distributed. It follows immediately because S 2 is a function of (X1 − X̄, X2 − X̄, ..., Xn − X̄). Instructor: Ping Li 132 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Joint Distribution of the Sample Mean and Sample Variance (n − 1)S 2 /σ 2 ∼ χ2n−1 . Theorem 6.3.B: Proof: X1 , X2 , ..., Xn , are independent normal variables, Xi ∼ N (µ, σ 2 ). Intuitively, S 2 = 1 n−1 Pn Chi-squared distribution. i=1 Xi − X̄ 2 should be closely related to a 133 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 2 (n − 1)S = = = n X i=1 n X i=1 n X i=1 Instructor: Ping Li Xi − X̄ 2 Xi − µ + µ − X̄ 2 2 (Xi − µ) − n µ − X̄ 2 134 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 2 n X Xi − µ i=1 σ 2 Y = Instructor: Ping Li ∼ χ2n 2 n X Xi − µ (n − 1)S = 2 σ i=1 Y + µ − X̄ √ σ/ n µ − X̄ √ σ/ n 2 σ = The MGFs in both sides should be equal. Also, note that Y and X̄ are independent. − 2 µ − X̄ √ σ/ n 2 n X Xi − µ i=1 σ ∼ χ21 2 135 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 2 (n − 1)S , Y = σ2 Y + Instructor: Ping Li µ − X̄ √ σ/ n 2 = 2 n X Xi − µ i=1 Equating the MGFs of both sides (also using independence). tY (1 − 2t)−1/2 = (1 − 2t)−n/2 tY = (1 − 2t)−(n−1)/2 E e =⇒E e Therefore, (n − 1)S 2 2 ∼ χ n−1 σ2 σ 136 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Corollary 6.3.B: X̄ − µ √ ∼ tn−1 . S/ n Proof: X̄−µ √ σ/ n X̄ − µ U √ = p =√ S/ n V (n − 1)S 2 /σ 2 /(n − 1) U ∼ N (0, 1). V ∼ χ2n−1 . Therefore, √U V ∼ tn−1 by definition. 137 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chapter 8: Parameter Estimation One of the most important chapters for 4090! Assume n i.i.d. observations Xi , i = 1 to n. Xi ’s has density function with k parameters θ1 , θ2 , ... ,θk , written as fX (x; θ1 , θ2 , ..., θk ). The task is to estimate θ1 , θ2 , ..., θk , from n samples X1 , X2 , ..., Xn . ———————————- Where did the density function fX come from in the first place? This is often a chicken-egg problem, but it is not a major concern for this class. 138 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Two Basic Estimation Methods Suppose X1 , X2 , ..., Xn are i.i.d. samples with density fX (x; θ1 , θ2 ). • The method of moments Pn 1 Force n i=1 Xi = E(X) and 1 n Two equations, two unknowns (θ1 , θ2 ). Pn i=1 Xi2 = E(X 2 ) • The method of maximum likelihood Find the θ1 and θ2 that maximizes the joint probability (likelihood) Qn i=1 fX (xi ; θ1 , θ2 ). An optimization problem, maybe convex. 139 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Method of Moments Assume n i.i.d. observations Xi , i = 1 to n. Xi ’s has density function with k parameters θ1 , θ2 , ... ,θk , written as fX (x; θ1 , θ2 , ..., θk ). Define the mth theoretical moment of X µm = E(X m ) Define the mth empirical moment of X n µ̂m Solve a system of k equations: What could be the difficulties? 1X m = Xi n i=1 µm = µ̂m , m = 1 to k . 140 Cornell University, BTRY 4090 / STSCI 4090 Example 8.4.A: Because E(Xi ) Spring 2010 Instructor: Ping Li Xi ∼ P oisson(λ), i.i.d. i = 1 to n. = λ, the moment estimator would be n 1X λ̂ = Xi = X̄ n i=1 ————— Properties of λ̂ n 1X E(λ̂) = E(Xi ) = λ n i=1 V ar(λ̂) = 1 λ V ar(Xi ) = n n 141 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Xi ∼ P oisson(λ), i.i.d. i = 1 to n. Because V ar(Xi ) = λ, we can also estimate λ by 1 λ̂2 = n n X i=1 Xi2 − 1 n n X Xi i=1 !2 This estimator λ̂2 is no longer unbiased, because λ λ 2 2 E(λ̂2 ) = λ + λ − +λ =λ− n n Moment estimators are in general biased. Q: How to modify λ̂2 to obtain an unbiased estimator? 142 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Xi ∼ N (µ, σ 2 ), i.i.d. i = 1 to n. Example 8.4.B: Solve for µ and σ 2 from the equations 1 µ= n n X 1 σ = n 2 Xi , i=1 n X Xi2 i=1 − 1 n n X i=1 The moment estimators are n 1X 2 σ̂ = (Xi − X̄)2 n i=1 µ̂ = X̄, We have known that µ̂ and σ̄ 2 are independent, and µ̂ ∼ N 2 σ µ, n , nσ̂ 2 2 ∼ χ n−1 σ2 Xi !2 143 Cornell University, BTRY 4090 / STSCI 4090 Example 8.4.C: Spring 2010 Instructor: Ping Li Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n. The first two moments are α µ1 = , λ α(α + 1) µ2 = λ2 Equivalently µ21 α= , µ2 − µ21 µ1 λ= µ2 − µ21 The moment estimators are µ̂21 X̄ 2 α̂ = = 2, 2 µ̂2 − µ̂1 σ̂ λ̂ = µ̂1 X̄ = µ̂2 − µ̂21 σ̂ 2 144 Cornell University, BTRY 4090 / STSCI 4090 Example 8.4.D: Spring 2010 Instructor: Ping Li Assume that the random variable X has density 1 + αx , fX (x) = 2 |x| ≤ 1, |α| ≤ 1 Then α can be estimated from the first moment 1 1 + αx α µ1 = x dx = . 2 3 −1 Z Therefore, the moment estimator would be α̂ = 3X̄. 145 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Consistency of Moment Estimators Definition: Let θ̂n be an estimator of a parameter θ based on a sample of size n. Then θ̂n is consistent in probability, if for any ǫ P |θ̂n − θ| ≥ ǫ → 0, as > 0, n→∞ Moment estimators are consistent if the conditions for Weak Law of Large Numbers are satisfied. 146 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li A Simulation Study for Estimating Gamma Parameters Consider a gamma distribution Gamma(α, λ) with α Generate n, for n = 4 and λ = 0.5. = 5 to n = 105 , samples from Gamma(α = 4, λ = 0.5). Estimate α and λ by moment estimators for every n. Repeat the experiment 4 times. 147 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma: Moment estimate of α = 4 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1 0 10 1 10 2 10 3 10 4 10 5 10 148 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma: Moment estimate of λ = 0.5 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 1 10 2 10 3 10 4 10 5 10 149 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab Code function est_gamma n = 10ˆ5; al =4; lam = 0.5; c = [’b’,’k’,’r’,’g’]; for t = 1:4; X = gamrnd(al,1/lam,n,1); mu1 = cumsum(X)./(1:n)’; mu2 = cumsum(X.ˆ2)./(1:n)’; est_al = mu1.ˆ2./(mu2-mu1.ˆ2); est_lam = mu1./(mu2-mu1.ˆ2); st =5; figure(1); semilogx((st:n)’,est_al(st:end),c(t), ’linewidth’,2); hold on; grid on; title([’Gamma: Moment estimate of \alpha = ’ num2str(al)]); figure(2); semilogx((st:n)’,est_lam(st:end),c(t),’linewidth’,2); hold on; grid on; title([’Gamma: Moment estimate of \lambda = ’ num2str(lam)]); end; 150 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Method of Maximum Likelihood Suppose that random variables X1 , X2 , ..., Xn have a joint density f (x1 , x2 , ..., xn |θ). Given observed values Xi = xi , where i = 1 to n, the likelihood of θ as a function of (x1 , x2 , .., xn ) is defined as lik(θ) = f (x1 , x2 , ..., xn |θ) The method of maximum likelihood seeks the θ that maximizes lik(θ). 151 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Log Likelihood in the I.I.D. Case If Xi ’s are i.i.d., then lik(θ) = n Y i=1 f (Xi |θ) It is often more convenient to work with its logarithm, called the Log Likelihood l(θ) = n X i=1 log f (Xi |θ) 152 Cornell University, BTRY 4090 / STSCI 4090 Example 8.5.A: Spring 2010 Instructor: Ping Li Suppose X1 , X2 , ..., Xn , are i.i.d. samples of P oisson(λ). Then the likelihood of λ is lik(λ) = n Y λXi e−λ i=1 Xi ! The log likelihood is l(λ) = n X i=1 [Xi log λ − λ − log Xi !] = log λ n X i=1 Xi − nλ + The part in [...] is useless for finding the MLE. " − n X i=1 log Xi ! # 153 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The log likelihood is l(λ) = log λ n X i=1 The MLE is the solution to l′ (λ) Xi − nλ − n X log Xi ! i=1 = 0, where n 1X ′ l (λ) = Xi − n λ i=1 Therefore, the MLE is λ̂ = X̄ , same as the moment estimator. ′′ − λ12 Pn Xi ≤ 0, meaning that l(λ) is a concave function and the solution to l′ (λ) = 0 is indeed the maximum. For verification, check l (λ) = i=1 154 Cornell University, BTRY 4090 / STSCI 4090 Example 8.5.B: Spring 2010 Given n i.i.d. samples, Xi log likelihood is l µ, σ 2 = n X Instructor: Ping Li ∼ N (µ, σ 2 ), i = 1 to n. The log fX (Xi ; µ, σ 2 ) i=1 n 1 X 1 2 =− 2 (Xi − µ) − n log(2πσ 2 ) 2σ i=1 2 n n ∂l 1 X 1X = 2 (Xi − µ) = 0 =⇒ µ̂ = Xi ∂µ 2σ 2 i=1 n i=1 n n X ∂l 1 X n 1 2 2 2 = (X − µ) − = 0 =⇒ σ̂ = (X − µ̂) . i i 2 4 2 ∂σ 2σ i=1 2σ n i=1 155 Cornell University, BTRY 4090 / STSCI 4090 Example 8.5.C: Spring 2010 Instructor: Ping Li Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n. The likelihood function is lik(α, λ) = n Y 1 α α−1 −λXi λ Xi e Γ(α) i=1 The log likelihood function is l(α, λ) = n X i=1 − log Γ(α) + α log λ + (α − 1) log Xi − λXi Taking derivatives n X ∂l(α, λ) Γ′ (α) = −n + n log λ + log Xi ∂α Γ(α) i=1 n α X ∂l(α, λ) =n − Xi ∂λ λ i=1 156 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The MLE solutions are α̂ λ̂ = X̄ n Γ′ (α̂) 1X + log α̂ − log X̄ + log Xi = 0 − Γ(α̂) n i=1 Need an iterative scheme to solve for α̂ and λ̂. This is actually a difficult numerical problems because naive method will not converge, or possibly because Γ′ (α̂) the Matlab implementation of the “psi” function Γ(α̂) is not that accurate. As the last resort, one can always do exhaustive search or binary search. Our simulations can show MLE is indeed better than moment estimator. 157 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma: Moment estimate of α = 4 8 Moment MLE 7 6 5 4 3 2 10 20 30 40 50 60 70 80 90 100 158 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma: Moment estimate of λ = 0.5 1 Moment MLE 0.9 0.8 0.7 0.6 0.5 0.4 10 20 30 40 50 60 70 80 90 100 159 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab Code function est_gamma_mle close all; clear all; n = 10ˆ2; al =4; lam = 0.5; c = [’b’,’k’,’r’,’g’]; for t = 1:3; X = gamrnd(al,1/lam,n,1); % Find the moment estimators as starting points. mu1 = cumsum(X)./(1:n)’; mu2 = cumsum(X.ˆ2)./(1:n)’; est_al = mu1.ˆ2./(mu2-mu1.ˆ2); est_lam = mu1./(mu2-mu1.ˆ2); % Exhaustive search in the neighbor of the moment estimator. mu_log = cumsum(log(X))./(1:n)’; m = 400; for i = 1:m; al_m(:,i) = est_al-2+0.01*(i-1); ind_neg = find(al_m(:,i)<0); al_m(ind_neg,i) = eps; lam_m(:,i)= al_m(:,i)./mu1; end; L = log(lam_m).*al_m + (al_m-1).*(mu_log*ones(1,m)) - lam_m.*(mu1*ones(1,m)) - log(gamma(al_m)); [dummy, ind] = max(L,[],2); for i = 1:n est_al_mle(i) = al_m(i,ind(i)); est_lam_mle(i) = lam_m(i,ind(i)); end; st =10; figure(1); plot((st:n)’,est_al(st:end),[c(t) ’--’], ’linewidth’,2); hold on; grid on; 160 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 plot((st:n)’,est_al_mle(st:end),c(t), ’linewidth’,2); hold on; grid on; title([’Gamma: Moment estimate of \alpha = ’ num2str(al)]); legend(’Moment’,’MLE’); figure(2); plot((st:n)’,est_lam(st:end),[c(t) ’--’],’linewidth’,2); hold on; grid on; plot((st:n)’,est_lam_mle(st:end),c(t),’linewidth’,2); hold on; grid on; title([’Gamma: Moment estimate of \lambda = ’ num2str(lam)]); legend(’Moment’,’MLE’); end; Instructor: Ping Li 161 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Newton’s Method To find the maximum or minimum of function f (x) is equivalent to find the x∗ such that f ′ (x∗ ) = 0. Suppose x is close to x∗ . By Taylor expansions f ′ (x∗ ) = f ′ (x) + (x∗ − x)f ′′ (x) + ... = 0 we obtain ′ f (x) ∗ x ≈ x − ′′ f (x) This gives an iterative formula. In multi-dimensions, need invert a Hessian matrix (not just a reciprocal of f ′′ (x)). 162 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li MLE Using Newtons’ Method for Estimating Gamma Parameters Xi ∼ Gamma(α, λ), i.i.d. i = 1 to n. The log likelihood function l(α, λ) = n X i=1 − log Γ(α) + α log λ + (α − 1) log Xi − λXi First derivatives n X Γ′ (α) ∂l(α, λ) = −n + n log λ + log Xi ∂α Γ(α) i=1 n ∂l(α, λ) α X =n − Xi ∂λ λ i=1 163 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Second derivatives ∂ 2 l(α, λ) ′ = −nψ (α), ∂α2 Γ′ (α) ψ(α) = Γ(α) ∂ 2 l(α, λ) α = −n 2 ∂λ2 λ ∂ 2 l(α, λ) 1 =n ∂λα λ We can use Newton’s method (two dimensions), starting with moment estimators. The problem is actually more complicated because we have a constrained α ≥ 0 and λ ≥ 0 may not be satisfied during iterations, especially when sample size n is not large. optimization problem. The constraints: One the other hand, One-Step Newton’s method usually works well, starting with an (already pretty good) estimator. Often more iterations do not help much. 164 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma: One−step MLE of α = 4 4 Moment One−step MLE 3.5 3 MSE 2.5 2 1.5 1 0.5 0 20 40 60 80 100 120 Sample size 140 160 180 200 165 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Gamma: One−step MLE of λ = 0.5 0.07 Moment One−step MLE 0.06 0.05 MSE 0.04 0.03 0.02 0.01 0 20 40 60 80 100 120 Sample size 140 160 180 200 166 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matlab Code for MLE Using One-Step Newton Updates function est_gamma_mle_onestep al =4; lam = 0.5; N=[20:10:50, 80, 100, 150 200]; T = 10ˆ4; X = gamrnd(al,1/lam,T,max(N)); for i = 1:length(N) n = N(i); mu1 = sum(X(:,1:n),2)./n; mu2 = sum(X(:,1:n).ˆ2,2)./n; est_al0 = mu1.ˆ2./(mu2-mu1.ˆ2); est_lam0 = mu1./(mu2-mu1.ˆ2); est_al0_mu(i) = mean(est_al0); est_al0_var(i) = var(est_al0); est_lam0_mu(i) = mean(est_lam0); est_lam0_var(i) = var(est_lam0); est_al_mle_s1 = est_al0; est_lam_mle_s1= est_lam0; d1_al = log(est_lam_mle_s1)+mean(log(X(:,1:n)),2) - psi(est_al_mle_s1); d1_lam =est_al_mle_s1./est_lam_mle_s1 - mean(X(:,1:n),2); d2_al = - psi(1,est_al_mle_s1); d12 = 1./est_lam_mle_s1; d2_lam = -est_al_mle_s1./est_lam_mle_s1.ˆ2; for j = 1:T; update(j,:) = (inv([d2_al(j) d12(j); d12(j) d2_lam(j)])*[d1_al(j);d1_lam(j)])’; end; est_al_mle_s1 = est_al_mle_s1 - update(:,1); est_lam_mle_s1 = est_lam_mle_s1 - update(:,2); est_lam_mle_s1 = est_al_mle_s1./mean(X(:,1:n),2); 167 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 est_al_mle_s1_mu(i) = mean(est_al_mle_s1); est_al_mle_s1_var(i) = var(est_al_mle_s1); est_lam_mle_s1_mu(i) = mean(est_lam_mle_s1); est_lam_mle_s1_var(i) = var(est_lam_mle_s1); end; figure; plot(N, (est_al0_mu-al).ˆ2+est_al0_var,’k--’,’linewidth’,2); hold on; grid on; plot(N, (est_al_mle_s1_mu-al).ˆ2+est_al_mle_s1_var,’r-’,’linewidth’,2); xlabel(’Sample size’);ylabel(’MSE’); title([’Gamma: One-step MLE of \alpha = ’ num2str(al)]); legend(’Moment’,’One-step MLE’); figure; plot(N, (est_lam0_mu-lam).ˆ2+est_lam0_var,’k--’,’linewidth’,2); hold on; grid on; plot(N, (est_lam_mle_s1_mu-lam).ˆ2+est_lam_mle_s1_var,’r-’,’linewidth’,2); title([’Gamma: One-step MLE of \lambda = ’ num2str(lam)]); xlabel(’Sample size’);ylabel(’MSE’); legend(’Moment’,’One-step MLE’); Instructor: Ping Li 168 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li MLE of Multinomial Probabilities Suppose X1 , X2 , ..., Xm , which are the counts of cells 1, 2, ..., m, follow a multinomial distribution with total count of n and cell probabilities p1 , p2 , ..., pm . To estimate p1 , p2 , ..., pm from the observations X1 = x1 , X2 = x2 , ..., Xm = xm , write down the joint likelihood f (x1 , x2 , ..., xm | p1 , p2 , ..., pm ) ∝ m Y pxi i i=1 and the log likelihood L(p1 , p2 , ..., pm ) = m X i=1 A constrained optimization problem. xi log pi , m X i=1 pi = 1 169 Cornell University, BTRY 4090 / STSCI 4090 Solution 1: Spring 2010 Reduce to m − 1 variables. L(p2 , ..., pm ) = x1 log(1 − p2 − p3 − .... − pm ) + where Instructor: Ping Li m X i=2 m X xi log pi , i=2 pi ≤ 1, pi ≥ 0, pi ≤ 1 We do not have to worry about the inequality constraints unless they are violated. 170 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li ∂L −x1 xi = + = 0, ∂pi 1 − p2 − p3 − ... − pm pi i = 2, 3, ..., m x1 xi = p1 pi x1 x2 x3 xm =⇒ = = = ... = =λ p1 p2 p3 pm =⇒ Therefore, p1 = x1 , λ =⇒ 1 = p2 = m X pi = x2 , λ Pm i=1 ..., pm = xi = n λ xm , λ λ xi =⇒ λ = n =⇒ pi = , i = 1, 2, ..., m n i=1 171 Cornell University, BTRY 4090 / STSCI 4090 Solution 2: Spring 2010 Instructor: Ping Li Lagrange multiplier (essentially the same as solution 1) Convert the original problem into an “unconstrained” problem L(p1 , p2 , ..., pm ) = m X i=1 xi log pi − λ m X i=1 ! pi − 1 172 Cornell University, BTRY 4090 / STSCI 4090 Example A: Spring 2010 Instructor: Ping Li Hardy-Weinberg Equilibrium If gene frequencies are in equilibrium, the genotypes AA, Aa, and aa occur in a population with frequencies (1 − θ)2 , 2θ(1 − θ), θ2 , respectively. Suppose we observe sample counts x1 , x2 , and x3 , with total = Q: Estimate θ using MLE. n. 173 Cornell University, BTRY 4090 / STSCI 4090 Solution: Spring 2010 Instructor: Ping Li The log likelihood can be written as l(θ) = 3 X xi log pi i=1 =x1 log(1 − θ)2 + x2 log 2θ(1 − θ) + x3 log θ 2 ∝2x1 log(1 − θ) + x2 log θ + x2 log(1 − θ) + 2x3 log θ =(2x1 + x2 ) log(1 − θ) + (x2 + 2x3 ) log θ Taking the first derivative ∂l(θ) 2x1 + x2 x2 + 2x3 =− + =0 ∂θ 1−θ θ =⇒ θ̂ = What is V ar(θ̂)? x2 + 2x3 2n 174 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li 1 (V ar(x2 ) + 4V ar(x3 ) + 4Cov(x2 , x3 )) 2 4n 1 = 2 (np2 (1 − p2 ) + 4np3 (1 − p3 )−4np2 p3 ) 4n 1 2 p2 + 4p3 − (p2 + 2p3 ) = 4n θ(1 − θ) = 2n V ar(θ̂) = 1 We will soon show the variance of MLE is asymptotically I(θ) , where I(θ) = −E is the Fisher Information. 2 ∂ l(θ) ∂θ 2 175 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li ∂ 2 l(θ) 2x1 + x2 x2 + 2x3 =− − ∂θ 2 (1 − θ)2 θ2 I(θ) = − E 2 ∂ l(θ) ∂θ 2 2(1 − θ)2 + 2θ(1 − θ) 2θ(1 − θ) + 2θ 2 =n +n 2 (1 − θ) θ2 2n 2n 2n = + = 1−θ θ θ(1 − θ) Therefore, the “asymptotic variance” is V ar(θ̂) which in this case is the exact variance. = θ(1−θ) 2n , 176 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Review Properties of Multinomial Distribution Suppose X1 , X2 , ..., Xm , which are the counts of cells 1, 2, ..., m, follow a multinomial distribution with total count of n and cell probabilities p1 , p2 , ..., pm . Marginal and conditional distributions Xj ∼ Binomial(n, pj ) Xj |Xi ∼ Binomial n − Xi , pj 1 − pi , i 6= j 177 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Moments E(Xj ) = npj , V ar(Xj ) = npj (1 − pj ) pj E(Xj |Xi ) = (n − Xi ) 1 − pi 2 E(Xi Xj ) = E(Xi E(Xj |Xi )) = E nXi − Xi pj 2 2 2 n pi − npi (1 − pi ) − n pi = 1 − pi = npi pj (n − 1) Cov(Xi , Xj ) = E(Xi Xj ) − E(Xi )E(Xj ) = npi pj (n − 1) − n2 pi pj = −npi pj pj 1 − pi 178 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Large Sample Theory for MLE Assume i.i.d. samples of size n, Xi , i = 1 to n, with density f (x|θ). The MLE of θ , denoted by θ̂ is given by θ̂ = argmax θ Large sample theory says, as n θ̂ ∼ N θ, n X i=1 log f (xi |θ) → ∞, θ̂ is asymptotically unbiased and normal. 1 nI(θ) , approximately I(θ) is the Fisher Information of θ : I(θ) = −E h ∂2 ∂θ 2 i log f (X|θ) . 179 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Fisher Information ∂ I(θ) =E log f (X|θ) ∂θ 2 2 ∂ =−E log f (X|θ) ∂θ 2 How to prove the equivalence of two definitions? 180 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Proof: ∂ E log f (X|θ) ∂θ 2 = Z ∂f ∂θ ∂2 log f (X|θ) = − −E ∂θ 2 =− 2 Z f Z 1 f dx = f2 ∂2f ∂θ 2 − h f2 2 ∂ f dx + ∂θ 2 Z ∂f ∂θ i2 Z Therefore, it suffices to show (in fact assume): Z 2 2 ∂ f ∂ dx = 2 ∂θ 2 ∂θ Z f dx = 0 ∂f ∂θ 2 1 dx f f dx ∂f ∂θ 2 1 dx f 181 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Example: Normal Distribution Given n i.i.d. samples, xi ∼ N (µ, σ 2 ), i = 1 to n. log fX (x; µ, σ 2 ) = − 1 1 2 2 (x − µ) − log(2πσ ) 2 2σ 2 ∂ 2 log fX (x; µ, σ 2 ) 1 1 = − 2 =⇒ I(µ) = 2 ∂µ2 σ σ ∂ 2 log fX (x; µ, σ 2 ) (x − µ)2 1 = − + ∂(σ 2 )2 σ6 2σ 4 σ2 1 1 =⇒ I(σ ) = 6 − 4 = 4 σ 2σ 2σ 2 “Asymptotic” variances of MLE are in fact exact in this case. 182 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Example: Binomial Distribution x ∼ Binomial(p, n): Pr (x = k) = Log likelihood and Fisher Information: n k pk (1 − p)n−k l(p) = k log p + (n − k) log(1 − p) k n−k l (p) = − =⇒ MLE p̂ =?? p 1−p k n−k ′′ l (p) = − 2 − p (1 − p)2 np n − np n ′′ = I(p) = −E (l (p)) = 2 + p (1 − p)2 p(1 − p) ′ “Asymptotic” variance of MLE is also exact in this case. 183 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Intuition About the Asymptotic Distributions & Variances of MLE The MLE θ̂ is the solution to the MLE equation l′ (θ̂) = 0. The Taylor expansion around the true θ l′ (θ̂) ≈ l′ (θ) + (θ̂ − θ)l′′ (θ) Let l′ (θ̂) = 0 (because θ̂ is the MLE solution) l′ (θ) (θ̂ − θ) ≈ − ′′ l (θ) What is the mean of l′ (θ)? What is the mean of l′′ (θ)? 184 Cornell University, BTRY 4090 / STSCI 4090 l′ (θ) = Spring 2010 n X ∂ log f (xi ) i=1 ′ E (l (θ)) = n X Instructor: Ping Li E i=1 ∂θ ∂ log f (xi ) ∂θ = n X i=1 ∂f (xi ) ∂θ f (xi ) = nE ∂f (x) ∂θ f (x) ! =0 because E ∂f (x) ∂θ f (x) ! = Z ∂f (x) ∂θ f (x) f (x)dx = Z ∂f (x) ∂ dx = ∂θ ∂θ Z f (x)dx = 0 185 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li E(l′ (θ)) = 0, and we know −E(l′′ (θ)) = nI(θ), the Fisher Information. Thus l′ (θ) l′ (θ) ≈ (θ̂ − θ) ≈ − ′′ l (θ) nI(θ) and E(l′ (θ)) E(θ̂ − θ) ≈ =0 nI(θ) Then, the variance E(l′ (θ))2 nI(θ) 1 V ar(θ̂) ≈ 2 2 = 2 = n I (θ) n I(θ) nI(θ) 186 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Sec. 8.7: Efficiency and Cramér-Rao Lower Bound Definition: Given two unbiased estimates, θ̂1 and θ̂2 , the efficiency of θ̂1 relative to θ̂2 is ef f θ̂1 , θ̂2 = V ar(θ̂2 ) V ar(θ̂1 ) For example, if the variance of θ̂2 is 0.8 times the variance of θ̂1 . Then θ̂1 is 80% efficient relative to θ̂2 . Asymptotic relative efficiency Given two asymptotically unbiased estimates, θ̂1 and θ̂2 , the asymptotic relative efficiency of θ̂1 relative to θ̂2 is computed using their asymptotic variances (as sample size goes to infinity). 187 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Example 8.7.A: Assume that the random variable X has density 1 + αx , fX (x) = 2 Method of Moments: |x| ≤ 1, |α| ≤ 1 α can be estimated from the first moment Z 1 1 + αx α µ1 = x dx = . 2 3 −1 Therefore, the moment estimator would be α̂m = 3X̄. whose variance 3 − α2 9 9 2 2 V ar(α̂m ) = V ar(X) = E(X ) − E (X) = n n n 188 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Maximum Likelihood Estimate: Instructor: Ping Li The first two derivatives ∂ x log fX (x; α) = ∂α 1 + αx ∂2 −x2 log fX (x; α) = 2 ∂α (1 + αx)2 Therefore, the MLE is the solution to n X i=1 Xi = 0. 1 + α̂mle Xi Can not compute the exact variance. We resort to approximate (asymptotic) variance V ar (α̂mle ) ≈ 1 nI(α) 189 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Use the second derivatives to compute I(α) I(α) = − E = = = When α = 0, I(α) = R1 Z 1 −1 Z 1 ∂ log fX (x|α) 2 ∂α x2 1 + αx dx 2 (1 + αx) 2 x2 dx 2(1 + αx) −1 1+α log 1−α x2 dx −1 2 2 − 2α 2α3 , = 13 , which can also be obtained by taking limit of I(α). α 6= 0 190 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The asymptotic relative efficiency of α̂m to α̂mle is V ar(α̂mle ) = V ar(α̂m ) log 2α3 3−α2 1+α 1−α − 2α 1 0.9 0.8 Efficiency 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1 −0.5 Why the efficiency is no larger than 1? 0 α 0.5 1 191 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Cramér-Rao Inequality Theorem 8.7.A: Let X1 , X2 , ..., Xn be i.i.d. with density function f (x; θ). Let T be an unbiased estimate of θ . Then under smoothness assumption on f (x; θ), 1 V ar(T ) ≥ nI(θ) Thus, under reasonable assumptions, MLE is optimal or (asymptotically) optimal. 192 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Sec. 8.8: Sufficiency Definition: Let X1 , X2 , ..., Xn be i.i.d. samples with density f (x; θ). A statistic T = T (X1 , X2 , ..., Xn ) is said to be sufficient for θ if the conditional distribution of X1 , X2 , ..., Xn , given T = t, does not depend on θ for any t. In other words, given T , we can gain no more knowledge about θ . 193 Cornell University, BTRY 4090 / STSCI 4090 Example 8.8.A: Spring 2010 Instructor: Ping Li Let X1 , X2 , ..., Xn be a sequence of independent Bernoulli random variables with P (Xi = 1) = θ . Let T = P (X1 = x1 , ..., Xn = xn |T = t) = Pn i=1 Xi . P (X1 = x1 , ..., Xn = xn , T = t) P (T = t) θ t (1 − θ)n−t = n t n−t t θ (1 − θ) 1 = n , t which is independent of θ . Therefore, T = Pn i=1 Xi is a sufficient statistic. 194 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Theorem 8.8.1.A: Factorization Theorem A necessary and sufficient condition for T (X1 , ..., Xn ) to be sufficient for a parameter θ is that the joint probability density (mass) function factors f (x1 , x2 , .., xn ; θ) = g [T (x1 , x2 , ..., xn ), θ] h(x1 , x2 , ..., xn ) 195 Cornell University, BTRY 4090 / STSCI 4090 Example 8.8.1.A: Spring 2010 Instructor: Ping Li X1 , X2 , ..., Xn are i.i.d. Bernoulli random variables with success probability θ . f (x1 , x2 , ..., xn ; θ) = θ Y θ xi (1 − θ)1−xi i=1 Pn i=1 xi Pn i=1 xi =θ (1 − θ) P ni=1 xi θ = (1 − θ)n 1−θ =g (T, θ) × h n− h(x1 , x2 , ..., xn ) = 1. Pn T (x1 , x2 , ..., xn ) = i=1 xi is the sufficient statistic. T θ g(T, θ) = 1−θ (1 − θ)n 196 Cornell University, BTRY 4090 / STSCI 4090 Example 8.8.1.B: Spring 2010 Instructor: Ping Li X1 , X2 , ..., Xn are i.i.d. normal N (µ, σ 2 ), both µ and σ 2 are unknown. n Y (x −µ)2 1 − i2σ2 √ f (x1 , x2 , ..., xn ; µ, σ ) = e 2πσ i=1 2 = 1 − (xi −µ)2 i=1 2σ 2 Pn e (2π)n/2 σ n Pn −1 P n 2 1 x −2µ xi +nµ2 ] [ i 2 i=1 i=1 2σ e = (2π)n/2 σ n Therefore, Pn 2 i=1 xi and Equivalently, we say T Pn i=1 xi are sufficient statistics. = (X̄, S 2 ) is the sufficient statistic for normal with unknown mean and variance. 197 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Proof of the Factorization Theorem (Discrete Case) Theorem: A necessary and sufficient condition for T (X1 , ..., Xn ) to be sufficient for a parameter θ is that the joint probability mass function factors P (X1 = x1 , .., Xn = xn ; θ) = g [T (x1 , ..., xn ), θ] h(x1 , ..., xn ) Proof of sufficient condition: Assume P (X1 = x1 , .., Xn = xn ; θ) = g [T (x1 , ..., xn ), θ] h(x1 , ..., xn ) Then the conditional distribution P (X1 = x1 , ..., Xn = xn |T = t) = P (X1 = x1 , ..., Xn = xn , T = t) P (T = t) 198 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li But we assume P (X1 , .., Xn ) factors, i.e., P (T = t) = X P (X1 = x1 , ..., Xn = xn ) T (x1 ,...,xn )=t =g(t, θ) X h(x1 , ..., xn ) T (x1 ,...,xn )=t Note that t is constant. Thus, the conditional distribution P (X1 = x1 , ..., Xn = xn , T = t) P (X1 = x1 , ..., Xn = xn |T = t) = P (T = t) g(t, θ)h(x1 , ..., xn ) =P T (x1 ,...,xn )=t g(t, θ)h(x1 , ..., xn ) h(x1 , ..., xn ) , T (x1 ,...,xn )=t h(x1 , ..., xn ) which does not depend on θ . =P Therefore, T (X1 , ..., Xn ) is a sufficient statistic. 199 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Proof of necessary condition: Instructor: Ping Li Assume T (X1 , ..., Xn ) is sufficient. That is, the conditional distribution (X1 , ..., Xn )|T does not depend on θ . Then P (X1 = x1 , ..., Xn = xn ) =P (X1 = x1 , ..., Xn = xn |T = t)P (T = t) =P (T = t)P (X1 = x1 , ..., Xn = xn |T = t) =g(t, θ)h(x1 , ..., xn ), where h(x1 , ..., xn ) = P (X1 = x1 , ..., Xn = xn |T = t) g(t, θ) = P (T = t) Therefore, the probability mass function factors. 200 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Exponential Family Definition: Members of one parameter (θ ) exponential family have density function (or frequency functions) of the form f (x; θ) = exp [c(θ)T (x) + d(θ) + S(x)] if x ∈ A 0 otherwise Where the set A does not depend on θ . Many common distributions: normal, binomial, Poisson, gamma, are members of this family. 201 Cornell University, BTRY 4090 / STSCI 4090 Example 8.8.C: Spring 2010 Instructor: Ping Li The frequency function of the Bernoulli distribution is P (X = x) =θ x (1 − θ)1−x , x ∈ {0, 1} θ = exp x log + log(1 − θ) 1−θ Therefore, this is a member of the exponential family, with θ c(θ) = log 1−θ T (x) = x d(θ) = log 1 − θ S(x) = 0. —————- f (x; θ) = exp [c(θ)T (x) + d(θ) + S(x)]. 202 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Sufficient statistics of exponential family Suppose that X1 , X2 , ..., Xn is an i.i.d. sample from a member of the exponential family. Then the joint probability is n Y i=1 f (xi |θ) = n Y exp [c(θ)T (xi ) + d(θ) + S(xi )] i=1 " = exp c(θ) n X T (xi ) + nd(θ) exp i=1 By the factorization theorem, we know In the Bernoulli example, # Pn i=1 Pn i=1 T (xi ) = " n X i=1 S(xi ) # T (xi ) is a sufficient statistic. Pn i=1 xi . 203 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The MLE of exponential family If T (x) is a sufficient statistic for θ , then the MLE is a function of T . Recall: if X ∼ N (µ, σ 2 ), then the MLE n 1X µ̂ = xi n i=1 n 1X 2 σ̂ = (xi − µ̂)2 n i=1 We know that ( Pn i=1 xi , Pn i=1 x2i ) is sufficient statistic. Note that normal is a member of the two-parameter exponential family. 204 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li k -parameter Exponential Family Definition: Members of k -parameter (θ ) exponential family have density function (or frequency functions) of the form f (x; θ) = exp k X j=1 cj (θ)Tj (x) + d(θ) + S(x) 205 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Normal Distribution and Exponential Family Suppose X ∼ N (µ, σ 2 ). Then 2 1 1 1 2 µ µ 2 2 √ f (x; µ, σ ) = exp − log σ − 2 x + 2 x − 2 2 2σ σ 2σ 2π Does it really belong to a (2-dim) exponential family? Well, suppose σ 2 is known, then it is clear that it does belong to a one-dim exponential family. f (x; θ) = exp [c(θ)T (x) + d(θ) + S(x)] θ = µ, T (x) = x, µ2 d(θ) = − 2 , 2σ c(θ) = µ σ2 x2 1 1 2 S(x) = − 2 − log σ − log 2π 2σ 2 2 206 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li When σ 2 is unknown, then we need to re-parameterize the distribution by letting µ 2 , σ = (θ1 , θ2 ) θ= σ2 Then it belongs to a 2-dim exponential family f (x; θ) = exp [c1 (θ)T1 (x) + c2 (θ)T2 (x) + d(θ) + S(x)] µ T1 (x) = x c1 (θ) = 2 = θ1 , σ 1 1 c2 (θ) = − 2 = − , T2 (x) = x2 2σ 2θ2 1 µ2 1 θ12 2 d(θ) = − log σ − 2 = − log θ2 − θ2 2 2σ 2 2 1 S(x) = − log 2π 2 207 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Another Nice Property of Exponential Family Suppose f (x; θ) = exp k X j=1 cj (θ)Tj (x) + d(θ) + S(x) Then ∂d(θ) E (Ti (X)) = − ∂ci (θ) Exercise: What about variances and covariances? 208 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Proof: Take derivatives on both sides of R Instructor: Ping Li R f dx = R ∂ f dx 1, i.e., ∂ci (θ) = 0. ∂ f dx ∂f = dx ∂ci (θ) ∂ci (θ) Z k X ∂ exp cj (θ)Tj (x) + d(θ) + S(x) dx = ∂ci (θ) j=1 Z k X ∂d(θ) dx = exp cj (θ)Tj (x) + d(θ) + S(x) Ti (x) + ∂c (θ) i j=1 Z ∂d(θ) = f Ti (x) + dx ∂ci (θ) Z Therefore E (Ti (X)) = Z f Ti (x)dx = − Z f ∂d(θ) ∂d(θ) dx = − ∂ci (θ) ∂ci (θ) 209 Cornell University, BTRY 4090 / STSCI 4090 For example, X Spring 2010 Instructor: Ping Li ∼ N (µ, σ 2 ) belongs to a 2-dim exponential family µ 2 , σ θ = (θ1 , θ2 ) = σ2 T1 (x) = x, T2 (x) = x2 Apply the previous result, µ 2 ∂d(θ) E(T1 (x)) = E(x) = − = − (−θ1 θ2 ) = 2 σ = µ c1 (θ) σ as expected. 210 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Sec. 8.6: The Bayesian Approach to Parameter Estimation The θ is the parameter to be estimated The prior distribution fΘ (θ). The joint distribution fX,Θ (x, θ) = fX|Θ (x|θ)fΘ (θ) The marginal distribution fX (x) = Z fX,Θ (x, θ)dθ = Z fX|Θ (x|θ)fΘ (θ)dθ The posterior distribution fX|Θ (x|θ)fΘ (θ) fX,Θ (x, θ) R fΘ|X (θ|x) = = fX (x) fX|Θ (x|u)fΘ (u)du 211 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Three main issues in Bayesian estimation • Specify a prior (without looking at the data first). • Calculate the posterior distribution, maybe computationally intensive. • Choose appropriate estimators from the posterior distribution: mean, median, mode, ... 212 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Add-one Smoothing Consider n + m trials having a common probability of success. Suppose, however, that this success probability is not fixed in advance but is chosen from U (0, 1). Q: What is the conditional distribution of this success probability given that the n + m trials result in n successes? Solution: Let X = trial success probability. X ∼ U (0, 1). Let N = total number of successes. N |X = x ∼ Binomial(n + m, x). 213 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Solution: Let X = trial success probability. X ∼ U (0, 1). Let N = total number of successes. N |X = x ∼ Binomial(n + m, x). P {N = n|X = x}fX (x) P {N = n} n+m n m x (1 − x) = n P {N = n} ∝ xn (1 − x)m fX|N (x|n) = Therefore, X|N Here X ∼ Beta(n + 1, m + 1). ∼ U (0, 1) is the prior distribution. 214 Cornell University, BTRY 4090 / STSCI 4090 If X|N Spring 2010 Instructor: Ping Li ∼ Beta(n + 1, m + 1), then n+1 E(X|N ) = (n + 1) + (m + 1) Suppose we do not have a prior knowledge of the success probability X . We observe n successes out of n + m trials. The most intuitive estimate (in fact MLE) of X should be X̂ = n n+m Assuming a uniform prior on X leads to the add-one smoothing. 215 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Posterior distribution, assuming p ~ U(0,1) 10 m = 8, n = 0 m = 8, n = 2 m = 80, n = 20 9 8 7 PMF 6 5 4 3 2 1 0 0 0.2 0.4 0.6 p Posterior distribution X|N Posterior mean: E(X) = ∼ Beta(n + 1, m + 1). n+1 m+1 , n Posterior mode (peak of density): m . 0.8 1 216 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Estimating Binomial Parameter Under Beta Prior X ∼ Bin(n, p). p ∼ Beta(a, b). Joint probability n x Γ(a + b) a−1 n−x fX,P (x, p) = p (1 − p) p (1 − p)b−1 x Γ(a)Γ(b) Γ(a + b) n x+a−1 n−x+b−1 p (1 − p) = Γ(a)Γ(b) x which is also a beta distribution Beta(x + a, n − x + b). Marginal distribution fX (x) = Z 1 fX,P (x, p)dp = g(n, x), 0 (very nice, why?) 217 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Therefor, the posterior distribution is also Beta, with parameters (x + a, n − x + b). This is extremely convenient. Moment estimator (using posterior mean) x+a x+a = p̂ = E(p|x) = (x + a) + (n − x + b) n+a+b x n a a+b = + na+b+n a+ba+b+n x n: the usual estimate without considering priors. a a+b : the estimate when there are no data. The add-one smoothing is a special case with a What about the bias-variance trade-off?? = b = 1. 218 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Bias-Variance Trade-off Bayesian estimator (using posterior mean) p̂ = x+a n+a+b MLE p̂M LE = x n Assume p is fixed (conditional on p). Study the MSE ratio MSE ratio = MSE(p̂) MSE(p̂M LE ) We hope MSE ratio ≤ 1, especially when sample size n is reasonable. 219 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Asymptotic MSE ratio: when n is not too small A Asymptotic MSE ratio = 1 + +O n We hope A ≤0 Exercise: Find A, which is a function of p, a, b. 1 n2 . 220 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li p = 0.5 , a = 1 , b = 1 1 0.9 0.8 MSE ratios 0.7 0.6 0.5 0.4 0.3 0.2 Exact MSE ratios Asymptotic MSE ratios 0.1 0 0 20 40 60 n 80 100 221 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li p = 0.9 , a = 1 , b = 1 2 Exact MSE ratios Asymptotic MSE ratios MSE ratios 1.5 1 0.5 0 20 40 60 n 80 100 222 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Conjugate Priors The prior distribution fΘ (θ), belongs to family G. The conditional distribution fX|Θ (x|θ), belongs to family H . The posterior distribution fΘ|X (θ|x) = fX|Θ (x|θ)fΘ (θ) fX,Θ (x, θ) = R fX (x) fX|Θ (x|u)fΘ (u)du If the posterior distribution also belongs to G, then G is conjugate to H . Conjugate priors were introduced mainly for the computational convenience. 223 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Examples of Conjugate priors: Beta is conjugate to Binomial. Gamma is conjugate to Poisson. Dirichlet is conjugate to multinomial. Gamma is conjugate to exponential. Normal is conjugate to normal (with known variance). Instructor: Ping Li 224 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chapter 9: Testing Hypothesis Suppose you have a coin which is possibly biased. You want to test whether the coin is indeed biased (i.e., p Suppose you observe k 6= 0.5), by tossing the coin n = 10 times. = 8 heads (out of n = 10 tosses). It is reasonable to guess that this coin is indeed biased. But how to make a precise statement? Are n = 10 tosses enough? How about n = 100? n = 1000? What is the principled approach? 225 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Terminologies Null hypothesis H0 : p = 0.5 Alternative hypothesis HA : p 6= 0.5 Type I error Rejecting H0 when it is true Significance level P (Type I error) = P (Reject H0 |H0 ) = α Type II error Accepting H0 when it is false P (Type II error) = P (Accept H0 |HA ) = β Power Goal: 1−β Low α and high 1 − β . 226 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Let X1 , X2 , ..., Xn be an i.i.d. sample from a normal with Example 9.2.A known variance σ 2 and unknown mean µ. Consider two simple hypotheses: H0 : µ = µ0 HA : µ = µ1 (µ1 > µ0 ) Under H0 , the null distribution likelihood is f0 ∝ n Y i=1 exp − " 1 1 2 (X − µ ) = exp − i 0 2σ 2 2σ 2 n X i=1 Under HA , the likelihood is " 1 f1 ∝ exp − 2 2σ Which hypothesis is more likely? n X i=1 (Xi − µ1 )2 # (Xi − µ0 )2 # 227 Cornell University, BTRY 4090 / STSCI 4090 Likelihood Ratio: Because µ0 Spring 2010 Instructor: Ping Li Small ratios =⇒ rejection. Sounds reasonable, but why? 1 Pn 2 f0 exp − 2σ2 i=1 (Xi − µ0 ) 1 Pn = 2 f1 exp − 2σ2 i=1 (Xi − µ1 ) " # n 1 X 2 2 = exp − 2 (Xi − µ0 ) − (Xi − µ1 ) 2σ i=1 i h n 2 2 2 X̄(µ − µ ) + µ − µ = exp 0 1 1 0 2σ 2 − µ1 < 0 (by assumption), the likelihood is small if X̄ is large. Suppose the significance level α = 0.05. With how large X̄ can we reject H0 ? Neyman-Pearson Lemma provides the answers. 228 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Neyman-Pearson Lemma Suppose that H0 and HA are simple hypotheses and that the test that rejects H0 whenever the likelihood ratio is less than c has significance level α. Then any other test for which the significance level is ≤ α has power less than or equal to that of the likelihood ratio test. In other words, among all possible tests achieving significance level ≤ based on likelihood ratio maximizes the power. α, the test 229 Cornell University, BTRY 4090 / STSCI 4090 Proof: Let Spring 2010 H0 : f (x) = f0 (x), Instructor: Ping Li HA : f (x) = fA (x) Denote two tests 0, if H is accepted 0 d(x) = 1, if H0 is rejected 0, if H is accepted 0 d∗ (x) = 1, if H0 is rejected The test d(x), based on the likelihood ratio, has a significance level α, i.e., d(x) = 1, whenever f0 (x) < cfA (x), (c > 0) Z α = P (d(x) = 1| H0 ) = E(d(x)|H0 ) = d(x)f0 (x)dx Assume the test d∗ (x) has smaller significance level, i.e., P (d∗ (x) = 1|H0 ) ≤ P (d(x) = 1|H0 ) = α Z =⇒ [d(x) − d∗ (x)] f0 (x)dx ≥ 0 230 Cornell University, BTRY 4090 / STSCI 4090 To show: Spring 2010 Instructor: Ping Li P (d∗ (x) = 1|HA ) ≤P (d(x) = 1|HA ) Equivalently, we need to show Z [d(x) − d∗ (x)] fA (x)dx≥0 We make use of a key inequality d∗ (x) [cfA (x) − f0 (x)] ≤ d(x) [cfA (x) − f0 (x)] d(x) = 1 whenever cfA (x) − f0 (x) > 0, and d(x), d∗ (x) only take values in {0, 1}. which is true because More specifically, let M (x) If M (x) = cfA (x) − f0 (x). > 0, then the right-hand-side of the inequality becomes M (x), but the left-hand-side becomes M (x) (if d∗ (x) = 1) or 0 (if d∗ (x) = 0). Thus the inequality holds, because M (x) > 0. 231 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li If M (x) < 0, then the right-hand-side of the inequality becomes 0, but the left-hand-side becomes M (x) (if d∗ (x) = 1) or 0 (if d∗ (x) = 0). Thus the inequality also holds, because M (x) < 0. Integrating both sides of the inequality yields Z Z d∗ (x) [cfA (x) − f0 (x)] dx ≤ d(x) [cfA (x) − f0 (x)] dx Z Z =⇒c [d(x) − d∗ (x)] fA dx ≥ [d(x) − d∗ (x)] f0 dx ≥ 0 232 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 f0 f1 Continue Example 9.2.A: Instructor: Ping Li ≤ c =⇒ Reject H0 . h n i f0 2 2 = exp 2 X̄(µ − µ ) + µ − µ ≤c 0 1 1 0 f1 2σ 2 α = P (reject H0 |H0 ) = P (f0 ≤ cf1 |H0 ) Equivalently, Reject H0 if X̄ ≥ x0 , P (X̄ ≥ x0 |H0 ) = α. 2 Under H0 : X̄ ∼ N µ0 , σ /n and 233 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li α =P (X̄ ≥ x0 |H0 ) X̄ − µ0 x −µ √ > 0 √ 0 =P σ/ n σ/ n x0 − µ0 √ =1 − Φ σ/ n σ =⇒ x0 = µ0 + zα √ n zα is the upper α point of the standard normal: P (Z ≥ zα ) = α, where Z ∼ N (0, 1). z0.05 = 1.645, z0.025 = 1.960 Therefore, the test rejects H0 if X̄ ≥ µ0 + zα √σn . ————— Q: What is β ? What is the power? Can we reduce both α and β when n is fixed? 234 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Uniformly Most Powerful Test Neyman-Pearson Lemma requires that both hypotheses be simple. However, most real-situations require composite hypothesis. If the alternative H1 is composite, a test that is most powerful for every simple alternative in H1 is uniformly most powerful (UMP). 235 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Continuing Example 9.2.A: Instructor: Ping Li Consider testing H0 : µ = µ0 H1 : µ > µ0 For every µ1 > µ0 , the likelihood ratio test rejects H0 if X̄ ≥ x0 , where σ √ x0 = µ0 + zα n does not depend on µ1 . Therefore, this test is most powerful for every µ1 > µ0 and hence it is UMP. 236 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Similarly, the test is UMP for testing (one-sided alternative) H0 : µ < µ0 H1 : µ > µ0 However, the test is not UMP for testing (two-sided alternative) H0 : µ = µ0 H1 : µ 6= µ0 Unfortunately, in typical composite situations, there is no UMP test. Instructor: Ping Li 237 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li P-Value Definition: The p-value is the smallest significance level at which the null hypothesis would be rejected. The smaller the p-value, the stronger the evidence against the null hypothesis. In a sense, calculating the p-value is more sensible than specifying (often arbitrarily) the level of significance α. 238 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Confidence Intervals Xn be an i.i.d. sample from a normal distribution having unknown mean µ and known variance σ 2 . Consider testing Example 9.3 A: Let X1 , ... H0 : µ = µ0 HA : µ 6= µ0 Consider a test that rejects H0 : for |X̄ − µ0 | ≥ x0 such that P (|X̄ − µ0 | > x0 |H0 ) = α Solve for x0 : x0 = √σ zα/2 . n 239 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The test accepts H0 if σ σ √ √ X̄ − zα/2 ≤ µ0 ≤ X̄ + zα/2 n n We say a 100(1 − α)% confidence interval for µ0 is µ0 ∈ Duality: σ σ X̄ − √ zα/2 , X̄ + √ zα/2 n n µ0 lies in the confidence interval for µ if and only if the hypothesis test accepts. This result holds more generally. 240 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Duality of Confidence Intervals and Hypothesis Tests Let θ be a parameter of a family of probability distributions. random variables constituting the data by X. Theorem 9.3.A: Suppose that for every value θ0 θ ∈ Θ. Denote the ∈ Θ there is a test at level α of the hypothesis: H0 : θ = θ0 . Denote that acceptance region of the test by A(θ0 ). Then the set C(X) = {θ : X ∈ A(θ)} is a 100(1 − α)% confidence region for θ . 241 Cornell University, BTRY 4090 / STSCI 4090 Proof: Spring 2010 Need to show P [θ0 ∈ C(X)|θ = θ0 ] = 1 − α By the definition of C(X), we know P [θ0 ∈ C(X)|θ = θ0 ] = P [X ∈ A(θ0 )|θ = θ0 ] By the definition of level of significance, we know P [X ∈ A(θ0 )|θ = θ0 ] = 1 − α. This completes the proof. Instructor: Ping Li 242 Cornell University, BTRY 4090 / STSCI 4090 Theorem 9.3.B: Spring 2010 Instructor: Ping Li Suppose that C(X) is 100(1 − α)% confidence region for θ ; that is, for every θ0 , P [θ0 ∈ C(X)|θ = θ0 ] = 1 − α Then an acceptance region for a test at level α of H0 : θ = θ0 is A(θ0 ) = {X|θ0 ∈ C(X)} Proof: P [X ∈ A(θ0 )|θ = θ0 ] = P [θ0 ∈ C(X)|θ = θ0 ] = 1 − α 243 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Generalized Likelihood Ratio Test Likelihood ratio test: A simple hypothesis versus a simple hypothesis. Optimal. Very limited use. Generalized likelihood ratio test: Composite hypotheses. Sub-optimal and widely-used. Play the same role as MLE in parameter estimation. 244 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Assume a sample X1 , ... ,Xn from a distribution with unknown parameter θ . H0 : θ ∈ ω0 HA : θ ∈ ω1 Let Ω = ω0 ∪ ω1 . The test statistic max Λ= θ∈ω0 lik(θ) max lik(θ) θ∈Ω Reject H0 if Λ ≤ λ0 , such that P (Λ ≤ λ0 |H0 ) = α 245 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Let X1 , ..., Xn be i.i.d. and Example 9.4.A: Testing a Normal Mean normally distributed with mean µ and known variance σ 2 . Test H0 : µ = µ0 HA : µ 6= µ0 In other words, ω0 = {µ0 }, Ω = {−∞ < µ < ∞}. max θ∈ω0 − 2σ12 lik(µ) = √ n e 2πσ max lik(µ) θ∈Ω 1 1 − 2σ12 = √ n e 2πσ Pn i=1 (Xi −µ0 ) Pn i=1 (Xi −X̄) 2 2 246 Cornell University, BTRY 4090 / STSCI 4090 max lik(θ) Spring 2010 Instructor: Ping Li " n #) n X 1 X θ∈ω0 2 (Xi − µ0 ) − (Xi − X̄)2 Λ= = exp − 2 max lik(θ) 2σ i=1 i=1 θ∈Ω ( " n #) 1 X = exp − 2 (X̄ − µ0 )(2Xi − µ0 − X̄) 2σ i=1 1 = exp − 2 n(X̄ − µ0 )2 2σ ( (X̄ − µ0 )2 −2 log Λ = σ 2 /n Because under H0 , ∼ N (µ0 , σ 2 /n), we know, under H0 , −2 log Λ|H0 ∼ χ21 247 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The test rejects H0 (X̄ − µ0 )2 2 > χ 1,α σ 2 /n χ21,0.05 = 3.841. Equivalently, the test rejects H0 if X̄ − µ0 ≥ zα/2 √σ n —————– In this case, we know the sample null distribution exactly. When the sample distribution is unknown (or not in a convenient form), we resort to the approximation by central limit theorem. 248 Cornell University, BTRY 4090 / STSCI 4090 Theorem 9.4.A: Spring 2010 Instructor: Ping Li Under some smoothness conditions on the probability density of mass functions, the null distribution of −2 log Λ tends to a chi-square distribution with degrees of freedom equal to dimΩ− dimω0 , as the sample size tends to infinity. dimΩ = number of free parameters under Ω dimω0 = number of free parameters under ω0 . In Example 9.4.A, the null hypothesis specifies µ and σ 2 and hence there are no free parameters under H0 , i.e., dimω0 = 0. Under Ω, σ 2 is known (fixed) but µ is free, so dimΩ = 1. 249 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Generalized Likelihood Ratio Tests for Multinomial Distribution Goodness of fit: Assume the multinomial probabilities pi are specified by H0 : p = p(θ), θ ∈ ω0 where θ is a (vector of) parameter(s) to be estimated. We need to know whether the model p(θ) is good or not, according to the observed data (cell counts). We also need an alternative hypothesis. A common choice of Ω would be Ω = {pi , i = 1, 2, ..., m|pi ≥ 0, m X i=1 pi = 1} 250 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 max Λ= p∈ω0 Instructor: Ping Li lik(p) max lik(p) p∈Ω n x1 xm p ( θ̂) ...p ( θ̂) 1 m x ,x ,...,xm x1 xm = 1 2 n x1 ,x2 ,...,xm p̂1 ...p̂m !xi m Y pi (θ̂) = i=1 θ̂ : the MLE under ω0 Λ= m Y i=1 pi (θ̂) p̂i !np̂i p̂i p̂i = , xi n : the MLE under Ω. −2 log Λ = −2n m X i=1 p̂i log pi (θ̂) p̂i ! 251 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li −2 log Λ = − 2n =2 m X m X p̂i log i=1 np̂i log i=1 m X ! pi (θ̂) p̂i ! np̂i npi (θ̂) Oi =2 Oi log Ei i=1 Oi = np̂i = xi : the observed counts, Ei = npi (θ̂) : the expected counts −2 log Λ is asymptotically χ2s . The degrees of freedom s = dimΩ − dimω0 = (m − 1) − k . k = length of the vector θ = number of parameters in the model. 252 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li G2 Test Versus X 2 Test Generalized likelihood ratio test G2 = −2 log Λ =2 m X np̂i np̂i log npi (θ̂) i=1 ! =2 m X Oi log i=1 Pearson’s Chi-square test X2 = h m xi − npi (θ̂) X i=1 npi (θ̂) i2 G2 and X 2 are asymptotically equivalent. = m 2 X [Oi − Ei ] i=1 Ei Oi Ei 253 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 By Taylor expansions, about x x log Instructor: Ping Li ≈ x0 , x x − x0 + x0 x − x0 =x log = x log 1 + x0 x0 x0 2 x − x0 (x − x0 ) =x − + ... 2 x0 2x0 2 x − x0 (x − x0 ) = (x − x0 + x0 ) − + ... 2 x0 2x0 (x − x0 )2 =(x − x0 ) + + ... 2x0 254 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Under H0 , we expect np̂i = xi ≈ npi (θ̂). Thus ! m X np̂i 2 G =2 np̂i log npi (θ̂) i=1 " # m 2 X (np̂i − npi (θ̂)) (np̂i − npi (θ̂)) + =2 + ... 2npi (θ̂) i=1 ≈ m X (np̂i − npi (θ̂))2 i=1 npi (θ̂) = X2 It appears G2 test should be “more accurate,” but X 2 is actually more frequently used. 255 Cornell University, BTRY 4090 / STSCI 4090 Example 9.5.A: Spring 2010 Instructor: Ping Li The Hardy-Weinberg equilibrium model assumes the cell probabilities are (1 − θ)2 , 2θ(1 − θ), θ2 The observed counts are 342, 500, and 187, respectively (total n Using MLE, we estimate θ̂ = 2x3 +x2 2n = 1029). = 0.4246842. The expected (estimated) counts are 340.6, 502.8, and 185.6, respectively. G2 = 0.032499, X 2 = 0.0325041 (slightly different numbers in the Book) Both G2 and X 2 are asymptotically χ2s where s = (m − 1) − k = (3 − 1) − 1 = 1. 256 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li G2 = 0.032499, X 2 = 0.0325041, both asymptotically χ21 . p-values For G2 , p-value = 0.85694. For X 2 , p-value = 0.85682 Very large p-values indicate that we should not reject H0 . In other words, the model is very good. Suppose we do want to reject H0 , we must use a significance level α ≥ 0.86. 257 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Poisson Dispersion Test Assume X ∼ P oi(λ), then E(X) = λ and V ar(X) = λ. However, for many real data, the variance may considerably exceed the mean. Over-dispersion is often caused by subject heterogeneity, which may require a more flexible model to explain the data Given counts x1 , ..., xn , consider ω0 : xi ∼ P oi(λ), i = 1, 2, ..., n Ω : xi ∼ P oi(λi ), i = 1, 2, ..., n 258 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Given counts x1 , ..., xn , consider ω0 : xi ∼ P oi(λ), i = 1, 2, ..., n Ω : xi ∼ P oi(λi ), i = 1, 2, ..., n Under ω0 , the MLE is λ Λ= = x̄. max lik(λ) max lik(λi ) λ∈ω0 λi ∈Ω Qn xi −λ̂ /xi ! i=1 λ̂ e = Qn xi −λ̂i λ̂ /xi ! i=1 i e Qn xi n xi −x̄ Y x̄ e /xi ! x̄ = Qni=1 xi −x = exi −x̄ i /x ! xi i i=1 xi e i=1 n X xi −2 log Λ = 2 xi log ∼ χ2n−1 x̄ i=1 (asymptotically) 259 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Tests for Normality If X ∼ N (µ, σ 2 ), then • The density function is symmetric about µ, with coefficient of skewness b1 = 0, where E(X − µ)3 b1 = σ3 • The coefficient of kurtosis b2 = 3, where E(X − µ)4 b2 = σ4 These provide two simple tests for normality (among many tests). 260 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Two simple tests for normality • Reject if the empirical coefficient of skewness |b̂1 | is large, where Pn 1 3 (X − X̄) i i=1 b̂1 = nP 3/2 n 1 2 i=1 (Xi − X̄) n • Reject if the empirical coefficient of kurtosis |b̂2 − 3| is large, where Pn 1 4 (X − X̄) i b̂2 = n Pni=1 2 1 2 i=1 (Xi − X̄) n Difficulty: The distributions of b̂1 and b̂2 have no closed-forms and one must resort to a numerical procedure. 261 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chapter 11: Comparing Two Samples • Comparing two independent samples For example, a sample X1 , ... , Xn is drawn from N (µX , σ 2 ); and an independent sample Y1 , ..., Ym is drawn from N (µY , σ 2 ). H0 : µX = µY HA : µY 6= µY • Comparing paired samples For example, we observe pairs (Xi , Yi ), i = 1 to n. We would like to test the difference X and Y . Pairing causes samples to be dependent, i.e., Cov(Xi , Yi ) = σXY . 262 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Section 11.2: Comparing Two Independent Samples Example: In a medical study, a sample of subjects may be assigned to a particular treatment, and another independent sample may be assigned to a control treatment. • Methods based on the normal distribution • The analysis of power 263 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Methods Based on the Normal Distribution A sample X1 , ... , Xn is drawn from N (µX , σ 2 ); An independent sample Y1 , ..., Ym is drawn from N (µY , σ 2 ). The goal is to study the difference µX − µY from the observations. By the independence assumption, X̄ − Ȳ ∼ N µX − µY , σ 2 Two scenarios: • σ 2 is known. • σ 2 is unknown. 1 1 + n m . 264 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Two Independent Normal Samples with Known Variance X̄ − Ȳ ∼ N µX − µY , σ 2 1 1 + n m Assume σ 2 is known. Then (X̄ − Ȳ ) − (µX − µY ) q Z= ∼ N (0, 1) 1 σ n1 + m The 100(1 − α)% confidence interval of is (X̄ − Ȳ ) ± zα/2 σ r 1 1 + n m However, σ 2 in general must be estimated from the data. . 265 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Two Independent Normal Samples with Unknown Variance The pooled sample variance s2p (n − 1)s2X + (m − 1)s2Y = m+n−2 is an estimate of the common variance σ 2 , where n s2X 1 X = (Xi − X̄)2 n − 1 i=1 m 1 X 2 sY = (Yi − Ȳ )2 m − 1 i=1 are the sample variances of the X ’s and Y ’s. s2p is the weighted average of s2X and s2Y . 266 Cornell University, BTRY 4090 / STSCI 4090 Theorem 11.2.A: Spring 2010 Instructor: Ping Li The test statistic t= (X̄ − Ȳ ) − (µX − µY ) q ∼ tm+n−2 1 sp n1 + m a t distribution with m + n − 2 degrees of freedom. Proof: Recall in Chapter 6, if V independent, then √U V /n ∼ tn . ∼ χ2n , U ∼ N (0, 1), and U and V are 267 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li s2p (m + n − 2) (n − 1)s2X + (m − 1)s2Y 2 = ∼ χ m+n−2 σ2 σ2 Let Then (X̄ − Ȳ ) − (µX − µY ) q U= ∼ N (0, 1) 1 σ n1 + m U q ∼ tm+n−2 s2p /σ 2 That is, U = 2 2 sp /σ (X̄−Ȳ )−(µX −µY ) σ √1 1 n+m sp /σ (X̄ − Ȳ ) − (µX − µY ) q = ∼ tm+n−2 1 sp n1 + m 268 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Three Types of Hypothesis Testing The null hypothesis: H0 : µX = µY Three common alternative hypotheses H1 : µX 6= µY H2 : µX > µY H3 : µX < µY H1 is a two-sided alternative H2 and H3 are one-sided alternatives Instructor: Ping Li 269 Cornell University, BTRY 4090 / STSCI 4090 Using the test statistic t = Spring 2010 sp X̄−Ȳ √ 1 1 n+m Instructor: Ping Li , the rejection regions are For H1 : |t| > tn+m−2,α/2 For H2 : t > tn+m−2,α For H3 : t < −tn+m−2,α Pay attention to the p-value calculation for H1 . 270 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Equivalence between t-test and Likelihood Ratio Test H0 : µX = µY , H1 : µX 6= µY . Three parameters: 2 θ = µX , µY , σ . max Λ= θ∈ω0 lik(µX , µY , σ 2 ) max lik(µX , µY , σ 2 ) θ∈Ω We can show rejecting small Λ (i.e., rejecting large −2 log Λ) is equivalent to rejecting large |t| = |X̄−Ȳ | sp √1 1 n+m . 271 Cornell University, BTRY 4090 / STSCI 4090 Three parameters: Spring 2010 Instructor: Ping Li 2 θ = µX , µY , σ . ω0 = {µX = µY = µ0 , 0 < σ = σ0 < ∞} Ω = {−∞ < µX , µY < ∞, 0 < σ < ∞} lik(µX , µY , σ 2 ) m n 2 Y 2 Y 1 (Xi − µX ) 1 (Yi − µY ) √ √ = exp − exp − 2 2 2σ 2σ 2πσ 2πσ i=1 i=1 m+n m+n log 2π − log σ 2 2" 2 # n m X 1 X 2 − 2 (Xi − µX ) + (Yi − µY )2 2σ i=1 i=1 l(µX , µY , σ 2 ) = − 272 Cornell University, BTRY 4090 / STSCI 4090 Under ω0 Spring 2010 Instructor: Ping Li = {µX = µY = µ0 , 0 < σ = σ0 < ∞} m+n m+n log 2π − log σ02 2" 2 # n m X 1 X 2 − 2 (Xi − µ0 ) + (Yi − µ0 )2 2σ0 i=1 i=1 l(µ0 , σ02 ) = − ∼ N (µ0 , σ02 ) and Yi ∼ N (µ0 , σ02 ), Xi and Yi are independent, we have m + n samples in N (µ0 , σ02 ). In fact, since both Xi Therefore, the MLEs are 1 µ̂0 = m+n σ̂02 1 = m+n " n X m X # n m Xi + Yi = X̄ + Ȳ m + n m + n i=1 i=1 " n # m X X 2 (Xi − µ̂0 ) + (Yi − µ̂0 )2 i=1 i=1 273 Cornell University, BTRY 4090 / STSCI 4090 Thus, under the null ω0 Instructor: Ping Li = {µX = µY = µ0 , 0 < σ = σ0 < ∞} l(µ̂0 , σ̂02 ) = − Under Ω Spring 2010 m+n m+n m+n log 2π − log σ̂02 − 2 2 2 = {−∞ < µX , µY < ∞, 0 < σ < ∞}. We can show µ̂X = X̄, σ̂ 2 = 1 m+n µ̂Y = Ȳ " n # m X X 2 (Xi − µ̂X ) + (Yi − µ̂Y )2 i=1 i=1 m+n m+n m+n 2 l(µ̂X , µ̂Y , σ̂ ) = − log 2π − log σ̂ − 2 2 2 2 274 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The negative log likelihood ratio is 2 m + n σ̂ log 02 − l(µ̂0 , σ̂02 ) − l(µ̂X , µ̂Y , σ̂ 2 ) = 2 σ̂ σ̂02 Therefore, the test rejects for large values of σ̂2 . Pn Pn 2 (X − µ̂ ) + (Yi − µ̂0 )2 σ̂02 i 0 i=1 i=1 Pn = Pn 2 2 σ̂ 2 i=1 (Xi − X̄) + i=1 (Yi − Ȳ ) mn (X̄ − Ȳ )2 Pn Pn =1 + 2 m + n i=1 (Xi − X̄) + i=1 (Yi − Ȳ )2 Equivalently, the test rejects for large values of |X̄ − Ȳ | qP Pn n 2 2 i=1 (Xi − X̄) + i=1 (Yi − Ȳ ) which is the t statistic. 275 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Power Analysis of Two-Sample t Test Recall power = 1 - Type II error = P (reject H0 |HA ). To compute the power, we must specify a simple alternative hypothesis. We consider H0 : µX − µY = 0 H1 : µX − µY = ∆. For simplicity, we assume σ 2 is known and n The t test rejects if |X̄ − Ȳ | > zα/2 σ q = m. 2 n. 276 Cornell University, BTRY 4090 / STSCI 4090 power =P =P Spring 2010 Instructor: Ping Li ! r 2 |X̄ − Ȳ | > zα/2 σ |H1 n ! r 2 X̄ − Ȳ > zα/2 σ |H1 + P n X̄ − Ȳ < −zα/2 σ 2σ 2 Note that X̄ − Ȳ |H1 ∼ N µX − µY = ∆, n . Therefore, r ! 2 |H1 n q 2 zα/2 σ n − ∆ X̄ − Ȳ − ∆ p p =P > |H1 σ 2/n σ 2/n p ∆ =1 − Φ zα/2 − n/2 σ P X̄ − Ȳ > zα/2 σ 277 r 2 |H1 n ! Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Therefore, the power can be computed from ∆p ∆p power =1 − Φ zα/2 − n/2 + Φ −zα/2 − n/2 σ σ ′ ′ =1 − Φ zα/2 − ∆ + Φ −zα/2 − ∆ p ∆ ′ where ∆ = σ n/2. Three parameters, α, ∆, and n, affect the power. • Larger α =⇒ smaller zα/2 =⇒ larger power. • Larger |∆′ | =⇒ larger power. • Larger |∆| =⇒ larger power. • Larger n =⇒ larger power. • Smaller σ =⇒ larger power. What is the relation between α and power if ∆ = 0? 278 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Section 11.3: Comparing Paired Samples In many cases, the samples are paired (and dependent), for example, measurements before and after medical treatments. Consider (Xi , Yi ), i = 1, 2, ...n (Xi , Yi ) is independent of (Xj , Yj ), if i 6= j E(Xi ) = µX , 2 V ar(Xi ) = σX , E(Yi ) = µY V ar(Yi ) = σY2 279 Cornell University, BTRY 4090 / STSCI 4090 Let Di Spring 2010 = Xi − Yi , and D̄ = 1 n Instructor: Ping Li Pn i=1 Di . Then, E(D̄) = µX − µY , 1 2 2 V ar(D̄) = σ + σY − 2ρσX σY n X Therefore, D̄ is still an unbiased estimator of µX − µY , but it has smaller variance if there exists positive correlation (ρ > 0). 280 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Paired Test Based on the Normal Distribution This methods assume that Di = Xi − Yi is i.i.d. normal with E(Di ) = µD , 2 V ar(Di ) = σD 2 In general, σD needs to be estimated from the data. Consider a two-sided test H0 : µD = 0, A t-test rejects for large values of |t|, where t The rejection region is D̄ > tn−1,α/2 sD̄ . HA : µD 6= 0 = D̄−µD sD̄ . 281 Cornell University, BTRY 4090 / STSCI 4090 Example 11.3.1.A: Spring 2010 Instructor: Ping Li Effect of cigarette smoking on platelet aggregation. Before (X ) After (Y ) Difference (D ) 25 27 2 25 29 4 27 37 10 44 56 12 30 46 16 67 82 15 53 57 4 53 80 27 52 61 9 60 59 -1 28 43 15 282 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li D̄ = 10.272 r 63.6182 sD̄ = = 2.405 11 ρ = 0.8938 D̄ 10.272 H0 : t = = = 4.271. sD̄ 2.405 Suppose α = 0.01. tα/2,n−1 = t0.005,10 = 3.169 < t. Therefore, the test rejects H0 at significance level α Alternatively, we say the p-value is smaller than 0.01. = 0.01. 283 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li A Heuristic Explanation on GLRT Why, under H0 , the test statistic max Λ= θ∈ω0 lik(θ) max lik(θ) θ∈Ω satisfies −2 log Λ → χ2s , as n → ∞?? The heuristic argument • Only considers s = 1. • Utilizes Taylor expansion. • Uses the fact that the MLE is asymptotically normal. 284 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Since s = 1, we consider H0 : θ = θ0 . Let l(θ) = log lik(θ) and θ̂ be the MLE of θ ∈ Ω. h i −2 log Λ = −2 l(θ0 ) − l(θ̂) Applying Taylor expansion (θ0 − θ̂)2 ′′ l(θ0 ) = l(θ̂) + (θ0 − θ̂)l (θ̂) + l (θ̂) + ... 2 ′ Because θ̂ is the MLE, we know l′ (θ̂) = 0. Therefore, h i −2 log Λ = −2 l(θ0 ) − l(θ̂) = −l′′ (θ̂)(θ0 − θ̂)2 + ... 285 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 The MLE is asymptotically normal, i.e., as n θ̂ − θ0 q Because nI(θ) 1 nI(θ) Instructor: Ping Li → ∞, p = θ̂ − θ0 nI(θ) → N (0, 1) = −E(l′′ (θ)), we can (heuristically) write, as n → ∞, −2 log Λ = − l′′ (θ̂)(θ0 − θ̂)2 p i2 h ≈ θ̂ − θ0 nI(θ) →χ21 286 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Chapter 14: Linear Lease Squares Materials: • The basic procedure: Observe (xi , yi ). Assume y = β0 + β1 x. Estimate β0 , β1 by minimizing X (yi − β0 − β1 xi )2 • Statistical analysis of linear square estimates Assume y = β0 + β1 x + e, e ∼ N (0, σ 2 ), and x is constant. What are the statistical properties of β0 and β1 , which are estimated by the least square procedure? • Matrix approach to multiple least squares • Conditional expectation and best linear estimator for better understanding of the basic procedure. If X and Y are jointly normal, then linear regression is the best under MSE. 287 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Linear Lease Squares: The Basic Procedure The basic procedure is to fit a straight line to a plot of points (xi , yi ), y = β0 + β1 x by minimizing L(β0 , β1 ) = n X i=1 (yi − β0 − β1 xi )2 , i.e., solving for β0 and β1 from ∂L(β0 , β1 ) =0 ∂β0 ∂L(β0 , β1 ) =0 ∂β1 288 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Taking the first derivatives, n ∂L(β0 , β1 ) X = 2(yi − β0 − β1 xi )(−1) ∂β0 i=1 n ∂L(β0 , β1 ) X = 2(yi − β0 − β1 xi )(−xi ) ∂β1 i=1 Setting them to zero =⇒ β̂0 =ȳ − x̄βˆ1 Pn Pn i=1 xi yi − ȳP i=1 xi P β̂1 = n n 2 − x̄ x i=1 i i=1 xi Instructor: Ping Li 289 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Statistical Properties of β̂0 and β̂1 Model: yi = β0 + β1 xi + ei , i = 1, 2, ..., n ei ∼ N (0, σ 2 ), i.i.d. xi ’s are constants. The randomness of yi ’s is due to ei . The coefficients β0 and β1 are estimated by least squares. Q: Under this model, what are E(β̂0 ), V ar(β̂0 ), E(β̂1 ), V ar(β̂1 ), etc.? 290 Cornell University, BTRY 4090 / STSCI 4090 According to the model: Spring 2010 yi = β0 + β1 xi + ei , ei ∼ N (0, σ 2 ), E(yi ) = β0 + β1 xi E(ȳ) = β0 + β1 x̄ V ar(yi ) = σ 2 Cov(yi , yj ) = 0, if i 6= j Therefore, E(β̂0 ) = E(ȳ − x̄β̂1 ) = β0 + β1 x̄ − x̄E(β̂1 ) i.e., E(β̂0 ) Instructor: Ping Li = β0 iff E(β̂1 ) = β1 291 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Pn Pn xi yi − ȳ i=1 xi Pn E(β̂1 ) =E Pi=1 n 2 i=1 xi i=1 xi − x̄ Pn Pn xi E(yi ) − E(ȳ) i=1 xi i=1 P Pn = n 2 i=1 xi − x̄ i=1 xi Pn Pn xi 0 + β1 xi ) − (β0 + β1 x̄) i=1 xi (βP i=1 Pn = n 2 − x̄ x i=1 xi i=1 i Pn Pn 2 β1 x − β1 x̄ i=1 xi Pn = Pni=1 2i i=1 xi − x̄ i=1 xi =β1 Theorem 14.2.A: Unbiasedness E(β̂0 ) = β0 , E(β̂1 ) = β1 292 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Another way to express β̂1 : Pn Pn x y − ȳ xi i i i=1 i=1 Pn β̂1 = Pn 2 i=1 xi − x̄ i=1 xi Pn (x − x̄)(yi − ȳ) Pn i = i=1 2 i=1 (xi − x̄) Pn (xi − x̄)yi = Pi=1 n 2 i=1 (xi − x̄) Note that n X i=1 (xi − x̄) = 0, n X i=1 (yi − ȳ) = 0. 293 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Theorem 14.2.B: V ar(β̂1 ) = Pn 2 (x − x̄) V ar(yi ) i i=1 Pn 2 [ i=1 (xi − x̄)2 ] σ2 = Pn 2 i=1 (xi − x̄) Exercises V ar(β̂0 ) = Pn σ2 2 x i i=1 Pnn , 2 (x − x̄) i=1 i −σ 2 x̄ Cov(β̂0 , β̂1 ) = Pn . 2 (x − x̄) i=1 i 294 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Residual Sum of Squares (RSS) Definition RSS = n X i=1 (yi − β̂0 − β̂1 xi )2 We can show E(RSS) = (n − 2)σ 2 In other words, RSS s = n−2 2 is an unbiased estimator of σ 2 . 295 Cornell University, BTRY 4090 / STSCI 4090 E(RSS) = E Spring 2010 " n X i=1 =E " n X i=1 =E " n X i=1 Instructor: Ping Li (yi − β̂0 − β̂1 xi )2 # (β0 + β1 xi + ei − β̂0 − β̂1 xi )2 (β0 − β̂0 + (β1 − β̂1 )xi + ei )2 =nV ar(β̂0 ) + V ar(β̂1 ) n X # # x2i + nσ 2 + 2Cov(β̂0 , β̂1 ) i=1 + 2E " n X i=1 ei β0 − β̂0 + (β1 − β̂1 )xi =(n + 2)σ 2 + 2E " n X i=1 n X i=1 # ei β0 − β̂0 + (β1 − β̂1 )xi # xi 296 Cornell University, BTRY 4090 / STSCI 4090 E " =E " =E " =E " " n X i=1 n X i=1 n X i=1 n X i=1 =E β̂1 Spring 2010 ei β0 − β̂0 + (β1 − β̂1 )xi Instructor: Ping Li # ei β0 − ȳ + x̄β̂1 + (β1 − β̂1 )xi # ei β0 − β0 − x̄β1 − ē + x̄β̂1 + (β1 − β̂1 )xi ei −x̄β1 + x̄β̂1 + (β1 − β̂1 )xi n X i=1 # ei (x̄ − xi ) − σ 2 # − σ2 # 297 Cornell University, BTRY 4090 / STSCI 4090 " E β̂1 n X i=1 Spring 2010 ei (x̄ − xi ) Instructor: Ping Li # n n X (x − x̄)y i i Pi=1 ei n 2 i=1 (xi − x̄) i=1 =E "P =E "P =E Pn = − σ2 (x̄ − xi ) n i − x̄)(β0 + β1 xi i=1 (xP n 2 i=1 (xi − x̄) 2 (x − x̄)(x̄ − x )e i i i i=1 Pn 2 i=1 (xi − x̄) + ei ) # n X i=1 # ei (x̄ − xi ) Therefore, E(RSS) = (n + 2)σ 2 + 2(−2σ 2 ) = (n − 2)σ 2 298 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li The Distributions β̂0 and β̂1 Model: yi = β0 + β1 xi + ei , ei ∼ N (0, σ 2 ), yi ∼ N (β0 + β1 xi , σ 2 ) β̂1 = n X i=1 ci yi ∼ N (β1 , V ar(β̂1 )) β̂0 = ȳ − x̄β̂1 ∼ N (β0 , V ar(β̂0 )) s2 RSS 2 (n − 2) = ∼ χ n−2 σ2 σ2 β̂0 − β0 β̂1 − β1 ∼ tn−2 , ∼ tn−2 sβ̂0 sβ̂1 What if ei is not normal? Central limit theorem and normal approximation. 299 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Hypothesis Testing Once we know the distributions β̂0 − β0 ∼ tn−2 , sβ̂0 β̂1 − β1 ∼ tn−2 sβ̂1 we can conduct hypothesis test, for example, H0 : β1 = 0 HA : β1 6= 0 300 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Multiple Least Squares Model: yi = β0 + β1 xi,1 + ... + βp−1 xi,p−1 + ei , Observations: ei ∼ N (0, σ 2 ) i.i.d. (xi , yi ), i = 1 to n. Multiple least squares: Estimate βj by minimizing the MSE L(βj , j = 0, 1, ..., p − 1) = n X i=1 (yi − β0 − p−1 X j=1 xi,j βj )2 301 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Matrix Approach to Linear Least Square X= 1 x1,1 x1,2 ... x1,p−1 1 x2,1 x2,2 ... x2,p−1 1 x3,1 x3,2 ... x3,p−1 .. . .. . .. . .. . .. . xn,2 ... xn,p−1 1 xn,1 L(β) = n X i=1 (yi − β0 − p−1 X j=1 , β= β0 β1 β2 .. . βp−1 2 xi,j βj )2 = kY − Xβk 302 Cornell University, BTRY 4090 / STSCI 4090 L(β) = n X i=1 Spring 2010 (yi − β0 − Instructor: Ping Li p−1 X j=1 2 xi,j βj )2 = kY − Xβk Matrix/vector derivative ∂L(β) =2(−XT ) (Y − Xβ) ∂β T T = − 2 X Y − X Xβ = 0 =⇒ XT Xβ = XT Y T −1 T =⇒β̂ = X X XY 303 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Statistical Properties of β Model: ei ∼ N (0, σ)2 i.i.d. Y = Xβ + e, Unbiasedness (Theorem 14.4.2.A): E β̂ =E T XX T −1 −1 T X Y T X [Xβ + e] −1 −1 =E XT X XT X β + E XT X XT e =E =β XX 304 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Covariance matrix of β̂ (Theorem 14.4.2.B) −1 −1 V ar β̂ =V ar XT X XT X β + XT X XT e −1 =V ar XT X XT e h −1 T iT T −1 T T = XX X V ar(e) X X X T −1 T T −1 = XX X V ar(e) X X X T −1 T T −1 2 =σ X X X X X X T −1 2 =σ X X Note that V ar(e) is a diagonal matrix = σ 2 In×n 305 Cornell University, BTRY 4090 / STSCI 4090 Theorem 14.4.3.A: Spring 2010 Instructor: Ping Li An unbiased estimator of σ 2 is s2 , where kY − Ŷk2 s = n−p 2 Proof: Lemma 14.4.3.A: h i −1 Ŷ = Xβ̂ = X XT X XT Y = PY P2 = P = PT (I − P)2 = I − P = (I − P)T Proof of Lemma 14.4.3.A T −1 T T −1 T T −1 T P =X X X X X X X X =X X X X =P 2 306 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Therefore, kY − Ŷk2 = k(I − P)Yk2 = Y T (I − P)T (I − P)Y = Y T (I − P)Y and T E Y (I − P)Y =E because T T β X +e T (I − P) (Xβ + e) T T T =β X (I − P)Xβ + E e (I − P)e T =E e (I − P)e T 2 =nσ − E e Pe h i −1 XT (I − P)X = XT X − XT X XT X XT X = 0 307 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li E eT (P)e =E =E =σ 2 p X j=1 p X j=1 p X " n X i=1 # ei Pij ej Pjj e2j Pjj = pσ 2 j=1 where we skip the proof of the very last step. Combining the results, we obtain E kY − Ŷk2 = (n − p)σ 2 308 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Properties of Residuals Residuals: ê = Y − Ŷ = (I − P)Y . Covariance matrix of residuals: V ar(ê) =(I − P)Var(Y)(I − P)T =(I − P)σ 2 I(I − P) =σ 2 (I − P) =⇒ Residuals are correlated Instructor: Ping Li 309 Cornell University, BTRY 4090 / STSCI 4090 Theorem 14.4.A: Spring 2010 Instructor: Ping Li The residuals are uncorrelated with the fitted values. Proof: T T E(ê Ŷ) =E Y (I − P)PY T 2 =E Y (P − P )Y T =E Y (P − P)Y =0 310 Cornell University, BTRY 4090 / STSCI 4090 Spring 2010 Instructor: Ping Li Inference about β T −1 2 V ar β̂ = σ X X = σ 2 C. Using s2 to estimate σ 2 , we obtain the distribution β̂j − βj ∼ tn−p , sβ̂j where sβ̂ j √ = s cii which allows us to conduct hypothesis test on the significance of the fit. 311