'HSDUWPHQWRI6WDWLVWLFV 1RWHV 0DWKHPDWLFDO6WDWLVWLFV :67 &RS\ULJKWUHVHUYHG CONTENTS CHAPTER 7: RANDOM VARIABLES AND THEIR DISTRIBUTIONS 7.1 INTRODUCTION 196 7.2 SEQUENCES OF RANDOM VARIABLES 196 7.3 THE CENTRAL LIMIT THEOREM 203 7.4 APPROXIMATIONS FOR THE BINOMIAL DISTRIBUTION 206 7.5 ASYMPTOTIC NORMAL DISTRIBUTIONS 208 7.6 STOCHASTIC CONVERGENCE 208 7.7 ADDITIONAL LIMIT THEOREMS 211 7.8 ASYMPTOTIC DISTRIBUTIONS OF EXTREME ORDER STATISTICS 214 CHAPTER 8: STATISTICS AND SAMPLING DISTRIBUTIONS 8.1 INTRODUCTION 215 8.2 STATISTICS 215 8.3 SAMPLING DISTRIBUTIONS 218 8.4 THE t, F AND BETA DISTRIBUTIONS 226 8.5 LARGE SAMPLE APPROXIMATIONS 238 CHAPTER 9: POINT ESTIMATION 9.1 INTRODUCTION 239 9.2 SOME METHODS OF ESTIMATION 240 9.3 CRITERIA FOR EVALUATING ESTIMATORS 253 9.4 LARGE SAMPLE PROPERTIES 267 9.5 BAYES AND MINIMAX ESTIMATORS 272 CHAPTER 10: SUFFICIENCY AND COMPLETENESS 10.1 INTRODUCTION 276 10.2 SUFFICIENT STATISTICS 277 10.3 FURTHER PROPERTIES OF SUFFICIENT STATISTICS 284 10.4 COMPLETENESS AND THE EXPONENTIAL CLASS 287 CHAPTER 11: INTERVAL ESTIMATION 11.1 INTRODUCTION 293 11.2 CONFIDENCE INTERVALS 294 11.3 PIVOTAL QUANTITY METHOD 297 11.4 GENERAL METHOD 302 11.5 CONFIDENCE INTERVALS FOR SOME STANDARD PROBLEMS 309 CHAPTER 12: TESTS OF HYPOTHESES 12.1 INTRODUCTION 317 12.2 P-VALUES AND THE USE OF CONFIDENCE INTERVALS FOR TESTING HYPOTHESIS 12.3 TESTS OF HYPOTHESES FOR SOME STANDARD PROBLEMS 331 333 196 [Chapter 7] CHAPTER 7 : LIMITING DISTRIBUTIONS 7.1 INTRODUCTION In many cases the distribution of a function of random variables is very difficult to derive exactly or may not even be very useful if it is known exactly. For large samples there are many instances where a very good approximate distribution can be found and which can be used very easily. 7.2 SEQUENCES OF RANDOM VARIABLES Definition 7.2.1 : Suppose that Y 1 , Y 2 , Y 3 , T is a sequence of random variables with corresponding distribution functions G 1 , G 2 , G 3 , T i.e. P¡Y n t y¢ G n y for all values of n and for all values of y. If Y is a random variable with distribution function G and lim G y Gy for all nv. n values of y where G is continuous , we say that Y 1 , Y 2 , Y 3 , T converges in distribution d to Y and use the notation Y n v Y . The distribution function G is then called the limiting distribution of Y n . EXAMPLE 7.2.1 : Suppose that X 1 , X 2 , X 3 , T , X n is a random sample from a population which is uniformly distributed over the interval (0,1). Let Y n:n max£X 1 , X 2 , X 3 , T , X n ¤. Then the distribution function of Y n:n is given by G n y £Fy ¤ n by th.6.5.3a 0 for y t 0 yn for 0 t y t 1 since Fy y for 0 y 1 1 for y u 1. Therefore lim G y nv. n 0 for y 1. for y u 1. G y is the distribution function of a random variable Y such that Hence lim nv. n P¡Y 1¢ 1. 1 197 [Chapter 7] Definition 7.2.2 : The distribution function Gy of the random variable Y is called degenerate at the point c if 0 for y c Gy 1 for y u c i.e. if P¡Y c¢ 1. EXAMPLES : Example 7.2.2 : Suppose that X 1 , X 2 , X 3 , T , X n is a random sample from a population which is exponentially distributed with parameter 2 i.e. the distribution function is 0 for x 0 Fx for x u 0. 1 " e "x/2 for all X i . Let Y 1:n min£X 1 , X 2 , X 3 , T , X n ¤. Then the distribution function of Y 1:n is given by G n y 1 " £1 " Fy ¤ n by th.6.5.3a for y 0 0 1 " ¡1 " £1 " n e "y/2 ¤¢ 0 for y 0 1 " e "ny/2 for y u 0 for y u 0 Therefore lim G y nv. n 0 for y t 0 1 for y 0 G y is not the distribution function of a random variable since it is not Note that lim nv. n continuous from the right at y 0. However, if we make the function continuous from the right at 0, i.e. take 0 for y 0 Gy 1 for y u 0, G y Gy at all points y where G is continuous and therefore we do get then lim nv. n convergence in distribution to G. 198 [Chapter 7] Example 7.2.3 Suppose that Y 1 , Y 2 , Y 3 , T is a sequence of random variables such that Y n is normally distributed with expected value 1n and variance 1. Then G n y P¡Y n t y¢ P Y n " 1n 1 P¡Z t y " oy " y" 1n 1 t 1 n ¢ where Z L N0, 1 1 n and therefore G y oy lim nv. n since o a continuous function. d Hence Y n v Z where Z L N0, 1 . Example 7.2.4 Suppose that Y 1 , Y 2 , Y 3 , T is a sequence of random variables such that Y n is normally 2 distributed with expected value 0 and variance 1n . Then G n y P¡Y n t y¢ P Y n "0 1 n t y"0 1 n P¡Z t ny¢ where Z L N0, 1 ony . It then follows that lim G y nv. n o". if y 0 o0 if y 0 o. if y 0 0 if y 0 0. 5 if y 0 1 if y 0 G y is not a distribution function since it is not In this case we also get that lim nv. n continuous from the right at 0. If we make the function continuous from the right at 0 i.e. take 0 for y 0 Gy 1 for y u 0, G y Gy at all points y where G is continuous and therefore we do get then lim nv. n convergence in distribution to G. 199 [Chapter 7] Definition 7.2.3 : A sequence of random variables is said to converge stochastically to a constant c if the limiting distribution is degenerate at c. EXAMPLES : Example 7.2.5 : Suppose that Y 1 , Y 2 , Y 3 , T is a sequence of normally distributed random variables such that Y n L Nn, 1 . Then G n y P¡Y n t y¢ P Y n "n 1 t y"n 1 P¡Z t y " n¢ oy " n . It then follows that lim G y o". nv. n where Z L N0, 1 for all y 0 for all y In this case there does not exist a distribution function G such that lim G y Gy for all y where G is continuous i.e. the sequence Y 1 , Y 2 , Y 3 , T does nv. n not converge in distribution. Example 7.2.6 Suppose that S n is the sum of all the observations in a random sample from a normal population with expected value 6 and variance @ 2 . Then S n is normally distributed with expected value n6 and variance n@ 2 - from Example 6.4.5. Therefore G n y P¡S n t y¢ P S n " n6 y " n6 t n@ n@ P Zt o y " n6 n@ y " n6 n@ . where Z L N0, 1 200 [Chapter 7] It then follows that o". if 6 0 for all y lim G y nv. n o0 if 6 0 for all y o. if 6 0 for all y In none of these cases does there exist a distribution function G such that G y Gy i.e. there does not exist a limit distribution for S n . lim nv. n S " n6 . Now let Y n n n@ Then S n " n6 G n y P ty n@ P¡Z t y¢ where Z L N0, 1 oy i.e. lim G y oy nv. n d for all y i.e. Y n v Z where Z L N0, 1 . Note that although S n itself does not converge in distribution to any distribution, a linear function of S n namely Y n does converge in distribution. To transform to Y n we subtracted the expected value and divided by the standard deviation i.e. we obtained a random variable with expected value 0 and variance 1. It turns out in many cases that if we standardise random variables in this way that we do obtain a limit distribution. JTheorem 7.2.1a : Suppose that c and b are constants and dn a function of n such that lim dn 0. nv. Then nb c dn 1 e cb . lim n n nv. 201 [Chapter 7] EXAMPLES : Example 7.2.7 Suppose that X 1 , X 2 , T , X n is a random sample from a population with a Pareto distribution with parameters 1 and 1. Then for x 0 1 x "2 f X i x 0 otherwise and x F X i x ; f X i t dt ". for x t 0 0 x ;1 t "2 for x 0 dt 0 for x t 0 0 "1 t x "1 0 for x 0 for x t 0 0 1 " 1 x "1 for x 0 Let Y 1:n min£X 1 , X 2 , T , X n ¤. Then F Y 1:n y 1 " ¡1 " F X y ¢ n where F X is the distribution function of all X i ’s 1 " ¡1 " 0¢ n 1 " 1 " 1 " 1 y "1 for y t 0 0 1 " ¡1 y¢ F y i.e. lim nv. Y 1:n i.e. Y 1:n for y t 0 0 "n for y 0 n for y 0 (7.2.3) for all y t 0 1 for all y 0 converge in distribution to a degenerate distribution. 202 [Chapter 7] Now let Z n nY 1:n . Then G n y P¡Z n t y¢ P¡nY 1:n t y¢ P¡Y 1:n t y/n¢ F Y 1:n y/n 0 for y t 0 1 " ¡1 y/n¢ "n for y 0 for y t 0 0 i.e. lim G y nv. n from (7.2.3) 1 " e "y for y 0 from th.7. 2. 1a i.e. Z n converges in distribution to an exponential distribution with parameter 1. Example 7.2.8 Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. Let Y n:n max£X 1 , X 2 , T , X n ¤. Then F Y n:n y £Fy ¤ n from th.6.5.3a £1 " e "y/2 ¤ n £0¤ n for y 0 (7.2.4) for y 0 Let Z n Y n:n /2 " ln n. Then G n y P¡Z n t y¢ P Y n:n 2 " ln n t y P¡Y n:n t 2y ln n ¢ F Y n:n ¡2y ln n ¢ 1 " e " 2 ¡2yln n ¡1 " i.e. lim G y nv. n 1 ¢ n for n big enough y lnn 0 "y n e n ¢ "y e "e for all y. In the examples above G n y P¡Y n t y¢ is known for all values of n. G y Gy , then Gy is an approximation of G n y for large n. How good this If lim nv. n approximation will be, will differ for different cases, and will depend on how big n is. In general the approximation will be good enough provided n is big enough. See for instance Bain and Engelhardt p236. 203 [Chapter 7] OEXERCISES : Bain and Engelhardt p.259 No. 1 and p260 No. 3. 7.3 THE CENTRAL LIMIT THEOREM JTheorem 7.3.1 Suppose that Y 1 , Y 2 , Y 3 , T is a sequence of random variables with distribution functions G 1 , G 2 , G 3 , T and moment generating functions M 1 , M 2 , M 3 , T . Suppose that Y is a random variable with distribution function G and moment generating function M. If lim M n t Mt for all t in an interval "h t h where h 0, then nv. d lim G n y Gy for all y where G is continuous i.e. Y n v Y. nv. Proof : We will not give a proof of this theorem. EXAMPLES : Example 7.3.1 Suppose that Y 1 , Y 2 , Y 3 , T is a sequence of binomial random variables with parameters n and 6/n. Then E¡Y n ¢ 6 for all values of n. Let M n t be the moment generating function of Y n . Then 6 6 n M n t n e t 1 " n n 6e t " 1 1 n i.e. lim M n t e 6e "1 t for all values of t. nv. Since this limit moment generating function is that of a Poisson random variable with d parameter ȝ , it follows from th.7.3.1 that Y n v Y where Y is a Poisson random variable with parameter ȝ. Example 7.3.2 Suppose that Y n is the number of successes in n independent trials where only success or not-success is observed and the probability of success is p for all trials i.e. Y n is a binomial random variable with parameters n and p. Let W n Y n /n i.e. W n is the proportion of successes in n trials. Hence t t n M Wn t E¡e tWn ¢ E¡e n Y n ¢ M Y n t/n ¡pe n 1 " p¢ Let ft e t . 204 [Chapter 7] Then ft et f0 1 f U t et f U 0 1 and f UU t et. 2 where 8 is between 0 and t et 1 t t e8 2 2 t e n 1 nt t 2 e 8 n where 8 n is between 0 and t/n . 2n Therefore or Hence M Wn t 2 p 1 nt t 2 e 8 n 2n 2 8n p t2 en pt 1 n n dn pt 1 n n 1"p n n n 8n 2 where dn p t en v 0 as n v . for all values of t. 2 Therefore lim M t e pt for all values of t. nv. W n But e pt is the moment generating function of a random variable which is equal to p with probability 1. From th.7.3.1 it then follows that the proportion of successes converge stochastically to p. JTheorem 7.3.2 THE CENTRAL LIMIT THEOREM Suppose that X 1 , X 2 , X 3 , T , X n is a random sample from a population with expected value ȝ and variance ı 2 .. n ! X i " n6 Let Z n i1 . n@ d Then Z n v Z where Z is a standard normal random variable. Proof: For the proof we also assume that the moment generating function of the X i ’s exist. Let mt be the moment generating function of X i " 6. Then m0 1, m U 0 E¡X i " 6¢ 0 and m UU 0 E X i " 6 2 @ 2 . Hence mt 2 m0 m U 0 t m UU 8 t where 8 is between 0 and t 2 2 1 m UU 8 t 2 2t2 2 @ 1 £m UU 8 " @ 2 ¤ t . 2 2 205 [Chapter 7] Note that n ! X i " n6 Zn i1 n@ n ! i1 Xi " 6 n@ and therefore M Z n t n E exp t ! i1 X1 " 6 n@ E exp t E exp t n i1 n 1 exp t Xi " 6 n@ X2 " 6 n@ C exp t Xn " 6 n@ since X i ’s are independent t n@ m i1 Xi " 6 n@ @2 t2 n@ 2 2 n ¡m UU 8 n t2 2 dn 1 n n 2 " @2 ¢ t 2 2n@ where 8 n is between 0 and t n@ n where dn v 0 as n v . since 8 n v 0 and m UU 8 n v @ 2 . 1 2 M t e 2 t Therefore lim nv. Z n which is the moment generating function of a standard normal distribution. d From th 7.3.1 it then follows that Z n v Z where Z is a standard normal random variable. n ! X i " n6 For large values of n, Z n i1 n@ standard normal random variable. ſ is therefore approximately distributed like a OEXERCISES : Bain and Engelhardt p260 Nos. 9 and 11 . 206 [Chapter 7] 7.4 APPROXIMATIONS FOR THE BINOMIAL DISTRIBUTION Suppose that Y n is a binomial random variable with parameters n and p. It then follows Y " np is approximately distributed like a standard normal random variable. that Z n n npq How good this approximation is, will depend on p and on how big n is. In general a reasonable approximation will be obtained if np u 5 and nq u 5. EXAMPLES : Example 7.4.1 Suppose that Y n is a binomial random variable with parameters n 20 and p 0. 5. Then P¡Y 20 8¢ 1 " P¡Y 20 t 8¢ 8 1"! y0 n 0. 5 y 1 " 0. 5 y 1 " 0. 2517 20"y from table I p601 Bain and Engelhardt 0. 7483. If we use the normal approximation we obtain P¡Y 20 8¢ 1 " P¡Y 20 t 8¢ 1"P Y 20 " 20 . 5 t 20 . 5 . 5 ` 1 " P¡Z t ". 89¢ 1 " ¡1 ". 8133¢ . 8133. 8 " 20 . 5 20 . 5 . 5 where Z L N0, 1 207 [Chapter 7] A better approximation can be obtained by using the so-called correction for continuity since we are approximating a binomial distribution which is discrete with a normal distribution which is continuous. Suppose that Y n L BINn, p . Then P¡a t Y n t b¢ ` P¡a ". 5 t Y n t b . 5¢ a ". 5 " np Y " np b . 5 " np P¡ t n t ¢ npq npq npq ` P¡ o a ". 5 " np b . 5 " np tZt ¢ npq npq b . 5 " np npq "o where Z L N0, 1 a ". 5 " np npq Example 7.4.2 The same as Example 7.4.1 i.e. Y 20 L BIN20, . 5 . With the correction for continuity we get that P¡Y 20 8¢ 1 " P¡Y 20 t 8¢ ` 1 " P¡Y 20 t 8 . 5¢ 1"P Y 20 " 20 . 5 t 8. 5 " 20 . 5 20 . 5 . 5 20 . 5 . 5 ` 1 " P Z t 8. 5 " 20 . 5 20 . 5 . 5 where Z L N0, 1 1 " ¡1 ". 7486¢ . 7486 and this approximation is almost precisely the correct probability. OEXERCISES : Bain and Engelhardt p260 No. 12 . 208 [Chapter 7] 7.5 ASYMPTOTIC NORMAL DISTRIBUTIONS Definition 7.5.1 : If Y 1 , Y 2 , T is a sequence of random variables and m and c are constants such that d Z n Y n c" m v Z where Z is a standard normal random variable, then Y n is said to n have an asymptotic normal distribution with asymptotic mean m and asymptotic variance c 2 /n . EXAMPLE 7.5.1 Suppose that a random sample of size n 40 is selected from a population with an exponential distribution with parameter 100. By the central limit theorem n ! X i " n 100 i1 n 100 " 100 converges in distribution to a standard normal random X n 100 n variable i.e. X 40 has an asymptotic normal distribution with asymptotic mean 100 and asymptotic variance 100 2 /40 250. 7.6 STOCHASTIC CONVERGENCE Definition 7.6.1 : A sequence of random variables Y 1 , Y 2 , T is said to converge stochastically to a constant c if it has a limiting distribution that is degenerate at the point c i.e. if G n is the distribution function of Y n then lim G y 0 for y c nv. n 1 for y c. 209 [Chapter 7] JTheorem 7.6.1 A sequence of random variables Y 1 , Y 2 , T converges stochastically to a constant c if and only if for every . 0 P¡|Y n " c| .¢ 1. lim nv. Proof : Suppose that we have stochastic convergence to the constant c. For any . 0 we have that 1 u P¡|Y n " c| .¢ P¡Y n c .¢ " P¡Y n t c " .¢ u P¡Y n t c ./2¢ " G n ¡c " .¢ since c ./2 c . G n ¡c ./2¢ " G n ¡c " .¢ The limit of the right-hand side is 1 i.e. lim P¡|Y n " c| .¢ exists and is equal to 1. nv. P¡|Y n " c| .¢ 1 for all . 0. Now suppose that lim nv. Let y c. Then . c " y /2 0. Also y . y c " y /2 c y /2 and c " . c " c " y /2 c y /2 i.e. y . c " .. We then have that 0 t G n y P¡Y n t y¢ t P¡Y n t y .¢ P¡Y n t c " .¢ t P¡|Y n " c| u .¢ 1 " P¡|Y n " c| .¢. Since the right-hand side goes to zero it follows that lim G y exist and is equal to 0. nv. n The proof for y c is similar and is left as an exercise. ſ Definition 7.6.2 : If Y 1 , Y 2 , T is a sequence of random variables and c a constant, then we say that the sequence of random variables converge to the constant c in probability if P P¡|Y n " c| .¢ 1 for all . 0 and is denoted by Y n v c. lim nv. 210 [Chapter 7] EXERCISE 7.6.1 : § Let p n be the proportion of successes in n Bernoulli trials with probability of success p. L § Then E p n § p and var p n L p1 " p /n. L By Chebychev’s inequality we then get that for any . 0 § p n "p . 1 u lim P nv. L lim 1"P nv. § p n "p u . 1 " lim P nv. § p n "p u . u 1" L L p1"p n lim nv. .2 Since the limit on the right-hand side is 1, it follows that § § P P p n "p . 1 and therefore p n v p. lim nv. L L JTheorem 7.6.2 Let X 1 , X 2 , T , X n be a random sample from a population with expected value 6 and P variance @ 2 . with sample mean X n . Then X n v 6. Proof : For any . 0 we have that P¡|X n " 6| .¢ lim 1 u lim £1 " P¡|X n " 6| u .¢¤ nv. nv. 1 " lim £P¡|X n " 6| u .¢¤ nv. u 1 " lim nv. @ 2 /n .2 From Chebychev’s theorem Since the limit on the right-hand side is 1, it follows that P lim P¡|X n " 6| .¢ 1 and therefore X n v 6. ſ nv. JTheorem 7.6.3 d P If Z n Y n c" m v Z where Z is a standard normal random variable, then Y n v m. n 211 [Chapter 7] 7.7 ADDITIONAL LIMIT THEOREMS Definition 7.7.1 : A sequence of random variables Y 1 , Y 2 , T is said to converge in probability to a random variable Y if P P¡|Y n " Y| .¢ 1 for any . 0 and denoted by Y n v Y. lim nv. NOTE: In this definition the "limit" is a random variable and not a constant. In case Y is a random variable which is equal to a constant m with probability 1 the two definitions are the same. JTheorem 7.7.1a P If Y n v Y then lim P¡|Y n " Y| t .¢ 1 for any . 0. nv. Proof : Let . 0. Then 1 u P¡|Y n " Y| t .¢ u P¡|Y n " Y| .¢ and letting n tend to infinity if follows that 1 u lim P¡|Y n " Y| t .¢ u 1 nv. i.e. lim P¡|Y n " Y| t .¢ 1. nv. From this it then also follows that lim P¡|Y n " Y| .¢ 0 and nv. lim P¡|Y n " Y| u .¢ 0. nv. ſ JTheorem 7.7.1 If Y 1 , Y 2 , T is a sequence of random variables that converge to a random variable Y in P d probability then they converge to Y in distribution i.e. Y n v Y ´ Y n v Y. Proof : Let G n be the distribution function of Y n , G the distribution function of Y and let . 0. Then Gy " . P¡Y t y " .¢ P¡Y t y " ., Y n t y¢ P¡Y t y " ., Y n y¢ t P¡Y n t y¢ P¡Y n " Y .¢ t G n y P¡|Y n " Y| .¢. 212 [Chapter 7] Letting n tend to infinity on both sides we obtain that Gy " . t lim G y . nv. n Similarly Gy . P¡Y t y .¢ u P¡Y t y ., Y n t y¢ P¡Y n t y¢ " P¡Y n t y, Y y .¢ u P¡Y n t y¢ " P¡Y " Y n .¢ u P¡Y n t y¢ " P¡|Y n " Y| .¢. G n y " P¡|Y n " Y| .¢. Letting n tend to infinity we obtain that G y . Gy . u lim nv. n Thus Gy " . t lim G y t Gy . for any . 0. nv. n If G is continuous at the point y, if we let . tend to 0 in the above, we get that Gy t lim G y t Gy i.e. lim G y Gy . nv. n nv. n Therefore Y n converge to Y in distribution. ſ NOTE : The converse of theorem 7.7.1 is not true, i.e. convergence in probability is a stronger convergence than convergence in distribution. JTheorem 7.7.2 P P If Y n v c, then for any function g that is continuous at c , gY n v gc . Proof : If g is continuous at the point c, there exist for every . 0 a - 0 such that |y " c| - .implies |gy " gc | .. Therefore 1 u P¡|gY n " gc | .¢ u P¡|Y n " c| -¢. Letting n go to infinity on both sides we obtain that P P¡|gY n " gc | .¢ u 1 since Y n v c 1 u lim nv. P P¡|gY n " gc | .¢ 1 for any . 0 i.e. gY n v gc . i.e. lim nv. ſ 213 [Chapter 7] We note the following theorems with no proofs given. JTheorem 7.7.3 P If X 1 , X 2 , T and Y 1 , Y 2 , T are sequences of random variables such that X n v c and P Y n v d, then P 1. aX n bY n v ac bd P 2. X n Y n v cd P 3. X n /c v 1 for c p 0 P 4. 1/X n v 1/c if P¡X n p 0¢ 1 for all n and c p 0 5. P Xn v c if P¡X n u 0¢ 1 for all n. JTheorem 7.7.4 P If X 1 , X 2 , T and Y 1 , Y 2 , T are sequences of random variables such that X n v c and d Y n v Y, then d 1. X n Y n v c Y d 2. X n Y n v cY d 3. Y n /X n v Y/c for c p 0 JTheorem 7.7.5 d d If Y n v Y, then for any continuous function g, gY n v gY . JTheorem 7.7.6 n Y n " m d v Z where Z is a standard normal random variable and if g has a If c n gY n " gm d v Z i.e. gY n has an non-zero derivative at y m, then U |cg m | asymptotic normal distribution with asymptotic expected value gm and asymptotic 2 c 2 ¡g U m ¢ . variance n 214 7.8 [Chapter 7] ASYMPTOTIC DISTRIBUTIONS OF EXTREME ORDER STATISTICS Let X 1:n , X 2:n , T , , X n:n be the order statistics for a random sample from a population with distribution function F. The maximum X n:n is said to have a non-degenerate limiting distribution G if there exist sequences of standardizing constants £a n ¤ and £b n ¤ with a n 0 such that the d bn v Y where Y is a random variable with standardized variable Y n X n:na " n distribution function G. We will then say that X n:n has a limiting distribution of type G. Similarly for X 1:n . It can be shown that there are only three types of limiting distributions for the extreme order statistic X n:n , namely "y 1. Gy e "e for y 0 2. Gy e "y "+ for y 0 + 3. Gy e ""y for y 0. Similarly it can be shown that there are only three types of limiting distributions for the extreme order statistic X 1:n , namely y 1. Gy 1 " e "e for all y 2. Gy 1 " e ""y "y + "+ for y 0 for y 0. 3. Gy 1 " e Note that the type 3 distribution for minimums is a Weibull distribution. 215 [Chapter 8] CHAPTER 8 : STATISTICS AND SAMPLING DISTRIBUTIONS 8.1 INTRODUCTION In this chapter we study the distribution of a number of statistics which are later used for statistical inference. 8.2 STATISTICS Definition 8.2.1 : Suppose that X 1 , X 2 , X 3 , T X n are jointly distributed random variables. Let tX 1 , X 2 , X 3 , T X n T be a function of X 1 , X 2 , X 3 , T X n which does not depend on any unknown parameters. Then T is called a statistic. EXAMPLES : Example 8.2.1 Suppose that X 1 , X 2 , X 3 , T X n is a random sample from some population. Let X Let s 2 L n Xi ! i1 n . Then X is a statistic and is called the sample mean . n ! i1 X i " X n"1 2 . Then s 2 is a statistic and is called the sample variance. L Example 8.2.2 Suppose that X 1 , X 2 , X 3 , T X n are independent Bernoulli random variables all with n § ! i1 X i § . Then p is a statistic and is called the sample proportion. parameter p. Let p n L L Statistics ( the plural of statistic as defined above) are therefore functions of random variables. We are interested in the properties of these random variables such as the expected value, variance and the distribution as for any other random variable. 216 [Chapter 8] JTheorem 8.2.1 : Suppose that X 1 , X 2 , X 3 , T X n is a random sample from a population with expected value ȝ and variance ı 2 . n Xi ! i1 @2 . . Then E¡X¢ 6 and varX Let X n n Proof : E¡X¢ E¡ 1n £X 1 X 2 X 3 T X n ¤¢ 1 n 1 n 1 n E¡X 1 X 2 X 3 T X n ¢ £E¡X 1 ¢ E¡X 2 ¢ E¡X 3 ¢ T E¡X n ¢¤ £6 6 6 T 6¤ 6. Furthermore var¡X¢ var¡ 1n £X 1 X 2 X 3 T X n ¤¢ 1 n2 1 n2 @2 n ¡varX 1 varX 2 varX 3 T varX n ¢ since X i ’s are independent n@ 2 . ſ JTheorem 8.2.2 : Suppose that X 1 , X 2 , X 3 , T X n is a random sample from a population with expected value ȝ and variance ı 2 . 2 n ! i1 X i " X 2 Let s . Then E s 2 @2. n"1 L L Proof : Note first of all that n !X i " X 2 i1 n ! X 2i " 2XX i X i1 n 2 n ! X 2i " 2X ! X i nX i1 n i1 ! X 2i " 2XnX nX i1 n ! X 2i " nX i1 and therefore 2 2 2 217 [Chapter 8] E s2 L n E ! X 2i " nX 1 n"1 i1 n 1 n"1 2 E ! X 2i " nX 2 i1 n 1 n"1 ! EX 2i " nE X 2 1 n"1 ! varX i EX i 1 n"1 1 n"1 @2. i1 n 2 " n varX EX 2 i1 n£@ 2 6 2 ¤ " n @2 n 62 ¡n " 1 @ 2 ¢ ſ JTheorem 8.2.3a : Suppose that X 1 , X 2 , X 3 , T X n are independent Bernoulli random variables all with parameter p . n pq § ! i1 X i § § . Then E p p and var p n . Let p n L L L Proof : The variable X i is 1 if the i-th observation is a success and 0 if the i-th observation is n not-success i.e. ! X i is the number of successes in n independent trials where the i1 n probability of success is p at each trial i.e. ! X i L BINn, p . Therefore i1 § E p L E n Xi ! i1 n n 1n E ! i1 X i 1n np p and 218 [Chapter 8] § var p var n Xi ! i1 n L n 12 var ! i1 X i n 12 npq n pq n . ſ 8.3 SAMPLING DISTRIBUTIONS LINEAR COMBINATIONS OF NORMAL RANDOM VARIABLES JTheorem 8.3.1 : Suppose that X 1 , X 2 , X 3 , T X n are independent random variables and that n X i L N6 i , @ 2i . Let Y ! a i X i where a 1 , a 2 , T , a n are constants. Then Y is normally i1 n n i1 i1 distributed with expected value ! a i 6 i and variance ! a 2i @ 2i . Proof: M Y t n E exp t ! a i X i i1 E¡expta 1 X 1 expta 2 X 2 T expta n X n ¢ E¡expta 1 X 1 ¢ E¡expta 2 X 2 ¢ T E¡expta n X n ¢ since X 1 , X 2 , T , X n are independent. M X 1 a 1 t M X 2 a 2 t T M X n a n t n M X i a i t i1 n exp 6 i a i t @ 2i i1 exp a i t 2 n n i1 i1 2 2 ! a i 6 i t ! a 2i @ 2i t 2 This is the moment generating function of a normal random variable with expected value n n n n i1 i1 i1 i1 ! a i 6 i and variance ! a 2i @ 2i i.e. Y L N ! a i 6 i , ! a 2i @ 2i . ſ 219 [Chapter 8] JTheorem 8.3.2a : Suppose that X 1 , X 2 , T X n is a random sample from a normal population with expected value 6 and variance @ 2 . Let X be the sample mean. Then X is normally distributed with expected value 6 and 2 variance @n . Proof : n n Xi ! i1 1 We have that X ! n Xi. n i1 n From th.8.3.1 it then follows that X is normally distributed with mean ! n variance ! 1n i1 1 n 6 6 and i1 2 2 @ 2 @n . ſ JTheorem 8.3.2b : Suppose that X 1 , X 2 , T X n 1 is a random sample from a normal population with expected value 6 1 and variance @ 21 . Suppose that Y 1 , Y 2 , T Y n 2 is an independent random sample from a normal population with expected value 6 2 and variance @ 22 . Let X be the sample mean of the X’s and let Y be the sample mean of the Y’s. Then @2 @2 X " Y is normally distributed with expected value 6 1 " 6 2 and variance n 11 n 22 . Proof : From th.8.3.2a it follows that X is normally distributed with expected value 6 1 and @2 variance n 11 and Y is also normally distributed with expected value 6 2 and variance @ 22 n 2 . Furthermore X and Y are independent since the two samples are independent. Hence X " Y is a linear combination of two independent random variables and it then follows from th.8.3.1 that X " Y is a normal random variable with expected value @2 @2 6 1 " 6 2 and variance n 11 n 22 . ſ OEXERCISES : Bain and Engelhardt p.283 No.’s 1, 2 and 4. 220 [Chapter 8] THE CHI-SQUARE DISTRIBUTION Definition 8.3.1 : The random variable X is said to have a chi-square distribution with 7 .degrees of freedom if X is distributed like a gamma random variable with parameters 2 and 7/2. JTheorem 8.3.2 : If X is distributed like a chi-square with 7 .degrees of freedom, then 72 r "7/2 r r , E¡X ¢ 2 , E¡X¢ 7 .and varX 27. M X t 1 " 2t 72 Proof : The results follows from the results for gamma distributions. ſ JTheorem 8.3.3 : If X L GAM2, 4 , then Y 2X/2 L D 2 24 i.e. Y is distributed like chi-square with 24 degrees of freedom. Proof : "4 2X 2t 1 " 2t "24/2 , M Y t E¡e tY ¢ E e t 2 E e 2 X M X 2t 1 " 2 2t 2 2 i.e. it is the moment generating function of a chi-square distribution with 24 degrees of freedom. ſ JTheorem 8.3.4 : Suppose that X 1 , X 2 , T X n are independent random variables and that X i L D 2 7 i . Let n n i1 i1 Y ! X i . Then Y is distributed like chi-square with ! 7 i degrees of freedom. Proof : M Y t n M X i t i1 n 1 " 2t " 7i 2 i1 1 " 2t " % ni1 7 i 2 n which is the moment generating function of a chi-squared random variable with ! 7 i degrees of freedom. ſ i1 221 [Chapter 8] JTheorem 8.3.5 : Suppose that X is a standard normal random variable. Then X 2 is distributed like a chi-squared random variable with 1 degree of freedom. Proof : M X 2 t E¡e tX ¢ 2 . "x /2 2 dx ; e tx e 2= ". . ; 2 2 exp " x2 1 " 2t 2= ". . 1 ; 1 " 2t ". dx x2 exp " 12 2= 1 1"2t 1 1 " 2t dx this is the integral of a normal density function with expected value 0 and variance 1 . 1 " 2t 1 1 " 2t This is the moment generating function of a chi-squared random variable with 1 degree of freedom. ſ JTheorem 8.3.6a : Suppose that X 1 , X 2 , T X n is a random sample from a normal population with expected value 6 and variance @ 2 . Let X be the sample mean. 2 n nX " 6 X i " 6 2 2 L D n and L D 2 1 . Then ! @2 @2 i1 Proof : n n n Xi " 6 2 X " 6 2 ! Y 2i ! i 2 ! @ @ i1 i1 i1 Xi " 6 where Y i are independent N0, 1 random variables i.e. Y i 2 are @ n independent D 2 1 random variables i.e. ! Y 2i L D 2 n . Also nX " 6 @2 2 X"6 @ n 2 i1 Y 2 where Y L N0, 1 i.e. Y 2 L D 2 1 . ſ 222 [Chapter 8] JTheorem 8.3.6 : Suppose that X 1 , X 2 , T X n is a random sample from a normal population with expected value 6 and variance @ 2 . Let X be the sample mean and let s 2 be the sample variance. Then L 1. X and X i " X are independent for i 1, 2, 3, T , n. 2. X and s 2 are independent. L n " 1 s 2 3. @2 L L D 2 n " 1 . Proof : Part 1 The joint density function of X 1 , X 2 , T X n is given by n f X i x i f X 1 ,X 2 ,T,X n x 1 , x 2 , T x n i1 n " e 2 f X 1 ,X 2 ,T,X n x 1 , x 2 , T x n 1 i1 since the X i ’s are independent i.e. x i "6 @ 2= @ x i "6 " 2 % i1 @ e n 2= 2 @ n 1 n 2 2 . n Let Y 1 X ! 1 n Xi i1 and Yi Xi " X for i 2, 3, 4, T n. Consider the equations n y1 x ! 1 n xi i1 and yi xi " x for i 2, 3, 4, T n. The solution of these equations is given by x i y i y 1 for i 2, 3, 4, T n. and x 1 ny 1 " x 2 " x 3 "T "x n ny 1 " £y 2 y 1 ¤ " £y 3 y 1 ¤ "T "£y n y 1 ¤ y 1 " y 2 " y 3 "T "y n . 223 [Chapter 8] Then 1 "1 "1 "1 T "1 J xvy 1 1 0 0 T 0 1 0 1 0 T 0 1 0 0 1 T 0 . . . . T . n "1 "1 "1 T "1 0 1 0 0 T 0 0 0 1 0 T 0 0 0 0 1 T 0 . . . . T . 1 0 0 0 T 1 0 0 0 0 T 1 by subtracting column 2 from the first column, subtracting column 3 from the first column,T,subtracting the n-th column from the first column n 0 0 0 T 0 0 1 0 0 T 0 0 0 1 0 T 0 n 0 0 0 1 T 0 . . . T . . 0 0 0 0 T 1 by adding row 2 to the first row, adding row 3 to the first row,T,adding the n-th row to the first row and since the determinant of a diagonal matrix is the product of the diagonal elements. We have that n !x i " x i1 n n i1 i1 ! x i " ! x nx " nx 0 n and !x i " 6 n n i2 i2 " !x i " x " ! y i i.e. x 1 " x 2 n !x i " x x " 6 i1 i1 n ! x i " x i1 n !x i " x i1 n !x i " x 2 2 2 8. 3. 1 2 2x i " x x " 6 x " 6 n n i1 i1 2x " 6 !x i " x !x " 6 nx " 6 2 2 n !x i " x 2 nx " 6 i2 n " ! yi i2 2 n ! y 2i ny 1 " 6 i2 2 8. 3. 2 i1 x 1 " x 2 2 2 224 [Chapter 8] Therefore f Y 1 ,Y 2 ,T,Y n y 1 , y 2 , T y n 2 n " ! yi exp " 2@1 2 n ! y 2i ny 1 " 6 i2 2 i2 n 2 2= . |n| @n nn exp " 1 2 ny 1 " 6 2@ 2= 2 @ n 2 exp " 1 2 2@ n " ! yi i2 2 n ! y 2i i2 i.e. the joint density Y 1 , Y 2 , T , Y n consists of one factor that depends only on y 1 and the other factor that depends on y 2 , y 3 , T y n i.e. Y 1 is independent of Y 2 , Y 3 , T , Y n or X is independent of X 2 " X, X 3 " X, T , X n " X. From (8.3.1) we have that n X 1 " X " !X i " X and therefore X is also independent of X 1 " X. i2 Part 2 L n ! i1 X i " X 2 depends only on X 1 " X, X 2 " X, T , X n " X and since X is n"1 independent of all of them, it follows that X and s 2 are independent. Since s2 L Part 3 From 8.3.2 we get that n !X i " 6 2 n !X i " X i1 2 nX " 6 2 i1 Xi " 6 or ! @ n 2 i1 n " 1 s 2 @2 L X"6 2 @ n or say V 1 V 2 V 3 . X "6 The random variables i @ , i 1, 2, T , n, are independent standard normal random Xi " 6 2 variables, i.e. , i 1, 2, T , n, are independent D 2 1 random variables and @ n Xi " 6 2 therefore V 1 ! L D 2 n . Furthermore V 2 depend only on s 2 and V 3 @ L i1 only depends on X and since they are independent random variables it follows that V 2 and V 3 are independent random variables. Since X L N6, @ 2 /n it follows that X"6 @ n L N0, 1 and V 3 X"6 @ n 2 L D 2 1 . 225 [Chapter 8] Since V 1 is the sum of two independent random variables V 2 and V 3 we have that M V 1 t M V 2 t . M V 3 t where M V i t the m.g.f. of V i or 1 " 2t " n2 M V 2 t . 1 " 2t " M V 2 t 1 " 2t n"1 2 " 12 or . This is the moment generating function of a chi-squared random variable with n " 1 n " 1 s 2 L degrees of freedom i.e. V 2 L D 2 n " 1 . ſ 2 @ Suppose that Y L D 2 7 , that 0 + 1 and that D 2+ 7 is that value such that P¡Y t D 2+ 7 ¢ + i.e. D 2+ 7 is the 100 + percentile of a chi-squared distribution with 7 ..degrees of freedom. Some values of D 2+ 7 may be found in Table 4 pp604-605 in the book of Bain and Engelhardt. For instance D 2.025 9 2. 7 and D 2.975 9 19. 02 from which it then follows that if Y L D 2 9 , then P¡D 2.025 9 t Y t D 2.975 9 ¢ 0. 95 or P¡2. 7 t Y t 19. 02¢ 0. 95. EXAMPLE 8.3.1 : Suppose that X 1 , X 2 , T X 10 is a random sample from a normal population with expected 9 s2 L 2 2 L D 2 9 . value 6 and variance @ where both 6 and @ are unknown. Then @2 It then follows that 9 s2 L 2 t D 2.975 9 0. 95 P D .025 9 t @2 P 2. 7 t P 9 s2 L @2 t 19. 02 2 1 t @2 19. 02 9s t 1 2. 7 L P 9 s2 L 19. 02 t @2 t 9 s2 L 2. 7 . 226 [Chapter 8] Note that @ 2 is a fixed number, but that s 2 is a random variable i.e. for different L samples the value of s2 L 9 s2 will be different. The intervals L 19. 02 9 s2 , L 2. 7 will be different for different samples. Some intervals will include @ 2 and others will not include @ 2 . The probability that the interval will include @ 2 is 0.95. Such an interval is called a 95% confidence interval for @ 2 . OEXERCISES : Bain and Engelhardt p.284 No.’s 5, 8, 9, 11, 12 and 13. 8.4 THE t, F AND BETA DISTRIBUTIONS STUDENT’S t DISTRIBUTION JTheorem 8.4.1 : Suppose that Z is a standard normal random variable and that X is an independent chi-squared random variable with 7 .degrees of freedom. Let T Z . Then the density function of T is given by X 7 f T t 7 2 71 2 12 2 1 t7 7 " 71 2 for ". t .. The density function is called the t distribution with 7 .degrees of freedom. Proof : The joint density function of Z and X is given by f Z,X z, x f Z z . f X x since Z and X are independent 7 1 e " 12 z 2 x 2 "1 e " 2 7 2 2 72 2= x 0 Let T Z X 7 for " . z . and x 0 otherwise and W X. z The solution of the equations t and w x x 7 is given by x w and z t w 7 . Therefore Jz, x v t, w w 7 0 t 1 1 2 w 7 1 7 w . 7 227 [Chapter 8] Hence 1 e " 12 2= f T,W t, w t2w 7 7 "1 " w7 2 e7 2 22 2 w w 7 for " . t . and w 0 0 otherwise Then . f T t ; f T,W t, w dw ". . ; 0 . ; 0 71 w 2 71 2 2 71 2 1 2 2 71 2 1 2 7 2 71 2 7 2 " 71 2 1 t7 2 w dw 7 2 " 71 2 2 w e 1 2 1 t7 2 2 7 "1 " w7 2 e7 2 22 2 w t "1 " 2 1 7 2 t2w 7 1 e " 12 2= 7 . 7 71 2 7 2 dw 7 71 2 ; w 0 2 2 1 t7 2 w t "1 " 2 1 7 e 71 2 1 since the integral is that of a GAM 1 2 71 2 72 JTheorem 8.4.2 : See Bain and Engelhardt. 7 dw 71 2 2 1 t7 " 71 2 ſ 2 , 71 2 2 1 t7 function 228 [Chapter 8] JTheorem 8.4.3 : Suppose that X 1 , X 2 , T X n is a random sample from a normal population with expected value 6 and variance @ 2 where both 6 and @ 2 are unknown. Let X be the sample X"6 is distributed like t with n " 1 mean and let s 2 be the sample variance. Then s/ n L L degrees of freedom. Proof : From th.8.3.2a we have that n " 1 s 2 L @2 L D 2 n " 1 . It then follows from th.8.4.1 that X"6 @/ n n"1 s 2 L @2 X"6 L N0, 1 and from th.8.3.6 it is independent of @/ n / n " 1 X"6 L tn " 1 . s/ n ſ L The percentiles of the t-distribution t + 7 can be found in table 6 p.608 in Bain and Engelhardt. Suppose that T is distributed like t with 7 .degrees of freedom, then + P¡Y t t + 7 ¢. For example t .025 9 "2. 262 and t .975 9 2. 262. EXAMPLE 8.4.1 : Suppose that X 1 , X 2 , T X 10 is a random sample from a N6, @ 2 X"6 Then L t9 and s / 10 population. L 0. 95 P t .025 9 t X"6 t t .975 9 s / 10 L P "2. 262 t X"6 t 2. 262 s / 10 L P "2. 262 s / 10 t X " 6 t 2. 262 s / 10 L L P "X " 2. 262 s / 10 t "6 t "X 2. 262 s / 10 L L P X " 2. 262 s / 10 t 6 t X 2. 262 s / 10 L L 229 [Chapter 8] Note that 6 is a constant. For some samples the interval X " 2. 262 s / 10 , X 2. 262 s / 10 L L will include 6 while in other samples it will not include 6. The probability that the interval will include 6 is 0.95 i.e. the interval is a 95% confidence interval for 6. OEXERCISES : Bain and Engelhardt p.285 No.16. THE BETA AND F DISTRIBUTIONS Definition 8.4.1 : The random variable X is said to have a beta density function with parameters ) 1 and ) 2 if the density function is given by x ) 1 "1 1 " x ) 2 "1 for 0 x 1 B) 1 , ) 2 f X x 0 otherwise 1 where B) 1 , ) 2 ; x ) 1 "1 1 " x ) 2 "1 dx. 0 Note that . 1 ) 1 "1 ; f X x dx ; x 1 " x ". 0 B) 1 , ) 2 ) 2 "1 dx 1 1 ; x ) 1 "1 1 " x B) 1 , ) 2 0 1 B) 1 , ) 2 B) 1 , ) 2 ) 2 "1 1 i.e. the integral of the density function is 1. dx 230 [Chapter 8] JTheorem 8.4.4a : Suppose that X 1 is a gamma random variable with parameters 2 and ) 1 . Suppose that X 2 is an independent gamma random variable with parameters 2 and ) 2 . X1 and Y 2 X 1 X 2 . Let Y 1 X1 X2 Then Y 1 and Y 2 are independent random variables and Y 1 is a beta random variable with parameters ) 1 and ) 2 and Y 2 is a gamma random variable with parameters 2 and ) 1 ) 2 . Proof : The joint density function of X 1 and X 2 is given by f X 1 ,X 2 x 1 , x 2 f X 1 x 1 . f X 2 x 2 x1 x )1 1 "1 e " 2 2 ) 1 ) 1 0 x2 x )2 2 "1 e " 2 2 ) 2 ) 2 since X 1 and X 2 are independent for x 1 0 and x 2 0 otherwise. The solution of y 1 x x1 x 1 2 and y 2 x 1 x 2 is given by x 1 y 1 . x 1 x 2 y 1 . y 2 and x 2 y 2 " x 1 y 2 " y 1 y 2 1 " y 1 y 2 and the Jacobian of the transformation is y2 y1 y 2 1 " y 1 " £"y 2 y 1 ¤ y 2 . J xvy "y 2 1 " y 1 Therefore f Y 1 ,Y 2 y 1 , y 2 y 1 y 2 ) 1 "1 e " 2 ) 1 ) 1 y1y2 2 ¡1 " y 1 y 2 ¢ ) 2 "1 e " 2 ) 2 ) 2 1"y 1 y 2 2 0 y )1 1 "1 1 " y 1 ) 1 ) 2 /) 1 ) 2 0 for y 1 y 2 0 and 1 " y 1 y 2 0 otherwise ) 2 "1 |y 2 | y2 y ) 1 "1) 2 "11 e " 2 )21 ) 2 2 ) 1 ) 2 for 0 y 1 1 and y 2 0 otherwise 231 [Chapter 8] . Then f Y 1 y 1 ; f Y 1 ,Y 2 y 1 , y 2 dy 2 . ". Since f Y 1 ,Y 2 y 1 , y 2 0 for all y 2 if y 1 t 0 or if y 1 u 1, it follows that the integral of f Y 1 y 1 is 0 for y 1 t 0 or if y 1 u 1. For 0 y 1 1 we then get that . ) 2 "1 f Y 1 y 1 y2 y )1 1 "1 1 " y 1 y ) 1 "1) 2 "11 e " 2 ; )21 ) 2 dy 2 ) 1 ) 2 /) 1 ) 2 0 2 ) 1 ) 2 y )1 1 "1 1 " y 1 ) 1 ) 2 /) 1 ) 2 ) 2 "1 1 since the integral is the integral of a gamma density function with parameters 2 and ) 1 ) 2 . Since 1 ; f Y 1 y 1 dy 1 ". 1 ; 0 ) 2 "1 y )1 1 "1 1 " y 1 dy 1 ) 1 ) 2 /) 1 ) 2 1 ) 1 ) 2 /) 1 ) 2 1 ; y )1 1 "1 1 " y 1 it follows that B) 1 , ) 2 dy 1 0 1 B) 1 , ) 2 ) 1 ) 2 /) 1 ) 2 ) 2 "1 by definition of B) 1 , ) 2 ) 1 ) 2 ) 1 ) 2 Therefore f Y 1 y 1 y )1 1 "1 1 " y 1 B) 1 , ) 2 0 ) 2 "1 for 0 y 1 1 otherwise i.e. Y 1 has a beta density function with parameters ) 1 and ) 2 . 232 [Chapter 8] We also have that . f Y 2 y 2 ; f Y 1 ,Y 2 y 1 , y 2 dy 1 . ". Since f Y 1 ,Y 2 y 1 , y 2 0 for all y 1 if y 2 t 0, it follows that the integral of f Y 2 y 2 for y 2 t 0. is 0 For y 2 0 ) 2 "1 y )1 1 "1 1 " y 1 ; ) 1 ) 2 /) 1 ) 2 0 1 f Y 2 y 2 y2 y2 y )2 1 "1) 2 "11 e " 2 dy 1 2 ) 1 ) 2 ) 1 ) 2 ) 2 "1 y )1 1 "1 1 " y 1 y )2 1 ) 2 "1 e " 2 ; ) 1 ) dy 1 2 2 ) 1 ) 2 0 ) 1 ) 2 /) 1 ) 2 1 y2 y )2 1 ) 2 "1 e " 2 ) 1 ) 2 2 ) 1 ) 2 B) 1 , ) 2 ) 1 ) 2 /) 1 ) 2 y2 y )2 1 ) 2 "1 e " 2 ) 1 ) because of 8. 4. 1 2 2 ) 1 ) 2 i.e. Y 2 has a gamma density function with parameters 2 and ) 1 ) 2 . ſ JTheorem 8.4.4b : Suppose that X 1 and X 2 are two independent chi-squared random variables with 7 1 and 7 2 degrees of freedom respectively. X1 and Y 2 X 1 X 2 are independent random variables and Y 1 is a Then Y 1 X1 X2 beta random variable with parameters 721 and 722 and Y 2 is a chi-squared random variable with 7 1 7 2 degrees of freedom. Proof : We have that X 1 L GAM 2, 721 and X 2 L GAM 2, 722 . It then follows from th.8.4.4a that Y 1 and Y 2 are independent and that Y 1 has a beta density function with parameters 721 and 722 and that Y 2 has a GAM 2, 721 722 distribution i.e. Y 2 L D 2 7 1 7 2 . ſ 233 [Chapter 8] JTheorem 8.4.4c : Suppose that X is a random variable with a beta density function with parameters and )22 . ) 2 . X . Then the density function of Y is given by Let Y ) 1 1"X f Y y )1 2 ) 1 ) 2 2 )2 2 )) 12 )1 2 y )1 2 "1 1 )1 )2 y " ) 1 ) 2 2 )1 2 for y 0 0 otherwise This density function is called an F distribution with ) 1 and ) 2 degrees of freedom. Proof : We have that x f X x )1 2 "1 1 " x B )21 , )22 )2 2 "1 0 for 0 x 1 otherwise )2 x . Let y ) 1 1"x ) 1 y1 " x x Then ) 2 ) 1 y x ) 1 yx 1 ) 1 y x or ) )2 )2 2 y )1 or x ) u "1 y 2 1 y 1 ) ) say. 2 Then d u "1 y ) 1 )2 dy )1 1 1 ) )2 y " )2 y )1 2 )2 1 1 ) y )2 1 1 1 ) )2 y 2 . 234 [Chapter 8] Therefore f Y y )1 )2 1 )1 2 y )1 )2 "1 1" y B )1 )2 )1 2 , 1 )2 2 )2 2 y )1 )2 y "1 )1 )2 1 1 1 ) )2 y 2 0 1 y )1 )2 y 1 otherwise for 0 )1 )2 )1 2 ) 1 ) 2 2 )1 2 )1 )2 )2 2 y )1 2 "1 1 1 ) )2 y " ) 1 ) 2 2 for y 0 0 ſ otherwise JTheorem 8.4.4 : Suppose that X 1 and X 2 are independent chi-squared random variables with 7 1 and 7 2 degrees of freedom respectively. Let F X1 71 X2 72 . Then F is distributed like F with 7 1 and 7 2 degrees of freedom. Proof : From th.8.4.4b we have that Y parameters 71 2 and 72 2 X1 is a beta random variable with X1 X2 . From th.8.4.4c it then follows that X1 72 Y 72 72 X1 X 1 X 2 F 7 7 7 1 1 " X1 1 1 X2 1"Y X 1 X 2 X1 71 X2 72 is distributed like F with 7 1 and 7 2 degrees of freedom. JTheorem 8.4.5 : See Bain and Engelhardt p276. ſ 235 [Chapter 8] JTheorem 8.4.6 : If U is distributed like F with 7 1 and 7 2 degrees of freedom, then 1/U is distributed like F with 7 2 and 7 1 degrees of freedom. Proof : To get an F distribution with 7 1 and 7 2 degrees of freedom, suppose that X 1 and X 2 are independent chi-squared random variables with 7 1 and 7 2 degrees of freedom respectively. Then U X1 71 X2 72 is distributed like F with 7 1 and 7 2 degrees of freedom. But 1 U X2 72 X1 71 and from th.8.4.5 it follows that 1/U is distributed like F with 7 2 and 7 1 degrees of freedom. ſ We will use the notation F L F7 1 , 7 2 to indicate that the random variable F has an F distribution with 7 1 and 7 2 degrees of freedom. Now suppose that F L F7 1 , 7 2 . Let f + 7 1 , 7 2 be that value such that P¡F t f + 7 1 , 7 2 ¢ +, i.e. f + 7 1 , 7 2 is the 100 + percentile of the F distribution with 7 1 and 7 2 degrees of freedom. Values of f + 7 1 , 7 2 is given in Table 7 pp.609-611 in the book of Bain and Engelhardt. Note that the table gives values of f + 7 1 , 7 2 for various combinations of values of 7 1 and 7 2 and + . 90, . 95, . 975, . 99 and . 995 i.e. only for values of + close to 1. To obtain values of f + 7 1 , 7 2 for values of + close to 0 we use th.8.4.6 as follows. Suppose that F L F7 1 , 7 2 i.e. 1/F L F7 2 , 7 1 . Therefore 1 " + P¡F t f 1"+ 7 1 , 7 2 ¢ P¡1/F u 1/f 1"+ 7 1 , 7 2 ¢ 1 " P¡1/F t 1/f 1"+ 7 1 , 7 2 ¢ or + P¡1/F t 1/f 1"+ 7 1 , 7 2 ¢ i.e. f + 7 2 , 7 1 1/f 1"+ 7 1 , 7 2 or f + 7 1 , 7 2 1/f 1"+ 7 2 , 7 1 . 236 [Chapter 8] EXAMPLE 8.4.2 : Suppose that F L F5, 20 . From Table 7 we get that f .975 5, 20 3. 29 and f .975 20, 5 6. 33 i.e. f .025 5, 20 1/6. 33 0. 157. Therefore 0. 95 0. 975 " 0. 025 P f .025 5, 20 t F t f .975 5, 20 P¡0. 157 t F t 3. 29¢. JTheorem 8.4.7 : Suppose that X 1 , X 2 , T , X n 1 is a random sample from a normal population with variance @ 21 . Suppose that Y 1 , Y 2 , T , Y n 2 is an independent random sample from a normal population with variance @ 22 . Let s 21 be the sample variance of the X’s and let s 22 be the sample variance of the Y’s. L L s 21 Let F L @ 21 . s 22 L @ 22 Then F has an F distribution with n 1 " 1 and n 2 " 1 degrees of freedom. Proof : From th.8.3.6 we have that n 1 " 1 s 21 n 2 " 1 s 22 L L U1 L D 2 n 1 " 1 and U 2 L D 2 n 2 " 1 @ 21 @ 22 and U 1 and U 2 are independent since the samples are independent. It then follows from th.8.4.4 that F U1 n 1 "1 U2 n 2 "1 But F is distributed like F with n 1 " 1 and n 2 " 1 degrees of freedom. U1 n 1 "1 U2 n 2 "1 n 1 "1 s 21 L @ 21 n 1 "1 n 2 "1 s 22 L @ 22 n 2 "1 s 21 L @ 21 s 22 L @ 22 . ſ 237 [Chapter 8] EXAMPLE 8.4.3 : Suppose that X 1 , X 2 , T , X 11 is a random sample from a normal population with variance @ 21 . Suppose that Y 1 , Y 2 , T , Y 21 is an independent random sample from a normal population with @ 22 . Let s 21 be the sample variance of the X’s and let s 22 be the sample variance of the Y ’s. L L s 21 L @ 21 Then s 22 L F10, 20 and L @ 22 s 21 L @ 21 0. 95 P f .025 10, 20 t t f .975 10, 20 s 22 L @ 22 P P 1 f .975 20, 10 @ 21 @ 22 1 @ 21 @ 22 L s 21 L s 22 L s 21 2 L P i.e. 1 2. 77 the interval L s 22 L t @ 21 @ 22 @ 21 @ 22 1 2. 77 u t f .975 10, 20 @ 21 @ 22 1 s 22 P 3. 42 sL2 u 1 L s 22 L L P 3. 42 u s 21 t s 21 1 t 3. 42 s 21 t 2. 77 u from Table 7 1 2. 77 1 2. 77 s 21 L s 22 L s 21 t 3. 42 sL2 2 L s 21 s 22 s2 , 3. 42 s 12 is a 95% confidence interval for 2 OEXERCISES : Bain and Engelhardt p.286 No.’s 17 and 18. @ 21 . @ 22 238 [Chapter 8] 8.5 LARGE-SAMPLE APPROXIMATIONS JTheorem 8.5.1 : Suppose that Y 7 is a random variable for 7 1, 2, 3, T and that Y 7 has a chi-squared distribution with 7 .degrees of freedom. Let Z 7 Y 7 " 7 . 27 d Then Z 7 v Z where Z is a standard normal random variable. Proof : 7 If Y 7 L D 2 7 then Y 7 is distributed like ! X i where the X i ’s are independent i1 random variables all with a chi-squared distribution with 1 degree of freedom. Then E¡X i ¢ 2 12 1 and varX i 2 2 12 2. 7 ! Xi " 7 1 From the central limit theorem it then follows that Z L N0, 1 . i1 7 2 d Z 7 v Z where ſ Suppose that Y 7 L D 2 7 and that D 2+ 7 is the number such that P¡Y 7 t D 2+ 7 ¢ + i.e. D 2+ 7 is the 100 + percentile of a chi-squared distribution with 7 .degrees of freedom. For large values of 7 .we then get that D 2+ 7 " 7 D 2+ 7 " 7 + P Y7 " 7 t `P Zt where Z L N0, 1 27 27 27 D 2+ 7 " 7 i.e. ` z + where z + is the 100 + percentile of the N0, 1 distribution. 27 Therefore D 2+ 7 ` 27 z + 7. EXAMPLE 8.5.1 : Suppose that 7 30 and + 0. 95. Then D 2+ 30 43. 77 from Table 4 in Bain and Engelhardt. The approximation derived above gives us D 2+ 30 ` 2 30 1. 645 30 42. 74. It can also be shown that if T 7 is distributed like t with 7 degrees of freedom, then d T 7 v Z where Z L N0, 1 . The percentiles of the t distribution with 7 degrees of freedom is therefore approximately equal to the percentiles of a standard normal distribution. 239 [Chapter 9] CHAPTER 9 : POINT ESTIMATION 9.1 INTRODUCTION Suppose that X 1 , X 2 , T , X n are jointly distributed random variables and that their joint distribution depends on certain quantities. These quantities are referred to as parameters of the distribution. For example if X 1 , X 2 , T , X n is a random sample from a normal population with expected value ȝ and variance ı 2 then n x i "6 2 1 f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n 1 e " 2 @ . 2= @ i1 This distribution depends on the two parameters 6 and @ 2 . Another example is when the joint distribution of X 1 , X 2 , T , X k is a multinomial distribution i.e. n! f X x p x 1 p x 2 T p xk k 1 " p 1 " p 2 "T "p k n"x 1 "x 2 "T"x k . x 1 !x 2 !T x k !n " x 1 " x 2 "T "x k ! 1 2 In this case the parameters of the distribution are n and p 1 , p 2 , T , p k . Usually the properties (e.g. the distribution) of populations are not known completely. It may for instance be known ( or may reasonably be expected to) that the distribution of a population is normal but the expected value and variance may not be known. In this case the quantities 6 and @ 2 are unknown parameters. The purpose of taking a sample from a population is usually to get to know more about the unknown properties of the population. On the basis of a sample the unknown properties e.g. parameters cannot be established exactly and estimates of these parameters are therefore required. We may also be interested to find intervals which we are fairly confident will include the values of these parameters i.e. confidence intervals. It may also be of interest to test whether a parameter is equal to some specific value i.e. we need tests of hypotheses. In chapter 9 we study the problem of estimating unknown parameters of a distribution. In the rest of this chapter we will assume that the distribution of a population depends on a number of parameters say 2 1 , 2 2 , T , 2 k . Let ș 2 1 , 2 2 , T , 2 k . We will assume that X 1 , X 2 , T , X n is a random sample from a population with distribution fx; 2 1 , 2 2 , T , 2 k fx; 2 i.e. n f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n fx i ; 2 1 , 2 2 , T , 2 k . i1 240 [Chapter 9] Definition 9.1.1 : A statistic T tX 1 , X 2 , T , X n is an estimator of IJș if tx 1 , x 2 , T , x n is used to estimate IJș if the observed values of X 1 , X 2 , T , X n are equal to x 1 , x 2 , T , x n . For the observed values x 1 , x 2 , T , x n , tx 1 , x 2 , T , x n is an estimate of A2 . A special case of IJș is IJș ș in which case we estimate the parameter ș itself. 9.2 SOME METHODS OF ESTIMATION Suppose that the distribution of a random variable X is given by fx; 2 1 , 2 2 , T , 2 k . Then the moments of X will depend on 2 1 , 2 2 , T , 2 k i.e. 6 Uj E¡X j ¢ 6 Uj 2 1 , 2 2 , T , 2 k , say. We then say that 6 U1 , 6 U2 , T , 6 Uk are the population moments for X and all of them will be functions of the parameters of the distribution of X. Definition 9.2.1 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with distribution fx; 2 1 , 2 2 , T , 2 k . The sample moments are defined by n j M Uj 1n ! X i . i1 Note that M U1 X the sample mean. § We will use the notation 2 to indicate an estimate of the parameter ș. If the observed values of X 1 , X 2 , T , X n are x 1 , x 2 , T , x n§ and we use tx 1 , x 2 , T , x n . The tx 1 , x 2 , T , x n as an estimate of 2, we use the notation 2 § corresponding estimator tX 1 , X 2 , T , X n is indicated by 2. L 241 [Chapter 9] Definition 9.2.2a : THE METHOD OF MOMENTS ESTIMATORS Suppose that X 1 , X 2 , T , X n is a random sample from a population with distribution fx; 2 1 , 2 2 , T , 2 k . Let 6 Uj 2 1 , 2 2 , T , 2 k , j 1, 2, . . . , k be the population moments. Let M Uj , j 1, 2, . . . , k be the sample moments. If x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n the observed values of M Uj n will be m Uj 1n ! x ji , j 1, 2, . . . , k. i1 § § § The method of moments estimates of 2 1 , 2 2 , T , 2 k , say 2 1 , 2 2 , T , 2 k are those values of 2 1 , 2§2 , T§ , 2 k such § that U U (9.2.1) 6 j 2 1 , 2 2 , T , 2 k m j for j 1, 2, . . . , k. The solution of equations (9.2.1) will depend on x 1 , x 2 , T , x n i.e. there will be functions t 1 , t 2§, . . . , t k of x 1 , x 2 , T , x n such that 2 i t i x 1 , x 2 , T , x n for i 1, 2, . . . , k. The corresponding method of moments estimators of 2 1 , 2 2 , T , 2 k is then given by § 2 i t i X 1 , X 2 , T , X n for i 1, 2, . . . , k. L § § § The method of moments estimators of 2 1 , 2 2 , T , 2 k denoted by 2 1 , 2 2 , T , 2 k are therefore those values of 2 1 , 2 2 , T , 2 k such that § § § 6 Uj 2 1 , 2 2 , T , 2 k M Uj for j 1, 2, . . . , k. L L L L L L The method estimator of IJ2 1 , 2 2 , T , 2 k is given by § § of moments § IJ 2 1 , 2 2 , T , 2 k . L L L EXAMPLES : Example 9.2.1 Suppose that X 1 , X 2 , T , X n is a random sample from a normal population with parameters 6 and @ 2 . In this case 6 U1 6 and 6 U2 E¡X 2 ¢ varX £E¡X¢¤ 2 @ 2 6 2 . If x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n we have that n m U1 1n ! x i x and i1 n m U2 1n ! x 2i i1 n 1n !x i " x x 2 1n ! x i " x 2x i " x x x i1 n 1n 2 i1 n n !x i " x 2 2x !x i " x i1 n 1n !x i " x i1 2 nx 2 i1 2 x 2 n since !x i " x nx " nx 0. i1 (9.2.2) 242 [Chapter 9] § The equations to determine 6 and @ 2 are therefore given by § 6x and n § @ 2 6 2 1n !x i " x 2 x 2 . i1 The solution is given by n § 1 2 6 x and @ 2 1n !x i " x 2 n " n s i1 where s 2 is the sample variance. Example 9.2.2 Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. In this case 6 U1 E¡X¢ § 2. The equation to determine the method of moments estimate of 2 is then given by 2 m U1 x. Example 9.2.3 Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. In this example we want to estimate 1 A2 P¡X u 1¢ § e " 2 . According to the definition of method of moment estimators, 1 the estimate is A 2 Ax e " x from Example 9.2.2. 1 1 Another way to estimate e " 2 is to write the distribution in terms of e " 2 . 1 Let 1 e " 2 . Then ln1 " 1 i.e. 2 " 1 . 2 ln1 243 [Chapter 9] In terms of 1 the density function is given by fx; 1 " ln1 e ln1 x 0 for x 0 otherwise. 1 and m U1 x. In this case 6 U1 1 E¡X¢ " ln1 The equation to determine the method of moments estimate of 1 is therefore " 1§ x ln 1 1 § § i.e. ln 1 " 1 or 1 e " x which is the same estimate as obtained above. The fact that x we get the same estimate is an example of the so-called property of invariance of the method of moments estimates. Example 9.2.4 Suppose that X 1 , X 2 , T , X n is a random sample from a population with an uniform distribution over the interval 2 1 , 2 2 . Then fx; 2 1 , 2 2 1 22 " 21 0 In this case 6 U1 2 1 , 2 2 and 6 U2 2 1 , 2 2 for 2 1 x 2 2 otherwise. 21 22 2 varX £E¡X¢¤ 2 2 2 " 2 1 2 2 2 2 2 1 12 4 2 2 2 21 22 22 1 . 3 § § The equations to determine the method of moments estimates 2 1 and 2 2 § § 2 1 22 are m U1 x (9.2.3) 2 §2 §2 § § n 21 21 22 22 m U2 1n ! x 2i (9.2.4) and 3 i1 § § From (9.2.3) we get that 2 2 2x " 2 1 and if we substitute this in (9.2.4) we get that §2 § § § 2 n 2 1 2 1 2x " 2 1 2x " 2 1 1n ! x 2i 3 i1 or 244 [Chapter 9] n §2 § 2 1 2 1 2x " 4x 4x 2 " 3n ! x 2i 0. (9.2.5) i1 § Equation (9.2.5) is a quadratic equation for 2 1 and will § give § us two solutions. Since ș 1 ș 2 we must use that solution for which ș 1 ș 2 . OEXERCISES : Bain and Engelhardt p.328 No. 1 and 2(b), 2(c) and 2(d). METHOD OF MAXIMUM LIKELIHOOD Definition 9.2.2 : Suppose that the joint distribution of X 1 , X 2 , T , X n is given by f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n ; 2 for all x 1 , x 2 , T , x n R n . Let ( be the set of all possible values of 2. Given the observed values x 1 , x 2 , T , x n of X 1 , X 2 , T , X n , the function L : ( v R, where L2 fx 1 , x 2 , T , x n ; 2 for 2 ( is defined as the likelihood function. In the case where X 1 , X 2 , T , X n is a random sample from a population with distribution fx; 2 , the likelihood function is given by Lș fx 1 ; ș . fx 2 ; ș . T . fx n ; ș for all ș ȍ . Definition 9.2.3 : Suppose that L2 f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n ; 2 , 2 (, is the likelihood function for the observed § values x 1 , x 2 , T , x n of the random variables X 1 , X 2 , T , X n . A value 2 ȍ is a maximum likelihood estimate of ș if § f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n ; 2 max f X 1 ,X 2 ,T,X n x 1 , x 2 , T , x n ; 2 2( max L2 . 2( Given the observed values x 1 , x 2 , T , x n , the value of 2 for which L2 is a§maximum is not necessarily unique. If it is § unique for all values of x 1 , x 2 , T , x n then 2 is a function of x 1 , x 2 , T , x n , say 2 tx 1 , x 2 , T , x n for all§ x 1 , x 2 , T , x n R n . The corresponding likelihood estimator is then given by 2 tX 1 , X 2 , T , X n . L 245 [Chapter 9] NOTE 1 : In case ș consists of only one parameter ș , and if ȍ consists of an open interval and Lș is a maximum for some ș ȍ, then the maximum likelihood estimate § 2 of 2 is a solution of the equation d Lș 0. (9.2.6) d2 If the solution is not unique, it must be determined which one of the solutions gives the maximum value of L2 - remember a solution of (9.2.6) may be a minimum of L2 . § § NOTE 2 : If 2 gives a maximum of Lș , then 2 gives a maximum of§ ln Lș and vice-versa. The maximum likelihood estimate of ș is therefore a value 2 of ș which is a solution of the equation d ln Lș 0. (9.2.7) d2 JTheorem 9.2.1a : Suppose that X 1 , X 2 , T , X n is a random sample from a population with a Poisson distribution with parameter 2. Then the maximum likelihood estimator of 2 is given by § 2 X. L Proof : Suppose that x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n . Then n L2 xi e "2 2 xi! i1 % i1 x i e "n2 2n xi! n i1 and n n i1 i1 ln L2 "n2 ! x i ln2 " ln x i ! . Therefore n ln L2 "n ! x i . 1 0 2 i1 n § or 2 1n ! x i x. d d2 i1 246 [Chapter 9] § To show that 2 x actually gives a maximum value of L2 consider d2 d2 2 ln L2 § 2 2 n ! x i " 12 2 i1 § 2 2 £nx¤ " 12 x n " x n 0 if x 0 i.e. if ! x i 0 i1 which will be the case unless all x i ’s are zero. Assuming not all observations are zero, we § x does give a maximum value for Lș and therefore can then conclude that 2 § 2 X. ſ L JTheorem 9.2.1 : THE INVARIANCE PROPERTY OF ML ESTIMATORS Suppose that X 1 § , X 2 , T , X n is a random sample from a population with distribution given by fx; 2 . Let 2 be the maximum likelihood estimator ( ml estimator) of 2. If L § § A u2 is a one-to-one function of 2, then the ml estimator of A is given by A u 2 L i.e. the ml estimator of a function of theta is that function at the ml estimator of 2. L Proof : Let L2 be the likelihood function in terms of 2 i.e. n L2 fx i ; 2 i1 Let L ' IJ be the likelihood function in terms of IJ. To find L ' IJ we need to write the distribution of X in terms of IJ. Suppose that u : R v R is a one to one function i.e. if IJ uș , then ș u "1 IJ . Therefore fx; ș fx; u "1 IJ . Hence n L2 fx i ; 2 i1 n fx i ; u "1 A i1 Lu "1 A L ' A (9.2.8) § § The maximum value of L is obtained when 2 2 and in terms of A when A A. L L Because of the relationship (9.2.8) , which must be the same on both sides § the maximums § § § of (9.2.8) , we must have that 2 u "1 A or A u 2 . ſ L L L L 247 [Chapter 9] JTheorem 9.2.2a : Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter ș. Then the maximum likelihood estimator of ș is given by § 2 X. L Proof : In this case 1 e " 2x 2 0 fx; 2 for x 0 otherwise. Therefore n xi L2 1 e " 2 i1 2 0 if all x i 0 otherwise. Since P¡X i 0¢ 1 and P¡all X i 0¢ 1, we will assume that all x i 0 in which case 1 n L2 1n e " 2 % i1 x i 2 and then n ln L2 "n ln2 " 1 ! x i 2 i1 and n 1 " " 1 . ! xi d ln L2 "n. d2 2 22 i1 § and for 2 2 we have that 1 " "n. § 2 " §12 2 n i.e. ! xi n . ! xi 0 i1 § n2 i1 n § i.e. 2 1n ! x i x give a maximum value for L2 and i1 § therefore 2 X. ſ L 248 [Chapter 9] EXAMPLES : Example 9.2.5 Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function given by e "x"1 for x u 1 fx; 1 0 otherwise Then n L1 e "x i "1 if all x i u 1 i1 0 otherwise e "% i1 x i "1 if all x i u 1 0 otherwise n Let y 1 min£x 1 , x 2 , T , x n ¤. Hence 1 t all x i ’s if and only if 1 t y 1 . Then e "% i1 x i "1 if 1 t y 1 0 if 1 y 1 n L1 n n i1 i1 Note that as 1 gets bigger, !x i " 1 gets smaller i.e. " !x i " 1 gets bigger and therefore e "% i1 x i "1 gets bigger i.e. L1 gets bigger, provided 1 t y 1 . As soon as 1 becomes greater than y 1 the value of L1 becomes zero and stays zero for all 1 y 1 . A graphical representation of L1 is given in Figure 9.1 in Bain and Engelhardt. n Therefore L1 is a maximum for 1 y 1 min£x 1 , x 2 , T , x n ¤ . This is a case where the maximum of L1 is not at a point where the derivative of L1 is zero. Example 9.2.6 Suppose that X 1 , X 2 , T , X n is a random sample from a normal population with expected value 6 and variance @ 2 . Suppose that @ 2 is known but that 6 is unknown. Suppose that x 1 , x 2 , T , x n are the observed values for X 1 , X 2 , T , X n . In this case n L6 1 e " 12 xi@"6 2 2= @ i1 " 12 % ni1 1 n e n 2 2 2 2= @ x i "6 @ 2 249 [Chapter 9] i.e. n ln L6 " n ln2= " n ln@ 2 " 1 2 !x i " 6 2 2 2@ i1 and 2 n d d6 " 1 2 ! 2x i " 6 "1 2@ i1 ln L6 n 12 ! x i " n6 @ i1 12 £nx " n6¤ @ d d6 and therefore § § ln L6 0 for 6 x i.e. 6 x and therefore 6 X. L Example 9.2.7 Suppose that X 1 , X 2 , T , X n is a random sample from a normal population with expected value 6 and variance @ 2 . Suppose that 6 is known but that @ 2 is unknown. Suppose that x 1 , x 2 , T , x n are the observed values for X 1 , X 2 , T , X n . In this case n L@ 2 1 e " 12 xi@"6 2 2= @ i1 1 " 12 % ni1 n e n 2= 2 @ 2 2 x i "6 @ 2 i.e. n ln L@ 2 " n ln2= " n ln@ 2 " 1 2 !x i " 6 2 2 2@ i1 and d d@ 2 1 " n 12 2 @ 2@ 2 ln L@ 2 n 2 !x i " 6 2 2 i1 n 1 2@ 2 "n@ 2 !x i " 6 2 i1 n and therefore d d@ 2 n ln L@ 2 0 if @ 2 1n !x i " 6 i.e. @ 2 1n !x i " 6 i1 2 i1 2 n and @ 2 1n !X i " 6 2 . L i1 2 250 [Chapter 9] JTheorem 9.2.2b : Suppose that X 1 , X 2 , T , X n is a random sample from a normal population with expected value ȝ and variance ı 2 . Suppose that ȝ and ı 2 are unknown. Suppose that x 1 , x 2 , T , x n are the observed values for X 1 , X 2 , T , X n . n 2 § Then 6 X and @ 2 1n !X i " X . L i1 L Proof : In this case L6, @ 2 n 1 e " 12 xi@"6 2 2= @ i1 1 " 12 % ni1 n e n 2 2= 2 @ 2 x i "6 @ 2 i.e. n ln L6, @ 2 " n ln2= " n ln@ 2 " 1 2 !x i " 6 2 2 2@ i1 and 6 2 n " 1 2 ! 2x i " 6 "1 2@ i1 ln L6, @ 2 n 12 ! x i " n6 @ i1 1 2 £nx " n6¤ @ and therefore 6 § ln L6, @ 2 0 for 6 x i.e. 6 x. 9. 2. 9 We also have that @ 2 ln L6, @ 2 1 " n 12 2 @ 2@ 2 1 2@ 2 n 2 !x i " 6 i1 n 2 "n@ 2 !x i " 6 6 n ln L6, @ 2 0 if @ 2 1n !x i " x i1 n § i.e. @ 2 1n !x i " x 2 and 6 x i1 and therefore n § 6 X and @ 2 1n !X i " X 2 . ſ L L i1 2 (9.2.10) i1 Substituting (9.2.9) in (9.2.10) we get that @ 2 2 2 ln L6, @ 2 0 and 251 [Chapter 9] EXAMPLES : Example 9.2.8 Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function given by 1 e " x"12 for x u 1 2 fx; 2, 1 0 otherwise where 2 0 and 1 are unknown parameters. Let x 1 , x 2 , T , x n be the observed values of X 1 , X 2 , T , X n . Then n L2, 1 x i "1 12 e " 2 if all x i u 1 0 otherwise 1 e "% ni1 xi2"1 2n 0 if all x i u 1 i1 otherwise i.e. n ln L2, 1 "n ln 2 " 1 !x i " 1 2 i1 if all x i u 1 ". otherwise. n Note that for any fixed value of 2 the value of "n ln 2 " 1 !x i " 1 gets bigger and 2 i1 bigger as 1 gets bigger since 2 0 provided 1 is less than all x i ’s i.e. as long as 1 is less than y 1 min£x 1 , x 2 , T , x n ¤. As soon as 1 gets bigger than y 1 the value of L2, 1 becomes ". and stays ".. The maximum value of L2, 1 is therefore obtained at 1 y 1 for any value of 2. Given 1 y 1 we must then find that value of 2 for which L2, y 1 is a maximum. We have that n ln L2, y 1 "n ln 2 " 1 !x i " y 1 2 i1 n d d2 and ln L2, y 1 "n 1 12 !x i " y 1 0 2 2 i1 n if "n2 !x i " y 1 0 i1 i.e. if n § 2 1n !x i " y 1 x " y 1 § and 1 y 1 . i1 252 [Chapter 9] Example 9.2.9 Suppose that X 1 , X 2 , T , X n is a random sample from a population with a Weibull density function given by * x *"1 " 2x * e for x 0 2 2 fx; 2, * 0 otherwise. Let x 1 , x 2 , T , x n be the observed values of X 1 , X 2 , T , X n . Then n L2, * * xi 2 2 i1 *"1 xi e " 2 * if all x i u 0 0 otherwise * 2 n n x2i *"1 xi e "% i1 2 n * if all x i u 0 i1 0 otherwise. For all x i u 0 we have that n ln L2, * n ln * " n ln 2 * " 1 ! ln x i 2 i1 Hence n n " ! xi 2 i1 * . n *"1 "x i ln L2, * " n * " 1 ! " 1 " ! * x i 2 2 2 2 2 i1 i1 and n n n ! ln x i " ! x i * ln x i . ln L2, * * 2 2 2 * i1 i1 To find the maximum likelihood estimates we have to solve the equations 2 " * n * n* *1 ! x i 0 2 2 i1 n (9.2.11) n n ! ln x i 2 * i1 " ! xi 2 i1 * ln x i 2 0 (9.2.12) From (9.2.11) we get that n * "n2 * ! x i 0 i1 i.e. 2 * n xi ! i1 n 1 * . (9.2.13) 253 [Chapter 9] We can substitute this into (9.2.12) to obtain an equation in ȕ to solve. It is not possible to find an analytic solution to this equation and it therefore becomes necessary to use numerical methods to find a solution. Once the solution for ȕ is known it can be substituted in (9.2.13) to find the value of ș. OEXERCISES : Bain and Engelhardt p.328 No. 3 ,4(a) - 4(c) , 6, 7 and 13. 9.3 CRITERIA FOR EVALUATING ESTIMATORS Definition 9.3.1 : The estimator T of IJș is said to be an unbiased estimator of IJș if E¡T¢ IJș for all possible values of ș. If an estimator T is not an unbiased estimator of IJș it is a biased estimator of IJș . EXAMPLES : Example 9.3.1 Suppose that X 1 , X 2 , T , X n is a random sample from a population with expected value 6. Let X be the sample mean. Then E¡X¢ 6 for all possible values of 6 i.e. X is an unbiased estimator of 6. Example 9.3.2 Suppose that X 1 , X 2 , T , X n is a random sample from a population with variance @ 2 . Let s 2 be the sample variance. Then E¡s 2 ¢ @ 2 for all possible values of @ 2 i.e. s 2 L is an unbiased estimator of @ 2 . L L Example 9.3.3 Suppose that X 1 , X 2 , T , X n are independent Bernoulli random variables with parameter n § § § p. Let p 1n ! X i . Then E¡ p¢ p and this is true for all possible values of p i.e. p i1 is an unbiased estimator of p. 254 [Chapter 9] It is often possible to derive several different estimators of a parameter e.g. the method of moment estimator may differ from the maximum likelihood estimator. It is then necessary to compare the properties of the estimators to decide which one is the ”best” estimator. Suppose that T is used as an estimator of A2 . The ”error” in the estimate is then T " A2 . If T is an unbiased estimator of A2 , then E " error" E¡T " A2 ¢ A2 " A2 0 for all possible values of 2. An unbiased estimator is a "good" estimator in the sense that the expected or average error is zero. This is, however, not the only desirable property of an estimator and sometimes we may even prefer a biased estimator to an unbiased estimator. A further problem is also that it is possible to obtain more than one unbiased estimator and we then have to decide which one of the unbiased estimators is best. Example 9.3.4 Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. Then E¡X i ¢ 2 and therefore E¡X¢ 2 for all values of theta i.e. X is an unbiased estimate of 2. Let Y 1 min£X 1 , X 2 , T , X n ¤. Then F Y 1 y 1 1 " ¡1 " Fy 1 ¢ n 1 " 1 " 1 " e" y1 2 n if y 1 t 0 1 " ¡1 " 0¢ n 1 " e" ny 1 2 0 if y 1 0 if y 1 t 0 0 1"e if y 1 0 " y1 2 n if y 1 0 if y 1 t 0 i.e. Y 1 has an exponential distribution with parameter n2 . Therefore E¡Y 1 ¢ n2 and E¡nY 1 ¢ 2. Since this is true for all values of 2 it means that T nY 1 is an unbiased estimator of 2. 255 [Chapter 9] For the unbiased estimator T of A2 to be a ”good” estimator we would like the absolute value of the ”error” namely |T " A2 |, to be as small as possible with a probability as close to 1 as possible. Given . 0, we would therefore like the probability P¡|T " A2 | .¢ to be as small as possible. From Chebychev’s inequality we have that varT . varT t .2 varT i.e. to make P¡|T " IJș | .¢ as small as possible, we must make varT as small as possible. We therefore expect that estimators with smaller variances are better than those with bigger variances. P¡|T " A2 | u .¢ P |T " A2 | u Example 9.3.5 Continuation of Example 9.3.4. For the exponential distribution with parameter ș the variance is ș 2 i.e. 2 varX 2n . Since Y 1 has an exponential distribution with parameter n2 we have that 2 2 varY 1 2 2 and varnY 1 n 2 2 2 2 2 . n n Hence varX t varnY 1 and therefore X is a better unbiased estimator of ș than nY 1 . Definition 9.3.2 : UNIFORMLY MINIMUM VARIANCE UNBIASED ESTIMATORS An estimator T ' of IJș is called a uniformly minimum variance unbiased estimator of IJș if 1. T ' is an unbiased estimator of A2 and 2. for any unbiased estimator T of A2 , varT ' t varT for all possible values of 2. If possible we would like to use uniformly minimum variance unbiased estimators if they can be found. One problem is to know whether or not an unbiased estimator has minimum variance or not. 256 [Chapter 9] JTheorem 9.3.2a : Suppose X 1 , X 2 , T , X n is a random sample from a population with density function fx; 2 . If T is an unbiased estimator of A2 , then the Cramer-Rao lower bound (CRLB), based on a random sample is ¡A U 2 ¢ VarT u nE 2 ln fX; 2 2 2 CRLB (9.3.1) Proof : Assume the case of sampling from a continuous distribution. Define the random variable U ' U ' X 1 , T , X n ; 2 ln fX 1 , T , X n ; 2 2 fX , T , X ; 2 1 1 n fX, T , X n ; 2 2 then ; C ; U ' x 1 , T , x n ; 2 fx 1 , T , x n ; 2 ; C ; fx 1 , T , x n ; 2 dx 1 Cdx n 2 EU ' dx 1 Cdx n ; C ; fx 1 , T , x n ; 2 dx 1 Cdx n 2 we assume that we can inter-change the order of differentiation and integration 1 2 2 0 2 If T TX 1 , T , X n is an unbiased estimator for A2 then A2 ET ; C ; Tx 1 , T , x n fx 1 , T , x n ; 2 dx 1 Cdx n (9.3.2) 2 and if we differentiate with respect to 2, then A U 2 ; C ; Tx 1 , T , x n fx 1 , T , x n ; 2 dx 1 Cdx n 2 Since EU ' ; C ; Tx 1 , T , x n U ' x 1 , T , x n ; 2 fx 1 , T , x n ; 2 dx 1 Cdx n ETU ' 0 it follows that ETU ' CovT, U ' ET EU ' CovT, U ' (9.3.3) 257 [Chapter 9] Because the correlation coefficient is always between o1 ®: ¡CorT, U ' ¢ 2 t 1 ¡CovT, U ' ¢ 2 VarT VarU ' ®: t1 ®: ¡CovT, U ' ¢ 2 t VarT VarU ' 2 ¡A U 2 ¢ VarU ' When X 1 , T , X n represents a random sample fX 1 , T , X n ; 2 fX 1 ; 2 T fX n ; 2 so that U ' X 1 , T , X n ; 2 ln fX 1 , T , X n ; 2 2 ®: VarT u ln 2 n n i1 n fX i ; 2 2 i1 ln fX i ; 2 2 ! (9.3.4) ! ln fX i ; 2 i1 n ! UX i ; 2 i1 where UX i ; 2 ln fX i ; 2 , 2 i 1, 2, T n it follows that n VarU ' Var ! UX i ; 2 i1 n ! Var¡UX i ; 2 ¢, X i ’s are independent i1 ln fX; 2 , X ’s are identically distributed i 2 ln fX; 2 2 nE 2 ln fX; 2 0 (9.3.2). Consequently it follows from (9.3.4) and (9.3.5) that n Var since E 2 2 VarT u ¡A U 2 ¢ VarU ' ¡A U 2 ¢ nE (9.3.5) 2 ln fX; 2 2 2 From th.9.3.2a it follows that if T is an unbiased estimator of A2 and if varT is equal to the right-hand side of (9.3.1), then T is the unbiased estimator with the smallest possible variance. ſ 258 [Chapter 9] EXAMPLE 9.3.6 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. Then 1 e " 2x for x 0 2 fx; 2 0 otherwise and for x 0 we have that ln fx; 2 " ln 2 " x 2 i.e. 2 ln fx; 2 " 1 x2 x "2 2 . 2 2 2 Then EX 2 ln fX; 2 2 E X " 2 24 2 varX 24 2 2 4 12 2 2 For A2 2 we have that A U 2 1 and then ¡A U 2 ¢ nE X 2 2 2 ln fX; 2 22 . 12 n n 212 2 Since varX 2n from Example 9.3.5 it follows that X is the unbiased estimator of 2 with the smallest possible variance. Since this is true for all possible values of 2, X is the uniform minimum variance unbiased estimator of 2. JTheorem 9.3.2b : Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function n fx; 2 . Let U ' ! i1 2 ln£fX i ; 2 ¤. If T is an unbiased estimator of A2 with variance equal to the Cramer-Rao lower bound then T must be some linear function of U ' ( or equivalently U ' must be a linear function of T ) with probability 1 . Proof : The equality in (9.3.4) can only be true if corT, U ' 1 which implies that they are each a linear function of the other one with probability one. ſ NOTE. When U ' aT b the constants a and b for which this is true will usually depend on 2 i.e. we should really write it as U ' a2 T b2 . 259 [Chapter 9] EXAMPLE 9.3.6a Continuation of Example 9.3.6. n n ln£fX i ; 2 ¤ ! X i "2 2 2 i1 i1 n 2 X" n 2 2 n 2 X " 2 2 and since X is an unbiased estimator of 2, X must be an unbiased estimator with variance equal to the CRLB. Note that X is also the value of 2 for which the derivative of the log likelihood function is 0 i.e. X is the maximum likelihood estimator of 2. In this case U ' ! 2 JTheorem 9.3.2c : Suppose that T is an unbiased estimator of A2 with variance equal to the CRLB. Suppose that T ' is an unbiased estimator of A ' 2 also with variance equal to the CRLB. Then A ' 2 must be a linear function of A2 . Proof : From theorem 9.3.2b it follows that U ' a2 T b2 and also U ' c2 T ' d2 where a2 , b2 , c2 and d2 depend only on 2 and not on the X i ’s and T and T ' only depend on the X i ’s and in no way depend on 2. It then follows that a2 T b2 c2 T ' d2 a2 b2 " d2 T . c2 c2 a2 Note that only depends on 2 and not on the X i ’s . But since T ' can only depend c2 a2 on the X i ’s we must have that is some constant, say e, that does not depend on 2 c2 or the X i ’s . b2 " d2 must be some constant for which the same is true. Hence Similarly c2 T ' eT f. But since E¡T ' ¢ A ' 2 for all theta and E¡T¢ A2 for all 2 and e and f are constants, it must be true that A ' 2 eA2 f for all 2. ſ i.e. T ' 260 [Chapter 9] EXAMPLE 9.3.6b Continuation of Example 9.3.6a. Since X is an unbiased estimator of 2 with variance equal to the CRLB there cannot exist unbiased estimators of 1/2 or e 2 with variance equal to the CRLB. This does not mean however that UMVUE estimators does not exist - we simply cannot use the Cramer-Rao lower bound to show that such estimators are UMVUE. We will pursue this problem of how to find UMVUE estimators further in Chapter 10. NOTE. An estimator T of IJș which is unbiased and with variance equal to the CRLB will be called a CRLB estimator of IJș . JTheorem 9.3.2d : If T is an unbiased estimator of IJș such that U ' cș T dș then T is a CRLB estimator of IJș and A U 2 varT CRLB . c2 Proof : Since T is an unbiased estimator of A2 and U ' a linear function of T, T must be the CRLB estimator i.e. varT CRLB. Furthermore varU ' £c2 ¤ 2 varT £c2 ¤ 2 CRLB 2 ¡A U 2 ¢ i.e. CRLB £c2 ¤ 2 CRLB ¡A U 2 ¢ [since CRLB varU ' i.e. £CRLB¤ 2 A U 2 c2 i.e. varT CRLB 2 from th.9.3.2a] 2 A U 2 c2 ſ EXAMPLE 9.3.6c Continuation of Example 9.3.6b In this case U ' n2 X " n and since X is an unbiased estimator of 2 it is the CRLB 2 2 estimator of A2 2 and therefore 2 A U 2 varT 1n 2n c2 22 which is the CRLB obtained in Example 9.3.6. 261 [Chapter 9] JTheorem 9.3.2e : Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function fx; ș . If T is the CRLB estimator § of IJș , where IJ is a one-to-one function, and if the maximum likelihood estimator ș of ș is L § unique, then T is some function of 2. L Proof : § Note that U ' is the derivative of the log of the likelihood function for 2 i.e. 2 is the L value of 2 for which U ' 0. If T is the CRLB estimator of A2 then § U ' c2 T d2 . The value of 2 for which U ' 0 therefore depends on T i.e. 2 is a L § function of T, say 2 uT where u is some one-to-one function. The maximum L § § likelihood estimator of A2 , say A, is then given by A AuT . L L § ſ Since u and A are both one-to-one functions T u "1 A "1 A L EXAMPLE 9.3.6d Continuation of Example 9.3.6c. In this case the CRLB estimator of ș is X and is also the maximum likelihood estimator of ș. EXAMPLE 9.3.7 : Suppose that X 1 , X 2 , T , X n are independent random variables with a geometric distribution with parameter 2. Let A2 1/2. Since E¡X i ¢ 1/2 it follows that X n ! X i /n is an unbiased estimator of 1/2 and since varX i 1 " 2 /2 2 we i1 have that varX 1 "22 . n2 We have that ln fx; 2 ln 2 x " 1 ln1 " 2 1 "x and 2 ln fx; 2 1 " x " 1 1 " x2 2 2 1"2 21 " 2 1 " 2 and therefore E 2 ln fX; 2 2 E "X 1 " 2 1 1 " 2 1 1 " 2 1 2 2 2 varX 2 since E¡X¢ 1/2 1"2 1 . 22 2 2 1 " 2 262 [Chapter 9] For A2 1/2 we have A U 2 "1/2 2 and therefore CRLB " 212 n 2 1 "22 n2 1 2 2 1"2 which is equal to varX i.e. X is the CRLB estimator of 1/2. n We have that U ' ! i1 n 2 ln£fX i ; 2 ¤ ! 1 " Xi " 1 2 1"2 i1 " 1 1"2 n ! X i n 12 i1 1 1"2 n Xn 1 1 2 1"2 1"2 n n " X . 1"2 21 " 2 " Since X is an unbiased estimator of 1/2 and U ' is a linear function of X it also follows from this that X is the CRLB estimator of 1/2 and that the variance of X is equal to " 212 A U 2 1 "22 which is the value of the CRLB obtained above. c2 n2 n 1 1"2 The maximum likelihood estimator of ș is that value for which U ' 0 i.e. n n " n X £"2X 1¤ 0 1"2 21 " 2 21 "2 § X i.e. in this i.e. 2 1/X and the maximum likelihood estimator of 1/2 is 1/1/X L case the maximum likelihood estimator is the same as the CRLB estimator. Note that 2 is not a linear function of 1/2 and therefore cannot have a CRLB estimator - not even the maximum likelihood estimator 1/X is a CRLB estimator and we do not know if it is even an unbiased estimator. Definition 9.3.3 : The relative efficiency of an unbiased estimator T of A2 to another unbiased estimator T ' of A2 is defined by VarT ' . reT, T ' VarT An unbiased estimator T ' of A2 is said to be efficient if reT, T ' t 1 for all unbiased estimators T of A2 for all possible values of 2. 263 [Chapter 9] The efficiency of an unbiased estimator T of IJș is given by eT reT, T ' if T ' is an efficient estimator of IJș . EXAMPLE 9.3.8 : Continuation of Examples 9.3.4 , 9.3.5 and 9.3.6. Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential 2 distribution with parameter 2. Then X is an unbiased estimator of 2 with varX 2n and nY 1 is also an unbiased estimator with varnY 1 2 2 . 22 VarX n Therefore renY 1 , X 1n VarnY 1 22 i.e. for large n the relative efficiency of nY 1 with respect to X is very small. From Example 9.3.6 we know that X is the unbiased estimator of 2 with the smallest possible variance for all values of 2 i.e. X is an efficient estimator of 2. Since X is an efficient estimator of ș the efficiency of nY 1 is given by enY 1 renY 1 , X 1n . So far we only considered unbiased estimators. It is possible that we can get an estimator T of IJș which has a very small bias but for which the differences T " IJș is always very small. Such an estimator may be preferred to any possible unbiased estimator of IJș . Definition 9.3.4 : If T is an estimator of IJș , then the bias is given by bT E¡T¢ " A2 and the mean squared error is given by MSET E¡£T " A2 ¤ 2 ¢ . NOTE : If we call T " IJș the "error" of the estimator, then MSET E¡£" error" ¤ 2 ¢. 264 [Chapter 9] JTheorem 9.3.2 : If T is an estimator of IJș , then MSET varT £bT ¤ 2 . Proof : MSET E¡£T " A2 ¤ 2 ¢ E¡£T " E¡T¢ E¡T¢ " A2 ¤ 2 ¢ E¡£T " E¡T¢¤ 2 2£E¡T¢ " A2 ¤£T " E¡T¢¤ £E¡T¢ " A2 ¤ 2 ¢ varT £bT ¤ 2 since E¡T " E¡T¢¢ E¡T¢ " E¡T¢ 0 ſ EXAMPLE 9.3.9 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function given by e "x"1 for x 1 fx; 1 0 otherwise. Then . E¡X¢ ; xe "x"1 dx let y x " 1 i.e. dy dx 1 . ; y 1 e "y dy 0 . . 0 0 ; y 2"1 e "y dy 1 ; y 1"1 e "y dy 2 11 1 1; . E¡X 2 ¢ ; x 2 e "x"1 dx let y x " 1 i.e. dy dx 1 . ; y 1 2 e "y dy 0 . ; y 2 21y 1 2 e "y dy 0 . ; y 3"1 e "y dy . 21 ; 0 3 212 y 2"1 e "y dy . 0 2 21 1 2 " 1 1 2 2 21 1 2 " £1 21 1 2 ¤ 1. ; y 1"1 e "y dy 0 1 2 1 2 21 1 2 ; varX 12 265 [Chapter 9] Hence E¡X " 1¢ 1 1 " 1 1 i.e. X " 1 is an unbiased estimator of 1 and varX varX " 1 varX 1n . n Let Fx be the distribution function for the population i.e. for x 1 0 x Fx ; e "t"1 dt for x u 1. 1 For x 1 x Fx ; e "t"1 dt let y t " 1 i.e. dy dt 1 x"1 ; e "y dy 0 x"1 "e "y | 0 1 " e "x"1 . Therefore Fx 0 for x 1 1 " e "x"1 for x u 1. Let Y 1 min£X 1 , X 2 , T , X n ¤. Then F Y 1 y 1 1 " ¡1 " Fy 1 ¢ n 1 " ¡1 " £1 " e "y 1 "1 ¤¢ 1" n for y 1 1 0 and f Y 1 y 1 for y 1 1 0 e "ny 1 "1 for y 1 u 1 0 for y 1 1 ne "ny 1 "1 for y 1 u 1 for y 1 u 1 266 [Chapter 9] Therefore . E¡Y 1 ¢ ; y 1 ne "ny 1 "1 dy 1 let u ny 1 " 1 i.e. du ndy 1 1 . ; un 1 ne "u 1n du 0 . ; un 1 ne "u 1n du 0 . . 0 0 1n ; u 2"1 e "u du 1 ; u 1"1 e "u du 1n 2 11 1n 1. If Y 1 were to be used as an estimator of 1 it will not be an unbiased estimator. But MSEY 1 E £Y 1 " 1¤ 2 . ; £y 1 " 1¤ 2 ne "ny 1 "1 dy 1 let u ny 1 " 1 i.e. du ndy 1 1 . ; 0 u n 2 ne "u 1n du . 12 ; u 3"1 e "u du 12 3 2 12 . n 0 n n Therefore MSEX " 1 varX 1n u 2 12 MSEY 1 if n 1. n Although Y 1 is not an unbiased estimator of 1, it’s MSE is less than that of the unbiased estimator X " 1. OEXERCISES : Bain and Engelhardt p.330 No.’s 15, 17, 21, 28(a)-(d) and 33(a)-(d). 267 [Chapter 9] 9.4 LARGE-SAMPLE PROPERTIES Properties of estimators discussed so far were for fixed values of n. It turns out that although estimators may have undesirable properties for small n, they may have very good asymptotic properties as n tend to infinity and may therefore be useful estimators to use for large values of n. Definition 9.4.1 : Let £T n ¤ be a sequence of estimators of A2 i.e. T n is an estimator of A2 if the sample size is n. These estimators are said to be consistent estimators of A2 if for every . 0 lim P¡|T n " A2 | .¢ 1 for every 2 (. nv. For a consistent estimator we know that if the sample size is large there is a probability close to 1 that the value of the estimator will be very close to A2 irrespective of the true value of 2. P NOTE that the sequence £T n ¤ of estimators of A2 is consistent if T n v A2 . Definition 9.4.2 : Let £T n ¤ be a sequence of estimators of A2 i.e. T n is an estimator of A2 if the sample size is n. These estimators are said to be mean squared error consistent estimators of A2 if lim E £T n " A2 ¤ 2 nv. lim MSE¡T n ¢ 0 for every 2 (. nv. For mean squared error consistent estimators the expected value of the squared error in the estimate is very small for large n. Definition 9.4.3 : Let £T n ¤ be a sequence of estimators of A2 i.e. T n is an estimator of A2 if the sample size is n. These estimators are said to be asymptotically unbiased if lim E¡T n ¢ A2 for every 2 (. nv. For asymptotically unbiased estimators the expected value of T n is very close to A2 irrespective of the value of 2. 268 [Chapter 9] JTheorem 9.4.1 : A sequence of estimators of A2 is mean squared error consistent if and only if it is asymptotically unbiased and varT n 0 lim nv. Proof : By theorem 9.3.2 9. 4. 1 MSET n varT n ¡ET n " A2 ¢ 2 . Therefore if £T n ¤ is asymptotically unbiased and lim varT 0, n nv. MSE¡T then lim 0 for every 2 ( i.e. is MSE consistent. ¢ £T ¤ n n nv. MSE¡T n ¢ 0 and this can only On the other hand if £T n ¤ is MSE consistent then lim nv. happen if both terms on the right-hand side tend to 0, i.e. lim varT n 0 and nv. E¡T n ¢ A2 i.e. £T n ¤ is asymptotically unbiased . lim nv. EXAMPLE 9.4.1 : Let X 1 , X 2 , T , X n be a random sample from a population with an exponential distribution with parameter 2. Let T n 1/X n be used as an estimator of 1/2. Note that X n is the maximum likelihood estimator of 2 and therefore 1/X n is the maximum likelihood estimator of 1/2. Let n n 2 ! i1 X i Yn 2nX n L D 2 2n by th.8.3.3 since ! X i L GAM2, n . Hence 2 2 i1 2n "1 Tn Y n where Y n L D 2 2n . Therefore by th.8.3.2 2 ET n n " 1 2n 2 "1 2 n n 1 n"1 2 using r "1 since n n " 1 n " 1 . Therefore T n 1/X n is not an unbiased estimator of 1/2 for any finite value of n, but since lim ET n 1/2 it is an asymptotic unbiased estimator of 1/2. nv. Furthermore 2 2n 2 2 "2 n " 2 using r "2 in th.8.3.2 E¡T 2n ¢ 2n E¡Y "2 n ¢ 2 2 n 2 1 since n n " 1 n " 2 n " 2 n2 n " 1 n " 2 2 and 269 [Chapter 9] 2 n 1 1 " var¡T n ¢ n 2 n"1 2 n " 1 n " 2 2 2 1 n"1 " n"2 n2 2 n"2 n"2 2 n " 1 2 1 n " 2 2 2 i.e. lim var¡T n ¢ 0 for all values of 2. From th.9.4.1 it then follows that T n is MSE nv. consistent for 1/2. n n"1 2 JTheorem 9.4.2 : If a sequence £T n ¤ is mean squared error consistent for A2 then it is also consistent for A2 . Proof : We have that 1 u P¡|T n " A2 | .¢ P £T n " A2 ¤ 2 . 2 u 1 " E £T n " A2 ¤ 2 /. 2 by the Markov inequality. 1 " MSET n /. 2 and since £T n ¤ is MSE consistent lim MSET n 0 and therefore nv. P¡|T n " A2 | .¢ 1 for any . 0 i.e. £T n ¤ is consistent. lim nv. ſ JTheorem 9.4.3 : If £T n ¤ is consistent for A2 and if g is everywhere continuous then £gT n ¤ is consistent for gA2 . Proof : It follows from th.7.7.2 with T n Y n and A2 c and the fact that being consistent is the same as convergence in probability. ſ § It follows from th.9.4.3 that if 2 n is consistent for 2 and A is a continuous function, L § then A 2 n is consistent for A2 . L 270 [Chapter 9] Definition 9.4.4 : Let £T n ¤ and £T 'n ¤ be two asymptotically unbiased sequences of estimators of A2 . The asymptotic relative efficiency of T n relative to T 'n is defined by areT n , T 'n lim nv. varT 'n . varT n . The sequence £T 'n ¤ is said to be asymptotically efficient if areT n , T 'n t 1 for all other asymptotically unbiased sequences £T n ¤. The asymptotic efficiency of an asymptotically unbiased sequence £T n ¤ is given by aeT n areT n , T 'n if £T 'n ¤ is asymptotically efficient. EXAMPLE 9.4.2 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. Then fx; 2 1 e "x/2 , ln fx; 2 " ln2 " x/2 2 and 2 ln fx; 2 " 1 x2 x "2 2 2 2 2 2 varX X " 2 2 and E ln fX; 2 E 12 . 4 2 24 2 2 If we let A2 1/2 then A U 2 "1/2 2 and the CRLB for an unbiased estimator of 1/2 is given by 1 2 . From Example 9.4.1 we have that if T n 1/X n then n2 n 212 2 n 1 . n 1 and var¡T n ¢ n"1 n"1 2 n " 2 2 2 CRLB ET n 1 24 CRLB 1 i.e. although the T n ’s are not unbiased estimators they are in the Then lim nv. varT n limit and in the limit the variance of T n is as small as the variance of any sequence of unbiased estimators. 271 [Chapter 9] ASYMPTOTIC PROPERTIES OF MLEs Under certain conditions, it can§be shown that maximum likelihood estimators possess very desirable properties. Let 2 n be the solution of the maximum likelihood equations. Then under certain conditions § 1. 2 n exists and is unique § 2 n is consistent. 2. L 3. § 2n L is asymptotically normal with asymptotic mean 2 and asymptotic 2 variance 1/nE 4. § 2n L ln fX; 2 2 is asymptotically efficient. Note that the asymptotic variance is equal to the CRLB for an unbiased estimator of 2. It also follows from th.7.7.6 that if A is a function with non-zero derivatives that if § § § is asymptotically normal with asymptotic mean A2 and A n A 2 n then A n L L L asymptotic variance ¡A U 2 ¢ 2 CRLB which is the CRLB for an unbiased estimator of A2 . EXAMPLE 9.4.3 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. Then X n is the maximum likelihood estimator of 2. From the central limit theorem we know that asymptotically X n has a normal distribution with expected value 2 and asymptotic variance 2 2 /n. Since the CRLB for an unbiased estimator of 2 is 2 2 /n it shows that X n is an efficient estimator of 2. From the properties of MLEs it also follows that X n is asymptotically efficient. Now let R2 P¡X t¢ e "t/2 where t is some fixed number 0. (Rt is the reliability as defined in chapter 16). " t Then R n e Xn is the maximum likelihood estimator of R2 and by th.7.7.6 is asymptotically normal distributed with asymptotic expected value R2 and asymptotic variance var R n ` 2 R2 2 £2 2 /n¤ 2 ¡e "t/2 t/2 2 ¢ £2 2 /n¤. 272 [Chapter 9] 9.5 BAYES AND MINIMAX ESTIMATORS Let T be an estimator of A2 . The difference between the value of A2 and the value of the estimator T is referred to as the "error". This error may have certain consequences for a person making use of the estimate in stead of the value of A2 . These consequences, if it can be measured in some way by a non-negative real number, is referred to as the loss suffered by using the estimate. Definition 9.5.1 : If T is an estimator of A2 , then a loss function is any real-valued function, Lt; 2 such that Lt; 2 u 0 for all values of t and Lt; 2 0 if t A2 . NOTE that Lt; 2 £t " A2 ¤ 2 satisfies the conditions to be a loss function. Definition 9.5.2 : The risk function is defined as R T 2 E¡LT; 2 ¢ i.e. the risk function is the expected value of the loss using T as an estimate of A2 . NOTE that the value of the risk function depends on the value of 2. NOTE that if the loss function is £t " A2 ¤ 2 then the risk function is MSE. If one is faced with estimating a parameter or function of a parameter, one way to proceed could be to choose an appropriate loss function for the particular problem and then try to make the expected loss as small as possible for all values of the parameter. Definition 9.5.3 : An estimator T 1 is a better estimator than T 2 if and only if R T 1 2 t R T 2 2 for all 2 ( and R T 1 2 R T 2 2 for at least one 2 (. An estimator T is admissable if and only if there is no better estimator. 273 [Chapter 9] We will only consider admissable estimators since if an estimator is not admissable there is some estimator with risk less than or equal to it for all values of 2 and definitely better for some values of 2. Typically an estimator will have smaller risk for some 2 and bigger risk for other values of 2 than some other estimator. Definition 9.5.4 : An estimator T 1 is a minimax estimator if max R T 1 2 t max R T 2 2 2 for every estimator T i.e. max R T 1 2 min max R T 2 2 2 T One will use a minimax estimator if one wants to make the worst possible expected loss as small as possible. It can happen that a minimax estimator will have a slightly smaller maximum expected loss than another estimator but have much bigger expected losses for almost all other values of 2 than the other estimator. To avoid this pitfall one may consider an average value of the expected loss. In statistics we usually calculate the expected value of a random variable using the probability distribution of the random variable. To calculate such an average loss one will need some ”probability distribution” for the values of 2. The fact that we use such a distribution to determine the average expected loss does not mean that the parameter becomes a random variable, the value of which is determined by the outcome of some experiment. Let p be some density function that has high values for values of 2 which are considered very likely and has small values for values of 2 considered very unlikely. If no such prior knowledge of 2 exists one may consider a uniform distribution to use. Definition 9.5.5 : For a random sample from a population with distribution fx; 2 , the Bayes risk of an estimator T relative to a risk function R T 2 and density function p2 is the average risk with respect to p2 , namely A T E 2 ¡R T 2 ¢ ; R T 2 p2 d2. ( 274 [Chapter 9] Definition 9.5.6 : For a random sample from a population with distribution fx; 2 , the Bayes estimator T ' relative to the risk function R T 2 and distribution p2 is the estimator with minimum expected risk, i.e. E 2 ¡R T ' 2 ¢ t E 2 ¡R T 2 ¢ for every estimator T. NOTE that the distribution p2 is used to calculate the average risk. This is an additional assumption about the parameter 2 and is a tool that can be used to choose between two estimators. Hopefully a researcher will have prior knowledge about the likely values of the parameter 2 to help in choosing the function p2 . The distribution p2 is referred to as the prior distribution. Definition 9.5.7 : The conditional density of 2 given the sample observations x x 1 , x 2 , T x n the posterior density and is given by f 2|X 2 . is called fx 1 , x 2 , T , x n |2 p2 ; fx 1 , x 2 , T x n |2 p2 d2 ". NOTE in the above definition, if X 1 , X 2 , T , X n and 2 were all ordinary random variables, fx 1 , x 2 , T , x n |2 p2 would be the joint density function of all the random variables. The Bayes estimator is the estimator that minimizes the average risk over 2 namely E 2 ¡R T 2 ¢ E 2 £E X|2 ¡LT; 2 ¢¤ E X £E 2|X ¡LT; 2 ¢¤. If for all values of x we use for T, which is a function of x, that value which minimises E 2|Xx ¡LT; 2 ¢ for that particular x, then E X £E 2|X ¡LT; 2 ¢¤ will be minimised. JTheorem 9.5.1 : If X 1 , X 2 , T , X n is a random sample from a population with distribution fx; 2 , then the Bayes estimator is the estimator that minimises the expected loss relative to the posterior distribution of 2 given x, E 2|X ¡LT; 2 ¢. Proof : Follows from the remark before the theorem. 275 [Chapter 9] JTheorem 9.5.2 : The Bayes estimator, T, of A2 under the squared error loss function, LT; 2 ¡T " A2 ¢ 2 is the conditional expected value of A2 relative to the posterior distribution, . T E 2|X ¡A2 ¢ ; A2 f 2|X 2 d2. ". Proof : For any random variable X the quantity gc E £X " c¤ 2 since is minimised if c E¡X¢ gc E¡X 2 " 2cX c 2 ¢ E¡X 2 ¢ " 2cE¡X¢ c 2 and d dc gc "2E¡X¢ 2c which is equal to 0 if c E¡X¢. . Therefore E 2|X ¡LT; 2 ¢ ; £Tx " A2 ¤ 2 f 2|X 2 d2 is minimised if ". . TX ; A2 f 2|X 2 d2. ". JTheorem 9.5.3 : The Bayes estimator, T, of 2 under the absolute error loss function, LT; 2 |T " 2| is the median of the posterior distribution f 2|X 2 . Proof : For any random variable X with density function fx the quantity gc E¡|X " c|¢ is minimised if c is the median of X since . c gc ; c " x fx dx ; x " c fx dx ". c c . ". c cFc " ; xfx dx ; xfx dx " c£1 " Fc ¤ and d dc gc Fc cfc " cfc " cfc " £1 " Fc ¤ cfc 2Fc " 1 which is equal to 0 if Fc 1/2 i.e. c is the median of X. . Therefore E 2|X ¡LT; 2 ¢ ; |Tx " 2|f 2|X 2 d2 is minimised if TX is the median of f 2|X 2 . ". 276 [Chapter 10] CHAPTER 10 : SUFFICIENCY AND COMPLETENESS 10.1 INTRODUCTION Suppose that X 1 , X 2 , T , X n are jointly distributed random variables and that their joint distribution depends on certain parameters. The observed values of the random variables contain information about the distribution of the random variables and therefore contains information on the values of the parameters which we can use to estimate the values of the parameters. The likelihood function gives us information on the relation between the observed values and the parameters since it depends on the observed values and the value of the parameters. A question that arises is whether or not it is really necessary to know all the different observed values and whether it is possible that all the information about the parameters is contained in a few properties(functions) of the observed values? EXAMPLE 10.1.1 Let X 1 , X 2 , T , X n be n independent Bernoulli random variables with probability of success 2. Let x 1 , x 2 , T , x n be the observed values. Then the likelihood function for 2, given the observed values, is given by L2 2 %x i 1 " 2 n"%x i . The value of this function can be determined without knowing the individual x i ’s, we only need to know the sum of the x i ’s. Intuitively we therefore feel that all the information in the sample about the value of 2 is contained in the sum of the observations. To formalise this, let us determine the conditional distribution of the x i ’s given the sum of the x i ’s. 277 [Chapter 10] n Let S ! X i . Then conditional distribution of the X i ’s given S is as follows i1 f X 1 ,X 2 ,T,X n |S x 1 , x 2 , T , x n |s P¡X 1 x 1 , X 2 x 2 , T , X n x n |S s¢ P¡X 1 x 1 , X 2 x 2 , T , X n x n , S s¢/P¡S s¢ P¡X 1 x 1 , X 2 x 2 , T , X n x n ¢ P¡S s¢ if ! x i s 0 if ! x i p s 2 %x i 1 " 2 n"%x i n 2 s 1 " 2 n"s s if ! x i p s 0 1 n s if ! x i s if ! xi s if ! x i p s 0 i.e. it does not depend on 2 and therefore contains no information about 2. Now let T tX 1 , X 2 , T , X n and let C t £x 1 , x 2 , T , x n |tx 1 , x 2 , T , x n t¤. Then the conditional distribution of T given S is f T|S t | s P¡T t | S s¢ ! f X 1 ,X 2 ,T,X n |S x 1 , x 2 , T , x n |s Ct which does not depend on 2. The individual x i ’s nor any function of them contain any n information about 2 if ! X i is known. i1 10.2 SUFFICIENT STATISTICS Let X X 1 , X 2 , T , X n be a vector of random variables with joint distribution fx; 2 where x x 1 , x 2 , T , x n and 2 2 1 , 2 2 , T , 2 r is a vector of parameters. Let S be a vector of k statistics say S S 1 , S 2 , . . . , S k s 1 X 1 , X 2 , T , X n , s 2 X 1 , X 2 , T , X n , . . . , s k X 1 , X 2 , T , X n . We will assume that the functions s 1 , s 2 , . . . , s k are such that no one of them can be expressed in terms of the other functions. 278 [Chapter 10] Definition 10.2.1 : Let X be a vector of random variables with joint distribution fx; 2 and let S S 1 , S 2 , . . . , S k be a k-dimensional statistic. Then S 1 , S 2 , . . . , S k is a set of jointly sufficient statistics for 2 if for any other vector of statistics T, the conditional distribution of T given S, denoted by f T|S , does not depend on 2 . In the case of a single parameter 2 and a single statistic S, we simply say that S is sufficient for 2. The vector statistic S will be jointly sufficient for 2 if any other vector statistic T contains no additional information about 2 if we already have the information contained in S . When we consider ”other” statistics T we have in mind that they are functions of the X i ’s, say T i t i X 1 , X 2 , T , X n , i 1, 2, T , u, which cannot be expressed in terms of each other or in terms of the functions that define S 1 , S 2 , . . . , S k . One problem with definition 10.2.1 is that we have to consider all possible vector statistics T and have to determine the conditional distribution of T given S . The latter may be fairly difficult to do. In the case of discrete random variables we can first get the conditional distribution of X given S as follows : P¡X 1 x 1 , X 2 x 2 , T , X n x n | S 1 s 1 , S 2 s 2 , T , S k s k ¢ P¡X 1 x 1 , X 2 x 2 , T , X n x n , S 1 s 1 , S 2 s 2 , T , S k s k ¢ P¡S 1 s 1 , S 2 s 2 , T , S k s k ¢ P¡X 1 x 1 , X 2 x 2 , T , X n x n ¢ P¡S 1 s 1 , S 2 s 2 , T , S k s k ¢ if s i x 1 , x 2 , T , x n s i i 0 otherwise f X x; 2 f S s; 2 if s i x 1 , x 2 , T , x n s i i 0 otherwise 279 [Chapter 10] Now let C t £x 1 , x 2 , T , x n |t i x 1 , x 2 , T , x n t i for i 1, 2, . . . , q¤. Then P¡T 1 t 1 , T 2 t 2 , . . . , T q t q |S 1 s 1 , S 2 s 2 , . . . , S k s k ¢ ! P¡X 1 x 1 , X 2 x 2 , T , X n x n |S 1 s 1 , S 2 s 2 , T , S k s k ¢. Ct f X x; 2 does not depend on 2 it will follow from the above that the f S s; 2 conditional distribution of T given S is independent of 2 for all possible statistics T. (NOTE: To show a similar result for the continuous case involve some technical difficulties e.g. the transformation from n variables to n q variables.) If it turns out that Consider a random sample X 1 , X 2 , T , X n from a population with density function fx; 2 . Let S 1 , S 2 , . . . , S n be the order statistics. Then f S s; 2 n!fs 1 ; 2 fs 2 ; 2 . . . . fs n ; 2 0 And therefore for s 1 s 2 . . . , s n otherwise f X x; 2 /f S s; 2 if s i x 1 , x 2 , T , x n s i i fx 1 ; 2 fx 2 ; 2 T fx n ; 2 if s i x 1 , x 2 , T , x n s i i n!fs 1 ; 2 fs 2 ; 2 T fs n ; 2 1 since the x i ’s is just the s i ’s in some order. n! This does not depend on 2 and this then means that the order statistics are jointly sufficient for 2 . JTheorem 10.2.1 : Suppose that X X 1 , X 2 , T , X n is a random vector with joint distribution fx; 2 and that S S 1 , S 2 , . . . , S k , k n, is a vector of statistics. Then S is jointly sufficient for 2 if and only if there exist functions g and h such that fx 1 , x 2 , T , x n ; 2 gs; 2 hx 1 , x 2 , T , x n where gs; 2 depend on x 1 , x 2 , T , x n only through s and hx 1 , x 2 , T , x n in no way involves 2. 280 [Chapter 10] Proof : Let S 1 u 1 X 1 , X 2 , T , X n S2 u 2 X 1 , X 2 , T , X n .............................. Sk u k X 1 , X 2 , T , X n Let T 1 , T 2 , T , T n"k be other statistics such that T1 v 1 X 1 , X 2 , T , X n T2 v 2 X 1 , X 2 , T , X n .............................. T n"k v n"k X 1 , X 2 , T , X n S S say qX . T T We will assume that R n can be subdivided into regions such that over each one of these regions the transformation is one-to-one and that on the i-th region we have that S X q "1 . We then have that i T s s Ji x v f S,T s, t; 2 ! f q "1 i t t i This gives us a transformation from X to ! gs; 2 h q "1 i i gs; 2 ! h q "1 i i . . and f S s; 2 s t Ji x v s t s t Ji x v s t . gs; 2 ; ; T ; ! h q "1 i ". ". ". i s t Ji x v s t dt gs; 2 ms where ms does not depend on 2 since not one of the functions h, q "1 i or J i depend on 2. f s, t; 2 Therefore f T|S t; s S,T f S s; 2 ! h q "1 i i s t Ji x v s t ms which is independent of 2. The marginal distribution of any subset of the T ’s will also be independent of 2. Thus the conditional distribution of any other statistic T given S is independent of 2 and therefore S is jointly sufficient for 2. ſ 281 [Chapter 10] NOTE: If S is jointly sufficient for 2 and if S ' uS is a one-to-one transformation from S to S ' , then S ' will be jointly sufficient. For any given s ' there is a unique value of s such that s ' us i.e. knowing that S ' s ' is equivalent to knowing that S s. Therefore f T|S ' t|s ' f T|S t|s which is independent of 2. EXAMPLE 10.2.2 : Consider a random sample X 1 , X 2 , T , X n where X i L BIN1, 2 . Let n n i1 i1 S ! X i and s ! x i . Then fx 1 , x 2 , T , x n ; 2 2 %x i 1 " 2 n"%x i 2 s 1 " 2 n"s gs; 2 1 i.e. hx 1 , x 2 , T , x n 1 for all x i 0 or 1 for all i and 0 otherwise which does not depend on 2. n From th.10.2.1 it follows that S ! X i is sufficient for 2. i1 It may happen that a set of statistics S 1 , S 2 , T , S k is jointly sufficient for 2 but that all the information is really contained in only some of the S i ’s. In Example 10.2.2 above, if we n let s 1 ! x i and s 2 x 1 , we can write the joint distribution as i1 fx 1 , x 2 , T , x n ; 2 2 %x i 1 " 2 n"%x i £x 1 ¤ 0 ¡2 s 1 1 " 2 n"s 1 s 02 ¢ 1 gs 1 , s 2 ; 2 i.e. S 1 and S 2 are jointly sufficient for ș. However, since S 1 is sufficient , if we know s 1 then s 2 does not contain any extra information about ș. Definition 10.2.2 : A set of statistics is called a minimal sufficient set if the members of the set are jointly sufficient for the parameters and if they are a function of every other set of jointly sufficient statistics. In the example above s 1 can be determined if we you know both s 1 and s 2 but the converse is not true i.e. both s 1 and s 2 cannot be determined if you only know s 1 . 282 [Chapter 10] EXAMPLE 10.2.3 : Consider a random sample X 1 , X 2 , T , X n from a population which is uniformly distributed over the interval 0, 2 where 2 is unknown. Let X 1:n and X n:n be the smallest and largest values respectively. Then the joint distribution is given by fx 1 , x 2 , T , x n ; 2 1/2 n if 0 x i 2 for i 1, 2, . . . , n 0 otherwise 1/2 n if 0 x 1:n and x n:n 2 0 otherwise gs; 2 hx 1 , x 2 , T , x n where gs; 2 and hx 1 , x 2 , T , x n 1/2 n if s x n:n 2 0 otherwise 1 if x 1:n 0 0 otherwise. It then follows from the factorisation criterium that X n:n is a sufficient statistic for ș. Definition 10.2.3 : If A is a set then the indicator function of A, denoted by I A , is defined as 1 if x A I A x 0 if x A. Using indicator functions we can rewrite Example 10.2.3 as follows: fx; ș £1/ș¤I 0,ș x , n fx 1 , x 2 , T , x n ; 2 fx i ; 2 n £1/2 n ¤ I 0,2 x i i1 i1 £1/2 n ¤ I 0,2 x n:n I 0,. x 1:n ¡£1/2 n ¤ I 0,2 x n:n ¢ hx 1 , x 2 , T , x n . 283 [Chapter 10] EXAMPLES : Example 10.2.4 Let X 1 , X 2 , T , X n be a random sample from a population which is normally distributed with parameters 6 and @ 2 which are both unknown. Then n fx i ; 6, @ 2 fx 1 , x 2 , T , x n ; 6, @ 2 i1 2= 2= 1n 2 1n 2 @n @n e e " 1 2@ 2 % ni1 x i "6 " 1 2@ 2 ¡% ni1 x 2i "26% ni1 x i n6 2 ¢ 2 . n n i1 i1 This expression only depends on the x i ’s through s 1 ! x i and s 2 ! x 2i i.e. fx 1 , x 2 , T , x n ; 6, @ 2 gs 1 , s 2 ; 6, @ 2 hx 1 , x 2 , T , x n where " 1 ¡s "26s "n6 2 ¢ 1n e 2@ 2 2 1 and hx 1 , x 2 , T , x n q 1. gs 1 , s 2 ; 6, @ 2 n 2= 2 @ n n i1 i1 It follows from the factorisation criterion that S 1 ! X i and S 2 ! X 2i are jointly sufficient statistics for 6 and @ 2 . The maximum likelihood estimators of 6 and @ 2 are n § 6 X S 1 /n and @ 2 !X i " X 2 /n S 2 /n " S 1 /n 2 . This is a one-to-one L L i1 transformation of S 1 and S 2 and are therefore also jointly sufficient for 6 and @ 2 . Example 10.2.5 Consider a random sample X 1 , X 2 , T , X n from a population which is uniformly distributed over the interval 2, 2 1 where 2 is unknown. Let X 1:n and X n:n be the smallest and largest values respectively. Then the joint distribution is given by n fx 1 , x 2 , T , x n ; 2 fx i ; 2 i1 n I 2,21 x i i1 I 2,. x 1:n I ".,21 x n:n which shows that X 1:n and X n:n are jointly sufficient for ș. (It can be shown that they are minimal sufficient i.e. we would loose information if only one of them is known ). 284 [Chapter 10] 10.3 FURTHER PROPERTIES OF SUFFICIENT STATISTICS JTheorem 10.3.1 : § If S S 1 , S 2 , T , S k is sufficient for ș and if ș is a unique maximum likelihood L § estimator of ș , then ș must be a function of S. L Proof : If S is sufficient then by the factorisation criterium Lș fx 1 , x 2 , T , x n ; ș gs; ș hx 1 , x 2 , T , x n which means that to maximise Lș it is necessary to maximise gs; ș i.e. the value of ș for which the maximum is obtained depends on s. If the maximum likelihood estimator is unique this defines a function of s. ſ § § § NOTE: If the maximum likelihood estimators 2 1 , 2 2 , T , 2 k are unique and sufficient L L L they will be minimal sufficient since the factorisation criterium applies to all sets of sufficient statistics and therefore also to minimal sufficient statistics. NOTE: Example 10.3.1 in Bain and Engelhardt shows that if the maximum likelihood estimates are not unique it is possible to find maximum likelihood estimators which are not functions of sufficient statistics. NOTE: If S is a sufficient statistic with a known distribution, the distribution of S can be used to find the maximum likelihood estimate - gs; 2 is proportional to the distribution function of S. 285 [Chapter 10] JTheorem 10.3.2 : If S is sufficient for ș, then any Bayes estimator of ș will be a function of S. Proof : The posterior distribution of 2 given x is given by fx 1 , x 2 , T , x n ; 2 p2 gs; 2 hx 1 , x 2 , T , x n p2 f 2|x 2 ; fx 1 , x 2 , T , x n ; 2 p2 d2 ; gs; 2 hx 1 , x 2 , T , x n p2 d2 ! ! gs; 2 p2 ; gs; 2 p2 d2 ! which depends only on s and 2. Hence E 2|x ¡LT; 2 ¢ only depends on T and s and therefore the value of T for which it is a minimum only depends on s i.e. is a function of s. By theorem 9.5.1 the Bayes estimate depends only on s. ſ JTheorem 10.3.3 : If X 1 , X 2 , T , X n is a random sample from a population with a continuous distribution with density function fx; ș , then the order statistics are sufficient for ș. Proof : For fixed x 1:n , x 2:n , . . . , x n:n and associated x 1 , x 2 , T , x n the conditional density function of x 1 , x 2 , T , x n given x 1:n , x 2:n , T , x n:n is given by fx 1 ; 2 fx 2 ; 2 T fx n ; 2 n!fx 1:n ; 2 fx 2:n ; 2 T fx n:n ; 2 1 if x 1:n minx 1 , x 2 , T , x n , . . , x n:n maxx 1 , x 2 , T , x n n! 0 otherwise i.e. the conditional distribution does not depend on ș which implies the sufficiency of the order statistics. ſ 286 [Chapter 10] JTheorem 10.3.4 : The Rao-Blackwell theorem Let X 1 , X 2 , T , X n be jointly distributed random variables with joint density function fx 1 , x 2 , T , x n ; 2 and let S be a vector of jointly sufficient statistics for 2. If T is any unbiased estimator of A2 and if T ' E¡T|S¢, then 1. T ' is a function of S , 2. T ' is an unbiased estimator of A2 , and 3. VarT ' t varT for all 2 and varT ' varT for some 2 unless T ' T with probability 1. Proof : Since S is jointly sufficient the conditional distribution of T given S does not depend on 2 . Therefore t ' s E¡T|s¢ does not depend on 2. Therefore T ' E¡T | S¢ is a statistic that depends on S and does not depend on 2. Furthermore E¡T ' ¢ E S ¡T ' ¢ since T ' is a function of S E S ¡E¡T|S¢¢ E¡T¢ by theorem 5.4.1 A2 for all values of 2 since T is an unbiased estimator of A2 is an unbiased estimator of IJș . i.e. From theorem 5.4.3 we have that varT var¡E¡T|S¢¢ E¡varT|S ¢ T' u var¡E¡T|S¢¢ varT ' and equality holds if and only if E¡varT|S ¢ 0 which will be true if and only if varT|S 0 with probability 1 (since a variance is always u 0) and varT|S can only be zero if T is equal to its expected value i.e. T ' with probability 1 ſ NOTE: From the Rao-Blackwell theorem it follows that if any unbiased estimator T exists and if S is jointly sufficient, there will exist a function of S that is an unbiased estimator namely E¡T | S¢. This unbiased estimator which is a function of S will also have a smaller variance unless it is already a function of S . To look for an UMVUE estimator we can therefore restrict our search to sufficient statistics. If there exist a unique function of sufficient statistics which is unbiased it will be the UMVUE estimator. The problem of uniqueness will be discussed in the next section. 287 [Chapter 10] 10.4 COMPLETENESS AND THE EXPONENTIAL CLASS Definition 10.4.1 : COMPLETENESS A family ( i.e. set ) of density functions £f T t; 2 ; 2 (¤ is called complete if E¡uT ¢ 0 for all 2 ( implies uT 0 with probability 1 for all 2 (. NOTE: If the functions u 1 T and u 2 T of T are both unbiased estimators of A2 , then E¡u 1 T " u 2 T ¢ A2 " A2 0 for all values of 2. In the case of completeness this would imply that u 1 T " u 2 T 0 with probability 1 i.e. u 1 T u 2 T with probability 1 i.e. for all practical purposes the two fuctions are exactly the same i.e there is a unique function of T that is an unbiased estimator of A2 . NOTE: The Rao-Blackwell theorem implies that what we are essentially interested in is whether or not the family of distributions of a set of sufficient statistics are complete or not. NOTE: A sufficient statistic, the density of which is a member of a complete family of density functions, is referred to as a complete sufficient statistic. JTheorem 10.4.1 LEHMANN-SCHEFFE Let X 1 , X 2 , T , X n be random variables with joint density function fx 1 , x 2 , T , x n ; 2 and let S be a vector of jointly complete sufficient statistics for 2 . If T ' t ' S is a statistic that is unbiased for A2 , then T ' is the UMVUE of A2 . Proof : From the completeness it follows that any statistic that is a function of S and an unbiased estimator of A2 , must be equal to T ' with probability 1. If T is any other statistic that is an unbiased estimator of A2 , then by the Rao-Blackwell theorem E¡T | S¢ is also an unbiased estimator of A2 and a function of S, so by uniqueness T ' E¡T | S¢ with probability 1. Furthermore varT ' t varT for all 2 . Thus T ' is the UMVUE of A2 ſ 288 [Chapter 10] EXAMPLE 10.4.1 : Let X 1 , X 2 , T , X n be a random sample from a population which has a Poisson distribution with parameter 6. Then e "n6 6 %x i . fx 1 , x 2 , T , x n ; 6 $x i ! By the factorization criterium S ! X i is a sufficient statistic. The distribution of S is a Poisson distribution with parameter n6. The expected value of any function u of S is given by . e "n6 n6 s! E¡uS ¢ ! us s0 s e "n6 . ! s0 us n s s 6. s! . us n s s 6 0 for all 6 0 since e "n6 p 0 for s! s0 any 6 0. This power series in 6 will be equal to 0 for all 6 0 if and only if all the us n s 0 for all s i.e. us 0 for all s i.e. us is 0 coefficients of 6 s are 0 i.e. s! with probability 1. Therefore the family of distributions for S is complete. Since E¡S/n¢ n6/n 6 for all values of 6, S/n is the UMVUE of 6. The variance of S/n is n6/n 2 6/n. To determine the Cramer-Rao lower bound for an unbiased estimator of ȝ we use fx; 6 e "6 6 x /x! This will be 0 for all 6 0 only if ! ln fx; 6 "6 x ln 6 " lnx! 6 E ln fx; 6 "1 x/6 x " 6 /6 6 ln fX; 6 2 E¡X " 6 2 /6 2 ¢ varX /6 2 6/6 2 1/6 and therefore CRLB 1/n 1/6 6/n. In this case the variance of the UMVUE of 6 is actually equal to the CRLB. n Note that ! i1 6 ln fX i ; 6 S " n6 /6 n/6 S/n " n which is a linear function of S/n which is an unbiased estimator of 6. From this we can also conclude that varS/n is equal to the CRLB for an unbiased estimate of 6. We also have that E¡£S/n¤ 2 ¢ varS/n £E¡S/n¢¤ 2 6/n 6 2 i.e. £S/n¤ 2 is a function of the complete sufficient statistic S and is an unbiased estimator of 6/n 6 2 . From the Lehmann-Scheffe theorem it then follows that £S/n¤ 2 is the UMVUE estimate n of 6/n 6 2 . Since ! i1 6 ln fX i ; 6 is not a linear function of £S/n¤ 2 it will mean that the variance of the UMVUE is not equal to the Cramer-Rao lower bound. 289 [Chapter 10] To find the UMVUE of any function A6 it is only necessary to find some function of S that is an unbiased estimator. If there is a difficulty finding an unbiased estimator of A6 which is a function of S, first find any function T of X 1 , X 2 , T , X n which is an unbiased estimator and then determine E¡T | S¢. In any particular case it may still be a substantial problem whether or not a family of density functions is complete. Definition 10.4.2 : REGULAR EXPONENTIAL CLASS (REC) A density function is said to be a member of a regular exponential class if it can be expressed in the form c2 hx e % j1 q j 2 t j x for x A 0 otherwise k fx; 2 where 2 ș 1 , ș 2 , T , ș k is a vector of k unknown parameters and the parameter space has the form ȍ £2 | a i t ș i t b i , i 1, 2, . . . , k¤ ( a i ". and b i . permissible ) and if it satisfies the regularity conditions 1, 2 and 3a or 3b given by : 1. The set A £x|fx; 2 0¤ does not depend on 2. 2. The functions q j 2 are non-trivial, functionally independent, continuous functions of the 2 i . 3a. For a continuous random variable the derivatives t Uj x are linearly independent continuous functions of x over A. 3b. For a discrete random variable the t j x are non-trivial functions of x on A and none is a linear function of the others. For convenience we will write that fx; 2 is a member of RECq 1 , q 2 , . . . , q k or simply REC. NOTE: The notion of REC can be extended to the case where X is a vector. JTheorem 10.4.2a A REC of density functions is complete if t j x x for all j. Proof: We will not prove this theorem. 290 [Chapter 10] JTheorem 10.4.2b Suppose that X 1 , X 2 , T , X n is a random sample from a population with a density function that is a member of a REC of density functions. The statistics n n n i1 i1 i1 S 1 ! t 1 X i , S 2 ! t 2 X i , T , S k ! t k X i is a set of sufficient statistics for 2 2 1 , 2 2 , . . . , 2 k . Proof : The joint density function of X 1 , X 2 , T , X n is given by fx 1 , x 2 , . . . , x n ; 2 n c2 hx i e % j1 q j 2 t j x i k i1 n hx i £c2 ¤ n e % j1 q j 2 ¡% i1 t j x i k n ¢ i1 so that it follows from the factorization criterium that S 1 , S 2 , . . . , S k is a set of sufficient statistics. ſ JTheorem 10.4.2 Suppose that X 1 , X 2 , T , X n is a random sample from a population with a density function that is a member of a REC of density functions. The set of sufficient statistics S 1 , S 2 , . . . , S k has a joint distribution which is member of a REC that is complete. Proof : No proof will be given. An unbiased estimator of A2 , the variance of which is equal to the Cramer- Rao lower bound, will be referred to as a CRLB estimator. EXAMPLES : Example 10.4.2 Suppose that X is a random variable with a Bernoulli distribution with parameter p. Then fx; p p x 1 " p 1"x 1 " p £p/1 " p ¤ x 1 " p e x ln¡p/1"p ¢ for x A £0, 1¤ and 0 t p t 1 i.e. the Bernoulli density functions is a member of the regular exponential class with cp 1 " p, qp ln¡p/1 " p ¢ and tx x i.e. it is also a complete class of distributions. 291 [Chapter 10] Example 10.4.3 Let X 1 , X 2 , T , X n be independent random variables all with a Bernoulli distribution with parameter p. From Example 10.4.2 it follows that the distribution of the X i ’s belong to a n REC with t 1 x i x i i.e. S 1 ! X i is a single complete sufficient statistic for p i1 th.10.4.2b. Let X S 1 /n. Then E¡X¢ p and from the Lehmann-Scheffe theorem it then follows that X is a UMVUE estimator of p with varX £p1 " p ¢/n. To determine the CRLB we note that : fx; p p x 1 " p 1"x ln fx; p x ln p 1 " x ln1 " p x"p ln fx; p px " .1 " x p 1"p p1 " p 1 1 E¡£ p ln fX; p ¤ 2 ¢ varX p1 " p ¡p1 " p ¢ 2 p1 " p 12 and therefore CRLB varX n 1 n p1"p i.e. X is the CRLB estimator of p. From th.9.3.2c it follows that no CRLB estimator of p1 " p exist. But 2 n X1 " X E n p"E X n"1 n"1 2 n p " varX ¡EX ¢ n"1 p1 " p n p" p 2 n n"1 n p1 " p 1 " 1n n"1 p1 " p n i.e. X1 " X is an unbiased estimator and a function of the complete sufficient n"1 statistic S i.e. from the Lehmann-Scheffe theorem it follows that it is the UMVUE estimtor of p1 " p . 292 [Chapter 10] Example 10.4.4 Suppose that X is a random variable with a normal density function with parameters 6 and @ 2 . Then 2 1 e " 12 ¡ x"6 1 exp x 2 "1 x 26 " 6 2 @ ¢ fx; 6, @ 2 2@ 2 2@ 2 2@ 2 2= @ 2= @ which is a member of a REC with t 1 x x 2 and t 2 x x. The statistics n n i1 i1 S 1 ! X 2i and S 2 ! X i are therefore a set of complete sufficient statistics for 6 and @ 2 i.e. any unbiased estimator of any function of 6 and @ 2 which is a function of S 1 and S 2 will therefore be a UMVUE estimator. We state the next two theorems witout proof. JTheorem 10.4.3 If a CRLB estimator T exists for A2 , then a single sufficient statistic exists, and T is a function of the sufficient statistic. Conversely, if a single sufficient statistic exist and the CRLB exists, then a CRLB estimator exists for some A2 . JTheorem 10.4.4 If the CRLB exists, then a CRLB estimator will exist for some function A2 if and only if the density function is a member of the REC. Furthermore the CRLB estimator of A2 § § will be A 2 where 2 is the maximum likelihood estimator of 2. L L 293 [Chapter 11] CHAPTER 11 : INTERVAL ESTIMATION 11.1 INTRODUCTION Although point estimation is very important, it is necessary to get an indication of the accuracy of the estimate. The MSE of the estimate gives an indication of the error. In most cases, however, the MSE will depend on the parameters of the distribution. A possibility is to estimate the MSE using estimates of the parameter to get some indication of the possible error in the estimate of the parameter. A more satisfactory solution is to find a set of values of ș depending on the observed values of X 1 , X 2 , T , X n which is a subset of the set of possible values of ș , say ȍ , and which is such that there is a high probability that this set will include the value, referred to as the true value, of ș. Definition 11.1.1 : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș . Let ȍ be the set of all possible values ș. For all possible observed values x 1 , x 2 , T , x n of X 1 , X 2 , T , X n let Rx 1 , x 2 , T , x n ȍ. Let ș 0 be the true value of ș. If P¡2 0 RX 1 , X 2 , T , X n | 2 0 ¢ + for all 2 0 ( then Rx 1 , x 2 , T , x n is called a 100 Ȗ% confidence region for ș. NOTE : For some observed values x 1 , x 2 , T , x n the region Rx 1 , x 2 , T , x n will include 2 0 and for some observed values Rx 1 , x 2 , T , x n will not include 2 0 . The probability that the region will include 2 0 must be + for all possible values of 2 0 for the region to be a 100 +% confidence region. In the rest of this chapter we will consider methods to find such confidence regions. We would like to get confidence regions which are as small as possible. This means that we should use statistics with good properties. Such statistics are discussed in Chapter 10 of Bain and Engelhardt. In general it is true that if such statistics exist that they are functions of the maximum likelihood estimators of the parameters. 294 [Chapter 11] Confidence regions based on maximum likelihood estimators will therefore be used whenever possible. 11.2 CONFIDENCE INTERVALS Definition 11.2.1 : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș . Let ȍ be the set of all possible values ș. Let l : R n v R and u : R n v R be two functions such that lx 1 , x 2 , T , x n ux 1 , x 2 , T , x n for all values of x 1 , x 2 , T , x n . If ș 0 is the true value of ș and P¡lX 1 , X 2 , T , X n ș 0 uX 1 , X 2 , T , X n | ș 0 ¢ Ȗ for all ș 0 ȍ then the interval Rx 1 , x 2 , T , x n lx 1 , x 2 , T , x n , ux 1 , x 2 , T , x n is called a 100 Ȗ% confidence interval for ș. For different sets of observed values x 1 , x 2 , T , x n we will get different intervals. Some will include ș and some will not include ș. For any observed values x 1 , x 2 , T , x n of X 1 , X 2 , T , X n the values of lx 1 , x 2 , T , x n and ux 1 , x 2 , T , x n are called the lower and upper limits of the confidence interval respectively. The quantity + is called the confidence level. EXAMPLE 11.2.1 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. In this case the X i ’s are independent GAMș, 1 random variables i.e. n ! X i L GAM2, n by Example 6.4.2. From th.8.3.3 it then follows that i1 2 ! i1 X i L D 2 2n 2 i.e. if X is the sample mean, then 2nX L D 2 2n . Hence 2 n 295 [Chapter 11] 0. 95 P D 2.025 2n t 2nX t D 2.975 2n | 2 2 2 u 1 |2 D 2.975 2n 2nX P 1 D 2.025 2n u P 2nX D 2.025 2n u2u 2nX |2 D 2.975 2n P 2nX D 2.975 2n t2t 2nX |2 D 2.025 2n i.e. if x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n then the interval from 2nx to 2 2nx is a 95% confidence interval for 2. Note that this interval D 2.975 2n D .025 2n is based on x which is the maximum likelihood estimate of 2. Definition 11.2.2 : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș . Let ȍ be the set of all possible values ș. 1. If l : R n v R is such that P¡lX 1 , X 2 , T , X n 2 | 2¢ + for all 2 ( then lx 1 , x 2 , T , x n is called a one-sided lower 100+% confidence limit for 2. 2. If u : R n v R is such that P¡uX 1 , X 2 , T , X n 2 | 2¢ + for all 2 ( then ux 1 , x 2 , T , x n is called a one-sided upper 100+% confidence limit for 2. EXAMPLES : Example 11.2.2 Continuation of Example 11.2.1. Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential distribution with parameter 2. Then, if X is the sample mean, 2nX L D 2 2n . Hence 2 0. 95 P 2nX t D 2.95 2n |2 2 P 2 u 1 |2 D 2.95 2n 2nX P 2u P 2nX |2 D 2.95 2n 2nX D 2.95 2n t 2|2 296 [Chapter 11] i.e. if x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n then 95% confidence limit for 2. Similarly 0. 95 P 2nX u D 2.05 2n | 2 2 P 2nx is a lower 2nx is an upper D 2.95 2n 2 t 1 |2 D 2.05 2n 2nX P 2t 2nX D 2.05 2n |2 i.e. if x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n then 95% confidence limit for 2. D 2.05 2n These examples shows that usually lower and upper confidence limits for ș can be found quite easily once we have found a confidence interval for ș . We will therefore concentrate on finding confidence intervals for ș. The problem still remains how to find such confidence intervals. Example 11.2.3 Suppose that X 1 , X 2 , T , X n is a random sample from a population which is normally distributed with expected value 6 and variance @ 2 where 6 is unknown and @ 2 is known. The maximum likelihood estimate of 6 is the sample mean X - see Example 2 9.2.6. Then X is normally distributed with expected value 6 and variance @n X"6 is a standard normal random variable - th.3.3.4. th.8.3.2a. Then Y @ n 297 [Chapter 11] Hence X"6 1 " ) P "z 1" )2 @ n z 1" )2 |6, @ 2 P "z 1" )2 @ X " 6 z 1" )2 @ |6, @ 2 n n P "X " z 1" )2 @ "6 "X z 1" )2 @ |6, @ 2 n n P X z 1" )2 @ 6 X " z 1" )2 @ |6, @ 2 n n P X " z 1" )2 @ 6 X z 1" )2 @ |6, @ 2 n n i.e. x " z 1" )2 @ to x z 1" )2 @ is a 1001 " ) % confidence interval for 6. n n Note that it is only possible to calculate the limits of the interval if @ 2 is known. In the above examples we managed to find confidence limits using the two quantities 2nX L D 2 2n and X " 6 L N0, 1 . The problem still remains how to find such @ 2 n confidence intervals. In the next two sections we study two methods of obtaining confidence intervals. 11.3 PIVOTAL QUANTITY METHOD Definition 11.3.1 : Suppose that Q qX 1 , X 2 , T , X n ; ș is a quantity that is a function of only X 1 , X 2 , T , X n and ș . Then Q is called a pivotal quantity if its distribution does not depend on ș or any other unknown parameters . In Example 11.2.1 the quantity 2nX is a pivotal quantity since it is a function only of 2 n 1 X n ! X i and 2 and its distribution is D 2 2n which does not depend on 2. In i1 Example 11.2.3 X"6 and 6 ( remember depend on 6. @ n @2 n is a pivotal quantity since it is only a function of X 1n ! X i i1 is assumed known ) and its distribution is N0, 1 which does not 298 [Chapter 11] NOTE : The distribution of a pivotal quantity qX 1 , X 2 , T , X n ; 2 is independent of 2 only if 2 is the true value of the parameter. For example if X i L EXP5 then 2nX L D 2 2n and if X i L EXP10 then 2nX L D 2 2n . 5 10 Suppose that Q qX 1 , X 2 , T , X n ; ș is a pivotal quantity. Let q Ȗ be the value for which Ȗ P¡qX 1 , X 2 , T , X n ; ș t q Ȗ | ș¢ . Note that this probability is true if ș is the true value of the parameter. Note also that since the distribution of Q does not depend on ș or any other unknown parameters, the value of q Ȗ does not depend on ș and therefore it can be determined without knowing the value of ș. JTheorem 11.3.1a : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș and that Q qX 1 , X 2 , T , X n ; ș is a pivotal quantity. Given the observed values x 1 , x 2 , T , x n of X 1 , X 2 , T , X n , let Rx 1 , x 2 , T , x n 2 | 2 (, q )2 t qx 1 , x 2 , T , x n ; 2 t q 1" )2 . Then Rx 1 , x 2 , T , x n is a 1001 " ) % confidence region for 2. Proof : Let 2 0 be the true value of 2. Then 1 " ) P¡q )2 t qX 1 , X 2 , T , X n ; 2 0 t q 1" )2 | 2 0 ¢ P¡2 0 RX 1 , X 2 , T , X n | 2 0 ¢ and it is true for all 2 0 ( i.e. Rx 1 , x 2 , T , x n is a 1001 " Į % confidence region for ș. ſ JTheorem 11.3.1b : Suppose that X 1 , X 2 , T , X n is a random sample from a population which is normally distributed with expected value 6 and variance @ 2 where 6 and @ 2 are unknown. Then X"6 Q L tn " 1 i.e. it is a pivotal quantity for 6 and s/ n ~ x " t 1" )2 n " 1 s to x t 1" )2 n " 1 s is a 1001 " ) % confidence interval n n for 6. 299 [Chapter 11] Proof : X"6 t 1" )2 n " 1 | 6, @ 2 s/ n 1 " ) P "t 1" )2 n " 1 ~ s P "t 1" )2 n " 1 X " 6 t 1" )2 n " 1 ~ n s P "X " t 1" )2 n " 1 i.e. x " t 1" )2 n " 1 interval for 6. n n 6 X " t 1" )2 n " 1 ~ n s P X " t 1" )2 n " 1 ~ 6 X t 1" )2 n " 1 ~ n | 6, @ 2 "6 "X t 1" )2 n " 1 ~ s P X t 1" )2 n " 1 s s to x t 1" ) n " 1 2 n s ~ n s ~ n s ~ n | 6, @ 2 | 6, @ 2 | 6, @ 2 s is a 1001 " ) % confidence n ſ JTheorem 11.3.1c : Suppose that X 1 , X 2 , T , X n is a random sample from a population which is normally distributed with expected value ȝ and variance ı 2 where ȝ and ı 2 are unknown. n " 1 s 2 ~ L D 2 n " 1 i.e. it is a pivotal quantity for @ 2 and Then Q @2 n " 1 s 2 n " 1 s 2 to is a 1001 " ) % confidence interval for @ 2 . D 21" )2 n " 1 D 2)2 n " 1 Proof : We have that n " 1 s 2 1 " ) P D 2 n " 1 t P P P 1 D 2 n " 1 2) n " 1 s2 ~ D 2 n " 1 2) n " 1 s 2 ~ D 21" )2 n " 1 t D 21" )2 n " 1 | 6, @ 2 ~ 2) @2 @2 n " 1 s 2 u ~ u @2 t u @2 t u 1 | 6, @ 2 D 21" )2 n " 1 n " 1 s 2 ~ D 21" )2 n " 1 n " 1 s 2 ~ D 2 n " 1 2) | 6, @ 2 | 6, @ 2 300 [Chapter 11] n " 1 s 2 D 21" )2 n " 1 i.e. to n " 1 s 2 D 2)2 n " 1 is a 1001 " ) % confidence interval for @ 2 . ſ In general it is true that any pivotal quantity can be used to construct confidence intervals. To obtain confidence intervals with good properties, it is usually best to find pivotal quantities based on maximum likelihood estimators of ș. If it is not possible to find a pivotal quantity for a particular problem, the result in the following theorem can be used. JTheorem 11.3.1d : Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function fx; 2 and distribution function Fx; 2 . Then n Q "2 ! ln FX i ; 2 L D 2 2n i1 i.e. Q is a pivotal quantity for 2. Proof : The random variable FX i ; 2 is uniformly distributed over the interval 0, 1 and therefore " ln FX i ; 2 is exponentially distributed with parameter 1 i.e. it is a gamma random variable with parameters 1 and 1. Hence n !£" ln FX i ; 2 ¤ L GAM1, n i1 i.e. n 2 !£" ln FX i ; 2 ¤ L D 2 2n . ſ (11.3.1) i1 NOTE : If X L UNIF0, 1 , then £1 " X¤ L UNIF0, 1 i.e. if FX i ; 2 L UNIF0, 1 then £1 " FX i ; 2 ¤ L UNIF0, 1 and therefore n "2 ! ln£1 " FX i ; 2 ¤ L D 2 2n i1 (11.3.2) 301 [Chapter 11] EXAMPLE 11.3.1 : Suppose that X 1 , X 2 , T , X n is a random sample from a population with a Pareto density function with parameters 1 and 4 i.e. 41 x "4"1 for x 0 fx; 4 0 otherwise and therefore x Fx; 4 ; ft; 4¤dt ". for x 0 0 x ; 41 t "4"1 dt for x 0 0 for x 0 0 ¡"1 t "4 x ¢0 for x 0 for x 0 0 1 " 1 x "4 for x 0 Using (11.3.2) and for all X i 0 we have that n Q "2 ! ln£1 " FX i ; 2 ¤ i1 n "2 !£"4 ln1 X i ¤ L D 2 2n . i1 Now using th.11.3.1a we get that if x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n that Rx 1 , x 2 , T , x n n 4 | D 2)2 2n t 24 ! ln1 x i t D 21" )2 2n i1 4| D 2)2 2n n 2 ! ln1 x i i1 t4t D 21" )2 2n n 2 ! ln1 x i i1 which gives a 1001 " ) % confidence interval for 4. In Example 11.3.1 it was very easy to determine the confidence interval for 4. In most cases, however, it is not that simple to determine the confidence interval for 2 using the pivotal quantities (11.3.1) or (11.3.2). 302 [Chapter 11] 11.4 GENERAL METHOD In many cases a suitable pivotal quantity is not available i.e. where the distribution of Q qX 1 , X 2 , T , X n ; ș does not depend on ș. In general it is much more likely that we can find a statistic S sX 1 , X 2 , T , X n such that the distribution of S depends on ș. In such cases the following theorem can be used to construct confidence regions. JTheorem 11.4.1a : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș . Also suppose that S sX 1 , X 2 , T , X n is a statistic with distribution gs; 2 . Let h 1 2 and h 2 2 be such that P¡h 1 2 t S t h 2 2 |2¢ 1 " ). Given any observed value s of S, let Rs £2|h 1 2 t s t h 2 2 ¤. Then Rs is a 1001 " ) % confidence region for 2. Proof : Let 2 0 be the true value of 2. Then 1 " ) P¡h 1 2 0 t S t h 2 2 0 | 2 0 ¢ P¡2 0 RS | 2 0 ¢ for all possible values of 2 0 i.e. Rs is a 1001 " ) % confidence region for 2. ſ NOTE: Suppose that Gs; 2 is the distribution function of S. Suppose further that h 1 ș is such that P¡S t h 1 ș ¢ Į/2 i.e. Gh 1 ș ; ș Į/2. Also suppose that h 2 ș is such that P¡S t h 2 ș ¢ 1 " Į/2 i.e. Gh 2 ș ; ș 1 " Į/2. Then P¡h 1 2 t S t h 2 2 | 2¢ Gh 2 2 ; 2 " Gh 1 2 ; 2 1 " )/2 " )/2 1 " ). From th.11.4.1a it then follows that Rs £2 | h 1 2 t s t h 2 2 ¤ is a 1001 " Į % confidence region for ș. 303 [Chapter 11] EXAMPLES : Example 11.4.1: Suppose that X 1 , X 2 , T , X n is a random sample from a population with density function 1 e " x"2 22 for x 2 22 fx; 2 0 otherwise Let ( £2 | 2 0¤. Then Fx; ș 0 for x t ș and for x ș we have that x Fx; 2 " t"2 ; 12 e 22 dt 2 2 "e " t"22 2 x 2 " x"22 1"e 2 . Note that F. ; 2 1. Let Y 1 min£X 1 , X 2 , T , X n . ¤ Then F Y 1 y 1 ; 2 1 " ¡1 " Fy 1 ; 2 ¢ n if y 1 2 0 1" 1" 1"e " y 1 "2 22 n if y 1 2 if y 1 2 0 1"e " ny 1 "2 22 if y 1 2 Let h 1 2 be such that F Y 1 h 1 2 ; 2 0. 05 i.e. 0. 05 1 " e i.e. " nh 1 2 "2 22 " nh 1 2 "2 e 22 nh 1 2 " 2 i.e. " 22 0. 95 ln 0. 95 i.e. h 1 2 2 " 1n ln 0. 95 2 2 2 1n 0. 0513 2 2 . 304 [Chapter 11] Let h 2 2 be such that F Y 1 h 2 2 ; 2 0. 95 i.e. 0. 95 1 " e " nh 2 2 "2 22 " i.e. nh 2 2 "2 0. 05 e 22 nh 2 2 " 2 i.e. " 22 ln 0. 05 i.e. h 2 2 2 " 1n ln 0. 05 2 2 2 1n 2. 996 2 2 . The two functions h 1 2 and h 2 2 can be represented graphically as follows h1 h2 3 y1 2 1 0 0 1 l (y1) 2 u (y1) 3 For any value of 2 we have that P¡h 1 2 t Y 1 t h 2 2 ¢ 0. 95 " 0. 05 0. 9. For any observed value y 1 of Y 1 we have that Ry 1 £2 | h 1 2 t y 1 t h 2 2 ¤ £2 | ly 1 t 2 t uy 1 ¤. Now suppose that n 10 and that y 1 2. 5 is the observed value of Y 1 . The value of ly 1 is such that h 2 ly 1 y 1 i.e. ly 1 1n 2. 996 ¡ly 1 ¢ 2 y 1 i.e. l2. 5 1 10 2. 996 ¡l2. 5 ¢ 2 2. 5 i.e. . 2996¡l2. 5 ¢ 2 l2. 5 " 2. 5 0 i.e. l2. 5 "1 o 1 2 " 4 . 2996 "2. 5 2 . 2996 "1 o 2 . 5992 1. 667 since 2 0. 305 [Chapter 11] The value of uy 1 is such that h 1 uy 1 uy 1 1n . 0513 ¡uy 1 ¢ 2 y 1 i.e. u2. 5 1 10 y 1 i.e. . 0513 ¡u2. 5 ¢ 2 2. 5 i.e. . 00513¡u2. 5 ¢ 2 u2. 5 " 2. 5 0 "1 o 1 2 " 4 . 00513 "2. 5 2 . 00513 "1 o 1. 0253 . 01026 2. 466 since 2 0 i.e. u2. 5 i.e. the interval 1. 667, 2. 466 is a 90% confidence interval for ș if n 10 and the observed value of Y 1 is 2. 5 . Example 11.4.2: Suppose that X 1 , X 2 , T , X n is a random sample from a population with an exponential n density function with parameter 2. Then X i L GAM2, 1 and S ! X i L GAM2, n and therefore 2S L D 2 2n . 2 Let h 1 ș and h 2 ș be such that P¡S t h 1 2 |2¢ )/2 and P¡S t h 2 2 |2¢ 1 " )/2. Then P¡h 1 2 t S t h 2 2 |2¢ 1 " )/2 " )/2 1 " ). Then )/2 P¡S t h 1 2 | 2¢ 2h 1 2 P 2S t |2 2 2 2h 1 2 P D 2 2n t 2 i.e. 2h 1 2 2 i.e. h 1 2 D 2)2 2n D 2)2 2n 2. 2 i1 306 [Chapter 11] Similarly 1 " )/2 P¡S t h 2 2 | 2¢ 2h 2 2 P 2S t |2 2 2 2h 2 2 P D 2 2n t 2 i.e. i.e. 2h 2 2 2 h 2 2 D 21" )2 2n D 21" )2 2n 2. 2 Now suppose that s is the observed value of S and let ls and us be that value such that h 2 ls s and h 1 us s. Then Rs £2|h 1 2 t s t h 2 2 ¤ £2|ls t 2 t us ¤. as can be seen graphically as follows h1 h2 s 0 l (s ) 1 u (s ) 2 307 [Chapter 11] Since h 2 ls s we have that D 21" )2 2n h 2 ls ls s 2 we have that ls 2 2s D 1" )2 2n D 2)2 2n us s 2 we have that us 2 2s . D )2 2n and since h 1 us 2s 2s , is therefore a 1001 " ) % confidence interval D 21" )2 2n D 2)2 2n for 2 if s is the observed value of S. The interval In the above theorems and examples we assumed that it is possible to find the functions h 1 2 and h 2 2 such that P¡h 1 2 t S t h 2 2 | 2¢ is exactly equal to 1 " ) for all values of 2. For continuous random variables this is true, but for discrete random variables it is usually not true that we can find such functions for which the probability is exactly equal to 1 " ). To cater for discrete random variables we need to consider so-called conservative confidence intervals. Definition 11.4.1 : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș . Let ȍ be the set of all possible values ș . Let Rx 1 , x 2 , T , x n ȍ. Let ș 0 be the true value of ș. If P¡2 0 RX 1 , X 2 , T , X n | 2 0 ¢ u 1 " ) for all 2 0 ( and x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n then Rx 1 , x 2 , T , x n is called a 1001 " Į % conservative confidence region for ș. 308 [Chapter 11] JTheorem 11.4.3a : Suppose that X 1 , X 2 , T , X n are jointly distributed random variables with joint distribution fx 1 , x 2 , T , x n ; ș . Let S sX 1 , X 2 , T , X n be a discrete random variable with possible values s 1 s 2 s 3 . . . . Let h 1 ș be the biggest value of the s i ’s such that P¡S h 1 ș | ș¢ t Į/2 . Let h 2 ș be the smallest value of the s i ’s such that P¡S t h 2 ș | ș¢ u 1 " Į/2. Let s be the observed value of S and let Rs £ș | h 1 ș t s t h 2 ș ¤. Then Rs is a 1001 " ) % conservative confidence region for 2. Proof : For any value of 2 P¡h 1 2 t S t h 2 2 | 2¢ P¡S t h 2 2 | 2¢ " P¡S h 1 2 | 2¢ u 1 " )/2 " )/2 1 " ). (11.4.1) Suppose that ș 0 is the true value of ș. Then P¡2 0 RS ¢ P¡h 1 2 0 t S t h 2 2 0 | 2 0 ¢ u 1 " ). since (11.4.1) true for all values of 2 i.e. Rs a 1001 " ) % conservative confidence region for 2. ſ Basically we can therefore use the same method for continuous and discrete random variables to construct either exact confidence regions or conservative confidence regions. We will later consider making use of certain tables to find conservative confidence intervals for the parameters of the binomial and Poisson distributions. NOTE : Let E denote some event. The notation P¡E | ș¢ is used to denote the probability of E when ș is the value of the parameter in the distribution used to determine the probability of E. To simplify notation we will simply use P¡E¢ instead of P¡E | ș¢ from now on. 309 [Chapter 11] 11.5 CONFIDENCE INTERVALS FOR SOME STANDARD PROBLEMS 11.5.1 RANDOM SAMPLES FROM NORMAL POPULATIONS : ONE SAMPLE In this section we will assume that X 1 , X 2 , T , X n is a random sample from a population with a normal distribution with parameters 6 and @ 2 . We will use X to denote the sample mean and use s 2 to denote the sample variance. We will use x and s 2 to ~ denote observed values of X and s 2 . ~ (a) Confidence interval for ȝ From th.11.3.1b we have that X"6 L tn " 1 and is a pivotal quantity for 6 s/ n ~ and that x " t 1" )2 n " 1 s to x t ) n " 1 1" 2 n s is a 1001 " ) % confidence n interval for 6. (b) Confidence interval for ı 2 . n " 1 s 2 From th.11.3.1c we have that @ 2 and that @2. n " 1 s 2 D 21" )2 n " 1 to ~ @2 n " 1 s 2 D 2)2 n " 1 L D 2 n " 1 is a pivotal quantity for is a 1001 " ) % confidence interval for 11.5.2 RANDOM SAMPLES FROM NORMAL POPULATIONS : TWO SAMPLES Suppose that X 11 , X 12 , T , X 1n 1 is a random sample of size n 1 from a normal population with expected value 6 1 and variance @ 21 . Let X 1 be the sample mean of the X 1i ’s i.e. n1 X 1i ! i1 X1 . Let s 21 be the sample variance of the X 1i ’s i.e. n1 s 21 ~ n1 ! i1 X 1i " X 1 n1 " 1 ~ 2 . 310 [Chapter 11] Suppose that X 21 , X 22 , T , X 2n 2 is an independent random sample of size n 2 from a normal population with expected value 6 2 and variance @ 22 . Let X 2 be the sample mean n2 X 2i ! i1 . Let s 22 be the sample variance of the X 2i ’s i.e. of the X 2i ’s i.e. X 2 n2 s 22 n2 ! i1 X 2i " X 2 . n2 " 1 ~ ~ 2 (a) Confidence Interval for 6 1 " 6 2 (i) @ 21 @ 22 @ 2 , say n 1 " 1 s 21 @2 In this case X 1 L N6 1 , n 1 ~ and @2 @2 L D 2 n 1 " 1 are independent and is n 2 " 1 s 22 ~ also independent of X 2 L N 6 2 , n 2 and L D 2 n 2 " 1 which are @2 independent. It then follows that 2 @2 and is independent of X 1 " X 2 L N 6 1 " 6 2 , @ n1 n2 n 1 " 1 s 21 ~ @2 Therefore Z n 2 " 1 s 22 X 1 " X 2 " 6 1 " 6 2 @2 n1 n 1 " 1 s 21 V ~ @2 @2 n2 L N0, 1 and is independent of n 2 " 1 s 22 Hence T L D 2 n 1 n 2 " 2 . ~ @2 L D 2 n 1 n 2 " 2 . ~ @2 X 1 "X 2 "6 1 "6 2 Z V n 1 n 2 "2 @ 1 @ 1 n1 n12 n 1 "1 s 21 n 2 "1 s 22 ~ n 1 n 2 "2 ~ X 1 " X 2 " 6 1 " 6 2 1 n1 1 n2 n 1 "1 s 21 n 2 "1 s 22 is distributed like t with n 1 n 2 " 2 degrees of freedom. ~ n 1 n 2 "2 ~ 311 [Chapter 11] Hence 1"Į X 1 " X 2 " 6 1 " 6 2 P "t 1" )2 n 1 n 2 " 2 1 n1 P "t 1" )2 n 1 n 2 " 2 1 n1 1 n2 ~ 1 n1 X 1 " X 2 " t 1" n 1 n 2 " 2 ) 2 X 1 " X 2 " 6 1 " 6 2 ~ 1 n1 n 1 "1 s 21 n 2 "1 s 22 1 n2 ~ 1 n1 1 n2 ~ n 1 n 2 "2 1 n2 X 1 " X 2 t 1" n 1 n 2 " 2 ) 2 i.e. x 1 " x 2 o t 1" )2 n 1 n 2 " 2 ~ n 1 n 2 "2 t 1" )2 n 1 n 2 " 2 P ~ n 1 n 2 "2 n 1 "1 s 21 n 2 "1 s 22 1 n2 t 1" )2 n 1 n 2 " 2 n 1 "1 s 21 n 2 "1 s 22 n 1 "1 s 21 n 2 "1 s 22 ~ ~ n 1 n 2 "2 1 n1 6 1 " 6 2 n 1 "1 s 21 n 2 "1 s 22 1 n2 ~ ~ n 1 n 2 "2 n 1 "1 s 21 n 2 "1 s 22 n 1 n 2 "2 is a 1001 " ) % confidence interval for 6 1 " 6 2 . (ii) @ 21 p @ 22 The problem of determining confidence intervals for 6 1 " 6 2 in case @ 21 p @ 22 is a well known problem in statistical literature. Unfortunately no easy solution exists for this problem and we will only give two approximate solutions below. X 1 " X 2 " 6 1 " 6 2 d If n 1 v . and n 2 v . it can be shown that v Z where 2 2 s1 ~ n1 s2 ~ n2 Z L N0, 1 . For large values of n 1 and n 2 we then have that 1 " ) ` P¡"z 1" )2 t X 1 " X 2 " 6 1 " 6 2 s 21 ~ n1 s 22 t z 1" )2 ¢ from which we find ~ n2 approximate 1 " ) 100% confidence intervals for 6 1 " 6 2 of x 1 " x 2 o z 1" )2 s 21 n1 s 22 n2 . 312 [Chapter 11] NOTE : If n 1 and n 2 are big, the approximate confidence intervals for 6 1 " 6 2 apply even if the populations are not normally distributed. For small values of n 1 and n 2 of samples from normal populations reasonably accurate 1 " ) 100% confidence intervals for 6 1 " 6 2 is obtained by using the fact that X 1 " X 2 " 6 1 " 6 2 s 21 ~ n1 where v is approximately distributed like t with v degrees of freedom s 22 ~ n2 s 21 n1 2 s 22 n2 2 s 21 n1 n 1 "1 s 22 n2 . 2 n 2 "1 Hence 1 " ) ` P "t 1" )2 v t X 1 " X 2 " 6 1 " 6 2 s 21 s 22 t t 1" )2 v n~2 from which we can derive approximate 1 " Į 100% confidence intervals for ȝ 1 " ȝ 2 of ~ n1 x 1 " x 2 o t 1" )2 v s 21 n1 s 22 n2 (b) Confidence interval for . @ 21 @ 22 n 1 " 1 s 21 ~ From th.8.3.6 we have that @2 n 2 " 1 s 22 ~ @2 L D 2 n 2 " 1 . From th.8.4.7 it then follows that n 1 "1 s 21 ~ @ 21 F n 1 "1 n 2 "1 s 22 ~ @ 22 L D 2 n 1 " 1 and is independent of s 21 s 21 ~ @ 21 s 22 s 22 ~ ~ @ 21 @ 22 ~ @ 22 n 2 "1 i.e. F is a pivotal quantity for @ 21 @ 22 . L Fn 1 " 1, n 2 " 1 313 [Chapter 11] Therefore s 21 ~ s 22 1 " ) P f )2 n 1 " 1, n 2 " 1 t P 1 f )2 n 1 " 1, n 2 " 1 t f 1" )2 n 1 " 1, n 2 " 1 ~ @ 21 @ 22 @ 21 @ 22 u u s 21 ~ 1 f 1" )2 n 1 " 1, n 2 " 1 s 22 ~ P s 21 s 21 ~ s 22 ~ s 22 ~ f n 1 " 1, n 2 " 1 ) 2 @ 21 @ 22 u u ~ f 1" )2 n 1 " 1, n 2 " 1 s 21 P for f 1" )2 n 1 " 1, n 2 " 1 @ 21 @ 22 . ~ f 1" )2 n 1 " 1, n 2 " 1 s 21 s 22 i.e. s 21 ~ s 22 ~ , t @ 21 @ 22 t s 22 ~ f )2 n 1 " 1, n 2 " 1 s 21 s 22 f )2 n 1 " 1, n 2 " 1 is a 1001 " ) % confidence interval 314 [Chapter 11] 11.5.3 RANDOM SAMPLES FROM NORMAL POPULATIONS: PAIRED OBSERVATION Suppose that n experimental units are selected at random from a population of experimental units. Suppose that two observations, say X 1i and X 2i , are made on the i-th experimental unit. Suppose also that the joint distribution of X 1i , X 2i is bivariate normal with expected values ȝ 1 and ȝ 2 , variances ı 21 and ı 22 and covariance ı 12 . Then Y i X 1i " X 2i , i 1, 2, 3, . . . , n, are independent normal random variables with expected value ȝ 1 " ȝ 2 and variance ı 21 ı 22 " 2ı 12 . Let Y X 1 " X 2 be the sample mean of Y 1 , Y 2 , T , Y n and let s 2 be the sample Y " 6 1 " 6 2 variance. Then s / n ~ is distributed like t with n " 1 degrees of freedom i.e. it ~ is a pivotal quantity for 6 1 " 6 2 . Hence s , y t 1" ) n " 1 s 2 n n is a 1001 " ) % confidence interval for 6 1 " 6 2 . y " t 1" )2 n " 1 NOTE : The idea to use the differences between X 1i and X 2i to get confidence intervals for 6 1 " 6 2 can also be used in the case of independent samples to get exact confidence intervals in case @ 21 p @ 22 . Suppose that X 11 , X 12 , T , X 1n 1 is a random sample of size n 1 from a normal population with expected value 6 1 and variance @ 21 and that X 21 , X 22 , T , X 2n 2 is an independent random sample of size n 2 from a normal population with expected value 6 2 and variance @ 22 . Suppose that n 1 t n 2 . Let Y i X 1i " X 2i , i 1, 2, 3, T , n 1 . Then Y 1 , Y 2 , T , Y n 1 are independent normal random variables with expected value 6 1 " 6 2 and variance @ 21 @ 22 . Let Y be the sample mean of Y 1 , Y 2 , T , Y n 1 and let s 2 be the sample variance. Then Y " 6 1 " 6 2 s/ n ~ is distributed like t with n 1 " 1 degrees of freedom i.e. it is a pivotal ~ quantity for 6 1 " 6 2 . Hence y " t 1" )2 n 1 " 1 s , y t ) n " 1 1" 2 1 n1 s n1 is a 1001 " Į % confidence interval for 6 1 " 6 2 . Although this method gives exact confidence intervals, their expected length will be greater than that of the approximate ones given before since this method does not use all the information available. 315 [Chapter 11] 11.5.4 CONFIDENCE INTERVALS FOR THE PROBABILITY OF SUCCESS : ONE SAMPLE CASE Suppose that X 1 , X 2 , T , X n are independent Bernoulli random variables with p the n probability of success for all of them. Then X ! X i is the number of successes in the i1 § , the proportion of successes , is the maximum likelihood estimator of n trials and p X n ~ p. In this case we have that § p " p d ~ v Z where Z L N0, 1 . § § p 1" p ~ ~ n For large values of n it then follows that § p " p 1 " ) ` P "z 1" )2 t ~ § § p 1" p ~ t z 1" )2 ~ n § § p1" p n § i.e. p o z 1" )2 is an approximate 1001 " ) % confidence interval for p. Since there does not exist a pivotal quantity for p and X is a discrete random variable we can only find conservative 1001 " Į % confidence intervals for p in the case of small values of n. To find 95% and 99% conservative confidence intervals for p Table 41 of Biometrika Tables for Statisticians Vol. I by Pearson and Hartly can be used. 11.5.5 CONFIDENCE INTERVALS FOR THE DIFFERENCE BETWEEN TWO PROBABILITIES OF SUCCESS Suppose that X 1 L BINn 1 , p 1 and that X 2 L BINn 2 , p 2 and that X 1 and X 2 § § X2 1 are independent. Let p 1 X n 1 and p 2 n 2 . If n 1 v . and n 2 v . we get that ~ § § p1 " p2 ~ " p 1 " p 2 ~ § § p 1 1" p 1 ~ ~ n1 § § p 2 1" p 2 ~ ~ n2 ~ d v Z where Z L N0, 1 . 316 [Chapter 11] For large values of n 1 and n 2 we then have that § § p1 " p2 1 " ) ` P "z 1" )2 t ~ § § p 1 1" p 1 ~ ~ n1 § § i.e. p 1 " p 2 o z 1" )2 " p 1 " p 2 ~ § § p 1 1" p 1 n1 § § p 2 1" p 2 ~ t z 1" )2 ~ n2 § § p 2 1" p 2 n2 is an approximate 1001 " ) % confidence interval for p 1 " p 2 . 11.5.6 CONFIDENCE INTERVALS FOR THE PARAMETER OF A POISSON DISTRIBUTION Suppose that X 1 , X 2 , T , X n is a random sample from a population with a Poisson distribution with parameter 6 . Let X be the sample mean. As n v . we have that X"6 X n d v Z where Z L N0, 1 . For large n 1 " ) ` P "z 1" )2 t X"6 X n t z 1" )2 x n is an approximate 1001 " ) % confidence interval for 6. i.e. x o z 1" )2 If X L POI6 then Table 40 of Biometrika Tables For Statisticians vol. I by Pearson and Hartly can be used to find conservative confidence intervals for 6 . If X 1 , X 2 , T , X n is a random sample from a population with a Poisson distribution with n n i1 i1 parameter 6 , ! X i L POIn6 . In that case ! X i can be used in Table 40 to determine a confidence interval for n6 which can then be used to determine a confidence interval for 6 . 317 [Chapter 12] CHAPTER 12 : TESTS OF HYPOTHESES 12.1 INTRODUCTION Definition 12.1.1 : Suppose that X is a random variable with distribution fx; ș . A statistical hypotheses is some statement about the distribution of X. If there is only one possible distribution for which the statement is true it is called a simple hypothesis. If there are more than one distribution for which the statement is true it is called a composite hypothesis. EXAMPLE 12.1.1 : A statement that the random variable X has a normal distribution with expected value 100 and variance 225 is a simple hypothesis. A statement that X has a normal distribution with expected value 100 is a composite hypothesis since the statement would be true for any N100, @ 2 , @ 2 0, distribution. Suppose that X is a random variable with distribution fx; 2 where the mathematical form of f is known e.g. an exponential distribution. In such cases it is usual to formulate the hypothesis only in terms of 2. Suppose that ( is the set of all possible values of 2. Let ( 0 (. In this case a statement that the distribution of X is such that 2 ( 0 is a hypothesis that is usually stated as H : 2 ( 0 . If the hypothesis is correct then 2 ( 0 and if the hypothesis is not correct then 2 ( " ( 0 . If ( 0 consists of only one element, then the hypothesis is simple. If ( 0 consists of more than one element, then the hypothesis is composite. To make a decision whether or not a hypothesis is true, it is necessary to plan some experiment which will give us observations of the random variable X. Suppose that X 1 , X 2 , T , X n is a random sample from a population with distribution fx; ș . Then the X i ’s will be independent random variables all with distribution fx; ș . Now suppose that x 1 , x 2 , T , x n are the observed values of X 1 , X 2 , T , X n . On the basis of these observed values it is then necessary to make a decision whether or not to accept the hypothesis. We will decide to reject the hypothesis if the observed values are very unlikely , maybe impossible, if the hypothesis is true. Even if the observed values are very unlikely if the hypothesis is true, it will never be a ”proof” that the hypothesis is not true. It is therefore possible to make an incorrect decision about the hypothesis on the basis of the observed values. 318 [Chapter 12] Two types of incorrect decisions are possible : TYPE I ERROR - To reject a hypothesis if the hypothesis is correct TYPE II ERROR - Not to reject a hypothesis if the hypothesis is not correct. NOTE : There are two possible decisions we can make namely to reject a hypothesis or not to reject a hypothesis. A hypothesis is rejected if the observed values are such that it is very unlikely to observe such values if the hypothesis is correct. This does not mean that if the hypothesis is not rejected that the observed values strongly support the hypothesis. It is therefore not correct to say that if a hypothesis is not rejected that we actually accept the hypothesis as correct. Depending on the set of observed values for which we will reject the hypothesis we can determine the probabilities of type I and type II errors. Definition 12.1.2 : The set of all possible observed values for which we would reject the hypothesis is called the critical region of the test. Suppose that the critical region for a test is C. If x 1 , x 2 , T , x n C the hypothesis is rejected and if x 1 , x 2 , T , x n R n " C the hypothesis is not rejected. Normally researchers would like to prove that something, say some treatment, does make a difference or that there is some relationship between variables. In such cases the onus to prove the existence of such differences or relationships is on the researcher i.e. the researcher must provide the necessary data to prove the statement. In such cases the usual procedure is to formulate the hypotheses that there is no difference or no relationship. This is known as the null hypothesis and is normally indicated by H 0 . The onus is then on the researcher to show that the observed data is very unlikely on the assumption of no difference or no relationship. It is therefore required that there is only a very small probability that the null hypothesis will be rejected if in fact the null hypothesis is correct i.e. P¡reject H 0 |H 0 is true¢ P¡Type I error¢ must be small for all cases where H 0 is correct. 319 [Chapter 12] Definition 12.1.3 : Suppose that H 0 is a simple hypothesis. Then Į P¡type I error¢ P¡reject H 0 |H 0 is true¢ is called the significance level of the test. If H 0 is a composite hypothesis, then the maximum of P¡type I error¢ over all cases for which H 0 is true, is called the size of the test. EXAMPLE 12.1.2 : Suppose that the lifetime of a standard electronic component is exponentially distributed with parameter 1 i.e. the expected lifetime of a standard component is 1. The producer of a new type of component claims that the new type of component is better than the standard component i.e. that the new component has an expected lifetime greater than 1. (i) In this case the null hypothesis is that the expected lifetime of the new type of component does not differ from the expected lifetime of the standard component. The onus is on the producer of the new type of component to prove that the expected lifetime of the new type of component is greater than that of the standard type of component. The null hypothesis is that the lifetime of the new type of component is exponentially distributed with expected lifetime equal to 1. Hence the null hypothesis is H 0 : ș 1 where ș is the parameter of the exponential distribution. The null hypothesis is not true if ș p 1. This is called the alternative hypothesis and is indicated by H A : ș p 1. Suppose that the lifetimes of 10 of the new type of component are observed and that X 1 , X 2 , T , X 10 are independent exponentially distributed random variables with parameter ș. n Suppose the null hypothesis is rejected if 2 ! x i Ȥ 2.99 20 37. 57. i1 Then P¡reject H 0 |H 0 correct¢ P 10 2 ! X i /1 37. 57| 2 1 i1 P¡D 2 20 0. 01. D 2.99 20 ¢ by th.8.3.3 320 [Chapter 12] The probability is therefore 0.01 that H 0 will be rejected if H 0 is correct i.e. if ș 1. For the simple hypothesis H 0 : ș 1 the significance level of the test is 0.01. (ii) Since the producer claims that the new type of component is better than the standard type of component, his claim really is that 2 1. It would therefore be better to formulate the null hypothesis in such a way that if the null hypothesis is rejected it would mean that 2 1. The alternative hypothesis should therefore be H A : 2 1 and since ( £2 | 2 0¤ we will have to take ( 0 £2 | 0 2 t 1¤. In this case we have a composite null hypothesis H 0 : 2 ( 0 . Suppose that we use the same critical region as above namely to reject H 0 if n 2 ! X i Ȥ 2.99 20 37. 57. i1 For any value of ș such that 0 ș t 1 we then have that P¡reject H 0 | 2¢ P 2 ! i1 X i D 2.99 20 | 2 10 P 2 ! i1 X i D 2 20 .99 |2 2 2 tP 2 ! i1 X i D 2.99 20 | 2 2 10 10 t P¡D 2 20 D 2.99 20 ¢ since 2 t 1 i.e. D 2.99 20 t D 2.99 20 2 by th.8.3.3 0. 01. The maximum value of the probability of a type I error for all values of ș such that 0 ș t 1 is therefore 0.01 i.e. the size of the test is 0.01. Let ȕ P¡ type II error ¢ P¡ do not reject H 0 | H 0 is not correct ¢. Obviously we would like the probability of a type II error i.e. ȕ to be as small as possible subject to the condition that the probability of a type I error must be small. But 1 " ȕ P¡ reject H 0 | H 0 is not correct ¢ i.e. we would like to make 1 " ȕ to be as big as possible. 321 [Chapter 12] Definition 12.1.4 : Let ʌș P¡ reject H 0 | ș¢ for all ș ȍ. The function = defined for all possible values of 2, is called the power function of the test. NOTE that the power function is always the probability to reject the null hypothesis. For ș ȍ 0 we have that =2 P¡ reject H 0 | 2¢ for 2 ( 0 i.e. it is the probability of a type I error and we want this as small as possible for all values of 2 which belong to ( 0 . For ș ȍ " ȍ 0 we have that =2 P¡ reject H 0 | 2¢ for 2 ( " ( 0 i.e. it is the probability to reject H 0 when in fact H 0 is not true i.e. it is the probability of a correct decision i.e. we want this as close to 1 as possible for all values of 2 which belong to ( " ( 0 . An ideal test would therefore be a test for which =2 0 for all 2 ( 0 1 for all 2 ( " ( 0 which means that we would always make the correct decision i.e never reject H 0 if it is true and always reject H 0 if it is not true. One would therefore always try to construct the test of an hypothesis so that the power function of the test is as close as possible to the power function of the ideal test. EXAMPLE 12.1.3 : Continuation of Example 12.1.2. Suppose that X 1 , X 2 , T , X 10 are independent exponential random variables with parameter 2. Let ( 0 £2 | 0 2 t 1¤ and H 0 : 2 ( 0 and H A : 2 ( " ( 0 . 10 Suppose that the test is to reject H 0 if 2 ! x i D 2.99 20 37. 57 where x 1 , x 2 , T , x 10 i1 are the observed values of X 1 , X 2 , T , X 10 . 322 Then =2 [Chapter 12] P¡reject H 0 | 2¢ P 2 ! i1 X i D 2.99 20 | 2 10 2 ! i1 X i D 2 20 .99 |2 2 2 10 P P D 2 20 37. 57 | 2 . 2 by th.8.3.3 For different values of 2 this probability is given in the table below. POWER FUNCTION OF TEST : n 10, Į 0. 01. 2: 0. 3 0. 6 1. 0 1. 5 2. 0 2. 5 3. 0 37. 57 125. 2 62. 6 37. 6 25. 0 18. 8 15. 0 12. 5 2 =2 . 00001 . 00001 0. 01 0. 2014 0. 5349 0. 7764 . 8978 Note that although =2 t 0. 01 for all 2 ( 0 , the probability is only 0.2014 to reject H 0 even if 2 1. 5 i.e. if the expected lifetime of the new type of component is 50% bigger than that of the standard component. For 2 1. 5 there is an 80% probability of not rejecting the null hypothesis. The question now is if we can improve on the test to come closer to that of the ideal test. One possibility is to increase the size ot the test i.e. to allow a bigger probability of rejecting H 0 even if H 0 is true. 10 Suppose that H 0 will be rejected if 2 ! x i Ȥ 2.95 20 31. 41 where x 1 , x 2 , T , x 10 are the observed values of X 1 , X 2 , T , X 10 . Then =2 P¡reject H 0 | 2¢ i1 P 2 ! i1 X i D 2.95 20 | 2 10 2 ! i1 X i D 2 20 .95 |2 2 2 10 P P D 2 20 31. 41 | 2 . 2 by th.8.3.3 323 [Chapter 12] For different values of 2 this probability is given in the table below. POWER FUNCTION OF TEST : n 10, Į 0. 05. 2: 0. 3 0. 6 1. 0 1. 5 2. 0 2. 5 3. 0 31. 41 104. 7 52. 4 31. 41 20. 9 15. 7 12. 6 10. 5 2 =2 . 00001 . 00001 0. 05 0. 397 0. 735 . 898 . 958 The size of the test ( i.e. the maximum value of the probability to reject H 0 if in fact H 0 is true) is now 0.05 in stead of 0.01 as before. For 2 1. 5 the probability to reject H 0 is now 0.397 compared to 0.2014 before. The ”cost” of this improvement of the power of the test when H 0 is not true is to increase the probability of a type I error. Another way to improve the power of the test is to take more observations. Suppose we make 20 observations i.e. X 1 , X 2 , T , X 20 are independent exponential random variables with parameter ș. 20 Suppose that H 0 will be rejected if 2 ! x i D 2.95 40 55. 76 where x 1 , x 2 , T , x 20 i1 are the observed values of X 1 , X 2 , T , X 20 . Then =2 P¡reject H 0 | 2¢ P 2 ! i1 X i D 2.95 40 | 2 20 2 ! i1 X i D 2 40 .95 |2 2 2 20 P P D 2 40 55. 76 | 2 . 2 by th.8.3.3 For different values of 2 this probability is given in the table below. POWER FUNCTION OF TEST : n 20, Į 0. 05. 2: 0. 3 0. 6 1. 0 1. 5 2. 0 2. 5 3. 0 55. 76 186 92. 9 55. 76 37. 2 27. 9 22. 3 18. 6 2 =2 . 00001 . 00001 0. 05 0. 606 0. 926 . 991 . 999 Note that for 2 1 i.e. when H 0 is not true, the probability of rejecting is much higher than before. 324 [Chapter 12] The three power functions above can be represented graphically as follows. 1 power 0.8 0.6 0.4 0.2 0 0 1 2 3 theta 4 Note that for ș 1 i.e. ș ȍ 0 the power of the test is much higher for n 20 than for n 10. The question arises, given Į 0. 05 , how many observations should be made such that the power is at least 0.95 when ș 2 i.e. if ș 2 we want to be at least 95% certain that H 0 will be rejected. If Į 0. 05 and n observations are made then H 0 should be rejected if n 2 ! x i Ȥ 2.95 2n . i1 Then =2 P¡reject H 0 | 2¢ P 2 ! i1 X i D 2.95 2n | 2 n 2 ! i1 X i D 2 2n .95 |2 2 2 n P P D 2 2n D 2.95 2n 2 . by th.8.3.3 We then have to determine the smallest value for n such that D 2 2n =2 P D 2 2n .95 u 0. 95. 2 This has to be done in a trial and error basis i.e. try different values of n until we find the smallest value of n. The table below show an example of such calculations starting with n 25. 325 [Chapter 12] DETERMINING THE SMALLEST n SUCH THAT ʌ2 u 0. 95 D 2.95 2n n D 2.95 2n =2 Decision 2 25 67. 5 33. 8 0. 962 D n t 25 24 65 32. 5 0. 957 D n t 24 23 62 31. 0 0. 955 D n t 23 22 60 30. 0 0. 946 D n 23 CRITICAL REGION The choice of the critical region i.e. those values of the observed values x 1 , x 2 , T , x n for which the null hypothesis H 0 is rejected determines the values of Į, the significance level of the test, and ʌș the power of the test. In general a decision is first made about the significance level of the test and then the test is chosen in such a way that the power of the test is as big as possible if H 0 is not true. As a general rule the best choice of C , the critical region, consists of those observed values x 1 , x 2 , T , x n which are very unlikely if H 0 is true and is as likely as possible if H 0 is not true. To find such values it is necessary to pay particular attention to the formulation of the null hypothesis and the alternative hypothesis. This usually determines the shape of the test e.g. whether it is a one-sided or a two-sided test. As an example let us consider the case where X 1 , X 2 , T , X n is a random sample from a normal population with expected value ȝ and variance ı 2 where ı 2 is known. We will consider three possible cases. (i) H 0 : 6 t 6 0 ; H A : 6 6 0 . In this case ȍ 0 £ȝ | ȝ t ȝ 0 ¤ and ȍ " ȍ 0 £ȝ | ȝ ȝ 0 ¤. § The maximum likelihood estimator of ȝ is 6 X where X is the sample mean. L Choose ȝ 1 ȍ 0 i.e. ȝ 1 t ȝ 0 and choose ȝ 2 ȍ " ȍ 0 i.e. ȝ 2 ȝ 0 . Suppose that x is the observed value of X. Let f X . | 6 be the density function of X given the value of 6. Then f X x | 6 1 gives us the likelihood of observing x in case 6 6 1 i.e. when H 0 is true. Similarly f X x | 6 2 gives us the likelihood of observing x f x | 6 2 gives us the ratio of the likelihood if 6 6 2 i.e. when H A is true. The ratio X f X x | 6 1 of observing x when 6 6 2 to the likelihood of observing x when 6 6 1 . 326 [Chapter 12] Hence if this ratio is much greater than 1 it is much more likely to observe x if H A is true than when H 0 is true. If the ratio is much less than 1 it is much more likely to observe x if H 0 is true than when H A is true. We will therefore reject H 0 if the ratio is big and not reject H 0 if the ratio is small. Since X L N6, @ 2 /n we get that f X x | 6 2 f X x | 6 1 1 2= 1 2= e " n 2@ 2 e " n 2@ 2 " n @ n @ n exp " 1 2 2 @n x " 6 2 2 exp " 1 2 2 @n x " 6 1 2 x"6 2 2 "x"6 1 2 ¡6 22 "6 21 "2x6 2 "6 1 ¢ ¡6 2 "6 2 ¢ n ¡2x6 "6 ¢ 2 1 e 2@ 2 2 1 e 2@ 2 Since 6 2 " 6 1 0 the ratio will be big if x is large i.e. we will reject H 0 if x is large. The critical region will therefore be to reject H 0 if x is greater than some critical value c i.e. the form of the test is to reject H 0 if x c. The critical value is determined by the significance level or size of the test. Suppose that we want the probability of a type I error to be t ) for all 6 ( 0 i.e. when 6 t 6 0 . For all ȝ t ȝ 0 P¡reject H 0 | 6¢ P¡X c | 6¢ P X"6 @ n P Z tP Z if c " 60 @ n c"6 @ n c"6 @ n c " 60 @ n |6 |6 | 60 where Z L N0, 1 since 6 t 6 0 ) z 1") or c 6 0 @ n z 1") . The test of size Į for H 0 : ȝ t ȝ 0 vs. H A : ȝ ȝ 0 is therefore to reject H 0 if x ȝ 0 @n z 1"Į . (12.1.1) This test is called a one-sided test of size ) . 327 [Chapter 12] The power function of the test is given by =6 P¡X c | 6¢ P X 60 P X"6 @ n @ n 60 " 6 60 " 6 1"o z 1") | 6 @ n @ n z 1") | 6 z 1") . NOTE : Note that ʌȝ gets bigger as ȝ gets bigger and tend to 1 as n tends to . . The power is Į if ȝ ȝ 0 . For any value of ȝ ȝ 0 , ʌȝ gets bigger as n gets bigger. For any value of ȝ ȝ 0 , ʌȝ gets smaller as ı gets bigger. For any value of 6, =6 gets bigger as ) gets bigger. Now suppose that when ȝ ȝ 2 ȝ 0 we want the power of the test to be equal to 1 " ȕ. Then =6 2 1 " * 1 " o i.e. i.e. i.e. i.e. 60 " 62 @ n 60 " 62 @ n z 1") z 1") z * "z 1"* n 6 0 " 6 2 "z 1") " z 1"* @ @z 1") z 1"* n 62 " 60 n @ 2 z 1") z 1"* 6 2 " 6 0 2 2 . (12.1.2) NOTE : The bigger ı 2 is, the bigger is n. The smaller Į is, the bigger is n. The bigger 1 " ȕ is, the bigger is n. The closer 6 2 is to 6 0 , the bigger n is. (ii) H 0 : 6 u 6 0 ; H A : 6 6 0 . f x | 6 2 ( " ( 0 In this case the ratio X will be big if x is small i.e. H 0 will be f X x | 6 1 ( 0 rejected if x c where c is determined in such a way that the size of the test is ) . Note that this is also a one-sided test but in this case we reject if x is too small. 328 [Chapter 12] (iii) H 0 : 6 6 0 ; H A : 6 p 6 0 Suppose that ȝ 1 is a value of ȝ different from ȝ 0 . Then f X x | 6 1 f X x | 6 0 1 2= 1 2= @ n @ n exp " 1 2 2 @n x " 6 1 2 exp " 1 2 2 @n x " 6 0 2 e " n 2@ 2 x"6 1 2 "x"6 0 e " n 2@ 2 ¡6 21 "6 20 "2x6 1 "6 0 ¢ e " n 2@ 2 ¡6 21 "6 20 ¢ n e 2@ 2 2 ¡2x6 1 "6 0 ¢ In this case, since ȝ 1 " ȝ 0 can be either positive or negative, we see that the ratio will be big if x is big if ȝ 1 " ȝ 0 0 and the ratio will also be big if x is small when ȝ 1 " ȝ 0 0 . To provide for both possibilities we will have to reject either if x is too small or when it is too big. This is known as a two-sided test. If H 0 is true then ȝ ȝ 0 and X is normally distributed with expected value ȝ 0 and variance ı 2 /n . This means that if H 0 is true the most likely values of X are close to ȝ 0 . Therefore we will reject if the difference between x and ȝ 0 is greater than some critical value c i.e. when |x " 6 0 | c. For the significance level of the test to be Į we must have that ) =6 0 P |X " 6 0 | c | 6 0 P X " 60 @ n P |Z| i.e. c @ n c where Z L N0, 1 @ n nc ) @ z 1" 2 i.e. c | 60 @ n z 1" )2 . The test of significance level ) for H 0 : 6 6 0 vs. H A : 6 p 6 0 is therefore to reject H 0 if |x " 6 0 | @n z 1" )2 . (12.1.3) This test is called a two-sided test. 329 [Chapter 12] The power function of the test is given by =6 P |X " 6 0 | @ n z 1" )2 | 6 P X " 60 " @ n z 1" )2 | 6 P X " 6 0 @ n P X 60 " P X"6 @ n 60 " 6 @ n @ n @ n . " z 1" )2 | 6 " z 1" )2 P P Z z 1" )2 | 6 z 1" )2 | 6 X"6 @ n 60 " 6 @ n 60 " 6 @ n z 1" )2 z 1" )2 | 6 where Z L N0, 1 12.1.4 For 6 6 0 we have that - 0 and from =6 @ n z 1" )2 | 6 P X 6 0 60 " 6 60 " 6 P Z Let - @ n P Z - " z 1" )2 12.1.4 P Z - z 1" )2 it then follows that where Z L N0, 1 ` P Z - " z 1" )2 , since P Z - z 1" )2 )/2 i.e. very small. Suppose that for some given value 6 1 6 0 we want the power of the test to be 1 " *. Then =6 1 1 " * ` P Z - " z 1" )2 `P Z i.e. i.e. i.e. z 1"* ` n 6 0 " 6 1 @ 60 " 61 @ n 60 " 61 @ n " z 1" )2 " z 1" )2 ` z 1"* z 1" )2 n ` @ 2 z 1" )2 z 1"* 6 0 " 6 1 2 2 . 12. 1. 5 . 330 [Chapter 12] For 6 6 0 we have that - =6 P Z - " z 1" )2 60 " 6 @ n 0. In this case it follows from P Z - z 1" )2 12.1.4 that where Z L N0, 1 ` P Z - z 1" )2 , since P Z - " z 1" )2 )/2 i.e. very small. 1 " P Z t - z 1" )2 Suppose that for some given value 6 1 6 0 we want the power of the test to be 1 " *. Then =6 1 1 " * ` 1 " P Z t - z 1" )2 ` 1"P Z t i.e. 60 " 61 i.e. @ n z 1" )2 n 6 0 " 6 1 @ @ n z 1" )2 ` z * "z 1"* ` "z 1"* " z 1" )2 n ` i.e. 60 " 61 @ 2 z 1" )2 z 1"* 2 . 6 0 " 6 1 2 which is the same result as obtained in 12. 1. 5 . EXAMPLE 12.1.4 : Suppose that X 1 , X 2 , T , X n is a random sample from a normal population with expected value 6 and variance @ 2 10. To test the hypothesis H 0 : 6 t 5 vs. the alternative H A : 6 5 with a test of size ) 0. 01, we will reject H 0 if x 60 @ n z 1") see 12. 1. 1 i.e. if x 5 10 n 2. 326. If it is required that the probability to reject H 0 must be 0. 99 in case 6 8, then by 12. 1. 2 we must have that @ 2 z 1") z 1"* n 6 2 " 6 0 2 2 102. 326 2. 326 8 " 5 2 i.e. n 25 observations are required. 2 24. 05 331 [Chapter 12] To test the hypothesis H 0 : 6 5 vs. the alternative H A : 6 p 5 with significance level ) 0. 01, we will reject H 0 if |x " 6 0 | @ n see 12. 1. 3 z 1" )2 i.e. if |x " 5| 10 n 2. 576. If it is required that the probability to reject H 0 must be 0. 99 in case 6 8, then by 12. 1. 5 we must have that n @ 2 z 1" )2 z 1"* 6 0 " 6 1 2 2 102. 576 2. 326 8 " 5 2 2 26. 70 i.e. n 27 observations are needed. 12.2 P-VALUES AND THE USE OF CONFIDENCE INTERVALS FOR TESTING HYPOTHESES P-values It was established in the previous section that the choice of the significance level of a test has a very strong and direct influence on the power of the test. Furthermore it must be noticed that there is no prescription about the choice of the significance level i.e. in many cases it could be a very personal choice. The common use of 0.05 and 0.01 is merely a choice for convenience. Strictly speaking a reasonable choice should be based on the consequences of a type I error versus the consequences of a type II error. This is very seldom known and not easy to establish. For this reason it has become common practice to report the P-value of a test and leave it to the user or reader to make his own decision. Definition 12.2.1a : The P-value for any test is the smallest value of the significance level for which the null hypothesis would be rejected based on the observed values of the random variables. If the P-value of a test is 0.03 and the hypothesis would be rejected for any value of ) u 0. 03, but would not be rejected for any value of ) 0. 03. This tells the reader that the test would be rejected for ) 0. 05 and would not be rejected at ) 0. 01. If the P-value is 0.005 then it indicates to the reader that the hypothesis would be rejected for both ) 0. 05 and ) 0. 01. Since the significance level of a test is the probability to reject H 0 if in fact H 0 is correct, a very small P-value therefore indicates that the observed values of the random variables are extremely unlikely if the hypothesis is true. 332 [Chapter 12] EXAMPLE 12.2.1 : Suppose that X 1 , X 2 , T , X 10 is a random sample from a normal population. Let the hypothesis be H 0 : 6 3 vs. H A : 6 p 3. If H 0 is correct, then X " 3 is distributed s/ n L like t with 10-19 degrees of freedom. The critical region for the test is to reject if x " 3 c and the test will have significance level ) if we reject when s/ n x " 3 t 1" ) 9 . Also note that as c gets bigger the significance level gets smaller. 2 s/ n Now suppose that the observed values are such that x " 3 0. 883. For this observed s/ n value and critical value c we would reject if . 883 c. To make the significance level as small as possible, make c as big as possible and still reject for that value of c i.e. we would reject if x " 3 0. 883, because if c . 883 we would not reject H 0 . The s/ n P-value is therefore the significance level of the test if we reject when x " 3 0. 883 s/ n i.e. P P X"3 s/ n L 0. 883 2 0. 2 0. 4 as can be determined from the t table in Bain and Engelhardt p 608. The P-value gives an indication of the probability to get such a big deviation of X from 3 if the hypothesis was true. 333 [Chapter 12] The use of confidence intervals for testing hypotheses. JTheorem 12.2.1 : Suppose that x 1 , x 2 , T , x n are the observed values of the random variables X 1 , X 2 , T , X n and that Rx 1 , x 2 , T , x n is a 1001 " ) % confidence region for the parameter 2. A test of significance level ) for the hypothesis H 0 : 2 2 0 vs. H A : 2 p 2 0 is to reject H 0 if 2 0 Rx 1 , x 2 , T , x n . Proof : Since Rx 1 , x 2 , T , x n is a 1001 " ) % confidence region for the parameter 2, we have that P 2 RX 1 , X 2 , T , X n | 2 1 " ) for all 2. The significance level of the test to reject H 0 if 2 0 Rx 1 , x 2 , T , x n is given by P reject H 0 | H 0 is true P 2 0 RX 1 , X 2 , T , X n | 2 0 1 " P 2 0 RX 1 , X 2 , T , X n | 2 0 1 " 1 " ) for all 2 0 ( ). ſ EXAMPLE 12.2.2 : Suppose that X 1 , X 2 , T , X 16 is a random sample from a normal population. Let the hypothesis be H 0 : 6 3 vs. H A : 6 p 3. and suppose that this hypothesis is to be tested with significance level ) 0. 05. Suppose that the observed values are such that x 5 and s 2 9. The 1001 " 0. 05 % 95% confidence interval for 6 is then given by x o t 1". 052 n " 1 . s n 5 o 2. 131 3 16 i.e. the interval is 3. 4, 6. 6 . Since this does not include the value 3 the hypothesis that 6 3 is rejected with significance level ) 0. 05. 12.3 TESTS OF HYPOTHESES FOR SOME STANDARD PROBLEMS. RANDOM SAMPLES FROM NORMAL POPULATIONS : ONE SAMPLE Suppose that X 1 , X 2 , T , X n is a random sample from a normal population with expected value 6 and variance @ 2 . Let X be the sample mean and s 2 the sample variance and let x and s 2 be the observed values of X and s 2 . L L 334 [Chapter 12] (a) Test for H 0 : 6 6 0 vs. H A : 6 p 6 0 : @ 2 known In this case X L N6 0 , @ 2 /n if H 0 is true and therefore true. If we reject H 0 when P reject H 0 | H 0 is true x " 60 @/ n P X " 60 L N0, 1 if H 0 is @/ n z 1" )2 the significance level of the test is X " 60 @/ n z 1" )2 | 6 0 P |Z| z 1" )2 where Z L N0, 1 ). We have previously derived the result (see (12.1.5)) that if we want the power of the test to be 1 " * for 6 6 1 , then the number of observations must be n` @ 2 z 1" )2 z 1"* 6 0 " 6 1 2 2 . NOTE : We will only give the two-sided tests. The necessary adjustment must in each case be made for one-sided hypotheses and the corresponding one-sided tests. (b) Test for H 0 : 6 6 0 vs. H A : 6 p 6 0 : @ 2 not known. In this case x " 60 s/ n X " 60 L tn " 1 if H 0 is correct. If we reject H 0 when s/ n L t 1" )2 n " 1 the significance level of the test is P reject H 0 | H 0 is true P X " 60 s/ n t 1" )2 n " 1 | 6 0 L P |T| t 1" )2 n " 1 ). where T L tn " 1 335 [Chapter 12] The power function of this test is given by =6, @ 2 P reject H 0 | 6, @ 2 P X " 60 s/ n t 1" )2 n " 1 | 6, @ 2 L P X " 6 6 " 60 s/ n t 1" )2 n " 1 | 6, @ 2 L X"6 P @ n n"1 s 2 L @2 P ZV n"1 6"6 0 @ n t 1" )2 n " 1 | 6, @ 2 /n " 1 t 1" )2 n " 1 | 6, @ 2 6 " 60 . @/ n is called a non-central t-distribution with n " 1 degrees of where Z L N0, 1 , V L D 2 n " 1 , Z and V are independent and - The distribution of ZV n"1 freedom and parameter of non-centrality -. This distribution can be used to determine the power of the t-test and to determine n that will give us a certain power for specific values of 6 and @ 2 . Table 10 of Biometrika Tables for Statisticians Vol I by Pearson and Hartly gives the power of the test for values of ) 0. 05 and ) 0. 01, different degrees of freedom and C 6 " 60 @/ n 1 . 2 EXAMPLE 12.3.1 : Suppose that X 1 , X 2 , T , X 10 is a random sample from a normal population with expected value 6 and variance @ 2 . Let the hypothesis be H 0 : 6 5 vs. H A : 6 p 5 and suppose that this hypothesis is to be tested with significance level ) 0. 05. In this case X"5 s/ n L t9 if H 0 is correct. If we reject H 0 when L x"5 s/ n t 1" .052 9 2. 262 the significance level of the test is 0.05. 336 [Chapter 12] If 6 8 and @ 2 10 then the power of the test can be determined from table 10 in Biometrika Tables vol I for ) 0. 05, degrees of freedom 9 and 8"5 C 1 2. 12. From table 10 we get the power as 0.76. 2 10 / 10 Now suppose that we want to determine the number of observations required to ensure that the power is at least 0.95 if 6 8 and @ 2 10. The calculations has to be done on a trial and error basis, an example of which is shown in the table below. DETERMINATION OF n TO ENSURE POWER OF .95 FOR 6 8 AND @ 2 10. n C 15 2. 6 =8, 10 Decision 0. 93 D n 15 18 2. 84 0. 96 D n t 18 17 2. 77 0. 95 D n t 17 16 2. 68 0. 95 D n 17 Table 8 p612 in the book of Bain and Engelhardt gives an indication of the values of n that are required. For the example above we have that 2) 0. 05, d 3 . 95 and 10 from the table we find that n 18 observations are required. (c) Test for H 0 : @ 2 @ 20 vs. H A : @ 2 p @ 20 . n " 1 s 2 L If H 0 is correct we have that L D 2 n " 1 and if we reject H 0 whenever @ 20 n " 1 s 2 n " 1 s 2 2) D D 21" )2 n " 1 the significance level of the " 1 or when n 2 @ 20 @ 20 test will be ). 337 [Chapter 12] SAMPLES FROM NORMAL POPULATIONS : TWO INDEPENDENT SAMPLES Suppose that X 11 , X 12 , T , X 1n 1 is a random sample of size n 1 from a normal population with expected value 6 1 and variance @ 21 . Let X 1 be the sample mean of the X 1i ’s i.e. 2 n1 n1 X 1i ! i1 ! i1 X 1i " X 1 2 2 X1 . Let s 1 be the sample variance of the X 1i ’s i.e. s 1 . n1 n1 " 1 L L Suppose that X 21 , X 22 , T , X 2n 2 is an independent random sample of size n 2 from a normal population with expected value 6 2 and variance @ 22 . Let X 2 be the sample n2 X 2i ! i1 . Let s 22 be the sample variance of the X 2i ’s i.e. mean of the X 2i ’s i.e. X 2 n2 s 22 n2 ! i1 X 2i " X 2 n2 " 1 L L 2 . (a) Test for H 0 : 6 1 " 6 2 d vs. H A : 6 1 " 6 2 p d : @ 21 @ 22 @ 2 n 1 " 1 s 21 L In this case X 1 L N6 1 , @ 2 /n 1 and is independent of L D 2 n 1 " 1 . @2 n 2 " 1 s 22 L They are independent of X 2 L N6 2 , @ 2 /n 2 and L D 2 n 2 " 1 and @2 the last two also are independent random variables. Hence X 1 " X 2 L N 6 1 " 6 2 , n 1 " 1 s 21 @2 L Therefore Z @2 n1 @2 n2 and is independent of n 2 " 1 s 22 @2 L L D 2 n 1 n 2 " 2 . X 1 " X 2 " 6 1 " 6 2 @2 n1 @2 n2 L N0, 1 n 1 " 1 s 21 and is independent of V @2 L n 2 " 1 s 22 L @2 L D 2 n 1 n 2 " 2 . X 1 "X 2 "6 1 "6 2 Hence T Z V n 1 n 2 "2 @ 1 @ 1 n1 n12 n 1 "1 s 21 n 2 "1 s 22 L n 1 n 2 "2 L X 1 " X 2 " 6 1 " 6 2 1 n1 is distributed like t with n 1 n 2 " 2 degrees of freedom. 1 n2 n 1 "1 s 21 n 2 "1 s 22 L n 1 n 2 "2 L 338 [Chapter 12] If H 0 is true X 1 " X 2 " d 1 n1 n 1 "1 s 21 n 2 "1 s 22 1 n2 L n 1 n 2 "2 L tn 1 n 2 " 2 . L A test of significance level ) is to reject H 0 if x 1 " x 2 " d 1 n1 1 n2 n 1 "1 s 21 n 2 "1 s 22 n 1 n 2 "2 t 1" )2 n 1 n 2 " 2 . (b) Test for H 0 : 6 1 " 6 2 d vs. H A : 6 1 " 6 2 p d : @ 21 p @ 22 and n 1 and n 2 large. If H 0 is correct and n 1 v . and n 2 v . we have that X 1 " X 2 " d s 21 L n1 s 22 L n2 d v Z where Z L N0, 1 For large values of n 1 and n 2 a test of H 0 with approximate significance level ) is therefore to reject H 0 if x 1 " x 2 " d s 21 n1 s 22 n2 z 1" )2 . 339 [Chapter 12] (c) Test for H 0 : 6 1 " 6 2 d vs. H A : 6 1 " 6 2 p d : @ 21 p @ 22 and n 1 or n 2 small. X 1 " X 2 " d If H 0 is correct then is approximately distributed like t with v degrees 2 2 s1 L n1 s2 L n2 of freedom where s 21 n1 v s 21 n1 2 n 1 "1 2 s 22 n2 . 2 s 22 n2 n 2 "1 A test with approximate significance level ) is to reject H 0 if x 1 " x 2 " d s 21 n1 s 22 n2 t 1" )2 v . @ 21 @ 22 (d) Test for H 0 : r vs. H A : @ 21 @ 22 pr n 1 " 1 s 21 L In this case n 2 " 1 s 22 L D 2 n 1 " 1 and @ 21 L @ 22 L D 2 n 2 " 1 and they are independent since the samples are independent. Therefore n 1 "1 s 21 n 2 "1 s 22 @ 21 @ 22 L n1 " 1 L µ n2 " 1 s 21 If H 0 is true then L s 22 s 21 L s 22 L µ @ 21 @ 22 ~Fn 1 " 1, n 2 " 1 . µ r L Fn 1 " 1, n 2 " 1 . A two-sided test of significance level L ) is therefore to reject H 0 if s 21 s 22 µ r f )2 n 1 " 1, n 2 " 1 or if s 21 s 22 µ r f 1" )2 n 1 " 1, n 2 " 1 . 340 [Chapter 12] SAMPLES FROM NORMAL POPULATIONS: PAIRED OBSERVATIONS Suppose that a random sample of n pairs of observations X 1i , X 2i is drawn from a population with a bivariate normal distribution with expected values 6 1 and 6 2 and variances @ 21 and @ 22 and covariance @ 12 . Then Y i X 1i " X 2i for i1,2,3,T,n are independent normal random variables with expected value 6 1 " 6 2 and variance @ 21 @ 22 " 2@ 12 . Let Y be the sample mean and s 2 the sample variance of the Y i ’s and let y and s 2 be L the observed values of Y and s 2 . Y " 6 1 " 6 2 Then s/ n L L tn " 1 . L Let the hypothesis be H 0 : 6 1 " 6 2 d vs. H A : 6 1 " 6 2 p d. If H 0 is true Y " d L tn " 1 . s/ n L A test of significance level ) is to reject H 0 if y"d s/ n t 1" )2 n " 1 . TEST FOR H 0 : p p 0 vs. H A : p p p 0 FOR THE PROBABILITY OF SUCCESS: ONE SAMPLE. Let X be the total number of successes in n independent trials of a Bernoulli experiment with probability of success p. (a) n Large § Let p X n , the proportion of successes. If H 0 is true it then follows that L § p "p 0 L p 0 .1"p 0 n d v Z where Z L N0, 1 . A test with approximate significance level ) is to reject H 0 when § p " p0 p 0 .1"p 0 n z 1" )2 . 341 [Chapter 12] (b) n Small A test of approximate significance level ) is to reject H 0 when p 0 is not an element of the 1001 " ) % confidence interval for p as determined with the help of Table 41 in Biometrika Tables. TEST FOR H 0 : p 1 " p 2 d vs. H A : p 1 " p 2 p d WHERE p 1 AND p 2 ARE PROBABILITIES OF SUCCESS IN TWO SAMPLES. Suppose that X 1 L BINn 1 , p 1 independently of X 2 L BINn 2 , p 2 . § § X2 1 Let p 1 X n 1 and p 2 n 2 . If n 1 v . and n 2 v . we get that if H 0 is correct that L L § § p1 " p2 L § § p 1 . 1" p 1 L n1 " d d L L § § p 2 . 1" p 2 L n2 v Z where Z L N0, 1 . L For large values of n 1 and n 2 we then have that a test which has approximate significance level ) is to reject H 0 if § § p 1 " p 2 " d § § . p 1 1" p 1 n1 § § p 2 .1" p 2 n2 z 1" )2 . For n 1 and/or n 2 small the hypothesis H 0 : p 1 " p 2 0 vs. H A : p 1 " p 2 p 0 can be tested using a so-called conditional test as given in Bain and Engelhardt p427. 342 [Chapter 12] TEST FOR H 0 : 6 6 0 vs. H A : 6 p 6 0 FOR THE PARAMETER OF A POISSON DISTRIBUTION. Suppose that X 1 , X 2 , T , X n is a random sample from a population with a Poisson distribution with parameter 6. Let X be the sample mean. (a) n Large As n v . we have that if H 0 is true that X " 60 60 n d v Z where Z L N0, 1 . For large n a test for which the significance level is approximately ) is to reject H 0 if x " 60 60 n z 1" )2 . (b) n Small If n is small a test with significance level approximately ) is to reject H 0 if 6 0 is not in the confidence interval given by Table 40 of Biometrika Tables For Statisticians vol. I by Pearson and Hartly.