Brief Review Probability and Statistics Probability distributions Continuous distributions Defn (density function) Let x denote a continuous random variable then f(x) is called the density function of x 1) f(x) ≥ 0 2) b 3) f ( x)dx 1 f ( x)dx P a x b a The Normal distribution (mean m, standard deviation s) 1 f x e 2s x m 2 2s 2 The Exponential distribution e x f x 0 x0 x0 0.2 0.1 0 -2 0 2 4 6 8 10 The Gamma distribution An important family of distibutions The Gamma Function, G(x) An important function in mathematics The Gamma function is defined for x ≥ 0 by G x u e du x 1 u 0 u x 1e u G x u The Gamma distribution Let the continuous random variable X have density function: 1 x x e f x G 0 x0 x0 Then X is said to have a Gamma distribution with parameters and . Graph: The gamma distribution 0.4 ( = 2, = 0.9) 0.3 ( = 2, = 0.6) ( = 3, = 0.6) 0.2 0.1 0 0 2 4 6 8 10 Comments 1. The set of gamma distributions is a family of distributions (parameterized by and ). 2. Contained within this family are other distributions a. The Exponential distribution – in this case = 1, the gamma distribution becomes the exponential distribution with parameter . The exponential distribution arises if we are measuring the lifetime, X, of an object that does not age. It is also used a distribution for waiting times between events occurring uniformly in time. b. The Chi-square distribution – in the case = n/2 and = ½, the gamma distribution becomes the chi- square (c2) distribution with n degrees of freedom. Later we will see that sum of squares of independent standard normal variates have a chi-square distribution, degrees of freedom = the number of independent terms in the sum of squares. The Exponential distribution e x f x 0 x0 x0 The Chi-square (c2) distribution with n d.f. 1 2 n 1 1 x 2 x2 e 2 f x G n2 0 n n 2 x 1 2 2 x e n 2 n 2 G 2 0 x0 x0 x0 x0 Graph: The c2 distribution (n = 4) 0.2 (n = 5) (n = 6) 0.1 0 0 4 8 12 16 Defn (Joint density function) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables then f(x) = f(x1 ,x2 ,x3 , ... , xn) is called the joint density function of x = (x1 ,x2 ,x3 , ... , xn) if 1) f(x) ≥ 0 2) f ( x ) dx 1 3) f (x)dx Px R R Note: f (x)dx f x , x , x dx dx dx 1 2 n 1 2 f (x)dx f x , x , x dx dx dx 1 R R n 2 n 1 2 n Defn (Marginal density function) The marginal density of x1 = (x1 ,x2 ,x3 , ... , xp) (p < n) is defined by: f1(x1) = f ( x )dx 2 = f ( x1 , x 2 )dx 2 where x2 = (xp+1 ,xp+2 ,xp+3 , ... , xn) The marginal density of x2 = (xp+1 ,xp+2 ,xp+3 , ... , xn) is defined by: f2(x2) = f ( x) dx1= f (x1 , x 2 )dx1 where x1 = (x1 ,x2 ,x3 , ... , xp) Defn (Conditional density function) The conditional density of x1 given x2 (defined in previous slide) (p < n) is defined by: f ( x) f (x1 , x 2 ) f1|2(x1 |x2) = f 2 x 2 f 2 x 2 conditional density of x2 given x1 is defined by: f (x) f (x1 , x 2 ) f2|1(x2 |x1) = f1 x1 f1 x1 Marginal densities describe how the subvector xi behaves ignoring xj Conditional densities describe how the subvector xi behaves when the subvector xj is held fixed Defn (Independence) The two sub-vectors (x1 and x2) are called independent if: f(x) = f(x1, x2) = f1(x1)f2(x2) = product of marginals or the conditional density of xi given xj : fi|j(xi |xj) = fi(xi) = marginal density of xi Example (p-variate Normal) The random vector x (p × 1) is said to have the p-variate Normal distribution with mean vector m (p × 1) and covariance matrix S (p × p) (written x ~ Np(m,S)) if: 1 1 f x exp (x μ)' S (x μ) 1/ 2 p/2 2 2 S 1 Example (bivariate Normal) x1 The random vector x is said to have the bivariate x2 m1 Normal distribution with mean vector μ m2 and covariance matrix 2 s s s1 11 12 S s 12 s 22 s1s 2 s1s 2 2 s2 1 1 f x exp (x μ)' S (x μ) 1/ 2 p/2 2 2 S 1 1 1 f x1 , x2 exp (x μ)' S (x μ) 1/ 2 2 2 S 1 1 2 s 11s 22 s 2 1/ 2 12 exp Q x1 , x2 1 s 11 s 12 Q x1 , x2 (x μ)' (x μ) s 12 s 22 s 22 ( x1 m1 ) 2 2s 12 ( x1 m1 )( x2 m 2 ) s 11 ( x2 m 2 ) 2 s 11s 22 s 122 f x1 , x2 1 2s1s 1 1 2 exp Qx1 , x2 Qx1 , x2 2 x1 m1 x1 m1 x2 m 2 x2 m 2 2 s1 s 1 s 2 s 2 1 2 2 Theorem (Transformations) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x1 ,x2 ,x3 , ... , xn) = f(x). Let y1 =f1(x1 ,x2 ,x3 , ... , xn) y2 =f2(x1 ,x2 ,x3 , ... , xn) ... yn =fn(x1 ,x2 ,x3 , ... , xn) define a 1-1 transformation of x into y. Then the joint density of y is g(y) given by: g(y) = f(x)|J| where ( x ) ( x1 , x 2 , x3 ,..., x n ) J ( y ) ( y1 , y 2 , y 3 ,..., y n ) x n x1 x 2 ... y y y 1 1 1 x1 x 2 ... x n det y 2 y 2 y 2 = the Jacobian of the ... transformation x x x n 2 1 ... y n y n y n Corollary (Linear Transformations) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x1 ,x2 ,x3 , ... , xn) = f(x). Let y1 = a11x1 + a12x2 + a13x3 , ... + a1nxn y2 = a21x1 + a22x2 + a23x3 , ... + a2nxn ... yn = an1x1 + an2x2 + an3x3 , ... + annxn define a 1-1 transformation of x into y. Then the joint density of y is g(y) given by: 1 1 1 g ( y ) f ( x) f ( A y) det( A) det( A) a11 a 21 where A an1 a12 a22 an 2 a1n a2 n ... ann ... ... Corollary (Linear Transformations for Normal Random variables) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables having an n-variate Normal distribution with mean vector m and covariance matrix S. i.e. x ~ Nn(m, S) Let y1 = a11x1 + a12x2 + a13x3 , ... + a1nxn y2 = a21x1 + a22x2 + a23x3 , ... + a2nxn ... yn = an1x1 + an2x2 + an3x3 , ... + annxn define a 1-1 transformation of x into y. Then y = (y1 ,y2 ,y3 , ... , yn) ~ Nn(Am,ASA') Defn (Expectation) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x) = f(x1 ,x2 ,x3 , ... , xn). Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn) Then E U E h(x) h(x) f (x)dx Defn (Conditional Expectation) Let x = (x1 ,x2 ,x3 , ... , xn) = (x1 , x2 ) denote a vector of continuous random variables with joint density function f(x) = f(x1 ,x2 ,x3 , ... , xn) = f(x1 , x2 ). Let U = h(x1) = h(x1 ,x2 ,x3 , ... , xp) Then the conditional expectation of U given x2 EU x2 Eh(x1 ) x2 h(x1 ) f1|2 (x1 x2 )dx 1 Defn (Variance) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x) = f(x1 ,x2 ,x3 , ... , xn). Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn) Then s VarU E U EU E h(x) Eh(x) 2 U 2 2 Defn (Conditional Variance) Let x = (x1 ,x2 ,x3 , ... , xn) = (x1 , x2 ) denote a vector of continuous random variables with joint density function f(x) = f(x1 ,x2 ,x3 , ... , xn) = f(x1 , x2 ). Let U = h(x1) = h(x1 ,x2 ,x3 , ... , xp) Then the conditional variance of U given x2 VarU x 2 E h(x1 ) Eh(x1 ) x 2 2 Defn (Covariance, Correlation) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x) = f(x1 ,x2 ,x3 , ... , xn). Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn) and V = g(x) =g(x1 ,x2 ,x3 , ... , xn) Then the covariance of U and V. CovU ,V EU EU V EV Eh(x) Eh(x)g (x) Eg (x) and UV CovU ,V correlatio n Var (U ) Var (V ) Properties • • • • Expectation Variance Covariance Correlation 1. E[a1x1 + a2x2 + a3x3 + ... + anxn] = a1E[x1] + a2E[x2] + a3E[x3] + ... + anE[xn] or E[a'x] = a'E[x] 2. E[UV] = E[h(x1)g(x2)] = E[U]E[V] = E[h(x1)]E[g(x2)] if x1 and x2 are independent 3. Var[a1x1 + a2x2 + a3x3 + ... + anxn] n n i 1 i j ai2Var[ xi ] 2 ai a j Cov[ xi , x j ] or Var[a'x] = a′S a Cov( x1 , x2 ) ... Cov( x1 , xn ) Var ( x1 ) Cov( x , x ) Var ( x ) ... Cov ( x , x ) 2 1 2 2 n where S ... Cov( xn , x1 ) Cov( xn , x2 ) ... Var ( xn ) 4. Cov[a1x1 + a2x2 + ... + anxn , b1x1 + b2x2 + ... + bnxn] n n i 1 i j ai b jVar[ xi ] ai b j Cov[ xi , x j ] or Cov[a'x, b'x] = a′S b 5. EU Ex2 EU x2 6. VarU Ex2 VarU x2 Varx2 EU x2 Statistical Inference Making decisions from data There are two main areas of Statistical Inference • Estimation – deciding on the value of a parameter – Point estimation – Confidence Interval, Confidence region Estimation • Hypothesis testing – Deciding if a statement (hypotheisis) about a parameter is True or False The general statistical model Most data fits this situation Defn (The Classical Statistical Model) The data vector x = (x1 ,x2 ,x3 , ... , xn) The model Let f(x| q) = f(x1 ,x2 , ... , xn | q1 , q2 ,... , qp) denote the joint density of the data vector x = (x1 ,x2 ,x3 , ... , xn) of observations where the unknown parameter vector q W (a subset of p-dimensional space). An Example The data vector x = (x1 ,x2 ,x3 , ... , xn) a sample from the normal distribution with mean m and variance s2 The model Then f(x| m , s2) = f(x1 ,x2 , ... , xn | m , s2), the joint density of x = (x1 ,x2 ,x3 , ... , xn) takes on the form: f x ms n 2 i 1 1 e 2 s xi m 2 2s 1 2 n/2 s n e n i 1 xi m 2 2s where the unknown parameter vector q (m , s2) W ={(x,y)|-∞ < x < ∞ , 0 ≤ y < ∞}. Defn (Sufficient Statistics) Let x have joint density f(x| q) where the unknown parameter vector q W. Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is called a set of sufficient statistics for the parameter vector q if the conditional distribution of x given S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is not functionally dependent on the parameter vector q. A set of sufficient statistics contains all of the information concerning the unknown parameter vector A Simple Example illustrating Sufficiency Suppose that we observe a Success-Failure experiment n = 3 times. Let q denote the probability of Success. Suppose that the data that is collected is x1, x2, x3 where xi takes on the value 1 is the ith trial is a Success and 0 if the ith trial is a Failure. The following table gives possible values of (x1, x2, x3). (x1, x2, x3) (0, 0, 0) (1, 0, 0) (0, 1, 0) (0, 0, 1) (1, 1, 0) (1, 0, 1) (0, 1, 1) (1, 1, 1) f(x1, x2, x3|q) (1 - q)3 (1 - q)2q (1 - q)2q (1 - q)2q (1 - q)q2 (1 - q)q2 (1 - q)q2 q3 S =Sxi 0 1 1 1 2 2 2 3 g(S |q) (1 - q)3 3(1 - q)2q 3(1 - q)q2 q3 f(x1, x2, x3| S) 1 1/3 1/3 1/3 1/3 1/3 1/3 1 The data can be generated in two equivalent ways: 1. Generating (x1, x2, x3) directly from f (x1, x2, x3|q) or 2. Generating S from g(S|q) then generating (x1, x2, x3) from f (x1, x2, x3|S). Since the second step does involve q no additional information will be obtained by knowing (x1, x2, x3) once S is determined The Sufficiency Principle Any decision regarding the parameter q should be based on a set of Sufficient statistics S1(x), S2(x), ...,Sk(x) and not otherwise on the value of x. A useful approach in developing a statistical procedure 1. Find sufficient statistics 2. Develop estimators , tests of hypotheses etc. using only these statistics Defn (Minimal Sufficient Statistics) Let x have joint density f(x| q) where the unknown parameter vector q W. Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Minimal Sufficient statistics for the parameter vector q if S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Sufficient statistics and can be calculated from any other set of Sufficient statistics. Theorem (The Factorization Criterion) Let x have joint density f(x| q) where the unknown parameter vector q W. Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Sufficient statistics for the parameter vector q if f(x| q) = h(x)g(S, q) = h(x)g(S1(x) ,S2(x) ,S3(x) , ... , Sk(x), q). This is useful for finding Sufficient statistics i.e. If you can factor out q-dependence with a set of statistics then these statistics are a set of Sufficient statistics Defn (Completeness) Let x have joint density f(x| q) where the unknown parameter vector q W. Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Complete Sufficient statistics for the parameter vector q if S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Sufficient statistics and whenever E[f(S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) ] = 0 then P[f(S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) = 0] = 1 Defn (The Exponential Family) Let x have joint density f(x| q)| where the unknown parameter vector q W. Then f(x| q) is said to be a member of the exponential family of distributions if: k h(x) g (θ) exp S i (x) pi (θ) ai xi bi f x θ , i 1 0 Otherwise q W,where 1) - ∞ < ai < bi < ∞ are not dependent on q. 2) W contains a nondegenerate k-dimensional rectangle. 3) g(q), ai ,bi and pi(q) are not dependent on x. 4) h(x), ai ,bi and Si(x) are not dependent on q. If in addition. 5) The Si(x) are functionally independent for i = 1, 2,..., k. 6) [Si(x)]/ xj exists and is continuous for all i = 1, 2,..., k j = 1, 2,..., n. 7) pi(q) is a continuous function of q for all i = 1, 2,..., k. 8) R = {[p1(q),p2(q), ...,pK(q)] | q W,} contains nondegenerate k-dimensional rectangle. Then the set of statistics S1(x), S2(x), ...,Sk(x) form a Minimal Complete set of Sufficient statistics. Examples Suppose we repeat a success-failure experiment independently n times. Suppose that q is the probability of success. Note: 0 ≤ q ≤ 1 Let 1 i th repitition is Success xi th 0 i repitition is Failure q f xi 1 - q xi 1 1 xi xi q 1 - q xi 0 The joint density of x1 , x2 , x3 ,, xn is f x1 , x2 , x3 ,, xn q f x1 f x2 f x3 f xn q x1 1 - q q xi i 1 x1 q x 1 - q 1 x q x 1 - q 1 x q x 1 - q 1 x 1- q xi q S 1 - q nS 1 - q n q 1 - q 1 S i n 1 - q e 2 2 3 f x q h(x) g (q ) expS (x) p(q ) i n n ln q 1-q 1 S where S x xi 3 n Defn (The Likelihood function) Let x have joint density f(x|q) where the unkown parameter vector q W. Then for a given value of the observation vector x ,the Likelihood function, Lx(q), is defined by: Lx(q) = f(x|q) with q W The log Likelihood function lx(q) is defined by: lx(q) =lnLx(q) = lnf(x|q) with q W The Likelihood Principle Any decision regarding the parameter q should be based on the likelihood function Lx(q) and not otherwise on the value of x. If two data sets result in the same likelihood function the decision regarding q should be the same. Some statisticians find it useful to plot the likelihood function Lx(q) given the value of x. It summarizes the information contained in x regarding the parameter vector q. An Example The data vector x = (x1 ,x2 ,x3 , ... , xn) a sample from the normal distribution with mean m and variance s2 The joint distribution of x Then f(x| m , s2) = f(x1 ,x2 , ... , xn | m , s2), the joint density of x = (x1 ,x2 ,x3 , ... , xn) takes on the form: f x ms n 2 i 1 1 e 2 s xi m 2 2s 1 2 n/2 s n e n i 1 xi m 2 2s where the unknown parameter vector q (m , s2) W ={(x,y)|-∞ < x < ∞ , 0 ≤ y < ∞}. The Likelihood function Assume data vector is known x = (x1 ,x2 ,x3 , ... , xn) The Likelihood function Then L(m , s)= f(x| m , s) = f(x1 ,x2 , ... , xn | m , s2), n i 1 1 e 2s xi m 1 2 n/2 sn 1 2 s n n/2 2s n 2 s xi m 2s 1 n/2 2 2s xi2 2 m xi m 2 n 1 i 1 1 i 1 e e 2 n e n i 1 xi m 2 2s or L m,s 1 2 n/2 1 2 s n/2 i 1 e sn n 2 2 x 2 m x m i i 2s e n 1 xi2 2 m 2s i1 1 2 s n/2 n n 1 e 1 2s n 1 s 2 xi nm 2 i 1 n nx 2 2 m nx n m 2 n since s 2 2 2 x nx i i 1 n or n 1 2 2 2 x n 1 s nx i i 1 n and since x x i 1 n i n then x i 1 i nx hence L m ,s 1 2 s n/2 n e 2s 1 2 s n/2 n n 1 s 1 e 1 2 nx 2 2 m nx nm 2 n 1s n x m 2 2 2s Now consider the following data: (n = 10) 57.1 72.3 75.0 57.8 50.3 mean s L m ,s 48.0 1 6.2832 s 53.1 58.5 53.7 57.54 9.2185 5 49.6 10 e 1 2s 9 9.2185 10 57.54 m 2 2 Likelihood n = 10 3E-16 2.5E-16 2E-16 1.5E-16 1E-16 70 5E-17 m 0 1 0 s S1 20 50 Contour Map of Likelihood n = 100 70 m 50 S1 1 0 s 20 Now consider the following data: (n = 100) 57.1 72.3 75.0 57.8 50.3 48.0 49.6 53.1 58.5 53.7 77.8 43.0 69.8 65.1 71.1 44.4 64.4 52.9 56.4 43.9 49.0 37.6 65.5 50.4 40.7 66.9 51.5 55.8 49.1 59.5 64.5 67.6 79.9 48.0 68.1 68.0 65.8 61.3 75.0 78.0 61.8 69.0 56.2 77.2 57.5 84.0 45.5 64.4 58.7 77.5 81.9 77.1 58.7 71.2 58.1 50.3 53.2 47.6 53.3 76.4 69.8 57.8 65.9 63.0 43.5 70.7 85.2 57.2 78.9 72.9 78.6 53.9 61.9 75.2 62.2 53.2 73.0 38.9 75.4 69.7 68.8 77.0 51.2 65.6 44.7 40.4 72.1 68.1 82.2 64.7 83.1 71.9 65.4 45.0 51.6 48.3 58.5 65.3 65.9 59.6 mean s 62.02 11.8571 L m ,s 1 6.2832 s 50 100 e 1 2s 9911.8571 100 62.02 m 2 2 Likelihood n = 100 1.6E-169 1.4E-169 1.2E-169 1E-169 8E-170 6E-170 4E-170 70 2E-170 m 0 1 0 s S1 20 50 Contour Map of Likelihood n = 100 70 m 50 S1 1 0 s 20 The Sufficiency Principle Any decision regarding the parameter q should be based on a set of Sufficient statistics S1(x), S2(x), ...,Sk(x) and not otherwise on the value of x. If two data sets result in the same values for the set of Sufficient statistics the decision regarding q should be the same. Theorem (Birnbaum - Equivalency of the Likelihood Principle and Sufficiency Principle) Lx (q) Lx (q) 1 2 if and only if S1(x1) = S1(x2),..., and Sk(x1) = Sk(x2) The following table gives possible values of (x1, x2, x3). f(x1, x2, x3|q) (1 - q)3 (1 - q)2q (1 - q)2q (1 - q)2q (1 - q)q2 (1 - q)q2 (1 - q)q2 (x1, x2, x3) (0, 0, 0) (1, 0, 0) (0, 1, 0) (0, 0, 1) (1, 1, 0) (1, 0, 1) (0, 1, 1) (1, 1, 1) S =Sxi 0 1 1 1 2 2 2 3 q3 g(S |q) (1 - q)3 f(x1, x2, x3| S) 1 1/3 1/3 1/3 1/3 1/3 1/3 1 3(1 - q)2q 3(1 - q)q2 q3 The Likelihood function S =0 1.2 0.08 0.08 0 0 0 0.2 0.4 0.6 0.8 1 0.2 0.02 0.02 0 0.4 0.04 0.04 0.2 0.6 0.06 0.06 0.4 0.8 0.1 0.1 0.6 1 0.12 0.12 0.8 S =3 1.2 0.14 0.14 1 S =2 0.16 S =1 0.16 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Estimation Theory Point Estimation Defn (Estimator) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Then an estimator of the parameter f(q) = f(q1 ,q2 , ... , qk) is any function T(x)=T(x1 ,x2 ,x3 , ... , xn) of the observation vector. Defn (Mean Square Error) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let T(x) be an estimator of the parameter f(q). Then the Mean Square Error of T(x) is defined to be: M .S.E.T x θ E(T (x) f (θ)) 2 (T (x) f (θ)) 2 f (x | θ)dx Defn (Uniformly Better) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let T(x) and T*(x) be estimators of the parameter f(q). Then T(x) is said to be uniformly better than T*(x) if: M .S .E.T x θ M .S .E.T *x θ whenever θ W Defn (Unbiased ) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let T(x) be an estimator of the parameter f(q). Then T(x) is said to be an unbiased estimator of the parameter f(q) if: E T x T (x) f (x | θ)dx f θ Theorem (Cramer Rao Lower bound) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Suppose that: i) f ( x | θ) exists for all x and for all θ W . θ f ( x | θ) ii) f ( x | θ)dx dx θ θ f ( x | θ) iii) t x f ( x | θ)dx t x dx θ θ 2 f (x | θ) iv) 0 E q for all θ W i Let M denote the p x p matrix with ijth element. 2 ln f (x | θ) mij E i, j 1,2, , p q i q j Then V = M-1 is the lower bound for the covariance matrix of unbiased estimators of q. That is, var(c' θ̂ ) = c'var( θ̂)c ≥ c'M-1c = c'Vc where θ̂ is a vector of unbiased estimators of q. Defn (Uniformly Minimum Variance Unbiased Estimator) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Then T*(x) is said to be the UMVU (Uniformly minimum variance unbiased) estimator of f(q) if: 1) E[T*(x)] = f(q) for all q W. 2) Var[T*(x)] ≤ Var[T(x)] for all q W whenever E[T(x)] = f(q). Theorem (Rao-Blackwell) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let S1(x), S2(x), ...,SK(x) denote a set of sufficient statistics. Let T(x) be any unbiased estimator of f(q). Then T*[S1(x), S2(x), ...,Sk (x)] = E[T(x)|S1(x), S2(x), ...,Sk (x)] is an unbiased estimator of f(q) such that: Var[T*(S1(x), S2(x), ...,Sk(x))] ≤ Var[T(x)] for all q W. Theorem (Lehmann-Scheffe') Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let S1(x), S2(x), ...,SK(x) denote a set of complete sufficient statistics. Let T*[S1(x), S2(x), ...,Sk (x)] be an unbiased estimator of f(q). Then: T*(S1(x), S2(x), ...,Sk(x)) )] is the UMVU estimator of f(q). Defn (Consistency) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let Tn(x) be an estimator of f(q). Then Tn(x) is called a consistent estimator of f(q) if for any e > 0: lim PTn x f θ e 0 for all θ W n Defn (M. S. E. Consistency) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x|q) where the unknown parameter vector q W. Let Tn(x) be an estimator of f(q). Then Tn(x) is called a M. S. E. consistent estimator of f(q) if for any e > 0: lim M .S .E.Tn θ lim E Tn x f θ 0 n for all θ W n 2 Methods for Finding Estimators 1. The Method of Moments 2. Maximum Likelihood Estimation Methods for finding estimators 1. Method of Moments 2. Maximum Likelihood Estimation Method of Moments Let x1, … , xn denote a sample from the density function f(x; q1, … , qp) = f(x; q) The kth moment of the distribution being sampled is defined to be: mk q1 , ,q p E x k x k f x;q1 , ,q p dx The kth sample moment is defined to be: 1 n k mk xi n i 1 To find the method of moments estimator of q1, … , qp we set up the equations: m1 q1 , ,q p m1 m2 q1 , ,q p m2 m p q1 , ,q p m p We then solve the equations m1 q1 , ,q p m1 m2 q1 , ,q p m2 m p q1 , ,q p m p for q1, … , qp. The solutions q1 , ,q p are called the method of moments estimators The Method of Maximum Likelihood Suppose that the data x1, … , xn has joint density function f(x1, … , xn ; q1, … , qp) where q (q1, … , qp) are unknown parameters assumed to lie in W (a subset of p-dimensional space). We want to estimate the parametersq1, … , qp Definition: Maximum Likelihood Estimation Suppose that the data x1, … , xn has joint density function f(x1, … , xn ; q1, … , qp) Then the Likelihood function is defined to be L(q) = L(q1, … , qp) = f(x1, … , xn ; q1, … , qp) the Maximum Likelihood estimators of the parameters q1, … , qp are the values that maximize L(q) = L(q1, … , qp) the Maximum Likelihood estimators of the parameters q1, … , qp are the values qˆ1 , ,qˆp Such that L qˆ1 , Note: , qˆp max L q1 , q1 , ,q p maximizing L q1 , is equivalent to maximizing l q1 , , q p ln L q1 , the log-likelihood function ,q p ,q p ,q p Application The General Linear Model Consider the random variable Y with 1. E[Y] = g(U1 ,U2 , ... , Uk) = b1f1(U1 ,U2 , ... , Uk) + b2f2(U1 ,U2 , ... , Uk) + ... + bpfp(U1 ,U2 , ... , Uk) p = b if i U ,U 2 ,...,U k i 1 and 2. var(Y) = s2 • where b1, b2 , ... ,bp are unknown parameters • and f1 ,f2 , ... , fp are known functions of the nonrandom variables U1 ,U2 , ... , Uk. • Assume further that Y is normally distributed. Thus the density of Y is: f(Y|b1, b2 , ... ,bp, s2) = f(Y| b, s2) 1 2 exp 2 Y g (U1 ,U 2 ,...,U k ) 2s 2 2s 2 p 1 1 exp 2 Y b ifi U1U 2 ,...,U k 2s i 1 2s 2 1 1 2 exp Y b1 X 1 b 2 X 2 ... b p X p 2 2 2s 2s 1 where X i fi U1U 2 ,...,U k i = 1,2, … , p Now suppose that n independent observations of Y, (y1, y2, ..., yn) are made corresponding to n sets of values of (U1 ,U2 , ... , Uk) ,u12 , ... , u1k), (u21 ,u22 , ... , u2k), ... (un1 ,un2 , ... , unk). Let xij = fj(ui1 ,ui2 , ... , uik) j =1, 2, ..., p; i =1, 2, ..., n. Then the joint density of y = (y1, y2, ... yn) is: (u11 f(y1, y2, ..., yn|b1, b2 , ... ,bp, s2) = f(y|b, s2) 1 2s 2 n/2 1 exp 2 2s yi g (u1i , u 2i ,..., u ki ) i 1 n 2 2 p n 1 1 exp y b j f j (u1i , u 2i ,..., u ki ) 2 i 2 n/2 2s i 1 j 1 2s 2 p n 1 1 exp y b j xij 2 i 2 n/2 2s i 1 j 1 2s 1 2s 2 n/2 1 exp y Xβ y Xβ 2 2s 1 yy 2yXβ βXXβ exp 2 2 n/2 2s 2s 1 1 1 exp β X Xβ exp y y 2 y Xβ 2 2 2 n/2 2s 2s 2s 1 1 hy g β, s exp y y 2 y Xβ 2 2s 2 Thus f(y|b,s2) is a member of the exponential family of distributions and S = (y'y, X'y) is a Minimal Complete set of Sufficient Statistics. Hypothesis Testing Defn (Test of size ) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W. Let w be any subset of W. Consider testing the the Null Hypothesis H0: q w against the alternative hypothesis H1: q w. Let A denote the acceptance region for the test. (all values x = (x1 ,x2 ,x3 , ... , xn) of such that the decision to accept H0 is made.) and let C denote the critical region for the test (all values x = (x1 ,x2 ,x3 , ... , xn) of such that the decision to reject H0 is made.). Then the test is said to be of size if Px C f (x | θ)dx for all θ w and C Px C f (x | θ)dx for at least one θ0 w C Defn (Power) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W. Consider testing the the Null Hypothesis H0: q w against the alternative hypothesis H1: q w. where w is any subset of W. Then the Power of the test for q w is defined to be: C θ Px C f ( x | θ)dx C Defn (Uniformly Most Powerful (UMP) test of size ) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W. Consider testing the the Null Hypothesis H0: q w against the alternative hypothesis H1: q w. where w is any subset of W. Let C denote the critical region for the test . Then the test is called the UMP test of size if: Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W. Consider testing the the Null Hypothesis H0: q w against the alternative hypothesis H1: q w. where w is any subset of W. Let C denote the critical region for the test . Then the test is called the UMP test of size if: Px C f (x | θ)dx for all θ w and C Px C f (x | θ)dx for at least one θ0 w C and for any other critical region C* such that: Px C * Px C * f (x | θ)dx for all θ w and C* f (x | θ)dx for at least one θ0 w C* then f (x | θ)dx f (x | θ)dx for all θ w . C C* Theorem (Neymann-Pearson Lemma) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W = (q0, q1). Consider testing the the Null Hypothesis H0: q = q0 against the alternative hypothesis H1: q = q1. Then the UMP test of size has critical region: f (x | θ 0 ) C x K f (x | θ1 ) where K is chosen so that f (x | θ C 0 )dx Defn (Likelihood Ratio Test of size ) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W. Consider testing the the Null Hypothesis H0: q w against the alternative hypothesis H1: q w. where w is any subset of W Then the Likelihood Ratio (LR) test of size a has critical region: max f (x | θ) C x θw K f ( x | θ) max θW where K is chosen so that Px C f (x | θ)dx for all θ w and Px C f (x | θ)dx for at least one θ0 w C C Theorem (Asymptotic distribution of Likelihood ratio test criterion) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x| q) where the unknown parameter vector q W. Consider testing the the Null Hypothesis H0: q w against the alternative hypothesis H1: q w. max f (x | θ) Let x θw where w is any subset of W max f (x | θ) θW Then under proper regularity conditions on U = -2ln(x) possesses an asymptotic Chi-square distribution with degrees of freedom equal to the difference between the number of independent parameters in W and w.