Chapter 2 Statistic Review A. Random variables; 1. expected value: Define : X is a discrete random variable, “ the mean (or expected value) of X ” is the weighted average of the possible outcomes, where the probabilities of the outcome serve as the appropriate weight. pi is ith of prob., i=1,2, ……n X E( X ) pi X i p1 X 1 p2 X 2 ...... pn X n Interpretation: The random variable is a variable that have a probability associated with each outcome. Outcome is not controlled. Discrete random Var. : has finite outcome, or outcome is countable infinite. Continuous random Var.: uncountable infinite outcome, the probability of each outcome is small because of too many numbers. For normal random Var., probability density function is used to calculate the probability between the are. E( ): the expectations operator, pi 1 N X X 1 X 2 X 3 .... X N N (of obs.) X i 1` N i … “ sample mean”, used to estimate X The X is changed from sample to sample. is not a fixed on time, the outcome selected should not be the same. There is prob. associated with each X . X is also a random variable, we can calculate E( X ). 2. variance: measure the dispersion(分散), the range of the value Var ( X ) X2 { pi [ X i E ( X )] 2 } E[ X i E ( X )] E ( X i ) 2 X X Var(X ) ˆ S 2 X 2 X 2 “population variance” constant …………... “population standard deviation” (X i X )2 N 1 “sample variance” used to estimate X2 ˆ X S X S X2 ……….. “sample standard deviation” 1 3. joint distribution: (linear relation of X and Y bi-variance random variance.) Cov( X , Y ) E[( X X )(Y Y )] Covariance, measuring the linear relationship between X and Y. Cov( X , Y ) , depends on the units of X and Y; different unit-> different Cov( X , Y ) >0: the best-fitting line has a positive slope, positive relationship between X and Y. Cov( X , Y ) <0: the best-fitting line has a negative slope, negative relationship between X and Y. Cov( X , Y ) =0: there is no linear relationship between X and Y, but may be have nonlinear relationship. XY cov( X , Y ) “population correlation coefficient” is scale free. XY XY >0, a positive correction, XY <0, a negative correction XY =0, no linear relationship between X and Y 1 XY 1, XY =1: regression line is a straight line with positive slope, XY =-1: regression line is a straight line with negative slope. N Covˆ( X , Y ) (X i 1 i X )(Yi Y ) , N 1 N XY cov̂( X , Y ) = S X SY (X i 1 i X )(Yi Y ) ( X X ) ( Y 2 i i Y ) “sample correlation coefficient” 2 2 Ex: joint prob. distribution of X and Y X 1 2 6 0.175 0.088 Y 5 0.070 0.210 4 0.018 0.105 Prob(X) 0.263 0.403 Ni Nj p( X ) 1, p(Y ) 1, p i 1 j 1 ij 3 0.035 0.105 0.194 0.334 Prob(Y) 0.298 0.385 0.317 1 1, ( jo int prob) E(X) =0.263×1+0.403×2+0.334×3=2.071 Var(X) =4.881-(2.071)2=0.591959 E(Y) =0.298×6+0.385×5+0317×4=4.981 Var(Y)= 25.425-(4.981)2=0.614639 Cov(X,Y)= pi j X i Y j E ( X ) E (Y ) =10.001-2.071×4.981= -0.317 i j 4. formula E(b)=b , Var(b)=0;E(aX)=aE(X), Var(aX)=a2 Var(X); E(aX+b)=aE(X)+b, Var(aX+b)= a2 Var(X) Var (aX b) E[( aX b) E (aX b)] 2 E[ax b aE ( X ) b]2 E{a 2 [ X E ( X )]} 2 a 2 E[ X E ( X )] 2 a 2Var ( X ) E(X+Y)=E(X)+E(Y), Var(X+Y)=Var(X)+Var(Y)+2Cov(X, Y)\ Var ( X Y ) E[( X Y ) E ( X Y )] 2 E[ X Y E ( X ) E (Y )] 2 E{[ X E ( X )] [Y E (Y )]} 2 E{[ X E ( X )] 2 [Y E (Y )] 2 2[ X E ( X )][Y E (Y )]} Var ( X ) Var (Y ) 2Cov( X , Y ) If X and Y is independent (linear uncorrelated), than E(X+Y)=E(X)+E(Y) Cov(X,Y)=0, XY =0, Var(X+Y)=Var(X)+Var(Y) 3 Cov( X , Y ) E[ X E ( X )][Y E (Y )] E[ XY XE ( X ) XE (Y ) E ( X ) E (Y )] E[ XY Y X X Y X Y ] E ( XY ) X Y X Y X Y 0 Cov( X , Y ) 0, can’t define the X and Y are independent. E ( X X ) ( E ( X )) 2 X is not independent of itself. 4 B. (probability) distributions: 1. the normal distribution: X `~ N ( X , X2 ) 2. the standard normal distribution: X X Z X , Z ~ N (0, 1) 3. the Chi-square distribution: N N2 Z i2 Z 12 Z 22 Z N2 i 1 Zi : N independently normal distribution random variables with 0 mean and variance 1. As N gets larger, the χ2 distribution because an approximation of normal distribution. the rang of χ2 : 4. 0 ∞ the t distribution: If (1) Z is normal distribution with mean 0 and variance 1, (2) χ2 is distribution as Chi-square with N degrees of freedom, (3) Z and χ2 are independent Then Z n2 ~ tn N As N gets large, the t distribution will tend to approximately be normal distribution. 5. the F distribution: if X and Y are independent and distribution as Chi-square will N1 and N2 degrees of freedom, respectively thus 5 X Y N1 ~ FN 1 , N 2 N2 assume Xi’s are independent each other, X ~ ( X , E(X)= E[ = X2 N X N ) i regardless of the distribution of X ] 1 1 E ( X i ) E ( X 1 X 2 X N ) N N N X 1 ( X X X X X ) X N N X2 var( X ) var( 1 N Xi) X2 1 1 2 var( X ) ( N ) i N2 X N N2 var( X i ) E[( X 1 X 2 X N ) E( X 1 X 2 X N )] 2 = E{[ X 1 E ( X 1 )] [ X 2 E ( X 2 ) [ X N E ( X N )]} 2 = [ X i E( X i )]2 2 E[ X i E( X i )][ X j E( X j )] = var( X i ) var( X 1 ) var( X 2 ) var( X N ) = X2 X2 X2 N X2 If X ~ N ( X , X2 ) , then i.i.d. (independent with other variables, identically distributed over time) If X ~ N ( X , X2 ) , then X ~ N ( X , X2 N ) * Central limit theorem: If X ~ ( X , X2 ) (at any distribution), then X ~ N ( X , 6 X2 N ) as N(sample size) increase. S ˆ X (X i X )2 N 1 , what is the distribution of X 0 ? S N C. hypothesis testing: ex. H0: μ=100… null hypothesis, H1: μ≠100 …alternative hypothesis α=0.05 … level of significance X 100 to get the value of X , the test statistic, Z 2 if σ2 is know. N To check the critical value, Zc ( whether or not to accept H0) If accept H0:“we can not reject H0 at 95% confidence level based on the data we have” If reject H0:“we can reject H0 at 5% level of confidence.” Type I error: reject H0 when H0 is true, the probability of making type I error isα. Type II error: accept H0 when H0 is false, the probability of making type II error is difficult to determine. When the confidence interval increase, then the type I error will reduce and type II error will increase. ※ X 0 ( X 0 ) N S S N ( X 0 ) N ( N 1) S 2 ( N 1) 2 7 (1) if X ~ N ( X , ) , then X ~ N ( X , 2 X X2 N ) , the numerator(分子) ( X 0 ) is the standard S N normal variable Z, and ( N 1) S 2 2 is N2 1 2 2 X X X X 1 ( N 1) S 2 ( 2 ( i ) ( i ) N2 1 ) 2 N 1 S2 X 0 S N ~ Z N2 1 ~ t N 1 , if numerator and denominator(分母) are independent. ( N 1) (2) if X ~ N ( X , X2 ) , the we will appeal to the central limit theorem. 8 D. point and interval estimate: 1. point estimate: ex X =40 , but we don’t know whether the true value approximates it or not. 2. interval estimate: X Z X t X N 2 , α=0.05 , if X is know. SX 2 , t 1 N ,α=0.05 , if X is unknown and N is small (<31). 95% interval will include the true value. E. properties of estimator: unbiasedness, efficiency, consistency 1. unbiasedness: ˆ is an unbiased estimator of β, if E( ˆ ) =β(true value) ex. X is an unbiased estimator of X ;ex. S X2 is an unbiased estimator of X2 E (S 2 (X X ) ) E[ 2 N 1 = ] E{ 2 1 [ ( X X ) 2 N ( X X ) ]} N 1 2 1 ( N X2 N X ) X2 N 1 N ( X X ) 2 [( X X ) ( X X )]2 [( X X ) 2 ( X X ) 2 2( X X )( X X )] = ( X X ) 2 N ( X X ) 2 2( X X ) ( X X ) = ( X X ) 2 N ( X X ) 2 2N ( X X ) 2 = ( X X )2 N ( X X )2 bias= E( ˆ ) -β 2. efficiency: define: (a) minimum mean square error, MSE( ˆ )=E( ˆ -β)2 =[bias( ˆ )]2+Var( ˆ ). 9 Pove : E ( ˆ ) 2 E{[ ˆ E ( )] [ E ( ) ]}2 = E[ ˆ E ( )] 2 [ E ( ) ] 2 2 E{[ ˆ E ( )] [ E ( ) ]} = var( ˆ ) [bias ( ˆ )] 2 2 E{ˆ E ( ˆ ) ˆ [ E ( ˆ )] 2 E ( ˆ )} = var( ˆ ) [bias ( ˆ )] 2 0 we check X , it is mostly around μ. when bias( ˆ )=0, then MSE( ˆ )=Var( ˆ ) If we have an unbiased estimator with large dispersion of true value (ie. High var.) and a biased estimator with low var. we might prefer biased estimator than unbiased estimator to maximize the precision of prediction. ~ ~ Def. (b) If for a given sample size, (1) ˆ is unbiased, (2) Var( ˆ ) var( ) , when is any other unbiased estimator of β. Then ˆ is an efficient estimator of β. Cramer-Rao lower bounds: gives a lower limit for the variance of any unbiased estimator. EX. The lower var( X )= 2 2 4 ; the lower variance for 2 N N 2 4 ( between the all unbiased estimators, the min. is var(ˆ ) ) N k 2 3. consistency: The probability limit of ˆ , plim( ˆ )=β as N∞ ; lim prob ( ˆ ) 1, for any δ>0. N ie : prob( ˆ ) 1, N prob( ˆ ) 0 N 10 2. criteria to consistency: (a) sufficient condition for consistency: not necessary for consistency ˆ is a consistent estimator ofβ if plim( ˆ )=β as N∞ consistency plim( ˆ )=β (b) mean square error consistency: If ˆ is an estimator of β and if lim MSE ( ˆ ) 0, then ˆ is a consistent estimator N of β ie X is a consistent estimator of X MSE ( X ) [bias ( X )] 2 var( X ) 0 Slutsky’s theorem: If X2 N X2 N , lim MSE 0 N plim( ˆ )=β and g( ˆ ) is a continuous function of ˆ , then plim g( ˆ )=g[ plim( ˆ )] ( This is “ consistency carries over” However, consistency is not always carries over. ) Biasedness doesn’t carry over. 11 4. asymptotic unbiasedness: as N becomes large and large. “ ˆ ” is an asymptotically unbiased estimator of β if lim E ( ˆ ) , N if estimator is unbiased , it will be asym. unbiased. 1 Ex. If ~ 2 ( X i X ) 2 , it is asym. unbias N N 1 2 E (~ 2 ) , lim E (~ 2 ) 2 N N 5. asymptotic efficiency: N∞ “ ˆ ” is an asymptotically efficient estimator of β if all of the following conditions are satisfied: (a) ˆ is an asymptotic distribution with finite mean and finite variance. (b) ˆ is consistent ; (c) no other consistent estimator of β has small asymptotic variance than ˆ . if estimator is efficient, it will be asym. efficient. unbiaedness asym. unbiasedness efficiency asym. efficiency consistency biaedness asym. unbiasedness inefficiency asym. efficiency consistency biaedness asym. unbiasedness inefficiency asym. inefficiency inconsistency 12 Review of linear algebra 1. a mrtrix A is idempotent iff AA=A. 2. If the inner product of the 2 vectors vanishes (ie., the scalar is 0), then the vector are orthogonal. [An inner product (or scalar, or dot product) of 2 vectors is a row vector times a column vector, yielding a scalar. An outer product of 2 vectors is a column vector times a new vector, yielding a matrix ] 3. The rank of any matrix A, ρ(A), is the size of the largest non-vanishing determinant contained in A; Or, the rank is the maximum number of linearly independent rows or columns of A, where a1, a2, … aN is a set of linearly independent vectors, iff N k a j 1 j j 0 ,necessarily implies k1 k 2 k N 0 2 3 Ex. A , A =12-24=-12≠ 0, 8 6 3 6 B , B =12-12=0 2 4 ( A) 2 (B) 1 , properties: a. 0 ( A) =integer≦min(M,N) where A is an M×N matrix. b. ( A) =N, (0) 0 c. ( A) = ( A' ) ( A' A) ( AA' ) d. if A and B are of the same order, ( A B) ( A) ( B) . e. if AB is defined, ( AB) min [ ( A), ( B)] f. If A is diagonal, ( A) = number of nonzero elements. g. If A is idempotent, ( A) =trace(A.) h. The ranks of a matrix is not changed if one row ( or column) is multiplied by a nonzero constant, or if such a multiple of one row (column) is added to another row (column). 13 4. A aquare matrix of order N is nonsingular iff it is of full rank, ( A) =N (or A 0); otherwise, it is aingular. The rank of matrix is unchanged by premultiplying or postmultiplying by a nonsingular matrix. 5. differentiation in matrix notation (rules): X1 a M a. If X , a where ai=(i=1,2,…M) are constant, then M 1 M 1 X M a M (a ' X ) a. X b. IF A is a symmetric matrix of order M×M where the typical element is a constant aij, then ( X ' AX ) 2AX, X if not a symmetric matrix ( X ' AX ) (A’+A)X. X c. IF A is a symmetric matrix of order M×M where the typical element is a constant aij, then ( X ' AX ) 2A, X X ' ( X ' AX ) XX’ A If Y and Z are vectors, B is a matrix, then (Y ' BZ ) BZ Y , (Y ' BZ ) B’Y , Z 14 (Y ' BZ ) YZ’ B Formula for Matrix ( A B )' B’A’ (A’)’= A M N AB≠BA M N N K in general ( A C )' A'C ' M N ( D E ) 1 E 1 D 1 , M M M M iff det D≠0 and E≠0 ( ie. Iff (D-1)’= (D’)-1 Det D=det D’ Trace(D ± E)= trace(D) ±trace(E) Trace ( A F ) trace( FA) M N N M N N Trace(scalar)=saclar E[trace(D)]=trace(E(D)) 15 D and E nonsingular matrices)