INSY 7300 Notes on Sampling distributions (SMDs) from Underlying Infinite Populations F2009 By S. Maghsoodloo As an example consider the population {2, 5, 8, 8, 11}, where prime is used to indicate that two elements of this size N = 5 population have identical values of 8. Simple calculations show that the population mean E(Y) = Y = 6.80, the variance 2Y = V(Y) = E[(Y 6.8)2] = 9.360 the standard deviation = 3.0594, and the coefficient of variation of Y is given by CVY = /E(Y) = 49.99%. Now consider a random sample of size n = 2 with replacement from the above population. First, from the standpoint of sampling, the population is considered infinite because it can never be exhausted, but if we do the sampling without replacement, the population will be finite of size N = 5; in fact it will be exhausted after two random samples of size n = 2. Because, all the processes that we consider in this course are hypothetically infinite, we use the sampling with replacement to illustrate the concept of sampling distributions. Accordingly, let y represent the mean of the random sample of size n = 2 with replacement. Then the SMD (SaMpling Distribution), or the probability mass function, of y in the case of with replacement is given by 1 / 25, 2 / 25, pmf ( y ) 5 / 25, 6 / 25, 4 / 25, y 2, 11 y 3.5 y 5 E( y ) = the weighted average of the sample mean y 6.5, 8 y 9.5 over all possible values in the range space of y is E( y ) = y pmf (y) = 13/25 + Ry 7/25 + 25/25 + 87/25 + 38/25 = 6.80 = y is an unbiased estimator of , i.e., the amount of bias in y as an estimator of is given by B( y ) = E( y ) = 0. Variance by definition is the weighted (or long-term) average of deviations from the mean squared, i.e., V( y ) = E[( y )2] = E[( y 6.8)2] = [(4.8)2+(4.2)2]/25 + (3.3)22/25 +(1.8)25/25 + [(0.3)2 +1.22]6/25 +2.724/25 = 4.680. It can easily be proven that for all infinite populations in the universe, the 1 variance of the sample mean is equal to V(individuals)/(sample size), i.e., V( y ) = V(Y)/n = 2/n. For our example, n = 2, V(Y) = 9.360, and thus V( y ) = 9.36/n =9.360/2 = 4.68, as before! Further, the variance operator can be reduced to the expected-value operators as follows: V(Y) = E[(Y EY)2] = E[(Y )2] = E[(Y2 2Y + 2 )] = E(Y2) 2E(Y) + 2 = E(Y2) 22 + 2 = E(Y2) 2 = E(Y2) [E(Y)]2. To illustrate the use of the above formula, we re-compute the V( y ): 2 E[ ( y) ] = 125/25 + 24.50/25 + 125/25 + 6(6.52)/25 + 6(82)/25 +9.524/25 = 50.92 2 V( y ) = E[ ( y) ] [E( y )]2= 50.92 6.802 = 4.680 as before! n (yi y)2 = CSS USS CF = , n 1 n 1 ( yi )2 / n . The pmf (Pr. Mass One measure of variability is S2 = i 1 n where the USS = i 1 y i2 and the CF = n 1 n i 1 function, or SMD) of S2 for our example with n = 2 (with Replacement) is given by 7 / 25, S2 0 6 / 25, S2 18 2 pmf (S ) 2 10 / 25, S2 4.5 2 / 25, S 40.5 E(S2) = S2 p(S2 ) = 0 + 186/25 + 45/25 + R S2 81/25 = 9.360 = 2 S2 is an unbiased estimator of 2 for all infinite populations; if the population were finite this would not be the case. In fact for all finite populations, the E(S2) = N2/(N 1). Just because S2 is an unbiased estimator of 2 for all infinite populations, it does not at all imply that S is an unbiased estimator of , as illustrated below for our example. 2 7 / 25, S 0 0.5 6 / 25, S 18 P( S ) 10 / 25, S 4.5 2 / 25, S 40.5 E(S) = S p(S) = 2.375878785 < = 3.0594 The amount of bias in S for this RS example is B(S) = 2.375878785 3.059412 = 0.6835329234. In fact, for all infinite populations, S is a biased estimator of . It will be shown below that per force must always exceed E(S), i.e., the amount of bias in S as an estimator of is always negative. By definition: V(S) = E(S2) [E(S)]2 = 2 [E(S)]2 > 0 2 > [E(S)]2 > E(S), or E(S) < B(S) = E(S) < 0. Further, note that for all rvs (random variables) in the universe, V( [E( Y )]2 > 0 E(Y) > [E( Y )]2 E Y > E( Y) = E(Y) Y ). If the underlying distribution is the Laplace-Gaussian N(, 2), then it can be proven that E(S) = c4, where c4 = 2 (n / 2) lies within the interval n 1 [(n 1) / 2] [0.797884561, 1), where c4 at n = 2 is equal to 0.797884561, and the limit of c4(as n ) 1. Thus, for a N(, 2), an unbiased estimator of is given by S/c4. Exercise for Friday 08/19/2011. Use the following two independent pmfs p(y1) = 4 / 9, y1 1 2 / 9, y1 2.5 3 / 9, y1 3 and p(y2) = 3 / 5, y 2 1.5 2 / 5, y 2 2.0 in order to illustrate the following properties of the expected and variance operators. (1) E(Y1 +Y2) = E(Y1) + E(Y2) (2) V(3Y1) = 9V(Y1) (3) V(Y1 +Y2) = V(Y1) + V(Y2) only because Y1 and Y2 are considered independent. (4) For the example on pp. 1-2 of these notes, compute the amount of bias for a 3 random sample of size 2 (with replacement) for the sample range R̂ and the sample median y . Further, compute the variance of y and compare against V( y ). It has been shown in statistical literature that the asymptotic SMD of the p th sample quantile, ŷ p , is approximately Laplace-Gaussian with mean yp and variance given by Var( ŷ p ) pq/[nf2(yp)], where f(y) is the underlying density function, and q = 1p. This implies, that the SMD of the median from a normal universe is approximately normal (for n > 10) with E( ŷ 0.50 ) = , due to symmetry, and variance V( ŷ 0.50 ) = 0.50.5/[nf2()] = 0.25/[n/(22)] = 2/(2n). Thus, the SE( ŷ 0.50 ) = / 2 / n = / 2 SE( y) = 1.25331414SE( y) , which is larger than the SE( y) by at least 25%. It has also been shown in statistical literature that when the sample size n from a N(, 2) is small, then V( ŷ 0.50 )/ V( y ) = 1, 1.35, 1.19, and 1.44 for n = 2, 3, 4, 5, respectively. Kendall & Stuart (1967), Vol. 2, p. 7 give these results but do not provide information for n = 6, 7, 8, 9, and 10. 4