Econ 496/895 Introduction to Design and Analysis of Economics Experiments Professor Daniel Houser Background A. Useful definitions and results from probability theory. 1. A “random experiment” is an experiment for which the outcome cannot be predicted with certainty, although the set of all possible outcomes can be described. 2. The sample space S of a random experiment is the set of all possible outcomes. 3. Any function X : S is a “random variable.” Note that the choice of function X generally depends on the purpose of the experiment. For example, an experiment on brain activation one could look for activation in a single area or for the average over several areas or for certain activation paths. Ex: Tossing a six-sided die and observing the number of spots facing up is a random experiment. The sample space is S 1,2,3,4,5,6 and the random variable is X ( s ) s, the identity map. An alternative random variable would be X ( s ) 1 if s 4 and X ( s ) 0 otherwise, which provides an indicator for whether the realized outcome exceeded four. 4. Let A be a subset of S , A S . Suppose P is a set function defined on the set of all subsets of S, P( A) for all A S . P is called a “probability” if and only if it satisfies the following. (i) (ii) (iii) For all subsets A of S, P( A) 0, P ( S ) 1, If A1 , A2 , ..., Ak are such that Am An , m n, then P A1 ... Ak = P( A1 ) P( A2 ) ... P( Ak ), for k . Ex: In the die tossing example P(1) P(2) ... P(6) 1/ 6, P(1 2) 1/ 3, and P([1 2] [2 3]) 0.5. 5. The “conditional probability” of A given that B has occurred is P( A B ) P( A | B ) . P( B ) Ex: In the die tossing example, the probability of a “2” given that a “1” or “2” has been observed is P(2 |1 or 2) P(2 [1 or 2]) / P(1 or 2) = (1/ 6) /(1/ 3) 0.5. 1 6. Events A and B are “independent” if and only if P( A B ) P( A) P( B ). From the definition of conditional probability we see that if two events are independent then knowledge about the occurrence of one does not help to improve predictions with regard to the likelihood of the other. 7. Events A, B and C are mutually independent if they are pairwise independent and P( A B C ) P( A) P( B ) P(C ). This can be extended to four or more events easily: each pair, triplet, etc… must satisfy independence. Ex: Suppose three coins are tossed simultaneously but that these are trick coins and, with equal probability, one of the four outcomes occurs. {H,T,H}, {H,H,T},{T,T,T} or {T,H,H}, where H is heads, T is tails, and the order is {Coin 1, Coin 2, Coin 3}. Then it is easy to see that the realizations are pairwise independent, P(Coin 1=H|Coin 2 = T)=0.5, etc… although there is obviously dependence in their joint realization since P(H,H,H) is zero, not 1/8. 8. From the definition of conditional probability one can deduce Bayes’ Formula: P( A) P( B | A) P( A | B ) . P( B ) This is often referred to as the posterior probability of A given B. 9. If X is a random variable, we will often denote its “probability density function” (pdf) by f . If X is discrete, then f ( x ) P( X x ). If X is continuous then P( X A) f ( x )dx. A pdf is nonnegative and integrates to unity over the A sample space. 10. The “cumulative distribution function” (cdf) of a random variable X is defined by F ( x ) P ( X x ) f ( x ). A cdf lies between zero and one and is X x nondecreasing. 11. The “expected value” of a random variable X, E(X), is xf ( x ), providing that the X value of this integral is finite. This is the weighted average of all possible values of X, where the weights are interpreted as the chance that the observation will occur. If the value of the integral is infinite, then the expectation is said not to exist. Expected values satisfy: E(c) = c for every constant c. E(cX)=cE(X) for constant c and rv X. E(u(X)+v(X))=E(u(X)) + E(v(X)) for functions u and v. 2 12. The variance of a rv X is its expected squared deviation from its mean, X2 ( X E ( X )) 2 f ( x )dx. The positive square root of the variance is called the X standard deviation. The variance is a measure of the dispersion of the distribution of X about its mean. Chebyshev’s inequality shows that the probability that a realization of X is within k 1 standard deviations of its mean is at least 1 1/ k 2 . Hence, at least 84% of the mass of any rv lies within 2.5 sd’s of its mean. 13. The covariance of two rv’s X and Y is Cov ( X , Y ) E ([ X x ][Y y ]). Their correlation coefficient is xy Cov( X , Y ) x y . The correlation coef. lies between –1 and +1. Intuition about the meaning of covariance and correlation is gained by noting that the least squares regression of Y on X is y y y ( x x ). x 14. A rv X has a “normal” (or Gaussian) distribution if its pdf is ( x x )2 1 f ( x) exp , x . 2 X2 2 X2 If X follows a normal distribution we write X ~ N ( x , X2 ). If X ~ N ( , 2 ) then Z ( X ) ~ N (0,1). The finite sum of normally distributed rv’s follows a normal distribution. B. Useful definitions and results from sampling distribution theory. 1. A “statistic” is a function of a random sample of realizations of random variables. The mean and variance of a sample are both statistics. Since statistics are functions of random variables, they are themselves random variables. 2. “Sampling distribution theory” is concerned with determining the distribution of statistics. 3. Example: Finding the sampling distribution of a function of random variables in a random sample. An unbiased four-sided die is cast twice with outcomes X1 and X2. Find the expected value and variance of Y = X1 + X2. Y takes values in {2, 3, …, 8}. Let g(Y) denote the pdf of Y. 3 g(2) = P(Y=2) = P(X1=1 and X2=1) = (1/4)(1/4)=1/16. g(3) = P(Y=3) = P(X1=1 and X2=2) + P(X1=2 and X2=1) = 2/16. similarly: g(4) = 3/16. g(5) = 4/16. g(6) = 3/16. g(7) = 2/16. g(8) = 1/16. Hence: E[Y] = 2(1/16)+3(2/16)+4(3/16)+5(4/16)+6(3/16)+7(2/16)+8(1/16)=5. One may then calculate Var(Y) = E([Y-5]2) = 5/2. Note that in many cases sampling distributions can be simulated more quickly and easily than they can be computed. 4. In many empirical applications the sampling distributions of the statistics of interest turn out to be chi-square, t or F. We next discuss each of these distributions. - The chi-square distribution. (i) A rv X is said to have a chi-square distribution with r degrees of freedom, 2 ( r ), if its pdf is 1 f ( x) x r / 21e x / 2 , 0 x . r/2 ( r / 2)2 E[X] = r, Var(X) = 2r. (ii) If the rv X is N ( , 2 ) then the rv V X / 2 is 2 (1). That is, the square of a standard normal rv follows a chi-square distribution with one degree of freedom. 2 (iii) If X 1 is 2 ( r1 ) and X 2 is 2 ( r2 ) and if they are independent, then X 1 X 2 ~ 2 ( r1 r2 ). (iv) Let Z1 ,..., Z k be independent standard normal rv’s. Then the sum of the square of these rv’s has a chi-square distribution with k degrees of freedom: W ( Z12 ... Z k2 ) ~ 2 ( k ). (v) If X 1 ,..., X n have mutually independent normal distributions n N ( i , i2 ) then the distribution of W i 1 4 ( X i i )2 2 i is 2 (n). Suppose X 1 ,..., X n are independent random draws from a N ( X , X2 ) distribution. Define 1 n X X i (the sample mean) n i 1 (vi) S2 2 1 n X i X , (estimate of the variance) n 1 i 1 then (a) X and S 2 are independent. n (b) n 1 S 2 2 X X i 1 i X 2 X 2 is 2 ( n 1). Use of the sample mean instead of the true mean reduces the degrees of freedom by one. This reduces the expected value of the statistic relative to the value under the true mean. The reason is that the squared deviation of the observations in the sample from the sample mean will never be greater than the difference from the mean of the population’s distribution. - The t distribution (i) If Z is a rv that is N(0,1), if U is a rv that is chi-square with r degrees of freedom, and if Z and U are independent, then Z T U /r has a t distribution with r degrees of freedom. This distribution was discovered by W. Gosset who published under the name of “Student.” (ii) The pdf of a t distribution is ( r 1) / 2 g (t ) , -<t<. r ( r / 2)(1 t 2 / r )( r 1) / 2 (iii) When r = 1 the t distribution is the Cauchy distribution, hence its mean and variance do not exist. The mean exists and is zero when r 2, r and the variance exists and is equal to when r 3. r2 (iv) The t distribution is symmetric about the origin but has fatter tails than the normal distribution. The tails shrink as r grows larger: as r approaches infinity the t distribution approaches the normal distribution. 5 (v) Suppose X 1 ,..., X n is a random sample from a N ( X , X2 ) distribution. Then ( X X ) /( X / n ) is N(0,1), (n 1) S 2 / X2 is 2 (n 1) and the two are independent. Thus, T=[ ( X X ) /( X / n ) ]/ ( n 1) S 2 / X2 = - X X is t(n-1). S/ n This provides a convenient way to test for whether a mean is statistically different from, say, zero when the variance of the underlying normal distribution is not known but must be estimated. The F distribution (i) If U and V and independent chi-square variables with q and r degrees of freedom, respectively, then F = (U/q)/(V/r) has an F distribution with q and r degrees of freedom. The F distribution is named in honor of R. A. Fisher. (ii) The F statistic can take only positive values. (iii) E(F) = r/(r-2) when r > 2, Var(F) = (iv) Let X 1 ,..., X n and Y1 ,..., Ym be random samples of size n and m from 2r 2 ( q r 2) when r > 4. q( r 2)2 ( r 4) two independent normal distributions N ( X , X2 ) and N ( Y , Y2 ). Then (n 1) S X2 / X2 is 2 (n 1) and (n 1) SY2 / Y2 is 2 (m 1). Since the distributions, and hence S X2 and SY2 , are independent we have h (m 1) SY2 / Y2 (m 1) SY2 / Y2 ~ F (m 1, n 1). (n 1) S X2 / X2 (n 1) S X2 / X2 This statistic is often useful when testing for the statistical significance of treatment effects. (5) The central limit theorem Suppose X 1 ,..., X n is a random sample from a N ( X , X2 ) distribution. What is the sampling distribution of the mean X ? Note that: 6 E[ X ] 1 n 1 E[ X i ] n X X . n i 1 n 1 n 1 X2 2 X ) n i n2 X n . n i 1 Moreover, since X is a scaled sum of independent normal rv’s it also follows a normal distribution. Thus, Var ( X ) Var ( X ~ N (X , X2 n ) and W= X X ~ N (0,1). X / n As n the distribution of X X degenerates to zero. However, the distribution of W does not degenerate. The scale factor X / n spreads out the mass so that the variance of W is unity for all n. The statistic W is well defined even if the underlying sample is not normal. Suppose X 1 ,..., X n is a random sample from an arbitrary distribution. What is the sampling distribution of the W statistic in this case? It is easy to show (and you should) that E[W] = 0 and Var(W) = 1 in this case as well. However, since the distribution of the X is not normal the distribution of W will not generally be normal. Nevertheless, the central limit theorem (clt) assures us that a standard normal distribution closely approximates the true sampling distribution of W, as long as fairly weak regularity conditions hold and n is sufficiently large. Typically, n = 15 is large enough for practical purposes. Central Limit Theorem: If X is the mean of a random sample X 1 ,..., X n of size n from a distribution with a finite mean X and a finite, positive variance X2 then the distribution of X X W X / n is N(0,1) in the limit as n . It is often the case that assessing treatment contrasts involves comparing means from two treatment groups. One way to determine the significance of the contrasts is to use parametric tests that assume the means are normally distributed. The clt can be invoked to justify this normality assumption, particularly if the number of observations in each treatment is 15 or more. 7 C. Comments on Tests of Statistical Hypotheses 1. Definitions - A statistical hypothesis is an assertion about the distribution of one or more rv’s. - A test of a statistical hypothesis is a procedure, based on the observed values of the rv’s, that leads to the acceptance or rejection of the hypothesis. - The critical region C is the set of points in the sample space that leads to the rejection of the null hypothesis. - A statistic used to define the critical region is referred to as a test statistic. The critical region can be defined as the set of values of the test statistic that leads to rejection of the null. - A type I error is made when the null is rejected even though it is true. A type II error is made when the null is accepted even though it is false. - The significance level of the test is the probability of a type I error. - The power function of a test of a null H 0 against an alternative H1 is the function that gives the probability of rejecting H 0 for each parameter point in H 0 and H1 . The value of the power function at a particular point is called the “power” of the test at that point. - The p-value is the probability, under H 0 , of all values of the test statistic that are as extreme (in the direction of rejection of H 0 ) as the observed value of the test statistic. 2. Example Suppose we have a random sample of 25 observations X i from an unknown distribution with a known variance of 100. We believe the mean of this distribution is 60, but we want to entertain the possibility that it exceeds 60. Hence, we will test the null hypothesis H 0 : 60 against the alternative hypothesis H1 : 60. Note that each of these are statistical hypotheses, since the make assertions about the distribution of X. 8 We decide to reject the null in favor of the alternative if the mean of the observations is at least 62, or X 62. This defines the critical region for the test: 1 C ( X i )i 1,25 : X i 62 . If the realized sample does not fall in the critical 25 region will accept the null. Since the critical region is known we may now determine the significance level (or size) and power of this test. The power function is ( ) P( X 62; ). Since by the clt the distribution of X should be well approximated by a N( ,100/25=4) distribution, we have that X 62 ( ) P( ; ) 2 2 62 1 ( ), 60 . 2 where () denotes the standard normal cdf. Some values of the power function follow. 60 61 62 63 64 65 66 ( ) 0.16 0.31 0.50 0.69 0.84 0.93 0.98 Hence, the significance level of this test is 0.16: there is a 16% chance of rejecting the null when it is true. The power of the test when the true mean is 65 is 93%. The p-value associated with a realized mean of 63 is X 60 63 60 p-value = P( X 63 ; 60 )=P( ) 2 2 63 60 ) = 0.067 = 1 ( 2 so we would not reject at a 5% significance level, but would reject at a 10% level. Generally, we would like a power of 100% and a size of 0%. This is not usually attainable. In fact, for a fixed sample size, more powerful tests usually have greater chances of generating a type I error. 9 3. The power function and sample size The power function can be used to help determine an appropriate sample size. Ex: Let X have a N( X ,36) distribution. We want to use a sample of size n from this distribution to test the hypothesis that X 50 against the alternative that X 50. We want the size of this test (the probability of a type I error) to be 0.05 and the power of the test at X 55 to be 0.90. This requires us to solve simultaneously the following two equations. X 50 c 50 ), 6/ n 6/ n X 55 c 55 0.10 P( X c; X 55) P( ). 6/ n 6/ n 0.05 P( X c; H 0 ) P( Thus, we must solve c 50 1.645 6/ n c 55 1.282 6/ n which has solution {c, n} {52.8,12}. 10