Introduction to Design and Analysis of Experiments Professor Daniel Houser Noteset 3 A. Parametric tests about means 1. We assume we have a random sample of size n from a N( X , X2 ) distribution where the variance is known but the mean is unknown. We are interested in testing the null hypothesis that X 0 . The alternative hypothesis will typically be in one of three forms. (i) (ii) (iii) X 0 . This would be used if the mean could only have decreased. X 0 . This would be used if the mean could only have increased. X 0 . This would be used to test for any change in the mean. An appropriate test statistic to test the null against any of these alternatives is z X 0 . X / n Let z correspond to the % critical value from the standard normal distribution. For example, z0.05 1.65 because about 5% of the mass of the standard normal distribution lies to the right of 1.65. Then the critical regions for the z-test are as follows. X 0 X 0 Critical Region z z X 0 X 0 z z X 0 X 0 | z | z / 2 H0 H1 2. The z-test is only appropriate when the variance of the distribution is known. Assume the variance is unknown. An appropriate statistic in this case to test the null against the possible alternatives described above is T X 0 . S/ n where S S 2 2 1 n Xi X . n 1 i 1 The critical regions are analogous to the z-test, but with the appropriate t distribution replacing the standard normal. The two tailed alternative, for example, will be accepted over the null if | T | | X 0 | t / 2 (n 1). S/ n Example: Suppose we believe that with probability p a person will play a certain Nash equilibrium, and that this probability is the same for all people. We do not know p, but have reason to believe it is about 25%. To test this hypothesis against the two-tailed alternative we put 100 subjects through an experiment and observe whether each plays the Nash strategy. Hence, we obtain 100 independent draws from a Bernoulli distribution with the probability of success p. (Recall that a Bernoulli random variable takes only two values, zero or one, and has pdf f ( x ) p x (1 p)1 x . ) From the clt it follows that, approximately, 1 100 pˆ xi ~ N ( p, p(1 p) /100) 100 i 1 Suppose that pˆ 0.20. This provides an estimate of the mean for the normal, and we can estimate the variance by 0.20(0.80)/100 = 0.0016 (this is S 2 / n. ) Then we can test the null using a standard z-test (which closely approximates the t-test because the number of observations is large.) | 0.20 0.25 | 1.25 1.96 z0.025. 0.0016 So we accept the null at the 5% significance level. B. Tests about variances and differences in means. 1. A random sample of n observations is taken from a normal distribution with unknown variance. It is desired to test the null hypothesis that the variance is 100 against the two-sided alternative. Under the null, we know that (n 1) S 2 /100 ~ 2 ( n 1). Hence, the test is performed by comparing the realized value of this statistic to the 2 (n 1) distribution. In this case, since the test is two sided, we reject the null if S2 100 2 100 2 1 / 2 (n 1) or S 2 / 2 (n 1). n 1 n 1 2. Suppose we have n and m observations from two independent normal distributions, X and Y, and we want to test the hypothesis that their variances are equal against the two-sided alternative that they are different. We know that (n 1) S X2 / X2 ~ 2 (n 1) and (m 1) SY2 / Y2 ~ 2 (m 1) . Under the null that X2 Y2 it follows that S X2 / SY2 ~ F (n 1, m 1) and from this the null hypothesis can be easily tested in a way analogous to that described above. 3. Suppose we have n and m observations from two independent normal distributions, X and Y, and that we know the variances of the distributions are equal. We are interested in determining whether their means are statistically significantly different. It is appropriate to use the following statistic. X Y T 2 {[(n 1) S X (m 1) SY2 ]/(n m 2)}(1/ n 1/ m) which has a t-distribution with n+m-2 degrees of freedom. 4. Suppose that in (3.) above the variances are unequal. Assessing whether there are differences in means in this case is hard using standard classical techniques: this is called the Behrens-Fisher problem. The difficulty is that the variances do not drop out of the statistic in a natural way, so ad-hoc assumptions must be used to eliminate them. One statistic that people use is X Y T S X2 SY2 m n which has, approximately, a t-distribution with (n+m-2) degrees of freedom. This approximation is better when the sample size is large, in which case the t-test can be replaced by a z-test. If the variances are known but different, the S X2 and SY2 are replaced by the true values X2 and Y2 . C. Nonparametric methods (Siegel and Castellan – at bookstore.) 1. Advantages of nonparametric statistical tests. - If number of observations is small then there is often no alternative to nonparametrics (except making artificial assumptions about the properties of the data generating process.) - Tests based on ordinal ranks may be easier to implement nonparametrically. - Nonparametric methods can test for location differences between distributions from different families easily. The Behrens-Fisher problem exemplifies the difficulty that classical, parametric techniques have with this situation. - Some nonparametric tests have more intuitive appeal than certain parametric tests, which can often look rather ad-hoc. 2. Disadvantage of nonparametric statistical tests. - They are less efficient than parametric tests. If the conditions of the parametric model are met then parametric tests allow more precise inference. 3. Useful nonparametric tests (for more on these tests and other nonparametric tests see Siegel and Castellan (1988).) - Chi-square goodness of fit test. Used to test whether a sample follows a particular distribution. H 0 : the data follows distribution with pdf f . H1 : otherwise. Is particularly valuable in assessing whether an estimated model “fits” the data used to estimate it. The idea is to compare cell frequencies between the two distributions. k (Oi Ei )2 ~ 2 (k 1). The statistic is: v Ei i 1 Oi observed number of cases in ith category. Ei expected cases in ith category when the null is true. k the number of categories. The null hypothesis is rejected at an significance level if v 2 ( k 1). Example: The data from an experiment includes a series of choices that subjects made in a multiple-round game. Aggregate choice frequencies can be described as follows: Period/Choice 1 2 3 4 A 25% 25% 50% 0% B 10% 40% 40% 60% C 50% 25% 0% 20% D 15% 10% 10% 20% The experimenter has additional information about the subjects, such as their history of play and other observable, individual specific data, summarized by a vector X i for each subject i. The researcher is interested in whether a particular parametric model which depends on a finite parameter vector , G( X i | ) , can adequately explain choices. This can be answered by (i) (ii) (iii) (iv) Estimating the model, giving a point estimate of the parameter vector ˆ. Simulating the model under the estimated parameter vector. Calculating the 16 cell frequencies for the simulated data. Calculating the statistic v and comparing it to a 2 (15) distribution. This test is particularly useful when other tests of “fit” are hard to compute. The statistic v is only asymptotically 2 . When the amount of data is small, say less than five observations per category, one should attempt to recategorize to increase frequencies within each cell. When there are a small number of cells and a small amount of observations in each cell, the test may be inaccurate. - Permutations tests for paired replicates. Powerful test for treatment effects when a pair of similar experimental units are observed in each condition. The null hypothesis is that any observed differences are not due to treatments. Example: An experiment on Boys’ Shoes 10 boys wore shoes made of different materials, A and B, on their left and right foot. Whether material A was on the right foot or left foot was randomized for each boy by flipping a coin. Any differences due to individual boys should be apparent in both shoes. Data was taken on the wear of the sole of each shoe, giving the following data. boy 1 2 3 4 5 6 7 8 9 10 material A 13.2-L 8.2-L 10.9-R 14.3-L 10.7-R 6.6-L 9.5-L 10.8-L 8.8-R 13.3-L material B 14.0 8.8 11.2 14.2 11.8 6.4 9.8 11.3 9.3 13.6 B-A .8 .6 .3 -.1 1.1 -.2 .3 .5 .5 .3 Mean dive: 0.41 Under the null hypothesis that there is no treatment effect – that A and B give equal protection against wear – all that is affected by the outcome of the coin toss is the sign of the difference, B-A. Hence, under the null there are 210 1024 possible realizations of the mean difference, and this defines the mean diffuse sampling distribution under the null. The significance of the observed mean is determined by comparing it to the 1023 other possible outcomes. In this case, only 3 of the 1023 other means are greater than 0.41. There are four cases when the mean is identical to 0.41. Conservatively then, the significance level is 7/1024, or about 0.7%. Hence we would reject the null. This test uses all of the information in the sample, and is among the most powerful of all statistical tests. It can be cumbersome to compute if the number of observations is large. An alternative which is easier to compute, but throws out some information and is therefore somewhat less powerful, is the Wilcoxon signed ranks test, which is just the permutation test based on ranks instead of actual values. - The median test. Tests whether two independent samples have the same median. Is a “robust” test, in the sense that it does not make strong assumptions about the relationship between the underlying distributions of the two samples (they may have different dispersions, for example.) It is the natural test to use when data is truncated. The procedure is to derive the combined sample median and then the following table from the two distributions. No. of scores above combined median No. of scores below combined median Observations Data set I A Data set II B C D m n Let the N denote the total number of observations, N=m+n. The approximate sampling distribution of the statistic N ([ AD BC ] N / 2)2 v is 2 (1) , under the null hypothesis that ( A B)(C D)( A C )( B D) the medians are the same. The approximation is better when sample sizes are larger. - The Wilcoxon-Mann-Whitney test is more powerful than the median test, but it requires that the distributions underlying the two populations differ only in location. In particular, it requires that their variances are the same. The Jonckheere test for ordered alternatives. Suppose one has a sample from each of k independent populations. The Jonckheere procedure may be used to test H 0 : population distributions are identical against H1 : the populations have different medians, i , and the medians are ordered by 1 2 ... k where at least one of the inequalities is strict. Note: The ordering of the variables must be specified before the data is collected. To run the test one first orders the data in a table as follows. Data set 1 (low median) x(1,1) x(2,1) Data set 2 (2nd lowest median) x(1,2) x(2,2) x(n,1) x(m,2) ……. Data set k (highest median) x(1,k) x(2,k) x(m,2) The columns are arranged from smallest observation to largest observation. The test statistic, J*, is formed using the following 3-step procedure. (i) For each entry x(i,j) in each of the first k-1 columns, determine the number of entries in all of the higher columns that are greater than x(i,j). Call this number N(i,j). Note that there will be an N(i,j) corresponding to each entry in the table except within the last column, where there will be no entries. (ii) Define J as equal to the sum of the N(i,j). (iii) It can be shown that the sampling distribution of J under the null that the distributions are identical has mean and variance: k J J2 N 2 n 2j j 1 4 k 1 [ N 2 (2 N 3) n 2j (2n j 3)] 72 j 1 In large sample, the statistic J * J J can be compared to the standard J normal cdf to compute approximate p-values for a test of the null hypothesis. Rejection of the null implies that at least one median is statistically greater in magnitude than one that precedes it, but it does not tell us which one. D. Comments on the bootstrap The bootstrap is a computer based method for assigning measures of accuracy to statistical estimates. There are parametric, nonparametric, classical and Bayesian versions of the bootstrap. The original motivation for the bootstrap was to provide a method to assign standard errors to estimators for which no closed form solution existed. Intuitively, the bootstrap treats the sample as though it were the population, and “resamples” the original sample repeatedly to generate an approximation to the sampling distribution of any statistic that one might find useful. The properties of the estimator are known only in large sample, although evidence suggests it works well even when the sample size is very small. Example: Bootstrapping the standard error of the mean. - Suppose one has a random sample of 25 observations, xi , i 1,..., 25 from an unknown population. - Definition: A bootstrap sample x* is obtained by randomly sampling 25 times with replacement from {xi }i 1,25 . - By repeatedly resampling from {xi }i 1,25 one obtains a large number of bootstrap samples x*1, x*2 ,..., x*B . Around 200 bootstrap samples is usually enough. - Corresponding to each bootstrap sample is its mean: x *i , i 1,..., B. - An estimate of the standard deviation of the mean’s sampling distribution is the standard deviation of the bootstrapped means: 1/ 2 boot B ( x b* x * )2 /( B 1) b1 where x * , 1 x b*. B Note that one can replace the mean with any statistic, s(x), and follow the same procedure to generate a measure of accuracy of this statistic’s value. In the case of the mean of a vector of independent draws from a single distribution one would not usually want to bootstrap the standard error. The reason is that the clt assures normality and the se’s under that distributional assumption are tighter than will be given by the bootstrap. Example 2: Bootstrapping the permutation test. We observe two independent random samples from possibly different pdf’s F and G. F z1 ,..., zn G y1 ,..., ym . We are interested in testing the null hypothesis H 0 : F G. This is the standard two-sample problem we have discussed above where we are interested in determining whether there is evidence of a treatment effect. If we cannot reject the null there is little evidence of any effect. Note that the null is very strong. It requires that there is no difference in the stochastic behavior of z and y. We have discussed how to test for different means, assuming that both distributions are Normal and have the same variance. The bootstrap procedure is as follows. (a) If F=G, then the m+n observations came from the same distribution, (m n)! and the way they were classified as in “z” or “y” was one of m !n ! equally likely outcomes (this result is called the “Permutation Lemma.” (b) We can resample from the pooled distribution repeatedly to form bootstrapped samples of the z and y vectors. (c) We calculate the difference between the means of each pair of bootstrap samples, and then order (from high to low) this set of differences. (d) We determine where the realized difference in means occurs within the bootstrapped order. For example, if we have 100 bootstrapped differences in means, then if the original difference between the mean of the z and y samples is greater than 190 of those 100 we would say that the difference is significant at the 5% level (assuming the test is two-sided.) To summarize the calculation of the two-sample permutation test statistic. (1) Choose B independent vectors g * (1), g * (2),..., g * ( B) each consisting of n z’s and m y’s and each being randomly selected m n from the set of all possible such vectors. Usually, n you will want to set B at least equal to 1000. (2) Evaluate the desired statistic, ˆ* (b), for every bootstrapped vector. (3) The achieved significance level of the original sample’s test 1 statistic, ˆ, is given by 1(ˆ* (b) ˆ), where 1(s) = 1 if s B is true, and zero otherwise. Bernoulli, Binomial, Poisson and Normal distributions Bernoulli distribution Random variable X follows a Bernoulli distribution if it can take only two values, say zero and one, and: Pr(X=1) = p (between zero and one) Pr(X=0) = 1-p (also between zero and one) Hence, the pdf of a Bernoulli random variable is f ( X ) p X (1 p)1 X , X 0,1. The mean and variance of a Bernoulli random variable are p and p(1-p), respectively. Binomial distribution Suppose an experimenter conducts a sequence of N Bernoulli trials with probability of a “1” equal to p>0, and let Y be the random variable that indicates the number of times “1” occurred over the N trials. Then Y is said to follow a Binomial distribution, with parameters N and p, and has pdf N y p (1 p) N y if y 0,1,..., N . Pr ( y | N , p) y 0 otherwise N Here, the notation denotes the number of ways y distinct elements can be selected y from a set of N elements, divided by the number of distinct arrangements of y distinct N N! elements. Hence, . y ( N y )! y ! The mean and variance of the binomial distribution are Np and Np(1-p), respectively. The normal distribution with the same mean and variance provides a good approximation to the Binomial distribution when N>5, and (1/ N )( (1 p) / p p /(1 p)) 0.3. When the normal approximation is valid, one can test hypotheses about the mean of the binomial distribution by forming the z statistic ( y0 Np) / ( Np(1 p) where y0 is the actual number of 1s observed, and p is the hypothesized mean. It turns out that the approximation is somewhat better if ( y0 0.5) is used to calculate the z statistic, in place of y0 . (This is called the Yates adjustment.) Suppose one has observations from N subjects in two treatment conditions. In each treatment condition they make a series of yes/no decisions, each of which is either correct or incorrect. Let Ci ( A) denote the number of correct answers provided by subject i in treatment condition A, and similarly Ci ( B). Then it is reasonable to model Ci ( A) Bi( p, N ) . The researcher hopes to determine whether the fraction of correct responses varies with the treatment. Because we have a “within” design (this means that we have observations on each subject in both conditions), a natural approach would be to use the permutations test as in the Boys’ shoes example. But suppose we wanted to use a paired t-test. To do this we would form a set of N differences (correct in treatment A minus correct in treatment B for each subject) and test whether the mean of these differences is zero. Recall that to use this test, we must be able to assume that the data from each subject has about the same variance. This assumption is likely violated if there are large differences across subjects in the fraction of correct answers. The reason is that the variance of a Binomial rv is Np(1-p), so that if different subjects have different values of p, they will also have different variances. How to circumvent this problem? Transformation of variables. In many cases it is useful to transform the outcome variable of interest. For example, the log of a variable whose distribution is skewed right is often more closely normally distributed than the variable itself. In the present case, it turns out that the transformation xˆi arcsin( pˆ i ) is particularly useful, where pˆ i is the fraction of correct answers given by subject i. It turns out that xˆi (called a “score”) is a rv with variance that does not depend on p. Hence, by transforming percentages to this score, one can use the usual paired t-test with greater confidence that the results will be correct. In particular, the test will be more sensitive to treatment differences after this correction has been applied. Sometimes one is interested in knowing whether, within a given session and treatment, success probabilities vary (e.g., testing for learning effects.) Example: A subject makes a sequence of 20 yes/no decisions over a series of 10 identical blocks (a total of 200 decisions.) The researcher wants to determine whether there is evidence of learning. A first test of this might be to ask whether success probabilities seem to remain fixed over the 10 blocks. The null hypothesis is that there is no learning (success probabilities are constant across blocks) and the alternative is otherwise. The data are as follows. Block Number Correct 1 2 3 4 5 6 7 8 9 10 4 3 2 5 7 8 6 7 12 10 Under the null that the success probability is constant, we know that the number correct in each block follows a Bi(N,p)=Bi(20,p) distribution. By pooling the data, one can easily estimate pˆ 0.32. Then, if Ci represents the number correct in block i, we have that zi (Ci 0.32) / (20 0.32(1 0.32)) 10 that z i i 2 i N (0,1). (At least approximately.) It follows ~ 2 (9). In this example, it turns out that doing the summation gives a result of 19.85, which is large relative to what one would expect from a random draw from a chi-square(9) (it is significant at the 2.5% level.) This is evidence against the null hypothesis that the success rate is constant across blocks, and is therefore evidence in favor of learning. Note that the test above looks similar to the chi-square test discussed earlier. In the above case, the general form of the test is K (C Np) i 1 2 i / Np(1 p) ~ 2 ( K 1). The form of the chi-square test discussed earlier, when applied to the above example, is K (C Np) i 1 i 2 / Np ~ 2 ( K 1). Note that these two tests are very similar whenever p is very small (because then (1-p) is near unity. Poisson distribution. Let Np. If p 0 and N while Np stays constant, then the Binomial distribution becomes the Poisson distribution. The pdf of the Poisson distribution is: Pr ( y ) e y y! where y is the number of successes, or occurrences of an event. Note that the probability depends only on the expected frequency, and not on the number of events or the probability of success per event. Note also, by examining the Binomial distribution, that the mean and variance of a Poisson distribution are the same. The sum of Poisson random variables is again a Poisson random variable, with mean equal to the sum of the means of the underlying random variables. That is, if K y1 , y2 ,..., yK are all distributed according to a Poisson distribution, then y i 1 i is also a K Poisson random variable, with mean equal to . i 1 i Example A researcher is studying the effect of information events on behavior in an asset market experiment. She wants information events to occur randomly during the experiment, but also wants to ensure that 90% of two-minute blocks contain at least one information event. Assuming that an information event takes only one second to occur, how many information events should occur, on average, during the experiment? There are 120 seconds in each two-minute block, but the probability of a random information event occurring during any particular second is small. Hence, the Poisson approximation is appropriate (large N, small p). In this case, the probability of not getting an information event during a two-minute block is supposed to be 0.1. Hence, 1 e 0.9, or 2.3. Thus the researcher should ensure that there are, on average, 2.3 information events per two-minute block. If there are 45 two-minute blocks in the experiment, this means that the researcher should allocate 45 X 2.3 = 103 information events randomly across the timeline of the experiment. Relationship between Poisson and the chi-square test When is not too small (say, greater than 5) the Poisson distribution can be approximated well by a normal distribution with the same mean and variance (this makes sense, given that the normal well approximates a Binomial distribution). Hence, K i 1 y ~ N (0,1), ( yi )2 at least approximately. Therefore, ~ 2 ( K ), and K i 1 ( yi )2 ~ 2 ( K 1). Because is the expected frequency, this is a type of Chi-Square test (within the context of contingency tables) as discussed above, where the expected frequency within each cell is the same.