Stat 921 Notes 3 Reading: Observational Studies, Chapters 2.4-2.5. I. General Approach to Randomization Inference for Testing Hypothesis of No Treatment Effect and Examples Consider the null hypothesis of no treatment effect: H 0 : rTi rCi for all i . Notation: Let rz denote the vector of potential responses for randomization assignment z . Let Z denote the observed randomization assignment. Let r rZ denote the vector of observed responses. Under H 0 : rTi rCi for all i , rz is the same for all z and hence rz r . Test: Consider a test statistic t ( Z , r ) . We will reject for large values of t ( Z , r ) . We want to compute a p-value for the test. To do this, (i) The null hypothesis is tentatively assumed to hold, so r is fixed. (ii) The treatment assignment Z is assumed to have been selected from the set of possible treatment assignments using a known random mechanism 1 (iii) The observed value, say T , of the test statistic t ( Z , r ) is calculated (iv) We seek the probability of a value of the test statistic t ( Z , r ) as large or larger than that observed if the null hypothesis were true. The p-value is simply the sum of the randomization probabilities of assignments z that lead to values of t ( Z , r ) greater than or equal to the observed value T , namely PrH0 {t ( Z , r ) T I [t ( Z , r ) T ]Pr( Z z ) z In the case of a uniform randomized experiment, since Pr( Z z ) 1/ K 1/ | | , the p-value is simply | z : t ( z, r ) T | PrH 0 {t ( Z , r ) T . K Common Randomization Inferences: 1. Fisher’s exact test for binary outcomes. Example 1: Perry Preschol Project. In a 1962 social experiment, 123 3-and 4-year old children from poverty-level families in Ypsilanti, Michigan were randomly assigned either to a treatment group receiving 2 years of preschool instruction or to a control group receiving no preschool. The participants were followed into their adult years. The following table shows how many in each group were arrested for some crime by the time they were 19 years old: 2 Arrested for some crime? No (1) Yes (0) Preschool 42 19 61 Control 30 32 62 72 51 123 Let the outcome be 1 if the person is arrested for some crime and 0 if not. The test statistic for Fisher’s test is the number of T treated units with the outcome of 1, t ( Z , r ) Z r ; here t ( Z , r ) is the number of children assigned to the preschool group who were not arrested for some crime. Note that under the null hypothesis of no effect, the margins of this table are fixed. Thus, under the null hypothesis, t ( Z , r ) has a hypergeometric distribution. The p-value for the test is 1-phyper(41,72,51,61) [1] 0.01673607 There is moderate evidence (p=0.017) that the preschool program is having an effect. Note, that we have constructed a one-sided test that rejects when there are a large number of treated units with an outcome of 1. If we wanted a two sided test, we could use the test statistic t ( Z , r ) | Z T r EH0 [ Z T r ] | . 3 2. Mantel-Haenszel test Analogue of Fisher’s exact test when there are two or more strata and the outcomes is binary. The test statistic is again the sum of the number of 1outcomes among the treated subjects. Special case: Matched pairs. The experiment randomly assigns one member of each pair to the treatment. For this special case, the Mantel-Haenszel test is called McNemar’s test. Example 2: Pair Treated Control 1 1 1 2 0 0 3 1 0 4 1 0 5 0 0 6 0 1 7 1 0 8 1 1 9 1 0 10 1 1 4 Pairs 1, 2, 5, 8 and 10 are concordant, meaning that the outcomes are the same and pairs 3, 4, 6, 7 and 9 are discordant, meaning that the outcomes are different. T Under the null hypothesis, the distribution of t ( Z , r ) Z r is the number of concordant pairs in which both outcomes are one plus a binomial random variable with the number of trials being the number of discordant pairs and probability of success 0.5. The p-value for McNemar’s test can just be found by the p-value for testing that a binomial has p=0.5 with number of successes equal to the number of discordant pairs in which the treated unit has a 1. For example 2, 4 out of the 5 discordant pairs have the treated unit with a 1. The p-value is > 1-pbinom(3,5,.5) [1] 0.1875 There is not strong evidence that the treatment has an effect. We now consider some tests for outcomes taking more than two numerical values: 3. Difference in sample means or t-test. We studied this in Notes 2. 5 4. Wilcoxon’s rank sum test for experiments with only one strata. Rank the observed responses from smallest to largest. If all N responses were different numbers, the ranks would be the numbers 1,2…, N . If some of the responses were equal, then the average of their ranks would be used. Write qi for the rank T of the observed ri of the ith unit and write q (q1 , , qN ) . Note that the ranks q are fixed unde the null hypothesis of no treatment statistic. The rank sum statistic is the sum of the ranks T of the treated observations t ( Z , r ) Z r Example: Job training experiment from Notes 2 nswdata=read.table("nswdata.txt",header=TRUE); treated.r.jobtrain=nswdata$earnings78[nswdata$treatment==1]; control.r.jobtrain=nswdata$earnings78[nswdata$treatment==0]; # Alternative = “greater” in the Wilcoxon test rejects for large # values of the Wilcoxon rank sum test wilcox.test(treated.r.jobtrain,control.r.jobtrain,alternative="greater") Wilcoxon rank sum test with continuity correction data: treated.r.jobtrain and control.r.jobtrain W = 68209.5, p-value = 0.03096 alternative hypothesis: true location shift is greater than 0 The p-value from the Wilcoxon rank sum test, 0.03096, is very close to the p-value we got from the difference in means test statistic, 0.031, in Notes 2. 6 The Wilcoxon rank sum test has almost as much power as the difference in means test statistic when the data is normally distributed but is much more robust to outliers. Reference: Lehmann (1975, Nonparametrics: Statistical Methods Based on Ranks). 5. Wilcoxon’s signed rank statistic for matched pair experiments. For each pair, compute the absolute difference in the responses between the treated and control units. Rank these absolute differences. Let d s be the rank of the absolute difference of the sth pair. The signed rank statistic is the sum of the ranks of the pairs in which the treated unit had a higher response than the control unit. 6. Stratified rank sum statistic Sum the Wilcoxon rank sum statistics of each strata. 7. Aligned rank statistic Hodges and Lehmann (1962, Annals of Statistics) find the stratified rank sum statistic to be inefficient when S is large compared to N. They suggest as an alternative the method of aligned ranks: the mean in each stratum is subtracted from the responses in that stratum creating aligned responses that are ranked from 1 to N, momentarily ignoring the strata. Writing 7 q for these aligned ranks, the aligned rank statistic is the sum of T the aligned ranks in the treated group, t ( Z , r ) Z q . 8. Median test: The number of treated responses that exceed the median of the responses in their stratum. II. Classes of Test Statistics (Section 2.4.4) Computing the exact p-value for a test statistic becomes computationally difficult for even moderate sized experiments. One approach is to use the Monte Carlo method to approximate the p-value as we did in Notes 2. Another approach is to use a large sample or asymptotic approximation based on the mean and variance of the test statistic. For test statistics that are based on the ranks of the data, these large sample approximations are quite accurate even for relatively small experiments. A general class of test statistics for which we can derive the moments are sum statistics, which are test statistics of the form t ( Z , r ) Z T q where q is some function of r . All of the statistics we considered above are sum statistics for suitable choices of q . In Fisher’s exact test, the Mantel-Haenszel test and McNemar’s test, q is simply r . In the rank sum test, q is the ranks of r . In the median test, q is the vector of ones and zeros identifying responses that exceed stratum medians. Proposition 2: In a uniform randomized experiment, if the treatment has no effect, the expectation and variance of a sum T statistic t ( Z , r ) Z q are 8 S E ( Z q) ms qs T s 1 ms (ns ms ) ns Var ( Z q) (qsi qs ) 2 s 1 ns ( ns 1) i 1 S T where 1 qs ns ns q i 1 si Proof: S S ns S ms E ( Z q) E ( Z qs ) qsi ms qs . s 1 s 1 i 1 ns s 1 T T s The proof for the variance uses properties of simple random sampling without replacement. See Problem 4 at the end of Chapter 2in the book. Using Proposition 2, we can approximate the p-value using the Central Limit Theorem by t ( Z , r ) EH [t ( Z , r )] 0 PrH 0 [t ( Z , r ) T ] 1 VarH 0 (t ( Z , r )) Effect increasing statistics: Many statistics that measure the size of the difference between treated and control groups would tend to increase in value if responses in the treated group were increased and those in the 9 control group were decreased. Statistics with this property will be called effect increasing. To express the idea formally, note that a treated unit si has 2Z si 1 1 and a control unit has 2Z si 1 1 . Let z be a possible treatment assignment and let r and r * be two possible * values of the responses such that (rsi rsi )(2 zsi 1) 0 for all s, i . With treatments given by z , this says that rsi* rsi for every * treated unit and rsi rsi for every control unit. In words, if higher responses indicated favorable outcomes, then every treated unit does better with r * than r , and every control unit does worse with r * than r . That is, the difference between the treated and control gropus looks larger with r * than r . The test statistic t ( Z , r ) is effect increasing if t ( z, r ) t ( z, r* ) whenever r and r * be two possible values of the * responses such that (rsi rsi )(2 zsi 1) 0 for all s, i . All of the test statistics we considered in Section I of these notes are effect increasing. Table 1 below contains a small, hypothetical example to illustrate the idea of an effect increasing statistic. Here there is a single stratum and four subjects, 2 of whom receive treatment. * Notice that when ri and ri are compared, treated subjects have ri* ri while control subjects have ri* ri . If the responses are ranked 1, 2, 3 and 4, and the ranks in the treated group are 10 summed to give Wilcoxon’s rank sum statistic, then the rank * sum is larger for ri than for ri . Table 1 i zi 2 zi 1 ri ri* 1 Treated 1 1 5 6 2 Treated 1 1 2 4 3 Control 0 -1 3 2 4 Control 0 -1 1 1 6 7 Rank Sum III. Models for Treatment Effects (Chapter 2.5) If the treatment has an effect, then the vector of potential responses rz varies with the randomization assignment z . Let Z denote the observed randomization assignment. Let R rZ denote the vector of observed responses. In principle, each possible treatment assignment z might yield a pattern of responses rz that is unrelated to the pattern observed with another z . For instance, in a completely randomized experiment with 50 subjects divided into two groups 25, there 11 50 14 1.3 10 might be 25 different and unrelated rz ’s. Since it is difficult to comprehend a treatment effect in such terms, we look for regularities, patterns or models of the behavior of rz as z varies. No Interference between units A first model is that of “no interference between units” which means that “the observation on one unit should be unaffected by the particular assignment of treatments to the other units.” Donald Rubin calls this SUTVA for the “stable unit treatment value assumption.” Formally, no interference means that rsiz varies with z si but not with the other coordinates of z . When this model is assumed we can write the potential outcomes for unit si as rTsi (potential outcome when zsi 1 ) and rCsi (potential outcome when zsi 0 ) – we’ve been implicitly assuming no interference between units when we’ve been using this notation. “No interference between units” is a model and it can be false. No interference is often plausible when the units are different people and the treatment is a medical intervention with a biological response. In this case, no interference means that a medical treatment given to one patient affects only that patient, not other patients. That is often true. However, a vaccine given to many people may protect unvaccinated individuals by reducing the spread of a virus (so called herd immunity) and this is a form of interference. No interference is less plausible in 12 some social setting such as a workplace or classroom, where a reward given to one person may be visible to others, and may affect behavior. Additive Treatment Effect Model The model of an additive treatment effect assumes units do not interfere with each other, and the treatment raises the response of a unit by a constant amount , so that rTsi rCsi for each si. The additive treatment effect model cannot be directly checked because rTsi and rCsi are never jointly observed. However, in a large randomized experiment, the model of an additive effect has clear implications for the distribution of observed responses in treated and control groups. The model of an additive treatment effect rTsi rCsi implies that the distribution of observed outcomes Rsi among treated subjects ( Z si 1 ), has the same shape and dispersion as the distribution of observed outcomes Rsi among control subjects ( Z si 0 ), but the treated distribution is shifted by . For example, boxplots or histograms of the distribution of outcomes in treated and control groups would look the same except that one would be shifted upwards by . If, in a large, randomized experiment, the distribution of observed responses are clearly not shifted, then the model of an additive effect is not applicable and other methods are needed. 13 Example 1: Intrinsic vs. Extrinsic motivation experiment. An experiment was done to examine whether or not grading systems promote creativity in students. Subjects with considerable experience in creative writing were assigned to one of two treatment groups: 24 of the subjects were placed in an “intrinsic” group in which they were given a questionnaire designed to establish a thought pattern of intrinsic motivation – doing something because it is associated with satisfaction; 23 of the subjects were placed in an “extrinsic” group in which they were given a questionnaire designed to establish a though pattern of extrinsic motivation – doing something because a reward is associated with its completion. After completing the questionnaire, all subjects were asked to write a poem in the Haiku style about “laughter.” All poems were submitted to 12 poets, who evaluated them on a 40-point scale of creativity, based on their own subjective views (Data from T. Amabile, “Motivation and Creativity: Effects of Motivational Orientation on Creative Writers, Journal of Personality and Social Psychology 48(2), 1985: 393-399). R code for producing boxplots: intrinsic=c(12,12,12.9,13.6,16.6,17.2,17.5,18.2,19.1,19.3,19.8,20.3,20.5,20.6,21.3, 21.6,22.1,22.2,22.6,23.1,24,24.3,26.7,29.7); extrinsic=c(5,5.4,6.1,10.9,11.8,12,12.3,14.8,15,16.8,17.2,17.2,17.4,17.5,18.5,18.7, 18.7,19.2,19.5,20.7,21.2,22.1,24); boxplot(intrinsic,extrinsic,names=c("Intrinsic","Extrinsic")) ylab="Creativity Score"); 14 The extrinsic scores are slightly more dispersed than the intrinsic scores but the additive treatment effect model appears reasonable. Example 2: A randomized experiment was performed on mice to determine whether two forms of iron, Fe3+ and Fe4+ are retained differently. If one type is retained especially well, then it may be more useful as a dietary supplement for humans. The iron was radioactively labeled so that the initial amount and the amount retained after a fixed time interval could be measured. The measurements of interest are the percentages of iron retained in each mouse after the time period had elapsed. fe3plus=c(.71,1.66,2.01,2.16,2.42,2.42,2.56,2.60,3.31,3.64,3.74,3.74,4.39,4.50,5.0 7,5.26,8.15,8.24) 15 fe4plus=c(2.2,2.69,3.54,3.75,3.83,4.08,4.27,4.53,5.32,6.18,6.22,6.33,6.97,6.97,7.5 2,8.36,11.65,12.45) boxplot(fe3plus,fe4plus,names=c("fe3plus","fe4plus")) The Fe4+ mice have higher and more dispersed amounts of iron retained, suggesting that the additive treatment effect model is not reasonable. Multiplicative Treatment Effect Model: 16 rTsi rCsi For 1 , treated outcomes will be larger and more dispersed than control outcomes. Multiplicative treatment effect model is an additive treatment effect model on the log scale: log(rTsi ) log(rCsi ) log boxplot(log(fe3plus),log(fe4plus),names=c("log (fe3plus)","log (fe4plus)")) 17