Notes 3 - Wharton Statistics Department

advertisement
Stat 921 Notes 3
Reading:
Observational Studies, Chapters 2.4-2.5.
I. General Approach to Randomization Inference for Testing
Hypothesis of No Treatment Effect and Examples
Consider the null hypothesis of no treatment effect:
H 0 : rTi  rCi for all i .
Notation: Let rz denote the vector of potential responses for
randomization assignment z .
Let Z denote the observed randomization assignment.
Let r  rZ denote the vector of observed responses.
Under H 0 : rTi  rCi for all i , rz is the same for all z and hence
rz  r .
Test: Consider a test statistic t ( Z , r ) . We will reject for large
values of t ( Z , r ) . We want to compute a p-value for the test.
To do this,
(i) The null hypothesis is tentatively assumed to hold, so r is
fixed.
(ii) The treatment assignment Z is assumed to have been
selected from the set of possible treatment assignments  using
a known random mechanism
1
(iii) The observed value, say T , of the test statistic t ( Z , r ) is
calculated
(iv) We seek the probability of a value of the test statistic
t ( Z , r ) as large or larger than that observed if the null
hypothesis were true.
The p-value is simply the sum of the randomization probabilities
of assignments z  that lead to values of t ( Z , r ) greater than
or equal to the observed value T , namely
PrH0 {t ( Z , r )  T   I [t ( Z , r )  T ]Pr( Z  z )
z
In the case of a uniform randomized experiment, since
Pr( Z  z )  1/ K  1/ |  | , the p-value is simply
| z   : t ( z, r )  T |
PrH 0 {t ( Z , r )  T 
.
K
Common Randomization Inferences:
1. Fisher’s exact test for binary outcomes.
Example 1: Perry Preschol Project. In a 1962 social experiment,
123 3-and 4-year old children from poverty-level families in
Ypsilanti, Michigan were randomly assigned either to a
treatment group receiving 2 years of preschool instruction or to a
control group receiving no preschool. The participants were
followed into their adult years. The following table shows how
many in each group were arrested for some crime by the time
they were 19 years old:
2
Arrested for some crime?
No (1)
Yes (0)
Preschool
42
19
61
Control
30
32
62
72
51
123
Let the outcome be 1 if the person is arrested for some crime
and 0 if not. The test statistic for Fisher’s test is the number of
T
treated units with the outcome of 1, t ( Z , r )  Z r ; here t ( Z , r )
is the number of children assigned to the preschool group who
were not arrested for some crime. Note that under the null
hypothesis of no effect, the margins of this table are fixed.
Thus, under the null hypothesis, t ( Z , r ) has a hypergeometric
distribution. The p-value for the test is
1-phyper(41,72,51,61)
[1] 0.01673607
There is moderate evidence (p=0.017) that the preschool
program is having an effect.
Note, that we have constructed a one-sided test that rejects when
there are a large number of treated units with an outcome of 1.
If we wanted a two sided test, we could use the test statistic
t ( Z , r ) | Z T r  EH0 [ Z T r ] | .
3
2. Mantel-Haenszel test
Analogue of Fisher’s exact test when there are two or more
strata and the outcomes is binary. The test statistic is again the
sum of the number of 1outcomes among the treated subjects.
Special case: Matched pairs. The experiment randomly assigns
one member of each pair to the treatment. For this special case,
the Mantel-Haenszel test is called McNemar’s test.
Example 2:
Pair
Treated
Control
1
1
1
2
0
0
3
1
0
4
1
0
5
0
0
6
0
1
7
1
0
8
1
1
9
1
0
10
1
1
4
Pairs 1, 2, 5, 8 and 10 are concordant, meaning that the
outcomes are the same and pairs 3, 4, 6, 7 and 9 are discordant,
meaning that the outcomes are different.
T
Under the null hypothesis, the distribution of t ( Z , r )  Z r is
the number of concordant pairs in which both outcomes are one
plus a binomial random variable with the number of trials being
the number of discordant pairs and probability of success 0.5.
The p-value for McNemar’s test can just be found by the p-value
for testing that a binomial has p=0.5 with number of successes
equal to the number of discordant pairs in which the treated unit
has a 1.
For example 2, 4 out of the 5 discordant pairs have the treated
unit with a 1. The p-value is
> 1-pbinom(3,5,.5)
[1] 0.1875
There is not strong evidence that the treatment has an effect.
We now consider some tests for outcomes taking more than two
numerical values:
3. Difference in sample means or t-test.
We studied this in Notes 2.
5
4. Wilcoxon’s rank sum test for experiments with only one
strata.
Rank the observed responses from smallest to largest. If all N
responses were different numbers, the ranks would be the
numbers 1,2…, N . If some of the responses were equal, then
the average of their ranks would be used. Write qi for the rank
T
of the observed ri of the ith unit and write q  (q1 , , qN ) .
Note that the ranks q are fixed unde the null hypothesis of no
treatment statistic. The rank sum statistic is the sum of the ranks
T
of the treated observations t ( Z , r )  Z r
Example: Job training experiment from Notes 2
nswdata=read.table("nswdata.txt",header=TRUE);
treated.r.jobtrain=nswdata$earnings78[nswdata$treatment==1];
control.r.jobtrain=nswdata$earnings78[nswdata$treatment==0];
# Alternative = “greater” in the Wilcoxon test rejects for large
# values of the Wilcoxon rank sum test
wilcox.test(treated.r.jobtrain,control.r.jobtrain,alternative="greater")
Wilcoxon rank sum test with continuity correction
data: treated.r.jobtrain and control.r.jobtrain
W = 68209.5, p-value = 0.03096
alternative hypothesis: true location shift is greater than 0
The p-value from the Wilcoxon rank sum test, 0.03096, is very
close to the p-value we got from the difference in means test
statistic, 0.031, in Notes 2.
6
The Wilcoxon rank sum test has almost as much power as the
difference in means test statistic when the data is normally
distributed but is much more robust to outliers. Reference:
Lehmann (1975, Nonparametrics: Statistical Methods Based on
Ranks).
5. Wilcoxon’s signed rank statistic for matched pair
experiments.
For each pair, compute the absolute difference in the responses
between the treated and control units. Rank these absolute
differences. Let d s be the rank of the absolute difference of the
sth pair. The signed rank statistic is the sum of the ranks of the
pairs in which the treated unit had a higher response than the
control unit.
6. Stratified rank sum statistic
Sum the Wilcoxon rank sum statistics of each strata.
7. Aligned rank statistic
Hodges and Lehmann (1962, Annals of Statistics) find the
stratified rank sum statistic to be inefficient when S is large
compared to N. They suggest as an alternative the method of
aligned ranks: the mean in each stratum is subtracted from the
responses in that stratum creating aligned responses that are
ranked from 1 to N, momentarily ignoring the strata. Writing
7
q for these aligned ranks, the aligned rank statistic is the sum of
T
the aligned ranks in the treated group, t ( Z , r )  Z q .
8. Median test: The number of treated responses that exceed the
median of the responses in their stratum.
II. Classes of Test Statistics (Section 2.4.4)
Computing the exact p-value for a test statistic becomes
computationally difficult for even moderate sized experiments.
One approach is to use the Monte Carlo method to approximate
the p-value as we did in Notes 2. Another approach is to use a
large sample or asymptotic approximation based on the mean
and variance of the test statistic. For test statistics that are based
on the ranks of the data, these large sample approximations are
quite accurate even for relatively small experiments.
A general class of test statistics for which we can derive the
moments are sum statistics, which are test statistics of the form
t ( Z , r )  Z T q where q is some function of r . All of the
statistics we considered above are sum statistics for suitable
choices of q . In Fisher’s exact test, the Mantel-Haenszel test
and McNemar’s test, q is simply r . In the rank sum test, q is
the ranks of r . In the median test, q is the vector of ones and
zeros identifying responses that exceed stratum medians.
Proposition 2: In a uniform randomized experiment, if the
treatment has no effect, the expectation and variance of a sum
T
statistic t ( Z , r )  Z q are
8
S
E ( Z q)   ms qs
T
s 1
ms (ns  ms ) ns
Var ( Z q)  
(qsi  qs ) 2

s 1 ns ( ns  1) i 1
S
T
where
1
qs 
ns
ns
q
i 1
si
Proof:
S
S
ns
S
ms
E ( Z q)   E ( Z qs )   qsi   ms qs .
s 1
s 1 i 1 ns
s 1
T
T
s
The proof for the variance uses properties of simple random
sampling without replacement. See Problem 4 at the end of
Chapter 2in the book.
Using Proposition 2, we can approximate the p-value using the
Central Limit Theorem by
 t ( Z , r )  EH [t ( Z , r )] 
0

PrH 0 [t ( Z , r )  T ]  1   

VarH 0 (t ( Z , r )) 

Effect increasing statistics:
Many statistics that measure the size of the difference between
treated and control groups would tend to increase in value if
responses in the treated group were increased and those in the
9
control group were decreased. Statistics with this property will
be called effect increasing.
To express the idea formally, note that a treated unit si has
2Z si  1  1 and a control unit has 2Z si  1  1 . Let z be a
possible treatment assignment and let r and r * be two possible
*
values of the responses such that (rsi  rsi )(2 zsi  1)  0 for all
s, i . With treatments given by z , this says that rsi*  rsi for every
*
treated unit and rsi  rsi for every control unit. In words, if
higher responses indicated favorable outcomes, then every
treated unit does better with r * than r , and every control unit
does worse with r * than r . That is, the difference between the
treated and control gropus looks larger with r * than r . The test
statistic t ( Z , r ) is effect increasing if
t ( z, r )  t ( z, r* ) whenever r and r * be two possible values of the
*
responses such that (rsi  rsi )(2 zsi  1)  0 for all s, i . All of the
test statistics we considered in Section I of these notes are effect
increasing.
Table 1 below contains a small, hypothetical example to
illustrate the idea of an effect increasing statistic. Here there is a
single stratum and four subjects, 2 of whom receive treatment.
*
Notice that when ri and ri are compared, treated subjects have
ri*  ri while control subjects have ri*  ri . If the responses are
ranked 1, 2, 3 and 4, and the ranks in the treated group are
10
summed to give Wilcoxon’s rank sum statistic, then the rank
*
sum is larger for ri than for ri .
Table 1
i
zi
2 zi  1
ri
ri*
1
Treated
1
1
5
6
2
Treated
1
1
2
4
3
Control
0
-1
3
2
4
Control
0
-1
1
1
6
7
Rank
Sum
III. Models for Treatment Effects (Chapter 2.5)
If the treatment has an effect, then the vector of potential
responses rz varies with the randomization assignment z .
Let Z denote the observed randomization assignment.
Let R  rZ denote the vector of observed responses.
In principle, each possible treatment assignment z might yield a
pattern of responses rz that is unrelated to the pattern observed
with another z . For instance, in a completely randomized
experiment with 50 subjects divided into two groups 25, there
11
 50 
14

1.3

10
might be  25 
different and unrelated rz ’s. Since it
 
is difficult to comprehend a treatment effect in such terms, we
look for regularities, patterns or models of the behavior of rz as
z varies.
No Interference between units
A first model is that of “no interference between units” which
means that “the observation on one unit should be unaffected by
the particular assignment of treatments to the other units.”
Donald Rubin calls this SUTVA for the “stable unit treatment
value assumption.” Formally, no interference means that rsiz
varies with z si but not with the other coordinates of z . When
this model is assumed we can write the potential outcomes for
unit si as rTsi (potential outcome when zsi  1 ) and rCsi (potential
outcome when zsi  0 ) – we’ve been implicitly assuming no
interference between units when we’ve been using this notation.
“No interference between units” is a model and it can be false.
No interference is often plausible when the units are different
people and the treatment is a medical intervention with a
biological response. In this case, no interference means that a
medical treatment given to one patient affects only that patient,
not other patients. That is often true. However, a vaccine given
to many people may protect unvaccinated individuals by
reducing the spread of a virus (so called herd immunity) and this
is a form of interference. No interference is less plausible in
12
some social setting such as a workplace or classroom, where a
reward given to one person may be visible to others, and may
affect behavior.
Additive Treatment Effect Model
The model of an additive treatment effect assumes units do not
interfere with each other, and the treatment raises the response
of a unit by a constant amount  , so that rTsi  rCsi   for each
si.
The additive treatment effect model cannot be directly checked
because rTsi and rCsi are never jointly observed.
However, in a large randomized experiment, the model of an
additive effect has clear implications for the distribution of
observed responses in treated and control groups. The model of
an additive treatment effect rTsi  rCsi   implies that the
distribution of observed outcomes Rsi among treated subjects
( Z si  1 ), has the same shape and dispersion as the distribution
of observed outcomes Rsi among control subjects ( Z si  0 ), but
the treated distribution is shifted by  . For example, boxplots
or histograms of the distribution of outcomes in treated and
control groups would look the same except that one would be
shifted upwards by  . If, in a large, randomized experiment,
the distribution of observed responses are clearly not shifted,
then the model of an additive effect is not applicable and other
methods are needed.
13
Example 1: Intrinsic vs. Extrinsic motivation experiment.
An experiment was done to examine whether or not grading
systems promote creativity in students. Subjects with
considerable experience in creative writing were assigned to one
of two treatment groups: 24 of the subjects were placed in an
“intrinsic” group in which they were given a questionnaire
designed to establish a thought pattern of intrinsic motivation –
doing something because it is associated with satisfaction; 23 of
the subjects were placed in an “extrinsic” group in which they
were given a questionnaire designed to establish a though
pattern of extrinsic motivation – doing something because a
reward is associated with its completion. After completing the
questionnaire, all subjects were asked to write a poem in the
Haiku style about “laughter.” All poems were submitted to 12
poets, who evaluated them on a 40-point scale of creativity,
based on their own subjective views (Data from T. Amabile,
“Motivation and Creativity: Effects of Motivational Orientation
on Creative Writers, Journal of Personality and Social
Psychology 48(2), 1985: 393-399).
R code for producing boxplots:
intrinsic=c(12,12,12.9,13.6,16.6,17.2,17.5,18.2,19.1,19.3,19.8,20.3,20.5,20.6,21.3,
21.6,22.1,22.2,22.6,23.1,24,24.3,26.7,29.7);
extrinsic=c(5,5.4,6.1,10.9,11.8,12,12.3,14.8,15,16.8,17.2,17.2,17.4,17.5,18.5,18.7,
18.7,19.2,19.5,20.7,21.2,22.1,24);
boxplot(intrinsic,extrinsic,names=c("Intrinsic","Extrinsic"))
ylab="Creativity Score");
14
The extrinsic scores are slightly more dispersed than the
intrinsic scores but the additive treatment effect model appears
reasonable.
Example 2: A randomized experiment was performed on mice to
determine whether two forms of iron, Fe3+ and Fe4+ are retained
differently. If one type is retained especially well, then it may
be more useful as a dietary supplement for humans. The iron
was radioactively labeled so that the initial amount and the
amount retained after a fixed time interval could be measured.
The measurements of interest are the percentages of iron
retained in each mouse after the time period had elapsed.
fe3plus=c(.71,1.66,2.01,2.16,2.42,2.42,2.56,2.60,3.31,3.64,3.74,3.74,4.39,4.50,5.0
7,5.26,8.15,8.24)
15
fe4plus=c(2.2,2.69,3.54,3.75,3.83,4.08,4.27,4.53,5.32,6.18,6.22,6.33,6.97,6.97,7.5
2,8.36,11.65,12.45)
boxplot(fe3plus,fe4plus,names=c("fe3plus","fe4plus"))
The Fe4+ mice have higher and more dispersed amounts of iron
retained, suggesting that the additive treatment effect model is
not reasonable.
Multiplicative Treatment Effect Model:
16
rTsi   rCsi
For   1 , treated outcomes will be larger and more dispersed
than control outcomes.
Multiplicative treatment effect model is an additive treatment
effect model on the log scale:
log(rTsi )  log(rCsi )  log 
boxplot(log(fe3plus),log(fe4plus),names=c("log (fe3plus)","log (fe4plus)"))
17
Download