Notes 9

advertisement
Stat 921 Notes 9
Reading:
Observational Studies, Chapter 10.2-10.3
I. Example of matching in an observational study
Background: Silber, Rosenbaum, Polsky, Ross, Even-Shoshan,
Schwartz, Armstrong and Randall (2007, Journal of Clinical
Oncology) sought to study the effect of medical oncologists vs.
gynecological oncologists in providing chemotherapy for
ovarian cancer. Medical oncologists typically have a residency
in internal medicine, followed by a fellowship emphasizing the
administration of chemotherapy and the management of its side
effects. Gynecologic oncologists typically complete a residency
in obstetrics and gynecology, followed by a fellowship in
gynecologic oncology, which includes training in surgical
oncology and chemotherapy administration for gynecologic
cancers. Unlike gynecologic oncologists, who are trained in
surgery, medical oncologists are almost invariably not surgeons,
so medical oncologists provide chemotherapy after someone
else has performed surgery. It was anticipated that MOs would
use chemotherapy more intensively than GOs, both at the time
of initial diagnosis and several years later if the cancer has
spread from its site of origin. A main question of interest was
whether the greater intensity of chemotherapy found in the
practice of MOs is of benefit to patients in terms of survival.
1
The study used data on women older than 65 with ovarian
cancer who were diagnosed between 1991 and 1999, who had
appropriate surgery and at least some chemotherapy. There were
344 such women who received chemotherapy from a GO, and
2011 such women who received chemotherapy from an MO.
Before matching, surgeon type, SEER site and year of diagnosis
were considerably out of balance, and clinical stage, grade and
comorbid conditions were slightly out of balance. After
matching, all variables were reasonably well balanced between
the GO and MO groups.
As anticipated, many MO’s often used chemotherapy more
intensely than did GO’s, both in the first year following
diagnosis and in the first five years following diagnosis. The
difference at the medians is not large, but it is quite noticeable at
the upper quartiles.
Weeks with Chemotherapy in Matched Pairs
Mean Min 25% 50% 75%
Year
1
Year
1
Year
1-5
Year
1-5
Max
GO
6.63
1
5
6
8
19
MO
7.74
1
5
6
10
42
GO
12.07 1
5
9
16
70
MO
16.47 1
6
11
21
103
2
Pvalue
.0022
.00045
3
Despite a difference in chemotherapy intensity, survival was
virtually identical for the patients of MO’s and GO’s. The
standard test comparing paired censored survival times is the
Prentice-Wilcoxon test (O’Brien and Fleming, 1987,
Biometrics); it gives a two-sided P-value of 0.45.
Median
(Years)
95% CI
1 Year
Survival %
95% CI
2 Year
Survival %
95% CI
5 Year
Survival %
95% CI
Number at
Risk Year 0
Number at
Risk Year 1
Number at
Risk Year 2
Number at
Risk Year 3
Number at
Risk Year 4
GO
Patients
3.04
MO
Patients
2.98
[2.50,
3.40]
86.6
[2.69,
3.67]
87.5
[83.0,
90.2]
64.8
[84.0,
90.1]
66.9
[59.8,
69.9]
35.1
[61.9,
71.8]
34.2
[30.0,
40.2]
344
[29.2,
39.3]
344
298
301
223
230
173
172
133
128
In brief, it appears that MO’s often treated more intensively than
GO’s, but survival was no different.
4
Two advantages of matching for presenting results in a scientific
journal:
(1) The reader may examine the degree to which matched
groups are comparable with respect to observed covariates, as
well as which covariates are not among the observed covariates,
without getting involved in the procedures used to construct the
matched sample.
(2) Straightforward analyses can be used to assess the effect of
the treatment on outcomes in the matched samples.
II. Matching on the Estimated Propensity Score
# Load Job training data
treated.table=read.table("nsw_treated_earn74.txt",header=TRUE);
control.table=read.table("cps_controls.txt",header=TRUE);
jobtraining=c(treated.table[,1],control.table[,1]);
age=c(treated.table[,2],control.table[,2]);
education=c(treated.table[,3],control.table[,3]);
black=c(treated.table[,4],control.table[,4]);
hispanic=c(treated.table[,5],control.table[,5]);
married=c(treated.table[,6],control.table[,6]);
nodegree=c(treated.table[,7],control.table[,7]);
earnings74=c(treated.table[,8],control.table[,8])
earnings75=c(treated.table[,9],control.table[,9]);
earnings78=c(treated.table[,10],control.table[,10]);
age.sq=age^2;
education.sq=education^2;
earnings74.sq=earnings74^2;
earnings75.sq=earnings75^2;
age.education=age*education;
age.black=age*black;
age.hispanic=age*hispanic;
age.married=age*married;
age.nodegree=age*nodegree;
age.earnings74=age*earnings74;
5
age.earnings75=age*earnings75;
education.black=education*black;
education.hispanic=education*hispanic;
education.married=education*married;
education.nodegree=education*nodegree;
education.earnings74=education*earnings74;
education.earnings75=education*earnings75;
black.married=black*married;
black.nodegree=black*nodegree;
black.earnings74=black*earnings74;
black.earnings75=black*earnings75;
hispanic.married=hispanic*married;
hispanic.nodegree=hispanic*nodegree;
hispanic.earnings74=hispanic*earnings74;
hispanic.earnings75=hispanic*earnings75;
married.nodegree=married*nodegree;
married.earnings74=married*earnings74;
married.earnings75=married*earnings75;
nodegree.earnings74=nodegree*earnings74;
nodegree.earnings75=nodegree*earnings75;
earnings74.earnings75=earnings74*earnings75;
Xmat=cbind(age,education,black,hispanic,married,nodegree,earnings74,earnings75
,age.sq,education.sq,earnings74.sq,earnings75.sq,age.education,age.black,age.hisp
anic,age.married,age.nodegree,age.earnings74,age.earnings75,education.black,edu
cation.hispanic,education.married,education.nodegree,education.earnings74,educat
ion.earnings75,black.married,black.nodegree,black.earnings74,black.earnings75,hi
spanic.married,hispanic.nodegree,hispanic.earnings74,hispanic.earnings75,married
.nodegree,married.earnings74,married.earnings75,nodegree.earnings74,nodegree.e
arnings75,earnings74.earnings75);
# Load Design and optmatch libraries
library(Design);
library(optmatch);
# Model that uses all main effects, squared main effects
# and interactions
firstmodel=glmD(jobtraining~age+education+black+hispanic+married+nodegree+
earnings74+earnings75+age.sq+education.sq+earnings74.sq+earnings75.sq+age.ed
ucation+age.black+age.hispanic+age.married+age.nodegree+age.earnings74+age.e
6
arnings75+education.black+education.hispanic+education.married+education.node
gree+education.earnings74+education.earnings75+black.married+black.nodegree+
black.earnings74+black.earnings75+hispanic.married+hispanic.nodegree+hispanic.
earnings74+hispanic.earnings75+married.nodegree+married.earnings74+married.e
arnings75+nodegree.earnings74+nodegree.earnings75+earnings74.earnings75,fami
ly=binomial)
# Stepwise logistic regression
stepmodel=fastbw(firstmodel);
Factors in Final Model
[1] age
education black
earnings75 age.sq
[6] education.sq earnings74.sq age.married age.earnings74
# Use terms in stepwise model
secondmodel=glm(jobtraining~age+education+black+earnings75+age.sq+educatio
n.sq+earnings74.sq+age.married+age.earnings74,family=binomial);
summary(secondmodel)
Call:
glm(formula = jobtraining ~ age + education + black + earnings75 +
age.sq + education.sq + earnings74.sq + age.married + age.earnings74,
family = binomial)
Deviance Residuals:
Min
1Q Median
3Q
Max
-2.107377 -0.048253 -0.012250 -0.001842 3.945850
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.807e+01 1.823e+00 -9.911 < 2e-16 ***
age
8.026e-01 9.114e-02 8.807 < 2e-16 ***
education
1.034e+00 2.557e-01 4.045 5.23e-05 ***
black
3.504e+00 2.173e-01 16.125 < 2e-16 ***
earnings75 -1.973e-04 3.556e-05 -5.548 2.88e-08 ***
age.sq
-1.239e-02 1.531e-03 -8.091 5.89e-16 ***
education.sq -6.794e-02 1.322e-02 -5.139 2.76e-07 ***
7
earnings74.sq 1.249e-08 2.072e-09 6.028 1.66e-09 ***
age.married -5.477e-02 8.210e-03 -6.671 2.55e-11 ***
age.earnings74 -1.094e-05 1.773e-06 -6.170 6.81e-10 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2022.1 on 16176 degrees of freedom
Residual deviance: 865.7 on 16167 degrees of freedom
AIC: 885.7
We now consider how to construct matched sets using the
estimated propensity scores.
In the simplest case, we will seek to form matched pairs of one
treated and one control unit for each of the treated units.
Suppose there are a treated units and b>a control units. We can
form the matched pairs to minimize the absolute propensity
score differences between the matched pairs. The solution to
this optimal matching problem can be formulated as finding a
minimum cost flow in a certain network, a problem that has
been extensively studied and for which good algorithms exist
(text book, Chapter 10). An algorithm is implemented in the
optmatch package in R that was developed by Ben Hansen.
# Load optmatch (Note: Need to first install optmatch
# package)
library(optmatch);
# Match based on second model
8
# Construct a distance matrix which is the distance between
# log(propensity score/(1-propensity score)) of the two units
# avoiding compression of the estimated probabilities near 0
# and 1 (Rosenbaum and Rubin, American Statistician, 1985)
# Note: The distance is scaled by pooled sd of propensity
# scores
distmat=pscore.dist(secondmodel);
# Match the pairs
pairs=pairmatch(distmat);
# Create a vector saying which control unit each treated unit is matched to
pairs.short=substr(pairs,start=3,stop=10);
pairsnumeric=as.numeric(pairs.short);
notreated=sum(jobtraining)
pairsvec=rep(0,notreated);
for(i in 1:notreated){
temp=(pairsnumeric==i)*seq(1,length(pairsnumeric),1);
pairsvec[i]=sum(temp,na.rm=TRUE)-i;
}
# Calculate standardized differences
notreated=sum(jobtraining);
treatedmat=Xmat[1:notreated,];
# Standardized differences before matching
controlmat.before=Xmat[(notreated+1):nrow(Xmat),];
controlmean.before=apply(controlmat.before,2,mean);
treatmean=apply(treatedmat,2,mean);
treatvar=apply(treatedmat,2,var);
controlvar=apply(controlmat.before,2,var);
stand.diff.before=(treatmean-controlmean.before)/sqrt((treatvar+controlvar)/2);
# Standardized differences after matching
controlmat.after=Xmat[pairsvec,];
controlmean.after=apply(controlmat,2,mean);
# Standardized differences after matching
stand.diff.after=(treatmean-controlmean.after)/sqrt((treatvar+controlvar)/2);
Note about standardized differences: Even when calculating the
standardized differences after matching, we should use
9
2
2
streatment

s
before matching
control before matching
in the denominator of the
2
standardized difference, because we want to compare the
difference in means to the population standard deviation.
> stand.diff.before
age
education
black
-0.79618331
-0.67850209
2.42774682
hispanic
married
nodegree
-0.05069731
-1.23264758
0.90381106
earnings74
earnings75
age.sq
-1.56898961
-1.74642806
-0.80312739
education.sq
earnings74.sq
earnings75.sq
-0.76039546
-1.29336819
-1.45018324
age.education
age.black
age.hispanic
-1.00522974
1.87634548
-0.13693315
age.married
age.nodegree
age.earnings74
-1.30457154
0.51640907
-1.49391740
age.earnings75
education.black education.hispanic
-1.56397833
2.16352032
-0.05724567
education.married education.nodegree education.earnings74
-1.27808151
0.93975238
-1.48280162
education.earnings75
black.married
black.nodegree
-1.63222551
0.37432057
1.57594495
black.earnings74 black.earnings75 hispanic.married
0.22621174
0.13310904
-0.20016807
hispanic.nodegree hispanic.earnings74 hispanic.earnings75
0.04923006
-0.25206779
-0.25568829
married.nodegree married.earnings74 married.earnings75
-0.14903770
-1.39571408
-1.41660356
nodegree.earnings74 nodegree.earnings75 earnings74.earnings75
-0.40724542
-0.40028462
-1.42268950
> stand.diff.after
age
education
black
-0.337486672
0.002181047
0.170492989
hispanic
married
nodegree
10
0.108948520
-0.497291312
0.225151557
earnings74
earnings75
age.sq
-0.361654499
-0.679310783
-0.343167393
education.sq
earnings74.sq
earnings75.sq
-0.051963934
-0.063220188
-0.335412662
age.education
age.black
age.hispanic
-0.188750830
-0.058755590
0.074370153
age.married
age.nodegree
age.earnings74
-0.429197751
-0.008507845
-0.268165384
age.earnings75
education.black education.hispanic
-0.462242454
0.195667895
0.098819785
education.married education.nodegree education.earnings74
-0.391722315
0.304562223
-0.270328907
education.earnings75
black.married
black.nodegree
-0.538952689
-0.673411243
0.294293628
black.earnings74 black.earnings75 hispanic.married
-0.544008546
-1.100527898
0.000000000
hispanic.nodegree hispanic.earnings74 hispanic.earnings75
0.158604662
-0.025202258
-0.022508091
married.nodegree married.earnings74 married.earnings75
-0.332761429
-0.161317590
-0.243809454
nodegree.earnings74 nodegree.earnings75 earnings74.earnings75
-0.351432252
-0.454896341
-0.167084297
# Add to second model variables that showed standardized absolute
# difference greater than 0.1
thirdmodel=glm(jobtraining~age+education+black+earnings75+age.sq+education.
sq+earnings74.sq+age.married+age.earnings74+hispanic+married+nodegree+earni
ngs74+earnings75.sq+age.education+age.earnings75+education.black+education.
married+education.nodegree+education.earnings74+education.earnings75+black.m
arried+black.nodegree+black.earnings74+black.earnings75+hispanic.nodegree+ma
rried.nodegree+married.earnings74+married.earnings75+nodegree.earnings74+nod
egree.earnings75+earnings74.earnings75,family=binomial);
# Match based on third model
rm(distmat);
distmat=pscore.dist(thirdmodel);
pairs=pairmatch(distmat);
pairs.short=substr(pairs,start=3,stop=10);
11
pairsnumeric=as.numeric(pairs.short);
notreated=sum(jobtraining)
pairsvec=rep(0,notreated);
for(i in 1:notreated){
temp=(pairsnumeric==i)*seq(1,length(pairsnumeric),1);
pairsvec[i]=sum(temp,na.rm=TRUE)-i;
}
# Standardized differences after matching with third model
controlmat.after=Xmat[pairsvec,];
controlmean.after=apply(controlmat,2,mean);
# Standardized differences after matching
stand.diff.after=(treatmean-controlmean.after)/sqrt((treatvar+controlvar)/2);
> stand.diff.after
age
education
black
-0.337486672
0.002181047
0.170492989
hispanic
married
nodegree
0.108948520
-0.497291312
0.225151557
earnings74
earnings75
age.sq
-0.361654499
-0.679310783
-0.343167393
education.sq
earnings74.sq
earnings75.sq
-0.051963934
-0.063220188
-0.335412662
age.education
age.black
age.hispanic
-0.188750830
-0.058755590
0.074370153
age.married
age.nodegree
age.earnings74
-0.429197751
-0.008507845
-0.268165384
age.earnings75
education.black education.hispanic
-0.462242454
0.195667895
0.098819785
education.married education.nodegree education.earnings74
-0.391722315
0.304562223
-0.270328907
education.earnings75
black.married
black.nodegree
-0.538952689
-0.673411243
0.294293628
black.earnings74 black.earnings75 hispanic.married
-0.544008546
-1.100527898
0.000000000
hispanic.nodegree hispanic.earnings74 hispanic.earnings75
0.158604662
-0.025202258
-0.022508091
married.nodegree married.earnings74 married.earnings75
12
-0.332761429
-0.161317590
-0.243809454
nodegree.earnings74 nodegree.earnings75 earnings74.earnings75
-0.351432252
-0.454896341
-0.167084297
The matching is still not great on many variables. We will study
how to improve it in the next class.
Let’s ignore for a moment the problems with the match and look
at how we would make inferences about the treatment effect if
we were happy with the matches.
# Wilcoxon signed rank test
wilcox.test(earnings78[1:notreated],earnings78[pairsvec],paired=TRUE,conf.int=T
RUE)
Wilcoxon signed rank test with continuity correction
data: earnings78[1:notreated] and earnings78[pairsvec]
V = 8108, p-value = 0.08057
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-149.449 2377.971
sample estimates:
(pseudo)median
1116.483
The estimated treatment effect is $1116 with a 95% confidence
interval of (-150, 2378).
13
Download