Parametric robust regression of correlated binary data on cluster-specific covariates

advertisement
1041
Parametric robust regression of correlated binary
data on cluster-specific covariates
Tsung-Shan Tsou
Institute of Statistics
Institute of System Biology and bioinformatics
National Central University, Taiwan
tsou@mx.stat.ncu.edu.tw
Introduction
Methods for analyzing correlated binary data suggested in the literature
fall into two broad categories. One approach is to fabricate a parametric
model to account for the intra cluster correlation. A typical way to accomplish this is to impose a random mechanism on the parameters. For instance, let Y be a binomial random variable, denoted by B ( n, p ) . If one assumes that the binomial probability p is beta distributed with mean π and
variance π (1 − π )φ , then the marginal distribution of Y is known as the
beta-binomial distribution (Williams, 1975, Crower, 1978). The mean and
variance of Y are given by E (Y ) = nπ and
Var (Y ) = nπ (1 − π ){1 + ( n − 1)φ }
Here φ has the interpretation of the correlation coefficient between the
Bernoulli outcome variables that constitute Y . Since 0 ≤ φ ≤ 1 , only
overdispersion is allowed by the beta-binomial model. Nevertheless, Prentice (1986) pointed out that φ in the beta-binomial model could actually be
negative, but with a lower bound, max{−π /( n − π − 1), − (1 − π ) /( n + π )} .
In order to model dispersion more liberally one could adopt the generalized estimating equations (GEE, Liang, Zeger and Qaqish, 1992) approach
1042
that permits more flexible modeling of the intra cluster correlation. However, accompanied by this flexibility are the loss of efficiency, and the sacrifice of the likelihood function. Being semi-parametric in nature, full likelihood inference is prohibited by the GEE methodology. One obvious
shortcoming is that in order to perform hypothesis tests one would have to
rely on the Wald’s test that has been shown to suffer from the lack of the
invariance property under reparameterization (Hauck and Donner, 1977).
The aim of this paper is to introduce a parametric way of making valid
inference about regression parameters associated with cluster-specific covariates. The binomial model is utilized as the working model and is adequately corrected to achieve such a goal. One needs not explicitly model
the correlation in order to attain legitimate inferences. Under-and overdispersion possessed in data are automatically accounted for and adjusted by
the proposed robust method. It is stressed that “robustness” in this paper
refers to the violation of the binomial assumption and not to the presence
of outliers.
Robust binomial regression
Suppose that the response variable Yi is the sum of ni binary outcomes with
success probability μi . For example, Yi might be the number of dead fetuses
from a rat with litter size ni . It is assumed that μi is linked to the linear predictor
(1)
ηi = β 0 + " + β p − 2 xi , p − 2 + β p −1 xi , p −1
by a link function, say, g , so that g (ηi ) = μi . Here β 0 ," , β p −1 are p re-
t
gression coefficients and xi = ( xi ,0 , xi ,1 ," , xi , p −1 ) , xi ,0 = 1, is the corre-
sponding p-vector of characteristics specific to Yi . The logit link function
log{μi /(1 − μi )} = ηi is a typical and common choice of the link, partly
due to the odds ratio interpretation of the regression parameters. A wide
choice of link functions is also available; see, for example, McCullagh and
Nelder (1989).
The log likelihood contribution from yi , under the binomial assumption,
is
li = yi log{μi /(1 − μi )} + ni log(1 − μi )
1043
and independent observations from m clusters Yi , i = 1," , m , contribute ∑ im li , denoted by l for notational simplicity. The score functions for regression parameters are, hence,
∂l
∂β j −1
m
= ∑ xi , j −1μi′ ( yi − ni μi ) /{μi (1 − μi )}, j = 1," , p
(2)
i =1
where μi' is the derivative of g (ηi ) with respect to ηi . Apparently so long as
μi , i = 1," , m , is correctly specified, (2) are unbiased estimating functions. Consequently, under mild regularity conditions (McCullagh, 1983),
parameter estimates (maximum likelihood (ML) estimates) obtained by
solving (2)=0 remain consistent even if the binomial assumption is violated. Hence, the binomial likelihood function could be properly adjusted
to become asymptotically robust, see Royall and Tsou (2003).
Without loss of generality suppose now β p −1 is the parameter of interest,
denoted by θ . Now let Z = ( z0 ," , z p−1 ) be the n × p design matrix, so that
Z t = ( x1 ," , xn )
.
Define
U = ( z0 , z1 ," , z j −1 , z j ," , z p − 2 )
and
U j = ( z0 , z1 ," , z p −1 , z j ," , z p− 2 ) which is obtained from U by substituting z p −1 for the j th column vector z j −1 of U . Let φ denotes the vector of the
other regression parameters ( β 0 ," , β p − 2 ) .
Extending the robust likelihood technique for the iid cases considered
by Royall and Tsou (2003) to the binomial regression scenario, the factors A and B of A/B that corrects the working likelihood function for θ are
p −1 Δ
⎞
ni μi',0 2 ⎛
1 m
j
⎜ xi , p −1 − ∑
xi , j −1 ⎟
A = lim ∑
m →∞ m
⎟
i =1 μi ,0 (1 − μi ,0 ) ⎜
j =1 Δ
⎝
⎠
2
and
2
' 2 ⎛
p −1 Δ
⎞
1 m Var (Y ) μ
j
xi , j −1 ⎟
B = lim ∑ 2 i i ,0 2 ⎜ xi , p −1 − ∑
m →∞ m
⎟
i =1 μi ,0 (1 − μi ,0 ) ⎜
j =1 Δ
⎝
⎠
'
'
where μi ,0 and μi ,0 denote the true values of μi and μi , respectively. Here
Var (Yi ) denotes the true variance of Yi and Δ = U tV −1U , Δ j = U tV −1U j
'2
and V = diag (μ1,0 (1 − μ1,0 ) /(n1μ1,0
),", μm,0 (1 − μm,0 ) /(nm μm'2,0 )) .
could consult Tsou (2006) for the derivation of A and B .
Readers
1044
Notice that even thought the true underlying distributions are unknown,
one could still estimate A and B consistently. The following explains the
reason why. Recall that the ML estimators based on the binomial model
remain consistent despite of model misspecification. If regression parameters are replaced by their corresponding ML estimates, then the resultant
l =e
estimates of μi and μi' , μ
i
βl 0 +"+ βl p −1 xi , p−1
/(1 + e
βl 0 +"+ βl p−1 xi , p−1
) and
'
l (1 − μ
l ) , are consistent as well. Consistent estimates for A and
μl i = μ
i
i
B can thus be obtained by further substituting the empirical second central
l ) 2 for Var (Y ) .
moment ( yi − μ
i
i
l denote the empirical versions of A and B, and φ (θ ) and
A and B
Let l
0
B (θ 0 , φ (θ 0 )) represent, respectively, the constrained ML estimators
of φ and B given θ 0 . Royall and Tsou (2003) explicated that the adjusted
log profile likelihood function
l )l (θ , φ (θ ))
(l
A/ B
is asymptotically equivalent to the profile likelihood yielded from the true
model. One could therefore employ it to acquire asymptotically valid inferences about θ . Now if we let m goes to infinity so that ni / m → ωi > 0
and
∑
wi = 1 , then one could show that adjusted likelihood ratio (LR)
test statistic
l )[l (θ , φ (θ )) − l (θ , φ (θ ))]
2( l
A/ B
0
0
approximates a χ12 distribution under H 0 : θ = θ 0 , even if the working
model is incorrect. Likewise we can demonstrate that the adjusted score
test statistic lθ2 (θ 0 , φ (θ 0 )) /[ mB (θ 0 , φ (θ 0 ))] has an asymptotic χ12 distribu-
l ) m (θ − θ ) is asympA/ B
tion, and that the adjusted Wald statistic ( l
0
totically standard normal distributed, for B / A2 is a valid asymptotic variance of m θ .
The naive LR and score tests, namely, 2[l (θ , φ (θ )) − l (θ 0 , φ (θ 0 ))] and
lθ2 (θ 0 , φ (θ 0 )) /{mA(θ 0 , φ (θ 0 ))} are asymptotically chi-squared distributed
only if f=h (Cox and Hinkley, 1976). Similarly, the naive Wald statis-
l (θ − θ ) approximates a standard normal distribution only if A is
tic mA
0
1045
a legitimate asymptotic variance of m θ , which occurs when f is equal to
h.
Clearly, if the intra-cluster binary responses are independent or the
within cluster correlation is negligible, then Var (Yi ) is approximately equal
to ni μi ,0 (1 − μi ,0 ) . Hence A and B are nearly identical so that no adjustment
is necessary. An overdispersed cluster is indicated if Var (Yi ) is in excess
of ni μi ,0 (1 − μi ,0 ) , whereas, underdispersion is revealed when Var (Yi ) is
smaller than the naïve binomial variance.
Under a simple logistic regression modelηi = β 0 + θ xi ,
m
A = lim ∑ i =1 ni μi ,0 (1 − μi ,0 )( xi − x ) 2 / m
m →∞
and
m
B = lim ∑ i =1Var (Yi )( xi − x ) 2 / m
m →∞
where x =
∑
m
x /{ni μi (1 − μi )}/ ∑ i =11/{ni μi (1 − μi )} .
m
i =1 i
Simulation studies
To demonstrate the efficacy of the proposed robust method, simulation
studies are executed in this section. Without loss of generality, a simple
logistic regression model is considered. Beta-binomial data are generated
according to the model log μi /(1 − μi ) = 0.1 + θ xi , i = 1," , m with
m = 50,80 and 100. The true value of θ is given the value of 0.5. The cluster sizes are randomly selected from 10 to 50, and the covariate xi is simu-
lated uniformly from the interval (0, 1). The intra-cluster correlation coefficient for each cluster is allowed to vary uniformly from 0.1 to 0.9, so that
the within-cluster associations are distinct.
Four thousand simulation runs are performed for each m . Table 1 tabulates
the average of the 4,000 parameter estimates and the sample variance of θ ,
designated as S 2 . The naïve and the adjusted robust variance estimates
of θ , denoted respectively by Varn (θ ) and Vara (θ ) , are also exhibited. The
empirical type I error probability of the robust LR test (for testing
1046
H 0 : θ = 0.5 with a nominal level of 0.05) is denoted by α a , and the counterpart calculated from the naïve LR test is also given and denoted by α n .
Table 1. Robust and naïve variance estimates and type I error rates
Vara (θ )
Varn (θ )
0.498
0.478
0.501
0.343
m = 100 0.508
0.278
θ
S2
m = 50
0.509
m = 80
αa
αn
0.029
0.058
0.633
0.337
0.018
0.055
0.648
0.273
0.015
0.052
0.657
The accuracy of the parameter estimates is expected, due to the unbiasedness of the estimating functions (2). The empirical type I error probability
and the corrected asymptotic variance clearly indicated that the adjustment A / B has satisfactorily corrected the binomial likelihood. Obviously
the varying within-cluster correlations have been successfully accounted
for by the adjustment.
Examples
Example 1.
The proposed parametric robust method is first applied to the WeilWilliams Toxicology Data. The data set comprises two groups of pregnant
female rats, each with 16 rats, fed with a control diet and a diet treated
with a chemical, respectively, during pregnancy and lactation. The observations are the numbers of pups alive at 4 days ( ni ) and the numbers of
pups that survived the 21-day lactation period ( yi ) (Weil, 1970, Williams,
1975).
l = 2.183
The ML estimates for the logistic regression parameters are β
0
and θ = −0.961 . The robust estimate of standard error of θ is calculated as
0.519, whereas the naïve version is 0.285. The naïve and the adjusted robust LR test statistics for H 0 : θ = 0 are 9.016 and 2.713, respectively. The
robust p-value of is, hence, 0.100.
Williams (1975) reported a p-value of 0.034 based on a chi-squared statistic 4.48 derived from using a beta-binomial model. On the other hand,
1047
Rao and Scott (1992) found the p-value to be 0.047 on the basis of the
value 4.10 derived from their proposed chi-squared statistic.
Example 2
Paul (1982) published an experimental data set taken from Shell Toxicology Laboratory in Kent, England. The data contain the number of live fetuses in a litter affected by treatment and the total number of live fetuses in
a litter. There are four dose groups: control, low dose, medium dose, and
high dose. This data set was analyzed by Rao and Scott (1992) with the
high dose group omitted because of problems associated with the high toxicity. They also assigned scores 0, 1 and 2 to the three dose levels, respectively as the continuous covariate to study the dose-response trend.
The ML estimates for the logistic regression parameters are
βl 0 = −2.035 and θ = 0.633 which are practically identical to those reported by Rao and Scott (1992). The calculated robust estimate of the
standard error of θ is 0.190 which is slightly smaller than 0.21 given by
Rao and Scott (1992) who used data transformed according to the method
they suggested. The naïve standard error estimate is, in contrast, 0.122.
The naïve and the adjusted robust LR test statistics for H 0 : θ = 0 are
22.353 and 9.181, respectively. The latter gives a p-value of 0.0024, while
Rao and Scott (1992) found the p-value to be less than 0.002. It is reminded that our test results are likelihood-based which are superior in
terms of efficiency over the non-parametric approach adopted by Rao and
Scott.
Conclusions
A parametric robust approach is proposed to analyze correlated binary
data. This method is applicable to any sensible link function that relates the
mean response to cluster-specific covariates. Unlike the generalized estimating equations approach, this parametric method supplies asymptotically valid likelihood functions. Full likelihood inferences for regression
parameters are hence made available.
It is easy to implement the proposed method. No additional programming is necessary. One could employ standard statistical packages to get
regression parameter estimates. Naïve and robust variance estimates are
easily obtainable by software such as SAS procedure GENMOD. The ad-
1048
justment A / B is simply the naïve asymptotic variance 1/( mA) divided by
the robust version B /(mA2 ) . An asymptotically legitimate likelihood function and, consequently, valid likelihood ratio test are readily in place.
References
Cox, D. R. and Hinkley, D. V. (1986). Theoretical statistics. New York: Chapman
and Hall.
Crowder, M. J. (1978). Beta-binomial ANOVA for proportions. Applied Statistics,
27, 34-37.
Hauck, W. W. and Donner, A. (1977). Wald’s test as applied to hypotheses in
logit analysis. Journal of the American Statistical Association, 72, pp. 851-853.
Huber, P. J. (1981). Robust statistics. New York: John Wiley.
Liang, K. Y., Zeger, S. L. and Qaqish, B. (1992). Multivariate regression analyses
for categorical data (with discussion). Journal of the Royal Statistical Society, B
54, 3-40.
McCullagh, P. (1983). Quasi-likelihood functions. Annals of Statistics 11, pp. 5967.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. 2nd ed. New
York: Chapman and Hall.
Paul, S. R. (1982). Analysis of proportions of affected fetuses in teratological experiments. Biometrics, 38, 361-370.
Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association, 81, pp. 321-327..
Rao, J. N. K. and Scott, A. J. (1992). A simple method for the analysis of clustered
binary data. Biometrics, 48, 577-585.
Royall, R. M. and Tsou, T. S. (2003). Interpreting statistical evidence using imperfect models: robust adjusted likelihood functions. Journal of the Royal Statistical Society, B, 65, 391-404.
Tsou, T. S. (2006). Robust Poisson regression. (Journal of Statistical Planning
and Inference, in press).
Weil, C. S. (1970). Selection of the valid number of sampling units and a consideration of their combination in toxicological studies involving reproduction,
teratogenesis or carcinogenesis. Food and Cosmetics Toxicology, 8, 177-182.
White, H. (1982). Maximum likelihood estimation of misspecified models.
Econometrica, 50, 1-25.
Williams, D. A. (1975). The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics, 31, 949-952.
Download