1041 Parametric robust regression of correlated binary data on cluster-specific covariates Tsung-Shan Tsou Institute of Statistics Institute of System Biology and bioinformatics National Central University, Taiwan tsou@mx.stat.ncu.edu.tw Introduction Methods for analyzing correlated binary data suggested in the literature fall into two broad categories. One approach is to fabricate a parametric model to account for the intra cluster correlation. A typical way to accomplish this is to impose a random mechanism on the parameters. For instance, let Y be a binomial random variable, denoted by B ( n, p ) . If one assumes that the binomial probability p is beta distributed with mean π and variance π (1 − π )φ , then the marginal distribution of Y is known as the beta-binomial distribution (Williams, 1975, Crower, 1978). The mean and variance of Y are given by E (Y ) = nπ and Var (Y ) = nπ (1 − π ){1 + ( n − 1)φ } Here φ has the interpretation of the correlation coefficient between the Bernoulli outcome variables that constitute Y . Since 0 ≤ φ ≤ 1 , only overdispersion is allowed by the beta-binomial model. Nevertheless, Prentice (1986) pointed out that φ in the beta-binomial model could actually be negative, but with a lower bound, max{−π /( n − π − 1), − (1 − π ) /( n + π )} . In order to model dispersion more liberally one could adopt the generalized estimating equations (GEE, Liang, Zeger and Qaqish, 1992) approach 1042 that permits more flexible modeling of the intra cluster correlation. However, accompanied by this flexibility are the loss of efficiency, and the sacrifice of the likelihood function. Being semi-parametric in nature, full likelihood inference is prohibited by the GEE methodology. One obvious shortcoming is that in order to perform hypothesis tests one would have to rely on the Wald’s test that has been shown to suffer from the lack of the invariance property under reparameterization (Hauck and Donner, 1977). The aim of this paper is to introduce a parametric way of making valid inference about regression parameters associated with cluster-specific covariates. The binomial model is utilized as the working model and is adequately corrected to achieve such a goal. One needs not explicitly model the correlation in order to attain legitimate inferences. Under-and overdispersion possessed in data are automatically accounted for and adjusted by the proposed robust method. It is stressed that “robustness” in this paper refers to the violation of the binomial assumption and not to the presence of outliers. Robust binomial regression Suppose that the response variable Yi is the sum of ni binary outcomes with success probability μi . For example, Yi might be the number of dead fetuses from a rat with litter size ni . It is assumed that μi is linked to the linear predictor (1) ηi = β 0 + " + β p − 2 xi , p − 2 + β p −1 xi , p −1 by a link function, say, g , so that g (ηi ) = μi . Here β 0 ," , β p −1 are p re- t gression coefficients and xi = ( xi ,0 , xi ,1 ," , xi , p −1 ) , xi ,0 = 1, is the corre- sponding p-vector of characteristics specific to Yi . The logit link function log{μi /(1 − μi )} = ηi is a typical and common choice of the link, partly due to the odds ratio interpretation of the regression parameters. A wide choice of link functions is also available; see, for example, McCullagh and Nelder (1989). The log likelihood contribution from yi , under the binomial assumption, is li = yi log{μi /(1 − μi )} + ni log(1 − μi ) 1043 and independent observations from m clusters Yi , i = 1," , m , contribute ∑ im li , denoted by l for notational simplicity. The score functions for regression parameters are, hence, ∂l ∂β j −1 m = ∑ xi , j −1μi′ ( yi − ni μi ) /{μi (1 − μi )}, j = 1," , p (2) i =1 where μi' is the derivative of g (ηi ) with respect to ηi . Apparently so long as μi , i = 1," , m , is correctly specified, (2) are unbiased estimating functions. Consequently, under mild regularity conditions (McCullagh, 1983), parameter estimates (maximum likelihood (ML) estimates) obtained by solving (2)=0 remain consistent even if the binomial assumption is violated. Hence, the binomial likelihood function could be properly adjusted to become asymptotically robust, see Royall and Tsou (2003). Without loss of generality suppose now β p −1 is the parameter of interest, denoted by θ . Now let Z = ( z0 ," , z p−1 ) be the n × p design matrix, so that Z t = ( x1 ," , xn ) . Define U = ( z0 , z1 ," , z j −1 , z j ," , z p − 2 ) and U j = ( z0 , z1 ," , z p −1 , z j ," , z p− 2 ) which is obtained from U by substituting z p −1 for the j th column vector z j −1 of U . Let φ denotes the vector of the other regression parameters ( β 0 ," , β p − 2 ) . Extending the robust likelihood technique for the iid cases considered by Royall and Tsou (2003) to the binomial regression scenario, the factors A and B of A/B that corrects the working likelihood function for θ are p −1 Δ ⎞ ni μi',0 2 ⎛ 1 m j ⎜ xi , p −1 − ∑ xi , j −1 ⎟ A = lim ∑ m →∞ m ⎟ i =1 μi ,0 (1 − μi ,0 ) ⎜ j =1 Δ ⎝ ⎠ 2 and 2 ' 2 ⎛ p −1 Δ ⎞ 1 m Var (Y ) μ j xi , j −1 ⎟ B = lim ∑ 2 i i ,0 2 ⎜ xi , p −1 − ∑ m →∞ m ⎟ i =1 μi ,0 (1 − μi ,0 ) ⎜ j =1 Δ ⎝ ⎠ ' ' where μi ,0 and μi ,0 denote the true values of μi and μi , respectively. Here Var (Yi ) denotes the true variance of Yi and Δ = U tV −1U , Δ j = U tV −1U j '2 and V = diag (μ1,0 (1 − μ1,0 ) /(n1μ1,0 ),", μm,0 (1 − μm,0 ) /(nm μm'2,0 )) . could consult Tsou (2006) for the derivation of A and B . Readers 1044 Notice that even thought the true underlying distributions are unknown, one could still estimate A and B consistently. The following explains the reason why. Recall that the ML estimators based on the binomial model remain consistent despite of model misspecification. If regression parameters are replaced by their corresponding ML estimates, then the resultant l =e estimates of μi and μi' , μ i βl 0 +"+ βl p −1 xi , p−1 /(1 + e βl 0 +"+ βl p−1 xi , p−1 ) and ' l (1 − μ l ) , are consistent as well. Consistent estimates for A and μl i = μ i i B can thus be obtained by further substituting the empirical second central l ) 2 for Var (Y ) . moment ( yi − μ i i l denote the empirical versions of A and B, and φ (θ ) and A and B Let l 0 B (θ 0 , φ (θ 0 )) represent, respectively, the constrained ML estimators of φ and B given θ 0 . Royall and Tsou (2003) explicated that the adjusted log profile likelihood function l )l (θ , φ (θ )) (l A/ B is asymptotically equivalent to the profile likelihood yielded from the true model. One could therefore employ it to acquire asymptotically valid inferences about θ . Now if we let m goes to infinity so that ni / m → ωi > 0 and ∑ wi = 1 , then one could show that adjusted likelihood ratio (LR) test statistic l )[l (θ , φ (θ )) − l (θ , φ (θ ))] 2( l A/ B 0 0 approximates a χ12 distribution under H 0 : θ = θ 0 , even if the working model is incorrect. Likewise we can demonstrate that the adjusted score test statistic lθ2 (θ 0 , φ (θ 0 )) /[ mB (θ 0 , φ (θ 0 ))] has an asymptotic χ12 distribu- l ) m (θ − θ ) is asympA/ B tion, and that the adjusted Wald statistic ( l 0 totically standard normal distributed, for B / A2 is a valid asymptotic variance of m θ . The naive LR and score tests, namely, 2[l (θ , φ (θ )) − l (θ 0 , φ (θ 0 ))] and lθ2 (θ 0 , φ (θ 0 )) /{mA(θ 0 , φ (θ 0 ))} are asymptotically chi-squared distributed only if f=h (Cox and Hinkley, 1976). Similarly, the naive Wald statis- l (θ − θ ) approximates a standard normal distribution only if A is tic mA 0 1045 a legitimate asymptotic variance of m θ , which occurs when f is equal to h. Clearly, if the intra-cluster binary responses are independent or the within cluster correlation is negligible, then Var (Yi ) is approximately equal to ni μi ,0 (1 − μi ,0 ) . Hence A and B are nearly identical so that no adjustment is necessary. An overdispersed cluster is indicated if Var (Yi ) is in excess of ni μi ,0 (1 − μi ,0 ) , whereas, underdispersion is revealed when Var (Yi ) is smaller than the naïve binomial variance. Under a simple logistic regression modelηi = β 0 + θ xi , m A = lim ∑ i =1 ni μi ,0 (1 − μi ,0 )( xi − x ) 2 / m m →∞ and m B = lim ∑ i =1Var (Yi )( xi − x ) 2 / m m →∞ where x = ∑ m x /{ni μi (1 − μi )}/ ∑ i =11/{ni μi (1 − μi )} . m i =1 i Simulation studies To demonstrate the efficacy of the proposed robust method, simulation studies are executed in this section. Without loss of generality, a simple logistic regression model is considered. Beta-binomial data are generated according to the model log μi /(1 − μi ) = 0.1 + θ xi , i = 1," , m with m = 50,80 and 100. The true value of θ is given the value of 0.5. The cluster sizes are randomly selected from 10 to 50, and the covariate xi is simu- lated uniformly from the interval (0, 1). The intra-cluster correlation coefficient for each cluster is allowed to vary uniformly from 0.1 to 0.9, so that the within-cluster associations are distinct. Four thousand simulation runs are performed for each m . Table 1 tabulates the average of the 4,000 parameter estimates and the sample variance of θ , designated as S 2 . The naïve and the adjusted robust variance estimates of θ , denoted respectively by Varn (θ ) and Vara (θ ) , are also exhibited. The empirical type I error probability of the robust LR test (for testing 1046 H 0 : θ = 0.5 with a nominal level of 0.05) is denoted by α a , and the counterpart calculated from the naïve LR test is also given and denoted by α n . Table 1. Robust and naïve variance estimates and type I error rates Vara (θ ) Varn (θ ) 0.498 0.478 0.501 0.343 m = 100 0.508 0.278 θ S2 m = 50 0.509 m = 80 αa αn 0.029 0.058 0.633 0.337 0.018 0.055 0.648 0.273 0.015 0.052 0.657 The accuracy of the parameter estimates is expected, due to the unbiasedness of the estimating functions (2). The empirical type I error probability and the corrected asymptotic variance clearly indicated that the adjustment A / B has satisfactorily corrected the binomial likelihood. Obviously the varying within-cluster correlations have been successfully accounted for by the adjustment. Examples Example 1. The proposed parametric robust method is first applied to the WeilWilliams Toxicology Data. The data set comprises two groups of pregnant female rats, each with 16 rats, fed with a control diet and a diet treated with a chemical, respectively, during pregnancy and lactation. The observations are the numbers of pups alive at 4 days ( ni ) and the numbers of pups that survived the 21-day lactation period ( yi ) (Weil, 1970, Williams, 1975). l = 2.183 The ML estimates for the logistic regression parameters are β 0 and θ = −0.961 . The robust estimate of standard error of θ is calculated as 0.519, whereas the naïve version is 0.285. The naïve and the adjusted robust LR test statistics for H 0 : θ = 0 are 9.016 and 2.713, respectively. The robust p-value of is, hence, 0.100. Williams (1975) reported a p-value of 0.034 based on a chi-squared statistic 4.48 derived from using a beta-binomial model. On the other hand, 1047 Rao and Scott (1992) found the p-value to be 0.047 on the basis of the value 4.10 derived from their proposed chi-squared statistic. Example 2 Paul (1982) published an experimental data set taken from Shell Toxicology Laboratory in Kent, England. The data contain the number of live fetuses in a litter affected by treatment and the total number of live fetuses in a litter. There are four dose groups: control, low dose, medium dose, and high dose. This data set was analyzed by Rao and Scott (1992) with the high dose group omitted because of problems associated with the high toxicity. They also assigned scores 0, 1 and 2 to the three dose levels, respectively as the continuous covariate to study the dose-response trend. The ML estimates for the logistic regression parameters are βl 0 = −2.035 and θ = 0.633 which are practically identical to those reported by Rao and Scott (1992). The calculated robust estimate of the standard error of θ is 0.190 which is slightly smaller than 0.21 given by Rao and Scott (1992) who used data transformed according to the method they suggested. The naïve standard error estimate is, in contrast, 0.122. The naïve and the adjusted robust LR test statistics for H 0 : θ = 0 are 22.353 and 9.181, respectively. The latter gives a p-value of 0.0024, while Rao and Scott (1992) found the p-value to be less than 0.002. It is reminded that our test results are likelihood-based which are superior in terms of efficiency over the non-parametric approach adopted by Rao and Scott. Conclusions A parametric robust approach is proposed to analyze correlated binary data. This method is applicable to any sensible link function that relates the mean response to cluster-specific covariates. Unlike the generalized estimating equations approach, this parametric method supplies asymptotically valid likelihood functions. Full likelihood inferences for regression parameters are hence made available. It is easy to implement the proposed method. No additional programming is necessary. One could employ standard statistical packages to get regression parameter estimates. Naïve and robust variance estimates are easily obtainable by software such as SAS procedure GENMOD. The ad- 1048 justment A / B is simply the naïve asymptotic variance 1/( mA) divided by the robust version B /(mA2 ) . An asymptotically legitimate likelihood function and, consequently, valid likelihood ratio test are readily in place. References Cox, D. R. and Hinkley, D. V. (1986). Theoretical statistics. New York: Chapman and Hall. Crowder, M. J. (1978). Beta-binomial ANOVA for proportions. Applied Statistics, 27, 34-37. Hauck, W. W. and Donner, A. (1977). Wald’s test as applied to hypotheses in logit analysis. Journal of the American Statistical Association, 72, pp. 851-853. Huber, P. J. (1981). Robust statistics. New York: John Wiley. Liang, K. Y., Zeger, S. L. and Qaqish, B. (1992). Multivariate regression analyses for categorical data (with discussion). Journal of the Royal Statistical Society, B 54, 3-40. McCullagh, P. (1983). Quasi-likelihood functions. Annals of Statistics 11, pp. 5967. McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. 2nd ed. New York: Chapman and Hall. Paul, S. R. (1982). Analysis of proportions of affected fetuses in teratological experiments. Biometrics, 38, 361-370. Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association, 81, pp. 321-327.. Rao, J. N. K. and Scott, A. J. (1992). A simple method for the analysis of clustered binary data. Biometrics, 48, 577-585. Royall, R. M. and Tsou, T. S. (2003). Interpreting statistical evidence using imperfect models: robust adjusted likelihood functions. Journal of the Royal Statistical Society, B, 65, 391-404. Tsou, T. S. (2006). Robust Poisson regression. (Journal of Statistical Planning and Inference, in press). Weil, C. S. (1970). Selection of the valid number of sampling units and a consideration of their combination in toxicological studies involving reproduction, teratogenesis or carcinogenesis. Food and Cosmetics Toxicology, 8, 177-182. White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica, 50, 1-25. Williams, D. A. (1975). The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics, 31, 949-952.