J. R. Statist. Soc. A (2001) 164, Part 3, pp. 449±465 A simple method for estimating a regression model for between a pair of raters Stuart R. Lipsitz, Medical University of South Carolina, Charleston, USA John Williamson, Centers for Disease Control, Atlanta, USA Neil Klar, Cancer Care Ontario, Toronto, Canada Joseph Ibrahim Harvard School of Public Health and Dana-Farber Cancer Institute, Boston, USA and Michael Parzen University of Chicago, USA [Received November 1999. Revised December 2000] Summary. Agreement studies commonly occur in medical research, for example, in the review of Xrays by radiologists, blood tests by a panel of pathologists and the evaluation of psychopathology by a panel of raters. In these studies, often two observers rate the same subject for some characteristic with a discrete number of levels. The -coef®cient is a popular measure of agreement between the two raters. The -coef®cient may depend on covariates, i.e. characteristics of the raters and/or the subjects being rated. Our research was motivated by two agreement problems. The ®rst is a study of agreement between a pastor and a co-ordinator of Christian education on whether they feel that the congregation puts enough emphasis on encouraging members to work for social justice (yes versus no). We wish to model the -coef®cient as a function of covariates such as political orientation (liberal versus conservative) of the pastor and co-ordinator. The second example is a spousal education study, in which we wish to model the -coef®cient as a function of covariates such as the highest degree of the father of the wife and the father of the husband. We propose a simple method to estimate the regression model for the -coef®cient, which consists of two logistic (or multinomial logistic) regressions and one linear regression for binary data. The estimates can be easily obtained in any generalized linear model software program. Keywords: Generalized estimating equations; Generalized linear model; Measure of agreement 1. Introduction Agreement studies are common in medical research. For example, researchers are interested in agreement between radiologists in the review of X-rays, pathologists in reviewing blood tests and raters in evaluating patients' mental condition. When two observers rate subjects on Address for correspondence: Stuart R. Lipsitz, Department of Biometry and Epidemiology, Medical University of South Carolina, Suite 1148, 135 Rutledge Avenue, PO Box 250551, Charleston, SC 29425, USA. E-mail: lipsitzs@musc.edu & 2001 Royal Statistical Society 0964±1998/01/164449 450 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen a categorical scale, the -coef®cient proposed by Cohen (1960) is a popular chance-corrected measure of interobserver agreement. Although not always the case, it is often true in such studies that each subject is rated by only two raters. The -coef®cient may depend on covariates, i.e. characteristics of the raters and/or the subjects being rated. The characteristics of the raters are called `rater-speci®c' covariates; the characteristics of the subject are called `subject-speci®c' covariates. In this paper, we propose a simple method to estimate the regression model of the -coef®cient as a function of covariates, when each subject is rated by two raters. For ratings on a dichotomous scale, estimates can be obtained by computing two ordinary logistic regressions and one linear regression for binary data. If the rating is on a categorical scale, all that we need to do is to compute two ordinal or polytomous logistic regressions and one linear regression for binary data. These estimates can be easily obtained in any generalized linear model software program. We use the following two examples to illustrate the methods. We note that both data sets are selected subsamples downloaded from the World Wide Web. The interpretation and results given here do not necessarily re¯ect the results and views of the data collection and funding agencies for either data set. 1.1. Example 1: pastor and co-ordinator educator study Our ®rst example is from the `Effective Christian education study' (Benson and Eklin, 1991), which was a study of US Protestant congregations. In this study, one pastor and one coordinator of Christian education from each of 395 randomly chosen congregations provided detailed data on faith, loyalty, congregational life and the dynamics of Christian education programming. In our example, both the pastor and the co-ordinator were asked whether they felt that the congregation puts enough emphasis on encouraging members to work for social justice (yes versus no); we are interested in agreement between a pastor and a co-ordinator on this variable. The `subject' in this example is the congregation, with one rating from the pastor and the other rating from the co-ordinator. The goal of this analysis was to determine whether agreement between pastors and the coordinators depends on two subject-level covariates and three rater level covariates. The two `subject level' covariates are characteristics of the congregations: the Protestant denomination (Disciples of Christ, Lutheran, Presbyterian, Baptist, Church of Christ or Methodist) and whether the church offers community service projects for youths (yes versus no). The following three covariates can differ between the pastor and co-ordinator and thus are `rater level' covariates: the number of hours (2 or under versus more than 2) spent in the last 30 days helping people who are poor, political orientation (liberal versus conservative) and the amount of money given to charities over the past year ($100 or more versus less than $100). Our hypothesis is that the -coef®cient may be higher in certain denominations, may be higher in churches that offer community service projects for youths, may be higher when both the pastor and the co-ordinator work the same number of hours helping the poor, may be higher when both the pastor and the co-ordinator have the same political orientation and may be higher when both the pastor and the co-ordinator give the same amount to charity. The data of 15 of the 395 congregations in the data set are displayed in Table 1. 1.2. Example 2: spousal education study In our second example, we are interested in the similarity of the highest educational degree (no high school degree, high school degree or college degree) of a husband and wife. The subject in this example is the husband±wife pair; the wife is considered one rater, and the Estimating a Regression Model for 451 Table 1. Data from 15 selected children from the pastor and co-ordinator educator study{ Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Church emphasizes social justice Hours helping poor Pastor Coordinator Pastor Coordinator Y N Y Y Y Y Y N Y N Y Y N Y Y N Y Y Y N Y N Y Y Y N N Y N Y >2 >2 42 >2 42 >2 >2 42 >2 42 >2 >2 42 >2 >2 >2 42 >2 >2 >2 42 >2 >2 >2 >2 >2 42 >2 42 >2 Donations Liberal Pastor Coordinator Pastor Coordinator < $100 5$100 5$100 5$100 5$100 < $100 < $100 5$100 < $100 5$100 5$100 5$100 5$100 < $100 5$100 < $100 5$100 5$100 5$100 < $100 5$100 5$100 5$100 5$100 5$100 < $100 5$100 5$100 < $100 5$100 N N Y N N Y N N N Y Y N Y N Y N N Y N N N N N Y N Y N N Y Y Church offers Religion community service projects N N Y N Y Y N Y N N N N N Y Y 3 3 6 4 1 3 2 6 3 3 1 2 5 5 5 {Y, yes; N, no. husband the other rater. We use data from the 1991 General Social Survey (Smith, 1996), which is a personal interview survey of US households conducted by the National Opinion Research Center. The goal of this analysis was to determine whether agreement of the highest degree between the husband and wife can be predicted by two subject level (family level) covariates and three rater level covariates. The subject level covariates are the current family income (less than $25 000, $25000±50000 or more than $50 000) and the number of children in the family (0, 1, 2, 3 or 4). The rater level covariates are the ages at marriage for the wife and husband the highest degree (no high school degree, high school degree or college degree) of the father of the wife and the father of the husband, and an indicator for living at home with both parents at age 16 years (yes versus no). In this example, we expect families with lower incomes and more children possibly to have both husband and wife with lower education (and thus higher agreement). We also expect agreement to be higher for husbands and wives whose fathers have the same education level, husbands and wives who have similar ages at marriage and husbands and wives who both lived or did not live at home at age 16 years. The data of 15 of the 332 husband±wife pairs are displayed in Table 2. Maximum likelihood (Shoukri and Mian, 1996) has recently been proposed for estimating the -coef®cient for dichotomous ratings as a function of covariates. The method requires a special macro to compute the estimates. In this paper, we show that, for dichotomous ratings, the regression model for can be estimated by using two ordinary logistic regressions and one linear regression for binary data, and we show that the estimates can be obtained in any generalized linear models program, such as SAS procedure GENMOD (SAS Institute, 1997), STATA glm (Stata Corporation, 1995), S function glm (Hastie and Pregibon, 1993) or GLIM (Francis et al., 1993). For dichotomous ratings, the two ordinary logistic regressions and the one linear regression for binary data completely specify the joint multinomial distribution of the two ratings on the same subject, so the main advantage of our approach versus maximum likelihood for dichotomous ratings is that writing the computer program needed to obtain the estimate is 452 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen Table 2. Data from 15 selected patients from the spousal education study{ Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Highest degree Father's highest degree Wife Husband Wife Husband HS HS HS HS Below HS Below HS COL HS HS Below HS COL COL COL COL COL HS HS HS HS Below HS Below HS COL HS HS HS COL COL COL COL HS HS HS COL COL HS Below HS HS Below HS Below HS Below HS COL COL HS COL HS Below HS HS HS HS HS Below HS COL Below HS Below HS HS HS Below HS HS HS HS Age at marriage (years) Wife Husband 24 21 23 23 16 18 28 23 21 28 24 27 21 26 22 20 21 25 23 27 21 27 25 23 28 27 29 23 27 20 Living with parents at 16 years Wife Husband Y Y Y N N Y Y Y N Y Y Y Y Y Y Y N Y N Y Y Y Y N Y Y Y Y Y Y Family income ($) Number of children 25000±50000 25000±50000 > 50000 > 50000 < 25000 > 50000 25000±50000 25000±50000 25000±50000 > 50000 > 50000 > 50000 > 50000 > 50000 25000±50000 0 0 3 1 0 0 1 0 0 0 0 3 3 1 2 {HS, high school; COL, college; Y, yes; N, no. simpler. However, when estimating the -coef®cient for polytomous ratings (more than two levels), maximum likelihood is much more complicated to extend since the full joint distribution of the two ratings must be speci®ed, and this contains many unwanted nuisance parameters. Using our approach, the regression model for can be estimated by using two multinomial or ordinal logistic regressions for the polytomous ratings and one linear regression for binary data, and does not require the speci®cation of the nuisance parameters that is required by maximum likelihood. Thus, for polytomous ratings, our method can be considered semiparametric. In Section 2 we introduce some notation and brie¯y describe how the estimates are obtained; ®rst for dichotomous ratings, and then for general categorical ratings. In Section 3, we compare our proposed estimates with the maximum likelihood estimates (MLEs) by using the pastor and co-ordinator educator example. As stated above, a maximum likelihood method has yet to be developed for polytomous ratings, so we do not compare our estimate with maximum likelihood in the spousal education study. 2. Notation and model: dichotomous rating We have i 1, . . ., N independent subjects. Two raters assess each subject on a discrete characteristic. First, we assume that the rating is dichotomous. Thus, we let Yir equal 1 if subject i is judged positive by rater r and equal 0 if judged negative, for r 1, 2. Each individual also has a subject-speci®c covariate vector xi , such as family income in the spousal education study. Also, there are two rater-speci®c covariates xir , r 1, 2. A rater-speci®c covariate could be, for example, the age of the rater. We let zi0 xi0 , xi10 , xi20 . We can further de®ne an indicator random variable Yi , which equals 1 if both raters agree on their ratings for subject i and equals 0 otherwise. In terms of Yi1 and Yi2 , Yi Yi1 Yi2 1 Yi1 1 Yi2 . Estimating a Regression Model for 453 To measure agreement between Yi1 and Yi2 , Cohen (1960) proposed the -coef®cient i pagree, i pchance, i , 1 pchance, i 1 where pagree, i is the probability that the two raters agree on subject i, pagree, i pr Yi 1jzi pr Yi1 1, Yi2 1jzi pr Yi1 0, Yi2 0jzi , and pchance, i pi1 pi2 1 pi1 1 pi2 is the probability that the two raters agree if they independently choose a category, where pir pr Yir 1jxi , xir . Also, for a given value of pchance, i , one can show that i is restricted to be in the interval pchance, i 4 i 4 1; 1 pchance, i 2 further algebra can be used to show that i is restricted to be in the interval 1, 1). For a given link function g ., an adjustment for covariates associated with can be accomplished by using the model g i zi0 . Since i is similar to a correlation coef®cient, we could let g i be proportional to Fisher's ztransformation, i.e. 1 i g i log . 1 i This link function avoids the need for constraints on the parameter to ensure that 1 < i < 1. However, the parameters have no simple interpretation. Thus, we propose to use the identity (linear) link i zi0 , 3 so that the parameters will be directly interpreted as the change in for a 1-unit change in a covariate. In equation (3), there are constraints on so that 1 < i < 1. However, we do not use constraints in the estimating procedure that we propose later in this section; the estimation procedure is much simpler without constraints, and the resulting estimates are asymptotically normal and consistent. Also, in all the data sets (and simulated data sets) that we have looked at, we have never had a problem with non-convergence of the algorithm described below to a unique ; ^ further, for each , ^ all estimated i were such that 1 < ^ i < 1. Although we have not experienced problems with ^ i being greater than 1, if the true i is close to 1 for most combinations of covariates, then we expect that, depending on the sample size, there could be problems with ^ i being greater than 1. We now show that we can estimate the parameter vector through modelling pagree, i E Yi jzi pr Yi 1jzi . In particular, using equations (1) and (3), the model for pagree, i in terms of i can be expressed as 454 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen pagree, i pchance, i i 1 pchance, i pchance, i zi0 1 pchance, i pchance, i z*i 0 , where z*i 1 4 pchance, i zi . If pchance, i was known, then the model for pagree, i is a `linear' model with (a) known `offset' pchance, i , (b) known covariate vector z*i 1 (c) and parameter vector . pchance, i zi Thus, instead of a logistic model for the probability pagree, i , we have a linear model for pagree, i . To estimate if pchance, i is known, we can use maximum likelihood based on the Bernoulli distribution of Yi . The likelihood in this case is just n Q i1 y i p agree, i 1 pagree, i 1 yi . The MLE from this likelihood can be obtained by using any generalized linear models program which allows a linear model for the binary outcome Yi . Unfortunately, pchance, i is rarely known. One possibility is to use maximum likelihood based on the joint distribution of Yi1 , Yi2 to estimate pi1 , pi2 and jointly. This is the approach taken by Shoukri and Mian (1996). We take a different approach, which will be easier to extend than maximum likelihood when the ratings are polytomous. In our approach, we replace pchance, i in equation (4) with an estimate p^ chance, i and then estimate by using a linear model. In particular, we can estimate pi1 and pi2 , denoted p^ i1 and p^ i2 , by using ordinary logistic regression with the model logit pir xir0 1r xi0 2r , 5 r 1, 2, and we estimate pchance, i with p^ chance, i p^ i1 p^ i2 1 p^ i1 1 p^ i2 . Then, the linear model for pi, agree is pagree, i p^ chance, i 1 p^ chance, i zi0 6 (we put `' in expression (6) since p^ chance, i is an estimate). Treating the estimated value p^ chance, i in expression (6) as known, can still be estimated by using a linear model for binary data. For example, suppose that we wanted to use the SAS generalized linear models program PROC GENMOD to estimate ; then we can do the following. (a) Use logistic regression (PROC GENMOD or LOGISTIC) of Yi1 versus (xi1 , xi to obtain p^ i1 . (b) Use logistic regression (PROC GENMOD or LOGISTIC) of Yi2 versus (xi2 , xi to obtain p^ i2 . (c) For each subject, use the results of the ®rst two models to form p^ chance, i p^ i1 p^ i2 1 p^ i1 1 p^ i2 . (d) Use linear regression (PROC GENMOD) of the binary outcome Yi versus z*i with offset p^ chance, i . Estimating a Regression Model for 455 SAS PROC GENMOD prints out an error message that convergence is questionable if pr Y b i 1jz*i p^ agree, i p^ chance, i zi0 ^ 1 p^ chance, i ^ i 1 p^ chance, i p^ chance, i is not between 0 and 1. However, if 0 < p^ chance, i ^ i 1 p^ chance, i < 1, then, by rearranging terms, one can show that condition (2) holds, and PROC GENMOD prints out `converged'. Although it is surely possible that ^ i will be outside the range 1, 1, we have never found this to be true in any data which we have analysed. We note here that since the model for pir in equation (5) is not a function of , for a given model for pir , the estimate ^ will be the same, regardless of the model for i . The estimate of i , however, depends on the model for pir and thus could be sensitive to the model for pir and be biased if we under®t the model for pir . Thus, if we were most interested in consistently estimating i , we should ®t the largest possible model for pir . In particular, to prevent biases in the estimate of i , we would not worry about the type I error rate when ®tting a model for pir ; we would much rather over®t the model for pir than under®t it. In all the examples that we have looked at, we have found that over®tting the model (i.e. including non-signi®cant terms ^ in the model) for pir leads to very little increase in the estimated standard error of . 0 0 0 , 21 , 12 , 22 . Using Taylor series expansions similarly to Suppose that we let 0 11 Prentice (1988), assuming that the models for pi1 , pi2 and i are correctly speci®ed, ^ 0 , ^ 0 0 is consistent for 0 , ^ 0 0 , and N 1=2 ^ 0 , ^ 0 0 has an asymptotic distribution which is multivariate normal with mean vector 0 and a covariance matrix which can be consistently estimated by a robust variance estimator such as the jackknife (Wu, 1986) or the sandwich estimator of White (1982). One form of the jackknife variance estimate (Wu, 1986) is var c ^ 0 , ^ 0 0 n P i1 ^ 0 , ^ 0 i ^ 0 , ^ 0 0 ^ 0 , ^ 0 i ^ 0 , ^ 0 , 7 where ^ 0 , ^ 0 i is the estimate of 0 , 0 obtained by deleting the two ratings on the ith individual and recalculating both and . As opposed to equation (7), the SAS PROC GENMOD variance estimate of , ^ sometimes called a `naõÈ ve' variance estimate, is calculated assuming that p^ chance, i is known. The SAS PROC GENMOD variance estimate tends to be somewhat conservative in simulations that we have performed; these standard errors tend to be conservative because of the covariance between ^ and . ^ Since logistic and linear regression for binary data can be calculated quite quickly, the time that is needed to calculate the jackknife variance estimate is minimal. For example, it took 2 min on a SPARCstation Ultra 20 computer to calculate the variance for the ®rst example with 395 observations. As stated above, to use maximum likelihood (Shoukri and Mian, 1996) with pi1 and pi2 unknown, we must specify the joint multinomial distribution of Yi1 , Yi2 . This joint multinomial distribution has three non-redundant probabilities. One can show that pi1 , pi2 , i completely speci®es these three probabilities. Thus, we must estimate the same number of parameters with our method as with maximum likelihood, so the main advantage of our method is that a simpler program can be written to calculate the estimate. SAS macros to calculate our estimate can be found at the Web site of the ®fth author: 456 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen http://gsbwww.uchicago.edu/fac/michael.parzen/research/ However, if the ratings have more than two levels, our method, described in the next section, extends more easily than maximum likelihood. 3. Extension to more than two categories Suppose that Yir can take levels 1, . . ., J instead of 0 or 1. In this case, the -coef®cient is again de®ned as pagree, i pchance, i i , 1 pchance, i where pagree, i is the probability that the two raters agree on subject i, pagree, i J P j1 pr Yi1 j, Yi2 jjzi and pchance, i J P j1 pr Yi1 jjxi , xi1 pr Yi2 jjxi , xi2 . If we again de®ne an indicator random variable Yi , which equals 1 if both raters agree on their ratings for subject i and equals 0 otherwise, then pr Yi 1jzi pagree, i pchance, i i 1 pchance, i 0 pchance, i z*i , 8 zi0 , where z*i zi 1 pchance, i . for the linear model i To estimate if pchance, i is known, we can again use maximum likelihood based on the distribution of Yi , with the likelihood n Q i1 y i p agree, i 1 pagree, i 1 yi . The MLE from this likelihood can again be obtained by using any generalized linear models program which allows a linear model for binary outcome Yi . However, pchance, i is rarely known, so we must extend our method from the previous section. Extending the approach in the previous section, to estimate i , we can (a) use polytomous or ordinal logistic regression of Yi1 versus xi , xi1 to obtain pr Y b i1 jjxi , xi1 , b i2 jjxi , (b) use polytomous or ordinal logistic regression of Yi2 versus xi , xi2 to obtain pr Y xi2 ), (c) for each subject, form p^ chance, i J P j1 pr Y b i1 jjxi , xi1 pr Y b i2 jjxi , xi2 and (d) use linear regression of the binary outcome Yi versus z*i with offset p^ chance, i . Estimating a Regression Model for 457 To use maximum likelihood with polytomous ratings, and pr Yi1 jjxi , xi1 and pr Yi2 jjxi , xi2 unknown, we must specify the joint multinomial distribution of Yi1 , Yi2 . This joint multinomial distribution has J 2 1 non-redundant probabilities. Our method requires a model for the J 1 non-redundant probabilities pr Yi1 jjxi , xi1 j 1, . . ., J 1, a model for the J 1 probabilities pr Yi2 jjxi , xi2 ( j 1, . . ., J 1 and a model for i . These models only specify 2 J 1 1 dimensions of the J 2 1-dimensional joint distribution of Yi1 , Yi2 . Thus, compared with a maximum likelihood method based on the joint distribution of Yi1 , Yi2 , our method requires far fewer `nuisance' parameters. Further, the full joint distribution that is needed for maximum likelihood does not have a simple form. Thus, not only is the computational method simple with polytomous ratings, this method is also semiparametric in that the full joint distribution of Yi1 , Yi2 need not be speci®ed as in maximum likelihood. 4. Examples 4.1. Pastor and co-ordinator educator study For the pastor and co-ordinator educator study, we want to determine covariates predictive of agreement between a pastor and co-ordinator with respect to whether the congregation puts enough emphasis on encouraging members to work for social justice. We model the -coef®cient as a function of the Protestant denomination (six different denominations), whether the church offers community service projects for youths (yes versus no), the number of hours (2 or under versus more than 2) spent in the last 30 days helping people who are poor, political orientation (liberal versus conservative) and the amount of money donated to charities over the past year ($100 or more versus less than $100). Table 3 gives Cohen's (1960) simple estimate of for various levels of these covariates; note that the estimates of for levels of one covariate are not adjusted for the other covariates. Looking at the results in Table 3, none of the covariates seems signi®cantly to predict chance-corrected agreement. As we shall see later in this section, just as in these univariate results, when we formulate a regression model for , none of these covariates are signi®cant. If we let Yi1 be the pastor rating (1 if yes; 0 if no), and Yi2 be the co-ordinator rating, then the models for the marginal probabilities are, for the pastor, logit pi1 01 11 c1i . . . 51 c5i 61 SERVi 71 HOURi1 81 LIBi1 91 DONATEi1 9 and, for the co-ordinator, logit pi2 02 12 c1i . . . 52 c5i 62 SERVi 72 HOURi2 82 LIBi2 92 DONATEi2 , 10 where fc1i , . . ., c5i g are ®ve church indicators, SERVi equals 1 if the church offers community service projects for youths and equals 0 otherwise, HOURir equals 1 if rater j spent more than 2 h in the last 30 days helping people who are poor and equals 0 otherwise, LIBir equals 1 if rater j is liberal and equals 0 otherwise, and DONATEir equals 1 if rater j donated at least $100 to charities in the past year and equals 0 otherwise. As stated earlier, to prevent biases in the estimate of i , we do not worry about the type I error rate when ®tting a model for pir ; we would much rather over®t the model for pir than under®t it. Thus, we ®t the model for pir which was to include all interactions terms that are 458 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen Table 3. Cohen's (1960) estimates of from the pastor and co-ordinator educator study for subgroups Covariate N % agree Standard error P-value{ Church offers youth community service projects No 217 62 0.201 Yes 178 57 0.115 0.067 0.075 0.390 Religion 1 2 3 4 5 6 0.100 0.120 0.224 70.064 0.244 0.195 0.114 0.109 0.119 0.161 0.114 0.121 0.664 Hours helping poor (pastor, co-ordinator) 42, 42 47 65 42, > 2 38 53 > 2, 42 132 62 > 2, > 2 178 58 0.227 0.053 0.201 0.148 0.145 0.158 0.082 0.074 0.818 Donations (pastor, co-ordinator) < $100, < $100 48 < $100, 5$100 84 5$100, < $100 62 5$100, 5$100 201 65 60 60 58 0.297 0.198 0.185 0.074 0.134 0.104 0.126 0.071 0.455 Liberal (pastor, co-ordinator) (no, no) 244 (no, yes) 31 (yes, no) 85 (yes, yes) 35 62 58 53 60 0.165 0.176 0.071 0.178 0.065 0.144 0.105 0.163 0.877 72 85 67 37 69 65 58 61 61 51 62 60 ^ {Wald test for equal s over subgroups. signi®cant at the 0.25-level of signi®cance. However, ®tting a step-up model, none of the pairwise interactions were signi®cant using a Wald statistic (P-values greater than 0.25). The estimates for the logistic models (9) and (10) are given in Table 4. We see from Table 4 that the covariates have different effects on pi1 in model (9) and pi2 in model (10). For instance, DONATIONS and LIBERAL are signi®cant for the pastor, but not for the co-ordinator. When further ®tting the model for i , we included the same covariates for both pi1 in model (9) and pi2 in model (10); however, this is not necessary, and we could have dropped the nonsigni®cant terms in either pi1 or pi2 . Our model for is i 0 1 c1i . . . 5 c5i 6 SERVi 7 HOURi1 1 8 1 HOURi1 HOURi2 9 1 10 LIBi1 1 LIBi2 11 1 HOURi1 1 HOURi2 HOURi2 LIBi1 LIBi2 12 1 LIBi1 1 LIBi2 13 DONATEi1 1 14 1 DONATEi1 DONATEi2 15 1 DONATEi1 1 DONATEi2 . DONATEi2 11 Estimating a Regression Model for 459 Table 4. Estimates from the pastor and co-ordinator educator study for the logistic marginal model for putting emphasis on encouraging members to work for social justice Parameter Estimate Standard error Z P-value Pastor INTERCEPT YOUTH SERVICE CHURCH(1) CHURCH(2) CHURCH(3) CHURCH(4) CHURCH(5) HOURS HELP DONATIONS LIBERAL 70.635 70.039 70.090 70.302 0.079 70.170 70.057 0.039 0.583 0.703 0.412 0.221 0.394 0.389 0.399 0.479 0.403 0.269 0.230 0.247 71.54 70.18 70.23 70.78 0.20 70.36 70.14 0.14 2.54 2.85 0.123 0.859 0.820 0.438 0.844 0.723 0.887 0.886 0.011 0.004 Co-ordinator INTERCEPT YOUTH SERVICE CHURCH(1) CHURCH(2) CHURCH(3) CHURCH(4) CHURCH(5) HOURS HELP DONATIONS LIBERAL 70.481 0.194 70.799 70.812 0.044 70.311 70.420 0.328 0.346 0.018 0.357 0.225 0.404 0.391 0.392 0.464 0.399 0.228 0.245 0.292 71.35 0.87 71.98 72.08 0.11 70.67 71.05 1.44 1.41 0.06 0.178 0.387 0.048 0.038 0.911 0.502 0.293 0.150 0.157 0.952 The effects f 7 , . . ., 15 g correspond to the various combinations of the rater-speci®c covariates. When ®tting this model for i , we ®t the marginal models (9) and (10). The covariates in the marginal models and the model for need not be identical, and we see that the covariates in model (11) are different from those in models (9) and (10). We can think of the `saturated model' as containing a different -coef®cient for every possible combination of the covariates, which means that both the marginal model for pir and the model for would have all possible interaction terms. Of course, this `saturated' model usually cannot be ®tted in practice, so we propose ®tting step-up regression models for pir and i . Fitting a step-up model for i , none of the pairwise interactions were signi®cant using a Wald statistic P-values greater than 0.25). Table 5 gives our estimate and the MLE (Shoukri and Mian, 1996) for . Our approach and the MLE are similar, with the MLEs having slightly smaller estimated standard errors. Thus, our estimates appear highly ef®cient compared with the MLEs (the estimated ef®ciency is at least 85% for most effects). Unfortunately, none of the covariates are signi®cant. For model (11), the estimated values of ^ i range from 0:35 to 0.51, which are well within the parameter space 1, 1. We note that it took 2 min to run on a SPARCstation Ultra 20 computer to calculate the jackknife variance for this example with 395 observations. As stated earlier, the estimate of could be sensitive to the model for pir . Table 6 contains the estimated parameters for model (11) when including only intercepts in the marginal models for pir as well as the marginal model which adds all possible pairwise interactions to models (9) and (10). Looking at Table 6, the parameter estimates of and estimated standard errors are very similar. In this example, the estimates of are not very sensitive to under®tting or over®tting the marginal model. One possible reason for this is that none of the covariates are apparently predictive of ; however, we could imagine that an under®tted marginal model could affect the best ®tting model for i . 460 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen Table 5. Regression parameter estimates for for the pastor and co-ordinator educator study Effect INTERCEPT YOUTH SERVICE CHURCH(1) CHURCH(2) CHURCH(3) CHURCH(4) CHURCH(5) HOURS(42, > 2) HOURS(> 2, 42) HOURS(> 2, > 2) DONATIONS(< $100, 5$100 DONATIONS(5$100, < $100 DONATIONS(5$100, 5$100 LIBERAL(NO, YES) LIBERAL(YES, NO) LIBERAL(YES, YES) Approach{ Estimate Standard error Z P-value GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE GLIM MLE 0.345 0.376 70.097 70.132 70.127 70.154 70.135 70.185 0.012 70.086 70.300 70.322 0.067 0.033 70.282 70.308 70.076 70.095 70.112 70.130 0.099 0.125 0.116 0.116 0.180 0.163 70.122 70.035 70.181 70.164 0.002 70.050 0.209 0.201 0.111 0.107 0.178 0.170 0.175 0.163 0.179 0.160 0.226 0.213 0.185 0.175 0.244 0.233 0.179 0.169 0.174 0.165 0.157 0.145 0.133 0.128 0.166 0.157 0.217 0.189 0.135 0.131 0.187 0.171 1.65 1.87 70.87 71.23 70.71 70.91 70.77 71.14 0.07 70.54 71.33 71.51 0.36 0.19 71.16 71.32 70.43 70.56 70.64 70.79 0.63 0.86 0.87 0.90 1.09 1.04 70.56 70.19 71.34 71.25 0.01 70.29 0.099 0.061 0.384 0.219 0.478 0.363 0.441 0.254 0.944 0.589 0.184 0.131 0.719 0.849 0.246 0.187 0.667 0.575 0.522 0.430 0.529 0.390 0.384 0.368 0.276 0.298 0.575 0.849 0.180 0.211 0.992 0.772 {GLIM is our approach, with jackknife standard errors. Given that none of the covariates in the model for i are signi®cant, we are interested in an overall estimate of i . To obtain an overall estimate of the -coef®cient, we re®tted the marginal models (9) and (10), and modelled i 0 . The estimate of is ^ 0:1528. We computed the 95% con®dence interval by ®rst calculating the quantity 1 ^ ^ log , 1 ^ c , ^ by using the delta method. We and estimating its asymptotic standard error, denoted SE then calculated the end points of the 95% con®dence interval with c g ^ exp f ^ 1:96 SE 1 . ^ c ^ exp f 1:96 SE g 1 The 95% con®dence interval is 0:0541, 0:2484, suggesting agreement beyond chance. However, using the scale of Landis and Koch (1977), agreement between the pastor and coordinators is at best `slight'. Estimating a Regression Model for 461 Table 6. Regression parameter estimates for for the pastor and co-ordinator educator study, for the intercept only and pairwise interaction marginal models Effect INTERCEPT YOUTH SERVICE CHURCH(1) CHURCH(2) CHURCH(3) CHURCH(4) CHURCH(5) HOURS(42, > 2) HOURS(> 2, 42) HOURS(> 2, > 2) DONATIONS(< $100, 5$100 DONATIONS(5$100, < $100 DONATIONS(5$100, 5$100 LIBERAL(NO, YES) LIBERAL(YES, NO) LIBERAL(YES, YES) Marginal model{ Estimate Standard error Z P-value INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR INT PAIR 0.354 0.361 70.081 70.098 70.071 70.132 70.033 70.141 0.018 0.005 70.248 70.299 0.102 0.059 70.259 70.319 70.041 70.089 70.121 70.127 0.056 0.098 0.020 0.125 0.127 0.175 70.117 70.113 70.254 70.183 70.073 0.003 0.208 0.207 0.112 0.111 0.185 0.179 0.176 0.176 0.182 0.179 0.226 0.225 0.189 0.185 0.241 0.246 0.178 0.177 0.173 0.172 0.156 0.157 0.140 0.134 0.170 0.167 0.202 0.213 0.142 0.137 0.201 0.189 1.71 1.74 70.72 70.89 70.38 70.74 70.19 70.80 0.10 0.03 71.10 71.33 0.54 0.32 71.07 71.30 70.23 70.50 70.70 70.74 0.36 0.62 0.14 0.93 0.75 1.05 70.58 70.53 71.79 71.34 70.36 0.02 0.087 0.082 0.472 0.373 0.704 0.459 0.849 0.424 0.920 0.976 0.271 0.184 0.589 0.749 0.285 0.194 0.818 0.617 0.484 0.459 0.719 0.535 0.889 0.352 0.453 0.294 0.562 0.596 0.073 0.180 0.719 0.984 {INT just contains an intercept in the marginal model; PAIR contains pairwise interactions in the marginal model. 4.2. Spousal education study Social scientists are often interested in measuring the similarity of the highest educational degree (no high school degree, high school degree or college degree) of a husband and wife. We want to determine whether agreement between the spouses' highest educational degree depends on the current family income (less than $25000, $25000±50000 and more than $50000), the number of children (0, 1, 2, 3 or 4), the ages at marriage for the wife and husband, the highest degree (no high school degree, high school degree or college degree) of the father of the wife and father of the husband, and an indicator for living at home with both parents at age 16 years (yes versus no). The data from 332 husband±wife pairs are taken from the 1991 US General Social Survey (Smith, 1996). Table 7 gives Cohen's (1960) simple estimate of for various levels of these covariates, unadjusted for the other covariates. Looking at the results in Table 7, appears to differ for different combinations of the father's education. Because the ratings are ordinal, we chose to model the marginal probabilities pir, j 462 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen Table 7. Cohen's (1960) estimates of from the spousal education study for subgroups{ Covariate Standard error P-value{ 0.420 0.312 0.458 0.110 0.071 0.069 0.325 70 48 44 72 64 52 73 79 83 0.518 0.192 70.231 0.354 0.270 70.073 0.492 0.569 0.356 0.081 0.134 0.192 0.144 0.107 0.171 0.199 0.143 0.220 0.002 Age at marriage} (years) (wife, husband) (422, 422 129 70 (422, > 22 80 64 (> 22, 422 9 56 (> 22, >22 116 66 0.476 0.400 0.053 0.387 0.070 0.090 0.296 0.074 0.503 Family income < $25000 $25000±50000 > $50000 N % agree 50 127 157 64 61 72 Father's education (wife, husband) (below HS, below HS) 80 (below HS, HS) 31 (below HS, COL) 8 (HS, below HS) 36 (HS, HS) 80 (HS, COL) 27 (COL, below HS) 15 (COL, HS) 28 (COL, COL) 29 ^ Home at 16 years (wife, husband) (no, no) 6 (no, yes) 29 (yes, no) 27 (yes, yes) 270 50 59 67 68 70.200 0.374 0.449 0.441 0.150 0.132 0.148 0.049 0.299 Number of children 0±1 2±4 66 69 0.433 0.417 0.051 0.085 0.872 244 90 {HS, high school; COL, college. {Wald test for equal s over subgroups. }22 years is the median age, collapsed over spouse. pr Yir jjxi , xir with the cumulative logistic link function (McCullagh, 1980). In particular, we model log Fir, j 1 Fir, j for j 1, 2, where Fir, j pir, 1 . . . pir, j is a `cumulative' probability of observing at most the jth level. For example, Fir, 2 pir, 1 pir, 2 is the probability of at most a high school degree j 1, no high school degree, or j 2, high school degree). The covariates are the number of children in the family (KIDSi ), current family income (INCOME2i equals 1 if the income is $25 000±50 000 and 0 otherwise, INCOME3i equals 1 if the income is more than $50 000 and 0 otherwise), father's education (FATHED2ir equals 1 if the father's education is a high school degree and 0 otherwise, FATHED3ir equals 1 if the father's education is a college degree and 0 otherwise), age at marriage (AGEir and home with both parents at age 16 years (HOMEir ). We ®t a proportional odds model (McCullagh, 1980) to the cumulative logits, i.e. Estimating a Regression Model for 463 Table 8. Estimates from the spousal education study for the cumulative logistic marginal model for education level Parameter Estimate Standard error Z P-value Wife's education level INTERCEPT1 INTERCEPT2 KIDS INCOME2 INCOME3 FATHED2 FATHED3 AGE AT MAR HOME AT 16 INCOME2 * FATHED2 INCOME2 * FATHED3 INCOME3 * FATHED2 INCOME3 * FATHED3 INCOME2 * AGE INCOME3 * AGE 70.347 3.489 0.100 71.136 0.845 72.079 73.990 0.038 70.767 1.218 1.760 1.741 2.191 70.021 70.193 1.202 1.235 0.096 1.402 1.559 0.700 0.903 0.053 0.392 0.810 1.023 0.814 1.043 0.062 0.069 70.29 2.83 1.04 70.81 0.54 72.97 74.42 0.71 71.96 1.50 1.72 2.14 2.10 70.34 72.81 0.773 0.005 0.299 0.418 0.588 0.003 0.000 0.480 0.050 0.133 0.085 0.032 0.036 0.737 0.005 Husband's education level INTERCEPT1 INTERCEPT2 KIDS INCOME2 INCOME3 FATHED2 FATHED3 AGE AT MAR HOME AT 16 INCOME2 * FATHED2 INCOME2 * FATHED3 INCOME3 * FATHED2 INCOME3 * FATHED3 INCOME2 * AGE INCOME3 * AGE 0.076 2.981 0.112 70.978 0.378 72.006 73.621 0.001 70.355 1.245 0.586 1.941 1.854 70.028 70.126 1.156 1.175 0.090 1.493 1.578 0.622 0.955 0.044 0.373 0.718 1.241 0.727 1.095 0.060 0.065 0.07 2.54 1.24 70.66 0.24 73.22 73.79 0.02 70.95 1.73 0.47 2.67 1.69 70.46 71.95 0.948 0.011 0.214 0.512 0.811 0.001 0.000 0.988 0.342 0.083 0.637 0.008 0.090 0.646 0.051 Fir, j log 0rj 1r KIDSi 2r INCOME2i 3r INCOME3i 4r FATHED2ir 1 Fir, j 5r FATHED3ir 6r AGEir 7r HOMEir 8r INCOME2i FATHED2ir 9r INCOME2i FATHED3ir 10r INCOME3i FATHED2ir 11r INCOME3i FATHED3ir 12r INCOME3i AGEir 13r INCOME3i AGEir . 12 for j 1, 2 and r 1, 2. The estimates of the parameters of the cumulative logistic model (12) are given in Table 8. All other interaction terms were non-signi®cant at the 0.25 level of signi®cance. We ®t the following model for the -coef®cient: i 0 1 KIDSi 2 INCOME2i 3 INCOME3i 4 FATHEDi 5 AGEi 6 HOMEi , 13 where FATHEDi equals 0 if the level of education of the husband and wife are the same, and 464 S. R. Lipsitz, J. Williamson, N. Klar, J. Ibrahim and M. Parzen Table 9. Regression parameter estimates for for the spousal education study, using the best ®tting marginal model (Table 8) and the marginal model containing only intercepts Parameter INTERCEPT KIDS INCOME2 INCOME3 FATHED DIF AGE DIF HOME AT 16 DIF Marginal model{ (12) INT (12) INT (12) INT (12) INT (12) INT (12) INT (12) INT Estimate Standard error Z P-value 0.364 0.378 0.025 0.027 70.112 70.094 0.016 0.097 70.140 70.144 0.005 0.004 70.037 70.031 0.129 0.127 0.038 0.034 0.147 0.145 0.145 0.139 0.057 0.055 0.010 0.009 0.116 0.106 2.83 2.97 0.65 0.82 70.76 70.64 0.11 0.69 72.47 72.62 0.55 0.47 70.32 70.29 0.005 0.003 0.516 0.412 0.447 0.522 0.912 0.490 0.014 0.009 0.582 0.638 0.749 0.772 {INT, intercept-only model. 1 otherwise, AGEi equals 0 if the age at marriage of the husband and wife is the same, and 1 otherwise, and HOMEi equals 0 if the husband and wife have the same value of HOMEir , and 1 otherwise. Table 9 gives a comparison of the estimates of using marginal model (12) as well as only an intercept marginal model, with jackknife standard errors. Looking at Table 9, the parameter estimates of and estimated standard errors are very similar when either marginal model (12) or the intercept-only model is used. In this example, the estimates of are again not very sensitive to under®tting or over®tting the marginal model. Then, for either marginal model, the results in Table 9 agree with the univariate results in Table 7 in that only father's education is signi®cant. From Table 9, when the father's education of the husband and wife are different, the -coef®cient decreases by approximately 0.14. We note that, using marginal model (12) and model (13), it took 2 min and 26 s to calculate the jackknife standard errors on a SPARCstation Ultra 20 computer for this example with 332 observations. Further, using these same models, the estimated values of ^ i range from 0:05 to 0.75, which are well within the parameter space 1, 1. Unfortunately, as discussed earlier, we do not compare this method with an MLE because a likelihood method requires a full speci®cation of the joint distribution of the two ratings on the same subject, which can be quite complicated to specify and has only been developed for dichotomous ratings (Shoukri and Mian, 1996). 5. Conclusion In this paper we considered a simple method for estimating a linear regression model for the -coef®cient, for the case when each subject is rated by two raters. Although it was not shown here, the estimates can be shown to be the solution to a generalized estimating equation (Liang and Zeger, 1986). The method is attractive because it can be implemented easily by using existing general purpose statistical software. The method can be easily extended to more than two raters. With more than two ratings per subject, we can still use a (possibly ordinal) logistic regression for each rating on a subject, and a linear regression model for a binary indicator of agreement between a pair of ratings. For more than two ratings, though, Estimating a Regression Model for 465 the resulting estimator may not be the most ef®cient estimator of the -coef®cient, which is a topic for future research. Although useful for dichotomous ratings, our method is especially attractive for polytomous ratings, in that it is not necessary to specify the full joint multinomial distribution between the pair of ratings on each subject. Acknowledgements We are grateful for the support provided by National Institutes of Health grant CA 55576 and NIMH grant 1-R01-MH-54693-01A1. References Benson, P. L. and Eklin, C. H. (1991) Effective Christian Education: a National Study of Protestant Congregations. Minneapolis: Search Institute. Cohen, J. (1960) A coef®cient of agreement for nominal scales. Educ. Psychol. Measmnt, 20, 37±46. Francis, B., Green, M. and Payne, C. (1993) The GLIM System, Release 4 Manual. New York: Clarendon. Hastie, T. J. and Pregibon, D. (1993) Generalized linear models. In Statistical Models in S (eds J. M. Chambers and T. J. Hastie). Paci®c Grove: Wadsworth and Brooks. Landis, J. R. and Koch, G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33, 159±174. Liang, K. Y. and Zeger, S. L. (1986) Longitudinal data analysis using generalized linear models. Biometrika, 73, 13±22. McCullagh, P. (1980) Regression models for ordinal data (with discussion). J. R. Statist. Soc. B, 42, 109±142. Prentice, R. L. (1988) Correlated binary regression with covariates speci®c to each binary observation. Biometrics, 44, 1033±1048. SAS Institute (1997) SAS/STAT Software: Changes and Enhancements through Release 6.12. Cary: SAS Institute. Shoukri, M. M. and Mian, I. U. (1996) Maximum likelihood estimation of the kappa coef®cient from bivariate logistic regression. Statist. Med., 15, 1409±1419. Smith, T. W. (1996) Who, What, When, Where, and Why: an Analysis of Usage of the General Social Survey. Chicago: National Opinion Research Center. Stata Corporation (1995) Stata Reference Manual: Release 4.0, vol. 2, pp. 399±410. College Station: Stata Corporation. White, H. (1982) Maximum likelihood estimation of misspeci®ed models. Econometrica, 50, 1±25. Wu, C. F. J. (1986) Jackknife bootstrap and other resampling methods in regression analysis. Ann. Statist., 14, 1261±1295.