Comparing the Odds and the Dependence Ratios Anders Ekholm Rolf Nevanlinna Institute, P.O. Box 4, (Yliopistonkatu 5), FIN-00014 UNIVERSITY OF HELSINKI, Finland. Email: anders.ekholm@helsinki.fi Summary The parametrisation of a multivariate binary response in terms of first order moments and dependence ratios of order from two up to the dimension of the response, is presented. Conventionally, the mixed parameter is used for marginal regression analysis of a multivariate binary response. In the mixed parameter, the first order moments are complemented by odds ratios. The properties of the, less well known, dependence ratios are expounded and compared to those of the odds ratio. To demonstrate how the dependence ratio parametrisation works, seven different binary time series, all of length 10, are constructed and analysed, one by one. Key words: Asymmetric association measure; Combining regression with association; Parameter orthogonality; Short binary time series; 1 Motivation In my contribution to the ‘Fellman Festschrift’ I introduced the success parametrisation for a multivariate binary response (Ekholm, 1991). It gained a number of references, but few applications. Soon, I gave it up and developed instead, jointly 1 with a group of coauthors, the dependence ratio parametrisation, which we have applied for analysing a fair number of empirical data sets (Ekholm, Smith & McDonald, 1995; Ekholm & Skinner, 1998; Ekholm, McDonald & Smith, 2000; Ekholm, Jokinen & Kilpi, 2002). Ekholm, Jokinen, McDonald & Smith (2002), generalized the dependence ratio from a binary to an ordinal response and Jokinen, McDonald & Smith (2002) formulated a number of new association generating mechanisms using the dependence ratio. Despite being handy in applied work, the dependence ratio has gained fewer references and applications than the success parametrisation. On the contrary, referees have occasionally dismissed our papers, it seems to me, just because the dependence ratio is not the odds ratio. Perplexed by this conservatism, I devote my contribution for the ‘Nordberg Festschrift’ to a comparison of the dependence ratio with the odds ratio, and to some quotes exemplifying the arguments raised against the dependence ratio. Naturally, I do not know the names of the referees and I will also leave some authors and editors anonymous. For example, AA has referred to Ekholm (1991) several times during the nineties without comments, but in an invited talk at an international conference in 2002, he complained that there are problems with the success parametrisation. Strangely, he has never referred to the dependence ratio. 2 Simultaneous regression and association modelling We denote the response of unit i = 1, . . . , n at subunit tk , for k = 1, . . . , q, by Yi tk = 1/0 and the q-variate response of unit i by Y i = (Yi t1 , . . . , Yi tq ). In the data set analysed by Ekholm, Jokinen & Kilpi (2002), Yi tk = 1/0 indicates carriage/non-carriage of a certain bacterium by child i at age tk . Another useful 2 application of the hierarchical data structure is when unit is a family and subunit is a member of the family (Ekholm et al. 1995, Example 4.1). With each subunit is associated a p × 1 vector of fixed explanatory values, xi tk . In modelling the data set, {(xi tk , yi tk ); i = 1, . . . , n, k = 1, . . . , q}, we want to combine (i) a marginal regression model for the univariate means µi tk = E(Yi tk |xi tk ), with (ii) an association model depicting the dependence between the subunit responses Yi t1 , . . . , Yi tq . Ambitiously, we model the joint probability of the q responses of unit i, which we refer to as the path probability πi and define as πi = pr(Yi t1 = yi1 , . . . , Yi tq = yiq ). (1) The marginal regression model is the conventional logistic regression, for i = 1, . . . , n and k = 1, . . . , q, µi tk = pr(Yi tk = 1|xi tk ) = {1 + exp(−βxi tk )}−1 , (2) where β = (β1 , . . . , βp ) is a 1 × p vector of regression coefficients, constant with respect to both i and tk . The response vectors Y1 , . . . , Yn are assumed to be independent of each other. For task (ii) we need measures of association that handily complement the first order moments µi tk in specifying the joint q-dimensional probability. The dependence ratios, denoted by τ , of second, third,. . . , qth order, are well suited for this task, being defined by ratios of moments of all orders, as τ t1 t2 = µi t1 ...tq µi t 1 t2 t3 µi t 1 t2 , . . . , τ t1 t2 t3 = , . . . , τt1 ...tq = , µi t 1 µi t 2 µi t 1 µi t 2 µi t 3 µi t 1 · · · µ i t q (3) where, for example, µi t1 t2 t3 = E(Yi t1 Yi t2 Yi t3 |xi t1 , xi t2 , xi t3 ), is the third-order product moment. There are q!/{w!(q−w)!} dependence ratios of order w = 2, . . . , q and accordingly 2q − q − 1 dependence ratios in all. The q first order moments 3 µi t1 , . . . , µi tq and the 2q − q − 1 dependence ratios, together, specify fully the path probability πi of child i (Ekholm et al. 1995). In fact, there exists a closed form expression, denoted here by π ∗ (·), for the path probability in terms of the first order moments and the dependence ratios of all orders, πi = pr(Yi t1 = yi1 , . . . , Yi tq = yiq ) = π ∗ (µi t1 , . . . , µi tq , τt1 t2 , . . . , τt1 ...tq ). (4) To achieve parsimonious modelling, a strong structure has to be imposed on the dependence ratios, specifying a meaningful mechanism that generates the association between Yi t1 , . . . , Yi tq from a small number of association parameters. Denoting the vector of association parameters by α and the vector of dependence ratios by τ = (τt1 t2 , . . . , τt1 ...tq ), we refer to a set of equations τ = g(α), (5) as an association model with parameter α. Ekholm et al. (2000) present a number of association generating mechanisms for a binary response, Ekholm et al. (2002) for a multicategory response and Jokinen et al. (2002) extend the latter set. It follows from (2), (4) and (5), jointly, that there exists an explicit expression, denoted here by π(·), for the path probability in terms of β, α and the xi tk , πi = pr(Yi t1 = yi1 , . . . , Yi tq = yiq ) = π(β, α, xi t1 , . . . , xi tq ). (6) For given values of the regression coefficients, the association parameters and the data {(xi tk , yi tk ); i = 1, . . . , n, k = 1, . . . , q}, (6) gives computational access to the log-likelihood function l(β, α) in a single, iteration-free step. 4 3 The Dependence Ratio Compared with the Odds Ratio Dropping the subscript i, we consider the response path of a generic unit, and denote the q-dimensional binary response by (Y1 , . . . , Yq ). For several important differences between the dependence ratio and the odds ratio parametrisations, it is enough to consider the bivariate case, q = 2. The notation is conveniently introduced by spelling out the relevant 2 × 2 table as Table 1. Table 1. The probabilities of the bivariate distribution for (Y1 , Y2 ). Y1 = 1 Y2 = 1 Y2 = 0 Sum µ12 µ1 − µ12 µ1 1 − µ1 − µ2 + µ12 1 − µ1 1 − µ2 1 Y1 = 0 µ2 − µ12 Sum µ2 The dependence ratio for (Y1 , Y2 ) is τ12 = µ12 /(µ1 µ2 ) and the odds ratio, also called the cross product ratio, is χ12 = µ12 (1−µ1 −µ2 +µ12 )/{(µ1 −µ12 )(µ2 −µ12 )}. We denote for (Y1 , Y2 ) the dependence ratio parametrisation by υ = (µ1 , µ2 , τ12 ) and the odds ratio parametrisation by ψ = (µ1 , µ2 , χ12 ). We list the most important features of the dependence ratio, including comparisons to the odds ratio: (i) The metric of the w-way dependence ratio τ1...w is that pr(Y1 = · · · = Yw = 1) is (τ1...w − 1)100% greater than if the w events {Y1 = 1}, . . . , {Yw = 1} were independent. The definitions (3) and accordingly the interpretations of τ1...w , for w = 2, . . . q, do not grow in conceptual complexity, when w increases. In contrast the 3-way odds ratio is a ratio of two conditional 2-way odds ratios, in obvious notation, χ123 = χ12|Y3 =1 /χ12|Y3 =0 , and so forth in geometrically growing complexity. (ii) A computational advantage. Since, from (3), µ12 = µ1 µ2 τ12 , it is obvious from Table 1 that the cell probabilities pr(Y1 = u, Y2 = v), for u, v = 1, 0, 5 have explicit expressions in terms of µ1 , µ2 and τ12 . The thrust of (4) and (6) is that analogous explicit expressions for the path probabilities are valid when q > 2 (Ekholm et al., 1995). In contrast, to find the path probabilities from µ1 , µ2 and χ12 one has to solve a quadratic, and when q > 2 one has to resolve to iterative procedures. Therefore, maximizing the log-likelihood function using the odds ratio parametrisation requires nesting two iterative procedures, while using the dependence ratio parametrisation iteration in the parameter space is enough. (iii) An interpretational advantage. When the subunits are ordered, for example, by time as in the data on carriage of bacteria (Ekholm et al., 2002), then the transition probabilities between the states are of interest. Interpretation of the dependence ratio is facilitated by the relation τ12 = pr(Y2 = 1|Y1 = 1)/pr(Y2 = 1). (iv) The range of τ12 depends on µ1 and µ2 , the lower and upper bounds are −1 −1 −1 −1 −1 max(0, µ−1 1 + µ2 − µ1 µ2 ) < τ12 < min(µ1 , µ2 ). (7) For the case µ1 = µ2 = µ the parameter space (µ, τ ) is illustrated in Figure 1. For µ < 0.5 the parameters are reasonably variation independent, while when µ approaches one, the range of the dependence ratio is quite narrow around the independence value 1. In contrast, the odds ratio, χ, is variation independent from the marginal probabilities µ1 and µ2 . A referee stated ‘...the lack of a good grip on the range of the dependence ratios seems to be a major drawback.’ I believe that the bound τ12 < 1/µ reflects a meaningful unidentifiability, for if pr(Y1 = 1) = pr(Y2 = 1) ≈ 1, then only scant empirical evidence can be found for discriminating between the propositions (a) that the events {Y1 = 1} and {Y2 = 1} are independent and (b) that they are associated. Also, the bound τ < 1/µ is, by (iii) above, equivalent to pr(Y2 = 1|Y1 = 1) < 1 and therefore desirable. 6 10 9 8 7 6 5 0 1 2 3 4 τ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 µ Figure 1: Parameter space of (µ, τ ) for µ1 = µ2 = µ (v) The mappings τ12 → χ12 and ρ12 → τ12 , where ρ12 is the correlation coefficient of (Y1 , Y2 ), are easily seen to be, χ12 − 1 = τ12 − 1 , (1 − µ1 τ12 )(1 − µ2 τ12 ) (8) s (9) and τ12 − 1 = (1 − µ1 )(1 − µ2 ) ρ12 . µ1 µ2 Since, by (7), both factors in the denominator of (8) are positive, it follows that χ12 and τ12 indicate negative, zero and positive association simultaneously. If µt τ12 , 7 for t = 1, 2, are both small, then χ12 ≈ τ12 . Except when there is zero association, |χ12 − 1| > |τ12 − 1| implying that τ12 is always bounded towards zero association by χ12 . This result gives an important grip on the range of τ12 , see also Table 2. The intuitive reason for this result is that τ12 measures, solely, the strength of association between the events {Y1 = 1} and {Y2 = 1}, see (i) above, while χ12 measures the strength of association between the pair of binary random variates (Y1 , Y2 ). Being focused on specific events, the interpretation of the dependence ratio is sharper, which we demonstrate with a numerical example in Section 4. (vi) DR, OR and RR. By being focused on specific events, the dependence ratio (DR) is closer akin to the risk ratio (RR) than to the odds ratio (OR). Consider the case when Y1 is explanatory, for example exposed/not exposed, denoted by, respectively, E and E c , and Y2 is a response, for example presence/absence of disease, denoted by, respectively, D and D c , then RR = pr(D|E)/pr(D|E c ) and DR = pr(D|E)/pr(D) = RR × pr(D|E c )/pr(D). Therefore, DR can be interpreted as a scaled RR. If exposure is a rare condition, then pr(E) ≈ 0 and the scale factor pr(D|E c )/pr(D) ≈ 1. Sackett, Deeks and Altman (1996) prefer RR over OR, since RR has a more pertinent medical interpretation. (vii) Lack of invariance to coding. Consider the consequences of reversing the coding, meaning that we transform Yk → Yk∗ = 1 − Yk , for k = 1, 2. Denoting the odds ratio, the correlation coefficient and the dependence ratio for the starred ∗ , one finds that χ∗12 = χ12 , ρ∗12 = ρ12 , but responses by, respectively, χ∗12 , ρ∗12 and τ12 ∗ τ12 − 1 = (τ − 1)µ1 µ2 {(1 − µ1 )(1 − µ1 )}−1 . The dependence ratio lacks invariance to coding because it measures event specific association. (viii) Lack of parameter orthogonality. The lack of variation independence, 8 see (iv), implies that τ12 is not orthogonal to µt , for t = 1, 2. In contrast, χ12 is orthogonal to µt . Also, for q > 2, in the general odds ratio parametrisation ψ = (µ1 , . . . , µq , χ12 , . . . , χ1...q ) the mean parameters, µt , for t = 1, . . . , q are orthogonal to the canonical parameters, χ1...w , for w = 2, . . . , q (Barndorff-Nielsen & Cox, 1994, Sec. 2.9). When introducing the dependence ratio parametrisation, we pointed out that the regression coefficients, β, and the association parameters, α, are not orthogonal and recommended routine calculation of the correlations of the parameter estimates (Ekholm et al., 1995, Sec. 3.4). In all applications we have followed this advice and only rarely have we encountered high correlations. An important result in the theory of statistical inference is that inference concerning the parameter of interest, for example, β, is sharper if β is orthogonal to the nuisance parameter, for example, α, (Barndorff-Nielsen & Cox, 1994, Sec. 3.6). The conventional casting of interest and nuisance is, in analyses of multivariate binary or ordinal responses, indeed, just this. The more data sets we have analysed the more we have learned to appreciate, that meaningful association modelling can yield, at least, equally important scientific insights as the regression modelling, see Ekholm, Jokinen & Kilpi, (2002) and Ekholm, Jokinen, McDonald & Smith (2002). Accordingly, we consider the parameter of interest to be (β, α). In applications one needs to monitor the standard errors of the estimates. In an Appendix we derive the asymptotic covariance matrices for υ̂ = (µ̂1 , µ̂2 , τ̂12 ) and for ψ̂ = (µ̂1 , µ̂2 , χ̂12 ) and from these matrices it is clear that the standard errors of √ √ τ̂12 and χ̂12 are of form, respectively, cτ / n and cχ / n, where n is the number of bivariate observations. We list in Table 2 the values of τ12 , χ12 , calculated using (8) and (9), and the coefficients cτ and cχ , calculated using (A4) and (A7), when 9 ρ12 ranges between its minimum and its maximum value, for µ1 = µ2 = 0.2. Not only do the values of the odds ratio increase very fast with increasing correlation, but the estimates of the odds ratio become statistically very unstable. (ix) Negative fitted path probabilities. Jukka Jokinen’s R-program for fitting our models maximizes the log-likelihood function by a direct search, restricted to values of the parameters that produce non-negative fitted path probabilities for all observed paths. However, since we parametrize in terms of probabilities, not in terms of logits of probabilities, the joining of a regression with an association model can produce negative fitted probabilities for unobserved paths. We do not regard this feature as a serious drawback of our approach, but rather as providing a model validation tool. We strive to specify meaningful models for both the regression and the association. A balance between explanations in terms of variation (a) between units, i.e., regression, and (b) within units, i.e., association, has to be found. Finding the proper balance is part of doing meaningful science, instead of throwing all available explanatory variables into a foolproof regression mill, treating the association as a nuisance parameter, which is what authors using the odds ratio parametrisation usually do. An associate editor recently wrote that to him our method ‘allows better estimation of parameters at the expense of a less flexible model’. To me statistical models are better if they are not too flexible. An editor recently dismissed our method with the unsubstantiated argument that, ‘...the drawbacks of the dependence ratio parametrization in the regression setting are major’. Strangely, this editor had seen a supporting report in which we reanalysed a data set, previously analysed by him, complementing his regression analysis by an association analysis which yielded new empirical insights. 10 Table 2. Values of τ12 , cτ = s.e.(τ̂12 ) × √ n, χ12 , cχ = s.e.(χ̂12 ) × √ n for a range of values of the correlation coefficient ρ in a 2 × 2 table, with µ1 = µ2 = 0.2, when the number of observations is n = 1. ρ12 τ12 -0.25 0 s.e.(τ̂12 ) × √ n χ12 s.e.(χ̂12 ) × – 0 – -0.20 0.20 2.1 0.13 1.5 -0.10 0.60 3.4 0.48 3.6 0.00 1.0 4.0 1.0 6.3 0.10 1.4 4.4 1.8 10.2 0.20 1.8 4.6 3.0 16.4 0.30 2.2 4.8 4.8 26.7 0.40 2.6 5.1 7.9 44.8 0.50 3.0 5.5 13.5 79.5 0.60 3.4 6.0 24.4 154.5 0.70 3.8 6.8 49.6 349.3 0.80 4.2 7.7 126.0 1052.3 0.90 4.6 8.8 563.5 6467.0 1.00 5 – ∞ – √ n 4 Seven short binary time series A patient’s condition is recorded for ten consecutive days as bad, scored 1, or good, scored 0. Requested to analyse the binary time series (0, 0, 1, 1, 1, 1, 1, 1, 1, 0) two explanations spring to mind: (i) If bad has no carry over effect from one day to the next, then the probability of bad is 0.7. (ii) If bad, once it appears, is 11 likely to carry over into the next days, then the marginal probability of bad is, presumably, smaller and the dependence between bad on consecutive days needs to be assessed. As a demonstration of how the dependence ratio works, I fit two models and compute a number of estimates. The two models are: Model I for independence: we assume Y1 , . . . , Y10 ⊥⊥ and marginal stationarity µ = pr(Yt = 1), for t = 1, . . . , 10. The parameters to be estimated from the data are (a) µ̃ = P10 1 yt /10 and (g) pr(d|I) ˆ = the maximum likelihood estimate of the probability of the observed data under I. Model M for Markov: we assume pr(Yt+1 = y|Y1 , . . . , Yt ) = pr(Yt+1 = y|Yt ), for t = 1, . . . , 9, stationarity in dependence ratio, that is τt(t+1) = τ , and also marginal stationarity as under I. In addition to the numbers (a) and (g), we calculate (b) τ̃ , the estimate of τ directly from the 9 bivariate observations, (yt , yt+1 ), (c) χ̃, the estimate of the odds ratio directly from the 9 bivariate observations, (d) µ̂ = the maximum likelihood estimate of µ under M, (e) τ̂ the maximum likelihood estimate of τ under M, (f) pr(d|M), ˆ (h) pr(1|1) ˜ the estimate of pr(Yt+1 = 1|Yt = 1) directly from the data, (i) pr(1|1) ˆ = µ̂τ̂ , the maximum likelihood estimate under M of pr(Yt+1 = 1|Yt = 1) and finally (j) ω = 2[log{pr(d|M)} ˆ − log{pr(d|I)}], ˆ the likelihood ratio statistic for testing model I against model M. Denoting the series (0, 0, 1, 1, 1, 1, 1, 1, 1, 0) by (2, 7, 1), since two zeros are followed by seven ones and they by one zero, the statistics (a) to (j) are listed in Table 3, not only for (2, 7, 1), but also for six other series (2, 8, 0), (2, 6, 2), (2, 5, 3), (2, 4, 4), (2, 3, 5), (2, 2, 6). The log-likelihood function is unbounded for the more extreme cases, in which not both (0, 0) and (1, 1) for adjacent times are present. Since model M is better supported by the data than model I, although not 12 significantly better, I select M for my analysis of the series (2, 7, 1) and find that the marginal probability of bad is 0.58, clearly less than under I, when it is 0.70. The maximum likelihood estimate for the probability that bad on day t is followed by bad on day (t + 1) is 0.79, considerably higher than the marginal probability of bad, but smaller than the model free estimate, pr(Y ˜ t+1 = 1|Yt = 1) = 0.86. Inspecting Table 3 as a whole, some remarks are: (1) The dependence ratio, by measuring event specific association, here association between bad on adjacent days, provides an interpretational advantage, for the association between good on adjacent days is of no substantive interest. In accordance with the heuristic argument in the introduction to this section, τ̂ increases when µ̂ decreases. The relation of χ̃ to µ̃ from row two on is non-monotonic, first increasing and then decreasing. For the series (2, 8, 0) in which the frequency of (1, 0) is zero, χ̃ is unbounded, while τ̃ = 9/8, attaining its upper bound. (2) For the range, 0.2 ≤ µ̃ ≤ 0.8, studied here, τ̂ stays on all rows well below its upper bound 1/µ̂. However, if we know from outside information that for the series (2, 7, 1), pr(Y10 = 1) = 0.8, then, using the estimates µ̂1 = · · · = µ̂9 = 7/9 and τ̂ = 1.35, we find that pr(Y9 = 1, Y10 = 0) = 0.58(1 − 0.8 · 1.35) < 0. This goes to demonstrate that joining a regression and an association model can easily lead to negative fitted path probabilities and has to be done with care. Ekholm et al., (2000, Sec. 5.2) checked the impact of reversing the coding, see Section 3 (vii), in analysing a real data example, finding that the association model is simpler, if the less frequent category is scored as 1, but with little impact on the regression model. 13 Table 3. The values of ten statistics for seven different binary time series, all of length 10. Statistics Series (a) (b) (c) (d) (e) µ̃ τ̃ χ̃ µ̂ τ̂ (f) (g) pr(d|M) ˆ pr(d|I) ˆ (h) (i) pr(1|1) ˜ pr(1|1) ˆ (j) ω (2,8,0) 0.8 1.12 ∞ 0.71 1.28 0.026 0.0067 1 0.91 2.69 (2,7,1) 0.7 1.10 6 0.58 1.35 0.0044 0.0022 0.86 0.79 1.37 (2,6,2) 0.6 1.25 10 0.48 1.62 0.0043 0.0012 0.83 0.77 2.55 (2,5,3) 0.5 1.44 12 0.39 1.88 0.0046 0.0010 0.80 0.74 3.09 (2,4,4) 0.4 1.69 12 0.32 2.16 0.0054 0.0012 0.75 0.69 3.01 (2,3,5) 0.3 2.00 10 0.25 2.44 0.0071 0.0022 0.67 0.61 2.31 (2,2,6) 0.2 2.25 6 0.18 2.53 0.0113 0.0067 0.50 0.46 1.05 (3) The correlations of the parameter estimates, corr(µ̂, τ̂ ), for the seven series reading Table 3 from top to bottom, are −0.97, −0.93, −0.94, −0.94, −0.92, −0.86, −0.66. Admittedly, all except the last one, are alarmingly high. However, note that from the substantive point of view, the most important number, on anyone row, is the product µ̂τ̂ , which is the maximum likelihood estimate of the probability of persistence of bad, pr(Y ˆ t+1 = 1|Yt = 1), and these behave decently on all rows. (4) The maximum likelihood estimates of the odds ratio, χ̂, are, by the invariance of maximum likelihood estimates, conveniently computed using equation (8). The seven pairs of (χ̃, χ̂), reading Table 3 from top to bottom, are (∞, 33.7), (6, 7.4), (10, 11.9), (12, 12.4), (12, 12.2), (10, 9.5), (6, 5.2). The computational route from τ̂ to χ̂ suggests that also models, specifying distributions of dimensions higher than 14 two and formulated in terms of odds ratios, might be fitted using dependence ratios, which provide a computational advantage. It would be valuable to find out what kind of models one can formulate only by dependence ratios, what kind only by odds ratios and what kind by both. 5 Conclusions Not all referees, whom we have encountered during the last seven years, have dismissed the dependence ratio as wrong or useless. Naturally, we have then pondered, why some statisticians are unreasonably dismissive and others take a more positive, sometimes encouraging attitude. In fact, Sir David Cox in 1994 at a seminar in Nuffield College, Oxford, expressed appreciation of the very two features of the dependence ratio, which the critical referees dislike most. These two are, the event specificity and the risk for negative fitted probabilities. Lindsey (1999, p. ix) states ‘As a group, statistical referees tend to be very conservative; a minor modification to some obscure score test is revolutionary whereas a new family of models must either already be widely known or be wrong’. Noteworthy as this statement is, it does not explain the variation we have experienced between referees. Another word of wisdom points to a different explanation. Elfving (1985) remarked that to gain recognition, it is not enough to do original work, you have also to belong to a school which is willing to listen to you. Now, there are wordings in the highly critical referee reports that indicate that some are written by statisticians who work on developing the generalized estimating equations (GEE) approach and some others by statisticians who take the odds ratio as the starting point for the theory of multivariate distributions for categorical data. The 15 irritating thing about the dependence ratio is then, presumably, that it suggests a new solution to a problem they regard as solved for good. Our problem is, perhaps, not so much that we do not belong to a listening school, as that the critical referees belong to schools that do not want to hear about new solutions. References Barndorff-Nielsen, O. E. & Cox, D. R. (1994). Inference and asymptotics. London: Chapman & Hall. Ekholm, A. (1991). Fitting regression models to a multivariate binary response. In A spectrum of statistical thought: essays in statistical theory, economics and population genetics in honour of Johan Fellman, pp. 19-32, Eds. G. Rosenqvist, K. Juselius, K. Nordström and J. Palmgren. Helsingfors: Swedish School of Economics and Business Administration. Ekholm, A., Smith, P. W. F. & McDonald, J. W. (1995). Marginal regression analysis of a multivariate binary response. Biometrika, 82, 847-854. Ekholm, A. & Skinner, C. (1998). The Muscatine children’s obesity data reanalysed using pattern mixture models. Applied Statistics, 47, 251-263. Ekholm, A., McDonald, J. W. & Smith, P. W. F. (2000). Association models for a multivariate binary response. Biometrics, 56, 712-718. Ekholm, A., Jokinen, J. & Kilpi, T. (2002). Combining regression and association modelling for longitudinal data on bacterial carriage. Statistics in Medicine, 21, 773-791. Ekholm, A., Jokinen, J., McDonald, J. W. & Smith, P. W. F. (2002). Joint regression and association modelling of longitudinal ordinal data. Manuscript. Elfving, G. (1985). Finnish mathematical statistics in the past. In Proceedings of the first international Tampere seminar on linear statistical models and their applications, Tampere 1983, pp. 3-8, Eds. T. Pukkila & S. Puntanen. Department of Mathematical 16 Sciences/Statistics, University of Tampere. Jokinen, J., McDonald, J. W. & Smith, P. W. F. (2002). Meaningful regression and association models for clustered ordinal data. Manuscript. Lindsey, J. K. (1999). Models for repeated measurements: second edition. Oxford University Press. Sackett, D. L., Deeks, J. J. & Altman, D. G. (1996). Down with odds ratios! EvidenceBased Medicine, 1, 164-166. Appendix: Derivation of the covariance matrices of ψ̂ and υ̂ In 1998 a referee, adamant that the dependence ratio is hopelessly inferior to the odds ratio, wrote as one argument: ‘Another useful property of odds ratios is orthogonality between marginal parameters and odds ratio parameters in the bivariate case (BB, 1989)’. The reference is to a technical report and the referee was, obviously, not aware of the general orthogonality of the components of the mixed parameter, see (viii) of Section 3. The technical report purports to prove that µ1 and µ2 are orthogonal to χ in the parametrisation ψ = (µ1 , µ2 , χ). The proof is deficient and the asymptotic covariance matrix Var(µ̂1 , µ̂2 , χ̂) is not derived. The matter would not be worth commenting upon, if I had not encountered several other references attributing the revelation of this orthogonality in a 2 × 2 table to that same source; most recently at an international conference in 2002. Presumably, Var(µ̂1 , µ̂2 , χ̂) was derived and published decades ago, but since the simplest way to derive Var(µ̂1 , µ̂2 , τ̂ ) is via Var(µ̂1 , µ̂2 , χ̂), we start from scratch. Denote the cell probabilities by πuv = pr(Y1 = u, Y2 = v), for u, v = 1, 0, where 0 < πuv < 1 and π11 + π10 + π01 + π00 = 1. 17 A full rank parametrisation π is obtained by π = (π11 , π10 , π01 ). For brevity we use, sometimes, π00 = 1−π11 −π10 −π01 . We work now with three parametrisations of the 2 × 2 table: π, ψ and υ. The transformation π → ψ is given by µ1 = π11 + π10 , µ2 = π11 + π01 , χ = π11 (1 − π11 − π10 − π01 ) , π10 π01 and the transformation υ → ψ by µ1 = µ 1 , µ2 = µ 2 , χ = 1 + τ −1 . (1 − µ1 τ )(1 − µ2 τ ) Denote the log-likelihood functions for the three parameters by, respectively, l(π), l∗ (ψ) and l∗∗ (υ) and the information matrices by, respectively, i(π), i∗ (ψ) and i∗∗ (υ). Any information matrix is defined, in the standard way, as (−1) times the expected value of the Hessian matrix of the corresponding log-likelihood function. Denoting by n the number of bivariate observations (y1 , y2 ), i(π) is easily calculated, i(π) = n −1 π11 −1 π00 , −1 π00 −1 π00 , −1 −1 π10 + π00 , −1 π00 −1 π00 , −1 π00 , −1 −1 π01 + π00 + −1 π00 , , (A1) The relation between i(π) and i∗ (ψ) is (Barndorff-Nielsen & Cox, p. 28), " ∂π i (ψ) = ∂ψ ∗ #T # " ∂π , i(π) ∂ψ (A2) where the last matrix on the right hand side is the 3×3 Jacobian of the reparametrisation π → ψ. In this case calculating that Jacobian is an unpleasant task and we use instead the general result that for locally 1 − 1 transformations the Jacobian of the inverse transformation is the inverse of the original Jacobian, " # " ∂π ∂ψ = ∂ψ ∂π 18 #−1 . (A3) It is readily checked that " ∂ψ ∂π #−1 = −1 1, 1, 0 1, 0, 1 a, b, c = −1 (a − b − c) −b, −c, a − c, c, b, 1 −1 a − b, −1 , where a, b and c are, respectively, ∂χ/∂π11 , ∂χ/∂π10 and ∂χ/∂π01 . Calculating these derivatives and denoting by vu = var(Yu ), for u = 1, 2, and by c12 = cov(Y1 , Y2 ) using (A1), (A2) and (A3) gives i∗ (ψ) = nd−1 (π) v2 , −c12 , 0 −c12 , v1 , 0 0, 0, χ−2 e(π) , where d(π) = v1 v2 − c212 = π10 π01 π00 + π11 π01 π00 + π11 π10 π00 + π11 π10 π01 and e(π) = π11 π10 π01 π00 . Inverting i∗ (ψ) we find that the asymptotic covariance matrix for ψ̂ is ∗ Var(µ̂1 , µ̂2 , χ̂) = [i (ψ)] −1 1 = n v1 , c12 , c12 , v2 , 0, 0, 0 0 vχ , (A4) −1 −1 −1 −1 where vχ = χ2 (π11 + π10 + π01 + π00 ). To derive Var(µ̂1 , µ̂2 , τ̂ ) we use " ∂ψ i (υ) = ∂υ ∗∗ #T # " ∂ψ , i (ψ) ∂υ ∗ (A5) where " # ∂ψ = ∂υ 1, 0, 0, 1, 0 0 a ∗ , b∗ , c∗ 19 , (A6) with a∗ = ∂χ/∂µ1 , b∗ = ∂χ/∂µ2 and c∗ = ∂χ/∂τ . Var(µ̂1 , µ̂2 , τ̂ ) is then best calculated as [i∗∗ (υ)]−1 using (A4), (A5) and (A6), obtaining 1 Var(µ̂1 , µ̂2 , τ̂ ) = n 1 = n 1, 0, 0, 0 1, 0 −a∗ /c∗ , −b∗ /c∗ , 1/c∗ v1 , c12 , v1 , c12 , c12 , v2 , 0, 0, ∗ c2τ c1τ , c2τ , vτ 0 vχ 0, 1, −b∗ /c∗ 0, 0, 1/c∗ , c1τ ! ∂χ ∂τ !−1 ! ∂χ ∂τ ∂χ ∂χ ∂χ · · c12 + vχ v2 + 2 · ∂µ1 ∂µ2 ∂τ !−2 , c2τ ∂χ ∂χ =− c12 + v2 ∂µ1 ∂µ2 !−1 and ∂χ vτ = ∂µ1 !2 ∂χ v1 + ∂µ2 !2 . For computational purposes, useful expressions for the partial derivatives are, ∂χ/∂µ1 = τ (τ − 1)(1 − µ1 τ )−2 (1 − µ2 τ )−1 , ∂χ/∂µ2 = τ (τ − 1)(1 − µ1 τ )−1 (1 − µ2 τ )−2 , ∂χ/∂τ = {π00 − π11 (τ − 1)}(1 − µ1 τ )−2 (1 − µ2 τ )−2 . 20 (A7) where ∂χ ∂χ =− v1 + c12 ∂µ1 ∂µ2 0 1, 0, −a /c c12 , c1τ v2 , ∗ ,