Comparing the Odds and the Dependence Ratios

advertisement
Comparing the Odds and the Dependence Ratios
Anders Ekholm
Rolf Nevanlinna Institute, P.O. Box 4, (Yliopistonkatu 5),
FIN-00014 UNIVERSITY OF HELSINKI, Finland.
Email: anders.ekholm@helsinki.fi
Summary
The parametrisation of a multivariate binary response in terms of
first order moments and dependence ratios of order from two up to
the dimension of the response, is presented. Conventionally, the mixed
parameter is used for marginal regression analysis of a multivariate binary response. In the mixed parameter, the first order moments are
complemented by odds ratios. The properties of the, less well known,
dependence ratios are expounded and compared to those of the odds
ratio. To demonstrate how the dependence ratio parametrisation works,
seven different binary time series, all of length 10, are constructed and
analysed, one by one.
Key words: Asymmetric association measure; Combining regression with association; Parameter orthogonality; Short binary time series;
1 Motivation
In my contribution to the ‘Fellman Festschrift’ I introduced the success parametrisation for a multivariate binary response (Ekholm, 1991). It gained a number of
references, but few applications. Soon, I gave it up and developed instead, jointly
1
with a group of coauthors, the dependence ratio parametrisation, which we have applied for analysing a fair number of empirical data sets (Ekholm, Smith & McDonald, 1995; Ekholm & Skinner, 1998; Ekholm, McDonald & Smith, 2000; Ekholm,
Jokinen & Kilpi, 2002). Ekholm, Jokinen, McDonald & Smith (2002), generalized
the dependence ratio from a binary to an ordinal response and Jokinen, McDonald
& Smith (2002) formulated a number of new association generating mechanisms
using the dependence ratio. Despite being handy in applied work, the dependence
ratio has gained fewer references and applications than the success parametrisation. On the contrary, referees have occasionally dismissed our papers, it seems
to me, just because the dependence ratio is not the odds ratio. Perplexed by this
conservatism, I devote my contribution for the ‘Nordberg Festschrift’ to a comparison of the dependence ratio with the odds ratio, and to some quotes exemplifying
the arguments raised against the dependence ratio. Naturally, I do not know the
names of the referees and I will also leave some authors and editors anonymous.
For example, AA has referred to Ekholm (1991) several times during the nineties
without comments, but in an invited talk at an international conference in 2002, he
complained that there are problems with the success parametrisation. Strangely,
he has never referred to the dependence ratio.
2 Simultaneous regression and association modelling
We denote the response of unit i = 1, . . . , n at subunit tk , for k = 1, . . . , q,
by Yi tk = 1/0 and the q-variate response of unit i by Y i = (Yi t1 , . . . , Yi tq ). In
the data set analysed by Ekholm, Jokinen & Kilpi (2002), Yi tk = 1/0 indicates
carriage/non-carriage of a certain bacterium by child i at age tk . Another useful
2
application of the hierarchical data structure is when unit is a family and subunit
is a member of the family (Ekholm et al. 1995, Example 4.1). With each subunit
is associated a p × 1 vector of fixed explanatory values, xi tk . In modelling the data
set, {(xi tk , yi tk ); i = 1, . . . , n, k = 1, . . . , q}, we want to combine (i) a marginal
regression model for the univariate means µi tk = E(Yi tk |xi tk ), with (ii) an association model depicting the dependence between the subunit responses Yi t1 , . . . , Yi tq .
Ambitiously, we model the joint probability of the q responses of unit i, which we
refer to as the path probability πi and define as
πi = pr(Yi t1 = yi1 , . . . , Yi tq = yiq ).
(1)
The marginal regression model is the conventional logistic regression, for i =
1, . . . , n and k = 1, . . . , q,
µi tk = pr(Yi tk = 1|xi tk ) = {1 + exp(−βxi tk )}−1 ,
(2)
where β = (β1 , . . . , βp ) is a 1 × p vector of regression coefficients, constant with
respect to both i and tk . The response vectors Y1 , . . . , Yn are assumed to be independent of each other. For task (ii) we need measures of association that handily
complement the first order moments µi tk in specifying the joint q-dimensional
probability. The dependence ratios, denoted by τ , of second, third,. . . , qth order,
are well suited for this task, being defined by ratios of moments of all orders, as
τ t1 t2 =
µi t1 ...tq
µi t 1 t2 t3
µi t 1 t2
, . . . , τ t1 t2 t3 =
, . . . , τt1 ...tq =
,
µi t 1 µi t 2
µi t 1 µi t 2 µi t 3
µi t 1 · · · µ i t q
(3)
where, for example, µi t1 t2 t3 = E(Yi t1 Yi t2 Yi t3 |xi t1 , xi t2 , xi t3 ), is the third-order
product moment. There are q!/{w!(q−w)!} dependence ratios of order w = 2, . . . , q
and accordingly 2q − q − 1 dependence ratios in all. The q first order moments
3
µi t1 , . . . , µi tq and the 2q − q − 1 dependence ratios, together, specify fully the path
probability πi of child i (Ekholm et al. 1995). In fact, there exists a closed form
expression, denoted here by π ∗ (·), for the path probability in terms of the first
order moments and the dependence ratios of all orders,
πi = pr(Yi t1 = yi1 , . . . , Yi tq = yiq ) = π ∗ (µi t1 , . . . , µi tq , τt1 t2 , . . . , τt1 ...tq ).
(4)
To achieve parsimonious modelling, a strong structure has to be imposed on
the dependence ratios, specifying a meaningful mechanism that generates the association between Yi t1 , . . . , Yi tq from a small number of association parameters.
Denoting the vector of association parameters by α and the vector of dependence
ratios by τ = (τt1 t2 , . . . , τt1 ...tq ), we refer to a set of equations
τ = g(α),
(5)
as an association model with parameter α. Ekholm et al. (2000) present a number
of association generating mechanisms for a binary response, Ekholm et al. (2002)
for a multicategory response and Jokinen et al. (2002) extend the latter set.
It follows from (2), (4) and (5), jointly, that there exists an explicit expression,
denoted here by π(·), for the path probability in terms of β, α and the xi tk ,
πi = pr(Yi t1 = yi1 , . . . , Yi tq = yiq ) = π(β, α, xi t1 , . . . , xi tq ).
(6)
For given values of the regression coefficients, the association parameters and the
data {(xi tk , yi tk ); i = 1, . . . , n, k = 1, . . . , q}, (6) gives computational access to the
log-likelihood function l(β, α) in a single, iteration-free step.
4
3 The Dependence Ratio Compared with the Odds Ratio
Dropping the subscript i, we consider the response path of a generic unit, and
denote the q-dimensional binary response by (Y1 , . . . , Yq ). For several important
differences between the dependence ratio and the odds ratio parametrisations, it
is enough to consider the bivariate case, q = 2. The notation is conveniently introduced by spelling out the relevant 2 × 2 table as Table 1.
Table 1. The probabilities of the bivariate distribution for (Y1 , Y2 ).
Y1 = 1
Y2 = 1
Y2 = 0
Sum
µ12
µ1 − µ12
µ1
1 − µ1 − µ2 + µ12
1 − µ1
1 − µ2
1
Y1 = 0 µ2 − µ12
Sum
µ2
The dependence ratio for (Y1 , Y2 ) is τ12 = µ12 /(µ1 µ2 ) and the odds ratio, also
called the cross product ratio, is χ12 = µ12 (1−µ1 −µ2 +µ12 )/{(µ1 −µ12 )(µ2 −µ12 )}.
We denote for (Y1 , Y2 ) the dependence ratio parametrisation by υ = (µ1 , µ2 , τ12 )
and the odds ratio parametrisation by ψ = (µ1 , µ2 , χ12 ). We list the most important features of the dependence ratio, including comparisons to the odds ratio:
(i) The metric of the w-way dependence ratio τ1...w is that pr(Y1 = · · · = Yw = 1)
is (τ1...w − 1)100% greater than if the w events {Y1 = 1}, . . . , {Yw = 1} were independent. The definitions (3) and accordingly the interpretations of τ1...w , for
w = 2, . . . q, do not grow in conceptual complexity, when w increases. In contrast
the 3-way odds ratio is a ratio of two conditional 2-way odds ratios, in obvious notation, χ123 = χ12|Y3 =1 /χ12|Y3 =0 , and so forth in geometrically growing complexity.
(ii) A computational advantage. Since, from (3), µ12 = µ1 µ2 τ12 , it is obvious from Table 1 that the cell probabilities pr(Y1 = u, Y2 = v), for u, v = 1, 0,
5
have explicit expressions in terms of µ1 , µ2 and τ12 . The thrust of (4) and (6) is
that analogous explicit expressions for the path probabilities are valid when q > 2
(Ekholm et al., 1995). In contrast, to find the path probabilities from µ1 , µ2 and
χ12 one has to solve a quadratic, and when q > 2 one has to resolve to iterative
procedures. Therefore, maximizing the log-likelihood function using the odds ratio
parametrisation requires nesting two iterative procedures, while using the dependence ratio parametrisation iteration in the parameter space is enough.
(iii) An interpretational advantage. When the subunits are ordered, for example, by time as in the data on carriage of bacteria (Ekholm et al., 2002), then
the transition probabilities between the states are of interest. Interpretation of the
dependence ratio is facilitated by the relation τ12 = pr(Y2 = 1|Y1 = 1)/pr(Y2 = 1).
(iv) The range of τ12 depends on µ1 and µ2 , the lower and upper bounds are
−1
−1 −1
−1
−1
max(0, µ−1
1 + µ2 − µ1 µ2 ) < τ12 < min(µ1 , µ2 ).
(7)
For the case µ1 = µ2 = µ the parameter space (µ, τ ) is illustrated in Figure 1.
For µ < 0.5 the parameters are reasonably variation independent, while when µ
approaches one, the range of the dependence ratio is quite narrow around the
independence value 1. In contrast, the odds ratio, χ, is variation independent
from the marginal probabilities µ1 and µ2 . A referee stated ‘...the lack of a good
grip on the range of the dependence ratios seems to be a major drawback.’ I
believe that the bound τ12 < 1/µ reflects a meaningful unidentifiability, for if
pr(Y1 = 1) = pr(Y2 = 1) ≈ 1, then only scant empirical evidence can be found for
discriminating between the propositions (a) that the events {Y1 = 1} and {Y2 = 1}
are independent and (b) that they are associated. Also, the bound τ < 1/µ is, by
(iii) above, equivalent to pr(Y2 = 1|Y1 = 1) < 1 and therefore desirable.
6
10
9
8
7
6
5
0
1
2
3
4
τ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
µ
Figure 1: Parameter space of (µ, τ ) for µ1 = µ2 = µ
(v) The mappings τ12 → χ12 and ρ12 → τ12 , where ρ12 is the correlation
coefficient of (Y1 , Y2 ), are easily seen to be,
χ12 − 1 =
τ12 − 1
,
(1 − µ1 τ12 )(1 − µ2 τ12 )
(8)
s
(9)
and
τ12 − 1 =
(1 − µ1 )(1 − µ2 )
ρ12 .
µ1 µ2
Since, by (7), both factors in the denominator of (8) are positive, it follows that χ12
and τ12 indicate negative, zero and positive association simultaneously. If µt τ12 ,
7
for t = 1, 2, are both small, then χ12 ≈ τ12 . Except when there is zero association,
|χ12 − 1| > |τ12 − 1| implying that τ12 is always bounded towards zero association
by χ12 . This result gives an important grip on the range of τ12 , see also Table
2. The intuitive reason for this result is that τ12 measures, solely, the strength of
association between the events {Y1 = 1} and {Y2 = 1}, see (i) above, while χ12
measures the strength of association between the pair of binary random variates
(Y1 , Y2 ). Being focused on specific events, the interpretation of the dependence
ratio is sharper, which we demonstrate with a numerical example in Section 4.
(vi) DR, OR and RR. By being focused on specific events, the dependence ratio
(DR) is closer akin to the risk ratio (RR) than to the odds ratio (OR). Consider
the case when Y1 is explanatory, for example exposed/not exposed, denoted by,
respectively, E and E c , and Y2 is a response, for example presence/absence of
disease, denoted by, respectively, D and D c , then RR = pr(D|E)/pr(D|E c ) and
DR = pr(D|E)/pr(D) = RR × pr(D|E c )/pr(D). Therefore, DR can be interpreted
as a scaled RR. If exposure is a rare condition, then pr(E) ≈ 0 and the scale factor
pr(D|E c )/pr(D) ≈ 1. Sackett, Deeks and Altman (1996) prefer RR over OR, since
RR has a more pertinent medical interpretation.
(vii) Lack of invariance to coding. Consider the consequences of reversing the
coding, meaning that we transform Yk → Yk∗ = 1 − Yk , for k = 1, 2. Denoting
the odds ratio, the correlation coefficient and the dependence ratio for the starred
∗
, one finds that χ∗12 = χ12 , ρ∗12 = ρ12 , but
responses by, respectively, χ∗12 , ρ∗12 and τ12
∗
τ12
− 1 = (τ − 1)µ1 µ2 {(1 − µ1 )(1 − µ1 )}−1 . The dependence ratio lacks invariance
to coding because it measures event specific association.
(viii) Lack of parameter orthogonality. The lack of variation independence,
8
see (iv), implies that τ12 is not orthogonal to µt , for t = 1, 2. In contrast, χ12
is orthogonal to µt . Also, for q > 2, in the general odds ratio parametrisation
ψ = (µ1 , . . . , µq , χ12 , . . . , χ1...q ) the mean parameters, µt , for t = 1, . . . , q are orthogonal to the canonical parameters, χ1...w , for w = 2, . . . , q (Barndorff-Nielsen &
Cox, 1994, Sec. 2.9). When introducing the dependence ratio parametrisation, we
pointed out that the regression coefficients, β, and the association parameters, α,
are not orthogonal and recommended routine calculation of the correlations of the
parameter estimates (Ekholm et al., 1995, Sec. 3.4). In all applications we have
followed this advice and only rarely have we encountered high correlations.
An important result in the theory of statistical inference is that inference concerning the parameter of interest, for example, β, is sharper if β is orthogonal to
the nuisance parameter, for example, α, (Barndorff-Nielsen & Cox, 1994, Sec. 3.6).
The conventional casting of interest and nuisance is, in analyses of multivariate
binary or ordinal responses, indeed, just this. The more data sets we have analysed
the more we have learned to appreciate, that meaningful association modelling can
yield, at least, equally important scientific insights as the regression modelling, see
Ekholm, Jokinen & Kilpi, (2002) and Ekholm, Jokinen, McDonald & Smith (2002).
Accordingly, we consider the parameter of interest to be (β, α).
In applications one needs to monitor the standard errors of the estimates. In an
Appendix we derive the asymptotic covariance matrices for υ̂ = (µ̂1 , µ̂2 , τ̂12 ) and
for ψ̂ = (µ̂1 , µ̂2 , χ̂12 ) and from these matrices it is clear that the standard errors of
√
√
τ̂12 and χ̂12 are of form, respectively, cτ / n and cχ / n, where n is the number of
bivariate observations. We list in Table 2 the values of τ12 , χ12 , calculated using
(8) and (9), and the coefficients cτ and cχ , calculated using (A4) and (A7), when
9
ρ12 ranges between its minimum and its maximum value, for µ1 = µ2 = 0.2. Not
only do the values of the odds ratio increase very fast with increasing correlation,
but the estimates of the odds ratio become statistically very unstable.
(ix) Negative fitted path probabilities. Jukka Jokinen’s R-program for fitting
our models maximizes the log-likelihood function by a direct search, restricted to
values of the parameters that produce non-negative fitted path probabilities for
all observed paths. However, since we parametrize in terms of probabilities, not
in terms of logits of probabilities, the joining of a regression with an association
model can produce negative fitted probabilities for unobserved paths. We do not
regard this feature as a serious drawback of our approach, but rather as providing
a model validation tool. We strive to specify meaningful models for both the regression and the association. A balance between explanations in terms of variation
(a) between units, i.e., regression, and (b) within units, i.e., association, has to be
found. Finding the proper balance is part of doing meaningful science, instead of
throwing all available explanatory variables into a foolproof regression mill, treating the association as a nuisance parameter, which is what authors using the odds
ratio parametrisation usually do. An associate editor recently wrote that to him
our method ‘allows better estimation of parameters at the expense of a less flexible
model’. To me statistical models are better if they are not too flexible.
An editor recently dismissed our method with the unsubstantiated argument
that, ‘...the drawbacks of the dependence ratio parametrization in the regression
setting are major’. Strangely, this editor had seen a supporting report in which we
reanalysed a data set, previously analysed by him, complementing his regression
analysis by an association analysis which yielded new empirical insights.
10
Table 2. Values of τ12 , cτ = s.e.(τ̂12 ) ×
√
n, χ12 , cχ = s.e.(χ̂12 ) ×
√
n for a range
of values of the correlation coefficient ρ in a 2 × 2 table, with µ1 = µ2 = 0.2,
when the number of observations is n = 1.
ρ12
τ12
-0.25
0
s.e.(τ̂12 ) ×
√
n
χ12
s.e.(χ̂12 ) ×
–
0
–
-0.20 0.20
2.1
0.13
1.5
-0.10 0.60
3.4
0.48
3.6
0.00
1.0
4.0
1.0
6.3
0.10
1.4
4.4
1.8
10.2
0.20
1.8
4.6
3.0
16.4
0.30
2.2
4.8
4.8
26.7
0.40
2.6
5.1
7.9
44.8
0.50
3.0
5.5
13.5
79.5
0.60
3.4
6.0
24.4
154.5
0.70
3.8
6.8
49.6
349.3
0.80
4.2
7.7
126.0
1052.3
0.90
4.6
8.8
563.5
6467.0
1.00
5
–
∞
–
√
n
4 Seven short binary time series
A patient’s condition is recorded for ten consecutive days as bad, scored 1, or
good, scored 0. Requested to analyse the binary time series (0, 0, 1, 1, 1, 1, 1, 1, 1, 0)
two explanations spring to mind: (i) If bad has no carry over effect from one day
to the next, then the probability of bad is 0.7. (ii) If bad, once it appears, is
11
likely to carry over into the next days, then the marginal probability of bad is,
presumably, smaller and the dependence between bad on consecutive days needs
to be assessed. As a demonstration of how the dependence ratio works, I fit two
models and compute a number of estimates. The two models are:
Model I for independence: we assume Y1 , . . . , Y10 ⊥⊥ and marginal stationarity
µ = pr(Yt = 1), for t = 1, . . . , 10. The parameters to be estimated from the data
are (a) µ̃ =
P10
1
yt /10 and (g) pr(d|I)
ˆ
= the maximum likelihood estimate of the
probability of the observed data under I.
Model M for Markov: we assume pr(Yt+1 = y|Y1 , . . . , Yt ) = pr(Yt+1 = y|Yt ), for
t = 1, . . . , 9, stationarity in dependence ratio, that is τt(t+1) = τ , and also marginal
stationarity as under I. In addition to the numbers (a) and (g), we calculate (b)
τ̃ , the estimate of τ directly from the 9 bivariate observations, (yt , yt+1 ), (c) χ̃, the
estimate of the odds ratio directly from the 9 bivariate observations, (d) µ̂ = the
maximum likelihood estimate of µ under M, (e) τ̂ the maximum likelihood estimate of τ under M, (f) pr(d|M),
ˆ
(h) pr(1|1)
˜
the estimate of pr(Yt+1 = 1|Yt = 1)
directly from the data, (i) pr(1|1)
ˆ
= µ̂τ̂ , the maximum likelihood estimate under
M of pr(Yt+1 = 1|Yt = 1) and finally (j) ω = 2[log{pr(d|M)}
ˆ
− log{pr(d|I)}],
ˆ
the
likelihood ratio statistic for testing model I against model M.
Denoting the series (0, 0, 1, 1, 1, 1, 1, 1, 1, 0) by (2, 7, 1), since two zeros are followed by seven ones and they by one zero, the statistics (a) to (j) are listed in
Table 3, not only for (2, 7, 1), but also for six other series (2, 8, 0), (2, 6, 2), (2, 5, 3),
(2, 4, 4), (2, 3, 5), (2, 2, 6). The log-likelihood function is unbounded for the more
extreme cases, in which not both (0, 0) and (1, 1) for adjacent times are present.
Since model M is better supported by the data than model I, although not
12
significantly better, I select M for my analysis of the series (2, 7, 1) and find that
the marginal probability of bad is 0.58, clearly less than under I, when it is 0.70.
The maximum likelihood estimate for the probability that bad on day t is followed
by bad on day (t + 1) is 0.79, considerably higher than the marginal probability of
bad, but smaller than the model free estimate, pr(Y
˜ t+1 = 1|Yt = 1) = 0.86.
Inspecting Table 3 as a whole, some remarks are:
(1) The dependence ratio, by measuring event specific association, here association between bad on adjacent days, provides an interpretational advantage, for the
association between good on adjacent days is of no substantive interest. In accordance with the heuristic argument in the introduction to this section, τ̂ increases
when µ̂ decreases. The relation of χ̃ to µ̃ from row two on is non-monotonic, first
increasing and then decreasing. For the series (2, 8, 0) in which the frequency of
(1, 0) is zero, χ̃ is unbounded, while τ̃ = 9/8, attaining its upper bound.
(2) For the range, 0.2 ≤ µ̃ ≤ 0.8, studied here, τ̂ stays on all rows well below
its upper bound 1/µ̂. However, if we know from outside information that for the
series (2, 7, 1), pr(Y10 = 1) = 0.8, then, using the estimates µ̂1 = · · · = µ̂9 = 7/9
and τ̂ = 1.35, we find that pr(Y9 = 1, Y10 = 0) = 0.58(1 − 0.8 · 1.35) < 0. This goes
to demonstrate that joining a regression and an association model can easily lead
to negative fitted path probabilities and has to be done with care. Ekholm et al.,
(2000, Sec. 5.2) checked the impact of reversing the coding, see Section 3 (vii), in
analysing a real data example, finding that the association model is simpler, if the
less frequent category is scored as 1, but with little impact on the regression model.
13
Table 3. The values of ten statistics for seven different binary time series, all of
length 10.
Statistics
Series
(a)
(b)
(c)
(d)
(e)
µ̃
τ̃
χ̃
µ̂
τ̂
(f)
(g)
pr(d|M)
ˆ
pr(d|I)
ˆ
(h)
(i)
pr(1|1)
˜
pr(1|1)
ˆ
(j)
ω
(2,8,0) 0.8 1.12
∞
0.71 1.28
0.026
0.0067
1
0.91
2.69
(2,7,1) 0.7 1.10
6
0.58 1.35
0.0044
0.0022
0.86
0.79
1.37
(2,6,2) 0.6 1.25
10
0.48 1.62
0.0043
0.0012
0.83
0.77
2.55
(2,5,3) 0.5 1.44
12
0.39 1.88
0.0046
0.0010
0.80
0.74
3.09
(2,4,4) 0.4 1.69
12
0.32 2.16
0.0054
0.0012
0.75
0.69
3.01
(2,3,5) 0.3 2.00
10
0.25 2.44
0.0071
0.0022
0.67
0.61
2.31
(2,2,6) 0.2 2.25
6
0.18 2.53
0.0113
0.0067
0.50
0.46
1.05
(3) The correlations of the parameter estimates, corr(µ̂, τ̂ ), for the seven series reading Table 3 from top to bottom, are −0.97, −0.93, −0.94, −0.94, −0.92, −0.86, −0.66.
Admittedly, all except the last one, are alarmingly high. However, note that from
the substantive point of view, the most important number, on anyone row, is the
product µ̂τ̂ , which is the maximum likelihood estimate of the probability of persistence of bad, pr(Y
ˆ t+1 = 1|Yt = 1), and these behave decently on all rows.
(4) The maximum likelihood estimates of the odds ratio, χ̂, are, by the invariance
of maximum likelihood estimates, conveniently computed using equation (8). The
seven pairs of (χ̃, χ̂), reading Table 3 from top to bottom, are (∞, 33.7), (6, 7.4),
(10, 11.9), (12, 12.4), (12, 12.2), (10, 9.5), (6, 5.2). The computational route from τ̂
to χ̂ suggests that also models, specifying distributions of dimensions higher than
14
two and formulated in terms of odds ratios, might be fitted using dependence ratios, which provide a computational advantage. It would be valuable to find out
what kind of models one can formulate only by dependence ratios, what kind only
by odds ratios and what kind by both.
5 Conclusions
Not all referees, whom we have encountered during the last seven years, have
dismissed the dependence ratio as wrong or useless. Naturally, we have then pondered, why some statisticians are unreasonably dismissive and others take a more
positive, sometimes encouraging attitude. In fact, Sir David Cox in 1994 at a seminar in Nuffield College, Oxford, expressed appreciation of the very two features of
the dependence ratio, which the critical referees dislike most. These two are, the
event specificity and the risk for negative fitted probabilities.
Lindsey (1999, p. ix) states ‘As a group, statistical referees tend to be very conservative; a minor modification to some obscure score test is revolutionary whereas
a new family of models must either already be widely known or be wrong’. Noteworthy as this statement is, it does not explain the variation we have experienced
between referees. Another word of wisdom points to a different explanation. Elfving (1985) remarked that to gain recognition, it is not enough to do original work,
you have also to belong to a school which is willing to listen to you. Now, there
are wordings in the highly critical referee reports that indicate that some are written by statisticians who work on developing the generalized estimating equations
(GEE) approach and some others by statisticians who take the odds ratio as the
starting point for the theory of multivariate distributions for categorical data. The
15
irritating thing about the dependence ratio is then, presumably, that it suggests
a new solution to a problem they regard as solved for good. Our problem is, perhaps, not so much that we do not belong to a listening school, as that the critical
referees belong to schools that do not want to hear about new solutions.
References
Barndorff-Nielsen, O. E. & Cox, D. R. (1994). Inference and asymptotics. London:
Chapman & Hall.
Ekholm, A. (1991). Fitting regression models to a multivariate binary response. In A
spectrum of statistical thought: essays in statistical theory, economics and population
genetics in honour of Johan Fellman, pp. 19-32, Eds. G. Rosenqvist, K. Juselius, K.
Nordström and J. Palmgren. Helsingfors: Swedish School of Economics and Business
Administration.
Ekholm, A., Smith, P. W. F. & McDonald, J. W. (1995). Marginal regression analysis
of a multivariate binary response. Biometrika, 82, 847-854.
Ekholm, A. & Skinner, C. (1998). The Muscatine children’s obesity data reanalysed
using pattern mixture models. Applied Statistics, 47, 251-263.
Ekholm, A., McDonald, J. W. & Smith, P. W. F. (2000). Association models for a
multivariate binary response. Biometrics, 56, 712-718.
Ekholm, A., Jokinen, J. & Kilpi, T. (2002). Combining regression and association modelling for longitudinal data on bacterial carriage. Statistics in Medicine, 21, 773-791.
Ekholm, A., Jokinen, J., McDonald, J. W. & Smith, P. W. F. (2002). Joint regression
and association modelling of longitudinal ordinal data. Manuscript.
Elfving, G. (1985). Finnish mathematical statistics in the past. In Proceedings of the
first international Tampere seminar on linear statistical models and their applications,
Tampere 1983, pp. 3-8, Eds. T. Pukkila & S. Puntanen. Department of Mathematical
16
Sciences/Statistics, University of Tampere.
Jokinen, J., McDonald, J. W. & Smith, P. W. F. (2002). Meaningful regression and
association models for clustered ordinal data. Manuscript.
Lindsey, J. K. (1999). Models for repeated measurements: second edition. Oxford University Press.
Sackett, D. L., Deeks, J. J. & Altman, D. G. (1996). Down with odds ratios! EvidenceBased Medicine, 1, 164-166.
Appendix: Derivation of the covariance matrices of ψ̂ and υ̂
In 1998 a referee, adamant that the dependence ratio is hopelessly inferior to
the odds ratio, wrote as one argument: ‘Another useful property of odds ratios is
orthogonality between marginal parameters and odds ratio parameters in the bivariate case (BB, 1989)’. The reference is to a technical report and the referee was,
obviously, not aware of the general orthogonality of the components of the mixed
parameter, see (viii) of Section 3. The technical report purports to prove that µ1
and µ2 are orthogonal to χ in the parametrisation ψ = (µ1 , µ2 , χ). The proof is
deficient and the asymptotic covariance matrix Var(µ̂1 , µ̂2 , χ̂) is not derived. The
matter would not be worth commenting upon, if I had not encountered several
other references attributing the revelation of this orthogonality in a 2 × 2 table to
that same source; most recently at an international conference in 2002.
Presumably, Var(µ̂1 , µ̂2 , χ̂) was derived and published decades ago, but since the
simplest way to derive Var(µ̂1 , µ̂2 , τ̂ ) is via Var(µ̂1 , µ̂2 , χ̂), we start from scratch.
Denote the cell probabilities by πuv = pr(Y1 = u, Y2 = v), for u, v = 1, 0, where
0 < πuv < 1 and π11 + π10 + π01 + π00 = 1.
17
A full rank parametrisation π is obtained by π = (π11 , π10 , π01 ). For brevity we
use, sometimes, π00 = 1−π11 −π10 −π01 . We work now with three parametrisations
of the 2 × 2 table: π, ψ and υ. The transformation π → ψ is given by
µ1 = π11 + π10 , µ2 = π11 + π01 , χ =
π11 (1 − π11 − π10 − π01 )
,
π10 π01
and the transformation υ → ψ by
µ1 = µ 1 , µ2 = µ 2 , χ = 1 +
τ −1
.
(1 − µ1 τ )(1 − µ2 τ )
Denote the log-likelihood functions for the three parameters by, respectively, l(π),
l∗ (ψ) and l∗∗ (υ) and the information matrices by, respectively, i(π), i∗ (ψ) and
i∗∗ (υ). Any information matrix is defined, in the standard way, as (−1) times
the expected value of the Hessian matrix of the corresponding log-likelihood function. Denoting by n the number of bivariate observations (y1 , y2 ), i(π) is easily
calculated,

i(π) =



n




−1
π11
−1
π00
,
−1
π00
−1
π00
,
−1
−1
π10
+ π00
,
−1
π00
−1
π00
,
−1
π00
,
−1
−1
π01
+ π00
+
−1
π00
,





,



(A1)
The relation between i(π) and i∗ (ψ) is (Barndorff-Nielsen & Cox, p. 28),
"
∂π
i (ψ) =
∂ψ
∗
#T
#
"
∂π
,
i(π)
∂ψ
(A2)
where the last matrix on the right hand side is the 3×3 Jacobian of the reparametrisation π → ψ. In this case calculating that Jacobian is an unpleasant task and we
use instead the general result that for locally 1 − 1 transformations the Jacobian
of the inverse transformation is the inverse of the original Jacobian,
"
#
"
∂π
∂ψ
=
∂ψ
∂π
18
#−1
.
(A3)
It is readily checked that
"
∂ψ
∂π
#−1

=








−1
1, 1, 0 
1, 0, 1
a, b, c








=



−1 
(a − b − c) 



−b,
−c,
a − c,
c,
b,

1 
−1
a − b, −1



,



where a, b and c are, respectively, ∂χ/∂π11 , ∂χ/∂π10 and ∂χ/∂π01 . Calculating these derivatives and denoting by vu = var(Yu ), for u = 1, 2, and by c12 =
cov(Y1 , Y2 ) using (A1), (A2) and (A3) gives

i∗ (ψ) =




nd−1 (π) 




v2 ,
−c12 ,
0
−c12 ,
v1 ,
0
0,
0,
χ−2 e(π)




,



where d(π) = v1 v2 − c212 = π10 π01 π00 + π11 π01 π00 + π11 π10 π00 + π11 π10 π01 and
e(π) = π11 π10 π01 π00 . Inverting i∗ (ψ) we find that the asymptotic covariance matrix
for ψ̂ is
∗
Var(µ̂1 , µ̂2 , χ̂) = [i (ψ)]
−1
1
=
n









v1 , c12 ,
c12 , v2 ,
0,
0,

0 
0
vχ



,



(A4)
−1
−1
−1
−1
where vχ = χ2 (π11
+ π10
+ π01
+ π00
). To derive Var(µ̂1 , µ̂2 , τ̂ ) we use
"
∂ψ
i (υ) =
∂υ
∗∗
#T
#
"
∂ψ
,
i (ψ)
∂υ
∗
(A5)
where
"
#
∂ψ
=
∂υ









1,
0,
0,
1,

0 

0
a ∗ , b∗ , c∗
19


,



(A6)
with a∗ = ∂χ/∂µ1 , b∗ = ∂χ/∂µ2 and c∗ = ∂χ/∂τ . Var(µ̂1 , µ̂2 , τ̂ ) is then best
calculated as [i∗∗ (υ)]−1 using (A4), (A5) and (A6), obtaining
1
Var(µ̂1 , µ̂2 , τ̂ ) =
n
1
=
n


















1,
0,
0,

0
1,
0
−a∗ /c∗ , −b∗ /c∗ , 1/c∗
v1 ,
c12 ,









v1 , c12 ,
c12 , v2 ,
0,
0,

∗
c2τ
c1τ , c2τ , vτ
0
vχ







0, 1, −b∗ /c∗
0, 0,
1/c∗



,



c1τ
!
∂χ
∂τ
!−1
!
∂χ
∂τ
∂χ
∂χ ∂χ
·
· c12 + vχ
v2 + 2 ·

∂µ1 ∂µ2
∂τ
!−2
,
c2τ
∂χ
∂χ
=−
c12 +
v2
∂µ1
∂µ2
!−1
and


∂χ
vτ =
 ∂µ1
!2
∂χ
v1 +
∂µ2
!2


.
For computational purposes, useful expressions for the partial derivatives are,
∂χ/∂µ1 = τ (τ − 1)(1 − µ1 τ )−2 (1 − µ2 τ )−1 ,
∂χ/∂µ2 = τ (τ − 1)(1 − µ1 τ )−1 (1 − µ2 τ )−2 ,
∂χ/∂τ = {π00 − π11 (τ − 1)}(1 − µ1 τ )−2 (1 − µ2 τ )−2 .
20







(A7)
where
∂χ
∂χ
=−
v1 +
c12
∂µ1
∂µ2

0   1, 0, −a /c 
c12 , c1τ 
v2 ,
∗
,
Download