Stratified McNemar Tests

advertisement
DRAFT
DRAFT
Stratified McNemar Tests
C. Mitchell Dayton
University of Maryland
The McNemar chi-square test is the procedure of choice in behavioral studies
assessing marginal homogeneity for repeated dichotomous measures. Common examples
include two independent raters providing dichotomous judgments for the same set of
stimuli or a panel of judges responding on two occasions to the same dichotomous item.
The research question of interest is whether or not is it reasonable to describe the
marginal response rates for, say, a favorable rating as equivalent (i.e., homogeneous). The
development of the chi-square test for this case is attributed to McNemar (1947) and the
generalization to square tables larger than 2x2 is often referred to as the Stuart-Maxwell
test (Stuart, 1955; Maxwell, 1970).
For dichotomous variables, A and B, let
 ij represent the theoretic proportion for
level i of variable A and level j of variable B (Table 1). Marginal homogeneity means
_______________________________________
Insert Table 1 about here
_______________________________________
that  1.   .1 or, equivalently, that  2.   .2 and this implies symmetry: i.e., 12   21 . Note,
however, that marginal homogeneity does not imply symmetry for 3x3 or larger tables.
Assuming a sample of n cases and observed frequencies, nij , the test for symmetry, and
per force the test for marginal homogeneity, reduces to a simple two-celled goodness-offit test based on the observed frequencies n12 and n21 with the hypothesis 12   21  .5 .
Dayton AERA 2006
1
DRAFT
DRAFT
Consequently, both expected frequencies are equal to (n12  n21 ) / 2 . In terms of observed
(n12  n21 )2
frequencies, the McNemar statistic is algebraically equivalent to  
. Often
n12  n21
2
a correction for continuity that is known to improve accuracy is applied (Fleiss, 1981).
The present paper focuses on the issue of what may be termed stratified
homogeneity. Stratified homogeneity implies that marginal homogeneity for, say,
variables A and B, holds across the levels a third variable such as strata or time. Although
this model can be conceptualized in log-linear terms (Bishop, Fienberg and Holland,
1975), the present approach is to exploit a result from Dayton and Macready (1983).
They showed that the model underlying the McNemar test is equivalent to a certain
restricted two-class latent class model for a 2x2 contingency table. We begin with a brief
summary of relevant latent class modeling, present the model for the stratified McNemar
test and conclude with exemplary data from six years of the General Social Survey
(GSS).
Latent Class Analysis
The mathematical model for latent class analysis (LCA) can be represented as
follows. Let Ys = {ysj} be the vector-valued response for items j = 1, . . ., J, for the sth
respondent. Let the response options for the items be defined over a set of distinct,
mutually-exclusive values r = 1, . . ., Rj for the jth item (e.g., for dichotomous responses
these values would be r = 1, 2). Then, for C distinct latent classes, an unrestricted latent
class model is defined by:
C
J
c 1
j 1
P(Ys )  c 
Dayton AERA 2006
2
Rj
cjr
r 1
 sjr
DRAFT
DRAFT
The latent class (mixing) proportions are θc , c = 1, . . . , C, with the restriction that the
proportions sum to one. These latent class proportions represent the sizes of the
unobserved latent classes. The αcjr are conditional probabilities associated with the items.
That is, they represent the probability of response r to item j given membership in the cth
latent class. Thus, for each item, there is a vector of Rj conditional probabilities and these
conditional probabilities sum to one for each item within each latent class. The δsjr terms
are introduced in the manner of Kronecker deltas to include the appropriate conditional
probabilities in the model based on the observed responses for the sth respondent. Thus,
δsjr = 1 if ysj = r but δsjr = 0 otherwise. In effect, the latent class model is based on the
assumption that, conditional on knowing latent class membership, the responses to the
items are independent. To make the model more explicit, consider three dichotomouslyscored items and two latent classes. Within latent class 1, the probabilities for a 1
response (e.g., yes or agree) are α111, α121, and α131 while within latent class 2 they are
α211, α221, and α231. The observed response [1,2,1], for example, has conditional
probability α111(1-α121)α131 within latent class 1 and conditional probability
α211(1-α221)α231 within latent class 2. Then, the unconditional probability for this response
is: θ1α111(1-α121)α131 + (1-θ1)α211(1-α221)α231. From a measurement perspective, the
conditional probabilities may be viewed as item “difficulties” that vary across the
unobserved latent classes.
Although the model underlying latent class analysis is non-linear in the
parameters, maximum-likelihood estimation (MLE) can be accomplished using relatively
straight-forward iterative procedures such Newton-Raphson iterations (e.g., Haberman’s
program LAT, 1979) or the estimation-maximization (EM) algorithm (e.g., Vermunt’s
Dayton AERA 2006
3
DRAFT
DRAFT
program LEM, 1993). Based on the MLE’s, model fit can be assessed by conventional
Pearson or likelihood-ratio chi-square procedures based on the multi-way crosstabulation of the item responses (e.g., the 2J table for J dichotomous items). In general,
the degrees of freedom for these tests are #Cells-1-#IndPars where #IndPars is the
number of independent parameters estimated by MLE. However, it is possible that a
latent class model is not identified even though there are positive degrees of freedom.
Programs such as LEM (Vermunt, 1993) provide some useful information on model
identification although this can be a complex issue. These methods, as well as related
descriptive approaches to assessing model fit, are summarized in Dayton (1999).
Two Repeated Dichotomous Classifications
The McNemar test is based on a 2x2 table with cell frequencies nij and cell
proportions pij = nij/N, where N is the total sample size. Assuming an unrestricted twoclass latent class model, the expected cell proportions are:
E ( p11 )  1111121   2 211 221
E ( p12 )  1111122   2 211 222
E ( p21 )  1112121   2 212 221
E ( p22 )  1112122   2 212 222
Given the usual restrictions on probabilities, there are five independent parameters,
1 , 111 , 121  211 ,and  221 , but only three independent proportions, p11, p12, and p21, say.
Thus, the model is not identified unless at least two restrictions are imposed. However,
imposing two restrictions would not yield positive degrees of freedom for assessing fit.
Thus, in order to assess fit of the model, it is necessary to impose three restrictions. The
first two restrictions can be: 111  121  11 and  211   221   21 ; i.e., equating conditional
Dayton AERA 2006
4
DRAFT
DRAFT
probabilities across the two items. If we interpret the first class as favoring a “1” response
and the second class as favoring a “2” response, then a third restriction of the form
111  121 1   211 1   221 , or 11   e and1   21 1   e , allows the single conditional
probability,  e , to be interpreted as a response error. Given these restrictions, the
equations above reduce to:
E ( p11 )  1 e2  (1  1 )(1   e ) 2
E ( p12 )  1 e (1   e )  (1  1 )(1   e ) e
E ( p21 )  1 (1   e ) e  (1  1 ) e (1   e )
E ( p22 )  1 (1   e ) 2  (1  1 ) e2
In light of these restrictions, the likelihood for a sample becomes greatly simplified and it
is easy to show that the MLE’s are:
p11   e2
ˆ
1 
and ˆ e  .5  .25  ( p12  p21 ) / 2
1  2 e
Note that ˆ e is undefined for p12 + p21 > .5. If this occurs in practice, it is only necessary
to reverse the coding for one of the two variables. As noted by Dayton and Macready
(1983), this restricted latent model reproduces expected frequencies that are consistent
with the McNemar Test in the following sense:
pˆ11  p11 , pˆ 22  p22 , pˆ12  pˆ 21  ( p11  p22 ) / 2
and the resulting chi-square value for model fit is exactly the same as the McNemar chisquare statistic, both with one degree of freedom. Thus, the McNemar may be viewed as
testing the null hypothesis 11 1   21 versus the alternative 11  1   21 .
Stratified McNemar Test
Dayton AERA 2006
5
DRAFT
DRAFT
Consider cross-tabulations similar to Table 1 for two or more strata within a
population or for the same population at different points in time. Letting the strata be
represented as y = 1,…,Y, the equations for expected cell proportions for any given
stratum can be written as:
E ( p11 y )  1 y ey2  (1  1 y )(1   ey ) 2
E ( p12 y )  1 y ey (1   ey )  (1  1 y )(1   ey ) ey
E ( p21 y )  1 y (1   ey ) ey  (1  1 y ) ey (1   ey )
E ( p22 y )  1 y (1   ey ) 2  (1  1 y ) ey2
Jointly estimating the parameters in this heterogeneous form of the stratified model is
equivalent to fitting the model separately to each stratum but has the advantage of
providing an overall measure of fit in the form of a chi-square statistic with Y degrees of
freedom. However, the real advantage of conceptualizing the model in this manner is that
various restrictions can be imposed across the error rates. The most extreme case results
in a homogeneous model with 2Y-1 degrees of freedom and is based on restrictions of the
form  ey =  e  y . However, a variety of partially heterogeneous models may be
suggested by theory (or, the data) and tested accordingly. Fortunately, available computer
programs for latent class analysis allow for these restrictions, as illustrated below.
Stratified McNemar Model for Two Abortion Items
The General Social Survey (GSS) has provided response data for several items
related to attitudes toward abortion over many years. These data are in public-access
databases maintained by the National Opinion Research Center in Chicago, Illinois at the
Dayton AERA 2006
6
DRAFT
DRAFT
web site http://webapp.icpsr.umich.edu/GSS/. In the GSS, these items are presented with
the introduction:
“Please tell me whether or not you think it should be possible for a
pregnant woman to obtain a legal abortion if. . .”
For purposes of illustrating stratified McNemar tests, the following two items were
selected:
“She is married and does not want any more children.”
“She is not married and does not want to marry the man.”
For reference, the items are called No More and Single, respectively. Data for the sixyear period 1993 to 1998 (recoded 1 through 6) are presented in Table 2. It is notable
that, for four of the six years, the numbers of inconsistent responses (i.e., Yes, No or No,
Yes) differ by only 1 or 2 even though the annual sample sizes range from 856 to 1750
for these years.
_______________________________________
Insert Table 2 about here
_______________________________________
Table 3 provides a summary for six models that were fitted to the six-year crosstabulation of the No More and Single abortion items. The Part Heterogeneous models
were suggested by inspecting the subgroup error rates for the heterogeneous model.
Parameter estimation, model fit, etc. were conducted using the program LEM (Vermunt,
1993) and sample set-ups are included for three of the models in the Appendix. Note that
each of these six models fails to result in discrepancies large enough to provide
statistically significant chi-square statistics at conventional levels. Thus, these goodness-
Dayton AERA 2006
7
DRAFT
DRAFT
of-fit chi-square procedures are not of much help is selecting among the models. On the
other hand, certain difference chi-square statistics can be computed since the remaining
five models are each nested within the Heterogeneous model and the Homogeneous
model is nested within each of the Part Heterogeneous models. Unfortunately, these tests
yield a not-unusual picture of intransitivity when multiple significance tests are used in
this manner. Specifically, the Homogeneous/Heterogeneous difference is not significant
at the conventional .05 level (Δ chi-square = 9.84, DF = 5, p = .080), all four Part
Heterogeneous/Heterogeneous differences are non-significant at the conventional .05
level (Δ chi-squares = 5.95, 2.97, .10, .01, respectively, with 4, 3, ,3 and 2 DF and
corresponding
p-values
of
.203,
.397,
.992,
.996),
but
all
four
Part
Heterogeneous/Homogeneous differences are significant at the conventional .05 level (Δ
chi-squares = 3.89, 6.87, 9.73, 9.83, respectively, with 1, 2, 2 and 3 DF and
corresponding p-values of .049, .032, .008, .020).
_______________________________________
Insert Table 3 about here
_______________________________________
In general, model comparisons in situations for which multiple significance tests
are typically employed are better approached using strategies based on information
criteria such as Akaike (1973) min(AIC) or Schwarz (1978) min(BIC), both of which are
shown in Table 3. It should be noted that BIC is expected to result in the selection of
more conservative models than is AIC because of the heavier penalty that BIC employs
(i.e., ln(N) instead of 2N). This is illustrated by the example in Table 3 where BIC favors
the simplest model (Homogeneity) that incorporates a single error rate whereas AIC
Dayton AERA 2006
8
DRAFT
DRAFT
favors a somewhat more complex model, Part Heterogeneous C, that incorporates three
separate error rates for various subsets of years. Although simulations around the issue of
choosing between AIC and BIC are far from conclusive or convincing, there is some
evidence that AIC is more effective than BIC in selecting “correct” models when the
“correct” model is relatively complex (note that “correct” models only exist in simulation
studies). For this author, the model, Part Heterogeneous C, seems to represent the
disparity in subgroup errors in a convincing manner and is the preferred model. That is,
the marginal homogeneity seems reasonable for these two abortion items across all six
years but the rates of inconsistent responses seem to fall into three distinct levels –
approximately .04, approximately .046 and approximately .056.
References
Akaike, H. (1973). Information theory and an extension of the maximum
likelihood principle. In B.N. Petrov and F. Csake (eds.), Second International Symposium
on Information Theory. Budapest: Akademiai Kiado, 267-281.
Bishop, Y. M. M., Fienberg, S. E. & Holland, P. W. (1975) Discrete Multivariate
Analysis: Theory and Practice, Cambridge: MIT Press
Dayton, C. M. (1999) Latent Class Scaling Analysis. Sage Publications.
Dayton, C. M. & Macready, G. B. (1983) Latent structure analysis of repeated
classifications with dichotomous data. British Journal of Mathematical & Statistical
Psychology, 36, 189-201.
Fleiss, J. L. (1981) Statistical Methods for Rates and Proportions. New York:
Wiley
Haberman, S. J. (1979), Analysis of Qualitative Data, Volume 2: New
Developments, New York: Academic Press.
Dayton AERA 2006
9
DRAFT
DRAFT
McNemar Q. (1947) Note on the sampling error of the difference between
correlated proportions or percentages. Psychometrika, 12, 153-157.
Maxwell A. E. (1970) Comparing the classification of subjects by two
independent judges. British Journal of Psychiatry, 116, 651-655.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6,
461-464.
Stuart A. A. (1955) A test for homogeneity of the marginal distributions in a twoway classification. Biometrika, 42, 412-416.
Vermunt, J. K. (1993). Log-linear & event history analysis with missing data
using the EM algorithm. WORC Paper, Tilburg University, The Netherlands.
Appendix
LEM input file for Homogeneous model
* Six years of abortion data – Item: No More, Not Married
* Stratified McNemar test
* Homogeneous Model
lat 1
man 3
dim 2 6 2 2
lab X Y D H * X = latent variable; Y = year; D = No More; H = Not Married
mod Y
X|Y
D|XY eq2
H|XY eq2
des [0 2 0 2 0 2 0 2 0 2 0 2 2 0 2 0 2 0 2 0 2 0 2 0
0 2 0 2 0 2 0 2 0 2 0 2 2 0 2 0 2 0 2 0 2 0 2 0]
dat [342 45 47 422 376 42 41 475 429 43 44 476 829 90 78 903
725 109 75 867 672 68 69 941]
LEM input file for Heterogeneous model
* Six years of abortion data – Item: No More, Not Married
* Stratified McNemar test
* Heterogeneous Model
lat 1
Dayton AERA 2006
10
DRAFT
DRAFT
man 3
dim 2 6 2 2
l lab X Y D H * X = latent variable; Y = year; D = No More; H = Not Married
mod Y
X|Y
D|XY eq2
H|XY eq2
des [0 2 0 4 0 6 0 8 0 10 0 12 2 0 4 0 6 0 8 0 10 0 12 0
0 2 0 4 0 6 0 8 0 10 0 12 2 0 4 0 6 0 8 0 10 0 12 0]
dat [342 45 47 422 376 42 41 475 429 43 44 476 829 90 78 903
725 109 75 867 672 68 69 941]
LEM input file for Part Heterogeneous C model
* Six years of abortion data – Item: No More, Not Married
* Stratified McNemar test
* Part Heterogeneous Model C
lat 1
man 3
dim 2 6 2 2
lab X Y D H * X = latent variable; Y = year; D = No More; H = Not Married
mod Y
X|Y
D|XY eq2
H|XY eq2
des [0 2 0 4 0 4 0 4 0 2 0 6 2 0 4 0 4 0 4 0 2 0 6 0
0 2 0 4 0 4 0 4 0 2 0 6 2 0 4 0 4 0 4 0 2 0 6 0]
dat [342 45 47 422 376 42 41 475 429 43 44 476 829 90 78 903
725 109 75 867 672 68 69 941]
Dayton AERA 2006
11
DRAFT
DRAFT
Table 1
Theoretic Proportions for 2X2 Table
Item 1 +
Item 2 +
π11
Item 2 π12
π1.
Item 1 -
π21
π22
π2.
π.1
π.2
Table 2
Cross-Tabulation of Two Abortion Items for Six Years
No
More
Single
Year
1
2
3
4
5
6
Yes
Yes
Yes
No
No
Yes
No
No
Total
342
376
429
829
725
672
45
42
43
90
109
68
47
41
44
78
75
69
422
475
476
903
867
941
856
934
992
1900
1776
1750
GSS Item
No
More:
Single:
Dayton AERA 2006
Married - wants no more children
Not married
12
DRAFT
DRAFT
Table 3
Stratified McNemar Models Fit to Abortion Items
Degrees
Likelihood-Ratio
Model
of Freedom
Chi-Square (G2)
P-Value
AIC
BIC
Years
Subgroup Error Rate
Homogeneous
11
17.09
0.1054
44,871.78
44,955.93
[123456]
0.048
Part Heterogeneous A
10
13.20
0.2129
44,869.89
44,961.05
[12346], [5]
.046, .055
Part Heterogeneous B
9
10.22
0.3331
44,868.91
44,967.09
[1234], [5], [6]
.048, .055, .041
Part Heterogeneous C
9
7.35
0.6006
44,866.04
44,964.22
[15], [234], [6]
.056, .046, .041
Part Heterogeneous D
8
7.26
0.5089
44,867.95
44,973.14
[1], [234], [5], [6]
.057, .046, .055, .041
Heterogeneous
6
7.25
0.2983
44,871.94
44,991.16
[1], [2], [3], [4], [5], [6]
.057, .047, .046, .046, .055, .041
Dayton AERA 2006
Homogeneous
13
Download