DRAFT DRAFT Stratified McNemar Tests C. Mitchell Dayton University of Maryland The McNemar chi-square test is the procedure of choice in behavioral studies assessing marginal homogeneity for repeated dichotomous measures. Common examples include two independent raters providing dichotomous judgments for the same set of stimuli or a panel of judges responding on two occasions to the same dichotomous item. The research question of interest is whether or not is it reasonable to describe the marginal response rates for, say, a favorable rating as equivalent (i.e., homogeneous). The development of the chi-square test for this case is attributed to McNemar (1947) and the generalization to square tables larger than 2x2 is often referred to as the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970). For dichotomous variables, A and B, let ij represent the theoretic proportion for level i of variable A and level j of variable B (Table 1). Marginal homogeneity means _______________________________________ Insert Table 1 about here _______________________________________ that 1. .1 or, equivalently, that 2. .2 and this implies symmetry: i.e., 12 21 . Note, however, that marginal homogeneity does not imply symmetry for 3x3 or larger tables. Assuming a sample of n cases and observed frequencies, nij , the test for symmetry, and per force the test for marginal homogeneity, reduces to a simple two-celled goodness-offit test based on the observed frequencies n12 and n21 with the hypothesis 12 21 .5 . Dayton AERA 2006 1 DRAFT DRAFT Consequently, both expected frequencies are equal to (n12 n21 ) / 2 . In terms of observed (n12 n21 )2 frequencies, the McNemar statistic is algebraically equivalent to . Often n12 n21 2 a correction for continuity that is known to improve accuracy is applied (Fleiss, 1981). The present paper focuses on the issue of what may be termed stratified homogeneity. Stratified homogeneity implies that marginal homogeneity for, say, variables A and B, holds across the levels a third variable such as strata or time. Although this model can be conceptualized in log-linear terms (Bishop, Fienberg and Holland, 1975), the present approach is to exploit a result from Dayton and Macready (1983). They showed that the model underlying the McNemar test is equivalent to a certain restricted two-class latent class model for a 2x2 contingency table. We begin with a brief summary of relevant latent class modeling, present the model for the stratified McNemar test and conclude with exemplary data from six years of the General Social Survey (GSS). Latent Class Analysis The mathematical model for latent class analysis (LCA) can be represented as follows. Let Ys = {ysj} be the vector-valued response for items j = 1, . . ., J, for the sth respondent. Let the response options for the items be defined over a set of distinct, mutually-exclusive values r = 1, . . ., Rj for the jth item (e.g., for dichotomous responses these values would be r = 1, 2). Then, for C distinct latent classes, an unrestricted latent class model is defined by: C J c 1 j 1 P(Ys ) c Dayton AERA 2006 2 Rj cjr r 1 sjr DRAFT DRAFT The latent class (mixing) proportions are θc , c = 1, . . . , C, with the restriction that the proportions sum to one. These latent class proportions represent the sizes of the unobserved latent classes. The αcjr are conditional probabilities associated with the items. That is, they represent the probability of response r to item j given membership in the cth latent class. Thus, for each item, there is a vector of Rj conditional probabilities and these conditional probabilities sum to one for each item within each latent class. The δsjr terms are introduced in the manner of Kronecker deltas to include the appropriate conditional probabilities in the model based on the observed responses for the sth respondent. Thus, δsjr = 1 if ysj = r but δsjr = 0 otherwise. In effect, the latent class model is based on the assumption that, conditional on knowing latent class membership, the responses to the items are independent. To make the model more explicit, consider three dichotomouslyscored items and two latent classes. Within latent class 1, the probabilities for a 1 response (e.g., yes or agree) are α111, α121, and α131 while within latent class 2 they are α211, α221, and α231. The observed response [1,2,1], for example, has conditional probability α111(1-α121)α131 within latent class 1 and conditional probability α211(1-α221)α231 within latent class 2. Then, the unconditional probability for this response is: θ1α111(1-α121)α131 + (1-θ1)α211(1-α221)α231. From a measurement perspective, the conditional probabilities may be viewed as item “difficulties” that vary across the unobserved latent classes. Although the model underlying latent class analysis is non-linear in the parameters, maximum-likelihood estimation (MLE) can be accomplished using relatively straight-forward iterative procedures such Newton-Raphson iterations (e.g., Haberman’s program LAT, 1979) or the estimation-maximization (EM) algorithm (e.g., Vermunt’s Dayton AERA 2006 3 DRAFT DRAFT program LEM, 1993). Based on the MLE’s, model fit can be assessed by conventional Pearson or likelihood-ratio chi-square procedures based on the multi-way crosstabulation of the item responses (e.g., the 2J table for J dichotomous items). In general, the degrees of freedom for these tests are #Cells-1-#IndPars where #IndPars is the number of independent parameters estimated by MLE. However, it is possible that a latent class model is not identified even though there are positive degrees of freedom. Programs such as LEM (Vermunt, 1993) provide some useful information on model identification although this can be a complex issue. These methods, as well as related descriptive approaches to assessing model fit, are summarized in Dayton (1999). Two Repeated Dichotomous Classifications The McNemar test is based on a 2x2 table with cell frequencies nij and cell proportions pij = nij/N, where N is the total sample size. Assuming an unrestricted twoclass latent class model, the expected cell proportions are: E ( p11 ) 1111121 2 211 221 E ( p12 ) 1111122 2 211 222 E ( p21 ) 1112121 2 212 221 E ( p22 ) 1112122 2 212 222 Given the usual restrictions on probabilities, there are five independent parameters, 1 , 111 , 121 211 ,and 221 , but only three independent proportions, p11, p12, and p21, say. Thus, the model is not identified unless at least two restrictions are imposed. However, imposing two restrictions would not yield positive degrees of freedom for assessing fit. Thus, in order to assess fit of the model, it is necessary to impose three restrictions. The first two restrictions can be: 111 121 11 and 211 221 21 ; i.e., equating conditional Dayton AERA 2006 4 DRAFT DRAFT probabilities across the two items. If we interpret the first class as favoring a “1” response and the second class as favoring a “2” response, then a third restriction of the form 111 121 1 211 1 221 , or 11 e and1 21 1 e , allows the single conditional probability, e , to be interpreted as a response error. Given these restrictions, the equations above reduce to: E ( p11 ) 1 e2 (1 1 )(1 e ) 2 E ( p12 ) 1 e (1 e ) (1 1 )(1 e ) e E ( p21 ) 1 (1 e ) e (1 1 ) e (1 e ) E ( p22 ) 1 (1 e ) 2 (1 1 ) e2 In light of these restrictions, the likelihood for a sample becomes greatly simplified and it is easy to show that the MLE’s are: p11 e2 ˆ 1 and ˆ e .5 .25 ( p12 p21 ) / 2 1 2 e Note that ˆ e is undefined for p12 + p21 > .5. If this occurs in practice, it is only necessary to reverse the coding for one of the two variables. As noted by Dayton and Macready (1983), this restricted latent model reproduces expected frequencies that are consistent with the McNemar Test in the following sense: pˆ11 p11 , pˆ 22 p22 , pˆ12 pˆ 21 ( p11 p22 ) / 2 and the resulting chi-square value for model fit is exactly the same as the McNemar chisquare statistic, both with one degree of freedom. Thus, the McNemar may be viewed as testing the null hypothesis 11 1 21 versus the alternative 11 1 21 . Stratified McNemar Test Dayton AERA 2006 5 DRAFT DRAFT Consider cross-tabulations similar to Table 1 for two or more strata within a population or for the same population at different points in time. Letting the strata be represented as y = 1,…,Y, the equations for expected cell proportions for any given stratum can be written as: E ( p11 y ) 1 y ey2 (1 1 y )(1 ey ) 2 E ( p12 y ) 1 y ey (1 ey ) (1 1 y )(1 ey ) ey E ( p21 y ) 1 y (1 ey ) ey (1 1 y ) ey (1 ey ) E ( p22 y ) 1 y (1 ey ) 2 (1 1 y ) ey2 Jointly estimating the parameters in this heterogeneous form of the stratified model is equivalent to fitting the model separately to each stratum but has the advantage of providing an overall measure of fit in the form of a chi-square statistic with Y degrees of freedom. However, the real advantage of conceptualizing the model in this manner is that various restrictions can be imposed across the error rates. The most extreme case results in a homogeneous model with 2Y-1 degrees of freedom and is based on restrictions of the form ey = e y . However, a variety of partially heterogeneous models may be suggested by theory (or, the data) and tested accordingly. Fortunately, available computer programs for latent class analysis allow for these restrictions, as illustrated below. Stratified McNemar Model for Two Abortion Items The General Social Survey (GSS) has provided response data for several items related to attitudes toward abortion over many years. These data are in public-access databases maintained by the National Opinion Research Center in Chicago, Illinois at the Dayton AERA 2006 6 DRAFT DRAFT web site http://webapp.icpsr.umich.edu/GSS/. In the GSS, these items are presented with the introduction: “Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if. . .” For purposes of illustrating stratified McNemar tests, the following two items were selected: “She is married and does not want any more children.” “She is not married and does not want to marry the man.” For reference, the items are called No More and Single, respectively. Data for the sixyear period 1993 to 1998 (recoded 1 through 6) are presented in Table 2. It is notable that, for four of the six years, the numbers of inconsistent responses (i.e., Yes, No or No, Yes) differ by only 1 or 2 even though the annual sample sizes range from 856 to 1750 for these years. _______________________________________ Insert Table 2 about here _______________________________________ Table 3 provides a summary for six models that were fitted to the six-year crosstabulation of the No More and Single abortion items. The Part Heterogeneous models were suggested by inspecting the subgroup error rates for the heterogeneous model. Parameter estimation, model fit, etc. were conducted using the program LEM (Vermunt, 1993) and sample set-ups are included for three of the models in the Appendix. Note that each of these six models fails to result in discrepancies large enough to provide statistically significant chi-square statistics at conventional levels. Thus, these goodness- Dayton AERA 2006 7 DRAFT DRAFT of-fit chi-square procedures are not of much help is selecting among the models. On the other hand, certain difference chi-square statistics can be computed since the remaining five models are each nested within the Heterogeneous model and the Homogeneous model is nested within each of the Part Heterogeneous models. Unfortunately, these tests yield a not-unusual picture of intransitivity when multiple significance tests are used in this manner. Specifically, the Homogeneous/Heterogeneous difference is not significant at the conventional .05 level (Δ chi-square = 9.84, DF = 5, p = .080), all four Part Heterogeneous/Heterogeneous differences are non-significant at the conventional .05 level (Δ chi-squares = 5.95, 2.97, .10, .01, respectively, with 4, 3, ,3 and 2 DF and corresponding p-values of .203, .397, .992, .996), but all four Part Heterogeneous/Homogeneous differences are significant at the conventional .05 level (Δ chi-squares = 3.89, 6.87, 9.73, 9.83, respectively, with 1, 2, 2 and 3 DF and corresponding p-values of .049, .032, .008, .020). _______________________________________ Insert Table 3 about here _______________________________________ In general, model comparisons in situations for which multiple significance tests are typically employed are better approached using strategies based on information criteria such as Akaike (1973) min(AIC) or Schwarz (1978) min(BIC), both of which are shown in Table 3. It should be noted that BIC is expected to result in the selection of more conservative models than is AIC because of the heavier penalty that BIC employs (i.e., ln(N) instead of 2N). This is illustrated by the example in Table 3 where BIC favors the simplest model (Homogeneity) that incorporates a single error rate whereas AIC Dayton AERA 2006 8 DRAFT DRAFT favors a somewhat more complex model, Part Heterogeneous C, that incorporates three separate error rates for various subsets of years. Although simulations around the issue of choosing between AIC and BIC are far from conclusive or convincing, there is some evidence that AIC is more effective than BIC in selecting “correct” models when the “correct” model is relatively complex (note that “correct” models only exist in simulation studies). For this author, the model, Part Heterogeneous C, seems to represent the disparity in subgroup errors in a convincing manner and is the preferred model. That is, the marginal homogeneity seems reasonable for these two abortion items across all six years but the rates of inconsistent responses seem to fall into three distinct levels – approximately .04, approximately .046 and approximately .056. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B.N. Petrov and F. Csake (eds.), Second International Symposium on Information Theory. Budapest: Akademiai Kiado, 267-281. Bishop, Y. M. M., Fienberg, S. E. & Holland, P. W. (1975) Discrete Multivariate Analysis: Theory and Practice, Cambridge: MIT Press Dayton, C. M. (1999) Latent Class Scaling Analysis. Sage Publications. Dayton, C. M. & Macready, G. B. (1983) Latent structure analysis of repeated classifications with dichotomous data. British Journal of Mathematical & Statistical Psychology, 36, 189-201. Fleiss, J. L. (1981) Statistical Methods for Rates and Proportions. New York: Wiley Haberman, S. J. (1979), Analysis of Qualitative Data, Volume 2: New Developments, New York: Academic Press. Dayton AERA 2006 9 DRAFT DRAFT McNemar Q. (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153-157. Maxwell A. E. (1970) Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 116, 651-655. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464. Stuart A. A. (1955) A test for homogeneity of the marginal distributions in a twoway classification. Biometrika, 42, 412-416. Vermunt, J. K. (1993). Log-linear & event history analysis with missing data using the EM algorithm. WORC Paper, Tilburg University, The Netherlands. Appendix LEM input file for Homogeneous model * Six years of abortion data – Item: No More, Not Married * Stratified McNemar test * Homogeneous Model lat 1 man 3 dim 2 6 2 2 lab X Y D H * X = latent variable; Y = year; D = No More; H = Not Married mod Y X|Y D|XY eq2 H|XY eq2 des [0 2 0 2 0 2 0 2 0 2 0 2 2 0 2 0 2 0 2 0 2 0 2 0 0 2 0 2 0 2 0 2 0 2 0 2 2 0 2 0 2 0 2 0 2 0 2 0] dat [342 45 47 422 376 42 41 475 429 43 44 476 829 90 78 903 725 109 75 867 672 68 69 941] LEM input file for Heterogeneous model * Six years of abortion data – Item: No More, Not Married * Stratified McNemar test * Heterogeneous Model lat 1 Dayton AERA 2006 10 DRAFT DRAFT man 3 dim 2 6 2 2 l lab X Y D H * X = latent variable; Y = year; D = No More; H = Not Married mod Y X|Y D|XY eq2 H|XY eq2 des [0 2 0 4 0 6 0 8 0 10 0 12 2 0 4 0 6 0 8 0 10 0 12 0 0 2 0 4 0 6 0 8 0 10 0 12 2 0 4 0 6 0 8 0 10 0 12 0] dat [342 45 47 422 376 42 41 475 429 43 44 476 829 90 78 903 725 109 75 867 672 68 69 941] LEM input file for Part Heterogeneous C model * Six years of abortion data – Item: No More, Not Married * Stratified McNemar test * Part Heterogeneous Model C lat 1 man 3 dim 2 6 2 2 lab X Y D H * X = latent variable; Y = year; D = No More; H = Not Married mod Y X|Y D|XY eq2 H|XY eq2 des [0 2 0 4 0 4 0 4 0 2 0 6 2 0 4 0 4 0 4 0 2 0 6 0 0 2 0 4 0 4 0 4 0 2 0 6 2 0 4 0 4 0 4 0 2 0 6 0] dat [342 45 47 422 376 42 41 475 429 43 44 476 829 90 78 903 725 109 75 867 672 68 69 941] Dayton AERA 2006 11 DRAFT DRAFT Table 1 Theoretic Proportions for 2X2 Table Item 1 + Item 2 + π11 Item 2 π12 π1. Item 1 - π21 π22 π2. π.1 π.2 Table 2 Cross-Tabulation of Two Abortion Items for Six Years No More Single Year 1 2 3 4 5 6 Yes Yes Yes No No Yes No No Total 342 376 429 829 725 672 45 42 43 90 109 68 47 41 44 78 75 69 422 475 476 903 867 941 856 934 992 1900 1776 1750 GSS Item No More: Single: Dayton AERA 2006 Married - wants no more children Not married 12 DRAFT DRAFT Table 3 Stratified McNemar Models Fit to Abortion Items Degrees Likelihood-Ratio Model of Freedom Chi-Square (G2) P-Value AIC BIC Years Subgroup Error Rate Homogeneous 11 17.09 0.1054 44,871.78 44,955.93 [123456] 0.048 Part Heterogeneous A 10 13.20 0.2129 44,869.89 44,961.05 [12346], [5] .046, .055 Part Heterogeneous B 9 10.22 0.3331 44,868.91 44,967.09 [1234], [5], [6] .048, .055, .041 Part Heterogeneous C 9 7.35 0.6006 44,866.04 44,964.22 [15], [234], [6] .056, .046, .041 Part Heterogeneous D 8 7.26 0.5089 44,867.95 44,973.14 [1], [234], [5], [6] .057, .046, .055, .041 Heterogeneous 6 7.25 0.2983 44,871.94 44,991.16 [1], [2], [3], [4], [5], [6] .057, .047, .046, .046, .055, .041 Dayton AERA 2006 Homogeneous 13