STRUCTURAL EQUATION MODELING, 72(1), 148-162 Copyright © 2005, Lawrence Eribaum Associates, Inc. Do Items That Measure Self-Perceived Physical Appearance Function Differentially Across Gender Groups? An Application of the MACS Model Vicente Gonzalez-Roma, Ines Tomas, Doris Ferreres and Ana Hernandez Facultad de Psicologia University of Valencia, Spain The aims of this study were to investigate whether the 6 items ofthe Physical Appearance Scale (Marsh, Richards, Johnson, Roche, & Tremayne, 1994) show differential item functioning (DIF) across gender groups of adolescents, and to show how this can be done using the multigroup mean and covariance structure (MG-MACS) analysis model. Two samples composed of 402 boys and 445 girls were analyzed. Two DIF items were detected. One of them showed uniform DIF in the unexpected direction, whereas the other showed nonuniform DIF in the expected direction. The practical significance of the DIF detected was trivial. Thus, we concluded that the differences between girls' and boys' average scores on the analyzed scale reflect valid differences on the latent trait. Psychologists providing assessment and testing services, as well as those who carry out studies involving comparison of test scores across groups, are obligated to select and use nonbiased test instruments (Knauss, 2001). For this to be possible, researchers and test constructors have to provide empirical evidence about whether test items function differently across groups (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), In this article, we investigate whether the six items of the Physical Appearance Scale (PAS), a scale included in the Physical-Self DeCorrespondencc concerning this study should be addressed to Vicente Gonzalez-Roma, University of Valencia, Faculty of Psychology, Department of Methodology of Behavioral Sciences, Av. Blasco Ibaiiez, 21, 46010-Valencia, Spain. E-mail: Vicente.Glez-roma@uv.es DIF IN PHYSICAL APPEARANCE ITEMS 149 scription Questionnaire (PSDQ; Marsh, Richards, Johnson, Roche, & Tremayne, 1994), show differential item functioning (DIF) across adolescent gender groups. Furthermore, we show how this can be done using multigroup confirmatory factor analysis with mean and covariance structure (MG-CFA-MACS; Sorbom, 1974), a method frequently used to investigate group differences on factor means (e.g., McArdle, Johnson, Hishinuma, Miyamoto, & Andrade, 2001). During the last two decades, many techniques have been proposed to detect DIF. Some examples are the Mantel-Haenszel procedure (Holland & Thayer, 1988), the SIBTEST procedure (Shealy & Stout, 1993a, 1993b), the logistic regression procedure (Swaminathan & Rogers, 1990), and item response theory (IRT) procedures (see Camilli & Shepard, 1994). However, some of these techniques may be too complex for psychologists lacking a highly technical or statistical background. If we want DIF analysis to be systematically implemented when tests are assessed, we have to foster the use and application of the most accessible methods. With this idea in mind, the DIF detection procedure used in this article was selected because it is an extension of a technique familiar to many psychologists: factor analysis. DIF AND SELF-PERCEIVED PHYSICAL APPEARANCE Recent research has shown that adolescent girls obtain lower scores than adolescent boys on self-perceived physical appearance measures. Marsh and colleagues (Marsh, Hey, Roche, & Perry, 1997; Marsh et al., 1994) have found consistent statistically significant differences between the average observed scores obtained by girls and boys that favor boys. There are some data that suggest that adolescent girls really have a lower self-perceived physical appearance than boys have. For instance, the percentage of anorexia cases among girls is higher than among boys, and anorexia is related to a dysfunction in self-perception of physical appearance. However, this does not rule out the possibility that DIF of appearance items contributes to producing gender differences in observed scores on self-perceived physical appearance measures. To conclude that observed differences reflect valid differences in the latent trait, measurement equivalence (or the opposite, DIF) of items across gender groups must be analyzed. This involves ascertaining whether item parameters differ across gender groups or not. An item is said to function differentially across groups when individuals at the same latent trait level, but belonging to different groups, respond differently to that item. If a substantive number of items on a questionnaire show DIF (i.e., are not equivalent across groups), "then we cannot even assume that the same construct is being assessed across groups by the same measure" (Chan, 2000, p. 170). If the proportion of items showing DIF is small, then meaningful between-group comparison may still be possible at the latent trait level (Chan, 2000; Reise, Widaman, & Pugh, 1993). On the other hand, impact is defined as differences in item (or test) 150 GONZALEZ-ROMA, TOMAS, FERRERES, HERNANDEZ performance caused by real differences in the underlying latent variable measured by the test (Camilli & Shepard, 1994). Two types of DIF could be distinguished depending on the type of item parameter that differs across groups (i.e., the item difficulty and the item discrimination parameters). Uniform DIF exists when only the item difficulty parameter differs across groups. In the context ofthe factor analytic item response model that is used later, the item difficulty parameter corresponds to the expected (mean) item response value for participants with a score on the latent trait that equals zero. Therefore, the higher the expected item response value for those participants, the easier the item. When the items under study are personality or attitude items, the item difficulty parameter is referred to as the item attractiveness or evocativeness parameter (e.g., Lanning, 1991; Oort, 1998). Nonuniform DIF exists when the item discrimination parameter differs across groups, whether or not the item difficulty parameter remains invariant. Item discrimination represents the relation between an item and the latent trait that the item is supposed to measure. It also refers to the extent to which an item is able to distinguish between individuals with close but different scores on the latent trait. In the context of the factor analytic item response model that is used later, the item discrimination parameter corresponds to the item factor loading. A pervasive criticism of empirical DIF detection studies is that the detection of DIF has been performed in a nontheoretical manner without a priori hypotheses regarding whether DIF would exist and in which direction (Chan, 2000). One exception to this general trend is Chan (2000). He based his hypothesis about nonuniform DIF on Roskam's (1985) idea that in personality inventories the item's discrimination parameter expresses its psychological ambiguity. According to Roskam (1985), the more the item is formulated in concrete terms and the less ambiguous it is, the larger the discrimination parameter. However, Zumbo, Pope, Watson, and Hubley (1997) tested Roskam's idea and did not find any evidence supporting it. Zumbo et al. (1997) concluded that "a general theory on the interpretation of item parameters in personality inventories in terms of psychological ambiguity is not viable and such interpretation should be tailored to individual scales or subscales" (p. 968). In this study, to guide DIF detection, we formulated two tentative hypotheses, taking into account item content and some characteristics ofthe groups compared. The first tentative hypothesis refers to the difficulty or evocativeness parameter. Because ofthe predominance of female models in advertisements and mass media, adolescent girls are exposed to same-gender beauty models more frequently than adolescent boys are. Thus, it is possible that when they rate their own physical appearance, their (top) models of reference are more clear and present, and this may help explain why they score lower than adolescent boys on physical appearance items. A recent experiment showed that children who were exposed to images judged to epitomize the media emphasis on physical beauty reported lower physi- DIE IN PHYSICAL APPEARANCE ITEMS 151 cal appearance than children who viewed images judged to be devoid of such messages (Oliveira, 2000). Considering these arguments, we expect that physical appearance items will show uniform DIF (i.e., differences in the item intercepts) across gender groups, so that the item intercepts will be lower for girls than for boys. The second tentative hypothesis refers to the discrimination parameter. Because of socialization processes, adolescent girls talk and express their opinions about their physical appearance with more naturalness than do adolescent boys. In groups of girls, it is socially acceptable to talk about physical appearance, whereas in groups of boys, talking about physical appearance may be regarded unfavorably. Therefore, it is possible that other factors, such as group norms, lack of familiarity, and feelings of shame, also play a role when boys express their opinions about their physical appearance. Thus, we think that girls' responses to physical appearance items will be more strongly related to the intended underlying latent trait than boys' responses will. Therefore, we hypothesize that physical appearance items will show nonuniform DIF (i.e., differences in the item factor loadings) across gender groups, so that the factor loading estimates will be larger for girls than for boys. Because of the lack of more precise theoretical reasons, no specific hypotheses as to which items would show DIF were formulated. THE MG-CFA-MACS MODEL FOR ANALYZING DIF Unlike the CFA with covariance structure, which assumes that all measured variable and latent variable means are equal to zero, in the MACS model the means of latent and measured variables are not presumed to be zero. Thus, within this framework the linear relation between items and latent traits is expressed as follows (Sorbom, 1974): X = T^ + A,^ + 6 (1) where Xx is a (q x 1) vector that contains the intercept parameters; X is a (q x 1) vector containing the scores on the q measured variables or items; A^r is a (q x r) matrix of factor loadings that represents the relations between the q measured variables and the r latent variables; ^ is a (r x 1) vector that contains the factor scores on the r latent variables; 5 is a (q x 1) vector that represents the measurement errors or uniquenesses for each measured variable. It is assumed that uniquenesses are normally distributed with a population mean that equals zero. Mean values of measured variables [E(X)] can be explained in terms of the latent variable means as follows: X;, + A;,K (2) 152 GONZALEZ-ROMA, TOMAS, FERRERES, HERNANDEZ where K is a (r x 1) column vector containing the latent variahle means [E(^) - K]. Hence, in an unidimensional questionnaire, when the latent trait equals zero, the expected score for the items is the corresponding item intercept. As stated earlier, within the MACS model, the item intercept (T) represents the item difficulty (or evocativeness) parameter, whereas the item factor loading (X) corresponds to the item discrimination parameter. When two or more groups are considered. Equations 1 and 2 become: X(g) = X x(8) + Ax(s) ^(8) + 5(8) (3) E(X) (s) = Xx(8' + Ax(g) K(g) (4) where g = 1, 2, ... G refers to the different groups considered. If item parameters are invariant across groups (i.e., items do not show DIE), then E(X)(g) =Tx +Ax K<g) (5) According to Equation 5, when item intercepts and factor loading are invariant across groups, between-group differences in average item scores do reflect between-group differences in latent means. Under these conditions, average item and scale scores are comparable across groups. According to Meredith (1993), the invariance of item intercepts and factor loadings represents a type of factorial invariance, the so-called strong factorial invariance. When only the invariance of item intercepts cannot be maintained, then uniform DIE exists. In this case, the intercepts that a given item (j) shows in the different groups of participants considered are not equivalent [i.e., TJO ^ TJ*^) ^^ ... ^ TJ(G)]. When the invariance of item factor loadings cannot be maintained, regardless of whether the intercepts are invariant or not, then nonuniform DIE exists [i.e., Xf^^ ^ A,j(2) :^ ... ^ A,j(<^>]. Thus, within the MG-MACS model, testing for DIE involves testing for item intercepts and factor loadings invariance. In summary, the aim of this study is twofold. The first is to investigate whether the six items of the PAS, a scale included in the PSDQ (Marsh et al., 1994), show DIE across gender groups of adolescents. The second is to show how this can be done using MG-CEA-MACS. METHOD Participants The study sample was composed of 847 participants of between 12 and 16 years of age. Eorty-eight percent were boys (n = 402), and 52% were girls (n = 445). The average age for both groups was 13.3 years (SD - 1 for both groups). DIF IN PHYSICAL APPEARANCE ITEMS 1 53 All of the participants responded to the PSDQ (see Marsh et al., 1994, for a full presentation of the questionnaire). Participation in the study was voluntary. The same researcher administered the questionnaire to classroom units of high school students during a regular class period. Before the participants completed the questionnaire, the test administrator read the written test instructions, and procedural questions were solicited and answered. Measures The items analyzed in this study are those included in the Appearance scale of the PSDQ (Marsh et al,, 1994). The Appearance scale comprises six items that are responded to using a 6-point response scale ranging from 1 (completely false) to 6 (completely true; see Appendix). Items 4 and 6 are reversed. Responses to these items were transformed so that a high score was indicative of high self-perceived physical appearance. The Cronbach's alpha estimates computed for the Appearance scale in both samples were satisfactory: ,87 in the boy sample and .86 in the girl sample. Analysis All the MG-MACS models were tested using LISREL 8.30 (Joreskog & Sorbom, 1993) and normal theory maximum likelihood (ML) estimation methods. As the ML estimation procedure assumes a multivariate normal distribution for the observed variables, this assumption was tested. The tests of multivariate normality provided by PRELIS 2,30 indicated that this assumption could not be maintained in the analyzed groups of participants. The tests of univariate normality indicated that the analyzed variables could not be considered as normally distributed. Simulation studies that have analyzed the robustness of ML estimators to violations of distributional assumptions when the observed variables are discrete (e.g., Boomsma, 1983; Harlow, 1985; Muthen & Kaplan, 1985; Qlsson, 1979), pointed out that not much distortion of the ML chi-square, and very little or nonexistent parameter estimate bias, is to be expected with nonnormal ordinal variables when they show a moderate departure from normality; that is, when they have univariate skewness and kurtosis in the range from -1 to -i-l, The skewness statistic computed for the items analyzed showed ranges from -0.66 to 0.05 and from -0.45 to 0.37 in the boy and girl samples, respectively. The kurtosis statistic showed a range from -0,60 to -0.13 in the boy sample and a range from -0.86 to -0.34 in the girl sample. Therefore, because skewness and kurtosis are minimal, the assumption of approximate normality for the observed variables is reasonable, and the use of normal theory ML estimation techniques can be justified (Bollen, 1989). To assess the goodness of fit for the models, we considered the chi-square goodness-of-fit statistic (x^), the root mean square error of approximation (RMSEA), 154 GONZALEZ-ROMA, TOMAS, FERRERES, H E R N A N D E Z and the Nonnormed Fit Index (NNFI). The chi-square statistic is a test of the difference between the observed covariance matrix and the one predicted by the specifted model. Nonsignificant values indicate that the hypothesized model fits the data. However, this index has two important limitations: (a) it is sensitive to sample size, so that the probability of rejecting a hypothesized model increases as sample size increases (Joreskog & Sorbom, 1993; La Du & Tanaka, 1995; Tanaka, 1993); and (b) it evaluates whether or not the analyzed model holds exactly in the population, which is a very demanding assessment. Thus, the use of other fit indexes has been suggested as an alternative to tests of statistical significance (e.g.. Marsh, 1994; Marsh & Hocevar, 1985; Marsh et al,, 1997; Reise et aL, 1993). The RMSEA evaluates whether the analyzed model holds approximately in the population. Guidelines for interpretation of the RMSEA suggest that values of about .05 or less would indicate a close fit of the model, values of about .08 or less would indicate a fair fit of the model or a reasonable error of approximation, and values greater than .1 would indicate poor fit (Browne & Cudeck, 1993; Browne & Du Toit, 1992). Finally, the NNFI (Bentler & Bonett, 1980; Tucker & Lewis, 1973) is a relative measure of fit that also applies penalties for a lack of parsimony. NNFI values of .90 or above indicate good model ftt (Bentler & Bonett, 1980). Chi-square difference tests were used to compare the fit of nested models. Considering that the DIF analyses assume that the scale under study is unidimensional, this assumption was tested before running the DIF analyses. First, scale factorability was assessed by means of Bartlett's sphericity test. The results obtained for boys, ^^(is, A^ = 402) = 1275.6, p < .01, and girls, 5^^(15, A^ = 445) = 1235.9, p < .01, supported the factorability of the scale in both samples. Then, a multigroup one-factor CFA model with no invariance constraints across groups was tested. The fit provided by the model was acceptable, x\lS,N- 847) = 63.9,p < .01; RMSEA = .075; NNFI - .97. Thus the unidimensionality of the Appearance scale could be maintained. The MG-MACS model was fitted to the 6 x 6 item variance-covariance matrices and vectors of six means of both the boys and the girls samples. To detect uniform and nonuniform DIF, a series of nested multiple-group, single-factor models were tested according to the iterative procedure recommended by Oort (1998) and followed by Chan (2000). For all the models, a number of constraints were imposed for model identification and scale purposes. First, an item was chosen as the reference indicator. To guide this selection, we conducted an exploratory factor analysis. The item that showed the highest loading (Item 5) was selected as the reference indicator. This item's factor loading was set to 1 in both groups to scale the latent variable and provide a common scale in both groups. Second, the factor mean was fixed to zero in the boys group for identification purposes, whereas the factor mean in the girls group was freely estimated. Finally the reference indicator intercepts were constrained to be equal in both groups to identify and estimate the factor mean in the girls group and the intercepts in both groups (Sorbom, 1982). DIF IN PHYSICAL APPEARANCE ITEMS 155 The iterative procedure starts with a fully equivalent model in which all the item factor loadings and the intercepts are set to be equal in both groups. Then nonuniform DIF is evaluated. Specifically the largest modification index (MI) associated to the factor loading estimates is evaluated to determine its statistical significance. An MI shows the reduction in the model chi-square value if the implied constrained parameter is freely estimated. Because this chi-square difference is distributed with 1 4^ it is easy to determine whether the reduction in chi-square is statistically significant. If the largest loading MI is statistically significant, it is concluded that the corresponding item exhibits nonuniform DIF across the two groups. Then a new model is fitted. In this model, the factor loading that showed a statistically significant MI is freely estimated, and the remaining factor loading estimates are constrained to be equal in both groups. The largest MI associated to the factor loading estimates is evaluated again to determine its statistical significance, and this iterative procedure continues until the largest MI is not statistically significant. After evaluating nonuniform DIF, the procedure focuses on uniform DIF to determine the statistical significance ofthe Mis associated to the intercepts. If the largest MI associated to the intercepts is statistically significant, it is concluded that the corresponding item exhibits uniform DIF across groups. As before, a new model is fitted in which the corresponding intercept is freely estimated, and the remaining intercepts are constrained to be equal in both groups. The largest intercept MI is evaluated again to determine its statistical significance. This iterative procedure continues until the largest intercept MI is not statistically significant. Taking into account that each MI is evaluated multiple times, the Bonferroni correction should be used to test the significance of the reduction in chi-square at a specified alpha. In this study, because at each step a maximum of six Mis were considered, the alpha value for determining the significance of each MI was .05/6 = .008. RESULTS Descriptive statistics and correlations among items are displayed in Table 1. The boy sample showed an average score on the Appearance scale (24.1) that is larger than the average score obtained for the girl sample (21.1), t = 8.2, p < .01. Nonuniform DIF The initial model (Model 1) that we tested imposed invariance constraints across groups on all the factor loadings and all the intercepts. The fit of this model was acceptable, x\2S,N^S47) = 103.5,/?< .01; RMSEA = .079; NNFI = .97. To detect nonuniform DIF, the largest MI associated with the factor loadings was tested: It corresponded to the factor loading of Item 4 (MI =17,9 ,p< .008), and because it was statistically significant, we concluded that Item 4 showed nonuniform DIF 156 GONZALEZ-ROMA, TOMAS, FERRERES, HERNANDEZ TABLE 1 Descriptive Statistics and Correlations Among Items Correlations^ Item 1 2 3 4 5 6 3,82/3,31 4,19/3,86 3,36/2,70 4,46/3,74 3,97/3,32 4,27/4,14 ,35/1,34 ,22/1,27 ,37/1,29 ,41/1,52 ,39/1,41 ,38/1,43 Skewness" Kurtosis" J 2 3 4 5 6 -0,22/-0,14 -0,46/-0,45 0,05 / 0,37 -0,66/-0,29 -0,46/-0,10 -O,58/-O,41 -0,50/-0,86 -0,13/-0,34 -0,60/-0,57 -0,32/-0,75 -0,34/-0,77 -0,21/-0,68 1 ,63 ,64 ,48 ,64 ,44 ,48 1 ,59 ,48 ,59 ,38 ,59 ,56 1 ,48 ,67 ,43 ,53 ,56 ,67 1 ,63 ,46 ,61 ,58 ,72 ,71 1 ,42 ,33 ,39 ,40 ,45 ,38 1 ''The value on the left is for boys, the value on the right is for girls, ''The correlations below the diagonal are for boys, the correlations above the diagonal are for girls, across gender groups. Then, a new model (Model 2) in which the aforementioned factor loading was freely estimated in both groups was fitted. This model yielded an acceptable fit to data, %2(27, N = 847) = 85,6,;? < .01; RMSEA = .070; NNFI = .97. Now, the largest MI associated with the factor loadings was not statistically significant (MI = 3.97, p > .008), so we concluded that none ofthe remaining items showed nonuniform DIF. Uniform DIF Next, the MI values associated with the item intercepts were examined to detect uniform DIR The largest MI was the one associated with the intercept of item 6 (MI - 8.7, p < .008). Because this MI was statistically significant, we concluded that Item 6 showed uniform DIF across gender groups. Then, a new model (Model 3), in which the factor loading of Item 4 and the intercept of Item 6 were freely estimated in both groups, was fitted. This model showed an acceptable fit to data, y}(21,N= 847) = 76.9,/? < .01; RMSEA = .067; NNFI = .98, and did not yield any additional statistically significant ML Therefore, we concluded that Items 1,2, and 3 did not show DIF across gender groups. To confirm that Items 4 and 6 showed nonuniform and uniform DIF, respectively, the fit of Model 3 was compared with the fit of Model 1. The difference between the chi-square statistics of each model is distributed as a chi-square with a number of degrees of freedom equal to the difference between the degrees of freedom for both models. In this case, the difference in model fit was statistically significant, x'^(2,N= 847) = 26.5, p< .01, supporting the presence of DIF on the items that were flagged by the MI values. To confirm that the remaining invariance constraints could be maintained, the fit of Model 3 was compared with the fit of the model that imposed no invariance constraints across groups. The difference in DIF IN PHYSICAL APPEARANCE ITEMS 1 57 model fit was not statistically significant, %\S,N- 847) = 13.0, p> .05, supporting that Items 1, 2, and 3 did not show DIF. To rule out the possibility that the item used as the reference indicator (Item 5) might be a DIF item, the iterative procedure of DIF detection was repeated using a randomly selected item (Item 3) as a reference indicator. Items 4 and 6 were flagged again as showing nonuniform and uniform DIF, respectively, and Item 5 was not. DIF Interpretation The item parameter estimates yielded by Model 3 are displayed in Table 2. The difference observed in the factor loading estimates showed by Item 4 ("I am ugly") in both samples was in the expected direction. The factor loading estimate for the girls sample (.86) was higher than the factor loading estimate for the boys sample (.64). However, the difference observed in the intercepts shown by Item 6 ("Nobody thinks that I'm good looking") in both samples was in the unexpected direction. The intercept esfimate for the girls sample (4.53) was higher than the intercept estimate for the boys sample (4.27). To assess the practical significance of the DIF detected, an additional analysis was carried out at the scale level. We ascertained the practical implications of retaining the DIF items in the Appearance scale. The mean score on the Appearance scale, with and without removing Items 4 and 6, was computed for each gender group and compared across groups using the standardized mean difference (d; Chan, 2000). With Items 4 and 6 included, the means for the two groups differed by .53 deviation units (d - .53). With Items 4 and 6 excluded from the scale, the standardized mean difference equaled .54. The difference on d provides an index of the practical significance of the DIF detected. In this study, the d difference equaled .01. This value most likely points out that the practical implication of the DIF detected at the scale level is trivial (Chan, 2000). TABLE 2 Item Parameter Estimates Provided by Model 3 Item 1 2 3 4 5 6 Eactor Loading^ Intercept 0.74 0.70 0.81 0.64/0.86'' 0.85 0.52 3.84 4.26 3.33 4.44 3.97 4.27/4.53'' "Common metric completely standardized solution. ''The value on the left is for boys, the value on the right is for giris. 1 58 GONZALEZ-ROMA, T O M A S , FERRERES, HERNANDEZ Differences in Latent Means Finally, the estimates obtained for the latent means under Model 3 revealed statistically significant differences between groups. As stated earlier, for identification purposes the latent mean for boys was fixed to zero. The estimated latent mean for girls was -0.65, with a standard error of 0.09. Therefore, the latent means for the two groups differed by practically two thirds of a standard deviation unit. Thus, considering that there existed real differences in the underlying latent variable measured by the Appearance scale, we concluded that the statistically significant difference obtained between the observed average scores for the two groups reflected impact. DISCUSSION The aims of this study were to investigate whether the six items ofthe PAS (Marsh et al., 1994) show DIF across gender groups of adolescents, and to show how this can be done using MG-CFA-MACS. Recent research has consistently shown that groups of adolescent girls obtain lower average scores than groups of adolescent boys on self-perceived physical appearance measures (Marsh et al., 1997; Marsh et al., 1994). DIF analysis must be carried out before researchers can conclude that these observed gender differences reflect valid differences in the latent trait. Only when items do not show DIF, or the amount of DIF detected is practically trivial, does comparison of gender groups' observed average scores make sense. According to our first tentative hypothesis, we expected that the analyzed physical appearance items would show uniform DIF (i.e., differences in the item intercepts) across gender groups, so that the item intercepts would be lower for girls than for boys. This hypothesis was not supported. Only one ofthe six items showed uniform DIF, but it was in the unexpected direction. According to our second tentative hypothesis, we expected that the items on the Marsh et al. (1994) PAS would show nonuniform DIF (i.e., differences in the item factor loadings) across gender groups, so that the factor loading estimates will be larger for girls than for boys. This hypothesis was supported for only one of the six items analyzed. Interestingly, the two items that showed DIF were reversed items. However, the fact that those items showed different types of DIF impedes, for the moment, easily interpreting this finding and formulating plausible post hoc explanations. Future research should address this issue. From a practical point of view, the practical significance of the DIF detected was trivial. The standardized mean difference (d) between the observed average scores for boys and girls was practically the same regardless of whether it was computed using the six appearance items or only the four items with no DIF {d values of .54 and .53, respectively). Thus, it can be concluded that the differences be- DIF IN PHYSICAL APPEARANCE ITEMS 159 tween girls' and boys' average scores on the Marsh et al. (1994) PAS reflect valid differences in the latent trait. In relation to this, in this study we found gender differences that are congruent with the findings reported in the literature (Marsh et al., 1997; Marsh et al., 1994): Girls showed an average score on the Appearance scale that is significantly lower than the average score obtained for boys. Besides, the latent means for the two groups significantly differed by 0.65 SD, and the girls sample showed the lowest latent mean. Because of socialization processes and the impact of mass media, girls receive more pressure than boys to be good looking. Considering that the standards of reference for both genders are high, the different pressure received may help to explain why girls score lower than boys when they compare themselves to reference models and rate themselves on self-perceived physical appearance. We also wanted to show how MG-CFA-MACS can be used for DIF detection. In comparison with other methods, such as those based on item response theory (IRT), the method used here has a number of advantages (Chan, 2000; Reise et al., 1993). First, programs performing MG-CFA-MACS, such as LISREL, EQS, Mplus and AMOS (i.e., structural equation modeling [SEM] programs), provide different indexes of practical fit to data that are very useful when the sample size is large and the models include a large number of indicators. IRT programs (e.g., MULTILOG, PARSCALE) only provide the likelihood ratio chi-square test as a measure of model fit, and this test is very sensitive to sample size. Second, when a model imposing invariance constraints on an item parameter cannot be maintained, modification indexes provided by SEM software are very useful for detecting which particular items have parameters that are not invariant. This facility allows researchers to specify models assuming partial invariance on the item parameters involved. IRT programs do not provide modification indexes or analogues. Third, SEM programs allow researchers to work with multidimensional questionnaires, whereas IRT programs, such as MULTILOG and PARSCALE, are suitable for unidimensional questionnaires. Multidimensional models can be operationalized using SEM programs following the strategy proposed by Little (1997). This strategy allows researchers to test hypotheses that refer to the invariance of factor parameters (correlations, variances, and means), hypotheses that are relevant in cross-cultural research. Fourth, SEM methods offer researchers different alternatives for testing the hypothesis that a grouping variable (e.g., gender) is producing DIF: a group comparison strategy (see Breithaupt & Zumbo, 2002) and a strategy based on including the grouping variable in the structural model (see Oort, 1992). However, this method also has some limitations. The MG-CFA-MACS model is a model for continuous variables. In our study and in previous applications of this model (e.g., Chan, 2000), ordinal Likert-type items were analyzed under the assumption that responses to graded polytomous items approximate a continuous 160 GONZALEZ-ROMA, TOMAS, FERRERES, HERNANDEZ scale. In relation to this, a recent simulation study has showed that MG-CFA-MACS detects both uniform and nonuniform DIF in graded polytomous items quite well when the amount of DIF is medium and large in a sample of 800 individuals, keeping the proportion of false positive detections (Type I error) close to, or lower than, nominal alpha values (Hernandez & Gonzalez-Roma, 2003). Although these results are promising, more simulation research is needed regarding the use of MG-CFA-MACS for testing item parameter invariance with graded polytomous items. REFERENCES American Educational Research Association, American Psychological Association & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington DC: American Educational Research Association, Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychotogieal Bulletin. 88, 588-606. Bollen, K. A. (1989). Structural equatiotxs with latent variables. New York: Wiley. Boomsma, A. (1983). On the robustness of LISREL (maxitnum likeiihood estimation) against small sample size and non-normality. Unpublished doctoral dissertation. University of Groningen, Groningen, The Netherlands. Breithaupt, K., & Zumbo, B. D. (2002). Sample invariance ofthe structural equation model and the item response model: A ease study. Structural Equation Modeling, 9, 390-412. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136-162), Newhury Park, CA: Sage. Browne, M. W., & Du Toit, S. H. C. (1992). Automated fitting on nonstandard models. Multivariate Behavioral Research, 27, 269-300. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Chan, D. (2000). Deteetion of differential item funetioning on the Kirton Adaptation-Innovation Inventory using multiple-group mean and covariance structure analyses, Multivariate Behavioral Research, 35, 169-199. Harlow, L. L. (1985). Behavior of some elliptical estimators with nonnormal data in a covariance structurefratnework: A Monte Carlo study. Unpublished doetoral dissertation. University of California, Los Angeles. Herndndez, A., & Gonzalez-Roma, V. (2003). Evaluating the multiple-group mean and covariance analysis model for the detection of differential item funetioning in polytomous ordered items. Psieothema, 15, 322-327. Holland, P. W., & Thayer, D. T. (1988). Differential item funetioning and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145), Hillsdale, NJ: Lawrence Erlbaum Associates, Ine. Joreskog, K. G., & Sorbom, D. (1993). LISREL 8: User's reference guide. Mooresville, IN: Seientific Software. Knauss, L. K. (2001). Ethical issues in psychological assessment in school settings. Journal of Personality Assessment, 77, 231-241. La Du, T. J., & Tanaka, J. S. (1995). Incremental fit index changes for nested structural equation models. Multivariate Behavioral Research, 30, 289-316. DIF IN PHYSICAL APPEARANCE ITEMS 1 61 Lanning, K. (1991). Consistency, scalability atid persortality measuretnent. New York: Springer-Verlag. Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53-76. Marsh, H. W. (1994). Confirmatory factor analysis models of factorial invariance: A multifaceted approach. Structural Equation Modeling. I. 5-34. Marsh, H. W., Hey, J., Roche, L., & Perry, C. (1997). Structure of physical self-concept: Elite athletes and physical education students. Journal of Educational Psychology, 89, 369-380. Marsh, H. W., & Hocevar, D. (1985). The application of confirmatory factor analysis to the study of self-concept: First and higher order factor structures and their invariance across age groups. Psychoiogicai Bulletin, 97, 562-582. Marsh, H. W., Richards, G. E., Johnson, S., Roche, L., & Tremayne, P (1994). Physical Self-Description Questionnaire: Psychometric properties and a multitrait-multimethod analysis of relations to existing instruments. Journal of Sport and Exercise Psychology, 16, 270-305. McArdle, J. J., Johnson, R. C , Hishinuma, E. S., Miyamoto, R. H., & Andrade, N. N. (2001). Structural equation modeling of group differences in CES-D ratings of native Hawaiian and non-Hawaiian high school students. Journal of Adolescent Research, 16, 108-149. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrica, 58, 525-543. Muthen, B., & Kaplan, D. (1985). A comparison of some methodologies for the factor analysis of non-normal Likert variables. British Journal of Matketnatieat and Statistical Psychology, 38, 171-189. Oliveira, M. A. (2000). The effect of media's objectification of beauty on children's body esteem. Dissertation Abstracts International: Section B: The Sciences and Engineering, 61, 221 A. Olsson, U. (1979). On the robustness of factor analysis against crude classification ofthe observations. Multivariate Behavioral Research, 14, 485-500. Oort, F. J. (1992). Using restricted factor analysis to detect item bias. Methodika. 6, 150-166. Oort, F. J. (1998). Simulation study of item bias detection with restricted factor analysis. Structural Equation Modeling, 5, 107-124. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin. 114, 552-566. Roskam, E. E. (1985). Current issues in item response theory. In E. E. Roskam (Ed.), Measuretnent and personality assessment (pp. 3-20). Amsterdam: North Holland. Shealy, R., & Stout, W. F. (1993a). An item response theory model for test bias. In P. W. Holland & H. Wainer (Eds.), Differential item functioning: Theory and practice (pp. 197-239). Hillsdale, NJ: Lawrence Eribaum Associates, Inc. Shealy, R., & Stout, W. F. (1993b). A model-based standardization approach that separates true bias/DIF from group differences and detects test bias/DTF as well as item bias/DIF. Psychometriica, 58, 159-194. Sorbom, D. (1974). A general method for studying differences in factor means and factor structures between groups. British Journal of Mathematical and Statistical Psychology, 27, 229-239. Sorbom, D. (1982). Structural equation models with structured means. In K. G. Joreskog & H. Wold (Eds.), Systems under indirect observation (pp. 183-195). Amsterdam: North Holland. Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement. 27, 361-370. Tanaka, J. S. (1993). Multifaceted conceptions of fit in structural equation models. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 10-39). Newbury Park, CA: Sage. Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometriica, 38, 1-10. 162 GONZALEZ-ROMA, TOMAS, FERRERES, HERNANDEZ Zumbo, B. D., Pope, G. A., Watson, J. E., & Hubley, A. M. (1997). An empirical test of Roskam's conjecture about the interpretation of an ICC parameter in personality inventories. Edueational and Psychological Measurement, 57, 963-969. APPENDIX PSDQ APPEARANCE SCALE ITEMS 1. 2. 3. 4. 5. 6. I am attractive for my age I have a nice looking face I'm better looking than most of my friends I am ugly I am good looking Nobody thinks that I'm good looking Note. From Physical Self-Description Questionnaire: Psychometric properties and a multitrait-multimethod analysis of relations to existing instruments, by H, W. Marsh, G, E. Richards, S. Johnson, L. Roche, & P. Tremayne, 1994, Journal of Sport and Exercise Psychology, 16, pp, 270-305. Reprinted with permission ofthe authors.