Estimating the correlations between the latent variables for

A Method for Estimating the Correlations Between Observed and IRT Latent Variables or Between Pairs of IRT Latent Variables Alan Nicewander Pacific Metrics Presented at a conference to honor Dr. Michael W. Browne of the Ohio State University, September 9-10, 2010 • Using the factor analytic version of item response (IRT) models, – Estimates of the correlations between the latent variables measured by test items are derived. – Also, estimates of the correlations between the latent variables measured by test items and external, observed variables are derived Brief Derivations of the Correlations • The normal ogive model for multiple-choice, dichotomous items may be written as, • where, θ is the latent proficiency variable, ai is the item slope parameter, bi is the item location parameter, ci is a guessing parameter, and φ(t) is the normal density function. • Another useful version of this model is the socalled factor analytic representation: Let Yi be a latent response variable that is a linear function of θ plus error, • where λi may be considered as a factor loading and εi is an error variable. It is further assumed that Yi and θ are normally distributed with zero means and unit variances, and that εi is uncorrelated with θ and Yi. • Let γi be a response threshold, defined so that if Yi > γi the item is gotten correct and then (1) may be rewritten as, • A graphical representation of this equation is given on the following slide. • Then if λi and γi are rescaled as, • And (1) becomes, A Graph Showing Yi , the Latent Response Variable , Mapped into (1, 0) Using the Response Threshold, γi Estimating correlations between the latent variables measured by dichotomous items • Suppose one wants to determine the correlation between the latent variables, θi and θj, that underlie the observed item responses, ui and uj . Let Yi and Yj be the latent response variables for the two MC items. • • or, • The resulting equation, does not seem useful in that it involves two, latent correlations, and However, from the definition of the tetrachoric correlation it follows that, where is the guessing-corrected tetrachoric correlation coefficient. Estimating the correlations between the latent variables for dichotomous items and external, observed variables • It is fairly easy to extend the logic above in order to derive a means for computing correlations between observed variables and IRT latent variables. • First, define Zk as an observed variable scaled to have zero mean and unit variance. Then the correlation between Zk and the latent response variable, Yi , assumed to underlying a MC item, ui is given by, • Repeating the previous equation, we once again have an equation with two, latent correlation coefficients; However, following from the definition of a biserial correlation, we may substitute and obtain a solution involving only observables, where ρ*bis (Zk,ui) is the guessing-corrected biserial correlation between the observed variables, Zk and ui. Extending the latent-variable correlations to polytomous test items. • In order to simplify exposition, only polytomous items having three categories are modeled. • Generalization of the methods described below to items with more than three categories is very straightforward. • Let xij be the score for item i scored in category j (j = 1, 2, …, m). Under commonly-used scoring rules, a three-category item would be scored 0, 1, or 2. • As was done above, for the case of MC items having binary scores, let Y*i be the latent response variable underlying the polytomous item xij: • where λi is a factor loading. • Let γi1 and γi2 be two response thresholds, defined so that: λi and these two thresholds may be rescaled into IRT slope and location parameters, viz. • Fitting the previous model to data may be done with the nomal ogive version of Samejima’s (1969) Graded Response model, or the more commonly used logistic approximation thereof. • However the correlations we are seeking here do not depend on the locations, but only the 2  / 1  a   estimates of the i i i , the item slope parameters. • First, define the correlation between the two latent response variables that underlie two, polytomous items, xij and xkj, using previous logic, • Then, solve for ρ(θi θk) and substitute for λi and λk Computing the correlation between an external variable and the latent variable measured by a polytomous item. • From earlier derivations, it is fairly obvious that the correlation between an external variable Zk and the latent variable measured by a polytomous item, xij is, • where, ρpoly_s(Zk, xij) is the polyserial correlation between the external variable and the score on the polytomous item. Some Numerical Examples Computing the correlations between the latent variables measured by three polytomous items, each having three categories. • Ten replications of 300 observations were simulated using the values of a, b1 and b2 given below in Table 1, and with true values of ρ(θ1,θ2) = ρ(θ2,θ3) = .6, and ρ(θ1,θ3) = 1. Table 1. Summary averages (std. error) and parameters for simulations of three, polytomous items having three categories. * Items x1 x2 x3 locations a-values replications x1 1 .198(.04) .295(.04) -1, 1 1 10 x2 .276(.04) 1 .155(.04) -.5, 1.5 .8 10 x3 .394(.04) .193(.05) 1 -.2, 1.5 .6 10 Mean phi-correlations above the diagonal, mean polychoric correlations below diagonal Estimating slopes in latent variable regression problems  Consider a multivariate multiple regression that involves regressing a multidimensional vector of latent variables, θ, onto a multidimensional vector of observed scores, z.  What are the slopes of the latent variables on the observed Z’s (moderated by observed scores on a test designed to measure the latent dimensions)?  The multivariate, multiple regression model may be expressed as, θ = Γz + ε , where Γ contains the regression slopes, and ε the residuals.  A Bayesian solution to this regression problem was originally proposed by Mislevy (1985), and this solution was implemented using the EM-algorithm-based program called C-Group (ETS, 1993).   Notice that this regression system could be solved using ordinary least squares if the dispersion matrices Σθθ = E(θθ') and Σθz = E(θz') were known.  With a little reflection, the estimates derived here could be used to compute an estimate of Σθz (and Σθθ ). These in turn could be used to estimate Γ = Σzz-1Σzθ.  In order to compare C-Group and OLS solutions for the latentvariable regression problem, a data set consisting of five θ's and six observed Z's was simulated, each with n=2,000, using the following population dispersion matrices: Σθθ = a (5x5) constant-correlation matrix with ρ = .9. Σzz= a (6x6) identity matrix. Σθz = a (5x6) matrix with identical rows = (.4, .2, 0, .4, .2, 0).  It is easy to see that, given these dispersion matrices, the population value of Γ is equal to ΣθZ.Using well-known procedures, normal random numbers were generated for 2,000 cases and transformed to have the above dispersion matrices in expectation (Browne, 1969). Summary of Regression Slopes of Five Latent Variables, θi , on Six Observed Variables , Zk θ1 θ2 θ3 θ4 θ5 Z1 .4 .4 .4 .4 .4 Z2 .2 .2 .2 .2 .2 Z3 0 0 0 0 0 Z4 .4 .4 .4 .4 .4 Z5 .2 .2 .2 .2 .2 Z6 0 0 0 0 0  In addition to these simulated data, the responses to fifty NAEP Math items were simulated using 1990 NAEP itemparameter estimates.  These simulated data for 2,000 cases were then used to generate estimates of the regression slopes in Γ using both the C-group EM solution and the OLS solution  The whole process was repeated ten times, and means and standard deviations of the estimates computed. These summary statistics are shown in Table 2 on the following page: Θ1 Θ2 Θ3 Θ4 Θ5 True slope .40 .40 .40 .40 .40 OLS-LVC est. C-GRP.est. .43 (.04) .40 (.05) .39 (.03) .40 (.02) .38 (.05) .43 (.03) .38 (.03) .42 (.03) .39 (.03) .47 (.04) True slope .20 .20 .20 .20 .20 OLS-LVC est. C-GRP. est. .21 (.04) .21 (.04) .21 (.04) .21 (.02) .21 (.04) .22 (.02) .21 (.03) .22 (.04) .21 (.03) .24 (.04) True slope 0 0 0 0 0 OLS-LVC est. C-GRP. est .01 (.04) .04 (.05) .01 (.03) .02 (.03) .01 (.05) .01 (.03) .03 (.03) .01 (.02) .02 (.03) .01 (.04) Θ1 True slope Θ2 Θ3 Θ4 Θ5 .40 .40 .40 .40 .40 OLS-LVC est. C-GRP est. .44 (.02) .38 (.04) .39 (.02) .43 (.02) .40 (.04) .43 (.03) .39 (.03) .42 (.02) .41 (.02) .47 (.03) True slope .20 .20 .20 .20 .20 OLS-LVC est. C-GRP. est. .23 (.04) .20 (.06) .20 (.02) .21 (.03) .20 (.04) .22 (.02) .19 (.02) .22 (.02) .21 (.02) .20 (.04) True slope 0 0 0 0 0 OLS-LVC est. C-GRP. est. .00 (.03) .01 (.06) -.01 (.03) -.01 (.03) .03 (.04) .00 (.03) .01 (.04) -.01 (.03) -.01 (.02) .01 (.04) Tabled estimates are the means of ten replications with n=2,000 each. Values in parentheses are estimated standard errors. The coefficients developed in this inquiry have a very simple form for two basic reasons: 1. They make strong assumptions about data and, 2. The exotic correlation coefficients on which they are based (tetrachoric-polychoric and biserialpolybiserial) do the “heavy lifting”, mathematically, because of the complex calculations that are entailed in their computation. 3. It is also the case that the standard errors of these coefficients are rather large, and they almost certainly will require large samples for accuracy.

Estimating the correlations between the latent variables for

Related documents

Products

Support

Estimating the correlations between the latent variables for

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib