Estimating the correlations between the latent variables for

advertisement
A Method for Estimating the
Correlations Between Observed and
IRT Latent Variables or Between
Pairs of IRT Latent Variables
Alan Nicewander
Pacific Metrics
Presented at a conference to honor
Dr. Michael W. Browne of the Ohio State University,
September 9-10, 2010
• Using the factor analytic version of item
response (IRT) models,
– Estimates of the correlations between the latent
variables measured by test items are derived.
– Also, estimates of the correlations between the
latent variables measured by test items and
external, observed variables are derived
Brief Derivations of the Correlations
• The normal ogive model for multiple-choice,
dichotomous items may be written as,
• where, θ is the latent proficiency variable, ai is
the item slope parameter, bi is the item
location parameter, ci is a guessing parameter,
and φ(t) is the normal density function.
• Another useful version of this model is the socalled factor analytic representation: Let Yi be
a latent response variable that is a linear
function of θ plus error,
• where λi may be considered as a factor loading
and εi is an error variable. It is further
assumed that Yi and θ are normally distributed
with zero means and unit variances, and that
εi is uncorrelated with θ and Yi.
• Let γi be a response threshold, defined so that
if Yi > γi the item is gotten correct and then (1)
may be rewritten as,
• A graphical representation of this equation is
given on the following slide.
• Then if λi and γi are rescaled as,
• And (1) becomes,
A Graph Showing Yi , the Latent Response Variable ,
Mapped into (1, 0) Using the Response Threshold, γi
Estimating correlations between the latent variables
measured by dichotomous items
• Suppose one wants to determine the correlation between the
latent variables, θi and θj, that underlie the observed item
responses, ui and uj . Let Yi and Yj be the latent response
variables for the two MC items.
•
• or,
• The resulting equation,
does not seem useful in that it involves two, latent
correlations,
and
However, from the definition of the tetrachoric
correlation it follows that,
where
is the guessing-corrected tetrachoric
correlation coefficient.
Estimating the correlations between the latent
variables for dichotomous items and external,
observed variables
• It is fairly easy to extend the logic above in order to derive a
means for computing correlations between observed
variables and IRT latent variables.
• First, define Zk as an observed variable scaled to have zero
mean and unit variance. Then the correlation between Zk and
the latent response variable, Yi , assumed to underlying a MC
item, ui is given by,
• Repeating the previous equation,
we once again have an equation with two, latent correlation
coefficients; However, following from the definition of a
biserial correlation, we may substitute and obtain a solution
involving only observables,
where ρ*bis (Zk,ui) is the guessing-corrected biserial correlation
between the observed variables, Zk and ui.
Extending the latent-variable correlations to
polytomous test items.
• In order to simplify exposition, only polytomous
items having three categories are modeled.
• Generalization of the methods described below to
items with more than three categories is very
straightforward.
• Let xij be the score for item i scored in category j (j =
1, 2, …, m). Under commonly-used scoring rules, a
three-category item would be scored 0, 1, or 2.
• As was done above, for the case of MC items having
binary scores, let Y*i be the latent response variable
underlying the polytomous item xij:
• where λi is a factor loading.
• Let γi1 and γi2 be two response thresholds,
defined so that:
λi and these two thresholds may be rescaled into
IRT slope and location parameters, viz.
• Fitting the previous model to data may be
done with the nomal ogive version of
Samejima’s (1969) Graded Response model, or
the more commonly used logistic
approximation thereof.
• However the correlations we are seeking here
do not depend on the locations, but only the
2

/
1

a


estimates of the i i
i , the item slope
parameters.
• First, define the correlation between the two latent
response variables that underlie two, polytomous
items, xij and xkj, using previous logic,
• Then, solve for ρ(θi θk) and substitute for λi and λk
Computing the correlation between an external
variable and the latent variable measured by a
polytomous item.
• From earlier derivations, it is fairly obvious that the
correlation between an external variable Zk and the
latent variable measured by a polytomous item, xij is,
• where, ρpoly_s(Zk, xij) is the polyserial correlation
between the external variable and the score on the
polytomous item.
Some Numerical Examples
Computing the correlations between the latent variables
measured by three polytomous items, each having three
categories.
• Ten replications of 300 observations were simulated using the
values of a, b1 and b2 given below in Table 1, and with true
values of ρ(θ1,θ2) = ρ(θ2,θ3) = .6, and ρ(θ1,θ3) = 1.
Table 1. Summary averages (std. error) and parameters for simulations of three,
polytomous items having three categories.
*
Items
x1
x2
x3
locations
a-values
replications
x1
1
.198(.04)
.295(.04)
-1, 1
1
10
x2
.276(.04)
1
.155(.04)
-.5, 1.5
.8
10
x3
.394(.04)
.193(.05)
1
-.2, 1.5
.6
10
Mean phi-correlations above the diagonal, mean polychoric correlations below diagonal
Estimating slopes in latent variable regression problems
 Consider a multivariate multiple regression that involves
regressing a multidimensional vector of latent variables, θ, onto
a multidimensional vector of observed scores, z.
 What are the slopes of the latent variables on the observed Z’s
(moderated by observed scores on a test designed to measure
the latent dimensions)?
 The multivariate, multiple regression model may be expressed
as,
θ = Γz + ε ,
where Γ contains the regression slopes, and ε the residuals.
 A Bayesian solution to this regression problem was originally
proposed by Mislevy (1985), and this solution was
implemented using the EM-algorithm-based program called
C-Group (ETS, 1993).

 Notice that this regression system could be solved using
ordinary least squares if the dispersion matrices Σθθ = E(θθ')
and Σθz = E(θz') were known.
 With a little reflection, the estimates derived here could be
used to compute an estimate of Σθz (and Σθθ ). These in turn
could be used to estimate
Γ = Σzz-1Σzθ.
 In order to compare C-Group and OLS solutions for the latentvariable regression problem, a data set consisting of five θ's and six
observed Z's was simulated, each with n=2,000, using the following
population dispersion matrices:
Σθθ = a (5x5) constant-correlation matrix with ρ = .9.
Σzz= a (6x6) identity matrix.
Σθz = a (5x6) matrix with identical rows = (.4, .2, 0, .4, .2, 0).
 It is easy to see that, given these dispersion matrices, the population
value of Γ is equal to ΣθZ.Using well-known procedures, normal
random numbers were generated for 2,000 cases and transformed to
have the above dispersion matrices in expectation (Browne, 1969).
Summary of Regression Slopes of Five Latent Variables, θi ,
on Six Observed Variables , Zk
θ1
θ2
θ3
θ4
θ5
Z1
.4
.4
.4
.4
.4
Z2
.2
.2
.2
.2
.2
Z3
0
0
0
0
0
Z4
.4
.4
.4
.4
.4
Z5
.2
.2
.2
.2
.2
Z6
0
0
0
0
0
 In addition to these simulated data, the responses to fifty
NAEP Math items were simulated using 1990 NAEP itemparameter estimates.
 These simulated data for 2,000 cases were then used to
generate estimates of the regression slopes in Γ using both
the C-group EM solution and the OLS solution
 The whole process was repeated ten times, and means and
standard deviations of the estimates computed. These
summary statistics are shown in Table 2 on the following
page:
Θ1
Θ2
Θ3
Θ4
Θ5
True slope
.40
.40
.40
.40
.40
OLS-LVC
est.
C-GRP.est.
.43 (.04)
.40 (.05)
.39 (.03)
.40 (.02)
.38 (.05)
.43 (.03)
.38 (.03)
.42 (.03)
.39 (.03)
.47 (.04)
True slope
.20
.20
.20
.20
.20
OLS-LVC
est.
C-GRP.
est.
.21 (.04)
.21 (.04)
.21 (.04)
.21 (.02)
.21 (.04)
.22 (.02)
.21 (.03)
.22 (.04)
.21 (.03)
.24 (.04)
True slope
0
0
0
0
0
OLS-LVC
est.
C-GRP.
est
.01 (.04)
.04 (.05)
.01 (.03)
.02 (.03)
.01 (.05)
.01 (.03)
.03 (.03)
.01 (.02)
.02 (.03)
.01 (.04)
Θ1
True slope
Θ2
Θ3
Θ4
Θ5
.40
.40
.40
.40
.40
OLS-LVC
est.
C-GRP
est.
.44 (.02)
.38 (.04)
.39 (.02)
.43 (.02)
.40 (.04)
.43 (.03)
.39 (.03)
.42 (.02)
.41 (.02)
.47 (.03)
True slope
.20
.20
.20
.20
.20
OLS-LVC
est.
C-GRP.
est.
.23 (.04)
.20 (.06)
.20 (.02)
.21 (.03)
.20 (.04)
.22 (.02)
.19 (.02)
.22 (.02)
.21 (.02)
.20 (.04)
True slope
0
0
0
0
0
OLS-LVC
est.
C-GRP.
est.
.00 (.03)
.01 (.06)
-.01 (.03)
-.01 (.03)
.03 (.04)
.00 (.03)
.01 (.04)
-.01 (.03)
-.01 (.02)
.01 (.04)
Tabled estimates are the means of ten replications with n=2,000 each. Values in
parentheses are estimated standard errors.
The coefficients developed in this inquiry have a
very simple form for two basic reasons:
1. They make strong assumptions about data and,
2. The exotic correlation coefficients on which they are
based (tetrachoric-polychoric and biserialpolybiserial) do the “heavy lifting”, mathematically,
because of the complex calculations that are
entailed in their computation.
3. It is also the case that the standard errors of these
coefficients are rather large, and they almost
certainly will require large samples for accuracy.
Download