Predictive Inference Using Latent Variables with Covariates

advertisement
Predictive Inference Using Latent Variables with
Covariates∗
Lynne Steuerle Schofield
Department of Mathematics and Statistics, Swarthmore College
Brian Junker
Department of Statistics, Carnegie Mellon University
Lowell J. Taylor
Heinz College, Carnegie Mellon University
Dan A. Black
Harris School, University of Chicago
January 8, 2014
Abstract
Plausible Values (PVs) are a standard multiple imputation tool for analysis of large
education survey data that measures latent proficiency variables. When latent proficiency is the dependent variable, we reconsider the standard institutionally-generated
PV methodology and find it applies with greater generality than shown previously.
When latent proficiency is an independent variable, we show that the standard institutional PV methodology produces biased inference because the institutional conditioning
model places restrictions on the form of the secondary analysts’ model. We offer an
alternative approach that avoids these biases based on the mixed effects structural
equations (MESE) model of Schofield (2008).
∗
Requests for reprints should be made to Lynne Steuerle Schofield, 500 College Avenue, Swarthmore,
PA 19081. The authors would like to thank the anonymous reviewers for their valuable comments and
suggestions to improve the quality of the paper. The work was supported by Award Number R21HD069778
from the Eunice Kennedy Shriver National Institute of Child Health and Human Development. The content
is solely the responsibility of the authors and does not necessarily represent the official views of the Eunice
Kennedy Shriver National Institute of Child Health and Human Development or the National Institutes of
Health.
1
1
Introduction
Latent variable models for measuring cognitive constructs (e.g., proficiency in a particular
domain of mathematics) are ubiquitous in education research and institutional reporting.
Item response theory (IRT) models (van der Linden & Hambleton, 1997; Fox, 2010) offer the
machinery needed to handle sophisticated item- and person-sampling schemes in complex
survey data. Even in simpler settings these models offer proficiency estimates with high
reliability and precision, due to their efficient use of assessment data. Econometricians,
policy analysts, and other social scientists increasingly rely on the results of latent variable
measurement models to generate constructs for statistical analysis.
Studies that characterize students’ achievement under different curricula, compare students belonging to different social groups, or evaluate achievement differences across countries, use estimated proficiency as the dependent variable in their analyses. Studies that
focus on downstream outcomes, such as earnings in the labor market, might assess the direct
effect of academic proficiency on the outcome, or control for proficiency in trying to assess
the influence of other variables of interest. In these latter cases estimated proficiency is an
independent variable in the analysis.
Whether latent proficiency variables play the role of dependent or independent variables,
the issue of measurement error must be addressed. If the proficiency variables were estimated
without error, they could be used directly with no adjustments. If, as is usually the case,
proficiencies are estimated with some uncertainty, it will affect both the precision and bias of
estimated effects. The accurate assessment of precision requires using appropriately adjusted
standard errors or similar calculations. Bias must be dealt with by conditioning proficiency
estimates on an appropriate set of covariates.
The appropriate conditioning model has been discussed at length by Mislevy (1991),
Mislevy, Beaton, Kaplan & Sheehan (1992) and others. A key motivating application is the
2
release of data for secondary analysis by large institutional surveys, such as the U.S. National
Assessment of Educational Progress (NAEP, Allen, Carlson & Zelenak, 1999), or other largescale national and international surveys of education that have a similar structure (e.g., the
National Adult Literacy Survey, NALS, Kirsch, et al, 2000; and Trends in International
Mathematics and Science Study, TIMSS, Olson, Martin & Mullis, 2008). In a data release
that provides individual-level proficiency measures, a fixed number of multiple imputations
(Rubin, 1987, 1996) for each individual’s ability are released. These imputations, known as
plausible values (PVs) in this context, are adjusted to account for degraded precision and bias
due to measurement error, in two ways. First, they are Monte Carlo draws from posterior
proficiency distributions for each individual, and hence incorporate all sources of uncertainty
(including measurement error). Second, the posterior distribution is conditioned not only on
the individual responses to items on a cognitive assessment, but also on a set of demographic
and other background variables. PV methodology provided in Mislevy (1991), Mislevy et
al. (1992), and other sources allows secondary analysts to account for measurement error in
subsequent analyses by employing PVs in appropriate ways (e.g., Mislevy, 1991 and 1993,
and Von Davier, Gonzalez & Mislevy, 2009). Typically, agencies release five PVs for each
individual, along with instructions for using PVs to estimate regression coefficients and other
effects. (For more on current PV methodology see Li, Oranje & Jiang, 2009.)
Given standard practice, there is a subtle but important question about the conditioning
model used to generate PVs: What data, aside from the item response data themselves,
should be incorporated in generating the posterior distribution from which PVs are drawn?
Based on an argument developed by Mislevy (1991), institutions that release PVs typically
condition on a fixed but extremely large set of covariates to account for the large universe
of studies that a secondary analyst might undertake. In particular, any contrast (such as a
comparison between mean proficiencies in two social groups of interest) that a hypothetical
secondary analyst might be interested in must be included, directly or by proxy, in the
3
conditioning model used by the institution to generate PVs. In Section 2 and Section 3
we review this argument and see that when proficiency is a dependent variable the release
of institutional PVs based on an extremely large conditioning model allows a secondary
analyst to conduct estimation that is unbiased but perhaps statistically inefficient, under
rather general assumptions. In Section 4, we provide a disquieting result for the case when
proficiency is an independent variable in a regression model. We show that secondary analysis
is susceptible to substantial bias when using institutional PVs as independent variables
with the standard methodology prescribed for them. Because of the complex nature of the
conditioning model, a secondary analyst has essentially no chance of specifying a model
consistent with the survey institution’s modeling choices. The bias involves not only the
regression coefficient for proficiency in the model, but also regression coefficients for other
predictors, whether they are latent or not. Our conditions and results are similar in spirit
to Meng’s (1992) work on congeniality between inference methods and imputation models,
but our work is distinct from Meng’s in two ways: first, we consider imputation for latent
variables, which are measured with some error, and second we consider the role of latent
variables as independent variables in secondary analyses.
An immediate consequence of these results is that the use of institutional PVs based
on a large, fixed conditioning model may introduce substantial bias when proficiency is an
independent variable in a secondary analysis. More broadly, analysts who wish to use latent
variables to predict other outcomes should use conditioning models that are customized to
their particular prediction problem; in Section 5 we discuss workable machinery to do this.
Our results are stated in considerable generality. Nevertheless we show an example in
Section 6 to demonstrate the size and direction of the bias when the secondary analyst’s
model is not compatible with the institution’s conditioning model. In the example, the
structural model of interest is a linear regression model in which proficiency serves as an
independent variable predicting weekly wages and the measurement model is a standard
4
IRT model. We use data from the 1992 National Adult Literacy Survey (NALS) to illustrate
the bias resulting from the incompatibility of the secondary analyst’s structural model and
the institution’s conditioning model.
2
Modeling Components for the Analysis of Education
Surveys
Before discussing the key results of Mislevy (1991) and Mislevy et al. (1992) in Section 3
and exploring their extension to models that use proficiency to predict other outcomes in
Section 4, it is important to describe and discuss the two sets of analysis that modern large
scale education surveys are designed to serve, and to discuss, in abstract terms, the tools
that they use to make inferences from the survey data.
In order to focus the discussion on these two sets of analyses and their inferential tools,
we will ignore some complexities of education surveys, such as complex student-sampling
designs (which are generally accounted for with design-based survey weights and jackknife
or Taylor linearization variance adjustments), and complex item-sampling designs (which
are generally accounted for with incomplete likelihoods in the measurement model, e.g.,
items not administered are missing completely at random or MCAR by design). All of these
complexities, and the tools developed to address them, are crucial for practical inference
from education survey data, and are well described in the technical documentation for these
surveys (e.g., Allen, Carlson & Zelenak, 1999; Allen, Donoghue & Schoeps, 2001; Kirsch
et al., 2000; NCES, 2009; and Olsen, Martin, & Mullis, 2009). But to review them in
detail would distract from the essential structure of the inferential problems faced both by
primary analysts working on behalf of the survey institution, and by secondary analysts who
use public-use and restricted-use data to answer questions not envisioned in the reports
published by the survey institution.
5
In Section 2.1 we review the survey institution’s measurement model, a generative psychometric model that describes the relationships between participants’ proficiency in a particular cognitive domain, and their responses to particular cognitive items in the survey. In
Section 2.2 we review the survey institution’s population model, also known as the conditioning model, which describes variation in cognitive proficiency across groups defined by
demographic, jurisdictional, and other background covariates. We then briefly discuss in
Section 2.3 the kinds of inference made by primary analysts working on behalf of the survey
institution. Finally, in Section 2.4, we discuss two basic classes of models used by secondary
analysts to explore research questions not contemplated in the survey institution’s reports.
2.1
The Measurement Model
For simplicity we suppose there are N participants (students or other respondents) in the
education survey and J test items. We denote the response of participant i to question j as
Xij , and the set of all responses as X = [Xij ]N,J
i=1,j=1 . We also denote the latent proficiency
of the ith participant as θi , and the set of all N proficiencies as θ = (θ1 , . . . , θN ). The
measurement model
p(X|θ)
(1)
is the generative model chosen by the survey institution to model the likelihood of observing
response matrix X, given latent proficiencies θ. The measurement model may depend on
other parameters, which do not concern our analysis here.
The formulation in (1) is intended to be quite general and cover a broad variety of possible
stochastic models for measurement, including
• classical unidimensional dichotomous item response theory (IRT) models, which take
the form
p(X|θ) =
N Y
J
Y
P (θi |γj )Xij (1 − P (θi |γj ))1−Xij
i=1 j=1
6
where Xij = 0 or 1, indicating a wrong or right response, θi is a single real number
indicating a level of proficiency, P (θ) is a standard item characteristic curve, such
as the 2-parameter logistic (2PL) model, and γj is a set of item parameters for item
j, such as a discrimination parameter aj and difficulty parameter bj , in which case
γj = (aj , bj );
• multidimensional IRT (MIRT) models, in which θi is a vector of d real numbers, θi =
(θ1i , . . . , θdi ), denoting proficiencies on d latent constructs, and P (θ) and γj are modified
accordingly;
• polytomous variations on the IRT or MIRT models above, in which Xij can take values
in a discrete set of categories, and P (θ) and γj are modified accordingly;
• cognitive diagnosis models (CDMs), in which Xij may take dichotomous or polytomous
values, θi is a d-dimensional vector of discrete indicators denoting the presence or
absence of d specific skills or knowledge components, and P (θ) and γj are modified to
specify a specific CDM such as the DINA or DINO model;
• factor analysis (FA) models in which Xij are continuous responses, θi is a vector of
continuous factor scores, and p(X|θ) is a typical FA model; and
• other models in which Xij is a more complex (multivariate, graphical, etc.) response,
and/or θi is a more complex proficiency variable, and/or the measurement model
p(X|θ) may reflect violations of, or variations on, many standard assumptions such
as local independence, monotonicity, experimental independence, complete data, etc.
In most modern large-scale education surveys, the measurement model (1) is some form of
an IRT or MIRT model. Whatever the measurement model is, it is usually pre-calibrated so
that any item parameters γ1 , . . . , γJ can be thought of as fixed and known for all subsequent
analyses. We will assume this in our development below. In the case in which γj are
7
estimated along with other quantities, however, there is no essential change in the message
of our work.
2.2
The Population (Conditioning) Model
In a typical large-scale education survey, the survey institution is primarily interested in
reporting on features of the population distribution
pP A (θ|Z) .
(2)
The subscript PA in (2) is intended to remind that this distribution is the focus of the
primary analysis performed by the survey institution or its contractors. The variable Z
denotes an entire set of conditioning variables that are of interest in the primary analysis.
These might typically include
• primary reporting variables;
• survey design variables;
• jurisdictional or institutional variables that describe the institutions (typically schools,
school districts, governmental jurisdictions, etc.) that the individual participants (typically students) are members of;
• variables concerning participants’ education contexts, such as teacher questionnaires;
• participant demographic variables, such as gender, race/ethnicity, age, SES; and
• other background variables for individual participants, such as education experience
or number of hours of TV watched, which might be collected through a background
questionnaire administered to individual participants.
8
The conditioning variables Z in our setup subsume both the collateral variables Y and the
design variables Z in the setup of Mislevy (1991) and Mislevy et al. (1992). The distinction
between design variables and collateral variables is not important for our development, and
we wish to reserve Y for the (observable) dependent variable in a prediction model.
The model in (2) is highly multivariate in both θ and Z. Indeed, Z generally spans
“reporting variables” that serve primary analyses and reports by the survey institution, other
demographic, background and jurisdictional variables that may serve secondary analysts, and
many interactions between them. Thus, (2) usually conditions on a large set of covariates
Z, and so it is also known as the conditioning model for the survey.
2.3
Primary Analysis: Reporting and Plausible Values
It is typically not possible to do inference on small jurisdictional units or individual participants, for a variety of legal and technical reasons. Many education surveys, such as the
National Assessment of Educational Progress, operate under laws that proscribe the public
identification of individual students, schools, or other local units that participate in the survey. More broadly, it is generally unethical to break confidentiality and privacy commitments
made to survey participants. At a more technical level, the participant sample is usually
not designed to provide reliable inferences at the level of a school or even a school district of
moderate size (and would be prohibitively expensive if it were so designed), and the number
of cognitive items asked of any individual participant is small enough (to manage time and
fatigue constraints) that inference for an individual is usually not reliable either. Instead,
the targets of inference for primary analysis by the survey institution are typically means,
percentiles, and other summaries for major reporting groups, defined by reporting variables
such as race/ethnicity, gender, age, region, larger jurisdictions, as well as some background
variables.
The “institutional model” described by equations (1) and (2) is essentially a two-stage
9
generative model for the cognitive data collected in the survey: first, θ is generated from its
conditional distribution given Z, and then X is generated from its conditional distribution
given θ. The objects of inference for primary analyses are features of the θ distribution,
after collecting the survey data. This suggests a Bayesian, or at least empirical Bayes,
approach. Indeed, the measurement model in (1) can be though of as a likelihood for θ,
and the conditional model in (2) can be thought of as a prior distribution for θ. Then the
posterior distribution of θ may be written as
pP A (θ|X, Z) ∝ p(X|θ, Z)pP A (θ|Z)
= p(X|θ)pP A (θ|Z)
(3)
under the assumption that X ⊥⊥ Z | θ, which is usually true by design of the measurement process producing X (i.e., if the measurement process is well-designed, X should be
conditionally independent of any other variable, given θ). In typical settings, a great deal
of X is missing by design, to reduce testing time, fatigue, etc., for individual participants.
The mechanics of implementation of the measurement model (1), as reported in the technical documentation for any large-scale education survey—such as Allen, Carlson & Zelenak
(1999), Allen, Donoghue & Schoeps (2001), Kirsch, et al. (2000), NCES (2009), and Olsen,
Martin, & Mullis (2009)—allow reporting for all groups of students on a common θ scale.
Thus different groups are equated on a common θ scale, even though they may have seen
disjoint sets of items.
The summaries (e.g., conditional means or percentiles) produced in primary reports by
the survey institution are either functionals of the posterior distribution pP A (θ|X, Z)—that
10
is, they can be obtained by computing the integral1
Z
s(θ, Z)pP A (θ|X, Z)dθ
(4)
for some appropriate function s(θ, Z)—or they can be derived from functionals of pP A (θ|X, Z).
The quantities in (3) and (4) may be estimated using Bayesian methods (Johnson & Jenkins, 2005) or marginal maximum likelihood and other methods (Allen, Donoghue & Schoeps,
2001).
Following the work of Mislevy (1991), Mislevy et al. (1992), and others, many survey
institutions compute and publish plausible values (PVs) for θ in large scale education surveys.
PVs, known in the statistics literature as multiple imputations (Rubin, 1996), are sets of
random draws from the posterior distribution (3). Their primary use, as noted by Mislevy
et al. (1992, p. 142), is as a Monte Carlo numerical integration tool for integrals such as
(4). PVs and their appropriate use have been discussed recently by von Davier, Gonzalez
& Mislevy (2009), and the consequences of their misuse in certain contexts was recently
discussed by Carstens & Hastedt (2010).
2.4
The Secondary Analyst’s Research Models
In addition to primary reports generated by the survey institution and its contractors, important substantive and methodological work has been performed by secondary analysts
(NCES, 2008; Robitaille & Beaton, 2002), that is, researchers acting independently of the
survey institution, investigating questions outside the scope of the primary reports. Substantive questions for secondary analysts often revolve around some feature of a distribution
1
As suggested in Section 2.1 and Section 2.2, X, Z and θ are extremely general multidimensional objects;
they may have components that are continuous, discrete, etc. For ease of exposition, we will express all
appropriate probability calculations as integrals, as if the variables involved were continuous. For other
variable types, the integrals can be replaced with appropriate sums, Riemann-Stieltjes integrals, etc., as
needed. The essential message of our work is the same.
11
such as
pSA (θ|Z̃) ,
(5)
where the subscript SA is intended as a reminder that this is a model chosen by the secondary
analyst and Z̃ represents covariates of interest to the secondary analyst in his/her research.
For example, if the components of θ = (θ1 , . . . , θN ) are continuous and unidimensional, and
Z̃ can be separated into participant-level pieces Z̃ = (Z̃1 , . . . , Z̃N ), then pSA (θ|Z̃) might be
expressed as a linear model
θi = β0 + β1 Z̃i + εi , εi ∼ N (0, σ 2 ) .
(6)
In general Z̃i need not be univariate, in which case β1 is a vector of regression coefficients.
In addition Z̃ may or may not be identical to Z in (2). Indeed, the most interesting and
innovative secondary analyses usually involve Z̃ not contemplated by the survey institution.
Because θ appears as the dependent variable in the regression form of (5), we refer to models
of the form (5) as θ-dependent models.
The object of inference in the θ-dependent case is typically some function s(θ, Z̃), which
captures some feature of pSA (θ|Z̃) of interest. For example in the linear regression case,
the secondary analyst might be interested in the least-squares estimate of β1 , in which case
s(θ, Z̃) takes the form
s(θ, Z̃) = β̂1 =
d Z̃)
Cov(θ,
,
d Z̃)
Var(
d ·) and Var(·)
d denote sample covariance and variance calculations suitable for
where Cov(·,
the design of the survey. More generally, the posterior distribution of β1 ,
s(θ, Z̃) = p(β1 |θ, Z̃) ,
and similar quantities, may be of interest.
12
Another class of models considered by secondary analysts, especially those interested in
using cognitive proficiency in predicting later outcomes Y , is of the form
pSA (Y |θ, Z̃) .
(7)
Under suitable assumptions, this might also be expressible as a regression model such as
Yi = β0 + β1 θi + β2 Z̃i + εi , εi ∼ N (0, σ 2 ) ,
(8)
where, once again, Z̃ may or may not be identical to Z, and either or both of θi and Z̃i might
be multidimensional. Since θ appear as an independent variable in the regression form of
(7), we refer to models of the form (7) as θ-independent models.
The object of inference in the θ-independent case is again some function s(θ, Y, Z̃), which
now captures some feature of pSA (Y |θ, Z̃) of interest. For example, in the linear regression
case, the secondary analyst will typically be interested in the least-squares estimate of some
regression coefficient(s) or the posterior distribution of the regression coefficient(s).
3
The θ-dependent case
We consider first the θ-dependent case. Here the secondary analyst has a research model of
the form of (5), i.e.
pSA (θ|Z̃) ,
in which θ appears as the dependent variable, and the object of inference is a function s(θ, Z̃)
related to pSA (θ|Z̃). Because θ is completely missing, the most the secondary analyst can
13
hope to learn is some feature of a marginal quantity such as
Z
s(X, Z̃) =
s(θ, Z̃)pSA (θ|X, Z̃)dθ = ESA [s(θ, Z̃)|X, Z̃] .
(9)
We can now state, in modern terms, the problem identified and solved by Mislevy (1991)
and Mislevy et al. (1992): What tool can the primary analyst provide to the secondary
analyst, so that (a) an integral of the form of (9) can be calculated or approximated appropriately, and (b) the results of the secondary analysis are numerically consistent with the
primary survey reports? The answer is the publication of institutional PVs, as discussed
above at the end of Section 2.3.
With institutional PVs the secondary analyst can approximate the quantity
Z
s(X, Z̃, Z) =
s(θ, Z̃)pP A (θ|X, Z)dθ = EP A [s(θ, Z̃)|X, Z] ,
(10)
which is a functional of the form (4). The following theorem lays out conditions under
which calculation of (10) leads to an unbiased estimate of s(X, Z̃), providing the underlying
justification for the use of PVs as outlined by Mislevy (1991) and subsequent authors.
Theorem 3.1. If Z̃ ⊆ Z, then s(X, Z̃, Z) is an unbiased estimate of s(X, Z̃).
By the notation Z̃ ⊆ Z, we mean that the σ-field generated by Z̃ is a subfield of the
σ-field generated by Z (see Billingsley, 1986, for definitions). Informally, this means that Z̃
is a deterministic function of Z.
Proof. We calculate
ESA {s(X, Z̃, Z)|X, Z̃} = ESA {EP A [s(θ, Z̃)|X, Z]|X, Z̃}
= ESA [s(θ, Z̃)|X, Z̃]
= s(X, Z̃)
14
by the “telescoping” property of conditional expected values, when Z̃ ⊆ Z (Billingsley, 1986,
p. 470). 2
Biases arising when Z̃ 6⊆ Z have been illustrated by Mislevy et al. (1992), von Davier et
al. (2009), and Carstens & Hastedt (2010).
The standard procedure for using institutional PV’s, described by Mislevy (1991, pp.
181–182), amounts to calculating
sm = s(θm , Z̃) , m = 1, . . . , M
for each of M imputations θm drawn from pP A (θ|X, Z), and then averaging. This produces
Z
M
1 X
sm ≈ s(θ, Z̃)pP A (θ|X, Z)dθ = EP A [s(θ, Z̃)|X, Z] ,
s̄ =
M 1
the Monte Carlo numerical approximation to s(X, Z̃, Z) in (10). A further between/within
variance calculation (Mislevy, 1991, p. 182) approximates the posterior variance Var P A [s(θ,
Z̃) | X, Z].
Theorem 3.1 works for any function s(θ, Z̃), but it is useful to know that the same result
applies when computing formal posterior distributions of parameters of interest, such as the
regression coefficient β1 in (6). The corollary below extends Theorem 3.1 to this case, as
well as any other case where β is some parameter (or set of parameters) of interest.
Corollary 3.1. Let β be a parameter in the model pSA (θ|Z̃) and let s(θ, Z̃) = pSA (β|θ, Z̃).
If Z̃ ⊆ Z and β ⊥⊥ X | θ, Z̃, then s(X, Z̃, Z) is an unbiased estimate of pSA (β|X, Z̃).
The condition β ⊥⊥ X | θ, Z̃ is essentially guaranteed by design of the measurement
process leading to X.
15
Proof. Observe that
Z
s(X, Z̃) =
s(θ, Z̃)pSA (θ|X, Z̃)dθ
Z
=
pSA (β|θ, Z̃)pSA (θ|X, Z̃)dθ
Z
=
pSA (β|θ, X, Z̃)pSA (θ|X, Z̃)dθ
= pSA (β|X, Z̃) ,
where the second to last line follows from the assumption that β ⊥⊥ X | θ, Z̃. 2
Theorem 3.1 gives a positive result, in the case that the secondary analyst’s Z̃ is a function
of the survey institution’s Z. Since Z̃ is “invented” by the secondary analyst, however, there
is a good chance that Z̃ 6⊆ Z. In that case, the amount of bias is simply
ESA {EP A [s(θ, Z̃)|X, Z]|X, Z̃} − ESA [s(θ, Z̃)|X, Z̃] .
(11)
We generally expect that this bias should decrease as the number of items J in X increases,
or equivalently, as the reliability with which θ can be measured by X increases. Mislevy
(1991) shows this in the case of a classical true score theory model, and Mislevy et al. (1992)
illustrate the same point numerically with an application to SAT testing data. Here we give
an informal argument that this should be true even more generally. Note that the term on
the right in (11) is
Z
ESA [s(θ, Z̃)|X, Z̃] =
s(θ, Z̃)p(θ|X, Z̃)dθ
Z
=
s(θ, Z̃)
p(Z̃|θ, X)
p(θ|X)dθ .
p(Z̃|X)
In any measurement model for which there is a consistent estimator θ̂(X) based on the
16
response variables X, we expect that θ ⊆ X will become true as J grows (Ellis & Junker,
1997, show a somewhat stronger result for general class of locally independent monotone
latent variable models, for example); hence p(Z̃|θ, X)/p(Z̃|X) → 1. Moreover, as J grows,
p(θ|X) should converge to a point mass at the participants’ true θ values, θT RU E .2 Thus, as
J → ∞,
Z
ESA [s(θ, Z̃)|X, Z̃] ≈
s(θ, Z̃)p(θ|X)dθ
→ s(θT RU E , Z̃) .
A similar line of reasoning, beginning with the inner expected value EP A [s(θ, Z̃)|X, Z] in the
term on the left in (11), shows that this term too converges to s(θT RU E , Z̃) as J → ∞, and
hence the bias (11) vanishes as J grows.
Since Z̃ is determined by the secondary analyst long after the survey institution has done
the primary analyses, survey institutions try to make Z as large as possible, to accommodate
any possible Z̃ that secondary analysts may be interested in. A typical conditioning model
(e.g., Kirsch et al., 2000; Mislevy et al., 1992; Dresher, 2006) will involve Z containing all
of the variables listed in Section 2.2 as well as their two-way interactions. This generally
produces a Z with many hundreds of columns. This is reduced by principal components
analysis (PCA) to a Z with far fewer columns (e.g., a few hundred) and this is used for all
subsequent primary analyses, including the generation of plausible values. Such a large Z
is thought to contain a good proxy for any Z̃ that a secondary analyst could define, so that
Z̃ ⊆ Z nearly holds, and the bias (11) in s(X, Z̃, Z) is minimal, even when θ is not measured
with high reliability.
Although the construction of such a large Z may seem awkward, it represents an elegant
solution to the problem of making primary and secondary analyses logically and arithmeti2
Chang & Stout (1993) give a result implying this for IRT models, and a similar result can be obtained
for other psychometric models, by further generalizing the work of Walker (1969).
17
cally consistent. For both primary and secondary analysts, computation is simply a matter
of summing over plausible values to approximate the functional in (4). If the primary and
secondary analysts are using the same set of plausible values, based on a Z designed to
contain good proxies for any possible Z̃, then the primary reports are margins of the table of
all possible secondary analysis results. If we are able to aggregate across secondary analyses
to produce a reporting quantity such as the mean proficiency for female students, this must
produce the same answer as the primary analysis did by calculating that mean directly, since
it amounts to summing across plausible values in a different order. Thus any inconsistencies
between primary and secondary analyses must be due to arithmetic errors, or conceptual
errors in setting up the quantity to be computed, rather than differences in computational
methods or tools.
Finally we note in passing that making Z much larger than Z̃ causes some inefficiency,
as can be seen by comparing the variability of s(X, Z̃, Z) with that of s(X, Z̃) over random
replications of the survey,
h
i
Var s(X, Z̃, Z) = E Var s(X, Z̃, Z) X, Z̃ + Var
h
i
= E Var s(X, Z̃, Z) X, Z̃ + Var
h
i
E s(X, Z̃, Z) X, Z̃
s(X, Z̃) ,
where the last line follows directly if Z̃ ⊆ Z. However, since there are no replications of
surveys in practice, this inefficiency is usually overlooked.
4
The θ-independent case
Suppose now that the secondary analyst has a research model of the form of (7), namely
pSA (Y |θ, Z̃) ,
18
in which θ now plays the role of an independent variable, and again the secondary analyst
is interested in a quantity of the form s(θ, Y, Z̃). Once again, θ is completely missing, and
so it is natural to consider a marginal quantity like
Z
s(X, Y, Z̃) =
s(θ, Y, Z̃)pSA (θ|X, Y, Z̃)dθ .
By replacing Z̃ with (Y, Z̃), we immediately obtain natural corollaries to Theorem 3.1 and
Corollary 3.1. For these corollaries, stated below, we also define
Z
s(X, Y, Z̃, Z) =
s(θ, Y, Z̃)pP A (θ|X, Z)dθ
for the institutional posterior distribution pP A (θ|X, Z), perhaps available to secondary analysts through the publication of plausible values. We then immediately obtain
Corollary 4.1. If (Y, Z̃) ⊆ Z, then s(X, Y, Z̃, Z) an unbiased estimate of s(X, Y, Z̃).
Corollary 4.2. Let β be a parameter in the model pSA (Y |θ, Z̃) and let s(θ, Y, Z̃) = p(β|θ, Y, Z̃).
If (Y, Z̃) ⊆ Z and β ⊥⊥ X | θ, Y, Z̃, then s(X, Y, Z̃, Z) is an unbiased estimate of p(β|X, Y, Z̃).
Corollary 4.1 and Corollary 4.2 assert that Y , which is already a dependent variable in
the secondary analyst’s model pSA (Y |θ, Z̃), should also be an independent variable in the
primary analyst’s conditioning model pP A (θ|Z), in order that S(X, Y, Z̃, Z)—or its approximation using institutional PVs—produces an unbiased estimate of s(X, Y, Z̃). However, the
assumption that Y is in the institutional conditioning model imposes a serious restriction
on what the secondary analyst’s model can be; violating this constraint can produce other
biases.
Indeed Theorem 4.1, which we present next, shows that when Y is in the institutional
conditioning model, the form of p(Y |θ, Z̃) is essentially determined by the institutional conditioning model. If the form of the secondary analyst’s research model pSA (Y |θ, Z̃) is the
19
same as the form determined by Theorem 4.1 from the institutional conditioning model
p(θ|Y, Z̃), then s(X, Y, Z̃, Z)—and therefore its estimate using institutional PVs—will be an
unbiased estimate of s(X, Y, Z̃). If not, the PV-based approximation to s(X, Y, Z̃, Z) will be
vulnerable to bias, as an estimate of s(X, Y, Z̃).
We illustrate this bias in Section 6 below. There, we suppose that the secondary analyst’s
research model pSA (Y |θ, Z̃) is in the form of the wage equation from economics, which in
our case is a linear regression model for Y = log(wage) in terms of reading proficiency θ
and additional covariates Z̃ that include race/ethnicity and work experience. We attempt
to estimate regression coefficients in the wage equation using PVs released for the National
Adult Literacy Survey (NALS). In this case, the primary NALS analysis explicitly includes
(a function of) Y in the institutional conditioning model, but the form for pSA (Y |θ, Z̃)
determined from the conditioning model by Theorem 4.1 is undoubtedly different from the
wage equation in the economics literature. The estimates of coefficients in the wage equation
based on NALS PVs are thus vulnerable to bias. We examine the size and direction of the
bias, in Section 6, by comparing PV-based coefficient estimates with estimates that do not
depend on institutional PVs.
For ease of exposition, we consider in Theorem 4.1 the case in which Z = (Y, Z̃). However,
as discussed below with respect to Corollary 4.3(b) and illustrated in Section 6, we can expect
similar behavior in the more general case Z ⊇ (Y, Z̃). Note that in Theorem 4.1, p(θ|Y, Z̃)
is the institutional conditioning model as in (2), not a posterior distribution, and p(Y |θ, Z̃)
is the form that the secondary analyst’s research model pSA (Y |θ, Z̃) must match.
Theorem 4.1. The distribution p(Y |θ, Z̃) is completely determined by the conditioning
model p(θ|Y, Z̃) and the conditional distribution p(Y |Z̃).
Since p(Y |Z̃) is entirely determined by the observable relationship between Y and Z̃,
p(Y |θ, Z̃) is essentially determined once p(θ|Y, Z̃) is specified.
20
Proof. Observe that
p(Y |θ, Z̃) =
p(θ|Y, Z̃)
· p(Y |Z̃) .
p(θ|Z̃)
(12)
For the denominator in (12), we note
Z
p(θ|Z̃) =
Z
p(θ, Y |Z̃)dY =
p(θ|Y, Z̃)p(Y |Z̃)dY
(13)
Clearly, equations (12) and (13) depend only on p(θ|Y, Z̃) and p(Y |Z̃), and completely
determine p(Y |θ, Z̃). 2
Now let us consider a more general conditioning model p(θ|U, Z̃). In the special case of
Theorem 4.1, U = Y . Or, U might be a variable specifically intended to proxy for Y . Or,
if we have a large institutional conditioning model p(θ|Z) with Z̃ ⊆ Z, U might be all the
information in Z that is not in Z̃, in which case U might be highly multivariate.
Part (a) of Corollary 4.3, which we present next, shows that if the relationship between
U and Y is completely explained by θ and Z̃—in the sense that Y ⊥⊥ U |θ, Z̃—then adding
U to the conditioning model does not force a particular form for the secondary analyst’s
research model p(Y |θ, Z̃). For example in the wage equation example of Section 6, an
indicator variable U that is 1 for someone who has been working 5–10 years and 0 otherwise is
undoubtedly associated with Y = log(wage), but since U is a function of the work experience
variable that is already in Z̃, including U in the conditioning model does not constrain the
specification of the wage equation by the secondary analyst.
On the other hand, part (b) of Corollary 4.3 shows that if Y 6⊥⊥ U |θ, Z̃, then once again
the form of p(Y |θ, Z̃) is determined by the conditioning model p(θ|U, Z̃). For example, if we
have a large institutional conditioning model that is designed so that Z contains “proxies
for everything,” and U contains all of the information in Z that is not in Z̃, then it is very
likely that Y 6⊥⊥ U |θ, Z̃. Hence the secondary analyst’s research model pP A (Y |θ, Z̃) must
21
match the form determined by the conditioning model p(θ|Z).
Corollary 4.3.
(a) If Y ⊥⊥ U |θ, Z̃, then the conditioning model p(θ|U, Z̃) places no constraint on the
distribution p(Y |θ, Z̃).
(b) If Y 6⊥⊥ U |θ, Z̃, then the conditioning model p(θ|U, Z̃) and the conditional distribution
p(U |Z̃) determine the distribution p(Y |θ, Z̃).
Proof. We first observe that, as in the proof of Theorem 4.1,
p(U |θ, Z̃) =
p(θ|U, Z̃)
· p(U |Z̃) ,
p(θ|Z̃)
which again depends only on p(θ|U, Z̃) and p(U |Z). Then,
Z
p(Y |θ, Z̃) =
p(Y, U |θ, Z̃)dU
Z
=
p(Y |θ, U, Z̃)p(U |θ, Z̃)dU .
(14)
If Y ⊥⊥ U |θ, Z̃ then the first term under the integral in (14) reduces to p(Y |θ, Z̃), and there
is no constraint. However, if Y 6⊥⊥ U |θ, Z̃, then (14) determines p(Y |θ, Z̃). 2
Corollary 4.1 and Corollary 4.2 show that, in order to use institutional PV methodology
to explore predictive inference using θ and other covariates, the variable Y to be predicted—
or a good proxy of it—must be in the institutional conditioning model. But Theorem 4.1
and Corollary 4.3 show that to include Y or a nontrivial proxy U forces a particular form
for the secondary analyst’s research model. When this form is not used, bias may result, as
illustrated below in Section 6, and this may lead to incorrect scientific or policy conclusions.
It might be reassuring to realize that the bias should decrease as test length increases (or
22
measurement error for θ decreases), as suggested by the discussion of equation (11), but the
direction and magnitude of the bias will vary depending on the application.
Since survey institutions typically release only the plausible values and not the details
of the model associated with them, it is unlikely that the secondary analyst could specify
pSA (Y |θ, Z̃) to match the form suggested by the conditioning model, even if he/she wished
to deviate from the research models in the substantive literature (such as the wage equation
in economics). Thus, for predictive inference using θ, the secondary analyst is better off
building a model from scratch, not making use of institutional PVs. We turn to this process
in the next section.
5
A Workable Approach to the θ-independent Case
In Section 4, we argued that the usual plausible values methodology, using institutional PVs
generated from a large, fixed conditioning model in order to calculate unbiased estimates of
s(X, Y, Z̃), is not usually a tenable practice. An alternative would be to build a model directly
for the secondary analyst’s research question. The following easy proposition summarizes the
essential features of a marginal likelihood or Bayesian model built by the secondary analyst.
Proposition 5.1. Let β be any parameter(s) of interest. Then, under the setup of Section 2,
if X ⊥⊥ Z̃, β|θ and θ ⊥⊥ β|Z̃,
(a) The secondary analyst’s marginal likelihood for β is
Z
pSA (X, Y |Z̃, β) =
pSA (Y |θ, Z̃, β)pSA (X|θ)pSA (θ|Z̃)dθ .
(b) The secondary analyst’s posterior distribution for β is
R
pSA (β|X, Y, Z̃) = R R
pSA (Y |θ, Z̃, β)pSA (X|θ)pSA (θ|Z̃)dθpSA (β)
pSA (Y |θ, Z̃, β)pSA (X|θ)pSA (θ|Z̃)dθpSA (β)dβ
23
The condition X ⊥⊥ Z̃, β|θ is essentially guaranteed by good measurement practice in
constructing X as a measure of θ. The condition θ ⊥⊥ β|Z̃ is quite innocuous in the θindependent case, as we shall see after the proof.
Proof. For part (a), we observe
Z
pSA (X, Y |Z̃, β) =
pSA (X, Y, θ|Z̃, β)dθ
Z
pSA (Y |θ, Z̃, β)pSA (X|θ, Z̃, β)pSA (θ|Z̃, β)dθ
=
Z
=
pSA (Y |θ, Z̃, β)pSA (X|θ)pSA (θ|Z̃)dθ ,
where the second line follows from the law of total probability and the third line follows from
the conditional independence assumptions.
For part (b) we observe that
pSA (β|X, Y, Z̃) =
pSA (X, Y |Z̃, β)pSA (β)
pSA (β, X, Y |Z̃)
.
=R
pSA (X, Y |Z̃, β)pSA (β)dβ
pSA (X, Y |Z̃)
The result now follows by applying part (a). 2
Note that Y is forced to be in the institutional PV conditioning model in Corollary 4.1,
but it does not appear in the secondary analyst’s “conditioning model” p(θ|Z̃) in Proposition 5.1. There is a subtle but important distinction between the problems that these two
results solve. Corollary 4.1 shows what is necessary for institutional PVs so that s(X, Y, Z̃, Z)
(or its PV-based approximation) is an unbiased estimate of s(X, Y, Z̃). Proposition 5.1 shows
what is necessary for the rather different problem of making inferences about β directly from
the data and tools available to the secondary analyst, ignoring the institutional model used
for primary analyses. It is still necessary for the secondary analyst to incorporate appropriate survey design variables into Z̃, in order to avoid omitted-variable biases for example,
24
but Y plays a rather different role in the model of Proposition 5.1, versus the institutional
PV conditioning model.
The model implied by Proposition 5.1 can be instantiated, for example, as a Mixed Effects
Structural Equations (MESE) model (Schofield, 2008; Junker, Schofield, and Taylor, 2012).
The MESE model takes the form
Yi |Z̃i , θi , β, σ ∼ p(Yi |θi , Z̃i , β, σ)
(15)
Xij |θi , γj ∼ p(Xij |θi , γj )
(16)
θi |Z̃i , α, τ ∼ p(θi |Z̃i , α , τ )
(17)
where θi , Z̃i , and Yi are defined as before.
Although it is expressed in hierarchical Bayes form here, it is not difficult to see that
the MESE model is a generalized structural equations model (SEM; e.g. Bollen, 2002). In
particular, it is an extension of the MIMIC model (Joreskog and Goldberger, 1975) that
allows for general data types and additional observed covariates. The MESE model allows
for examination of the conditional effect of Z̃ on the dependent variable Y after controlling
for θ, through the parameter(s) β. It is a “mixed effects” model, because the latent variable
is modeled as a random effect while the other covariates in (15) are fixed effects.
Equation (15) corresponds to the secondary analyst’s model and is of primary substantive
interest. Often (15) will take the form of a generalized linear model. For example, Junker,
Schofield, and Taylor (2012) apply MESE models with linear regression components such as
Yi = β0 + β1 θi + β2 Z̃i + εi , εi ∼ N (0, σ 2 ) ,
(18)
and with logistic regression components, in (15).
Equation (16) is the measurement model for θ with parameters γj . Frequently, (16) will
25
be an IRT or an MIRT model, but any measurement model (e.g., a classical test theory
model, a factor analysis model, a cognitive diagnosis model) is possible. When the survey
institution has built and calibrated a reliable measurement model, for example an IRT model,
then it is best for the secondary analyst to use that model and those calibration estimates
of γj in (16).
Equation (17) is a prior for θ; it is the secondary analyst’s “conditioning model.” But,
as argued above, because the problem being solved now is different from the one that institutional PVs were designed to solve, (17) must have a form different from the institutional
conditioning model. In the MESE model, the conditioning model plays the role of a prior
on θ and it must condition on Z̃ that appear as covariates in (15), but does not condition
on dependent variable Y .
Note that (16) uses item response data, which the survey institution must make available to secondary researchers. While these responses are available for some surveys (NCES
provides item responses for those researchers who have restricted use data licenses in many
surveys), many other surveys (e.g., the 1979 and 1997 National Longitudinal Survey of
Youth) do not publish item responses for all cognitive tests. Items need not be released, but
rather item responses along with information about the construction and design of the test
(e.g., item parameters and information on the model used to construct the test) are sufficient
for estimation and inference with the MESE model.
6
Data Example: Racial Wage Gaps in the 1992 National Adult Literacy Survey
The wage regression literature in econometrics provides an important example of the analysis
we have been discussing. A typical problem in this area examines wage gap between two
social groups; the typical solution is to build a regression equation that accounts for different
26
factors that could affect the wage differential, while controlling for cognitive proficiency
through test scores.
To illustrate our theoretical results, we examine the issue of racial wage gaps with the
hope of decomposing the gaps in order to understand the impact of the racial differences in
cognitive proficiency on differences in labor market earnings. The model of interest is
log(wagei ) = β0 + β1 θi + β2 blacki + β3 Hispanici + β4 Wi + εi
(19)
where wagei is weekly wage rate for individual i, θi is a measure of cognitive proficiency
(in this case English-language prose literacy), blacki and Hispanici are indicator variables
for race/ethnicity identification (with non-Hispanic white serving as the omitted category),
and W is a vector that includes potential labor market experience, region, and urban status.
Here, the variables blacki , Hispanici , and Wi comprise Z̃.
Equation (19) is a wage regression of the form seen in the literature (e.g., Neal and Johnson, 1996; Lang and Manove, 2006; and Junker, Schofield and Taylor, 2012).3 Of primary
interest are estimates of the regression coefficients in front of the racial indicator variables
(βˆ2 and βˆ3 ) as they provide information on racial differences in labor market earnings after
controlling for differences in cognitive proficiency. Accurate estimates can potentially inform
policy, with the goal of targeting appropriate resources to address the earnings differential.
The data come from the 1992 National Adult Literacy Survey (NALS; Kirsch et al., 2000)
which includes an individually-administered household survey of 24,944 adults ages 16 and
over. The NALS is comprised of two sets of questions—standard demographic questions
(e.g., race, gender, labor force behavior, wages, age, etc.) and items that measure functional
3
The basic form of equation (19) is very common in the labor economics literature. However, the work
we present below differs in many ways from Neal and Johnson (1996) and Lang and Manove (2006). Those
papers use different data than we do, they use a different measure of cognitive ability (and that measure
is constructed when individuals are teenagers, while wages are measured many years later when individuals
are working adults), and their regressions include different covariates than ours. Thus, while the basic form
of the regression is standard, interpretation of results will differ.
27
literacy in three unidimensional domains: prose, document, and quantitative. For completeness we describe the design in terms of all three domains, but in our analyses below we
include only a latent variable θ for the unidimensional prose domain.
The NALS contains 165 items (41 prose literacy, 81 document literacy, 43 quantitative
literacy) to test the literacy skills of the examinees. Because it was deemed impractical to
administer every item to every respondent, each respondent was randomly administered a
booklet designed via the Balanced Incomplete Block (BIB; Beaton and Zwick, 1992) spiral
design which contained a representative sub-sample of approximately one-quarter to onethird of the full set of 165 items. On the prose scale, each individual received a booklet that
contained between eight to thirteen prose items rather than the full 41 items.
Since the primary purpose for conducting the NALS survey is to provide information on
the literacy skills of U.S. adults, population estimates are of more interest than individuallevel literacy scores. Consequently, the NALS data set does not contain individual literacy
proficiency estimates, but instead contains five PVs per content area and individual to aid
in calculating population estimates through the marginal estimation procedures outlined in
Section 3 (with further information available in Mislevy, 1991).
Kirsch, et al. (2000) describe the conditioning model used in the production of plausible
values for the NALS data. The NALS conditioning model is a “a normal multivariate
distribution . . . with a common variance, S, and with a mean given by a linear model with
slope parameters, G, based on the first approximately 100 principal components of several
hundred selected main effects and two-way interactions of the complete vector of background
variables” (Kirsch, et al., 2000, p. 180). The full list of background variables placed in the
principal component analysis is available in Appendix B of the report. For our purposes, it
is important to note this saturated model used in producing the plausible values contains
wages—the dependent variable in our analysis—and several other variables not in (19), the
regression of interest. Many of these additional variables are correlated with wages such as
28
family income and occupation.
We compare estimates of the coefficients on prose literacy (β1 ), black (β2 ) and Hispanic
(β3 ) in (19) under several different model specifications, using a restricted sample consisting
of the 4950 men in the study who are between the ages of 25 and 65, work full time, report a
wage, and who self-report as black, Hispanic, or non-Hispanic white. Sample characteristics
are available in Table 1. We note a few features of the data. On average, white men earn
more than black and Hispanic men and white men have higher average prose literacy scores
than black men or Hispanic men.
Table 1: Sample Characteristics of the 1992 NALS Data
Black Hispanic
White
N
622
470
3858
Average Age
39.5
37.5
40.5
Average Weekly Wage
479.03
468.41
721.25
Average Prose Literacy Scores −0.36
−0.62
0.75
Note: Authors’ calculations, 1992 NALS. Data are from the
1992 NALS, restricted to individuals aged 25–65 who work fulltime, reported wages, and answered at least one prose literacy
item.
We report estimates of the regression coefficients for three different model specifications in
Table 2. In the first specification, the explanatory variables include two indicator variables
for racial identification (black and Hispanic), potential experience entered as a quadratic
where potential experience is defined as the approximate number of possible years in the
labor force (defined age − years of schooling − 6 ), and indicator variables for census regions
(midwest, south, and west) and urban settings. Column (a), provides a “baseline” for the
black-white and Hispanic-white wage gaps without controlling for prose literacy proficiency.
We expect to see omitted-variable bias in the regression coefficients in column (a), because
an important variable—prose literacy—has been omitted from the model.
The next two specifications control for prose literacy proficiency but do so in different
ways. In column (b), we include a measure of prose literacy proficiency using the plausible
29
values provided by the survey institution with the saturated conditioning model. We follow
Mislevy (1991) and Kirsch, et al. (2000) by estimating the regression five times (each with
a different prose plausible value as the measure of θ). The best estimate of the regression
coefficients is the average of the five regression coefficients obtained from the analyses using
different sets of plausible values. We estimate the standard errors of the regression coefficients
by using the between/within calculation recommended by Mislevy (1991, p. 182); see also
the discussion of this calculation in Section 3 above.
Since Y is in the conditioning model used to generate the institutional PVs, Theorem 4.1 and Corollary 4.3 suggest that the conditioning model will determine a unique form
for p(Y |X, θ, black, Hisp, W ). Because of the complex nature of the conditioning model
used by the survey institution, it is impossible for us to determine the functional form of
p(Y |X, θ, black, Hisp, W ) determined by the NALS conditioning model. Additionally, it is
highly unlikely that our wage regression shown in (19) will be compatible with the survey institution’s p(Y |X, θ, black, Hisp, W ). For these reasons, we expect to see bias in our
estimates of β if we use the survey institution’s PV’s in our secondary analysis.
In the MESE specification, shown in column (c), we estimate the regression coefficients
using the Mixed Effects Structural Equations (MESE) model described in Section 5. We
report estimates from a MESE model where we specify the conditioning model to include
only race, experience, census region, and urban setting. For this example, the MESE model
takes the form
Yi |Ri , Wi , θi ∼ N (β0 + β1 θi + β2 blacki + β3 Hispanici + β4 Wi , σ 2 )
Xij |θi ∼ IRT (Xij |θi , γj )
θi |Ri , Wi ∼ N (θi |α0 + α1 blacki + α2 Hispanici + α3 Wi , τ 2 )
(20)
(21)
(22)
where yi is log(wages), Xij are prose literacy item responses, θ is latent prose literacy pro30
ficiency, Ri are the race/ethnicity variables, and Wi are the additional variables we control
for that include experience, census region and urban status indicator variables. (Ri and Wi
are the components of Z̃i in equations (15) and (17)).
In the MESE specification, we use the same 3-PL IRT model specified by Kirsch, et
al. (2000) as the measurement model for prose literacy in the primary analyses, and we fix
the item parameters at the estimates reported in Kirsch, et al. (2000, Appendix A), to ensure
that the differences in the MESE versus PV estimates are not due to measurement model
differences.
Kirsch, et al. (2000) note that the IRT model used in the NALS primary analyses is a
multi-unidimensional model that treats each scale independently and so “a unique proficiency
was assumed for each scale” (p 170) and all items load on only scale. We include only those
items deemed “prose” items in the MESE model, and compare our results to results using
institutional PVs for prose literacy proficiency only. While the conditioning model used in
the production of the PVs is multivariate, separate estimates of the conditioning variables
were estimated for each scale, so the additional items on the document and quantitative
literacy scale effectively function as additional conditioning variables in the conditioning
model for prose literacy. Because the MESE model is correctly specified as described in
Section 5 and because it does not depend on institutional plausible values, we do not expect
it to exhibit the biases that the PV-based methodology exhibits.
We use a Markov Chain Monte Carlo (MCMC) procedure to estimate the regression
coefficients in the MESE specification using WinBUGS software (Spiegelhalter, Thomas,
and Best, 2000). R and WinBUGS code implementing this MESE model is available from
the authors.
Table 2 shows substantially different results depending on the conditioning model used.
Based on arguments made in Section 4 and Section 5, we have reason to suspect that there
are biases in the estimates of the regression coefficients in column (b), because the model in
31
Table 2: Wage Regressions Comparing NCES PVs and MESE Using 1992 NALS Data
No θ
Black
(a)
−0.368
(0.027)
PV
Yes
(b)
−0.140
(0.027)
Hispanic
−0.453
(0.030)
−0.145
(0.031)
−0.172
(0.033)
0.247
(0.010)
0.256
(0.013)
4950
4950
Conditioning Model contains Y?
92 NALS Prose Lit
N
4950
MESE
No
(c)
−0.083
(0.029)
Notes: All regressions control for potential experience entered as a quartic, census
region (entered as dummy variables), and urban setting (entered as a dummy variable). PV estimates in column (b) employ the recommended procedure (Mislevy,
1991) for combining regression results for multiple imputations. PV estimates use
the NCES saturated conditioning set which includes wage (the dependent variable)
variables correlated with wages, race, experience, census region, and urban setting.
The MESE model estimates in column (c) are posterior means reported from estimating a MESE model. In column (c), the conditioning set includes race, experience,
census region, and urban setting. Literacy has been scaled such that it is the same
in columns (b) – (c) for comparison purposes.
our secondary analysis is not compatible with the primary institutIon’s conditioning model.
The coefficient on the variable black is estimated to be substantially lower using the
MESE model than with the PV methodology. The MESE estimates indicate that prose
literacy proficiency “accounts for” approximately 77 percent of the black-white log wage
gap. By way of comparison, the results in the regression estimated by PV methodology
(which is likely biased) suggests that literacy proficiency accounts for only 62 percent of this
gap.
Also, using the PV-based model, we would infer that Hispanics and blacks have a similar
wage disadvantage relative to whites, once we account for prose literacy proficiency. In
contrast, the MESE estimates suggests that among men with comparable literacy proficiency,
the wage gap is higher for Hispanic men than black men.
32
Our example is intended as an exercise to demonstrate differences in inference that can
arise from differences in methodological approach. We note, though, that our example comes
from a literature that has an important goal—providing useful policy guidance. For instance,
if empirical work establishes that racial and ethnic wage gaps are primarily the consequence
of differences in proficiencies (literacy, as in our example, or other skills), this suggests that
there is an important role for allocating educational resources so as to narrow racial and
ethnic differences in the acquisition of skills. If, instead, wage gaps are found to persist even
among similarly-skilled individuals, this might direct a focus to policies designed to reduce
discrimination in labor markets.
This empirical exercise demonstrates the biases that can result from using institutional
PVs in a secondary analysis that is not compatible with the conditioning model used to
procure the PVs. Our secondary analysis is not a rigged example; regressions of the form
(19) are used in many studies in labor economics (as well as in other areas of the social
sciences). Using real data, we have demonstrated what we argued in Section 4 analytically:
institutional PVs will produce biased estimates of regression coefficients when the secondary
analysis is not compatible with the survey institution’s conditioning model.
7
Discussion
Institutional plausible values (PVs) are multiple imputations from an institutional posterior
distribution p(θ|X, Z) based on a large latent regression model known as a “conditioning
model.” Institutional PVs were designed for, and are successful at, allowing secondary
analysts to produce unbiased estimates of quantities associated with a “small” posterior
distribution p(θ|X, Z̃) related to the secondary analyst’s research questions, using machinery
from the larger institutional posterior distribution. This situation, which we call the θdependent case, covers all primary and much secondary analysis of education survey data:
33
characterizing θ in one or more subgroups of interest. In the first part of this paper, we have
reformulated basic results for this case, going back to Mislevy (1991), in modern and fairly
general notation. We have shown that the utility of PVs is likely to extend across a much
larger range of measurement and conditioning models than has been considered previously
in studies of PV methodology.
We have also considered the θ-independent case, in which an outcome variable Y is
predicted from θ and other covariates Z̃. Such models are often expressed as regressions for
Y on Z̃ and θ, and can be found in many social science settings, from predicting end-of-year
exams using progress in a tutoring system throughout the year (Ayers & Junker, 2008) to
understanding social group-based gaps in wages after accounting for cognitive status (Junker,
Schofield & Taylor, 2012), to predicting post-secondary choice and success (Schofield, 2013).
Unfortunately our results show—both theoretically and empirically—that standard institutional PVs cannot lead to the same sorts of unbiased estimates of posterior quantities
in the θ-independent case. Indeed the only time that unbiased results are guaranteed using
standard PV methodology in the θ-independent case is if the institution releases enough
information that a secondary analyst could build a θ-independent model compatible with
the conditioning model that generated the PVs.
There is, however, a fix at hand. Instead of trying to use PVs in the θ-independent
case, secondary analysts can build the relevant model for their research questions, from
scratch, largely ignoring the institutional machinery. One such model, the Mixed Effects
Structural Equations (MESE) model, offers considerable flexibility, great initial success, and
can account carefully for the latent structure and its unreliable measurement.
In order to use the MESE model or any related model, the secondary analyst must have
access to the individual item responses of survey participants on the cognitive portion of the
survey, and ideally also the measurement model that the survey institution used to build
the θ scale. The advantage of doing so is that secondary analyst can take advantage of
34
the measurement model that the survey institution designed to characterize carefully and
account for the measurement uncertainty in the θ-independent model. Other approaches—
such as classical errors-in-variables or instrumental variables approaches—require additional,
often untested and untenable, assumptions to account for the measurement error.
Finally, while we focused our discussion and example on IRT models and linear regression
models, there is nothing in our analysis that requires this structure. The points we raise
here are issues for a wide class of models and a wide class of latent variables. The proper
conditioning set for latent variables used as covariates is necessary for the consistency of
estimates.
35
References
Allen, N. L., Carlson, J. E., & Zelenak, C. A. (1999). The NAEP 1996 Technical Report
(NCES 99452). Washington, DC: National Center for Education Statistics.
Allen, N. L., Donoghue, J. R., & Schoeps, T. L. (2001). The NAEP 1998 Technical Report
(NCES 2001-509). Washington, DC: National Center for Education Statistics. Obtained from http://nces.ed.gov/nationsreportcard/pubs/main1998/2001509.asp
Ayers, E., & Junker, B. (2008). IRT Modeling of tutor performance to predict end-of-year
exam scores. Educational and Psychological Measurement, 68, 972-987.
Beaton, A. E., & Zwick, R. (1992). Overview of the National Assessment of educational
Progress. Journal of Educational Statistics, 17(2), Special Issue: National Assessment
of Educational Progress, 95-109.
Billingsley, P. (1986). Probability and Measure. New York: Wiley and Sons.
Bollen, K. A. (2002). Latent variables in psychology and the social sciences. Annual Review
of Psychology, 55, 605-634.
Carstens, R. & Hastedt, D. (2010). The effect of not using plausible values when they should
be: An illustration using TIMSS 2007 grade 8 mathematics data. Paper presented at
the 4th IEA International Research Conference (IRC-2010), July, 2010 Gothenburg,
Sweden. Obtained July 2013 from http://www.iea.nl/irc-2010.html.
Chang, H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in
an IRT model. Psychometrika, 58, 37-52.
Dresher, A. (2006). Results from NAEP Marginal Estimation Research. Presented at the
annual meeting of the National Council on Measurement in Education (NCME), April
2006, San Francisco, CA.
36
Ellis, J. L. & Junker, B. W. (1997). Tail-measurability in monotone latent variable models.
Psychometrika, 62, 495523.
Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York:
Springer.
Johnson, M.S., & Jenkins, F. (2005). A Bayesian hierarchical model for large-scale educational surveys: an application to the National Assessment of Educational Progress. Research Report #RR-04-38. Princeton NJ: Educational Testing Service. Abstracted at
http://www.ets.org/research/policy_research_reports/publications/report/
Joreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model with multiple indicators
and multiple causes of a single latent variable. Journal of the American Statistical
Association, 70, 631-639.
Junker, B. W., Schofield, L. S., & Taylor, L. (2012). “The use of cognitive ability measures
as explanatory variables in regression analysis. IZA Journal of Labor Economics, 1(4).
Kirsch, I., et al. , (2000). Technical report and data file user’s manual for the 1992 National
Adult Literacy Survey. Washington, DC: U.S. Department of Education, National
Center for Education Statistics.
Lang, K., & Manove, M. (2011) Education and labor market discrimination. American
Economic Review, 101, 1467-1496.
Li, D., Oranje, A., & Jiang, Y. (2009). On the estimation of hierarchical latent regression
models for large-scale assessments. Journal of Educational and Behavioral Statistics,
34, 433463.
Meng, X-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statistical Science, 9(4), 538-558.
37
Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex
samples. Psychometrika, 56, 177-196.
Mislevy, R. J. (1993). Should “multiple imputations” be treated as “multiple indicators”?
Psychometrika, 58, 79-85.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population
characteristics from sparse matrix samples of item responses. Journal of Educational
Measurement, 29(2) NAEP, 133-161.
NCES (2008). NAEP secondary analysis grant abstracts. Washington DC: National Center
for Education Statistics. Obtained from
http://nces.ed.gov/nationsreportcard/researchcenter/naepgrants.asp
NCES (2009). NAEP technical documentation. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics (NCES).
Obtained July 2013 from http://nces.ed.gov/nationsreportcard/tdw/.
Neal, D., & Johnson, W. (1996). The role of pre-market factors in black-white wage differences. Journal of Political Economy, 104, 869-895.
Olson, J. F., Martin, M. O., & Mullis, I. V. S. (Eds.). (2008). TIMSS 2007 technical report.
Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Robitaille, D. F., & Beaton, A. E. (Eds.) (2002). Secondary analysis of the TIMSS data.
New York: Springer.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: John
Wiley.
Rubin, D. B. (1996). Multiple imputation after 18+ years (with discussion). Journal of the
American Statistical Association, 9, 473-489.
38
Schofield, L. S. (2008). Modeling measurement error when using cognitive test scores in social
science research. (Doctoral dissertation). Department of Statistics and Heinz College
of Public Policy, Carnegie Mellon University, Pittsburgh, PA.
Schofield, L. S. (under review). Measurement error in latent variables that predict STEM
retention.
Spiegelhalter, D. J., Thomas, A., & Best, N. G. (2000). WinBUGS Version 1.3 User Manual.
Cambridge: Medical Research Council Biostatistics Unit, (Available from www.mrcbsu.cam.ac.uk/bugs.)
van der Linden W. J., & Hambleton, R. K. (Eds.) (1997). Handbook of modern item response
theory, New York, NY: Springer-Verlag.
von Davier, M., Gonzalez, M., E., & Mislevy, R. J. (2009). What are plausible values and why
are they useful? In M. von Davier, and D. Hastedt, (Eds.), IERI Monograph Series,
volume 2: Issues and methodologies in large-scale assessments (9-36). Hamburg, Germany and Princeton NJ: International Association for the Evaluation of Educational
Achievement (IEA) and Educational Testing Service (ETS).
Walker, A. M. (1969). On the asymptotic behavior of posterior distributions. Journal of the
Royal Statistical Society, Series B, 31, 80-88.
39
Download