Item Response Theory for Time-Series Data

advertisement
Dynamic Measurement of Political Phenomena:
Item Response Theory for Time-Series Data
Jonathan Kropko
Columbia University
July 11, 2013
Abstract. Item response theory (IRT) is a method for estimating a latent variable from a series of items with
a shared covariance. Current applications of IRT require a large enough cross-sectional sample to identify item
parameters and values of the latent variable. As a result, IRT can only be run on data that contain a prominent
cross-sectional component. I develop a method called time-series item response theory (TSIRT) that estimates
a latent variable and item parameters for a single important case when the items are observed at several points
in time. Time points replace the cross-sectional sample, and the estimated latent variable is a time-series rather
than a variable with a distribution in a population. TSIRT specifies an integrated time-series to be the prior
distribution of the latent variable so that the serial dependence in the latent time-series is explicitly modeled.
TSIRT is demonstrated to provide close estimates of a simulated latent variable using binary, count, proportion,
and continuous items, and is used to examine two substantive examples: the dynamics of the Israeli-Palestinian
conflict from 1990-2004, and the overall performance of the US economy since 1978.
1
Introduction
Item response theory (IRT) is an important tool for measuring a latent variable. The method was first developed
in psychometrics (Thurstone 1927, Lawley 1943) to measure unobservable individual traits with many observable
associated behaviors, and has been used to study self-esteem (Gray-Little, Williams, and Hancock 1997), depression
(Orlando, Sherbourne, and Thissen 2000), and attachment anxiety (Fraley, Waller, and Brennan 2000) among other
cognitive disorders. Another canonical application of IRT is educational testing, in which the students’ latent ability
scores are estimated from their observed responses to the test items.1 Recently, IRT has been applied more broadly
across the health and social sciences: in political science, IRT has been used to estimate the ideological ideal points
of members of Congress (Clinton, Jackman, and Rivers 2004), of Supreme Court Justices (Martin and Quinn 2002,
Bailey and Maltzmann 2008), of state legislators (Shor and McCarty 2011), and member states in the UN (Voeten
2004). The method has also been applied to study the cross-national variation in democracy (Treier and Jackman
2008), and underlies the method of computerized adaptive testing that is used to assess performance on the GRE
and GMAT tests, and has been proposed as an efficient means to conduct a public opinion survey (Montgomery
and Cutler 2013).
IRT is often preferred by researchers to simpler measurements such as the sum of correct responses on a test
because different items provide different levels of information about the latent variable. On an educational test, the
standard 2-parameter implementation of IRT (Birnbaum 1968) estimates the difficulty of each test item, and the
extent to which the item accurately discriminates the high-ability students from the low-ability students. Likewise,
some votes are more informative than others. IRT therefore provides more accurate indices for latent political
variables than naive metrics that treat items with equal weight.
Most applications of IRT require a large sample size to estimate two parameters per item with enough precision
to provide useful estimates of the latent variable. In addition, most uses of IRT are on cross-sectional data. Very
few applications have attempted to estimate a dynamic latent variable, but Martin and Quinn’s (2002) analysis of
Supreme Court ideal points is an important exception. Even Martin and Quinn, however, utilize cross-sectional
variation. In this paper I propose a version of IRT to measure a latent variable that is strictly a time-series. This
method, called time-series item response theory (TSIRT), works with data that describe only a single case but
contain several variables that are recorded at many points in time. As with any measurement technique, these
variables should have shared covariance; that is, they should all be implications of the latent time-series variable
1 For
IRT, the word “item” is synonymous with “observed variable.”
1
to be measured. In addition, while most implementations of IRT work exclusively with binary or categorical items,
TSIRT can produce a latent variable from combinations of continuous, count, and proportion (real values bounded
within [0, 1]) variables as well as binary and categorical ones, while estimating item parameters for each variable
type. This method can be applied to a large and previously neglected class of measurement problems in which the
focus is on the dynamics of a particularly important case.
Below, I describe the formulation of TSIRT and how it builds on other versions of IRT and similar measurement
methods in political science. I demonstrate the performance of the algorithm on simulated data, and I apply the
method to two examples: an analysis of the dynamics of the Israeli-Palestinian conflict from 1990-2004, and a
measurement of the overall health of the US economy since 1978.
2
Background
The classic 2-parameter logit (2PL) version of IRT (Bock and Lieberman 1970) works with binary items and jointly
estimates a latent ability parameter for each individual i ∈ {1, 2, . . . , N }, denoted θi , as well as a discrimination
parameter for each item k ∈ {1, 2, . . . , K}, denoted αk , and a difficulty parameter for each item, denoted βk . Let X
be an (N × K) matrix in which the columns are binary variables. The probability of a correct response is logistic
conditional on θi , αk , and βk :
P (Xik = 1|θi , αk , βk ) = f (θi |αk , βk ) =
1
,
1 + exp (−αk (θi − βk ))
(1)
where f (θi |αk , βk ) is sometimes referred to as a test curve for the item. The discrimination parameter indicates
the certainty about the predicted value of the item conditional on θi . Two examples of test curves are illustrated
in figure 1. Each item has a difficulty parameter of 0, indicating that P (Xik = 1) = 0.5 when θi = 0. The item on
Figure 1: Two Logistic Items – One That Fits Well (Left) and One That Fits Less Well (Right).
α = 0.8
1.0
1.0
α=3
0.8
0.8
●
0.6
0.0
0.2
0.4
Pr(Y=1)
0.6
0.4
0.0
0.2
Pr(Y=1)
●
−3
−2
−1
0
1
2
3
−3
θ
−2
−1
0
1
2
3
θ
the left has a high discrimination parameter α = 3 and the item on the right has a lower discrimination parameter
α = 0.8. An individual with a latent ability of 1 has a 0.95 probability of a positive response for Xi if α = 3,
but this probability is only 0.69 if α = 0.8. Discrimination parameters are therefore analogous to loadings in a
factor analysis: large values indicate that the variation of the item is mostly explained by the latent variable, and
low values indicate that the item is potentially inappropriate to use to estimate the latent variable. The likelihood
2
function for the item parameters conditional on θ is
L(α, β|θ, X) =
N Y
K
Y
f (θi |αk , βk )Xik (1 − f (θi |α,k βk ))1−Xik .
(2)
i=1 k=1
(van der Linden and Hambleton 1997, p. 14). The values of θi are estimated through a Bayesian updating process.
Given a marginal prior distribution P (θi ) for each value of the latent variable, the posterior distribution of θi is
P (θi |α, β, Xi ) ∝ P (θi )
K
Y
f (θi |αk , βk )Xik (1 − f (θi |αk , βk ))1−Xik .
(3)
k=1
The 2PL IRT model can be estimated either through a fully Bayesian specification or through marginal maximum
likelihood (Bock and Aitken 1981), which is an EM procedure that estimates item parameters conditional on θi via
maximum likelihood on equation 2, updates θi using equation 3, and iterates the procedure until convergence.
IRT models make an assumption of local independence of items and cases. That is, IRT assumes that conditional on the latent variable, the responses on any item are independent from the responses on every other item,
and the response pattern on any row is independent from the response pattern on every other row.2 In practice,
the assumption that is made with regard to the rows is often stronger than local independence. The joint prior distribution on the latent variable is commonly the multivariate standard normal distribution. Under this distribution
the marginal prior of each θi is standard normal, which is a convenient choice for many researchers since it is naive
as to the true value of θi and also serves as a scale to identify the estimates. Using this prior, and assuming local
independence, the posterior distributions of θi must be independent over i as well.
Very few applications of IRT use an alternative prior. One notable exception however is the analysis of ideological
ideal points of Supreme Court justices conducted by Martin and Quinn (2002). In order to capture the temporal
variation of the justices’ ideal points from term to term, Martin and Quinn use a random walk prior for the latent
ideal points:
θt,j ∼ N (θt−1,j , ∆θt,j )
(4)
where t refers to the judicial term and j refers to the justice (p. 140). ∆θt,j is a scalar variance term which is
fixed to 1 in their analysis. Since the prior distribution of θj forms an integrated time-series over t, Martin and
Quinn maintain the assumption that the items are locally independent since the autocorrelation is contained in the
prior distribution of the latent variable. Each term contains different cases, so the item parameters are estimated
conditioning only on the 9 (or fewer) votes in each case.
In general, there have only been a few attempts in political science to measure a dynamic latent variable. Voeten
(2004) estimates the ideological ideal points of states in UN voting, and allows these ideal points to change over
time by modeling the latent variable as a quadratic function of time:
θit = π0i + π1i Tt + π2i Tt2 + vit ,
vit ∼ N (0, σ 2 ),
(5)
where Tt ∈ {−5, −4, . . . , 4, 5} represents the time frame 1991-2001 and π0i , π1i , and π2i are estimated from covariates
(p. 738). Similarly, Bailey (2007) estimates the ideal points of members of Congress and Supreme Court justices
as quartic functions of their time in office. The most famous example of dynamic measurement in political science
is DW-NOMINATE, which also treats legislator ideal points as polynomial functions of time (Carroll et al 2009, p.
263).
All of these approaches are at least in part derived from cross-sectional variation. Many important measurement
problems exist, however, for data that are strictly a time-series. These problems consider the variation over time
of one important case where several items are observed at repeated points in time. Problems of this type exist in
measuring for example the nature of a conflict between two states, the performance of an economy, the development
of an institution, the behavior of an executive, the strategies of a campaign, and so on. TSIRT extends the approach
of Martin and Quinn to time-series data by using a random walk prior on the latent variable, which is more general
and uses fewer degrees of freedom than specifying the latent variable to be a polynomial function of time. Unlike
Martin and Quinn’s approach, however, item parameters are well identified since the same items are observed
repeatedly. TSIRT also extends this methodology by including test curves for continuous, count, and proportion
items. The derivation of TSIRT is presented below.
2 Other assumptions made by IRT include the unidimensionality of the latent variable and the parametric representation of the test
curves. IRT has been extended to estimate multidimensional scales (McDonald 1997) and to handle test curves non-parametrically
(Mokken 1997). These extensions can also be applied to TSIRT, but the focus here is on the unidimensional, parametric case.
3
3
Methodology
TSIRT is implemented as a fully Bayesian model, estimated through MCMC.3 The data X are assumed to form a
(T × K) matrix, where t ∈ {1, 2, . . . , T } indexes the time points, and k ∈ {1, 2, . . . , K} indexes the items. The time
points refer to the variation over time of one important case, and the items have a shared covariance that can be
largely explained by a unidimensional, dynamic latent variable θt . The parameters to be estimated include θt , the
variance of the random walk σ 2 , item discrimination parameters αk , item difficulty parameters βk , and auxiliary
parameters for particular variable types: the number of failures before an experiment is terminated using the
negative binomial distribution rk for count variables, and a beta distribution count parameter φk for the proportion
variables. These parameters are assumed to be independent in their joint prior distribution, so that
P (θt , σ 2 , α, β, r, φ) = Pθ (θt ) · Pσ2 (σ 2 ) · Pα (α) · Pβ (β) · Pr (r) · Pφ (φ).
(6)
Each of these prior distributions is described below.
The dynamic variation of the latent variable is expressed through a random walk prior:
θt ∼ N (θt−1 , σ 2 ).
(7)
The same prior is used by Martin and Quinn (2002) to estimate the ideal points of Supreme Court justices while
allowing these estimates to be correlated over time. Unlike Martin and Quinn, the variance hyperparameter σ 2 is
not fixed a priori, but is estimated. σ 2 is bounded to be greater than 0, and is specified to have a uniform prior.
The item parameters are strongly identified since the data contain T observations of each item. Since these observations form a time-series, they are not independent, but the observations are assumed to be locally independent
conditional on the latent variable which accounts for autocorrelation. The prior distributions for the difficulty and
discrimination parameters are
N (0, 1) if α > 0,
β ∼ N (0, 1), and α ∼
(8)
0
if α ≤ 0.
Finally, the auxilliary parameters for count and proportion items are specified to have uniform and uninformative
prior distributions.
Each item has a distribution that is specific to the type of item being considered. Binary items follow a logit
distribution conditional on θt , αk , and βk as in equation 1.4 Continuous items are standardized, and are conditionally
distributed normally with a mean of θt and a standard deviation of αk :
−(Xtk − θt )2
1
√
exp
.
(9)
fk (Xtk |θt , αk ) =
2αk
2παk
No difficulty parameter is included in the test curves for continuous items because Xk is standardized and βk must
equal 0 by construction. αk is the standard deviation of the normal test curve. As the means θt deviate from the
data Xtk , the estimated values of αk must become larger in order to compensate for this deviation. Smaller values
of αk indicate a better fit for continuous items. Figure 2 illustrates the conditional distribution of a continuous
item X given θt and αk . On the left αk = 0.5, which is a low value because the standardized values of Xtk do not
depart much from the values of θt . But on the right these values deviate from the values of the latent variable to a
greater extrent, and the standard deviation αk must be larger to compensate.
Count variables are modeled using the negative binomial distribution, which has two parameters: π ∈ [0, 1], the
probability of a positive outcome on any one particular draw, and r > 0, the number of negative draws before the
experiment is terminated.5 The distribution of a count item Xk is conditioned on the latent variable by specifying
π to be a logistic link function of θt :
1
.
(10)
πtk =
1 + exp(−αk (θt − βk ))
3 The models described in the examples below are estimated using Stan, a Hamiltonian MCMC algorithm, available in C++ and for
R (Stan Development Team 2013).
4 Similarly, ordinal items can be modeled using the ordered logit distribution and unordered-categorical items can be modeled using
the multinomial logit distribution. These distributions are natural extensions of the logit distribution. All of these models can be
implemented using probit models instead of logit models by using standard normal test curves, but the differences between logit and
standard normal CDFs are minor.
5 For example, if the probability of a positive draw is π = 0.5, and r = 3 negative draws result in the termination of the experiment,
then the probability that X = 2 is the probability that 2 positive outcomes had been drawn when the third negative outcome is drawn.
If 0 is a negative outcome and 1 is a positive outcome, then this result can be achieved by drawing a string of 11000, 10100, 10010,
01100, 01010, or 00110. Therefore if X ∼ N B(.5, 3), then P (X = 2) = 6(2−5 ) = 0.1875.
4
Figure 2: Two Continuous Items – One That Fits Well (Left) and One That Fits Less Well (Right).
α = 1.2
1.0
0.8
θt = −1
θt = 0
θt = 1
0.6
f( X | θt, α )
0.6
0.0
0.2
0.4
0.0
0.2
f( X | θt, α )
0.8
θt = −1
θt = 0
θt = 1
0.4
1.0
α = 0.5
−3
−2
−1
0
1
2
3
−3
−2
X
−1
0
1
2
3
X
Then the negative binomial distribution is conditioned on r, on the latent variable, and on the item parameters:6
Xtk + rk − 1
fk (Xtk |θt , αk , βk , rk ) =
(1 − πtk )rk πtk Xtk .
(11)
Xtk
The discrimination parameter αk determines how much weight a particular count places on a region of the latent
variable. Figure 3 shows three values of two count items with rk = 3. The item on the left has a high discrimination
parameter and places a great deal of weight on specific regions of the posterior distribution of θt . The item on the
right has a lower discrimination parameter, and is more agnostic about the implications on θt .
Proportion items follow the beta distribution, which has two shape parameters a > 0 and b > 0:
fk (X|a, b) =
1
(X)a−1 (1 − X)b−1 ,
B(a, b)
(12)
where B() is the beta function, given by
Z
B(a, b) =
1
ua−1 (1 − u)b−1 du.
(13)
0
The beta distribution can be reparameterized to have a mean parameter π = a/(a + b) and a total count parameter
φ = a + b (Gelman et al 2004, p.132-133). Since π ∈ [0, 1], it can be specified using equation 10 to be a logistic link
function through which to introduce dependence on the latent variable and the item discrimination and difficulty
parameters. The reparameterized beta distribution is
fk (Xtk |θt , αk , βk , φk ) =
1
(Xtk )πtk φk −1 (1 − Xtk )(1−πtk )φk −1 .
(14)
B πtk φk , (1 − πtk )φk
Figure 4 illustrates the distribution of a proportional item conditional on θt and αk , where βk is fixed to 0 and φk
is fixed to 10. As with the test curves for continuous items, the item with high discrimination involves a high level
of association between the values of Xtk and θt while the item with lower discrimination allows this relationship to
be less determined.
6 r is an integer parameter, but it can be treated as positive and continuous by replacing the binomial in the negative binomial
distribution with a ratio of gamma functions (Cameron and Trivedi 2005, chapter 20).
5
Figure 3: Two Count Items – One That Fits Well (Left) and One That Fits Less Well (Right).
0.4
0.6
0.8
C=0
C=2
C=4
0.0
0.0
0.2
0.4
0.6
P( X = C | θt, α )
0.8
C=0
C=2
C=4
0.2
P( X = C | θt, α )
1.0
α = 0.8
1.0
α=3
−3
−2
−1
0
1
2
3
−3
−2
−1
0
θt
1
2
3
θt
Figure 4: Two Proportion Items – One That Fits Well (Left) and One That Fits Less Well (Right).
8
6
θt = −1
θt = 0
θt = 1
0
0
2
4
f( X | θt, α )
6
4
θt = −1
θt = 0
θt = 1
2
f( X | θt, α )
8
10
α = 0.8
10
α=3
0.0
0.2
0.4
0.6
0.8
1.0
0.0
X
0.2
0.4
0.6
0.8
1.0
X
The models for binary, continuous, count, and proportion items are all linked by the common latent variable
θt , and the parameters that are specific to each item are locally independent conditional on θt . Let fBk represent
the distribution of a binary item implied by equation 1, let fN k be the distribution of a continuous item given in
equation 9, let fCk be the distribution of a count item given in equation 11, and let fP k be the distribution of a
proportion item given in equation 14. Consider a model with KB binary items, KN continuous items, KC count
6
items, and KP proportion items. The joint posterior distribution for all of the parameters is given by
P (θt , σ 2 , α, β, r, φ|X) ∝ Pθ (θt ) · Pσ2 (σ 2 ) · Pα (α) · Pβ (β) · Pr (r) · Pφ (φ)
×
KB
Y
(15)
fBk (X|θt , αk , βk )
k=1
×
K
N
Y
fN k (X|θt , αk )
k=1
×
KC
Y
fCk (X|θt , αk , βk , rk )
k=1
×
KP
Y
fP k (X|θt , αk , βk , φk ),
k=1
∀t ∈ {1, 2, . . . , T }. In order to assess the convergence of the MCMC algorithm, I run multiple chains and examine
their traceplots. I also consider the values of the R̂ statistics to diagnose convergence (Gelman and Rubin 1992).
TSIRT has some limitations. First, like IRT, TSIRT requires that the items are locally independent conditional
on the latent variable. If this assumption is not true, then neither the estimates of item parameters nor the estimates
of the latent variable are accurate.7 Second, item parameters are estimated as if they are fixed over time, when in
fact it is very feasible that the meaning and implications of an item may change from the beginning to the end of
the time-series. Finally, the reliance on MCMC to estimate the model implies that a researcher can never conclude
with absolute certainty that the algorithm has converged to the sampling distribution of the posterior. However, if
modern convergence diagnostics all agree that the chains have converged, then it would be highly unlikely that the
chains remain unconverged.
4
Demonstration on Simulated Data
In order to demonstrate the performance of TSIRT, I generate a dataset that consists of binary, continuous, count,
and proportion variables from a data generating process in which every variable depends on a variable θt , generated
from an integrated random walk. My goal is to show that TSIRT can accurately measure values of the latent
variable and correctly estimate item parameters for each type of variable.
I set the number of time points to be T = 300, which would be the number of cases in a time-series observed
monthly for 25 years. I generate values of a latent variable θt from
θ1 = ε1 ,
θt = θt−1 + εt , ∀t ∈ {2, . . . , T },
(16)
where
εt ∼ N (0, σ 2 ), ∀t.
(17)
The variance of the random walk σ 2 is set to 1 in this simulation.
I generate 5 binary items, 5 continuous items, 5 count items, and 5 proportion items, all using θt as a parameter to
link these items together. The binary items are randomly generated from the Bernoulli distribution with probability
parameter πtk as given in equation 10. The continuous items are generated conditional on θt and αk from the normal
distribution in equation 9. The count items are generated conditional on rk and πtk by the negative binomial
distribution given in equation 11. The proportion items are generated from the beta distribution conditional on πtk
and φk in equation 14.
For this simulation, the true values of the item parameters are randomly generated. The discrimination parameters αk are drawn from the uniform[.2, 1] distribution for the binary, count, and proportion items, and from the
uniform[.5, 1.5] distribution for the continuous items. The difficulty parameters are drawn from the uniform[−1, 1]
distribution for the binary, count, and proportion items. The rk parameters for the count items are drawn from the
unifrom[1, 8] distribution, but are rounded up to the nearest integer. The φk parameters for the proportion items
are drawn from the uniform[0, 100] distribution.
I estimate TSIRT on the simulated data described above using a fully Bayesian algorithm (Stan Development
Team 2013). I run three chains of the algorithm for 1000 iterations each, using the first 500 iterations of each chain
7 Some work has been done to identify testlets – clusters of locally dependent items – and to correct for this dependence with
parametric and nonparametric approaches (e.g. Jannarone 1997). This methodology is not currently implemented for TSIRT however.
7
−10
θt
−5
0
Figure 5: Estimated and True Values of the Latent Integrated time-series.
−15
True Value
95% Credible Interval for Estimate
0
50
100
150
200
250
300
t
as a burn-in and keeping the last 500 iterations. The R̂ statistic for each parameter is within .01 of 1, indicating
that the chains have converged to the correct posterior distribution.8 The true values of θt are graphed over t in
figure 5, along with a grey area that represents the 95% credible interval for each estimated value of θt . TSIRT
does an excellent job of returning the true values of the latent variable, with only a small amount of noise around
the estimate at each time point.
The item parameter estimates are illustrated in figure 6. 20 estimates of αk are listed on the left: one for each
of the 5 binary, 5 continuous, 5 count, and 5 proportion items. The estimate of σ 2 is listed below the estimates
of αk . 15 estimates of βk are listed on the right, excluding estimates for the 5 continuous items. Finally, the 5
estimates of rk and φk are listed below the estimates of βk . The dots represent the posterior mean of each estimate
and the horizontal lines through the dots represent the 95% credible interval for each estimate. The crosses denote
the true value of each parameter. TSIRT returns the item parameters accurately as well: the 95% credible interval
of every parameter except for one contains the true value. The results suggest that TSIRT may be more suited
to estimating α parameters than the other parameters, and is particularly suited to estimate the discrimination
parameters of count and proportion items, but more simulated trials are necessary to confirm this observation.
5
Example 1: the Israeli-Palestinian Conflict, 1990-2004
The examples presented here demonstrate the utility of TSIRT for real applications in political science research,
but are not intended to be completed research projects.
Many empirical studies in international relations use dyadic data in which each observation considers the relationship between two distinct political actors, often countries. Dyadic data contain several variables that characterize the
relationship between the two actors, and often dyadic data also contain multiple time points. Hoff and Ward (2004)
point out that many methodological approaches adopted by researchers with time-series cross-sectional dyadic data
use both the cross-sectional and temporal variation to draw inferences. They warn, however, that these approaches
make inappropriate assumptions about the data: it may not be appropriate to assume that dyads are independently
drawn, particularly if two dyads involve the same country; and it is not appropriate to assume that the actions
taken within a dyad at one time point are independent from the actions taken within the dyad at previous time
points.
One way to avoid a series of suspicious assumptions about the independence of dyads is to leverage variation
over time to consider individual dyads. TSIRT can be used to analyze a particularly important relationship between
8 The
same convergence properties are true for both examples described below.
8
Figure 6: Estimated Item Parameters vs. True Values.
α
Item
β
Item
1
1
●
●
2
2
Binary
●
3
Binary
3
●
4
●
4
●
●
5
●
●
1
5
●
●
2
1
●
Count
2
●
3
●
4
●
●
5
Countinuous
3
●
●
1
4
●
5
●
2
Proportion
●
●
3
●
4
1
●
●
5
2
●
●
−3
Count
3
−1
0
1
2
3
Parameter Value
4
●
5
●
2
r
Item
1
2
3
4
5
●
1
●
●
●
●
●
●
0
Proportion
−2
●
3
1
2
3
4
5
6
100
120
●
Parameter Value
4
●
5
σ
2
●
0.0
0.5
1.0
φ
Item
1
2
3
4
5
●
1.5
0
Parameter Value
●
●
●
●
●
20
40
60
80
Parameter Value
9
two countries without making an incorrect assumption that the relationship at a point in time is independent from
the relationship at another point in time.
One important dyad in international politics is Israel and Palestine. For many years the relationship between
Israel and Palestine has involved ebbs and flows in the amount of conflict exhibited between the two sides, and in the
efforts to achieve peace. Both “peace” and “conflict” are latent variables in that they are not directly observable,
but have many observable implications. There are increasing efforts towards peace when the actors are engaged in
meetings, agreeing to frameworks for future actions or meetings, removing occupying forces, and easing sanctions.
Escalations of conflict involve rejected proposals, hostile rhetoric, military raids and occupations, arrests, violent
clashes, shootings, physical assaults, grenade attacks, and suicide bombings. The latent peace and conflict variables
are not perfectly negatively correlated: at times conflicts increase while efforts for peace decrease, and vice versa,
but at other times conflicts and peace efforts increase or decrease together.
Efforts for peace become necessary in response to escalations of conflict. On the other hand, a large literature in
international relations suggests that certain actors within Israel and Palestine benefit from increased conflict, and
attempt to spoil peace efforts (Stedman 2001, Kaufman 2006, Greenhill and Major 2007, Sheafer and Dvir-Gvirsman
2010). It is an open question, therefore, as to whether peace efforts respond to conflict, or conflicts arise as part
of deliberate efforts by certain parties to derail ongoing peace efforts. In this application, I use TSIRT to measure
latent peace and conflict indices for Israel and Palestine, and I consider the question of whether peace succeeds or
precedes conflict using a Granger causality test.
The implications of peace efforts and escalations of conflict are events that are directly observable. I use the
subset of the 10 Million International Dyadic Events dataset (King 2003) that contains events that involve both
Israel and Palestine. The data cover the years 1990-2004. The events are coded daily, and I take the sum of the
occurrence of each event within each quarter, so that the data contain T = 60 time points. All of the variables are
counts, so TSIRT employs negative binomial test curves. Of the events that occur within this dyad, several are
direct implications of peace efforts, listed in the top section of table 1, and several are direct implications of conflict
escalation, listed in the bottom section of table 1.
0
−2
−1
Index
1
2
3
Figure 7: Mean Estimates of Peace and Conflict Indices, 1990-2004.
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
−3
Peace Index
Conflict Index
Year and Quarter
The indices are graphed for every quarter between 1990 and 2004 in figure 7.9 The peace index achieves its
highest values in the early 1990s, with a brief decline that roughly corresponds to the first Gulf War. Another
decline in the peace index occurs in 1996 after the assassination of Yitzhak Rabin and the appointment of Benjamin
Netanyahu to his first term as Israel’s prime minister. The peace index climbs in the late 1990s during the talks
leading to the Oslo II accord, and falls again during the Second Intifada. Correspondingly, the conflict index falls
immediately after the Oslo II accords but spikes dramatically during the Second Intifada.
In order to assess whether peace tends to respond to conflict, or vice versa, I conduct a Granger causality test
on the means of the two indices (Granger 1969). I control for 10 lags in order to allow 2.5 years for the effects of
peace efforts to be observed. The results are listed in table 2. In the left-hand column, I test whether the lags of
9 The
credible intervals are not graphed in order to make the comparison of the two trends as clear as possible.
10
Table 1: Indicators of Efforts Towards Peace, Escalations of Conflicts.
Efforts Towards Peace
Indicator
Accept a proposal for future action
Accept a proposal for future action
Discussions or meetings
Discussions or meetings
Demobilize armed forces
Ease sanctions
Travel to meet
Travel to meet
Yield control of a location
Direction
ISR → PAL
PAL → ISR
ISR → PAL
PAL → ISR
ISR → PAL
PAL → ISR
ISR → PAL
PAL → ISR
ISR → PAL
Mean
3.2
1.9
10.3
10.3
1.3
1.7
1.7
1.1
1.0
Min
0
0
0
0
0
0
0
0
0
Max
28
14
43
43
27
8
20
11
13
Escalations of Conflict
Indicator
Military occupation
Arrest and detention
Criticize or blame
Criticize or blame
Military clash
Military clash
Grenade/RPG use
Physical assault
Physical assault
Shooting
Shooting
Political arrests and detentions
Military raid
Reject proposal
Suicide bombing
Direction
ISR → PAL
ISR → PAL
ISR → PAL
PAL → ISR
ISR → PAL
PAL → ISR
ISR → PAL
ISR → PAL
PAL → ISR
ISR → PAL
PAL → ISR
ISR → PAL
ISR → PAL
PAL → ISR
PAL → ISR
Mean
1.4
1.5
3.4
8.3
2.7
1.3
1.5
2.5
2.3
25.4
4.6
1.1
5.5
1.1
1.1
Min
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Max
14
7
15
30
48
13
10
16
11
122
35
6
42
4
23
Table 2: Granger Causality Tests: Israel-Palestinian Peace Index vs. Conflict Index.
Conflict → Peace
Null Model
Pt = α +
P10
s=1
βs Pt−s + εt
Peace → Conflict
Ct = α +
P10
s=1
βs Ct−s + εt
Alternative Model
P10
Pt = α + s=1 βs Pt−s
P10
+ s=1 δs Ct−s + εt
P10
Ct = α + s=1 βs Ct−s
P10
+ s=1 δs Pt−s + εt
F (10, 39)
p
1.834
0.099
0.076
1
the conflict index jointly have a significant effect on the peace index, after controlling for 10 lags of the peace index.
The result is marginally significant, which may suggest that the previous 2.5 years of variation in conflict has a real
effect on current peace efforts after controlling for the variation of the peace index over the same timeframe. On
the other hand, the symmetric test that conflict responds to peace yields a highly insignificant result. This analysis
suggests that peace efforts are natural responses to recent conflict, and it does not suggest the presence of spoilers.
11
6
Example 2: US Economic Performance Since 1978
In this example I focus on interpretations of the item parameters. The overall performance of the U.S. economy is
a latent variable, but it is hardly a secret. The recession of late 2008 is evident in every metric used to evaluate the
economy. In this example, I consider five indicators measured quarterly from 1978 through the first quarter of 2013:
GDP growth, the consumer sentiment index, the percent change in the S&P 500 index, the unemployment rate,
and the percent change in housing starts.10 Summary statistics for these items as well as their sources are listed
in table 3. I standardize these variables, and I estimate the economic performance index using these standardized
variables as continuous items.
Table 3: Indicators of U.S. Quarterly Economic Performance, 1978-2013.
Indicator
Mean
Best
Worst
Source
GPD Growth
2.71
16.7 (1978, Q2)
-8.9 (2008, Q4)
Bureau of Economic Analysis (2013)
Consumer Sentiment Index
85.3
110.1 (2000, Q1)
51.1 (1980, Q2)
Thomson Reuters and
the University of Michigan (2013)
S&P 500, % Change
2.20
20.2 (1982, Q4)
-27.2 (2008, Q4)
Unemployment Rate
6.42
3.9 (2000, Q4)
10.7 (1982, Q4)
Housing Starts, % Change
-0.18
31.5 (1980, Q3)
-23.1 (2008, Q4)
Federal Reserve Bank
of St. Louis (2013a)
Bureau of Labor Statistics (2013)
Federal Reserve Bank
of St. Louis (2013b)
0.00
−0.05
−0.10
Index
0.05
0.10
Figure 8: Estimated Economic Performance Index, 1978-2013.
2012
2010
2008
2006
2004
2002
2000
1998
1996
1994
1992
1990
1988
1986
1984
1982
1980
1978
−0.15
Mean Estimate
95% Credible Interval
Year and Quarter
The economic performance index is illustrated as a function of time in figure 8. For every quarter from 1978
through the first quarter of 2013, the mean is plotted as well as a grey region that represents the 95% credible
interval around this mean. This index captures the recessions of the early 1980s, the stock market crash of 1987,
the recession of the early 1990s, the burst of the “dot-com” bubble in the early 2000s, and the recession of 2008.
10 Since low values of the unemployment rate indicate a stronger economy, I reverse code the unemployment rate by multiplying it by
-1 before standardizing the variable.
12
Figure 9: Item Parameters for the Economic Performance Index.
−2
−1
0
1
2
3
X − θt
−3
−2
−1
0
X − θt
1
2
3
−2
−1
0
X − θt
1
2
3
f( X | θt)
f( X | θt)
−3
Housing Starts
α5 = 1.16
−3
−2
−1
0
X − θt
1
2
3
0.0 0.2 0.4 0.6 0.8
Unemployment
α4 = 0.75
0.0 0.2 0.4 0.6 0.8
f( X | θt)
S&P 500
α3 = 1.11
0.0 0.2 0.4 0.6 0.8
f( X | θt)
−3
0.0 0.2 0.4 0.6 0.8
f( X | θt)
Consumer Sentiment Index
α2 = 0.43
0.0 0.2 0.4 0.6 0.8
GDP Growth
α1 = 0.93
−3
−2
−1
0
1
2
3
X − θt
A secondary but important use of TSIRT is the estimation of item parameters. For continuous items, the
only item parameter represents item discrimination, which is another way to describe the fit of the item to the
latent variable in question. The item with the smallest variance has the best fit, while the item with the highest
variance also has the greatest amount of unique variance unexplained by the latent variable. In this example,
the item parameters suggest which items are the best indicators of the overall performance of the economy. The
discrimination parameters are illustrated in figure 9.
The item that best fits the underlying measure of economic performance is the consumer sentiment index, which
is a score compiled from 500 telephone interviews that focus on judging consumers’ level of economic optimism
(Thomson Reuters and the University of Michigan 2013). The result is telling in light of prior research that shows
that consumer confidence is a primary driving component of fluctuations in the economy (Matsusaka and Sbordone
1995). Furthermore, MacKuen, Erikson, and Stimson (1992) have shown that voter’s prospective attitudes towards
the economy explain the variation in presidential election outcomes. Since the consumer sentiment index is centrally
associated with overall economic performance, this result reinforces the connection between economic performance
and voting in presidential elections.
7
Conclusion
TSIRT estimates a latent variable for one important case in which relevant items are observed repeatedly at several
time points. This methodology departs from IRT and other measurement techniques in that it does not require any
cross-sectional information to identify item parameters or values of the latent variable. TSIRT therefore provides
a means to address an important but previously neglected class of measurement problems, and can be used for
a variety of applications in political science and other social sciences. TSIRT relies on the same assumptions
as standard applications of IRT: in particular, TSIRT assumes that items are locally independent conditional on
the latent variable. Most implementations of IRT use a prior distribution for the latent variable that results in
the even stronger assumption that these values are independent in their posterior distribution as well. TSIRT,
in contrast, uses an integrated time-series for the prior distribution of the latent variable, and therefore models
temporal dependence explicitly. In addition, TSIRT is flexible to many data structures, and is demonstrated here
to be able to accurately process categorical, continuous, count, and proportion variables.
The methodology underlying TSIRT can be developed to account for several violations of its assumptions. If
items remain locally dependent even after conditioning on the latent variable, it may be possible to group dependent
items into blocks using the same methods that educational test researchers use to create testlets. The model
can also be extended to estimate a multidimensional latent variable. TSIRT can be rewritten to use alternative
parametric distributions for different types of variables, alternative link functions, or nonparametric distributions.
Time dependent formulations of the item distributions can allow estimates of item parameters to change over
time. Finally, although TSIRT assumes that the latent time-series variable is an integrated time-series, alternative
ARIMA models or nonparametric time-series models may be more appropriate for certain situations. All of these
topics are areas for future research.
13
References
Bailey, Michael A. 2007. “Comparable Preference Estimates across Time and Institutions for the Court, Congress,
and Presidency.” American Journal of Political Science. 51(3): 433-448.
Bailey, Michael A. and Forrest Maltzman. 2008. “Does Legal Doctrine Matter? Unpacking Law and Policy Preferences on the U.S. Supreme Court.” American Political Science Review. 102(3): 369-384.
Birnbaum, Allan. 1968. “Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability.” In Frederick
M. Lord and Melvin R. Novick, Eds., Statistical Theories of Mental Test Scores. pg. 395-479. Reading, MA:
Addison Wesley.
Bock, R. Darrell and Murray Aitkin. 1981. “Marginal Maximum Likelihood Estimation of Item Parameters: Application of an EM Algorithm.” Psychometrika. 46(4): 443-459.
Bock, R. Darrell and Marcus Lieberman. 1970. “Fitting a Response Model for n Dichotomously Scored items.”
Psychometrika. 35(2): 179-197.
Bureau of Economic Analysis. 2013. “Gross Domestic Product: Percent Change from Preceding Period.” National
Economic Accounts. <http://www.bea.gov/national/index.htm>. Accessed 20 June 2013.
Bureau of Labor Statistics. 2013. “Labor Force Statistics from the Current Population Survey: (Seas) Unemployment Rate.” Databases, Tables & Calculators by Subject. <http://data.bls.gov/timeseries/LNS14000000>.
Accessed 20 June 2013.
Cameron, A. Colin and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press.
Carroll, Royce et al. 2009. “Measuring Bias and Uncertainty in DW-NOMINATE Ideal Point Estimates via the
Parametric Bootstrap.” Political Analysis. 17(2): 261-275.
Clinton, Joshua, Simon Jackman, and Douglas Rivers. 2004. “The Statistical Analysis of Roll Call Data.” American Political Science Review. 98(2): 355-370.
Federal Reserve Bank of St. Louis. 2013a. “S&P 500 Stock Price Index (SP500).” FRED Economic Data.
<http://research.stlouisfed.org/fred2/series/SP500/downloaddata>. Accessed 20 June 2013.
Federal Reserve Bank of St. Louis. 2013b. “Housing Starts: Total: New Privately Owned Housing Units Started
(HOUST).” FRED Economic Data. <http://research.stlouisfed.org/fred2/series/HOUST/downloaddata?cid=32302>.
Accessed 20 June 2013.
Fraley, R. Chris, Niels G. Waller, and Kelly A. Brennan. 2000. “An Item Response Theory Analysis of Self-Report
Measures of Adult Attachment.” Journal of Personality and Social Psychology. 78(2): 350-365.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian Data Analysis. Second Ed.
Texts in Statistical Science. New York: Chapman and Hall.
Gelman, Andrew and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.”
Statistical Science. 7(4): 457-511.
Granger, C. W. J. 1969. “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods.”
Econometrica. 37(3): 424-438.
Gray-Little, Bernadette, Valerie S. L. Williams, and Timothy D. Hancock. 1997. “An Item Response Theory
Analysis of the Rosenberg Self-Esteem Scale.” Personality and Social Psychology Bulletin. 23(5): 443-451.
14
Greenhill, Kelly M. and Soloman Major. 2007. “The Perils of Profiling: Civil War Spoilers and the Collapse of
Intrastate Peace Accords.” International Security. 31(3): 7-40.
Hoff, Peter D. and Michael D. Ward. 2004. “Modeling Dependencies in International Relations Networks.” Political
Analysis. 12(1): 160-175.
Jannarone, Robert J. 1997. “Models for Locally Dependent Responses: Conjunctive Item Response Theory.” In
van der Linden, Wim and Ronald K. Hambleton, Handbook of Modern Item Response Theory. p. 465-480. New
York: Springer.
Kaufman, Stuart J. 2006. “Escaping the Symbolic Politics Trap: Reconciliation Initiatives and Conflict Resolution
in Ethnic Wars.” Journal of Peace Research. 43(2): 201-218.
King, Gary. 2003. 10 Million International Dyadic Events. <http://hdl.handle.net/1902.1/FYXLAWZRIA>. Accessed 19 June 2013.
Lawley, D. N. 1943. “On the Problems Connected with Item Selection and Test Construction.” Proceedings of the
Royal Society of Edinburgh. 61(3): 273-287.
McDonald, Roderick P. 1997. “Normal-Ogive Multidimensional Model.” In van der Linden, Wim and Ronald K.
Hambleton, Handbook of Modern Item Response Theory. p. 258-270. New York: Springer.
MacKuen, Michael B., Robert S. Erikson, and James A. Stimson. 1992. “Peasants or Bankers? The American
Electorate and the U.S. Economy.” American Political Science Review. 86(3): 597-611.
Matsusaka, John G. and Argia M. Sbordone. 1995. “Consumer Confidence and Economic Fluctuations.” Economic
Inquiry. 33(2): 296-318.
Martin, Andrew D. and Kevin M. Quinn. 2002. “Dynamic Ideal Point Estimation Via Markov Chain Monte Carlo
for the U.S. Supreme Court, 1953-1999.” Political Analysis. 10(2): 134-153.
Mokken, Robert J. 1997. “Nonparametric Models for Dichotomous Responses.” In van der Linden, Wim and
Ronald K. Hambleton, Handbook of Modern Item Response Theory. p. 351-368. New York: Springer.
Montgomery, Jacob M. and Joshua Cutler. 2013. “Computerized Adaptive Testing for Public Opinion Surveys.”
Political Analysis. 21(2):141-171.
Orlando, Maria, Cathy Donald Sherbourne, and David Thissen. 2000. “Summed-Score Linking Using Item Response Theory: Application to Depression Measurement.” Psychological Assessment. 12(3): 354-359.
Thomson Reuters and the University of Michigan. 2013. Index of Consumer Sentiment. <http://www.sca.isr.umich.edu/dataarchive/mine.php>. Accessed 20 June 2013.
Sheafer, Tamir and Shira Dvir-Gvirsman. 2010. “The Spoiler Effect: Framing Attitudes and Expectations Toward
Peace.” Journal of Peace Research. 47(2): 205-215.
Shor, Boris and Nolan McCarty. 2011. “The Ideological Mapping of American Legislatures.” American Political
Science Review. 105(3): 530-551.
Stan Development Team. 2013. Stan: Version 1.1. <https://github.com/stan-dev/stan>.
Stedman, Stephen John. 2001. Implementing Peace Agreements in Civil Wars: Lessons and Recommendations for
Policymakers. IPA Policy Paper Series on Peace Implementation. New York: International Peace Academy.
Thurstone, L. L. 1927. “A Law of Comparative Judgement.” Psychological Review. 34(2): 273-286.
15
Treier, Shawn and Simon Jackman. 2008. “Democracy as a Latent Variable.” American Journal of Political Science. 52(1): 201-217.
van der Linden, Wim and Ronald K. Hambleton. 1997. “Item Response Theory: Brief History, Common Models,
and Extensions.” In van der Linden, Wim and Ronald K. Hambleton, Handbook of Modern Item Response
Theory. p. 1-28. New York: Springer.
Voeten, Erik. 2004. “Resisting the Lonely Superpower: Responses of States in the United Nations to U.S. Dominance.” The Journal of Politics. 66(3): 729-754.
16
Download