Dynamic Measurement of Political Phenomena: Item Response Theory for Time-Series Data Jonathan Kropko Columbia University July 11, 2013 Abstract. Item response theory (IRT) is a method for estimating a latent variable from a series of items with a shared covariance. Current applications of IRT require a large enough cross-sectional sample to identify item parameters and values of the latent variable. As a result, IRT can only be run on data that contain a prominent cross-sectional component. I develop a method called time-series item response theory (TSIRT) that estimates a latent variable and item parameters for a single important case when the items are observed at several points in time. Time points replace the cross-sectional sample, and the estimated latent variable is a time-series rather than a variable with a distribution in a population. TSIRT specifies an integrated time-series to be the prior distribution of the latent variable so that the serial dependence in the latent time-series is explicitly modeled. TSIRT is demonstrated to provide close estimates of a simulated latent variable using binary, count, proportion, and continuous items, and is used to examine two substantive examples: the dynamics of the Israeli-Palestinian conflict from 1990-2004, and the overall performance of the US economy since 1978. 1 Introduction Item response theory (IRT) is an important tool for measuring a latent variable. The method was first developed in psychometrics (Thurstone 1927, Lawley 1943) to measure unobservable individual traits with many observable associated behaviors, and has been used to study self-esteem (Gray-Little, Williams, and Hancock 1997), depression (Orlando, Sherbourne, and Thissen 2000), and attachment anxiety (Fraley, Waller, and Brennan 2000) among other cognitive disorders. Another canonical application of IRT is educational testing, in which the students’ latent ability scores are estimated from their observed responses to the test items.1 Recently, IRT has been applied more broadly across the health and social sciences: in political science, IRT has been used to estimate the ideological ideal points of members of Congress (Clinton, Jackman, and Rivers 2004), of Supreme Court Justices (Martin and Quinn 2002, Bailey and Maltzmann 2008), of state legislators (Shor and McCarty 2011), and member states in the UN (Voeten 2004). The method has also been applied to study the cross-national variation in democracy (Treier and Jackman 2008), and underlies the method of computerized adaptive testing that is used to assess performance on the GRE and GMAT tests, and has been proposed as an efficient means to conduct a public opinion survey (Montgomery and Cutler 2013). IRT is often preferred by researchers to simpler measurements such as the sum of correct responses on a test because different items provide different levels of information about the latent variable. On an educational test, the standard 2-parameter implementation of IRT (Birnbaum 1968) estimates the difficulty of each test item, and the extent to which the item accurately discriminates the high-ability students from the low-ability students. Likewise, some votes are more informative than others. IRT therefore provides more accurate indices for latent political variables than naive metrics that treat items with equal weight. Most applications of IRT require a large sample size to estimate two parameters per item with enough precision to provide useful estimates of the latent variable. In addition, most uses of IRT are on cross-sectional data. Very few applications have attempted to estimate a dynamic latent variable, but Martin and Quinn’s (2002) analysis of Supreme Court ideal points is an important exception. Even Martin and Quinn, however, utilize cross-sectional variation. In this paper I propose a version of IRT to measure a latent variable that is strictly a time-series. This method, called time-series item response theory (TSIRT), works with data that describe only a single case but contain several variables that are recorded at many points in time. As with any measurement technique, these variables should have shared covariance; that is, they should all be implications of the latent time-series variable 1 For IRT, the word “item” is synonymous with “observed variable.” 1 to be measured. In addition, while most implementations of IRT work exclusively with binary or categorical items, TSIRT can produce a latent variable from combinations of continuous, count, and proportion (real values bounded within [0, 1]) variables as well as binary and categorical ones, while estimating item parameters for each variable type. This method can be applied to a large and previously neglected class of measurement problems in which the focus is on the dynamics of a particularly important case. Below, I describe the formulation of TSIRT and how it builds on other versions of IRT and similar measurement methods in political science. I demonstrate the performance of the algorithm on simulated data, and I apply the method to two examples: an analysis of the dynamics of the Israeli-Palestinian conflict from 1990-2004, and a measurement of the overall health of the US economy since 1978. 2 Background The classic 2-parameter logit (2PL) version of IRT (Bock and Lieberman 1970) works with binary items and jointly estimates a latent ability parameter for each individual i ∈ {1, 2, . . . , N }, denoted θi , as well as a discrimination parameter for each item k ∈ {1, 2, . . . , K}, denoted αk , and a difficulty parameter for each item, denoted βk . Let X be an (N × K) matrix in which the columns are binary variables. The probability of a correct response is logistic conditional on θi , αk , and βk : P (Xik = 1|θi , αk , βk ) = f (θi |αk , βk ) = 1 , 1 + exp (−αk (θi − βk )) (1) where f (θi |αk , βk ) is sometimes referred to as a test curve for the item. The discrimination parameter indicates the certainty about the predicted value of the item conditional on θi . Two examples of test curves are illustrated in figure 1. Each item has a difficulty parameter of 0, indicating that P (Xik = 1) = 0.5 when θi = 0. The item on Figure 1: Two Logistic Items – One That Fits Well (Left) and One That Fits Less Well (Right). α = 0.8 1.0 1.0 α=3 0.8 0.8 ● 0.6 0.0 0.2 0.4 Pr(Y=1) 0.6 0.4 0.0 0.2 Pr(Y=1) ● −3 −2 −1 0 1 2 3 −3 θ −2 −1 0 1 2 3 θ the left has a high discrimination parameter α = 3 and the item on the right has a lower discrimination parameter α = 0.8. An individual with a latent ability of 1 has a 0.95 probability of a positive response for Xi if α = 3, but this probability is only 0.69 if α = 0.8. Discrimination parameters are therefore analogous to loadings in a factor analysis: large values indicate that the variation of the item is mostly explained by the latent variable, and low values indicate that the item is potentially inappropriate to use to estimate the latent variable. The likelihood 2 function for the item parameters conditional on θ is L(α, β|θ, X) = N Y K Y f (θi |αk , βk )Xik (1 − f (θi |α,k βk ))1−Xik . (2) i=1 k=1 (van der Linden and Hambleton 1997, p. 14). The values of θi are estimated through a Bayesian updating process. Given a marginal prior distribution P (θi ) for each value of the latent variable, the posterior distribution of θi is P (θi |α, β, Xi ) ∝ P (θi ) K Y f (θi |αk , βk )Xik (1 − f (θi |αk , βk ))1−Xik . (3) k=1 The 2PL IRT model can be estimated either through a fully Bayesian specification or through marginal maximum likelihood (Bock and Aitken 1981), which is an EM procedure that estimates item parameters conditional on θi via maximum likelihood on equation 2, updates θi using equation 3, and iterates the procedure until convergence. IRT models make an assumption of local independence of items and cases. That is, IRT assumes that conditional on the latent variable, the responses on any item are independent from the responses on every other item, and the response pattern on any row is independent from the response pattern on every other row.2 In practice, the assumption that is made with regard to the rows is often stronger than local independence. The joint prior distribution on the latent variable is commonly the multivariate standard normal distribution. Under this distribution the marginal prior of each θi is standard normal, which is a convenient choice for many researchers since it is naive as to the true value of θi and also serves as a scale to identify the estimates. Using this prior, and assuming local independence, the posterior distributions of θi must be independent over i as well. Very few applications of IRT use an alternative prior. One notable exception however is the analysis of ideological ideal points of Supreme Court justices conducted by Martin and Quinn (2002). In order to capture the temporal variation of the justices’ ideal points from term to term, Martin and Quinn use a random walk prior for the latent ideal points: θt,j ∼ N (θt−1,j , ∆θt,j ) (4) where t refers to the judicial term and j refers to the justice (p. 140). ∆θt,j is a scalar variance term which is fixed to 1 in their analysis. Since the prior distribution of θj forms an integrated time-series over t, Martin and Quinn maintain the assumption that the items are locally independent since the autocorrelation is contained in the prior distribution of the latent variable. Each term contains different cases, so the item parameters are estimated conditioning only on the 9 (or fewer) votes in each case. In general, there have only been a few attempts in political science to measure a dynamic latent variable. Voeten (2004) estimates the ideological ideal points of states in UN voting, and allows these ideal points to change over time by modeling the latent variable as a quadratic function of time: θit = π0i + π1i Tt + π2i Tt2 + vit , vit ∼ N (0, σ 2 ), (5) where Tt ∈ {−5, −4, . . . , 4, 5} represents the time frame 1991-2001 and π0i , π1i , and π2i are estimated from covariates (p. 738). Similarly, Bailey (2007) estimates the ideal points of members of Congress and Supreme Court justices as quartic functions of their time in office. The most famous example of dynamic measurement in political science is DW-NOMINATE, which also treats legislator ideal points as polynomial functions of time (Carroll et al 2009, p. 263). All of these approaches are at least in part derived from cross-sectional variation. Many important measurement problems exist, however, for data that are strictly a time-series. These problems consider the variation over time of one important case where several items are observed at repeated points in time. Problems of this type exist in measuring for example the nature of a conflict between two states, the performance of an economy, the development of an institution, the behavior of an executive, the strategies of a campaign, and so on. TSIRT extends the approach of Martin and Quinn to time-series data by using a random walk prior on the latent variable, which is more general and uses fewer degrees of freedom than specifying the latent variable to be a polynomial function of time. Unlike Martin and Quinn’s approach, however, item parameters are well identified since the same items are observed repeatedly. TSIRT also extends this methodology by including test curves for continuous, count, and proportion items. The derivation of TSIRT is presented below. 2 Other assumptions made by IRT include the unidimensionality of the latent variable and the parametric representation of the test curves. IRT has been extended to estimate multidimensional scales (McDonald 1997) and to handle test curves non-parametrically (Mokken 1997). These extensions can also be applied to TSIRT, but the focus here is on the unidimensional, parametric case. 3 3 Methodology TSIRT is implemented as a fully Bayesian model, estimated through MCMC.3 The data X are assumed to form a (T × K) matrix, where t ∈ {1, 2, . . . , T } indexes the time points, and k ∈ {1, 2, . . . , K} indexes the items. The time points refer to the variation over time of one important case, and the items have a shared covariance that can be largely explained by a unidimensional, dynamic latent variable θt . The parameters to be estimated include θt , the variance of the random walk σ 2 , item discrimination parameters αk , item difficulty parameters βk , and auxiliary parameters for particular variable types: the number of failures before an experiment is terminated using the negative binomial distribution rk for count variables, and a beta distribution count parameter φk for the proportion variables. These parameters are assumed to be independent in their joint prior distribution, so that P (θt , σ 2 , α, β, r, φ) = Pθ (θt ) · Pσ2 (σ 2 ) · Pα (α) · Pβ (β) · Pr (r) · Pφ (φ). (6) Each of these prior distributions is described below. The dynamic variation of the latent variable is expressed through a random walk prior: θt ∼ N (θt−1 , σ 2 ). (7) The same prior is used by Martin and Quinn (2002) to estimate the ideal points of Supreme Court justices while allowing these estimates to be correlated over time. Unlike Martin and Quinn, the variance hyperparameter σ 2 is not fixed a priori, but is estimated. σ 2 is bounded to be greater than 0, and is specified to have a uniform prior. The item parameters are strongly identified since the data contain T observations of each item. Since these observations form a time-series, they are not independent, but the observations are assumed to be locally independent conditional on the latent variable which accounts for autocorrelation. The prior distributions for the difficulty and discrimination parameters are N (0, 1) if α > 0, β ∼ N (0, 1), and α ∼ (8) 0 if α ≤ 0. Finally, the auxilliary parameters for count and proportion items are specified to have uniform and uninformative prior distributions. Each item has a distribution that is specific to the type of item being considered. Binary items follow a logit distribution conditional on θt , αk , and βk as in equation 1.4 Continuous items are standardized, and are conditionally distributed normally with a mean of θt and a standard deviation of αk : −(Xtk − θt )2 1 √ exp . (9) fk (Xtk |θt , αk ) = 2αk 2παk No difficulty parameter is included in the test curves for continuous items because Xk is standardized and βk must equal 0 by construction. αk is the standard deviation of the normal test curve. As the means θt deviate from the data Xtk , the estimated values of αk must become larger in order to compensate for this deviation. Smaller values of αk indicate a better fit for continuous items. Figure 2 illustrates the conditional distribution of a continuous item X given θt and αk . On the left αk = 0.5, which is a low value because the standardized values of Xtk do not depart much from the values of θt . But on the right these values deviate from the values of the latent variable to a greater extrent, and the standard deviation αk must be larger to compensate. Count variables are modeled using the negative binomial distribution, which has two parameters: π ∈ [0, 1], the probability of a positive outcome on any one particular draw, and r > 0, the number of negative draws before the experiment is terminated.5 The distribution of a count item Xk is conditioned on the latent variable by specifying π to be a logistic link function of θt : 1 . (10) πtk = 1 + exp(−αk (θt − βk )) 3 The models described in the examples below are estimated using Stan, a Hamiltonian MCMC algorithm, available in C++ and for R (Stan Development Team 2013). 4 Similarly, ordinal items can be modeled using the ordered logit distribution and unordered-categorical items can be modeled using the multinomial logit distribution. These distributions are natural extensions of the logit distribution. All of these models can be implemented using probit models instead of logit models by using standard normal test curves, but the differences between logit and standard normal CDFs are minor. 5 For example, if the probability of a positive draw is π = 0.5, and r = 3 negative draws result in the termination of the experiment, then the probability that X = 2 is the probability that 2 positive outcomes had been drawn when the third negative outcome is drawn. If 0 is a negative outcome and 1 is a positive outcome, then this result can be achieved by drawing a string of 11000, 10100, 10010, 01100, 01010, or 00110. Therefore if X ∼ N B(.5, 3), then P (X = 2) = 6(2−5 ) = 0.1875. 4 Figure 2: Two Continuous Items – One That Fits Well (Left) and One That Fits Less Well (Right). α = 1.2 1.0 0.8 θt = −1 θt = 0 θt = 1 0.6 f( X | θt, α ) 0.6 0.0 0.2 0.4 0.0 0.2 f( X | θt, α ) 0.8 θt = −1 θt = 0 θt = 1 0.4 1.0 α = 0.5 −3 −2 −1 0 1 2 3 −3 −2 X −1 0 1 2 3 X Then the negative binomial distribution is conditioned on r, on the latent variable, and on the item parameters:6 Xtk + rk − 1 fk (Xtk |θt , αk , βk , rk ) = (1 − πtk )rk πtk Xtk . (11) Xtk The discrimination parameter αk determines how much weight a particular count places on a region of the latent variable. Figure 3 shows three values of two count items with rk = 3. The item on the left has a high discrimination parameter and places a great deal of weight on specific regions of the posterior distribution of θt . The item on the right has a lower discrimination parameter, and is more agnostic about the implications on θt . Proportion items follow the beta distribution, which has two shape parameters a > 0 and b > 0: fk (X|a, b) = 1 (X)a−1 (1 − X)b−1 , B(a, b) (12) where B() is the beta function, given by Z B(a, b) = 1 ua−1 (1 − u)b−1 du. (13) 0 The beta distribution can be reparameterized to have a mean parameter π = a/(a + b) and a total count parameter φ = a + b (Gelman et al 2004, p.132-133). Since π ∈ [0, 1], it can be specified using equation 10 to be a logistic link function through which to introduce dependence on the latent variable and the item discrimination and difficulty parameters. The reparameterized beta distribution is fk (Xtk |θt , αk , βk , φk ) = 1 (Xtk )πtk φk −1 (1 − Xtk )(1−πtk )φk −1 . (14) B πtk φk , (1 − πtk )φk Figure 4 illustrates the distribution of a proportional item conditional on θt and αk , where βk is fixed to 0 and φk is fixed to 10. As with the test curves for continuous items, the item with high discrimination involves a high level of association between the values of Xtk and θt while the item with lower discrimination allows this relationship to be less determined. 6 r is an integer parameter, but it can be treated as positive and continuous by replacing the binomial in the negative binomial distribution with a ratio of gamma functions (Cameron and Trivedi 2005, chapter 20). 5 Figure 3: Two Count Items – One That Fits Well (Left) and One That Fits Less Well (Right). 0.4 0.6 0.8 C=0 C=2 C=4 0.0 0.0 0.2 0.4 0.6 P( X = C | θt, α ) 0.8 C=0 C=2 C=4 0.2 P( X = C | θt, α ) 1.0 α = 0.8 1.0 α=3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 θt 1 2 3 θt Figure 4: Two Proportion Items – One That Fits Well (Left) and One That Fits Less Well (Right). 8 6 θt = −1 θt = 0 θt = 1 0 0 2 4 f( X | θt, α ) 6 4 θt = −1 θt = 0 θt = 1 2 f( X | θt, α ) 8 10 α = 0.8 10 α=3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 X 0.2 0.4 0.6 0.8 1.0 X The models for binary, continuous, count, and proportion items are all linked by the common latent variable θt , and the parameters that are specific to each item are locally independent conditional on θt . Let fBk represent the distribution of a binary item implied by equation 1, let fN k be the distribution of a continuous item given in equation 9, let fCk be the distribution of a count item given in equation 11, and let fP k be the distribution of a proportion item given in equation 14. Consider a model with KB binary items, KN continuous items, KC count 6 items, and KP proportion items. The joint posterior distribution for all of the parameters is given by P (θt , σ 2 , α, β, r, φ|X) ∝ Pθ (θt ) · Pσ2 (σ 2 ) · Pα (α) · Pβ (β) · Pr (r) · Pφ (φ) × KB Y (15) fBk (X|θt , αk , βk ) k=1 × K N Y fN k (X|θt , αk ) k=1 × KC Y fCk (X|θt , αk , βk , rk ) k=1 × KP Y fP k (X|θt , αk , βk , φk ), k=1 ∀t ∈ {1, 2, . . . , T }. In order to assess the convergence of the MCMC algorithm, I run multiple chains and examine their traceplots. I also consider the values of the R̂ statistics to diagnose convergence (Gelman and Rubin 1992). TSIRT has some limitations. First, like IRT, TSIRT requires that the items are locally independent conditional on the latent variable. If this assumption is not true, then neither the estimates of item parameters nor the estimates of the latent variable are accurate.7 Second, item parameters are estimated as if they are fixed over time, when in fact it is very feasible that the meaning and implications of an item may change from the beginning to the end of the time-series. Finally, the reliance on MCMC to estimate the model implies that a researcher can never conclude with absolute certainty that the algorithm has converged to the sampling distribution of the posterior. However, if modern convergence diagnostics all agree that the chains have converged, then it would be highly unlikely that the chains remain unconverged. 4 Demonstration on Simulated Data In order to demonstrate the performance of TSIRT, I generate a dataset that consists of binary, continuous, count, and proportion variables from a data generating process in which every variable depends on a variable θt , generated from an integrated random walk. My goal is to show that TSIRT can accurately measure values of the latent variable and correctly estimate item parameters for each type of variable. I set the number of time points to be T = 300, which would be the number of cases in a time-series observed monthly for 25 years. I generate values of a latent variable θt from θ1 = ε1 , θt = θt−1 + εt , ∀t ∈ {2, . . . , T }, (16) where εt ∼ N (0, σ 2 ), ∀t. (17) The variance of the random walk σ 2 is set to 1 in this simulation. I generate 5 binary items, 5 continuous items, 5 count items, and 5 proportion items, all using θt as a parameter to link these items together. The binary items are randomly generated from the Bernoulli distribution with probability parameter πtk as given in equation 10. The continuous items are generated conditional on θt and αk from the normal distribution in equation 9. The count items are generated conditional on rk and πtk by the negative binomial distribution given in equation 11. The proportion items are generated from the beta distribution conditional on πtk and φk in equation 14. For this simulation, the true values of the item parameters are randomly generated. The discrimination parameters αk are drawn from the uniform[.2, 1] distribution for the binary, count, and proportion items, and from the uniform[.5, 1.5] distribution for the continuous items. The difficulty parameters are drawn from the uniform[−1, 1] distribution for the binary, count, and proportion items. The rk parameters for the count items are drawn from the unifrom[1, 8] distribution, but are rounded up to the nearest integer. The φk parameters for the proportion items are drawn from the uniform[0, 100] distribution. I estimate TSIRT on the simulated data described above using a fully Bayesian algorithm (Stan Development Team 2013). I run three chains of the algorithm for 1000 iterations each, using the first 500 iterations of each chain 7 Some work has been done to identify testlets – clusters of locally dependent items – and to correct for this dependence with parametric and nonparametric approaches (e.g. Jannarone 1997). This methodology is not currently implemented for TSIRT however. 7 −10 θt −5 0 Figure 5: Estimated and True Values of the Latent Integrated time-series. −15 True Value 95% Credible Interval for Estimate 0 50 100 150 200 250 300 t as a burn-in and keeping the last 500 iterations. The R̂ statistic for each parameter is within .01 of 1, indicating that the chains have converged to the correct posterior distribution.8 The true values of θt are graphed over t in figure 5, along with a grey area that represents the 95% credible interval for each estimated value of θt . TSIRT does an excellent job of returning the true values of the latent variable, with only a small amount of noise around the estimate at each time point. The item parameter estimates are illustrated in figure 6. 20 estimates of αk are listed on the left: one for each of the 5 binary, 5 continuous, 5 count, and 5 proportion items. The estimate of σ 2 is listed below the estimates of αk . 15 estimates of βk are listed on the right, excluding estimates for the 5 continuous items. Finally, the 5 estimates of rk and φk are listed below the estimates of βk . The dots represent the posterior mean of each estimate and the horizontal lines through the dots represent the 95% credible interval for each estimate. The crosses denote the true value of each parameter. TSIRT returns the item parameters accurately as well: the 95% credible interval of every parameter except for one contains the true value. The results suggest that TSIRT may be more suited to estimating α parameters than the other parameters, and is particularly suited to estimate the discrimination parameters of count and proportion items, but more simulated trials are necessary to confirm this observation. 5 Example 1: the Israeli-Palestinian Conflict, 1990-2004 The examples presented here demonstrate the utility of TSIRT for real applications in political science research, but are not intended to be completed research projects. Many empirical studies in international relations use dyadic data in which each observation considers the relationship between two distinct political actors, often countries. Dyadic data contain several variables that characterize the relationship between the two actors, and often dyadic data also contain multiple time points. Hoff and Ward (2004) point out that many methodological approaches adopted by researchers with time-series cross-sectional dyadic data use both the cross-sectional and temporal variation to draw inferences. They warn, however, that these approaches make inappropriate assumptions about the data: it may not be appropriate to assume that dyads are independently drawn, particularly if two dyads involve the same country; and it is not appropriate to assume that the actions taken within a dyad at one time point are independent from the actions taken within the dyad at previous time points. One way to avoid a series of suspicious assumptions about the independence of dyads is to leverage variation over time to consider individual dyads. TSIRT can be used to analyze a particularly important relationship between 8 The same convergence properties are true for both examples described below. 8 Figure 6: Estimated Item Parameters vs. True Values. α Item β Item 1 1 ● ● 2 2 Binary ● 3 Binary 3 ● 4 ● 4 ● ● 5 ● ● 1 5 ● ● 2 1 ● Count 2 ● 3 ● 4 ● ● 5 Countinuous 3 ● ● 1 4 ● 5 ● 2 Proportion ● ● 3 ● 4 1 ● ● 5 2 ● ● −3 Count 3 −1 0 1 2 3 Parameter Value 4 ● 5 ● 2 r Item 1 2 3 4 5 ● 1 ● ● ● ● ● ● 0 Proportion −2 ● 3 1 2 3 4 5 6 100 120 ● Parameter Value 4 ● 5 σ 2 ● 0.0 0.5 1.0 φ Item 1 2 3 4 5 ● 1.5 0 Parameter Value ● ● ● ● ● 20 40 60 80 Parameter Value 9 two countries without making an incorrect assumption that the relationship at a point in time is independent from the relationship at another point in time. One important dyad in international politics is Israel and Palestine. For many years the relationship between Israel and Palestine has involved ebbs and flows in the amount of conflict exhibited between the two sides, and in the efforts to achieve peace. Both “peace” and “conflict” are latent variables in that they are not directly observable, but have many observable implications. There are increasing efforts towards peace when the actors are engaged in meetings, agreeing to frameworks for future actions or meetings, removing occupying forces, and easing sanctions. Escalations of conflict involve rejected proposals, hostile rhetoric, military raids and occupations, arrests, violent clashes, shootings, physical assaults, grenade attacks, and suicide bombings. The latent peace and conflict variables are not perfectly negatively correlated: at times conflicts increase while efforts for peace decrease, and vice versa, but at other times conflicts and peace efforts increase or decrease together. Efforts for peace become necessary in response to escalations of conflict. On the other hand, a large literature in international relations suggests that certain actors within Israel and Palestine benefit from increased conflict, and attempt to spoil peace efforts (Stedman 2001, Kaufman 2006, Greenhill and Major 2007, Sheafer and Dvir-Gvirsman 2010). It is an open question, therefore, as to whether peace efforts respond to conflict, or conflicts arise as part of deliberate efforts by certain parties to derail ongoing peace efforts. In this application, I use TSIRT to measure latent peace and conflict indices for Israel and Palestine, and I consider the question of whether peace succeeds or precedes conflict using a Granger causality test. The implications of peace efforts and escalations of conflict are events that are directly observable. I use the subset of the 10 Million International Dyadic Events dataset (King 2003) that contains events that involve both Israel and Palestine. The data cover the years 1990-2004. The events are coded daily, and I take the sum of the occurrence of each event within each quarter, so that the data contain T = 60 time points. All of the variables are counts, so TSIRT employs negative binomial test curves. Of the events that occur within this dyad, several are direct implications of peace efforts, listed in the top section of table 1, and several are direct implications of conflict escalation, listed in the bottom section of table 1. 0 −2 −1 Index 1 2 3 Figure 7: Mean Estimates of Peace and Conflict Indices, 1990-2004. 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 −3 Peace Index Conflict Index Year and Quarter The indices are graphed for every quarter between 1990 and 2004 in figure 7.9 The peace index achieves its highest values in the early 1990s, with a brief decline that roughly corresponds to the first Gulf War. Another decline in the peace index occurs in 1996 after the assassination of Yitzhak Rabin and the appointment of Benjamin Netanyahu to his first term as Israel’s prime minister. The peace index climbs in the late 1990s during the talks leading to the Oslo II accord, and falls again during the Second Intifada. Correspondingly, the conflict index falls immediately after the Oslo II accords but spikes dramatically during the Second Intifada. In order to assess whether peace tends to respond to conflict, or vice versa, I conduct a Granger causality test on the means of the two indices (Granger 1969). I control for 10 lags in order to allow 2.5 years for the effects of peace efforts to be observed. The results are listed in table 2. In the left-hand column, I test whether the lags of 9 The credible intervals are not graphed in order to make the comparison of the two trends as clear as possible. 10 Table 1: Indicators of Efforts Towards Peace, Escalations of Conflicts. Efforts Towards Peace Indicator Accept a proposal for future action Accept a proposal for future action Discussions or meetings Discussions or meetings Demobilize armed forces Ease sanctions Travel to meet Travel to meet Yield control of a location Direction ISR → PAL PAL → ISR ISR → PAL PAL → ISR ISR → PAL PAL → ISR ISR → PAL PAL → ISR ISR → PAL Mean 3.2 1.9 10.3 10.3 1.3 1.7 1.7 1.1 1.0 Min 0 0 0 0 0 0 0 0 0 Max 28 14 43 43 27 8 20 11 13 Escalations of Conflict Indicator Military occupation Arrest and detention Criticize or blame Criticize or blame Military clash Military clash Grenade/RPG use Physical assault Physical assault Shooting Shooting Political arrests and detentions Military raid Reject proposal Suicide bombing Direction ISR → PAL ISR → PAL ISR → PAL PAL → ISR ISR → PAL PAL → ISR ISR → PAL ISR → PAL PAL → ISR ISR → PAL PAL → ISR ISR → PAL ISR → PAL PAL → ISR PAL → ISR Mean 1.4 1.5 3.4 8.3 2.7 1.3 1.5 2.5 2.3 25.4 4.6 1.1 5.5 1.1 1.1 Min 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Max 14 7 15 30 48 13 10 16 11 122 35 6 42 4 23 Table 2: Granger Causality Tests: Israel-Palestinian Peace Index vs. Conflict Index. Conflict → Peace Null Model Pt = α + P10 s=1 βs Pt−s + εt Peace → Conflict Ct = α + P10 s=1 βs Ct−s + εt Alternative Model P10 Pt = α + s=1 βs Pt−s P10 + s=1 δs Ct−s + εt P10 Ct = α + s=1 βs Ct−s P10 + s=1 δs Pt−s + εt F (10, 39) p 1.834 0.099 0.076 1 the conflict index jointly have a significant effect on the peace index, after controlling for 10 lags of the peace index. The result is marginally significant, which may suggest that the previous 2.5 years of variation in conflict has a real effect on current peace efforts after controlling for the variation of the peace index over the same timeframe. On the other hand, the symmetric test that conflict responds to peace yields a highly insignificant result. This analysis suggests that peace efforts are natural responses to recent conflict, and it does not suggest the presence of spoilers. 11 6 Example 2: US Economic Performance Since 1978 In this example I focus on interpretations of the item parameters. The overall performance of the U.S. economy is a latent variable, but it is hardly a secret. The recession of late 2008 is evident in every metric used to evaluate the economy. In this example, I consider five indicators measured quarterly from 1978 through the first quarter of 2013: GDP growth, the consumer sentiment index, the percent change in the S&P 500 index, the unemployment rate, and the percent change in housing starts.10 Summary statistics for these items as well as their sources are listed in table 3. I standardize these variables, and I estimate the economic performance index using these standardized variables as continuous items. Table 3: Indicators of U.S. Quarterly Economic Performance, 1978-2013. Indicator Mean Best Worst Source GPD Growth 2.71 16.7 (1978, Q2) -8.9 (2008, Q4) Bureau of Economic Analysis (2013) Consumer Sentiment Index 85.3 110.1 (2000, Q1) 51.1 (1980, Q2) Thomson Reuters and the University of Michigan (2013) S&P 500, % Change 2.20 20.2 (1982, Q4) -27.2 (2008, Q4) Unemployment Rate 6.42 3.9 (2000, Q4) 10.7 (1982, Q4) Housing Starts, % Change -0.18 31.5 (1980, Q3) -23.1 (2008, Q4) Federal Reserve Bank of St. Louis (2013a) Bureau of Labor Statistics (2013) Federal Reserve Bank of St. Louis (2013b) 0.00 −0.05 −0.10 Index 0.05 0.10 Figure 8: Estimated Economic Performance Index, 1978-2013. 2012 2010 2008 2006 2004 2002 2000 1998 1996 1994 1992 1990 1988 1986 1984 1982 1980 1978 −0.15 Mean Estimate 95% Credible Interval Year and Quarter The economic performance index is illustrated as a function of time in figure 8. For every quarter from 1978 through the first quarter of 2013, the mean is plotted as well as a grey region that represents the 95% credible interval around this mean. This index captures the recessions of the early 1980s, the stock market crash of 1987, the recession of the early 1990s, the burst of the “dot-com” bubble in the early 2000s, and the recession of 2008. 10 Since low values of the unemployment rate indicate a stronger economy, I reverse code the unemployment rate by multiplying it by -1 before standardizing the variable. 12 Figure 9: Item Parameters for the Economic Performance Index. −2 −1 0 1 2 3 X − θt −3 −2 −1 0 X − θt 1 2 3 −2 −1 0 X − θt 1 2 3 f( X | θt) f( X | θt) −3 Housing Starts α5 = 1.16 −3 −2 −1 0 X − θt 1 2 3 0.0 0.2 0.4 0.6 0.8 Unemployment α4 = 0.75 0.0 0.2 0.4 0.6 0.8 f( X | θt) S&P 500 α3 = 1.11 0.0 0.2 0.4 0.6 0.8 f( X | θt) −3 0.0 0.2 0.4 0.6 0.8 f( X | θt) Consumer Sentiment Index α2 = 0.43 0.0 0.2 0.4 0.6 0.8 GDP Growth α1 = 0.93 −3 −2 −1 0 1 2 3 X − θt A secondary but important use of TSIRT is the estimation of item parameters. For continuous items, the only item parameter represents item discrimination, which is another way to describe the fit of the item to the latent variable in question. The item with the smallest variance has the best fit, while the item with the highest variance also has the greatest amount of unique variance unexplained by the latent variable. In this example, the item parameters suggest which items are the best indicators of the overall performance of the economy. The discrimination parameters are illustrated in figure 9. The item that best fits the underlying measure of economic performance is the consumer sentiment index, which is a score compiled from 500 telephone interviews that focus on judging consumers’ level of economic optimism (Thomson Reuters and the University of Michigan 2013). The result is telling in light of prior research that shows that consumer confidence is a primary driving component of fluctuations in the economy (Matsusaka and Sbordone 1995). Furthermore, MacKuen, Erikson, and Stimson (1992) have shown that voter’s prospective attitudes towards the economy explain the variation in presidential election outcomes. Since the consumer sentiment index is centrally associated with overall economic performance, this result reinforces the connection between economic performance and voting in presidential elections. 7 Conclusion TSIRT estimates a latent variable for one important case in which relevant items are observed repeatedly at several time points. This methodology departs from IRT and other measurement techniques in that it does not require any cross-sectional information to identify item parameters or values of the latent variable. TSIRT therefore provides a means to address an important but previously neglected class of measurement problems, and can be used for a variety of applications in political science and other social sciences. TSIRT relies on the same assumptions as standard applications of IRT: in particular, TSIRT assumes that items are locally independent conditional on the latent variable. Most implementations of IRT use a prior distribution for the latent variable that results in the even stronger assumption that these values are independent in their posterior distribution as well. TSIRT, in contrast, uses an integrated time-series for the prior distribution of the latent variable, and therefore models temporal dependence explicitly. In addition, TSIRT is flexible to many data structures, and is demonstrated here to be able to accurately process categorical, continuous, count, and proportion variables. The methodology underlying TSIRT can be developed to account for several violations of its assumptions. If items remain locally dependent even after conditioning on the latent variable, it may be possible to group dependent items into blocks using the same methods that educational test researchers use to create testlets. The model can also be extended to estimate a multidimensional latent variable. TSIRT can be rewritten to use alternative parametric distributions for different types of variables, alternative link functions, or nonparametric distributions. Time dependent formulations of the item distributions can allow estimates of item parameters to change over time. Finally, although TSIRT assumes that the latent time-series variable is an integrated time-series, alternative ARIMA models or nonparametric time-series models may be more appropriate for certain situations. All of these topics are areas for future research. 13 References Bailey, Michael A. 2007. “Comparable Preference Estimates across Time and Institutions for the Court, Congress, and Presidency.” American Journal of Political Science. 51(3): 433-448. Bailey, Michael A. and Forrest Maltzman. 2008. “Does Legal Doctrine Matter? Unpacking Law and Policy Preferences on the U.S. Supreme Court.” American Political Science Review. 102(3): 369-384. Birnbaum, Allan. 1968. “Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability.” In Frederick M. Lord and Melvin R. Novick, Eds., Statistical Theories of Mental Test Scores. pg. 395-479. Reading, MA: Addison Wesley. Bock, R. Darrell and Murray Aitkin. 1981. “Marginal Maximum Likelihood Estimation of Item Parameters: Application of an EM Algorithm.” Psychometrika. 46(4): 443-459. Bock, R. Darrell and Marcus Lieberman. 1970. “Fitting a Response Model for n Dichotomously Scored items.” Psychometrika. 35(2): 179-197. Bureau of Economic Analysis. 2013. “Gross Domestic Product: Percent Change from Preceding Period.” National Economic Accounts. <http://www.bea.gov/national/index.htm>. Accessed 20 June 2013. Bureau of Labor Statistics. 2013. “Labor Force Statistics from the Current Population Survey: (Seas) Unemployment Rate.” Databases, Tables & Calculators by Subject. <http://data.bls.gov/timeseries/LNS14000000>. Accessed 20 June 2013. Cameron, A. Colin and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. New York: Cambridge University Press. Carroll, Royce et al. 2009. “Measuring Bias and Uncertainty in DW-NOMINATE Ideal Point Estimates via the Parametric Bootstrap.” Political Analysis. 17(2): 261-275. Clinton, Joshua, Simon Jackman, and Douglas Rivers. 2004. “The Statistical Analysis of Roll Call Data.” American Political Science Review. 98(2): 355-370. Federal Reserve Bank of St. Louis. 2013a. “S&P 500 Stock Price Index (SP500).” FRED Economic Data. <http://research.stlouisfed.org/fred2/series/SP500/downloaddata>. Accessed 20 June 2013. Federal Reserve Bank of St. Louis. 2013b. “Housing Starts: Total: New Privately Owned Housing Units Started (HOUST).” FRED Economic Data. <http://research.stlouisfed.org/fred2/series/HOUST/downloaddata?cid=32302>. Accessed 20 June 2013. Fraley, R. Chris, Niels G. Waller, and Kelly A. Brennan. 2000. “An Item Response Theory Analysis of Self-Report Measures of Adult Attachment.” Journal of Personality and Social Psychology. 78(2): 350-365. Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian Data Analysis. Second Ed. Texts in Statistical Science. New York: Chapman and Hall. Gelman, Andrew and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science. 7(4): 457-511. Granger, C. W. J. 1969. “Investigating Causal Relations by Econometric Models and Cross-Spectral Methods.” Econometrica. 37(3): 424-438. Gray-Little, Bernadette, Valerie S. L. Williams, and Timothy D. Hancock. 1997. “An Item Response Theory Analysis of the Rosenberg Self-Esteem Scale.” Personality and Social Psychology Bulletin. 23(5): 443-451. 14 Greenhill, Kelly M. and Soloman Major. 2007. “The Perils of Profiling: Civil War Spoilers and the Collapse of Intrastate Peace Accords.” International Security. 31(3): 7-40. Hoff, Peter D. and Michael D. Ward. 2004. “Modeling Dependencies in International Relations Networks.” Political Analysis. 12(1): 160-175. Jannarone, Robert J. 1997. “Models for Locally Dependent Responses: Conjunctive Item Response Theory.” In van der Linden, Wim and Ronald K. Hambleton, Handbook of Modern Item Response Theory. p. 465-480. New York: Springer. Kaufman, Stuart J. 2006. “Escaping the Symbolic Politics Trap: Reconciliation Initiatives and Conflict Resolution in Ethnic Wars.” Journal of Peace Research. 43(2): 201-218. King, Gary. 2003. 10 Million International Dyadic Events. <http://hdl.handle.net/1902.1/FYXLAWZRIA>. Accessed 19 June 2013. Lawley, D. N. 1943. “On the Problems Connected with Item Selection and Test Construction.” Proceedings of the Royal Society of Edinburgh. 61(3): 273-287. McDonald, Roderick P. 1997. “Normal-Ogive Multidimensional Model.” In van der Linden, Wim and Ronald K. Hambleton, Handbook of Modern Item Response Theory. p. 258-270. New York: Springer. MacKuen, Michael B., Robert S. Erikson, and James A. Stimson. 1992. “Peasants or Bankers? The American Electorate and the U.S. Economy.” American Political Science Review. 86(3): 597-611. Matsusaka, John G. and Argia M. Sbordone. 1995. “Consumer Confidence and Economic Fluctuations.” Economic Inquiry. 33(2): 296-318. Martin, Andrew D. and Kevin M. Quinn. 2002. “Dynamic Ideal Point Estimation Via Markov Chain Monte Carlo for the U.S. Supreme Court, 1953-1999.” Political Analysis. 10(2): 134-153. Mokken, Robert J. 1997. “Nonparametric Models for Dichotomous Responses.” In van der Linden, Wim and Ronald K. Hambleton, Handbook of Modern Item Response Theory. p. 351-368. New York: Springer. Montgomery, Jacob M. and Joshua Cutler. 2013. “Computerized Adaptive Testing for Public Opinion Surveys.” Political Analysis. 21(2):141-171. Orlando, Maria, Cathy Donald Sherbourne, and David Thissen. 2000. “Summed-Score Linking Using Item Response Theory: Application to Depression Measurement.” Psychological Assessment. 12(3): 354-359. Thomson Reuters and the University of Michigan. 2013. Index of Consumer Sentiment. <http://www.sca.isr.umich.edu/dataarchive/mine.php>. Accessed 20 June 2013. Sheafer, Tamir and Shira Dvir-Gvirsman. 2010. “The Spoiler Effect: Framing Attitudes and Expectations Toward Peace.” Journal of Peace Research. 47(2): 205-215. Shor, Boris and Nolan McCarty. 2011. “The Ideological Mapping of American Legislatures.” American Political Science Review. 105(3): 530-551. Stan Development Team. 2013. Stan: Version 1.1. <https://github.com/stan-dev/stan>. Stedman, Stephen John. 2001. Implementing Peace Agreements in Civil Wars: Lessons and Recommendations for Policymakers. IPA Policy Paper Series on Peace Implementation. New York: International Peace Academy. Thurstone, L. L. 1927. “A Law of Comparative Judgement.” Psychological Review. 34(2): 273-286. 15 Treier, Shawn and Simon Jackman. 2008. “Democracy as a Latent Variable.” American Journal of Political Science. 52(1): 201-217. van der Linden, Wim and Ronald K. Hambleton. 1997. “Item Response Theory: Brief History, Common Models, and Extensions.” In van der Linden, Wim and Ronald K. Hambleton, Handbook of Modern Item Response Theory. p. 1-28. New York: Springer. Voeten, Erik. 2004. “Resisting the Lonely Superpower: Responses of States in the United Nations to U.S. Dominance.” The Journal of Politics. 66(3): 729-754. 16