Inference in Tough Places: Essays on Modeling and Matching with Applications to Civil Conflict TECN CLOGY b MASSACHUSETTS INSTIUTE b Chad Hazlett . J LJUL~t21 M.S., Duke University (2002) M.P.P., Harvard Kennedy School (2006) LUBRARIES Submitted to the Department of Political Science in partial fulfillment of the requirements for the degree of Doctorate in Political Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2014 @ Chad Hazlett, MMXIV. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature redacted A uthor .................................. Department of Political Science May 5, 2014 .. Signature redacted' Certified by.............................. Jens Hainmueller Associate Professor Thesis Supervisor Accepted by.............. Signature redacted Roger Petersen Arthur and Ruth Sloan Professor of Political Science Chairman, Graduate Program Committee Inference in Tough Places: Essays on Modeling and Matching with Applications to Civil Conflict by Chad Hazlett Submitted to the Department of Political Science on May 5, 2014, in partial fulfillment of the requirements for the degree of Doctorate in Political Science ABSTRACT This dissertation focuses on the challenges of making inferences from observational data in the social sciences, with particular application to situations of violent conflict. The first essay utilizes quasi-experimental conditions to examine the effects of violence against civilians in Darfur, Sudan on attitudes towards peace and reconciliation. The second and third essays both address a common but overlooked challenge to making inferences from observational data: even when unobserved confounding can be ruled out, correctly "conditioning on" or "adjusting for" covariates remains a challenge. In all but the simplest cases, existing methods ensure unbiased estimation only when the investigator can correctly specify the functional relationship between covariates and the outcome. The second essay (with Jens Hainmueller) introduces Kernel Regularized Least Sqaures (KRLS), a flexible modelling approach that provides investigators with a powerful tool to estimate marginal effects, without linearity or additivity assumptions, and at low risk of misspecification bias. The third essay introduces Kernel Balancing (KBAL), a weighting method that mitigates the risk of misspecification bias by establishing high-order balance between treated and control samples without balance testing or a specification search. Thesis Supervisor: Jens Hainmueller Title: Associate Professor 3 Acknowledgments I owe an enormous debt of gratitude to many advisors and supporters, formal and informal, throughout my time at MIT. First, I am extremely fortunate that each member of my thesis committee has been supportive and responsive beyond reasonable expectation. From the first day I arrived in Cambridge, Fotini Christia provided unparalleled direction and encouragement. She was the first to get me involved in her research, and I have gained a great deal from this involvement. More than a few times, a timely phone call with her provided much needed advice and support. I look forward to continued collaboration in the coming years. Adam Berinsky provided invaluable strategic advice at every stage of my graduate career, from choosing a thesis topic to negotiating while on the market. Teppei Yamamoto provided essential technical feedback, especially on the Kernel Balancing project (Chapter 4). Moreover, his confidence that I could (and should!) develop a solo-authored methods piece prior to going on the job market proved to be essential. Finally, nobody has had a greater impact on my intellectual development than Jens Hainmueller. The fact that I arrived at MIT at just the right time to work with Jens altered entirely my experience and quality of training. Jens provided the backbone for the methods training I rely upon, and writing the Kernel Regularized Least Squares paper (Chapter 3) with him was among the most important and rewarding experiences of my time at MIT. From Jens, I learned just how good a teacher can be, how it can transform both individual students and a department, and the enormous amounts of time and effort required to achieve these outcomes. I hope to become the kind of teacher to my students that Jens has been to his. Numerous faculty outside of my committee have also been extremely supportive and helpful throughout my time at MIT. Danny Hidalgo has been a frequent source of advice and feedback, and I learned a great deal as teaching assistant for his class with Teppei Yamamoto. Rich Nielsen and Vipin Narang also provided frequent advice at levels ranging from the technical to the strategic. Kosuke Imai has been kind enough to host me as a pre-doctoral fellow at Princeton in this final year. I would also like 4 to thank my fellow students, and especially those in my cohort - Chris Clary, Jeremy Ferwerda, Yue Hou, David Hyun-Saeng, Nicholas Miller, and Krista Loose - whose friendship and academic support made my time at MIT much more pleasant. My parents, Robert and Nedra, provided their ceaseless support and unconditional confidence in my abilities. Finally, my wife Trish has tolerated not only the long distance for these last five years but also the frequent times at which I was too deeply submerged in work to pay sufficient attention to much else. Thank you, Trish, for putting up with this, and I look forward to beginning our new life together at UCLA. 5 6 Contents 1 Introduction 9 2 Angry or Weary? The effect of physical violence on attitudes towards peace in Darfur 13 16 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Methods ...... .................................. 21 2.4 Results ....... ................................... 26 2.5 Robustness 2.6 Discussion ....... ................................. 35 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.8 Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Explaining KRLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 KRLS in Practice: Parameters and Quantities of Interest . . . . . . 62 3.4 Inference and Interpretation with KRLS . . . . . . . . . . . . . . . 66 3.5 Sim ulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.6 Empirical Applications . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.8 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.9 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 . . . . . . . . Kernel Regularized Least Sqaures . 3 2.1 7 97 4 Kernel Balancing ................................ 100 4.1 Introduction ....... 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3 Motivating Example 4.4 Theoretical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.7 Empirical Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 137 5 Appendices 5.1 Appendix for Kernel Regularized Least Squares . . . . . . . . . . . . 5.2 Appendix for Kernel Balancing 137 . . . . . . . . . . . . . . . . . . . . . 146 8 Chapter 1 Introduction This dissertation consists of three essays, focusing on the challenges of making credible inferences from observational data in the social sciences, with particular application to situations of violent conflict. Social scientists are often interested in estimating the effects of a particular treatment variable on one or more outcome variables. In many cases, these treatment variables cannot be randomly assigned, making experiments impossible. Within political science, one area where this challenge is particularly acute is the study of the causes and consequences of violence, as neither violence itself nor its putative precursors can practically or ethically be randomized by investigators. The first essay (chapter 2) demonstrates how causal inferences about the effect of violence can be made and put to theoretical use, in a situation where the distribution of the violence is arguable indiscriminate within certain sub-populations of those who were targeted. Specifically, it argues that violence against civilians in Darfur, Sudan during the height of atrocities in 2003-2004 was indiscriminately applied among individuals within a particular village and of a particular gender. This provides a rare opportunity to examine the effects of violence during mass atrocity with greatly reduced risk of confounding. This allows a preliminary answer to a central theoretical question that would be difficult to convincingly address without such causal leverage: Does exposure to violence make individuals more angry, more vengeful, and more supportive of further violence against their perpetrators - as is often assumed? Or, 9 does it instead make them more weary, desirous of peace, and disenchanted with armed actors? While the answer found here is not easily generalized to other cases, results consistently support the "weary" response, with individuals exposed to direct physical violence more likely to report that peace is possible, and less likely to demand that their enemies be executed. This finding qualifies the claim that violence generates demands for retribution that lead to further violence or war recurrence, but is consistent with an emerging view that exposure to violence increases some pro-social attitudes. It also suggests that victims of violence have an important role to play political settlement and reconciliation processes. While the first essay must go to lengths to substantiate the claim that the distribution of violence is conditionally indiscriminate, the actual act of "conditioning on" or "adjusting for" the covariates is straightforward, as there are only two, categorical covariates that must be accounted for to identify the causal effect. The second and third essays introduce methods for dealing with less ideal but more common circumstances, in which the investigator must adjust for more numerous covariates, of which some may be continuous. As described more rigorously in those essays, existing methods for dealing with high-dimensional and/or continuous covariates typically require that the investigator can appropriately specify the functional form relating the covariates to the outcome. This is an implausible claim in most circumstances, yet violating this assumption can lead to potentially substantial misspecification bias. An important goal of chapters 3 and 4 is to provide investigators with tools that allow them to easily and accurately make covariate adjustment in this common scenario, with greatly reduced risk of misspecification bias. Specifically, chapter 3, with Jens Hainmueller, describes kernel regularized least squares (KRLS) (Hainmueller and Hazlett, 2013). KRLS is a modeling approach, to be deployed in regression or classification problems where investigators would more habitually use generalized linear models or other parametric approaches. While bringing the power of a flexible machine learning approaches into an easy-to-use package, it also allows the investigator to interpret the result in ways similar to those permitted by traditional regression models. We also provide proof of desirable statistical 10 properties such as unbiasedness, consistency, and normality, and provide closed-form expressions for standard errors of several quantities of interest. The result is a powerful tool for estimating the marginal effects of variables, even in very high dimensional problems, with greatly reduced risk of misspecification bias. Another benefit of this flexibility is that it naturally accommodates heterogeneous effects, and readily allows for their exploration. Chapter 4 introduces "Kernel Balancing". Like KRLS, this approach borrows insights from kernel-based methods in statistical learning theory to solve a common analytical challenge faced by social scientists. Matching and balancing approaches are frequently used to construct control groups and treated groups from the existing data. The intent of such methods is to identify control and treated samples "similar" enough in terms of their covariate values to effectively adjust for those covariates, after which estimating treatment effects (under conditional ignorability) is straightforward. However, without further (parametric) assumptions regarding functional form, matching produces biased estimates in most common circumstances. Weighting methods, which use continuous weights instead of simply keeping or dropping units, can overcome some of these challenges. However these, too, only guarantee unbiased effect estimation if the functional form relating the covariates to the outcome is of a particular form known to the investigator. Ultimately the challenge with both matching and weighting methods is that the investigator must know what functions of the covariates to include in the procedure (and to check balance on) in order to ensure that different mean outcomes for the two groups are due to the treatment rather than remaining differences on the covariates. Kernel balancing answers this question by ensuring that the treated and control have the same mean on a very large space of smooth functions of the covariates, which grows with N. This greatly reduces the risk of misspecification bias, without assuming the investigator is able to correctly guess or determine the functional form. Kernel balancing has the additional useful interpretation that it equalizes the multivariate densities of the covariates for the treated and (reweighting) controls, when density is measured in a particular way. 11 12 Chapter 2 Angry or Weary? The effect of physical violence on attitudes towards peace in Darfur 13 14 Angry or Weary? The effect of physical violence on attitudes towards peace in Darfur Chad Hazlett - Massachusetts Institute of Technology ABSTRACT Exposure to indiscriminate violence during civil conflict is often thought to increase anger towards it perpetrators, the desire for vengeance, and pessimism regarding the prospects for peace and security. Alternatively, however, experiences with violence during conflict could make individuals more "weary", less interested in retribution, and more desiring of peace. While these responses theoretically play a role in the evolution and recurrence of violent conflict, it has been difficult to obtain micro-level evidence for how violence impacts these attitudes. This paper uses information about the indiscriminate nature of violence in Darfur and a new survey of Darfurian refugees to shed light on the responses of Darfurian civilians to violence. Results consistently support the "weary" response, with individuals exposed to direct physical violence more likely to report that peace is possible, and less likely to demand that their enemies be executed. This finding qualifies the claim that violence generates demands for retribution that lead to further violence or war recurrence, but is consistent with an emerging view that exposure to violence increases some pro-social attitudes. It also suggests that victims of violence can play an important role in political settlement processes. 2.1 Introduction Large-scale violence directed against civilian populations is a common feature of internal conflicts such as civil wars, especially when relatively weak states attempt to defeat insurgencies embedded in or supported by civilian communities (Valentino et al., 2004; Colaresi and Carey, 2008). Beyond the immediate and horrific human consequences of targeting civilians, such violence may also shape the duration, termination, and recurrence of the conflicts in which they occur through a variety of possible channels. Mass violence can also generate persistent anger towards other groups or fears of future attacks. Especially when the perpetrator is another civilian community living nearby, these fears may have long-lasting effects on the prospects for peace, as civilian communities support armed actors who can promise protection, and are unwilling to see "their side" disarmed and left vulnerable to renewed attacks. Moreover, where political solutions to a conflict may have been possible at the outset of fighting, once mass atrocities occur against civilian communities, it is often difficult or impossible to identify credible security arrangements that alleviate civilians' fears of future attacks. In these ways, the reactions of civilians to atrocities shape both their own decisions to support ongoing violence or peace, and the strategies available to elites who would exploit these emotions, security concerns, and desires for their own ends. Yet, we know little about how civilians' experiences with violence influences their views towards continuing to fight or making peace. In this paper, I focus on arbitrating among two opposing hypotheses regarding how civilians react to violence. First, the "angry hypothesis" states what most experts and non-experts alike might expect: that direct exposure to violence during conflict makes individuals less likely to seek peace or believe it is possible, and more likely to be angry or vengeful. One might predict this outcome based on a number of theoretical mechanisms. To name a few such channels, violence may harden divisive ethnic identifications and generate new, stronger grievances and demands for reprisal violence. Relatedly, past atrocities may generate heightened demands for future violence by showing that neutrality is 16 no longer a guarantee of safety for non-combatants, and that civilians must demand protection from other armed actors. As a result, in places where governments do not monopolize the use of large-scale violence, civillians may chose to maintain support for armed actors, refuse to support negotiated settlements that involve disarmament, and may become convinced (by elites or otherwise) that counter-violence or preventive strikes against the perpetrating group are justifiable security measures. All the paths lead to heightened prospects for continued or renewed violence, and while the mechanisms differ, each predicts the "angry hypothesis" - that exposure to violence leads to a reduced desire for or belief in the possibility of peace. Alternatively, we may reasonably hypothesize just the opposite: a "weary" hypothesis stating that exposure to violence makes one wish for peace more strongly or believe it to be more achievable.1 High levels of violence directed against civilians may be blamed on insurgents and their willingness to employ violence, triggering heightened attempts by civilians to push for peace. Or, individual exposure to violence may make the costs of fighting a war more apparent, and may alter calculations of whether it is worthwhile to pursue the initial war aims rather than protect the pre-war status quo, making individuals less interested in fighting. This paper makes two contributions. First, it employs a novel dataset from a random sample of Darfurian refugees in eastern Chad. In focusing on Darfur, this paper examines the first and only conflict so far this century to be labeled a genocide by the U.S. government and in indictments by the International Criminal Court, but that has remained under-studied in empirical work due to severe logistical constraints. The data examined here come from the only large-scale, systematic survey of Darfurian refugees' exposure to violence and attitudes toward peace, justice, and reconciliation. Second, while prior literature relates only suggestively to the "angry" versus "weary" hypotheses (see below), this paper directly adjudicates between them, using a causal identification strategy based on conditionally exogenous exposure to vio'Note that the terms "angry" and "weary" are merely shorthand for these hypotheses, each of which is the observable implication of multiple possible mechanisms. The terms are not intended to suggest that emotions as such, whether rational or not in their origin, are the driving force in attitudes towards peace and violence. 17 lence. The results find that exposure to violence makes individuals more pro-peace or "weary" rather than more anti-peace or "angry". This effect is sizable: for each of four individual binary outcomes, physical harm increases the probability of giving the "weary" response by 8-12 percentage points, which is 17-48% of each mean response level. This finding proves robust to a variety of modelling approaches (regression, matching, and re-weighting estimators) as well as sensitivity analysis and placebo tests exploring the effects of possibly omitted confounders. While perhaps counter-intuitive, these findings offer insights into individual-level responses to indiscriminate mass violence during episodes of civil conflict, with implications for the duration, termination, and recurrence of those conflicts. The results also suggest specific recommendations for the design of peace and reconciliation processes. 2.2 Background Violence in Darfur This study examines the effects of violence directed indiscriminately but deliberately against civilian in Darfur in 2003 and 2004. As this violence was directed broadly against whole communities, it differs from other forms of violence against civilians, such as cases of unintended collateral damage, or highly selective violence targeting individuals based on their political or military activities or on denunciations (e.g. Kalyvas, 2006). The findings, thus, speak predominantly to cases of mass atrocity and genocidal violence during ongoing civil conflicts. While Darfur has experienced previous wars and sporadic violence, the current conflict most clearly began in February 2003, when two rebel groups - the Sudan Liberation Army (SLA) and the Justice and Equality Movement (JEM) - launched an attack on the government air force base in Al Fashir, the capital of North Darfur state. The articulated motives for this rebellion included long-standing neglect of the region by the central government and prior attacks on civilians by both the Sudanese army itself and irregular militia widely referred to as the Janjaweed (Flint 18 and de Waal, 2008). In response to the surprising success of the rebellion in its early stages, the government unleashed a ferocious counter-insurgency operation, designed to punish, kill, or displace the civilian population presumed to be supportive of the rebellion. The offensive employed not only the army and air force, but also expanded mobilization of irregular forces that would continue to be known as the Janjaweed, in a "counter-insurgency on the cheap" (Flint and de Waal, 2008) strategy of exploiting pre-existing ethnic and tribal tensions to mobilize against the civilian base of an insurgency. Violence rates climbed and remained high through 2003 and 2004. Most refugees or internally displaced persons (IDPs) of the conflict left their homes at this time. Those near major towns generally chose to move to them. Others fled to the mountains and forests. A large number of those in the western regions of West Darfur made the decision to cross the border into eastern Chad, becoming refugees. Many of these refugees still have not returned home. At the time of our survey in 2009, approximately 250,000 Darfurians were living in registered refugee camps in eastern Chad. The largest offensives by the Sudanese army and Janjaweed concluded in early 2005; thereafter, fighting has continued sporadically and with varying patterns (de Waal et al., 2014). The number of people killed during the height of violence remains uncertain. Estimates suggest that in the 17 months from September 2003 to January 2005, there were 120,000 deaths directly attributable to the conflict, of which 35,000 were due to direct violence (Guha-Sapir and Degomme, 2005). Over the wider course of the conflict, Degomme and Guha-Sapir (2010) find that for the period of 2004-2008, approximately 300,000 deaths were attributable to the conflict, roughly 5% of the pre-2004 population. Related Literature The existing empirical literature sheds little light on the "angry" versus "weary" consequences of exposure to violence during mass atrocities or even violence more generally. Prior research speaking indirectly to this question, however, can be orga19 nized into cross-national analyses; events-level studies that look at a single conflict but study the effects on non-individual outcomes (e.g. insurgent attacks; patterns of control); and micro or (individual) level studies. Cross-national regression studies have spoken indirectly to this question through the analysis of war recurrence. Doyle and Sambanis (2000) suggests that some measure of war intensity (log of deaths and displacement) are associated with more war recurrence, loosely supportive of the "angry" hypothesis. However, they also find that longer durations of war are associated with greater likelihood of a lasting peace, suggesting a "weary" response. Walter (2004) concludes that recurrence is better explained by underlying conditions in the country rather than possible effects of the previous war, but also finds longer wars associated with lower rates of recurrence, suggestive of the "weary" hypotheses). Fortna (2004) similarly found that longer wars were associated with longer periods of postwar-peace. "Event-level" studies have focused on the impacts of violence within a given conflict, but without access to individual-level outcomes. These again relate only indirectly to the "angry" versus "weary" distinction. Lyall (2009) examined the effects of random mortar fire on villages by Russian soldiers in Chechnya, finding that shelled villages were less likely to be the source of future reprisal attacks. Taken at face value, this might loosely support the "weary" hypothesis. Kocher et al. (2011) finds that aerial bombardment by U.S. forces in Vietnam was strongly associated with higher likelihood that an area would later fall under Viet Cong control, which the authors interpret as evidence that such indiscriminate forms of violence create a backlash against those who perpetrate it. Lyall et al. (2013) find that violence committed by some (but not all) warring parties tends to shift support towards their opposition. This may suggest an "angry" result, though only indirectly, as it does not speak to preferences for continuation of a struggle versus achieving peace. Micro-level studies show promise for resolving this debate as they can examine how events relate to individual attitudes. A richer set of findings on the micro-level effects of violence has begun to emerge, several of which apply careful causal identification strategies. So far these have not focused on the effects of violence exposure 20 towards peace as such, but have examined the relationship between personal violence and a range of outcomes, finding that violence relates to heightened psychological trauma (Pham et al., 2004; Vinck et al., 2007; Pham et al., 2009), and reduced education, employment, and future earnings (Blattman and Annan, 2010; Akresh and De Walque, 2008). However, some work has begun to support a perhaps counter- intuitive set of results, that exposure to violence related to greater levels of social engagement (Bellows and Miguel, 2009; Blattman, 2009) and increased altruism, at least parochially (i.e. towards kin or coethnics)(Gilligan et al., 2011; Voors et al., 2011; Choi, 2007; Cassar et al., 2012). 2.3 Methods Data The primary data source is a survey conducted from April to June of 2009 by the author and other members of the "Darfurian Voices" team. The project sought to systematically document the views held by Darfurian refugees in Chad on issues of peace, justice, and reconciliation, and to accurately transmit these views to policymakers, mediators, negotiating parties, and other key stakeholders. Reports and other materials from this project can be downloaded at This paper uses data from the random-sample survey. Briefly, the sample includes 1,872 individuals from the target population of adult refugees (18 years or older) from Darfur living in all 12 Darfurian refugee camps in eastern Chad. We used a stratified random sampling method, with geographic location (camp and block) and gender as strata. It should be emphasized that the refugee population sampled here is not representative of Darfur's civilian population broadly. Geography was the primary determinant of who immigrated from Darfur into Chad rather than elsewhere; almost all Darfurian refugees in Chad hail from the western part of West Darfur. 21 Measurement The key causal variable of interest is exposure to violence, which I refer to as the "treatment" in keeping with the usual language of causal inference. I focus on whether or not the respondent was the victim of direct physical harm during this conflict, which I code as a binary variable Physical Harm, indicating injury or maiming during an attack.2 Approximately 40% of the sample report being directly injured or maimed. This measure speaks to an individual's exposure to violence above and beyond the experiences of those around them, which is particularly important in this context, where many individuals have family members or neighbors who experienced violence during these attacks. All violence-related questions come at the end of the survey to avoid possible priming effects. Participants are not asked to describe the violence, and in particular, women are not asked whether the violence was of a sexual nature. I examine four outcome measures. The first three assess whether individuals believe it is possible to make peace with former enemies (Peace Enemies)3 , peace with individual Janjaweed members (Peace Janjaweed Members)4 , and peace with the tribes from which the Janjaweed come (Peace Janjaweed Tribes)5 . All three response items are transformed into binary responses, coding the positive response ("strongly" or "somewhat-possible") as a 1, and coding negative responses ("somewhat" or "strongly-disagree") as 0. Note that the directionality of this variable is such that more positive values indicate more pro-peace ("weary") answers. If violence increases weariness as measured, we will see positive effects on these outcome; if it increases "anger", we would see negative effects on these outcomes. 2 The question was: "Have you suffered violence, or have you been physically maimed in an attack related to the current conflict? (a) yes; (b). no; (c/d/e) uncertain/refused/not understood". Enumerators were trained to ensure that this was understood to refer to physical harm against the participant resulting in physical assault or injury. 3 "Some people say that it is possible for former enemies to live peacefully together after a war. Some people say that it is not possible for former enemies to live peacefully together after a war. Do you believe (a) strongly that it is possible; (b) somewhat that it is possible; (c) somewhat that it is impossible; (d) or strongly that it is impossible?" 4"In the future, I can see myself living peacefully with actual members of the Janjaweed": (Strongly agree/ somewhat agree/ somewhat disagree/ strongly disagree). "'In the future, I can see myself living peacefully with the tribes from which the Janjaweed came": (Strongly agree/ somewhat agree/ somewhat disagree/ strongly disagree). 22 A fourth outcome regards what punishment participants feel is appropriate for Government soldiers involved in the conflict (Execute Soldier). This is coded as a 1 when the answer was "execution", and 0 for any other (lesser) punishment, and so points in the opposite direction to the previous three (the "wearier" answer now being the lower value). These four measure are highly inter-related. Factor analysis supports a single- dimensional solution, with the expected signs on the loadings. 6 Using these loadings, and then re-scaling by the sum of the weights, I create the variable Peace Index, for use when a single measure of the outcome is useful. What does this single fact measure? These survey questions are difficult; they require the participant to evaluate counterfactual circumstances and estimate the chances of a complex process leading to a particular outcome. Moreover, they are emotionally charged, and come towards the end of a challenging two-hour interview. In order to answer difficult and emotional questions about the possibility of living in peace, respondents likely answer instead an easier and more intuitive question (Kahneman, 2011) such as "Would I like to live with these groups?" or "How would I feel about living with these groups?" Identification Assumption: Conditional Indiscriminacy The most critical assumption to identify the effect of violence on individuals in the data is that conditional on observed covariates, whether an individual experiences violence must not depend on the outcome an individual would have if exposed to violence, or the outcome they would have if not exposed to violence. Let Y(1) designate individual i's (possibly unobserved) outcome had she been exposed to violence; let Yi(0) be the same individual's (possibly unobserved) outcome under non-exposure to violence. The causal effect for unit i is then defined as Y(1) - Y(0), and the average treatment effect (ATE) over the population is E[Y(1) - Y(0)]. Let Di be an indica6 Principal factor analysis, no rotation. Factor loadings were 0.68,0.63, and 0.79 for Peace Enemies, Peace Janjaweed Members, and Peace Janjaweed Tribes respectively, and -0.35 for Execute Soldier. The eigenvalue of the single retained factor was 1.6; all other were negative. 23 tor of exposure to violence for individual i, while genderi and villagej designate the gender and village of respondent i. The assumption made here is then stated as {Y(1), Y(0)} JL Dilgenderi,village. That is, among individuals of a given gender and village, whether they experience injury or not is unrelated to both potential outcomes Y(0) and Yi(1). I refer to this throughout as "conditional indiscriminacy", as it implies violence is effectively indiscriminate within village-gender strata. The Distribution of Violence Justifying the conditional indiscrimnacy assumption requires characterizing the nature and purpose of the violence conducted against civilians. During the height of attacks in Darfur in 2003-2004 described above, widespread violence against civilians was employed throughout Darfur, including the state of West Darfur, from which almost all the survey respondents in this study originate. Critically, the aims of these attacks were not to selectively seek out rebel or political leaders. Instead, it was to punish or destroy the communities behind rebel groups, through both direct violence against the populace and forced displacement. Displacement of communities served a second purpose of incentivizing members of the Janjaweed militia, whose tribes have long sought more reliable access to grazing lands, which could be achieved by removing these groups. When a village was under attack, it would typically involve one or both of the following: first, Government of Sudan planes would often begin crude, indiscriminate aerial bombardment. Second, Janjaweed militia would charge into the village, during which time many would be killed and many women were raped. In the case of government bombing of villages, within a given village it is relatively straightforward to claim that one's chances of being injured is largely random. These villages are relatively small, allowing for little variation in targeting. These bombings were often as crude as pushing bombs, scrap metal, and barrels full of shrapnel out of aircraft. This does not allow for any kind of targeting based on political attitudes or other strategic considerations within the village level. 24 The Janjaweed attacks, too, produced effectively exogenous exposure to violence within a given village, conditional on gender. Beyond the use of different types of violence for males versus females, the Janjaweed not only appeared to be indiscriminate in their use of violence, but also were unlikely to have any knowledge of what individuals in the village were potentially more or less politically or militarily active. Villages are ethnically very homogenous and, while certain villages may be targeted, within village there was little or no basis for targeting. Men and women, the old and young, were all apparently subject to injury and killing. In over 80 filmed and transcribed interviews, our research team asked a range of questions that included the nature of attacks on their villages. Not one of the interviewees provided evidence suggesting that during village attacks, the Janjaweed were discriminant in directing violence against particular types of individuals. Though there is evidence that Janjaweed groups encountered on the roads and elsewhere interrogated individuals. The common theme was that the Janjaweed would "kill everything", with their instructions to do so sometimes overheard by villagers. One typical respondent recounted, "The government came with Antonovs (aircraft), and targeted everything that moved.. .If it moved, it was bombed. It is the same thing, whether there are rebel groups (present) or not.. .They shoot everyone when they see them from a distance, and [if] they have any doubt about him, they shoot him. The government Antonovs survey the area from time to time to see if there is anything moving and if it is a human or an animal.. .The government bombs from the sky and the Janjaweed sweeps through and burns everything and loots the animals and spoils everything that they cannot take". Such statements look very similar to those collected by other organizations at other times, such as those collected in Human Rights Watch (2006). Further examination of these interviews finds that those in the village, whether sleeping or attempting to flee, were subject to attack. Even those fleeing to nearby hiding places were frequently pursued. Livestock and belongings were often stolen (97% of respondents in our sample reported losing all or most of their livestock, crops, and belongings), and villages were almost always burned to the ground. 25 One immediate concern is that some individuals would have been more likely to have resisted or counter-attacked, and also more likely to experience violence. This is relatively unproblematic for two reasons. First, during the phase of violence experienced by those in the survey, resistance within the village had become extremely rare. One reason is that once the government had clearly joined the effort using its aircraft, this was no longer a war among tribes, and the would-be resisters among the Fur, Massalit, and Zaghawa tribes realized that protecting the village was not an option. Relatedly, those who did wish to resist in this area had already left to join rebel groups operating outside the villages (and do not enter our sample). Second, it is important to note that those who hid or attempted to flee were not evidently shown mercy. Testimony describes how those who fled or hid during the attack were often chased down or found, and thus still potentially subject to direct violence. 2.4 Results Covariate Balance While anecdotal evidence, testimonials, and other information supports the conditional indiscriminacy claim, we can also partially test its plausibility quantitatively. A traditional balance test would examine whether the distribution of a series of pretreatment covariate is the same for the treated and untreated groups. The identification strategy here requires as-if-random distribution of violence only within each village-gender sub-group. I therefore test "conditional" balance, first splitting the sample by gender and then, within each, regressing the Physical Harm indicator on the pre-treatment covariates and village fixed effects. This tests whether covariates predict Physical Harm within village and gender. If exposure to Phyical Harm is indeed unrelated to the distribution of a covariate (conditional on the others), its coefficient in this regression will be zero in expectation. Covariates are included in this analysis if they are certain to be pre-treatment (measured prior to the village attack or clearly not altered by the violence). These 26 include age, whether they were a farmer, herder, merchant, or trader in Darfur, their household size in Darfur, and whether or not they had voted in the past. All results are shown for linear probability models, with heteroscedasticity-robust standard errors. The analysis includes 517 unique villages, with no single village accounting for more than 6% of the sample. On average, 40% of individuals report experiencing physical harm. Note that the identification assumptions hold only for villagers who were present during the time of village attacks. Several sample restrictions are thus made in all further analyses. Most importantly, only those who report leaving Darfur due to direct violence are included, ruling out the approximately 20% of the sample that left before violence occurred. In addition, because only the civilian (nonleadership) sample was randomly surveyed, and because leaders are expected to be more politicized in their responses than non-leader civilians, only those who report being non-leaders both in Darfur and while in the camps are considered here. The remaining sample size is 1345. Note, however, that when the same analyses below are run on the full sample, the results are nearly identical. The results of conditional balance tests support the conditional indiscriminacy assumption (Table 2.1). The only covariate with a p-value of less then 0.10 is Herder in Darfur: herders appear to be more likely to experience physical harm. While possibly a spurious result (made more likely by multiple comparisons), this suggests conditioning on herder status to ensure this is not acting as a confounder, though herders make up only 15% of the sample, and dropping them does not affect the results reported below. Moreover, covariates other than village are not jointly predictive of who experienced violence for either men (F(8, 338) = 1.10,p = 0.37) or women (F(6,321),p = 0.43). Distributions of Treatment Probabilities It is helpful to see the distributions of propensity score estimates for the treated and untreated, to ensure that there is not group for which the scores differ greatly. Here we are interested in propensity to treatment only within each stratum of gender and village. Conditioning on gender is achieved by separately plotting male and female 27 propensity scores; adjusting for village can be achieved by a re-weighting procedure. P(Viage=vi,1ageiD=1) where For each directly harmed participant, I assign a weight of 1. For each participant _ I re-weight according to wi = is villagej not harmed, P(Vi11age=vil1age~iD=O' the village from which participant i originates. This ensure that the post-weighting number of untreated participants from each village is the same as the number of treated units from each village, thus differences in the distribution of propensities to treatment are not due to differences in village of origin. The top row of Figure 2-1 shows the gender-specific distributions of propensity scores prior to this re-weighting by village. Clearly, the balance is not good, reflecting that some villages experienced much more complete violence than others. However, once the untreated observations are re-weighted to adjust for differences in village of origin, the balance is extremely good, with very similar distributions of propensity scores for the treated and untreated (Figure 2-1, bottom row). This boosts our confidence that those units within a single village and gender group are exposed to violence in ways unrelated to any of the observed pre-treatment covariates. Main Results Treatment effects were estimated using linear regression models, OLS with weights determined by entropy balancing, and Mahalonobis matching. Results from models on each of the five outcomes are shown in Figure 2-2. We first examine the OLS results. Given the identification assumptions, it should be necessary only to regress the outcome on the treatment and village and gender dummies. I refer to this model as the "short" specification. Adding further covariates to the model ("long") is not required for unbiased estimation, but allows these (pretreatment) covariates to explain additional variation, possibly improving the precision of estimates. Coefficient estimates from the short and long OLS models are given in Table 2.2 and summarized in Figure 2-2. Both reveal the same pattern, as expected since the covariates are effectively controlled for by design. Those who report being directly harmed are approximately 10 percentage points more likely to say it is possible to 28 live in peace with former enemies, with individual members of the Janjaweed, or with the tribes from which the Janjaweed were drawn. Results on these three outcomes, under either model, fall in the 8-11 percentage point range. These are substantially significant as well: each of these outcomes had an unconditional mean between 0.17 and 0.40, making increases of 10 percentage points quite large, generally more than 25% of each variable's mean. Those directly harmed are also 9-11 percentage points less likely to penalize Government of Sudan soldiers to death (compared to an overall mean of 62%). The factor created by a weighted average of these four, Peace Index, is also significantly affected by Physical Harm, rising by 0.13 among those harmed. Peace Index is no longer binary, and has a minimum of -0.20 rather than 0. The effect size of 0.13 amounts to 31% of the distance between the minimum and the mean. Together, these results consistently point towards the hypothesis that exposure to violence stimulates a greater desire for or belief in the possibility of peace, and lesser desire to punish enemies to death. Evidence, thus, appears to lie in favor of the "weary", rather than the "angry" hypothesis. Entropy Balancing To reduce possible model dependency while ensuring effectively perfect balance on selected covariate moments, I also employ entropy balancing (Hainmueller, 2012). This approach chooses weights for the control units such that after weighting, the marginal distributions of covariates is the same for the treated and untreated up to a specified number of moments, while keeping the weights as close as possible to equality. Entropy balancing is successful in equating the means and variances of the covariate distributions between those directly harmed and those not directly harmed. I then employ these weights in regressions with village-fixed effects to complete the required conditioning. Again, this is done with (a) a "short" model with the minimal conditioning to achieve identification (gender and village-fixed effects), and (b) a "long" specification in which covariates are included in the regression stage for additional robustness. The results are summarized in Figure 2-2 and Table 2.3, and are very similar to 29 those produced by the OLS analyses: respondents directly harmed by violence are 8-12 percentage points more likely to give the pro-peace or "weary" response to all questions, all of which are highly significant. Peace Index rises by 0.14-0.16 among those exposed to direct violence. Matching Finally, matching offers an alternative estimation approach. Here, the aim is less to improve balance on observables, but rather to allow for conditioning on covariates in a way that is less model-dependent than linear modeling. Mahalanobis matching was used, with 1-to-1 matching without replacement. The variables matched on were the same as those in the multivariate models above: all available pre-treatment variables with enough variation such that at least 10% of the participants fall in the smaller group. Matching is exact on all variables except age and household size in Darfur. Post-matching balance tests showed no statistically significant imbalances on any covariates. Table 2.4 shows estimates from the matching analyses. The findings are consistent with the regression estimates, though larger and more significant in some cases. While the number of observations is substantially lower due to the strict matching requirements, the more precise estimates also allow gender-specific effects to be estimated more precisely than under regression. The effects all lie in the same direction for men and for women. However, the effects for men tend to be larger. The only outcome on which the effects dramatically differ by gender is "peace with Janjaweed Tribes" (Peace Janjaweed Tribes), such that men see a large 20 percentage point increase in positive responses after exposure to violence, while women see an insignificant change of only 3 percentage points. The effect of physical harm on Execute Soldier is also significantly negative for men (as it is in the overall sample), and negative but nonsignificant among women. Otherwise, all the effects that were significant for men or for the overall sample are significant among women as well. 30 2.5 Robustness As this is an observational study, further validity checks and an examination of possible alternative explanations are in order before proceeding to interpretation. Robustness to Confounders The validity of this finding depends on the absence of unobserved confounders, which in turn is plausible only if violence was targeted on gender and village, but was indiscriminate within these strata. While this cannot be definitively proven, I show that the results observed here are unlikely to be the result of confounding through: (1) consideration of the likely direction of bias if confounders did exist, (2) a placebo test, and (3) sensitivity analysis. First, the direction of the effect is opposite to what we would expect due to most likely sources of confounding. We would typically expect that an unobserved characteristic driving some people to "select into" experiencing direct violence would be associated with more "angry" attitudes, not less angry ones. For example, those who are more anti-government or more interested in supporting the rebellion may rush into the fight, increasing their chances of exposure to violence, but would be expected to give the less peaceable answer to survey questions. The observed effect, however, is in the opposite direction. Second, a placebo test further supports the identification assumption. Note that those experiencing physical harm are more likely to report they would vote in future elections (11%, p < 0.01), echoing findings of other studies (Blattman, 2009; Bateson, 2012). However, the variable pastvoted is a pre-violence measure of whether individuals voted in the past. According to the identification assumptions, conditional on village and gender, there should be no relation between physical harm and pastvoted, even though we do see a relationship between physical harm and wouldvote. Using identical analyses to those above, I find no effect of direct violence on past voting (# = 0.02, p = 0.62 using OLS-long, for example). The finding that physical harm strongly affects whether people would vote in the future, but correctly shows no ef31 fect on whether people did vote prior to treatment is useful evidence that physical harm was distributed without reference to pre-existing political attitudes within each village-by-gender cell. Third, sensitivity analyses are useful for examining the robustness of the results to violations of the identification assumption. I use an approach similar to Imbens (2003). Suppose the "true" model is y = X0 + Z-y + e, where y is the outcome of interest (here, Peace Index), X contains the treatment, intercept, and covariates, 0 is the true (causal) effect of each variable in X on y, Z is an unobserved confounder, and -/ is the effect of this confounder on y. If we estimate this model using OLS on only the observables (X), then 3 = # - 7(XTX)-IXT Z. That is, the bias is the product of (a) the effect of Z on Peace Index (y), and (b) the strength of the correlation between the treatment and the confounder (measured as the predictiveness of the treatment for the confounder after controlling for the rest of X, (XTX)-IXTZ (which estimates E[ZlPhysicalHarm= 1, X] -E[ZPhyscialHarm = 0, X]). Figure 2-3 shows the "true" treatment effect implied by varying the degree of confounding using these two parameters. Note that I make the worst-case assumption that the y and (XTX)1XTZ are signed so as to produce bias in the direction of the result; if either sign were to change, the direction of bias would imply that the result was actually stronger than what was observed. For comparability, the plot shows the confounding effects of each covariate included, had it not been observed. This shows that in order for an omitted confounder to reduce the true effect so far that it cannot be distinguished from zero (the red dotted line), it would have to be a considerably stronger confounder than any observed covariate. For example, to imply a true treatment effect statistically indistinguishable from zero, a confounder would have to be as strongly correlated with Physical Harm as age, but would need to have an effect on Peace Index more than 10 times larger than that of age. In another example, female has a large correlation with the outcome (larger even than Physical Harm, as there is a substantial gender "effect" in the data. However, even for a confounder as strongly related to Peace Index as female (which is difficult to imagine), in order to reduce the implied treatment effect to the critical 32 value, the treatment would have to be three times more strongly related to such a confounder as it it is to female. Interference Between Units Another concern is interference between units, or spillover. In this case, one cannot reasonably assume the "Stable Unit Treatment Value Assumption" (SUTVA) assumption of zero spillover is valid. Instead, I examine possible violations of this assumption, and determine how each would alter the meaning of the estimated effect given that interference occurs. One possibility is that when other people experience direct violence, those around her who do not experience it (but hear about it or observe it) receive on average a mitigated effect in the same direction.7 If this is the case, it ensures a bias towards zero on the estimated treatment effect, as those classified as unexposed to violence are actually "partially" exposed to it. This would suggest that the true effect is stronger than the estimate on the observed data. Alternatively, "negative" spillover is also possible: it could be that when person j experiences violence, its effect on person i's (non-treatment outcome) is on average opposite in direction to the average treatment effect. This type of spillover would be implicated if, for example, those who are harmed become more pro-peace, but those not harmed experience "survivor's guilt," and as a result become more anti-peace. This example would not invalidate the finding here of the pro-peace effect of violence, but it would suggest the observed effect is exaggerated relative to the true individual effect. However, the data do not show evidence for spillover of either the partial-treatment type or the negative type. In addition to asking about exposure to physical harm, we also asked individuals how many family members were killed or maimed, whether they witnessed other family members being injured, or whether they witnessed non-family members being injured. Because these measure harm experienced by those close to the respondent but not the respondent herself, they essentially allow a direct test of 7 That is, when person j is exposed to physical harm, the effect of j's exposure on person i's non-exposure outcome (Yi(O)) is, on average, in the same direction as the average treatment effect. 33 how violence committed against others affect the attitudes of the respondent. Using the same specifications and models as above, these measures of indirect exposure show no significant effect on attitudes in either direction. 8 Correlated Measurement Error Another potential threat is that some respondents are of a "sophisticated" type, and seek to show the survey enumerators that (a) they have suffered and are thus in need of support from donors, and (b) are of a pacific, conciliatory nature, more likely to attract donors to continue supporting the camps. This is effectively a concern about non-classical measurement error: the error or mis-representation on the measurement of the treatment may be correlated with error on the outcome. This is unlikely to explain the observed effect: if strategic misrepresentation of this type was driving the effect, we would also expect to see a (false) effect for indirect forms of violence, such as the loss of family members. The same individuals would be expected to over-report losses on these measures, while also reporting being more conciliatory, again confounding the relationship between Physical Harm and attitudes. Since we see no measurable effect of indirect forms of exposure on attitudes, however, such a confounder seems unlikely. Survivorship Effects A final set of concerns is that the population from which we sample is censored in some way that may bias the results. As noted, the population from which we sample is in no way representative of the population of Darfur: individuals only appear in the population studied here if they survived the initial attack, chose to come to refugee camps in Chad rather than seek refuge elsewhere or join the rebel movements to stay and fight in Darfur, and survived the trip to Chad. To the degree that those who were directly, physically harmed and those who were not physcially harmed experience the same selection pressures on who makes it to the camps (that is, the 8 These results available upon request. Also note that this result does not rule out the possibility of a spillover effect so broad that even those who report no indirect harm experience that spillover. 34 relationship between potential outcomes and making it into the camp does not depend on treatment status), then these pressures alter the population about which we make inferences, but cause no bias on the causal estimates of physical harm. In contrast, selection pressures that occur differentially depending on whether one is directly harmed could cause a biased estimate of the effect of violence. The first concern of this type is that among those who are directly phyiscally harmed during an attack, the chances of death are higher. It seems plausible, however, that among those who are physcially harmed, whether they survive that injury or not, is unrelated to their potential attitudes. Likewise, among those who are not harmed, whether they survive is surely uncorrelated with their attitudes. As long as this reasonable assumption holds, then the higher death rate among those who are injured does not introduce a bias. A related concern is selective mobilization into rebel groups depending on Physical Harm. Among those present during the attack but not physcially harmed, the more "angry" ones may have joined the rebel movements rather than coming to the refugee camp. Among those physically harmed, on the other hand, even the angry ones may come to the camp for medical care, regardless of their attitudes. This would bias the results, but in the opposite direction of the observed effect. If the more "angry" individuals from the unharmed sub-population join the rebel movements rather than coming to the camps, it would make the resulting non-harmed group in the camp appear less angry, but we observe the opposite. 2.6 Discussion Violence directed against civilians during civil wars and could lead those who experience it to either be more resistant to or more supportive of peace. On the one hand, exposure to such violence may increase grievances against, fear of, or anger towards the perpetrating group. Any of these could drive civilians to support armed groups that would offer them protection, and to resist calls for disarmament until their fears, angers, or grievances have been addressed. In such cases, we expect to see 35 exposure to violence leading to increased pessimism about the prospects for peace, and/or increased willingness to punish one's enemies (the "angry" hypothesis). On the other hand, exposure to violence may increase the perceived cost of supporting ongoing conflict, improving the attractiveness of peace despite whatever heightened fears, anger, or other effects it produces. If this effect dominates, we expect to see exposure to violence lead to increased desire for peace (the "weary" hypothesis). The findings here consistently support the "weary" hypothesis: those exposed to direct violence are approximately 10 percentage points more likely to report the "weary" or pro-peace answer using four different measures and under a variety of different modeling approaches. This effect is large, with exposure to physical harm increasing the probability of giving the more pro-peace answer by 17-48% of the mean probability for each item. Prior quantitative studies have not directly examined the "angry" versus "weary" question, complicating comparisons to existing literature. However, this finding is roughly consistent with those cross-national studies that indirectly suggest a "weariness effect" through the finding that longer wars are associated with better chances at future peace (Doyle and Sambanis, 2000; Walter, 2004; Fortna, 2004). It may also be consistent with Lyall (2009), which found that indiscriminate violence by the incumbents may have led to lower support even for insurgents, as areas subject to such violence staged fewer insurgent attacks. At the micro-level, Beber et al. (2012) find that those exposed to riot-related violence in Khartoum (the capital of Sudan) were far more likely to support the South Sudan's secession. This is roughly consistent with the "weary" finding here, as it implies a willingness to put an end to hostilities even if it comes at potentially high economic cost and means granting the opposition it long-standing aims. 9 In addition, if the "weary" effect here is regarded (cautiously) as a pro-social outcome, it is consistent with other micro-level studies that show relatively pro-social effects, including increased political and social engagement (Bellows and Miguel, 2009; Blattman, 9 However an alternative interpretation would be that granting succession to South Sudan would have the of potentially removing South Sudanese from Khartoum. Thus a host of motives besides "weariness" could produce this result. 36 2009), or heightened altruism, if only towards kin or coethnics (Gilligan et al., 2011; Voors et al., 2011; Choi, 2007; Cassar et al., 2012).10 While this study and the micro-level work cited above have focused on civilian attitudes in the wake of violence, an influential strand of research in the civil war literature suggests we should be principally concerned with elites, who are thought to largely control public narratives, alliance-formation, and the mobilization of violence (e.g. Christia, 2012; Fearon and Laitin, 2000; Wilkinson, 2006). However, the civilian-centric and elite-centric approaches of study are complementary. First, even elites do not operate in a vacuum; the perceptions, attitudes, and emotions of the civilian populations they seek to influence shape the opportunities and strategies available to them. Second, if elite manipulation matters, it is ultimately through its ability to influence individual attitudes, perceptions, or incentives. By studying civilian responses to violence, we are studying the net effect of many influences, including elite manipulations which have already occurred in the wake of these events. Understanding the various mechanisms that result in changes in individual attitudes - ranging from intrinsic individual responses, to the effect of social groups, and to the efforts of elites to shape these responses - remains a key area for future research. The generalizability of these findings likely depends on several factors. The Darfur case is one of indiscriminate violence directed against civilians, and the results may be specific to this type of violence, as opposed to cases of selective violence targeted against individual insurgents, the accidental killing of civilians in otherwise carefully targeted violence, or violence resulting from denunciation of collaborators by community members (e.g. Kalyvas, 2006). In Darfur, civilians were also attacked both by their own government and by members of other civilian communities with whom 'Olt is important not to overstate those effects of violence that might possibly be pro-social. While Beber et al. (2012) suggests that violence increases readiness to make peace, those exposed to violence were also less willing to grant citizenship to Southerners remaining in the North, presumably out of heightened concern for their safety. Moreover, many counter-productive effects of violence have also been documented regarding psychological trauma (Pham et al., 2004; Vinck et al., 2007; Pham et al., 2009), reduced education, employment, and future earnings (Blattman and Annan, 2010; Akresh and De Walque, 2008), as well as negative effects on trust (Cassar et al., 2012; Nunn and Wantchekon, 2009; Becchetti et al., 2011) regarding increase trust, at parochially). Most directly, violence clearly carries with it a horrific and unacceptable direct cost, and the possibility of there being some positive effects dose not suggest that violence has a net positive effect. 37 they share a history of antagonism and conflict. This may also be a relevant feature to consider when generalizing the finding to other cases, though future research will be needed to establish the exact conditions under which the effect found here is likely to hold. Possible Mechanisms Having found robust support in favor of the "weary" hypothesis, I briefly consider three possible explanations to help inform future research on possible mechanisms. The first explanation is a calculus of perceived cost and benefit. In its simplest form, those subject to direct physical violence experience heightened suffering and, therefore, see greater costs to ongoing conflict, translating into a greater desire for peace and heightened perceived attractiveness of the pre-war status quo. Second, recent work supports the claim that in situations of war and other forms of violence, "post-traumatic growth" (Tedeschi et al., 1998) is a more common outcome than alienation or debilitating psychiatric illness (see Tedeschi and Calhoun, 2004, Bateson, 2012, Blattman, 2009). Most closely related, Blattman (2009) describes - interviews suggesting that abduction by the Lord's Resistance Army in Uganda and managing to escape and survive thereafter - leads to rapid maturation and an increased sense of control over one's life. This could presumably lead to behavior changes, though why it should specifically lead to an increased desire for peace in particular remains unclear. A third possibility is suggested by the combination of a demand for retributive violence, with the special status of individuals with injuries. Communities that live far from government protections and have easily lootable capital stocks - an apt description of Darfur - must maintain a reputation for toughness and the willingness to use reciprocal violence when slighted. In such a "culture of honor" as described by (Nisbett and Cohen, 1996), individuals are expected to show a desire for retribution in response to attacks on group members. Evidence collected during the survey - such as songs sung by women calling for their men to take retributive action - suggest that the demands for retributive violence are high among these communities (see also 38 Hastrup, 2013). Against this backdrop, however, it appears that those individuals who are directly harmed - particularly those with physical evidence of their injuries - have heightened legitimacy to speak on violence, and can promote peace without without fear of appearing cowardly." Such arguments are clearly tentative, however, and only suggestive of directions for future research. In the available data, the lack of evidence for any effect due to non-direct forms of harm would seem to challenge both the "heightened cost" and "personal growth" mechanisms: it is difficult to see why personal injury should trigger greater increases in perceived cost or personal growth while the other forms of harm visited on many in the sample do not. However, effects of indirect harm might also be harder to detect because they may be weaker and more susceptible to downward bias through positive spillover. The "culture of honor exemption" theory, also raises questions as to (a) why individuals who are injured would be given the hypothesized exemption from norms calling for reciprocal violence, especially in a context where so many individuals have lost so much; and (b) why harmed individuals would be inclined to use this exemption as an opportunity to become more pro-peace. Nevertheless a combination of mechanisms provides a plausible candidate for future investigation. First, if the culture of honor "exemption" really gives injured individuals necessary status to speak without fear of being branded cowardly, then either the post-traumatic growth or increased perceived-costs mechanisms could explain why they use that opening to become espouse pro-peace attitudes. A characteristic of this combined theory is that, if physical wounds act as evidence of one's hardship and a mark of authority to speak on the subject of violence, this could explain why those who experience indirect harm do not show such an effect. "In our anecdotal experiences, of conducting and filming interviews in these camps, those with physical evidence of injuries - such as amputations, shrapnel, and scars - were often eager to approach our research team and to be interviewed. While clearly only suggestive, this is consistent with the view that individuals with physical evidence of harm appear to the strongest mandate or motivation to demand an audience. 39 2.7 Conclusions Violence against non-combatant civilians is a common feature of many civil wars, and beyond the obvious human cost of such violence, it can shape the possible trajectory and outcomes of conflict. Two plausible theoretical claims produce opposing hypotheses: does exposure to violence make individuals more "angry", vengeful, or likely to view peace as impossible on the one hand, or more "weary" or pro-peace on the other? The results strongly and consistently favor the "weary" hypothesis: those who report being injured or maimed were approximately 10 percentage points more likely to say it is possible to live in peace with former enemies, to live in peace with individual Janjaweed, or to live in peace with the tribes from which the Janjaweed were recruited. They are also roughly 10 percentage points less likely to demand that Government of Sudan soldiers be executed. These effects are substantively large, amounting to 17-48% of the mean probability of giving the pro-peace answer on each item. While this study can only hope to maximize internal validity in the single case of Darfurian refugees in eastern Chad. That said, this case is a particularly important and severely understudied case. A valuable next step would be to assess its generalizability by similarly examining the effects of personal violence on attitudes towards peace in other conflicts. This paper contributes to a small but growing literature identifying the effects of violence during conflict, but is the first to directly test the "weary" versus "angry" hypotheses. This is also the first study to make micro-level causal inference in the case of Darfur, which has been vastly understudied relative to the scale of violence and policy attention. While not directly comparable to any prior study, these findings are broadly consistent with an emerging view that, perhaps surprisingly, exposure to violence is associated with some positive shift in individuals' social and political engagement (Bellows and Miguel, 2009; Blattman, 2009; Gilligan et al., 2011; Bateson, 2012). Having estimated this effect, arbitrating among the mechanisms that generate it 40 remains an important task for future work. Those exposed to physical harm may perceive a shift in the cost of conflict relative to others, making ongoing violence less appealing and the pre-war status quo more appealing. Alternatively, individuals who undergo heightened suffering due to direct physical violence may experience "post-traumatic" growth. It is also possible that individuals physically harmed are "exempted" from demands to show anger and a desire for vengeance. Accordingly, further research could fruitfully help to understand these mechanisms by examining whether victims of physical harm see the costs of the conflict as being starker than others do, whether they show post-traumatic growth or other changes in a variety of domains, and whether communities view victims of physical violence as having greater authority to speak, or as having an exemption from norms requiring support for retribution. An important consequence of this finding is the possibility that, by enhancing weariness, exposure to direct physical violence may mitigate rather than potentiate the support civilians are willing to provide to violent actors. This matters, first, in terms of the direct impact of such shifts on civilians' willingness to contribute to armed conflict (through providing safe haven, providing or withholding information, material support, or direct participation). It may also influence the narratives that will resonate or be counter-productive for elites to exploit, as they seek to mobilize for war or peace in their interests. One practical policy lesson emerging from this analysis is that individuals harmed by violence are not to be treated as lost causes or likely spoilers of potential peace. On the contrary, they may be more peace-seeking than their neighbors. Political settlement and reconciliation processes would do well to incorporate these individuals as directly and inclusively as possible. 41 2.8 Tables and Figures Table 2.1: Multivariate Balance Conditional on Village Fixed Effects DV: Physically Harmed Age Farmer in Darfur Herder in Darfur Voted in past Household size in Darfur Merchant in Darfur Tradesman in Darfur Joint F Joint p N Males Females (p-val) (p-val) -0.003 (0.196) -0.031 (0.733) 0.201 (0.045) -0.011 (0.861) -0.004 (0.541) 0.135 (0.096) 0.037 (0.800) 1.10 0.362 640 -0.001 (0.801) 0.027 (0.732) 0.144 (0.151) 0.093 (0.287) 0.006 (0.352) NA NA 0.99 0.43 588 Note: Conditional balance test examining whether, within village and gender, observable pretreatment covariates have the same means for those who were and were not physically harmed. The treatment indicator (physical harm) is regressed on village fixed effects and all pre-treatment covariates, separately for men and for women. The results show good balance overall: all coefficients are near zero, with the only significant estimate being a dummy for individuals who were herders in Darfur. Joint significance tests fail to reject the null hypothesis that all coefficients (except those on the village fixed effects) are zero. Thus, taken together these covariates are not significantly predictive of being physically harmed. 42 C43 1270 0.61 (0.078) 0.083 (0.043) 1303 0.51 (0.024) 0.075 (0.034) 1279 0.22 (0.064) 0.082 (0.035) Peace With Janjaweed Indiv. 0.17 (3) (4) 1278 0.51 (0.076) 0.099 (0.040) 1225 0.56 (0.032) -0.092 (0.043) 1223 0.50 (0.082) $/ -0.11 (0.044) Would Execute Gov. Soldiers 0.63 (7) (8) 1188 V, 0.32 (0.032) 0.13 (0.044) (9) (10) 1168 0.46 (0.078) 0.13 (0.044) Peace Index 0.22 Note: OLS estimates of the effect of being physically harmed on each outcome. All models include a gender dummy and village fixed effects, as required to meet identification conditions. The "long" models (odd numbers) also include pre-treatment controls, despite apparent balance on these, to improve precision of the estimate. All estimates show that exposure to physical harm (Physcial Harm) produces a 9-11 percentage point increase in pro-peace attitudes. The reduction in willingness to execute Government of Sudan soldiers by a similar amount is consistent with this pro-peace effect. Similarly, Peace Index is the single-factor solution combining all the other outcome variables into one, and shows the most significant effect of Direct Harm. As all the effects fall in the pro-peace direction, these findings support the "weary hypothesis" rather than the "angry hypothesis". The effect sizes are not only statistically significant, but also substantively large relative to the means of each outcome, also shown. These results are summarized together with other models in Figure 2-2 1303 0.40 (0.028) 0.097 (0.039) Peace With Janjaweed Tribes 0.32 (5) (6) Robust SEs in parentheses Controls: age, farmer, herder, past vote, household size in Darfur $/ 1294 0.49 (0.030) // Intercept Female Village FEs Controls N 0.087 (0.042) physical harm Mean(DV) Model Peace with Former Enemies 0.40 (1) (2) Table 2.2: Effect of Physical Harm on Attitudes: OLS Regression Estimates Figure 2-1: Propensity Scores for Harmed and Unharmed Treated Untreated -Treated - - ----Untreated C Z 0 0 0.0 0.5 0.0 1.0 - 1.0 0.5 N N = 266 Bandwidth = 0.08611 235 Bandwidth 0.0905 - Treated -- Untreated --- treated Untreated N N q C' 0.0 0.5 1 1.0 0.0 0.5 1.0 N = 235 Bandwidth = 0.0905 N = 266 Bandwidth = 0.08611 Top row: Propensity scores for treated (harmed) and untreated (unharmed) individuals, using same linear model of pre-treatment covariates used in multivariate balance testing. Top left: male only; Top right: female only. These show that without conditioning on village, the covariate values of those who are harmed and those who are unharmed are different, allowing the propensity score model to distinguish between those likely to be harmed and those who are not. Bottom row: Propensity scores for harmed and unharmed individuals, after re-weighting the data so that the distribution of villages is the same for the harmed and unharmed. Bottom left panel: male only; Bottom right: female only. These results show that the apparent imbalances seen in the top row disappear when village of origin is taken into account. It thus provides a visual illustration of the balance on covariates within village of origin and gender. 44 Figure 2-2: Estimated Effect of Exposure to Physical Harm on Attitudes under Five Models o Peace with Former Enemies OLS-short 1 OLS-ong B Ebal-short A Ebal-long V Match ' Peace with Janjaweed Individuals Peace with Janjaweed Tribes p B A Should Execute Government Soldiers Peace Factor A I -0.3 I I I I I -0.2 -0.1 0.0 0.1 0.2 Effect Estimate of Exposure to Direct Harm 0.3 Note: Summary of effect estimates on various outcome under five models: OLS with only minimal covariate (OLS-s), OLS with additional covariates (OLS-1), entropy balancing followed by weighted OLS with minimal covariates (ebal-s) and with additional covariates (ebal-1), and matching. As discussed in the text, all models find that being directly harmed moves all variables in the "propeace" or "weary" direction. Specifically, regardless of the model used, those directly harmed show a 8-12% increase in their beliefs that it is possible to live in peace with former enemies, with Janjaweed individuals, or with Janjaweed tribes. Congruently, those directly harmed are also approximately 10% less likely to say that execution would be the appropriate punishment for Government of Sudan soldiers. Finally, a single dimensional index made from the above for variables shows the strongest effects, again in the pro-peace direction. Evidence is thus consistent with the "weary" hypothesis and opposite to what is predicted by the "angry" hypothesis. 45 0.16 (0.047) (9) 0.14 (0.045) (10) Peace Index 0.22 -0.11 (0.046) Would Execute Gov. Soldiers 0.63 (8) -0.092 (0.045) (7) 0.10 (0.041) Peace With Janjaweed Tribes 0.32 (6) 0.11 (0.042) (5) 0.088 (0.036) Peace With Janjaweed Indiv. 0.17 (4) 0.094 (0.036) 0.44 (0.082) 1168 0.40 (0.085) V 0.51 (0.078) 1203 0.15 (0.025) $ V$ $ 1168 1203 0.28 (0.023) / $/ 1278 1278 0.13 (0.019) / 1279 0.68 (0.025) $/ V $ 1279 0.22 (0.067) $ V (3) Table 2.3: Effect of Physical Harm on Attitudes: Entropy Balanced Regression Estimates 0.083 (0.044) Peace with Former Enemies 0.40 (2) 0.097 (0.046) 0.59 (0.081) / $ (1) physical harm 0.36 (0.026) / / V 1270 Mean(DV) Model Intercept 1270 Female Village FEs Controls N Robust SEs in parentheses Controls: age, farmer, herder, past vote, household size in Darfur Note: Estimates of the effect of being directly harmed on each outcome. Weights derived from entropy balancing ensure that those who were and who were not directly harmed have the same mean and variance on pre-treatment covariates. The "short" and "long" OLS models are then run on the re-weighted data. As before, the models show that exposure to physical harm produces a 8-11 percentage point increase in pro-peace attitudes, or congruently, a 9-11 percentage point decrease in reported desire to execute Government of Sudan soldiers. Peace Index shows the largest effect of Physical Harm. Again, all results are in the "pro-peace" direction, consistent with the "weary" hypothesis, and contradictory to the "angry hypothesis". The effect sizes are not only statistically significant, but are also substantively large relative to the means of each outcome. These results are summarized together with those of other models in Figure 2-2 Table 2.4: Effect of Exposure to Physical Harm on Attitudes: Matching Estimates Peace Janjaweed Tribes Would Execute Gov. Soldiers 0.17 0.32 0.63 0.22 0.13 0.00 0.11 0.00 0.14 0.00 -0.12 0.00 0.19 0.00 Npairs 254 258 260 248 231 Male Mean(DV) 0.55 0.24 0.46 0.53 0.39 physical harm p-val 0.20 0.00 0.14 0.00 0.21 0.00 -0.09 0.01 0.26 0.00 Npairs 118 119 118 108 101 Female Mean(DV) 0.23 0.09 0.18 0.73 0.04 0.10 Peace with Former Enemies Peace With Janjaweed Indiv. All Mean(DV) 0.40 physical harm p-val Peace With Index physical harm 0.08 0.07 0.03 -0.04 p-val 0.01 0.00 0.35 0.20 0.00 99 91 85 97 94 Npairs Controls: age, farmer, herder, past vote, household size in Darfur Note: Matching estimates of the effect of being directly harmed on each outcome, using Mahalanobis, 1-to-1 matching without replacement. Matching is exact on all variables except age and household size in Darfur. Results show that the effect of being directly harmed on each outcome variable is again in the "pro-peace" direction. On the full sample, effects are similar to OLS and entropybalanced models, but slightly larger. Among females, effect sizes are somewhat smaller, and the effect of physical harm on Peace with Janjaweed Tribes and Would Execute Government Soldier shrink substantially, losing significance. For Peace with Janjaweed Tribes in particular it appears that the observed aggregate effect is largely driven by males, while females show little or no effect. Otherwise, all effects significant in the overall model are significant among each gender separately. 47 Figure 2-3: Effect of Physical Harm on Peace Index implied by confounders of varying strength 01r ' 0 0. -C" ~0 age t5 w 0 ~female C.,o . 0 hhsIkrder A pastvote A C0 farmer A Observed(O.13) I 0.0 0.12 A I I I I I 0.1 0.2 0.3 0.4 0.5 Effect of Confounder on Peace Index Note: Sensitivity analysis. The "height" shown by contour lines gives the expected true size of the effect of Physical Harm on Peace Index, given a hypothetical confounder. The bias is parameterized by how strongly this confounder relates to Physcial Harm (vertical axis) and how strongly it relates to the outcome (horizontal axis). For the true effect of Physical Harm on Peace Index to be statistically indistinguishable from zero, an unobserved confounder would have to be substantively more confounding than any of the included covariates. For example, even a confounder as strongly correlated with Peace Index as female would have to be three times more predictive of exposure to Physical Harm in order for the true treatment effect to be statistically indistinguishable from zero. 48 Chapter 3 Kernel Regularized Least Sqaures 49 50 Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach Jens Hainmueller - Massachusetts Institute of Technology Chad Hazlett - Massachusetts Institute of Technology ABSTRACT We propose the use of Kernel Regularized Least Squares (KRLS) for social science modeling and inference problems. KRLS borrows from machine learning methods designed to solve regression and classification problems without relying on linearity or additivity assumptions. The method constructs a flexible hypothesis space that uses kernels as radial basis functions and finds the best-fitting surface in this space by minimizing a complexity-penalized least squares problem. We argue that the method is well-suited for social science inquiry because it avoids strong parametric assumptions, yet allows interpretation in ways analogous to generalized linear models while also permitting more complex interpretation to examine non-linearities, interactions, and heterogeneous effects. We also extend the method in several directions to make it more effective for social inquiry, by (1) deriving estimators for the pointwise marginal effects and their variances, (2) establishing unbiasedness, consistency, and asymptotic normality of the KRLS estimator under fairly general conditions, (3) proposing a simple automated rule for choosing the kernel bandwidth, and (4) providing companion software. We illustrate the use of the method through simulations and empirical examples. 3.1 Introduction Generalized linear models (GLMs) remain the workhorse method for regression and classification problems in the social sciences. Applied researchers are attracted to GLMs because they are fairly easy to understand, implement, and interpret. However, GLMs also impose strict functional form assumptions. These assumptions are often problematic in social science data, which are frequently ridden with non-linearities, non-additivity, heterogeneous marginal effects, complex interactions, bad leverage points, or other complications. It is well-known that misspecified models can lead to bias, inefficiency, incomplete conditioning on control variables, incorrect inferences, and fragile model-dependent results (e.g. King and Zeng (2006)). One traditional and well-studied approach to address some of these problems is to introduce high-order terms and interactions to GLMs (e.g. Friedrich, 1982; Jackson, 1991; Brambor et al., 2006). However, higher-order terms only allow for interactions of a prescribed type, and even for experienced researchers, it is typically very difficult to find the correct functional form among the many possible interaction specifications, which explode in number once the model involves more than a few variables. Moreover, as we show below, even when these efforts may appear to work based on model diagnostics, under common conditions, they can instead make the problem worse, generating false inferences about the effects of included variables. Presumably, many researchers are aware of these problems and routinely resort to GLMs not because they staunchly believe in the implied functional form assumptions, but because they lack convenient alternatives that relax these modeling assumptions while maintaining a high degree of interpretability. While some more flexible methods, such as neural networks (e.g. Beck et al., 2000) and Generalized Additive Models (GAMs, e.g. Wood, 2003), have been proposed, they have not been widely adopted by social scientists, perhaps because these models often do not generate the desired quantities of interest or allow inference on them (e.g. confidence intervals or tests of null hypotheses) without non-trivial modifications and often impracticable computational demands. 52 In this paper, we describe Kernel Regularized Least Squares (KRLS). This approach draws from Regularized Least Squares (RLS), a well-established method in the machine learning literature (see e.g. Rifkin et al., 2003).1 We add the "K" to (a) emphasize that it employs kernels (whereas the term RLS can also apply to nonkernelized models); and (b) to designate the specific set of choices we have made in this version of RLS, including procedures we developed to remove all parameter selection from the investigator's hands and, most importantly, methodological innovations we have added relating to interpretability and inference. The KRLS approach offers a versatile and convenient modeling tool that strikes a compromise between the highly constrained GLMs that many investigators rely on and more flexible but often less interpretable machine learning approaches. KRLS is an easy to use approach that helps researchers to protect their inferences against misspecification bias and does not require them to give up many of the interpretative and statistical properties they value. This method belongs to a class of models for which marginal effects are well-behaved and easily obtainable due to the existence of a continuously differentiable solution surface, estimated in closed form. It also readily admits to statistical inference using closed form expressions, and has desirable statistical properties under relatively weak assumptions. The resulting model is directly interpretable in ways similar to linear regression while also making much richer interpretations possible. The estimator yields pointwise estimates of partial derivatives that characterize the marginal effects of each independent variable at each data point in the covariate space. The researcher can examine the distribution of these pointwise estimates to learn about the heterogeneity in marginal effects or average thein to obtain an average partial derivative similar to a / coefficient from linear regression. Because it marries flexibility with interpretability, the KRLS approach is suitable for a wide range of regression and classification problems where the correct functional form is unknown. This includes exploratory analysis to learn about the datagenerating process, model-based causal inference, or prediction problems that require 'Similar methods appear under various names, including Regularization Networks (e.g. Evgeniou et al., 2000) and Kernel Ridge Regression (e.g. Saunders et al., 1998). 53 an accurate approximation of a conditional expectation function to impute missing counterfactuals. Similarly, it can be employed for propensity score estimation or other regression and classification problems where it is critical to use all the available information from covariates to estimate a quantity of interest. Instead of engaging in a tedious specification search, researchers simply pass the X matrix of predictors to the KRLS estimator (e.g. krls(y=y,X=X) in our R package), which then learns the target function from the data. For those who work with matching approaches, the KRLS estimator has the benefit of similarly weak functional form assumptions while allowing continuous valued treatments, maintaining good properties in highdimensional spaces where matching and other local methods suffer from the curse of dimensionality, and producing principled variance estimates in closed form. Finally, although necessarily somewhat less efficient than Ordinary Least Squares (OLS), the KRLS estimator also has advantages even when the true data-generating process is linear, as it protects against model dependency that results from bad leverage points or extrapolation and is designed to bound over-fitting. The main contributions of this paper are threefold. First, we explain and justify the underlying methodology in an accessible way and introduce interpretations that illustrate why KRLS is a good fit for social science data. Second, we develop various methodological innovations. We (a) derive closed-form estimators for pointwise and average marginal effects; (b) derive closed-form variance estimators for these quantities to enable hypothesis tests and the construction of confidence intervals; (c) establish the unbiasedness, consistency, and asymptotic normality of the estimator for fitted values under conditions more general than those required for GLMs; and (d) derive justification for a simple rule for choosing the bandwidth of the kernel at no computational cost, thereby taking all parameter-setting decisions out of the investigator's hands to improve falsifiability. Third, we provide companion software that allows researchers to implement the approach in R, Stata, and Matlab. 54 3.2 Explaining KRLS Regularized least squares approaches with kernels, of which KRLS is a special case, can be motivated in a variety of ways. We begin with two explanations, the "similaritybased" view and the "superposition of Gaussians" view, which provide useful insight on how the method works and why it is a good fit for many social science problems. Further below we also provide a more rigorous, but perhaps less intuitive, justification. 2 Similarity-Based View Assume that we draw i.i.d. data of the form (yi, xi), where i = 1, of observation, yi ER ... , N indexes units is the outcome of interest, and xi E RD is our D-dimensional vector of covariate values for unit i (often called exemplars). Next, we need a so-called kernel, which for our purposes is defined as a symmetric and positive semi-definite function k(., -) that takes two arguments and produces a real valued output.3 It is useful to think of the kernel function as providing a measure of similarity between two input patterns. While many kernels are available, the kernel used in KRLS and throughout this paper is the Gaussian kernel given by k(xj, xi) =e where ex is the exponential function and | xi - xi (3.1) is the Euclidean distance between the covariate vectors xj and xi. This function is the same function as the normal dis. tribution, but with a2 in place of 2U 2 , and omitting the normalizing factor 1//2rou 2 The most important feature of this kernel is that it reaches its maximum of one only when xi = xj and grows closer to zero as xi and xj become more distant. We will 2 Another justification is based on the analysis of reproducing kernels, and the corresponding spaces of functions (Reproducing Kernel Hilbert Spaces) they generate along with norms over those spaces. For details on this approach, we direct readers to recent reviews included in Evgeniou et al. (2000) and Sch6lkopf and Smola (2002). 3 By positive semi-definite, we mean that Ej aiaj k(xi, xj) > 0, V ai, aj E R, x c RD, D E Z+. Note that the use of kernels for regression in our context should not be confused with non-parametric methods commonly called "kernel regression" that involve using a kernel to construct a weighted local estimate. >2 55 thus think of k(xi, xj) as a measure of the similarity of xi to xj. Under the "similarity-based view", we assert that the target function y = f(x) can be approximated by some function in the space of functions represented by 4 N cik(x, xi) f(x) = (3.2) i=1 where k(x, xi) measures the similarity between our point of interest (x) and one of N input patterns xi, and ci is a weight for each input pattern. The key intuition behind this approach is that it does not model yi as a linear function of xi. Rather, it leverages information about the similarity between observations. To see this, consider some test point x* at which we would like to evaluate the function value given fixed input patterns xi and weights ci. For such a test point, the predicted value is given by f (x*) = cik(x*, xi) + c2 k(x*, x 2 ) + ... + CNk(x (3.3) , XN) = ci(similarity of x* to x 1 ) + c2 (sim. of x* to x 2 ) + .. . + cN(sim. Of x* to (0) That is, the outcome is linear in the similarities of the target point to each observation, and the closer x* comes to some xj, the greater the "influence" of xj on the predicted f(x*). This approach to understanding how equation (3.2) fits complex functions is what we refer to as the "similarity view." It highlights a fundamental difference between KRLS and the GLM approach. With GLMs, we assume that the outcome is a weighted sum of the independent variables. In contrast, KRLS is based on the premise that information is encoded in the similarity between observations, with more similar observations expected to have more similar outcomes. We argue that this latter approach is more natural and powerful in most social science circumstances: in most reasonable cases, we expect that the nearness of a given observation, xi, to other observations reveals information about the expected value of yi, which suggests a large space of smooth functions in which observations close to each other in X are 4 Below we provide a formal justification for this space based on ridge regressions in highdimensional feature spaces. 56 close to each other in y. Superposition of Gaussians View Another useful perspective is the "superposition of Gaussians" view. Recalling that k(-, xi) traces out a Gaussian curve centered over xi, we slightly rewrite our function approximation as f () = cik(., xi) + c 2 k(-, x 2 ) + .. . + cNk(', XN)- (3.5) The resulting function can be thought of as the superposition of Gaussian curves, centered over the exemplars (xi) and scaled by their weights (ci). Figure 3-1 illustrates six random samples of functions in this space. We draw eight data points xi ~ Uniform(O, 1) and weights ci ~ N(O, 1) and compute the target function by centering a Gaussian over each xi, scaling each by its ci, and then summing them (the dots represent the data points, the dotted lines refer to the scaled Gaussian kernels, and the solid lines represent the target function created from the superposition). This figure shows that the function space is much more flexible than the function spaces available to GLMs; it enables us to approximate highly non-linear and non-additive functions that may characterize the data-generating process in social science data. The same logic generalizes seamlessly to multiple dimensions. In this view, for a given dataset, KRLS would fit the target function by placing Gaussians over each of the observed exemplars xi and scaling them such that the summated surface approximates the target function. The process of fitting the function requires solving for the N values of the weights ci. We, therefore, refer to the ci weights as choice coefficients, similar to the role that f coefficients play in linear regression. Notice that a great many choices of ci can produce highly similar fits-a problem resolved in the next section through regularization. (In the online appendix, we present a toy example to build intuition for the mechanics of fitting the function; see Figure A.1). Before describing how KRLS chooses the choice coefficients, we introduce a more 57 convenient matrix notation. Let K be the N x N symmetric Kernel matrix whose jth ith entry is k(xj, xi); it measures the pairwise similarities between each of the N input patterns xi. Let c = and y = [yi, ... , [ci, ..., YNYT be the N x cN]T be the N x 1 vector of choice coefficients 1 vector of outcome values. Equation (3.2) can be rewritten as k(xi, xi) y = Kc k(X 2 ,X) k(xi, X2) ... k (xi, XN) C1 . . k(XN, X1) k(XN, XN) (3.6) CN In this form, we plainly see KRLS as fitting a simple linear model: we fit y for some xi as a linear combination of basis functions or regressors, each of which is a measure of xi's similarity to another observation in the dataset. Notice that the matrix K will be symmetric and positive semi-definite and, thus, invertible. 5 Therefore, there is a "perfect" solution to the linear system y = Kc, or equivalently, there is a target surface that is created from the superposition of scaled Gaussians that provides a perfect fit to each data point. Regularization and the KRLS Solution While extremely flexible, fitting functions by the method described above produces a perfect fit of the data and invariably leads to over-fitting. This issue speaks to the ill-posedness of the problem of simply fitting the observed data: there are many solutions that are similarly good fits. We need to make two additional assumptions that specify which type of solutions we prefer. Our first assumption is that we prefer functions that minimize squared loss, which ensures that the resulting function has a clear interpretation as a conditional expectation function (of y conditional on x). The second assumption is that we prefer smoother, less complicated functions. Rather than simply choosing c as c = K-'y, we instead solve a different problem 5 This holds as long as no input pattern is repeated exactly. We relax this in the following section. 58 that explicitly takes into account our preference for smoothness and concerns for over-fitting. This is based on a common but perhaps under-utilized assumption: in social science contexts, we often believe that the conditional expectation function characterizing the data-generating process is relatively smooth, and that less "wiggly" functions are more likely to be due to real underlying relationships rather than noise. Less "wiggly" functions also provide more stable predictions at values in between the observed data points. Put another way, for most social science inquiry, we think that "low-frequency" relationships (in which y cycles up and down fewer times across the range of x) are theoretically more plausible and useful than "high-frequency" relationships. (Figure A.2 in the appendix provides an example for a low- and highfrequency explanation of the relationship between x and y.) 6 To give preference to smoother, less complicated functions, we change the optimization problem from one that considers only model fit to one that also considers complexity. Tikhonov regularization (Tychonoff, 1963) proposes that we search over some space of possible functions and choose the best function according to the rule argmin (V(f(xi), yi)) + AXZ(f) (3.7) fEH where V(yi, f(xi)) is a loss function that computes how "wrong" the function is at each observation, 1Z is a "regularizer" measuring the "complexity" of function and A E R f, is a scalar parameter that governs the tradeoff between model fit and complexity. Tikhonov regularization forces us to choose a function that minimizes a weighted combination of empirical error and complexity. Larger values of A result in a larger penalty for the complexity of the function and a higher priority for model fit; lower values of A will have the opposite effect. Our hypothesis space, H, is the flexible space of functions in the span of kernels built on N input patterns or, more 6 This smoothness prior may prove wrong if there are truly sharp thresholds or discontinuities in the phenomenon of interest. Rarely, however, is a threshold so sharp that it cannot be fit well by a smooth curve. Moreover, most political science data has a degree of measurement error. Given measurement error (on x), then, even if the relationship between the "true" x and y was a step function, the observed relationship with noise will be the convolution of a step function with the distribution of the noise, producing a smoother curve (for example, a sigmoidal curve in the case of normally distributed noise). 59 formally, the Reproducing Kernel Hilbert Spaces (RKHS) of functions associated with a particular choice of kernel. For our particular purposes, we choose the regularizer to be the square of the L 2 norm, (f, f)H = I|f H in the RKHS associated with our kernel. It can be shown that, for the Gaussian kernel, this choice of norm imposes an increasingly high penalty on higher-frequency components of f. We also always use squared-loss for V. The resulting Tikhonov regularization problem is given by argmin Z(f(X) - y,) 2 + AI If11. (3.8) f EH Tikhonov regularization may seem a natural objective function given our preference for low-complexity functions. As we show in the appendix, it also results more formally from encoding our prior beliefs that desirable functions tend to be less complicated and then solving for the most likely model given this preference and the observed data. To solve this problem, we first substitute f(x) = Kc to approximate f(x) in our hypothesis space H.7 In addition, we use as the regularizer the norm IIfI I = E , cicjk(xi, xj) = cTKc. The justification for this form is given below; however, a suitable intuition is that it is akin to the sum of the squared ci's, which itself is a possible measure of complexity, but it is weighted to reflect overlap that occurs for points nearer to each other. The resulting problem is c* = argmin (y - Kc)T(y - Kc) + AcTKc. (3.9) cE RD Accordingly, y* = Kc* provides the best-fitting approximation to the conditional expectation of the outcome in the available space of functions given regularization. Notice that this minimization is equivalent to a ridge regression in a new set of features, one that measures the similarity of an exemplar to each of the other exemplars. As we show in the appendix, we explicitly solve for the solution by differentiating the 7 As we explain below, we do not need an intercept since we work with demeaned data for fitting the function. 60 objective function with respect to the choice coefficients c and solving the resulting first-order conditions, finding the solution c* = (K + AI)-ly. We therefore have a closed-form solution for the estimator of the choice coefficients that provides the solution to the Tikhonov regularization problem within our flexible space of functions. This estimator is numerically rather benign. Given a fixed value for A, we compute the kernel matrix and add A to its diagonal. The resulting matrix is symmetric and positive definite, so inverting it is straightforward. Also, note that the addition of A along the diagonal ensures that the matrix is well-conditioned (for large enough A), which is another way of conceptualizing the stability gains achieved by regularization. Derivation from an Infinite-Dimensional Linear Model The above interpretations informally motivate the choices made in KRLS through our expectation that "similarity matters" more than linearity and that, within a broad space of smooth functions, less complex functions are preferable. Here we provide a formal justification for the KRLS approach that offers perhaps less intuition, but has the benefit of being generalizable to other choices of kernels and motivates both the choice of f (xi) = EN 1 cjk(xi, xj) for the function space and cTKc for the regularizer. For any positive semi-definite kernel function k(., .), there exists a mapping O(x) that #(xi) such that k(xi, xj) = (#(xi), #(xj)). the mapping #(xi) is infinite-dimensional. Suppose transforms xi to a higher-dimensional vector In the case of the Gaussian kernel, we wish to fit a regularized linear model (i.e. a ridge regression) in the expanded features, i.e. f(xi) = q(xi)TO, where #(x) has dimension D' (which is oc in the Gaussian case), and 9 is a D' vector of coefficients. Then, we solve argmin Z(yi - q(xi)TO) 2 + A110II 2 (3.10) OERD'I and -2 110112 N(yi - RD' 9 T9 gives the coefficients for each dimension of the new feature space, is simply the L2 norm in that space. The first-order condition is #(X)T9)O(X,) + 2A9 = 0. Solving partially for 9 gives 9 = A- 61 E_ 1(y - where 9 E q(xi)TO)q(xi) or simply N ci(xi) 0= (3.11) i=1 where ci = A-(yi - #(xi)T0). Equation 3.11 asserts that the solution for 9 is in the span of the features, O(xi). Moreover, it makes clear that the solution to our potentially infinite-dimensional problem can be found in just N parameters, and using only the features at the observations. 8 Substituting 9 back into f(x) = #(x)TO, we get N f(x) = N ci#(xi)q(x) = cik(x, xi) (3.12) j=1 which is precisely the form of the function space we previously asserted. Note that the use of kernels to compute inner products between each #(xi) and #(xj) in equation 3.12 prevents us from needing to ever explicitly perform the expansion implied by #(xi); this is often referred to as the kernel "trick" or kernel substitution. Finally, the norm in equation 3.10, 119112 is (9, 0) = ( EN 1 Cio(Xi)) = cTKc. Thus, both the choice of our function space and our norm can be derived from a ridge regression in a high- or infinite-dimensional feature space O(x) associated with the kernel. 3.3 KRLS in Practice: Parameters and Quantities of Interest In this section, we address some remaining features of the KRLS approach and discuss the quantities of interest that can be computed from the KRLS model. 8 This powerful result is more directly shown by the Representer theorem (Kimeldorf and Wahba, 1970). 62 Why Gaussian Kernels? While users can build a kernel of their choosing to be used with KRLS, the logic is most applicable to kernels that radially measure the distance between points. We seek functions k(xi, xj) that approach 1 as xi and x3 become identical and approach 0 as they move far away from each other, with some smooth transition in between. Among kernels with this property, Gaussian kernels provide a suitable choice. One intuition for this is that we can imagine some data-generating process that produces X's with normally distributed errors. Some x's may be essentially "the same" point but separated in observation by random fluctuations. Then, the value of k(xi, xj) is proportional to the likelihood of the two observations xi and xj being the "same" in this sense. Moreover, we can take derivatives of the Gaussian kernel and, thus, of the response surface itself, which is central to interpretation. 9 Data Pre-processing We standardize all variables prior to analysis by subtracting off the sample means and dividing by the sample standard deviations. Subtracting the mean of y is equivalent to including an (unpenalized) intercept and simplifies the mathematics and exposition. Subtracting the means of the x's has no effect, since the kernel is translation-invariant. The re-scaling operation is commonly invoked in penalized regressions for norms Lq with q > 0-including ridge, bridge, Least Absolute Shrinkage and Selection Operator (LASSO), and elastic-net methods-because, in these approaches, the penalty depends on the magnitudes of the coefficients and thus on the scale of the data. Re-scaling by the standard deviation ensures that unit-of-measure decisions have no effect on the estimates. As a second benefit, re-scaling enables us to use a simple and fast approach for choosing o (see below). Note that this re-scaling does not interfere with interpretation or generalizability; all estimates are returned to the original scale 9 1n addition, by choosing the Gaussian kernel, KRLS is made similar to Gaussian Process regression, in which each point (yi) is assumed to be a normally distributed random variable, and part of a joint normal distribution together with all other yj, with the covariance between any two observations yi, y3 (taken over the space of possible functions) being equal to k(xi, xj). 63 and location.10 Choosing the Regularization Parameter A As formulated, there is no single "correct" choice of A, a property shared with other penalized regression approaches such as ridge, bridge, LASSO, etc. Nevertheless, cross-validation provides a now standard approach (see, e.g. Hastie et al. (2009)) for choosing reasonable values that perform well in practice. We follow previous work on RLS-related approaches and choose A by minimizing the sum of the squared leave-one-out errors (LOOE) by default (e.g. Sch6lkopf and Smola, 2002; Rifkin and Lippert, 2007; Rifkin et al., 2003). For leave-one-out validation, the model is trained on N - 1 observations and tested on the left-out observation. For a given test value of A, this can be done N times, producing a prediction for each observation that does not depend on that observation itself. The N errors from these predictions can then be summed and squared to measure the goodness of out-of-sample fit for that choice of A. Fortunately, with KRLS, the vector of N leave-one-out errors (LOOE) can be efficiently estimated in O(N') time for any valid choice of A using the formula LOOE = diag1) where G = K + AI (see Rifkin and Lippert, 2007)." Choosing the Kernel Bandwidth o 2 To avoid confusion, we first emphasize that the role of a2 in KRLS differs from its role in methods such as traditional kernel regression and kernel density estimation. In those approaches, the kernel bandwidth is typically the only smoothing parameter; no additional fitting procedure is conducted to minimize an objective function, and no separate complexity penalty is available. In KRLS, by contrast, the kernel is used to ' 0 New test points for which estimates are required can be applied, using the means and standard deviations from the original training. Our companion software handles this automatically. "A variant on this approach, generalized cross-validation (GCV), is equal to a weighted version of LOOE (Golub et al., 1979), computed as C GCV can provide computational savings in some contexts (since the trace of G- can be computed without computing G-1 itself) but less so here, as we must compute G- anyway to solve for c. In practice, LOOE and GCV provide nearly identical measures of out-of-sample fit, and commonly, very similar results. Our companion software also allows users to set their own value of A, which can be used to implement other approaches if needed. 64 form K, beyond which fitting is conducted through the choice of coefficients c, under a penalty for complexity controlled by A. Here, a2 enters principally as a measurement decision incorporated into the kernel definition, determining how distant points need to be in the (standardized) covariate space before they are considered dissimilar. The resulting fit is thus expected to be less dependent on the exact choice of a2 than is true of those kernel methods in which the bandwidth is the only parameter. Moreover, since there is a tradeoff between a 2 and A (increasing either can increase smoothness), a range of a 2 values is typically acceptable and leads to similar fits after optimizing over A. Accordingly, in KRLS, our goal is to chose a 2 to ensure that the columns of K carry useful information extracted from X, resulting in some units being considered similar, some being dissimilar, and some in between. We propose that a 2 = dim(X) = D is a suitable default choice that adds no computational cost. The theoretical motivation for this proposition is that, in the standardized data, the average (Euclidian) distance between two observations that enters into the kernel calculation, E[Ilxj - x, 12], is equal to 2D (see appendix). Choosing a 2 to be proportional to D therefore ensures a reasonable scaling of the average distance. Empirically, we have found that setting a2 = ID in particular has reliably resulted in good empirical performance (see simulations below) and typically provides a suitable distribution of values in K such that entries range from close to 1 (highly similar) to close to 0 (highly dissimilar), with a distribution falling in between. 12 12 , - Note that our choice for a is consistent with advice from other work. For example, Sch6lkopf and Smola (2002) suggest that an "educated guess" for a 2 can be made by ensuring that "roughly lies in the same range, even if the scaling and dimension of the data are different," and they also choose u2 = dim(X) for the Gaussian kernel in several examples (though without the justification given here). Our companion software also allows users to set their own value for 0 2 and this feature can be used to implement more complicated approaches if needed. In principle, one could also use a joint grid-search over values of a-2 and A, for example using k-fold cross-validation where k is typically between 5 and 10. However, this approach adds a significant computational burden (since a new K needs to be formed for each choice of a 2 ), and the benefits can be small since a2 and A trade off with each other so, it is typically computationally more efficient to fix a- 2 at a reasonable value and optimize over A. 65 3.4 Inference and Interpretation with KRLS In this section, we provide the properties of the KRLS estimator. In particular, we establish its unbiasedness, consistency, and asymptotic normality and derive a closed-form estimator for its variance.13 We also develop new interpretational tools, including estimators for the pointwise partial derivatives and their variances, and discuss how the KRLS estimator protects against extrapolation when modeling extreme counterfactuals. Unbiasedness, Variance, Consistency, and Asymptotic Normality Unbiasedness We first show that KRLS unbiasedly estimates the best approximation of the true conditional expectation function that falls in the available space of functions given our preference for less complex functions. ASSUMPTION 1 (FUNCTIONAL FORM) The target function we seek to estimate falls in the space of functions representable as y* = Kc*, and we observe a noisy version of this, Yobs = Y + 6. These two conditions together constitute the "correct specification" requirement for KRLS. Notice that these requirements are analogous to the familiar correct specification assumption for the linear regression model, which states that the datagenerating process is given by y = X03+ e. However, as we saw above, the functional form assumption in KRLS is much more flexible compared to linear regression or GLMs more generally and this guards against misspecification bias. 13 While statisticians and econometricians are often interested in these classical statistical properties, machine learning theorists have largely focused attention on whether and how fast the empirical error rate of the estimator converges to the true error rate. We are not aware of existing arguments for unbiasedness, or the normality of KRLS point estimates, though proofs of consistency, distinct from our own, have been given, including in frameworks with stochastic X (e.g. De Vito et al., 2005). 66 ASSUMPTION 2 (ZERO CONDITIONAL MEAN) E[E4X] = 0, which implies that E[EjKi] = 0 (where Ki designates the it" column of K) since K is a deterministic function of X. This assumption is mathematically equivalent to the usual zero conditional mean assumption used to establish unbiasedness for linear regression or GLMs more generally. However, note that substantivally, this assumption is typically weaker in KRLS than in GLMs, which is the source of KRLS' improved robustness to misspecification bias. In a standard OLS setup, with y = X0 + Elinear, unbiasedness requires that E[EjnearIX] = 0. Importantly, this Elinear includes both omitted variables and un- modeled effects of X on y that are not linear functions of X (e.g. an omitted squared term or interaction). Thus, in addition to any omitted variable bias due to unobserved confounders, misspecification bias also occurs whenever the unmodeled effects of X in qinear are correlated with the Xs that are included in the model. In KRLS, we instead have y = Kc + Ekrls. In this case, Ckrls is devoid of virtually any smooth function of X because these functions are captured in the flexible model through Kc. In other words, KRLS moves many otherwise unmodeled effects of X from the error term into the model. This greatly reduces the chances of misspecification bias, leaving the errors restricted to principally the unobserved confounders, which will always be an issue in non-experimental data. Under these assumptions, we can establish the unbiasedness of the KRLS estimator, meaning that the expectation of the estimator for the choice coefficients that minimize the penalized least squares * obtained from running KRLS on Yobs equals its true population estimand, c*. Given this unbiasedness result, we can also establish unbiasedness for the fitted values. THEOREM 1 (UNBIASEDNESS OF CHOICE COEFFICIENTS) Under assumptions 1-2, E[a*IX] = c*. The proof is given in the appendix. THEOREM 2 (UNBIASEDNESS OF FITTED VALUES) The proof is given in the appendix. 67 Under assumptions 1-2, E[Y] = y*. We emphasize that this definition of unbiasedness says only that the estimator is unbiased for the best approximation of the conditional expectation function given penalization.1 4 In other words, unbiasedness here establishes that we get the correct answer in expectation for y* (not y), regardless of noise added to the observations. While this may seem like a somewhat dissatisfying notion of unbiasedness, it is precisely the sense in which many other approaches are unbiased, including OLS. If, for example, the "true" data-generating process includes a sharp discontinuity that we do not have a dummy variable for, then KRLS will always instead choose a function that smooths this out somewhat, regardless of N, just as a linear model will not correctly fit a non-linear function. The benefit of KRLS over GLMs is that the space of allowable functions is much larger, making the "correct specification" assumption much weaker. Variance Here, we derive a closed-form estimator for the variance of the KRLS estimator of the choice coefficients that minimizes the penalized least squares, c*, conditional on a given A. This is important because it allows researchers to conduct hypothesis tests and construct confidence intervals. We utilize a standard homoscedasticity assumption, although the results could be extended to allow for heteroscedastic, serially correlated, or grouped error structures. We note that, as in OLS, the values for the point estimates of interest (e.g. Y, -y,, x3 discussed below), do not depend on this ho- moscedasticity assumption. Rather, an assumption over the error structure is needed for computing variances. ASSUMPTION 3 (SPHERICAL ERRORS) The errors are homoscedastic and have zero se- rial correlation, such that E[eCTIX] = o-2. 14Readers will recognize that classical ridge regression, usually in the span of X rather than O(X), is biased, in that the coefficients achieved are biased relative to the unpenalized coefficients. Imposing this bias is, in some sense, the purpose of ridge regression. However, if one is seeking to estimate the post-penalization function because regularization is desirable to identify the most reliable function for making new predictions, the procedure is unbiased for estimating that post-penalization function. 68 LEMMA 1 (VARIANCE OF CHOICE COEFFICIENTS) Under assumptions 1-3, the vari- ance of the choice coefficients is given by Var[a*|X, A] = o-,K + AI-2. The proof is given in the appendix. LEMMA 2 (VARIANCE OF FITTED VALUES) the fitted values Under assumptions 1-3, the variance of is given by Var[9|X, A] = Var[KC^*|X, A] = KT [oI(K + AI)- 2]K. In many applications, we also need to estimate the variance of fitted values for new counterfactual predictions at specific test points. We can compute these out-ofsample predictions using Ytest = Kesta* where Ktest is the Ntest x Ntain dimensional kernel matrix that contains the similarity measures of each test observation to each training observation.15 LEMMA 3 (VARIANCE FOR TEST POINTS) Under assumptions 1-3, the variance for predicted outcomes at test points is given by Var[itestIX, A] = Ktest Var[C*IX, A] Kt~ Ktest[OI(K + = AI)-2 ]K et. Our companion software implements these variance estimators. We estimate o- by & = y E e2 = -(y - Ka*)T(y - Ka*). Note that all variance estimates above are conditional on the user's choice of A. This is important, since the variance does indeed depend on A: higher choices of A always imply the choice of a more stable (but less well-fitting) solution, producing lower variance. Recall that A is not a random variable with a distribution but, rather, a choice regarding the tradeoff of fit and complexity made by the investigator. LOOE provides a reasonable criterion for choosing this parameter, and so variance estimates are given for A = ALOOE-16 Consistency In machine learning, attention is usually given to bounds on the error rate of a given method, and to how this error rate changes with the sample size. When the 15 To reduce notation, here we condition simply on X, but we intend this X to include both the original training data (used to form K) and the test data (needed to form Ktest). 16 Though we suppress the notation, variance estimates are technically conditional on the choice of o.2 as well. Recall that, in our setup, o.2 is not a random variable; it is set to the dimension of the input data as a mechanical means of rescaling Euclidan distances appropriately. 69 probability limit of the sample error rate will reach the irreducible approximation error (i.e. the best error rate possible for a given problem and a given learning machine), the approach is said to be consistent (e.g. De Vito et al., 2005). Here, we are instead interested in consistency in the classical sense, i.e. determining whether plim Di,N = y* for all i. Since we have already established that E[ j] = y , all that N-+oo remains to prove consistency is that the variance of Yi goes to zero as N grows large. ASSUMPTION 4 (REGULARITY CONDITION I) Let (i) A > 0 and (ii) as N -+ O0, for eigenvalues of K given by aj, Ei ai grows slower than N once N > M for some M < oo. THEOREM 3 (CONSISTENCY) Under assumptions 1-4, E[^ JX] = yt and plim Var[y|X, A] = N-+oo 0, so the estimator is therefore consistent with plim i,N = y* for all i. N-+oo The proof is provided in the appendix. Our proof provides several insights, which we briefly highlight here. The degrees of freedom of the model can be related to the effective number of non-zero eigenvalues. The number of effective eigenvalues, in turn, is given by Ei ,-- where as are the eigenvalues of K. This generates two important insights. First, some regularization is needed (A > 0) or this quantity grows exactly as N does. Without regularization (A = 0), new observations translate into added complexity rather than added certainty; accordingly, the variances do not shrink. because of the regularization. Thus, consistency is achieved precisely Second, regularization greatly reduces the number of effective degrees of freedom, driving the eigenvalues that are small relative to A essentially to zero. Empirically, a model with hundreds or thousands of observations, which could theoretically support as many degrees of freedom, often turns out to have on the order of 5-10 effective degrees of freedom. This ability to approximate complex functions but with a preference for less complicated ones is central to the wide applicability of KRLS. It makes models as complicated as needed but not more so and it gains from the efficiency boost when simple models are sufficient. As we show below, the regularization can rescue so much efficiency that the resulting KRLS model is not much less efficient than an OLS regression even for linear data. 70 Finite Sample and Asymptotic Distribution of y Here, we establish the asymptotic normality of the KRLS estimator. First, we establish that the estimator is normally distributed in finite samples when the elements of e are i.i.d. normal. ASSUMPTION 5 (NORMALITY) The errors are distributed noTmally, ei i N(0, u2). THEOREM 4 (NORMALITY IN FINITE SAMPLES) AI)- 1 )2 ). Under assumptions 1-5, y ~ N(y*, (o-K(K+ The proof is given in the appendix. Second, we establish that the estimator is also normal asymptotically even when e is non-normal but independently drawn from a distribution with a finite mean and variance. ASSUMPTION 6 (REGULARITY CONDITIONS II ) Let (i) the errors be independently drawn from a distribution with a finite mean and variance and (ii) the standard Lindeberg conditions hold such that the sum of variances of each term in the summation ZJ[K(K + AI)-,j(j)cj goes to infinity as N -+ oc and that the summands are uniformly bounded, i.e. there exists some constant a such that |[K(K + AI) 1 ],(ij)j <; a for all j. - N(y*, (JEK(K+ THEOREM 5 (ASYMPTOTIC NORMALITY) Under assumptions 1-4 and 6, d AI)-')2) as N -+ oo. The proof is given in the appendix. The resulting asymptotic distribution used for inference on any given yi is ((+ l) o-e(K(K + AI)-')(i'i) d N(0, 1), (3.13) Theorem 4 is corroborated by simulations, which show that 95% confidence intervals based on standard errors computed by this method (a) closely match confidence intervals constructed from a non-parametric bootstrap and (b) have accurate empirical coverage rates under repeated sampling where new noise vectors are drawn for each iteration. 71 Taken together, these new results establish the desirable theoretical properties of the KRLS estimator for the conditional expectation: it is unbiased for the best-fitting approximation to the true Conditional Expectation Function (CEF) in a large space of (penalized) functions (Theorems 1 and 2), it is consistent (Theorem 3), and it is asymptotically normally distributed given standard regularity conditions (Theorems 4 and 5). Moreover, variances can be estimated in closed form (Lemmas 1-3). Interpretation and Quantities of Interest One important benefit of KRLS over many other flexible modeling approaches is that the fitted KRLS model lends itself to a range of interpretational tools, which we develop in this section. Estimating E[yIX] and First Differences The most straightforward interpretive element of KRLS is that we can use it to estimate the expectation of y conditional on X = x. From here, we can compute many quantities of interest, such as first differences or marginal effects. We can also produce plots that show how the predicted outcomes change across a range of values for a given predictor variable while holding the other predictors fixed. For example, we can construct a dataset in which one predictor x(a) varies across a range of test values and the other predictors remain fixed at some constant value (e.g. the means) and then use this dataset to generate predicted outcomes, add a confidence envelope, and plot them against x(a) to explore ceteris paribus changes. Similar plots are typically used to interpret GAM models; however, the advantage of KRLS is that the learned model that is used to generate predicted outcomes does not rely on the additivity assumptions typically required for GAMs. Our companion software includes an option to produce such plots. 72 Partial Derivatives We derive an estimator for the pointwise partial derivatives of y with respect to any particular input variable, x(a), which allows researchers to directly explore the pointwise marginal effects of each input variable and summarize them, for example, in the form of a regression table. Let x(d) be a particular variable such that X = [xI ... xd ... xD]. Then, for a single observation, j, the partial derivative of y with respect to variable d is estimated by Dcd)e2 Z(c x). - (3.14) 3 The KRLS pointwise partial derivatives may vary across every point in the covariate space. One way to summarize the partial derivatives is to take their expectation. We, thus, estimate the sample-average partial derivative of y with respect to x(d) at each observation as EN d) 2N Cie X x d)). (3.15) We also derive the variance of this quantity, and our companion software computes the pointwise and the sample-average partial derivative for each input variable together with their standard errors. The benefit of the sample-average partial derivative estimator is that it reports something akin to the usual 3 produced by linear regression: an estimate of the average marginal effect of each independent variable. However, there is a key difference between taking a best linear approximation to the data (as in OLS) versus fitting the CEF flexibly and then taking the average partial derivative in each dimension (as in KRLS). OLS gives a linear summary, but is highly susceptible to misspecification bias, in which the unmodeled effects of some observed variables can be mistakenly attributed to other observed variables. KRLS is much less susceptible to this bias because it first fits the CEF more flexibly and then can report back an average derivative over this improved fit. Since KRLS provides partial derivatives for every observation, it allows for inter73 pretation beyond the sample-average partial derivative. Plotting histograms of the pointwise derivatives and plotting the derivative of y with respect to (d) as a function of X(d) are useful interpretational tools. Plotting a histogram of ay over all i can quickly give the investigator a sense of whether the effect of a particular variable is relatively constant or very heterogeneous. It may turn out that the distribution of ay is bimodal, having a marginal effect that is strongly positive for one group of observations and strongly negative for another group. While the average partial derivative (or a / coefficient) would return a result near zero, this would obscure the fact that the variable in question is having a strong effect but in opposite directions depending on the levels of other variables. KRLS is well-suited to detect such effect heterogeneity. Our companion software includes an option to plot such histograms, as well as a range of other quantities. Binary Independent Variables KRLS works well with binary independent variables; however, they must be interpreted by a different approach than continuous variables. Given a binary variable the pointwise partial derivative X(b), is only observed where x(b) = 0 or where x(b) - 1 .7. The partial derivatives at these two points do not characterize the expected effect of going from x(b) = 0 to x(b) = 1.17 If the investigator wishes to know the expected difference in y between a case in which x(b) = 0 and one in which X(b) = 1, as is usually the case, we must instead compute first-differences directly. Let all other covariates (besides the binary covariate in question) be given by X. The first-difference sample estimator is I [DI (b) - 1,X - x i - (i = 0, X = b) xi]. This is computed by taking the mean Y in one version of the dataset in which all X's retain their original value and all X(b) = 1 and then subtracting from this the mean y in a dataset where all the values of x(b) = 0. In the appendix, we derive closed-form estimators for the standard errors for this quantity. Our companion software detects binary vari7 1 The predicted function that KRLS fits for a binary input variable is a sigmoidal curve, less steep at the two endpoints than at the (unobserved) values in between. Thus, the sample-average partial derivative on such variables will underestimate the marginal effect of going from 0 to 1 on this variable. 74 ables and reports the first-difference estimate and its standard error, allowing users to interpret these effects as they are accustomed to from regression tables. E[ylx] Returns to E[y] for Extreme Examples of x One important result is that KRLS protects against extrapolation for modeling extreme counterfactuals. Suppose we attempt to model a value of y, for a test point xj. If x3 lies far from all the observed data points, then k(xi, xj) will be close to zero for all i. Thus, by equation (3.2), f(xj) will be close to zero, which also equals the mean of y due to pre-processing. Thus, if we attempt to predict y for a new counterfactual example that is far from the observed data, our estimate approaches the sample mean of the outcome variable. This property of the estimator is both useful and sensible. It is useful because it protects against highly model-dependent counterfactual reasoning based on extrapolation. In linear models, for example, counterfactuals are modeled as though the linear trajectory of the CEF continues on indefinitely, creating a risk of producing highly implausible estimates (King and Zeng, 2006). This property is also sensible, we argue, because, in a Bayesian sense, it reflects the knowledge that we have for extreme counterfactuals. Recall that, under the similarity-based view, the only information we need about observations is how similar they are to other observations; the matrix of similarities, K, is a sufficient statistic for the data. If an observation is so unusual that it is not similar to any other observation, our best estimate of E[yy X = x3 ] would simply be E[y], as we have no basis for updating that expectation. 3.5 Simulation Results Here, we show simulation examples of KRLS that illustrate certain aspects of its behavior. Further examples are presented in the online appendix. 75 Leverage Points One weakness of OLS is that a single aberrant data point can have an overwhelming effect on the coefficients and lead to unstable inferences. This concern is mitigated in KRLS due to the complexity-penalized objective function: adjusting the model to accommodate a single aberrant point typically adds more in complexity than it makes up for by improving model fit. To test this, we consider a linear data-generating process, y = 2x + E. In each simulation, we draw x ~ Unif (0, 1) and e ~ N(0, .3). We then contaminate the data by setting a single data point to (x = 5, y = -5), which is off the line described by the target function. As shown in the left panel of Figure 3-2, this single bad leverage point strongly biases the OLS estimates of the average marginal effect downwards (open circles), while the estimates of the average marginal effect from KRLS are robust even at small sample sizes (closed circles). Efficiency Comparison We expect that the added flexibility of KRLS will reduce the bias due to misspecification error but at the cost of increased variance due to the usual bias-variance tradeoff. However, regularization helps to prevent KRLS from suffering this problem too severely. The regularizer imposes a high penalty on complex, high-frequency functions, effectively reducing the space of functions and ensuring that small variations in the data do not lead to large variations in the fitted function. Thus, it reduces the variance. We illustrate this using a linear data-generating process, y = 2x + E, x - N(0, 1), and E ~ N(0, .25) such that OLS is guaranteed to be the most efficient unbiased linear estimator according to the Gauss-Markov theorem. The right panel in Figure 3-2 compares the standard error of the sample average partial derivative estimated by KRLS to that of 3 obtained by OLS. As expected, KRLS is not as efficient as OLS. However, the efficiency cost is quite modest, with the KRLS standard error, on average, being only 14% larger than the standard errors from OLS. The efficiency cost is relatively low due to regularization, as discussed above. Both OLS and KRLS standard errors decrease at the rate of roughly 1/vN, as suggested by 76 our consistency results. Over-fitting A possible concern with flexible estimators is that they may be prone to overfitting, especially in large samples. With KRLS, regularization helps to prevent over-fitting by explicitly penalizing complex functions. To demonstrate this point, we consider a high-frequency function given by y = .2 sin(127rx) + sin(27rx) and run simulations with x ~ Unif (0, 1) and c - N(0, 1) with two sample sizes, N = 40 and N = 400. The results are displayed in the left panel of Figure 3-3. We find that, for the small sample size, KRLS approximates the high-frequency target function (solid line) well with a smooth low-frequency approximation (dashed line). This approximation remains stable at the larger sample size (dotted line), indicating that KRLS is not prone to over-fit the function even as N grows large. This admittedly depends on the appropriate choice of A, which is automatically chosen in all examples by LOOE as described above. Non-smooth Functions One potential downside of regularization is that KRLS is not well-suited to estimate discontinuous target functions. In the right panel of Figure 3-3, we use the same setup from the over-fitting simulation above but replace the high-frequency function with a discontinuous step function. KRLS does not approximate the step well at N = 40, and the fit improves only modestly at N = 400, still failing to approximate the sharp discontinuity. However, KRLS still performs much better than the comparable OLS estimate, which uses x as a continuous regressor. The fact that KRLS tries to approximate the step with a smooth function is expected and desirable. For most social science problems, we would assume that the target function is continuous in the sense that very small changes in the independent variable are not associated with dramatic changes in the outcome variable, which is why KRLS uses such a smoothness prior by construction. Of course, if the discontinuity is known to the researcher, it 77 should be directly incorporated into the KRLS or the OLS model by using a dummy variable x' = 1[x > .5] instead of the continuous x regression. Both methods would then exactly fit the target function. Interactions We now turn to multivariate functions. First, we consider the standard interaction model where the target function is y = .5 + x + X Bernoulli(.5) for j = 1, 2 and e -- - 2(x1 - X2) + E with xi ~ N(0, .5). We fit KRLS and OLS models that include x1 and X2 as covariates and test the out-of-sample performance using the R 2 for predictions of Y at 1000 test points drawn from the same distribution as the covariates. The upper panel in Figure 3-4 shows the out-of-sample R2 estimates. KRLS (closed circles) accurately learns the interaction from the data and approaches the true R 2 as the sample size increases. OLS (open circles) misses the interaction and performs poorly even as the sample size increases. Of course, in this simple case, we can get the correct answer with OLS if we specify the saturated regression that includes the interaction term (x1 - X2). However, even if the investigator suspects that such an interaction needs to be modeled, the strategy of including interaction terms very quickly runs up against the combinatorial explosion of potential interactions in more realistic cases with multiple predictors. Consider a similar simulation for a more realistic case with ten binary predictors and a target (X1 -X)-+2(x - .xio) +x1o. (X1 - X2) - 2(X3 x 4 ) + 3(X5 - X6 - ) - function that contains several interactions: y = Here, it is difficult to search through the myriad different OLS specifications to find the correct model: it would take 210 terms to account for all the unique possible multiplicative interactions. This is why, in practice, social science researchers typically include no or very few interactions in their regressions. It is well-known that this results in often severe misspecification bias if the effects of some covariates depend on the levels of other covariates (e.g. Brambor et al., 2006). KRLS allows researchers to avoid this problem since it learns the interactions from the data. The lower panel in Figure 3-4 shows that, in this more complex example, the 78 OLS regression that is linear in the predictors (open circles) performs very poorly, and this performance does not improve as the sample size increases. Even at the largest sample size, it still misses close to half of the systematic variation in the outcome that results from the covariates. In stark contrast, the KRLS estimator (closed circles) performs well even at small sample sizes when there are fewer observations than the number of possible two-way interactions (not to mention higher-order interactions). Moreover, the out-of-sample performance approaches the true R2 as the sample size increases, indicating that the learning of the function continues as the sample size grows larger. This clearly demonstrates how KRLS obviates the need for tedious specification searches and guards against misspecification bias. The KRLS estimator accurately learns the target function from the data and captures complex non-linearities or interactions that are likely to bias OLS estimates. The Dangers of OLS with Multiplicative Interactions Here, we show how the strategy of adding interaction terms can easily lead to incorrect inferences even in simple cases. Consider two correlated predictors x 1 and x 2 = x 1 + with ( - Unif (0, 2) N(0, 1). The true target function is y = 5x2 and, thus, only depends on x 1 with a mild non-linearity. This non-linearity is so mild that, in reasonably noisy samples, even a careful researcher that follows the textbook recommendations and first inspects a scatterplot between the outcome and x1 might mistake it for a linear relationship. The same is true for the relationship between the outcome and the (conditionally irrelevant) predictor x 2 . Given this, a researcher who has no additional knowledge about the true model is likely to fit a rather "flexible" regression model with a multiplicative interaction term given by y = a+3 1 x1 +3 2 x 2 + 3 (Xi -x 2 ). To examine the performance of this model, we run a simulation that adds random noise and fits the model using outcomes generated by y' = 5xi +e where e ~ N(O, 2). The second column in Table 3.1 displays the coefficient estimates from the OLS regression (averaged across the simulations) together with their bootstrapped standard errors. In the eyes of the researcher, the OLS model performs rather well. Both lower-order terms and the interaction term are highly significant, and the model fit is 79 good with R2 = .89. In reality, however, using OLS with the added interaction term leads us to entirely false conclusions. We conclude that x1 has a positive effect, and the magnitude of this effect increases with higher levels of x 2. Similarly, x 2 appears to have a negative effect at low levels of x1 and a positive effect at high levels of x 1 . Both conclusions are false and an artefact of misspecification bias. In truth, no interaction effect exists; the effect of x1 only depends on levels of x1 and x 2 has no effect at all. The third column in Table 3.1 displays the estimates of the average pointwise derivatives from the KRLS estimator, which accurately recover the true average derivatives. The magnitude of the average marginal effect of x 2 is zero and highly insignificant. The average marginal effect of x 1 is highly significant and estimated at 9.2, which is fairly accurate given that x1 is uniform between 0 and 2 (so we expect an average marginal effect of 10). Moreover, KRLS gives us more than just the average derivatives: it allows us to examine the effect of heterogeneity by examining the marginal distribution of the pointwise derivatives. The next three columns display the first, second, and third quartile of the distributions of the marginal effects of the two predictors. The marginal effect of x 2 is close to zero throughout the support of x 2 , which is accurate given that this predictor is indeed irrelevant for the outcome. The marginal effect of x1 varies greatly in magnitude from about 5 at the first quartile to more than 14 at the third quartile. This accurately captures the non-linearity in . the true effect of x 1 Common Interactions and Non-additivity Here, we show how KRLS is well-suited to fit target functions that are non-additive and/or involve more complex interactions as they arise in social science research. For the sake of presentation, we focus on target functions that involve two independent variables, but the principles generalize to higher-dimensional problems. We consider three types of functions: those with one "hill" and one "valley," two hills and two valleys, or three hills and three valleys (see appendix, Figures A.4, A.5, and A.5, respectively). These functions, especially the first two, correspond to rather common 80 scenarios in the social sciences where the effect of one variable changes or dissipates depending on the effect of another. observations, X1 , x 2 We simulate each type of function, using 200 Unif (0, 1), and noise given by E ~ N(0, .25). We then fit these data using KRLS, OLS, and GAMs. The results are averaged over 100 simulations. In the online appendix, we provide further explanation and visualizations pertaining to each simulation. Table 3.2 displays both the in-sample and out-of-sample R 2 (based on 200 test points drawn from the same distribution as the training sample) for all three target functions and estimators. KRLS provides better in- and out-of-sample fits for all three target functions, and the out-of-sample R 2 for each model is close to the true R 2 that one would obtain knowing the functional form. These simulations increase our confidence that KRLS can capture complex non-linearity, non-additivity, and interactions that we may expect in social science data. While such features may be easy to detect in examples like these that only involve two predictors, they are even more likely in higher-dimensional problems where complex interactions and nonlinearities are very hard to detect using plots or traditional diagnostics. Comparison to Other Approaches KRLS is not a panacea for all that ails empirical research, but our proposition is that it provides a useful addition to the empirical toolkit of social scientists, especially those currently using GLMs, because of (a) the appropriateness of its assumptions to social science data, (b) its ease of use, and (c) the interpretability and ease with which relevant quantities of interest and their variances are produced. It therefore fulfills different needs than many other machine learning or flexible modeling approaches, such as neural networks, regression trees, k-Nearest Neighbors, SVMs, and GAMs, to name a few. In the appendix, we describe in greater detail how KRLS compares to important classes of models on interpretability and inference, with special attention to Generalized Additive Models (GAMs) and to approaches that involve explicit basis expansions followed by fitting methods that force many of the coefficients to be exactly zero (LASSO). At bottom, we do not claim that KRLS is generally superior to other 81 approaches but, rather, that it provides a particularly useful marriage of flexibility and interpretability. It does so with far lower risk of misspecifciation bias than highly constrained models, while minimizing arbitrary choices about basis expansions and the selection of smoothing parameters. These differences aside, in proposing a new method, it is useful to compare its pure modeling performance to other candidates. In this area, KRLS does very well.18 To further illustrate how KRLS compares against other methods that have appeared in political science, we replicate a simulation from Wood (2003) that was designed specifically to illustrate the use of GAMs. The data-generating process is e4((x1-7) (a.2 2 -- Unif (0, 1), c ~ N(0,.25), and y = e1O((xi 2 (x2-7) ) 25)2-(X2-.25)2) + .5 * given by x1, x 2 + c. We consider five models: (1) KRLS with default choices = D = 2), implemented in our R package simply as krls (y=y,X=cbind (x1, x2)); (2) a "naive" GAM (GAM1) that smoothes x1 and x 2 separately but then assumes that they add; (3) a "smart" GAM (GAM2) that smoothes x1 and x 2 together using the default thin-plate splines and the default method for choosing the number of basis functions in the mgcv package in R; (4) a flexibly specified linear model (LM), y = /o0 + 3,1X + 0 2 x 2 + 3X+ 3x x x 2 ; and (5) a neural network (NN) with 5 hidden units and all other parameters at their defaults using the NeuralNet package in R. We train this model on samples of 50, 100, or 200 observations and then test it on 100 out-of-sample observations. The results for the root mean square error (RMSE) of each model averaged over 200 iterations at each sample size are shown in Table 3.3. KRLS performs as well as or better than all other methods at all sample sizes. In smaller samples, it clearly dominates. As the sample size increases, the fully smoothed GAM performs very similarly.19 18It has been shown that the RLS models on which KRLS is based are effective even when used for classification rather than regression, with performance indistinguishable from state-of-the-art Support Vector Machines (Rifkin et al., 2003). 19 KRLS and GAMs in which all variables are smoothed together are similar. The main difference under current implementations (our package for KRLS and mgcv for GAMs) include the following: (1) the fewer interpretable quantities produced by GAMs; (2) the inability of GAMs to fully smooth together more than a few input variables; and (3) the kernel implied by GAMs that leads to straightline extrapolation outside the support of X. These are discussed further in the online appendix. 82 3.6 Empirical Applications In this section, we show an application of KRLS to a real data example. In the online appendix, we also provide a second empirical example that shows how KRLS analysis corrects for misspecification bias in a linear interaction model used by Brambor et al. (2006) to test the "short-coattails" hypothesis. This second example highlights the common problem that multiplicative interaction terms in linear models only allow marginal effects to vary linearly, while KRLS allows marginal effects to vary in virtually any smooth way, and this added flexibility can be critical to substantive inferences. Predicting Genocide In a widely cited article, Harff (2003) examines data from 126 political instability events (i.e. internal wars and regime changes away from democracy) to determine which factors can be used to predict whether a state will commit genocide. 2 0 Harff proposes a "structural model of genocide" where a dummy for genocide onset (onset) is regressed on two continuous variables, prior upheaval (summed years of prior instability events in the past 15 years) and trade openness (imports and exports as a fraction of gross domestic product (GDP) in logs), and four dummy variables that capture whether the state is an autocracy, had a prior genocide, and whether the ruling elite has an ideological character and/or an ethnic character.21 The first column in Table 3.4 replicates the original specification, using a linear probability model (LPM) in place of the original logit. We use the LPM here because this allows more direct comparison to the KRLS results. However, the substantive results of the LPM are virtually identical to those of the logit in terms of magnitude and statistical significance. The next four columns on the left present the replication results from the 20 The American Political Science Association lists this paper as the 15th most downloaded paper in the American Political Science Review. According to Google Scholar, this article has been cited 310 times. 21 See Harff (2003) for details. Notice that Harff dichotomized a number of continuous variables (such as the polity score), which discards valuable information. With KRLS, one could instead use the original continuous variables unless there was a strong reason to code dummies. In fact, tests confirm that using the original continuous variables with KRLS results in a more predictive model. 83 KRLS estimator. We report first differences for all the binary predictor variables as described above. The analysis yields several lessons. First, the in-sample R2 from the original logit model and KRLS are very similar (32% versus 34%), but KRLS dominates in terms of its receiver operator curve (ROC) curve for predicting genocide, with statistically significantly more area under the curve (p < 0.03). It is reassuring that KRLS performs better (at least in-sample) than the original logit model even though, as Harff reports, her final specification was selected after an extensive search through a large number of models. Moreover, this added predictive power does not require any human specification search, the researcher simply passes the predictor matrix to KRLS, which learns the functional form from the data, and this improves empirical performance and reduces arbitrariness in selecting a particular specification. Second, the average marginal effects reported by KRLS (shown in the second column) are all of reasonable size and tend to be in the same direction as but somewhat smaller than the estimates from the linear probability model. We also see some important differences. The LPM model (and the original logit) shows a significant effect of prior upheaval, with an increase of one standard deviation corresponding to a 10 percentage point increase in the probability of genocide onset which corresponds to a 37 percent increase over the baseline probability. This sizable "effect" completely vanishes in the KRLS model, which yields an average marginal effect of zero that is also highly insignificant. This sharply different finding is confirmed when we look beyond the average marginal effect. Recall that the average marginal effects, while a useful summary tool especially to compare to GLMs, are only summaries and can hide interesting heterogeneity in the actual marginal effects across the covariate space. To examine the effect heterogeneity, the next three columns on the left in Table 3.4 show the quartiles of the distribution of pointwise marginal effects for each input variable. Figure 3-5 also plots histograms to visualize the distributions. We see that the effect of prior upheaval is essentially zero at every point. What explains this difference in marginal effect estimates? It turns out that the significant effect in the LPM model is an artefact of misspecification bias. The 84 variable prior upheaval is strongly right-skewed and, when logged to make it more appropriate for linear or logistic regression, the "effect" disappears entirely. This change in results emphasizes the risk of mistaken inference due to misspecification under GLMs and its potential impact on interpretation. Note that this difference in results is by no means trivial substantivally. In fact, Harff (2003) argues that prior upheaval is "the necessary precondition for genocide and politicide" and "a concept that captures the essence of the structural crises and societal pressures that are preconditions for authorities' efforts to eliminate entire groups." Harff (2003) goes on to explain two mechanisms by which this variable matters and draws policy conclusions from it. However, as the KRLS results show, this "important finding" readily disappears when the model accounts for the skew. This showcases the general problem that misspecification bias is often difficult to avoid in typical political science data, even for experienced researchers who publish in top journals and engage in various model diagnostics and specification searches. It also highlights the advantages of a more flexible approach such as KRLS, which avoids misspecification bias while yielding marginal effects estimates that are as easy to interpret as LPM and also make richer interpretation possible. Third, while using KRLS as a robustness test of more rigid models can thus be valuable, working in a much richer model space also permits exploration of effect heterogeneity, including interactions. In Figure 3-5 we see that for several variables, such as autocracy and ideological character, the marginal effect lies to the same side of zero at almost every point, indicating that these variables have marginal effects in the same direction regardless of their level or the levels of other variables. We also see that some variables show little variation in marginal effects, such as prior upheaval, while others show more substantial variation, such as prior genocide. For example, the marginal effects (measured as first-differences) of ethnic character and ideological character are mostly positive, but both show variation from approximately 0 to 20 percentage points. A suggestive summary of how these marginal effects relate to each observed covariate can be provided by regressing the estimates of 85 the pointwise marginal effects c9ol set aideoi ogical character. or aeonset o8ethniccharacter. on the covariates. 2 2 Both regressions reveal a strong negative relationship of the level of trade openness on these marginal effects. To give substantive interpretation to the results, we find that having an ethnic characterto the ruling elite is associated with a 3 percentage point higher probability of genocide for countries in the highest quartile of trade openness, but a 9 percentage point higher probability in the highest quartile of trade openness. Ideological characteris associated with a 9 percentage point higher risk of genocide for the countries in the top quartile of trade openness, but an 18 percentage point higher risk among those in the first quartile of trade openness. These findings, while associational only, are consistent with theoretical expectations, but would be easily missed in models that do not allow sufficient flexibility. In addition, the marginal effects of prior genocide are very widely dispersed. We find that the marginal effects of prior genocide and ideological character are strongly related: when one is high, the marginal effect of the other is lessened on average. For example, the marginal effect of ideological character is 18 percentage points higher when prior genocide is equal to zero. Correspondingly, the marginal effect of prior genocide is 21 percentage points higher when ideological character is equal to zero. This is characteristic of a sub-additive relationship, in which either prior genocide or ideological charactersignals a higher risk of genocide, but once one of them is known, the marginal effect of the other is negligible. 23 By contrast, the marginal effects of ethnic character- and every other variable besides ideological character- changes by little as a function of prior genocide. 22 This approach is helpful to identify non-linearities and interaction effects. For each variable, take the pointwise partial derivatives (or first-differences) modeled by KRLS and regress them on all original independent variables to see which of them help explain the marginal effects. For example, if % is found to be well-explained by x(a) itself, then this suggests a non-linearity in X(a) (because the derivative changes with the level of the same variable). Likewise, if ax(-) ay is well-explained by another variable x(b), this suggests an interaction effect (the marginal effect of one variable, x(a), depends on the level of another, x(b)). 23 1n addition to theoretically plausible reasons why these effects are sub-additive, this relationship may be partly due to ex post facto coding of the variables: once a prior genocide has occurred, it becomes easier to classify a government as having an ideological character, since it has demonstrated a willingess to kill civilians, possibly even stating an ideological aim as justification. Thus, in the absence of priorgenocide, coding a country as having ideological characteris informative of genocide risk, while it adds less after prior genocide has been observed. 86 This brief example demonstrates that KRLS is appropriate and effective in dealing with real-world data even in relatively small datasets. KRLS offers much more flexibility than GLMs and guards against misspecification bias that can result in incorrect substantive inferences. It is also straightforward to interpret the KRLS results in ways that are familiar to researchers from GLMs. 3.7 Conclusion To date, it has been difficult to find user-friendly approaches that avoid the dangers of misspecification while also conveniently generating quantities of interest that are as interpretable and appealing as the coefficients from GLMs. We argue that KRLS represents a particularly useful marriage of flexibility and interpretability, especially for current GLM users looking for more powerful modeling approaches. It allows investigators to easily model non-linear and non-additive effects and reduce misspecification bias and still produce quantities of interest that enable "simple" interpretations (similar to those allowed by GLMs) and, if desired, more nuanced interpretations that examine non-constant marginal effects. While interpretable quantities can be derived from almost any flexible modeling approach with sufficient knowledge, computational power, and time, constructing such estimates for many methods is inconvenient at best and computationally infeasible in some cases. Moreover, conducting inference over derived quantities of interest multiplies the problem. KRLS belongs to a class of models, those producing continuously differentiable solution surfaces with closed-form expressions, that makes such interpretation feasible and fast. All the interpretational and inferential quantities are produced by a single run of the model, and the model does not require user input regarding functional form or parameter settings, improving falsifiability. We have illustrated how KRLS accomplishes this improved tradeoff between flexibility and interpretability by starting from a different set of assumptions altogether: rather than assume that the target function is well-fitted by a linear combination of the original regressors, it is instead modeled in an N-dimensional space using informa87 tion about similarity to each observation, but with a preference for less complicated functions, improving stability and efficiency. Since KRLS is a global method - i.e. the estimate at each point uses information from all other points - it is less susceptible to the curse of dimensionality than purely local methods such as k-nearest neighbors and matching. We have established a number of desirable properties of this technique. First, it allows computationally tractable, closed-form solutions for many quantities, including E[ylX], the variance of this estimator, the pointwise partial derivatives with respect to each variable, the sample average partial derivatives, and their variances. We have also shown that it is unbiased, consistent, and asymptotically normal. Simulations have demonstrated the performance of this method, even with small samples and highdimensional spaces. They have also shown that even when the true data-generating process is linear, the KRLS estimate of the average partial derivative is not much less efficient than the analogous OLS coefficient and far more robust to bad leverage points. We believe that KRLS is broadly useful whenever investigators are unsure of the functional form in regression and classification problems. This may include modelfitting problems such as prediction tasks, propensity score estimation, or any case where a conditional expectation function must be acquired and rigid functional forms risk missing important variation. The method's interpretability also makes it suitable for both exploratory analyses of marginal effects and causal inference problems in which accurate conditioning on a set of covariates is required to achieve a reliable causal estimate. Relatedly, using KRLS as specification check for more rigid methods can also be very useful. However, there remains considerable room for further research. Our hope is that the approach provided here and in our companion software will allow more researchers to begin using KRLS or methods like it; only when tested by a larger community of scholars will we be able to determine the method's true usefulness. Specific research tasks remain as well. Due to the memory demands of working with an N x N matrix, the practical limit on N for most users is currently in the tens of thousands. 88 Work on resolving this constraint would be useful. In addition, the most effective methods for choosing A and a2 are still relatively open questions, and it would be useful to develop heteroscedasticity-, autocorrelation-, and cluster-robust estimators for standard errors. 89 3.8 Tables Table 3.1: Comparing KRLS to OLS with Multiplicative Interactions Estimator oy/&xij const X1 X2 (x 1 x x 2 ) OLS Average -1.50 (0.34) 7.51 (0.40) -1.28 (0.21) Average 9.22 (0.52) 0.02 (0.13) KRLS 1st Qu. Median 5.22 (0.82) -0.08 (0.19) 9.38 (0.85) 0.00 (0.16) 3rd Qu. 14.03 (0.79) 0.10 (0.20) 1.24 (0.15) N 250 Note: Point estimates of marginal effects from OLS and KRLS regression with bootstrapped standard errors in parenthesis. For KRLS, the table shows the average and the quartiles of the distribution of the pointwise marginal effects. The true target function is y = 5x, and simulated using y' = 5xi + e with e ~ (0, 2), xi ~ Unif (0, 2), and x2 = x1 + with ~ N(0, 1). With OLS, we conclude that x1 has a positive effect that grows with higher levels of x2 and that x2 has a negative (positive) effect at low (high) levels of xi. The true marginal effects are - = 10x1 and 0'Y = 0; the effect of xi only depends on levels of x1, and x2 has no effect at all. The KRLS estimator accurately recovers the true average derivatives. The marginal effects of x2 are close to zero throughout the support of x2. The marginal effects of xi varies from about 5 at the first quartile to about 14 at the third quartile. Table 3.2: KRLS Captures Complex Interactions and Non-additivity Target Function In-Sample R2 KRLS OLS GAM Out-of-Sample R 2 KRLS OLS GAM True R 2 One Hill One Valley Two Hills Two Valleys Three Hills Three Valleys 0.75 0.61 0.63 0.41 0.01 0.21 0.52 0.01 0.05 0.70 0.60 0.60 0.73 0.35 -0.01 0.13 0.39 0.45 -0.01 -0.03 0.51 Note: In- and out-of-sample R 2 (based on 200 test points) for simulations using the three target functions displayed in Figures A.4, A.5, and A.6 in the appendix with the OLS, GAM, and KRLS estimators. KRLS attains the best in-sample and out-of-sample fit for all three functions. 90 Table 3.3: Comparing KRLS to Other Methods Model KRLS GAM2 NN LM GAMI Mean RMSE N=100 0.107 0.109 0.177 0.177 0.213 N=50 0.139 0.143 0.312 0.193 0.234 N=200 0.088 0.088 0.118 0.169 0.202 + Note: Simulation comparing RMSE for out-of-sample fits generated by five models, averaged over 200 iterations. The data-generating process is based on Wood (2003): x1,x2 ~ Unif(0, 1), e ~ N(0,.25), and y = 2 2 2 2 _(x2 -. 25) ) + .5 * e14(-(x1--7) _(X2-.7) ) + e. The models are el(-(x1-.25) KRLS with default choices; a "naive" GAM (GAM1) that smoothes xi and x2 separately; (3) a "smart" GAM (GAM2) that smoothes xi and x2 together; (4) a generous linear model (LM), y = 3o + / 3ix + 32x2 + 63x1 + 04X| 05X1 X x2; and (5) a neural network (NN) with 5 hidden units. The models are trained on samples of 50, 100, or 200 observations and then tested on 100 out-of-sample observations. KRLS out-performs all other methods in small samples. In larger samples, KRLS and the GAM2 (with "full-smoothing") perform similarly. The linear model, despite including terms for x1, x2, and X1X2, does not perform particularly well. GAM1 also performs poorly in all circumstances. Table 3.4: Predictors of Genocide Onset: OLS versus KRLS Estimator OLS # Prior upheaval Prior genocide Ideological char. of elite Autocracy Ethnic char. of elite Trade openness (log) Intercept 0.009* (0.004) 0.263* (0.119) 0.152 (0.084) 0.160* (0.077) 0.120 (0.083) -0.172* (0.057) 0.659 (0.217) Average 0.002 (0.003) 0.190* (0.075) 0.129 (0.076) 0.122 (0.068) 0.052 (0.077) -0.093* (0.035) KRLS 19y/(9xij 1st Qu. Median -0.001 0.002 3rd Qu. 0.004 0.137 0.232 0.266 0.086 0.136 0.186 0.092 0.114 0.136 0.012 0.046 0.078 -0.142 -0.073 -0.048 Note: Replication of the "structural model of genocide" by Harff (2003). Marginal effects of predictors from OLS regression and KRLS regression with standard errors in parenthesis. For KRLS, the table shows the average of the pointwise derivative as well as the quartiles of their distribution to examine the effect heterogeneity. The dependent variable is a binary indicator for genocide onsets. N=126. *p-value < .05. See text for details. 91 3.9 Figures C-4 ck(x, x) = E - Figure 3-1: Random Samples of Functions of the Form f(x) .2****** * C'J 1,4W .... ... 0@'' SW* - 110!4! ... ... . . .. (C4 C4 . - Gaussians for x_i Superposition 0.0 0.2 - I 0.4 I 0.6 - I I I I I . 0.8 1.0 0.0 0.2 0.4 0.6 . VV 0.8 1.0 Note: The target function is created by centering a Gaussian over each xi, scaling each by its ci, and - N(O, 1), x ~ Unif (0, 1), and a fixed value for the bandwidth of the kernel C.2 . The dots represent the sampled data points, the dotted lines refer to the scaled Gaussian kernels that are placed over each sample point, and the solid lines represent the target functions created from the superpositions. Notice that the center of the Gaussian curves depends on the point xi, its upwards or downward direction depends on the sign of the weight ci, and its amplitude depends on the magnitude of the weight ci (as well as the fixed o.2). then summing them. We use 8 observations with ci 92 Figure 3-2: KRLS Compares Well to OLS with Linear Data-Generating Processes 000000000 ~0 0 100 0 0000000 00 200 300 0 OLS estimatey KRLS estimates 400 500 600 N - - KRLS: SE(E[dy/dx]) OLS: SE(beta) 0 o 0 I I I I I 100 200 300 400 500 I 600 N Left: Simulation to recover the average derivative of y = .5x, i.e. a = .5 (solid line). For each sample size, we run 100 simulations with observed outcomes y = .5x + e where x ~ Unif (0, 1) and E ~ N(0, .3). One contaminated data point is set to (yj = -5, xi = 5). Dots represent the mean estimated average derivative for each sample size for OLS (open circles) and KRLS (full circles). The simulation shows that KRLS is robust to the bad leverage point, while OLS is not. Right: Comparison of the standard error of 0 from OLS (solid line) to the standard error of the sample average partial derivative from KRLS (dashed line). Data are generated according to y = 2x + e, with x ~ N(0, 1) and e ~ N(0, 1) with 100 simulations for each sample size. KRLS is nearly as efficient as OLS at all but very small sample sizes, with standard errors, on average, approximately 14% larger than those of OLS. 93 Figure 3-3: KRLS with High-Frequency and Discontinuous Functions - Target function KRLS estimate, N=40 ---KRLS estimate, N=400 9 - - - 0.0 0.2 0.4 0.6 -' - 0.8 1.0 0 - -. Target function KRLS estimate, N=40 KRLS estimate, N=400 OLS estimate, N=40 OLS estimate, N=400 0.0 0.2 0.4 0.6 0.8 1.0 Left: Simulation to recover a high-frequency target function given by y = .2 * sin(127rx) + sin(27rx) (solid line). For each sample size, we run 100 simulations where we draw x ~ Unif (0, 1) and simulate observed outcomes as y = .2 * sin(127rx) + sin(27rx) + e where E - N(0, .2). The dashed line shows mean estimates across simulations for N=40 and the dotted line for N=400. The results show that KRLS finds a low-frequency approximation even at the larger sample sizes. Right: Simulation to recover the discontinuous target function given by y = .5 * 1(x > .5) (solid line). For each sample size, we run 100 simulations where we draw x ~ Unif (0, 1) and simulate observed outcomes as y = .5 * 1(x > .5) + e where - - N(0, .2). Dashed lines show mean estimates across simulations for N=40 and dotted lines for N=400. The results show that KRLS fails to approximate the sharp discontinuity even at the larger sample size, but still dominates the comparable OLS estimate, which uses x as a continuous regressor. 94 Figure 3-4: KRLS Learns Interactions from the Data Complex Interaction Simple Interaction -0- KRLS -OLS - I u RA2 .. e**.....00----*''''' I 0 0 I N~ 0 00000000000000 0 'N -0- KRLS -0 OLS - rue RA2 0' C4 T7 0 50 100 150 200 250 0 300 50 100 150 200 300 250 N Simulations to recover target functions that include multiplicative interaction terms. Left: The N(0,.5). target function is y = .5+x 1 +x 2 -2(xi-x 2 )+e with xi - Bernoulli(.5) for j = 1,2 anae Right: The target function is y = (X1 -X2) - 2(x 3 - x 4 ) + 3(x 5 - X -X 7 ) - (x 1 x8) + 2(xs , X9 - io) + XiO where all x are drawn i.i.d. Bernoulli(p) with p = .25 for x, and X2, p = .75 for x 3 and x 4 , and '- p = .5 for all others. For each sample size, we run 100 simulations where we draw the x and simulate outcomes using y = ytrue + e where e - N(O,.5) for the training data. We use 1,000 test points drawn from the same distribution to test the out-of-sample R 2 of the estimators. The closed circles show the average R 2 estimates across simulations for the KRLS estimator, the open circles show the estimates for the OLS regression that uses all x as predictors. The true R 2 is given by the solid line. The results show that KRLS learns the interactions from the data and approaches the true R 2 that one would obtain knowing the functional form as the sample size increases. 95 C 4(D) 0 - 5- 0 5 10 15 2015- 25 30 25- 0- 5 10 - 15 - 20 - 25 - 30 I 0.0 I PriorUpheaval Autoc i 0.2 I I -0.2 I I marginal effect 0.0 PriorGen EthnicChar 0.2 I I I -2 I -0.2 I I I I' 0.0 0 IdeologicalChar TradeOpen 0.2 0 Histograms of pointwise margina 1 effects based on KRLS fit to the Harff data (Model 2 in Table 3.4). -0.2 I Distributions of pointwise marginal effects Figure 3-5: Effect Heterogeneity in Harff Data Chapter 4 Kernel Balancing 97 98 Kernel Balancing: A Balancing Method to Equalize Multivariate Densities and Reduce Bias without a Specification Search Chad Hazlett - Massachusetts Institute of Technology ABSTRACT Investigators often use matching and weighting techniques to adjust for differences between treated and control groups on observed characteristics. These methods, however, ensure that the treated and control have the same means only on explicitly chosen functions of the covariates. Treatment effect estimates made after adjustment by these methods are thus sensitive to specification choices. The resulting treatment effect estimates are biased if any function of the covariates influencing the outcome is imbalanced. Kernel balancing finds weights that ensure the treated and control have equal means on a very large class of smooth functions of the covariates. In addition, when multivariate density is measured a particular way, the reweighted control group has the same multivariate density as the treated. In two empirical applications, kernel balancing (1) accurately recovers the experimentally estimated effect of a job training program, and (2) finds that after controlling for observed differences, democracies are less likely to win counterinsurgencies, consistent with theoretical expectation but in contrast to previous findings. 4.1 Introduction Matching and weighting methods are widely used to estimate causal effects from non-experimental data when unobserved confounders can be ruled out. However, existing methods do not ensure that the multivariate densities of the resulting treated and control units are sufficiently similar, nor do they typically allow for multivariate imbalances to be detected. As a result, even apparently well-balanced samples may differ on important functions of the covariates, leading to biased treatment effect estimates. Kernel balancing, proposed here, is a weighting technique that uses kernels to construct a higher dimensional transformation of the original data. It then achieves equal means for the treated and control groups on this transformed version of the data. This method makes several contributions to existing methodology. First, it obtains approximate balance on a large class of smooth functions of the covariates. Second, it ensures that the entire multivariate densities of the covariates - as measured by a particular smoothing estimator - is approximately equalized for the treated and control samples. Third, the method does not require users to conduct an iterative specification search, to check univariate balance measures, or to otherwise guess what functions of the covariates must be included in matching/reweighting procedures. Fourth and finally, I introduce a method of measuring multivariate imbalance using an L, metric, which can be applied before and after this or any other matching or weighting method to assess the distance between the multivariate densities of treated and controls. In what follows, section 4.2 briefly describes existing methods and their shortcomings, after which section 4.3 illustrates these shortcomings with a simple hypothetical example. Section 4.4 then establishes a formal framework for the problem, and describes the conditions under which unbiased estimation with matching and weighting estimators is possible. The kernel balancing technique is then described in section 4.5, with implementation details in section 4.6. Section 4.7 offers two empirical applications. The first is a re-analysis of Dehejia and Wahba (1999), in which the kernel 100 balancing estimates made using observational data accurately recover the experimental estimate. The second application reanalyses data from Lyall (2010), examining whether democracies are less successful counterinsurgents than non-democracies. Kernel balancing produces far better balance on the original variables and numerous functions of them. The resulting effect estimates reveal that, in contrast to Lyall (2010), democracies are over 25 percentage points less likely than non-democracies to defeat insurgencies, a substantial and significant reduction. 4.2 Background Traditional matching approaches (e.g. Rubin, 1973) match each treated unit to one or several control units that are most similar, as measured using some distance metric. Methods in this family vary principally in how they measure distance, with recent advances allowing the relative weight of each variable to change in order to optimize balance across a panel of balance tests (Diamond and Sekhon, 2005). While the non-parametric nature of these approaches is appealing, methods in this family have three key shortcomings. First, they seek to ensure the multivariate density of the controls matches that of the treated, but are not typically equipped to measure discrepancies in these distributions, nor are they designed to optimize equality of the multivariate distributions. Second, when exact matches cannot be found for each treated unit - as is the case when matching on continuous variables - the resulting matching discrepancies cause bias. This bias dissipates only very slowly, and in general the resulting estimates are not vMW-consistent (Abadie and Imbens, 2006). The bias can be removed by modeling the effect of the matching discrepancy, but this re-introduces parametric assumptions. Finally, because of these difficulties, results depend on what functions of the covariates - e.g. squared terms, logarithms, multiplicative interactions - the user includes in the matching procedure. Thus while intended to be a specification-free approach, in practice users must undergo a tedious specification search. Moreover, without balance metrics that accurately measure multivariate balance, there is no clear way of arbitrating among different specifications 101 in a way that leads to the least-biased estimate. A second family of techniques involves weighting methods that choose continuous (usually non-negative) weights for control units and possibly treated units. Propensity score weighting (Rosenbaum and Rubin, 1983) is a widely used technique to remove bias due to observed covariates. However it's major shortcoming is the requirement that the propensity score model can be correctly specified. The difficulty of achieving this correct-specification and the resulting biases have been well studied (e.g. Smith and Todd, 2001). More recently, "covariate-balancing" weighting techniques have been proposed. Entropy balancing (Hainmueller, 2012) allows users to enter a covariate matrix X - which may include squares, interactions, or other higher-order terms of the original covariates - and achieve essentially perfect mean-balance on each of these covariates.' Since many possible combinations of weights achieve this, entropy balancing chooses the set that, roughly speaking, ensures that weights are as close to constant as possible. Also in this family, the covariate-balancing propensity score (Imai and Ratkovic, 2014) seeks weights on the controls to balance the propensity score, while also balancing desired moments of the covariate distributions. These various covariate-balancing methods have the benefit of achieving essentially perfect equality of means on the matrix of included covariates, X, thus side-stepping some of the problem of bias due to matching discrepancies. However, as shown below, the principal shortcoming of these approach is that they do no guarantee balance on nonlinear functions of X. Accordingly, unbiasedness will only be guaranteed for these methods when the (non-treatment potential) outcome is linear in the columns of X. Finally, coarsened exact matching (CEM, lacus et al., 2012) is weighting technique distinct from those above. This approach "coarsens" the data, placing each observation into a multivariate bin. Within each bin having at least one control unit, a weight can be chosen so that the weighted number of controls equals the number of treated falling in that bin (if the desired estimands is an average treatment effect on 1 Equivalently, entropy balancing allows the user to equate any desired moments of the covariate distribution for the controls to that of the treated. For example obtaining mean balance on a covariate X and its square ensures that the first and second sample moments of X are equal for the treated and controls. 102 the treated). This has the benefit of bounding the multivariate imbalance, and providing a way of measuring the imbalance by examining the original imbalance within each bin. However, it too has several major shortcomings. First, it requires choosing the boundaries for the bins, which may influence the result. Second, it is not tolerant of high-dimensional data, since the bin size required in order to obtain bins with common support grows very quickly with the dimension of the data. Perhaps most importantly, because the bins must be large, a treated and control unit within the same bin may vary widely in their covariate values, and thus in their associated values of the non-treatment potential outcome, re-introducing the matching discrepancy problem that causes bias in traditional matching estimators. In summary, then, the existing approaches do not guarantee multivariate balance, balance on functions of the covariates that may influence the outcome, or unbiasedness. The covariate-balancing weighting techniques side-step the problem of matching discrepancies, but instead require the investigator to know all the functions of the covariates that may influence the outcome. While theoretical guidance may help investigators to choose which covariates are important to match on, theory rarely tells us exactly how these covariates matter. Thus, investigators cannot know exactly what functions of the covariates to achieve balance on - e.g. the raw variables, squared terms, interactions, ratios, or even non-standard functions of the data. Balance testing also provides little guidance, as it is typically confirmed only univariately, or on higher order terms explicitly included by the user. As a result, current matching or weighting techniques leave users unsure of "what to match on", and estimates made even by the most careful researcher can be both biased and sensitive to specification. 4.3 Motivating Example This section provides a motivating example using simulated data to illustrate the risks of bias under existing methods. Suppose we are interested in the question of whether peacekeeping missions deployed after civil wars are effective in lengthening the duration of peace (peaceyears) 103 after the war's conclusion (e.g. Fortna, 2004; Doyle and Sambanis, 2000). However, the "treatment" - peacekeeping missions (peacekeeping) - is not randomly assigned. Rather, missions are more likely to be deployed in certain situations, which may differ systematically in their expected peace years even in the absence of a peacekeeping mission. To deal with this, we collect four pre-treatment covariates that describe each case: the duration of the preceding war (war duration), the number of fatalities (fatalities), democracy level prior to the peacekeeping mission (democracy), and a measure of the number of factions or sides in the civil war (factionalism). We are interested in the average treatment effect on the treated (ATT), which is the mean number of peace years experienced by countries that received peacekeeping, minus the average number of peace years for this group had they not received peacekeeping missions. Such causal estimands can be estimated from these data only if missions are deployed on the basis of a conflict's intensity, measured as , there are no unobserved confounders. For this example, suppose that peackeeping with missions more likely to be deployed where conflicts were higher in intensity. Such an assignment process would result, for example, if "faster-burning" conflicts are more likely to attract international attention, and thus peacekeeping missions. In addition, suppose the outcome of interest, peace years, is also a function of intensity, with more intense conflicts leading to longer average peace years. This is reasonable if, for example, more intense wars indicate greater dominance by one side, leading to a lower likelihood of resurgence in each subsequent year. In this example, peace years is only a function of intensity, and not of peacekeeping, implying a true treatment effect of zero. 2 How well do existing techniques achieve balance, both on the original covariates and on intensity, an important function of the observables? In figure 4-1, the x-axis for each plot shows the difference in means between treated and control on each of the covariates, as well as on intensity. All results are averaged over 500 simulations 2 Details for the simulation are as follows. War duration in years is distributed max(1, N(7, 9)); intensity in fatalities per year is distributed Unif(100, 10000). fatatlities is then computed as intensity - war duration. The treatment, peacekeeping is assigned by a Bernoulli draw with probability logit-1(iintensity - 2), and the outcome peace years = in 2 sty + e, r ~ N(0, 0.004). 104 with the same data generating process and N = 500 on each simulation. The first plot (matching) shows results for simple Mahalanobis distance matching (with replacement). Imbalance remains somewhat large on war duration. More troubling, imbalance remains considerable on intensity, which was not directly included in the matching procedure. A careful researcher may realize the need to match on more functions of the covariates, and instead match on the original covariates, their squares, and their pairwise multiplicative interactions. While few researchers go this far in practice, the second plot in figure 4-1 (matching+) shows that even this approach would not provide the needed flexibility to produce balance on intensity. In fact, balance on both war duration and intensity are worsened. In the third plot (mean balance), entropy balancing (Hainmueller, 2012) is used to achieve equal means in the original covariates. As expected, this produces excellent balance on the original covariates, but only a modest improvement in balance on intensity. Finally, in the fourth plot (kernel balance), the kernel balancing approach introduced here is applied, again using the original covariate data alone. Because this method achieves balance on many smooth functions of the included covariates, it achieves vastly improved balance on intensity. These imbalances are worrying because they lead to biased ATT estimates: since intensity affects the outcome, the mean differences in intensity between the treated and control group after adjustment lead to mean differences in the outcome between treated and control that are not due to the treatment. When the ATT is estimated by difference in means in the post weighting/matching sample, bias is thus found for all methods but kernel balancing. This is evident in figure 4-2, which shows the distribution of ATT estimates (over the simulations) for each method. This example is notable because, while artificial, it is reasonable in many cases that a function such as the ratio of two variables may impact the outcome variable in question in the absence of the treatment, yet investigators rarely ensure balance on such ratios. More generally, even with strong theoretical priors, it is unreasonable to expect investigators to correctly guess what functions of the observables may impact the outcome, and to ensure balance on each of these. Kernel balancing offers a solution 105 Figure 4-1: Imbalance on a function of the covariates matching matching+ mean balance kernel balance - fatalities war duration factions - democracy intensity 0.00 0.06 0.121 Imbalance 0.00 0.06 0.12' Imbalance 0.00 0.06 0.12' Imbalance 0.00 0.06 0,12 Imbalance Mean imbalances on the four included covariates, and intensity = d tion , which is they key factor in both assignment of the treatment (peacekeeping),and the eventual outcome (peaceyears).Matching: Mahalanobis distance matching on the original four covariates leaves a substantial imbalance on war duration. More problematically, it shows a large imbalance on intensity. Matching+: Mahalanobis distance matching with squared terms and all pairwise multiplicative interactions included in the matching procedure. This produces a slight worsening of imbalance, particularly on intensity. Mean balance: Entropy balancing on the original four covariates achieves essentially perfect mean balance on these covariates. However, this produces only a small improvement in balance on intensity. Kernel balance: the technique proposed here, obtains mean balance on a wide range of smooth functions of the included covariates. As a result, it obtains very good balance intensity, even though the user only enters the original four covariates. to this problem. 4.4 Theoretical Framework Let Y(1) E R, Y(0) E R, X E X and D E 0, 1 be random variables, with joint distribution fX,D,Y(0),Y(1), where Y(1) is the potential outcome under treatment, Y(0) is the potential outcome under control, D is the treatment assignment, and X is a vector of covariates with labels x. We sample N i.i.d. pairs (Xi, Di, Y(0), Yi(1)) from this joint distribution. The estimand of primary interest throughout will be the average treatment effect on the treated (ATT): ATT = E[Y(1) - Y(0)IDi = 1] 106 (4.1) Figure 4-2: Biased ATT estimation due imbalanced function of the covariates Truth- mdmatch+ - ---------------------9 ---- --LI----- - - mdmatch kernel balance ------------------ 6--------------- - mean balance K-2-L1 -0.00 0.05 0.10 0.15 0.20 0.25 0.30 ATT Estimates Boxplot illustrating distribution of average treatment effect on the treated (ATT) estimates in the same example as figure 4-1 above. The actual effect is zero peaceyears. Matching on the raw covariates, matching on higher order transforms, and obtaining mean balance all show large biases because the control samples chosen by these procedures include higher intensity conflicts than the treated sample, even though intensity is entirely a function of observables. Since intensity influences the outcome, peace years, the treated and control samples thus differ regardless of any treatment effect. By contrast, kernel balance is approximately unbiased, as it achieves balance on a large space of smooth functions of the covariates. Weights W will be chosen for the control units. The weighted difference in means estimator for the average treatment effect on the treated (ATT) is then: ATT = N E (4.2) Yi- i:Di=1 i:Di= where Nt and N, are the number of treated and controls respectively, and W are non-negative negative weights on the controls, such that ZE W = 1. Here I describe more precisely the conditions under which unbiased estimation of this simple ATT estimator is possible for such an estimator. To summarize the main results, conditional ignorability is not in general sufficient for unbiased ATT estimation by these techniques. Instead, estimates are guaranteed to be unbiased only if conditional ignorability holds and, after adjustment by W, either: 107 1. all functions of Xi that influence Y(0) have the same means for the treated and controls, or more strongly, 2. the multivariate density of the covariates for the treated is the same as that of the controls. Conditions for Unbiasedness First, throughout this analysis, assume conditional ignorability with respect to the non-treatment potential outcome: ASSUMPTION 7 (CONDITIONAL IGNORABILITY FOR THE NON-TREATMENT OUTCOME) Y (0) JL DiIXi where Y (0) is the non-treatment potential outcome and is assumed to be bounded, Di is treatment status, and Xi a vector of observed, pre-treatment covariates. I also assume "common support", that 0 < Pr(Dj = 1|X ) < 1.3 Y (0) is assumed to be bounded and, without loss of generality, and can be constructed as: Yi(0) = g(Xi) + 71 where (i) E[jilXi] = E[9q] = 0 for all values of Xi E X; (ii) g(Xi) may be any integrable function of X; and (iii) qj is bounded, with E[qilDi] = 0. This construction makes explicit that Y(0) may vary with Xj, while allowing for a stochastic element (rh), and maintaining independence of Y (0) and Di conditional on Xi due to (iii). To see the difficulties in estimating the ATT even when conditional ignorability holds, it is useful to more closely examine its dependence on the distribution of the 3 However, note that Pr(Di = 11 X) must be estimated, and determining whether common support exists in a given case depends (as it always does) on some assumption used to make these estimates. 108 covariates. The ATT can be re-written: ATT = E[Y(1) - Y(O)|D = 1] = EXID=E[Y(1) X, D = 1] - EXID=lE[Y(0) X, D = 1] (4.3) = EXID=1E[Y(1)|X, D = 1] - EXD=1E[Y(0)jX, D = 0] (4.4) where the substitution between equations 4.3 and 4.4 is possible due to conditional ignorability.4 While the aim of this substitution is to make the second term in equation 4.4 identifiable, it remains non-trivial to estimate. Specifically, we only observe Y(0) at locations in X at which control units are found, but we must integrate over these sampled points as though they have the distribution given by the treated, fxD=1- 5 Weighting and matching methods are designed to effectively change the distribution of the controls such that an average over the control units is taken as though they had the same empirical distribution (of X) as the treated. Conditions for Unbiased ATT Estimation When estimating the ATT by equation 4.2, unbiasedness can be obtained by ensuring that the mean non-treatment outcome of the treated equals that of the reweighted controls. This is stated more formally by proposition 1. PROPOSITION 1 (SUFFICIENCY FOR UNBIASED ATT ESTIMATION) The sample estimator 4 Note that equation 4.3 leaves implicit the distribution over which the inner expectations are taken. This is more fully written ATT = EXID=1EY(1)IX,D=1[Y(1)X,D = 1] - ExID=1Ey(0)X,D=1 [Y(0)IX,D = 1] The subscripts on the inner expectations have been suppressed in the text for simplicity and because the argument to the expectation operator correctly indicates the distribution of interest. 5 The integration implied by the subscripted expectation operators is more evident when rewritten in integral form: EX|D=E[Yi(0)|Xi, Di = 0] xEX E[(Y(0)|Xi]fxID=1(x)dx = g(XifX|D=1(x)dx = Sxex 109 for the ATT under a method of choosing weights, Wi: 1 ATT=- (Y WY i:D=1 i:D=O is unbiased for the true ATT if and only if E [Y (0)] = E L W2 YdO) -i:D=O Proof of this derives directly from consideration of the bias. Specifically, the bias is given by: bias = E[ATT] - ATT =E Nti:D=1 Y, - E .. = E[Y (O)lDi = 1] - E L WiY - E[Y(0)|Di= 1] + E[Y (o)|Di= 01 i:Di=O WiY (4.5) wi:Di=1 Unbiasedness is thus achieved if and only if E[Y(0)lDi = 1] = E[Z i:DO WiYi], proving proposition 1. The sample analog of proposition 1 is that i Z:D -1 Y() = W (0) i.e. that mean balance obtains for Y(0) itself. This is an unusual statement in the sense that Y(0) is not usually thought of as a direct target of balancing procedures, since Y(0) is unobserved for the treated, making this impossible to directly verify. A more natural corollary of this is that "any function of X that influences Y(0) must have the same mean among the treated and controls."' I use this latter interpretation throughout the paper. 6 These are equivalent statements because if some function of Xi influences Y(0) and is imbalanced, Yi(O) too will be imbalanced (except in knife-edge cases). Proposition 1 simply takes this logic to its conclusion, by saying that Yi(O) itself must be mean balanced, implying mean balance of any function that influences it. Put differently, anything that might "matter" in determining the outcome (in the absence of the treatment) must be balanced, otherwise average differences between the treated and control on the outcome will emerge as a result of improper adjustment even in the absence of the treatment. 110 Second, while conditional ignorability and proposition 1 are sufficient for unbiasedness, a more standard approach is to consider a slightly broader condition that is more typically regarded as the target for matching and weighting methods: equalizing the multivariate densities of the treated and the (reweighted) controls. Multiviarate balance is defined as follows: ASSUMPTION 8 "Multivariate balance" is achieved when a (MULTIVARIATE BALANCE) method finds weights, W, on the control units such that the post-weighting density of the controls equals that of the treated. In finite samples this is required at each observation in the dataset: fw,x|D=0(Xi) = fX|D=1(Xi), Vi We can now state proposition 2: PROPOSITION 2 (MULTIVARIATE BALANCE PRODUCES UNBIASEDNESS) Under conditional ignorability (assumption 7), the choice of weights that produce multivariate balance as defined by 8 allow for unbiased estimation of the ATT. Proof is given in the appendix, but the intuition is straightforward. By assumption 7, at any given Xi = xi the observed outcomes from the controls is equal in expectation to the non-treatment outcomes that treated units would take if found at the same value xi. The problem of having to average these control outcomes over the distribution of the treated is solved because the distribution of the controls is the same as the treated when assumption 8 holds. Note also the relationship between multivariate balance and proposition 1. When multivariate balance does not hold, it is possible that some function of X has different means among the treated and controls, which will induce bias if that function of X is correlated with Y(O). As a condition for unbiasedness, then, multivariate balance is stronger than necessary, in that it ensures mean balance on every function of X, whereas proposition 1 only requires balance on Y(0), and thus balance on functions 111 of X that influence Y(O). 7 Unbiasedness, Matching, and Mean Balance To complete the analysis, consider three situations and the potential for bias in each: exact matching, distance-minimizing matching methods for approximate matching, and approaches that ensure equality of means or higher moments of covariates that are explicitly chosen by the investigator. Exact Matching First, exact matching ensures that multivariate balance (assumption 8) holds. It is therefore sufficient, under conditional ignorability, for unbiased ATT estimation. Approximate Matching Second, consider approximate matching by methods such as nearest-neighbor, Mahalanobis distance, and genetic matching, using single matching for simplicity. These methods are typically optimized based on tests of univariate balance measures, and while they may bring fw,XID=o closer to fXID=l, they do not ensure multivariate balance and thus fail assumption 8. The potential bias can be appreciated by examining each matched pair's contribution to it. Specifically, matching attempts to pair observations i (a treated unit) and j (a control unit) such that some measure of the distance between xi and xj is as small as possible. It then must be assumed that E[Y(O)ID = 1,X = xj] = E[Y(O)ID = O,X = xj], so that the observed outcomes from the control unit at xj can substitute for the non-treatment outcomes of a treated unit placed at xi. Under conditional ignorability, this implies E[Y(O)IX = xj] = E[Y(O)IX = xi], or simply g(xi) = g(xj). The bias due to each matched pair is g(xi) - g(xj), and the total bias is the average of these over all pairs. Besides knife-edge cases, unbiasedness is achieved only if g(xi) = g(xj) for 7 Strictly speaking, even the common support assumption is too strong by this logic: for a variable Xa in X that does not influence Y(O) through g(X), neither equality of marginal distributions nor common support are required. 112 Imbens (2006), the resulting bias due to these "matching discrepancies", IIxi - Xj , each matched pair {i, j}. This does not generally hold.8 As shown in Abadie and does not shrink fast enough to achieve vNW-consistency of ATT estimates for many problems.9 Mean Balancing on X Third, consider "mean balancing" methods, those that achieve mean balance on covariates. Note that since X can include higher order transforms of the original covariates (squared terms, multiplicative interactions, etc.), it allows balance to be sought on any desired sample moments of the original covariates. While matching estimates may nearly obtain mean balance on the included covariates, other methods target mean balance more directly through weighting, including entropy balancing (Hainmueller, 2012) and the covariate-balancing propensity score (Imai and Ratkovic, 2014). A key fact for understanding when such estimators are unbiased is that mean balance on X implies mean balance on all linear functions of X: PROPOSITION 3 (BALANCE ON LINEAR TRANSFORMS OF X) When weights W achieve mean balance on the (possibly-augmented) covariates Xi according to: Y, Wixi =+ i: D=0 zxi i:D=1 all linearfunctions in Xi, which evaluate to x7/ at the observed points, also have the same mean for the treated and the weighted control samples: Z Wi(Xfr) = i:D=O Z XT/ (4.6) i:D=1 8Moreover, this fails particularly when E[g(X)ID = 1] 7 E[g(X)ID = 0], which is generally suspected to be the case in problems that require conditioning. 9 When g(Xi) is not a known function, there is no guarantee of how severe the remaining bias is. When g(Xi) is known, this can be used to adjust for the bias in each pair as proposed by Abadie and Imbens (2011). However, the presumption that g(Xi) is a known function may bring us back to the specification assumptions which matching was meant to avoid. 113 Proof follows simply from the fact that equation 4.6 can be rewritten as: oT 1 WiXi = OT IDX1 i:D=O and Ei:D=0WiXi = yE i:D=1 Xi under mean balance. Recall that due to proposition 1, unbiasedness requires that all functions of Xi influencing Y(0) are mean balanced after weighting. Since mean balance on Xi guarantees mean balance only on linear functions of Xj, mean balance on Xi only ensures unbiasedness if the structural component of Y(0), g(Xi), is linear in Xi. Suppose, in contrast, that a nonlinear component of Y (0), hi(Xi) exists, such that Yi(0)i = X7'/ + h(Xi) + rn. If h(Xi) is non-zero, unbiasedness is not guaranteed. In this case, the bias is given by bias = E h(Xt)Wz- i:D=O 5 h(Xi) (4.7) i:D=1 This bias is derived in the appendix. Note that it is closely related to the correlation between h(Xi) and the treatment assignment on the weighted data. 4.5 The Proposed Method Kernel Functions Consider a kernel function, k(.,-) : X x X F-+ R, taking in covariate vectors from any two observations and produces a single real-valued output interpretable as a measure of similarity between those two vectors. Here I use the Gaussian kernel: k(Xj, Xj) = e- lix 3-_ a Z 1 (4.8) Note that k(Xi, Xj) has a clear interpretation as a measure of similarity between Xi and Xj. Furthermore, consider a feature map, 0(.), mapping any given observation, Xj, to a P'-dimensional vector, O(Xi), where P' may be very large or even infinite. 114 For any positive semi-definite kernel,' 0 there exists a choice of feature mapping such that (#(Xi), of expansion #(-) #(Xj)) #(.) = k(Xi, Xj). That is, for a given kernel, there exists a choice such that the inner-product of q(Xi) and by taking k(Xi, Xj), even if #(-) #(Xj) can be computed cannot be explicitly formed." A critical piece of notation is the kernel matrix, K, constructed to store the results of each pairwise application of the kernel, i.e. K{j, 3 = k(Xi, Xj) = (O(Xi), #(Xj)). To reduce notation it is useful to order the observations so that the Nt treated units come first, followed by the N, control units. Then K can be partitioned into two rectangular matrices, Kt and K,. Kt is the "left-half" of K and is N x Nt. The row of Kt indicates the similarity of a the jth jth observation in the dataset to the first treated unit, to the second treated unit, and so on. Likewise, K, is the "right-half" of K, is N x Nc, and its jth row describes the similarity of the jth observation in the dataset to each of the control units. By symmetry of K, the average row of K for the treated is identical to the average column of K, and can be written -- KtINt- Likewise the (unweighted) mean column (or row) of K belonging to the controls would be +KNC. The weighted average column (or row) of K is Kcw for the N, x 1 vector of weights W such that EZ W = Nc. Proposal: Mean balance on K A reasonable weighting approach is to achieve mean balance on the matrix of original covariates, X, which is to say the average vector Xi for the treated is equal to the weighted mean of Xi for the controls. The kernel balancing procedure is analogous to this, but seeking balance instead in K. That is, consider a single row of column K: ki = [k(Xi, X1), k(Xi, X 2 ), ... , k(Xj, XN)] 10 A kernel is positive semi-definite if E> E, aiajk(Xi, X) ;> 0, V a, aj E R,x C RD, D E Z+. "For example, suppose X = [X('),X( 2)] and we choose the kernel (1 + (X,, Xj)) 2 . This choice of kernel happens to corresponds to #(X) = [1, V'2X(1), VFX(2), X(1)X(1), VX(1)X(2), X(2)X(2)], and one can confirm that k(Xi, Xj) = (O(X), q(Xj)) for this choice of kernel and 0(.). For the Gaussian kernel, the corresponding choice of #(Xi) happens to be infinite-dimensional, but can be understood roughly as listing the distance (in a Gaussian sense) of an observation at Xi to every other point in X. 115 Each ki is analogous to X, but it describes the data in new terms, using an N dimensional vector of similarity measures rather than in the original coordinates of X. Similar to mean balancing with the data taken as Xj, kernel balancing then seeks the weights to ensure the average ki of the treated is equal to the weighted mean vector ki of the controls: = Z Wkt i:D=O where kt is the average row of K for the treated units with non-negative weights W that sum to Nc. In what follows, I explain why obtaining balance in this way achieves both approximate mean balance on a large set of functions of Xi that could influence Y(0), and approximate multivariate balance, thus achieving approximately unbiased estimation by propositions 1 and 2. I present two equivalent but separate views that provide alternative interpretations of why this occurs. View 1: Mean balance in K implies mean balance on many smooth functions of X Recall that under conditional ignorability, unbiased estimation can be achieved if any function influencing Y (0) - or simply Y (0) itself - is mean balanced after weighting (proposition 1). We begin with the view that kernel balancing ensures that a large space of smooth functions is mean balanced by kernel balancing, and that these functions are likely to include most plausible forms for Y(0). Understanding mean balance in a large set of functions of the covariates begins with considering the space of functions that is linear in K. These are functions that, when evaluated at all i observations, produce values Kc for c E RN View 1A: Superposition of Gaussians There are two important interpretations of what this function space looks like. The first is the "superposition of Gaussians view". Suppose we place a Gaussian kernel over each observation in the dataset, rescale each Gaussian by the value of ci for 116 that observation, then sum the resulting rescaled Gaussians to form a single surface. By varying the scaling factors in c, an enormous variety smooth functions can be formed, approximating a wide variety of non-linear functions of the covariates. This view is described and illustrated at length in Hainmueller and Hazlett (2013), where this function space is used to successfully model highly non-linear functions even in high-dimensional problems. Critically, achieving mean balance on vectors ki achieves mean balance on all functions Kc and thus all the smooth functions formed by superposition of Gaussians in this way. Thus, the many smooth functions of Xi that can be built by superposition of Gaussians can be mean balanced between the treated and control, simply by achieving mean balance in K instead of X. Smooth functions of the covariates that influences the outcome (such as intensity in the motivating example above) can thus be balanced. More directly, if Y (0) itself is a smooth function of Xi in the function space representable as Kc, it too is directly mean balanced, ensuring unbiasedness of the ATT. View 1B: Mean balance in #(X) A closely related view, more familiar in machine learning theory, relates to a feature space expansion of the original data. Under this view, the benefit of achieving mean balance on the N columns of K is that it also achieves mean balance on a very highdimensional set of features, #(Xi), such that the outcome Y(0) is likely to be linear in these features, and thus mean balanced as well. #(Xi) As introduced above, the feature map such that the inner product (#(Xi), #(Xj)) is related to the choice of kernel is equal to simply k(Xi, Xj), and the 0(-) corresponding to the Gaussian kernel happens to be infinite-dimensional. Perhaps surprisingly, the weights W that achieve mean balance on the need to explicitly form #(Xi) #(Xi). PROPOSITION 4 (BALANCE IN K IMPLIES BALANCE IN O(X)) among the treated units be given by kt = -KtIN, =Kcw, then q =i whereKq =among the controls given by Kcw. VA ZDj=1 q(Xi) and D= can be found, without = ZDj=0 If q(Xi)Wi. 117 Let the mean row of K and the weighted mean row of t Proof is given in the appendix. Proposition 4 states that mean balance in K implies mean balance in each feature of q(Xi). The benefit of this can be understood in several ways. First and most simply, O(Xi) is a much richer representation, akin to taking many higher-order and multivariate transforms of the covariates. Second, again if the control and treated have the same means on every dimension of then they have the same means on all linear combinations #(Xi)TO. well captured by the many smooth functions that are linear in and the ATT can be estimates without bias." #(Xi), So long as Y(0) is #(Xi), it is balanced, This is the same space described previously as the superposition of Gaussians, and it captures a very wide variety of smooth functions.' 3 In practical terms then, kernel balancing answers the question of "what to balance on" by offering a set of transformations, the rows of K, such that balance on these transformations ensures balance on a very large set of smooth functions likely to include or nearly include any smooth functions of Xi that influences Y (0), or Y(0) itself. Equalization of smoothed multivariate densities The second view of what is achieved by kernel balancing relates to the quest for multivariate balance. Recall proposition 2, which states that under conditional ignorabilty (assumption 7), obtaining multivariate balance is (more than) sufficient for unbiased estimation of the ATT. Matching techniques attempt to achieve this equalization 1 2 As noted, the remaining bias is given by h(Xi)Wi - bias = j h(Xi) i:D=1 i:D=O As the function space in which balance is achieved grows richer, even if Y(0) is not fully captured in that space, var(h(Xi)) grows smaller, resulting in decreased bias. Formal results regarding the lower bias of kernel balancing and the rate at which remaining bias dissipates are forthcoming. 3 One further interpretation is that since Ot = 0, the treated and controls have the same classmeans in the feature space. Thus any classifier that takes observation i and classifies it based on the whether #(Xi) is nearer to the class-mean of the treated (#t) or the class-mean of the controls (0t) would be unable to make classifications for any observation. The logic of finding balance by considering a subset in which treated and control can no longer be distinguished is also explored in Ratkovic (2012). 118 asymptotically by pairing together control and treated units with similar locations in X, but generally fail to achieve multivariate balance, and are typically optimized and tested with respect to univariate balance. Here I show that kernel balancing approximately equalizes the multivariate covariate distributions for the treated and weighted controls, as estimated by a particular smoother: PROPOSITION 5 (BALANCE IN K IMPLIES EQUALITY OF SMOOTHED MULTIVARIATE DENSITIES) Consider a density estimatorfor the treated, fXID=1 and for the (weighted) controls, fX|D=0,w, each constructed with kernel k(., -) of bandwidth ax as described below. The choice of weights that ensures mean balance in the kernel matrix K constructed by the same choice of kernel ensures that fXD=1 = fXID=o,w at every position in X at which an observation is located. While full proof of proposition 5 it given in the appendix, I describe here the key intuitions that are required. Density estimation seeks to estimate an underlying density function that can be evaluated at new locations where observations have not previously occurred. In this sense, it always requires some assumption about how an observation in one exact location of X should be "smeared" to understand the probability at nearby points. In a univariate context, the typical (Parzen-Rosenblatt window) approach estimates a density function according to: N f(x) = NZk (x,Xi) i=1 for kernel function k with choice of bandwidth oA. 14 The Gaussian kernels is among the most commonly used for this task. Generalizing to a Euclidean distance, a mul14 This is typically written in the form f(x) = E k,2 Ix - Xi. However, translation-invariant kernels - including the Gaussian - are those that operate only on the difference between the two input or k(x, Xi). For arguments. For such kernels, it is always possible to equivalently write k( Ix - Xi) , with z = example, the Gaussian could be written as k(z) = e form here for consistency with the remainder of the paper. 119 Ix - Xi. I use the two-argument tivariate density estimator can be given by: 1 Nv6 N j= where the bandwidth is defined by U2 , and the normalizing constants are required since they are not included in the definition of the Gaussian kernel used throughout this paper.15 Such density estimators are intuitively understandable as a process of placing a multivariate Gaussian kernel over each observation's location in X, then summing them into a single surface and rescaling, providing a density estimate at each location.16 Notice that this estimator is not a strictly local one: the density at a given point is a function of the distance to every other point in the dataset. Local estimators are highly sensitive to the curse of dimensionality, because the size of a volumetric neighborhood required to include even the nearest observations grows very quickly with dim(X). By contrast, the density estimator applied here is less sensitive to IlXi dimensionality, because it depends on the Euclidean distance, - XI I, which grows only linearly in dim(X). For a sample consisting of X1,..., XN, a density estimate can be made at position x* by multiplying the corresponding row of K by a column of normalizing weights equal to N . Specifically, construction of the kernel matrix K using the Gaussian kernel and right-multiplying it by a column vector, 1 produces values numerically equal to (1) constructing such an estimator based on all the observations represented in the columns of K, then (2) evaluating the resulting density estimates 1 5The denominator here involves irq 2 rather than the conventional 27ra 2 because the a2 used in our particular kernel is what plays the role normally played by 20.2 in the expression for the normal density. 16Note that in some cases, the density estimator constructed in this way would not be a natural one, for example when X is a categorical variable or have sharp bounds. Nevertheless, this approach will apply the same smoothing estimator of density in the case of treated and control values. Obtaining equality of these smoothed density estimates for the treated and controls is thus still useful, and means obtaining equality on a function that is similar to the true underlying density function. For the same reason it is also not critical to get the kernel bandwidth 0,2 exactly "correct", if a correct value exists. Nevertheless the choice of U2 implies a bias-variance tradeoff, which I discuss briefly before but will discuss more fully in future drafts. 120 at all the positions represented by the rows of K. The expression N1 KIN thus returns estimates for the density of the treated, measured at all points in X. Likewise, KN, estimates the density of the con- Nc trol units and returns its evaluated height at every datapoint observed, and Kcw does the same for the reweighted density of the controls. Proposition 5 states that the choice of w found by kernel balancing to achieve Kcw = kt is also the choice that equalizes the smoothed density estimates for the treated and weighted controls at every point in the dataset. Proof is given in the appendix. Note that the density estimate at a given point depends on the choice of kernel, including its bandwidth. The choice of a Gaussian kernel for density estimation is common. The choice of u2 is more difficult, and discussed further below. Figure 4-3 provides a graphical illustration of the density-equalizing property of the kernel balancing weights for a one-dimensional problem. The left panel shows the x values for 10 treated units, drawn from N - (.5, 1) (red dots), and from 30 control units (black dots) drawn from N ~ (-0.5, 1). In each case, the appropriately rescaled Gaussian is placed over each observation, and summed to form the density estimator for the treated (solid red line) and for the controls (solid black line). In the right panel of figure 4-3, the heights of the Gaussians over each control units are adjusted according to the weights given by kernel balancing (dashed blue lines). When these reweighted Gaussians are summed to form the reweighted density estimator of the controls (solid blue line), it closely matches the density of the treated. A Continuous Multivariate Imbalance Measure The construction of estimated multivariate densities evaluated at each point in the dataset immediately suggests a balance metric that simply compares the estimated densities of the treated and controls at all points in the dataset for a given choice of weights. One reasonable way to combine the pointwise estimated differences into a 121 Figure 4-3: Density Equalizing Property of the kbal Weights -- treated --dcontrol - -4 -2 0 -2 -4 4 2 - weihtedcontol 0 2 treated control weighted conrAro 4 for the treated and (unweighted) controls. Red dots show the location of 10 treated units. Left: Density estimates dN The dashed black lines show the appropriately scaled Gaussian over each observation, which sum to form the density estimator for the treated (solid red line). Similarly, the black dots indicate the location of 30 control units, and the solid black line gives the resulting density estimate. The L 1 imbalance (see below) is measured to be 0.32. Right: The weights chosen by kernel balancing effectively rescale the height of the Gaussian over each control observation (dashed blue lines). The summated density from the rescaled controls (solid blue line) now closely matches the density of the treated across the covariates space. The L 1 imbalance is now measured to be 0.002 summary measure is an L 1 metric, whose sample analogue is: L 1 = Z |fD= (i) -- w,D= -i 2 i=1 For interpretability, the values of fD=1 and one.1 fw,D=o are first normalized to sum to Note that the density estimates depend on the underlying choice of kernel bandwidth, discussed below. This metric is similar to the L 1 metric used in CEM (iacus et al., 2012), but without requiring coarsening in order to construct discrete bins. As noted above, using a global rather than local approach to density estimation makes kernel balance tolerant of high dimensional data. This Lt imbalance measure can be applied as a measure of multivariate imbalance with any matching or weighting method, not just kernel balancing. The kbal software computes Lr on the original data and after balancing. It also provides the multivariate 17 While the underlying continuous functions for the densities each integrate to 1, in general a series of estimated heights drawn from this surface does not sum to 1. A rescaling is thus useful for interpretational purposes. 122 density for the treated and the controls as computed at each point in the dataset, which can be useful for visualizing overlap and diagnosing which treated units are most difficult to accommodate. 4.6 Implementation Achieving Balance on K A method is needed to find the weight vector w such that 1KtN, = Kew as nearly as possible, while constraining the weights to be positive and with minimal variation. To achieve this, I employ entropy balancing (Hainmueller, 2012) to satisfy these conditions while maximizing the entropy of the distribution of weights. However, establishing balance on all columns of K by entropy balancing is computationally infeasible, owing to the near co-linearity of many columns of K. This co-linearity is perfect in cases where a single observation is repeated exactly, but even if this does not occur, there may be a multitude of similarly suitable solutions with different weights. Instead, I first project K onto its major principal components using principal components analysis.18 The number of factors retained for balancing, starting with those corresponding to the largest eigenvalues, is determined by the parameter numdims. The algorithm will converge when numdims is small enough to avoid excessive co-linearity. The balance as measured by L, improves as numdims initially rises, and then typically deteriorates once numdims is too high, where both overfitting and numerical instability begin to creep in. Thus, when numdims is not user-provided, an optimization is performed to find the value of numdims that produces the best L1 balance. Note that this does not involve the outcome in anyway. 19 18The aim here is to get approximate balance on K by getting balance on the principal components of K. In kernel methods, "kernel PCA" is sometimes used. This approach treats K as the covariance matrix of O(X), since Ki% = #(xi)Tq(xj). Thus directly computing the eigenvectors on K effectively produces principal components for data originally in the coordinates of #(x). Here, a traditional PCA is computed instead, taking the eigen-decomposition of kTk, where K is a centered version of K. This is more in keeping with the intention of getting balance on K through its principal components, but also demonstrates slightly better performance in practice than the kernel-PCA approach. 191n addition, kbal computes the quantity pctvarK, which is the percentage of variation in the 123 An illustration of the relationship numdims, L, and the balance achieved on unknown functions of Xi is given in the appendix. Choosing u2 , bias-variance tradeoff, and common support The kernel bandwidth, a2, plays an important role in determining the precision with which balance is assessed and achieved, governing a bias-variance tradeoff. Under the "mean balance in #(Xi)" view, o2 is most naturally viewed as a measurement decision that determines the construction of #(Xi), and particularly how close two points Xi and X need to be in order to have highly similar features #(Xi) the "equalization of smoothed multivariate densities" view, U2 and #(Xj). Under can be understood as how "blurry" or sharply resolved the density functions are taken to be prior to weighting. One can therefore think of a 2 as controlling the "precision" of the match: while balance is typically obtainable under a range of values, a 2 describes how high a bar this actually is, with smaller values of a 2 implying balance to a finer level of detail. 20 How should a 2 be chosen? The question is difficult to answer as it implies a bias-variance tradeoff and there is no clear way of determining the ideal point along this tradeoff. Occasionally a2 may be set too small for balance to be achievable at all, in which case the algorithm will not converge. Such cases represent effectively an absence of common support, when density is assessed by a given choice os a 2 . In these cases, one option is raising a 2 , which further "spreads out" the density contributed by each observation, thus increasing scope for common support. Alternatively, it may be necessary to drop treated units for which matches are most difficult to find (see section 4.6). Fortunately, however, I choice is often not strictly necessary. In many cases, matrix K accounted for by the included factors, computed based on the sum of squared eigenvalues. At the choices of numdims that minimize L 1 for a given problem, pctvarK is consistently above 0.99 or higher. This indicates that while balancing on a subset of dimensions of K is not ideal, it does account for a large majority of the variation in the matrix. 20 This role is somewhat analogous to the role of bin size in CEM, where exact matching can be obtained within-bin, but this implies more precise matches when bin sizes are smaller than when they are large. 124 balance is achievable across a wide range of a2 values. While lower values of a2 are generally preferable, smaller a 2 may produce highly "concentrated" weights, i.e. solutions that depend on placing very large weights on a very small proportion of the controls. Numerous metrics could be used to assess the concentration of the weights, including variance or entropy measures. For an easily interpretable metric, I use the quantity min90, which is the minimum number of control units that are required to account for 90% of the total weight among the controls. For example, if min90=20, 90% of the total weight of the controls comes from just the 20 most heavily-weighted observations. I propose choosing a 2 = 2dim(X) as a rule-of-thumb. The average Euclidean distance E[jJXi - XjJJ] that enters into the kernel calculation scales with dim(X). Choosing a 2 proportional to dim(X) thus ensures a relatively sound scaling of the data, such that some observations appear to be closer together, some further apart, and some in-between, regardless of dim(X). The constant of proportionality, however, remains open to debate. Empirically, the choice of a 2 = 2dim(X) has offered very good performance, and so this is the default value of U2, though clearly further work is needed to justify this choice.21 This rule-of-thumb approach is a useful starting point and is used in all simulated and empirical examples presented here. Results are not typically highly sensitive to the choice of a-2 . Nevertheless, investigators may wish to present their results across a range of a 2 values to ensure that this holds in any particular example. Where results do vary across a 2 values, inspecting L, and min90 can be helpful for determining an appropriate value. 21 This is similar to the approach used in KRLS, where the default setting is a 2 = ldim(X). However, KRLS is tolerant of a wide-range of o,2 values because the smoothing parameter, A is free to vary, and the two terms largely compensate for each other. Accordingly, KRLS at o 2 = 2dim(X) shows nearly identical performance to the original default of dim(X), offering excellent power to detect highly nonlinear, nonadditive relationships even in small samples. This provides some assurance: recall that kernel balancing achieves mean balance on all elements of #(X) for a given choice of kernel, and thus KRLS will be unable to detect any differences between treated and controls on data reweighted by kernel balancing. Since KRLS with the choice of a 2 = 2dim(X) is powerful in detecting a wide range of nonlinear, nonadditive functional forms, the guarantee that kernel balancing controls for all such confounding functions when the same attractive choice. 125 U2 is used makes this an Optional Trimming of the Treated In some cases, balance can be greatly improved with less variable (and thus more efficient) weights if the most difficult-to-match treated units are trimmed. In estimating an ATT, control units in areas with very low density of treated units can always be down-weighted (or dropped if the weight goes to zero), but treated units in areas unpopulated by control units pose a greater problem. These areas may prevent any suitable weighting solution, or may place extremely large (and thus ineffecient) weights on a small set of controls. While estimates drawn from samples in which the treated are trimmed no longer represent the ATT with respect to the original population, they can be considered a local or sample average treatment effect within the remaining population. King et al. (2011) refer similarly to a "feasible sample average treatment effect on the treated" (FSATT), based on only the treated units for which sufficiently close matches can be found. In any case, the discarded units can be characterized to learn how the inferential population has changed. However, even when the investigator is willing to change the population of interest by trimming the treated, it is not always clear on what basis trimming should be done. In kernel balancing, trimming of the treated can be (optionally) employed by using the multivariate density interpretation given above. Specifically, the density estimators at all points is constructed using the kernel matrix. Then, treated units are trimmed if fXIDl(Xi) fXID=o(Xi) exceeds the parameter trimratio. The value of trimratio can be set by the investigator based on qualitative considerations, inspection of the typical ratio of densities, a willingness to trim up to a certain percent of the sample, or performance on L 1 . Whatever approach is taken to determine a suitable level of trimratio, kbal produces a list of the trimmed units, which the investigator can examine to determine how the inferential population has changed. 126 4.7 Empirical Examples In this section, I apply kernel balancing to two empirical examples. The first is a standard benchmark in the literature. Following the example pioneered by LaLonde (1986) and Dehejia and Wahba (1999) and repeated in many other studies (e.g. Diamond and Sekhon, 2005; lacus et al., 2012; Hainmueller, 2012), I reanalyze the impact of a job training intervention, the National Supported Work Demonstration Program (NSW). This is a difficult estimation problem, but one for which an experimental estimate is also available for comparison. Using default settings and no specification search, the treatment effect estimated by kernel balancing is within 0.7% of the experimental estimate, though the latter itself is estimated with uncertainty. The second example applies kernel balancing to a reexamine whether democracies are less successful in fighting counterinsurgencies (Lyall, 2010). The results show that when high-order balance is achieved by using kernel balancing, democracies are over 25 percentage points less likely to win counterinsurgencies, consistent with theoretical expectations but in contrast to Lyall (2010). Example 1: Job Training Benchmark It is useful to know whether kernel balancing accurately recovers average treatment effects in observational data under conditions in which an approximately "true" answer is known. This can be approximated using a method and dataset first used by LaLonde (1986) and Dehejia and Wahba (1999), and which has become a routine benchmark for new matching and weighting approaches (e.g. Diamond and Sekhon, 2005; Iacus et al., 2012; Hainmueller, 2012). The aim of these studies is to recover an experimental estimate of the effect of a job training program, the National Supported Work (NSW) program. Following LaLonde (1986), the treated sample from the experimental study is compared to a control sample drawn from a separate, observational sample. Methods of adjustment are tested to see if they accurately recover the treatment effect despite large observable 127 differences between the control sample and the treated sample. 22 Here I use 185 treated units from NSW, originally selected by Dehejia and Wahba (1999) for the treated sample. The experimental benchmark for this group of treated units is $1794, which is computed by difference-in-means in the original experimental data with these 185 treated units. The control sample is drawn from the Panel Study of Income Dynamics (PSID-1), containing N = 2490 controls. The pre-treatment covariates available for matching are age, years of education, real earnings in 1974, real earnings in 1975 and a series of indicator variables: Black, Hispanic, and married. Three further variables that are actually transforms of these are commonly used as well: indicators for being unemployed (having income of $0) in 1974 and 1975, and an indicator for having no highschool degree (fewer than 12 years of education). As found by Dehejia and Wahba (1999), propensity score matching can be effective in recovering reasonable estimates of the ATT, but these results are highly sensitive to specification choices in constructing the propensity score model (Smith and Todd, 2001). Diamond and Sekhon (2005) use genetic matching to estimate treatment effects with the same treated sample. While matching solutions with the highest degree of balance produced estimates very close to the experimental benchmark, these models included the addition of squared terms and two-way interactions. Similarly, entropy balancing Hainmueller (2012) has also been shown to recover good estimates using a similar setup, 2 3 also employing all pairwise interactions and squared terms for continuous variables, amounting to 52 covariates. Figure 4-4 reports results from a variety of estimation procedures and specifications. Three procedures are used: linear regression (OLS), Mahalanobis distance matching (match), and kernel balancing (kbal). For match and kbal, estimate are produced by simple difference in means on the matched/reweighted sample.2 4 22See Diamond and Sekhon, 2005 for an extensive description of this dataset, the debates around it, and the various subsets that have been drawn from it. 23 Hainmueller (2012) uses the same treated group used here, but a different control dataset based on the Current Population Survey (CPS-1) 24 Standard errors from matching are the Abadie-Imbens standard errors, though the correct standard errors for matching estimators remains a largely unsolved problem. Standard errors from kernel balancing are from weighted least squares with fixed weights, which are also incorrect, as they do 128 For each method, three sets of covariates are attempted: the standard set of 10 covariates described in the text, a reduced set (simple) including only the seven of these that are not transforms of other variables, and an expanded set (squares) including the 10 standard covariates plus squares of the three continuous variables. Figure 4-4 shows that the OLS estimates vary widely by specification, and even the estimate closest to the benchmark ($1794) is incorrect by $1042. Mahalahobis distance matching performs better, though remains somewhat specification dependent, with its best estimate (match-squares) falling within $387 of the benchmark. Finally, kernel balancing performs well over the three specification. While there is some variation by specification, no estimate is more than $681 from the benchmark, and the standard specification, kbal, produces an estimate of $1807, within $13 of the benchmark. From the kernel balancing solution, we can also see that balance is difficult to achieve in this example, in the sense that it requires focusing on a relatively small portion of the original control sample. Specifically, at the solution achieved by kernel balancing, min90 = 193, meaning that 90% of the total weight of the control comes from 193 observations. While this is still a reasonable number, and similar to the size of the treatment group, it implies that approximately 90% of the control sample was not useful for comparison to the treated. This is appropriate, however, given the large differences between the treated and control samples. For example, while 72% of the treated are unemployed in either 1974 or 1975, only 12% of controls are unemployed in either year. Example 2: Are Democracies Inferior Counterinsurgents? Decades of research in international relations has argued that democracies are poor counterinsurgents (see Lyall, 2010 for a review). Democracies, as the argument goes, are (1) sensitive to public backlash against wars that get more costly in blood or not incorporate uncertainty in the choice of weights. Bootstrap or jackknife procedures to obtain standard errors estimates for kernel balancing may be valid(in contrast to matching estimators). Alternatively, it may be possible to do the entire estimation in an Empirical Likelihood framework that will also allow for closed-form estimation of standard errors. Examining this remains an area for future work. 129 Figure 4-4: Estimating the Effect of a Job Training Program from Partially Observational Data 8 OLS OLS-simple 8 OLS-squares 8 match C match-simple E match-squares kbal kbal-simple kbal-squares C-Benchmark SI -2000 0 I I I 2000 4000 6000 Effect of Training Program on Income ($) Reanalysis of Dehejia and Wahba (1999), estimating the effect of a job training program on income using a variety of estimation procedures. Three procesures are used: linear regression (OLS), Mahalanobis distance matching (Match), and kernel balancing (kbal). For each, three sets of covariates are attempted: the standard set of 10 covariates described in the text, a reduced set (simple) inset cluding only the seven of these that are not transforms of other variables, and an expanded While variables. continuous three the of squares plus covariates standard (squares) including the 10 estimate OLS and match perform reasonably well, both are sensitive to specification. The best OLS estimate matching best the while $1042, by benchmark $1794 the (OLS-simple) still under-estimates specification. three all on well reasonably performs (match-squares) is off by $387. Kernel balancing While there is some variation by specification, no estimate is more than $681 from the benchmark, and the standard specification, kbal, produces an estimate of $1807, within $13 of the benchmark. treasure than originally expected, (2) are unable to control the media in order to supress this backlash, and (3) often respect international prohibitions on brutal tactics that may be needed to obtain a quick victory. Each of these makes them more prone to withdrawal from countinsurgency operations, which often become long and bloody wars of attrition. Empirical work on this question was significantly advanced by Lyall (2010), who points out that previous work (1) often examined only democracies rather, than a universe of cases with variation on polity type, and (2) did little to overcome 130 the non-random assignment of democracy, and particular, the selection effects by which democracies may choose to fight different types of counterinsurgencies than non-democracies. Lyall (2010) overcomes these shortcomings by constructing a dataset covering the period of 1800-2005, in which the polity type of the countinsurgent regimes vary. Matching is then used to adjust for observable differences between the conflicts selected by democracies and non-democracies, using one-to-one nearest neighbor matching on a series of covariates.2 1 In a battery of analyses with varying modeling approaches, Lyall (2010) finds that democracy, measured as a polity score of at least 7 in the specifications replicated here, has no relationship to success or failure in counter insurgency, either in the raw data or in the matched sample. While the credibility of this estimate as a causal quantity depends on the absence of unobserved confounders, we can nevertheless assess whether the procedures used to adjust for observed covariates were sufficient, or whether an inability to achieve mean balance on some functions of the covariates may have led to bias even in the absence of unobserved confounders. Here I reexamine these findings using the post-1945 portion of the data, which includes 35 counterinsurgencies by democracies and 100 by non-democracies, and is used in many of the analyses in Lyall (2010).2' First, I assess balance. As shown in figure 4-5, numerous covariates are badly imbalanced in the original dataset (circles), where imbalance is measured on the x-axis by the standardized difference in means. This balance improves somewhat under matching (diamonds), but improves far more under kernel balancing (squares). Note that imbalance is shown both on the variables used in the matching/weighting algorithms (the first ten covariates up to and including year), as well as several others that were not explicitly included in the balancing 25 These covariates are: a dummy for whether the counterinsurgent is an occupier (occupier), a measure of support and sanctuary for insurgents from neighboring countries (support), a measure of state power (power), mechanization of the military (mechanized), elevation, distance from the state capital to the war zone, a dummy for whether a state is in the first two years of independence (new state), a cold war dummy, the number of languages spoken in the country, and the year in which the conflict began. 26 The 1945 period is the only one with complete data on the covariates used for balancing here, but is also the period in which the logic of democratic vulnerability is expected to be most relevant. 131 procedure: year2 , and two multiplicative interactions that were particularly predicted of treatment status in the original data. Kernel balancing produces good balance on both the included covariates, and functions of them. Figure 4-5: Balance: Democracies vs. Non-democracies and the Counterinsurgencies they Fight U + kbal matched orig - - - mechanization support occupier power -:+ - - - elevation distance new state coldwar - : :e - num.languages year- - yearA2 - cincXelev occupierXcinc -_-1 -. 5 0 .5 1 standardized difference in means Balance in post-1945 sample of Lyall (2010). Imbalance, measured as the difference in means divided by the standard deviation, is shown on the x- axis. Democracies (treated) and non-democracies (controls) vary widely on numerous covariates. The matched sample (diamonds) shows somewhat improved balance over the original sample, but imbalances remain on numerous characteristics. Balance is considerably improved by kernel balancing (squares). The rows at or above year show imbalance on characteristics explicitly included in the balancing procedures. Those below year show imbalance on characteristics not explicitly included. Next, I use the matched and weighted data to estimate the effect of democracy on counterinsurgency success. For this, I simply use linear probability models (LPM) to regress a dummy for victory (1) or defeat (0) on covariates according to five dif- ferent specifications.2 7 . The first three specifications used are (1) raw regresses the 27 While Lyall (2010) used a number of other approaches, including logistic regression, some of these models suffer "separation" under the specifications attempted here. This causes observations 132 outcome directly on democracy without covariates (and is equivalent to difference-inmeans);(2) orig uses the same covariates as Lyall (2010), which are all those variables balanced on except for year, (3) time reincludes year as well as year2 to flexibly model the effects of time. The final two models, occupieri (4) and occupier2 (5), add flexibility by including interactions of occupier with other variables in the model.2 8 Figure 4-6 shows results for the matched and kernel balanced samples with 95% confidence intervals. Under matching, the effect varies considerably depending on the choice of model. No estimate is significantly different from zero, however. In stark contrast, kernel balancing producing estimates that are essentially invariant to the choice of model. Each kernel balancing estimate is between -0.26 and -0.27, indicating that democracy is associated with a 26 to 27 percentage point lower probability of success in fighting counterinsurgencies. This is a very large effect, both statistically and substantively, given that the overall success rate is only 33% in the post-1945 sample. 4.8 Conclusions In the ongoing quest to reliably infer causal quantities from observational data, the first-order challenge often remains ensuring that there are no unobserved confounders in a given identification scenario. However, the problem of actually adjusting for differences in observed covariates to take advantage of conditional ignorability remains non-trivial. As shown here, even when conditional ignorability holds, matching and other weighting approaches only ensure unbiasedness under strict conditions. One sufficient condition is full multivariate balance. Absent this, unbiasedness of the ATT requires that Y(0) (or all functions of Xi influencing Y(0)) has the same mean for the treated and controls. and variables to effectively drop out of the analysis, producing variability in effect estimates that are due only to this artefact of logistic regression and not due to any meaningful change in the relationship among the variables. Linear models do not suffer this problem, and provide a well defined approximation to the conditional expectation function, allowing valid estimation of the changing probability of victory associated with changes in the treatement variable, democracy 28 These interactions were chosen because analysis with KRLS revealed that interactions with occupier were particularly predictive of the outcome. 133 Figure 4-6: Effect of Democracy on Counterinsurgency Success Match Kernel Balance * raw E orig *+ time A occupier I V occupier2 A -0.6 -0.4 0.0 -0.2 Effect of Democracy on Pr(victory) 0.2 matchEffect of democracy on counterinsurgency success in post-1945 sample of Lyall (2010) using Under procedures. estimation different five by followed ing or kernel balancing for pre-processing zero. from difference significantly are none but variable, highly matching, effect estimates -remain when even procedures, estimation five the over estimates stable Kernel balancing shows remarkably in the -0.26 to no covariates are included (raw). Results from kernel balancing are consistently with a associated is democracy that indicating zero, from -0.27 range and significantly different counterinsurgencies. win to ability substantively large deficit in the Kernel balancing can be understood as a method of approximately achieving both of these conditions. First, by obtaining balance on the columns of the kernel matrix K, mean balance is also obtained on the much higher-dimensional set of features, <(Xi). Mean balance on these features implies mean balance on all the functions that are linear in these features. Equivalently, and more intuitively, these are the functions that can be formed by the superposition of Gaussians placed over each observation in the covariate space. The assumption that the systematic component of Y(0) is among these smooth functions is far more plausible than the assumption that it is linear in the original Xj, even if the investigator is careful enough to include higher-order terms among these Xi's. Moreover as N grows large, Y(0) is increasingly 134 well modelled within this space, while the space of functions linear in Xi does not grow with N. Second, while existing methods are evaluated by univariate balance metrics, kernel balancing ensures that the entire multivariate densities of the treated and weighted controls are approximately equalized, as measured by a corresponding kernel smoother. This does not require coarsening the data into discrete bins, and because the method is global rather than local, it is tolerant of higher-dimensional data than approaches that require discrete binning of observations compared to methods such as CEM. Kernel balancing also generates pointwise estimates of the multivariate density for the treated and controls at each location in the dataset, and uses this to report an L1 measure of imbalance that is truly multivariate in nature but does not require coarsening of the data. Two empirical examples illustrate the use of kernel balancing. The first, a widely used benchmark, uses data from Dehejia and Wahba (1999), to test whether kernel balancing accurately recovers a known ATT estimate by using the experimental treatmentgroup but control observations drawn from a separate, observational dataset. At it's default values, with the covariates commonly used for this problem and no further specification choices, kernel balancing estimated an effect of $1807, extremely close to the experimental benchmark of $1794. In a second empirical example, kernel balancing is used to obtain higher order balancing in the comparison of counterinsurgency success for democracies and non-democracies (Lyall, 2010). While theory and prior research has argued found that democracies are inferior counterinsurgents, Lyall (2010) finds otherwise using a novel dataset and matching to ensure comparability of the counterinsurgencies fought by democracies and non-democracies. Reexamining the post-1945 period and using the same covariates, kernel balancing proves far more effective in obtaining balance, both on the covariates directly included in the balancing procedures, and on functions of these variables. Using five different models to estimate the effect of democracy on the adjusted datasets, estimates from the kernel balanced data all indicate that democracies were 26 to 27 percentage points less likely to win counterinsurgencies over this period than non-democracies on com135 parable cases. These effects are statistically significant, but also substantively large, especially given the overall success rate of just 33%. Nevertheless, additional questions and challenges remain for future work. First, it will be useful to better understand the asymptotic properties of this procedure, and particular the rate of decline in bias as a function of N. Second, K has dimensionality N x N, which becomes unwieldy as N grows large, posing a practical limit of tens of thousands of observations. Third, obtaining correct confidence intervals for estimates based on weighted samples - either through resampling or a closed-form solution - will be an important and useful advance, particularly since standard errors remain poorly understand for matching techniques. Finally, improving the method for selecting beyond the rule-of-thumb approach proposed here would be very useful as well. 136 2 U Chapter 5 Appendices 5.1 Appendix for Kernel Regularized Least Squares 137 Figure A.1: Fitting a Simple Function with KRLS CM 0 Cl I - >1 ... 1 2 3 Unscaled Gaussians 4 x Nl -. 0 -. -1 2 - Scaled Gaussians KRLS Fit (Superposition) 3 4 Note: Left Panel: Unscaled Gaussians placed over each of the four data point. Right Panel: Gaussians scaled by the choice coefficients obtained from KRLS. The choice coefficients for the data points (from left to right) are c = [-3.06,2.68, -1.12,0.97] 138 Figure A.2: Example of High and Low Frequency Functions C14 I is I I I I 0.0 0.2 0.4 0.6 I 0.8 1.0 x Note: The solid line represents a "good" explanation of the relationship between x and y. The dashed line represents a "bad" one, which is both considered more likely to be noise and is also much less useful in a theoretical way. For most social science inquiry, we are interested in recovering conditional expectation functions that look like the solid, low-frequency line, not the dashed, highfrequency line. 139 Figure A.3: KRLS Fits Non-Linear Functions and their Derivatives 0 OLS, N= 20 KRLS, N= 20 o*0 W 0 0 CO 0 0 00 OLS, N= 100 KRLS, N= 100 -- -f(y)=100+3x^4 -- f(y)=100+3x^4 Sf(y) fit KRLS -4 -2 0 2 f(y) fit OLS dy/dx= 12x^3 dy/dx= 12x^3 dy/dx fit KRLS dy/dx fit 4 x -4 -2 0 OLS 2 x Note: Simulation to recover the non-linear function y -100+ 3X4 (black solid line) and its derivative '9y = 12X3 (gray dashed line).- The sample sizes are 20, 50, and 100, X ~ Unif (-4, 4) and observed outcomes are simulated as y =100 + 3x4 + E where E ~ N(0, 1). In the right figures the black dots show the fitted values for Qand the grey triangles show the fitted values for 2- from the KRLS estimator (average across 500 simulations). The estimates in the left figures show the estimates from the OLS estimator accordingly. 140 4 Figure A.4: KRLS Approximates Complex Interactions: One Hill, One Valley True f(xl,x2) KRLS fit f(xl,x2) OLS fit f(xl,x2) GAM fit f(xl,x2) Note: Simulation to recover target function given by y = e--+(x) 2 (1-X2) 2 _ e-5(1-X2)+(x1)2 using simulations with 200 observations drawn from X 1 , x 2 ~ Unif (0, 1) and random noise E ~ N(0, .25). The top right figure shows the true target function. The top left, bottom right, and bottom left figure shows the fitted functions from the KRLS, OLS, and GAM estimator respectively. 141 Figure A.5: KRLS Approximates Complex Interactions: Two Hills, Two Valleys KRLS fit f(xl,x2) True f(x1,x2) llk 4 GAM fit f(xl,x2) OLS fit f(xl,x2) 2 2 2 using (x2) + e5(x1)2+(1-x2) e-5(1-x1) Note: Simulation to recover target function given by y N(0,.25). simulations with 200 observations drawn from x 1 , x 2 ~ Unif (0, 1) and random noise e ~ bottom left The top right figure shows the true target function. The top left, bottom right, and figure shows the fitted functions from the KRLS, OLS, and GAM estimator respectively. 142 Figure A.6: KRLS Approximates Complex Interactions: Three Hills, Three Valleys True f(xl,x2) KRLS fit f(xl,x2) A 4 OLS fit f(x1,x2) GAM fit f(xl,x2) N 4 Note: Simulation to recover target function given by y = sin(xi) * cos(x ) using simulations with 2 200 observations drawn from X 1 , X 2 ~ Unif (0, 27r) and random noise e ~ N(0, .25). The top right figure shows the true target function. The top left, bottom right, and bottom left figure shows the fitted functions from the KRLS, OLS, and GAM estimator respectively. 143 Figure A.7: The marginal effect of temporally proximate presidential elections on the effective number of electoral parties Thomas Brambor et al. ----- --- 4 - 95% Confidence interval ------ 22- 0 EW Zip -4M~f -6- i a 6 5 4 3 2 Effctive Nuinber of Presidential Candidates E 0 0. 0 E P) 16- .I. 0 LU 01 I 0 I 1 I 2 I 3 I 4 I 5 I 6 Effective Presidential Candidates Note: Top Panel: Figure 3 from Brambor et al. (2006). More temporally proximate presidential and legislative elections lead to fewer effective electoral parties. However this is true only when there are relatively few presidential candidates, and the effect vanishes when there are large numbers of presidential candidates. Bottom Panel: Scatterplot of pointwise marginal effects of temporal proximity on number of parties ( aParties ), with lowess estimates super-imposed. The plot looks similar to the Brambor et al. (2006) model only when there are 3 or more presidential candidates. By contrast at zero presidential candidates (which represents 62% of the observations included in the Brambor et al. regression), the marginal effect estimates come back towards zero. 144 Figure A.8: OLS Results for Brambor et al. Split at two Presidential Candidates <0 E2: 0 a 0 ---------------------------------------------------------------------------------- 0- LU 2 I I I I 0.0 0.5 1.0 1.5 ' (D 2.0 CD - CD - Presidential Candidates 0 05 8U Cu I 2 I 3 I 4 I 5 I 6 Presidential Candidates Note: Results from OLS models identical to those in the previous figure, but split at observations with two or fewer Presidential Canditates and those with more than two. KRLS estimates differed from the original Brambor et al. (2006) result (A.7), suggesting that Dproximity aparties takes values near zero when PresidentialCandidatesis zero (indicating no "coat-tail effect" there), and if anything decreases as PresidentialCandidatesrises to two, then reverses direction and follows the pattern suggested by Brambor et al. (2006) thereafter. Here we split the sample and conduct OLS analyses separately when PresidentialCandidates< 2 and when PresidentialCandidates> 2. As shown, the OLS results from the split samples reflect the KRLS result. 145 5.2 Appendix for Kernel Balancing Proof of proposition 2 Proposition 2 states that the estimator ATT Ei:D= 1Y i:D=0 wjyj is unbiased for ATT if - both Assumption 7 (conditional ignorability) and Assumption 8 (multivariate balance) holds. The proof is as follows. Under Assumption 8, fw,xID=O = fxID=1. bias = E[y(0)ID = 1] - E[ E Then, wiyj] i:D=O y(0)fxjD=1(x)dX = y(0)fW,xlD=o(x)dX - (A.1) =0 Proof of proposition 4 Proposition 4 states: that for the mean row of K among the treated, kt mean row of K among the controls given by E(w) where t Dj=1 #(xi) and F, 1 1 = 1KtIN, and the weighted *=k 1 , if c (w), then = = (xi). This can be shown as follows. FT= E w-k- (A.2) i:D=O I~~x) X) 1 ( ... [(# ( x),$(x )), - ), [ZEwk(xi,xi),..., L~i i:D=O (x)) = i:D=11 wik(xi, i:D=O ,(q(X), W (q(xi), q(xl)), X) JN) c(XN))] i:D=O xi), #(xj)) = ( W #(j, (x)) Vi i:D=O NT i:D=1 xi), #(x)) = NTi:D=1 ( tO(xj)) = ( #(xj)) (A.3) wi#(xi), q(xj)) (A.4) W#(), i:D=O E i:D=O Tt= E i:D=O 146 wjO(Xj) (A.5) Remarks An intuitive interpretation of equation A.3 is that each unit j is as close to the average treated unit as it is to the (weighted) average control unit, where where distance is measured in the feature space O(X).' Relatedly, a method of classifying observations as treated or control based on whether they are closer to the centroid of the treated or the centroid of the controls in O(X) would be unable to classify any point. Proof of proposition 5 Proposition 5 states that for a density estimator for the treated, fXID=1, and for the (weighted) controls, fXID=O,w, both constructed with kernel k of bandwidth o, the choice of weights that ensures mean balance in the kernel matrix K also ensures fXID=1 = fXID=o,w at every location in X at which an observation is located. As detailed in the main text, the expression N KtI Nt places a multivariate standard normal density over each treated observation, sums these to construct a smooth density estimator at all points in X, and evaluates the height of that joint density estimate at each of the points found in the dataset. Likewise, K1N estimates the density of the control units and returns its evaluated height at every datapoint in the dataset. To reweight the controls would be to say that some units originally observed should be made more or less likely. This is achieved by changing the numerator of each weight 1 to some non-negative value other than 1. Letting the weights sum to 1 (rather than N,), the reweighted density of the controls would be evaluated at each point in the dataset according to 1 Kw, for vector of weights w. If weights are selected so that this equals the density of the treated: 1 Ktl{N} = 1 Krw V 0 +KtfNtl = K,,w kt =Kcw kc ,(w) (A.6) 'For the Gaussian kernel, (#(xi), O(xj)) is naturally interpretable as a similarity measure in the - input space, since this quantity equals k(xj, Xi) = e 2 . However, (#(xi), #(xj)) or k(xi, xj) is more generally interpretable as similarity in the feature space as well. Note the squared Euclidean distance between two points xi and xj after mapping into 0(.) is: |10(x,) - O(xj)I 2 = (#(x,) O(x,), O(xi) - #(X)) = (#(xi), #(xi)) + ((x), #(xj)) - 2(#(xi), #(xj)). In the case of the Gaussian kernel, (#(xi), O(xi)) = 1, so this distance reduces to 2(1 - (#(xi), O(xj))). In this sense, (#(x ), #(x)) is as reasonable measure of similarity of position in the feature space, as it runs opposite to distance in this space. 147 where the final line is the definition of mean balance in K. Thus, the weights that achieve mean balance in K are precisely the right weights to achieve equivalence of the measured multivariate densities for the treated and controls at all points in the dataset. Bias in ATT when mean balance on X is achieved As discussed in the text, weighting estimates of the ATT are unbiased under conditional ignorability when all functions of X influencing y(O) - or y(O) itself - are mean balanced. Because mean balance on X implies mean balance on all linear functions of X, this generalizes to the statement that conditional ignorability and mean balance in X are sufficient for unbiasedness when y(O) is linear in X. Here I provide proof of this statement while also deriving the bias that obtains when linearity does not hold. Let y(O)i = g(xi) + rq = xT"/ + h(xi) + 77j, where xT/ is the linear component of y(O), and h(x) includes only the remaining nonlinear components. x is a vector-valued random variable containing the pre-treatment covariates to be balanced, and all-desired transforms of them. The weighted mean of the control outcomes is given by - For fixed choice of x and w, the expectation of Zi:D-O wiyj. this estimator is given by: E 1: wiE[y(O)] Ci:D Nc j:D=O ) Iw[xT)+h(x i:D=O >i:D=cE w h(x ihx0 '3wx )] + ) [ Nc E:= [W i:D= iD wh(x N ci:D=O T=1E D ) i:D= E[y(O) - h(x)]+ i:D= = (A.8) wwh(x ) 3 (A.7) i:D=O Ci:D=0 Z E [y(O)ID = 1] + 1 Nc EN h(x%) - 1 h(x) where the substitution from A.7 to A.8 is due to mean balance. The resulting bias is thus: bias = ( wih(xi) - i:D=O 1 E i:D=1 148 h(xi) This equals zero only when (1) y(O) is linear in x and thus there h(x) = 0, Vx E X, or (2) the mean of h(x) among treated and (weighted) controls happens to be equal. Accordingly, note that the degree of bias worsens as h(x) has larger variance and as its correlation with D increases. The results are thus similar to those obtained for omitted variables in linear models, where h(x) is omitted in this case. Density Equalization Illustration This example visualized the density estimates produced internally by kernel balancing using linear combinations of K as described above. Suppose X contains 200 observations from a standard normal distribution. Units are assigned to treatment with probability 1/(1 + exp(2 - 2X)), which produces approximately 2 control units for each treated unit. Figure A.9 shows the resulting density plots, using density estimates provided by kbal in which the density of the treated is given by N and the density of the controls is given by N Kc' Nc KtIN, As shown, the density estimates for the treated at each observations X position (black squares) is initially very different from the density estimates for the controls taken at each observation (black circles). After weighting, however, the new density of the controls as measured at each observation (red x) matches that of the treated almost exactly. Note that in multidimensional examples, the density becomes more difficult to visualize across each dimension, but it is still straightforward to compute and to think about the pointwise density estimates for the treated or control as measured at each observation's X value. In contrast to binning approaches such as CEM, equalizing density functions continuously in this way avoids difficult or arbitrary binning decisions, is tolerant of high dimensional data, and smoothly matches the densities in a continous fashion, resolving the within-bin discrepancies implied by CEM. L 1 , imbalance, and numdims Recall that kernel balancing does not directly achieve mean balance on K, but rather on the first numdims factors of K as determines by principal components analysis. This example examines the efficacy of this approach in minimizing the L1 loss, and in minimizing imbalance on an unknown function of the data. Suppose we have 500 observations and 5 covariates, each with a standard normal distribution. Let z = VX + x2. This function impacts treatment assignment, with the probability of treatment being given by logit- (z - 2), which produces approximately two control units for each treated unit. In figure A.10, the value of numdims - the number of factors of K retained for purposes of balancing 149 Figure A.9: Density-Equalizing Property of Kernel Balancing o Control o Treated x Weighted Controls E (P V;0 0 doloob I -3 i1,10 1 -2 -1 0 2 3 X Plot showing the density-equalization property of kernel balancing. For 200 observations of X N(O, 1), treatment is assigned according to Pr(treatment) = 1/(1 + exp(2 - 2X)), producing approximately two control units for each treated unit. Black squares indicate the density of the treated, as evaluated at each observation's location in the dataset (and given the choice of kernel and 0.2 ). Black circles indicate the density of (unweighted) controls. The treated and control are seen to be drawn from different distributions, owing to the treatment assignment process. Red x's show the new density of the controls, after weighting by kbal. The reweighted density is nearly indistinguishable from the density of the treated, owing to the density equalization property of kernel balancing. - is increased from a minimum of 2 up to 100. As expected, both L, 2 and the mean imbalance on z taken after weighting improve as numdims is first increased, and then worsen beyond some choice of numdims. Most importantly, while the balance on z is unobservable in the case of unknown confounders, L 1 is observable, and improvements in L1 track very closely to improvements in the balance of z. Accordingly, selecting numdims to minimize L1 appears to be a viable strategy for selecting the value that also minimizes imbalance on unseen functions of the data. 150 Figure A.10: L, distance and imbalance on an unknown confounder, by numdims ~,CR CDJ 0 C0 0) , -- - --- L1 0) CDO 0 o) N o -D 0 - s . - -- 0 20 60 40 80 Imbal on z 100 numdims of K included This example shows the relationship between the number of components of K that get balanced upon (numdims), the multivariate imbalance (Li), and balance on confounder z. L 1 generally improves as numdims is increased at first, but beyond approximately 50 dimensions, numerical instability produces less desirable results and a higher Li imbalance. While the confounder represented by z in this case would generally be unobservable, balance on z is optimized where Ll finds its minimum, which is observable. 151 152 Bibliography Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74 (1), 235-267. Abadie, A. and G. W. Imbens (2011). Bias-corrected matching estimators for average treatment effects. Journal of Business & Economic Statistics 29(1). Akresh, R. and D. De Walque (2008). Armed conflict and schooling: Evidence from the 1994 rwandan genocide. World Bank Policy Research Working Paper Series. Bateson, R. (2012). Crime victimization and political participation. American Political Science Review 106. Beber, B., P. Roessler, and A. Scacco (2012). attitudes in a dividing sudan. Who supports partition? violence and political Becchetti, L., P. Conzo, and A. Romeo (2011). Violence, social capital and economic development: Evidence of a microeconomic vicious circle. ECINEQ Working PaperSeries. Beck, N., G. King, and L. Zeng (2000). Improving quantitative studies of international conflict: A conjecture. American Political Science Review 94, 21-36. Bellows, J. and E. Miguel (2009). War and local collective action in sierra leone. Journal of Public Economics 93(11), 1144-1157. Blattman, C. (2009). From violence to voting: War and political participation in uganda. American Political Science Review 103(02), 231-247. Blattman, C. and J. Annan (2010). The consequences of child soldiering. The review of economics and statistics 92(4), 882-898. Brambor, T., W. Clark, and M. Golder (2006). Understanding interaction models: Improving empirical analyses. Political Analysis 14 (1), 63-82. Cassar, A., P. Grosjean, and S. Whitt (2012). Social cooperation and the problem of the conflict gap: Survey and experimental evidence from post-war tajikistan. Choi, J-K; Bowles, S. (2007). The coevolution of parochial altruism and war:. Science 318, 636-640. Christia, F. (2012). Alliance Formation in Civil Wars. Cambridge University Press. Colaresi, M. and S. Carey (2008). To kill or to protect. Journal of Conflict Resolution 52(1), 39-67. De Vito, E., A. Caponnetto, and L. Rosasco (2005). Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics 5(1), 59-85. 153 de Waal, A., C. Hazlett, C. Davenport, and J. Kennedy (2014). The epidemiology of lethal violence in darfur: using micro-data to explore complex patterns of ongoing armed conflict. Social Science & Medicine. Degomme, 0. and D. Guha-Sapir (2010). Lancet 375(9711), 294-300. Patterns of mortality rates in darfur conflict. The Dehejia, R. H. and S. Wahba (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American statistical Association 94 (448), 10531062. Diamond, A. and J. S. Sekhon (2005). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics (0). Doyle, M. W. and N. Sambanis (2000). International peacebuilding: A theoretical and quantitative analysis. American political science review, 779-801. Evgeniou, T., M. Pontil, and T. Poggio (2000). Regularization networks and support vector machines. Advances in Computational Mathematics 13(1), 1-50. Fearon, J. D. and D. D. Laitin (2000). Violence and the social construction of ethnic identity. International Organization 54(4), 845-877. Flint, J. and A. de Waal (2008). Darfur: a new history of a long war. Zed Books. Fortna, V. P. (2004). Does peacekeeping keep peace? international intervention and the duration of peace after civil war. InternationalStudies Quarterly 48(2), 269-292. Friedrich, R. J. (1982). In defense of multiplicative terms in multiple regression equations. American Journal of Political Science 26(4), 797-833. Gilligan, M., B. Pasquale, and C. Samii (2011). Civil war and social capital: Behavioral-game evidence from nepal. Golub, G. H., M. Heath, and G. Wahba (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2), 215-223. Guha-Sapir, D. and 0. Degomme (2005). Darfur: Counting the deaths. report, Center for Research on the Epidemiology of Disasters 26. Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. PoliticalAnalysis 20(1), 25-46. Hainmueller, J. and C. Hazlett (2013). Kernel regularized least squares: Reducing misspecification bias with a flexible and interpretable machine learning approach. Political Analysis, mpt019. Harff, B. (2003). No lessons learned from the holocaust? assessing risks of genocide and political mass murder since 1955. American Political Science Review 97(1), 57-73. Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statisticallearning: Data mining, inference, and prediction (Second ed.). Springer. Hastrup, A. (2013). The War in Darfur: Reclaiming Sudanese History. Routledge. Iacus, S. M., G. King, and G. Porro (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis 20(1), 1-24. 154 Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Statistical Society: Series B (StatisticalMethodology) 76(1), 243-263. Journal of the Royal Imbens, G. (2003). Sensitivity to exogeneity assumptions in program evaluation. Economic Review 93(2), 126-132. The American Jackson, J. E. (1991). Estimation of models with variable coefficients. Political Analysis 3(1), 27-49. Kahneman, D. (2011). Thinking, fast and slow. Farrar Straus & Giroux. Kalyvas, S. (2006). The logic of violence in civil war. Cambridge Univ Press. Kimeldorf, G. and G. Wahba (1970). A correspondence between bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics 41 (2), 495-502. King, G., R. Nielsen, C. Coberley, J. E. Pope, and A. Wells (2011). Comparative effectiveness of matching methods for causal inference. Unpublished manuscript 15. King, G. and L. Zeng (2006). The Dangers of Extreme Counterfactuals. Political Analysis 14(2), 131-159. Kocher, M. A., T. B. Pepinsky, and S. N. Kalyvas (2011). Aerial bombing and counterinsurgency in the vietnam war. American Journal of PoliticalScience 55(2), 201-218. LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 604-620. Lyall, J. (2009). Does indiscriminate violence incite insurgent attacks? Journal of Conflict Resolution 53(3), 331-362. Lyall, J. (2010). Do democracies make inferior counterinsurgents? reassessing democracy's impact on war outcomes and duration. International Organization 64 (01), 167-192. Lyall, J., K. Imai, and G. Blair (2013). Explaining support for combatants during wartime: A survey experiment in afghanistan. American Political Science Review. Nisbett, R. and D. Cohen (1996). Westview Press. Culture of honor: The psychology of violence in the South. Nunn, N. and L. Wantchekon (2009). The slave trade and the origins of mistrust in africa. American Economic Review. Pham, P., P. Vinck, and E. Stover (2009). Returning home: forced conscription, reintegration, and mental health status of former abductees of the lord's resistance army in northern uganda. BMC psychiatry 9(1), 23. Pham, P., H. Weinstein, and T. Longman (2004). Trauma and ptsd symptoms in rwanda. JAMA: the journal of the American Medical Association 292(5), 602-612. Ratkovic, M. (2012). Identifying the largest balanced subset of the data under general treatment regimes. Technical report, Working Paper. Available a t http://www. princeton. edu/ ratkovic/SVMMatch. pdf. Rifkin, R., G. Yeo, and T. Poggio (2003). Regularized least-squares classification. Series Sub Series III Computer and Systems Sciences 190, 131-154. Nato Science Rifkin, R. M. and R. A. Lippert (2007). Notes on regularized least squares. Technical report, MIT Computer Science and Artificial Intelligence Laboratory Technical Report. 155 Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1), 41-55. Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 159-183. Saunders, C., A. Gammerman, and V. Vovk (1998). Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning, Volume 19980, pp. 515-521. San Frsncisco, CA, USA: Morgan Kaufmann. Schdlkopf, B. and A. Smola (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press. Smith, J. A. and P. E. Todd (2001). Reconciling conflicting evidence on the performance of propensity-score matching methods. The American Economic Review 91(2), 112-118. Tedeschi, R. G. and L. G. Calhoun (2004). Posttraumatic growth: Conceptual foundations and empirical evidence. Psychological inquiry 15(1). Tedeschi, R. G., C. L. Park, and L. G. Calhoun (1998). Posttraumaticgrowth: Positive changes in the aftermath of crisis. Routledge. Tychonoff, A. N. (1963). Solution of incorrectly formulated problems and the regularization method. Doklady Akademii Nauk SSSR 151, 501504. Translated in Soviet Mathematics 4: 10351038. Valentino, B., P. Huth, and D. Balch-Lindsay (2004). draining the sea: mass killing and guerrilla warfare. InternationalOrganization 58(02), 375-407. Vinck, P., P. Pham, E. Stover, and H. Weinstein (2007). Exposure to war crimes and implications for peace building in northern uganda. JAMA 298(5), 543-554. Voors, M., E. Nillesen, P. Verwimp, E. Bulte, R. Lensink, and D. van Soest (2011). Violent conflict and behavior: a field experiment in burundi. American Economic Review. Walter, B. F. (2004). Does conflict beget conflict? explaining recurring civil war. Journal of Peace Research 41(3), 371-388. Wilkinson, S. I. (2006). Votes and violence: Electoral competition and ethnic riots in India. Cambridge University Press. Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 65(1), 95-114. 156