Document 11199469

advertisement
Inference in Tough Places: Essays on Modeling
and Matching with Applications to Civil Conflict
TECN CLOGY
b
MASSACHUSETTS INSTIUTE
b
Chad Hazlett
.
J
LJUL~t21
M.S., Duke University (2002)
M.P.P., Harvard Kennedy School (2006)
LUBRARIES
Submitted to the Department of Political Science
in partial fulfillment of the requirements for the degree of
Doctorate in Political Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2014
@ Chad Hazlett, MMXIV. All rights reserved.
The author hereby grants to MIT permission to reproduce and to
distribute publicly paper and electronic copies of this thesis document
in whole or in part in any medium now known or hereafter created.
Signature redacted
A uthor ..................................
Department of Political Science
May 5, 2014
.. Signature redacted'
Certified by..............................
Jens Hainmueller
Associate Professor
Thesis Supervisor
Accepted by..............
Signature redacted
Roger Petersen
Arthur and Ruth Sloan Professor of Political Science
Chairman, Graduate Program Committee
Inference in Tough Places: Essays on Modeling and
Matching with Applications to Civil Conflict
by
Chad Hazlett
Submitted to the Department of Political Science
on May 5, 2014, in partial fulfillment of the
requirements for the degree of
Doctorate in Political Science
ABSTRACT
This dissertation focuses on the challenges of making inferences from observational data in the social sciences, with particular application to situations of violent conflict. The first essay utilizes quasi-experimental conditions to examine the effects of violence against civilians in Darfur, Sudan
on attitudes towards peace and reconciliation. The second and third essays both address a common but overlooked challenge to making inferences
from observational data: even when unobserved confounding can be ruled
out, correctly "conditioning on" or "adjusting for" covariates remains a
challenge. In all but the simplest cases, existing methods ensure unbiased
estimation only when the investigator can correctly specify the functional
relationship between covariates and the outcome. The second essay (with
Jens Hainmueller) introduces Kernel Regularized Least Sqaures (KRLS),
a flexible modelling approach that provides investigators with a powerful
tool to estimate marginal effects, without linearity or additivity assumptions, and at low risk of misspecification bias. The third essay introduces
Kernel Balancing (KBAL), a weighting method that mitigates the risk of
misspecification bias by establishing high-order balance between treated
and control samples without balance testing or a specification search.
Thesis Supervisor: Jens Hainmueller
Title: Associate Professor
3
Acknowledgments
I owe an enormous debt of gratitude to many advisors and supporters, formal and
informal, throughout my time at MIT. First, I am extremely fortunate that each
member of my thesis committee has been supportive and responsive beyond reasonable expectation. From the first day I arrived in Cambridge, Fotini Christia provided
unparalleled direction and encouragement. She was the first to get me involved in
her research, and I have gained a great deal from this involvement. More than a few
times, a timely phone call with her provided much needed advice and support. I look
forward to continued collaboration in the coming years. Adam Berinsky provided invaluable strategic advice at every stage of my graduate career, from choosing a thesis
topic to negotiating while on the market. Teppei Yamamoto provided essential technical feedback, especially on the Kernel Balancing project (Chapter 4). Moreover, his
confidence that I could (and should!) develop a solo-authored methods piece prior to
going on the job market proved to be essential. Finally, nobody has had a greater
impact on my intellectual development than Jens Hainmueller. The fact that I arrived at MIT at just the right time to work with Jens altered entirely my experience
and quality of training. Jens provided the backbone for the methods training I rely
upon, and writing the Kernel Regularized Least Squares paper (Chapter 3) with him
was among the most important and rewarding experiences of my time at MIT. From
Jens, I learned just how good a teacher can be, how it can transform both individual
students and a department, and the enormous amounts of time and effort required
to achieve these outcomes. I hope to become the kind of teacher to my students that
Jens has been to his.
Numerous faculty outside of my committee have also been extremely supportive
and helpful throughout my time at MIT. Danny Hidalgo has been a frequent source of
advice and feedback, and I learned a great deal as teaching assistant for his class with
Teppei Yamamoto. Rich Nielsen and Vipin Narang also provided frequent advice at
levels ranging from the technical to the strategic. Kosuke Imai has been kind enough
to host me as a pre-doctoral fellow at Princeton in this final year. I would also like
4
to thank my fellow students, and especially those in my cohort - Chris Clary, Jeremy
Ferwerda, Yue Hou, David Hyun-Saeng, Nicholas Miller, and Krista Loose - whose
friendship and academic support made my time at MIT much more pleasant.
My parents, Robert and Nedra, provided their ceaseless support and unconditional
confidence in my abilities. Finally, my wife Trish has tolerated not only the long
distance for these last five years but also the frequent times at which I was too deeply
submerged in work to pay sufficient attention to much else. Thank you, Trish, for
putting up with this, and I look forward to beginning our new life together at UCLA.
5
6
Contents
1
Introduction
9
2 Angry or Weary?
The effect of physical violence on attitudes towards peace in Darfur 13
16
2.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.3
Methods ......
..................................
21
2.4
Results .......
...................................
26
2.5
Robustness
2.6
Discussion .......
.................................
35
2.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.8
Tables and Figures . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
.
.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. ... . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . .
31
49
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.2
Explaining KRLS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.3
KRLS in Practice: Parameters and Quantities of Interest . . . . . .
62
3.4
Inference and Interpretation with KRLS
. . . . . . . . . . . . . . .
66
3.5
Sim ulation Results
. . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3.6
Empirical Applications . . . . . . . . . . . . . . . . . . . . . . . . .
83
3.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
3.8
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
3.9
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
.
.
.
.
.
.
.
.
Kernel Regularized Least Sqaures
.
3
2.1
7
97
4 Kernel Balancing
................................
100
4.1
Introduction .......
4.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3
Motivating Example
4.4
Theoretical Framework . . . . . . . . . . . . . . . . . . . . . . . . . .
106
4.5
The Proposed Method
. . . . . . . . . . . . . . . . . . . . . . . . . .
114
4.6
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123
4.7
Empirical Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
4.8
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
. . . . . . . . . . . . . . . . . . . . . . . . . . . 103
137
5 Appendices
5.1
Appendix for Kernel Regularized Least Squares . . . . . . . . . . . .
5.2
Appendix for Kernel Balancing
137
. . . . . . . . . . . . . . . . . . . . . 146
8
Chapter 1
Introduction
This dissertation consists of three essays, focusing on the challenges of making credible
inferences from observational data in the social sciences, with particular application
to situations of violent conflict.
Social scientists are often interested in estimating the effects of a particular treatment variable on one or more outcome variables.
In many cases, these treatment
variables cannot be randomly assigned, making experiments impossible. Within political science, one area where this challenge is particularly acute is the study of the
causes and consequences of violence, as neither violence itself nor its putative precursors can practically or ethically be randomized by investigators.
The first essay (chapter 2) demonstrates how causal inferences about the effect of
violence can be made and put to theoretical use, in a situation where the distribution
of the violence is arguable indiscriminate within certain sub-populations of those who
were targeted. Specifically, it argues that violence against civilians in Darfur, Sudan during the height of atrocities in 2003-2004 was indiscriminately applied among
individuals within a particular village and of a particular gender. This provides a
rare opportunity to examine the effects of violence during mass atrocity with greatly
reduced risk of confounding. This allows a preliminary answer to a central theoretical
question that would be difficult to convincingly address without such causal leverage:
Does exposure to violence make individuals more angry, more vengeful, and more
supportive of further violence against their perpetrators - as is often assumed? Or,
9
does it instead make them more weary, desirous of peace, and disenchanted with
armed actors? While the answer found here is not easily generalized to other cases,
results consistently support the "weary" response, with individuals exposed to direct
physical violence more likely to report that peace is possible, and less likely to demand that their enemies be executed. This finding qualifies the claim that violence
generates demands for retribution that lead to further violence or war recurrence,
but is consistent with an emerging view that exposure to violence increases some
pro-social attitudes. It also suggests that victims of violence have an important role
to play political settlement and reconciliation processes.
While the first essay must go to lengths to substantiate the claim that the distribution of violence is conditionally indiscriminate, the actual act of "conditioning on"
or "adjusting for" the covariates is straightforward, as there are only two, categorical covariates that must be accounted for to identify the causal effect. The second
and third essays introduce methods for dealing with less ideal but more common circumstances, in which the investigator must adjust for more numerous covariates, of
which some may be continuous. As described more rigorously in those essays, existing methods for dealing with high-dimensional and/or continuous covariates typically
require that the investigator can appropriately specify the functional form relating
the covariates to the outcome. This is an implausible claim in most circumstances,
yet violating this assumption can lead to potentially substantial misspecification bias.
An important goal of chapters 3 and 4 is to provide investigators with tools that allow
them to easily and accurately make covariate adjustment in this common scenario,
with greatly reduced risk of misspecification bias.
Specifically, chapter 3, with Jens Hainmueller, describes kernel regularized least
squares (KRLS) (Hainmueller and Hazlett, 2013). KRLS is a modeling approach, to
be deployed in regression or classification problems where investigators would more
habitually use generalized linear models or other parametric approaches. While bringing the power of a flexible machine learning approaches into an easy-to-use package,
it also allows the investigator to interpret the result in ways similar to those permitted by traditional regression models. We also provide proof of desirable statistical
10
properties such as unbiasedness, consistency, and normality, and provide closed-form
expressions for standard errors of several quantities of interest. The result is a powerful tool for estimating the marginal effects of variables, even in very high dimensional
problems, with greatly reduced risk of misspecification bias. Another benefit of this
flexibility is that it naturally accommodates heterogeneous effects, and readily allows
for their exploration.
Chapter 4 introduces "Kernel Balancing".
Like KRLS, this approach borrows
insights from kernel-based methods in statistical learning theory to solve a common
analytical challenge faced by social scientists. Matching and balancing approaches
are frequently used to construct control groups and treated groups from the existing
data. The intent of such methods is to identify control and treated samples "similar"
enough in terms of their covariate values to effectively adjust for those covariates,
after which estimating treatment effects (under conditional ignorability) is straightforward.
However, without further (parametric) assumptions regarding functional
form, matching produces biased estimates in most common circumstances. Weighting methods, which use continuous weights instead of simply keeping or dropping
units, can overcome some of these challenges. However these, too, only guarantee unbiased effect estimation if the functional form relating the covariates to the outcome
is of a particular form known to the investigator. Ultimately the challenge with both
matching and weighting methods is that the investigator must know what functions
of the covariates to include in the procedure (and to check balance on) in order to
ensure that different mean outcomes for the two groups are due to the treatment
rather than remaining differences on the covariates. Kernel balancing answers this
question by ensuring that the treated and control have the same mean on a very
large space of smooth functions of the covariates, which grows with N. This greatly
reduces the risk of misspecification bias, without assuming the investigator is able to
correctly guess or determine the functional form. Kernel balancing has the additional
useful interpretation that it equalizes the multivariate densities of the covariates for
the treated and (reweighting) controls, when density is measured in a particular way.
11
12
Chapter 2
Angry or Weary?
The effect of physical violence on
attitudes towards peace in Darfur
13
14
Angry or Weary?
The effect of physical violence on attitudes towards peace in
Darfur
Chad Hazlett - Massachusetts Institute of Technology
ABSTRACT
Exposure to indiscriminate violence during civil conflict is often thought
to increase anger towards it perpetrators, the desire for vengeance, and
pessimism regarding the prospects for peace and security. Alternatively,
however, experiences with violence during conflict could make individuals
more "weary", less interested in retribution, and more desiring of peace.
While these responses theoretically play a role in the evolution and recurrence of violent conflict, it has been difficult to obtain micro-level evidence
for how violence impacts these attitudes. This paper uses information
about the indiscriminate nature of violence in Darfur and a new survey of
Darfurian refugees to shed light on the responses of Darfurian civilians to
violence. Results consistently support the "weary" response, with individuals exposed to direct physical violence more likely to report that peace is
possible, and less likely to demand that their enemies be executed. This
finding qualifies the claim that violence generates demands for retribution
that lead to further violence or war recurrence, but is consistent with an
emerging view that exposure to violence increases some pro-social attitudes. It also suggests that victims of violence can play an important role
in political settlement processes.
2.1
Introduction
Large-scale violence directed against civilian populations is a common feature of internal conflicts such as civil wars, especially when relatively weak states attempt
to defeat insurgencies embedded in or supported by civilian communities (Valentino
et al., 2004; Colaresi and Carey, 2008). Beyond the immediate and horrific human
consequences of targeting civilians, such violence may also shape the duration, termination, and recurrence of the conflicts in which they occur through a variety of
possible channels. Mass violence can also generate persistent anger towards other
groups or fears of future attacks. Especially when the perpetrator is another civilian
community living nearby, these fears may have long-lasting effects on the prospects
for peace, as civilian communities support armed actors who can promise protection,
and are unwilling to see "their side" disarmed and left vulnerable to renewed attacks.
Moreover, where political solutions to a conflict may have been possible at the outset
of fighting, once mass atrocities occur against civilian communities, it is often difficult
or impossible to identify credible security arrangements that alleviate civilians' fears
of future attacks. In these ways, the reactions of civilians to atrocities shape both
their own decisions to support ongoing violence or peace, and the strategies available
to elites who would exploit these emotions, security concerns, and desires for their
own ends.
Yet, we know little about how civilians' experiences with violence influences their
views towards continuing to fight or making peace. In this paper, I focus on arbitrating among two opposing hypotheses regarding how civilians react to violence.
First, the "angry hypothesis" states what most experts and non-experts alike might
expect: that direct exposure to violence during conflict makes individuals less likely
to seek peace or believe it is possible, and more likely to be angry or vengeful. One
might predict this outcome based on a number of theoretical mechanisms. To name
a few such channels, violence may harden divisive ethnic identifications and generate
new, stronger grievances and demands for reprisal violence. Relatedly, past atrocities
may generate heightened demands for future violence by showing that neutrality is
16
no longer a guarantee of safety for non-combatants, and that civilians must demand
protection from other armed actors. As a result, in places where governments do not
monopolize the use of large-scale violence, civillians may chose to maintain support
for armed actors, refuse to support negotiated settlements that involve disarmament,
and may become convinced (by elites or otherwise) that counter-violence or preventive strikes against the perpetrating group are justifiable security measures. All the
paths lead to heightened prospects for continued or renewed violence, and while the
mechanisms differ, each predicts the "angry hypothesis" - that exposure to violence
leads to a reduced desire for or belief in the possibility of peace.
Alternatively, we may reasonably hypothesize just the opposite: a "weary" hypothesis stating that exposure to violence makes one wish for peace more strongly
or believe it to be more achievable.1 High levels of violence directed against civilians
may be blamed on insurgents and their willingness to employ violence, triggering
heightened attempts by civilians to push for peace. Or, individual exposure to violence may make the costs of fighting a war more apparent, and may alter calculations
of whether it is worthwhile to pursue the initial war aims rather than protect the
pre-war status quo, making individuals less interested in fighting.
This paper makes two contributions. First, it employs a novel dataset from a random sample of Darfurian refugees in eastern Chad. In focusing on Darfur, this paper
examines the first and only conflict so far this century to be labeled a genocide by the
U.S. government and in indictments by the International Criminal Court, but that
has remained under-studied in empirical work due to severe logistical constraints. The
data examined here come from the only large-scale, systematic survey of Darfurian
refugees' exposure to violence and attitudes toward peace, justice, and reconciliation.
Second, while prior literature relates only suggestively to the "angry" versus
"weary" hypotheses (see below), this paper directly adjudicates between them, using
a causal identification strategy based on conditionally exogenous exposure to vio'Note that the terms "angry" and "weary" are merely shorthand for these hypotheses, each of
which is the observable implication of multiple possible mechanisms. The terms are not intended
to suggest that emotions as such, whether rational or not in their origin, are the driving force in
attitudes towards peace and violence.
17
lence. The results find that exposure to violence makes individuals more pro-peace
or "weary" rather than more anti-peace or "angry". This effect is sizable: for each
of four individual binary outcomes, physical harm increases the probability of giving the "weary" response by 8-12 percentage points, which is 17-48% of each mean
response level. This finding proves robust to a variety of modelling approaches (regression, matching, and re-weighting estimators) as well as sensitivity analysis and
placebo tests exploring the effects of possibly omitted confounders. While perhaps
counter-intuitive, these findings offer insights into individual-level responses to indiscriminate mass violence during episodes of civil conflict, with implications for the
duration, termination, and recurrence of those conflicts.
The results also suggest
specific recommendations for the design of peace and reconciliation processes.
2.2
Background
Violence in Darfur
This study examines the effects of violence directed indiscriminately but deliberately
against civilian in Darfur in 2003 and 2004. As this violence was directed broadly
against whole communities, it differs from other forms of violence against civilians,
such as cases of unintended collateral damage, or highly selective violence targeting
individuals based on their political or military activities or on denunciations (e.g.
Kalyvas, 2006). The findings, thus, speak predominantly to cases of mass atrocity
and genocidal violence during ongoing civil conflicts.
While Darfur has experienced previous wars and sporadic violence, the current
conflict most clearly began in February 2003, when two rebel groups - the Sudan
Liberation Army (SLA) and the Justice and Equality Movement (JEM) - launched
an attack on the government air force base in Al Fashir, the capital of North Darfur
state. The articulated motives for this rebellion included long-standing neglect of
the region by the central government and prior attacks on civilians by both the
Sudanese army itself and irregular militia widely referred to as the Janjaweed (Flint
18
and de Waal, 2008). In response to the surprising success of the rebellion in its early
stages, the government unleashed a ferocious counter-insurgency operation, designed
to punish, kill, or displace the civilian population presumed to be supportive of the
rebellion. The offensive employed not only the army and air force, but also expanded
mobilization of irregular forces that would continue to be known as the Janjaweed, in
a "counter-insurgency on the cheap" (Flint and de Waal, 2008) strategy of exploiting
pre-existing ethnic and tribal tensions to mobilize against the civilian base of an
insurgency.
Violence rates climbed and remained high through 2003 and 2004. Most refugees
or internally displaced persons (IDPs) of the conflict left their homes at this time.
Those near major towns generally chose to move to them. Others fled to the mountains and forests. A large number of those in the western regions of West Darfur made
the decision to cross the border into eastern Chad, becoming refugees. Many of these
refugees still have not returned home. At the time of our survey in 2009, approximately 250,000 Darfurians were living in registered refugee camps in eastern Chad.
The largest offensives by the Sudanese army and Janjaweed concluded in early 2005;
thereafter, fighting has continued sporadically and with varying patterns (de Waal
et al., 2014).
The number of people killed during the height of violence remains uncertain.
Estimates suggest that in the 17 months from September 2003 to January 2005,
there were 120,000 deaths directly attributable to the conflict, of which 35,000 were
due to direct violence (Guha-Sapir and Degomme, 2005). Over the wider course of
the conflict, Degomme and Guha-Sapir (2010) find that for the period of 2004-2008,
approximately 300,000 deaths were attributable to the conflict, roughly 5% of the
pre-2004 population.
Related Literature
The existing empirical literature sheds little light on the "angry" versus "weary"
consequences of exposure to violence during mass atrocities or even violence more
generally. Prior research speaking indirectly to this question, however, can be orga19
nized into cross-national analyses; events-level studies that look at a single conflict
but study the effects on non-individual outcomes (e.g. insurgent attacks; patterns of
control); and micro or (individual) level studies.
Cross-national regression studies have spoken indirectly to this question through
the analysis of war recurrence. Doyle and Sambanis (2000) suggests that some measure of war intensity (log of deaths and displacement) are associated with more war
recurrence, loosely supportive of the "angry" hypothesis.
However, they also find
that longer durations of war are associated with greater likelihood of a lasting peace,
suggesting a "weary" response. Walter (2004) concludes that recurrence is better
explained by underlying conditions in the country rather than possible effects of the
previous war, but also finds longer wars associated with lower rates of recurrence,
suggestive of the "weary" hypotheses).
Fortna (2004) similarly found that longer
wars were associated with longer periods of postwar-peace.
"Event-level" studies have focused on the impacts of violence within a given conflict, but without access to individual-level outcomes. These again relate only indirectly to the "angry" versus "weary" distinction. Lyall (2009) examined the effects of
random mortar fire on villages by Russian soldiers in Chechnya, finding that shelled
villages were less likely to be the source of future reprisal attacks. Taken at face value,
this might loosely support the "weary" hypothesis. Kocher et al. (2011) finds that
aerial bombardment by U.S. forces in Vietnam was strongly associated with higher
likelihood that an area would later fall under Viet Cong control, which the authors
interpret as evidence that such indiscriminate forms of violence create a backlash
against those who perpetrate it. Lyall et al. (2013) find that violence committed by
some (but not all) warring parties tends to shift support towards their opposition.
This may suggest an "angry" result, though only indirectly, as it does not speak to
preferences for continuation of a struggle versus achieving peace.
Micro-level studies show promise for resolving this debate as they can examine
how events relate to individual attitudes. A richer set of findings on the micro-level
effects of violence has begun to emerge, several of which apply careful causal identification strategies. So far these have not focused on the effects of violence exposure
20
towards peace as such, but have examined the relationship between personal violence
and a range of outcomes, finding that violence relates to heightened psychological
trauma (Pham et al., 2004; Vinck et al., 2007; Pham et al., 2009), and reduced education, employment, and future earnings (Blattman and Annan, 2010; Akresh and
De Walque, 2008).
However, some work has begun to support a perhaps counter-
intuitive set of results, that exposure to violence related to greater levels of social
engagement (Bellows and Miguel, 2009; Blattman, 2009) and increased altruism, at
least parochially (i.e. towards kin or coethnics)(Gilligan et al., 2011; Voors et al.,
2011; Choi, 2007; Cassar et al., 2012).
2.3
Methods
Data
The primary data source is a survey conducted from April to June of 2009 by the
author and other members of the "Darfurian Voices" team. The project sought to systematically document the views held by Darfurian refugees in Chad on issues of peace,
justice, and reconciliation, and to accurately transmit these views to policymakers,
mediators, negotiating parties, and other key stakeholders. Reports and other materials from this project can be downloaded at http://www.darfurianvoices.org.
This paper uses data from the random-sample survey. Briefly, the sample includes
1,872 individuals from the target population of adult refugees (18 years or older) from
Darfur living in all 12 Darfurian refugee camps in eastern Chad. We used a stratified
random sampling method, with geographic location (camp and block) and gender
as strata. It should be emphasized that the refugee population sampled here is not
representative of Darfur's civilian population broadly. Geography was the primary
determinant of who immigrated from Darfur into Chad rather than elsewhere; almost
all Darfurian refugees in Chad hail from the western part of West Darfur.
21
Measurement
The key causal variable of interest is exposure to violence, which I refer to as the
"treatment" in keeping with the usual language of causal inference. I focus on whether
or not the respondent was the victim of direct physical harm during this conflict, which
I code as a binary variable Physical Harm, indicating injury or maiming during an
attack.2 Approximately 40% of the sample report being directly injured or maimed.
This measure speaks to an individual's exposure to violence above and beyond the
experiences of those around them, which is particularly important in this context,
where many individuals have family members or neighbors who experienced violence
during these attacks. All violence-related questions come at the end of the survey to
avoid possible priming effects. Participants are not asked to describe the violence,
and in particular, women are not asked whether the violence was of a sexual nature.
I examine four outcome measures.
The first three assess whether individuals
believe it is possible to make peace with former enemies (Peace Enemies)3 , peace
with individual Janjaweed members (Peace Janjaweed Members)4 , and peace with
the tribes from which the Janjaweed come (Peace Janjaweed Tribes)5 . All three
response items are transformed into binary responses, coding the positive response
("strongly" or "somewhat-possible") as a 1, and coding negative responses ("somewhat" or "strongly-disagree") as 0. Note that the directionality of this variable is
such that more positive values indicate more pro-peace ("weary") answers. If violence increases weariness as measured, we will see positive effects on these outcome;
if it increases "anger", we would see negative effects on these outcomes.
2
The question was: "Have you suffered violence, or have you been physically maimed in an
attack related to the current conflict? (a) yes; (b). no; (c/d/e) uncertain/refused/not understood".
Enumerators were trained to ensure that this was understood to refer to physical harm against the
participant resulting in physical assault or injury.
3 "Some people say that it is possible for former enemies to live peacefully together after a war.
Some people say that it is not possible for former enemies to live peacefully together after a war.
Do you believe (a) strongly that it is possible; (b) somewhat that it is possible; (c) somewhat that
it is impossible; (d) or strongly that it is impossible?"
4"In the future, I can see myself living peacefully with actual members of the Janjaweed":
(Strongly agree/ somewhat agree/ somewhat disagree/ strongly disagree).
"'In the future, I can see myself living peacefully with the tribes from which the Janjaweed
came": (Strongly agree/ somewhat agree/ somewhat disagree/ strongly disagree).
22
A fourth outcome regards what punishment participants feel is appropriate for
Government soldiers involved in the conflict (Execute Soldier). This is coded as a
1 when the answer was "execution", and 0 for any other (lesser) punishment, and
so points in the opposite direction to the previous three (the "wearier" answer now
being the lower value).
These four measure are highly inter-related.
Factor analysis supports a single-
dimensional solution, with the expected signs on the loadings. 6 Using these loadings,
and then re-scaling by the sum of the weights, I create the variable Peace Index, for
use when a single measure of the outcome is useful.
What does this single fact measure? These survey questions are difficult; they
require the participant to evaluate counterfactual circumstances and estimate the
chances of a complex process leading to a particular outcome. Moreover, they are
emotionally charged, and come towards the end of a challenging two-hour interview.
In order to answer difficult and emotional questions about the possibility of living
in peace, respondents likely answer instead an easier and more intuitive question
(Kahneman, 2011) such as "Would I like to live with these groups?" or "How would
I feel about living with these groups?"
Identification Assumption: Conditional Indiscriminacy
The most critical assumption to identify the effect of violence on individuals in the
data is that conditional on observed covariates, whether an individual experiences
violence must not depend on the outcome an individual would have if exposed to violence, or the outcome they would have if not exposed to violence. Let Y(1) designate
individual i's (possibly unobserved) outcome had she been exposed to violence; let
Yi(0) be the same individual's (possibly unobserved) outcome under non-exposure to
violence. The causal effect for unit i is then defined as Y(1) - Y(0), and the average
treatment effect (ATE) over the population is E[Y(1) - Y(0)]. Let Di be an indica6
Principal factor analysis, no rotation. Factor loadings were 0.68,0.63, and 0.79 for Peace Enemies, Peace Janjaweed Members, and Peace Janjaweed Tribes respectively, and -0.35 for Execute
Soldier. The eigenvalue of the single retained factor was 1.6; all other were negative.
23
tor of exposure to violence for individual i, while genderi and villagej designate the
gender and village of respondent i.
The assumption made here is then stated as {Y(1), Y(0)} JL Dilgenderi,village.
That is, among individuals of a given gender and village, whether they experience
injury or not is unrelated to both potential outcomes Y(0) and Yi(1).
I refer to
this throughout as "conditional indiscriminacy", as it implies violence is effectively
indiscriminate within village-gender strata.
The Distribution of Violence
Justifying the conditional indiscrimnacy assumption requires characterizing the nature and purpose of the violence conducted against civilians. During the height of
attacks in Darfur in 2003-2004 described above, widespread violence against civilians
was employed throughout Darfur, including the state of West Darfur, from which
almost all the survey respondents in this study originate. Critically, the aims of these
attacks were not to selectively seek out rebel or political leaders. Instead, it was to
punish or destroy the communities behind rebel groups, through both direct violence
against the populace and forced displacement. Displacement of communities served
a second purpose of incentivizing members of the Janjaweed militia, whose tribes
have long sought more reliable access to grazing lands, which could be achieved by
removing these groups.
When a village was under attack, it would typically involve one or both of the
following: first, Government of Sudan planes would often begin crude, indiscriminate
aerial bombardment. Second, Janjaweed militia would charge into the village, during
which time many would be killed and many women were raped.
In the case of
government bombing of villages, within a given village it is relatively straightforward
to claim that one's chances of being injured is largely random. These villages are
relatively small, allowing for little variation in targeting. These bombings were often
as crude as pushing bombs, scrap metal, and barrels full of shrapnel out of aircraft.
This does not allow for any kind of targeting based on political attitudes or other
strategic considerations within the village level.
24
The Janjaweed attacks, too, produced effectively exogenous exposure to violence
within a given village, conditional on gender. Beyond the use of different types of
violence for males versus females, the Janjaweed not only appeared to be indiscriminate in their use of violence, but also were unlikely to have any knowledge of what
individuals in the village were potentially more or less politically or militarily active.
Villages are ethnically very homogenous and, while certain villages may be targeted,
within village there was little or no basis for targeting. Men and women, the old and
young, were all apparently subject to injury and killing.
In over 80 filmed and transcribed interviews, our research team asked a range
of questions that included the nature of attacks on their villages. Not one of the
interviewees provided evidence suggesting that during village attacks, the Janjaweed
were discriminant in directing violence against particular types of individuals. Though
there is evidence that Janjaweed groups encountered on the roads and elsewhere
interrogated individuals. The common theme was that the Janjaweed would "kill
everything", with their instructions to do so sometimes overheard by villagers. One
typical respondent recounted, "The government came with Antonovs (aircraft), and
targeted everything that moved.. .If it moved, it was bombed. It is the same thing,
whether there are rebel groups (present) or not.. .They shoot everyone when they see
them from a distance, and [if] they have any doubt about him, they shoot him. The
government Antonovs survey the area from time to time to see if there is anything
moving and if it is a human or an animal.. .The government bombs from the sky
and the Janjaweed sweeps through and burns everything and loots the animals and
spoils everything that they cannot take". Such statements look very similar to those
collected by other organizations at other times, such as those collected in Human
Rights Watch (2006). Further examination of these interviews finds that those in the
village, whether sleeping or attempting to flee, were subject to attack. Even those
fleeing to nearby hiding places were frequently pursued. Livestock and belongings
were often stolen (97% of respondents in our sample reported losing all or most of
their livestock, crops, and belongings), and villages were almost always burned to the
ground.
25
One immediate concern is that some individuals would have been more likely
to have resisted or counter-attacked, and also more likely to experience violence.
This is relatively unproblematic for two reasons. First, during the phase of violence
experienced by those in the survey, resistance within the village had become extremely
rare. One reason is that once the government had clearly joined the effort using its
aircraft, this was no longer a war among tribes, and the would-be resisters among
the Fur, Massalit, and Zaghawa tribes realized that protecting the village was not an
option. Relatedly, those who did wish to resist in this area had already left to join
rebel groups operating outside the villages (and do not enter our sample). Second,
it is important to note that those who hid or attempted to flee were not evidently
shown mercy. Testimony describes how those who fled or hid during the attack were
often chased down or found, and thus still potentially subject to direct violence.
2.4
Results
Covariate Balance
While anecdotal evidence, testimonials, and other information supports the conditional indiscriminacy claim, we can also partially test its plausibility quantitatively.
A traditional balance test would examine whether the distribution of a series of pretreatment covariate is the same for the treated and untreated groups. The identification strategy here requires as-if-random distribution of violence only within each
village-gender sub-group. I therefore test "conditional" balance, first splitting the
sample by gender and then, within each, regressing the Physical Harm indicator on
the pre-treatment covariates and village fixed effects. This tests whether covariates
predict Physical Harm within village and gender. If exposure to Phyical Harm is
indeed unrelated to the distribution of a covariate (conditional on the others), its
coefficient in this regression will be zero in expectation.
Covariates are included in this analysis if they are certain to be pre-treatment
(measured prior to the village attack or clearly not altered by the violence). These
26
include age, whether they were a farmer, herder, merchant, or trader in Darfur, their
household size in Darfur, and whether or not they had voted in the past. All results are
shown for linear probability models, with heteroscedasticity-robust standard errors.
The analysis includes 517 unique villages, with no single village accounting for
more than 6% of the sample. On average, 40% of individuals report experiencing
physical harm. Note that the identification assumptions hold only for villagers who
were present during the time of village attacks. Several sample restrictions are thus
made in all further analyses. Most importantly, only those who report leaving Darfur due to direct violence are included, ruling out the approximately 20% of the
sample that left before violence occurred. In addition, because only the civilian (nonleadership) sample was randomly surveyed, and because leaders are expected to be
more politicized in their responses than non-leader civilians, only those who report
being non-leaders both in Darfur and while in the camps are considered here. The
remaining sample size is 1345. Note, however, that when the same analyses below
are run on the full sample, the results are nearly identical.
The results of conditional balance tests support the conditional indiscriminacy
assumption (Table 2.1). The only covariate with a p-value of less then 0.10 is Herder
in Darfur: herders appear to be more likely to experience physical harm. While
possibly a spurious result (made more likely by multiple comparisons), this suggests
conditioning on herder status to ensure this is not acting as a confounder, though
herders make up only 15% of the sample, and dropping them does not affect the results
reported below. Moreover, covariates other than village are not jointly predictive of
who experienced violence for either men (F(8, 338) = 1.10,p = 0.37) or women
(F(6,321),p = 0.43).
Distributions of Treatment Probabilities
It is helpful to see the distributions of propensity score estimates for the treated and
untreated, to ensure that there is not group for which the scores differ greatly. Here
we are interested in propensity to treatment only within each stratum of gender and
village. Conditioning on gender is achieved by separately plotting male and female
27
propensity scores; adjusting for village can be achieved by a re-weighting procedure.
P(Viage=vi,1ageiD=1) where
For each directly harmed participant, I assign a weight of 1. For each participant
_
I re-weight according to wi =
is
villagej
not harmed,
P(Vi11age=vil1age~iD=O'
the village from which participant i originates. This ensure that the post-weighting
number of untreated participants from each village is the same as the number of
treated units from each village, thus differences in the distribution of propensities to
treatment are not due to differences in village of origin.
The top row of Figure 2-1 shows the gender-specific distributions of propensity
scores prior to this re-weighting by village. Clearly, the balance is not good, reflecting
that some villages experienced much more complete violence than others. However,
once the untreated observations are re-weighted to adjust for differences in village of
origin, the balance is extremely good, with very similar distributions of propensity
scores for the treated and untreated (Figure 2-1, bottom row). This boosts our
confidence that those units within a single village and gender group are exposed to
violence in ways unrelated to any of the observed pre-treatment covariates.
Main Results
Treatment effects were estimated using linear regression models, OLS with weights
determined by entropy balancing, and Mahalonobis matching. Results from models
on each of the five outcomes are shown in Figure 2-2.
We first examine the OLS results. Given the identification assumptions, it should
be necessary only to regress the outcome on the treatment and village and gender
dummies. I refer to this model as the "short" specification. Adding further covariates
to the model ("long") is not required for unbiased estimation, but allows these (pretreatment) covariates to explain additional variation, possibly improving the precision
of estimates.
Coefficient estimates from the short and long OLS models are given in Table 2.2
and summarized in Figure 2-2. Both reveal the same pattern, as expected since the
covariates are effectively controlled for by design. Those who report being directly
harmed are approximately 10 percentage points more likely to say it is possible to
28
live in peace with former enemies, with individual members of the Janjaweed, or with
the tribes from which the Janjaweed were drawn. Results on these three outcomes,
under either model, fall in the 8-11 percentage point range. These are substantially
significant as well: each of these outcomes had an unconditional mean between 0.17
and 0.40, making increases of 10 percentage points quite large, generally more than
25% of each variable's mean. Those directly harmed are also 9-11 percentage points
less likely to penalize Government of Sudan soldiers to death (compared to an overall
mean of 62%). The factor created by a weighted average of these four, Peace Index,
is also significantly affected by Physical Harm, rising by 0.13 among those harmed.
Peace Index is no longer binary, and has a minimum of -0.20 rather than 0. The effect
size of 0.13 amounts to 31% of the distance between the minimum and the mean.
Together, these results consistently point towards the hypothesis that exposure
to violence stimulates a greater desire for or belief in the possibility of peace, and
lesser desire to punish enemies to death. Evidence, thus, appears to lie in favor of
the "weary", rather than the "angry" hypothesis.
Entropy Balancing
To reduce possible model dependency while ensuring effectively perfect balance on selected covariate moments, I also employ entropy balancing (Hainmueller, 2012). This
approach chooses weights for the control units such that after weighting, the marginal
distributions of covariates is the same for the treated and untreated up to a specified number of moments, while keeping the weights as close as possible to equality.
Entropy balancing is successful in equating the means and variances of the covariate
distributions between those directly harmed and those not directly harmed. I then
employ these weights in regressions with village-fixed effects to complete the required
conditioning. Again, this is done with (a) a "short" model with the minimal conditioning to achieve identification (gender and village-fixed effects), and (b) a "long"
specification in which covariates are included in the regression stage for additional
robustness.
The results are summarized in Figure 2-2 and Table 2.3, and are very similar to
29
those produced by the OLS analyses: respondents directly harmed by violence are
8-12 percentage points more likely to give the pro-peace or "weary" response to all
questions, all of which are highly significant. Peace Index rises by 0.14-0.16 among
those exposed to direct violence.
Matching
Finally, matching offers an alternative estimation approach. Here, the aim is less to
improve balance on observables, but rather to allow for conditioning on covariates in
a way that is less model-dependent than linear modeling. Mahalanobis matching was
used, with 1-to-1 matching without replacement. The variables matched on were the
same as those in the multivariate models above: all available pre-treatment variables
with enough variation such that at least 10% of the participants fall in the smaller
group. Matching is exact on all variables except age and household size in Darfur.
Post-matching balance tests showed no statistically significant imbalances on any
covariates.
Table 2.4 shows estimates from the matching analyses. The findings are consistent
with the regression estimates, though larger and more significant in some cases. While
the number of observations is substantially lower due to the strict matching requirements, the more precise estimates also allow gender-specific effects to be estimated
more precisely than under regression. The effects all lie in the same direction for men
and for women. However, the effects for men tend to be larger. The only outcome
on which the effects dramatically differ by gender is "peace with Janjaweed Tribes"
(Peace Janjaweed Tribes), such that men see a large 20 percentage point increase in
positive responses after exposure to violence, while women see an insignificant change
of only 3 percentage points. The effect of physical harm on Execute Soldier is also
significantly negative for men (as it is in the overall sample), and negative but nonsignificant among women. Otherwise, all the effects that were significant for men or
for the overall sample are significant among women as well.
30
2.5
Robustness
As this is an observational study, further validity checks and an examination of possible alternative explanations are in order before proceeding to interpretation.
Robustness to Confounders
The validity of this finding depends on the absence of unobserved confounders, which
in turn is plausible only if violence was targeted on gender and village, but was
indiscriminate within these strata. While this cannot be definitively proven, I show
that the results observed here are unlikely to be the result of confounding through:
(1) consideration of the likely direction of bias if confounders did exist, (2) a placebo
test, and (3) sensitivity analysis.
First, the direction of the effect is opposite to what we would expect due to
most likely sources of confounding. We would typically expect that an unobserved
characteristic driving some people to "select into" experiencing direct violence would
be associated with more "angry" attitudes, not less angry ones. For example, those
who are more anti-government or more interested in supporting the rebellion may
rush into the fight, increasing their chances of exposure to violence, but would be
expected to give the less peaceable answer to survey questions. The observed effect,
however, is in the opposite direction.
Second, a placebo test further supports the identification assumption. Note that
those experiencing physical harm are more likely to report they would vote in future
elections (11%, p < 0.01), echoing findings of other studies (Blattman, 2009; Bateson,
2012). However, the variable pastvoted is a pre-violence measure of whether individuals voted in the past. According to the identification assumptions, conditional on
village and gender, there should be no relation between physical harm and pastvoted,
even though we do see a relationship between physical harm and wouldvote. Using
identical analyses to those above, I find no effect of direct violence on past voting
(# = 0.02, p = 0.62 using OLS-long, for example). The finding that physical harm
strongly affects whether people would vote in the future, but correctly shows no ef31
fect on whether people did vote prior to treatment is useful evidence that physical
harm was distributed without reference to pre-existing political attitudes within each
village-by-gender cell.
Third, sensitivity analyses are useful for examining the robustness of the results
to violations of the identification assumption. I use an approach similar to Imbens
(2003). Suppose the "true" model is y = X0 + Z-y + e, where y is the outcome of
interest (here, Peace Index), X contains the treatment, intercept, and covariates, 0
is the true (causal) effect of each variable in X on y, Z is an unobserved confounder,
and -/ is the effect of this confounder on y. If we estimate this model using OLS
on only the observables (X), then 3 =
#
-
7(XTX)-IXT Z.
That is, the bias is
the product of (a) the effect of Z on Peace Index (y), and (b) the strength of the
correlation between the treatment and the confounder (measured as the predictiveness
of the treatment for the confounder after controlling for the rest of X, (XTX)-IXTZ
(which estimates E[ZlPhysicalHarm= 1, X] -E[ZPhyscialHarm = 0, X]). Figure
2-3 shows the "true" treatment effect implied by varying the degree of confounding
using these two parameters. Note that I make the worst-case assumption that the
y and (XTX)1XTZ are signed so as to produce bias in the direction of the result;
if either sign were to change, the direction of bias would imply that the result was
actually stronger than what was observed.
For comparability, the plot shows the confounding effects of each covariate included, had it not been observed. This shows that in order for an omitted confounder
to reduce the true effect so far that it cannot be distinguished from zero (the red dotted line), it would have to be a considerably stronger confounder than any observed
covariate. For example, to imply a true treatment effect statistically indistinguishable
from zero, a confounder would have to be as strongly correlated with Physical Harm
as age, but would need to have an effect on Peace Index more than 10 times larger
than that of age. In another example, female has a large correlation with the outcome
(larger even than Physical Harm, as there is a substantial gender "effect" in the data.
However, even for a confounder as strongly related to Peace Index as female (which
is difficult to imagine), in order to reduce the implied treatment effect to the critical
32
value, the treatment would have to be three times more strongly related to such a
confounder as it it is to female.
Interference Between Units
Another concern is interference between units, or spillover. In this case, one cannot reasonably assume the "Stable Unit Treatment Value Assumption" (SUTVA)
assumption of zero spillover is valid. Instead, I examine possible violations of this
assumption, and determine how each would alter the meaning of the estimated effect
given that interference occurs. One possibility is that when other people experience
direct violence, those around her who do not experience it (but hear about it or
observe it) receive on average a mitigated effect in the same direction.7
If this is
the case, it ensures a bias towards zero on the estimated treatment effect, as those
classified as unexposed to violence are actually "partially" exposed to it. This would
suggest that the true effect is stronger than the estimate on the observed data.
Alternatively, "negative" spillover is also possible: it could be that when person
j
experiences violence, its effect on person i's (non-treatment outcome) is on average
opposite in direction to the average treatment effect. This type of spillover would be
implicated if, for example, those who are harmed become more pro-peace, but those
not harmed experience "survivor's guilt," and as a result become more anti-peace.
This example would not invalidate the finding here of the pro-peace effect of violence,
but it would suggest the observed effect is exaggerated relative to the true individual
effect.
However, the data do not show evidence for spillover of either the partial-treatment
type or the negative type. In addition to asking about exposure to physical harm, we
also asked individuals how many family members were killed or maimed, whether they
witnessed other family members being injured, or whether they witnessed non-family
members being injured. Because these measure harm experienced by those close to
the respondent but not the respondent herself, they essentially allow a direct test of
7
That is, when person j is exposed to physical harm, the effect of j's exposure on person i's
non-exposure outcome (Yi(O)) is, on average, in the same direction as the average treatment effect.
33
how violence committed against others affect the attitudes of the respondent. Using
the same specifications and models as above, these measures of indirect exposure
show no significant effect on attitudes in either direction. 8
Correlated Measurement Error
Another potential threat is that some respondents are of a "sophisticated" type, and
seek to show the survey enumerators that (a) they have suffered and are thus in need
of support from donors, and (b) are of a pacific, conciliatory nature, more likely to
attract donors to continue supporting the camps. This is effectively a concern about
non-classical measurement error: the error or mis-representation on the measurement
of the treatment may be correlated with error on the outcome.
This is unlikely to explain the observed effect: if strategic misrepresentation of
this type was driving the effect, we would also expect to see a (false) effect for indirect
forms of violence, such as the loss of family members. The same individuals would be
expected to over-report losses on these measures, while also reporting being more conciliatory, again confounding the relationship between Physical Harm and attitudes.
Since we see no measurable effect of indirect forms of exposure on attitudes, however,
such a confounder seems unlikely.
Survivorship Effects
A final set of concerns is that the population from which we sample is censored in
some way that may bias the results. As noted, the population from which we sample
is in no way representative of the population of Darfur: individuals only appear in the
population studied here if they survived the initial attack, chose to come to refugee
camps in Chad rather than seek refuge elsewhere or join the rebel movements to
stay and fight in Darfur, and survived the trip to Chad. To the degree that those
who were directly, physically harmed and those who were not physcially harmed
experience the same selection pressures on who makes it to the camps (that is, the
8
These results available upon request. Also note that this result does not rule out the possibility
of a spillover effect so broad that even those who report no indirect harm experience that spillover.
34
relationship between potential outcomes and making it into the camp does not depend
on treatment status), then these pressures alter the population about which we make
inferences, but cause no bias on the causal estimates of physical harm.
In contrast, selection pressures that occur differentially depending on whether one
is directly harmed could cause a biased estimate of the effect of violence. The first
concern of this type is that among those who are directly phyiscally harmed during
an attack, the chances of death are higher. It seems plausible, however, that among
those who are physcially harmed, whether they survive that injury or not, is unrelated
to their potential attitudes. Likewise, among those who are not harmed, whether
they survive is surely uncorrelated with their attitudes. As long as this reasonable
assumption holds, then the higher death rate among those who are injured does not
introduce a bias.
A related concern is selective mobilization into rebel groups depending on Physical
Harm. Among those present during the attack but not physcially harmed, the more
"angry" ones may have joined the rebel movements rather than coming to the refugee
camp. Among those physically harmed, on the other hand, even the angry ones may
come to the camp for medical care, regardless of their attitudes. This would bias
the results, but in the opposite direction of the observed effect. If the more "angry"
individuals from the unharmed sub-population join the rebel movements rather than
coming to the camps, it would make the resulting non-harmed group in the camp
appear less angry, but we observe the opposite.
2.6
Discussion
Violence directed against civilians during civil wars and could lead those who experience it to either be more resistant to or more supportive of peace. On the one
hand, exposure to such violence may increase grievances against, fear of, or anger
towards the perpetrating group. Any of these could drive civilians to support armed
groups that would offer them protection, and to resist calls for disarmament until
their fears, angers, or grievances have been addressed. In such cases, we expect to see
35
exposure to violence leading to increased pessimism about the prospects for peace,
and/or increased willingness to punish one's enemies (the "angry" hypothesis). On
the other hand, exposure to violence may increase the perceived cost of supporting
ongoing conflict, improving the attractiveness of peace despite whatever heightened
fears, anger, or other effects it produces. If this effect dominates, we expect to see
exposure to violence lead to increased desire for peace (the "weary" hypothesis).
The findings here consistently support the "weary" hypothesis: those exposed
to direct violence are approximately 10 percentage points more likely to report the
"weary" or pro-peace answer using four different measures and under a variety of
different modeling approaches. This effect is large, with exposure to physical harm
increasing the probability of giving the more pro-peace answer by 17-48% of the mean
probability for each item.
Prior quantitative studies have not directly examined the "angry" versus "weary"
question, complicating comparisons to existing literature. However, this finding is
roughly consistent with those cross-national studies that indirectly suggest a "weariness effect" through the finding that longer wars are associated with better chances
at future peace (Doyle and Sambanis, 2000; Walter, 2004; Fortna, 2004). It may also
be consistent with Lyall (2009), which found that indiscriminate violence by the incumbents may have led to lower support even for insurgents, as areas subject to such
violence staged fewer insurgent attacks.
At the micro-level, Beber et al. (2012) find that those exposed to riot-related violence in Khartoum (the capital of Sudan) were far more likely to support the South
Sudan's secession.
This is roughly consistent with the "weary" finding here, as it
implies a willingness to put an end to hostilities even if it comes at potentially high
economic cost and means granting the opposition it long-standing aims. 9 In addition, if the "weary" effect here is regarded (cautiously) as a pro-social outcome, it is
consistent with other micro-level studies that show relatively pro-social effects, including increased political and social engagement (Bellows and Miguel, 2009; Blattman,
9
However an alternative interpretation would be that granting succession to South Sudan would
have the of potentially removing South Sudanese from Khartoum. Thus a host of motives besides
"weariness" could produce this result.
36
2009), or heightened altruism, if only towards kin or coethnics (Gilligan et al., 2011;
Voors et al., 2011; Choi, 2007; Cassar et al., 2012).10
While this study and the micro-level work cited above have focused on civilian
attitudes in the wake of violence, an influential strand of research in the civil war
literature suggests we should be principally concerned with elites, who are thought
to largely control public narratives, alliance-formation, and the mobilization of violence (e.g.
Christia, 2012; Fearon and Laitin, 2000; Wilkinson, 2006).
However,
the civilian-centric and elite-centric approaches of study are complementary. First,
even elites do not operate in a vacuum; the perceptions, attitudes, and emotions of
the civilian populations they seek to influence shape the opportunities and strategies
available to them. Second, if elite manipulation matters, it is ultimately through
its ability to influence individual attitudes, perceptions, or incentives. By studying
civilian responses to violence, we are studying the net effect of many influences, including elite manipulations which have already occurred in the wake of these events.
Understanding the various mechanisms that result in changes in individual attitudes
- ranging from intrinsic individual responses, to the effect of social groups, and to
the efforts of elites to shape these responses - remains a key area for future research.
The generalizability of these findings likely depends on several factors. The Darfur
case is one of indiscriminate violence directed against civilians, and the results may
be specific to this type of violence, as opposed to cases of selective violence targeted
against individual insurgents, the accidental killing of civilians in otherwise carefully
targeted violence, or violence resulting from denunciation of collaborators by community members (e.g. Kalyvas, 2006). In Darfur, civilians were also attacked both
by their own government and by members of other civilian communities with whom
'Olt is important not to overstate those effects of violence that might possibly be pro-social. While
Beber et al. (2012) suggests that violence increases readiness to make peace, those exposed to violence
were also less willing to grant citizenship to Southerners remaining in the North, presumably out
of heightened concern for their safety. Moreover, many counter-productive effects of violence have
also been documented regarding psychological trauma (Pham et al., 2004; Vinck et al., 2007; Pham
et al., 2009), reduced education, employment, and future earnings (Blattman and Annan, 2010;
Akresh and De Walque, 2008), as well as negative effects on trust (Cassar et al., 2012; Nunn and
Wantchekon, 2009; Becchetti et al., 2011) regarding increase trust, at parochially). Most directly,
violence clearly carries with it a horrific and unacceptable direct cost, and the possibility of there
being some positive effects dose not suggest that violence has a net positive effect.
37
they share a history of antagonism and conflict. This may also be a relevant feature
to consider when generalizing the finding to other cases, though future research will
be needed to establish the exact conditions under which the effect found here is likely
to hold.
Possible Mechanisms
Having found robust support in favor of the "weary" hypothesis, I briefly consider
three possible explanations to help inform future research on possible mechanisms.
The first explanation is a calculus of perceived cost and benefit.
In its simplest
form, those subject to direct physical violence experience heightened suffering and,
therefore, see greater costs to ongoing conflict, translating into a greater desire for
peace and heightened perceived attractiveness of the pre-war status quo.
Second, recent work supports the claim that in situations of war and other forms of
violence, "post-traumatic growth" (Tedeschi et al., 1998) is a more common outcome
than alienation or debilitating psychiatric illness (see Tedeschi and Calhoun, 2004,
Bateson, 2012, Blattman, 2009).
Most closely related, Blattman (2009) describes
-
interviews suggesting that abduction by the Lord's Resistance Army in Uganda
and managing to escape and survive thereafter - leads to rapid maturation and an
increased sense of control over one's life. This could presumably lead to behavior
changes, though why it should specifically lead to an increased desire for peace in
particular remains unclear.
A third possibility is suggested by the combination of a demand for retributive
violence, with the special status of individuals with injuries. Communities that live
far from government protections and have easily lootable capital stocks - an apt
description of Darfur - must maintain a reputation for toughness and the willingness
to use reciprocal violence when slighted. In such a "culture of honor" as described by
(Nisbett and Cohen, 1996), individuals are expected to show a desire for retribution in
response to attacks on group members. Evidence collected during the survey - such
as songs sung by women calling for their men to take retributive action - suggest
that the demands for retributive violence are high among these communities (see also
38
Hastrup, 2013). Against this backdrop, however, it appears that those individuals
who are directly harmed - particularly those with physical evidence of their injuries
- have heightened legitimacy to speak on violence, and can promote peace without
without fear of appearing cowardly." Such arguments are clearly tentative, however,
and only suggestive of directions for future research.
In the available data, the lack of evidence for any effect due to non-direct forms
of harm would seem to challenge both the "heightened cost" and "personal growth"
mechanisms: it is difficult to see why personal injury should trigger greater increases
in perceived cost or personal growth while the other forms of harm visited on many in
the sample do not. However, effects of indirect harm might also be harder to detect
because they may be weaker and more susceptible to downward bias through positive
spillover. The "culture of honor exemption" theory, also raises questions as to (a) why
individuals who are injured would be given the hypothesized exemption from norms
calling for reciprocal violence, especially in a context where so many individuals have
lost so much; and (b) why harmed individuals would be inclined to use this exemption
as an opportunity to become more pro-peace.
Nevertheless a combination of mechanisms provides a plausible candidate for future investigation. First, if the culture of honor "exemption" really gives injured individuals necessary status to speak without fear of being branded cowardly, then either
the post-traumatic growth or increased perceived-costs mechanisms could explain why
they use that opening to become espouse pro-peace attitudes. A characteristic of this
combined theory is that, if physical wounds act as evidence of one's hardship and a
mark of authority to speak on the subject of violence, this could explain why those
who experience indirect harm do not show such an effect.
"In our anecdotal experiences, of conducting and filming interviews in these camps, those with
physical evidence of injuries - such as amputations, shrapnel, and scars - were often eager to
approach our research team and to be interviewed. While clearly only suggestive, this is consistent
with the view that individuals with physical evidence of harm appear to the strongest mandate or
motivation to demand an audience.
39
2.7
Conclusions
Violence against non-combatant civilians is a common feature of many civil wars,
and beyond the obvious human cost of such violence, it can shape the possible trajectory and outcomes of conflict. Two plausible theoretical claims produce opposing
hypotheses: does exposure to violence make individuals more "angry", vengeful, or
likely to view peace as impossible on the one hand, or more "weary" or pro-peace on
the other?
The results strongly and consistently favor the "weary" hypothesis: those who
report being injured or maimed were approximately 10 percentage points more likely
to say it is possible to live in peace with former enemies, to live in peace with individual Janjaweed, or to live in peace with the tribes from which the Janjaweed were
recruited.
They are also roughly 10 percentage points less likely to demand that
Government of Sudan soldiers be executed.
These effects are substantively large,
amounting to 17-48% of the mean probability of giving the pro-peace answer on each
item. While this study can only hope to maximize internal validity in the single case
of Darfurian refugees in eastern Chad. That said, this case is a particularly important
and severely understudied case. A valuable next step would be to assess its generalizability by similarly examining the effects of personal violence on attitudes towards
peace in other conflicts.
This paper contributes to a small but growing literature identifying the effects of
violence during conflict, but is the first to directly test the "weary" versus "angry"
hypotheses. This is also the first study to make micro-level causal inference in the case
of Darfur, which has been vastly understudied relative to the scale of violence and
policy attention. While not directly comparable to any prior study, these findings
are broadly consistent with an emerging view that, perhaps surprisingly, exposure
to violence is associated with some positive shift in individuals' social and political
engagement (Bellows and Miguel, 2009; Blattman, 2009; Gilligan et al., 2011; Bateson,
2012).
Having estimated this effect, arbitrating among the mechanisms that generate it
40
remains an important task for future work. Those exposed to physical harm may
perceive a shift in the cost of conflict relative to others, making ongoing violence
less appealing and the pre-war status quo more appealing. Alternatively, individuals
who undergo heightened suffering due to direct physical violence may experience
"post-traumatic" growth. It is also possible that individuals physically harmed are
"exempted" from demands to show anger and a desire for vengeance.
Accordingly, further research could fruitfully help to understand these mechanisms
by examining whether victims of physical harm see the costs of the conflict as being
starker than others do, whether they show post-traumatic growth or other changes in
a variety of domains, and whether communities view victims of physical violence as
having greater authority to speak, or as having an exemption from norms requiring
support for retribution.
An important consequence of this finding is the possibility that, by enhancing
weariness, exposure to direct physical violence may mitigate rather than potentiate
the support civilians are willing to provide to violent actors. This matters, first, in
terms of the direct impact of such shifts on civilians' willingness to contribute to
armed conflict (through providing safe haven, providing or withholding information,
material support, or direct participation). It may also influence the narratives that
will resonate or be counter-productive for elites to exploit, as they seek to mobilize
for war or peace in their interests.
One practical policy lesson emerging from this analysis is that individuals harmed
by violence are not to be treated as lost causes or likely spoilers of potential peace.
On the contrary, they may be more peace-seeking than their neighbors.
Political
settlement and reconciliation processes would do well to incorporate these individuals
as directly and inclusively as possible.
41
2.8
Tables and Figures
Table 2.1: Multivariate Balance Conditional on Village Fixed Effects
DV: Physically Harmed
Age
Farmer in Darfur
Herder in Darfur
Voted in past
Household size in Darfur
Merchant in Darfur
Tradesman in Darfur
Joint F
Joint p
N
Males
Females
(p-val)
(p-val)
-0.003
(0.196)
-0.031
(0.733)
0.201
(0.045)
-0.011
(0.861)
-0.004
(0.541)
0.135
(0.096)
0.037
(0.800)
1.10
0.362
640
-0.001
(0.801)
0.027
(0.732)
0.144
(0.151)
0.093
(0.287)
0.006
(0.352)
NA
NA
0.99
0.43
588
Note: Conditional balance test examining whether, within village and gender, observable pretreatment covariates have the same means for those who were and were not physically harmed.
The treatment indicator (physical harm) is regressed on village fixed effects and all pre-treatment
covariates, separately for men and for women. The results show good balance overall: all coefficients
are near zero, with the only significant estimate being a dummy for individuals who were herders in
Darfur. Joint significance tests fail to reject the null hypothesis that all coefficients (except those
on the village fixed effects) are zero. Thus, taken together these covariates are not significantly
predictive of being physically harmed.
42
C43
1270
0.61
(0.078)
0.083
(0.043)
1303
0.51
(0.024)
0.075
(0.034)
1279
0.22
(0.064)
0.082
(0.035)
Peace With
Janjaweed Indiv.
0.17
(3)
(4)
1278
0.51
(0.076)
0.099
(0.040)
1225
0.56
(0.032)
-0.092
(0.043)
1223
0.50
(0.082)
$/
-0.11
(0.044)
Would Execute
Gov. Soldiers
0.63
(7)
(8)
1188
V,
0.32
(0.032)
0.13
(0.044)
(9)
(10)
1168
0.46
(0.078)
0.13
(0.044)
Peace
Index
0.22
Note: OLS estimates of the effect of being physically harmed on each outcome. All models include a gender dummy and village fixed effects, as
required to meet identification conditions. The "long" models (odd numbers) also include pre-treatment controls, despite apparent balance on these,
to improve precision of the estimate. All estimates show that exposure to physical harm (Physcial Harm) produces a 9-11 percentage point increase
in pro-peace attitudes. The reduction in willingness to execute Government of Sudan soldiers by a similar amount is consistent with this pro-peace
effect. Similarly, Peace Index is the single-factor solution combining all the other outcome variables into one, and shows the most significant effect
of Direct Harm. As all the effects fall in the pro-peace direction, these findings support the "weary hypothesis" rather than the "angry hypothesis".
The effect sizes are not only statistically significant, but also substantively large relative to the means of each outcome, also shown. These results are
summarized together with other models in Figure 2-2
1303
0.40
(0.028)
0.097
(0.039)
Peace With
Janjaweed Tribes
0.32
(5)
(6)
Robust SEs in parentheses
Controls: age, farmer, herder, past vote, household size in Darfur
$/
1294
0.49
(0.030)
//
Intercept
Female
Village FEs
Controls
N
0.087
(0.042)
physical harm
Mean(DV)
Model
Peace with
Former Enemies
0.40
(1)
(2)
Table 2.2: Effect of Physical Harm on Attitudes: OLS Regression Estimates
Figure 2-1: Propensity Scores for Harmed and Unharmed
Treated
Untreated
-Treated
-
-
----Untreated
C
Z
0
0
0.0
0.5
0.0
1.0
-
1.0
0.5
N
N = 266 Bandwidth = 0.08611
235
Bandwidth
0.0905
-
Treated
-- Untreated
---
treated
Untreated
N
N
q
C'
0.0
0.5
1
1.0
0.0
0.5
1.0
N = 235 Bandwidth = 0.0905
N = 266 Bandwidth = 0.08611
Top row: Propensity scores for treated (harmed) and untreated (unharmed) individuals, using same
linear model of pre-treatment covariates used in multivariate balance testing. Top left: male only;
Top right: female only. These show that without conditioning on village, the covariate values of those
who are harmed and those who are unharmed are different, allowing the propensity score model to
distinguish between those likely to be harmed and those who are not. Bottom row: Propensity scores
for harmed and unharmed individuals, after re-weighting the data so that the distribution of villages
is the same for the harmed and unharmed. Bottom left panel: male only; Bottom right: female
only. These results show that the apparent imbalances seen in the top row disappear when village
of origin is taken into account. It thus provides a visual illustration of the balance on covariates
within village of origin and gender.
44
Figure 2-2: Estimated Effect of Exposure to Physical Harm on Attitudes under Five
Models
o
Peace with
Former Enemies
OLS-short
1 OLS-ong
B
Ebal-short
A Ebal-long
V Match
'
Peace with
Janjaweed Individuals
Peace with
Janjaweed Tribes
p
B
A
Should Execute
Government Soldiers
Peace Factor
A
I
-0.3
I
I
I
I
I
-0.2
-0.1
0.0
0.1
0.2
Effect Estimate of Exposure to Direct Harm
0.3
Note: Summary of effect estimates on various outcome under five models: OLS with only minimal
covariate (OLS-s), OLS with additional covariates (OLS-1), entropy balancing followed by weighted
OLS with minimal covariates (ebal-s) and with additional covariates (ebal-1), and matching. As
discussed in the text, all models find that being directly harmed moves all variables in the "propeace" or "weary" direction. Specifically, regardless of the model used, those directly harmed show a
8-12% increase in their beliefs that it is possible to live in peace with former enemies, with Janjaweed
individuals, or with Janjaweed tribes. Congruently, those directly harmed are also approximately
10% less likely to say that execution would be the appropriate punishment for Government of Sudan
soldiers. Finally, a single dimensional index made from the above for variables shows the strongest
effects, again in the pro-peace direction. Evidence is thus consistent with the "weary" hypothesis
and opposite to what is predicted by the "angry" hypothesis.
45
0.16
(0.047)
(9)
0.14
(0.045)
(10)
Peace
Index
0.22
-0.11
(0.046)
Would Execute
Gov. Soldiers
0.63
(8)
-0.092
(0.045)
(7)
0.10
(0.041)
Peace With
Janjaweed Tribes
0.32
(6)
0.11
(0.042)
(5)
0.088
(0.036)
Peace With
Janjaweed Indiv.
0.17
(4)
0.094
(0.036)
0.44
(0.082)
1168
0.40
(0.085)
V
0.51
(0.078)
1203
0.15
(0.025)
$
V$
$
1168
1203
0.28
(0.023)
/
$/
1278
1278
0.13
(0.019)
/
1279
0.68
(0.025)
$/
V
$
1279
0.22
(0.067)
$
V
(3)
Table 2.3: Effect of Physical Harm on Attitudes: Entropy Balanced Regression Estimates
0.083
(0.044)
Peace with
Former Enemies
0.40
(2)
0.097
(0.046)
0.59
(0.081)
/
$
(1)
physical harm
0.36
(0.026)
/
/
V
1270
Mean(DV)
Model
Intercept
1270
Female
Village FEs
Controls
N
Robust SEs in parentheses
Controls: age, farmer, herder, past vote, household size in Darfur
Note: Estimates of the effect of being directly harmed on each outcome. Weights derived from entropy balancing ensure that those who were and
who were not directly harmed have the same mean and variance on pre-treatment covariates. The "short" and "long" OLS models are then run on
the re-weighted data. As before, the models show that exposure to physical harm produces a 8-11 percentage point increase in pro-peace attitudes,
or congruently, a 9-11 percentage point decrease in reported desire to execute Government of Sudan soldiers. Peace Index shows the largest effect
of Physical Harm. Again, all results are in the "pro-peace" direction, consistent with the "weary" hypothesis, and contradictory to the "angry
hypothesis". The effect sizes are not only statistically significant, but are also substantively large relative to the means of each outcome. These results
are summarized together with those of other models in Figure 2-2
Table 2.4: Effect of Exposure to Physical Harm on Attitudes: Matching Estimates
Peace
Janjaweed Tribes
Would Execute
Gov. Soldiers
0.17
0.32
0.63
0.22
0.13
0.00
0.11
0.00
0.14
0.00
-0.12
0.00
0.19
0.00
Npairs
254
258
260
248
231
Male
Mean(DV)
0.55
0.24
0.46
0.53
0.39
physical harm
p-val
0.20
0.00
0.14
0.00
0.21
0.00
-0.09
0.01
0.26
0.00
Npairs
118
119
118
108
101
Female
Mean(DV)
0.23
0.09
0.18
0.73
0.04
0.10
Peace with
Former Enemies
Peace With
Janjaweed Indiv.
All
Mean(DV)
0.40
physical harm
p-val
Peace With
Index
physical harm
0.08
0.07
0.03
-0.04
p-val
0.01
0.00
0.35
0.20
0.00
99
91
85
97
94
Npairs
Controls: age, farmer, herder, past vote, household size in Darfur
Note: Matching estimates of the effect of being directly harmed on each outcome, using Mahalanobis,
1-to-1 matching without replacement. Matching is exact on all variables except age and household
size in Darfur. Results show that the effect of being directly harmed on each outcome variable is
again in the "pro-peace" direction. On the full sample, effects are similar to OLS and entropybalanced models, but slightly larger. Among females, effect sizes are somewhat smaller, and the
effect of physical harm on Peace with Janjaweed Tribes and Would Execute Government Soldier
shrink substantially, losing significance. For Peace with Janjaweed Tribes in particular it appears
that the observed aggregate effect is largely driven by males, while females show little or no effect.
Otherwise, all effects significant in the overall model are significant among each gender separately.
47
Figure 2-3: Effect of Physical Harm on Peace Index implied by confounders of varying
strength
01r
'
0
0.
-C"
~0
age
t5
w
0
~female
C.,o
.
0
hhsIkrder
A
pastvote
A
C0
farmer
A
Observed(O.13)
I
0.0
0.12
A
I
I
I
I
I
0.1
0.2
0.3
0.4
0.5
Effect of Confounder on Peace Index
Note: Sensitivity analysis. The "height" shown by contour lines gives the expected true size of the
effect of Physical Harm on Peace Index, given a hypothetical confounder. The bias is parameterized by how strongly this confounder relates to Physcial Harm (vertical axis) and how strongly it
relates to the outcome (horizontal axis). For the true effect of Physical Harm on Peace Index to be
statistically indistinguishable from zero, an unobserved confounder would have to be substantively
more confounding than any of the included covariates. For example, even a confounder as strongly
correlated with Peace Index as female would have to be three times more predictive of exposure to
Physical Harm in order for the true treatment effect to be statistically indistinguishable from zero.
48
Chapter 3
Kernel Regularized Least Sqaures
49
50
Kernel Regularized Least Squares: Reducing Misspecification
Bias with a Flexible and Interpretable Machine Learning
Approach
Jens Hainmueller - Massachusetts Institute of Technology
Chad Hazlett - Massachusetts Institute of Technology
ABSTRACT
We propose the use of Kernel Regularized Least Squares (KRLS) for social
science modeling and inference problems. KRLS borrows from machine
learning methods designed to solve regression and classification problems
without relying on linearity or additivity assumptions. The method constructs a flexible hypothesis space that uses kernels as radial basis functions and finds the best-fitting surface in this space by minimizing a
complexity-penalized least squares problem. We argue that the method is
well-suited for social science inquiry because it avoids strong parametric
assumptions, yet allows interpretation in ways analogous to generalized
linear models while also permitting more complex interpretation to examine non-linearities, interactions, and heterogeneous effects. We also extend
the method in several directions to make it more effective for social inquiry, by (1) deriving estimators for the pointwise marginal effects and
their variances, (2) establishing unbiasedness, consistency, and asymptotic normality of the KRLS estimator under fairly general conditions,
(3) proposing a simple automated rule for choosing the kernel bandwidth,
and (4) providing companion software. We illustrate the use of the method
through simulations and empirical examples.
3.1
Introduction
Generalized linear models (GLMs) remain the workhorse method for regression and
classification problems in the social sciences. Applied researchers are attracted to
GLMs because they are fairly easy to understand, implement, and interpret. However,
GLMs also impose strict functional form assumptions. These assumptions are often
problematic in social science data, which are frequently ridden with non-linearities,
non-additivity, heterogeneous marginal effects, complex interactions, bad leverage
points, or other complications. It is well-known that misspecified models can lead to
bias, inefficiency, incomplete conditioning on control variables, incorrect inferences,
and fragile model-dependent results (e.g. King and Zeng (2006)). One traditional and
well-studied approach to address some of these problems is to introduce high-order
terms and interactions to GLMs (e.g. Friedrich, 1982; Jackson, 1991; Brambor et al.,
2006). However, higher-order terms only allow for interactions of a prescribed type,
and even for experienced researchers, it is typically very difficult to find the correct
functional form among the many possible interaction specifications, which explode
in number once the model involves more than a few variables.
Moreover, as we
show below, even when these efforts may appear to work based on model diagnostics,
under common conditions, they can instead make the problem worse, generating false
inferences about the effects of included variables.
Presumably, many researchers are aware of these problems and routinely resort to
GLMs not because they staunchly believe in the implied functional form assumptions,
but because they lack convenient alternatives that relax these modeling assumptions
while maintaining a high degree of interpretability. While some more flexible methods, such as neural networks (e.g. Beck et al., 2000) and Generalized Additive Models
(GAMs, e.g. Wood, 2003), have been proposed, they have not been widely adopted
by social scientists, perhaps because these models often do not generate the desired
quantities of interest or allow inference on them (e.g. confidence intervals or tests of
null hypotheses) without non-trivial modifications and often impracticable computational demands.
52
In this paper, we describe Kernel Regularized Least Squares (KRLS). This approach draws from Regularized Least Squares (RLS), a well-established method in
the machine learning literature (see e.g. Rifkin et al., 2003).1 We add the "K" to
(a) emphasize that it employs kernels (whereas the term RLS can also apply to nonkernelized models); and (b) to designate the specific set of choices we have made in
this version of RLS, including procedures we developed to remove all parameter selection from the investigator's hands and, most importantly, methodological innovations
we have added relating to interpretability and inference.
The KRLS approach offers a versatile and convenient modeling tool that strikes
a compromise between the highly constrained GLMs that many investigators rely on
and more flexible but often less interpretable machine learning approaches. KRLS
is an easy to use approach that helps researchers to protect their inferences against
misspecification bias and does not require them to give up many of the interpretative
and statistical properties they value. This method belongs to a class of models for
which marginal effects are well-behaved and easily obtainable due to the existence of
a continuously differentiable solution surface, estimated in closed form. It also readily
admits to statistical inference using closed form expressions, and has desirable statistical properties under relatively weak assumptions. The resulting model is directly
interpretable in ways similar to linear regression while also making much richer interpretations possible. The estimator yields pointwise estimates of partial derivatives
that characterize the marginal effects of each independent variable at each data point
in the covariate space. The researcher can examine the distribution of these pointwise estimates to learn about the heterogeneity in marginal effects or average thein to
obtain an average partial derivative similar to a / coefficient from linear regression.
Because it marries flexibility with interpretability, the KRLS approach is suitable
for a wide range of regression and classification problems where the correct functional form is unknown. This includes exploratory analysis to learn about the datagenerating process, model-based causal inference, or prediction problems that require
'Similar methods appear under various names, including Regularization Networks (e.g. Evgeniou
et al., 2000) and Kernel Ridge Regression (e.g. Saunders et al., 1998).
53
an accurate approximation of a conditional expectation function to impute missing
counterfactuals.
Similarly, it can be employed for propensity score estimation or
other regression and classification problems where it is critical to use all the available
information from covariates to estimate a quantity of interest. Instead of engaging
in a tedious specification search, researchers simply pass the X matrix of predictors
to the KRLS estimator (e.g. krls(y=y,X=X) in our R package), which then learns
the target function from the data. For those who work with matching approaches,
the KRLS estimator has the benefit of similarly weak functional form assumptions
while allowing continuous valued treatments, maintaining good properties in highdimensional spaces where matching and other local methods suffer from the curse of
dimensionality, and producing principled variance estimates in closed form. Finally,
although necessarily somewhat less efficient than Ordinary Least Squares (OLS), the
KRLS estimator also has advantages even when the true data-generating process is
linear, as it protects against model dependency that results from bad leverage points
or extrapolation and is designed to bound over-fitting.
The main contributions of this paper are threefold. First, we explain and justify
the underlying methodology in an accessible way and introduce interpretations that
illustrate why KRLS is a good fit for social science data. Second, we develop various methodological innovations. We (a) derive closed-form estimators for pointwise
and average marginal effects; (b) derive closed-form variance estimators for these
quantities to enable hypothesis tests and the construction of confidence intervals; (c)
establish the unbiasedness, consistency, and asymptotic normality of the estimator
for fitted values under conditions more general than those required for GLMs; and
(d) derive justification for a simple rule for choosing the bandwidth of the kernel
at no computational cost, thereby taking all parameter-setting decisions out of the
investigator's hands to improve falsifiability. Third, we provide companion software
that allows researchers to implement the approach in R, Stata, and Matlab.
54
3.2
Explaining KRLS
Regularized least squares approaches with kernels, of which KRLS is a special case,
can be motivated in a variety of ways. We begin with two explanations, the "similaritybased" view and the "superposition of Gaussians" view, which provide useful insight
on how the method works and why it is a good fit for many social science problems.
Further below we also provide a more rigorous, but perhaps less intuitive, justification. 2
Similarity-Based View
Assume that we draw i.i.d. data of the form (yi, xi), where i = 1,
of observation, yi
ER
... ,
N indexes units
is the outcome of interest, and xi E RD is our D-dimensional
vector of covariate values for unit i (often called exemplars). Next, we need a so-called
kernel, which for our purposes is defined as a symmetric and positive semi-definite
function k(., -) that takes two arguments and produces a real valued output.3 It is
useful to think of the kernel function as providing a measure of similarity between
two input patterns. While many kernels are available, the kernel used in KRLS and
throughout this paper is the Gaussian kernel given by
k(xj, xi)
=e
where ex is the exponential function and | xi - xi
(3.1)
is the Euclidean distance between
the covariate vectors xj and xi. This function is the same function as the normal dis.
tribution, but with a2 in place of 2U 2 , and omitting the normalizing factor 1//2rou 2
The most important feature of this kernel is that it reaches its maximum of one only
when xi = xj and grows closer to zero as xi and xj become more distant. We will
2
Another justification is based on the analysis of reproducing kernels, and the corresponding
spaces of functions (Reproducing Kernel Hilbert Spaces) they generate along with norms over those
spaces. For details on this approach, we direct readers to recent reviews included in Evgeniou et al.
(2000) and Sch6lkopf and Smola (2002).
3
By positive semi-definite, we mean that
Ej aiaj k(xi, xj) > 0, V ai, aj E R, x c RD, D E Z+.
Note that the use of kernels for regression in our context should not be confused with non-parametric
methods commonly called "kernel regression" that involve using a kernel to construct a weighted
local estimate.
>2
55
thus think of k(xi, xj) as a measure of the similarity of xi to xj.
Under the "similarity-based view", we assert that the target function y = f(x)
can be approximated by some function in the space of functions represented by 4
N
cik(x, xi)
f(x) =
(3.2)
i=1
where k(x, xi) measures the similarity between our point of interest (x) and one of
N input patterns xi, and ci is a weight for each input pattern. The key intuition
behind this approach is that it does not model yi as a linear function of xi. Rather, it
leverages information about the similarity between observations. To see this, consider
some test point x* at which we would like to evaluate the function value given fixed
input patterns xi and weights ci. For such a test point, the predicted value is given
by
f (x*)
=
cik(x*, xi) + c2 k(x*, x 2 ) +
...
+ CNk(x
(3.3)
, XN)
= ci(similarity of x* to x 1 ) + c2 (sim. of x* to x 2 ) +
.. .
+ cN(sim. Of x* to (0)
That is, the outcome is linear in the similarities of the target point to each observation,
and the closer x* comes to some xj, the greater the "influence" of xj on the predicted
f(x*).
This approach to understanding how equation (3.2) fits complex functions
is what we refer to as the "similarity view." It highlights a fundamental difference
between KRLS and the GLM approach. With GLMs, we assume that the outcome
is a weighted sum of the independent variables. In contrast, KRLS is based on the
premise that information is encoded in the similarity between observations, with more
similar observations expected to have more similar outcomes. We argue that this
latter approach is more natural and powerful in most social science circumstances:
in most reasonable cases, we expect that the nearness of a given observation, xi, to
other observations reveals information about the expected value of yi, which suggests
a large space of smooth functions in which observations close to each other in X are
4
Below we provide a formal justification for this space based on ridge regressions in highdimensional feature spaces.
56
close to each other in y.
Superposition of Gaussians View
Another useful perspective is the "superposition of Gaussians" view. Recalling that
k(-, xi) traces out a Gaussian curve centered over xi, we slightly rewrite our function
approximation as
f ()
= cik(., xi) + c 2 k(-, x 2 ) + .. . + cNk(', XN)-
(3.5)
The resulting function can be thought of as the superposition of Gaussian curves,
centered over the exemplars (xi) and scaled by their weights (ci). Figure 3-1 illustrates
six random samples of functions in this space.
We draw eight data points xi ~
Uniform(O, 1) and weights ci ~ N(O, 1) and compute the target function by centering
a Gaussian over each xi, scaling each by its ci, and then summing them (the dots
represent the data points, the dotted lines refer to the scaled Gaussian kernels, and
the solid lines represent the target function created from the superposition). This
figure shows that the function space is much more flexible than the function spaces
available to GLMs; it enables us to approximate highly non-linear and non-additive
functions that may characterize the data-generating process in social science data.
The same logic generalizes seamlessly to multiple dimensions.
In this view, for a given dataset, KRLS would fit the target function by placing
Gaussians over each of the observed exemplars xi and scaling them such that the
summated surface approximates the target function. The process of fitting the function requires solving for the N values of the weights ci. We, therefore, refer to the
ci weights as choice coefficients, similar to the role that
f
coefficients play in linear
regression. Notice that a great many choices of ci can produce highly similar fits-a
problem resolved in the next section through regularization. (In the online appendix,
we present a toy example to build intuition for the mechanics of fitting the function;
see Figure A.1).
Before describing how KRLS chooses the choice coefficients, we introduce a more
57
convenient matrix notation. Let K be the N x N symmetric Kernel matrix whose
jth
ith
entry is k(xj, xi); it measures the pairwise similarities between each of the
N input patterns xi. Let c =
and y = [yi,
... ,
[ci, ...,
YNYT be the N x
cN]T be the N x 1 vector of choice coefficients
1 vector of outcome values. Equation (3.2) can be
rewritten as
k(xi, xi)
y
=
Kc
k(X 2 ,X)
k(xi,
X2)
... k (xi, XN)
C1
.
.
k(XN, X1)
k(XN, XN)
(3.6)
CN
In this form, we plainly see KRLS as fitting a simple linear model: we fit y for some
xi as a linear combination of basis functions or regressors, each of which is a measure
of xi's similarity to another observation in the dataset. Notice that the matrix K
will be symmetric and positive semi-definite and, thus, invertible. 5 Therefore, there
is a "perfect" solution to the linear system y = Kc, or equivalently, there is a target
surface that is created from the superposition of scaled Gaussians that provides a
perfect fit to each data point.
Regularization and the KRLS Solution
While extremely flexible, fitting functions by the method described above produces
a perfect fit of the data and invariably leads to over-fitting. This issue speaks to
the ill-posedness of the problem of simply fitting the observed data: there are many
solutions that are similarly good fits. We need to make two additional assumptions
that specify which type of solutions we prefer. Our first assumption is that we prefer
functions that minimize squared loss, which ensures that the resulting function has a
clear interpretation as a conditional expectation function (of y conditional on x).
The second assumption is that we prefer smoother, less complicated functions.
Rather than simply choosing c as c = K-'y, we instead solve a different problem
5
This holds as long as no input pattern is repeated exactly. We relax this in the following section.
58
that explicitly takes into account our preference for smoothness and concerns for
over-fitting.
This is based on a common but perhaps under-utilized assumption:
in social science contexts, we often believe that the conditional expectation function
characterizing the data-generating process is relatively smooth, and that less "wiggly"
functions are more likely to be due to real underlying relationships rather than noise.
Less "wiggly" functions also provide more stable predictions at values in between
the observed data points. Put another way, for most social science inquiry, we think
that "low-frequency" relationships (in which y cycles up and down fewer times across
the range of x) are theoretically more plausible and useful than "high-frequency"
relationships. (Figure A.2 in the appendix provides an example for a low- and highfrequency explanation of the relationship between x and y.) 6
To give preference to smoother, less complicated functions, we change the optimization problem from one that considers only model fit to one that also considers
complexity. Tikhonov regularization (Tychonoff, 1963) proposes that we search over
some space of possible functions and choose the best function according to the rule
argmin
(V(f(xi), yi)) + AXZ(f)
(3.7)
fEH
where V(yi, f(xi)) is a loss function that computes how "wrong" the function is at
each observation, 1Z is a "regularizer" measuring the "complexity" of function
and A E R
f,
is a scalar parameter that governs the tradeoff between model fit and
complexity. Tikhonov regularization forces us to choose a function that minimizes a
weighted combination of empirical error and complexity. Larger values of A result in
a larger penalty for the complexity of the function and a higher priority for model
fit; lower values of A will have the opposite effect. Our hypothesis space, H, is the
flexible space of functions in the span of kernels built on N input patterns or, more
6
This smoothness prior may prove wrong if there are truly sharp thresholds or discontinuities in
the phenomenon of interest. Rarely, however, is a threshold so sharp that it cannot be fit well by
a smooth curve. Moreover, most political science data has a degree of measurement error. Given
measurement error (on x), then, even if the relationship between the "true" x and y was a step
function, the observed relationship with noise will be the convolution of a step function with the
distribution of the noise, producing a smoother curve (for example, a sigmoidal curve in the case of
normally distributed noise).
59
formally, the Reproducing Kernel Hilbert Spaces (RKHS) of functions associated with
a particular choice of kernel.
For our particular purposes, we choose the regularizer to be the square of the
L 2 norm, (f, f)H
=
I|f H
in the RKHS associated with our kernel. It can be shown
that, for the Gaussian kernel, this choice of norm imposes an increasingly high penalty
on higher-frequency components of
f.
We also always use squared-loss for V. The
resulting Tikhonov regularization problem is given by
argmin Z(f(X)
-
y,) 2 +
AI If11.
(3.8)
f EH
Tikhonov regularization may seem a natural objective function given our preference
for low-complexity functions. As we show in the appendix, it also results more formally from encoding our prior beliefs that desirable functions tend to be less complicated and then solving for the most likely model given this preference and the
observed data.
To solve this problem, we first substitute f(x) = Kc to approximate f(x) in
our hypothesis space H.7 In addition, we use as the regularizer the norm
IIfI I
=
E , cicjk(xi, xj) = cTKc. The justification for this form is given below; however,
a suitable intuition is that it is akin to the sum of the squared ci's, which itself is a
possible measure of complexity, but it is weighted to reflect overlap that occurs for
points nearer to each other. The resulting problem is
c* = argmin (y - Kc)T(y - Kc) + AcTKc.
(3.9)
cE RD
Accordingly, y* = Kc* provides the best-fitting approximation to the conditional
expectation of the outcome in the available space of functions given regularization.
Notice that this minimization is equivalent to a ridge regression in a new set of features, one that measures the similarity of an exemplar to each of the other exemplars.
As we show in the appendix, we explicitly solve for the solution by differentiating the
7
As we explain below, we do not need an intercept since we work with demeaned data for fitting
the function.
60
objective function with respect to the choice coefficients c and solving the resulting
first-order conditions, finding the solution c* = (K + AI)-ly.
We therefore have a closed-form solution for the estimator of the choice coefficients
that provides the solution to the Tikhonov regularization problem within our flexible
space of functions. This estimator is numerically rather benign. Given a fixed value
for A, we compute the kernel matrix and add A to its diagonal. The resulting matrix
is symmetric and positive definite, so inverting it is straightforward. Also, note that
the addition of A along the diagonal ensures that the matrix is well-conditioned (for
large enough A), which is another way of conceptualizing the stability gains achieved
by regularization.
Derivation from an Infinite-Dimensional Linear Model
The above interpretations informally motivate the choices made in KRLS through our
expectation that "similarity matters" more than linearity and that, within a broad
space of smooth functions, less complex functions are preferable. Here we provide a
formal justification for the KRLS approach that offers perhaps less intuition, but has
the benefit of being generalizable to other choices of kernels and motivates both the
choice of f (xi) = EN 1 cjk(xi, xj) for the function space and cTKc for the regularizer.
For any positive semi-definite kernel function k(., .), there exists a mapping O(x) that
#(xi) such that k(xi, xj) = (#(xi), #(xj)).
the mapping #(xi) is infinite-dimensional. Suppose
transforms xi to a higher-dimensional vector
In the case of the Gaussian kernel,
we wish to fit a regularized linear model (i.e. a ridge regression) in the expanded
features, i.e. f(xi) = q(xi)TO, where
#(x)
has dimension D' (which is oc in the
Gaussian case), and 9 is a D' vector of coefficients. Then, we solve
argmin
Z(yi -
q(xi)TO) 2
+ A110II 2
(3.10)
OERD'I
and
-2
110112
N(yi -
RD'
9 T9
gives the coefficients for each dimension of the new feature space,
is simply the L2 norm in that space. The first-order condition is
#(X)T9)O(X,) + 2A9
= 0. Solving partially for 9 gives 9 = A-
61
E_
1(y
-
where 9 E
q(xi)TO)q(xi) or simply
N
ci(xi)
0=
(3.11)
i=1
where ci = A-(yi -
#(xi)T0).
Equation 3.11 asserts that the solution for 9 is in
the span of the features, O(xi). Moreover, it makes clear that the solution to our
potentially infinite-dimensional problem can be found in just N parameters, and
using only the features at the observations. 8
Substituting 9 back into f(x)
=
#(x)TO,
we get
N
f(x) =
N
ci#(xi)q(x)
=
cik(x, xi)
(3.12)
j=1
which is precisely the form of the function space we previously asserted. Note that the
use of kernels to compute inner products between each
#(xi)
and
#(xj)
in equation
3.12 prevents us from needing to ever explicitly perform the expansion implied by
#(xi);
this is often referred to as the kernel "trick" or kernel substitution. Finally, the
norm in equation 3.10,
119112
is (9, 0) = (
EN
1 Cio(Xi)) = cTKc.
Thus,
both the choice of our function space and our norm can be derived from a ridge
regression in a high- or infinite-dimensional feature space O(x) associated with the
kernel.
3.3
KRLS in Practice: Parameters and Quantities
of Interest
In this section, we address some remaining features of the KRLS approach and discuss
the quantities of interest that can be computed from the KRLS model.
8
This powerful result is more directly shown by the Representer
theorem (Kimeldorf and Wahba,
1970).
62
Why Gaussian Kernels?
While users can build a kernel of their choosing to be used with KRLS, the logic is
most applicable to kernels that radially measure the distance between points. We
seek functions k(xi, xj) that approach 1 as xi and x3 become identical and approach
0 as they move far away from each other, with some smooth transition in between.
Among kernels with this property, Gaussian kernels provide a suitable choice. One
intuition for this is that we can imagine some data-generating process that produces
X's with normally distributed errors. Some x's may be essentially "the same" point
but separated in observation by random fluctuations. Then, the value of k(xi, xj) is
proportional to the likelihood of the two observations xi and xj being the "same" in
this sense. Moreover, we can take derivatives of the Gaussian kernel and, thus, of the
response surface itself, which is central to interpretation. 9
Data Pre-processing
We standardize all variables prior to analysis by subtracting off the sample means and
dividing by the sample standard deviations. Subtracting the mean of y is equivalent to
including an (unpenalized) intercept and simplifies the mathematics and exposition.
Subtracting the means of the x's has no effect, since the kernel is translation-invariant.
The re-scaling operation is commonly invoked in penalized regressions for norms Lq
with q > 0-including ridge, bridge, Least Absolute Shrinkage and Selection Operator (LASSO), and elastic-net methods-because, in these approaches, the penalty
depends on the magnitudes of the coefficients and thus on the scale of the data.
Re-scaling by the standard deviation ensures that unit-of-measure decisions have no
effect on the estimates. As a second benefit, re-scaling enables us to use a simple and
fast approach for choosing o
(see below). Note that this re-scaling does not interfere
with interpretation or generalizability; all estimates are returned to the original scale
9
1n addition, by choosing the Gaussian kernel, KRLS is made similar to Gaussian Process regression, in which each point (yi) is assumed to be a normally distributed random variable, and
part of a joint normal distribution together with all other yj, with the covariance between any two
observations yi, y3 (taken over the space of possible functions) being equal to k(xi, xj).
63
and location.10
Choosing the Regularization Parameter A
As formulated, there is no single "correct" choice of A, a property shared with other
penalized regression approaches such as ridge, bridge, LASSO, etc.
Nevertheless,
cross-validation provides a now standard approach (see, e.g. Hastie et al. (2009))
for choosing reasonable values that perform well in practice.
We follow previous
work on RLS-related approaches and choose A by minimizing the sum of the squared
leave-one-out errors (LOOE) by default (e.g. Sch6lkopf and Smola, 2002; Rifkin and
Lippert, 2007; Rifkin et al., 2003). For leave-one-out validation, the model is trained
on N - 1 observations and tested on the left-out observation. For a given test value
of A, this can be done N times, producing a prediction for each observation that
does not depend on that observation itself. The N errors from these predictions can
then be summed and squared to measure the goodness of out-of-sample fit for that
choice of A. Fortunately, with KRLS, the vector of N leave-one-out errors (LOOE)
can be efficiently estimated in O(N') time for any valid choice of A using the formula
LOOE
=
diag1) where G = K + AI (see Rifkin and Lippert, 2007)."
Choosing the Kernel Bandwidth o 2
To avoid confusion, we first emphasize that the role of a2 in KRLS differs from its
role in methods such as traditional kernel regression and kernel density estimation. In
those approaches, the kernel bandwidth is typically the only smoothing parameter; no
additional fitting procedure is conducted to minimize an objective function, and no
separate complexity penalty is available. In KRLS, by contrast, the kernel is used to
' 0 New test points for which estimates are required can be applied, using the means and standard
deviations from the original training. Our companion software handles this automatically.
"A variant on this approach, generalized cross-validation (GCV), is equal to a weighted version
of LOOE (Golub et al., 1979), computed as
C
GCV can provide computational savings in
some contexts (since the trace of G- can be computed without computing G-1 itself) but less so
here, as we must compute G- anyway to solve for c. In practice, LOOE and GCV provide nearly
identical measures of out-of-sample fit, and commonly, very similar results. Our companion software
also allows users to set their own value of A, which can be used to implement other approaches if
needed.
64
form K, beyond which fitting is conducted through the choice of coefficients c, under a
penalty for complexity controlled by A. Here, a2 enters principally as a measurement
decision incorporated into the kernel definition, determining how distant points need
to be in the (standardized) covariate space before they are considered dissimilar. The
resulting fit is thus expected to be less dependent on the exact choice of a2 than is true
of those kernel methods in which the bandwidth is the only parameter. Moreover,
since there is a tradeoff between a 2 and A (increasing either can increase smoothness),
a range of a 2 values is typically acceptable and leads to similar fits after optimizing
over A.
Accordingly, in KRLS, our goal is to chose a 2 to ensure that the columns of K carry
useful information extracted from X, resulting in some units being considered similar,
some being dissimilar, and some in between. We propose that a 2 = dim(X) = D is a
suitable default choice that adds no computational cost. The theoretical motivation
for this proposition is that, in the standardized data, the average (Euclidian) distance between two observations that enters into the kernel calculation, E[Ilxj
-
x, 12],
is equal to 2D (see appendix). Choosing a 2 to be proportional to D therefore ensures
a reasonable scaling of the average distance. Empirically, we have found that setting
a2 = ID in particular has reliably resulted in good empirical performance (see simulations below) and typically provides a suitable distribution of values in K such that
entries range from close to 1 (highly similar) to close to 0 (highly dissimilar), with a
distribution falling in between. 12
12
,
-
Note that our choice for a is consistent with advice from other work. For example, Sch6lkopf
and Smola (2002) suggest that an "educated guess" for a 2 can be made by ensuring that
"roughly lies in the same range, even if the scaling and dimension of the data are different," and
they also choose u2 = dim(X) for the Gaussian kernel in several examples (though without the
justification given here). Our companion software also allows users to set their own value for 0 2
and this feature can be used to implement more complicated approaches if needed. In principle, one
could also use a joint grid-search over values of a-2 and A, for example using k-fold cross-validation
where k is typically between 5 and 10. However, this approach adds a significant computational
burden (since a new K needs to be formed for each choice of a 2 ), and the benefits can be small since
a2 and A trade off with each other so, it is typically computationally more efficient to fix a- 2 at a
reasonable value and optimize over A.
65
3.4
Inference and Interpretation with KRLS
In this section, we provide the properties of the KRLS estimator.
In particular,
we establish its unbiasedness, consistency, and asymptotic normality and derive a
closed-form estimator for its variance.13 We also develop new interpretational tools,
including estimators for the pointwise partial derivatives and their variances, and discuss how the KRLS estimator protects against extrapolation when modeling extreme
counterfactuals.
Unbiasedness, Variance, Consistency, and Asymptotic Normality
Unbiasedness
We first show that KRLS unbiasedly estimates the best approximation of the true
conditional expectation function that falls in the available space of functions given
our preference for less complex functions.
ASSUMPTION 1 (FUNCTIONAL FORM)
The target function we seek to estimate falls in
the space of functions representable as y* = Kc*, and we observe a noisy version of
this, Yobs = Y + 6.
These two conditions together constitute the "correct specification" requirement
for KRLS. Notice that these requirements are analogous to the familiar correct specification assumption for the linear regression model, which states that the datagenerating process is given by y = X03+ e. However, as we saw above, the functional
form assumption in KRLS is much more flexible compared to linear regression or
GLMs more generally and this guards against misspecification bias.
13 While statisticians and econometricians are often interested in these classical statistical properties, machine learning theorists have largely focused attention on whether and how fast the empirical
error rate of the estimator converges to the true error rate. We are not aware of existing arguments
for unbiasedness, or the normality of KRLS point estimates, though proofs of consistency, distinct
from our own, have been given, including in frameworks with stochastic X (e.g. De Vito et al.,
2005).
66
ASSUMPTION 2 (ZERO CONDITIONAL MEAN) E[E4X] = 0, which implies that E[EjKi] =
0 (where Ki designates the it" column of K) since K is a deterministic function of
X.
This assumption is mathematically equivalent to the usual zero conditional mean
assumption used to establish unbiasedness for linear regression or GLMs more generally. However, note that substantivally, this assumption is typically weaker in KRLS
than in GLMs, which is the source of KRLS' improved robustness to misspecification
bias. In a standard OLS setup, with y = X0 + Elinear, unbiasedness requires that
E[EjnearIX] =
0. Importantly, this Elinear includes both omitted variables and un-
modeled effects of X on y that are not linear functions of X (e.g. an omitted squared
term or interaction). Thus, in addition to any omitted variable bias due to unobserved confounders, misspecification bias also occurs whenever the unmodeled effects
of X in
qinear
are correlated with the Xs that are included in the model. In KRLS,
we instead have y = Kc + Ekrls. In this case,
Ckrls
is devoid of virtually any smooth
function of X because these functions are captured in the flexible model through Kc.
In other words, KRLS moves many otherwise unmodeled effects of X from the error
term into the model. This greatly reduces the chances of misspecification bias, leaving
the errors restricted to principally the unobserved confounders, which will always be
an issue in non-experimental data.
Under these assumptions, we can establish the unbiasedness of the KRLS estimator, meaning that the expectation of the estimator for the choice coefficients that
minimize the penalized least squares * obtained from running KRLS on
Yobs
equals
its true population estimand, c*. Given this unbiasedness result, we can also establish
unbiasedness for the fitted values.
THEOREM 1 (UNBIASEDNESS
OF CHOICE COEFFICIENTS)
Under assumptions 1-2, E[a*IX] =
c*.
The proof is given in the appendix.
THEOREM 2 (UNBIASEDNESS
OF FITTED VALUES)
The proof is given in the appendix.
67
Under assumptions 1-2, E[Y]
=
y*.
We emphasize that this definition of unbiasedness says only that the estimator
is unbiased for the best approximation of the conditional expectation function given
penalization.1 4 In other words, unbiasedness here establishes that we get the correct
answer in expectation for y* (not y), regardless of noise added to the observations.
While this may seem like a somewhat dissatisfying notion of unbiasedness, it is precisely the sense in which many other approaches are unbiased, including OLS. If, for
example, the "true" data-generating process includes a sharp discontinuity that we
do not have a dummy variable for, then KRLS will always instead choose a function
that smooths this out somewhat, regardless of N, just as a linear model will not
correctly fit a non-linear function. The benefit of KRLS over GLMs is that the space
of allowable functions is much larger, making the "correct specification" assumption
much weaker.
Variance
Here, we derive a closed-form estimator for the variance of the KRLS estimator of
the choice coefficients that minimizes the penalized least squares, c*, conditional on
a given A. This is important because it allows researchers to conduct hypothesis
tests and construct confidence intervals. We utilize a standard homoscedasticity assumption, although the results could be extended to allow for heteroscedastic, serially
correlated, or grouped error structures. We note that, as in OLS, the values for the
point estimates of interest (e.g. Y, -y,,
x3
discussed below), do not depend on this ho-
moscedasticity assumption. Rather, an assumption over the error structure is needed
for computing variances.
ASSUMPTION
3
(SPHERICAL ERRORS)
The errors are homoscedastic and have zero se-
rial correlation, such that E[eCTIX] = o-2.
14Readers will recognize that classical ridge regression, usually in the span of X rather than
O(X),
is biased, in that the coefficients achieved are biased relative to the unpenalized coefficients. Imposing
this bias is, in some sense, the purpose of ridge regression. However, if one is seeking to estimate the
post-penalization function because regularization is desirable to identify the most reliable function
for making new predictions, the procedure is unbiased for estimating that post-penalization function.
68
LEMMA 1 (VARIANCE OF CHOICE COEFFICIENTS)
Under assumptions 1-3, the vari-
ance of the choice coefficients is given by Var[a*|X, A] = o-,K + AI-2. The proof is
given in the appendix.
LEMMA 2 (VARIANCE OF FITTED VALUES)
the fitted values
Under assumptions 1-3,
the variance of
is given by Var[9|X, A] = Var[KC^*|X, A] = KT [oI(K + AI)- 2]K.
In many applications, we also need to estimate the variance of fitted values for
new counterfactual predictions at specific test points. We can compute these out-ofsample predictions using Ytest = Kesta* where Ktest is the Ntest x Ntain dimensional
kernel matrix that contains the similarity measures of each test observation to each
training observation.15
LEMMA 3 (VARIANCE FOR TEST POINTS)
Under assumptions 1-3, the variance for
predicted outcomes at test points is given by Var[itestIX, A] = Ktest Var[C*IX, A] Kt~
Ktest[OI(K +
=
AI)-2 ]K et.
Our companion software implements these variance estimators. We estimate o- by
& =
y
E
e2 =
-(y - Ka*)T(y - Ka*). Note that all variance estimates above are
conditional on the user's choice of A. This is important, since the variance does indeed
depend on A: higher choices of A always imply the choice of a more stable (but less
well-fitting) solution, producing lower variance. Recall that A is not a random variable
with a distribution but, rather, a choice regarding the tradeoff of fit and complexity
made by the investigator.
LOOE provides a reasonable criterion for choosing this
parameter, and so variance estimates are given for A = ALOOE-16
Consistency
In machine learning, attention is usually given to bounds on the error rate of a
given method, and to how this error rate changes with the sample size. When the
15 To reduce notation, here we condition simply on X, but we intend this X to include both the
original training data (used to form K) and the test data (needed to form Ktest).
16 Though we suppress the notation, variance estimates are technically conditional on the choice
of o.2 as well. Recall that, in our setup, o.2 is not a random variable; it is set to the dimension of the
input data as a mechanical means of rescaling Euclidan distances appropriately.
69
probability limit of the sample error rate will reach the irreducible approximation
error (i.e.
the best error rate possible for a given problem and a given learning
machine), the approach is said to be consistent (e.g. De Vito et al., 2005). Here, we
are instead interested in consistency in the classical sense, i.e. determining whether
plim Di,N = y* for all i. Since we have already established that E[ j] = y , all that
N-+oo
remains to prove consistency is that the variance of Yi goes to zero as N grows large.
ASSUMPTION
4 (REGULARITY CONDITION I) Let (i) A > 0 and (ii) as N -+ O0, for
eigenvalues of K given by aj, Ei ai
grows slower than N once N > M for some
M < oo.
THEOREM 3 (CONSISTENCY)
Under assumptions 1-4, E[^ JX] = yt and plim Var[y|X, A] =
N-+oo
0, so the estimator is therefore consistent with plim
i,N
= y* for all i.
N-+oo
The proof is provided in the appendix.
Our proof provides several insights, which we briefly highlight here. The degrees of
freedom of the model can be related to the effective number of non-zero eigenvalues.
The number of effective eigenvalues, in turn, is given by Ei ,--
where as are the
eigenvalues of K. This generates two important insights. First, some regularization is
needed (A > 0) or this quantity grows exactly as N does. Without regularization (A =
0), new observations translate into added complexity rather than added certainty;
accordingly, the variances do not shrink.
because of the regularization.
Thus, consistency is achieved precisely
Second, regularization greatly reduces the number
of effective degrees of freedom, driving the eigenvalues that are small relative to A
essentially to zero. Empirically, a model with hundreds or thousands of observations,
which could theoretically support as many degrees of freedom, often turns out to
have on the order of 5-10 effective degrees of freedom. This ability to approximate
complex functions but with a preference for less complicated ones is central to the
wide applicability of KRLS. It makes models as complicated as needed but not more
so and it gains from the efficiency boost when simple models are sufficient. As we
show below, the regularization can rescue so much efficiency that the resulting KRLS
model is not much less efficient than an OLS regression even for linear data.
70
Finite Sample and Asymptotic Distribution of
y
Here, we establish the asymptotic normality of the KRLS estimator. First, we establish that the estimator is normally distributed in finite samples when the elements of
e are i.i.d. normal.
ASSUMPTION
5 (NORMALITY)
The errors are distributed noTmally, ei i N(0, u2).
THEOREM 4 (NORMALITY IN FINITE SAMPLES)
AI)- 1 )2 ).
Under assumptions 1-5, y ~ N(y*, (o-K(K+
The proof is given in the appendix.
Second, we establish that the estimator is also normal asymptotically even when e
is non-normal but independently drawn from a distribution with a finite mean and
variance.
ASSUMPTION
6 (REGULARITY CONDITIONS II ) Let (i) the errors be independently drawn
from a distribution with a finite mean and variance and (ii) the standard Lindeberg conditions hold such that the sum of variances of each term in the summation
ZJ[K(K + AI)-,j(j)cj goes to infinity as N -+ oc and that the summands are uniformly bounded, i.e. there exists some constant a such that |[K(K + AI) 1 ],(ij)j
<; a
for all j.
-
N(y*, (JEK(K+ THEOREM 5
(ASYMPTOTIC NORMALITY)
Under assumptions 1-4 and 6,
d
AI)-')2) as N -+ oo. The proof is given in the appendix. The resulting asymptotic
distribution used for inference on any given yi is
((+
l)
o-e(K(K + AI)-')(i'i)
d
N(0, 1),
(3.13)
Theorem 4 is corroborated by simulations, which show that 95% confidence intervals
based on standard errors computed by this method (a) closely match confidence intervals constructed from a non-parametric bootstrap and (b) have accurate empirical
coverage rates under repeated sampling where new noise vectors are drawn for each
iteration.
71
Taken together, these new results establish the desirable theoretical properties of
the KRLS estimator for the conditional expectation: it is unbiased for the best-fitting
approximation to the true Conditional Expectation Function (CEF) in a large space
of (penalized) functions (Theorems 1 and 2), it is consistent (Theorem 3), and it is
asymptotically normally distributed given standard regularity conditions (Theorems
4 and 5). Moreover, variances can be estimated in closed form (Lemmas 1-3).
Interpretation and Quantities of Interest
One important benefit of KRLS over many other flexible modeling approaches is that
the fitted KRLS model lends itself to a range of interpretational tools, which we
develop in this section.
Estimating E[yIX] and First Differences
The most straightforward interpretive element of KRLS is that we can use it to
estimate the expectation of y conditional on X = x. From here, we can compute
many quantities of interest, such as first differences or marginal effects.
We can
also produce plots that show how the predicted outcomes change across a range of
values for a given predictor variable while holding the other predictors fixed. For
example, we can construct a dataset in which one predictor x(a) varies across a range
of test values and the other predictors remain fixed at some constant value (e.g. the
means) and then use this dataset to generate predicted outcomes, add a confidence
envelope, and plot them against x(a) to explore ceteris paribus changes. Similar plots
are typically used to interpret GAM models; however, the advantage of KRLS is
that the learned model that is used to generate predicted outcomes does not rely on
the additivity assumptions typically required for GAMs. Our companion software
includes an option to produce such plots.
72
Partial Derivatives
We derive an estimator for the pointwise partial derivatives of y with respect to
any particular input variable, x(a), which allows researchers to directly explore the
pointwise marginal effects of each input variable and summarize them, for example,
in the form of a regression table. Let x(d) be a particular variable such that X =
[xI
...
xd ...
xD]. Then, for a single observation,
j, the partial derivative of y with
respect to variable d is estimated by
Dcd)e2
Z(c x).
-
(3.14)
3
The KRLS pointwise partial derivatives may vary across every point in the covariate
space. One way to summarize the partial derivatives is to take their expectation. We,
thus, estimate the sample-average partial derivative of y with respect to x(d) at each
observation as
EN
d)
2N
Cie
X
x d)).
(3.15)
We also derive the variance of this quantity, and our companion software computes the pointwise and the sample-average partial derivative for each input variable
together with their standard errors. The benefit of the sample-average partial derivative estimator is that it reports something akin to the usual 3 produced by linear
regression: an estimate of the average marginal effect of each independent variable.
However, there is a key difference between taking a best linear approximation to the
data (as in OLS) versus fitting the CEF flexibly and then taking the average partial
derivative in each dimension (as in KRLS). OLS gives a linear summary, but is highly
susceptible to misspecification bias, in which the unmodeled effects of some observed
variables can be mistakenly attributed to other observed variables. KRLS is much
less susceptible to this bias because it first fits the CEF more flexibly and then can
report back an average derivative over this improved fit.
Since KRLS provides partial derivatives for every observation, it allows for inter73
pretation beyond the sample-average partial derivative. Plotting histograms of the
pointwise derivatives and plotting the derivative of y with respect to (d) as a function of
X(d)
are useful interpretational tools. Plotting a histogram of ay
over all i
can quickly give the investigator a sense of whether the effect of a particular variable
is relatively constant or very heterogeneous. It may turn out that the distribution
of ay is bimodal, having a marginal effect that is strongly positive for one group
of observations and strongly negative for another group. While the average partial
derivative (or a / coefficient) would return a result near zero, this would obscure the
fact that the variable in question is having a strong effect but in opposite directions
depending on the levels of other variables. KRLS is well-suited to detect such effect
heterogeneity. Our companion software includes an option to plot such histograms,
as well as a range of other quantities.
Binary Independent Variables
KRLS works well with binary independent variables; however, they must be interpreted by a different approach than continuous variables. Given a binary variable
the pointwise partial derivative
X(b),
is only observed where x(b) = 0 or where x(b) - 1
.7.
The partial derivatives at these two points do not characterize the expected effect of
going from x(b) = 0 to x(b) = 1.17 If the investigator wishes to know the expected
difference in y between a case in which x(b) = 0 and one in which
X(b) =
1, as is usually
the case, we must instead compute first-differences directly. Let all other covariates
(besides the binary covariate in question) be given by X. The first-difference sample
estimator is I
[DI (b)
-
1,X -
x
i
-
(i = 0, X =
b)
xi]. This is computed
by taking the mean Y in one version of the dataset in which all X's retain their original value and all
X(b)
= 1 and then subtracting from this the mean y in a dataset
where all the values of x(b) = 0. In the appendix, we derive closed-form estimators for
the standard errors for this quantity. Our companion software detects binary vari7
1 The
predicted function that KRLS fits for a binary input variable is a sigmoidal curve, less
steep at the two endpoints than at the (unobserved) values in between. Thus, the sample-average
partial derivative on such variables will underestimate the marginal effect of going from 0 to 1 on
this variable.
74
ables and reports the first-difference estimate and its standard error, allowing users
to interpret these effects as they are accustomed to from regression tables.
E[ylx] Returns to E[y] for Extreme Examples of x
One important result is that KRLS protects against extrapolation for modeling extreme counterfactuals. Suppose we attempt to model a value of y, for a test point xj.
If x3 lies far from all the observed data points, then k(xi, xj) will be close to zero for
all i. Thus, by equation (3.2), f(xj) will be close to zero, which also equals the mean
of y due to pre-processing. Thus, if we attempt to predict
y for a new counterfactual
example that is far from the observed data, our estimate approaches the sample mean
of the outcome variable. This property of the estimator is both useful and sensible. It
is useful because it protects against highly model-dependent counterfactual reasoning
based on extrapolation. In linear models, for example, counterfactuals are modeled
as though the linear trajectory of the CEF continues on indefinitely, creating a risk
of producing highly implausible estimates (King and Zeng, 2006). This property is
also sensible, we argue, because, in a Bayesian sense, it reflects the knowledge that
we have for extreme counterfactuals.
Recall that, under the similarity-based view,
the only information we need about observations is how similar they are to other
observations; the matrix of similarities, K, is a sufficient statistic for the data. If
an observation is so unusual that it is not similar to any other observation, our best
estimate of E[yy X = x3 ] would simply be E[y], as we have no basis for updating that
expectation.
3.5
Simulation Results
Here, we show simulation examples of KRLS that illustrate certain aspects of its
behavior. Further examples are presented in the online appendix.
75
Leverage Points
One weakness of OLS is that a single aberrant data point can have an overwhelming
effect on the coefficients and lead to unstable inferences. This concern is mitigated
in KRLS due to the complexity-penalized objective function: adjusting the model
to accommodate a single aberrant point typically adds more in complexity than it
makes up for by improving model fit. To test this, we consider a linear data-generating
process, y = 2x + E. In each simulation, we draw x ~ Unif (0, 1) and e ~ N(0, .3).
We then contaminate the data by setting a single data point to (x = 5, y = -5),
which is off the line described by the target function. As shown in the left panel of
Figure 3-2, this single bad leverage point strongly biases the OLS estimates of the
average marginal effect downwards (open circles), while the estimates of the average
marginal effect from KRLS are robust even at small sample sizes (closed circles).
Efficiency Comparison
We expect that the added flexibility of KRLS will reduce the bias due to misspecification error but at the cost of increased variance due to the usual bias-variance
tradeoff. However, regularization helps to prevent KRLS from suffering this problem
too severely. The regularizer imposes a high penalty on complex, high-frequency functions, effectively reducing the space of functions and ensuring that small variations
in the data do not lead to large variations in the fitted function. Thus, it reduces
the variance. We illustrate this using a linear data-generating process, y = 2x + E,
x - N(0, 1), and E ~ N(0, .25) such that OLS is guaranteed to be the most efficient
unbiased linear estimator according to the Gauss-Markov theorem. The right panel
in Figure 3-2 compares the standard error of the sample average partial derivative
estimated by KRLS to that of 3 obtained by OLS. As expected, KRLS is not as efficient as OLS. However, the efficiency cost is quite modest, with the KRLS standard
error, on average, being only 14% larger than the standard errors from OLS. The
efficiency cost is relatively low due to regularization, as discussed above. Both OLS
and KRLS standard errors decrease at the rate of roughly 1/vN, as suggested by
76
our consistency results.
Over-fitting
A possible concern with flexible estimators is that they may be prone to overfitting,
especially in large samples. With KRLS, regularization helps to prevent over-fitting
by explicitly penalizing complex functions. To demonstrate this point, we consider
a high-frequency function given by y = .2 sin(127rx) + sin(27rx) and run simulations
with x ~ Unif (0, 1) and c
-
N(0, 1) with two sample sizes, N = 40 and N =
400. The results are displayed in the left panel of Figure 3-3. We find that, for the
small sample size, KRLS approximates the high-frequency target function (solid line)
well with a smooth low-frequency approximation (dashed line). This approximation
remains stable at the larger sample size (dotted line), indicating that KRLS is not
prone to over-fit the function even as N grows large. This admittedly depends on the
appropriate choice of A, which is automatically chosen in all examples by LOOE as
described above.
Non-smooth Functions
One potential downside of regularization is that KRLS is not well-suited to estimate
discontinuous target functions. In the right panel of Figure 3-3, we use the same setup
from the over-fitting simulation above but replace the high-frequency function with
a discontinuous step function. KRLS does not approximate the step well at N = 40,
and the fit improves only modestly at N = 400, still failing to approximate the
sharp discontinuity. However, KRLS still performs much better than the comparable
OLS estimate, which uses x as a continuous regressor. The fact that KRLS tries to
approximate the step with a smooth function is expected and desirable. For most
social science problems, we would assume that the target function is continuous in
the sense that very small changes in the independent variable are not associated with
dramatic changes in the outcome variable, which is why KRLS uses such a smoothness
prior by construction. Of course, if the discontinuity is known to the researcher, it
77
should be directly incorporated into the KRLS or the OLS model by using a dummy
variable x' = 1[x > .5] instead of the continuous x regression. Both methods would
then exactly fit the target function.
Interactions
We now turn to multivariate functions. First, we consider the standard interaction
model where the target function is y = .5 + x + X
Bernoulli(.5) for j = 1, 2 and e
--
-
2(x1 - X2) + E with xi ~
N(0, .5). We fit KRLS and OLS models that
include x1 and X2 as covariates and test the out-of-sample performance using the
R 2 for predictions of Y at 1000 test points drawn from the same distribution as the
covariates.
The upper panel in Figure 3-4 shows the out-of-sample R2 estimates.
KRLS (closed circles) accurately learns the interaction from the data and approaches
the true R 2 as the sample size increases. OLS (open circles) misses the interaction
and performs poorly even as the sample size increases.
Of course, in this simple case, we can get the correct answer with OLS if we specify
the saturated regression that includes the interaction term (x1 - X2). However, even if
the investigator suspects that such an interaction needs to be modeled, the strategy of
including interaction terms very quickly runs up against the combinatorial explosion
of potential interactions in more realistic cases with multiple predictors. Consider a
similar simulation for a more realistic case with ten binary predictors and a target
(X1 -X)-+2(x
-
.xio) +x1o.
(X1
- X2)
-
2(X3 x 4 ) + 3(X5
- X6
-
)
-
function that contains several interactions: y =
Here, it is difficult to search through the myriad different
OLS specifications to find the correct model: it would take 210 terms to account for
all the unique possible multiplicative interactions.
This is why, in practice, social
science researchers typically include no or very few interactions in their regressions.
It is well-known that this results in often severe misspecification bias if the effects of
some covariates depend on the levels of other covariates (e.g. Brambor et al., 2006).
KRLS allows researchers to avoid this problem since it learns the interactions from
the data.
The lower panel in Figure 3-4 shows that, in this more complex example, the
78
OLS regression that is linear in the predictors (open circles) performs very poorly,
and this performance does not improve as the sample size increases.
Even at the
largest sample size, it still misses close to half of the systematic variation in the outcome that results from the covariates. In stark contrast, the KRLS estimator (closed
circles) performs well even at small sample sizes when there are fewer observations
than the number of possible two-way interactions (not to mention higher-order interactions). Moreover, the out-of-sample performance approaches the true R2 as the
sample size increases, indicating that the learning of the function continues as the
sample size grows larger. This clearly demonstrates how KRLS obviates the need for
tedious specification searches and guards against misspecification bias. The KRLS
estimator accurately learns the target function from the data and captures complex
non-linearities or interactions that are likely to bias OLS estimates.
The Dangers of OLS with Multiplicative Interactions
Here, we show how the strategy of adding interaction terms can easily lead to incorrect
inferences even in simple cases. Consider two correlated predictors x 1
and x 2 = x 1 +
with (
-
Unif (0, 2)
N(0, 1). The true target function is y = 5x2 and, thus,
only depends on x 1 with a mild non-linearity. This non-linearity is so mild that, in
reasonably noisy samples, even a careful researcher that follows the textbook recommendations and first inspects a scatterplot between the outcome and x1 might mistake
it for a linear relationship. The same is true for the relationship between the outcome
and the (conditionally irrelevant) predictor x 2 . Given this, a researcher who has no
additional knowledge about the true model is likely to fit a rather "flexible" regression
model with a multiplicative interaction term given by y = a+3 1 x1 +3 2 x 2 +
3
(Xi -x 2 ).
To examine the performance of this model, we run a simulation that adds random
noise and fits the model using outcomes generated by y' = 5xi +e where e ~ N(O, 2).
The second column in Table 3.1 displays the coefficient estimates from the OLS
regression (averaged across the simulations) together with their bootstrapped standard errors. In the eyes of the researcher, the OLS model performs rather well. Both
lower-order terms and the interaction term are highly significant, and the model fit is
79
good with R2 = .89. In reality, however, using OLS with the added interaction term
leads us to entirely false conclusions. We conclude that x1 has a positive effect, and
the magnitude of this effect increases with higher levels of x 2. Similarly, x 2 appears
to have a negative effect at low levels of x1 and a positive effect at high levels of
x 1 . Both conclusions are false and an artefact of misspecification bias. In truth, no
interaction effect exists; the effect of x1 only depends on levels of x1 and x 2 has no
effect at all.
The third column in Table 3.1 displays the estimates of the average pointwise
derivatives from the KRLS estimator, which accurately recover the true average
derivatives. The magnitude of the average marginal effect of x 2 is zero and highly
insignificant. The average marginal effect of x 1 is highly significant and estimated at
9.2, which is fairly accurate given that x1 is uniform between 0 and 2 (so we expect
an average marginal effect of 10). Moreover, KRLS gives us more than just the average derivatives: it allows us to examine the effect of heterogeneity by examining the
marginal distribution of the pointwise derivatives. The next three columns display
the first, second, and third quartile of the distributions of the marginal effects of the
two predictors. The marginal effect of x 2 is close to zero throughout the support of
x 2 , which is accurate given that this predictor is indeed irrelevant for the outcome.
The marginal effect of x1 varies greatly in magnitude from about 5 at the first quartile
to more than 14 at the third quartile. This accurately captures the non-linearity in
.
the true effect of x 1
Common Interactions and Non-additivity
Here, we show how KRLS is well-suited to fit target functions that are non-additive
and/or involve more complex interactions as they arise in social science research. For
the sake of presentation, we focus on target functions that involve two independent
variables, but the principles generalize to higher-dimensional problems. We consider
three types of functions: those with one "hill" and one "valley," two hills and two
valleys, or three hills and three valleys (see appendix, Figures A.4, A.5, and A.5,
respectively). These functions, especially the first two, correspond to rather common
80
scenarios in the social sciences where the effect of one variable changes or dissipates
depending on the effect of another.
observations, X1 , x 2
We simulate each type of function, using 200
Unif (0, 1), and noise given by E ~ N(0, .25). We then fit these
data using KRLS, OLS, and GAMs. The results are averaged over 100 simulations.
In the online appendix, we provide further explanation and visualizations pertaining
to each simulation.
Table 3.2 displays both the in-sample and out-of-sample R 2 (based on 200 test
points drawn from the same distribution as the training sample) for all three target
functions and estimators. KRLS provides better in- and out-of-sample fits for all
three target functions, and the out-of-sample R 2 for each model is close to the true
R 2 that one would obtain knowing the functional form. These simulations increase
our confidence that KRLS can capture complex non-linearity, non-additivity, and
interactions that we may expect in social science data.
While such features may
be easy to detect in examples like these that only involve two predictors, they are
even more likely in higher-dimensional problems where complex interactions and nonlinearities are very hard to detect using plots or traditional diagnostics.
Comparison to Other Approaches
KRLS is not a panacea for all that ails empirical research, but our proposition is that
it provides a useful addition to the empirical toolkit of social scientists, especially
those currently using GLMs, because of (a) the appropriateness of its assumptions to
social science data, (b) its ease of use, and (c) the interpretability and ease with which
relevant quantities of interest and their variances are produced. It therefore fulfills
different needs than many other machine learning or flexible modeling approaches,
such as neural networks, regression trees, k-Nearest Neighbors, SVMs, and GAMs, to
name a few. In the appendix, we describe in greater detail how KRLS compares to
important classes of models on interpretability and inference, with special attention
to Generalized Additive Models (GAMs) and to approaches that involve explicit basis
expansions followed by fitting methods that force many of the coefficients to be exactly
zero (LASSO). At bottom, we do not claim that KRLS is generally superior to other
81
approaches but, rather, that it provides a particularly useful marriage of flexibility
and interpretability. It does so with far lower risk of misspecifciation bias than highly
constrained models, while minimizing arbitrary choices about basis expansions and
the selection of smoothing parameters.
These differences aside, in proposing a new method, it is useful to compare its
pure modeling performance to other candidates.
In this area, KRLS does very
well.18 To further illustrate how KRLS compares against other methods that have
appeared in political science, we replicate a simulation from Wood (2003) that was
designed specifically to illustrate the use of GAMs. The data-generating process is
e4((x1-7)
(a.2
2
--
Unif (0, 1), c ~ N(0,.25), and y = e1O((xi
2
(x2-7)
)
25)2-(X2-.25)2)
+ .5
*
given by x1, x 2
+ c. We consider five models: (1) KRLS with default choices
= D = 2), implemented in our R package simply as krls (y=y,X=cbind (x1, x2));
(2) a "naive" GAM (GAM1) that smoothes x1 and x 2 separately but then assumes
that they add; (3) a "smart" GAM (GAM2) that smoothes x1 and x 2 together using the default thin-plate splines and the default method for choosing the number of
basis functions in the mgcv package in R; (4) a flexibly specified linear model (LM),
y = /o0 +
3,1X + 0 2 x 2 + 3X+
3x
x
x 2 ; and (5) a neural network (NN)
with 5 hidden units and all other parameters at their defaults using the NeuralNet
package in R. We train this model on samples of 50, 100, or 200 observations and then
test it on 100 out-of-sample observations. The results for the root mean square error
(RMSE) of each model averaged over 200 iterations at each sample size are shown in
Table 3.3. KRLS performs as well as or better than all other methods at all sample
sizes. In smaller samples, it clearly dominates. As the sample size increases, the fully
smoothed GAM performs very similarly.19
18It has been shown that the RLS models on which KRLS is based are effective
even when used
for classification rather than regression, with performance indistinguishable from state-of-the-art
Support Vector Machines (Rifkin et al., 2003).
19 KRLS and GAMs in which all variables are smoothed together are similar. The main difference
under current implementations (our package for KRLS and mgcv for GAMs) include the following:
(1) the fewer interpretable quantities produced by GAMs; (2) the inability of GAMs to fully smooth
together more than a few input variables; and (3) the kernel implied by GAMs that leads to straightline extrapolation outside the support of X. These are discussed further in the online appendix.
82
3.6
Empirical Applications
In this section, we show an application of KRLS to a real data example. In the online
appendix, we also provide a second empirical example that shows how KRLS analysis corrects for misspecification bias in a linear interaction model used by Brambor
et al. (2006) to test the "short-coattails" hypothesis. This second example highlights
the common problem that multiplicative interaction terms in linear models only allow marginal effects to vary linearly, while KRLS allows marginal effects to vary in
virtually any smooth way, and this added flexibility can be critical to substantive
inferences.
Predicting Genocide
In a widely cited article, Harff (2003) examines data from 126 political instability
events (i.e. internal wars and regime changes away from democracy) to determine
which factors can be used to predict whether a state will commit genocide. 2 0 Harff
proposes a "structural model of genocide" where a dummy for genocide onset (onset) is regressed on two continuous variables, prior upheaval (summed years of prior
instability events in the past 15 years) and trade openness (imports and exports as
a fraction of gross domestic product (GDP) in logs), and four dummy variables that
capture whether the state is an autocracy, had a prior genocide, and whether the
ruling elite has an ideological character and/or an ethnic character.21 The first column in Table 3.4 replicates the original specification, using a linear probability model
(LPM) in place of the original logit. We use the LPM here because this allows more
direct comparison to the KRLS results. However, the substantive results of the LPM
are virtually identical to those of the logit in terms of magnitude and statistical significance. The next four columns on the left present the replication results from the
20
The American Political Science Association lists this paper as the 15th most downloaded paper
in the American Political Science Review. According to Google Scholar, this article has been cited
310 times.
21
See Harff (2003) for details. Notice that Harff dichotomized a number of continuous variables
(such as the polity score), which discards valuable information. With KRLS, one could instead use
the original continuous variables unless there was a strong reason to code dummies. In fact, tests
confirm that using the original continuous variables with KRLS results in a more predictive model.
83
KRLS estimator. We report first differences for all the binary predictor variables as
described above.
The analysis yields several lessons. First, the in-sample R2 from the original logit
model and KRLS are very similar (32% versus 34%), but KRLS dominates in terms
of its receiver operator curve (ROC) curve for predicting genocide, with statistically
significantly more area under the curve (p < 0.03).
It is reassuring that KRLS
performs better (at least in-sample) than the original logit model even though, as
Harff reports, her final specification was selected after an extensive search through
a large number of models. Moreover, this added predictive power does not require
any human specification search, the researcher simply passes the predictor matrix to
KRLS, which learns the functional form from the data, and this improves empirical
performance and reduces arbitrariness in selecting a particular specification.
Second, the average marginal effects reported by KRLS (shown in the second
column) are all of reasonable size and tend to be in the same direction as but somewhat
smaller than the estimates from the linear probability model.
We also see some
important differences. The LPM model (and the original logit) shows a significant
effect of prior upheaval, with an increase of one standard deviation corresponding to
a 10 percentage point increase in the probability of genocide onset which corresponds
to a 37 percent increase over the baseline probability. This sizable "effect" completely
vanishes in the KRLS model, which yields an average marginal effect of zero that is
also highly insignificant. This sharply different finding is confirmed when we look
beyond the average marginal effect. Recall that the average marginal effects, while a
useful summary tool especially to compare to GLMs, are only summaries and can hide
interesting heterogeneity in the actual marginal effects across the covariate space. To
examine the effect heterogeneity, the next three columns on the left in Table 3.4 show
the quartiles of the distribution of pointwise marginal effects for each input variable.
Figure 3-5 also plots histograms to visualize the distributions. We see that the effect
of prior upheaval is essentially zero at every point.
What explains this difference in marginal effect estimates?
It turns out that
the significant effect in the LPM model is an artefact of misspecification bias. The
84
variable prior upheaval is strongly right-skewed and, when logged to make it more
appropriate for linear or logistic regression, the "effect" disappears entirely.
This
change in results emphasizes the risk of mistaken inference due to misspecification
under GLMs and its potential impact on interpretation. Note that this difference
in results is by no means trivial substantivally.
In fact, Harff (2003) argues that
prior upheaval is "the necessary precondition for genocide and politicide" and "a
concept that captures the essence of the structural crises and societal pressures that
are preconditions for authorities' efforts to eliminate entire groups."
Harff (2003)
goes on to explain two mechanisms by which this variable matters and draws policy
conclusions from it. However, as the KRLS results show, this "important finding"
readily disappears when the model accounts for the skew. This showcases the general
problem that misspecification bias is often difficult to avoid in typical political science
data, even for experienced researchers who publish in top journals and engage in
various model diagnostics and specification searches. It also highlights the advantages
of a more flexible approach such as KRLS, which avoids misspecification bias while
yielding marginal effects estimates that are as easy to interpret as LPM and also make
richer interpretation possible.
Third, while using KRLS as a robustness test of more rigid models can thus be
valuable, working in a much richer model space also permits exploration of effect
heterogeneity, including interactions. In Figure 3-5 we see that for several variables,
such as autocracy and ideological character, the marginal effect lies to the same side
of zero at almost every point, indicating that these variables have marginal effects in
the same direction regardless of their level or the levels of other variables. We also see
that some variables show little variation in marginal effects, such as prior upheaval,
while others show more substantial variation, such as prior genocide.
For example, the marginal effects (measured as first-differences) of ethnic character and ideological character are mostly positive, but both show variation from approximately 0 to 20 percentage points. A suggestive summary of how these marginal
effects relate to each observed covariate can be provided by regressing the estimates of
85
the pointwise marginal effects
c9ol
set
aideoi ogical character.
or
aeonset
o8ethniccharacter.
on the covariates. 2 2
Both regressions reveal a strong negative relationship of the level of trade openness on
these marginal effects. To give substantive interpretation to the results, we find that
having an ethnic characterto the ruling elite is associated with a 3 percentage point
higher probability of genocide for countries in the highest quartile of trade openness,
but a 9 percentage point higher probability in the highest quartile of trade openness.
Ideological characteris associated with a 9 percentage point higher risk of genocide
for the countries in the top quartile of trade openness, but an 18 percentage point
higher risk among those in the first quartile of trade openness. These findings, while
associational only, are consistent with theoretical expectations, but would be easily
missed in models that do not allow sufficient flexibility.
In addition, the marginal effects of prior genocide are very widely dispersed. We
find that the marginal effects of prior genocide and ideological character are strongly
related: when one is high, the marginal effect of the other is lessened on average. For
example, the marginal effect of ideological character is 18 percentage points higher
when prior genocide is equal to zero. Correspondingly, the marginal effect of prior
genocide is 21 percentage points higher when ideological character is equal to zero.
This is characteristic of a sub-additive relationship, in which either prior genocide or
ideological charactersignals a higher risk of genocide, but once one of them is known,
the marginal effect of the other is negligible. 23 By contrast, the marginal effects of
ethnic character- and every other variable besides ideological character- changes by
little as a function of prior genocide.
22
This approach is helpful to identify non-linearities and interaction effects.
For each variable,
take the pointwise partial derivatives (or first-differences) modeled by KRLS and regress them on all
original independent variables to see which of them help explain the marginal effects. For example,
if % is found to be well-explained by x(a) itself, then this suggests a non-linearity in X(a) (because
the derivative changes with the level of the same variable). Likewise, if ax(-)
ay is well-explained by
another variable x(b), this suggests an interaction effect (the marginal effect of one variable, x(a),
depends on the level of another, x(b)).
23
1n addition to theoretically plausible reasons why these effects are sub-additive, this relationship
may be partly due to ex post facto coding of the variables: once a prior genocide has occurred, it
becomes easier to classify a government as having an ideological character, since it has demonstrated
a willingess to kill civilians, possibly even stating an ideological aim as justification. Thus, in the
absence of priorgenocide, coding a country as having ideological characteris informative of genocide
risk, while it adds less after prior genocide has been observed.
86
This brief example demonstrates that KRLS is appropriate and effective in dealing with real-world data even in relatively small datasets. KRLS offers much more
flexibility than GLMs and guards against misspecification bias that can result in incorrect substantive inferences. It is also straightforward to interpret the KRLS results
in ways that are familiar to researchers from GLMs.
3.7
Conclusion
To date, it has been difficult to find user-friendly approaches that avoid the dangers
of misspecification while also conveniently generating quantities of interest that are as
interpretable and appealing as the coefficients from GLMs. We argue that KRLS represents a particularly useful marriage of flexibility and interpretability, especially for
current GLM users looking for more powerful modeling approaches. It allows investigators to easily model non-linear and non-additive effects and reduce misspecification
bias and still produce quantities of interest that enable "simple" interpretations (similar to those allowed by GLMs) and, if desired, more nuanced interpretations that
examine non-constant marginal effects.
While interpretable quantities can be derived from almost any flexible modeling
approach with sufficient knowledge, computational power, and time, constructing
such estimates for many methods is inconvenient at best and computationally infeasible in some cases. Moreover, conducting inference over derived quantities of interest
multiplies the problem. KRLS belongs to a class of models, those producing continuously differentiable solution surfaces with closed-form expressions, that makes such
interpretation feasible and fast. All the interpretational and inferential quantities are
produced by a single run of the model, and the model does not require user input
regarding functional form or parameter settings, improving falsifiability.
We have illustrated how KRLS accomplishes this improved tradeoff between flexibility and interpretability by starting from a different set of assumptions altogether:
rather than assume that the target function is well-fitted by a linear combination of
the original regressors, it is instead modeled in an N-dimensional space using informa87
tion about similarity to each observation, but with a preference for less complicated
functions, improving stability and efficiency. Since KRLS is a global method - i.e. the
estimate at each point uses information from all other points - it is less susceptible
to the curse of dimensionality than purely local methods such as k-nearest neighbors
and matching.
We have established a number of desirable properties of this technique. First, it
allows computationally tractable, closed-form solutions for many quantities, including
E[ylX], the variance of this estimator, the pointwise partial derivatives with respect
to each variable, the sample average partial derivatives, and their variances. We have
also shown that it is unbiased, consistent, and asymptotically normal. Simulations
have demonstrated the performance of this method, even with small samples and highdimensional spaces. They have also shown that even when the true data-generating
process is linear, the KRLS estimate of the average partial derivative is not much
less efficient than the analogous OLS coefficient and far more robust to bad leverage
points.
We believe that KRLS is broadly useful whenever investigators are unsure of the
functional form in regression and classification problems. This may include modelfitting problems such as prediction tasks, propensity score estimation, or any case
where a conditional expectation function must be acquired and rigid functional forms
risk missing important variation. The method's interpretability also makes it suitable
for both exploratory analyses of marginal effects and causal inference problems in
which accurate conditioning on a set of covariates is required to achieve a reliable
causal estimate. Relatedly, using KRLS as specification check for more rigid methods
can also be very useful.
However, there remains considerable room for further research. Our hope is that
the approach provided here and in our companion software will allow more researchers
to begin using KRLS or methods like it; only when tested by a larger community of
scholars will we be able to determine the method's true usefulness. Specific research
tasks remain as well.
Due to the memory demands of working with an N x N
matrix, the practical limit on N for most users is currently in the tens of thousands.
88
Work on resolving this constraint would be useful. In addition, the most effective
methods for choosing A and a2 are still relatively open questions, and it would be
useful to develop heteroscedasticity-, autocorrelation-, and cluster-robust estimators
for standard errors.
89
3.8
Tables
Table 3.1: Comparing KRLS to OLS with Multiplicative Interactions
Estimator
oy/&xij
const
X1
X2
(x 1 x x 2 )
OLS
Average
-1.50
(0.34)
7.51
(0.40)
-1.28
(0.21)
Average
9.22
(0.52)
0.02
(0.13)
KRLS
1st Qu. Median
5.22
(0.82)
-0.08
(0.19)
9.38
(0.85)
0.00
(0.16)
3rd Qu.
14.03
(0.79)
0.10
(0.20)
1.24
(0.15)
N
250
Note: Point estimates of marginal effects from OLS and KRLS regression with bootstrapped standard errors in parenthesis. For KRLS, the table shows the average and
the quartiles of the distribution of the pointwise marginal effects. The true target
function is y = 5x, and simulated using y' = 5xi + e with e ~ (0, 2), xi ~ Unif (0, 2),
and x2 = x1 + with ~ N(0, 1). With OLS, we conclude that x1 has a positive effect
that grows with higher levels of x2 and that x2 has a negative (positive) effect at low
(high) levels of xi. The true marginal effects are - = 10x1 and 0'Y = 0; the effect
of xi only depends on levels of x1, and x2 has no effect at all. The KRLS estimator
accurately recovers the true average derivatives. The marginal effects of x2 are close
to zero throughout the support of x2. The marginal effects of xi varies from about 5
at the first quartile to about 14 at the third quartile.
Table 3.2: KRLS Captures Complex Interactions and Non-additivity
Target
Function
In-Sample R2
KRLS
OLS
GAM
Out-of-Sample R 2
KRLS
OLS
GAM
True R 2
One Hill
One Valley
Two Hills
Two Valleys
Three Hills
Three Valleys
0.75
0.61
0.63
0.41
0.01
0.21
0.52
0.01
0.05
0.70
0.60
0.60
0.73
0.35
-0.01
0.13
0.39
0.45
-0.01
-0.03
0.51
Note: In- and out-of-sample R 2 (based on 200 test points) for simulations using
the three target functions displayed in Figures A.4, A.5, and A.6 in the appendix
with the OLS, GAM, and KRLS estimators. KRLS attains the best in-sample and
out-of-sample fit for all three functions.
90
Table 3.3: Comparing KRLS to Other Methods
Model
KRLS
GAM2
NN
LM
GAMI
Mean RMSE
N=100
0.107
0.109
0.177
0.177
0.213
N=50
0.139
0.143
0.312
0.193
0.234
N=200
0.088
0.088
0.118
0.169
0.202
+
Note: Simulation comparing RMSE for out-of-sample fits generated by
five models, averaged over 200 iterations. The data-generating process is
based on Wood (2003): x1,x2 ~ Unif(0, 1), e ~ N(0,.25), and y =
2
2
2
2
_(x2 -. 25) ) + .5 * e14(-(x1--7)
_(X2-.7) ) + e. The models are
el(-(x1-.25)
KRLS with default choices; a "naive" GAM (GAM1) that smoothes xi and x2
separately; (3) a "smart" GAM (GAM2) that smoothes xi and x2 together;
(4) a generous linear model (LM), y = 3o + / 3ix + 32x2 + 63x1 + 04X|
05X1 X x2; and (5) a neural network (NN) with 5 hidden units. The models
are trained on samples of 50, 100, or 200 observations and then tested on 100
out-of-sample observations. KRLS out-performs all other methods in small
samples. In larger samples, KRLS and the GAM2 (with "full-smoothing")
perform similarly. The linear model, despite including terms for x1, x2, and
X1X2, does not perform particularly well. GAM1 also performs poorly in all
circumstances.
Table 3.4: Predictors of Genocide Onset: OLS versus KRLS
Estimator
OLS
#
Prior upheaval
Prior genocide
Ideological char. of elite
Autocracy
Ethnic char. of elite
Trade openness (log)
Intercept
0.009*
(0.004)
0.263*
(0.119)
0.152
(0.084)
0.160*
(0.077)
0.120
(0.083)
-0.172*
(0.057)
0.659
(0.217)
Average
0.002
(0.003)
0.190*
(0.075)
0.129
(0.076)
0.122
(0.068)
0.052
(0.077)
-0.093*
(0.035)
KRLS
19y/(9xij
1st Qu. Median
-0.001
0.002
3rd Qu.
0.004
0.137
0.232
0.266
0.086
0.136
0.186
0.092
0.114
0.136
0.012
0.046
0.078
-0.142
-0.073
-0.048
Note: Replication of the "structural model of genocide" by Harff (2003). Marginal effects of
predictors from OLS regression and KRLS regression with standard errors in parenthesis. For
KRLS, the table shows the average of the pointwise derivative as well as the quartiles of their
distribution to examine the effect heterogeneity. The dependent variable is a binary indicator
for genocide onsets. N=126. *p-value < .05. See text for details.
91
3.9
Figures
C-4
ck(x, x)
= E
-
Figure 3-1: Random Samples of Functions of the Form f(x)
.2******
*
C'J
1,4W
....
...
0@'' SW*
-
110!4!
...
...
. . ..
(C4
C4
.
-
Gaussians for x_i
Superposition
0.0
0.2
-
I
0.4
I
0.6
-
I
I
I
I
I
.
0.8
1.0
0.0
0.2
0.4
0.6
.
VV
0.8
1.0
Note: The target function is created by centering a Gaussian over each xi, scaling each by its ci, and
- N(O, 1), x ~ Unif (0, 1), and a fixed value for
the bandwidth of the kernel C.2 . The dots represent the sampled data points, the dotted lines refer
to the scaled Gaussian kernels that are placed over each sample point, and the solid lines represent
the target functions created from the superpositions. Notice that the center of the Gaussian curves
depends on the point xi, its upwards or downward direction depends on the sign of the weight ci,
and its amplitude depends on the magnitude of the weight ci (as well as the fixed o.2).
then summing them. We use 8 observations with ci
92
Figure 3-2: KRLS Compares Well to OLS with Linear Data-Generating Processes
000000000
~0
0
100
0
0000000
00
200
300
0
OLS estimatey
KRLS estimates
400
500
600
N
- -
KRLS: SE(E[dy/dx])
OLS: SE(beta)
0
o
0
I
I
I
I
I
100
200
300
400
500
I
600
N
Left: Simulation to recover the average derivative of y = .5x, i.e. a = .5 (solid line). For each
sample size, we run 100 simulations with observed outcomes y = .5x + e where x ~ Unif (0, 1) and
E ~ N(0, .3). One contaminated data point is set to (yj = -5, xi = 5). Dots represent the mean
estimated average derivative for each sample size for OLS (open circles) and KRLS (full circles).
The simulation shows that KRLS is robust to the bad leverage point, while OLS is not. Right:
Comparison of the standard error of 0 from OLS (solid line) to the standard error of the sample
average partial derivative from KRLS (dashed line). Data are generated according to y = 2x + e,
with x ~ N(0, 1) and e ~ N(0, 1) with 100 simulations for each sample size. KRLS is nearly as
efficient as OLS at all but very small sample sizes, with standard errors, on average, approximately
14% larger than those of OLS.
93
Figure 3-3: KRLS with High-Frequency and Discontinuous Functions
-
Target function
KRLS estimate, N=40
---KRLS estimate, N=400
9
-
- -
0.0
0.2
0.4
0.6
-'
-
0.8
1.0
0
- -.
Target function
KRLS estimate, N=40
KRLS estimate, N=400
OLS estimate, N=40
OLS estimate, N=400
0.0
0.2
0.4
0.6
0.8
1.0
Left: Simulation to recover a high-frequency target function given by y = .2 * sin(127rx) + sin(27rx)
(solid line). For each sample size, we run 100 simulations where we draw x ~ Unif (0, 1) and simulate
observed outcomes as y = .2 * sin(127rx) + sin(27rx) + e where E - N(0, .2). The dashed line shows
mean estimates across simulations for N=40 and the dotted line for N=400. The results show that
KRLS finds a low-frequency approximation even at the larger sample sizes. Right: Simulation to
recover the discontinuous target function given by y = .5 * 1(x > .5) (solid line). For each sample
size, we run 100 simulations where we draw x ~ Unif (0, 1) and simulate observed outcomes
as
y = .5 * 1(x > .5) + e where - - N(0, .2). Dashed lines show mean estimates across simulations
for N=40 and dotted lines for N=400. The results show that KRLS fails to approximate the
sharp
discontinuity even at the larger sample size, but still dominates the comparable OLS estimate,
which
uses x as a continuous regressor.
94
Figure 3-4: KRLS Learns Interactions from the Data
Complex Interaction
Simple Interaction
-0- KRLS
-OLS
-
I
u RA2
..
e**.....00----*'''''
I
0
0
I
N~
0 00000000000000
0
'N
-0- KRLS
-0
OLS
- rue RA2
0'
C4
T7
0
50
100
150
200
250
0
300
50
100
150
200
300
250
N
Simulations to recover target functions that include multiplicative interaction terms. Left: The
N(0,.5).
target function is y = .5+x 1 +x 2 -2(xi-x 2 )+e with xi - Bernoulli(.5) for j = 1,2 anae
Right: The target function is y = (X1 -X2) - 2(x 3 - x 4 ) + 3(x 5 - X -X 7 ) - (x 1 x8) + 2(xs , X9
- io) + XiO
where all x are drawn i.i.d. Bernoulli(p) with p = .25 for x, and X2, p = .75 for x 3 and x 4 , and
'-
p = .5 for all others. For each sample size, we run 100 simulations where we draw the x and simulate
outcomes using y = ytrue + e where e - N(O,.5) for the training data. We use 1,000 test points
drawn from the same distribution to test the out-of-sample R 2 of the estimators. The closed circles
show the average R 2 estimates across simulations for the KRLS estimator, the open circles show the
estimates for the OLS regression that uses all x as predictors. The true R 2 is given by the solid line.
The results show that KRLS learns the interactions from the data and approaches the true R 2 that
one would obtain knowing the functional form as the sample size increases.
95
C
4(D)
0
-
5-
0
5
10
15
2015-
25
30
25-
0-
5
10 -
15 -
20 -
25 -
30
I
0.0
I
PriorUpheaval
Autoc
i
0.2
I
I
-0.2
I
I
marginal effect
0.0
PriorGen
EthnicChar
0.2
I
I
I
-2 I
-0.2
I
I
I I'
0.0
0
IdeologicalChar
TradeOpen
0.2
0
Histograms of pointwise margina 1 effects based on KRLS fit to the Harff data (Model 2 in Table 3.4).
-0.2
I
Distributions of pointwise marginal effects
Figure 3-5: Effect Heterogeneity in Harff Data
Chapter 4
Kernel Balancing
97
98
Kernel Balancing: A Balancing Method to Equalize
Multivariate Densities and Reduce Bias without a
Specification Search
Chad Hazlett - Massachusetts Institute of Technology
ABSTRACT
Investigators often use matching and weighting techniques to adjust for
differences between treated and control groups on observed characteristics. These methods, however, ensure that the treated and control have the
same means only on explicitly chosen functions of the covariates. Treatment effect estimates made after adjustment by these methods are thus
sensitive to specification choices. The resulting treatment effect estimates
are biased if any function of the covariates influencing the outcome is
imbalanced. Kernel balancing finds weights that ensure the treated and
control have equal means on a very large class of smooth functions of the
covariates. In addition, when multivariate density is measured a particular way, the reweighted control group has the same multivariate density as
the treated. In two empirical applications, kernel balancing (1) accurately
recovers the experimentally estimated effect of a job training program, and
(2) finds that after controlling for observed differences, democracies are
less likely to win counterinsurgencies, consistent with theoretical expectation but in contrast to previous findings.
4.1
Introduction
Matching and weighting methods are widely used to estimate causal effects from
non-experimental data when unobserved confounders can be ruled out. However,
existing methods do not ensure that the multivariate densities of the resulting treated
and control units are sufficiently similar, nor do they typically allow for multivariate
imbalances to be detected. As a result, even apparently well-balanced samples may
differ on important functions of the covariates, leading to biased treatment effect
estimates.
Kernel balancing, proposed here, is a weighting technique that uses kernels to
construct a higher dimensional transformation of the original data. It then achieves
equal means for the treated and control groups on this transformed version of the data.
This method makes several contributions to existing methodology. First, it obtains
approximate balance on a large class of smooth functions of the covariates. Second,
it ensures that the entire multivariate densities of the covariates - as measured by
a particular smoothing estimator - is approximately equalized for the treated and
control samples. Third, the method does not require users to conduct an iterative
specification search, to check univariate balance measures, or to otherwise guess what
functions of the covariates must be included in matching/reweighting procedures.
Fourth and finally, I introduce a method of measuring multivariate imbalance using
an L, metric, which can be applied before and after this or any other matching or
weighting method to assess the distance between the multivariate densities of treated
and controls.
In what follows, section 4.2 briefly describes existing methods and their shortcomings, after which section 4.3 illustrates these shortcomings with a simple hypothetical
example. Section 4.4 then establishes a formal framework for the problem, and describes the conditions under which unbiased estimation with matching and weighting
estimators is possible. The kernel balancing technique is then described in section
4.5, with implementation details in section 4.6. Section 4.7 offers two empirical applications. The first is a re-analysis of Dehejia and Wahba (1999), in which the kernel
100
balancing estimates made using observational data accurately recover the experimental estimate. The second application reanalyses data from Lyall (2010), examining whether democracies are less successful counterinsurgents than non-democracies.
Kernel balancing produces far better balance on the original variables and numerous
functions of them. The resulting effect estimates reveal that, in contrast to Lyall
(2010), democracies are over 25 percentage points less likely than non-democracies to
defeat insurgencies, a substantial and significant reduction.
4.2
Background
Traditional matching approaches (e.g. Rubin, 1973) match each treated unit to one or
several control units that are most similar, as measured using some distance metric.
Methods in this family vary principally in how they measure distance, with recent
advances allowing the relative weight of each variable to change in order to optimize balance across a panel of balance tests (Diamond and Sekhon, 2005). While
the non-parametric nature of these approaches is appealing, methods in this family
have three key shortcomings. First, they seek to ensure the multivariate density of
the controls matches that of the treated, but are not typically equipped to measure
discrepancies in these distributions, nor are they designed to optimize equality of the
multivariate distributions. Second, when exact matches cannot be found for each
treated unit - as is the case when matching on continuous variables - the resulting
matching discrepancies cause bias. This bias dissipates only very slowly, and in general the resulting estimates are not vMW-consistent (Abadie and Imbens, 2006). The
bias can be removed by modeling the effect of the matching discrepancy, but this
re-introduces parametric assumptions. Finally, because of these difficulties, results
depend on what functions of the covariates - e.g. squared terms, logarithms, multiplicative interactions - the user includes in the matching procedure. Thus while
intended to be a specification-free approach, in practice users must undergo a tedious
specification search. Moreover, without balance metrics that accurately measure multivariate balance, there is no clear way of arbitrating among different specifications
101
in a way that leads to the least-biased estimate.
A second family of techniques involves weighting methods that choose continuous
(usually non-negative) weights for control units and possibly treated units. Propensity
score weighting (Rosenbaum and Rubin, 1983) is a widely used technique to remove
bias due to observed covariates. However it's major shortcoming is the requirement
that the propensity score model can be correctly specified. The difficulty of achieving
this correct-specification and the resulting biases have been well studied (e.g. Smith
and Todd, 2001). More recently, "covariate-balancing" weighting techniques have
been proposed. Entropy balancing (Hainmueller, 2012) allows users to enter a covariate matrix X - which may include squares, interactions, or other higher-order terms
of the original covariates - and achieve essentially perfect mean-balance on each of
these covariates.' Since many possible combinations of weights achieve this, entropy
balancing chooses the set that, roughly speaking, ensures that weights are as close
to constant as possible. Also in this family, the covariate-balancing propensity score
(Imai and Ratkovic, 2014) seeks weights on the controls to balance the propensity
score, while also balancing desired moments of the covariate distributions. These
various covariate-balancing methods have the benefit of achieving essentially perfect
equality of means on the matrix of included covariates, X, thus side-stepping some
of the problem of bias due to matching discrepancies. However, as shown below, the
principal shortcoming of these approach is that they do no guarantee balance on nonlinear functions of X. Accordingly, unbiasedness will only be guaranteed for these
methods when the (non-treatment potential) outcome is linear in the columns of X.
Finally, coarsened exact matching (CEM, lacus et al., 2012) is weighting technique distinct from those above. This approach "coarsens" the data, placing each
observation into a multivariate bin. Within each bin having at least one control unit,
a weight can be chosen so that the weighted number of controls equals the number of
treated falling in that bin (if the desired estimands is an average treatment effect on
1 Equivalently, entropy balancing allows
the user to equate any desired moments of the covariate
distribution for the controls to that of the treated. For example obtaining mean balance on a
covariate X and its square ensures that the first and second sample moments of X are equal for the
treated and controls.
102
the treated). This has the benefit of bounding the multivariate imbalance, and providing a way of measuring the imbalance by examining the original imbalance within
each bin. However, it too has several major shortcomings. First, it requires choosing
the boundaries for the bins, which may influence the result. Second, it is not tolerant
of high-dimensional data, since the bin size required in order to obtain bins with
common support grows very quickly with the dimension of the data. Perhaps most
importantly, because the bins must be large, a treated and control unit within the
same bin may vary widely in their covariate values, and thus in their associated values of the non-treatment potential outcome, re-introducing the matching discrepancy
problem that causes bias in traditional matching estimators.
In summary, then, the existing approaches do not guarantee multivariate balance,
balance on functions of the covariates that may influence the outcome, or unbiasedness. The covariate-balancing weighting techniques side-step the problem of matching discrepancies, but instead require the investigator to know all the functions of
the covariates that may influence the outcome. While theoretical guidance may help
investigators to choose which covariates are important to match on, theory rarely
tells us exactly how these covariates matter. Thus, investigators cannot know exactly
what functions of the covariates to achieve balance on - e.g. the raw variables, squared
terms, interactions, ratios, or even non-standard functions of the data. Balance testing also provides little guidance, as it is typically confirmed only univariately, or on
higher order terms explicitly included by the user. As a result, current matching or
weighting techniques leave users unsure of "what to match on", and estimates made
even by the most careful researcher can be both biased and sensitive to specification.
4.3
Motivating Example
This section provides a motivating example using simulated data to illustrate the
risks of bias under existing methods.
Suppose we are interested in the question of whether peacekeeping missions deployed after civil wars are effective in lengthening the duration of peace (peaceyears)
103
after the war's conclusion (e.g. Fortna, 2004; Doyle and Sambanis, 2000). However,
the "treatment" - peacekeeping missions (peacekeeping) - is not randomly assigned.
Rather, missions are more likely to be deployed in certain situations, which may differ
systematically in their expected peace years even in the absence of a peacekeeping
mission. To deal with this, we collect four pre-treatment covariates that describe
each case: the duration of the preceding war (war duration), the number of fatalities
(fatalities), democracy level prior to the peacekeeping mission (democracy), and a
measure of the number of factions or sides in the civil war (factionalism).
We are interested in the average treatment effect on the treated (ATT), which is
the mean number of peace years experienced by countries that received peacekeeping,
minus the average number of peace years for this group had they not received peacekeeping missions. Such causal estimands can be estimated from these data only if
missions are deployed on the basis of a conflict's intensity, measured as
,
there are no unobserved confounders. For this example, suppose that peackeeping
with missions more likely to be deployed where conflicts were higher in intensity.
Such an assignment process would result, for example, if "faster-burning" conflicts
are more likely to attract international attention, and thus peacekeeping missions. In
addition, suppose the outcome of interest, peace years, is also a function of intensity,
with more intense conflicts leading to longer average peace years. This is reasonable
if, for example, more intense wars indicate greater dominance by one side, leading to
a lower likelihood of resurgence in each subsequent year. In this example, peace years
is only a function of intensity, and not of peacekeeping, implying a true treatment
effect of zero. 2
How well do existing techniques achieve balance, both on the original covariates
and on intensity, an important function of the observables? In figure 4-1, the x-axis
for each plot shows the difference in means between treated and control on each of
the covariates, as well as on intensity. All results are averaged over 500 simulations
2
Details for the simulation are as follows. War duration in years is distributed max(1, N(7, 9));
intensity in fatalities per year is distributed Unif(100, 10000). fatatlities is then computed as
intensity - war duration. The treatment, peacekeeping is assigned by a Bernoulli draw with probability logit-1(iintensity - 2), and the outcome peace years = in 2 sty + e, r ~ N(0, 0.004).
104
with the same data generating process and N = 500 on each simulation. The first
plot (matching) shows results for simple Mahalanobis distance matching (with replacement). Imbalance remains somewhat large on war duration. More troubling,
imbalance remains considerable on intensity, which was not directly included in the
matching procedure. A careful researcher may realize the need to match on more functions of the covariates, and instead match on the original covariates, their squares,
and their pairwise multiplicative interactions. While few researchers go this far in
practice, the second plot in figure 4-1 (matching+) shows that even this approach
would not provide the needed flexibility to produce balance on intensity. In fact,
balance on both war duration and intensity are worsened. In the third plot (mean
balance), entropy balancing (Hainmueller, 2012) is used to achieve equal means in
the original covariates. As expected, this produces excellent balance on the original
covariates, but only a modest improvement in balance on intensity. Finally, in the
fourth plot (kernel balance), the kernel balancing approach introduced here is applied,
again using the original covariate data alone. Because this method achieves balance
on many smooth functions of the included covariates, it achieves vastly improved
balance on intensity.
These imbalances are worrying because they lead to biased ATT estimates: since
intensity affects the outcome, the mean differences in intensity between the treated
and control group after adjustment lead to mean differences in the outcome between
treated and control that are not due to the treatment. When the ATT is estimated
by difference in means in the post weighting/matching sample, bias is thus found
for all methods but kernel balancing. This is evident in figure 4-2, which shows the
distribution of ATT estimates (over the simulations) for each method.
This example is notable because, while artificial, it is reasonable in many cases
that a function such as the ratio of two variables may impact the outcome variable in
question in the absence of the treatment, yet investigators rarely ensure balance on
such ratios. More generally, even with strong theoretical priors, it is unreasonable to
expect investigators to correctly guess what functions of the observables may impact
the outcome, and to ensure balance on each of these. Kernel balancing offers a solution
105
Figure 4-1: Imbalance on a function of the covariates
matching
matching+
mean balance
kernel balance
-
fatalities
war duration
factions
-
democracy
intensity
0.00
0.06
0.121
Imbalance
0.00
0.06
0.12'
Imbalance
0.00
0.06
0.12'
Imbalance
0.00
0.06
0,12
Imbalance
Mean imbalances on the four included covariates, and intensity =
d
tion , which is they key factor in both
assignment of the treatment (peacekeeping),and the eventual outcome (peaceyears).Matching: Mahalanobis distance
matching on the original four covariates leaves a substantial imbalance on war duration. More problematically, it
shows a large imbalance on intensity. Matching+: Mahalanobis distance matching with squared terms and all
pairwise multiplicative interactions included in the matching procedure. This produces a slight worsening of imbalance,
particularly on intensity. Mean balance: Entropy balancing on the original four covariates achieves essentially perfect
mean balance on these covariates. However, this produces only a small improvement in balance on intensity. Kernel
balance: the technique proposed here, obtains mean balance on a wide range of smooth functions of the included
covariates. As a result, it obtains very good balance intensity, even though the user only enters the original four
covariates.
to this problem.
4.4
Theoretical Framework
Let Y(1) E R, Y(0) E R, X E X and D E 0, 1 be random variables, with joint
distribution fX,D,Y(0),Y(1), where Y(1) is the potential outcome under treatment, Y(0)
is the potential outcome under control, D is the treatment assignment, and X is a
vector of covariates with labels x. We sample N i.i.d. pairs (Xi, Di, Y(0), Yi(1)) from
this joint distribution.
The estimand of primary interest throughout will be the average treatment effect
on the treated (ATT):
ATT = E[Y(1) - Y(0)IDi = 1]
106
(4.1)
Figure 4-2: Biased ATT estimation due imbalanced function of the covariates
Truth-
mdmatch+
-
---------------------9
---- --LI-----
-
-
mdmatch
kernel balance
------------------
6---------------
-
mean balance
K-2-L1 -0.00
0.05
0.10
0.15
0.20
0.25
0.30
ATT Estimates
Boxplot illustrating distribution of average treatment effect on the treated (ATT) estimates in the same example as
figure 4-1 above. The actual effect is zero peaceyears. Matching on the raw covariates, matching on higher order
transforms, and obtaining mean balance all show large biases because the control samples chosen by these procedures
include higher intensity conflicts than the treated sample, even though intensity is entirely a function of observables.
Since intensity influences the outcome, peace years, the treated and control samples thus differ regardless of any
treatment effect. By contrast, kernel balance is approximately unbiased, as it achieves balance on a large space of
smooth functions of the covariates.
Weights W will be chosen for the control units. The weighted difference in means
estimator for the average treatment effect on the treated (ATT) is then:
ATT = N E
(4.2)
Yi-
i:Di=1
i:Di=
where Nt and N, are the number of treated and controls respectively, and W are
non-negative negative weights on the controls, such that
ZE W
=
1. Here I describe
more precisely the conditions under which unbiased estimation of this simple ATT
estimator is possible for such an estimator.
To summarize the main results, conditional ignorability is not in general sufficient
for unbiased ATT estimation by these techniques. Instead, estimates are guaranteed
to be unbiased only if conditional ignorability holds and, after adjustment by W,
either:
107
1. all functions of Xi that influence Y(0) have the same means for the treated and
controls, or more strongly,
2. the multivariate density of the covariates for the treated is the same as that of
the controls.
Conditions for Unbiasedness
First, throughout this analysis, assume conditional ignorability with respect to the
non-treatment potential outcome:
ASSUMPTION 7 (CONDITIONAL IGNORABILITY FOR THE NON-TREATMENT OUTCOME)
Y (0) JL DiIXi
where Y (0) is the non-treatment potential outcome and is assumed to be bounded,
Di is treatment status, and Xi a vector of observed, pre-treatment covariates. I also
assume "common support", that 0 < Pr(Dj = 1|X ) < 1.3
Y (0) is assumed to be bounded and, without loss of generality, and can be constructed
as:
Yi(0) = g(Xi) +
71
where (i) E[jilXi] = E[9q] = 0 for all values of Xi E X; (ii) g(Xi) may be any integrable
function of X; and (iii) qj is bounded, with E[qilDi] = 0. This construction makes
explicit that Y(0) may vary with Xj, while allowing for a stochastic element (rh), and
maintaining independence of Y (0) and Di conditional on Xi due to (iii).
To see the difficulties in estimating the ATT even when conditional ignorability
holds, it is useful to more closely examine its dependence on the distribution of the
3
However, note that Pr(Di = 11 X) must be estimated, and determining whether common support exists in a given case depends (as it always does) on some assumption used to make these
estimates.
108
covariates. The ATT can be re-written:
ATT
=
E[Y(1) - Y(O)|D = 1]
=
EXID=E[Y(1) X, D = 1] - EXID=lE[Y(0) X, D = 1]
(4.3)
=
EXID=1E[Y(1)|X, D = 1] - EXD=1E[Y(0)jX, D = 0]
(4.4)
where the substitution between equations 4.3 and 4.4 is possible due to conditional
ignorability.4 While the aim of this substitution is to make the second term in equation
4.4 identifiable, it remains non-trivial to estimate. Specifically, we only observe Y(0)
at locations in X at which control units are found, but we must integrate over these
sampled points as though they have the distribution given by the treated,
fxD=1- 5
Weighting and matching methods are designed to effectively change the distribution
of the controls such that an average over the control units is taken as though they
had the same empirical distribution (of X) as the treated.
Conditions for Unbiased ATT Estimation
When estimating the ATT by equation 4.2, unbiasedness can be obtained by ensuring
that the mean non-treatment outcome of the treated equals that of the reweighted
controls. This is stated more formally by proposition 1.
PROPOSITION 1 (SUFFICIENCY FOR UNBIASED
ATT
ESTIMATION)
The sample estimator
4
Note that equation 4.3 leaves implicit the distribution over which the inner expectations are
taken. This is more fully written
ATT = EXID=1EY(1)IX,D=1[Y(1)X,D
= 1] - ExID=1Ey(0)X,D=1 [Y(0)IX,D = 1]
The subscripts on the inner expectations have been suppressed in the text for simplicity and because
the argument to the expectation operator correctly indicates the distribution of interest.
5
The integration implied by the subscripted expectation operators is more evident when rewritten
in integral form:
EX|D=E[Yi(0)|Xi, Di = 0]
xEX E[(Y(0)|Xi]fxID=1(x)dx
=
g(XifX|D=1(x)dx
=
Sxex
109
for the ATT under a method of choosing weights, Wi:
1
ATT=- (Y
WY
i:D=1
i:D=O
is unbiased for the true ATT if and only if
E [Y (0)]
=
E
L
W2 YdO)
-i:D=O
Proof of this derives directly from consideration of the bias. Specifically, the bias
is given by:
bias = E[ATT] - ATT
=E
Nti:D=1
Y, - E
..
= E[Y (O)lDi = 1] - E
L
WiY
-
E[Y(0)|Di= 1] + E[Y (o)|Di= 01
i:Di=O
WiY
(4.5)
wi:Di=1
Unbiasedness is thus achieved if and only if E[Y(0)lDi = 1] = E[Z i:DO WiYi],
proving proposition 1.
The sample analog of proposition 1 is that i Z:D -1 Y() =
W
(0)
i.e. that mean balance obtains for Y(0) itself. This is an unusual statement in the
sense that Y(0) is not usually thought of as a direct target of balancing procedures,
since Y(0) is unobserved for the treated, making this impossible to directly verify. A
more natural corollary of this is that "any function of X that influences Y(0) must
have the same mean among the treated and controls."' I use this latter interpretation
throughout the paper.
6
These are equivalent statements because if some function of Xi influences Y(0) and is imbalanced, Yi(O) too will be imbalanced (except in knife-edge cases). Proposition 1 simply takes this
logic to its conclusion, by saying that Yi(O) itself must be mean balanced, implying mean balance
of any function that influences it. Put differently, anything that might "matter" in determining the
outcome (in the absence of the treatment) must be balanced, otherwise average differences between
the treated and control on the outcome will emerge as a result of improper adjustment even in the
absence of the treatment.
110
Second, while conditional ignorability and proposition 1 are sufficient for unbiasedness, a more standard approach is to consider a slightly broader condition that is
more typically regarded as the target for matching and weighting methods: equalizing
the multivariate densities of the treated and the (reweighted) controls. Multiviarate
balance is defined as follows:
ASSUMPTION
8
"Multivariate balance" is achieved when a
(MULTIVARIATE BALANCE)
method finds weights, W, on the control units such that the post-weighting density
of the controls equals that of the treated. In finite samples this is required at each
observation in the dataset:
fw,x|D=0(Xi)
=
fX|D=1(Xi), Vi
We can now state proposition 2:
PROPOSITION 2 (MULTIVARIATE BALANCE PRODUCES UNBIASEDNESS)
Under conditional
ignorability (assumption 7), the choice of weights that produce multivariate balance
as defined by 8 allow for unbiased estimation of the ATT.
Proof is given in the appendix, but the intuition is straightforward. By assumption 7, at any given Xi = xi the observed outcomes from the controls is equal in
expectation to the non-treatment outcomes that treated units would take if found at
the same value xi. The problem of having to average these control outcomes over
the distribution of the treated is solved because the distribution of the controls is the
same as the treated when assumption 8 holds.
Note also the relationship between multivariate balance and proposition 1. When
multivariate balance does not hold, it is possible that some function of X has different
means among the treated and controls, which will induce bias if that function of X
is correlated with Y(O). As a condition for unbiasedness, then, multivariate balance
is stronger than necessary, in that it ensures mean balance on every function of X,
whereas proposition 1 only requires balance on Y(0), and thus balance on functions
111
of X that influence Y(O). 7
Unbiasedness, Matching, and Mean Balance
To complete the analysis, consider three situations and the potential for bias in each:
exact matching, distance-minimizing matching methods for approximate matching,
and approaches that ensure equality of means or higher moments of covariates that
are explicitly chosen by the investigator.
Exact Matching
First, exact matching ensures that multivariate balance (assumption 8) holds. It is
therefore sufficient, under conditional ignorability, for unbiased ATT estimation.
Approximate Matching
Second, consider approximate matching by methods such as nearest-neighbor, Mahalanobis distance, and genetic matching, using single matching for simplicity. These
methods are typically optimized based on tests of univariate balance measures, and
while they may bring fw,XID=o closer to fXID=l, they do not ensure multivariate
balance and thus fail assumption 8. The potential bias can be appreciated by examining each matched pair's contribution to it. Specifically, matching attempts to
pair observations i (a treated unit) and j (a control unit) such that some measure of the distance between xi and xj is as small as possible.
It then must be
assumed that E[Y(O)ID = 1,X = xj] = E[Y(O)ID = O,X = xj], so that the observed outcomes from the control unit at xj can substitute for the non-treatment
outcomes of a treated unit placed at xi. Under conditional ignorability, this implies
E[Y(O)IX = xj] = E[Y(O)IX = xi], or simply g(xi) = g(xj). The bias due to each
matched pair is g(xi) - g(xj), and the total bias is the average of these over all
pairs. Besides knife-edge cases, unbiasedness is achieved only if g(xi) = g(xj) for
7
Strictly speaking, even the common support assumption is too strong by this logic: for a variable
Xa in X that does not influence Y(O) through g(X), neither equality of marginal distributions nor
common support are required.
112
Imbens (2006), the resulting bias due to these "matching discrepancies", IIxi - Xj
,
each matched pair {i, j}. This does not generally hold.8 As shown in Abadie and
does not shrink fast enough to achieve vNW-consistency of ATT estimates for many
problems.9
Mean Balancing on X
Third, consider "mean balancing" methods, those that achieve mean balance on covariates. Note that since X can include higher order transforms of the original covariates (squared terms, multiplicative interactions, etc.), it allows balance to be
sought on any desired sample moments of the original covariates. While matching
estimates may nearly obtain mean balance on the included covariates, other methods target mean balance more directly through weighting, including entropy balancing
(Hainmueller, 2012) and the covariate-balancing propensity score (Imai and Ratkovic,
2014).
A key fact for understanding when such estimators are unbiased is that mean
balance on X implies mean balance on all linear functions of X:
PROPOSITION 3 (BALANCE
ON LINEAR TRANSFORMS OF X)
When weights W achieve
mean balance on the (possibly-augmented) covariates Xi according to:
Y, Wixi =+
i: D=0
zxi
i:D=1
all linearfunctions in Xi, which evaluate to x7/ at the observed points, also have the
same mean for the treated and the weighted control samples:
Z Wi(Xfr) =
i:D=O
Z
XT/
(4.6)
i:D=1
8Moreover, this fails particularly when E[g(X)ID = 1] 7 E[g(X)ID = 0], which is generally
suspected to be the case in problems that require conditioning.
9
When g(Xi) is not a known function, there is no guarantee of how severe the remaining bias is.
When g(Xi) is known, this can be used to adjust for the bias in each pair as proposed by Abadie
and Imbens (2011). However, the presumption that g(Xi) is a known function may bring us back
to the specification assumptions which matching was meant to avoid.
113
Proof follows simply from the fact that equation 4.6 can be rewritten as:
oT 1
WiXi = OT IDX1
i:D=O
and Ei:D=0WiXi
=
yE
i:D=1
Xi under mean balance.
Recall that due to proposition 1, unbiasedness requires that all functions of Xi
influencing Y(0) are mean balanced after weighting. Since mean balance on Xi guarantees mean balance only on linear functions of Xj, mean balance on Xi only ensures
unbiasedness if the structural component of Y(0), g(Xi), is linear in Xi.
Suppose, in contrast, that a nonlinear component of Y (0), hi(Xi) exists, such that
Yi(0)i = X7'/ + h(Xi) + rn. If h(Xi) is non-zero, unbiasedness is not guaranteed. In
this case, the bias is given by
bias = E
h(Xt)Wz-
i:D=O
5
h(Xi)
(4.7)
i:D=1
This bias is derived in the appendix. Note that it is closely related to the correlation
between h(Xi) and the treatment assignment on the weighted data.
4.5
The Proposed Method
Kernel Functions
Consider a kernel function, k(.,-) : X x X F-+ R, taking in covariate vectors from any
two observations and produces a single real-valued output interpretable as a measure
of similarity between those two vectors. Here I use the Gaussian kernel:
k(Xj, Xj) = e-
lix 3-_
a
Z 1
(4.8)
Note that k(Xi, Xj) has a clear interpretation as a measure of similarity between Xi
and Xj. Furthermore, consider a feature map, 0(.), mapping any given observation,
Xj, to a P'-dimensional vector, O(Xi), where P' may be very large or even infinite.
114
For any positive semi-definite kernel,' 0 there exists a choice of feature mapping
such that (#(Xi),
of expansion
#(-)
#(Xj))
#(.)
= k(Xi, Xj). That is, for a given kernel, there exists a choice
such that the inner-product of q(Xi) and
by taking k(Xi, Xj), even if
#(-)
#(Xj)
can be computed
cannot be explicitly formed."
A critical piece of notation is the kernel matrix, K, constructed to store the results
of each pairwise application of the kernel, i.e. K{j, 3 = k(Xi, Xj) = (O(Xi),
#(Xj)).
To reduce notation it is useful to order the observations so that the Nt treated units
come first, followed by the N, control units. Then K can be partitioned into two
rectangular matrices, Kt and K,. Kt is the "left-half" of K and is N x Nt. The
row of Kt indicates the similarity of a the
jth
jth
observation in the dataset to the first
treated unit, to the second treated unit, and so on. Likewise, K, is the "right-half"
of K, is N x Nc, and its jth row describes the similarity of the
jth
observation in the
dataset to each of the control units.
By symmetry of K, the average row of K for the treated is identical to the average
column of K, and can be written -- KtINt- Likewise the (unweighted) mean column
(or row) of K belonging to the controls would be
+KNC.
The weighted average
column (or row) of K is Kcw for the N, x 1 vector of weights W such that EZ W = Nc.
Proposal: Mean balance on K
A reasonable weighting approach is to achieve mean balance on the matrix of original
covariates, X, which is to say the average vector Xi for the treated is equal to the
weighted mean of Xi for the controls. The kernel balancing procedure is analogous
to this, but seeking balance instead in K. That is, consider a single row of column
K:
ki = [k(Xi, X1), k(Xi, X 2 ),
... ,
k(Xj, XN)]
10
A kernel is positive semi-definite if E> E, aiajk(Xi, X) ;> 0, V a, aj E R,x C RD, D E Z+.
"For example, suppose X = [X('),X( 2)] and we choose the kernel (1 + (X,, Xj)) 2 . This choice
of kernel happens to corresponds to #(X) = [1, V'2X(1), VFX(2), X(1)X(1), VX(1)X(2), X(2)X(2)],
and one can confirm that k(Xi, Xj) = (O(X), q(Xj)) for this choice of kernel and 0(.). For the
Gaussian kernel, the corresponding choice of #(Xi) happens to be infinite-dimensional, but can be
understood roughly as listing the distance (in a Gaussian sense) of an observation at Xi to every
other point in X.
115
Each ki is analogous to X, but it describes the data in new terms, using an N
dimensional vector of similarity measures rather than in the original coordinates of
X. Similar to mean balancing with the data taken as Xj, kernel balancing then seeks
the weights to ensure the average ki of the treated is equal to the weighted mean
vector ki of the controls:
=
Z Wkt
i:D=O
where kt is the average row of K for the treated units with non-negative weights W
that sum to Nc.
In what follows, I explain why obtaining balance in this way achieves both approximate mean balance on a large set of functions of Xi that could influence Y(0), and
approximate multivariate balance, thus achieving approximately unbiased estimation
by propositions 1 and 2. I present two equivalent but separate views that provide
alternative interpretations of why this occurs.
View 1: Mean balance in K implies mean balance on many
smooth functions of X
Recall that under conditional ignorability, unbiased estimation can be achieved if any
function influencing Y (0) - or simply Y (0) itself - is mean balanced after weighting
(proposition 1). We begin with the view that kernel balancing ensures that a large
space of smooth functions is mean balanced by kernel balancing, and that these
functions are likely to include most plausible forms for Y(0).
Understanding mean balance in a large set of functions of the covariates begins
with considering the space of functions that is linear in K. These are functions that,
when evaluated at all i observations, produce values Kc for c E RN
View 1A: Superposition of Gaussians
There are two important interpretations of what this function space looks like. The
first is the "superposition of Gaussians view". Suppose we place a Gaussian kernel
over each observation in the dataset, rescale each Gaussian by the value of ci for
116
that observation, then sum the resulting rescaled Gaussians to form a single surface.
By varying the scaling factors in c, an enormous variety smooth functions can be
formed, approximating a wide variety of non-linear functions of the covariates. This
view is described and illustrated at length in Hainmueller and Hazlett (2013), where
this function space is used to successfully model highly non-linear functions even in
high-dimensional problems. Critically, achieving mean balance on vectors ki achieves
mean balance on all functions Kc and thus all the smooth functions formed by superposition of Gaussians in this way. Thus, the many smooth functions of Xi that can be
built by superposition of Gaussians can be mean balanced between the treated and
control, simply by achieving mean balance in K instead of X. Smooth functions of the
covariates that influences the outcome (such as intensity in the motivating example
above) can thus be balanced. More directly, if Y (0) itself is a smooth function of Xi
in the function space representable as Kc, it too is directly mean balanced, ensuring
unbiasedness of the ATT.
View 1B: Mean balance in
#(X)
A closely related view, more familiar in machine learning theory, relates to a feature
space expansion of the original data. Under this view, the benefit of achieving mean
balance on the N columns of K is that it also achieves mean balance on a very highdimensional set of features,
#(Xi),
such that the outcome Y(0) is likely to be linear
in these features, and thus mean balanced as well.
#(Xi)
As introduced above, the feature map
such that the inner product (#(Xi),
#(Xj))
is related to the choice of kernel
is equal to simply k(Xi, Xj), and the 0(-)
corresponding to the Gaussian kernel happens to be infinite-dimensional. Perhaps
surprisingly, the weights W that achieve mean balance on
the need to explicitly form
#(Xi)
#(Xi).
PROPOSITION 4 (BALANCE IN K IMPLIES BALANCE IN O(X))
among the treated units be given by kt = -KtIN,
=Kcw, then
q =i whereKq =among the controls given by Kcw.
VA
ZDj=1
q(Xi) and D=
can be found, without
= ZDj=0
If
q(Xi)Wi.
117
Let the mean row of K
and the weighted mean row of
t
Proof is given in the appendix. Proposition 4 states that mean balance in K
implies mean balance in each feature of q(Xi). The benefit of this can be understood
in several ways. First and most simply, O(Xi) is a much richer representation, akin
to taking many higher-order and multivariate transforms of the covariates. Second,
again if the control and treated have the same means on every dimension of
then they have the same means on all linear combinations
#(Xi)TO.
well captured by the many smooth functions that are linear in
and the ATT can be estimates without bias."
#(Xi),
So long as Y(0) is
#(Xi),
it is balanced,
This is the same space described
previously as the superposition of Gaussians, and it captures a very wide variety of
smooth functions.' 3
In practical terms then, kernel balancing answers the question of "what to balance
on" by offering a set of transformations, the rows of K, such that balance on these
transformations ensures balance on a very large set of smooth functions likely to
include or nearly include any smooth functions of Xi that influences Y (0), or Y(0)
itself.
Equalization of smoothed multivariate densities
The second view of what is achieved by kernel balancing relates to the quest for multivariate balance. Recall proposition 2, which states that under conditional ignorabilty
(assumption 7), obtaining multivariate balance is (more than) sufficient for unbiased
estimation of the ATT. Matching techniques attempt to achieve this equalization
1 2 As noted, the remaining bias is given by
h(Xi)Wi -
bias =
j
h(Xi)
i:D=1
i:D=O
As the function space in which balance is achieved grows richer, even if Y(0) is not fully captured
in that space, var(h(Xi)) grows smaller, resulting in decreased bias. Formal results regarding the
lower bias of kernel balancing and the rate at which remaining bias dissipates are forthcoming.
3 One further interpretation is that since Ot = 0, the treated and controls have the same classmeans in the feature space. Thus any classifier that takes observation i and classifies it based on
the whether #(Xi) is nearer to the class-mean of the treated (#t) or the class-mean of the controls
(0t) would be unable to make classifications for any observation. The logic of finding balance by
considering a subset in which treated and control can no longer be distinguished is also explored in
Ratkovic (2012).
118
asymptotically by pairing together control and treated units with similar locations
in X, but generally fail to achieve multivariate balance, and are typically optimized
and tested with respect to univariate balance.
Here I show that kernel balancing
approximately equalizes the multivariate covariate distributions for the treated and
weighted controls, as estimated by a particular smoother:
PROPOSITION 5 (BALANCE IN
K
IMPLIES EQUALITY OF SMOOTHED MULTIVARIATE DENSITIES)
Consider a density estimatorfor the treated, fXID=1 and for the (weighted) controls,
fX|D=0,w, each constructed with kernel k(., -) of bandwidth ax as described below. The
choice of weights that ensures mean balance in the kernel matrix K constructed by the
same choice of kernel ensures that fXD=1 = fXID=o,w at every position in X at which
an observation is located.
While full proof of proposition 5 it given in the appendix, I describe here the
key intuitions that are required. Density estimation seeks to estimate an underlying
density function that can be evaluated at new locations where observations have not
previously occurred. In this sense, it always requires some assumption about how
an observation in one exact location of X should be "smeared" to understand the
probability at nearby points. In a univariate context, the typical (Parzen-Rosenblatt
window) approach estimates a density function according to:
N
f(x)
=
NZk
(x,Xi)
i=1
for kernel function k with choice of bandwidth oA. 14 The Gaussian kernels is among
the most commonly used for this task. Generalizing to a Euclidean distance, a mul14 This is typically written in the form
f(x)
=
E
k,2 Ix - Xi. However, translation-invariant
kernels - including the Gaussian - are those that operate only on the difference between the two input
or k(x, Xi). For
arguments. For such kernels, it is always possible to equivalently write k( Ix - Xi)
, with z =
example, the Gaussian could be written as k(z) = e
form here for consistency with the remainder of the paper.
119
Ix - Xi.
I use the two-argument
tivariate density estimator can be given by:
1
Nv6
N
j=
where the bandwidth is defined by U2 , and the normalizing constants are required
since they are not included in the definition of the Gaussian kernel used throughout
this paper.15 Such density estimators are intuitively understandable as a process of
placing a multivariate Gaussian kernel over each observation's location in X, then
summing them into a single surface and rescaling, providing a density estimate at
each location.16
Notice that this estimator is not a strictly local one: the density at a given point
is a function of the distance to every other point in the dataset. Local estimators
are highly sensitive to the curse of dimensionality, because the size of a volumetric
neighborhood required to include even the nearest observations grows very quickly
with dim(X). By contrast, the density estimator applied here is less sensitive to
IlXi
dimensionality, because it depends on the Euclidean distance,
- XI I, which
grows only linearly in dim(X).
For a sample consisting of X1,..., XN, a density estimate can be made at position x* by multiplying the corresponding row of K by a column of normalizing
weights equal to
N
. Specifically, construction of the kernel matrix K using the
Gaussian kernel and right-multiplying it by a column vector,
1
produces values
numerically equal to (1) constructing such an estimator based on all the observations
represented in the columns of K, then (2) evaluating the resulting density estimates
1 5The denominator here involves irq 2 rather than the conventional 27ra 2 because the a2 used in
our particular kernel is what plays the role normally played by 20.2 in the expression for the normal
density.
16Note that in some cases, the density estimator constructed in this way would not be a natural one,
for example when X is a categorical variable or have sharp bounds. Nevertheless, this approach will
apply the same smoothing estimator of density in the case of treated and control values. Obtaining
equality of these smoothed density estimates for the treated and controls is thus still useful, and
means obtaining equality on a function that is similar to the true underlying density function. For
the same reason it is also not critical to get the kernel bandwidth 0,2 exactly "correct", if a correct
value exists. Nevertheless the choice of U2 implies a bias-variance tradeoff, which I discuss briefly
before but will discuss more fully in future drafts.
120
at all the positions represented by the rows of K.
The expression N1
KIN thus returns estimates for the density of the treated,
measured at all points in X. Likewise,
KN, estimates the density of the con-
Nc
trol units and returns its evaluated height at every datapoint observed, and
Kcw
does the same for the reweighted density of the controls. Proposition 5 states that
the choice of w found by kernel balancing to achieve Kcw = kt is also the choice
that equalizes the smoothed density estimates for the treated and weighted controls
at every point in the dataset. Proof is given in the appendix.
Note that the density estimate at a given point depends on the choice of kernel,
including its bandwidth. The choice of a Gaussian kernel for density estimation is
common. The choice of u2 is more difficult, and discussed further below.
Figure 4-3 provides a graphical illustration of the density-equalizing property of
the kernel balancing weights for a one-dimensional problem. The left panel shows the
x values for 10 treated units, drawn from N
-
(.5, 1) (red dots), and from 30 control
units (black dots) drawn from N ~ (-0.5, 1). In each case, the appropriately rescaled
Gaussian is placed over each observation, and summed to form the density estimator
for the treated (solid red line) and for the controls (solid black line). In the right
panel of figure 4-3, the heights of the Gaussians over each control units are adjusted
according to the weights given by kernel balancing (dashed blue lines). When these
reweighted Gaussians are summed to form the reweighted density estimator of the
controls (solid blue line), it closely matches the density of the treated.
A Continuous Multivariate Imbalance Measure
The construction of estimated multivariate densities evaluated at each point in the
dataset immediately suggests a balance metric that simply compares the estimated
densities of the treated and controls at all points in the dataset for a given choice of
weights. One reasonable way to combine the pointwise estimated differences into a
121
Figure 4-3: Density Equalizing Property of the kbal Weights
--
treated
--dcontrol
-
-4
-2
0
-2
-4
4
2
-
weihtedcontol
0
2
treated
control
weighted conrAro
4
for the treated and (unweighted) controls. Red dots show the location of 10 treated units.
Left: Density estimates
dN
The dashed black lines show the appropriately scaled Gaussian over each observation, which sum to form the density
estimator for the treated (solid red line). Similarly, the black dots indicate the location of 30 control units, and the
solid black line gives the resulting density estimate. The L 1 imbalance (see below) is measured to be 0.32. Right: The
weights chosen by kernel balancing effectively rescale the height of the Gaussian over each control observation (dashed
blue lines). The summated density from the rescaled controls (solid blue line) now closely matches the density of the
treated across the covariates space. The L 1 imbalance is now measured to be 0.002
summary measure is an L 1 metric, whose sample analogue is:
L
1 =
Z
|fD=
(i)
--
w,D=
-i
2
i=1
For interpretability, the values of fD=1 and
one.1
fw,D=o
are first normalized to sum to
Note that the density estimates depend on the underlying choice of kernel
bandwidth, discussed below. This metric is similar to the L 1 metric used in CEM
(iacus et al., 2012), but without requiring coarsening in order to construct discrete
bins. As noted above, using a global rather than local approach to density estimation
makes kernel balance tolerant of high dimensional data.
This Lt imbalance measure can be applied as a measure of multivariate imbalance
with any matching or weighting method, not just kernel balancing. The kbal software
computes Lr on the original data and after balancing. It also provides the multivariate
17
While the underlying continuous functions for the densities each integrate to 1, in general a
series of estimated heights drawn from this surface does not sum to 1. A rescaling is thus useful for
interpretational purposes.
122
density for the treated and the controls as computed at each point in the dataset,
which can be useful for visualizing overlap and diagnosing which treated units are
most difficult to accommodate.
4.6
Implementation
Achieving Balance on K
A method is needed to find the weight vector w such that 1KtN, = Kew as nearly
as possible, while constraining the weights to be positive and with minimal variation. To achieve this, I employ entropy balancing (Hainmueller, 2012) to satisfy these
conditions while maximizing the entropy of the distribution of weights.
However, establishing balance on all columns of K by entropy balancing is computationally infeasible, owing to the near co-linearity of many columns of K. This
co-linearity is perfect in cases where a single observation is repeated exactly, but even
if this does not occur, there may be a multitude of similarly suitable solutions with
different weights. Instead, I first project K onto its major principal components using principal components analysis.18 The number of factors retained for balancing,
starting with those corresponding to the largest eigenvalues, is determined by the
parameter numdims. The algorithm will converge when numdims is small enough
to avoid excessive co-linearity. The balance as measured by L, improves as numdims
initially rises, and then typically deteriorates once numdims is too high, where both
overfitting and numerical instability begin to creep in. Thus, when numdims is not
user-provided, an optimization is performed to find the value of numdims that produces the best L1 balance. Note that this does not involve the outcome in anyway. 19
18The aim here is to get approximate balance on K by getting balance on the principal components
of K. In kernel methods, "kernel PCA" is sometimes used. This approach treats K as the covariance
matrix of O(X), since Ki% = #(xi)Tq(xj). Thus directly computing the eigenvectors on K effectively
produces principal components for data originally in the coordinates of #(x). Here, a traditional
PCA is computed instead, taking the eigen-decomposition of kTk, where K is a centered version
of K. This is more in keeping with the intention of getting balance on K through its principal
components, but also demonstrates slightly better performance in practice than the kernel-PCA
approach.
191n addition, kbal computes the quantity pctvarK, which is the percentage of variation in the
123
An illustration of the relationship numdims, L, and the balance achieved on
unknown functions of Xi is given in the appendix.
Choosing u2 , bias-variance tradeoff, and common support
The kernel bandwidth, a2, plays an important role in determining the precision with
which balance is assessed and achieved, governing a bias-variance tradeoff. Under the
"mean balance in #(Xi)" view, o2 is most naturally viewed as a measurement decision
that determines the construction of
#(Xi),
and particularly how close two points Xi
and X need to be in order to have highly similar features
#(Xi)
the "equalization of smoothed multivariate densities" view,
U2
and
#(Xj).
Under
can be understood
as how "blurry" or sharply resolved the density functions are taken to be prior to
weighting. One can therefore think of a 2 as controlling the "precision" of the match:
while balance is typically obtainable under a range of values, a 2 describes how high
a bar this actually is, with smaller values of a 2 implying balance to a finer level of
detail. 20
How should a 2 be chosen? The question is difficult to answer as it implies a
bias-variance tradeoff and there is no clear way of determining the ideal point along
this tradeoff. Occasionally a2 may be set too small for balance to be achievable at
all, in which case the algorithm will not converge. Such cases represent effectively an
absence of common support, when density is assessed by a given choice os a 2 . In these
cases, one option is raising a 2 , which further "spreads out" the density contributed
by each observation, thus increasing scope for common support. Alternatively, it may
be necessary to drop treated units for which matches are most difficult to find (see
section 4.6).
Fortunately, however, I choice is often not strictly necessary.
In many cases,
matrix K accounted for by the included factors, computed based on the sum of squared eigenvalues.
At the choices of numdims that minimize L 1 for a given problem, pctvarK is consistently above
0.99 or higher. This indicates that while balancing on a subset of dimensions of K is not ideal, it
does account for a large majority of the variation in the matrix.
20
This role is somewhat analogous to the role of bin size in CEM, where exact matching can be
obtained within-bin, but this implies more precise matches when bin sizes are smaller than when
they are large.
124
balance is achievable across a wide range of a2 values.
While lower values of a2
are generally preferable, smaller a 2 may produce highly "concentrated" weights, i.e.
solutions that depend on placing very large weights on a very small proportion of the
controls. Numerous metrics could be used to assess the concentration of the weights,
including variance or entropy measures. For an easily interpretable metric, I use the
quantity min90, which is the minimum number of control units that are required to
account for 90% of the total weight among the controls. For example, if min90=20,
90% of the total weight of the controls comes from just the 20 most heavily-weighted
observations.
I propose choosing a 2 = 2dim(X) as a rule-of-thumb.
The average Euclidean
distance E[jJXi - XjJJ] that enters into the kernel calculation scales with dim(X).
Choosing a 2 proportional to dim(X) thus ensures a relatively sound scaling of the
data, such that some observations appear to be closer together, some further apart,
and some in-between, regardless of dim(X). The constant of proportionality, however,
remains open to debate. Empirically, the choice of a 2 = 2dim(X) has offered very
good performance, and so this is the default value of
U2,
though clearly further work
is needed to justify this choice.21
This rule-of-thumb approach is a useful starting point and is used in all simulated
and empirical examples presented here. Results are not typically highly sensitive to
the choice of a-2 . Nevertheless, investigators may wish to present their results across a
range of a 2 values to ensure that this holds in any particular example. Where results
do vary across a 2 values, inspecting L, and min90 can be helpful for determining an
appropriate value.
21
This is similar to the approach used in KRLS, where the default setting is a 2 = ldim(X).
However, KRLS is tolerant of a wide-range of o,2 values because the smoothing parameter, A is
free to vary, and the two terms largely compensate for each other. Accordingly, KRLS at o 2 =
2dim(X) shows nearly identical performance to the original default of dim(X), offering excellent
power to detect highly nonlinear, nonadditive relationships even in small samples. This provides
some assurance: recall that kernel balancing achieves mean balance on all elements of #(X) for a
given choice of kernel, and thus KRLS will be unable to detect any differences between treated and
controls on data reweighted by kernel balancing. Since KRLS with the choice of a 2 = 2dim(X) is
powerful in detecting a wide range of nonlinear, nonadditive functional forms, the guarantee that
kernel balancing controls for all such confounding functions when the same
attractive choice.
125
U2
is used makes this an
Optional Trimming of the Treated
In some cases, balance can be greatly improved with less variable (and thus more
efficient) weights if the most difficult-to-match treated units are trimmed. In estimating an ATT, control units in areas with very low density of treated units can
always be down-weighted (or dropped if the weight goes to zero), but treated units in
areas unpopulated by control units pose a greater problem. These areas may prevent
any suitable weighting solution, or may place extremely large (and thus ineffecient)
weights on a small set of controls.
While estimates drawn from samples in which the treated are trimmed no longer
represent the ATT with respect to the original population, they can be considered a
local or sample average treatment effect within the remaining population. King et al.
(2011) refer similarly to a "feasible sample average treatment effect on the treated"
(FSATT), based on only the treated units for which sufficiently close matches can
be found. In any case, the discarded units can be characterized to learn how the
inferential population has changed.
However, even when the investigator is willing to change the population of interest
by trimming the treated, it is not always clear on what basis trimming should be done.
In kernel balancing, trimming of the treated can be (optionally) employed by using the
multivariate density interpretation given above. Specifically, the density estimators
at all points is constructed using the kernel matrix. Then, treated units are trimmed
if
fXIDl(Xi)
fXID=o(Xi)
exceeds the parameter trimratio. The value of trimratio can be set by
the investigator based on qualitative considerations, inspection of the typical ratio of
densities, a willingness to trim up to a certain percent of the sample, or performance
on L 1 . Whatever approach is taken to determine a suitable level of trimratio, kbal
produces a list of the trimmed units, which the investigator can examine to determine
how the inferential population has changed.
126
4.7
Empirical Examples
In this section, I apply kernel balancing to two empirical examples. The first is a
standard benchmark in the literature. Following the example pioneered by LaLonde
(1986) and Dehejia and Wahba (1999) and repeated in many other studies (e.g. Diamond and Sekhon, 2005; lacus et al., 2012; Hainmueller, 2012), I reanalyze the impact
of a job training intervention, the National Supported Work Demonstration Program
(NSW). This is a difficult estimation problem, but one for which an experimental
estimate is also available for comparison. Using default settings and no specification search, the treatment effect estimated by kernel balancing is within 0.7% of the
experimental estimate, though the latter itself is estimated with uncertainty.
The
second example applies kernel balancing to a reexamine whether democracies are less
successful in fighting counterinsurgencies (Lyall, 2010). The results show that when
high-order balance is achieved by using kernel balancing, democracies are over 25
percentage points less likely to win counterinsurgencies, consistent with theoretical
expectations but in contrast to Lyall (2010).
Example 1: Job Training Benchmark
It is useful to know whether kernel balancing accurately recovers average treatment
effects in observational data under conditions in which an approximately "true" answer is known. This can be approximated using a method and dataset first used by
LaLonde (1986) and Dehejia and Wahba (1999), and which has become a routine
benchmark for new matching and weighting approaches (e.g. Diamond and Sekhon,
2005; Iacus et al., 2012; Hainmueller, 2012).
The aim of these studies is to recover an experimental estimate of the effect of
a job training program, the National Supported Work (NSW) program. Following
LaLonde (1986), the treated sample from the experimental study is compared to a
control sample drawn from a separate, observational sample. Methods of adjustment
are tested to see if they accurately recover the treatment effect despite large observable
127
differences between the control sample and the treated sample. 22
Here I use 185
treated units from NSW, originally selected by Dehejia and Wahba (1999) for the
treated sample. The experimental benchmark for this group of treated units is $1794,
which is computed by difference-in-means in the original experimental data with
these 185 treated units. The control sample is drawn from the Panel Study of Income
Dynamics (PSID-1), containing N = 2490 controls.
The pre-treatment covariates available for matching are age, years of education,
real earnings in 1974, real earnings in 1975 and a series of indicator variables: Black,
Hispanic, and married. Three further variables that are actually transforms of these
are commonly used as well: indicators for being unemployed (having income of $0)
in 1974 and 1975, and an indicator for having no highschool degree (fewer than 12
years of education).
As found by Dehejia and Wahba (1999), propensity score matching can be effective
in recovering reasonable estimates of the ATT, but these results are highly sensitive
to specification choices in constructing the propensity score model (Smith and Todd,
2001).
Diamond and Sekhon (2005) use genetic matching to estimate treatment
effects with the same treated sample. While matching solutions with the highest
degree of balance produced estimates very close to the experimental benchmark, these
models included the addition of squared terms and two-way interactions. Similarly,
entropy balancing Hainmueller (2012) has also been shown to recover good estimates
using a similar setup, 2 3 also employing all pairwise interactions and squared terms
for continuous variables, amounting to 52 covariates.
Figure 4-4 reports results from a variety of estimation procedures and specifications.
Three procedures are used: linear regression (OLS), Mahalanobis distance
matching (match), and kernel balancing (kbal). For match and kbal, estimate are
produced by simple difference in means on the matched/reweighted sample.2 4
22See Diamond and Sekhon, 2005 for an extensive description of this dataset, the debates around
it, and the various subsets that have been drawn from it.
23
Hainmueller (2012) uses the same treated group used here, but a different control dataset based
on the Current Population Survey (CPS-1)
24
Standard errors from matching are the Abadie-Imbens standard errors, though the correct standard errors for matching estimators remains a largely unsolved problem. Standard errors from kernel
balancing are from weighted least squares with fixed weights, which are also incorrect, as they do
128
For each method, three sets of covariates are attempted: the standard set of 10 covariates described in the text, a reduced set (simple) including only the seven of these
that are not transforms of other variables, and an expanded set (squares) including
the 10 standard covariates plus squares of the three continuous variables. Figure 4-4
shows that the OLS estimates vary widely by specification, and even the estimate
closest to the benchmark ($1794) is incorrect by $1042. Mahalahobis distance matching performs better, though remains somewhat specification dependent, with its best
estimate (match-squares) falling within $387 of the benchmark. Finally, kernel balancing performs well over the three specification. While there is some variation by
specification, no estimate is more than $681 from the benchmark, and the standard
specification, kbal, produces an estimate of $1807, within $13 of the benchmark.
From the kernel balancing solution, we can also see that balance is difficult to
achieve in this example, in the sense that it requires focusing on a relatively small
portion of the original control sample. Specifically, at the solution achieved by kernel
balancing, min90 = 193, meaning that 90% of the total weight of the control comes
from 193 observations. While this is still a reasonable number, and similar to the size
of the treatment group, it implies that approximately 90% of the control sample was
not useful for comparison to the treated. This is appropriate, however, given the large
differences between the treated and control samples. For example, while 72% of the
treated are unemployed in either 1974 or 1975, only 12% of controls are unemployed
in either year.
Example 2: Are Democracies Inferior Counterinsurgents?
Decades of research in international relations has argued that democracies are poor
counterinsurgents (see Lyall, 2010 for a review). Democracies, as the argument goes,
are (1) sensitive to public backlash against wars that get more costly in blood or
not incorporate uncertainty in the choice of weights. Bootstrap or jackknife procedures to obtain
standard errors estimates for kernel balancing may be valid(in contrast to matching estimators).
Alternatively, it may be possible to do the entire estimation in an Empirical Likelihood framework
that will also allow for closed-form estimation of standard errors. Examining this remains an area
for future work.
129
Figure 4-4: Estimating the Effect of a Job Training Program from Partially Observational Data
8
OLS
OLS-simple
8
OLS-squares
8
match
C
match-simple
E
match-squares
kbal
kbal-simple
kbal-squares
C-Benchmark
SI
-2000
0
I
I
I
2000
4000
6000
Effect of Training Program on Income ($)
Reanalysis of Dehejia and Wahba (1999), estimating the effect of a job training program on income
using a variety of estimation procedures. Three procesures are used: linear regression (OLS), Mahalanobis distance matching (Match), and kernel balancing (kbal). For each, three sets of covariates
are attempted: the standard set of 10 covariates described in the text, a reduced set (simple) inset
cluding only the seven of these that are not transforms of other variables, and an expanded
While
variables.
continuous
three
the
of
squares
plus
covariates
standard
(squares) including the 10
estimate
OLS and match perform reasonably well, both are sensitive to specification. The best OLS
estimate
matching
best
the
while
$1042,
by
benchmark
$1794
the
(OLS-simple) still under-estimates
specification.
three
all
on
well
reasonably
performs
(match-squares) is off by $387. Kernel balancing
While there is some variation by specification, no estimate is more than $681 from the benchmark,
and the standard specification, kbal, produces an estimate of $1807, within $13 of the benchmark.
treasure than originally expected, (2) are unable to control the media in order to
supress this backlash, and (3) often respect international prohibitions on brutal tactics
that may be needed to obtain a quick victory. Each of these makes them more prone
to withdrawal from countinsurgency operations, which often become long and bloody
wars of attrition. Empirical work on this question was significantly advanced by Lyall
(2010), who points out that previous work (1) often examined only democracies rather,
than a universe of cases with variation on polity type, and (2) did little to overcome
130
the non-random assignment of democracy, and particular, the selection effects by
which democracies may choose to fight different types of counterinsurgencies than
non-democracies.
Lyall (2010) overcomes these shortcomings by constructing a dataset covering the
period of 1800-2005, in which the polity type of the countinsurgent regimes vary.
Matching is then used to adjust for observable differences between the conflicts selected by democracies and non-democracies, using one-to-one nearest neighbor matching on a series of covariates.2 1 In a battery of analyses with varying modeling approaches, Lyall (2010) finds that democracy, measured as a polity score of at least 7 in
the specifications replicated here, has no relationship to success or failure in counter
insurgency, either in the raw data or in the matched sample.
While the credibility of this estimate as a causal quantity depends on the absence
of unobserved confounders, we can nevertheless assess whether the procedures used
to adjust for observed covariates were sufficient, or whether an inability to achieve
mean balance on some functions of the covariates may have led to bias even in the
absence of unobserved confounders.
Here I reexamine these findings using the post-1945 portion of the data, which
includes 35 counterinsurgencies by democracies and 100 by non-democracies, and is
used in many of the analyses in Lyall (2010).2' First, I assess balance. As shown in
figure 4-5, numerous covariates are badly imbalanced in the original dataset (circles),
where imbalance is measured on the x-axis by the standardized difference in means.
This balance improves somewhat under matching (diamonds), but improves far more
under kernel balancing (squares). Note that imbalance is shown both on the variables
used in the matching/weighting algorithms (the first ten covariates up to and including year), as well as several others that were not explicitly included in the balancing
25
These covariates are: a dummy for whether the counterinsurgent is an occupier (occupier), a
measure of support and sanctuary for insurgents from neighboring countries (support), a measure
of state power (power), mechanization of the military (mechanized), elevation, distance from the
state capital to the war zone, a dummy for whether a state is in the first two years of independence
(new state), a cold war dummy, the number of languages spoken in the country, and the year in
which the conflict began.
26
The 1945 period is the only one with complete data on the covariates used for balancing here,
but is also the period in which the logic of democratic vulnerability is expected to be most relevant.
131
procedure: year2 , and two multiplicative interactions that were particularly predicted
of treatment status in the original data. Kernel balancing produces good balance on
both the included covariates, and functions of them.
Figure 4-5: Balance: Democracies vs. Non-democracies and the Counterinsurgencies
they Fight
U
+
kbal
matched
orig
-
-
-
mechanization
support
occupier
power -:+
-
-
-
elevation
distance
new state
coldwar -
:
:e
-
num.languages
year-
-
yearA2
-
cincXelev
occupierXcinc -_-1
-. 5
0
.5
1
standardized difference in means
Balance in post-1945 sample of Lyall (2010). Imbalance, measured as the difference in means divided
by the standard deviation, is shown on the x- axis. Democracies (treated) and non-democracies
(controls) vary widely on numerous covariates. The matched sample (diamonds) shows somewhat
improved balance over the original sample, but imbalances remain on numerous characteristics.
Balance is considerably improved by kernel balancing (squares). The rows at or above year show
imbalance on characteristics explicitly included in the balancing procedures. Those below year show
imbalance on characteristics not explicitly included.
Next, I use the matched and weighted data to estimate the effect of democracy
on counterinsurgency success. For this, I simply use linear probability models (LPM)
to regress a dummy for victory (1) or defeat (0) on covariates according to five dif-
ferent specifications.2 7 . The first three specifications used are (1) raw regresses the
27
While Lyall (2010) used a number of other approaches, including logistic regression, some of
these models suffer "separation" under the specifications attempted here. This causes observations
132
outcome directly on democracy without covariates (and is equivalent to difference-inmeans);(2) orig uses the same covariates as Lyall (2010), which are all those variables
balanced on except for year, (3) time reincludes year as well as year2 to flexibly
model the effects of time. The final two models, occupieri (4) and occupier2 (5), add
flexibility by including interactions of occupier with other variables in the model.2 8
Figure 4-6 shows results for the matched and kernel balanced samples with 95%
confidence intervals. Under matching, the effect varies considerably depending on the
choice of model. No estimate is significantly different from zero, however. In stark
contrast, kernel balancing producing estimates that are essentially invariant to the
choice of model. Each kernel balancing estimate is between -0.26 and -0.27, indicating that democracy is associated with a 26 to 27 percentage point lower probability of
success in fighting counterinsurgencies. This is a very large effect, both statistically
and substantively, given that the overall success rate is only 33% in the post-1945
sample.
4.8
Conclusions
In the ongoing quest to reliably infer causal quantities from observational data, the
first-order challenge often remains ensuring that there are no unobserved confounders
in a given identification scenario.
However, the problem of actually adjusting for
differences in observed covariates to take advantage of conditional ignorability remains
non-trivial. As shown here, even when conditional ignorability holds, matching and
other weighting approaches only ensure unbiasedness under strict conditions. One
sufficient condition is full multivariate balance.
Absent this, unbiasedness of the
ATT requires that Y(0) (or all functions of Xi influencing Y(0)) has the same mean
for the treated and controls.
and variables to effectively drop out of the analysis, producing variability in effect estimates that
are due only to this artefact of logistic regression and not due to any meaningful change in the
relationship among the variables. Linear models do not suffer this problem, and provide a well
defined approximation to the conditional expectation function, allowing valid estimation of the
changing probability of victory associated with changes in the treatement variable, democracy
28
These interactions were chosen because analysis with KRLS revealed that interactions with
occupier were particularly predictive of the outcome.
133
Figure 4-6: Effect of Democracy on Counterinsurgency Success
Match
Kernel Balance
* raw
E orig
*+ time
A occupier I
V occupier2
A
-0.6
-0.4
0.0
-0.2
Effect of Democracy on Pr(victory)
0.2
matchEffect of democracy on counterinsurgency success in post-1945 sample of Lyall (2010) using
Under
procedures.
estimation
different
five
by
followed
ing or kernel balancing for pre-processing
zero.
from
difference
significantly
are
none
but
variable,
highly
matching, effect estimates -remain
when
even
procedures,
estimation
five
the
over
estimates
stable
Kernel balancing shows remarkably
in the -0.26 to
no covariates are included (raw). Results from kernel balancing are consistently
with a
associated
is
democracy
that
indicating
zero,
from
-0.27 range and significantly different
counterinsurgencies.
win
to
ability
substantively large deficit in the
Kernel balancing can be understood as a method of approximately achieving both
of these conditions. First, by obtaining balance on the columns of the kernel matrix
K, mean balance is also obtained on the much higher-dimensional set of features,
<(Xi).
Mean balance on these features implies mean balance on all the functions
that are linear in these features. Equivalently, and more intuitively, these are the
functions that can be formed by the superposition of Gaussians placed over each
observation in the covariate space. The assumption that the systematic component
of Y(0) is among these smooth functions is far more plausible than the assumption
that it is linear in the original Xj, even if the investigator is careful enough to include
higher-order terms among these Xi's. Moreover as N grows large, Y(0) is increasingly
134
well modelled within this space, while the space of functions linear in Xi does not
grow with N.
Second, while existing methods are evaluated by univariate balance metrics, kernel balancing ensures that the entire multivariate densities of the treated and weighted
controls are approximately equalized, as measured by a corresponding kernel smoother.
This does not require coarsening the data into discrete bins, and because the method
is global rather than local, it is tolerant of higher-dimensional data than approaches
that require discrete binning of observations compared to methods such as CEM.
Kernel balancing also generates pointwise estimates of the multivariate density
for the treated and controls at each location in the dataset, and uses this to report
an L1 measure of imbalance that is truly multivariate in nature but does not require
coarsening of the data.
Two empirical examples illustrate the use of kernel balancing. The first, a widely
used benchmark, uses data from Dehejia and Wahba (1999), to test whether kernel
balancing accurately recovers a known ATT estimate by using the experimental treatmentgroup but control observations drawn from a separate, observational dataset.
At it's default values, with the covariates commonly used for this problem and no
further specification choices, kernel balancing estimated an effect of $1807, extremely
close to the experimental benchmark of $1794. In a second empirical example, kernel
balancing is used to obtain higher order balancing in the comparison of counterinsurgency success for democracies and non-democracies (Lyall, 2010). While theory and
prior research has argued found that democracies are inferior counterinsurgents, Lyall
(2010) finds otherwise using a novel dataset and matching to ensure comparability
of the counterinsurgencies fought by democracies and non-democracies. Reexamining the post-1945 period and using the same covariates, kernel balancing proves far
more effective in obtaining balance, both on the covariates directly included in the
balancing procedures, and on functions of these variables. Using five different models to estimate the effect of democracy on the adjusted datasets, estimates from the
kernel balanced data all indicate that democracies were 26 to 27 percentage points
less likely to win counterinsurgencies over this period than non-democracies on com135
parable cases. These effects are statistically significant, but also substantively large,
especially given the overall success rate of just 33%.
Nevertheless, additional questions and challenges remain for future work. First, it
will be useful to better understand the asymptotic properties of this procedure, and
particular the rate of decline in bias as a function of N. Second, K has dimensionality
N x N, which becomes unwieldy as N grows large, posing a practical limit of tens of
thousands of observations. Third, obtaining correct confidence intervals for estimates
based on weighted samples - either through resampling or a closed-form solution - will
be an important and useful advance, particularly since standard errors remain poorly
understand for matching techniques. Finally, improving the method for selecting
beyond the rule-of-thumb approach proposed here would be very useful as well.
136
2
U
Chapter 5
Appendices
5.1
Appendix for Kernel Regularized Least Squares
137
Figure A.1: Fitting a Simple Function with KRLS
CM
0
Cl
I
-
>1
...
1
2
3
Unscaled Gaussians
4
x
Nl
-.
0
-.
-1
2
- Scaled Gaussians
KRLS Fit (Superposition)
3
4
Note: Left Panel: Unscaled Gaussians placed over each of the four data point. Right Panel: Gaussians scaled by the choice coefficients obtained from KRLS. The choice coefficients for the data
points (from left to right) are c = [-3.06,2.68, -1.12,0.97]
138
Figure A.2:
Example of High and Low Frequency Functions
C14
I
is
I
I
I
I
0.0
0.2
0.4
0.6
I
0.8
1.0
x
Note: The solid line represents a "good" explanation of the relationship between x and y. The
dashed line represents a "bad" one, which is both considered more likely to be noise and is also
much less useful in a theoretical way. For most social science inquiry, we are interested in recovering
conditional expectation functions that look like the solid, low-frequency line, not the dashed, highfrequency line.
139
Figure A.3: KRLS Fits Non-Linear Functions and their Derivatives
0
OLS, N= 20
KRLS, N= 20
o*0
W
0
0
CO
0
0
00
OLS, N= 100
KRLS, N= 100
--
-f(y)=100+3x^4
--
f(y)=100+3x^4
Sf(y) fit KRLS
-4
-2
0
2
f(y) fit
OLS
dy/dx= 12x^3
dy/dx= 12x^3
dy/dx fit KRLS
dy/dx fit
4
x
-4
-2
0
OLS
2
x
Note: Simulation to recover the non-linear function y -100+ 3X4 (black solid line) and its derivative
'9y = 12X3 (gray dashed line).- The sample sizes are 20, 50, and 100, X ~ Unif (-4, 4) and observed
outcomes are simulated as y =100 + 3x4 + E where E ~ N(0, 1). In the right figures the black
dots show the fitted values for Qand the grey triangles show the fitted values for 2- from the KRLS
estimator (average across 500 simulations). The estimates in the left figures show the estimates from
the OLS estimator accordingly.
140
4
Figure A.4: KRLS Approximates Complex Interactions: One Hill, One Valley
True f(xl,x2)
KRLS fit f(xl,x2)
OLS fit f(xl,x2)
GAM fit f(xl,x2)
Note: Simulation to recover target function given by y = e--+(x) 2 (1-X2) 2 _ e-5(1-X2)+(x1)2 using
simulations with 200 observations drawn from X 1 , x 2 ~ Unif (0, 1) and random noise E ~ N(0, .25).
The top right figure shows the true target function. The top left, bottom right, and bottom left
figure shows the fitted functions from the KRLS, OLS, and GAM estimator respectively.
141
Figure A.5: KRLS Approximates Complex Interactions: Two Hills, Two Valleys
KRLS fit f(xl,x2)
True f(x1,x2)
llk
4
GAM fit f(xl,x2)
OLS fit f(xl,x2)
2
2
2
using
(x2) + e5(x1)2+(1-x2)
e-5(1-x1)
Note: Simulation to recover target function given by y
N(0,.25).
simulations with 200 observations drawn from x 1 , x 2 ~ Unif (0, 1) and random noise e ~
bottom left
The top right figure shows the true target function. The top left, bottom right, and
figure shows the fitted functions from the KRLS, OLS, and GAM estimator respectively.
142
Figure A.6:
KRLS Approximates Complex Interactions: Three Hills, Three Valleys
True f(xl,x2)
KRLS fit f(xl,x2)
A
4
OLS fit f(x1,x2)
GAM fit f(xl,x2)
N
4
Note: Simulation to recover target function given by y = sin(xi) * cos(x ) using simulations with
2
200 observations drawn from X 1 , X 2 ~ Unif (0, 27r) and random noise e ~ N(0, .25). The top right
figure shows the true target function. The top left, bottom right, and bottom left figure shows the
fitted functions from the KRLS, OLS, and GAM estimator respectively.
143
Figure A.7: The marginal effect of temporally proximate presidential elections on
the effective number of electoral parties
Thomas Brambor et al.
----- ---
4 -
95% Confidence interval
------
22-
0
EW
Zip
-4M~f
-6-
i
a
6
5
4
3
2
Effctive
Nuinber of Presidential Candidates
E
0
0.
0
E
P)
16-
.I.
0
LU
01
I
0
I
1
I
2
I
3
I
4
I
5
I
6
Effective Presidential Candidates
Note: Top Panel: Figure 3 from Brambor et al. (2006). More temporally proximate presidential and
legislative elections lead to fewer effective electoral parties. However this is true only when there
are relatively few presidential candidates, and the effect vanishes when there are large numbers
of presidential candidates. Bottom Panel: Scatterplot of pointwise marginal effects of temporal
proximity on number of parties ( aParties ), with lowess estimates super-imposed. The plot looks
similar to the Brambor et al. (2006) model only when there are 3 or more presidential candidates.
By contrast at zero presidential candidates (which represents 62% of the observations included in
the Brambor et al. regression), the marginal effect estimates come back towards zero.
144
Figure A.8: OLS Results for Brambor et al. Split at two Presidential Candidates
<0
E2:
0
a
0
----------------------------------------------------------------------------------
0-
LU
2
I
I
I
I
0.0
0.5
1.0
1.5
'
(D
2.0
CD
-
CD
-
Presidential Candidates
0
05
8U
Cu
I
2
I
3
I
4
I
5
I
6
Presidential Candidates
Note: Results from OLS models identical to those in the previous figure, but split at observations
with two or fewer Presidential Canditates and those with more than two. KRLS estimates differed
from the original Brambor et al. (2006) result (A.7), suggesting that Dproximity
aparties takes values near
zero when PresidentialCandidatesis zero (indicating no "coat-tail effect" there), and if anything
decreases as PresidentialCandidatesrises to two, then reverses direction and follows the pattern
suggested by Brambor et al. (2006) thereafter. Here we split the sample and conduct OLS analyses
separately when PresidentialCandidates< 2 and when PresidentialCandidates> 2. As shown,
the OLS results from the split samples reflect the KRLS result.
145
5.2
Appendix for Kernel Balancing
Proof of proposition 2
Proposition 2 states that the estimator ATT
Ei:D= 1Y
i:D=0 wjyj is unbiased for ATT if
-
both Assumption 7 (conditional ignorability) and Assumption 8 (multivariate balance) holds.
The proof is as follows. Under Assumption 8,
fw,xID=O = fxID=1.
bias = E[y(0)ID = 1] - E[ E
Then,
wiyj]
i:D=O
y(0)fxjD=1(x)dX
=
y(0)fW,xlD=o(x)dX
-
(A.1)
=0
Proof of proposition 4
Proposition 4 states: that for the mean row of K among the treated, kt
mean row of K among the controls given by E(w)
where t
Dj=1 #(xi) and F,
1
1
=
1KtIN, and the weighted
*=k
1 , if c
(w), then
=
=
(xi).
This can be shown as follows.
FT= E
w-k-
(A.2)
i:D=O
I~~x)
X)
1
(
...
[(# ( x),$(x )), -
),
[ZEwk(xi,xi),...,
L~i
i:D=O
(x)) =
i:D=11
wik(xi,
i:D=O
,(q(X),
W (q(xi), q(xl)),
X)
JN)
c(XN))]
i:D=O
xi), #(xj)) = (
W #(j,
(x))
Vi
i:D=O
NT i:D=1
xi), #(x)) =
NTi:D=1
( tO(xj)) =
(
#(xj))
(A.3)
wi#(xi), q(xj))
(A.4)
W#(),
i:D=O
E
i:D=O
Tt= E
i:D=O
146
wjO(Xj)
(A.5)
Remarks
An intuitive interpretation of equation A.3 is that each unit
j
is as close to the average treated unit
as it is to the (weighted) average control unit, where where distance is measured in the feature space
O(X).' Relatedly, a method of classifying observations as treated or control based on whether they
are closer to the centroid of the treated or the centroid of the controls in O(X) would be unable to
classify any point.
Proof of proposition 5
Proposition 5 states that for a density estimator for the treated, fXID=1, and for the (weighted)
controls,
fXID=O,w,
both constructed with kernel k of bandwidth
o, the choice of weights that
ensures mean balance in the kernel matrix K also ensures fXID=1 = fXID=o,w at every location in
X at which an observation is located.
As detailed in the main text, the expression N
KtI Nt places a multivariate standard normal
density over each treated observation, sums these to construct a smooth density estimator at all
points in X, and evaluates the height of that joint density estimate at each of the points found
in the dataset. Likewise,
K1N estimates the density of the control units and returns its
evaluated height at every datapoint in the dataset.
To reweight the controls would be to say that some units originally observed should be made more or
less likely. This is achieved by changing the numerator of each weight
1
to some non-negative
value other than 1. Letting the weights sum to 1 (rather than N,), the reweighted density of the
controls would be evaluated at each point in the dataset according to
1
Kw, for vector of weights
w. If weights are selected so that this equals the density of the treated:
1
Ktl{N} =
1
Krw
V 0
+KtfNtl
=
K,,w
kt =Kcw
kc ,(w)
(A.6)
'For the Gaussian kernel, (#(xi), O(xj)) is naturally interpretable as a similarity measure in the
-
input space, since this quantity equals k(xj, Xi) = e
2
. However, (#(xi), #(xj)) or k(xi, xj)
is more generally interpretable as similarity in the feature space as well. Note the squared Euclidean
distance between two points xi and xj after mapping into 0(.) is: |10(x,) - O(xj)I 2 = (#(x,)
O(x,), O(xi) - #(X)) = (#(xi), #(xi)) + ((x), #(xj)) - 2(#(xi), #(xj)). In the case of the Gaussian
kernel, (#(xi), O(xi)) = 1, so this distance reduces to 2(1 - (#(xi), O(xj))). In this sense, (#(x ), #(x))
is as reasonable measure of similarity of position in the feature space, as it runs opposite to distance
in this space.
147
where the final line is the definition of mean balance in K. Thus, the weights that achieve mean
balance in K are precisely the right weights to achieve equivalence of the measured multivariate
densities for the treated and controls at all points in the dataset.
Bias in ATT when mean balance on X is achieved
As discussed in the text, weighting estimates of the ATT are unbiased under conditional ignorability
when all functions of X influencing y(O) - or y(O) itself - are mean balanced. Because mean balance
on X implies mean balance on all linear functions of X, this generalizes to the statement that
conditional ignorability and mean balance in X are sufficient for unbiasedness when y(O) is linear in
X. Here I provide proof of this statement while also deriving the bias that obtains when linearity
does not hold.
Let y(O)i = g(xi) + rq = xT"/ + h(xi) + 77j, where xT/ is the linear component of y(O), and h(x)
includes only the remaining nonlinear components. x is a vector-valued random variable containing
the pre-treatment covariates to be balanced, and all-desired transforms of them. The weighted mean
of the control outcomes is given by -
For fixed choice of x and w, the expectation of
Zi:D-O wiyj.
this estimator is given by:
E
1:
wiE[y(O)]
Ci:D
Nc j:D=O
)
Iw[xT)+h(x
i:D=O
>i:D=cE w h(x
ihx0
'3wx
)] +
)
[
Nc E:= [W
i:D=
iD
wh(x
N ci:D=O
T=1E
D
)
i:D=
E[y(O)
-
h(x)]+
i:D=
=
(A.8)
wwh(x
)
3
(A.7)
i:D=O
Ci:D=0
Z
E [y(O)ID = 1] + 1
Nc EN
h(x%)
-
1
h(x)
where the substitution from A.7 to A.8 is due to mean balance. The resulting bias is thus:
bias = (
wih(xi) -
i:D=O
1
E
i:D=1
148
h(xi)
This equals zero only when (1) y(O) is linear in x and thus there h(x) = 0, Vx E X, or (2) the mean of
h(x) among treated and (weighted) controls happens to be equal. Accordingly, note that the degree
of bias worsens as h(x) has larger variance and as its correlation with D increases. The results are
thus similar to those obtained for omitted variables in linear models, where h(x) is omitted in this
case.
Density Equalization Illustration
This example visualized the density estimates produced internally by kernel balancing using linear
combinations of K as described above. Suppose X contains 200 observations from a standard normal
distribution. Units are assigned to treatment with probability 1/(1
+ exp(2 - 2X)), which produces
approximately 2 control units for each treated unit. Figure A.9 shows the resulting density plots,
using density estimates provided by kbal in which the density of the treated is given by N
and the density of the controls is given by N
Kc' Nc
KtIN,
As shown, the density estimates for the
treated at each observations X position (black squares) is initially very different from the density
estimates for the controls taken at each observation (black circles). After weighting, however, the
new density of the controls as measured at each observation (red x) matches that of the treated
almost exactly.
Note that in multidimensional examples, the density becomes more difficult to visualize across
each dimension, but it is still straightforward to compute and to think about the pointwise density
estimates for the treated or control as measured at each observation's X value. In contrast to binning
approaches such as CEM, equalizing density functions continuously in this way avoids difficult or
arbitrary binning decisions, is tolerant of high dimensional data, and smoothly matches the densities
in a continous fashion, resolving the within-bin discrepancies implied by CEM.
L 1 , imbalance, and numdims
Recall that kernel balancing does not directly achieve mean balance on K, but rather on the first
numdims factors of K as determines by principal components analysis. This example examines the
efficacy of this approach in minimizing the L1 loss, and in minimizing imbalance on an unknown
function of the data. Suppose we have 500 observations and 5 covariates, each with a standard
normal distribution.
Let z = VX + x2. This function impacts treatment assignment, with the
probability of treatment being given by logit- (z - 2), which produces approximately two control
units for each treated unit.
In figure A.10, the value of numdims - the number of factors of K retained for purposes of balancing
149
Figure A.9: Density-Equalizing Property of Kernel Balancing
o
Control
o Treated
x Weighted Controls
E
(P
V;0
0
doloob
I
-3
i1,10
1
-2
-1
0
2
3
X
Plot showing the density-equalization property of kernel balancing. For 200 observations of X
N(O, 1), treatment
is assigned according to Pr(treatment) = 1/(1 + exp(2 - 2X)), producing approximately two control units for each
treated unit. Black squares indicate the density of the treated, as evaluated at each observation's location in the
dataset (and given the choice of kernel and 0.2 ). Black circles indicate the density of (unweighted) controls. The
treated and control are seen to be drawn from different distributions, owing to the treatment assignment process. Red
x's show the new density of the controls, after weighting by kbal. The reweighted density is nearly indistinguishable
from the density of the treated, owing to the density equalization property of kernel balancing.
- is increased from a minimum of 2 up to 100. As expected, both L, 2 and the mean imbalance on z
taken after weighting improve as numdims is first increased, and then worsen beyond some choice
of numdims. Most importantly, while the balance on z is unobservable in the case of unknown
confounders, L 1 is observable, and improvements in L1 track very closely to improvements in the
balance of z. Accordingly, selecting numdims to minimize L1 appears to be a viable strategy for
selecting the value that also minimizes imbalance on unseen functions of the data.
150
Figure A.10: L, distance and imbalance on an unknown confounder, by numdims
~,CR
CDJ
0
C0
0)
, -- - --- L1
0)
CDO
0
o)
N
o
-D
0
-
s
. -
--
0
20
60
40
80
Imbal on
z
100
numdims of K included
This example shows the relationship between the number of components of K that get balanced upon (numdims),
the multivariate imbalance (Li), and balance on confounder z. L 1 generally improves as numdims is increased at
first, but beyond approximately 50 dimensions, numerical instability produces less desirable results and a higher Li
imbalance. While the confounder represented by z in this case would generally be unobservable, balance on z is
optimized where Ll finds its minimum, which is observable.
151
152
Bibliography
Abadie, A. and G. W. Imbens (2006). Large sample properties of matching estimators for average
treatment effects. Econometrica 74 (1), 235-267.
Abadie, A. and G. W. Imbens (2011). Bias-corrected matching estimators for average treatment
effects. Journal of Business & Economic Statistics 29(1).
Akresh, R. and D. De Walque (2008). Armed conflict and schooling: Evidence from the 1994
rwandan genocide. World Bank Policy Research Working Paper Series.
Bateson, R. (2012). Crime victimization and political participation. American Political Science
Review 106.
Beber, B., P. Roessler, and A. Scacco (2012).
attitudes in a dividing sudan.
Who supports partition?
violence and political
Becchetti, L., P. Conzo, and A. Romeo (2011). Violence, social capital and economic development:
Evidence of a microeconomic vicious circle. ECINEQ Working PaperSeries.
Beck, N., G. King, and L. Zeng (2000). Improving quantitative studies of international conflict: A
conjecture. American Political Science Review 94, 21-36.
Bellows, J. and E. Miguel (2009). War and local collective action in sierra leone. Journal of Public
Economics 93(11), 1144-1157.
Blattman, C. (2009). From violence to voting: War and political participation in uganda. American
Political Science Review 103(02), 231-247.
Blattman, C. and J. Annan (2010). The consequences of child soldiering. The review of economics
and statistics 92(4), 882-898.
Brambor, T., W. Clark, and M. Golder (2006). Understanding interaction models: Improving
empirical analyses. Political Analysis 14 (1), 63-82.
Cassar, A., P. Grosjean, and S. Whitt (2012). Social cooperation and the problem of the conflict
gap: Survey and experimental evidence from post-war tajikistan.
Choi, J-K; Bowles, S. (2007). The coevolution of parochial altruism and war:. Science 318, 636-640.
Christia, F. (2012). Alliance Formation in Civil Wars. Cambridge University Press.
Colaresi, M. and S. Carey (2008). To kill or to protect. Journal of Conflict Resolution 52(1), 39-67.
De Vito, E., A. Caponnetto, and L. Rosasco (2005). Model selection for regularized least-squares
algorithm in learning theory. Foundations of Computational Mathematics 5(1), 59-85.
153
de Waal, A., C. Hazlett, C. Davenport, and J. Kennedy (2014). The epidemiology of lethal violence
in darfur: using micro-data to explore complex patterns of ongoing armed conflict. Social Science
& Medicine.
Degomme, 0. and D. Guha-Sapir (2010).
Lancet 375(9711), 294-300.
Patterns of mortality rates in darfur conflict.
The
Dehejia, R. H. and S. Wahba (1999). Causal effects in nonexperimental studies: Reevaluating the
evaluation of training programs. Journal of the American statistical Association 94 (448), 10531062.
Diamond, A. and J. S. Sekhon (2005). Genetic matching for estimating causal effects: A general
multivariate matching method for achieving balance in observational studies. Review of Economics
and Statistics (0).
Doyle, M. W. and N. Sambanis (2000). International peacebuilding: A theoretical and quantitative
analysis. American political science review, 779-801.
Evgeniou, T., M. Pontil, and T. Poggio (2000). Regularization networks and support vector machines. Advances in Computational Mathematics 13(1), 1-50.
Fearon, J. D. and D. D. Laitin (2000). Violence and the social construction of ethnic identity.
International Organization 54(4), 845-877.
Flint, J. and A. de Waal (2008). Darfur: a new history of a long war. Zed Books.
Fortna, V. P. (2004). Does peacekeeping keep peace? international intervention and the duration of
peace after civil war. InternationalStudies Quarterly 48(2), 269-292.
Friedrich, R. J. (1982). In defense of multiplicative terms in multiple regression equations. American
Journal of Political Science 26(4), 797-833.
Gilligan, M., B. Pasquale, and C. Samii (2011). Civil war and social capital: Behavioral-game
evidence from nepal.
Golub, G. H., M. Heath, and G. Wahba (1979). Generalized cross-validation as a method for choosing
a good ridge parameter. Technometrics 21(2), 215-223.
Guha-Sapir, D. and 0. Degomme (2005). Darfur: Counting the deaths. report, Center for Research
on the Epidemiology of Disasters 26.
Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to
produce balanced samples in observational studies. PoliticalAnalysis 20(1), 25-46.
Hainmueller, J. and C. Hazlett (2013). Kernel regularized least squares: Reducing misspecification
bias with a flexible and interpretable machine learning approach. Political Analysis, mpt019.
Harff, B. (2003). No lessons learned from the holocaust? assessing risks of genocide and political
mass murder since 1955. American Political Science Review 97(1), 57-73.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statisticallearning: Data mining,
inference, and prediction (Second ed.). Springer.
Hastrup, A. (2013). The War in Darfur: Reclaiming Sudanese History. Routledge.
Iacus, S. M., G. King, and G. Porro (2012). Causal inference without balance checking: Coarsened
exact matching. Political analysis 20(1), 1-24.
154
Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score.
Statistical Society: Series B (StatisticalMethodology) 76(1), 243-263.
Journal of the Royal
Imbens, G. (2003). Sensitivity to exogeneity assumptions in program evaluation.
Economic Review 93(2), 126-132.
The American
Jackson, J. E. (1991). Estimation of models with variable coefficients. Political Analysis 3(1), 27-49.
Kahneman, D. (2011). Thinking, fast and slow. Farrar Straus & Giroux.
Kalyvas, S. (2006). The logic of violence in civil war. Cambridge Univ Press.
Kimeldorf, G. and G. Wahba (1970). A correspondence between bayesian estimation on stochastic
processes and smoothing by splines. The Annals of Mathematical Statistics 41 (2), 495-502.
King, G., R. Nielsen, C. Coberley, J. E. Pope, and A. Wells (2011). Comparative effectiveness of
matching methods for causal inference. Unpublished manuscript 15.
King, G. and L. Zeng (2006). The Dangers of Extreme Counterfactuals. Political Analysis 14(2),
131-159.
Kocher, M. A., T. B. Pepinsky, and S. N. Kalyvas (2011). Aerial bombing and counterinsurgency
in the vietnam war. American Journal of PoliticalScience 55(2), 201-218.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 604-620.
Lyall, J. (2009). Does indiscriminate violence incite insurgent attacks? Journal of Conflict Resolution 53(3), 331-362.
Lyall, J. (2010). Do democracies make inferior counterinsurgents? reassessing democracy's impact
on war outcomes and duration. International Organization 64 (01), 167-192.
Lyall, J., K. Imai, and G. Blair (2013). Explaining support for combatants during wartime: A survey
experiment in afghanistan. American Political Science Review.
Nisbett, R. and D. Cohen (1996).
Westview Press.
Culture of honor: The psychology of violence in the South.
Nunn, N. and L. Wantchekon (2009). The slave trade and the origins of mistrust in africa. American
Economic Review.
Pham, P., P. Vinck, and E. Stover (2009). Returning home: forced conscription, reintegration, and
mental health status of former abductees of the lord's resistance army in northern uganda. BMC
psychiatry 9(1), 23.
Pham, P., H. Weinstein, and T. Longman (2004). Trauma and ptsd symptoms in rwanda. JAMA:
the journal of the American Medical Association 292(5), 602-612.
Ratkovic, M. (2012). Identifying the largest balanced subset of the data under general treatment regimes.
Technical report, Working Paper. Available a t http://www. princeton.
edu/ ratkovic/SVMMatch. pdf.
Rifkin, R., G. Yeo, and T. Poggio (2003). Regularized least-squares classification.
Series Sub Series III Computer and Systems Sciences 190, 131-154.
Nato Science
Rifkin, R. M. and R. A. Lippert (2007). Notes on regularized least squares. Technical report, MIT
Computer Science and Artificial Intelligence Laboratory Technical Report.
155
Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational
studies for causal effects. Biometrika 70(1), 41-55.
Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, 159-183.
Saunders, C., A. Gammerman, and V. Vovk (1998). Ridge regression learning algorithm in dual
variables. In Proceedings of the 15th International Conference on Machine Learning, Volume
19980, pp. 515-521. San Frsncisco, CA, USA: Morgan Kaufmann.
Schdlkopf, B. and A. Smola (2002). Learning with kernels: Support vector machines, regularization,
optimization, and beyond. MIT Press.
Smith, J. A. and P. E. Todd (2001). Reconciling conflicting evidence on the performance of
propensity-score matching methods. The American Economic Review 91(2), 112-118.
Tedeschi, R. G. and L. G. Calhoun (2004). Posttraumatic growth: Conceptual foundations and
empirical evidence. Psychological inquiry 15(1).
Tedeschi, R. G., C. L. Park, and L. G. Calhoun (1998). Posttraumaticgrowth: Positive changes in
the aftermath of crisis. Routledge.
Tychonoff, A. N. (1963). Solution of incorrectly formulated problems and the regularization method.
Doklady Akademii Nauk SSSR 151, 501504. Translated in Soviet Mathematics 4: 10351038.
Valentino, B., P. Huth, and D. Balch-Lindsay (2004). draining the sea: mass killing and guerrilla
warfare. InternationalOrganization 58(02), 375-407.
Vinck, P., P. Pham, E. Stover, and H. Weinstein (2007). Exposure to war crimes and implications
for peace building in northern uganda. JAMA 298(5), 543-554.
Voors, M., E. Nillesen, P. Verwimp, E. Bulte, R. Lensink, and D. van Soest (2011). Violent conflict
and behavior: a field experiment in burundi. American Economic Review.
Walter, B. F. (2004). Does conflict beget conflict? explaining recurring civil war. Journal of Peace
Research 41(3), 371-388.
Wilkinson, S. I. (2006). Votes and violence: Electoral competition and ethnic riots in India. Cambridge University Press.
Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series
B (StatisticalMethodology) 65(1), 95-114.
156
Download