Uploaded by Pushpinder Gill

REndo AnRPackagetoAddressEndogeneityWithoutExternalInstrumentalVariables

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/322698633
REndo: An R Package to Address Endogeneity Without External Instrumental
Variables
Article · January 2018
CITATIONS
READS
0
582
3 authors, including:
Raluca Gui
Markus Meierer
University of Zurich
University of Zurich
2 PUBLICATIONS 0 CITATIONS
9 PUBLICATIONS 47 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
REndo: An R Package to Address Endogeneity Without External Instrumental Variables View project
All content following this page was uploaded by Raluca Gui on 25 January 2018.
The user has requested enhancement of the downloaded file.
REndo: An R Package to Address Endogeneity
Without External Instrumental Variables
Raluca Gui
Markus Meierer
René Algesheimer
University of Zurich
University of Zurich
University of Zurich
Abstract
Endogeneity arises when the independence assumption between an explanatory variable and the error in a statistical model is violated. Among its most common causes
are omitted variable bias (e.g. like ability in the returns to education estimation) measurement error (e.g. survey response bias) or simultaneity (e.g. advertising and sales).
Instrumental variable estimation is a common treatment when endogeneity is of concern.
However valid, strong external instruments are difficult to find. Consequently, statistical
methods to correct for endogeneity without external instruments have been advanced and
they are generically called internal instrumental variable models (IIV). We propose the
first R (R Core Team 2013) package that implements the following five instrument-free
methods, all assuming a continuous dependent variable: latent instrumental variables approach (Ebbes, Wedel, Boeckenholt, and Steerneman 2005), higher moments estimation
(Lewbel 1997), heteroskedastic error approach (Lewbel 2012), joint estimation using copula (Park and Gupta 2012) and multilevel GMM (Kim and Frees 2007). Package usage is
illustrated on several simulated datasets.
Keywords: endogeneity, multilevel, internal instrumental variables.
2
REndo - Gui, Meierer and Algesheimer
1. Introduction
In the absence of data based on a randomized experiment, endogeneity is always a concern.
It implies that the independence assumption between at least one regressor and the error
term is not satisfied, leading to biased and inconsistent results. The causes of endogeneity are
manifold and include response bias in surveys, omission of important explanatory variables,
or simultaneity between explanatory and response variables (Antonakis, Bendahan, Jacquart,
and Lalive 2014; Angrist and Pischke 2008).
A ”devilishly clever” solution to cope with confounders when randomization is not possible
is to find additional variables, the so called ”instrumental variables” (Antonakis et al. 2014;
Theil 1958). These variables are correlated with the suspected endogenous regressor but not
correlated with the structural error. For example, when estimating the demand for bread, the
price variable is endogenous since price and demand are jointly determined in the market, so
we have a simultaneity problem. A strong (see A) instrumental variable would be the number
of rainy days in the year which influences the wheat production and consequently the price
of bread, without having any effect on the demand for bread. The difficulty of finding a good
instrument is well known. Paradoxically, the stronger the correlation between the instrument
and the regressor, the more difficult it is to defend its lack of correlation with the error (Ebbes
et al. 2005; Lewbel 1997).
Sometimes a suitable instrumental variable could be indicated by the data generating process
or by the cause of endogeneity. However, it is not unfrequent to fail finding a strong instrumental variable. In this case, an ‘instrument free ’or ‘internal instrumental variable’(IIV)
model could be used. These methods solve the endogeneity problem without the need of
external instruments (further details regarding these methods can be found in the following
sections).
REndo is the first package to encompass the most recent developments in IIV methods with
continuous dependent variables. It implements the latent instrumental variable approach
(Ebbes et al. 2005), the joint estimation using copula (Park and Gupta 2012), the higher
moments method (Lewbel 1997) and the heteroskedastic error approach (Lewbel 2012). To
Gui, Meierer, Algesheimer
3
model hierarchical data such as students nested within classrooms, nested within schools,
REndo includes the multilevel GMM estimation proposed by Kim and Frees (2007).
Section 2 presents in more detail the endogeneity problem and its consequences on the estimates. Section 3 presents the existing R (R Core Team 2013) packages that address endogeneity making use of external instruments, followed by section 4 describing the underlying idea
of internal instrumental variables alongside the five IIV methods correcting for endogeneity.
Section 5 describes the implementation of these IIV models in REndo and their usage on
various simulated datasets, comparing the results with the ones obtained using ordinary least
squares. The last section presents future possible improvements and concludes.
2. Endogeneity
Researchers from a wide range of disciplines, from political science, finance, management,
marketing or education research, are interested in causal relationships. For instance, does
institutional quality explain the variation in the economic development of different countries?
Does CEO’s compensation depend on firm size? Does an increase in advertising expenditure
lead to an increase in sales? Does repeating a class lead to better test scores?
In order to be able to answer such questions, there are two possible options: run a randomized
controlled experiment or use observational data. Since in many instances randomized experiments are too expensive, focus on small sample sizes or are seen as unethical, researchers rely
on observational data. The problem with such data is the threat of obtaining estimates that
are not consistent, meaning that they will not converge to the true population parameter as
sample size increases. This threat occurs when:
1. important variables are omitted from the model;
2. one or more explanatory variables are measured with error;
3. there is simultaneity between the response variable and one of the covariates;
4. the sample is biased due to self-selection;
4
REndo - Gui, Meierer and Algesheimer
5. a lagged response variable is included as covariate.
Ruud (2000) showed that (2)-(5) can be viewed as a special case of (1). Nonetheless, in
all these instances the error term of the model is correlated with one of the covariates. This
situation is known in the literature as endogeneity . Figure 1 depicts an endogeneity problem:
regressor P is endogenous due to its correlation with the error , while X is an exogenous
covariate, since its correlation with the error is zero.
𝜓2≠0
P
𝛽1
𝜓1
!
y
𝛽2
X
𝜓3=0
Condition
β1
β2
Explanation
ψ1 = 0
Inconsistent
Consistent
P correlates with (ψ2 6= 0) thus β1 is inconsistent. β2 is consistent
since X is uncorrelated with both P (ψ1 = 0) and (ψ3 = 0).
Inonsistent
P correlates with (ψ2 6= 0) thus β1 is inconsistent. Although X
is uncorrelated with (ψ3 = 0), β2 is inconsistent since it is affected
by the bias in P through X ’s correlation with P (ψ1 6= 0),
although ψ3 still equals zero.
ψ1 6= 0
Inconsistent
Figure 1: Endogeneity causes inconsistent estimates
Figure 2 shows how the bias of the endogenous regressor’ s estimates increases as the correlation between P and the error increases. At low correlation (0.1), the bias is 0.11 (as seen in
Gui, Meierer, Algesheimer
5
0.5
b1 Estimates Bias
0.4
0.3
0.2
0.1
0.0
0.1
0.3
0.5
Correlation between P and Error
Figure 2: Instrumental variables solves the inconsistency of estimates problem cause by endogeneity
the chart). But as the correlation between P and the error increases to 0.3 and then to 0.5,
so does the bias: it increases from 0.24 to 0.34.
Sometimes the cause of endogeneity or the data at hand can give clues on how to handle the
problem. For example, if endogeneity arises from time-invariant sources, applying fixed-effects
estimation on a panel dataset eliminates the omitted variable problem. For endogeneity caused
by measurement error, autoregressive models or simultaneous equation models could offer a
solution, (Ebbes 2004). Structural models that estimate demand-supply models (Draganska
and Jain 2004; Berry, Levinsohn, and Pakes 1995; Berry 1994) are yet another alternative to
deal with endogeneity.
However, the most frequently used methods in addressing endogeneity are instrumental variables (Theil 1958; Wright 1928). The main idea behind these approaches is to focus on the
variations in the endogenous variable that are uncorrelated with the error term and disregard
the variations that bias the ordinary-least squares coefficients. This is possible by finding an
additional variable, called external instrument, such that the endogenous covariate can be
separated into two parts: the instrumental variable that a.) should not be correlated with the
structural error and b.) should be correlated with the endogenous regressor; and the other
part, correlated with the structural error of the model. Section 3 below exposes more in detail
these methods and presents the R packages that are currently implementing IV methods.
6
REndo - Gui, Meierer and Algesheimer
3. External IV Methods
The concept of instrumental variables was first derived by Wright (1928). He observed that
”success with this method depends on success in discovering factors of the type A and B”
where A and B refer to instrumental variables. Specifically, an ”instrumental variable” (IV)
is defined as a variable Z (equation 2) that is correlated with the explanatory variable P and
uncorrelated with the structural error, , in equation 1:
Y = βP + ,
(1)
P = γZ + ν,
(2)
where
• The error term stands for all exogenous factors that affect Y when P is held constant.
• The instrument Z should be independent of .
• The instrument Z should not affect Y when P is held constant (exclusion restriction).
• The instrument Z should not be independent of P.
Figure 3 gives a graphical representation of the notions above.
Today, researchers have a variety of instrumental variable models to choose from, depending
on the research question, data type (cross-sectional vs. panel, single-level vs. multilevel),
and on the number of available external instrumental variables. Examples are: two-stage
and three-stage least square estimations as well as generalized method of moments. A short
presentation of the most popular R packages for external instrumental variables estimation is
given below.
• The simplest and most commonly used method in tackling endogeneity in linear single
level models is the two-stage least squares (2SLS) approach proposed by Theil (1958).
In the first stage, each endogenous variable is regressed on all the exogenous variables
Gui, Meierer, Algesheimer
7
𝜓2≠0
P
𝛽1
𝜓1≠0
𝜀
y
𝜓3≠0
Z
!" = 0
Figure 3: The instrumental variable z solves the inconsistency of estimates problem caused by endogeneity
in the model, both exogenous covariates and the excluded instruments. In the second
stage, the regression of interest is estimated as usual via ordinary least squares, except
that each endogenous covariate is replaced with the predicted values from the first stage.
In R, two-stage least squares can be applied using ivreg() function in the AER (Kleiber
and Zeileis 2015) package as well as using tsls() function in the sem (Fox, Nie, and
Byrnes 2016) package.
• Another widely used method for correcting endogeneity is the generalized method of
moments (GMM) proposed by Hansen (1982). GMM is a class of estimators which
are constructed from exploiting the sample moment counterparts of population moment
conditions of the data generating model. When the system is over-identified and the
sample size is large, GMM is more efficient than two-stage least squares method. It
can be implemented in R using gmm() function in the package with the same name, gmm
(Chaussé 2010).
• A package that offers the possibility of addressing endogeneity in a system of linear
equations is systemfit (Henningsen and Hamann 2015). The available methods for
8
REndo - Gui, Meierer and Algesheimer
estimation are 2SLS, weighted 2SLS and 3SLS methods. A useful feature of the package
is the possibility of specifying either the same instrumental variables for all equations
or different ones for each equation. Through its function nlsystemfit(), it also offers
the possibility to estimate systems of non-linear equations.
• Package plm offers the possibility of estimating random and fixed effects for static
linear panel data, variable coefficients models and generalized method of moments for
dynamic models. For all the models, the package also offers the possibility of addressing
endogeneity using external instrumental variables.
• SemiParBIVProbit (Marra and Radice 2011) allows fitting a sample selection binary
response model. Being able to handle the possible presence of nonlinear relationships
between the binary outcome and the continuous covariates, SemiParBIVProbit (Marra
and Radice 2011) offers more flexibility than previous methods, such as the one proposed
by Van de Ven and Van Praag (1981).
Table 1 provides a short overview of the methods tackling endogeneity using external IVs
already implemented in R:
Approach
Level DV
Binary
No.
endog.
regres
multiple
Level endog.var.
2SLS
Cont.
x
3SLS
x
single
x
GMM
x
multiple
x
Copula splines
x
multiple
x
x
Cont.
x
Function (Pkg)
Discrete
ivreg(AER),
tsls(sem),
systemfit(systemfit),
plm(plm)
systemfit(systemfit)
x
gmm(gmm)
SemiParBIVProbit
(SemiParBIVProbit)
Table 1: External IV Methods
In real world applications we cannot however detect how big is the endogeneity problem since,
in fact, we cannot test how large the correlation between the endogenous regressor and the
error is. Thus, for an unbiased and consistent estimate it is important to find a very good,
Gui, Meierer, Algesheimer
9
”strong” instrumental variable (see Appendix A). Stock and Yogo (2002) were the first to point
to the importance of the quality of the instrumental variable used in the IV estimation. They
were the first to differentiate instrumental variables according to their correlation with the
endogenous regressor, into weak (low correlation) and strong (high correlation) instruments
(Stock and Yogo 2002). Considering equation 2, Stock and Yogo (2002) constructed the
”concentration” parameter, µ2 :
µ2 = γ 0 Z 0 Zγ/σν
(3)
and, depending on the number of instruments, proposed thresholds for the concentration
parameter for when the instruments are considered weak (Stock and Yogo 2002).
Depending on strength of the instrumental variables and on the sample size, the performance
of external instrumental variable methods compared to OLS can vary significantly (see Appendix 1 - Table A.1 for a comparison of OLS and two stage least squares method with weak
and strong instrument). Finding a suitable instrumental variable is far from easy and is the
reason why researchers have proposed alternative ”instrument - free” models as the ones in
REndo, detailed below.
4. Internal IV Methods
Internal instrumental variable methods have been proposed for cases when no observable
variable can be found that satisfies the properties of a strong instrumental variable. The
identification strategy of these methods is conditional on distributional assumptions of the
endogenous regressors and of the error term. For example, Ebbes et al. (2005) assume the
distribution of the endogenous variable to be discrete, Lewbel (1997), Lewbel (2012) and Park
and Gupta (2012) take it skewed, while Rigobon (2003) and Hogan and Rigobon (2002) work
with a heteroskedastic distribution.
Among the existing single-level instrument-free methods, the current paper implements the
latent instrumental variables approach proposed by Ebbes et al. (2005), the higher moments
estimation proposed by Lewbel (1997), the heteroskedastic errors approach1 (Lewbel 2012)
1
This method has also been implemented in the package ivlewbel.
10
REndo - Gui, Meierer and Algesheimer
and the joint estimation using Gaussian copula (Park and Gupta 2012). These methods have
been chosen to be included in the package since they are the internal instrumental variable
methods with the highest usage frequency among researchers in social sciences.
There are many instances in which the data have a hierarchical structure, for example students
clustered in classes and in schools. Even longitudinal data can be seen as a series of repeated
measurements nested within individuals. For such data structures multilevel models have been
developed (Longford 1995; Raudenbush and Bryk 1986). These models are regression methods
that recognize the existence of data hierarchies by allowing for residual components at each
level in the hierarchy. In these models endogeneity can have two sources: either the regressors
are correlated with the random components, or they are correlated with the structural error
of the model at the lowest level (level-1 dependence). While theoretically, methods such as
two-stage-least squares, weighted two-stage-least squares or generalized methods of moments
could be used in a multilevel setting to deal with both level-1 and level-2 endogeneity, no such
implementations are available in R. REndo is the first R package that implements a method
that tackles endogeneity in a multilevel setting by implementing the multilevel generalized
method of moments method proposed by Kim and Frees (2007).
Table 2 gives an overview of the instrument-free methods present in REndo, emphasizing
the assumptions of each of the approaches. Four of the methods apply to single level data,
while the approach of Kim and Frees (2007) addresses endogeneity in multilevel models. All
approaches allow for more than one endogenous regressor, the exception being the latent
instrumental variable approach, which also does not allow to include additional explanatory
variables besides the endogenous variable. The response variable is assumed to be continuous
in all methods.
Multilevel GMM Method
Kim and Frees (2007)
Heteroskedastic Errors Method
Lewbel (2012)
Higher Moments Method
Lewbel (1997)
Copula Correction Method
Park and Gupta (2012)
Latent Instrumental Variable Method
Ebbes et al. (2005)
Method
- Pt 6= N (·, ·);
- t ∼ N (0, σ2 ); corr(Zt , t )=0;
- Zt discrete with at least 2 groups with different means.
one endogenous regressor;
no additional exogenous regressors;
ML estimation;
single level model.
multiple endogenous regressors;
additional exogenous regressors allowed;
ML estimation;
single-level model.
multiple endogenous regressors;
additional exogenous regressors allowed;
TSLS estimation;
single-level model.
multiple endogenous regressors;
additional exogenous regressors allowed;
TSLS estimation;
single-level model.
multiple endogenous regressors;
additional exogenous regressors allowed;
GMM estimation;
multilevel model.
-
-
cov(Z, ν 2 ) 6= 0;
E(Xt ) = 0,
E(Xνt ) = 0;
0
E(XX ) - non-singular.
Table 2: Internal Instrumental Variables Models
- Model:
Yij = β0j + βij Xij + ij
β0j = γ00 + γ01 Wj + u0j
βij = γ10
- ij ∼ N (0, σ2 ), u0j ∼ N (0, σu2 )
- Cov(Xij , ij , Wj , u0j ) = 0
-
- Zt skewed distribution;
- E(t ) = 0, E(νt ) = 0;
- Third moment of the data exists.
- Pt 6= N (·, ·), Pt 6= bimodal ;
- Pt can be discrete, but not Bernoulli;
- t ∼ N (0, σ2 );
Assumptions
Short description
Gui, Meierer, Algesheimer
12
REndo - Gui, Meierer and Algesheimer
4.1. Internal IV methods for non-hierarchical data
The four internal instrumental variable methods presented in this section share the same
underlying model presented in equations 4 and 5, while the specific characteristics of each
method are discussed in the subsequent sections.
Let’s consider the model:
Yt = β0 + β1 Pt + β2 Xt + t
(4)
where t = 1, .., T indexes either time or cross-sectional units, Yt is a 1 x 1 response variable,
Xt is a k x n exogenous regressor, Pt is a k x 1 continuous endogenous regressor, t is a
structural error term with mean zero and E(2 ) = σ2 , α and β are model parameters. The
endogeneity problem arises from the correlation of Pt and t . As such:
Pt = γZt + νt
(5)
where Zt is a l x 1 vector of internal instrumental variables, and νt is a random error with
mean zero, E(ν 2 ) = σν2 and E(ν) = σν . Z is assumed to be stochastic with distribution
G and ν is assumed to have density h(·). The latent instrumental variables and the higher
moments models assume Z to be uncorrelated with the structural error, which is similar to the
”exclusion restriction” assumption for observed instrumental variables methods. Moreover, Z
is also assumed unobserved. Therefore, Z and ν cannot be identified without distributional
assumptions. The distributions of Z and ν should be specified such that two conditions are
met: (1) endogeneity of P is corrected, and (2) the distribution of P is empirically close to
the integral that expresses the amount of overlap of Z as it is shifted over ν (= the convolution
between Z and ν). When the density h(·) is chosen to be normal, then G could not be also
normal because the parameters cannot be identified (Ebbes et al. 2005).
The ensuing subsections follow a general structure, where we present the underlying idea of
each method, followed by its specific characteristics and in the end we present the particular
assumptions and weaknesses of the method.
Gui, Meierer, Algesheimer
13
Latent Instrumental Variables Method
Ebbes et al. (2005) propose the latent instrumental variables (LIV) model as exposed in
equations 4 and 5, with both errors being normally distributed. LIV does not accept additional
covariates besides the endogenous regressor, P , whose distribution has to be different than
normal. The novelty of the approach is that the internal instrumental variables Zt are assumed
unobserved, discrete and exogenous, with an unknown number of groups m, while γ is a vector
of group means. The method accepts just one endogenous regressor.
Identification of the parameters relies on the distributional assumptions of the latent instruments as well as that of the endogenous regressor, Pt . Specifically, the endogenous regressor
should have a non-normal distribution while the unobserved instruments, Z, should be discrete and have at least two groups with different means (Ebbes, Wedel, and Boeckenholt
2009). A continuous distribution for the instruments leads to an unidentified model, while a
normal distribution of the endogenous regressor gives rise to inefficient estimates.
The LIV model in REndo assumes that the latent instrumental variable has two categories.
Departure from this assumption, with a true instrument with eight categories or more, has
been proven to greatly affect the efficiency of the estimates. Figure 4 presents the performance
of the latent instrumental variable method in comparison with ordinary least squares. While
the OLS estimate is biased but consistent, the LIV estimate is unbiased but has larger standard
deviation. The mean and variance of the OLS estimate are −0.59 and 0.0002, while the mean
and variance of the LIV estimate are −0.88 and 0.20 respectively. For a more detailed overview
of the method see Ebbes et al. (2005).
Joint Estimation Using Copula Method
Park and Gupta (2012) propose a method that allows for the joint estimation of the continuous
endogenous regressor and the error term using Gaussian copulas. A copula is a function that
maps several conditional distribution functions (CDF) into their joint CDF (see Appendix 2
- Figure C.2).
The underlying idea of the joint estimation method is that using information contained in
14
REndo - Gui, Meierer and Algesheimer
20
Method
15
Density
LIV
OLS
10
True value
−1
5
0
−2
−1
0
1
2
Endogenous Regressor
Figure 4: Latent Instrumental Variable vs OLS: coefficient estimates were obtained over 1000 simulated
samples, each of size 2500; true parameter value is −1.
the observed data, one selects marginal distributions for the endogenous regressor and the
structural error term, respectively. Then, the copula model enables the construction of a
flexible multivariate joint distribution allowing a wide range of correlations between the two
marginals.
In equation 4, the error t is assumed to have a normal marginal distribution, while the
marginal distribution of the endogenous regressor Pt is obtained using the Epanechnikov
kernel density estimator (Epanechnikov 1969), as below:
T
ĥ(p) =
1 X
K
T ·b
t=1
p − Pt
b
(6)
where Pt is the endogenous regressor, K(x) = 0.75 · (1 − x2 ) · I(kxk <= 1) and the bandwidth
b is the one proposed by Silverman (1986), and is equal to b = 0.9 · T −1/5 · min(s, IQR/1.34).
Gui, Meierer, Algesheimer
15
IQR is the inter-quartile range while s is the data sample standard deviation and T is the
number of time periods observed in the data. After obtaining the joint distribution between
the error term and the continuous endogenous regressor, the model parameters are estimated
using maximum likelihood estimation.
With more than one continuous endogenous regressor or when the endogenous regressor is
discrete, an alternative method to the estimation using Gaussian copula should be applied,
which is similar to the control function approach (Petrin and Train 2010). The idea is to run
OLS estimation where to the original explanatory variables in equation 4 another regressor is
added, namely Pt∗ = Φ−1 (H(Pt )), where H(Pt ) is the marginal distribution of the endogenous
regressor, P . Including this generated regressor solves the correlation between the endogenous
regressor and the structural error in equation 4, OLS providing consistent estimates.
10.0
7.5
Method
Density
Copula method
OLS
5.0
True value
−1
2.5
0.0
−2
−1
0
1
2
Endogenous Regressor
Figure 5: Copula Correction vs. OLS: coefficient estimates were obtained over 1000 simulated samples,
each of size 2500; true parameter value is −1
The method has been shown to perform well in recovering the true parameter values even with
departures from the normality assumption of the structural error, given that the distribution of
the endogenous regressor is different from that of the error. Figure 5 presents the performance
16
REndo - Gui, Meierer and Algesheimer
of the copula correction method compared to the performance of OLS. Once again, the copula
correction method is less biased than OLS when endogeneity is present, even though it is less
consistent. The estimate of the coefficient of the endogenous variable has mean and variance
of −0.62 and 0.002 with the OLS, while the mean and variance of the estimate obtained with
the copula correction method are −0.97 and 0.003 respectively. For further technical details,
see Park and Gupta (2012).
Higher Moments Method
The method proposed by Lewbel (1997) helps identifying structural parameters in regression
models with endogeneity caused by measurement error, as exposed in equations 4 and 5.
Identification is achieved by exploiting third moments of the data.
Internal instruments, Zt , alongside the error terms in equations 4 and 5, t , and νt , respectively, are assumed unobserved. Unlike previous models, no restriction is imposed on the
distribution of the two error terms, while their means are set to zero. Lewbel (1997) proves
that the following instruments can be constructed and used with two-stage least squares
estimation to obtain consistent estimates:
q1t = (Gt − Ḡ)
(7a)
q2t = (Gt − Ḡ)(Pt − P̄ )
(7b)
q3t = (Gt − Ḡ)(Yt − Ȳ )
(7c)
q4t = (Yt − Ȳ )(Pt − P̄ )
(7d)
q5t = (Pt − P̄ )2
(7e)
q6t = (Yt − Ȳ )2
(7f)
Here, Gt = G(Xt ) for any given function G that has finite third own and cross moments and
X are all the exogenous in the model. Ḡ is the sample mean of Gt . The same rule applies
also for Pt and Yt .
The instruments in equations 7e and 7f can be used only when the measurement and the
Gui, Meierer, Algesheimer
17
structural errors are symmetrically distributed. Otherwise, the use of the instruments does
not require any distributional assumptions for the errors. Given that the regressors G(X) = X
are included as instruments, G(X) should not be linear in X in equation 7a.
An important part for identification is played by the assumption of skewness of the endogenous
regressor. To cite Lewbel (1997), ”the greater the skewness, the better the quality of the
proposed instruments”. If this assumption fails, the instruments may be weak and thus the
estimates will be biased. Since the instruments constructed come along with very strong
assumptions, one of their best uses is to provide over-identifying information. The overidentification provided by constructed moments can be used to test validity of a potential
outside instrument, to increase efficiency, and to check for robustness of estimates based on
alternative identifying assumptions. Figure 6 renders the comparison between the estimates of
75
Method
Density
Higher Moments
OLS
50
True value
−1
25
0
−2
−1
0
1
2
Endogenous Regressor
Figure 6: Higher Moments vs OLS: coefficient estimates were obtained over 1000 simulated samples,
each of size 2500; true parameter value is −1.
the higher moments approach and OLS. While ordinary least squares is more biased, with an
average estimate equal to −0.87 and variance equal to 0.0002, the higher moments estimator
has an average of −1.00 but it is highly volatile, with a variance of 0.18. For more details and
the proof of the above assumptions, see Lewbel (1997).
18
REndo - Gui, Meierer and Algesheimer
Heteroskedastic Errors Method
The method proposed in Lewbel (2012) identifies structural parameters in regression models
with endogenous regressors by having variables that are uncorrelated with the product of
heteroskedastic errors. This feature is encountered in many models where error correlations
are due to an unobserved common factor (Lewbel 2012). The instruments are constructed as
simple functions of the model’s data. The method can be applied when no external instruments are available or to supplement external instruments to improve the efficiency of the IV
estimator. REndo implements the identification through heteroskedastic errors in the case of
triangular systems.
Consider the model in equations (4) and (5), with the exception that in equation 5 Pt is a
function of Xt , while Zt is a subset of Xt . The model assumes that E(X) = 0, E(Xν) = 0
and cov(Z, ν) = 0. The errors, and ν, may be correlated with each other. Structural
parameters are identified by an ordinary two stage least squares regression of Y on X and
P , using X and [Z − E(Z)]ν as instruments. A vital assumption for identification is that
cov(Z, ν 2 ) 6= 0. The strength of the instrument is proportional to the covariance of (Z − Z̄)ν
with ν, which corresponds to the degree of heteroskedasticity of ν with respect to Z (Lewbel
2012). The assumption that the covariance between Z and the squared error is different from
zero can be empirically tested. If it is zero or close to zero, the instrument is weak, producing
imprecise estimates, with large standard errors.
Under homoskedasticity, the parameters of the model are unidentified. But, identification
is achieved in the presence of heteroskedasticity related to at least some elements of X.
This strategy of identification is less reliable than the identification based on coefficient zero
restriction, since it relies upon higher moments. But sometimes, it might be the only available
strategy. The performance of the method, compared with OLS, on a sample 1000 randomly
generated datasets, is depicted in Figure 7. As with the previous methods, the heteroskedastic
error method is less efficient than the OLS, but the coefficient estimate is unbiased, unlike
that of the ordinary-least squares. The mean coefficient estimate in the case of OLS is −0.87,
compared to −0.97 in the case of the IIV method. The variance of the OLS estimate is 0.0002
Gui, Meierer, Algesheimer
19
compared to 0.043 of the heteroskedastic error method.
40
30
Method
Density
Heteroskedastic errors
OLS
20
True value
−1
10
0
−2
−1
0
1
2
Endogenous Regressor
Figure 7: Heteroskedastic Errors IV vs OLS: coefficient estimates were obtained over 1000 simulated
samples, each of size 2500; true parameter value is −1.
4.2. Internal IV methods for hierarchical data
Many kinds of data have a hierarchical structure. Multilevel modeling is a generalization
of regression methods that recognize the existence of such data hierarchies by allowing for
residual components at each level in the hierarchy. For example, a three-level multilevel
model which allows for grouping of students within classrooms, over time, would include
time, student and classroom residuals (Eqn.9). Thus, the residual variance is partitioned into
four components: between-classroom (the variance of the classroom-level residuals), withinclassroom (the variance of the student-level residuals), between student (the variance of the
student-level residuals) and within-student (the variance of the time-level residuals). The
classroom residuals represent the unobserved classroom characteristics that affect student’s
outcomes. These unobserved variables lead to correlation between outcomes for students from
the same classroom. Similarly, the unobserved time residuals lead to correlation between a
20
REndo - Gui, Meierer and Algesheimer
student’s outcomes over time. A three-level model can be described as below:
1
1
1
ycst = Zcst
βcs
+ Xcst
β1 + 1cst
(8)
1
2 2
2
βcs
= Zcs
βc + Xcs
β2 + 2cs
(9)
βc2 = Xc3 β3 + 3c .
(10)
Like in single-level regression, in multilevel models endogeneity is also a concern. The additional problem is that in multilevel models there are multiple independent assumptions
involving various random components at different levels. Any moderate correlation between
some predictors and a random component or error term, can result in a significant bias of
the coefficients and of the variance components (Ebbes et al. 2005). While panel data models
for dealing with endogeneity can be used to address the same problem in two-level multilevel
models, their implications to higher-levels have not been closely examined (Kim and Frees
2007). Moreover, panel data models allow only the inclusion of one random intercept while
multilevel models require methodology that can handle more general error structures.
Multilevel Instrumental Variables Method
Exploiting the hierarchical structure of multilevel data, Kim and Frees (2007) propose a
generalized method of moments technique for addressing endogeneity in multilevel models
without the need of external instrumental variables. This is realized by using both the between
and within variations of the exogenous variables, but only the within variation of the variables
assumed endogenous. The assumptions in the multilevel generalized moment of moments
model is that the errors at each level are normally distributed and independent of each other.
Moreover, the slope variables are assumed exogenous. Since the model does not handle ”level
1 dependencies”, an additional assumption is that the level 1 structural error is uncorrelated
with any of the regressors. If this assumption is not met, additional, external instruments are
necessary.
Kim and Frees (2007) employ a three level multilevel model as in equation 9. The coefficients
1 captures latent,
of the explanatory variables appear in the vectors β1 , β2 and β3 . The term βcs
Gui, Meierer, Algesheimer
21
unobserved characteristics that are classroom and student specific while βc2 captures latent,
unobserved characteristics that are classroom specific. For identification, the disturbance
1 and X 1 .
term cst is assumed independent of the other variables, Zcst
cst
Given the set of disturbance terms at different levels, there exist a couple of possible correlation
patterns that could lead to biased results:
• errors at both levels (2cs and 3c ) are correlated with some of the regressors,
• only third level errors (3c ) are correlated with some of the regressors,
• an intermediate case, where there is concern with errors at both levels, but there is not
enough information to estimate level 3 parameters.
When all model variables are assumed exogenous, the GMM estimator is the usual GLS
estimator, denoted as bRE . When all variables are assumed endogenous, the fixed-effects
estimator is used, bF E . While bRE assumes all explanatory variables are uncorrelated with
the random intercepts and slopes in the model, bF E allows for endogeneity of all effects but
sweeps out the random components as well as the explanatory variables at the same levels.
The more general estimator bGM M proposed by Kim and Frees (2007) allows for some of
the explanatory variables to be endogenous and uses this information to build instrumental
variables. The multilevel GMM estimator uses both the between and within variations of
the exogenous variables, but only the within variation of the variables assumed endogenous.
When all variables are assumed exogenous, bGM M estimator equals bRE . When all covariates
are assume endogenous, bGM M equals bF E . In facilitating the choice of the estimator to be
used for the given data, Kim and Frees (2007) also proposed an omitted variable test. This
test is based on the Hausman-test (Hausman 1978) for panel data. The omitted variable
test allows the comparison of a robust estimator and an estimator that is efficient under the
null hypothesis of no omitted variables, and also the comparison of two robust estimators at
different levels.
Figure 8 shows the performance of the multilevel generalized method of moments method
compared to the usual random effects model employed with hierarchical data. Here the data
22
REndo - Gui, Meierer and Algesheimer
is a three-level hierarchical data with one level two endogenous regressor. The mean random
effects estimate is equal to −0.55 and standard error equal to 0.022, while the multilevel GMM
estimate has a mean of −0.999 over 1000 simulations, with standard error equal to 0.049.
15
Method
Density
Multilevel IV
Random Effects
10
True value
−1
5
0
−2
−1
0
1
2
Endogenous Regressor
Figure 8: MultilevelIV vs Random effects: coefficient estimates were obtained over 1000 simulated
samples, each of size 2800; true parameter value is −1.
Citing Kim and Frees (2007), ”the (multilevel) GMM estimators can help researchers in the
direction of exploiting the rich hierarchical data and, at the same time, impeding the improper
use of powerful multilevel models.” REndo facilitates the use of this method, thus contributing
in growing the number of applications of multilevel models in the social sciences.
5. Using REndo package
REndo encompasses five functions that allow the estimation of linear models with one or more
endogenous regressors using internal instrumental variables. Depending on the assumptions
of the model and the structure of the data, single or multilevel, the researcher can use one of
the following functions:
Gui, Meierer, Algesheimer
23
• latentIV() - implements the latent instrumental variable estimation as in Ebbes et al.
(2005). The endogenous variable is assumed to have two components - a latent, discrete
and exogenous component with an unknown number of groups and the error term that
is assumed normally distributed and correlated with the structural error. The method
supports only one endogenous, continuous regressor and no additional explanatory variables. A Bayesian approach is needed when the model assumes additional exogenous
regressors.
• copulaCorrection() - models the correlation between the endogenous regressor and
the structural error with the use of Gaussian copula. The endogenous regressor can
be continuous or discrete. The method allows estimating a model with more than one
endogenous regressor.
• higherMomentsIV() - implements the higher moments approach described in Lewbel
(1997) where instruments are constructed by exploiting higher moments of the data,
under strong model assumptions. The method allows more than one endogenous regressor.
• hetErrorsIV() - uses the heteroskedasticity of the errors in a linear projection of the
endogenous regressor on the other covariates to solve the endogeneity problem induced
by measurement error, as proposed by Lewbel (2012). The presence of more than one
endogenous regressor is possible.
• multilevelIV() - implements the instrument free multilevel GMM method proposed
by Kim and Frees (2007) where identification is possible due to the different levels of
the data. Endogenous regressors at different levels can be present. The function comes
along a built in omitted variable test, which helps in deciding which model is robust to
omitted variables at different levels.
The package comes with seven simulated datasets. Using just one dataset for exemplifying the
functions is not possible due to different assumptions regarding the underlying data generating
process for each of the methods. Where possible, the names of the variables were kept
24
REndo - Gui, Meierer and Algesheimer
consistent across the datasets, with y for the response variable, P for the endogenous variables
and X for the exogenous regressors. The usage of each of the functions is presented in the
sections below.
5.1. latentIV() - Latent Instrumental Variables
To illustrate the LIV approach, the dataset dataLatentIV was artificially constructed. It
contains a response variable, y, an endogenous regressor, P and a discreet latent variable Z
used in building P. An intercept is always considered. The latentIV() function can be called
providing only the model formula and the data. Alternatively, the function can be called
providing a set of initial parameter values. In any model there are eight parameters. The
first parameter is the intercept, then the coefficient of the endogenous variable followed by the
means of the two groups of the model. The means of the groups need to be different, otherwise
the model is not identified. The next three parameters are for the variance-covariance matrix.
The last parameter is the probability of being in group 1. When not provided, the initial
values are taken to be the least squares coefficients of regressing y on P. For the groups’
means, the mean and the mean plus one standard deviation of the endogenous regressor are
considered as initial parameter values. The variance-covariance parameters are taken to be
1, while the probability of being part of group one is 0.5.
Below, the function is called without providing initial values for the parameters:
R> data(dataLatentIV)
R> resultsLIV <- latentIV(y ~ P, data = dataLatentIV)
R> summary(resultsLIV)
Call:
latentIV(formula = y ~ P, data = dataLatentIV)
Coefficients:
Estimate Std. Error t-value
Intercept
2.98902
0.02900
Pr(>|t|)
5153.5 < 2.2e-16 ***
Gui, Meierer, Algesheimer
P
-0.97716
25
0.03700 -1320.5 < 2.2e-16 ***
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 .
Initial Parameter Values:
2.915 -0.852 0.589 2.001 1 1 1 0.5
Log likelihood function:
-7872.102
AIC: 15760.2 , BIC: 35744.2
Convergence Code:
0
The summary returns the coefficient estimates for the intercept (2.98), and for P (−0.97).
Here, the initial parameter values were not provided, therefore the ordinary least squares coefficient estimates were used, together with the default values for the correlation matrix and the
probability of belonging to group one: (2.915, −0.852, 0.589, 2.001, 1, 1, 1, 0.5). Being a maximum likelihood estimation, the function also returns the log likelihood value (−7872.102),
Akaike and Bayesian information criteria (AIC : 15760.2 and BIC : 35744.2), and a convergence code (here 0) that says whether the model converged (0) or not (1).
For this dataset the true parameter value for the endogenous regressor is −1. The latent
internal instrument method coefficient for P is −0.97 with standard error 0.037, while OLS
returns a coefficient equal to −0.85 and standard error 0.013. As we have seen in the comparison of the OLS estimate with that of the LIV in the previous section, on average, LIV
produces unbiased estimates for the endogenous regressor but with very large standard errors.
Therefore, when using the latent instrumental variable approach caution has to be taken when
interpreting the results.
5.2. copulaCorrection() - Joint Estimation Using Gaussian Copulas
In handling endogeneity by copula correction three cases are discussed. The first case
considers one continuous endogenous regressor, the second assumes two continuous
26
REndo - Gui, Meierer and Algesheimer
endogenous regressors and the last assumes one discrete endogenous regressor. To exemplifying the use of the function, three datasets have been simulated (see Table 3).
Name
Exogenous var.
Endogenous var.
True parameter values
dataCopC1
X1, X2
P continuous,
t-distributed
Exog. var: β0 = 2, β1 = 1.5, β3 = −3 ,
Endog. var.: α1 = −1,
corr(P, ) = 0.33
dataCopC2
X1, X2
P , P2 continuous,
t-distributed
Exog. var: β0 = 2, β1 = 1.5, β3 = −3,
Endog. var.: α1 = −1, α2 = 0.8,
corr(P, ) = 0.33, corr(P, P2 ) = −0.15,
corr(P2 , ) = 0.1
dataCopD1
X1, X2
P discrete,
Poisson distributed
Exog. var: β0 = 2, β1 = 1.5, β3 = −3,
Endog. var.: α1 = −1,
corr(P, ) = 0.33
Table 3: Simulated datasets to exemplify copulaCorrection() function
Case 1: One continuous endogenous regressor The maximum likelihood estimation
returns, besides the estimates of the explanatory variables in the model, two additional estimates rho and sigma. Since these two variables are estimated from the data and not given
ex-ante, the standard errors of all the estimates need to be obtained using bootstrapping. By
default, the function returns the standard errors using 10 bootstrap replications. For a larger
number of bootstrap replications, function boots() from the same package can be used. The
syntax and output of the copulaCorrection() function for the first case is the following:
R> data(dataCopC1)
R> resultsCC1 <- copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P",
type = "continuous", method = "1", intercept=TRUE, data = dataCopC1)
R> summary(resultsCC1)
Call:
copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P", type = "continuous",
method = "1", intercept = TRUE, data = dataCopC1)
Coefficients:
Estimate Std. Error
z-score Pr(>|z|)
Gui, Meierer, Algesheimer
Intercept
2.00177
0.02412
82.984
0
X1
1.50301
0.00727
206.731
0
X2
-3.01447
0.01179 -255.660
0
P
-0.99171
0.00376 -263.555
0
rho
0.39644
0.05798
6.838
0
sigma
1.41128
0.04678
30.168
0
27
--Initial parameter values:
1.983 1.507 -3.014 -0.891 0 1
Log likelihood function:
-102
AIC: 216 ,
BIC:
15204
Convergence Code:
0
The summary returns the coefficient estimates of the model parameters, along the coefficient
estimates of the correlation between the error and the endogenous regressor, rho (here 0.39),
and of the standard deviation of the structural error, sigma (here 1.41). If initial parameter
values were not supplied by the user, the coefficient estimates obtained running OLS are
used and returned as output, while for rho and sigma the default initial values are 0 and
1. Next, the log likelihood value is returned (here −56) alongside Akaike and Bayesian
information criteria (here AIC : 124 and BIC : 15112 respectively) which can be used for
model comparison. Last, the convergence code is returned, which can be (0) if the estimation
converged or (1) otherwise (here is 0).
As seen in Figure 5, copula correction produces unbiased estimates for the continuous endogenous regressor. In the example above, the coefficient of the endogenous regressor P
is −0.99(s.e = 0.0037), a value very close to the true value, −1. In comparison, the OLS
estimate for P is −0.89 (s.e. = 0.008).
28
REndo - Gui, Meierer and Algesheimer
Case 2: Two continuous endogenous regressor The maximum likelihood approach using
copulas cannot be implemented with more than one continuous endogenous regressor due to
un-identification problem. In this scenario, the suitable method is OLS augmented with P ∗ .
The syntax is the following:
R> data(dataCopC2)
R> resultsCC2 <- copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P",
type = "continuous", method = "2", intercept = TRUE, data = dataCopC2)
R> summary(resultsCC2)
Call:
copulaCorrection(formula = y ~ X1 + X2 + P1 + P2, endoVar = c("P1", "P2"),
type = "continuous", method = "2", intercept = TRUE, data = dataCopC2)
Coefficients:
Estimate Std. Error
z-score Pr(>|z|)
Intercept
2.00843
0.04424
45.394
<2e-16 ***
X1
1.49359
0.00817
182.776
<2e-16 ***
X2
-2.99199
0.01197 -250.037
<2e-16 ***
P1
-1.00597
0.04542
-22.150
<2e-16 ***
P2
0.82241
0.05630
14.609
<2e-16 ***
PS.1
0.29188
0.06915
4.221
<2e-16 ***
PS.2
0.05393
0.07827
0.689
0.491
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 .
Residual standard error:
Multiple R-squared:
F-statistic:
0.9387937 on 2493 degrees of freedom
0.976301
Adjusted R-squared:
0.976244
17116.89 on 7 and 2493 degrees of freedom, p-value:
0
In dataCopC2 dataset there are two endogenous regressor, P 1 and P 2. The additional coefficients, P S.1 and P S.2, which are the estimates of P1∗ and P2∗ , tell us whether there exists
Gui, Meierer, Algesheimer
29
significant endogeneity or not. Also they tell the direction of the correlation between the
endogenous regressors and the error.
In the example above, copula correction method produced a coefficient estimate for P 1 very
close to the true value, −1.0059 (s.e. = 0.045), while the OLS coefficient for this data is
−0.82 (s.e. = 0.056). P S.1 is statistically significant and positive, implying that indeed P 1
is endogenous, and given the positive sign of P S.1, P 1 is positively correlated with the error.
For P 2, the method returned a P S.2 which is not statistically significant, implying that P 2
is not endogenous (which is to be expected since the correlation of P 2 with the error is 0.1).
The OLS estimate for P 2 is 0.86 while copula correction returned an estimate equal to 0.82.
Case 3: Discrete endogenous regressor In the case of discrete endogenous regressors, the
augmented least squares estimation should be used. The fact that the endogenous regressor
is discrete has to be specified by using type="discrete", as below:
R> data(dataCopD1)
R> resultsCC3 <- copulaCorrection(formula = y ~ X1+ X2 + P, endoVar = "P",
type="discrete", intercept = TRUE, data = dataCopDis)
R> summary(resultsCC3)
Call:
copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P", type = "discrete",
intercept = TRUE, data = dataCopDis)
Coefficients:
Estimate
Low_95CI
Up_95CI
Intercept
2.05432
1.44900
2.390
X1
1.50458
1.48900
1.521
X2
-2.98547
-3.00700
-2.962
P
-1.01649
-1.08300
-0.899
0.36598
0.10100
0.515
PStar.1
In the discrete case, the variable P ∗ lies between two points of the inverse univariate normal
distribution (see Park and Gupta (2012), p. 573). Therefore, instead of standard errors,
30
REndo - Gui, Meierer and Algesheimer
the lower and upper bounds of the 95% confidence interval of the coefficient estimates are
provided. We see here again that copula correction method produces an estimate for P very
close to the true value (−1.01), while the OLS coefficient for P is −0.85 (s.e. = 0.008).
The additional coefficient estimate, P Star.1, has the same interpretation as in the previous
example: It tells the direction of the correlation and whether exists significant endogeneity
or not. In this example, the coefficient of P Star.1 is statistically significant and positive,
implying a positive correlation between the endogenous regressor P and the error.
5.3. higherMomentsIV() - Higher Moments
Lewbel (1997)’s higher moments approach to endogeneity is illustrated on the synthetic
dataset dataHigherMoments that contains a response variable, y, two exogenous regressors,
X1 and X2, and an endogenous variable, P . An intercept is always considered. A set of
six instruments can be constructed from the data and they should be specified in the IIV
argument of the function. The values for this argument can take the following values:
• g - for (Gt − Ḡ),
• gp - for (Gt − Ḡ)(Pt − P̄ ),
• gy - for (Gt − Ḡ)(Yt − Ȳ ),
• yp - for (Yt − Ȳ )(Pt − P̄ ),
• p2 - for (Pt − P̄ )2 ,
• y2 - for (Yt − Ȳ )2 .
where G = G(Xt ) can be either x2 , x3 , ln(x) or 1/x and should be specified in the G argument
of the function. If left empty, the function will take by default G = x2 . In the example below
the internal instrument was chosen to be ”yp”. Additional external instruments can be added
to the model by specifying them in EIV:
R> data(dataHigherMoments)
Gui, Meierer, Algesheimer
31
R> resultsHM <- higherMomentsIV(formula = y ~ X1+ X2 + P, endoVar = "P",
IIV = "yp", data = dataHigherMoments)
R> summary(resultsHM)
Call:
higherMomentsIV(formula = y ~ X1+ X2 + P, endoVar = "P", IIV = "yp",
data = dataHigherMoments)
Residuals:
Min
1Q
Median
3Q
Max
-5.54641 -1.17855
0.03467
1.14903
6.54074
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
Intercept
1.86321
0.71715
X1
1.74095
0.36677
X2
2.87247
0.07582
P
-1.00747
2.598
0.00943 **
4.747 2.18e-06 ***
37.885
< 2e-16 ***
0.03772 -26.706
< 2e-16 ***
--Signif. codes:
0 ***
0.001 **
0.01 *
0.05 .
Residual standard error: 1.741 on 2496 degrees of freedom
Multiple R-Squared: 0.8974,
Wald test: 833.7 on 3 and 2496 DF,
Adjusted R-squared: 0.8973
p-value: < 2.2e-16
For this dataset, the OLS coefficient of the endogenous regressor is −0.866 with a standard
error of 0.005, while the higher moments estimate is very close to the true value, −1.007 with
a standard error of 0.037. However, interpreting the coefficient estimates has to be made with
caution since the higher moments approach comes with a set of very strict assumptions (see
Subsection 4.1.2). And, as seen in the comparison with the OLS in the previous section, the
higher moments estimator displays high variance.
5.4. hetErrorsIV() - Heteroskedastic Errors Method
The function hetErrorsIV() has a four-part formula specification, e.g.: y ~ P | X1 + X2 +
32
REndo - Gui, Meierer and Algesheimer
X3 | X1 + X2 | Z1, where: y is the response variable and P is the endogenous regressor, X1
+ X2 + X3 represent the exogenous regressors, the third part X1 + X2 specifies the exogenous
heteroskedastic variables from which the instruments are derived, while the final part Z1 is
optional, allowing the user to include additional external instrumental variables. Allowing the
user to specify any additional external variables is a convenient feature of the function, since
adding external instruments increases the efficiency of the estimates. A synthetic dataset is
included in the package to show the heteroskedastic errors approach. The dataset has an
endogenous regressor, P , two normally distributed exogenous regressors, X1 and X2, and a
response variable, y. The use of the function is presented below:
R> data(dataHetIV)
R> resultsHetIV <- hetErrorsIV(y ~ P | X1 + X2
| X1 + X2, data = dataHetIV)
R> summary(resultsHetIV)
Call:
hetErrorsIV(formula = y ~ P | X1 + X2 | X1 + X2, data = dataHetIV)
Coefficients:
Estimate Std. Error t value
Intercept
1.01176
0.39666
2.5507
1.57911 -0.6092
Pr(>|t|)
0.01075 *
P
-0.96197
0.54240
X1
1.04929
0.25403
4.1306 3.618e-05 ***
X2
0.99428
0.13951
7.1269 1.027e-12 ***
--Signif. codes:
0 *** 0.001 ** 0.01 * 0.05 .
Overidentification Test:
J-test
Test E(g)=0:
P-value
0.5536734 0.4568206
Partial F-test Statistics for Weak IV Detection:
P
0.4156665
Number of Observations: 2500
Gui, Meierer, Algesheimer
33
If the user wants to obtain robust standard errors, this possibility is offered by calling the function with the additional argument robust=TRUE. Similarly, clustered standard errors can be
obtained using clustervar=TRUE. The function also returns Sargan’s J-test of over-identifying
restrictions and partial F-statistics for weak IV detection. If the null hypothesis of the Sargan’s J-test cannot be rejected, ti implies that the instruments are valid, while a rejection of
the null implies that at least one instrument is not valid. For this example, the OLS coefficient is −0.36 with a 0.04 standard error while the heterogenous errors approach produces
an estimate for P equal to −0.96, with a very large standard error. This is to be expected,
since the IIV estimator is not efficient in small samples. However, here in this dataset, the
internal instrumental method is closer to the true parameter value of the endogenous regressor. The overidentification restrictions are valid since the null hypothesis could not be
rejected (p-value=0.45). We also fail to reject the null hypothesis of the F-statistic test (p =
0.415), meaning the instruments are valid. In cases where there is more than one endogenous
regressor the Angrist and Pischke (2008) method for multivariate first-stage F-statistics is
employed.
5.5. multilevelIV() - Multilevel Internal Instrumental Variables
The multilevelIV() function allows the estimation of a multilevel model with up to three
levels, in the presence of endogeneity. Specifically, Kim and Frees (2007) designed an approach
that controls for endogeneity at higher levels in the data hierarchy, e.g. for a three-level model,
endogeneity can be present either at level two, level three or at both, level two and three.
The function takes as input the formula, the set of endogenous variables and the name of the
dataset. It returns the coefficient estimates obtained with fixed effects, random effects and the
GMM estimator proposed by Kim and Frees (2007), such that a comparison across models
can be done. Asymptotically, the multilevel GMM estimators share the same properties
of corresponding fixed effects estimators, but they allow the estimation of all the variables
in the model, unlike the fixed effects counterpart. To illustrate the use of the function,
dataMultilevelIV has a total of 2950 observations clustered into 1370 observations at level
2, which are further clustered into 40 level-three units. Additional details are presented in
34
REndo - Gui, Meierer and Algesheimer
Table 4 below:
Name
dataMultilevelIV
Variables
level-1: X11, X12, X13,
Type
True param values
exogenous
β11 = 3, β12 = 9,
β13 = −2, β14 = 2
endogenous,
β15 = −1
X14
level-1: X15
corr(X15, cs) = 0.7;
level-2: X21, X22, X23,
exogenous
β21 = −1.5, β22 = −4,
β23 = −3, β24 = 6
exogenous
β31 = 0.5, β32 = 0.1,
β33 = −0.5
X24
level-3: X31, X32, X33
Table 4: Simulated dataset to exemplify multilevelIV() function. cs is the level-two error.
The dataset has five level-1 regressors, X11, X12, X13, X14 and X15, where X15 is correlated
with the level two error, thus endogenous. There are four level-2 regressors, X21, X22, X23
and X24, and three level-3 regressors, X31, X32, X33, all exogenous. We estimate a threelevel model with X15 assumed endogenous. Having a three-level hierarchy, multilevelIV()
returns five estimators, from the most robust to omitted variables (FE_L2), to the most efficient
(REF), i.e. lowest mean squared error:
• level-2 fixed effects (FE_L2);
• level-2 multilevel GMM (GMM_L2);
• level-3 fixed effects (FE_L3);
• level-3 multilevel GMM (GMM_L3);
• random effects estimator (REF).
The random effects estimator is efficient assuming no omitted variables, whereas the fixed
effects estimator id unbiased and asymptotically normal even in the presence of omitted
variables. Because of the efficiency, one would choose the random effects estimator if confident
that no important variables were omitted. On the contrary, the robust estimator would be
preferable if there was a concern that important variables were likely to be omitted. The
estimation result is presented below:
Gui, Meierer, Algesheimer
35
R> data(dataMultilevelIV)
R> formula <- y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 +
X31 + X32 + X33 +
(1 | DID) + (1 + X11 | HID)
R> endV <- dataMultilevelIV[,"X15"]
R> resultsMIV <- multilevelIV(formula = formula, endoVar = endV, data =
dataMultilevelIV)
R> summary(resultsMIV)
Call:
multilevelIV(formula = formula1, endoVar = endoVars, data = dataMultilevelIV)
FE_L2 GMM_L2
FE_L3 GMM_L3
Intercept
0.000 62.446
0.000 61.959
X11
0.000
2.978
2.953
2.953
2.953
X12
8.876
8.952
8.961
8.962
8.962
-2.064 -2.019 -2.019 -2.019
-2.019
X13
X14
2.077
2.036
2.042
RandomEffects
61.959
2.042
2.042
X15
-1.061 -1.062 -0.537 -0.537
-0.537
X21
0.000 -1.528 -1.708 -1.685
-1.685
X22
0.000 -3.814 -3.808 -3.813
-3.813
X23
0.000 -2.798 -2.838 -2.838
-2.838
X24
0.000
5.772
5.509
5.520
5.520
X31
0.000
1.651
0.000
1.663
1.663
X32
0.000
0.507
0.000
0.528
0.528
X33
0.000
0.095
0.000
0.093
0.093
As we have simulated the data, we know that the true parameter value of the endogenous
regressor (X15) is −1. Looking at the coefficients of X15 returned by the five models, we see
that they form two clusters: one cluster composed of the level-two fixed effects estimator and
the level-two GMM estimator (both return −1.06), while the other cluster is composed of
the other three estimators, FE_L3, GMM_L3, REF, all three having a value of −0.537. The
bias of the last three estimators is to be expected since we have simulated the data such that
36
REndo - Gui, Meierer and Algesheimer
X15 is correlated with the level-two error, to which only FE_L2 and GMM_L2 are robust. After
the initial estimation of all applicable methods, the most appropriate estimator has to be
identified in the next steps.
To provide guidance for selecting the appropriate estimator, multilvelIV() performs an
omitted variable test. The results are returned by the summary() function. For example, in
a three-level setting, different estimator comparisons are possible:
1. Fixed effects versus random effects estimators: To test for omitted level-two and
level-three omitted effects, simultaneously, one compares FE_L2 to REF. The test does
not indicate the level at which omitted variables might exist.
2. Fixed effects versus GMM estimators: Once it was established that there exist
omitted effects but not certain at which level (see 1), we test for level-two omitted effects
by comparing FE_L2 versus GMM_L3. A rejection of the null hypothesis will imply omitted
variables at level-two. The same is accomplished by testing FE_L2 versus GMM_L2, since
the latter is consistent only if there are no omitted effects at level two.
3. Fixed effects versus fixed effects estimators: We can test for omitted level-two
effects, while allowing for omitted level-three effects. This can be done by comparing
FE_L2 versus FE_L3 since FE_L2 is robust against both level-two and level-three omitted
effects while FE_L3 is only robust to level-three omitted variables.
In general, testing for higher level endogeneity in multilevel settings one would start by looking
at the results of the omitted variable test comparing REF and FE_L2. If the null hypothesis
if rejected, this means the model suffers from omitted variables, either at level two or level
three. Next, test whether there are level-two omitted effects, since testing for omitted level
three effects relies on the assumption there are no level-two omitted effects. To this end, rely
on one of the following model comparisons: FE_L2 versus FE_L3 or FE_L2 versus GMM_L2. If no
omitted variables at level-two are found, proceed with testing for omitted level-three effects
by comparing FE_L3 versus GMM_L3 or GMM_L2 versus GMM_L3.
The summary() function that returns the results of the omitted variable test takes two argu-
Gui, Meierer, Algesheimer
37
ments: the name of the model object (here resultsMIV) and the estimation method (here
REF). Without a second argument, summary() displays simultaneously the coefficients obtained with all the different estimation approaches (fixed effects, GMM, random effects). The
second parameter can take the following values, depending on the model estimated (two or
three levels): REF, GMM_L2, GMM_L3, FE_L2, FE_L3. It returns the estimated coefficients
under the model specified in the second argument, together with their standard errors and
z-scores, Further, it returns the chi-squared statistic, degrees of freedom and p-value of the
omitted variable test between the focal model (here REF) and all the other possible options
(here FE_L3, GMM_L2 and GMM_L3).
R> summary(resultsMIV, "REF")
Call:
multilevelIV(formula = formula2, endoVar = endV2, data = dataMultilevelIV)
Coefficients:
Coefficients ModelSE z-score Pr|>z|
Intercept
61.959
8.719
7.106
0.000
X11
2.953
0.032
92.281
0.000
X12
8.962
0.027 331.926
0.000
X13
-2.019
0.026 -77.654
0.000
X14
2.042
0.026
78.538
0.000
X15
-0.537
0.020 -26.850
0.000
X21
-1.685
0.142 -11.866
0.000
X22
-3.813
0.120 -31.775
0.000
X23
-2.838
0.068 -41.735
0.000
X24
5.520
0.256
21.562
0.000
X31
1.663
0.098
16.969
0.000
X32
0.528
0.249
2.120
0.034
X33
0.093
0.059
1.576
0.115
38
REndo - Gui, Meierer and Algesheimer
Omitted variable test:
FEL2_vs_REF
stat 112.864
df
4
pval 1.781e-23
GMML2_vs_REF
282.025
FEL3_vs_REF
GMML3_vs_REF
0.058
0
13
9
13
1.373e-52
1
1
alternative hypothesis: one model is inconsistent
In the example above, we compare the random effects (REF) with all the other estimators.
Testing REF, the most efficient estimator, against the level-two fixed effects estimator, FE_L2,
which is the most robust estimator, we are actually testing simultaneously for level-2 and
level-3 omitted effects. Since the null hypothesis is rejected with a p-value of 1.78e-23, the
test indicates severe bias in the random effects estimator. In order to test for level-two omitted
effects regardless of the presence of level-three omitted effects, we have to compare the two
fixed effects estimators, FE_L2 versus FE_L3:
R> summary(resultsMIV,"FE_L2")
Call:
multilevelIV(formula = formula2, endoVar = endV2, data = dataMultilevelIV)
Coefficients:
Coefficients ModelSE z-score Pr|>z|
Intercept
0.000
0.000
NA
NA
X11
0.000
0.000
NA
NA
X12
8.876
0.054 164.370
0
X13
-2.064
0.047 -43.915
0
X14
2.077
0.050
41.540
0
X15
-1.061
0.053 -20.019
0
X21
0.000
0.000
NA
NA
X22
0.000
0.000
NA
NA
X23
0.000
0.000
NA
NA
X24
0.000
0.000
NA
NA
Gui, Meierer, Algesheimer
X31
0.000
0.000
NA
NA
X32
0.000
0.000
NA
NA
X33
0.000
0.000
NA
NA
39
Omitted variable test:
FEL2_vs_GMML2 FEL2_vs_FEL3 FEL2_vs_GMML3
stat 32834.99
220.3656
222.1892
df
4
4
12
pval 0
1.564e-46 6.335e-47
alternative hypothesis: one model is inconsistent
The null hypothesis of no omitted level-two effects is rejected (p-value is equal to 1.56e-46).
Therefore, it is possible to conclude that there are omitted effects at level-two. This finding
is no surprise as we simulated the dataset with the level-two error correlated with X15, which
leafs to biased FE_L3 coefficients. Hence, the result of the omitted variable test between leveltwo fixed effects and level-two GMM should not come to a surprise: The null hypothesis of
no omitted level-two effects is rejected (p-value is 0). In case of wrongly assuming that an
endogenous variable is exogenous, the random effects as well as the GMM estimators will be
biased, since the former will be constructed using the wrong set of internal instrumental variables. Consequently, it is important to compare the results of the omitted variable tests when
the variable is considered endogenous versus exogenous, can indicate whether the variable is
indeed endogenous or not. To conclude this example, the test results provide support that
the FE_L2 should be used.
6. Conclusion
Endogeneity, one of the challenges of empirical research, can be addressed using different
approaches, from instrumental variables to structural models. However, there exist applications in which the economic theory or the intuition do not help in obtaining an additional
observable variable that can be used as external instrumental variable. Given the properties
this variable has to have in order to be suitable as instrument, finding one is far from easy.
40
REndo - Gui, Meierer and Algesheimer
Therefore, methods that treat endogeneity without the need of external regressors have been
proposed by researchers coming from different disciplines. These techniques of estimation are
called ”instrument-free” or ”internal instrumental variables” methods.
REndo integrates the main instrument-free models that allow correcting for endogeneity. The
motivation behind the development of REndo is to come to the aid of researchers from a
broad spectrum of disciplines, from sociology, political science to economics and marketing
in addressing the endogeneity problem when no strong, valid instrumental variables can be
found. Moreover, by offering the multilevelIV() function REndo is the first R package
that offers a solution to the endogeneity problem in a multilevel setting. The package also
offers seven datasets on which the capabilities and usefulness of the package are illustrated.
Future developments could include a Bayesian approach to the latent instrumental variable
method which would allow incorporating additional explanatory variables and even additional
endogenous regressors, as exposed in Ebbes et al. (2009).
Gui, Meierer, Algesheimer
41
References
Angrist J, Pischke J (2008). Mostly Harmless Econometrics: An Empiricist’s Companion.
Princeton: Princeton University Press.
Antonakis J, Bendahan S, Jacquart P, Lalive R (2014). “Causality and Endogeneity: Problems
and Solutions.” In NYOU Press (ed.), The Oxford Handbook of Leadership and Organizations, pp. 93–117.
Berry S, Levinsohn J, Pakes A (1995). “Automobile Prices in Market Equilibrium.” Econometrica, 63(4), 841–890.
Berry ST (1994). “Estimating Discrete-Choice Models of Product Differentiation.” The RAND
Journal of Economics, 25(2), 242–262.
Chaussé P (2010). “Computing Generalized Method of Moments and Generalized Empirical
Likelihood with R.” Journal of Statistical Software, 34(11), 1–35. URL http://www.
jstatsoft.org/v34/i11/.
Draganska M, Jain D (2004). “A Likelihood Approach to Estimating Market Equilibrium
Models.” Management Science, 50(5), 605–616.
Ebbes P (2004). “Latent Instrumental Variables - A New Approach to Solve for Endogeneity.”
Groningen: University of Groningen.
Ebbes P, Wedel M, Boeckenholt U (2009). “Frugal IV Alternatives to Identify the Parameter
for an Endogeneous Regressor.” Journal of Applied Econometrics, 24(3), 446–468.
Ebbes P, Wedel M, Boeckenholt U, Steerneman A (2005). “Solving and Testing for RegressorError (In)Dependence When no Instrumental Variables Are Available: With New Evidence
for the Effect of Education on Income.” Quantitative Marketing and Economics, 3(4), 365–
392.
Epanechnikov V (1969). “Nonparametric Estimation of a Multidimensional Probability Density.” Teoriya veroyatnostei i ee primeneniya, 14(1), 156–161.
42
REndo - Gui, Meierer and Algesheimer
Fox J, Nie Z, Byrnes J (2016). “sem: Structural Equation Models.” URL https://cran.
r-project.org/web/packages/sem.
Hansen LP (1982). “Large Sample Properties of Generalized Method of Moments Estimators.”
Econometrics, 50(4), 1029–1054.
Hausman J (1978). “Specification Tests in Econometrics.” Econometrica, 46(6), 1251–1271.
Henningsen A, Hamann Jeff D (2015). “systemfit: Estimating Systems of Simultaneous Equations.” URL https://cran.r-project.org/web/packages/systemfit.
Hogan V, Rigobon R (2002). “Using Unobserved Supply Shocks to Estimate the Returns to
Education.” Working Paper 9145, Cambridge:National Bureau of Economc Research.
Kim S, Frees F (2007). “Multilevel Modeling with Correlated Effects.” Pshychometrika, 72(4),
505–533.
Kleiber C, Zeileis A (2015). “AER: Applied Econometrics with R.” URL https://cran.
r-project.org/web/packages/AER/.
Lewbel A (1997). “Constructing Instruments for Regressions with Measurement Error When
No Additional Data are Available, With an Application to Patents and R and D.” Econometrica, 65(5), 1201–1213.
Lewbel A (2012). “Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models.” Journal of Business and Economic Statistics, 30(1), 67–80.
Longford N (1995). “Random Coefficient Models.”
Marra G, Radice R (2011). “SemiParBIVProbit: Semi-parametric Bivariate Probit Modelling. R package version 2.0-4.1.” URL https://cran.r-project.org/web/packages/
SemiParBIVProbit.
Park S, Gupta S (2012). “Handling Endogeneous Regressors by Joint Estimation Using Copulas.” Marketing Science, 31(4), 567–586.
Gui, Meierer, Algesheimer
43
Petrin A, Train K (2010). “A Control Function Approach to Endogeneity in Consumer Choice
Models.” Journal of Marketing Research, 47(1), 3–13.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project.org/ edition.
Raudenbush S, Bryk A (1986). “Hierarchical Model for Studying School Effects.” Sociology
of Education, 59(1), 1–17.
Rigobon R (2003). “Identification Through Heteroskedasticity.” Review of Economics and
Statistics, 85(4), 777–792.
Ruud P (2000). An Introduction to Classical Econometric Theory. 1 edition. New York:
Oxford University Press.
Silverman B (1986). Density Estimation for Satistics and Data Analysis. CRC Monographs
on Statistics and Applied Probability. London: Chapman & Hall.
Stock JH, Yogo M (2002). “Testing for Weak Instruments in Linear IV Regression.” NBER
Technical Working Paper No 284.
Theil H (1958). Economic Forecasts and Policy. Contributions to economic analysis no.15.
Amsterdam: North-Holland.
Van de Ven WPM, Van Praag BMS (1981). “The Demand for Deductibles in Private Health
Insurance: A Probit Model with Sample Selection.” Journal of Econometrics, 17(2), 229–
252.
Wright S (1928). The Tariff on Animal and Vegetable Oils. New York: MacMillan.
44
REndo - Gui, Meierer and Algesheimer
A. Weak vs. Strong Instrumental Variables
Table A.1 presents the performance of OLS in comparison with two-stage least squares, one
of the most frequently used instrumental variable methods, differentiating among strong and
weak instruments and among small and large sample size. The sample size varies between
500 and 2500 observations respectively, and the correlation (ρ) between the instrument and
the endogenous regressor takes two values, 0.10 and 0.40. The results are obtained running
a simulation over 1000 random samples, where the true parameter value is equal to −1 and
the correlation between the omitted variable and the endogenous regressor takes two values,
0.1 and 0.3.
θ
0.1
0.3
True value
OLS
-1
-1
-1
-1
-0.918
-0.920
-0.774
-0.776
IV
ρ = 0.10 ρ = 0.4
-0.884
-1.001
-0.998
-1.001
-0.888
-1.000
-0.999
-1.002
Sample size
500
2500
500
2500
Table A.1: OLS vs IV: θ = correlation between endogenous variable and the error, ρ = correlation
between the IV and the endogenous variable.
The bias of the estimates, both of the OLS and of the IV, depends heavily on the sample size
but also on how much the regressor correlates with the error and with the IV. In the first row
of Table A.1 we can easily see that, with a relatively low sample size and a low correlation
of the regressor with the error, a weak instrument produces a more biased estimate than the
OLS. Once the sample size increases, the weak instrument does no longer produce biases.
Gui, Meierer, Algesheimer
45
B. Bias under Endogeneity
The bias induced by an endogenous regressor on an exogenous variable depends, besides the
correlation between the two variables, on the correlation between the model’s error and the
endogenous variable. To underline this situation, Figure B.1 as well as Table B.2 present the
coefficient estimates for the model presented in Figure 1. The correlation between the error
and P takes three values: 0.1 0.3 and 0.5. For each of these three values, the correlation
between the two covariates, P and X, is also taken to be equal to 0.1, 0.3 and 0.5. The
estimates presented are the mean coefficient estimates over 1000 simulated samples, each
having the same size, 2500 observations.
Panel A presents the coefficient estimates for the
Panel B
−0.6
1.00
−0.7
0.95
Corr(P, error)
0.1
0.3
0.5
−0.8
True b1 value
−1
Coefficient estimates for beta2
Coefficient estimates for beta1
Panel A
Corr(P, error)
0.1
0.3
True b2 value
1
−0.9
0.85
−1.0
0.80
0.1
0.2
0.3
0.4
0.5
Correlation between P and X
0.5
0.90
0.1
0.2
0.3
0.4
0.5
Correlation between P and X
Figure B.1: Endogeneity and the Bias of Estimates for β1 and β2 : bias at different levels of correlation
between the endogenous regressor (P ) and the error
intercept, β1 , while panel B the coefficient estimates for the slope, β2 . In both cases the
bias grows as we move from lower to higher levels of the correlation between the error and
the endogenous regressor. In panel B, for the same correlation between the error and the
46
REndo - Gui, Meierer and Algesheimer
endogenous regressor (e.g. blue line), the increase in bias is steeper than in panel A. For
example, when the correlation between the error and P is 0.5, the bias of the slope estimate
increases from 0.03 to 0.10, to 0.19 for different levels of correlation between P and X. Similar
patterns occur for different levels of correlation between the error and P , either for β1 or β2 ,
as can be observed in Table B.2.
corr(P, )
β1
β2
0.1
0.3
0.5
corr(P, X)
0.1
0.3
0.5
0.082 0.089 0.102
0.226 0.239 0.274
0.333 0.351 0.396
0.1
0.3
0.5
0.008
0.023
0.034
0.026
0.072
0.105
0.049
0.137
0.198
Table B.2: Estimates’ Bias: β1 - intercept, β2 - slope
Gui, Meierer, Algesheimer
47
C. Gaussian Copula
A copula is a function that maps several conditional distribution functions (CDF) into their
joint CDF. Here, the CDF of x1 is a t-distribution and the CDF of x2 is a normal distribution.
Function H() is the joint conditional distribution function.
Cumulative Distribution Function − x1
0.8
0.6
G(x2)
0.0
0.0
0.2
0.4
0.6
0.4
0.2
F(x1)
0.8
1.0
1.0
Cumulative Distribution Function − x2
−200
0
200
400
600
−4
−2
x1
F(x1)
0
G(x2)
15002000
500 1000
3
2
1
H(x1,x2) = C(F(x1),G(x2))
0
-1
-2
-3
600
400
200
0
x
-200
joint distribution
Figure C.2: Copula example
Affiliation:
Raluca Gui
Department of Business Administration
URPP Social Networks
University of Zurich
8050, Zurich, Switzerland
0
x2
2
48
REndo - Gui, Meierer and Algesheimer
E-mail: raluca.gui@business.uzh.ch
Url: http://www.socialnetworks.uzh.ch/en/aboutus.html
View publication stats
Download