See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/322698633 REndo: An R Package to Address Endogeneity Without External Instrumental Variables Article · January 2018 CITATIONS READS 0 582 3 authors, including: Raluca Gui Markus Meierer University of Zurich University of Zurich 2 PUBLICATIONS 0 CITATIONS 9 PUBLICATIONS 47 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: REndo: An R Package to Address Endogeneity Without External Instrumental Variables View project All content following this page was uploaded by Raluca Gui on 25 January 2018. The user has requested enhancement of the downloaded file. REndo: An R Package to Address Endogeneity Without External Instrumental Variables Raluca Gui Markus Meierer René Algesheimer University of Zurich University of Zurich University of Zurich Abstract Endogeneity arises when the independence assumption between an explanatory variable and the error in a statistical model is violated. Among its most common causes are omitted variable bias (e.g. like ability in the returns to education estimation) measurement error (e.g. survey response bias) or simultaneity (e.g. advertising and sales). Instrumental variable estimation is a common treatment when endogeneity is of concern. However valid, strong external instruments are difficult to find. Consequently, statistical methods to correct for endogeneity without external instruments have been advanced and they are generically called internal instrumental variable models (IIV). We propose the first R (R Core Team 2013) package that implements the following five instrument-free methods, all assuming a continuous dependent variable: latent instrumental variables approach (Ebbes, Wedel, Boeckenholt, and Steerneman 2005), higher moments estimation (Lewbel 1997), heteroskedastic error approach (Lewbel 2012), joint estimation using copula (Park and Gupta 2012) and multilevel GMM (Kim and Frees 2007). Package usage is illustrated on several simulated datasets. Keywords: endogeneity, multilevel, internal instrumental variables. 2 REndo - Gui, Meierer and Algesheimer 1. Introduction In the absence of data based on a randomized experiment, endogeneity is always a concern. It implies that the independence assumption between at least one regressor and the error term is not satisfied, leading to biased and inconsistent results. The causes of endogeneity are manifold and include response bias in surveys, omission of important explanatory variables, or simultaneity between explanatory and response variables (Antonakis, Bendahan, Jacquart, and Lalive 2014; Angrist and Pischke 2008). A ”devilishly clever” solution to cope with confounders when randomization is not possible is to find additional variables, the so called ”instrumental variables” (Antonakis et al. 2014; Theil 1958). These variables are correlated with the suspected endogenous regressor but not correlated with the structural error. For example, when estimating the demand for bread, the price variable is endogenous since price and demand are jointly determined in the market, so we have a simultaneity problem. A strong (see A) instrumental variable would be the number of rainy days in the year which influences the wheat production and consequently the price of bread, without having any effect on the demand for bread. The difficulty of finding a good instrument is well known. Paradoxically, the stronger the correlation between the instrument and the regressor, the more difficult it is to defend its lack of correlation with the error (Ebbes et al. 2005; Lewbel 1997). Sometimes a suitable instrumental variable could be indicated by the data generating process or by the cause of endogeneity. However, it is not unfrequent to fail finding a strong instrumental variable. In this case, an ‘instrument free ’or ‘internal instrumental variable’(IIV) model could be used. These methods solve the endogeneity problem without the need of external instruments (further details regarding these methods can be found in the following sections). REndo is the first package to encompass the most recent developments in IIV methods with continuous dependent variables. It implements the latent instrumental variable approach (Ebbes et al. 2005), the joint estimation using copula (Park and Gupta 2012), the higher moments method (Lewbel 1997) and the heteroskedastic error approach (Lewbel 2012). To Gui, Meierer, Algesheimer 3 model hierarchical data such as students nested within classrooms, nested within schools, REndo includes the multilevel GMM estimation proposed by Kim and Frees (2007). Section 2 presents in more detail the endogeneity problem and its consequences on the estimates. Section 3 presents the existing R (R Core Team 2013) packages that address endogeneity making use of external instruments, followed by section 4 describing the underlying idea of internal instrumental variables alongside the five IIV methods correcting for endogeneity. Section 5 describes the implementation of these IIV models in REndo and their usage on various simulated datasets, comparing the results with the ones obtained using ordinary least squares. The last section presents future possible improvements and concludes. 2. Endogeneity Researchers from a wide range of disciplines, from political science, finance, management, marketing or education research, are interested in causal relationships. For instance, does institutional quality explain the variation in the economic development of different countries? Does CEO’s compensation depend on firm size? Does an increase in advertising expenditure lead to an increase in sales? Does repeating a class lead to better test scores? In order to be able to answer such questions, there are two possible options: run a randomized controlled experiment or use observational data. Since in many instances randomized experiments are too expensive, focus on small sample sizes or are seen as unethical, researchers rely on observational data. The problem with such data is the threat of obtaining estimates that are not consistent, meaning that they will not converge to the true population parameter as sample size increases. This threat occurs when: 1. important variables are omitted from the model; 2. one or more explanatory variables are measured with error; 3. there is simultaneity between the response variable and one of the covariates; 4. the sample is biased due to self-selection; 4 REndo - Gui, Meierer and Algesheimer 5. a lagged response variable is included as covariate. Ruud (2000) showed that (2)-(5) can be viewed as a special case of (1). Nonetheless, in all these instances the error term of the model is correlated with one of the covariates. This situation is known in the literature as endogeneity . Figure 1 depicts an endogeneity problem: regressor P is endogenous due to its correlation with the error , while X is an exogenous covariate, since its correlation with the error is zero. 𝜓2≠0 P 𝛽1 𝜓1 ! y 𝛽2 X 𝜓3=0 Condition β1 β2 Explanation ψ1 = 0 Inconsistent Consistent P correlates with (ψ2 6= 0) thus β1 is inconsistent. β2 is consistent since X is uncorrelated with both P (ψ1 = 0) and (ψ3 = 0). Inonsistent P correlates with (ψ2 6= 0) thus β1 is inconsistent. Although X is uncorrelated with (ψ3 = 0), β2 is inconsistent since it is affected by the bias in P through X ’s correlation with P (ψ1 6= 0), although ψ3 still equals zero. ψ1 6= 0 Inconsistent Figure 1: Endogeneity causes inconsistent estimates Figure 2 shows how the bias of the endogenous regressor’ s estimates increases as the correlation between P and the error increases. At low correlation (0.1), the bias is 0.11 (as seen in Gui, Meierer, Algesheimer 5 0.5 b1 Estimates Bias 0.4 0.3 0.2 0.1 0.0 0.1 0.3 0.5 Correlation between P and Error Figure 2: Instrumental variables solves the inconsistency of estimates problem cause by endogeneity the chart). But as the correlation between P and the error increases to 0.3 and then to 0.5, so does the bias: it increases from 0.24 to 0.34. Sometimes the cause of endogeneity or the data at hand can give clues on how to handle the problem. For example, if endogeneity arises from time-invariant sources, applying fixed-effects estimation on a panel dataset eliminates the omitted variable problem. For endogeneity caused by measurement error, autoregressive models or simultaneous equation models could offer a solution, (Ebbes 2004). Structural models that estimate demand-supply models (Draganska and Jain 2004; Berry, Levinsohn, and Pakes 1995; Berry 1994) are yet another alternative to deal with endogeneity. However, the most frequently used methods in addressing endogeneity are instrumental variables (Theil 1958; Wright 1928). The main idea behind these approaches is to focus on the variations in the endogenous variable that are uncorrelated with the error term and disregard the variations that bias the ordinary-least squares coefficients. This is possible by finding an additional variable, called external instrument, such that the endogenous covariate can be separated into two parts: the instrumental variable that a.) should not be correlated with the structural error and b.) should be correlated with the endogenous regressor; and the other part, correlated with the structural error of the model. Section 3 below exposes more in detail these methods and presents the R packages that are currently implementing IV methods. 6 REndo - Gui, Meierer and Algesheimer 3. External IV Methods The concept of instrumental variables was first derived by Wright (1928). He observed that ”success with this method depends on success in discovering factors of the type A and B” where A and B refer to instrumental variables. Specifically, an ”instrumental variable” (IV) is defined as a variable Z (equation 2) that is correlated with the explanatory variable P and uncorrelated with the structural error, , in equation 1: Y = βP + , (1) P = γZ + ν, (2) where • The error term stands for all exogenous factors that affect Y when P is held constant. • The instrument Z should be independent of . • The instrument Z should not affect Y when P is held constant (exclusion restriction). • The instrument Z should not be independent of P. Figure 3 gives a graphical representation of the notions above. Today, researchers have a variety of instrumental variable models to choose from, depending on the research question, data type (cross-sectional vs. panel, single-level vs. multilevel), and on the number of available external instrumental variables. Examples are: two-stage and three-stage least square estimations as well as generalized method of moments. A short presentation of the most popular R packages for external instrumental variables estimation is given below. • The simplest and most commonly used method in tackling endogeneity in linear single level models is the two-stage least squares (2SLS) approach proposed by Theil (1958). In the first stage, each endogenous variable is regressed on all the exogenous variables Gui, Meierer, Algesheimer 7 𝜓2≠0 P 𝛽1 𝜓1≠0 𝜀 y 𝜓3≠0 Z !" = 0 Figure 3: The instrumental variable z solves the inconsistency of estimates problem caused by endogeneity in the model, both exogenous covariates and the excluded instruments. In the second stage, the regression of interest is estimated as usual via ordinary least squares, except that each endogenous covariate is replaced with the predicted values from the first stage. In R, two-stage least squares can be applied using ivreg() function in the AER (Kleiber and Zeileis 2015) package as well as using tsls() function in the sem (Fox, Nie, and Byrnes 2016) package. • Another widely used method for correcting endogeneity is the generalized method of moments (GMM) proposed by Hansen (1982). GMM is a class of estimators which are constructed from exploiting the sample moment counterparts of population moment conditions of the data generating model. When the system is over-identified and the sample size is large, GMM is more efficient than two-stage least squares method. It can be implemented in R using gmm() function in the package with the same name, gmm (Chaussé 2010). • A package that offers the possibility of addressing endogeneity in a system of linear equations is systemfit (Henningsen and Hamann 2015). The available methods for 8 REndo - Gui, Meierer and Algesheimer estimation are 2SLS, weighted 2SLS and 3SLS methods. A useful feature of the package is the possibility of specifying either the same instrumental variables for all equations or different ones for each equation. Through its function nlsystemfit(), it also offers the possibility to estimate systems of non-linear equations. • Package plm offers the possibility of estimating random and fixed effects for static linear panel data, variable coefficients models and generalized method of moments for dynamic models. For all the models, the package also offers the possibility of addressing endogeneity using external instrumental variables. • SemiParBIVProbit (Marra and Radice 2011) allows fitting a sample selection binary response model. Being able to handle the possible presence of nonlinear relationships between the binary outcome and the continuous covariates, SemiParBIVProbit (Marra and Radice 2011) offers more flexibility than previous methods, such as the one proposed by Van de Ven and Van Praag (1981). Table 1 provides a short overview of the methods tackling endogeneity using external IVs already implemented in R: Approach Level DV Binary No. endog. regres multiple Level endog.var. 2SLS Cont. x 3SLS x single x GMM x multiple x Copula splines x multiple x x Cont. x Function (Pkg) Discrete ivreg(AER), tsls(sem), systemfit(systemfit), plm(plm) systemfit(systemfit) x gmm(gmm) SemiParBIVProbit (SemiParBIVProbit) Table 1: External IV Methods In real world applications we cannot however detect how big is the endogeneity problem since, in fact, we cannot test how large the correlation between the endogenous regressor and the error is. Thus, for an unbiased and consistent estimate it is important to find a very good, Gui, Meierer, Algesheimer 9 ”strong” instrumental variable (see Appendix A). Stock and Yogo (2002) were the first to point to the importance of the quality of the instrumental variable used in the IV estimation. They were the first to differentiate instrumental variables according to their correlation with the endogenous regressor, into weak (low correlation) and strong (high correlation) instruments (Stock and Yogo 2002). Considering equation 2, Stock and Yogo (2002) constructed the ”concentration” parameter, µ2 : µ2 = γ 0 Z 0 Zγ/σν (3) and, depending on the number of instruments, proposed thresholds for the concentration parameter for when the instruments are considered weak (Stock and Yogo 2002). Depending on strength of the instrumental variables and on the sample size, the performance of external instrumental variable methods compared to OLS can vary significantly (see Appendix 1 - Table A.1 for a comparison of OLS and two stage least squares method with weak and strong instrument). Finding a suitable instrumental variable is far from easy and is the reason why researchers have proposed alternative ”instrument - free” models as the ones in REndo, detailed below. 4. Internal IV Methods Internal instrumental variable methods have been proposed for cases when no observable variable can be found that satisfies the properties of a strong instrumental variable. The identification strategy of these methods is conditional on distributional assumptions of the endogenous regressors and of the error term. For example, Ebbes et al. (2005) assume the distribution of the endogenous variable to be discrete, Lewbel (1997), Lewbel (2012) and Park and Gupta (2012) take it skewed, while Rigobon (2003) and Hogan and Rigobon (2002) work with a heteroskedastic distribution. Among the existing single-level instrument-free methods, the current paper implements the latent instrumental variables approach proposed by Ebbes et al. (2005), the higher moments estimation proposed by Lewbel (1997), the heteroskedastic errors approach1 (Lewbel 2012) 1 This method has also been implemented in the package ivlewbel. 10 REndo - Gui, Meierer and Algesheimer and the joint estimation using Gaussian copula (Park and Gupta 2012). These methods have been chosen to be included in the package since they are the internal instrumental variable methods with the highest usage frequency among researchers in social sciences. There are many instances in which the data have a hierarchical structure, for example students clustered in classes and in schools. Even longitudinal data can be seen as a series of repeated measurements nested within individuals. For such data structures multilevel models have been developed (Longford 1995; Raudenbush and Bryk 1986). These models are regression methods that recognize the existence of data hierarchies by allowing for residual components at each level in the hierarchy. In these models endogeneity can have two sources: either the regressors are correlated with the random components, or they are correlated with the structural error of the model at the lowest level (level-1 dependence). While theoretically, methods such as two-stage-least squares, weighted two-stage-least squares or generalized methods of moments could be used in a multilevel setting to deal with both level-1 and level-2 endogeneity, no such implementations are available in R. REndo is the first R package that implements a method that tackles endogeneity in a multilevel setting by implementing the multilevel generalized method of moments method proposed by Kim and Frees (2007). Table 2 gives an overview of the instrument-free methods present in REndo, emphasizing the assumptions of each of the approaches. Four of the methods apply to single level data, while the approach of Kim and Frees (2007) addresses endogeneity in multilevel models. All approaches allow for more than one endogenous regressor, the exception being the latent instrumental variable approach, which also does not allow to include additional explanatory variables besides the endogenous variable. The response variable is assumed to be continuous in all methods. Multilevel GMM Method Kim and Frees (2007) Heteroskedastic Errors Method Lewbel (2012) Higher Moments Method Lewbel (1997) Copula Correction Method Park and Gupta (2012) Latent Instrumental Variable Method Ebbes et al. (2005) Method - Pt 6= N (·, ·); - t ∼ N (0, σ2 ); corr(Zt , t )=0; - Zt discrete with at least 2 groups with different means. one endogenous regressor; no additional exogenous regressors; ML estimation; single level model. multiple endogenous regressors; additional exogenous regressors allowed; ML estimation; single-level model. multiple endogenous regressors; additional exogenous regressors allowed; TSLS estimation; single-level model. multiple endogenous regressors; additional exogenous regressors allowed; TSLS estimation; single-level model. multiple endogenous regressors; additional exogenous regressors allowed; GMM estimation; multilevel model. - - cov(Z, ν 2 ) 6= 0; E(Xt ) = 0, E(Xνt ) = 0; 0 E(XX ) - non-singular. Table 2: Internal Instrumental Variables Models - Model: Yij = β0j + βij Xij + ij β0j = γ00 + γ01 Wj + u0j βij = γ10 - ij ∼ N (0, σ2 ), u0j ∼ N (0, σu2 ) - Cov(Xij , ij , Wj , u0j ) = 0 - - Zt skewed distribution; - E(t ) = 0, E(νt ) = 0; - Third moment of the data exists. - Pt 6= N (·, ·), Pt 6= bimodal ; - Pt can be discrete, but not Bernoulli; - t ∼ N (0, σ2 ); Assumptions Short description Gui, Meierer, Algesheimer 12 REndo - Gui, Meierer and Algesheimer 4.1. Internal IV methods for non-hierarchical data The four internal instrumental variable methods presented in this section share the same underlying model presented in equations 4 and 5, while the specific characteristics of each method are discussed in the subsequent sections. Let’s consider the model: Yt = β0 + β1 Pt + β2 Xt + t (4) where t = 1, .., T indexes either time or cross-sectional units, Yt is a 1 x 1 response variable, Xt is a k x n exogenous regressor, Pt is a k x 1 continuous endogenous regressor, t is a structural error term with mean zero and E(2 ) = σ2 , α and β are model parameters. The endogeneity problem arises from the correlation of Pt and t . As such: Pt = γZt + νt (5) where Zt is a l x 1 vector of internal instrumental variables, and νt is a random error with mean zero, E(ν 2 ) = σν2 and E(ν) = σν . Z is assumed to be stochastic with distribution G and ν is assumed to have density h(·). The latent instrumental variables and the higher moments models assume Z to be uncorrelated with the structural error, which is similar to the ”exclusion restriction” assumption for observed instrumental variables methods. Moreover, Z is also assumed unobserved. Therefore, Z and ν cannot be identified without distributional assumptions. The distributions of Z and ν should be specified such that two conditions are met: (1) endogeneity of P is corrected, and (2) the distribution of P is empirically close to the integral that expresses the amount of overlap of Z as it is shifted over ν (= the convolution between Z and ν). When the density h(·) is chosen to be normal, then G could not be also normal because the parameters cannot be identified (Ebbes et al. 2005). The ensuing subsections follow a general structure, where we present the underlying idea of each method, followed by its specific characteristics and in the end we present the particular assumptions and weaknesses of the method. Gui, Meierer, Algesheimer 13 Latent Instrumental Variables Method Ebbes et al. (2005) propose the latent instrumental variables (LIV) model as exposed in equations 4 and 5, with both errors being normally distributed. LIV does not accept additional covariates besides the endogenous regressor, P , whose distribution has to be different than normal. The novelty of the approach is that the internal instrumental variables Zt are assumed unobserved, discrete and exogenous, with an unknown number of groups m, while γ is a vector of group means. The method accepts just one endogenous regressor. Identification of the parameters relies on the distributional assumptions of the latent instruments as well as that of the endogenous regressor, Pt . Specifically, the endogenous regressor should have a non-normal distribution while the unobserved instruments, Z, should be discrete and have at least two groups with different means (Ebbes, Wedel, and Boeckenholt 2009). A continuous distribution for the instruments leads to an unidentified model, while a normal distribution of the endogenous regressor gives rise to inefficient estimates. The LIV model in REndo assumes that the latent instrumental variable has two categories. Departure from this assumption, with a true instrument with eight categories or more, has been proven to greatly affect the efficiency of the estimates. Figure 4 presents the performance of the latent instrumental variable method in comparison with ordinary least squares. While the OLS estimate is biased but consistent, the LIV estimate is unbiased but has larger standard deviation. The mean and variance of the OLS estimate are −0.59 and 0.0002, while the mean and variance of the LIV estimate are −0.88 and 0.20 respectively. For a more detailed overview of the method see Ebbes et al. (2005). Joint Estimation Using Copula Method Park and Gupta (2012) propose a method that allows for the joint estimation of the continuous endogenous regressor and the error term using Gaussian copulas. A copula is a function that maps several conditional distribution functions (CDF) into their joint CDF (see Appendix 2 - Figure C.2). The underlying idea of the joint estimation method is that using information contained in 14 REndo - Gui, Meierer and Algesheimer 20 Method 15 Density LIV OLS 10 True value −1 5 0 −2 −1 0 1 2 Endogenous Regressor Figure 4: Latent Instrumental Variable vs OLS: coefficient estimates were obtained over 1000 simulated samples, each of size 2500; true parameter value is −1. the observed data, one selects marginal distributions for the endogenous regressor and the structural error term, respectively. Then, the copula model enables the construction of a flexible multivariate joint distribution allowing a wide range of correlations between the two marginals. In equation 4, the error t is assumed to have a normal marginal distribution, while the marginal distribution of the endogenous regressor Pt is obtained using the Epanechnikov kernel density estimator (Epanechnikov 1969), as below: T ĥ(p) = 1 X K T ·b t=1 p − Pt b (6) where Pt is the endogenous regressor, K(x) = 0.75 · (1 − x2 ) · I(kxk <= 1) and the bandwidth b is the one proposed by Silverman (1986), and is equal to b = 0.9 · T −1/5 · min(s, IQR/1.34). Gui, Meierer, Algesheimer 15 IQR is the inter-quartile range while s is the data sample standard deviation and T is the number of time periods observed in the data. After obtaining the joint distribution between the error term and the continuous endogenous regressor, the model parameters are estimated using maximum likelihood estimation. With more than one continuous endogenous regressor or when the endogenous regressor is discrete, an alternative method to the estimation using Gaussian copula should be applied, which is similar to the control function approach (Petrin and Train 2010). The idea is to run OLS estimation where to the original explanatory variables in equation 4 another regressor is added, namely Pt∗ = Φ−1 (H(Pt )), where H(Pt ) is the marginal distribution of the endogenous regressor, P . Including this generated regressor solves the correlation between the endogenous regressor and the structural error in equation 4, OLS providing consistent estimates. 10.0 7.5 Method Density Copula method OLS 5.0 True value −1 2.5 0.0 −2 −1 0 1 2 Endogenous Regressor Figure 5: Copula Correction vs. OLS: coefficient estimates were obtained over 1000 simulated samples, each of size 2500; true parameter value is −1 The method has been shown to perform well in recovering the true parameter values even with departures from the normality assumption of the structural error, given that the distribution of the endogenous regressor is different from that of the error. Figure 5 presents the performance 16 REndo - Gui, Meierer and Algesheimer of the copula correction method compared to the performance of OLS. Once again, the copula correction method is less biased than OLS when endogeneity is present, even though it is less consistent. The estimate of the coefficient of the endogenous variable has mean and variance of −0.62 and 0.002 with the OLS, while the mean and variance of the estimate obtained with the copula correction method are −0.97 and 0.003 respectively. For further technical details, see Park and Gupta (2012). Higher Moments Method The method proposed by Lewbel (1997) helps identifying structural parameters in regression models with endogeneity caused by measurement error, as exposed in equations 4 and 5. Identification is achieved by exploiting third moments of the data. Internal instruments, Zt , alongside the error terms in equations 4 and 5, t , and νt , respectively, are assumed unobserved. Unlike previous models, no restriction is imposed on the distribution of the two error terms, while their means are set to zero. Lewbel (1997) proves that the following instruments can be constructed and used with two-stage least squares estimation to obtain consistent estimates: q1t = (Gt − Ḡ) (7a) q2t = (Gt − Ḡ)(Pt − P̄ ) (7b) q3t = (Gt − Ḡ)(Yt − Ȳ ) (7c) q4t = (Yt − Ȳ )(Pt − P̄ ) (7d) q5t = (Pt − P̄ )2 (7e) q6t = (Yt − Ȳ )2 (7f) Here, Gt = G(Xt ) for any given function G that has finite third own and cross moments and X are all the exogenous in the model. Ḡ is the sample mean of Gt . The same rule applies also for Pt and Yt . The instruments in equations 7e and 7f can be used only when the measurement and the Gui, Meierer, Algesheimer 17 structural errors are symmetrically distributed. Otherwise, the use of the instruments does not require any distributional assumptions for the errors. Given that the regressors G(X) = X are included as instruments, G(X) should not be linear in X in equation 7a. An important part for identification is played by the assumption of skewness of the endogenous regressor. To cite Lewbel (1997), ”the greater the skewness, the better the quality of the proposed instruments”. If this assumption fails, the instruments may be weak and thus the estimates will be biased. Since the instruments constructed come along with very strong assumptions, one of their best uses is to provide over-identifying information. The overidentification provided by constructed moments can be used to test validity of a potential outside instrument, to increase efficiency, and to check for robustness of estimates based on alternative identifying assumptions. Figure 6 renders the comparison between the estimates of 75 Method Density Higher Moments OLS 50 True value −1 25 0 −2 −1 0 1 2 Endogenous Regressor Figure 6: Higher Moments vs OLS: coefficient estimates were obtained over 1000 simulated samples, each of size 2500; true parameter value is −1. the higher moments approach and OLS. While ordinary least squares is more biased, with an average estimate equal to −0.87 and variance equal to 0.0002, the higher moments estimator has an average of −1.00 but it is highly volatile, with a variance of 0.18. For more details and the proof of the above assumptions, see Lewbel (1997). 18 REndo - Gui, Meierer and Algesheimer Heteroskedastic Errors Method The method proposed in Lewbel (2012) identifies structural parameters in regression models with endogenous regressors by having variables that are uncorrelated with the product of heteroskedastic errors. This feature is encountered in many models where error correlations are due to an unobserved common factor (Lewbel 2012). The instruments are constructed as simple functions of the model’s data. The method can be applied when no external instruments are available or to supplement external instruments to improve the efficiency of the IV estimator. REndo implements the identification through heteroskedastic errors in the case of triangular systems. Consider the model in equations (4) and (5), with the exception that in equation 5 Pt is a function of Xt , while Zt is a subset of Xt . The model assumes that E(X) = 0, E(Xν) = 0 and cov(Z, ν) = 0. The errors, and ν, may be correlated with each other. Structural parameters are identified by an ordinary two stage least squares regression of Y on X and P , using X and [Z − E(Z)]ν as instruments. A vital assumption for identification is that cov(Z, ν 2 ) 6= 0. The strength of the instrument is proportional to the covariance of (Z − Z̄)ν with ν, which corresponds to the degree of heteroskedasticity of ν with respect to Z (Lewbel 2012). The assumption that the covariance between Z and the squared error is different from zero can be empirically tested. If it is zero or close to zero, the instrument is weak, producing imprecise estimates, with large standard errors. Under homoskedasticity, the parameters of the model are unidentified. But, identification is achieved in the presence of heteroskedasticity related to at least some elements of X. This strategy of identification is less reliable than the identification based on coefficient zero restriction, since it relies upon higher moments. But sometimes, it might be the only available strategy. The performance of the method, compared with OLS, on a sample 1000 randomly generated datasets, is depicted in Figure 7. As with the previous methods, the heteroskedastic error method is less efficient than the OLS, but the coefficient estimate is unbiased, unlike that of the ordinary-least squares. The mean coefficient estimate in the case of OLS is −0.87, compared to −0.97 in the case of the IIV method. The variance of the OLS estimate is 0.0002 Gui, Meierer, Algesheimer 19 compared to 0.043 of the heteroskedastic error method. 40 30 Method Density Heteroskedastic errors OLS 20 True value −1 10 0 −2 −1 0 1 2 Endogenous Regressor Figure 7: Heteroskedastic Errors IV vs OLS: coefficient estimates were obtained over 1000 simulated samples, each of size 2500; true parameter value is −1. 4.2. Internal IV methods for hierarchical data Many kinds of data have a hierarchical structure. Multilevel modeling is a generalization of regression methods that recognize the existence of such data hierarchies by allowing for residual components at each level in the hierarchy. For example, a three-level multilevel model which allows for grouping of students within classrooms, over time, would include time, student and classroom residuals (Eqn.9). Thus, the residual variance is partitioned into four components: between-classroom (the variance of the classroom-level residuals), withinclassroom (the variance of the student-level residuals), between student (the variance of the student-level residuals) and within-student (the variance of the time-level residuals). The classroom residuals represent the unobserved classroom characteristics that affect student’s outcomes. These unobserved variables lead to correlation between outcomes for students from the same classroom. Similarly, the unobserved time residuals lead to correlation between a 20 REndo - Gui, Meierer and Algesheimer student’s outcomes over time. A three-level model can be described as below: 1 1 1 ycst = Zcst βcs + Xcst β1 + 1cst (8) 1 2 2 2 βcs = Zcs βc + Xcs β2 + 2cs (9) βc2 = Xc3 β3 + 3c . (10) Like in single-level regression, in multilevel models endogeneity is also a concern. The additional problem is that in multilevel models there are multiple independent assumptions involving various random components at different levels. Any moderate correlation between some predictors and a random component or error term, can result in a significant bias of the coefficients and of the variance components (Ebbes et al. 2005). While panel data models for dealing with endogeneity can be used to address the same problem in two-level multilevel models, their implications to higher-levels have not been closely examined (Kim and Frees 2007). Moreover, panel data models allow only the inclusion of one random intercept while multilevel models require methodology that can handle more general error structures. Multilevel Instrumental Variables Method Exploiting the hierarchical structure of multilevel data, Kim and Frees (2007) propose a generalized method of moments technique for addressing endogeneity in multilevel models without the need of external instrumental variables. This is realized by using both the between and within variations of the exogenous variables, but only the within variation of the variables assumed endogenous. The assumptions in the multilevel generalized moment of moments model is that the errors at each level are normally distributed and independent of each other. Moreover, the slope variables are assumed exogenous. Since the model does not handle ”level 1 dependencies”, an additional assumption is that the level 1 structural error is uncorrelated with any of the regressors. If this assumption is not met, additional, external instruments are necessary. Kim and Frees (2007) employ a three level multilevel model as in equation 9. The coefficients 1 captures latent, of the explanatory variables appear in the vectors β1 , β2 and β3 . The term βcs Gui, Meierer, Algesheimer 21 unobserved characteristics that are classroom and student specific while βc2 captures latent, unobserved characteristics that are classroom specific. For identification, the disturbance 1 and X 1 . term cst is assumed independent of the other variables, Zcst cst Given the set of disturbance terms at different levels, there exist a couple of possible correlation patterns that could lead to biased results: • errors at both levels (2cs and 3c ) are correlated with some of the regressors, • only third level errors (3c ) are correlated with some of the regressors, • an intermediate case, where there is concern with errors at both levels, but there is not enough information to estimate level 3 parameters. When all model variables are assumed exogenous, the GMM estimator is the usual GLS estimator, denoted as bRE . When all variables are assumed endogenous, the fixed-effects estimator is used, bF E . While bRE assumes all explanatory variables are uncorrelated with the random intercepts and slopes in the model, bF E allows for endogeneity of all effects but sweeps out the random components as well as the explanatory variables at the same levels. The more general estimator bGM M proposed by Kim and Frees (2007) allows for some of the explanatory variables to be endogenous and uses this information to build instrumental variables. The multilevel GMM estimator uses both the between and within variations of the exogenous variables, but only the within variation of the variables assumed endogenous. When all variables are assumed exogenous, bGM M estimator equals bRE . When all covariates are assume endogenous, bGM M equals bF E . In facilitating the choice of the estimator to be used for the given data, Kim and Frees (2007) also proposed an omitted variable test. This test is based on the Hausman-test (Hausman 1978) for panel data. The omitted variable test allows the comparison of a robust estimator and an estimator that is efficient under the null hypothesis of no omitted variables, and also the comparison of two robust estimators at different levels. Figure 8 shows the performance of the multilevel generalized method of moments method compared to the usual random effects model employed with hierarchical data. Here the data 22 REndo - Gui, Meierer and Algesheimer is a three-level hierarchical data with one level two endogenous regressor. The mean random effects estimate is equal to −0.55 and standard error equal to 0.022, while the multilevel GMM estimate has a mean of −0.999 over 1000 simulations, with standard error equal to 0.049. 15 Method Density Multilevel IV Random Effects 10 True value −1 5 0 −2 −1 0 1 2 Endogenous Regressor Figure 8: MultilevelIV vs Random effects: coefficient estimates were obtained over 1000 simulated samples, each of size 2800; true parameter value is −1. Citing Kim and Frees (2007), ”the (multilevel) GMM estimators can help researchers in the direction of exploiting the rich hierarchical data and, at the same time, impeding the improper use of powerful multilevel models.” REndo facilitates the use of this method, thus contributing in growing the number of applications of multilevel models in the social sciences. 5. Using REndo package REndo encompasses five functions that allow the estimation of linear models with one or more endogenous regressors using internal instrumental variables. Depending on the assumptions of the model and the structure of the data, single or multilevel, the researcher can use one of the following functions: Gui, Meierer, Algesheimer 23 • latentIV() - implements the latent instrumental variable estimation as in Ebbes et al. (2005). The endogenous variable is assumed to have two components - a latent, discrete and exogenous component with an unknown number of groups and the error term that is assumed normally distributed and correlated with the structural error. The method supports only one endogenous, continuous regressor and no additional explanatory variables. A Bayesian approach is needed when the model assumes additional exogenous regressors. • copulaCorrection() - models the correlation between the endogenous regressor and the structural error with the use of Gaussian copula. The endogenous regressor can be continuous or discrete. The method allows estimating a model with more than one endogenous regressor. • higherMomentsIV() - implements the higher moments approach described in Lewbel (1997) where instruments are constructed by exploiting higher moments of the data, under strong model assumptions. The method allows more than one endogenous regressor. • hetErrorsIV() - uses the heteroskedasticity of the errors in a linear projection of the endogenous regressor on the other covariates to solve the endogeneity problem induced by measurement error, as proposed by Lewbel (2012). The presence of more than one endogenous regressor is possible. • multilevelIV() - implements the instrument free multilevel GMM method proposed by Kim and Frees (2007) where identification is possible due to the different levels of the data. Endogenous regressors at different levels can be present. The function comes along a built in omitted variable test, which helps in deciding which model is robust to omitted variables at different levels. The package comes with seven simulated datasets. Using just one dataset for exemplifying the functions is not possible due to different assumptions regarding the underlying data generating process for each of the methods. Where possible, the names of the variables were kept 24 REndo - Gui, Meierer and Algesheimer consistent across the datasets, with y for the response variable, P for the endogenous variables and X for the exogenous regressors. The usage of each of the functions is presented in the sections below. 5.1. latentIV() - Latent Instrumental Variables To illustrate the LIV approach, the dataset dataLatentIV was artificially constructed. It contains a response variable, y, an endogenous regressor, P and a discreet latent variable Z used in building P. An intercept is always considered. The latentIV() function can be called providing only the model formula and the data. Alternatively, the function can be called providing a set of initial parameter values. In any model there are eight parameters. The first parameter is the intercept, then the coefficient of the endogenous variable followed by the means of the two groups of the model. The means of the groups need to be different, otherwise the model is not identified. The next three parameters are for the variance-covariance matrix. The last parameter is the probability of being in group 1. When not provided, the initial values are taken to be the least squares coefficients of regressing y on P. For the groups’ means, the mean and the mean plus one standard deviation of the endogenous regressor are considered as initial parameter values. The variance-covariance parameters are taken to be 1, while the probability of being part of group one is 0.5. Below, the function is called without providing initial values for the parameters: R> data(dataLatentIV) R> resultsLIV <- latentIV(y ~ P, data = dataLatentIV) R> summary(resultsLIV) Call: latentIV(formula = y ~ P, data = dataLatentIV) Coefficients: Estimate Std. Error t-value Intercept 2.98902 0.02900 Pr(>|t|) 5153.5 < 2.2e-16 *** Gui, Meierer, Algesheimer P -0.97716 25 0.03700 -1320.5 < 2.2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . Initial Parameter Values: 2.915 -0.852 0.589 2.001 1 1 1 0.5 Log likelihood function: -7872.102 AIC: 15760.2 , BIC: 35744.2 Convergence Code: 0 The summary returns the coefficient estimates for the intercept (2.98), and for P (−0.97). Here, the initial parameter values were not provided, therefore the ordinary least squares coefficient estimates were used, together with the default values for the correlation matrix and the probability of belonging to group one: (2.915, −0.852, 0.589, 2.001, 1, 1, 1, 0.5). Being a maximum likelihood estimation, the function also returns the log likelihood value (−7872.102), Akaike and Bayesian information criteria (AIC : 15760.2 and BIC : 35744.2), and a convergence code (here 0) that says whether the model converged (0) or not (1). For this dataset the true parameter value for the endogenous regressor is −1. The latent internal instrument method coefficient for P is −0.97 with standard error 0.037, while OLS returns a coefficient equal to −0.85 and standard error 0.013. As we have seen in the comparison of the OLS estimate with that of the LIV in the previous section, on average, LIV produces unbiased estimates for the endogenous regressor but with very large standard errors. Therefore, when using the latent instrumental variable approach caution has to be taken when interpreting the results. 5.2. copulaCorrection() - Joint Estimation Using Gaussian Copulas In handling endogeneity by copula correction three cases are discussed. The first case considers one continuous endogenous regressor, the second assumes two continuous 26 REndo - Gui, Meierer and Algesheimer endogenous regressors and the last assumes one discrete endogenous regressor. To exemplifying the use of the function, three datasets have been simulated (see Table 3). Name Exogenous var. Endogenous var. True parameter values dataCopC1 X1, X2 P continuous, t-distributed Exog. var: β0 = 2, β1 = 1.5, β3 = −3 , Endog. var.: α1 = −1, corr(P, ) = 0.33 dataCopC2 X1, X2 P , P2 continuous, t-distributed Exog. var: β0 = 2, β1 = 1.5, β3 = −3, Endog. var.: α1 = −1, α2 = 0.8, corr(P, ) = 0.33, corr(P, P2 ) = −0.15, corr(P2 , ) = 0.1 dataCopD1 X1, X2 P discrete, Poisson distributed Exog. var: β0 = 2, β1 = 1.5, β3 = −3, Endog. var.: α1 = −1, corr(P, ) = 0.33 Table 3: Simulated datasets to exemplify copulaCorrection() function Case 1: One continuous endogenous regressor The maximum likelihood estimation returns, besides the estimates of the explanatory variables in the model, two additional estimates rho and sigma. Since these two variables are estimated from the data and not given ex-ante, the standard errors of all the estimates need to be obtained using bootstrapping. By default, the function returns the standard errors using 10 bootstrap replications. For a larger number of bootstrap replications, function boots() from the same package can be used. The syntax and output of the copulaCorrection() function for the first case is the following: R> data(dataCopC1) R> resultsCC1 <- copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P", type = "continuous", method = "1", intercept=TRUE, data = dataCopC1) R> summary(resultsCC1) Call: copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P", type = "continuous", method = "1", intercept = TRUE, data = dataCopC1) Coefficients: Estimate Std. Error z-score Pr(>|z|) Gui, Meierer, Algesheimer Intercept 2.00177 0.02412 82.984 0 X1 1.50301 0.00727 206.731 0 X2 -3.01447 0.01179 -255.660 0 P -0.99171 0.00376 -263.555 0 rho 0.39644 0.05798 6.838 0 sigma 1.41128 0.04678 30.168 0 27 --Initial parameter values: 1.983 1.507 -3.014 -0.891 0 1 Log likelihood function: -102 AIC: 216 , BIC: 15204 Convergence Code: 0 The summary returns the coefficient estimates of the model parameters, along the coefficient estimates of the correlation between the error and the endogenous regressor, rho (here 0.39), and of the standard deviation of the structural error, sigma (here 1.41). If initial parameter values were not supplied by the user, the coefficient estimates obtained running OLS are used and returned as output, while for rho and sigma the default initial values are 0 and 1. Next, the log likelihood value is returned (here −56) alongside Akaike and Bayesian information criteria (here AIC : 124 and BIC : 15112 respectively) which can be used for model comparison. Last, the convergence code is returned, which can be (0) if the estimation converged or (1) otherwise (here is 0). As seen in Figure 5, copula correction produces unbiased estimates for the continuous endogenous regressor. In the example above, the coefficient of the endogenous regressor P is −0.99(s.e = 0.0037), a value very close to the true value, −1. In comparison, the OLS estimate for P is −0.89 (s.e. = 0.008). 28 REndo - Gui, Meierer and Algesheimer Case 2: Two continuous endogenous regressor The maximum likelihood approach using copulas cannot be implemented with more than one continuous endogenous regressor due to un-identification problem. In this scenario, the suitable method is OLS augmented with P ∗ . The syntax is the following: R> data(dataCopC2) R> resultsCC2 <- copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P", type = "continuous", method = "2", intercept = TRUE, data = dataCopC2) R> summary(resultsCC2) Call: copulaCorrection(formula = y ~ X1 + X2 + P1 + P2, endoVar = c("P1", "P2"), type = "continuous", method = "2", intercept = TRUE, data = dataCopC2) Coefficients: Estimate Std. Error z-score Pr(>|z|) Intercept 2.00843 0.04424 45.394 <2e-16 *** X1 1.49359 0.00817 182.776 <2e-16 *** X2 -2.99199 0.01197 -250.037 <2e-16 *** P1 -1.00597 0.04542 -22.150 <2e-16 *** P2 0.82241 0.05630 14.609 <2e-16 *** PS.1 0.29188 0.06915 4.221 <2e-16 *** PS.2 0.05393 0.07827 0.689 0.491 --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . Residual standard error: Multiple R-squared: F-statistic: 0.9387937 on 2493 degrees of freedom 0.976301 Adjusted R-squared: 0.976244 17116.89 on 7 and 2493 degrees of freedom, p-value: 0 In dataCopC2 dataset there are two endogenous regressor, P 1 and P 2. The additional coefficients, P S.1 and P S.2, which are the estimates of P1∗ and P2∗ , tell us whether there exists Gui, Meierer, Algesheimer 29 significant endogeneity or not. Also they tell the direction of the correlation between the endogenous regressors and the error. In the example above, copula correction method produced a coefficient estimate for P 1 very close to the true value, −1.0059 (s.e. = 0.045), while the OLS coefficient for this data is −0.82 (s.e. = 0.056). P S.1 is statistically significant and positive, implying that indeed P 1 is endogenous, and given the positive sign of P S.1, P 1 is positively correlated with the error. For P 2, the method returned a P S.2 which is not statistically significant, implying that P 2 is not endogenous (which is to be expected since the correlation of P 2 with the error is 0.1). The OLS estimate for P 2 is 0.86 while copula correction returned an estimate equal to 0.82. Case 3: Discrete endogenous regressor In the case of discrete endogenous regressors, the augmented least squares estimation should be used. The fact that the endogenous regressor is discrete has to be specified by using type="discrete", as below: R> data(dataCopD1) R> resultsCC3 <- copulaCorrection(formula = y ~ X1+ X2 + P, endoVar = "P", type="discrete", intercept = TRUE, data = dataCopDis) R> summary(resultsCC3) Call: copulaCorrection(formula = y ~ X1 + X2 + P, endoVar = "P", type = "discrete", intercept = TRUE, data = dataCopDis) Coefficients: Estimate Low_95CI Up_95CI Intercept 2.05432 1.44900 2.390 X1 1.50458 1.48900 1.521 X2 -2.98547 -3.00700 -2.962 P -1.01649 -1.08300 -0.899 0.36598 0.10100 0.515 PStar.1 In the discrete case, the variable P ∗ lies between two points of the inverse univariate normal distribution (see Park and Gupta (2012), p. 573). Therefore, instead of standard errors, 30 REndo - Gui, Meierer and Algesheimer the lower and upper bounds of the 95% confidence interval of the coefficient estimates are provided. We see here again that copula correction method produces an estimate for P very close to the true value (−1.01), while the OLS coefficient for P is −0.85 (s.e. = 0.008). The additional coefficient estimate, P Star.1, has the same interpretation as in the previous example: It tells the direction of the correlation and whether exists significant endogeneity or not. In this example, the coefficient of P Star.1 is statistically significant and positive, implying a positive correlation between the endogenous regressor P and the error. 5.3. higherMomentsIV() - Higher Moments Lewbel (1997)’s higher moments approach to endogeneity is illustrated on the synthetic dataset dataHigherMoments that contains a response variable, y, two exogenous regressors, X1 and X2, and an endogenous variable, P . An intercept is always considered. A set of six instruments can be constructed from the data and they should be specified in the IIV argument of the function. The values for this argument can take the following values: • g - for (Gt − Ḡ), • gp - for (Gt − Ḡ)(Pt − P̄ ), • gy - for (Gt − Ḡ)(Yt − Ȳ ), • yp - for (Yt − Ȳ )(Pt − P̄ ), • p2 - for (Pt − P̄ )2 , • y2 - for (Yt − Ȳ )2 . where G = G(Xt ) can be either x2 , x3 , ln(x) or 1/x and should be specified in the G argument of the function. If left empty, the function will take by default G = x2 . In the example below the internal instrument was chosen to be ”yp”. Additional external instruments can be added to the model by specifying them in EIV: R> data(dataHigherMoments) Gui, Meierer, Algesheimer 31 R> resultsHM <- higherMomentsIV(formula = y ~ X1+ X2 + P, endoVar = "P", IIV = "yp", data = dataHigherMoments) R> summary(resultsHM) Call: higherMomentsIV(formula = y ~ X1+ X2 + P, endoVar = "P", IIV = "yp", data = dataHigherMoments) Residuals: Min 1Q Median 3Q Max -5.54641 -1.17855 0.03467 1.14903 6.54074 Coefficients: Estimate Std. Error t-value Pr(>|t|) Intercept 1.86321 0.71715 X1 1.74095 0.36677 X2 2.87247 0.07582 P -1.00747 2.598 0.00943 ** 4.747 2.18e-06 *** 37.885 < 2e-16 *** 0.03772 -26.706 < 2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . Residual standard error: 1.741 on 2496 degrees of freedom Multiple R-Squared: 0.8974, Wald test: 833.7 on 3 and 2496 DF, Adjusted R-squared: 0.8973 p-value: < 2.2e-16 For this dataset, the OLS coefficient of the endogenous regressor is −0.866 with a standard error of 0.005, while the higher moments estimate is very close to the true value, −1.007 with a standard error of 0.037. However, interpreting the coefficient estimates has to be made with caution since the higher moments approach comes with a set of very strict assumptions (see Subsection 4.1.2). And, as seen in the comparison with the OLS in the previous section, the higher moments estimator displays high variance. 5.4. hetErrorsIV() - Heteroskedastic Errors Method The function hetErrorsIV() has a four-part formula specification, e.g.: y ~ P | X1 + X2 + 32 REndo - Gui, Meierer and Algesheimer X3 | X1 + X2 | Z1, where: y is the response variable and P is the endogenous regressor, X1 + X2 + X3 represent the exogenous regressors, the third part X1 + X2 specifies the exogenous heteroskedastic variables from which the instruments are derived, while the final part Z1 is optional, allowing the user to include additional external instrumental variables. Allowing the user to specify any additional external variables is a convenient feature of the function, since adding external instruments increases the efficiency of the estimates. A synthetic dataset is included in the package to show the heteroskedastic errors approach. The dataset has an endogenous regressor, P , two normally distributed exogenous regressors, X1 and X2, and a response variable, y. The use of the function is presented below: R> data(dataHetIV) R> resultsHetIV <- hetErrorsIV(y ~ P | X1 + X2 | X1 + X2, data = dataHetIV) R> summary(resultsHetIV) Call: hetErrorsIV(formula = y ~ P | X1 + X2 | X1 + X2, data = dataHetIV) Coefficients: Estimate Std. Error t value Intercept 1.01176 0.39666 2.5507 1.57911 -0.6092 Pr(>|t|) 0.01075 * P -0.96197 0.54240 X1 1.04929 0.25403 4.1306 3.618e-05 *** X2 0.99428 0.13951 7.1269 1.027e-12 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . Overidentification Test: J-test Test E(g)=0: P-value 0.5536734 0.4568206 Partial F-test Statistics for Weak IV Detection: P 0.4156665 Number of Observations: 2500 Gui, Meierer, Algesheimer 33 If the user wants to obtain robust standard errors, this possibility is offered by calling the function with the additional argument robust=TRUE. Similarly, clustered standard errors can be obtained using clustervar=TRUE. The function also returns Sargan’s J-test of over-identifying restrictions and partial F-statistics for weak IV detection. If the null hypothesis of the Sargan’s J-test cannot be rejected, ti implies that the instruments are valid, while a rejection of the null implies that at least one instrument is not valid. For this example, the OLS coefficient is −0.36 with a 0.04 standard error while the heterogenous errors approach produces an estimate for P equal to −0.96, with a very large standard error. This is to be expected, since the IIV estimator is not efficient in small samples. However, here in this dataset, the internal instrumental method is closer to the true parameter value of the endogenous regressor. The overidentification restrictions are valid since the null hypothesis could not be rejected (p-value=0.45). We also fail to reject the null hypothesis of the F-statistic test (p = 0.415), meaning the instruments are valid. In cases where there is more than one endogenous regressor the Angrist and Pischke (2008) method for multivariate first-stage F-statistics is employed. 5.5. multilevelIV() - Multilevel Internal Instrumental Variables The multilevelIV() function allows the estimation of a multilevel model with up to three levels, in the presence of endogeneity. Specifically, Kim and Frees (2007) designed an approach that controls for endogeneity at higher levels in the data hierarchy, e.g. for a three-level model, endogeneity can be present either at level two, level three or at both, level two and three. The function takes as input the formula, the set of endogenous variables and the name of the dataset. It returns the coefficient estimates obtained with fixed effects, random effects and the GMM estimator proposed by Kim and Frees (2007), such that a comparison across models can be done. Asymptotically, the multilevel GMM estimators share the same properties of corresponding fixed effects estimators, but they allow the estimation of all the variables in the model, unlike the fixed effects counterpart. To illustrate the use of the function, dataMultilevelIV has a total of 2950 observations clustered into 1370 observations at level 2, which are further clustered into 40 level-three units. Additional details are presented in 34 REndo - Gui, Meierer and Algesheimer Table 4 below: Name dataMultilevelIV Variables level-1: X11, X12, X13, Type True param values exogenous β11 = 3, β12 = 9, β13 = −2, β14 = 2 endogenous, β15 = −1 X14 level-1: X15 corr(X15, cs) = 0.7; level-2: X21, X22, X23, exogenous β21 = −1.5, β22 = −4, β23 = −3, β24 = 6 exogenous β31 = 0.5, β32 = 0.1, β33 = −0.5 X24 level-3: X31, X32, X33 Table 4: Simulated dataset to exemplify multilevelIV() function. cs is the level-two error. The dataset has five level-1 regressors, X11, X12, X13, X14 and X15, where X15 is correlated with the level two error, thus endogenous. There are four level-2 regressors, X21, X22, X23 and X24, and three level-3 regressors, X31, X32, X33, all exogenous. We estimate a threelevel model with X15 assumed endogenous. Having a three-level hierarchy, multilevelIV() returns five estimators, from the most robust to omitted variables (FE_L2), to the most efficient (REF), i.e. lowest mean squared error: • level-2 fixed effects (FE_L2); • level-2 multilevel GMM (GMM_L2); • level-3 fixed effects (FE_L3); • level-3 multilevel GMM (GMM_L3); • random effects estimator (REF). The random effects estimator is efficient assuming no omitted variables, whereas the fixed effects estimator id unbiased and asymptotically normal even in the presence of omitted variables. Because of the efficiency, one would choose the random effects estimator if confident that no important variables were omitted. On the contrary, the robust estimator would be preferable if there was a concern that important variables were likely to be omitted. The estimation result is presented below: Gui, Meierer, Algesheimer 35 R> data(dataMultilevelIV) R> formula <- y ~ X11 + X12 + X13 + X14 + X15 + X21 + X22 + X23 + X24 + X31 + X32 + X33 + (1 | DID) + (1 + X11 | HID) R> endV <- dataMultilevelIV[,"X15"] R> resultsMIV <- multilevelIV(formula = formula, endoVar = endV, data = dataMultilevelIV) R> summary(resultsMIV) Call: multilevelIV(formula = formula1, endoVar = endoVars, data = dataMultilevelIV) FE_L2 GMM_L2 FE_L3 GMM_L3 Intercept 0.000 62.446 0.000 61.959 X11 0.000 2.978 2.953 2.953 2.953 X12 8.876 8.952 8.961 8.962 8.962 -2.064 -2.019 -2.019 -2.019 -2.019 X13 X14 2.077 2.036 2.042 RandomEffects 61.959 2.042 2.042 X15 -1.061 -1.062 -0.537 -0.537 -0.537 X21 0.000 -1.528 -1.708 -1.685 -1.685 X22 0.000 -3.814 -3.808 -3.813 -3.813 X23 0.000 -2.798 -2.838 -2.838 -2.838 X24 0.000 5.772 5.509 5.520 5.520 X31 0.000 1.651 0.000 1.663 1.663 X32 0.000 0.507 0.000 0.528 0.528 X33 0.000 0.095 0.000 0.093 0.093 As we have simulated the data, we know that the true parameter value of the endogenous regressor (X15) is −1. Looking at the coefficients of X15 returned by the five models, we see that they form two clusters: one cluster composed of the level-two fixed effects estimator and the level-two GMM estimator (both return −1.06), while the other cluster is composed of the other three estimators, FE_L3, GMM_L3, REF, all three having a value of −0.537. The bias of the last three estimators is to be expected since we have simulated the data such that 36 REndo - Gui, Meierer and Algesheimer X15 is correlated with the level-two error, to which only FE_L2 and GMM_L2 are robust. After the initial estimation of all applicable methods, the most appropriate estimator has to be identified in the next steps. To provide guidance for selecting the appropriate estimator, multilvelIV() performs an omitted variable test. The results are returned by the summary() function. For example, in a three-level setting, different estimator comparisons are possible: 1. Fixed effects versus random effects estimators: To test for omitted level-two and level-three omitted effects, simultaneously, one compares FE_L2 to REF. The test does not indicate the level at which omitted variables might exist. 2. Fixed effects versus GMM estimators: Once it was established that there exist omitted effects but not certain at which level (see 1), we test for level-two omitted effects by comparing FE_L2 versus GMM_L3. A rejection of the null hypothesis will imply omitted variables at level-two. The same is accomplished by testing FE_L2 versus GMM_L2, since the latter is consistent only if there are no omitted effects at level two. 3. Fixed effects versus fixed effects estimators: We can test for omitted level-two effects, while allowing for omitted level-three effects. This can be done by comparing FE_L2 versus FE_L3 since FE_L2 is robust against both level-two and level-three omitted effects while FE_L3 is only robust to level-three omitted variables. In general, testing for higher level endogeneity in multilevel settings one would start by looking at the results of the omitted variable test comparing REF and FE_L2. If the null hypothesis if rejected, this means the model suffers from omitted variables, either at level two or level three. Next, test whether there are level-two omitted effects, since testing for omitted level three effects relies on the assumption there are no level-two omitted effects. To this end, rely on one of the following model comparisons: FE_L2 versus FE_L3 or FE_L2 versus GMM_L2. If no omitted variables at level-two are found, proceed with testing for omitted level-three effects by comparing FE_L3 versus GMM_L3 or GMM_L2 versus GMM_L3. The summary() function that returns the results of the omitted variable test takes two argu- Gui, Meierer, Algesheimer 37 ments: the name of the model object (here resultsMIV) and the estimation method (here REF). Without a second argument, summary() displays simultaneously the coefficients obtained with all the different estimation approaches (fixed effects, GMM, random effects). The second parameter can take the following values, depending on the model estimated (two or three levels): REF, GMM_L2, GMM_L3, FE_L2, FE_L3. It returns the estimated coefficients under the model specified in the second argument, together with their standard errors and z-scores, Further, it returns the chi-squared statistic, degrees of freedom and p-value of the omitted variable test between the focal model (here REF) and all the other possible options (here FE_L3, GMM_L2 and GMM_L3). R> summary(resultsMIV, "REF") Call: multilevelIV(formula = formula2, endoVar = endV2, data = dataMultilevelIV) Coefficients: Coefficients ModelSE z-score Pr|>z| Intercept 61.959 8.719 7.106 0.000 X11 2.953 0.032 92.281 0.000 X12 8.962 0.027 331.926 0.000 X13 -2.019 0.026 -77.654 0.000 X14 2.042 0.026 78.538 0.000 X15 -0.537 0.020 -26.850 0.000 X21 -1.685 0.142 -11.866 0.000 X22 -3.813 0.120 -31.775 0.000 X23 -2.838 0.068 -41.735 0.000 X24 5.520 0.256 21.562 0.000 X31 1.663 0.098 16.969 0.000 X32 0.528 0.249 2.120 0.034 X33 0.093 0.059 1.576 0.115 38 REndo - Gui, Meierer and Algesheimer Omitted variable test: FEL2_vs_REF stat 112.864 df 4 pval 1.781e-23 GMML2_vs_REF 282.025 FEL3_vs_REF GMML3_vs_REF 0.058 0 13 9 13 1.373e-52 1 1 alternative hypothesis: one model is inconsistent In the example above, we compare the random effects (REF) with all the other estimators. Testing REF, the most efficient estimator, against the level-two fixed effects estimator, FE_L2, which is the most robust estimator, we are actually testing simultaneously for level-2 and level-3 omitted effects. Since the null hypothesis is rejected with a p-value of 1.78e-23, the test indicates severe bias in the random effects estimator. In order to test for level-two omitted effects regardless of the presence of level-three omitted effects, we have to compare the two fixed effects estimators, FE_L2 versus FE_L3: R> summary(resultsMIV,"FE_L2") Call: multilevelIV(formula = formula2, endoVar = endV2, data = dataMultilevelIV) Coefficients: Coefficients ModelSE z-score Pr|>z| Intercept 0.000 0.000 NA NA X11 0.000 0.000 NA NA X12 8.876 0.054 164.370 0 X13 -2.064 0.047 -43.915 0 X14 2.077 0.050 41.540 0 X15 -1.061 0.053 -20.019 0 X21 0.000 0.000 NA NA X22 0.000 0.000 NA NA X23 0.000 0.000 NA NA X24 0.000 0.000 NA NA Gui, Meierer, Algesheimer X31 0.000 0.000 NA NA X32 0.000 0.000 NA NA X33 0.000 0.000 NA NA 39 Omitted variable test: FEL2_vs_GMML2 FEL2_vs_FEL3 FEL2_vs_GMML3 stat 32834.99 220.3656 222.1892 df 4 4 12 pval 0 1.564e-46 6.335e-47 alternative hypothesis: one model is inconsistent The null hypothesis of no omitted level-two effects is rejected (p-value is equal to 1.56e-46). Therefore, it is possible to conclude that there are omitted effects at level-two. This finding is no surprise as we simulated the dataset with the level-two error correlated with X15, which leafs to biased FE_L3 coefficients. Hence, the result of the omitted variable test between leveltwo fixed effects and level-two GMM should not come to a surprise: The null hypothesis of no omitted level-two effects is rejected (p-value is 0). In case of wrongly assuming that an endogenous variable is exogenous, the random effects as well as the GMM estimators will be biased, since the former will be constructed using the wrong set of internal instrumental variables. Consequently, it is important to compare the results of the omitted variable tests when the variable is considered endogenous versus exogenous, can indicate whether the variable is indeed endogenous or not. To conclude this example, the test results provide support that the FE_L2 should be used. 6. Conclusion Endogeneity, one of the challenges of empirical research, can be addressed using different approaches, from instrumental variables to structural models. However, there exist applications in which the economic theory or the intuition do not help in obtaining an additional observable variable that can be used as external instrumental variable. Given the properties this variable has to have in order to be suitable as instrument, finding one is far from easy. 40 REndo - Gui, Meierer and Algesheimer Therefore, methods that treat endogeneity without the need of external regressors have been proposed by researchers coming from different disciplines. These techniques of estimation are called ”instrument-free” or ”internal instrumental variables” methods. REndo integrates the main instrument-free models that allow correcting for endogeneity. The motivation behind the development of REndo is to come to the aid of researchers from a broad spectrum of disciplines, from sociology, political science to economics and marketing in addressing the endogeneity problem when no strong, valid instrumental variables can be found. Moreover, by offering the multilevelIV() function REndo is the first R package that offers a solution to the endogeneity problem in a multilevel setting. The package also offers seven datasets on which the capabilities and usefulness of the package are illustrated. Future developments could include a Bayesian approach to the latent instrumental variable method which would allow incorporating additional explanatory variables and even additional endogenous regressors, as exposed in Ebbes et al. (2009). Gui, Meierer, Algesheimer 41 References Angrist J, Pischke J (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton: Princeton University Press. Antonakis J, Bendahan S, Jacquart P, Lalive R (2014). “Causality and Endogeneity: Problems and Solutions.” In NYOU Press (ed.), The Oxford Handbook of Leadership and Organizations, pp. 93–117. Berry S, Levinsohn J, Pakes A (1995). “Automobile Prices in Market Equilibrium.” Econometrica, 63(4), 841–890. Berry ST (1994). “Estimating Discrete-Choice Models of Product Differentiation.” The RAND Journal of Economics, 25(2), 242–262. Chaussé P (2010). “Computing Generalized Method of Moments and Generalized Empirical Likelihood with R.” Journal of Statistical Software, 34(11), 1–35. URL http://www. jstatsoft.org/v34/i11/. Draganska M, Jain D (2004). “A Likelihood Approach to Estimating Market Equilibrium Models.” Management Science, 50(5), 605–616. Ebbes P (2004). “Latent Instrumental Variables - A New Approach to Solve for Endogeneity.” Groningen: University of Groningen. Ebbes P, Wedel M, Boeckenholt U (2009). “Frugal IV Alternatives to Identify the Parameter for an Endogeneous Regressor.” Journal of Applied Econometrics, 24(3), 446–468. Ebbes P, Wedel M, Boeckenholt U, Steerneman A (2005). “Solving and Testing for RegressorError (In)Dependence When no Instrumental Variables Are Available: With New Evidence for the Effect of Education on Income.” Quantitative Marketing and Economics, 3(4), 365– 392. Epanechnikov V (1969). “Nonparametric Estimation of a Multidimensional Probability Density.” Teoriya veroyatnostei i ee primeneniya, 14(1), 156–161. 42 REndo - Gui, Meierer and Algesheimer Fox J, Nie Z, Byrnes J (2016). “sem: Structural Equation Models.” URL https://cran. r-project.org/web/packages/sem. Hansen LP (1982). “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrics, 50(4), 1029–1054. Hausman J (1978). “Specification Tests in Econometrics.” Econometrica, 46(6), 1251–1271. Henningsen A, Hamann Jeff D (2015). “systemfit: Estimating Systems of Simultaneous Equations.” URL https://cran.r-project.org/web/packages/systemfit. Hogan V, Rigobon R (2002). “Using Unobserved Supply Shocks to Estimate the Returns to Education.” Working Paper 9145, Cambridge:National Bureau of Economc Research. Kim S, Frees F (2007). “Multilevel Modeling with Correlated Effects.” Pshychometrika, 72(4), 505–533. Kleiber C, Zeileis A (2015). “AER: Applied Econometrics with R.” URL https://cran. r-project.org/web/packages/AER/. Lewbel A (1997). “Constructing Instruments for Regressions with Measurement Error When No Additional Data are Available, With an Application to Patents and R and D.” Econometrica, 65(5), 1201–1213. Lewbel A (2012). “Using Heteroscedasticity to Identify and Estimate Mismeasured and Endogenous Regressor Models.” Journal of Business and Economic Statistics, 30(1), 67–80. Longford N (1995). “Random Coefficient Models.” Marra G, Radice R (2011). “SemiParBIVProbit: Semi-parametric Bivariate Probit Modelling. R package version 2.0-4.1.” URL https://cran.r-project.org/web/packages/ SemiParBIVProbit. Park S, Gupta S (2012). “Handling Endogeneous Regressors by Joint Estimation Using Copulas.” Marketing Science, 31(4), 567–586. Gui, Meierer, Algesheimer 43 Petrin A, Train K (2010). “A Control Function Approach to Endogeneity in Consumer Choice Models.” Journal of Marketing Research, 47(1), 3–13. R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.r-project.org/ edition. Raudenbush S, Bryk A (1986). “Hierarchical Model for Studying School Effects.” Sociology of Education, 59(1), 1–17. Rigobon R (2003). “Identification Through Heteroskedasticity.” Review of Economics and Statistics, 85(4), 777–792. Ruud P (2000). An Introduction to Classical Econometric Theory. 1 edition. New York: Oxford University Press. Silverman B (1986). Density Estimation for Satistics and Data Analysis. CRC Monographs on Statistics and Applied Probability. London: Chapman & Hall. Stock JH, Yogo M (2002). “Testing for Weak Instruments in Linear IV Regression.” NBER Technical Working Paper No 284. Theil H (1958). Economic Forecasts and Policy. Contributions to economic analysis no.15. Amsterdam: North-Holland. Van de Ven WPM, Van Praag BMS (1981). “The Demand for Deductibles in Private Health Insurance: A Probit Model with Sample Selection.” Journal of Econometrics, 17(2), 229– 252. Wright S (1928). The Tariff on Animal and Vegetable Oils. New York: MacMillan. 44 REndo - Gui, Meierer and Algesheimer A. Weak vs. Strong Instrumental Variables Table A.1 presents the performance of OLS in comparison with two-stage least squares, one of the most frequently used instrumental variable methods, differentiating among strong and weak instruments and among small and large sample size. The sample size varies between 500 and 2500 observations respectively, and the correlation (ρ) between the instrument and the endogenous regressor takes two values, 0.10 and 0.40. The results are obtained running a simulation over 1000 random samples, where the true parameter value is equal to −1 and the correlation between the omitted variable and the endogenous regressor takes two values, 0.1 and 0.3. θ 0.1 0.3 True value OLS -1 -1 -1 -1 -0.918 -0.920 -0.774 -0.776 IV ρ = 0.10 ρ = 0.4 -0.884 -1.001 -0.998 -1.001 -0.888 -1.000 -0.999 -1.002 Sample size 500 2500 500 2500 Table A.1: OLS vs IV: θ = correlation between endogenous variable and the error, ρ = correlation between the IV and the endogenous variable. The bias of the estimates, both of the OLS and of the IV, depends heavily on the sample size but also on how much the regressor correlates with the error and with the IV. In the first row of Table A.1 we can easily see that, with a relatively low sample size and a low correlation of the regressor with the error, a weak instrument produces a more biased estimate than the OLS. Once the sample size increases, the weak instrument does no longer produce biases. Gui, Meierer, Algesheimer 45 B. Bias under Endogeneity The bias induced by an endogenous regressor on an exogenous variable depends, besides the correlation between the two variables, on the correlation between the model’s error and the endogenous variable. To underline this situation, Figure B.1 as well as Table B.2 present the coefficient estimates for the model presented in Figure 1. The correlation between the error and P takes three values: 0.1 0.3 and 0.5. For each of these three values, the correlation between the two covariates, P and X, is also taken to be equal to 0.1, 0.3 and 0.5. The estimates presented are the mean coefficient estimates over 1000 simulated samples, each having the same size, 2500 observations. Panel A presents the coefficient estimates for the Panel B −0.6 1.00 −0.7 0.95 Corr(P, error) 0.1 0.3 0.5 −0.8 True b1 value −1 Coefficient estimates for beta2 Coefficient estimates for beta1 Panel A Corr(P, error) 0.1 0.3 True b2 value 1 −0.9 0.85 −1.0 0.80 0.1 0.2 0.3 0.4 0.5 Correlation between P and X 0.5 0.90 0.1 0.2 0.3 0.4 0.5 Correlation between P and X Figure B.1: Endogeneity and the Bias of Estimates for β1 and β2 : bias at different levels of correlation between the endogenous regressor (P ) and the error intercept, β1 , while panel B the coefficient estimates for the slope, β2 . In both cases the bias grows as we move from lower to higher levels of the correlation between the error and the endogenous regressor. In panel B, for the same correlation between the error and the 46 REndo - Gui, Meierer and Algesheimer endogenous regressor (e.g. blue line), the increase in bias is steeper than in panel A. For example, when the correlation between the error and P is 0.5, the bias of the slope estimate increases from 0.03 to 0.10, to 0.19 for different levels of correlation between P and X. Similar patterns occur for different levels of correlation between the error and P , either for β1 or β2 , as can be observed in Table B.2. corr(P, ) β1 β2 0.1 0.3 0.5 corr(P, X) 0.1 0.3 0.5 0.082 0.089 0.102 0.226 0.239 0.274 0.333 0.351 0.396 0.1 0.3 0.5 0.008 0.023 0.034 0.026 0.072 0.105 0.049 0.137 0.198 Table B.2: Estimates’ Bias: β1 - intercept, β2 - slope Gui, Meierer, Algesheimer 47 C. Gaussian Copula A copula is a function that maps several conditional distribution functions (CDF) into their joint CDF. Here, the CDF of x1 is a t-distribution and the CDF of x2 is a normal distribution. Function H() is the joint conditional distribution function. Cumulative Distribution Function − x1 0.8 0.6 G(x2) 0.0 0.0 0.2 0.4 0.6 0.4 0.2 F(x1) 0.8 1.0 1.0 Cumulative Distribution Function − x2 −200 0 200 400 600 −4 −2 x1 F(x1) 0 G(x2) 15002000 500 1000 3 2 1 H(x1,x2) = C(F(x1),G(x2)) 0 -1 -2 -3 600 400 200 0 x -200 joint distribution Figure C.2: Copula example Affiliation: Raluca Gui Department of Business Administration URPP Social Networks University of Zurich 8050, Zurich, Switzerland 0 x2 2 48 REndo - Gui, Meierer and Algesheimer E-mail: raluca.gui@business.uzh.ch Url: http://www.socialnetworks.uzh.ch/en/aboutus.html View publication stats