See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/334038979 A Summary of Introductory Econometrics By Wooldridge Article in SSRN Electronic Journal · January 2015 DOI: 10.2139/ssrn.3401712 CITATION READS 1 8,392 1 author: Marius van Oordt University of Pretoria 19 PUBLICATIONS 21 CITATIONS SEE PROFILE All content following this page was uploaded by Marius van Oordt on 06 August 2019. The user has requested enhancement of the downloaded file. A SUMMARY OF INTRODUCTORY ECONOMETRICS BY WOOLDRIDGE Marius van Oordt 1 African Tax Institute University of Pretoria ABSTRACT This is a summary of the well-known textbook by Wooldridge titled “Introductory Econometrics: A Modern Approach” (6th edition). It covers the basics of cross-section, timeseries and panel econometrics. Please inform me where the summary can be improved. Keywords: Econometrics JEL Classifications: C01 1 Email: marius.vanoordt@up.ac.za 1 Contents CROSS-SECTIONAL DATA ................................................................................................... 5 Ordinary Least Squares (OLS) Assumptions ......................................................................... 5 Multiple regression under OLS .............................................................................................. 6 Proxy variables................................................................................................................... 7 Variance in the model and estimates ................................................................................. 7 Statistical inference and hypothesis testing ....................................................................... 9 OLS large sample properties ................................................................................................ 10 Consistency ...................................................................................................................... 10 Asymptotic normality ...................................................................................................... 11 Asymptotic efficiency ...................................................................................................... 11 Transformation of variables ................................................................................................. 11 Models for limited dependent variables ............................................................................... 13 Linear probability model (LPM) for binary dependent variables .................................... 13 Logit and Probit models for binary dependent variables ................................................. 14 Tobit model for continuous dependent variable with many zero observations ............... 16 Poisson regression model for count dependent variables ................................................ 16 Censored regression model for censored dependent variable .......................................... 17 Heteroscedasticity ................................................................................................................ 17 Heteroscedasticity under OLS ......................................................................................... 17 Weighted Least Squares (WLS)....................................................................................... 18 Measurement error ............................................................................................................... 20 Non-random sampling ......................................................................................................... 21 Truncated regression model ............................................................................................. 21 Incidental truncated models ............................................................................................. 22 Outliers ................................................................................................................................. 22 Least absolute deviations (LAD) ..................................................................................... 22 2 Testing whether a variable is endogenous ........................................................................... 23 Independently pooled cross section ..................................................................................... 23 Cluster samples .................................................................................................................... 25 Instrumental variable (IV) estimator .................................................................................... 25 Statistical inference of the IV estimator........................................................................... 27 Two-stage least squares (2SLS) estimator ........................................................................... 28 Assumptions for 2SLS ..................................................................................................... 30 Indicator variables (Multiple indicator solution) ............................................................. 31 Generated independent variables and instruments ............................................................... 31 Control Function Estimator (CF) ......................................................................................... 32 Correlated random coefficient model .................................................................................. 32 Systems of Equations ........................................................................................................... 33 Simultaneity bias and simultaneous equation models (SEM).......................................... 34 TIME SERIES DATA ............................................................................................................. 37 OLS Assumptions for finite samples ................................................................................... 37 Basic time series models using OLS as the estimator.......................................................... 38 Static model ..................................................................................................................... 38 Finite distributed lag model (FDL) .................................................................................. 39 Dynamically complete model .......................................................................................... 39 Possible additions to the above models ........................................................................... 39 OLS asymptotic assumptions............................................................................................... 41 Stationary ......................................................................................................................... 41 Weakly dependent ............................................................................................................ 42 Highly persistent time series ................................................................................................ 44 Spurious regression .......................................................................................................... 45 Serial correlation .................................................................................................................. 46 Tests for serial correlation ............................................................................................... 46 3 Correcting serial correlation ............................................................................................ 47 Heteroscedasticity ................................................................................................................ 49 Serial correlation and heteroscedasticity ............................................................................. 49 2SLS estimator ..................................................................................................................... 50 SEMs ................................................................................................................................ 50 Assumptions for 2SLS ..................................................................................................... 50 Infinite distributed lag (IDL) models ................................................................................... 51 Geometric (Koyck) distributed lag models ...................................................................... 51 Rational distributed lag models ....................................................................................... 52 Forecasting ........................................................................................................................... 52 One-step forecasting ........................................................................................................ 53 Multiple-step forecasting ................................................................................................. 53 PANEL DATA......................................................................................................................... 54 Fixed effects model .............................................................................................................. 54 First-Differencing estimator (FD) .................................................................................... 54 Fixed effects estimator (Within estimator) (FE) .............................................................. 56 Random effects model ......................................................................................................... 58 Random effects estimator (RE) ........................................................................................ 58 FE/FD or RE or pooled OLS? .............................................................................................. 59 The correlated random effects model (CRE) ....................................................................... 60 IV estimator ......................................................................................................................... 60 Dynamic panel data models ................................................................................................. 61 Spatial panels ....................................................................................................................... 61 4 CROSS-SECTIONAL DATA Ordinary Least Squares (OLS) Assumptions The assumptions (for finite samples) of OLS are: 1. The parameters are linear (note not the independent variables). OLS cannot be performed when the equation is e.g. π¦π¦ = πΌπΌ + π½π½ 2 π₯π₯ + π’π’ 2. The sample is obtained randomly from a population. This is not always the case. 3. There is variance in independent variables. This is always the case and can be ignored as a requirement. 4. Unbiased parameters, zero mean error assumption, written as πΈπΈ(π’π’|π₯π₯1 , π₯π₯2 … π₯π₯ππ ) = πΈπΈ(π’π’) = 0 This means that there are no unobserved factors (included in the error term) that are correlated with the independent variable. Alternatively stated, all other factors not included in the model that effect π¦π¦ are uncorrelated with π₯π₯1 , π₯π₯2 … π₯π₯ππ . If this does not hold, the parameters are biased upwards or downwards and we say that we have endogenous explanatory variables. Note that this assumption will also not hold if the incorrect functional form for independent variables is chosen, if there is measurement error in the independent variables or in the presence of simultaneity bias (all of these are discussed later). Functional form is less important asymptotically than the other mentioned. It is important to understand the omitted variables bias that result if this assumption does not hold, this can be written π΅π΅π΅π΅π΅π΅π΅π΅ (π΅π΅1 ) = π΅π΅2 πΏπΏ where π΅π΅2 indicates the correlation between the omitted variable, π₯π₯ππ and y; and πΏπΏ indicates the correlation between π₯π₯1 ππππππ π₯π₯ππ , the endogenous variable and the omitted variable. It is not possible to determine the magnitude of the bias, but we do indicate whether the bias is upwards or downwards. If π΅π΅2 is positive and πΏπΏ is positive, we have upward bias (this is based on intuition). Similarly if one is positive and other negative we have downward bias. If both are negative we have upward bias. It should be remembered that a bias parameter will influence all parameters that are correlated with the variable of that parameter. In discussing our results from a multiple regression, however, we do not discuss whether the exogenous variables, which means variables not correlated with the error term, are upwards or downwards bias as a result of including an endogenous variable in the model. 5 5. Homoskedasticity ππππππ(π’π’|π₯π₯1 , π₯π₯2 , … π₯π₯ππ ) = ππ 2 = ππππππ(π¦π¦|π₯π₯) This means that the variance of the dependent variable, given the variance of the independent variables is constant. This also means the variance of the error terms is constant around the regression line for each observation and does not change as the level of the independent variables change. If this does not hold, the standard errors of the parameters are incorrect and the parameters are therefore a poorer estimation of the population parameter. It is also very important to note that increased variability in the independent variable will decrease the standard error of the parameter. 6. There is no perfect collinearity between the independent variables An independent variable may not be a constant. There may not be an exact linear relationship between independent variables, e.g. π₯π₯1 = ππ. π₯π₯2 ππππ π₯π₯1 = π₯π₯2 + π₯π₯3 . Note that π₯π₯1 ππππππ πππππππ₯π₯1 ππππ π₯π₯12 are not linear relationships and are allowed. Multiple regression under OLS The main purpose of including multiple independent variables is to take controls out of the error terms and put them explicitly in the equation. This is done to adhere to assumption four above. For interpretation take the following regression π¦π¦ = πΌπΌ + π½π½π₯π₯1 + πΎπΎπ₯π₯2 + π’π’ π½π½ measures the relationship between π¦π¦ and π₯π₯1 after other variables (π₯π₯2 ) has been partialled out. The same is true for all other parameters, unless two parameters use different functional forms of the same variable, discussed next. In the case where π₯π₯1 is e.g. income and π₯π₯2 is income squared then the derivative of the equation would have to be used to interpret π½π½, for instance π¦π¦ = ππ + π΅π΅1 π₯π₯ + π΅π΅2 π₯π₯ 2 + π’π’ βπ¦π¦ = π΅π΅1 + 2π΅π΅2 π₯π₯ βπ₯π₯ If there are other independent variables included, the partial derivative (treating all other variables as constant) would need to be taken to interpret π΅π΅1. The same logic is applied to interaction terms, the interaction term forms part of the interpretation just as would be the case for a partial derivative. 6 Proxy variables Before estimating a model, we should always specify the population model. Often a population model will include unobservable variables (for instance ability) that we cannot include in our model to be estimated (we cannot observe it). In such instances, it is generally preferable to include a proxy variable (which can be observed) to reduce or possibly remove the bias of not including the unobservable variable. The requirements for an ideal proxy are 1. If we were able to include the unobserved variable, the proxy variable would be irrelevant. This is always met when the population model is correctly specified. 2. The independent variables are not partially correlated with the unobserved variable after including the proxy variable. If this is not the case then the independent variables will still be correlated with the error term, although most likely to a lesser extent than if the proxy was not included (less bias). It should be noted that even if the second requirement is not met and we have an imperfect proxy, it is generally still a good idea to include it in the estimation model. It may also be required that the proxy interact with another independent variable in the population model. If ππ is taken as the unobserved variable in the model π¦π¦ = π½π½0 + π½π½1 π₯π₯ + πΎπΎ1 ππ + πΎπΎ2 π₯π₯. ππ + π’π’ Then the interpretation of π₯π₯ will be the partial effect: π½π½1 + πΎπΎ2 ππ. This provides a problem, since ππ is not observed. We can however obtain the average partial effect if we assume the average of ππ in the population is zero, meaning the average partial effect is: π½π½1. 2 Once we take a proxy for ππ, it is therefore required that we demean the proxy in the sample before interaction and then we obtain the average partial effect for π½π½1. Further note that if the interaction term is significant, the error term will be heteroskedastic. A model with an interaction proxy is called a random coefficient model. Variance in the model and estimates Sum of squares total (SST) = Sum of squares explained (SSE) + Sum of squared residuals (SSR). π π 2 is therefore SSE over SST; the explained variance over the total variance. A higher R squared does not always indicate a better model, additional variables should only be included if it has a non-zero partial effect on the dependent variable in the population. It is also common If π₯π₯ is binary, then we call this the average treatment effect. As previously mentioned, all estimated coefficients are average partial effects. 2 7 1−ππ2 to calculate the adjusted π π 2 ππππ ππππππ(ππ−1). This is useful as the adjusted π π 2 is not always increasing by adding additional variables. If an additional variable has a t-stat of less than one, the adjusted π π 2 will decrease. This is also useful for non-nested model selection. The sampling variance of the OLS slope estimates is calculated as follow: πππππποΏ½π΅π΅ππ οΏ½ = ππ 2 /πππππππ½π½ οΏ½1 − π π ππ2 οΏ½ Where ππ 2 is the error variance of the regression. This means a larger variance in the error (more noise) leads to more variance in the estimate. Adding more variables reduces this variance. Further, πππππππ½π½ is the total sample variation in π₯π₯ππ . This means that the more variance in the sample (or alternatively the larger the sample), the smaller will the variance of the estimate become. Lastly and very importantly, π π ππ2 indicates the extent of multicollinearity between π₯π₯ππ (e.g. the variable of interest) and the other independent variables. This can for instance be seen by looking at VIF’s for π₯π₯ππ . In other words, this is the linear relationship between one independent variable to all other independent variables. The more collinearity between this variable and the other, the larger will πππππποΏ½π΅π΅ππ οΏ½ become. This is where multicollinearity becomes a “problem”, but it should be seen that multicollinearity has the same effect as a small sample as this will reduce πππππππ½π½ . If a variable is dropped due to multicollinearity, then we may not meet assumption 4 (estimates will be bias) and ππ 2 will increase, so this is not a good idea. Multicollinearity does not make any OLS assumption invalid and does not need to be addressed (as opposed to perfect multicollinearity). Further, if other variables are collinear, besides the variable of interest, and these variables are not correlated with the variable of interest, this will not influence πππππποΏ½π΅π΅ππ οΏ½. In conclusion, focus on having ππ 2 as small as possible and πππππππ½π½ as large as possible and worry less about multicollinearity. This, however, does not mean that we should add as many as possible variables in the model. The ceteris paribus interpretation should always be considered. It does not make sense to add for instance the amount of beer consumption and the amount of tax collected from beer consumption in a model where we are interested in the effect of the beer tax on fatalities in motor vehicle accidents; the ceteris paribus interpretation becomes nonsensical. However, if we have a variable that affects y and is uncorrelated with all other independent variables, such a variable should always be included; it does not increase multicollinearity and results in smaller standard errors. To calculate ππ 2 in a sample ,we write 8 ππ 2 = πππππ π ππππ Where ππππ (degrees of freedom) is ππ (observations) – ππ (parameters including intercept) -1. Take the root to obtain ππ, the standard error of the regression. This standard error is used to compute the standard deviation of a parameter, π π π π οΏ½π΅π΅ππ οΏ½ = οΏ½π£π£π£π£π£π£(π΅π΅ππ ). Note that heteroscedasticity violates this and we cannot be certain that OLS has the smallest variance of all estimators (that OLS is Best). Statistical inference and hypothesis testing Classic linear model assumption The classic linear model is not an estimator but an assumption important for hypothesis testing and statistical inference of the sample to the population. The assumption includes 1-6 of OLS and the normality assumption. Officially the assumption of CLM is πΈπΈ(π’π’|π₯π₯1 , π₯π₯2 … π₯π₯ππ ) = πΈπΈ(π’π’) ππππππ π’π’~ππππππππππππ(0, ππ 2 ) The assumption is therefore that the error term follows a normal distribution, which means that the estimates are normally distributed, linear combination of π½π½1 , π½π½2 , … π½π½ππ is normally distributed and a subset of π½π½ππ has a joint normal distribution. Single parameter test – T test For the population hypothesis π»π»0 : π΅π΅ππ = 0 οΏ½π₯π₯ /π π π π (π΅π΅ οΏ½π₯π₯ ) or alternatively stated, the t-test is π΅π΅ T = (Estimated – Hypothesised value) / Standard error of estimated (this is useful for when hypothesized value is not zero). It should be seen that smaller standard errors lead to higher t-stats, this, in turn, means this decrease the probability of an obtained t-stat, meaning a lower p-value. Standard errors are calculated based on standard deviations (divided by √ππ) and this is in turn is calculated based on ππππππ(π΅π΅ππ ). This means for statistical significance under the CLM assumption we want small ππ 2 , large πππππππ½π½ and small π π ππ2 . Large samples therefore is key to statistical inference. Also, remember that statistical significance is not necessarily equal to economic significance. Single linear combination of parameters – T test For the population hypothesis π»π»0 : π΅π΅1 = π΅π΅2 , ππππππππππππππππππππππππππ π΅π΅1 − π΅π΅2 = 0. οΏ½1 − π΅π΅ οΏ½2 − 0/π π π π (π΅π΅ οΏ½1 − π΅π΅ οΏ½2 ) T-test therefore π΅π΅ 9 This can be estimated by creating a new variable for π΅π΅1 − π΅π΅2 and replacing this in the original equation. Multiple linear restrictions – F test For the population hypothesis π»π»0 : π΅π΅3 = 0, π΅π΅4 = 0, π΅π΅5 = 0 one cannot look at individual t-tests as the other parameters are not restricted and we are interested in the joint significance of three (or however many) variables. One way to see this is how SSR change with the removal of these three variables. We therefore have an unrestricted (original) model and a restricted model, which is the original model after removing the variables that we wish to restrict (indicated in Ho). The F test is then written πΉπΉ = (πππππ π ππππππππππππππππππππ − πππππ π π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’ )/(ππππππππππππππππππππππππ − πππππ’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’ ) πππππ π π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’ /πππππ’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’π’ If the null is rejected then π΅π΅3, π΅π΅4 and π΅π΅5 is jointly statistically significant. The F-test is also useful for testing the exclusion of a group of variables if highly correlated. It may often be the case that many similar variables are not significant under the t-test, but they are jointly significant under the F-test. This is where the F-test becomes very important as we do not need to drop variables due to multicollinearity. The F-statistics is also shown for each regression by Stata and this indicates the hypothesis that all parameters are equal to zero. Multiple linear restrictions – Lagrange multiplier stat (n-R-squared stat) This test, as an alternative to the F test is performed as follow: 1) Regress π¦π¦ on restricted model, save π’π’ 2) Regress saved π’π’ on unrestricted model, get R squared 3) LM=ππRsquared, compare this to a critical value to test hypothesis. OLS large sample properties As the sample size grows (for large samples) OLS has some additional properties besides being the estimator with the smallest variance and being unbiased (applicable to finite samples). This affords us to relax some of the assumptions of OLS previously discussed. These properties are Consistency As ππ grows, π΅π΅οΏ½ππ collapses to π΅π΅ππ , meaning the estimate gets closer and closer to the actual population parameter. This essentially means that there is no bias and the parameter is consistently correctly estimated. The assumption required for this to hold is πΈπΈ(π’π’) = 0 ππππππ πΆπΆπΆπΆπΆπΆ(π₯π₯ππ |π’π’) = 0 10 Note that this is a slightly less strict assumption than assumption 4 of OLS for a finite sample and states that the covariance between all variables individually and the error term should be zero. If this assumption does not hold, the variable that is correlated with the error term, as well as all other variables that are correlated with this variable or the error term, will be bias and inconsistent. This inconsistency increase as the sample size increase, meaning π΅π΅οΏ½ππ collapses to an incorrect population estimate. Asymptotic normality The T, F and LM tests rely on a normal distribution of u in the population. According to the central limit theorem, OLS estimates (and the error term) are approximately normally distributed in large samples (n>30 about) and we can, therefore, use these tests for large samples, even if it appears that are errors are non-normally distributed (there are certain cases where the non-normal distribution may still be an issue). This means that the assumption of CLM is generally not required for OLS hypothesis testing. Note that the zero mean and homoscedasticity assumptions are still required. Other consequences of the asymptotic normality of the estimators are that the error variance is consistent and that standard errors are expected to shrink at a rate of 1/√ππ. Asymptotic efficiency If OLS assumptions hold, then it has the smallest asymptotic variance of all estimators. If heteroscedasticity is present, there may exist better estimators than OLS. Transformation of variables Scaling data does not change any measured effect or testing outcome, only the interpretation changes. It may be useful in certain scenarios to run a standardized model with only beta coefficients (also called standardized coefficients) as this gives an indication of the magnitude of the effect of the different independent variables on the dependent variable. This is done by taking the z score of all variables and the interpretation is the change in standard deviation to a change in standard deviation. Logs are useful for obtaining elasticities or semi-elasticities. Further, taking the natural log of a variable may increase the normality and reduce heteroscedasticity of the variable by drawing in large variances (this also increase the likelihood of statistical significance as there is less variance in the error term). This is particularly useful for significantly skewed variables where 11 the central limited theorem is unlikely to hold (CLM assumption is therefore violated). Also, the impact of outliers is reduced. It should, however, be noted that the log of a variable is a new variable with a different interpretation than the original variable. Further, a log should not be taken for a variable with many values between 0 and 1 or a variable with 0 values. A constant can be added if there are few 0 values, but this is generally not preferred. Generally, it is not preferred to transform a variable, outliers should rather be treated separately. Only if a variable is greatly positively skewed does it make sense (or you are estimating elasticities). Further, taking the log of the variable of interest make little sense; you cannot argue causality on a logtransformed variable as the variable (particularly its variance) is not the same as the nontransformed variable. Of course, if a variable has a log-linear relationship with the dependent variable, the log must be taken, otherwise the model will be misspecified and there will be bias in the parameters. Quadratic terms are also common, just remember the interpretation of such a term requires the partial derivative of the equation. The adjusted π π 2 is particularly useful to determine whether a quadratic term should be included in addition to the non-quadratic variable. Again, if a variable has a quadratic relationship with the dependent variable, the quadratic term must be included, otherwise the model is misspecified and the estimates bias. Logs and quadratic terms are the most common functional forms for variables. As noted, the zero mean error assumption will not hold if a model has functional form misspecification, meaning there is an omitted variable and it is a function of an included dependent variable. One way to test for additional functional forms is with the F test after including additional transformed variables. Other tests are 1. RESET (Regression specification error test) To conduct this test, run the regression and save fitted values π¦π¦οΏ½, calculate π¦π¦οΏ½ 2 , π¦π¦οΏ½ 3 … Run a regression that is the same as original, but adding the calculated values as variables. Conduct an F test on the parameters of the newly added variables (H0 is all is nil). If rejected then there is misspecification that needs further consideration. 2. Dawidson-MacKinnon test (nonnested model selection) This test is useful for testing whether some independent variables should be logged. Run the alternative model that includes the logged variable, save the fitted values π¦π¦οΏ½. Run the original model with the fitted values as an independent variable and see whether this variable is significant. If it is, it is likely that the logged variable should be preferred. 12 Qualitative independent variables should be transformed into dummy categories. If the dependent variable has a log function, the interpretation is percentage. Where there are multiple binary or ordinal variables, the coefficient takes the interpretation of all the 0 categories. Binary variables can also be used as interaction terms to obtain additional information from an intercept (binary interact with binary) or a different slope (binary with continuous). Binary variables can also be used to determine whether e.g. females and males have different models; this is done by interacting all variables and keeping the original variables and using the F test where the non-interacted model is the restricted model. It may also be useful to include a lagged dependent variable in the model. This new independent variable will control for unobservable historical facts that cause current differences in the dependent variable. Models for limited dependent variables A limited dependent variable is a variable that has a substantially restricted range of values, such as binary variables and some discrete variables. Models with such dependent variables can be estimated by OLS, discussed first, although this presents some issues. More advanced estimators are therefore required in most cases. The predominant reason for this is that the dependent variable will not follow a normal distribution. Linear probability model (LPM) for binary dependent variables This model is run in the exact same manner as a continuous dependent variable model with OLS as the estimator and hypothesis testing remains the same. The only difference is in interpreting the parameter estimates. These are interpreted as the change in probability of success (y=1) when x changes, ceteris paribus. Mathematically βππ(π¦π¦ = 1|π₯π₯) = π΅π΅ππ βπ₯π₯ππ This model is very easy to run and interpret, but has some issues. Some predictions of probability (for individual cases) will exceed 1 or be less than 0, this is nonsensical. Further, it is not possible to relate probability linearly to independent variables as this model does; this means that e.g. the probability of being employed is not a linear function of the number of children one has. These prediction problems can be resolved by taking π¦π¦οΏ½ = 1 ππππ π¦π¦οΏ½ ≥ 0.5 ππππππ π¦π¦οΏ½ = 0 ππππ π¦π¦οΏ½ ≤ 0.5 and then see how often the prediction is correct. This goodness of fit measure is called the percentage correctly predicted approach. 13 The major issue with this model is that heteroscedasticity is always present and the standard errors under the t or f test can therefore not be trusted. The preferred approach to address this is to use robust tests since weighted least squares can be complex to calculate. Logit and Probit models for binary dependent variables Logit and probit models address the issues mentioned for the LPM models. It allows for nonlinear parameters and the prediction of probability is always between 0 and 1. For both logit and probit, we are interested in the response probability written ππ(π¦π¦ = 1|π₯π₯1 , π₯π₯2 , … , π₯π₯ππ ) = ππ(π¦π¦ = 1|ππ) If we take ππ as all independent variables. Written in functional form together with parameters this is ππ(π¦π¦ = 1|ππ) = πΊπΊ(π½π½0 + π½π½1 π₯π₯1 + β― + π½π½ππ π₯π₯ππ ) = πΊπΊ(π½π½0 + ππππ) Note that the shorthand πΊπΊ(π½π½0 + πππ·π·) can also be written πΊπΊ(π₯π₯π₯π₯) for simplicity. Since we are concerned about probability it is required that for all real numbers, π§π§ 0 < πΊπΊ(π§π§) < 1 We, therefore, need a method to calculate πΊπΊ(π§π§) where it will adhere to this requirement. The most common methods are the logistic function (used in the logit model) and the normal cumulative distribution function (used in the probit model). Both of these distributions are nonlinear and look very similar (the logistic distribution has heavier tails). They are useful as they indicate that probability increase the fastest close to zero and slowest close to one. In the logit model, And in the probit model πΊπΊ(π§π§) = exp(π§π§) 1 + exp(π§π§) 1 πΊπΊ(π§π§) = (2ππ)−2 exp(− π§π§ 2 ) 2 The probit model is more popular than the logit model since it is often assumed that the errors are normally distributed. Since both the logit and probit models rely on non-linear parameters, we use Maximum Likelihood Estimation (MLE) to estimate the models. 14 Maximum Likelihood Estimation for logit and probit models The MLE estimator is based on the distribution of π¦π¦ given π₯π₯ and is therefore very important for estimating probit or logit models. To see how MLE for LDVs are estimated we first write the density of π¦π¦ given π₯π₯ as ππ(π¦π¦|ππππ ; π·π·) = [πΊπΊ(ππππ π·π·)]π¦π¦ [1 − πΊπΊ(ππππ π·π·)]1−π¦π¦ From this, we get the log-likelihood function by taking the log of the density function above ππππ (π·π·) = π¦π¦ππ ππππππ [πΊπΊ(ππππ π·π·)] + (1 − π¦π¦ππ )log[1 − πΊπΊ(ππππ π·π·)] Summing all ππππ (π·π·) for all n gives the log-likelihood for the sample, πΏπΏ(π·π·). Under MLE, π½π½Μ is obtained by maximizing πΏπΏ(π·π·). If we used πΊπΊ(π§π§) as in the logit model, we call this the logit estimator and if the used πΊπΊ(π§π§) as in the probit model, we call this the probit estimator. MLE under general conditions is consistent and asymptotically normal and efficient. Hypothesis testing (Likelihood ratio test) Normal t-tests are reported after using the logit or probit estimator. These can be used for single hypothesis testing. For multiple hypothesis testing, we use the likelihood ratio test. This test considers the difference in the log-likelihood of the unrestricted and restricted model. The likelihood ratio statistic is πΏπΏπΏπΏ = 2(πΏπΏπ’π’π’π’ − πΏπΏππ ) Note that the difference in log-likelihood is multiplied by two to allow the statistic to follow a chi-square distribution. P-values are therefore also obtained from this distribution. Interpreting logit and probit Since the econometric package automatically estimates and calculates all of the above, the most challenging part of logit and probit models is interpreting them. The sign of the obtained coefficients can be interpreted as usual, but since the parameters are non-linear the magnitude of the estimated coefficients does not give rise to useful interpretation. If the variable of interest is binary, the partial effect for that variable can be obtained by πΊπΊ(π½π½0 + π½π½1 + π½π½2 π₯π₯2 + β― ) − πΊπΊ(π½π½0 + π½π½2 π₯π₯2 + β― ) If the variable of interest is discrete the partial effect for the variable can be obtained by πΊπΊ(π½π½0 + π½π½1 (π₯π₯1 + 1) + π½π½2 π₯π₯2 + β― ) − πΊπΊ(π½π½0 + π½π½1 π₯π₯1 + π½π½2 π₯π₯2 + β― ) If the variable of interest is continuous then we need to take the partial derivative for the partial effect which will give 15 ππ(π½π½0 + ππππ)(π½π½ππ ) To compare the estimated parameters with OLS, we make use of scale factors based on the partial effects. This is done by Stata and the most useful is the average partial effects (APE). It is, therefore, standard to estimate a model by LPM, probit and logit and compare the estimated coefficients. Tobit model for continuous dependent variable with many zero observations Using a linear estimator for models with a continuous dependent variable with many zero observations (for instance the number of cigarettes smoked per month over the population) will give negative predictions of π¦π¦οΏ½π€π€ and heteroscedasticity will be present. It is therefore preferred to use a non-linear estimator that do not allow for negative values of π¦π¦οΏ½π€π€ (meaning the estimated parameters are more reliable). Similar to the probit and logit model, for the tobit model we use MLE as the estimator to maximize the sum of the following log-likelihood function ππππ (π½π½, ππ) = 1(π¦π¦ππ = 0) log οΏ½1 − Φ οΏ½ π₯π₯ππ π½π½ 1 π¦π¦ππ − π₯π₯ππ π½π½ οΏ½οΏ½ + 1(π¦π¦ππ > 0)log{οΏ½ οΏ½ ππ οΏ½ οΏ½} ππ ππ ππ Where Φ indicates the standard normal cdf and ππ indicates the standard normal pdf. This can be called the tobit estimator. Hypothesis testing is conducted in the same manner as for the logit and probit models. Interpretation of the tobit model In interpreting the tobit model we again rely on partial derivatives. These are used to calculate APEs that can be compared to an OLS estimation of the same model and interpreted as a usual (not probabilities as for binary dependent variables). APEs are routinely calculated by Stata. Poisson regression model for count dependent variables A count variable is a variable that takes on non-negative integer values (not continuous as for the tobit model). Again we are only really interested in this model if the count variable can also be considered an LDV, meaning the dependent does not take on many integer values (e.g. number of children in a household). In other words, the dependent variable will not follow a normal distribution, but rather a nominal distribution that we call the Poisson distribution. This distribution can be written ππ(π¦π¦ = β|π₯π₯) = exp[− exp(π₯π₯π₯π₯)] [exp(π₯π₯π₯π₯)]β /β! 16 Where β is a count variable and used to indicate that π¦π¦ is a count variable, and β! means factorial. Further note that exponential function are used as these are strictly positive. The loglikelihood function is therefore ππππ (π½π½) = [π¦π¦ππ π₯π₯ππ π½π½ − exp(π₯π₯ππ π½π½)] And the sum of this over n is again maximized by MLE, t-stats are given and we can use APEs to compare the coefficients with OLS. It is, however, very important to note that the Poisson distribution assumes that ππππππ(π¦π¦|π₯π₯) = πΈπΈ(π¦π¦|π₯π₯) Which is very restrictive an unlikely to hold. If this is not assumed then we should rather use Quasi MLE (QMLE) as the estimator together with the quasi-likelihood test statistic for multiple hypotheses. Censored regression model for censored dependent variable The dependent variable is censored if a threshold was inserted during data collection, meaning the dependent variable cannot take on a value greater (or less than for a lower bound threshold) than a certain value (ππππ ). An example is for instance a questionnaire where you tick a box if your income is above a certain amount (with no higher possible selections). Although the uncensored observations have a normal distribution (and do not pose any difficulty for OLS) the censored observations (values above the threshold not observed) does not have a normal distribution. The density for the censored observations is ππ((π¦π¦ ≥ ππππ |π₯π₯) = 1 − Φ οΏ½ ππππ − π₯π₯ππ π½π½ οΏ½ ππ This means that we can again use MLE after taking the log-likelihood where MLE will maximize the sum. The interpretation of the estimates does not require any scaling and are directly comparable with OLS. It should, however, be noted that in the presence of heteroscedasticity or non-normal errors, MLE will be bias and inconsistent. Heteroscedasticity Heteroscedasticity under OLS Heteroscedasticity does not cause bias or inconsistency in the OLS estimates and does not influence R-squared or adjusted R-squared. It does, however, bias the variance of the OLS estimates, resulting in incorrect standard errors and T, F and LM test results. OLS is then no longer asymptotic most efficient amongst linear estimators. The first step is to test for 17 heteroscedasticity and then to address it. Note that incorrect functional forms may indicate heteroscedasticity even when none is present, it is therefore important to first test whether the functional forms are correct. Testing for heteroscedasticity The two most common tests are the Breusch Pagan test and the special case of the White test for heteroscedasticity. For the Breuscg Pagan test, OLS is run and π’π’ is saved and π’π’2 is calculated. This is regressed on the original model and a F or LM test is conducted to test the null hypothesis that all parameters are equal to nil. If the null is rejected, heteroscedasticity is present. For the special case of the White test, OLS is run and π’π’οΏ½ ππππππ π¦π¦οΏ½ is saved, π’π’οΏ½2 ππππππ π¦π¦οΏ½ 2 is computed. , π’π’οΏ½2 is regressed on π¦π¦οΏ½, π¦π¦οΏ½ 2 and the null is whether the parameters of these two are equal to nil. If the null is rejected, heteroscedasticity is present. This test specifically test for the type of heteroscedasticity that gives bias variances under OLS. It is important to note that for both these tests, it is required that the errors in the second regression, π£π£ππ be homoscedastic, πΈπΈ(π£π£ππ2 οΏ½ππ) = ππ (k means constant). This implies that for the original error πΈπΈ(π’π’ππ4 οΏ½ππ) = ππ 2 (where ππ 2 is also a constant). This is called the homokurtosis assumption. There are heterokurtosis-robust tests for heteroskedasticity also, but these are seldom used (see page 141 in Wooldridge (2010) if interested). Correcting heteroscedasticity under OLS For large samples, correcting heteroscedasticity is straightforward. All methods use an alternative way of calculating standard errors that provide standard errors that are correct in the presence of heteroscedasticity. Robust (Huber White) standard errors can be calculated for the t-test (robust command in Stata). Note the same factors that influence the size of normal standard errors influence these standard errors. For exclusions, the robust F statistic (also called the Wald Statistic) can be calculated (test command in Stata). It is also possible to calculate a robust LM statistic, although the Wald statistic is more popular and should suffice. Weighted Least Squares (WLS) The WLS estimator gives different estimates and standard errors that OLS. This said large differences in estimates indicate that the other OLS assumptions do not hold or that there is functional form misspecification. WLS is more efficient than OLS with robust standard errors, 18 assuming all OLS assumptions hold, besides homoscedasticity, and the heteroscedasticity function (the weight) for WLS is correctly identified (WLS is BLUE). If we write ππππππ(π’π’ππ |π₯π₯ππ ) = ππ 2 β(π₯π₯ππ ) = ππ 2 βππ where β(π₯π₯ππ ) is some function of the explanatory variables that determines the heteroscedasticity, the standard deviation is πποΏ½βππ . We can divide this by 1/οΏ½βππ to get ππ, the standard deviation if heteroscedasticity was not present. To do this, we weight the original OLS model with 1/οΏ½βππ for each variable, including the dependent and the intercept. After dividing, the estimators are written π½π½ππ∗ , this is an example of generalised least squares (GLS) and is estimated by OLS. The WLS model does exactly the same as OLS with GLS estimators, the only difference is that we do not calculate the GLS estimators, but rather divide the entire least squares by βππ (note not root square). WLS therefore minimises the weighted sum of squared residuals, where each squared residual is weighted by 1/βππ . Specifying the weighting function βππ is therefore the key. In a simple model with one independent variable, the weighting function must be that independent variable. This means that we do not need a GLS estimator to estimate WLS. For more complex models we need to estimate the weighting function, meaning we then again need a GLS estimator to estimate by WLS. This is done by estimating feasible GLS (FGLS). FGLS has the following steps • • • • • Run y on x1, x2…xk and obtain the residuals, π’π’οΏ½ οΏ½2 ) Square and then log to obtain log(π’π’ οΏ½2 ) on x1, x2,…xk and obtain the fitted values πποΏ½ Run log(π’π’ βοΏ½ = exp(πποΏ½) Estimate the original equation by WLS using 1/βοΏ½ as weights. Note that using FGLS makes WLS biased, but consistent and more efficient than OLS. It is, therefore, a good idea to run WLS and OLS with robust standard errors. Robust standard errors should also be calculated for WLS, since the weighting function may be incorrect, meaning heteroscedasticity remains present. WLS should then still be more efficient than OLS (both with robust standard errors). 19 Measurement error Measurement error is not the same as taking a proxy. A proxy is where we have an unobserved factor and we take an observable variable that is likely correlated with the unobserved factor. This is always a good idea even if it increases multicollinearity, it will lead to smaller standard errors and less bias estimates. An example is IQ for ability. Measurement error is where we have an observable variable, but this variable is measured with error, for instance, actual income vs declared income for tax purposes. If the measurement error is in the dependent variable, it is generally not a problem. It is then just assumed that the measurement error is random and not correlated with the independent variables. OLS, therefore, remains unbiased and consistent as long as this assumption holds. Measurement error in the independent variables is a problem. If it can be assumed that the covariance between the measurement error and the actual variable included in the model is nil, then there is no bias and OLS is BLUE. This is however unlikely to be the case. The general assumption that needs to be made is that πΆπΆπΆπΆπΆπΆοΏ½π₯π₯1,∗ ππ1 οΏ½ = 0 Where π₯π₯1∗ is the true variable that should be in the model and ππ1 is the measurement error calculate as ππ1 = π₯π₯1 − π₯π₯1∗ Where π₯π₯1 is the variable included in the model that contains the measurement error. This assumption is called the classic-error-in-variance assumption (CEV). This assumption leads to bias and inconsistency in the estimates of OLS, this bias is called attenuation bias. The bias is towards zero, e.g. if π΅π΅1 is positive then π΅π΅οΏ½1will underestimate π΅π΅1. If any other variable is correlated with the variable that contains the measurement error, those estimates will also be biased and inconsistent. This means an alternative estimator to OLS is required to obtain unbiased and consistent estimates when there is measurement error in the independent variables. One way to resolve the measurement error bias is with the use of instrumental variables (IV) (refer below for a discussion hereon). Taking π₯π₯1 = π₯π₯1∗ + ππ1 the model including the measurement error can be written 20 π¦π¦ = π½π½0 + π½π½1 π₯π₯1 + π½π½2 π₯π₯2 + (π’π’ − π½π½1 ππ1 ) In the above model, it is assumed that all independent variables are exogenous. The requirement for a valid IV is that it is correlated with π₯π₯1 and not correlated with π’π’ or ππ1 . If we have two measures of π₯π₯1 , the second measure can be taken as an IV. Otherwise we ,can always take other excluded exogenous variables as IV. By doing this we correct the attenuation bias. Non-random sampling Non-random sample selection generally violated OLS assumption 2. There are certain instances where OLS remains BLUE, even though this assumption is violated. This is if 1) missing data is random and the reason for the missing data is therefore not correlated with any endogenous or unobservable variables (or the error) in the model, 2) the sample is selected based on the level of the exogenous independent variable(s) (called exogenous sample selection), e.g. only adults older than 40 are included in the sample and age is an independent variable, 3) the sample is selected based on an exogenous variable to the model. OLS will, however, be biased if 1) missing data is not random and the reason is endogenous to the model or correlated with the error, 2) the sample is selected based on the level of the dependent variable, e.g. where firm size is the dependent and only the biggest 20 firms are sampled, 3) sample is selected based on an endogenous variable in the model. The key question is therefore whether sample selection is endogenous or exogenous. If endogenous, specials methods are required to correct this. Truncated regression model Where we only sample observations based on the level of the dependent variable, ππππ , we have non-random sampling and OLS will be biased. For example w,e only sample households if their earnings are above R10 000 per month. Our sample will then no longer follow a normal distribution and similar to limited dependent variables, we would require an alternative distribution. For truncated regression models this is written [ππ(π¦π¦|π₯π₯ππ π½π½, ππ 2 )] ππ(π¦π¦|π₯π₯ππ , ππππ ) = π¦π¦ ≤ ππππ [πΉπΉ(ππππ |π₯π₯ππ π½π½, ππ 2 )] From this, we can take the log-likelihood function and use MLE to maximize the sum for all observations (Stata does this). The interpretation is the same as for OLS. In the presence of homoscedasticity or non-normal errors, MLE will, however, be bias and inconsistent. 21 Incidental truncated models For truncated models, the truncations are generally applied by choice of the data collector. It is also possible that truncation occurs incidentally. We take a random sample, but due to truncation, the sample is non-random for estimation purposes. Under incidental truncation, whether we observe y will depend on external factors. If we, for instance, collect data on labor variables, some observations will have zero wage, meaning wage is dependent on labor force participation. We will still have observations on the other variables, but not on wage. If wage is then used as the dependent variable, OLS will be biased. To correct for this we follow the Heckman method (Heckman command in Stata): 1) First, we estimate a selection equation with the probit estimator using all observations. This equation can be written π π = π§π§π§π§ + π£π£ Where s=1 where we observe π¦π¦ππ and zero otherwise (we make s binary) and π§π§ is a set of independent variables that includes all the population variables, π₯π₯, and at least one additional variable that is correlated with s (the selection process). πΎπΎ are parameters as usual. 2) Compute the inverse Mills ratio πποΏ½π€π€ = ππ(π§π§ππ πΎπΎοΏ½) 3) Run OLS of π¦π¦ππ = π₯π₯ππ π½π½ + πποΏ½π€π€ π½π½ The significance of πποΏ½π€π€ ’s parameter indicate whether selection bias is present. If this parameter is not zero, then OLS test statistics are not computed correctly and an adjustment is required (Wooldridge 2010). Outliers Studentized residuals, leverage and Cook’s distance is useful to detect outliers in the sample. This is important since OLS squares residuals, it is very sensitive to outliers. It is generally recommended to report results with and without outliers unless an outlier is clearly a result of a data capturing error. It may also be preferred to use an alternative estimator as a supplement to OLS such as: Least absolute deviations (LAD) LAD minimizes the sum of the absolute values of the residuals and is, therefore, less sensitive to outliers. It should be noted that the estimated parameters are the conditional median and not the conditional mean as in the case of OLS. This means that unless the residuals are normally 22 symmetric around the zero mean under LAD, the result will greatly differ from OLS and be biased. Further, the t, F and LM test statistics is only valid in large samples under LAD. Testing whether a variable is endogenous The tests used in testing whether the assumptions of an estimator holds are called specification tests. A key assumption to obtain unbias and consistent estimates are that all variables are exogenous and not correlated with the error term. To perform this test we need to understand the instrumental variable (IV) and the two-stage least squares (2SLS) estimator (discussed below). To perform the test we need at least one instrument for each perceived endogenous variable. Then we conduct the test by 1) Estimate each endogenous variable (perceived) in its reduced form (all exogenous variables) 2) Save the residuals for each estimation 3) Include the residuals as new variables in the structural equation and test for significance (t test if one endogenous and F test if more than one). It is important to take the robust test statistics for both types of tests. If the residuals are not significant, the perceived endogenous variable is exogenous (take robust standard errors). OLS can, therefore, be preferred if this is the case for all perceived endogenous variables since OLS will be Best. This test is the same as the first steps of the Control Function estimator discussed later, so also refer to this section. Independently pooled cross section To increase sample size or for purposes of estimating the impact of a natural or quasiexperiment, we may wish to pool two cross-sections. This can only be done if the two or more samples of cross-sectional data are drawn randomly from the same population at two or more different time periods. All cross sections methods discussed can be applied to pooled crosssections. Since the two samples are not drawn at the same time, the variables will not be identically distributed between the two periods. To correct this it is required to include a dummy variable for each year/time period (besides year 1 generally) in the regression that will control for changes between years. It is often useful to interact this dummy with other variables to determine how they have changed over time. 23 It is further possible that the functional forms of the variables in the regression should not be the same for the different periods. This can be tested with an F test in the same manner as was done for model selection, by conducting the test on each time period individually. The greatest benefit of pooled cross-section is if a difference-in-difference estimator (DD) is to be used to estimate the effect of a change in policy or exogenous event. For this estimator, we would have a treatment and a control group and pre and post event (or policy change) for each group. The difference-in-difference estimate can be written as πΏπΏ1 = (π¦π¦οΏ½2π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ − π¦π¦οΏ½2ππππππππππππππ ) − (π¦π¦οΏ½1π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ − π¦π¦οΏ½1ππππππππππππππ ) To estimate πΏπΏ1 and obtain its standard error we regress π¦π¦ = π½π½0 + πΏπΏ0 ππ2 + π½π½1 ππππ + πΏπΏ1 ππ2ππππ + ππππβππππ ππππππππππππππ + ππππππππππ Where ππ2 is a dummy for the post event time period and ππππ is a dummy for treatment group = 1 and control group = 0. The following table indicates the interpretation of the parameters: Pre Post Post-Pre difference Control π½π½0 π½π½0 + πΏπΏ0 πΏπΏ0 Treatment π½π½0 + π½π½1 π½π½0 + πΏπΏ0 + π½π½1 + πΏπΏ1 πΏπΏ0 + πΏπΏ1 Treatment-Control π½π½1 π½π½1 + πΏπΏ1 πΏπΏ1 difference If for instance, the model indicates the change in student attendance (π¦π¦) after giving free internet access on one campus (treatment), but not on another campus (control) (the population are students). Then π½π½0 indicates the attendance of the control group before free internet; π½π½0 + π½π½1 indicates the attendance of the treatment group before free internet; π½π½0 + πΏπΏ0 indicate the attendance of the control group after free internet; π½π½0 + πΏπΏ0 + π½π½1 + πΏπΏ1 indicates the attendance of the treatment group after free internet. Taking the difference between treatment and control, pre and post (difference in difference) gives us πΏπΏ1 , the estimated effect of giving free internet. Of course, for this to be causal we will have to control for all other relevant factors, or in other words, the obtained estimate is most likely bias. It is also possible to use a difference-in-difference-in-difference (DDD) estimate. If we have attendance data for another university that did not provide free internet on their campus for the time periods used, we can use this as an additional ‘difference indicator’. If πΆπΆ is this variable then the model is π¦π¦ = π½π½0 + πΏπΏ0 ππ2 + π½π½1 ππππ + πΏπΏ1 ππ2. ππππ + π½π½2 ππππ + π½π½3 ππππ. ππππ + πΏπΏ2 ππ2. ππππ + πΏπΏ3 ππ2. ππππ. ππππ + ππππβππππ ππππππππππππππ + ππππππππππ 24 The coefficient of interest is therefore πΏπΏ3 . It is of course also possible to use more time periods with either the DD and DDD estimate. Cluster samples In cluster sampling, clusters are randomly sampled from a population of clusters and units of observation are sampled from the clusters. An example is siblings (units) sampled from families (clusters) where the population is all families (the population of clusters). It is very important that clustering should not be done ex-post (for instance obtain a random sample of individuals and cluster them into families) as this will result in incorrect standard errors. Matched pairs samples are also applicable to this section. The benefit of cluster sampling is that a fixed cluster effect that influences all of the units in the cluster can be controlled for in the model. Note that if the key independent variable only changes at the cluster level and not at unit level then we would not want to include a fixed cluster effect. To include a fixed cluster effect, we use panel data methods (first-difference estimator, fixed effects estimator, random effects estimator, correlation random effects model or pooled OLS) to control for the cluster effect. These methods are discussed in the section on panel data. Note that if pooled OLS is used after cluster sampling, the errors will have cluster correlation and cluster-robust standard errors need to be used. Instrumental variable (IV) estimator The main assumption for unbiased estimates is that the independent variables and the unobservable variables are not correlated (we assume that we have included all relevant observable variables as independent variables). If this does not hold we have a few options: 1. Ignore the problem and indicate the direction of bias. This is not ideal, but we may still learn something. 2. Include proxy variables for the unobserved variables. It may be difficult to find applicable proxies. 3. Control for the time constant unobservable variables by including fixed effects. Refer to the cluster sampling discussion and panel data methods. Another popular method is by using the Instrumental variable (IV) estimator. The IV estimator obtains consistent (although bias) estimates when the OLS estimates will be bias and inconsistent due to unobservable variable bias. The IV estimator is, therefore, most useful in 25 large samples. To use the IV estimator, we first have to identify and IV or instrument. Taken the simple regression model π¦π¦ = π½π½0 + π½π½1 π₯π₯ + π’π’ Where πΆπΆπΆπΆπΆπΆ(π₯π₯, π’π’) ≠ 0 the estimated parameter π½π½1 will be inconsistent and bias under OLS. If we take a new variable (π§π§) that adheres to the following assumptions πΆπΆπΆπΆπΆπΆ(π§π§, π’π’) = 0 and πΆπΆπΆπΆπΆπΆ(π§π§, π₯π₯) ≠ 0 Then π§π§ is a valid instrument for π₯π₯. Note that the first assumption means that the IV may not have a partial effect on the dependent variable after controlling by the independent variables, meaning that the IV must be exogenous in the original equation. Because the error cannot be observed, we cannot test the first assumption and need to rely on logic and theory that argument this. The second assumption can easily be tested by regressing π§π§ on π₯π₯. It is important that the direction of the found correlation be aligned with logic and theory. Where an endogenous variable is interacted with another variable, the IV for the interaction variable is the IV for the endogenous variable interacted with the interacted variable in the model. Further see that a good proxy is a bad IV since a proxy requires correlation between the proxy and the error (before including the proxy) and a good IV requires no correlation between the IV and the error. If we found a good IV, we can use the IV assumptions to identify3 the parameter π½π½1.Write the simple model above as πΆπΆπΆπΆπΆπΆ(π§π§, π¦π¦) = π½π½1 πΆπΆπΆπΆπΆπΆ(π§π§, π₯π₯) + πΆπΆπΆπΆπΆπΆ(π§π§, π’π’) Then taken the assumption that πΆπΆπΆπΆπΆπΆ(π§π§, π’π’) = 0 The IV estimator of π½π½1 then is π½π½1 = οΏ½1 = π½π½ πΆπΆπΆπΆπΆπΆ(π§π§, π¦π¦) πΆπΆπΆπΆπΆπΆ(π§π§, π₯π₯) ∑ππππ=1(π§π§ππ − π§π§Μ )(π¦π¦ππ − π¦π¦οΏ½) ∑ππππ=1(π§π§ππ − π§π§Μ )(π₯π₯ππ − π₯π₯Μ ) οΏ½1 π₯π₯Μ . See that if π§π§ = π₯π₯ then the IV estimator becomes the OLS estimator. As οΏ½0 = π¦π¦οΏ½ − π½π½ And π΅π΅ οΏ½1 is consistent but bias and the IV estimator is therefore only really previously mentioned π½π½ useful in larger samples. 3 This means we can write the parameter in terms of population moments that can be estimated. 26 The above can be extended to a multivariate model. To do this we need to make use of structural equations and reduced forms. Given the structural equation π¦π¦1 = π½π½0 + π½π½1 π¦π¦2 + π½π½2 π§π§1 + π’π’1 The π¦π¦ variables are interpreted as endogenous variables (correlated with the error term) and the π§π§ variable is interpreted as exogenous (not correlated with the error term). It is evident that the independent variable π¦π¦2 is problematic since it is endogenous and if estimated under OLS will result in bias in all the parameters. To resolve this we can use the IV estimator, but note that π§π§1 may not be an IV for π¦π¦2 , since it is already included in the model. We therefore need a new exogenous variable, π§π§2 , to serve as an IV for π¦π¦2 . We therefore need to assume that πΆπΆπΆπΆπΆπΆ(π§π§2 , π’π’1 = 0) and further that the partial correlation between π§π§2 and π¦π¦2 is not zero. To test the second assumption we write π¦π¦2 in its reduced form, meaning we write an endogenous variable in terms of exogenous variables (including IV’s). This can also be done for dependent variables where the interpretation of the parameters of the reduced form is intention-to-treat as opposed to treat in the structural model. π¦π¦2 in its reduced form is therefore π¦π¦2 = ππ0 + ππ1 π§π§1 + ππ2 π§π§2 + π£π£1 The assumption holds if ππ2 ≠ 0 and the reduced form is estimated by OLS (with the assumption of no perfect multicollinearity). Note that if the model contained further exogenous variables then those would also be included in the reduced form. Statistical inference of the IV estimator The IV estimator is asymptotically valid under the homoscedasticity assumption of πΈπΈ(π’π’2 |π§π§) = ππππππ(π’π’) = ππ 2 The asymptotic variance of an estimated parameter is οΏ½1 = π½π½ ππ 2 2 πππππ₯π₯2 πππ₯π₯,π§π§ 2 2 Where πππ₯π₯,π§π§ is the square of the population correlation between π₯π₯ and π§π§ (π π π₯π₯,π§π§ ). The asymptotic standard error of an estimated parameter is οΏ½1 = π½π½ οΏ½2 ππ 2 πππππππ₯π₯ . π π π₯π₯,π§π§ 2 Note that the only difference between the standard errors of OLS and IV is the term π π π₯π₯,π§π§ . Since this is always less than one, the standard errors under IV will always be larger than under OLS 27 (a weakness of IV). Further, if we have a poor IV, meaning there is weak correlation between the endogenous variable and its instrument, besides large standard errors, IV will also have large asymptotic bias. Therefore although consistent, IV can be worse than OLS if we have a poor IV. Generally an, IV is considered to be weak (and should not be used) if the t-stat of the IV in the reduced form model is less than absolute 3.2 (√10) (reference to Stock and Yogo, 2005). The obtained R squared from an IV estimation is not useful and should not be reported. Two-stage least squares (2SLS) estimator The 2SLS estimator is an IV estimator with multiple exogenous variables not included in the model as used to estimate an IV. This means that there is either more than one excluded exogenous variable used with one endogenous independent variable, or the structural model has more than one endogenous independent variable in which case we require at least as many excluded exogenous variables as there are endogenous independent variables. Taken the structural model π¦π¦1 = π½π½0 + π½π½1 π¦π¦2 + π½π½2 π§π§1 + π’π’1 And we have two exogenous variables that are correlated with π¦π¦2 called π§π§2 and π§π§3 , any linear combination of exogenous variables is a valid IV for π¦π¦2 . The reduced form of π¦π¦2 is therefore And the best IV for π¦π¦2 is π¦π¦2 = ππ0 + ππ1 π§π§1 + ππ2 π§π§2 + ππ3 π§π§3 + π£π£1 π¦π¦2∗ = ππ0 + ππ1 π§π§1 + ππ2 π§π§2 + ππ3 π§π§3 In other words, the independent variable π¦π¦2 is divided into two parts, π¦π¦2∗ (the part that is exogenous in the structural model) and π£π£1 (the part that is endogenous in the structural model). We only wish to use the exogenous part of the variable. To estimate π¦π¦ οΏ½2 we need two OLS estimations, called the first stage and the second stage. First stage π¦π¦ οΏ½2 = ππ οΏ½0 + ππ οΏ½π§π§ οΏ½π§π§ οΏ½π§π§ 1 1 + ππ 2 2 + ππ 3 3 + ππ Which after we need to test for joint significance (F test) of ππ οΏ½2 and οΏ½ ππ3 . It is very important to test and note that if the F stat is less than 10 then we should not proceed with the 2SLS estimator, since it will result in large asymptotic bias and large variance (reference to Stock and Yogo, 2005). 28 Second stage π¦π¦1 = π½π½0 + π½π½1 π¦π¦ οΏ½2 + π½π½2 π§π§1 + π’π’1 It can, therefore, be seen that 2SLS first purges π¦π¦2 of its correlation with π’π’1 and it therefore consistent where OLS would not be. Note that the econometric package automatically estimates both stages and this should not be done manually. Further, when asking for instrumental variables, all exogenous variables (included and excluded) are given as all of these are used in the first stage and therefore estimation of the IV. The asymptotic variance of an estimated parameter is οΏ½1 = π½π½ ππ 2 οΏ½ οΏ½2 ππππππ2 (1 − π π 2) οΏ½2 is the R squared of the reduced form equation. Where ππππππ2 is the total variation in π¦π¦ οΏ½2 and π π 2 See from this that 2SLS will always have larger variance than OLS since 1. π¦π¦ οΏ½2 has less variation than π¦π¦2 (a part of its variation is in the reduced form error term) 2. π¦π¦ οΏ½2 is more correlated with the exogenous variables, increasing the multicollinearity problem. Taken the structural model π¦π¦1 = π½π½0 + π½π½1 π¦π¦2 + π½π½2 π¦π¦3 + π½π½2 π§π§1 + π’π’1 we would require at least two excluded exogenous variables that are partially correlated with π¦π¦2 and π¦π¦3 . This means that the two or more excluded exogenous variables should be jointly significant (with an F stat greater than 10) in both the reduce form models of π¦π¦2 and π¦π¦3 . To use 2SLS and to obtain valid estimates we need to adhere to the order condition. The order condition requires that we have at least as many excluded exogenous variables as included endogenous variables. A requirement for a valid instrument is that it is uncorrelated with the error term in the structural model (endogenous). If we have more instruments than we need to identify an equation (more instruments than endogenous variables) we can test whether the additional instruments are uncorrelated with the error term (called testing the overidentification restriction). 1) Estimate the structural equation by 2SLS and save the residuals, π’π’ οΏ½1 2) Regress π’π’ οΏ½1 on all exogenous variables (instruments and included) and get the R square 29 3) The null hypothesis that all instruments are uncorrelated with π’π’ οΏ½1 is tested by testing whether the R squared multiplied by the sample size follows a chi-square distribution where the degrees of freedom is the instruments less the endogenous variables. If πππ π 2 exceeds the critical value in the chi-square distribution we reject the Ho, meaning all instruments are not exogenous. This means that the additional instruments are useful, but only to a certain extent. It may still be that one of the additional instruments are endogenous. 4) To obtain a heteroscedasticity robust test, we regress all endogenous variables on all exogenous variables (included and additional instrumental variables 4) and save the οΏ½). fitted values (π¦π¦ 2 Next ,we regress each of the overidentifying restrictions (instruments not needed for the model to be just identified) on the exogenous variables included in the original model and the π¦π¦οΏ½ 2 ′π π and we save the residuals πποΏ½2 . Then we regress the saved residuals in step 1, π’π’ οΏ½1 on ππΜ2 and perform the heteroscedasticity robust Wald test on this regression. Assumptions for 2SLS 1) Linear parameters, random sampling. Instrumental variables are denoted as π§π§ππ 2) Random sampling on π¦π¦, π₯π₯ππ and π§π§ππ 3) No perfect multicollinearity among instrumental variables and the order condition for identification holds. This means we need at least one excluded exogenous variable (which parameter is not zero in the reduced form equation) for each included endogenous variable. For SEMs the rank condition needs to hold (discussed above). 4) πΈπΈ(π’π’) = 0, πΆπΆπΆπΆπΆπΆοΏ½π§π§ππ , π’π’οΏ½ = 0 Note that each exogenous independent variable is seen as its own instrumental variable, therefore all exogenous variables are denoted π§π§ππ Under 1-4 2SLS is consistent (although bias) If ππ denotes all instrumental variables (all exogenous variables) then 5) πΈπΈ(π’π’2 |ππ) = ππ 2 Under 1-5 2SLS is consistent and test statistics are asymptotically valid. The 2SLS estimator is the best IV estimator under these assumptions. 4 Note that an exogenous variable is its own instrument. 30 If 5 does not hold, then 2SLS is not the most efficient IV estimator. Homoscedasticity can be tested by saving the residuals from 2SLS and regressing this on all exogenous variables with the null being the joint significance of all exogenous variables is zero (required for homoscedasticity). This is analog to the Breusch Pagan test. To correct heteroscedasticity under 2SLS 1) Take robust standard errors as for OLS, or 2) Use weighted 2SLS that is done the same as for OLS, but 2SLS are used after applying the weights. Indicator variables (Multiple indicator solution) A solution to omitted variable bias and/or measurement error exists with the use of indicator variables. These variables serve a similar purpose than proxy variables under OLS, but we require 2SLS to use indicator variables. If we have an unobserved variable, ππ we look to find at least two indicators ππ1 and ππ2. Both ππ1 and ππ2 are correlated with ππ, but ππ1 and ππ2 are only correlated with each other as a result of being correlated with ππ. It is further logical that neither of the instruments are ideal proxies, otherwise we would, just use them as such. This means that after including one indicator in the structural model, that indicator is endogenous. We include ππ1 in the model and then use ππ2 as an instrument for ππ1. Doing this provide for consistency where OLS would have been inconsistent (using ππ1). It is important that ππ2 meets the normal requirements for a good and valid instrument. This approach is called the multiple indicator solution. Similarly, measurement error can be resolved if we have two indicators that measure and independent variable with error (where we do not have the correctly measured independent variable). For OLS we would just have been able to include one of the two indicators, but using 2SLS we can use the second indicator as an IV for the first, resulting in consistent estimators (this is also discussed under measurement error). Generated independent variables and instruments We may wish to include as an independent variable in a model an estimated variable from another regression, called a generated regressor (Pagan, 1984). This will in most cases be the residuals from a previously estimated model, but can also, for instance, be the predicted value. Using such a variable does not result in inconsistent estimates, but the obtained test statistics are invalid. This is because there is sampling variation in the generated regressor (it was 31 obtained from data). If the parameter for the generated regressor ≠ 0, then all standard errors and statistics need to be adjusted for valid inference. A generated instrument does not result in the same problems, 2SLS remains consistent with valid test statistics (assuming the other assumptions hold). Of course, if a generated regressor is included in 2SLS then we need to adjust the asymptotic variance. Control Function Estimator (CF) Similar to 2SLS, CF is aimed at removing endogeneity. This is done by using extra regressors (not in the structural model) to break the correlation between the endogenous variable and the error. Take π¦π¦1 = π§π§1 πΏπΏ1 + πΎπΎπ¦π¦2 + π’π’1 Where π§π§1 are all the exogenous variables in the structural model and π¦π¦2 is the endogenous variable. If we have at least one additional exogenous variable that is not included in the structural model, the reduced form of π¦π¦2 is π¦π¦2 = π§π§π§π§ + π£π£2 Where π§π§ includes at least one variable not in π§π§1 . This is required to avoid perfect multicollinearity (see the final model below) . Since π¦π¦2 is correlated with π’π’1 , π£π£2 must be correlated with π’π’1 as well. Therefore we can write π’π’1 = πππ£π£2 + ππ1 See that this is a simple test for endogeneity of π¦π¦2 , if ππ = 0 then π¦π¦2 is actually exogenous. Further se,e that π£π£2 and ππ1 are uncorrelated and consequently π§π§ (which includes π§π§1 ) are also uncorrelated with both π£π£2 and ππ1. We can therefore substitute π’π’1 in the original model to get π¦π¦1 = π§π§1 πΏπΏ1 + πΎπΎπ¦π¦2 + πππ£π£2 + ππ1 Which is a model with no endogeneity and will be consistent by OLS. Since π£π£2 is a generated regressor, we need to correct the standard errors. CF provides identical results to 2SLS unless there are more than one function of π¦π¦2 included in the model (for instance π¦π¦2 and π¦π¦22). In such instances on,ly 2SLS will be consistent, but CF will be more efficient. CF is very useful for non-linear models (discussed later). Correlated random coefficient model It may be that in the population model, an endogenous variable should interact with an unobserved variable (unobserved heterogeneity) and we do not have a valid proxy for the 32 endogenous variable. Taken the model that we can estimate (not having the unobserved heterogeneity data) π¦π¦1 = π½π½1 + πΏπΏ1 π§π§1 + ππ1 π¦π¦2 + π’π’1 ππ1 , the ‘coefficient’ of π¦π¦2 is an unobserved random variable, meaning it will change as π¦π¦2 changes. We can write ππ1 = πΌπΌ1 + π£π£1 Where πΌπΌ1 is the correct (constant) coefficient which we wish to estimate. Substituting this into the original model gives the population model π¦π¦1 = π½π½1 + πΏπΏ1 π§π§1 + πΌπΌ1 π¦π¦2 + π£π£1 π¦π¦2 + π’π’1 This shows the interaction between the unobserved heterogeneity for which we do not have a proxy, π£π£1 , and the endogenous variable. To address the endogeneity of π¦π¦2 we would want to use 2SLS. The problem with 2SLS is that the error term (π£π£1 π¦π¦2 + π’π’1 )in the model to be estimated is not necessarily uncorrelated with the instrument (π§π§) that we would want to use. A further requirement is therefore necessary being πΆπΆπΆπΆπΆπΆ(π£π£1 , π¦π¦2 |π§π§) = πΆπΆπΆπΆπΆπΆ(π£π£1 , π¦π¦2 ) which means the conditional covariance is not a function of the instrumental variable. Finding an instrument that satisfies this condition is difficult. One option is to obtain fitted values of a first stage regression of π¦π¦ππ2 on π§π§ππ and then use as IV’s 1, π§π§ππ and π¦π¦οΏ½ππ2 (π§π§ππ1 − π§π§Μ 1 ). Alternatively, a control function approach can be used by first regressing π¦π¦2 on π§π§ and save the reduced form residuals, π£π£οΏ½2 and then run the OLS regression π¦π¦1 on 1, π§π§1 , π¦π¦2 , π£π£οΏ½2 π¦π¦2 and π£π£οΏ½2 . This approach requires a stronger assumption which is Systems of Equations πΈπΈ(π’π’1 |π§π§, π£π£2 ) = ππ1 π£π£2 , πΈπΈ(π£π£1 |π§π§, π£π£2 ) = ππ1 π£π£2 It is possible that the population model is a set of equations, for instance in estimating a demand system, for instance π¦π¦1 = π₯π₯1 π½π½1 + π’π’1 π¦π¦2 = π₯π₯2 π½π½2 + π’π’2 π¦π¦ππ = π₯π₯ππ π½π½ππ + π’π’ππ 33 Since each equation has its own vector of coefficients π½π½ππ , this model is known as seemingly unrelated regression (SUR). In estimating such a system we can use OLS equation by equation, system OLS (SOLS) or FGLS. From these FGLS will be more efficient if we can assume system homoscedasticity. SOLS is generally more likely to be consistent as it contains a lesser assumption; FGLS requires strict exogeneity. If we cannot assume system homoscedasticity, then either SOLS or FGLS may be more efficient. Systems of equations often have endogenous variables and IV method is therefore commonly used (see SEM models). There are more efficient estimators than 2SLS for systems of equations with endogeneity, for instance, the General Methods of Moments estimator (GMM) and GMM 3SLS. Simultaneity bias and simultaneous equation models (SEM) Not previously discussed, the estimated parameters obtained by using OLS as the estimator will be biased in the presence of simultaneity. Simultaneity arises if one or more of the independent variables are jointly determined with the dependent variable. As long as we need to resolve the equation of interested together with another simultaneous equation, the independent variables will be correlated with the error term. An example of this is the amount of crime and the amount of policeman; it can be that a change in crime results in a change in the amount of policeman, but it can also be that a change in the amount of policeman results in a change in the amount of crime (the correlation goes both ways and crime and police are jointly determined). Another example is that of supply and demand (or any phenomena that require a system of equations to resolve, such as general equilibrium models of the economy). In these situations, we would require at least two (simultaneous) equations to estimate one of the equations. The most important requirement for each of these equations is that they have a ceteris paribus interpretation (we cannot willingly leave out any relevant variables). Taking supply and demand as an example we can write a supply equation as And a demand equation as π»π»π»π»π»π»π»π»π π π π = π€π€π€π€π€π€π€π€ + π§π§1 + π’π’1 π»π»π»π»π»π»π»π»π π ππ = π€π€π€π€π€π€π€π€ + π§π§2 + π’π’2 Where π§π§ indicated exogenous variables. See that the observed hours are determined by the intersection of supply and demand and the true hours that workers are willing to supply cannot 34 be observed, but we wish to estimate this. Because we only observe the equilibrium of hours worked where supply equals demand we can write for each individual βππ = πΌπΌπ€π€ππ + π½π½1 π§π§1 + π’π’ππ1 And βππ = πΌπΌπ€π€ππ + π½π½1 π§π§2 + π’π’ππ2 See that the only difference between these two equations is the subscript for the exogenous variables. If the exogenous variables are exactly the same, then the two equations will be exactly the same, meaning we have an identification problem; the true hours that workers wish to supply cannot be estimated. Taking crime and police as an example the first equation will be πΆπΆπΆπΆπΆπΆπΆπΆππππ = πΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌπΌππππ + π½π½1 π§π§1 + π’π’ππ1 And the second equation will be ππππππππππππππ = πΌπΌπΌπΌπΌπΌπΌπΌπΌπΌππππ + π½π½1 π§π§1 + π’π’ππ1 See that both equations have a ceteris paribus interpretation. Further note that these two equations describe different behaviors. In the first equation, we are interested in factors that change in the behavior of criminals and in the second we are interested in factors that change in the behavior of the country/state etc. in appointing policemen and policewomen. It is, therefore, most plausible that the exogenous variables will be different and the first (or second) equation can be estimated. Note, however, that if we use OLS on the first or second equation, the estimated parameters will be biased because of simultaneity. We, therefore, use 2SLS. Identification of SEMs with two equations To use 2SLS to address simultaneity bias, we first need to specify a structural equation for each endogenous (simultaneous) variable. Secondly, to be able to consistently estimate an equation, that equation must be identified. Normally, 2SLS only requires the order condition for identification, but for SEM a stronger condition (together with the order condition) is required, namely the rank condition. For two equations this requirement states that the non-estimated equation contains at least one statistically significant exogenous variable that is not present in the estimated equation. Identification of SEMs with more than two equations 35 The order condition is again necessary, but not sufficient. The rank condition for SEMs with more than two equations follows. (Wooldridge 2010, c9). Estimation of SEMs (any number of equations) by 2SLS In estimating SEMs we are most often only interested in one equation, with the remaining equations required to correctly describe the simultaneous effect on the dependent variable of the equation of interest. It can, therefore, be viewed that the non-estimated equations are used to identify the instrumental variables applicable to the estimated equation. This can be seen by taking the reduced form of the first equation (writing it in terms of all the exogenous variables in the system of equations). The instrumental variables that are used in estimating the equation of interest are therefore all exogenous variables in the system of equations. By doing this we remove the simultaneity bias in the independent variable that is jointly determined with the dependent variable. In conclusion, the only difference between 2SLS to address endogeneity bias and simultaneity bias is in how we obtain the instrumental variables to be used and the necessary condition to estimate an equation. 36 TIME SERIES DATA OLS Assumptions for finite samples Assumption 1-3 The OLS assumption for time series data (TSD) to ensure that OLS is BLUE in finite samples is similar to cross-sectional data. For instance, the model needs to be linear in parameters (1) and there may not be any perfect collinearity (2). For OLS to be unbiased with TSD, a further assumption needs to be adhered to. This assumption combines the random sample and zero conditional mean assumption for cross-sectional data and adds a stricter requirement. If ππ is taken to represent all independent variables for all time periods (π‘π‘) then πΈπΈ(π’π’π‘π‘ |ππ) = 0 , π‘π‘ = 1,2,3 … ππ This means that for each time period the expected value of the error term of that period, given the independent variables for all time periods is zero (3). In other words, the error in any one time period may not be correlated with any independent variable in any time period. If this holds we say the model is strictly exogenous and OLS is unbiased and consistent. This assumption will not hold if the data does not come from a random sample. Note that this assumption includes the assumption for cross-sectional data and can be written πΈπΈ(π’π’π‘π‘ |π₯π₯π‘π‘ ) = 0 Which means that the error term and independent variables for one time period are not correlated. If only the second assumption holds, then the model is said to be contemporaneously exogenous. OLS will be consistent, but biased. This means this assumption is not sufficient to have OLS be BLUE. Assumption 3 may fail due to 1. Omitted variable bias (this is the same as for cross-sectional data) 2. Measurement error 3. When the present level of a variable is influenced by the past level of an independent variable, e.g the size of the police force may be adjusted due to past crime rates. Note a strictly exogenous variable such as rainfall does not pose a problem e.g. rainfall in future years is not influenced by past years of agricultural output. 37 Meeting assumption 1-3 result in OLS being unbiased and consistent. The assumptions required for OLS to have the smallest variance (to be Best) are Assumption 4 Homoscedasticity, meaning ππππππ(π’π’π‘π‘ |ππ) = ππππππ(π’π’π‘π‘ ) = ππ 2 Note again that the requirement is on all independent variables at all time periods, this said, in most cases, the heteroscedasticity in the error for a time period is as a result of the independent variables of that time period. Assumption 5 No serial correlation (autocorrelation), meaning the errors (given all independent variables for all time periods) may not be correlated over time. This can be written πΆπΆπΆπΆπΆπΆπΆπΆ(π’π’π‘π‘ , π’π’π π |ππ) = ππ ππππππ ππππππ π‘π‘ ≠ π π Note that this does not mean that an independent variable may not be correlated with itself or other independent variables over time, only the errors (that contain unobserved factors and measurement error) are of concern. Under assumptions 1-5 OLS is BLUE for time series data. Further, the OLS sampling variance is calculated exactly as in cross-sectional data (see above) and the estimated variance of the error terms are unbiased estimates of the population error term. Therefore OLS has the same desirable property for time series data. Assumption 6 To be able to use the t and F test in finite samples, the classic linear model assumption is required. Without this assumption, the errors will not have an F and T distribution. This assumption is that π’π’π‘π‘ are independent of ππ and independent and identically distributed as normal. Basic time series models using OLS as the estimator Static model The most basic model for time series data is the static model; this model is essentially the same as a cross-sectional model, but the assumptions for OLS to be BLUE is different (discussed above). Such a model can be written π¦π¦π‘π‘ = π½π½0 + π½π½1 π§π§1π‘π‘ + π½π½2 π§π§2π‘π‘ + π’π’π‘π‘ , π‘π‘ = 1,2,3 … 38 This model, therefore, does not make use of data from another time period towards estimating the effects of the current time period (the same as a cross-sectional analysis). The parameters, therefore, indicate the immediate effect of the independent variables on the dependent variable or alternatively stated the trade-off between the independents and dependent variable. Finite distributed lag model (FDL) For this model, we allow variables to affect the dependent variable with a lag. The number of lags included indicates the order of FDL, e.g. one lag is called an FDL of order one. This model is used to estimate the short-run (immediate) propensity/effect of an independent variable(s) on the dependent variable, as well as the long-run propensity/effect. A model with one independent variable included for different time periods can be written as π¦π¦π‘π‘ = π½π½0 + π½π½1 π§π§π‘π‘ + π½π½2 π§π§π‘π‘−1 + π½π½3 π§π§π‘π‘−2 + π½π½4 π§π§π‘π‘−3 + π’π’π‘π‘ Where π½π½1 indicates the immediate propensity, meaning the change in π¦π¦π‘π‘ due to a one unit increase in π§π§ at time π‘π‘ ; and π½π½1 + π½π½2 + π½π½3 + π½π½4 indicates the long run propensity, meaning the change in π¦π¦π‘π‘ over four time periods (or how many lags are included plus one) due to a one unit increase in π§π§ at time π‘π‘. This means that π½π½2 indicates the change in π¦π¦π‘π‘ one period after a change π§π§ at time π‘π‘ and similar for the remaining parameters individually considered. Dynamically complete model A dynamically complete model is a model where enough lags for the dependent and independent variables have been included as independent variables in the model, so that further lags do not matter in explaining the dependent variable. A possible model of this kind can be written π¦π¦π‘π‘ = π½π½0 + π¦π¦π‘π‘−1 + π¦π¦π‘π‘−2 + π½π½1 π§π§π‘π‘ + π½π½2 π§π§π‘π‘−1 + π½π½3 π§π§π‘π‘−2 + π½π½4 π§π§π‘π‘−3 + π’π’π‘π‘ For such a model there cannot be any serial correlation, meaning the serial correlation assumption always holds. This does not mean all models should be dynamically complete. If the purpose of the regression is to forecast, the model must be dynamically complete. If we are however interested in the static impact (a static model) or the long run effect (FDL) model, such a model need not by dynamically complete. It should however then be noted that the model will have serial correlation and this will have to be corrected (discussed later). Possible additions to the above models Similar to cross-sectional data, data can be transformed for time series. A log-log FDL model has the benefit of interpretation of estimating short-run elasticity and long-run elasticity. 39 Dummy variables and binary variables can also be used. Binary variables are useful for event studies using time series data. It should further be noted that for time series data, we always want to use real economic variables and not nominal economic variables. This means that if data is in nominal form, this data needs to be adjusted by an index, such as the consumer price index, to obtain the real economic variable. Alternatively stated, not accounting for inflation gives rise to measurement error. A unique aspect of time series data is trends and seasonality. 1. Trends Often we may think that variables are correlated over time, but this correlation can partly be described to a similar time trend that variables follow. If a dependent or independent variable follows a time trend, we need to control for this trend in the model. Not doing so means that the trend will be included in the error term and this means the estimates will be biased, called a spurious regression. Including the trend in the model depends on the type of trend. For a linear time trend, we can write π¦π¦π‘π‘ = π½π½0 + π΅π΅1 π‘π‘ + πππ‘π‘ , π‘π‘ = 1,2,3 … Note that the independent variable “t” indicates time where 1 is for instance 2010, 2 is 2011, 3 is 2012, etc. Including this variable detrends the results of the equation. If a variable has an exponential trend we can include logs and for a quadratic trend, we can include polynomial functions. Note that when including trends, the R-squared or adjusted R-squared is biased, but this does not influence the T of F stat. 2. Seasonality If our time periods are less than a year, data can also be influenced by seasonality, e.g. crop output is influenced by rainfall and rainfall is seasonal. Most often series are already seasonally adjusted and we do not have to make any changes to our model. If the data you receive is not seasonally adjusted and suspect to seasonality, it is required to do such an adjustment. This is easily done by including dummy variables for the relevant seasons (for instance for each month (less one) or for each quarter (less one)). This will control for the seasonality in the data. 40 OLS asymptotic assumptions In large samples, the assumptions of OLS can be made less strict, as long as the law of large numbers and the central limit theorem holds. Additional requirements, besides having a large sample are required for this to be the case. The two additional requirements for OLS and other estimators are that the time series’ included in a regression are stationary and weakly dependent. It should be noted that we are interested here in the specific variables individually and not the regression model. We look at one variable over time (a time series) individually to see whether it is stationary and weakly dependent. For a time series to be stationary is not critical, but weakly dependent is. Logically, to understand the relationship between variables over time, we need to be able to assume that this relationship does not change arbitrarily between time periods. This means that each variable should follow a determinable path over time. For this reason, a time series (one variable over time) can be seen as a process (and defined in term of a process). A stochastic process in probability theory means a mathematical object defined in terms of a sequence of random variables. The opposite of a stochastic process is a deterministic process, by looking only at the process we can determine the answer correctly (with probability of 1). An example of a stochastic process is, for instance, tossing a coin, just by looking at the process we cannot determine the answer (how many heads or tails) correctly, we can only get probabilities and a joint probability distribution. For any time series, we are dealing with a stochastic process, meaning that the time series level is not deterministic in any one period, the data points are determined by probability. The important aspect of the process is whether it is stationary or non-stationary. Stationary A stationary stochastic process is a process where the joint probability distribution of the sequence of random variables in that process remains unchanged over time. Again, flipping a coin is a stationary stochastic process, since the joint probability of heads and tails remains unchanged over time. If a variable, for instance, has a time trend, then the stochastic process cannot be stationary, meaning it is a non-stationary stochastic process. A stationary stochastic process is called strictly stationary. Sometimes a lesser extent of stationary is required. To understand this we need to understand moments. 41 If we write {π₯π₯1π π , π₯π₯2π π , π₯π₯3π π … π₯π₯4π π }/ππ the first moment is where s = 1 (this is the mean) and the second moment is where s = 2 (this is the variance). This can continue further to get skewness and kurtosis. The lesser form of stationary is called covariance stationary or weak stationary and is more important than strict stationary (since strict stationary seldom hold) and this holds where all the random variables have a finite second moment (πΈπΈ(π₯π₯π‘π‘2 ) < ∞ for all π‘π‘), the mean and the variance of the process is constant and the covariance depends only on the time period between two terms and not the starting time period. Mathematically this can be written πΈπΈ(π₯π₯π‘π‘ ) = ππ ππππππ(π₯π₯π‘π‘ ) = ππ 2 πΆπΆπΆπΆπΆπΆ (π₯π₯π‘π‘ , π₯π₯π‘π‘+β ) = ππ(β) ππππππ ≠ ππ(π‘π‘) This requirement means that there is one data generating process that determine π₯π₯π‘π‘ in all time periods, this data generating process does not change between time periods. The data generating process is unknown and can be likened to a true model that explains changes in the time series. If the generating process changes between periods then it would not be possible to have a linear relationship in the regression model, since the parameter would change greatly between time periods. It can be seen that a strictly stationary process with a finite second moment is automatically a covariance stationary process, but vice versa is not true. Weakly dependent The weakly dependent requirement differs between a strictly stationary process and a covariance stationary process. For a strictly stationary process, it is required that π₯π₯π‘π‘ , π₯π₯π‘π‘+β are “almost independent” as h increases without bound. The covariance stationary requirement is less abstract and generally how we think of weak dependence. This requires that the correlation between π₯π₯π‘π‘ , π₯π₯π‘π‘+β goes satisfactorily quickly to zero at h goes to infinity. In other words, we do not want persistent correlation for a variable with itself over time, only taking into account the first time period and another time period further away. One example of a weakly dependent process is a moving average process of order 1 ([MA(1)]). This can be written as π₯π₯π‘π‘ = πππ‘π‘ + πππππ‘π‘−1 (πππ‘π‘ , ππ. ππ. ππ (0, ππ 2 )) 42 This process states that a once off change in πππ‘π‘ will influence π₯π₯π‘π‘ in the period of the change, the following period, but not thereafter. The covariance therefore goes to zero within two periods. This process is stationary (since πππ‘π‘ is i.i.d) and weakly dependent. Another example is an autoregressive process of order 1 [AR(1)]. This can be written π₯π₯π‘π‘ = πππ₯π₯π‘π‘−1 + πππ‘π‘ , πππ‘π‘ ~ππππππ(0, ππ 2 ) This process states that as long as ππ is less than one, a change in π₯π₯π‘π‘ will have a persistent effect on π₯π₯π‘π‘ , but the effect will decrease to zero over time. It should be noted that if ππ gets close to one, the process will decrease to zero over time, but not satisfactorily quickly (it seems that below 0.95 is satisfactorily). This process is also weakly dependent and stationary. It is possible to perform multiple regression if a series is non-stationary and not weakly independent, but since the law of large numbers and central limit theorem will not hold, this analysis becomes tricky and the finite sample OLS properties need to be adhered by. If the series is stationary and weakly independent, the asymptotic properties of OLS can be used (for large samples). We now turn back to the regression model as these assumptions need to hold in the model. Assumption 1 The model must be linear in the parameters and the process must be stationary and weakly dependent so that LLN and CLT can be applied to sample averages. For this purpose, weakly dependent is more important. Assumption 2 No perfect multicollinearity Assumption 3 The explanatory variables are contemporaneously exogenous, meaning πΈπΈ(π’π’π‘π‘ |π₯π₯π‘π‘ ) = πΈπΈ(π’π’π‘π‘ ) = 0. Note that this assumption is less strict than the finite sample assumption as it is not concern a on how the error for one period are related to the explanatory variables in other time periods. Under assumptions 1-3 OLS will be consistent, but not necessarily unbiased. Strict exogenous is required for unbiasedness. In large sample sizes, the bias is likely to be small. Assumption 4 The errors are contemporaneously homoscedastic, ππππππ(π’π’π‘π‘ |π₯π₯π‘π‘ ) = ππ 2 . Note again this is less strict than the finite sample assumption. Further note that π₯π₯π‘π‘ here can also include lags of either or both the dependent and independent variables. Assumption 5 43 The errors for different time periods are uncorrelated, no serial correlation. Under assumption 1-5, OLS estimators are asymptotically normal and the standard errors, T, F, and LM test statistics are valid. If a model has trending explanatory variables and the trend is stationary and included in the model, the assumption 1-5 can be applied. Highly persistent time series In this section, we are again concerned with individual variables over time (a time series). We are not concerned with the regression model. Many variables do not tend to zero satisfactorily quickly over time, in other words, it is a highly persistent time series where the level in one period depends greatly on the level in the previous period(s). A process that describes such a time series is a random walk, which is part of a unit root process. The term unit root comes from the ππ in the AR(1) model that equals unity (one). A random walk can be written π¦π¦π‘π‘ = π¦π¦π‘π‘−1 + πππ‘π‘ , πππ‘π‘ ~ππππππ(0, ππ 2 ) In this model, the expected value does not depend on the time period, but the variance does and increases as a linear function of time and the correlation between π¦π¦π‘π‘ ππππππ π¦π¦π‘π‘−1 gets arbitrarily close to one. This process is not weakly dependent and also non-stationary. It is also possible for this process to have a time trend, called a random walk with drift. Luckily, non-weakly dependent processes are easily transformed to weakly dependent processes (which are often stationary) and this can then be used in the regression. Before transformation we need to determine whether the process is weakly dependent, called a process integrated of order zero; I(0), or not, called a process integrated of order one; I(1). We can, therefore, estimate ππ by obtaining the correlation between π¦π¦π‘π‘ ππππππ π¦π¦π‘π‘−1, but it should be noted that this estimate is bias and can be largely biased (we therefore rather use the Dickey- Fuller test discussed below). Note that if the process has a trend, we first need to detrend before taking the correlation. If |ππ| > 0.8 π‘π‘π‘π‘ 0.9 (preference differ on this) then it is better to conclude that the process is I(1). If the process is I(1), we need to take the first difference of the process and include this in the regression. For the random walk process, the first difference is therefore ππππ = π¦π¦π‘π‘ − π¦π¦π‘π‘−1 44 Note that πππ¦π¦ = ππππ = βπ¦π¦, which all means the first difference in y. Note that we will lose the first observation, meaning we will start at period 2, as a result of taking the first difference. Taking the first difference also has the advantage of detrending the time series. This is true since the first difference of a linear relationship is constant. A more formal test for a unit root is known as the Dickey-Fuller (DF) test. Taking the AR(1) model above and subtracting π¦π¦π‘π‘−1 gives βπ¦π¦π‘π‘ = πΌπΌ + πππ¦π¦π‘π‘−1 + πππ‘π‘ Where ππ = (1 − ππ). This module can be estimated by OLS, but ππ does not follow a normal distribution, but what is known as the Dickey-Fuller distribution. We therefore need alternative critical values which can then be used in the t-test. Higher order AR processes to address serial correlation are also allowed and can be written βπ¦π¦π‘π‘ = πΌπΌ + πππ¦π¦π‘π‘−1 + πΎπΎβπ¦π¦π‘π‘−1 +. . . +πππ‘π‘ If a series has a time trend, we need to include the trend in the Dickey-Fuller test. Note, however, that alternative critical values need to be used after including the time trend. Spurious regression It is possible for two variables to be correlated only because both are correlated with a third variable not included in the model. Including this variable removes the correlation between the first two variables. If this is the case we have a spurious regression. This is of course also possible for time series, but time series has an additional issue. If we have an I(1) dependent variable and at least one I(1) independent variable, this will in most instances result in spurious regression. This means the t-statistics cannot be trusted. One way to address this is by differencing the variables, but this limits our application. Another possibility is to determine whether the two I(1) variables are co-integrated. Co-integration If two I(1) variables have a long run relationship it is possible that the difference between the two variables is an I(1) process. This can be written π¦π¦π‘π‘ − π½π½π₯π₯π‘π‘ is πΌπΌ(0) for certain π½π½ ≠ 0 To test whether two I(1) variables are co-integrated we perform the Engle-Granger test 1) Estimate π¦π¦π‘π‘ = πΌπΌ + π½π½π₯π₯π‘π‘ by OLS 2) Apply the DF test to the residuals by estimating βπ’π’π‘π‘ = πΌπΌ + πππ’π’π‘π‘−1 + πππ‘π‘ 45 3) Use the Engle-Granger critical values to determine whether ππ is significant. 4) If t-stat is below critical value then π¦π¦π‘π‘ − π½π½π₯π₯π‘π‘ is πΌπΌ(0), meaning we can calculate a new variable that often has economic interpretation. If we include this new variable, we call the model an error correction model and this can be written (note that variables are differenced because y and x are I(1)) Serial correlation βπ¦π¦π‘π‘ = πΌπΌ0 + πΎπΎβπ₯π₯π₯π₯ + πΏπΏ(π¦π¦π‘π‘−1 − π½π½π₯π₯π‘π‘−1 ) + π’π’π‘π‘ Remember, in a dynamically complete model there is no serial correlation. Serial correlation can, however, exist in other types of models, or where there is misspecification in a dynamically complete model. When there is serial correlation, OLS remains consistent and unbiased (even if the model includes lagged dependent variables). OLS will, however, be less efficient (no longer BLUE) and the test statistics will be invalid. The goodness of fit tests (Rsquared) remains valid. Tests for serial correlation Tests when independent variables are strictly exogenous (e.g. no lagged dependent variables) For time series data, the error terms can also be viewed as processes. This means the error terms can be related to past error terms in various manners. Commonly, errors terms are written as AR(1) processes: π’π’π‘π‘ = πππ’π’π‘π‘−1 + πππ‘π‘ , πππ‘π‘ ~ππππππ(0, ππ 2 ) If there is no serial correlation in adjacent errors, then ππ = 0. This is therefore the null hypothesis of the test. Since we only have strictly exogenous variables, the estimate of π’π’π‘π‘ is unbiased and can be used for testing the null. Therefore I. Run OLS of π¦π¦π‘π‘ on xt1 , π₯π₯π‘π‘2 , … . . π₯π₯π‘π‘π‘π‘ and take π’π’οΏ½π‘π‘ for all π‘π‘′π π II. Run π’π’οΏ½π‘π‘ ππππ π’π’οΏ½π‘π‘−1 for all π‘π‘. The parameter ππ’s p-value indicates serial correlation. Generally the nul,l can be rejected at the 5 percent level. The test can be made robust to heteroscedasticity by computing robust standard errors. It should be remembered that this test only tests for AR(1) serial correlation, meaning only correlation in adjacent error terms. It may be that there is serial correlation in non-adjacent error terms. 46 Another possible test is the Durbin-Watson test, but this requires that the classical assumptions all hold and provides the same answer as above. It is therefore suggested that this test is rather not used. Tests when independent variables are not strictly exogenous Since strict exogeneity is unlikely to hold, but OLS will still be asymptotically consistent (although bias and the bias can be small if the time series are non-persistent), serial correlation tests where the variables are not strictly exogenous are often required. The previously discussed tests are not valid. The Durbin’s alternative statistic test holds whether the variables are strict exogenous variables or not, so it can always be used. This test must be used if there is a lagged dependent variable (such a model can never be strictly exogenous). For AR(1) errors I. Run OLS on π¦π¦π‘π‘ on xt1 , π₯π₯π‘π‘2 , … . . π₯π₯π‘π‘π‘π‘ and take π’π’οΏ½π‘π‘ for all π‘π‘′π π II. Run π’π’οΏ½π‘π‘ ππππ xt1 , π₯π₯π‘π‘2 , … . . π₯π₯π‘π‘π‘π‘ , π’π’οΏ½π‘π‘−1 for all π‘π‘ III. The null is the same as the previous test (the parameter of π’π’οΏ½π‘π‘−1 = 0) and the test can again be made robust to heteroscedasticity. For higher order (e.g. AR(2) errors, meaning there are two lags) serial correlation the same test can be done, but with including the higher order error terms in step 2. The F test is then used to test for joint significance (all parameters of the residuals should be zero) and the test can be made robust to heteroscedasticity as discussed for cross-sectional data. Correcting serial correlation Strictly exogenous variables In the test for serial correlation, we obtain the parameter ππ for AR(1) serial correlation. We can use this parameter to transform the data in the model and thereby correct serial correlation. This is done with a FGLS estimator and the estimation is also called the Cochrane Orcult (OC) or Prais-Winsten (PW) estimation. The OC estimation only make use of t>1 and the PW estimation make use of all time periods in the data. The PW can therefore be preferred in small samples, although asymptotically these two estimations does not differ. Most regression packages include an iterated version of the estimates, meaning an iterated FGLS is used as the estimator. To understand the estimator, you need to understand how the data is transformed. AR(1) errors (residuals as we are using ππ, οΏ½ but for ease just writing ππ) are written π’π’π‘π‘ = πππ’π’π‘π‘−1 + πππ‘π‘ 47 Where π£π£π£π£π£π£(π’π’π‘π‘ ) = ππππ2 /(1 − ππ2 ). Note that ππ indicates the extent of the serial correlation and if 0 then π£π£π£π£π£π£(π’π’π‘π‘ ) = ππππ2 , meeting the serial correlation and homoscedasticity assumption. To obtain this we take the quasi-difference for each variable in the regression besides in time period 1. This is done by multiplying the t>1 time period, multiplying each variable by ππ and the deducting this from the previous time period (e.g for time period 2, this time period is multiplied by ππ and the we deduct this from time period 1). Note that if ππ were equal to one (which we assume not to be the case) then this will be the exact same process as taking the difference to transform a variable to be weakly dependent. To include time period 1 in our estimation, each variable in this time period is multiplied by 1 (1 − ππ2 )2 . Note that these transformations are performed automatically by the regression software. For higher order serial correlation (AR(q)) a similar approach is followed by quasitransforming all variables. This again is done automatically by the regression software. From the above, there are two possible estimators when the errors are serially correlated with strictly exogenous variables, OLS and FGLS. FGLS is generally preferred since the transformation ensures all variables are I(0) and that there is no serial correlation. FGLS will however only be consistent if πΆπΆπΆπΆπΆπΆ(π₯π₯π‘π‘ , π’π’π‘π‘ ) = 0 πΆπΆπΆπΆπΆπΆ((π₯π₯π‘π‘−1 + π₯π₯π‘π‘+1 ), π’π’π‘π‘ = 0 Note that this is a stronger requirement than OLS, which only needs the first written covariance to hold. If the second written covariance does not hold, then OLS can be preferred to FGLS since OLS will be consistent (although the test statistics will be invalid). Taking the difference for variables with OLS, especially when ππ is large, eliminates most of the serial correlation. Both OLS and FGLS should be used and reported to show (hopefully) that there are no large differences between the estimated parameters. Independent variables not strictly endogenous When the independent variables are not strictly endogenous, the OC and PW estimations are not consistent or efficient. This means that we will have to use OLS as the estimator. After OLS, serial correlation –robust standard errors can be computed (refer Woolridge 1989 for how this is computed). These standard errors are also robust to heteroscedasticity. This is therefore also called heteroscedasticity and autocorrelation consistent (HAC) standard errors. 48 It may further be a good idea to compute these standard errors even when the independent variables are strictly endogenous after using OLS or FGLS. FGLS is included since the parameter ππ may not account for all serial correlation (the errors did not follow the selected AR model) and there may be heteroscedasticity in the errors. Heteroscedasticity If the errors are heteroskedastic, but there is no serial correlation, the same procedures as discussed for cross-sectional data can be applied to time series. A specific type of heteroscedasticity in time series is autoregressive conditional heteroscedasticity (ARCH). The type of heteroscedasticity does not result in OLS not being BLUE and all the OLS assumptions remain to hold, but in the presence of ARCH, there may be estimators that are asymptotically more efficient than OLS, for instance, weighted least squares. An ARCH(1) model for the errors can be written 2 π’π’π‘π‘2 = πΌπΌ0 + πΌπΌ1 π’π’π‘π‘−1 + πππ‘π‘ Where πΌπΌ1 contains the serial correlation in the square of the errors even though there is no serial correlation in the errors (non-squared). This type of heteroscedasticity is often found if the model contains lagged dependent variables (therefore the name), although it may be present even when the model does not contain lagged dependent variables. Serial correlation and heteroscedasticity It is possible that the errors are both heteroskedastic and serially correlated. If this is the case it is possible to use HAC standard errors after OLS. It is further possible to combine the WLS procedure to address heteroscedasticity (discussed for cross-sectional data) with the AR(1) procedure (OC or PW estimations) discussed above. To do this π’π’π‘π‘ for all π‘π‘′s 1. Regress π¦π¦π‘π‘ on π₯π₯π‘π‘1 , … , π₯π₯π‘π‘π‘π‘ and save οΏ½ 2. Regress log(π’π’π‘π‘2 ) on π₯π₯π‘π‘1 , … , π₯π₯π‘π‘π‘π‘ and obtain the fitted values, οΏ½ πππ‘π‘ 3. Obtain βοΏ½π‘π‘ = exp(ππ οΏ½) π‘π‘ 1 − 4. Multiply all variables by βοΏ½π‘π‘ 2 to remove the heteroscedasticity 5. Estimate with the new variables by way of OC or PW. Note that this approach can only be used for strict exogenous variables. 49 2SLS estimator The mechanics of the 2SLS estimator is identical for time series and cross-sectional data. Just as variables are differenced for time series, so instrumental variables can be differenced. The tests and correction for serial correlation change slightly when using the 2SLS estimator. To test for AR(1) serial correlation: 1) Estimate the 2SLS and save the residuals, π’π’οΏ½π‘π‘ 2) Estimate π¦π¦π‘π‘ = π½π½0 + π½π½1 π₯π₯π‘π‘1 + β― + πππ’π’π‘π‘−1 + ππππππππππ 3) The null hypothesis is that the parameter for π’π’π‘π‘−1 is zero (no serial correlation). To correct serial correlation Serially robust standard errors can be taken or we can use quasi-difference data by 1) Estimate the 2SLS and save the residuals, π’π’οΏ½π‘π‘ 2) Run π’π’οΏ½π‘π‘ on π’π’π‘π‘−1 and get ππ 3) Construct quasi-differenced variables for all variables including the instrumental variables 4) Estimate quasi differenced variables by 2SLS 5) The first period can also be calculated with the usual quasi-differenced approach to the first period. SEMs For time series, using 2SLS for simultaneous equation models and to address simultaneity bias is no different than for cross-sectional data. In SEMs lagged variables are often called predetermined variables. It should further be noted that SEMs generally are highly persistent and the correct treatment for these series are required (for instance first differencing). Assumptions for 2SLS 1) Linear parameters, all series (including instrumental variables) are stationary and weakly dependent. Instrumental variables are denoted as π§π§ππ 2) No perfect multicollinearity among instrumental variables and the order condition for identification holds. This means we need at least one excluded exogenous variable (which parameter is not zero in the reduced form equation) for each included endogenous variable. For SEMs the rank condition is required. 3) πΈπΈ(π’π’) = 0, πΆπΆπΆπΆπΆπΆοΏ½π§π§ππ , π’π’οΏ½ = 0 50 Note that each exogenous independent variable is seen as its own instrumental variable, therefore all exogenous variables are denoted π§π§ππ Under 1-4 2SLS is consistent (although bias) 4) πΈπΈ(π’π’π‘π‘2 |π§π§π‘π‘1 , … , π§π§π‘π‘π‘π‘ ) = ππ 2 If ππ denotes all instrumental variables (all exogenous variables) then 5) πΈπΈ(π’π’π‘π‘ π’π’π π |πππ‘π‘ πππ π ) = 0 Under 1-5 2SLS is consistent and test statistics are asymptotically valid. The 2SLS estimator is the best IV estimator under these assumptions. Infinite distributed lag (IDL) models IDL models are similar to FDL models discussed previously, with the only difference being that the lag is included in the model indefinitely. Such a model can be written π¦π¦π‘π‘ = πΌπΌ + πΏπΏ0 π§π§π‘π‘ + πΏπΏ1 π§π§π‘π‘−1 + πΏπΏ2 π§π§π‘π‘−2 … . +π’π’π‘π‘ Where it is required that πΏπΏππ → 0, ππ → ∞, which makes logical sense since the distant past has less of an impact than the recent past for nearly all series’. The interpretation of this model is also the same as FDL; πΏπΏππ is the change in the expected value of the dependent variable for a one-unit temporary change in the independent variable at time zero (after j periods). πΏπΏ0 is again the impact propensity and the sum of all the coefficients that are sufficiently large can be used to approximate the long run propensity (this is required since the model is indefinite). Similar to FDL we need to assume strict exogeneity πΈπΈ(π’π’π‘π‘ | … , π§π§π‘π‘−1 , π§π§π‘π‘ , π§π§π‘π‘+1 , … ) = 0 Although in certain situations this assumption can be weakened to only include present and past periods (not π§π§π‘π‘+1 , …). There are multiple models that can be used to estimate IDL’s: Geometric (Koyck) distributed lag models In estimating IDL we need to be able to have a finite model (we do not have infinite data). If we take πΏπΏππ = πΎπΎππππ Where ππ is in absolute form between zero and one (to ensure πΏπΏππ → 0, ππ → ∞) and ππ = 0,1,2, … then the original IDL model at time t is written 51 π¦π¦π‘π‘ = πΌπΌ + πΎπΎπ§π§π‘π‘ + πΎπΎπΎπΎπ§π§π‘π‘−1 + πΎπΎππ2 π§π§π‘π‘−2 … . +π’π’π‘π‘ If we also write this for time t-1, multiply the t-1 equation by ππ and subtract it from the from the time t equation we get the geometric lag model π¦π¦π‘π‘ = πΌπΌ0 + πΎπΎπ§π§π‘π‘ + πππ¦π¦π‘π‘−1 + π£π£π‘π‘ Where π£π£π‘π‘ = π’π’π‘π‘ − πππ’π’π‘π‘−1 , an MA(1). The impact propensity is πΎπΎ and the long run propensity πΎπΎ can be shown to be 1− ππ. This equation can be estimated by OLS, but there is a few problems. Firstly, π¦π¦π‘π‘−1 is endogenous and π£π£π‘π‘ is serially correlated where πππ’π’π‘π‘−1 ≠ 0 and the model is not dynamically complete. The endogeneity can be resolved by using 2SLS and a good instrumental variable for π¦π¦π‘π‘−1 is generally π§π§π‘π‘−1 (π§π§π‘π‘ and π§π§π‘π‘−1 are IVs). Note that using π§π§π‘π‘−1 requries the strict exogeneity assumption to hold (otherwise it is correlated with π¦π¦π‘π‘ ) Afterwards, we can adjust the standard errors as discussed previously. Rational distributed lag models This model is similar to the geometric lag model but is written π¦π¦π‘π‘ = πΌπΌ0 + πΎπΎ0 π§π§π‘π‘ + πππ¦π¦π‘π‘−1 + πΎπΎ1 π§π§π‘π‘−1 + π£π£π‘π‘ The impact propensity is πΎπΎ0 and the long run propensity is Forecasting πΎπΎ0 +πΎπΎ1 1− ππ . Some terminology: πππ‘π‘ denotes the forecast of π¦π¦π‘π‘+1 at time t (one-step ahead forecasting) πππ‘π‘,β denotes the forecast of π¦π¦π‘π‘+β at time t (multiple-step ahead forecasting) The forecast error πππ‘π‘+1 = π¦π¦π‘π‘+1 − πππ‘π‘ 2 The most common lost function is πππ‘π‘+1 , which we want to minimize (the same as for OLS). Note however that we do not observe this, so we want to minimize the expected loss function. πΌπΌπ‘π‘ denotes a set of information known at time t. Conditional forecasting is where we know the future values of the independent variables. It is then easy to forecast the future dependent variable. We can write πΈπΈ(π¦π¦π‘π‘+1 |πΌπΌπ‘π‘ ) = πΌπΌ + π½π½1 π§π§π‘π‘+1 52 Where we need to assume that πΈπΈ(π’π’π‘π‘+1 |πΌπΌπ‘π‘ )=0. The problem with conditional forecasting is that we rarely know π§π§π‘π‘+1 . If we for instance want to forecast a trend, then we can use conditional forecasting as we know π§π§π‘π‘+1 . Unconditional forecasting is where we do not know the level of the independent variables it is not included in πΌπΌπ‘π‘ . This would mean that we would have to first forecast π§π§π‘π‘+1 before we can forecast π¦π¦π‘π‘+1 . One-step forecasting The conditional forecasting problem of not knowing π§π§π‘π‘+1 can be resolved by forecasting the dependent variable based on the lags of the dependent and independent variables. This will allow us to know π§π§π‘π‘+1 as it is the variable observed in the current time period. A model that makes use of this approach is called a vector autoregressive model (VAR) and can be written π¦π¦π‘π‘ = πΏπΏ0 + πΌπΌ1 π¦π¦π‘π‘−1 + π½π½1 π§π§π‘π‘−1 + πΌπΌ2 π¦π¦π‘π‘−2 + π½π½2 π§π§π‘π‘−2 + β― + π’π’π‘π‘ Where we include as many variables as to make the model dynamically complete. See that to forecast we would have π¦π¦π‘π‘+1 = πΏπΏ0 + πΌπΌ1 π¦π¦π‘π‘ + π½π½1 π§π§π‘π‘ + πΌπΌ2 π¦π¦π‘π‘−1 + π½π½2 π§π§π‘π‘−1 + β― + π’π’π‘π‘ And all the independent variables are included in πΌπΌπ‘π‘ . As we obtain additional data we can then repeat the estimation. If after controlling for past y, z helps to forecast y, we say that z Granger causes y. If we include additional variables, w, we say that z Granger causes y conditional on w. If we consider different models that can forecast the dependent variable, the model with the lowest root mean square error or mean absolute error can be generally preferred. Multiple-step forecasting Multiple-step forecasting is less reliable than one-step forecasting since the error variance increase as the forecast horizon increases. We can use the VAR model above to also forecast the independent variables. We can then use the forecasted dependent and independent variables as lags to forecast π¦π¦π‘π‘+2 . This process can be repeated indefinitely, but obviously becomes less reliable as the forecast horizon increases. 53 PANEL DATA Panel data is similar to pooled cross-sectional data, with the difference being that the same individual, country, firm, etc. are sampled for different time periods. A panel dataset is therefore organized as City Pretoria Pretoria Johannesburg Johannesburg Year 2015 (t=1) 2016 (t=2) 2015 (t=1) 2016 (t=2) Variables 421 464 658 863 One estimator that can be used on this data is pooled OLS, but this is seldom used since it does not make use of the benefits of panel data. The fact that the same individual, firm, country, etc. is sampled over time, gives an advantage of panel data sets to control for fixed factors of the individuals, firms countries, etc. that are correlated with the dependent variable over time. To see this we can write the error term for a panel as π£π£ππππ = ππππ + π’π’ππππ Where π£π£ππππ is known as the composite error and includes both constant (ππππ ) and variable (π’π’ππππ ) unobserved factors explaining the dependent variable. ππππ is called the fixed effect, unobserved heterogeneity or individual/firm/country etc. heterogeneity and π’π’ππππ is called the idiosyncratic error. A fixed effects model is used to include the fixed effect. It is useful to control for these fixed effects as this removes a lot of the persistence in the variables. Fixed effects model The fixed effects model for a two-period panel dataset (as above) can be written π¦π¦ππππ = π½π½0 + πΏπΏ0 ππ2π‘π‘ + π½π½1 π₯π₯ππππ + ππππ + π’π’ππππ Where ππ2π‘π‘ is a dummy for time period two that control for changes due to using different time periods (it is generally a good idea to include this) and ππππ is the fixed effect. See that if ππππ is not included in the model and is correlated with the independent variables, the estimates of the model will be bias due to omitted variables. This bias is called heterogeneity bias. Of course if π’π’ππππ is correlated with any independent variable then the estimates are also bias. Since ππππ is not known, we need a method to control for ππππ . One method to do this is by first- differencing. First-Differencing estimator (FD) The First-Differencing estimator is an OLS estimator applied to first-differenced data. 54 For a two-period panel, we simply take the first-difference between the model for t=2 and t=1 (note that πΏπΏ0 ππ2π‘π‘ = 0 for period 1) which gives one cross section βπ¦π¦ππ = πΏπΏ0 + π½π½1 βπ₯π₯ππ + βπ’π’ππ Using this model is the same as saying we are only modeling what has changed over time (nonconstant), which is the same as saying that ππππ is controlled for. This model is also similar to the difference-in-difference estimator for pooled cross sections, with the only difference being that it is the same individual, firm, country etc. that has been sampled. This model can be extended for more time periods and the process of taking the first difference (t2-t1;t3-t2 etc.) remains the same. To ensure that the R-squared for the model is correctly calculated, it is advised to drop the dummy parameter for t2-1 and include an intercept. The model is therefore written as βπ¦π¦ππππ = πΌπΌ0 + πΌπΌ3 ππ3π‘π‘ + πΌπΌ4 ππ4π‘π‘ + β― + π½π½1 βπ₯π₯ππππ1 + π½π½2 βπ₯π₯ππππ2 + β― + βπ’π’ππππ Assumptions for OLS using the First-Differencing estimator 1. Random sample 2. Independent variables have variance over time for at least some ππ 3. No perfect multicollinearity If ππππ indicates all independent variables over all time periods (such as for time series) 4. πΈπΈ(βπ’π’ππππ |ππππ ) = πΈπΈ(βπ’π’ππππ ) = 0 to obtain unbiased, consistent estimates (strict exogeneity assumption). Note βπ’π’ππππ is the differenced idiosyncratic error πΈπΈοΏ½βπ’π’ππππ οΏ½βπ₯π₯ππππππ οΏ½ = πΈπΈ(βπ’π’ππππ ) = 0 for consistent but bias estimates Under 1-4, FD is unbiased and consistent. 5. ππππππ(βπ’π’ππππ |ππππ ) = ππ 2 (Homoscedasticity) 6. πΆπΆπΆπΆπΆπΆ(βπ’π’ππππ , βπ’π’ππππ |ππππ ) = 0 (Serial correlation) Note that this will only hold if the non-differenced errors (π’π’ππππ ) follows a random walk. If they are AR(q) then this will not hold. Under 5-6, OLS test statistics are asymptotically valid. 7. Conditional on ππππ the βπ’π’ππππ are independent and identically distributed normal random variables. Under 5-7, OLS test statistics are valid, under 5-6 asymptotically valid. Treatment if 5 or 6 does not hold 55 Testing for homoscedasticity and serial correlation can be done in exactly the same manner as for cross section and time series, respectively. If we only have heteroscedasticity (no serial correlation) the corrections for cross sections can be used. If we only have serial correlation this can be corrected by way of the PW transformation. Note, however, that this needs to be done by hand as the regression software assumes that the serial correlation is over ππ and π‘π‘, but in panel data we have independent ππ. The HAC standard errors can also be used. If we have both heteroscedasticity and serial correlation then one option is to run OLS and take HAC standard errors. The general approach, however, is clustering. In this approach, each cross-sectional unit is defined as a cluster over time and arbitrary correlation is allowed within each cluster. Clustered standard errors are valid in large panel datasets with any kind of serial correlation and heteroskedasticity. Fixed effects estimator (Within estimator) (FE) The fixed effects estimator is an OLS estimator on data that has been time-demeaned. Within transformation is another method of controlling for ππππ in a fixed effects model. Take the model π¦π¦ππππ = π½π½1 π₯π₯ππππ + ππππ + π’π’ππππ Taking the mean over time for each variable gives π¦π¦οΏ½π€π€ = π½π½1 π₯π₯οΏ½π€π€ + ππππ + π’π’οΏ½π€π€ Taking the difference between these two equations gives π¦π¦π€π€π€π€Μ = π½π½1 π₯π₯π€π€π€π€Μ + π’π’π€π€π€π€Μ Where for instance π₯π₯π€π€π€π€Μ = (π₯π₯ππππ − π₯π₯οΏ½π€π€ ) and Μ indicates time demeaned data. Note that the intercept has been eliminated and the degrees of freedom is calculated as ππππ = ππππ − ππ − πΎπΎ (automatically done by regression software). It is important to see that for the fixed effects estimator we cannot include time-consistent variables (such as gender, race or for instance the distance that a house is from a river). Further, if we include dummy variables for time, then we cannot include variables with constant change over time, such as age or years of experience. To calculate the fixed effect πποΏ½ππ (if of importance we write οΏ½1 οΏ½οΏ½οΏ½οΏ½ οΏ½ππ οΏ½οΏ½οΏ½οΏ½ πποΏ½π€π€ = π¦π¦οΏ½π€π€ − π½π½ π₯π₯11 − β― − π½π½ π₯π₯π€π€π€π€ 56 FD or FE Although FD and FE both provide the same estimates of the parameters (assuming all assumptions related to this hold), the extent of serial correlation changes which estimator is most efficient. If π’π’ππππ is not serially correlated, FE is more efficient. If π’π’ππππ follows a random walk then FD is more efficient. If there is a substantial negative correlation in βπ’π’ππππ then FE is more efficient. If T is large and N is not large the use FD as inference on FE can be very sensitive to violations. If the model includes a lagged dependent variable then the bias is much less under FE than FD, therefore use FE. Unbalanced panels for fixed effects models If data is missing for some units in one or more year, the computation does not change. The only major issue with unbalanced panels is whether the random sampling assumption is adhered to. If the reason for a unit not being sampled for a year is related to the idiosyncratic error (it can be related to the fixed effect), then the estimates will be biased. This is called attrition bias. Assumptions of the fixed effects estimator 1. Random sample 2. Independent variables have variance over time for at least some ππ 3. No perfect multicollinearity If ππππ indicates all independent variables over all time periods (such as for time series) 4. πΈπΈ(π’π’ππππ |ππππ , ππππ ) = πΈπΈ(π’π’ππππ ) = 0 (strict exogeneity assumption). Under 1-4, FE is unbiased and consistent. 5. ππππππ(π’π’ππππ |ππππ , ππππ ) = πππ’π’2 (Homoscedasticity) 6. πΆπΆπΆπΆπΆπΆ(π’π’ππππ , π’π’ππππ |ππππ , ππππ ) = 0 (Serial correlation) Under 1-6 FE is BLUE (smaller variances than FD since idiosyncratic errors are uncorrelated, which is not the case for FD) If 5 and 6 do not hold, use clustered standard errors(discussed under FD assumptions). 7. Conditional on ππππ and ππππ , the π’π’ππππ are independent and identically distributed normal random variables. Under 5-7 the test statistics are valid, under 5-6 asymptotically valid (large N, small T) 57 Random effects model It is generally preferred to use fixed effects in panel data (this is one of the strengths of panel data), but if πππππποΏ½π₯π₯ππππππ , ππππ οΏ½ = 0 then the FE/FD estimator is not the most efficient. If we were to then use pooled OLS with a model that can be written as π¦π¦ππππ = π½π½0 + π½π½1 π₯π₯ππππ1 + β― + π½π½ππ π₯π₯ππππππ + π’π’ππππ Where the error term includes both the fixed effects and the idiosyncratic error. Because all fixed effects are left in the error, π’π’ππππ will necessarily be serially correlated across time and therefore pooled OLS will have invalid standard errors (unless serial and heteroscedasticity robust standard errors are calculated). Further, we lose all the benefit of being able to control for fixed effects. To alleviate these issues we use GLS and the random effects estimator. Random effects estimator (RE) The random effects estimator is an FGLS estimator using quasi-demeaned data. To understand the quasi-demeaning process define 1 2 πππ’π’2 ππ = 1 − οΏ½ 2 οΏ½ πππ’π’ + ππππππ2 Where πππ’π’2 is the variance of the idiosyncratic error, ππ is the total number of time periods that data is observed (note in an unbalanced panel this will change over i’s) and ππππ2 is the variance of the fixed effects. After quasi-demeaning the data (where demeaning is the same as for the fixed effects estimator) the equation becomes 5 π¦π¦ππππ − πππ¦π¦οΏ½π€π€ = π½π½0 (1 − ππ) + π½π½1 (π₯π₯ππππ1 − πππ₯π₯ οΏ½οΏ½οΏ½οΏ½) οΏ½οΏ½οΏ½οΏ½ οΏ½π€π€ ) π€π€1 + β― + π½π½ππ (π₯π₯ππππππ − πππ₯π₯ π€π€π€π€ + (π£π£ππππ − πππ£π£ It can, therefore, be seen that the random effects estimator subtracts a fraction (ππ) of the time average from the data. Further, the errors are serially uncorrelated. Also, see that is ππ = 0 the random effects estimator becomes the pooled OLS estimator. Also if ππ = 1 then the random effects estimator becomes the fixed effects estimator. There is further a tendency for ππ to approach one as the amount of time periods increase, meaning that RE and FE will give very similar results. Note that ππ is never known, but can be estimated and therefore we use FGLS. Assumptions of the random effects estimator 1. Random sample 5 Note that the original equation here is the same as the fixed effects model, but with a composite error term. 58 2. No perfect multicollinearity. Due to time constant independent variables being allowed, additional assumptions are required on how the unobserved fixed effect is related to the independent variables. 3. πΈπΈ(π’π’ππππ |ππππ , ππππ ) = πΈπΈ(π’π’ππππ ) = 0 (strict exogeneity assumption) and πΈπΈ(ππππ |ππππ ) = π½π½0 which means that there is no correlation between the unobserved effect and the explanatory variables. Under 1-4, RE is consistent, but bias due to using FGLS. 4. ππππππ(π’π’ππππ |ππππ , ππππ ) = πππ’π’2 and π£π£π£π£π£π£(ππππ |ππππ ) = ππππ2 (Homoscedasticity) 5. πΆπΆπΆπΆπΆπΆ(π’π’ππππ , π’π’ππππ |ππππ , ππππ ) = 0 (Serial correlation) Under 1-5, RE is consistent and test statistics are asymptotically valid (Large N, small T). Asymptotically RE is also more efficient than pooled OLS and more efficient than FE for timevarying variables’ estimates. FE is more robust (no bias) and therefore BLUE, but RE is more efficient (but not BLUE since it is biased). If 4 and 5 do not hold, use clustered standard errors (discussed under FD assumptions). FE/FD or RE or pooled OLS? In practice, it is a good idea to estimate all three estimators (the choice between FE and FD is discussed above) to gain an understanding the of the bias that results from leaving the fixed effect in the error term. Note that pooled OLS leaves the entire fixed effect in the error, random effects partially leaves the fixed effect in the error and FE/FD completely removed the fixed effects from the error. A benefit of random effects over fixed effects is that it is serially uncorrelated (although this is easily corrected for under FE/FD and pooled OLS) and time-constant independent variables can be included in the model. Therefore, if the variable of interest is time-constant (e.g. gender) then FE/FD cannot be used and another estimator should be used. Generally, it cannot be easily assumed that πππππποΏ½π₯π₯ππππππ , ππππ οΏ½ = 0, which means that FE/FD should be used (otherwise we have bias estimates). The Hausman test can be used to test this assumption, but note that failure to reject does not mean that we should use RE, it means that we can use either test. If the Hausman test rejects the nul, it means that we should be careful to assume that πππππποΏ½π₯π₯ππππππ , ππππ οΏ½ = 0 and that FE/FD may be preferred. Note, however, that the Hausman test is not a model selection test and should not be used as such. 59 Further, if we have reason to believe that we do not have a random sample from the population, FE/FD should be used as this is the same as allowing for a unique intercept for each unit. FE/FD is also more robust in unbalanced panels where the reason for selection may be correlated with the error term. The correlated random effects model (CRE) CRE use a pooled OLS estimator after including the correlation between ππππ and π₯π₯ππππ in the model and provide the same estimates as FE/FD. The term random effects is included in the name since ππππ is not completely eliminated by estimation. This approach does not require that πππππποΏ½π₯π₯ππππππ , ππππ οΏ½ = 0. The benefit of this model over the FE estimator is that time-constant independent variables can be included. If we assume a linear relationship ππππ = πΌπΌ + πΎπΎπ₯π₯οΏ½π€π€ + ππππ Then πΎπΎ indicates the correlation between ππππ and π₯π₯ππππ . Substituting ππππ as assumed above into the fixed effects model gives π¦π¦ππππ = πΌπΌ + π½π½π₯π₯ππππ + πΎπΎπ₯π₯οΏ½π€π€ + ππππ + π’π’ππππ Where ππππ + π’π’ππππ is a composite error and ππππ is a time constant unobservable. Note the only difference is the inclusion of the time average variable π₯π₯οΏ½π€π€ . Including this variable (which can easily be calculated for each independent variable) is the same as demeaning the data and therefore the estimate of π½π½ is exactly the same under CRE and FE. However, because we are not demeaning, we can include time-constant variables in the model. Further, πΎπΎ can be seen as a further test between FE and RE, if πΎπΎ = 0 then there is no correlation between ππππ and π₯π₯ππππ , meaning the FE or RE estimator can be used. If πΎπΎ is statistically significant then the assumption for RE does not hold (economic significance should also be considered) and we may prefer FE. When using the CRE model, it is important not to include time averages of variables that change only over time and not over units (for instance dummies for years), but if the panel is unbalanced, these should be included. Further, in unbalanced panels, the time averages should be calculated based on the number of periods that data is available per unit which will be different for different units in the panel. The assumptions for CRE follows the FE estimator. IV estimator For panel data, the mechanics of the 2SLS estimator remains the same as for cross-sectional data. The unobserved constant effect is first removed by FE/FD and then the 2SLS estimator 60 is used. Because the constant effect is removed, it is most likely that the instrumental variables will have to be time-variant, otherwise, they are unlikely to be correlated with the FE/FD endogenous variable. SEMs also do not pose any particular challenge. To ensure that all assumptions are met, refer to the assumptions for 2SLS for cross-sectional data, read together with the homoscedasticity and serial correlation 2SLS assumption for time series data and then the relevant effect estimator assumptions. There are multiple estimators that can be used. Refer to the Stata manual for xtivreg. Dynamic panel data models For dynamic economic relationships, it is useful to include a lagged dependent variable as an independent variable. This removes the persistence and serial correlation in the error term. One problem with doing this is that the lagged dependent variable will be endogenous. To address this problem a number of estimators are used including the Arellano and Bond estimator, the Arellano and Bover estimator and the Blundell and Bond estimator. Stata can perform all these estimations. Spatial panels When observing firms, countries and other similar samples, cross-sectional correlation (also called spatial correlation) can cause problems. The correlation mainly arise as a result of spatial dependency and spatial structure. This results in inefficient standard errors. For a correction, see the stata paper on xtscc. 61 View publication stats