Violation of the Assumptions of the Linear Regression Model 1 • We study the underlying assumptions of the Linear Regression model further, and look at: – How to test for violations? – What causes the violations? – What are the consequences? e.g any combination of the following problems: – the coefficient estimates are wrong – the associated standard errors are wrong – the distribution for the test statistics are inappropriate – How can it be fixed? – Change the model, so that the assumptions are no longer violated – Work around the problem by using alternative (econometric) techniques which are still valid 2 • More specifically, we are going to study: 1. E(ǫt ) = 0 2. var(ǫt ) = σ 2 < ∞ 3. cov(ǫi ,ǫj ) = 0 4. No perfect multicollinearity 5. Omitting/Including variables 6. Errors correlated with regressors E(ǫt xk,t ) 6= 0 7. Model selection and specification checking 7.1 Model Building 7.2 Lasso, Forward Stage-wise regression and LARS 7.3 Specification checking: residual plots and non-linearity; parameter stability; influential observations 3 1. Assumption: E (ǫt ) = 0 • A1 states that the mean of the disturbances is zero. • Disturbances can never be observed, so we use the residuals instead. • it can only be an approximate investigation of the properties of the errors • since residuals depend on the chosen estimation method, different methods will yield different residuals with potentially different properties. • The mean of the residuals will always be zero provided that there is a constant term in the regression. • Always include a constant term... but what if the economic/finance model does not support a constant term? • This can be a way to test the validity of the econ/finance theory in the data: include a constant term and test whether it is equal to zero. • Example: CAPM 4 2. Assumption: var(ǫt ) = σ 2 < ∞ • The variance of the errors is constant, σ 2 - this is known as (unconditional) homoscedasticity. If the errors do not have a constant variance, we say that they are heteroskedastic. • How can we detect heteroskedasticity? • Graphical methods • Formal tests: • we will discuss Goldfeld-Quandt test and White’s test • both test H0 : homoskedasticity vs H1 : heteroskedasticity 5 Detection of Heteroskedasticity: graph • Say we estimate a regression model, calculate the residuals, ǫ̂t , and plot them against one regressor ût + x 2t – 6 Detection of Heterosk.: Goldfeld-Quandt (GQ) test It is carried out as follows: 1. Split the total sample of length T into two sub-samples of length T1 and T2 . The regression model is estimated on each sub-sample and the two residual variances are calculated. 2. The null hypothesis is that the variances of the disturbances are equal, H0 : σ12 = σ22 3. The test statistic, denoted GQ, is simply the ratio of the two residual variances where the larger of the two variances must be placed in the numerator. GQ = s12 s22 7 Detection of Heterosk.: Goldfeld-Quandt (GQ) test (Cont’d) 4. The test statistic is distributed as an F(T1 − k, T2 − k) under the null of homoscedasticity. • Big practical issue: where do you split the sample? It is often arbitrary, and it may crucially affect the outcome of the test. 8 Detection of Heterosk.: White’s test • White’s general test for heteroskedasticity is one of the best approaches because it makes few assumptions about the form of the heteroskedasticity. • The test is carried out as follows: 1. Assume that the regression we carried out is as follows yt = β1 + β2 x2t + β3 x3t + ǫt And we want to test Var(ǫt ) = σ 2 . We estimate the model, obtaining the residuals, ǫˆt . 2. Then run the auxiliary regression 2 2 ǫ̂2t = α1 + α2 x2t + α3 x3t + α4 x2t + α5 x3t + α6 x2t x3t + vt 9 Detection of Heterosk.: White’s test (Cont’d) 3. Obtain R 2 from the auxiliary regression and multiply it by the number of observations, T. It can be shown that T × R 2 ∼ χ2 (m) where m is the number of regressors in the auxiliary regression excluding the constant term. 4. If the χ2 test statistic from step 3 is greater than the corresponding value from the statistical table then reject the null hypothesis that the disturbances are homoskedastic. 10 Consequences of using OLS in the presence of Heterosk. • OLS estimation still gives unbiased coefficient estimates, but they are no longer BLUE. • This implies that if we still use OLS in the presence of heteroskedasticity, our standard errors could be inappropriate and hence any inferences we make could be misleading. • Whether the standard errors calculated using the usual formula are too big or too small will depend upon the form of the heteroskedasticity. 11 How do we Deal with Heterosk.? • If the form (i.e. the cause) of the heteroskedasticity is known, then we can use an estimation method which takes this into account (called Generalised Least Squares, GLS). • A simple illustration of GLS is as follows: Suppose that the error variance is related to another variable zt by var(ǫt ) = σ 2 zt2 • To remove the heteroskedasticity, rescale the regression model for each obsevation by zt (standardization) yt 1 x2t x3t = β1 + β2 + β3 + vt zt zt zt zt 12 How do we Deal with Heterosk.? (Cont’d) where vt = ut is an error term. zt • Now var(ut ) = σ 2 zt2 , var(vt ) = var ut zt = var(ut ) σ 2 zt2 = 2 = σ 2 for known zt . zt2 zt • Other solutions include: 1. Transforming the variables into logs or reducing by some other measure of “size”. 2. Use White’s heteroscedasticity consistent standard error estimates. The effect of using White’s correction is that in general the standard errors for the slope coefficients are increased relative to the usual OLS standard errors. 13 How do we Deal with Heterosk.? (Cont’d) This makes us more “conservative” in hypothesis testing, so that we would need more evidence against the null hypothesis before we would reject it. 14 3. Assumption: cov(ǫi ,ǫj ) = 0 • This assumption essentially states that there is no pattern in the errors. • If there are patterns in the residuals from a model, we say that they are autocorrelated. • Some stereotypical patterns we may find in the residuals are given on the next 3 slides. 15 Positive Autocorrelation + ût ût + + – û t–1 – time – Positive Autocorrelation is indicated by a cyclical residual plot over time. 16 Negative Autocorrelation + ût ût + + – û t–1 time – – Negative autocorrelation is indicated by an alternating pattern where the residuals cross the time axis more frequently than if they were distributed randomly 17 No pattern in residuals – No autocorrelation + ût ût + + – û t–1 time – – No pattern in residuals at all: this is what we would like to see 18 Detecting Autocorrelation: Durbin-Watson Test • The Durbin-Watson (DW) is a test for first order autocorrelation - i.e. it assumes that the relationship is between an error and its first lag, that is ǫt = ρǫt−1 + vt • The test is: H0 : ρ = 0 where vt ∼ N(0, σv2 ) and H1 : ρ 6= 0 • Full details are available in Chapter 5 of Brooks. • Many limitations: 1. Only consider 1st-order 2. Constant term in regression 3. Regressors are non-stochastic 4. No lags of dependent variable 19 Detecting Autocorrelation: Breusch-Godfrey Test • Related to the modified Ljung-Box test (see e.g. Hayashi chapter 2.10). • It is a more general test for r th order autocorrelation: = ρ1 ǫt−1 + ρ2 ǫt−2 + ρ3 ǫt−3 + · · · + ρr ǫt−r + vt , where vt ∼ N 0, σv2 ǫt • The null and alternative hypotheses are: H0 : ρ1 = 0 and ρ2 = 0 and . . . and ρr = 0 H1 : ρ1 6= 0 or ρ2 6= 0 or . . . or ρr 6= 0 20 • The test is carried out as follows: 1. Estimate the linear regression using OLS and obtain the residuals, ût . 2. Regress ǫ̂t on all of the regressors from stage 1 (the xs) plus ǫ̂t−1 , ǫ̂t−2 , . . . , ǫ̂t−r ; Obtain R 2 from this regression. 3. It can be shown that (T − r )R 2 ∼ χ2r • If the test statistic exceeds the critical value from the statistical tables, reject the null hypothesis of no autocorrelation. 21 Csq of ignoring Autocorrelation if it is present • The coefficient estimates derived using OLS are still unbiased, but they are inefficient, i.e. they are not BLUE, even in large sample sizes. • Thus, if the standard error estimates are inappropriate, there exists the possibility that we could make the wrong inferences. • R 2 is likely to be inflated relative to its “correct” value for positively correlated residuals. 22 “Remedies” for Autocorrelation • If the form of the autocorrelation is known, we could use a GLS procedure – i.e. an approach that allows for autocorrelated residuals e.g., Cochrane-Orcutt. • But such procedures that “correct” for autocorrelation require assumptions about the form of the autocorrelation. • If these assumptions are invalid, the cure would be more dangerous than the disease! - see Hendry and Mizon (1978). • However, it is unlikely to be the case that the form of the autocorrelation is known, and a more “modern” view is that residual autocorrelation presents an opportunity to modify the regression. 23 Dynamic Models • All of the models we have considered so far have been static, yt = β1 + β2 x2t + · · · + βk xkt + ut • But we can easily extend this analysis to the case where the current value of yt depends on previous values of y or one of the x’s, e.g. yt = β1 + β2 x2t + · · · + βk xkt + γ1 yt−1 + γ2 x2t−1 + · · · + γk xkt−1 + ut • We could extend the model even further by adding extra lags, e.g. x2t−2 , yt−3 . 24 • Additional motivation for including lags: – Inertia of the dependent variable – it may take some time for the dependent variable to react to a news announcement, a change in policy... – Overreaction – initial overreaction to an announcement: e.g. if a firm announces that its profits are expected to be lower than anticipated, the markets may adjust preemptively by lowering the price of its share; when the exact profits are released, the markets may readjust and increase the price of its share (but lower than the original price). • However, other problems with the regression could cause the null hypothesis of no autocorrelation to be rejected: – Omission of relevant (autocorrelated) variables. – Misspecification by using an inappropriate functional form (e.g. linear). – Unparameterised seasonality. 25 • Models in first differences – Another way to sometimes deal with the problem of autocorrelation is to switch to a model in first differences. – Denote the first difference of yt , i.e. yt − yt−1 as ∆yt ; similarly for the x-variables, ∆x2t = x2t − x2t−1 etc. – The model is now ∆yt = β1 + β2 ∆x2t + · · · βk ∆xkt + ut – The change in y may also depend on previous values of y or xt : ∆yt = β1 + β2 ∆x2t + β3 ∆x2t−1 + β4 yt−1 + ut 26 • Other problems with the addition of lagged regressors to ”cure” autocorrelation – Inclusion of lagged values of the dependent variable violates the assumption that the RHS variables are non-stochastic. – not a big deal if the asymptotic framework is reliable: estimators are still consistent – What does an equation with a large number of lags actually mean? – adding lagged variables may be motivated by ”statistical analysis” rather than ”economic theory”: how do you interpret the new model? how does it address the validity of the econ. theory being tested? ! If there is still autocorrelation in the residuals of a model including lags, then the OLS estimators will not even be consistent. For example: yt = β1 + β2 xt + β3 yt−1 + ut with ut = ρut−1 + vt It is easy to show that yt−1 is correlated with ut (via ut−1 ) which violates one of the key assumptions (no correlation btw errors and regressors!) 27 4. Assumption: No perfect multicollinearity • Perfect multicollinearity: – Easy to detect b/c it is associated with identification issues – e.g. estimators cannot be computed and the software returns an error message – Example: suppose x3 = 2x2 and the model is yt = β1 + β2 x2t + βx3t + β4 x4t + ut • Real issue: near Multicollinearity – R 2 will be high but the individual coefficients will have high standard errors. – confidence intervals for the parameters will be very wide, and significance tests might therefore give inappropriate conclusions. – The regression becomes very sensitive to small changes in the specification: e.g. additional observations, or additional variables. 28 • Can we get a sense for ”near-multicollinearity”? – The first step is simply to look at the matrix of correlations between the individual variables. e.g. corr x2 x3 x4 x2 – 0.2 0.8 x3 0.2 – 0.3 0.8 0.3 – x4 – But unfortunately, it does not informs us when 3 or more variables are linear - e.g. x2t + x3t = x4t – Another indicator: the condition number of the matrix X ′ X : – near-multicollinearity means that X ′ X is close to being singular; – the condition number is the ratio of the largest singular value and the smallest singular value (in the SVD): when it is ”large” it means that the matrix is ill-conditioned, aka close to being singular; – how large? log(C ) > precision of the matrix entries 29 • What can be done in presence of near-multicollinearity? – Regularization methods, such as ridge regression or principal components. But these may bring more problems than they solve. – Some econometricians argue that if the model is otherwise OK, just ignore it – The easiest ways to “cure” the problems are – drop one of the collinear variables – transform the highly correlated variables into a ratio – go out and collect more data e.g. – a longer run of data – switch to a higher frequency 30 5. Omitting/Including some variables Omission of an Important Variable • Consequence: The estimated coefficients on all the other variables will be biased and inconsistent unless the excluded variable is uncorrelated with all the included variables. • Even if this condition is satisfied, the estimate of the coefficient on the constant term will be biased. • The standard errors will also be biased. Inclusion of an Irrelevant Variable • Coefficient estimates will still be consistent and unbiased, but the estimators will be inefficient. 31 Overall • Bias-variance tradeoff in finite samples: fewer regressors leads to more precise estimates, while more regressors lead to less bias. • Asymptotically, only the bias remains as it is higher order than the variance: when scaling the difference btw estimator and √ true param. value by n, the associated bias explodes while the variance is constant. 32 6. Errors correlated with regressors (endogeneity) • We already discussed 2 examples: – when lagged variables are introduced in the model and the error term is still autocorrelated; – when relevant variables are omitted. • Solution: find at least one instrumental variable for each endogeneous regressor (you can use more) and compute the 2SLS or IV estimator. – IV Z needs to be correlated with endog. var. X , but cannot be correlated with the error term (so Z is exog. and highly correlated with X ) • Updated regularity conditions: 1. {(zi , xi , ǫi )} is a strictly stationary and ergodic sequence. 2. E (zi′ xi ) is full rank. 3. {(zi′ ǫi , Fi )} is a MDS. 4. E (xi4,j ) < ∞; E (zi4,k ) < ∞; E (ǫ2i ) = σ 2 < ∞. 33