UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Regression Analysis--Autocorrelation and WLS Recall that one of the assumptions of the OLS method is that the errors for the individuals in the population (and therefore, in the sample) are independent of one another. That is, the size of one individual’s error does not affect the size of another individual’s error. The Autocorrelation Problem occurs when this assumption is violated and the errors are somehow dependent on one another; that is, the errors affect one another in some way. Autocorrelation occurs more often in time-series data than in cross-section data, because often a large error in one time period will have "lingering effects" on later time periods, causing the errors in the two time periods to be related to one another (instead of being independent of one another). There are many forms of autocorrelation, but we will focus in this handout on one of the most commonlyencountered types of autocorrelation “First Order Serial Autocorrelation.” First Order Serial Autocorrelation typically occurs in Time Series datasets. In Time Series datasets, each observation/individual/row of data corresponds to a particular point in time. For example, each row of data may refer to a particular month, or quarter, or year. In First Order Serial Autocorrelation, the error in one time period (row of data) lingers to affect the error in the next time period (row of data). The term “First Order” means that the lingering effect lasts for just one time period (In Second Order Serial Autocorrelation, the lingering effect can last for two time periods, etc.). The term “Serial” describes how the lingering effect occurs over and over again, one error affecting the next, for all time periods in the data set. The term “Autocorrelation” refers to the fact that the errors are affecting themselves (the “Auto” or “self-effecting” part), and this causes the errors to be correlated with one another. Mathematically, we can represent First Order Serial Autocorrelation as follows: e t e t 1 v t , where vt is a random error term, The equation above says that the error in time period t, et, is equal to a random error, vt, plus a fraction, ρ, of the error from the preceding time period et-1. The key thing is that a fraction, ρ, of the error from the preceding time period gets incorporated into the error of the current time period—THIS IS WHAT MAKES THE TWO ERRORS CORRELATED AND WHAT CAUSES THE AUTOCORRELATION PROBLEM. (Recall that one of the assumptions of OLS is that the errors are NOT correlated in this way.) In the equation above, rho (“ρ”) is the COEFFICIENT OF AUTOCORRELATION. The value of rho lies between -1 and 1, that is: -1 < ρ < 1. Positive Autocorrelation: If rho is positive, a large (positive) error in one period increases the error in the next period. Negative Autocorrelation: If rho is negative, a large (positive) error in one period decreases the error in the next period. Problems Caused by Autocorrelation 1. Although the estimates of the coefficients (𝛽̂’s) are still unbiased, the estimates of the s.e.’s of the 𝛽̂’s are biased downward; as a result, we may incorrectly conclude that X variables affect Y when in fact they do not. 2. The S.E.R. is biased downward; as a result, we may conclude that the model fits the data better than it actually does. 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Detecting Autocorrelation We can detect Autocorrelation using: 1. A Residual Plot of the regression residuals, the “ehats,” against time. If Autocorrelation is present, the variation in the ehats will not be the same for all values of X. Figure 1 shows an example of no Autocorrelation. Figure 2 shows and example of Autocorrelation. ehats + Figure 1. No Autocorrelation 0 ehats + Figure 2. Autocorrelation Present 0 time - time - Variation of ehats remains the same over time. Variation of ehats changes (in this example, cycles up and down) over time. 2. The Durbin-Watson (DW) “d” statistic test can also be used to detect Autocorrelation, especially when it is difficult to determine whether Autocorrelation is present from looking at the residual (ehat, 𝑒̂ ) plots against time. The Durbin-Watson “d” statistic tests the following null hypothesis: H0: autocorrelation is not present H1: autocorrelation is present The Durbin-Watson “d” statistic is calculated from the regression residuals, the “ehats,” ( 𝑒̂ 's) according to the following formula: 0 < dtest < 4 n d test 2 (ê t ê t 1 ) t 2 n 2 ê t t 1 , dtest = 0 ==> perfect POSITIVE autocorrelation dtest = 2 ==> no autocorrelation dtest = 4 ==> perfect NEGATIVE autocorrelation where “t” denotes the observation (the time period) and there are n total observations. IT IS ASSUMED THAT THE DATA ARE A TIME SERIES STARTING AT t = 1 AND ENDING AT t = n. Notice that in the numerator the difference between each residual at time period "t" and the residual one time period before it (at time period "t -1") is squared. Because the residual at time t = 1 has no residual “before” it in time, it is not included in the summation in the numerator. In the denominator, each residual is simply squared before being summed, including the residual at t = 1. 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas The dtest statistic calculated using the formula above is compared to “dcritical” values from the Durbin-Watson d-statistic table (handed out in lecture). The Durbin-Watson test actually uses two “dcritical” values, a “dcritical-upper” and a “dcritical-lower.” Both critical values depend on sample size, n, and the number of X variables in your model, (k-1). The two dcrit values from the table are used to calculate two more critical values: (4 - dcrit-lower) and (4 – dcrit-upper). So, a total of four critical values are used in the DW test. See the example below. Example of the Durbin-Watson Test: Suppose that in a study of time series data we have a sample size n = 40, the number of X variables in the model is (k-1) = 4, and you choose an α = 5% significance level for the Durbin-Watson test. The Durbin-Watson d-table shows that “dcritlower” is 1.285" and “dcrit-upper” = 1.721. The dtest value is calculated using the formula above and is compared to dcrit-upper, dcrit-lower, (4 – dcrit-upper), and (4 - dcrit-lower) on the d-axis scale below to determine whether to accept or reject Ho. The d-axis scale divided into several regions. Several outcomes are possible, depending on the region into which dtest falls. d axis 4 negative autocorrelation 2.715 = (4 - dcrit-lower) test inconclusive 2.279 = (4 - dcrit-upper) 2 no autocorrelation 1.721 = dcrit-upper test inconclusive 1.285 = dcrit-lower positive autocorrelation 0 For example, if dtest = 0.83, then we would conclude that we have positive autocorrelation. If dtest = 3.24, then we would conclude that we have negative autocorrelation. If dtest = 1.88, then we would conclude that we have no autocorrelation. If dtest = 2.53, then the test is inconclusive, and we remain unsure about whether or not there is autocorrelation. The following additional assumptions must be met for the d statistic to be valid: 1. the regression model includes an intercept 2. the regression does not use lagged values of Y as an X variable 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Correcting Autocorrelation Using Weighted Least Squares (WLS) Regression: Okay, assuming that the autocorrelation is FIRST-ORDER SERIAL AUTOREGRESSION, we can estimate the Coefficient of Autocorrelation (rho) from the Durbin-Watson dtest statistic using the following formula: d ˆ 1 test , 2 where ̂ , “rho hat,” is our estimate of rho, based on our data. (That is, we calculate ̂ based on dtest, but recall that we calculated dtest based on our ehats, and the ehats are based on our X and Y data.) We use ̂ to weight (adjust) our X and Y data and the intercept of the model as follows: For time period t = 2 and all later time periods: Yt* (Yt ˆ Yt 1 ) X *t (X t ˆ X t 1 ) *0 (1 ˆ ) 0 We need to use special formulas for the first time period, because there is no time period t – 1 before the first time period in our data set. For time period t = 1 only: Y1* ( 1 ˆ 2 )Y1 X1* ( 1 ˆ 2 )X1 Finally, run the regression: Yt* *0 x X *t e t This is another example of Weighted Least Squares (WLS) regression. In this example, we are using ̂ , “rho hat,” to weight the data in such a way as to remove the effects of autocorrelation. Autocorrelation in SAS: Suppose we have a time series data set with three variables: TIME (year), RWAGES (real wages) and PRODUCT (national product as measured by GDP). We want to run a regression to determine the relationship between real wages and national product. However, because we are working with a time series data set, we want to test for autocorrelation and correct for it, if it is present. If autocorrelation is present, we can use PROC AUTOREG in SAS to conduct a WLS regression to correct for the autocorrelation. 4 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas /* SOFTWARE: SAS Statistical Software program, version 9.2 AUTHOR: Dr. Chris Dumas, UNC-Wilmington, Spring, 2013. TITLE: Program to perform weighted least squares (WLS) regression to correct for autocorrelation. */ proc import datafile="v:\ECN377\timeseriesdata.xls" dbms=xls out=dataset01 replace; run; /* proc reg below conducts a regression of variable RWAGES (real wages) against variable PRODUCT (national product as measured by GDP).The "dw" option on the model command requests that SAS calculate the durbin-watson statistic. */ proc reg data=dataset01; model rwages = product / dw; output out=dataset02 r=ehat; run; /* The proc plot below graphs the residuals (ehat's) against time to check for autocorrelation. The pattern in the residuals indicates that autocorrelation DOES appear to be present. Also, the durbin-watson dtest statistic calculated above is 0.214, which indicates positive autocorrelation. */ proc plot data=dataset02; plot ehat*time; run; /* The PROC AUTOREG command below corrects for autocorrelation. First, PROC AUTOREG gives the results for an uncorrected OLS regression (this repeats what we did above). Next, PROC AUTOREG estimates rho based on the residuals from the OLS regression. Under "Estimates of Autoregressive Parameters" in the output window, SAS shows rho to be -0.814743, BUT SAS ALWAYS GIVES THE NEGATIVE OF THE TRUE RHO, SO THE ACTUAL ESTIMATE OF RHO IS +0.814743. Next, SAS uses rho to weight the Y and X variables. The "nlag=1" option in the model command below tells SAS that we are correcting for FIRST ORDER serial autocorrelation--the error in period t depends on the error one time period earlier. (For other types of autocorrelation, nlag is greater than 1, but we won't get into that here.) Finally, the "output" command saves the ehats from the autocorrelation-corrected regression and names them "ehatnew". We want to save the ehatnew's so that we can plot them against time and check whether the autocorrelation correction worked. */ proc autoreg data=dataset01; model rwages = product / nlag=1; output out=dataset03 residual=ehatnew; run; /* The proc plot command below graphs the ehatnew’s against time to see whether a pattern still exists after we adjusted for autocorrelation. Very little pattern remains, so it looks as if the autocorrelation correction worked pretty well. */ proc plot data=dataset03; plot ehatnew*time; run; 5