Center for Teaching, Research and Learning Research Support Group American University, Washington, D.C. Hurst Hall 203 rsg@american.edu (202) 885-3862 Advanced SAS: Time Series Analysis Workshop Objective This workshop is designed to introduce SAS in Time Series Analysis. Time series data represents data observed and recorded over time. Ordinary Least Squares analysis could yield spurious results given time series data due to the possible non-stationary properties of the variables. Hence time series analysis requires caution. This workshop covers tests designed to improve interpretation of results. Learning Outcomes 1. Descriptive Statistics 2. Time Series Operators 3. Unit Root Tests 4. Regression Analysis 1. Descriptive Statistics Open the dataset (STATA data format): LIBNAME AUSAS "C:\"; PROC IMPORT OUT= AUSAS.data DATAFILE="J:\CLASSES\Workshops\ts_wdi.dta" DBMS=dta REPLACE; RUN; We have annual times series data on Belgium starting from 1960 to 2012. Next, let us focus on some time series operators. In this workshop, we will concentrate on a few variables: mortality rate and GDP per capita. The procedure “MEANS” generate the descriptive statistics for both variables. PROC MEANS DATA=AUSAS.data N MEAN STD MEDIAN QRANGE MIN MAX CV; VAR mortrate_under5 gdppc2005c; RUN; 2. Time Series Operators 1 Set of time series operations include the lags, leads, differences and seasonal operators. It is common to analyze the relationship between the levels, lags and differences of variables in most time series analysis. The Lag/Forward/Difference Operators To generate values with past, lead or forward, and difference (current-past) values, SAS procedure called “EXPAND” is very useful and simple. Here, we will just do transformations in the data step. For mortality rate (under 5) variable, we generate the 1st difference; difference of the 1st difference (or 2nd difference); 1st and 2nd lags; log transformation; and 1st and 2nd difference of the log transformation. We also generate similar transformations for the GDP per capita as well. The procedure “PRINT” print out to the output window for illustration of the resulting data. TITLE1 'Plot'; TITLE2 'Mortality Rate: 1ST Differences, and 2ND Differences'; DATA AUSAS.data; SET AUSAS.data; mortrate_under5_D1 = DIF(mortrate_under5); mortrate_under5_D2 = DIF(mortrate_under5_D1); mortrate_under5_L1 = LAG(mortrate_under5); mortrate_under5_L2 = LAG(mortrate_under5_D1); mortrate_under5_ln = LOG(mortrate_under5); mortrate_under5_lnD1 = DIF(mortrate_under5_ln); mortrate_under5_lnD2 = DIF(mortrate_under5_lnD1); gdppc2005c_D1 = DIF(gdppc2005c); gdppc2005c_D2 = DIF(gdppc2005c_D1); gdppc2005c_L1 = LAG(gdppc2005c); gdppc2005c_L2 = LAG(gdppc2005c_D1); gdppc2005c_ln = LOG(gdppc2005c); gdppc2005c_lnD1 = DIF(gdppc2005c_ln); gdppc2005c_lnD2 = DIF(gdppc2005c_lnD1); diff_md1_gdpd1 = mortrate_under5_D1-gdppc2005c_D1; diff_md2_gdpd2 = mortrate_under5_D2-gdppc2005c_D2; diffln_md1_gdpd1 = mortrate_under5_lnD1-gdppc2005c_lnD1; diffln_md2_gdpd2 = mortrate_under5_lnD2-gdppc2005c_lnD2; RUN; PROC PRINT DATA=AUSAS.data(OBS=10 KEEP=CountryCode year mortrate_under5 mortrate_under5_D1 mortrate_under5_D2 mortrate_under5_L1 mortrate_under5_L2 mortrate_under5_ln mortrate_under5_lnD1 mortrate_under5_lnD2); RUN; 2 For time series variables, it is a good practice to plot them in order to observe trends in the data. Below, we will plot mortrate_under5, its 1st difference, and 2nd difference. “GPLOT” procedure performs the plotting without display while “GREPLAY” display them in the same graph. TITLE2 'Mortality Rate'; PROC GPLOT DATA=AUSAS.data GOUT=fig; PLOT mortrate_under5*year / VAXIS=AXIS1 HAXIS=AXIS2; RUN;QUIT; TITLE2 '1ST Differences of Mortality Rate'; PROC GPLOT DATA=AUSAS.data GOUT=fig; PLOT mortrate_under5_D1*year / VAXIS=AXIS1 HAXIS=AXIS2; RUN;QUIT; TITLE2 '2ND Differences of Mortality Rate '; PROC GPLOT DATA=AUSAS.data GOUT=fig; PLOT mortrate_under5_D2*year / VAXIS=AXIS1 HAXIS=AXIS2; RUN;QUIT; 3 The top panel in above graph shows that there is a negative trend in mortality rate over time, while the middle panel its upward trended 1st difference. Both series seem to be not stationary. However, the bottom panel of 2nd difference seems to be stationary. GDP per capita plots are provided below as well. Although the level of GDP per capita has trend, both the 1st and 2nd differences seem to meet stationary conditions: mean and variances are independent with respect to time. TITLE2 'GDP'; PROC GPLOT DATA=AUSAS.data GOUT=fig1; PLOT gdppc2005c*year / VAXIS=AXIS01 HAXIS=AXIS2; RUN;QUIT; TITLE2 '1ST Differences of GDP'; PROC GPLOT DATA=AUSAS.data GOUT=fig1; PLOT gdppc2005c_D1*year / VAXIS=AXIS01 HAXIS=AXIS2; RUN;QUIT; TITLE2 '2ND Differences of GDP'; PROC GPLOT DATA=AUSAS.data GOUT=fig1; PLOT gdppc2005c_D2*year / VAXIS=AXIS01 HAXIS=AXIS2; RUN;QUIT; 4 5 If the variance is time dependence, logarithmic transformation might be a good approach. For completeness, we provide the code to plot for the log transformations for mortality rate and GDP per capita in the program file. Using “ARIMA” procedure, we will compute the autocorrelation and partial correlation functions of 1st order differences of both variables, in order to see any long-term memory in the data. Proc “ARIMA” is divided into three stages of the so-called Box-Jenkins procedures. They are identification, estimation and diagnostic checking, and forecasting. In this workshop, we only focus on the first two stages. PROC ARIMA DATA=AUSAS.data; IDENTIFY VAR=mortrate_under5 scan minic NLAG=8; run; estimate p = 1; run; estimate p=1 q=1; run; RUN;QUIT; Identification stage is specified as IDENTIFY while ESTIMATE option is for the estimation stage. The IDENTIFY statement first prints descriptive statistics for the mortality rate series and plots of the series, autocorrelation function (ACF), partial autocorrelation function (PACF), and inverse autocorrelation function (IACF). The autocorrelation plot shows how values of the series are correlated with past values of the series. Again, visual inspection suggests the series is not stationary since its ACF decays very slowly. By default, it also provides the check for white noise. The null hypothesis is that none of the autocorrelations of the series with its given lags are significantly different 6 from 0. In this case, the white noise hypothesis is rejected (p<0.0001) and it’s not surprising since the series seems nonstationary from visual inspection. We performed two diagnostic test to see if either AR(1) or ARMA(1,1) model is adequate. The table below lists the estimated parameters in the model. MU is the mean term and AR1,1 is the autoregressive parameter, 1, both of which are highly significant. It suggests it has a unit root. The ESTIMATE statement also provides a check for the autocorrelations of the residuals. We reject that there’s no-autocorrelation or the residuals are white noise, as shown above. The second ESTIMATE statement estimate a mixed autoregressive moving average model (ARMA) of order (1,1). The moving average parameter estimate is labeled as MA1,1. All parameters are highly significant and we reject that the residuals are white noise. Both models suggest we should try different transformation of the series, possibly the difference. The following codes performed the same procedures as described above with the first difference of the series. Both models’ parameters are significant and while only the AR(1) residuals are white noise. So, we choose AR(1) model characterizes the data generating process for the first difference of the mortality rate. PROC ARIMA DATA=AUSAS.data; IDENTIFY VAR=mortrate_under5(1) scan minic NLAG=8; run; estimate p = 1; run; estimate p=1 q=1; run; RUN;QUIT; 7 Now, let’s explore the GDP per capita series. We will follow the same procedures as in the mortality rate analysis. Since our visual inspection above suggest 1st order difference at the minimum, we will skip the level analysis. Here we provide analysis for the 1st order difference. Below, we can observe that the ACF decays very quickly and suggests that 1st order difference of the GDP per capita likely to be stationary. We will conduct the official stationary test in the next section. We cannot reject no-autocorrelations hypothesis. PROC ARIMA DATA=AUSAS.data; IDENTIFY VAR=gdppc2005c(1) scan minic NLAG=8; run; estimate p = 1; run; estimate p=1 q=1; run; RUN;QUIT; 8 The estimated mean MU is statistically significant while AR(1,1) does not add more info the model. We reject that the residuals still have more information. Base on this diagnostic check, we conclude that the 1st order difference of the GDP per capita series seem stationary. 4. Testing for Unit Root Many economic and financial time series exhibit trending behavior or non-stationarity in the mean. Regression without due consideration to unit roots might yield spurious results. Before we explain the concept of unit roots, let’s run an OLS regression of mortality rate (under 5) on GDP per capita. 9 The coefficient is negative (-0.00106) and significant (t = -28.72 > |2|) and the R-squared is very high (R-squared = 0.9418). Apparently, the relationship is much stronger than we expect and the R-squared is inflated. This is a common result that is often called spurious regression. It is not that the result is wrong per se, but if you are trying to draw causal conclusions, you will overestimate or underestimate the importance of your variables if they are not stationary. Looking at the residual plot above, it is very obvious that our regression result is spurious. Estimating the regression model of the differences would be interesting for comparison. Below, we will run an OLS regression of the 1st difference of mortality rate (under 5) on the 1st difference of GDP per capita. 10 The coefficient becomes minute and insignificant. The R-squared reduces down to less than 0.04. Apparently, the relationship between mortality rate and GDP per capita disappeared. Looking at the residual plot above, its serial correlation is washed-out. If you want to find out more about time series data and the proper inference techniques, please consult a good textbook on time-series econometrics. In this tutorial we will simply state that it is generally a good idea to make sure that the variables of interest are “stationary” before doing any inference (correlations, regression etc.). To provide conclusive evidence about whether a variable is stationary or has a unit root, we undertake unit root tests. In the section below, we will go over some important methods used to test for unit roots. Augmented Dickey-Fuller test (ADF test) The null hypothesis for the ADF test is that the variable contains a unit root, and the alternative is that the variable was generated by a stationary process. The test allows for several options, such as including a constant, a trend, lagged values of the variable. The following “UNITROOT” macro function codes will test manually. Let’s discuss the four steps inside the macro function: First, we will run a regression model of mortality rate and GDP per capita on their own 1st lags. Then, we will export the estimated coefficients as “est” data set. Second, we will calculate the “T-statistics” to test whether the coefficient of the 1st lag equal to 1 11 or not (note: unit root). Third, we will calculate their corresponding “p-values” for three different specifications: zero mean, single mean, and trend. We finally print them to the output window. /*Dickey-Fuller Test*/ TITLE1 'Testing for Unit Roots by Dickey-Fuller'; %MACRO UNITROOT(Dsn=, /*Data set name */ Dvar1=, /*Dependent variable for model 1 */ Ivar1=, /*Independent variable for model 1*/ Dvar2=, /*Dependent variable for model 2 */ Ivar2=, /*Independent variable for model 2*/ n=, /*number of observations */); /* Estimate gamma for both series by regression PROC REG DATA=&Dsn OUTEST=est; MODEL &Dvar1 = &Ivar1 / NOINT NOPRINT; MODEL &Dvar2 = &Ivar2 / NOINT NOPRINT; */ /* Compute test statistics for both series */ DATA dickeyfuller1; SET est; x&Dvar1 = &n*(&Ivar1-1); x&Dvar2 = &n*(&Ivar2-1); /* Compute p-values for the three models */ DATA dickeyfuller2; SET dickeyfuller1; p&Dvar1 =PROBDF(x&Dvar1,%eval(&n-1),1,"RZM"); p&Dvar2 =PROBDF(x&Dvar2,%eval(&n-1),1,"RZM"); p1&Dvar1 =PROBDF(x&Dvar1,%eval(&n-1),1,"RSM"); p1&Dvar2 =PROBDF(x&Dvar2,%eval(&n-1),1,"RSM"); p2&Dvar1 =PROBDF(x&Dvar1,%eval(&n-1),1,"RTR"); p2&Dvar2 =PROBDF(x&Dvar2,%eval(&n-1),1,"RTR"); /* Print the results */ PROC PRINT DATA=dickeyfuller2(KEEP= x&Dvar1 x&Dvar2 p&Dvar1 p&Dvar2 p1&Dvar1 p1&Dvar2 p2&Dvar1 p2&Dvar2); RUN; QUIT; %MEND UNITROOT; %UNITROOT( Dsn Dvar1 Ivar1 Dvar2 Ivar2 n =AUSAS.data, =mortrate_under5, =mortrate_under5_L1, =gdppc2005c, =gdppc2005c_L1, =53); The null hypothesis is that there exists unit root. Columns 3, 5, and 7 (zero mean, single mean, and trend) of the 1st observation provide “p-values” for mortality rate while 12 columns 4, 6, and 8 of the 2nd observation for GDP per capita. Based on above table, we cannot reject the null hypothesis that there exists unit root for levels of both variables. However, the 1st difference of log-transformation of mortality rate and 1st difference of the GDP per capita suggest stationary. See the table below. %UNITROOT( Dsn Dvar1 Ivar1 Dvar2 Ivar2 n =AUSAS.data, =mortrate_under5_lnD2, =mortrate_under5_lnL2, =gdppc2005c_D1, =gdppc2005c_L2, =53); 4. Durbin Watson test for Serial Correlation The Durbin Watson test for autocorrelation is used to test if the residuals from a linear regression or multiple regressions are independent or not serially correlated. The null hypothesis of the test is that there is no first-order autocorrelation. We specified several models. All tests indicate that there is no serial correlation, by not rejecting the null hypothesis. /* cointegration*/ TITLE1 'Testing for cointegration'; /* Compute Phillips-Ouliaris-test for cointegration */ PROC AUTOREG DATA=AUSAS.data; MODEL mortrate_under5=gdppc2005c /dw=4 dwprob ; MODEL mortrate_under5_D1=gdppc2005c_D1/dw=4 dwprob ; MODEL mortrate_under5_lnD1=gdppc2005c_lnD1/dw=4 dwprob ; MODEL mortrate_under5=gdppc2005c mortrate_under5_L1 mortrate_under5_L2 gdppc2005c_L1 gdppc2005c_L2/dw=4 dwprob; RUN; QUIT; The Durbin-Watson test is a widely used method of testing for autocorrelation. The firstorder Durbin-Watson statistic is printed by default. This statistic can be used to test for first-order autocorrelation. Use the DWPROB option to print the significance level (pvalues) for the Durbin-Watson tests. (Since the Durbin-Watson p-values are computationally expensive, they are not reported by default.) You can use the DW= option to request higher-order Durbin-Watson statistics. Since the ordinary Durbin-Watson statistic tests only for first-order autocorrelation, the DurbinWatson statistics for higher-order autocorrelation are called generalized Durbin-Watson statistics. 13 14