1. Descriptive Statistics

advertisement
Center for Teaching, Research and Learning
Research Support Group
American University, Washington, D.C.
Hurst Hall 203
rsg@american.edu
(202) 885-3862
Advanced SAS: Time Series Analysis
Workshop Objective
This workshop is designed to introduce SAS in Time Series Analysis. Time series
data represents data observed and recorded over time. Ordinary Least Squares
analysis could yield spurious results given time series data due to the possible
non-stationary properties of the variables. Hence time series analysis requires
caution. This workshop covers tests designed to improve interpretation of results.
Learning Outcomes
1. Descriptive Statistics
2. Time Series Operators
3. Unit Root Tests
4. Regression Analysis
1. Descriptive Statistics
Open the dataset (STATA data format):
LIBNAME AUSAS "C:\";
PROC IMPORT OUT= AUSAS.data DATAFILE="J:\CLASSES\Workshops\ts_wdi.dta"
DBMS=dta REPLACE;
RUN;
We have annual times series data on Belgium starting from 1960 to 2012. Next, let us
focus on some time series operators. In this workshop, we will concentrate on a few
variables: mortality rate and GDP per capita. The procedure “MEANS” generate the
descriptive statistics for both variables.
PROC MEANS DATA=AUSAS.data N MEAN STD MEDIAN QRANGE MIN MAX CV;
VAR mortrate_under5 gdppc2005c;
RUN;
2. Time Series Operators
1
Set of time series operations include the lags, leads, differences and seasonal operators. It
is common to analyze the relationship between the levels, lags and differences of
variables in most time series analysis.
 The Lag/Forward/Difference Operators
To generate values with past, lead or forward, and difference (current-past) values, SAS
procedure called “EXPAND” is very useful and simple. Here, we will just do
transformations in the data step.
For mortality rate (under 5) variable, we generate the 1st difference; difference of the 1st
difference (or 2nd difference); 1st and 2nd lags; log transformation; and 1st and 2nd
difference of the log transformation. We also generate similar transformations for the
GDP per capita as well. The procedure “PRINT” print out to the output window for
illustration of the resulting data.
TITLE1 'Plot';
TITLE2 'Mortality Rate: 1ST Differences, and 2ND Differences';
DATA AUSAS.data; SET AUSAS.data;
mortrate_under5_D1 = DIF(mortrate_under5);
mortrate_under5_D2 = DIF(mortrate_under5_D1);
mortrate_under5_L1 = LAG(mortrate_under5);
mortrate_under5_L2 = LAG(mortrate_under5_D1);
mortrate_under5_ln = LOG(mortrate_under5);
mortrate_under5_lnD1 = DIF(mortrate_under5_ln);
mortrate_under5_lnD2 = DIF(mortrate_under5_lnD1);
gdppc2005c_D1
= DIF(gdppc2005c);
gdppc2005c_D2
= DIF(gdppc2005c_D1);
gdppc2005c_L1
= LAG(gdppc2005c);
gdppc2005c_L2
= LAG(gdppc2005c_D1);
gdppc2005c_ln
= LOG(gdppc2005c);
gdppc2005c_lnD1
= DIF(gdppc2005c_ln);
gdppc2005c_lnD2
= DIF(gdppc2005c_lnD1);
diff_md1_gdpd1 = mortrate_under5_D1-gdppc2005c_D1;
diff_md2_gdpd2 = mortrate_under5_D2-gdppc2005c_D2;
diffln_md1_gdpd1 = mortrate_under5_lnD1-gdppc2005c_lnD1;
diffln_md2_gdpd2 = mortrate_under5_lnD2-gdppc2005c_lnD2;
RUN;
PROC PRINT DATA=AUSAS.data(OBS=10 KEEP=CountryCode year mortrate_under5
mortrate_under5_D1 mortrate_under5_D2
mortrate_under5_L1 mortrate_under5_L2 mortrate_under5_ln
mortrate_under5_lnD1 mortrate_under5_lnD2);
RUN;
2
For time series variables, it is a good practice to plot them in order to observe trends in
the data. Below, we will plot mortrate_under5, its 1st difference, and 2nd difference.
“GPLOT” procedure performs the plotting without display while “GREPLAY” display
them in the same graph.
TITLE2 'Mortality Rate';
PROC GPLOT DATA=AUSAS.data GOUT=fig;
PLOT mortrate_under5*year / VAXIS=AXIS1 HAXIS=AXIS2;
RUN;QUIT;
TITLE2 '1ST Differences of Mortality Rate';
PROC GPLOT DATA=AUSAS.data GOUT=fig;
PLOT mortrate_under5_D1*year / VAXIS=AXIS1 HAXIS=AXIS2;
RUN;QUIT;
TITLE2 '2ND Differences of Mortality Rate ';
PROC GPLOT DATA=AUSAS.data GOUT=fig;
PLOT mortrate_under5_D2*year / VAXIS=AXIS1 HAXIS=AXIS2;
RUN;QUIT;
3
The top panel in above graph shows that there is a negative trend in mortality rate over
time, while the middle panel its upward trended 1st difference. Both series seem to be not
stationary. However, the bottom panel of 2nd difference seems to be stationary. GDP per
capita plots are provided below as well. Although the level of GDP per capita has trend,
both the 1st and 2nd differences seem to meet stationary conditions: mean and variances
are independent with respect to time.
TITLE2 'GDP';
PROC GPLOT DATA=AUSAS.data GOUT=fig1;
PLOT gdppc2005c*year / VAXIS=AXIS01 HAXIS=AXIS2;
RUN;QUIT;
TITLE2 '1ST Differences of GDP';
PROC GPLOT DATA=AUSAS.data GOUT=fig1;
PLOT gdppc2005c_D1*year / VAXIS=AXIS01 HAXIS=AXIS2;
RUN;QUIT;
TITLE2 '2ND Differences of GDP';
PROC GPLOT DATA=AUSAS.data GOUT=fig1;
PLOT gdppc2005c_D2*year / VAXIS=AXIS01 HAXIS=AXIS2;
RUN;QUIT;
4
5
If the variance is time dependence, logarithmic transformation might be a good approach.
For completeness, we provide the code to plot for the log transformations for mortality
rate and GDP per capita in the program file.
Using “ARIMA” procedure, we will compute the autocorrelation and partial correlation
functions of 1st order differences of both variables, in order to see any long-term memory
in the data. Proc “ARIMA” is divided into three stages of the so-called Box-Jenkins
procedures. They are identification, estimation and diagnostic checking, and forecasting.
In this workshop, we only focus on the first two stages.
PROC ARIMA DATA=AUSAS.data;
IDENTIFY VAR=mortrate_under5 scan minic NLAG=8;
run;
estimate p = 1;
run;
estimate p=1 q=1;
run;
RUN;QUIT;
Identification stage is specified as IDENTIFY while ESTIMATE option is for the
estimation stage. The IDENTIFY statement first prints descriptive statistics for the
mortality rate series and plots of the series, autocorrelation function (ACF), partial
autocorrelation function (PACF), and inverse autocorrelation function (IACF). The
autocorrelation plot shows how values of the series are correlated with past values of the
series. Again, visual inspection suggests the series is not stationary since its ACF decays
very slowly. By default, it also provides the check for white noise. The null hypothesis
is that none of the autocorrelations of the series with its given lags are significantly
different
6
from 0. In this case, the white noise hypothesis is rejected (p<0.0001) and it’s not
surprising since the series seems nonstationary from visual inspection. We performed
two diagnostic test to see if either AR(1) or ARMA(1,1) model is adequate. The table
below lists the estimated parameters in the model. MU is the mean term and AR1,1 is the
autoregressive parameter, 1, both of which are highly significant. It suggests it has a unit
root.
The ESTIMATE statement also provides a check for the autocorrelations of the residuals.
We reject that there’s no-autocorrelation or the residuals are white noise, as shown above.
The second ESTIMATE statement estimate a mixed autoregressive moving average
model (ARMA) of order (1,1). The moving average parameter estimate is labeled as
MA1,1. All parameters are highly significant and we reject that the residuals are white
noise. Both models suggest we should try different transformation of the series, possibly
the difference.
The following codes performed the same procedures as described above with the first
difference of the series. Both models’ parameters are significant and while only the
AR(1) residuals are white noise. So, we choose AR(1) model characterizes the data
generating process for the first difference of the mortality rate.
PROC ARIMA DATA=AUSAS.data;
IDENTIFY VAR=mortrate_under5(1) scan minic NLAG=8;
run;
estimate p = 1;
run;
estimate p=1 q=1;
run;
RUN;QUIT;
7
Now, let’s explore the GDP per capita series. We will follow the same procedures as in
the mortality rate analysis. Since our visual inspection above suggest 1st order difference
at the minimum, we will skip the level analysis. Here we provide analysis for the 1st order
difference. Below, we can observe that the ACF decays very quickly and suggests that
1st order difference of the GDP per capita likely to be stationary. We will conduct the
official stationary test in the next section. We cannot reject no-autocorrelations
hypothesis.
PROC ARIMA DATA=AUSAS.data;
IDENTIFY VAR=gdppc2005c(1) scan minic NLAG=8;
run;
estimate p = 1;
run;
estimate p=1 q=1;
run;
RUN;QUIT;
8
The estimated mean MU is statistically significant while AR(1,1) does not add more info
the model. We reject that the residuals still have more information. Base on this
diagnostic check, we conclude that the 1st order difference of the GDP per capita series
seem stationary.
4. Testing for Unit Root
Many economic and financial time series exhibit trending behavior or non-stationarity in
the mean. Regression without due consideration to unit roots might yield spurious results.
Before we explain the concept of unit roots, let’s run an OLS regression of mortality rate
(under 5) on GDP per capita.
9
The coefficient is negative (-0.00106) and significant (t = -28.72 > |2|) and the R-squared
is very high (R-squared = 0.9418). Apparently, the relationship is much stronger than we
expect and the R-squared is inflated. This is a common result that is often called spurious
regression. It is not that the result is wrong per se, but if you are trying to draw causal
conclusions, you will overestimate or underestimate the importance of your variables if
they are not stationary. Looking at the residual plot above, it is very obvious that our
regression result is spurious.
Estimating the regression model of the differences would be interesting for comparison.
Below, we will run an OLS regression of the 1st difference of mortality rate (under 5) on
the 1st difference of GDP per capita.
10
The coefficient becomes minute and insignificant. The R-squared reduces down to less
than 0.04. Apparently, the relationship between mortality rate and GDP per capita
disappeared. Looking at the residual plot above, its serial correlation is washed-out. If
you want to find out more about time series data and the proper inference techniques,
please consult a good textbook on time-series econometrics. In this tutorial we will
simply state that it is generally a good idea to make sure that the variables of interest are
“stationary” before doing any inference (correlations, regression etc.).
To provide conclusive evidence about whether a variable is stationary or has a unit root,
we undertake unit root tests. In the section below, we will go over some important
methods used to test for unit roots.
Augmented Dickey-Fuller test (ADF test)
The null hypothesis for the ADF test is that the variable contains a unit root, and
the alternative is that the variable was generated by a stationary process. The test allows
for several options, such as including a constant, a trend, lagged values of the variable.
The following “UNITROOT” macro function codes will test manually. Let’s discuss the
four steps inside the macro function:
First, we will run a regression model of mortality rate and GDP per capita on their
own 1st lags. Then, we will export the estimated coefficients as “est” data set. Second,
we will calculate the “T-statistics” to test whether the coefficient of the 1st lag equal to 1
11
or not (note: unit root). Third, we will calculate their corresponding “p-values” for three
different specifications: zero mean, single mean, and trend. We finally print them to the
output window.
/*Dickey-Fuller Test*/
TITLE1 'Testing for Unit Roots by Dickey-Fuller';
%MACRO UNITROOT(Dsn=,
/*Data set name
*/
Dvar1=, /*Dependent variable for model 1 */
Ivar1=, /*Independent variable for model 1*/
Dvar2=, /*Dependent variable for model 2 */
Ivar2=, /*Independent variable for model 2*/
n=,
/*number of observations
*/);
/* Estimate gamma for both series by regression
PROC REG DATA=&Dsn OUTEST=est;
MODEL &Dvar1 = &Ivar1 / NOINT NOPRINT;
MODEL &Dvar2 = &Ivar2 / NOINT NOPRINT;
*/
/* Compute test statistics for both series */
DATA dickeyfuller1;
SET est;
x&Dvar1
= &n*(&Ivar1-1);
x&Dvar2
= &n*(&Ivar2-1);
/* Compute p-values for the three models */
DATA dickeyfuller2;
SET dickeyfuller1;
p&Dvar1
=PROBDF(x&Dvar1,%eval(&n-1),1,"RZM");
p&Dvar2
=PROBDF(x&Dvar2,%eval(&n-1),1,"RZM");
p1&Dvar1 =PROBDF(x&Dvar1,%eval(&n-1),1,"RSM");
p1&Dvar2 =PROBDF(x&Dvar2,%eval(&n-1),1,"RSM");
p2&Dvar1 =PROBDF(x&Dvar1,%eval(&n-1),1,"RTR");
p2&Dvar2 =PROBDF(x&Dvar2,%eval(&n-1),1,"RTR");
/* Print the results */
PROC PRINT DATA=dickeyfuller2(KEEP= x&Dvar1 x&Dvar2 p&Dvar1 p&Dvar2
p1&Dvar1 p1&Dvar2 p2&Dvar1 p2&Dvar2);
RUN; QUIT;
%MEND UNITROOT;
%UNITROOT( Dsn
Dvar1
Ivar1
Dvar2
Ivar2
n
=AUSAS.data,
=mortrate_under5,
=mortrate_under5_L1,
=gdppc2005c,
=gdppc2005c_L1,
=53);
The null hypothesis is that there exists unit root. Columns 3, 5, and 7 (zero mean, single
mean, and trend) of the 1st observation provide “p-values” for mortality rate while
12
columns 4, 6, and 8 of the 2nd observation for GDP per capita. Based on above table, we
cannot reject the null hypothesis that there exists unit root for levels of both variables.
However, the 1st difference of log-transformation of mortality rate and 1st difference of
the GDP per capita suggest stationary. See the table below.
%UNITROOT(
Dsn
Dvar1
Ivar1
Dvar2
Ivar2
n
=AUSAS.data,
=mortrate_under5_lnD2,
=mortrate_under5_lnL2,
=gdppc2005c_D1,
=gdppc2005c_L2,
=53);
4. Durbin Watson test for Serial Correlation
The Durbin Watson test for autocorrelation is used to test if the residuals from a linear
regression or multiple regressions are independent or not serially correlated. The null
hypothesis of the test is that there is no first-order autocorrelation. We specified several
models. All tests indicate that there is no serial correlation, by not rejecting the null
hypothesis.
/* cointegration*/
TITLE1 'Testing for cointegration';
/* Compute Phillips-Ouliaris-test for cointegration */
PROC AUTOREG DATA=AUSAS.data;
MODEL mortrate_under5=gdppc2005c /dw=4 dwprob ;
MODEL mortrate_under5_D1=gdppc2005c_D1/dw=4 dwprob ;
MODEL mortrate_under5_lnD1=gdppc2005c_lnD1/dw=4 dwprob ;
MODEL mortrate_under5=gdppc2005c mortrate_under5_L1
mortrate_under5_L2 gdppc2005c_L1 gdppc2005c_L2/dw=4 dwprob;
RUN; QUIT;
The Durbin-Watson test is a widely used method of testing for autocorrelation. The firstorder Durbin-Watson statistic is printed by default. This statistic can be used to test for
first-order autocorrelation. Use the DWPROB option to print the significance level (pvalues) for the Durbin-Watson tests. (Since the Durbin-Watson p-values are
computationally expensive, they are not reported by default.)
You can use the DW= option to request higher-order Durbin-Watson statistics. Since the
ordinary Durbin-Watson statistic tests only for first-order autocorrelation, the DurbinWatson statistics for higher-order autocorrelation are called generalized Durbin-Watson
statistics.
13
14
Download