Final Project - Alexander Hardt

advertisement
1
Total Vehicle Sales Forecast
Final Project
Alexander Hardt
Dr. Holmes
Economic Forecasting 309-01W
Summer II
8/6/2013
2
Executive Summary
For this project I created a twelve month forecast for Total Vehicle Sales in the United
States using four different methods. These four techniques are called exponential smoothing,
decomposition, ARIMA, and multiple regression. To do so I picked one dependent (Y) variable
along with two independent (X) variables and collected 80 monthly observations for each
variable. This historical data allowed me to create four different forecasting models which
predict future Vehicle Sales with low risk of error. The best model according to the lowest error
measures was winter’s exponential smoothing method because it had the lowest MAPE along
with the lowest RMSE for the fit period as well as the forecast period.
Introduction
I chose the Y variable to be Total Vehicle Sales in the United States because I have a
strong interest in the auto industry and would like to work for a German car maker in the future.
The auto industry is very vulnerable to the state of the economy because people tend to postpone
high-item purchases like a car when times are tough. Therefore, the variables that cause a change
in vehicle sales numbers must be indicators of economic performance. In order to forecast the
dependent variable Y (Total Vehicle Sales), I chose two independent variables, X1 and X2 that
are closely related to Y. These are going to be Employment non-farm and the Personal Saving
Rate. The hypothesis I make for the first X variable is that employment numbers are logically
related to vehicle sales because the more people are in the workforce, the more people earn an
income which is necessary to make high-item purchases like a personal car. The hypothesis for
the second X variable is that the personal saving rate has an inverse linear relationship to vehicle
3
sales because the more people hold on to their disposable income, the less spending occurs which
hurts vehicle sales numbers.
Since I am using three completely different variables in my forecast, the means, ranges,
and standard deviations for each variable are going to differ from each other. In order to avoid
forecasting difficulty, it is important to look at the variations about the mean value for each
variable. The Y variable Total Vehicle Sales has a mean value of 1130.5, a range of 919.9, and a
standard deviation of 243.0. Since it is important that the standard deviation is less than 50% of
the mean value to avoid forecasting difficulty, these numbers indicate that I should be able to get
a pretty accurate forecast. The X variable Employees non-farm shows a mean value of 133,784
with a standard deviation of 3,463 and a low range of 11,769 which are great numbers for an
independent variable. The X variable Personal Saving Rate with a mean value of 4.157 and a
standard deviation of 1.389 also indicate that I should not run into difficulties producing a
forecast. Below are descriptive statistics for all the variables used:
Descriptive Statistics: Total Vehicle Sales, Employees non farm, Saving Rate
Variable
Total Vehicle Sales
Employees non farm
Saving Rate
N
68
68
68
N*
0
0
0
Variable
Total Vehicle Sales
Employees non farm
Saving Rate
Q3
1282.5
137029
5.200
Mean
1130.5
133784
4.157
SE Mean
29.5
420
0.168
StDev
243.0
3463
1.389
Minimum
670.3
127374
2.000
Q1
967.4
130916
2.800
Median
1090.3
133209
4.350
Maximum
1590.2
139143
8.300
Looking at the time series plot for the Y variable Total Vehicle Sales, one can notice a
slight negative trend over the 68 observations studied. This characteristic is proven by the
autocorrelation function which shows that the autocorrelation coefficients remain fairly large for
several time periods before slowly declining. Furthermore, there could be a seasonal pattern in
4
the Y variable because there are spikes at the 12th and 24th lag of the autocorrelation function.
This can be logically explained by the holiday sales events car dealers have during the Christmas
season. The time series plot for the X variable Employees non-farm is very closely related to the
Y variable and also shows a negative trend and seasonality. Furthermore, there is a noticeable
cyclical pattern as well. The second X variable Personal saving rate shows a slight positive trend
and cycle only. There is no seasonality here because the data I found was seasonally adjusted.
Below are the three time series plots for all variables:
Time Series Plot of Total Vehicle Sales
1600
Total Vehicle Sales
1400
1200
1000
800
600
1
7
14
21
28
35
Index
42
49
56
63
5
Time Series Plot of Employees non farm
140000
Employees non farm
138000
136000
134000
132000
130000
128000
126000
1
7
14
21
28
35
Index
42
49
56
63
Time Series Plot of Saving Rate
9
8
Saving Rate
7
6
5
4
3
2
1
7
14
21
28
35
Index
42
49
56
63
In order to be able to show the YX variable relationship, scatter plots are a great tool.
Both of the X variables have a moderate to strong linear relationship with the Y variable.
6
However, the linear relationships are of different nature. While the variables Employees and
Vehicle Sales are positively linearly related, the X variable Personal Saving Rate and Vehicle
Sales exhibit a pretty strong negative linear relationship. The strength of this linear relationship
is shown by the slope of the regression line in each scatter plot. Many values are very close to the
regression line which is indicative of a strong linear relationship. However, there are also a few
values that are far from the regression line which shows that there are extremes as well. Below
are the scatter plots for each XY relationship:
Scatterplot of Total Vehicle Sales vs Employees non farm
1600
Total Vehicle Sales
1400
1200
1000
800
600
126000
128000
130000
132000
134000
136000
Employees non farm
138000
140000
7
Scatterplot of Total Vehicle Sales vs Saving Rate
1600
Total Vehicle Sales
1400
1200
1000
800
600
2
3
4
5
6
Saving Rate
7
8
9
In researching X variables that help forecast the Y variable, the correlation matrix is the
most important tool for forecasting personnel. It shows two values that measure the relationship
between each variable. The Pearson correlation shows how strong the linear relationship is
between two variables and the P-Value states the confidence interval which is an important
factor in the decision to use a certain X variable. One wants to have at least 95% confidence.
Both X variables have strong Pearson correlations with the Y variable and a perfect 0 P-Value
which makes these variables significant and acceptable to use in the forecast. Furthermore, the
correlations between the two X variables are less than each X variable’s correlation with the Y
variable. Since these correlations are logical and prove the hypothesis made earlier, I will go on
with the forecast. Below is the correlation matrix for all variables:
8
Correlations: Total Vehicle Sales, Employees non-farm, Saving Rate
Employees non fa
Total Vehicle Sa
0.673
0.000
Employees non-fa
-0.608
0.000
-0.438
0.000
Saving Rate
Cell Contents: Pearson correlation
P-Value
Body
Exponential Smoothing
The correct exponential smoothing method depends on the characteristics of the Y data.
As you can tell from the time series plot below, the data series has a negative trend and
seasonality shown by the repeated annual spikes in the data.
Time Series Plot of Total Vehicle Sales
1600
Total Vehicle Sales
1400
1200
1000
800
600
1
7
14
21
28
35
Index
42
49
56
63
9
In order to further analyze the characteristics of the Y data, it is helpful to look at the
autocorrelation function. The autocorrelation coefficients reveal that the data series definitely has
negative trend as seen by the slowly decreasing autocorrelation coefficients. There is seasonality,
although not significant, as shown by the spike in the 12th lag. Furthermore, there is also some
sort of a cycle since the coefficients often go up and down.
Autocorrelation Function for Total Vehicle Sales
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
Since the data series has a trend and seasonality, the best method to use is winter’s
exponential smoothing technique. It is the only method that can capture seasonality. The plot for
Total Vehicle Sales using Winter’s method of exponential smoothing is seen below:
10
Winters' Method Plot for Total Vehicle Sales
Multiplicative Method
1750
Variable
Actual
Fits
Total Vehicle Sales
1500
Smoothing C onstants
A lpha (lev el)
0.6
Gamma (trend)
0.1
Delta (seasonal)
0.8
1250
Accuracy Measures
MA PE
5.89
MA D
63.70
MSD
7698.66
1000
750
500
1
7
14
21
28
35
42
Index
49
56
63
The exponential smoothing model coefficients that gave the lowest MAPE accuracy
measures are alpha (level)= 0.6, gamma (trend)= 0.1, and delta (seasonal)= 0.8. A table showing
the Y data (excluding hold out period), the fit values, and the corresponding residuals is included
in the appendix.
The goodness to fit measures attained with this model are MAPE= 5.89% , MAD= 63.70,
MSD= 7698.66, and RMSE= 87.74. These accuracy measures are pretty good and indicate that
an accurate forecast can be made. It can be seen that the fit graph is very close to the Y data
graph which indicates that trend, cycle, and seasonality has been accounted for. Below is a time
series plot of the Y data compared with the Fit period and a time series plot of the residuals:
11
Time Series Plot of Total Vehicle Sales, FITS1
1750
Variable
Total Vehicle Sales
FITS1
1500
Data
1250
1000
750
500
1
7
14
21
28
35 42
Index
49
56
63
Time Series Plot of RESI1
300
200
RESI1
100
0
-100
-200
-300
1
7
14
21
28
35
Index
42
49
56
63
12
It can be seen that there are no significant signs of trend, cycle, or seasonality in the
residual’s time series plot. Most values are around 0 which indicates randomness. In order to
prove randomness, the autocorrelation function of the residuals can help. Below is the
autocorrelation function of the residuals:
Autocorrelation Function for RESI1
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
Residual analysis with the autocorrelation function shows that there is no
autoregressiveness in the residuals because no coefficients exceed the t-value lines. Furthermore,
the LBQ coefficient at the 24th lag is 33.33 which is below the critical value of chi-square of
36.41.
The histogram of the residuals however does show a slight skew to the left which is
indicating an underestimation bias. This fact is supported by the mean shift to the right. The
13
mean of the residuals is 5.351 which is still very close to zero. Therefore, one can say that the
residuals are random and the distribution is normal.
Histogram of RESI1
Normal
25
Mean
StDev
N
5.351
88.23
68
Frequency
20
15
10
5
0
-200
-100
0
RESI1
100
200
It is proven that the residuals are random. The trend, cycle, and seasonality that existed in
the original data series is not seen in the residuals. This shows that the model is successful at
picking up the systematic variation of the Y data series. Therefore, the model will be able to
generate an accurate forecast.
Below are the one year forecast and a time series plot for the Y data series including the
hold out period (index 69-80):
14
Winters' Method Plot for Y Total Vehicle Sales
Multiplicative Method
2250
Variable
Actual
Fits
Forecasts
95.0% PI
Y Total Vehicle Sales
2000
1750
Smoothing C onstants
A lpha (lev el)
0.6
Gamma (trend)
0.1
Delta (seasonal)
0.8
1500
1250
Accuracy Measures
MA PE
5.89
MA D
63.70
MSD
7698.66
1000
750
500
1
8
16
24
32
40
48
Index
56
64
72
80
Time Series Plot of Y Forecast
1600
Y Forecast
1400
1200
1000
800
600
1
8
16
24
32
40
Index
48
56
64
72
80
15
The accuracy of the forecast for the hold out period is MAPE= 5.16809 and RMSE=
74.7947. When comparing the time series plot of the one year forecast with the actual hold out
data for this period, we can see that both variables are very close to each other. The forecast and
hold out variable cross each other multiple times. Therefore, there is little over- or
underestimation.
Time Series Plot of Vehicle HO, FORE1, UPPE1, LOWE1
2250
Variable
Vehicle HO
FORE1
UPPE1
LOWE1
2000
1750
Data
1500
1250
1000
750
500
1
2
3
4
5
6
7
Index
8
9
10
11
12
The forecast period residuals seem to be pretty random with the exception of Index 7
where the residual has an extreme negative. However, the autocorrelation function proves that
there are no significant systematic patterns in the residuals because all coefficients are far from
the t-value boundaries. Furthermore, the LBQ value of 7.13 at the 10th lag is far below the
critical value of chi-square of 18.3070. Below are the time series plot of the forecast residuals
and the autocorrelation function:
16
Time Series Plot of Frcst Resid
100
Frcst Resid
50
0
-50
-100
-150
1
2
3
4
5
6
7
Index
8
9
10
11
12
Autocorrelation Function for Frcst Resid
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
1
2
3
4
5
6
Lag
7
8
9
10
17
The error measures improved from the fit to the hold out period. The MAPE of 5.89%
improved to 5.17% and RMSE of 87.74 was lowered to 74.79. Furthermore, the forecast
residuals show that trend, cycle, and seasonality have been accounted for so that there is no bias
in the forecast. These results prove that the forecast accuracy is acceptable and the model
successful.
Decomposition
The results table of the decomposition method of forecasting is included in the appendix.
In order to determine the seasonal component of the Y data, one needs to look at the seasonal
indices. The seasonal indices for 12 periods (monthly data) as well as a time series plot of the
seasonal indices are displayed below:
Seasonal Indices
Period
1
2
3
4
5
6
7
8
9
10
11
12
Index
1.14258
1.01495
1.05177
1.13192
0.94791
0.92322
0.84718
1.06925
0.79222
0.91807
1.12809
1.03285
18
Time Series Plot of SeasInd
1.15
1.10
SeasInd
1.05
1.00
0.95
0.90
0.85
0.80
1
2
3
4
5
6
7
Index
8
9
10
11
12
When looking at the seasonal indices, one can notice that Vehicle Sales are periodic.
There are relatively high sales in the Christmas season, early spring, and late summer and
relatively low sales in late spring, early summer. The seasonal analysis will help to adjust the Y
data for seasonality.
Below is a time series plot comparing the Y data with the decomposition deseasonalized
data:
19
Time Series Plot of Y Total Vehicle Sales, DESE2
1600
Variable
Y Total Vehicle Sales
DESE2
1400
Data
1200
1000
800
600
1
7
14
21
28
35 42
Index
49
56
63
From this comparison it is noticeable that the deseasonalized variable contains much less
spikes and extreme up and down movements as the original Y data. This shows us that the strong
seasonality that exists in the original Y data has been adjusted in the deseasonalized plot.
The “goodness to fit” as measured by the MAPE and RMSE indicates a large decrease in
accuracy compared to the previous model. The MAPE went up to 12.7% and the RMSE went up
to 157.146. This accuracy might be too high to accept, however it can be adjusted through the
cycle factors.
To determine the residual distribution, one needs to look at the time series plot, the
autocorrelation function, as well as a histogram of the residuals of the fit period:
20
Time Series Plot of RESI2
400
300
200
RESI2
100
0
-100
-200
-300
-400
1
7
14
21
28
35
Index
42
49
56
63
Autocorrelation Function for RESI2
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
21
Histogram of RESI2
Normal
20
Mean
StDev
N
-0.5784
158.3
68
Frequency
15
10
5
0
-320
-160
0
RESI2
160
320
Based on these graphs, it can be said that the residuals are definitely not random. I can
detect significant trend and some cycle by looking at the autocorrelation function. The very high
LBQ value of 245 at the 12th lag proves that the residual distribution is autoregressive which
means it is not random. The mean however is very close to 0. Since it is negative, there will be a
slight tendency to over forecast.
The one-year forecast for the hold out period using the decomposition model is displayed
below:
22
Time Series Decomposition Plot for Y Total Vehicle Sales
Multiplicative Model
Variable
Actual
Fits
Trend
Forecasts
Y Total Vehicle Sales
1600
1400
A ccuracy Measures
MAPE
12.7
MAD
127.3
MSD
24694.8
1200
1000
800
600
1
8
16
24
32
40
48
Index
56
64
72
80
However, since the accuracy measures we observed earlier were too high, I have adjusted
the forecast data with the last cycle factor. The time series plot of the Y data including the
adjusted one-year forecast is displayed below:
23
Time Series Plot of Y w adj fcst
1600
Y w adj fcst
1400
1200
1000
800
600
1
8
16
24
32
40
Index
48
56
64
72
80
The adjustment was necessary because the decomposition model did not pick up cycle.
The adjustment raised the forecast data by about 33% because the last cycle factor in the Y data
was 1.33. The new forecast improved accuracy measures significantly. The MAPE was lowered
to 8.22103% and the RMSE is now at 130.279.
The closeness of the new forecast to the actual hold out period can be seen in the
following time series plot:
24
Time Series Plot of Vehicle HO, New Fcst
1500
Variable
Vehicle HO
New Fcst
1400
Data
1300
1200
1100
1000
900
1
2
3
4
5
6
7
Index
8
9
10
11
12
Finally, we will look at the time series plot of the forecast residuals. The forecast
residuals are much closer to 0 than before the adjustment through the cycle factor. However,
there is still trend and cycle in the residuals.
Time Series Plot of FcstResid
250
FcstResid
200
150
100
50
0
1
2
3
4
5
6
7
Index
8
9
10
11
12
25
ARIMA
Time Series Plot of Y Total Vehicle Sales
1600
Y Total Vehicle Sales
1400
1200
1000
800
600
1
7
14
21
28
35
Index
42
49
56
63
Autocorrelation Function for Y Total Vehicle Sales
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
26
Based on the analysis of the Y time series plot and autocorrelation function, it will be
necessary to difference the data to make it stationary. Since the Y data has significant trend, the
first step is to difference the data for trend. The time series plot and autocorrelation function for
the first trend difference are:
Time Series Plot of 1 Tren Dif
300
200
100
1 Tren Dif
0
-100
-200
-300
-400
-500
-600
1
7
14
21
28
35
Index
42
49
56
63
Autocorrelation Function for 1 Tren Dif
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
27
We can see that there is no significant trend anymore. Therefore we will use only one
difference for the nonseasonal model. To determine which model to use we need to also show
the PACF for the first trend difference:
Partial Autocorrelation Function for 1 Tren Dif
(with 5% significance limits for the partial autocorrelations)
1.0
Partial Autocorrelation
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
Based on the ACF and PACF of the first trend difference, we can determine that this is an
MA 1 model because of one significant negative spike in the ACF and the PACF coefficients
slowly approaching zero. Since the data also has seasonality, we will also take the seasonal
difference. The time series plot and ACF for the first seasonal difference is shown below:
28
Time Series Plot of 1 Seas Dif
500
400
300
1 Seas Dif
200
100
0
-100
-200
-300
-400
1
7
14
21
28
35
Index
42
49
56
63
Autocorrelation Function for 1 Seas Dif
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
We can see that the first difference is sufficient to take out significant seasonality and
make the time series stationary. To determine which seasonal model coefficient to use, we need
to look at the PACF below:
29
Partial Autocorrelation Function for 1 Seas Dif
(with 5% significance limits for the partial autocorrelations)
1.0
Partial Autocorrelation
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
It can be determined that the ARIMA model for the seasonal difference is also MA1.
Therefore the menu section of the best ARIMA model should be (0,1,1) for the seasonal as well
as non-seasonal section. After running this model we get the following results:
Final Estimates of Parameters
Type
MA
1
SMA 12
Coef
0.3938
0.8476
SE Coef
0.1263
0.1199
T
3.12
7.07
P
0.003
0.000
Differencing: 1 regular, 1 seasonal of order 12
Number of observations: Original series 68, after differencing 55
Residuals:
SS = 451577 (backforecasts excluded)
MS = 8520 DF = 53
Modified Box-Pierce (Ljung-Box) Chi-Square statistic
Lag
Chi-Square
DF
P-Value
12
10.8
10
0.371
24
31.8
22
0.081
36
44.0
34
0.116
48
51.2
46
0.277
We can see that the coefficients for the MA 1 and SMA 12 model have a t-value over
1.96 and a p-value very close to zero. Also, the P-value in the Box-Pierce statistic is above 0.05.
30
These results are very good because they indicate that the model coefficients are significant and
the ARIMA model should produce good forecast results.
In order to determine the accuracy for the fit period, we will need to look at the MAPE
and RMSE. These are shown below:
MAPE= 6.51131%
RMSE= 90.6118
The Fit MAPE of 6.5% is pretty decent along with the RMSE of 90.6118. Therefore we can
determine that this model is accurate based on error measures. Furthermore, like said above, the
LBQ associated p-values for the lag periods 12, 24, 36, and 48 are all above 0.05 which allows
us to declare the residuals random. After running the ARIMA model we produce a 12-month
forecast. A time series plot of the forecast residuals is displayed below:
Time Series Plot of F Resid
150
F Resid
100
50
0
-50
1
2
3
4
5
6
7
Index
8
9
10
11
12
31
Based on the LBQ values we can say that the forecast residuals are random because they
are below the chi square value at lag 12. However, the time series plot does indicate a little trend.
The accuracy measures for the forecast period compared to the hold out period are
displayed below:
MAPE= 6.49819%
RMSE= 96.7536
The accuracy measures went down slightly. However, it can be said that this is normal and we
can still declare the ARIMA model accurate. A time series plot of the Y variable including the
12-month forecast is shown below:
Time Series Plot of Y Total Vehicle Sales_1
Y Total Vehicle Sales_1
1600
1400
1200
1000
800
600
1
8
16
24
32
40
Index
48
56
64
72
80
The forecast looks reasonable because the forecast takes the same pattern as the historical
observations before.
32
Multiple Regression
Before running a multiple regression model, it is important to look at the XY
relationships using a scatterplot and a correlation matrix. Both of the X variables have a
moderate to strong linear relationship with the Y variable. However, the linear relationships are
of different nature. While the variables Employees and Vehicle Sales are positively linearly
related, the X variable Personal Saving Rate and Vehicle Sales exhibit a pretty strong negative
linear relationship. The strength of this linear relationship is shown by the slope of the regression
line in each scatter plot. Many values are very close to the regression line which is indicative of a
strong linear relationship. However, there are also a few values that are far from the regression
line which shows that there are extremes as well. Below are the scatter plots for each XY
relationship:
Scatterplot of Total Vehicle Sales vs Employees non farm
1600
Total Vehicle Sales
1400
1200
1000
800
600
126000
128000
130000
132000
134000
136000
Employees non farm
138000
140000
33
Scatterplot of Total Vehicle Sales vs Saving Rate
1600
Total Vehicle Sales
1400
1200
1000
800
600
2
3
4
5
6
Saving Rate
7
8
9
In researching X variables that help forecast the Y variable, the correlation matrix is the
most important tool for forecasting personnel. It shows two values that measure the relationship
between each variable. The Pearson correlation shows how strong the linear relationship is
between two variables and the P-Value states the confidence interval which is an important
factor in the decision to use a certain X variable. One wants to have at least 95% confidence.
Both X variables have strong Pearson correlations with the Y variable and a perfect 0 P-Value
which makes these variables significant and acceptable to use in the forecast. Furthermore, the
correlations between the two X variables are less than each X variable’s correlation with the Y
variable. These correlations are logical and prove the hypothesis made earlier. Below is the
correlation matrix for all variables:
34
Correlations: Total Vehicle Sales, Employees non-farm, Saving Rate
Employees non fa
Total Vehicle Sa
0.673
0.000
Employees non-fa
-0.608
0.000
-0.438
0.000
Saving Rate
Cell Contents: Pearson correlation
P-Value
In order to further analyze the characteristics of the Y data, it is helpful to look at the
autocorrelation function. The autocorrelation coefficients reveal that the data series definitely has
negative trend as seen by the slowly decreasing autocorrelation coefficients. There is seasonality,
although not significant, as shown by the spike in the 12th lag. Furthermore, there is also some
cycle since the ACF’s often move up and down. Below is the ACF graph:
Autocorrelation Function for Total Vehicle Sales
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
Based on the analysis of the scatter plots and correlation matrix, it can be determined that
there is no transformation needed for neither of the variables because the XX relationship is
35
weaker than both XY relationships and because there are no curvilinear relationships between
the independent variables and the dependent variable.
This model does require dummy variables because the Y data has seasonality. After
creating the dummy variables and including them in the model, it can be determined that only the
dummy variables m6 and m9 are significant and can be used in the multiple regression model.
The best possible regression model using both X variables and the dummy variables m6
and m9 is displayed below:
Regression Analysis: Y Total Vehi versus X1 Employees, X2 Saving Ra, ...
The regression equation is
Y Total Vehicle Sales = - 3076 + 0.0338 X1 Employees non farm
- 69.5 X2 Saving Rate - 174 m6 - 182 m9
Predictor
Constant
X1 Employees non farm
X2 Saving Rate
m6
m9
S = 148.859
Coef
-3076.4
0.033822
-69.54
-173.84
-182.08
R-Sq = 64.7%
SE Coef
821.5
0.005926
14.58
64.11
70.45
T
-3.74
5.71
-4.77
-2.71
-2.58
R-Sq(adj) = 62.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
63
67
SS
2559676
1396018
3955694
Source
X1 Employees non farm
X2 Saving Rate
m6
m9
Unusual Observations
X1
Employees
Y Total
Vehicle
DF
1
1
1
1
MS
639919
22159
Seq SS
1791670
478522
141458
148027
F
28.88
P
0.000
P
0.000
0.000
0.000
0.009
0.012
VIF
1.274
1.240
1.015
1.037
36
Obs
9
21
25
30
31
33
34
43
45
non farm
134994
135896
138105
137038
136355
131627
131387
130787
127374
Sales
1124.2
1063.4
1420.6
859.0
763.9
670.3
701.6
761.7
712.5
Fit
1133.4
1080.4
1017.3
988.2
1083.3
769.1
1005.7
1075.8
722.7
SE Fit
72.0
70.3
77.2
70.3
46.9
70.5
25.3
28.2
70.8
Residual
-9.2
-17.0
403.3
-129.2
-319.4
-98.8
-304.1
-314.1
-10.2
St Resid
-0.07 X
-0.13 X
3.17RX
-0.98 X
-2.26R
-0.75 X
-2.07R
-2.15R
-0.08 X
R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large leverage.
Durbin-Watson statistic = 1.57157
To determine if the model is acceptable to use, we need to look at the R square value and
the F statistic. The R sq value in this model is 64.7% which is good because it tells us that these
coefficients explain more than half of the Y data. Furthermore, the F value of 28.88 is more than
three times the book value. Also, the coefficients produced by the model are all significant
because the p-values are far below 0.05 and the t-values above +/- 1.96. The signs of the
coefficients also make logic sense and support my hypothesis from the proposal. Therefore the
model is acceptable to use.
The error measures for the fit period are a little bit higher than for other models. The
MAPE is 10.56% and the RMSE is 143.282. However, they do not get lower using other
predictors.
An investigation of the best model gave the following results:

There is no serial correlation. The Durbin-Watson statistic for this model is 1.57157. This
is above 1.55 and below 2.45 which indicates no serial correlation.
37

There is no heteroscedasticity in this model. This can be determined by looking at the
Residuals vs. order graph displayed below. The residuals seem to bounce around a zero
mean and there is no noticeable megaphone effect.
Residual Plots for Y Total Vehicle Sales
Normal Probability Plot
Versus Fits
99.9
400
99
Residual
Percent
90
50
10
200
0
-200
1
0.1
-500
-250
0
Residual
250
-400
500
600
800
400
15
200
10

0
-200
5
0
1400
Versus Order
20
Residual
Frequency
Histogram
1000
1200
Fitted Value
-320
-160
0
160
Residual
320
-400
1 5 10 15 20 25 30 35 40 45 50 55 60 65
Observation Order
There is no multicollinearity. The VIF statistic for all the predictors is between 1.015 and
1.274. These are far below 2.5 which support the statement that there is no
multicollinearity.
Below is an autocorrelation function of the fit period residuals. It will be used along with
the 4in1 plot above to evaluate the randomness of the residuals.
38
Autocorrelation Function for RESI1
(with 5% significance limits for the autocorrelations)
1.0
0.8
Autocorrelation
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
2
4
6
8
10
12
14
Lag
16
18
20
22
24
Based on the information in the 4in1 plot it can be said that the histogram of the residuals
has a normal distribution and that the residuals seem to be random because the residual versus
order plot shows them fluctuating around zero. However, the autocorrelation function reveals
that there is still significant seasonality and cycle in the residuals. The normal probability plot is
good because most of the values fall on a linear line.
The forecast for the hold out period using hold out x values to forecast Y produces the
following error measures:
MAPE= 11.6749%
RMSE= 188.759
The error measures decline slightly from the fit period error measures which was expected. They
are still pretty high. The time series plot of the forecast residuals is displayed below:
39
Time Series Plot of Fcst Resid
400
300
Fcst Resid
200
100
0
-100
-200
1
2
3
4
5
6
7
Index
8
9
10
11
12
Time Series Plot of Y Total Vehicle Sales
1600
Y Total Vehicle Sales
1400
1200
1000
800
600
1
8
16
24
32
40
Index
48
56
64
72
80
According to the time series plot, the residuals seem to fluctuate around zero. There may be a
slight trend however it is not significant. When looking at the time series plot of the Y data
40
including the forecast for the hold out period, we can determine that the forecast looks
reasonable when comparing it to the historical observations.
Conclusion
Although all the forecasting techniques delivered acceptable results, winter’s exponential
smoothing method produced the best error measures. The relatively low MAPE of 5.89% for the
fit period even improved to 5.17% for the forecast period. Another very successful model to
forecast the Y variable Total Vehicle Sales is ARIMA. Its error measures were slightly higher
with a MAPE of 6.51%, however they remained constant through the forecast period. The results
of the multiple regression model were somewhat unsatisfactory because the variables used as
predictors were all significant, had pretty high R square values and there was no multicollinearity
or heteroscedasticity. Winter’s method of exponential smoothing was the best model to forecast
Total Vehicle Sales because it was very successful in picking up the trend and seasonality that
existed in the Y variable. Below is a table including the error measures of all forecast methods
for the fit period as well as the forecast period:
Forecast Model Error Comparison
Fit Period
Forecast Period
RSME
MAPE RSME
MAPE
Winter’s Exponential
Smoothing
87.74 5.89%
74.79 5.17%
Decomposition
157.146 12.70% 130.279 8.22%
ARIMA
90.6118 6.51% 96.7536 6.50%
Multiple Regression
143.282 10.56% 188.759 11.67%
41
Appendix A
Total Vehicle Sales Forecast Data Observations
Date
Y Total Vehicle Sales X2 Employment non-farm X3 Personal Saving Rate
5/1/2006
1533.8
136621
2.6
6/1/2006
1545.3
137121
2.9
7/1/2006
1531.1
135945
2.3
8/1/2006
1530.1
136149
2.5
9/1/2006
1394.1
136817
2.6
10/1/2006
1260.6
137516
2.8
11/1/2006
1236.5
137898
2.9
12/1/2006
1476.4
137786
2.7
1/1/2007
1124.2
134994
2.5
2/1/2007
1285.1
135683
2.6
3/1/2007
1574.9
136576
2.8
4/1/2007
1366.0
137381
2.5
5/1/2007
1590.2
138323
2.2
6/1/2007
1481.3
138825
2.1
7/1/2007
1331.2
137425
2.1
8/1/2007
1500.5
137534
2.0
9/1/2007
1335.8
138096
2.3
10/1/2007
1256.5
138835
2.5
11/1/2007
1200.4
139143
2.3
12/1/2007
1414.1
138929
2.6
1/1/2008
1063.4
135896
3.7
2/1/2008
1196.5
136414
4.4
3/1/2008
1378.6
137003
4.5
4/1/2008
1273.1
137535
3.9
5/1/2008
1420.6
138105
8.3
6/1/2008
1212.6
138296
6.1
7/1/2008
1156.0
136811
5.1
8/1/2008
1269.1
136697
4.5
9/1/2008
984.6
136748
5.0
10/1/2008
859.0
137038
5.7
11/1/2008
763.9
136355
6.5
12/1/2008
916.1
135321
6.5
1/1/2009
670.3
131627
6.1
2/1/2009
701.6
131387
5.2
3/1/2009
872.8
131249
5.2
4/1/2009
832.6
131429
5.6
42
5/1/2009
6/1/2009
7/1/2009
8/1/2009
9/1/2009
10/1/2009
11/1/2009
12/1/2009
1/1/2010
2/1/2010
3/1/2010
4/1/2010
5/1/2010
6/1/2010
7/1/2010
8/1/2010
9/1/2010
10/1/2010
11/1/2010
12/1/2010
1/1/2011
2/1/2011
3/1/2011
4/1/2011
5/1/2011
6/1/2011
7/1/2011
8/1/2011
9/1/2011
10/1/2011
11/1/2011
12/1/2011
1/1/2012
2/1/2012
3/1/2012
4/1/2012
5/1/2012
6/1/2012
7/1/2012
8/1/2012
9/1/2012
938.3
874.8
1011.8
1274.6
759.5
853.9
761.7
1049.3
712.5
793.2
1083.9
997.4
1117.5
1000.5
1065.7
1011.5
973.9
965.2
888.0
1162.9
834.8
1009.3
1267.8
1177.2
1083.4
1076.5
1080.2
1096.6
1077.6
1046.9
1018.0
1272.3
934.7
1172.7
1431.9
1209.6
1361.9
1311.6
1178.4
1310.5
1210.3
131697
131510
129910
129786
130144
130741
130787
130242
127374
127811
128646
129770
130886
131004
129664
129728
130221
131195
131502
131199
128338
129154
130061
131279
131963
132453
131181
131457
132204
133125
133456
133292
130657
131604
132505
133400
134213
134556
133368
133753
134374
6.7
5.0
4.3
3.1
4.0
3.5
3.9
4.0
4.7
4.6
4.6
5.3
5.7
5.8
5.6
5.4
5.2
4.9
4.6
4.9
5.5
5.2
4.6
4.5
4.4
4.7
4.2
4.0
3.5
3.6
3.2
3.4
3.7
3.5
3.7
3.5
3.9
4.1
3.9
3.7
3.3
43
10/1/2012
11/1/2012
12/1/2012
1116.1
1165.1
1382.9
135241
135636
135560
3.7
4.7
7.4
Citations
Y : http://research.stlouisfed.org/fred2/data/TOTALNSA
X1: http://research.stlouisfed.org/fred2/series/PSAVERT
X2: http://research.stlouisfed.org/fred2/data/PAYNSA
Description of variables
Y:
Total Vehicle Sales in the United States, from May 2006 - Dec 2012, thousands of units,
monthly data, not seasonally adjusted
X1:
All Employees in United States, from May 2006 - Dec 2012, thousands of persons,
monthly data, not seasonally adjusted
X2:
Personal Saving Rate, from May 2006 – Dec 2012, in percent,
monthly data, seasonally adjusted annual rate
Appendix B
Exponential Smoothing Data Table
Y Total Vehicle
Sales
FITS1
RESI1
1533.8
1698.496172
-164.6961722
1545.3
1470.592482
74.70751775
1531.1
1492.204222
38.89577837
1530.1
1601.384932
-71.28493234
1394.1
1303.006613
91.09338685
1260.6
1284.940768
-24.3407677
1236.5
1178.326783
58.17321719
1476.4
1493.689081
-17.28908143
1124.2
1052.980873
71.2191265
1285.1
1232.190472
52.90952775
1574.9
1560.984218
13.91578233
1366.0
1429.945986
-63.94598628
1590.2
1500.449202
89.75079784
1481.3
1523.304025
-42.00402451
44
1331.2
1500.5
1335.8
1256.5
1200.4
1414.1
1063.4
1196.5
1378.6
1273.1
1420.6
1212.6
1156.0
1269.1
984.6
859.0
763.9
916.1
670.3
701.6
872.8
832.6
938.3
874.8
1011.8
1274.6
759.5
853.9
761.7
1049.3
712.5
793.2
1083.9
997.4
1117.5
1000.5
1065.7
1011.5
973.9
965.2
888.0
1478.653961
1437.70953
1288.356752
1219.158701
1186.891921
1452.777751
1045.564573
1181.90821
1455.696427
1257.16394
1411.849564
1346.901357
1208.035608
1270.601503
1097.47906
933.9126215
821.4632138
911.5968883
654.5066714
715.402073
812.3412269
748.5483182
865.8688387
816.0533407
821.0821421
1033.281023
1005.392701
795.9111855
790.9593759
947.8603482
755.3343873
802.9868702
978.8137012
966.6270408
1096.081647
1022.096204
1039.155534
1178.292827
775.2174272
952.0135485
896.6985774
-147.4539608
62.79047005
47.44324806
37.34129944
13.5080795
-38.6777513
17.83542716
14.59178983
-77.09642735
15.93606023
8.750435704
-134.3013572
-52.03560781
-1.501502668
-112.8790597
-74.91262152
-57.56321376
4.503111662
15.79332863
-13.80207298
60.4587731
84.05168179
72.43116129
58.74665929
190.7178579
241.3189767
-245.8927006
57.98881448
-29.25937585
101.4396518
-42.83438727
-9.786870178
105.0862988
30.77295918
21.4183528
-21.59620413
26.54446615
-166.7928274
198.6825728
13.18645152
-8.698577382
45
1162.9
834.8
1009.3
1267.8
1177.2
1083.4
1076.5
1080.2
1096.6
1077.6
1046.9
1018.0
1272.3
1161.222363
836.0672561
946.4286047
1275.674795
1167.021903
1313.351152
1067.389307
1121.312971
1150.439933
908.090978
1006.57492
955.8103638
1303.625058
1.677637326
-1.267256059
62.87139532
-7.874794951
10.17809712
-229.951152
9.110692541
-41.11297129
-53.83993314
169.509022
40.32507953
62.18963618
-31.32505816
Appendix C
Decomposition Data Table
Y Total Vehicle
Sales
HO
1533.8
934.7
1545.3
1172.7
1531.1
1431.9
1530.1
1209.6
1394.1
1361.9
1260.6
1311.6
1236.5
1178.4
1476.4
1310.5
1124.2
1210.3
1285.1
1116.1
1574.9
1165.1
1366.0
1382.9
1590.2
1481.3
1331.2
1500.5
1335.8
1256.5
1200.4
1414.1
1063.4
TREN2
1361.694124
1354.736483
1347.778842
1340.821201
1333.86356
1326.905919
1319.948278
1312.990637
1306.032997
1299.075356
1292.117715
1285.160074
1278.202433
1271.244792
1264.287151
1257.32951
1250.371869
1243.414228
1236.456587
1229.498946
1222.541306
SEAS2
1.142579926
1.014950259
1.051765392
1.131919943
0.94791084
0.923216552
0.847179762
1.069245167
0.792215329
0.918074702
1.128087943
1.032854185
1.142579926
1.014950259
1.051765392
1.131919943
0.94791084
0.923216552
0.847179762
1.069245167
0.792215329
CF
0.985831249
1.123862601
1.080105172
1.008168758
1.102592488
1.029043157
1.105761931
1.051635193
1.086541179
1.077518063
1.080458138
1.029092613
1.088843642
1.148071873
1.001102874
1.054316927
1.127028128
1.094568834
1.145965495
1.075659125
1.097968481
46
1196.5
1378.6
1273.1
1420.6
1212.6
1156.0
1269.1
984.6
859.0
763.9
916.1
670.3
701.6
872.8
832.6
938.3
874.8
1011.8
1274.6
759.5
853.9
761.7
1049.3
712.5
793.2
1083.9
997.4
1117.5
1000.5
1065.7
1011.5
973.9
965.2
888.0
1162.9
834.8
1009.3
1267.8
1177.2
1083.4
1076.5
1215.583665
1208.626024
1201.668383
1194.710742
1187.753101
1180.79546
1173.837819
1166.880178
1159.922537
1152.964896
1146.007255
1139.049615
1132.091974
1125.134333
1118.176692
1111.219051
1104.26141
1097.303769
1090.346128
1083.388487
1076.430846
1069.473205
1062.515564
1055.557924
1048.600283
1041.642642
1034.685001
1027.72736
1020.769719
1013.812078
1006.854437
999.8967962
992.9391553
985.9815144
979.0238735
972.0662326
965.1085916
958.1509507
951.1933098
944.2356689
937.278028
0.918074702
1.128087943
1.032854185
1.142579926
1.014950259
1.051765392
1.131919943
0.94791084
0.923216552
0.847179762
1.069245167
0.792215329
0.918074702
1.128087943
1.032854185
1.142579926
1.014950259
1.051765392
1.131919943
0.94791084
0.923216552
0.847179762
1.069245167
0.792215329
0.918074702
1.128087943
1.032854185
1.142579926
1.014950259
1.051765392
1.131919943
0.94791084
0.923216552
0.847179762
1.069245167
0.792215329
0.918074702
1.128087943
1.032854185
1.142579926
1.014950259
1.072135873
1.011121571
1.025743727
1.040692565
1.005881063
0.93081695
0.955151
0.890155909
0.802159248
0.782068606
0.747615368
0.742819569
0.67504053
0.68764993
0.720919807
0.739018734
0.780534529
0.876695804
1.032746702
0.739564527
0.85924556
0.84069502
0.923606704
0.852039163
0.823938345
0.922417447
0.933302
0.951662647
0.965705169
0.999444376
0.88753091
1.027523349
1.052909621
1.063086542
1.110891881
1.084035094
1.139111067
1.172934756
1.198236246
1.004203751
1.131620585
47
1080.2
1096.6
1077.6
1046.9
1018.0
1272.3
930.3203871
923.3627461
916.4051052
909.4474643
902.4898234
895.5321825
1.051765392
1.131919943
0.94791084
0.923216552
0.847179762
1.069245167
1.103958527
1.049204588
1.240516637
1.246878112
1.331465421
1.32871254
Comment
Dr. Holmes allowed the use of Personal Saving Rate, seasonally adjusted, according to email
from 7/15/2013
Download