Uploaded by evren.tuerkday

Hands-on Time Series Forecasting with Python

advertisement
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
1 von 9
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
Open in app
Following
562K Followers
Hands-on Time Series Forecasting with Python
Box-Jenkins modeling strategy for building SARIMA model
Idil Ismiguzel Jun 2, 2020 · 8 min read
Click to add this story to a
list.
Got it
Photo by Brian Suman on Unsplash
Time series analysis is the endeavor of extracting meaningful summary and statistical
information from data points that are in chronological order. They are widely used in
applied science and engineering which involves temporal measurements such as signal
processing, pattern recognition, mathematical finance, weather forecasting, control
engineering, healthcare digitization, applications of smart cities, and so on.
As we are continuously monitoring and collecting time series data, the opportunities for
applying time series analysis and forecasting are increasing.
In this article, I will show how to develop an ARIMA model with a seasonal component
for time series forecasting in Python. We will follow Box-Jenkins three-stage modeling
approach to reach at the best model for forecasting.
I encourage anyone to check out the Jupyter Notebook on my GitHub for the full
analysis.
In time series analysis, Box-Jenkins method named after statisticians George Box and
Gwilym Jenkins applying ARIMA models to find the best fit of a time series model.
The model indicates 3 steps: model identification, parameter estimation and model
validation.
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
2 von 9
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
Open in app
Time Series
As data, we will use the monthly milk production dataset. It includes monthly
production records in terms of pounds per cow between 1962–1975.
df = pd.read_csv('./monthly_milk_production.csv', sep=',',
parse_dates=['Date'], index_col='Date')
Time Series Data Inspection
As we can observe from the plot above, we have an increasing trend and very strong
seasonality in our data.
We will use the statsmodels library from Python to perform a time series decomposition.
The decomposition of time series is a statistical method to deconstruct time series into
its trend, seasonal and residual components.
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['Production'], freq=12)
decomposition.plot()
plt.show()
The decomposition plot indicates that the monthly milk production has an increasing
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
3 von 9
Open in app
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
trend and seasonal pattern.
If we want to observe the seasonal component more precisely, we can plot the data
based on the month.
1. Model Identification
In this step, we need to detect whether time series is stationary, and if not, we need to
understand what kind of transformation is required to make it stationary.
A time series is stationary when its statistical properties such as mean, variance, and
autocorrelation are constant over time. In other words, time series is stationary when it
is not dependent on time and not have a trend or seasonal effects. Most statistical
forecasting methods are based on the assumption that time series is (approximately)
stationary.
Imagine, we have a time series that is consistently increasing over time, the sample
mean and variance will grow with the size of the sample, and they will always
underestimate the mean and variance in future periods. This is why, we need to start
with a stationary time series, which is removed from its time dependent trend and
seasonal components.
We can check stationarity by using different approaches:
We can understand from the plots, such as decomposition plot we have seen
previously where we have already observed there is trend and seasonality.
We can plot autocorrelation function and partial autocorrelation function plots,
which provide information about the dependency of time series values to their
previous values. If the time series is stationary, the ACF/PACF plots will show a quick
cut off after a small number of lags.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(df, lags=50, ax=ax1)
plot_pacf(df, lags=50, ax=ax2)
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
4 von 9
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
Open in app
Here we see that both ACF and PACF plots do not show a quick cut off into the 95%
confidence interval area (in blue) meaning time series is not stationary.
We can apply statistical tests and Augmented Dickey-Fuller test is the widely used
one. The null hypothesis of the test is time series has a unit root, meaning that it is
non-stationary. We interpret the test result using the p-value of the test. If the
p-value is lower than the threshold value (5% or 1%), we reject the null hypothesis
and time series is stationary. If the p-value is higher than the threshold, we fail to
reject the null hypothesis and time series is non-stationary.
from statsmodels.tsa.stattools import adfuller
dftest = adfuller(df['Production'])
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','pvalue','#Lags Used','Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Results of Dickey-Fuller Test:
Test Statistic -1.303812
p-value 0.627427
#Lags Used 13.000000
Number of Observations Used 154.000000
Critical Value (1%) -3.473543
Critical Value (5%) -2.880498
Critical Value (10%) -2.576878
P-value is greater than the threshold value, we fail to reject the null hypothesis and time
series is non-stationary, it has time dependent component.
All these approaches suggest we have non-stationary data. Now, we need to find a way
to make it stationary.
There are two major reasons behind non-stationary time series; trend and seasonality.
We can apply differencing to make time series stationary by subtracting the previous
observations from the current observations. Doing so we will eliminate trend and
seasonality, and stabilize the mean of time series. Due to both trend and seasonal
components, we apply one non-seasonal
diff()
and one seasonal differencing
diff(12) .
df_diff = df.diff().diff(12).dropna()
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
5 von 9
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
Open in app
Results of Dickey-Fuller Test:
Test Statistic -5.038002
p-value 0.000019
#Lags Used 11.000000
Number of Observations Used 143.000000
Critical Value (1%) -3.476927
Critical Value (5%) -2.881973
Critical Value (10%) -2.577665
Applying the previously listed stationarity checks, we notice the plot of differenced time
series does not reveal any specific trend or seasonal behavior, ACF/PACF plots have a
quick cut-off, and ADF test result returns p-value almost 0.00. which is lower than the
threshold. All these checks suggest that differenced data is stationary.
We will apply Seasonal Autoregressive Integrated Moving Average (SARIMA or
Seasonal-ARIMA) which is an extension of ARIMA that supports time series data with a
seasonal component. ARIMA stands for Autoregressive Integrated Moving Average
which is one of the most common techniques of time series forecasting.
ARIMA models are denoted with the order of ARIMA(p,d,q) and SARIMA models are
denoted with the order of SARIMA(p, d, q)(P, D, Q)m.
AR(p) is a regression model that utilizes the dependent relationship between an
observation and some number of lagged observations.
I(d) is the differencing order to make time series stationary.
MA(q) is a model that uses the dependency between an observation and a residual error
from a moving average model applied to lagged observations.
(P, D, Q)m are the additional set of parameters that specifically describe the seasonal
components of the model. P, D, and Q represent the seasonal regression, differencing,
and moving average coefficients, and m represents the number of data points in each
seasonal cycle.
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
6 von 9
Open in app
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
2. Model Parameter Estimation
We will use Python’s pmdarima library, to automatically extract the best parameters for
our Seasonal ARIMA model. Inside auto_arima function, we will specify
we differentiate once for the trend and once for seasonality,
monthly data, and
m=12
d=1 and D=1
as
because we have
trend='C' to include constant and seasonal=True to fit a seasonal-
ARIMA. Besides, we specify
trace=True to print status on the fits. This helps us to
determine the best parameters by comparing the AIC scores.
import pmdarima as pm
model = pm.auto_arima(df['Production'], d=1, D=1,
m=12, trend='c', seasonal=True,
start_p=0, start_q=0, max_order=6, test='adf',
stepwise=True, trace=True)
AIC (Akaike Information Criterion) is an estimator of out of sample prediction error and
the relative quality of our model. The desired result is to find the lowest possible AIC
score.
The result of auto_arima function with various (p, d, q)(P, D, Q)m parameters indicates
that the lowest AIC score is obtained when the parameters equal to (1, 1, 0)(0, 1, 1, 12).
We split the dataset into a train and test set. Here I’ve used 85% as train split size. We
create a SARIMA model, on the train set with the suggested parameters. We use
SARIMAX function from statsmodel library (X describes the exogenous parameter, but
here we don’t add any). After fitting the model, we can also print the summary statistics.
from statsmodels.tsa.statespace.sarimax import SARIMAX
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
7 von 9
Open in app
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
model = SARIMAX(train['Production'],
order=(1,1,0),seasonal_order=(0,1,1,12))
results = model.fit()
results.summary()
3. Model Validation
Primary concern of the model is to ensure that the residuals are normally distributed
with zero mean and uncorrelated.
To check for residuals statistics, we can print model diagnostics:
results.plot_diagnostics()
plt.show()
The top-left plot shows the residuals over time and it appears to be a white noise
with no seasonal component.
The top-right plot shows that kde line (in red) closely follows the N(0,1) line, which
is the standard notation of normal distribution with zero mean and standard
deviation of 1, suggesting the residuals are normally distributed.
The bottom-left normal QQ-plot shows ordered distribution of residuals (in blue)
closely follow the linear trend of the samples taken from a standard normal
distribution, suggesting residuals are normally distributed.
The bottom-right is a correlogram plot indicating residuals have a low correlation
with lagged versions.
All these results suggest residuals are normally distributed with low correlation.
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
8 von 9
Open in app
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
To measure the accuracy of forecasts, we compare the prediction values on the test set
with its real values.
forecast_object = results.get_forecast(steps=len(test))
mean = forecast_object.predicted_mean
conf_int = forecast_object.conf_int()
dates = mean.index
From the plot, we see that model prediction nearly matches with the real values of the
test set.
from sklearn.metrics import r2_score
r2_score(test['Production'], predictions)
>>> 0.9240433686806808
The R squared of the model is 0.92, indicating that the coefficient of determination of
the model is 92%.
mean_absolute_percentage_error = np.mean(np.abs(predictions test['Production'])/np.abs(test['Production']))*100
>>> 1.649905
Mean absolute percentage error (MAPE) is one of the most used accuracy metrics,
expressing the accuracy as a percentage of the error. MAPE score of the model equals to
1.64, indicating the forecast is off by 1.64% and 98.36% accurate.
Since both the diagnostic test and the accuracy metrics intimates that our model is
nearly perfect, we can continue to produce future forecasts.
Here is the forecast for the next 60 months.
results.get_forecast(steps=60)
I hope you enjoyed following this tutorial and building time series forecasts in Python.
If you liked this article, you can read my other articles here and follow me on Medium.
Let me know if you have any questions or suggestions.✨
22.07.21, 08:38
Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa...
9 von 9
https://towardsdatascience.com/hands-on-time-series-forecasting-with-p...
Open in app
Python
Time Series Forecasting
About Write Help Legal
Get the Medium app
22.07.21, 08:38
Download