Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 1 von 9 https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... Open in app Following 562K Followers Hands-on Time Series Forecasting with Python Box-Jenkins modeling strategy for building SARIMA model Idil Ismiguzel Jun 2, 2020 · 8 min read Click to add this story to a list. Got it Photo by Brian Suman on Unsplash Time series analysis is the endeavor of extracting meaningful summary and statistical information from data points that are in chronological order. They are widely used in applied science and engineering which involves temporal measurements such as signal processing, pattern recognition, mathematical finance, weather forecasting, control engineering, healthcare digitization, applications of smart cities, and so on. As we are continuously monitoring and collecting time series data, the opportunities for applying time series analysis and forecasting are increasing. In this article, I will show how to develop an ARIMA model with a seasonal component for time series forecasting in Python. We will follow Box-Jenkins three-stage modeling approach to reach at the best model for forecasting. I encourage anyone to check out the Jupyter Notebook on my GitHub for the full analysis. In time series analysis, Box-Jenkins method named after statisticians George Box and Gwilym Jenkins applying ARIMA models to find the best fit of a time series model. The model indicates 3 steps: model identification, parameter estimation and model validation. 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 2 von 9 https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... Open in app Time Series As data, we will use the monthly milk production dataset. It includes monthly production records in terms of pounds per cow between 1962–1975. df = pd.read_csv('./monthly_milk_production.csv', sep=',', parse_dates=['Date'], index_col='Date') Time Series Data Inspection As we can observe from the plot above, we have an increasing trend and very strong seasonality in our data. We will use the statsmodels library from Python to perform a time series decomposition. The decomposition of time series is a statistical method to deconstruct time series into its trend, seasonal and residual components. import statsmodels.api as sm from statsmodels.tsa.seasonal import seasonal_decompose decomposition = seasonal_decompose(df['Production'], freq=12) decomposition.plot() plt.show() The decomposition plot indicates that the monthly milk production has an increasing 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 3 von 9 Open in app https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... trend and seasonal pattern. If we want to observe the seasonal component more precisely, we can plot the data based on the month. 1. Model Identification In this step, we need to detect whether time series is stationary, and if not, we need to understand what kind of transformation is required to make it stationary. A time series is stationary when its statistical properties such as mean, variance, and autocorrelation are constant over time. In other words, time series is stationary when it is not dependent on time and not have a trend or seasonal effects. Most statistical forecasting methods are based on the assumption that time series is (approximately) stationary. Imagine, we have a time series that is consistently increasing over time, the sample mean and variance will grow with the size of the sample, and they will always underestimate the mean and variance in future periods. This is why, we need to start with a stationary time series, which is removed from its time dependent trend and seasonal components. We can check stationarity by using different approaches: We can understand from the plots, such as decomposition plot we have seen previously where we have already observed there is trend and seasonality. We can plot autocorrelation function and partial autocorrelation function plots, which provide information about the dependency of time series values to their previous values. If the time series is stationary, the ACF/PACF plots will show a quick cut off after a small number of lags. from statsmodels.graphics.tsaplots import plot_acf, plot_pacf plot_acf(df, lags=50, ax=ax1) plot_pacf(df, lags=50, ax=ax2) 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 4 von 9 https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... Open in app Here we see that both ACF and PACF plots do not show a quick cut off into the 95% confidence interval area (in blue) meaning time series is not stationary. We can apply statistical tests and Augmented Dickey-Fuller test is the widely used one. The null hypothesis of the test is time series has a unit root, meaning that it is non-stationary. We interpret the test result using the p-value of the test. If the p-value is lower than the threshold value (5% or 1%), we reject the null hypothesis and time series is stationary. If the p-value is higher than the threshold, we fail to reject the null hypothesis and time series is non-stationary. from statsmodels.tsa.stattools import adfuller dftest = adfuller(df['Production']) dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','pvalue','#Lags Used','Number of Observations Used']) for key, value in dftest[4].items(): dfoutput['Critical Value (%s)'%key] = value print(dfoutput) Results of Dickey-Fuller Test: Test Statistic -1.303812 p-value 0.627427 #Lags Used 13.000000 Number of Observations Used 154.000000 Critical Value (1%) -3.473543 Critical Value (5%) -2.880498 Critical Value (10%) -2.576878 P-value is greater than the threshold value, we fail to reject the null hypothesis and time series is non-stationary, it has time dependent component. All these approaches suggest we have non-stationary data. Now, we need to find a way to make it stationary. There are two major reasons behind non-stationary time series; trend and seasonality. We can apply differencing to make time series stationary by subtracting the previous observations from the current observations. Doing so we will eliminate trend and seasonality, and stabilize the mean of time series. Due to both trend and seasonal components, we apply one non-seasonal diff() and one seasonal differencing diff(12) . df_diff = df.diff().diff(12).dropna() 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 5 von 9 https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... Open in app Results of Dickey-Fuller Test: Test Statistic -5.038002 p-value 0.000019 #Lags Used 11.000000 Number of Observations Used 143.000000 Critical Value (1%) -3.476927 Critical Value (5%) -2.881973 Critical Value (10%) -2.577665 Applying the previously listed stationarity checks, we notice the plot of differenced time series does not reveal any specific trend or seasonal behavior, ACF/PACF plots have a quick cut-off, and ADF test result returns p-value almost 0.00. which is lower than the threshold. All these checks suggest that differenced data is stationary. We will apply Seasonal Autoregressive Integrated Moving Average (SARIMA or Seasonal-ARIMA) which is an extension of ARIMA that supports time series data with a seasonal component. ARIMA stands for Autoregressive Integrated Moving Average which is one of the most common techniques of time series forecasting. ARIMA models are denoted with the order of ARIMA(p,d,q) and SARIMA models are denoted with the order of SARIMA(p, d, q)(P, D, Q)m. AR(p) is a regression model that utilizes the dependent relationship between an observation and some number of lagged observations. I(d) is the differencing order to make time series stationary. MA(q) is a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations. (P, D, Q)m are the additional set of parameters that specifically describe the seasonal components of the model. P, D, and Q represent the seasonal regression, differencing, and moving average coefficients, and m represents the number of data points in each seasonal cycle. 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 6 von 9 Open in app https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... 2. Model Parameter Estimation We will use Python’s pmdarima library, to automatically extract the best parameters for our Seasonal ARIMA model. Inside auto_arima function, we will specify we differentiate once for the trend and once for seasonality, monthly data, and m=12 d=1 and D=1 as because we have trend='C' to include constant and seasonal=True to fit a seasonal- ARIMA. Besides, we specify trace=True to print status on the fits. This helps us to determine the best parameters by comparing the AIC scores. import pmdarima as pm model = pm.auto_arima(df['Production'], d=1, D=1, m=12, trend='c', seasonal=True, start_p=0, start_q=0, max_order=6, test='adf', stepwise=True, trace=True) AIC (Akaike Information Criterion) is an estimator of out of sample prediction error and the relative quality of our model. The desired result is to find the lowest possible AIC score. The result of auto_arima function with various (p, d, q)(P, D, Q)m parameters indicates that the lowest AIC score is obtained when the parameters equal to (1, 1, 0)(0, 1, 1, 12). We split the dataset into a train and test set. Here I’ve used 85% as train split size. We create a SARIMA model, on the train set with the suggested parameters. We use SARIMAX function from statsmodel library (X describes the exogenous parameter, but here we don’t add any). After fitting the model, we can also print the summary statistics. from statsmodels.tsa.statespace.sarimax import SARIMAX 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 7 von 9 Open in app https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... model = SARIMAX(train['Production'], order=(1,1,0),seasonal_order=(0,1,1,12)) results = model.fit() results.summary() 3. Model Validation Primary concern of the model is to ensure that the residuals are normally distributed with zero mean and uncorrelated. To check for residuals statistics, we can print model diagnostics: results.plot_diagnostics() plt.show() The top-left plot shows the residuals over time and it appears to be a white noise with no seasonal component. The top-right plot shows that kde line (in red) closely follows the N(0,1) line, which is the standard notation of normal distribution with zero mean and standard deviation of 1, suggesting the residuals are normally distributed. The bottom-left normal QQ-plot shows ordered distribution of residuals (in blue) closely follow the linear trend of the samples taken from a standard normal distribution, suggesting residuals are normally distributed. The bottom-right is a correlogram plot indicating residuals have a low correlation with lagged versions. All these results suggest residuals are normally distributed with low correlation. 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 8 von 9 Open in app https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... To measure the accuracy of forecasts, we compare the prediction values on the test set with its real values. forecast_object = results.get_forecast(steps=len(test)) mean = forecast_object.predicted_mean conf_int = forecast_object.conf_int() dates = mean.index From the plot, we see that model prediction nearly matches with the real values of the test set. from sklearn.metrics import r2_score r2_score(test['Production'], predictions) >>> 0.9240433686806808 The R squared of the model is 0.92, indicating that the coefficient of determination of the model is 92%. mean_absolute_percentage_error = np.mean(np.abs(predictions test['Production'])/np.abs(test['Production']))*100 >>> 1.649905 Mean absolute percentage error (MAPE) is one of the most used accuracy metrics, expressing the accuracy as a percentage of the error. MAPE score of the model equals to 1.64, indicating the forecast is off by 1.64% and 98.36% accurate. Since both the diagnostic test and the accuracy metrics intimates that our model is nearly perfect, we can continue to produce future forecasts. Here is the forecast for the next 60 months. results.get_forecast(steps=60) I hope you enjoyed following this tutorial and building time series forecasts in Python. If you liked this article, you can read my other articles here and follow me on Medium. Let me know if you have any questions or suggestions.✨ 22.07.21, 08:38 Hands-on Time Series Forecasting with Python | by Idil Ismiguzel | Towa... 9 von 9 https://towardsdatascience.com/hands-on-time-series-forecasting-with-p... Open in app Python Time Series Forecasting About Write Help Legal Get the Medium app 22.07.21, 08:38