Johaan Tomy Kallany 1940817 PRACTICAL -8 FITTING ARMA MODEL AND RESIDUAL ANALYSIS 08-04-2022 INTRODUCTION: A time series is a sequence of observations 饾憢饾憽 , collected at equal intervals of time. Time series analysis is a specific way of analyzing a sequence of data points collected over an interval of time. In time series analysis, analysts record data points at consistent intervals over a set period of time rather than just recording the data points intermittently or randomly. However, this type of analysis is not merely the act of collecting data over time. What sets time series data apart from other data is that the analysis can show how variables change over time. It provides an additional source of information and a set order of dependencies between the data. For example, Xt may denote the price of a commodity on a day t, or the maximum temperature on a day t, exchange rate of one currency against another, the number of accidents that occurred in a month t, the number of defective items manufactured by a firm in week t, or biomedical measurements on a subject over time such as blood pressure, body temperature, pulse rate, etc. In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term); thus the model is in the form of a stochastic difference equation (or recurrence relation which should not be confused with differential equation). Together with the moving-average (MA) model, it is a special case and key component of the more general autoregressive–moving-average (ARMA) and autoregressive integrated moving average In time series analysis, the moving-average model, also known as moving-average process, is a common approach for modelling univariate time series. The moving-average model specifies that the output variable depends linearly on the current and various past values of a stochastic term. Together with the autoregressive (AR) model, the moving-average model is a special case and key component of the more general ARMA model of time series, which have a more complicated stochastic structure. Contrary to the AR model, the finite MA model is always stationary. A stationary time series is one whose properties do not depend on the time at which the series is observed. Thus, time series with trends, or with seasonality, are not stationary — the trend and seasonality will affect the value of the time series at different times. In general, a stationary time series will have no predictable patterns in the long-term. Time plots will show the series to be roughly horizontal (although some cyclic behavior is possible), with constant variance. In the statistical analysis of time series, autoregressive–moving-average (ARMA) models provide a parsimonious description of a (weakly) stationary stochastic process in terms of two polynomials, one for the autoregression (AR) and the second for the moving average (MA). The general ARMA model was described in the 1951 thesis of Peter Whittle, Hypothesis testing in time series analysis, and it was popularized in the 1970 book by George E. P. Box and Gwilym Jenkins. Given a time series of data Xt , the ARMA model is a tool for understanding and, perhaps, predicting future values in this series. The AR part involves regressing the variable on its own lagged (i.e., past) values. The MA part involves modelling the error term as a linear combination of error terms occurring contemporaneously and at various times in the past. The model is usually referred to as the ARMA (p, q) model where p is the order of the AR part and q is the order of the MA part (as defined below). AIM: Fit a suitable ARMA model for the given data set and perform residual analysis for the fitted model by testing the assumptions associated with the residuals 1) errors are uncorrelated and 2) errors are normally distributed. Prepare a report by giving proper interpretations for your analysis. Data Description For illustration of fitting an ARMA model and for residual analysis, we will be using the ‘Earthquakes’ data set consisting details about earthquake occurrences in a given year for the time-period the millions) from 1916– 2015 . First, we will load the data set in to R and convert it as a time series object by properly defining the dates. THEORY: A statistical model is autoregressive if it predicts future values based on past values. For example, an autoregressive model might seek to predict a stock's future prices based on its past performance. Autoregressive models operate under the premise that past values have an effect on current values, which makes the statistical technique popular for analyzing nature, economics, and other processes that vary over time. Multiple regression models forecast a variable using a linear combination of predictors, whereas autoregressive models use a combination of past values of the variable. Auto-regressive model In a multiple regression model, we forecast the variable of interest using a linear combination of predictors. In an autoregression model, we forecast the variable of interest using a linear combination of past values of the variable. The term autoregression indicates that it is a regression of the variable against itself. Thus, an autoregressive model of order: where εt is white noise. This is like a multiple regression but with lagged values of y t as predictors. We refer to this as an AR(p) model, an autoregressive model of order p Autoregressive models are remarkably flexible at handling a wide range of different time series patterns. The two series in show series from an AR (1) model and an AR (2) model. Changing the parameters 蠒1,…,蠒p results in different time series patterns. The variance of the error term εt will only change the scale of the series, not the patterns. can be written as The A process is stationary if the roots of the characteristic polynomial associated with the AR process lies outside the unit circle. This is a non-stationary explosive process. If we combine all the inequalities, we obtain a region bounded by the lines φ2 =1+ φ1; φ2 = 1 − φ1; φ2 = −1. This is the region where the AR (2) process is stationary. Therefore, for AR(p) process to be stationary, the roots of the characteristic polynomial φ(B) associated with the process should be stationary. Auto-regressive moving average model: This is a model that is combined from the AR and MA models. In this model, the impact of previous lags along with the residuals is considered for forecasting the future values of the time series. Here β represents the coefficients of the AR model and α represents the coefficients of the MA model. Therefore, ARMA process is a linear combination of its own past value and also the linear combination of its own past error terms. The ARMA(p,q) is stationary only if the AR(p) process associated with it is stationary. Let’s suppose that “Y” is some random time-series variable. Then, a simple Autoregressive Moving Average model would look something like this: yt = c + 蠒1 yt-1 + θ1 系 t-1 + 系 t yt and yt-1 represent the values in the current period and 1 period ago respectively. As was the case with the AR model, we’re using the past values as a benchmark for future estimates. Similarly, 系 t and 系 t-1 are the error terms for the same two periods. The error term from the last period is used to help us correct our predictions. By knowing how far off we were in our last estimate, we can make a more accurate estimation this time. As always, “c” is just a baseline constant factor. In simply means we can plug in any constant factor here. If the equation doesn’t have such a baseline, we just assume c=0. The two remaining parameters are 蠒1 and θ1. The former, 蠒1, expresses on average what part of the value last period (yt-1) is relevant in explaining the current one. Similarly, the latter, θ1, represents the same for the past error term (系 t-1). Just like in previous models, these coefficients must range between -1 and 1 to prevent the coefficients from blowing up. Fitting models Choosing p and q: Finding appropriate values of p and q in the ARMA(p,q) model can be facilitated by plotting the partial autocorrelation functions for an estimate of p, and likewise using the autocorrelation functions for an estimate of q. Extended autocorrelation functions (EACF) can be used to simultaneously determine p and q.[5] Further information can be gleaned by considering the same functions for the residuals of a model fitted with an initial selection of p and q. Brockwell & Davis recommend using Akaike information criterion (AIC) for finding p and q.[ Another possible choice for order determining is the BIC criterion. Residuals The “residuals” in a time series model are what is left over after fitting a model. For many (but not all) time series models, the residuals are equal to the difference between the observations and the corresponding fitted values: Residuals are useful in checking whether a model has adequately captured the information in the data. A good forecasting method will yield residuals with the following properties: 1. The residuals are uncorrelated. If there are correlations between residuals, then there is information left in the residuals which should be used in computing forecasts. 2. The residuals have zero mean. If the residuals have a mean other than zero, then the forecasts are biased. Any forecasting method that does not satisfy these properties can be improved. However, that does not mean that forecasting methods that satisfy these properties cannot be improved. It is possible to have several different forecasting methods for the same data set, all of which satisfy these properties. Checking these properties is important in order to see whether a method is using all of the available information, but it is not a good way to select a forecasting method. If either of these properties is not satisfied, then the forecasting method can be modified to give better forecasts. Adjusting for bias is easy: if the residuals have mean mm, then simply add mm to all forecasts and the bias problem is solved. In addition to these essential properties, it is useful (but not necessary) for the residuals to also have the following two properties. 1. The residuals have constant variance. 2. The residuals are normally distributed. These two properties make the calculation of prediction intervals easier. However, a forecasting method that does not satisfy these properties cannot necessarily be improved. Sometimes applying a Box-Cox transformation may assist with these properties, but otherwise there is usually little that you can do to ensure that your residuals h ave constant variance and a normal distribution. Instead, an alternative approach to obtaining prediction intervals is necessary. Portmanteau tests for autocorrelation In addition to looking at the ACF plot, we can also do a more formal test for autocorrelation by considering a whole set of rk values as a group, rather than treating each one separately. Recall that rk is the autocorrelation for lag k. When we look at the ACF plot to see whether each spike is within the required limits, we are implicitly carrying out multiple hypothesis tests, each one with a small probability of giving a false positive. When enough of these tests are done, it is likely that at least one will give a false positive, and so we may conclude that the residuals have some remaining autocorrelation, when in fact they do not. In order to overcome this problem, we test whether the first h autocorrelations are significantly different from what would be expected from a white noise process. A test for a group of autocorrelations is called a portmanteau test, from a French word describing a suitcase or coat rack carrying several items of clothing. ANALYSIS AND INTERPRETATION Step 1: import the required packages. Step 2: Import the dataset Step 3: convert the given dataset into a time series object by properly defining the dates. Step 4: Check for the stationarity of the time series data. If not, transform the given data into a stationary using appropriate statistical techniques such a method of seasonal differencing, moving average etc. Step 5: Fit a suitable ARMA to the stationary time series Step 6: Obtain the residuals of the model and test for the assumptions associated with the residuals i.e. 1) residuals are uncorrelated to each other and 2)they follows a normal distribution. Importing the required packages: library(UsingR) library(astsa) ## ## Attaching package: 'astsa' ## The following objects are masked from 'package:UsingR': ## ## blood, chicken library(tseries) ## Warning: package 'tseries' was built under R version 4.1.2 ## Registered S3 method overwritten by 'quantmod': ## method from ## as.zoo.data.frame zoo #ASTSA- Applied Statistical Time Series Analysis. This package contains Data sets and scripts to accompany Time Series Analysis and Its Applications #tseries-Time Series Analysis and Computational Finance. #Importing the required dataset library(readxl) earthquakes <- read_excel("C:/Users/DELL G 5/Desktop/earthquakes.xlsx") View(earthquakes) #Let 'data' represents the dataset imported using the above code data=data.frame(earthquakes) head(data,10)#First 10 rows of the data ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 Year Quakes 1916 2 1917 5 1918 12 1919 8 1920 7 1921 9 1922 7 1923 12 1924 9 1925 12 attach(data) We need to convert the given dataset into a time series object using the ts() function in R. The function ts() is used to create time-series objects. These are vectors or matrices with class of "ts" (and additional attributes) which represent data which has been sampled at equispaced points in time. Syntax: ts(data = NA, start = 1, end = numeric(), frequency = 1,deltat = 1, ts.eps = getOption("ts.eps"), class = , names = ) earthquakes=ts(Quakes,start=1916,end=2015) earthquakes ## ## ## ## ## 21 ## 17 ## 11 ## 19 Time Series: Start = 1916 End = 2015 Frequency = 1 [1] 2 5 12 8 7 9 7 12 9 12 14 7 [26] 13 11 18 13 5 10 13 11 9 13 12 18 [51] 9 11 22 14 17 20 16 9 11 13 9 18 [76] 17 13 12 13 20 15 16 12 18 15 12 19 13 11 16 15 11 7 9 14 10 12 9 19 6 10 8 9 8 21 6 10 8 12 14 11 8 9 8 13 12 10 7 14 14 15 11 13 16 13 15 16 11 11 18 12 17 24 20 16 class(earthquakes) ## [1] "ts" #Plotting the time series data ts.plot(earthquakes,col="blue",lwd=3,main="Annual earthquake occurences",g pars=list(xlab="Time (in years)", ylab="Frequency")) grid(nx = NULL, ny = NULL, lty = 3, # Grid line type col = "gray", # Grid line color lwd = 2) # Grid line width From the above time series plot, we can observe a slightly increasing trend of annual earthquake occurrences with time. As a result, the trend component 饾憵饾憽 might exists. However, there is no evident repetitive pattern or periodic fluctuations. Therefore, the seasonal component 饾憜饾憽 does not exists. Therefore, the time series plot indicates that the data is non-stationary ARMA model for a non-stationary data set by removing the non-stationarity component from the data set. For that, we can either use Moving average smoothing or the method of differencing. Here, stationary version is extracted by using the differencing operator to the data set. Transforming non-stationary series into stationary We can transform the above non-stationary series into a stationary series by using the method of differencing to eliminate any trend component. diffdata=diff(earthquakes)#diff command can be used to perform the first o rder differencing operation. ts.plot(diffdata,col="red",lwd=3,main="Time series plot after elimination of trend",gpars=list(xlab="Time (in years)", ylab="Frequency")) grid(nx = NULL, ny = NULL, lty = 3, # Grid line type col = "gray", # Grid line color lwd = 2) # Grid line width From the above time series plot, we can see that there is no deterministic/predictable pattern in the plot and time series data is free of trend and seasonality components. Hence, the plot indicates that the given time series data is stationary. Checking stationarity for the given time series data ADF Test Augmented Dickey Fuller test (ADF Test) is a common statistical test used to test whether a given Time series is stationary or not. The null and the alternative hypotheses under this test are: H0 , Null Hypothesis: The given time-series is non-stationary H1 , Alternative Hypothesis: The given time-series is stationary adf.test(diffdata) ## Warning in adf.test(diffdata): p-value smaller than printed p-value ## ## Augmented Dickey-Fuller Test ## ## data: diffdata ## Dickey-Fuller = -6.7817, Lag order = 4, p-value = 0.01 ## alternative hypothesis: stationary Here, the p-value is lesser than the significance level 0.05 (5%). Hence, there is not enough statistical evidence in favour of the null hypothesis and we may reject the null hypothesis. Hence, we conclude that the given time series data is stationary. acf(diffdata,main="ACF plot of stationary series",col="red",lwd=2) From the above ACF plot, it is clear that the ACF at almost all non-zero lags lies below the threshold level of the plot. This implies that the ACF or the degree of auto-correlation is insignificant for most non-zero lags, indicating that there exist no dependencies among the residuals. So, the ACF plot indicates that the differenced data is stationary.. Fitting an ARMA model for the given data #Fitting an ARMA model library(forecast) fit=auto.arima(diffdata,seasonal="FALSE") fit ## ## ## ## ## ## ## ## ## ## Series: diffdata ARIMA(0,0,1) with zero mean Coefficients: ma1 -0.8092 s.e. 0.0710 sigma^2 = 15.53: log likelihood = -276.26 AIC=556.53 AICc=556.65 BIC=561.72 auto.arima command in R gives us the best fitted ARMA model among the class of ARMA models. From the result we can see that, the ARMA model fitted for the data is of the form ARIMA(0,0,1) or MA(1). Therefore, the model can be written as: 饾拋饾挄 = 饾拏饾挄 + 饾煄. 饾煐饾煄饾煑饾煇 饾拏饾挄−饾煆 饾憤饾憽 represents the annual earthquakes for the tth year. Here, 饾憤饾憽 depends only on the error/white noise term at its previous lag i.e., 饾憥饾憽−1 . Here, -0.8092 is the value of the moving-average parameter and SE of the coefficient is 0.0710. The AIC score of the model is 556.53 and we can see that the MA(1) process has the least AIC score among all the ARMA models. Hence, the best fitted model is the MA(1) model above RESIDUAL ANALYSIS Step 1: Extract the residuals from the fitted model Residuals can be extracted by using the command res res=resid(fit) ## Time Series: ## Start = 1917 ## End = 2015 ## Frequency = 1 ## [1] 2.33211693 06620 ## [7] 4.61231041 84340 ## [13] 3.75417346 01441 ## [19] 2.54279055 74960 ## [25] 0.92581097 81266 ## [31] 1.72567970 71352 ## [37] -1.49384416 24137 ## [43] -2.81212552 99950 7.54571402 1.35324173 0.02693396 1.97488805 -0.432 0.69348767 3.54399128 3.84687375 1.10315547 5.884 7.60123596 -3.85026100 -4.11444125 0.671 -2.96293958 -0.94248272 -2.76255305 -1.25084406 2.72446023 0.90116951 -6.270 5.98782301 -0.15473105 -8.12519993 -1.574 -0.60360009 -2.48842580 -4.20880149 9.76447505 1.98639444 -0.39263249 -4.317 0.59428608 -1.51911050 11.77075312 -3.475 1.20460202 -1.02524984 6.17037957 -0.006 ## [49] 5.99433608 -4.14945187 -1.35768895 9.90137365 0.01207819 3.009 77354 ## [55] 5.43547429 0.39832356 -6.67768114 -3.40350312 -0.75407576 0.389 81053 ## [61] -3.68456978 -0.98151168 -4.79422802 -5.87943442 -0.75757102 -3.613 01779 ## [67] 4.07638737 3.29856599 3.66916183 -1.03095626 1.16576200 -1.056 67874 ## [73] -2.85505233 6.68972434 4.41324834 -0.42884997 -1.34702048 -0.089 99355 ## [79] 6.92717825 0.60539333 1.48987735 -2.79440831 3.73879679 0.025 39156 ## [85] 1.02054656 -2.17418541 0.24067406 1.19475069 -4.03322142 -3.263 63659 ## [91] 4.35910263 -2.47266406 2.99914855 9.42687667 3.62812067 -1.064 16629 ## [97] 2.13888882 -5.26923565 2.73619484 Checking assumption 1: Errors are uncorrelated We can examine this assumption either by plotting the time series plot/ACF of residuals or by using a statistical test called" Portmanteu". ts.plot(res,col="red",lwd=3,main="Time series plot of residuals",gpars=lis t(xlab="Time (in years)")) grid(nx = NULL, ny = NULL, lty = 3, # Grid line type col = "gray", # Grid line color lwd = 2) # Grid line width From the above time series plot of the residuals, we can see that there is no deterministic/predictable pattern in the plot and residuals data is free of trend and seasonality components. Hence, the plot indicates that the residuals are uncorrelated to each other. acf(res,col="red",lwd=2,main="ACF plot of residuals") From the above ACF plot of the residuals, it is clear that the ACF at almost all non-zero lags lies below the threshold level of the plot. This implies that the ACF or the degree of autocorrelation is insignificant for most non-zero lags, indicating that there exist no dependencies among the residuals. So, the ACF plot indicates that the residuals are uncorrelated to each other. For a good model, ACF values for all lags should lie inside the threshold line. But for sample of observations, sometime you can see one or two lags are significant. Generally, you can ignore such cases and can conclude that the most of the dependencies in the data set is explained by the fitted model. Portmanteau test The null and alternative hypotheses of portmanteau test are given below: H0 , Null Hypothesis: There is no significant correlation among the residual series H1 , Alternative Hypothesis: There is significant correlation among the residual Based on the p value you can comment on the presence of autocorrelation. Box.test(res)## portmanteau test ## ## Box-Pierce test ## ## data: res ## X-squared = 0.034748, df = 1, p-value = 0.8521 Here, the p-value is greater than the significance level 0.05 (5%). Hence, there is sufficient statistical evidence in favour of the null hypothesis and we may accept the null hypothesis. Hence, we conclude that there is no significant correlation among the residual series i.e. residuals are uncorrelated to each other. Checking assumption 2: Residuals are normally distributed For checking this assumption, we can either use QQ plots or some statistical test such as Shapiro Wilk test. Quantile-Quantile plot The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. A 45-degree reference line is also plotted. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the two data sets have come from populations with different distributions. In the above Q-Q plot, we can see that most of the observation lie on and near the 45o straight line. This implies that the sample quantiles almost coincides with the theoretical (normal) quantiles, indicating that the residuals are normally distributed. We can affirm the normality of the residuals with the help of Shapiro-wilk test as given below. Normality test- Shapiro Wilks test The null and alternative hypotheses of portmanteau test are given below: H0 , Null Hypothesis: Residuals are normally distributed H1 , Alternative Hypothesis: Residuals are not normally distributed shapiro.test(res) ## ## Shapiro-Wilk normality test ## ## data: res ## W = 0.98327, p-value = 0.2428 Here, the p-value is 0.248 which is greater than the significance level 0.05 (5%). Hence, there is sufficient statistical evidence in favour of the null hypothesis and we may accept the null hypothesis. Hence, we conclude that the residuals are normally distributed. Conclusion From the time series plot of the given data, we can observe an increasing trend of annual earthquake occurrences with time. As a result, the trend component 饾憵饾憽 exists. However, there is no evident repetitive pattern or periodic fluctuations. Therefore, the seasonal component 饾憜饾憽 does not exists. Since the trend component exists, the given time series data is non-stationary. The non-stationary data has been transformed into a stationary series by the method of first order differencing and the stationarity nature is verified with the help of ADF test. An ARMA model has been fitted to the differenced data and the best fitted model was found to be a MA(1) process given by: 饾憤饾憽 = 饾憥饾憽 + 0.8092 饾憥饾憽−1 Where 饾憤饾憽 represents the annual earthquakes for the tth year. Here, 饾憤饾憽 depends only on the error/white noise term at its previous lag i.e., 饾憥饾憽−1 . Here, -0.8092 is the value of the movingaverage parameter and SE of the coefficient is 0.0710. The AIC score of the model is 556.53 and we can see that the MA(1) process has the least AIC score among all the ARMA models. Hence, the best fitted model is the MA(1) model above The residuals from the fitted model have been extracted and assumptions associated with the residuals i.e. 1) residuals are uncorrelated and 2) residual are normally distributed had been tested with the help of both graphical tools and statistical tests. Residuals are useful in checking whether a model has adequately captured the information in the data. From the time series plot of the residuals, it was observed that there was no deterministic/predictable pattern in the plot and residuals is free of trend and seasonality components. Hence, the plot indicates that the residuals are uncorrelated to each other. From the ACF plot of the residuals, the ACF at almost all non-zero lags lies below the threshold level of the plot. This implies that the ACF or the degree of auto-correlation is insignificant at most non-zero lags, indicating that there exist no dependencies among the residuals. So, the ACF plot indicates that the residuals are uncorrelated to each other. The p-value obtained from the Portmanteau test is greater than 0.05 and hence we conclude that there is no significant correlation among the residual series i.e. residuals are uncorrelated to each other. From the Q-Q plot, we can see that most of the observation lie on and near the 45o straight line. This implies that the sample quantiles almost coincides with the theoretical (normal) quantiles, indicating that the residuals are normally distributed. In addition, the results from the Shapirowilk test shows that the residuals are normally distributed. Violation of any one of the above assumptions could question the reliability of the fitted model. Any forecasting method that does not satisfy these properties can be improved. For example, if the residuals are autocorrelated, it indicates the mis-specification of the model. Also, if the errors are not normally distributed, the non-normal data should first be transformed into a normal distribution by log-transformation or Box-cox transformation and then the above procedure has to be repeated for the transformed data. .