Modeling Road Accident Data of Saudi Arabia A THESIS SUBMITTED TO THE GRADUATE EDUCATIONAL POLLCIES COUNCIL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS For the degree MASTER OF SCIENCE By Abdulrahman Alshammari Advisor Dr. Rahmatullah Imon Ball State University Muncie, Indiana May 2015 ACKNOWLEDGEMENTS Foremost, I would like to express my sincere gratitude to my advisor Professor Dr. Rahmatullah Imon for the continuous support of my thesis study, for his patience, motivation, enthusiasm, and immense knowledge. His guidance helped me in all the time during my analysis and writing the report. I could not have imagined having a better advisor and mentor for my thesis other than him. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Munni Begum and Dr. Yayuan Xiao for their encouragement, insightful comments and patience. I am thankful to all my classmates for their kind supports. Last but not the least; I would like to thank my family: my mother, my brothers and sisters, for supporting me throughout my life. Abdulrahman Alshammari March 25, 2015 Table of Contents CHAPTER 1 1 INTRODUCTION 1 1.1 Saudi Arabia Road Accident Data 1 1.3 Outline of the Study 10 11 CHAPTER 2 ESTIMATION OF MISSING VALUES, CROSS VALIDATION AND ARIMA MODELS IN TIME SERIES 2.1 Estimation of Missing Values 11 2.2 Estimation of Missing Values in Time Series 13 2.3 Robust Estimation Methods 16 2.4 Bootstrap 21 2.5 Stochastic Time Series Modeling 25 2.6 Cross Validation in Time Series 39 2.7 Computation 43 44 CHAPTER 3 ESTIMATION OF MISSING VALUES BY ROBUST EM AND BOOTSTRAP ALGORITHM FOR SAUDI ARABIA ROAD ACCIDENT DATA 3.1 Estimation of Missing Values 3.2 Evaluation of Estimation of Missing Values 44 Error! Bookmark not defined. 50 CHAPTER 4 SELECTION OF ARIMA MODELS FOR SAUDI ARABIA ROAD ACCIDENTS DATA 59 59 4.1 Total Number of Road Accidents in Saudi Arabia 59 4.2 Total Number of Road Accidents in Riyadh 62 4.3 Total Number of Road Accidents in Mecca 64 4.4 Total Number of Road Accidents in Eastern Province 66 4.5 Total Number of Road Accidents in Medina 68 4.6 Total Number of Road Accidents in Al Qassim 70 4.7 Total Number of Road Accidents in Tabuk 73 4.9 Total Number of Road Accidents in Asir 75 4.10 Total Number of Road Accidents in Albahih 77 4.11 Total Number of Road Accidents in Northern Borders 79 4.12 Total Number of Road Accidents in Al Jouf 81 4.13 Total Number of Road Accidents in Ha'il 84 4.14 Total Number of Road Accidents in Najran 86 4.15 Total Number of Road Accidents in Jizan 88 4.17 Result Summary 90 CHAPTER 5 CONCLUSIONS AND DIRECTION OF FUTURE RESEARCH 92 92 5.1 Conclusions 92 5.2 Direction of Future Research 93 REFERENCES 94 List of Tables Chapter 1 Table 1.1: Total Number of Road Accidents in Saudi Arabia 2 Table 1.2: Total Number of Road Accidents in Riyadh 2 Chapter 2 Table 2.1: Specification of ARIMA Models 39 Chapter 3 Table 3.1: Estimates of Missing Values for the Year 2003 Using Conventional Methods 44 Table 3.2: Estimates of Missing Values for the Year 2004 Using Conventional Methods 44 Table 3.3: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods 47 Table 3.4: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods 48 Table 3.5: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods 48 Table 3.6: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods 49 Table 3.7: A Comparison Between Exact and Estimated Values of the Total Number of Road Accidents for 2004 in Saudi Arabia 50 Table 3.8: Total Number of Road Accidents in Saudi Arabia after the Missing Value Estimation 51 Table 3.9: Total Number of Road Accidents in Different Region of Saudi Arabia after the 52 Missing Value Estimation Chapter 4 Table 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi Arabia 60 Table 4.2: Significance of ARIMA models fir the Number of Road Accidents in Saudi Arabia 61 Table 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh 62 Table 4.4: Significance of ARIMA models fir the Number of Road Accidents in Riyadh 63 Table 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Mecca 64 Table 4.6: Significance of ARIMA models fir the Number of Road Accidents in Mecca 65 Table 4.7: The ACF and PACF Values for the Total Number of Road Accidents in E Province 66 Table 4.8: Significance of ARIMA models fir the Number of Road Accidents in E Province 67 Table 4.9: The ACF and PACF Values for the Total Number of Road Accidents in Medina 69 Table 4.10: Significance of ARIMA models fir the Number of Road Accidents in Medina 69 Table 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Qassim 71 Table 4.12: Significance of ARIMA models fir the Number of Road Accidents in Qassim 72 Table 4.13: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk 73 Table 4.14: Significance of ARIMA models fir the Number of Road Accidents in Tabuk 74 Table 4.15: The ACF and PACF Values for the Total Number of Road Accidents in Asir 75 Table 4.16: Significance of ARIMA models fir the Number of Road Accidents in Asir 76 Table 4.17: The ACF and PACF Values for the Total Number of Road Accidents in Albhih 77 Table 4.18: Significance of ARIMA models fir the Number of Road Accidents in Albahih 78 Table 4.19: The ACF and PACF Values for the Total Number of Road Accidents in N Borders 80 Table 4.20: Significance of ARIMA models fir the Number of Road Accidents in N Borders 80 Table 4.21: The ACF and PACF Values for the Total Number of Road Accidents in Al Jouf 82 Table 4.22: Significance of ARIMA models fir the Number of Road Accidents in Al Jouf 83 Table 4.23: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il 84 Table 4.24 Significance of ARIMA models fir the Number of Road Accidents in Ha’il 85 Table 4.25: The ACF and PACF Values for the Total Number of Road Accidents in Najran 86 Table 4.26: Significance of ARIMA models fir the Number of Road Accidents in Najran 87 Table 4.27: The ACF and PACF Values for the Total Number of Road Accidents in Jizan 88 Table 4.28: Significance of ARIMA models fir the Number of Road Accidents in Jizan 89 Table 4.29 Selected Models for Saudi Arabia Road Accident Data 90 List of Figures Chapter 1 Figure 1.1: The Political Map of Saudi Arabia with Different Regions 2 Figure 1.2: Time Series Plot of Total Road Accidents in Saudi Arabia 5 Figure 1.3: Time Series Plot of Total Road Accidents in Riyadh 5 Figure 1.4: Time Series Plot of Total Road Accidents in Mecca 5 Figure 1.5: Time Series Plot of Total Road Accidents in Eastern Province 6 Figure 1.6: Time Series Plot of Total Road Accidents in Medina 6 Figure 1.7: Time Series Plot of Total Road Accidents in Al Qassim 6 Figure 1.8: Time Series Plot of Total Road Accidents in Tabuk 7 Figure 1.9: Time Series Plot of Total Road Accidents in Asir 7 Figure 1.10: Time Series Plot of Total Road Accidents in Albahih 7 Figure 1.11: Time Series Plot of Total Car Accident in Northern Borders 8 Figure 1.12: Time Series Plot of Total Car Accident in Al Jouf 8 Figure 1.13: Time Series Plot of Total Car Accident in Ha’il 8 Figure 1.14: Time Series Plot of Total Car Accident in Najran 9 Figure 1.15: Time Series Plot of Total Car Accident in Jizan 9 Chapter 2 Figure 2.1: A Schematic Diagram of Bootstrap and Simulation 23 Chapter 3 Figure 3.1: Time Series Plot of Original and Estimated Road Accidents in Saudi Arabia 54 Figure 3.2: Time Series Plot of Original and Estimated Road Accidents in Riyadh 54 Figure 3.3: Time Series Plot of Original and Estimated Road Accidents in Mecca 54 Figure 3.4: Time Series Plot of Original and Estimated Road Accidents in E Province 55 Figure 3.5: Time Series Plot of Original and Estimated Road Accidents in Medina 55 Figure 3.6: Time Series Plot of Original and Estimated Road Accidents in Al Qassim 55 Figure 3.7: Time Series Plot of Original and Estimated Road Accidents in Tabuk 56 Figure 3.8: Time Series Plot of Original and Estimated Road Accidents in Asir 56 Figure 3.9: Time Series Plot of Original and Estimated Road Accidents in Albahih 56 Figure 3.10: Time Series Plot of Original and Estimated Road Accidents in N Borders 57 Figure 3.11: Time Series Plot of Original and Estimated Road Accidents in Al Jouf 57 Figure 3.12: Time Series Plot of Original and Estimated Road Accidents in Ha’il 57 Figure 3.13: Time Series Plot of Original and Estimated Road Accidents in Najran 58 Figure 3.14: Time Series Plot of Original and Estimated Road Accidents in Jizan 58 Chapter 4 Figure 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi Arabia 59 Figure 4.2: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh 62 Figure 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Mecca 64 Figure 4.4: The ACF and PACF Values for the Total Number of Road Accidents in E Province 66 Figure 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Medina 68 Figure 4.6: The ACF and PACF Values for the Total Number of Road Accidents in Qassim 71 Figure 4.7: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk 73 Figure 4.8: The ACF and PACF Values for the Total Number of Road Accidents in Asir 75 Figure 4.9: The ACF and PACF Values for the Total Number of Road Accidents in Albahih 77 Figure 4.10: The ACF and PACF Values for the Total Number of Road Accidents in N Borders 79 Figure 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Al Jouf 82 Figure 4.12: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il 84 Figure 4.13: The ACF and PACF Values for the Total Number of Road Accidents in Najran 86 Figure 4.14: The ACF and PACF Values for the Total Number of Road Accidents in Jizan 88 CHAPTER 1 INTRODUCTION Road accident has been considered as one of the most serious problems in Saudi Arabia since 1994 [see Abdulrahman (2013)]. Saudi Arabia is ranked 12th in the world in the number of road fatality rate per 100,000 inhabitants according to a latest survey. The fatality rate for Saudi Arabia is 24.8 which is more than twice of the fatality rate of the USA (11.2). According to Abdulrahman (2013), 570 people died in road accident in 1971. In 1994 this number got increased to 4077. Desperate lots of desperate measures taken after 1994, there were 6205 deaths in road accident in 2013. More than 68,000 people are injured annually, and the lost amounts of money are more than 13 billion SR (3.5 billion USD) in every year. This problem is not new in Saudi Arabia. However, its severity has been increasing over the years since Saudi Arabia is developing and its road network is growing very fast. 1.1 Saudi Arabia Road Accident Data In our study, we would like to consider road accident data of the entire Saudi Arabia and it’s different regions from year 1999 to 2010. This data set is taken from the official website of Saudi Arabian Ministry of Road Safety department. Here is the link of the data: http://www.moi.gov.sa/wps/portal/traffic/!ut/p/b1/jc_JDoIwEAbgZEJOmVaU46lLG1MVSSg9GJ6MEbDkhjj84uKR9S5TfL9sxBHmpAiChpRgWRPXO_v55O_n Yfet8_eLQ5cbZQRFmmumALDaA2l4ohbNoJmBDFAGsVUgtAWwOiiDitbS2PhvzzMlPyZL_ 2V7Ih7sXxVl5hrpJAuRlZwvV7yCkyOE4B1aBNaKJlBoVnCI_1ZNIFvf7wnzB60kN3JJ1rM1GaC5NB8ACybrkI/dl4/d5/L2dJQSEvUUt3QS80SmtFL1o2X0dOVlMzR0gzMTB 1 FNkMwSVE1SE9LNVUwSTgx/?WCM_GLOBAL_CONTEXT=/wps/wcm/connect/Traffic/Tra ffic+AR/Main/Statistics/ At first we present time series plot of the total number of road accidents for the entire Saudi Arabia and then for the different regions which are Riyadh, Mecca, Eastern Province, Medina, Al-Qassim, Tabuk, Asir, Albahih, Northern Borders, Al Jouf, Ha’il, Najran, Jizan. These regions are located in the political map of Saudi Arabia as presented in Figure 1. Data for the entire Saudi Arabia is given in Table 1.1 and the regional data is given in Table 1.2. Figure 2.1: The Political Map of Saudi Arabia with Different Regions 2 Table 1.1: Total Number of Road Accidents in Saudi Arabia Year Number of accidents 1999 264326 2000 267772 2001 280401 2002 305649 2003 - 2004 261872 2005 293281 2006 296015 2007 283648 2008 435264 2009 485931 2010 484805 Table 1.2: Total Number of Road Accidents in Different Region of Saudi Arabia Year Riyadh Mecca Eastern Medina Province Al- Tabuk Asir Qassim 2000 80067 94669 56182 747 7213 3591 10242 2001 84313 89520 64381 700 8266 3060 12469 2002 86387 97091 69143 973 8051 4033 18998 2003 - - - - - - - 2004 - - - - - - - 3 2005 47803 95649 76435 13422 12901 5043 18230 2006 49019 93987 71731 14471 11687 5974 22108 2007 48142 86854 62033 18988 13370 6061 19024 2008 134915 96204 107780 21334 14768 7503 20556 2009 143466 107932 127460 20956 17645 11634 22929 2010 141549 112844 112420 18448 18622 15500 25263 Northern_ Al Jouf Hail Najran Jizan Year Albahih Borders 2000 1169 3999 3447 3112 2431 903 2001 1716 4174 4248 3686 2412 1456 2002 - - - - - - 2003 - - - - - - 2004 2754 4930 5960 4878 2008 2620 2005 3293 4034 6889 4502 1823 3257 2006 3515 5668 7259 5835 1710 3051 2007 3208 6805 7170 5274 1963 4756 2008 2843 6160 9446 5884 2631 5240 2009 4245 6799 7284 5714 4215 5652 2010 5117 9449 7038 6069 4340 8146 4 Figure 1.2: Time Series Plot of Total Road Accident in Saudi Arabia Figure 1.3: Time Series Plot of Total Road Accident in Riyadh Figure 1.4: Time Series Plot of Total Road Accident in Mecca 5 Figure 1.5: Time Series Plot of Total Road Accident in Eastern Province Figure 1.6: Time Series Plot of Total Road Accident in Medina Figure 1.7: Time Series Plot of Total Road Accident in Al-Qassim 6 Figure 1.8: Time Series Plot of Total Road Accident in Tabuk Figure 1.9: Time Series Plot of Total Road Accident in Asir Figure 1.10: Time Series Plot of Total Road Accident in Albahih 7 Figure 1.11: Time Series Plot of Total Road Accident in Northern Borders Figure 1.12: Time Series Plot of Total Road Accident in Al Jouf Figure 1.13: Time Series Plot of Total Road Accident in Ha’il 8 Figure 1.14: Time Series Plot of Total Road Accident in Najran Figure 1.15: Time Series Plot of Total Road Accident in Jizan The time series plots of the variables reveal lots of interesting features of time series. All of the plots show that there is an increasing trend in the number of road accidents in the entire Saudi Arabia and all regions of it. Another interesting feature of this data is there are missing observations for all variables. For the entire Saudi Arabia the total number of accident for the year 2003 is missing, however for each region total number of accidents for the year 2003 and 2004 are missing. 9 1.2 Outline of the Study We organize this thesis in the following way. In chapter 2, we introduce different methodologies we use in our research that include the estimation of missing values by a variety of robust and bootstrap expectation minimization (EM) algorithm, validation technique for assessing the goodness of estimation of missing values, and ARIMA models for fitting the data. Since all variables that we consider in our study have missing observations, in chapter 3, we employ a variety algorithm for estimating the missing values. Since this is a time series problem, the conventional techniques which assume independence of observations are not appropriate here. Lack of normaility could be always to big issue, hence to avoid this issue we consider robust and nonparametric bootstrap techniques to estimate the missing values. Later we employed the mean average percentage error (MAPE) to validate which method estimates the missing value in the best possible way. In Chapter 4 in order to determine the most appropriate ARIMA model we employed both graphical methods based on autocorrelation function (ACF) and the partial autocorrelation function and numerical tests such as the t-test and the Ljung-Box test based on the ACF and the PACF. In Chapter 5, we draw conclusions on our current research and suggest appropriate ARIMA models for all variables we considered in our study. In this chapter we also outline our directions for future research. 10 CHAPTER 2 ESTIMATION OF MISSING VALUES, CROSS VALIDATION AND ARIMA MODELS IN TIME SERIES In this chapter we discuss different aspects of data analysis techniques useful in time series analysis. Although the prime topic of our discussion will be ARIMA models but we will start discussing with the missing value technique. For running the ARIMA model it is essential to estimate the missing values because if there is any missing value in the series the estimates of parameters in ARIMA model does not converge. We also talk about the cross validation technique as a measure of accuracy of the estimates of the missing values. 2.1 Estimation of Missing values Missing data are a part of almost all research and it has a negative influence on the analysis, such as information loss and, as a result, a loss of efficiency, loss of unbiasedness of estimated parameters and loss of power. An excellent review of different aspects of missing values is available in Little and Rubin (2002). In this section we introduce few commonly used missing values estimation techniques. 2.1.1 Mean Imputation Technique Among the different methods for solving the missing value, Imputation methods (Little and Rubin (2002)) is one of the most widely used technique to solve incomplete data problems. Therefore, this study stresses on several imputation methods to determine the best methods to replace missing data. 11 Let us consider n observations x1 , x2 ,..., xn of which m values are missing denoted by x1* , x2* ,..., xm* . Thus the observed data with missing values are x1 , x2 ,..., xn1 , x1* , xn1 1 , xn1 2 ,..., xn2 , x2* , xn2 1 , xn2 2 ,..., xm* , xn (2.1) Therefore, the first missing value occurs after n1 observations, the second missing value occur after n 2 observations, and so on. Note that there might be more than one consecutive missing observation. Mean-before Technique The mean-before technique is one of the most popular imputation techniques in handling missing data. This technique consists of substituting all missing values with the mean of all available data before missing values. Thus for the data in (2.1), x1* will be replaced by x1 1 n1 xi n1 i 1 (2.2) and x 2* will be replaced by x2 n2 1 xi (n2 n1 1) i n1 1 (2.3) and so on. Mean-before-after Technique The mean-before-after technique substitutes all missing values with the mean of one datum before the missing value and one datum after the missing value. Thus for the data in (2.1), x1* will be replaced by x1 x n1 x n1 1 (2.4) 2 12 and x 2* will be replaced by x2 x n2 x n2 1 (2.5) 2 and so on. 2.1.2 Median Imputation Technique Since mean is highly sensitive to extreme observations and/or outliers, median imputation is becoming more popular now-a-days [Mamun (2014)]. Instead of mean the missing value for a data set given in (2.1), the missing value is estimated by the median. 2.1.3 Expectation Maximization Algorithm Let y is an incomplete data vector whose density function is parameter. If y were complete, the maximum likelihood of ) where would be based on the distribution of y. The log-likelihood function of y, is required to be maximized. As y is incomplete we may denote that as ( incomplete data and is a p-dimensional ) where is the observed is the unobserved missing data. Let assume that the missing data is missing by random, then: (2.6) Considering the log-likelihood function: (2.7) The EM algorithm focused on maximizing in each iteration by replacing it by its conditional expectation given the observed data . The EM algorithm has an E-step (estimation step) followed by an M-step (maximization step) as follows: 13 E-step: Compute where (2.8) M-step: Find such that (2.9) The E-step and M-step are repeated alternately until the difference , where is less that is a small quantity. If the convergence attribute of the likelihood function of the complete data, that is , is attainable then convergence of EM algorithm also attainable. The rate of convergence depends on number of missing observations. Dempster, Laird, and Rubin (1977) show that convergence is linear with rate proportional to the fraction of information about 2.2 in that is observed. Estimation of Missing Values in Time Series The missing value estimation discussed above are designed for independent observations. But in regression or in time series we assume a model and that should have a consideration when we try to estimate missing values. In time series things are even more challenging as the observations are dependent. Here we consider several commonly used missing value technique useful for time series. 2.2.1 Estimation with Trend Models We begin with simple models that can be used to forecast a time series on the basis of its past behavior. Most of the series we encounter are not continuous in time, instead, they consist of discrete observations made at regular intervals of time. We denote the values of a time series by 14 { y t }, t = 1, 2, …, T. Our objective is to model the series y t and use that model to forecast y t beyond the last observation yT . We denote the forecast l periods ahead by yˆ T l . We sometimes can describe a time series y t by using a trend model defined as yt TR t t (2.10) where TR t is the trend in time period t. Linear Trend Model: TR t 0 1t (2.11) We can predict y t by yˆ t ˆ0 ˆ1t (2.12) Polynomial Trend Model of Order p TR t 0 1t 2 t 2 ... p t p (2.13) If the number of observation is not too large, we can predict y t by ŷt ˆ0 ˆ1t ˆ2t 2 ... ˆ pt p (2.14) Exponential Trend Model: The exponential trend model which is often referred to a growth curve model is defined as y t 0 1 t t (2.15) If 0 > 0 and 1 > 0, applying a logarithmic transformation to the above model yields y t 0 1 t t * * * * (2.16) * where yt ln yt , 0 ln 0 , 1 ln 1 and t ln t . * * * 15 2.2.2 Estimation with Smoothing Techniques Several decomposition and smoothing techniques are available in the literature for estimating the missing value of a time series data. Moving Average Moving average is a technique of updating averages by dropping the earliest observation and adding the latest observation of a series. Let us suppose we have T = nL observations, where L is the number of seasons, then the first moving average computed for that season is obtained as y1 y1 y 2 ... y L / L (2.17) The second moving average is obtained as y2 y2 ... y L y L1 / L y1 y L1 y1 / L (2.18) In general, the m-th moving average is obtained as y m y m1 y m L1 y m1 / L (2.19) Centered Moving Average Instead of moving average we often use the centered moving average (CMA) which is the average of two successive moving average values. Some other useful smoothing techniques that can be used for estimating the missing values in time series are exponentially weighted moving average (EWMA) and the Holt-Winter model. 2.3 Robust Estimation Methods Robustness is now playing a key role in time series. According to Kadane (1984) ‘Robustness is a fundamental issue for all statistical analyses; in fact it might be argued that robustness is the subject of statistics.' The term robustness signifies insensitivity to small deviations from the 16 assumption. That means a robust procedure is nearly as efficient as the classical procedure when classical assumptions hold strictly but is considerably more efficient over all when there is a small departure from them. The main application of robust techniques in a time series problem is to try to devise estimators that are not strongly affected by outliers or departures from the assumed model. In time series, robust techniques grew up in parallel to diagnostics (Hampel et al., 1986) and initially they were used to estimate parameters and to construct confidence intervals in such a way that outliers or departures from the assumptions do not affect them. A large body of literature is now available [Rousseuw and Leroy (1987), Maronna, Martin, and Yohai (2006), Hadi, Imon and Werner (2009)] for robust techniques that are readily applicable in linear regression or in time series. L – Estimator A first step toward a more robust time series estimator was the consideration of least absolute values estimator (often referred to as L – estimator). In the OLS method, outliers may have a very large influence since the estimated parameters are estimated by minimizing the sum of squared residuals n u t 1 2 t L estimates are then considered to be less sensitive since they are determined by minimizing the sum of absolute residuals n | u t 1 t | The L estimator was first introduced by Edgeworth in 1887 who argued that the OLS method is over influenced by outliers, but because of computational difficulties it was not popular and not much used 17 until quite recently. Sometimes we consider the L – estimator as a special case of L p -norm estimator in the literature where the estimators are obtained by minimizing n | u t 1 t |p The L1 -norm estimator is the OLS, while the L2 - norm estimator is the L – estimator. But unfortunately a single erroneous observation (high leverage point) can still totally offset the Lestimator. Least Median of Squares Rousseeuw (1984) proposed Least Median of Squares (LMS) method which is a fitting technique less sensitive to outliers than the OLS. In OLS, we estimate parameters by n minimizing the sum of squared residuals u t 1 2 t Which is obviously the same if we Minimize the mean of squared residuals 1 n 2 ut . n t 1 Sample means are sensitive to outliers, but medians are not. Hence to make it less sensitive we can replace the mean by a median to obtain median sum of squared residuals 2 MSR ( ˆ ) = Median { uˆ t } (2.20) Then the LMS estimate of is the value that minimizes MSR ( ˆ ). Rousseeuw and Leroy (1987) have shown that LMS estimates are very robust with respect to outliers and have the highest possible 50% breakdown point. 18 Least Trimmed Squares The least trimmed (sum of) squares (LTS) estimator is proposed by Rousseeuw (1984). In this method we try to estimate in such a way that LTS ( ˆ ) = minimize h uˆ t 1 2 (2.21) t Here ût is the t-th ordered residual. For a trimming percentage of , Rousseeuw and Leroy (1987) suggested choosing the number of observations h based on which the model is fitted as h = [n (1 – )] + 1. The advantage of using LTS over LMS is that, in the LMS we always fit the regression line based on roughly 50% of the data, but in the LTS we can control the level of trimming. When we suspect that the data contains nearly 10% outliers, the LTS with 10% trimming will certainly produce better result than the LMS. We can increase the level of trimming if we suspect there are more outliers in the data. M – Estimator Huber (1973) generalized the estimation of parameters by considering a class of estimators, which chooses to n n Minimize i Minimize yi xi i 1 i 1 T (2.22) Where is a symmetric function less sensitive to outliers than squares. An estimator of this type is called an M – estimator, where M stands for maximum likelihood. It is easy to see from (2.22) that the function is related to the likelihood function for an appropriate choice of an error distribution. For example, if the error distribution is normal, then z z 2 / 2 , –∞ < z < ∞, which also yields the OLS estimator the M – estimator obtained from (2.22) is not scale invariant. To obtain a scale invariant version of this estimator we solve n n y x T Minimize i Minimize i i i 1 i 1 19 (2.23) In most of the practical applications, the value of is unknown and it is usually estimated before solving the equation (2.22). A popular choice of is ~ = MAD (normalized). (2.23) To minimize (2.22), we have to equate the first partial derivatives of w.r.t. for j = 0, 1, …, p to zero, yielding a necessary condition for a minimum. This gives a system of k = p + 1 equations yi xi 0, (2.24) xij = ~ j =0, 1, …, p i 1 where . In general the function is nonlinear and (2.24) must be solved by iterative methods. T n S – Estimator Rouusseeuw and Yohai (1984) suggested another class of high breakdown estimator based on the minimization of the dispersion of the residuals: ˆ 1 , ˆ 2 , ..., ˆ n s The above dispersion is defined as the solution of K is often put equal to E 1 n (2.25) ˆ i / s K n i 1 where is the standard normal. The function must satisfy the following conditions: is symmetric and continuously differentiable and (0) = 0. The estimator thus obtained is called an S – estimator because it is derived from a scale statistic in an implicit way. In fact s given in the above estimating equation is an M – estimator of scale. MM – estimator The MM – estimator was originally proposed by Yohai (1987). The objective was to produce a high breakdown point estimator that maintained good efficiency. The MM – estimator has three stages: The initial estimate is an S – estimate, so it has a high breakdown point. The second stage computed an M – estimate of the error standard deviation using the residuals from the initial S – estimate. 20 The last step is an M – estimate of the parameters using a hard redescending weight function to put a very small (often zero) weight to sufficiently large residuals. In an extensive performance evaluation of several robust regression estimators, Simpson and Montgomery (1998) report that MM –estimators have high efficiency and work well in most outlier scenarios. 2.4 Bootstrap According to Schucany (1988) in sample reuse technique ‘a sample is used to evaluate a statistic and then may be used again to assess or improve the performance of the basic statistical procedure.‘ This implies that this concept is a subjective, post data concept. There is a variety of sample reuse or resampling methods in the literature which have been in use in a regression context. Among them bootstrap technique proposed by Efron (1979) has become extremely popular with the statisticians. In this procedure one can create a huge number of sub-samples from a pre-observed data set by a simple random sampling with replacement. These sub-samples could be later used to investigate the nature of the population without having any assumption about the population itself. There are several forms of the bootstrap, and additionally, several other resampling methods that are related to it, such as jackknifing, cross-validation, randomization tests, and permutation tests. Suppose that we draw a sample S={x1, x2,...,xn } from a population P={x1, x2,...,xn } imagine further at least for the time being, that N is very much larger than n, and that S is either a simple random or an independent random sample, from P. Now suppose that we are interested in some statistic T = t(S) as an estimate of the corresponding population parameter θ = t(P). Again, θ could be a vector of parameters and T the corresponding vector of estimates, but for simplicity assumes that θ is scalar. A traditional approach to statistical inference is to make assumptions about the structure of the population (e.g., an assumption of normality), and along with the 21 stipulation of random sampling, to use these assumptions to derive the sampling distribution of T, on which classical inference is based. In certain instances, the exact distribution of T may be intractable and so we instead derive its asymptotic distribution. This familiar approach has two potentially important deficiencies: 1. If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be seriously inaccurate. On the other hand, if asymptotic results are relied upon these may not hold to the required level of accuracy in a relatively small sample. 2. The approach requires sufficient mathematical process to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult. In contrast, the nonparametric bootstrap allows us to estimate the sampling distribution of an empirically statistic without making assumptions about the form of the population, without deriving the sampling distribution explicitly. The essential idea of the nonparametric bootstrap is as follows: We proceed to draw a sample of size n from among the elements of S. sampling with * *, , x12 ..., x1*n } . It is necessary to sample replacement. Call the resulting bootstrap sample S1* {x11 with replacement, because we would otherwise simply reproduce the sample S. In effect, we are treating the S as an estimate of the population P; that is, each element Xi of S is selected for the bootstrap sample with probability 1/n, mimicking the original selection of the sample S from the population P. We repeat this procedure a large number of times, B selecting many bootstrap * } .The basic hypothesis is samples; the bth such bootstrap sample is denoted S b* {xb*1 , xb*2 ,..., xbn this: representative and sufficient resample from the original sample would contain information about original population as the original sample does represent the population. 22 REAL WORLD Unknown Probability Distribution F BOOTSTRAP WORLD EmpiricalT * b Distribution Observed Random Sample F̂ S={x1,x2,...,x n} T= t (s) Statistic of Interest * Random t ( SObserved b) Sample * Sb* {xb*1, xb*2 ,..., xbn } Bootstrap Replication Assumed SIMULATION WORLD Known Probability Distribution F Observed Random Sample S={x1,x2,...,xn} T= t(s) Statistic of Interest Figure 2.1: A schematic diagram of bootstrap and simulation [Efron and Tibshirani (1993)]. 23 In real world, an unknown distribution F has given the observed data S= (x1, x2,...,xn) by random sampling. We calculate a statistic of interest T = t(S). In the bootstrap world, the * empirical distribution F̂ gives bootstrap samples Sb* {xb*1 , xb*2 ,..., xbn } by random sampling from which we calculate bootstrap replications of the statistic of interest Tb t ( Sb ) . The big * * advantage of the bootstrap world is that we can calculate as many replications of Tb* as we want or at least as many as we can afford. Next, we compute the statistic T for each of the bootstrap samples; that is Tb t ( S b ) . Then the distribution of Tb* around the original estimate T is * * analogous to the sampling distribution of the estimator T around the population parameter θ. For B example the average of the bootstrapped statistics, T * Eˆ * (T * ) average of the bootstrap statistics; then Bia sˆ* T * T is T b 1 B * b . Estimate of the an estimate of the bias of T. * Similarly, the estimated bootstrap variance of T, Vˆ * (T * ) (Tb T * ) 2 /( B 1) , that estimates the sampling variance of T. The random selection of bootstrap samples is not an essential aspect of the nonparametric bootstrap. At least in principle, we could enumerate all bootstrap samples of size n. Then we could calculate E * (T * ) and V * (T * ) exactly, rather than having to estimate them. The number of bootstrap samples, however, is astronomically large unless n is tiny. There are, therefore two sources of error in bootstrap inference: (a) the error induced by using a particular sample S to represent the population; (b) the sampling error produce by falling to estimate all bootstrap samples. Making the number of bootstrap replication B sufficiently large can control the latter sources of error. 24 How large should we take B, the number of bootstrap replications uses to evaluate different estimate. The possible bootstrap replications is B = nn. To estimate the standard error we usually use the bootstrap replications between 25 to250. But for other estimate such as confidence interval or regression estimate B is much bigger. We may increase B if T= t (S) a very complicated function of X to increasing variability. Bootstrap replications are depend on the value of X, if n ≤ 100 we generally replicate B ≤ 10000. Here are two rule of thumb gathered from Efron and Tibshirani's experience Even a small number of bootstrap replications, say B = 25, is usually informative. B = 50 is often enough to give a good estimate of standard error. Very seldom are more than B = 200 replications needed for estimating a standard error For estimating a standard error, the number B will ordinarily be in the range 25-2000. Much bigger values of B are required for bootstrap confidence interval, cross-validation, randomization tests and permutation test. Some other suggestions are also available in the literature [see Efron (1987), Hall (1992), Booth and Sarker (1998)] about the number of replications needed in bootstrap. 2.5 Stochastic Time Series Modeling An excellent review of different aspects of stochastic time series modelling is available in Pyndick and Rubenfield (1998), Bowerman et al. (2005) and Imon (2015). We assume that the time series models have been generated by a stochastic process. In other words, we assume that each value y1 , y 2 , …, yT in the series is randomly drawn from a probability distribution. We could assume that the observed series y1 , y 2 , …, yT is drawn from a set of jointly distributed random variables. If we could specify the probability distribution function of our series, we could 25 determine the probability of one or another future outcome. Unfortunately, the complete specification of the probability distribution function for a time series is usually impossible. However, it usually is possible to construct a simplified model of the time series which explains its randomness in a manner that is useful for forecasting purpose. 2.5.1 The Box-Jenkins Methodology The Box-Jenkins methodology consists of a four-step iterative procedure. Step 1: Tentative Identification: Historical data are used to tentatively identify an appropriate Box-Jenkins model. Step 2: Estimation: Historical data are used to estimate the parameters of tentatively identified model. Step 3: Diagnostic Checking: Various diagnostics are used to check the adequacy of the tentatively identified model and, if need be, to suggest an improved model, which is then regarded as a new tentatively identified model. Step 4: Forecasting: Once a final model is obtained, it is used to forecast future time series values. 2.5.2 Stationary and Nonstationary Time Series It is important to know whether the stochastic process that generates the series can be assumed to be invariant with respect to time. If the characteristic of the stochastic process changes over time we call the process nonstationary. If the process is nonstationary, it will often be difficult to represent the time series over past and future intervals of time by a simple algebraic model. By contrast, if the process is stationary, one can model the process via an equation with fixed coefficients that can be estimated from past data. 26 Properties of Stationary Process We have said that any stochastic time series y1 , y 2 , …, yT can be thought of as having been generated by a set of jointly distributed random variables; i.e., the set of data points y1 , y 2 , …, yT represents a particular outcome (also known as a realization) of the joint probability distribution function p( y1 , y 2 , …, yT ). Similarly, a future observation yT 1 can be thought of as being generated by a conditional probability distribution function p( yT 1 | y1 , y 2 , …, yT ) (2.26) that is, a probability distribution for yT 1 given the past observations y1 , y 2 , …, yT . We define a stationary process, then, as one whose joint distribution and conditional distribution both are invariant with respect to displacement in time. In other words, if the series is stationary, then p( y t , y t 1 , …, y t k ) = p( yt m , yt m1 , …, yt m k ) and p( y t ) = p( yt m ) (2.27) for any t, k, and m. If the series y t is stationary, the mean of the series, which is defined as y = E( y t ) (2.28) must also be stationary, so that E( y t ) = E( yt m ), for any t and m. Furthermore, the variance of the series y 2 = E[( y t – y )] 2 (2.29) must be stationary, so that E[( y t – y )] 2 = E[( yt m – y )] 2 . 27 (2.30) Finally, for any lag k, the covariance of the series k = Cov ( y t , y t k ) = E[( y t – y ) ( y t k – y )] (2.31) must be stationary, so that Cov ( y t , y t k ) = Cov ( yt m , yt m k ). If a stochastic process is stationary, the probability distribution p( y t ) is the same for all time t and its shape can be inferred by looking at the histogram of the observations y1 , y 2 , …, yT . An estimate of the mean y can be obtained from the sample mean T y = y t 1 t /T (2.32) And an estimate of the variance y can be obtained from the sample variance 2 ˆ y 2 = T y t 1 y / T. 2 t (2.33) Usually it is very difficult to get a complete description of a stochastic process. The autocorrelation function could be extremely useful because it provides a partial description of the process for modeling purposes. The autocorrelation function tells us how much correlation there is between neighboring data points in the series y t . We define the autocorrelation with lag k as k Cov y t , y t k V yt V yt k (2.34) For a stationary time series the variance at time t is the same as the variance at time t + k; thus, the autocorrelation becomes k k k 0 y2 and thus 0 = 1 for any stochastic process. Suppose the stochastic process is simply 28 (2.35) y t t where t is an independently distributed random variable with zero mean. Then it is easy to show that for this process, 0 = 1 and k = 0 for k > 0. This particular process is known as white noise, and there is no model that can provide a forecast any better than yˆ T l = 0 for all l. Thus, if the autocorrelation function is zero (or close to zero) for all k > 0, there is little or no value in using a model to forecast the series. 2.5.3 Test for Significance of a White Noise Autocorrelation Function In practice, we use an estimate of the autocorrelation function, called the sample autocorrelation (SAC) function T k rk y t 1 t y y t k y T y t 1 t y (2.36) 2 It is easy to see from their definitions that both the theoretical and the estimated autocorrelation functions are symmetrical; i.e., k = k and rk = r k . Bartlett’s test Here the null hypothesis is H 0 : k = 0 for k > 0. Bartlett shows that when the time series is generated by a white noise process, the sample autocorrelation function is distributed approximately as a normal with mean 0 and variance 1/T. Hence, the test statistic is |z| = T | rk | (2.37) and we reject the null hypothesis at the 95% level of significance, if |z| is greater than 1.96. 29 The t-test based on SAC The standard error of rk is given by 1/ T k 1 SE rk 2 1 2 ri / T i 1 if k 1 if k 1 (2.38) The t-statistic for testing the hypothesis H 0 : k = 0 for k > 0 is defined as T = rk /SE( rk ) (2.39) and this test is significant when |T| > 2. Box and Pierce Test and Ljung and Box Test To test the joint hypothesis that all the autocorrelation coefficients are zero we use a test statistic introduced by Box and Pierce. Here the null hypothesis is H 0 : 1 = 2 = … = k = 0. Box and Pierce show that the appropriate statistic for testing this null hypothesis is k Q=T r i 1 2 (2.40) i is distributed as chi-square with k degrees of freedom. A slight modification of the Box-Pierce test was suggested by Ljuang and Box, which is known as the Ljuang-Box Q (LBQ) test defined as k Q T (T 2) (T k ) 1 ri i 1 2 (2.41) Thus, if the calculated value of Q is greater than, say, the critical 5% level, we can be 95% sure that the true autocorrelation coefficients are not all zero. 30 Stationarity and the Autocorrelation Function How can we decide that whether a series is stationary or determine the appropriate number of times a homogenous nonstationary series should be differenced to arrive at a stationary series? The correlogram (a plot of autocorrelation coefficients against the number of lag periods) could be a useful indicator of it. For a stationary series, the autocorrelation function drops off as k becomes large, but this usually is not the case for a nonstationary series. In order to employ the Box-Jenkins methodology, we must examine the behavior of the SAC. The SAC for a nonseasonal time series can display a variety of behaviors. First, the SAC for a nonseasonal time series can cut off at lag k. We say that a spike at lag k exists in the SAC if the SAC at lag k is statistically significant. Second, we say that the SAC dies down if this function does not cut-off but rather decreases in a steady fashion. In general, it can be said that 1. If the SAC of the time series values either cuts off fairly quickly or dies down fair quickly, then the time series values should be considered stationary. 2. If the SAC of the time series values dies down extremely slowly, then the time series values should be considered nonstationary. 2.5.4 ARIMA Models Moving Average Models In the moving average process of order q each observation y t is generated by a weighted average of random disturbances going back to q periods. We denote this process as MA(q) and write its equation as y t = t 1 t 1 2 t 2 ... q t q 31 (2.42) In the moving average model the random disturbances are assumed to be independently distributed across time, i.e., generated by a white noise process. In particular, each t is assumed to be normal random variable with mean 0 and variance , and covariance k = 0 for 2 k 0. We have E( y t ) = which shows that a moving average process is independent of time. The process MA(q) is described by exactly q + 2 parameters, the mean , the disturbance 2 and the parameters 1 , 2 ,..., q . Let us now look at the variance of the moving average process of order q. 2 0 = E yt (2.43) = E( t 1 t 1 2 t 2 ... q t q 21 t t 1 y t 1 – …) 2 2 2 2 2 2 2 2 = (1 + 1 2 ... q ) 2 2 2 The moving average process of order q has autocorrelation function k 1 k 1 2 k 2 ... q k q 2 2 2 k = 1 1 2 ... q 0 k 1,2,..., q kq (2.44) Autoregressive Models In the autoregressive process of order p the current observation y t is generated by a weighted average of past observations going back p periods, together with a random disturbance in the current period. We denote this process as AR(p) and write its equation as y t = 1 yt 1 2 yt 2 ... p yt p t (2.45) Properties of Autoregressive Models If the autoregressive process is stationary, then its mean, which we denote by , must be invariant with respect to time; i.e., E( y t ) = E( y t 1 ) = E( y t p ) = . The mean is thus given by 32 = 1 2 ... p or = If the process is stationary, then 1 1 2 ... p must be finite. For this, it is necessary that 1 2 ... p < 1. For the first-order process AR(1): y t = 1 yt 1 t , k = k = 1 k 0 (2.46) In general, for k > 1, k = E yt k 1 yt 1 2 yt 2 t = 1 k 1 + 2 k 2 (2.47) Thus the autocorrelation function is given by 1 = 1 1 2 (2.48) k = 1 k 1 + 2 k 2 for k > 1 The Partial Autocorrelation Function The partial autocorrelation function is used to determine the order of an autoregressive process. For an autoregressive process of order p, the covariance with displacement k is determined from k = E yt k 1 yt 1 2 yt 2 ... p yt p t (2.49) which gives 0 = 1 1 + 2 2 + … + p p + 2 1 = 1 0 + 2 1 + … + p p 1 …………………………………… p = 1 p 1 + 2 p 2 + … + p 0 (2.50) The above equations also give a set of p equations, known as Yule-Walker equations, to determine the first p values of the autocorrelation functions: 33 1 = 1 + 2 1 + p p 1 ……………………………… p = 1 p 1 + 2 p 2 + … + p (2.51) The solution of the Yule-Walker equations requires the knowledge of p. Therefore we solve these equations for successive values of p. We begin by hypothesizing that p = 1. We compute the sample autocorrelation ˆ1 as an estimate of 1 . If this value is significantly different from 0, we know that the autoregressive process is at least order 1. Next we consider the hypothesis that p = 2. We solve the Yule-Walker equations for p = 2 and obtain a new set of estimates for 1 and 2 . If 2 is significantly different from 0, we may conclude that the process is at least order 2. Otherwise we conclude that the process is order 1. We repeat this process for successive values of p. We call the series 1 , 2 , …, partial autocorrelation function. If the true order of the process is p, we should observe that ˆ j 0 for j > p. To test whether a particular j is zero, we can use the fact that it is approximately normally distributed with mean 0 and variance 1 / T. Hence we can check whether it is statistically significant at, say, the 5% level by determining whether it exceeds 2 / T in magnitude. Mixed Autoregressive – Moving Average (ARMA) Models Many stationary random processes cannot be modeled as purely moving average or as purely autoregressive, since they have quality of both types of processes. In this situation we can use the mixed autoregressive – moving average (ARMA) process. An ARMA process of order (p, q) is defined as y t = 1 yt 1 2 yt 2 ... p yt p t 1 t 1 2 t 2 ... q t q 34 (2.52) We assume that the process is stationary, then its mean, which we denote by , must be invariant with respect to time and is given by = 1 2 ... p or = For the general ARMA(p, q) process it is not easy to obtain the 1 1 2 ... p . variances, covariances and autocorrelations by solving equations. It can be shown easily, however, that k = 1 k 1 + 2 k 2 + … + p k p , k > q (2.53) It is interesting to note that q is the memory of the moving average part of the process, so that for k > q, the autocorrelation function exhibits the properties of a purely autoregressive process. The above equations also give a set of p equations, known as Yule-Walker equations, to determine the first p values of the autocorrelation functions: 1 = 1 + 2 1 + p p 1 ……………………………… p = 1 p 1 + 2 p 2 + … + p (2.54) The solution of the Yule-Walker equations requires the knowledge of p. Therefore we solve these equations for successive values of p. We begin by hypothesizing that p = 1. We compute the sample autocorrelation ˆ1 as an estimate of 1 . If this value is significantly different from 0, we know that the autoregressive process is at least order 1. Next we consider the hypothesis that p = 2. We solve the Yule-Walker equations for p = 2 and obtain a new set of estimates for 1 and 2 . If 2 is significantly different from 0, we may conclude that the process is at least order 2. Otherwise we conclude that the process is order 1. We repeat this process for successive 35 values of p. We call the series 1 , 2 , …, partial autocorrelation function. If the true order of the process is p, we should observe that ˆ j 0 for j > p. To test whether a particular j is zero, we can use the fact that it is approximately normally distributed with mean 0 and variance 1 / T. Hence we can check whether it is statistically significant at, say, the 5% level by determining whether it exceeds 2 / T in magnitude. Homogenous Nonstationary Processes: ARIMA Models Probably very few of the time series one meets in practice are stationary. Fortunately, however, many of the nonstationary time series that are encountered have the desirable property that they are differentiated one or more times, the resulting series will be stationary. Such a nonstationary series is termed homogenous. The number of times the original series must be differenced before a stationary series results in is called the order of homogeneity. Thus, if y t is first order homogenous nonstationary, the series wt = y t – y t 1 = y t is stationary. Here we should construct models for those nonstationary series, which can be transformed into stationary series by differencing them one or more times. We say that y t is homogenous nonstationary of order d if wt d yt (2.55) is a stationary series. If wt d yt and wt is an ARMA(p, q) process, then we say that y t is an integrated autoregressive moving average process of order p, d and q, or simply ARIMA(p, d, q). 36 2.5.5 Estimation and Specification of ARIMA Models It is often convenient to describe time lags by using the backward shift operator B. The operator B imposes a one-period time lag each time it is applied to a variable. Thus B t = t 1 , B 2 t = t 2 , …, B n t = t n (2.56) Using this operator, we can write an MA(q) process as: y t = t 1 1 B 2 B 2 ... q B q = t (B) (2.57) In a similar way, the AR(p) process can be rewritten as yt 1 1 B 2 B 2 ... p B p = t (2.58) Finally, an ARMA(p, q) process can be reexpressed as yt 1 1 B 2 B 2 ... p B p = t 1 1 B 2 B 2 ... q B q (2.59) It is easy to show that any homogenous nonstationary process can be modeled as an ARIMA process. We can write the equation for an ARIMA(p, d, q) as: B d y t = B t (2.60) B y t = B t (2.61) If d = 0, we obtain which is an ARMA(p, q). When q = 0, i.e., when wt d yt is just AR(p), we call y t an integrated autoregressive process of order (p, d) and denote it as ARIMA(p, d, 0) or ARI(p, d, 0). When p = 0, i.e., when wt is just MA(q), we call y t an integrated moving average process of order (d, q) and denote it as ARIMA(0, d, q) or IMA(0, d, q). In practice, it is crucial to specify the ARIMA model, i.e., to choose the most appropriate values for p, d and q. 37 Given a series y t , the first problem is to determine the degree of homogeneity d. To do this one first examines the autocorrelation function of the original series y t and determine whether it is stationary. If it is not, difference the series and examine the autocorrelation function for y t . Repeat this process until a value of d is reached such that d yt is stationary, i.e., the autocorrelation function goes to 0 as k becomes large. After d is determined, one can work with the stationary series wt d yt and examine both its autocorrelation and partial autocorrelation function to determine possible specifications for p and q. The lower order processes like AR(1), AR(2), MA(1), MA(2), ARMA(1, 1) etc are easy to recognize Table 2.1: Specification of ARIMA Models Model ACF PACF White noise All zero All zero MA(1) Zero after 1 lag Declining from 1st lag MA(2) Zero after 2 lags Declining from 2nd lag MA(q) Zero after q lags Declining from qth lag AR(1) Geometric decline from 1 lag Zero after 1 lag AR(2) Geometric decline from 2 lags Zero after 2 lags AR(p) Geometric decline from pth lag Zero after p-lags ARMA(1,1) Geometric decline from 1 lag Declining from first lag ARMA(p,q) Geometric decline from pth lag Declining from qth lag 38 The autoregressive and moving average of order p and q respectively are determined by using the partial autocorrelation function (PACF) and ACF respectively. Table 2.1 summarizes the characteristics of the ARIMA(p,0,q) or ARMA(p,q) model. 2.5.6 Diagnostic Checking After a time series model has been estimated, one must test whether the specification was correct. We assume that the random errors t in the actual process are normally distributed and independent. Then, if the model is specified correctly, the residuals ̂t should resemble a white noise process. Consequently a sample autocorrelation function of the residuals rˆk ˆ ˆ t k t t 1 2 ˆ t (2.62) t 1 would be close to 0 for k > 0. We can use the Box and Pierce Test for this purpose. Consider the statistic Q composed of the first K residual autocorrelations K Q=T rˆ k 1 2 k (2.63) which is distributed as chi-square with K – p – q degrees of freedom. 2.6 Cross Validation in Time Series Models Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the 39 other subset (called the validation set or testing set). An excellent review of different type of cross validation techniques is available in Izenman (2008). Picard and Cook (1984) developed all basic fundamentals of applying cross validation technique in regression and time series. According to Montgomery et al. (2008b), three types of procedures are useful for validating a time series model. (i) Analysis of the model coefficients and predicted values including comparisons with prior experience, physical theory, and other analytical models or simulation results, (ii) Collection of new data with which to investigate the model’s predictive performance, (iii) Data splitting, that is, setting aside some of the original data and using these observations to investigate the model’s predictive performance. Since we have a large number of data set, we prefer the data splitting technique for cross-validation of the fitted model. In randomly missing data, generally three performance indicators; say, mean absolute error (MAE), root mean square error (RMSE) and estimated bias (EB) are considered to examine the accuracy of theses imputation methods. In order to select the best method for estimation missing values, the predicted and observed data were compared. The mean absolute error is the average difference between predicted and actual data values, and is given by MAE 1 N N P O i 1 i i (2.64) where N is the number of imputations, Pi and Oi are the imputed and observed data points, respectively. MAE varies from 0 to infinity and perfect fit is obtained when MAE = 0. The root mean squared error is one of the most commonly used measure and it is computed by RMSE 1 N Pi Oi 2 N i 1 40 (2.65) The smaller is the RMSE value, the better is the performance of the model. The estimated bias is the absolute difference between the observed and the estimated value of the respective parameters and defined as EB Oi Ei (2.66) where Ei is the estimated value of the parameter that obtained from the imputation methods. For a regular time series three measures of accuracy of the fitted model: MAPE, MAD, and MSD for each of the simple forecasting and smoothing methods. For all three measures, the smaller the value, the better the fit of the model. Use these statistics to compare the fits of the different methods. MAPE, or Mean Absolute Percentage Error, measures the accuracy of fitted time series values. It expresses accuracy as a percentage. MAPE = | y t yˆt / yt | T 100 (2.67) where yt equals the actual value, ŷt equals the fitted value, and T equals the number of observations. MAD (Mean), which stands for Mean Absolute Deviation, measures the accuracy of fitted time series values. It expresses accuracy in the same units as the data, which helps conceptualize the amount of error. MAD (Mean) = | y t T yˆt | (2.68) where yt equals the actual value, ŷt equals the fitted value, and T equals the number of observations. 41 MSD stands for Mean Squared Deviation. MSD is always computed using the same denominator, T, regardless of the model, so you can compare MSD values across models. MSD is a more sensitive measure of an unusually large forecast error than MAD. MSD = y yˆt 2 t T (2.69) where yt equals the actual value, ŷt equals the fitted value, and T equals the number of observations. In order to find out the best prediction model we usually leave out say, l observations aside as holdback period. The size of l is usually 10% to 20% of the original data. Suppose that we tentatively select two models namely, A and B. We fit both the models using (T – l) set of observations. Then we compute MSPE A 1 l 2 e Ai l t 1 (2.70) for model A and MSPE B 1 l 2 e Bi l t 1 (2.71) for model B. Several methods have been devised to determine whether one MSPE is statistically different from the other. One such popular method of testing is the F-test approach, where Fstatistic is constructed as a ratio between the two MSPEs keeping the larger MSPE in the numerator of the F-statistic. If the MSPE for model A is larger, this statistic takes the form: F MSPE A MSPE B (2.72) This statistic follows an F distribution with (l , l) degrees of freedom under the null hypothesis of equal forecasting performance. If the F-test is significant we will choose model B for this data 42 otherwise, we would conclude that there is a little bit difference in choosing between these two models. 2.7 Computation We have used a number of modern and sophisticated statistical software such as R, S-Plus and Minitab for computational purposes. 43 CHAPTER 3 ESTIMATION OF MISSING VALUES BY ROBUST EM AND BOOTSTRAP ALGORITHM FOR SAUDI ARABIA ROAD ACCIDENT DATA In this section our main objective is to estimate missing values of Saudi Arabia road accident data because if any data is missing we cannot fit any ARIMA model to the data. It is now evident that outliers may have an adverse effect [see Mamun (2014)] on the estimation of missing values and also in the determination of the ARIMA [see Imon et al. (2007)]. So we would like to apply a variety of robust and nonparametric approach of missing value estimation. We also investigate the accuracy of the estimated missing values by cross validation. 3.1 Estimation of Missing Values We have observed in section 1.2 that the total number of road accidents for the entire Saudi Arabia is missing in the year 2003 whereas all regions have missing values for 2003 and 2004. In section 2.3 we discussed different methods of estimating missing values. Here is the list of methods we apply to estimate the missing values. Mean Imputation Median Imputation Linear Trend Model Quadratic Trend Model Exponential Trend Model Centered Moving Average 44 EM-OLS EM-LTS EM-LMS EM-M EM-MM BOOT-OLS BOOT-LTS BOOT-LMS BOOT-M BOOT-MM Results of this experiment are presented in Tables 3.1 to 3.6. Table 3.1 gives some traditional measures of missing value estimation for the year 2013. Values of the same for the year 2004 are presented in Table 3.2. Tables 3.3 and 3.4 give EM algorithmic vales for all robust time series methods including the OLS for the year 2003 and 2004 respectively. The same values using the bootstrap technique are presented in Tables 3.5 and 3.6. It is worth mentioning that each of bootstrap result is based on 10000 replications. Table 3.1: Estimates of Missing Values for the Year 2003 Using Conventional Methods Region Mean Median LT QT ET CMA Riyadh 90629 84313 77997.1 54242.7 74376.6 85350 Mecca 97194 95649 94098.9 89945.5 94004.2 93305.5 E Province 83063 71731 69690.4 62433.5 68330.4 66762 Medina 12227 19972 6698.81 7915.52 2456.14 836.5 45 Al- Qassim 12503 12901 9908.26 9360.26 9527.53 8158.5 Tabuk 6933 5974 4654.92 2982.60 4373.41 3546.5 Asir 18869 19024 16060.5 16545.9 15411.4 15733.5 Albahih 3025 3208 2287.03 2239.00 2099.98 1917 N Borders 5697 5668 4668.42 4074.99 4573.35 4180.50 Al Jouf 6461 7038 5462.06 6059.46 5202.04 4809.50 Ha’il 5139 5714 4622.64 4837.18 4461.22 4929 Najran 2630 2412 2272.64 1625.21 2233.72 2280.50 Jizan 3715 3257 2207.34 1687.86 1752.07 1216 Total 348085 336861 300629 263949.7 288802.1 293025 Table 3.2: Estimates of Missing Values for the Year 2004 Using Conventional Methods Region Mean Median LT QT ET CMA Riyadh 90629 84313 83410.8 51738.2 77821.5 86387 Mecca 97194 95649 95425.6 89887.7 95231.9 97091 E Province 83063 71731 75421.4 65745.6 73032.6 69143 Medina 12227 19972 9067.85 10690.1 3675.92 973 Al- Qassim 12503 12901 11020.1 10289.4 10470.7 8051 Tabuk 6933 5974 5631.33 3401.57 5027.94 4033 Asir 18869 19024 17264.1 17911.3 16558.6 18998 Albahih 3025 3208 2603.26 2539.22 2366.25 2118 N Borders 5697 5668 5109.33 4318.10 4933.70 4187 Al Jouf 6461 7038 5890.32 6686.85 5610.72 5371 46 Ha’il 5139 5714 4843.79 5129.84 4690.08 6172 Najran 2630 2412 2425.99 1562.74 2341.11 2149 Jizan 3715 3257 2853.58 2160.93 2175.30 976 Total 348085 336861 320967.5 272061.6 303936.3 305649 Table 3.3: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods Region OLS LTS LMS M MM Riyadh 74393 97492 96314 11537 69292 Mecca 94509 26556 71565 94994 99090 E Province 70581 70443 6989 17501 17856 Medina 6649 7792 10361 6557 6589 Al- Qassim 9994 11090 4503 9924 9939 Tabuk 4557 4532 18551 4521 4438 Asir 16442 18597 2426 16509 16482 Albahih 2406 2470 4107 2380 2408 N Borders 4488 4103 5955 4524 4511 Al Jouf 55805 5860 4094 5544 5558 Ha’il 4764 4077 1941 4661 4657 Najran 2179 2052 2336 2188 0102 Jizan 14186 2385 2000 2108 2115 Total 360953 204337 231142 182948 243037 47 Table 3.4: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods Region OLS LTS LMS M MM Riyadh 80153 104387 96506 11542 63168 Mecca 95786 53139 75616 96125 100466 E Province 76280 74089 9772 35005 35714 Medina 4026 10740 11520 9011 9016 Al- Qassim 11100 12249 4856 11030 11045 Tabuk 5546 4869 18811 5404 4921 Asir 17587 18856 2817 17658 17634 Albahih 2714 2848 4114 2708 2754 N Borders 4948 4100 6315 4932 4930 Al Jouf 5994 6237 4372 5947 5960 Ha’il 4967 4355 1830 4885 4878 Najran 2345 1939 2886 2351 2008 Jizan 28355 2836 2600 2616 2620 Total 339801 194366 246015 209214 265114 Table 3.5: Estimates of Missing Values for 2003 Using EM Robust Time Series Methods Region OLS LTS LMS M MM Riyadh 73290 97492 97188 18143 67396 Mecca 94470 94872 94911 94685 94687 E Province 70440 76843 67581 17344 17317 Medina 6328 6347 6008 6292 6299 48 Al- Qassim 9830 11066 10289 9775 9792 Tabuk 3924 4424 4113 3961 4008 Asir 15621 15986 15737 15081 15214 Albahih 2244 2430 2426 2297 22791 N Borders 4727 5213 4623 4821 4805 Al Jouf 5439 5398 5483 5411 5406 Ha’il 4960 4077 4094 4956 9801 Najran 2241 2141 2070 2237 2194 Jizan 14719 2470 2361 2201 2231 Total 308233 328759 316884 187204 261941 Table 3.6: Estimates of Missing Values for 2004 Using EM Robust Time Series Methods Region OLS LTS LMS M MM Riyadh 79345 104387 104083 36288 61543 Mecca 95774 95057 94763 96073 96014 E Province 76036 82467 70012 34691 34638 Medina 8738 9395 8916 8715 8750 Al- Qassim 10964 12225 11448 10909 10925 Tabuk 4984 4830 4691 4965 4753 Asir 16882 17408 17147 16420 16549 Albahih 2573 2821 2817 2632 2610 N Borders 5158 5512 4951 5182 5125 Al Jouf 5871 6035 6144 5826 5824 49 Ha’il 5140 4355 4372 5136 5002 Najran 2399 2014 1930 2395 2159 Jizan 29440 2968 2865 2735 2744 Total 343304 349474 334139 231967 256636 3.2 Evaluation of the Estimation of Missing Values One of the most interesting features of the number of road accidents in Saudi Arabia data is that although the total number of accidents in different regions for the year 2004 is missing, but we know the total for the entire country. So we estimate the total number of accidents for different regions using the different alternate methods and then add them together to estimate the total number for the entire country. Once we get the estimate we can compare it with the actual values what we definitely know. Here we use the mean average percentage error (MAPE) for a variety of alternative methods and the results are presented in Table 3.7. Table 3.7: A Comparison Between Exact and Estimated Values of the Total Number of Road Accidents for 2004 in Saudi Arabia Measures Estimated Total MAPE (Exact value: 261874) Mean 348085 32.92 Median 336861 28.64 LT 320967.5 22.57 QT 272061.6 3.90 ET 303936.3 16 50 CMA 305649 16.72 EM-OLS 339801 30 EM-LTS 194366 26 EM-LMS 239415 8.58 EM-M 209214 20 EM-MM 265114 1.24 BOOT-OLS 343304 31 BOOT-LTS 349474 33.45 BOOT-LMS 334139 27.60 BOOT-M 231967 11.42 BOOT-MM 256636 2 From the above results it is clear that the MM method estimates the missing value in the best possible way. The MAPE values of MM are low for both EM and bootstrap. So we take the values of EM-MM for the years 2003 and 2004 and the completed data are presented in Tables 3.8 and 3.9. The estimated values are presented in boldfaces. Table 3.8: Total Number of Road Accidents in Saudi Arabia after the Missing Value Estimation Year Number of accidents 1999 264326 2000 267772 2001 280401 2002 305649 51 2003 263037 2004 261872 2005 293281 2006 296015 2007 283648 2008 435264 2009 485931 2010 484805 Table 3.9: Total Number of Road Accidents in Different Region of Saudi Arabia after the Missing Value Estimation Year Riyadh Mecca E Province Medina Qassim Tabuk Asir 2000 80067 94669 56182 747 7213 3591 10242 2001 84313 89520 64381 700 8266 3060 12469 2002 86387 97091 69143 973 8051 4033 18998 2003 69292 99090 37501 6589 9939 4438 16482 2004 63168 100466 35714 9016 11045 4921 17634 2005 47803 95649 76435 13422 12901 5043 18230 2006 49019 93987 71731 14471 11687 5974 22108 2007 48142 86854 62033 18988 13370 6061 19024 2008 134915 96204 107780 21334 14768 7503 20556 2009 143466 107932 127460 20956 17645 11634 22929 2010 141549 112844 112420 18448 18622 15500 25263 52 Year Albahih N_Borders Al Jouf Hail Najran Jizan 2000 1169 3999 3447 3112 2431 903 2001 1716 4174 4248 3686 2412 1456 2002 2118 4187 5371 6172 2149 976 2003 2408 4511 5558 4657 2102 2115 2004 2754 4930 5960 4878 2008 2620 2005 3293 4034 6889 4502 1823 3257 2006 3515 5668 7259 5835 1710 3051 2007 3208 6805 7170 5274 1963 4756 2008 2843 6160 9446 5884 2631 5240 2009 4245 6799 7284 5714 4215 5652 2010 5117 9449 7038 6069 4340 8146 Next we construct time series plot for all variables considered in this study with and without the missing value estimates o get an impression how closely the EM-MM method estimate the missing values. 53 Figure 3.1: Time Series Plot of Original and Estimated Road Accidents in Saudi Arabia Figure 3.2: Time Series Plot of Original and Estimated Road Accidents in Riyadh Figure 3.3: Time Series Plot of Original and Estimated Road Accidents in Mecca 54 Figure 3.4: Time Series Plot of Original and Estimated Road Accidents in Eastern Province Figure 3.5: Time Series Plot of Original and Estimated Road Accidents in Medina Figure 3.6: Time Series Plot of Original and Estimated Road Accidents in Al-Qassim 55 Figure 3.7: Time Series Plot of Original and Estimated Road Accidents in Tabuk Figure 3.8: Time Series Plot of Original and Estimated Road Accidents in Asir Figure 3.9: Time Series Plot of Original and Estimated Road Accidents in Albahih 56 Figure 3.10: Time Series Plot of Original and Estimated Road Accidents in Northern Borders Figure 3.11: Time Series Plot of Original and Estimated Road Accidents in Al Jouf Figure 3.12: Time Series Plot of Original and Estimated Road Accidents in Ha’il 57 Figure 3.13: Time Series Plot of Original and Estimated Road Accidents in Najaran Figure 3.14: Time Series Plot of Original and Estimated Road Accidents in Jizan Figures 3.1 – 3.14 clearly show that the estimation of missing values using EM-MM are very consistent and perfectly match with the rest of the data. 58 CHAPTER 4 SELECTION OF ARIMA MODELS FOR SAUDI ARABIA ROAD ACCIDENT DATA In this chapter our main objective is to find appropriate ARIMA models for the total number of car accidents in the entire Saudi Arabia and it’s all regions. At first we look at the autocorrelation function (ACF) and partial autocorrelation functions (PACF) and their corresponding t value and the Ljung-Box test to determine the appropriate ARIMA model. But we know that ACF and PACF can only give an indication. We need to employ formal test to answer this question. Later we fit a variety of tentative ARIMA models for each of the variables and by examining the significance of the coefficients of ARIMA models we select the most appropriate model for each of the variable. 4.1 Total Number of Accidents in Saudi Arabia At first we consider the total number of road accidents in the entire Saudi Arabia. The ACF and PACF values for this data together with the Ljuang-Box t and 2 tests are given in Figure 4.1 and in Table 4.1. Autocorrelation Function for total Partial Autocorrelation Function for total (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 0.8 0.8 0.6 0.6 Partial Autocorrelation Autocorrelation 1.0 0.4 0.2 0.0 -0.2 -0.4 -0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 11 1 Lag 2 3 4 5 6 7 8 9 10 11 Lag Figure 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi Arabia 59 Table 4.1: The ACF and PACF Values for the Total Number of Road Accidents in Saudi Arabia Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.650100 2.25201 6.4547 0.650100 2.25201 2 0.254603 0.64927 7.5437 -0.291023 -1.00813 3 -0.009077 -0.02238 7.5453 -0.063523 -0.22005 4 -0.030761 -0.07582 7.5651 0.149157 0.51669 5 -0.184256 -0.45395 8.3799 -0.408400 -1.41474 6 -0.221489 -0.53655 9.7536 0.170052 0.58908 7 -0.193680 -0.45832 11.0140 -0.041897 -0.14514 8 -0.164204 -0.38194 12.1464 -0.274840 -0.95207 9 -0.268235 -0.61647 16.1756 -0.070623 -0.24464 10 -0.220151 -0.49066 20.2468 0.121250 0.42002 The above table and figure clearly show an indication of autoregressive or moving average pattern. The ACF values show that only the first one is significant, so an MA(1) could be a possible model for this. However, the first PACF value is also significant that indicates that the possible model could be AR(1). Now we fit eight different ARIMA model for this data. They are AR(1), AR(2), AR(3), MA(1), MA(2), MA(3), ARMA(1, 1) and ARIMA(1, 1, 1). Results of the above models are presented in Table 4.2. 60 Table 4.2: Significance of ARIMA models fir the Number of Road Accidents in Saudi Arabia Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0006 21.27 0.000 AR(2) AR(1) 1.3449 3.92 0.003 AR(2) -0,3431 -0.93 0.375 AR(1) 1.3756 3.85 0.004 AR(2) -0.5562 -1.05 0.321 AR(3) 0.1827 0.48 0.644 MA(1) MA(1) -0.8547 -2.49 0.032 MA(2) MA(1) -1.5529 -4.45 0.001 MA(2) -0.6426 -1.36 0.203 MA(1) -1.6856 -4.85 0.001 MA(2) -1.4250 -2.46 0.036 MA(3) -0.5650 -1.08 0.309 AR(1) 1.0010 15.86 0.000 MA(1) -0.2530 -0.78 0.454 AR(1) -0.3182 -0.22 0.828 MA(1) -0.5555 -0.43 0.679 AR(3) MA(3) ARMA(1, 1) ARIMA(1, 1, 1) The above results clearly show that AR(1) is significant throughout. Although MA(1) is significant in some cases but AR(1) is more significant than MA(1). ARIMA(1, 1, 1) is not a possibility at all. Hence the best model for this data is AR(1). 61 4.2 Total Number of Accidents in Riyadh Now we consider the total number of road accidents in Riyadh. The ACF and PACF values for this data together with the Ljuang-Box t and 2 tests are given in Figure 4.2 and in Table 4.3. Autocorrelation Function for Riyadh Partial Autocorrelation Function for Riyadh (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 0.8 Partial Autocorrelation Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 Lag Lag Figure 4.2: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh Table 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Riyadh Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.639726 2.12173 5.8523 0.639726 2.12173 2 0.310614 0.76394 7.3852 -0.166967 -0.55377 3 0.034943 0.08171 7.4071 -0.156943 -0.52052 4 -0.174029 -0.40672 8.0258 -0.148767 -0.49341 5 -0.354857 -0.81714 11.0269 -0.220778 -0.73224 6 -0.356595 -0.77542 14.6637 0.017352 0.05755 7 -0.257252 -0.53112 17.0296 0.013195 0.04376 8 -0.162444 -0.32710 18.2875 -0.071982 -0.23874 9 -0.117293 -0.23392 19.2711 -0.124647 -0.41341 62 10 -0.062812 -0.12465 19.8353 -0.063589 -0.21090 The above table and figure clearly show an indication that this day may fit an MA(1) or AR(1) model. For confirmation we fit eight different ARIMA model for this data and the results are presented in Table 4.4. Table 4.4: Significance of ARIMA models fir the Number of Road Accidents in Riyadh Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0003 10.00 0.000 AR(2) AR(1) 1.1851 3.53 0.006 AR(2) -0.1841 -0.50 0.630 AR(1) 1.1692 3.28 0.011 AR(2) -0.1717 -0.33 0.753 AR(3) 0.0036 0.01 0.993 MA(1) MA(1) -0.854 -2.60 0.029 MA(2) MA(1) -1.0942 -4.45 0.001 MA(2) -0.7256 -1.72 0.119 MA(1) -1.2305 -3.52 0.008 MA(2) -1.1393 -2.13 0.065 MA(3) -0.4821 -0.57 0.583 AR(1) 1.0006 8.18 0.000 MA(1) -0.0915 -0.26 0.804 AR(1) 0.1078 0.03 0.977 AR(3) MA(3) ARMA(1, 1) ARIMA(1, 1, 1) 63 MA(1) 0.0097 0.00 0.999 The above results clearly show that AR(1) is the most appropriate model for this data. 4.3 Total Number of Accidents in Mecca Next we consider the total number of road accidents in Mecca and the summary results are given Autocorrelation Function for Mecca Partial Autocorrelation Function for Mecca (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 0.8 0.6 0.6 Partial Autocorrelation Autocorrelation in Figure 4.3 and in Table 4.5. 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 1 2 Lag 3 4 5 6 7 8 9 10 Lag Figure 4.3: The ACF and PACF Values for the Total Number of Road Accidents in Mecca Table 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Mecca Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.518564 1.71988 3.8454 0.518564 1.71988 2 0.137058 0.36656 4.1439 -0.180348 -0.59815 3 -0.060622 -0.16019 4.2096 -0.072091 -0.23910 4 -0.000961 -0.00253 4.2096 0.129911 0.43087 5 -0.107564 -0.28357 4.4853 -0.232716 -0.77183 6 -0.138906 -0.36355 5.0372 0.003594 0.01192 64 7 -0.281103 -0.72703 7.8621 -0.247915 -0.82224 8 -0.248209 -0.61316 10.7987 -0.031766 -0.10536 9 -0.226961 -0.54244 14.4818 -0.101963 -0.33817 10 -0.091295 -0.21258 15.6737 0.000941 0.00312 Although the ACF and PACF values are not significant here but they are close to the critical value 2 and their graphs indicate that this data may fit an AR(1) model. Table 4.6: Significance of ARIMA models fir the Number of Road Accidents in Mecca Model Components Coefficients t-statistic p-value AR(1) AR(1) 0.9996 49.35 0.000 AR(2) AR(1) 1.7810 4.95 0.001 AR(2) -0.7799 -2.15 0.060 AR(1) 1.6151 4.06 0.004 AR(2) -0.3786 -0.56 0.592 AR(3) -0.2361 -0.49 0.640 MA(1) MA(1) -0.8838 -2.88 0.016 MA(2) MA(1) -1.4742 -4.39 0.002 MA(2) -0.7438 -2.25 0.051 MA(1) -1.9245 -4.68 0.002 MA(2) -1.5421 -2.23 0.056 MA(3) -0.4883 -1.01 0.310 AR(1) 1.004 38.38 0.000 AR(3) MA(3) ARMA(1, 1) 65 ARIMA(1, 1, 1) MA(1) -0.2464 -0.75 0.474 AR(1) 0.1159 0.08 0.939 MA(1) -0.1570 0.11 0.915 The above results clearly show that although MA(1) and MA(2) can be considered for this data, based on the overall performances AR(1) is considered as the most appropriate model for this data. 4.4 Total Number of Accidents in Eastern Province Next we consider the total number of road accidents in Eastern Province and the summary results are given in Figure 4.4 and in Table 4.7. Partial Autocorrelation Function for Eastern Province Autocorrelation Function for Eastern Province (with 5% significance limits for the partial autocorrelations) (with 5% significance limits for the autocorrelations) 1.0 1.0 0.8 0.8 Partial Autocorrelation Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 Lag Lag Figure 4.4: The ACF and PACF Values for the Total Number of Road Accidents in E Province Table 4.7: The ACF and PACF Values for the Total Number of Road Accidents in E Province Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.705947 2.34136 7.1266 0.705947 2.34136 2 0.344712 0.80909 9.0146 -0.306295 -1.01586 66 3 0.060024 0.13318 9.0790 -0.091588 -0.30376 4 -0.226373 -0.50147 10.1259 -0.308181 -1.02212 5 -0.360562 -0.78107 13.2243 0.017203 0.05706 6 -0.427266 -0.87815 18.4454 -0.224781 -0.74551 7 -0.292031 -0.56209 21.4942 0.286353 0.94973 8 -0.133929 -0.25068 22.3492 -0.221803 -0.73564 9 -0.116442 -0.21672 23.3187 -0.154299 -0.51175 10 -0.054081 -0.10023 23.7369 -0.046454 -0.15407 Here both the ACF and PACF values are significant for the first order, hence the model can be fitted by AR(1) or MA(1) methods. . Table 4.8: Significance of ARIMA models fir the Number of Road Accidents in E Province Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0006 10.43 0.000 AR(2) AR(1) 1.0645 3.03 0.014 AR(2) -0.0620 -0.16 0.880 AR(1) 0.9432 3.17 0.013 AR(2) -0.5708 -1.34 0.218 AR(3) 0.6306 1.91 0.093 MA(1) MA(1) -0.8666 -2.62 0.026 MA(2) MA(1) -1.5590 -3.54 0.006 MA(2) -0.6521 -1.28 0.232 AR(3) 67 MA(3) ARMA(1, 1) ARIMA(1, 1, 1) MA(1) -2.0450 -6.30 0.000 MA(2) -1.5474 -2.61 0.031 MA(3) -0.4313 -0.97 0.358 AR(1) 0.9997 224.27 0.000 MA(1) -1.0575 -262.20 0.000 AR(1) 0.5252 0.20 0.849 MA(1) 0.6924 0.28 0.787 The above results clearly show that AR(1), MA(1) and ARMA(1, 1) can be considered for this data, based on the overall performances ARMA(1, 1) is considered as the most appropriate model for this data. 4.5 Total Number of Accidents in Medina Next we consider the total number of road accidents in Medina and the summary results are given in Figure 4.5 and in Table 4.9. Autocorrelation Function for Medina Partial Autocorrelation Function for Medina (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 0.8 0.8 0.6 Partial Autocorrelation 1.0 Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 1 2 3 4 5 6 7 8 9 10 1 Lag 2 3 4 5 6 7 8 9 10 Lag Figure 4.5: The ACF and PACF Values for the Total Number of Road Accidents in Medina 68 Table 4.9: The ACF and PACF Values for the Total Number of Road Accidents in Medina Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.788543 2.61530 8.8917 0.788543 2.61530 2 0.493955 1.09373 12.7685 -0.338039 -1.12115 3 0.149332 0.29967 13.1671 -0.314293 -1.04239 4 -0.126859 -0.25252 13.4959 -0.058117 -0.19275 5 -0.383965 -0.75991 17.0096 -0.302652 -1.00378 6 -0.433620 -0.81639 22.3871 0.268468 0.89041 7 -0.382425 -0.67998 27.6155 -0.036391 -0.12069 8 -0.301701 -0.51523 31.9543 -0.268906 -0.89186 9 -0.214677 -0.35807 35.2495 0.034405 0.11411 10 -0.088583 -0.14606 36.3716 -0.014209 -0.04713 Here both the ACF and PACF values are significant for the first order, hence the model can be fitted by AR(1) or MA(1) methods. . Table 4.10: Significance of ARIMA models fir the Number of Road Accidents in Medina Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0174 14.55 0.000 AR(2) AR(1) 1.5854 4.59 0.001 AR(2) -0.6039 -1.52 0.164 AR(1) 1.3756 3.85 0.004 AR(3) 69 AR(2) -0.5562 -1.05 0.321 AR(3) 0.1827 0.48 0.644 MA(1) MA(1) -0.8578 -2.52 0.030 MA(2) MA(1) -1.4742 -4.39 0.002 MA(2) -0.7438 -2.25 0.051 MA(1) -1.5348 -3.32 0.010 MA(2) -1.2944 -1.97 0.085 MA(3) -0.5512 -1.20 0.265 AR(1) 1.0151 10.91 0.000 MA(1) -0.2932 -0.82 0.434 AR(1) 0.7457 1.74 0.120 MA(1) 0.2806 0.43 0.679 MA(3) ARMA(1, 1) ARIMA(1, 1, 1) The above results clearly show that AR(1) and MA(1) can be considered for this data, based on the overall performances AR(1) is considered as the most appropriate model for this data. 4.6 Total Number of Accidents in Al-Qassim Next we consider the total number of road accidents in Al-Qassim and the summary results are given in Figure 4.6 and in Table 4.11. 70 Partial Autocorrelation Function for Qassim Autocorrelation Function for Qassim (with 5% significance limits for the partial autocorrelations) (with 5% significance limits for the autocorrelations) 1.0 0.8 1.0 Partial Autocorrelation 0.8 Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 Lag Lag Figure 4.6: The ACF and PACF Values for the Total Number of Road Accidents in Al-Qassim Table 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Al-Qassim Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.778861 2.58319 8.6747 0.778861 2.58319 2 0.554764 1.23677 13.5648 -0.131836 -0.43725 3 0.229719 0.45300 14.5080 -0.405259 -1.34409 4 -0.063780 -0.12349 14.5911 -0.196334 -0.65117 5 -0.285927 -0.55284 16.5396 -0.040538 -0.13445 6 -0.421232 -0.79272 21.6143 -0.051637 -0.17126 7 -0.463226 -0.82585 29.2855 -0.062427 -0.20705 8 -0.405921 -0.68260 37.1396 0.019373 0.06425 9 -0.263172 -0.42492 42.0916 0.095264 0.31595 10 -0.160087 -0.25434 45.7564 -0.200343 -0.66446 Here both the ACF and PACF values are significant for the first order, hence the model can be fitted by AR(1) or MA(1) methods. . 71 Table 4.12: Significance of ARIMA models fir the Number of Road Accidents in Al-Qassim Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0021 24.62 0.000 AR(2) AR(1) 1.3705 3.60 0.014 AR(2) -0.3679 -0.74 0.476 AR(1) 1.2183 2.75 0.025 AR(2) 0.2143 0.40 0.702 AR(3) -0.4303 -0.81 0.442 MA(1) MA(1) -0.8585 -2.49 0.032 MA(2) MA(1) -1.4380 -3.83 0.004 MA(2) -0.7188 -1.65 0.133 MA(1) -1.6287 -3.72 0.006 MA(2) -1.4056 -2.39 0.044 MA(3) -0.6016 -1.41 0.197 AR(1) 1.0025 16.34 0.000 MA(1) -0.1784 -0.43 0.678 AR(1) 1.0001 497.56 0.000 MA(1) 1.0412 61.08 0.000 AR(3) MA(3) ARMA(1, 1) ARIMA(1, 1, 1) The above results clearly show that AR(1), MA(1) and ARIMA(1, 1, 1) can be considered for this data, based on the overall performances ARIMA(1, 1, 1) is considered as the most appropriate model for this data. 72 4.7 Total Number of Accidents in Tabuk Next we consider the total number of road accidents in Tabuk and the summary results are given in Figure 4.7 and in Table 4.13. Autocorrelation Function for Tabuk Partial Autocorrelation Function for Tabuk (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 Partial Autocorrelation 0.8 Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 1 Lag 2 3 4 5 6 7 8 9 10 Lag Figure 4.7: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk Table 4.13: The ACF and PACF Values for the Total Number of Road Accidents in Tabuk Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.682488 2.26356 6.6608 0.682488 2.26356 2 0.509195 1.21513 10.7805 0.081250 0.26948 3 0.178187 0.37755 11.3480 -0.370368 -1.22837 4 -0.096803 -0.20250 11.5394 -0.263705 -0.87461 5 -0.248731 -0.51840 13.0139 0.035595 0.11806 6 -0.362890 -0.73850 16.7802 -0.057857 -0.19189 7 -0.419031 -0.81337 23.0575 -0.203492 -0.67491 8 -0.359251 -0.65883 29.2094 0.012469 0.04135 9 -0.236137 -0.41691 33.1962 0.149891 0.49713 73 10 -0.147029 -0.25558 36.2875 -0.143894 -0.47724 Here both the ACF and PACF values are significant for the first order, hence the model can be fitted by AR(1) or MA(1) methods. . Table 4.14: Significance of ARIMA models fir the Number of Road Accidents in Tabuk Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0046 10.43 0.000 AR(2) AR(1) 1.5999 4.19 0.002 AR(2) -0.5976 -1.30 0.224 AR(1) 1.1692 3.28 0.011 AR(2) -0.1717 -0.33 0.753 AR(3) 0.0036 0.01 0.993 MA(1) MA(1) -0.8468 -2.29 0.045 MA(2) MA(1) -1.4721 -4.18 0.002 MA(2) -0.7436 -2.12 0.063 MA(1) -1.6287 -5.15 0.001 MA(2) -1.6140 -2.51 0.036 MA(3) -0.5256 -1.28 0.237 AR(1) 1.0025 15.86 0.000 MA(1) -0.2530 -0.78 0.454 AR(1) 1.0043 1.70 0.128 MA(1) 0.0234 0.03 0.976 AR(3) MA(3) ARMA(1, 1) ARIMA(1, 1, 1) 74 The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data, based on the overall performances AR(1) is considered as the most appropriate model for this data. 4.8 Total Number of Accidents in Asir Our variable of interest is now the total number of road accidents in Asir. The summary results for this data are given in Figure 4.8 and in Table 4.15. Autocorrelation Function for Asir Partial Autocorrelation Function for Asir (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 0.8 Partial Autocorrelation Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 1 2 Lag 3 4 5 6 7 8 9 Lag Figure 4.8: The ACF and PACF Values for the Total Number of Road Accidents in Asir. Table 4.15: The ACF and PACF Values for the Total Number of Road Accidents in Asir. Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.583366 1.93481 4.8665 0.583366 1.93481 2 0.321259 0.82189 6.5064 -0.028888 -0.09581 3 0.217749 0.52573 7.3539 0.063379 0.21020 4 0.012084 0.02847 7.3569 -0.204853 -0.67942 5 -0.132835 -0.31292 7.7774 -0.100465 -0.33321 75 10 6 -0.243621 -0.56886 9.4749 -0.157780 -0.52330 7 -0.383432 -0.87010 14.7309 -0.212037 -0.70325 8 -0.337912 -0.71892 20.1737 0.025868 0.08580 9 -0.309691 -0.62994 27.0311 -0.108039 -0.35832 10 -0.226966 -0.44587 34.3976 0.049251 0.16335 Here the ACF and PACF values are not significant but for the first order they are very close to the critical value 2. Hence the model may be fitted by AR(1) or MA(1) methods. . Table 4.16: Significance of ARIMA models fir the Number of Road Accidents in Asir. Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0017 19.28 0.000 AR(2) AR(1) 1.0252 2.84 0.019 AR(2) -0.0220 -0.06 0.956 AR(1) 1.2183 2.75 0.025 AR(2) 0.2143 0.40 0.702 AR(3) -0.4303 -0.81 0.442 MA(1) MA(1) -0.8729 -2.87 0.017 MA(2) MA(1) -1.3698 -4.06 0.003 MA(2) -0.7454 -2.21 0.055 MA(1) -1.2305 -3.52 0.008 MA(2) -1.1393 -2.13 0.065 MA(3) -0.4821 -0.57 0.583 AR(3) MA(3) 76 ARMA(1, 1) ARIMA(1, 1, 1) AR(1) 1.0022 17.50 0.000 MA(1) 0.1057 0.26 0.801 AR(1) -0.9997 -15.99 0.000 MA(1) -0.9857 -1.79 0.111 The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data, based on the overall performances AR(1) is considered as the most appropriate model for this data. 4.9 Total Number of Accidents in Albahih Now we consider the total number of road accidents in Albahih. The summary results for this data are given in Figure 4.9 and in Table 4.17. Autocorrelation Function for Albahih Partial Autocorrelation Function for Albahih (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 Partial Autocorrelation 0.8 Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 Lag Lag Figure 4.9: The ACF and PACF Values for the Total Number of Road Accidents in Albahih Table 4.17: The ACF and PACF Values for the Total Number of Road Accidents in Albahih Index 1 ACF PACF Values TSTAT LBQ Values TSTAT 0.637186 2.11331 5.8059 0.637186 2.11331 77 2 0.450798 1.11070 9.0348 0.075409 0.25010 3 0.206319 0.45942 9.7957 -0.181694 -0.60261 4 -0.061150 -0.13363 9.8721 -0.266149 -0.88272 5 -0.143985 -0.31413 10.3662 0.040374 0.13390 6 -0.255075 -0.55156 12.2270 -0.092220 -0.30586 7 -0.422780 -0.88992 18.6171 -0.338786 -1.12363 8 -0.417706 -0.82205 26.9338 -0.059729 -0.19810 9 -0.276939 -0.51434 32.4176 0.285438 0.94669 10 -0.216669 -0.39306 39.1308 -0.109983 -0.36477 Here the ACF and PACF values are significant for the first order. Hence the model can be fitted by AR(1) or MA(1) methods. . Table 4.18: Significance of ARIMA models fir the Number of Road Accidents in Albahih Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0046 14.74 0.000 AR(2) AR(1) 1.3183 3.48 0.007 AR(2) -0.3128 -0.74 0.481 AR(1) 1.3008 3.44 0.009 AR(2) -0.6292 -0.92 0.386 AR(3) 0.3336 0.56 0.588 MA(1) MA(1) -0.8556 -2.42 0.036 MA(2) MA(1) -1.4904 -3.98 0.003 AR(3) 78 MA(3) ARMA(1, 1) ARIMA(1, 1, 1) MA(2) -0.7411 -1.82 0.102 MA(1) -2.1498 -12.08 0.000 MA(2) -1.6801 -2.96 0.018 MA(3) -0.4746 -1.11 0.299 AR(1) 1.0070 8.19 0.000 MA(1) -0.6269 -1.54 0.159 AR(1) -0.0683 -0.04 0.969 MA(1) -0.6911 -0.40 0.697 The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data, based on the overall performances AR(1) is considered as the most appropriate model for this data. 4.10 Total Number of Accidents in Northern Borders Now we consider the total number of road accidents in Northern Borders. The summary results for this data are given in Figure 4.10 and in Table 4.19. Autocorrelation Function for Northern Borders Partial Autocorrelation Function for Northern Borders (with 5% significance limits for the partial autocorrelations) (with 5% significance limits for the autocorrelations) 1.0 1.0 0.8 0.8 Partial Autocorrelation Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 Lag Lag Figure 4.10: The ACF and PACF Values for the Total Number of Road Accidents in N Borders 79 Table 4.19: The ACF and PACF Values for the Total Number of Road Accidents in N Borders Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.478168 1.58591 3.2696 0.478168 1.58591 2 0.518466 1.42444 7.5407 0.375730 1.24615 3 0.204436 0.48006 8.2877 -0.196225 -0.65080 4 -0.084495 -0.19438 8.4336 -0.458333 -1.52012 5 -0.169860 -0.38943 9.1212 -0.085555 -0.28376 6 -0.325861 -0.73699 12.1581 0.057935 0.19215 7 -0.368456 -0.79499 17.0116 -0.107831 -0.35764 8 -0.374139 -0.76452 23.6839 -0.203452 -0.67477 9 -0.200046 -0.38865 26.5453 0.222916 0.73933 10 -0.178213 -0.34157 31.0869 0.049331 0.16361 Here the ACF and PACF values are not significant but for the first order they are very close to the critical value 2. Hence the model may be fitted by AR(1) or MA(1) methods. . Table 4.20: Significance of ARIMA models fir the Number of Road Accidents in N Borders Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0016 15.07 0.000 AR(2) AR(1) 1.2141 2.36 0.043 AR(2) -0.2102 -0.39 0.708 AR(1) 1.1609 2.10 0.069 AR(3) 80 AR(2) -0.4018 -0.40 0.563 AR(3) 0.2440 0.42 0.687 MA(1) MA(1) -0.8556 -2.42 0.036 MA(2) MA(1) -1.4944 -4.35 0.002 MA(2) -0.7734 -2.25 0.051 MA(1) -1.8743 -5.40 0.001 MA(2) -1.4567 -2.24 0.055 MA(3) -0.4397 -0.85 0.419 AR(1) 1.0021 9.39 0.000 MA(1) -0.6103 -1.94 0.084 AR(1) -0.2880 -0.38 0.712 MA(1) -0.7422 -1.26 0.237 MA(3) ARMA(1, 1) ARIMA(1, 1, 1) The above results clearly show that AR(1), MA(1) and ARMA(1, 1) can be considered for this data, based on the overall performances we consider AR(1) is as the most appropriate model for this data. 4.11 Total Number of Accidents in Al Jouf Now we consider the total number of road accidents in Al Jouf. The summary results for this data are given in Figure 4.11 and in Table 4.21. 81 Autocorrelation Function for Aljouf Partial Autocorrelation Function for Aljouf (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 0.8 0.6 0.6 Partial Autocorrelation Autocorrelation (with 5% significance limits for the autocorrelations) 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 1 Lag 2 3 4 5 6 7 8 9 10 Lag Figure 4.11: The ACF and PACF Values for the Total Number of Road Accidents in Al Jouf Table 4.21: The ACF and PACF Values for the Total Number of Road Accidents in Al Jouf Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.591436 1.96157 5.0021 0.591436 1.96157 2 0.264240 0.67224 6.1115 -0.131583 -0.43641 3 0.130765 0.31979 6.4171 0.049680 0.16477 4 -0.062931 -0.15249 6.4980 -0.215148 -0.71356 5 -0.368551 -0.89117 9.7353 -0.367613 -1.21923 6 -0.458838 -1.03713 15.7565 -0.097710 -0.32407 7 -0.204077 -0.42187 17.2454 0.300553 0.99682 8 -0.168385 -0.34259 18.5970 -0.182215 -0.60434 9 -0.151072 -0.30414 20.2288 -0.019734 -0.06545 10 -0.072586 -0.14492 20.9822 -0.192690 -0.63908 Here the ACF and PACF values are not significant but for the first order they are very close to the critical value 2. Hence the model may be fitted by AR(1) or MA(1) methods. . 82 Table 4.22: Significance of ARIMA models fir the Number of Road Accidents in Al Jouf Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0006 18.21 0.000 AR(2) AR(1) 0.9682 2.87 0.018 AR(2) 0.0364 0.10 0.920 AR(1) 1.1143 3.39 0.010 AR(2) 0.6334 1.23 0.255 AR(3) -0.8294 -1.63 0.141 MA(1) MA(1) -0.9124 -2.30 0.044 MA(2) MA(1) -1.1197 -2.57 0.030 MA(2) -0.7362 -1.67 0.130 MA(1) -1.2337 -3.16 0.013 MA(2) -1.0878 -1.88 0.097 MA(3) -0.5747 -0.77 0.466 AR(1) 1.0012 17.51 0.000 MA(1) 0.1864 0.45 0.662 AR(1) -0.2724 -0.13 0.899 MA(1) -0.0548 -0.03 0.981 AR(3) MA(3) ARMA(1, 1) ARIMA(1, 1, 1) The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data, based on the overall performances we consider AR(1) as the most appropriate model for this data. 83 4.12 Total Number of Accidents in Ha’il Now we consider the total number of road accidents in Ha’il. The summary results for this data Autocorrelation Function for Hail Partial Autocorrelation Function for Hail (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 0.8 0.6 0.6 Partial Autocorrelation Autocorrelation are given in Figure 4.12 and in Table 4.23. 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 1 2 Lag 3 4 5 6 7 8 9 10 Lag Figure 4.12: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il Table 4.23: The ACF and PACF Values for the Total Number of Road Accidents in Ha’il Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.290442 0.963287 1.2063 0.290442 0.963287 2 -0.012848 -0.039417 1.2089 -0.106160 -0.352093 3 0.192035 0.589062 1.8681 0.250735 0.831592 4 0.123485 0.367375 2.1796 -0.021859 -0.072499 5 -0.120277 -0.353521 2.5244 -0.138686 -0.459968 6 -0.182778 -0.531222 3.4799 -0.150781 -0.500083 7 -0.224613 -0.636681 5.2835 -0.213506 -0.708118 8 -0.166265 -0.454827 6.6012 -0.028472 -0.094432 9 -0.226800 -0.609074 10.2790 -0.161838 -0.536755 84 10 -0.172381 -0.448069 14.5283 0.010759 0.035684 Here the ACF and PACF values are not significant so the appropriate model could be white noise. Table 4.24: Significance of ARIMA models fir the Number of Road Accidents in Ha’il Model Components Coefficients t-statistic p-value White Noise ARIMA (0,0,0) - 0.00 1.00 AR(1) AR(1) 1.0005 15.14 0.000 AR(2) AR(1) 1.0645 3.03 0.014 AR(2) -0.0620 -0.16 0.880 AR(1) 0.8776 2.37 0.045 AR(2) 0.3962 0.98 0.354 AR(3) -0.2683 -0.67 0.519 MA(1) MA(1) -0.8680 -3.27 0.008 MA(2) MA(1) -1.1627 -2.75 0.022 MA(2) -0.7736 -1.92 0.087 MA(1) -1.3062 -3.70 0.006 MA(2) -1.1474 -2.42 0.042 MA(3) -0.3759 -0.97 0.359 AR(1) 1.0006 25.29 0.000 MA(1) 0.5475 1.38 0.200 AR(1) -0.0473 -0.07 0.942 AR(3) MA(3) ARMA(1, 1) ARIMA(1, 1, 1) 85 MA(1) 0.5186 0.93 0.379 Although the ACF and the PACF gave an indication that the right model should be a white noise, but this foot is really poor. Based on the above results we can select AR(1), MA(1) or MA(2) for this data. Based on the overall performances we consider AR(1) as the most appropriate model for this data. 4.13 Total Number of Accidents in Najran Now we consider the total number of road accidents in Najran. The summary results for this data are given in Figure 4.13 and in Table 4.25. Autocorrelation Function for Najran Partial Autocorrelation Function for Najran (with 5% significance limits for the autocorrelations) (with 5% significance limits for the partial autocorrelations) 1.0 1.0 0.8 Partial Autocorrelation 0.8 Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 10 1 2 Lag 3 4 5 6 7 8 9 10 Lag Figure 4.13: The ACF and PACF Values for the Total Number of Road Accidents in Najran Table 4.25: The ACF and PACF Values for the Total Number of Road Accidents in Najran Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.760792 2.52326 8.2769 0.760792 2.52326 2 0.424511 0.95851 11.1402 -0.366322 -1.21495 3 0.078217 0.16348 11.2496 -0.231333 -0.76724 86 4 -0.249643 -0.52051 12.5227 -0.274939 -0.91187 5 -0.369878 -0.75289 15.7834 0.191436 0.63492 6 -0.370222 -0.71752 19.7034 -0.078195 -0.25934 7 -0.336178 -0.62303 23.7437 -0.184142 -0.61073 8 -0.236194 -0.42306 26.4029 -0.042759 -0.14181 9 -0.132954 -0.23436 27.6668 0.022027 0.07306 10 -0.068449 -0.12006 28.3368 -0.079983 -0.26528 Here the ACF and PACF values are significant for the first order. Hence the model can be fitted by AR(1) or MA(1) methods. . Table 4.26: Significance of ARIMA models fir the Number of Road Accidents in Najran Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0005 14.41 0.000 AR(2) AR(1) 1.6108 5.06 0.001 AR(2) -0.6110 -1.74 0.116 AR(1) 1.7939 2.61 0.031 AR(2) -1.2905 -0.62 0.554 AR(3) 0.4979 0.35 0.738 MA(1) MA(1) -0.8520 -2.62 0.026 MA(2) MA(1) -1.4558 -4.04 0.033 MA(2) -0.7145 -1.82 0.101 MA(1) -2.1198 -9.33 0.000 AR(3) MA(3) 87 ARMA(1, 1) ARIMA(1, 1, 1) MA(2) -1.6590 -2.67 0.028 MA(3) -0.4766 -0.96 0.364 AR(1) 1.0005 8.48 0.000 MA(1) -0.5178 -1.00 0.345 AR(1) 0.2411 0.10 0.925 MA(1) -0.2833 -0.09 0.927 The above results clearly show that AR(1), MA(1) and MA(2) can be considered for this data, based on the overall performances we consider AR(1) as the most appropriate model for this data. 4.14 Total Number of Accidents in Jizan Now we consider the total number of road accidents in Jizan. The summary results for this data are given in Figure 4.14 and in Table 4.27. Partial Autocorrelation Function for Jizan Autocorrelation Function for Jizan (with 5% significance limits for the partial autocorrelations) (with 5% significance limits for the autocorrelations) 1.0 0.8 0.8 0.6 Partial Autocorrelation 1.0 Autocorrelation 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 0.4 0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0 -1.0 1 2 3 4 5 6 7 8 9 1 10 2 3 4 5 6 7 8 9 10 Lag Lag Figure 4.14: The ACF and PACF Values for the Total Number of Road Accidents in Jizan 88 Table 4.27: The ACF and PACF Values for the Total Number of Road Accidents in Jizan Index ACF PACF Values TSTAT LBQ Values TSTAT 1 0.639136 2.11977 5.8415 0.639136 2.11977 2 0.569828 1.40205 11.0006 0.272750 0.90461 3 0.198326 0.41884 11.7037 -0.439926 -1.45907 4 -0.055681 -0.11576 11.7671 -0.367870 -1.22009 5 -0.244339 -0.50736 13.1900 0.033945 0.11258 6 -0.406835 -0.82567 17.9237 -0.048714 -0.16157 7 -0.394432 -0.75507 23.4855 -0.013338 -0.04424 8 -0.410516 -0.74804 31.5185 -0.107096 -0.35520 9 -0.216164 -0.37527 34.8595 0.160052 0.53083 10 -0.179322 -0.30740 39.4578 -0.127243 -0.42202 Here the ACF and PACF values are significant for the first order. Hence the model can be fitted by AR(1) or MA(1) methods. . Table 4.28: Significance of ARIMA models fir the Number of Road Accidents in Jizan Model Components Coefficients t-statistic p-value AR(1) AR(1) 1.0094 10.19 0.000 AR(2) AR(1) 0.7001 1.19 0.265 AR(2) 0.3127 0.46 0.659 AR(1) 0.7347 1.17 0.314 AR(3) 89 AR(2) 0.5415 0.76 0.467 AR(3) -0.2662 -0.32 0.754 MA(1) MA(1) -0.8509 -2.34 0.042 MA(2) MA(1) -1.4050 -2.84 0.019 MA(2) -0.7353 -1.39 0.197 MA(1) -1.6450 -3.22 0.012 MA(2) -1.3749 -1.65 0.136 MA(3) -0.5492 -0.94 0.375 AR(1) 1.0099 7.33 0.000 MA(1) -0.1260 -0.23 0.824 AR(1) 0.9996 9.28 0.000 MA(1) 0.8134 1.42 0.195 MA(3) ARMA(1, 1) ARIMA(1, 1, 1) The above results clearly show that AR(1) and MA(1) can be considered for this data, based on the overall performances we consider AR(1) as the most appropriate model for this data. 4.15 Result Summary In this section we summarize results that we obtain in sections 4.1 – 4.14 and the results are presented in the following table. Table 4.29 Selected Models for Saudi Arabia Road Accident Data Region Tentative Models Selected Model Entire Saudi Arabia AR(1), MA(1) AR(1) Riyadh AR(1), MA(1) AR(1) 90 Mecca AR(1), MA(1), MA(2) AR(1) Eastern Province AR(1), MA(1), ARMA(1, 1) ARMA(1,1) Medina AR(1), MA(1), ARIMA(1, 1, 1) AR1MA(1, 1, 1) Al Qassim AR(1), MA(1), MA(2) AR(1) Tabuk AR(1), MA(1), MA(2) AR(1) Asir AR(1), MA(1), MA(2) AR(1) Albahih AR(1), MA(1), MA(2) AR(1) Northern Borders AR(1), MA(1), ARMA(1, 1) AR(1) Al Jouf AR(1), MA(1), MA(2) AR(1) Ha’il White Noise, AR(1), MA(1), MA(2) AR(1) Najran AR(1), MA(1), MA(2) AR(1) Jizan AR(1), MA(1) AR(1) 91 CHAPTER 5 CONCLUSIONS AND DIRECTION OF FUTURE RESEARCH In this chapter we will summarize the findings of our research to draw some conclusions and outline ideas for our future research. 5.1 Conclusions My prime objective was to find the most appropriate models for analyzing the total number of road accidents in the entire Saudi Arabia and its different regions. Since we have time series data it is conventional to consider ARIMA models to fit the data. So the question was what type of ARIMA model best fit the data? Since all of the variables had missing values it was not possible to fit ARIMA model readily. As a remedy to this problem we estimate missing observations first. We employed sixteen different methods to estimate the missing values and based on the crossvalidation it turned out that the robust MM estimates based on the EM algorithm provided the best set of estimates of the missing values. But since our data are time series simple EM algorithm would not be appropriate for them. After the estimation of missing values we employed a series of tests such as ACF, PACF, Ljung-Box test and t-test of the ARIMA components to select the most appropriate ARIMA model(s) for all fourteen variables under study. Twelve of them matched with AR(1) model, one with ARMA (1, 1) and one with ARIMA(1, 1, 1) model. 92 5.2 Direction of Future Research We have to complete this research within a short period of time. So because of time constraints it was not possible for us to look at every aspect of Saudi Arabia road accident data. We intend to look at the major causes of road accidents in Saudi Arabia. Unlike the Western countries drunk driving is not an issue there as Alcohol is strictly prohibited in Saudi Arabia but the other issues like over-speeding, quality of road, qualification of drivers, texting or talking while driving, but we could not manage enough information for those. We would like to extend our research in this direction in future. 93 REFERENCES 1. 2. 3. 4. 5. 6 7. 8. 9. 10 . 11 . 12 . 13 . 14 . 15 . 16 17 . 18 . 19 . 20 . 21 . 22 Bowerman, B. L., O’Connell, R. T., and Koehler, A. B. (2005). Forecasting, Time Series, and Regression: An Applied Approach, 4th Ed., Duxbury Publishing, Thomson Books/Cole, New Jersey. Booth, J.G. and Sarkar, S. (1998). Monte Carlo Approximation of Bootstrap Variances, The American Statisticians, 52, 354-357. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data Via the EM algorithm, Journal of the Royal Statistical Society. Series B, 39, 1-38 Efron, B. (1979). Bootstrap Method: Another Look at the Jackknife, The Annals of Statistics, 7, 1-26. Efron B. (1987). Better Bootstrap Confidence Intervals (With Discussions), Journal of the American Statistical Association, 82, 171-200. Efron, B. and Tibshirani, R.J. (1993). An Introduction to Bootstrap, Chapman and Hall, London. Efron, B. and Tibshirani, R. (1997), Improvements on cross-validation: The .632 + Bootstrap Method, Journal of the American Statistical Association, 92, 548–560. Hadi, A.S., Imon, A.H.M.R. and Werner, M. (2009). Detection of outliers, Wiley Interdisciplinary Reviews: Computational Statistics, 1, 57 – 70. Hall, P. (1992). The Bootstrap and Edgeworth Expansion, Springer-Verlag, New York. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J. and Stahel, W. (1986). Robust Statistics: The Approach Based on Influence Function, Wiley, New York. Huber, P.J. (1973). Robust Regression: Asymptotics, Conjectures, and Monte Carlo, The Annals of Statistics, 1, 799-821. Imon, A. H. M. R. (2015). An Introduction to Regression, Time Series, and Forecasting (To Appear). Izenman, A.J. (2008), Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning, Springer, New York. Kadane, J.B. (1984). Robustness of Bayesian Analysis, Elsevier North-Holland, Amsterdam. Little, R.J.A. and Rubin D. B. (2002). Statistical Analysis with Missing Data, 2nd ed., Wiley, New York. Mamun, A.S.A. (2014). Robust Statistics in Linear Structural Relationship Model and Analysis of Missing Values. Unpublished Ph.D. thesis, University of Malaya. Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006), Robust Statistics: Theory and Methods, Wiley, New York. Picard, R. and Cook, R.D. (1984), Cross-Validation of Regression Models, Journal of the American Statistical Association, 79, 575–583. Pindyck, R. S. and Rubenfeld, D. L. (1998), Econometric Models and Economic Forecasts, 4th Ed. Irwin/McGraw-Hill Boston. Rousseeuw, P.J. (1984). Least Median of Squares Regression, Journal of the American Statistical Association, 79, 871 – 880. Rousseeuw, P.J. and Leroy, A.M. (1987). Robust Regression and Outlier Detection, Wiley, New York. Rousseeuw, P.J. and Leroy, A.M. (1987). A Fast Algorithm for S-Regression Estimates, 94 . 23 . 24 . 25 . 26 . Journal of Computational and Graphical Statistics, 15, 414–427. Schucany, W.R. (1988). Sample Reuse in Encyclopedia of Statistical Science, 8, Kotz, S. and Johnson, N.L. (Eds.), 235-38. Simpson, J.R. and Montgomery, D.C. (1998). A Robust Regression Technique Using Compound Estimation, Naval Research Logistics (NRL), 45, 125–139. Yohai, V. (1987). High Breakdown-Point and High Efficiency Robust Estimates for Regression, The Annals of Statistics, 15, 642-656. Zahir, Abdulrahman (2013). Saudi Arabia First globally in Traffic Accidents. Alarabia net. http://www.alarabiya.net/ar/sauditoday/2013/03/13/%D8%A7%D9%84%D8%B3%D8%B9%D9%88%D8%AF%D9%8 A%D8%A9-%D8%A7%D9%84%D8%A3%D9%88%D9%84%D9%89%D8%B9%D8%A7%D9%84%D9%85%D9%8A%D8%A7%D9%8B%D9%81%D9%8A%D8%A7%D9%84%D8%AD%D9%88%D8%A7%D8%AF%D8%AB%D8%A7%D9%84%D9%85%D8%B1%D9%88%D8%B1%D9%8A%D8%A9.html 95