UNIVERSITY OF THE AEGEAN SCHOOL OF SCIENCES DEPARTMENT OF STATISTICS and ACTUARIAL - FINANCIAL MATHEMATICS MSc in STATISTICS AND DATA ANALYSIS MASTER THESIS A TIME SERIES ANALYSIS APPROACH TO THE MIGRATION ISSUE – THE AEGEAN ROUTE EVANGELIDIS KONSTANTINOS 2022 SAMOS ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΙΓΑΙΟΥ ΤΜΗΜΑ ΣΤΑΤΙΣΤΙΚΗΣ ΚΑΙ ΑΝΑΛΟΓΙΣΤΙΚΩΝ –ΧΡΗΜΑΤΟΟΙΚΟΝΟΜΙΚΩΝ ΜΑΘΗΜΑΤΙΚΩΝ Π.Μ.Σ. ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΟΓΙΣΤΙΚΑ –ΧΡΗΜΑΤΟΟΙΚΟΝΟΜΙΚΑ ΜΑΘΗΜΑΤΙΚΑ ΚΑΤΕΥΘΥΝΣΗ : ΣΤΑΤΙΣΤΙΚΗ & ΑΝΑΛΥΣΗ ΔΕΔΟΜΕΝΩΝ ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ MΕΤΑΝΑΣΤΕΥΤΙΚΕΣ ΚΑΙ ΠΡΟΣΦΥΓΙΚΕΣ ΡΟΕΣ ΣΤΟ ΒΟΡΕΙΟΑΝΑΤΟΛΙΚΟ ΑΙΓΑΙΟ – ΜΙΑ ΣΤΑΤΙΣΤΙΚΗ ΧΡΟΝΟΛΟΓΙΚΗ ΠΡΟΣΕΓΓΙΣΗ ΕΥΑΓΓΕΛΙΔΗΣ ΚΩΝΣΤΑΝΤΙΝΟΣ 2022 ΣΑΜΟΣ Μέλη Τριμελούς Επιτροπής Αλέξανδρος Καραγρηγορίου (Επιβλέπων) Χρήστος Κουτζάκης Αθανάσιος Ρακιτζής To Narges, Sheila, Mohammad, and all the dear students I have met during this journey. If it wasn’t for you, I would never be where I stand today. ACKNOWLEDGEMENTS First and foremost, I would like to express my deepest gratitude to my supervisor Prof. Alexandros Karagrigoriou of the Department of Statistics and Actuarial-Financial Mathematics of the University of Aegean for his support, thorough guidance, patience and of course for being an excellent teacher for me throughout my studies in the Department of Statistics. I am also grateful to Ph.D. Emmanouil-Nektarios Kalligeris for the communication between us throughout the writing as it was particularly constructive and helpful for me. Furthermore, words cannot express my appreciation to everyone who stood by me offering moral support and encouragement in the daily difficulties during my studies, as well as seemingly small help from dear friends is what often determines success or failure in anything we undertake that requires personal sacrifice and effort. Last but not the least, it would be remiss not to mention my mother. She was always by my side since my childhood, no matter how much I made it difficult for her along the way. ABSTRACT Prediction issues are one of the most exciting fields in the sciences. Modeling through stochastic processes is also often used for forecasting purposes, particularly in the field of finance. The fact that stochastic processes produce time series of data makes the study of time series particularly useful on our quest of predicting quantities that change with time when randomness is included. So, since we are able to make quantitative measurements of a phenomenon during its evolution over time, we can apply time series analysis methods to a range of scientific applications that is truly unlimited. In this thesis, we are going to follow the Box - Jenkins approach to time series analysis and forecasting on an attempt to forecast a social phenomenon, the refugee and migrants flows through the islands of the East Aegean Sea. We use a series of time indexed data recorded from 01/2014 to 01/2022 from the United Nation High Commissioner for Refugees (UNHCR) database. This is a time period in which the refugee issue became major from a social and political point of view. First, the theoretical framework is set, then the Box - Jenkins method is presented and finally we proceed with the analysis of the data. The goal is to see whether the method we have chosen is suitable to make forecasts on the specific phenomenon. ΠΕΡΙΛΗΨΗ Τα προβλήματα πρόβλεψης αποτελούν ένα από τα πιο συναρπαστικά πεδία στις επιστήμες. Η μοντελοποίηση φαινομένων μέσω στοχαστικών διαδικασιών χρησιμοποιείται συχνά για σκοπούς πρόβλεψης, ειδικά στο πεδίο των χρηματοοικονομικών . Το γεγονός ότι οι στοχαστικές διαδικασίες παράγουν χρονικές σειρές δεδομένων κάνει την μελέτη των χρονοσειρών ιδιαίτερα χρήσιμη όταν θέλουμε να προβλέψουμε μεγέθη που αλλάζουν με το χρόνο όταν αυτή η διαδικασία περιλαμβάνει και τυχαιότητα. Εφόσον λοιπόν είμαστε σε θέση να κάνουμε ποσοτικές μετρήσεις ενός φαινομένου κατά την χρονική του εξέλιξη, μπορούμε να εφαρμόσουμε μεθόδους ανάλυσης χρονοσειρών σε ένα φάσμα επιστημονικών εφαρμογών που είναι πραγματικά απεριόριστο. Σε αυτή την εργασία, θα ακολουθήσουμε την προσέγγιση Box - Jenkins στην ανάλυση και πρόβλεψη χρονοσειρών σε μια προσπάθεια να προβλέψουμε ένα κοινωνικό φαινόμενο, τις ροές προσφύγων και μεταναστών μέσω των νησιών του Βορειοανατολικού Αιγαίου. Χρησιμοποιούμε ένα σύνολο χρονικά καταχωρημένων δεδομένων για την περίοδο 01/2014-01/2022 από την βάση δεδομένων της ‘Υπατης Αρμοστείας του Οργανισμού Ηνωμένων Εθνών (UNHCR), μια περίοδο κατά την οποία το προσφυγικό ζήτημα έγινε καίριο από κοινωνική και πολιτική άποψη. Πρώτα εισάγεται το θεωρητικό πλαίσιο, μετά παρουσιάζεται η μέθοδος ανάλυσης Box - Jenkins και κατόπιν εφαρμόζουμε τη μέθοδο αυτή στα δεδομένα. O σκοπός είναι να διαπιστώσουμε σε ποιο βαθμό η μέθοδος αυτή είναι κατάλληλη για να κάνουμε προβλέψεις για το συγκεκριμένο φαινόμενο. CONTENTS 1 2 Introduction The refugee issue and its socio-economic dimension 2.1 Global overview 2.2 The situation in the European Union (EU) 2.3 The eastern passage through the Aegean Sea 1 2 2 8 11 3 The Box – Jenkins methodology 3.1 Introduction to time series analysis 3.2 Stationarity 3.3 Components of a non-stationary time series 3.4 Stochastic processes 3.5 The Box – Jenkins approach 15 15 17 19 20 25 4 Application of the Box – Jenkins Method 4.1 Pre Growth season (01/2014 - 02/2015) _ 4.1.1 Data plots 4.1.2 Fitting polynomials to the Pre growth season data 4.1.3 Box-Jenkins method for fitting (S)ARIMA models 4.1.4 The Auto ARIMA 29 29 29 31 37 46 4.2 Growth season (03/2015 – 04/2016) _ 4.2.1 Data plots 4.2.2 Fitting polynomials to the Growth season data 4.2.3 Box – Jenkins method for fitting (S)ARIMA models 4.2.4 The Auto ARIMA 49 49 51 56 61 4.3 Post growth season (05/2016 – 01/2022)_ 4.3.1 Data plots 4.3.2 Fitting polynomials to the Post growth season data 4.3.3 Box – Jenkins method for fitting (S)ARIMA models 4.3.4 The Auto ARIMA 4.3.5 Conclusion 65 65 68 73 81 86 Bibliography 88 Web References 89 Appendix 91 LIST OF FIGURES 2.1 2.2 2.3 2.4 2.5 2.6 Number of refugees globally per year ,2016-2021 People forced to flee worldwide per year (2012-2022) The number of international migrants. 1990 – 2020 Top five countries of origin, 2005-2020 Migrant workers by destination country income level Percentage of refugee population as to the average income of country host 2.7 Territorial attractiveness 2.8 Unemployment rate 2.9 The sea routes 2.10 The Aegean Sea arrivals per island, Jan.2015-Sept.2015 2.11 The total time series for the recorded arrivals in the Greek islands of East Aegean Sea 01/2014 – 01/2022 3.1 Strikes in the USA, 1951 – 1980 3.2 Population of USA at 10-year intervals, 1970-1990 3.3 The monthly accidental deaths data, 1973-1978 4.1 The time series of Pre growth season, 01/2014-02/2015 4.2 The sample ACF plot of the Pre growth time series 4.3 The sample ACF values 4.4 The sample PACF plot for the Pre growth time series 4.5 The sample PACF values 4.6 Fitted 3rd degree polynomial summary 4.7 3rd degree polynomial fit on the Pre growth data plot 4.8 Fitted 4th degree polynomial summary 4.9 4th degree polynomial fit on the Pre growth data plot 4.10 Fitted 5th degree polynomial summary 4.11 5th degree polynomial fit on the Pre growth data plot 4.12 The first difference of the Pre growth series 4.13 The second difference of the Pre growth series 4.14 The third difference of the Pre growth series 3 3 5 6 7 8 9 10 11 13 14 16 16 17 29 30 30 31 31 32 33 34 34 35 36 38 39 39 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 4.34 4.35 4.36 4.37 4.38 4.39 4.40 4.41 4.42 4.43 4.44 4.45 4.46 4.47 4.48 4.49 4.50 4.51 The ACF of the first difference of the Pre growth series The PACF of the first difference of the Pre growth series ARIMA(0,1,0) fit summary on Pre growth ARIMA(0,1,1) fit summary on Pre growth ARIMA(1,1,1) fit summary on Pre growth Forecast summary of ARIMA(0,1,0) for the Pre growth season Forecast graph of ARIMA(0,1,0) for the Pre growth season Forecast summary of ARIMA(0,1,1) for the Pre growth season Forecast graph of ARIMA(0,1,1) for the Pre growth season Auto ARIMA fit summary on Pre growth The common plot of the two optimal ARIMA models with the data of Pre growth season The time series plot for the Growth season, 03/2015-04/2016 The sample ACF plot for the Growth season time series The sample ACF values The sample PACF plot for the Growth season The sample PACF values 2nd degree polynomial fitted on Growth season 2nd degree polynomial fit on Growth season data plot 3rd degree polynomial fitted on Growth season 3rd degree polynomial fit on Growth season data plot 4th degree polynomial fitted on Growth season 4th degree polynomial fit on Growth season data plot The first difference of the Growth season data The ACF plot for the differenced Growth season data The PACF plot for the differenced Growth season data ARIMA(0,1,0) fit summary on Growth season ARIMA(0,1,1) fit summary on Growth season ARIMA(1,1,1) fit summary on Growth season ARIMA(1,1,0) fit summary on Growth season ARIMA(1,1,0) forecast summary on Growth season ARIMA(1,1,0) forecast plot on Growth season The Auto ARIMA fit summary on Growth season The Auto ARIMA forecast summary on Growth season The Auto ARIMA forecast plot on Growth season The common plot of the two optimal ARIMA models with the data line on Growth season The Post growth time series plot The sample ACF plot of the Post growth data 40 41 41 42 42 44 45 45 46 46 47 49 50 50 50 51 52 52 53 53 54 55 56 57 57 58 58 59 59 61 61 62 62 63 63 66 66 4.52 4.53 4.54 4.55 4.56 4.57 4.58 4.59 4.60 4.61 4.62 4.63 4.64 4.65 4.66 4.67 4.68 4.69 4.70 4.71 4.72 4.73 4.74 4.75 4.76 The Post growth sample ACF values The Post growth sample PACF plot The sample PACF values 2nd degree polynomial fitted on the Post growth data 2nd degree polynomial fit on the Post growth data plot 3rd degree polynomial fitted on Post growth data 3rd degree polynomial fit on the Post growth plot 4th degree polynomial fitted on the Post growth data 4th degree polynomial fit on the Post growth plot The first difference of the Post growth data The ACF plot for the differenced data of Post growth season The PACF plot for the differenced data of Post growth season ARIMA(1,1,1) fit summary on Post growth ARIMA(1,1,2) fit summary on Post growth ARIMA(2,1,1) fit summary on Post growth ARIMA(2,1,2) fit summary on Post growth ARIMA(1,1,0) fit summary on Post growth ARIMA(1,1,0) forecast summary for the Post growth ARIMA(1,1,0) forecast plot on Post growth season ARIMA(2,1,1) forecast summary for the Post growth ARIMA(2,1,1) forecast plot for the Post growth season Auto ARIMA fit summary on Post growth Auto ARIMA forecast summary for the Post growth season Auto ARIMA forecast plot for the Post growth season Common plot of the three optimal ARIMA models with the data line 67 67 68 69 69 70 71 72 72 74 74 75 75 76 76 77 77 79 80 80 80 81 82 83 83 LIST OF TABLES Table 1 Summary of polynomials fitted on Pre growth 37 Table 2 Summary of the ARIMA models fitted on Pre growth 43 Table 3 Summary of all models fitted on Pre growth 48 Table 4 Summary of forecasting errors for the ARIMA models on Pre growth 48 Table 5 Summary of polynomials fitted on Growth 55 Table 6 Summary of the ARIMA models fitted on Growth 60 Table 7 Summary of all models fitted on Growth 64 Table 8 Summary of forecasting errors for the ARIMA models on Growth 65 Table 9 Summary of polynomials fitted on Post growth 73 Table 10 Summary of the ARIMA models fitted on Post growth 78 Table 11 Summary of all models fitted on Post growth 85 Table 12 Summary of forecasting errors for the ARIMA models on Post growth 85 Appendix 91 Chapter 1 Introduction Time series analysis is a way of understanding the mechanism that generates a set of time indexed data, finding an appropriate model to represent this mechanism, and to use this model for future predictions. A question arises when it comes to forecasting, can we forecast a quantity in a sufficient way, or this specific quantity cannot be predicted? Is the selected approach suitable for this forecasting? Do we have enough data to produce an accurate prediction for the future values? Is every observed phenomenon suitable for applying the usual methods of analysis and forecast? With time series analysis, it is assumed that the way in which a system changes, will be the same to the future. On the context of this work, the question is if we can forecast the social activity of large groups of human beings, where the complexity of the phenomenon is determined by countless many factors, political, economy, personal, global or local? This thesis is focused on a method of time series analysis and forecasting based on the Box- Jenkins (B-J) approach. This approach consists of three stages: identification, estimation and diagnostic checking. To apply the method, we divide the time series of arrivals in the islands in three seasons, based on the initial plot of the total series. The seasons are named Pre Growth, Growth and Post Growth respectively. The reason for this splitting was the observed increased values that indicate a rapid growth that is initiated at 03/2015. The values appear to return to the previous level (post growth level) on 05/2016. The purpose was to have a lower degree of variation in the analyzed time series, supposing that this would result in finding more appropriate models for each season. The outline of this thesis is the following: In Chapter 2, a history of the refugee crisis for the period which we will analyze next, is presented, based mainly on statistical figures. In Chapter 3, we present fundamental theoretical terms that are commonly used on time series analysis and make a brief description of the Box - Jenkins methodology. Finally, in Chapter 4, we proceed to the analysis of the time series of arrivals to the Greek islands. 1|P a ge Chapter 2 The refugee issue and its socio-economic dimension 2.1 Global overview The United Nations High Commissioner for Refugees (UNHCR) announced on May 23, 2022 that the number of people forced to flee due to persecution, conflicts, violence, human rights violations, had reached more than 100 millions for the first time in history, a negative record that would have been totally unreal a few decades ago [1]. This number represents 1 in every 78 people of the global population and includes refugees, asylum seekers and the 53,2 million people that have been forced to move within their country’s borders because of conflicts. These numbers are the result of new or protracted conflicts in countries including among others Ethiopia, Burkina Faso, Myanmar, Nigeria, Afghanistan and the Democratic Republic of the Congo. More recently, the war in Ukraine has forced more than 6 million people to leave the country and 8 million people to be internally displaced, that is within the country’s borders, according to data that UNHCR has recorded [1]. Figure 2.1 exhibits the numbers recorded by UNHCR during the years of the period 2016 -2021. We can track an increasing trend to the numbers of every column suggesting that since 2016, a year in which the refugees’ flows to Europe reached overwhelming figures, until 2021 the global situation in terms of peace, safety, human rights and living condition, gets worse by each year. In figure 2.2, a visualization of the data for the decade 2012-2022 in which there is a noticeable upward trend. Also, according to the International Organization for Migration (IOM), the United Nations migration agency, the total number of international migrants in the world is estimated 281 millions, or 3.6% of the global population [4]. This means that the total people living in countries other than their birth country is 128 millions more than in 1990 and over three times the estimated numbers in 1970, as it is stated on the organization 2022 World Migration report [4]. 2|P a ge Figure 2.1: Number of refugees globally per year, 2016-2021 (UNHCR data finder, [2]) Figure 2.2: People forced to flee worldwide per year ,2012-2022. (UNHCR Global trends, [3]) Furthermore, a greater rate of increase to the numbers of international migrants is observed in Asia and Europe in comparison with other regions as depicted in figure 2.3. IOM notes the existence of a wide variation between countries as for the number of international migrants that live in those. In United Arab Emirates for example, the 88% of the population are migrants from other countries. An interesting fact is that although mobility between countries was reduced in 2020 due to Covid-19 pandemic restrictions, the number of internally displaced people had an increase during 2020, reaching 55 million globally whereas the same figure was 51 million for 2019 [5]. During the presentation of IOM’s World Migration Report 2022, the organization’s general director Antonio Vitorino said: « We are witnessing a paradox not seen before 3|P a ge in human history. While billions of people have been effectively grounded by COVID19, tens of millions of others have been displaced within their own countries.». The causes of this impressive increase on the number of refugees, migrants, asylum seekers and internally displaced people during the last decades and especially after 1990, can be found in a complex of economical, geopolitical and environmental facts that affect the lives of literally every single person in today’s globalized world. In IOM World Migration Report 2022 is stated that: «Increased competition between States is resulting in heightened geopolitical tension and risking the erosion of multilateral cooperation. Economic, political and military power has radically shifted in the last two decades, with power now more evenly distributed in the international system. As a result, there is rising geopolitical competition, especially among global powers, often played out via proxies. The environment of intensifying competition between key States– and involving a larger number of States– is undermining international cooperation through multilateral mechanisms, such as those of the United Nations» [4]. This is an interesting statement, if one considers that it comes from the United Nations (UN) agency for migration and at the same time, the five permanent members of UN Security Council are countries who participate in a major role in the international field of economical and geopolitical competition, while they have a continuous presence in conflicts around the globe since the establishment of UN in 1945, either by means of political influence and diplomacy , or by military means through the intergovernmental alliance of the North Atlantic Treaty Organization (NATO). 4|P a ge Figure 2.3 The number of international migrants , 1990-2020 . ( IOM World Migration Report 2020 , [4] ) The rapidly changing environmental conditions , related to human activity through the absence of a planned development of the material production and consumption system that leads to industrial overproduction , toxic waste pollution , uncontrolled energy consumption , is another factor that can lead people to flee in other countries seeking better living conditions . As stated in IOM’s 2022 report: «The intensification of ecologically negative human activity is resulting in overconsumption and overproduction linked to unsustainable economic growth, resource depletion and biodiversity collapse, as well as ongoing climate change. Broadly grouped under the heading of “human supremacy”, there is growing recognition of the extremely negative consequences of human activities that are not preserving the planet’s ecological systems. (…) The implications for migration and displacement are significant, as people increasingly turn to internal and international migration as a means of adaptation to environmental impacts, or face displacement from their homes and communities due to slow-onset impacts of climate change.»[4] 5|P a ge Figure 2.4 Top five countries of origin , 2005-2020 . (IOM World Migrant Report,2022,[6]) In figure 2.4 is depicted the number of refugees by top five countries of origin during the years 2005-2020. A rapid upward trend initiated in 2011 for the number of Syrian refugees brings this country to the top of the list, whereas we note an almost constant number of refugees from Afghanistan each year, placing this central Asia country to the second position. It is obvious by the political facts that the war in Syria and the conflicts in Afghanistan that are ongoing for decades, generated these flows. Also it is estimated by IOM [7] that there were nearly 169 million migrant workers around the world in 2019, which is the 62% of the total number of immigrants in 2019 (272 millions). From these people, 67% of workers were living in high-income countries, 29% were living in middle-income countries and 3.6% were in low-income countries. ( figure 2.5) . The numbers indicate the much expected fact that regardless the reason of displacement, refugees and immigrants prefer countries with better income, hoping that this will provide them with higher standards of living. 6|P a ge Figure 2.5 Migrant workers by destination country income level. (IOM World Migrant Report 2022 , [7]) On the same time , as UNHCR states in data figures , 83% of the refugees are hosted in low and middle-income countries (figure 2.6) . This indicates the distinction between a person who is a refugee or an asylum seeker , and a migrant . According to UNHCR : «Migrants choose to move not because of a direct threat of persecution or death, but mainly to improve their lives by finding work, or in some cases for education, family reunion, or other reasons. Unlike refugees who cannot safely return home, migrants face no such impediment to return. If they choose to return home, they will continue to receive the protection of their government.» . This is a point that is often neglected or misunderstood by many when they refer to the issue . 7|P a ge Figure 2.6: Percentage of refugee population as to the average income of country host. (UNHCR Figures at a glance, [8]) 2.2 As a conclusion , migration and the refugee issue is a phenomenon generated by a complex of multi-causes , including political , economical and environmental factors . It is an international issue with multiple social and economical dimensions that affects the local economies of the destination countries by raising the available working power and it should be confronted by means of distinction between a person who is a refugee or a seeker of international protection , and a migrant . The situation in the European Union (EU) In recent years, Europe has seen the largest flow of migrants and refugees from countries outside of EU , since the end of World War 2 . These flows had their peaks in 2015 and 2016 , with a significant reduction after 2017. A relative stability in economical and political situation in compare to African, Asian or Middle East countries, even though this contrast is highly related with more than a century of historical interaction between these territories and European countries through colonialism and its consequences, makes the countries of European Union an attractive destination for people who want to find better living conditions or they seek international protection. The most attractive regions, as we might expect , are in destination countries like Germany or Austria . Greece and Italy are in the middle of the scale and the less attractive are found to be in Romania, Serbia and Montenegro [9] as it is also seen in figure 2.7. This enhances the general assumptions about the factors that attract refugees and migrants to specific areas. We have to note that regardless of the expectations a person may have, the unemployment rates are significantly higher for asylum seekers residing in EU countries, than it is for native population or migrants that have moved in the same country having a university degree. The unemployment rates for refugees [9] differ 8|P a ge between different states , with Great Britain to have a rate of 15 % and Spain to have over 50% rate . (figure 2.8 ) The routes through which the refugees attempt to reach the destination country of EU is highly dangerous . Crossing into Europe can be done by land border or by sea. The main border crossing routes with direction the EU territory are : the Eastern Mediterranean Route, the Western Balkan Route and the Central Mediterranean Route. The central crossing of the Mediterranean sea (figure 2.9) has proven to be extremely dangerous, while there are many times when the boat capsized and the passengers didn’t survive . In October 2013, a boat carrying hundreds of refugees from Libya to Italy sank near the island of Lampedusa, killing 368 refugees. Italy launched a large scale sea rescue operation named Mare Nostrum . UNHCR reports (2015) that : «During the first four months of 2015, the numbers of those dying at sea reached horrifying new heights. Between January and March, 479 refugees and migrants drowned or went missing, as opposed to 15 during the first three months of the year before. In April the situation took an even more terrible turn. Figure 2.7 . Territorial attractiveness . ( ESPON 2018, [10]) 9|P a ge Figure 2.8 . Unemployment rate (U) , Long Term (LTU) , Very Long Term (VLTU) . (ESPON 2018 , [9] ) In a number of concurrent wrecks, an unprecedented 1,308 refugees and migrants drowned or went missing in a single month (compared to 42 in April 2014), sparking a global outcry.» [12] . European states held meetings shortly after that incident and decided to raise the funding of Frontex , a private company which is one of the main EU border surveillance agency , member states offered to deploy naval vessels for patrols and a better coverage of the sea routes . During the months May and June of 2015 , the number of people drowned or missing in the sea fell to 68 and 12 persons respectively as a result of the applied operations [12] . 10 | P a g e Figure 2.9 The sea routes . ( UNHCR 2015 , [11] ) 2.3 The Eastern passage through the Aegean Sea Greece is one of the main gateways to Europe, along with Italy and Spain in the Mediterranean region. Refugees and immigrants arrive in Greece both through of its land border with Turkey in the North (Evros) as well as through the Greek-Turkish maritime borders in the Aegean , which is the route that is mainly used . During the years 2015-2016 which was the peak arrivals years for Greece as well as for Europe , there were more than 1 million refugees that arrived through the sea in the islands and more than 6000 arrivals through the land borders . For the period 01.01.201621.11.2016 there were 49792 sea rescues , 765 arrests of smugglers transporting refugees by boats and 108 people lost their lives in the sea , when for 2015 the number of people that got drowned was 272 [13] . According to UNHCR reports , during the first six months of 2015 , 68000 refugees arrived in the island of Lesvos , Chios , Samos and Kos , and Greece overtook Italy which had the first place in arrivals during 2014 while for the same period of 2015 , had 67,500 . A change in the profile of 11 | P a g e people arriving as refugees also was noted . The main countries of origin arriving in Italy were Eritrea (25 %), Nigeria (10 %) and Somalia (10 %), followed by Syria (7 %) and Gambia (6 %). The main countries of origin of refugees and migrants arriving in Greece were Syria (57 %), followed by Afghanistan (22 %) and Iraq (5 %) . [14] The 2016 agreement between EU and Turkey had significant consequences for the management of this unprecedented refugee crisis by the Greek state , and consequently to the people who had arrived in the Greek islands . Since March of 2016 when it first took effect , the agreement held the vast majority of the refugees to the Greek islands , where the Greek state didn’t have the infrastructure and services to address the basic needs of the population . As it was reported on UNHCR’s factsheet of May 2017 : «The Aegean islands have been at the forefront of the 2015/2016 European Refugee Emergency with over 1 million people arriving in total, the vast majority from refugee producing countries. Before 20 March 2016, the population was transient, with arrivals remaining on the islands for a limited time, sometimes hours or a few days, before continuing their journey. The situation changed after the closure of the so-called ‘Balkans route’ and the implementation of the Joint EU-Turkey Statement of 18 March 2016. Arrivals decreased significantly, the length of stay on the island increased, and the needs of the refugee and migrant populations on the islands changed, especially for people and families with specific needs.» [15] These facts generated tremendous strain on the island local communities . The reception conditions in the so called Hot-spots were insufficient , and the situation was getting worse while thousands of people were accumulating in small towns , and being unable to move to the inland or elsewhere . In the same UNHCR factsheet of 2017 , is reported that : «…. challenges with overcrowding and insecurity remain, and substandard conditions must still be improved in some locations, notably on Chios due to recent overcrowding. Protection risks for people staying on the islands continue, particularly the risk of sexual and gender-based violence. Children, including unaccompanied children, remain in inadequate shelter with insufficient access to formal or non-formal education, which also severely impacts their psychosocial wellbeing.» [15] Τhe situation gradually de-escalated, partially because specific measures of relocation of the population were taken by the Greek Government and other EU Governments and partially because of the fact that in countries like Afghanistan and Syria , the war conflicts stopped or decreased , although the Aegean islands had 29718 arrivals in 2017 , 32494 in 2018 , a flare up on 2019 with 59726 arrivals [16], followed by a rapid decline in 2020 and 2021 , when the number of arrivals was 9714 and 4331 respectively . [17] 12 | P a g e Among the islands of east Aegean , Lesvos got the highest numbers of arrivals . In December 2015 , a factsheet from UNHCR reported that up to that time , 59 % of total arrivals by sea in Greece , passed through Lesvos . The total arrivals from January to 24 of December 2015 was 487964 people . The average daily arrivals during the last 7 days was estimated to be 1968 per day . The total arrivals during December was 47243 people [18] . From January to November 2017 Lesvos had the 42 % of total arrivals in Greece by sea . 11570 asylum –seekers and migrants were recorded to reach the island and the total number of sea arrivals in Greece during that season was 27354 [19] . Between January and November 2018 the 47% of the total arrivals in Greece by sea , was in Lesvos (13945 people) and the total number of sea arrivals in Greece was 29.567 [20] . For 2019 , until December the percentage of people arriving to Lesvos was 40 % of the total arrivals and 23861 arrivals in total number [21]. The majority of the asylum-seekers and migrants arriving in the Greek islands of East Aegean sea for the whole period of 2015 -2021 was from Syria, Afghanistan , Iraq and the Democratic Republique of Kongo. Typically these nationalities arrive in family groups , although a large number of unaccompanied minors , mainly from Afghanistan was recorded in the Reception and Identification Centres of the Greek State . [UNHCR, factsheets 2017,2018] Figure 2.10 The Aegean Sea arrivals per island , Jan.2015-Sept.2015 [22] 13 | P a g e In this work we attempt a time series analysis for the modeling of the number of migrants arriving to Greece ( through all the islands and Evros borders) for the period 01/2014 – 01/2022 .The entire time series which will be analyzed in Chapter 4 with the Box-Jenkins methodology to be presented in Chapter 3 is depicted in figure 2.11. (Data collected from the Operational Data Portal of UNHCR for the Mediterranean Situation Figure 2.11: The time series for the recorded arrivals in the Greek islands of East Aegean Sea, 01/2014 – 01/2022. 14 | P a g e Chapter 3 The Box- Jenkins methodology 3.1. Introduction to time series analysis A time series is a set of observations {𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 } recorded sequentially over time. We suppose that each observation is a realized value of a specific random variable 𝑋𝑡 . Therefore , we may consider the time series to be the realized values of a sequence {𝑋𝑡 } of random variables indexed by time t ,where 𝑥1 is the observed value at time point 1 , 𝑥2 is the observed value at time point 2 , and so on. In general, we refer to a collection {𝑋𝑡 } of random variables indexed by time t, as a stochastic process. Hence the observed time series can be considered as a realization of a specific stochastic process. In this study , we will use the term time series whether we are referring to the stochastic process or to a particular realization of it and t will have discrete integer values ±1, ±2, ±3, … and so on .However , in other cases the set T of time in which we record the observations can be a continuous interval , i.e. T = [0,1] . In that case we denote that we have a continuous time time series. Time series occur in the field of economics, where we can have monthly national unemployment figures, inflation rates registered over equal time periods, annually GDP registration etc. In epidemiology, an example of time series is the daily registration of covid-19 deaths observed in a specific geographical area. In medicine, a patient’s blood sugar measurements traced over time could be useful for evaluating the influence of a specific drug on treating diabetes. In environmental sciences, time series can occur by registration of average monthly temperature or yearly rainfall. In the stock market, daily stock prices produce a time series. Time series analysis applies to a diverse list of scientific fields, practically anything that we observe sequentially over time is a time series and can be analyzed as such. Graphically, we display a sample time series by plotting the values of the random variables on the vertical axis, or ordinate, and having the time scale as the abscissa. Typically, we connect the values at adjacent time points producing visually a hypothetical continuous time series that could have produced these values as a discrete sample. Examples of time series plots can be seen in figures 3.1- 3.3. 15 | P a g e Figure 3.1 Strikes in the USA , 1951 – 1980 . [Brockwell – Davis , 2016] Figure 3.2 Population of USA at 10-year intervals , 1790-1990 .[Brockwell – Davis , 2016] 16 | P a g e Figure 3.3 The monthly accidental deaths data , 1973-1978 . [Brockwell – Davis , 2016] The purpose of time series analysis is primarily to find a satisfactory probability model to represent the data. This will help to understand the stochastic process that produces the observed time series. Once the model is developed, it could be used for prediction purposes. 3.2 Stationarity Definition 1. A time series 𝑋𝑡 , 𝑡 ∈ 𝑇 is said to be strictly stationary if the joint distribution F of 𝑋𝑡1 , 𝑋𝑡2 , … . , 𝑋𝑡𝑛 is independent from the system of coordinates: F(𝑋𝑡1 , 𝑋𝑡2 , … . , 𝑋𝑡𝑛 ) =F(𝑋𝑡1+𝑘 , 𝑋𝑡2+𝑘 , … . , 𝑋𝑡𝑛+𝑘 ) , where 𝑘, 𝑛 ∈ ℕ 𝑎𝑛𝑑 𝑡1 , 𝑡2 , … , 𝑡𝑛 ∈ 𝑇 According to Definition 1 we have: 𝐹(𝑋𝑡 ) = 𝐹 (𝑋𝑡+𝑘 ) = 𝐹(𝑋0 ) 17 | P a g e which means that the cumulative distribution function is independent of t , and so the mean 𝐸𝑋𝑡 = 𝜇 is independent of t . Also we have : 𝐹 (𝑋𝑡 , 𝑋𝑠 ) = 𝐹(𝑋𝑡+𝑘 , 𝑋𝑠+𝑘 ) = 𝐹(𝑋0 , 𝑋𝑠−𝑡 ) which means that the common distribution of d 𝑋𝑡 , 𝑋𝑡+𝑘 𝑑oes not depend on t , in other words , observations that have the same distance between them will have the same common distribution . Definition 2 . A time series 𝑋𝑡 , 𝑡 ∈ 𝑇 is said to be weakly stationary ( stationarity of 2nd order ) if the mean and the covariance are independent of time t , which means : 𝐸𝑋𝑡 = 𝜇 , 𝑡 ∈𝑇 𝐶𝑜𝑣( 𝑋𝑡 , 𝑋𝑠 ) = 𝐸 (𝑋𝑡 − 𝜇)(𝑋𝑠 − 𝜇) = 𝛾|𝑡−𝑠| and , 𝑡 ∈𝑇 Because strict stationarity is a very difficult condition to have and it is highly restricting ,. whenever we use the term ” stationary time series “ we will mean weakly stationary . From all the above , we get the following assumptions for every time we start to define an appropriate model : 1) 𝐸𝑋𝑡 = 𝜇 𝑎𝑛𝑑 𝑉𝑎𝑟(𝛸𝑡 ) = 𝜎𝜒 2 , 𝑡 ∈𝑇 meaning that the mean and the variance are constant in time t . 2) 𝐶𝑜𝑣(𝑋𝑡 , 𝑋𝑡+𝑘 ) = 𝛾𝜅 ,𝑡 ∈ 𝑇 meaning that covariance between two observations of the time series depends only on lag κ between their time moments . Definition 3 . Let {𝑋𝑡 } be a stationary time series . The autocovariance function (ACVF) of {𝑋𝑡 } at lag k is : 𝛾𝑘 = 𝐶𝑜𝑣( 𝑋𝑡 , 𝑋𝑡+𝑘 ) = 𝐸 (𝑋𝑡 − 𝜇)(𝑋𝑡+𝑘 − 𝜇) By this definition , we find : 𝛾𝑘 = 𝐶𝑜𝑣( 𝑋𝑡 , 𝑋𝑡+𝑘 ) = 𝐶𝑜𝑣( 𝑋𝑡−𝑘 , 𝑋𝑡 ) = 𝐶𝑜𝑣 (𝑋𝑡 , 𝑋𝑡−𝑘 ) = 𝛾−𝑘 αnd 𝛾0 = 𝜎𝜒 2 18 | P a g e We also define the autocorrelation function (𝑨𝑪𝑭) 𝜌𝑘 = from which , we get 𝜌0 = 1 𝛾𝑘 𝛾0 : , 𝑘 = 0, ±1, ±2, … .. 𝑎𝑛𝑑 𝜌𝑘 = 𝜌−𝑘 Practically , we don’t calculate the ACF starting from a model , we use a finite set of observed data {𝑥1 , 𝑥2 , … . . , 𝑥𝑛 } and calculate the sample autocorrelation function (sample ACF ) .The sample equivalents of the above quantities are used as estimators for inferential purposes . Thus , we are provided with an estimate of the extend of the dependence in the data which is one of the most important tools we have for modeling purposes . 3.3 Components of a non-stationary time series The first step in the analysis of a time series should always be to plot the data . If we find any apparent outlying observations , we need to examine them carefully to understand if they were caused by mistakes during the data recording and decide whether or not we need to discard them . We also check the magnitude of the fluctuations and try to see whether the variance changes with the level of the time series . If so, we need to apply a transformation to the data , i.e. we get the {𝑙𝑛𝑥1 , 𝑙𝑛𝑥2 , … , 𝑙𝑛𝑥𝑛 } in which we have more limited transformed time series magnitudes , or we can use of the Box-Cox transformation .By observing the plot we can also see if there is a trend and a seasonal component . A trend exists if we notice a long term increase or decrease in the observed values . A repeated pattern in the plot implies the presence of a seasonal component . If both components exist in the plot (which means that the time series is not stationary) , to represent our data we will use the classical decomposition model 𝑋𝑡 = 𝑚𝑡 + 𝑠𝑡 + 𝑌𝑡 (1) where • • • 𝑚𝑡 is a slowly changing function of time t which can be of deterministic or stochastic nature and is known as the trend component 𝑠𝑡 is a periodical function with period d , the seasonal component . We note that 𝑠𝑡 = 𝑠𝑡−𝑑 𝑌𝑡 is the random noise component that is stationary 19 | P a g e The goal of the researcher is to estimate the trend and seasonal components and to eliminate them from the original series . If the noise component 𝑌𝑡 that remains after this elimination is a stationary time series , we can find a satisfactory model to describe the process and its properties and by following the reverse route , we can combine it with the estimated 𝑚𝑡 and 𝑠𝑡 components to compose a model that fits our original data . Then we can use this model to forecast future values of 𝑋𝑡 . For methods of estimation the interested reader could refer to the book by Cryer & Chan (2008) . 3.4 Stochastic processes We have already define a time series to be the realization of a specific stochastic process . Therefore , we need to define the basic stochastic models that are commonly used to describe the process which generated the data . Finding a model that is a satisfactory fit to our data , gives us the ability to move with forecasting future values of the time series . We have to note again that by the term time series , we refer to both the data set , and the stochastic process that we consider to have generated these data . a) The time series {𝑋𝑡 } , 𝑡 ∈ ℤ is named time series of independent and identically distributed random variables (iid) if it consists of independent random variables that have the same distribution. An iid time series is completely random and doesn’t contain any correlations (linear or not) between its observations . The independence of the random variables indicates that we can’t get any information out of the series analysis . b) A time series that consists of random variables that don’t have correlations but they might not be independent , is not an iid time series . We will refer to a time series of this case as white noise with mean 0 and variance 𝜎𝑥 2 and we will use the notation {𝑋𝑡 } ~ 𝑊𝑁(0, 𝜎𝑥 2 ) Additionally , if the random variables of the white noise have a normal distribution , the time series is named Gaussian white noise . 20 | P a g e c) The random walk is a non-stationary time series model {𝑋𝑡 } in which , every random variable 𝑋𝑡 comes from the previous 𝑋𝑡−1 by adding a random number 𝑌𝑡 , in other words by adding an iid random variable . This process is denoted as 𝑋𝑡 = 𝑋𝑡−1 + 𝑌𝑡 If we start with t=0 and replace the random variables 𝑋𝑡−1 , 𝑋𝑡−2 , …. using the definition of the random walk , we get the notation 𝑡 𝑋𝑡 = ∑ 𝑌𝑘 𝑌𝑘 , ∶ 𝑖𝑖𝑑 𝑛𝑜𝑖𝑠𝑒 𝑘=0 It’s easy to see that random walk has a mean 𝐸 (𝑋𝑡 ) = 0 and a variance 𝜎𝑥 2 = 𝐸(𝑌𝑡 2 ) = 𝑡𝜎𝑥 2 . The variance is increasing with time t , indicating that random walk is not a stationary time series . We note that if we apply first differencing to a random walk , we get the stationary iid time series {𝑌𝑡 } . d) An autoregressive process of order p , denoted as AR(p) , is of the form 𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + ⋯ + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑍𝑡 , 𝑍𝑡 ~𝑊𝑁(0, 𝜎𝑧 2 ) where 𝑋𝑡 is stationary and 𝜑1 , 𝜑2 , … . , 𝜑𝑝 are constants ( 𝜑𝑝 ≠ 0 ) and we have considered the mean of 𝑋𝑡 to be zero . If the mean is not zero we write 𝑋𝑡 − 𝜇 instead of 𝑋𝑡 in the formula above . By using the backshift operator , we can write the AR(p) model as follows (1 − 𝜑1 𝛣 − 𝜑2 𝛣2 − … … … . . − 𝜑𝑝 𝐵𝑝 )𝑋𝑡 = 𝑍𝑡 or more concisely as 𝛷 (𝐵)𝑋𝑡 = 𝑍𝑡 , where 𝑝 𝛷 (𝐵) = 1 − ∑𝑖=1 𝜑𝑖 𝐵𝑖 is the AR(p) operator . We also define as the AR(p) characteristic polynomial , the polynomial 𝛷 (𝑧) = 1 − ∑𝑝𝑖=1 𝜑𝑖 𝑧 𝑖 , where z is a complex number . It can be proved that AR(p) is stationary , when the roots of the AR(p) polynomial are outside the unit circle . The idea behind Autoregressive models is that the current 21 | P a g e value of the series, 𝑥𝑡 , can be explained as a function of p past values 𝑥𝑡−1 , 𝑥𝑡−2 , … , 𝑥𝑡−𝑝 with the addition of white noise . The linear combination of 𝑥𝑖 for i=t-1 ,….,t-p can be nonsidered as the deterministic part of this model and 𝑍𝑡 the stochastic part . e) A moving average model of order q , denoted as MA(q) , is defined to be 𝑋𝑡 = 𝑍𝑡 − 𝜃1 𝑍𝑡−1 − 𝜃2 𝑍𝑡−2 − … … − 𝜃𝑞 𝑍𝑡−𝑞 , 𝑍𝑡 ~𝑊𝑁(0, 𝜎𝑧 2 ) where 𝜃1 , 𝜃2 , … . 𝜃𝑞 are parameters . By using the backshift operator , we can write the MA(q) model as 𝑋𝑡 = (1 − 𝜃1 𝛣 − 𝜃2 𝛣2 − … … … . . − 𝜃𝑞 𝐵𝑞 )𝑍𝑡 or more concisely as 𝑋𝑡 = 𝛩(𝐵)𝑍𝑡 , where 𝑞 𝛩(𝐵) = 1 − ∑𝑖=1 𝜃𝑖 𝐵𝑖 is the MA(q) operator . We also define as the MA(q) characteristic polynomial , the polynomial 𝑞 𝛩 (𝑧) = 1 − ∑𝑖=1 𝜃𝑖 𝑧 𝑖 where z is a complex number . The moving average process is stationary for any values of the parameters, since it is a finite sum of white noise terms . f) A process {𝑋𝑡 } is an autoregressive moving average series (ARMA) if it is stationary and 𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + ⋯ + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑍𝑡 − 𝜃1 𝑍𝑡−1 − 𝜃2 𝑍𝑡−2 − ⋯ 𝜃𝑞 𝑍𝑡−𝑞 where 𝜑𝑝 ≠ 0 , 𝜃𝑞 ≠ 0 . The parameters p and q are the autoregressive and the moving average orders respectively . We have assumed that 𝑋𝑡 has a zero mean . If 𝑋𝑡 has a nonzero mean , we write 𝑋𝑡 − 𝜇 instead of 𝑋𝑡 in the formula above . Since the process consists of an AR(p) part and an MA(q) part , we refer to this model as ARMA(p,q) model . The AR part defines if the series is stationary , so if the roots of the AR polynomial are outside of the unit circle , the ARMA(p,q) is stationary . We have to note that an ARMA(p,0) model is in fact an AR(p) model , while an ARMA(0,q) is an MA(q) model . 22 | P a g e ARMA models are very important for representing time series data , but they can be applied only if we have a stationary time series . If the time series becomes stationary after what is called differencing , we have the class of autoregressive integrated moving average models ( ARIMA ) described below . g) A process {𝑋𝑡 } is an ARIMA(p, d, q) if ∇𝑑 𝑋𝑡 = (1 − 𝐵)𝑑 𝑋𝑡 is ARMA (p, q) with d being the order of differencing. Observe that all the previous models are specific cases of ARIMA(p,d,q) . For obtaining based on a data set appropriate values for p,d,q we will proceed in the next section in the Box –Jenkins approach . h) If 𝑑 𝑎𝑛𝑑 𝐷 are nonnegative integers , then {𝑋𝑡 } is a seasonal 𝑨𝑹𝑰𝑴𝑨(𝒑, 𝒅, 𝒒) × (𝑷, 𝑫, 𝑸)𝒔 process with period s if the differenced series 𝑌𝑡 = (1 − 𝐵)𝑑 (1 − 𝐵 𝑠 )𝐷 𝑋𝑡 is a casual ARMA process defined by 𝜑(𝛣)𝛷 (𝛣 𝑠 )𝑌𝑡 = 𝜃 (𝛣)𝛩(𝐵 𝑠 )𝑍𝑡 , 𝑍𝑡 ~𝑊𝑁(0, 𝜎𝑧 2 ) where 𝜑(𝑧) = 1 − 𝜑1 𝑧 − ⋯ − 𝜑𝑝 𝑧 𝑝 , 𝛷(𝑧) = 1 − 𝛷1 𝑧 − ⋯ − 𝛷𝑃 𝑧 𝑃 𝜃(𝑧) = 1 + 𝜃1 𝑧 + ⋯ + 𝜃𝑞 𝑧 𝑞 , 𝛩 (𝑧) = 1 + 𝛩1 𝑧 + ⋯ + 𝛩𝑄 𝑧 𝑄 Before we move to the final part of this chapter , we will give the definition of a function that plays an important role in finding candidate ARMA models to fit our data . In an autoregressive process AR(p) , the partial autocorrelation of 𝑋𝑡 and 𝑋𝑡−ℎ , for ℎ > 𝑝 is nonzero since they are correlated through the random variables that are between them , 𝑋𝑡−1 , … , 𝑋𝑡−ℎ−1 . We want to find the straight correlation between them , by neutralizing all the other autocorrelations they might have with 𝑋𝑡−1 , … , 𝑋𝑡−ℎ−1 . This correlation is defined as 𝐶𝑜𝑟𝑟(𝑋𝑡 , 𝑋𝑡−ℎ ⁄ 𝑋𝑡−1 , … , 𝑋𝑡−ℎ−1 ) and it is noted as partial autocorrelation . 23 | P a g e The partial autocorrelation function (PACF ) of an ARMA process {𝑋𝑡 } is the function a(.) defined by 𝑎 (0) = 1 and 𝑎(ℎ) = 𝜑ℎℎ , ℎ ≥ 1 where 𝜑ℎℎ is the last component of 𝜱𝒉 = 𝜞𝒉 −𝟏 𝜸𝒉 𝜞𝒉 = [𝛾(𝑖 − 𝑗)] and , 𝑖, 𝑗 = 1, … . , ℎ 𝛾(𝑘) = 𝑐𝑜𝑣(𝑥𝑡+𝑘 , 𝑥𝑡 ) and 𝜸𝒉 = [𝛾(1), 𝛾 (2), … , 𝛾(ℎ)]′ , the autocovariance function . For a set of observations {𝑥1 , … . , 𝑥𝑛 } with 𝑥𝑖 ≠ 𝑥𝑗 for some 𝑖 𝑎𝑛𝑑 𝑗 , the sample PACF 𝑎̂(ℎ) is given by 𝑎̂(0) = 1 𝑎̂(ℎ) = 𝜑̂ℎℎ , ℎ ≥ 1 where 𝜑̂ℎℎ is the last component of ̂𝒉 = 𝜞 ̂ 𝒉 −𝟏 𝜸 ̂𝒉 𝜱 Statistical packages can do the computations for the sample PACF and provide us with a plot in similar fashion as for ACF . 24 | P a g e 3.5 The Box – Jenkins approach Box and Jenkins approach is a method of time series analysis and forecasting that aims to define a proper statistical model 𝐴𝑅𝐼𝑀𝐴(𝑝, 𝑑, 𝑞 ) to represent in a sufficient way the stochastic process that produced our data . There are three stages in the Box – Jenkins approach : identification, estimation and diagnostic checking . a) Identification In this stage we choose an initial set of values for the parameters p,d,q . The basic tools in this procedure are the sample ACF and PACF . If the sample ACF plot exhibits a rapid decay under the limits of significance , the time series is most probably stationary in which case , we choose d = 0 . On the other hand , if the ACF plot decays slowly with lag , the series is not stationary and thus it is necessary to apply differencing in order to obtain series stationarity . If we apply first order difference , d = 1 etc . Next , we define the parameters p and q by using the plots of the sample ACF and PACF . b) Estimation By using a non-linear technique and through the minimization of the sum of squares of the errors we get estimates of the coefficients 𝜑1 , … , 𝜑𝑝 𝑎𝑛𝑑 𝜃1 , … , 𝜃𝑞 for our ARIMA(p,d,q) model . If the model does not contain an MA part , we can use the least squares method . c) Diagnostic checking In this stage , we conduct a number of checks for the goodness of fit for the selected model . If the fit is poor , we apply modifications . We check the statistical significance for the model’s coefficients , standard errors for the estimates , and confidence intervals . Also Box and Jenkins suggest to check for the goodness of fit by applying tests on the residuals of the fitted model . If the fitted model is satisfactory , the residuals should behave like white noise and this is what we want to see by applying the diagnostic tests . The information criteria AIC , BIC and AICC are commonly used as we repeat the procedure for a variety of competing p and q values . The model with the smallest value for a certain criterion , is better . 25 | P a g e After the selection of the optimal model to fit our data , we proceed with forecasts . Suppose we have the time series {𝑥1 , 𝑥2 , … , 𝑥𝑛 } generated from the stochastic process {𝑋𝑡 } and we want to explore the forecast 𝑥𝑛 (𝑘) of the time series for the future time moment 𝑡 = 𝑛 + 𝑘 . The true value at that time which is unknown to us is 𝑥𝑛+𝑘 . The prediction error is 𝑒𝑛 (𝑘) = 𝑥𝑛+𝑘 − 𝑥𝑛 (𝑘) In fact , the forecast value 𝑥𝑛 (𝑘) is the estimation of 𝑋𝑛+𝑘 of the process {𝑋𝑡 } . Since this is a stochastic process , the optimal forecast is 𝑋𝑛 (𝑘) = 𝐸[𝑋𝑛+𝑘 ⁄𝑋𝑛 , 𝑋𝑛−1 , … ] We want to have • unbiasedness of the forecast 𝐸 [𝑋𝑛 (𝑘)] = 𝑋𝑛+𝑘 • efficiency , meaning a small variance for the prediction error . 𝑉𝑎𝑟[𝑒𝑛 (𝑘)] = 𝑉𝑎𝑟[𝑋𝑛+𝑘 − 𝑋𝑛 (𝑘)] Our goal is to have a forecast that minimizes mean squared prediction error 2 𝐸[(𝑋𝑛+𝑘 − 𝑋𝑛 (𝑘)) ] for any k . If we believe that the time series {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is the realization of an AR(p) process , then 𝑥𝑛+1 = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 + 𝑧𝑛+1 . The optimal forecast for 1 time step will be 𝑥𝑛 (1) = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 and the corresponding prediction error 𝑒𝑛 (1) = 𝑧𝑛+1 The optimal forecast for 𝑘 time steps will be 𝑥𝑛 (𝑘 ) = 𝜑1 𝑥𝑛 (𝑘 − 1) + ⋯ + 𝜑𝑝 𝑥𝑛 (𝑘 − 𝑝) 26 | P a g e where each value 𝑥𝑛 (𝑗) is known eather by previous forecast or from the time series data . The prediction error will be 𝑒𝑛 (𝑘) = ∑𝑘−1 𝑗=0 𝑏𝑗 𝑧𝑛+𝑘−𝑗 2 Var[𝑒𝑛 (𝑘)] = 𝜎𝑧 2 ∑𝑘−1 𝑗=0 𝑏𝑗 and also If we believe that {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is the realization of an 𝑴𝑨(𝒒) process , then the next observation will be 𝑥𝑛+1 = 𝑧𝑛+1 + 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1 The optimal forecast for 1 time step will be 𝑥𝑛 (1) = 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1 and the corresponding prediction error 𝑒𝑛 (1) = 𝑧𝑛+1 For k time steps , the forecast will be 𝜃𝑘 𝑧𝑛 + 𝜃𝑘+1 𝑧𝑛−1 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+𝑘 , 𝑥 𝑛 (𝑘 ) = { 0 𝑖𝑓 𝑘 ≤ 𝑞 𝑖𝑓 𝑘 > 𝑞 If we consider the {𝑥1 , 𝑥2 , … , 𝑥𝑛 } to be the realization of an 𝑨𝑹𝑴𝑨(𝒑, 𝒒) process , then 𝑥𝑛+1 = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 + 𝑧𝑛+1 + 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1 When {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is given , the optimal prediction for 1 time step is 𝑥𝑛 (1) = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 + 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1 and the prediction error 𝑒𝑛 (1) = 𝑧𝑛+1 For k time steps the optimal forecast will be 27 | P a g e 𝑥 𝑛 (𝑘 ) = { 𝜑1 𝑥𝑛 (𝑘 − 1) + ⋯ + 𝜑𝑝 𝑥𝑛 (𝑘 − 𝑝) + 𝜃𝑘 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+𝑘 , 𝑘 ≤ 𝑞 𝜑1 𝜒𝑛 (𝑘 − 1) + ⋯ + 𝜑𝑝 𝑥𝑛 (𝑘 − 𝑝) , 𝑘>𝑞 To measure the accuracy of the forecasts we use a number of statistical measures based on the prediction errors 𝑒𝑡 , the original series 𝑥𝑡 and the number of observations n. These are: • Mean Squared Error (MSE) 𝒏 𝟏 𝑴𝑺𝑬 = ∑ 𝒆𝒕 𝟐 𝒏 𝒕=𝟏 • Root Mean Squared Error (RMSE) 𝟏 𝑹𝑴𝑺𝑬 = √𝒏 ∑𝒏𝒕=𝟏 𝒆𝒕 𝟐 • Mean Absolute Error (MAE) 𝒏 𝟏 𝑴𝑨𝑬 = ∑|𝒆𝒕 | 𝒏 𝒕=𝟏 • Mean Absolute Percentage Error (MAPE) 𝒏 𝟏𝟎𝟎 𝒆𝒕 𝑴𝑨𝑷𝑬 = ∑| | 𝒏 𝒙𝒕 𝒕=𝟏 28 | P a g e Chapter 4 Application of the Box-Jenkins method The time series in figure 2.11 has been divided into three seasons (periods), namely: • • • The Pre Growth Season (01/2014 – 02/2015) The Growth Season (03/2015 – 04/2016) The Post Growth Season (05/2016 – 01/2022) which are analyzed in this Chapter. The break points for the three periods have been chosen by combining the visual inspection of the data and the Change Point Analysis [12]. For the latter, using the (approximate) Binary Segmentation method [11] we identified the end of the Pre Growth Season as being at 02/2015. The same method identified a short period with 10 observations, from 5/2019 – 2/2020 as another (short) outbreak period. We have chosen, mainly due to a very limited number of observations, to ignore this recommendation and retain this period within the Post Growth Season. For the latter, the method proposed the end of the Growth Season to be at 2/2016. Since though the value associated with this time point was extremely high (as compared to all future values) we have chosen to delay the end of the Growth Period for two more months until 4/2016 believing that by that time the degree of influence of the outbreak will have been fully extinct. Hence, the Growth period has been chosen as above, namely from 3/2015 – 4/2016 with the Post Growth Season lasting for 69 months starting at 5/2016. For this analysis we have used the changepoint package of R. ( 4.1 Pre growth season (01 /2014 – 02/2015) 4.1.1 Data plots The first step is to plot the data of the Pre growth time series (fig.4.1) , the sample auto correlation function (ACF) and partial auto correlation function (PACF ) (fig.4.2 4.5 ) . We notice that the plot shows an upward trend followed by a downward trend and a peak on September of 2014 and doesn’t contain any apparent periodic component. The sample ACF graph demonstrates only one significant spike at lag 1 .It is critical to use a more objective measure to determine whether there is a trend or not . We apply the augmented Dickey - Fuller test (ADF ) of unit root for which we find 29 | P a g e a p-value of 0.1514 ,so we accept that the time series is not stationary on the usual significance levels. Figure 4.1 : The time series of Pre growth season , 01/2014 -02/2015 Figure 4.2 : The sample ACF plot of the Pre growth time series . 30 | P a g e Figure 4.3 : The sample ACF values . Figure 4 .4 : The sample PACF plot for the Pre growth time series . Figure 4. 5 The sample PACF values . 4.1.2 Fitting polynomials to the Pre growth season data When a time series is not stationary , one of the first steps to proceed with exploratory analysis is to remove the trend component . One way to do this is by differencing. Differencing allows to remove but doesn’t allow to estimate the deterministic trend 31 | P a g e component . An estimation of the trend can be done by fitting polynomials with respect to time t to the data . The coefficients can be calculated with ordinary least squares method and the detrending is done by subtracting the estimated trend values from the original data for every t . In the following examples , in every polynomial equation , t represents the month order since season start and has values 1,2,3,… and X(t) represents the monthly number of arrivals . Using R , we fit a 3rd degree polynomial with respect to time t on the data of Post growth season . This can provide us with an estimation of the trend component of the time series .The fitted polynomial as we see on figure 4.6 is : 𝑋(𝑡) = 𝑋𝑡 = −15.462𝑡 3 + 251.741𝑡 2 − 498.774𝑡 + 923.062 The p-values for the coeficients lead to accept the null hypothesis of non-significance . The 𝑅2 and adjusted 𝑅2 are 0.6428 and 0.5357 respectively . Residuals’ Standard Error is 1602 . Mean Square Error (MSE) is 15400613 . The information criteria values are found to be 251.6347 and 254.83 for AIC and BIC respectively . We also apply the Box –Ljung test on the residuals to check for correlation between them and determine if they are white noise . The p-value of the test is 0.1685 hence in all the usual levels of significance , we can accept the hypothesis that the residuals are white noise. Finally for a visual presentation of the goodness of fit , we plot the fitted polynomial and the original data on the same graph ( fig.4.7 ) . Figure 4.6 Fitted 3rd degree polynomial summary . 32 | P a g e Figure 4.7 3rd degree polynomial fit on the Pre growth data . The next candidate polynomial for fitting our data is a 4th degree polynomial . The R output for the fitted model can be seen in figure 4.8 . The fitted polynomial is : 𝑋(𝑡) = 𝑋𝑡 = 4.75𝑡 4 − 157.97𝑡 3 + 1659.85𝑡 2 − 5588.31𝑡 + 5906.73 The p-values for the coefficients’ significance show an improved fitness of the 4 th degree polynomial compared to the previous model . The MSE is 13807342 and residuals’ Standard Error is 1360 . The values of the information criteria AIC ( = 247.5674) and BIC ( = 251.4017) , are better in comparison with the previous model and the Box –Ljung test has a p-value of 0.1938 , so we accept the null hypothesis : the time series of the residuals is a white noise . A common plot of the Pre growth data and the fitted polynomial of 4th degree is shown in figure 4.9. 33 | P a g e Figure 4.8 : fitted 4th degree polynomial summary . Figure 4.9 : 4th degree polynomial fit on the Pre growth data . 34 | P a g e The fitting of a 5th degree polynomial is our next and last attempt . The fitted polynomial is : 𝑋 (𝑡) = 𝑋𝑡 = 1.712𝑡 5 − 59.4508𝑡 4 + 715.1661𝑡 3 − 3540.4416𝑡 2 + 7232.8518 − 3573.014 The output of the model summary is presented in figure 4.10 . The T statistics and pvalues for the coefficients show that they are statistically significant for any of the usual levels of significance, the residuals standard error is 705.6 which is a great improvement from the previous models , MSE is decreased and equals to 13577691 , the multiple and the adjusted 𝑅2 statistics have been improved as well , the F statistic and the p-value for the model’s significance indicate a good fitness . The AIC and BIC values have been found to be 229.5478 and 234.0212 respectively , much better than the other two polynomials . The p-value of the portmandeau test is 0.2708 so we can accept the null hypothesis , the residuals are not correlated and the model doesn’t show a lack of fit . Figure 4.10 : fitted 5th degree polynomial summary . 35 | P a g e Figure 4.11 : 5th degree polynomial fit on the Pre growth data . The summary of the AIC , BIC , 𝑅2 and 𝑅2 − adjusted values for all three polynomials fitted in the Pre growth time series ( Table 1) indicates that the 5th degree polynomial has a much better fit than the other two since it has the minimum score in both AIC and BIC and has significantly better 𝑅2 and adjusted 𝑅2 values . The 5th degree polynomial seems a better choice if we want to estimate the trend component in the time series . It can also provide a prediction of future values of arrivals 𝑋𝑡 . For example , if we consider the noise of t = 15 to be zero ( it’s mean value ) , then we get an estimation of the arrivals for the month March of 2015 equal to : 𝑋15 = 1.712 ∙ 155 − 59.4508 ∙ 154 + 715.1661 ∙ 153 − 3540.4416 ∙ 152 + 7232.8518 ∙ 15 − 3573.014 = 12359.24 We have to state that this polynomial by having a higher degree than the other two models , loses in terms of simplicity and this is something we should consider along with the rest of criteria when it comes to choose the estimator function of the trend . At the same time the complexity is not severe and it has the best values of AIC and BIC as shown in Table 1 36 | P a g e AIC BIC 𝑅2 Adjusted 𝑅2 3rd degree 251.6347 254.83 0.6428 0.5357 4th degree 247.5674 251.4017 0.7685 0.6655 5th degree 229.5478 234.0212 0.9446 0.91 Criterion Polynomial fitted Table 1. Summary of polynomials fitted on Pre growth. 4.1.3 Box - Jenkins method for fitting (S)ARIMA models to the data We have already plotted the Pre growth time series and the sample ACF and PACF . The ADF test p-value for the original Pre growth data is 0.1514 , which shows that there is a trend in the time series . To eliminate that trend , we apply 1 st order differences (fig.4.12) and we repeat the ADF test for the differenced series . By having a p-value of 0.1279 , on all the usual levels of significance we fail to reject the null hypothesis of non-stationarity . 37 | P a g e Figure 4.12 : The first difference of Pre growth series . We apply 2nd order differences in the Post growth time series (fig.4.13) . We find the p- value for the ADF test to be 0.4403 , higher than the p-value of the 1st differences so we proceed by applying 3rd order differences on the original time series . The plot of the differenced data (fig.4.14) is getting more fluctuations and even shows some upward trend . The ADF p-value is 0.7118 , even higher than before . 38 | P a g e Figure 4.13 : The second difference of the Pre growth series . Figure 4.14 : The third difference of the Pre growth series . 39 | P a g e Not wishing to proceed further with the differencing ( due to small number of observations) we consider a significance level of 12.79% and continue with the first differenced data . We have succeed partial trend elimination . From figure 4.12 of the 1st order differences , we do not recognize any obvious patterns or any trends and there is only one point getting some more distance from a central line where all the others are gathered . We don’t see any cycles in the series . We plot the sample ACF and PACF (figures 4.15 and 4.16 ) and we see no significant values at any lag on both graphs . We don’t see any pattern in the ACF and PACF plots to indicate the presence of a seasonal component in the series . Figure 4.15 : The ACF of the first difference of the Pre growth series . 40 | P a g e Figure 4.16 : The PACF of the first difference of the Pre growth series . Based on the sample ACF and PACF plots of the 1st order differences , we can decide the order of our ARIMA model . For our model, d=1, since we performed the 1st differences to transform the original time series into stationary . We choose our parameters to be p = 0 and q = 0 , so we fit an ARIMA (0,1,0 ) model . Additionally , several alternative models will be considered . We will try models that have p and q values close to our primary selection . 1) ARIMA (0,1,0) Figure 4.17 : ARIMA(0,1,0) fit summary on Pre growth . Model : 𝑋𝑡 = 𝑍𝑡 41 | P a g e 2) ARIMA (0,1,1) Figure 4.18 : ARIMA(0,1,1) fit summary on Pre growth. Model : 𝑋𝑡 = 𝑍𝑡 + 0.3198𝑍𝑡−1 3) ARIMA (1,1,1) Figure 4.19: ARIMA(1,1,1) fit summary on Pre growth . Model : 𝑋𝑡 = 0.3232𝑋𝑡−1 + 𝑍𝑡 + 0.1076𝑍𝑡−1 42 | P a g e Criterion AIC BIC ARIMA (0,1,0) 229.11 229.6754 ARIMA (0,1,1) 229.11 230.2421 ARIMA (1,1,1) 230.61 232.3002 Fitted model Table 2. Summary of the ARIMA models fitted on Pre growth. Table 2 contains a summary of all the candidate models’ scores for the AIC and BIC criteria . ARIMA (0,1,0) gets the best BIC score and has the same AIC score with the ARIMA (0,1,1) , better than ARIMA (1,1,1) which seems to be the worst . The difference between the scores though , is minor and both models with the best score have a simple structure . We plot each one of the two models along with the data and we can check visually the goodness of fit for both models (figure 4.25) . We proceed with goodness of fit diagnostic tests. The tests are applied to the residuals of the ARIMA (0,1,0) fitting . First we plot the residuals’ series and we look for any trends, skewness, or other patterns that the model didn’t capture . (App.) . The plot shows the residuals moving randomly around zero value , the variance doesn’t appear to increase over time and their sequence looks like white noise . This is supported by the ACF plot of the residuals which shows no significant spikes at any lag (App.) . The p value for the Box – Ljung test is found to be 0.08455 and indicates the lack of correlation between the residuals for the usual 1% and 5% levels of significance. The same conditions apply to the ARIMA (0,1,1) fitting . The Box – Ljung p-value is 0.611 . The relevant graphs are given in the Appendix . 43 | P a g e Both models have a fairly good fit in our data. Now we can use the chosen models for forecasting future values of arrivals for the Pre growth season for the best two models of Table 2, namely ARIMA(0,1,0) and ARIMA(0,1,1). Our forecasts will be for the months February to March of year 2022, a period of h=3 months. The forecasting output for the model ARIMA (0,1,0) is shown in figure 4.20. In figure 4.21 we have the corresponded plot in which, we can see the forecasting line and also the 80% and 95% confidence intervals for the forecasts, highlighted with light blue and dark blue colours respectively. It should be noted that the forecasts for the Post Growth Season are presented only for illustrative purposes since as we already now, this season is followed by an outbreak, which due to its unnatural behavior, cannot be predicted by the model of the Post Growth Season. Figure 4.20: Forecast summary of ARIMA(0,1,0) for the Pre growth season . 44 | P a g e Figure 4.21 .Forecast graph of ARIMA(0,1,0) for the Pre growth season . The forecasting output for the model ARIMA(0,1,1) and the corresponded plot are shown in figures 4.22 and 4.23 . Figure 4.22 : Forecast summary of ARIMA(0,1,1) for the Pre growth season. 45 | P a g e Figure 4.23: Forecast graph of ARIMA(0,1,1) for the Pre growth season . 4.1.4 The Auto ARIMA In this section we are going to see the output of the automated method by using the “auto.arima(.)” command in R . The optimal ARIMA model according to all the standard criteria is fitted by R without us having to choose the parameters p and q .We can compare with the previous models and use it for forecasting. The output is exhibited in figure 4.24 and the model is the ARIMA(0,1,0) : 𝑌𝑡 = 𝑍𝑡 . This is a model that we have already applied on our data . For a straight comparison of the two models with the best fit , we plot them together on the Pre growth data plot ( fig.4.25) . Figure 4.24 : Auto ARIMA fit summary on Pre growth . 46 | P a g e Figure 4.25: The common plot of the two optimal ARIMA models with the data of Pre growth season . Pre growth overview In summary , we form Table 3 which contains all the models fitted in our Pre growth data . We can see right away that ARIMA(0,1,0) and ARIMA(0,1,1) have the same AIC value but ARIMA(0,1,0) has a slightly better BIC value . The 5th degree polynomial has much better 𝑅2 and adjusted 𝑅2 scores compared to the other two polynomials and better AIC and BIC . 3rd and 4th degree polynomials have worst scores in AIC and BIC than any of the ARIMA models that we have applied , but the 5 th degree polynomial has an AIC value very close to the best fitted ARIMA model . As for the ARIMA models , since we have two models with the same AIC value ,we could say that any of these models is as good as the other . On Table 4 we see that the Root mean square error for the ARIMA (0,1,0) is higher than the corresponding error for the ARIMA (0,1,1,) . We must note that ARIMA (0,1,0) as a model that shows no significant sample ACF and PACF values in any lag , is equal to an ARMA (0,0) which is random iid noise and so , the original time series is a random walk . ARIMA (0,1,1) on the other hand , having a lack of an AR component , is being in fact an MA(1) process , will give a constant forecast for each one of the forecasting months . 47 | P a g e Criterion AIC BIC 𝑅2 Adjusted 𝑅2 Fitted model 3rd degree polynomial 251.6347 254.83 0.6428 0.5357 4th degree polynomial 247.5674 251.4017 0.7685 0.6655 5th degree polynomial 229.5478 234.0212 0.9446 0.91 ARIMA (0,1,1) 229.11 230.2421 ARIMA (1,1,1) 230.61 232.3002 ARIMA (0,1,0) = AUTO 229.11 229.6754 Table 3. Summary of all models fitted on Pre growth. ME ARIMA 125.0664 (0,1,1) ARIMA 116.91 (1,1,1) ARIMA 137.0682 (0,1,0) RMSE MAE MPE 1336.781 832.2897 2.490061 MAPE MASE 26.49729 0.7769472 ACF1 0.12253 1306.821 820.2339 4.245509 25.58179 0.765693 0.05033 1449.574 994.7825 0.0286448 31.66417 0.9286351 0.41553 Table 4. Summary of forecasting errors for the ARIMA models on Pre growth. 48 | P a g e 4.2 Growth season (03/2015 -04/2016) 4.2.1 Data plots We plot the data and observe the features of the graph (fig.4.32) . We distinct two obvious trends , an upward trend with an increasing slope until the line reaches the peak point , and a downward trend after the peak leading the series to a continuous decrease until the end of this period . Both trends seem to follow an exponential or quadratic function line so we assume that 1st or 2nd order differences bay be needed . Since our season lasts 13 months , by definition there is no seasonality and we don’t see any cycles also . There are no obvious outliers . The data don’t exhibit increasing fluctuations as the level of the series increases . In the ACF plot , we see a sinusoidal pattern declining to zero , with significant spikes on lags 1 ,5 and 6 . In the PACF plot we see one significant spike at lag 1 suggesting that the series can be stationary after performing the 1st order difference on the original series. We apply the ADF test and by getting a p-value of 0.7845 , higher than all the usual significance levels ,we accept the null hypothesis of non-stationarity . Figure 4.26: The time series plot for the Growth season , 03/2015-04/2016 . 49 | P a g e Figure 4.27 : The sample ACF plot for the Growth season time series . Figure 4.28: The sample ACF values Figure 4.29 : Sample PACF plot for the Growth season time series . 50 | P a g e Figure 4.30 : The sample PACF values 4.2.2 Fitting polynomials to the Growth season data . We start by fitting on the data a 2nd degree polynomial with respect to time t. The R output for the model is shown in figure 4.31 . The fitted polynomial is : 𝑋 (𝑡) = 𝑋𝑡 = −3513.9𝑡 2 + 55261.6𝑡 − 87759.9 The p-values for the coefficients’ significance are lower than the significance level of 5% hence they are significant . 𝑅2 and adjusted 𝑅2 are 0.7 and 0.646 respectively . Residuals’ standard error is 38110 and MSE is 1.45 ∙ 109 . The values for the AIC and BIC information criteria are found to be 339.7011 and 342.2573 respectively . The Box– Ljung test for the residuals gives a p-value of 0.039 which means we can accept the null hypothesis of non-correlation on the significance level of 1% but we reject the null hypothesis on the 5% level. The fitted polynomial together with the data line is shown in figure 4.32 51 | P a g e Figure 4.31 : 2nd degree polynomial fitted on Growth season . Figure 4.32: 2nd degree polynomial fit on Growth season data plot . Next , we fit a 3rd degree polynomial to the data . The results of the R output appear in figure 4.33 . The fitted polynomial is : 𝑋(𝑡) = 𝑋𝑡 = −276.8𝑡 3 + 2715𝑡 2 + 16587.2𝑡 − 31284.7 We see the p-values for the coefficients’ significance to be higher than any usual significance level , a sign that this model is not fitting well . Residuals’ standard error is 36450 , MSE is 1.33 ∙ 109 , 𝑅2 is 0.75 and adjusted 𝑅2 is 0.676 , both slightly improved. AIC and BIC values are found to be 339.1266 and 342.3219 respectively . 52 | P a g e The Box – Ljung test p-value is 0.059 , hence we accept the null hypothesis , the residuals are white noise . In summary the model seems to have a worst fit than the previous model . Figure 4.33 : 3rd degree polynomial fitted on Growth season . Figure 4.34: Fitted 3rd degree polynomial on Growth season data . 53 | P a g e The fitting of a 4th degree polynomial gives the results of figure 4.35 .The polynomial fitted is : 𝑋(𝑡) = 𝑋𝑡 = 139.2𝑡 4 − 4453.9𝑡 3 + 43988.4𝑡 2 − 132593.6𝑡 + 114793.1 The coefficients seem to be significant on the usual 5% level of significance , residuals’ standard error has been improved and is found to be 24800 , MSE is 6.148 ∙ 108 , we note a significant improvement of the 𝑅2 and the adjusted 𝑅2 statistics , much expected since we have added another predictor to our model , AIC and BIC values are found to be 328.86 and 332.69 respectively . The Box – Ljung test p-value is 0.6287 , hence the residuals are white noise . Overall , the model seems to have a much better goodness of fit if we compare it with the previous models . The fitting of the polynomial graph on our data is shown in figure 4.36 Figure 4.35: 4th degree polynomial fitted on Growth season . 54 | P a g e Figure 4.36: 4th degree polynomial fit on Growth season data plot. The total results are summarized in Table 5. The 4th degree polynomial has the best fit among the three models as it scores much better in both AIC and BIC and has significantly better 𝑅2 and adjusted 𝑅2 . The 4th degree polynomial seems a better choice if we want to estimate the trend component of the time series although a higher degree means an increase on the number of predictors and a negative effect on the simplicity of the chosen model . We have to note that we also tried to fit a 5th degree polynomial and in comparison with the 4th degree polynomial , we found it to have lower 𝑅2 adjusted value (0.84) , slightly higher AIC and BIC values ( 329.6 and 334.12 respectively ) and the high p-values for the coefficients indicated the non significance of the coefficients . Criterion AIC BIC 𝑅2 𝑅2 𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑 Polynomial 2nd degree 339.7011 342.2573 0.701 0.6466 3rd degree 339.1266 342.3219 0.7512 0.6765 0.8964 0.8504 4th degree 328.8602 332.6945 Table 5. Summary of polynomials fitted on Growth. 55 | P a g e 4.2.3 Box - Jenkins method for fitting ARIMA models We have noted before that there is a trend in the Growth season data ( ADF p-value = 0.7845) so we apply the 1st order differences to detrend the series ( fig. 4.43) . The ADF test gives a p-value of 0.3 so we fail to reject the null hypothesis of non stationarity . The sample ACF and PACF plots demonstrate no significant spikes at any lag (figures 4.44 and 4.45 ) , a sign that the 1st differences is white noise and the original time series is a random walk , hence the data are dependent and are not identically distributed . By applying the 2nd and 3rd order differences , the situation doesn’t improve as we get the p-values for the ADF test to be 0.59 and 0.52 respectively . Therefore , we don’t want to continue with differencing .We choose a significance level of 0.3 and continue the analysis with the first order differenced data , succeeding to have partial elimination of the trend . Figure 4.37 : The first difference of the Growth season data . 56 | P a g e Figure 4.38: The ACF plot for the differenced Growth season data . Figure 4.39: The PACF plot for the differenced Growth season data . We choose to fit ARIMA models having as initial values d = 1 , p = 0 , q = 0 . We will also try to fit models with parameters close to these values . 57 | P a g e 1) ARIMA (0,1,0) Figure 4.40: ARIMA(0,1,0) fit summary on Growth season . Model : 𝑋𝑡 = 𝑍𝑡 2) ARIMA (0,1,1) Figure 4.41: ARIMA(0,1,1) fit summary on Growth season . Model : 𝑋𝑡 = 𝑍𝑡 + 0.2968𝑍𝑡−1 58 | P a g e 3) ARIMA (1,1,1) Figure 4.42: ARIMA(1,1,1) fit summary on Growth season . Model : 𝑋𝑡 = 0.4883𝑋𝑡−1 + 𝑍𝑡 − 0.0681𝑍𝑡−1 4) ARIMA (1,1,0) Figure 4.43: ARIMA(1,1,0)`fit summary for the Growth season . Model : 𝑋𝑡 = 0.4326𝑋𝑡−1 + 𝑍𝑡 59 | P a g e In Table 6 we have the summary of the information criteria for our fitted models . We see that ARIMA (1,1,0) has the best score between all , followed by ARIMA (0,1,0) . A common plot of the data with the fitted model ARIMA (0,1,0) and the Auto ARIMA model is shown in figure 4.55 . Criterion AIC BIC ARIMA (0,1,0) 312.54 313.1048 ARIMA (0,1,1) 312.62 313.7518 ARIMA (1,1,1) 313.67 315.3624 Fitted model 311.7 ARIMA (1,1,0) Table 6. Summary of the ARIMA models fitted on Growth. 312.8266 We proceed with diagnostics tests for the goodness of fit . The tests are performed on the residuals of fitting the ARIMA (1,1,0) model . A plot of the residuals (App.) shows a random movement around zero . The ACF plot (App.) shows no significant spikes on any lag and the Box – Ljung test p-value is found to be 0.06 . Hence we accept the bull hypothesis that the residuals are white noise , meaning our model doesn’t have a lack of fit . Using the ARIMA (1,1,0) we can proceed to forecasting . Our forecasts will be for the months May to July of the year 2022 , a period of h=3 months . The forecasting output for the model ARIMA (1,1,0) and the corresponded plot are shown in figures 4.44 and 4.45 . 60 | P a g e Figure 4.44: ARIMA(1,1,0) forecast summary on Growth season . Figure 4.45 : ARIMA(1,1,0) forecast plot on Growth season . 4.2.4 The Auto ARIMA. By giving the “auto.arima(.) “ command in R , it fits the model ARIMA (2,0,0) : 𝑋𝑡 = 1.3253𝑋𝑡−1 − 0.4566𝑋𝑡−2 + 𝑍𝑡 61 | P a g e The model’s summary is exhibited in figure 4.46 . Figure 4.46 : The Auto ARIMA fit summary on Growth season . In the residuals’ plot (App.) for the fitted model ARIMA (2,0,0) we see random movement around zero , the ACF plot (App.) doesn’t show any significant spikes , and the p value of the Box – Ljung test ( 0.7 ) indicates that the residuals is white noise and the model has a fair fit .We proceed with the auto arima model’s forecasting and the corresponded forecasting plot (fig. 4.47 and fig. 4.48) . Figure 4.47 :Auto ARIMA forecast summary on Growth season . 62 | P a g e Figure 4.48 : Auto ARIMA forecast plot on Growth season . Figure 4.49 The common plot of the two optimal ARIMA models with the data line on Growth season . Growth season overview In summary , we gather all the criteria values for our fitted models and form Table 8 . As we noted before , the 4th degree polynomial shows the best scores for AIC and BIC criteria among the polynomials and much higher R squared and R squared-adjusted . Having as a 63 | P a g e fact that the R squared statistical measures improve by adding explanatory variables to a model and also they can be affected by a number of reasons to the point where a significant model scores a low R squared value or a non significant model can have a high score , the most reliable measure to compare the goodness of fit for the polynomials is the AIC and BIC values . The ARIMA models we have chosen to fit on our data , are fitted under a high level of significance , since the p-value for the ADF test on the 1st differences is 0.3 . The optimal model according to the AIC and BIC values is ARIMA(1,1,0) , an auto regressive model of low order . The most appropriate model according to the auto.arima(.) command is ARIMA(2,0,0) ,a model that doesn’t apply difference on the original data and has AIC and BIC values higher than the ARIMA(1,1,0) . In Table 8 on the other hand , we can see the ARIMA(2,0,0) to have lower RMSE , MAE , MAPE and MASE values than ARIMA(1,1,0) .Both models deliver negative forecast numbers and forecast confidence intervals that contain zero . Since a negative forecast doesn’t have a logical meaning when the variable represents arrivals of people , we can substitute these forecasts with zero . AIC BIC 𝑅2 𝑅2 adjusted 2nd degree polynomial 339.7011 342.2573 0.701 0.6466 3rd degree polynomial 339.1266 342.3219 0.7512 0.6765 5th degree polynomial 328.8602 332.6945 0.8964 0.8504 ARIMA (2,0,0) 336.99 338.9 ARIMA (0,1,0) 312.54 313.1048 ARIMA (1,1,0) 311.7 312.8266 Criterion Model fitted Table 7. Summary of all models fitted on Growth. 64 | P a g e ARIMA (2,0,0) ME 8621.831 RMSE 30408.53 MAE 22000.09 MPE 2.112955 ARIMA (0,1,0) -301.1519 35876.93 29414.99 -46.47366 ARIMA (1,1,0) -931.1624 31905.25 22435.38 -20.04772 ARIMA (2,0,0) MAPE 42.69843 MASE 0.6945115 ACF1 -0.09181435 ARIMA (0,1,0) 83.38134 0.9285892 0.4514709 ARIMA (1,1,0) 49.19935 0.7082527 0.000577384 Table 8. Summary of forecasting errors for the ARIMA models on Growth. 4.3 Post growth season (05/2016 – 01/2022) 4.3.1 Data plots We plot the data for the Post growth season (fig. 4.50) along with the sample ACF and PACF . No obvious outlying points seem to exist in the graph .The variance doesn’t seem to have increased variation οver time so we don’t need to apply a Box-Cox transformation on the data to stabilize the variance . Some potential trend appears to be present in the time series and we notice that the ACF in figure exhibits a slow decay as the number of lags increase , suggesting that the original series is not stationary. (figures 4.51 and 4.52) .We apply the Augmented Dickey – Fuller (ADF ) test to check for stationarity . The p-value is 0.5237 , so we reject the Null hypothesis thus the Post growth time series is not stationary .In addition, based on Partial Autocorrelation function (PACF) of the original series, there is one significant spike (exceeding the significance boundaries ). This provides an indication of an autoregressive process of order 1 (figure 4.53 ) . 65 | P a g e Figure 4.50: The Post Growth time series plot . Figure 4.51 The sample ACF plot of the Post growth data . 66 | P a g e Figure 4.52 : The Post growth sample ACF values . Figure 4.53: The Post growth sample PACF plot . 67 | P a g e Figure 4.54 The sample PACF values . 4.3.2 Fitting polynomials to the Post growth season data Using R , we fit a 2nd degree polynomial with respect to time t on our data of Post growth season . The fitted polynomial is : 𝑋(𝑡) = 𝑋𝑡 = −3.83𝑡 2 + 243.11𝑡 + 572.35 Along with the model summary (fig.4.55) , we calculate the AIC and BIC criteria to check the goodness of fitness. The p-values for the coefficients indicate a fair fitness of the model to our data and the values for the information criteria AIC and BIC are 1262.466 and 1271.403 respectively . We apply the Box –Ljung test to check for auto correlation among residuals . We find the p-value to be 1.1 ∙ 10−12 which means that the residuals are not white noise and the model has a significant lack of fit. Finally , for a visual presentation of the goodness of fitness we plot the time series line and the fitted polynomial on the same graph (fig.4.56). 68 | P a g e Figure 4.55: 2nd degree polynomial fitted on the Post growth data . Figure 4.56: 2nd degree polynomial fit on Post growth data plot . 69 | P a g e We repeat the procedure by fitting a 3rd degree polynomial to our Post growth season data . The fitted polynomial is : 𝑋(𝑡) = 𝑋𝑡 = −0.00773𝑡 3 − 3.01797𝑡 2 + 220.21741𝑡 + 710.66668 The model summary is shown in figure 4.57 . We find the p – values for the coefficients to be greater than all the standard levels of significance ,meaning that the model does not fit well to our data .The values for the AIC and BIC criteria are 1264.432 and 1275.602 respectively and the Box – Ljung test p-value is 1.2 ∙ 10−12 which means that the model shows a lack of fit and the residuals are highly correlated. Figure 4.57: 3rd degree polynomial fitted on Post growth data . 70 | P a g e Figure 4.58: 3rd degree polynomial fit on the Post growth plot . The last candidate polynomial that we fit in our data is a 4 th degree polynomial . The output of the R Studio for the fitted model can be seen in figure 4.59 . The fitted polynomial is : 𝑋(𝑡) = 𝑋𝑡 = 0.0079𝑡 4 − 1.1138𝑡 3 + 46.9971𝑡 2 − 571.0056𝑡 + 3658.8987 The p- values for the coefficients’ significance show that we have a better fitness of the 4th degree polynomial compared to the previous two models . The values for the information criteria are AIC : 1254.649 and BIC : 1268.054 , slightly improved from the past two models and the Box –Ljung test gives a p value of 7.5 ∙ 10−12 , meaning that the residuals’ series is not white noise and the model has a lack of fit . 71 | P a g e Figure 4.59: 4th degree polynomial fitted on Post growth season . Figure 4.60: 4th degree polynomial fit on the Post growth data . The summary of the AIC , BIC , 𝑅2 and adjusted 𝑅2 values for all three polynomials fitted in the Post growth data ( Table 9) indicates that the 4th and 2nd degree polynomials have a better fit than the 3rd degree polynomial with the difference 72 | P a g e between the 2nd and the 4th degree polynomials to be in favour of the 4th grade , as it exhibits lower AIC and BIC values . Criterion AIC BIC 𝑅2 𝑅2 adjusted Polynomial 2nd degree 1262.466 1271.403 0.3124 0.2915 3rd degree 1264.432 1275.602 0.3127 0.281 4th degree 1254.649 1268.054 0.4206 0.3844 Table 9 . Summary of polynomials fitted on Post Growth. 4.3.3 Box - Jenkins method for fitting (S)ARIMA models We have already ploted the Post growth time series along with the sample ACF and PACF . We have also found the ADF p-value to be 0.5237 and we know there is a trend in the time series . We apply 1st order differences and we repeat the ADF test for the differenced time series . By having a p-value of 0.0195 , we can conclude that the time series of the 1st order differences is stationary. By observing the plot of figure 4.60 we do not recognize any obvious patterns or cycles ,and we can assume that there isn’t any seasonality or cyclicity . We plot the ACF and PACF for the differenced data (figures 4.61 and 4.62 ) and we notice one significant spike on lag = 1 in the ACF graph and one significant spike on lag = 1 in the PACF graph . Both graphs seem to get another slightly significant spike on lag = 4 but that is only after lags 2 and 3 give no significant values . We don’t see any periodic pattern in the ACF and PACF plots to indicate the presence of a seasonal component in the series . 73 | P a g e Figure 4.61: The first difference of the Post growth data. Figure 4.62: The ACF plot for the differenced data . 74 | P a g e Figure 4.63: The PACF plot for the differenced data . Based on the ACF and PACF of the 1st order differences , we can decide the order of our ARIMA model . For our model, d=1, since we perform the 1st differences to transform the original time series into stationary series , we choose first our parameters to be p = 1 and q = 1 , so we fit an ARIMA (1,1,1 ) model . Additionally , several alternative models will be considered . We will try models that have p and q values close to our primary selection . 1) ARIMA (1,1,1) Figure 4.64: ARIMA(1,1,1) fit summary on Post growth . Model : 𝑋𝑡 = −0.2734𝑋𝑡−1 + 𝑍𝑡 + 0.6147𝑍𝑡−1 75 | P a g e 2) ARIMA (1,1,2) Figure 4.65: ARIMA(1,1,2) fit summary on Post growth . Model : 𝑋𝑡 = −0.8169𝑋𝑡−1 + 𝑍𝑡 + 1.2316𝑍𝑡−1 + 0.2316𝑍𝑡−2 3) ARIMA (2,1,1) Figure 4.66: ARIMA(2,1,1) fit summary on Post growth . Model : 𝑋𝑡 = −0.5698𝑋𝑡−1 + 0.1857𝑋𝑡−2 + 𝑍𝑡 + 𝑍𝑡−1 76 | P a g e 4) ARIMA (2,1,2) Figure 4.67: ARIMA(2,1,2) fit summary on Post growth . Model : 𝑋𝑡 = −0.5078𝑋𝑡−1 + 0.2286𝑋𝑡−2 + 𝑍𝑡 + 0.9365𝑍𝑡−1 − 0.0635𝑍𝑡−2 5) ARIMA (1,1,0) Figure 4.68: ARIMA(1,1,0) fit summary on Post growth . Model : 𝑋𝑡 = 0.3049𝑋𝑡−1 + 𝑍𝑡 At this point we have to comment that another possible model would be ARIMA (0,1,1) but its analysis is postponed until the next section of this study . 77 | P a g e Criterion AIC BIC ARIMA (1,1,0) 1156.54 1160.979 ARIMA (1,1,1) 1157.2 1163.854 ARIMA (1,1,2) 1156.08 1164.959 ARIMA (2,1,1) 1155.79 1164.672 ARIMA (2,1,2) 1157.78 1168.875 Model Table 10. Summary of the ARIMA Models on Post Growth. Table 10 provides the summary of all candidate models’ scores for the AIC and BIC criteria . ARIMA (2,1,1) gets the best score for the AIC and ARIMA (1,1,0) gets the best score for the BIC criterion . The difference between the scores though , is minor and the complexity of the ARIMA (2,1,1) model is something we should consider when it comes to choosing the model for forecasting . On the last section of this chapter , we will plot each model together with the auto ARIMA model and the data and we get a visual presentation of the goodness of fitness for our selected models (figure 4.76 ) Next, we proceed with diagnostic plots for the goodness of fitness . The tests will be applied on the residuals . First we plot the residuals’ series of ARIMA (1,1,0) and we look for any trends, skewness, or other patterns that the model didn’t capture . The plot shows the residuals moving randomly around zero and look like white noise (App.) . The ACF plot for the residuals (App.) shows no significant correlations as the values are within the boundary lines of non-significance for any early lag with some spikes on the limit of non-significance for later lags . We also apply the Box-Ljung test and we get a p-value of 0.8224 which means that the residuals don’t have any significant autocorrelations . They are white noise and the model has a fairly good fit. 78 | P a g e The same route is followed for the residuals of the model ARIMA (2,1,1) . We plot the residuals and the the ACF (App. ) , we conduct a Box_Ljung test which gives a p value of 0.9903 and the conclusion is that the residuals are white noise . Now we can use the optimal models to forecast . The forecasts will be for the months February to March of the year 2022 , a period of h=3 months . The forecasting output for the model ARIMA(1,1,0) and the corresponded plot exhibiting the forecasting line and the confidence intervals are shown in figures 4.69 and 4.70 . Figure 4.69 : ARIMA(1,1,0) forecast summary for Post growth season . 79 | P a g e Figure 4.70: ARIMA(1,1,0) forecast plot on Post growth season . The forecasting output for the model ARIMA(2,1,1) and the corresponded plot are shown in figures 4.71 and 4.72 . Figure 4.71: ARIMA(2,1,1) fit summary for the Post growth season . 80 | P a g e Figure 4.72: ARIMA(2,1,1) forecast plot for the Post growth season . 4.3.4 The Auto ARIMA . In this last part we use the “auto.arima( )” command on the Post growth season data . The R output is shown in figure 4.73 and the model is the ARIMA(0,1,1) : 𝑋𝑡 = 𝑍𝑡 + 0.3613𝑍𝑡−1 Figure 4.73: Auto ARIMA fit summary on Post growth . 81 | P a g e Furthermore , we get the plot and the ACF plot of the residuals (App.) of the model ARIMA (0,1,1) . Finally we do the Box – Ljung test for the residuals . As it is noticed ,the residuals’ series shows no apparent trends and the residuals moving randomly around zero so they look like white noise , the ACF plot shows no significant correlations on early lags and some spikes on the boundary lines of significance on later lags . The p-value of the Box test is found to be 0.8984 , hence we accept that the residuals are white noise meaning that this model has a sufficient fit . We proceed with forecasting for the arrivals in Post growth season .The forecast period will be for the next three months after the end of our time series data . That means we forecast the arrivals in the Post growth season for the months February , March and April of the year 2022 . The forecasting output for the model ARIMA (0,1,1) and the corresponded plot are shown in figures 4.74 and 4.75 . Figure 4.74: Auto ARIMA forecast summary for the Post growth season . 82 | P a g e Figure 4.75: Auto ARIMA forecast plot for the Post growth season . For reasons of visual comparison , we get a summary plot containing the 3 selected models and the time series of Post growth season data (fig . 4.76) . Figure 4.76: The common plot of the three optimal ARIMA models with the data line. 83 | P a g e Post growth season overview In summary , we form Table 11 which includes the criteria scores for every model fitted in Post growth time series. We can see that the best fitted model according to AIC and BIC criteria is the ARIMA (0,1,1) , this is the model R gave as an output to the “auto.arima” command . The 4th degree polynomial has a better R square and R square adjusted scores compared to the other two polynomials and better AIC and BIC . On the other hand ,the polynomials have highest scores in AIC and BIC than any ARIMA model that we applied in these data . This was expected as we have already seen the residuals of the polynomials’ fit to be highly correlated . As for the ARIMA models , the scores on AIC and BIC are very close to each other and we could say that any of these models is as good as the others ,but the ARIMA(0,1,1) is the most simple of all , similar only to the ARIMA (1,1,0) which may have a higher AIC or BIC score than the ARIMA (2,1,1) but it should be our choice between those two because of its simplicity . The ARIMA (2,1,1) gives negative prediction also , and we could say it is not safe to use it at this point . The ARIMA (0,1,1) gives a forecast of a steady number of 109.199 arrivals per month for the next three months and this has to do with the moving average part of the model and the absence of an auto regressive part . So if the selection is between the models ARIMA (0,0,1) and ARIMA (1,1,0 ) , maybe it is a better idea to go for the second model ,as with this model we avoid to have steady forecasts for each subsequent month , not to mention that the AR terms have a better adjustment in the data . Error measures to evaluate the forecasting of the ARIMA models are depicted on Table 12 . The Root Mean Square Error for the ARIMA (1,1,0) is slightly bigger than for the ARIMA (0,0,1 ) . Other measures for this model ( like the Mean Average Percentage Error ) though , have lower values than the error of the ARIMA (0,1,1) . 84 | P a g e Criterion AIC 𝑅2 BIC Adjusted 𝑅2 Fitted model 2nd degree polynomial 3rd degree polynomial 4th degree polynomial ARIMA (0,1,1) =AUTO ARIMA (1,1,0) ARIMA (2,1,1) 1262.466 1271.403 0.3124 0.2915 1264.432 1275.602 0.3127 0.281 1254.649 1268.054 0.4206 0.3844 1155.73 1160.165 1156.54 1160.979 1155.79 1164.672 Table 11. Summary of all models fitted on Post Growth. ARIMA (2,1,1) ME -17.57609 RMSE 1093.096 MAE 764.4439 MPE -40.29637 ARIMA (0,1,1) -17.09396 1142.88 797.7409 -44.5703 ARIMA (1,1,0) -16.30146 1150.103 788.6504 -44.24847 ARIMA (2,1,1) MAPE 74.23449 MASE 0.920169 ACF1 0.001437162 ARIMA (0,1,1) 80.25684 0.9602489 -0.01503607 ARIMA (1,1,0) 78.32442 0.9493066 0.0264405 Table 12. Summary of forecasting errors for the ARIMA models on Post Growth. 85 | P a g e Conclusion The goal of this thesis was to apply the Box- Jenkins approach on a specific time series coming from the field of social sciences, to identify and estimate various (S)ARIMA models and to select the most proper for forecasting. We compared the models as to the goodness of fit by using the typical statistical tools, the AIC and BIC criteria and the Box – Ljung test for the residuals. Since the data were divided in three seasons, we applied the B-J method three times, and we chose the optimal models of each season. We used the selected models to proceed in monthly forecasts for a period of three months ahead on every season. To check the effectiveness of the model, we used the error measures RMSE, ME, MPE, MAPE. Moreover, we used polynomial to fit the data a method that is often used for an estimation of the trend component. The comparison between the fitted polynomials and the ARIMA models, proved that the later are superior by means of fit, since the AIC and BIC values were significantly better. In some cases, the forecast value was negative, having no real meaning for the specific time series. Since we notice that the 80% and 95% confidence intervals for the forecasted arrivals contain zero, we can safely replace these forecasts with zero values. The constant forecasting values that appear in some cases, have to do with the absence of the autoregressive part from the model. Knowing the real values for the upcoming months of each period, we note that for the Pre growth season, the forecast values are constant for both of the selected models and the forecast maximum level is 3267 when in fact, for the next three months the numbers grew rapidly, being 2873, 7874 and 13556 respectively. The forecasts for the upcoming three months in Growth season had negative values for both models and the real values for these months were recorded to be 1721, 1554 and 1920 respectively. For the Post growth season, we had constant predictions of 109 per month from one model, negative predictions from a second model and the last selected model gave the values 148.7, 107.1, 94.5 when the real values for these months are 464, 008 and 1300 respectively. Although the predictions made based on the growth and post growth models could have been considered relatively reasonable, the same is not true for the predictions of the pre growth model (in Sections 4.1.2-4.1.4) since the extreme and sudden increase of flows that followed in early 2015, could not have been predicted based on the previously collected data, i.e. the pre growth period. Hence, please note that these particular predictions are presented in this Thesis for illustrative purposes only and thus, they should be viewed and evaluated with extreme caution. The variation on the data, implies the division of the whole recorded season in three seasons, as done in this work. The forecasts were done for each season independently. Forecasting a Pre growth season value, when in fact we know that the following 86 | P a g e months will note severe changes on the level of the series and a different model is needed to represent the data, is an ex-post, ad-hoc basis in which we stand. For future work, the researcher could follow an approach by using statistical process control charts that detect the so called “ special cause “ variations to the data , which is an unusual variation or rapid change to the level , similar to what was presented in this study . 87 | P a g e Bibliography [1] D. Kugiumtzis, Time Series Analysis, Aristotle University of Thessaloniki,, 2020 [2] Ε. Bora-Senta, Ch. Moisiadis, Applied Statistics (in Greek), Ziti Publications, 1995 [3] A. Karagrigoriou, Lecture Notes on Time Series Theory, University of Aegean, 2020 [4] R.J. Hyndman, G. Athanasopoulos, Forecasting: Principles and Practice, 2nd edition, OTexts,, 2018 [5] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, G. M. Ljung, Time Series Analysis, 5th edition, Wiley, 2015 [6] R. Harris, R. Solis, Applied Time Series Modeling and Forecasting, Wiley, 2003 [7] P.J. Brockwell, R.A. Davis, Introduction to Time Series and Forecasting, 3rd edition, Springer, 2016 [8] R.H. Shumway, D.S. Stoffer, Time Series Analysis and its Applications With R Examples, 3rd Edition, Springer, 2015 [9] J.D. Cryer, K.-S. Chan, Time Series Analysis with Applications in R, 2nd edition, Springer, 2008 [10] W.L. Young, The Box-Jenkins approach to time series analysis and forecasting: principles and applications, RAIRO, Recherche Opérationnelle, Tome 11(2), 129-143,, 1977 [11] A.J. Scott, M. Knott, A Cluster Analysis Method for Grouping Means in the Analysis of Variance, Biometrics, 30(3), 507-512, 1974 [12] E.S. Page, Continuous inspection schemes, Biometrika, 41, 100-115, 1954. 88 | P a g e Web References [1] UNHCR portal, [2] UNHCR Data finder, [3] UNHCR Global trends, [4] IOM portal for World Migration report 2022 [5] IOM World Migration report 2022, Chapter 1 [6] IOM World Migration Report 2022, Chapter 2 [7] IOM World Migration Report 2022, Chapter 2, Migrant workers [8] UNHCR portal, Figures at a glance [9] ESPON 2018, Refugee and asylum seeker flows [10] ESPON Topic paper: Migration and asylum seekers-ESPON evidences on%20migration.pdf [11] UNHCR Portal, The sea route to Europe, 2015 [12] UNHCR Portal, The sea route to Europe, 2015, Ch. 2: Rescue at sea, tragedy and response [13] Refugee crisis factsheet 2017, Hellenic Republique, General Secretariat for Media and Communication [14] UNHCR The sea route to Europe, 2015, Ch. 3 The rise of the Eastern Mediterranean route: the shift to Greece [15] UNHCR Aegean Island factsheet, April 2017 [16] UNHCR Data portal [17] UNHCR Data portal [18] UNHCR Data portal 89 | P a g e [19] UNHCR Data portal [20] UNHCR Data portal [21] UNHCR Data portal [22] UNHCR Data portal [23] R ‘changepoint’ package manual: 90 | P a g e Appendix The graphical representations of the residuals and ACF for the best models for each season, are provided in this Appendix. PRE GROWTH SEASON 1. The residuals series of ARIMA(0,1,0) 2. The ACF plot for the residuals of ARIMA(0,1,0) 91 | P a g e 3. The residuals series of ARIMA(0,1,1) 4. The ACF plot for the residuals of ARIMA(0,1,1) 92 | P a g e GROWTH SEASON 5. The residuals series of ARIMA(1,1,0) 6. The ACF plot for the residuals of ARIMA(1,1,0) 93 | P a g e 7. The residuals series of ARIMA(2,0,0 ) 8. The ACF plot for the residuals of ARIMA(2,0,0) 94 | P a g e POST GROWTH SEASON 9. The residuals series of ARIMA(1,1,0) 10. The ACF plot for the residuals of ARIMA(1,1,0) 95 | P a g e 11. The residuals series of ARIMA(2,1,1) 12. The ACF plot for the residuals of ARIMA(2,1,1) 96 | P a g e 13. The residuals series of ARIMA(0,1,1) 14. The ACF plot for the residuals of ARIMA(0,1,1) 97 | P a g e 98 | P a g e EPIMYTHIUM The drum of war thunders and thunders. It calls: thrust iron into the living. From every country slave after slave are thrown onto bayonet steel. For the sake of what? The earth shivers hungry and stripped. Mankind is vapourised in a blood bath only so someone somewhere can get hold of Albania. Human gangs bound in malice, blow after blow strikes the world only for someone’s vessels to pass without charge through the Bosporus. Soon the world won’t have a rib intact. And its soul will be pulled out. And trampled down only for someone, to lay their hands on Mesopotamia. Why does a boot crush the Earth — fissured and rough? What is above the battles’ sky Freedom? God? Money! When will you stand to your full height, you, giving them your life? When will you hurl a question to their faces: Why are we fighting? Vladimir Mayakovsky (1917) 99 | P a g e