Uploaded by Konstantinos Evangelidis

A TIME SERIES ANALYSIS APPROACH TO THE MIGRATION ISSUE

advertisement
UNIVERSITY OF THE AEGEAN
SCHOOL OF SCIENCES
DEPARTMENT OF STATISTICS and
ACTUARIAL - FINANCIAL MATHEMATICS
MSc in STATISTICS AND DATA ANALYSIS
MASTER THESIS
A TIME SERIES ANALYSIS APPROACH TO THE MIGRATION
ISSUE – THE AEGEAN ROUTE
EVANGELIDIS KONSTANTINOS
2022
SAMOS
ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΙΓΑΙΟΥ
ΤΜΗΜΑ ΣΤΑΤΙΣΤΙΚΗΣ ΚΑΙ ΑΝΑΛΟΓΙΣΤΙΚΩΝ –ΧΡΗΜΑΤΟΟΙΚΟΝΟΜΙΚΩΝ
ΜΑΘΗΜΑΤΙΚΩΝ
Π.Μ.Σ. ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΟΓΙΣΤΙΚΑ –ΧΡΗΜΑΤΟΟΙΚΟΝΟΜΙΚΑ
ΜΑΘΗΜΑΤΙΚΑ
ΚΑΤΕΥΘΥΝΣΗ : ΣΤΑΤΙΣΤΙΚΗ & ΑΝΑΛΥΣΗ ΔΕΔΟΜΕΝΩΝ
ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ
MΕΤΑΝΑΣΤΕΥΤΙΚΕΣ ΚΑΙ ΠΡΟΣΦΥΓΙΚΕΣ ΡΟΕΣ ΣΤΟ
ΒΟΡΕΙΟΑΝΑΤΟΛΙΚΟ ΑΙΓΑΙΟ – ΜΙΑ ΣΤΑΤΙΣΤΙΚΗ ΧΡΟΝΟΛΟΓΙΚΗ
ΠΡΟΣΕΓΓΙΣΗ
ΕΥΑΓΓΕΛΙΔΗΣ ΚΩΝΣΤΑΝΤΙΝΟΣ
2022
ΣΑΜΟΣ
Μέλη Τριμελούς Επιτροπής
Αλέξανδρος Καραγρηγορίου (Επιβλέπων)
Χρήστος Κουτζάκης
Αθανάσιος Ρακιτζής
To Narges, Sheila, Mohammad,
and all the dear students I have met during this journey.
If it wasn’t for you, I would never be where I stand today.
ACKNOWLEDGEMENTS
First and foremost, I would like to express my deepest gratitude to my supervisor Prof.
Alexandros Karagrigoriou of the Department of Statistics and Actuarial-Financial
Mathematics of the University of Aegean for his support, thorough guidance, patience
and of course for being an excellent teacher for me throughout my studies in the
Department of Statistics. I am also grateful to Ph.D. Emmanouil-Nektarios Kalligeris for
the communication between us throughout the writing as it was particularly
constructive and helpful for me. Furthermore, words cannot express my appreciation
to everyone who stood by me offering moral support and encouragement in the daily
difficulties during my studies, as well as seemingly small help from dear friends is what
often determines success or failure in anything we undertake that requires personal
sacrifice and effort. Last but not the least, it would be remiss not to mention my
mother. She was always by my side since my childhood, no matter how much I made it
difficult for her along the way.
ABSTRACT
Prediction issues are one of the most exciting fields in the sciences. Modeling through
stochastic processes is also often used for forecasting purposes, particularly in the field
of finance. The fact that stochastic processes produce time series of data makes the
study of time series particularly useful on our quest of predicting quantities that
change with time when randomness is included. So, since we are able to make
quantitative measurements of a phenomenon during its evolution over time, we can
apply time series analysis methods to a range of scientific applications that is truly
unlimited. In this thesis, we are going to follow the Box - Jenkins approach to time
series analysis and forecasting on an attempt to forecast a social phenomenon, the
refugee and migrants flows through the islands of the East Aegean Sea. We use a
series of time indexed data recorded from 01/2014 to 01/2022 from the United Nation
High Commissioner for Refugees (UNHCR) database. This is a time period in which the
refugee issue became major from a social and political point of view. First, the
theoretical framework is set, then the Box - Jenkins method is presented and finally we
proceed with the analysis of the data. The goal is to see whether the method we have
chosen is suitable to make forecasts on the specific phenomenon.
ΠΕΡΙΛΗΨΗ
Τα προβλήματα πρόβλεψης αποτελούν ένα από τα πιο συναρπαστικά πεδία στις
επιστήμες. Η μοντελοποίηση φαινομένων μέσω στοχαστικών διαδικασιών
χρησιμοποιείται συχνά για σκοπούς πρόβλεψης, ειδικά στο πεδίο των
χρηματοοικονομικών . Το γεγονός ότι οι στοχαστικές διαδικασίες παράγουν χρονικές
σειρές δεδομένων κάνει την μελέτη των χρονοσειρών ιδιαίτερα χρήσιμη όταν
θέλουμε να προβλέψουμε μεγέθη που αλλάζουν με το χρόνο όταν αυτή η διαδικασία
περιλαμβάνει και τυχαιότητα. Εφόσον λοιπόν είμαστε σε θέση να κάνουμε ποσοτικές
μετρήσεις ενός φαινομένου κατά την χρονική του εξέλιξη, μπορούμε να
εφαρμόσουμε μεθόδους ανάλυσης χρονοσειρών σε ένα φάσμα επιστημονικών
εφαρμογών που είναι πραγματικά απεριόριστο. Σε αυτή την εργασία, θα
ακολουθήσουμε την προσέγγιση Box - Jenkins στην ανάλυση και πρόβλεψη
χρονοσειρών σε μια προσπάθεια να προβλέψουμε ένα κοινωνικό φαινόμενο, τις ροές
προσφύγων και μεταναστών μέσω των νησιών του Βορειοανατολικού Αιγαίου.
Χρησιμοποιούμε ένα σύνολο χρονικά καταχωρημένων δεδομένων για την περίοδο
01/2014-01/2022 από την βάση δεδομένων της ‘Υπατης Αρμοστείας του Οργανισμού
Ηνωμένων Εθνών (UNHCR), μια περίοδο κατά την οποία το προσφυγικό ζήτημα έγινε
καίριο από κοινωνική και πολιτική άποψη. Πρώτα εισάγεται το θεωρητικό πλαίσιο,
μετά παρουσιάζεται η μέθοδος ανάλυσης Box - Jenkins και κατόπιν εφαρμόζουμε τη
μέθοδο αυτή στα δεδομένα. O σκοπός είναι να διαπιστώσουμε σε ποιο βαθμό η
μέθοδος αυτή είναι κατάλληλη για να κάνουμε προβλέψεις για το συγκεκριμένο
φαινόμενο.
CONTENTS
1
2
Introduction
The refugee issue and its socio-economic dimension
2.1 Global overview
2.2 The situation in the European Union (EU)
2.3 The eastern passage through the Aegean Sea
1
2
2
8
11
3
The Box – Jenkins methodology
3.1 Introduction to time series analysis
3.2 Stationarity
3.3 Components of a non-stationary time series
3.4 Stochastic processes
3.5 The Box – Jenkins approach
15
15
17
19
20
25
4
Application of the Box – Jenkins Method
4.1 Pre Growth season (01/2014 - 02/2015) _
4.1.1 Data plots
4.1.2 Fitting polynomials to the Pre growth season data
4.1.3 Box-Jenkins method for fitting (S)ARIMA models
4.1.4 The Auto ARIMA
29
29
29
31
37
46
4.2 Growth season (03/2015 – 04/2016) _
4.2.1 Data plots
4.2.2 Fitting polynomials to the Growth season data
4.2.3 Box – Jenkins method for fitting (S)ARIMA models
4.2.4 The Auto ARIMA
49
49
51
56
61
4.3 Post growth season (05/2016 – 01/2022)_
4.3.1 Data plots
4.3.2 Fitting polynomials to the Post growth season data
4.3.3 Box – Jenkins method for fitting (S)ARIMA models
4.3.4 The Auto ARIMA
4.3.5 Conclusion
65
65
68
73
81
86
Bibliography
88
Web References
89
Appendix
91
LIST OF FIGURES
2.1
2.2
2.3
2.4
2.5
2.6
Number of refugees globally per year ,2016-2021
People forced to flee worldwide per year (2012-2022)
The number of international migrants. 1990 – 2020
Top five countries of origin, 2005-2020
Migrant workers by destination country income level
Percentage of refugee population as to
the average income of country host
2.7 Territorial attractiveness
2.8 Unemployment rate
2.9 The sea routes
2.10 The Aegean Sea arrivals per island, Jan.2015-Sept.2015
2.11 The total time series for the recorded arrivals in the Greek islands
of East Aegean Sea 01/2014 – 01/2022
3.1 Strikes in the USA, 1951 – 1980
3.2 Population of USA at 10-year intervals, 1970-1990
3.3 The monthly accidental deaths data, 1973-1978
4.1 The time series of Pre growth season, 01/2014-02/2015
4.2 The sample ACF plot of the Pre growth time series
4.3 The sample ACF values
4.4 The sample PACF plot for the Pre growth time series
4.5 The sample PACF values
4.6 Fitted 3rd degree polynomial summary
4.7 3rd degree polynomial fit on the Pre growth data plot
4.8 Fitted 4th degree polynomial summary
4.9 4th degree polynomial fit on the Pre growth data plot
4.10 Fitted 5th degree polynomial summary
4.11 5th degree polynomial fit on the Pre growth data plot
4.12 The first difference of the Pre growth series
4.13 The second difference of the Pre growth series
4.14 The third difference of the Pre growth series
3
3
5
6
7
8
9
10
11
13
14
16
16
17
29
30
30
31
31
32
33
34
34
35
36
38
39
39
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
4.32
4.33
4.34
4.35
4.36
4.37
4.38
4.39
4.40
4.41
4.42
4.43
4.44
4.45
4.46
4.47
4.48
4.49
4.50
4.51
The ACF of the first difference of the Pre growth series
The PACF of the first difference of the Pre growth series
ARIMA(0,1,0) fit summary on Pre growth
ARIMA(0,1,1) fit summary on Pre growth
ARIMA(1,1,1) fit summary on Pre growth
Forecast summary of ARIMA(0,1,0) for the Pre growth season
Forecast graph of ARIMA(0,1,0) for the Pre growth season
Forecast summary of ARIMA(0,1,1) for the Pre growth season
Forecast graph of ARIMA(0,1,1) for the Pre growth season
Auto ARIMA fit summary on Pre growth
The common plot of the two optimal ARIMA models with the data
of Pre growth season
The time series plot for the Growth season, 03/2015-04/2016
The sample ACF plot for the Growth season time series
The sample ACF values
The sample PACF plot for the Growth season
The sample PACF values
2nd degree polynomial fitted on Growth season
2nd degree polynomial fit on Growth season data plot
3rd degree polynomial fitted on Growth season
3rd degree polynomial fit on Growth season data plot
4th degree polynomial fitted on Growth season
4th degree polynomial fit on Growth season data plot
The first difference of the Growth season data
The ACF plot for the differenced Growth season data
The PACF plot for the differenced Growth season data
ARIMA(0,1,0) fit summary on Growth season
ARIMA(0,1,1) fit summary on Growth season
ARIMA(1,1,1) fit summary on Growth season
ARIMA(1,1,0) fit summary on Growth season
ARIMA(1,1,0) forecast summary on Growth season
ARIMA(1,1,0) forecast plot on Growth season
The Auto ARIMA fit summary on Growth season
The Auto ARIMA forecast summary on Growth season
The Auto ARIMA forecast plot on Growth season
The common plot of the two optimal ARIMA models with the data line on
Growth season
The Post growth time series plot
The sample ACF plot of the Post growth data
40
41
41
42
42
44
45
45
46
46
47
49
50
50
50
51
52
52
53
53
54
55
56
57
57
58
58
59
59
61
61
62
62
63
63
66
66
4.52
4.53
4.54
4.55
4.56
4.57
4.58
4.59
4.60
4.61
4.62
4.63
4.64
4.65
4.66
4.67
4.68
4.69
4.70
4.71
4.72
4.73
4.74
4.75
4.76
The Post growth sample ACF values
The Post growth sample PACF plot
The sample PACF values
2nd degree polynomial fitted on the Post growth data
2nd degree polynomial fit on the Post growth data plot
3rd degree polynomial fitted on Post growth data
3rd degree polynomial fit on the Post growth plot
4th degree polynomial fitted on the Post growth data
4th degree polynomial fit on the Post growth plot
The first difference of the Post growth data
The ACF plot for the differenced data of Post growth season
The PACF plot for the differenced data of Post growth season
ARIMA(1,1,1) fit summary on Post growth
ARIMA(1,1,2) fit summary on Post growth
ARIMA(2,1,1) fit summary on Post growth
ARIMA(2,1,2) fit summary on Post growth
ARIMA(1,1,0) fit summary on Post growth
ARIMA(1,1,0) forecast summary for the Post growth
ARIMA(1,1,0) forecast plot on Post growth season
ARIMA(2,1,1) forecast summary for the Post growth
ARIMA(2,1,1) forecast plot for the Post growth season
Auto ARIMA fit summary on Post growth
Auto ARIMA forecast summary for the Post growth season
Auto ARIMA forecast plot for the Post growth season
Common plot of the three optimal ARIMA models with the data line
67
67
68
69
69
70
71
72
72
74
74
75
75
76
76
77
77
79
80
80
80
81
82
83
83
LIST OF TABLES
Table 1
Summary of polynomials fitted on Pre growth
37
Table 2
Summary of the ARIMA models fitted on Pre growth
43
Table 3
Summary of all models fitted on Pre growth
48
Table 4 Summary of forecasting errors for the ARIMA models on Pre growth
48
Table 5
Summary of polynomials fitted on Growth
55
Table 6
Summary of the ARIMA models fitted on Growth
60
Table 7
Summary of all models fitted on Growth
64
Table 8
Summary of forecasting errors for the ARIMA models on Growth
65
Table 9
Summary of polynomials fitted on Post growth
73
Table 10
Summary of the ARIMA models fitted on Post growth
78
Table 11
Summary of all models fitted on Post growth
85
Table 12
Summary of forecasting errors for the ARIMA models on Post growth
85
Appendix
91
Chapter 1
Introduction
Time series analysis is a way of understanding the mechanism that generates a set of
time indexed data, finding an appropriate model to represent this mechanism, and to
use this model for future predictions. A question arises when it comes to forecasting,
can we forecast a quantity in a sufficient way, or this specific quantity cannot be
predicted? Is the selected approach suitable for this forecasting? Do we have enough
data to produce an accurate prediction for the future values? Is every observed
phenomenon suitable for applying the usual methods of analysis and forecast? With
time series analysis, it is assumed that the way in which a system changes, will be the
same to the future. On the context of this work, the question is if we can forecast the
social activity of large groups of human beings, where the complexity of the
phenomenon is determined by countless many factors, political, economy, personal,
global or local?
This thesis is focused on a method of time series analysis and forecasting based on the
Box- Jenkins (B-J) approach. This approach consists of three stages: identification,
estimation and diagnostic checking. To apply the method, we divide the time series of
arrivals in the islands in three seasons, based on the initial plot of the total series. The
seasons are named Pre Growth, Growth and Post Growth respectively. The reason for
this splitting was the observed increased values that indicate a rapid growth that is
initiated at 03/2015. The values appear to return to the previous level (post growth
level) on 05/2016. The purpose was to have a lower degree of variation in the analyzed
time series, supposing that this would result in finding more appropriate models for
each season.
The outline of this thesis is the following: In Chapter 2, a history of the refugee crisis
for the period which we will analyze next, is presented, based mainly on statistical
figures. In Chapter 3, we present fundamental theoretical terms that are commonly
used on time series analysis and make a brief description of the Box - Jenkins
methodology. Finally, in Chapter 4, we proceed to the analysis of the time series of
arrivals to the Greek islands.
1|P a ge
Chapter 2 The refugee issue and its socio-economic dimension
2.1
Global overview
The United Nations High Commissioner for Refugees (UNHCR) announced on May 23,
2022 that the number of people forced to flee due to persecution, conflicts, violence,
human rights violations, had reached more than 100 millions for the first time in
history, a negative record that would have been totally unreal a few decades ago [1].
This number represents 1 in every 78 people of the global population and includes
refugees, asylum seekers and the 53,2 million people that have been forced to move
within their country’s borders because of conflicts. These numbers are the result of
new or protracted conflicts in countries including among others Ethiopia, Burkina Faso,
Myanmar, Nigeria, Afghanistan and the Democratic Republic of the Congo. More
recently, the war in Ukraine has forced more than 6 million people to leave the country
and 8 million people to be internally displaced, that is within the country’s borders,
according to data that UNHCR has recorded [1]. Figure 2.1 exhibits the numbers
recorded by UNHCR during the years of the period 2016 -2021. We can track an
increasing trend to the numbers of every column suggesting that since 2016, a year in
which the refugees’ flows to Europe reached overwhelming figures, until 2021 the
global situation in terms of peace, safety, human rights and living condition, gets worse
by each year. In figure 2.2, a visualization of the data for the decade 2012-2022 in
which there is a noticeable upward trend. Also, according to the International
Organization for Migration (IOM), the United Nations migration agency, the total
number of international migrants in the world is estimated 281 millions, or 3.6% of the
global population [4]. This means that the total people living in countries other than
their birth country is 128 millions more than in 1990 and over three times the
estimated numbers in 1970, as it is stated on the organization 2022 World Migration
report [4].
2|P a ge
Figure 2.1: Number of refugees globally per year, 2016-2021 (UNHCR data finder, [2])
Figure 2.2: People forced to flee worldwide per year ,2012-2022. (UNHCR Global trends, [3])
Furthermore, a greater rate of increase to the numbers of international migrants is
observed in Asia and Europe in comparison with other regions as depicted in figure
2.3. IOM notes the existence of a wide variation between countries as for the number
of international migrants that live in those. In United Arab Emirates for example, the
88% of the population are migrants from other countries. An interesting fact is that
although mobility between countries was reduced in 2020 due to Covid-19 pandemic
restrictions, the number of internally displaced people had an increase during 2020,
reaching 55 million globally whereas the same figure was 51 million for 2019 [5].
During the presentation of IOM’s World Migration Report 2022, the organization’s
general director Antonio Vitorino said: « We are witnessing a paradox not seen before
3|P a ge
in human history. While billions of people have been effectively grounded by COVID19, tens of millions of others have been displaced within their own countries.».
The causes of this impressive increase on the number of refugees, migrants, asylum
seekers and internally displaced people during the last decades and especially after
1990, can be found in a complex of economical, geopolitical and environmental facts
that affect the lives of literally every single person in today’s globalized world. In IOM
World Migration Report 2022 is stated that: «Increased competition between States is
resulting in heightened geopolitical tension and risking the erosion of multilateral
cooperation. Economic, political and military power has radically shifted in the last two
decades, with power now more evenly distributed in the international system. As a
result, there is rising geopolitical competition, especially among global powers, often
played out via proxies. The environment of intensifying competition between key
States– and involving a larger number of States– is undermining international
cooperation through multilateral mechanisms, such as those of the United Nations»
[4].
This is an interesting statement, if one considers that it comes from the United Nations
(UN) agency for migration and at the same time, the five permanent members of UN
Security Council are countries who participate in a major role in the international field
of economical and geopolitical competition, while they have a continuous presence in
conflicts around the globe since the establishment of UN in 1945, either by means of
political influence and diplomacy , or by military means through the intergovernmental
alliance of the North Atlantic Treaty Organization (NATO).
4|P a ge
Figure 2.3 The number of international migrants , 1990-2020 . ( IOM World Migration
Report 2020 , [4] )
The rapidly changing environmental conditions , related to human activity through
the absence of a planned development of the material production and consumption
system that leads to industrial overproduction , toxic waste pollution , uncontrolled
energy consumption , is another factor that can lead people to flee in other countries
seeking better living conditions . As stated in IOM’s 2022 report: «The intensification of
ecologically negative human activity is resulting in overconsumption and
overproduction linked to unsustainable economic growth, resource depletion and
biodiversity collapse, as well as ongoing climate change. Broadly grouped under the
heading of “human supremacy”, there is growing recognition of the extremely
negative consequences of human activities that are not preserving the planet’s
ecological systems. (…) The implications for migration and displacement are significant,
as people increasingly turn to internal and international migration as a means of
adaptation to environmental impacts, or face displacement from their homes and
communities due to slow-onset impacts of climate change.»[4]
5|P a ge
Figure 2.4 Top five countries of origin , 2005-2020 . (IOM World Migrant Report,2022,[6])
In figure 2.4 is depicted the number of refugees by top five countries of origin during
the years 2005-2020. A rapid upward trend initiated in 2011 for the number of Syrian
refugees brings this country to the top of the list, whereas we note an almost
constant number of refugees from Afghanistan each year, placing this central Asia
country to the second position. It is obvious by the political facts that the war in Syria
and the conflicts in Afghanistan that are ongoing for decades, generated these flows.
Also it is estimated by IOM [7] that there were nearly 169 million migrant workers
around the world in 2019, which is the 62% of the total number of immigrants in 2019
(272 millions). From these people, 67% of workers were living in high-income
countries, 29% were living in middle-income countries and 3.6% were in low-income
countries. ( figure 2.5) . The numbers indicate the much expected fact that regardless
the reason of displacement, refugees and immigrants prefer countries with better
income, hoping that this will provide them with higher standards of living.
6|P a ge
Figure 2.5 Migrant workers by destination country income level. (IOM World Migrant Report
2022 , [7])
On the same time , as UNHCR states in data figures , 83% of the refugees are hosted in
low and middle-income countries (figure 2.6) . This indicates the distinction between a
person who is a refugee or an asylum seeker , and a migrant . According to UNHCR :
«Migrants choose to move not because of a direct threat of persecution or death, but
mainly to improve their lives by finding work, or in some cases for education, family
reunion, or other reasons. Unlike refugees who cannot safely return home, migrants
face no such impediment to return. If they choose to return home, they will continue
to receive the protection of their government.» . This is a point that is often neglected
or misunderstood by many when they refer to the issue .
7|P a ge
Figure 2.6: Percentage of refugee
population as to the average income of
country host. (UNHCR Figures at a glance,
[8])
2.2
As a conclusion , migration and the
refugee issue is a phenomenon generated
by a complex of multi-causes , including
political , economical and environmental
factors . It is an international issue with
multiple
social
and
economical
dimensions that affects the local
economies of the destination countries by
raising the available working power and it
should be confronted by means of
distinction between a person who is a
refugee or a seeker of international
protection , and a migrant .
The situation in the European Union (EU)
In recent years, Europe has seen the largest flow of migrants and refugees from
countries outside of EU , since the end of World War 2 . These flows had their peaks in
2015 and 2016 , with a significant reduction after 2017. A relative stability in
economical and political situation in compare to African, Asian or Middle East
countries, even though this contrast is highly related with more than a century of
historical interaction between these territories and European countries through
colonialism and its consequences, makes the countries of European Union an
attractive destination for people who want to find better living conditions or they seek
international protection.
The most attractive regions, as we might expect , are in destination countries like
Germany or Austria . Greece and Italy are in the middle of the scale and the less
attractive are found to be in Romania, Serbia and Montenegro [9] as it is also seen in
figure 2.7. This enhances the general assumptions about the factors that attract
refugees and migrants to specific areas.
We have to note that regardless of the expectations a person may have, the
unemployment rates are significantly higher for asylum seekers residing in EU
countries, than it is for native population or migrants that have moved in the same
country having a university degree. The unemployment rates for refugees [9] differ
8|P a ge
between different states , with Great Britain to have a rate of 15 % and Spain to have
over 50% rate . (figure 2.8 )
The routes through which the refugees attempt to reach the destination country of EU
is highly dangerous . Crossing into Europe can be done by land border or by sea. The
main border crossing routes with direction the EU territory are : the Eastern
Mediterranean Route, the Western Balkan Route and the Central Mediterranean
Route. The central crossing of the Mediterranean sea (figure 2.9) has proven to be
extremely dangerous, while there are many times when the boat capsized and the
passengers didn’t survive . In October 2013, a boat carrying hundreds of refugees from
Libya to Italy sank near the island of Lampedusa, killing 368 refugees. Italy launched a
large scale sea rescue operation named Mare Nostrum . UNHCR reports (2015) that :
«During the first four months of 2015, the numbers of those dying at sea reached
horrifying new heights. Between January and March, 479 refugees and migrants
drowned or went missing, as opposed to 15 during the first three months of the year
before. In April the situation took an even more terrible turn.
Figure 2.7 . Territorial attractiveness . ( ESPON 2018, [10])
9|P a ge
Figure 2.8 . Unemployment rate (U) , Long Term (LTU) , Very Long Term (VLTU) . (ESPON
2018 , [9] )
In a number of concurrent wrecks, an unprecedented 1,308 refugees and migrants
drowned or went missing in a single month (compared to 42 in April 2014), sparking a
global outcry.» [12] . European states held meetings shortly after that incident and
decided to raise the funding of Frontex , a private company which is one of the main
EU border surveillance agency , member states offered to deploy naval vessels for
patrols and a better coverage of the sea routes . During the months May and June of
2015 , the number of people drowned or missing in the sea fell to 68 and 12 persons
respectively as a result of the applied operations [12] .
10 | P a g e
Figure 2.9 The sea routes . ( UNHCR 2015 , [11] )
2.3
The Eastern passage through the Aegean Sea
Greece is one of the main gateways to Europe, along with Italy and Spain in the
Mediterranean region. Refugees and immigrants arrive in Greece both through of its
land border with Turkey in the North (Evros) as well as through the Greek-Turkish
maritime borders in the Aegean , which is the route that is mainly used . During the
years 2015-2016 which was the peak arrivals years for Greece as well as for Europe ,
there were more than 1 million refugees that arrived through the sea in the islands
and more than 6000 arrivals through the land borders . For the period 01.01.201621.11.2016 there were 49792 sea rescues , 765 arrests of smugglers transporting
refugees by boats and 108 people lost their lives in the sea , when for 2015 the
number of people that got drowned was 272 [13] . According to UNHCR reports ,
during the first six months of 2015 , 68000 refugees arrived in the island of Lesvos ,
Chios , Samos and Kos , and Greece overtook Italy which had the first place in arrivals
during 2014 while for the same period of 2015 , had 67,500 . A change in the profile of
11 | P a g e
people arriving as refugees also was noted . The main countries of origin arriving in
Italy were Eritrea (25 %), Nigeria (10 %) and Somalia (10 %), followed by Syria (7 %) and
Gambia (6 %). The main countries of origin of refugees and migrants arriving in Greece
were Syria (57 %), followed by Afghanistan (22 %) and Iraq (5 %) . [14]
The 2016 agreement between EU and Turkey had significant consequences for the
management of this unprecedented refugee crisis by the Greek state , and
consequently to the people who had arrived in the Greek islands . Since March of 2016
when it first took effect , the agreement held the vast majority of the refugees to the
Greek islands , where the Greek state didn’t have the infrastructure and services to
address the basic needs of the population . As it was reported on UNHCR’s factsheet of
May 2017 : «The Aegean islands have been at the forefront of the 2015/2016
European Refugee Emergency with over 1 million people arriving in total, the vast
majority from refugee producing countries. Before 20 March 2016, the population was
transient, with arrivals remaining on the islands for a limited time, sometimes hours or
a few days, before continuing their journey. The situation changed after the closure of
the so-called ‘Balkans route’ and the implementation of the Joint EU-Turkey Statement
of 18 March 2016. Arrivals decreased significantly, the length of stay on the island
increased, and the needs of the refugee and migrant populations on the islands
changed, especially for people and families with specific needs.» [15]
These facts generated tremendous strain on the island local communities . The
reception conditions in the so called Hot-spots were insufficient , and the situation was
getting worse while thousands of people were accumulating in small towns , and being
unable to move to the inland or elsewhere . In the same UNHCR factsheet of 2017 , is
reported that : «…. challenges with overcrowding and insecurity remain, and substandard conditions must still be improved in some locations, notably on Chios due to
recent overcrowding. Protection risks for people staying on the islands continue,
particularly the risk of sexual and gender-based violence. Children, including
unaccompanied children, remain in inadequate shelter with insufficient access to
formal or non-formal education, which also severely impacts their psychosocial wellbeing.» [15]
Τhe situation gradually de-escalated, partially because specific measures of relocation
of the population were taken by the Greek Government and other EU Governments
and partially because of the fact that in countries like Afghanistan and Syria , the war
conflicts stopped or decreased , although the Aegean islands had 29718 arrivals in
2017 , 32494 in 2018 , a flare up on 2019 with 59726 arrivals [16], followed by a rapid
decline in 2020 and 2021 , when the number of arrivals was 9714 and 4331
respectively . [17]
12 | P a g e
Among the islands of east Aegean , Lesvos got the highest numbers of arrivals . In
December 2015 , a factsheet from UNHCR reported that up to that time , 59 % of total
arrivals by sea in Greece , passed through Lesvos . The total arrivals from January to 24
of December 2015 was 487964 people . The average daily arrivals during the last 7
days was estimated to be 1968 per day . The total arrivals during December was
47243 people [18] . From January to November 2017 Lesvos had the 42 % of total
arrivals in Greece by sea . 11570 asylum –seekers and migrants were recorded to
reach the island and the total number of sea arrivals in Greece during that season was
27354 [19] . Between January and November 2018 the 47% of the total arrivals in
Greece by sea , was in Lesvos (13945 people) and the total number of sea arrivals in
Greece was 29.567 [20] . For 2019 , until December the percentage of people arriving
to Lesvos was 40 % of the total arrivals and 23861 arrivals in total number [21]. The
majority of the asylum-seekers and migrants arriving in the Greek islands of East
Aegean sea for the whole period of 2015 -2021 was from Syria, Afghanistan , Iraq and
the Democratic Republique of Kongo. Typically these nationalities arrive in family
groups , although a large number of unaccompanied minors , mainly from Afghanistan
was recorded in the Reception and Identification Centres of the Greek State . [UNHCR,
factsheets 2017,2018]
Figure 2.10 The Aegean Sea arrivals per island , Jan.2015-Sept.2015 [22]
13 | P a g e
In this work we attempt a time series analysis for the modeling of the number of
migrants arriving to Greece ( through all the islands and Evros borders) for the period
01/2014 – 01/2022 .The entire time series which will be analyzed in Chapter 4 with the
Box-Jenkins methodology to be presented in Chapter 3 is depicted in figure 2.11.
(Data collected from the Operational Data Portal of UNHCR for the Mediterranean
Situation https://data.unhcr.org/en/situations/mediterranean/location/5179).
Figure 2.11: The time series for the recorded arrivals in the Greek islands of East Aegean Sea,
01/2014 – 01/2022.
14 | P a g e
Chapter 3 The Box- Jenkins methodology
3.1.
Introduction to time series analysis
A time series is a set of observations {𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 } recorded sequentially over
time. We suppose that each observation is a realized value of a specific random
variable 𝑋𝑡 . Therefore , we may consider the time series to be the realized values of a
sequence {𝑋𝑡 } of random variables indexed by time t ,where 𝑥1 is the observed value
at time point 1 , 𝑥2 is the observed value at time point 2 , and so on. In general, we
refer to a collection {𝑋𝑡 } of random variables indexed by time t, as a stochastic
process. Hence the observed time series can be considered as a realization of a specific
stochastic process. In this study , we will use the term time series whether we are
referring to the stochastic process or to a particular realization of it and t will have
discrete integer values ±1, ±2, ±3, … and so on .However , in other cases the set T of
time in which we record the observations can be a continuous interval , i.e. T = [0,1] .
In that case we denote that we have a continuous time time series.
Time series occur in the field of economics, where we can have monthly national
unemployment figures, inflation rates registered over equal time periods, annually
GDP registration etc. In epidemiology, an example of time series is the daily
registration of covid-19 deaths observed in a specific geographical area. In medicine, a
patient’s blood sugar measurements traced over time could be useful for evaluating
the influence of a specific drug on treating diabetes. In environmental sciences, time
series can occur by registration of average monthly temperature or yearly rainfall. In
the stock market, daily stock prices produce a time series. Time series analysis applies
to a diverse list of scientific fields, practically anything that we observe sequentially
over time is a time series and can be analyzed as such.
Graphically, we display a sample time series by plotting the values of the random
variables on the vertical axis, or ordinate, and having the time scale as the abscissa.
Typically, we connect the values at adjacent time points producing visually a
hypothetical continuous time series that could have produced these values as a
discrete sample. Examples of time series plots can be seen in figures 3.1- 3.3.
15 | P a g e
Figure 3.1 Strikes in the USA , 1951 – 1980 . [Brockwell – Davis , 2016]
Figure 3.2 Population of USA at 10-year intervals , 1790-1990 .[Brockwell – Davis , 2016]
16 | P a g e
Figure 3.3 The monthly accidental deaths data , 1973-1978 . [Brockwell – Davis , 2016]
The purpose of time series analysis is primarily to find a satisfactory probability model
to represent the data. This will help to understand the stochastic process that
produces the observed time series. Once the model is developed, it could be used for
prediction purposes.
3.2 Stationarity
Definition 1. A time series 𝑋𝑡 , 𝑡 ∈ 𝑇 is said to be strictly stationary if the joint
distribution F of 𝑋𝑡1 , 𝑋𝑡2 , … . , 𝑋𝑡𝑛 is independent from the system of coordinates:
F(𝑋𝑡1 , 𝑋𝑡2 , … . , 𝑋𝑡𝑛 ) =F(𝑋𝑡1+𝑘 , 𝑋𝑡2+𝑘 , … . , 𝑋𝑡𝑛+𝑘 ) ,
where 𝑘, 𝑛 ∈ ℕ
𝑎𝑛𝑑 𝑡1 , 𝑡2 , … , 𝑡𝑛 ∈ 𝑇
According to Definition 1 we have:
𝐹(𝑋𝑡 ) = 𝐹 (𝑋𝑡+𝑘 ) = 𝐹(𝑋0 )
17 | P a g e
which means that the cumulative distribution function is independent of t , and so
the mean 𝐸𝑋𝑡 = 𝜇 is independent of t . Also we have :
𝐹 (𝑋𝑡 , 𝑋𝑠 ) = 𝐹(𝑋𝑡+𝑘 , 𝑋𝑠+𝑘 ) = 𝐹(𝑋0 , 𝑋𝑠−𝑡 )
which means that the common distribution of d 𝑋𝑡 , 𝑋𝑡+𝑘 𝑑oes not depend on t , in
other words , observations that have the same distance between them will have the
same common distribution .
Definition 2 . A time series 𝑋𝑡 , 𝑡 ∈ 𝑇 is said to be weakly stationary ( stationarity of
2nd order ) if the mean and the covariance are independent of time t , which means :
𝐸𝑋𝑡 = 𝜇 ,
𝑡 ∈𝑇
𝐶𝑜𝑣( 𝑋𝑡 , 𝑋𝑠 ) = 𝐸 (𝑋𝑡 − 𝜇)(𝑋𝑠 − 𝜇) = 𝛾|𝑡−𝑠|
and
, 𝑡 ∈𝑇
Because strict stationarity is a very difficult condition to have and it is highly restricting
,. whenever we use the term ” stationary time series “ we will mean weakly stationary
.
From all the above , we get the following assumptions for every time we start to
define an appropriate model :
1) 𝐸𝑋𝑡 = 𝜇
𝑎𝑛𝑑
𝑉𝑎𝑟(𝛸𝑡 ) = 𝜎𝜒 2
, 𝑡 ∈𝑇
meaning that the mean and the variance are constant in time t .
2) 𝐶𝑜𝑣(𝑋𝑡 , 𝑋𝑡+𝑘 ) = 𝛾𝜅
,𝑡 ∈ 𝑇
meaning that covariance between two observations of the time series depends only
on lag κ between their time moments .
Definition 3 . Let {𝑋𝑡 } be a stationary time series . The autocovariance function
(ACVF) of {𝑋𝑡 } at lag k is :
𝛾𝑘 = 𝐶𝑜𝑣( 𝑋𝑡 , 𝑋𝑡+𝑘 ) = 𝐸 (𝑋𝑡 − 𝜇)(𝑋𝑡+𝑘 − 𝜇)
By this definition , we find :
𝛾𝑘 = 𝐶𝑜𝑣( 𝑋𝑡 , 𝑋𝑡+𝑘 ) = 𝐶𝑜𝑣( 𝑋𝑡−𝑘 , 𝑋𝑡 ) = 𝐶𝑜𝑣 (𝑋𝑡 , 𝑋𝑡−𝑘 ) = 𝛾−𝑘
αnd
𝛾0 = 𝜎𝜒 2
18 | P a g e
We also define the autocorrelation function (𝑨𝑪𝑭)
𝜌𝑘 =
from which , we get 𝜌0 = 1
𝛾𝑘
𝛾0
:
, 𝑘 = 0, ±1, ±2, … ..
𝑎𝑛𝑑 𝜌𝑘 = 𝜌−𝑘
Practically , we don’t calculate the ACF starting from a model , we use a finite set of
observed data {𝑥1 , 𝑥2 , … . . , 𝑥𝑛 } and calculate the sample autocorrelation function
(sample ACF ) .The sample equivalents of the above quantities are used as estimators
for inferential purposes . Thus , we are provided with an estimate of the extend of the
dependence in the data which is one of the most important tools we have for
modeling purposes .
3.3
Components of a non-stationary time series
The first step in the analysis of a time series should always be to plot the data . If we
find any apparent outlying observations , we need to examine them carefully to
understand if they were caused by mistakes during the data recording and decide
whether or not we need to discard them . We also check the magnitude of the
fluctuations and try to see whether the variance changes with the level of the time
series . If so, we need to apply a transformation to the data , i.e. we get the
{𝑙𝑛𝑥1 , 𝑙𝑛𝑥2 , … , 𝑙𝑛𝑥𝑛 } in which we have more limited
transformed time series
magnitudes , or we can use of the Box-Cox transformation .By observing the plot we
can also see if there is a trend and a seasonal component . A trend exists if we notice a
long term increase or decrease in the observed values . A repeated pattern in the plot
implies the presence of a seasonal component . If both components exist in the plot
(which means that the time series is not stationary) , to represent our data we will
use the classical decomposition model
𝑋𝑡 = 𝑚𝑡 + 𝑠𝑡 + 𝑌𝑡
(1)
where
•
•
•
𝑚𝑡 is a slowly changing function of time t which can be of deterministic or
stochastic nature and is known as the trend component
𝑠𝑡 is a periodical function with period d , the seasonal component . We note
that 𝑠𝑡 = 𝑠𝑡−𝑑
𝑌𝑡 is the random noise component that is stationary
19 | P a g e
The goal of the researcher is to estimate the trend and seasonal components and to
eliminate them from the original series . If the noise component 𝑌𝑡 that remains after
this elimination is a stationary time series , we can find a satisfactory model to describe
the process and its properties and by following the reverse route , we can combine it
with the estimated 𝑚𝑡 and 𝑠𝑡 components to compose a model that fits our original
data . Then we can use this model to forecast future values of 𝑋𝑡 . For methods of
estimation the interested reader could refer to the book by Cryer & Chan (2008) .
3.4
Stochastic processes
We have already define a time series to be the realization of a specific stochastic
process . Therefore , we need to define the basic stochastic models that are commonly
used to describe the process which generated the data . Finding a model that is a
satisfactory fit to our data , gives us the ability to move with forecasting future values
of the time series . We have to note again that by the term time series , we refer to
both the data set , and the stochastic process that we consider to have generated
these data .
a) The time series {𝑋𝑡 } , 𝑡 ∈ ℤ is named time series of independent and
identically distributed random variables (iid) if it consists of independent
random variables that have the same distribution. An iid time series is
completely random and doesn’t contain any correlations (linear or not)
between its observations . The independence of the random variables indicates
that we can’t get any information out of the series analysis .
b) A time series that consists of random variables that don’t have correlations but
they might not be independent , is not an iid time series . We will refer to a
time series of this case as white noise with mean 0 and variance 𝜎𝑥 2 and we
will use the notation
{𝑋𝑡 } ~ 𝑊𝑁(0, 𝜎𝑥 2 )
Additionally , if the random variables of the white noise have a normal
distribution , the time series is named Gaussian white noise .
20 | P a g e
c) The random walk is a non-stationary time series model {𝑋𝑡 } in which , every
random variable 𝑋𝑡 comes from the previous 𝑋𝑡−1 by adding a random
number 𝑌𝑡 , in other words by adding an iid random variable . This process is
denoted as
𝑋𝑡 = 𝑋𝑡−1 + 𝑌𝑡
If we start with t=0 and replace the random variables 𝑋𝑡−1 , 𝑋𝑡−2 , …. using the
definition of the random walk , we get the notation
𝑡
𝑋𝑡 = ∑ 𝑌𝑘
𝑌𝑘
,
∶ 𝑖𝑖𝑑 𝑛𝑜𝑖𝑠𝑒
𝑘=0
It’s easy to see that random walk has a mean 𝐸 (𝑋𝑡 ) = 0 and a variance 𝜎𝑥 2 =
𝐸(𝑌𝑡 2 ) = 𝑡𝜎𝑥 2 . The variance is increasing with time t , indicating that random
walk is not a stationary time series . We note that if we apply first differencing
to a random walk , we get the stationary iid time series {𝑌𝑡 } .
d) An autoregressive process of order p , denoted as AR(p) , is of the form
𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + ⋯ + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑍𝑡
, 𝑍𝑡 ~𝑊𝑁(0, 𝜎𝑧 2 )
where 𝑋𝑡 is stationary and 𝜑1 , 𝜑2 , … . , 𝜑𝑝 are constants ( 𝜑𝑝 ≠ 0 ) and
we have considered the mean of 𝑋𝑡 to be zero . If the mean is not zero we
write 𝑋𝑡 − 𝜇 instead of 𝑋𝑡 in the formula above .
By using the backshift operator , we can write the AR(p) model as follows
(1 − 𝜑1 𝛣 − 𝜑2 𝛣2 − … … … . . − 𝜑𝑝 𝐵𝑝 )𝑋𝑡 = 𝑍𝑡
or more concisely as
𝛷 (𝐵)𝑋𝑡 = 𝑍𝑡 ,
where
𝑝
𝛷 (𝐵) = 1 − ∑𝑖=1 𝜑𝑖 𝐵𝑖
is the AR(p) operator .
We also define as the AR(p) characteristic polynomial , the polynomial 𝛷 (𝑧) = 1 −
∑𝑝𝑖=1 𝜑𝑖 𝑧 𝑖 , where z is a complex number .
It can be proved that AR(p) is stationary , when the roots of the AR(p) polynomial are
outside the unit circle . The idea behind Autoregressive models is that the current
21 | P a g e
value of the series, 𝑥𝑡
, can be explained as a function of p past values
𝑥𝑡−1 , 𝑥𝑡−2 , … , 𝑥𝑡−𝑝 with the addition of white noise . The linear combination of 𝑥𝑖 for
i=t-1 ,….,t-p can be nonsidered as the deterministic part of this model and 𝑍𝑡 the
stochastic part .
e) A moving average model of order q , denoted as MA(q) , is defined to be
𝑋𝑡 = 𝑍𝑡 − 𝜃1 𝑍𝑡−1 − 𝜃2 𝑍𝑡−2 − … … − 𝜃𝑞 𝑍𝑡−𝑞
, 𝑍𝑡 ~𝑊𝑁(0, 𝜎𝑧 2 )
where 𝜃1 , 𝜃2 , … . 𝜃𝑞 are parameters . By using the backshift operator , we can write the
MA(q) model as
𝑋𝑡 = (1 − 𝜃1 𝛣 − 𝜃2 𝛣2 − … … … . . − 𝜃𝑞 𝐵𝑞 )𝑍𝑡
or more concisely as
𝑋𝑡 = 𝛩(𝐵)𝑍𝑡 ,
where
𝑞
𝛩(𝐵) = 1 − ∑𝑖=1 𝜃𝑖 𝐵𝑖
is the MA(q) operator .
We also define as the MA(q) characteristic polynomial , the polynomial
𝑞
𝛩 (𝑧) = 1 − ∑𝑖=1 𝜃𝑖 𝑧 𝑖 where z is a complex number .
The moving average process is stationary for any values of the parameters, since it is a
finite sum of white noise terms .
f) A process {𝑋𝑡 } is an autoregressive moving average series (ARMA) if it is
stationary and
𝑋𝑡 = 𝜑1 𝑋𝑡−1 + 𝜑2 𝑋𝑡−2 + ⋯ + 𝜑𝑝 𝑋𝑡−𝑝 + 𝑍𝑡 − 𝜃1 𝑍𝑡−1 − 𝜃2 𝑍𝑡−2 − ⋯ 𝜃𝑞 𝑍𝑡−𝑞
where 𝜑𝑝 ≠ 0 , 𝜃𝑞 ≠ 0 . The parameters p and q are the autoregressive and the
moving average orders respectively . We have assumed that 𝑋𝑡 has a zero mean . If
𝑋𝑡 has a nonzero mean , we write 𝑋𝑡 − 𝜇 instead of 𝑋𝑡 in the formula above .
Since the process consists of an AR(p) part and an MA(q) part , we refer to this model
as ARMA(p,q) model . The AR part defines if the series is stationary , so if the roots of
the AR polynomial are outside of the unit circle , the ARMA(p,q) is stationary . We have
to note that an ARMA(p,0) model is in fact an AR(p) model , while an ARMA(0,q) is an
MA(q) model .
22 | P a g e
ARMA models are very important for representing time series data , but they can be
applied only if we have a stationary time series . If the time series becomes stationary
after what is called differencing , we have the class of autoregressive integrated
moving average models ( ARIMA ) described below .
g) A process {𝑋𝑡 } is an ARIMA(p, d, q) if
∇𝑑 𝑋𝑡 = (1 − 𝐵)𝑑 𝑋𝑡
is ARMA (p, q) with d being the order of differencing. Observe that all the previous
models are specific cases of ARIMA(p,d,q) . For obtaining based on a data set
appropriate values for p,d,q we will proceed in the next section in the Box –Jenkins
approach .
h) If 𝑑 𝑎𝑛𝑑 𝐷 are nonnegative integers , then {𝑋𝑡 } is a seasonal
𝑨𝑹𝑰𝑴𝑨(𝒑, 𝒅, 𝒒) × (𝑷, 𝑫, 𝑸)𝒔 process with period s if the differenced series
𝑌𝑡 = (1 − 𝐵)𝑑 (1 − 𝐵 𝑠 )𝐷 𝑋𝑡 is a casual ARMA process defined by
𝜑(𝛣)𝛷 (𝛣 𝑠 )𝑌𝑡 = 𝜃 (𝛣)𝛩(𝐵 𝑠 )𝑍𝑡
,
𝑍𝑡 ~𝑊𝑁(0, 𝜎𝑧 2 )
where 𝜑(𝑧) = 1 − 𝜑1 𝑧 − ⋯ − 𝜑𝑝 𝑧 𝑝 , 𝛷(𝑧) = 1 − 𝛷1 𝑧 − ⋯ − 𝛷𝑃 𝑧 𝑃
𝜃(𝑧) = 1 + 𝜃1 𝑧 + ⋯ + 𝜃𝑞 𝑧 𝑞 , 𝛩 (𝑧) = 1 + 𝛩1 𝑧 + ⋯ + 𝛩𝑄 𝑧 𝑄
Before we move to the final part of this chapter , we will give the definition of a
function that plays an important role in finding candidate ARMA models to fit our
data . In an autoregressive process AR(p) , the partial autocorrelation of 𝑋𝑡 and 𝑋𝑡−ℎ ,
for ℎ > 𝑝 is nonzero since they are correlated through the random variables that are
between them , 𝑋𝑡−1 , … , 𝑋𝑡−ℎ−1 . We want to find the straight correlation between
them , by neutralizing all the other autocorrelations they might have with
𝑋𝑡−1 , … , 𝑋𝑡−ℎ−1 . This correlation is defined as
𝐶𝑜𝑟𝑟(𝑋𝑡 , 𝑋𝑡−ℎ ⁄ 𝑋𝑡−1 , … , 𝑋𝑡−ℎ−1 )
and it is noted as partial autocorrelation .
23 | P a g e
The partial autocorrelation function (PACF ) of an ARMA process {𝑋𝑡 } is the function
a(.) defined by
𝑎 (0) = 1
and
𝑎(ℎ) = 𝜑ℎℎ , ℎ ≥ 1
where 𝜑ℎℎ is the last component of
𝜱𝒉 = 𝜞𝒉 −𝟏 𝜸𝒉
𝜞𝒉 = [𝛾(𝑖 − 𝑗)]
and
, 𝑖, 𝑗 = 1, … . , ℎ
𝛾(𝑘) = 𝑐𝑜𝑣(𝑥𝑡+𝑘 , 𝑥𝑡 )
and 𝜸𝒉 = [𝛾(1), 𝛾 (2), … , 𝛾(ℎ)]′ ,
the autocovariance function .
For a set of observations {𝑥1 , … . , 𝑥𝑛 } with 𝑥𝑖 ≠ 𝑥𝑗 for some 𝑖 𝑎𝑛𝑑 𝑗 , the sample
PACF 𝑎̂(ℎ) is given by
𝑎̂(0) = 1
𝑎̂(ℎ) = 𝜑̂ℎℎ , ℎ ≥ 1
where 𝜑̂ℎℎ is the last component of
̂𝒉 = 𝜞
̂ 𝒉 −𝟏 𝜸
̂𝒉
𝜱
Statistical packages can do the computations for the sample PACF and provide us with
a plot in similar fashion as for ACF .
24 | P a g e
3.5
The Box – Jenkins approach
Box and Jenkins approach is a method of time series analysis and forecasting that
aims to define a proper statistical model 𝐴𝑅𝐼𝑀𝐴(𝑝, 𝑑, 𝑞 ) to represent in a sufficient
way the stochastic process that produced our data . There are three stages in the Box
– Jenkins approach : identification, estimation and diagnostic checking .
a) Identification
In this stage we choose an initial set of values for the parameters p,d,q . The basic
tools in this procedure are the sample ACF and PACF . If the sample ACF plot exhibits a
rapid decay under the limits of significance , the time series is most probably stationary
in which case , we choose d = 0 . On the other hand , if the ACF plot decays slowly with
lag , the series is not stationary and thus it is necessary to apply differencing in order
to obtain series stationarity . If we apply first order difference , d = 1 etc . Next , we
define the parameters p and q by using the plots of the sample ACF and PACF .
b) Estimation
By using a non-linear technique and through the minimization of the sum of squares
of the errors we get estimates of the coefficients 𝜑1 , … , 𝜑𝑝 𝑎𝑛𝑑 𝜃1 , … , 𝜃𝑞 for our
ARIMA(p,d,q) model . If the model does not contain an MA part , we can use the least
squares method .
c) Diagnostic checking
In this stage , we conduct a number of checks for the goodness of fit for the selected
model . If the fit is poor , we apply modifications . We check the statistical significance
for the model’s coefficients , standard errors for the estimates , and confidence
intervals . Also Box and Jenkins suggest to check for the goodness of fit by applying
tests on the residuals of the fitted model . If the fitted model is satisfactory , the
residuals should behave like white noise and this is what we want to see by applying
the diagnostic tests . The information criteria AIC , BIC and AICC are commonly used as
we repeat the procedure for a variety of competing p and q values . The model with
the smallest value for a certain criterion , is better .
25 | P a g e
After the selection of the optimal model to fit our data , we proceed with forecasts .
Suppose we have the time series {𝑥1 , 𝑥2 , … , 𝑥𝑛 } generated from the stochastic
process {𝑋𝑡 } and we want to explore the forecast 𝑥𝑛 (𝑘) of the time series for the
future time moment 𝑡 = 𝑛 + 𝑘 . The true value at that time which is unknown to us
is 𝑥𝑛+𝑘 . The prediction error is 𝑒𝑛 (𝑘) = 𝑥𝑛+𝑘 − 𝑥𝑛 (𝑘)
In fact , the forecast value 𝑥𝑛 (𝑘) is the estimation of 𝑋𝑛+𝑘 of the process {𝑋𝑡 } .
Since this is a stochastic process , the optimal forecast is
𝑋𝑛 (𝑘) = 𝐸[𝑋𝑛+𝑘 ⁄𝑋𝑛 , 𝑋𝑛−1 , … ]
We want to have
•
unbiasedness of the forecast
𝐸 [𝑋𝑛 (𝑘)] = 𝑋𝑛+𝑘
•
efficiency , meaning a small variance for the prediction error .
𝑉𝑎𝑟[𝑒𝑛 (𝑘)] = 𝑉𝑎𝑟[𝑋𝑛+𝑘 − 𝑋𝑛 (𝑘)]
Our goal is to have a forecast that minimizes mean squared prediction error
2
𝐸[(𝑋𝑛+𝑘 − 𝑋𝑛 (𝑘)) ]
for any k .
If we believe that the time series {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is the realization of an AR(p) process
, then 𝑥𝑛+1 = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 + 𝑧𝑛+1 .
The optimal forecast for 1 time step will be
𝑥𝑛 (1) = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1
and the corresponding prediction error
𝑒𝑛 (1) = 𝑧𝑛+1
The optimal forecast for 𝑘 time steps will be
𝑥𝑛 (𝑘 ) = 𝜑1 𝑥𝑛 (𝑘 − 1) + ⋯ + 𝜑𝑝 𝑥𝑛 (𝑘 − 𝑝)
26 | P a g e
where each value 𝑥𝑛 (𝑗) is known eather by previous forecast or from the time series
data .
The prediction error will be 𝑒𝑛 (𝑘) = ∑𝑘−1
𝑗=0 𝑏𝑗 𝑧𝑛+𝑘−𝑗
2
Var[𝑒𝑛 (𝑘)] = 𝜎𝑧 2 ∑𝑘−1
𝑗=0 𝑏𝑗
and also
If we believe that {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is the realization of an 𝑴𝑨(𝒒) process , then the
next observation will be
𝑥𝑛+1 = 𝑧𝑛+1 + 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1
The optimal forecast for 1 time step will be
𝑥𝑛 (1) = 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1
and the corresponding prediction error
𝑒𝑛 (1) = 𝑧𝑛+1
For k time steps , the forecast will be
𝜃𝑘 𝑧𝑛 + 𝜃𝑘+1 𝑧𝑛−1 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+𝑘 ,
𝑥 𝑛 (𝑘 ) = {
0
𝑖𝑓 𝑘 ≤ 𝑞
𝑖𝑓 𝑘 > 𝑞
If we consider the {𝑥1 , 𝑥2 , … , 𝑥𝑛 } to be the realization of an 𝑨𝑹𝑴𝑨(𝒑, 𝒒) process ,
then
𝑥𝑛+1 = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 + 𝑧𝑛+1 + 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1
When {𝑥1 , 𝑥2 , … , 𝑥𝑛 } is given , the optimal prediction for 1 time step is
𝑥𝑛 (1) = 𝜑1 𝑥𝑛 + ⋯ + 𝜑𝑝 𝑥𝑛−𝑝+1 + 𝜃1 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+1
and the prediction error
𝑒𝑛 (1) = 𝑧𝑛+1
For k time steps the optimal forecast will be
27 | P a g e
𝑥 𝑛 (𝑘 ) = {
𝜑1 𝑥𝑛 (𝑘 − 1) + ⋯ + 𝜑𝑝 𝑥𝑛 (𝑘 − 𝑝) + 𝜃𝑘 𝑧𝑛 + ⋯ + 𝜃𝑞 𝑧𝑛−𝑞+𝑘 , 𝑘 ≤ 𝑞
𝜑1 𝜒𝑛 (𝑘 − 1) + ⋯ + 𝜑𝑝 𝑥𝑛 (𝑘 − 𝑝)
,
𝑘>𝑞
To measure the accuracy of the forecasts we use a number of statistical measures
based on the prediction errors 𝑒𝑡 , the original series 𝑥𝑡 and the number of
observations n. These are:
•
Mean Squared Error (MSE)
𝒏
𝟏
𝑴𝑺𝑬 =
∑ 𝒆𝒕 𝟐
𝒏
𝒕=𝟏
•
Root Mean Squared Error (RMSE)
𝟏
𝑹𝑴𝑺𝑬 = √𝒏 ∑𝒏𝒕=𝟏 𝒆𝒕 𝟐
•
Mean Absolute Error (MAE)
𝒏
𝟏
𝑴𝑨𝑬 = ∑|𝒆𝒕 |
𝒏
𝒕=𝟏
•
Mean Absolute Percentage Error (MAPE)
𝒏
𝟏𝟎𝟎
𝒆𝒕
𝑴𝑨𝑷𝑬 =
∑| |
𝒏
𝒙𝒕
𝒕=𝟏
28 | P a g e
Chapter 4
Application of the Box-Jenkins method
The time series in figure 2.11 has been divided into three seasons (periods), namely:
•
•
•
The Pre Growth Season (01/2014 – 02/2015)
The Growth Season (03/2015 – 04/2016)
The Post Growth Season (05/2016 – 01/2022)
which are analyzed in this Chapter.
The break points for the three periods have been chosen by combining the visual
inspection of the data and the Change Point Analysis [12]. For the latter, using the
(approximate) Binary Segmentation method [11] we identified the end of the Pre
Growth Season as being at 02/2015.
The same method identified a short period with 10 observations, from 5/2019 –
2/2020 as another (short) outbreak period. We have chosen, mainly due to a very
limited number of observations, to ignore this recommendation and retain this period
within the Post Growth Season. For the latter, the method proposed the end of the
Growth Season to be at 2/2016. Since though the value associated with this time point
was extremely high (as compared to all future values) we have chosen to delay the end
of the Growth Period for two more months until 4/2016 believing that by that time the
degree of influence of the outbreak will have been fully extinct. Hence, the Growth
period has been chosen as above, namely from 3/2015 – 4/2016 with the Post Growth
Season lasting for 69 months starting at 5/2016. For this analysis we have used the
changepoint package of R.
(https://cran.r-project.org/web/packages/changepoint/index.html)
4.1
Pre growth season (01 /2014 – 02/2015)
4.1.1 Data plots
The first step is to plot the data of the Pre growth time series (fig.4.1) , the sample
auto correlation function (ACF) and partial auto correlation function (PACF ) (fig.4.2 4.5 ) . We notice that the plot shows an upward trend followed by a downward trend
and a peak on September of 2014 and doesn’t contain any apparent periodic
component. The sample ACF graph demonstrates only one significant spike at lag 1
.It is critical to use a more objective measure to determine whether there is a trend or
not . We apply the augmented Dickey - Fuller test (ADF ) of unit root for which we find
29 | P a g e
a p-value of 0.1514 ,so we accept that the time series is not stationary on the usual
significance levels.
Figure 4.1 : The time series of Pre growth season , 01/2014 -02/2015
Figure 4.2 : The sample ACF plot of the Pre growth time series .
30 | P a g e
Figure 4.3 : The sample ACF values .
Figure 4 .4 : The sample PACF plot for the Pre growth time series .
Figure 4. 5
The sample PACF values .
4.1.2 Fitting polynomials to the Pre growth season data
When a time series is not stationary , one of the first steps to proceed with exploratory
analysis is to remove the trend component . One way to do this is by differencing.
Differencing allows to remove but doesn’t allow to estimate the deterministic trend
31 | P a g e
component . An estimation of the trend can be done by fitting polynomials with
respect to time t to the data . The coefficients can be calculated with ordinary least
squares method and the detrending is done by subtracting the estimated trend values
from the original data for every t . In the following examples , in every polynomial
equation , t represents the month order since season start and has values 1,2,3,… and
X(t) represents the monthly number of arrivals .
Using R , we fit a 3rd degree polynomial with respect to time t on the data of Post
growth season . This can provide us with an estimation of the trend component of the
time series .The fitted polynomial as we see on figure 4.6 is :
𝑋(𝑡) = 𝑋𝑡 = −15.462𝑡 3 + 251.741𝑡 2 − 498.774𝑡 + 923.062
The p-values for the coeficients lead to accept the null hypothesis of non-significance .
The 𝑅2 and adjusted 𝑅2 are 0.6428 and 0.5357 respectively . Residuals’ Standard
Error is 1602 . Mean Square Error (MSE) is 15400613 . The information criteria values
are found to be 251.6347 and 254.83 for AIC and BIC respectively . We also apply the
Box –Ljung test on the residuals to check for correlation between them and
determine if they are white noise . The p-value of the test is 0.1685 hence in all the
usual levels of significance , we can accept the hypothesis that the residuals are white
noise. Finally for a visual presentation of the goodness of fit , we plot the fitted
polynomial and the original data on the same graph ( fig.4.7 ) .
Figure 4.6 Fitted 3rd degree polynomial summary .
32 | P a g e
Figure 4.7 3rd degree polynomial fit on the Pre growth data .
The next candidate polynomial for fitting our data is a 4th degree polynomial . The R
output for the fitted model can be seen in figure 4.8 . The fitted polynomial is :
𝑋(𝑡) = 𝑋𝑡 = 4.75𝑡 4 − 157.97𝑡 3 + 1659.85𝑡 2 − 5588.31𝑡 + 5906.73
The p-values for the coefficients’ significance show an improved fitness of the 4 th
degree polynomial compared to the previous model . The MSE is 13807342 and
residuals’ Standard Error is 1360 . The values of the information criteria AIC ( =
247.5674) and BIC ( = 251.4017) , are better in comparison with the previous model
and the Box –Ljung test has a p-value of 0.1938 , so we accept the null hypothesis : the
time series of the residuals is a white noise . A common plot of the Pre growth data
and the fitted polynomial of 4th degree is shown in figure 4.9.
33 | P a g e
Figure 4.8 : fitted 4th degree polynomial summary .
Figure 4.9 : 4th degree polynomial fit on the Pre growth data .
34 | P a g e
The fitting of a 5th degree polynomial is our next and last attempt . The fitted
polynomial is :
𝑋 (𝑡) = 𝑋𝑡 = 1.712𝑡 5 − 59.4508𝑡 4 + 715.1661𝑡 3 − 3540.4416𝑡 2 + 7232.8518 − 3573.014
The output of the model summary is presented in figure 4.10 . The T statistics and pvalues for the coefficients show that they are statistically significant for any of the
usual levels of significance, the residuals standard error is 705.6 which is a great
improvement from the previous models , MSE is decreased and equals to 13577691 ,
the multiple and the adjusted 𝑅2 statistics have been improved as well , the F statistic
and the p-value for the model’s significance indicate a good fitness . The AIC and BIC
values have been found to be 229.5478 and 234.0212 respectively , much better than
the other two polynomials . The p-value of the portmandeau test is 0.2708 so we can
accept the null hypothesis , the residuals are not correlated and the model doesn’t
show a lack of fit .
Figure 4.10 : fitted 5th degree polynomial summary .
35 | P a g e
Figure 4.11 : 5th degree polynomial fit on the Pre growth data .
The summary of the AIC , BIC , 𝑅2 and 𝑅2 − adjusted values for all three
polynomials fitted in the Pre growth time series ( Table 1) indicates that the 5th degree
polynomial has a much better fit than the other two since it has the minimum score in
both AIC and BIC and has significantly better 𝑅2 and adjusted 𝑅2 values . The 5th
degree polynomial seems a better choice if we want to estimate the trend component
in the time series . It can also provide a prediction of future values of arrivals 𝑋𝑡 . For
example , if we consider the noise of t = 15 to be zero ( it’s mean value ) , then we get
an estimation of the arrivals for the month March of 2015 equal to :
𝑋15 =
1.712 ∙ 155 − 59.4508 ∙ 154 + 715.1661 ∙ 153 − 3540.4416 ∙ 152 + 7232.8518 ∙ 15 −
3573.014 = 12359.24
We have to state that this polynomial by having a higher degree than the other two
models , loses in terms of simplicity and this is something we should consider along
with the rest of criteria when it comes to choose the estimator function of the trend .
At the same time the complexity is not severe and it has the best values of AIC and BIC
as shown in Table 1
36 | P a g e
AIC
BIC
𝑅2
Adjusted
𝑅2
3rd degree
251.6347
254.83
0.6428
0.5357
4th degree
247.5674
251.4017
0.7685
0.6655
5th degree
229.5478
234.0212
0.9446
0.91
Criterion
Polynomial
fitted
Table 1. Summary of polynomials fitted on Pre growth.
4.1.3
Box - Jenkins method for fitting (S)ARIMA models to the data
We have already plotted the Pre growth time series and the sample ACF and PACF .
The ADF test p-value for the original Pre growth data is 0.1514 , which shows that
there is a trend in the time series . To eliminate that trend , we apply 1 st order
differences (fig.4.12) and we repeat the ADF test for the differenced series . By having
a p-value of 0.1279 , on all the usual levels of significance we fail to reject the null
hypothesis of non-stationarity .
37 | P a g e
Figure 4.12 : The first difference of Pre growth series .
We apply 2nd order differences in the Post growth time series (fig.4.13) . We find the
p- value for the ADF test to be 0.4403 , higher than the p-value of the 1st differences
so we proceed by applying 3rd order differences on the original time series . The plot
of the differenced data (fig.4.14) is getting more fluctuations and even shows some
upward trend . The ADF p-value is 0.7118 , even higher than before .
38 | P a g e
Figure 4.13 : The second difference of the Pre growth series .
Figure 4.14 : The third difference of the Pre growth series .
39 | P a g e
Not wishing to proceed further with the differencing ( due to small number of
observations) we consider a significance level of 12.79% and continue with the first
differenced data . We have succeed partial trend elimination . From figure 4.12 of the
1st order differences , we do not recognize any obvious patterns or any trends and
there is only one point getting some more distance from a central line where all the
others are gathered . We don’t see any cycles in the series . We plot the sample ACF
and PACF (figures 4.15 and 4.16 ) and we see no significant values at any lag on both
graphs . We don’t see any pattern in the ACF and PACF plots to indicate the presence
of a seasonal component in the series .
Figure 4.15 : The ACF of the first difference of the Pre growth series .
40 | P a g e
Figure 4.16 : The PACF of the first difference of the Pre growth series .
Based on the sample ACF and PACF plots of the 1st order differences , we can decide
the order of our ARIMA model . For our model, d=1, since we performed the 1st
differences to transform the original time series into stationary . We choose our
parameters to be p = 0 and q = 0 , so we fit an ARIMA (0,1,0 ) model . Additionally ,
several alternative models will be considered . We will try models that have p and q
values close to our primary selection .
1) ARIMA (0,1,0)
Figure 4.17 : ARIMA(0,1,0) fit summary on Pre growth .
Model :
𝑋𝑡 = 𝑍𝑡
41 | P a g e
2) ARIMA (0,1,1)
Figure 4.18 : ARIMA(0,1,1) fit summary on Pre growth.
Model :
𝑋𝑡 = 𝑍𝑡 + 0.3198𝑍𝑡−1
3) ARIMA (1,1,1)
Figure 4.19: ARIMA(1,1,1) fit summary on Pre growth .
Model :
𝑋𝑡 = 0.3232𝑋𝑡−1 + 𝑍𝑡 + 0.1076𝑍𝑡−1
42 | P a g e
Criterion
AIC
BIC
ARIMA (0,1,0)
229.11
229.6754
ARIMA (0,1,1)
229.11
230.2421
ARIMA (1,1,1)
230.61
232.3002
Fitted model
Table 2. Summary of the ARIMA models fitted on Pre growth.
Table 2 contains a summary of all the candidate models’ scores for the AIC and BIC
criteria . ARIMA (0,1,0) gets the best BIC score and has the same AIC score with the
ARIMA (0,1,1) , better than ARIMA (1,1,1) which seems to be the worst . The
difference between the scores though , is minor and both models with the best score
have a simple structure . We plot each one of the two models along with the data and
we can check visually the goodness of fit for both models (figure 4.25) .
We proceed with goodness of fit diagnostic tests. The tests are applied to the
residuals of the ARIMA (0,1,0) fitting . First we plot the residuals’ series and we look
for any trends, skewness, or other patterns that the model didn’t capture . (App.) .
The plot shows the residuals moving randomly around zero value , the variance
doesn’t appear to increase over time and their sequence looks like white noise . This
is supported by the ACF plot of the residuals which shows no significant spikes at any
lag (App.) . The p value for the Box – Ljung test is found to be 0.08455 and indicates
the lack of correlation between the residuals for the usual 1% and 5% levels of
significance. The same conditions apply to the ARIMA (0,1,1) fitting . The Box – Ljung
p-value is 0.611 . The relevant graphs are given in the Appendix .
43 | P a g e
Both models have a fairly good fit in our data. Now we can use the chosen models for
forecasting future values of arrivals for the Pre growth season for the best two models
of Table 2, namely ARIMA(0,1,0) and ARIMA(0,1,1). Our forecasts will be for the
months February to March of year 2022, a period of h=3 months. The forecasting
output for the model ARIMA (0,1,0) is shown in figure 4.20. In figure 4.21 we have the
corresponded plot in which, we can see the forecasting line and also the 80% and 95%
confidence intervals for the forecasts, highlighted with light blue and dark blue colours
respectively.
It should be noted that the forecasts for the Post Growth Season are presented only
for illustrative purposes since as we already now, this season is followed by an
outbreak, which due to its unnatural behavior, cannot be predicted by the model of
the Post Growth Season.
Figure 4.20: Forecast summary of ARIMA(0,1,0) for the Pre growth season .
44 | P a g e
Figure 4.21 .Forecast graph of ARIMA(0,1,0) for the Pre growth season .
The forecasting output for the model ARIMA(0,1,1) and the corresponded plot are
shown in figures 4.22 and 4.23 .
Figure 4.22 : Forecast summary of ARIMA(0,1,1) for the Pre growth season.
45 | P a g e
Figure 4.23: Forecast graph of ARIMA(0,1,1) for the Pre growth season .
4.1.4 The Auto ARIMA
In this section we are going to see the output of the automated method by using the
“auto.arima(.)” command in R . The optimal ARIMA model according to all the
standard criteria is fitted by R without us having to choose the parameters p and q .We
can compare with the previous models and use it for forecasting. The output is
exhibited in figure 4.24 and the model is the ARIMA(0,1,0) :
𝑌𝑡 = 𝑍𝑡 .
This is a model that we have already applied on our data . For a straight comparison of
the two models with the best fit , we plot them together on the Pre growth data plot
( fig.4.25) .
Figure 4.24 : Auto ARIMA fit summary on Pre growth .
46 | P a g e
Figure 4.25: The common plot of the two optimal ARIMA models with the data of Pre
growth season .
Pre growth overview
In summary , we form Table 3 which contains all the models fitted in our Pre growth
data . We can see right away that ARIMA(0,1,0) and ARIMA(0,1,1) have the same AIC
value but ARIMA(0,1,0) has a slightly better BIC value . The 5th degree polynomial has
much better 𝑅2 and adjusted 𝑅2 scores compared to the other two polynomials and
better AIC and BIC . 3rd and 4th degree polynomials have worst scores in AIC and BIC
than any of the ARIMA models that we have applied , but the 5 th degree polynomial
has an AIC value very close to the best fitted ARIMA model . As for the ARIMA models ,
since we have two models with the same AIC value ,we could say that any of these
models is as good as the other . On Table 4 we see that the Root mean square error for
the ARIMA (0,1,0) is higher than the corresponding error for the ARIMA (0,1,1,) . We
must note that ARIMA (0,1,0) as a model that shows no significant sample ACF and
PACF values in any lag , is equal to an ARMA (0,0) which is random iid noise and so ,
the original time series is a random walk . ARIMA (0,1,1) on the other hand , having a
lack of an AR component , is being in fact an MA(1) process , will give a constant
forecast for each one of the forecasting months .
47 | P a g e
Criterion
AIC
BIC
𝑅2
Adjusted
𝑅2
Fitted model
3rd degree
polynomial
251.6347
254.83
0.6428
0.5357
4th degree
polynomial
247.5674
251.4017
0.7685
0.6655
5th degree
polynomial
229.5478
234.0212
0.9446
0.91
ARIMA (0,1,1)
229.11
230.2421
ARIMA (1,1,1)
230.61
232.3002
ARIMA (0,1,0) =
AUTO
229.11
229.6754
Table 3. Summary of all models fitted on Pre growth.
ME
ARIMA 125.0664
(0,1,1)
ARIMA 116.91
(1,1,1)
ARIMA 137.0682
(0,1,0)
RMSE
MAE
MPE
1336.781 832.2897 2.490061
MAPE
MASE
26.49729 0.7769472
ACF1
0.12253
1306.821 820.2339 4.245509
25.58179 0.765693
0.05033
1449.574 994.7825 0.0286448
31.66417 0.9286351
0.41553
Table 4. Summary of forecasting errors for the ARIMA models on Pre growth.
48 | P a g e
4.2
Growth season (03/2015 -04/2016)
4.2.1 Data plots
We plot the data and observe the features of the graph (fig.4.32) . We distinct two
obvious trends , an upward trend with an increasing slope until the line reaches the
peak point , and a downward trend after the peak leading the series to a continuous
decrease until the end of this period . Both trends seem to follow an exponential or
quadratic function line so we assume that 1st or 2nd order differences bay be needed .
Since our season lasts 13 months , by definition there is no seasonality and we don’t
see any cycles also . There are no obvious outliers . The data don’t exhibit increasing
fluctuations as the level of the series increases . In the ACF plot , we see a sinusoidal
pattern declining to zero , with significant spikes on lags 1 ,5 and 6 . In the PACF plot
we see one significant spike at lag 1 suggesting that the series can be stationary after
performing the 1st order difference on the original series. We apply the ADF test and
by getting a p-value of 0.7845 , higher than all the usual significance levels ,we accept
the null hypothesis of non-stationarity .
Figure 4.26: The time series plot for the Growth season , 03/2015-04/2016 .
49 | P a g e
Figure 4.27 : The sample ACF plot for the Growth season time series .
Figure 4.28: The sample ACF values
Figure 4.29 : Sample PACF plot for the Growth season time series .
50 | P a g e
Figure 4.30 : The sample PACF values
4.2.2 Fitting polynomials to the Growth season data .
We start by fitting on the data a 2nd degree polynomial with respect to time t. The R
output for the model is shown in figure 4.31 . The fitted polynomial is :
𝑋 (𝑡) = 𝑋𝑡 = −3513.9𝑡 2 + 55261.6𝑡 − 87759.9
The p-values for the coefficients’ significance are lower than the significance level of
5% hence they are significant . 𝑅2 and adjusted 𝑅2 are 0.7 and 0.646 respectively .
Residuals’ standard error is 38110 and MSE is 1.45 ∙ 109 .
The values for the AIC and BIC information criteria are found to be 339.7011 and
342.2573 respectively . The Box– Ljung test for the residuals gives a p-value of 0.039
which means we can accept the null hypothesis of non-correlation on the significance
level of 1% but we reject the null hypothesis on the 5% level. The fitted polynomial
together with the data line is shown in figure 4.32
51 | P a g e
Figure 4.31 : 2nd degree polynomial fitted on Growth season .
Figure 4.32: 2nd degree polynomial fit on Growth season data plot .
Next , we fit a 3rd degree polynomial to the data . The results of the R output appear in
figure 4.33 . The fitted polynomial is :
𝑋(𝑡) = 𝑋𝑡 = −276.8𝑡 3 + 2715𝑡 2 + 16587.2𝑡 − 31284.7
We see the p-values for the coefficients’ significance to be higher than any usual
significance level , a sign that this model is not fitting well . Residuals’ standard error is
36450 , MSE is 1.33 ∙ 109 , 𝑅2 is 0.75 and adjusted 𝑅2 is 0.676 , both slightly
improved. AIC and BIC values are found to be 339.1266 and 342.3219 respectively .
52 | P a g e
The Box – Ljung test p-value is 0.059 , hence we accept the null hypothesis , the
residuals are white noise . In summary the model seems to have a worst fit than the
previous model .
Figure 4.33 : 3rd degree polynomial fitted on Growth season .
Figure 4.34: Fitted 3rd degree polynomial on Growth season data .
53 | P a g e
The fitting of a 4th degree polynomial gives the results of figure 4.35 .The polynomial
fitted is :
𝑋(𝑡) = 𝑋𝑡 = 139.2𝑡 4 − 4453.9𝑡 3 + 43988.4𝑡 2 − 132593.6𝑡 + 114793.1
The coefficients seem to be significant on the usual 5% level of significance , residuals’
standard error has been improved and is found to be 24800 , MSE is 6.148 ∙ 108 , we
note a significant improvement of the 𝑅2 and the adjusted 𝑅2 statistics , much
expected since we have added another predictor to our model , AIC and BIC values are
found to be 328.86 and 332.69 respectively . The Box – Ljung test p-value is 0.6287 ,
hence the residuals are white noise . Overall , the model seems to have a much better
goodness of fit if we compare it with the previous models . The fitting of the
polynomial graph on our data is shown in figure 4.36
Figure 4.35: 4th degree polynomial fitted on Growth season .
54 | P a g e
Figure 4.36: 4th degree polynomial fit on Growth season data plot.
The total results are summarized in Table 5. The 4th degree polynomial has the best
fit among the three models as it scores much better in both AIC and BIC and has
significantly better 𝑅2 and adjusted 𝑅2 . The 4th degree polynomial seems a better
choice if we want to estimate the trend component of the time series although a
higher degree means an increase on the number of predictors and a negative effect on
the simplicity of the chosen model . We have to note that we also tried to fit a 5th
degree polynomial and in comparison with the 4th degree polynomial , we found it to
have lower 𝑅2 adjusted value (0.84) , slightly higher AIC and BIC values ( 329.6 and
334.12 respectively ) and the high p-values for the coefficients indicated the non
significance of the coefficients .
Criterion
AIC
BIC
𝑅2
𝑅2
𝑎𝑑𝑗𝑢𝑠𝑡𝑒𝑑
Polynomial
2nd degree
339.7011
342.2573
0.701
0.6466
3rd degree
339.1266
342.3219
0.7512
0.6765
0.8964
0.8504
4th degree
328.8602
332.6945
Table 5. Summary of polynomials fitted on Growth.
55 | P a g e
4.2.3 Box - Jenkins method for fitting ARIMA models
We have noted before that there is a trend in the Growth season data ( ADF p-value =
0.7845) so we apply the 1st order differences to detrend the series ( fig. 4.43) . The ADF
test gives a p-value of 0.3 so we fail to reject the null hypothesis of non stationarity .
The sample ACF and PACF plots demonstrate no significant spikes at any lag (figures
4.44 and 4.45 ) , a sign that the 1st differences is white noise and the original time
series is a random walk , hence the data are dependent and are not identically
distributed . By applying the 2nd and 3rd order differences , the situation doesn’t
improve as we get the p-values for the ADF test to be 0.59 and 0.52 respectively .
Therefore , we don’t want to continue with differencing .We choose a significance
level of 0.3 and continue the analysis with the first order differenced data , succeeding
to have partial elimination of the trend .
Figure 4.37 : The first difference of the Growth season data .
56 | P a g e
Figure 4.38: The ACF plot for the differenced Growth season data .
Figure 4.39: The PACF plot for the differenced Growth season data .
We choose to fit ARIMA models having as initial values d = 1 , p = 0 , q = 0 . We will also
try to fit models with parameters close to these values .
57 | P a g e
1) ARIMA (0,1,0)
Figure 4.40: ARIMA(0,1,0) fit summary on Growth season .
Model :
𝑋𝑡 = 𝑍𝑡
2) ARIMA (0,1,1)
Figure 4.41: ARIMA(0,1,1) fit summary on Growth season .
Model :
𝑋𝑡 = 𝑍𝑡 + 0.2968𝑍𝑡−1
58 | P a g e
3) ARIMA (1,1,1)
Figure 4.42: ARIMA(1,1,1) fit summary on Growth season .
Model :
𝑋𝑡 = 0.4883𝑋𝑡−1 + 𝑍𝑡 − 0.0681𝑍𝑡−1
4) ARIMA (1,1,0)
Figure 4.43: ARIMA(1,1,0)`fit summary for the Growth season .
Model :
𝑋𝑡 = 0.4326𝑋𝑡−1 + 𝑍𝑡
59 | P a g e
In Table 6 we have the summary of the information criteria for our fitted models . We
see that ARIMA (1,1,0) has the best score between all , followed by ARIMA (0,1,0) . A
common plot of the data with the fitted model ARIMA (0,1,0) and the Auto ARIMA
model is shown in figure 4.55 .
Criterion
AIC
BIC
ARIMA (0,1,0)
312.54
313.1048
ARIMA (0,1,1)
312.62
313.7518
ARIMA (1,1,1)
313.67
315.3624
Fitted model
311.7
ARIMA (1,1,0)
Table 6. Summary of the ARIMA models fitted on Growth.
312.8266
We proceed with diagnostics tests for the goodness of fit . The tests are performed on
the residuals of fitting the ARIMA (1,1,0) model . A plot of the residuals (App.) shows a
random movement around zero . The ACF plot (App.) shows no significant spikes on
any lag and the Box – Ljung test p-value is found to be 0.06 . Hence we accept the bull
hypothesis that the residuals are white noise , meaning our model doesn’t have a lack
of fit .
Using the ARIMA (1,1,0) we can proceed to forecasting . Our forecasts will be for the
months May to July of the year 2022 , a period of h=3 months . The forecasting
output for the model ARIMA (1,1,0) and the corresponded plot are shown in figures
4.44 and 4.45 .
60 | P a g e
Figure 4.44: ARIMA(1,1,0) forecast summary on Growth season .
Figure 4.45 : ARIMA(1,1,0) forecast plot on Growth season .
4.2.4 The Auto ARIMA.
By giving the “auto.arima(.) “ command in R , it fits the model ARIMA (2,0,0) :
𝑋𝑡 = 1.3253𝑋𝑡−1 − 0.4566𝑋𝑡−2 + 𝑍𝑡
61 | P a g e
The model’s summary is exhibited in figure 4.46 .
Figure 4.46 : The Auto ARIMA fit summary on Growth season .
In the residuals’ plot (App.) for the fitted model ARIMA (2,0,0) we see random
movement around zero , the ACF plot (App.) doesn’t show any significant spikes , and
the p value of the Box – Ljung test ( 0.7 ) indicates that the residuals is white noise and
the model has a fair fit .We proceed with the auto arima model’s forecasting and the
corresponded forecasting plot (fig. 4.47 and fig. 4.48) .
Figure 4.47 :Auto ARIMA forecast summary on Growth season .
62 | P a g e
Figure 4.48 : Auto ARIMA forecast plot on Growth season .
Figure 4.49 The common plot of the two optimal ARIMA models with the data line on Growth
season .
Growth season overview
In summary , we gather all the criteria values for our fitted models and form Table 8 . As
we noted before , the 4th degree polynomial shows the best scores for AIC and BIC criteria
among the polynomials and much higher R squared and R squared-adjusted . Having as a
63 | P a g e
fact that the R squared statistical measures improve by adding explanatory variables to a
model and also they can be affected by a number of reasons to the point where a
significant model scores a low R squared value or a non significant model can have a high
score , the most reliable measure to compare the goodness of fit for the polynomials is
the AIC and BIC values . The ARIMA models we have chosen to fit on our data , are fitted
under a high level of significance , since the p-value for the ADF test on the 1st differences
is 0.3 . The optimal model according to the AIC and BIC values is ARIMA(1,1,0) , an auto
regressive model of low order . The most appropriate model according to the
auto.arima(.) command is ARIMA(2,0,0) ,a model that doesn’t apply difference on the
original data and has AIC and BIC values higher than the ARIMA(1,1,0) . In Table 8 on the
other hand , we can see the ARIMA(2,0,0) to have lower RMSE , MAE , MAPE and MASE
values than ARIMA(1,1,0) .Both models deliver negative forecast numbers and forecast
confidence intervals that contain zero . Since a negative forecast doesn’t have a logical
meaning when the variable represents arrivals of people , we can substitute these
forecasts with zero .
AIC
BIC
𝑅2
𝑅2
adjusted
2nd degree
polynomial
339.7011
342.2573
0.701
0.6466
3rd degree
polynomial
339.1266
342.3219
0.7512
0.6765
5th degree
polynomial
328.8602
332.6945
0.8964
0.8504
ARIMA (2,0,0)
336.99
338.9
ARIMA (0,1,0)
312.54
313.1048
ARIMA (1,1,0)
311.7
312.8266
Criterion
Model fitted
Table 7. Summary of all models fitted on Growth.
64 | P a g e
ARIMA (2,0,0)
ME
8621.831
RMSE
30408.53
MAE
22000.09
MPE
2.112955
ARIMA (0,1,0)
-301.1519
35876.93
29414.99
-46.47366
ARIMA (1,1,0)
-931.1624
31905.25
22435.38
-20.04772
ARIMA (2,0,0)
MAPE
42.69843
MASE
0.6945115
ACF1
-0.09181435
ARIMA (0,1,0)
83.38134
0.9285892
0.4514709
ARIMA (1,1,0)
49.19935
0.7082527
0.000577384
Table 8. Summary of forecasting errors for the ARIMA models on Growth.
4.3 Post growth season (05/2016 – 01/2022)
4.3.1 Data plots
We plot the data for the Post growth season (fig. 4.50) along with the sample ACF and
PACF . No obvious outlying points seem to exist in the graph .The variance doesn’t
seem to have increased variation οver time so we don’t need to apply a Box-Cox
transformation on the data to stabilize the variance . Some potential trend appears to
be present in the time series and we notice that the ACF in figure exhibits a slow
decay as the number of lags increase , suggesting that the original series is not
stationary. (figures 4.51 and 4.52) .We apply the Augmented Dickey – Fuller (ADF )
test to check for stationarity . The p-value is 0.5237 , so we reject the Null hypothesis
thus the Post growth time series is not stationary .In addition, based on Partial
Autocorrelation function (PACF) of the original series, there is one significant spike
(exceeding the significance boundaries ). This provides an indication of an
autoregressive process of order 1 (figure 4.53 ) .
65 | P a g e
Figure 4.50: The Post Growth time series plot .
Figure 4.51 The sample ACF plot of the Post growth data .
66 | P a g e
Figure 4.52 : The Post growth sample ACF values .
Figure 4.53: The Post growth sample PACF plot .
67 | P a g e
Figure 4.54 The sample PACF values .
4.3.2
Fitting polynomials to the Post growth season data
Using R , we fit a 2nd degree polynomial with respect to time t on our data of Post
growth season . The fitted polynomial is :
𝑋(𝑡) = 𝑋𝑡 = −3.83𝑡 2 + 243.11𝑡 + 572.35
Along with the model summary (fig.4.55) , we calculate the AIC and BIC criteria to
check the goodness of fitness. The p-values for the coefficients indicate a fair fitness
of the model to our data and the values for the information criteria AIC and BIC are
1262.466 and 1271.403 respectively . We apply the Box –Ljung test to check for
auto correlation among residuals . We find the p-value to be 1.1 ∙ 10−12 which
means that the residuals are not white noise and the model has a significant lack of
fit. Finally , for a visual presentation of the goodness of fitness we plot the time series
line and the fitted polynomial on the same graph (fig.4.56).
68 | P a g e
Figure 4.55: 2nd degree polynomial fitted on the Post growth data .
Figure 4.56:
2nd degree polynomial fit on Post growth data plot .
69 | P a g e
We repeat the procedure by fitting a 3rd degree polynomial to our Post growth season
data . The fitted polynomial is :
𝑋(𝑡) = 𝑋𝑡 = −0.00773𝑡 3 − 3.01797𝑡 2 + 220.21741𝑡 + 710.66668
The model summary is shown in figure 4.57 . We find the p – values for the coefficients
to be greater than all the standard levels of significance ,meaning that the model does
not fit well to our data .The values for the AIC and BIC criteria are 1264.432 and
1275.602 respectively and the Box – Ljung test p-value is 1.2 ∙ 10−12 which means
that the model shows a lack of fit and the residuals are highly correlated.
Figure 4.57: 3rd degree polynomial fitted on Post growth data .
70 | P a g e
Figure 4.58: 3rd degree polynomial fit on the Post growth plot .
The last candidate polynomial that we fit in our data is a 4 th degree polynomial . The
output of the R Studio for the fitted model can be seen in figure 4.59 . The fitted
polynomial is :
𝑋(𝑡) = 𝑋𝑡 = 0.0079𝑡 4 − 1.1138𝑡 3 + 46.9971𝑡 2 − 571.0056𝑡 + 3658.8987
The p- values for the coefficients’ significance show that we have a better fitness of
the 4th degree polynomial compared to the previous two models . The values for the
information criteria are AIC : 1254.649 and BIC : 1268.054 , slightly improved from
the past two models and the Box –Ljung test gives a p value of 7.5 ∙ 10−12 , meaning
that the residuals’ series is not white noise and the model has a lack of fit .
71 | P a g e
Figure 4.59: 4th degree polynomial fitted on Post growth season .
Figure 4.60: 4th degree polynomial fit on the Post growth data .
The summary of the AIC , BIC , 𝑅2 and adjusted 𝑅2 values for all three polynomials
fitted in the Post growth data ( Table 9) indicates that the 4th and 2nd degree
polynomials have a better fit than the 3rd degree polynomial with the difference
72 | P a g e
between the 2nd and the 4th degree polynomials to be in favour of the 4th grade , as it
exhibits lower AIC and BIC values .
Criterion
AIC
BIC
𝑅2
𝑅2
adjusted
Polynomial
2nd degree
1262.466
1271.403
0.3124
0.2915
3rd degree
1264.432
1275.602
0.3127
0.281
4th degree
1254.649
1268.054
0.4206
0.3844
Table 9 . Summary of polynomials fitted on Post Growth.
4.3.3 Box - Jenkins method for fitting (S)ARIMA models
We have already ploted the Post growth time series along with the sample ACF and
PACF . We have also found the ADF p-value to be 0.5237 and we know there is a
trend in the time series . We apply 1st order differences and we repeat the ADF test for
the differenced time series . By having a p-value of 0.0195 , we can conclude that the
time series of the 1st order differences is stationary. By observing the plot of figure
4.60 we do not recognize any obvious patterns or cycles ,and we can assume that
there isn’t any seasonality or cyclicity . We plot the ACF and PACF for the differenced
data (figures 4.61 and 4.62 ) and we notice one significant spike on lag = 1 in the ACF
graph and one significant spike on lag = 1 in the PACF graph . Both graphs seem to get
another slightly significant spike on lag = 4 but that is only after lags 2 and 3 give no
significant values . We don’t see any periodic pattern in the ACF and PACF plots to
indicate the presence of a seasonal component in the series .
73 | P a g e
Figure 4.61: The first difference of the Post growth data.
Figure 4.62:
The ACF plot for the differenced data .
74 | P a g e
Figure 4.63:
The PACF plot for the differenced data .
Based on the ACF and PACF of the 1st order differences , we can decide the order of
our ARIMA model . For our model, d=1, since we perform the 1st differences to
transform the original time series into stationary series , we choose first our
parameters to be p = 1 and q = 1 , so we fit an ARIMA (1,1,1 ) model . Additionally ,
several alternative models will be considered . We will try models that have p and q
values close to our primary selection .
1) ARIMA (1,1,1)
Figure 4.64: ARIMA(1,1,1) fit summary on Post growth .
Model :
𝑋𝑡 = −0.2734𝑋𝑡−1 + 𝑍𝑡 + 0.6147𝑍𝑡−1
75 | P a g e
2) ARIMA (1,1,2)
Figure 4.65: ARIMA(1,1,2) fit summary on Post growth .
Model :
𝑋𝑡 = −0.8169𝑋𝑡−1 + 𝑍𝑡 + 1.2316𝑍𝑡−1 + 0.2316𝑍𝑡−2
3) ARIMA (2,1,1)
Figure 4.66: ARIMA(2,1,1) fit summary on Post growth .
Model :
𝑋𝑡 = −0.5698𝑋𝑡−1 + 0.1857𝑋𝑡−2 + 𝑍𝑡 + 𝑍𝑡−1
76 | P a g e
4) ARIMA (2,1,2)
Figure 4.67: ARIMA(2,1,2) fit summary on Post growth .
Model :
𝑋𝑡 = −0.5078𝑋𝑡−1 + 0.2286𝑋𝑡−2 + 𝑍𝑡 + 0.9365𝑍𝑡−1 − 0.0635𝑍𝑡−2
5) ARIMA (1,1,0)
Figure 4.68: ARIMA(1,1,0) fit summary on Post growth .
Model :
𝑋𝑡 = 0.3049𝑋𝑡−1 + 𝑍𝑡
At this point we have to comment that another possible model would be ARIMA
(0,1,1) but its analysis is postponed until the next section of this study .
77 | P a g e
Criterion
AIC
BIC
ARIMA (1,1,0)
1156.54
1160.979
ARIMA (1,1,1)
1157.2
1163.854
ARIMA (1,1,2)
1156.08
1164.959
ARIMA (2,1,1)
1155.79
1164.672
ARIMA (2,1,2)
1157.78
1168.875
Model
Table 10. Summary of the ARIMA Models on Post Growth.
Table 10 provides the summary of all candidate models’ scores for the AIC and BIC
criteria . ARIMA (2,1,1) gets the best score for the AIC and ARIMA (1,1,0) gets the best
score for the BIC criterion . The difference between the scores though , is minor and
the complexity of the ARIMA (2,1,1) model is something we should consider when it
comes to choosing the model for forecasting . On the last section of this chapter , we
will plot each model together with the auto ARIMA model and the data and we get a
visual presentation of the goodness of fitness for our selected models (figure 4.76 )
Next, we proceed with diagnostic plots for the goodness of fitness . The tests will be
applied on the residuals . First we plot the residuals’ series of ARIMA (1,1,0) and we
look for any trends, skewness, or other patterns that the model didn’t capture . The
plot shows the residuals moving randomly around zero and look like white noise
(App.) .
The ACF plot for the residuals (App.) shows no significant correlations as the values are
within the boundary lines of non-significance for any early lag with some spikes on the
limit of non-significance for later lags .
We also apply the Box-Ljung test and we get a p-value of 0.8224 which means that
the residuals don’t have any significant autocorrelations . They are white noise and
the model has a fairly good fit.
78 | P a g e
The same route is followed for the residuals of the model ARIMA (2,1,1) . We plot the
residuals and the the ACF (App. ) , we conduct a Box_Ljung test which gives a p value
of 0.9903 and the conclusion is that the residuals are white noise .
Now we can use the optimal models to forecast . The forecasts will be for the months
February to March of the year 2022 , a period of h=3 months . The forecasting output
for the model ARIMA(1,1,0) and the corresponded plot exhibiting the forecasting line
and the confidence intervals are shown in figures 4.69 and 4.70 .
Figure 4.69 : ARIMA(1,1,0) forecast summary for Post growth season .
79 | P a g e
Figure 4.70: ARIMA(1,1,0) forecast plot on Post growth season .
The forecasting output for the model ARIMA(2,1,1) and the corresponded plot are
shown in figures 4.71 and 4.72 .
Figure 4.71: ARIMA(2,1,1) fit summary for the Post growth season .
80 | P a g e
Figure 4.72: ARIMA(2,1,1) forecast plot for the Post growth season .
4.3.4 The Auto ARIMA .
In this last part we use the “auto.arima( )” command on the Post growth season data .
The R output is shown in figure 4.73 and the model is the ARIMA(0,1,1) :
𝑋𝑡 = 𝑍𝑡 + 0.3613𝑍𝑡−1
Figure 4.73: Auto ARIMA fit summary on Post growth .
81 | P a g e
Furthermore , we get the plot and the ACF plot of the residuals (App.) of the model
ARIMA (0,1,1) . Finally we do the Box – Ljung test for the residuals . As it is noticed
,the residuals’ series shows no apparent trends and the residuals moving randomly
around zero so they look like white noise , the ACF plot shows no significant
correlations on early lags and some spikes on the boundary lines of significance on
later lags . The p-value of the Box test is found to be 0.8984 , hence we accept that the
residuals are white noise meaning that this model has a sufficient fit .
We proceed with forecasting for the arrivals in Post growth season .The forecast
period will be for the next three months after the end of our time series data . That
means we forecast the arrivals in the Post growth season for the months February ,
March and April of the year 2022 . The forecasting output for the model ARIMA (0,1,1)
and the corresponded plot are shown in figures 4.74 and 4.75 .
Figure 4.74: Auto ARIMA forecast summary for the Post growth season .
82 | P a g e
Figure 4.75: Auto ARIMA forecast plot for the Post growth season .
For reasons of visual comparison , we get a summary plot containing the 3 selected
models and the time series of Post growth season data (fig . 4.76) .
Figure 4.76: The common plot of the three optimal ARIMA models with the data line.
83 | P a g e
Post growth season overview
In summary , we form Table 11 which includes the criteria scores for every model
fitted in Post growth time series. We can see that the best fitted model according to
AIC and BIC criteria is the ARIMA (0,1,1) , this is the model R gave as an output to the
“auto.arima” command . The 4th degree polynomial has a better R square and R square
adjusted scores compared to the other two polynomials and better AIC and BIC . On
the other hand ,the polynomials have highest scores in AIC and BIC than any ARIMA
model that we applied in these data . This was expected as we have already seen the
residuals of the polynomials’ fit to be highly correlated . As for the ARIMA models , the
scores on AIC and BIC are very close to each other and we could say that any of these
models is as good as the others ,but the ARIMA(0,1,1) is the most simple of all , similar
only to the ARIMA (1,1,0) which may have a higher AIC or BIC score than the ARIMA
(2,1,1) but it should be our choice between those two because of its simplicity . The
ARIMA (2,1,1) gives negative prediction also , and we could say it is not safe to use it at
this point . The ARIMA (0,1,1) gives a forecast of a steady number of 109.199 arrivals
per month for the next three months and this has to do with the moving average part
of the model and the absence of an auto regressive part . So if the selection is between
the models ARIMA (0,0,1) and ARIMA (1,1,0 ) , maybe it is a better idea to go for the
second model ,as with this model we avoid to have steady forecasts for each
subsequent month , not to mention that the AR terms have a better adjustment in the
data . Error measures to evaluate the forecasting of the ARIMA models are depicted
on Table 12 . The Root Mean Square Error for the ARIMA (1,1,0) is slightly bigger than
for the ARIMA (0,0,1 ) . Other measures for this model ( like the Mean Average
Percentage Error ) though , have lower values than the error of the ARIMA (0,1,1) .
84 | P a g e
Criterion
AIC
𝑅2
BIC
Adjusted
𝑅2
Fitted model
2nd degree
polynomial
3rd degree
polynomial
4th degree
polynomial
ARIMA (0,1,1)
=AUTO
ARIMA (1,1,0)
ARIMA (2,1,1)
1262.466
1271.403
0.3124
0.2915
1264.432
1275.602
0.3127
0.281
1254.649
1268.054
0.4206
0.3844
1155.73
1160.165
1156.54
1160.979
1155.79
1164.672
Table 11. Summary of all models fitted on Post Growth.
ARIMA (2,1,1)
ME
-17.57609
RMSE
1093.096
MAE
764.4439
MPE
-40.29637
ARIMA (0,1,1)
-17.09396
1142.88
797.7409
-44.5703
ARIMA (1,1,0)
-16.30146
1150.103
788.6504
-44.24847
ARIMA (2,1,1)
MAPE
74.23449
MASE
0.920169
ACF1
0.001437162
ARIMA (0,1,1)
80.25684
0.9602489
-0.01503607
ARIMA (1,1,0)
78.32442
0.9493066
0.0264405
Table 12. Summary of forecasting errors for the ARIMA models on Post Growth.
85 | P a g e
Conclusion
The goal of this thesis was to apply the Box- Jenkins approach on a specific time series
coming from the field of social sciences, to identify and estimate various (S)ARIMA
models and to select the most proper for forecasting. We compared the models as to
the goodness of fit by using the typical statistical tools, the AIC and BIC criteria and the
Box – Ljung test for the residuals. Since the data were divided in three seasons, we
applied the B-J method three times, and we chose the optimal models of each season.
We used the selected models to proceed in monthly forecasts for a period of three
months ahead on every season. To check the effectiveness of the model, we used the
error measures RMSE, ME, MPE, MAPE. Moreover, we used polynomial to fit the data
a method that is often used for an estimation of the trend component. The
comparison between the fitted polynomials and the ARIMA models, proved that the
later are superior by means of fit, since the AIC and BIC values were significantly
better. In some cases, the forecast value was negative, having no real meaning for the
specific time series. Since we notice that the 80% and 95% confidence intervals for the
forecasted arrivals contain zero, we can safely replace these forecasts with zero values.
The constant forecasting values that appear in some cases, have to do with the
absence of the autoregressive part from the model. Knowing the real values for the
upcoming months of each period, we note that for the Pre growth season, the forecast
values are constant for both of the selected models and the forecast maximum level is
3267 when in fact, for the next three months the numbers grew rapidly, being 2873,
7874 and 13556 respectively. The forecasts for the upcoming three months in Growth
season had negative values for both models and the real values for these months were
recorded to be 1721, 1554 and 1920 respectively. For the Post growth season, we had
constant predictions of 109 per month from one model, negative predictions from a
second model and the last selected model gave the values 148.7, 107.1, 94.5 when the
real values for these months are 464, 008 and 1300 respectively.
Although the predictions made based on the growth and post growth models could
have been considered relatively reasonable, the same is not true for the predictions of
the pre growth model (in Sections 4.1.2-4.1.4) since the extreme and sudden increase
of flows that followed in early 2015, could not have been predicted based on the
previously collected data, i.e. the pre growth period. Hence, please note that these
particular predictions are presented in this Thesis for illustrative purposes only and
thus, they should be viewed and evaluated with extreme caution.
The variation on the data, implies the division of the whole recorded season in three
seasons, as done in this work. The forecasts were done for each season independently.
Forecasting a Pre growth season value, when in fact we know that the following
86 | P a g e
months will note severe changes on the level of the series and a different model is
needed to represent the data, is an ex-post, ad-hoc basis in which we stand.
For future work, the researcher could follow an approach by using statistical process
control charts that detect the so called “ special cause “ variations to the data , which
is an unusual variation or rapid change to the level , similar to what was presented in
this study .
87 | P a g e
Bibliography
[1] D. Kugiumtzis, Time Series Analysis, Aristotle University of Thessaloniki,
https://users.auth.gr/dkugiu/Teach/TimeSeries/TimeSeries.pdf, 2020
[2] Ε. Bora-Senta, Ch. Moisiadis, Applied Statistics (in Greek), Ziti Publications, 1995
[3] A. Karagrigoriou, Lecture Notes on Time Series Theory, University of Aegean, 2020
[4] R.J. Hyndman, G. Athanasopoulos, Forecasting: Principles and Practice, 2nd edition,
OTexts, OTexts.com/fpp2, 2018
[5] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, G. M. Ljung, Time Series Analysis, 5th edition,
Wiley, 2015
[6] R. Harris, R. Solis, Applied Time Series Modeling and Forecasting, Wiley, 2003
[7] P.J. Brockwell, R.A. Davis, Introduction to Time Series and Forecasting, 3rd edition,
Springer, 2016
[8] R.H. Shumway, D.S. Stoffer, Time Series Analysis and its Applications With R
Examples, 3rd Edition, Springer, 2015
[9] J.D. Cryer, K.-S. Chan, Time Series Analysis with Applications in R, 2nd edition,
Springer, 2008
[10] W.L. Young, The Box-Jenkins approach to time series analysis and forecasting:
principles and applications, RAIRO, Recherche Opérationnelle, Tome 11(2), 129-143,
http://www.numdam.org/article/RO_1977__11_2_129_0.pdf, 1977
[11] A.J. Scott, M. Knott, A Cluster Analysis Method for Grouping Means in the Analysis
of Variance, Biometrics, 30(3), 507-512, 1974
[12] E.S. Page, Continuous inspection schemes, Biometrika, 41, 100-115, 1954.
88 | P a g e
Web References
[1] UNHCR portal, https://www.unhcr.org/refugeestatistics/insights/explainers/100-million-forcibly-displaced.html
[2] UNHCR Data finder, https://www.unhcr.org/refugeestatistics/download/?url=2z1B08
[3] UNHCR Global trends, https://www.unhcr.org/globaltrends.html
[4] IOM portal for World Migration report 2022
https://worldmigrationreport.iom.int/wmr-2022-interactive/
[5] IOM World Migration report 2022, Chapter 1
https://worldmigrationreport.iom.int/wmr-2022-interactive/
[6] IOM World Migration Report 2022, Chapter 2
https://worldmigrationreport.iom.int/wmr-2022-interactive/
[7] IOM World Migration Report 2022, Chapter 2, Migrant workers
https://worldmigrationreport.iom.int/wmr-2022-interactive/
[8] UNHCR portal, Figures at a glance https://www.unhcr.org/figures-at-aglance.html
[9] ESPON 2018, Refugee and asylum seeker flows
https://www.espon.eu/sites/default/files/attachments/espon_asylum-flowsresponse-policies-greece-online_0.pdf
[10] ESPON Topic paper: Migration and asylum seekers-ESPON evidences
https://www.espon.eu/sites/default/files/attachments/TNO%20Topic%20Paper%20
on%20migration.pdf
[11] UNHCR Portal, The sea route to Europe, 2015
https://www.unhcr.org/news/stories/2015/7/56ec1e9e10/the-sea-route-toeurope.html
[12] UNHCR Portal, The sea route to Europe, 2015, Ch. 2: Rescue at sea, tragedy and
response https://www.unhcr.org/5592bd059.pdf
[13] Refugee crisis factsheet 2017, Hellenic Republique, General Secretariat for
Media and Communication https://government.gov.gr/wpcontent/uploads/2017/04/gr_fact_sheet_refugee_print_19_01_2017-2.pdf
[14] UNHCR The sea route to Europe, 2015, Ch. 3 The rise of the Eastern
Mediterranean route: the shift to Greece https://www.unhcr.org/5592bd059.pdf
[15] UNHCR Aegean Island factsheet, April 2017
https://data.unhcr.org/en/documents/details/57261
[16] UNHCR Data portal https://data.unhcr.org/en/documents/details/74139
[17] UNHCR Data portal https://data.unhcr.org/en/documents/details/90834
[18] UNHCR Data portal https://data.unhcr.org/en/documents/details/46622
89 | P a g e
[19] UNHCR Data portal https://data.unhcr.org/en/documents/details/61395
[20] UNHCR Data portal https://data.unhcr.org/en/documents/details/67184
[21] UNHCR Data portal https://data.unhcr.org/en/documents/details/73442
[22] UNHCR Data portal https://data.unhcr.org/en/documents/details/46357
[23] R ‘changepoint’ package manual: https://cran.rproject.org/web/packages/changepoint/index.html
90 | P a g e
Appendix
The graphical representations of the residuals and ACF for the best models for each
season, are provided in this Appendix.
PRE GROWTH SEASON
1. The residuals series of ARIMA(0,1,0)
2. The ACF plot for the residuals of ARIMA(0,1,0)
91 | P a g e
3. The residuals series of ARIMA(0,1,1)
4. The ACF plot for the residuals of ARIMA(0,1,1)
92 | P a g e
GROWTH SEASON
5. The residuals series of ARIMA(1,1,0)
6. The ACF plot for the residuals of ARIMA(1,1,0)
93 | P a g e
7. The residuals series of ARIMA(2,0,0 )
8. The ACF plot for the residuals of ARIMA(2,0,0)
94 | P a g e
POST GROWTH SEASON
9. The residuals series of ARIMA(1,1,0)
10. The ACF plot for the residuals of ARIMA(1,1,0)
95 | P a g e
11. The residuals series of ARIMA(2,1,1)
12. The ACF plot for the residuals of ARIMA(2,1,1)
96 | P a g e
13. The residuals series of ARIMA(0,1,1)
14. The ACF plot for the residuals of ARIMA(0,1,1)
97 | P a g e
98 | P a g e
EPIMYTHIUM
The drum of war thunders and thunders.
It calls: thrust iron into the living.
From every country
slave after slave
are thrown onto bayonet steel.
For the sake of what?
The earth shivers
hungry
and stripped.
Mankind is vapourised in a blood bath
only so
someone
somewhere
can get hold of Albania.
Human gangs bound in malice,
blow after blow strikes the world
only for
someone’s vessels
to pass without charge
through the Bosporus.
Soon
the world
won’t have a rib intact.
And its soul will be pulled out.
And trampled down
only for someone,
to lay
their hands on
Mesopotamia.
Why does
a boot
crush the Earth — fissured and rough?
What is above the battles’ sky Freedom?
God?
Money!
When will you stand to your full height,
you,
giving them your life?
When will you hurl a question to their faces:
Why are we fighting?
Vladimir Mayakovsky (1917)
99 | P a g e
Download