Uploaded by Dylan Gaznabbi

The International Organization for Migration Missing Migrants Project is the only known open access database with categorical and quantitative variables that keeps a running record of migrant deaths and disappearances al

advertisement
The International Organization for Migration Missing Migrants Project is the only known
open access database with categorical and quantitative variables that keeps a running record of
migrant deaths and disappearances along all of the migratory routes across the globe. The
purpose of this global initiative is to raise awareness of the over 60,000 migrants that have lost
their lives to failed migration journeys across the world since 2014 and to encourage policies that
help create safe and legal routes for migration and to help end migrants’ deaths and
disappearances. This dataset that contains over 13,000 records shows very strong associations
between some of the variables with the use of Statistical Analysis in R.
This large dataset contains 19 categorical and quantitative variables that record key
details surrounding the disappearances of migrants such as where they went missing, whether
they were found death or alive, and metrics about the migrants themselves. The first variable is
Incident Type which tells us what type of migration incident we are examining, which could be
incident, split incident, or cumulative incident. The second variable is Incident Year which
ranges from 2014 to 2023 which are the years data has been collected for this project. The third
variable is Reported Month which contains the month that the incident was reported and can
range from January to December. The fourth variable is Region of Origin which shows the
region that the migrants originated from. These regions encompass different areas within the
Americas, Africa, Asia, Caribbean, Europe, and Oceania. The fifth variable is Region of Incident
which contains data with the same classifications as Region of Origin. The sixth variable is
Country of Origin which pinpoints the exact country in which the migrant originates from. The
seventh variable is Number of Death which is simply a quantity of confirmed deceased migrants
from the incident. The eighth variable is Minimum Estimated Number of Missing which shows
the least amount estimated count of missing migrants in an incident. The ninth variable is Total
Number of Death and Missing which is the sum of all missing and deceased in an incident. The
tenth variable of Number of Survivors shows how many confirmed migrants survived an
incident. The eleventh, twelfth, and thirteenth variables are all Numbers of Females, Males and
Children involved in an incident respectively. The fourteenth variable is Cause of Death which
shows how a migrant die and can be one of many causes including accidental deaths, diseases,
drownings, vehicle accidents, violence, and environmental conditions. The fifteenth variable Is
the Migration Route which includes many popular migratory routes around the world. The
sixteenth variable is Location of Death which approximates what jurisdiction the migrant death
took place which contains various distances from landmarks. The seventeenth variable is
Information Source which is the office or news agency that reported the migrant incident. The
eighteenth variable is Coordinates which shows the latitude and longitude of the migrant
incident. Lastly, the nineteenth variable is the UNSD Geographical Grouping which is the
continental subregion of the migrant incident.
The first statistical analysis that I performed was between two categorical variables. I
tested the Incident Year and Causes of Death variables to see if there was an association between
the two.
In creating this bar chart in an attempt to show the relationship between Incident Year and
Causes of Death, I placed Incident Year in the X-axis and Causes of Death on the Y-axis to see if
the change in Causes of Death can be explained by the change in the Incident Year. As visually
seen, as Incident Year increased, Causes of Death also increased. Even though the Causes of
Death variable has many different causes or levels within it, the Causes of Death as a whole
gradually increase when incident year is increased, supporting the existence of an association
between the two variables.
The next step in determining if there is an association between Incident Year and Causes
of Death was creating frequency tables that would show the actual data values for each segment
of Causes of Death which would indicate numerically if the Causes of Death increased when the
Incident Year increased. This can be determined by creating frequency tables for each segment.
With the produced frequency tables, most of the segments of Causes of Death increase when
Incident Year increases, further supporting the belief in a strong association between these two
variables. For example, in Vehicle accident/death linked to hazardous transport, as Incident Year
increases i.e. 2014, 2015, 2016, 2017, the Death Count consistently increases i.e. 25, 118, 191,
252. This general trend is observed across most of the frequency tables which show an upward
trend where the change in the Causes of Death can be explained by the change in the Incident
Year. While it is important to acknowledge that for a few of these cases there is no substantial
linear growth, such as the groups that alternate between zeroes and ones that do not follow the
general data trend, the segments that have a more diverse group of data show a strong linear
growth.
The final statistical test that was performed to analyze the association between Incident
Year and Causes of Death was a Chi-square test. A Chi-square test is a statistical method used to
determine if there is a significant difference between the expected frequencies and the observed
frequencies. The p-value generated in a Chi-square test will determine if the association or
correlation is statistically significant or not; for the purposes of this testing 0.05 is used as the
benchmark for determining statistical significance. If the generated p-value is greater than 0.05,
then the association is not statistically significant, but if it is smaller, than it is.
In this performed Chi-square test, the generated p-value is extremely near 0, 2.2e-16. This is well
below the 0.05 benchmark for statistical significance showing that the association between
Incident Year and Causes of Death is statistically significant.
For the second set of statistical association tests, I tested a categorical variable and a
quantitative variable. I used Region of Incident as my categorical variable on the X-axis and
Number of Survivors as my quantitative variable on the Y-axis for the Side-By-Side Boxplot and
for the histogram.
The Side-By-Side Boxplot shows that visually the bulk of the data points for most regions does
not have more than 500 survivors. The obvious exception here is the Mediterranean which has a
lot more data points above the 500-survivor mark and has its upper limit close to 2000. The
Mediterranean, being a hotspot for survivors and by extension migrant activity is further
visualized by the histogram.
This histogram uses a different method to show a similar result, that the greatest number of
survivors are found in the Mediterranean region. Examining this within the broader context of
the information that the Missing Migrants Project shows how this may be misleading because it
may lead you to believe that out of all of the regions, most survivors are found in the
Mediterranean so it must be the safest. In reality, the Mediterranean is the busiest hotspot for
migrant activity and most migrants attempt to cross to the Mediterranean; while it would have
the highest number of survivors, it also has the highest number of deaths, this is only possible
since most migrants are found moving along the Mediterranean and it is busier than any other
region. More migrants will be found in the Mediterranean than any other region, dead or alive.
The normality of the association between Region of Incident and Number of Survivors is
determined by a QQ plot; this will determine if the data has a normal distribution or if the data is
skewed. If the data points perfectly follow the line of best fit, then the data has a normal
distribution; however, if any of the data points deviate from the line of best fit, then the dataset is
skewed.
As can visually be seen from the QQ plot, the first value and the last value do not follow the
trend that the rest of the data follows and shows that the dataset is skewed. Instead of a F test
statistic, a Median test will have to be performed on these two variables because the distribution
is not normal but in fact skewed. The data point on the right far of the QQ plot that appears to
have a sample quantile much higher than any of the other data point is most likely the
Mediterranean as it had the highest number of survivors in both the Side-By-Side Boxplot and
the histogram. Even with most of the data values following the general trend, this outlier would
skew the entire dataset.
To further support these assumptions between Region of Incident and Number of
Survivors, an ANOVA test showing the statistical significance with a p-value would show the
strength of their association if one truly existed and if it is statistically significant. This statistical
test would indicate the probability of observing a result as extreme as, or more extreme than,
what was actually observed, assuming the null hypothesis of there being no association between
Region of Incident and Number of Survivors is true. If the p-value is below the 0.05 benchmark
for statistical significance, then the null hypothesis would be rejected in favor of the alternative
hypothesis which postulates that the association between these two variables is statically
significant.
The results of this ANOVA show that the generated p-value is extremely near 0, 2.2e-16. This is
well below the 0.05 benchmark for statistical significance showing that the association between
Region of Incident and Number of Survivors is statistically significant. Since the data is skewed
and not normally distributed, I chose to run a median test rather than rely on an F test statistic.
By using the Kruskal-Wallis rank sum test, we can examine insights into whether the medians of
Number of Survivors are different across different Regions of Incidents. Since the p-value in this
test is the same value as the ANOVA test and is so small that it is undoubtedly below the 0.05
benchmark for statistical significance, we can reject the null hypothesis of the group medians
being equal and follow the alternative hypothesis of the change in Region of Incident having a
statistically significant effect on the change in Number of Survivors.
For the third association I performed statistical analysis on two quantitative variables. In
the Global Missing Migrants dataset, these variables were Minimum Estimated Number of
Missing and Total Number of Missing and Dead. The first step in performing this is creating a
scatterplot with the Minimum Estimated Number of Missing on the X-axis and the Total Number
of Missing and Dead on the Y-axis. The purpose of creating a scatterplot is to visually assess
whether or not the association is of any practical significance and noticeable enough to be
important.
From visually examining the scatterplot, we can see that as Minimum Estimate Number of
Missing increases, the Total Number of Missing and Dead also increases; this observation
warrants further test to see if the association between these two variables is statistically
significant and unlikely to have been produced by chance. To numerically see if the association
is statistically significant, I performed Pearson’s Correlation Test to measure the generated pvalue and see if it is above or below the 0.05 threshold for statistical significance. This will
determine if there is a strong association between Minimum Estimated Number of Missing and
Total Number of Missing and Dead.
Pearson’s Correlation Test generates a lot of values but the important value for us to
examine is the p-value.
The p-value given in this correlation test is 2.2e-16 which is below the 0.05 threshold for
statistical significance, allowing us to reject the null hypothesis that Minimum Estimated
Number of Missing and Total Number of Missing and Dead are independent variables and
further investigate the alternate hypothesis where the change in the Y-axis could be accounted
for the change in the X-axis. The correlation coefficient value of approximately 0.884 also
supports a strong positive association between these two variables because this value is very near
1. The p-value and correlation coefficient here both support a strong association between
Minimum Estimated Number of Missing and Total Number of Missing and Dead.
Download