Migrant Deaths Analysis: Statistical Study of IOM Data

The International Organization for Migration Missing Migrants Project is the only known open access database with categorical and quantitative variables that keeps a running record of migrant deaths and disappearances along all of the migratory routes across the globe. The purpose of this global initiative is to raise awareness of the over 60,000 migrants that have lost their lives to failed migration journeys across the world since 2014 and to encourage policies that help create safe and legal routes for migration and to help end migrants’ deaths and disappearances. This dataset that contains over 13,000 records shows very strong associations between some of the variables with the use of Statistical Analysis in R. This large dataset contains 19 categorical and quantitative variables that record key details surrounding the disappearances of migrants such as where they went missing, whether they were found death or alive, and metrics about the migrants themselves. The first variable is Incident Type which tells us what type of migration incident we are examining, which could be incident, split incident, or cumulative incident. The second variable is Incident Year which ranges from 2014 to 2023 which are the years data has been collected for this project. The third variable is Reported Month which contains the month that the incident was reported and can range from January to December. The fourth variable is Region of Origin which shows the region that the migrants originated from. These regions encompass different areas within the Americas, Africa, Asia, Caribbean, Europe, and Oceania. The fifth variable is Region of Incident which contains data with the same classifications as Region of Origin. The sixth variable is Country of Origin which pinpoints the exact country in which the migrant originates from. The seventh variable is Number of Death which is simply a quantity of confirmed deceased migrants from the incident. The eighth variable is Minimum Estimated Number of Missing which shows the least amount estimated count of missing migrants in an incident. The ninth variable is Total Number of Death and Missing which is the sum of all missing and deceased in an incident. The tenth variable of Number of Survivors shows how many confirmed migrants survived an incident. The eleventh, twelfth, and thirteenth variables are all Numbers of Females, Males and Children involved in an incident respectively. The fourteenth variable is Cause of Death which shows how a migrant die and can be one of many causes including accidental deaths, diseases, drownings, vehicle accidents, violence, and environmental conditions. The fifteenth variable Is the Migration Route which includes many popular migratory routes around the world. The sixteenth variable is Location of Death which approximates what jurisdiction the migrant death took place which contains various distances from landmarks. The seventeenth variable is Information Source which is the office or news agency that reported the migrant incident. The eighteenth variable is Coordinates which shows the latitude and longitude of the migrant incident. Lastly, the nineteenth variable is the UNSD Geographical Grouping which is the continental subregion of the migrant incident. The first statistical analysis that I performed was between two categorical variables. I tested the Incident Year and Causes of Death variables to see if there was an association between the two. In creating this bar chart in an attempt to show the relationship between Incident Year and Causes of Death, I placed Incident Year in the X-axis and Causes of Death on the Y-axis to see if the change in Causes of Death can be explained by the change in the Incident Year. As visually seen, as Incident Year increased, Causes of Death also increased. Even though the Causes of Death variable has many different causes or levels within it, the Causes of Death as a whole gradually increase when incident year is increased, supporting the existence of an association between the two variables. The next step in determining if there is an association between Incident Year and Causes of Death was creating frequency tables that would show the actual data values for each segment of Causes of Death which would indicate numerically if the Causes of Death increased when the Incident Year increased. This can be determined by creating frequency tables for each segment. With the produced frequency tables, most of the segments of Causes of Death increase when Incident Year increases, further supporting the belief in a strong association between these two variables. For example, in Vehicle accident/death linked to hazardous transport, as Incident Year increases i.e. 2014, 2015, 2016, 2017, the Death Count consistently increases i.e. 25, 118, 191, 252. This general trend is observed across most of the frequency tables which show an upward trend where the change in the Causes of Death can be explained by the change in the Incident Year. While it is important to acknowledge that for a few of these cases there is no substantial linear growth, such as the groups that alternate between zeroes and ones that do not follow the general data trend, the segments that have a more diverse group of data show a strong linear growth. The final statistical test that was performed to analyze the association between Incident Year and Causes of Death was a Chi-square test. A Chi-square test is a statistical method used to determine if there is a significant difference between the expected frequencies and the observed frequencies. The p-value generated in a Chi-square test will determine if the association or correlation is statistically significant or not; for the purposes of this testing 0.05 is used as the benchmark for determining statistical significance. If the generated p-value is greater than 0.05, then the association is not statistically significant, but if it is smaller, than it is. In this performed Chi-square test, the generated p-value is extremely near 0, 2.2e-16. This is well below the 0.05 benchmark for statistical significance showing that the association between Incident Year and Causes of Death is statistically significant. For the second set of statistical association tests, I tested a categorical variable and a quantitative variable. I used Region of Incident as my categorical variable on the X-axis and Number of Survivors as my quantitative variable on the Y-axis for the Side-By-Side Boxplot and for the histogram. The Side-By-Side Boxplot shows that visually the bulk of the data points for most regions does not have more than 500 survivors. The obvious exception here is the Mediterranean which has a lot more data points above the 500-survivor mark and has its upper limit close to 2000. The Mediterranean, being a hotspot for survivors and by extension migrant activity is further visualized by the histogram. This histogram uses a different method to show a similar result, that the greatest number of survivors are found in the Mediterranean region. Examining this within the broader context of the information that the Missing Migrants Project shows how this may be misleading because it may lead you to believe that out of all of the regions, most survivors are found in the Mediterranean so it must be the safest. In reality, the Mediterranean is the busiest hotspot for migrant activity and most migrants attempt to cross to the Mediterranean; while it would have the highest number of survivors, it also has the highest number of deaths, this is only possible since most migrants are found moving along the Mediterranean and it is busier than any other region. More migrants will be found in the Mediterranean than any other region, dead or alive. The normality of the association between Region of Incident and Number of Survivors is determined by a QQ plot; this will determine if the data has a normal distribution or if the data is skewed. If the data points perfectly follow the line of best fit, then the data has a normal distribution; however, if any of the data points deviate from the line of best fit, then the dataset is skewed. As can visually be seen from the QQ plot, the first value and the last value do not follow the trend that the rest of the data follows and shows that the dataset is skewed. Instead of a F test statistic, a Median test will have to be performed on these two variables because the distribution is not normal but in fact skewed. The data point on the right far of the QQ plot that appears to have a sample quantile much higher than any of the other data point is most likely the Mediterranean as it had the highest number of survivors in both the Side-By-Side Boxplot and the histogram. Even with most of the data values following the general trend, this outlier would skew the entire dataset. To further support these assumptions between Region of Incident and Number of Survivors, an ANOVA test showing the statistical significance with a p-value would show the strength of their association if one truly existed and if it is statistically significant. This statistical test would indicate the probability of observing a result as extreme as, or more extreme than, what was actually observed, assuming the null hypothesis of there being no association between Region of Incident and Number of Survivors is true. If the p-value is below the 0.05 benchmark for statistical significance, then the null hypothesis would be rejected in favor of the alternative hypothesis which postulates that the association between these two variables is statically significant. The results of this ANOVA show that the generated p-value is extremely near 0, 2.2e-16. This is well below the 0.05 benchmark for statistical significance showing that the association between Region of Incident and Number of Survivors is statistically significant. Since the data is skewed and not normally distributed, I chose to run a median test rather than rely on an F test statistic. By using the Kruskal-Wallis rank sum test, we can examine insights into whether the medians of Number of Survivors are different across different Regions of Incidents. Since the p-value in this test is the same value as the ANOVA test and is so small that it is undoubtedly below the 0.05 benchmark for statistical significance, we can reject the null hypothesis of the group medians being equal and follow the alternative hypothesis of the change in Region of Incident having a statistically significant effect on the change in Number of Survivors. For the third association I performed statistical analysis on two quantitative variables. In the Global Missing Migrants dataset, these variables were Minimum Estimated Number of Missing and Total Number of Missing and Dead. The first step in performing this is creating a scatterplot with the Minimum Estimated Number of Missing on the X-axis and the Total Number of Missing and Dead on the Y-axis. The purpose of creating a scatterplot is to visually assess whether or not the association is of any practical significance and noticeable enough to be important. From visually examining the scatterplot, we can see that as Minimum Estimate Number of Missing increases, the Total Number of Missing and Dead also increases; this observation warrants further test to see if the association between these two variables is statistically significant and unlikely to have been produced by chance. To numerically see if the association is statistically significant, I performed Pearson’s Correlation Test to measure the generated pvalue and see if it is above or below the 0.05 threshold for statistical significance. This will determine if there is a strong association between Minimum Estimated Number of Missing and Total Number of Missing and Dead. Pearson’s Correlation Test generates a lot of values but the important value for us to examine is the p-value. The p-value given in this correlation test is 2.2e-16 which is below the 0.05 threshold for statistical significance, allowing us to reject the null hypothesis that Minimum Estimated Number of Missing and Total Number of Missing and Dead are independent variables and further investigate the alternate hypothesis where the change in the Y-axis could be accounted for the change in the X-axis. The correlation coefficient value of approximately 0.884 also supports a strong positive association between these two variables because this value is very near 1. The p-value and correlation coefficient here both support a strong association between Minimum Estimated Number of Missing and Total Number of Missing and Dead.

Migrant Deaths Analysis: Statistical Study of IOM Data

Related documents

Products

Support

Migrant Deaths Analysis: Statistical Study of IOM Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib