Data1002-stage 1

Introduction In this report, with a time background set in the year 2017. we will be analysing the impact of socioeconomic factors on birth rate : a global analysis. The centrifugal hypothesis is that social and economic factors play a role in determining a families ability to procreate and sustain. We aim to compile the information we have gathered on various factors, such as GDP, Happiness score, corruption level, and education level, and analyse their effects on birth rates globally in order to test this hypothesis. Since their roles and interests can have a significant impact on how this reports findings are applied,it is essential to identify the pertinent stakeholders. Governments and policy makers are some relevant stakeholders because they can use the findings to implement policies for healthcare, education, and economic development. Additionally, international organisations such as WHO and UNESCO may utilise this research to inform their programs related to reproductive health and development. The happiness rank dataset is crucial for this research since it provides a distinctive viewpoint on a country's general well-being, taking into account economic success, access to healthcare, quality of education, and social support networks. This enables us to investigate relationships between birth rates and happiness,illuminating how societal contentment affects family planning choices. Access to vital services like healthcare and education can be hampered by corruption, which can also widen economic inequalities. We can use government corruption data to investigate whether access to family planning services may be restricted as a result of higher levels of corruption. The study of the link between socioeconomic variables and birth rates relies heavily on the GDP per capita dataset. It is a crucial gauge of a country's economic health and standard of living. Access to healthcare, education, and employment opportunities is frequently correlated with a higher GDP per capita.This dataset helps examine how families may prioritise financial security and career advancement, impacting birth rates. Lastly, by showing the percentage of eligible students enrolled in educational institutions, Gross Enrolment Ratio sheds light on a country's educational landscape. Since family planning decisions are influenced by education,this dataset is pivotal. Higher GER is frequently associated with more educated family planning decisions, postponed marriages, and increased awareness of contraceptives. In summary, this report takes on a global analysis to explore the interplay of these factors to paint a holistic picture on how socioeconomic factors impact familial decisions. Datasets : The raw data comes from research conducted by SUSTAINABLE DEVELOPMENT SOLUTIONS NETWORK, which scores the happiness of most countries around the globe using a number of factors, including economic production and social support. The dataset focuses on the time period of 2017, which is a high fit with the focus of our study, and therefore omits the data screening step for year. It has been uploaded by the original authors and the authors have explicitly stated that they have contributed their work to the public domain (CC0: Public Domain), where it is available to the general public. In my dataset, I chose to extract the columns "Country", "Happiness Ranking", "Happiness Score" and "Trust in Government Corruption" from the original data into the new dataset and cleaned them to ensure that the dataset was accurate. Country Happiness Ranking Happiness Score Trust in Government Corruption Attributes Nominal Ordered Ratio Ratio Meaning Name of the country The country ranked 1 has the highest level of happiness, and the higher the ranking, the lower the happiness value. The maximum score is 10 points, and the higher the score, the happier it is. It is a numerical value in the dataset. A higher value indicates that people trust the government's performance in corruption issues more, while a lower value indicates the opposite. During the data cleaning process, firstly, in order to verify that there were no missing values in the dataset, I used the relevant Python code to perform the checks and removed the rows that contained missing data. Additionally, I used the appropriate code to calculate the extremes, medians, and means, which are statistics that can be used to analyse the distribution and central tendency of "Happiness Score" and "Trust Government Corruption"column. The same statistics can also be used to understand the range of the data by comparing the maximum and minimum values. According to the results of the data analysis, the happiness score ranges between 2.693 and 7.537, with a median of 5.286 and a mean of 5.355. This indicates that happiness is relatively high in most countries. Meanwhile, trust in government corruption ranges between 0.004 and 0.464, with a mean of 0.124, which is slightly higher than the median of 0.09. This shows that citizens of most countries exhibit relatively high levels of trust in their governments. The fact that the mean is slightly higher than the median may imply that there are some extreme values or outliers that have some influence on the mean. Finally, I rounded all data to two decimal places using Excel to ensure accuracy and consistency. This series of data cleaning steps ensured that our data provided a reliable basis for the study, allowing for strong analyses and arguments to be made afterwards. This dataset contains different types of data as well as different domains and is therefore characterised by diversity and comprehensiveness. This diversity facilitates a more comprehensive exploration and analysis of data from different domains. However, the dataset also has some potential limitations, mainly in the form of missing data. These missing data may have some impact on data usability, and therefore I performed a systematic data cleaning and preprocessing exercise to minimise the impact of these limitations on the study or application. https://www.kaggle.com/datasets/unsdsn/world-happiness Accessed 8 September 2023 The data I have obtained pertains to the global total fertility rate (number of births per woman) from 1960 to 2021. According to the website description this data is The World Bank Group makes data publicly available according to open data standards and licenses datasets under the Creative Commons Attribution 4.0 International license (CC-BY 4.0). (Data Source:https://datacatalog.worldbank.org/public-licenses#cc-by） Based on the data obtained from the website, the Last Updated Date is July 25, 2023. This dataset includes the following components: Country Name (Nominal. Data collection from all over the world), Country Code (Nominal. abbreviations for country names), Indicator Name (Description. Fertility rate, total (births per woman)), Indicator Code (Nominal. SP.DYN.TFRT.IN), and birth rate data for each country from 1960 to 2021(Ratio).I think this data gives a clear indication of the fertility situation in each country between 1960 and 2021, and provides important basic information for the questions that our group is studying, which will help us to carry out in-depth research in a clearer and more intuitive way. (Data Source: https://data.worldbank.org/indicator/SP.DYN.TFRT.IN ) There are also some limitations to this data. Firstly, the data collected may not be complete. Secondly, the birth rate is a dynamic indicator that is affected by time and region and may therefore be subject to change. In extreme cases, the birth rate may even fall to a level close to zero. These factors need to be considered and explored in our observations and analyses. Our group has chosen to analyze fertility rate factors for the year 2017. Therefore, I isolated the data specifically for the year 2017 in Microsoft Excel, generating a completely new dataset. Subsequently, I performed data cleaning using Python to remove invalid data, as demonstrated below: Subsequently, I conducted data cleaning in Microsoft Excel. Use Ctrl+1 to select the numerical option in number and adjust the number of decimal places to 2,ensuring that each numerical value retained two decimal places for the purpose of subsequent data observation and analysis. Finally, to facilitate our group's more intuitive research in the future, I utilized Python to analyze the data, identifying the maximum value, minimum value, and calculating the average. Below is the code I used: Finally the maximum value was obtained as: 7.08 and the minimum value was: 0.87 The average value was: 2.6942635658914718. After obtaining the data, I used Microsoft Excel to search and analyze it. I found that the highest value was in row 173, corresponding to Niger, while the lowest value was in row 253, corresponding to the British Virgin Islands. Additionally, I observed the average value, which appeared to be closer to the lowest value. Therefore, I believe that the global total fertility rate was relatively low in 2017. However, to explore the specific factors contributing to this, we will need to integrate our group's data in subsequent work to arrive at a conclusion. About my Data : Metadata, Data dictionary & Provenance The third dataset, provided from UNESCO, contains critical statistics on Gross Enrollment Ratios (GER) as a percentage for both sexes in the sphere of education, organised by country. As a global organisation dedicated to education, research, and culture, UNESCO is a credible and trustworthy source. The GER values are calculated by the number of pupils enrolled in a level of education regardless of age expressed as a percentage of the population of a certain age group. The data is disaggregated by sex and the general level of education in this case being primary level for the age group 24 - 65 years. Our dataset focuses on the concept of GER which depicts the general level of participation in education in various nations and we have narrowed our coverage for this assignment to the last 10 years to ensure relevancy. It is vital to note that the UNESCO UIS database through its bulk data downloading service (BDDS) enables data download for non commercial reasons under the Creative Commons Attribution-ShareAlike 3.0 IGO License. The legitimacy of the source, global coverage, and relevance to our research topic are among the dataset's strengths. Additionally, the GER has been calculated based on total enrolment in all types of schools and educational institutions (private, public and all organised educational programmes) assuring quality standards. However, the data might be limited in producing accurate analysis since it contains several missing values and there might be a case of under/overestimation due to inaccurate reporting of enrolment or population numbers. Furthermore, in some situations, the GER% can surpass 100% due to the inclusion of over-aged and under-aged pupils as a result of early or late arrivals, as well as grade repetition. The data was extracted on 2nd September 2023 1:37 UTC (GMT) from the Unesco Institute for Statistics website.Two variables were sourced by UIS for the calculation of GER% by country : Total enrolment for a given level of education and Population of the age group corresponding to the specified level. Originally, the data on enrolment was sourced from school registers from each country and the school-age population data was sourced from United Nations Procurement Division (UNPD) population estimates. The data follows a tabulated structure consisting of columns and rows. It is organised in rows identified by Country names and columns which give GER values dating from 2013 up until 2022 as seen in the data dictionary. (See Appendix Section 1.1)The dataset is formatted as a CSV (Comma-Separated Values) file, making it easy to analyse due to its high compatibility with data analysis tools like python. (Data source : http://data.uis.unesco.org/index.aspx?queryid=3812) Data Quality : To ensure data quality, the CSV file was converted to excel for easier interpretation. Pandas was imported to make data handling easier using data frames and numpy was imported to carry out more complex numerical computations to deal with missing values. Code : import pandas as pd import numpy as np I then imported the excel file by reading it in pandas to create a data frame. Additionally, since the first 3 rows were not part of the data, I skipped their reading using the skiprows function. To achieve a cleaner look, the legend which was present at the end of the dataset and the blank columns in the middle of the dataset were removed using the drop function. code : df = pd.read_excel('Book2.xlsx',skiprows=3) df = df.drop(df.columns[[1, 2,4,6,8,10,12,14,16,18,20]], axis=1) df = df.drop(range(280,285)) To make numerical calculations easier I substituted missing values (‘..’) with 0 values using the replace function which would stop them from affecting calculations. I then renamed the columns(country and GER Year), in alignment with the original dataset using the columns function. (See Appendix Section 2.1) To limit the impact of missing values in the dataset I calculated the row mean GER for each country and added it as a column. I then used the numpy ‘where’ function to replace 0 values with their corresponding row mean GER. This uses average GER values for all missing values, allowing for more accuracy. (See Appendix Section 2.2) To ensure data quality for integration with the rest , I checked the ‘2017’ column for 0 values and removed all rows with all 0 values. I also rounded all the values to 2 decimal places to make calculations easier. ((See Appendix Section 2.3) Lastly, to conduct a final quality check, I checked for any 0 values, which gave the output of no missing values. To ensure medium volume of the dataset I determined the number of rows and columns which produced the output of 235 rows x 11 columns, ensuring the same. To make a new output file for the cleaned data I used the ‘.to_csv’ function and saved the data frame as a new cleaned dataset.(See Appendix Section 2.4) Data Analysis For the initial data analysis I chose to find the maximum and minimum GER from 2017. code : max_value = df_new['GER_2017'].max() min_value = df_new['GER_2017'].min() The max being 149.32% and the min being 0.91%, belonging to Madagascar and Somalia respectively. According to additional research the adult literacy rate of madagascar was reported to be 77.25%, indicating that it has high education quality. On the other hand, Somalia was recorded to have a low literacy rate of only 37.8%, in line with the GER findings. Although, it is important to note that the GER for Madagascar exceeds 100% due to previously mentioned reasons, which would require further information to justify. Furthermore, I found the mean GER for 2017 to be 80.39%, indicating that the global adult education level is quite high. As for a grouped aggregate summary, I added a column which provides the average GER for every country for the last 10 years. (See Appendix section 3.1) The data I obtained was from Kaggle, there is no credit for the author of this work but there is a creator of the data goes by the name of SRK (the owner of the datasheet). He searched for all the data across various website (in 2017) then combined them and publicly published it. I have analysed every datasheet from the website I chose and stumbled across Gross Domestic Product (GDP). From the data, I’ve extracted 3 of the columns which are countries, GDP: Gross domestic product (million current US$), and GDP per capita (current US$). This gives me an idea of how every country’s GDP works. Data Dictionary: Column Name Data Type Description Country String Name of the country GDP: Gross domestic product (million current US$) Float GDP: Gross domestic Currency: USD product in million current US$ GDP per capita (current US$) Float GDP per capita in current US$ Python code: Metadata Currency: USD Output: From the data I have used, there are too many variables to look at. I looked up everything and broke it down until there was enough for 600 variables. There were some problems I encountered while selecting those variables too. In the column where I chose, there are variables that don't give a proper value (or you could say there is no data in it). So, this is where I start to do data cleaning. Using Python code, I look for which row contains the number ‘-99' in it. I coded where in which row has that number, it will remove the entire row since there is not enough evidence to support the data. Additionally, I calculated the total average GDP per capita around the world and the result was $15700.76 USD. Integration The column headings of our research document contain important factors such as birth rate, gross domestic product (GDP), happiness index,gross enrolment ratio and level of trust in government for the year 2017. We aim to analyse this data to understand the influencing factors behind the global birth rate. To integrate the datasets, we collected cleaned data from all group members and subsequently integrated the data using python. In this process, we extracted the necessary data from each member's dataset, rearranged and combined them on the basis of country, and finally generated a new data document. We then performed numeric precision and renamed the columns to ensure consistency and readability of the data, rounding all numbers using numeric values and retaining to two decimal places. code : As for the schema, we chose to organise the attributes by ‘Country’ hence placing ‘Country’ as the first column. Additionally, since our report revolves around birth rate analysis, ‘birth rate’ was chosen as the second column, followed by columns of the remaining attributes. Since all the datasets followed a similar structure with Country as a key attribute,we did not face any structural challenges during the integration process. However, the size and scope of the final integrated dataset dwindled since each dataset was retrieved from different sources and hence they did not have all countries in common. This reduced our dataset to provide data for all attributes but only for 122 countries. Another challenge we faced with the integrated dataset were the column names which originated from their complete datasets but did not make complete sense in the integrated dataset. To accommodate context, we renamed the columns and simplified the attribute names. Appendix (pvor6694 : 520592227) Section 1.1 Data Dictionary: Column name Country description Data type Acceptable range Name of country String N/A GER_2013 Gross enrolment ratio for the year 2013 Float(decimal) 0-100%, can also sometimes exceed 100% Float GER_2014 Gross enrolment ratio for the year 2014 0-100%, can also sometimes exceed 100% Gross enrolment ratio for the year 2015 Float GER_2015 0-100%, can also sometimes exceed 100% Float GER_2016 Gross enrolment ratio for the year 2016 0-100%, can also sometimes exceed 100% Gross enrolment ratio for the year 2017 Float GER_2017 0-100%, can also sometimes exceed 100% Gross enrolment ratio for the year 2018 Float 0-100%, can also sometimes exceed 100% Gross enrolment ratio for the year 2019 Float 0-100%, can also sometimes exceed 100%. Float GER_2020 Gross enrolment ratio for the year 2020 0-100%, can also sometimes exceed 100% Gross enrolment ratio for the year 2021 Float GER_2021 0-100%, can also sometimes exceed 100% Gross enrolment ratio for the year 2022 Float 0-100%, can also sometimes exceed 100% GER_2018 GER_2019 GER_2022 Section 2.1 code : df = df.replace('..', 0) df.columns = ['Country', 'GER_2013', 'GER_2014', 'GER_2015', 'GER_2016', 'GER_2017', 'GER_2018', 'GER_2019', 'GER_2020', 'GER_2021', 'GER_2022'] Section 2.2 code : row_means = df.mean(axis=1) df['Row_Means'] = row_means df[col] = np.where(df[col] == 0, df['Row_Means'], df[col]) Section 2.3 code: df = df.drop('Row_Means', axis=1) df_new = df[df['GER_2017'] != 0] df_new = df_new.round(2) Section 2.4 code: zero_rows = (df_new == 0).any(axis=1).sum() print(f"Number of rows with 0 in any columns: {zero_rows}") num_rows, num_columns = df_new.shape Section 3.1 Code and output : References Country Statistics - UNData. (n.d.). Www.kaggle.com. https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles?resource=d ownload Gross enrolment ratio. (2020, June 22). Uis.unesco.org. https://uis.unesco.org/en/glossary-term/gross-enrolment-ratio Madagascar Literacy Rate 2000-2022. (n.d.). Www.macrotrends.net. https://www.macrotrends.net/countries/MDG/madagascar/literacy-rate#:~:text=Adult %20literacy%20rate%20is%20the MANAGING VOTERS ILLITERACY IN PUNTLAND LOCAL GOVERNMENT ELECTIONS | Rift Valley Institute. (, August). Riftvalley.net. https://riftvalley.net/publication/managing-voters-illiteracy-puntland-local-governmen t-elections#:~:text=However%2C%20over%20the%20last%2030 Other policy relevant indicators : Gross enrolment ratio by level of education. (n.d.). Data.uis.unesco.org. http://data.uis.unesco.org/index.aspx?queryid=3812 UIS Developer Portal. (n.d.). Apiportal.uis.unesco.org. https://apiportal.uis.unesco.org/bdds World Bank. (2020). Fertility rate, total (births per woman) | Data. Worldbank.org. https://data.worldbank.org/indicator/SP.DYN.TFRT.IN World Happiness Report. (n.d.). Www.kaggle.com. https://www.kaggle.com/datasets/unsdsn/world-happiness

Data1002-stage 1

Related documents

Products

Support

Data1002-stage 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib