Uploaded by vorapranati234

Data1002-stage 1

advertisement
Introduction
In this report, with a time background set in the year 2017. we will be analysing the
impact of socioeconomic factors on birth rate : a global analysis. The centrifugal hypothesis
is that social and economic factors play a role in determining a families ability to procreate
and sustain. We aim to compile the information we have gathered on various factors, such as
GDP, Happiness score, corruption level, and education level, and analyse their effects on
birth rates globally in order to test this hypothesis.
Since their roles and interests can have a significant impact on how this reports
findings are applied,it is essential to identify the pertinent stakeholders. Governments and
policy makers are some relevant stakeholders because they can use the findings to implement
policies for healthcare, education, and economic development. Additionally, international
organisations such as WHO and UNESCO may utilise this research to inform their programs
related to reproductive health and development.
The happiness rank dataset is crucial for this research since it provides a distinctive
viewpoint on a country's general well-being, taking into account economic success, access to
healthcare, quality of education, and social support networks. This enables us to investigate
relationships between birth rates and happiness,illuminating how societal contentment affects
family planning choices.
Access to vital services like healthcare and education can be hampered by corruption,
which can also widen economic inequalities. We can use government corruption data to
investigate whether access to family planning services may be restricted as a result of higher
levels of corruption.
The study of the link between socioeconomic variables and birth rates relies heavily
on the GDP per capita dataset. It is a crucial gauge of a country's economic health and
standard of living. Access to healthcare, education, and employment opportunities is
frequently correlated with a higher GDP per capita.This dataset helps examine how families
may prioritise financial security and career advancement, impacting birth rates.
Lastly, by showing the percentage of eligible students enrolled in educational
institutions, Gross Enrolment Ratio sheds light on a country's educational landscape. Since
family planning decisions are influenced by education,this dataset is pivotal. Higher GER is
frequently associated with more educated family planning decisions, postponed marriages,
and increased awareness of contraceptives.
In summary, this report takes on a global analysis to explore the interplay of these
factors to paint a holistic picture on how socioeconomic factors impact familial decisions.
Datasets :
The raw data comes from research conducted by SUSTAINABLE DEVELOPMENT
SOLUTIONS NETWORK, which scores the happiness of most countries around the globe
using a number of factors, including economic production and social support. The dataset
focuses on the time period of 2017, which is a high fit with the focus of our study, and
therefore omits the data screening step for year. It has been uploaded by the original authors
and the authors have explicitly stated that they have contributed their work to the public
domain (CC0: Public Domain), where it is available to the general public.
In my dataset, I chose to extract the columns "Country", "Happiness Ranking",
"Happiness Score" and "Trust in Government Corruption" from the original data into the new
dataset and cleaned them to ensure that the dataset was accurate.
Country
Happiness
Ranking
Happiness
Score
Trust in Government
Corruption
Attributes
Nominal
Ordered
Ratio
Ratio
Meaning
Name of
the
country
The country ranked
1 has the highest
level of happiness,
and the higher the
ranking, the lower
the happiness value.
The maximum
score is 10
points, and the
higher the score,
the happier it is.
It is a numerical value in the
dataset. A higher value
indicates that people trust
the government's
performance in corruption
issues more, while a lower
value indicates the opposite.
During the data cleaning process, firstly, in order to verify that there were no missing
values in the dataset, I used the relevant Python code to perform the checks and removed the
rows that contained missing data.
Additionally, I used the appropriate code to calculate the extremes, medians, and
means, which are statistics that can be used to analyse the distribution and central tendency of
"Happiness Score" and "Trust Government Corruption"column. The same statistics can also
be used to understand the range of the data by comparing the maximum and minimum values.
According to the results of the data analysis, the happiness score ranges between
2.693 and 7.537, with a median of 5.286 and a mean of 5.355. This indicates that happiness is
relatively high in most countries. Meanwhile, trust in government corruption ranges between
0.004 and 0.464, with a mean of 0.124, which is slightly higher than the median of 0.09. This
shows that citizens of most countries exhibit relatively high levels of trust in their
governments. The fact that the mean is slightly higher than the median may imply that there
are some extreme values or outliers that have some influence on the mean.
Finally, I rounded all data to two decimal places using Excel to ensure accuracy and
consistency. This series of data cleaning steps ensured that our data provided a reliable basis
for the study, allowing for strong analyses and arguments to be made afterwards.
This dataset contains different types of data as well as different domains and is
therefore characterised by diversity and comprehensiveness. This diversity facilitates a more
comprehensive exploration and analysis of data from different domains. However, the dataset
also has some potential limitations, mainly in the form of missing data. These missing data
may have some impact on data usability, and therefore I performed a systematic data cleaning
and preprocessing exercise to minimise the impact of these limitations on the study or
application.
https://www.kaggle.com/datasets/unsdsn/world-happiness Accessed 8 September 2023
The data I have obtained pertains to the global total fertility rate (number of births per
woman) from 1960 to 2021. According to the website description this data is The World Bank Group
makes data publicly available according to open data standards and licenses datasets under the
Creative Commons Attribution 4.0 International license (CC-BY 4.0).
(Data Source:https://datacatalog.worldbank.org/public-licenses#cc-by)
Based on the data obtained from the website, the Last Updated Date is July 25, 2023. This
dataset includes the following components: Country Name (Nominal. Data collection from all over
the world), Country Code (Nominal. abbreviations for country names), Indicator Name (Description.
Fertility rate, total (births per woman)), Indicator Code (Nominal. SP.DYN.TFRT.IN), and birth rate
data for each country from 1960 to 2021(Ratio).I think this data gives a clear indication of the fertility
situation in each country between 1960 and 2021, and provides important basic information for the
questions that our group is studying, which will help us to carry out in-depth research in a clearer and
more intuitive way.
(Data Source: https://data.worldbank.org/indicator/SP.DYN.TFRT.IN )
There are also some limitations to this data. Firstly, the data collected may not be complete.
Secondly, the birth rate is a dynamic indicator that is affected by time and region and may therefore be
subject to change. In extreme cases, the birth rate may even fall to a level close to zero. These factors
need to be considered and explored in our observations and analyses.
Our group has chosen to analyze fertility rate factors for the year 2017. Therefore, I isolated the
data specifically for the year 2017 in Microsoft Excel, generating a completely new dataset.
Subsequently, I performed data cleaning using Python to remove invalid data, as demonstrated below:
Subsequently, I conducted data cleaning in Microsoft Excel. Use Ctrl+1 to select the numerical
option in number and adjust the number of decimal places to 2,ensuring that each numerical value
retained two decimal places for the purpose of subsequent data observation and analysis.
Finally, to facilitate our group's more intuitive research in the future, I utilized Python to analyze
the data, identifying the maximum value, minimum value, and calculating the average. Below is the
code I used:
Finally the maximum value was obtained as: 7.08 and the minimum value was: 0.87 The average
value was: 2.6942635658914718.
After obtaining the data, I used Microsoft Excel to search and analyze it. I found that the highest
value was in row 173, corresponding to Niger, while the lowest value was in row 253, corresponding
to the British Virgin Islands. Additionally, I observed the average value, which appeared to be closer
to the lowest value. Therefore, I believe that the global total fertility rate was relatively low in 2017.
However, to explore the specific factors contributing to this, we will need to integrate our group's data
in subsequent work to arrive at a conclusion.
About my Data : Metadata, Data dictionary & Provenance
The third dataset, provided from UNESCO, contains critical statistics on Gross Enrollment
Ratios (GER) as a percentage for both sexes in the sphere of education, organised by country.
As a global organisation dedicated to education, research, and culture, UNESCO is a credible
and trustworthy source. The GER values are calculated by the number of pupils enrolled in a
level of education regardless of age expressed as a percentage of the population of a certain
age group. The data is disaggregated by sex and the general level of education in this case
being primary level for the age group 24 - 65 years. Our dataset focuses on the concept of
GER which depicts the general level of participation in education in various nations and we
have narrowed our coverage for this assignment to the last 10 years to ensure relevancy.
It is vital to note that the UNESCO UIS database through its bulk data downloading service
(BDDS) enables data download for non commercial reasons under the Creative Commons
Attribution-ShareAlike 3.0 IGO License.
The legitimacy of the source, global coverage, and relevance to our research topic are among
the dataset's strengths. Additionally, the GER has been calculated based on total enrolment in
all types of schools and educational institutions (private, public and all organised educational
programmes) assuring quality standards. However, the data might be limited in producing
accurate analysis since it contains several missing values and there might be a case of
under/overestimation due to inaccurate reporting of enrolment or population numbers.
Furthermore, in some situations, the GER% can surpass 100% due to the inclusion of
over-aged and under-aged pupils as a result of early or late arrivals, as well as grade
repetition.
The data was extracted on 2nd September 2023 1:37 UTC (GMT) from the Unesco Institute
for Statistics website.Two variables were sourced by UIS for the calculation of GER% by
country : Total enrolment for a given level of education and Population of the age group
corresponding to the specified level. Originally, the data on enrolment was sourced from
school registers from each country and the school-age population data was sourced from
United Nations Procurement Division (UNPD) population estimates.
The data follows a tabulated structure consisting of columns and rows. It is organised in rows
identified by Country names and columns which give GER values dating from 2013 up until
2022 as seen in the data dictionary. (See Appendix Section 1.1)The dataset is formatted as a
CSV (Comma-Separated Values) file, making it easy to analyse due to its high compatibility
with data analysis tools like python. (Data source : http://data.uis.unesco.org/index.aspx?queryid=3812)
Data Quality :
To ensure data quality, the CSV file was converted to excel for easier interpretation. Pandas
was imported to make data handling easier using data frames and numpy was imported to
carry out more complex numerical computations to deal with missing values.
Code : import pandas as pd
import numpy as np
I then imported the excel file by reading it in pandas to create a data frame. Additionally,
since the first 3 rows were not part of the data, I skipped their reading using the skiprows
function. To achieve a cleaner look, the legend which was present at the end of the dataset
and the blank columns in the middle of the dataset were removed using the drop function.
code :
df = pd.read_excel('Book2.xlsx',skiprows=3)
df = df.drop(df.columns[[1, 2,4,6,8,10,12,14,16,18,20]], axis=1)
df = df.drop(range(280,285))
To make numerical calculations easier I substituted missing values (‘..’) with 0 values using
the replace function which would stop them from affecting calculations. I then renamed the
columns(country and GER Year), in alignment with the original dataset using the columns
function. (See Appendix Section 2.1)
To limit the impact of missing values in the dataset I calculated the row mean GER for each
country and added it as a column. I then used the numpy ‘where’ function to replace 0 values
with their corresponding row mean GER. This uses average GER values for all missing
values, allowing for more accuracy. (See Appendix Section 2.2)
To ensure data quality for integration with the rest , I checked the ‘2017’ column for 0 values
and removed all rows with all 0 values. I also rounded all the values to 2 decimal places to
make calculations easier. ((See Appendix Section 2.3)
Lastly, to conduct a final quality check, I checked for any 0 values, which gave the output of
no missing values. To ensure medium volume of the dataset I determined the number of rows
and columns which produced the output of 235 rows x 11 columns, ensuring the same. To
make a new output file for the cleaned data I used the ‘.to_csv’ function and saved the data
frame as a new cleaned dataset.(See Appendix Section 2.4)
Data Analysis
For the initial data analysis I chose to find the maximum and minimum GER from 2017.
code : max_value = df_new['GER_2017'].max()
min_value = df_new['GER_2017'].min()
The max being 149.32% and the min being 0.91%, belonging to Madagascar and Somalia
respectively. According to additional research the adult literacy rate of madagascar was
reported to be 77.25%, indicating that it has high education quality. On the other hand,
Somalia was recorded to have a low literacy rate of only 37.8%, in line with the GER
findings. Although, it is important to note that the GER for Madagascar exceeds 100% due to
previously mentioned reasons, which would require further information to justify.
Furthermore, I found the mean GER for 2017 to be 80.39%, indicating that the global adult
education level is quite high. As for a grouped aggregate summary, I added a column which
provides the average GER for every country for the last 10 years. (See Appendix section 3.1)
The data I obtained was from Kaggle, there is no credit for the author of this work but
there is a creator of the data goes by the name of SRK (the owner of the datasheet). He
searched for all the data across various website (in 2017) then combined them and publicly
published it. I have analysed every datasheet from the website I chose and stumbled across
Gross Domestic Product (GDP).
From the data, I’ve extracted 3 of the columns which are countries, GDP: Gross
domestic product (million current US$), and GDP per capita (current US$). This gives me an
idea of how every country’s GDP works.
Data Dictionary:
Column Name
Data Type
Description
Country
String
Name of the country
GDP: Gross
domestic product
(million current
US$)
Float
GDP: Gross domestic Currency: USD
product in million
current US$
GDP per capita
(current US$)
Float
GDP per capita in
current US$
Python code:
Metadata
Currency: USD
Output:
From the data I have used, there are too many variables to look at. I looked up everything and
broke it down until there was enough for 600 variables. There were some problems I
encountered while selecting those variables too. In the column where I chose, there are
variables that don't give a proper value (or you could say there is no data in it). So, this is
where I start to do data cleaning. Using Python code, I look for which row contains the
number ‘-99' in it. I coded where in which row has that number, it will remove the entire row
since there is not enough evidence to support the data. Additionally, I calculated the total
average GDP per capita around the world and the result was $15700.76 USD.
Integration
The column headings of our research document contain important factors such as
birth rate, gross domestic product (GDP), happiness index,gross enrolment ratio and level of
trust in government for the year 2017. We aim to analyse this data to understand the
influencing factors behind the global birth rate.
To integrate the datasets, we collected cleaned data from all group members and
subsequently integrated the data using python. In this process, we extracted the necessary
data from each member's dataset, rearranged and combined them on the basis of country, and
finally generated a new data document. We then performed numeric precision and renamed
the columns to ensure consistency and readability of the data, rounding all numbers using
numeric values and retaining to two decimal places.
code :
As for the schema, we chose to organise the attributes by ‘Country’ hence placing ‘Country’
as the first column. Additionally, since our report revolves around birth rate analysis, ‘birth
rate’ was chosen as the second column, followed by columns of the remaining attributes.
Since all the datasets followed a similar structure with Country as a key attribute,we did
not face any structural challenges during the integration process. However, the size and
scope of the final integrated dataset dwindled since each dataset was retrieved from different
sources and hence they did not have all countries in common. This reduced our dataset to
provide data for all attributes but only for 122 countries. Another challenge we faced with the
integrated dataset were the column names which originated from their complete datasets but
did not make complete sense in the integrated dataset. To accommodate context, we renamed
the columns and simplified the attribute names.
Appendix (pvor6694 : 520592227)
Section 1.1
Data Dictionary:
Column name
Country
description
Data type
Acceptable range
Name of country
String
N/A
GER_2013
Gross enrolment ratio for
the year 2013
Float(decimal)
0-100%, can also
sometimes exceed 100%
Float
GER_2014
Gross enrolment ratio for
the year 2014
0-100%, can also
sometimes exceed 100%
Gross enrolment ratio for
the year 2015
Float
GER_2015
0-100%, can also
sometimes exceed 100%
Float
GER_2016
Gross enrolment ratio for
the year 2016
0-100%, can also
sometimes exceed 100%
Gross enrolment ratio for
the year 2017
Float
GER_2017
0-100%, can also
sometimes exceed 100%
Gross enrolment ratio
for the year 2018
Float
0-100%, can also
sometimes exceed 100%
Gross enrolment ratio
for the year 2019
Float
0-100%, can also
sometimes exceed
100%.
Float
GER_2020
Gross enrolment ratio
for the year 2020
0-100%, can also
sometimes exceed 100%
Gross enrolment ratio
for the year 2021
Float
GER_2021
0-100%, can also
sometimes exceed 100%
Gross enrolment ratio for
the year 2022
Float
0-100%, can also sometimes
exceed 100%
GER_2018
GER_2019
GER_2022
Section 2.1
code :
df = df.replace('..', 0)
df.columns = ['Country', 'GER_2013', 'GER_2014', 'GER_2015', 'GER_2016',
'GER_2017', 'GER_2018', 'GER_2019', 'GER_2020', 'GER_2021', 'GER_2022']
Section 2.2
code :
row_means = df.mean(axis=1)
df['Row_Means'] = row_means
df[col] = np.where(df[col] == 0, df['Row_Means'], df[col])
Section 2.3
code:
df = df.drop('Row_Means', axis=1)
df_new = df[df['GER_2017'] != 0]
df_new = df_new.round(2)
Section 2.4
code:
zero_rows = (df_new == 0).any(axis=1).sum()
print(f"Number of rows with 0 in any columns: {zero_rows}")
num_rows, num_columns = df_new.shape
Section 3.1
Code and output :
References
Country Statistics - UNData. (n.d.). Www.kaggle.com.
https://www.kaggle.com/datasets/sudalairajkumar/undata-country-profiles?resource=d
ownload
Gross enrolment ratio. (2020, June 22). Uis.unesco.org.
https://uis.unesco.org/en/glossary-term/gross-enrolment-ratio
Madagascar Literacy Rate 2000-2022. (n.d.). Www.macrotrends.net.
https://www.macrotrends.net/countries/MDG/madagascar/literacy-rate#:~:text=Adult
%20literacy%20rate%20is%20the
MANAGING VOTERS ILLITERACY IN PUNTLAND LOCAL GOVERNMENT ELECTIONS |
Rift Valley Institute. (, August). Riftvalley.net.
https://riftvalley.net/publication/managing-voters-illiteracy-puntland-local-governmen
t-elections#:~:text=However%2C%20over%20the%20last%2030
Other policy relevant indicators : Gross enrolment ratio by level of education. (n.d.).
Data.uis.unesco.org. http://data.uis.unesco.org/index.aspx?queryid=3812
UIS Developer Portal. (n.d.). Apiportal.uis.unesco.org. https://apiportal.uis.unesco.org/bdds
World Bank. (2020). Fertility rate, total (births per woman) | Data. Worldbank.org.
https://data.worldbank.org/indicator/SP.DYN.TFRT.IN
World Happiness Report. (n.d.). Www.kaggle.com.
https://www.kaggle.com/datasets/unsdsn/world-happiness
Download