Uploaded by Edwin Sidik

World Population Analysis & Predictive Modeling Report

advertisement
國立成功大學
114 學年度第2學期
期末報告
課程名稱:資料科學導論
授課老師:李政德
報告主題:World Population Analysis and Predictive Modeling
組別: 龍生科技
組員: 統計115 徐茲聰 H24115336,
統計115 曾煥志 H24115069,
統計115 林琳峰 H24115051
0. Brief Introduction of the Problem
Understanding and forecasting global population trends is essential for informed
decision-making across various sectors, including government policy, economic planning,
healthcare, education, and environmental sustainability. Population data helps policymakers
allocate resources efficiently to anticipate future demands and address challenges such as
aging population, urbanization, fertility rate, age demographics, and migration patterns.
However acquiring up-to-date, reliable, and comprehensive population data can be
challenging due to the dynamic nature of demographic changes and the variability in data
sources
However, the problem addressed by the WorldPopulationAnalyzer is the need for an
automated, robust, and scalable solution to collect, process, analyze, and visualize world
population data by country, By This project aims to analyze global population data, uncover
key insights, and build a predictive model to estimate population trends based on
socio-economic indicators. The analyzer guarantees access to up-to-date and correct
information by using web scraping techniques to retrieve data from reliable sources like
Worldometers. The program not only cleans and converts raw data but also finds important
aspects impacting population dynamics and forecasts future trends through careful data
preparation, feature engineering, and machine learning modeling. Furthermore, using
sophisticated visualization techniques makes it easier to understand intricate data
relationships, allowing stakeholders to get useful insights and make decisions based on solid
evidence.
Data Description and Preprocessing
The dataset provides insights into global population statistics for various countries or
dependencies. Key features include :
●​
Country (or dependency): The name of the country or dependency
●​
Population (2024): Projected population for 2024.
●​
Yearly Change (%): Annual percentage change in population.
●​
Net Change: Net population change in absolute numbers.
●​
Density (P/Km²): Population density per square kilometer.
●​
Land Area (Km²): Total land area of the country.
●​
Migrants (Net): Net migration numbers.
●​
Fertility Rate: Average number of children per woman.
●​
Median Age: Median age of the population.
●​
Urban Population (%): Percentage of the population living in urban areas.
●​
World Share (%): Percentage of the world's population.
These features offer a comprehensive view of demographic and geographic attributes, helping
to analyze population growth trends and related factors.
Preprocessing Steps:
The preprocessing steps began with data extraction using BeautifulSoup, scraping an
HTML table containing global population statistics. The raw data was then cleaned by
removing invalid characters, converting strings into numeric types, and handling missing
values. To improve usability, column names were renamed by replacing spaces and special
characters with underscores. Missing values in numerical features were addressed by
imputing them with column means to ensure completeness.
In addition, feature engineering was employed to enhance the dataset’s analytical
potential. New features, such as the logarithm of the population and the urban-to-rural
population ratio, were introduced to capture deeper insights and improve predictive accuracy.
Finally, numerical features were standardized using StandardScaler, ensuring a consistent
range across all variables and optimizing the data for use in machine learning models. This
thorough preprocessing ensured the dataset was clean, comprehensive, and ready for analysis.
1.​def preprocess_data(self):
"""
2.​
Preprocess the scraped data
3.​
"""
4.​
if self.data is None:
5.​
print("No data to process. Please scrape data
6.​
first.")
return None
7.​
8.​
print("Processing data...")
9.​
self.processed_data = self.data.copy()
10.​
11.​
# Convert numerical columns
12.​
print("Converting numerical columns...")
13.​
14.​
self.processed_data.rename(columns={
15.​
'Population_(2024)': 'Population_2024',
16.​
'Med._Age': 'Med_Age',
17.​
'Density_(P/Km²)': 'Density',
18.​
'Land_Area_(Km²)': 'Land_Area',
19.​
'Migrants_(net)' : 'Migrants',
20.​
'Fert._Rate': 'Fert_Rate',
21.​
'Urban_Pop_%': 'Urban_Pop'
22.​
}, inplace=True)
23.​
24.​
try:
25.​
# Population
26.​
self.processed_data['Population_2024'] =
27.​
pd.to_numeric(self.processed_data['Population_2024'],
errors='coerce')
28.​
# Yearly Change
29.​
self.processed_data['Yearly_Change'] =
30.​
self.processed_data['Yearly_Change'].str.rstrip('%').astype(float
)
31.​
# Net Change
32.​
self.processed_data['Net_Change'] =
33.​
pd.to_numeric(self.processed_data['Net_Change'], errors='coerce')
34.​
# Density
35.​
self.processed_data['Density'] =
36.​
pd.to_numeric(self.processed_data['Density'], errors='coerce')
37.​
# Land Area
38.​
self.processed_data['Land_Area'] =
39.​
pd.to_numeric(self.processed_data['Land_Area'], errors='coerce')
40.​
# Migrants
41.​
self.processed_data['Migrants'] =
42.​
pd.to_numeric(self.processed_data['Migrants'], errors='coerce')
43.​
# Fertility Rate
44.​
self.processed_data['Fert_Rate'] =
45.​
pd.to_numeric(self.processed_data['Fert_Rate'], errors='coerce')
46.​
# Median Age
47.​
self.processed_data['Med_Age'] =
48.​
pd.to_numeric(self.processed_data['Med_Age'], errors='coerce')
49.​
# Urban Population
50.​
# Replace 'N.A.' with NaN
51.​
self.processed_data['Urban_Pop'] =
52.​
self.processed_data['Urban_Pop'].replace('N.A.', pd.NA)
53.​
# Remove the '%' sign, then convert to float
54.​
self.processed_data['Urban_Pop'] =
55.​
self.processed_data['Urban_Pop'].str.rstrip('%')
56.​
# Now, safely convert the 'Urban_Pop' column to
57.​
float, coercing errors to NaN
self.processed_data['Urban_Pop'] =
58.​
pd.to_numeric(self.processed_data['Urban_Pop'], errors='coerce')
59.​
# World Share
60.​
self.processed_data['World_Share'] =
61.​
self.processed_data['World_Share'].str.rstrip('%').astype(float)
62.​
except Exception as e:
63.​
print(f"Error during numerical conversion: {e}")
64.​
return None
65.​
66.​
# Handle missing values
67.​
# Select only numeric columns
68.​
69.​
numeric_columns =
self.processed_data.select_dtypes(include=['number']).columns
70.​
# Fill NaN values with the mean in numeric columns
71.​
self.processed_data[numeric_columns] =
72.​
self.processed_data[numeric_columns].fillna(self.processed_data[n
umeric_columns].mean())
73.​
print(self.processed_data.head())
74.​
# Feature engineering
75.​
self.processed_data['Population_Log'] =
76.​
np.log1p(self.processed_data['Population_2024'])
self.processed_data['Density_Log'] =
77.​
np.log1p(self.processed_data['Density'])
self.processed_data['Urban_Rural_Ratio'] =
78.​
self.processed_data['Urban_Pop'] / (100 self.processed_data['Urban_Pop'])
79.​
# Scale features
80.​
81.​
numeric_columns =
self.processed_data.select_dtypes(include=[np.number]).columns
82.​
# Replace infinite values with NaN
83.​
self.processed_data.replace([float('inf'),
84.​
float('-inf')], float('nan'), inplace=True)
85.​
# Handle NaN values (optional, depending on your
86.​
dataset)
self.processed_data[numeric_columns] =
87.​
self.processed_data[numeric_columns].fillna(self.processed_data[n
umeric_columns].mean())
88.​
# Apply scaling
89.​
self.processed_data[numeric_columns] =
90.​
self.scaler.fit_transform(self.processed_data[numeric_columns])
91.​
return self.processed_data
92.​
Insights Discovered from the Data
The analysis of the global population dataset reveals fascinating regional trends and
relationships that highlight the diverse demographic factors influencing population dynamics
worldwide.
High-fertility nations like India and Pakistan are experiencing rapid population
growth, driven by elevated fertility rates and youthful demographics. These countries benefit
from demographic momentum, where a large proportion of young people leads to continued
population increases even if fertility rates decline over time. In contrast, aging, low-fertility
countries such as China are witnessing population declines. This demographic shift signals
challenges for labor markets, economic growth, and social support systems, requiring
strategic planning to address the consequences of an aging population.
Geographic factors like land area surprisingly show a negligible impact on population
density, with a correlation of -0.058. This indicates that geographic size does not necessarily
correlate with how densely populated a region is. For example, Bangladesh, despite its small
land area, is highly dense, while larger nations like Russia have relatively sparse populations.
On the other hand, urbanization emerges as a key driver of population trends. Highly
urbanized nations tend to exhibit higher population densities and contribute significantly to
the global population share, underscoring the importance of urban development in shaping
demographic patterns.
Migration plays a moderate yet vital role in population growth, particularly in
countries like the United States, where positive net migration consistently supports steady
increases in population. This highlights the significance of migration policies in shaping
demographic outcomes, especially for nations relying on immigration to offset slowing
natural growth rates.
The correlation heatmap provides deeper insights into demographic trends. Fertility
rate demonstrates a strong negative correlation (-0.85) with median age, indicating that
nations with younger populations, such as Pakistan, tend to have higher fertility rates.
Urbanization shows positive correlations with both population density (0.48) and the
urban-to-rural ratio (0.43), highlighting that increased urbanization fosters denser populations
and shifts in the balance between urban and rural areas. Furthermore, yearly population
change is significantly influenced by the fertility rate (0.63) but exhibits a negative
correlation with median age (-0.67), emphasizing the role of younger populations in driving
growth. Aging populations, such as those in China and parts of Europe, are seeing slower or
even negative growth, which underscores the need for strategic policies to address
demographic challenges.
These insights reveal how fertility, urbanization, and migration interplay to shape
global population trends. For high-growth nations, youthful populations and higher fertility
rates offer economic opportunities, such as workforce expansion, but require investments in
education, healthcare, and infrastructure for sustainable growth. Meanwhile, aging nations
must face the challenges of an aging workforce, declining productivity, and increased
dependency ratios through policies encouraging migration, family support programs, and
retirement reforms. Urbanization’s influence calls for smart urban planning, including
housing, transportation, and resource management, to accommodate growing city
populations. Lastly, the importance of migration in sustaining population growth, particularly
in developed nations, highlights the need for inclusive and balanced migration strategies.
These findings provide a comprehensive perspective on global population trends and
serve as a guide for policymakers, researchers, and stakeholders to anticipate challenges,
leverage opportunities, and plan strategically for the future.
3. Methodology Details
The WorldPopulationAnalyzer employs a comprehensive, multi-step pipeline to
acquire, preprocess, analyze, model, and visualize global population data. This methodology
ensures data integrity and provides actionable insights into the demographic dynamics
shaping the world.
The process begins with data acquisition, where the dataset is scraped from the
Worldometers "Population by Country" page using the requests library to fetch webpage
content and BeautifulSoup for HTML parsing. The relevant table containing population
statistics is extracted, and the data is converted into a structured format using Pandas for
further analysis. This automation ensures access to current, accurate population data in a
reusable framework.
Preprocessing focuses on ensuring data consistency and quality. Column names are
standardized by removing special characters and spaces, while numerical columns are
converted into appropriate formats by stripping commas, percentage signs, and non-numeric
characters. Missing values are imputed with column means to preserve dataset completeness.
Additional features are used to enhance analysis, such as logarithmic transformations of
population and density to stabilize variance and the calculation of the urban-to-rural
population ratio. Numeric features are then scaled using StandardScaler to ensure uniform
contribution during machine learning.
For predictive modeling, the Random Forest Regressor is then used to predict the
logarithmic transformation of the population (log-population), leveraging its ability to handle
non-linear relationships and rank feature importance effectively. Key features such as yearly
change, population density, and urban population are selected as predictors, given their strong
association with demographic trends. The dataset is split into training and testing subsets,
with 80% allocated for training and 20% reserved for testing. The model is trained using 100
decision trees, utilizing ensemble learning to improve accuracy and generalization.
Model performance is evaluated using metrics such as Mean Absolute Error (MAE),
which quantifies the average magnitude of prediction errors, and R-squared (R²), which
explains the variance in the log population captured by the model. Cross-validation ensures
robustness, dividing the data into multiple folds and iteratively training and testing the model
across subsets. This approach mitigates overfitting risks and ensures the model's reliability.
Exploratory Data Analysis (EDA) uncovers patterns and relationships within the data.
A correlation heatmap reveals significant links between variables, such as a strong negative
correlation between fertility rate and median age, and positive correlations between
urbanization and population density. Visualizations such as residual plots, actual vs. predicted
value plots, and feature importance charts provide deeper insights into model performance
and demographic drivers. Interactive plots using Plotly allow for dynamic exploration of
population trends, enhancing interpretability.
This combination of meticulous preprocessing, machine learning, and visualization
delivers a robust analysis of global population trends. The Random Forest Regressor,
combined with logarithmic transformation and carefully selected features, not only predicts
log population effectively but also highlights the underlying factors driving demographic
change. The methodology provides valuable insights for policymakers, researchers, and
organizations addressing global demographic challenges and opportunities.
4. Evaluation and Results
Description:​
The Random Forest model demonstrated strong predictive performance with the following
results:
●​ Mean Absolute Error (MAE): Indicates the average absolute deviation of
predictions from actual values.
●​ R-squared (R²): Measures how well the model explains the variance in the target
variable.
Visualizations include:
1.​ Residual Plot: Analyzing errors between predicted and actual values.
2.​ Actual vs. Predicted Plot: Validating the alignment of model predictions with actual
values.
3.​ Feature Importance Plot: Highlighting which features contribute most to the
predictions.
The regression model demonstrated excellent predictive capabilities, as indicated by the
evaluation metrics. The Mean Absolute Error (MAE) of 0.088 shows that, on average, the
model's predictions deviate from the actual values by only a small margin. The Mean Squared
Error (MSE) of 0.024 and the Root Mean Squared Error (RMSE) of 0.154 confirm this strong
performance, as lower values indicate minimal prediction errors. Furthermore, the R² score of
0.979 reveals that the model accounts for 97.9% of the variance in the target variable,
emphasizing its reliability and robustness.
The scatter plot comparing actual vs. predicted values supports these findings. Most data
points align closely with the diagonal red line, indicating that the model captures the
relationship between predictors and the target variable effectively. This visual representation
highlights consistent prediction accuracy across the dataset, with only minimal deviations for
a few outliers. The strong correlation between actual and predicted values affirms the model’s
suitability for the given data.
Feature importance analysis reveals that "World_Share" is by far the most influential
variable in determining the target outcome, accounting for the majority of the model's
predictive power. Other factors like "Land_Area" and "Density" contribute to the predictions
but play a secondary role. Less impactful features, such as "Urban_Pop," "Migrants," and
"Yearly_Change," suggest that while they may provide context, their predictive contributions
are limited. This insight provides a deeper understanding of which variables drive the model's
performance, offering opportunities for domain-specific interpretations.
However, cross-validation scores reveal some inconsistencies, with folds ranging
from -0.40 to -10.17. While the high R² score and low error metrics show strong performance
on the primary dataset, the negative cross-validation scores in certain folds suggest potential
overfitting or data imbalance. This disparity highlights a need for further investigation, such
as examining outlier impact, resampling methods, or refining the dataset's split to improve
model generalization. Addressing these inconsistencies would ensure the model remains
robust and reliable across different subsets of the data.
5. Conclusions and Novelty
In conclusion, the WorldPopulationAnalyzer successfully addresses the challenges of
analyzing and forecasting global population trends by leveraging automated data collection,
preprocessing, and machine learning techniques. Through the use of web scraping and data
engineering, the project ensures access to up-to-date and accurate demographic information,
making it a reliable tool for policymakers, researchers, and stakeholders. The insights
derived, such as the relationships between fertility rates, urbanization, and population growth,
provide actionable intelligence to support decision-making in areas like urban planning,
resource allocation, and migration policies.
The predictive modeling component, utilizing the Random Forest Regressor,
demonstrated robust performance, achieving high accuracy as indicated by metrics like R²
and MAE. Despite some inconsistencies revealed in cross-validation, the overall reliability of
the model in estimating population trends is evident. Feature importance analysis highlights
the critical role of variables like "World_Share," offering a nuanced understanding of the
factors driving population dynamics.
The novelty of this project lies in its integration of advanced data preprocessing,
feature engineering, and visualization to streamline the analysis of complex global
demographic trends. Unlike static datasets or periodic reports, the WorldPopulationAnalyzer's
dynamic approach ensures scalability and adaptability, enabling continuous updates and
refinements as new data emerges. Additionally, its emphasis on interpretability, through
visualization and feature importance analysis, sets it apart as a practical and user-friendly
tool.
Future work could address the identified limitations, such as cross-validation
inconsistencies, through techniques like enhanced resampling, addressing outliers, or
incorporating additional socio-economic features to refine predictions further. By expanding
its scope to include real-time data pipelines or broader regional contexts, the
WorldPopulationAnalyzer could evolve into a more comprehensive solution for tackling
global demographic challenges.
6. The contribution of each team member
徐茲聰 H24115336:
1.​ Scraped world population data using web scraping techniques and structured the
dataset.
2.​ Preprocessed and cleaned the data, including handling missing values, scaling
features, and feature engineering.
3.​ Designed visualizations for the dataset, including a correlation heatmap and
exploratory plots.
林琳峰 H24115051:
1.​ Researched and implemented the Random Forest Regressor model, including
parameter tuning and evaluation.
2.​ Developed custom functions to test and evaluate the model, ensuring modularity and
reusability.
3.​ Created visualizations for model performance, such as residual plots and feature
importance charts.
曾煥志 H24115069:
1.​ Assisted in the design and refinement of the project framework and overall workflow.
2.​ Optimized and packaged the final code for reusability and ease of use.
3.​ Prepared the final project presentation, including summarizing key insights and
results.
Reference:
https://www.worldometers.info/world-population/population-by-country/
Video Link:
https://youtu.be/LzN8gBQe1MM
Download