國立成功大學 114 學年度第2學期 期末報告 課程名稱:資料科學導論 授課老師:李政德 報告主題:World Population Analysis and Predictive Modeling 組別: 龍生科技 組員: 統計115 徐茲聰 H24115336, 統計115 曾煥志 H24115069, 統計115 林琳峰 H24115051 0. Brief Introduction of the Problem Understanding and forecasting global population trends is essential for informed decision-making across various sectors, including government policy, economic planning, healthcare, education, and environmental sustainability. Population data helps policymakers allocate resources efficiently to anticipate future demands and address challenges such as aging population, urbanization, fertility rate, age demographics, and migration patterns. However acquiring up-to-date, reliable, and comprehensive population data can be challenging due to the dynamic nature of demographic changes and the variability in data sources However, the problem addressed by the WorldPopulationAnalyzer is the need for an automated, robust, and scalable solution to collect, process, analyze, and visualize world population data by country, By This project aims to analyze global population data, uncover key insights, and build a predictive model to estimate population trends based on socio-economic indicators. The analyzer guarantees access to up-to-date and correct information by using web scraping techniques to retrieve data from reliable sources like Worldometers. The program not only cleans and converts raw data but also finds important aspects impacting population dynamics and forecasts future trends through careful data preparation, feature engineering, and machine learning modeling. Furthermore, using sophisticated visualization techniques makes it easier to understand intricate data relationships, allowing stakeholders to get useful insights and make decisions based on solid evidence. Data Description and Preprocessing The dataset provides insights into global population statistics for various countries or dependencies. Key features include : ● Country (or dependency): The name of the country or dependency ● Population (2024): Projected population for 2024. ● Yearly Change (%): Annual percentage change in population. ● Net Change: Net population change in absolute numbers. ● Density (P/Km²): Population density per square kilometer. ● Land Area (Km²): Total land area of the country. ● Migrants (Net): Net migration numbers. ● Fertility Rate: Average number of children per woman. ● Median Age: Median age of the population. ● Urban Population (%): Percentage of the population living in urban areas. ● World Share (%): Percentage of the world's population. These features offer a comprehensive view of demographic and geographic attributes, helping to analyze population growth trends and related factors. Preprocessing Steps: The preprocessing steps began with data extraction using BeautifulSoup, scraping an HTML table containing global population statistics. The raw data was then cleaned by removing invalid characters, converting strings into numeric types, and handling missing values. To improve usability, column names were renamed by replacing spaces and special characters with underscores. Missing values in numerical features were addressed by imputing them with column means to ensure completeness. In addition, feature engineering was employed to enhance the dataset’s analytical potential. New features, such as the logarithm of the population and the urban-to-rural population ratio, were introduced to capture deeper insights and improve predictive accuracy. Finally, numerical features were standardized using StandardScaler, ensuring a consistent range across all variables and optimizing the data for use in machine learning models. This thorough preprocessing ensured the dataset was clean, comprehensive, and ready for analysis. 1.def preprocess_data(self): """ 2. Preprocess the scraped data 3. """ 4. if self.data is None: 5. print("No data to process. Please scrape data 6. first.") return None 7. 8. print("Processing data...") 9. self.processed_data = self.data.copy() 10. 11. # Convert numerical columns 12. print("Converting numerical columns...") 13. 14. self.processed_data.rename(columns={ 15. 'Population_(2024)': 'Population_2024', 16. 'Med._Age': 'Med_Age', 17. 'Density_(P/Km²)': 'Density', 18. 'Land_Area_(Km²)': 'Land_Area', 19. 'Migrants_(net)' : 'Migrants', 20. 'Fert._Rate': 'Fert_Rate', 21. 'Urban_Pop_%': 'Urban_Pop' 22. }, inplace=True) 23. 24. try: 25. # Population 26. self.processed_data['Population_2024'] = 27. pd.to_numeric(self.processed_data['Population_2024'], errors='coerce') 28. # Yearly Change 29. self.processed_data['Yearly_Change'] = 30. self.processed_data['Yearly_Change'].str.rstrip('%').astype(float ) 31. # Net Change 32. self.processed_data['Net_Change'] = 33. pd.to_numeric(self.processed_data['Net_Change'], errors='coerce') 34. # Density 35. self.processed_data['Density'] = 36. pd.to_numeric(self.processed_data['Density'], errors='coerce') 37. # Land Area 38. self.processed_data['Land_Area'] = 39. pd.to_numeric(self.processed_data['Land_Area'], errors='coerce') 40. # Migrants 41. self.processed_data['Migrants'] = 42. pd.to_numeric(self.processed_data['Migrants'], errors='coerce') 43. # Fertility Rate 44. self.processed_data['Fert_Rate'] = 45. pd.to_numeric(self.processed_data['Fert_Rate'], errors='coerce') 46. # Median Age 47. self.processed_data['Med_Age'] = 48. pd.to_numeric(self.processed_data['Med_Age'], errors='coerce') 49. # Urban Population 50. # Replace 'N.A.' with NaN 51. self.processed_data['Urban_Pop'] = 52. self.processed_data['Urban_Pop'].replace('N.A.', pd.NA) 53. # Remove the '%' sign, then convert to float 54. self.processed_data['Urban_Pop'] = 55. self.processed_data['Urban_Pop'].str.rstrip('%') 56. # Now, safely convert the 'Urban_Pop' column to 57. float, coercing errors to NaN self.processed_data['Urban_Pop'] = 58. pd.to_numeric(self.processed_data['Urban_Pop'], errors='coerce') 59. # World Share 60. self.processed_data['World_Share'] = 61. self.processed_data['World_Share'].str.rstrip('%').astype(float) 62. except Exception as e: 63. print(f"Error during numerical conversion: {e}") 64. return None 65. 66. # Handle missing values 67. # Select only numeric columns 68. 69. numeric_columns = self.processed_data.select_dtypes(include=['number']).columns 70. # Fill NaN values with the mean in numeric columns 71. self.processed_data[numeric_columns] = 72. self.processed_data[numeric_columns].fillna(self.processed_data[n umeric_columns].mean()) 73. print(self.processed_data.head()) 74. # Feature engineering 75. self.processed_data['Population_Log'] = 76. np.log1p(self.processed_data['Population_2024']) self.processed_data['Density_Log'] = 77. np.log1p(self.processed_data['Density']) self.processed_data['Urban_Rural_Ratio'] = 78. self.processed_data['Urban_Pop'] / (100 self.processed_data['Urban_Pop']) 79. # Scale features 80. 81. numeric_columns = self.processed_data.select_dtypes(include=[np.number]).columns 82. # Replace infinite values with NaN 83. self.processed_data.replace([float('inf'), 84. float('-inf')], float('nan'), inplace=True) 85. # Handle NaN values (optional, depending on your 86. dataset) self.processed_data[numeric_columns] = 87. self.processed_data[numeric_columns].fillna(self.processed_data[n umeric_columns].mean()) 88. # Apply scaling 89. self.processed_data[numeric_columns] = 90. self.scaler.fit_transform(self.processed_data[numeric_columns]) 91. return self.processed_data 92. Insights Discovered from the Data The analysis of the global population dataset reveals fascinating regional trends and relationships that highlight the diverse demographic factors influencing population dynamics worldwide. High-fertility nations like India and Pakistan are experiencing rapid population growth, driven by elevated fertility rates and youthful demographics. These countries benefit from demographic momentum, where a large proportion of young people leads to continued population increases even if fertility rates decline over time. In contrast, aging, low-fertility countries such as China are witnessing population declines. This demographic shift signals challenges for labor markets, economic growth, and social support systems, requiring strategic planning to address the consequences of an aging population. Geographic factors like land area surprisingly show a negligible impact on population density, with a correlation of -0.058. This indicates that geographic size does not necessarily correlate with how densely populated a region is. For example, Bangladesh, despite its small land area, is highly dense, while larger nations like Russia have relatively sparse populations. On the other hand, urbanization emerges as a key driver of population trends. Highly urbanized nations tend to exhibit higher population densities and contribute significantly to the global population share, underscoring the importance of urban development in shaping demographic patterns. Migration plays a moderate yet vital role in population growth, particularly in countries like the United States, where positive net migration consistently supports steady increases in population. This highlights the significance of migration policies in shaping demographic outcomes, especially for nations relying on immigration to offset slowing natural growth rates. The correlation heatmap provides deeper insights into demographic trends. Fertility rate demonstrates a strong negative correlation (-0.85) with median age, indicating that nations with younger populations, such as Pakistan, tend to have higher fertility rates. Urbanization shows positive correlations with both population density (0.48) and the urban-to-rural ratio (0.43), highlighting that increased urbanization fosters denser populations and shifts in the balance between urban and rural areas. Furthermore, yearly population change is significantly influenced by the fertility rate (0.63) but exhibits a negative correlation with median age (-0.67), emphasizing the role of younger populations in driving growth. Aging populations, such as those in China and parts of Europe, are seeing slower or even negative growth, which underscores the need for strategic policies to address demographic challenges. These insights reveal how fertility, urbanization, and migration interplay to shape global population trends. For high-growth nations, youthful populations and higher fertility rates offer economic opportunities, such as workforce expansion, but require investments in education, healthcare, and infrastructure for sustainable growth. Meanwhile, aging nations must face the challenges of an aging workforce, declining productivity, and increased dependency ratios through policies encouraging migration, family support programs, and retirement reforms. Urbanization’s influence calls for smart urban planning, including housing, transportation, and resource management, to accommodate growing city populations. Lastly, the importance of migration in sustaining population growth, particularly in developed nations, highlights the need for inclusive and balanced migration strategies. These findings provide a comprehensive perspective on global population trends and serve as a guide for policymakers, researchers, and stakeholders to anticipate challenges, leverage opportunities, and plan strategically for the future. 3. Methodology Details The WorldPopulationAnalyzer employs a comprehensive, multi-step pipeline to acquire, preprocess, analyze, model, and visualize global population data. This methodology ensures data integrity and provides actionable insights into the demographic dynamics shaping the world. The process begins with data acquisition, where the dataset is scraped from the Worldometers "Population by Country" page using the requests library to fetch webpage content and BeautifulSoup for HTML parsing. The relevant table containing population statistics is extracted, and the data is converted into a structured format using Pandas for further analysis. This automation ensures access to current, accurate population data in a reusable framework. Preprocessing focuses on ensuring data consistency and quality. Column names are standardized by removing special characters and spaces, while numerical columns are converted into appropriate formats by stripping commas, percentage signs, and non-numeric characters. Missing values are imputed with column means to preserve dataset completeness. Additional features are used to enhance analysis, such as logarithmic transformations of population and density to stabilize variance and the calculation of the urban-to-rural population ratio. Numeric features are then scaled using StandardScaler to ensure uniform contribution during machine learning. For predictive modeling, the Random Forest Regressor is then used to predict the logarithmic transformation of the population (log-population), leveraging its ability to handle non-linear relationships and rank feature importance effectively. Key features such as yearly change, population density, and urban population are selected as predictors, given their strong association with demographic trends. The dataset is split into training and testing subsets, with 80% allocated for training and 20% reserved for testing. The model is trained using 100 decision trees, utilizing ensemble learning to improve accuracy and generalization. Model performance is evaluated using metrics such as Mean Absolute Error (MAE), which quantifies the average magnitude of prediction errors, and R-squared (R²), which explains the variance in the log population captured by the model. Cross-validation ensures robustness, dividing the data into multiple folds and iteratively training and testing the model across subsets. This approach mitigates overfitting risks and ensures the model's reliability. Exploratory Data Analysis (EDA) uncovers patterns and relationships within the data. A correlation heatmap reveals significant links between variables, such as a strong negative correlation between fertility rate and median age, and positive correlations between urbanization and population density. Visualizations such as residual plots, actual vs. predicted value plots, and feature importance charts provide deeper insights into model performance and demographic drivers. Interactive plots using Plotly allow for dynamic exploration of population trends, enhancing interpretability. This combination of meticulous preprocessing, machine learning, and visualization delivers a robust analysis of global population trends. The Random Forest Regressor, combined with logarithmic transformation and carefully selected features, not only predicts log population effectively but also highlights the underlying factors driving demographic change. The methodology provides valuable insights for policymakers, researchers, and organizations addressing global demographic challenges and opportunities. 4. Evaluation and Results Description: The Random Forest model demonstrated strong predictive performance with the following results: ● Mean Absolute Error (MAE): Indicates the average absolute deviation of predictions from actual values. ● R-squared (R²): Measures how well the model explains the variance in the target variable. Visualizations include: 1. Residual Plot: Analyzing errors between predicted and actual values. 2. Actual vs. Predicted Plot: Validating the alignment of model predictions with actual values. 3. Feature Importance Plot: Highlighting which features contribute most to the predictions. The regression model demonstrated excellent predictive capabilities, as indicated by the evaluation metrics. The Mean Absolute Error (MAE) of 0.088 shows that, on average, the model's predictions deviate from the actual values by only a small margin. The Mean Squared Error (MSE) of 0.024 and the Root Mean Squared Error (RMSE) of 0.154 confirm this strong performance, as lower values indicate minimal prediction errors. Furthermore, the R² score of 0.979 reveals that the model accounts for 97.9% of the variance in the target variable, emphasizing its reliability and robustness. The scatter plot comparing actual vs. predicted values supports these findings. Most data points align closely with the diagonal red line, indicating that the model captures the relationship between predictors and the target variable effectively. This visual representation highlights consistent prediction accuracy across the dataset, with only minimal deviations for a few outliers. The strong correlation between actual and predicted values affirms the model’s suitability for the given data. Feature importance analysis reveals that "World_Share" is by far the most influential variable in determining the target outcome, accounting for the majority of the model's predictive power. Other factors like "Land_Area" and "Density" contribute to the predictions but play a secondary role. Less impactful features, such as "Urban_Pop," "Migrants," and "Yearly_Change," suggest that while they may provide context, their predictive contributions are limited. This insight provides a deeper understanding of which variables drive the model's performance, offering opportunities for domain-specific interpretations. However, cross-validation scores reveal some inconsistencies, with folds ranging from -0.40 to -10.17. While the high R² score and low error metrics show strong performance on the primary dataset, the negative cross-validation scores in certain folds suggest potential overfitting or data imbalance. This disparity highlights a need for further investigation, such as examining outlier impact, resampling methods, or refining the dataset's split to improve model generalization. Addressing these inconsistencies would ensure the model remains robust and reliable across different subsets of the data. 5. Conclusions and Novelty In conclusion, the WorldPopulationAnalyzer successfully addresses the challenges of analyzing and forecasting global population trends by leveraging automated data collection, preprocessing, and machine learning techniques. Through the use of web scraping and data engineering, the project ensures access to up-to-date and accurate demographic information, making it a reliable tool for policymakers, researchers, and stakeholders. The insights derived, such as the relationships between fertility rates, urbanization, and population growth, provide actionable intelligence to support decision-making in areas like urban planning, resource allocation, and migration policies. The predictive modeling component, utilizing the Random Forest Regressor, demonstrated robust performance, achieving high accuracy as indicated by metrics like R² and MAE. Despite some inconsistencies revealed in cross-validation, the overall reliability of the model in estimating population trends is evident. Feature importance analysis highlights the critical role of variables like "World_Share," offering a nuanced understanding of the factors driving population dynamics. The novelty of this project lies in its integration of advanced data preprocessing, feature engineering, and visualization to streamline the analysis of complex global demographic trends. Unlike static datasets or periodic reports, the WorldPopulationAnalyzer's dynamic approach ensures scalability and adaptability, enabling continuous updates and refinements as new data emerges. Additionally, its emphasis on interpretability, through visualization and feature importance analysis, sets it apart as a practical and user-friendly tool. Future work could address the identified limitations, such as cross-validation inconsistencies, through techniques like enhanced resampling, addressing outliers, or incorporating additional socio-economic features to refine predictions further. By expanding its scope to include real-time data pipelines or broader regional contexts, the WorldPopulationAnalyzer could evolve into a more comprehensive solution for tackling global demographic challenges. 6. The contribution of each team member 徐茲聰 H24115336: 1. Scraped world population data using web scraping techniques and structured the dataset. 2. Preprocessed and cleaned the data, including handling missing values, scaling features, and feature engineering. 3. Designed visualizations for the dataset, including a correlation heatmap and exploratory plots. 林琳峰 H24115051: 1. Researched and implemented the Random Forest Regressor model, including parameter tuning and evaluation. 2. Developed custom functions to test and evaluate the model, ensuring modularity and reusability. 3. Created visualizations for model performance, such as residual plots and feature importance charts. 曾煥志 H24115069: 1. Assisted in the design and refinement of the project framework and overall workflow. 2. Optimized and packaged the final code for reusability and ease of use. 3. Prepared the final project presentation, including summarizing key insights and results. Reference: https://www.worldometers.info/world-population/population-by-country/ Video Link: https://youtu.be/LzN8gBQe1MM