AN ANALYSIS OF FACTORS RELATING TO ENERGY AND ENVIRONMENT IN PREDICTING LIFE EXPECTANCY Valerie Belding CS 105 – Professor David G. Sullivan, Ph.D May 9, 2011 Introduction As modern medicine advances and infant and child mortality continue to decline, life expectancy is increasing in many parts of the world. However, demographic predictions and an increase in urbanization indicate greater stress on cities to maintain sanitation, water, and air quality. In addition, a worldwide increase in population and urbanization results in the burning of more fossil fuels, contributing further to anthropogenic climate change. Although there exists reasonable predictions with regard to the location and extent of the impacts of climate change, these predictions have uncertainties as to their forecasted effects on the health of individuals and populations. Thus, understanding how various attributes relating to energy and environment may affect life expectancy is a worthwhile endeavor. Dataset The dataset was compiled from a series of tables (year 2008) from the UN database (data.un.org). The dataset contains the following nine attributes: Attribute Type Description Country Life_expectancy Nominal Numeric CO2_emission Numeric Country or area name Average number of years an individual is expected to live Metric tonnes of CO2 emitted per capita Improved_H20 Numeric Proportion of population using improved drinking water resources Energy_use Numeric Energy use in kilograms oil equivalent per capita Forest_area Numeric Proportion of land area covered by forest Hypothesis Carbon dioxide emissions are a by-product of the burning of fossil fuels; more affluent people burn more fossil fuels and tend to have higher standards of living Adequate access to potable water is essential to preventing potentially deadly waterborne diseases Consumption of energy increases with economic development and subsequent improvements in public health Relative standards of living and life expectancy increases as deforestation begins but ultimately declines as the frontier evolves (exhaustion of resources) Protected_area Numeric CFC Numeric Improved_sanitation Numeric Terrestrial and marine areas protected to total territorial area, percentage Consumption of ozone-depleting CFCs in ODP metric tonnes Proportion of population using improved sanitation facilities Networks of protected land decrease the likelihood of deforestation, thereby increasing living standards Although industrialized nations were forced to ban CFCs in 1995, less affluent developing nations were not required to begin their phase out until 2010 Failure to treat raw sewage directly affects the health of water systems, increasing a population’s risk of contracting waterborne illnesses Data preparation First, each of the individual eight tables had to be combined into one single table, a .csv file. Upon doing so, this master dataset had to be further edited, as not all countries contained a value for all eight attributes. Any country with a missing value for any attribute was deleted, resulting in a final dataset of 115 countries. Additional format edits included deleting apostrophes and other symbols (percent signs). Unique identifiers had to be eliminated i.e. country names. Before uploading the dataset in Weka, the data was randomized and split into 80% training data (92 tuples) and 20% test data (remaining 23 tuples). A Python program was written in order to produce a single relational table for all of the data, which could be queried using SQL. (See Appendix A for program) SQL queries were written in order to determine the differences in average life expectancy amongst countries with above and below average access to two of the attributes (Improved_sanitation and Improved_H2O) (See Appendix B for queries). Data Analysis The goal of this project is to understand the weight and relationship of several environmental/energy related variables and their effect on life expectancy. A data mining and open source machine learning software, Weka, was utilized in order to perform numeric estimation and build regression equations based on four models, as follows: LinearRegression (no attribute selection) – Linear regression function utilizing all attributes given in the dataset. LinearRegression (M5 Method) – Linear regression function utilizing attributes pruned by Weka. SimpleLinearRegression – Linear regression function that utilizes the single ‘best’ (least error) attribute in the dataset. MultilayerPerceptron – A function that utilizes back propagation to classify output. This model builds several layers of regression functions to produce output. The complete model built by each alogorithm can be found in Appendix C. Results Model Test data correlation coefficient Training data correlation coefficient LinearRegression (no attribute selection) 0.7318 0.8666 LinearRegression (M5 method) 0.7233 0.8638 SimpleLinearRegression 0.7083 0.8261 MultilayerPerceptron 0.3751 0.9102 The linear regression function utilizing all seven attributes had the highest correlation on both the test and training data, indicating that all of the attributes may have some correlation (even if it is minor) to life expectancy. This is intuitive, suggesting that no single factor can be simply correlated to predict life expectancy; rather, the factors affecting this attribute are numerous and varied. Another glaring result of these models is that (with the exception of MultilayerPerceptron performing very poorly on the test data) all of the models had relatively similar correlation coefficients (all within 5% for test and training data respectively). This would suggest that perhaps there are one or two attributes that have higher coefficients, and thus a greater on effect on the outcome. An examination of the complete LinearRegression equation (Appendix A) supports this observation. The attributes Improved_H20 and Improved_sanitation have coefficients that are considerably larger than the coefficients of the other five attributes. In fact, the SimpleLinearRegression model selected the attribute Improved_sanitation as having the single best correlation, and this model exhibited a similarly respectable correlation coefficient to the other models. With respect to the hypotheses outlined above (see Dataset), not all of my predictions were realized in the model. In fact, CO2_Emission, Forest_area and Protected_area all had a negative correlation with life expectancy. The strong correlation between life expectancy and proportion of population with access to improved sanitation is visualized in the following graphic. This visualization was created courtesy of geocommons.com The following graphics display results of the queries as outlined in Appendix B. Average life expectancy (years) 75 70 65 60 55 50 Avergae life expectancy (years) Countries with above average access to improved sanitaIon Countries with below average access to improved sanitaIon 75 70 65 60 55 50 Countries with above average access to improved drinking water resources Countries with below average access to improved drinking water resources Conclusion The reasonably high correlation coefficient of the LinearRegression (no attribute selection) model suggests that all of the attributes examined in this project may, to some extent, be useful indicators of the overall health of a given population. In particular, it appears that satisfactory infrastructure to provide clean water and sanitation has a considerably high correlation to life expectancy. Undoubtedly, the complexity of factors that may influence a population’s life expectancy is impossible to quantify precisely, but the models examined in this paper suggest an acceptable starting point for predicting this attribute. Appendix A Appendix B SELECT AVG(Life_expectancy) FROM FinalProjectTable WHERE Improved_sanitation >= SELECT AVG(Improved_sanitation) FROM FinalProjectTable; SELECT AVG(Life_expectancy) FROM FinalProjectTable WHERE Improved_sanitation < SELECT AVG(Improved_sanitation) FROM FinalProjectTable; SELECT AVG(Life_expectancy) FROM FinalProjectTable WHERE Improved_H2O >= SELECT AVG (Improved_H2O FROM FinalProjectTable; SELECT AVG(Life_expectancy) FROM FinalProjectTable WHERE Improved_H2O < SELECT AVG (Improved_H2O FROM FinalProjectTable; Appendix C LinearRegression (no attribute selection): 0.2902 * Improved_H20 + -0.0307 * CO2_Emission + 0.0003 * Energy_use + -0.0038 * Forest_area + -0.0101 * Protected_area + 0.0011 * CFC + 0.1502 * Improved_sanitation + 32.8838 LinearRegression (M5 Method): 0.2927 * Improved_H20 + 0.1621 * Improved_sanitation + 32.0597 SimpleLinearRegression: 0.27 * Improved_sanitation + 49.8 MultilayerPerceptron: Linear Node 0 Inputs Weights Threshold -0.06365923447434069 Node 1 -1.3044840595621277 Node 2 0.6436857124058317 Node 3 1.036796201581872 Node 4 2.6752783397229476 Sigmoid Node 1 Inputs Weights Threshold 0.4950880605031987 Attrib Improved_H20 -4.414981332791403 Attrib CO2_Emission -2.207873803607682 Attrib Energy_use 3.0120445732248595 Attrib Forest_area 3.458045028459332 Attrib Protected_area -0.09115114212144113 Attrib CFC 0.1619057324156068 Attrib Improved_sanitation -0.24846758864819155 Sigmoid Node 2 Inputs Weights Threshold -1.5602335922282642 Attrib Improved_H20 -0.24151832851994415 Attrib CO2_Emission 0.05532850592851299 Attrib Energy_use 0.40163903002722456 Attrib Forest_area 1.0116682371359682 Attrib Protected_area 0.5947206208536553 Attrib CFC 0.27811918882095177 Attrib Improved_sanitation 1.2207248134394877 Sigmoid Node 3 Inputs Weights Threshold -1.7527667679831078 Attrib Improved_H20 2.408271404556648 Attrib CO2_Emission -1.238570152951369 Attrib Energy_use 1.7449810104693952 Attrib Forest_area 0.9862758753316596 Attrib Protected_area -1.9185447724177973 Attrib CFC -0.21952811855816465 Attrib Improved_sanitation 1.3443842038611158 Sigmoid Node 4 Inputs Weights Threshold -1.6509809202156398 Attrib Improved_H20 -3.2545534283626383 Attrib CO2_Emission 0.20223486797597762 Attrib Energy_use 1.1115153730699168 Attrib Forest_area 2.3708684468721155 Attrib Protected_area 3.9959068129510884 Attrib CFC -0.6554578841790959 Attrib Improved_sanitation 0.1489077699139931