A F R E

advertisement

AN
ANALYSIS
OF
FACTORS
RELATING
TO
ENERGY
AND
ENVIRONMENT
IN
PREDICTING
LIFE
EXPECTANCY
Valerie
Belding
CS
105
–
Professor
David
G.
Sullivan,
Ph.D
May
9,
2011
Introduction
As modern medicine advances and infant and child mortality continue to
decline, life expectancy is increasing in many parts of the world. However,
demographic predictions and an increase in urbanization indicate greater stress
on cities to maintain sanitation, water, and air quality. In addition, a worldwide
increase in population and urbanization results in the burning of more fossil fuels,
contributing further to anthropogenic climate change.
Although there exists reasonable predictions with regard to the location
and extent of the impacts of climate change, these predictions have uncertainties
as to their forecasted effects on the health of individuals and populations. Thus,
understanding how various attributes relating to energy and environment may
affect life expectancy is a worthwhile endeavor.
Dataset
The dataset was compiled from a series of tables (year 2008) from the UN
database (data.un.org). The dataset contains the following nine attributes:
Attribute
Type
Description
Country
Life_expectancy
Nominal
Numeric
CO2_emission
Numeric
Country or area name
Average number of
years an individual is
expected to live
Metric tonnes of CO2
emitted per capita
Improved_H20
Numeric
Proportion of
population using
improved drinking
water resources
Energy_use
Numeric
Energy use in
kilograms oil
equivalent per capita
Forest_area
Numeric
Proportion of land
area covered by
forest
Hypothesis
Carbon dioxide emissions
are a by-product of the
burning of fossil fuels;
more affluent people burn
more fossil fuels and tend
to have higher standards
of living
Adequate access to
potable water is essential
to preventing potentially
deadly waterborne
diseases
Consumption of energy
increases with economic
development and
subsequent improvements
in public health
Relative standards of
living and life expectancy
increases as deforestation
begins but ultimately
declines as the frontier
evolves (exhaustion of
resources)
Protected_area
Numeric
CFC
Numeric
Improved_sanitation
Numeric
Terrestrial and
marine areas
protected to total
territorial area,
percentage
Consumption of
ozone-depleting
CFCs in ODP metric
tonnes
Proportion of
population using
improved sanitation
facilities
Networks of protected land
decrease the likelihood of
deforestation, thereby
increasing living standards
Although industrialized
nations were forced to ban
CFCs in 1995, less
affluent developing nations
were not required to begin
their phase out until 2010
Failure to treat raw
sewage directly affects the
health of water systems,
increasing a population’s
risk of contracting
waterborne illnesses
Data preparation
First, each of the individual eight tables had to be combined into one single table,
a .csv file. Upon doing so, this master dataset had to be further edited, as not all
countries contained a value for all eight attributes. Any country with a missing
value for any attribute was deleted, resulting in a final dataset of 115 countries.
Additional format edits included deleting apostrophes and other symbols (percent
signs). Unique identifiers had to be eliminated i.e. country names.
Before uploading the dataset in Weka, the data was randomized and split into
80% training data (92 tuples) and 20% test data (remaining 23 tuples).
A Python program was written in order to produce a single relational table for all
of the data, which could be queried using SQL. (See Appendix A for program)
SQL queries were written in order to determine the differences in average life
expectancy amongst countries with above and below average access to two of
the attributes (Improved_sanitation and Improved_H2O) (See Appendix B for
queries).
Data Analysis
The goal of this project is to understand the weight and relationship of several
environmental/energy related variables and their effect on life expectancy. A data
mining and open source machine learning software, Weka, was utilized in order
to perform numeric estimation and build regression equations based on four
models, as follows:
LinearRegression (no attribute selection) – Linear regression function utilizing all
attributes given in the dataset.
LinearRegression (M5 Method) – Linear regression function utilizing attributes
pruned by Weka.
SimpleLinearRegression – Linear regression function that utilizes the single ‘best’
(least error) attribute in the dataset.
MultilayerPerceptron – A function that utilizes back propagation to classify output.
This model builds several layers of regression functions to produce output.
The complete model built by each alogorithm can be found in Appendix C.
Results
Model
Test data correlation
coefficient
Training data
correlation
coefficient
LinearRegression (no attribute
selection)
0.7318
0.8666
LinearRegression (M5 method)
0.7233
0.8638
SimpleLinearRegression
0.7083
0.8261
MultilayerPerceptron
0.3751
0.9102
The linear regression function utilizing all seven attributes had the highest
correlation on both the test and training data, indicating that all of the attributes
may have some correlation (even if it is minor) to life expectancy. This is intuitive,
suggesting that no single factor can be simply correlated to predict life
expectancy; rather, the factors affecting this attribute are numerous and varied.
Another glaring result of these models is that (with the exception of
MultilayerPerceptron performing very poorly on the test data) all of the models
had relatively similar correlation coefficients (all within 5% for test and training
data respectively). This would suggest that perhaps there are one or two
attributes that have higher coefficients, and thus a greater on effect on the
outcome. An examination of the complete LinearRegression equation (Appendix
A) supports this observation. The attributes Improved_H20 and
Improved_sanitation have coefficients that are considerably larger than the
coefficients of the other five attributes. In fact, the SimpleLinearRegression model
selected the attribute Improved_sanitation as having the single best correlation,
and this model exhibited a similarly respectable correlation coefficient to the
other models.
With respect to the hypotheses outlined above (see Dataset), not all of my
predictions were realized in the model. In fact, CO2_Emission, Forest_area and
Protected_area all had a negative correlation with life expectancy.
The strong correlation between life expectancy and proportion of population with
access to improved sanitation is visualized in the following graphic. This
visualization was created courtesy of geocommons.com
The following graphics display results of the queries as outlined in Appendix B.
Average
life
expectancy
(years)
75
70
65
60
55
50
Avergae
life
expectancy
(years)
Countries
with
above
average
access
to
improved
sanitaIon
Countries
with
below
average
access
to
improved
sanitaIon
75
70
65
60
55
50
Countries
with
above
average
access
to
improved
drinking
water
resources
Countries
with
below
average
access
to
improved
drinking
water
resources
Conclusion
The reasonably high correlation coefficient of the LinearRegression (no attribute
selection) model suggests that all of the attributes examined in this project may,
to some extent, be useful indicators of the overall health of a given population. In
particular, it appears that satisfactory infrastructure to provide clean water and
sanitation has a considerably high correlation to life expectancy. Undoubtedly,
the complexity of factors that may influence a population’s life expectancy is
impossible to quantify precisely, but the models examined in this paper suggest
an acceptable starting point for predicting this attribute.
Appendix A
Appendix B
SELECT AVG(Life_expectancy)
FROM FinalProjectTable
WHERE Improved_sanitation >= SELECT AVG(Improved_sanitation)
FROM FinalProjectTable;
SELECT AVG(Life_expectancy)
FROM FinalProjectTable
WHERE Improved_sanitation < SELECT AVG(Improved_sanitation)
FROM FinalProjectTable;
SELECT AVG(Life_expectancy)
FROM FinalProjectTable
WHERE Improved_H2O >= SELECT AVG (Improved_H2O
FROM FinalProjectTable;
SELECT AVG(Life_expectancy)
FROM FinalProjectTable
WHERE Improved_H2O < SELECT AVG (Improved_H2O
FROM FinalProjectTable;
Appendix C
LinearRegression (no attribute selection):
0.2902 * Improved_H20 +
-0.0307 * CO2_Emission +
0.0003 * Energy_use +
-0.0038 * Forest_area +
-0.0101 * Protected_area +
0.0011 * CFC +
0.1502 * Improved_sanitation +
32.8838
LinearRegression (M5 Method):
0.2927 * Improved_H20 +
0.1621 * Improved_sanitation +
32.0597
SimpleLinearRegression:
0.27 * Improved_sanitation + 49.8
MultilayerPerceptron:
Linear Node 0
Inputs Weights
Threshold -0.06365923447434069
Node 1 -1.3044840595621277
Node 2 0.6436857124058317
Node 3 1.036796201581872
Node 4 2.6752783397229476
Sigmoid Node 1
Inputs Weights
Threshold 0.4950880605031987
Attrib Improved_H20 -4.414981332791403
Attrib CO2_Emission -2.207873803607682
Attrib Energy_use 3.0120445732248595
Attrib Forest_area 3.458045028459332
Attrib Protected_area -0.09115114212144113
Attrib CFC 0.1619057324156068
Attrib Improved_sanitation -0.24846758864819155
Sigmoid Node 2
Inputs Weights
Threshold -1.5602335922282642
Attrib Improved_H20 -0.24151832851994415
Attrib CO2_Emission 0.05532850592851299
Attrib Energy_use 0.40163903002722456
Attrib Forest_area 1.0116682371359682
Attrib Protected_area 0.5947206208536553
Attrib CFC 0.27811918882095177
Attrib Improved_sanitation 1.2207248134394877
Sigmoid Node 3
Inputs Weights
Threshold -1.7527667679831078
Attrib Improved_H20 2.408271404556648
Attrib CO2_Emission -1.238570152951369
Attrib Energy_use 1.7449810104693952
Attrib Forest_area 0.9862758753316596
Attrib Protected_area -1.9185447724177973
Attrib CFC -0.21952811855816465
Attrib Improved_sanitation 1.3443842038611158
Sigmoid Node 4
Inputs Weights
Threshold -1.6509809202156398
Attrib Improved_H20 -3.2545534283626383
Attrib CO2_Emission 0.20223486797597762
Attrib Energy_use 1.1115153730699168
Attrib Forest_area 2.3708684468721155
Attrib Protected_area 3.9959068129510884
Attrib CFC -0.6554578841790959
Attrib Improved_sanitation 0.1489077699139931
Download