Term Project

advertisement
Term Project
OPIM 5103 – Managerial Statistics
Sample Student
7/5/2010
Jantzen Note: The most serious omission in this project is the failure to discuss the 95%
confidence interval results for the regression coefficients and the estimated coefficient values.
OPIM 5103 Term Project
Sample Student
Table of Contents
Introduction…………………………………………………………………………………….…2
Literature Review……………………………………………………………..…………….……2
Data Analysis……………………………………………………………………………..………3
Regression Method………………………………………………………………………….…13
Conclusion………………………………………………………………………….…………...21
1|Page
OPIM 5103 Term Project
Sample Student
Introduction
The primary objective of this term project is to present appropriate statistical
evidence to support the proposed hypothesis that there is a significant relationship
between the dependent variable of “Average Fertility” (FERTILTY) and the following
explanatory factors:
1. Average female life expectancy (LIFEEXPF)
2. Average male life expectancy (LIFEEXPM)
3. Mortality rate per 1000 people (DEATH_RT)
The study will describe the statistical distribution of the variables, test the degree
to which the explanatory variables explain the dependent variable, and test the
coefficients to summarize the overall behavior of the multiple regression model.
Literature Review
I utilized the EconLit research database in the UConn Virtual Library and came
across the following working paper that theorizes the relationship between fertility, adult
longevity, and the mortality environment.
Cervellati, Matteo, and Uwe Sunde. "Human Capital, Mortality and Fertility: A Unified Theory of
the Economic and Demographic Transition." (2007): EconLit. EBSCO. Web. 7 July 2010.
The research findings conclude that fertility may decrease as a response to
increased life expectancy. Consequently, declines in mortality rates could lead to a
quantity vs. quality trade off, where parents have fewer children but invest more in each
child. The regression model proposes that fertility is positively related to mortality and
can reduce the initial increase in population size due to higher life expectancy.
2|Page
OPIM 5103 Term Project
Sample Student
Data Analysis
The data used for this paper is from the “WORLD95.XLS” database from the
course website. It is a random sample of human demographics from 1995 for 106
different world countries.
FERTILTY: This data element reflects the average number of children born per
population and is the dependent variable of this study. The FERTILTY variable is known
as the ‘Fertility Rate,’ which as mentioned above, is a measure of average offspring
production. The tables below present Summary Statistics for this data element. The
average fertility rate is 3.57 (mean) and the middle number of the 106 observations is
3.06 (median). All of the numbers are within 6.89 (range) of each other, and vary around
the mean by 1.91 (standard deviation). The middle 50% of the numbers are within 3.22
of each other (IQR). The standard deviation is 53% of the mean (CV), which shows high
variability. The Pearson Measure of Skewness (PMS) absolute score is 26%, which
shows that the data is not approximately symmetrical because it is higher than 10%.
3|Page
OPIM 5103 Term Project
Sample Student
FERTILTY Summary Statistics
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
PMS
CV
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
3.570283
0.185520
3.065000
1.800000
1.910044
3.648269
0.264540
0.534984
(0.975758)
0.651101
6.890000
1.300000
8.190000
378.450000
106.000000
Five-number Summary
Minimum
First Quartile
Median
Third Quartile
Maximum
IQR Range
1.30
1.88
3.07
5.10
8.19
3.22
The Frequency Table and Histogram reflect that approximately 70% of the data
is within 1 – less than 4.5, and the highest frequency interval is 1.5 – less than 2.49.
Frequencies (FERTILTY)
Intervals
Bins
Frequency Percentage Cumulative % Midpts
1 Less Than 1.5
1.49
4
3.77%
3.77%
1
1.5 Less Than 2.5
2.49
41
38.68%
42.45%
2
2.5 Less Than 3.5
3.49
17
16.04%
58.49%
3
3.5 Less Than 4.5
4.49
13
12.26%
70.75%
4
4.5 Less Than 5.5
5.49
6
5.66%
76.42%
5
5.5 Less Than 6.5
6.49
12
11.32%
87.74%
6
6.5 Less Than 7.5
7.49
12
11.32%
99.06%
7
7.5 Less Than 8.5
8.49
1
0.94%
100.00%
8
4|Page
OPIM 5103 Term Project
Sample Student
Fertility Histogram
45
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
40
Frequency
35
30
25
20
15
10
5
0
1
2
3
4
5
6
7
Frequency
Cumulative %
8
Midpoints
The box below reflects 50% of the data between the 1st quartile (1.88) and third
quartile (5.10). The whiskers show the range of the outliers from 1.30 – 8.19. The boxand-whisker plot graphically confirms that the data is skewed to the left.
5|Page
OPIM 5103 Term Project
Sample Student
Fertility Box-and-Whisker Plot
FERTILTY
0
1
2
3
4
5
6
7
8
9
LIFEEXPF: This data element reflects the average number of years expected in
a female human life, and is one of the explanatory variables in this study. The
LIFEEXPF variable is known as the ‘Female Life Expectancy Rate,’ which as mentioned
above, is a measure of the average years of a female life. The tables below present
Summary Statistics for this data element. The average female life expectancy is 69.96
(mean) and the middle number of the 106 observations is 74.00 (median). All of the
numbers are within 39.00 (range) of each other, and vary around the mean by 10.65
(standard deviation). The middle 50% of the numbers are within 12 of each other (IQR).
The standard deviation is 15% of the mean (CV), which shows lower variability. The
6|Page
OPIM 5103 Term Project
Sample Student
Pearson Measure of Skewness (PMS) absolute score is 37%, which shows that the data
is not approximately symmetrical because it is higher than 10%.
LIFEEXPF Summary Statistics
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
PMS
CV
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
69.962264
1.034875
74.000000
75.000000
10.654688
113.522372
(0.378963)
0.152292
0.116924
(1.068043)
39.000000
43.000000
82.000000
7,416.000000
106.000000
Five-number Summary
Minimum
First Quartile
Median
Third Quartile
Maximum
IQR Range
43.00
66.00
74.00
78.00
82.00
12.00
The Frequency Table and Histogram reflect that approximately 84% of the data
is within 40 – less than 80, and the highest frequency interval is 70 – less than 80.
Frequencies (LIFEEXPF)
Intervals
Bins
Frequency Percentage Cumulative % Midpts
40 Less Than 50
49.99
7
6.60%
6.60%
45
50 Less Than 60
59.99
15
14.15%
20.75%
55
60 Less Than 70
69.99
17
16.04%
36.79%
65
70 Less Than 80
79.99
50
47.17%
83.96%
75
80 Less Than 90
89.99
17
16.04%
100.00%
85
7|Page
OPIM 5103 Term Project
Sample Student
Female Life Expectancy Histogram
60
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Frequency
50
40
30
20
10
0
45
55
65
75
Frequency
Cumulative %
85
Midpoints
The box below reflects 50% of the data between the 1st quartile (66) and third
quartile (78). The whiskers show the range of the outliers from 43 - 82. The box-andwhisker plot graphically confirms that the data is skewed to the right.
Female Life Expectancy
LIFEEXPF
40
50
60
70
80
90
8|Page
OPIM 5103 Term Project
Sample Student
LIFEEXPM: This data element reflects the average number of years expected in
a male human life, and is one of the explanatory variables in this study. The LIFEEXPM
variable is known as the ‘Male Life Expectancy Rate,’ which as mentioned above, is a
measure of the average years of a male life. The tables below present Summary
Statistics for this data element. The average male life expectancy is 64.76 (mean) and
the middle number of the 106 observations is 67.00 (median). All of the numbers are
within 35.00 (range) of each other, and vary around the mean by 9.34 (standard
deviation). The middle 50% of the numbers are within 12 of each other (IQR). The
standard deviation is 14% of the mean (CV), which shows lower variability. The Pearson
Measure of Skewness (PMS) absolute score is 23%, which shows that the data is not
approximately symmetrical because it is higher than 10%.
LIFEEXPM Summary Statistics
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
PMS
CV
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
64.764151
0.908139
67.000000
73.000000
9.349868
87.420036
(0.239132)
0.144368
0.242961
(1.042360)
35.000000
41.000000
76.000000
6,865.000000
106.000000
Five-number Summary
Minimum
First Quartile
Median
Third Quartile
Maximum
IQR Range
41.00
61.00
67.00
73.00
76.00
12.00
9|Page
OPIM 5103 Term Project
Sample Student
The Frequency Table and Histogram reflect that approximately 64% of the data
is within 40 – less than 70, and the highest frequency interval is 60 – less than 70.
Frequencies (LIFEEXPM)
Intervals
Bins
Frequency Percentage Cumulative % Midpts
40 Less Than 50
49.99
10
9.43%
9.43%
45
50 Less Than 60
59.99
14
13.21%
22.64%
55
60 Less Than 70
69.99
44
41.51%
64.15%
65
70 Less Than 80
79.99
38
35.85%
100.00%
75
Frequency
Male Life Expectancy Histogram
50
45
40
35
30
25
20
15
10
5
0
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
45
55
65
Frequency
Cumulative %
75
Midpoints
10 | P a g e
OPIM 5103 Term Project
Sample Student
The box below reflects 50% of the data between the 1st quartile (61) and third
quartile (73).The whiskers show the range of the outliers from 41 - 76. The box-andwhisker plot graphically confirms that the data is skewed to the right.
Male Life Expectancy
LIFEEXPM
40
45
50
55
60
65
70
75
80
DEATH_RT: This data element reflects the number of deaths per 1,000 people,
and is one of the explanatory variables in this study. The DEATH_RT variable is known
as the ‘Mortality Rate,’ which as mentioned above, is a measure of the deaths per 1,000
people in a population. The tables below present Summary Statistics for this data
element. The average mortality rate is 9.61 (mean) and the middle number of the 106
observations is 9.00 (median). All of the numbers are within 22.00 (range) of each other,
and vary around the mean by 4.27 (standard deviation). The middle 50% of the numbers
11 | P a g e
OPIM 5103 Term Project
Sample Student
are within 4 of each other (IQR). The standard deviation is 44% of the mean (CV), which
shows high variability. The Pearson Measure of Skewness (PMS) absolute score is
14%, which shows that the data is slightly skewed because it is higher than 10%.
DEATH_RT Summary Statistics
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
PMS
CV
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
9.611321
0.415204
9.000000
6.000000
4.274784
18.273775
0.143006
0.444765
1.775282
1.280521
22.000000
2.000000
24.000000
1,018.800000
106.000000
Five-number Summary
Minimum
First Quartile
Median
Third Quartile
Maximum
IQR Range
2.00
7.00
9.00
11.00
24.00
4.00
The Frequency Table and Histogram reflect that approximately 67% of the data
is within 2 – less than 11, and the highest frequency intervals are shared between 5 –
less than 8 and 8 – less than 11.
Frequencies (DEATH_RT)
Intervals
Bins
Frequency Percentage Cumulative % Midpts
2 Less Than 5
4.99
5
4.72%
4.72%
2
5 Less Than 8
7.99
33
31.13%
35.85%
5
8 Less Than 11
10.99
33
31.13%
66.98%
8
11 Less Than 14
13.99
22
20.75%
87.74%
11
14 Less Than 17
16.99
4
3.77%
91.51%
14
17 Less Than 20
19.99
4
3.77%
95.28%
17
20 Less Than 23
22.99
4
3.77%
99.06%
20
23 Less Than 26
25.99
1
0.94%
100.00%
23
12 | P a g e
OPIM 5103 Term Project
Sample Student
Mortality Rate Histogram
35
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
30
Frequency
25
20
15
10
5
0
2
5
8
11
14
17
20
Frequency
Cumulative %
23
Midpoints
The box below reflects 50% of the data between the 1st quartile (7) and third
quartile (11).The whiskers show the range of the outliers from 2 - 24. The box-andwhisker plot graphically confirms that the data is skewed to the left.
Mortality Rate
DEATH_RT
0
5
10
15
20
25
13 | P a g e
OPIM 5103 Term Project
Sample Student
Regression Method
In this term project, the multiple regression method is used to predict how large
or small the dependent variable (FERTILTY) will be, given differing values for the
explanatory variables (LIFEEXPF, LIFEEXPM, and DEATH_RT). The standard equation
is below, along with the equation for the estimated model.
Yi(FERTILTY) = b0 + b1(LIFEEXPF) + b2(LIFEEXPM) + b3(DEATH_RT) + E
This estimated model reflects that FERTILTY (Yi) can be expressed in terms of a
constant intercept (B0) plus a coefficient (B1) times LIFEEXPF (X1i), plus a coefficient
(B2) times LIFEEXPM (X2i) and a coefficient (B3) times DEATH_RT (X3i), plus an
unexplained error term (E). The output of the regression model through statistical
software (PHSTAT) is summarized below.
14 | P a g e
OPIM 5103 Term Project
Sample Student
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.8842011
R Square
0.781811585
Adjusted R Square
0.775394278
Standard Error
0.905219416
Observations
106
ANOVA
df
Regression
Residual
Total
Intercept
LIFEEXPF
LIFEEXPM
DEATH_RT
3
102
105
SS
MS
F
Significance F
299.4872281 99.82907602 121.8286216 1.37259E-33
83.58106345 0.819422191
383.0682915
Coefficients
Standard Error
t Stat
P-value
Lower 95%
17.15599441
1.197780326 14.32315596 3.67868E-26 14.78020288
-0.307052703
0.045750494 -6.711462047 1.10489E-09 -0.397798588
0.14085653
0.055575732 2.53449705 0.012780756 0.030622331
-0.127564431
0.031531798 -4.045580622 0.00010165 -0.190107601
Upper 95%
19.53178595
-0.216306818
0.251090728
-0.065021261
a) Goodness of Fit
The R squared (R2) measures the proportion to which the explanatory variables
explain the behavior of the dependent variables in the model. The regression program
calculated a R2 of 0.78, which indicates that 78% of the variation in the dependent
variable is explained by differences in the explanatory variables, and 22% in
unaccounted for. The adjusted R2 measure of 0.77 is not needed due to the large
sample size. This indicates that the estimated model has reasonable predicative ability.
The Line Fit Plot below shows confirms the reasonable accuracy when comparing actual
vs. predicted Fertility.
15 | P a g e
OPIM 5103 Term Project
Sample Student
Fertility Line Fit Plot
9.0
8.0
7.0
FERTILTY
6.0
5.0
4.0
3.0
2.0
1.0
.0
1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106
Observations
FERTILTY
Predicted FERTILTY
The measure of absolute residuals (MAD) is 0.65, which is the average of the
absolute values of error terms. When dividing the MAD score by the mean of the
dependent variable, the % standard error of the regression (%SER) calculates to
18.11%. Because the error terms are normally distributed (normal probability plot is on
the following page), a %SER of 0.18 indicates that 68% of the errors are ≤ 18% of the
size of the mean of the dependent variable (and 95% of the errors are ≤ 20% of the size
of the mean). This R2 of 0.78 and %SER of 18% indicates that this multiple regression
model has good predictive ability.
16 | P a g e
OPIM 5103 Term Project
Sample Student
b) Error Terms
The following charts test whether the error terms (residuals) in the multiple
regression model meet data requirements. The first requirement states that error terms
should be normally distributed, which appears to be approximately met in the plot below.
Residual Normal Probability Plot
3
2
Residuals
1
0
-3
-2
-1
0
1
2
-1
-2
-3
Z Value
The second requirement states that the error terms should be independent of each
other. This also appears to be met in the chart below, as it reflects a random plot of
observations.
17 | P a g e
3
OPIM 5103 Term Project
Sample Student
Residual Plot
3
2
Residuals
1
0
0
20
40
60
80
100
-1
-2
-3
Observations
c) Overall Regression Test
The ‘F’ statistical inference test assesses whether all of the regression coefficients
(except the constant) in the "true" model describing the underlying population are equal
to zero. A four step process is used below to conduct the F test, including stating the
hypotheses, calculating a sample F value, finding a critical F value and then comparing
the sample F to the critical F to make a conclusion on the Hypothesis.
Null Hypothesis (Ho): B1 = B2 = B3 = 0 (no linear relationship between FERTILTY and
the explanatory variables)
Alternative Hypothesis (Ha): Ho is false (at least one independent variable affects
FERTILTY)
18 | P a g e
120
OPIM 5103 Term Project
Sample Student
The formula for calculating the sample F-statistic:
where k = # of explainers & n= sample size.
The regression program calculates the sample F-statistic value to be 121.82. The
critical F statistic using a significance level of 0.05 and (k) or 3 being the degrees of
freedom in the numerator and (n-k-1) or 102 as the degrees of freedom in the
denominator calculates as 2.6937. The sample F-statistic is greater than the critical Fstatistic, so the Null Hypothesis (Ho) is rejected, which shows that at least one of the
explanatory variables influences the dependent variable. The significance factor is
extremely low, which shows that the chance of drawing samples like this one when the
Null Hypothesis is true is extremely low.
d) Single Coefficient Tests
The ‘T’ statistical inference test assesses whether each of the estimated
regression coefficients are =, , or a particular number. I do not have a prior
expectation about what value of the population coefficient should be, so I will assume
the Null Hypothesis is equal to zero. A four step process is used below to conduct the T
test on each regression coefficient, including stating the hypotheses, calculating a
sample T value, finding a two-tail critical T value and then comparing the sample T to
the critical T to make a conclusion on the Hypothesis.
19 | P a g e
OPIM 5103 Term Project
Sample Student
LIFEEXPF:
Null Hypothesis (Ho): B1 (LIFEEXPF) = 0
Alternative Hypothesis (Ha): B1 (LIFEEXPF)  0
The formula for calculating the sample T-statistic:
n-k-1 (sample size minus # explanatory variables minus 1)
The regression program calculates the sample T-statistic value to be -6.711. The
two-tail critical t-value to be considered for (n-k-1) or 102 degrees of freedom and with
0.05 significance level is 1.9834. The absolute sample T-statistic is greater than the
critical two-tail T-statistic, so the Null Hypothesis (Ho) is rejected, which shows that we
have sufficient evidence to conclude that higher female life expectancy decreases
fertility.
LIFEEXPM:
Null Hypothesis (Ho): B1 (LIFEEXPM) = 0
Alternative Hypothesis (Ha): B1 (LIFEEXPM)  0
The regression program calculates the sample T-statistic value to be 2.534. The
two-tail critical t-value to be considered for (n-k-1) or 102 degrees of freedom and with
0.05 significance level is 1.9834. The absolute sample T-statistic is greater than the
critical two-tail T-statistic, so the Null Hypothesis (Ho) is rejected, which shows that we
have sufficient evidence to conclude that higher male life expectancy increases fertility.
20 | P a g e
OPIM 5103 Term Project
Sample Student
DEATH_RT:
Null Hypothesis (Ho): B1 (DEATH_RT) = 0
Alternative Hypothesis (Ha): B1 (DEATH_RT)  0
The regression program calculates the sample T-statistic value to be 4.045. The two-tail critical t-value to be considered for (n-k-1) or 102 degrees of freedom
and with 0.05 significance level is 1.9834. The absolute sample T-statistic is greater than
the critical two-tail T-statistic, so the Null Hypothesis (Ho) is rejected, which shows that
we have sufficient evidence to conclude that a higher mortality rate decreases fertility.
e) Standardized Coefficients
To determine which explanatory variables (LIFEEXPF, LIFEEXPM, and DEATH_RT)
have the greatest influence on the dependent variable (FERTILTY), standardized
coefficients need to be calculated. Standardized coefficients show how many standard
deviations the dependent variable will change if the explanatory variable changes by one
standard deviation. Larger standardized coefficients indicate more influence, smaller
ones less. Standardized coefficients (bi*) for each explanatory variable can be
calculated as follows:
where bi is the estimated regression coefficient, Sxi is the standard deviation of the
explanatory variable, and Sy is the standard deviation of the dependent variable.
21 | P a g e
OPIM 5103 Term Project
LIFEEXPF
LIFEEXPM
DEATH_RT
Std. Coeff. Est. Coeff.
-1.172599 -0.307053
0.472039
0.140857
-0.195452 -0.127564
Sample Student
SD of X
10.654688
9.349868
4.274784
SD of Y
1.910044
1.910044
1.910044
The above standardized coefficient values implicate the following:

A one SD increase in LIFEEXPF leads to a -1.172599 SD decrease in
FERTILTY.

A one SD increase in LIFEEXPM leads to a -0.472039 SD decrease in
FERTILTY.

A one SD increase in DEATH_RT leads to a -0.195452 SD decrease in
FERTILTY.
Conclusion
The summary statistics and charts revealed that the variables all share nonsymmetrical shapes. The error terms passed the data requirements for being normally
distributed and independent of each other. The Fertility Line Fit Plot graphically showed
the predictive performance of the model due to the good R2 and %SER scores. After
conducting a test for overall fit it was determined that there was sufficient evidence that
each one of the explanatory variables had an impact on average fertility.
After testing the influence that each individual explainer (LIFEEXPF, LIFEEXPM, and
DEATH_RT) had on the dependent variable (FERTILTY), it was concluded that there
was sufficient evidence that they each had impacts on the dependent variable. Also, the
confidence intervals for the regression coefficients show how large the population
coefficients are likely to be. Specifically, we're 95% confident that the "true" marginal
effects on FERTILTY of changes in LIFEEXPF, LIFEEXPM, and DEATH_RT lie in the
22 | P a g e
OPIM 5103 Term Project
Sample Student
ranges depicted below. Note that zero does not lie within any of the ranges, which
indicates that the population regression coefficients cannot be zeros.
Intercept
LIFEEXPF
LIFEEXPM
DEATH_RT
Lower 95%
14.78020288
-0.397798588
0.030622331
-0.190107601
Upper 95%
19.53178595
-0.216306818
0.251090728
-0.065021261
The focus of this study was to analyze world demographic variables for average
female life expectancy, average male life expectancy, and mortality rates and their
influence on average fertility. The proposed relationship theory proved to be consistent
with the working paper that was summarized on page two. The female life expectancy
variable has the largest influence on fertility, reflecting a negative linear relationship. The
smallest impact to fertility is the mortality rate explainer. The findings revealed that there
is a good relationship between the dependent and explanatory variables, and the model
revealed stellar predictive performance.
23 | P a g e
Download