Term Project OPIM 5103 – Managerial Statistics Sample Student 7/5/2010 Jantzen Note: The most serious omission in this project is the failure to discuss the 95% confidence interval results for the regression coefficients and the estimated coefficient values. OPIM 5103 Term Project Sample Student Table of Contents Introduction…………………………………………………………………………………….…2 Literature Review……………………………………………………………..…………….……2 Data Analysis……………………………………………………………………………..………3 Regression Method………………………………………………………………………….…13 Conclusion………………………………………………………………………….…………...21 1|Page OPIM 5103 Term Project Sample Student Introduction The primary objective of this term project is to present appropriate statistical evidence to support the proposed hypothesis that there is a significant relationship between the dependent variable of “Average Fertility” (FERTILTY) and the following explanatory factors: 1. Average female life expectancy (LIFEEXPF) 2. Average male life expectancy (LIFEEXPM) 3. Mortality rate per 1000 people (DEATH_RT) The study will describe the statistical distribution of the variables, test the degree to which the explanatory variables explain the dependent variable, and test the coefficients to summarize the overall behavior of the multiple regression model. Literature Review I utilized the EconLit research database in the UConn Virtual Library and came across the following working paper that theorizes the relationship between fertility, adult longevity, and the mortality environment. Cervellati, Matteo, and Uwe Sunde. "Human Capital, Mortality and Fertility: A Unified Theory of the Economic and Demographic Transition." (2007): EconLit. EBSCO. Web. 7 July 2010. The research findings conclude that fertility may decrease as a response to increased life expectancy. Consequently, declines in mortality rates could lead to a quantity vs. quality trade off, where parents have fewer children but invest more in each child. The regression model proposes that fertility is positively related to mortality and can reduce the initial increase in population size due to higher life expectancy. 2|Page OPIM 5103 Term Project Sample Student Data Analysis The data used for this paper is from the “WORLD95.XLS” database from the course website. It is a random sample of human demographics from 1995 for 106 different world countries. FERTILTY: This data element reflects the average number of children born per population and is the dependent variable of this study. The FERTILTY variable is known as the ‘Fertility Rate,’ which as mentioned above, is a measure of average offspring production. The tables below present Summary Statistics for this data element. The average fertility rate is 3.57 (mean) and the middle number of the 106 observations is 3.06 (median). All of the numbers are within 6.89 (range) of each other, and vary around the mean by 1.91 (standard deviation). The middle 50% of the numbers are within 3.22 of each other (IQR). The standard deviation is 53% of the mean (CV), which shows high variability. The Pearson Measure of Skewness (PMS) absolute score is 26%, which shows that the data is not approximately symmetrical because it is higher than 10%. 3|Page OPIM 5103 Term Project Sample Student FERTILTY Summary Statistics Mean Standard Error Median Mode Standard Deviation Sample Variance PMS CV Kurtosis Skewness Range Minimum Maximum Sum Count 3.570283 0.185520 3.065000 1.800000 1.910044 3.648269 0.264540 0.534984 (0.975758) 0.651101 6.890000 1.300000 8.190000 378.450000 106.000000 Five-number Summary Minimum First Quartile Median Third Quartile Maximum IQR Range 1.30 1.88 3.07 5.10 8.19 3.22 The Frequency Table and Histogram reflect that approximately 70% of the data is within 1 – less than 4.5, and the highest frequency interval is 1.5 – less than 2.49. Frequencies (FERTILTY) Intervals Bins Frequency Percentage Cumulative % Midpts 1 Less Than 1.5 1.49 4 3.77% 3.77% 1 1.5 Less Than 2.5 2.49 41 38.68% 42.45% 2 2.5 Less Than 3.5 3.49 17 16.04% 58.49% 3 3.5 Less Than 4.5 4.49 13 12.26% 70.75% 4 4.5 Less Than 5.5 5.49 6 5.66% 76.42% 5 5.5 Less Than 6.5 6.49 12 11.32% 87.74% 6 6.5 Less Than 7.5 7.49 12 11.32% 99.06% 7 7.5 Less Than 8.5 8.49 1 0.94% 100.00% 8 4|Page OPIM 5103 Term Project Sample Student Fertility Histogram 45 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 40 Frequency 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 Frequency Cumulative % 8 Midpoints The box below reflects 50% of the data between the 1st quartile (1.88) and third quartile (5.10). The whiskers show the range of the outliers from 1.30 – 8.19. The boxand-whisker plot graphically confirms that the data is skewed to the left. 5|Page OPIM 5103 Term Project Sample Student Fertility Box-and-Whisker Plot FERTILTY 0 1 2 3 4 5 6 7 8 9 LIFEEXPF: This data element reflects the average number of years expected in a female human life, and is one of the explanatory variables in this study. The LIFEEXPF variable is known as the ‘Female Life Expectancy Rate,’ which as mentioned above, is a measure of the average years of a female life. The tables below present Summary Statistics for this data element. The average female life expectancy is 69.96 (mean) and the middle number of the 106 observations is 74.00 (median). All of the numbers are within 39.00 (range) of each other, and vary around the mean by 10.65 (standard deviation). The middle 50% of the numbers are within 12 of each other (IQR). The standard deviation is 15% of the mean (CV), which shows lower variability. The 6|Page OPIM 5103 Term Project Sample Student Pearson Measure of Skewness (PMS) absolute score is 37%, which shows that the data is not approximately symmetrical because it is higher than 10%. LIFEEXPF Summary Statistics Mean Standard Error Median Mode Standard Deviation Sample Variance PMS CV Kurtosis Skewness Range Minimum Maximum Sum Count 69.962264 1.034875 74.000000 75.000000 10.654688 113.522372 (0.378963) 0.152292 0.116924 (1.068043) 39.000000 43.000000 82.000000 7,416.000000 106.000000 Five-number Summary Minimum First Quartile Median Third Quartile Maximum IQR Range 43.00 66.00 74.00 78.00 82.00 12.00 The Frequency Table and Histogram reflect that approximately 84% of the data is within 40 – less than 80, and the highest frequency interval is 70 – less than 80. Frequencies (LIFEEXPF) Intervals Bins Frequency Percentage Cumulative % Midpts 40 Less Than 50 49.99 7 6.60% 6.60% 45 50 Less Than 60 59.99 15 14.15% 20.75% 55 60 Less Than 70 69.99 17 16.04% 36.79% 65 70 Less Than 80 79.99 50 47.17% 83.96% 75 80 Less Than 90 89.99 17 16.04% 100.00% 85 7|Page OPIM 5103 Term Project Sample Student Female Life Expectancy Histogram 60 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Frequency 50 40 30 20 10 0 45 55 65 75 Frequency Cumulative % 85 Midpoints The box below reflects 50% of the data between the 1st quartile (66) and third quartile (78). The whiskers show the range of the outliers from 43 - 82. The box-andwhisker plot graphically confirms that the data is skewed to the right. Female Life Expectancy LIFEEXPF 40 50 60 70 80 90 8|Page OPIM 5103 Term Project Sample Student LIFEEXPM: This data element reflects the average number of years expected in a male human life, and is one of the explanatory variables in this study. The LIFEEXPM variable is known as the ‘Male Life Expectancy Rate,’ which as mentioned above, is a measure of the average years of a male life. The tables below present Summary Statistics for this data element. The average male life expectancy is 64.76 (mean) and the middle number of the 106 observations is 67.00 (median). All of the numbers are within 35.00 (range) of each other, and vary around the mean by 9.34 (standard deviation). The middle 50% of the numbers are within 12 of each other (IQR). The standard deviation is 14% of the mean (CV), which shows lower variability. The Pearson Measure of Skewness (PMS) absolute score is 23%, which shows that the data is not approximately symmetrical because it is higher than 10%. LIFEEXPM Summary Statistics Mean Standard Error Median Mode Standard Deviation Sample Variance PMS CV Kurtosis Skewness Range Minimum Maximum Sum Count 64.764151 0.908139 67.000000 73.000000 9.349868 87.420036 (0.239132) 0.144368 0.242961 (1.042360) 35.000000 41.000000 76.000000 6,865.000000 106.000000 Five-number Summary Minimum First Quartile Median Third Quartile Maximum IQR Range 41.00 61.00 67.00 73.00 76.00 12.00 9|Page OPIM 5103 Term Project Sample Student The Frequency Table and Histogram reflect that approximately 64% of the data is within 40 – less than 70, and the highest frequency interval is 60 – less than 70. Frequencies (LIFEEXPM) Intervals Bins Frequency Percentage Cumulative % Midpts 40 Less Than 50 49.99 10 9.43% 9.43% 45 50 Less Than 60 59.99 14 13.21% 22.64% 55 60 Less Than 70 69.99 44 41.51% 64.15% 65 70 Less Than 80 79.99 38 35.85% 100.00% 75 Frequency Male Life Expectancy Histogram 50 45 40 35 30 25 20 15 10 5 0 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 45 55 65 Frequency Cumulative % 75 Midpoints 10 | P a g e OPIM 5103 Term Project Sample Student The box below reflects 50% of the data between the 1st quartile (61) and third quartile (73).The whiskers show the range of the outliers from 41 - 76. The box-andwhisker plot graphically confirms that the data is skewed to the right. Male Life Expectancy LIFEEXPM 40 45 50 55 60 65 70 75 80 DEATH_RT: This data element reflects the number of deaths per 1,000 people, and is one of the explanatory variables in this study. The DEATH_RT variable is known as the ‘Mortality Rate,’ which as mentioned above, is a measure of the deaths per 1,000 people in a population. The tables below present Summary Statistics for this data element. The average mortality rate is 9.61 (mean) and the middle number of the 106 observations is 9.00 (median). All of the numbers are within 22.00 (range) of each other, and vary around the mean by 4.27 (standard deviation). The middle 50% of the numbers 11 | P a g e OPIM 5103 Term Project Sample Student are within 4 of each other (IQR). The standard deviation is 44% of the mean (CV), which shows high variability. The Pearson Measure of Skewness (PMS) absolute score is 14%, which shows that the data is slightly skewed because it is higher than 10%. DEATH_RT Summary Statistics Mean Standard Error Median Mode Standard Deviation Sample Variance PMS CV Kurtosis Skewness Range Minimum Maximum Sum Count 9.611321 0.415204 9.000000 6.000000 4.274784 18.273775 0.143006 0.444765 1.775282 1.280521 22.000000 2.000000 24.000000 1,018.800000 106.000000 Five-number Summary Minimum First Quartile Median Third Quartile Maximum IQR Range 2.00 7.00 9.00 11.00 24.00 4.00 The Frequency Table and Histogram reflect that approximately 67% of the data is within 2 – less than 11, and the highest frequency intervals are shared between 5 – less than 8 and 8 – less than 11. Frequencies (DEATH_RT) Intervals Bins Frequency Percentage Cumulative % Midpts 2 Less Than 5 4.99 5 4.72% 4.72% 2 5 Less Than 8 7.99 33 31.13% 35.85% 5 8 Less Than 11 10.99 33 31.13% 66.98% 8 11 Less Than 14 13.99 22 20.75% 87.74% 11 14 Less Than 17 16.99 4 3.77% 91.51% 14 17 Less Than 20 19.99 4 3.77% 95.28% 17 20 Less Than 23 22.99 4 3.77% 99.06% 20 23 Less Than 26 25.99 1 0.94% 100.00% 23 12 | P a g e OPIM 5103 Term Project Sample Student Mortality Rate Histogram 35 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 30 Frequency 25 20 15 10 5 0 2 5 8 11 14 17 20 Frequency Cumulative % 23 Midpoints The box below reflects 50% of the data between the 1st quartile (7) and third quartile (11).The whiskers show the range of the outliers from 2 - 24. The box-andwhisker plot graphically confirms that the data is skewed to the left. Mortality Rate DEATH_RT 0 5 10 15 20 25 13 | P a g e OPIM 5103 Term Project Sample Student Regression Method In this term project, the multiple regression method is used to predict how large or small the dependent variable (FERTILTY) will be, given differing values for the explanatory variables (LIFEEXPF, LIFEEXPM, and DEATH_RT). The standard equation is below, along with the equation for the estimated model. Yi(FERTILTY) = b0 + b1(LIFEEXPF) + b2(LIFEEXPM) + b3(DEATH_RT) + E This estimated model reflects that FERTILTY (Yi) can be expressed in terms of a constant intercept (B0) plus a coefficient (B1) times LIFEEXPF (X1i), plus a coefficient (B2) times LIFEEXPM (X2i) and a coefficient (B3) times DEATH_RT (X3i), plus an unexplained error term (E). The output of the regression model through statistical software (PHSTAT) is summarized below. 14 | P a g e OPIM 5103 Term Project Sample Student SUMMARY OUTPUT Regression Statistics Multiple R 0.8842011 R Square 0.781811585 Adjusted R Square 0.775394278 Standard Error 0.905219416 Observations 106 ANOVA df Regression Residual Total Intercept LIFEEXPF LIFEEXPM DEATH_RT 3 102 105 SS MS F Significance F 299.4872281 99.82907602 121.8286216 1.37259E-33 83.58106345 0.819422191 383.0682915 Coefficients Standard Error t Stat P-value Lower 95% 17.15599441 1.197780326 14.32315596 3.67868E-26 14.78020288 -0.307052703 0.045750494 -6.711462047 1.10489E-09 -0.397798588 0.14085653 0.055575732 2.53449705 0.012780756 0.030622331 -0.127564431 0.031531798 -4.045580622 0.00010165 -0.190107601 Upper 95% 19.53178595 -0.216306818 0.251090728 -0.065021261 a) Goodness of Fit The R squared (R2) measures the proportion to which the explanatory variables explain the behavior of the dependent variables in the model. The regression program calculated a R2 of 0.78, which indicates that 78% of the variation in the dependent variable is explained by differences in the explanatory variables, and 22% in unaccounted for. The adjusted R2 measure of 0.77 is not needed due to the large sample size. This indicates that the estimated model has reasonable predicative ability. The Line Fit Plot below shows confirms the reasonable accuracy when comparing actual vs. predicted Fertility. 15 | P a g e OPIM 5103 Term Project Sample Student Fertility Line Fit Plot 9.0 8.0 7.0 FERTILTY 6.0 5.0 4.0 3.0 2.0 1.0 .0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106 Observations FERTILTY Predicted FERTILTY The measure of absolute residuals (MAD) is 0.65, which is the average of the absolute values of error terms. When dividing the MAD score by the mean of the dependent variable, the % standard error of the regression (%SER) calculates to 18.11%. Because the error terms are normally distributed (normal probability plot is on the following page), a %SER of 0.18 indicates that 68% of the errors are ≤ 18% of the size of the mean of the dependent variable (and 95% of the errors are ≤ 20% of the size of the mean). This R2 of 0.78 and %SER of 18% indicates that this multiple regression model has good predictive ability. 16 | P a g e OPIM 5103 Term Project Sample Student b) Error Terms The following charts test whether the error terms (residuals) in the multiple regression model meet data requirements. The first requirement states that error terms should be normally distributed, which appears to be approximately met in the plot below. Residual Normal Probability Plot 3 2 Residuals 1 0 -3 -2 -1 0 1 2 -1 -2 -3 Z Value The second requirement states that the error terms should be independent of each other. This also appears to be met in the chart below, as it reflects a random plot of observations. 17 | P a g e 3 OPIM 5103 Term Project Sample Student Residual Plot 3 2 Residuals 1 0 0 20 40 60 80 100 -1 -2 -3 Observations c) Overall Regression Test The ‘F’ statistical inference test assesses whether all of the regression coefficients (except the constant) in the "true" model describing the underlying population are equal to zero. A four step process is used below to conduct the F test, including stating the hypotheses, calculating a sample F value, finding a critical F value and then comparing the sample F to the critical F to make a conclusion on the Hypothesis. Null Hypothesis (Ho): B1 = B2 = B3 = 0 (no linear relationship between FERTILTY and the explanatory variables) Alternative Hypothesis (Ha): Ho is false (at least one independent variable affects FERTILTY) 18 | P a g e 120 OPIM 5103 Term Project Sample Student The formula for calculating the sample F-statistic: where k = # of explainers & n= sample size. The regression program calculates the sample F-statistic value to be 121.82. The critical F statistic using a significance level of 0.05 and (k) or 3 being the degrees of freedom in the numerator and (n-k-1) or 102 as the degrees of freedom in the denominator calculates as 2.6937. The sample F-statistic is greater than the critical Fstatistic, so the Null Hypothesis (Ho) is rejected, which shows that at least one of the explanatory variables influences the dependent variable. The significance factor is extremely low, which shows that the chance of drawing samples like this one when the Null Hypothesis is true is extremely low. d) Single Coefficient Tests The ‘T’ statistical inference test assesses whether each of the estimated regression coefficients are =, , or a particular number. I do not have a prior expectation about what value of the population coefficient should be, so I will assume the Null Hypothesis is equal to zero. A four step process is used below to conduct the T test on each regression coefficient, including stating the hypotheses, calculating a sample T value, finding a two-tail critical T value and then comparing the sample T to the critical T to make a conclusion on the Hypothesis. 19 | P a g e OPIM 5103 Term Project Sample Student LIFEEXPF: Null Hypothesis (Ho): B1 (LIFEEXPF) = 0 Alternative Hypothesis (Ha): B1 (LIFEEXPF) 0 The formula for calculating the sample T-statistic: n-k-1 (sample size minus # explanatory variables minus 1) The regression program calculates the sample T-statistic value to be -6.711. The two-tail critical t-value to be considered for (n-k-1) or 102 degrees of freedom and with 0.05 significance level is 1.9834. The absolute sample T-statistic is greater than the critical two-tail T-statistic, so the Null Hypothesis (Ho) is rejected, which shows that we have sufficient evidence to conclude that higher female life expectancy decreases fertility. LIFEEXPM: Null Hypothesis (Ho): B1 (LIFEEXPM) = 0 Alternative Hypothesis (Ha): B1 (LIFEEXPM) 0 The regression program calculates the sample T-statistic value to be 2.534. The two-tail critical t-value to be considered for (n-k-1) or 102 degrees of freedom and with 0.05 significance level is 1.9834. The absolute sample T-statistic is greater than the critical two-tail T-statistic, so the Null Hypothesis (Ho) is rejected, which shows that we have sufficient evidence to conclude that higher male life expectancy increases fertility. 20 | P a g e OPIM 5103 Term Project Sample Student DEATH_RT: Null Hypothesis (Ho): B1 (DEATH_RT) = 0 Alternative Hypothesis (Ha): B1 (DEATH_RT) 0 The regression program calculates the sample T-statistic value to be 4.045. The two-tail critical t-value to be considered for (n-k-1) or 102 degrees of freedom and with 0.05 significance level is 1.9834. The absolute sample T-statistic is greater than the critical two-tail T-statistic, so the Null Hypothesis (Ho) is rejected, which shows that we have sufficient evidence to conclude that a higher mortality rate decreases fertility. e) Standardized Coefficients To determine which explanatory variables (LIFEEXPF, LIFEEXPM, and DEATH_RT) have the greatest influence on the dependent variable (FERTILTY), standardized coefficients need to be calculated. Standardized coefficients show how many standard deviations the dependent variable will change if the explanatory variable changes by one standard deviation. Larger standardized coefficients indicate more influence, smaller ones less. Standardized coefficients (bi*) for each explanatory variable can be calculated as follows: where bi is the estimated regression coefficient, Sxi is the standard deviation of the explanatory variable, and Sy is the standard deviation of the dependent variable. 21 | P a g e OPIM 5103 Term Project LIFEEXPF LIFEEXPM DEATH_RT Std. Coeff. Est. Coeff. -1.172599 -0.307053 0.472039 0.140857 -0.195452 -0.127564 Sample Student SD of X 10.654688 9.349868 4.274784 SD of Y 1.910044 1.910044 1.910044 The above standardized coefficient values implicate the following: A one SD increase in LIFEEXPF leads to a -1.172599 SD decrease in FERTILTY. A one SD increase in LIFEEXPM leads to a -0.472039 SD decrease in FERTILTY. A one SD increase in DEATH_RT leads to a -0.195452 SD decrease in FERTILTY. Conclusion The summary statistics and charts revealed that the variables all share nonsymmetrical shapes. The error terms passed the data requirements for being normally distributed and independent of each other. The Fertility Line Fit Plot graphically showed the predictive performance of the model due to the good R2 and %SER scores. After conducting a test for overall fit it was determined that there was sufficient evidence that each one of the explanatory variables had an impact on average fertility. After testing the influence that each individual explainer (LIFEEXPF, LIFEEXPM, and DEATH_RT) had on the dependent variable (FERTILTY), it was concluded that there was sufficient evidence that they each had impacts on the dependent variable. Also, the confidence intervals for the regression coefficients show how large the population coefficients are likely to be. Specifically, we're 95% confident that the "true" marginal effects on FERTILTY of changes in LIFEEXPF, LIFEEXPM, and DEATH_RT lie in the 22 | P a g e OPIM 5103 Term Project Sample Student ranges depicted below. Note that zero does not lie within any of the ranges, which indicates that the population regression coefficients cannot be zeros. Intercept LIFEEXPF LIFEEXPM DEATH_RT Lower 95% 14.78020288 -0.397798588 0.030622331 -0.190107601 Upper 95% 19.53178595 -0.216306818 0.251090728 -0.065021261 The focus of this study was to analyze world demographic variables for average female life expectancy, average male life expectancy, and mortality rates and their influence on average fertility. The proposed relationship theory proved to be consistent with the working paper that was summarized on page two. The female life expectancy variable has the largest influence on fertility, reflecting a negative linear relationship. The smallest impact to fertility is the mortality rate explainer. The findings revealed that there is a good relationship between the dependent and explanatory variables, and the model revealed stellar predictive performance. 23 | P a g e