Original Automatic Linear Modeling

advertisement
Predictive Analysis of FTEs for 2013-14 School Year
The predicted value of FTEs for the 2013-14 School Year (End of Year Total) is 2,057. This is based on a multiple regression analysis of the
Unemployment Rate in Manitowoc and Population of 15 to 19 Year Olds in Manitowoc. The equation for the multiple regression is:
FTEs = -326.959 + (131.104 X UnemploymentRateManitowoc) + (0.291 X Populations-Manitowoc15to19YearOlds)
Data was collected from a variety of sources from 2001 through 2013. The variables initially assessed for their importance to the linearity of FTEs
were:
Of all the variables listed above, for the years 2001-13, it was found the best predictors are the Unemployment Rate in Manitowoc and the
Population of 15 to 19 Year Olds in Manitowoc. More detail on why can be seen in the procedure outlined below.
Draft 9/23/2013
What is Multiple Linear Regression?
In statistics, linear regression is an approach to model the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, it is called
multiple linear regression. (source: Wikipedia, 9/23/2013)
Concern has been expressed as to why the Manitowoc Unemployment Rate and Manitowoc 15 to 19 Year Olds don’t “independently” correlate
to FTEs. While this correlation is dubious on its own, by using a multiple regression procedure, we find the model works quite well. The statistical
explanation is as follows for those concerned with individual correlations of independent variables not mapping well to the dependent:
http://www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0205863728.pdf, page 171.
In a multiple linear regression, more than one independent variable is included in the regression model…multiple regression examines
how two or more variables act together to affect the dependent variable…While the interpretation of the statistics in multiple regression
is, on the whole, the same as in bivariate regression, there is one important difference. In bivariate regression, the regression coefficient
is interpreted as the predicted change in the value of the dependent variable for a one-unit change in the independent variable. In
multiple regression, the effects of multiple independent variables often overlap in their association with the dependent variable. The
coefficients printed by SPSS don’t include the overlapping part of the association. Multiple regression coefficients only describe the
unique association between the dependent and that independent variable. The ANOVA F-test and the R-square statistic include this
overlapping portion. This means a variable’s coefficient shows the “net strength” of the relationship of that particular independent
variable to the dependent variable, above and beyond the relationships of the other independent variables. Each coefficient is then
interpreted as the predicted change in the value of the dependent variable for a one-unit change in the independent variable, after
accounting for the effects of the other variables in the model.
What is Monte Carlo Simulation?
Essentially, it is a simulated flip of the coin. While we know the probability of getting a head or tail of a fair coin is 50%, a Monte Carlo method
will use random number generation to simulate a coin flipping 1000, 100 thousand, or even 1 million times and then allow us to look at the
probabilities. In this analysis, the unemployment rate as a function of FTEs is the simulated variable.
Draft 9/23/2013
Steps Used in this Analysis:
1) Collect data for variables that may be thought to have “predictive capabilities” in determining final FTE number each year
2) Run an automatic linear regression with all variables (38 total)
3) Of those variables detected to be of “highest importance” to the regression, re-run the analysis including only those variables
4) Remove outliers, re-run the analysis
5) Determine final variables to use in the regression
6) Run a manual linear regression to do specific analysis checking for validity of assumptions-tests (listed in the document).
7) Build Monte Carlo simulation (2,173,554 simulated iterations of the year)
8) View probabilities and analyze sensitivity to unemployment rate changes
Draft 9/23/2013
The data used in the analysis:
Draft 9/23/2013
Probability densities based on Monte Carlo simulation with Population of Manitowoc 15 to 19 Year Olds held constant at 4,896 and Manitowoc
Unemployment Rate simulated between 6.25% and 8.25% expanded to 2,173,554 iterations, resulting in a median of 7.17% (close to recent
average of 7.33%. Sensitivity analysis conducted for various levels of Manitowoc Unemployment Rate.
Draft 9/23/2013
Draft 9/23/2013
FTE Change Based on Standard Deviation of Input: +/-1 Change in Standard Deviation of Manitowoc Unemployment Rate results in a change of
217 FTEs. (1 standard deviation of Manitowoc Unemployment Rate from 2001 until present excluding outliers equals 1.607.) This change makes
sense when we look at the change from 2011-12 to 2012-13. Respectively, unemployment was 8.2 and 7.6 (roughly .6 standard deviations).
When applied to the total FTE count for those years (2252 and 2139 respectively; *2139 used as analysis began prior to finalization of FTE in
Year-End), we see a change of about 113 FTEs. 60% of the 217 FTEs is about 130, which is close to what would be the expected change from
2011-12 to 2012-13. Essentially, for every 1.6-percentage-point rise or fall of Manitowoc’s Unemployment rate (when accounting for change in
15 to 19 year old demographics), we can expect to see a rise or fall of about 217 FTEs. Of course, as the years continue and more data is
collected, the model should be rebuilt to account for most recent data. But, this can be used as a general rule-of-thumb for the next one or two
years.
Draft 9/23/2013
The final prediction model was created using the variables identified in the Manual Linear Regression procedures. The final manual model steps
are included in the later part of this document. For purposes of prediction, the Manual Linear Regression procedure will be used as a “check” on
the Automatic Linear Modeling. That is, the Automatic Linear Model will predict final values because it is more robust than the manual
regression procedure. (Per email communications with David Nichols, Statistician, IBM SPSS: “[The Automatic Linear Modeling procedure] does
avoid some numerical issues by using scaling and centering of variables to form a correlation matrix instead of a raw crossproducts matrix, and
uses a sweep algorithm that allows it to avoid trying to invert a singular matrix.”
To achieve the FTE Prediction Model, the procedure for Automatic Linear Modeling was used in SPSS.
Draft 9/23/2013
Original Automatic Linear Modeling – PRIOR to REMOVAL OF OUTLIERS:
“Accuracy” is equivalent to the Adjusted R-squared value. Note: The adjusted R-squared value is used in place of R-squared to account for
additional variables in the model. This means that because the more variables a model has, the higher the R-squared value will be, Adjusted Rsquared adjusts for this consideration. The actual R-squared value of this model is .914.
Draft 9/23/2013
The model made transformations to the predictors. The transformations were to “Trim Outliers”.
Draft 9/23/2013
Predictor Importance:
Unemployment Rate Manitowoc = 0.80
High School Sheboygan Falls Graduates = 0.11
Population – Manitowoc 15 to 19 = 0.09
Draft 9/23/2013
From this model, we see twelve plot-points of data showing the Actual FTEs and the Predicted Value using the linear regression model.
Draft 9/23/2013
Draft 9/23/2013
Another way to represent the distribution above is with a P-Plot below. It would be ideal for the residuals to all fall close to the line.
Draft 9/23/2013
The following records contain outliers for the years 2004-05 and 2008-09. Notice that the Cook’s distance is over 1.
After removing case #8 (2008-09) and #4 (2004-05), the regression was run again. This time, case #2 (2002-03) was detected as an outlier. Case
#2 was removed, in addition to #4 and #8. Upon running the model after removing these three cases, the variable “High School Sheboygan Falls
Graduates” was automatically removed from the regression and the importance of both Unemployment Rate Manitowoc and PopulationManitowoc15to19 increased to 83% and 17% respectively. No outliers were detected at this point.
Draft 9/23/2013
Because of this, it is safe to assume that “High School Sheboygan Falls Graduates” was being indicated as an important variable in the multiple
regression because of the high Cook’s distances of case #2.
Once this was determined, the regression was re-run using just the two variables “Manitowoc unemployment” and “Manitowoc Population 15
to 19”. All cases, except #2 (2002-03), were included. This time, through a few iterations of the regression procedure, cases #8 (2008-09)and #9
(2009-10) were identified as outliers with high Cook’s distances. Once all 3 outlier cases were removed from the analysis, the regression was run
again.
Here are the final results of the Automatic Linear Regression with Cases #2, 8, and 9 Removed (indicated to be outliers in previous iterations):
Draft 9/23/2013
Draft 9/23/2013
Automatic Linear Modeling (Multiple Linear Regression)
Draft 9/23/2013
Draft 9/23/2013
Draft 9/23/2013
Draft 9/23/2013
Draft 9/23/2013
Draft 9/23/2013
In the table below, we see the model and independent variables are significant at the p<.05 level.
Draft 9/23/2013
Draft 9/23/2013
The table below was also derived in the manual linear regression – with the same results.
From the Coefficients table below, the equation used to predict FTEs can be derived from the Unstandardized Coefficients (B) as:
FTEs = -326.959 + (131.104 X UnemploymentRateManitowocTransformed) + (0.291 X Populations-Manitowoc15to19YearOldsTransformed)
The unstandardized coefficients of the independent variables are statistically significantly different to zero at the p<.05 level (See Sig. column
above). It is not important in this analysis that the intercept coefficient is statistically significantly different to zero.
Draft 9/23/2013
Draft 9/23/2013
The lower the information criterion is, the better the model is compared to models with a higher information criterion. Many of the other
models run (not discussed in this document) had information criterion’s in the 100s and above. This has been the lowest information criterion
observed.
Draft 9/23/2013
Draft 9/23/2013
Manual Linear Regression (Multiple Linear Regression)
Performing this as a manual linear regression results in the following assumptions being understood.
Assumptions:
There was independence (no correlation between) of residuals, as assessed by a Durbin-Watson statistic of 2.390.
Studentized Residual and Unstandardized Predicted Value showed a horizontal band, showing the relationship between the dependent and
independent variables are likely to be linear.
Partial regression plots for each predictor variable against FTEs showed a linear relationship.
Correlations should show less than .7 among all independent variables. However, the correlation between Unemployment and 15 to 19 year olds
is .730. While this is a strong correlation, other tests will be observed to determine any problems with collinearity.
Tolerance level is greater than .1 and VIF is less than 10, indicating there is not a collinearity problem. There does not appear to be
multicollinearity problems between variables based on the criteria used above. While there is concern about high correlation, the impact it will
have on the final regression equation should be minimal.
All cases have Standardized Residuals less than +/-3, indicating no outliers exist in the data.
All cases have a Studentized Deleted Residual below +/-3, indicating no cases contain potential outliers.
A few cases have high leverage values (above .2), in the “risky range” (.2 to .5). This indicates independent variables in these cases that have
values farther away from the corresponding average predictor values. However, most of those cases are in more recent years, which is
acceptable since recent years most likely reflect more closely to the year being predicted, particularly because of the shift in unemployment.
This risk will be accepted in the analysis.
No Cook’s distances were greater than 1, indicating there are no influential points.
Draft 9/23/2013
A histogram of the Regression Standardized Residual shows an approximate normal distribution. The mean and standard deviation have values
of approximately 0 (zero) and 1, respectively. Mean= -8.32E-15 and Std. Dev= 0.866, N=9.
The P-P Plot of Regression Standardized Residual shows an approximately normal distribution.
Reporting the Output:
Draft 9/23/2013
In the table below, Population of Manitowoc 15 to 19 year olds and Unemployment Rate Manitowoc statistically significantly predict FTEs, F(2,
6) = 296.114, p < .0005. The table shows that the independent variables statistically significantly predict the dependent variable, F(2, 6) =
296.114, p < .0005 (i.e., the regression model is a good fit of the data). The null hypothesis of this test is that the multiple correlation coefficient,
R, is equal to 0. What this also means, is that at least one regression coefficient (except the intercept) is statistically significantly different to
zero.
Draft 9/23/2013
From the Coefficients table below, the equation used to predict FTEs can be derived from the Unstandardized Coefficients (B) as:
FTEs = -326.959 + (131.104 X UnemploymentRateManitowoc) + (0.291 X Populations-Manitowoc15to19YearOlds)
The unstandardized coefficients of the independent variables are statistically significantly different to zero at the p<.05 level (See Sig. column
above). It is not important in this analysis that the intercept coefficient is statistically significantly different to zero.
A multiple regression was run to predict FTE from Unemployment Rate in Manitowoc and Population of Manitowoc 15 to 19 Year Olds. The
assumptions of linearity (with some concern), independence of errors, unusual points, homoscedasticity, and normality of residuals were met.
Some outlier years were removed and in testing the removal of these points, the regression equation achieved a higher value of R-squared.
Automatic Linear Modeling was run and compared to the results achieved here, with great success. These variables statistically significantly
predicted FTEs, F(2, 6) = 296.114, p < .0005, adj. R2 = .987. Both variables added statistically significantly to the prediction, p < .05. Regression
coefficients and standard errors can be found in Table above.
Draft 9/23/2013
Download