Regression Forecasting and Model Building Forecasting company revenue with multiple linear regression Forecasting Revenue: An Example of Regression Model Building • Setting: Possibly a large set of predictor variables (i.e. revenue drivers) used to predict future quarterly revenues from data collected from previous 7 years. • Goal: Find an equation (model) that explains variation in Y with a smaller set of predictors that are all related to Y but not too related to each other (multicollinearity). Predict revenues for next four quarters. Your dependent variable will be revenues or seasonally adjusted revenues depending upon whether your data has pronounced seasonality. Forecasting Revenue: An Example of Regression Model Building • When you speculate on predictors it is not unusual that many of them will be strongly related to each other. This is especially the case when a variable is mostly derived form another. Predictors that are too highly correlated can form multicollinearity where predictors essentially add no additional information while interfering with each other to fit the dependent variable. • Starting Point: Examine multicollinearity by checking correlations with a correlation matrix and by generating VIF values. This allows you some choice in which to choose variables that have better forecasts available or that you believe should be most related to revenues in theory. Variance Inflation Factors • Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity. • Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model. • Inflated standard errors lead to insignificant tstatistics for regression coefficients and wider confidence intervals Forecasting Revenue: An Example of Regression Model Building • Run a multiple regression to look at VIF values (and D-W values) – Delete one of the variables from those that with VIF > 10. Use the correlation matrix to see which pairs of high VIF variables are highly correlated. For each pair, choose the one that has the highest VIF or the variable with high VIF that may not have forecasts available or has other problems (such as non-linearity). There is some flexibility in this step and it may require some investigation. • Repeat until all VIF are smaller than 10. This will result in a reduced set of variables to use in finding an equation using All Possible Regressions. Forecasting Revenue: An Example of Regression Model Building • Best Model Process using the data. Use MegaStat All Possible Regressions to find an equation that has the fewest number of all significant (p-value < .05) variables and has a small standard error and a large adjusted R-squared. • Megastat will order the models from highest adjusted R-squared and lowest standard error. Look at the top for best model candidates with all significant pvalues and fewest predictors. You can use the formula =IF(COUNTIF(predictor range,">.05")>0,"","OK") to help identify the significant predictors models and compare OK models by looking at adjusted R-squared / standard error • If Megastat provided a Cp Statistic, it summarizes each possible model, where “best” model can be selected based on this statistic. Ideally you select the model with the fewest predictors p that has Cp p and has all p-values < .05 for all variables. Forecasting Revenue: An Example of Regression Model Building • Again you have some flexibility here to choose a set of variables with desirable qualities (e.g. good forecasts). • Minor differences in adjusted R-squared, standard error are not likely to have significant impact on your forecast results. • Keep in mind that p-values only indicate confidence that the slope is not zero. You need only be confident enough and smaller pvalues do not translate into better forecasts. Predictors with pvalues that are small but larger than .05 may still be good for your model. • If you have to go far down the list sacrificing R-square and standard error, consider using a model with less significant predictors or swap out one or more variables with one of the highly correlated variables you left out previously Validating Your Model • When you forecast with speculative predictors it’s possible that the data coincidentally has a relationship to the dependent variable (“spurious correlation”) especially with small amounts of time series data. To help address this we will use a “hold out sample” for a validation process that the relationships actually exist. • Validation with holdout sample: Run the regression with the best model selected leaving out the last two quarters of data. Forecast the quarters you held out with 95% prediction intervals. – Check the assumptions for the validation model. If not valid, can you fix? Transform data? – Do the actual values fall within the lower and upper prediction limits implying that the predictions seem reasonable? – If not, try using an alternative model from the all possible regressions options or see if there is a reason that quarters held out are different in some way. Look at the quarterly reports and see if they might suggest use of a dummy variable. Redo the validation process. Regression Diagnostics Model Assumptions: • Residual plots or other diagnostics can be used to check the assumptions -- Plot of Residuals versus each variable should be random cloud U-shaped (or rainbow) Nonlinear relationship -- Plot of Residuals versus predicted should be random cloud Wedge shaped Non-constant (increasing) variability -- Residuals should be mound-shaped (normal). Use skewness/kurtosis or a normal probability plot to check. -- Plot of Residuals versus Time order (Time series data) should be random cloud. If D-W < 1.3, residuals are not independent. Cook’s D is a check for influential observations that may have large impacts on the equation. Check data for accuracy or errors (e.g. typos, wrong units, etc.). Detecting Influential Observations Studentized Residuals – Residuals divided by their estimated standard errors. Observations in dark blue are considered outliers from the equation. Leverage Values – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations in dark blue are considered to be outliers in the X values. Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 1 are considered highly influential. Influential observations may suggest quarters to research to see if something special happened that may suggest a dummy variable. The Final Forecasts • Add the last two quarters back in your data set and redo the equation using the same variables and the next four quarters. Recheck the assumptions now that you have 2 additional data points. • Do the forecasts make sense? Superimpose your forecasts on a time series plot of revenues and ensure that the forecasts seem reasonable. If not, try to explain or find error. • Document all your data and forecast sources. • Write a report that documents all aspects of the forecasting process.