Chapter 11 © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. BUSINESS ANALYTICS: DATA ANALYSIS AND DECISION MAKING Regression Analysis: Statistical Inference Introduction Two basic problems are discussed in this chapter: Population regression model Inferring its characteristics—that is, its intercept and slope term(s)—from the corresponding terms estimated by least squares Determining which explanatory variables belong in the equation Inferring whether there is any population regression equation worth pursuing Prediction Predicting values of the dependent variable for new observations Calculating prediction intervals to measure the accuracy of the predictions © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 1 of 7) To perform statistical inference in a regression context, a statistical model is required—that is, we must first make several assumptions about the population. These assumptions represent an idealization of reality and are never likely to be entirely satisfied for the population in any real study. From a practical point of view, all we can ask is that they represent a close approximation to reality. If the assumptions are grossly violated, statistical inferences that are based on these assumptions should be viewed with suspicion. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 2 of 7) Regression assumptions: There is a population regression line. It joins the means of the dependent variable for all values of the explanatory variables. For any fixed values of the explanatory variables, the mean of the errors is zero. For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such values. For any values of the explanatory variables, the dependent variable is normally distributed. The errors are probabilistically independent. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 3 of 7) The first assumption is probably the most important. It implies that for some set of explanatory variables, there is an exact linear relationship in the population between the means of the dependent variable and the values of the explanatory variables. Equation for population regression line joining means: α is the intercept term, and the βs are the slope terms. (Greek letters are used to denote that they are unobservable population parameters.) Most individual Ys do not lie on the population regression line. The vertical distance from any point to the line is an error. Equation for population regression line with error: © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 4 of 7) Assumption 2 concerns variation around the population regression line. It states that the variation of the Ys about the regression line is the same, regardless of the values of the Xs. The technical term for this property is homoscedasticity. A simpler term is constant error variance. This assumption is often questionable—the variation in Y often increases as X increases. Heteroscedasticity means that the variability of Y values is larger for some X values than for others. A simpler term for this is nonconstant error variance. The easiest way to detect nonconstant error variance is through a visual inspection of a scatterplot. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 5 of 7) Assumption 3 is equivalent to stating that the errors are normally distributed. You can check this by forming a histogram (or a Q-Q plot) of the residuals. If assumption 3 holds, the histogram should be approximately symmetric and bell-shaped, and the points of a Q-Q plot should be close to a 45 degree line. If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption 3. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 6 of 7) Assumption 4 requires probabilistic independence of the errors. This assumption means that information on some of the errors provides no information on the values of the other errors. For cross-sectional data, this assumption is usually taken for granted. For time-series data, this assumption is often violated. This is because of a property called autocorrelation. The Durbin-Watson statistic is one measure of autocorrelation and thus measures the extent to which assumption 4 is violated. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Statistical Model (slide 7 of 7) One other assumption is important for numerical calculations: No explanatory variable can be an exact linear combination of any other explanatory variables. The violation occurs if one of the explanatory variables can be written as a weighted sum of several of the others. This is called exact multicollinearity. If it exists, there is redundancy in the data. A more common and serious problem is multicollinearity, where explanatory variables are highly, but not exactly, correlated. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inferences about the Regression Coefficients In the equation for the population regression line, α and the βs are called the regression coefficients. There is one other unknown constant in the model: the variance of the errors, labeled σ2. The choice of relevant explanatory variables is almost never obvious. Two guiding principles are relevance and data availability. One overriding principle is parsimony—to explain the most with the least. It favors a model with fewer explanatory variables, assuming that this model explains the dependent variable almost as well as a model with additional explanatory variables. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Sampling Distribution of the Regression Coefficients The sampling distribution of any estimate derived from sample data is the distribution of this estimate over all possible samples. Sampling distribution of a regression coefficient: If the regression assumptions are valid, the standardized value has a t distribution with n − k − 1 degrees of freedom. This result has three important implications: The estimate b is unbiased in the sense that its mean is β, the true but unknown value of the slope. The estimated standard deviation of b is labeled sb. It is usually called the standard error of a regression coefficient, or the standard error of b. It measures how much the bs would vary from sample to sample. The shape of the distribution of b is symmetric and bell-shaped. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.1: Overhead Costs.xlsx (slide 1 of 2) Objective: To use standard regression output to make inferences about the regression coefficients of machine hours and production runs in the equation for overhead costs. Solution: The dependent variable is Overhead and the explanatory variables are Machine Hours and Production Runs. The output from StatTools’s Regression procedure is shown below. The estimates of the regression coefficients appear under the label Coefficient. The column labeled Standard Error shows the sb values. Each b represents a point estimate of the corresponding β. The corresponding sb indicates the accuracy of this point estimate. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.1: Overhead Costs.xlsx (slide 2 of 2) The sample data can be used to obtain a confidence interval for a regression coefficient. A confidence interval for any β is of the form: where the t-multiple depends on the confidence level and the degrees of freedom. StatTools always provides these 95% confidence intervals for the regression coefficients automatically, as shown at the bottom right of the figure on the previous slide. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Hypothesis Tests for the Regression Coefficients and p-Values There is another important piece of information in regression outputs: the t-values for the individual regression coefficients. Each t-value is the ratio of the estimated coefficient to its standard error. It indicates how many standard errors the regression coefficient is from zero. A t-value can be used in a hypothesis test for a regression coefficient. If a variable’s coefficient is zero, there is no point in including this variable in the equation. To run this test, simply compare the t-value in the regression output with a tabulated t-value and reject the null hypothesis only if the t-value from the computer output is greater in magnitude than the tabulated t-value. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. A Test for the Overall Fit: The ANOVA Table (slide 1 of 3) It is conceivable that none of the explanatory variables in the regression equation explains the dependent variable. An indication of this problem is a very small R2 value. An equation has no explanatory power if the the same value of Y will be predicted regardless of the values of the Xs. The null hypothesis is that all coefficients of the explanatory variables are zero. The alternative is that at least one of these coefficients is not zero. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. A Test for the Overall Fit: The ANOVA Table (slide 2 of 3) To test the null hypothesis, use an F test, a formal procedure for testing whether the explained variation is large compared to the unexplained variation. This is also called the ANOVA (analysis of variance) test because the elements for calculating the required F-value are shown in an ANOVA table for regression. The ANOVA table splits the total variation of the Y variable (SST): into the part unexplained by the regression equation (SSE): and the part that is explained (SSR): © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. A Test for the Overall Fit: The ANOVA Table (slide 3 of 3) The required F-ratio for the test is: where and If the F-ratio is small, the explained variation is small relative to the unexplained variation, and there is evidence that the regression equation provides little explanatory value. The F-ratio has an associated p-value that allows you to run the test easily; it is reported in most regression outputs. Reject the null hypothesis—and conclude that the X variables have at least some explanatory value—if the F-value in the ANOVA table is large and the corresponding p-value is small. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Multicollinearity Multicollinearity occurs when there is a fairly strong linear relationship among a set of explanatory variables. In this case, the relationship between the explanatory variable X and the dependent variable Y is not always accurately reflected in the coefficient of X; it depends on which other Xs are included or not included in the equation. There are various degrees of multicollinearity, but in each of them, there is a linear relationship between two or more explanatory variables. The symptoms of multicollinearity can be “wrong” signs of the coefficients, smaller-than-expected t-values, and largerthan-expected (insignificant) p-values. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.2: Heights Simulation.xlsx (slide 1 of 2) Objective: To illustrate the problem of multicollinearity when both foot length variables are used in a regression for height. Solution: The dependent variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively. Simulation is used to generate a hypothetical data set of heights and left and right foot lengths. Height is approximately 31.8 plus 3.2 times foot length (all expressed in inches). © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.2: Heights Simulation.xlsx (slide 2 of 2) The regression output when both Right and Left are entered in the equation for Height appears at the bottom right of the figure below. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. INCLUDE/EXCLUDE DECISIONS The t-values of regression coefficients can be used to make include/exclude decisions for explanatory variables in a regression equation. Finding the best Xs to include in a regression equation is the most difficult part of any real regression analysis. You are always trying to get the best fit possible, but the principle of parsimony suggests using the fewest number of variables. This presents a trade-off, where there are not always easy answers. To help with this decision, several guidelines are presented on the next slide. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Guidelines for Including/Excluding Variables in a Regression Equation Look at a variable’s t-value and its associated p-value. If the p-value is above some accepted significance level, such as 0.05, this variable is a candidate for exclusion. Check whether a variable’s t-value is less than 1 or greater than 1 in magnitude. If it is less than 1, then it is a mathematical fact that se will decrease (and adjusted R2 will increase) if this variable is excluded from the equation. Look at t-values and p-values, rather than correlations, when making include/exclude decisions. An explanatory variable can have a fairly high correlation with the dependent variable, but because of other variables included in the equation, it might not be needed. When there is a group of variables that are in some sense logically related, it is sometimes a good idea to include all of them or exclude all of them. Use economic and/or physical theory to decide whether to include or exclude variables, and put less reliance on t-values and/or p-values. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.3: Catalog Marketing.xlsx (slide 1 of 2) Objective: To see which potential explanatory variables are useful for explaining current year spending amounts at HyTex with multiple regression. Solution: Data file contains data on 1000 customers who purchased mail-order products from HyTex Company. For each customer, data on several variables are included. Base the regression on the first 750 observations and use the other 250 for validation. Enter all of the potential explanatory variables. Then exclude unnecessary variables based on their t-values and pvalues. Four variables, Age, Gender, Own Home, and Married, have p-values well above 0.05 and are obvious candidates for exclusion. Exclude variables one at a time, starting with the variable that has the highest p-value, and rerun the regression after each exclusion. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.3: Catalog Marketing.xlsx (slide 2 of 2) The resulting output appears below. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Stepwise Regression Many statistical packages provide some assistance in include/exclude decisions by including automatic equation-building options. There are three types of equation-building procedures: These options estimate a series of regression equations by successively adding (or deleting) variables according to prescribed rules. Generically, these methods are referred to as stepwise regression. Forward—begins with no explanatory variables in the equation and successfully adds one at a time until no remaining variables make a significant contribution. Backward—begins with all potential explanatory variables in the equation and deletes them one at a time until further deletion would do more harm than good. Stepwise—is much like a forward procedure, except that it also considers possible deletions along the way. All of these procedures have the same basic objective—to find an equation with a small se and a large R2 (or adjusted R2). © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.3 (Continued): Catalog Marketing.xlsx Objective: To use StatTools’s Stepwise Regression procedure to analyze the HyTex data. Solution: Choose either the forward, backward, or stepwise procedure from the Regression Type dropdown list in the Regression dialog box. Specify Amount Spent as the dependent variable and select all of the other variables (besides Customer) as potential explanatory variables. A sample of the stepwise output appears to the right. The variables that enter or exit the equation are listed at the bottom of the output. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Outliers (slide 1 of 2) An observation can be considered an outlier for one or more of the following reasons: It has an extreme value for at least one variable. Its value of the dependent variable is much larger or smaller than predicted by the regression line, and its residual is abnormally large in magnitude. An example of this type of outlier is shown below. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Outliers (slide 2 of 2) Its residual is not only large in magnitude, but this point “tilts” the regression line toward it. Its values of individual explanatory variables are not extreme, but they fall outside the general pattern of the other observations. This type of outlier is called an influential point. An example of this type of outlier is shown below, on the left. An example of this type of outlier is shown below, on the right. In most cases, the regression output will look “nicer” if you delete the outliers, but this is not necessarily appropriate. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.4: Bank Salaries.xlsx (slide 1 of 2) Objective: To locate possible outliers in the bank salary data, and to see to what extent they affect the regression model. Solution: Examine each variable for outliers, using box plots of the variables or scatterplots of the residuals versus the fitted values. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.4: Bank Salaries.xlsx (slide 2 of 2) Then run the regression with and without the outlier. The output with the outlier included is shown on the top right; the output with the outlier excluded is shown on the bottom right. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Violations of Regression Assumptions There are three major issues related to violations of regression assumptions: How to detect violations of the assumptions This is usually relatively easy, using scatterplots, histograms, time series graphs, and numerical measures. What goes wrong if the violations are ignored This depends on the type of violation and its severity. What to do about violations if they are detected This issue is the most difficult to resolve. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Nonconstant Error Variance The second regression assumption—that the variance of the errors should be constant for all values of the explanatory variables—is almost always violated to some extent. Mild violations do not have much effect on the validity of the regression output. One common form of nonconstant error variance that should be dealt with is the fan-shape phenomenon. It occurs when increases in a variable result in increases in variability. It can cause an incorrect value for the standard error of estimate, so that confidence intervals and hypothesis tests for the regression coefficients are not valid. There are two ways to deal with it: Use a different estimation method than least squares, called weighted least squares. Use a logarithmic transformation of the dependent variable. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Nonnormality of Residuals The third regression assumption states that the error terms are normally distributed. Check this assumption by forming a histogram of the residuals. Unless the distribution of the residuals is severely nonnormal, the inferences made from the regression output are still approximately valid. One form of nonnormality often encountered is skewness to the right. This can often be remedied by the same logarithmic transformation of the dependent variable that remedies nonconstant error variance. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Autocorrelated Residuals The fourth regression assumption states that the error terms are probabilistically independent, but this assumption is often violated for time series data. The problem with time series data is that the residuals are often correlated with nearby residuals, a property called autocorrelation of residuals. The most frequent type of autocorrelation is positive autocorrelation. If residuals separated by one time period are correlated, it is called lag 1 autocorrelation. The Durbin-Watson (DW) statistic is a numerical measure used to check for lag 1 autocorrelation. A DW statistic below 2 signals that nearby residuals are positively correlated with one another. When the number of observations is about 30 and the number of explanatory variables is fairly small, then any DW statistic less than 1.2 warrants attention. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.1 (Continued): Overhead Costs.xlsx Objective: To use the Durbin-Watson statistic to check whether there is any lag 1 autocorrelation in the residuals from the Bendrix regression model for overhead costs. Solution: Run the usual multiple regression and check the graph of residuals versus fitted values. Check for lag 1 autocorrelation in two ways: with the DW statistic and by examining the time series graph of the residuals. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction (slide 1 of 4) Once you have estimated a regression equation from a set of data, you might want to use it to predict the value of the dependent variable for new observations. There are two types of prediction problems in regression: 1. 2. Predicting the value of the dependent variable for one or more individual members of the population Predicting the mean of the dependent variable for all members of the population with certain values of the explanatory variables The second problem is inherently easier in the sense that the resulting prediction is bound to be more accurate. When you predict a mean, there is a single source of error: the possibly inaccurate estimates of the regression coefficients. When you predict an individual value, there are two sources of error: the inaccurate estimates of the regression coefficients and the inherent variation of individual points around the regression line. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction (slide 2 of 4) Predictions for values of the Xs close to their means are likely to be more accurate than predictions for Xs far from their means. Trying to predict for Xs beyond the range of the data set is called extrapolation, and it is quite risky. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction (slide 3 of 4) The point prediction, or best guess, is found by substituting the given values of the Xs into the estimated regression equation. To measure the accuracy of the point predictions, calculate standard errors of prediction. Standard error of prediction for a single Y: This error is approximately equal to the standard error of estimate. Standard error of prediction for the mean Y: This error is approximately equal to the standard error of estimate divided by the square root of the sample size. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction (slide 4 of 4) These standard errors can be used to calculate a 95% prediction interval for an individual value and a 95% confidence interval for a mean value. Go out a t-multiple of the relevant standard error on either side of the point prediction. The term prediction interval (rather than confidence interval) is used for an individual value because an individual value of Y is not a population parameter. However, the interpretation is basically the same. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Example 11.1 (Continued): Overhead Costs.xlsx Objective: To predict Overhead at Bendrix for the next three months, given anticipated values of Machine Hours and Production Runs. Solution: Suppose Bendrix expects the values of Machine Hours and Production Runs for the next three months to be 1430, 1560, 1520, and 35, 45, 40, respectively. StatTools has the capability to provide predictions and 95% prediction intervals, but you must set up a second data set to capture the results. It should have the same variable name headings, and it should include values of the explanatory variable to be used for prediction. © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.