Quantitative Business Analysis for Decision Making Multiple Linear Regression Analysis Outlines Multiple Regression Model Estimation Testing Significance of Predictors Multicollinearity Selection of Predictors Diagnostic Plots 403.8 2 Multiple Regression Model Multiple linear regression model: Y 0 1 X 1 2 X 2 .... k X k 1 , 2 ,.... k are slope coefficients of X1, X2 ,… ,Xk. i quantifies the amount of change in response Y for a unit change in Xi when all other predictors are held fixed. 403.8 3 Multiple Regression Model (con’t) In the model, y 0 1 X 1 2 X 2 ...... k X k is the mean of Y. – Contributes to the variation in Y values from their mean y, and – is assumed normally distributed with mean 0 and standard deviation 403.8 4 Sampling A random sample of n units is taken. Then for each unit k+1 measurements are made: Y, X1 , X2 , …., Xk A multivariate sample of size n Unit Response Y Predictor X1 Predictor X2 Predictor Xk 1 2 Y1 Y2 X11 X21 X12 X22 X1k X2k n Yn Xn1 Xn2 Xnk 403.8 5 Estimated Model Estimated multiple regression model is: Yˆ b0 b1 X 1 b2 X 2 ....... bk X k Expressions for bi are cumbersome to write. Yˆ is an estimate of y 403.8 6 Standard Error Sample standard deviation around the mean (estimated regression model) is: s Yˆ Y Yˆ n k 1 2 It is an estimate of Standard error of Yˆ (for specified values of predictors) is denoted by s yˆ 403.8 7 Testing Significance of a Predictor For comparing i with a reference i 0 ,test statistic is: bi i 0 t s bi and for estimating i by a confidence interval, compute bi t sbi 403.8 8 Coefficient of Determination Coefficient of determination R2 quantifies the % of variation in the Y-distribution that is accounted by the predictors in the model. If – R2 = 80%, then 20% variation in the Y-distribution is due to factors other than those in the model. – R2 increases as predictors are added in the model but at the cost of complicating it. 403.8 9 Testing the Model for Significance Null hypothesis = predictors in the relationship have no predictive power to explain the variation in Ydistribution H 0 : 1 2 .... k 0 vs. H1 : at least one of i 0 (n k 1) R 2 Test statistic: F = . It has 2 k (1 R ) F- distribution with k and (n-k-1) degrees of freedoms for the numerator and denominator. 403.8 10 Multicollinearity and Selection of Predictors Multicollinearity - occurs when predictors are highly correlated among themselves. In its presence R2 may be high, but individual coefficients are less reliable. Screening process (e.g. stepwise regression) can eliminate multicollinearity by selecting only those predictors that are not strongly correlated among themselves. 403.8 11 Diagnostic Plots Residuals ei Yi Yˆi are used to diagnose the validity of the model assumptions. A scatter plot of the residuals Yˆi against the predicted values can serve as a diagnostic tool. A diagnostic plot can identify outliers, unequal variability, and need for transformation to achieve homogeneity etc. 403.8 12 Indicator Variables Indicator variables (also called dummy variables) are numerical codes that are used to represent qualitative variables. For example, 0 for men and 1 for women. For a qualitative variable with c categories, (c-1) indicator variables need to be defined. 403.8 13