Part I – MULTIVARIATE ANALYSIS C2 Multiple Linear Regression I © Angel A. Juan & Carles Serrat - UPC 2007/2008 1.2.1: The t-Distribution (quick review) The t-distribution is basic to statistical inference. The t-distribution is characterized by the degrees of freedom parameter, df. For a sample of size n df = n-1 Like the normal, the t-distribution is symmetric, but it is more likely to have extreme observations. As the degrees of freedom increase, the t-distribution better approximates the standard normal. 1.2.2: The F-Distribution (quick review) The F-distribution is basic to regression and analysis of variance. The F-distribution has two parameters: the numerator, m, and denominator, n, degrees of freedom F(m,n) Like the chi-square distribution, the F-distribution is skewed. 1.2.3: The Multiple Linear Regression Model Multiple linear regression analysis studies how a dependent variable or response y is related to two or more independent variables or predictors, i.e., it enables us to consider more factors –and thus obtain better estimates– than simple linear regression. Independent variables Error term y 0 1 x1 2 x 2 ... P x P Multiple Linear Regression Model (MLRM): In general, the regression model parameters will not be known and must be estimated from sample data. Estimated Multiple Regression Equation: Parameters (unknown) yˆ b0 b1 x1 b2 x 2 ... b P x P Estimate for y Parameters’ estimates The Least squares method uses sample data to provide the values of b0, b1, b2, …, bp that minimize the sum of squared residuals –the deviations between the observed values, yi, and the estimated ones: 2 ˆ m in y i y i We will focus on how to interpret the computer outputs rather than on how to make the multiple regression calculations. 1.2.4: Single & Multiple Regress. (Minitab) File: LOGISTICS.MTW Stat > Regression > Regression… Managers of a Logistics company want to estimate the total daily travel time for their drivers. At the 0.05 level of significance, the t value of 3.98 and its corresponding p-value of 0.004 indicate that the relationship is significant, i.e., we can reject H0: β1 = 0. The same conclusion is obtained from the F value of 15.81 and its associated p-value of 0.004. Thus, we can reject H0: ρ = 0 and conclude that the relationship between Hours and Km is significant. In MR analysis, each regression coefficient represents an estimate of the change in y corresponding to a one unit change in xi when all other independent variables are held constant; e.g.: 0.0611 hours is an estimate of the expected increase in Hours corresponding to an increase of one Km when Deliveries is held constant. 66.4% of the variability in Hours can be explained by the linear effect of Km. 1.2.5: The Coefficient of Determination (R2) The total sum of squares (SST) can be partitioned into two components: the sum of squares due to regression (SSR) and the sum of squares due to error (SSE): The value of SST is the same in both cases, but SSR increases and SSE decreases when a second independent variable is added The estimated MR equation provides a better fit for the observed data. y y 2 i SST yˆ y 2 i SSR SST = 23.900 SSR = 15.871 SSE = 8,029 The multiple coefficient of determination, R2 = SSR / SST, measures the goodness of fit for the estimated MR equation. R2 can be interpreted as the proportion of the variability in y that can be explained by the estimated MR equation. y i yˆ i SSE SST = 23.900 SSR = 21.601 SSE = 2,299 R 2 SSR SST R2 always increases as new independent variables are added to the model. Many analysts prefer adjusting R2 for the number of independent variables to avoid overestimating the impact of adding new independent variables. 2 1.2.6: Model Assumptions General assumption: The form of the model is correct, i.e., it’s appropriate to use a linear combination of the predictor variables to predict the value of the response variable. Assumptions about the error term ε: 1. The error ε is a random variable with E[ε] = 0 2. Var[ε] = σ2 and is constant for all values of the independent variables x1, x2, …, xp 3. The values of ε are independent 4. ε is normally distributed and it reflects the deviation between the observed values, yi, and the expected values, E[yi] Any statement about hypothesis tests or confidence intervals requires that these assumptions be met. Since the assumptions involve the errors, and the residuals, yobserved – yestimated, are the estimated errors, it is important to look carefully at the residuals. This implies that, for given values of x1, x2, …, xp, the expected value of y, E[y], is given by E[y] = β0 + β1x1 + β2x2 + … + βpxp This implies that Var[y] = σ2, and it is also constant. i.e.: the size of the error for a particular set of values for the predictors is not related to the size of the error for any other set of values. This implies that, for the given values of x1, x2, …, xp, the response y is also normally distributed. The errors do not have to be perfectly normal, but you should worry if there are extreme values: outliers can have a major impact on the regression. 1.2.7: Testing for Significance In SLR, both the t test and the F test provide the same conclusion (same p-value) In MLR, both tests have different purposes: 1. 2. 1. Since pvalue < α a significant relationship is present between the response and the set of predictors. The F test is used to determine the MS 2Error provides an unbiased estimate of σ , the variance of the error term overall significance, i.e., whether there is a significant relationship between the response and the set of all the predictors. If the F test shows an overall significance, then separated t tests are conducted to determine the individual significance, i.e., whether each of the predictors is significant. H 0 : i 0 t t est for the i th param eter: H1 : i 0 T est S tatistic : t C oef i S E C oe f i 2. Also, since p-value < α both predictors are statistically significant. If an individual parameter is not significant, the corresponding variable can be dropped from the model. However, if the t test shows that two or more parameters are not significant, no more than one variable can ever be dropped: if one variable is dropped, a second variable that was not significant initially might become significant. H 0 : 1 2 ... p 0 F test: H 1 : O ne or m ore of the param eters is not ze ro S S R eg T est S tatistic: F M S R eg M S E rr p S S E rr ( n - p - 1) n # observ. p # predictors 1.2.8: Multicollinearity Multicollinearity means that some predictors are correlated with other predictors. Rule of thumb: multicollinearity can cause problems if the absolute value of the sample correlation coefficient exceeds 0.7 for any two independent variables. Potentials problems: In t tests for individual significance, it is possible to erroneously conclude that none of the individual parameters are significantly different from zero when an F test on the overall significance indicates a significant relationship. It results difficult to separate the effects of the individual predictors on the response. Severe cases could even result in parameter estimates (regression coefficients) with the wrong sign. Possible solution: to eliminate predictors from the model, especially if deleting them has little effect on R2. Predictors are also called independent variables. This does not mean, however, that predictors themselves are statistically independent. On the contrary, predictors can be correlated with one another. Does this mean that no predictor makes a significant contribution to determining the response? No, what this probably means is that with some predictors already in the model, the contribution made by the others has been already included in the model (because of the correlation). 1.2.9: Estimation & Prediction Given a set of values x1, x2, …, xp, we can use the estimated regression equation to: a) b) Estimate the mean value of y associated to the given set of xi values Predict the individual value of y associated to the given set of xi values Both point estimates provide the same result. yˆ Eˆ y b0 b1 x1 b2 x 2 ... b P x P Given x1, x2, …, xp, the point estimate of an individual value of y is the same as the point estimate of the expected value of y Given Km = 50 and Deliveries = 2 point estimate for y = point estimate for E[y] = 4.035 Point estimates do not provide any idea of the precision associated with an estimate. For that we must develop interval estimates: Confidence interval: is an interval estimate of the mean value of y for a given set of xi values Prediction interval: is an interval estimate of an individual value of y corresponding to a given set of xi values Confidence Intervals for E[y] are always more precise (narrower) than Prediction Intervals for y. 1.2.10: Qualitative Predictors A regression equation could have qualitative predictors, such as gender (male, female), method of payment (cash, credit card, check), and so on. If a qualitative predictor has k levels, k – 1 dummy variables are required, with each of them being coded as 0 or 1. File: REPAIRS.MTW Repair time (Hours) is believed to be related to two factors, the number of months since the last maintenance service (Elapsed) and the type of repair problem (Type). Since k = 3 (software, hardware and firmware), we had to add 2 dummy variables We could try to eliminate non-significant predictors from the model and see what happens… 1.2.11: Residual Analysis: Plots (1/2) The model assumptions about the error term, ε, provide the theoretical basis for the t test and the F test. The residuals provide the best information about ε; hence an analysis of the residuals is an important step in determining whether the assumptions for ε are appropriate. R esidual for observation i : ri y i yˆ i y i E y i i The standardized residuals are frequently used in residuals plots and in the identification of outliers. Residual plots: 1. 2. If these assumptions appear questionable, the hypothesis tests about the significance of the regression relationship and the interval estimation results may not be valid. Standardized residuals against predicted values: this plot should approximate a horizontal band of points centered around 0. A different pattern could suggest: (i) that the model is not adequate, or (ii) a non-constant variance for ε Normal probability plot: this plot can be used to determine whether the distribution of ε appears to be normal S tandardized residual for observation i : sri ri E ri s ri ri s ri ( s ri standard deviation of resiudal i ) A linear model is not adequate Non-constant variance 1.2.11: Residual Analysis: Plots (2/2) File: LOGISTICS.MTW Stat > Regression > Regression… R e s idua ls V e r s us the F itte d V a l ue s (r e s po ns e is Ho ur s ) S t a nd a r d iz e d R e s id ua l 2 The plot of standardized residuals against predicted values should approximate a horizontal band of points centered around 0. Also, the 68-95-99% rule should apply. In this case, the plot does not indicate any unusual abnormalities. Hence, it seems reasonable to assume constant variance. Also, all of the standardized residuals are between -2 and +2 (i.e., more than the expected 95% of the standardized residuals). Hence, we have no reason to question the normality assumption. 1 0 -1 -2 4 5 6 7 8 9 Fit t e d V a lue N o r ma l P r o ba bility P lo t o f the R e s idua ls (r e s po ns e is Ho ur s ) The normal probability plot for the standardized residuals represents another approach for determining the validity of the assumption that the error term has a normal distribution. The plotted points should cluster closely around the 45-degree. In general, the more closely the points are clustered about the 45-degree line, the stronger the evidence supporting the normality assumption. Any substantial curvature in the plot is evidence that the residuals have not come from a normal distribution. 99 95 90 80 Pe r c e nt 70 60 50 40 30 20 10 5 1 -3 -2 -1 0 S t a nd a r d iz e d R e s id ua l 1 2 3 In this case, we see that the points are grouped closely about the line. We therefore conclude that the assumption of the error term having a normal distribution is reasonable. 1.2.12: Residual Analysis: Influential Obs. Outliers are observations with larger than average response or predictor values. It is important to identify outliers because they can be influential observations, i.e., they can significantly influence your model, providing potentially misleading or incorrect results. Apart from residual plots, there are several alternative ways to identify outliers: 1. Leverages (HI) are a measure of the distance between the x-values for an observation and the mean of x-values for all observations. Observations with a HI>3(p+1)/n may exert considerable influence on the fitted value, and thus on the regression model 2. Cook's distance (D) , is calculated using leverage values and standardized residuals. It considers whether an observation is unusual with respect to both x- and y-values. Geometrically, Cook's distance is a measure of the distance between the fitted values calculated with and without the observation. Observations with a D>1 are influential 3. DFITS represents roughly the number of estimated standard deviations that the fitted value changes when the corresponding observation is removed from the data. Observations with a DFITS>Sqrt[2(p+1)/n] are influential Minitab identifies observations with a large leverage value with an X in the table of unusual observations. 1.2.13: Autocorrelation (Serial Correlation) Often, the data used for regression studies are collected over time. In such cases, autocorrelation or serial correlation can be present in the data. Autocorrelation of order k: the value of y in time period t is related to its value in time period t-k. The Durbin-Watson statistic, 0<=DW<=4, can be used to detect first-order autocorrelation: If DW close to 0 positive autocorrelation (successive values of the residuals are close together) If DW close to 4 negative autocorrelation (successive values of the residuals are far apart) If DW close to 2 no autocorrelation is present Including a predictor that measures the time of the observation or transforming the variables can help to reduce autocorrelation. When autocorrelation is present, the assumption of independence is violated. Therefore, serious errors can be made in performing tests of statistical significance based upon the assumed regression model The Durbin-Watson test for autocorrelation uses the residuals to determine whether firstorder autocorrelation is present in the data. The test is useful when sample size >= 15