Regression Analysis © 2007 Prentice Hall 17-1 Chapter Outline 1) Correlations 2) Bivariate Regression 3) Statistics Associated with Bivariate Regression 4) Conducting Bivariate Regression Analysis i. Scatter Diagram ii. Bivariate Regression Model iii. Estimation of Parameters iv. Standardized Regression Coefficient v. Significance Testing © 2007 Prentice Hall 17-2 Chapter Outline vi. Strength and Significance of Association vii. Assumptions 5) Multiple Regression 6) Statistics Associated with Multiple Regression 7) Conducting Multiple Regression i. Partial Regression Coefficients ii. Strength of Association iii. Significance Testing 8) Multicollinearity 9) Relative Importance of Predictors © 2007 Prentice Hall 17-3 Product Moment Correlation The product moment correlation, r, summarizes the strength of association between two metric (interval or ratio scaled) variables, say X and Y. It is an index used to determine whether a linear or straight-line relationship exists between X and Y. r varies between -1.0 and +1.0. The correlation coefficient between two variables will be the same regardless of their underlying units of measurement. © 2007 Prentice Hall 17-4 Explaining Attitude Toward the City of Residence Table 17.1 Respondent No Attitude Toward the City © 2007 Prentice Hall Duration of Residence Importance Attached to Weather 1 6 10 3 2 9 12 11 3 8 12 4 4 3 4 1 5 10 12 11 6 4 6 1 7 5 8 7 8 2 2 4 9 11 18 8 10 9 9 10 11 10 17 8 12 2 2 5 17-5 Product Moment Correlation When it is computed for a population rather than a sample, the product moment correlation is denoted by r , the Greek letter rho. The coefficient r is an estimator of r . The statistical significance of the relationship between two variables measured by using r can be conveniently tested. The hypotheses are: H0 : r = 0 H1 : r 0 © 2007 Prentice Hall 17-6 Significance of correlation •The test statistic has a t dist. •The r bet. ‘Attitude towards city’ and ‘Duration’ is 0.9361 • The value of t-stat is 8.414. •From the t table (Table 4 in the Stat Appdx), the critical value of t for a two-tailed test and a= 0.05 is 2.228. •Hence, the null hypothesis of no relationship between X and Y is rejected © 2007 Prentice Hall 17-7 Regression Analysis Regression analysis examines associative relationships between a metric dependent variable and one or more independent variables in the following ways: Determine whether the independent variables explain a significant variation in the dependent variable: whether a relationship exists. Determine how much of the variation in the dependent variable can be explained by the independent variables: strength of the relationship. Predict the values of the dependent variable. © 2007 Prentice Hall 17-8 Statistics Associated with Bivariate Regression Analysis Regression model. Yi = b 0 + b 1 Xi + ei whereY = dep var, X = indep var, b 0 = intercept of the line, b 1 = slope of the line, and ei is the error term for the i th observation. Coefficient of determination: r 2. Measures strength of association. Varies bet. 0 and 1 and signifies proportion of the variation in Y accounted for by the variation in X. Estimated or predicted value of Yi is Yi = a + bx where Yi is the predicted value of Yi and a and b are estimators of b 0 and b 1 © 2007 Prentice Hall 17-9 Statistics Associated with Bivariate Regression Analysis Regression coefficient. The estimated parameter b is usually referred to as the nonstandardized regression coefficient. Standard error of estimate. This statistic is the standard deviation of the actual Y values from the predicted Y values. Standard error. The standard deviation of b, SEb is called the standard error. © 2007 Prentice Hall 17-10 Statistics Associated with Bivariate Regression Analysis Sum of squared errors. The distances of all the points from the regression line are squared and added together to arrive at the sum of squared errors, which is a measure of total error,Se 2 . j t statistic. A t statistic can be used to test the null hypothesis that no linear relationship exists between X and Y © 2007 Prentice Hall 17-11 Idea Behind Estimating Regression Eqn A scatter diagram, or scattergram, is a plot of the values of two variables The most commonly used technique for fitting a straight line to a scattergram is the least-squares procedure. In fitting the line, the least-squares procedure minimizes the sum of squared errors, Se j2 . © 2007 Prentice Hall 17-12 Conducting Bivariate Regression Analysis Plot the Scatter Diagram Formulate the General Model Estimate the Parameters Estimate Regression Coefficients Test for Significance Determine the Strength and Significance of Association © 2007 Prentice Hall 17-13 Plot of Attitude with Duration Attitude Fig. 17.3 9 6 3 2.25 4.5 6.75 9 11.25 13.5 15.75 18 Duration of Residence © 2007 Prentice Hall 17-14 Which Straight Line Is Best? Fig. 17.4 Line 1 Line 2 9 Line 3 Line 4 6 3 2.25 4.5 © 2007 Prentice Hall 6.75 9 11.25 13.5 15.75 18 17-15 Decomposing the Total Variation Fig. 17.6 Y Residual Variation SSres Explained Variation SSreg Y X1 © 2007 Prentice Hall X2 X3 X4 X5 X 17-16 Decomposing the Total Variation The total variation, SSy, may be decomposed into the variation accounted for by the regression line, SSreg, and the error or residual variation, SSerror or SSres, as follows: SSy = SSreg + SSres where n SSy = iS=1 (Yi - Y)2 n SSreg = iS (Yi - Y)2 =1 n SSres = iS= (Yi - Yi)2 1 © 2007 Prentice Hall 17-17 Strength and Significance of Association The strength of association is: 2 R = SS re g SS y Answers the question: ”What percentage of total variation in Y is explained by X?” © 2007 Prentice Hall 17-18 Test for Significance The statistical significance of the linear relationship between X and Y may be tested by examining the hypotheses: H0 : b 1 = 0 H1 : b 1 0 A t statistic can be used, where t=b/SEb SEb denotes the standard deviation of b and is called the standard error. © 2007 Prentice Hall 17-19 Illustration of Bivariate Regression The regression of attitude on duration of residence, using the data shown in Table 17.1, yielded the results shown in Table 17.2. a= 1.0793, b= 0.5897. The estimated equation is: Attitude (Y ) = 1.0793 + 0.5897 (Duration of residence) The standard error, or standard deviation of b is 0.07008, and t = 0.5897/0.0700 =8.414. The p-value corresponding to the calculated t is 0.000. Since this is smaller than a =0.05, the null hypothesis is rejected. © 2007 Prentice Hall 17-20 Bivariate Regression Table 17.2 Multiple R R2 Adjusted R2 Standard Error 0.93608 0.87624 0.86387 1.22329 df Regression Residual F = 70.80266 1 10 ANALYSIS OF VARIANCE Sum of Squares Mean Square 105.95222 105.95222 14.96444 1.49644 Significance of F = 0.0000 Variable VARIABLES IN THE EQUATION b SEb Beta (ß) T Duration (Constant) 0.58972 1.07932 © 2007 Prentice Hall 0.07008 0.74335 0.93608 8.414 1.452 Significance of T 0.0000 0.1772 17-21 Strength and Significance of Association The predicted values ( Y) can be calculated using Attitude ( Y ) = 1.0793 + 0.5897 (Duration of residence) For the first observation in Table 17.1, this value is: Y = 1.0793 + 0.5897 x 10 = 6.9763. For each observation, we can obtain this value Using these, SSreg =105.9524, SSres =14.9644 R2=105.95/(105.95+14.96)=0.8762, © 2007 Prentice Hall 17-22 Strength and Significance of Association Another, equivalent test for examining the significance of the linear relationship between X and Y (significance of b) is the test for the significance of the coefficient of determination. The hypotheses in this case are: H0: R2pop = 0 H1: R2pop > 0 © 2007 Prentice Hall 17-23 Strength and Significance of Association • The appropriate test statistic is the F statistic which has an F distribution. • The p-value corresponding to the F statistic is: 0.0000 Therefore, the relationship is significant at the α=0.05 level, corroborating the results of the t test. © 2007 Prentice Hall 17-24 Assumptions The error term is normally distributed. The mean of the error term is 0. The variance of the error term is constant. This variance does not depend on the values assumed by X. The error terms are uncorrelated. In other words, the observations have been drawn independently. © 2007 Prentice Hall 17-25 Multiple Regression The general form of the multiple regression model is as follows: Y = b 0 + b 1 X1 + b 2 X2 + b 3 X3+ . . . + b k Xk + e which is estimated by the following equation: Y = a + b1X1 + b2X2 + b3X3+ . . . + bkXk As before, the coefficient a represents the intercept, but the b's are now the partial regression coefficients. © 2007 Prentice Hall 17-26 Stats Associated with Multiple Reg Coefficient of multiple determination. The strength of association is measured by R2. Adjusted R2. R2, coefficient of multiple determination, is adjusted for the number of independent variables and the sample size. F test. The F test is used to test the null hypothesis that the coefficient of multiple determination in the population, R2pop, is zero. The test statistic has an F distribution © 2007 Prentice Hall 17-27 The Multiple Regression Equation For data in Table 17.1, suppose we want to explain ‘Attitude Towards City’ by ‘Duration’ and ‘Importance of Weather’ From Table 17.3, the estimated regression equation is: ( Y) = 0.33732 + 0.48108 X1 + 0.28865 X2 or Attitude = 0.33732 + 0.48108 (Duration) + 0.28865 (Importance) © 2007 Prentice Hall 17-28 Multiple Regression Table 17.3 Multiple R R2 Adjusted R2 Standard Error 0.97210 0.94498 0.93276 0.85974 df Regression Residual F = 77.29364 2 9 ANALYSIS OF VARIANCE Sum of Squares Mean Square 114.26425 57.13213 6.65241 0.73916 Significance of F = 0.0000 Variable VARIABLES IN THE EQUATION b SEb Beta (ß) T IMPORTANCE DURATION (Constant) 0.28865 0.48108 0.33732 © 2007 Prentice Hall 0.08608 0.05895 0.56736 0.31382 0.76363 3.353 8.160 0.595 Significance of T 0.0085 0.0000 0.5668 17-29 Strength of Association The strength of association is measured by R2, which is similar to bivariate case R2 is adjusted for the number of independent variables and the sample size. It is called Adjusted R2 © 2007 Prentice Hall 17-30 Conducting Multiple Regression Analysis: Significance Testing H0 : R2pop = 0, This is equivalent to the following null hypothesis: H0: b 1 = b 2 = b 3 = . . . = b k = 0 The overall test (for all βi’s collectively) can be conducted by using an F statistic which has an F distribution. Testing for the significance of the individual βi’s can be done in a manner similar to that in the bivariate case by using t tests © 2007 Prentice Hall 17-31 Multicollinearity Multicollinearity arises when intercorrelations among the predictors are very high. Multicollinearity can result in several problems, including: The partial regression coefficients may not be estimated precisely. The standard errors are likely to be high. It becomes difficult to assess the relative importance of the independent variables in explaining the variation in the dependent variable. © 2007 Prentice Hall 17-32 Relative Importance of Predictors Statistical significance. If the partial regression coefficient of a variable is not significant, that variable is judged to be unimportant. Measures based on standardized coefficients or beta weights. The most commonly used measures are the absolute values of the beta weights, |Bi| , or the squared values, Bi 2. © 2007 Prentice Hall 17-33