Chapter 05 Regression Models IIMT3636 Faculty of Business and Economics The University of Hong Kong Instructor: Dr. Wei ZHANG 2 Introduction • When data is available, how to understand the underlying relationship between โซ Education and income? โซ Advertising expense and sales volume? โซ Number of policemen and the crime rate in a region? • If we know the education level of a man, how to predict This is about correlation. his future income level? • If the number of policemen is reduced by half, how will This is about causation. the crime rate increase? 3 Introduction • Regression analysis helps us (i) understand the relationship between variables and (ii) predict the value of one based on the others. Linear Regression: ๐ = ๐ฝ0 + ๐ฝ1 โ ๐ + ๐ Y X Dependent variable Independent variable Explained variable Explanatory variable Response variable Control variable Predicted variable Predictor variable Regressand Regressor The error term: the part of Y that cannot be predicted by X. 4 Scatter Diagrams • A graphical presentation of the data โซ Independent variable is plotted on the horizontal axis โซ Dependent variable is plotted on the vertical axis 6 3 8 9 5 4.5 9.5 Hidden relationship: better payroll predicts higher sales 4 6 4 2 5 ๐ = ๐ฝ0 + ๐ฝ1 โ ๐ + ๐ 12 – 10 – Sales ($100,000) TRIPLE A’S SALES ($100,000s) LOCAL PAYROLL ($100,000,000s) 8– 6– Which line best represents the true relationship? 4– 2– 0– | 0 | 1 | 2 | | | 3 4 5 Payroll ($100 million) | 6 | 7 | 8 5 Simple Linear Regression • What are the best estimates of ๐ฝ0 and ๐ฝ1 ? โซ ๐0 = estimate of ๐ฝ0 โซ ๐1 = estimate of ๐ฝ1 • Once we have ๐0 and ๐1 , then given ๐ (payroll) we can predict ๐ (sales): ๐เท = ๐0 + ๐1 โ ๐ • The chosen line will in some way minimize the “errors”. Error = Actual value − Predicted value ๐ = ๐ − ๐เท • Objective: to minimize the sum of ๐ 2 . 6 Simple Linear Regression • The following formulas can be used to compute the “best” intercept and slope: σ๐ ๐เดค = = average (mean) of ๐ values ๐ σ๐ เดค ๐= = average ๐ σ ๐−๐เดค ๐−๐เดค ๐1 = σ ๐−๐เดค 2 ๐0 = ๐เดค − ๐1 ๐เดค (mean) of ๐ values 7 Simple Linear Regression Y Triple A Construction เดค 2 X (X –๐) เดค เดค (X – ๐)(Y – ๐) 6 3 (3 – 4)2 = 1 (3 – 4)(6 – 7) = 1 8 4 (4 – 4)2 = 0 (4 – 4)(8 – 7) = 0 9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4 5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0 4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5 9.5 5 ๐เดค = ΣY/6 = 7 ๐เดค = ΣX/6 = 4 (5 – 4)2 = 1 เดค 2 = 10 Σ(X – ๐) (5 – 4)(9.5 – 7) = 2.5 เดค เดค = 12.5 Σ(X – ๐)(Y – ๐) ๐1 = σ ๐−๐เดค ๐−๐เดค σ ๐−๐เดค 2 = 1.25 ๐0 = ๐เดค − ๐1 ๐เดค = 7 – 5 = 2 ๐เท = 2 + 1.25๐ เทฃ = 2 + 1.25 × Payroll Sales 8 Simple Linear Regression • Discussion: What is the real logic behind the relationsip เทฃ = 2 + 1.25 × Payroll? Sales • Payroll is associated with sales via two channels โซ First, payroll means income. People may want to renovate their homes when they are richer. โซ Second, payroll is correlated with the economic condition. There are more realty transactions and demand for renovation when the economy is better. • 1.25 is the aggregate effect. To rule out the impact of economy, we need to include it as a control variable. 9 The Fit of Regression Model • How good or effective is the estimated model? How well does the model “fit” the data? • One way to evaluate the effectiveness is to compare the predictions with a simple benchmark model: the average of Y. • Define: โซ The sum of squares total (SST) = σ ๐ − ๐เดค 2 . 2 โซ The sum of squares error (SSE) = σ ๐ − ๐เท . • SSE/SST measures the relative effectiveness of the regression model as compared to the benchmark model. 2 เดค เท • An equation: SSR = SST – SSE = σ ๐ − ๐ . 10 The Fit of Regression Model 12 – ^ Y = 2 + 1.25X 10 – Sales ($100,000) ^ Y–Y 8– ^ Y–Y Y–Y Y 6– 4– 2– 0– 0 | 1 | 2 | | | 3 4 5 Payroll ($100 million) | 6 | 7 | 8 11 Coefficient of Determination • Coefficient of determination (or the so-called R squared) SSE SSR 2 ๐ =1− = SST SST • It means the proportion of the variability in Y explained by the regression model. • For Triple A Construction, ๐ 2 = 0.6944, which means about 69% of the variations in sales is captured by the regression model based on payroll. • ๐ 2 can range from 0 to 1. An ๐ 2 greater than 0.5 is very good in practice. 12 Correlation Coefficient • This measure, r, expresses the degree of linear relationship in the data. ๐ = ± ๐2 • It is positive if ๐1 > 0 and negative if ๐1 < 0. • r can range between and including -1 and +1. • For Triple A Construction, ๐ = 0.6944 = 0.8333 • A strong, positive correlation. 13 Correlation Coefficient Y Y (a) Perfect Positive X Correlation: r = +1 Y X (b) Positive Correlation: 0 < r < 1 Y (c) No Correlation: r=0 X (d) Perfect Negative Correlation: r = –1 X 14 Assumptions of Regression Model • When performing regression analysis, we often make the following assumption about the random error ๐: • 1. Errors are independent (Random sampling) • 2. Errors are normally distributed • 3. Errors have a mean of zero • 4. Errors have a constant variance (Homoscedasticity) • A plot of the residuals (prediction errors) often highlights obvious violations of assumptions. 15 Residual Plot Prediction Error • When the assumptions are met, the errors are random and no discernible pattern is present. X 16 Residual Plot Prediction Error • Non-constant variance X 17 Residual Plot Prediction Error • Nonlinear relationship X 18 Residual Plot Prediction Error • Normality is violated X 19 Testing the Model for Significance • The ๐ 2 provides a measure of accuracy or “fit” in a regression model. However, when the sample size is too small, it is possible to get good fit by randomness. • E.g., Y X • To see if a linear relationship exists (i.e., ๐ฝ1 ≠ 0), a statistical hypothesis test is performed. 20 Testing the Model for Significance • Define the F-statistic as ๐น = SSR SSE / . ๐ ๐−๐−1 โซ ๐ = number of observations โซ ๐ = number of independent variables • F is large if the model is accurate and small if otherwise. โซ F is boosted for large n โซ F is discounted for large k 21 Testing the Model for Significance • Testing the model: ๐ = ๐ฝ0 + ๐ฝ1 โ ๐ + ๐ • Null hypothesis ๐ป0 : ๐ฝ1 = 0 • Alternative hypothesis ๐ป1 : ๐ฝ1 ≠ 0 • If ๐ป0 is true, then SST = σ ๐ − ๐เดค 2 and SSE = σ ๐ − ๐เท should be close. In other words, SSR = SST – SSE and the F-stat should be close to zero. 2 22 Testing the Model for Significance • Given ๐ป0 , F-stat follows an F distribution with df1 , df2 โซ df1 = degree of freedom for the numerator = k โซ df2 = degree of freedom for the denominator = n – k – 1 • F distribution: โซ https://en.wikipedia.org/wiki/F-distribution • Select the level of significance ๐ผ and the threshold value ๐น๐ผ,df1,df2 such that ๐ ๐น > ๐น๐ผ,df1,df2 = ๐ผ. • Reject ๐ป0 if the F-stat > ๐น๐ผ,df1,df2 23 Testing the Model for Significance • Triple A Construction ๐ป0 : no linear relationship between sales and payroll ๐ป1 : linear relationship exists • • • • df1 = 1 df2 = 4 SSE = 6.875 SSR = 15.625 P-value = P(F > F-stat) < 0.05 0.05 ๐น0.05 = 7.71 F-stat = 9.09 The observed data is very unlikely if the null hypothesis is true! 24 Analysis of Variance (ANOVA) Table • When software is used to develop a regression model, an ANOVA table is typically created that shows the observed significance level (p-value) for the F-stat, which can be compared to the level of significance (α) to make a decision. DF SS MS Regression k SSR MSR = SSR/k Residual n-k-1 SSE MSE = SSE/(n - k - 1) Total n-1 SST F SIGNIFICANCE MSR/MSE P(F > MSR/MSE) 25 Using Excel for Regression • Open Chapter05_Regression.xlsx 26 Multiple Regression Analysis • The model: ๐ = ๐ฝ0 + ๐ฝ1 ๐1 + ๐ฝ2 ๐2 + โฏ + ๐ฝ๐ ๐๐ + ๐ where ๐ = dependent variable ๐๐ = ith independent variable ๐ฝ0 = intercept ๐ฝ๐ = coefficient of the ith independent variable k = number of independent variable ๐ = random error 27 Multiple Regression Analysis • The estimated equation: ๐เท = ๐0 + ๐1 ๐1 + ๐2 ๐2 + โฏ + ๐๐ ๐๐ where ๐เท = predicted value of ๐ ๐0 = the estimate of intercept ๐ฝ0 ๐๐ = estimated coefficient of ith variable • The estimation procedure is more complex. • Excel is usually enough. 28 Jenny Wilson Realty • JWR is a real estate firm in Alabama. Jenny wants to develop a model to determine a suggested listing price based on the size and age of the house. • A sample of historical data include selling price (๐), the square footage (๐1 ), the age (๐2 ), and the condition (good, excellent, or mint). • The model: ๐เท = ๐0 + ๐1 ๐1 + ๐2 ๐2 • Open Chapter05_Regression.xlsx 29 Jenny Wilson Realty 30 Evaluating Multiple Regression Models • R squared: same as with simple linear regression โซ R squared increases with the number of variables โซ Use the adjusted r squared to correct for the number of variables ๐ด๐๐. ๐2 ๐๐๐ธ/(๐ − ๐ − 1) =1− ๐๐๐/(๐ − 1) • F test: for overall effectiveness of the model โซ Null hypothesis: ๐ฝ1 = ๐ฝ2 = โฏ = ๐ฝ๐ = 0 31 Evaluating Multiple Regression Models • t test is for the significance of a single variable ๐ก-stat = เทข๐ ๐ฝ ๐ ๐ก.๐๐๐. Given ๐ฝ๐ = 0, the t-stat follows student’s t distribution with degrees of freedom ๐ − ๐ − 1. • In Excel, the test is performed for two sides โซ i.e., ๐ป0 : ๐ฝ๐ = 0 and ๐ป1 : ๐ฝ๐ ≠ 0 • Sometimes, we need a test for only one side โซ e.g., ๐ป0 : ๐ฝ๐ ≤ 0 and ๐ป1 : ๐ฝ๐ > 0 โซ The p-value computed by Excel should be halved 32 The t distribution • It is symmetric. • Reject ๐ป0 : ๐ฝ๐ = 0 if the t-stat falls into either tail region. • P-value indicates whether the t-stat is more extreme than the ๐ผ-threshold values. −๐ก๐ผ/2 0 ๐ก๐ผ/2 33 Binary or Dummy Variables • Binary (or dummy or indicator) variables are special variables created for qualitative data โซ Whether a person has a college degree โซ Whether a purchase is made by a female customer โซ Whether a call is from New York (or LA or SF) • A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise. • The number of dummy variables must equal one less than the number of categories of the qualitative variable. 34 Jenny Wilson Realty • A better model can be developed if information about the condition of the property is included X3 = 1 if house is in good condition = 0 otherwise X4 = 1 if house is in excellent condition = 0 otherwise • Two dummy variables are used to describe the three categories of condition • No variable is needed for “mint” condition since if both X3 and X4 = 0, the house must be in mint condition 35 Jenny Wilson Realty The adj. R sq. is greatly improved! High p-value for X4 does not mean no relationship. It means customers do not treat mint and excellent conditions quite differently. 36 Multicollinearity • When an independent variable is highly correlated with a combination of other independent variables, multicollinearity exists. • Variables contain duplicate information โซ Square footage, number of bedrooms, and number of bathrooms โซ Dummies for good, excellent, and mint conditions โซ US GDP per capita and S&P 500 index • When multicollinearity exists, the overall F test is still valid and the model is still useful for prediction, but the tests for individual coefficients are not. • Normally, a variable may appear to be insignificant when it is significant. 37 Nonlinear Regression • Sometimes the relationship is significantly nonlinear. • The usual solution is to create a linear model that can describe a nonlinear relationship. โซ Polynomial function: ๐เท = ๐0 + ๐1 ๐ + ๐2 ๐ 2 โซ Logarithm function: ๐เท = ๐0 ๐ ๐1๐ or log ๐เท = log ๐0 + ๐1 ๐ * * * * ** * * ** * Quadratic relationship * * ** * * * * ** * Exponential relationship 38 Colonel Motors • The engineers want to use regression analysis to improve fuel efficiency • They have been asked to study the impact of weight on miles per gallon (MPG) MPG WEIGHT (1,000 LBS.) MPG WEIGHT (1,000 LBS.) 12 4.58 20 3.18 13 4.66 23 2.68 15 4.02 24 2.65 18 2.53 33 1.70 19 3.09 36 1.95 19 3.11 42 1.92 39 Colonel Motors • A linear model for MPG data 40 Colonel Motors A good model with high R squared and F-stat. 41 Colonel Motors • A quadratic model for MPG data 42 Colonel Motors A better model. But do not try to interpret the coefficients. 43 Nonlinear Regression • When multiple variables are involved, the plot of a marginal relationship may not show the true pattern. • The residual plot may reveal a nonlinear pattern. 50 4 40 2 30 0 Residuals MPG โซ E.g., Chapter05_Regression.xlsx 20 10 -2 0 1 2 3 -4 -6 0 0 1 2 3 Weight 4 5 -8 Weight 4 5 44 Cautions and Pitfalls • Check if the assumptions are met • Correlation does not mean causation • Multicollinearity makes the interpretation of coefficients problematic, but the model may still be good • Using a regression model beyond the range of X is questionable • The significance of intercept is usually not important • A linear relationship may not be the best relationship • A nonlinear relationship can exist even if a linear one does not • A model with a significant relationship but low R squared is of little practical value. The first order effects are not captured. 45 In-class Exercises • • • • • What can go wrong if X is correlated with the error? What is the null hypothesis for the F test? What is the meaning of R squared? Why we need the adjusted R squared? What is an appropriate regression model for the relationship between firm output size, labor size, and capital size? • How to capture the impact of seasonality in a model? (1) When X is correlated with factors included in the error term, the estimated coefficient of X should NOT be interpreted as the marginal impact of X on Y; instead, the coefficient can only be interpreted as the expected difference in Y that is correlated with a unit difference in X when we compare two data points. (2) The null hypothesis for the F test is that Y is not correlated with any independent variable. (3) R squared measure the percentage of variation in Y that is explained by the model. (4) Adjusted R corrects for the impact of k and n. As k increases, we need to discount R squared because a model with more parameters is more flexible and thus can better fit the data by default; as n increases, we need to boost R squared because it becomes less likely to get high R squared just by chance. (5) Take log on both sides of the Cobb-Douglas model. (6) 3 dummies. 46 In-class Exercises SUMMARY OUTPUT Jenny Wilson Realty Regression Statistics Multiple R 0.526894 R Square 0.277617 Adjusted R Square 0.217419 Standard Error 34568.4 Observations 14 ANOVA df Regression Residual Total Intercept SF (X1) SS MS F Significance F 1 5.51E+09 5.51E+09 4.611689 0.05288022 12 1.43E+10 1.19E+09 13 1.99E+10 Coeff. 99704.31 28.68438 St. Er. t Stat P-value Lower 95% Upper 95% 31294.9 3.18596 0.007834 31518.5747 167890.04 13.3572 2.147484 0.05288 -0.4184621 57.7872167