Regression Models Presentation

Chapter 05 Regression Models IIMT3636 Faculty of Business and Economics The University of Hong Kong Instructor: Dr. Wei ZHANG 2 Introduction • When data is available, how to understand the underlying relationship between ▫ Education and income? ▫ Advertising expense and sales volume? ▫ Number of policemen and the crime rate in a region? • If we know the education level of a man, how to predict This is about correlation. his future income level? • If the number of policemen is reduced by half, how will This is about causation. the crime rate increase? 3 Introduction • Regression analysis helps us (i) understand the relationship between variables and (ii) predict the value of one based on the others. Linear Regression: 𝑌 = 𝛽0 + 𝛽1 ∙ 𝑋 + 𝜖 Y X Dependent variable Independent variable Explained variable Explanatory variable Response variable Control variable Predicted variable Predictor variable Regressand Regressor The error term: the part of Y that cannot be predicted by X. 4 Scatter Diagrams • A graphical presentation of the data ▫ Independent variable is plotted on the horizontal axis ▫ Dependent variable is plotted on the vertical axis 6 3 8 9 5 4.5 9.5 Hidden relationship: better payroll predicts higher sales 4 6 4 2 5 𝑌 = 𝛽0 + 𝛽1 ∙ 𝑋 + 𝜖 12 – 10 – Sales ($100,000) TRIPLE A’S SALES ($100,000s) LOCAL PAYROLL ($100,000,000s) 8– 6– Which line best represents the true relationship? 4– 2– 0– | 0 | 1 | 2 | | | 3 4 5 Payroll ($100 million) | 6 | 7 | 8 5 Simple Linear Regression • What are the best estimates of 𝛽0 and 𝛽1 ? ▫ 𝑏0 = estimate of 𝛽0 ▫ 𝑏1 = estimate of 𝛽1 • Once we have 𝑏0 and 𝑏1 , then given 𝑋 (payroll) we can predict 𝑌 (sales): 𝑌෠ = 𝑏0 + 𝑏1 ∙ 𝑋 • The chosen line will in some way minimize the “errors”. Error = Actual value − Predicted value 𝑒 = 𝑌 − 𝑌෠ • Objective: to minimize the sum of 𝑒 2 . 6 Simple Linear Regression • The following formulas can be used to compute the “best” intercept and slope: σ𝑋 𝑋ത = = average (mean) of 𝑋 values 𝑛 σ𝑌 ത 𝑌= = average 𝑛 σ 𝑋−𝑋ത 𝑌−𝑌ത 𝑏1 = σ 𝑋−𝑋ത 2 𝑏0 = 𝑌ത − 𝑏1 𝑋ത (mean) of 𝑌 values 7 Simple Linear Regression Y Triple A Construction ത 2 X (X –𝑋) ത ത (X – 𝑋)(Y – 𝑌) 6 3 (3 – 4)2 = 1 (3 – 4)(6 – 7) = 1 8 4 (4 – 4)2 = 0 (4 – 4)(8 – 7) = 0 9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4 5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0 4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5 9.5 5 𝑌ത = ΣY/6 = 7 𝑋ത = ΣX/6 = 4 (5 – 4)2 = 1 ത 2 = 10 Σ(X – 𝑋) (5 – 4)(9.5 – 7) = 2.5 ത ത = 12.5 Σ(X – 𝑋)(Y – 𝑌) 𝑏1 = σ 𝑋−𝑋ത 𝑌−𝑌ത σ 𝑋−𝑋ത 2 = 1.25 𝑏0 = 𝑌ത − 𝑏1 𝑋ത = 7 – 5 = 2 𝑌෠ = 2 + 1.25𝑋 ෣ = 2 + 1.25 × Payroll Sales 8 Simple Linear Regression • Discussion: What is the real logic behind the relationsip ෣ = 2 + 1.25 × Payroll? Sales • Payroll is associated with sales via two channels ▫ First, payroll means income. People may want to renovate their homes when they are richer. ▫ Second, payroll is correlated with the economic condition. There are more realty transactions and demand for renovation when the economy is better. • 1.25 is the aggregate effect. To rule out the impact of economy, we need to include it as a control variable. 9 The Fit of Regression Model • How good or effective is the estimated model? How well does the model “fit” the data? • One way to evaluate the effectiveness is to compare the predictions with a simple benchmark model: the average of Y. • Define: ▫ The sum of squares total (SST) = σ 𝑌 − 𝑌ത 2 . 2 ▫ The sum of squares error (SSE) = σ 𝑌 − 𝑌෠ . • SSE/SST measures the relative effectiveness of the regression model as compared to the benchmark model. 2 ത ෠ • An equation: SSR = SST – SSE = σ 𝑌 − 𝑌 . 10 The Fit of Regression Model 12 – ^ Y = 2 + 1.25X 10 – Sales ($100,000) ^ Y–Y 8– ^ Y–Y Y–Y Y 6– 4– 2– 0– 0 | 1 | 2 | | | 3 4 5 Payroll ($100 million) | 6 | 7 | 8 11 Coefficient of Determination • Coefficient of determination (or the so-called R squared) SSE SSR 2 𝑟 =1− = SST SST • It means the proportion of the variability in Y explained by the regression model. • For Triple A Construction, 𝑟 2 = 0.6944, which means about 69% of the variations in sales is captured by the regression model based on payroll. • 𝑟 2 can range from 0 to 1. An 𝑟 2 greater than 0.5 is very good in practice. 12 Correlation Coefficient • This measure, r, expresses the degree of linear relationship in the data. 𝑟 = ± 𝑟2 • It is positive if 𝑏1 > 0 and negative if 𝑏1 < 0. • r can range between and including -1 and +1. • For Triple A Construction, 𝑟 = 0.6944 = 0.8333 • A strong, positive correlation. 13 Correlation Coefficient Y Y (a) Perfect Positive X Correlation: r = +1 Y X (b) Positive Correlation: 0 < r < 1 Y (c) No Correlation: r=0 X (d) Perfect Negative Correlation: r = –1 X 14 Assumptions of Regression Model • When performing regression analysis, we often make the following assumption about the random error 𝜖: • 1. Errors are independent (Random sampling) • 2. Errors are normally distributed • 3. Errors have a mean of zero • 4. Errors have a constant variance (Homoscedasticity) • A plot of the residuals (prediction errors) often highlights obvious violations of assumptions. 15 Residual Plot Prediction Error • When the assumptions are met, the errors are random and no discernible pattern is present. X 16 Residual Plot Prediction Error • Non-constant variance X 17 Residual Plot Prediction Error • Nonlinear relationship X 18 Residual Plot Prediction Error • Normality is violated X 19 Testing the Model for Significance • The 𝑟 2 provides a measure of accuracy or “fit” in a regression model. However, when the sample size is too small, it is possible to get good fit by randomness. • E.g., Y X • To see if a linear relationship exists (i.e., 𝛽1 ≠ 0), a statistical hypothesis test is performed. 20 Testing the Model for Significance • Define the F-statistic as 𝐹 = SSR SSE / . 𝑘 𝑛−𝑘−1 ▫ 𝑛 = number of observations ▫ 𝑘 = number of independent variables • F is large if the model is accurate and small if otherwise. ▫ F is boosted for large n ▫ F is discounted for large k 21 Testing the Model for Significance • Testing the model: 𝑌 = 𝛽0 + 𝛽1 ∙ 𝑋 + 𝜖 • Null hypothesis 𝐻0 : 𝛽1 = 0 • Alternative hypothesis 𝐻1 : 𝛽1 ≠ 0 • If 𝐻0 is true, then SST = σ 𝑌 − 𝑌ത 2 and SSE = σ 𝑌 − 𝑌෠ should be close. In other words, SSR = SST – SSE and the F-stat should be close to zero. 2 22 Testing the Model for Significance • Given 𝐻0 , F-stat follows an F distribution with df1 , df2 ▫ df1 = degree of freedom for the numerator = k ▫ df2 = degree of freedom for the denominator = n – k – 1 • F distribution: ▫ https://en.wikipedia.org/wiki/F-distribution • Select the level of significance 𝛼 and the threshold value 𝐹𝛼,df1,df2 such that 𝑃 𝐹 > 𝐹𝛼,df1,df2 = 𝛼. • Reject 𝐻0 if the F-stat > 𝐹𝛼,df1,df2 23 Testing the Model for Significance • Triple A Construction 𝐻0 : no linear relationship between sales and payroll 𝐻1 : linear relationship exists • • • • df1 = 1 df2 = 4 SSE = 6.875 SSR = 15.625 P-value = P(F > F-stat) < 0.05 0.05 𝐹0.05 = 7.71 F-stat = 9.09 The observed data is very unlikely if the null hypothesis is true! 24 Analysis of Variance (ANOVA) Table • When software is used to develop a regression model, an ANOVA table is typically created that shows the observed significance level (p-value) for the F-stat, which can be compared to the level of significance (α) to make a decision. DF SS MS Regression k SSR MSR = SSR/k Residual n-k-1 SSE MSE = SSE/(n - k - 1) Total n-1 SST F SIGNIFICANCE MSR/MSE P(F > MSR/MSE) 25 Using Excel for Regression • Open Chapter05_Regression.xlsx 26 Multiple Regression Analysis • The model: 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑘 𝑋𝑘 + 𝜖 where 𝑌 = dependent variable 𝑋𝑖 = ith independent variable 𝛽0 = intercept 𝛽𝑖 = coefficient of the ith independent variable k = number of independent variable 𝜖 = random error 27 Multiple Regression Analysis • The estimated equation: 𝑌෠ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + ⋯ + 𝑏𝑘 𝑋𝑘 where 𝑌෠ = predicted value of 𝑌 𝑏0 = the estimate of intercept 𝛽0 𝑏𝑖 = estimated coefficient of ith variable • The estimation procedure is more complex. • Excel is usually enough. 28 Jenny Wilson Realty • JWR is a real estate firm in Alabama. Jenny wants to develop a model to determine a suggested listing price based on the size and age of the house. • A sample of historical data include selling price (𝑌), the square footage (𝑋1 ), the age (𝑋2 ), and the condition (good, excellent, or mint). • The model: 𝑌෠ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 • Open Chapter05_Regression.xlsx 29 Jenny Wilson Realty 30 Evaluating Multiple Regression Models • R squared: same as with simple linear regression ▫ R squared increases with the number of variables ▫ Use the adjusted r squared to correct for the number of variables 𝐴𝑑𝑗. 𝑟2 𝑆𝑆𝐸/(𝑛 − 𝑘 − 1) =1− 𝑆𝑆𝑇/(𝑛 − 1) • F test: for overall effectiveness of the model ▫ Null hypothesis: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0 31 Evaluating Multiple Regression Models • t test is for the significance of a single variable 𝑡-stat = ෢𝑖 𝛽 𝑠𝑡.𝑒𝑟𝑟. Given 𝛽𝑖 = 0, the t-stat follows student’s t distribution with degrees of freedom 𝑛 − 𝑘 − 1. • In Excel, the test is performed for two sides ▫ i.e., 𝐻0 : 𝛽𝑖 = 0 and 𝐻1 : 𝛽𝑖 ≠ 0 • Sometimes, we need a test for only one side ▫ e.g., 𝐻0 : 𝛽𝑖 ≤ 0 and 𝐻1 : 𝛽𝑖 > 0 ▫ The p-value computed by Excel should be halved 32 The t distribution • It is symmetric. • Reject 𝐻0 : 𝛽𝑖 = 0 if the t-stat falls into either tail region. • P-value indicates whether the t-stat is more extreme than the 𝛼-threshold values. −𝑡𝛼/2 0 𝑡𝛼/2 33 Binary or Dummy Variables • Binary (or dummy or indicator) variables are special variables created for qualitative data ▫ Whether a person has a college degree ▫ Whether a purchase is made by a female customer ▫ Whether a call is from New York (or LA or SF) • A dummy variable is assigned a value of 1 if a particular condition is met and a value of 0 otherwise. • The number of dummy variables must equal one less than the number of categories of the qualitative variable. 34 Jenny Wilson Realty • A better model can be developed if information about the condition of the property is included X3 = 1 if house is in good condition = 0 otherwise X4 = 1 if house is in excellent condition = 0 otherwise • Two dummy variables are used to describe the three categories of condition • No variable is needed for “mint” condition since if both X3 and X4 = 0, the house must be in mint condition 35 Jenny Wilson Realty The adj. R sq. is greatly improved! High p-value for X4 does not mean no relationship. It means customers do not treat mint and excellent conditions quite differently. 36 Multicollinearity • When an independent variable is highly correlated with a combination of other independent variables, multicollinearity exists. • Variables contain duplicate information ▫ Square footage, number of bedrooms, and number of bathrooms ▫ Dummies for good, excellent, and mint conditions ▫ US GDP per capita and S&P 500 index • When multicollinearity exists, the overall F test is still valid and the model is still useful for prediction, but the tests for individual coefficients are not. • Normally, a variable may appear to be insignificant when it is significant. 37 Nonlinear Regression • Sometimes the relationship is significantly nonlinear. • The usual solution is to create a linear model that can describe a nonlinear relationship. ▫ Polynomial function: 𝑌෠ = 𝑏0 + 𝑏1 𝑋 + 𝑏2 𝑋 2 ▫ Logarithm function: 𝑌෠ = 𝑏0 𝑒 𝑏1𝑋 or log 𝑌෠ = log 𝑏0 + 𝑏1 𝑋 * * * * ** * * ** * Quadratic relationship * * ** * * * * ** * Exponential relationship 38 Colonel Motors • The engineers want to use regression analysis to improve fuel efficiency • They have been asked to study the impact of weight on miles per gallon (MPG) MPG WEIGHT (1,000 LBS.) MPG WEIGHT (1,000 LBS.) 12 4.58 20 3.18 13 4.66 23 2.68 15 4.02 24 2.65 18 2.53 33 1.70 19 3.09 36 1.95 19 3.11 42 1.92 39 Colonel Motors • A linear model for MPG data 40 Colonel Motors A good model with high R squared and F-stat. 41 Colonel Motors • A quadratic model for MPG data 42 Colonel Motors A better model. But do not try to interpret the coefficients. 43 Nonlinear Regression • When multiple variables are involved, the plot of a marginal relationship may not show the true pattern. • The residual plot may reveal a nonlinear pattern. 50 4 40 2 30 0 Residuals MPG ▫ E.g., Chapter05_Regression.xlsx 20 10 -2 0 1 2 3 -4 -6 0 0 1 2 3 Weight 4 5 -8 Weight 4 5 44 Cautions and Pitfalls • Check if the assumptions are met • Correlation does not mean causation • Multicollinearity makes the interpretation of coefficients problematic, but the model may still be good • Using a regression model beyond the range of X is questionable • The significance of intercept is usually not important • A linear relationship may not be the best relationship • A nonlinear relationship can exist even if a linear one does not • A model with a significant relationship but low R squared is of little practical value. The first order effects are not captured. 45 In-class Exercises • • • • • What can go wrong if X is correlated with the error? What is the null hypothesis for the F test? What is the meaning of R squared? Why we need the adjusted R squared? What is an appropriate regression model for the relationship between firm output size, labor size, and capital size? • How to capture the impact of seasonality in a model? (1) When X is correlated with factors included in the error term, the estimated coefficient of X should NOT be interpreted as the marginal impact of X on Y; instead, the coefficient can only be interpreted as the expected difference in Y that is correlated with a unit difference in X when we compare two data points. (2) The null hypothesis for the F test is that Y is not correlated with any independent variable. (3) R squared measure the percentage of variation in Y that is explained by the model. (4) Adjusted R corrects for the impact of k and n. As k increases, we need to discount R squared because a model with more parameters is more flexible and thus can better fit the data by default; as n increases, we need to boost R squared because it becomes less likely to get high R squared just by chance. (5) Take log on both sides of the Cobb-Douglas model. (6) 3 dummies. 46 In-class Exercises SUMMARY OUTPUT Jenny Wilson Realty Regression Statistics Multiple R 0.526894 R Square 0.277617 Adjusted R Square 0.217419 Standard Error 34568.4 Observations 14 ANOVA df Regression Residual Total Intercept SF (X1) SS MS F Significance F 1 5.51E+09 5.51E+09 4.611689 0.05288022 12 1.43E+10 1.19E+09 13 1.99E+10 Coeff. 99704.31 28.68438 St. Er. t Stat P-value Lower 95% Upper 95% 31294.9 3.18596 0.007834 31518.5747 167890.04 13.3572 2.147484 0.05288 -0.4184621 57.7872167

Regression Models Presentation

Related documents

Products

Support

Regression Models Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib