Probability and Statistics for Business Chapter 6: Using Simple Regression (OLS) to Summarize Two-Variable Relationships MGTSC 312 Note: Equation numbers in slides match equation numbers in the Course Pack. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 1 Simple Linear Regression Overview • Basics for simple linear regression line • Important sums of squares and regression properties – SST, SSR, SSE • Goodness of Fit – the big R square – R2 • The apartment operating income example • Output from the Excel data analysis tools • Example from Course Pack page 58-61 using Excel and additional Excel example(s) Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 2 The OLS Simple Regression Line • Ordinary least squares (OLS) regression algorithm – Given observations on two variables y and x – OLS estimate (compute) the slope and intercept coefficients for a linear equation • The roles of the variables x and y are interchangeable – Consider y as the dependent and x as the explanatory variable • The observations on y and x can be viewed as being for a finite sized population or a sample. Note the NOTATION differences for the population and sample cases Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 3 Simple Regression versus Multiple Regression • The term simple regression means that there are just two variables involved: – a dependent variable, denoted as y for now, and one explanatory variable, often called an independent variable, denoted as x for now. • The term multiple regression means that there can be more than two variables: – a dependent variable, denoted as y, and potentially several explanatory variables that can be denoted as x1 , x2 , … , xp p is the total number of the included explanatory variables Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 4 Why Spend Time on Simple Regression? • In business, we often want to predict or explain the values of one variable (say, sales) based on the values of multiple other variables (say, product price, variables describing the economic conditions, and demographic factors). – A key tool is multiple regression (in Chapter 7). • Simple regressions include just one explanatory variable. • However, the principles are the same as for multiple regression. – The formulas for the coefficients and other equation statistics are easier to understand for simple regression. Thus, studying simple regression is helpful for understanding multiple regression. • Also, trend lines created with simple regression are everywhere in business! Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 5 Because the tool is used so much, you can find lots of information about simple regression on the Internet: [REMEMBER – materials from outside links are optional; NOT for your exams] • Simple regression in the PreMBA courses for Columbia Graduate School of Business in the heart of New York’s financial district: http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_6.html • A trend-line model: http://people.duke.edu/~rnau/411trend.htm • Creating a Market Pay Line Using Regression Analysis https://peoplecentre.wordpress.com/2016/02/19/creating-a-market-pay-line-using-regression-analysis/ • INVESTOPEDIA has a full page on simple regression: http://www.investopedia.com/terms/l/line-of-best-fit.asp • On Seeking Alpha: https://seekingalpha.com/article/4104725-regression-trend-another-look-long-term-market-performance • A “Customer Analytics” example: http://www.sganalytics.com/blog/choosing-right-price-elasticity-model/ Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 6 For a finite sized population of size N, the linear regression model has two parameters: β0 and β1 yi = β0 + β1xi + εi for i = 1, ... , N. • The slope coefficient, β1 = σy, x / σ2x – The ratio of the population covariance between y and x divided by the population variance for the explanatory variable x. • The intercept of the regression line (the constant term), β0 = μy − β1μx – the population mean of the dependent variable, y, minus the product of the slope coefficient times population mean of the explanatory variable, x. • The predicted regression line: yi = β0 + β1 xi for i = 1, ... , N Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 7 A regression for a finite sized population of size N, where yi = β0 + β1 xi + εi for i = 1, ... , N: • For a given population, β0 and β1 are a single pair of population parameter values, just as the population mean value, μy for a variable y. • The equation error term is εi = yi − yi = yi − β0 − β1 xi • For the regression error term, denoted by εi in the population case, there are N values, just as there are N pairs of observations on the variables y and x . – However, the values of that error term are never directly observed (except in statistical experiments called Monte Carlo experiments when the data are created). Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 8 For a sample of n observations: yi = b0 + b1 xi + ei , i = 1, ... , n • The slope coefficient, b1, is defined as the ratio of the covariance between y and x divided by the variance for the explanatory variable x: (6-2) 𝐛1 = 𝐬𝐲, 𝐱 / 𝒔𝟐𝒙 • The intercept of the regression line, also called the constant term, is defined as the mean of the dependent variable, y, minus the product of the slope coefficient and the mean of the explanatory variable, x: (6-3) 𝐛𝟎 = 𝐲 − 𝐛1 𝐱 Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 9 Explanation of the slope coefficient: 𝛃𝟏 yi = β0 + β1 xi for i = 1, ... , N • The regression coefficients, β0 and β1 are a single pair of values, computed from the population data. • The slope coefficient, β1, is the amount by which y is predicted to change given a 1-unit positive change in the variable x. – So, when β1 is greater than 0, the line slopes upward. – When β1 is 0, the line is horizontal. – When β1 is less than 0, the line slopes downward. NOTE: We read the variable on the left above as “y hat” and these are the predicted values of y. The predicted values will not equal the actual values (even in the population case) unless all the error term values are zero. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 10 Explanation of the slope coefficient: 𝐛𝟏 yi = b0 + b1 xi for i = 1, ... , n • The regression coefficients, b0 and b1 are a single pair of values, computed from the sample data. • The slope coefficient, b1, is the amount by which y is predicted to change given a 1-unit positive change in the value of the variable x. – So, when b1 is greater than 0, the line slopes upward. – When b1 is 0, the line is horizontal. – When b1 is less than 0, the line slopes downward. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 11 Regression residuals (𝐞𝐢 ) in the sample data case: yi = b0 + b1 xi for i = 1, ... , n • The equation residual (not to be confused with the true error term) is now given by (6-4) ei = yi − b0 − b1xi; so, ei = yi − yi . (6-5) yi = yi + ei • There are n values of the regression residual . • We have data on two variables: y and x. The regression slope and intercept values are computed using the data on y and x, and then the values of the equation residual are computed as shown in (6-4). Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 12 Important Sums of Squares: for sample data scenario Square both sides of y − y = y − y + (y − y) and sum over all observations. All cross-product terms sum to 0, leaving only the following sums of squares: (6-14) SST = yi − y 2 Sum of Squares Total (6-15) SSR = yi − y 2 Sum of Squares Regression (6-12) SSE = y i − yi 2 Sum of Squares Error (6-16) SST = SSR + SSE SST is the numerator of the sample variance for the dependent variable. Sample variance = SST/(n-1) Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 13 Important properties of simple regressions SSE = Sum of Squares Error = Sum of squared residual • The regression residuals will always sum to zero (except for rounding errors). So this is no indication of “good fit.” • OLS minimizes the sum of the SQUARED residuals – In other words, there is no linear relationship that can provide a smaller sum of the squared residuals for the data used in estimating the regression line. • For a sample n i=1(yi − yi )2 = n 2 e i=1 i = SSE (6-12) ??? Why do the expressions in (6-11) and (6-12) in your Course Pack both equal SSE, always??? (6-11) Jan 26a, 2021 version n i=1(ei − e)2 Winter 2021 MGTSC 312, Ch6 14 Regression Goodness of Fit Measure: 𝐑𝟐 Very Important • The equivalent definitions of the big R2 given in (6𝟐 𝟐 𝟐 17), including the fact that 𝐑 =𝐫 =𝐫 , must 𝐲. 𝐱 𝐲, 𝐲 𝐲, 𝐱 be UNDERSTOOD. (6 – 17) 1− SSE SST = Jan 26a, 2021 version SSR Explained variation 2 R = R y.x = = SST Total variation Unexplained variation 2 1− = ry. y Total variation 2 Winter 2021 MGTSC 312, Ch6 = 15 Example: Apartment Net Operating Income • An investor has come for advice about constructing a new apartment building • We have data on Net Operating Income and Number of Suites for a sample of 47 apartment buildings in Edmonton • Net Operating Income: Dependent variable • Number of Suites: Explanatory variable To excel: Apartments.xlsx data Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 16 The Apartments Data Set Obs. # Suites NetOpInc 1 58 119202 2 30 50092 3 22 33263 4 21 18413 5 12 26641 6 20 32628 7 15 19877 8 29 106500 9 28 63200 10 23 43484 11 14 26424 12 27 81413 13 52 153284 14 48 187993 15 20 33869 16 205 562942 17 17 10217 18 26 26712 19 22 48721 20 24 51282 21 20 31572 22 33 107169 23 104 345608 Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 140 44 78 69 150 62 86 44 104 21 18 24 15 21 65 24 12 12 12 12 12 15 12 20 350633 226375 247203 28519 154278 157332 171305 109461 159245 34057 15392 60791 48008 42299 145998 54357 17288 24058 12397 9882 13713 12782 24020 36187 17 Descriptive Statistics for the Dependent and Explanatory Variables Open Excel, and select the Data tab. Then select Data Analysis. Then select Descriptive Statistics: This will bring up a screen where you can enter the Input Range, indicate you have labels in row 1, enter the Output Range, and indicate that you want to see Summary Statistics Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 18 Excel Descriptive Statistics Output The output you’ll get looks like: Number of Suites Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Net Operating Income 41.31914894 5.995998232 24 12 41.10649287 1689.743756 5.468609534 2.261741725 193 12 205 1942 47 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 92257.15 15915.84 48008 #N/A 109113.5 1.19E+10 7.157446 2.423045 553060 9882 562942 4336086 47 And you can turn that into something that looks like: Number of Suites Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Jan 26a, 2021 version Net Operating Income 41.32 6.00 24 12 41.11 1689.74 5.47 2.26 193 12 205 1942 47 Winter 2021 MGTSC 312, Ch6 92257.15 15915.84 48008 #N/A 109113.49 11905754472.04 7.16 2.42 553060 9882 562942 4336086 47 19 The Variance-Covariance Matrix Do the same first steps as to create Descriptive Statistics, but now select the Covariance option: This will bring up a screen where you can enter the Input Range, select again that you have labels in the first row, and enter the Output Range. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 20 Excel Variance-Covariance Matrix Output The output you’ll get looks like: Number of Suites Net Operating Income Number of Suites 1653.791761 3887577.931 Net Operating Income 11652440547 • This Excel tool gives variances and covariances treating the data as population data (and hence dividing by N). If you are treating the data as sample data, you need to multiply these values by n and divide by (n-1) to get the correct values for a sample. • However, the correlation value will be the same irrespective of population or sample data scenarios Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 21 The Correlation Matrix Do the same first steps as to create Descriptive Statistics, but now select the Correlation option: This will bring up a screen where you can enter the Input Range, select again that you have labels in the first row, and enter the Output Range. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 22 The Excel Correlation Matrix Output The output you’ll get looks like: Number of Suites Number of Suites Net Operating Income Net Operating Income 1 0.885584993 1 • So, is the Number of Suites highly correlated with the Net Operating Income? • What would be the value of the big R2 if you regressed either of these variables on the other one? Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 23 Simple Regression Using Excel Now select Regression from the Data Analysis Tool options: Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 24 Regression using Excel (cont.) When running regressions using Excel, you must enter the range for your dependent (“Y”) variable followed by the range for your “input X” variable. And again you need to specify that you have Labels in row 1 and specify the Output Range. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 25 The Portions of the Excel Regression Output Covered So Far Regression Statistics Multiple R 0.886 R Square 0.784 Observations 47 ANOVA df Regression Residual Total SS 1 4.29512E+11 45 1.18153E+11 46 5.47665E+11 Coefficients Intercept -4872.015 Number of Suites 2350.706 (We’ll be taking up the F and t statistics and other parts of the full output that Excel gives for a regression in Chapter 11.) Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 26 From the Excel regression output: • For the apartment data the estimated coefficients are b0 = -4,872.0 and b1 = 2,350.7 • Expected Net Operating Income is y= –$4,872.0 + ($2,350.7 ×Number of Suites) • The R2 for the regression is 0.784. The equation explains 78.4% of the total variation in the dependent variable, which is the Net Operating Income Also, the correlation between the dependent and explanatory variables for this regression is 0.886 (since, rx,y = R2 = ry,y ) Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 27 Explanation of the Regression Coefficients y = −4872 + 2360.7x y is the dependent variable, which is predicted and the right side variable x is an explanatory variable • -4872 is the intercept in math, but what does it mean? • Two possible answers: 1) it’s the income from a building with no suites; i.e., it is the fixed cost of having a building regardless of the number of suites; or, 2) it is meaningless because zero is outside the range of number of suites in the data set. • 2350.7, the slope of the regression line, is the increase in predicted income for one additional suite. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 28 Scatter Diagram Net Operating Income 600 000 y = 2350,7x - 4872 R² = 0,7843 500 000 400 000 300 000 200 000 100 000 0 0 50 100 150 200 250 Number of Suites Use Excel’s graph wizard with Scatter and add Trend line, showing the equation and R2. Compare the slope and intercept here with what you got from the Excel Regression tool. Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 29 Example on Page 60: Excel Calculation To excel: Ch6_Movies.xlsx data Jan 26a, 2021 version Winter 2021 MGTSC 312, Ch6 30