Uploaded by daniel.mv.cabral

Ch 6 MGTSC SCIENCE

advertisement
Probability and Statistics for
Business
Chapter 6:
Using Simple Regression (OLS) to
Summarize Two-Variable Relationships
MGTSC 312
Note: Equation numbers in slides match equation
numbers in the Course Pack.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
1
Simple Linear Regression Overview
• Basics for simple linear regression line
• Important sums of squares and regression properties
– SST, SSR, SSE
• Goodness of Fit – the big R square
– R2
• The apartment operating income example
• Output from the Excel data analysis tools
• Example from Course Pack page 58-61 using Excel
and additional Excel example(s)
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
2
The OLS Simple Regression Line
• Ordinary least squares (OLS) regression algorithm
– Given observations on two variables y and x
– OLS estimate (compute) the slope and intercept coefficients for a
linear equation
• The roles of the variables x and y are interchangeable
– Consider y as the dependent and x as the explanatory variable
• The observations on y and x can be viewed as being for a
finite sized population or a sample.
Note the NOTATION differences for the
population and sample cases
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
3
Simple Regression versus Multiple Regression
• The term simple regression means that there are just
two variables involved:
– a dependent variable, denoted as y for now, and one
explanatory variable, often called an independent variable,
denoted as x for now.
• The term multiple regression means that there can be
more than two variables:
– a dependent variable, denoted as y, and potentially several
explanatory variables that can be denoted as x1 , x2 , … , xp
p is the total number of the included explanatory variables
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
4
Why Spend Time on Simple Regression?
• In business, we often want to predict or explain the values of
one variable (say, sales) based on the values of multiple
other variables (say, product price, variables describing the
economic conditions, and demographic factors).
– A key tool is multiple regression (in Chapter 7).
• Simple regressions include just one explanatory variable.
• However, the principles are the same as for multiple
regression.
– The formulas for the coefficients and other equation statistics are
easier to understand for simple regression. Thus, studying simple
regression is helpful for understanding multiple regression.
• Also, trend lines created with simple regression are
everywhere in business!
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
5
Because the tool is used so much, you can find lots of
information about simple regression on the Internet:
[REMEMBER – materials from outside links are optional; NOT for your exams]
• Simple regression in the PreMBA courses for Columbia
Graduate School of Business in the heart of New York’s
financial district: http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_6.html
• A trend-line model:
http://people.duke.edu/~rnau/411trend.htm
• Creating a Market Pay Line Using Regression Analysis
https://peoplecentre.wordpress.com/2016/02/19/creating-a-market-pay-line-using-regression-analysis/
• INVESTOPEDIA has a full page on simple regression:
http://www.investopedia.com/terms/l/line-of-best-fit.asp
• On Seeking Alpha: https://seekingalpha.com/article/4104725-regression-trend-another-look-long-term-market-performance
• A “Customer Analytics” example: http://www.sganalytics.com/blog/choosing-right-price-elasticity-model/
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
6
For a finite sized population of size N, the linear
regression model has two parameters: β0 and β1
yi = β0 + β1xi + εi for i = 1, ... , N.
• The slope coefficient, β1 = σy, x / σ2x
– The ratio of the population covariance between y and x
divided by the population variance for the explanatory
variable x.
• The intercept of the regression line (the constant
term), β0 = μy − β1μx
– the population mean of the dependent variable, y, minus the
product of the slope coefficient times population mean of the
explanatory variable, x.
• The predicted regression line: yi = β0 + β1 xi for i = 1,
... , N
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
7
A regression for a finite sized population of size N,
where yi = β0 + β1 xi + εi for i = 1, ... , N:
• For a given population, β0 and β1 are a single pair of
population parameter values, just as the population mean
value, μy for a variable y.
• The equation error term is εi = yi − yi = yi − β0 − β1 xi
• For the regression error term, denoted by εi in the
population case, there are N values, just as there are N
pairs of observations on the variables y and x .
– However, the values of that error term are never directly observed
(except in statistical experiments called Monte Carlo experiments
when the data are created).
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
8
For a sample of n observations:
yi = b0 + b1 xi + ei , i = 1, ... , n
• The slope coefficient, b1, is defined as the ratio of the
covariance between y and x divided by the variance for
the explanatory variable x:
(6-2)
𝐛1 = 𝐬𝐲, 𝐱 / 𝒔𝟐𝒙
• The intercept of the regression line, also called the
constant term, is defined as the mean of the dependent
variable, y, minus the product of the slope coefficient and
the mean of the explanatory variable, x:
(6-3)
𝐛𝟎 = 𝐲 − 𝐛1 𝐱
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
9
Explanation of the slope coefficient: 𝛃𝟏
yi = β0 + β1 xi for i = 1, ... , N
• The regression coefficients, β0 and β1 are a single pair of
values, computed from the population data.
• The slope coefficient, β1, is the amount by which y is predicted
to change given a 1-unit positive change in the variable x.
– So, when β1 is greater than 0, the line slopes upward.
– When β1 is 0, the line is horizontal.
– When β1 is less than 0, the line slopes downward.
NOTE: We read the variable on the left above as “y hat” and these are the
predicted values of y. The predicted values will not equal the actual values (even
in the population case) unless all the error term values are zero.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
10
Explanation of the slope coefficient: 𝐛𝟏
yi = b0 + b1 xi for i = 1, ... , n
• The regression coefficients, b0 and b1 are a single
pair of values, computed from the sample data.
• The slope coefficient, b1, is the amount by which y is
predicted to change given a 1-unit positive change in
the value of the variable x.
– So, when b1 is greater than 0, the line slopes upward.
– When b1 is 0, the line is horizontal.
– When b1 is less than 0, the line slopes downward.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
11
Regression residuals (𝐞𝐢 ) in the sample data case:
yi = b0 + b1 xi for i = 1, ... , n
• The equation residual (not to be confused with the true
error term) is now given by
(6-4)
ei = yi − b0 − b1xi; so, ei = yi − yi .
(6-5)
yi = yi + ei
• There are n values of the regression residual .
• We have data on two variables: y and x. The regression
slope and intercept values are computed using the data on y
and x, and then the values of the equation residual are
computed as shown in (6-4).
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
12
Important Sums of Squares:
for sample data scenario
Square both sides of y − y = y − y + (y − y)
and sum over all observations. All cross-product terms sum to 0,
leaving only the following sums of squares:
(6-14) SST =
yi − y
2
Sum of Squares Total
(6-15) SSR =
yi − y
2
Sum of Squares Regression
(6-12) SSE =
y i − yi
2
Sum of Squares Error
(6-16) SST = SSR + SSE
SST is the numerator of the sample variance
for the dependent variable.
Sample variance = SST/(n-1)
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
13
Important properties of simple regressions
SSE = Sum of Squares Error = Sum of squared residual
• The regression residuals will always sum to zero (except
for rounding errors). So this is no indication of “good fit.”
• OLS minimizes the sum of the SQUARED residuals
– In other words, there is no linear relationship that can provide a
smaller sum of the squared residuals for the data used in
estimating the regression line.
• For a sample
n
i=1(yi
− yi )2 =
n
2
e
i=1 i
= SSE (6-12)
??? Why do the expressions in (6-11) and (6-12) in your Course Pack
both equal SSE, always???
(6-11)
Jan 26a, 2021 version
n
i=1(ei
− e)2
Winter 2021 MGTSC 312, Ch6
14
Regression Goodness of Fit
Measure: 𝐑𝟐
Very Important
• The equivalent definitions of the big R2 given in (6𝟐
𝟐
𝟐
17), including the fact that 𝐑
=𝐫
=𝐫
, must
𝐲. 𝐱
𝐲, 𝐲
𝐲, 𝐱
be UNDERSTOOD.
(6 – 17)
1−
SSE
SST
=
Jan 26a, 2021 version
SSR
Explained variation
2
R = R y.x =
=
SST
Total variation
Unexplained variation
2
1−
= ry.
y
Total variation
2
Winter 2021 MGTSC 312, Ch6
=
15
Example: Apartment Net Operating Income
• An investor has come for advice about constructing
a new apartment building
• We have data on Net Operating Income and
Number of Suites for a sample of 47 apartment
buildings in Edmonton
• Net Operating Income: Dependent variable
• Number of Suites: Explanatory variable
To excel: Apartments.xlsx data
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
16
The Apartments Data Set
Obs. # Suites NetOpInc
1
58
119202
2
30
50092
3
22
33263
4
21
18413
5
12
26641
6
20
32628
7
15
19877
8
29
106500
9
28
63200
10
23
43484
11
14
26424
12
27
81413
13
52
153284
14
48
187993
15
20
33869
16
205
562942
17
17
10217
18
26
26712
19
22
48721
20
24
51282
21
20
31572
22
33
107169
23
104
345608
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
140
44
78
69
150
62
86
44
104
21
18
24
15
21
65
24
12
12
12
12
12
15
12
20
350633
226375
247203
28519
154278
157332
171305
109461
159245
34057
15392
60791
48008
42299
145998
54357
17288
24058
12397
9882
13713
12782
24020
36187
17
Descriptive Statistics for the Dependent and
Explanatory Variables
Open Excel, and select the Data tab.
Then select Data Analysis. Then select Descriptive Statistics:
This will bring up a screen where you can enter the Input Range, indicate
you have labels in row 1, enter the Output Range, and indicate that you
want to see Summary Statistics
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
18
Excel Descriptive Statistics Output
The output you’ll get looks like:
Number of Suites
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Net Operating Income
41.31914894
5.995998232
24
12
41.10649287
1689.743756
5.468609534
2.261741725
193
12
205
1942
47
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
92257.15
15915.84
48008
#N/A
109113.5
1.19E+10
7.157446
2.423045
553060
9882
562942
4336086
47
And you can turn that into something that looks like:
Number of Suites
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Jan 26a, 2021 version
Net Operating Income
41.32
6.00
24
12
41.11
1689.74
5.47
2.26
193
12
205
1942
47
Winter 2021 MGTSC 312, Ch6
92257.15
15915.84
48008
#N/A
109113.49
11905754472.04
7.16
2.42
553060
9882
562942
4336086
47
19
The Variance-Covariance Matrix
Do the same first steps as to create Descriptive
Statistics, but now select the Covariance option:
This will bring up a screen where you can enter the Input Range, select
again that you have labels in the first row, and enter the Output Range.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
20
Excel Variance-Covariance Matrix Output
The output you’ll get looks like:
Number of Suites
Net Operating Income
Number of Suites
1653.791761
3887577.931
Net Operating Income
11652440547
• This Excel tool gives variances and covariances treating the
data as population data (and hence dividing by N). If you are
treating the data as sample data, you need to multiply these
values by n and divide by (n-1) to get the correct values for a
sample.
• However, the correlation value will be the same irrespective of
population or sample data scenarios
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
21
The Correlation Matrix
Do the same first steps as to create Descriptive Statistics,
but now select the Correlation option:
This will bring up a screen where you can enter the Input Range, select
again that you have labels in the first row, and enter the Output Range.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
22
The Excel Correlation Matrix Output
The output you’ll get looks like:
Number of Suites
Number of Suites
Net Operating Income
Net Operating Income
1
0.885584993
1
• So, is the Number of Suites highly correlated with the
Net Operating Income?
• What would be the value of the big R2 if you regressed
either of these variables on the other one?
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
23
Simple Regression Using Excel
Now select Regression from the Data Analysis
Tool options:
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
24
Regression using Excel (cont.)
When running regressions using Excel, you must enter the range for
your dependent (“Y”) variable
followed by the range for your “input X” variable. And again you need
to specify that you have Labels in row 1 and specify the Output
Range.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
25
The Portions of the Excel Regression Output Covered So Far
Regression Statistics
Multiple R
0.886
R Square
0.784
Observations
47
ANOVA
df
Regression
Residual
Total
SS
1 4.29512E+11
45 1.18153E+11
46 5.47665E+11
Coefficients
Intercept
-4872.015
Number of Suites
2350.706
(We’ll be taking up the F and t statistics and other parts of the
full output that Excel gives for a regression in Chapter 11.)
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
26
From the Excel regression output:
• For the apartment data the estimated coefficients are
b0 = -4,872.0 and b1 = 2,350.7
• Expected Net Operating Income is
y= –$4,872.0 + ($2,350.7 ×Number of Suites)
• The R2 for the regression is 0.784.
The equation explains 78.4% of the total variation in the dependent
variable, which is the Net Operating Income
Also, the correlation between the dependent and explanatory
variables for this regression is 0.886 (since, rx,y = R2 = ry,y )
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
27
Explanation of the Regression Coefficients
y = −4872 + 2360.7x
y is the dependent variable, which is predicted and the right
side variable x is an explanatory variable
• -4872 is the intercept in math, but what does it mean?
• Two possible answers: 1) it’s the income from a building
with no suites; i.e., it is the fixed cost of having a building
regardless of the number of suites; or, 2) it is
meaningless because zero is outside the range of
number of suites in the data set.
• 2350.7, the slope of the regression line, is the increase
in predicted income for one additional suite.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
28
Scatter Diagram
Net Operating Income
600 000
y = 2350,7x - 4872
R² = 0,7843
500 000
400 000
300 000
200 000
100 000
0
0
50
100
150
200
250
Number of Suites
Use Excel’s graph wizard with Scatter and add Trend line, showing
the equation and R2. Compare the slope and intercept here with what
you got from the Excel Regression tool.
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
29
Example on Page 60: Excel Calculation
To excel: Ch6_Movies.xlsx data
Jan 26a, 2021 version
Winter 2021 MGTSC 312, Ch6
30
Download