Y i

advertisement
Spreadsheet Modeling &
Decision Analysis
A Practical Introduction to
Management Science
Regression Analysis
Introduction to Regression Analysis (RA)
• Regression Analysis is used to estimate a function f( )
that describes the relationship between a continuous
dependent variable and one or more independent
variables.
Y = f(X1, X2, X3,…, Xn) + e
Note:
• f( ) describes systematic variation in the relationship.
 e represents the unsystematic variation (or random error) in
the relationship.
• In other words, the observations that we
have interest can be separated into two
parts:
Y = f(X1, X2, X3,…, Xn) + e
Observations = Model + Error
Observations = Signal + Noise
Ideally, the noise shall be very small,
comparing to the model.
Signal to Noise
What we observe can be divided into:
signal
noise
what we see
Model specification
If the true function is:
yi = B0 + B1Xi + B2Zi
And we fit:
yi = b0 + b1Xi + b2Zi + ei
Our model is exactly specified and we
obtain an unbiased and efficient estimate.
Model specification
And finally, if the true function is:
yi = B0 + B1Xi + B2Zi + B3XiZi +
And we fit:
yi = b0 + b1Xi + b2Zi + ei
Our model is underspecified, we excluded
some necessary terms, and we
obtain a biased estimate.
2
B4Zi
Model specification
On the other hand, if the true function is:
yi = B0 + B1Xi + B2Zi
And we fit:
yi = b0 + b1Xi + b2Zi + b3XiZi + ei
Our model is overspecified, we included
some unnecessary terms, and we
obtain an inefficient estimate.
Model specification
• if specify the model exactly, there is no bias
• if you overspecify the model (add more terms than
needed), result is unbiased, but inefficient
• if you underspecify the model (omit one or more
necessary terms (the result is biased)
• Overall Strategy
– best option is to exactly specify the true function
– we would prefer to err by overspecifying our model
because that only leads to inefficiency
– Therefore, start with a likely overspecified model and
reduce it
An Example
• Consider the relationship between
advertising (X1) and sales (Y) for a
company.
• There probably is a relationship...
...as advertising increases, sales should
increase.
• But how would we measure and quantify
this relationship?
A Scatter Plot of the Data
Sales (in $1,000s)
600.0
500.0
400.0
300.0
200.0
100.0
0.0
20
30
40
50
60
70
Advertising (in $1,000s)
80
90
100
The Nature of a Statistical Relationship
Y
Regression
Curve
Probability distributions for
Y at different levels of X
X
A Simple Linear Regression Model
• The scatter plot shows a linear relation between
advertising and sales.
• So the following regression model is suggested by the
data,
Yi  b0  b1X1i  ei
This refers to the true relationship between the entire
population of advertising and sales values.
• The estimated regression function (based on our
sample) will be represented as,
  b b X
Y
i
0
1 1i
Yˆ i is the estimated (of fitted) value of Y at a given level of X
Determining the Best Fit
• Numerical values must be assigned to b0 and b1
• The method of “least squares” selects the values
that minimize: n
n
ESS 
 (Y  Y )   (Y  (b
2
i 1
i
i
i 1
i
2

b
X
))
0
1 1
i
• If ESS=0 our estimated function fits the data
perfectly.
• We could solve this problem using Solver...
Estimation – Linear Regressin
Formula for a straight line
outcome
program
Dy
b1 = slope =
Dx
y = b0 + b1x + e
y
want to solve for
Dy
Dx
b0 = intercept
x
Using Solver...
The Estimated Regression Function
• The estimated regression function is:
  36.342  5550
Y
. X1
i
i
Using the Regression Tool
• Excel also has a built-in tool for performing
regression that:
– is easier to use
– provides a lot more information about the
problem
The TREND() Function
TREND(Y-range, X-range, X-value for prediction)
where:
Y-range is the spreadsheet range containing the
dependent Y variable,
X-range is the spreadsheet range containing the
independent X variable(s),
X-value for prediction is a cell (or cells) containing the
values for the independent X variable(s) for which we want
an estimated value of Y.
Note: The TREND( ) function is dynamically updated whenever
any inputs to the function change. However, it does not provide
the statistical information provided by the regression tool. It is
best two use these two different approaches to doing regression
in conjunction with one another.
Evaluating the “Fit”
Sales (in $000s)
600.0
500.0
400.0
300.0
2
R = 0.9691
200.0
100.0
0.0
20
30
40
50
60
70
80
Advertising (in $000s)
90
100
The R2 Statistic
• The R2 statistic indicates how well an
estimated regression function fits the data.
• 0<= R2 <=1
• It measures the proportion of the total
variation in Y around its mean that is
accounted for by the estimated regression
equation.
• To understand this better, consider the
following graph...
Error Decomposition
Y
Yi (actual value)
*
Yi - Y
^
Yi - Y
i
^ (estimated value)
Y
i
^ -Y
Y
i
Y
^
Y
= b0 + b1X
X
Partition of the Total Sum of Squares
n

i 1
n
(Yi  Y) 2 

i 1
n
 )2 
(Yi  Y
i

i 1
  Y) 2
(Y
i
or,
TSS
=
ESS +
RSS
ESS
R 
 1
TSS
TSS
2
RSS
Degree of Linear Correlation
• R2 = 1 = perfect linear correlation; R2 = 0 = no
correlation
• High R2 = good fit only if linear model is appropriate;
always check with a scatterplot
• Correlation does not prove causation; x and y may
both be correlated to a third (possibly unidentified)
variable
sx
cov(x,y)
r
b
• A more popular (but less
measure is the
sx smeaningful)
sy
y
“correlation coefficient”:
R2 = RSQ([y-range],[x-range]
r = CORREL([y-range],[x-range])
R2 = 0.67
R2 = 0.67
R2 = 0.67
R2 = 0.67
Testing for Significance: F Test

Hypotheses
H 0 : b1 = 0
H a : b1 = 0

Test Statistic
F

Rejection Rule
RSS / 1
ESS / n  2
Reject H0 if F > F
where F is based on an F distribution with 1 d.f. in
the numerator and n - 2 d.f. in the denominator.
Some Cautions about the
Interpretation of Significance Tests
• Rejecting H0: b1 = 0 and concluding that the
relationship between x and y is significant
does not enable us to conclude that a causeand-effect relationship is present between x
and y.
• Just because we are able to reject H0: b1 = 0
and demonstrate statistical significance does
not enable us to conclude that there is a
linear relationship between x and y.
An Example of Inappropriate Interpretation
• A study shows that, in elementary schools,
the ability of spelling is stronger for the
students with larger feet.
 Could we conclude that the size of foot can
influence the ability of spelling?
 Or there exists another factor that can
influence the foot size and the spelling
ability?
Making Predictions
• Suppose we want to estimate the average
levels of sales expected if $65,000 is spent on
advertising.
  36.342  5550
Y
. X1
i
i
• Estimated Sales = 36.342 + 5.550 * 65
= 397.092
• So when $65,000 is spent on advertising, we
expect the average sales level to be
$397,092.
The Standard Error
• The standard error measures the scatter in the
actual data around the estimate regression line.
n

Se 
i 1
 )2
(Yi  Y
i
n  k 1
where k = the number of independent variables
• For our example, Se = 20.421
• This is helpful in making predictions...
Excel Functions
Scatterplot
Chart/Add Trendline/Linear/Display equation, R2
Menu-driven tools:
Tools/Data Analysis/Regression
Tools/Data Analysis Plus/Stepwise Regression
Tools/OLS Regression
Excel functions:
a = INTERCEPT([y-range],[x-range])
b = SLOPE([y-range],[x-range])
se = STEYX([y-range],[x-range])
ŷ = FORECAST(x,[y-range],[x-range])
An Approximate Prediction Interval
• An approximate 95% prediction interval for a
new value of Y when X1=X1h is given by
where:
  2S
Y
h
e
  b b X
Y
h
0
1 1
h
• Example: If $65,000 is spent on advertising:
95% lower prediction interval = 397.092 - 2*20.421 = 356.250
95% upper prediction interval = 397.092 + 2*20.421 = 437.934
• If we spend $65,000 on advertising we are
approximately 95% confident actual sales will
be between $356,250 and $437,934.
An Exact Prediction Interval
• A (1-)% prediction interval for a new value
of Y when X1=X1h is given by
 t
Y
h
(1 / 2 ,n  2 ) S p
where:
  b b X
Y
h
0
1 1
h
S p  Se
1
1 
n
( X1  X ) 2
h
n
 (X
i 1
1i
 X) 2
Example
• If $65,000 is spent on advertising:
95% lower prediction interval = 397.092 - 2.306*21.489 = 347.556
95% upper prediction interval = 397.092 + 2.306*21.489 = 446.666
• If we spend $65,000 on advertising we are 95%
confident actual sales will be between $347,556 and
$446,666.
• This interval is only about $20,000 wider than the
approximate one calculated earlier but was much
more difficult to create.
• The greater accuracy is not always worth the trouble.
Comparison of
Prediction Interval Techniques
Sales
575
525
475
Prediction intervals
created using standard
error Se
425
375
325
Regression Line
275
Prediction intervals
created using standard
prediction error Sp
225
175
125
25
35
45
55
65
75
Advertising Expenditures
85
95
Confidence Intervals for the Mean
• A (1-)% confidence interval for the true
mean value of Y when X1=X1h is given by
 t
Y
h
(1 / 2 ,n  2 ) Sa
where:
  b b X
Y
h
0
1 1
h
Sa  Se
1

n
( X1  X ) 2
h
n
 (X
i 1
1i
 X) 2
A Note About Extrapolation
• Predictions made using an estimated
regression function may have little or no
validity for values of the independent
variables that are substantially different
from those represented in the sample.
What Does “Regression” Mean?
75
Height of Daughter
70
65
60
55
55
60
65
Height of Mother
70
75
What Does “Regression” Mean?
1. Draw “best-fit” line free hand
2. Find mother’s height = 60”, find average
daughter’s height
3. Repeat for mother’s height = 62”, 64”… 70”; draw
“best-fit” line for these points
4. Draw line daughter’s height = mother’s height
5. For a given mother’s height, daughter’s height
tends to be between mother’s height and mean
height: “regression toward the mean”
What Does “Regression” Mean?
75
Height of Daughter
70
65
y = 0.45x + 36.29
R2 = 0.21
60
55
55
60
65
Height of Mother
70
75
Residual Analysis
• Residual for Observation i
yi – y^i
• Standardized Residual for Observation i
y i  ^y^i
sy i  y i
where:
syi  y^i  s 1  hi



Residual Analysis
• Detecting Outliers
–
–
–
–
–
An outlier is an observation that is unusual in
comparison with the other data.
Minitab classifies an observation as an outlier if
its standardized residual value is < -2 or > +2.
This standardized residual rule sometimes fails
to identify an unusually large observation as
being an outlier.
This rule’s shortcoming can be circumvented by
using studentized deleted residuals.
The |i th studentized deleted residual| will be
larger than the |i th standardized residual|.
Multiple Regression Analysis
• Most regression problems involve more than one
independent variable.
• If each independent variables varies in a linear manner
with Y, the estimated regression function in this case is:
  b  b X  b X b X
Y
i
0
1 1
2 2
k k
i
i
i
• The optimal values for the bi can again be found by
minimizing the ESS.
• The resulting function fits a hyperplane to our sample
data.
Example Regression Surface for
Two Independent Variables
Y
*
*
*
*
*
*
**
*
*
*
*
* *
*
*
*
*
*
*
*
*
*
X2
X1
Multiple Regression Example:
Real Estate Appraisal
• A real estate appraiser wants to develop
a model to help predict the fair market
values of residential properties.
• Three independent variables will be used
to estimate the selling price of a house:
– total square footage
– number of bedrooms
– size of the garage
Selecting the Model
• We want to identify the simplest model that
adequately accounts for the systematic
variation in the Y variable.
• Arbitrarily using all the independent variables
may result in overfitting.
• A sample reflects characteristics:
– representative of the population
– specific to the sample
• We want to avoid fitting sample specific
characteristics -- or overfitting the data.
Models with One Independent Variable
• With simplicity in mind, suppose we fit three
simple linear regression functions:
  b b X
Y
i
0
1 1i
  b b X
Y
i
0
2 2i
  b b X
Y
i
0
3 3i
• Key regression results are:
Variables
in the Model
X1
X2
X3
R2
0.870
0.759
0.793
Adjusted
Parameter
R2
Se
Estimates
0.855 10.299 b0=9.503, b1=56.394
0.731 14.030 b0=78.290, b2=28.382
0.770 12.982 b0=16.250, b3=27.607
• The model using X1 accounts for 87% of the
variation in Y, leaving 13% unaccounted for.
Important Software Note
When using more than one independent
variable, all variables for the X-range
must be in one contiguous block of cells
(that is, in adjacent columns).
Models with Two Independent Variables
• Now suppose we fit the following models with
two independent variables:
  b b X b X
Y
i
0
1 1i
2 2i
  b b X b X
Y
i
0
1 1
3 3
i
i
• Key regression results are:
Variables
in the Model
X1
X1 & X2
X1 & X3
R2
0.870
0.939
0.877
Adjusted
R2
Se
0.855 10.299
0.924
7.471
0.847 10.609
Parameter
Estimates
b0=9.503, b1=56.394
b0=27.684, b1=38.576 b2=12.875
b0=8.311, b1=44.313 b3=6.743
• The model using X1 and X2 accounts for 93.9%
of the variation in Y, leaving 6.1% unaccounted
for.
The Adjusted R2 Statistic
• As additional independent variables are added
to a model:
– The R2 statistic can only increase.
– The Adjusted-R2 statistic can increase or decrease.
R 2a
 ESS   n  1 
 1 


 TSS   n  k  1
• The R2 statistic can be artificially inflated by
adding any independent variable to the model.
• We can compare adjusted-R2 values as a
heuristic to tell if adding an additional
independent variable really helps.
A Comment On Multicollinearity
• It should not be surprising that adding X3 (# of
bedrooms) to the model with X1 (total square footage)
did not significantly improve the model.
• Both variables represent the same (or very similar)
things -- the size of the house.
• These variables are highly correlated (or collinear).
• Multicollinearity should be avoided.
Testing for Significance: Multicollinearity
• The term multicollinearity refers to the correlation
among the independent variables.
• When the independent variables are highly
correlated (say, |r | > .7), it is not possible to
determine the separate effect of any particular
independent variable on the dependent variable.
• If the estimated regression equation is to be used
only for predictive purposes, multicollinearity is
usually not a serious problem.
• Every attempt should be made to avoid including
independent variables that are highly correlated.
Model with Three Independent Variables
• Now suppose we fit the following model with
three independent variables:
  b b X b X b X
Y
i
0
1 1
2 2
3 3
i
i
i
• Key regression results are:
Variables
Adjusted
Parameter
in the Model
R2
R2
Se
Estimates
X1
0.870 0.855 10.299 b0=9.503, b1=56.394
X1 & X2
0.939 0.924 7.471 b0=27.684, b1=38.576, b2=12.875
X1, X2 & X3 0.943 0.918 7.762 b0=26.440, b1=30.803,
b2=12.567, b3=4.576
• The model using X1 and X2 appears to be
best:
– Highest adjusted-R2
– Lowest Se (most precise prediction intervals)
Making Predictions
• Let’s estimate the avg selling price of a house
with 2,100 square feet and a 2-car garage:
  b b X b X
Y
i
0
1 1i
2 2i
  27.684  38.576 * 2.1  12.875 * 2  134.444
Y
i
• The estimated average selling price is $134,444
• A 95% prediction interval for the actual selling
price is approximately:
  2S
Y
h
e
95% lower prediction interval = 134.444 - 2*7.471 = $119,502
95% lower prediction interval = 134.444 + 2*7.471 = $149,386
Binary Independent Variables
• Other types of non-quantitative factors could independent
variables could be included in the analysis using binary
variables.
• Example: The presence (or absence) of a swimming pool,
Xp
i
1, if house i has a pool

0, otherwise
• Example: Whether the roof is in good, average or poor
condition,
1, if the roof of house i is in good condition
Xr  
i
0, otherwise
X r 1
i
1, if the roof of house i is in average condition

0, otherwise
Polynomial Regression
• Sometimes the relationship between a dependent and
independent variable is not linear.
$175
Selling Price
$150
$125
$100
$75
$50
0.900
1.200
1.500
1.800
Square Footage
2.100
2.400
• This graph suggests a quadratic relationship between
square footage (X) and selling price (Y).
The Regression Model
• An appropriate regression function in this case
might be,
  b  b X  b X2
Y
i
0
1
1i
2
1i
or equivalently,
  b b X b X
Y
i
0
1 1
2 2
i
where,
X 2  X12
i
i
i
Implementing the Model
Graph of Estimated Quadratic Regression Function
$175
Selling Price
$150
$125
$100
$75
$50
0.900
1.200
1.500
1.800
Square Footage
2.100
2.400
Fitting a Third Order Polynomial Model
• We could also fit a third order polynomial model,
  b  b X  b X2  b X3
Y
i
0
1 1
2 1
3 1
i
i
i
or equivalently,
  b b X b X b X
Y
i
0
1 1
2 2
3 3
i
where,
i
X 2  X12
i
i
X 3  X13
i
i
i
Graph of Estimated Third Order Polynomial
Regression Function
$175
Selling Price
$150
$125
$100
$75
$50
0.900
1.200
1.500
1.800
Square Footage
2.100
2.400
Overfitting
• When fitting polynomial models, care
must be taken to avoid overfitting.
• The adjusted-R2 statistic can be used
for this purpose here also.
Example: Programmer Salary Survey
A software firm collected data for a sample of 20
computer programmers. A suggestion was made
that regression analysis could be used to determine
if salary was related to the years of experience and
the score on the firm’s programmer aptitude test.
The years of experience, score on the aptitude test,
and corresponding annual salary ($1000s) for a
sample of 20 programmers is shown on the next
slide.
Example: Programmer Salary Survey
Exper.
4
7
1
5
8
10
0
1
6
6
Score
78
100
86
82
86
84
75
80
83
91
Salary
24
43
23.7
34.3
35.8
38
22.2
23.1
30
33
Exper.
9
2
10
5
6
8
4
6
3
3
Score
88
73
75
81
74
87
79
94
70
89
Salary
38
26.6
36.2
31.6
29
34
30.1
33.9
28.2
30
Example: Programmer Salary Survey
• Multiple Regression Model
Suppose we believe that salary (y) is related to the years of
experience (x1) and the score on the programmer aptitude
test (x2) by the following regression model:
y = b0 + b1 x1 + b2 x2 + e
where
y = annual salary ($000)
x1 = years of experience
x2 = score on programmer aptitude test
Example: Programmer Salary Survey
• Multiple Regression Equation
Using the assumption E (e) = 0, we obtain
E(y ) = b0 + b1 x1 + b2 x2
• Estimated Regression Equation
b0, b1, b2 are the
^ least squares estimates of
b0, b1, b2.
Thus
y = b0 + b1x1 + b2x2.
Example: Programmer Salary Survey
• Solving for the Estimates of b0, b1, b2
Least Squares
Output
Input Data
x1
x2 y
4 78 24
7 100 43
.
.
.
. . .
3 89 30
Computer
Package
for Solving
Multiple
Regression
Problems
b0 =
b1 =
b2 =
R2 =
etc.
Example: Programmer Salary Survey
• Data Analysis Output
The regression is
Salary = 3.17 + 1.40 Exper + 0.251 Score
Predictor
Constant
Exper
Score
Coef
3.174
1.4039
.25089
Stdev
6.156
.1986
.07735
s = 2.419
R-sq = 83.4%
t-ratio
.52
7.07
3.24
p
.613
.000
.005
R-sq(adj) = 81.5%
Example: Programmer Salary Survey
• Computer Output (continued)
Analysis of Variance
SOURCE
Regression
Error
Total
DF
2
17
19
SS
500.33
99.46
599.79
MS
250.16
5.85
F
42.76
P
0.000
Download