presentation slides

advertisement
LECTURE 3
Introduction to Linear Regression
and Correlation Analysis
1 Simple Linear Regression
2 Regression Analysis
3 Regression Model Validity
Goals
After this, you should be able to:
 Interpret
the simple linear regression equation
for a set of data
 Use
descriptive statistics to describe the
relationship between X and Y
 Determine
significant
whether a regression model is
Goals
(continued)
After this, you should be able to:




Interpret confidence intervals for the
regression coefficients
Interpret confidence intervals for a predicted
value of Y
Check whether regression assumptions are
satisfied
Check to see if the data contains unusual
values
Introduction to Regression Analysis

Regression analysis is used to:

Predict the value of a dependent variable based on the value of at
least one independent variable

Explain the impact of changes in an independent variable on the
dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Simple Linear Regression Model

Only one independent variable, x

Relationship between x and y is described by a linear
function

Changes in y are assumed to be caused by changes in
x
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Population Linear Regression
The population regression model:
Population
y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
y  β0  β1x  ε
Linear component
Random
Error
term, or
residual
Random Error
component
Linear Regression Assumptions

The underlying relationship between the x variable and the y
variable is linear

The distribution of the errors has constant variability

Error values are normally distributed

Error values are independent (over time)
Population Linear Regression
y
y  β0  β1x  ε
Observed Value
of y for xi
εi
Predicted Value
of y for xi
Slope = β1
Random Error
for this x value
Intercept = β0
xi
x
Estimated Regression Model
The sample regression line provides an estimate of
the population regression line
Estimated
(or predicted)
y value
Estimate of
the regression
intercept
Estimate of the
regression slope
yˆ i  b0  b1x
Independent
variable
Interpretation of the
Slope and the Intercept
 b0
is the estimated average value of
y when the value of x is zero
 b1
is the estimated change in the
average value of y as a result of a
one-unit change in x
Finding the Least Squares Equation
 The
coefficients b0 and b1 will be
found using computer software, such
as Excel’s data analysis add-in or
MegaStat
 Other regression measures will also
be computed as part of computerbased regression analysis
Simple Linear Regression Example

A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)

A random sample of 10 houses is selected

Dependent variable (y) = house price in
$1000

Independent variable (x) = square feet
Sample Data for House Price Model
House Price in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Regression output from Excel – Data – Data Analysis or
MegaStat – Correlation/ regression

MegaStat – Correlation/ regression
MegaStat Output
The regression equation is:
Predicted house price  98.24833  0.10977 (square feet)
Regression Analysis
r² 0.581
r 0.762
Std. Error 41.330
ANOVA table
Source
SS
Regression 18,934.9348
Residual 13,665.5652
Total 32,600.5000
Regression output
variables coefficients
Intercept
98.2483
Square feet
0.1098
n 10
k1
Dep. Var. Price($000)
df
MS
1 18,934.9348
8 1,708.1957
9
F
11.08
std. error
t (df=8)
p-value
0.0330
3.329
.0104
p-value
.0104
confidence interval
95% lower 95% upper
0.0337
0.1858
Graphical Presentation
House price model: scatter plot and regression line
450
House Price ($1000s)

Intercept
= 98.248
400
350
Slope
= 0.10977
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet
house price  98.24833  0.10977 (squarefeet)
Interpretation of the
Intercept, b0
house price  98.24833  0.10977 (squarefeet)

b0 is the estimated average value of Y when the value of X is zero
(if x = 0 is in the range of observed x values)

Here, houses with 0 square feet do not occur, so b0 = 98.24833 just
indicates the height of the line.
Interpretation of the
Slope Coefficient, b1
house price  98.24833  0.10977 (squarefeet)
b1 measures the estimated change in Y as a result of a one-unit
increase in X
Here, b1 = .10977 tells us that the average
value of a house increases by
.10977($1000) = $109.77, on average, for
each additional one square foot of size
Least Squares Regression Properties


The simple regression line always passes through the mean of
the y variable and the mean of the x variable
The least squares coefficients are unbiased estimates of β0 and
β1
Coefficient of Determination, R2
The percentage of variability in Y that can be explained by variability
in X.
Note: In the single independent variable
case, the coefficient of determination is
R r
2
2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Examples of
R2 Values
y
R2 = 1, correlation = -1
R2 = 1
y
R2
x
Perfect linear relationship
between x and y:
100% of the variation in y is
explained by variation in x
x
= 1, correlation = +1
Examples of Approximate
R2 Values
y
0 < R2 < 1, correlation is negative
x
Weaker linear relationship
between x and y:
Some but not all of the
variation in y is explained
by variation in x
y
0 < R2 < 1, correlation is positive
x
Examples of Approximate
R2 Values
R2 = 0
y
No linear relationship
between x and y:
R2 = 0
x
The value of Y does not
depend on x. (None of the
variation in y is explained
by variation in x)
Excel Output
Regression Analysis
r² 0.581
r 0.762
Std.
Error 41.330
58.08% of the variation in
house prices is explained by
variation in square feet
The correlation of .762 shows
a fairly strong direct
relationship.
The typical error in predicting
Price is 41.33($000) = $41,330
Inference about the Slope:
t Test

t test for a population slope


Null and alternative hypotheses



Is there a linear relationship between x and y?
H0: β1 = 0
Ha: β1  0
(no linear relationship)
(linear relationship does exist)
Obtain p-value from ANOVA or across from
the slope coefficient (they are the same in
simple regression)

Inference about the Slope:
t Test
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
(continued)
Estimated Regression Equation:
house price  98.25  0.1098 (sq.ft.)
The slope of this model is 0.1098
Does square footage of the house
affect its sales price?
Inferences about the Slope:
t Test Example
H0: β1 = 0
Ha: β1  0
P-value
From Excel output:
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
Decision:
Reject H0
Conclusion:
We can be 98.96% confident
that square feet is related to
house price.
Regression Analysis for Description
Confidence Interval Estimate of the Slope:
Coefficient
Standard
Excel Printout forsHouse Prices:
Error
t Stat
P-value
Lower 95%
Intercept
Square Feet
Upper
95%
98.24833
58.03348
1.69296 0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938 0.01039
0.03374
0.18580
We can be 95% confident that house prices increase by
between $33.74 and $185.80 for a 1 square foot
increase.
Estimates of Expected y
for Different Values of x
y

yp
y
The relationship
describes how x
impacts your
estimate from y
x
xp
x
Interval Estimates
for Different Values of x
y
Prediction Interval for
an individual y, given
xp The father from x
the less accurate the
prediction.
y
x
xp
x
Example: House Prices
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Estimated Regression Equation:
house price  98.25  0.1098 (sq.ft.)
Predict the price for a house
with 2000 square feet
Example: House Prices
(continued)
Predict the price for a house
with 2000 square feet:
house price  98.25  0.1098 (sq.ft.)
 98.25  0.1098(2000)
 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Estimation of Individual Values: Example
Prediction Interval Estimate for y|xp
Find the 95% confidence interval for an individual
house with 2,000 square feet
Predicted Price Yi = 317.85 ($1,000s) = $317, 850
MegaStat will give both the predicted value as well as the
lower and upper limits
Predicted values for: Price($000)
95% Confidence Interval
Square
feet
Predicted
lower
upper
2,000
317.784
280.664
354.903
95% Prediction Interval
lower
215.503
upper
420.065
The prediction interval endpoints are from $215,503 to
$420,065. We can be 95% confident that the price of a
2000 ft2 home will fall within those limits.
Residual Analysis

Purposes





Check for linearity assumption
Check for the constant variability assumption for all levels of predicted Y
Check normal residuals assumption
Check for independence over time
Graphical Analysis of Residuals



Can plot residuals vs. x and predicted Y
Can create NPP of residuals to check for normality (or use
Skewness/Kurtosis)
Can check D-W statistic to confirm independence
Residual Analysis for Linearity
y
y
x
x
Not Linear
residuals
residuals
x
x

Linear
Residual Analysis for
Constant Variance
y
y
x
Ŷ
Non-constant variance
residuals
residuals
x
Ŷ
Constant variance
Residual Analysis for Normality

Can create NPP of residuals to check for normality. If you see an
approximate straight line residuals are acceptably normal. You can also
use Skewness/Kurtosis. If both are within + 1 the residuals are acceptably
normal
Residual Analysis for Independence
– Can check D-W statistic to confirm
independence. If D-W statistic is greater than
1.3 the residuals are acceptably independent.
Needed only if the data is collected over time.
Checking Unusual Data Points



Check for outliers from the predicted values
(studentized and studentized deleted residuals do
this; MegaStat highlights in blue)
Check for outliers on the X-axis; they are indicated
by large leverage values; more than twice as large
as the average leverage. MegaStat highlights in
blue.
Check Cook’s Distance which measures the harmful
influence of a data point on the equation by looking
at residuals and leverage together. Cook’s D > 1
suggests potentially harmful data points and those
points should be checked for data entry error.
MegaStat highlights in blue based on F distribution
values.
Patterns of Outliers





a). Outlier is extreme in both X and Y
but not in pattern. The point is
unlikely to alter regression line.
b). Outlier is extreme in both X and Y
as well as in the overall pattern. This
point will strongly influence
regression line
c). Outlier is extreme for X nearly
average for Y. The further it is away
from the pattern the more it will
change the regression.
d). Outlier extreme in Y not in X. The
further it is away from the pattern the
more it will change the regression.
e). Outlier extreme in pattern, but not
in X or Y. Slope may not be changed
much but intercept will be higher
with this point included.
Summary



Introduced simple linear regression analysis
Calculated the coefficients for the simple linear regression equation
measures of strength (r, R2 and se)
Summary
(continued)




Described inference about the slope
Addressed prediction of individual values
Discussed residual analysis to address assumptions of
regression and correlation
Discussed checks for unusual data points
Download