11_Simple Linear Regression

advertisement
Linear regression models
Simple Linear Regression
History
• Developed by Sir Francis Galton (18221911) in his article “Regression towards
mediocrity in hereditary structure”
Purposes:
• To describe the linear relationship between two
continuous variables, the response variable (yaxis) and a single predictor variable (x-axis)
• To determine how much of the variation in Y can
be explained by the linear relationship with X
and how much of this relationship remains
unexplained
• To predict new values of Y from new values of X
The linear regression model is:
Yi  0  1 X i   i
• Xi and Yi are paired observations (i = 1 to n)
• β0 = population intercept (when Xi =0)
• β1 = population slope (measures the change in Yi
per unit change in Xi)
• εi = the random or unexplained error associated
with the i th observation. The εi are assumed to be
independent and distributed as N(0, σ2).
Linear relationship
Y
ß1
1.0
ß0
X
Linear models approximate non-linear functions
over a limited domain
extrapolation
interpolation
extrapolation
• For a given value of X, the sampled Y
values are independent with normally
distributed errors:
Y = β + β *X + ε
i
o
1
i
i
ε ~ N(0,σ2)  E(εi) = 0
E(Yi ) = βo + β1*Xi
Y
E(Y2)
E(Y1)
X
X1
X2
Fitting data to a linear model:
ˆ
ˆ
ˆ
Yi  0  1X i
Yi
Yi – Ŷi = εi (residual)
Ŷi
Xi
The residual
 2
di  (Yi  Yi )
The residual sum of squares
RSS 

n
i 1
 2
(Yi  Yi )
Estimating Regression Parameters
• The “best fit” estimates for the regression
population parameters (β0 and β1) are the
values that minimize the residual sum of
squares (SSresidual) between each
observed value and the predicted value of
the model:
n
n
i 1
i 1
 Chooseˆ0 , ˆ1 to minimize (Yi  Yˆi ) 2   (Yi  ( ˆ0  ˆ1X i ))2
Sum of squares
SSY 

n
i 1
(Yi  Yi ) 
2

n
i 1
(Yi  Yi )(Yi  Yi )
Sum of cross products
SS XY 

n
i 1
(Yi  Yi )( X i  X i )
Least-squares parameter estimates
s
SS
XY
XY
ˆ
1  2 
SS XX
sX
where
SSX 

n
i 1
(Xi  Xi )
2
Sample variance of X:
1
2
sX 
n 1

n
( X  X )( X i  X )
i 1 i
Sample covariance:
1
n
s XY 
( X i  X )(Yi  Y )

i

1
n 1
n
 ( X i  X )(Yi  Y )
s XY
SS XY
i

1
ˆ
 1 


2
n
SS X
sX
 ( X i  X )( X i  X )
i 1
Solving for the intercept:
ˆ
ˆ
 0  Y  1 X
Thus, our estimated regression
equation is:
Yˆi  ˆ0  ˆ1 X i
Hypothesis Tests with Regression
• Null hypothesis is that there is no linear
relationship between X and Y:
H0: β1 = 0  Yi = β0 + εi
HA: β1 ≠ 0  Yi = β0 + β1 Xi + εi
• We can use an F-ratio (i.e., the ratio of
variances) to test these hypotheses
Variance of the error of regression:
2
ˆ

Y

Y
 i i
n
SSresidual i 1
2
ˆ 

n2
n2
NOTE: this is also referred to as residual
variance, mean squared error (MSE) or
residual mean square (MSresidual)
Mean square of regression:
2
ˆ

Y
 i Y 
n
MSregression 
SSregression
1
 i 1
1
The F-ratio is: (MSRegression)/(MSResidual)
This ratio follows the F-distribution with (1,
n-2) degrees of freedom
Variance components and
Coefficient of determination
SSreg  SSY  RSS
SSY  SSreg  RSS
Coefficient of determination
r 
2
SSreg
SSY

SSreg
SSreg  RSS
ANOVA table for regression
Source
Regression
Residual
Total
Degrees
Sum of squares
of freedom

n
Mean
square
SSreg 
(Yˆi  Yi ) 2 SSreg
i 1
1
n
RSS
n-2 RSS 
(Yi  Yˆi ) 2
i 1
n2
n
SSY
n-1 SSY 
(Yi  Yi ) 2
i 1
n 1
1
Expected
mean square
F
ratio
N
SSreg / 1
 2  12

i 1

2

 Y2
X2
RSS /(n  2 )
Product-moment correlation
coefficient
r
SS XY
s XY

SSX SSY  s X sY
Parametric Confidence Intervals
• If we assume our parameter of interest has a particular sampling
distribution and we have estimated its expected value and variance,
we can construct a confidence interval for a given percentile.
• Example: if we assume Y is a normal random variable with unknown
mean μ and variance σ2, then (Y   )  is distributed as a
standard normal variable. But, since we don’t know σ, we must
divide by the standard error instead: (Y   ) sY , giving us a tdistribution with (n-1) degrees of freedom.
• The 100(1-α)% confidence interval for μ is then given by:
Y  t(1 / 2;n 1)  sY    Y  t(1 / 2;n1)  sY
• IMPORTANT: this does not mean “There is a 100(1-α)% chance
that the true population mean μ occurs inside this interval.” It
means that if we were to repeatedly sample the population in
the same way, 100(1-α)% of the confidence intervals would
contain the true population mean μ.
Publication form of ANOVA table
for regression
Source
Regression
Residual
Total
Sum of
Squares
Mean
Square
df
11.479
1
11.479
8.182
15
.545
19.661
16
F
21.044
Sig.
0.00035
Variance of estimated intercept
ˆ 
1 X
 ˆ  
 n SS X
2
2

0
2



ˆ0  t ,n 2 ˆ ˆ  ˆ0  ˆ0  t ,n  2 ˆ ˆ
0
0
Variance of the slope estimator
ˆ 1 
2
ˆ
2
SS X
ˆ1  t ,n  2 ˆ ˆ  1  ˆ1  t ,n  2 ˆ ˆ
1
1
Variance of the fitted value
2
ˆ (Yˆ | X )
2


Xi  X  
2 1
 ˆ

n

SS
X


Yˆ  t ,n2ˆ (Yˆ | X )  Yˆ  Yˆ  t ,n2ˆ(Yˆ | X )
Variance of the predicted value
(Ỹ):


 1 X~  X 2 


2
2
ˆ
ˆ
 ~ ~   1  
(Y | X )
n
SS X 


~
~ ~
~
~
Y  t , n  2ˆ (Y | X )  Y  Y  t , n  2ˆ (Y~| X~ )
Regression
8
7
6
5
4
3
2
1
-2
0
2
4
Ln( Island Area)
6
8
10
Assumptions of regression
• The linear model correctly describes the
functional relationship between X and Y
• The X variable is measured without error
• For a given value of X, the sampled Y
values are independent with normally
distributed errors
• Variances are constant along the
regression line
Residual plot for species-area
relationship
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
2.5
3.0
3.5
4.0
4.5
Unstandardized Predicted Value
5.0
5.5
6.0
Download