Assumptions for Linear Regression

advertisement
Summary: Basic Concepts of Linear Regression Analysis (one independent variable)
Regression analysis is a statistical technique for modeling and investigating the relationship
between 2 or more variables. For an established relationship, it is used for prediction of the
dependent variable for a given independent variable.
Model for two variables:
(y/x) = 0 +1X +

This notation means  is a random variable - called random error. (It is the
vertical deviation of a response from the fitted line.)
 For any value of Xi, is normally distributed.
 For any value of Xi, the mean and variance are =E()=0 and Var()=2.
 Random errors are mutually independent.
E(Y/X) = E(0 +1X +)
Y = 0 +1X+
0, 1, and  unknown model parameters
^
Y =b0 + b1X
1.
2.
where b0 and b1 estimate 0 and1 in the model
Construct a scatter plot of the sample data and consider the relationship. Is it linear?
Quadratic? Other?
Check the assumptions of the model.
1) There is no error in the X-values. They are set prior to the experiment. The X
variable is not random. This is established by the experimental design.
2) At each X, there is a normal distribution of Y-values that are independent of each
other. The variance 2 of the normal distribution at each X-value is the same. (sLF
estimates the standard deviation.) This assumption is referred to as
homoscedasticity. To evaluate check for normal (random) scatter – plot the
residuals vs. X (or plot residuals vs. the fits). The plot should reflect a random
scatter of points about 0 on the vertical axis
3) The random error terms e are independent and, for any value of X, have a normal
distribution with mean 0 and variance  2. To check, examine a normal
probability plot.
^
3.
Write the equation for Y using the least square method.
4.
Examine R2 and sLF. What do they tell you about the relationship?
R2 is the coefficient of determination. It is the percent of raw variation in Y accounted
for by using the fitted equation.
6360reg
1
sLF estimates the common standard deviation in Y for a fixed X.
sLF 2 = 1
^
n2
5.
 (Y  Y )
2
=
1
 e2
n2
Test the slope of the line to see if there is a significant relationship between the two
variables.
Test the following hypothesis.
Ho: 1 = 0
Ha: 1  0
b1 has a normal distribution with  b1 = 1 and
2
Var (b1) =
(df = n-2)
( X  X )
Use t =
2
b1   1
s LF

( X  X ) 2
Failure to reject H0 means no linear relationship between X and Y.
6.
Establish a confidence interval estimate of
.
s LF
b1 + t *

( X  X ) 2
7.
Establish a confidence interval estimate of a predicted Y value. A confidence interval
estimate is an estimate of the mean Y for a fixed value of X. (df = n-2)
^
y  ts LF
8.
1
(X  X )2

n ( X  X ) 2
Construct a prediction interval for Y. A prediction interval predicts an additional
single observation of Y for a particular (fixed) value of X. (df = n-2)
^
y  ts LF 1 
1
(X  X )2

n ( X  X ) 2
Predict values of Y for the given X. Be careful not to extrapolate too much from the
given data.
6360reg
2
ANOVA Approach
The ANOVA approach is an additional method that is used to examine the model.
SST
 (Y  Y )
SSE
 (Y  Y )
SSR
 (Y  Y )
^
^
2
Sum Squares Total - Variation of the values
around their mean
2
Sum Squares Error – Residuals – Unexplained
Variation – Variation of the value from the
predicted value (for a fixed X) – Random
variation, variation that can be attributed to
factors other than the relationship
2
Sum Squares Regression – Explained Variation
– Variation of the predicted values from the
mean – Variation than can be attributed to the
relationship between X and Y
SST = SSR + SSE
R2 =
F=
SSR
SST
Mean  Square  Re gression
Explained  Variation
MSR


Mean  Square  Error
Un exp lained  Variation MSE
F is the ratio of explained variation to unexplained variation. If more variation is explained,
F>1. Use the F table to check significance.
6360reg
3
The F test is used to test the following hypothesis.
Ho: 1 = 0
Ha: 1  0
F=
SSR / 1
s
2
LFs
=
MSR
SSR / 1
=
MSE
SSE /( n  2)
The information is generally organized in a table as follows.
Souace
Regression
Error
Total
SS
SSR
SSE
SST
df
1
n-2
n-1
MS
SSR/1
SSE/(n-2)
F
MSR/MSE
See an Excel Example: http://www.uh.edu/~tech132/6303lst.xls
6360reg
4
(X,Y)
^
^
Y  Y =SSE
(X, Y )
Y
Y  Y  SST
^
Y  Y = SSR
Y
Xi
6360reg
5
Download