Formulas and Relationships from Linear Regression

advertisement
Formulas and Relationships from Simple & Multiple Linear Regression
I. Basics for Simple Linear Regression
Let ( X , Y ) be sample data from a bivariate normal population ( X / , Y / ) (technically we
have ( xi , yi ), i  1,..., n where n is the sample size and will use the notation  x for
n
 x ).
i 1
i
Then we have the following sample statistics:
X
s X2 
x
 ( x  x)
n 1
n
(sample mean for X )
Y
(sample variance for X )
sY2 
2
y
n
(sample mean for Y )
 ( y  y)
n 1
2
(sample variance for Y )
We will also use the following “sums of squares”:
SSX   ( x  x)2 and SSY   ( y  y)2
Note: Sometimes we refer to SSY as SST .
We have the following relationships between the sample statistics and the sums of
squares:
SSX
SSX  (n  1) s X2 or
 s X2
(n  1)
and
SSY
SSY  (n  1) sY2 or
 sY2
(n  1)
The sample covariance between X and Y is defined by
 ( x  x)( y  y)
COV ( X , Y )  s XY 
n 1
The sample correlation coefficient is defined as
 z X zY
r  COV  z X , zY  
n 1
xx
xx
where z X 
and zY 
. By plugging the formulas of z X and zY into the
sY
sX
formula for r we can easily derive that
s
r  XY .
s X sY
The least squares regression line for the data has the form
y  bx  a
where
sY
and a  y  b x
sX
Associated with the regression we have some additional “sums of squares”:
br
SSresid   ( y  y ) 2 and SSregress   ( y  y ) 2
Where for each value of x in the sample data, y is the corresponding y coordinate and
y is the predicted value from the regression line when x is used as the predictor (input) to
the line.
It is clear to see both symbolically and graphically that y  y  ( y  y)  ( y  y) .
However, the more interesting result is that:
SST  SSY  SSresid  SSregress
i.e.
SSY   ( y  y ) 2   ( y  y ) 2   ( y  y ) 2  SSresid  SSregress
This describes the total variation in Y by the sum of the “explained variation”
( SSregress ) and the “unexplained variation” ( SSresid ). Remember, the least squares
regression line is the line that fits the data in a way that minimizes the unexplained
variation. Also note that if the data perfectly fits a line then SSresid  0 .
One can derive that the square of the correlation coefficient can be written in terms of
these sums of squares:
SST  SSresid SSregress
r2 

SST
SST
r
So here again we see that measures how strongly the data fits a line since if the data is
perfectly linear, then r 2  1 . r 2 is called the coefficient of determination. Another
relationship that can be easily derived from the formula for r 2 is that
SSresid
1 r2 
SST
We will define the standard error of the linear regression to be the following:
SSresid
se 
n2
We will use this standard error to define the standard error for random variables
associated with the linear regression and thus to test hypotheses about these random
variables. We will also use the standard error of the linear regression to make inferences
about the predicted values of y for a fixed value of x .
II. Testing the correlation coefficient.
The population correlation coefficient is represented by the symbol  . We will use the
sample correlation coefficient, r , obtained from the sample ( X , Y ) data, to test if there is
a significant linear relationship between the population variables X / and Y / . If one
assumes that   0 then it can be shown that the standard error for the random variable
r is given by
1 r2
sr 
n2
r r
n2
 r
and that the test statistic t (r ) 
has a student-t distribution with
sr
sr
1 r2
n  2 degrees of freedom. Thus we can test the null hypotheses, H 0 :   0 against
appropriate alternative hypotheses using this test statistic.
The standard error for r , sr , can also be written in terms of the standard error of linear
regression and the sum of squares total:
SSresid
2
1 r
SSresid
n  2  se  se
sr 


n2
SST (n  2)
SST
SST
SSY
III. Testing the slope of the regression line.
The population regression line slope is represented by the symbol  . We will use the
slope of the regression line that fits the sample ( X , Y ) data, b , to make inferences about
the slope of the regression line that would fit the population variables X / and Y / , i.e.  .
If one assumes that   0 then it can be shown that the standard error for the random
variable b is given by
se
sb 
SSX
b b

and that the test statistic t (b) 
has a student-t distribution with n  2 degrees
sb
sb
of freedom. Thus we can test the null hypotheses, H 0 :   0 against appropriate
alternative hypotheses using this test statistic. When conducting this test we are
essentially determining if the “independent effect” of the variable X / is significant on the
predicted (or criterion) variable Y / . The independent effect is measured by slope. That
is a change in X / produces a proportional change in Y / as measured by  or
Y /  X / .
Now here is a very interesting result. You should be able to follow this derivation by
seeing how we are using the formulas above:
SSY
b
SSX
SSX
SSX
SSY
r
t (b)   b
r
 r n 1
r
  t (r )
sb
se
sx se
se
se
sr
SSX
n 1
Thus when testing   0 or   0 the value of the t-statistic will be the same!
sy
IV. Testing the coefficient of determination.
Another test that we will use to test the significance of the linear regression is the F -test.
The F -test is a test about the population coefficient of determination (  2 ). The null
hypothesis for the F -test is H 0 :  2  0 with alternative hypothesis H 0 :  2  0 . The
F distribution is used to test if two variances from independent populations are equal.
This is done by testing the ratio of the two variances. The F distribution has two degrees
of freedom parameters: the degrees of freedom of the numerator variance and the
degrees of freedom of the denominator variance. In the case of linear regression we will
use the following ratio for our F -test statistic:
SSregress / k
SSresid /(n  k  1)
Where k is the number of independent predictors (so far our k  1 but this will change
once we get to multiple regression) and n is the number of data points in the sample. The
numerator of the F -test statistic is the variance in the regression ( SSregress / k ) and is
also called the mean square regression. The denominator of the F -test statistic is the
variance in the residual ( SSresid /(n  k  1) ) and is also called the mean square residual.
We think of the F -test statistic as being the ratio of the explained variance to the
unexplained variance. Since in regression our goal is to minimize the unexplained
variance, then to have a significant result, we would expect the F -test statistic to be
greater than 1.
F
Consider the case when k  1 (i.e. simple (or single variable) linear regression). Then
SSregress / k
SSregress
SSregress
F


.
SSresid /(n  k  1) SSresid /(n  2)
se2
Another way to think of what the F -test statistic measures is to consider a comparison of
how well the regression line ( y  bx  a ) fits the data versus how well the line y  y fits
the data. The more the regression line explains the variance in Y (i.e. the higher the ratio
of the variance in the regression to the variance in the residual), the more significant the
result. Note that the line y  y has slope equal to zero, so the F -test is essentially a test
about the slope of the regression line (i.e. if the slope of the regression line is zero then
the F -test statistic will be zero since SSregress  0 ). Recall the t -test statistic for
testing H 0 :   0 against appropriate alternative hypotheses:
t (b) 
b
SSY
 t (r )  r
sb
se
Thus we have
t (b)
2
 r2
SST SSregress SST SSregress


F
se2
SST
se2
se2
So the F -test statistic is the square of the t -test statistic for the slope of the regression
line. Thus it turns out that in single variable regression, using the F -test statistic is
equivalent to testing if the slope of the regression line is zero! When doing simple linear
regression, if you check your computer output for the p -value for the F -test statistic
and for the t -test statistic for the slope, they will be the same value! There will be an
analogy to this test in multivariable regression!
V. Basics for Multiple Regression
Let ( X1 , X 2 ,..., X k , Y ), be sample data from a multivariate normal population
( X 1/ , X 2/ ,..., X k/ , Y / ) (technically we have ( xi1 , xi2 ,...xik , yi ), i  1,..., n where n is the
sample size and will use the notation
 xk for
n
x
i 1
ik
).
We will again perform linear regression on the data: i.e. we will use the data to find
bi , i  1,..., k and a such that
y  a  b1 x1  b2 x2  ...  bk xk
As before we will consider the sum of the squares of the residuals:
SSresid   ( y  y ) 2 where y is from the data (i.e., ( x1 , x2 ,...xk , y) ) and y is the
predicted value of y using the inputs x1 , x2 ,..., xk .
The values that we get for bi , i  1,..., k and a are such that SSresid is minimized (note
this is the same criteria we had for single variable (i.e. simple) regression). The formulas
for bi , i  1,..., k and a are somewhat complicated and for now we will compute these via
computer technology.
Now, just as in simple linear regression, we can determine how well the regression
equation explains the variance in Y by considering the coefficient of determination:
SST  SSresid SSregress
r2 

SST
SST
SSregress
Where
is defined as in simple linear regression and we have the relationship of
SST  SSY  SSresid  SSregress still true!
Testing the coefficient of determination in multiple regression. Again we will use
the F -test. The null hypothesis for the F -test is H 0 :  2  0 with alternative
hypothesis H 0 :  2  0 . Our F -test statistic will again be:
SSregress / k
SSresid /(n  k  1)
Where k is the number of independent predictors and n is the number of data points in
the sample. The numerator of the F -test statistic is the variance in the regression
( SSregress / k ) and is also called the mean square regression. The denominator of the
F
F -test statistic is the variance in the residual ( SSresid /(n  k  1) ) and is also called the
mean square residual. We think of the F -test statistic as being the ratio of the explained
variance to the unexplained variance. Since in regression our goal is to minimize the
unexplained variance and have most of the variance in Y explained by the regression
equation, then to have a significant result, we would expect the F -test statistic to be
greater than 1. Note that an alternative formula for the F -test statistic is
r2 / k
F
(1  r 2 ) /(n  k  1)
Can you show this is true??
In the case of simple linear regression we saw that the above hypothesis test was
equivalent to the hypothesis test: H 0 :   0 with H 0 :   0 . In the case of multiple
regression, the above hypothesis test is equivalent to the following hypothesis test:
H 0 : 1   2  ...   k  0 with alternative hypothesis
H a : At least one coefficient is not equal to zero.
Where 1 ,  2 ,...,  k are the slopes of the regression equation that fits the population. So
acceptance of the null hypothesis implies that the explanatory (i.e. predictor, i.e.
independent) variables do not have any significant impact on explaining the variance in
the dependent variable, Y . So to estimate the average value of Y one would not take into
account the values of the independent variables and thus use the sample mean Y as a
point estimate.
Rejection of the null hypothesis does not necessarily mean that all of the coefficients are
significantly different from zero but that at least one of the coefficients is significantly
different than zero. Rejection does imply that the regression equation is useful for
predicting values of Y given values of the independent variables. In particular, given
values of X i , i  1,..., k , a point estimate for the average value of Y is Y (i.e. we would
take into account the variances of the X i , i  1,..., k from their averages to compute the
average value of Y .
Once we have determined that the regression equation significantly explains the variance
in Y then we can run a t -test on each of the independent variable coefficients to test if
they are significantly different from zero. We will then be interested in seeing if we can
get a better model by reducing the full model by dropping independent variables whose
coefficients are not significantly different than zero.
More statistics associated with multiple regression:
Then we have the following sample statistics:
 xm (sample mean for each predictor variable X m , m  1,..., k )
Xm 
n
 y (sample mean for Y )
Y
n
sX2 m 
 (x
m
 xm )2
n 1
(sample variance for each predictor variable X m , m  1,..., k )
( y  y)2

2
sY 
n  1 (sample variance for Y )
We will also use the following “sums of squares”:
SSX m   ( xm  xm )2
SSY   ( y  y)2
and
Note: Sometimes we refer to SSY as SST .
We have the following relationships between the sample statistics and the sums of
squares:
SSX m
 s X2 m for each predictor variable X m , m  1,..., k
SSX m  (n  1)sX2 m or
(n  1)
and
SSY
 s2
SSY  (n  1) sY2 or (n  1) Y
The sample covariance between X m and Y is defined by
COV ( X m , Y )  sX mY 
(x
m
 xm )( y  y)
n 1
X
We will also define the covariance between X i and j for i  j as follows:
COV ( X i , X j )  s X i X j 
 ( x  x )( x
i
i
j
 xj )
n 1
i

j
Note that when
, sX i X j  COV ( X i , X j )  COV ( X i , X i )  sX2 i .
Download