Here are the calculations needed to do a simple regression.

advertisement
SIMPLE REGRESSION CALCULATIONS

Here are the calculations needed to do a simple regression.
Aside: The word simple here refers to the use of just one x to predict y. Problems
in which two or more variables are used to predict y are called multiple
regression.
The input data are (x1, y1), (x2, y2), …, (xn, yn). The outputs in which we are intereseted
(so far) are the values of b1 (estimated regression slope) and b0 (estimated regression
intercept). These will allow us to write the fitted regression line Y = b0 + b1 x.
n
(1)
Find the five sums
 xi ,
i 1
n
 yi ,
i 1
n
n
 xi2 ,
n
 yi2 ,
i 1
x y
i
i 1
F
I
xJ
G

H K,
2
n
n
(2)
n
Syy =
y
2
i
2

i 1
n
n
F
IF
I
xJ
yJ
G
G


H KH K.

n
i
i 1
i
i 1
i 1
F
I
yJ
G

H K, S
n
(3)
 xi2 
Find the five expressions x , y , Sxx =
n
xy
Give the slope estimate as b1 =
=
x y
i
i
n
i
S xx
i
i 1
i 1
n
i 1
S xy
.
i
i 1
and the intercept estimate as
b0 = y - b1 x .
dS i .

2
(4)
For later use, record Syy|x = S yy
xy
S xx
Virtually all the calculations for simple regression are based on the five quantities found
in step (2). The regression fitting procedure is known as least squares. It gets this name
n
because the resulting values of b0 and b1 minimize the expression
  y  b
i 1
i
0
 b1 xi  .
2
This is a good criterion to optimize for many reasons, but understanding these reasons
will force us to go into the regression model.

Page 1
 gs2011
SIMPLE REGRESSION CALCULATIONS

As an example, consider a data set with n = 10 and with
 xi = 200
 x i2 = 4,250
 yi = 1,000  y i2 = 106,250
 xi yi = 20,750
It follows that
x =
200
= 20
10
Sxx = 4,250 -
y =
1,000
= 100
10
200 2
= 250
10
Sxy = 20,750 -
200  1,000
= 750
10
It follows next that b1 =
S xy
S xx

750
= 3 and b0 = y - b1 x = 100 - 3(20) = 40.
250
The fitted regression line would be given as Y = 40 + 3 x.
We could note also Syy
1, 0002
750 2
= 106,250 = 6,250. Then Syy|x = 6,250 10
250
= 4,000.
We use Syy|x to get s , the estimate of the noise standard deviation. The relationship is
S yy| x
4,000
s =
, and here that value is
= 500  22.36.
n2
10  2

Page 2
 gs2011
SIMPLE REGRESSION CALCULATIONS

In fact, we can use these simple quantities to compute the regression analysis of variance
table. The table is built on the identity
SStotal = SSregression + SSresidual
The quantity SSresidual is often named SSerror.
The subscripts are often abbreviated. Thus, you will see reference to SStot , SSregr ,
SSresid , and SSerr .
For the simple regression case, these are computed as
SStot = Syy
SSregr =
S 
2
xy
S xx
SSresid = S yy 
S 
2
xy
S xx
The analysis of variance table for simple regression is set up as follows:
Source of
Variation
Regression
Residual
Total

Degrees
of
freedom
1
n-2
n-1
Sum of
Squares
Mean Squares
F
S 
xy
S 
xy
MSRegression
S xx
S xx
MSResid
S yy 
2
S 
xy
S xx
2
S yy 
2
S 
2
xy
S xx
n2
Syy
Page 3
 gs2011
SIMPLE REGRESSION CALCULATIONS

For the data set used here, the analysis of variance table would be
Source of
Variation
Regression
Residual
Total
Degrees
of
freedom
1
8
9
Sum of
Squares
Mean Squares
F
2,250
500
4.50
2,250
4,000
6,250
Just for the record, let’s note some other computations commonly done for regression.
The information given next applies to regressions with K predictors. To see the forms for
simple regression, just use K = 1 as needed.
The estimate for the noise standard deviation is the square root of the mean square
in the residual line. This is 500  22.36, as noted previously. The symbol s is
frequently used for this, as are sY | X and s .
The R2 statistic is the ration
SSregr
SStot
, which is here
The standard deviation of Y can be given as

2, 250
= 0.36.
6, 250
SStot
, which is here
n 1
6, 250
9
694.4444  26.35.
It is sometimes interesting to compare s (the estimate for the noise standard
deviation) to sY (the standard deviation of Y). It can be shown that the ratio of
these is
s

sY
n 1
1  R2
n 1  K


2
s 
n 1
1  R2
The quantity 1     = 1 
s
n

1

K
 Y
2
statistic, Radj .


Page 4

is called the adjusted R2
 gs2011
Download