Independent variable (X)
Chapter 14: Simple Linear Regression
The variable that is doing the predicting or explaining.
Dependent Variable (Y)
The variable that is being predicted or explained by the regression equation.
Simple Linear regression
Regression involving only two variables X and Y. The relationship between the two variables is approximated by a straight line.
Regression Model: Y =
0
+
1
X +
The model describing how variable (y) is related to the variable ( x
) in simple linear regression.
Regression Equation: E(y) =
0
+
1 x.
Estimated Regression Equation: Ŷ = b
0
+ b
1
X (Estimated from sample data). Also called the “fitted” line.
Least Squares Method
Define: e = Y i
– Ŷ i.
”e” is also called “error” or “residual”. The method of least squares, estimates the values for b
0
and b
1
such that e 2 = (Y i
– Ŷ i
) 2 is minimized.
Sums of Squares
Total Sum of Squares (SST) = (Y i
– i
) 2
Sum of Squares due to Regression (SSR) = (Ŷ i
– i
) 2
Sum of Squares due to Error (SSE) = (Y i
– Ŷ i
) 2
Note: SST = SSR + SSE
Coefficient of Determination (r 2 )
A measure of the proportion of the variation in the dependent variable that is explained by the estimated regression equation. It is a measure of how well the estimated regression equation fits the data.
Correlation Coefficient (r)
A statistical measure of the strength of the linear relationship between two variables.
Mean Square Error (MSE or S 2 )
The unbiased estimate of the variance, 2 , of the error term e.
Standard Error of the Estimate (S)
The square root of the mean square error, denoted. It is the estimate of , the standard deviation of the error term e.
Outlier
A data point or observation that is unusual compared to the remaining data.
Influential Observation
An observation that has a strong influence or effect on the regression results.
Leverage
A measure of the influence an observation has on the regression results. Influential observations have high leverage.
Excel functions
Estimate for o
= b o
: INTERCEPT(y-range,x-range)
Estimate for
1
= b
1
: SLOPE(y-range,x-range)
Coefficient of determination (r2) : RSQ(y-range,x-range)
Estimate for = S = S y.x
: Standard error of the estimate = STEYX(y-range,x-range)
LINEST Array Function:
1.
Highlight 5 rows and as many columns as no. of Xs
2.
Enter the function
LINEST(y-range,x-range,,1)
3.
Press Control-Shift-Enter
Output:
Row 1 (Coefficients): b k b k-1
….. b
1 b
0
Row 2 (Standard error of b's): S bk
S bk-1
….. S b1
S b0
Row 3 (R 2 and S
Y.X
Row 4 (F and df):
): R
F
2 S
Y.X
df of MSE
Row 5 (SS's): SSR
Forecasting Y
Point prediction Ŷ = b o
+ b
1
X
Estimated Simple Linear Regression Equation
SSE y
b o
b
1 x
e b
0
= the y-intercept, b
Review of assumptions
1
= the slope of the line
1.
The mean of Y ( y
) = o
+
1
X
2.
For a given X, Y values follow normal distribution
3.
The dispersion (variance) of Y values remains constant everywhere along the line.
4.
The error terms (e) are independent
Decomposition of variance
(Y -
Y
) 2 = (Ŷ -
Y
) 2 + (Y – Ŷ) 2 i.e. SST = SSR + SSE
SST = Sum of Squares Total =
(Y -
Y
) 2
SSR = Sum of Squares Regression = (Ŷ -
Y
) 2
SSE = Sum of Squares due to Error (Y – Ŷ) 2
= SST - SSR
ANOVA Table
Source Sum of squares Df
Regression SSR
Error SSE
Total SST
1 n-2 n-1
Mean Square
MSR = SSR/1
MSE = SSE/(n-2)
F-test
F = MSR/MSE
(Note: MSE = S 2y.x
) and
MSE
= S y.x
= Standard error of estimate
Coefficient of Determination r
2
SSR
SST
Sample Correlation Coefficient r xy
= (the sign of b
1
)
= (the sign of b
1
) r
2 where b
1
= the slope of the regression equation
Hypothesis testing for
1
:
Given the four assumptions stated earlier, the model for Y = regression model is statistically significant only if
o
+
1
X +
. Then, this
1
0. Therefore, the hypothesis for testing whether or not the regression is significant is as follows.
H o
:
1
= 0; H o
:
1
0
To test the above hypotheses, either a t-test or an F-test may be used. t Test: t
b
1 s b
1
(values of b
1
and S b1
are from the LINEST output)
F Test:
F
MSR
, also note, F = t 2
MSE
Confidence interval for
1
: b
1
± t
/2
S b1, where df for t = df of MSE
Interval Estimation for Y
Confidence Interval Estimate of an individual Y
IND
: Ŷ± t
/2
S ind where df for t = df of MSE, and S ind
= From special regression output
Confidence Interval Estimate of y p
: Ŷ± t
/2
S
Ŷp where df for t = df of MSE, and S yp
S
2 ind
MSE
Leverage of observation (h i
) = h i n
x i
x
x i
x
2
2