Linear Regression Handout

advertisement
EDF 802
Dr. Jeffrey Oescher
Regression Analysis – Linear Regression
I.
Introduction
A.
Description
1.
2.
B.
Applications
1.
2.
3.
C.
2.
Addresses only linear data
Uses only one predictor
Two major issues to be discussed
A.
B.
III.
The variable one is predicting from is known as the independent or predictor
variable
The variable one is predicting to is known as the dependent or criterion variable
Limitations
1.
2.
II.
Interval or ratio scales for both variables
Variables
1.
E.
Make predictions from X to Y
Identify the proportion of variance accounted for in the dependent variable (i.e.,
Y) based on our knowledge of the independent variable (i.e., X)
Estimate residual values on Y having removed the effect of X
Data
1.
D.
The process of predicting or estimating scores on a Y variable based on
knowledge of scores on a X variable (i.e., the regression of Y on X).
The term linear is based on an assumption that the relationship between Y and
X can be represented by a straight line. This straight line is the regression line
and represents how, on average, a change in the X variable is associated with a
change in the Y variable.
Describing the relationship
Inferentially testing the magnitude of the relationship
Descriptive linear regression
A.
B.
Purpose – estimating Y from X
The general linear regression equation
1.
Yi’ = bXi + c
-1-
a.
b.
c.
d.
2.
An example of a regression equation
a.
b.
3.
Y = 0.5 X + 2.0
For each unit of change in X we see one-half a unit change in Y
An example of a regression line
a.
b.
c.
C.
Yi’ is the predicted Y score
b is the regression coefficient
Xi is the score of X
c is the Y intercept
The intercept is the value of Y when X is zero (0)
The slope is the amount of change in Y that corresponds to a change of
one (1) unit in X
Graphing a regression line
Calculating the regression equation
1.
Scatterplot of sample data – see the attached page
a.
2.
Data set D5 containing logical reasoning scores (X) and creativity scores
(Y) for 20 subjects
Least squares criterion – see the attached page
a.
Residuals
(1)
(2)
3.
ei = (Yi –Yi’)
The values of b and c are derived so that ∑e i2 is minimized
Formula for c
c  Y  bX
4.
Formula for b
b  (r )
sy
sx
5.
SPSS – Windows programming and output
6.
Regression equation for D5 sample data
a.
b.
Y’ = (0.65)X + 5.23
Predicting Y from X using the formula
-2-
(1)
(2)
(3)
IV.
If X is 7, Y’ is 9.78
If X is 12, Y’ is 13.03
If X is 17, Y’ is 16.28
Inferential testing and linear regression
A.
Purpose – to determine whether the observed relationship between X and Y is of
sufficient magnitude to suggest a relationship truly exists
1.
2.
B.
If the relationship is zero, our knowledge of X will not help predict Y
If the relationship is not zero, our knowledge of X will help predict Y
Errors in prediction
1.
An error in prediction can be notationally represented as ei and is equal to (Yi –
Yi’)
a.
b.
2.
There is a need to estimate the characteristics of the errors terms
a.
b.
c.
3.
Average error
Variation of errors
Shape of the distributions of errors
An estimate of the average error
a.
b.
c.
4.
Errors can be found above (positive) and below (negative) the
regression line
The regression line is developed to reduce the sum of all squared errors
to a minimum
By definition, the sum of all errors ∑ei = 0
Thus the mean error is also zero (0)
See the accompanying sample data
Estimates of the variance and standard deviation of the distribution of errors
a.
b.
c.
Conceptually the variance of the distribution of error terms is the sum
of the squared deviation scores (i.e., Σ(e – e)2 divided by the appropriate
degrees of freedom
Since e (i.e., the mean error) equals 0, this equation can be simplified
The following formula represents the variance of the distribution of the
errors
s y2. x  
e2
n2
-3-
d.
Taking the square root of the variance produces a statistic called the
standard error of estimate
s y. x  
(1)
(2)
e2
n2
This represents the “standard deviation” of the distribution of
error terms
It is critically important to the inferential analyses related to
linear regression
(a)
(b)
Calculating confidence intervals for estimating the
range of predicted values of Y
Calculating the error term for the test statistic used to
examine the significance of the regression coefficient
(3)
If the relationship between the two variables is high, the
standard error of estimate is small; if the relationship is low the
standard error of estimate is large
(4)
Conditional distributions
(a)
(b)
Distributions of actual scores around the predicted
scores
Homoscedasticity
i)
The variation associated will all conditional
distributions is the same
This is a major assumption of linear regression
ii)
C.
Testing the significance of the regression coefficient
1.
The concern related to the need for inferential analysis
a.
b.
2.
If the relationship is zero (0), the regression coefficient b is zero (0) and
our knowledge of X will NOT help predict Y
Thus the issue becomes how different from zero (0) must the regression
coefficient be in order to statistically enhance the prediction of Y?
The relationship between linear regression and correlation
a.
b.
If r = ±1.00 there are no residual errors
If r = 0 then there is no regression line
(1)
If r = 0 then the formula for b is equal to zero (0) (see Section II
C 4 above)
-4-
(2)
This suggests the slope of the regression line is 0 (i.e., it is
parallel to the X axis) which in turn implies that regardless of the
value of X, Y will be equal to the intercept c (
c  Y  bX  Y  0 X  Y
(3)
D.
If r=0 the best prediction of Y ' is Y
Inferential logic
1.
Hypotheses
a.
b.
2.
H0: β = 0
H1: β ≠ 0
Test statistic and sampling distribution
a.
t
b.
b
sb
Sb is the standard error of the regression coefficient and is computed as
follows
sb 
c.
d.
V.
Random selection
Normal distributions of Y’s for each X
Homoscedasticity
An alternative method for evaluating this inferential question is to test the null
hypothesis ρ = 0
Determining residual values
A.
B.
VI.
SS x
The statistic is distributed as a t-distribution with n-2 df
Assumptions
(1)
(2)
(3)
E.
s y. x
Residuals represent the value of Y having removed the effect of X
Useful in multiple regression contexts
Numerical example
A.
B.
C.
SPSS-Windows programming
Output
Interpreting the output
-5-
Download