Linear Regression - NIU College of Education

advertisement
Simple Linear Regression Using Ordinary Least Squares
Purpose: To approximate a linear relationship with a line.
Reason: We want to be able to predict Y using X.
Definition: The Least Squares Regression (LSR) line is the line with the smallest sum of
square residuals smaller than any other line.
That is, we want to measure closeness of the line to the points.
The LSR line uses vertical distance from points to a line.
A residual is the vertical distance from a point to a line.
Equations:
The true line for the population parameter:
Note: We are trying to estimate this equation.
y = α + βx
We obtain a sample and estimate an approximation:
Estimate y
yˆ = a + bx + E
Predict α =
y-intercept
Predict
β
= slope
x = The independent variable.
E = Epsilon or the error. These are random errors of measurement.
Check the Assumptions:
1. No outliers.
2. Residuals follow a normal distribution with a mean = 0.
3. Residuals should be randomly scattered.
Note: For #2, produce a histogram of the standardized residuals to see if they are
normal.
Hypotheses for the Correlation Coefficient (R): A measure of how close residuals are
to the regression line.
H0: ρ = 0
H1: ρ ≠ or < or > 0
Coefficient of Determination = R2
Range 0 – 1
0 ≤ r2 ≤1
R2 = An effect size measure and yields the percentage of the variation in the Y values
explained by X.
Adjusted R2: Adjusts for R2’s upward bias and is a variance accounted for effect size
measure.
Adj. R2 = .40
40% of variation in Y is explained by the regression line and dependent
on X.
R2 =
SSR
SST
What’s explained by regression line SSRegression
(variability in y values or
Σ( y − y )
2
SEE: The square root of the Mean Square Residual is the same as the Standard Error
of the Estimate, which is the amount of error in the model measured in DV units. The
higher the Adj. R2 value, the smaller the amount of error in the model (i.e., the smaller
the value of the SEE) and the more stability the model will have upon replication.
F Test Hypotheses:
H0: The regression does not explain a significant proportion of the variance in Y.
H1: The regression works and does explain a significant proportion of the variance in Y.
ANOVA Results:
How close the points are to the line:
1. SSR/SSE = Sum of Squared Residuals or SSE (E = Errors): Variation attributed to
factors other than the relationship between X and Y.
2. SST = Sum of Squares Total: A measure of the total variability of your Y values
around their Mean Y (i.e., how much Y values vary).
3. SSR = Sums of Squares Regression: The explained variation attributed to the
relationship between X and Y.
Note: We want a large SSRegression & Small SSE. That is, the points are close to the
line and the line does a good job of predicting.
Hypotheses for the Slope:
H0: β = 0
No linear relationship
H1: β ≠ or < or > 0 Linear relationship
Test statistic = The distribution of β (slope)
1. The distribution of β is normal.
2. Mean = β
Se = measure of the variation of the points around the line.
Se
3. S.D. =
2
Σ( x − x )
SSE
(# of parameters; thing you’re estimating)
n−2
t=
B
Se
Hypotheses for the Intercept:
H0: α (y-intercept) = 0
H1: α ≠ or < or > 0
t=
a
Se
Confidence Intervals for B (Unstandardized Coefficient)
B ± 1.645or1.96or 2.58α ( Se)
2
Simple Linear Regression: Example 1
The linear model assumes that the relations between two variables can be
summarized by a straight line.
The X variable is often called the predictor and Y is often called the criterion. We
often talk about the regression of Y on X, so that if we were predicting GPA from
SAT we would talk about the regression of GPA on SAT.
The regression problems that we deal with will use a line to transform values of X
to predict values of Y. In general, not all of the points will fall on the line, but we
will choose our regression line so as to best summarize the relations between X
and Y.
Suppose we measured the height and weight of a random sample of 10 adults in
DeKalb. We want to predict weight from height in the population.
Ht
Wt
61
105
62
120
63
120
65
160
65
120
68
145
69
175
70
160
72
185
75
210
N=10
N=10
67
150
Mean
20.89
1155.5
Variance
(s2)
4.57
33.99
SD (s)
Correlation
(r) = .94
For the regression of weight on height, we found:
Y = -316.86 + 6.97(x), where -361.86 is the intercept ( α ) and 6.97 is the slope
( β ). We could also write that weight is -316.86+6.97(height). The slope value
means that for each inch we increase in height, we expect to increase
approximately 7 pounds in weight. The intercept is the value of Y that we expect
when X is zero. So if we had a person 0 inches tall, they should weigh -316.86
pounds (i.e., 6.97 *0 = 0; -316.86 + 0 = -316.86). Of course we do not find people
who are zero inches tall and we do not find people with negative weight.
Sometimes, in educational research, the value of the intercept will have no
meaningful interpretation.
Simple Linear Regression: Example 2
A. Predicted:
Self-destructiveness = -108.92 + 22.33 Alcohol
Unstandardized Regression Coefficients
22.33 Alcohol: We predict a 22.33 point increase in self-destructiveness for
a one point increase in Alcohol when all other variables
are held constant.
B. Predicted:
Self-destructiveness = 0.49 Alcohol
Standardized Regression Coefficients
0.49 Alcohol: We predict a 0.49 standard deviation increase in
self-destructiveness for a one standard deviation increase in
Alcohol when all other variables are held constant.
Simple Linear Regression: Example 3
The linear model tells us that each observed Y is composed of two parts, (1) a
linear function of X, and (2) an error. We can use the regression line to predict
values of Y given values of X. For any given value of X, we go straight up to the
line, and then move horizontally to the left to find the value of Y. The predicted
value of Y is called the predicted value of Y, and is denoted Y'. The difference
between the observed Y and the predicted Y (Y-Y') is called a residual. The
predicted Y part is the linear part. The residual is the error.
N
Ht
Wt
Y'
Residual
1
61
105
108.19
-3.19
2
62
120
115.16
4.84
3
63
120
122.13
-2.13
4
65
160
136.06
23.94
5
65
120
136.06
-16.06
6
68
145
156.97
-11.97
7
69
175
163.94
11.06
8
70
160
170.91
-10.91
9
72
185
184.84
0.16
10
75
210
205.75
4.25
Mean
67
150
150.00
0.00
4.57
33.99
31.85
11.89
20.89
1155.56
1014.37
141.32
SD
Variance
Compare the numbers in the table for person 5 (height = 65, weight=120) to the
same person on the graph. The regression line for X=65 is 136.06. The
difference between the mean of Y and 136.06 is the part of Y due to the linear
function of X. The difference between the line and Y is -16.06. This is the error
part of Y, the residual.
Download