Lesson 9—Least Squares Regression

advertisement
Chapter 10: Regression
We are interested in predicting how many publications a faculty
member has based on the number of years that have passed since
completing his/her PhD
We can do this by using regression!
Regression: “the prediction of one variable from knowledge of one or
more other variables”
In this class, we will limit ourselves to linear regression—“regression in
which the relationship is linear”
Chapter 10: Page 1
We’ve already seen that scatterplots visually convey the relationship
between 2 quantitative variables
On a scatterplot, we could draw a straight line through the data
points that approximates the relationship b/n the Y & X variables
It is only useful to draw such a line when the X variable is thought
to explain, cause, or predict the Y variable
In this case, the X variable is called an explanatory variable, & the Y
variable is called a response variable
Chapter 10: Page 2
We sampled 20 Miami faculty & recorded the years since receiving their
PhD (“years”) & the number of publications they have (“pubs”). A
scatterplot of these data are shown below:
Pubs
7
3
4
17
11
6
24
29
9
18
19
Years
8
6
6
2
1
4
5
12
11
Pubs
19
11
8
3
4
15
9
30
31
40
30
20
10
PUBS
Years
3
6
3
8
9
6
16
10
2
5
5
0
0
2
4
6
8
10
12
14
16
18
YEARS
Chapter 10: Page 3
We asked SPSS to place a line on the scatterplot that represents the
relationship between years since PhD (X variable) and publications
(Y variable)
Much of the remainder of this lecture will be a discussion of how we find
that line and why it is useful
Finding the ‘Best’ Regression Line
When you observe a scatterplot, you can ‘guess’ which line best
summarizes the relationship between Y and X
However, this method is highly subjective from person to person,
and also might be affected by the way the scatterplot is
constructed
Thus, we have mathematical ways to determine the best line
Chapter 10: Page 4
The Least-Squares Regression Line
A ‘good’ regression line comes as close as possible to all the data
points in the scatterplot
The points along the regression line represent our best predictions for
the value of the Y variable at each level of the X variable
In this case, the points along the line represent our predictions for the
# of pubs a faculty member will have, given a specific # of yrs
since completing the PhD
You’ll notice that very few of the actual data points fall on the line,
but most are fairly “close” to the line
Chapter 10: Page 5
Because we would like to predict the Y variable from the X
variable, we would like the (vertical) distance between the
points on the graph and the line to be as small as possible
The vertical distance b/n the predicted value (the point on the line)
and the observed (actual) value is called an error or a residual
The error or residual is the difference between the predicted value
and the observed value. Residuals are found by the following
equation:
residual  y  y
Where
ŷ
is the “predicted value”
Chapter 10: Page 6
The best regression line is the one that has the smallest residuals
One common way to obtain the smallest residuals is through the
least-squares approach
The least-squares regression line is the one that makes the sum of the
squared vertical distances b/n the data points & the line [residuals] as
small as possible
The equation for the least squares regression line is:
yˆ  bX  a
ŷ : predicted value of the response variable (Y)
X: explanatory variable
a: intercept; the predicted value of Y when X = 0
b: the slope; the change in the predicted value w/ a 1 unit increase in X
Chapter 10: Page 7
By knowing the regression line, we can predict the values of the
response variable for a given level of the explanatory variable
The regression output from SPSS:
Coefficientsa
Model
1
(Constant)
YEARS
Unstandardized
Coefficients
B
Std. Error
1.927
2.705
1.863
.366
Standardi
zed
Coefficien
ts
Beta
.768
t
.713
5.090
Sig.
.485
.000
a. Dependent Variable: PUBS
The regression equation is:
ŷ = 1.863X + 1.927
Chapter 10: Page 8
We can predict ŷ at a given value of X simply
by solving the equation
X
ŷ = 1.863X + 1.927
3
6
3
8
9
6
16
10
2
5
5
8
6
6
2
1
4
5
12
11
7.516
13.105
7.516
16.831
18.694
13.105
31.735
20.557
5.653
11.242
11.242
16.831
13.105
13.105
5.653
3.79
9.379
11.242
24.283
22.42
Chapter 10: Page 9
Notice that our actual values of Y are fairly
close on average to the predicted values of Y
X
ŷ = 1.863X + 1.927
Y
3
6
3
8
9
6
16
10
2
5
5
8
6
6
2
1
4
5
12
11
7.516
13.105
7.516
16.831
18.694
13.105
31.735
20.557
5.653
11.242
11.242
16.831
13.105
13.105
5.653
3.79
9.379
11.242
24.283
22.42
7
3
4
17
11
6
24
29
9
18
19
19
11
8
3
4
15
9
30
31
Y - ŷ
-.516
-10.105
-3.516
.169
-7.694
-7.105
-7.735
8.443
3.347
6.758
7.758
2.169
-2.105
-5.105
-2.653
.21
5.621
-2.242
5.717
8.58
.00
Chapter 10: Page 10
You’ll notice that the sum of the residuals is zero. Thus, we find the
regression line that minimizes the sum of the squared residuals
We can use the regression equation to make predictions:
So, if I wanted to predict how many publications a faculty member
would have who completed his/her PhD 15 years ago:
ŷ = 1.863X + 1.927
29.872 = 1.863(15) + 1.927
Chapter 10: Page 11
Accuracy in Prediction
We can always construct a regression line. The critical issue is—how
well does that line actually predict the Y values from the X values?
The “error” in our predictions is captured by the following:
S Y Yˆ 
2
ˆ
(
Y

Y
)

N 2
This is the standard error of the estimate: “the average of the squared
deviations about the regression line”
It is the standard deviation of the errors we make in prediction
Chapter 10: Page 12
Regression and Correlation
There is a conceptual relationship between correlation and regression
Specifically, if we square the correlation coefficient (r) we find the
“fraction of the variation in the values of y that is explained by the
least-squares regression of y on x”
r2 = proportion of variance in Y explained by relationship with X
Model Summary
Model
1
R
.768a
R Square
.590
Adjusted
R Square
.567
Std. Error of
the Estimate
6.0453
a. Predictors: (Constant), YEARS
Chapter 10: Page 13
Hypothesis Testing and Regression
If X can reliably predict Y, then there will be a non-zero slope
Thus, we can test the following hypotheses:
H0 :  = 0
H1 :  ≠ 0
 is the population counterpart of b
These hypotheses are tested with a t test
Conceptually, we take b and divide it by the standard error of b
We will allow SPSS to do these calculations for us
Chapter 10: Page 14
Coefficientsa
Model
1
(Constant)
YEARS
Unstandardized
Coefficients
B
Std. Error
1.927
2.705
1.863
.366
Standardi
zed
Coefficien
ts
Beta
.768
t
.713
5.090
Sig.
.485
.000
a. Dependent Variable: PUBS
Slope (b)
coefficient
Standard error
of the slope
coefficient
t-test of whether
the slope
coefficient
differs from zero
p-value of the
test
The t-test of the slope coefficient has n -2 df
Chapter 10: Page 15
As usual, if the obtained t equals or surpasses a critical value of t, then
we’d reject the null hypothesis
If the obtained t did not equal or surpass a critical value of t, then we’d
fail to reject the null hypothesis
In the above case, we rejected the null hypothesis. Our conclusion would
be:
“The number of years since Miami faculty have earned their PhD
predicts the number of publications they have, b = 1.863, t(18) = 5.09, p
≤ .05.”
Chapter 10: Page 16
Regression and Outliers
Like means, variances, and standard deviations, the regression line
is sensitive to outliers. Be sure to always plot your data first to see if
there are points that are far away from the regression line
Suppose I added one outlier to the previous dataset—a faculty
member who earned his/her PhD 25 years ago but only has 2
publications
Chapter 10: Page 17
40
30
30
20
20
10
10
PUBS
PUBS
40
0
0
2
4
6
8
10
12
14
16
18
0
0
YEARS
10
20
30
YEARS
Coefficientsa
Original
Model
1
(Constant)
YEARS
Unstandardized
Coefficients
B
Std. Error
9.677
3.371
.495
.373
Standardi
zed
Coefficien
ts
Beta
.292
With
outlier
t
2.871
1.328
Sig.
.010
.200
a. Dependent Variable: PUBS
Chapter 10: Page 18
Notice how much the slope of the regression line has shifted downward
to accommodate the new point.
The slope coefficient is no longer significant!
Always plot your data!
Chapter 10: Page 19
Download