Psychology 2010 Lecture 16 Notes: Simple regression Ch 14

advertisement
Lecture Notes
Regression – Corty Chapter 14
Regression Analysis:
Use of a relationship between X-Y pairs to explain or predict variation in Y in terms of differences
in the X’s.
Prediction:
Use of a relationship between X-Y pairs to predict values of Y based on knowledge of X.
For example, since I know that your high school GPA was 3.7, I predict that your college GPA will
be about 3.1.
Regression Sample
A sample for which you have X-Y pairs with no missing members of either pair.
Used it to develop a prediction equation, a simple equation relating predicted Ys to Xs.
The prediction equation
Predicted Y = Additive constant + multiplicative constant * X.
Predicted Y = a + bX
We’ll use this: Predicted Y = a + bX = or equivalently, bX + a
The second version, bX+a is best for use when you’re doing hand computations.
Regression line
The prediction equation forms a straight line on the scatterplot of Y vs. X.
That line is called the regression line or line of best fit.
b and a and the regression line
The constant, b, is the slope of the regression line on a scatterplot.
The constant, a, is the y-intercept of the line.
Prediction from the equation
For persons for whom you have X but not Y, simply plug their X value into the equation (assuming
you’ve obtained values of a and b) to generate the predicted Y for each one.
Why do regression analysis?
1. Economy in prediction. If you have 1000’s of Xs, it would be very difficult to examine all of
them to obtain a predicted value for someone. But with the equation, it’s easy.
2. Theory. It may be of theoretical interest to know that there is a relationship between Ys and Xs
that is expressed by the simple equation: Predicted Y = a + bX.
3. Objectivity in prediction. Without the equation, we might argue about what the predicted Y
should be for a person. With it, we all get the same number.
Biderman’s P2010 Handouts
Regression - 1
2/9/2016
Prediction Example
The data
Pair No.
1
2
3
4
5
6
X
1
4
2
6
3
4
Y
4
14
12
22
6
20
1. The Eyeball Method
Identify a dataset for which you have sufficient X-Y pairs.
A. Create a scatterplot of the X,Y pairs in the regression sample.
B. Draw the best fitting straight line through the scatterplot.
C. For each X value for which a predicted Y is desired, that predicted Y is the
height of the best fitting line above the X value.
24.
.
22.
.
20.
.
18.
.
16.
.
14.
.
12.
.
10.
.
8.
.
6.
.
4.
.
2.
.
0.
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
O
O
O
O
.
3
.
.
.
.
.
.
.
.
.
Mike – Show how.this
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . .
8 10 11 12 13 14 15 16 17 18 19 20 21
was obtained.
Note that the best fitting
straight line does not
necessarily pass through the
origin.
O
.
2
.
For example,
For X = 3,
the predicted value .of Y is 12 or 13.
O
.
1
.
.
4
.
5
.
6
.
7
Problem with the eyeball method:
Eyeballs differ so different people will get different prediction equations.
Not easily computerizable.
Biderman’s P2010 Handouts
Regression - 2
2/9/2016
2. The Formula Method, Predicted Y = a + b*X or, equivalently, b*X + a.
A. Compute the slope, b, of the best fitting straight line through the scatterplot.
Slope
=
NXY - (X)(Y)
--------------NX2 - (X)2
=
SY
r * ------SX
B. Compute the Y-intercept, a, of the best fitting straight line.
Y-intercept =
Y - Slope * X.
For the example data . . .
Pair No.
1
2
3
4
5
6
X
1
4
2
6
3
4
Y
4
14
12
22
6
20
X2
1
16
4
36
9
16
XY
4
56
24
132
18
80
Sum
20
78
82
314
Slope
=
NXY - (X)(Y)
--------------- =
NX2 - (X)2
Y-intercept = Y - Slope  X
6314 - (20)(78)
----------------- = 3.52
682 202
= 13 - 3.523.33 = 1.27
C. For each X value for which a predicted Y is desired, that predicted Y is obtained using the following
prediction formula .
Predicted Y = Y’ = Y = 3.52  X + 1.27
For example.
If X = 3, Predicted Y = 3.523 + 1.27 = 10.56 + 1.27 = 11.83
Putting the best fitting straight line on a scatterplot
1.
2.
3.
4.
5.
.
Compute Predicted Y for the smallest X.
Plot the point, (Smallest X, Predicted Y) on the scatterplot.
Compute Predicted Y for the largest X.
Plot the point, (Largest X, Predicted Y) on the scatterplot.
Connect the two points with a straight line.
Biderman’s P2010 Handouts
Regression - 3
2/9/2016
In Class example problem on Regression Analysis
Suppose a manufacturing company is interested in being able to predict how well prospective employees
will perform running a machine which bends metal parts into a predetermined shape. A test of eye-hand
coordination is given to six persons applying for employment. Scores on the test can range from 0,
representing little eye-hand coordination, to 10, representing very good coordination.
All 14 are hired and after six months on the job, the performance of each person is measured. The
performance measure is the number of parts produced to specification for a one hour period. Scores on the
performance measure could range from 0, representing no parts produced to specification to 26 or 27, the
maximum number the company's best machine operators can produce.
The data are as follows:
ID
1
2
3
4
Test Score
1
4
2
6
Mach Score
4
14
12
22
5
6
7
8
9
10
11
12
13
14
3
4
5
7
3
0
3
5
2
1
6
20
15
25
14
3
9
18
7
4
24.
.
22.
.
20.
.
18.
.
16.
.
14.
.
12.
.
10.
.
8.
.
6.
.
4.
.
2.
.
0.
0
.
.
.
.
.
.
.
1
.
.
2
3
Test Score
Biderman’s P2010 Handouts
.
.
.
4
.
.
.
5
.
.
.
6
.
.
.
7
Regression - 4
.
.
.
8
.
.
.
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
2/9/2016
Use the scatterplot to
look for nonlineariy
and outliers.
SPSS generated scatterplot
30
28
26
24
22
20
18
16
14
12
10
8
JOBPERF
6
4
2
0
0
1
2
3
4
5
6
7
8
9
10
TESTSCOR
De scri ptive Statistics
TESTSCOR
JOBPERF
Mean
3.2857
12.3571
St d. Deviat ion
2.0164
7.1426
N
14
14
Pearson r
Correl ations
TE STS COR
JOBPE RF
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
TE STS COR
1.000
.
14
.922
.000
14
JOBPE RF
.922
.000
14
1.000
.
14
p-value for a test of the null
hypothesis that the
Population r=0.
b = r * SY/SX = .922 * 7.1426 / 2.0164 = .922 * 3.5423 = 3.27
N: No. of pairs.
a = Y-bar - b * X-bar = 12.3571 - 3.27 * 3.2857 = 12.3571 - 10.7310 = 1.63
Predicted Y = a + b*X = 1.63 + 3.27*X = 3.27*X + 1.63 for ease of computation
Biderman’s P2010 Handouts
Regression - 5
2/9/2016
Using the SPSS REGRESSION procedure
1. Enter the data into SPSS
2. Analyze -> Regression -> Linear
Biderman’s P2010 Handouts
Regression - 6
2/9/2016
3.Put Y variable into the Dependent: field and X into the Independent(s): field.
4. The results . . .
Regression
Variables Entered/Removeda
Model
Variables Entered
Variables Removed
1
testb
.
a. Dependent Variable: machine
b. All requested variables entered.
Pearson r
Method
Enter
Model Summary
Model
R
1
.922a
a. Predictors: (Constant), test
Model
1
Regression
Residual
Total
R Square
.850
Std. Error of the
Estimate
2.884
Adjusted R Square
.837
ANOVAa
df
Sum of Squares
563.422
Mean Square
563.422
1
F
67.752
Sig.
.000b
Ignore this table for this semester
99.792
12
663.214
13
8.316
a. Dependent Variable: machine
b. Predictors: (Constant), test
Coefficientsa
Model
1
Standardized
Coefficients
Beta
Unstandardized Coefficients
B
Std. Error
(Constant)
test
a. Dependent Variable: machine
1.630
1.514
3.265
.397
Intercept, a
Biderman’s P2010 Handouts
.922
t
Sig.
1.076
.303
8.231
.000
Slope, b
Regression - 7
2/9/2016
Another Example: Predicting College GPA from High School GPA
This example is based on about 4000 students.
Analyze -> Regression -> Linear
Regression
[DataSet1] G:\MDBR\FFROSH\Ffroshnm.sav
Variables Entered/Removeda
Model
1
Variables Entered
Variables Removed
hsgpab
Method
.
Enter
a. Dependent Variable: ogpa1 1ST SEM GPA EXCL FSEM
b. All requested variables entered.
Model Summary
Model
R
R Square
Adjusted R Square
Std. Error of the
Estimate
1
.493a
.243
.243
.79268
a. Predictors: (Constant), hsgpa
Biderman’s P2010 Handouts
Regression - 8
2/9/2016
ANOVAa
Model
Sum of Squares
Regression
1
df
Ignore the ANOVA table.
It’s useful only when you have two or
more predictors
Mean Square
960.505
F
1
960.505
.628
Residual
2985.273
4751
Total
3945.778
4752
Sig.
1528.624
.000b
a. Dependent Variable: ogpa1 1ST SEM GPA EXCL FSEM
b. Predictors: (Constant), hsgpa
a
Coefficientsa
Model
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
Std. Error
Beta
(Constant)
.154
.064
hsgpa
.816
.021
2.424
.015
39.098
.000
1
.493
a. Dependent Variable: ogpa1 1ST SEM GPA EXCL FSEM
b
So Predicted College GPA = 0.154 + 0.816*HSGPA.
The p-value in the lower right corner of the Coefficients table indicates that the Population correlation is
different from 0. The relationship is positive in the population.
Biderman’s P2010 Handouts
Regression - 9
2/9/2016
Interpretation of the regression coefficients
Intercept : “a”: Expected (predicted) value of Y when X=0.
Slope:
“b”: Expected difference in Y between two people who differ by 1 on X.
Example test question: The prediction equation is Predicted Y = 3 + 4*X.
Fred scored X=10. John scored X=12.
What is the predicted difference between their Y values?
Measuring prediction accuracy
Most people use r2, the square of Pearson r.
r2 = 1: Prediction of the regression sample is perfect
r2 = .5: Prediction is about half “of perfection”.
r2 = 0: Prediction is no better than random guesses.
Residuals: Errors of prediction:
Residual: Observed Y – Predicted Y
Residual = Y – Y’ using Corty’s designation for predicted Y
Positive residual:
Observed Y is bigger than predicted.
Person overachieved – did better than expected.
Y
Y’
Negative residual: Observed Y is smaller than predicted.
Person did worse than expected.
Y’
Y
Biderman’s P2010 Handouts
Regression - 10
2/9/2016
Download