Regression Basics

advertisement
Regression Basics
Predicting a DV with a Single IV
Questions
• What are predictors and
criteria?
• Write an equation for
the linear regression.
Describe each term.
• How do changes in the
slope and intercept
affect (move) the
regression line?
• What does it mean to
test the significance of
the regression sum of
squares? R-square?
• What is R-square?
• What does it mean to choose
a regression line to satisfy
the loss function of least
squares?
• How do we find the slope
and intercept for the
regression line with a single
independent variable?
(Either formula for the slope
is acceptable.)
• Why does testing for the
regression sum of squares
turn out to have the same
result as testing for Rsquare?
Basic Ideas
• Jargon
– IV = X = Predictor (pl. predictors)
– DV = Y = Criterion (pl. criteria)
– Regression of Y on X e.g., GPA on SAT
• Linear Model = relations between IV
and DV represented by straight line.
Yi    X i   i (population values)
• A score on Y has 2 parts – (1) linear
function of X and (2) error.
Basic Ideas (2)
• Sample value: Yi  a  bX i  ei
• Intercept – place where X=0
• Slope – change in Y if X changes 1
unit. Rise over run.
• If error is removed, we have a predicted
value for each person at X (the line):
Y   a  bX
Suppose on average houses are worth about $75.00 a
square foot. Then the equation relating price to size
would be Y’=0+75X. The predicted price for a 2000
square foot house would be $150,000.
Linear Transformation
Y   a  bX
• 1 to 1 mapping of variables via line
• Permissible operations are addition and
multiplication (interval data)
4
0
3
5
3
0
2
5
Y
2
5
1
0
a
=
0
1
h
n
1
Y
5
=
1
Y
C
in
g
Y
Y
C
3
0
2
0
+
0
=
5
1
2 Y
+
0
hg
X =
2
Y
+
a
2
5
=X
Y
X
nt
gh
+
5
=
2
+
5
1
0
5
0
0
0
0
2
4
6
X
Add a constant
8
1
2
0
4
6
8
X
Multiply by a constant
Linear Transformation (2)
Centigrade to Fahrenheit
Note 1 to 1 map 240
212 degrees F, 100 degrees C
200
Intercept?
160
120
Slope?
Degrees F
•
•
•
•
Y   a  bX
80
40
32 degrees F, 0 degrees C
0
0
30
60
90
120
Degrees C
Intercept is 32. When X (Cent) is 0, Y (Fahr) is 32.
Slope is 1.8. When Cent goes from 0 to 100 (run), Fahr goes
from 32 to 212 (rise), and 212-32 = 180. Then 180/100 =1.8 is
rise over run is the slope. Y = 32+1.8X. F=32+1.8C.
Review
• What are predictors and criteria?
• Write an equation for the linear
regression with 1 IV. Describe each
term.
• How do changes in the slope and
intercept affect (move) the regression
line?
ig
Regression of Weight on
Height
X
61
105
62
120
63
120
65
160
65
120
68
145
69
175
70
160
72
185
75
210
N=10
N=10
M=67
M=150
SD=4.57
SD=
33.99
R
R
ee
gg
rr
e
Wt
2
4
0
1
0
Y   a  bX
2
Y
=
-
1
8
0
1
5
0
R
R
W
Ht
1
2
0
9
0
6
0
6
6
u
0
6
2
6
H
4
6
6
7
e
8
7
0
7
ig
Correlation (r) = .94.
Regression equation: Y’=-316.86+6.97X
ig
Illustration of the Linear
Model. This concept is vital!
Yi    X i   i
e
R
2
Yi  a  bX i  ei
1
Y   a  bX
ei  Yi  Yi '
g
0
0
8
M
0
r
e
e
e
1
6
0
M
W
Consider Y as
a deviation
from the
mean.
e
1
D
4
e
1
2
1
L
0
E v
0
(
0
6
e
6 2
Y
in
a
r ia
y
e
v
5
6 6
7 8
e
a
o
e
6
H
n
r t
D
0
6 4
'
7 0
ig
Part of that deviation can be associated with X (the linear
part) and part cannot (the error).
ig
Predicted Values & Residuals
Y   a  bX
e
R
2
1
e
Numbers for linear part and error.
g
N
r
e
0
8
M
0
6
2
e
W
1
D
4
e
1
2
1
L
0
v
E
0
(
0
6
e
6 2
Y
in
a
r ia
y
e
H
r
o io
r n
Note M of Y’
and Residuals.
Note variance of
Y is V(Y’) +
V(res).
ig
,
t
Y'
Resid
o
105
108.19
-3.19
62
120
o
115.16
f
4.84
63
120
122.13
-2.13
160
136.06
23.94
65
f
68
P
a
120
P f
io
145
Y
136.06
a r
n
156.97
r
r o
t
t
-11.97
f
175
)
163.94
11.06
70
160
170.91
-10.91
9
72
185
184.84
0.16
10
75
210
205.75
4.25
M
67
150
150.00
0.00
SD
4.57
33.99
31.85
11.89
V
20.89
1155.56
1014.37
141.32
2
h
t
X
-16.06
0
7 0
8
1
n
61
65
ia
Wt
io
692
7
7 8
e
o
a
v6
5
6 6
n
4
n
5
e
6
0
6 4
'
r t
D
a
3
0
M
s
1
0
e
1
Ht
s
m
Finding the Regression Line
Need to know the correlation, SDs and means of X and Y.
The correlation is the slope when both X and Y are
expressed as z scores. To translate to raw scores, just bring
back original SDs for both.
SDY
z X zY

(rise over run)
b  rXY
rXY 
SD X
N
To find the intercept, use:
a  Y  bX
Suppose r = .50, SDX = .5, MX = 10, SDY = 2, MY = 5.
Slope
2
b  .5  2
.5
Intercept
Equation
a  5  2(10)  15
Y '  15  2 X
ig
Line of Least Squares
e
R
We have some points.
e
g
2
0
0
1
8
M
0
1
6
0
r
e
e
a
e
W
y '
Y
M
e
a
n
Assume linear relations
L
in
e
a
1
4
0
D
e
is reasonable, so the 2
E v
r ia
r t
o io
2
0
D
e
v
vbls can be represented 1
(
6
5
,
1
0
0
by a line. Where
6
6 2
6 4
6 6
7 8
7 0
H
e
ig
should the line go?
Place the line so errors (residuals) are small. The line we
calculate has a sum of errors = 0. It has a sum of squared
errors that are as small as possible; the line provides the
smallest sum of squared errors or least squares.
ia
2
h
Least Squares (2)
Review
• What does it mean to choose a regression line
to satisfy the loss function of least squares?
• What are predicted values and residuals?
Suppose r = .25, SDX = 1, MX = 10, SDY = 2, MY = 5.
What is the regression equation (line)?
Partitioning the Sum of
Squares
Y  a  bX  e
Y  Y 'e
Y '  a  bX
e  Y Y '
Y  Y  (Y 'Y )  (Y  Y ' )
reg
Definitions
= y, deviation from mean
error
2
2
(
Y

Y
)

[(
Y
'

Y
)

(
Y

Y
'
)]


 (y)   (Y 'Y )   (Y  Y ' )
2
Sum of
squared
deviations
from the
mean
2
Sum of squares
2
(cross products
drop out)
Sum of squares Sum of squared
+
= due to
residuals
regression
Analog: SStot=SSB+SSW
Partitioning SS (2)
SSY=SSReg + SSRes
SSY SS Re g SS Re s


SSY
SSY
SSY
1  R 2  (1  R 2 )
Total SS is regression SS plus
residual SS. Can also get
proportions of each. Can get
variance by dividing SS by N if you
want. Proportion of total SS due to
regression = proportion of total
variance due to regression = R2
(R-square).
Partitioning SS (3)
Wt (Y)
M=150
(Y  Y ) 2
Y'
Y 'Y
(Y 'Y ) 2
Resid
(Y-Y')
Resid2
105
2025
108.19
-41.81
1748.076
-3.19
10.1761
120
900
115.16
-34.84
1213.826
4.84
23.4256
120
900
122.13
-27.87
776.7369
-2.13
4.5369
160
100
136.06
-13.94
194.3236
23.94
573.1236
120
900
136.06
-13.94
194.3236
-16.06
257.9236
145
25
156.97
6.97
48.5809
-11.97
143.2809
175
625
163.94
13.94
194.3236
11.06
122.3236
160
100
170.91
20.91
437.2281
-10.91
119.0281
185
1225
184.84
34.84
1213.826
0.16
0.0256
210
3600
205.75
55.75
3108.063
4.25
18.0625
Sum =
1500
10400
1500.01
0.01
9129.307
-0.01
1271.907
Variance
1155.56
1014.37
141.32
Partitioning SS (4)
Total
Regress
Residual
SS
10400
9129.31
1271.91
Variance
1155.56
1014.37
141.32
10400 9129.31 1271.91


 1  .88  .12
10400 10400
10400
Proportion of SS
1155.56 1014.37 141.32


 1  .88  .12
1155.56 1155.56 1155.56
rYY '  .94  rXY
rY ' X  1 .
Proportion of
Variance
R2 = .88
Note Y’ is linear function of X, so
rYY2 '  .88  R 2 . rYE  .35 rYE2  .12 rY 'E  0
Significance Testing
Testing for the SS due to regression = testing for the variance
due to regression = testing the significance of R2. All are the
same. H : R 2
0
0
population
F
SSreg / df 1
F
SSreg / df 1
SSres / df 2
SSres / df 2


SSreg / k
SSres / ( N  k  1)
9129.31 / 1
 57.42
127191
. / (10  1  1)
R2 / k
F
(1  R 2 ) /( N  k  1)
F
.88 / 1
 58.67
(1.88) / (10  1  1)
k=number of IVs (here
it’s 1) and N is the
sample size (# people).
F with k and (N-k-1)
df.
Equivalent test using R-square
instead of SS.
Results will be same within
rounding error.
Review
• What does it mean to test the
significance of the regression sum of
squares? R-square?
• What is R-square?
• Why does testing for the regression sum of
squares turn out to have the same result as
testing for R-square?
Download