Document 16081707

advertisement
Regression
•
What is regression to the mean?
•
Suppose the mean temperature in November is 5
degrees
•
What’s your best guess for tomorrow’s
temperature?
1. exactly 5?
2. warmer than 5?
3. colder than 5?
Regression
•
What is regression to the mean?
•
Suppose the mean temperature in November is 5
degrees and today the temperature is 15
•
What’s your best guess for tomorrow’s
temperature?
1.
2.
3.
4.
exactly 15 again?
exactly 5?
warmer than 15?
something between 5 and 15?
Regression
•
What is regression to the mean?
•
Regression to the mean is the fact that scores tend
to be closer to the mean than the values they are
paired with
–
–
e.g. Daughters tend to be shorter than mothers if the
mothers are taller than the mean and taller than mothers if
the mothers are shorter than the mean
e.g. Parents with high IQs tend to have kids with lower IQs,
parents with low IQs tend to have kids with higher IQs
Regression
•
What is regression to the mean?
•
The strength of the correlation between two
variables tells you the degree to which regression to
the mean affects scores
–
strong correlation means little regression to the mean
–
weak correlation means strong regression to the mean
–
no correlation means that one variable has no influence on
values of the other - the mean is always your best guess
Regression
• Suppose you measured workload and credit
hours for 8 students
Could you predict the
number of homework
hours from credit
hours?
Regression
• Suppose you measured workload and credit
hours for 8 students
Your first guess might
be to pick the mean
number of homework
hours which is 12.9
Regression
• Sum of Squares
•Adding up the squared
deviation scores gives you a
measure of the total error of
your estimate
Regression
• Sum of Squares
•ideally you would pick an
equation that minimized the
sum of the squared
deviations
•You would need a line is as
close as possible to each
point
Regression
• The regression line
•That line is called the
regression line
•The sum of squared
deviations from it is called
the sum of squared error or
SSE
Regression
• The regression line
•That line is called the
regression line
•its equation is:
y 
i  rxy
Sy
Sx
x i  y  rxy
Sy
Sx
x
Regression
remember: y = ax + b
ax
+
b
predicted y
y 
i  rxy
Sy
Sx
x i  y  rxy
Sy
Sx
x
Regression
• What happens if you had transformed all the scores to z
scores and were trying to predict a z score?
y 
i  rxy

Sy
Sx
x i  y  rxy
Sy
Sx
x
Regression
• What happens if you had transformed all the scores to z
scores and were trying to predict a z score?
y 
i  rxy
Sy
Sx
but…
Sy = Sx = 1


y  x 0
So….
z 
y i  rxy z x i
x i  y  rxy
Sy
Sx
x
The Regression Line
• The regression line is a linear function that
generates a y for a given x
The Regression Line
• The regression line is a linear function that
generates a y for a given x
• What should its slope and y-intercept be to be
the best predictor?
The Regression Line
• The regression line is a linear function that
generates a y for a given x
• What should its slope and y-intercept be to be
the best predictor?
• What does best predictor mean? It means
least distance between the predicted y and
an actual y for a given x
The Regression Line
• The regression line is a linear function that
generates a y for a given x
• What should its slope and y-intercept be to be
the best predictor?
• What does best predictor mean? It means
least distance between the predicted y and
an actual y for a given x
• in other words, how much variability is
residual after using the correlation to explain
the y scores
Mean Square Residual
• Recall that
S

2
y
(y


i
 y)
n
2
Mean Square Residual
• The variance of Zy is the average squared distance of each
point from the x axis (note that the mean of Zy = 0)
Regression
3.0
0.0
-3.0
-2.0
-1.0
Actual Scores
0.0
-3.0
1.0
2.0
3.0
Mean Square Residual
• Some of the variance in the Zy scores is due to the correlation with x
• Some of the variance in the Zy scores is due to other (probably
random) factors
Regression
3.0
0.0
-3.0
-2.0
-1.0
Actual Scores
0.0
-3.0
1.0
2.0
3.0
Mean Square Residual
• the variance due to other factors is called
“residual” because it is “leftover” after fitting a
regression line
• The best predictor should minimize this
residual variance
Mean Square Residual
(y

MSres 
i
 y'i )
2
n
MSres is the average squared deviation of the actual
scores from the regression line

Minimizing MSres
• the regression line (the best predictor of y) is
the line with a slope and y intercept such that
MSres is minimized
Minimizing MSres
• What will be its y intercept?
– if there was no correlation at all, your best guess
for y at any x would be the mean of y
– if there was a strong correlation between x and y,
your best guess for the y that matches the mean x
would be the mean y
– the mean of Zx is zero so the best guess for the
Zy that goes with it will be zero (the mean of the
Zy’s)
Minimizing MSres
• In other words, the regression line will predict
zero when Zx is zero so the y intercept of the
regression line will be zero (only so for Z
scores !)
Minimizing MSres
• y intercept is zero
Regression
3.0
0.0
-3.0
-2.0
-1.0
Actual Scores
0.0
-3.0
1.0
2.0
3.0
Minimizing MSres
• what is the slope?
Regression
3.0
0.0
-3.0
-2.0
-1.0
Actual Scores
0.0
-3.0
1.0
2.0
3.0
Minimizing MSres
• what is the slope? consider the extremes:
• Do the slopes look familiar?
-3.0
-2.0
-1.0
Z scores
Z scores
Z scores
3.0
3.0
3.0
0.0
0.0
1.0
-3.0
Zy = Zx
Zy’=Zx
slope = 1
2.0
3.0
-3.0
-2.0
-1.0
0.0
0.0
-3.0
Zy=-Zx
Zy’=-Zx
slope = -1
1.0
2.0
3.0
-3.0
-2.0
-1.0
0.0
0.0
1.0
2.0
-3.0
Zy is random with
respect to Zx
Zy’=mean Zy=0
slope = 0
3.0
Minimizing MSres
• a line (regression of Zy on Zx) that has a
slope of rxy and a y intercept of zero
minimizes MSres
Predicting raw scores
• we have a regression line in z scores:
z y  rxy z x
• can we predict a raw-score y from a rawscore x?
Predicting raw scores
• recall that:
yi  y
zyi 
Sy

and

xi  x
zx i 
Sx
Predicting raw scores
• by substituting we get:
Sy
Sy
y i  rxy
x i  y  rxy
x
Sx
Sx

Predicting raw scores
• by substituting we get:
a
+b
Sy
Sy
y i  rxy
x i  y  rxy
x
Sx
Sx
• note that this is still of the form:

y = ax + b
• note that the slope still depends on r and the
intercept still depends on the mean of y
Interpreting rxy in terms of
variance
• Recall that rxy is the slope of the regression
line that minimizes MSres
Interpreting rxy in terms of
variance
• Recall that rxy is the slope of the regression
line that minimizes MSres
(y

MSres 
i
 y )
n
2
S
2
y y 
Interpreting rxy in terms of
variance
• MSres can be simplified to:
S

2
y y 
 S (1 r )
2
y
2
xy
Interpreting rxy in terms of
variance
• Thus:
S S
r 
2
Sy
2
xy
2
y
2
y y 
Interpreting rxy in terms of
variance
• Thus:
S S
r 
2
Sy
2
xy
2
xy
2
y
2
y y 
• So r can be thought of as the proportion
of original variance accounted for by the
regression line

Interpreting rxy in terms of
variance
Observed y
Subtract this distance
What % of this distance
Regression Line
is this distance
Predicted y
Mean of y
Interpreting rxy in terms of
variance
2
• it follows that 1 is the proportion of
xy
variance not accounted for by the regression
line - this is the residual variance
r

Interpreting rxy in terms of
variance
• this can be thought of as a partitioning of
variance into the variance accounted for by
the regression and the variance unaccounted
for
S S S
2
y
2
y 
2
y y 
Interpreting rxy in terms of
variance
• this can be thought of as a partitioning of
variance into the variance accounted for by
the regression and the variance unaccounted
for
2
(y

y
)
 i
n

2
(y'y
)

n

2

(y

y
)
 i
n
Interpreting rxy in terms of
variance
• often written in terms of sums of squares:
2

(yi  y)  (y'y)  (yi  y )
2
2
• or simply
SStotal = SSregression + SSresidual
Download