08 Linear Regression

advertisement
Linear Regression
Correlation vs Regression: What’s the Difference?
• Correlation measures how strongly related 2 variables
are.
• Regression provides a means for predicting the value of
one variable based on the value of a related variable.
• The underlying mathematics are the same.
• Here we are dealing only with linear correlation and
linear regression.
PSYC 6130, PROF. J. ELDER
2
Optimal Prediction using z Scores
• Consider 2 variables X and Y that may be related in
some way.
– e.g.,
• X = midterm score, Y = final exam score
• X = reaction time, Y = error rate
• Suppose you know X for a particular case (e.g.,
individual, trial). What is your best guess at Y?
• The answer turns out to be pretty simple:
zY   rzX
PSYC 6130, PROF. J. ELDER
3
Example: 6130A 2005-06 Assignment marks
Mean
Sample Std. Dev.
PSYC 6130, PROF. J. ELDER
Assignment 1
X
Assignment 2
Y
86.7%
81.5%
85.0%
85.5%
90.2%
95.4%
91.9%
93.1%
94.8%
93.6%
94.8%
94.2%
94.8%
81.8%
82.4%
84.3%
86.8%
83.6%
87.4%
93.1%
93.1%
91.8%
93.7%
93.1%
94.3%
95.6%
90.9%
4.66%
89.3%
5.04%
4
Graphical Representation
PSYC 6130A 2005-06
3
zY  0.7998zX
Assignment 2 z-Score
2
Regression line
1
0
-3
-2
-1
0
1
-1
-2
-3
Assignment 1 z-Score
PSYC 6130, PROF. J. ELDER
5
2
3
The Raw-Score Regression Formula
In terms of sample statistics:
In terms of population parameters:
Y
Y   Y 
r ( X  X )
X
sY
Y Y 
r(X  X)
sX
or
or
Y   aYX  bYX X
Y   aYX  bYX X
where
where
aYX  Y  bYX  X

bYX  Y r
X
aYX  Y  bYX X
s
bYX  Y r
sX
PSYC 6130, PROF. J. ELDER
6
Example: 6130A 2005-06 Assignment marks
Mean
Sample Std. Dev.
PSYC 6130, PROF. J. ELDER
Assignment 1
X
Assignment 2
Y
86.7%
81.5%
85.0%
85.5%
90.2%
95.4%
91.9%
93.1%
94.8%
93.6%
94.8%
94.2%
94.8%
81.8%
82.4%
84.3%
86.8%
83.6%
87.4%
93.1%
93.1%
91.8%
93.7%
93.1%
94.3%
95.6%
90.9%
4.66%
89.3%
5.04%
7
Graphical Representation
PSYC 6130 Section A 2005-2006
Assignment 2 Grade
100%
y = 0.867x + 10.5%
95%
aYX  10.5%
Regression line
90%
85%
80%
75%
80%
85%
90%
95% 100%
Assignment 1 Grade
PSYC 6130, PROF. J. ELDER
8
bYX  0.867
Residuals
• The deviations of the actual Y values from the Y values predicted by
the regression line are called residuals.
Assignment 2 Grade
• The regression line minimizes the sum of squared residuals (and
hence is called a mean-squared fit).
PSYC 6130 Section A 2005-2006
100%
95%
90%
85%
Y
residual Y  Y 
Y
80%
75%
80% 85% 90% 95% 100%
Assignment 1 Grade
PSYC 6130, PROF. J. ELDER
9
Variance of the Estimate
• Total prediction error is expressed as the variance of the
2
estimate (or mean-squared error)  est Y:
In terms of population parameters:
2
 est
Y 
In terms of sample statistics:
2

(
Y

Y
)

2
sest
Y 
N
2

(
Y

Y
)

N 2
2
2
Note that  est


Y
Y.
Equality applies only when r  0.
 est Y (sest Y ) is called the standard error of the estimate.
PSYC 6130, PROF. J. ELDER
10
Explained and Unexplained Variance
Assignment 2 Grade
PSYC 6130 Section A 2005-2006
100%
Y
95%
90%
Explained
Y
85%
1
(Y   Y )2

N
1
2
Unexplained Variance  est
(Y  Y )2

y 
N
2
Explained Variance:  exp

80%
75%
80%
Unexplained
Y
85%
90%
95% 100%
Assignment 1 Grade
PSYC 6130, PROF. J. ELDER
11
Summary of Variances
Population:
Total Variance  
2
Y
 (Y  
Y
)
N
2
Unexplained Variance  est
Y 
2
Explained Variance:  exp

PSYC 6130, PROF. J. ELDER
2
12
2

 (Y  Y )
N
2

(
Y


)

Y
N
Summary of Variances
Population:
• It can be shown that:
  
2
Y
2
estY

2
exp
• i.e., the variance is equal to the sum of the explained
and unexplained variances.
PSYC 6130, PROF. J. ELDER
13
Summary of Variances
Sample:
Total Variance sY2 
2
(
Y

Y
)

N 1
2
Unexplained Variance sest
Y 
2

(
Y

Y
)

N 2
2
2
Explained Variance: sexp
 sY2  sestY
PSYC 6130, PROF. J. ELDER
14
Coefficient of Determination
• The fraction of the total variance explained by the regression line is
called the coefficient of determination
• It can be shown that this is just the square of the Pearson coefficient r:
• Population:
2

(
Y


)

Y
2
 estY
Coefficient of Determination r 
 1 2
2
Y
 (Y  Y )
2
• Sample:
2

(
Y

Y
)

2
n  2 sestY
Coefficient of Determination r 
 1
2
n  1 sY2
 (Y  Y )
2
PSYC 6130, PROF. J. ELDER
15
Coefficient of Nondetermination
• The fraction of the total variance that remains unexplained by the
regression line is called the coefficient of nondetermination
• It can be shown that this is just 1-r2:
• Population:
2

(
Y

Y
)

2
 estY
Coefficient of Nondetermination 1- r 
 2
2
 (Y  Y ) Y
2
• Sample:
2

(
Y

Y
)

2
n  2 sestY
Coefficient of Nondetermination 1- r 

2
 (Y  Y ) n  1 sY2
2
PSYC 6130, PROF. J. ELDER
16
Summary of Coefficients
Sample:
Population:
Coefficient of Determination:
Coefficient of Determination:
 (Y   Y )2
r
2
2
 1
2
n  2 sestY
n  1 sY2
Coefficient of Nondetermination:
Coefficient of Nondetermination:
 (Y  Y )2
 (Y  Y )

 (Y  Y )
2
2
 est
2
Y
1-r 

2
2
 (Y  Y )  Y
PSYC 6130, PROF. J. ELDER
 (Y   Y )

 (Y  Y )
2
2
 estY
2
r 
 1 2
Y
 (Y  Y )2
1-r
17
2
2

2
n  2 sest
Y
2
n  1 sY
Components of Variance: SPSS Output
Explained SS:
(Y  Y )
Unexplained SS:
Total SS:
2
(Y Y )
2
 (Y  Y )
2
2
est Y
Unexplained Variance s
(Y Y )2
ANOVA

N 2
b
Sum of
Model
1
Squares
Regression
df
Mean Square
861347.2
1
861347.186
Residual
1325861
11491
115.383
Total
2187209
11492
a. Predictors: (Constant), How tall are you without your shoes on (in cm.)
b. Dependent Variable: How much do you weigh (in kilograms)
PSYC 6130, PROF. J. ELDER
18
F
7465.139
Sig.
.000
a
Estimating the Variance of the Estimate
• Uncertainty in predictions can be estimated using the
assumption of homoscedasticity.
– (Etymology: hom- + Greek skedastikos able to disperse, from
skedannynai to disperse)
– Thought question: does this also explain the origin of the verb
skedaddle?
– In other words, homogeneity of variance in Y over the range of
X.
PSYC 6130, PROF. J. ELDER
19
Confidence Intervals for Predictions
Y  Y   tcrit sestY
PSYC 6130, PROF. J. ELDER
1 ( X  X )2
1 
N ( N  1) s X2
20
Example: 6130A 2005-06 Assignment marks
Mean
Sample Std. Dev.
PSYC 6130, PROF. J. ELDER
Assignment 1
X
Assignment 2
Y
86.7%
81.5%
85.0%
85.5%
90.2%
95.4%
91.9%
93.1%
94.8%
93.6%
94.8%
94.2%
94.8%
81.8%
82.4%
84.3%
86.8%
83.6%
87.4%
93.1%
93.1%
91.8%
93.7%
93.1%
94.3%
95.6%
90.9%
4.66%
89.3%
5.04%
21
r  0.7998
Underlying Assumptions
• Independent random sampling
• Linearity
• Normal Distribution
• Homoscedasticity
PSYC 6130, PROF. J. ELDER
22
Regressing X on Y
• Simply reverse the formulae, e.g.,
In terms of sample statistics:
sX
X  X 
r (Y  Y )
sY
or
X   aXY  bXYY
where
aXY  X  bXYY
s
bXY  X r
sY
PSYC 6130, PROF. J. ELDER
23
When to Use Linear Regression
• Prediction
• Statistical Control
– Adjust for effects of confounding variable.
– Also known as partialing out the effect of the confounding
variable.
• Experimental Psychology: modeling effect of continuous
independent variable on continuous dependent variable.
– e.g., reaction time vs set size in visual search.
PSYC 6130, PROF. J. ELDER
24
Statistical Control Example: Mental Health
Women report more bad mental health days than men, t(8176)=-7.1, p<.001, 2-tailed.
PSYC 6130, PROF. J. ELDER
25
Statistical Control Example: Physical Health
PSYC 6130, PROF. J. ELDER
26
Correlation
Pearson’s r = 0.31
PSYC 6130, PROF. J. ELDER
27
After Partialing Out Physical Health
PSYC 6130, PROF. J. ELDER
28
Result of Partialing Out Physical Health
Controlling for physical health, women report more bad mental health days than men,
t(8176)=-5.7, p<.001, 2-tailed.
PSYC 6130, PROF. J. ELDER
29
Download