3680 Lecture 19

advertisement
Math 3680
Lecture #19
Correlation and Regression
The Correlation Coefficient:
Limitations
X
-5
-4
-1
2
3
5
AVG
0
SD
4
Y
25
16
1
4
9
25
AVG
13.33333
SD
10.36661
X (su)
-1.25
-1
-0.25
0.5
0.75
1.25
Y (su)
1.1254
0.2572
-1.1897
-0.9003
-0.418
1.1254
Product
-1.40676
-0.25724
0.297429
-0.45016
-0.31351
1.40676
r=
-0.1447
30
Moral: Correlation
coefficients only
measure linear
association.
25
20
15
10
5
0
-6
-4
-2
0
2
4
6
Film starring
y,
x, Minutes
Matthew
Opening Weekend Gross
Shirtless
McConaughey
(millions of dollars)
We Are Marshall
ED tv
Reign of Fire
Sahara
Fool's Gold
0
6.1
0.8
8.3
1.6
15.6
1.8
18.1
14.6
21.6
Correlation 0.733965
25
20
15
10
5
0
0
2
4
6
8
10
12
14
Film starring
y,
x, Minutes
Matthew
Opening Weekend Gross
Shirtless
McConaughey
(millions of dollars)
We Are Marshall
ED tv
Reign of Fire
Sahara
Fool's Gold
0
0.8
1.6
1.8
6.1
8.3
15.6
18.1
Correlation 0.966256
25
20
15
10
5
0
0
2
4
6
8
10
12
14
Moral: Correlation
coefficients are
most appropriate
for football-shaped
scatter diagrams
and can be very
sensitive to outliers.
Regression
The heights and
weights from a
survey of 988 men
are shown in the
scatter diagram.
Avg height = 70 in
SD height = 3 in
Avg weight = 162 lb
SD weight = 30 lb
r = 0.47
Example.
Suppose a man
is one SD above
average in
height (70 + 3 =
73 inches).
Should you
guess his weight
to be one SD
above average
(162 + 30 = 192
pounds)?
Solution: No.
Notice that
maybe 10 or 11
of the men 73
inches tall have
weights above
192 pounds,
while dozens
have weights
below 192
pounds.
(73, 192)
A better prediction is obtained by increasing
by not a full SD but by r SDs:
Prediction = Average + ( r )(# SDs)(SD)
= 162 + (0.47) (1) (30)
= 176.1 lb
This is our second interpretation of the
correlation coefficient.
Prediction
=162+(1) (0.47) (30)
=176.1 lb
(73, 176.1)
(73, 176.1)
(70, 162)
(0.47)(30) lbs
3 inches
Slope = 0.47
30
3
r
sy
sx
Example: Predict the height of a man who is 176.1
pounds. Does this contradict our previous example?
Example: Predict the weight of a man who is 5’6”.
Where does this prediction appear in the diagram?
Notice that these points are displayed on the solid line in the
diagram. This line is called the regression line. To obtain
this line, you start at the point of averages, and draw a line
with slope
sy
rs
x
In other words, the equation of the regression line is
yˆ  y  r
sy
(x  x)
sx
Reverse the roles of x and y when predicting in the other
direction.
Example: Find the equation of the regression
line for the height-weight diagram.
Example: A university has made a statistical
analysis of the relationship between SAT-M
scores and first-year GPA. The results are:
Average SAT score = 550
SD = 80
Average first-year GPA = 2.6
SD = 0.6
r  0.4
The scatter diagram is football shaped. Find
the equation of the regression line. Then
predict the first-year GPA of a randomly
chosen student with an SAT-M score of 650.
Both Excel and TI calculators are capable of
computing and visualizing regression lines. (See book
p. 426). In Excel 2007, highlight the x- and y-values
and use
Insert, Scatter,
to draw a scatter plot. Click the data points, and then
right-click Add Trendline to see the regression line.
180
160
140
120
Height (cm)
Age (mo) Height (cm)
24
87
48
101
60
120
96
159
63
135
39
104
63
126
39
96
Age (mo) Line Fit Plot
100
80
60
Height (cm)
Predicted Height (cm)
40
20
0
0
20
40
60
Age (mo)
80
100
120
The Regression Effect
For a study of 1,078
fathers and sons:
Average fathers’ height = 68 in
SD = 2.7 in
Average sons’ height = 69 in
SD = 2.7 inches
r @ 0.5
Suppose a father is 72 inches tall. How tall would
you predict his son to be?
Repeat for a father who is 64 inches tall.
Notice that tall fathers tend to have tall sons –
though sons who are not as tall. Likewise, short
fathers on average will have short sons – just not as
short. Hence the term, “regression.” A pioneering
but aristocratic statistician (Galton) called this
effect, the “regression toward mediocrity,” and the
term has stuck.
There is no biological cause to this effect – it is
strictly statistical. Thinking that the regression
effect is due to something important is called the
regression fallacy.
Example: A preschool program attempts to boost
students’ IQs. The children are tested when they
enter the program (pretest), and again when they
leave the program (post-test). On both occasions, the
average IQ score was 100, with an SD of 15. Also,
students with below-average IQs on the pretest had
scores that went up by 5 points, while students with
above average scores of the pretest had their scores
drop by an average of 5 points.
What is going on? Does the program equalize
intelligence?
Example. Suppose someone gets a score of
140 on the pretest. Does this mean that the
student has an IQ of exactly 140?
Solution: No. There will always be chance error
associated with the measurement. For the sake of
argument, let’s assume that the chance error is equal to
5 points.
Then there are two likely explanations, they are:
Actual IQ of 135, with a chance error of +5
Actual IQ of 145, with a chance error of -5
Which of the above is the most likely explanation?
This explains the regression effect. If
someone scores above average on the first
test, we would estimate that the true score is
probably a bit lower than the observed
score.
Example: An instructor gives a midterm. She
asks the students who score 20 points below
average to see her regularly during her office
hours for special tutoring. They all scored at least
average on the final. Can this improvement be
attributed to the regression effect?
Regression and
Error Estimation
n = 988
Avg height = 70 in
SD height = 3 in
Avg weight = 162 lb
SD weight = 30 lb
r = 0.47
For a man 73 in tall,
we predict weight of
162+(1) (0.47) (30)
=176.1 lb
Next question: what is the error for this estimate?
Based on the picture, is it 30 lb? Or less?
THEOREM. Assuming the data is normally distributed,
we have
n 1
2
se  s y 1  r
n2
For the current example, that means
se  30 1  (0.47)
2
987
 26.5
986
The weight is therefore estimated as 176.1 ± 26.5 lb.
Note. If n is large, the last factor in
se  s y 1  r
n 1
2
n2
may be safely ignored. In other words, if n is large,
then
se  s y 1  r
2
Example: For a study of
1,078 fathers and sons:
Fathers:
Average height = 68 in
SD = 2.7 in
Sons:
Average height = 69 in
SD = 2.7 inches
r = 0.5
Suppose a father is 63 inches tall. What percentage
have sons who are at least 66 inches tall?
Testing Correlation
Recall the equation of the regression line:
yˆ  y  r
sy
(x  x)
sx
so that the slope of the regression line is b  r
sy
.
sx
The standard error for the slope is given by
SEb 
se
sx n  1

sy
1 r
sx
n2
2

b
1 r
r
n2
2
The standard error for the slope is given by
SEb 
se
sx n  1

sy
1 r
sx
n2
2

b
1 r
r
n2
2
Under the assumptions of:
1. normality, and
2. homoscedasticity (see below),
the t distribution with df = n - 2 may be used to find
confidence intervals and perform hypothesis tests.
Homoscedasticity means that the variability of the data
about the regression line does not depend on the value of x.
n = 988
Avg height = 70 in
SD height = 3 in
Avg weight = 162 lb
SD weight = 30 lb
r = 0.47
b  (0.47)
30
3
 4.7
SEb 
4.7
0.47
1  (0.47)
2
 0.281
986
For df = 986, the Student t distribution is almost
normal. So a 95% confidence interval for the slope is
4.7  (1.96)(0.281)  4.7  0.55
The corresponding confidence interval for the
correlation coefficient for all men is
0.47  0.055
Example: For a study of
1,078 fathers and sons:
Fathers:
Average height = 68 in
SD = 2.7 in
Sons:
Average height = 69 in
SD = 2.7 inches
r = 0.5
Test the hypothesis that the correlation coefficient for
all fathers and sons is positive.
Download