Rsquare

advertisement
Descriptive measures of the
strength of a linear association
r-squared and the (Pearson)
correlation coefficient r
Translating a research question into
a statistical procedure
• How strong is the linear relationship
between skin cancer mortality and
latitude?
– (Pearson) correlation coefficient r
– Coefficient of determination r2
Where does this topic fit in?
•
•
•
•
Model formulation
Model estimation
Model evaluation
Model use
Situation #1
A very weak linear relationship
Regression Plot
y = 54.4758 - 0.764016 x
S = 7.81137
R-Sq = 6.5 %
R-Sq(adj) = 3.2 %
n
SSR    yˆ i  y   119.1
y
2
i 1
60
y
n
2
SSE    yi  yˆ i   1708.5
i 1
50
n
SSTO    yi  y   1827.6
ŷ
40
0
1
2
3
4
5
x
6
i 1
7
8
9
10
2
Situation #2
A fairly strong linear relationship
Regression Plot
y = 75.5458 - 5.76402 x
R-Sq = 79.9 %
R-Sq(adj) = 79.2 %
S = 7.81137
80
n
SSR    yˆ i  y   6679.3
y
70
2
i 1
60
n
SSE    yi  yˆ i   1708.5
y
50
2
i 1
40
30
n
SSTO    yi  y   8487.8
ŷ
20
i 1
10
0
1
2
3
4
5
x
6
7
8
9
10
2
Coefficient of determination
2
r
SSR
SSE
r 
 1
SSTO
SSTO
2
• r2 is a number (a proportion!) between 0 and 1.
• If r2 = 1:
– all data points fall perfectly on the regression line
– the predictor x accounts for all of the variation in y
• If r2 = 0:
– the fitted regression line is perfectly horizontal
– the predictor x accounts for none of the variation in y
Interpretation of
2
r
• r2 ×100 percent of the variation in y is
reduced by taking into account predictor x.
• r2 ×100 percent of the variation in y is
“explained by” the variation in predictor x.
R-sq in Minitab fitted line plot
Regression Plot
Mort = 389.189 - 5.97764 Lat
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
Mortality
200
150
100
30
40
Latitude (at center of state)
50
R-sq in Minitab regression output
The regression equation is
Mort = 389.189 - 5.97764 Lat
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
Analysis of Variance
Source
Regression
Error
Total
DF
1
47
48
SS
36464.2
17173.1
53637.3
MS
36464.2
365.4
F
99.7968
P
0.000
Pearson correlation coefficient r
If r2 is represented in decimal form, e.g. 0.39 or 0.87, then:
r r
2
• r is a (unitless) number between -1 and 1, inclusive.
• Sign of coefficient of correlation
– plus sign if slope of fitted regression line is positive
– negative sign if slope of fitted regression line is
negative
Formulas for the
Pearson correlation coefficient r
n
r
 x
i 1
i
 x  yi  y 
n
n
2




 xi  x  yi  y
2
i 1



r 



i 1
n
 x
i 1
n
 x
2
i
2


y

y
 i
i 1



 b1



What do we learn from the
formulas for r?
• The correlation coefficient r gets its sign
from the slope b1.
• The correlation coefficient r is a unitless
measure.
• The correlation coefficient r = 0 when the
estimated slope b1 = 0 and vice versa.
Interpretation of
Pearson correlation coefficient r
• There is no nice practical interpretation for
r as there is for r2.
• r = -1 is perfect negative linear relationship.
• r = 1 is perfect positive linear relationship.
• r = 0 is no linear relationship.
• For other r, how strong the relationship
between x and y is deemed depends on the
research area.
Pearson correlation coefficient r
in Minitab
Correlations: Lat, Mort
Pearson correlation of Lat and Mort = -0.825
Correlations: Mort, Lat
Pearson correlation of Mort and Lat = -0.825
How strong is the linear relationship
between Celsius and Fahrenheit?
Regression Plot
S=0
Fahrenheit = 32 + 1.8 Celsius
R-Sq = 100.0 %
R-Sq(adj) = 100.0 %
120
Fahrenheit
110
100
90
80
70
60
50
40
30
0
10
20
30
40
50
Celsius
Pearson correlation of Celsius and Fahrenheit = 1.000
How strong is the linear relationship
between # of stories and height?
Regression Plot
HEIGHT = 90.3096 + 11.2924 STORIES
S = 58.3259
R-Sq = 90.4 %
R-Sq(adj) = 90.2 %
HEIGHT
1200
700
200
15
25
35
45
55
65
75
85
95
105
STORIES
Pearson correlation of HEIGHT and STORIES = 0.951
How strong is the linear relationship
between driver age and see distance?
Regression Plot
Distance = 576.682 - 3.00684 DrivAge
S = 49.7616
R-Sq = 64.2 %
R-Sq(adj) = 62.9 %
600
Distance
500
400
300
20
30
40
50
60
70
80
DrivAge
Pearson correlation of Distance and DrivAge = -0.801
How strong is the linear relationship
between height and g.p.a.?
Regression Plot
gpa = 3.41021 - 0.0065630 height
S = 0.542316
R-Sq = 0.3 %
R-Sq(adj) = 0.0 %
4
gpa
3
2
60
65
70
75
height
Pearson correlation of height and gpa = -0.053
Caution #1
• The correlation coefficient r quantifies the
strength of a linear relationship.
• It is possible to get r = 0 with a perfect
curvilinear relationship.
Example of Caution #1
Regression Plot
y = 14 - 0.0000000 x
S = 13.4907
R-Sq = 0.0 %
R-Sq(adj) = 0.0 %
40
30
y
ŷ
20
10
0
-5
0
5
x
Pearson correlation of x and y = 0.000
Clarification of Caution #1
Regression Plot
y = 0.0000000 - 0.0000000 x + 1 x**2
S=0
R-Sq = 100.0 %
R-Sq(adj) = 100.0 %
40
y
30
20
10
0
-5
0
5
x
Pearson correlation of x and y = 0.000
Caution #2
• A large r2 value should not be interpreted as
meaning that the estimated regression line
fits the data well.
• Another function might better describe the
trend in the data.
Example of Caution #2
Regression Plot
US Population (millions)
USPopn = -2217.46 + 1.21862 Year
S = 22.8349
R-Sq = 92.0 %
R-Sq(adj) = 91.6 %
200
100
0
1800
1900
2000
Year
Pearson correlation of Year and USPopn = 0.959
Caution #3
• The coefficient of determination r2 and the
correlation coefficient r can both be greatly
affected by just one data point (or a few
data points).
Example of Caution #3
Regression Plot
Deaths = -1121.94 + 179.468 Magnitude
S = 140.359
R-Sq = 53.5 %
R-Sq(adj) = 41.9 %
500
Deaths
400
300
200
100
0
6
7
8
Magnitude
Pearson correlation of Deaths and Magnitude = 0.732
Example of Caution #3
Regression Plot
Deaths = 647.967 - 87.1465 Magnitude
S = 13.1447
R-Sq = 92.1 %
R-Sq(adj) = 89.4 %
Deaths
100
50
0
6.4
6.9
7.4
Magnitude
Pearson correlation of Deaths and Magnitude = -0.960
Caution #4
• Correlation (association) does not imply
causation.
Example of Caution #4
Regression Plot
Heart = 260.563 - 22.9688 Wine
S = 37.8786
R-Sq = 71.0 %
R-Sq(adj) = 69.3 %
(per 100,000 people)
Heart disease deaths
300
200
100
0
1
2
3
4
5
6
7
8
9
Wine consumption
Liters of wine per person per year
Pearson correlation of Wine and Heart = -0.843
Caution #5
• Ecological correlations are correlations
that are based on rates or averages.
• Ecological correlations tend to overstate the
strength of an association.
Example of Caution #5
• Data from 1988 Current Population Survey
• Treating individuals as the units
– Correlation between income and education for men age
25-64 in U.S. is r ≈ 0.4.
• Treating nine regions as the units
– Compute average income and average education for
men age 25-64 in each of the nine regions.
– Correlation between the average incomes and the
average education in U.S. is r ≈ 0.7.
Example of Caution #5
Regression Plot
Heart = 260.563 - 22.9688 Wine
S = 37.8786
R-Sq = 71.0 %
R-Sq(adj) = 69.3 %
(per 100,000 people)
Heart disease deaths
300
200
100
0
1
2
3
4
5
6
7
Wine consumption
Liters of wine per person per year
8
9
Example of Caution #5
Regression Plot
Mort = 389.189 - 5.97764 Lat
S = 19.1150
R-Sq = 68.0 %
R-Sq(adj) = 67.3 %
Mortality
200
150
100
30
40
Latitude (at center of state)
50
Caution #6
• A “statistically significant” r2 does not
imply that the slope β1 is meaningfully
different from 0.
Caution #7
• A large r2 does not necessarily mean that a
useful prediction of the response ynew (or
estimation of the mean response μY) can be
made.
• It is still possible to get prediction (or
confidence) intervals that are too wide to be
useful.
Using the sample correlation r
to learn about
the population correlation ρ
Translating a research question into
a statistical procedure
• Is there a linear relationship between
skin cancer mortality and latitude?
– t-test for testing H0: β1= 0
– ANOVA F-test for testing H0: β1= 0
• Is there a linear correlation between
husband’s age and wife’s age?
– t-test for testing population correlation
coefficient H0: ρ = 0
Where does this topic fit in?
•
•
•
•
Model formulation
Model estimation
Model evaluation
Model use
Is there a linear correlation between
husband’s age and wife’s age?
65
Husband's Age (years)
60
55
50
45
40
35
30
25
20
15
25
35
45
55
65
Wife's Age (years)
Pearson correlation of HAge and WAge = 0.939
Is there a linear correlation between
husband’s age and wife’s age?
Wife's Age (years)
65
55
45
35
25
15
20
25
30
35
40
45
50
55
60
65
Husband's Age (years)
Pearson correlation of WAge and HAge = 0.939
The formal t-test
for correlation coefficient ρ
Null hypothesis
H0: ρ = 0
Alternative hypothesis HA: ρ ≠ 0 or ρ < 0 or ρ > 0
Test statistic
t 
*
r n2
1 r 2
P-value = What is the probability that we’d get a t* statistic as
extreme as we did, if the null hypothesis is true?
The P-value is determined by comparing t* to a t distribution
with n-2 degrees of freedom.
Is there a linear correlation between
husband’s age and wife’s age?
Test statistic:
t 
*
r n2
1 r
2

0.939 170  2
1  0.939
2
 35.39
Help in determining the P-value:
Student's t distribution with 168 DF
x
P( X <= x )
35.3900
1.0000
Just let Minitab do the work:
Pearson correlation of WAge and HAge = 0.939
P-Value = 0.000
When is it okay to use the t-test for
testing H0: ρ = 0?
• When it is not obvious which variable is the
response.
• When the (x, y) pairs are a random sample from a
bivariate normal population.
–
–
–
–
For each x, the y’s are normal with equal variances.
For each y, the x’s are normal with equal variances.
Either, y can be considered a linear function of x.
Or, x can be considered a linear function of y.
• The (x, y) pairs are independent.
The three tests will always yield
similar results.
The regression equation is HAge = 3.59 + 0.967 Wage
170 cases used 48 cases contain missing values
Predictor
Constant
WAge
S = 4.069
Coef
3.590
0.96670
SE Coef
1.159
0.02742
R-Sq = 88.1%
Analysis of Variance
Source
DF
SS
Regression
1
20577
Error
168
2782
Total
169
23359
T
3.10
35.25
P
0.002
0.000
R-Sq(adj) = 88.0%
MS
20577
17
F
1242.51
Pearson correlation of WAge and HAge = 0.939
P-Value = 0.000
P
0.000
The three tests will always
yield similar results.
The regression equation is WAge = 1.57 + 0.911 HAge
170 cases used 48 cases contain missing values
Predictor
Constant
HAge
S = 3.951
Coef
1.574
0.91124
SE Coef
1.150
0.02585
R-Sq = 88.1%
T
1.37
35.25
P
0.173
0.000
R-Sq(adj) = 88.0%
Analysis of Variance
Source
DF
SS
MS
F
Regression
1
19396
19396
1242.51
Error
168
2623
16
Total
169
22019
Pearson correlation of WAge and HAge = 0.939
P-Value = 0.000
P
0.000
Which results should I report?
• If one of the variables can be clearly identified as
the response, report the t-test or F-test results for
testing H0: β1 = 0.
– Does it make sense to use x to predict y?
• If it is not obvious which variable is the response,
report the t-test results for testing H0: ρ = 0.
– Does it only make sense to look for an association
between x and y?
Download