Correlation

advertisement
Looking at data: relationships
- Correlation
Chapter 7
Objectives
Correlation

The correlation coefficient “r”

R does not distinguish x and y

R has no units

R ranges from -1 to +1

Influential points
The correlation coefficient "r"
The correlation coefficient is a measure of the direction and strength of
the linear relationship between 2 quantitative variables. It is calculated
using the mean and the standard deviation of both the x and y variables.
Time to swim: x = 35, sx = 0.7
Pulse rate: y = 140 sy = 9.5
Correlation can only be used to
describe quantitative variables.
Categorical variables don’t have
means and standard deviations.
PROPERTIES

scaleless [like demand elasticity in economics (-infinity, 0)]

-1 < r < 1

r = -1 only if y = a + bx with slope b<0

r = +1 only if y = a + bx with slope b>0
Correlation: Fuel Consumption vs Car
Weight
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
r = .9766
7
6
5
4
3
2
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
SAT Score vs Proportion of Seniors
Taking SAT
88-89 SAT vs % Seniors Taking SAT
r = -.868
88-89 SAT State Avg.
IW
ND
1075
1025
975
88-89 SAT
925
875
SC
825
0
20
40
DC
NC
60
% Seniors that Took SAT
80
"r" ranges
from -1 to +1
"r" quantifies the strength
and direction of a linear
relationship between 2
quantitative variables.
Strength: how closely the points
follow a straight line.
Direction: is positive when
individuals with higher X values
tend to have higher values of Y.
Part of the calculation
involves finding z, the
standardized score we used
when working with the
normal distribution.
You DON'T want to do this by hand.
Make sure you learn how to use
your calculator!
Example: calculating correlation


(x1, y1), (x2, y2), (x3, y3)
(1, 3) (1.5, 6) (2.5, 8)
x  1.67, y  5.67, sx  .76, s y  2.52
r
11.67  35.67   1.51.67  65.67   2.51.67 85.67 
(31)(.76)(2.52)
 .9538
Standardization:
Allows us to compare
correlations between data
sets where variables are
measured in different units
or when variables are
different.
For instance, we might
want to compare the
correlation between [swim
time and pulse], with the
correlation between [swim
time and breathing rate].
“r” does not distinguish x & y
The correlation coefficient, r, treats
x and y symmetrically.
r = -0.75
r = -0.75
"Time to swim" is the explanatory variable here, and belongs on the x axis.
However, in either plot r is the same (r=-0.75).
"r" has no unit
Changing the units of variables does
not change the correlation coefficient
"r", because we get rid of all our units
when we standardize (get z-scores).
r = -0.75
z-score plot is the same
for both plots
r = -0.75
When variability in one
or both variables
decreases, the
correlation coefficient
gets stronger
( closer to +1 or -1).
Correlation only describes linear relationships
No matter how strong the association,
r does not describe curved relationships.
Note: You can sometimes transform a non-linear association to a linear form,
for instance by taking the logarithm. You can then calculate a correlation using
the transformed data.
High correlation does not imply cause
and effect
CARROTS: Hidden terror in the produce
department at your neighborhood grocery



Everyone who ate carrots in 1920, if they
are still alive, has severely wrinkled skin!!!
Everyone who ate carrots in 1865 is now
dead!!!
45 of 50 17 yr olds arrested in Raleigh for
juvenile delinquency had eaten carrots in
the 2 weeks prior to their arrest !!!
More Correlations

There is a strong positive correlation between the monetary damage
caused by structural fires and the number of firemen present at the
fire. (More firemen-more damage)

Improper training? Will no firemen present result in the least amount
of damage?
(1,2) (24,75) (1,0) (18,59) (9,9) (3,7)(5,35)
(20,46) (1,0) (3,2) (22,57)

r measures the strength of
the linear relationship
between x and y; it does not
indicate cause and effect
Example
r = .935
(x, y) = (fouls, points)
Points

x = fouls committed by player;
y = points scored by same player
80
70
60
50
40
30
20
10
0
0
5
10
15
Fouls
20
25
30
Influential points
Correlations are calculated using
means and standard deviations,
and thus are NOT resistant to
outliers.
Just moving one point away from the
general trend here decreases the
correlation from -0.91 to -0.75
Review examples
1) What is the explanatory variable?
Describe the form, direction and strength
of the relationship?
Estimate r.
r = 0.94
(in 1000’s)
2) If women always marry men 2 years older
than themselves, what is the correlation of the
ages between husband and wife?
r=1
ageman = agewoman + 2
equation for a straight line
Thought quiz on correlation
1. Why is there no distinction between explanatory and response
variable in correlation?
2. Why do both variables have to be quantitative?
3. How does changing the units of one variable affect a correlation?
4. What is the effect of outliers on
correlations?
5. Why doesn’t a tight fit to a horizontal line
imply a strong correlation?
Download