Correlation

advertisement
Basic Statistics
Correlation
Var
Relationships
Var
Var
Associations
Var
Var
The Need for a Measure
of Relationship
Control
Describe
INDIVIDUAL
DIFFERENCES
(Variance)
Predict
Explain
In Research
Information
Dependent
variable
X1
X2
X3
Independent
variables
COvary
?Y
The Concept of
Correlation
Association or relationship between two variables
Co-relate?
r relation
X
Y
Covary---Go together
Patterns of Covariation
Patterns of
Covariation
X
X
Y
Positive
correlation
Y
Correlation
Covary
Go together
Zero or no
correlation
X
Y
Negative
correlation
Scatter plots allow us to
visualize the relationships
The chief purpose of the scatter diagram is
to study the nature of the relationship
between two variables
 Linear/curvilinear relationship
 Direction of relationship
 Magnitude (size) of relationship
Scatter Plots
Scatter Plot A
high
Represents both the X and Y
scores
Variable Y
Exact value
low
low
Variable X
high
An illustration of a perfect positive correlation
Scatter Plot B
high
Variable Y
Estimated
Y value
low
low
Variable X
high
An illustration of a positive correlation
Scatter Plot C
high
Variable Y
Exact value
low
low
Variable X
high
An illustration of a perfect negative correlation
Scatter Plot D
high
Variable Y
Estimated
Y value
low
low
Variable X
high
An illustration of a negative correlation
Scatter Plot E
high
Variable Y
low
low
Variable X
high
An illustration of a zero correlation
Scatter Plot F
high
Variable Y
low
low
Variable X
high
An illustration of a curvilinear relationship
The Measurement of Correlation
The Correlation Coefficient
The degree of correlation between two variables
can be described by such terms as
“strong,” ”low,” ”positive,” or “moderate,” but
these terms are not very precise.
If a correlation coefficient is computed between
two sets of scores, the relationship can be
described more accurately.
A statistical summary of the degree and
direction of relationship or association between
two variables can be computed
Pearson’s Product-Moment
Correlation Coefficient r
r 
 XY 
(  X)(  Y)
n
2
2

(  X)  
(  Y) 
2
2
 X 
  Y 

n
n



No Relationship
Negative correlation
-1.00
-.50
Positive correlation
0
+ .50
1.00
Direction of relationship: Sign (+ or –)
Magnitude: 0 through +1 or 0 through -1
The Pearson Product-Moment
Correlation Coefficient
Recall that the formula for a variance is:
S 
2

Σ XX
n1

2



Σ XX XX

n1
If we replaced the second X that was squared
with a second variable, Y, it would be:
S x y 


Σ XX YY

n1
This is called a co-variance and is an index
of the relationship between X and Y.
Conceptual Formula for Pearson r
n
r 
 (X
i
 X )(Y i  Y )
i1
n
 (X
i1
n
i
 X)
2
 (Y
i
 Y)
2
i1
This formula may be rewritten to reflect
the actual method of calculation
Calculation of Pearson r
r 
 XY 
(  X)(  Y)
n
2
2




(
X)
(
Y)


2
2
 X 
  Y 

n
n



You should notice that this formula is merely the
sum of squares for covariance divided by the
square root of the product of the sum of squares
for X and Y
Formulae for Sums of Squares
SSx   X 
2
SSy   Y 
2
 X 
2
n
 Y 
SSxy   XY 
2
n
  X   Y 
n
Therefore, the formula for calculating r may
be rewritten as:
Calculation of r Using
Sums of Squares
r 
SSxy
SSx SSy 
An Example
Suppose that a college statistics professor is
interested in how the number of hours
that a student spends studying is related to
how many errors students make on the midterm examination. To determine the
relationship the professor collects the
following data:
The Stats Professor’s Data
Student
Hours
Studied (X)
X2
Errors (Y)
Y2
XY
1
4
15
16
225
60
2
4
12
16
144
48
3
5
9
25
81
45
4
6
10
36
100
60
5
7
8
49
64
56
6
7
4
49
16
28
7
7
6
49
36
42
8
9
2
81
4
18
9
9
4
81
16
36
10
12
3
100
9
36
X = 70
Y = 73
 X2 =546 Y2=695
Total
XY=429
The Data Needed to Calculate the
Sum of Squares
X
X2
Y2
XY
X = 70 Y = 73  X2 =546 Y2=695 XY=429
Total
SSx   X 
2
SSy   Y 
2
SSxy   XY 
Y
  X 2
n
  Y 2
n
 X  Y 
n
= 546 - 702/10 = 546 - 490 = 56
= 695 - 732/10 = 695 - 523.9 = 162.1
= 429 – (70)(73)/10 = 429 – 511 = -82
Calculating the Correlation
Coefficient
r 
SSxy
SSx SSy 
= -82 / √(56)(162.1)
= - 0.86
Thus, the correlation between hours studied and errors
made on the mid-term examination is -0.86; indicating
that more time spend studying is related to fewer
errors on the mid-term examination. Hopefully an
obvious, but now a statistical conclusion!
Pearson Product-Moment
Correlation Coefficient r
r 
 XY 
(  X)(  Y)
n
2
2




(
X)
(
Y)


2
2
 X 
  Y 

n
n



perfect negative
correlation
Zero
correlation
-1
0
Negative
correlation
Perfect positive
correlation
+1
Positive
correlation
r 
 XY 
(  X)(  Y)
n
2
2

(  X)  
(  Y) 
2
2
 X 
  Y 

n
n



-.73
.35
0 values
Numerical
Negative correlation
Perfect
Zero correlation
Strong
Positive correlation
Moderate
The Pearson r and Marginal Distribution
The marginal distribution of X is simply the
distribution of the X’s; the marginal distribution
of Y is the frequency distribution of the Y’s.
Y
Bivariate relationship
Bivariate Normal
Distribution
X
Marginal distribution of X and Y are
precisely the same shape.
Y variable
X variable
Interpreting r, the Correlation Coefficient
Recall that r includes two types of information:
The direction of the relationship (+ or -)
The magnitude of the relationship (0 to 1)
However, there is a more precise way to use
the correlation coefficient, r, to interpret the
magnitude of a relationship. That is, the
square of the correlation coefficient or r2.
The square of r tells us what proportion of the
variance of Y can be explained by X or vice versa.
Suppose you wish to estimate Y for a given value of X.
high
How does correlation explain variance?
Explained
Variable Y
Free to Vary
49% of variance is
explained
Explained
low
low
Variable X
high
An illustration of how the squared correlation
accounts for variance in X, r = .7, r2 = .49
Now, lets look
at some
correlation
coefficients and
their
corresponding
scatter plots.
120000
100000
80000
60000
C u r re n t S a la ry
40000
20000
0
0
10000
20000
30000
40000
50000
60000
70000
Beg inni ng Sa lary
What is your estimate of r?
r = .87
r2 = .76 = 76%
120000
Y
100000
80000
60000
C u r re n t S a la ry
40000
20000
0
0
10000
20000
30000
40000
Beg inni ng Sa lary
50000
60000
70000
X
What is your estimate of r?
r = -1.00
r2 = 1.00 = 100%
120000
Y
100000
80000
60000
C u r re n t S a la ry
40000
20000
0
0
10000
20000
30000
40000
Beg inni ng Sa lary
50000
60000
70000
X
What is your estimate of r?
r = +1.00 r2 = 1.00 = 100%
70000
60000
50000
40000
B e g in n in g S a la r y
30000
20000
10000
0
60
70
80
90
100
Mo nth s si nce Hire
What is your estimate of r?
r = .04
r2 = .002 = .2%
6000
5000
4000
3000
2000
1000
0
10
20
30
Time to Accele rate from 0 t o 60 mp h (sec)
What is your estimate of r?
r = -.44
r2 = .19 = 19%
Pearson r assumes that we are using interval or ratio
data. What do we do if one or both of the variables
we measured at the ordinal level?
If we replace the scores with ranks, we can use the same
formula. However, it can be simplified if we are using
ordinal data. It is called a Spearman Rank-Order Correlation
Coefficient.
Spearman’s Rank Order Correlation
As noted, the Spearman rs is a special case of the
Pearson r (when the data are ordinal). The formula,
derived from the Pearson, is as follows:
2
rS  1 
6  di
n(n  1)
2
where
d i  X i  Yi
The characteristics and interpretation of a Spearman rs
are exactly the same as a Pearson r. That is, rS ranges
from -1 to +1, and the square provides an estimate of the
shared variance.
Spearman Rank Order Correlation Coefficient
One or both of the variables are in the form of ranks.
Raw data may be converted to ranks, or ranks may be
gathered as the original data.
Example
Illustrated Calculation
N=4
X
Y
1
2
4
3
2
1
4
3
d=
X–Y
-1
1
0
0
2
d2
1
1
0
0
d  2
2
6(  d )
rS  1 
rS  1 
n(n  1)
2
6 2
4(4
rS  1 
2
 1)
12
60
rS  1  .20  .80
Choosing Between
Pearson and Spearman
• If the data are ordinal, we have no
choice, we have to use Spearman.
• If the data are interval or ratio, we do
have a choice.
– Pearson is more sensitive
– Spearman easier to compute by hand
Summary of Measures of Relationship
There are other correlation coefficients for
other levels of measurement. However, we
will only study three, the two we have
already reviewed and later, one more for
nominal data.
Spearman Rank Correlation Coefficient S
r
The Biserial Correlation Coefficient
rb
The Point-Biserial Correlation Coefficient r p b
The Phi Correlation Coefficient

The Tetrachoric Correlation Coefficient
rt
The Rank-Biserial Correlation Coefficient
rrb
Summarizing Correlations
• Pearson and Spearman Correlation
Coefficients range from -1.0 to + 1.0
• Pearson and Spearman Correlation
Coefficients indicate both direction and
magnitude of the relationship
• Correlation does NOT imply Causation
Download