Assessing the Relationship Between 2 Numerical

advertisement
Statistics for Everyone Workshop
Fall 2010
Part 6A
Assessing the Relationship Between 2
Numerical Variables Using Correlation
Workshop presented by Linda Henkel and Laura McSweeney of Fairfield
University
Funded by the Core Integration Initiative and the Center for Academic
Excellence at Fairfield University
Assessing the Relationship Between
2 Numerical Variables With Correlation
What if the research question we want to ask is whether or
how strongly two variables are related to each other, but we
do not have experimental control over those variables and
we rely on already existing conditions?
• TV viewing and obesity
• Age and severity of flu symptoms
• Tire pressure and gas mileage
• Amount of pressure and compression of insulation
Assessing the Relationship Between
2 Variables With Correlation
The nature of the research question is about the
association between two variables, not about
whether one causes the other
Correlation describes and quantifies the systematic,
linear relation between variables
Correlation  Causation
Statistics as a Tool in Scientific Research
Types of Research Questions
• Descriptive (What does X look like?)
• Correlational (Is there an association between
X and Y? As X increases, what does Y do?)
• Experimental (Do changes in X cause changes in
Y?)
Different statistical procedures allow us to answer the
different kinds of research questions
Correlation: type and strength of linear
relationship between X and Y
Linear Regression: making predictions of
Variable Y based on knowing the value of
Variable X
Linear Correlation Test
Statistical test for Pearson correlation coefficient (r)
Used for: Analyzing the strength of the linear
relationship between two numerical variables
Use when: Both variables are numerical (interval or
ratio) and the research question is about the
type and strength of relation (not about
causality)
Other Correlation Coefficients
There are many different types of correlation
coefficients – used for different types of variables -but we focus only on Pearson r
Point biserial (rPB): Use when one variable is nominal and has
two levels (e.g., gender [male/female], type of car [gaspowered, hybrid]) and one variable is numerical (e.g., reaction
time; miles per gallon)
Spearman rank order (rS): Use for ordinal data or numerical
data that are not normally distributed or linear
Calculators for these are available at:
http://faculty.vassar.edu/lowry/VassarStats.html
The Essence of the Correlation Coefficient
The correlation coefficient indicates whether or not
a relationship exists (association, co-occurrence,
covariation)
The value of the correlation coefficient tells you
about the type and strength of the relationship
• Positive, negative
• Strong, moderately strong, weak, no relation
Different Types of Correlations
Positive: As X increased, Y increased and as X decreased,
Y decreased
Negative: As X increased, Y decreased and as X decreased,
Y increased
Different Types of Correlations
• No systematic relation between X and Y
• High values of X are associated with both high & low
values of Y;
• Low values of X are associated with both high & low
values of Y
Different Types of Correlations
Curvilinear: Not straight
U: As X increased, Y decreased up to a point then
increased
Inverted U: As X increased, Y increased up to a point,
then decreased
Describing Linear Correlation
Pearson correlation coefficient (r)
Direction/type: Positive, Negative, or No correlation
Magnitude/strength:
r =  1  perfect correlation
r closer to 1  strong correlation
r closer to 0  weak or no correlation
• r is unitless; r is NOT a proportion or percentage
• It doesn’t matter which variable you call X or Y when
calculating r
Issues in Interpreting Correlation Coefficient
Magnitude of the Effect:
|r|
0 to .20
.20 to .40
.40 to .60
.60 to 1.0
Conclusion About Relationship
negligible to weak
weak to moderate
moderate to strong
strong
Deposit
Oxide
Time
thickness
(s)
(Angstroms)
18
1059
35
1049
52
1039
52
1026
18
1001
1263
23
etc.
etc.
Oxide Thickness (Angstroms)
Example 1: What can we say?
1600
1400
1200
1000
800
600
400
200
0
10
20
30
Deposit Time (s)
40
50
What can we say?
There is little or no correlation between oxide
thickness and deposit time.
In fact, r = .002.
Example 2: What can we say?
70
Drug Release rate
(% Released )
Surface Drug
Area to Release
Volume
Rate
1.5
60
1.05
48
0.9
39
0.75
33
0.6
30
0.65
29
60
50
40
30
20
10
0
0
0.5
1
1.5
Surface Area To Volume (mm^2/mm^3)
2
What can we say?
There was a strong, positive correlation
between surface area to volume and the
drug release rate.
As the surface area to volume increased, the
drug release rate increased
The smaller the surface area to volume, the
lower the drug release rate
What More Does the Correlation
Coefficient Tell Us?
70
Drug Release Rate
(%released)
Surface
Area to
Drug
Volume Release
Ratio
Rate
1.5
60
1.05
48
0.9
39
0.75
33
0.6
30
0.65
29
60
50
40
30
20
10
0
0
0.5
1
1.5
Surface Area To Volume Ratio (mm^2/mm^3)
r = .99
2
Interpreting Correlation Coefficient
A strong linear association was found between the
surface area to volume and the drug release rate,
r = .99
There was a significant positive correlation between
the surface area to volume and the drug release
rate, r = .99
As the surface area to volume increased the drug
release rate increased, and this was a strong linear
relationship, r = .99
Testing for Linear Correlation
Correlations are descriptive
We can describe the type and strength of the
linear relationship between two variables by
examining the r value
r values are associated with p values that reflect
the probability that the obtained relationship is
real or is more likely just due to chance
Running and Interpreting a Correlation
The key research question in a correlational design
is: Is there a real linear relationship between the two
variables or is the obtained pattern no different what
we would expect just by chance?
i.e. H0: The population correlation coefficient is zero
HA: The population correlation coefficient is not zero
Teaching tip: In order to understand what we mean
by “a real relationship,” students must understand
probability
What Do We Mean by “Just Due to Chance”?
p value = probability of results being due to chance
When the p value is high (p > .05), the obtained
relationship is probably due to chance
.99 .75 .55 .25 .15 .10 .07
When the p value is low (p < .05), the obtained
relationship is probably NOT due to chance and
more likely reflects a real relationship
.04 .03 .02 .01 .001
What Do We Mean by “Just Due to Chance”?
p value = probability of results being due to chance
[Probability of observing your data (or more severe) if H0 were
true]
When the p value is high (p > .05), the obtained relationship is
probably due to chance
[Data likely if H0 were true]
.99 .75 .55 .25 .15 .10 .07
When the p value is low (p < .05), the obtained relationship is
probably NOT due to chance and more likely reflects a real
relationship
[Data unlikely if H0 were true, so data support HA]
.04 .03 .02 .01 .001
What Do We Mean by “Just Due to Chance”?
In science, a p value of .05 is a conventionally accepted cutoff
point for saying when a result is more likely due to chance
or more likely due to a real effect
Not significant = the obtained relationship is probably due to
chance; the relationship observed does not appear to really
differ from what would be expected based on chance; p >
.05
Statistically significant = the obtained relationship is probably
NOT due to chance and is likely a real linear relationship
between the two variables; p < .05
Finding the Linear Correlation Using Excel
Step 1: Make a scatterplot of the data
If there is a linear trend, go to Step 2
Step 2: Get the Pearson Correlation Coefficient (r)
Using the = correl(X, Y) function
[Refer to handouts for more detail]
Running a Test for Linear Correlation
Using Excel
Pearson’s r: Both variables are numerical (interval or ratio) and
the research question is about the type and strength of relation
(not about causality)
Need:
• Pearson correlation coefficient (r)
• Number of pairs of observations (N)
To run: Open Excel file “SFE Statistical Tests” and go to page
called Linear Correlation Test
Enter the correlation coefficient (r) and number of pairs of
observations (N)
Output: Computer calculates the p value
Running a Test for Linear Correlation
Using SPSS
When to Calculate Pearson’s r: Both variables are numerical
(interval or ratio) and the research question is about the type
and strength of relation (not about causality)
To run:
Analyze  Correlate  Bivariate
Move the X and Y variables over.
Check “Pearson Correlation” and “Two-tailed test of
significance”. Then click Ok.
Output: Computer calculates r and the p value
Reporting Results of Correlation
If the relationship was significant (p < .05)
(a) Say: A [say something about size: weak, moderate, strong]
[say direction: positive/negative] correlation was found between
[X] and [Y] that was statistically significant, r = .xx, p = .xx
(b) Describe the relation: Thus as X increased, Y
[increased/decreased]
Note: If the r value is positive, then as X increased, Y increased; if r is
negative, then as X increased, Y decreased
e.g., A strong positive correlation was found between surface area to volume
ratio and the drug release rate that was statistically significant, r = .99,
p = .0001. As the surface area to volume ratio increased, the drug release
rate increased.
Reporting Results of Correlation
If the relationship was not significant (p > .05)
Say: No statistically significant correlation between [X]
and [Y] was found, r = .xx, p = .xx. Thus as X
increased, Y neither increased nor decreased
systematically.
e.g., No statistically significant correlation between oxide
thickness and deposit time was found, r = .002, p = .99.
Thus as oxide thickness increased, deposit time neither
increased nor decreased systematically.
Teaching Tip
Impress upon your students that an association
does not imply causation
e.g., Average life expectancy and the average
number of TVs per household are highly correlated.
But you can’t increase life expectancy by increasing
the number of TVs. They are related, but it isn’t a
cause-and-effect relationship.
More Teaching Tips
You can ask your students to report either:
• the exact p value (p = .03, p = .45)
• the cutoff: say either p < .05 (significant) or p > .05 (not significant)
You should specify which style you expect. Ambiguity confuses them!
Tell students they can only use the word “significant” only when they
mean it (i.e., the probability the results are due to chance is less
than 5%) and to not use it with adjectives (i.e., they often mistakenly
think one test can be “more significant” or “less significant” than
another). Emphasize that “significant” is a cutoff that is either met or
not met -- Just like you are either found guilty or not guilty, pregnant
or not pregnant. There are no gradients. Lower p values = less
likelihood results are due to chance, not “more significant”
Correlations are Predictive
You can predict one variable from the other if there
is a strong correlation.
If so, linear regression can be used to find a linear
model which can be used to make predictions.
But, remember to make a scatterplot of the data to
see if a linear model is appropriate and/or the
best model!
Correlation: type and strength of linear
relationship between X and Y
Linear Regression: making predictions of
Variable Y based on knowing the value of
Variable X
Example 1
Hours of protection vs. Cost
25
Hours of Protection
20
15
10
5
0
0
0.5
1
1.5
2
Cost per ounce ($)
A linear regression seems appropriate since the
data have an overall linear trend.
2.5
Example 2
Mass of spill (lbs).
Mass of chemical spill at time t
r = -.92
7
6
5
4
3
2
1
0
0
10
20
30
40
50
60
70
Time (minutes)
Even though r = -.92 a linear regression model is not
appropriate here. A different model should be used.
Terminology
X variable: predictor (independent variable,
explanatory variable, what we know)
Y variable: criterion (dependent variable,
response variable, what we want to predict)
Stronger correlation = better prediction
Making Predictions
1600
1400
Oxide thickness
Oxide
thickness
Time (s) (Angstroms)
18
1059
35
1049
52
1039
52
1026
18
1001
1263
23
etc.
etc.
1200
1000
800
600
400
200
0
10
20
30
Time
40
50
Making Predictions
70
Drug Release Rate
(%released)
Surface
Area to
Drug
Volume Release
Ratio
Rate
1.5
60
1.05
48
0.9
39
0.75
33
0.6
30
0.65
29
60
50
40
30
20
10
0
0
0.5
1
1.5
Surface Area To Volume Ratio (mm^2/mm^3)
2
Using Regression to Make Predictions
Low or no correlation (e.g., Oxide thickness and
Deposit time)
• Best prediction of Y is the mean of Y (knowing X
doesn’t add anything)
High correlation (e.g., Surface area to volume
ratio vs. Drug release rate )
• Best prediction of Y is based on knowing X
Sample Scatterplots & Regression Lines
If r = +1, Y’ = Y
If r = 0, Y’ = MY
Terminology
Regression line:
• Best fitting straight line that summarizes a
linear relation
• Comprised of the predicted values of Y
(denoted by Y’ )
What Do We Mean by the “Best Fit” Line?
Comparing Amounts of Snowfall
Gauge Measured
(inches of snow)
10
8
6
4
2
0
0
2
4
6
Radar Prediction (inches of snow)
8
What Do We Mean by the “Best Fit” Line?
Comparing Amounts of Snowfall
Gauge Measured
(inches of snow)
10
8
6
4
2
0
0
2
4
6
Radar Prediction (inches of snow)
8
What Do We Mean by the “Best Fit” Line?
It is the line that minimizes the vertical distances
from the observed points to the line
Comparing Amounts of Snowfall
Gauge Measured
(inches of snow)
10
8
6
4
2
0
0
2
4
6
Radar Prediction (inches of snow)
8
Least Squares Line
Error or residual = observed Y – predicted Y
= Y – Y’
The least-squares line is the “best fit” line that
minimizes the sum of the square errors/residuals
i.e: It minimizes
 Y  Y'
2
Best Fit Line
Also known as linear regression model or least squares line
Used to predict Y using a single predictor X and a linear model
Y’ = bX + a
Y’ = predicted y-value
X = known x-value
b = slope
a = y-intercept or the point where line crosses
y-axis
[We can use more advanced techniques for multiple predictors
or nonlinear models]
Interpreting the Model
Y’ = bX + a
What is b?
• Slope of the regression line
• The ratio of how much Y changes relative to a
change in one unit of X
• Same sign as the correlation
What is a?
• Y-intercept or where the line crosses the Y axis
(where X = 0)
Getting the Least Squares Line
While Excel will get the least squares line, we give
the following formulas for completeness.
b
r SD Y 
SD X
a  M Y  b( M X )
Getting the Least Squares Line in Excel
Linear Regression: When you want to predict a numerical
variable Y from another numerical variable X
Need:
• Enter X and Y data in Excel
• Be sure that X column is directly to the left of the Y column
To get linear model: Make a scatterplot of Y vs. X.
Click on the Chart and choose Chart/Add Trendline
Choose the Option tab and select Display equation
Output: Least squares line
Getting the Least Squares Line in SPSS
Linear Regression: When you want to predict a numerical
variable Y from another numerical variable X
Need: Enter X and Y data in SPSS
To get linear model:
Analyze  Regression  Linear;
Move the Y variable to Dependent and the X variable to
Independent; Click Ok.
Output: Least squares line
Example
X = Surface Area to Volume Ratio (mm2/mm3)
Y = Drug Release Rate (%released)
The least squares line is
Y’ = 36.92X + 7.21
If surface area to volume ratio is 1, what is predicted
drug release rate? ______
If the actual drug release rate was 40%, what is the
residual? ________
Example
X = Surface Area to Volume Ratio (mm2/mm3)
Y = Drug Release Rate (%released)
The least squares line is
Y’ = 36.92X + 7.21
If surface area to volume ratio is 1, what is predicted
drug release rate?
Y’ = 36.92(1)+7.21 = 44.13% released
If the actual drug release rate was 40%, what is the
residual?
Residual = Y – Y’ = 40 – 44.13 = -4.13%
Example
X = Surface Area to Volume Ratio (mm2/mm3)
Y = Drug Release Rate (%released)
The least squares line is
Y’ = 36.92X + 7.21
By how much can you expect the drug release rate
to increase or decrease if you increase the surface
area to volume ratio by 1 mm2/mm3?_________
Example
X = Surface Area to Volume Ratio (mm2/mm3)
Y = Drug Release Rate (%released)
The least squares line is
Y’ = 36.92X + 7.21
By how much can you expect the drug release rate
to increase or decrease if you increase the surface
area to volume ratio by 1 mm2/mm3?
This is just the slope, so you would expect the drug
release rate to increase by 36.92%
What about R2?
Often in correlational research the value R2 is
reported.
R2 = Proportion of variability of Y accounted for by
X and the linear model
For example, if R2 = .64 then we can say:
• 64% of the variance in the Y scores can be predicted
from Y’s (linear) relation with X
• Predictions are 64% more accurate using the linear
regression equation to make predictions (Y’) than when
we use MY to make predictions
How do you get R2?
R2 is literally the correlation coefficient, r, times
itself.
R2 = (r)2
0  R2  1
Low r  low R2  little improvement over MY
High r  high R2  more accurate predictions
using Y’
How do you get r from R2?
r = Sqrt(R2) or r = – Sqrt(R2)
Remember, the correlation (r) has the same sign
as the slope of the least squares line.
Issues to Consider
1. You can’t extrapolate outside the given domain
of the X’s
2. Be careful of the effect of potential outliers
3. Remember to impress upon your students that
an association does not imply causation, no
matter how seductive it may be to think that the 2
variables are causally related
You can’t extrapolate outside the given domain
of the X’s
You can only make predictions in the range of the given data.
We don’t know if the pattern we see continues for surface
area to volume ratios less than .6 and greater than 1.5
mm2/mm3.
Drug Release Rate
(%released)
Example
y = 35.916x + 7.2094
70
60
50
40
30
20
10
0
0
0.5
1
1.5
Surface Area To Volume Ratio (mm^2/mm^3)
2
Be careful of the effect of potential outliers
Outliers can be influential or not.
Exploring Linear Regression Excel Worksheet
Hypothesis Testing For Simple Linear Regression
Y =  + X+  where  is normally distributed with
mean 0 and variance 2
The F test allows a scientist to determine whether
their research hypothesis is supported
Null hypothesis H0:
• There is not a linear relationship between X and Y
• There is no correlation between X and Y
• =0
Hypothesis Testing For Simple Linear Regression
Research hypothesis HA:
• X is linearly related to Y
• The correlation between X and Y is different
from 0
• 0
Teaching tip: Very important for students to be
able to understand and state the research
question so that they see that statistics is a tool
to answer that question
Hypothesis Testing For Simple Linear Regression
Null hypothesis: Pain level is not linearly related to weight
Research hypothesis: Pain level is linearly related to weight
Null hypothesis: There is no linear relationship between age and
heart rate
Research hypothesis: There is a linear relationship between age
and heart rate
Hypothesis Testing For Simple Linear Regression
p value = probability of results being due to chance
When the p value is high (p > .05), the obtained relationship is
probably due to chance
.99 .75 .55 .25 .15 .10 .07
When the p value is low (p < .05), the obtained relationship is
probably NOT due to chance and more likely reflects a real
linear relationship between X and Y
.04 .03 .02 .01 .001
Hypothesis Testing For Simple Linear Regression
In science, a p value of .05 is a conventionally accepted cutoff
point for saying when a result is more likely due to chance
or more likely due to a real effect
Not significant = the obtained relationship is probably due to
chance; X does not appear to be linearly related to Y;
p > .05
Statistically significant = the obtained relationship is probably
NOT due to chance and X and Y appear to be linearly
related; p < .05
Sources of Variance in Simple Linear Regression
SSTotal = SSRegression + SSResidual
(Yi – MY)2 = (Yi’ – MY)2 + (Yi – Yi’)2
How much do all the
individual scores
differ from the grand
mean
How much do all
the predicted
values differ from
the grand mean
How much do the
individual scores differ
from the predicted
values
Total Sums of
Squares
Regression Sums
Residual Sums of
Squares
of Squares
Each F test has certain values for degrees of freedom (df), which is based on the
sample size (N) and number of conditions, and the F value will be associated with
a particular p value
SPSS calculates these numbers.
Summary Table for Simple Linear Regression
Source
Sum of
Squares (SS)
df
Mean
Square (MS)
Regression
(Yi’ – MY)2
1
SSRegression
dfRegression
Residual
(Error)
(Yi – Yi’)2
N-2
SSResidual
dfResidual
Total
(Yi – MY)2
N-1
F
MSRegression
MSResidual
N= # samples
To report, use the format: F(dfRegression, dfResidual) = x.xx, p _____.
The F Ratio
A test for simple linear regression gives you an F ratio
The bigger the F value, the less likely the relationship
between X and Y is just due to chance
The bigger the F value, the more likely the relationship
between X and Y is not just due to chance and is due to
a real relationship
So big values of F will be associated with small p values that
indicate the linear relationship is significant (p < .05)
Little values of F (i.e., close to 1) will be associated with
larger p values that indicate the linear relationship is not
significant (p > .05)
Interpreting the F test for Simple Linear Regression
Cardinal rule: Scientists do not say “prove”! Conclusions are based on
probability (likely due to chance, likely a real effect…). Be explicit about
this to your students.
Based on p value, determine whether you have evidence to conclude the
relationship was probably real or was probably due to chance: Is the
research hypothesis supported?
p < .05: Significant
•
Reject null hypothesis and support research hypothesis (the
relationship was probably real; X is linearly related to Y)
p > .05: Not significant
•
Retain null hypothesis and reject research hypothesis (any
relationship was probably due to chance; X is not linearly
related to Y)
Teaching Tips
Students have trouble understanding what is less than .05 and
what is greater, so a little redundancy will go a long way!
Whenever you say “p is less than point oh-five” also say, “so the
probability that relationship is due to chance is less than 5%,
so the 2 variables are likely to be linearly related.”
Whenever you say “p is greater than point oh-five” also say, “so
the probability that this is due to chance is greater than 5%, so
there’s just not enough evidence to conclude that there is a
linear relationship – X and Y are not really linearly related”
In other words, read the p value as a percentage, as odds, “the
odds that this relationship is due to chance are 1%, so X and
Y probably are linearly related…”
Running the F Test for Simple Linear
Regression in SPSS
Linear Regression: When you want to test whether there is
a linear relationship between two numerical variables (X
and Y)
Need: Enter X and Y data in SPSS
To get linear model:
Analyze  Regression  Linear;
Move the Y variable to Dependent and the X variable to
Independent; Click Ok.
Output: F and p-value
Reporting the Results of the F Test for Simple Linear
Regression
Step 1: Write a sentence that clearly indicates what statistical
analysis you used
An F test was used to determine if there was a linear relationship
between [X] and [Y].
-Or - An F test was used to determine if [X] is linearly related to
[Y].
An F test was used to determine if pain level is linearly related
to weight.
An F test was used to determine if there is a linear relationship
between age and heart rate.
Reporting the Results of the F Test for Simple Linear
Regression
Step 2: Report whether the linear relationship was significant or
not
The linear relationship between [X] and [Y] was significant [or not
significant], F(dfRegression, dfResidual) = X.XX [fill in F], p = xxxx.
The linear relationship between pain level and weight was
significant, F(2, 40) = 7.31, p = .01.
The linear relationship between age and heart rate was not
significant, F(2, 120) = 2.35, p = .10.
Assumptions Involved in Simple Linear
Regression
• Linearity
• Randomness and Normality
• Homoscedasticity
Assessing Linearity
Is a linear model appropriate?
Is a linear model the best model?
To check this make a scatterplot of the data
and see if there is an overall linear trend.
Assessing Linearity
Hours of protection vs. Cost
25
Hours of Protection
20
15
10
5
0
0
0.5
1
1.5
Cost per ounce ($)
Linear model seems appropriate.
2
2.5
Assessing Linearity
Mass of spill (lbs).
Mass of chemical spill at time t
7
6
5
4
3
2
1
0
0
10
20
30
40
50
60
Time (minutes)
Linear regression is not appropriate here.
Should use a different model.
70
Assessing Randomness and Normality of
Residuals
Are the residuals random and normally
distributed?
The data should be randomly scattered about
the least squares line. There should be no
patterns about the least squares line.
Assessing Normality of Residuals
Hours of protection vs. Cost
25
Hours of Protection
20
15
10
5
0
0
0.5
1
1.5
2
2.5
Cost per ounce ($)
Points are randomly scattered about least squares
line.
Assessing Normality of Residuals
Mass of chemical spill at time t
Mass of spill (lbs).
8
6
4
2
0
-2 0
10
20
30
40
50
60
70
-4
Time (minutes)
Points are not randomly scattered about least
squares line, so the residuals are not random.
Assessing Homoscedasticity
Homoscedasticity = the variability or scatter of points
about the best fit line is fairly constant
Check if line fits consistently well for all X’s.
Assessing Homoscedasticity
Hours of protection vs. Cost
25
Hours of Protection
20
15
10
5
0
0
0.5
1
1.5
2
2.5
Cost per ounce ($)
Line fits consistently well throughout.
Since all three assumptions are met for this data
set, a linear model is appropriate.
Assessing Homoscedasticity
Height vs. Weight
80
Height (in)
75
70
65
60
55
50
100
120
140
160
180
200
220
240
Weight (lbs)
The linear model fits better for weights more than
140. So, there may be a problem with
homoscedasticity.
Time to Practice
Correlation and Linear Regression
Download