Wk09_1

advertisement
A full analysis example
Multiple correlations
Partial correlations
New Dataset: Confidence
This is a dataset taken of the confidence scales of 41
employees some years ago using 4 facets of confidence
(Physical, Appearance, Emotional, and Problem Solving, as well
as their gender and their citizenship status.
Example problem 1: Analyze the correlation between physical
confidence and appearance confidence.
First question we should ask “Is Pearson correlation
appropriate?”
Four requirements for correlation:
1.
2.
3.
4.
_____
_____
_____
_____
Example problem 1: Analyze the correlation between physical
confidence and appearance confidence.
First question we should ask “Is Pearson correlation
appropriate?”
Four requirements for correlation:
1.
2.
3.
4.
A straight-line relationship.
Interval data.
Random sampling (Will need to assume)
Normal distributed characteristics
Check for normality in each of the histograms.
(Graphs  Legacy Dialog  Histogram)
The appearance variable is close enough to normal, although it
has more on the upper and lower end than it should.
The physical variable has a negative skew, so that could be a
problem.
There are at least two values that are far below the mean for
confidence in physical. We should investigate them further.
Graphs  Legacy Dialogs  Boxplot
Use summaries of separate variables, and
Options  Exclude Variable-by-Variable
Boxplots identify outliers, from the boxplot we find that cases
31 and 37 are the outliers in physical confidence.
Looking at the data directly we find that neither of these cases
even have a value for appearance.
The two outliers in ‘physical’ have no measured value for
‘appearance’
That means they will have no effect on a correlation between
“physical” and “appearance”. Correlation can only consider
cases where there are values for both variables (a point needs
both an X and a Y to exist)
Next, we look at the scatterplot.
Graphs  Legacy Dialog  Scatter/Dot
No obvious signs of non-linear trends, but there doesn’t seem
to be any strong trend at all.
Correlation is a appropriate measure, but it won’t be strong.
We run the correlation to find it and see if it’s significant at
alpha = 0.05.
Analyze  Correlate  Bivariate
Sig. (2-tailed) is .039, so the correlation is significant at alpha
=.05. (Had we chosen the .01 level, this would not be the case)
We could also run a t-test by hand to verify the significance
level we found. (r= .373, n=31)
t* = 2.045 at 0.05 level, 29 df
t* = 2.756 at 0.01 level, 29 df
Let’s not sully this moment with a bad pun or something.
The _______________ is a table that shows the
correlation between two variables.
Physical
Appearance
Physical
1.000
.373
Appearance
.373
1
In this case, Weight is correlated with Length with r=.940
Likewise, Length is correlated with Weight with r=.940
Also, everything correlates with itself with r=1.000.
SPSS takes it a little farther by making a matrix of correlation
coefficient, significance, and sample size.
Confidences are significantly correlated, there are 31 entries
for each pair (not 41 because real data has blanks).
However, if we go to the correlations menu and select more
than two variables of interest:
We get a 4x4 correlation matrix instead!
What’s better than two variables? FOUR VARIABLES!
Cutting away all the sample size and significance stuff, I find:
Physical
Appearance
Emotional
Problem Solving
Phys.
1
Appear.
.373*
1
Emot.
.430**
.483**
1
Pr.Solve.
.730**
.527**
.540**
1
There is a positive correlation between every facet. That
means that any one facet of confidence increases, so do all the
others.
* significant at 0.05 level
* significant at 0.01 level
Physical
Appearance
Emotional
Problem Solving
Phys.
1
Appear.
.373*
1
Emot.
.430**
.483**
1
Pr.Solve.
.730**
.527**
.540**
1
Multiple correlation is useful as a first-look search for
connections between variables, and to see broad trends
between data.
If there were only a few variables connected to each other, it
would help us identify which ones without having to look at all
6 pairs individually.
Pitfalls of multiple correlations:
1.
______________.
With 4 variables, there are 6
correlations being tested for significance. At alpha =0.05,
there’s a 26.5% chance that at least one correlation is
going to show as significant even if there are no
correlations at all.
At 5 variables, there are 10 tests and a 40.1% chance of falsely
rejecting at least one null. (Assuming no correlations)
At 6 variables, there are 15 tests and a 53.7% chance of falsely
rejecting the null.
You don’t need to know how to handle multiple testing
problems in this class. However, be cautious when dealing
with many variables.
Be suspicious of correlations that are significant, but just
barely.
Example: The weakest correlation here is physical with
appearance, a correlation of .373. That correlation being
significant could be a fluke.
2. Diagnostics doesn’t get easier.
Doing correlations as a matrix allows you to do the math of a
correlation much faster than checking them one at a time.
However, the diagnostic tests like histograms, scatterplots, and
residual plots don’t get any faster.
Any correlation we’re interested in (even if it’s not showing as
significant) still needs checks for normality and linearity before
use in research.
One big advantage of correlating with multiple variables is that
we can isolate the connections between different variables
where they might not be obvious otherwise.
Physical
Appearance
Emotional
Problem Solving
Phys.
1
Appear.
.373*
1
Emot.
.430**
.483**
1
Pr.Solve.
.730**
.527**
.540**
1
Example: Is there really a correlation between appearance
confidence and problem solving confidence SPECIFICALLY, or
are they both attached to the same general confidence?
Ponder that over a Mandarin Duck.
To isolate a correlation between two variables from a third
variable, we want to only look at the part of that correlation
that’s really between those two and not the third.
We want the ______________.
Example: Ice cream sales increase when murder rates increase.
These two variables have nothing logical to do with each other,
however, they both increase when it’s hot out.
This is the ______________between these two variables.
We want the relationship between murder and ice cream
WITHOUT the ______________of heat.
In the dataset “murderice.csv”, we can find run a partial
correlation and find out.
First, a simple correlation reveals very significant correlations
between everything.
But how much of that connection is truly between murder and
ice cream?
____________________________
From here, put the two variables of interest in the variable
(you can put more than two if you wish).
Put the confounding variable in the ‘control for’ slot.
The partial correlation between ice cream and murder is much
_______ than the simple correlation.
It appears that heat
(or something common to all three) was a major factor in both.
In fact, the correlation is no longer significant (we fail to reject
the null that there is no correlation)
Also note: SPSS tells us in the output table that heat is a
control variable, so we know from the output that this is a
partial correlation (hint, hint).
We’re using three degrees of freedom, one for each variable
involved, so the df is 57 even when n is 60 (for interest)
Key observation:
The partial correlation will be less than the simple correlation if
both variables of interest are correlated to the confounding
variable in _____________________.
Here, both murder and ice cream are correlated to heat
positively, so the partial correlation removes that common
positive relationship murder and ice cream.
Removing a positive relationship makes the correlation less
positive.
Likewise, if the correlation to the confounding variable is
opposing, then the partial correlation will be higher than the
simple correlation.
If we’re only considering positive correlations, this means a
confounding variable could be hiding or ______________
a correlation hiding a correlation between two variables rather
than creating a false correlation.
Example: Confidence.
Consider the correlation between types of confidence. Do the
correlations between the other three still show after we
control for problem solving confidence?
Simple Correlations
Physical
Appearance
Emotional
Problem Solving
Phys.
1
Appear.
.373*
1
Emot.
.430**
.483**
1
Pr.Solve.
.730**
.527**
.540**
1
The correlation between physical and anything is removed
entirely (that means that knowing problem solving confidence
tells you as much about an employee’s physical confidence as
knowing all three other facets)
With the heat behind murder and ice cream we had some
other non-math information to make the claim that heat was
behind the other two variables.
It could have easily been something we didn’t measure, like
the proportion of elderly in an area (retirees often migrate
south for winter).
In the case of facets of confidence, we don’t have any reason
why problem solving confidence would be the common
thread. The partial correlations shrink to nothing because after
problem solving, the other variables we’re giving much info.
If we control for emotional confidence, we see there’s a
connection between problem solving and physical when
emotional is taken out of the picture.
Interestingly, controlling for appearance produces the same
result. They all have a common thread and so increase
together, but the real connection is between problem solving
and physical confidence.
Without partial correlation we would have never caught this.
Download