Lab3Correlation.doc

advertisement
Statistics – Spring 2008
Lab #3 – Correlation
Defined:
Variables:
Relationship:
Example:
Assumptions:
The measure of the strength and direction of the linear relationship between two variables.
IV is continuous, DV is continuous
Relationship amongst variables
Relationship between height and weight.
Normality. Linearity.
1. Graphing - Scatterplot
 The first step of any statistical analysis is to first graphically plot the data. In terms of correlation, graphical
plots are called scatterplots. Scatterplots can show you visually the strength of the relationship between the
variables, the direction of the relationship between the variables, and whether outliers exist.
 Correlation is the measure of the strength and direction of the relationship between the variables.
a. Correlations can vary between -1 and 1.
b. Direction of the relationship can be either positive or negative. A positive relationship is indicated by a
positive value (e.g., ranging from 0 to 1). A negative relationship is indicated by a negative value (e.g.,
ranging from 0 to -1). An example of a positive relationship is the relationship between height and
weight. The higher the outcome on one variable, the higher the outcome on the other variable. An
example of a negative relationship is the relationship between exercise and weight. The higher the
outcome on one variable, the lower the outcome on the other variable.
c. Strength of the relationship is measured from 0 to 1/-1. The farther the value is away from 0, the
stronger the relationship. The approximate criteria for strength is 0 for no effect, .1 for a small effect,
.3 for a medium effect, and .5 for a large effect. Notice those values can be either positive or negative,
depending upon the direction of the relationship, so a .2 and -.2 relationship indicate the same strength,
but different direction.
 What does a scatterplot look like? Below are 9 scatterplots that show three examples of a positive relationship
in the top row (perfect, strong, weak), three examples of a negative relationship in the middle row (perfect,
strong weak), and three examples of no relationship.
(graph taken from bla website)
1
(graph taken from wikipedia)



How do I graph a scatterplot?
1. Select Graphs --> Legacy Dialogs --> Scatter
2. Click “Simple”, and “Define”
3. Move appropriate variables into the “Y axis” and X axis”
4. Click OK.
Output below is for two questions: “commit1” and “commit3”. Notice that there is a positive relationship.
From this scatterplot, I would anticipate that the correlation between the two variables is positive and medium
in size.
What is the purpose of graphing the scatterplot? The purpose of graphing the scatterplot is to look at the
relationship between the variables and determine if there are any problems/issues with the data or if the
scatterplot indicates anything unique or interesting about the data, such as:
a. How is the data dispersed? For example, in the scatterplot above, it appears all the scores are grouped
in the top right quadrant. What does this imply about the questions and/or data in your study? It
appears that subjects answered both commit1 and commit3 on the higher part of the scale. In other
2



words, most subjects feel that most people brought to trial did in fact commit the crime (commit1) and
that most people convicted by juries did in fact commit the crime (commit3). Thus, when discussing
these variables in your paper, just talking about the size and direction of the correlation does not tell
the whole story. If you want to also discuss descriptive analysis of the data, you could talk about how
the data are distributed at the high end of the scale. In other words, just presenting the correlational
analysis (e.g., r = .35, p = <.001) may mislead the reader about an interesting distribution of the data.
b. Are there outliers? A scatterplot is useful for “eyeballing” the presence of outliers. Just as a histogram
is useful for “eyeballing” univariate outliers, the scatterplot is useful for “eyeballing” bivariate outliers.
In a later section I describe how to statistically analyze whether or not bivariate outliers exist.
c. Is the relationship linear?
What is linearity? Linearity is a straight-line relationship between variables.
Why is linearity important? Correlation and regression tests rests upon the assumption of linearity because
they only capture linear relationships. Not all relationships are linear. Just as not all variables are normally
distributed in the real-world, not all relationships are supposed to be linear. For example, there could be a nonlinear relationship for USA immigrants between length of residence in USA and depression. It is a U-shaped
relationship. Depression levels starts high during the first few years of initial resettlement, then decreases for a
while as they adjust to the new environment, and then increases again later in life. Another example of a nonlinear relationship is mortality and water consumption. Absence of water increases mortality, middle levels of
water decreases mortality, but too much water increases mortality.
Correlation and regression only capture linear relationships. For example, all correlations below have the same
size and direction, r = .81. BUT only the top-left graph is appropriate for correlational analysis because the
other three graphs depict data that can NOT be captured by the formulas for correlation and regression.
(graph taken from wikipedia)
2. Bivariate (and multivariate outliers)
 How do I identify bivariate and multivariate outliers? The procedure for identifying bivariate outliers is the
same as for identifying multivariate outliers. The procedure is called Mahalanobis Distances, and it calculates
the distance of particular scores from the center cluster of remaining cases. The procedure creates a new
column at the end of the data file containing a calculated score for each subject. The newly calculated score is
based upon the specific variables entered into the analysis. Thus, you could calculate many different
Mahalanobis Distances where you enter different sets of variables into the analysis. For each analysis, a
separate score for each subject is created in a new column at the end of the data file. The Mahalanobis
Distances score for each subject is considered an outlier if it exceeds a “critical value”.
a. The critical value is determined by a table at the back of most textbooks. You can also find the table at
this webpage - http://www.ento.vt.edu/~sharov/PopEcol/tables/chisq.html
3





b. The table involves the “Rejection Regions” for a Chi-Square test. Remember back to the first day of
class when we talked about probability distributions and “Rejection Regions”. The Rejection Regions
for the chi-square test is based upon two factors: the probability level you set, and the degrees of
freedom. We will talk later in-depth about these concepts, but for right now what you need to know is
that:
c. degrees of freedom for this test is equal to the number of variables under investigation. Thus, if you are
analyzing a bivariate relationship, then degrees of freedom = 2. If you are analyzing 3 variables, then
degrees of freedom = 3, and so forth
d. the probability level we set for this test is p < .001.
e. so, if you look at the table, you find the degrees of freedom, then scan to the right until you get to the
column associated with 0.001. That is your critical value. For example, the critical value for a bivariate
relationship is 13.82.
f. Any Mahalanobis Distances score above that critical value is an outlier.
Here is how to calculate Mahalanobis Distances scores:
1. Select Analyze --> Regression --> Linear
2. Move all the variables under investigation into the “Independents” box. In the “Dependent” box, move the
subject number variable (numb). For example, if you are interested in the bivariate outlier analysis for
“commit1” and “commit3”, you move both those variables into the “Independent” box, and move “numb” into
the “Dependent” box.
3. Click “Save”, and click “Mahalanobis”
4. Click OK.
The newly created variable is saved as “MAH_1”
Output below is for “commit1” and “commit3”. In “Residuals Statistics” box, look for “Mahal. Distance”.
Look at the “Maximum” score. If that number exceeds your critical value, then an outlier exists. In this case,
with 2 variables, the critical value is 13.82. The “Maximum” is listed as 35.083. Thus, we have at least one
outlier.

You identify the outlier(s) by sorting the data by this new variable “MAH_1”, and then scroll to the bottom of
the list to find the highest valued scores. You can sort by: Data --> Sort Cases. In this case, we find 11
variables that have scores above 13.82.
Notice, however, that multivariate outlier analysis is just as arbitrary as univariate outlier analysis. The
determination for the threshold level is arbitrarily determined, just as the threshold level for univariate outliers
as 1.5* IQR and 3*IQR is arbitrarily determined. Plus, the “eyeball” method of the scatterplot does show some
differences when compared to the statistical method of using Mahalanobis Distances scores. For example, if
you look at the scatterplot for our two variables (see above), can you identify which 11 subjects are the ones
deemed outliers by the Mahalanobis Distances analysis?
4
3. Correlation
 A correlation is easy to conduct:
1. Select Analyze --> Correlate --> Bivariate
2. Move all variables into the “Variable(s)” window.
3. Click OK.
 Output below is for two questions “commit1” and “commit3”. The “Correlations” box tells you three pieces
of information: n = sample size, pearson = size and direction of the relationship, Sig. = significance level. In
essence, the “Pearson Correlation” tells you size and direction of the hypothetical line that can be drawn
through the data; and “Significance” tells you the probability that the line is due to chance. More specifically,
the “Significance” represents a test of whether the line is different from a flat line (e.g. a flat line would be
represented by a Pearson correlation = 0). For the data below, there is a positive and medium relationship
between the variables, and there is a p<.001 probability that the line is due to chance.



Another useful piece of information is R2 – the coefficient of determination. This is the amount of variance
explained by another variable. R2 is not provided in the output, but you can calculate R2 by squaring the
Pearson Correlation. In our example, therefore, 352 x .352 = .124. If you multiple this by 100, you converted
the value into a percentage. Thus, in our example, commit1 explains 12.4% of the variance in commit3, and
vice versa. This also means that 87.6% of the variance is unaccounted because 100-12.4 = 87.6.
WRITE-UP: The report of a correlational study should include the strength of the relationship and the
significance level. Some researchers also include the descriptive statistics of each variable. Some researchers
also include the R2
a. “There was a positive correlation between the two variables, r = .35, p = <.001.”
b. “There was a positive correlation between the belief about what percent of people brought to trial did
in fact commit the crime (M = 78.39% SD = 16.33) and the belief about what percent of people
convicted by juries did in fact commit the crime (M = 83.22% SD = 15.54) , r = .35, p = <.001.”
c. “There was a positive correlation between the two variables, r = .35, p = <.001, with a R2 = .124.”
EVALUATION:
a. You evaluate correlational analysis by looking at the direction of the relationship between the
variables. Is it in the same direction as the research hypothesis.
b. You then look at the significance level. Is the relationship significant? Remember that significance is
related to sample size. In small sample (n=30) you may have correlations that don’t reach significance,
but if the sample size was larger (n=100), it would be significant. Also, remember that sample size
does not typically affect the strength of the relationship, only the probability that the result was due to
chance.
c. You then look at the size of the relationship. Is it strong or weak? Just because the hypothesis is
confirmed in the predicted direction does not indicate if the relationship between the variables is strong
or important. Strength of the relationship is measured from 0 to 1/-1. The farther the value is away
from 0, the stronger the relationship. The approximate criteria for strength is 0 for no effect, .1 for a
small effect, .3 for a medium effect, and .5 for a large effect. Notice those values can be either positive
or negative, depending upon the direction of the relationship, so a .2 and -.2 relationship indicate the
same strength, but different direction.
d. You can also look at R2. In terms of percentage of variance explained, small is 1%, medium is 9%, and
large is 25%.
5
4. Correlation - Multiple
 When you conduct correlations, you typically enter MANY variables simultaneously into the analysis, and the
output provides all possible bivariate relationships. For example:
1. Select Analyze --> Correlate --> Bivariate
2. Move all variables into the “Variable(s)” window.
3. Click OK.
 Output below is for the “forensic” items and “innocence” items. Notice the diagonal is always “1” because
there is a perfect correlation between the same variable. Also notice that sample size is different for each
bivariate relationship because the default in correlation is “pairwise” deletion. Also, notice that the matrix is a
mirror of itself along the diagonal, so the information is depicted twice for each bivariate combination.


Just as you can have correlational output of multiple variables simultaneously, you can have scatterplots of
multiple variables simultaneously. The only limitation is that if there are more than 3 variables simultaneously
the scatterplots get so small as to be relatively useless. You conduct multiple scatterplots simultaneously by:
1. Select Graphs --> Legacy Dialogs --> Scatter
2. Click “Matrix”, and “Define”
3. Move appropriate variables into the “Matrix variables” box
4. Click “Options” and “exclude cases variable by variable”
5. Click OK.
Output below is for the first three “forensic items”.
6
5. Correlation - Partial
 Partial correlation is the relationship between two variables while controlling for a third variable. The purpose
is to find the unique variance between two variables while eliminating the variance from a third variables. The
diagram below from your textbook page 135 graphically represents the purpose of partial correlation.



You typically only conduct partial correlation when the third variable has shown a relationship to one or both
of the primary variables. In other words, you typically first conduct correlational analysis on all variables so
that you can see whether there are significant relationships amongst the variables, including any “third
variables” that may have a significant relationship to the variables under investigation. In addition to this
statistical pre-requisite, you also want some theoretical reason why the third variable would be impacting the
results.
How to conduct partial correlation:
1. Select Analyze --> Correlate --> Partial
2. Move variables into the “Variable(s)” window.
3. Move the variable you want to control for into the “Controlling” box
4. Click “Options” and click “Zero Order correlations” and click “Exclude cases pairwise”
(by clicking “zero order correlations”, the output will show both the relationships amongst the variables
while controlling for the third variable, and ALSO the relationships amongst the variables without
controlling for the third variable. This is useful so that you can easily see the difference between
controlling for the variable and not controlling for the variable.)
5. Click OK.
Output below is for the relationship between “commit1” and commit3” while controlling for “prosecutor1”.
7



I included “prosecutor1” as the controlling variable because: (1) statistically, it shows significant relationship
to both commit1 and commit3. You can see that significant relationship in the top part of the “Correlations”
box below which presents the correlations without controlling for a third variable, (2) theoretically, it is
possible that the reason why there is a positive correlation between commit1 and commit3 is because
prosecutor1 asks “whom do you trust more, defense attorneys or prosecutors”, so it is possible the reason why
subjects believe defendants brought to trial and convicted at trial are guilty (commit1 and commit3) is because
they trust the prosecutor over the defense attorney.
Thus, given this plausible (statistical and theoretical) third-variable relationship, it is interesting to note that
controlling for “prosecutor1” did not lower the strength of the relationship between commit1 and commit3 by
that much because the outcome while controlling for prosecutor1 was r = .341, p <.001. In other words, the
relationship between commit1 and commit3 is NOT due to subjects trusting the prosecutor.
You can conduct Partial Correlation with more than just 1 third-variable. You can include as many thirdvariables as you wish.
6. Correlation – Point-biserial Correlation
 Point-biseral Correlations are conducted when one of the variables is dichotomous, which means it’s a
categorical variable with only two categories, such as gender: male, female.
 FYI – The Point-biserial Correlation is analogous to a “t-test”, which we will cover in later weeks. A “t-test”
is conducted when you are interested in the relationship between a categorical independent variable (such as
gender: male, female) and a continuous dependent variable (such as belief in the death penalty on 1-7 scale).
 You conduct a Point-biserial Correlation the same way that you conduct a regular correlation:
1. Select Analyze --> Correlate --> Bivariate
2. Move all variables into the “Variable(s)” window.
3. Click OK.
 The output, Write-up, and interpretation are the same as for a regular correlation.
 FYI - If you want to analyze the tests from your classes using the Point-biserial Correlation, you would need
to first create a new dichotomous variable (e.g., 1=answered correctly, 2=answered incorrectly).
See “Lab2 – Descriptives” for “6. Transforming categorical variables into other categorical variables”.
8
Download