PPT 08

advertisement
Chapter 8
Relationships Among Variables
Research Methods in Physical Activity
Correlation — A statistical technique used to determine the relationship
between two or more variables.
Correlations may be simple, when they involve only two variables of
comparison, or may be multiple correlations when they involve more
than two variables.
Multiple correlations have a dependent variable (criterion variable)
and two or more independent variables (predictor variables).
A canonical correlation, establishes the relationships between two or
more dependent variables and two or more independent variables.
Research Methods in Physical Activity
Positive correlation — A relationship between two variables in which a
small value for one variable is associated with a small value for another
variable, and a large value for one variable is associated with a large value
for the other.
Research Methods in Physical Activity
Negative correlation — A relationship between two variables in which a
small value for the first variable is associated with a large value for the
second variable, and a large value for the first variable is associated with a
small value for the second variable.
Research Methods in Physical Activity
Correlation and Causation
A correlation between two variables does not mean that one variable
causes the other. While two variables must be correlated for a cause and
effect relationship to exist, correlation alone does not guarantee such a
relationship.
Correlation is a necessary but not sufficient condition for causation. The
only way that causation can be shown is with an experimental study in
which an independent variable can be manipulated to bring about an effect.
Research Methods in Physical Activity
coefficient of correlation [ r ] — A quantitative value of the
relationship between two or more variables that can range from .00 to
1.00 in either a positive or negative direction.
Pearson product moment
coefficient of correlation — The
most commonly used method of
computing correlation between two
variables; also called interclass
correlation, simple correlation, or
Pearson r.
The Pearson r has one criterion (or dependent) variable and one predictor
(or independent) variable. An important assumption for the use of r is that the
relationship between the variables is expected to be linear, that is, that a
straight line is the best model of the relationship. When that is not true (e.g.,
figure 8.4d, p. 129 ), r is an inappropriate way to analyze the data.
Research Methods in Physical Activity
Computation of the correlation coefficient
The computation of the correlation coefficient involves the relative
distances of the scores from the two means of the distributions.
The formula consists of only three operations:
1) Sum each set of scores.
2) Square and sum each set of scores.
3) Multiply each pair of scores and obtain the cumulative sum of these
products.
See Example 8.1, p.130, for example of computation
Research Methods in Physical Activity
Computation of the correlation coefficient
In a correlation problem that simply determines the relationship between two
variables, it does not matter which one is X and which is Y.
If the investigator wants to predict one score from the other, then Y
designates the criterion (dependent) variable (that which is being predicted)
and X the predictor (independent) variable.
In the example of the positive
correlation to the left, the
criterion variable is the “Years of
education”, and the predictor
variable is the annual income.
Thus, we would “predict” the
years of educational experience
based upon the annual income.
Research Methods in Physical Activity
Interpreting the reliability of r
What does a coefficient of correlation mean in terms of being high or low,
satisfactory or unsatisfactory?
One criterion is its reliability, or significance. Does it represent a real
relationship? That is, if the study were repeated, what is the probability of
finding a similar relationship?
For this statistical criterion of significance, simply consult a table. Table 3 in
the appendix (p. 428) contains the necessary correlation coefficients for
significance at the .05 and .01 levels.
In using the Table 3, select the desired level of significance, such as the .05
level, and then find the appropriate degrees of freedom (df, which is based
on the number of participants corrected for sample bias), which, for r, is
equal to N – 2 (remember, the variable N in correlation refers to the
number of pairs of scores).
Research Methods in Physical Activity
Some Observations about “significant r” (refer to Table 3)
1) The correlation needed for significance decreases with increased numbers
of participants (df). Very low correlation coefficients can be significant if you
have a large sample of participants.
At the .05 level, r = .38 is significant with 25 df, r = .27 is significant
with 50 df, and r = .195 is significant with 100 df.
2) The second observation to note from the table is that a higher correlation
is required for significance at the .01 level than at the .05 level.
The .05 level means that if 100 experiments were conducted, the null
hypothesis (that there is no relationship) would be rejected incorrectly, just
by chance, on 5 of the 100 occasions.
At the .01 level, we would expect a relationship of this magnitude because
of chance less than once in 100 experiments.
Therefore, the test of significance at the .01 level is more stringent than at
the .05 level, so a higher correlation is required for significance at the .01
level.
Research Methods in Physical Activity
Interpreting the Meaningfulness of r
The interpretation of a correlation for statistical significance is important,
but because of the vast influence of sample size, this criterion is not always
meaningful.
The most commonly used criterion for interpreting the meaningfulness of
the correlation coefficient is the coefficient of determination (r2).
With r2 the portion of common association of the factors that influence
the two variables is determined.
Thus, the coefficient of determination indicates the portion of the total
variance in one measure that can be explained, or accounted for, by the
variance in the other measure.
The Venn diagram visually depicts this idea. Circle A
represents the variance in one variable, and Circle B
represents the variance in a second variable. Overlay C, r
= .60; thus r2 = .36 (shared variance). Thus, 36% of changes
in A can be explained by changes in B. (Unexplained variance
is equal to 1- r2.
Research Methods in Physical Activity
Using Correlation for Prediction (Regression)
Prediction is based on correlation. The higher the
relationship is between two variables, the more accurately
you can predict one from the other. If the correlation
were perfect, you could predict with complete accuracy.
Thus, r = .00 means no predictive ability, and r = 1.0 means absolute
predictive ability.
Prediction equation — A formula to predict some
criterion (e.g., some measure of performance) based
on the relationship between the predictor variable(s)
and the criterion; also called regression equation.
We predict “Y” (criterion or dependent variable) on the ordinate/vertical
axis from “X” ( predictor or independent variable) on the abscissa/horizontal
axis.
Research Methods in Physical Activity
Using Correlation for Prediction (regression)
We predict “Y” (criterion or dependent variable) on the
ordinate/vertical axis from “X” ( predictor or independent
variable) on the abscissa/horizontal axis…. Where,
Y = a + bX (equation for a straight line)
Y = the predicted score (dependent score)
X = the predictor score (independent score)
a = the intercept point on Y
b = the slope of the line
Keep it simple.
1) “a” is the place on the “Y” axis, where the line will
intersect, and
2) the “slope of the line” is really about how “X” changes
with “Y” (degree or magnitude of slope) and their
direction (positive or negative)
So, if we want to predict “Y” from “X” then we need to
calculate “a” and “b”.
Research Methods in Physical Activity
Using Correlation for Prediction (regression)
Calculating “a” and “b”
First you will need to calculate “b” which is determined by the
correlation coefficient and the standardized variance (standard
deviation ) of variables “X” and “Y” with the following formula:
b= r(sY/sX)
sY = the standard deviation of “Y”
sX = the standard deviation of “X”
Note that the slope of the line is not only about the association of “X”
and “Y” (direction: positive or negative), but also the degree to which
the variance of “X” is related to the variance of “Y” (rise over run =
degree or magnitude of slope).
Research Methods in Physical Activity
Using Correlation for Prediction (regression)
Calculating “a” and “b”
Next you can calculate “a” which is the intercept on the “Y” axis:
a = MY - bMX
MY = the Mean of the “Y” scores
MX = the Mean of the “X” scores
b = the slope of the regression line (see last slide)
Note this formula will only produce one value dependent upon the
measure of central tendency (means of “X” and “Y”) and the variance
of “X” and “Y” (the degree to which the variance of “X” is related to
the variance of “Y” or “b”)
So, the intercept of the line and the slope of the line are dependent on
the mean and standard deviation of “X” and “Y”.
(see your text for examples of using the regression formula)
Research Methods in Physical Activity
Line of Best Fit (regression line)
The line of best fit is the line that passes through the intersection of the X
and Y means.
The slope of the line is dependent not only on the mean but also the
variance of X and Y (see previous formulas). Thus the line of best fit is the
least distance for all of the X and Y coordinates, it is the “best fit” for all
the X and Y data coordinates.
The line is a regression line because used to predict Y from X. (see previous
slides)
Those X and Y coordinates that do not fall on the line are called residuals
or residual scores.
residual scores — The difference between the predicted and actual
scores that represents the error of prediction. Note that if you have
perfect correlation all scores in the scatter plot would be in a straight line
(line of best fit) and there would be no residual scores. Also, residual scores
are really unexplained variance (error of prediction).
Research Methods in Physical Activity
Line of Best Fit (regression line)
If we were to compute all the residual scores (variance scores) the
mean would be zero (ie. the line of best fit is the least distance for all
of the X and Y coordinates), and the unexplained variance (standard
deviation) is called the standard error of prediction, or standard
error of the estimate.
The larger the standard error of the estimate the less predictive ability
and the larger the r 2 is, the smaller the error of prediction.
Note: Chapter 8 also contains information on Partial, Semi-partial, and Multiple
regression principles, and Fischer Z transformation of r. This information is
beyond the scope of our class in the introduction of statistical principles. I
welcome you to read the information, but I will not review nor test you on the
material
Research Methods in Physical Activity
End of Lecture
Research Methods in Physical Activity
Download