Correlation and Regression

advertisement
SPSS Session 4:
Association and Prediction Using
Correlation and Regression
Learning Objectives
• Review information from Lecture 10
• Understand the relationship between two
interval/ratio variables using
• Test for association between two variables using
correlation and interpret the correlation
coefficients
• Using regression, describe how one variable can
be used to predict the score in another
• Conduct correlation and regression analyses in
SPSS and interpret the statistical findings
Review of Lecture 10
Completion of this session enabled you to :
- Understand how multiple variables may
interact with one another
- Appreciate the role of intervening variables
- Be aware of how interpretation of statistics
may be affected by outliers and
misinterpretations
Association Between Variables
• Correlation is a statistical test that allows us to
gauge the association between two
interval/ratio variables.
• For example, we would expect age and height
to be correlated. As age increases, we expect a
similar increase in height.
• “Pearson’s R” statistic is the most common
correlation test.
• Correlation is best understood through the
use of a chart called a scatterplot.
Correlation and Pearson’s r
• Pearson’s r is the most common correlation
coefficient.
• It is used to statistically show the magnitude and
direction of a relationship between two variables.
• It is on a scale of -1 to 1.
• Distance either direction from 0 is crucial and
shows magnitude.
• The sign of the r (+/-) shows the direction.
– Either negative or positive direction
Scatterplots
• Scatterplots produce an useful visualization of the
association between two variables.
• The independent variable is shown on the
horizontal axis (X axis).
• The dependent variable is shown on the vertical
axis (Y axis).
• In the next example, we wanted to describe the
relationship between the age of the person
responding to the questionnaire in our child
protection study and the age of the child in their
care.
• In this example of a scatterplot, age of the
respondent is on the X axis.
• Age of the child is on the Y axis.
Each dot is a single family and represents the
point at which the ages of the respondent and
child intersect based on the two ages.
Example of a case:
Parent age = 45 years
Child age = 5 years
Correlation Lines
• Based on the scatterplot, think of a line that
could be drawn to represent the relationship
between the age of the person responding to
the questionnaire and the age of the child in
their care.
• This line should attempt to minimize the
vertical distance between any given point and
the line.
• It’s often called “the line of best fit”.
Correlation Line?
Correlation Line Shown
The line predicts some of
the cases and their
association between the
ages of the respondent
and child very well! These
cases sit right on the line!
The line does not other cases
and their ages quite as well.
These cases are vertically very
far from the line.
Perhaps these were cases
where the children were
placed in the care of their
grandparents after the
children were removed from
their parents.
Correlation and Pearson’s r
There are three critical characteristics of
correlation needed to properly describe the
association between to variables.
1. MAGNITUDE
2. DIRECTION
3. STATISTICAL SIGNIFICANCE
Magnitude of the Correlation
• Distance either direction from 0 is crucial and
shows magnitude.
• Correlation scores farther away from 0, closer
to either -1 or 1, are deemed as stronger.
• We would say that correlations of -1 or 1 are
perfectly correlated.
Direction of Correlation
• Correlation scores that are above 0 are called
positive correlations.
– As values for one variable increase, we would
expect an associated increase in the other.
• Correlation scores that are below 0 are called
negative correlations.
– As values for one variable increase, we would
expect as associated decrease in the other.
The correlation between the ages of the children and
the respondents to the questionnaire in the child
protection study was r=.514. The magnitude was
moderate as the correlation coefficient was halfway
between 0 and 1. Because the correlation score was
above 0, we would say that it was a positive
correlation.
Correlation Example 1: GHQ and WAI
We wanted to test for the association between two
variables in our child protection study.
• The General Health Questionnaire (GHQ) total
score which was a measure of psychological
distress reported by the respondent answering
the questionnaire.
• The Working Alliance Inventory (WAI) total score
which is a measure of the quality of the
relationship that respondents reported having
with their the child protection worker.
We hypothesized that respondents reporting
greater distress (GHQ scores) would report having a
worse relationship (WAI scores) with their child
protection worker.
Correlation Example 1: GHQ and WAI
• Our research hypothesis is that GHQ scores
and WAI scores are negatively and significantly
correlated.
• We expected that the r correlation coefficient
would be less than 0, closer to -1, and
statistically significant.
• Our null hypothesis would be that the two
variables would not be significantly associated
and thus would have a r correlation coefficient
not significantly different from to 0.
Correlation Example 1: GHQ and WAI
• In SPSS, we select the “Analyze” menu, then
“Correlate”, and select “Bivariate”.
Correlation Example 1: GHQ and WAI
• The “Bivariate Correlations” window will
appear
• Find the “WAI_Total” and “GHQ_TotalScore”
variables and add them to the “Variables” list
• All the options below which are selected are
the usual default.
Correlation Example 1: GHQ and WAI
• Click “OK” to conduct the analysis.
Correlation Results 1: GHQ and WAI
• The results from the analysis indicate that the GHQ and
WAI scores have a weak, negative correlation (r= -.184).
• However, the p-value for this correlation is above the
significance level standard of α=.05.
• The obtained p-value is .075 which is to say the correlation
was likely to have happened by chance and is not a
significant relationship (p>.05).
• We would then fail to reject the null hypothesis and say
that these two variables are unrelated.
Correlation Results 1: GHQ and WAI
Correlation Results 1: GHQ and WAI
Correlation Example 2: Family Environment
In the child protection study, we have three
measures of the characteristics of the family
environment using the Family Environment Scale:
• FES – Cohesion:
– Measure of the perceived level of commitment and
support expressed by family members
• FES – Expressiveness:
– Measure of the degree of emotional openness and
encouragement in the family
• FES – Conflict:
– Measure of familial conflict and expressed anger
Correlation Example 2: Family Environment
• Based on the cohesion, expressiveness, and conflict
within a family environment, we can begin to hypothesize
about the relationships between the three measures.
• We would expect the Cohesion and Expressiveness scores
to be positively, strongly, and significantly correlated
(correlation coefficient closer to 1).
• We would expect that the Conflict scores to be
negatively, strongly, and significantly correlated with the
Cohesion and Expressiveness scores (correlation
coefficient closer to -1).
• Our null hypothesis for each of these analyses would be
that no score is correlated with any other score and
would produce a correlation coefficient not significantly
different from 0.
Correlation Example 2: Family Environment
• In SPSS, we select the “Analyze” menu, then
“Correlate”, and select “Bivariate”.
Correlation Example 2: Family Environment
• The “Bivariate Correlations” window will
appear
• Find the “FES_Cohesive”, “FES_Express”, and
“FES_Conflict” variables and add them to the
“Variables” list
• All the options below which are selected are
the usual default.
Correlation Example 2: Family Environment
• Click “OK” to conduct the analysis.
Correlation Results 2: Family Environment
• Here are the results
Correlation Results 2: Family Environment
• Cohesion and Expressiveness are moderately,
positively, and significantly correlated (r= .556,
p<.05).
• We can reject our null hypothesis that these
variables were not associated.
• In our study, it appears that there is a
moderate and significant between parent or
carer reports of the level of commitment and
support expressed by family members and
their degree of emotional openness and
encouragement of each other.
Correlation Results 2: Family Environment
Correlation Results 2: Family Environment
• The degree of family Conflict is moderately,
negatively, and significantly correlated with both
Cohesion (r= -.486, p<.05) and Expressiveness (r=
-.403, p<.05).
• We can reject our null hypothesis that these
variables were not associated.
• In our study, it appears that increased reports of
family conflict is associated with decreased
reports of both their level of commitment and
support expressed by family members and their
degree of emotional openness and
encouragement of each other.
Moving from Association to Prediction
Moving from Correlation to Regression
Regression
• Regression is an extension of correlation
where we take the value of an independent
variable and attempt to predict the value in
another variable.
• Both variables must be interval/ratio level of
measurement
Regression Equation
• The equation for regression using one independent
variable and one dependent variable is the following:
π‘Œ = 𝛽1 𝑋1 + 𝛽0 + 𝑒
• Y is the dependent variable, or the outcome we are trying
to predict
• X1 is independent variable, or the variable we are using to
predict the value of the dependent variable (outcome)
• 𝛽1 is slope or the size and direction of the relationship
between X1 and Y
• 𝛽0 is the intercept, or the value of Y when X1 is equal to 0.
• e is the error term, or how far our prediction is off because
we can never perfectly predict a variable using another
variable
Regression Equation and Lines
Y = Outcome - DV
𝜷1
Slope, or
Change in Y for
every one unit
change in X1
X1 = Predictor – IV
Regression Example 1: Age and FES
• We will conduct three separate regression
analyses in this example.
• In each case, we will use age of the child (IV) to
predict one of the three FES scores (DV).
– FES – Cohesion:
• Measure of the perceived level of commitment and
support expressed by family members
Regression Example 1: Age and FES
• Within our child protection study, we wanted to
determine if age of the child could predict
characteristics of the family environment as
reported by the parent or carer responding to the
questionnaire.
• We would expect that older children are associated
with more challenges in the family environment
(research hypothesis).
• Like correlation, regression uses two interval/ratio
variables.
• For this analysis, our interval/ratio variables are age
of the child and one of the three FES scores.
Regression Example 1: Age and FES
• Our null hypothesis for each analysis is that age
of the child does not significantly predict the FES
score.
• In other words, the null hypothesis is that there
will be no associated change in the FES score
based on a change in age of the child.
• Statistically, the null hypothesis would indicate
that 𝛽 1 = 0.
– Recall the regrssion formula: π‘Œ = 𝛽1 𝑋1 + 𝛽0 + 𝑒
Regression Example 1: Age and FES
• Our formula for these analyses is the following:
π‘Œ = 𝛽1 𝑋1 + 𝛽0 + 𝑒
• Y is the FES scores, our outcome we are trying to predict
• X1 is age of the child, our independent variable, or the variable we are
using to predict the FES score
• 𝛽 1 is associated change in FES score for each change in the age of the
child
• 𝛽 0 is the intercept, or the value of a FES score (Y) when the age of the
child (𝑋1 ) is equal to 0.
• e is the error term, or how far our prediction is off because we can
never perfectly predict a variable using another variable
Regression Example 1: Age and FES
• To conduct each analysis, we need first to select
the FES score using the Linear Regression menu.
• Select “Analyze”, then “Regression”, then
“Linear”.
• The Linear Regression window will appear.
Regression Example 1: Age and FES
Regression Example 1: Age and FES
The Linear Regression window:
• Find our first dependent variable which will be
“FES_Cohesive”
• Add it to the “Dependent” list.
• Our independent variable is the age of the child.
• Find the “Child_Age_Yrs” variable and add it to
the “Independent(s)” variables list.
Regression Example 1: Age and FES
• Under the “Statistics” menu on the right side of the
“Linear Regression” window, select the following:
– Regression Coefficients – Estimates: this provides the
correlation coefficient r-value for the association
between the IV and DV
– Model Fit: this provides a value to estimate the
percentage of the DV that is explained by the IV
– Descriptives: this provides the descriptive statistics for
the values in the analysis
• Click “Continue”
• Click “OK” to conduct the analysis
Regression Example 1: Age and FES
Regression Results 1: Age and FES
• The first table provides the descriptive statistics
for the IV and DV.
Regression Results 1: Age and FES
• The second table offers the correlation
coefficients between the age of the child (IV) and
the FES – Cohesion scores (DV).
• From this table, we see that these two variables
are significantly associated and have a weak,
negative correlation (r = -.244, p<.05).
Regression Results 1: Age and FES
• Produced by the “Model Fit” option, this table
provides a summary of the value of our
regression equation in predicting FES – Cohesion
by using age of the child as our predictor.
Regression Results 1: Age and FES
• We see “R” is our correlation coefficient’s distance
from r = 0.
• “R Square” is r2 or squaring the correlation
coefficient.
• r2 can be interpreted as a percentage of the variance
in the DV that is explained by the IV.
• In this case, age of the child can statistically explain
5.9% of the variation in the FES – Cohesion scores.
Regression Results 1: Age and FES
• Regression tests use the same class as test as
the ANOVA, which for this analysis, is below:
Regression Results 1: Age and FES
• The table indicates that from our regression
model, we have significantly predicted the FES –
Cohesion scores (F=5.880, df= 1,93, p<.05).
• We can reject our null hypothesis that the age
of the child does not predict FES – Cohesion
scores.
Regression Results 1: Age and FES
• From the last table, we can construct our
regression equation.
π‘Œ = 𝛽1 𝑋1 + 𝛽0 + 𝑒
Regression Results 1: Age and FES
• π‘Œ = 𝛽1 𝑋1 + 𝛽0 + 𝑒
• FES-Cohesion =𝛽1 𝐴𝑔𝑒 + (πΆπ‘œπ‘›π‘ π‘‘π‘Žπ‘›π‘‘)
π‘ŒπΉπΈπ‘†−πΆπ‘œβ„Žπ‘’π‘ π‘–π‘œπ‘› = −1.34 𝐴𝑔𝑒 + 6.9
Regression Results 1: Age and FES
π‘ŒπΉπΈπ‘†−πΆπ‘œβ„Žπ‘’π‘ π‘–π‘œπ‘› = −1.34 𝐴𝑔𝑒 + 6.9
• From our equation, we can see that for every
year that a child is older, there is an average
decrease in the FES-Cohesion score of 1.34.
• The range of FES-Cohesion scores was from 0-9.
• A decrease of 1.34 on a scale from 0-9 for every
year that child is older is a significant and
meaningful decrease in cohesion of a family
environment as reported by the parent or carer!
• The regression model significant predicts FESCohesion scores (F=5.880, df= 1,93, p<.05).
• Age of the child is a significant predictor (t=-2.425,
p<.05) of FES-Cohesion.
• Age of the child explains 5.9% of the variance in
the FES-Cohesion scores.
Regression Example 2: Age and SDQ
• To conduct each analysis, we need first to
select the SDQ score using the Linear
Regression menu.
• Select “Analyze”, then “Regression”, then
“Linear”.
• The Linear Regression window will appear.
Regression Example 2: Age and SDQ scores
• We found from the previous analysis that older
children in the home are associated with greater
difficulties with the cohesion of the family
environment.
• We wanted to explore this aspect of the family
further.
• The Strength and Difficulties measure (SDQ) is
provides a view of the psychosocial problems of a
child as reported by the parent or carer.
• We would hypothesize that age of the child would
predict increased psychosocial difficulties reported
by the parent or carer on the SDQ measure
(research hypothesis).
Regression Example 2: Age and SDQ
• We chose to use regression to test the ability of
the age of the child to predict SDQ difficulty
scores.
• Both are interval/ratio variables.
• Our null hypothesis for each analysis is that age
of the child does not significantly predict the SDQ
scores.
• In other words, the null hypothesis is that there
will be no associated change in the SDQ scores
based on a change in age of the child.
• Statistically, the null hypothesis would indicate
that 𝛽1 = 0.
– Recall the regrssion formula: π‘Œ = 𝛽1 𝑋1 + 𝛽0 + 𝑒
• Replace “FES-Cohesive” with “SDQ_TotalDif”
from the list on the right.
• “SDQ_TotalDif” is the new dependent variable in
this new analysis.
Regression Example 2: Age and SDQ
• Under the “Statistics” menu on the right side of
the “Linear Regression” window, select the
following:
– Regression Coefficients – Estimates: this provides the
correlation coefficient r-value for the association
between the IV and DV
– Model Fit: this provides a value to estimate the
percentage of the DV that is explained by the IV
– Descriptives: this provides the descriptive statistics for
the values in the analysis
• Click “Continue”
• Click “OK” to conduct the analysis
Regression Example 2: Age and SDQ
Regression Results 2: Age and SDQ
• The first table gives the descriptive statistics for
the variables in the analysis.
Regression Results 2: Age and SDQ
• The second table provides the correlation
coefficients between the two variables.
• Age of the child is not correlated with the SDQ
total difficulties score (r= .127, p>.05).
• The variables are not significantly correlated.
• The weak, positive correlation likely occurred by
chance and not representative of an actual
relationship between the two variables.
• We see “R” is our correlation coefficient, r = .127
• “R Square” is r2 or squaring the correlation
coefficient.
• r2 can be interpreted as a percentage of the
variance in the DV that is explained by the IV.
• In this case, age of the child can statistically explain
1.6% of the variation in the SDQ total difficulties
score.
• This is a very small r2 showing how poorly age of the
child predicts SDQ scores.
Regression Results 2: Age and SDQ
• Regression tests use the same class as test as
the ANOVA, which for this analysis, is below:
Regression Results 2: Age and SDQ
• The table indicates that from our regression
model, we have NOT significantly predicted the
SDQ total scores (F=1.524, df= 1,93, p>.05).
• We have failed to reject our null hypothesis that
the age of the child does not predict SDQ total
scores .
• The regression model does not significant predict
SDQ scores (F=1.524, df= 1,93, p>.05).
• Age of the child is not a significant predictor (t=1.234, p>.05) of SDQ scores.
Regression Results 2: Age and SDQ
• It is interesting that the family environment was
predicted by the age of the child, but the age of
the child did not predict the parent/carer
reports of the psychosocial difficulties of the
child.
• From these two analyses, we can see different
results of regression models having completed
the regression equation for the one regression
model where the independent variable (age of
child) did significantly predict the dependent
variable (reports of cohesion in the family
environment).
Download