Assessing the Relationship Between More than 2

advertisement
Statistics for Everyone Workshop
Fall 2010
Part 6B
Assessing the Relationship Between More than
2 Numerical Variables Using
Correlation and Multiple Regression
Workshop presented by Linda Henkel and Laura McSweeney of Fairfield
University
Funded by the Core Integration Initiative and the Center for Academic
Excellence at Fairfield University
Assessing the Relationship Between
3+ Numerical Variables With Correlation
What if the research question we want to ask is whether or
how strongly 3 or more variables are related to each other,
but we do not have experimental control over those
variables and we rely on already existing conditions?
• TV viewing, exercise, and obesity
• Age, body temperature, and severity of flu symptoms
• Tire pressure, number of days since last oil change, and
gas mileage
• Volume, R-rating, and compression of insulation
Assessing the Relationship Between
3+ Variables With Correlation
The nature of the research question is about the
association between the variables, not about
whether one or some causes the other
Correlation describes and quantifies the systematic,
linear relation between variables
Correlation  Causation
Statistics as a Tool in Scientific Research
Types of Research Question
•
Is there an association between X1, X2, …, Xk and
Y?
•
As the X’s change, what does Y do?
•
How well can we predict Y from X1, X2, …, Xk?
Terminology
X variables: predictors (independent variables,
explanatory variables, what we know)
Y variable: criterion (dependent variable,
response variable, what we want to predict)
Example: Is there a relationship between diastolic
blood pressure (DBP) and the birth weights (BW)
and age of newborns?
Y = Diastolic Blood Pressure (mm Hg)
X1= Birth Weight (oz), X2 = Age (days)
Types of Correlation
Pairwise (or Bivariate) Correlation: Correlation
between 2 variables measures the strength of the
linear relationship between the 2 variables
Multiple Correlation: Correlation between Y and
variables X1, X2, …, Xk = sqrt(R2)
[Though, people usually just report R2 instead of
multiple correlation]
Partial Correlation: Measure of association
between Y and Xj controlling for effects of the
other X’s
What about R2?
Often in correlational research the value R2 is
reported.
R2 = Proportion of variability of Y accounted for by
all the X’s and the linear model
But R2 increases as the number of X variables
increase
Adjusted R2 = 1  (1  R2)(n  1)/(n  k  1)
Takes the number of variables into account
Pairwise (or Bivariate) Correlation
Correlation between:
BW and Age, BW and DBP, Age and DBP
Pairwise Correlation
BW
Age
DBP
BW
1
Age
.107
DBP
.441
1
.871*
1
*significant correlation (p <.001)
Multiple Correlation
Correlation between DBP and both BW and Age
R
R Square Adjusted R Square
.939*
.881
.863
*significant correlation (p <.001)
Partial Correlation
Correlation between:
DBP and BW after correcting for Age
DBP and Age after correcting for BW
Partial Correlation
SBP
BW
Age
.713*
.923*
*significant correlation (p <.001)
Multiple Linear Regression
Assumed underlying model
Y =  + 1X1 + 2X2 + …+ kXk +  where  is
normally distributed with mean 0 and variance 2
X1, X2, …, Xk: Independent Variables
Y: Dependent Variable
Each i, i = 1, 2, …, k, represents the partialregression coefficient or the average increase in Y
per unit increase in Xi adjusted for all the other
variables in the model (or holding all the other
variables constant). Estimated by the value bi.
Estimating the Regression Model
Y’ = a + b1X1 + b2X2 + …+ bkXk
Y’ = Predicted Y-value
Example: Is there a relationship between
diastolic blood pressure (DBP) and the birth
weights (BW) and age of newborns?
Y’ = Predicted DBP(mm Hg)
X1= BW (oz), X2 = Age (days)
Least Squares Line
Error or residual = observed Y – predicted Y
= Y – Y’
The least-squares line is the “best fit” linear model
that minimizes the sum of the square
errors/residuals
i.e: The estimates a, b1, b2, …, bk minimize
2
 Y  Y'
The Multiple Regression Model
Y’ = a + b1X1 + b2X2 + …+ bkXk
Ex: Is there a relationship between diastolic
blood pressure and the birth weights and age
of newborns?
Y’ = Predicted DBP(mm Hg)
X1= BW (oz), X2 = Age (days)
Y’ = 23.45 + 0.126X1 + 5.89X2
So, on average, newborns’ DBP increases by
5.89 mm Hg for each additional day in age
(assuming BW were fixed).
Hypothesis Testing For Multiple Linear Regression
The F test allows a scientist to determine whether
their research hypothesis is supported
Step 1: Overall F-Test for Model Significance
Test whether the linear model is significant.
Step 2: Stepwise tests (or t-tests for Partial Regression
Coefficients)
Determine which independent variables are significant in the
model. Remove any independent variable that is not
significant.
Step 1: Overall F-Test For Multiple Linear Regression
Y =  + 1X1 + 2X2 + …+ kXk +  where  is
normally distributed with mean 0 and variance 2
Null hypothesis H0:
• There is not a linear relationship between
X1, X2, …, Xk and Y
• 1 = 2 = k = 0
Overall F-test for Multiple Regression
Research hypothesis HA:
• At least one Xi is linearly related to Y
• At least one i  0
Teaching tip:
1.) Very important for students to be able to
understand and state the research question so
that they see that statistics is a tool to answer
that question
2.) If the multiple linear regression model is
significant, then you must determine whether
each independent variable is significant.
Sources of Variance in Multiple Linear Regression
SSTotal = SSRegression + SSResidual
(Yi – MY)2 = (Yi’ – MY)2 + (Yi – Yi’)2
How much do all the
individual scores
differ from the grand
mean
How much do all
the predicted
values differ from
the grand mean
How much do the
individual scores differ
from the predicted
values
Total Sums of
Squares
Regression Sums
Residual Sums of
Squares
of Squares
Each F test has certain values for degrees of freedom (df), which is based on the
sample size (N) and number of conditions, and the F value will be associated with
a particular p value
SPSS calculates these numbers
Summary Table for Multiple Linear Regression: X1, X2, .., Xk and Y
Source
Sum of
Squares (SS)
df
Mean
Square (MS)
Regression
(Yi’ – MY)2
k
SSRegression
dfRegression
Residual
(Error)
(Yi – Yi’)2
N – k -1
SSResidual
dfResidual
Total
(Yi – MY)2
N-1
F
MSRegression
MSResidual
N= # samples
To report, use the format: F(dfRegression, dfResidual) = x.xx, p _____.
The F-Test
• A test for multiple regression gives you an F ratio
• The bigger the F value, the less likely the
relationship between the Xs and Y is just due to
chance
• The bigger the F value, the more likely the
relationship between the Xs and Y is not just due
to chance and is due to a real relationship
The F-Test
• So big values of F will be associated with small p
values that indicate the linear relationship is
significant (p < .05)
• Little values of F (i.e., close to 1) will be associated
with larger p values that indicate the linear
relationship is not significant (p > .05)
• Based on p value, determine whether you have
evidence to conclude the relationship was probably
real or was probably due to chance: Is the
research hypothesis supported?
What Do We Mean by “Just Due to Chance”?
p value = probability of results being due to chance
When the p value is high (p > .05), the obtained
relationship is probably due to chance
.99 .75 .55 .25 .15 .10 .07
When the p value is low (p < .05), the obtained
relationship is probably NOT due to chance and
more likely reflects a real linear relationship
.04 .03 .02 .01 .001
What Do We Mean by “Just Due to Chance”?
In science, a p value of .05 is a conventionally accepted cutoff
point for saying when a result is more likely due to chance
or more likely due to a real effect
Not significant = the obtained relationship is probably due to
chance; there doesn’t appear to be a linear relationship
between Y and the X’s; p > .05
Statistically significant = the obtained relationship is probably
NOT due to chance and is likely a real linear relationship
between Y and at least one of the Xi; p < .05
Interpreting the F test for Multiple Regression
Cardinal rule: Scientists do not say “prove”! Conclusions are based on
probability (likely due to chance, likely a real effect…). Be explicit about
this to your students.
Based on p value, determine whether you have evidence to conclude the
relationship was probably real or was probably due to chance: Is the
research hypothesis supported?
p < .05: Significant
•
Reject null hypothesis and support research hypothesis (the
relationship was probably real; X’s are linearly related to Y)
p > .05: Not significant
•
Retain null hypothesis and reject research hypothesis (any
relationship was probably due to chance; Xs are not linearly
related to Y)
More Teaching Tips
You can ask your students to report either:
• the exact p value (p = .03, p = .45)
• the cutoff: say either p < .05 (significant) or p > .05 (not significant)
You should specify which style you expect. Ambiguity confuses them!
Tell students they can only use the word “significant” only when they
mean it (i.e., the probability the results are due to chance is less
than 5%) and to not use it with adjectives (i.e., they often mistakenly
think one test can be “more significant” or “less significant” than
another). Emphasize that “significant” is a cutoff that is either met or
not met -- Just like you are either found guilty or not guilty, pregnant
or not pregnant. There are no gradients. Lower p values = less
likelihood results are due to chance, not “more significant”
Next Steps
If the overall model is not significant, you
stop the regression analysis with this set of
variables. There is nothing more to do.
If the overall model is significant, the
significance of the independent
contributions of each explanatory variable is
tested. Just because the model is
significant does not mean that every
variable is.
Step 2: Stepwise Test For Multiple Linear Regression
You test only one variable at a time to see if it is
significant or not, assuming the other variables are
in the model
There are various ways to do this (Standard, Forward
Selection, Backward Selection, Stepwise,
Hierarchical). The order in which you enter in the
data may make a difference.
Null hypothesis H0: i = 0 while all other J  0
Research hypothesis HA: i  0 while all other J  0
Step 2: Stepwise Test For Multiple Linear Regression
The test statistic is:
t = bi/SE(bi)
Where:
the partial-regression coefficient bi is the estimate
for i from the regression model
and SE(bi) is the standard error of the partialregression coefficient
The Essence of a T Test
Each t test gives you a t score, which can be positive or
negative; It’s the absolute value that matters
The bigger the |t| score, the less likely that the slope of Xi
being different from 0 is just due to chance
The bigger the |t| score, the more likely that Xi is linearly
related to Y
So big values of |t| will be associated with small p values that
indicate that there is a significant relationship between Xi
and Y (p < .05)
Little values of |t| (i.e., close to 0) will be associated with
larger p values that indicate that the relationship
between Xi and Y is not significant (p > .05)
Next Steps
If the null hypothesis is not rejected (p >.05), then
you can remove the variable Xi from the model.
You would rerun the stepwise test without Xi and
see if another variable could be removed. You
would continue this process until no more
variables can be removed.
If the null hypothesis is rejected (p < .05) then you
do not remove the variable Xi from the model.
Running Multiple Linear Regression in SPSS
Multiple Linear Regression: When you want to predict a
numerical variable Y from numerical variables X1, X2, …, Xk
Need: Enter X1, X2, …, Xk and Y data in SPSS
To get linear model:
Analyze  Regression  Linear
Move the Y variable to Dependent and the X1, X2, …, Xk
variables to Independent; Click Ok.
Output: Linear model and output needed to run overall
F-test and stepwise tests
Reporting the Results of the F Test
for Multiple Linear Regression
Step 1: Write a sentence that clearly indicates what statistical
analysis you used
An F test was used to determine if there was a linear relationship
between [X1, X2, …, Xk] and [Y].
Or, an F test was used to determine if [X1, X2, …, Xk] is linearly
related to [Y].
An F test was used to determine if age and birth weight are
linearly related to diastolic blood pressure.
Reporting the Results of the F Test
for Multiple Regression
Step 2: Report whether the overall linear model was significant or
not
The linear relationship between [X1, X2, …, Xk] and [Y] was
significant [or not significant], F(dfRegression, dfResidual) = X.XX [fill
in F], p = xxxx.
The linear relationship between diastolic blood pressure and birth
weight and age was significant, F(2, 13) = 48.08, p < .001.
Reporting the Results of the F Test
for Multiple Regression
Step 3: If the overall model is significant, report whether any of the
variables were be removed during the stepwise tests or report
that no variable can be removed.
Variable [Xi] was not significantly related to Y after adjusting for
the other variables [X1, X2, …, Xk] and was removed from the
model. t(df) = X.XX, p = X.XX.
Or state that all the X variables were significant.
The linear relationship between diastolic blood pressure and birth
weight and age was significant, F(2, 13) = 48.081, p < .001.
Both variables were significant in the model.
Cautions Involved in Multiple Linear Regression
• Multicollinearity (or independent variables
are highly correlated); You may get a model
that is significant, but none of the individual
variables are or you may get unstable
coefficients
• Outliers
• Too many independent variables
• Small samples sizes (need at least 10
samples per variable in model)
Assumptions Involved in Multiple Linear Regression
•
•
•
•
Independence of Independent Variables
Linearity
Randomness and Normality
Homoscedasticity
Check the Independence of Independent Variables
Check the pairwise correlations or make
scatterplots of each pair of variables to see
if there is multicollinearity
If two variables are highly correlated you may
only need to use one of them in your model
Variance Inflation Factor (VIF) Test: If VIF > 5
or 10 this indicates multiple collinearity
Assessing Randomness and Normality of Residuals
Are the residuals random and normally
distributed?
Make a histogram of the residuals (or
standardized residuals which are centered
about 0).
Assessing Linearity
Method 1: Plot Y vs. Xi to see if there is a linear
relationship between Y and Xi. [Partial Plot in SPSS]
Repeat for all Xs.
Method 2: Plot the standardized residuals vs. each
Xi and check that the residuals are randomly
scattered about the line y = 0; Any pattern would
indicate a non-linear relationship between Y and that
particular Xi.
Assessing Homoscedasticity
Homoscedasticity = the variability or scatter of points
about the best fit line is fairly constant
Method 1: Plot Y vs. Xi [the Partial Plots in SPSS]
and see if the fit of the line is fairly constant for all
values of Xi. Repeat for all Xs.
Method 2: Plot standardized residuals vs. Xi and
check that the spread of the residuals is fairly
constant for all values of Xi
Using Dummy Variables In the Model
Dummy variables are used to code in categorical
variables into the model
If the categorical variable has n subcategories, then
you will need n -1 “dummy” variables to describe
the categories
Ex: Gender (Male = 0, Female = 1)
Dummy Variables
Ex: Marital Status
Responses: Single = 1, Married = 2, Divorced = 3
So, 3 subcategories require 2 dummy variables, X1
and X2 which would be used in the regression
model
Response
X1 (Vector 1)
X2 (Vector 2)
Single (or Control)
0
0
Married
1
0
Divorced
0
1
Time to Practice
Multiple Regression
Download