Solving Regression Problems as a General Linear Model

advertisement
Solving Linear Regression Problems as a General Linear Model
1 of 23
Homework problems are multiple answer rather than multiple choice. The format for multiple
answer questions is shown in the example below.
The directions for the problems instruct you to mark the check boxes for all of the statements
that are true. One or more answers must be marked for each problem. Full or partial credit is
computed for each question. To receive full credit, you must mark all of the correct answers
and not mark any of the incorrect answers. Partial credit is computed by summing the points
for each correct response and subtracting points for each incorrect answer. If the computation
for partial credit results in a negative number, zero credit is assigned.
Level of Measurement Requirement and Sample Size Requirement
Multiple regression requires that the dependent variable be interval and the independent
variables be interval or dichotomous. If one of the variables is ordinal level, we will follow the
common convention of treating ordinal variables as interval level, but we should consider
noting the use of an ordinal variable as a limitation to our findings.
These problems use the rule of thumb from Tabachnick and Fidell that the required number of
cases should be the larger of the number of independent variables x 8 + 50 or the number of
independent variables + 105.
If the sample size requirement (along with the level of measurement requirement) is satisfied,
the check box “The level of measurement requirement and the sample size requirement are
satisfied” should be marked. In many of problems we have worked, failing to meet sample size
implies that it is an inappropriate application of the statistic and we halted all further work on
the problem. We will not apply that policy to these problems. If our sample size is less than
the minimum requirement, we leave the check box unmarked and continue with the problem,
mention the sample size issues as a limitation for the analysis.
2 of 23
The Assumption of Normality
Regression assumes that the residual are normally distributed. We will meet this assumption if
each of the interval variables is normally distributed, but there is general consensus that
violations of this assumption do not seriously affect the probabilities needed for statistical
decision making, especially when the sample size is large.
The problems evaluate normality based on the criteria that the skewness and kurtosis of each
variable falls within the range from -1.0 to +1.0. If the variables satisfies these criteria for
skewness and kurtosis, the check box “The skewness and kurtosis of the variables satisfy the
assumption of normality” should be marked. If the criteria for normality are not satisfied, the
check box should remain unmarked and we should consider including a statement about the
violation of this assumption in the discussion of our results.
In these problems we will not test transformations or consider removing outliers to improve the
normality of the variables.
The Assumption of Homoscedasticity
Regression assumes that the variance of the residuals is homogeneous across predicted values
of the dependent variable. SPSS does not compute Levene’s test for equality of variance when
all of the variables are interval (or ordinal treated as interval). The check box “The regression
analysis satisfies assumption of homoscedasticity” will remain unchecked for these problems.
The Assumption of Linearity
The assumption of linearity is tested with the lack of fit test in the Univariate General Linear
Model procedure. If the test is significant, it implies that there is a non-linear component that
should be added to the model. If the test is not significant, we assume that a linear model is
present and is an adequate representation of the relationship between the dependent and
independent variables. If the lack of fit test is not significant at the alpha level for diagnositic
statistics, the check box “The regression analysis satisfies the assumption of linearity” is
marked.
The Assumption of Independence of Errors
SPSS does not compute the Durbin-Watson statistic in the Univariate General Linear Model
procedure. In these problems, we will acknowledge that fact and not mark the check box “The
regression analysis satisfies the assumption of independence of errors”.
The Assumption of Independence of Variables
SPSS does not compute tolerance for VIF in the Univariate General Linear Model procedure. In
these problems, we will acknowledge that fact and not mark the check box “The regression
analysis satisfies the assumption of independence of variables”.
I have included the complete list of assumptions in the list of possible answers even though
some will not ever be marked in this assignment because of limitations in the univariate
general linear procedure. In the future, we will develop a strategy for testing all of the
assumptions.
Interpretation of the Overall Relationship
3 of 23
The presence of overall relationship between the dependent variable and the independent
variables is represented by the statement that both predictors together have a relationship to
the dependent variable. If the ANOVA test of the overall relationship (“Corrected Model” in the
table of “Tests of Between Subjects Factors”) is not statistically significant, this statement is
not marked as a correct finding. If the overall relationship is not statistically significant, we
will not interpret the individual relationships.
If the overall relationship is statistically significant, we should examine the adjective
describing the strength of the relationship. SPSS computes partial eta squared as a measure of
effect. We characterize it as trivial, small, moderate, or large, applying Cohen's criteria for
effect size (less than .01 = trivial; .01 up to 0.06 = small; .06 up to .14 = moderate; .14 or
greater = large). If the adjective describing the strength of the relationship is not correct, the
check box for the overall relationship is not marked.
Interpretation of the Individual Relationships
Determination of the correctness of statements about individual relationships is a two stage
process. First, it is required that the relationship be statistically significant (the test of the
slope in the table of “Parameter Estimates”). Second, it is required that the statement be
correct a correct interpretation of the direction of the relationship with the dependent
variable.
The problems also contain statements about which predictor was more important or had the
greater impact. This is based on the magnitude of the partial eta squared statistic for each
independent variable, provided the variable has a statistically significant individual
relationship.
Inappropriate application of the statistic
The only limitation to the use of regression imposed on these problems is that we should not
use regression if we violate the level of measurement requirement.
Solving Problems in SPSS
We will demonstrate the use of
SPSS to compute a regression
analysis with the general linear
model procedure using this
problem.
The introductory statement identifies
the variables for the analysis and the
significance levels to use. Note that
we use a more conservative alpha
(.01) for diagnostic statistics than we
do for the statistics that answer our
research questions.
Level of Measurement – 1
The first statement in the problem asks about
level of measurement and sample size. Multiple
regression requires that the dependent variable
be interval and the independent variables be
interval or dichotomous. In these problems, we
will limit our analysis to the inclusion of interval
independent variables.
4 of 23
5 of 23
Level of Measurement - 2
To determine the level of measurement,
we examine the information about
variables in the SPSS data editor,
specifically the values and value labels.
"Occupational prestige score" [prestg80] is interval level,
satisfying the requirement for the dependent variable.
"Age" [age] is interval level, satisfying the requirement for
the independent variable. "Highest year of school
completed" [educ] is interval level, satisfying the
requirement for the independent variable.
Using Univariate General Linear Model for Linear Regression - 1
Select General Linear
Model > Univariate from
the Analyze menu.
To check for compliance with sample
size requirements, we run the univariate
general linear model procedure. This
procedure will give us the correct
number of cases used in the analysis,
taking into account missing data for all
of the variables in the analysis.
Using Univariate General Linear Model for Linear Regression - 2
First, move prestg80
to the Dependent
Variable text box.
Third, click on the
Options button.
Second, move educ and
age to the Covariate(s)
list box. Interval level
variables are treated as
covariates rather than
factors in the general
linear model.
Using Univariate General Linear Model for Linear Regression - 3
In the Options dialog box, we
mark the statistics we want
to include in the output.
First, mark the check boxes for

Descriptive statistics

Estimates of effect size

Parameter estimates

Lack of fit
Second, since this is the
only output we need for now,
click on the Continue button.
6 of 23
Using Univariate General Linear Model for Linear Regression – 4
We have finished entering
the specifications we need
for our analysis.
Click on the OK
button to obtain
the output.
Descriptive Statistics from the Univariate General Linear Model - 1
The table of Descriptive
Statistics contains the number
of cases used in the analysis.
Using the rule of thumb from Tabachnick and
Fidell that the required number of cases should
be the larger of the number of independent
variables x 8 + 50 or the number of
independent variables + 105, multiple
regression requires 107 cases. With 254 valid
cases, the sample size requirement is satisfied.
7 of 23
Marking the Statement for the Level of Measurement and Sample Size Requirement
Since we satisfied both the level
of measurement and the sample
size requirements for analysis,
we mark the first checkbox for
the problem.
The Assumption of Normality
The next statement in the problem focuses on
the assumption of normality, using the skewness
and kurtosis criteria that both statistical values
should be between -1.0 and +1.0.
8 of 23
Computing Skewness and Kurtosis to Test for Normality – 1
Skewness and kurtosis are
calculated in several procedures.
We will use Descriptive Statistics.
Select Descriptive Statistics >
Descriptives from the Analyze
menu.
Computing Skewness and Kurtosis to Test for Normality – 2
We add the variables whose
normality we are concerned about.
First, move the variables,
prestg80, age, and educ, to
the Variable(s) list box.
Second, click on the
Options button to
specify the statistics
we want computed.
9 of 23
Computing Skewness and Kurtosis to Test for Normality – 3
Kurtosis and Skewness are
not selected by default, so
we mark their check boxes.
Second, click
on the Continue
button to close
the dialog box.
First, mark the check
boxes for Kurtosis and
Skewness.
Computing Skewness and Kurtosis to Test for Normality – 4
We have finished entering the
specifications we need for the
evaluation of normality.
Click on the OK
button to obtain
the output.
10 of 23
Evaluating the Assumption of Normality - 1
"Occupational prestige score" [prestg80] satisfied the
criteria for a normal distribution. The skewness of the
distribution (.401) was between -1.0 and +1.0 and the
kurtosis of the distribution (-.630) was between -1.0 and
+1.0.
"Age" [age] satisfied the criteria for a normal distribution.
The skewness of the distribution (.595) was between -1.0
and +1.0 and the kurtosis of the distribution (-.351) was
between -1.0 and +1.0.
Evaluating the Assumption of Normality - 2
"Highest year of school completed" [educ] did not
satisfy the criteria for a normal distribution. The
skewness of the distribution (-.137) was between -1.0
and +1.0, but the kurtosis of the distribution (1.246)
fell outside the range from -1.0 to +1.0. The variable
highest year of school completed violates the
assumption of normality. We should either test
transformations and removing outliers or we should
include the violation in the limitations for the analysis.
11 of 23
Marking the Statement for the Assumption of Normality
Since the assumption of normality
is not satisfied for all variables, the
check box is not marked.
The Assumption of Homoscedasticity
The next statement in the problem
focuses on the assumption of
homoscedasticity.
The Univariate General Linear Model only computes
the test of homogeneity of variance for categorical
variables, so this check box will not be marked.
12 of 23
13 of 23
The Assumption of Linearity
The next statement in the problem focuses on the
assumption of linearity. The Univariate General
Linear Model computes a lack of fit test that we can
use for this assumption.
The Test of Linearity
In the lack of fit test, the probability
of the test statistic (F(187, 64) =
1.12, p = .301) was greater than
the diagnostic alpha of p = .010.
The null hypothesis that "a linear
regression model is appropriate" is
not rejected. A linear model is an
adequate representation of the
relationship among these variables.
Marking the Statement for the Assumption of Linearity
Since the lack of fit test supported
the appropriateness of a linear
relationship, we mark the check box.
The Assumption of Independence of Errors
The next statement in the problem focuses on the
independence of errors. Since the Univariate General
Linear Model does not compute the Durbin-Watson
Statistic, this check box is not marked.
14 of 23
The Assumption of Independence of Variables
The next statement in the problem focuses on the
independence of errors. Since the Univariate General
Linear Model does not compute tolerance or VIF
values, this check box is not marked.
The Overall Relationship between the Dependent and Independent Variables
The next statement in the problem
focuses on the overall relationship
– its existence and strength.
15 of 23
Significance and Strength of the Overall Relationship – 1
The overall relationship between the
independent variables "age" [age] and "highest
year of school completed" [educ] and the
dependent variable "occupational prestige score"
[prestg80] was statistically significant, F(2, 251)
= 46.167, p < .001, partial eta squared = 0.27.
We reject the null hypothesis that all of
the partial slopes (b coefficients) = 0 and
conclude that at least one of the partial
slopes (b coefficients) is not equal to 0.
Significance and Strength of the Overall Relationship – 2
If the F-test for class had not
been statistically significant, we
do not interpret the effect size.
Applying the criteria for interpreting eta squared
(less than .01 = trivial; .01 up to 0.06 = small;
.06 up to .14 = moderate; .14 or greater =
large), the partial eta squared value of 0.27 was
correctly interpreted as a strong effect.
16 of 23
Marking the Statement for the Overall Relationship
Since the overall relationship was
statistically significant and the effect
size was correctly interpreted, the
check box is marked.
Relationships between the Dependent Variable and Each Individual Independent Variable
The next two statements are possible
interpretation of the individual relationships,
both in terms of significance and direction of
the relationship with the dependent variable.
17 of 23
Interpreting the Individual Relationships – 1
The t-test of the slope coefficient for age (b = .13,
t(3) = 2.88, p = .004) is statistically significant at
alpha = .05. The null hypothesis that the slope = 0
is rejected. The positive sign of the slope indicates a
direct relationship. The statement that "survey
respondents who were older had more prestigious
occupations" is supported.
Interpreting the Individual Relationships – 2
The t-test of the slope coefficient for highest year of
school completed (b = 2.55, t(10) = 9.60, p <
.001) is statistically significant at alpha = .05. The
null hypothesis that the slope = 0 is rejected. The
positive sign of the slope indicates a direct
relationship. The statement that "survey
respondents who completed more years of school
had more prestigious occupations" is supported.
18 of 23
Interpreting the Individual Relationships – 3
Since both individual relationships
were statistically significant and
correctly interpreted, both check
boxes are marked.
Relative Importance of Predictors
The next two statements identify each
independent variable as being the most
important in predicting values of the
dependent variable.
19 of 23
20 of 23
Interpreting the Relative Importance of Predictors
Highest year of school completed was the
most influential predictor because its
partial eta squared (.27) was larger than
the partial eta squared for age (.03).
Marking the Relative Importance of Predictors
Since highest year of school
completed was the most influential
predictor, its check box is marked.
We have now
finished all of
the statements
for this problem.
21 of 23
The Problem Graded in BlackBoard
When this assignment was submitted,
BlackBoard indicated that all marked
answers for this problem were correct,
and we received full credit for the
question.
Logic Diagram for Linear Regression Problems – 1
Level of
measurement ok?
No
22 of 23
 Do not mark check box
 Mark: Inappropriate
application of the statistic
 Stop
Yes
Ordinal dv?
Yes
Consider limitation in
discussion of findings
Sample size ok?
No
Consider limitation in
discussion of findings
Yes
Mark check box for
correct answer
Normality ok? (skewness
and kurtosis +/-1)
No
Consider limitation in
discussion of findings
No
Consider limitation in
discussion of findings
Yes
Mark check box for
normality assumption
Assumption of
Homoscedasticity (not tested)
Assumption of Linearity
(lack of fit test > α)
Use α for
diagnostic
statistics
Yes
Mark check box for
linearity assumption
Assumption of Independence of
Errors (not tested)
23 of 23
Logic Diagram for Linear Regression Problems – 2
Assumption of Independence of
variables (not tested)
Overall relationship
(F-test Sig ≤ α)?
Use α for
statistical
tests
No
Do not mark check box for
overall relationship
Stop
Yes
Correct adjective used to
describe effect size?
No
Do not mark check box for
overall relationship
Yes
Mark check box for
overall relationship
Individual relationship
(t-test Sig ≤ α)?
Use α for
statistical
tests
No
Yes
Correct interpretation of
direction of relationship?
No
Do not mark check box for
individual relationship
No
Do not mark check box for
relative importance
Yes
Mark check box for
individual relationship
Relative importance
correctly identified?
Yes
Mark check box for
relative importance
Download