Spring2007_MultipleRegressionExamAnswer_brthrate_dv

advertisement
Name
Spring 2007 Multiple Regression Exam
General Directions
Please read all questions carefully and answer all parts of each question. In answering the
questions on this exam, be specific about the statistical evidence that you use to answer each
question. The point value for each question is listed in square brackets after the question
number.
To facilitate grading the exam, you are asked to copy and paste sections of output from SPSS
that support your answer to the question. The specific questions where you need to include
SPSS output end with the statement “Include the statistical output that substantiates your
answer.”
When you have completed this exam, you will need to print a copy of your answer to hand in.
You will be graded based on the contents of your printed answers. As a backup, please email a
copy to me at jimSchwab@mail.utexas.edu.
At the end of this answer sheet, insert copies of the syntax notes for the regression analyses you
conducted for this exam.
The Research Question
We are exploring the question whether the quality of life discrepancy between rich and poor
nations is likely to increase or not. We think that it will increase if poorer nations have higher
birth rates, leading to increasing populations attempting to sustain themselves with the same
meager resources. Obviously we can think of many other things that could affect the
relationship between wealth and birth rate, but rather than complicate the analysis, we will
assume that these other factors can be represented by taking geographic region (country group)
into account.
The data set for this exam, poverty.sav, contains information on selected demographic
characteristics for 97 different nations in the world.
Using this data set, conduct a standard multiple regression to analyze the relationship between
"live birth rate per 1,000 of population" [brthrate], "gross national product per capita in U.S.
dollars" [gnp], and "country group" [group].

Treat "live birth rate per 1,000 of population" [brthrate], "gross national product per
capita in U.S. dollars" [gnp] as metric variables.

Treat "country group" [group] as a non-metric variable and dummy-code it using
3=Western Europe, North America, Japan, Australia, New Zealand" (highly
industrialized nations) as the reference category.
page 1

Use .05 for alpha for interpreting the statistical relationships and .01 as alpha for
diagnostic tests.
Question 1 [10 points]
a) Based on the research question, which variable should be the dependent variable: "live birth
rate per 1,000 of population" [brthrate] or "gross national product per capita in U.S. dollars"
[gnp]? Explain your choice.
The dependent variable is: "live birth rate per 1,000 of population" [birthrate], because
the problem states that we want to explain differences in birth rates by wealth and other
demographic factors.
b) State the research hypothesis that you will test with the regression analysis of the three
variables. [5 points]
Region of the world and gross national product account for differences in birth rates.
Question 2 [5 points]
State the level of measurement requirements for the analysis. Evaluate the level of measurement
requirements for each of the variables to be included in your analysis.
Standard multiple regression requires the dependent variable and the metric independent
variables be interval level, and the non-metric independent variables be dummy-coded if
they are not dichotomous.
o The metric dependent variable "live birth rate per 1,000 of population"
[brthrate] was interval level, satisfying the requirement for dependent
variables.
o The metric independent variable "Gross National Product per capita in U.S.
dollars" [gnp] was interval level, satisfying the requirement for independent
variables.
o The non-metric independent variable "country group" [group] was nominal
level, but will satisfy the requirement for independent variables when dummy
coded.
Question 3 [20 points]
a) State and define the assumptions of multiple regression . What are the consequences of
violating each of the assumption?
page 2
b) For the variables in this analysis, evaluate each of the regression assumptions. Do we need to
transform variables or remove extreme outliers to satisfy the assumptions? Include the statistical
output that substantiates your answer.
The linear regression of "live birth rate per 1,000 of population" [brthrate] by "Gross
National Product per capita in U.S. dollars" [gnp],"countries which were in Eastern
Europe" [group_1],"countries which were in South America and Mexico"
[group_2],"countries which were in Middle East" [group_4],"countries which were in
Asia" [group_5] and "countries which were in Africa" [group_6] satisfied all of the
regression assumptions (independence of variables, linearity, homogeneity of error
variance, normality of the residuals, and independence of errors):
The tolerance values for all of the independent variables are larger than 0.10: "Gross
National Product per capita in U.S. dollars" [gnp] (0.261), "countries which were in
Eastern Europe" [group_1] (0.392), "countries which were in South America and
Mexico" [group_2] (0.423), "countries which were in Middle East" [group_4] (0.435),
"countries which were in Asia" [group_5] (0.458) and "countries which were in Africa"
[group_6] (0.431). Multicollinearity is not a problem in this regression analysis.
Coeffi cientsa
Model
1
(Const ant)
Country Group=Eastern
Europe
Country Group=South
Americ a and Mexic o
Country Group=Middle
East
Country Group=As ia
Country Group=Africa
Gross National Product
per capita in U.S. dollars
Unstandardized
Coeffic ient s
B
St d. E rror
30.327
1.080
St andardiz ed
Coeffic ient s
Beta
t
28.075
Sig.
.000
Zero-order
Correlations
Partial
Part
Collinearity Statistics
Tolerance
VIF
-14.169
1.855
-.565
-7. 639
.000
.277
-.640
-.354
.392
2.554
-.284
1.678
-.012
-.169
.866
.435
-.018
-.008
.423
2.365
7.631
1.721
.311
4.435
.000
.526
.436
.205
.435
2.296
-.059
14.642
1.555
1.363
-.003
.758
-.038
10.740
.970
.000
.417
.826
-.004
.761
-.002
.498
.458
.431
2.184
2.323
-.001
.000
-.307
-3. 382
.001
-.629
-.346
-.157
.261
3.834
a. Dependent Variable: Live birth rate per 1,000 of population
In the lack of fit test, the probability of the F test statistic (F=1.52) was p = .345, greater
than the alpha level of significance of 0.01. The null hypothesis that "a linear regression
model is appropriate" is not rejected. The research hypothesis that "a linear regression
model is not appropriate" is not supported by this test. The assumption of linearity is
satisfied.
La ck of Fit Tests
Dependent Variable: Live birt h rat e per 1,000 of populat ion
Sum of
Source
Squares
df
Mean Square
F
Lack of Fit
2922.706
79
36.996
1.517
Pure Error
121.935
5
24.387
Sig.
.345
The homogeneity of error variance is tested with the Breusch-Pagan test. For this
analysis, the Breusch-Pagan statistic was 8.672. The probability of the statistic was p =
.193, which was greater than the alpha level for diagnostic tests (p = .010). The null
hypothesis that "the variance of the residuals is the same for all values of the
independent variable" is not rejected. The research hypothesis that "the variance of the
page 3
residuals is different for some values of the independent variable" is not supported. The
assumption of homogeneity of error variance is satisfied.
Homoscedasticity
:
Test
Breusch-Pagan
Koenker
Statistic
8.6715
8.9282
Statistics
df
6
6
Sig.
.1929
.1777
Regression analysis assumes that the errors or residuals are normally distributed. The
Shapiro-Wilk test of studentized residuals yielded a statistical value of 0.987, which had
a probability of p = .481, which was greater than the alpha level for diagnostic tests (p =
.010). The null hypothesis that "the distribution of the residuals is normally distributed"
is not rejected. The research hypothesis that "the distribution of the residuals is not
normally distributed" is not supported. The assumption of normality of errors is
satisfied.
Tests of Normality
Studentized Residual
Kolmogorov-Smirnova
Statistic
df
Sig.
.065
91
.200*
Statistic
.987
Shapiro-Wilk
df
91
Sig.
.481
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction
Regression analysis assumes that the errors (residuals) are independent and there is no
serial correlation. No serial correlation implies that the size of the residual for one case
has no impact on the size of the residual for the next case. The Durbin-Watson statistic
tests for the presence of serial correlation among the residuals. The value of the DurbinWatson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not
correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is
1.50 - 2.50. The Durbin-Watson statistic for this problem is 2.11 which falls within the
acceptable range from 1.50 to 2.50. The analysis satisfies the assumption of
independence of errors.
Model Summ aryb
Model
1
R
R Square
.905a
.820
Adjust ed
R Square
.807
St d. Error of
the Es timate
6.020
DurbinW atson
2.110
a. Predic tors: (Constant), Gross National Product per c apit a in U.S.
dollars , Country Group= Middle Eas t, Country Group=As ia, Country
Group= Africa, Country Group=Sout h Americ a and Mexic o, Country
Group= Eas tern Europe
b. Dependent Variable: Live birth rate per 1,000 of population
The model satisfied the assumptions of multiple regression without transforming any
variables or excluding any countries.
page 4
Question 4 [5 points]
Do we have sufficient data to meet the same size requirements for the proposed analysis?
Include the statistical output that substantiates your answer.
The analysis included 6 independent variables: 1 for the covariate ("Gross National
Product per capita in U.S. dollars" [gnp]) plus 5 dummy-coded variables for the factor
"country group" [group]. The number of cases available for the analysis was 91, not
satisfying the requirement for 111 cases based on the rule of thumb that the required
number of cases should be the larger of the number of independent variables x 8 + 50 or
the number of independent variables + 105. We should consider mentioning the sample
size issue as a limitation of the analysis.
De scriptive Statistics
St d.
Deviat ion
Mean
Live birth rate per 1,000
of population
Country Group=Eastern
Europe
Country Group=South
Americ a and Mexico
Country Group=Middle
East
Country Group=As ia
Country Group=Africa
Gross Nat ional Produc t
per capita in U.S. dollars
N
29.46
13.699
91
-.109890
********
91
-.076923
********
91
-.098901
********
91
-.054945
.0879121
********
********
91
91
5741.25
8093.680
91
Question 5 [10 points]
a. Interpret the overall relationship between the dependent variable and the independent
variables. Include the statistical output that substantiates your answer.
The relationship between "live birth rate per 1,000 of population" and the combination
of "Gross National Product per capita in U.S. dollars" and "country group" was
statistically significant (F(6, 84) = 63.66, p < .001. The null hypothesis that "all of the
partial slopes (b coefficients) = 0" is rejected, supporting the research hypothesis that "at
least one of the partial slopes (b coefficients) is not equal to 0".
Applying Cohen's criteria for effect size (less than .01 = trivial; .01 up to 0.30 = weak;
.30 up to .50 = moderately strong; .50 or greater = strong), the relationship was correctly
characterized as strong (Multiple R = .905).
Model Summ aryb
Model
1
R
R Square
.905a
.820
Adjust ed
R Square
.807
St d. Error of
the Es timate
6.020
DurbinW atson
2.110
a. Predic tors: (Constant), Gross National Product per c apit a in U.S.
dollars , Country Group= Middle Eas t, Country Group=As ia, Country
Group= Africa, Country Group=Sout h Americ a and Mexic o, Country
Group= Eas tern Europe
b. Dependent Variable: Live birth rate per 1,000 of population
page 5
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
13845. 30
3044.641
16889. 94
df
6
84
90
Mean Square
2307.549
36.246
F
63.664
Sig.
.000a
a. Predic tors: (Constant), Gros s National Product per capita in U.S. dollars, Country
Group= Middle East , Country Group= Asia, Country Group=Africa, Country
Group= South America and Mexico, Country Group=Eastern Europe
b. Dependent Variable: Live birth rate per 1, 000 of population
Question 6 [25 points]
a. List the independent variables that had a statistically significant relationship to the dependent
variable and interpret each relationship. Include the statistical output that substantiates your
answer.
Countries who had a higher gross national product per capita had a lower live birth rate.
The individual relationship between the independent variable "Gross National Product
per capita in U.S. dollars" [gnp] and the dependent variable "live birth rate per 1,000 of
population" [brthrate] was statistically significant, β = -.307, t(84) = -3.38, p = .001. We
reject the null hypothesis that the partial slope (b coefficient) for the variable "Gross
National Product per capita in U.S. dollars" = 0 and conclude that the partial slope (b
coefficient) for the variable "Gross National Product per capita in U.S. dollars" is not
equal to 0. The negative sign of the b coefficient (-0.001) means that higher values of
"Gross National Product per capita in U.S. dollars" were associated with lower values of
"live birth rate per 1,000 of population".
The statement that ""countries which were in Eastern Europe" had a lower live birth rate
compared to the average for all countries" is correct. The individual relationship
between the independent variable "countries which were in Eastern Europe" [group_1]
and the dependent variable "live birth rate per 1,000 of population" [brthrate] was
statistically significant, β = -.565, t(84) = -7.64, p < .001. We reject the null hypothesis
that the partial slope (b coefficient) for the variable "countries which were in Eastern
Europe" = 0 and conclude that the partial slope (b coefficient) for the variable "countries
which were in Eastern Europe" is not equal to 0. The negative sign of the b coefficient (14.170) means that "countries which were in Eastern Europe" had a lower live birth rate
compared to the average for all countries.
The statement that ""countries which were in Middle East" had a higher live birth rate
compared to the average for all countries" is correct. The individual relationship
between the independent variable "countries which were in Middle East" [group_4] and
the dependent variable "live birth rate per 1,000 of population" [brthrate] was
statistically significant, β = .311, t(84) = 4.43, p < .001. We reject the null hypothesis
that the partial slope (b coefficient) for the variable "countries which were in Middle
East" = 0 and conclude that the partial slope (b coefficient) for the variable "countries
which were in Middle East" is not equal to 0. The positive sign of the b coefficient
(7.630) means that "countries which were in Middle East" had a higher live birth rate
compared to the average for all countries.
page 6
The statement that ""countries which were in Africa" had a higher live birth rate
compared to the average for all countries" is correct. The individual relationship
between the independent variable "countries which were in Africa" [group_6] and the
dependent variable "live birth rate per 1,000 of population" [brthrate] was statistically
significant, β = .758, t(84) = 10.74, p < .001. We reject the null hypothesis that the
partial slope (b coefficient) for the variable "countries which were in Africa" = 0 and
conclude that the partial slope (b coefficient) for the variable "countries which were in
Africa" is not equal to 0. The positive sign of the b coefficient (14.640) means that
"countries which were in Africa" had a higher live birth rate compared to the average for
all countries.
Coeffi cientsa
Model
1
(Const ant)
Country Group=Eastern
Europe
Country Group=South
Americ a and Mexic o
Country Group=Middle
East
Country Group=As ia
Country Group=Africa
Gross National Product
per capita in U.S. dollars
Unstandardized
Coeffic ient s
B
St d. E rror
30.327
1.080
St andardiz ed
Coeffic ient s
Beta
-14.169
1.855
-.565
t
28.075
Sig.
.000
Zero-order
Correlations
Partial
-7. 639
.000
.277
-.640
-.354
.392
2.554
Part
Collinearity Statistics
Tolerance
VIF
-.284
1.678
-.012
-.169
.866
.435
-.018
-.008
.423
2.365
7.631
1.721
.311
4.435
.000
.526
.436
.205
.435
2.296
-.059
14.642
1.555
1.363
-.003
.758
-.038
10.740
.970
.000
.417
.826
-.004
.761
-.002
.498
.458
.431
2.184
2.323
-.001
.000
-.307
-3. 382
.001
-.629
-.346
-.157
.261
3.834
a. Dependent Variable: Live birth rate per 1,000 of population
b. List the independent variables that did not have a statistically significant relationship to the
dependent variable and state the statistical criteria upon which you based your decision. Include
the statistical output that substantiates your answer.
The statement that ""countries which were in South America and Mexico" had a lower
live birth rate compared to the average for all countries" is not correct. The individual
relationship between the independent variable "countries which were in South America
and Mexico" [group_2] and the dependent variable "live birth rate per 1,000 of
population" [brthrate] was not statistically significant, β = -.012, t(84) = -.17, p = .866.
We are not able to reject the null hypothesis that the partial slope (b coefficient) for the
variable "countries which were in South America and Mexico" = 0.
The statement that ""countries which were in Asia" had a lower live birth rate compared
to the average for all countries" is not correct. The individual relationship between the
independent variable "countries which were in Asia" [group_5] and the dependent
variable "live birth rate per 1,000 of population" [brthrate] was not statistically
significant, β = -.003, t(84) = -.04, p = .970. We are not able to reject the null hypothesis
that the partial slope (b coefficient) for the variable "countries which were in Asia" = 0.
page 7
Coeffi cientsa
Model
1
(Const ant)
Country Group=Eastern
Europe
Country Group=South
Americ a and Mexic o
Country Group=Middle
East
Country Group=As ia
Country Group=Africa
Gross National Product
per capita in U.S. dollars
Unstandardized
Coeffic ient s
B
St d. E rror
30.327
1.080
St andardiz ed
Coeffic ient s
Beta
-14.169
1.855
-.565
t
28.075
Sig.
.000
Zero-order
Correlations
Partial
-7. 639
.000
.277
-.640
-.354
.392
2.554
Part
Collinearity Statistics
Tolerance
VIF
-.284
1.678
-.012
-.169
.866
.435
-.018
-.008
.423
2.365
7.631
1.721
.311
4.435
.000
.526
.436
.205
.435
2.296
-.059
14.642
1.555
1.363
-.003
.758
-.038
10.740
.970
.000
.417
.826
-.004
.761
-.002
.498
.458
.431
2.184
2.323
-.001
.000
-.307
-3. 382
.001
-.629
-.346
-.157
.261
3.834
a. Dependent Variable: Live birth rate per 1,000 of population
Question 7 [15 points]
Note: it is not necessary to do an analysis of covariance to answer this question, but you
may do so if it is helpful in formulating your answer.
a) The relationship between two metric variables and a non-metric variable could also be tested
with an analysis of covariance? Would an analysis of covariance have answered the same
research question that we answered with standard multiple regression?
In analysis of covariance the metric covariate is treated as a control variable, so we
would have answered the question what were the differences in birth rates among
geographic regions, controlling for differences in gross national product. This is a
different question that the one we answered with standard multiple regression.
b) How do the results of an analysis of covariance differ from the results of a standard multiple
regression? [Hint: what does an analysis of covariance always test for that is not usually tested
in standard multiple regression?]
An analysis of covariance tests for an interaction between the covariate and the factor.
c) How might this difference between analysis of covariance and standard multiple regression
affect our interpretation?
If the interaction effect is significant, we do not interpret main effects in light of the
interaction.
Question 8 [10 points]
a) The script uses deviation (or effects) coding to create dummy-coded variables. If we used
indicator coding to create dummy-coded variables instead, how would the interpretation of
individual relationships differ?
The comparison for deviation coding is a comparison of the group mean to the mean
across all groups. The comparison for indicator coding would compare the difference in
means between the dummy group and the reference category.
page 8
b) Would we be likely to find that the same relationships were statistically significant? Why or
why not?
It is not likely that the same relationships would be significant, because we are not
comparing the same combinations of means.
Syntax notes:
Notes
Output Created
Comments
Input
Missing Value
Handling
16-MA R-2007 13:01:38
Data
Ac tive Dataset
Filter
W eight
Split File
N of Rows in
W orking Data File
Definition of Mis sing
Cases Used
Sy ntax
Resources
Elapsed Time
Memory Required
Additional Memory
Required for
Residual P lots
C: \2007_Spring_
SW 388R7\ Regress ion
Ex am\ poverty.s av
DataSet1
<none>
<none>
<none>
97
Us er-defined missing values are
treated as miss ing.
St atist ics are based on cases with
no mis sing values for any variable
us ed.
RE GRESS ION
/DESCRIPTIVE S MEAN STDDE V
CORR SIG N
/MISSING LISTWIS E
/S TATISTICS COEFF OUTS R
ANOV A COLLIN TOL ZP P
/CRITE RIA =PIN(.05) POUT(. 10)
/NOORIGIN
/DEPE NDE NT brthrate
/METHOD= ENTER gnp group_1
group_2 group_4 group_5 group_6
/RESIDUALS DURB IN .
0:00:00.02
3252 bytes
0 bytes
page 9
Download