Lab 11

advertisement
Stat401E
Fall 2010
Lab 11
1. Use the 1984 NORC data to demonstrate some aspects of
multiple regression.
a. Show that R-squared from the regression of "rincome" on "educ"
and "prestige" equals the coefficient of determination between
"rincome" and the predicted values from this regression equation.
b. Show that the coefficient of determination between "rincome" and
the residuals from the regression of "rincome" on "educ" and
"prestige" equals 1 - R2 from this same regression.
c. Show that unlike the correlation between "educ" and "prestige",
there is a correlation of zero between "educ" and "presadj" (i.e.,
between "educ" and "prestige after it has been adjusted for its
covariance with educ").
d. Show that the partial slope between "rincome" and "prestige"
(from the regression of "rincome" on "prestige" and "educ") equals
the slope between "rincome" and "presadj" (as defined in part c).
e. Write and run your own SPSS (or R or SAS) program in which you
show that the partial slope between "rincome" and "educ" (from the
regression of "rincome" on "prestige" and "educ") equals the slope
between "rincome" and "educadj" (i.e., between "rincome" and "educ
after it has been adjusted for its covariance with prestige"). Be
sure to include a printed copy of the program (i.e., the SPSS, R, or
SAS code, not just its output) with your homework.
Below you are given a program that provides you with everything
necessary to answer parts a to d. Doing part e will help ensure that
you understand how the program works.
recode rincome (1=500)(2=2000)(3=3500)(4=4500)(5=5500)(6=6500)
(7=7500)(8=9000)(9=12500)(10=17500)(11=22500)(12=35000)(13=99).
select if (rincome ne 99).
regression variables=rincome,educ,prestige/dep=rincome/enter.
compute yhat=-2126.829488+(236.303466*prestige)+(634.841688*educ).
compute e=rincome-yhat.
pearson corr rincome with yhat,e.
regression variables=prestige,educ/dep=prestige/enter.
compute presadj=prestige-(2.741733*educ).
pearson corr prestige,presadj with educ.
regression variables=rincome,presadj/dep=rincome/enter.
1
NOTE: Be sure to check if your output for the above program
is correct. If you cannot find the numbers within the
program (e.g., -2126.829488, 2.741733, etc.) on the output, there
is an error in your program. In this case, delete the output
then correct and rerun the program. Be sure that you
understand what each line in the program does. For example, a
"select if" statement is required in the above program to
ensure that the same subjects (i.e., those without missing data values
of 99 on "rincome") are considered in analyses in which the "rincome"
variable is excluded.
2. Use a stepwise regression procedure (or, if using R, obtain 2
regressions as per the R-code provided below) and the 1991 General
Social Survey to examine the relative effects of occupational prestige
(prestg80) and years of education (educ) on income (rincome). To do
this you may wish to run the following one-line program:
regression descriptives=corr,sig,var/vars=rincome,prestg80,educ
/dep=rincome/stepwise.
Note that although both independent variables are highly correlated
with income, only one enters the regression equation at the .05 level
of significance (the default significance level in SPSS). The other
does not increment R-squared by a significant amount. Also note that
“rincome” should NOT be recoded in the above program!
a. At the end of the output is a section labeled, “Excluded
Variables.” In this section under “Beta in” is provided the
standardized partial slope associated with “educ” that one would
have if “rincome” were regressed on both “prestg80” and “educ.” Why
is this slope so much smaller than the zero-order correlation
between “rincome” and “educ” that is given at the beginning of the
output? (Hint: You may wish to calculate the standardized partial
slope yourself using the correlation coefficients and standard
deviations/variances listed at the beginning of your output.)
b. Using the variances (or standard deviations) and correlation
coefficients provided at the beginning of the output, find standard
errors for two unstandardized slopes: (1) the unstandardized slope
associated with “educ” from the regression of “rincome” on “educ”
and (2) the unstandardized slope associated with “educ” from the
regression of “rincome” on “educ” and “prestg80.” Why is the second
standard error larger than the first?
c. What do your answers to the questions raised in parts a and b
indicate about the possible consequences of collinearity in
regression models?
2
3. This question deals with two theories from the field of
"gender identity development" which explain why children
develop feminine or masculine traits:
Modeling theory: According to modeling theory, children
identify with their parents (among other people). The more
feminine a child's mother, the more feminine will be the
child. The more masculine its father, the more masculine it
becomes. In a sentence, the theory argues that children become feminine
or masculine by "modeling" (i.e., imitating) their parents.
Developmental theory: In 1966 Kohlberg argued that modeling only works
until a child gains a concept of itself as male or female. Once it
gains a self-concept as male or female, the child will only imitate the
parent of its same sex. It will "disassociate with" (i.e., "act
unlike") the other-sex parent. (Note: Kohlberg's theory is based on the
assumption that masculine and feminine behaviors are opposites. Thus a
daughter's masculine behaviors may result from her attempts to act
unlike an effeminate father.)
Research since Kohlberg's article has indicated that a child develops
its self-concept as male or female at around the age of 5.
Your research investigates gender identity development in girls (NOT
boys). You randomly sample 65 Des Moines two-parent one-child families
with a daughter between the ages of 2 and 8 years old. Both parents
from each of these families is administered a questionnaire that, in
addition to an "age" variable (measuring the daughter's age), yields
measures of mother's femininity (mfem), father's masculinity (fmasc),
and daughter's femininity (dfem). High scores on "mfem" and "dfem"
indicate high femininity. High scores on "fmasc" indicate high
masculinity.
(NOTE: In understanding these theories, be sure to keep in mind that a
child can be very feminine and not conceive of itself as female. In
addition, a child can have a female self-concept and not be very
feminine at all.)
a. Indicate the regression model that would allow you to evaluate
modeling theory. (Do NOT attempt to find parameter estimates [i.e.,
no numbers, please] in this part or in part b.) Explain in words
what modeling theory suggests you will find when you estimate this
model.
b. Indicate the regression model that would allow you to evaluate
developmental theory and that would take into account insights
gleaned since Kohlberg's article. (Evaluating this model requires
3
creating a new variable, "myvar". Be sure and explain
how "myvar" is derived from other variables mentioned
above. Also explain in words how "myvar" would allow you to
evaluate developmental theory.)
After a "compute" statement in which you create "myvar", you
use the following SPSS statements in a computer run:
regression vars=mfem,fmasc,dfem/dep=dfem/enter.
regression vars=mfem,fmasc,dfem,age/dep=dfem/enter.
regression vars=mfem,fmasc,dfem,age,myvar/dep=dfem/enter.
Parts of the resulting regression output are as follows:
On the first regression:
Model Summary
Model
1
R Square
.23846a
a. Predictors: (Constant), MFEM, FMASC
4
Coefficients a
Model
1
(Constant)
MFEM
FMASC
Unstandardized
Coefficients
B
Std. Error
-1.98787
17.761
5.639
-10.048
11.768
a. Dependent Variable: DFEM
On the second regression:
Model Summary
Model
1
R Square
.25723a
a. Predictors: (Constant), MFEM, FMASC, AGE
Coefficients a
Model
1
Unstandardized
Coefficients
B
Std. Error
-.38658
18.265
5.325
-11.159
8.767
-3.002
2.418
(Constant)
MFEM
FMASC
AGE
a. Dependent Variable: DFEM
On the third regression:
Model Summary
Model
1
R Square
.33846a
a. Predictors: (Constant), MFEM, FMASC,MYVAR, AGE
Coefficients a
Model
1
(Constant)
MFEM
FMASC
MYVAR
AGE
Unstandardized
Coefficients
B
Std. Error
50.27547
17.856
5.014
-12.268
3.879
21.572
7.948
-2.993
1.967
a. Dependent Variable: DFEM
c. An F-test is used to compare hierarchically related models
according to their relative parsimony. One model is more
parsimonious than another only if in comparison to the other it
explains either virtually as much variance but with fewer
independent variables, or significantly more variance with
additional independent variables. Identify the one model that is
the most parsimonious of the three regression models given above.
Use the .05 level of significance throughout and show your work.
(Hint: Remember that parsimony is a “two-tailed concept.”)
d. Which theory is supported by the data? (Explain your answer by
discussing the theoretical interpretation of each significant
[again, at the .05 significance level] partial slope in the
best-fitting regression model.)
5
Below please find R and SAS code for problems 1 and 2:
# R
# Directions for problem 1:
# Copy the below R code into the "R Editor" window (accessed
# by selecting "New script" under the "File" pull-down menu),
# swipe the code, and press F5.
# Code:
# read lab5data.txt into "gss"
gss<-read.table('http://www.public.iastate.edu/~carlos/401/labs/lab5da
ta.txt')
# read gss into gssnew without missing data codes for rincome
#
(var2=13 [refused] or 99), prestige (var6=0), and educ (var7=99)
gssnew<-gss[gss[,2]!=13 & gss[,2]!=99 & gss[,6]!=0 & gss[,7]!=99,]
# assign new values to rincome so that the data are in dollar units
gssnew[gssnew[,2]==1,2]=500
gssnew[gssnew[,2]==2,2]=2000
gssnew[gssnew[,2]==3,2]=3500
gssnew[gssnew[,2]==4,2]=4500
gssnew[gssnew[,2]==5,2]=5500
gssnew[gssnew[,2]==6,2]=6500
gssnew[gssnew[,2]==7,2]=7500
gssnew[gssnew[,2]==8,2]=9000
gssnew[gssnew[,2]==9,2]=12500
gssnew[gssnew[,2]==10,2]=17500
gssnew[gssnew[,2]==11,2]=22500
gssnew[gssnew[,2]==12,2]=35000
# Results
# Regression 1: rincome (var2) on prestige (var6) and education (var7)
reg1<-lm(gssnew[,2]~gssnew[,6]+gssnew[,7])
summary(reg1)
# compute Y-hat and e, then find correlations between rincome and each
of these
yhat<- -2126.829488 + (236.303466*gssnew[,6]) + (634.841688*gssnew[,7])
e<- gssnew[,2] - yhat
cor(gssnew[,2],yhat)
cor(gssnew[,2],e)
# Regression 2: prestige (var6) on education (var7)
6
reg2<-lm(gssnew[,6]~gssnew[,7])
summary(reg2)
# compute presadj=var6-(b*var7), then correlate educ with
both it and prestige
presadj<- gssnew[,6] - (2.741733*gssnew[,7])
cor(presadj,gssnew[,7])
cor(gssnew[,6],gssnew[,7])
# Regression 3: rincome (var2) on presadj
reg3<-lm(gssnew[,2]~presadj)
summary(reg3)
#
#
#
#
Directions for problem 2:
Copy the below R code into the "R Editor" window (accessed by
selecting "New script" under the "File" pull-down menu), swipe the
code, and press F5.
# Code:
# read lab11data.txt into "gss"
gss<-read.table('http://www.public.iastate.edu/~carlos/401/labs/lab11d
ata.txt')
# remove missing values from gss and obtain sample size
gss<-gss[gss[,1]!=0 & gss[,1]!=98 & gss[,1]!=99 & gss[,2]!=0 &
gss[,3]!=99,]
n<- length(gss[,1])
# Results
# obtain correlation matrix, associated 2-tailed P-values, and standard
devations for all 3 variables in gss
c<- cor(gss)
c
z<- (c/(sqrt((1-(c^2))/(n-2))))
p<- 2*(1 - pnorm(abs(z)))
p
s1<- sd(gss[,1])
s1
s2<- sd(gss[,2])
s2
s3<- sd(gss[,3])
s3
# Regression 4: rincome (var1) on prestige (var2)
reg4<-lm(gss[,1]~gss[,2])
anova(reg4)
7
summary(reg4)
# Regression 5: rincome (var1) on prestige (var2) and educ (var3)
reg5<-lm(gss[,1]~gss[,2]+gss[,3])
anova(reg5)
summary(reg5)
* SAS
* Directions for problem 1:
* Copy lab5data.txt into the C-drive's root (i.e., into "C:/").
* Copy the below SAS code into the "Editor" window,
*
and press the button with the figure of a little guy running.
* Code:
* read lab5data.txt into "gss";
data gss;
infile 'C:\lab5data.txt';
input age rincome sex fear papres16 prestige educ agewed xnorcsiz;
run;
* copy "gss" into "gssnew" without missing data and with new values
*
assigned to a new variable called income;
data gssnew;
set gss;
if (rincome=13 or rincome=99 or prestige=0 or educ=99) then delete;
if (rincome=1) then income=500;
if (rincome=2) then income=2000;
if (rincome=3) then income=3500;
if (rincome=4) then income=4500;
if (rincome=5) then income=5500;
if (rincome=6) then income=6500;
if (rincome=7) then income=7500;
if (rincome=8) then income=9000;
if (rincome=9) then income=12500;
if (rincome=10) then income=17500;
if (rincome=11) then income=22500;
if (rincome=12) then income=35000;
* compute presadj=prestige-(b*educ) for use below;
presadj=prestige-(2.741733*educ);
run;
* Results
* Regression 1: income on prestige and educ;
proc reg data=gssnew;
8
model income=prestige educ;
output out=resgss predicted=yhat residual=e;
run;
* find correlations between rincome and both yhat and e;
proc corr data=gssnew;
var income yhat e;
run;
* Regression 2: prestige on educ;
proc reg data=gssnew;
model prestige=educ;
run;
* correlate educ with both presadj and prestige;
proc corr data=gssnew;
var educ presadj prestige;
run;
* Regression 3: rincome (var2) on presadj;
proc reg data=gssnew;
model income=presadj;
run;
* Directions for problem 2:
* Copy lab11data.txt into the C-drive's root (i.e., into "C:/").
* Copy the below SAS code into the "Editor" window,
*
and press the button with the figure of a little guy running.
* Code:
* read lab11data.txt into "gss";
data gss;
infile 'C:\lab11data.txt';
input rincome prestige educ;
run;
* remove missing data while copying "gss" into "gssnew";
data gssnew;
set gss;
if (rincome=0 or rincome=98 or rincome=99 or prestige=0 or educ=99)
then delete;
run;
* Results
* obtain correlation matrix for rincome prestige educ;
proc corr data=gssnew;
9
var rincome prestige educ;
run;
* Regression 5: run stepwise regression of rincome on
prestige and educ;
proc reg data=gssnew;
model rincome=prestige educ/selection=stepwise slentry=0.05;
run;
* Regression 6: run regression of rincome on prestige and educ;
proc reg data=gssnew;
model rincome=prestige educ;
run;
10
Download