Stat401E Fall 2010 Lab 11 1. Use the 1984 NORC data to demonstrate some aspects of multiple regression. a. Show that R-squared from the regression of "rincome" on "educ" and "prestige" equals the coefficient of determination between "rincome" and the predicted values from this regression equation. b. Show that the coefficient of determination between "rincome" and the residuals from the regression of "rincome" on "educ" and "prestige" equals 1 - R2 from this same regression. c. Show that unlike the correlation between "educ" and "prestige", there is a correlation of zero between "educ" and "presadj" (i.e., between "educ" and "prestige after it has been adjusted for its covariance with educ"). d. Show that the partial slope between "rincome" and "prestige" (from the regression of "rincome" on "prestige" and "educ") equals the slope between "rincome" and "presadj" (as defined in part c). e. Write and run your own SPSS (or R or SAS) program in which you show that the partial slope between "rincome" and "educ" (from the regression of "rincome" on "prestige" and "educ") equals the slope between "rincome" and "educadj" (i.e., between "rincome" and "educ after it has been adjusted for its covariance with prestige"). Be sure to include a printed copy of the program (i.e., the SPSS, R, or SAS code, not just its output) with your homework. Below you are given a program that provides you with everything necessary to answer parts a to d. Doing part e will help ensure that you understand how the program works. recode rincome (1=500)(2=2000)(3=3500)(4=4500)(5=5500)(6=6500) (7=7500)(8=9000)(9=12500)(10=17500)(11=22500)(12=35000)(13=99). select if (rincome ne 99). regression variables=rincome,educ,prestige/dep=rincome/enter. compute yhat=-2126.829488+(236.303466*prestige)+(634.841688*educ). compute e=rincome-yhat. pearson corr rincome with yhat,e. regression variables=prestige,educ/dep=prestige/enter. compute presadj=prestige-(2.741733*educ). pearson corr prestige,presadj with educ. regression variables=rincome,presadj/dep=rincome/enter. 1 NOTE: Be sure to check if your output for the above program is correct. If you cannot find the numbers within the program (e.g., -2126.829488, 2.741733, etc.) on the output, there is an error in your program. In this case, delete the output then correct and rerun the program. Be sure that you understand what each line in the program does. For example, a "select if" statement is required in the above program to ensure that the same subjects (i.e., those without missing data values of 99 on "rincome") are considered in analyses in which the "rincome" variable is excluded. 2. Use a stepwise regression procedure (or, if using R, obtain 2 regressions as per the R-code provided below) and the 1991 General Social Survey to examine the relative effects of occupational prestige (prestg80) and years of education (educ) on income (rincome). To do this you may wish to run the following one-line program: regression descriptives=corr,sig,var/vars=rincome,prestg80,educ /dep=rincome/stepwise. Note that although both independent variables are highly correlated with income, only one enters the regression equation at the .05 level of significance (the default significance level in SPSS). The other does not increment R-squared by a significant amount. Also note that “rincome” should NOT be recoded in the above program! a. At the end of the output is a section labeled, “Excluded Variables.” In this section under “Beta in” is provided the standardized partial slope associated with “educ” that one would have if “rincome” were regressed on both “prestg80” and “educ.” Why is this slope so much smaller than the zero-order correlation between “rincome” and “educ” that is given at the beginning of the output? (Hint: You may wish to calculate the standardized partial slope yourself using the correlation coefficients and standard deviations/variances listed at the beginning of your output.) b. Using the variances (or standard deviations) and correlation coefficients provided at the beginning of the output, find standard errors for two unstandardized slopes: (1) the unstandardized slope associated with “educ” from the regression of “rincome” on “educ” and (2) the unstandardized slope associated with “educ” from the regression of “rincome” on “educ” and “prestg80.” Why is the second standard error larger than the first? c. What do your answers to the questions raised in parts a and b indicate about the possible consequences of collinearity in regression models? 2 3. This question deals with two theories from the field of "gender identity development" which explain why children develop feminine or masculine traits: Modeling theory: According to modeling theory, children identify with their parents (among other people). The more feminine a child's mother, the more feminine will be the child. The more masculine its father, the more masculine it becomes. In a sentence, the theory argues that children become feminine or masculine by "modeling" (i.e., imitating) their parents. Developmental theory: In 1966 Kohlberg argued that modeling only works until a child gains a concept of itself as male or female. Once it gains a self-concept as male or female, the child will only imitate the parent of its same sex. It will "disassociate with" (i.e., "act unlike") the other-sex parent. (Note: Kohlberg's theory is based on the assumption that masculine and feminine behaviors are opposites. Thus a daughter's masculine behaviors may result from her attempts to act unlike an effeminate father.) Research since Kohlberg's article has indicated that a child develops its self-concept as male or female at around the age of 5. Your research investigates gender identity development in girls (NOT boys). You randomly sample 65 Des Moines two-parent one-child families with a daughter between the ages of 2 and 8 years old. Both parents from each of these families is administered a questionnaire that, in addition to an "age" variable (measuring the daughter's age), yields measures of mother's femininity (mfem), father's masculinity (fmasc), and daughter's femininity (dfem). High scores on "mfem" and "dfem" indicate high femininity. High scores on "fmasc" indicate high masculinity. (NOTE: In understanding these theories, be sure to keep in mind that a child can be very feminine and not conceive of itself as female. In addition, a child can have a female self-concept and not be very feminine at all.) a. Indicate the regression model that would allow you to evaluate modeling theory. (Do NOT attempt to find parameter estimates [i.e., no numbers, please] in this part or in part b.) Explain in words what modeling theory suggests you will find when you estimate this model. b. Indicate the regression model that would allow you to evaluate developmental theory and that would take into account insights gleaned since Kohlberg's article. (Evaluating this model requires 3 creating a new variable, "myvar". Be sure and explain how "myvar" is derived from other variables mentioned above. Also explain in words how "myvar" would allow you to evaluate developmental theory.) After a "compute" statement in which you create "myvar", you use the following SPSS statements in a computer run: regression vars=mfem,fmasc,dfem/dep=dfem/enter. regression vars=mfem,fmasc,dfem,age/dep=dfem/enter. regression vars=mfem,fmasc,dfem,age,myvar/dep=dfem/enter. Parts of the resulting regression output are as follows: On the first regression: Model Summary Model 1 R Square .23846a a. Predictors: (Constant), MFEM, FMASC 4 Coefficients a Model 1 (Constant) MFEM FMASC Unstandardized Coefficients B Std. Error -1.98787 17.761 5.639 -10.048 11.768 a. Dependent Variable: DFEM On the second regression: Model Summary Model 1 R Square .25723a a. Predictors: (Constant), MFEM, FMASC, AGE Coefficients a Model 1 Unstandardized Coefficients B Std. Error -.38658 18.265 5.325 -11.159 8.767 -3.002 2.418 (Constant) MFEM FMASC AGE a. Dependent Variable: DFEM On the third regression: Model Summary Model 1 R Square .33846a a. Predictors: (Constant), MFEM, FMASC,MYVAR, AGE Coefficients a Model 1 (Constant) MFEM FMASC MYVAR AGE Unstandardized Coefficients B Std. Error 50.27547 17.856 5.014 -12.268 3.879 21.572 7.948 -2.993 1.967 a. Dependent Variable: DFEM c. An F-test is used to compare hierarchically related models according to their relative parsimony. One model is more parsimonious than another only if in comparison to the other it explains either virtually as much variance but with fewer independent variables, or significantly more variance with additional independent variables. Identify the one model that is the most parsimonious of the three regression models given above. Use the .05 level of significance throughout and show your work. (Hint: Remember that parsimony is a “two-tailed concept.”) d. Which theory is supported by the data? (Explain your answer by discussing the theoretical interpretation of each significant [again, at the .05 significance level] partial slope in the best-fitting regression model.) 5 Below please find R and SAS code for problems 1 and 2: # R # Directions for problem 1: # Copy the below R code into the "R Editor" window (accessed # by selecting "New script" under the "File" pull-down menu), # swipe the code, and press F5. # Code: # read lab5data.txt into "gss" gss<-read.table('http://www.public.iastate.edu/~carlos/401/labs/lab5da ta.txt') # read gss into gssnew without missing data codes for rincome # (var2=13 [refused] or 99), prestige (var6=0), and educ (var7=99) gssnew<-gss[gss[,2]!=13 & gss[,2]!=99 & gss[,6]!=0 & gss[,7]!=99,] # assign new values to rincome so that the data are in dollar units gssnew[gssnew[,2]==1,2]=500 gssnew[gssnew[,2]==2,2]=2000 gssnew[gssnew[,2]==3,2]=3500 gssnew[gssnew[,2]==4,2]=4500 gssnew[gssnew[,2]==5,2]=5500 gssnew[gssnew[,2]==6,2]=6500 gssnew[gssnew[,2]==7,2]=7500 gssnew[gssnew[,2]==8,2]=9000 gssnew[gssnew[,2]==9,2]=12500 gssnew[gssnew[,2]==10,2]=17500 gssnew[gssnew[,2]==11,2]=22500 gssnew[gssnew[,2]==12,2]=35000 # Results # Regression 1: rincome (var2) on prestige (var6) and education (var7) reg1<-lm(gssnew[,2]~gssnew[,6]+gssnew[,7]) summary(reg1) # compute Y-hat and e, then find correlations between rincome and each of these yhat<- -2126.829488 + (236.303466*gssnew[,6]) + (634.841688*gssnew[,7]) e<- gssnew[,2] - yhat cor(gssnew[,2],yhat) cor(gssnew[,2],e) # Regression 2: prestige (var6) on education (var7) 6 reg2<-lm(gssnew[,6]~gssnew[,7]) summary(reg2) # compute presadj=var6-(b*var7), then correlate educ with both it and prestige presadj<- gssnew[,6] - (2.741733*gssnew[,7]) cor(presadj,gssnew[,7]) cor(gssnew[,6],gssnew[,7]) # Regression 3: rincome (var2) on presadj reg3<-lm(gssnew[,2]~presadj) summary(reg3) # # # # Directions for problem 2: Copy the below R code into the "R Editor" window (accessed by selecting "New script" under the "File" pull-down menu), swipe the code, and press F5. # Code: # read lab11data.txt into "gss" gss<-read.table('http://www.public.iastate.edu/~carlos/401/labs/lab11d ata.txt') # remove missing values from gss and obtain sample size gss<-gss[gss[,1]!=0 & gss[,1]!=98 & gss[,1]!=99 & gss[,2]!=0 & gss[,3]!=99,] n<- length(gss[,1]) # Results # obtain correlation matrix, associated 2-tailed P-values, and standard devations for all 3 variables in gss c<- cor(gss) c z<- (c/(sqrt((1-(c^2))/(n-2)))) p<- 2*(1 - pnorm(abs(z))) p s1<- sd(gss[,1]) s1 s2<- sd(gss[,2]) s2 s3<- sd(gss[,3]) s3 # Regression 4: rincome (var1) on prestige (var2) reg4<-lm(gss[,1]~gss[,2]) anova(reg4) 7 summary(reg4) # Regression 5: rincome (var1) on prestige (var2) and educ (var3) reg5<-lm(gss[,1]~gss[,2]+gss[,3]) anova(reg5) summary(reg5) * SAS * Directions for problem 1: * Copy lab5data.txt into the C-drive's root (i.e., into "C:/"). * Copy the below SAS code into the "Editor" window, * and press the button with the figure of a little guy running. * Code: * read lab5data.txt into "gss"; data gss; infile 'C:\lab5data.txt'; input age rincome sex fear papres16 prestige educ agewed xnorcsiz; run; * copy "gss" into "gssnew" without missing data and with new values * assigned to a new variable called income; data gssnew; set gss; if (rincome=13 or rincome=99 or prestige=0 or educ=99) then delete; if (rincome=1) then income=500; if (rincome=2) then income=2000; if (rincome=3) then income=3500; if (rincome=4) then income=4500; if (rincome=5) then income=5500; if (rincome=6) then income=6500; if (rincome=7) then income=7500; if (rincome=8) then income=9000; if (rincome=9) then income=12500; if (rincome=10) then income=17500; if (rincome=11) then income=22500; if (rincome=12) then income=35000; * compute presadj=prestige-(b*educ) for use below; presadj=prestige-(2.741733*educ); run; * Results * Regression 1: income on prestige and educ; proc reg data=gssnew; 8 model income=prestige educ; output out=resgss predicted=yhat residual=e; run; * find correlations between rincome and both yhat and e; proc corr data=gssnew; var income yhat e; run; * Regression 2: prestige on educ; proc reg data=gssnew; model prestige=educ; run; * correlate educ with both presadj and prestige; proc corr data=gssnew; var educ presadj prestige; run; * Regression 3: rincome (var2) on presadj; proc reg data=gssnew; model income=presadj; run; * Directions for problem 2: * Copy lab11data.txt into the C-drive's root (i.e., into "C:/"). * Copy the below SAS code into the "Editor" window, * and press the button with the figure of a little guy running. * Code: * read lab11data.txt into "gss"; data gss; infile 'C:\lab11data.txt'; input rincome prestige educ; run; * remove missing data while copying "gss" into "gssnew"; data gssnew; set gss; if (rincome=0 or rincome=98 or rincome=99 or prestige=0 or educ=99) then delete; run; * Results * obtain correlation matrix for rincome prestige educ; proc corr data=gssnew; 9 var rincome prestige educ; run; * Regression 5: run stepwise regression of rincome on prestige and educ; proc reg data=gssnew; model rincome=prestige educ/selection=stepwise slentry=0.05; run; * Regression 6: run regression of rincome on prestige and educ; proc reg data=gssnew; model rincome=prestige educ; run; 10