Stat404 Fall 2009 Lab 6 1. The following program will be needed to do this problem: data list list file='c:animals.txt' / id body brain. compute ibody=1/body. compute lbody=lg10(body). compute sbody=sqrt(body). compute ibrain=1/brain. compute lbrain=lg10(brain). compute sbrain=sqrt(brain). examine vars=body/percentiles(20,40,60,80)/plot none/stat none. compute rbody=body. recode rbody(lo thru .248=1)(.2481 thru 1.452=2)(1.4521 thru 4.226=3) (4.2261 thru 71.2=4)(71.2001 thru hi=5). examine vars=brain by rbody / plot boxplot / stat none / nototal. regression vars=body,brain / dep=brain / enter /save=resid(error5) pred(pbrain). examine vars=pbrain/percentiles(20,40,60,80)/plot none/stat none. compute rpbrain=pbrain. recode rpbrain(lo thru 91.244=1)(91.2441 thru 92.407=2) (92.4071 thru 95.088=3)(95.0881 thru 159.818=4)(159.8181 thru hi=5). examine vars=error5 by rpbrain / plot boxplot / stat none / nototal. regression vars=ibody,ibrain/ dep=ibrain / enter /save=resid(error6) pred(pibrain). examine vars=pibrain/percentiles(20,40,60,80)/plot none/stat none. compute rpibrain=pibrain. recode rpibrain(lo thru .1362=1)(.13621 thru .1448=2) (.14481 thru .1623=3)(.16231 thru .2958=4)(.29581 thru hi=5). examine vars=error6 by rpibrain / plot boxplot / stat none / nototal. regression vars=lbody,lbrain/ dep=lbrain / enter /save=resid(error7) pred(plbrain). examine vars=plbrain/percentiles(20,40,60,80)/plot none/stat none. compute rplbrain=plbrain. recode rplbrain(lo thru .4676=1)(.46761 thru 1.0483=2) (1.04831 thru 1.3976=3)(1.39761 thru 2.3156=4)(2.31561 thru hi=5). examine vars=error7 by rplbrain / plot boxplot / stat none / nototal. regression vars=sbody,sbrain/ dep=sbrain / enter /save=resid(error8) pred(psbrain). examine vars=psbrain/percentiles(20,40,60,80)/plot none/stat none. compute rpsbrain=psbrain. recode rpsbrain(lo thru 3.8488=1)(3.84881 thru 4.5702=2) (4.57021 thru 5.4373=3)(5.43731 thru 11.9124=4)(11.91241 thru hi=5). examine vars=error8 by rpsbrain / plot boxplot / stat none / nototal. 1 Weisberg (1985, Table 6.6 on pp. 144-5) presents data on brain and body weights for 62 species of mammals. These data are provided (via our class web site’s Assignmentspage). The file contains one line of data for each species. (E.g., "Man" is on line 32.) Each line of data contains three numbers: A sequence number (that allows you to identify individual species), body weight in kilograms, and brain weight in grams. Notice that it makes no sense to speak of brain or body weight as causal. (At most, one might expect a positive association between the two.) As it turns out, the variance of each variable increases with the magnitude of the other. a. Obtain a boxplot of brain weight by body weight using SPSS,R, or SAS. b. Regress brain weight on body weight and obtain a box plot of the residual brain weight values by the estimated brain weight values. Using this diagnostic plot show that the conditional variance of the dependent variable increases as the dependent variable takes larger and larger values. c. Perform square root, logarithmic, and inverse transformations on the dependent and independent variables and rerun the regression three times (once with each pair of transformed variables). From each of these regressions obtain a diagnostic plot as in part b. Do the variables' variances appear to increase as a proportion of (i) their means, (ii) their means squared, or (iii) their means to the fourth power? (Differently put, which transformation yields the greatest reduction of heteroscedasticity in the data?) State in words the meanings of the regression coefficient and constant from the regression with the most homoscedastic variances. Be sure to take into account the new meaning your transformation gives to the dependent variable. (Hint: When interpreting the constant, remember that log(1)=0.) d. Weisberg's Table 6.6 lists sequence numbers for identifying the various species. In reference to the “regression with the most homoscedastic variances” found in part c, the residual values from this regression indicate how much transformed brain weight an animal has above or below what one would estimate given its body weight. Which animal’s brain weight is farthest above what one would estimate given its body weight (i.e., which is smartest)? Which animal’s brain weight is farthest below what one would estimate given its body weight (i.e., which is dumbest)? (Hint: If you use SPSS, go to the Data Editor and select “Data”, then “Sort Cases...” Then select sort on the residuals from the appropriate regression 2 model estimated in part c. and bottom rows?) Which ids end up at the top 2. Using the 1996 General Social Survey of U.S. adults, determine whether people with high SES (socio-economic status) are less likely than people with low SES to watch a lot of television. To do this you must first construct an SES measure. One means of doing this begins by combining distinct SES measures (e.g., income, occupational prestige, and subjective class identification) using principle components analysis. A measure of hours spent watching TV each week is then regressed on this composite measure. Do this using the following program: import file='c:gss96.por'. recode rincome(1=500)(2=2000)(3=3500)(4=4500)(5=5500)(6=6500) (7=7500)(8=9000)(9=12500)(10=17500)(11=22500)(12=35000). factor vars=class,prestg80,rincome/rotate=norotate/save (1 ses). regression vars=ses1,tvhours/dep=tvhours/enter. condescriptive ses1,tvhours. a. The FACTOR routine in SPSS generates principle components when the NOROTATE option is specified. These principle components are standardized (i.e., in this case, SES1 has a mean of zero and a variance equal to one). Yet despite the fact that the independent variable is standardized in the regression of TVHOURS on SES1, the unstandardized regression coefficient in the output from this program does not equal the standardized coefficient. Why is this the case? (Hint: Use algebra to derive one coefficient from the other.) b. What benefit is there in combining CLASS, PRESTG80, and RINCOME into a single measure? 3. In the U.S. people tend to become more satisfied with their lives as they reach ages in their late 80s and older. Some gerontologists argue that this increase in life satisfaction results as old people learn to accept "what they cannot do" (i.e., their physical limitations resulting from declining health, etc.). Other gerontologists argue that life satisfaction results only when old people are actively involved with other people. You wish to test which theory (i.e., "acceptance theory" or "involvement theory") is correct. You have data generated during face-to-face interviews with 63 U.S. centenarian (i.e., over 100 year-old) nursing home residents. Three of your variables are as follows: Y = "life satisfaction" measured on a 100-point scale from 0 = no 3 satisfaction with life with life X = to 100 = total satisfaction "acceptance" measured as the number of hours each week that a respondent spends sitting in a rocking chair and sighing W = "involvement" measured as the number of hours each week that a respondent spends interacting with other nursing home residents, personnel, or visitors Means, standard deviations, and correlations on these variables are... Variable Y X W Mean 57 22 30 Standard Deviation 19 40 24 Correlation Coefficients Y X W 1.0 0.6 1.0 -0.1 -0.7 1.0 If one assumes the above correlations to reflect the true relations among life satisfaction, acceptance, and involvement for the population of all U.S. centenarian nursing home residents, one's regression model would be misspecified if it only included the involvement measure as an independent variable. What would be the bias in an unstandardized slope estimated using this misspecified model? (Hints: You are being asked to calculate a number here. Please give the units associated with this number, and show how the number was calculated. Finally, you should also assume that no additional important variables are excluded from the model.) Below please find R and SAS code for these problems: # R # Code: ########## QUESTION 1 ########## #-------Read in the data-------# animals <- read.table("C:\animals.txt") colnames(animals) <- c("id", "body", "brain") attach(animals) #-------Create some new variables-------# animals$ibody <- 1/body 4 animals$lbody <- log10(body) animals$sbody <- sqrt(body) animals$ibrain <- 1/brain animals$lbrain <- log10(brain) animals$sbrain<- sqrt(brain) #-------Find quantiles of "body"-------# qbody <- quantile(animals$body,c(.2,.4,.6,.8)) attach(animals) #---Create "rbody", by collaping "body" into 5 equal-sized groups.---# animals[body <= qbody[1],10] = 1 animals[body > qbody[1] & body <= qbody[2], 10] = 2 animals[body > qbody[2] & body <= qbody[3], 10] = 3 animals[body > qbody[3] & body <= qbody[4], 10] = 4 animals[body > qbody[4], 10] = 5 names(animals)[10] <- "rbody" #-------Boxplots of "brain", by groups created in previous step-------# boxplot(brain~rbody, data=animals,xlab="rbody",ylab="brain") abline(h=0,col="grey") #Plots a horizontal line at 0# X11() #Opens a new graphics device# #-------Regression of "brain" on "body"-------# reg1 <- lm(brain~body) summary(reg1) #-------Save residuals from the regression as "error5". Save y-hats as "pbrain".-------# animals$error5 <- reg1$residuals animals$pbrain <- reg1$fitted.values #-------Collapse "pbrain" into 5 categories-------# qpbrain <- quantile(animals$pbrain,c(.2,.4,.6,.8)) attach(animals) animals[pbrain <= qpbrain[1],13] = 1 animals[pbrain > qpbrain[1] & pbrain <= qpbrain[2], 13] = 2 animals[pbrain > qpbrain[2] & pbrain <= qpbrain[3], 13] = 3 animals[pbrain > qpbrain[3] & pbrain <= qpbrain[4], 13] = 4 animals[pbrain > qpbrain[4], 13] = 5 names(animals)[13] <- "rpbrain" #-------Boxplots of residuals, by collapsed category of y-hat-------# boxplot(error5~rpbrain, data=animals,xlab="rpbrain",ylab="error5") abline(h=0,col="grey") X11() #-------Now do the same thing with transformed variables-------# #-------First the inverse-transformed variables-------# reg2 <- lm(ibrain~ibody) 5 summary(reg2) animals$error6 <- reg2$residuals animals$pibrain <- reg2$fitted.values qpibrain <- quantile(animals$pibrain,c(.2,.4,.6,.8)) attach(animals) animals[pibrain <= qpibrain[1],16] = 1 animals[pibrain > qpibrain[1] & pibrain <= qpibrain[2], 16] = 2 animals[pibrain > qpibrain[2] & pibrain <= qpibrain[3], 16] = 3 animals[pibrain > qpibrain[3] & pibrain <= qpibrain[4], 16] = 4 animals[pibrain > qpibrain[4], 16] = 5 names(animals)[16] <- "rpibrain" boxplot(error6~rpibrain, data=animals,xlab="rpibrain",ylab="error6") abline(h=0,col="grey") X11() #-------Now the log-transformed variables-------# reg3 <- lm(lbrain~lbody) summary(reg3) animals$error7 <- reg3$residuals animals$plbrain <- reg3$fitted.values qplbrain <- quantile(animals$plbrain,c(.2,.4,.6,.8)) attach(animals) animals[plbrain <= qplbrain[1],19] = 1 animals[plbrain > qplbrain[1] & plbrain <= qplbrain[2], 19] = 2 animals[plbrain > qplbrain[2] & plbrain <= qplbrain[3], 19] = 3 animals[plbrain > qplbrain[3] & plbrain <= qplbrain[4], 19] = 4 animals[plbrain > qplbrain[4], 19] = 5 names(animals)[19] <- "rplbrain" boxplot(error7~rplbrain, data=animals,xlab="rplbrain",ylab="error7") abline(h=0,col="grey") X11() #-------Finally the square-root-transformed variables-------# reg4 <- lm(sbrain~sbody) summary(reg4) animals$error8 <- reg4$residuals animals$psbrain <- reg4$fitted.values qpsbrain <- quantile(animals$psbrain,c(.2,.4,.6,.8)) attach(animals) animals[psbrain <= qpsbrain[1],22] = 1 animals[psbrain > qpsbrain[1] & psbrain <= qpsbrain[2], 22] = 2 animals[psbrain > qpsbrain[2] & psbrain <= qpsbrain[3], 22] = 3 animals[psbrain > qpsbrain[3] & psbrain <= qpsbrain[4], 22] = 4 animals[psbrain > qpsbrain[4], 22] = 5 names(animals)[22] <- "rpsbrain" 6 boxplot(error8~rpsbrain, data=animals,xlab="rpsbrain",ylab="error8") abline(h=0,col="grey") X11() #-------Sorting by residuals-------# species <- array(order(error6)) error.6 <- array(sort(error6)) cbind(species,error.6) species <- array(order(error7)) error.7 <- array(sort(error7)) cbind(species,error.7) species <- array(order(error8)) error.8 <- array(sort(error8)) cbind(species,error.8) ########## QUESTION 2 ########## ### Be sure to put gss96.csv (not .por) in your root directory gss96 <- read.csv("C:\gss96.csv",header=T) #-------Recode rincome into rincome2-------# attach(gss96) gss96[rincome == 1,5] <- 500 gss96[rincome == 2,5] = 2000 gss96[rincome == 3,5] = 3500 gss96[rincome == 4,5] = 4500 gss96[rincome == 5,5] = 5500 gss96[rincome == 6,5] = 6500 gss96[rincome == 7,5] = 7500 gss96[rincome == 8,5] = 9000 gss96[rincome == 9,5] = 12500 gss96[rincome == 10,5] = 17500 gss96[rincome == 11,5] = 22500 gss96[rincome == 12,5] = 35000 names(gss96)[5] <- "rincome2" #-------Standardize the appropriate variables-------# gss96$class.z <- (gss96$class-mean(gss96$class))/sd(gss96$class) gss96$prestg80.z <(gss96$prestg80-mean(gss96$prestg80))/sd(gss96$prestg80) gss96$rincome2.z <(gss96$rincome2-mean(gss96$rincome2))/sd(gss96$rincome2) #-------Principle Components-------# pc <- prcomp(gss96[,6:8], scale=T, center=T ) pc 7 ses <- 0.5445274*gss96$class.z + 0.6080401*gss96$prestg80.z + 0.5777346*gss96$rincome2.z ses1 <- (ses-mean(ses))/sd(ses) #-------Regression-------# reg5 <- lm(tvhours~ses1) summary(reg5) #-------Descriptive statistics-------# summary(ses1) sd(ses1) summary(gss96$tvhours) sd(gss96$tvhours) * SAS * Code: /**** PROBLEM 1 ****/ /***Put animals.txt datafile in your C: root directory,***/ /*** or change line 2 below to point to where the file is.***/ /* Read in the data, perform transformations */ DATA animals; INFILE 'C:\animals.txt'; INPUT id body brain; ibody=1/body; lbody=log10(body); sbody=sqrt(body); ibrain=1/brain; lbrain=log10(brain); sbrain=sqrt(brain); RUN; /* Obtain various percentiles of "body" */ PROC UNIVARIATE data=animals noprint; VAR body; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=body_perentile; RUN; PROC PRINT data=percentiles; RUN; /* Collapse "body" into "rbody" with 5 groups of equal size. */ DATA animals; SET animals; 8 rbody=body; IF body <= .248 THEN rbody=1; IF body > .248 AND body <= 1.452 THEN rbody=2; IF body > 1.4521 AND body <= 4.226 THEN rbody=3; IF body > 4.2261 AND body <= 71.2 THEN rbody=4; IF body >= 71.20001 THEN rbody=5; RUN; /* Obtain boxplot of "brain" at each level of "rbody" */ PROC SORT; by rbody; PROC BOXPLOT; PLOT brain*rbody /BOXSTYLE= SCHEMATIC; RUN; /**********************************************************/ /* Regression of "brain" on "body". */ /*Y-hats saved as "pbrain" (smile). Residuals as "error5"*/ PROC REG; MODEL brain = body; OUTPUT out=animals r=error5 p=pbrain; RUN; /* Collapsing "pbrain" into 5 groups */ PROC UNIVARIATE data=animals noprint; VAR pbrain; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=pbrain_percentile; RUN; PROC PRINT data=percentiles; RUN; DATA animals; SET animals; rpbrain=pbrain; IF pbrain <= 91.244 THEN rpbrain=1; IF pbrain > 91.2241 AND pbrain <= 92.407 THEN rpbrain=2; IF pbrain > 92.4071 AND pbrain <= 95.088 THEN rpbrain=3; IF pbrain > 95.0881 AND pbrain <= 159.818 THEN rpbrain=4; IF pbrain >= 159.8181 THEN rpbrain=5; RUN; PROC SORT; by rpbrain; PROC BOXPLOT; PLOT error5*rpbrain /BOXSTYLE= SCHEMATIC; RUN; 9 /*********************************************************/ /* Now the same thing with our transformed variables */ /* First the inverse-transformed data */ PROC REG; MODEL ibrain = ibody; OUTPUT out=animals r=error6 p=pibrain; RUN; PROC UNIVARIATE data=animals noprint; VAR pibrain; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=pibrain_percentile; RUN; PROC PRINT data=percentiles; RUN; DATA animals; SET animals; rpibrain=pibrain; IF pibrain <= .1362 THEN rpibrain=1; IF pibrain > .13621 AND pibrain <= .1448 THEN rpibrain=2; IF pibrain > .14481 AND pibrain <= .1623 THEN rpibrain=3; IF pibrain > .16231 AND pibrain <= .2958 THEN rpibrain=4; IF pibrain >= .29581 THEN rpibrain=5; RUN; PROC SORT; by rpibrain; PROC BOXPLOT; PLOT error6*rpibrain /BOXSTYLE= SCHEMATIC; RUN; /* Now the log-transformed data */ PROC REG; MODEL lbrain = lbody; OUTPUT out=animals r=error7 p=plbrain; RUN; PROC UNIVARIATE data=animals noprint; VAR plbrain; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=plbrain_percentile; RUN; PROC PRINT data=percentiles; RUN; DATA animals; SET animals; rplbrain=plbrain; IF plbrain <= .4676 THEN rplbrain=1; 10 IF plbrain > .46761 AND plbrain <= 1.0483 THEN rplbrain=2; IF plbrain > 1.04831 AND plbrain <= 1.3976 THEN rplbrain=3; IF plbrain > 1.39761 AND plbrain <= 2.3156 THEN rplbrain=4; IF plbrain >= 2.31561 THEN rplbrain=5; RUN; PROC SORT; by rplbrain; PROC BOXPLOT; PLOT error7*rplbrain /BOXSTYLE= SCHEMATIC; RUN; /* Finally the square-root transformed data */ PROC REG; MODEL sbrain = sbody; OUTPUT out=animals r=error8 p=psbrain; RUN; PROC UNIVARIATE data=animals noprint; VAR psbrain; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=psbrain_percentile; RUN; PROC PRINT data=percentiles; RUN; DATA animals; SET animals; rpsbrain=psbrain; IF psbrain <= 3.8488 THEN rpsbrain=1; IF psbrain > 3.84881 AND psbrain <= 4.5702 THEN rpsbrain=2; IF psbrain > 4.57021 AND psbrain <= 5.4373 THEN rpsbrain=3; IF psbrain > 5.43731 AND psbrain <= 11.9124 THEN rpsbrain=4; IF psbrain >= 11.91241 THEN rpsbrain=5; RUN; PROC SORT; by rpsbrain; PROC BOXPLOT; PLOT error8*rpsbrain /BOXSTYLE= SCHEMATIC; RUN; /*To aid in the final question in problem 1d */ PROC SORT; by error6; PROC PRINT data=animals NoObs; VAR id error6; TITLE 'Data sorted by error6'; 11 RUN; PROC SORT; by error7; PROC PRINT data=animals noobs; VAR id error7; TITLE 'Data sorted by error7'; RUN; PROC SORT; by error8; PROC PRINT data=animals noobs; VAR id error8; TITLE 'Data sorted by error8'; RUN; /**** PROBLEM 2 ****/ PROC IMPORT OUT=gss96 FILE="C:\gss96.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; GUESSINGROWS=10; RUN; DATA gss96; SET gss96; IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = IF rincome = RUN; 1 THEN rincome2 = 500; 2 THEN rincome2 = 2000; 3 THEN rincome2 = 3500; 4 THEN rincome2 = 4500; 5 THEN rincome2 = 5500; 6 THEN rincome2 = 6500; 7 THEN rincome2 = 7500; 8 THEN rincome2 = 9000; 9 THEN rincome2 = 12500; 10 THEN rincome2 = 17500; 11 THEN rincome2 = 22500; 12 THEN rincome2 = 35000; PROC UNIVARIATE; VAR class prestg80 rincome2; RUN; DATA gss96; 12 SET gss96; classz = (class-2.46927803)/0.6238; prestg80z = (prestg80-43.674347)/13.70968; rincome2z =( rincome2-22849.846)/12186; RUN; PROC FACTOR data=gss96 ROTATE=none; VAR classz prestg80z rincome2z; RUN; DATA gss96; SET gss96; SES = 0.47555260*classz + 0.59295741*prestg80z + 0.53532292*rincome2z; RUN; PROC UNIVARIATE; VAR SES; RUN; DATA gss96; SET gss96; SES1 = (SES-2.68754E-8)/1.17673; RUN; PROC REG data=gss96; MODEL tvhours=SES1; RUN; PROC UNIVARIATE; VAR SES tvhours; RUN; 13