Stat404 Fall 2009 Lab 5 This lab will make use of output from the following program: get file='c:recall.sav'. compute nrecall = (1/sqrt(n))*arsin(sqrt(recall)). temporary. recode birthyr(28,32=30)(34,35=34.5)(36,37=36.5)(38,39=38.5) (40,41=40.5)(42,43=42.5)(44,45=44.5)(46,47=46.5)(48,49=48.5). recode eventyr(45 thru 48=46.5)(52,54=53)(56,61=58.5)(63,64=63.5) (65,67=66)(71,73=72). examine vars=nrecall by eventyr,birthyr/plot boxplot /stat none/nototal. regression variables=nrecall,birthyr,eventyr/dep=nrecall/enter. pearson corr nrecall birthyr eventyr. data list list file='c:commute.txt' / density expense. examine vars=density/percentiles(20,40,60,80)/plot none/stat none. compute rdensity=density. recode rdensity(lo thru 22=1)(22.01 thru 118=2)(118.01 thru 438=3) (438.01 thru 827=4)(827.01 thru hi=5). examine vars=expense by rdensity/plot boxplot/stat none/nototal. compute iexpense=1/expense. compute lexpense=lg10(expense). compute sexpense=sqrt(expense). regression vars=density,expense/dep=expense/enter /save=resid(error1) pred(pexpen). examine vars=pexpen/percentiles(20,40,60,80)/plot none/stat none. compute rpexpen=pexpen. recode rpexpen(lo thru 101=1)(101.01 thru 164=2)(164.01 thru 216=3) (216.01 thru 231=4)(231.01 thru hi=5). examine vars=error1 by rpexpen/plot boxplot/stat none/nototal. regression vars=density,iexpense/dep=iexpense/enter /save=resid(error2) pred(piexpen). examine vars=piexpen/percentiles(20,40,60,80)/plot none/stat none. compute rpiexpen=piexpen. recode rpiexpen(lo thru .00866=1)(.008661 thru .0093=2) (.009301 thru .01142=3)(.011421 thru .014=4)(.01401 thru hi=5). examine vars=error2 by rpiexpen/plot boxplot/stat none/nototal. regression vars=density,lexpense/dep=lexpense/enter /save=resid(error3) pred(plexpen). examine vars=plexpen/percentiles(20,40,60,80)/plot none/stat none. compute rplexpen=plexpen. recode rplexpen(lo thru 1.9=1)(1.901 thru 2.03=2)(2.031 thru 2.138=3) (2.139 thru 2.17=4)(2.171 thru hi=5). 1 examine vars=error3 by rplexpen/plot boxplot/stat none/nototal. regression vars=density,sexpense/dep=sexpense/enter /save=resid(error4) pred(psexpen). examine vars=psexpen/percentiles(20,40,60,80)/plot none/stat none. compute rpsexpen=psexpen. recode rpsexpen(lo thru 9.31=1)(9.311 thru 11.28=2)(11.281 thru 12.9=3) (12.901 thru 13.39=4)(13.391 thru hi=5). examine vars=error4 by rpsexpen/plot boxplot/stat none/nototal. 1. When data on one's dependent variable are proportions (as with the RECALL variable), one often transforms these proportions using the arcsin-square root transformation. The above program will provide you with a plot of NRECALL by EVENTYR, where NRECALL is RECALL after it has been transformed by taking its arcsin-square root and by weighting it by the inverse-square root of the number of observations from which it was computed. Examine the box plot of NRECALL by EVENTYR. Are any assumptions now met that were not met in the corresponding boxplot (i.e., of RECALL by EVENTYR) obtained in Lab 1? If yes, how can you tell? 2. Regress NRECALL on BIRTHYR and EVENTYR. Express the unstandardized regression equation in words. What is the proportion of the variance in NRECALL that is explained by both independent variables together? Finally, what proportion of "the variance in NRECALL that is not explained by EVENTYR" is explained by BIRTHYR? (Hints: NRECALL does not have units that are easy to describe. If you wish, you may simply refer to "units on the transformed recall measure." Also, note that the materials you received with Lab 1 provide a table in which proportions are listed for every BIRTHYR within each EVENTYR. In an analysis of a data set that contained each of these proportions you would discover that BIRTHYR and EVENTYR would have a zero correlation. [Can you see why?] When the class's data set was created, zero cells from the table were omitted, and as a consequence you will find in your analyses that BIRTHYR and EVENTYR are collinear.) 3. The cost of transporting children to school is higher in sparsely populated Long Island school districts than in densely populated ones. A data set called “commute.txt” is provided (via a link in our class web site) that allows you to estimate the annual transportation expenses of students (in dollars per pupil) as a function of the population density (in students per square mile) of the school district within which they live. The data set contains one line of data for each school district. Each line of data contains two numbers: The 2 population density of a school district and the per pupil annual transportation costs to school. a. Use SPSS to obtain a boxplot of the data. b. Regress transportation expenses on population density and obtain a boxplot of the residual expense values by the estimated expense values. Note that the conditional variance of the dependent variable increases as the dependent variable takes larger and larger values. c. Perform square root, logarithmic, and inverse transformations on the dependent variable and rerun the regression three times (once with each of the transformed dependent variables). From each of these regressions obtain a diagnostic plot as in part b. Does the dependent variable's variance appear to increase as a proportion of (i) its mean, (ii) its mean squared, or (iii) its mean to the fourth power? (Differently put, which transformation yields the greatest reduction of heteroscedasticity in the data?) State in words the meanings of the regression coefficient and constant from the regression with the most homoscedastic variances. Be sure to take into account the new meaning your transformation gives to the dependent variable. (Hints: When deciding which transformation is best, it sometimes happens that two transformations do nearly as well as each other. Under such conditions, it makes sense to take other criteria into consideration. For example, which transformation yields units on the dependent variable that are easiest to interpret?) Below please find R and SAS code for these problems: # R # Code: ############# QUESTION 1 ############### #Read in dataset and name the variables #You must have recall.txt in your root directory recall <- read.table("C:/recall.txt",header=F) colnames(recall) <- c("birthyr","event","eventyr","recall") attach(recall) #Assign new variable "n" based on value of birthyr recall[birthyr==28,5]=44 recall[birthyr==32,5]=47 recall[birthyr==34,5]=67 recall[birthyr==35,5]=51 3 recall[birthyr==36,5]=52 recall[birthyr==37,5]=63 recall[birthyr==38,5]=63 recall[birthyr==39,5]=78 recall[birthyr==40,5]=73 recall[birthyr==41,5]=80 recall[birthyr==42,5]=87 recall[birthyr==43,5]=94 recall[birthyr==44,5]=87 recall[birthyr==45,5]=66 recall[birthyr==46,5]=97 recall[birthyr==47,5]=86 recall[birthyr==48,5]=65 recall[birthyr==49,5]=48 #Assign variable name for new variable names(recall)[5] <- "n" #Compute new variable "nrecall" recall$nrecall <- (1/sqrt(recall$n))*asin(sqrt(recall$recall)) #Collapse birthyr into a new variable called "cbirthyr" recall$cbirthyr <- recall$birthyr recall[birthyr==28 | birthyr==32,7]=30 recall[birthyr==34 | birthyr==35,7]=34.5 recall[birthyr==36 | birthyr==37,7]=36.5 recall[birthyr==38 | birthyr==39,7]=38.5 recall[birthyr==40 | birthyr==41,7]=40.5 recall[birthyr==42 | birthyr==43,7]=42.5 recall[birthyr==44 | birthyr==45,7]=44.5 recall[birthyr==46 | birthyr==47,7]=46.5 recall[birthyr==48 | birthyr==49,7]=48.5 #Collapse eventyr into a new variable named "ceventyr" recall$ceventyr <- recall$eventyr recall[eventyr >=45 & eventyr <= 48,8]=46.5 recall[eventyr==52 | eventyr==54,8]=53 recall[eventyr==56 | eventyr==61,8]=58.5 recall[eventyr==63 | eventyr==64,8]=63.5 recall[eventyr==65 | eventyr==67,8]=66 recall[eventyr==71 | eventyr==73,8]=72 colnames(recall)[8] <- "ceventyr" #Boxplots boxplot(nrecall~cbirthyr,data=recall,xlab="Year of birth",ylab="Proportion recalling event (transformed)") X11() boxplot(nrecall~ceventyr,data=recall,xlab="Year of 4 event",ylab="Proportion recalling event (transformed)") ############# QUESTION 2 ############### reg <- lm(nrecall~birthyr+eventyr,data=recall) summary(reg) cor(recall[,c(1,3,6)]) X11() ############# QUESTION 3 ############### commute <- read.table("commute.txt") colnames(commute) <- c("density", "expense") attach(commute) #Quantiles qdens <- quantile(density,c(.2,.4,.6,.8)) qdens #Note that R computes quantiles a bit differently from SPSS, but that won't matter for #any analysis here. #Create new variable "rdensity" commute[density <= qdens[1],3] = 1 commute[density > qdens[1] & density <= qdens[2], 3] = 2 commute[density > qdens[2] & density <= qdens[3], 3] = 3 commute[density > qdens[3] & density <= qdens[4], 3] = 4 commute[density > qdens[4], 3] = 5 names(commute)[3] <- "rdensity" boxplot(expense~rdensity, data=commute,xlab="rdensity",ylab="expense") X11() #Transform variables commute$iexpense <- 1/commute$expense commute$lexpense <- log10(commute$expense) commute$sexpense <- sqrt(commute$expense) #Regression reg1 <- lm(expense~density) summary(reg1) #Save residuals and predicteds commute$error1 <- reg1$residuals commute$pexpen <- reg1$fitted.values attach(commute) qpex <- quantile(pexpen,c(.2,.4,.6,.8)) qpex #Create new variable "rpexpen" 5 commute[pexpen <= qpex[1],9] = 1 commute[pexpen > qpex[1] & pexpen <= qpex[2], 9] = 2 commute[pexpen > qpex[2] & pexpen <= qpex[3], 9] = 3 commute[pexpen > qpex[3] & pexpen <= qpex[4], 9] = 4 commute[pexpen > qpex[4], 9] = 5 names(commute)[9] <- "rpexpen" boxplot(error1~rpexpen, data=commute,xlab="rpexpen",ylab="error1") X11() attach(commute) reg2 <- lm(iexpense~density) summary(reg2) commute$error2 <- reg2$residuals commute$piexpen <- reg2$fitted.values qpiex <- quantile(commute$piexpen,c(.2,.4,.6,.8)) qpiex attach(commute) commute[piexpen <= qpiex[1],12] = 1 commute[piexpen > qpiex[1] & piexpen <= qpiex[2], 12] = 2 commute[piexpen > qpiex[2] & piexpen <= qpiex[3], 12] = 3 commute[piexpen > qpiex[3] & piexpen <= qpiex[4], 12] = 4 commute[piexpen > qpiex[4], 12] = 5 names(commute)[12] <- "rpiexpen" boxplot(error2~rpiexpen, data=commute,xlab="rpiexpen",ylab="error2") X11() attach(commute) reg3 <- lm(lexpense~density) summary(reg3) commute$error3 <- reg3$residuals commute$plexpen <- reg3$fitted.values attach(commute) qplex <- quantile(plexpen,c(.2,.4,.6,.8)) qplex commute[plexpen <= qplex[1],15] = 1 commute[plexpen > qplex[1] & plexpen <= qplex[2], 15] = 2 commute[plexpen > qplex[2] & plexpen <= qplex[3], 15] = 3 commute[plexpen > qplex[3] & plexpen <= qplex[4], 15] = 4 commute[plexpen > qplex[4], 15] = 5 names(commute)[15] <- "rplexpen" boxplot(error3~rplexpen, data=commute,xlab="rplexpen",ylab="error3") X11() attach(commute) reg4 <- lm(sexpense~density) 6 summary(reg4) commute$error4 <- reg4$residuals commute$psexpen <- reg4$fitted.values attach(commute) qpsex <- quantile(commute$psexpen,c(.2,.4,.6,.8)) qpsex commute[psexpen <= qpsex[1],18] = 1 commute[psexpen > qpsex[1] & psexpen <= qpsex[2], 18] = 2 commute[psexpen > qpsex[2] & psexpen <= qpsex[3], 18] = 3 commute[psexpen > qpsex[3] & psexpen <= qpsex[4], 18] = 4 commute[psexpen > qpsex[4], 18] = 5 names(commute)[18] <- "rpsexpen" boxplot(error4~rpsexpen, data=commute,xlab="rpsexpen",ylab="error4") * SAS * Code: *********** QUESTIONS 1 & 2 ************; DATA recall; INFILE 'C:\recall.txt'; INPUT birthyr event eventyr recall; IF (birthyr=28) THEN n=44; IF (birthyr=32) THEN n=47; IF (birthyr=34) THEN n=67; IF (birthyr=35) THEN n=51; IF (birthyr=36) THEN n=52; IF (birthyr=37) THEN n=63; IF (birthyr=38) THEN n=63; IF (birthyr=39) THEN n=78; IF (birthyr=40) THEN n=73; IF (birthyr=41) THEN n=80; IF (birthyr=42) THEN n=87; IF (birthyr=43) THEN n=94; IF (birthyr=44) THEN n=87; IF (birthyr=45) THEN n=66; IF (birthyr=46) THEN n=97; IF (birthyr=47) THEN n=86; IF (birthyr=48) THEN n=65; IF (birthyr=49) THEN n=48; nrecall = (1/sqrt(n))*arsin(sqrt(recall)); cbirthyr = birthyr; IF (birthyr=28) OR (birthyr=32) IF (birthyr=34) OR (birthyr=35) IF (birthyr=36) OR (birthyr=37) IF (birthyr=38) OR (birthyr=39) THEN THEN THEN THEN cbirthyr=30; cbirthyr=34.5; cbirthyr=36.5; cbirthyr=38.5; 7 IF (birthyr=40) OR (birthyr=41) THEN cbirthyr=40.5; IF (birthyr=42) OR (birthyr=43) THEN cbirthyr=42.5; IF (birthyr=44) OR (birthyr=45) THEN cbirthyr=44.5; IF (birthyr=46) OR (birthyr=47) THEN cbirthyr=46.5; IF (birthyr=48) OR (birthyr=49) THEN cbirthyr=48.5; IF (eventyr ge 45) AND (eventyr le 32) THEN ceventyr=46.5; ceventyr=eventyr; IF (eventyr=52) OR IF (eventyr=56) OR IF (eventyr=63) OR IF (eventyr=65) OR IF (eventyr=71) OR RUN; (eventyr=54) (eventyr=61) (eventyr=64) (eventyr=67) (eventyr=73) THEN THEN THEN THEN THEN ceventyr=53; ceventyr=58.5; ceventyr=63.5; ceventyr=66; ceventyr=72; PROC BOXPLOT data=recall; PLOT nrecall*cbirthyr; RUN; PROC SORT data=recall; BY ceventyr; PROC BOXPLOT data=recall; PLOT nrecall*ceventyr; RUN; PROC REG data=recall; MODEL nrecall = birthyr eventyr; RUN; PROC CORR data=recall; VAR nrecall birthyr eventyr; RUN; ************* QUESTION 3 ***************; /**** Copy commute.txt into your C: root directory, ****/ DATA commute; INFILE 'C:/commute.txt'; INPUT density expense; RUN; PROC UNIVARIATE data=commute noprint; VAR density; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p; RUN; PROC PRINT data=percentiles; 8 RUN; DATA commute; SET commute; rdensity=density; IF density <= 22 THEN rdensity=1; IF density >22 AND density <= 118 THEN rdensity=2; IF density >118 AND density <= 438 THEN rdensity=3; IF density >438 AND density <= 827 THEN rdensity=4; IF density >= 827 THEN rdensity=5; RUN; PROC BOXPLOT; PLOT expense*rdensity /BOXSTYLE= SCHEMATIC; RUN; DATA commute; SET commute; iexpense = 1/expense; lexpense = log10(expense); sexpense = sqrt(expense); RUN; PROC REG; MODEL expense = density; OUTPUT out=commute r=error1 p=pexpen; RUN; PROC UNIVARIATE data=commute noprint; VAR pexpen; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p; RUN; PROC PRINT data=percentiles; RUN; DATA commute; SET commute; rpexpen=pexpen; IF pexpen <= 101 THEN rpexpen=1; IF pexpen >101 AND pexpen <= 164 THEN rpexpen=2; IF pexpen >164 AND pexpen <= 216 THEN rpexpen=3; IF pexpen >216 AND pexpen <= 231 THEN rpexpen=4; IF pexpen >= 231 THEN rpexpen=5; RUN; 9 PROC SORT; BY rpexpen; PROC BOXPLOT; PLOT error1*rpexpen /BOXSTYLE= SCHEMATIC; RUN; PROC REG; MODEL iexpense = density; OUTPUT out=commute r=error2 p=piexpen; RUN; PROC UNIVARIATE data=commute noprint; VAR piexpen; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p; RUN; PROC PRINT data=percentiles; RUN; DATA commute; SET commute; rpiexpen=piexpen; IF piexpen <= .00866 THEN rpiexpen=1; IF piexpen > .00866 AND piexpen <= .0093 THEN rpiexpen=2; IF piexpen > .0093 AND piexpen <= .01142 THEN rpiexpen=3; IF piexpen > .01142 AND piexpen <= .014 THEN rpiexpen=4; IF piexpen > .014 THEN rpiexpen=5; RUN; PROC SORT; BY rpiexpen; PROC BOXPLOT; PLOT error2*rpiexpen /BOXSTYLE= SCHEMATIC; RUN; PROC REG; MODEL lexpense = density; OUTPUT out=commute r=error3 p=plexpen; RUN; PROC UNIVARIATE data=commute noprint; VAR plexpen; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p; RUN; PROC PRINT data=percentiles; 10 RUN; DATA commute; SET commute; rplexpen=plexpen; IF plexpen <= 1.9 THEN rplexpen=1; IF plexpen > 1.9 AND plexpen <= 2.03 THEN rplexpen=2; IF plexpen > 2.03 AND plexpen <= 2.138 THEN rplexpen=3; IF plexpen > 2.138 AND plexpen <= 2.17 THEN rplexpen=4; IF plexpen > 2.17 THEN rplexpen=5; RUN; PROC SORT; BY rplexpen; PROC BOXPLOT; PLOT error3*rplexpen /BOXSTYLE= SCHEMATIC; RUN; PROC REG; MODEL sexpense = density; OUTPUT out=commute r=error4 p=psexpen; RUN; PROC UNIVARIATE data=commute noprint; VAR psexpen; OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p; RUN; PROC PRINT data=percentiles; RUN; DATA commute; SET commute; rpsexpen=psexpen; IF psexpen <= 9.31 THEN rpsexpen=1; IF psexpen > 9.31 AND psexpen <= 11.28 THEN rpsexpen=2; IF psexpen > 11.28 AND psexpen <= 12.9 THEN rpsexpen=3; IF psexpen > 12.9 AND psexpen <= 13.39 THEN rpsexpen=4; IF psexpen > 13.39 THEN rpsexpen=5; RUN; PROC SORT; BY rpsexpen; PROC BOXPLOT; PLOT error4*rpsexpen /BOXSTYLE= SCHEMATIC; RUN; 11