Lab 5

advertisement
Stat404
Fall 2009
Lab 5
This lab will make use of output from the following program:
get file='c:recall.sav'.
compute nrecall = (1/sqrt(n))*arsin(sqrt(recall)).
temporary.
recode birthyr(28,32=30)(34,35=34.5)(36,37=36.5)(38,39=38.5)
(40,41=40.5)(42,43=42.5)(44,45=44.5)(46,47=46.5)(48,49=48.5).
recode eventyr(45 thru 48=46.5)(52,54=53)(56,61=58.5)(63,64=63.5)
(65,67=66)(71,73=72).
examine vars=nrecall by eventyr,birthyr/plot boxplot
/stat none/nototal.
regression variables=nrecall,birthyr,eventyr/dep=nrecall/enter.
pearson corr nrecall birthyr eventyr.
data list list file='c:commute.txt' / density expense.
examine vars=density/percentiles(20,40,60,80)/plot none/stat none.
compute rdensity=density.
recode rdensity(lo thru 22=1)(22.01 thru 118=2)(118.01 thru 438=3)
(438.01 thru 827=4)(827.01 thru hi=5).
examine vars=expense by rdensity/plot boxplot/stat none/nototal.
compute iexpense=1/expense.
compute lexpense=lg10(expense).
compute sexpense=sqrt(expense).
regression vars=density,expense/dep=expense/enter
/save=resid(error1) pred(pexpen).
examine vars=pexpen/percentiles(20,40,60,80)/plot none/stat none.
compute rpexpen=pexpen.
recode rpexpen(lo thru 101=1)(101.01 thru 164=2)(164.01 thru 216=3)
(216.01 thru 231=4)(231.01 thru hi=5).
examine vars=error1 by rpexpen/plot boxplot/stat none/nototal.
regression vars=density,iexpense/dep=iexpense/enter
/save=resid(error2) pred(piexpen).
examine vars=piexpen/percentiles(20,40,60,80)/plot none/stat none.
compute rpiexpen=piexpen.
recode rpiexpen(lo thru .00866=1)(.008661 thru .0093=2)
(.009301 thru .01142=3)(.011421 thru .014=4)(.01401 thru hi=5).
examine vars=error2 by rpiexpen/plot boxplot/stat none/nototal.
regression vars=density,lexpense/dep=lexpense/enter
/save=resid(error3) pred(plexpen).
examine vars=plexpen/percentiles(20,40,60,80)/plot none/stat none.
compute rplexpen=plexpen.
recode rplexpen(lo thru 1.9=1)(1.901 thru 2.03=2)(2.031 thru 2.138=3)
(2.139 thru 2.17=4)(2.171 thru hi=5).
1
examine vars=error3 by rplexpen/plot boxplot/stat
none/nototal.
regression vars=density,sexpense/dep=sexpense/enter
/save=resid(error4) pred(psexpen).
examine vars=psexpen/percentiles(20,40,60,80)/plot none/stat
none.
compute rpsexpen=psexpen.
recode rpsexpen(lo thru 9.31=1)(9.311 thru 11.28=2)(11.281 thru 12.9=3)
(12.901 thru 13.39=4)(13.391 thru hi=5).
examine vars=error4 by rpsexpen/plot boxplot/stat none/nototal.
1. When data on one's dependent variable are proportions (as with the
RECALL variable), one often transforms these proportions using the
arcsin-square root transformation. The above program will provide you
with a plot of NRECALL by EVENTYR, where NRECALL is RECALL after it has
been transformed by taking its arcsin-square root and by weighting it
by the inverse-square root of the number of observations from which it
was computed. Examine the box plot of NRECALL by EVENTYR. Are any
assumptions now met that were not met in the corresponding boxplot
(i.e., of RECALL by EVENTYR) obtained in Lab 1? If yes, how can you
tell?
2. Regress NRECALL on BIRTHYR and EVENTYR. Express the unstandardized
regression equation in words. What is the proportion of the variance
in NRECALL that is explained by both independent variables together?
Finally, what proportion of "the variance in NRECALL that is not
explained by EVENTYR" is explained by BIRTHYR? (Hints: NRECALL does
not have units that are easy to describe. If you wish, you may simply
refer to "units on the transformed recall measure." Also, note that
the materials you received with Lab 1 provide a table in which
proportions are listed for every BIRTHYR within each EVENTYR. In an
analysis of a data set that contained each of these proportions you
would discover that BIRTHYR and EVENTYR would have a zero correlation.
[Can you see why?] When the class's data set was created, zero cells
from the table were omitted, and as a consequence you will find in your
analyses that BIRTHYR and EVENTYR are collinear.)
3. The cost of transporting children to school is higher in sparsely
populated Long Island school districts than in densely populated ones.
A data set called “commute.txt” is provided (via a link in our class
web site) that allows you to estimate the annual transportation
expenses of students (in dollars per pupil) as a function of the
population density (in students per square mile) of the school district
within which they live. The data set contains one line of data for
each school district. Each line of data contains two numbers: The
2
population density of a school district and the per pupil
annual transportation costs to school.
a. Use SPSS to obtain a boxplot of the data.
b. Regress transportation expenses on population density and
obtain a boxplot of the residual expense values by the
estimated expense values. Note that the conditional variance
of the dependent variable increases as the dependent variable takes
larger and larger values.
c. Perform square root, logarithmic, and inverse transformations on
the dependent variable and rerun the regression three times (once
with each of the transformed dependent variables). From each of
these regressions obtain a diagnostic plot as in part b. Does the
dependent variable's variance appear to increase as a proportion of
(i) its mean, (ii) its mean squared, or (iii) its mean to the fourth
power? (Differently put, which transformation yields the greatest
reduction of heteroscedasticity in the data?) State in words the
meanings of the regression coefficient and constant from the
regression with the most homoscedastic variances. Be sure to take
into account the new meaning your transformation gives to the
dependent variable. (Hints: When deciding which transformation is
best, it sometimes happens that two transformations do nearly as
well as each other. Under such conditions, it makes sense to take
other criteria into consideration. For example, which
transformation yields units on the dependent variable that are
easiest to interpret?)
Below please find R and SAS code for these problems:
# R
# Code:
############# QUESTION 1 ###############
#Read in dataset and name the variables
#You must have recall.txt in your root directory
recall <- read.table("C:/recall.txt",header=F)
colnames(recall) <- c("birthyr","event","eventyr","recall")
attach(recall)
#Assign new variable "n" based on value of birthyr
recall[birthyr==28,5]=44
recall[birthyr==32,5]=47
recall[birthyr==34,5]=67
recall[birthyr==35,5]=51
3
recall[birthyr==36,5]=52
recall[birthyr==37,5]=63
recall[birthyr==38,5]=63
recall[birthyr==39,5]=78
recall[birthyr==40,5]=73
recall[birthyr==41,5]=80
recall[birthyr==42,5]=87
recall[birthyr==43,5]=94
recall[birthyr==44,5]=87
recall[birthyr==45,5]=66
recall[birthyr==46,5]=97
recall[birthyr==47,5]=86
recall[birthyr==48,5]=65
recall[birthyr==49,5]=48
#Assign variable name for new variable
names(recall)[5] <- "n"
#Compute new variable "nrecall"
recall$nrecall <- (1/sqrt(recall$n))*asin(sqrt(recall$recall))
#Collapse birthyr into a new variable called "cbirthyr"
recall$cbirthyr <- recall$birthyr
recall[birthyr==28 | birthyr==32,7]=30
recall[birthyr==34 | birthyr==35,7]=34.5
recall[birthyr==36 | birthyr==37,7]=36.5
recall[birthyr==38 | birthyr==39,7]=38.5
recall[birthyr==40 | birthyr==41,7]=40.5
recall[birthyr==42 | birthyr==43,7]=42.5
recall[birthyr==44 | birthyr==45,7]=44.5
recall[birthyr==46 | birthyr==47,7]=46.5
recall[birthyr==48 | birthyr==49,7]=48.5
#Collapse eventyr into a new variable named "ceventyr"
recall$ceventyr <- recall$eventyr
recall[eventyr >=45 & eventyr <= 48,8]=46.5
recall[eventyr==52 | eventyr==54,8]=53
recall[eventyr==56 | eventyr==61,8]=58.5
recall[eventyr==63 | eventyr==64,8]=63.5
recall[eventyr==65 | eventyr==67,8]=66
recall[eventyr==71 | eventyr==73,8]=72
colnames(recall)[8] <- "ceventyr"
#Boxplots
boxplot(nrecall~cbirthyr,data=recall,xlab="Year of
birth",ylab="Proportion recalling event (transformed)")
X11()
boxplot(nrecall~ceventyr,data=recall,xlab="Year of
4
event",ylab="Proportion recalling event (transformed)")
############# QUESTION 2 ###############
reg <- lm(nrecall~birthyr+eventyr,data=recall)
summary(reg)
cor(recall[,c(1,3,6)])
X11()
############# QUESTION 3 ###############
commute <- read.table("commute.txt")
colnames(commute) <- c("density", "expense")
attach(commute)
#Quantiles
qdens <- quantile(density,c(.2,.4,.6,.8))
qdens
#Note that R computes quantiles a bit differently from SPSS, but that
won't matter for
#any analysis here.
#Create new variable "rdensity"
commute[density <= qdens[1],3] = 1
commute[density > qdens[1] & density <= qdens[2], 3] = 2
commute[density > qdens[2] & density <= qdens[3], 3] = 3
commute[density > qdens[3] & density <= qdens[4], 3] = 4
commute[density > qdens[4], 3] = 5
names(commute)[3] <- "rdensity"
boxplot(expense~rdensity, data=commute,xlab="rdensity",ylab="expense")
X11()
#Transform variables
commute$iexpense <- 1/commute$expense
commute$lexpense <- log10(commute$expense)
commute$sexpense <- sqrt(commute$expense)
#Regression
reg1 <- lm(expense~density)
summary(reg1)
#Save residuals and predicteds
commute$error1 <- reg1$residuals
commute$pexpen <- reg1$fitted.values
attach(commute)
qpex <- quantile(pexpen,c(.2,.4,.6,.8))
qpex
#Create new variable "rpexpen"
5
commute[pexpen <= qpex[1],9] = 1
commute[pexpen > qpex[1] & pexpen <= qpex[2], 9] = 2
commute[pexpen > qpex[2] & pexpen <= qpex[3], 9] = 3
commute[pexpen > qpex[3] & pexpen <= qpex[4], 9] = 4
commute[pexpen > qpex[4], 9] = 5
names(commute)[9] <- "rpexpen"
boxplot(error1~rpexpen,
data=commute,xlab="rpexpen",ylab="error1")
X11()
attach(commute)
reg2 <- lm(iexpense~density)
summary(reg2)
commute$error2 <- reg2$residuals
commute$piexpen <- reg2$fitted.values
qpiex <- quantile(commute$piexpen,c(.2,.4,.6,.8))
qpiex
attach(commute)
commute[piexpen <= qpiex[1],12] = 1
commute[piexpen > qpiex[1] & piexpen <= qpiex[2], 12] = 2
commute[piexpen > qpiex[2] & piexpen <= qpiex[3], 12] = 3
commute[piexpen > qpiex[3] & piexpen <= qpiex[4], 12] = 4
commute[piexpen > qpiex[4], 12] = 5
names(commute)[12] <- "rpiexpen"
boxplot(error2~rpiexpen, data=commute,xlab="rpiexpen",ylab="error2")
X11()
attach(commute)
reg3 <- lm(lexpense~density)
summary(reg3)
commute$error3 <- reg3$residuals
commute$plexpen <- reg3$fitted.values
attach(commute)
qplex <- quantile(plexpen,c(.2,.4,.6,.8))
qplex
commute[plexpen <= qplex[1],15] = 1
commute[plexpen > qplex[1] & plexpen <= qplex[2], 15] = 2
commute[plexpen > qplex[2] & plexpen <= qplex[3], 15] = 3
commute[plexpen > qplex[3] & plexpen <= qplex[4], 15] = 4
commute[plexpen > qplex[4], 15] = 5
names(commute)[15] <- "rplexpen"
boxplot(error3~rplexpen, data=commute,xlab="rplexpen",ylab="error3")
X11()
attach(commute)
reg4 <- lm(sexpense~density)
6
summary(reg4)
commute$error4 <- reg4$residuals
commute$psexpen <- reg4$fitted.values
attach(commute)
qpsex <- quantile(commute$psexpen,c(.2,.4,.6,.8))
qpsex
commute[psexpen <= qpsex[1],18] = 1
commute[psexpen > qpsex[1] & psexpen <= qpsex[2], 18] = 2
commute[psexpen > qpsex[2] & psexpen <= qpsex[3], 18] = 3
commute[psexpen > qpsex[3] & psexpen <= qpsex[4], 18] = 4
commute[psexpen > qpsex[4], 18] = 5
names(commute)[18] <- "rpsexpen"
boxplot(error4~rpsexpen, data=commute,xlab="rpsexpen",ylab="error4")
* SAS
* Code:
*********** QUESTIONS 1 & 2 ************;
DATA recall;
INFILE 'C:\recall.txt';
INPUT birthyr event eventyr recall;
IF (birthyr=28) THEN n=44;
IF (birthyr=32) THEN n=47;
IF (birthyr=34) THEN n=67;
IF (birthyr=35) THEN n=51;
IF (birthyr=36) THEN n=52;
IF (birthyr=37) THEN n=63;
IF (birthyr=38) THEN n=63;
IF (birthyr=39) THEN n=78;
IF (birthyr=40) THEN n=73;
IF (birthyr=41) THEN n=80;
IF (birthyr=42) THEN n=87;
IF (birthyr=43) THEN n=94;
IF (birthyr=44) THEN n=87;
IF (birthyr=45) THEN n=66;
IF (birthyr=46) THEN n=97;
IF (birthyr=47) THEN n=86;
IF (birthyr=48) THEN n=65;
IF (birthyr=49) THEN n=48;
nrecall = (1/sqrt(n))*arsin(sqrt(recall));
cbirthyr = birthyr;
IF (birthyr=28) OR (birthyr=32)
IF (birthyr=34) OR (birthyr=35)
IF (birthyr=36) OR (birthyr=37)
IF (birthyr=38) OR (birthyr=39)
THEN
THEN
THEN
THEN
cbirthyr=30;
cbirthyr=34.5;
cbirthyr=36.5;
cbirthyr=38.5;
7
IF (birthyr=40) OR (birthyr=41) THEN cbirthyr=40.5;
IF (birthyr=42) OR (birthyr=43) THEN cbirthyr=42.5;
IF (birthyr=44) OR (birthyr=45) THEN cbirthyr=44.5;
IF (birthyr=46) OR (birthyr=47) THEN cbirthyr=46.5;
IF (birthyr=48) OR (birthyr=49) THEN cbirthyr=48.5;
IF (eventyr ge 45) AND (eventyr le 32) THEN ceventyr=46.5;
ceventyr=eventyr;
IF (eventyr=52) OR
IF (eventyr=56) OR
IF (eventyr=63) OR
IF (eventyr=65) OR
IF (eventyr=71) OR
RUN;
(eventyr=54)
(eventyr=61)
(eventyr=64)
(eventyr=67)
(eventyr=73)
THEN
THEN
THEN
THEN
THEN
ceventyr=53;
ceventyr=58.5;
ceventyr=63.5;
ceventyr=66;
ceventyr=72;
PROC BOXPLOT data=recall;
PLOT nrecall*cbirthyr;
RUN;
PROC SORT data=recall;
BY ceventyr;
PROC BOXPLOT data=recall;
PLOT nrecall*ceventyr;
RUN;
PROC REG data=recall;
MODEL nrecall = birthyr eventyr;
RUN;
PROC CORR data=recall;
VAR nrecall birthyr eventyr;
RUN;
************* QUESTION 3 ***************;
/**** Copy commute.txt into your C: root directory, ****/
DATA commute;
INFILE 'C:/commute.txt';
INPUT density expense;
RUN;
PROC UNIVARIATE data=commute noprint;
VAR density;
OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p;
RUN;
PROC PRINT data=percentiles;
8
RUN;
DATA commute;
SET commute;
rdensity=density;
IF density <= 22 THEN rdensity=1;
IF density >22 AND density <= 118 THEN rdensity=2;
IF density >118 AND density <= 438 THEN rdensity=3;
IF density >438 AND density <= 827 THEN rdensity=4;
IF density >= 827 THEN rdensity=5;
RUN;
PROC BOXPLOT;
PLOT expense*rdensity /BOXSTYLE= SCHEMATIC;
RUN;
DATA commute;
SET commute;
iexpense = 1/expense;
lexpense = log10(expense);
sexpense = sqrt(expense);
RUN;
PROC REG;
MODEL expense = density;
OUTPUT out=commute r=error1 p=pexpen;
RUN;
PROC UNIVARIATE data=commute noprint;
VAR pexpen;
OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA commute;
SET commute;
rpexpen=pexpen;
IF pexpen <= 101 THEN rpexpen=1;
IF pexpen >101 AND pexpen <= 164 THEN rpexpen=2;
IF pexpen >164 AND pexpen <= 216 THEN rpexpen=3;
IF pexpen >216 AND pexpen <= 231 THEN rpexpen=4;
IF pexpen >= 231 THEN rpexpen=5;
RUN;
9
PROC SORT;
BY rpexpen;
PROC BOXPLOT;
PLOT error1*rpexpen /BOXSTYLE= SCHEMATIC;
RUN;
PROC REG;
MODEL iexpense = density;
OUTPUT out=commute r=error2 p=piexpen;
RUN;
PROC UNIVARIATE data=commute noprint;
VAR piexpen;
OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA commute;
SET commute;
rpiexpen=piexpen;
IF piexpen <= .00866 THEN rpiexpen=1;
IF piexpen > .00866 AND piexpen <= .0093 THEN rpiexpen=2;
IF piexpen > .0093 AND piexpen <= .01142 THEN rpiexpen=3;
IF piexpen > .01142 AND piexpen <= .014 THEN rpiexpen=4;
IF piexpen > .014 THEN rpiexpen=5;
RUN;
PROC SORT;
BY rpiexpen;
PROC BOXPLOT;
PLOT error2*rpiexpen /BOXSTYLE= SCHEMATIC;
RUN;
PROC REG;
MODEL lexpense = density;
OUTPUT out=commute r=error3 p=plexpen;
RUN;
PROC UNIVARIATE data=commute noprint;
VAR plexpen;
OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p;
RUN;
PROC PRINT data=percentiles;
10
RUN;
DATA commute;
SET commute;
rplexpen=plexpen;
IF plexpen <= 1.9 THEN rplexpen=1;
IF plexpen > 1.9 AND plexpen <= 2.03 THEN rplexpen=2;
IF plexpen > 2.03 AND plexpen <= 2.138 THEN rplexpen=3;
IF plexpen > 2.138 AND plexpen <= 2.17 THEN rplexpen=4;
IF plexpen > 2.17 THEN rplexpen=5;
RUN;
PROC SORT;
BY rplexpen;
PROC BOXPLOT;
PLOT error3*rplexpen /BOXSTYLE= SCHEMATIC;
RUN;
PROC REG;
MODEL sexpense = density;
OUTPUT out=commute r=error4 p=psexpen;
RUN;
PROC UNIVARIATE data=commute noprint;
VAR psexpen;
OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=p;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA commute;
SET commute;
rpsexpen=psexpen;
IF psexpen <= 9.31 THEN rpsexpen=1;
IF psexpen > 9.31 AND psexpen <= 11.28 THEN rpsexpen=2;
IF psexpen > 11.28 AND psexpen <= 12.9 THEN rpsexpen=3;
IF psexpen > 12.9 AND psexpen <= 13.39 THEN rpsexpen=4;
IF psexpen > 13.39 THEN rpsexpen=5;
RUN;
PROC SORT;
BY rpsexpen;
PROC BOXPLOT;
PLOT error4*rpsexpen /BOXSTYLE= SCHEMATIC;
RUN;
11
Download