Lab 6

advertisement
Stat404
Fall 2009
Lab 6
1. The following program will be needed to do this problem:
data list list file='c:animals.txt' / id body brain.
compute ibody=1/body.
compute lbody=lg10(body).
compute sbody=sqrt(body).
compute ibrain=1/brain.
compute lbrain=lg10(brain).
compute sbrain=sqrt(brain).
examine vars=body/percentiles(20,40,60,80)/plot none/stat none.
compute rbody=body.
recode rbody(lo thru .248=1)(.2481 thru 1.452=2)(1.4521 thru 4.226=3)
(4.2261 thru 71.2=4)(71.2001 thru hi=5).
examine vars=brain by rbody / plot boxplot / stat none / nototal.
regression vars=body,brain / dep=brain / enter
/save=resid(error5) pred(pbrain).
examine vars=pbrain/percentiles(20,40,60,80)/plot none/stat none.
compute rpbrain=pbrain.
recode rpbrain(lo thru 91.244=1)(91.2441 thru 92.407=2)
(92.4071 thru 95.088=3)(95.0881 thru 159.818=4)(159.8181 thru hi=5).
examine vars=error5 by rpbrain / plot boxplot / stat none / nototal.
regression vars=ibody,ibrain/ dep=ibrain / enter
/save=resid(error6) pred(pibrain).
examine vars=pibrain/percentiles(20,40,60,80)/plot none/stat none.
compute rpibrain=pibrain.
recode rpibrain(lo thru .1362=1)(.13621 thru .1448=2)
(.14481 thru .1623=3)(.16231 thru .2958=4)(.29581 thru hi=5).
examine vars=error6 by rpibrain / plot boxplot / stat none / nototal.
regression vars=lbody,lbrain/ dep=lbrain / enter
/save=resid(error7) pred(plbrain).
examine vars=plbrain/percentiles(20,40,60,80)/plot none/stat none.
compute rplbrain=plbrain.
recode rplbrain(lo thru .4676=1)(.46761 thru 1.0483=2)
(1.04831 thru 1.3976=3)(1.39761 thru 2.3156=4)(2.31561 thru hi=5).
examine vars=error7 by rplbrain / plot boxplot / stat none / nototal.
regression vars=sbody,sbrain/ dep=sbrain / enter
/save=resid(error8) pred(psbrain).
examine vars=psbrain/percentiles(20,40,60,80)/plot none/stat none.
compute rpsbrain=psbrain.
recode rpsbrain(lo thru 3.8488=1)(3.84881 thru 4.5702=2)
(4.57021 thru 5.4373=3)(5.43731 thru 11.9124=4)(11.91241 thru hi=5).
examine vars=error8 by rpsbrain / plot boxplot / stat none / nototal.
1
Weisberg (1985, Table 6.6 on pp. 144-5) presents data on
brain and body weights for 62 species of mammals. These data
are provided (via our class web site’s Assignmentspage). The
file contains one line of data for each species. (E.g.,
"Man" is on line 32.) Each line of data contains three
numbers: A sequence number (that allows you to identify
individual species), body weight in kilograms, and brain
weight in grams.
Notice that it makes no sense to speak of brain or body weight as
causal. (At most, one might expect a positive association between the
two.) As it turns out, the variance of each variable increases with
the magnitude of the other.
a. Obtain a boxplot of brain weight by body weight using SPSS,R, or
SAS.
b. Regress brain weight on body weight and obtain a box plot of the
residual brain weight values by the estimated brain weight values.
Using this diagnostic plot show that the conditional variance of the
dependent variable increases as the dependent variable takes larger
and larger values.
c. Perform square root, logarithmic, and inverse transformations on
the dependent and independent variables and rerun the regression
three times (once with each pair of transformed variables). From
each of these regressions obtain a diagnostic plot as in part b. Do
the variables' variances appear to increase as a proportion of
(i) their means, (ii) their means squared, or (iii) their means to
the fourth power? (Differently put, which transformation yields the
greatest reduction of heteroscedasticity in the data?) State in
words the meanings of the regression coefficient and constant from
the regression with the most homoscedastic variances. Be sure to
take into account the new meaning your transformation gives to the
dependent variable. (Hint: When interpreting the constant, remember
that log(1)=0.)
d. Weisberg's Table 6.6 lists sequence numbers for identifying the
various species. In reference to the “regression with the most
homoscedastic variances” found in part c, the residual values from
this regression indicate how much transformed brain weight an animal
has above or below what one would estimate given its body weight.
Which animal’s brain weight is farthest above what one would
estimate given its body weight (i.e., which is smartest)? Which
animal’s brain weight is farthest below what one would estimate
given its body weight (i.e., which is dumbest)? (Hint: If you use
SPSS, go to the Data Editor and select “Data”, then “Sort Cases...”
Then select sort on the residuals from the appropriate regression
2
model estimated in part c.
and bottom rows?)
Which ids end up at the top
2. Using the 1996 General Social Survey of U.S. adults,
determine whether people with high SES (socio-economic status)
are less likely than people with low SES to watch a lot of
television. To do this you must first construct an SES
measure. One means of doing this begins by combining distinct SES
measures (e.g., income, occupational prestige, and subjective class
identification) using principle components analysis. A measure of
hours spent watching TV each week is then regressed on this composite
measure. Do this using the following program:
import file='c:gss96.por'.
recode rincome(1=500)(2=2000)(3=3500)(4=4500)(5=5500)(6=6500)
(7=7500)(8=9000)(9=12500)(10=17500)(11=22500)(12=35000).
factor vars=class,prestg80,rincome/rotate=norotate/save (1 ses).
regression vars=ses1,tvhours/dep=tvhours/enter.
condescriptive ses1,tvhours.
a. The FACTOR routine in SPSS generates principle components when
the NOROTATE option is specified. These principle components are
standardized (i.e., in this case, SES1 has a mean of zero and a
variance equal to one). Yet despite the fact that the independent
variable is standardized in the regression of TVHOURS on SES1, the
unstandardized regression coefficient in the output from this program does not equal the standardized coefficient. Why is this the
case? (Hint: Use algebra to derive one coefficient from the other.)
b. What benefit is there in combining CLASS, PRESTG80, and RINCOME
into a single measure?
3. In the U.S. people tend to become more satisfied with their lives as
they reach ages in their late 80s and older. Some gerontologists argue
that this increase in life satisfaction results as old people learn to
accept "what they cannot do" (i.e., their physical limitations
resulting from declining health, etc.). Other gerontologists argue
that life satisfaction results only when old people are actively
involved with other people. You wish to test which theory (i.e.,
"acceptance theory" or "involvement theory") is correct.
You have data generated during face-to-face interviews with 63 U.S.
centenarian (i.e., over 100 year-old) nursing home residents. Three of
your variables are as follows:
Y =
"life satisfaction" measured on a 100-point scale from
0 = no
3
satisfaction with life
with life
X =
to
100 = total satisfaction
"acceptance" measured as the number of hours each week
that a respondent spends sitting in a rocking chair
and sighing
W = "involvement" measured as the number of hours each week that a
respondent spends interacting with other nursing home residents,
personnel, or visitors
Means, standard deviations, and correlations on these variables are...
Variable
Y
X
W
Mean
57
22
30
Standard
Deviation
19
40
24
Correlation Coefficients
Y
X
W
1.0
0.6
1.0
-0.1
-0.7
1.0
If one assumes the above correlations to reflect the true relations
among life satisfaction, acceptance, and involvement for the population
of all U.S. centenarian nursing home residents, one's regression model
would be misspecified if it only included the involvement measure as an
independent variable. What would be the bias in an unstandardized
slope estimated using this misspecified model? (Hints: You are being
asked to calculate a number here. Please give the units associated
with this number, and show how the number was calculated. Finally, you
should also assume that no additional important variables are excluded
from the model.)
Below please find R and SAS code for these problems:
# R
# Code:
########## QUESTION 1 ##########
#-------Read in the data-------#
animals <- read.table("C:\animals.txt")
colnames(animals) <- c("id", "body", "brain")
attach(animals)
#-------Create some new variables-------#
animals$ibody <- 1/body
4
animals$lbody <- log10(body)
animals$sbody <- sqrt(body)
animals$ibrain <- 1/brain
animals$lbrain <- log10(brain)
animals$sbrain<- sqrt(brain)
#-------Find quantiles of "body"-------#
qbody <- quantile(animals$body,c(.2,.4,.6,.8))
attach(animals)
#---Create "rbody", by collaping "body" into 5 equal-sized groups.---#
animals[body <= qbody[1],10] = 1
animals[body > qbody[1] & body <= qbody[2], 10] = 2
animals[body > qbody[2] & body <= qbody[3], 10] = 3
animals[body > qbody[3] & body <= qbody[4], 10] = 4
animals[body > qbody[4], 10] = 5
names(animals)[10] <- "rbody"
#-------Boxplots of "brain", by groups created in previous step-------#
boxplot(brain~rbody, data=animals,xlab="rbody",ylab="brain")
abline(h=0,col="grey")
#Plots a horizontal line at 0#
X11()
#Opens a new graphics device#
#-------Regression of "brain" on "body"-------#
reg1 <- lm(brain~body)
summary(reg1)
#-------Save residuals from the regression as "error5". Save y-hats as
"pbrain".-------#
animals$error5 <- reg1$residuals
animals$pbrain <- reg1$fitted.values
#-------Collapse "pbrain" into 5 categories-------#
qpbrain <- quantile(animals$pbrain,c(.2,.4,.6,.8))
attach(animals)
animals[pbrain <= qpbrain[1],13] = 1
animals[pbrain > qpbrain[1] & pbrain <= qpbrain[2], 13] = 2
animals[pbrain > qpbrain[2] & pbrain <= qpbrain[3], 13] = 3
animals[pbrain > qpbrain[3] & pbrain <= qpbrain[4], 13] = 4
animals[pbrain > qpbrain[4], 13] = 5
names(animals)[13] <- "rpbrain"
#-------Boxplots of residuals, by collapsed category of y-hat-------#
boxplot(error5~rpbrain, data=animals,xlab="rpbrain",ylab="error5")
abline(h=0,col="grey")
X11()
#-------Now do the same thing with transformed variables-------#
#-------First the inverse-transformed variables-------#
reg2 <- lm(ibrain~ibody)
5
summary(reg2)
animals$error6 <- reg2$residuals
animals$pibrain <- reg2$fitted.values
qpibrain <- quantile(animals$pibrain,c(.2,.4,.6,.8))
attach(animals)
animals[pibrain <= qpibrain[1],16] = 1
animals[pibrain > qpibrain[1] & pibrain <= qpibrain[2], 16]
= 2
animals[pibrain > qpibrain[2] & pibrain <= qpibrain[3], 16] = 3
animals[pibrain > qpibrain[3] & pibrain <= qpibrain[4], 16] = 4
animals[pibrain > qpibrain[4], 16] = 5
names(animals)[16] <- "rpibrain"
boxplot(error6~rpibrain, data=animals,xlab="rpibrain",ylab="error6")
abline(h=0,col="grey")
X11()
#-------Now the log-transformed variables-------#
reg3 <- lm(lbrain~lbody)
summary(reg3)
animals$error7 <- reg3$residuals
animals$plbrain <- reg3$fitted.values
qplbrain <- quantile(animals$plbrain,c(.2,.4,.6,.8))
attach(animals)
animals[plbrain <= qplbrain[1],19] = 1
animals[plbrain > qplbrain[1] & plbrain <= qplbrain[2], 19] = 2
animals[plbrain > qplbrain[2] & plbrain <= qplbrain[3], 19] = 3
animals[plbrain > qplbrain[3] & plbrain <= qplbrain[4], 19] = 4
animals[plbrain > qplbrain[4], 19] = 5
names(animals)[19] <- "rplbrain"
boxplot(error7~rplbrain, data=animals,xlab="rplbrain",ylab="error7")
abline(h=0,col="grey")
X11()
#-------Finally the square-root-transformed variables-------#
reg4 <- lm(sbrain~sbody)
summary(reg4)
animals$error8 <- reg4$residuals
animals$psbrain <- reg4$fitted.values
qpsbrain <- quantile(animals$psbrain,c(.2,.4,.6,.8))
attach(animals)
animals[psbrain <= qpsbrain[1],22] = 1
animals[psbrain > qpsbrain[1] & psbrain <= qpsbrain[2], 22] = 2
animals[psbrain > qpsbrain[2] & psbrain <= qpsbrain[3], 22] = 3
animals[psbrain > qpsbrain[3] & psbrain <= qpsbrain[4], 22] = 4
animals[psbrain > qpsbrain[4], 22] = 5
names(animals)[22] <- "rpsbrain"
6
boxplot(error8~rpsbrain,
data=animals,xlab="rpsbrain",ylab="error8")
abline(h=0,col="grey")
X11()
#-------Sorting by residuals-------#
species <- array(order(error6))
error.6 <- array(sort(error6))
cbind(species,error.6)
species <- array(order(error7))
error.7 <- array(sort(error7))
cbind(species,error.7)
species <- array(order(error8))
error.8 <- array(sort(error8))
cbind(species,error.8)
########## QUESTION 2 ##########
### Be sure to put gss96.csv (not .por) in your root directory
gss96 <- read.csv("C:\gss96.csv",header=T)
#-------Recode rincome into rincome2-------#
attach(gss96)
gss96[rincome == 1,5] <- 500
gss96[rincome == 2,5] = 2000
gss96[rincome == 3,5] = 3500
gss96[rincome == 4,5] = 4500
gss96[rincome == 5,5] = 5500
gss96[rincome == 6,5] = 6500
gss96[rincome == 7,5] = 7500
gss96[rincome == 8,5] = 9000
gss96[rincome == 9,5] = 12500
gss96[rincome == 10,5] = 17500
gss96[rincome == 11,5] = 22500
gss96[rincome == 12,5] = 35000
names(gss96)[5] <- "rincome2"
#-------Standardize the appropriate variables-------#
gss96$class.z <- (gss96$class-mean(gss96$class))/sd(gss96$class)
gss96$prestg80.z <(gss96$prestg80-mean(gss96$prestg80))/sd(gss96$prestg80)
gss96$rincome2.z <(gss96$rincome2-mean(gss96$rincome2))/sd(gss96$rincome2)
#-------Principle Components-------#
pc <- prcomp(gss96[,6:8], scale=T, center=T )
pc
7
ses <- 0.5445274*gss96$class.z +
0.6080401*gss96$prestg80.z + 0.5777346*gss96$rincome2.z
ses1 <- (ses-mean(ses))/sd(ses)
#-------Regression-------#
reg5 <- lm(tvhours~ses1)
summary(reg5)
#-------Descriptive statistics-------#
summary(ses1)
sd(ses1)
summary(gss96$tvhours)
sd(gss96$tvhours)
* SAS
* Code:
/**** PROBLEM 1 ****/
/***Put animals.txt datafile in your C: root directory,***/
/*** or change line 2 below to point to where the file is.***/
/* Read in the data, perform transformations */
DATA animals;
INFILE 'C:\animals.txt';
INPUT id body brain;
ibody=1/body;
lbody=log10(body);
sbody=sqrt(body);
ibrain=1/brain;
lbrain=log10(brain);
sbrain=sqrt(brain);
RUN;
/* Obtain various percentiles of "body" */
PROC UNIVARIATE data=animals noprint;
VAR body;
OUTPUT out=percentiles PCTLPTS=20,40,60,80 PCTLPRE=body_perentile;
RUN;
PROC PRINT data=percentiles;
RUN;
/* Collapse "body" into "rbody" with 5 groups of equal size. */
DATA animals;
SET animals;
8
rbody=body;
IF body <= .248 THEN rbody=1;
IF body > .248 AND body <= 1.452 THEN rbody=2;
IF body > 1.4521 AND body <= 4.226 THEN rbody=3;
IF body > 4.2261 AND body <= 71.2 THEN rbody=4;
IF body >= 71.20001 THEN rbody=5;
RUN;
/* Obtain boxplot of "brain" at each level of "rbody" */
PROC SORT;
by rbody;
PROC BOXPLOT;
PLOT brain*rbody /BOXSTYLE= SCHEMATIC;
RUN;
/**********************************************************/
/* Regression of "brain" on "body". */
/*Y-hats saved as "pbrain" (smile). Residuals as "error5"*/
PROC REG;
MODEL brain = body;
OUTPUT out=animals r=error5 p=pbrain;
RUN;
/* Collapsing "pbrain" into 5 groups */
PROC UNIVARIATE data=animals noprint;
VAR pbrain;
OUTPUT out=percentiles PCTLPTS=20,40,60,80
PCTLPRE=pbrain_percentile;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA animals;
SET animals;
rpbrain=pbrain;
IF pbrain <= 91.244 THEN rpbrain=1;
IF pbrain > 91.2241 AND pbrain <= 92.407 THEN rpbrain=2;
IF pbrain > 92.4071 AND pbrain <= 95.088 THEN rpbrain=3;
IF pbrain > 95.0881 AND pbrain <= 159.818 THEN rpbrain=4;
IF pbrain >= 159.8181 THEN rpbrain=5;
RUN;
PROC SORT;
by rpbrain;
PROC BOXPLOT;
PLOT error5*rpbrain /BOXSTYLE= SCHEMATIC;
RUN;
9
/*********************************************************/
/* Now the same thing with our transformed variables */
/* First the inverse-transformed data */
PROC REG;
MODEL ibrain = ibody;
OUTPUT out=animals r=error6 p=pibrain;
RUN;
PROC UNIVARIATE data=animals noprint;
VAR pibrain;
OUTPUT out=percentiles PCTLPTS=20,40,60,80
PCTLPRE=pibrain_percentile;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA animals;
SET animals;
rpibrain=pibrain;
IF pibrain <= .1362 THEN rpibrain=1;
IF pibrain > .13621 AND pibrain <= .1448 THEN rpibrain=2;
IF pibrain > .14481 AND pibrain <= .1623 THEN rpibrain=3;
IF pibrain > .16231 AND pibrain <= .2958 THEN rpibrain=4;
IF pibrain >= .29581 THEN rpibrain=5;
RUN;
PROC SORT;
by rpibrain;
PROC BOXPLOT;
PLOT error6*rpibrain /BOXSTYLE= SCHEMATIC;
RUN;
/* Now the log-transformed data */
PROC REG;
MODEL lbrain = lbody;
OUTPUT out=animals r=error7 p=plbrain;
RUN;
PROC UNIVARIATE data=animals noprint;
VAR plbrain;
OUTPUT out=percentiles PCTLPTS=20,40,60,80
PCTLPRE=plbrain_percentile;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA animals;
SET animals;
rplbrain=plbrain;
IF plbrain <= .4676 THEN rplbrain=1;
10
IF plbrain > .46761 AND plbrain <= 1.0483 THEN rplbrain=2;
IF plbrain > 1.04831 AND plbrain <= 1.3976 THEN rplbrain=3;
IF plbrain > 1.39761 AND plbrain <= 2.3156 THEN rplbrain=4;
IF plbrain >= 2.31561 THEN rplbrain=5;
RUN;
PROC SORT;
by rplbrain;
PROC BOXPLOT;
PLOT error7*rplbrain /BOXSTYLE= SCHEMATIC;
RUN;
/* Finally the square-root transformed data */
PROC REG;
MODEL sbrain = sbody;
OUTPUT out=animals r=error8 p=psbrain;
RUN;
PROC UNIVARIATE data=animals noprint;
VAR psbrain;
OUTPUT out=percentiles PCTLPTS=20,40,60,80
PCTLPRE=psbrain_percentile;
RUN;
PROC PRINT data=percentiles;
RUN;
DATA animals;
SET animals;
rpsbrain=psbrain;
IF psbrain <= 3.8488 THEN rpsbrain=1;
IF psbrain > 3.84881 AND psbrain <= 4.5702 THEN rpsbrain=2;
IF psbrain > 4.57021 AND psbrain <= 5.4373 THEN rpsbrain=3;
IF psbrain > 5.43731 AND psbrain <= 11.9124 THEN rpsbrain=4;
IF psbrain >= 11.91241 THEN rpsbrain=5;
RUN;
PROC SORT;
by rpsbrain;
PROC BOXPLOT;
PLOT error8*rpsbrain /BOXSTYLE= SCHEMATIC;
RUN;
/*To aid in the final question in problem 1d */
PROC SORT;
by error6;
PROC PRINT data=animals NoObs;
VAR id error6;
TITLE 'Data sorted by error6';
11
RUN;
PROC SORT;
by error7;
PROC PRINT data=animals noobs;
VAR id error7;
TITLE 'Data sorted by error7';
RUN;
PROC SORT;
by error8;
PROC PRINT data=animals noobs;
VAR id error8;
TITLE 'Data sorted by error8';
RUN;
/**** PROBLEM 2 ****/
PROC IMPORT OUT=gss96
FILE="C:\gss96.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
GUESSINGROWS=10;
RUN;
DATA gss96;
SET gss96;
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
IF rincome =
RUN;
1 THEN rincome2 = 500;
2 THEN rincome2 = 2000;
3 THEN rincome2 = 3500;
4 THEN rincome2 = 4500;
5 THEN rincome2 = 5500;
6 THEN rincome2 = 6500;
7 THEN rincome2 = 7500;
8 THEN rincome2 = 9000;
9 THEN rincome2 = 12500;
10 THEN rincome2 = 17500;
11 THEN rincome2 = 22500;
12 THEN rincome2 = 35000;
PROC UNIVARIATE;
VAR class prestg80 rincome2;
RUN;
DATA gss96;
12
SET gss96;
classz = (class-2.46927803)/0.6238;
prestg80z = (prestg80-43.674347)/13.70968;
rincome2z =( rincome2-22849.846)/12186;
RUN;
PROC FACTOR data=gss96 ROTATE=none;
VAR classz prestg80z rincome2z;
RUN;
DATA gss96;
SET gss96;
SES = 0.47555260*classz + 0.59295741*prestg80z +
0.53532292*rincome2z;
RUN;
PROC UNIVARIATE;
VAR SES;
RUN;
DATA gss96;
SET gss96;
SES1 = (SES-2.68754E-8)/1.17673;
RUN;
PROC REG data=gss96;
MODEL tvhours=SES1;
RUN;
PROC UNIVARIATE;
VAR SES tvhours;
RUN;
13
Download