Lab 2

advertisement
Stat404
Fall 2009
Lab 2
1. The Mexican Tourism Bureau (MTB) has hired you as part of
its effort to increase the amount of money that tourists spend
while visiting Mexico. In your first research project you are
asked to focus primarily on the clothing purchased by Korean tourists.
In particular, your boss at the MTB wants to know whether a Korean is
more likely to buy items of clothing in Mexico when they are of a
genuinely Mexican style unavailable in Korea, or when they are less
expensive than comparable (in quality and style) Korean clothes.
During the spring of this year you obtained permission from the Mexican
Border Police to interview Koreans who were to return to Korea from
Mexico between this June and August. While these interviews were
taking place, the luggage of the Koreans-being-interviewed was
carefully examined to determine the Mexican versus Korean "cultural
styles" of clothing items purchased in Mexico. (For example, Aztec
imagery would reflect a Mexican cultural style, whereas a pagoda-motif
would reflect a Korean cultural style.) Your data were collected
during interviews with a random sample of 27 of these Korean tourists.
The following are data on three of your variables.
C
= the number of pieces of clothing purchased by the Korean tourist
during his or her just-completed visit to Mexico
S
= the percent of the Korean tourist's purchased-in-Mexico clothing
items with Mexican cultural style (You may assume that 100 - S is
the percent of these items with Korean cultural style.)
M
= the amount of money (in Mexican pesos) that the Korean tourist
spent on clothing during his or her just-completed visit to Mexico
Variable
C
S
M
Mean
3.9
53.2
371.7
Standard
Deviation
1.7
11.7
83.0
Correlation Coefficients
C
S
M
1.0
0.3
1.0
0.7
-0.4
1.0
(Be sure that you correctly read the correlation matrix.
the correlation between C and S is rCS = .3 .)
For example,
a. Find rCS.M (i.e., the partial correlation between C and S
controlling for M). Is this partial correlation significantly
larger than zero? (Use the .05 significance level.)
1
b. Find the unstandardized regression equation (i.e.,
the constant and two slopes) for the regression of C on S
and M.
c. From the equation calculated in part b, express the
partial slope associated with the variable, S, in words
that a lay person could understand.
d. Your boss at the MTB prefers to have the variable, M, in units of
dollars rather than of pesos. Given that the exchange rate is 6.3
pesos per dollar, redo part b (i.e., find the unstandardized
regression equation) using dollar--rather than peso--units for the
variable, M. (Show your calculations!)
a+X
V = ------------b
2. Show ALGEBRAICALLY that if
and if
c+Y
W = ------------d
(where X
and Y are random variables, a and c are constants, and b and d are
positive constants), then
r XY = r VW
.
(Note: As a direct
consequence of this problem, it follows that
zX
and
zY
r XY = r z z
X Y
are standardized forms of X and Y.)
,
where
Some hints:
a. Use the following formula for the correlation coefficient:
n
  Vi – V   Wi – W 
i=1
r VW = ---------------------------------------------------------------------------n

i=1
.
n
 Vi – V 
2

 Wi – W 
2
i=1
b. Substitute expressions in terms of V and W for ones in terms of X
and Y respectively. (Be sure to solve for V and W , and to
substitute these solutions into the formula as well.)
c. Simplify until you have the formula for
r XY
.
2
3. The following proofs should ensure that you understand
how unstandardized and standardized slopes are related:
a. Using summation notation and algebra give a proof that
whenever a variable is standardized, its mean equals 0
and its variance equals 1. Hints: Start with the formula
used in standardizing a variable (i.e.,
X–X
z X = -------------- ).
̂ X
Then find the mean and variance of this formula.
b. Prove algebraically that
standardized.
formula:
̂ Y
bˆ = r XY ------- ,
̂ X
even when X and Y are not
Hint: As in the previous problem, use the following
n
  Xi – X   Yi – Y 
=1
r XY = i---------------------------------------------------SS X SS
3
Y
c. Armed with the findings from parts a and b, plus the finding from
the previous problem that
r XY = r z z ,
X Y
following two equalities hold if
ẑ Y = r XY z X
and
zX
prove algebraically that the
and
zY
are standardized:
ẑ X = r XY z Y
.
Hints: Start with a regression equation of the form,
ˆ = â + bˆ X
Y
for
Y
and
zX
.
Using the formulas for
for
X
.
Then show that
â
and
â = 0
Do this a second time for the regression of
zX
bˆ
,
substitute
zY
bˆ = r XY
.
and
on
zY
.
4. Use the recall data to demonstrate some aspects of
multiple regression.
a. Show that R-squared from the regression of RECALL on
BIRTHYR and EVENTYR equals the coefficient of
determination between RECALL and the predicted values of the
regression equation.
b. Show that the coefficient of determination between RECALL and the
residuals from the regression of RECALL on BIRTHYR and EVENTYR
equals 1 - R2 from the same regression.
c. Show that unlike the correlation between BIRTHYR and EVENTYR,
there is a correlation of zero between BIRTHYR and EVENTADJ (i.e.,
between BIRTHYR and "EVENTYR after it has been adjusted for its
covariance with BIRTHYR").
d. Show that the partial slope between RECALL and EVENTYR (from the
regression of RECALL on BIRTHYR and EVENTYR) equals the slope
between RECALL and EVENTADJ (as defined in part c).
e. Write and run your own SPSS program in which you show that the
partial slope between RECALL and BIRTHYR (from the regression of
RECALL on BIRTHYR and EVENTYR) equals the slope between RECALL and
BIRTHADJ (i.e., between RECALL and “BIRTHYR after it has been
adjusted for its covariance with EVENTYR").
Below you are given a program that provides you with everything
necessary to answer parts a to d. Doing part e will help ensure that
you understand how the program works.
get file='c:recall.sav'.
regression variables=recall,birthyr,eventyr/dep=recall/enter.
compute yhat=-.616003+(.028273*eventyr)-(.012158*birthyr).
compute e=recall-yhat.
pearson corr recall with yhat,e.
regression variables=birthyr,eventyr/dep=eventyr/enter.
compute eventadj=eventyr-(.210351*birthyr).
pearson corr eventyr,eventadj with birthyr.
regression variables=recall,eventadj/dep=recall/enter.
Below please find R and SAS code for problem 4:
# R
# Directions:
4
# Copy recall.txt into the C-drive's root (i.e., into
"C:/").
# Copy the below R code into the "R Console" window, and press
Enter.
# Code:
# This command writes the results to a file instead of showing
them onscreen.
# The file is named lab2q4, and will be written to the root of your C:
drive.
sink('C:/lab2q4.txt')
# Read in the dataset
recall <- read.table('C:/recall.txt')
# Name the variables, if they are not already named
names(recall) <- c("birthyr","event","eventyr","recall")
# Regression of recall on birthyr and eventyr
# We will name this regression "q4a"
q4a <- lm(recall ~ birthyr+eventyr,data=recall)
# Show the regression results
q4a
# A summary of the regression gives more detail than the regression
itself.
summary(q4a)
# Create new variable "yhat", which is the predicted recall based on
the regression in the previous step.
# "q4$coef[1]" refers to the intercept. "q4$coef[2]" refers to the
regression coefficient for birthyr.
# "q4$coef[3]" refers to the regression coefficient for eventyr.
recall$yhat <- q4a$coef[1] + (q4a$coef[2]*recall$birthyr) +
(q4a$coef[3]*recall$eventyr)
# Create new variable "e", which stands for "error" and equals observed
recall minus predicted recall.
# Note: When you write "recall$recall", the first part refers to the
dataset named recall; the second part
# stands for the variable named recall.
recall$e <- recall$recall - recall$yhat
# Correlations
cor(recall$recall,recall$yhat)
cor(recall$recall,recall$e)
# Regression and summary of eventyr on birthyr
5
q4c <- lm(eventyr ~ birthyr,data=recall)
summary(q4c)
# Calculate new variable, "eventadj". "q4c$coef[2]" refers
to the slope from the last regression.
recall$eventadj <recall$eventyr-(q4c$coef[2]*recall$birthyr)
# Correlations
cor(recall$eventyr,recall$birthyr)
cor(recall$eventadj,recall$birthyr)
# Regression of recall on eventadj
q4d <- lm(recall ~ eventadj,data=recall)
summary(q4d)
* SAS
* Directions:
* Copy recall.txt into the C-drive's root (i.e., into "C:/").
* Copy the below SAS code into the "Editor" window,
*
and press the button with the figure of a little guy running.
* Code:
* Read in the dataset and name it “recall”;
DATA recall;
INFILE 'C:\recall.txt';
INPUT birthyr event eventyr recall;
RUN;
* Regression of recall on birthyr and eventyr;
PROC REG data=recall;
MODEL recall = birthyr eventyr;
RUN;
* Create a data set, “recall2,” in which 2 variables are added to the
data set, “recall”;
DATA recall2;
SET recall;
* Create new variable "yhat", which is the predicted recall based on
the regression in the previous step;
yhat=-.616003+(.028273*eventyr)-(.012158*birthyr);
* Create new variable "e", which stands for "error" and equals actual
recall minus predicted recall;
e=recall-yhat;
6
RUN;
* Correlations;
PROC CORR data=recall2;
VAR recall yhat e;
RUN;
* Regression of eventyr on birthyr;
PROC REG data=recall2;
MODEL eventyr = birthyr;
RUN;
* Create a data set, “recall3,” in which 1 variable is added to the
data set, “recall2”;
DATA recall3;
SET recall2;
* Calculate new variable, "eventadj";
eventadj=eventyr-(.210351*birthyr);
RUN;
*Correlations;
PROC CORR data=recall3;
VAR eventyr eventadj birthyr;
RUN;
* Regression of recall on eventadj;
PROC REG data=recall3;
MODEL recall = eventadj;
RUN;
7
Download