Stat404 Fall 2009 Lab 2 1. The Mexican Tourism Bureau (MTB) has hired you as part of its effort to increase the amount of money that tourists spend while visiting Mexico. In your first research project you are asked to focus primarily on the clothing purchased by Korean tourists. In particular, your boss at the MTB wants to know whether a Korean is more likely to buy items of clothing in Mexico when they are of a genuinely Mexican style unavailable in Korea, or when they are less expensive than comparable (in quality and style) Korean clothes. During the spring of this year you obtained permission from the Mexican Border Police to interview Koreans who were to return to Korea from Mexico between this June and August. While these interviews were taking place, the luggage of the Koreans-being-interviewed was carefully examined to determine the Mexican versus Korean "cultural styles" of clothing items purchased in Mexico. (For example, Aztec imagery would reflect a Mexican cultural style, whereas a pagoda-motif would reflect a Korean cultural style.) Your data were collected during interviews with a random sample of 27 of these Korean tourists. The following are data on three of your variables. C = the number of pieces of clothing purchased by the Korean tourist during his or her just-completed visit to Mexico S = the percent of the Korean tourist's purchased-in-Mexico clothing items with Mexican cultural style (You may assume that 100 - S is the percent of these items with Korean cultural style.) M = the amount of money (in Mexican pesos) that the Korean tourist spent on clothing during his or her just-completed visit to Mexico Variable C S M Mean 3.9 53.2 371.7 Standard Deviation 1.7 11.7 83.0 Correlation Coefficients C S M 1.0 0.3 1.0 0.7 -0.4 1.0 (Be sure that you correctly read the correlation matrix. the correlation between C and S is rCS = .3 .) For example, a. Find rCS.M (i.e., the partial correlation between C and S controlling for M). Is this partial correlation significantly larger than zero? (Use the .05 significance level.) 1 b. Find the unstandardized regression equation (i.e., the constant and two slopes) for the regression of C on S and M. c. From the equation calculated in part b, express the partial slope associated with the variable, S, in words that a lay person could understand. d. Your boss at the MTB prefers to have the variable, M, in units of dollars rather than of pesos. Given that the exchange rate is 6.3 pesos per dollar, redo part b (i.e., find the unstandardized regression equation) using dollar--rather than peso--units for the variable, M. (Show your calculations!) a+X V = ------------b 2. Show ALGEBRAICALLY that if and if c+Y W = ------------d (where X and Y are random variables, a and c are constants, and b and d are positive constants), then r XY = r VW . (Note: As a direct consequence of this problem, it follows that zX and zY r XY = r z z X Y are standardized forms of X and Y.) , where Some hints: a. Use the following formula for the correlation coefficient: n Vi – V Wi – W i=1 r VW = ---------------------------------------------------------------------------n i=1 . n Vi – V 2 Wi – W 2 i=1 b. Substitute expressions in terms of V and W for ones in terms of X and Y respectively. (Be sure to solve for V and W , and to substitute these solutions into the formula as well.) c. Simplify until you have the formula for r XY . 2 3. The following proofs should ensure that you understand how unstandardized and standardized slopes are related: a. Using summation notation and algebra give a proof that whenever a variable is standardized, its mean equals 0 and its variance equals 1. Hints: Start with the formula used in standardizing a variable (i.e., X–X z X = -------------- ). ̂ X Then find the mean and variance of this formula. b. Prove algebraically that standardized. formula: ̂ Y bˆ = r XY ------- , ̂ X even when X and Y are not Hint: As in the previous problem, use the following n Xi – X Yi – Y =1 r XY = i---------------------------------------------------SS X SS 3 Y c. Armed with the findings from parts a and b, plus the finding from the previous problem that r XY = r z z , X Y following two equalities hold if ẑ Y = r XY z X and zX prove algebraically that the and zY are standardized: ẑ X = r XY z Y . Hints: Start with a regression equation of the form, ˆ = â + bˆ X Y for Y and zX . Using the formulas for for X . Then show that â and â = 0 Do this a second time for the regression of zX bˆ , substitute zY bˆ = r XY . and on zY . 4. Use the recall data to demonstrate some aspects of multiple regression. a. Show that R-squared from the regression of RECALL on BIRTHYR and EVENTYR equals the coefficient of determination between RECALL and the predicted values of the regression equation. b. Show that the coefficient of determination between RECALL and the residuals from the regression of RECALL on BIRTHYR and EVENTYR equals 1 - R2 from the same regression. c. Show that unlike the correlation between BIRTHYR and EVENTYR, there is a correlation of zero between BIRTHYR and EVENTADJ (i.e., between BIRTHYR and "EVENTYR after it has been adjusted for its covariance with BIRTHYR"). d. Show that the partial slope between RECALL and EVENTYR (from the regression of RECALL on BIRTHYR and EVENTYR) equals the slope between RECALL and EVENTADJ (as defined in part c). e. Write and run your own SPSS program in which you show that the partial slope between RECALL and BIRTHYR (from the regression of RECALL on BIRTHYR and EVENTYR) equals the slope between RECALL and BIRTHADJ (i.e., between RECALL and “BIRTHYR after it has been adjusted for its covariance with EVENTYR"). Below you are given a program that provides you with everything necessary to answer parts a to d. Doing part e will help ensure that you understand how the program works. get file='c:recall.sav'. regression variables=recall,birthyr,eventyr/dep=recall/enter. compute yhat=-.616003+(.028273*eventyr)-(.012158*birthyr). compute e=recall-yhat. pearson corr recall with yhat,e. regression variables=birthyr,eventyr/dep=eventyr/enter. compute eventadj=eventyr-(.210351*birthyr). pearson corr eventyr,eventadj with birthyr. regression variables=recall,eventadj/dep=recall/enter. Below please find R and SAS code for problem 4: # R # Directions: 4 # Copy recall.txt into the C-drive's root (i.e., into "C:/"). # Copy the below R code into the "R Console" window, and press Enter. # Code: # This command writes the results to a file instead of showing them onscreen. # The file is named lab2q4, and will be written to the root of your C: drive. sink('C:/lab2q4.txt') # Read in the dataset recall <- read.table('C:/recall.txt') # Name the variables, if they are not already named names(recall) <- c("birthyr","event","eventyr","recall") # Regression of recall on birthyr and eventyr # We will name this regression "q4a" q4a <- lm(recall ~ birthyr+eventyr,data=recall) # Show the regression results q4a # A summary of the regression gives more detail than the regression itself. summary(q4a) # Create new variable "yhat", which is the predicted recall based on the regression in the previous step. # "q4$coef[1]" refers to the intercept. "q4$coef[2]" refers to the regression coefficient for birthyr. # "q4$coef[3]" refers to the regression coefficient for eventyr. recall$yhat <- q4a$coef[1] + (q4a$coef[2]*recall$birthyr) + (q4a$coef[3]*recall$eventyr) # Create new variable "e", which stands for "error" and equals observed recall minus predicted recall. # Note: When you write "recall$recall", the first part refers to the dataset named recall; the second part # stands for the variable named recall. recall$e <- recall$recall - recall$yhat # Correlations cor(recall$recall,recall$yhat) cor(recall$recall,recall$e) # Regression and summary of eventyr on birthyr 5 q4c <- lm(eventyr ~ birthyr,data=recall) summary(q4c) # Calculate new variable, "eventadj". "q4c$coef[2]" refers to the slope from the last regression. recall$eventadj <recall$eventyr-(q4c$coef[2]*recall$birthyr) # Correlations cor(recall$eventyr,recall$birthyr) cor(recall$eventadj,recall$birthyr) # Regression of recall on eventadj q4d <- lm(recall ~ eventadj,data=recall) summary(q4d) * SAS * Directions: * Copy recall.txt into the C-drive's root (i.e., into "C:/"). * Copy the below SAS code into the "Editor" window, * and press the button with the figure of a little guy running. * Code: * Read in the dataset and name it “recall”; DATA recall; INFILE 'C:\recall.txt'; INPUT birthyr event eventyr recall; RUN; * Regression of recall on birthyr and eventyr; PROC REG data=recall; MODEL recall = birthyr eventyr; RUN; * Create a data set, “recall2,” in which 2 variables are added to the data set, “recall”; DATA recall2; SET recall; * Create new variable "yhat", which is the predicted recall based on the regression in the previous step; yhat=-.616003+(.028273*eventyr)-(.012158*birthyr); * Create new variable "e", which stands for "error" and equals actual recall minus predicted recall; e=recall-yhat; 6 RUN; * Correlations; PROC CORR data=recall2; VAR recall yhat e; RUN; * Regression of eventyr on birthyr; PROC REG data=recall2; MODEL eventyr = birthyr; RUN; * Create a data set, “recall3,” in which 1 variable is added to the data set, “recall2”; DATA recall3; SET recall2; * Calculate new variable, "eventadj"; eventadj=eventyr-(.210351*birthyr); RUN; *Correlations; PROC CORR data=recall3; VAR eventyr eventadj birthyr; RUN; * Regression of recall on eventadj; PROC REG data=recall3; MODEL recall = eventadj; RUN; 7