R Tutorial for STAT 350 for Computer Assignment 9a Author: Leonore Findsen, Chunyan Sun, Sarah H. Sellke All of the tutorials for Computer Assignment 9 use the same data set. Example: (Data Set: loc.txt) Job Stress and Locus of Control. Many factors, such as the type of job, education level, and job experience, can affect the stress felt by workers on the job. Locus of control (LOC) is a term in psychology that describes the extent to which a person believes he or she is in control of the events that influence his or her life. Is feeling “more in control” associated with less job stress? A recent study examined the relationship between LOC and several work-related behavioral measures among certified public accountants in Taiwan. LOC was assessed using a questionnaire that asked respondents to select one of two options for each of 23 items. Scores ranged from 0 to 23. Individuals with low LOC believe that their own behavior and attributes determine their rewards in life. Those with high LOC believe that these rewards are beyond their control. We will consider a random sample of 100 accountants. a) b) c) d) e) f) Make a scatterplot of the data (including the least-squares regression line) with LOC on the x-axis and Stress on the y-axis. Briefly describe the relationship between Stress and LOC. Compute the correlation coefficient between Stress and LOC. Find the equation of the least-squares regression line for predicting Stress from LOC. What are MSE and R2 for these data? Using the ANOVA table for linear regression, confirm the values of MSE and R2. Calculate the correlation coefficient from R2 and compare the value with the value calculated in part b). Solution: job <- read.table(file = "loc.txt", header = TRUE) # # a) Scatterplot of the data # library(ggplot2) ggplot(job, aes(x=LOC, y=STRESS))+ geom_point() + geom_smooth(method = lm, se = FALSE) + ggtitle("Relationship between Stress and LOC") + xlab("Locus") + ylab("Stress") # # b) Correlation # cor(job$LOC, job$STRESS) # # c), d) Calculate linear regression and get results # job.lm <- lm(STRESS ~ LOC, data = job) summary(job.lm) # # e) ANOVA table # anova(job.lm) 1 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 R Tutorial for STAT 350 for Computer Assignment 9a Author: Leonore Findsen, Chunyan Sun, Sarah H. Sellke a) Make a scatterplot of the data (including the least-squares regression line) with LOC on the x-axis and Stress on the y-axis. Briefly describe the relationship between Stress and LOC. Solution: The plot looks linear with a positive direction. I am not sure about the strength because the scale on the y-axis is so small. I do not see any x- or y-outliers. b) Compute the correlation coefficient between Stress and LOC. Solution: > cor(job$LOC, job$STRESS) [1] 0.3122765 The correlation coefficient between Stress and LOC is 0.3122765. This looks like there is a weak but nonnegligible association between Stress and LOC. 2 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 R Tutorial for STAT 350 for Computer Assignment 9a Author: Leonore Findsen, Chunyan Sun, Sarah H. Sellke c) Find the equation of the least-squares regression line for predicting Stress from LOC. Solution: Call: lm(formula = STRESS ~ LOC, data = job) Residuals: Min 1Q -1.04704 -0.33806 Median 0.02169 3Q 0.30798 Max 1.06715 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.25550 0.14691 15.353 < 2e-16 *** LOC 0.03991 0.01226 3.254 0.00156 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4513 on 98 degrees of freedom Multiple R-squared: 0.09752, Adjusted R-squared: 0.08831 F-statistic: 10.59 on 1 and 98 DF, p-value: 0.001562 The answer to this part is highlighted in yellow in the data above. ̂ = 2.25550 + 0.03991 LOC 𝑆𝑡𝑟𝑒𝑠𝑠 Be sure to always report the equation, not just the values of b0 and b1. d) What are MSE and R2 for these data? Solution: The answer to this part is highlighted in green in the data above. R2 = 0.09752 For simple linear regression, do not use the adjusted R-squared value. This does not look very good. MSE = 0.45132 = 0.2037 This is squared because the reported value is the standard deviation. 3 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907 R Tutorial for STAT 350 for Computer Assignment 9a Author: Leonore Findsen, Chunyan Sun, Sarah H. Sellke e) Using the ANOVA table for linear regression, confirm the values of MSE and R2. Solution: > anova(job.lm) Analysis of Variance Table Response: STRESS Df Sum Sq Mean Sq F value Pr(>F) LOC 1 2.1565 2.15651 10.589 0.001562 ** Residuals 98 19.9578 0.20365 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 MSE = 0.20365 𝑆𝑆𝑅 𝑆𝑆𝑅 2.1565 2.1565 = = = = 0.09752 𝑆𝑆𝑇 (𝑆𝑆𝑅 + 𝑆𝑆𝐸) 2.1565 + 19.9578 22.1143 These numbers match the values from the previous output. 𝑅2 = f) Calculate the correlation coefficient from R2 and compare the value with the value calculated in part b). Solution: For simple linear regression 𝑟 = ±√𝑅 2. The sign of r is the same as the sign of the slope. In this case, the slope is positive; therefore, 𝑟 = +√𝑅 2 = √0.09752 = 0.312228 In part b), we obtained a value of 0.3122765. the two values are the same to 6 decimal places. The reason why they are not exact is because the value of R2 is rounded. 4 STAT 350: Introduction to Statistics Department of Statistics, Purdue University, West Lafayette, IN 47907