22/10/2023, 21:29 Linear Regression Linear Regression Ajay Macharla, Shashi K, Rakesh P, Raviteja Movva, Brinda N 2023-10-22 #Question 1. #### a.Start with a basic exploratory data analysis. Show summary statistics of the response variable and predictor variable. alumni <- read.csv(file = "/Users/ajay/Documents/Course/Linear Regression/Alumni.csv") attach(alumni) library(kableExtra) kable(summary(alumni)) #Summary of alumni table school percent_of_classes_under_20student_faculty_ratioalumni_giving_rateprivate Length:48 Min. :29.00 Min. : 3.00 Min. : 7.00 Min. :0.0000 Class :character1st Qu.:44.75 1st Qu.: 8.00 1st Qu.:18.75 1st Qu.:0.0000 Mode :characterMedian :59.50 Median :10.50 Median :29.00 Median :1.0000 NA Mean :55.73 Mean :11.54 Mean :29.27 Mean :0.6875 NA 3rd Qu.:66.25 3rd Qu.:13.50 3rd Qu.:38.50 3rd Qu.:1.0000 NA Max. :77.00 Max. :23.00 Max. :67.00 Max. :1.0000 From the above summary, we observe that * We have data for 48 universities. There are 15 public universities and 33 private universities * Median > Mean for percent_of_classes_under_20 meaning percent_of_classes_under_20 has a left skewed distribution * Mean > Median for student_faculty_ratio meaning student_faculty_ratio has a right skewed distribution * Mean ~ Median for alumni_giving_rate meaning alumni_giving_rate has a symmetric distribution b. What is the nature of the variables X and Y? Are there outliers? What is the correlation coefficient? Draw a scatter plot. Any major comments about the data? X has a left skewed distribution and Y has a symmetric distribution. library(ggplot2) library(gridExtra) p1 <- ggplot(alumni, aes(x="", y=percent_of_classes_under_20)) + geom_boxplot() p2 <- ggplot(alumni, aes(x="", y=alumni_giving_rate)) + geom_boxplot() grid.arrange(p1, p2, ncol=2) Outliers :From the above boxplots,we can see that among X and Y , there are no outliers. Calculating correlation coefficients to check which continous variable is most closely connected (either positively or negatively) to alumni giving rates. cor(alumni[,2:4]) ## ## ## ## ## ## ## ## percent_of_classes_under_20 student_faculty_ratio 1.0000000 -0.7855593 -0.7855593 1.0000000 0.6456504 -0.7423975 alumni_giving_rate percent_of_classes_under_20 0.6456504 student_faculty_ratio -0.7423975 alumni_giving_rate 1.0000000 percent_of_classes_under_20 student_faculty_ratio alumni_giving_rate From the correlation matrix, we can interpret the correlations in the context for estimating alumni_giving_rate : 1. percent_of_classes_under_20 and alumni_giving_rate (Positive Correlation: 0.646): There is a moderately strong positive correlation (0.646) between percent_of_classes_under_20 and alumni_giving_rate . This means that universities with a higher percentage of classes with fewer than 20 students tend to have higher alumni giving rates. 2. student_faculty_ratio and alumni_giving_rate (Negative Correlation: -0.742): There is a strong negative correlation (-0.742) between student_faculty_ratio and alumni_giving_rate . This indicates that universities with a lower student-to-faculty ratio file:///Users/ajay/Downloads/lr1.html 1/5 22/10/2023, 21:29 Linear Regression (meaning fewer students per faculty member) tend to have higher alumni giving rates. A lower student-to-faculty ratio often suggests smaller class sizes and more personalized attention, which may lead to increased alumni giving. Scatterplot: ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) + geom_point(size = 3, alpha = 0.3) Appears to be a linear relationship: as percent_of_classes_under_20 goes up, alumni_giving_rate goes up. c. Fit a simple linear regression to the data. What is your estimated regression equation? # Fitting a simple linear regression model to the alumni df. model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 , data = alumni) summary(model) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = alumni_giving_rate ~ percent_of_classes_under_20, data = alumni) Residuals: Min 1Q -21.053 -7.158 Median -1.660 3Q 6.734 Max 29.658 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.3861 6.5655 -1.125 0.266 percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 10.38 on 46 degrees of freedom Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042 F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07 The estimated regression equation based on the above output is: Y=-7.3861+0.6578*X i.e, alumni_giving_rate = -7.3861 + 0.6578 (percent_of_classes_under_20) d. Interpret your results Interpreting the coefficients: - Intercept (B0 = -7.3861): When percent_of_classes_under_20 is 0 , the estimated alumni_giving_rate is -7.3861. However, since percent_of_classes_under_20 is a percentage, it is very unlikely that there are no classes with less than 20 strength practically. Coefficient for percent_of_classes_under_20 (B1) = 0.6578): For every one-unit increase in percent_of_classes_under_20 , the alumni_giving_rate is expected to increase by 0.6578 units, assuming all other factors remain constant. In simpler terms, the regression equation suggests that there is a positive relationship between the percentage of classes with fewer than 20 students ( percent_of_classes_under_20 ) and the alumni giving rate ( alumni_giving_rate ). As the percentage of small classes increases, the alumni giving rate is expected to increase by approximately 0.6578 units, on average. However,this interpretation assumes a linear relationship, and the actual relationship might be more complex in real-world scenarios. Question 2. (10 points) A Simulation Study (Simple Linear Regression). Assuming the mean response is E(Y|X)=10+5X: #### a. Generate data with X∼N(μ=2,σ=0.1), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5) set.seed(7052) x <- rnorm(100, mean = 2, sd = .1) y <- rnorm(100, mean = 10 + 5*x, sd = 0.5) lmline <- cbind(x,y) b.Show summary statistics of the response variable and predictor variable. Are there outliers? What is the correlation coefficient? Draw a scatter plot. file:///Users/ajay/Downloads/lr1.html 2/5 22/10/2023, 21:29 Linear Regression summary(lmline) ## ## ## ## ## ## ## x Min. :1.725 1st Qu.:1.923 Median :2.001 Mean :2.004 3rd Qu.:2.070 Max. :2.243 y Min. :18.09 1st Qu.:19.67 Median :20.11 Mean :20.17 3rd Qu.:20.70 Max. :21.80 Boxplot: # Create a boxplot for the first column (x) boxplot(lmline[, 1], main="Boxplot of Variable x", ylab="x") # Create a boxplot for the second column (y) boxplot(lmline[, 2], main="Boxplot of Variable y", ylab="y") No, there are no outliers as seen in the above boxplot. Correlation: cor.test(x,y) ## ## ## ## ## ## ## ## ## ## ## Pearson's product-moment correlation data: x and y t = 13.395, df = 98, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7218233 0.8641361 sample estimates: cor 0.8042198 The correlation coefficient is .8042 suggests a strong positive correlation between x and y. This means that as the values of variable x increase, the values of variable y tend to increase as well, and vice versa. Scatterplot: file:///Users/ajay/Downloads/lr1.html 3/5 22/10/2023, 21:29 Linear Regression plot(x,y) #### c.Fit a simple linear regression. What is the estimated model? Report the estimated coefficients. What is the model mean squared error (MSE)? Fitting a simple linear regression: Estimated Coefficents are down below fit <- lm(y ~ x) df <- data.frame(cbind(x, y)) ggplot(df, aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm", se = FALSE) ## `geom_smooth()` using formula = 'y ~ x' summary(fit) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: lm(formula = y ~ x) Residuals: Min 1Q -1.2073 -0.3029 Median 0.0093 3Q 0.3033 Max 1.3545 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.0218 0.8336 10.82 <2e-16 *** x 5.5652 0.4155 13.39 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4509 on 98 degrees of freedom Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432 F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16 sigma(fit) ## [1] 0.4508807 sigma(fit)^2 file:///Users/ajay/Downloads/lr1.html 4/5 22/10/2023, 21:29 Linear Regression ## [1] 0.2032934 The MSE (mean squared error) is .2032 d. What is the sample mean of both X and Y? Plot the fitted regression line and the point (X¯,Y¯). What do you find? averagex <- mean(x) averagey <- mean(y) df2 <- data.frame(cbind(averagex, averagey)) ggplot(df, aes(x = x, y = y)) + geom_point() + geom_smooth(method = "lm", se = FALSE)+ geom_point(aes(x= averagex, y = averagey, color = "pink")) ## `geom_smooth()` using formula = 'y ~ x' It can be seen that average of x is 2.0037 and average of y is 20.1726 I found that the average x and y is in the middle of the graph and the regression line. file:///Users/ajay/Downloads/lr1.html 5/5