Uploaded by ajaymacharla22

A2 - linear regression

advertisement
22/10/2023, 21:29
Linear Regression
Linear Regression
Ajay Macharla, Shashi K, Rakesh P, Raviteja Movva, Brinda N
2023-10-22
#Question 1. #### a.Start with a basic exploratory data analysis. Show summary statistics of the response variable and predictor variable.
alumni <- read.csv(file = "/Users/ajay/Documents/Course/Linear Regression/Alumni.csv")
attach(alumni)
library(kableExtra)
kable(summary(alumni)) #Summary of alumni table
school
percent_of_classes_under_20student_faculty_ratioalumni_giving_rateprivate
Length:48
Min. :29.00
Min. : 3.00
Min. : 7.00
Min. :0.0000
Class :character1st Qu.:44.75
1st Qu.: 8.00
1st Qu.:18.75
1st Qu.:0.0000
Mode :characterMedian :59.50
Median :10.50
Median :29.00
Median :1.0000
NA
Mean :55.73
Mean :11.54
Mean :29.27
Mean :0.6875
NA
3rd Qu.:66.25
3rd Qu.:13.50
3rd Qu.:38.50
3rd Qu.:1.0000
NA
Max. :77.00
Max. :23.00
Max. :67.00
Max. :1.0000
From the above summary, we observe that * We have data for 48 universities. There are 15 public universities and 33 private universities * Median
> Mean for percent_of_classes_under_20 meaning percent_of_classes_under_20 has a left skewed distribution * Mean > Median for
student_faculty_ratio meaning student_faculty_ratio has a right skewed distribution * Mean ~ Median for alumni_giving_rate meaning
alumni_giving_rate has a symmetric distribution
b. What is the nature of the variables X and Y? Are there outliers? What is the correlation coefficient? Draw a
scatter plot. Any major comments about the data?
X has a left skewed distribution and Y has a symmetric distribution.
library(ggplot2)
library(gridExtra)
p1 <- ggplot(alumni, aes(x="", y=percent_of_classes_under_20)) + geom_boxplot()
p2 <- ggplot(alumni, aes(x="", y=alumni_giving_rate)) + geom_boxplot()
grid.arrange(p1, p2, ncol=2)
Outliers :From the above boxplots,we can see that among X and Y , there are no outliers.
Calculating correlation coefficients to check which continous variable is most closely connected (either positively or negatively) to alumni giving
rates.
cor(alumni[,2:4])
##
##
##
##
##
##
##
##
percent_of_classes_under_20 student_faculty_ratio
1.0000000
-0.7855593
-0.7855593
1.0000000
0.6456504
-0.7423975
alumni_giving_rate
percent_of_classes_under_20
0.6456504
student_faculty_ratio
-0.7423975
alumni_giving_rate
1.0000000
percent_of_classes_under_20
student_faculty_ratio
alumni_giving_rate
From the correlation matrix, we can interpret the correlations in the context for estimating alumni_giving_rate :
1. percent_of_classes_under_20 and alumni_giving_rate (Positive Correlation: 0.646): There is a moderately strong positive
correlation (0.646) between percent_of_classes_under_20 and alumni_giving_rate . This means that universities with a higher
percentage of classes with fewer than 20 students tend to have higher alumni giving rates.
2. student_faculty_ratio and alumni_giving_rate (Negative Correlation: -0.742): There is a strong negative correlation (-0.742)
between student_faculty_ratio and alumni_giving_rate . This indicates that universities with a lower student-to-faculty ratio
file:///Users/ajay/Downloads/lr1.html
1/5
22/10/2023, 21:29
Linear Regression
(meaning fewer students per faculty member) tend to have higher alumni giving rates. A lower student-to-faculty ratio often suggests
smaller class sizes and more personalized attention, which may lead to increased alumni giving.
Scatterplot:
ggplot(alumni, aes(x = percent_of_classes_under_20, y = alumni_giving_rate)) +
geom_point(size = 3, alpha = 0.3)
Appears to be a linear relationship: as
percent_of_classes_under_20 goes up, alumni_giving_rate goes up.
c. Fit a simple linear regression to the data. What is your estimated regression equation?
# Fitting a simple linear regression model to the alumni df.
model <- lm(alumni_giving_rate ~ percent_of_classes_under_20 , data = alumni)
summary(model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = alumni_giving_rate ~ percent_of_classes_under_20,
data = alumni)
Residuals:
Min
1Q
-21.053 -7.158
Median
-1.660
3Q
6.734
Max
29.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-7.3861
6.5655 -1.125
0.266
percent_of_classes_under_20
0.6578
0.1147
5.734 7.23e-07 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.38 on 46 degrees of freedom
Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
The estimated regression equation based on the above output is: Y=-7.3861+0.6578*X i.e, alumni_giving_rate = -7.3861 + 0.6578
(percent_of_classes_under_20)
d. Interpret your results
Interpreting the coefficients: - Intercept (B0 = -7.3861): When percent_of_classes_under_20 is 0 , the estimated alumni_giving_rate is
-7.3861. However, since percent_of_classes_under_20 is a percentage, it is very unlikely that there are no classes with less than 20 strength
practically.
Coefficient for percent_of_classes_under_20 (B1) = 0.6578): For every one-unit increase in percent_of_classes_under_20 , the
alumni_giving_rate is expected to increase by 0.6578 units, assuming all other factors remain constant.
In simpler terms, the regression equation suggests that there is a positive relationship between the percentage of classes with fewer than 20
students ( percent_of_classes_under_20 ) and the alumni giving rate ( alumni_giving_rate ). As the percentage of small classes increases,
the alumni giving rate is expected to increase by approximately 0.6578 units, on average. However,this interpretation assumes a linear
relationship, and the actual relationship might be more complex in real-world scenarios.
Question 2.
(10 points) A Simulation Study (Simple Linear Regression). Assuming the mean response is E(Y|X)=10+5X: #### a. Generate data with
X∼N(μ=2,σ=0.1), sample size n=100, and error term ϵ∼N(μ=0,σ=0.5)
set.seed(7052)
x <- rnorm(100, mean = 2, sd = .1)
y <- rnorm(100, mean = 10 + 5*x, sd = 0.5)
lmline <- cbind(x,y)
b.Show summary statistics of the response variable and predictor variable. Are there outliers? What is the
correlation coefficient? Draw a scatter plot.
file:///Users/ajay/Downloads/lr1.html
2/5
22/10/2023, 21:29
Linear Regression
summary(lmline)
##
##
##
##
##
##
##
x
Min.
:1.725
1st Qu.:1.923
Median :2.001
Mean
:2.004
3rd Qu.:2.070
Max.
:2.243
y
Min.
:18.09
1st Qu.:19.67
Median :20.11
Mean
:20.17
3rd Qu.:20.70
Max.
:21.80
Boxplot:
# Create a boxplot for the first column (x)
boxplot(lmline[, 1], main="Boxplot of Variable x", ylab="x")
# Create a boxplot for the second column (y)
boxplot(lmline[, 2], main="Boxplot of Variable y", ylab="y")
No, there are no outliers as seen in the above
boxplot.
Correlation:
cor.test(x,y)
##
##
##
##
##
##
##
##
##
##
##
Pearson's product-moment correlation
data: x and y
t = 13.395, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7218233 0.8641361
sample estimates:
cor
0.8042198
The correlation coefficient is .8042 suggests a strong positive correlation between x and y. This means that as the values of variable x increase,
the values of variable y tend to increase as well, and vice versa.
Scatterplot:
file:///Users/ajay/Downloads/lr1.html
3/5
22/10/2023, 21:29
Linear Regression
plot(x,y)
#### c.Fit a simple linear regression. What is the estimated model?
Report the estimated coefficients. What is the model mean squared error (MSE)?
Fitting a simple linear regression: Estimated Coefficents are down below
fit <- lm(y ~ x)
df <- data.frame(cbind(x, y))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
summary(fit)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-1.2073 -0.3029
Median
0.0093
3Q
0.3033
Max
1.3545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
9.0218
0.8336
10.82
<2e-16 ***
x
5.5652
0.4155
13.39
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4509 on 98 degrees of freedom
Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
sigma(fit)
## [1] 0.4508807
sigma(fit)^2
file:///Users/ajay/Downloads/lr1.html
4/5
22/10/2023, 21:29
Linear Regression
## [1] 0.2032934
The MSE (mean squared error) is .2032
d. What is the sample mean of both X and Y? Plot the fitted regression line and the point (X¯,Y¯). What do you
find?
averagex <- mean(x)
averagey <- mean(y)
df2 <- data.frame(cbind(averagex, averagey))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)+
geom_point(aes(x= averagex, y = averagey, color = "pink"))
## `geom_smooth()` using formula = 'y ~ x'
It can be seen that average of x is 2.0037 and
average of y is 20.1726
I found that the average x and y is in the middle of the graph and the regression line.
file:///Users/ajay/Downloads/lr1.html
5/5
Download