ECO6126 Assignment 1: Data Analysis with R

ECO6126 Assignment 1 ZHANG Wangjia 119020545 07 �� 2023 Contents Packages and Functions 1 Practical Problems 1 Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Part2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Part 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Packages and Functions rm(list=ls()) library(ISLR) data(Wage) data(Auto) Practical Problems Part 1 Use heading 2 for question title #Part 1 Task 1 salesdata<-c(25000,30000,20000,22000,15000,17000,18000,22000,16000,12000,8000,14000) profitdata<-c(0.12, 0.10, 0.08, 0.11, 0.09, 0.07, 0.10, 0.09, 0.07, 0.15, 0.12, 0.10) Sales<-matrix(data=salesdata,nrow=4,ncol=3,byrow=TRUE, dimnames=list(c('Toyota','Ford','Honda','BMW'),c( Sales Toyota Ford Honda BMW US Asia Europe 25000 30000 20000 22000 15000 17000 18000 22000 16000 12000 8000 14000 1 Profit_Rates<-matrix(data=profitdata,nrow=4,ncol=3,byrow=TRUE, dimnames=list(c('Toyota','Ford','Honda',' Profit_Rates Toyota Ford Honda BMW US 0.12 0.11 0.10 0.15 Asia Europe 0.10 0.08 0.09 0.07 0.09 0.07 0.12 0.10 #Task2 Net_profit<-Sales*Profit_Rates Net_profit Toyota Ford Honda BMW US 3000 2420 1800 1800 Asia Europe 3000 1600 1350 1190 1980 1120 960 1400 Net_profit[,1] Toyota 3000 Ford 2420 Honda 1800 BMW 1800 barplot(Net_profit[,1], main='Net profit data between brands in the US', names.arg = c('Toyota','Ford','Honda','BMW'), xlab='Brands', ylab='Net Profit', col=c('blue','b 2 1500 0 500 1000 Net Profit 2000 2500 3000 Net profit data between brands in the US Toyota Ford Honda BMW Brands Part2 #Part 2 #a data001<-data.frame(Wage) str(data001) 'data.frame': $ year : $ age : $ maritl : : $ race $ education : $ region : $ jobclass : 3000 obs. of 11 variables: int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ... int 18 24 45 43 50 54 44 30 41 52 ... Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ... Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ... Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ... Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ... Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ... 3 $ $ $ $ health : health_ins: logwage : wage : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ... Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ... num 4.32 4.26 4.88 5.04 4.32 ... num 75 70.5 131 154.7 75 ... #b age_level<-'null' age_level[data001$age<18]<-'Young' age_level[data001$age>=18&data001$age<=60]<-'Middle' age_level[data001$age>60]<-'Old' agelevelfac<-factor(x=age_level) head(agelevelfac) [1] Middle Middle Middle Middle Middle Middle Levels: Middle Old summary(agelevelfac) Middle 2827 Old 173 #c data002<-data.frame(Wage) summary(data002$age) Min. 1st Qu. 18.00 33.75 Median 42.00 Mean 3rd Qu. 42.41 51.00 Max. 80.00 residual<-data002$age%%10 summary(data002$age)#this is to find the range Min. 1st Qu. 18.00 33.75 Median 42.00 Mean 3rd Qu. 42.41 51.00 Max. 80.00 for (i in 1:3000) { if(residual[i]==5){ data002$age[i]=data002$age[i]+5#round function works not in the exact preferred way when occurs 25,3 }else data002$age[i]=round(data002$age[i],-1) } summary(data002$age) Min. 1st Qu. 20.00 30.00 Median 40.00 Mean 3rd Qu. 42.85 50.00 Max. 80.00 1. The variable race is a factor. 4 Part 3 #Part3 #a head(Wage) 231655 86582 161300 155159 11443 376662 231655 86582 161300 155159 11443 376662 year age maritl race education 2006 18 1. Never Married 1. White 1. < HS Grad 2. Middle 2004 24 1. Never Married 1. White 4. College Grad 2. Middle 2003 45 2. Married 1. White 3. Some College 2. Middle 2003 43 2. Married 3. Asian 4. College Grad 2. Middle 2005 50 4. Divorced 1. White 2. HS Grad 2. Middle 2008 54 2. Married 1. White 4. College Grad 2. Middle jobclass health health_ins logwage wage 1. Industrial 1. <=Good 2. No 4.318063 75.04315 2. Information 2. >=Very Good 2. No 4.255273 70.47602 1. Industrial 1. <=Good 1. Yes 4.875061 130.98218 2. Information 2. >=Very Good 1. Yes 5.041393 154.68529 2. Information 1. <=Good 1. Yes 4.318063 75.04315 2. Information 2. >=Very Good 1. Yes 4.845098 127.11574 region Atlantic Atlantic Atlantic Atlantic Atlantic Atlantic summary(Wage$education) 1. < HS Grad 268 5. Advanced Degree 426 2. HS Grad 971 3. Some College 650 4. College Grad 685 summary(Wage$race) 1. White 2. Black 3. Asian 4. Other 2480 293 190 37 educationvector<-as.vector(Wage$education) racevector<-as.vector(Wage$race) bardata<-table(racevector,educationvector) bardata educationvector racevector 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad 1. White 211 822 532 576 2. Black 31 105 92 40 3. Asian 15 31 18 66 4. Other 11 13 8 3 educationvector racevector 5. Advanced Degree 1. White 339 2. Black 25 3. Asian 60 4. Other 2 5 barplot(bardata,main='Frequency of Education', names.arg =c('<HS Grad','HS Grad','Some College','College Frequency of Education 0 200 400 600 800 Others Asian Black White <HS Grad HS Grad College Grad Education Level #b data003<-data.frame(Wage) boxplot(wage~race, data003,col=c('grey','brown','yellow','steelblue')) 6 300 250 200 150 50 100 wage 1. White 2. Black 3. Asian 4. Other race 1. By the Frequency of Education table, we can see that the majority of each education level is white due to the sample selection. And Asian has a relatively higher proportion in Advanced and College Grad while Black and Others has a relatively higher proportion in HS Grad and Some College. 2. By the bar table, we can see that Asian has a relatively higher performance in average Wage and a larger variance of that, while Other perform relatively lower in average wage. Part 4 #Part 4 #a data004<-data.frame(Wage) summary(data004$health_ins)#it's a binary variable 1. Yes 2083 2. No 917 7 t.test(data004$wage~data004$health_ins) Welch Two Sample t-test data: data004$wage by data004$health_ins t = 18.708, df = 1989.5, p-value < 2.2e-16 alternative hypothesis: true difference in means between group 1. Yes and group 2. No is not equal to 0 95 percent confidence interval: 24.99464 30.84858 sample estimates: mean in group 1. Yes mean in group 2. No 120.2383 92.3167 #b #The correlation coefficient between wage and age: cor(data004$wage,data004$age) [1] 0.1956372 1. P-Value is smaller, we reject H0, the means of two categories is not equal. 2. The correlation coeﬀicient is 0.1956372. Part 5 #Part 5 #a data(Auto) head(Auto) 1 2 3 4 5 6 1 2 3 4 5 6 mpg cylinders displacement horsepower weight acceleration year origin 18 8 307 130 3504 12.0 70 1 15 8 350 165 3693 11.5 70 1 18 8 318 150 3436 11.0 70 1 16 8 304 150 3433 12.0 70 1 17 8 302 140 3449 10.5 70 1 15 8 429 198 4341 10.0 70 1 name chevrolet chevelle malibu buick skylark 320 plymouth satellite amc rebel sst ford torino ford galaxie 500 lmmodel<-lm(Auto$mpg~Auto$weight) summary(lmmodel) 8 Call: lm(formula = Auto$mpg ~ Auto$weight) Residuals: Min 1Q -11.9736 -2.7556 Median -0.3358 3Q 2.1379 Max 16.5194 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 46.216524 0.798673 57.87 <2e-16 *** Auto$weight -0.007647 0.000258 -29.64 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.333 on 390 degrees of freedom Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918 F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16 #b plot(Auto$weight,Auto$mpg,xlab='weight',ylab='mpg') abline(lmmodel) 9 40 30 10 20 mpg 1500 2000 2500 3000 3500 weight #c par(mfrow=c(2,2)) plot(lmmodel) 10 4000 4500 5000 20 25 3 1 −1 30 −3 −2 −1 0 1 2 Scale−Location Residuals vs Leverage 15 20 25 30 0 2 321 324 328 −2 0.5 1.0 1.5 321 325 4 Theoretical Quantiles Standardized residuals Fitted values 382 10 321 325 382 −3 −5 2.0 15 0.0 Standardized residuals 10 Q−Q Residuals Standardized residuals 321 325 5 382 −15 Residuals 15 Residuals vs Fitted Cook's distance 0.000 0.005 0.010 0.015 Fitted values Leverage #d set.seed(1007) data005<-data.frame(Auto) head(data005) mpg cylinders displacement horsepower weight acceleration year origin 18 8 307 130 3504 12.0 70 1 15 8 350 165 3693 11.5 70 1 18 8 318 150 3436 11.0 70 1 16 8 304 150 3433 12.0 70 1 17 8 302 140 3449 10.5 70 1 15 8 429 198 4341 10.0 70 1 name 1 chevrolet chevelle malibu 2 buick skylark 320 3 plymouth satellite 1 2 3 4 5 6 11 3 4 5 6 amc rebel sst ford torino ford galaxie 500 train_index<-sample(nrow(data005), nrow(data005)*0.75) trainingset<-data005[train_index,] validationset<-data005[-train_index,] #e lmmodel2<-lm(mpg~weight+displacement+horsepower+acceleration,trainingset) summary(lmmodel2) Call: lm(formula = mpg ~ weight + displacement + horsepower + acceleration, data = trainingset) Residuals: Min 1Q -11.8914 -2.6862 Median -0.5102 3Q 2.2401 Max 15.9480 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45.2106392 3.0778592 14.689 < 2e-16 *** weight -0.0055336 0.0010466 -5.287 2.45e-07 *** displacement -0.0085752 0.0084548 -1.014 0.311 horsepower -0.0328889 0.0205787 -1.598 0.111 acceleration -0.0004225 0.1579407 -0.003 0.998 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.351 on 289 degrees of freedom Multiple R-squared: 0.7049, Adjusted R-squared: 0.7008 F-statistic: 172.6 on 4 and 289 DF, p-value: < 2.2e-16 pred<-predict(lmmodel2, validationset) MSE<-mean((validationset[,'mpg']-pred)^2) MSE [1] 15.69761 #f lmmodel3<-lm(mpg~weight+displacement+horsepower+acceleration+weight*displacement,trainingset) summary(lmmodel3) Call: lm(formula = mpg ~ weight + displacement + horsepower + acceleration + weight * displacement, data = trainingset) Residuals: Min 1Q -13.4262 -2.2002 Median -0.3478 3Q 2.0570 Max 16.4205 12 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.110e+01 3.622e+00 16.870 < 2e-16 weight -8.804e-03 1.072e-03 -8.215 7.28e-15 displacement -8.886e-02 1.376e-02 -6.458 4.50e-10 horsepower -7.388e-02 1.988e-02 -3.716 0.000243 acceleration -1.191e-01 1.469e-01 -0.811 0.418267 weight:displacement 2.356e-05 3.323e-06 7.089 1.04e-11 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' *** *** *** *** *** ' 1 Residual standard error: 4.021 on 288 degrees of freedom Multiple R-squared: 0.7487, Adjusted R-squared: 0.7444 F-statistic: 171.6 on 5 and 288 DF, p-value: < 2.2e-16 pred2<-predict(lmmodel3,validationset) MSE2<-mean((validationset[,'mpg']-pred2)^2) MSE2 [1] 14.66676 c. We find that residuals don’t have significant non-linear patterns. And the residuals seem to be normally distributed. The residuals appear randomly spread. While some points deviates from the regression. d. According to R^2 and MSE, the model performs well on validation sample. e. MSE2<MSE. According to MSE performance, with interaction term, the model performs better. −𝐸𝑁 𝐷− 13

ECO6126 Assignment 1: Data Analysis with R

Products

Support

ECO6126 Assignment 1: Data Analysis with R

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib