Uploaded by 1187727801

ECO6126Assignment1 119020545 ZHANGWangjia

advertisement
ECO6126 Assignment 1
ZHANG Wangjia 119020545
07 �� 2023
Contents
Packages and Functions
1
Practical Problems
1
Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Part2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Part 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Packages and Functions
rm(list=ls())
library(ISLR)
data(Wage)
data(Auto)
Practical Problems
Part 1
Use heading 2 for question title
#Part 1 Task 1
salesdata<-c(25000,30000,20000,22000,15000,17000,18000,22000,16000,12000,8000,14000)
profitdata<-c(0.12, 0.10, 0.08, 0.11, 0.09, 0.07, 0.10, 0.09, 0.07, 0.15, 0.12, 0.10)
Sales<-matrix(data=salesdata,nrow=4,ncol=3,byrow=TRUE, dimnames=list(c('Toyota','Ford','Honda','BMW'),c(
Sales
Toyota
Ford
Honda
BMW
US Asia Europe
25000 30000 20000
22000 15000 17000
18000 22000 16000
12000 8000 14000
1
Profit_Rates<-matrix(data=profitdata,nrow=4,ncol=3,byrow=TRUE, dimnames=list(c('Toyota','Ford','Honda','
Profit_Rates
Toyota
Ford
Honda
BMW
US
0.12
0.11
0.10
0.15
Asia Europe
0.10
0.08
0.09
0.07
0.09
0.07
0.12
0.10
#Task2
Net_profit<-Sales*Profit_Rates
Net_profit
Toyota
Ford
Honda
BMW
US
3000
2420
1800
1800
Asia Europe
3000
1600
1350
1190
1980
1120
960
1400
Net_profit[,1]
Toyota
3000
Ford
2420
Honda
1800
BMW
1800
barplot(Net_profit[,1], main='Net profit data between brands in
the US', names.arg = c('Toyota','Ford','Honda','BMW'), xlab='Brands', ylab='Net Profit', col=c('blue','b
2
1500
0
500
1000
Net Profit
2000
2500
3000
Net profit data between brands in
the US
Toyota
Ford
Honda
BMW
Brands
Part2
#Part 2
#a
data001<-data.frame(Wage)
str(data001)
'data.frame':
$ year
:
$ age
:
$ maritl
:
:
$ race
$ education :
$ region
:
$ jobclass :
3000 obs. of 11 variables:
int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
int 18 24 45 43 50 54 44 30 41 52 ...
Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
3
$
$
$
$
health
:
health_ins:
logwage
:
wage
:
Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
num 4.32 4.26 4.88 5.04 4.32 ...
num 75 70.5 131 154.7 75 ...
#b
age_level<-'null'
age_level[data001$age<18]<-'Young'
age_level[data001$age>=18&data001$age<=60]<-'Middle'
age_level[data001$age>60]<-'Old'
agelevelfac<-factor(x=age_level)
head(agelevelfac)
[1] Middle Middle Middle Middle Middle Middle
Levels: Middle Old
summary(agelevelfac)
Middle
2827
Old
173
#c
data002<-data.frame(Wage)
summary(data002$age)
Min. 1st Qu.
18.00
33.75
Median
42.00
Mean 3rd Qu.
42.41
51.00
Max.
80.00
residual<-data002$age%%10
summary(data002$age)#this is to find the range
Min. 1st Qu.
18.00
33.75
Median
42.00
Mean 3rd Qu.
42.41
51.00
Max.
80.00
for (i in 1:3000) {
if(residual[i]==5){
data002$age[i]=data002$age[i]+5#round function works not in the exact preferred way when occurs 25,3
}else
data002$age[i]=round(data002$age[i],-1)
}
summary(data002$age)
Min. 1st Qu.
20.00
30.00
Median
40.00
Mean 3rd Qu.
42.85
50.00
Max.
80.00
1. The variable race is a factor.
4
Part 3
#Part3
#a
head(Wage)
231655
86582
161300
155159
11443
376662
231655
86582
161300
155159
11443
376662
year age
maritl
race
education
2006 18 1. Never Married 1. White
1. < HS Grad 2. Middle
2004 24 1. Never Married 1. White 4. College Grad 2. Middle
2003 45
2. Married 1. White 3. Some College 2. Middle
2003 43
2. Married 3. Asian 4. College Grad 2. Middle
2005 50
4. Divorced 1. White
2. HS Grad 2. Middle
2008 54
2. Married 1. White 4. College Grad 2. Middle
jobclass
health health_ins logwage
wage
1. Industrial
1. <=Good
2. No 4.318063 75.04315
2. Information 2. >=Very Good
2. No 4.255273 70.47602
1. Industrial
1. <=Good
1. Yes 4.875061 130.98218
2. Information 2. >=Very Good
1. Yes 5.041393 154.68529
2. Information
1. <=Good
1. Yes 4.318063 75.04315
2. Information 2. >=Very Good
1. Yes 4.845098 127.11574
region
Atlantic
Atlantic
Atlantic
Atlantic
Atlantic
Atlantic
summary(Wage$education)
1. < HS Grad
268
5. Advanced Degree
426
2. HS Grad
971
3. Some College
650
4. College Grad
685
summary(Wage$race)
1. White 2. Black 3. Asian 4. Other
2480
293
190
37
educationvector<-as.vector(Wage$education)
racevector<-as.vector(Wage$race)
bardata<-table(racevector,educationvector)
bardata
educationvector
racevector 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad
1. White
211
822
532
576
2. Black
31
105
92
40
3. Asian
15
31
18
66
4. Other
11
13
8
3
educationvector
racevector 5. Advanced Degree
1. White
339
2. Black
25
3. Asian
60
4. Other
2
5
barplot(bardata,main='Frequency of Education', names.arg =c('<HS Grad','HS Grad','Some College','College
Frequency of Education
0
200
400
600
800
Others
Asian
Black
White
<HS Grad
HS Grad
College Grad
Education Level
#b
data003<-data.frame(Wage)
boxplot(wage~race, data003,col=c('grey','brown','yellow','steelblue'))
6
300
250
200
150
50
100
wage
1. White
2. Black
3. Asian
4. Other
race
1. By the Frequency of Education table, we can see that the majority of each education level is white due
to the sample selection. And Asian has a relatively higher proportion in Advanced and College Grad
while Black and Others has a relatively higher proportion in HS Grad and Some College.
2. By the bar table, we can see that Asian has a relatively higher performance in average Wage and a
larger variance of that, while Other perform relatively lower in average wage.
Part 4
#Part 4
#a
data004<-data.frame(Wage)
summary(data004$health_ins)#it's a binary variable
1. Yes
2083
2. No
917
7
t.test(data004$wage~data004$health_ins)
Welch Two Sample t-test
data: data004$wage by data004$health_ins
t = 18.708, df = 1989.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1. Yes and group 2. No is not equal to 0
95 percent confidence interval:
24.99464 30.84858
sample estimates:
mean in group 1. Yes mean in group 2. No
120.2383
92.3167
#b
#The correlation coefficient between wage and age:
cor(data004$wage,data004$age)
[1] 0.1956372
1. P-Value is smaller, we reject H0, the means of two categories is not equal.
2. The correlation coefficient is 0.1956372.
Part 5
#Part 5
#a
data(Auto)
head(Auto)
1
2
3
4
5
6
1
2
3
4
5
6
mpg cylinders displacement horsepower weight acceleration year origin
18
8
307
130
3504
12.0
70
1
15
8
350
165
3693
11.5
70
1
18
8
318
150
3436
11.0
70
1
16
8
304
150
3433
12.0
70
1
17
8
302
140
3449
10.5
70
1
15
8
429
198
4341
10.0
70
1
name
chevrolet chevelle malibu
buick skylark 320
plymouth satellite
amc rebel sst
ford torino
ford galaxie 500
lmmodel<-lm(Auto$mpg~Auto$weight)
summary(lmmodel)
8
Call:
lm(formula = Auto$mpg ~ Auto$weight)
Residuals:
Min
1Q
-11.9736 -2.7556
Median
-0.3358
3Q
2.1379
Max
16.5194
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.216524
0.798673
57.87
<2e-16 ***
Auto$weight -0.007647
0.000258 -29.64
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.333 on 390 degrees of freedom
Multiple R-squared: 0.6926,
Adjusted R-squared: 0.6918
F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16
#b
plot(Auto$weight,Auto$mpg,xlab='weight',ylab='mpg')
abline(lmmodel)
9
40
30
10
20
mpg
1500
2000
2500
3000
3500
weight
#c
par(mfrow=c(2,2))
plot(lmmodel)
10
4000
4500
5000
20
25
3
1
−1
30
−3
−2
−1
0
1
2
Scale−Location
Residuals vs Leverage
15
20
25
30
0
2
321
324
328
−2
0.5
1.0
1.5
321
325
4
Theoretical Quantiles
Standardized residuals
Fitted values
382
10
321
325
382
−3
−5
2.0
15
0.0
Standardized residuals
10
Q−Q Residuals
Standardized residuals
321
325
5
382
−15
Residuals
15
Residuals vs Fitted
Cook's distance
0.000 0.005 0.010 0.015
Fitted values
Leverage
#d
set.seed(1007)
data005<-data.frame(Auto)
head(data005)
mpg cylinders displacement horsepower weight acceleration year origin
18
8
307
130
3504
12.0
70
1
15
8
350
165
3693
11.5
70
1
18
8
318
150
3436
11.0
70
1
16
8
304
150
3433
12.0
70
1
17
8
302
140
3449
10.5
70
1
15
8
429
198
4341
10.0
70
1
name
1 chevrolet chevelle malibu
2
buick skylark 320
3
plymouth satellite
1
2
3
4
5
6
11
3
4
5
6
amc rebel sst
ford torino
ford galaxie 500
train_index<-sample(nrow(data005), nrow(data005)*0.75)
trainingset<-data005[train_index,]
validationset<-data005[-train_index,]
#e
lmmodel2<-lm(mpg~weight+displacement+horsepower+acceleration,trainingset)
summary(lmmodel2)
Call:
lm(formula = mpg ~ weight + displacement + horsepower + acceleration,
data = trainingset)
Residuals:
Min
1Q
-11.8914 -2.6862
Median
-0.5102
3Q
2.2401
Max
15.9480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.2106392 3.0778592 14.689 < 2e-16 ***
weight
-0.0055336 0.0010466 -5.287 2.45e-07 ***
displacement -0.0085752 0.0084548 -1.014
0.311
horsepower
-0.0328889 0.0205787 -1.598
0.111
acceleration -0.0004225 0.1579407 -0.003
0.998
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.351 on 289 degrees of freedom
Multiple R-squared: 0.7049,
Adjusted R-squared: 0.7008
F-statistic: 172.6 on 4 and 289 DF, p-value: < 2.2e-16
pred<-predict(lmmodel2, validationset)
MSE<-mean((validationset[,'mpg']-pred)^2)
MSE
[1] 15.69761
#f
lmmodel3<-lm(mpg~weight+displacement+horsepower+acceleration+weight*displacement,trainingset)
summary(lmmodel3)
Call:
lm(formula = mpg ~ weight + displacement + horsepower + acceleration +
weight * displacement, data = trainingset)
Residuals:
Min
1Q
-13.4262 -2.2002
Median
-0.3478
3Q
2.0570
Max
16.4205
12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
6.110e+01 3.622e+00 16.870 < 2e-16
weight
-8.804e-03 1.072e-03 -8.215 7.28e-15
displacement
-8.886e-02 1.376e-02 -6.458 4.50e-10
horsepower
-7.388e-02 1.988e-02 -3.716 0.000243
acceleration
-1.191e-01 1.469e-01 -0.811 0.418267
weight:displacement 2.356e-05 3.323e-06
7.089 1.04e-11
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
***
***
***
***
***
' 1
Residual standard error: 4.021 on 288 degrees of freedom
Multiple R-squared: 0.7487,
Adjusted R-squared: 0.7444
F-statistic: 171.6 on 5 and 288 DF, p-value: < 2.2e-16
pred2<-predict(lmmodel3,validationset)
MSE2<-mean((validationset[,'mpg']-pred2)^2)
MSE2
[1] 14.66676
c. We find that residuals don’t have significant non-linear patterns. And the residuals seem to be normally
distributed. The residuals appear randomly spread. While some points deviates from the regression.
d. According to R^2 and MSE, the model performs well on validation sample.
e. MSE2<MSE. According to MSE performance, with interaction term, the model performs better.
−𝐸𝑁 𝐷−
13
Download