Uploaded by Kevin Kyuson Lim

Statistical learning: Comparison on MARS and GAM

STATS 790
Assignment 4
Stduent number: 400415239
Kyuson Lim
29 March, 2022
1
Question 1
I have interest in education impact in higher education system to do survey as well as search for
various topics for data to analyze with. I have found data in UN (United Nation) for education data
as follows. The data set has been constructed using average Science scores by country from the
Programs for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing
Power Parity, 2005 dollars), Educational Index, Health Index, and Income Index from UN data.
In this analysis, I would like to know investigate model using health index, education index and
income index to expand in modeling multivariate analysis with GAMM (mixed) as well as final
project in the course.
• (Response variable) Overall Science Score (average score for 15 year olds), which is overall
variable
• Health Index: health index by life expectancy at birth, which is Health variable
• Education Index is measured by mean of years of schooling for adults aged 25 years, which is
Edu variable
• Income Index based on the gross national income per capita, which is Income variable
For our purposes in GAM and MARS, we have selected the 3 variables by the positive correlation
between predictors and response. Note the negative correlation between variables and overall score,
which cause Simpson’s paradox, namely, that what occurs for the individual may not be the same
for the group. As we are noticed if there is Simpson’s paradox with some negative correlation and
other being positively correlated, the regression looses predictive power as well as cause paradox in
model to be unsuitable with. Hence, 3 predictors with all positive correlation and the only response
variable, overall science scores, is chosen to fit in question 1 as well as other questions.
2
A density−scatter plot income index: case of overfitting (reason for MARS)
Overall
500
400
0.4
0.5
0.6
0.7
0.8
0.9
Income
non-linear effect
Generalized additive models (GAMs) provide a general framework for extending a standard linear
model by allowing non-linear functions of each of the variables, while maintaining additivity (The
beauty of GAMs is that we can use many non-linear methods as building blocks for fitting an
additive model)
We can also find from the plot that while there is a positive correlation between income index and
Overall science scores, all 3 variables have some non-linearity with geom_smooth curve fitted to
observe with (with span being higher than 0.15). A linear fits might be difficult to fit, and the
non-linearity for the effect by the smoothing spline or loess is expected by GAM (by piecewise
linear lines in MARS). Considering the idea of GAM, the income variable is have non-linearity to
adjust the model complexity. From the plot, we can identify the tapering off of Income’s effect at
its highest levels, and in addition, a positive effect of Education in the mid-range values (roughly
0.7 to 0.8), with a slight positive effect overall.
As the smoothing spline in GAM would expect to penalize Health index, the variable by the
non-linear effect is expected to be reduced to negative effect for a high income index group. For
education index variable, we could have a model with more EDF, making the variable be more
flexibility by the non-linearity in GAM which depends on response variable (Since the GAM has
more EFD than linear regression where there are many predictors, the non-linearity of education
index variable would well fitted by the GAM than linear regression)
3
A density−scatter plot of educational index: case of overfitting (reason for MARS)
Overall
500
400
0.6
0.7
0.8
0.9
1.0
Edu
A density−scatter plot of health index: case of overfitting (reason for MARS)
Overall
500
400
0.8
0.9
Health
MARS to outperform GAM
The effect of Health index is particularly important to identify for the effect with steady increasing
linear trend to impact on the model. Due steady increasing linear trend on health index variable, the
model is expected to reduce the predictor against response as a simple linear effect. The following
will update the GAM to explicitly model the effect similar to linear regression while keeping other
variables, leading the model to ignore important non-linear effect of other variables.
The second problem lies in the overfitting issue as to find from the plot. As data points of all variables
4
are somewhat sparse in the sense it is spread across all ranges, some interval are overestimate by
the non-linear fits to come across overfitting problems (from span higher than 0.15 in all plots as an
example). From the plot, the income index overfits in the interval of 0.7 to 0.8 where most points
are scattered to be very sensitive while it only fits for few points in the interval of 0.4 to 0.5 or
0.9. Also, the health index overfits by loess where most points are scattered to be wiggly for all
points even though linear fit in MARS is expected to be stable. While GAM is overfitted for most
variables, other that the health variable, the MARS will not overfit for variables with less model
complexity. As to find from the plots, all 3 variables are overfitted for splines with span to be over
0.15 for data points in the model of GAM.
variable interactions
respo
Ed
u
nse
Inc
om
e
(In the above we are using a type of smooth called a tensor product smooth, and by smoothing the
marginal smooths of Income and Education.)
The main limitation of GAMs is that the model is restricted to be additive. With many variables,
important interactions can be missed. For income and education variable, wealthy countries with
more of an apparent educational infrastructure are going to score higher on the overall science score.
However, wealth alone does not necessarily guarantee higher science scores and the interaction in
MARS allow interaction to model for grouping factors. This is akin to an interaction in typical
model settings. Hence, I would expect to see an interaction of income and education index along
the MARS to outperform a GAM fit.
5
Question 2
Table 1: MSE Result of possible fits
deg
deg
deg
deg
1
2
3
4
10
15
17
19
20
22
25
0.5415
0.5212
0.5224
0.5224
0.5144
0.5984
0.5818
0.5818
0.5113
0.5998
0.5911
0.5911
0.4998
0.5847
0.6289
0.6289
0.4998
0.5847
0.6289
0.6289
0.4863
0.5729
0.6269
0.6269
0.4908
0.5603
0.5982
0.5982
Based on the 10 fold CV result in Table 1, the out-of-sample MSE is lowest with 0.4863 from the
table with nk = 22, which is the maximum number of terms created by the forward pass. Also,
the maximum degree of interaction is 1, meaning an additive model is best with no interaction
terms. Note that the repeated number of cross-validation folds is set to be 10 for simulation purpose.
Hence, the model is fitted without any interaction and forward iteration to be set as 10.
set.seed(790)
# MARS fit
m1<-earth(Overall ~ Health +Edu + Income, data = pis, nfold=10,
nk=22, degree=1, ncross=10)
summary(m1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call: earth(formula=Overall~Health+Edu+Income, data=pis, degree=1, nfold=10,
ncross=10, nk=22)
coefficients
(Intercept)
-1.548140
h(Edu- -0.938708)
2.137014
h(Edu- -0.414698)
-4.413048
h(Edu- -0.103845)
2.623940
h(Income- -0.646418)
4.214232
h(Income- -0.34928)
-3.831756
h(Income-0.820701)
-2.670915
Selected 7 of 14 terms, and 2 of 3 predictors
Termination condition: Reached nk 22
Importance: Income, Edu, Health-unused
Number of terms at each degree of interaction: 1 6 (additive model)
GCV 0.3133325 RSS 9.164976 GRSq 0.6926931 RSq 0.8202946 CVRSq 0.2731854
6
## Note: the cross-validation sd's below are standard deviations across folds
##
## Cross validation:
nterms 5.47 sd 1.75
nvars 2.04 sd 0.60
##
##
CVRSq
sd
MaxErr
sd
##
0.273 0.86
-2.94 1.26
Table 2: Summary information of fitted MARS
deg 1
threshold
R squared
GCV score
predictive power
0.001
0.8203
0.3133
0.6927
From Table 2, when the threshold (closeness to stopping point of forward steps) is set for 0.001, the
R2 of the model calculated over all responses is 0.82 which indicates good fit for the model. Also,
the Generalized Cross Validation (GCV) of the model is 0.3133 which is low to account for the
model complexity that is related to overfitting of the MARS, as higher value is imprecise. Lastly,
GCV
GRSq, a value of 1 − null model
CGV is computed to account for an estimate of the predictive power
of the model, which is 0.6927 to be acceptable value for MARS fit.
Table 3: Coefficients of MARS fit
Overall
(Intercept)
h(Income-0.820701)
h(Edu–0.938708)
h(Edu–0.414698)
h(Edu–0.103845)
h(Income–0.646418)
h(Income–0.34928)
-1.548
-2.671
2.137
-4.413
2.624
4.214
-3.832
The fitted model of MARS has a equation of
ŷ = −1.548 − 2.671(Income − 0.821)+ − 4.413(Education − 0.415)+ + 2.137(Education − 0.934)+
+2.624(Education − 0.104)+ − 3.832(Income − 0.349)+ + 4.214(Income − 0.646)+
7
(ie. For example, the above model means Overall=



−1.548, Education ≤ 0.104





−1.548 + 2.624(Education − 0.104),


−1.548 − 4.413(Education − 0.104),





−1.548 + 2.137(Education − 0.104),
for Eudcation variable. )
8
0.415 ≥ Education ≥ 0.104
0.934 ≥ Education ≥ 0.415
Education ≥ 0.934
Question 3
# model fit
mod_lm2 <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data=pis)
summary(mod_lm2)
using GAM
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income) + s(Edu) + s(Health)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.479e-14 5.134e-02
0
1
Approximate significance of smooth terms:
edf Ref.df
F p-value
s(Income) 7.593 8.415 8.826 1.29e-06 ***
s(Edu)
6.204 7.178 3.308 0.00771 **
s(Health) 1.000 1.000 2.736 0.10679
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.863
Deviance explained = 90.3%
GCV = 0.19688 Scale est. = 0.13707
n = 52
We should note that the effect of Health index is best with effective degrees of freedom for 1 in
smoothing spline (s(Health)) to suggest that it has essentially best to reduce to a simple linear
effect to avoid any overfitting issue in the model. A variable of income is modeled as (s(Income)),
as the effect of income index should be modeled via a smoothing spline with 7.593 effective degrees
of freedom. Also, a variable of education is modeled as (s(Edu)), as the effect of education index
should be modeled via a smoothing spline with 6.204 effective degrees of freedom. Note that the
GAM model is defined as Overall = f1 (Income) + f2 (Edu) + f3 (Health) as intercept is almost
close to 0.
9
GAM fit: fitted y-value against predictors (scaled)
A GAM fit with spline on income index (scaled)
Overall score (fitted value)
1
0
−1
−2
−3
−3
−2
−1
0
1
Income
Health index vs. Overall score
A GAM fit with spline on educational index (scaled)
Overall score (fitted value)
1.0
0.5
0.0
−0.5
−2
−1
0
1
Edu
Health index vs. Overall score
10
A GAM fit with spline on health index (scaled)
Overall score (fitted value)
1.0
0.5
0.0
−2
−1
0
1
Health
Health index vs. Overall score
There is ‘visreg’ package which used to visualize regression models to provide fitted GAM model
against each predictors. With visreg function, we can fit the GAM model where response values are
on the y-axis and each predictors are on the x-axis.
using MGCV
11
Question 4
GAM fit
set.seed(790)
# GAM
# setting for CV
pis = scale(pisa[-c(1,2),c(8, 9, 10, 2)]) %>% data.frame()
rownames(pis) = c(1:length(pis[,1]))
n <- length(pis[,4])
d <- length(as.matrix(pis[,c(1,2,3)]))/n
####################################
## 10-fold CV
####################################
r = floor((n-1) / 10)
myAssign = c(rep(1:10,r),1:(n-(10*r)))
# Creates a repeating sequence of 1,2,...,10,1,2,... of length n
# May cut off mid-repetition
numSplits <- 10
# number of times to re-run cross-validation for each value of lambda
MSE <- c(0)
names(MSE) <- c("GAM")
for (i in 1:numSplits) {
myAssign <- sample(myAssign)
# generate a new 10-fold split
# we use the same split for each of our regressions
for (j in 1:10) {
tr_data = data.frame(pis[myAssign != j,]) # need to be data frame
GAMFit1 <- gam(Overall ~ s(Income) + s(Health) + s(Edu), data=tr_data,
control=gam.control(nthreads=4, maxit=60))
n_data = data.frame(pis[myAssign==j,]) # need to be data frame
# MSE
MSE[1] = MSE[1] +
sum((as.numeric(na.omit(predict(GAMFit1, n_data)))
12
- n_data$Overall)ˆ2)
}
}
# MSE from RSS
MSE <- MSE / (n*numSplits) # results from repeated 10-fold CV
# export into outcome table
mse_gam<-data.frame(matrix(nrow = 1,ncol=1))
rownames(mse_gam) <-c('GAM')
colnames(mse_gam) = c('MSE')
mse_gam[1,1] = MSE
# ----------------------- final output --------------------------- #
# combined of first: mse
kable(mse_gam, caption = 'MSE of GAM for 10 CV')
Table 4: MSE of GAM for 10 CV
MSE
GAM
0.3114505
13
Ridge regression
# ------------------ data ------------------ #
set.seed(790)
# setting for CV
x <- as.matrix(pis[,c(1,2,3)])
y2 <- as.numeric(pis[,4])
n <- length(y)
# ------------------ fitting ------------------ #
# coeff, df
co1 = matrix(nrow = 20, ncol=3)
co2 = matrix(nrow = 20, ncol=3)
d1 = matrix(nrow = 20, ncol=3)
d2 = matrix(nrow = 20, ncol=3)
# setting
numFolds <- 10
myAssign <- c(rep(1:numFolds, ceiling(n/numFolds)-1),
1:(n + numFolds - numFolds*ceiling(n/numFolds)))
# We represent each (numFolds)-fold split as a sequence of length n
# (number of points in data) split ~ evenly between 1s, 2s, ...
# (numFolds)s.
# We start out with a deterministic sequence myAssign and generate
# random splits by permuting this sequence
# number of times to run cross-validation for each regression method
Lambdas <- seq(0, 10, length.out = 20)
numLam <- length(Lambdas)
MSE <- rep(0, numLam)
MSE.corrForm <- rep(0, numLam)
names(MSE) <- Lambdas
names(MSE.corrForm) <- Lambdas
# Monte-carlo simulation & CV
m=1
while (m <= numLam) {
14
for (j in 1:numFolds) {
# Add sum of squared residuals on test data x[myAssign==j,], y[myAssign==j]
BIFit <- ridge::linearRidge(y2[myAssign!=j] ~ x[myAssign!=j,],
lambda = Lambdas[m], scaling="corrForm", print=F)
co1[m,] = as.numeric(BIFit$coef); d1[m,] = as.numeric(BIFit$df[,1:3])
BIFit2 <- ridge::linearRidge(y2[myAssign!=j] ~ x[myAssign!=j,],
lambda = Lambdas[m], scaling="scale", print=F)
co2[m,] = as.numeric(BIFit2$coef); d2[m,] = as.numeric(BIFit2$df[,1:3])
# MSE
MSE.corrForm[m] = MSE.corrForm[m] +
sum(( c(x[myAssign == j,] %*% BIFit$coef)- y2[myAssign == j])ˆ2 )
MSE[m] = MSE[m] +
sum(( c(x[myAssign == j,] %*% BIFit2$coef)- y2[myAssign == j])ˆ2 )
}
m = m+1
}
MSE.corrForm <- MSE.corrForm / (n)
MSE <- MSE / (n)
# export into outcome table
ex2_nonzero<-data.frame(matrix(nrow = 2,ncol=2))
rownames(ex2_nonzero) <-c('Corr.','Scale')
colnames(ex2_nonzero) = c('MSE','Lambda')
ex2_nonzero[1,1] = min(MSE.corrForm)
ex2_nonzero[2,1] = min(MSE)
ex2_nonzero[1,2] = Lambdas[which(MSE.corrForm == min(MSE.corrForm), arr.ind=TRUE)]
ex2_nonzero[2,2] = Lambdas[which(MSE == min(MSE), arr.ind = TRUE)]
# ----------------------- final output --------------------------- #
# combined of first: mse, second: coefficiet
kable(ex2_nonzero, caption = 'Optimal Lambda and MSE for ridge')
15
Table 5: Optimal Lambda values
Corr.
Scale
MSE
Lambda
0.4513312
0.4427496
10
10
# export into outcome table
ex2<-data.frame(matrix(nrow = 1,ncol=4))
colnames(ex2) <-c('Corr. ridge','Scaled ridge', 'GAM', 'MARS')
rownames(ex2) = c('MSE')
ex2[1,1] = min(MSE.corrForm)
ex2[1,2] = min(MSE)
ex2[1,3] = mse_gam[1,1]
ex2[1,4] = min(res)
# ----------------------- final output --------------------------- #
# combined of first: mse
kable(ex2, caption = 'MSE comparison')
Table 6: MSE comparison
MSE
Corr. ridge
Scaled ridge
GAM
MARS
0.4513312
0.4427496
0.3114505
0.4863
First, the ridge has the second lowest MSE (considering that correlated and scaled) due to positive
correlation between predictors and response variable to effectively penalized by the ridge regression,
even though the ridge regression is hard to catch for the non-linearity in the data. Generally, if
the regression coefficients of highly correlated variables are nearly equal then a regression models
tends to exhibit the grouping effect. A strict convexity ensures that a model exhibits the grouping
effect in the extreme situation when predictors are equal. Although MARS is very close to ridge
regression with just small difference (roughly 0.4), the ridge regression has penalized the coefficients
to correctly capture the grouped variable selection.
However, the GAM fit is best among 3 different models indicating that the non-linear splines have
well fitted over educational and income variables where points are scattered to be non-linear to be
captured by curves in question 3. Generalized additive models (GAMs) provide a general framework
for extending a standard linear model by allowing non-linear functions of each of the variables,
while maintaining additivity (The beauty of GAMs is that we can use many non-linear methods as
building blocks for fitting an additive model) Also, the spline has well capture the linear effect of
16
health index variable where correct effective degree of freedom which is 1 is suited for simplicity
of the model to fit without any overfitting issue. Compared with intended overfitting models in
question 1, the GAM model is well suited without any overfitting issue from plots of question 3 to
observe and identify with.
Although the MARS model is best to eliminate the possibility of overfitting in the model where
piecewise linear lines are suitable also for the complex curves, the worst model found is MARS due
to underfitting issue in the sparse data. Moreover, it was expected from tensor interaction plot that
there would be an important interaction effect between two variables, income and education, but
the interaction effect is not considered statically significant in the model to fit with, which have
lead to a less predictive power (GRSq in question 2) for MARS to fit in. (Instead, it may due to a
reasoning of many smoothing functions which has lessen the construction of multi-dimensionality in
modeling to overfitting as we can see that only 2 predictor with each 3 basis are used with, ignoring
the linear effect of education variable)
17