STATS 790 Assignment 4 Stduent number: 400415239 Kyuson Lim 29 March, 2022 1 Question 1 I have interest in education impact in higher education system to do survey as well as search for various topics for data to analyze with. I have found data in UN (United Nation) for education data as follows. The data set has been constructed using average Science scores by country from the Programs for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Income Index from UN data. In this analysis, I would like to know investigate model using health index, education index and income index to expand in modeling multivariate analysis with GAMM (mixed) as well as final project in the course. • (Response variable) Overall Science Score (average score for 15 year olds), which is overall variable • Health Index: health index by life expectancy at birth, which is Health variable • Education Index is measured by mean of years of schooling for adults aged 25 years, which is Edu variable • Income Index based on the gross national income per capita, which is Income variable For our purposes in GAM and MARS, we have selected the 3 variables by the positive correlation between predictors and response. Note the negative correlation between variables and overall score, which cause Simpson’s paradox, namely, that what occurs for the individual may not be the same for the group. As we are noticed if there is Simpson’s paradox with some negative correlation and other being positively correlated, the regression looses predictive power as well as cause paradox in model to be unsuitable with. Hence, 3 predictors with all positive correlation and the only response variable, overall science scores, is chosen to fit in question 1 as well as other questions. 2 A density−scatter plot income index: case of overfitting (reason for MARS) Overall 500 400 0.4 0.5 0.6 0.7 0.8 0.9 Income non-linear effect Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity (The beauty of GAMs is that we can use many non-linear methods as building blocks for fitting an additive model) We can also find from the plot that while there is a positive correlation between income index and Overall science scores, all 3 variables have some non-linearity with geom_smooth curve fitted to observe with (with span being higher than 0.15). A linear fits might be difficult to fit, and the non-linearity for the effect by the smoothing spline or loess is expected by GAM (by piecewise linear lines in MARS). Considering the idea of GAM, the income variable is have non-linearity to adjust the model complexity. From the plot, we can identify the tapering off of Income’s effect at its highest levels, and in addition, a positive effect of Education in the mid-range values (roughly 0.7 to 0.8), with a slight positive effect overall. As the smoothing spline in GAM would expect to penalize Health index, the variable by the non-linear effect is expected to be reduced to negative effect for a high income index group. For education index variable, we could have a model with more EDF, making the variable be more flexibility by the non-linearity in GAM which depends on response variable (Since the GAM has more EFD than linear regression where there are many predictors, the non-linearity of education index variable would well fitted by the GAM than linear regression) 3 A density−scatter plot of educational index: case of overfitting (reason for MARS) Overall 500 400 0.6 0.7 0.8 0.9 1.0 Edu A density−scatter plot of health index: case of overfitting (reason for MARS) Overall 500 400 0.8 0.9 Health MARS to outperform GAM The effect of Health index is particularly important to identify for the effect with steady increasing linear trend to impact on the model. Due steady increasing linear trend on health index variable, the model is expected to reduce the predictor against response as a simple linear effect. The following will update the GAM to explicitly model the effect similar to linear regression while keeping other variables, leading the model to ignore important non-linear effect of other variables. The second problem lies in the overfitting issue as to find from the plot. As data points of all variables 4 are somewhat sparse in the sense it is spread across all ranges, some interval are overestimate by the non-linear fits to come across overfitting problems (from span higher than 0.15 in all plots as an example). From the plot, the income index overfits in the interval of 0.7 to 0.8 where most points are scattered to be very sensitive while it only fits for few points in the interval of 0.4 to 0.5 or 0.9. Also, the health index overfits by loess where most points are scattered to be wiggly for all points even though linear fit in MARS is expected to be stable. While GAM is overfitted for most variables, other that the health variable, the MARS will not overfit for variables with less model complexity. As to find from the plots, all 3 variables are overfitted for splines with span to be over 0.15 for data points in the model of GAM. variable interactions respo Ed u nse Inc om e (In the above we are using a type of smooth called a tensor product smooth, and by smoothing the marginal smooths of Income and Education.) The main limitation of GAMs is that the model is restricted to be additive. With many variables, important interactions can be missed. For income and education variable, wealthy countries with more of an apparent educational infrastructure are going to score higher on the overall science score. However, wealth alone does not necessarily guarantee higher science scores and the interaction in MARS allow interaction to model for grouping factors. This is akin to an interaction in typical model settings. Hence, I would expect to see an interaction of income and education index along the MARS to outperform a GAM fit. 5 Question 2 Table 1: MSE Result of possible fits deg deg deg deg 1 2 3 4 10 15 17 19 20 22 25 0.5415 0.5212 0.5224 0.5224 0.5144 0.5984 0.5818 0.5818 0.5113 0.5998 0.5911 0.5911 0.4998 0.5847 0.6289 0.6289 0.4998 0.5847 0.6289 0.6289 0.4863 0.5729 0.6269 0.6269 0.4908 0.5603 0.5982 0.5982 Based on the 10 fold CV result in Table 1, the out-of-sample MSE is lowest with 0.4863 from the table with nk = 22, which is the maximum number of terms created by the forward pass. Also, the maximum degree of interaction is 1, meaning an additive model is best with no interaction terms. Note that the repeated number of cross-validation folds is set to be 10 for simulation purpose. Hence, the model is fitted without any interaction and forward iteration to be set as 10. set.seed(790) # MARS fit m1<-earth(Overall ~ Health +Edu + Income, data = pis, nfold=10, nk=22, degree=1, ncross=10) summary(m1) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: earth(formula=Overall~Health+Edu+Income, data=pis, degree=1, nfold=10, ncross=10, nk=22) coefficients (Intercept) -1.548140 h(Edu- -0.938708) 2.137014 h(Edu- -0.414698) -4.413048 h(Edu- -0.103845) 2.623940 h(Income- -0.646418) 4.214232 h(Income- -0.34928) -3.831756 h(Income-0.820701) -2.670915 Selected 7 of 14 terms, and 2 of 3 predictors Termination condition: Reached nk 22 Importance: Income, Edu, Health-unused Number of terms at each degree of interaction: 1 6 (additive model) GCV 0.3133325 RSS 9.164976 GRSq 0.6926931 RSq 0.8202946 CVRSq 0.2731854 6 ## Note: the cross-validation sd's below are standard deviations across folds ## ## Cross validation: nterms 5.47 sd 1.75 nvars 2.04 sd 0.60 ## ## CVRSq sd MaxErr sd ## 0.273 0.86 -2.94 1.26 Table 2: Summary information of fitted MARS deg 1 threshold R squared GCV score predictive power 0.001 0.8203 0.3133 0.6927 From Table 2, when the threshold (closeness to stopping point of forward steps) is set for 0.001, the R2 of the model calculated over all responses is 0.82 which indicates good fit for the model. Also, the Generalized Cross Validation (GCV) of the model is 0.3133 which is low to account for the model complexity that is related to overfitting of the MARS, as higher value is imprecise. Lastly, GCV GRSq, a value of 1 − null model CGV is computed to account for an estimate of the predictive power of the model, which is 0.6927 to be acceptable value for MARS fit. Table 3: Coefficients of MARS fit Overall (Intercept) h(Income-0.820701) h(Edu–0.938708) h(Edu–0.414698) h(Edu–0.103845) h(Income–0.646418) h(Income–0.34928) -1.548 -2.671 2.137 -4.413 2.624 4.214 -3.832 The fitted model of MARS has a equation of ŷ = −1.548 − 2.671(Income − 0.821)+ − 4.413(Education − 0.415)+ + 2.137(Education − 0.934)+ +2.624(Education − 0.104)+ − 3.832(Income − 0.349)+ + 4.214(Income − 0.646)+ 7 (ie. For example, the above model means Overall= −1.548, Education ≤ 0.104 −1.548 + 2.624(Education − 0.104), −1.548 − 4.413(Education − 0.104), −1.548 + 2.137(Education − 0.104), for Eudcation variable. ) 8 0.415 ≥ Education ≥ 0.104 0.934 ≥ Education ≥ 0.415 Education ≥ 0.934 Question 3 # model fit mod_lm2 <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data=pis) summary(mod_lm2) using GAM ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Family: gaussian Link function: identity Formula: Overall ~ s(Income) + s(Edu) + s(Health) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -7.479e-14 5.134e-02 0 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Income) 7.593 8.415 8.826 1.29e-06 *** s(Edu) 6.204 7.178 3.308 0.00771 ** s(Health) 1.000 1.000 2.736 0.10679 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 R-sq.(adj) = 0.863 Deviance explained = 90.3% GCV = 0.19688 Scale est. = 0.13707 n = 52 We should note that the effect of Health index is best with effective degrees of freedom for 1 in smoothing spline (s(Health)) to suggest that it has essentially best to reduce to a simple linear effect to avoid any overfitting issue in the model. A variable of income is modeled as (s(Income)), as the effect of income index should be modeled via a smoothing spline with 7.593 effective degrees of freedom. Also, a variable of education is modeled as (s(Edu)), as the effect of education index should be modeled via a smoothing spline with 6.204 effective degrees of freedom. Note that the GAM model is defined as Overall = f1 (Income) + f2 (Edu) + f3 (Health) as intercept is almost close to 0. 9 GAM fit: fitted y-value against predictors (scaled) A GAM fit with spline on income index (scaled) Overall score (fitted value) 1 0 −1 −2 −3 −3 −2 −1 0 1 Income Health index vs. Overall score A GAM fit with spline on educational index (scaled) Overall score (fitted value) 1.0 0.5 0.0 −0.5 −2 −1 0 1 Edu Health index vs. Overall score 10 A GAM fit with spline on health index (scaled) Overall score (fitted value) 1.0 0.5 0.0 −2 −1 0 1 Health Health index vs. Overall score There is ‘visreg’ package which used to visualize regression models to provide fitted GAM model against each predictors. With visreg function, we can fit the GAM model where response values are on the y-axis and each predictors are on the x-axis. using MGCV 11 Question 4 GAM fit set.seed(790) # GAM # setting for CV pis = scale(pisa[-c(1,2),c(8, 9, 10, 2)]) %>% data.frame() rownames(pis) = c(1:length(pis[,1])) n <- length(pis[,4]) d <- length(as.matrix(pis[,c(1,2,3)]))/n #################################### ## 10-fold CV #################################### r = floor((n-1) / 10) myAssign = c(rep(1:10,r),1:(n-(10*r))) # Creates a repeating sequence of 1,2,...,10,1,2,... of length n # May cut off mid-repetition numSplits <- 10 # number of times to re-run cross-validation for each value of lambda MSE <- c(0) names(MSE) <- c("GAM") for (i in 1:numSplits) { myAssign <- sample(myAssign) # generate a new 10-fold split # we use the same split for each of our regressions for (j in 1:10) { tr_data = data.frame(pis[myAssign != j,]) # need to be data frame GAMFit1 <- gam(Overall ~ s(Income) + s(Health) + s(Edu), data=tr_data, control=gam.control(nthreads=4, maxit=60)) n_data = data.frame(pis[myAssign==j,]) # need to be data frame # MSE MSE[1] = MSE[1] + sum((as.numeric(na.omit(predict(GAMFit1, n_data))) 12 - n_data$Overall)ˆ2) } } # MSE from RSS MSE <- MSE / (n*numSplits) # results from repeated 10-fold CV # export into outcome table mse_gam<-data.frame(matrix(nrow = 1,ncol=1)) rownames(mse_gam) <-c('GAM') colnames(mse_gam) = c('MSE') mse_gam[1,1] = MSE # ----------------------- final output --------------------------- # # combined of first: mse kable(mse_gam, caption = 'MSE of GAM for 10 CV') Table 4: MSE of GAM for 10 CV MSE GAM 0.3114505 13 Ridge regression # ------------------ data ------------------ # set.seed(790) # setting for CV x <- as.matrix(pis[,c(1,2,3)]) y2 <- as.numeric(pis[,4]) n <- length(y) # ------------------ fitting ------------------ # # coeff, df co1 = matrix(nrow = 20, ncol=3) co2 = matrix(nrow = 20, ncol=3) d1 = matrix(nrow = 20, ncol=3) d2 = matrix(nrow = 20, ncol=3) # setting numFolds <- 10 myAssign <- c(rep(1:numFolds, ceiling(n/numFolds)-1), 1:(n + numFolds - numFolds*ceiling(n/numFolds))) # We represent each (numFolds)-fold split as a sequence of length n # (number of points in data) split ~ evenly between 1s, 2s, ... # (numFolds)s. # We start out with a deterministic sequence myAssign and generate # random splits by permuting this sequence # number of times to run cross-validation for each regression method Lambdas <- seq(0, 10, length.out = 20) numLam <- length(Lambdas) MSE <- rep(0, numLam) MSE.corrForm <- rep(0, numLam) names(MSE) <- Lambdas names(MSE.corrForm) <- Lambdas # Monte-carlo simulation & CV m=1 while (m <= numLam) { 14 for (j in 1:numFolds) { # Add sum of squared residuals on test data x[myAssign==j,], y[myAssign==j] BIFit <- ridge::linearRidge(y2[myAssign!=j] ~ x[myAssign!=j,], lambda = Lambdas[m], scaling="corrForm", print=F) co1[m,] = as.numeric(BIFit$coef); d1[m,] = as.numeric(BIFit$df[,1:3]) BIFit2 <- ridge::linearRidge(y2[myAssign!=j] ~ x[myAssign!=j,], lambda = Lambdas[m], scaling="scale", print=F) co2[m,] = as.numeric(BIFit2$coef); d2[m,] = as.numeric(BIFit2$df[,1:3]) # MSE MSE.corrForm[m] = MSE.corrForm[m] + sum(( c(x[myAssign == j,] %*% BIFit$coef)- y2[myAssign == j])ˆ2 ) MSE[m] = MSE[m] + sum(( c(x[myAssign == j,] %*% BIFit2$coef)- y2[myAssign == j])ˆ2 ) } m = m+1 } MSE.corrForm <- MSE.corrForm / (n) MSE <- MSE / (n) # export into outcome table ex2_nonzero<-data.frame(matrix(nrow = 2,ncol=2)) rownames(ex2_nonzero) <-c('Corr.','Scale') colnames(ex2_nonzero) = c('MSE','Lambda') ex2_nonzero[1,1] = min(MSE.corrForm) ex2_nonzero[2,1] = min(MSE) ex2_nonzero[1,2] = Lambdas[which(MSE.corrForm == min(MSE.corrForm), arr.ind=TRUE)] ex2_nonzero[2,2] = Lambdas[which(MSE == min(MSE), arr.ind = TRUE)] # ----------------------- final output --------------------------- # # combined of first: mse, second: coefficiet kable(ex2_nonzero, caption = 'Optimal Lambda and MSE for ridge') 15 Table 5: Optimal Lambda values Corr. Scale MSE Lambda 0.4513312 0.4427496 10 10 # export into outcome table ex2<-data.frame(matrix(nrow = 1,ncol=4)) colnames(ex2) <-c('Corr. ridge','Scaled ridge', 'GAM', 'MARS') rownames(ex2) = c('MSE') ex2[1,1] = min(MSE.corrForm) ex2[1,2] = min(MSE) ex2[1,3] = mse_gam[1,1] ex2[1,4] = min(res) # ----------------------- final output --------------------------- # # combined of first: mse kable(ex2, caption = 'MSE comparison') Table 6: MSE comparison MSE Corr. ridge Scaled ridge GAM MARS 0.4513312 0.4427496 0.3114505 0.4863 First, the ridge has the second lowest MSE (considering that correlated and scaled) due to positive correlation between predictors and response variable to effectively penalized by the ridge regression, even though the ridge regression is hard to catch for the non-linearity in the data. Generally, if the regression coefficients of highly correlated variables are nearly equal then a regression models tends to exhibit the grouping effect. A strict convexity ensures that a model exhibits the grouping effect in the extreme situation when predictors are equal. Although MARS is very close to ridge regression with just small difference (roughly 0.4), the ridge regression has penalized the coefficients to correctly capture the grouped variable selection. However, the GAM fit is best among 3 different models indicating that the non-linear splines have well fitted over educational and income variables where points are scattered to be non-linear to be captured by curves in question 3. Generalized additive models (GAMs) provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity (The beauty of GAMs is that we can use many non-linear methods as building blocks for fitting an additive model) Also, the spline has well capture the linear effect of 16 health index variable where correct effective degree of freedom which is 1 is suited for simplicity of the model to fit without any overfitting issue. Compared with intended overfitting models in question 1, the GAM model is well suited without any overfitting issue from plots of question 3 to observe and identify with. Although the MARS model is best to eliminate the possibility of overfitting in the model where piecewise linear lines are suitable also for the complex curves, the worst model found is MARS due to underfitting issue in the sparse data. Moreover, it was expected from tensor interaction plot that there would be an important interaction effect between two variables, income and education, but the interaction effect is not considered statically significant in the model to fit with, which have lead to a less predictive power (GRSq in question 2) for MARS to fit in. (Instead, it may due to a reasoning of many smoothing functions which has lessen the construction of multi-dimensionality in modeling to overfitting as we can see that only 2 predictor with each 3 basis are used with, ignoring the linear effect of education variable) 17