STAT 425 – Modern Methods of Data Analysis (23 + 54 pts.) Assignment 9 – Gradient Boosting, Treed Regression, and a Mini-Midterm PROBLEM 1 –– PREDICTING THE AGE OF AN ABALONE The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are often times used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem. Attribute Information: Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict. These data are contained in the data frame Abalone. Name / Data Type / Measurement Unit / Description Length / continuous / mm / Longest shell measurement Diameter / continuous / mm / perpendicular to length Height / continuous / mm / with meat in shell Whole weight / continuous / grams / whole abalone Shucked weight / continuous / grams / weight of meat Viscera weight / continuous / grams / gut weight (after bleeding) Shell weight / continuous / grams / after being dried Rings / integer / -- / +1.5 gives the age in years length diam height whole.weight shucked.weight visc.weight shell.weight Rings a) Develop a gradient boosting model (gbm) to predict the number of rings. Show or explain how you choose your turning parameters (, Jm, and M). (5 pts.) b) Find the R2 for your final model and estimate RMSEP using the residuals from the fit. Plot 𝑦̂ 𝑣𝑠. 𝑦 and 𝑒̂ 𝑣𝑠. 𝑦̂. Also look at partial plots for the predictors using gbm.plot() command. Discuss all. (6 pts.) c) Use treed regression (Cubist) to build a model to predict the number of rings. Use cross-validation to determine if boosting (i.e. committees > 1) helps prediction and choose an optimal value for the number of committees, M. Also plot 𝑦̂ 𝑣𝑠. 𝑦 and 𝑒̂ 𝑣𝑠. 𝑦̂ for your final treed regression model. Discuss all. (6 pts.) d) How does the RMSEP for the gradient boosting and treed regression compare to MARS, RPART, bagged RPART, and Random Forests? (6 pts.) 1 PROBLEM 2 –– PREDICTING THE STRENGTH OF CONCRETE (MINI-MIDTERM) Given below are the variable name, variable type, the measurement unit and a brief description. To predict the concrete compressive strength is the regression problem. Cement -- quantitative -- kg in a m3 mixture -- Input Variable Blast Furnace Slag -- quantitative -- kg in a m3 mixture -- Input Variable Fly Ash -- quantitative -- kg in a m3 mixture -- Input Variable Water -- quantitative -- kg in a m3 mixture -- Input Variable Superplasticizer -- quantitative -- kg in a m3 mixture -- Input Variable Coarse Aggregate -- quantitative -- kg in a m3 mixture -- Input Variable Fine Aggregate -- quantitative -- kg in a m3 mixture -- Input Variable Age -- quantitative -- Day (1~365) -- Input Variable Concrete compressive strength -- quantitative -- MPa -- Output Variable (Y) These data can be obtained from the UCI Machine Learning Repository under Concrete. http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/ Read it into Excel first and save it as comma-delimited (.CSV) format after shortening the variable names. Use the command below to read the dataset into R. > Concrete = read.table(file.choose(),header=T,sep=”,”) a) Develop models to predict concrete compressive strength. Use the following modeling approaches: OLS – possibly using ACE/AVAS to help find appropriate transformations Projection Pursuit MARS Neural networks RPART Bagged RPART Random Forests Gradient Boosted Trees Treed Regression Be sure to include some discussion for each method on how you “tuned” the fit using that modeling approach. Be sure to use the same response for each!!!! (36 points – 4 pts. each) b) Identify which of the predictors are most important on the basis of the models you fit. Also give at least one visualization of the predictor “effects” from the models fit in part (a). Discuss all of this in practical terms. (6 pts.) 2 c) Using MC cross-validation, decide which modeling approach would be best to use to predict the compressive strength of concrete. Be sure that all MCCV functions have been fixed to perform similarly and correctly. Also make sure that you use the same response for each method, so the RMSEP values can be fairly compared. Put your results in a table and discuss. (12 pts.) 3