STAT 425 – Modern Methods of Data Analysis (23 + 54 pts.)

advertisement
STAT 425 – Modern Methods of Data Analysis (23 + 54 pts.)
Assignment 9 – Gradient Boosting, Treed Regression,
and a Mini-Midterm
PROBLEM 1 –– PREDICTING THE AGE OF AN ABALONE
The age of abalone is determined by cutting the shell through the cone, staining it, and
counting the number of rings through a microscope -- a boring and time-consuming
task. Other measurements, which are easier to obtain, are often times used to predict the
age. Further information, such as weather patterns and location (hence food
availability) may be required to solve the problem.
Attribute Information:
Given is the attribute name, attribute type, the measurement unit and a brief
description. The number of rings is the value to predict. These data are contained in the
data frame Abalone.
Name / Data Type / Measurement Unit / Description
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years
length
diam
height
whole.weight
shucked.weight
visc.weight
shell.weight
Rings
a) Develop a gradient boosting model (gbm) to predict the number of rings. Show or
explain how you choose your turning parameters (, Jm, and M). (5 pts.)
b) Find the R2 for your final model and estimate RMSEP using the residuals from the fit.
Plot 𝑦̂ 𝑣𝑠. 𝑦 and 𝑒̂ 𝑣𝑠. 𝑦̂. Also look at partial plots for the predictors using gbm.plot()
command. Discuss all. (6 pts.)
c) Use treed regression (Cubist) to build a model to predict the number of rings. Use
cross-validation to determine if boosting (i.e. committees > 1) helps prediction and
choose an optimal value for the number of committees, M. Also plot 𝑦̂ 𝑣𝑠. 𝑦 and 𝑒̂ 𝑣𝑠. 𝑦̂
for your final treed regression model. Discuss all. (6 pts.)
d) How does the RMSEP for the gradient boosting and treed regression compare to MARS,
RPART, bagged RPART, and Random Forests? (6 pts.)
1
PROBLEM 2 –– PREDICTING THE STRENGTH OF CONCRETE (MINI-MIDTERM)
Given below are the variable name, variable type, the measurement unit and a brief
description. To predict the concrete compressive strength is the regression problem.









Cement -- quantitative -- kg in a m3 mixture -- Input Variable
Blast Furnace Slag -- quantitative -- kg in a m3 mixture -- Input Variable
Fly Ash -- quantitative -- kg in a m3 mixture -- Input Variable
Water -- quantitative -- kg in a m3 mixture -- Input Variable
Superplasticizer -- quantitative -- kg in a m3 mixture -- Input Variable
Coarse Aggregate -- quantitative -- kg in a m3 mixture -- Input Variable
Fine Aggregate -- quantitative -- kg in a m3 mixture -- Input Variable
Age -- quantitative -- Day (1~365) -- Input Variable
Concrete compressive strength -- quantitative -- MPa -- Output Variable (Y)
These data can be obtained from the UCI Machine Learning Repository under Concrete.
http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/
Read it into Excel first and save it as comma-delimited (.CSV) format after shortening
the variable names. Use the command below to read the dataset into R.
> Concrete = read.table(file.choose(),header=T,sep=”,”)
a) Develop models to predict concrete compressive strength. Use the following modeling
approaches:
 OLS – possibly using ACE/AVAS to help find appropriate
transformations
 Projection Pursuit
 MARS
 Neural networks
 RPART
 Bagged RPART
 Random Forests
 Gradient Boosted Trees
 Treed Regression
Be sure to include some discussion for each method on how you “tuned” the fit using
that modeling approach. Be sure to use the same response for each!!!!
(36 points – 4 pts. each)
b) Identify which of the predictors are most important on the basis of the models you fit.
Also give at least one visualization of the predictor “effects” from the models fit in part
(a). Discuss all of this in practical terms. (6 pts.)
2
c) Using MC cross-validation, decide which modeling approach would be best to use to
predict the compressive strength of concrete. Be sure that all MCCV functions have
been fixed to perform similarly and correctly. Also make sure that you use the same
response for each method, so the RMSEP values can be fairly compared. Put your
results in a table and discuss. (12 pts.)
3
Download