R Code for Creating Decision Rules - Phase 1

advertisement
Using R for Creating Predictive Models
Getting Started With R
R is an open source statistical package developed and maintained by the user community. It is free to
use and has many basic and advanced methods available including Classification and Regression Trees
(CART), which were primary factors leading to the decision to use R for this project. For more
information about r visit the r community website: http://www.inside-r.org/
R consumes more memory than other statistical software and computers with at least 8G RAM are
recommended.
Max Kuhn, the creator of the r caret package provides the following modeling advice when working in
the r environment:
1.
2.
3.
4.
Start with the most powerful black-box type models
Get a sense of the best possible performance
Then fit more simplistic/understandable models
Evaluate the performance cost of using a simpler model
General Notes about r










R uses a command line interface. While not necessary, some users like to use an overlay GUI
interface to work with R. Common interfaces include:
o http://www.rstudio.com/ (used by the some of the MMAP team)
o http://www.rcommander.com/
o http://mran.revolutionanalytics.com/download/
The command space is called a workspace
Is a UNIX based package, therefore, commands are case-sensitive
Comments can be put anywhere in the command line so as long as the comment is started with
a hashmark (#)
New command lines are preceded by a greater than sign (>). This sign does not need to be
entered as command, this is a part of the workspace.
Objects in the r workspace can be named using the following command: <Provides a mechanism for recalling previous commands, use the arrow keys to scroll up and
down the command history
Many R manuals and tutorials are available on the web including YouTube videos. Some
examples are:
o http://cran.r-project.org/doc/manuals/R-intro.pdf
o http://www.ats.ucla.edu/stat/r/seminars/intro.htm
The full capacity of R is realized by adding free packages to the base R installation:
o http://www.statmethods.net/interface/packages.html
In addition, SPSS has a plug in for R (search web for “spss r plug in”).
1
Step-by-Step Instructions
1. Download and install R:
http://www.r-project.org/
a. A key package for this analysis the caret package:
For more information about caret, see the following resources:
http://topepo.github.io/caret/training.html
https://www.youtube.com/watch?v=7Jbb2ItbTC4
2. Upload the data into the workspace. The command to read in the data depends on where the
file directory path is defaulted.
a. Set the file directory path so r will point to the correct file location in the workspace.
Use the following command to set the directory path:
> setwd(“insert file directory path here”)
EX: >setwd(“C:/Users/mmap/Assessment”)
b. Select the appropriate file. It is recommended that csv files be used.

Use the following command to upload the data file into the workspace:
> read.csv(“File Name”, header = T, row.names = NULL)
EX: read.csv (“College High School Transcript.csv, header = T, row.names =NULL)

To name the data set in the r workspace, use the following command:
>MMAPEngl <- read.csv(“File Name”, header = T, row.names = NULL,
stringAsFactors=FALSE)
 Command breakdown
o MMAPEngl <- assigns the dataset to the right the following name
o row.names = NULL avoids issues in svm with duplicated row names
o header =T identifying the first row of data as the header/variable names
o stringsAsFactors = FALSE keeps character variables as they are instead
of converting them
2
3. Basic Commands
This section provides some basic commands for gaining familiarity with R and your data set. Most
commands have optional values that can be used such as adding labels to plots. To obtain help on a
command, type ? before the command. For example, ?plot will involve the help page for the plot
command. Web searches can also be helpful for examples and additional command options.
To view the variables names and type, use the following command:
str(MMAPEngl)
View the data set:
edit(MMAPEngl)
Provide basic summary statistics for a numeric variable:
summary(MMAPEngl$hs_12_gpa_cum)
Plot a simple histogram for a numeric variable:
hist(MMAPEngl$hs_12_gpa_cum)
Scatterplot of two numeric variables:
plot(MMAPEngl$hs_11_gpa_cum,MMAPEngl$CCFirstGradePoint)
Table of frequencies:
table(MMAPEnglna$ap_any_Cplus,MMAPEnglna$cc_first_level_rank)
Bar plot based on table:
First the table will be loaded into the R workspace as an object called x that can be called in subsequent
commands:
x <- table(MMAPEnglna$cc_first_level_rank)
Entering x into the command line will call the table. Any number, table, plot, data set, etc. can be loaded
into an object for later use. Loading a new value into an existing object will overwrite the previous item.
barplot(x)
Add labels to barplot. Note “smart quotes” will generate an error.
barplot(x, main="Level of First College English Course", xlab="Level")
3
4. Install the specific modules or packages that were used for the Multiple Measures Assessment
Project models:
a. rpart: Classification and Regression Tree (CART) modeling
b. rpart.plot: Tool to create visual graphics of the CART
c. e1071: miscellaneous functions that avoid error messages
d. caret: primary modeling package
e. dplyr: filtering tool
Use the following commands in the workspace:
>("dplyr",dependencies=TRUE)
>("caret",dependencies=TRUE)
>("e1071",dependencies=TRUE)
>("rpart",dependencies=TRUE)
>("rpart.plot",dependencies=TRUE)
5. Create a master subset of the data containing only variables used in model.
a. Create a vector of variable names that will be used in any potential model.
Use the following command:
> Name of vector <- c(“variable 1”, “variable 2”, “v…”)
Example:
MMAPEnglModelVars <c("derkey1","hs_gpa_present_by_grade","cc_first_level_rank","CCFirstGradePoint","hs_12_
course_grade_points","hs_12_gpa_cum","hs_ela_eap","hs_exit_subj_to_cc_entry_subj","A
P_ANY_Cplus","EXPOSITORY_ANY_Cplus","concurrent","REMEDIAL_ANY","GEN_ESL_ANY"
b. Create the data subject with only columns that have the variable names indicated in
previous step
Use the following command:
> Name of subset <- Original Data File Name [,Name of Vector]
Example:
MMAPEnglModelVarsSubset <- MMAPEngl[,MMAPEnglModelVars]
c. Omit rows with any NA's
Use the following command:
MMAPEnglna <- na.omit(MMAPEnglModelVarsSubset)
4
6. Create subsets of the master subset file to build models based on the CB21 coding of the
subject area community college courses.
a. Load the dplyr package for filtering
Use the following command:
> require(dplyr)
b. Select students with no missing data including having 4 years of high school GPAs
Use the following command:
Name of subset < - filter(DataSet Name, Variable 1 == Specific Value, Variable …
==Specific Value)
Examples:
e5na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==5)
e4na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==4)
e3na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==3)
e2na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==2)
e1na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==1)
 Command Breakdown
o == a double equal sign must be used to indicate that the value must be
exact
7. Load the rpart packages to run the decision tree modeling
Use the following code:
>require(rpart)
>require(rpart.plot)
8. Run the model (Simple Method)
a. Set the control parameters for the modeling.
Use the following command. The parameters in the parenthesis can adjusted accordingly
based on the levels desired.
>ctrl<-rpart.control(minsplit=2, minbucket=100, cp=0.002)
b. Regression Tree Modeling
5
Use the following command:
> Name of model <- rpart(Dependent Variable Name~ Predictor Variable Name 1 + Predictor
Variable Name 2 + …., data = Data File Name, method=”anova”, control = ctrl)
Example:
> fit.e5na<-rpart(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr
+ hs_ela_cst_ss +hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap +
hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent +
REMEDIAL_ANY + GEN_ESL_ANY, data = e5na, method="anova", control=ctrl)
 Command Breakdown
o rpart() this is identifying the desired package to run the analysis
o ~ separates the Dependent variable from the Predictor/Independent
Variables in the model
o + Identifying the additional Predictor/Independent Variables to include
in the mode
o data = e5na identifies the specific data set the model should be
modeled on. Replace this value for each data set that will be modeled
using the regression tree analysis.
o method = “anova” identifies the classification type and is dependent on
the data structure of the dependent variable. For a regression tree use
“anova”, for a classification tree (dichotomous variables) use “class”
o control = ctrl identifies the control parameters for the model. See #7 for
the specific parameters.
For more information about the control parameters in rpart see: https://stat.ethz.ch/Rmanual/R-devel/library/rpart/html/rpart.control.html
c. View the output
To view the output, enter the following command:
>printcp(Name of rpart analysis)
>print(Name of rpart analysis)
6
Example:
>printcp(fit.e5na)
>print(fit.e5na)
 Command Breakdown
o printcp() prints the output with the control parameter estimates based
on the mode
o print() prints the output of the analysis and provides a summary of the
branches
d. Plot the tree
To obtain a visual of the model (decision tree), use the following command:
>prp(Name of rpart analysis, modifications to plot)
Example:
>prp(fit.e5na, extra =1)
 prp() is a command to plot the decision tree based on a specific analysis
 extra = 1 is one of many options you can include to customize the tree.
For more information about customizing the plots see: http://cran.rproject.org/web/packages/rpart.plot/rpart.plot.pdf
9. Run the model (Advanced). Complete steps 1 through 7. Step 8 can be completed after this step
in the advanced method.
a. Load the caret package
Use the following commands:
>library(“caret”)
>library(“e1071”)
For a tutorial document on CARET: http://www.edii.uclm.es/~useR2013/Tutorials/kuhn/user_caret_2up.pdf
7
b. Setting the seeds to the same value before partitioning allows recreation of partitions
and models. For more information about seeds see:
http://topepo.github.io/caret/training.html
Use the following command:
set.seed(42)
c. Partition the data
Use the following commands:
> Name of Training and Testing Set <- createDataPartition(Data File Name$Dependent
Variable, p = 0.5, list = FALSE)
Example:
inTraining_e5na <- createDataPartition(e5na$CCFirstGradePoint, p = 0.5, list = FALSE)

Command Breakdown
 createDataPartition() is the command to partition the data
 $ separates the data file name from the variable name in that
data set
 p= selects the percentage of the data to be in training set (must
indicate outcome variable), the remainder of data will be the
test data.
 For larger data sets use p = 0.75
 For smaller data sets use p = 0.50
 list = FALSE prevents the results from being populated as a list
d. Assign data set names to training and test sets that have been partitioned
Use the following commands:
>Name of Training Data Set <-Data Set Name[InTraining_Name of Partioned Data Set,)
>Name of Test Data Set <-Data Set Name[-InTraining_Name of Partioned Data Set,)
8
Examples:
training_e5na <- e5na[inTraining_e5na, ]
testing_e5na <- e5na[-inTraining_e5na, ]

Command Breakdowns
 [inTraining_e5na,] retrieves the rows that have been flagged as
the training set
 [-inTraining_e5na,] retrieves all the other rows that have not
been flagged as the training data set – this is the testing data set
e. Set the resampling method (For continuous outcomes using a 10-fold repeated crossvalidation method)
Use the following command:
Name of control parameter<-trainControl(method=”insert type here”, repeats = insert
number here)
Example:
ctrl <- trainControl(method = "repeatedcv", repeats = 10)

f.
Command Breakdowns
 method=”repeatedcv” is a repeated k-fold cross-validation
method
 repeats = 10 indicates that the repetition occurs 10 times
CART Modeling. Run the models for each level of interest.
Use the following command:
>Name of cart model <- train(Dependent Variable~Predictor Variable 1 + Predictor
Variable 2 + …, data = Training data set name, method = insert type here, trControl =
insert control parameters here, preProc=c(insert types here), tuneLength = Insert
number here)
9
Example for Transfer Level
cartfit_e5na <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr +
hs_assmt_readcomp_bestscr + hs_ela_cst_ss +hs_12_course_grade_points +
hs_12_gpa_cum + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus +
EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data =
training_e5na, method = "rpart", trControl = ctrl, preProc = c("center", "scale"),
tuneLength = 10)

Command Breakdowns
 train() A routine function that calculates a resampling based on
performance measures from tuning parameters identified. For
more information see: http://www.insider.org/packages/cran/caret/docs/train
 CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + … are the
variables of interest in the model
 method=”rpart” identifies the type of analysis to resample from
 trControl=ctrl identifies the resampling method (see Step 8e)
 preProc=c(“center”, “scale”) transformation type of the
variables.
 tuneLength=10 represents the number of levels for each tuning
parameter in the train function
g. View the CART fit statistics

Use the following command to print the fit statistics:
>Name of model
Example:
>cart_fite5na
 Use the following command to show the importance metrics for each
variable in the fit models
varImp(Name of Model)
Example:
varImp(cartfit_e5na)
10
h. Plot the model

Plot the variable importances
Use the following command:
>plot(varImp(Model Name),main="Title of Plot")
Example:
>plot(varImp(cartfit_e4na),main="One Level Below Transfer English,
12th Grade data, no CST ")

Plot the decision trees
Use the following command:
>prp( Model Name$Type, main=”Title of Plot”, extra = Insert
Number Here)
Example
prp(cartfit_e5na$finalModel,main="Transfer Level English, 12th
Grade data, no CST",extra=100)
Note finalModel is an object created by R and does not need to be
renamed.
i.
Test training set by applying rules/making predictions using models developed from
the training set to the test set

Create a testing data set, using the following command:
Name of tested data<- predict(CART model training set name,
newdata=Test Data Set Name)
Example:
cartpred_e5na <- predict(cartfit_e5na, newdata = testing_e5na)

Create a column in the data set and move the predicted values to
the new column using the following code:
>Testing Data File Name [,"new column name"]<-Training Set Name
with Predictions
11
Example:
>testing_e5na[,"CARTpredvalues12thGrdDataNoCST"]<cartpred_e5na

Export the data file with the predicted values using the following
code:
>write.csv(Testing Data Set Name, “File Directory Path and File
Name”)
Example
write.csv(testing_e5na, "Y:/MMAP/testing_e5na12thGrdGPA.csv")
12
Advanced Tips and Tricks
CART models were used for the decision rule sets but predictions were compared also using linear
regression and support vector machines.
The following are the r commands used to run comparative models and use the same commands in
CARET to produce the analyses. What distinguishes the codes below from the one above is the
identification of the “method” or analysis type with additional parameter codes that are specific to the
method type. The following link provides a list of the methods in r’s CARET package:
http://topepo.github.io/caret/modelList.html.
Linear Regression Models
1. Set the seed for the analysis.
Use the following command:
set.seed(42)
2. Run the linear regression model.
Use the following command:
> regfit_e5na <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr
+ hs_ela_cst_ss + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap +
hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent +
REMEDIAL_ANY + GEN_ESL_ANY,
data = training_e5na,
method = "lm",
trControl = ctrl,
preProc = c("center", "scale")
tuneLength = 10)
3. Print the output and relevant statistics.
a) Analysis
> regfit_e5na
b) Additional regression co-efficients
> regfit_e5na$finalModel
> coef(regfit_e5na$finalModel)
13
> lm.beta(regfit_e5na$finalModel)
> varImp(regfit_e5na)
For additional information about the different types of linear regression models in CARET:
http://topepo.github.io/caret/Linear_Regression.html
4. Plot the data and export the predicted values
See Step 8i for descriptions
Command Example:
> plot(varImp(regfit_e5na),main="Transfer Level English - Regression")

Creating predicted values in data set
> regpred_e5na <- predict(regfit_e5na, newdata = testing_e5na)

Add the predicted values to the data set
> testing_e5na[,"regpredvalues"]<-regpred_e5na
Support Vector Machine Models with Radial Basis Function Kernel
1. Set the seed for the analysis.
Use the following command:
set.seed(42)
14
2. Run the support vector model with radial basic function kernel.
Use the following command:
> svmfit_e5na <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr +
hs_assmt_readcomp_bestscr + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_cst_ss +
hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus +
concurrent + REMEDIAL_ANY + GEN_ESL_ANY,
data = training_e5na,
method = "svmRadial",
trControl = ctrl,
preProc = c("center", "scale")
tuneLength = 10)
For additional information about support vector machines:
http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf
3. Apply relevant commands to view, plot, and create/export the data.
Neural Network Models
1. Set the seed for the analysis.
Use the following command:
set.seed(42)
15
2. Run the neural network analysis.
Use the following command:
> netfit <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr +
hs_ela_cst_ss + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap +
hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent +
REMEDIAL_ANY + GEN_ESL_ANY,
data = training,
method = "nnet",
maxit=100,
trace=FALSE,
trControl = ctrl,
preProc = c("center", "scale"),
tuneLength = 10)
3. Apply relevant commands to view, plot, create/export the data.
For more information about neural networks, please see the following:
http://www.di.fc.ul.pt/~jpn/r/neuralnets/neuralnets.html
Gradient Boosted Machine (GBM)
1. Set the seed for the analysis.
Use the following command:
set.seed(42)
16
2. Run the GBM model.
Use the following command:
> gbmfit <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr +
hs_ela_cst_ss + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap +
hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent +
REMEDIAL_ANY + GEN_ESL_ANY,
data = training,
method = "gbm",
trControl = ctrl,
verbose=FALSE,
preProc = c("center", "scale"),
tuneLength = 10)
3. Apply relevant commands to view, plot, create/export the data.
For more information about GBMs, please see the following:
http://topepo.github.io/caret/training.html
http://www.inside-r.org/packages/cran/gbm/docs/gbm
17
Download