Using R for Creating Predictive Models Getting Started With R R is an open source statistical package developed and maintained by the user community. It is free to use and has many basic and advanced methods available including Classification and Regression Trees (CART), which were primary factors leading to the decision to use R for this project. For more information about r visit the r community website: http://www.inside-r.org/ R consumes more memory than other statistical software and computers with at least 8G RAM are recommended. Max Kuhn, the creator of the r caret package provides the following modeling advice when working in the r environment: 1. 2. 3. 4. Start with the most powerful black-box type models Get a sense of the best possible performance Then fit more simplistic/understandable models Evaluate the performance cost of using a simpler model General Notes about r R uses a command line interface. While not necessary, some users like to use an overlay GUI interface to work with R. Common interfaces include: o http://www.rstudio.com/ (used by the some of the MMAP team) o http://www.rcommander.com/ o http://mran.revolutionanalytics.com/download/ The command space is called a workspace Is a UNIX based package, therefore, commands are case-sensitive Comments can be put anywhere in the command line so as long as the comment is started with a hashmark (#) New command lines are preceded by a greater than sign (>). This sign does not need to be entered as command, this is a part of the workspace. Objects in the r workspace can be named using the following command: <Provides a mechanism for recalling previous commands, use the arrow keys to scroll up and down the command history Many R manuals and tutorials are available on the web including YouTube videos. Some examples are: o http://cran.r-project.org/doc/manuals/R-intro.pdf o http://www.ats.ucla.edu/stat/r/seminars/intro.htm The full capacity of R is realized by adding free packages to the base R installation: o http://www.statmethods.net/interface/packages.html In addition, SPSS has a plug in for R (search web for “spss r plug in”). 1 Step-by-Step Instructions 1. Download and install R: http://www.r-project.org/ a. A key package for this analysis the caret package: For more information about caret, see the following resources: http://topepo.github.io/caret/training.html https://www.youtube.com/watch?v=7Jbb2ItbTC4 2. Upload the data into the workspace. The command to read in the data depends on where the file directory path is defaulted. a. Set the file directory path so r will point to the correct file location in the workspace. Use the following command to set the directory path: > setwd(“insert file directory path here”) EX: >setwd(“C:/Users/mmap/Assessment”) b. Select the appropriate file. It is recommended that csv files be used. Use the following command to upload the data file into the workspace: > read.csv(“File Name”, header = T, row.names = NULL) EX: read.csv (“College High School Transcript.csv, header = T, row.names =NULL) To name the data set in the r workspace, use the following command: >MMAPEngl <- read.csv(“File Name”, header = T, row.names = NULL, stringAsFactors=FALSE) Command breakdown o MMAPEngl <- assigns the dataset to the right the following name o row.names = NULL avoids issues in svm with duplicated row names o header =T identifying the first row of data as the header/variable names o stringsAsFactors = FALSE keeps character variables as they are instead of converting them 2 3. Basic Commands This section provides some basic commands for gaining familiarity with R and your data set. Most commands have optional values that can be used such as adding labels to plots. To obtain help on a command, type ? before the command. For example, ?plot will involve the help page for the plot command. Web searches can also be helpful for examples and additional command options. To view the variables names and type, use the following command: str(MMAPEngl) View the data set: edit(MMAPEngl) Provide basic summary statistics for a numeric variable: summary(MMAPEngl$hs_12_gpa_cum) Plot a simple histogram for a numeric variable: hist(MMAPEngl$hs_12_gpa_cum) Scatterplot of two numeric variables: plot(MMAPEngl$hs_11_gpa_cum,MMAPEngl$CCFirstGradePoint) Table of frequencies: table(MMAPEnglna$ap_any_Cplus,MMAPEnglna$cc_first_level_rank) Bar plot based on table: First the table will be loaded into the R workspace as an object called x that can be called in subsequent commands: x <- table(MMAPEnglna$cc_first_level_rank) Entering x into the command line will call the table. Any number, table, plot, data set, etc. can be loaded into an object for later use. Loading a new value into an existing object will overwrite the previous item. barplot(x) Add labels to barplot. Note “smart quotes” will generate an error. barplot(x, main="Level of First College English Course", xlab="Level") 3 4. Install the specific modules or packages that were used for the Multiple Measures Assessment Project models: a. rpart: Classification and Regression Tree (CART) modeling b. rpart.plot: Tool to create visual graphics of the CART c. e1071: miscellaneous functions that avoid error messages d. caret: primary modeling package e. dplyr: filtering tool Use the following commands in the workspace: >("dplyr",dependencies=TRUE) >("caret",dependencies=TRUE) >("e1071",dependencies=TRUE) >("rpart",dependencies=TRUE) >("rpart.plot",dependencies=TRUE) 5. Create a master subset of the data containing only variables used in model. a. Create a vector of variable names that will be used in any potential model. Use the following command: > Name of vector <- c(“variable 1”, “variable 2”, “v…”) Example: MMAPEnglModelVars <c("derkey1","hs_gpa_present_by_grade","cc_first_level_rank","CCFirstGradePoint","hs_12_ course_grade_points","hs_12_gpa_cum","hs_ela_eap","hs_exit_subj_to_cc_entry_subj","A P_ANY_Cplus","EXPOSITORY_ANY_Cplus","concurrent","REMEDIAL_ANY","GEN_ESL_ANY" b. Create the data subject with only columns that have the variable names indicated in previous step Use the following command: > Name of subset <- Original Data File Name [,Name of Vector] Example: MMAPEnglModelVarsSubset <- MMAPEngl[,MMAPEnglModelVars] c. Omit rows with any NA's Use the following command: MMAPEnglna <- na.omit(MMAPEnglModelVarsSubset) 4 6. Create subsets of the master subset file to build models based on the CB21 coding of the subject area community college courses. a. Load the dplyr package for filtering Use the following command: > require(dplyr) b. Select students with no missing data including having 4 years of high school GPAs Use the following command: Name of subset < - filter(DataSet Name, Variable 1 == Specific Value, Variable … ==Specific Value) Examples: e5na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==5) e4na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==4) e3na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==3) e2na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==2) e1na<-filter(MMAPEnglna, hs_gpa_present_by_grade==1111, cc_first_level_rank==1) Command Breakdown o == a double equal sign must be used to indicate that the value must be exact 7. Load the rpart packages to run the decision tree modeling Use the following code: >require(rpart) >require(rpart.plot) 8. Run the model (Simple Method) a. Set the control parameters for the modeling. Use the following command. The parameters in the parenthesis can adjusted accordingly based on the levels desired. >ctrl<-rpart.control(minsplit=2, minbucket=100, cp=0.002) b. Regression Tree Modeling 5 Use the following command: > Name of model <- rpart(Dependent Variable Name~ Predictor Variable Name 1 + Predictor Variable Name 2 + …., data = Data File Name, method=”anova”, control = ctrl) Example: > fit.e5na<-rpart(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr + hs_ela_cst_ss +hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data = e5na, method="anova", control=ctrl) Command Breakdown o rpart() this is identifying the desired package to run the analysis o ~ separates the Dependent variable from the Predictor/Independent Variables in the model o + Identifying the additional Predictor/Independent Variables to include in the mode o data = e5na identifies the specific data set the model should be modeled on. Replace this value for each data set that will be modeled using the regression tree analysis. o method = “anova” identifies the classification type and is dependent on the data structure of the dependent variable. For a regression tree use “anova”, for a classification tree (dichotomous variables) use “class” o control = ctrl identifies the control parameters for the model. See #7 for the specific parameters. For more information about the control parameters in rpart see: https://stat.ethz.ch/Rmanual/R-devel/library/rpart/html/rpart.control.html c. View the output To view the output, enter the following command: >printcp(Name of rpart analysis) >print(Name of rpart analysis) 6 Example: >printcp(fit.e5na) >print(fit.e5na) Command Breakdown o printcp() prints the output with the control parameter estimates based on the mode o print() prints the output of the analysis and provides a summary of the branches d. Plot the tree To obtain a visual of the model (decision tree), use the following command: >prp(Name of rpart analysis, modifications to plot) Example: >prp(fit.e5na, extra =1) prp() is a command to plot the decision tree based on a specific analysis extra = 1 is one of many options you can include to customize the tree. For more information about customizing the plots see: http://cran.rproject.org/web/packages/rpart.plot/rpart.plot.pdf 9. Run the model (Advanced). Complete steps 1 through 7. Step 8 can be completed after this step in the advanced method. a. Load the caret package Use the following commands: >library(“caret”) >library(“e1071”) For a tutorial document on CARET: http://www.edii.uclm.es/~useR2013/Tutorials/kuhn/user_caret_2up.pdf 7 b. Setting the seeds to the same value before partitioning allows recreation of partitions and models. For more information about seeds see: http://topepo.github.io/caret/training.html Use the following command: set.seed(42) c. Partition the data Use the following commands: > Name of Training and Testing Set <- createDataPartition(Data File Name$Dependent Variable, p = 0.5, list = FALSE) Example: inTraining_e5na <- createDataPartition(e5na$CCFirstGradePoint, p = 0.5, list = FALSE) Command Breakdown createDataPartition() is the command to partition the data $ separates the data file name from the variable name in that data set p= selects the percentage of the data to be in training set (must indicate outcome variable), the remainder of data will be the test data. For larger data sets use p = 0.75 For smaller data sets use p = 0.50 list = FALSE prevents the results from being populated as a list d. Assign data set names to training and test sets that have been partitioned Use the following commands: >Name of Training Data Set <-Data Set Name[InTraining_Name of Partioned Data Set,) >Name of Test Data Set <-Data Set Name[-InTraining_Name of Partioned Data Set,) 8 Examples: training_e5na <- e5na[inTraining_e5na, ] testing_e5na <- e5na[-inTraining_e5na, ] Command Breakdowns [inTraining_e5na,] retrieves the rows that have been flagged as the training set [-inTraining_e5na,] retrieves all the other rows that have not been flagged as the training data set – this is the testing data set e. Set the resampling method (For continuous outcomes using a 10-fold repeated crossvalidation method) Use the following command: Name of control parameter<-trainControl(method=”insert type here”, repeats = insert number here) Example: ctrl <- trainControl(method = "repeatedcv", repeats = 10) f. Command Breakdowns method=”repeatedcv” is a repeated k-fold cross-validation method repeats = 10 indicates that the repetition occurs 10 times CART Modeling. Run the models for each level of interest. Use the following command: >Name of cart model <- train(Dependent Variable~Predictor Variable 1 + Predictor Variable 2 + …, data = Training data set name, method = insert type here, trControl = insert control parameters here, preProc=c(insert types here), tuneLength = Insert number here) 9 Example for Transfer Level cartfit_e5na <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr + hs_ela_cst_ss +hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data = training_e5na, method = "rpart", trControl = ctrl, preProc = c("center", "scale"), tuneLength = 10) Command Breakdowns train() A routine function that calculates a resampling based on performance measures from tuning parameters identified. For more information see: http://www.insider.org/packages/cran/caret/docs/train CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + … are the variables of interest in the model method=”rpart” identifies the type of analysis to resample from trControl=ctrl identifies the resampling method (see Step 8e) preProc=c(“center”, “scale”) transformation type of the variables. tuneLength=10 represents the number of levels for each tuning parameter in the train function g. View the CART fit statistics Use the following command to print the fit statistics: >Name of model Example: >cart_fite5na Use the following command to show the importance metrics for each variable in the fit models varImp(Name of Model) Example: varImp(cartfit_e5na) 10 h. Plot the model Plot the variable importances Use the following command: >plot(varImp(Model Name),main="Title of Plot") Example: >plot(varImp(cartfit_e4na),main="One Level Below Transfer English, 12th Grade data, no CST ") Plot the decision trees Use the following command: >prp( Model Name$Type, main=”Title of Plot”, extra = Insert Number Here) Example prp(cartfit_e5na$finalModel,main="Transfer Level English, 12th Grade data, no CST",extra=100) Note finalModel is an object created by R and does not need to be renamed. i. Test training set by applying rules/making predictions using models developed from the training set to the test set Create a testing data set, using the following command: Name of tested data<- predict(CART model training set name, newdata=Test Data Set Name) Example: cartpred_e5na <- predict(cartfit_e5na, newdata = testing_e5na) Create a column in the data set and move the predicted values to the new column using the following code: >Testing Data File Name [,"new column name"]<-Training Set Name with Predictions 11 Example: >testing_e5na[,"CARTpredvalues12thGrdDataNoCST"]<cartpred_e5na Export the data file with the predicted values using the following code: >write.csv(Testing Data Set Name, “File Directory Path and File Name”) Example write.csv(testing_e5na, "Y:/MMAP/testing_e5na12thGrdGPA.csv") 12 Advanced Tips and Tricks CART models were used for the decision rule sets but predictions were compared also using linear regression and support vector machines. The following are the r commands used to run comparative models and use the same commands in CARET to produce the analyses. What distinguishes the codes below from the one above is the identification of the “method” or analysis type with additional parameter codes that are specific to the method type. The following link provides a list of the methods in r’s CARET package: http://topepo.github.io/caret/modelList.html. Linear Regression Models 1. Set the seed for the analysis. Use the following command: set.seed(42) 2. Run the linear regression model. Use the following command: > regfit_e5na <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr + hs_ela_cst_ss + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data = training_e5na, method = "lm", trControl = ctrl, preProc = c("center", "scale") tuneLength = 10) 3. Print the output and relevant statistics. a) Analysis > regfit_e5na b) Additional regression co-efficients > regfit_e5na$finalModel > coef(regfit_e5na$finalModel) 13 > lm.beta(regfit_e5na$finalModel) > varImp(regfit_e5na) For additional information about the different types of linear regression models in CARET: http://topepo.github.io/caret/Linear_Regression.html 4. Plot the data and export the predicted values See Step 8i for descriptions Command Example: > plot(varImp(regfit_e5na),main="Transfer Level English - Regression") Creating predicted values in data set > regpred_e5na <- predict(regfit_e5na, newdata = testing_e5na) Add the predicted values to the data set > testing_e5na[,"regpredvalues"]<-regpred_e5na Support Vector Machine Models with Radial Basis Function Kernel 1. Set the seed for the analysis. Use the following command: set.seed(42) 14 2. Run the support vector model with radial basic function kernel. Use the following command: > svmfit_e5na <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_cst_ss + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data = training_e5na, method = "svmRadial", trControl = ctrl, preProc = c("center", "scale") tuneLength = 10) For additional information about support vector machines: http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf 3. Apply relevant commands to view, plot, and create/export the data. Neural Network Models 1. Set the seed for the analysis. Use the following command: set.seed(42) 15 2. Run the neural network analysis. Use the following command: > netfit <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr + hs_ela_cst_ss + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data = training, method = "nnet", maxit=100, trace=FALSE, trControl = ctrl, preProc = c("center", "scale"), tuneLength = 10) 3. Apply relevant commands to view, plot, create/export the data. For more information about neural networks, please see the following: http://www.di.fc.ul.pt/~jpn/r/neuralnets/neuralnets.html Gradient Boosted Machine (GBM) 1. Set the seed for the analysis. Use the following command: set.seed(42) 16 2. Run the GBM model. Use the following command: > gbmfit <- train(CCFirstGradePoint ~ hs_assmt_sentskl_bestscr + hs_assmt_readcomp_bestscr + hs_ela_cst_ss + hs_12_course_grade_points + hs_12_gpa_cum + hs_ela_eap + hs_exit_subj_to_cc_entry_subj + AP_ANY_Cplus + EXPOSITORY_ANY_Cplus + concurrent + REMEDIAL_ANY + GEN_ESL_ANY, data = training, method = "gbm", trControl = ctrl, verbose=FALSE, preProc = c("center", "scale"), tuneLength = 10) 3. Apply relevant commands to view, plot, create/export the data. For more information about GBMs, please see the following: http://topepo.github.io/caret/training.html http://www.inside-r.org/packages/cran/gbm/docs/gbm 17