Using Decision Trees to Predict Student Placement and Course Success CAIR Conference November 21, 2014 Today’s Presentation • • • • • • • • • Multiple Measures Project overview Overview of recursive partitioning Pros and cons of decision trees Use for placement Code for creating decision trees in R Comparison to logistic regression Fit statistics Pruning, bagging and random forests Decision trees and disproportionate impact Multiple Measures Assessment Project • • • • • • • Data warehouse Research base and predictive analytics K-12 messaging and data population Decision models and tools Professional development Pilot colleges and faculty engagement Integration with Common Assessment Key idea: less about testing more about placement and support for student success Data Warehouse for MMAP Data in place… • K-12 transcript data • CST, EAP and CAHSEE • Accuplacer • CCCApply • MIS • Other College Board assessments • Local assessments Coming soon… • Compass • Common Assessment and SBAC 5 Pilot Colleges Participating • • • • • • • • • • • • • • Allan Hancock Bakersfield College Cañada College Contra Costa Community College District Cypress College Foothill-De Anza Community College District Fresno City College Irvine Valley College Peralta Community College District Rio Hondo College San Diego City College Santa Barbara City College Santa Monica College Sierra College Placement at the CCCs • Test-heavy process • Placement cannot, by law, rely only on a test – Multiple measures have traditionally involved a few survey-type questions – More of a nudge than a true determinant • Role of counselors • Informed self-placement • AP coursework & equivalencies Validating Placement Tests • Content validity • Criterion validity • Arguments-based validity – Validating the outcome of the decision that is made based on the placement system/process • Critiques of current placement system as prone to high degree of “severe error” (Belfield & Crosta, 2013; Scott-Clayton, 2012; Scott-Clayton, Crosta, & Belfield, 2012; Willett, 2013) Decision Trees • Howard Raiffa explains decision trees in Decision Analysis (1968). • Ross Quinlan invented ID3 and introduced it to the world in his 1975 book, Machine Learning. – Inspired by Hunt and others’ work in the 50’s & 60’s • CART popularized by Breiman et al. in mid-90’s – Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1994). Classification and regression trees. Chapman and Hall: New York, New York. – Based on information theory rather than statistics; developed for signal recognition • Today we will discuss recursive partitioning and regression trees (i.e., ‘rpart’ for R). Dog Decision Tree Increasing Homogeneity with each split ABZ ZAB Root Node BZA Branch Internal Node AAA ABA ZZZ BAB BBB Leaf Node How is homogeneity measured? 𝑛 𝑛 𝑝𝑖2 𝐷 =1− 𝐻′ = − 𝑖=1 𝑝𝑖 ∗ ln 𝑝𝑖 𝑖=1 Gini-Simpson Index Shannon Information Index If selecting two individual items randomly from a collection, what is the probability they are in different categories. Measures diversity of a collection of items. Higher values indicate greater diversity. The Gini coefficient is a measure of the inequality of a distribution, a value of 0 expressing total equality and a value of 1 maximal inequality. Pros and Cons of Decision Trees Strengths • Visualization • Easy to understand output • Easy to code rules • Model complex relationships easily • Linearity, normality, not assumed • Handles large data sets • Can use categorical and numeric inputs Weaknesses • Results dependent on training data set – can be unstable esp. with small N • Can easily overfit data • Out of sample predictions can be problematic • Greedy method selects only ‘best’ predictor • Must re-grow trees when adding new observations http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf Libraries and Code for R: Your Basic Classification Decision Tree Data <- read.csv(“C:/Folder/Document.csv", header=T) Data.df <- data.frame(Data) DataTL <- (subset (Data.df,EnglishLevel==“Transfer Level”)) library(rpart) library(rpart.plot) ctrl <- rpart.control(minsplit = 100, minbucket = 1, cp = 0.001) DataTransferLevel <- (subset (Data.df,CourseLevel==1)) fitTL <- rpart(formula = success ~ Delay + CBEDS_rank + course_gp + A2G + cst_ss + grade_level + GPA_sans, data=DataTL, method = "class", control = ctrl) printcp(fitTL) prp(fitTL) rsq.rpart(fitTL) print(fitTL) print(fitTL, minlength=0, spaces=2, digits= getOption("digits")) summary(fitTL) Decision tree predicting success in transfer-level English Misclassification=29% 1= Predicted success 0 = Predicted non-success GPA_sans is a student’s cumulative high school GPA excluding grades in English. course_g is a student’s grade in their most recent high school English course. Delay is the number of primary terms between last high school English course and first college English course. Note that GPA_sans was the most important predictor variable as determined by the random forest aggregated bootstrap method. 18 Decision tree predicting success in transfer-level Math Misclassification=36% 1= Predicted success 0 = Predicted non-success GPA_sans is a student’s cumulative high school GPA without grades in math. hs_course is a student’s grade in most recent high school math course. Delay is the number of primary terms between last high school math course and first college math course. cst_ss is the scaled score from a student’s California Standards Test (CST). CBEDS_ra is the rank or level of a student’s last high school math course. Note that GPA_sans was the most important predictor variable as determined by the random forest aggregated bootstrap method. 19 Libraries and Code for R: Bagging and Random Forests library(ipred) btTL = bagging(cc_success ~ Delay + CBEDS_rank + course_gp + A2G + cst_ss + grade_level + GPA_sans, dat=DataTL,nbagg=30,method = "class",coob=T) print(btTL) library(randomForest) DataTL$cc_success <- factor(DataTL$cc_success) rfTL = randomForest(cc_success ~ Delay + CBEDS_rank + course_gp + A2G + cst_ss + grade_level + GPA_sans, dat=DataTL,importance=T,na.action=na.exclude,ntree=100) print(rfTL) importance(rfTL,type=1) varImpPlot(rfTL) Key Considerations • Splitting criterion: how small should the leaves be? What are the minimum # of splits? • Stopping criterion: when should one stop growing the branch of the tree? • Pruning: avoiding overfitting of the tree and improving • Understanding classification performance Two Approaches to Avoid Overfitting Forward pruning: Stop growing the tree earlier. • Stop splitting the nodes if the number of samples is too small to make reliable decisions. • Stop if the proportion of samples from a single class (node purity) is larger than a given threshold Post-pruning: Allow overfit and then post-prune the tree. • Estimation of errors and tree size to decide which subtree should be pruned. Fit Statistics: Evaluating your tree • Misclassification rate - the number of incorrect predictions divided by the total number of classifications. • Sensitivity - the percentage of cases that actually experienced the outcome (e.g., "success") that were correctly predicted by the model (i.e., true positives). • Specificity - the percentage of cases that did not experience the outcome (e.g., "unsuccessful") that were correctly predicted by the model (i.e., true negatives). • Positive predictive value - the percentage of correctly predicted successful cases relative to the total number of cases predicted as being successful. • Negative predictive value - the percentage of correctly predicted unsuccessful cases relative to the total number of cases predicted as being unsuccessful. Libraries and Code for R: Confusion Matrix pred=predict(fitTL,type="class") table(pred,e5$success) #pred=row,actual=col table(pred,e5$success,e5$gender) #by gender Disproportionate Impact • Renewed interest in equity across gender, ethnicity, age, disability, foster youth and veteran status • Does a student’s demographics predict placement level? • If so, what is the degree of impact and what can be done to mitigate? Combining Models • How should multiple measures be combined with data from placement tests? • Decision theory/Models for combining data – Disjunctive (either/or) – multiple measures as a possible alternative to the test – Conjunctive (both/and) – multiple measures as an additional limit – Compensatory (blended) – multiple measures as an additional factor in an algorithm Does the Model Matter? Accuplacer Accuplacer Only Disjunctive A High Pass in English 12 or higher or Accupplacer Disjunctive B Pass in English 12 or higher or Accupplacer Conjunctive A High Pass in Fall and Spring English 12 or Higher Conjunctive B Pass in Fall and Accuplacer Placement From Willett, Gribbons & Hayward (2014) Number of Students Does the Model Matter? English 101 Placement 899 749 1000 800 600 400 475 343 228 200 0 Accuplacer Disjunctive A Disjunctive B Conjunctive A Conjunctive B Only Model From Willett, Gribbons & Hayward (2014) Thank you Craig Hayward Director of Planning, Research, and Accreditation Irvine Valley College chayward@ivc.edu John Hetts Senior Director of Data Science Educational Results Partnership jhetts@edresults.org Ken Sorey Director, CalPASS Plus ken@edresults.org Terrence Willett Director of Planning, Research, and Knowledge Systems Cabrillo College terrence@cabrillo.edu