Josh Baron, Marist College Sandeep Jayaprakash, Marist College Nate Angell, rSmart June 10-15, 2012 Growing Community; Growing Possibilities Open Academic Analytics Initiative (OAAI) Building the Predictive Model ◦ Overview of the process ◦ Data sets used and data extraction process ◦ Overview of Pentaho and training process Deploying the Predictive Model ◦ Using Pentaho to score data ◦ Performance of the predictive model ◦ Producing Academic Alert Reports (AARs) Overview of Intervention Strategies Current Outcomes and Next Steps “Creating an Open Ecosystem for Learning Analytics” OAAI is using two primary data sources: ◦ Student Information System (SIS – Banner) Demographics, Aptitude (SATs, GPA) ◦ Learning Management System (LMS) Event logs, Gradebook Goal: create open-source “early alert” system ◦ Predict “at risk” students in first 3 weeks of a course ◦ Deploy intervention to ensure student succeeds Static data Dynamic Data Student Attitude Data (SATs, current GPA, etc.) Model developed w/ historical data Student Demographic Data (Age, gender, etc.) Sakai Event Log Data Predictive Model Scoring Identified Students “at risk” to not complete course Sakai Gradebook Data Intervention Deployed Purdue University’s Course Signals Project ◦ Built on dissertation research by Dr. John Campbell ◦ Ellucian product that integrates with Blackboard ◦ Students in courses using Course Signals scored up to 26% more A or B grades up to 12% fewer C's; up to 17% fewer D's and F‘s ◦ Positive affect on four year retention rates No Course Signals courses – 69% Two or more Course Signals courses - 93% Interventions that utilize “support groups” ◦ Improved 1st and 2nd semester GPAs ◦ Increase semester persistence rates (79% vs. 39%) Building “open ecosystem” for Learning analytics ◦ Sakai Collaboration and Learning Environment Sakai API to automate secure data capture Will also facilitate use of Course Signals & IBM SPSS ◦ Pentaho Business Intelligence Suite OS data mining, integration, analysis and reporting tools ◦ OAAI Predictive Model released under OS license Predictive Modeling Markup Language (PMML) Researching critical analytics scaling factors ◦ How “portable” are predictive models? ◦ What intervention strategies are most effective? Release Sakai Academic Alert System (beta) ◦ Will be included as part of Sakai CLE release Conducted real world pilots ◦ 36 courses at community colleges ◦ 36 courses at HBCUs Research finding related to… ◦ Strategies for effectively “porting” predictive models ◦ The use of online communities and OER to impact on course completion, persistence and content mastery. Wave I EDUCAUSE Next Generation Learning Challenges (NGLC) grant Funded by Bill and Melinda Gates and Hewlett Foundations $250,000 over a 15 month period Began May 1, 2011, ends January 2013 (extended) Overview of the process Data sets and data extraction Overview of Pentaho and training process 2012 Jasig Sakai Conference 9 Development and initial deployment of an “open source” predictive model of academic risk Methodological framework for model development Empirical analysis of predictive performance (preliminary results) 2012 Jasig Sakai Conference 10 Data Integration using Pentaho Kettle Data Extraction phase Transformation phase Load phase Predictive Modeling using Pentaho WEKA Training phase Testing phase 2012 Jasig Sakai Conference 11 Source Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Predictive Modeling Layer: Training Testing and Scoring Training Data Data Preprocessing : missing values, outliers, derived features Balanced Training Data Balance Library of predictive models (Classifiers) Train Store Partition Algorithms Personal (Bio) Data Course Data Performance Data Source Data Software Platform: SQL Server 2008 R2 Pentaho Kettle Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Predict (Score) Results Test Test Data New Data Target feature Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition 2012 Jasig Sakai Conference 12 SQL queries to extract grade and user event data from Sakai CLE (see Sakai wiki for details) Ensure access to historical data: data warehouse, backups etc. Extract from backup to ensure no impact on production performance Encrypting user IDs for user anonymization 2012 Jasig Sakai Conference 13 Data mining and predictive modeling are affected by input data of diverse quality A predictive model is usually as good as its training data Good: lots of data Not so good: Data Quality Issues Not so good: Unbalanced classes (at Marist, 6% of students at risk. Good for the student body, bad for training predictive models ) 2012 Jasig Sakai Conference 14 Variability in instructor’s assessment criteria Variability in workload criteria Variability in period used for prediction (early detection) Variability in Multiple instance data (partial grades with variable contribution, and heterogeneous composition) Solution: Use ratios Percent of usage over Avg percent of usage per course Effective Weighted Score / Avg Effective Weighted Score 2012 Jasig Sakai Conference 15 Variability in Sakai tools usage No uniform criterion in the use of CMS tools (faculty members are a wild bunch ) Tools not used, data not entered, too much missing data Forums Content Lessons Assigns Assmnts 2012 Jasig Sakai Conference 16 Source Data CMS (Sakai) event data Partial grades (gradebook) data ETL Layer Predictive Modeling Layer: Training Testing and Scoring Training Data Data Preprocessin g: missing values, outliers, derived features Balanced Training Data Balance Library of predictive models (Classifiers) Train Store Partition Algorithms Personal (Bio) Data Course Data Performance Data Source Data Software Platform: SQL Server 2008 R2 Pentaho Kettle Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Predict (Score) Results Test Test Data New Data Target feature Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition 2012 Jasig Sakai Conference 17 Fall 2010 data sample of undergraduate students Datasets were joined and data was cleaned, recoded, and aggregated to produce an input data file of 3877 records corresponding to courses taken by students. Feature Type Feature Name Predictors GENDER, SAT_VERBAL, SAT_M ATH, APTITUDE_SCORE, FTPT, CLASS, CUM _GPA, ENROLLM ENT, ACADEM IC _STANDING, RM N_SCORE, R_SESSIONS, R_CONTENT_READ Target ACADEM IC_RISK (1 = at risk; 0 student in good standing) 2012 Jasig Sakai Conference 18 Use Weka 3.7 and IBM SPSS Modeler 14.2. Generate 5 different random partitions (70% training, 30% testing) Balance each training dataset Train a predictive model (Logistic Regression, C4.5 Decision Tree, SVM) for each balanced training dataset ◦ 5 x 3 = 15 models Measure predictive performance of classifiers ◦ recall, precision, specificity Produce summary measures (mean and standard error) 2012 Jasig Sakai Conference 19 At model training time: o For all students in all courses compute their effective weighted score as sumproduct(partial scores, partial weights) * (1 / sum partial weights) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effectiveWeighted Score / effective Avgweighted score At testing time: o For all students in the course tested compute their effective weighted score o as sumproduct(partial scores, partial weights) * (1 / sum partialwieghts) o Compute the effective Avg Weighted score for the course o Calculate the ratio as: RMN_SCORE = effective Weighted score / effective Avgweighted score 2012 Jasig Sakai Conference 20 Classif. Perform. Algorithm Metrics 1 Train 5089 SVN 2 Test 1182 Train 5041 3 Test 1146 Train 4952 4 Test 1207 Train 5014 5 Mean Test Train Test Train 1159 5095 SE Test Train Test 1134 Accuracy 93.40% 90.52% 92.66% 90.49% 91.54% 88.73% 92.98% 89.91% 93.27% 90.48% 92.77% 90.03% 0.74% 0.77% Recall 95.91% 86.15% 93.94% 77.46% 92.64% 86.30% 94.51% 77.78% 95.83% 85.29% 94.57% 82.60% 1.37% 4.56% Specificity 90.85% 90.78% 91.43% 91.35% 90.47% 88.89% 91.50% 90.71% 90.76% 90.81% 91.00% 90.51% 0.45% 0.94% C5.0 Precision 91.42% 35.22% 91.36% 37.16% 90.46% 33.33% 91.46% 35.67% 91.03% 37.18% 91.14% 35.71% 0.42% 1.59% Accuracy 99.96% 94.59% 99.64% 95.46% 99.94% 94.61% 99.96% 94.39% 99.65% 94.80% 99.83% 94.77% 0.17% 0.41% Recall 100.00% 66.15% 99.39% 61.97% 100.00% 58.90% 100.00% 55.56% 99.40% 54.41% 99.76% 59.40% 0.33% 4.80% Specificity 99.92% 96.24% 99.88% 97.67% 99.88% 96.91% 99.92% 96.96% 99.88% 97.37% 99.90% 97.03% 0.02% 0.54% Precision 99.92% 50.59% 99.88% 63.77% 99.88% 55.13% 99.92% 54.79% 99.88% 56.92% 99.90% 56.24% 0.02% 4.81% Accuracy 91.26% 89.00% 89.59% 90.23% 90.15% 88.48% 91.12% 88.96% 90.30% 91.01% 90.48% 89.54% 0.70% 1.05% Logistic Recall 92.98% 86.15% 89.09% 88.73% 90.80% 90.41% 92.07% 83.33% 91.07% 89.71% 91.20% 87.67% 1.46% 2.91% Reg. Specificity 89.50% 89.17% 90.06% 90.33% 89.51% 88.36% 90.21% 89.33% 89.55% 91.09% 89.77% 89.65% 0.34% 1.06% Precision 90.00% 31.64% 89.63% 37.72% 89.41% 33.33% 90.06% 34.09% 89.51% 39.10% 89.72% 35.18% 0.29% 3.12% 2012 Jasig Sakai Conference 21 Logistic regression and SVM did much better that C5.0 / J4.8 ◦ Detect 82% to 87% of the student population at risk. ◦ In comparison, recall of C5.0 / J4.8: 59% (why so low?) False positives: ◦ 10% of false positives over Ok students (C5.0 / J4.8 does better: 3%) ◦ 65% of predictions are false alarms (C5.0 / J4.8 does better: 44%) 2012 Jasig Sakai Conference 22 For logistic regression ◦ RMN_SCORE ◦ ACADEMIC_STANDING CUM_GPA ◦ Then R_SESSIONS and SAT_VERBAL For the SVM classifier ◦ RMN_SCORE ◦ CUM_GPA, ACADEMIC_STANDING, R_SESSIONS and SAT_VERBAL C5.0/J4.8 ◦ Minimal difference among predictors 2012 Jasig Sakai Conference 23 Results are encouraging, although the number of false alarms raises some concern Differences among classifiers, in particular DTs (typically very robust classifiers), requires further investigation. Data quality (missing values) remains an open issue with partial remediation Partial-grades-derived score (RMN_SCORE) remains as the best predictor. CMS generated events appear to be second tier predictors 2012 Jasig Sakai Conference 24 Using Pentaho to score data Performance of the predictive model Producing Academic Alert Reports 2012 Jasig Sakai Conference 25 ETL phase Remains similar to the ETL process used for training the model except for the records have missing data are also retained Scoring phase Utilizes WEKA Scoring plugin to embed WEKApredictive model into Pentaho Kettle Reporting phase Pentaho Report Designer tool to create a template for reporting. 2012 Jasig Sakai Conference 26 Source Data CMS (Sakai) event data Partial grades (gradebook) data Personal (Bio) Data Course Data Performance Data Source Data ETL Layer Predictive Modeling Layer: Training Testing and Scoring Training Data Data Preprocessi ng: missing values, Balance SQL Server 2008 R2 Pentaho Kettle Library of predictive models (Classifiers) Train outliers, derived features Software Platform : Balanced Training Data Store Partition Algorithms Anonymized Data by Institution, Semester, Program (Grad/ Ugrad), Course, Student Predict (Score) Results Test Test Data New Data Target feature Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM OS: Windows Server 2008 Standard Edition 2012 Jasig Sakai Conference 27 2012 Jasig Sakai Conference 28 Awareness Messaging Online Academic Support Environment (OASE) 2012 Jasig Sakai Conference 29 Researching effectiveness of two strategies ◦ Awareness Messaging ◦ Online Academic Support Environment (OASE) • OER Content • Self-Assessments • Learning Skills Flat World Knowledge • Learning Support Facilitation & Mentoring 2012 Jasig Sakai Conference 30 Initial research findings Future efforts 2012 Jasig Sakai Conference 31 Applied similar analytical techniques to those used by Campbell at Purdue, using Fall 2010 data Marist College and Purdue University ◦ Differences (institutional type and size) ◦ Similarities: % students receiving federal Pell Grants, % ethnicity, ACT composite ACT composite 25th/75th percentile We found similarities in correlation values. As in the case of Purdue, all these metrics are found to be significantly correlated with course grade, with rather low correlation values. Course Grade Undergraduate CM S Marist Campbell event frequencies Fall 2010 (2007) N=18968 N=27276 Correlation 0.147 S essions (no values Significance 0.000(**) Opened reported) N 11195 Correlation 0.098 0.112 Content Significance 0.000(**) 0.000(**) Viewed N 7651 19205 Correlation 0.133 0.068 Discussions Significance 0.000(**) 0.000(**) Read N 1552 7667 Correlation 0.233 0.061 Discussions Significance 0.000(**) 0.000(**) Posted N 1507 7292 Correlation 0.146 0.163 Assign. Significance 0.000(**) 0.000(**) S ubmitted N 3245 4309 Correlation 0.161 0.238 Asssmts Significance 0.000(**) 0.000(**) S ubmitted N 1423 4085 (**) Significant at the 0.01 level (2-tailed) M arist data uses ratios over course mean instead of frequencies 2012 Jasig Sakai Conference 32 Initial review of instructor “research logs” showed general agreement with predictions Faculty/student feedback has been positive "Not only did this project directly assist my students by guiding students to resources to help them succeed, but as an instructor, it changed my pedagogy; I became more vigilant about reaching out to individual students and providing them with outlets to master necessary skills. P.S. I have to say that this semester, I received the highest volume of unsolicited positive feedback from students, who reported that they felt I provided them exceptional individual attention! 2012 Jasig Sakai Conference 33 Develop and release “student effort data”API Develop a Sakai Academic Alert dashboard Create customized predictive models for different academic contexts Work to facilitate use of SNAPP with Sakai 2012 Jasig Sakai Conference 34 2012 Jasig Sakai Conference 35 Sakai Confluence Wiki – Open Academic Analytics Initiative (OAAI) https://confluence.sakaiproject.org/pages/viewpage.action?pageId=75671025 Contact Josh Baron – Senior Academic Technology Officer , Marist College josh.baron@marist.edu Sandeep M. Jayaprakash – Technical Consultant OAAI, Marist College sandeep.jayaprakash1@marist.edu 2012 Jasig Sakai Conference 36 2012 Jasig Sakai Conference 37 2012 Jasig Sakai Conference 38 2012 Jasig Sakai Conference 39 2012 Jasig Sakai Conference 40 Student Data (Demographics & Course enrollment) Data Extraction Banner (ERP) (course event data aggregated, student data added, student identity removed) Data Preprocessing (missing values, outliers, incomplete records, derived features) Data Sets Course Event Data Sakai Identifying student information is removed during the data extraction process 2012 Jasig Sakai Conference 41 2012 Jasig Sakai Conference 42 Knowledge Flow Balance Partition Filter NBayes DT J4.8 Logistic SVM/SMO 2012 Jasig Sakai Conference 43 Logistic Modeler SVM Partition Filter Balance DT C5.0 2012 Jasig Sakai Conference 44 Balance the training dataset Subsample – ( Reduce the dataset) Oversample – (Duplicate records if it’s a minority) SMOTE Nitesh V Chawla. Et.al(2002) Synthetic Minority Over Sampling Technique. Journal of Artificial Intelligence Research. 16.321-357. 2012 Jasig Sakai Conference 45 In the case of unbalanced classes, Accuracy is a poor measure ◦ Accuracy = (TP+TN) / (TP+TN+FP+FN) ◦ The large class overwhelms the metric Better Metrics: ◦ Recall = TP / (TP+FN) Ability to detect the class of interest ◦ Specificity = TN / (TN+FP) Ability to rule out the unimportant class ◦ Precision = TP / (TP+FP) Ability to rule out false alarms Confusion Matrix 2012 Jasig Sakai Conference 46 C4.5/C5.0 Boosted Decision Tree Logistic Regression Support Vector machines 2012 Jasig Sakai Conference 47 2012 Jasig Sakai Conference 48 2012 Jasig Sakai Conference 49 2012 Jasig Sakai Conference 50