OAAI Presentation

advertisement
Josh Baron, Marist College
Sandeep Jayaprakash, Marist College
Nate Angell, rSmart
June 10-15, 2012
Growing Community;
Growing Possibilities


Open Academic Analytics Initiative (OAAI)
Building the Predictive Model
◦ Overview of the process
◦ Data sets used and data extraction process
◦ Overview of Pentaho and training process

Deploying the Predictive Model
◦ Using Pentaho to score data
◦ Performance of the predictive model
◦ Producing Academic Alert Reports (AARs)


Overview of Intervention Strategies
Current Outcomes and Next Steps
“Creating an Open Ecosystem
for Learning Analytics”

OAAI is using two primary data sources:
◦ Student Information System (SIS – Banner)
 Demographics, Aptitude (SATs, GPA)
◦ Learning Management System (LMS)
 Event logs, Gradebook

Goal: create open-source “early alert” system
◦ Predict “at risk” students in first 3 weeks of a course
◦ Deploy intervention to ensure student succeeds
Static data
Dynamic Data
Student Attitude Data
(SATs, current GPA, etc.)
Model developed
w/ historical data
Student Demographic
Data (Age, gender, etc.)
Sakai Event Log Data
Predictive
Model
Scoring
Identified
Students “at
risk” to not
complete
course
Sakai Gradebook Data
Intervention
Deployed

Purdue University’s Course Signals Project
◦ Built on dissertation research by Dr. John Campbell
◦ Ellucian product that integrates with Blackboard
◦ Students in courses using Course Signals
 scored up to 26% more A or B grades
 up to 12% fewer C's; up to 17% fewer D's and F‘s
◦ Positive affect on four year retention rates
 No Course Signals courses – 69%
 Two or more Course Signals courses - 93%

Interventions that utilize “support groups”
◦ Improved 1st and 2nd semester GPAs
◦ Increase semester persistence rates (79% vs. 39%)

Building “open ecosystem” for Learning analytics
◦ Sakai Collaboration and Learning Environment
 Sakai API to automate secure data capture
 Will also facilitate use of Course Signals & IBM SPSS
◦ Pentaho Business Intelligence Suite
 OS data mining, integration, analysis and reporting tools
◦ OAAI Predictive Model released under OS license
 Predictive Modeling Markup Language (PMML)

Researching critical analytics scaling factors
◦ How “portable” are predictive models?
◦ What intervention strategies are most effective?

Release Sakai Academic Alert System (beta)
◦ Will be included as part of Sakai CLE release

Conducted real world pilots
◦ 36 courses at community colleges
◦ 36 courses at HBCUs

Research finding related to…
◦ Strategies for effectively “porting”
predictive models
◦ The use of online communities and OER to impact
on course completion, persistence and content
mastery.




Wave I EDUCAUSE
Next Generation Learning
Challenges (NGLC) grant
Funded by Bill and Melinda
Gates and Hewlett Foundations
$250,000 over a 15 month period
Began May 1, 2011, ends January 2013
(extended)
Overview of the process
Data sets and data extraction
Overview of Pentaho and
training process
2012 Jasig Sakai Conference
9
Development and initial deployment of an
“open source” predictive model of academic
risk


Methodological framework for model
development
Empirical analysis of predictive performance
(preliminary results)
2012 Jasig Sakai Conference
10

Data Integration using Pentaho Kettle
 Data Extraction phase
 Transformation phase
 Load phase

Predictive Modeling using Pentaho WEKA
 Training phase
 Testing phase
2012 Jasig Sakai Conference
11
Source Data
CMS (Sakai) event data
Partial grades
(gradebook) data
ETL Layer
Predictive Modeling Layer: Training Testing and Scoring
Training
Data
Data
Preprocessing
:
missing
values,
outliers,
derived
features
Balanced
Training
Data
Balance
Library of
predictive
models
(Classifiers)
Train
Store
Partition
Algorithms
Personal (Bio) Data
Course Data
Performance Data
Source Data
Software
Platform:
SQL
Server
2008 R2
Pentaho
Kettle
Anonymized
Data by
Institution,
Semester,
Program
(Grad/ Ugrad),
Course,
Student
Predict
(Score)
Results
Test
Test
Data
New
Data
Target feature
Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle
Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM
OS: Windows Server 2008 Standard Edition
2012 Jasig Sakai Conference
12




SQL queries to extract grade and user event
data from Sakai CLE (see Sakai wiki for
details)
Ensure access to historical data: data
warehouse, backups etc.
Extract from backup to ensure no impact on
production performance
Encrypting user IDs for user anonymization
2012 Jasig Sakai Conference
13

Data mining and predictive modeling are affected
by input data of diverse quality

A predictive model is usually as good as its training
data

Good: lots of data

Not so good: Data Quality Issues

Not so good: Unbalanced classes (at Marist, 6% of
students at risk. Good for the student body, bad
for training predictive models  )
2012 Jasig Sakai Conference
14





Variability in instructor’s assessment criteria
Variability in workload criteria
Variability in period used for prediction (early
detection)
Variability in Multiple instance data (partial
grades with variable contribution, and
heterogeneous composition)
Solution: Use ratios


Percent of usage over Avg percent of usage per
course
Effective Weighted Score / Avg Effective Weighted
Score
2012 Jasig Sakai Conference
15



Variability in Sakai tools usage
No uniform criterion in the use of CMS tools
(faculty members are a wild bunch )
Tools not used, data not entered, too much
missing data
Forums
Content
Lessons
Assigns
Assmnts
2012 Jasig Sakai Conference
16
Source Data
CMS (Sakai) event data
Partial grades
(gradebook) data
ETL Layer
Predictive Modeling Layer: Training Testing and Scoring
Training
Data
Data
Preprocessin
g:
missing
values,
outliers,
derived
features
Balanced
Training
Data
Balance
Library of
predictive
models
(Classifiers)
Train
Store
Partition
Algorithms
Personal (Bio) Data
Course Data
Performance Data
Source Data
Software
Platform:
SQL
Server
2008 R2
Pentaho
Kettle
Anonymized
Data by
Institution,
Semester,
Program
(Grad/ Ugrad),
Course,
Student
Predict
(Score)
Results
Test
Test
Data
New
Data
Target feature
Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle
Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM
OS: Windows Server 2008 Standard Edition
2012 Jasig Sakai Conference
17

Fall 2010 data sample of undergraduate
students
Datasets were joined and data was cleaned,
recoded, and aggregated to produce an input
data file of 3877 records corresponding to
courses taken by students.
Feature Type
Feature Name
Predictors

GENDER, SAT_VERBAL,
SAT_M ATH, APTITUDE_SCORE,
FTPT, CLASS, CUM _GPA,
ENROLLM ENT, ACADEM IC
_STANDING, RM N_SCORE,
R_SESSIONS, R_CONTENT_READ
Target
ACADEM IC_RISK (1 = at risk; 0
student in good standing)
2012 Jasig Sakai Conference
18






Use Weka 3.7 and IBM SPSS Modeler 14.2.
Generate 5 different random partitions (70%
training, 30% testing)
Balance each training dataset
Train a predictive model (Logistic Regression, C4.5
Decision Tree, SVM) for each balanced training
dataset
◦ 5 x 3 = 15 models
Measure predictive performance of classifiers
◦ recall, precision, specificity
Produce summary measures (mean and standard
error)
2012 Jasig Sakai Conference
19
 At model training time:
o For all students in all courses compute their effective weighted score
as sumproduct(partial scores, partial weights) * (1 / sum partial
weights)
o Compute the effective Avg Weighted score for the course
o Calculate the ratio as:
RMN_SCORE = effectiveWeighted Score / effective Avgweighted score
 At testing time:
o For all students in the course tested compute their effective weighted
score
o
as sumproduct(partial scores, partial weights) * (1 / sum
partialwieghts)
o Compute the effective Avg Weighted score for the course
o Calculate the ratio as:
RMN_SCORE = effective Weighted score / effective Avgweighted score
2012 Jasig Sakai Conference
20
Classif. Perform.
Algorithm Metrics
1
Train
5089
SVN
2
Test
1182
Train
5041
3
Test
1146
Train
4952
4
Test
1207
Train
5014
5
Mean
Test Train Test Train
1159
5095
SE
Test Train Test
1134
Accuracy
93.40% 90.52% 92.66% 90.49% 91.54% 88.73% 92.98% 89.91% 93.27% 90.48% 92.77% 90.03% 0.74% 0.77%
Recall
95.91% 86.15% 93.94% 77.46% 92.64% 86.30% 94.51% 77.78% 95.83% 85.29% 94.57% 82.60% 1.37% 4.56%
Specificity 90.85% 90.78% 91.43% 91.35% 90.47% 88.89% 91.50% 90.71% 90.76% 90.81% 91.00% 90.51% 0.45% 0.94%
C5.0
Precision
91.42% 35.22% 91.36% 37.16% 90.46% 33.33% 91.46% 35.67% 91.03% 37.18% 91.14% 35.71% 0.42% 1.59%
Accuracy
99.96% 94.59% 99.64% 95.46% 99.94% 94.61% 99.96% 94.39% 99.65% 94.80% 99.83% 94.77% 0.17% 0.41%
Recall
100.00% 66.15% 99.39% 61.97% 100.00% 58.90% 100.00% 55.56% 99.40% 54.41% 99.76% 59.40% 0.33% 4.80%
Specificity 99.92% 96.24% 99.88% 97.67% 99.88% 96.91% 99.92% 96.96% 99.88% 97.37% 99.90% 97.03% 0.02% 0.54%
Precision
99.92% 50.59% 99.88% 63.77% 99.88% 55.13% 99.92% 54.79% 99.88% 56.92% 99.90% 56.24% 0.02% 4.81%
Accuracy
91.26% 89.00% 89.59% 90.23% 90.15% 88.48% 91.12% 88.96% 90.30% 91.01% 90.48% 89.54% 0.70% 1.05%
Logistic Recall
92.98% 86.15% 89.09% 88.73% 90.80% 90.41% 92.07% 83.33% 91.07% 89.71% 91.20% 87.67% 1.46% 2.91%
Reg. Specificity 89.50% 89.17% 90.06% 90.33% 89.51% 88.36% 90.21% 89.33% 89.55% 91.09% 89.77% 89.65% 0.34% 1.06%
Precision
90.00% 31.64% 89.63% 37.72% 89.41% 33.33% 90.06% 34.09% 89.51% 39.10% 89.72% 35.18% 0.29% 3.12%
2012 Jasig Sakai Conference
21

Logistic regression and SVM did much better
that C5.0 / J4.8
◦ Detect 82% to 87% of the student population at
risk.
◦ In comparison, recall of C5.0 / J4.8: 59% (why so
low?)

False positives:
◦ 10% of false positives over Ok students (C5.0 /
J4.8 does better: 3%)
◦ 65% of predictions are false alarms (C5.0 / J4.8
does better: 44%)
2012 Jasig Sakai Conference
22

For logistic regression
◦ RMN_SCORE
◦ ACADEMIC_STANDING CUM_GPA
◦ Then R_SESSIONS and SAT_VERBAL

For the SVM classifier
◦ RMN_SCORE
◦ CUM_GPA, ACADEMIC_STANDING, R_SESSIONS and
SAT_VERBAL

C5.0/J4.8
◦ Minimal difference among predictors
2012 Jasig Sakai Conference
23


Results are encouraging, although the number of
false alarms raises some concern
Differences among classifiers, in particular DTs
(typically very robust classifiers), requires further
investigation.

Data quality (missing values) remains an open
issue with partial remediation

Partial-grades-derived score (RMN_SCORE)
remains as the best predictor.

CMS generated events appear to be second tier
predictors
2012 Jasig Sakai Conference
24
Using Pentaho to score data
Performance of the predictive model
Producing Academic Alert Reports
2012 Jasig Sakai Conference
25

ETL phase
Remains similar to the ETL process used for
training the model except for the records have
missing data are also retained


Scoring phase
Utilizes WEKA Scoring plugin
to embed WEKApredictive model
into Pentaho Kettle
Reporting phase
Pentaho Report Designer tool to create a template for
reporting.
2012 Jasig Sakai Conference
26
Source Data
CMS (Sakai) event data
Partial grades
(gradebook) data
Personal (Bio) Data
Course Data
Performance Data
Source Data
ETL Layer
Predictive Modeling Layer: Training Testing and Scoring
Training
Data
Data
Preprocessi
ng:
missing
values,
Balance
SQL
Server
2008 R2
Pentaho
Kettle
Library of
predictive
models
(Classifiers)
Train
outliers,
derived
features
Software
Platform
:
Balanced
Training
Data
Store
Partition
Algorithms
Anonymized
Data by
Institution,
Semester,
Program
(Grad/ Ugrad),
Course,
Student
Predict
(Score)
Results
Test
Test
Data
New
Data
Target feature
Software Platform: IBM SPSS Modeler, Pentaho Weka, Pentaho Kettle
Hardware Platform: IBM x3400 - Xeon E5410 2.33 GHz, Quad-Core, 64 bit, 10GB RAM
OS: Windows Server 2008 Standard Edition
2012 Jasig Sakai Conference
27
2012 Jasig Sakai Conference
28
Awareness Messaging
Online Academic Support
Environment (OASE)
2012 Jasig Sakai Conference
29

Researching effectiveness of two strategies
◦ Awareness Messaging
◦ Online Academic Support Environment (OASE)
• OER Content
• Self-Assessments
• Learning Skills Flat World
Knowledge
• Learning Support
Facilitation &
Mentoring
2012 Jasig Sakai Conference
30
Initial research findings
Future efforts
2012 Jasig Sakai Conference
31




Applied similar analytical techniques
to those used by Campbell at Purdue,
using Fall 2010 data
Marist College and Purdue University
◦ Differences (institutional type and
size)
◦ Similarities: % students receiving
federal Pell Grants, % ethnicity,
ACT composite ACT composite
25th/75th percentile
We found similarities in correlation
values.
As in the case of Purdue, all these
metrics are found to be significantly
correlated with course grade, with
rather low correlation values.
Course Grade
Undergraduate CM S
Marist
Campbell
event frequencies
Fall 2010
(2007)
N=18968
N=27276
Correlation
0.147
S essions
(no values
Significance 0.000(**)
Opened
reported)
N
11195
Correlation
0.098
0.112
Content
Significance 0.000(**)
0.000(**)
Viewed
N
7651
19205
Correlation
0.133
0.068
Discussions
Significance 0.000(**)
0.000(**)
Read
N
1552
7667
Correlation
0.233
0.061
Discussions
Significance 0.000(**)
0.000(**)
Posted
N
1507
7292
Correlation
0.146
0.163
Assign.
Significance 0.000(**)
0.000(**)
S ubmitted
N
3245
4309
Correlation
0.161
0.238
Asssmts
Significance 0.000(**)
0.000(**)
S ubmitted
N
1423
4085
(**) Significant at the 0.01 level (2-tailed)
M arist data uses ratios over course mean
instead of frequencies
2012 Jasig Sakai Conference
32


Initial review of instructor “research logs”
showed general agreement with predictions
Faculty/student feedback has been positive
"Not only did this project directly assist my students by guiding students
to resources to help them succeed, but as an instructor, it changed my
pedagogy; I became more vigilant about reaching out to individual
students and providing them with outlets to master necessary skills.
P.S. I have to say that this semester, I received the highest volume of
unsolicited positive feedback from students, who reported that they felt I
provided them exceptional individual attention!
2012 Jasig Sakai Conference
33




Develop and release “student effort data”API
Develop a Sakai Academic Alert dashboard
Create customized predictive models for
different academic contexts
Work to facilitate use of SNAPP with Sakai
2012 Jasig Sakai Conference
34
2012 Jasig Sakai Conference
35

Sakai Confluence Wiki – Open Academic
Analytics Initiative (OAAI)
https://confluence.sakaiproject.org/pages/viewpage.action?pageId=75671025

Contact
 Josh Baron – Senior Academic Technology Officer ,
Marist College
josh.baron@marist.edu
 Sandeep M. Jayaprakash – Technical Consultant
OAAI, Marist College
sandeep.jayaprakash1@marist.edu
2012 Jasig Sakai Conference
36
2012 Jasig Sakai Conference
37
2012 Jasig Sakai Conference
38
2012 Jasig Sakai Conference
39
2012 Jasig Sakai Conference
40
Student Data
(Demographics &
Course enrollment)
Data Extraction
Banner
(ERP)
(course event data
aggregated, student data
added, student identity
removed)
Data Preprocessing
(missing values,
outliers, incomplete
records, derived
features)
Data Sets
Course Event Data
Sakai
Identifying student information is
removed during the data extraction
process
2012 Jasig Sakai Conference
41
2012 Jasig Sakai Conference
42
Knowledge Flow
Balance
Partition
Filter
NBayes
DT J4.8
Logistic
SVM/SMO
2012 Jasig Sakai Conference
43
Logistic
Modeler
SVM
Partition
Filter
Balance
DT C5.0
2012 Jasig Sakai Conference
44

Balance the training dataset
 Subsample – ( Reduce the dataset)
 Oversample – (Duplicate records if it’s a minority)
 SMOTE
Nitesh V Chawla. Et.al(2002) Synthetic Minority Over
Sampling Technique. Journal of Artificial Intelligence Research.
16.321-357.
2012 Jasig Sakai Conference
45



In the case of unbalanced classes, Accuracy is a poor measure
◦ Accuracy = (TP+TN) / (TP+TN+FP+FN)
◦ The large class overwhelms the metric
Better Metrics:
◦ Recall = TP / (TP+FN)
Ability to detect the class of
interest
◦ Specificity = TN / (TN+FP) Ability to rule out the
unimportant class
◦ Precision = TP / (TP+FP)
Ability to rule out false alarms
Confusion Matrix
2012 Jasig Sakai Conference
46
C4.5/C5.0 Boosted Decision Tree
Logistic Regression
Support Vector machines
2012 Jasig Sakai Conference
47
2012 Jasig Sakai Conference
48
2012 Jasig Sakai Conference
49
2012 Jasig Sakai Conference
50
Download