General Features in Knowledge Tracing Applications to Multiple Subskills, Temporal IRT & Expert Knowledge Yun Huang, University of Pittsburgh* José P. González-Brenes, Pearson* Peter Brusilovsky, University of Pittsburgh * First authors This talk… • What? Determine student mastery of a skill • How? Novel algorithm called FAST – Enables features in Knowledge Tracing • Why? Better and faster student modeling – 25% better AUC, a classification metric – 300 times faster than popular general purpose student modeling techniques (BNT-SM) Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications 1. Multiple subskills 2. Temporal Item Response Theory 3. Paper exclusive: Expert knowledge • Execution time • Conclusion Motivation • Personalize learning of students – For example, teach students new material as they learn, so we don’t teach students material they know • How? Typically with Knowledge Tracing û û : û û ü ü û û ü ü ü : ü û û ü ü ü ü Masters a skill or not : û û ü ü û û ü ü ü : • Knowledge Tracing fits a twostate HMM per skill • Binary latent variables indicate the knowledge of the student of the skill • Four parameters: 1. Initial Knowledge Transition 2. Learning 3. Guess Emission 4. Slip What’s wrong? • Only uses performance data (correct or incorrect) • We are now able to capture feature rich data – MOOCs & intelligent tutoring systems are able to log fine-grained data – Used a hint, watched video, after hours practice… • … these features can carry information or intervene on learning What’s a researcher gotta do? • Modify Knowledge Tracing algorithm • For example, just on a small-scale literature survey, we find at least nine different flavors of Knowledge Tracing So you want to publish in EDM? 1. Think of a feature (e.g., from a MOOC) 2. Modify Knowledge Tracing 3. Write Paper 4. Publish 5. Loop! Are all of those models sooooo different? • No! we identify three main variants • We call them the “Knowledge Tracing Family” Knowledge Tracing Family k k k f f y No features k y Emission (guess/slip) f f y y Transition (learning) Both (guess/slip and learning) • Item difficulty • Student ability (Gowda et al ’11; (Pardos et al Pardos et al ’11) ’10) • Subskills (Xu et al ’12) • Help (Sao Pedro et al ’13) • Student ability (Lee et al ’12; Yudelson et al ’13) • Item difficulty (Schultz et al ’13) • Help (Becker et al ’08) • Each model is successful for an ad hoc purpose only – Hard to compare models – Doesn’t help to build a cognition theory • Learning scientists have to worry about both features and modeling • These models are not scalable: – Rely on Bayes Net’s conditional probability tables – Memory performance grows exponentially with number of features – Runtime performance grows exponentially with number of features (with exact inference) Example: Emission probabilities with no features: Mastery p(Correct) False (1) 0.10 (guess) True (2) 0.85 (1-slip) 20+1 parameters! Example: Emission probabilities with 1 binary feature: Mastery Hint p(Correct) False False (1) 0.06 True False (2) 0.75 False True (3) 0.25 True True (4) 0.99 21+1 parameters! Example: Emission probabilities with 10 binary features: Mastery F1 … F10 p(Correct) False False False (1) 0.06 False … True True True 210+1 parameters! … True (2048) 0.90 Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications – Multiple subskills – Temporal IRT • Execution time • Conclusion Something old… • Uses the most general model in the Knowledge Tracing Family • Parameterizes learning and emission (guess+slip) probabilities k f f y Something new… • Instead of using inefficient conditional probability tables, we use logistic regression [Berg-Kirkpatrick et al’10 ] • Exponential complexity -> linear complexity k f f y Example: # of features # of pararameters in KTF # of parameters in FAST 0 2 2 1 4 3 10 2048 12 25 67,108,864 27 25 features are not that many, and yet they can become intractable with Knowledge Tracing Family Something blue? • Not a lot of changes to implement prediction • Training requires quite a bit of changes – We use a recent modification of the Expectation-Maximization algorithm proposed for Computational Linguistics problems [Berg-Kirkpatrick et al’10 ] k f f y (A parenthesis) “Each equaMon I include in the book would halve the sales” • Jose’s corollary: Each equation in a presentation would send to sleep half the audience • Equations are in the paper! KT uses Expectation-Maximization E-Step:Forward-Backward algorithm Conditional Probability Table Lookup Latent Mastery M-Step: Maximum Likelihood FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ] “Conditional Probability Table” Lookup E-step Latent Mastery Logistic regression weights Slip/guess lookup: Mastery p(Correct) False (1) True (2) Use the multiple parameters of logistic regression to fill the values of a “nofeatures”conditional probability table! [Berg-Kirkpatrick et al’10 ] FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ] “Conditional Probability Table” Lookup Latent Mastery Logistic regression weights ... ... k Features: fea tur e Instance weights: fea tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 Slip/Guess logistic regression ... observation 2 ... observation n observation 1 observation 2 ... observation n { { { probability of not mastering probability of mastering observation 1 always active active when mastered active when not mastered ... ... { { { probability of not mastering probability of mastering When FAST ... observation 1 uses only observation 2 intercept terms ... as features for observation n the two levels observation 1 observation 2 of mastery, it is ... equivalent to observation n Knowledge Tracing! always active k Features: fea tur e Instance weights: fea tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 Slip/Guess logistic regression active when mastered active when not mastered Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Examples – Multiple subskills – Temporal IRT – Expert knowledge • Conclusion Tutoring System Collected from QuizJET, a tutor for learning Java programming. Java code Students give values for a variable or the output Each question is generated from a template, and students can try multiple attempts March 28, 2014 31 Data • Smaller dataset: – ~21,000 observations – First attempt: ~7,000 observations – 110 students • Unbalanced: 70% correct • 95 question templates • “Hierarchical” cognitive model: 19 skills, 99 subskills March 28, 2014 32 Evaluation • Predict future performance given history - Will a student get answer correctly at t=0 ? - At t =1 given t = 0 performance ? - At t = 2 given t = 0, 1 performance ? …. • Area Under Curve metric - 1: perfect classifier - 0.5: random classifier March 28, 2014 33 Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications – Multiple subskills – Temporal IRT – Expert knowledge • Execution time • Conclusion Multiple subskills • Experts annotated items (question) with a single skill and multiple subskills Multiple subskills & KnowledgeTracing • Original Knowledge Tracing can not model multiple subskills • Most Knowledge Tracing variants assume equal importance of subskills during training (and then adjust it during testing) • State of the art method, LR-DBN [Xu and Mostow ’11] assigns importance in both training and testing FAST can handle multiple subskills • Parameterize learning • Parameterize slip and guess • Features: binary variables that indicate presence of subskills FAST vs Knowledge Tracing: Slip parameters of subskills subskills within a skill: • Conventional Knowledge assumes that all subskills have the same difficulty (red line) • FAST can identify different difficulty between subskills • Does it matter? State of the art (Xu & Mostow’11) Model AUC LR-DBN .71 KT - Weakest KT - Multiply .69 .62 • The 95% of confidence intervals are within +/- .01 points Benchmark Model AUC LR-DBN Single-skill KT KT - Weakest KT - Multiply .71 .71 .69 .62 ! • The 95% of confidence intervals are within +/- .01 points • We are testing on non-overlapping students, LR-DBN was designed/tested in overlapping students and didn’t compare to single skill KT Benchmark Model AUC LR-DBN Single-skill KT KT - Weakest KT - Multiply .71 .71 .69 .62 ! • The 95% of confidence intervals are within +/- .01 points • We are testing on non-overlapping students, LR-DBN was designed/tested in overlapping students and didn’t compare to single skill KT Benchmark Model FAST LR-DBN Single-skill KT KT - Weakest KT - Multiply AUC .74 .71 .71 .69 .62 • The 95% of confidence intervals are within +/- .01 points Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications – Multiple subskills – Temporal IRT • Execution time • Conclusion Two paradigms: (50 years of research in 1 slide) • Knowledge Tracing – Allows learning – Every item = same difficulty – Every student = same ability • Item Response Theory – NO learning – Models items difficulties – Models student abilities Can FAST help merging the paradigms? Item Response Theory • The simplest of its forms, it’s the Rasch model • The Rasch can be formulated in many ways: – Typically using latent variables – Logistic regression • a feature per student • a feature per item • We end up with a lot of features! – Good thing we are using FAST ;-) Results AUC Knowledge Tracing .65 FAST + student .64 FAST + item .73 FAST + IRT .76 25% improvement • The 95% of confidence intervals are within +/- .03 points Disclaimer • In our dataset, most students answer items in the same order • Item estimates are biased • Future work: define continuous IRT difficulty features – It’s easy in FAST ;-) Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications – Multiple subskills – Temporal IRT • Execution time • Conclusion FAST is 300x faster than BNT-SM! 60 54 execution time (min.) 50 46 40 28 30 23 20 BNT−SM (no feat.) FAST (no feat.) 10 0.08 0 March 28, 2014 7,100 0.10 0.12 11,300 15,500 # of observations 0.15 19,800 50 LR-DBN vs FAST • We use the authors’ implementation of LR-DBN • LR-DBN takes about 250 minutes • FAST only takes about 44 seconds • 15,500 datapoints • This is on an old laptop, no parallelization, nothing fancy • (details on the paper) Outline • • • • Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Examples – Multiple subskills – Temporal IRT • Conclusion Comparison of existing techniques allows features slip/ guess recency/ ordering learning FAST ✓ ✓ ✓ ✓ PFA ✓ ✗ ✗ ✓ Knowledge Tracing ✗ ✓ ✓ ✓ Rasch Model ✓ ✗ ✗ ✗ Pavlik et al ’09 Corbett & Anderson ’95 Rasch ’60 March 28, 2014 53 • FAST lives by its name • FAST provides high flexibility in utilizing features, and as our studies show, even with simple features improves significantly over Knowledge Tracing • The effect of features depends on how smartly they are designed and on the dataset • I am looking forward for more clever uses of feature engineering for FAST in the community