Feature Aware Student modeling (FAST). An alternative to

General Features in Knowledge Tracing Applications to Multiple Subskills, Temporal IRT & Expert Knowledge Yun Huang, University of Pittsburgh* José P. González-Brenes, Pearson* Peter Brusilovsky, University of Pittsburgh * First authors This talk… •  What? Determine student mastery of a skill •  How? Novel algorithm called FAST –  Enables features in Knowledge Tracing •  Why? Better and faster student modeling –  25% better AUC, a classification metric –  300 times faster than popular general purpose student modeling techniques (BNT-SM) Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications 1.  Multiple subskills 2.  Temporal Item Response Theory 3.  Paper exclusive: Expert knowledge •  Execution time •  Conclusion Motivation •  Personalize learning of students –  For example, teach students new material as they learn, so we don’t teach students material they know •  How? Typically with Knowledge Tracing û û : û û ü ü û û ü ü ü : ü û û ü ü ü ü Masters a skill or not : û û ü ü û û ü ü ü : •  Knowledge Tracing fits a twostate HMM per skill •  Binary latent variables indicate the knowledge of the student of the skill •  Four parameters: 1.  Initial Knowledge Transition 2.  Learning 3.  Guess Emission 4.  Slip What’s wrong? •  Only uses performance data (correct or incorrect) •  We are now able to capture feature rich data –  MOOCs & intelligent tutoring systems are able to log fine-grained data –  Used a hint, watched video, after hours practice… •  … these features can carry information or intervene on learning What’s a researcher gotta do? •  Modify Knowledge Tracing algorithm •  For example, just on a small-scale literature survey, we find at least nine different flavors of Knowledge Tracing So you want to publish in EDM? 1.  Think of a feature (e.g., from a MOOC) 2.  Modify Knowledge Tracing 3.  Write Paper 4.  Publish 5.  Loop! Are all of those models sooooo different? •  No! we identify three main variants •  We call them the “Knowledge Tracing Family” Knowledge Tracing Family k k k f f y No features k y Emission (guess/slip) f f y y Transition (learning) Both (guess/slip and learning) •  Item diﬃculty •  Student ability (Gowda et al ’11; (Pardos et al Pardos et al ’11) ’10) •  Subskills (Xu et al ’12) •  Help (Sao Pedro et al ’13) •  Student ability (Lee et al ’12; Yudelson et al ’13) •  Item diﬃculty (Schultz et al ’13) •  Help (Becker et al ’08) •  Each model is successful for an ad hoc purpose only –  Hard to compare models –  Doesn’t help to build a cognition theory •  Learning scientists have to worry about both features and modeling •  These models are not scalable: –  Rely on Bayes Net’s conditional probability tables –  Memory performance grows exponentially with number of features –  Runtime performance grows exponentially with number of features (with exact inference) Example: Emission probabilities with no features: Mastery p(Correct) False (1) 0.10 (guess) True (2) 0.85 (1-slip) 20+1 parameters! Example: Emission probabilities with 1 binary feature: Mastery Hint p(Correct) False False (1) 0.06 True False (2) 0.75 False True (3) 0.25 True True (4) 0.99 21+1 parameters! Example: Emission probabilities with 10 binary features: Mastery F1 … F10 p(Correct) False False False (1) 0.06 False … True True True 210+1 parameters! … True (2048) 0.90 Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications –  Multiple subskills –  Temporal IRT •  Execution time •  Conclusion Something old… •  Uses the most general model in the Knowledge Tracing Family •  Parameterizes learning and emission (guess+slip) probabilities k f f y Something new… •  Instead of using inefficient conditional probability tables, we use logistic regression [Berg-Kirkpatrick et al’10 ] •  Exponential complexity -> linear complexity k f f y Example: # of features # of pararameters in KTF # of parameters in FAST 0 2 2 1 4 3 10 2048 12 25 67,108,864 27 25 features are not that many, and yet they can become intractable with Knowledge Tracing Family Something blue? •  Not a lot of changes to implement prediction •  Training requires quite a bit of changes –  We use a recent modification of the Expectation-Maximization algorithm proposed for Computational Linguistics problems [Berg-Kirkpatrick et al’10 ] k f f y (A parenthesis) “Each equaMon I include in the book would halve the sales” •  Jose’s corollary: Each equation in a presentation would send to sleep half the audience •  Equations are in the paper! KT uses Expectation-Maximization E-Step:Forward-Backward algorithm Conditional Probability Table Lookup Latent Mastery M-Step: Maximum Likelihood FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ] “Conditional Probability Table” Lookup E-step Latent Mastery Logistic regression weights Slip/guess lookup: Mastery p(Correct) False (1) True (2) Use the multiple parameters of logistic regression to fill the values of a “nofeatures”conditional probability table! [Berg-Kirkpatrick et al’10 ] FAST uses a recent E-M algorithm [Berg-Kirkpatrick et al’10 ] “Conditional Probability Table” Lookup Latent Mastery Logistic regression weights ... ... k Features: fea tur e Instance weights: fea tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 Slip/Guess logistic regression ... observation 2 ... observation n observation 1 observation 2 ... observation n { { { probability of not mastering probability of mastering observation 1 always active active when mastered active when not mastered ... ... { { { probability of not mastering probability of mastering When FAST ... observation 1 uses only observation 2 intercept terms ... as features for observation n the two levels observation 1 observation 2 of mastery, it is ... equivalent to observation n Knowledge Tracing! always active k Features: fea tur e Instance weights: fea tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 fea tur fea e k tur fea e 1 tur e2 Slip/Guess logistic regression active when mastered active when not mastered Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Examples –  Multiple subskills –  Temporal IRT –  Expert knowledge •  Conclusion Tutoring System Collected from QuizJET, a tutor for learning Java programming. Java code Students give values for a variable or the output Each question is generated from a template, and students can try multiple attempts March 28, 2014 31 Data •  Smaller dataset: –  ~21,000 observations –  First attempt: ~7,000 observations –  110 students •  Unbalanced: 70% correct •  95 question templates •  “Hierarchical” cognitive model: 19 skills, 99 subskills March 28, 2014 32 Evaluation •  Predict future performance given history -  Will a student get answer correctly at t=0 ? -  At t =1 given t = 0 performance ? -  At t = 2 given t = 0, 1 performance ? …. •  Area Under Curve metric -  1: perfect classifier -  0.5: random classifier March 28, 2014 33 Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications –  Multiple subskills –  Temporal IRT –  Expert knowledge •  Execution time •  Conclusion Multiple subskills •  Experts annotated items (question) with a single skill and multiple subskills Multiple subskills & KnowledgeTracing •  Original Knowledge Tracing can not model multiple subskills •  Most Knowledge Tracing variants assume equal importance of subskills during training (and then adjust it during testing) •  State of the art method, LR-DBN [Xu and Mostow ’11] assigns importance in both training and testing FAST can handle multiple subskills •  Parameterize learning •  Parameterize slip and guess •  Features: binary variables that indicate presence of subskills FAST vs Knowledge Tracing: Slip parameters of subskills subskills within a skill: •  Conventional Knowledge assumes that all subskills have the same difficulty (red line) •  FAST can identify different difficulty between subskills •  Does it matter? State of the art (Xu & Mostow’11) Model AUC LR-DBN .71 KT - Weakest KT - Multiply .69 .62 •  The 95% of confidence intervals are within +/- .01 points Benchmark Model AUC LR-DBN Single-skill KT KT - Weakest KT - Multiply .71 .71 .69 .62 ! •  The 95% of confidence intervals are within +/- .01 points •  We are testing on non-overlapping students, LR-DBN was designed/tested in overlapping students and didn’t compare to single skill KT Benchmark Model AUC LR-DBN Single-skill KT KT - Weakest KT - Multiply .71 .71 .69 .62 ! •  The 95% of confidence intervals are within +/- .01 points •  We are testing on non-overlapping students, LR-DBN was designed/tested in overlapping students and didn’t compare to single skill KT Benchmark Model FAST LR-DBN Single-skill KT KT - Weakest KT - Multiply AUC .74 .71 .71 .69 .62 •  The 95% of confidence intervals are within +/- .01 points Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications –  Multiple subskills –  Temporal IRT •  Execution time •  Conclusion Two paradigms: (50 years of research in 1 slide) •  Knowledge Tracing –  Allows learning –  Every item = same difficulty –  Every student = same ability •  Item Response Theory –  NO learning –  Models items difficulties –  Models student abilities Can FAST help merging the paradigms? Item Response Theory •  The simplest of its forms, it’s the Rasch model •  The Rasch can be formulated in many ways: –  Typically using latent variables –  Logistic regression •  a feature per student •  a feature per item •  We end up with a lot of features! – Good thing we are using FAST ;-) Results AUC Knowledge Tracing .65 FAST + student .64 FAST + item .73 FAST + IRT .76 25% improvement •  The 95% of confidence intervals are within +/- .03 points Disclaimer •  In our dataset, most students answer items in the same order •  Item estimates are biased •  Future work: define continuous IRT difficulty features –  It’s easy in FAST ;-) Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Applications –  Multiple subskills –  Temporal IRT •  Execution time •  Conclusion FAST is 300x faster than BNT-SM! 60 54 execution time (min.) 50 46 40 28 30 23 20 BNT−SM (no feat.) FAST (no feat.) 10 0.08 0 March 28, 2014 7,100 0.10 0.12 11,300 15,500 # of observations 0.15 19,800 50 LR-DBN vs FAST •  We use the authors’ implementation of LR-DBN •  LR-DBN takes about 250 minutes •  FAST only takes about 44 seconds •  15,500 datapoints •  This is on an old laptop, no parallelization, nothing fancy •  (details on the paper) Outline •  •  •  •  Introduction FAST – Feature-Aware Student Knowledge Tracing Experimental Setup Examples –  Multiple subskills –  Temporal IRT •  Conclusion Comparison of existing techniques allows features slip/ guess recency/ ordering learning FAST ✓ ✓ ✓ ✓ PFA ✓ ✗ ✗ ✓ Knowledge Tracing ✗ ✓ ✓ ✓ Rasch Model ✓ ✗ ✗ ✗ Pavlik et al ’09 Corbett & Anderson ’95 Rasch ’60 March 28, 2014 53 •  FAST lives by its name •  FAST provides high flexibility in utilizing features, and as our studies show, even with simple features improves significantly over Knowledge Tracing •  The effect of features depends on how smartly they are designed and on the dataset •  I am looking forward for more clever uses of feature engineering for FAST in the community

Feature Aware Student modeling (FAST). An alternative to

Related documents

Products

Support

Feature Aware Student modeling (FAST). An alternative to

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib