Tuned Models of Peer Assessment in MOOCs Chris Piech Jonathan Huang Stanford Zhenghao Chen Chuong Do Andrew Ng Daphne Koller Coursera A variety of assessments How can we efficiently grade 10,000 students? 2 The Assessment Spectrum Multiple choice Coding assignments Short Response Proofs Essay questions Long Response Easy to automate Hard to grade automatically Limited ability to ask expressive questions or require creativity Can assign complex assignments and provide complex feedback 3 Stanford/Coursera’s HCI course Video lectures + embedded questions, weekly quizzes, open ended assignments Student work Slide credit: Chinmay Kulkarni Calibrated peer assessment [Russell, ’05, Kulkarni et al., ‘13] ✓ staff-graded 1) Calibration 2) Assess 5 Peers 3) Self-Assess Similar process also used in Mathematical Thinking, Programming Python, Listening to World Music, Fantasy and Science Fiction, Sociology, Social network analysis .... Slide credit: Chinmay Kulkarni (http://hci.stanford.edu/research/assess/) Image credit: Debbie Morrison (http://onlinelearninginsights.wordpress.com/) Largest peer grading network to-date HCI #1 HCI #2 3,607 3,633 # Students # Assignments 5 5 # Submissions 6,702 7,270 # Peer grades 31,067 32,132 77 “ground truth” submissions graded by everyone (staff included) HCI 1, Homework #5 How well did peer grading do? Black stuff much room for improvement! Percent of experiments 0.1 0.09 0.08 0.07 0.06 0.05 within 5pp 0.04 0.03 within 10pp 0.02 0.01 0 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 Error Up to 20% students get a grade over 10% from ground truth! ~1400 students Peer Grading Desiderata Our work: • Highly reliable/accurate assessment – Statistical model for estimating and correcting for grader reliability/bias • Reduced workload for both students and course staff – A simple method for reducing grader workload • Scalability (to, say, tens of thousands of students) – Scalable estimation algorithm that easily handles MOOC sized courses How to decide if a grader is good Who should we trust? 100% 30% Graders 100% 50% 55% 56% 54% Need to reason with all submissions and Idea: look at the peer grades jointly! other submissions graded by these graders! Submissions Statistical Formalization Student/Grader Submission Average grade inflation/deflation Bias True score Grading variance Reliability Observed score Observed score Observed score Observed Observedscore score Model PG1 Modeling grader bias and reliability True score of student u in literature Related models Crowdsourcing [Whitehill et al. (‘09), Bachrach et al. Grader reliability of student v (‘12), Kamar et al. (‘12) ] Anthropology Grader bias of student v (‘88)] [Batchelder & Romney Peer Assessment [Goldin & Ashley (‘11), Goldin (‘12)] Student v’s assessment of student u (observed) Correlating bias variables across assignments 20 Bias on Assn 5 15 10 5 0 Biases estimated from assignment T with biases at assignment T+1 -5 -10 -15 -20 -20 -15 -10 -5 0 5 Bias on Assn 4 10 15 20 Model PG2 Temporal coherence True score of student u Grader bias at homework T depends on bias at T-1 Grader reliability of student v Grader bias of student v Student v’s assessment of student u (observed) Model PG3 Coupled grader score and reliability True score of student u Approximate Inference: Gibbs sampling (also implemented Your reliability as a EM, Variational for a grader depends on Grader bias of studentmethods v your ability! subset of the models) Running time: ~5 minutes for HCI 1 PG3 cannot be Gibbs sampled in “closed form” Student**v’s assessment of student u (observed) Incentives Scoring rules can impact student behavior Model PG3 gives high scoring graders more “sway” in computing a submission’s final score. Model PG3 gives higher homework scores to students who are accurate graders! Improves prediction accuracy Encourages students to grade better See [Dasgupta & Ghosh, ‘13] for a theoretical look at this problem Prediction Accuracy • 33% reduction in RMSE • Only 3% of submissions land farther than 10% from ground truth 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 0.1 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 0.1 Error Histogram Error Histogram Baseline (median) prediction accuracy Model PG3 prediction accuracy Prediction Accuracy, All models Despite an improved rubric in HCI2, the simplest model Just modelingrubric bias (constant reliability) captures ~95% An improved made baseline grading in HCI2 (PG1 with justperforms bias) outperforms baseline grading on all PGthe other models 3 typically of improvement in RMSE more accurate than HCI1 metrics. HCI 1 HCI 2 Meaningful Confidence Estimates When our model is 90% confident that its prediction is within K% of the true grade, then over 90% of the time in experiment, we are indeed within K%. (i.e., our model is conservative) 1 0.95 Actual Pass Rate Pass Rate We can use confidence estimates to 0.9 Expected Pass Rate tell when a submission needs to be 0.85 0.8seen by more graders! 0.75 0.7 0.65 0.6 Experiments where confidence fell between .90-.95 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Model Confidence How many graders do you need? After 2 rounds 15% Some submissions need more graders! After 3 rounds 10% Need more than 5 rounds 46% Some grader assignments can be reallocated! After 5 rounds 16% After 4 rounds 13% Note: This is quite an overconservative estimate (as in the last slide) Understanding graders in the context of the MOOC Question: What factors influence how well a student will grade? 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 “Harder” submissions to grade Standard deviation Mean -2.75 -1.75 -0.75 0.25 1.25 Gradee grade (z-score) Residual (z-score) Residual (z-score) “Easiest” submissions to grade Better scoring graders grade better 1.4 1.2 Standard deviation 1 0.8 0.6 0.4 Mean 0.2 0 -0.2 -2.75 -1.75 -0.75 0.25 1.25 Grader grade (z-score) Gradee grade (z-score) Residual given grader and gradee scores The worst students tend to inflate the best submissions Grade inflation Best students tend to downgrade the worst submissions Grade deflation Grader grade (z-score) # standard deviations from mean How much time should you spend on grading? Standard deviation of residual (z-scores) 1 “sweet spot of grading”: ~ 20 minutes 0.9 0.8 0.7 0.6 0.5 0.4 -0.30 -0.25 -0.20 -0.15 Time Grading (z-score) -0.10 What your peers say about you! Best submissions Worst submissions Commenting styles in HCI sentiment polarity 160 140 sentiment polarity 0.6 0.5 120 feedback length (words) 0.9 Students have more to say about 0.8 weaknesses than strong points 0.7 100 On average, comments vary from neutral to positive, with few highly negative 80 comments 0.4 feedback length 0.3 60 0.2 0.1 40 -3 -2 -1 0 residual (z-score) 1 2 Student engagement and peer grading 1 True Positive Rate 0.9 all features just grade 0.8 0.7 0.6 0.5 0.4 0.3 Task: predict whether a student will complete last homework 0.2 0.1 0 0 (AUC = 0.97605) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate Takeaways Peer grading is an easy and practical way to grade open-ended assignments at scale Real jointly worldover deployment: Reasoning all submissionsour and accounting for bias/reliability canHCI significantly system was used in 3! improve current peer grading in MOOCs Grading performance can tell us about other learning factors such as student engagement or performance Gradient descent for linear regression ~40,000 submissions