Slide - Jonathan Huang

advertisement
Tuned Models of Peer
Assessment in MOOCs
Chris Piech
Jonathan Huang
Stanford
Zhenghao Chen
Chuong Do
Andrew Ng
Daphne Koller
Coursera
A variety of assessments
How can we efficiently grade 10,000 students?
2
The Assessment Spectrum
Multiple
choice
Coding
assignments
Short
Response
Proofs
Essay
questions
Long
Response
Easy to automate
Hard to grade
automatically
Limited ability to ask
expressive questions
or require creativity
Can assign complex
assignments and provide
complex feedback
3
Stanford/Coursera’s HCI course
Video lectures + embedded questions, weekly quizzes, open
ended assignments
Student work
Slide credit: Chinmay Kulkarni
Calibrated peer assessment
[Russell, ’05, Kulkarni et al., ‘13]
✓
staff-graded
1) Calibration
2) Assess 5 Peers
3) Self-Assess
Similar process also used in Mathematical Thinking, Programming
Python, Listening to World Music, Fantasy and Science Fiction,
Sociology, Social network analysis ....
Slide credit: Chinmay Kulkarni (http://hci.stanford.edu/research/assess/)
Image credit: Debbie Morrison (http://onlinelearninginsights.wordpress.com/)
Largest peer grading network to-date
HCI #1 HCI #2
3,607 3,633
# Students
# Assignments
5
5
# Submissions 6,702 7,270
# Peer grades 31,067 32,132
77 “ground truth” submissions
graded by everyone (staff included)
HCI 1, Homework #5
How well did peer grading do?
Black stuff  much
room for improvement!
Percent of experiments
0.1
0.09
0.08
0.07
0.06
0.05
within 5pp
0.04
0.03
within 10pp
0.02
0.01
0
-30 -25 -20 -15 -10 -5
0
5 10 15 20 25 30
Error
Up to 20% students get a grade over 10% from ground truth!
~1400 students
Peer Grading Desiderata
Our work:
• Highly reliable/accurate assessment
– Statistical model for estimating and correcting
for grader reliability/bias
• Reduced workload for both students and
course staff
– A simple method for reducing grader workload
• Scalability (to, say, tens of thousands of
students)
– Scalable estimation algorithm that easily
handles MOOC sized courses
How to decide if a grader is good
Who should
we trust?
100%
30%
Graders
100%
50%
55%
56%
54%
Need to reason with all submissions and
Idea: look at the peer grades jointly!
other submissions
graded by these
graders!
Submissions
Statistical Formalization
Student/Grader
Submission
Average grade
inflation/deflation
Bias
True score
Grading variance
Reliability
Observed
score
Observed
score
Observed
score
Observed
Observedscore
score
Model PG1
Modeling grader bias and reliability
True score
of student
u in literature
Related
models
Crowdsourcing
[Whitehill et al. (‘09), Bachrach et al.
Grader reliability
of student v
(‘12), Kamar et al. (‘12) ]
Anthropology
Grader bias
of student
v (‘88)]
[Batchelder
& Romney
Peer Assessment
[Goldin & Ashley (‘11), Goldin (‘12)]
Student v’s assessment of student u (observed)
Correlating bias variables across
assignments
20
Bias on Assn 5
15
10
5
0
Biases estimated from
assignment T with biases
at assignment T+1
-5
-10
-15
-20
-20
-15
-10
-5
0
5
Bias on Assn 4
10
15
20
Model PG2
Temporal coherence
True score of student u
Grader bias at
homework T depends
on bias at T-1
Grader reliability of student v
Grader bias of student v
Student v’s assessment of student u (observed)
Model PG3
Coupled grader score and reliability
True score of student u
Approximate Inference:
Gibbs sampling (also implemented
Your reliability as a
EM, Variational
for a grader depends on
Grader bias
of studentmethods
v
your ability!
subset of the models)
Running time: ~5 minutes for HCI 1
PG3 cannot
be Gibbs sampled
in “closed form”
Student**v’s
assessment
of student
u (observed)
Incentives
Scoring rules can impact student behavior
Model PG3 gives high
scoring graders more
“sway” in computing a
submission’s final score.
Model PG3 gives higher
homework scores to
students who are accurate
graders!
Improves prediction
accuracy
Encourages students to
grade better
See [Dasgupta & Ghosh, ‘13] for
a theoretical look at this problem
Prediction Accuracy
• 33% reduction in RMSE
• Only 3% of submissions land farther than
10% from ground truth
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
0
-30
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
0.1
-30
-25
-20
-15
-10
-5
0
5
10
15
20
25
30
0.1
Error Histogram
Error Histogram
Baseline (median)
prediction accuracy
Model PG3
prediction accuracy
Prediction Accuracy, All models
Despite an improved rubric in HCI2, the simplest model
Just
modelingrubric
bias (constant
reliability)
captures
~95%
An improved
made baseline
grading
in HCI2
(PG1
with justperforms
bias) outperforms
baseline grading on all
PGthe
other models
3 typically
of
improvement
in
RMSE
more accurate than HCI1
metrics.
HCI 1
HCI 2
Meaningful Confidence Estimates
When our model is 90% confident that its prediction is within K%
of the true grade, then over 90% of the time in experiment, we
are indeed within K%. (i.e., our model is conservative)
1
0.95
Actual Pass Rate
Pass Rate
We can use confidence estimates to
0.9
Expected Pass Rate
tell when
a
submission needs to be
0.85
0.8seen by more graders!
0.75
0.7
0.65
0.6
Experiments where
confidence fell
between .90-.95
0.55
0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
Model Confidence
How many graders do you need?
After
2 rounds
15%
Some submissions
need more graders!
After
3 rounds
10%
Need
more
than 5
rounds
46%
Some grader
assignments can
be reallocated!
After 5
rounds
16%
After
4 rounds
13%
Note: This is quite an overconservative estimate (as in the last slide)
Understanding graders in the
context of the MOOC
Question: What factors influence how well a
student will grade?
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
“Harder”
submissions
to grade
Standard
deviation
Mean
-2.75 -1.75 -0.75 0.25 1.25
Gradee grade (z-score)
Residual (z-score)
Residual (z-score)
“Easiest”
submissions
to grade
Better scoring
graders grade
better
1.4
1.2
Standard
deviation
1
0.8
0.6
0.4
Mean
0.2
0
-0.2 -2.75 -1.75 -0.75 0.25 1.25
Grader grade (z-score)
Gradee grade (z-score)
Residual given grader and gradee scores
The worst students tend to
inflate the best submissions
Grade inflation
Best students tend to
downgrade the worst
submissions
Grade deflation
Grader grade (z-score)
# standard deviations
from mean
How much time should you spend
on grading?
Standard deviation of
residual (z-scores)
1
“sweet spot of grading”:
~ 20 minutes
0.9
0.8
0.7
0.6
0.5
0.4
-0.30
-0.25
-0.20
-0.15
Time Grading (z-score)
-0.10
What your peers say about you!
Best
submissions
Worst
submissions
Commenting styles in HCI
sentiment polarity
160
140
sentiment
polarity
0.6
0.5
120
feedback length (words)
0.9
Students have more to say
about
0.8 weaknesses than
strong points
0.7
100
On average, comments
vary
from neutral to positive, with
few highly negative
80 comments
0.4
feedback
length
0.3
60
0.2
0.1
40
-3
-2
-1
0
residual (z-score)
1
2
Student engagement and peer grading
1
True Positive Rate
0.9
all features
just grade
0.8
0.7
0.6
0.5
0.4
0.3
Task: predict whether a
student will complete
last homework
0.2
0.1
0
0
(AUC = 0.97605)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive Rate
Takeaways
Peer grading is an easy and practical way to
grade open-ended assignments at scale
Real jointly
worldover
deployment:
Reasoning
all submissionsour
and
accounting
for bias/reliability
canHCI
significantly
system
was used in
3!
improve current peer grading in MOOCs
Grading performance can tell us about other
learning factors such as student engagement
or performance
Gradient descent for linear regression
~40,000 submissions
Download