Slide 1 - Aspiring Minds

advertisement
Boredom Across Activities,
and Across the Year,
within Reasoning Mind
William L. Miller, Ryan Baker,
Mathew Labrum, Karen Petsche,
Angela Z. Wagner
In recent years
• Increasing interest in modeling more about
students than just what they know
In recent years
• Increasing interest in modeling more about
students than just what they know
• Can we assess a broad range of constructs
In recent years
• Increasing interest in modeling more about
students than just what they know
• Can we assess a broad range of constructs
• In a broad range of contexts
Boredom
• A particularly important construct to measure
Boredom is
• Common in real-world learning (D’Mello,
2013)
Boredom is
• Common in real-world learning (D’Mello,
2013)
• Associated with worse learning outcomes in
the short-term (Craig et al., 2004; Rodrigo et
al., 2007)
Boredom is
• Common in real-world learning (D’Mello,
2013)
• Associated with worse learning outcomes in
the short-term (Craig et al., 2004; Rodrigo et
al., 2007)
• Associated with worse course grades and
standardized exam performance (Pekrun et
al., 2010; Pardos et al., 2013)
Boredom is
• Common in real-world learning (D’Mello, 2013)
• Associated with worse learning outcomes in the
short-term (Craig et al., 2004; Rodrigo et al.,
2007)
• Associated with worse course grades and
standardized exam performance (Pekrun et al.,
2010; Pardos et al., 2013)
• Associated with lower probability of going to
college, years later (San Pedro et al., 2013)
Online learning environments
• Offer great opportunities to study boredom in
context
– Very fine-grained interaction logs that indicate
everything the student did in the system
Automated boredom detection
• Can we detect boredom in real time, while a
student is learning?
• Can we detect boredom retrospectively, from
log files?
Automated boredom detection
• Can we detect boredom in real time, while a
student is learning?
• Can we detect boredom retrospectively, from
log files?
• Would allow us to study affect at a large scale
– Figure out which content is most boring, in order
to improve it
Affect Detection: Physical Sensors?
• Lots of work shows that affect can be detected
using physical sensors
– Tone of voice (Litman & Forbes-Riley, 2005)
– EEG (Conati & McLaren, 2009)
– Posture sensor and video (D’Mello et al., 2007)
• It’s hypothesized – but not yet conclusively
demonstrated – that using physical sensors
may lead to better performance than
interaction logs alone
Sensor-free affect detection
• Easier to scale to the millions of students who
use online learning environments
• In settings that do not have cameras,
microphones, and other physical sensors
– Home settings
• have parents bought equipment?
• can they set it up and maintain it?
– Classroom settings
• can school maintain equipment?
• do students intentionally destroy equipment?
• parent concerns and political climate
Sensor-free boredom detection
• Has been developed for multiple learning
environments
– Problem solving tutors (Baker et al. 2012; Pardos et al.
2013)
– Dialogue tutors (D’Mello et al. 2008)
– Narrative virtual learning environments (Sabourin et al.
2011; Baker et al. 2014)
– Science simulations (Paquette et al., 2014)
• The principles of affect detection are largely the same
across environments
• But the behaviors associated with boredom differ
considerably between environments
This talk
• We discuss our work to develop sensor-free
boredom detection for Reasoning Mind Genie
2 (Khachatryan et al, 2014)
• Self-paced blended learning mathematics
curriculum for elementary school students
– Youngest population for sensor-free affect
detection so far
• Used by approximately 100,000 students a
year
Reasoning Mind Genie 2
• Combines
– Guided Study with a pedagogical agent “Genie”
– Speed Games that support development of
fluency
• Used in schools 3-5 days a week for 45-90
minutes per day
Reasoning Mind Genie 2
(c)
Reasoning Mind Genie 2
• Better affect and more on-task behavior than
most pedagogies, online or offline
(Ocumpaugh et al., 2013)
• Still a substantial amount of boredom
• Reducing boredom is a key goal
Role for affect detection
• If we can detect boredom in log files
• We can determine which content is more
boring, and improve that content
Related Work
• Evidence that specific design features
associated with boredom in Cognitive Tutors
for high school algebra (Doddannara et al.,
2013)
Related Work
• Evidence that specific design features
associated with boredom in Cognitive Tutors
for high school algebra (Doddannara et al.,
2013)
• Evidence that some disengaged behaviors
increase during the year (Beck, 2005)
– Important to verify that differences in affect due
to actual content/design, not time of year
Approach to Boredom Detection
• Collect “ground truth” data on student
boredom, using field observations
• Synchronize log data to field observations
• Distill meaningful data features of log data,
hypothesized to relate to boredom
• Develop automated detector using
classification algorithm
• Validate detector for new students/new
lessons/new populations
BROMP 2.0 Field Observations
(Ocumpaugh et al., 2012)
• Conducted through Android app HART (Baker et
al., 2012)
• Protocol designed to reduce disruption to student
– Some features of protocol: observe with peripheral
vision or side glances, hover over student not being
observed, 20-second “round-robin” observations of
several students, bored-looking people are boring
• Inter-rater reliability around 0.8 for behavior, 0.65
for affect
• 64 coders now certified in USA, Philippines, India
Data collection
• 408 elementary school students
Data collection
• Diverse sample important for model
generalizability (Ocumpaugh et al., 2014)
• 11 different 8th grade classes
• 6 schools
–
–
–
–
–
2 urban in Texas, predominantly African-American
1 urban in Texas, predominantly Latino
1 suburban in Texas, predominantly White
1 suburban in Texas, mixed ethnicity/race
1 rural in West Virginia, predominantly White
Affect coding
• 3 expert coders observed each student using BROMP
• Coded 5 categories of affect
–
–
–
–
–
Engaged Concentration
Boredom
Confusion
Frustration
?
• 4891 observations collected in RM classrooms
Building detectors
• Observations were synchronized with the logs
of the students interactions with RM, using
HART app and internet time server
• For each observation, a set of 93 meaningful
features describing the student’s behavior was
engineered
• Computed on actions occurring during or
preceding an observation (up to 20 seconds
before)
Features: Examples
• Individual action features
– Whether an action was correct or not
– How long the action took
• Features across all past activity
– Fraction of previous attempts on the current skill the
student has gotten correct
• Other known models applied to logs
– Probability student knows skill (Bayesian Knowledge
Tracing)
– Carelessness
– Moment-by-Moment Learning Graph
Automated detector of boredom
• Detectors were built using RapidMiner 5.3
• For each algorithm the best features were
selected using forward selection/backward
elimination
• Data was re-sampled to have more equal class
frequencies; models were evaluated on
original class distribution
• Detectors were validated using 10-fold
student-level cross-validation
Automated detector of boredom
• Detectors were built using 4 machine learning
algorithms that have been successful for
building affect detectors in the past:
– J48
– JRip
– Step Regression
– Naïve Bayes
Best One
• Detectors were built using 4 machine learning
algorithms that have been successful for
building affect detectors in the past:
– J48
– JRip
– Step Regression
– Naïve Bayes
Machine learning
• Performance of the detectors was evaluated
using
• A’
– Given two observations, probability of correctly
identifying which one is an example of a specific
affective state and which one is not
– A’ of 0.5 is chance level and 1 is perfect
– Identical to Wilcoxon statistic
– Very similar to AUC ROC (Area Under the ReceiverOperating Characteristic Curve)
Results
• A’ = 0.64
• Compared to similar detectors in other systems,
validated in similar stringent fashion
System
A'
Cognitive Tutor Algebra
(Baker et al. 2012)
0.69
ASSISTments
(Pardos et al. 2013)
0.63
EcoMUVE
(Baker et al. 2014)
0.65
Inq-ITS
(Paquette et al. 2014)
0.72
Coefficient
+0.212
-0.013
-0.070
-0.073
+0.290
-0.260
+0.123
Feature
The standard deviation, across the clip, of student
correctness (1 or 0) on each action.
The number of actions in the clip that occurred on
Speed Game items.
The fraction of the total clip duration spent on Speed
Game items.
The number of actions in the clip on items where the
answer input was made by selecting an item from a
drop-down list.
The minimum slip parameter (P(S) in Bayesian
Knowledge Tracing) on skills in the clip.
The standard deviation, across the clip, of the action
duration, normalized across all students, times the
presence (1) or absence (0) of a hint request on the
previous action.
Y-intercept.
Using detectors
• Model applied to entire year of data from these
classrooms
• 2,974,944 actions by 462 students
• Includes 54 additional students not present
during observations
• Aggregation over pseudo-confidences rather than
binary predictions
– Retains more information
0.150
Boredom vs. Time of Year
Average Boredom
0.145
0.140
0.135
0.130
1-Sep
8-Nov
15-Jan
Date
24-Mar
31-May
Apparent downward trend
Apparent downward trend
• Is it statistically significant?
Apparent downward trend
• Is it statistically significant?
•
•
•
•
Yes.
Students are less bored later in the year
F-test controlling for student
p<0.001
Is it practically significant?
Is it practically significant?
• No.
• r = -0.06
Is it practically significant?
• No.
• r = -0.06
• With large enough samples, anything is
statistically significant
Kind of a positive thing
• At minimum, students aren’t getting more
bored as the year goes on
• In other systems, students get more
disengaged as the year goes on (Beck, 2005)
• And the overall level of boredom (~14%) is not
very high
Beyond this
• Curriculum is self-paced
Beyond this
• Curriculum is self-paced
• Which means that predicting boredom by
date may obscure real variation
Beyond this
• Curriculum is self-paced
• Which means that predicting boredom by
date may obscure real variation
• Instead, look at boredom by learning objective
(b)
Boredom vs. Objective
0.150
Average Boredom
0.145
0.140
0.135
0.130
5.01
5.12
5.23
5.34
Reasoning Mind Objective
5.45
Predicting boredom by objective
• p<0.001
• r=0.343
If we cluster objectives into two groups
• “High boredom”
• “Low boredom”
• Ignoring the one point in between the two
groups
• Cohen’s D = 0.67
Future work
• So… what is it that differentiates the higher
boredom lessons from the lower boredom
lessons?
Future work
• So… what is it that differentiates the higher
boredom lessons from the lower boredom
lessons?
– Nothing obvious, unfortunately…
Future work
• So… what is it that differentiates the higher
boredom lessons from the lower boredom
lessons?
– Nothing obvious, unfortunately…
– May be necessary to develop a taxonomy of
potential differences, and see which are predictive
– May be possible to build off prior work by
(Doddannara et al., 2013) that did exactly this for
Cognitive Tutor
Future Work
• Can we fix the more boring lessons?
– Either by determining why they are boring
– Or just by adding a little more “fun content”
Eventual Goal
• Use precise assessments of boredom to help
us enhance Reasoning Mind
– Improving engagement
– Improving learning outcomes
Thank you
twitter.com/BakerEDMLab
Baker EDM Lab
Baker EDM Lab
See our free online MOOT “Big Data and Education”
All lab publications available online – Google “Ryan Baker”
“Data, Analytics, and Learning” – EdX, Fall 2014
Download