The Sounds of Silence: Towards Automated Evaluation of Student Learning in

advertisement
Carnegie
Mellon
The Sounds of Silence:
Towards Automated Evaluation of
Student Learning in
a Reading Tutor that Listens
Jack Mostow and Gregory Aist
Project LISTEN, Carnegie Mellon University
http://www.cs.cmu.edu/~listen
Mostow 7/17/2016, p. 1
Carnegie
Mellon
Pilot study in urban elementary school
Goals:
•Analyze extended use of Reading Tutor
•Identify opportunities for improvement
Protocol:
•Principal chose 8 lowest third-grade readers
•Aide took each kid daily to use Reading Tutor in small room
•Kid chose text to read (Weekly Reader, poems, …)
Milestones:
•Oct. 96: deployed Pentium, trained users, refined design
•Nov. 96: school pre-tested individually
•June 97: school post-tested individually
Mostow 7/17/2016, p. 2
User-Tutor interaction
(11/7/96 version used in pilot study)
Carnegie
Mellon
User may:
•click Back
•click Help
•click Go
•click word
•read
Tutor may:
•go on
•read word
•recue word
•read phrase
Mostow 7/17/2016, p. 3
Carnegie
Mellon
Data recorded by Reading Tutor
Sessions from Nov. 96 to May 97 (excluding outliers)
•29 to 57 sessions per kid, averaging 14 minutes
•Not used during vacations, downtime, absences
6 gigabytes of data
•.WAV files of kids’ spoken utterances
•.SEG files of time-aligned speech recognizer output
•.LOG files of Reading Tutor events
Mostow 7/17/2016, p. 4
Carnegie
Mellon
What to evaluate?
Usability (can kids use it?)
•1993 Wizard of Oz experiments
•Lab and in-school user tests of successive versions
Assistiveness (do kids perform better with than without?)
•1994 Reading Coach boosted comprehension by ~20%
•But: evaluation obtrusive, costly, sparse, subjective, noisy
Learning (do kids improve over time?)
•Within tutor: this talk
•On unassisted reading: pre-/post-test by school
•More than with alternatives: future studies
Mostow 7/17/2016, p. 5
Carnegie
Mellon
How should the Reading Tutor
evaluate learning?
Evaluation should be
•Ecologically valid -- based on normal system use
•Authentic -- student chooses material
•Unobtrusive -- invisible to student
•Automatic -- objective, cheap
•Fast -- computable in real-time on PC
•Robust -- to student, recognizer, and tutor behavior
•Data-rich -- based on many observations
•Sensitive -- detect subtle effects
So estimate improvement in assisted performance
Mostow 7/17/2016, p. 6
Carnegie
Mellon
How to estimate performance?
Accuracy = % of text words matched by recognizer output
•Coarse-grained
•Sensitive to missed words
•Doesn’t penalize requests for help
Inter-word latency = time interval between aligned text words
•Finer-grained
•Sensitive to hesitations, insertions
•Robust to many speech recognizer errors
Mostow 7/17/2016, p. 7
Carnegie
Mellon
Estimation of accuracy and latency
(Nov. 96 example from video)
Text:
If the computer thinks you need help, it talks to you.
Student said:
if the computer...takes your name...help it...take...s to you
Recognizer heard:
IF THE COMPUTER THINKS YOU IF THE HELP IT TO TO YOU
Tutor estimated 81% accuracy; inter-word latencies:
If the computer thinks you need…help, it talks...to you.
? 43 39
1
60 41 226
7 1
242 1 cs
Mostow 7/17/2016, p. 8
Carnegie
Mellon
Improvement in accuracy and latency
(same kid reads “help” in May 97)
Text:
When some kids jump rope, they help other people too.
Student said:
when some kids jump rope they help other people too
Recognizer heard:
WHEN SOME KIDS JUMP ROPE THEY HELP OTHER PEOPLE TOO
Tutor estimated 100% accuracy; inter-word latencies:
When some kids jump rope, they help other people too.
?
1
10 34
19 77
9
1
34
1 cs
Mostow 7/17/2016, p. 9
Carnegie
Mellon
Which performance improvements count?
Echoing the sentence doesn’t count.
•So look only at the first try.
Picking stories with easier words doesn’t count.
•So look at changes on the same word.
Memorizing the story doesn’t count.
•So look only at encounters of words in new contexts.
Remembering recent words doesn’t count.
•So look only at the first time a word is seen that day.
Mostow 7/17/2016, p. 10
Accuracy increased 16% on same word
from first to last day seen in new context
Carnegie
Mellon
90%
80%
70%
60%
50%
mjt
mtw
mmd
mrt
mdc
mgt
mcr
fbw
Mostow 7/17/2016, p. 11
Latency decreased 35% on same word
from first to last day read in new context
Carnegie
Mellon
100 cs
75 cs
50 cs
25 cs
0 cs
mjt
mtw mmd
mrt
mdc
mgt
mcr
fbw
Mostow 7/17/2016, p. 12
Carnegie
Mellon
Is accuracy and latency estimation...
Ecologically valid? Reading Tutor used in school
Authentic? kids choose stories
Unobtrusive? evaluate assisted reading invisibly
Automatic? align recognizer output against text
Fast? real-time on Pentium
Robust? to much student, recognizer, and tutor behavior
Data-rich? 10498 utterances, 139133 aligned words
Sensitive? detects significant but subtle effects (< 0.1 sec)
Mostow 7/17/2016, p. 13
Carnegie
Mellon
Conclusion
Does the Reading Tutor help?
•Yes, with assisted reading
•Transfers to unassisted reading!
Research questions:
•Who benefits how much, when, and why?
•How should we improve the Tutor?
For more information:
•http://www.cs.cmu.edu/~listen
Mostow 7/17/2016, p. 14
Download