Carnegie Mellon The Sounds of Silence: Towards Automated Evaluation of Student Learning in a Reading Tutor that Listens Jack Mostow and Gregory Aist Project LISTEN, Carnegie Mellon University http://www.cs.cmu.edu/~listen Mostow 7/17/2016, p. 1 Carnegie Mellon Pilot study in urban elementary school Goals: •Analyze extended use of Reading Tutor •Identify opportunities for improvement Protocol: •Principal chose 8 lowest third-grade readers •Aide took each kid daily to use Reading Tutor in small room •Kid chose text to read (Weekly Reader, poems, …) Milestones: •Oct. 96: deployed Pentium, trained users, refined design •Nov. 96: school pre-tested individually •June 97: school post-tested individually Mostow 7/17/2016, p. 2 User-Tutor interaction (11/7/96 version used in pilot study) Carnegie Mellon User may: •click Back •click Help •click Go •click word •read Tutor may: •go on •read word •recue word •read phrase Mostow 7/17/2016, p. 3 Carnegie Mellon Data recorded by Reading Tutor Sessions from Nov. 96 to May 97 (excluding outliers) •29 to 57 sessions per kid, averaging 14 minutes •Not used during vacations, downtime, absences 6 gigabytes of data •.WAV files of kids’ spoken utterances •.SEG files of time-aligned speech recognizer output •.LOG files of Reading Tutor events Mostow 7/17/2016, p. 4 Carnegie Mellon What to evaluate? Usability (can kids use it?) •1993 Wizard of Oz experiments •Lab and in-school user tests of successive versions Assistiveness (do kids perform better with than without?) •1994 Reading Coach boosted comprehension by ~20% •But: evaluation obtrusive, costly, sparse, subjective, noisy Learning (do kids improve over time?) •Within tutor: this talk •On unassisted reading: pre-/post-test by school •More than with alternatives: future studies Mostow 7/17/2016, p. 5 Carnegie Mellon How should the Reading Tutor evaluate learning? Evaluation should be •Ecologically valid -- based on normal system use •Authentic -- student chooses material •Unobtrusive -- invisible to student •Automatic -- objective, cheap •Fast -- computable in real-time on PC •Robust -- to student, recognizer, and tutor behavior •Data-rich -- based on many observations •Sensitive -- detect subtle effects So estimate improvement in assisted performance Mostow 7/17/2016, p. 6 Carnegie Mellon How to estimate performance? Accuracy = % of text words matched by recognizer output •Coarse-grained •Sensitive to missed words •Doesn’t penalize requests for help Inter-word latency = time interval between aligned text words •Finer-grained •Sensitive to hesitations, insertions •Robust to many speech recognizer errors Mostow 7/17/2016, p. 7 Carnegie Mellon Estimation of accuracy and latency (Nov. 96 example from video) Text: If the computer thinks you need help, it talks to you. Student said: if the computer...takes your name...help it...take...s to you Recognizer heard: IF THE COMPUTER THINKS YOU IF THE HELP IT TO TO YOU Tutor estimated 81% accuracy; inter-word latencies: If the computer thinks you need…help, it talks...to you. ? 43 39 1 60 41 226 7 1 242 1 cs Mostow 7/17/2016, p. 8 Carnegie Mellon Improvement in accuracy and latency (same kid reads “help” in May 97) Text: When some kids jump rope, they help other people too. Student said: when some kids jump rope they help other people too Recognizer heard: WHEN SOME KIDS JUMP ROPE THEY HELP OTHER PEOPLE TOO Tutor estimated 100% accuracy; inter-word latencies: When some kids jump rope, they help other people too. ? 1 10 34 19 77 9 1 34 1 cs Mostow 7/17/2016, p. 9 Carnegie Mellon Which performance improvements count? Echoing the sentence doesn’t count. •So look only at the first try. Picking stories with easier words doesn’t count. •So look at changes on the same word. Memorizing the story doesn’t count. •So look only at encounters of words in new contexts. Remembering recent words doesn’t count. •So look only at the first time a word is seen that day. Mostow 7/17/2016, p. 10 Accuracy increased 16% on same word from first to last day seen in new context Carnegie Mellon 90% 80% 70% 60% 50% mjt mtw mmd mrt mdc mgt mcr fbw Mostow 7/17/2016, p. 11 Latency decreased 35% on same word from first to last day read in new context Carnegie Mellon 100 cs 75 cs 50 cs 25 cs 0 cs mjt mtw mmd mrt mdc mgt mcr fbw Mostow 7/17/2016, p. 12 Carnegie Mellon Is accuracy and latency estimation... Ecologically valid? Reading Tutor used in school Authentic? kids choose stories Unobtrusive? evaluate assisted reading invisibly Automatic? align recognizer output against text Fast? real-time on Pentium Robust? to much student, recognizer, and tutor behavior Data-rich? 10498 utterances, 139133 aligned words Sensitive? detects significant but subtle effects (< 0.1 sec) Mostow 7/17/2016, p. 13 Carnegie Mellon Conclusion Does the Reading Tutor help? •Yes, with assisted reading •Transfers to unassisted reading! Research questions: •Who benefits how much, when, and why? •How should we improve the Tutor? For more information: •http://www.cs.cmu.edu/~listen Mostow 7/17/2016, p. 14