Spoken Dialogue in Human and Computer Tutoring Diane Litman Learning Research and Development Center and Computer Science Department University of Pittsburgh Outline Introduction and Background The ITSPOKE System and Corpora A Study of Spoken versus Typed Dialogue Tutoring – Human tutoring condition – Computer tutoring condition Current Directions and Summary Adding Spoken Language to a Text-Based Dialogue Tutor (11/03-9/06) Primary Research Question – How does speech-based dialogue interaction impact the effectiveness of tutoring systems for student learning? Hypotheses Compared to typed dialogues, spoken interactions will yield better learning gains, and will be more efficient and natural Different student behaviors will correlate with learning in spoken versus typed dialogues, and will be elicited by different tutor actions Findings in human-human and human-computer dialogues will vary as a function of system performance Motivation Working hypothesis regarding learning gains – Human Dialogue > Computer Dialogue > Text Most human tutoring involves face-to-face spoken interaction, while most computer dialogue tutors are text-based – Evens et al., 2001; Zinn et al., 2002; Vanlehn et al., 2002; Aleven et al., 2001 Can the effectiveness of dialogue tutorial systems be further increased by using spoken interactions? Spoken Tutorial Dialogue Systems Recent tutoring systems have begun to add spoken language capabilities – Rickel and Johnson, 2000; Graesser et al. 2001; Mostow and Aist, 2001; Aist et al., 2003; Fry et al., 2001; Schultz et al., 2003 However, little empirical analysis of the learning ramifications of using speech Potential Benefits of Speech Self-explanation correlates with learning and occurs more in speech – Hausmann and Chi, 2002 Speech contains prosodic information, providing new sources of information for dialogue adaptation – Forbes-Riley and Litman, 2004 Spoken computational environments may prime a more social interpretation that enhances learning – Moreno et al., 2001; Graesser et al., 2003 Potential for hands-free interaction Spoken Computer Tutoring Excerpt Tutor: What will the velocity of the object be a second after that (where the initial velocity is 9.8m/s the acceleration is 9.8m/s^2)? Student: thirty nine point two Tutor: Well... 19.6 m/s. So at every point in time during the fall of the man and his keys, how do their velocities compare with each other? Student: same Tutor: Yeah. Now we will compare the displacements of the man and his keys. Do you recall what displacement means? Student: distance in a straight line Outline Introduction and Background The ITSPOKE System and Corpora A Study of Spoken versus Typed Dialogue Tutoring – Human tutoring condition – Computer tutoring condition Current Directions and Summary ITSPOKE: Intelligent Tutoring SPOKEn Dialogue System Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002) Student speech digitized from microphone input; Sphinx2 speech recognizer Tutor speech played via headphones/speakers; Cepstral text-to-speech synthesizer Other additions: XML access to Why2-Atlas “internals”, speech recognition repairs, etc. Architecture www server html essay ITSpoke java Why2 xml Text Manager www browser student text (xml) Essay Analysis essay text Speech Analysis dialogue tutorial goals (Sphinx) repair goals dialogue (Carmel, Tacituslite+) text Cepstral Spoken Dialogue Manager dialogue tutor turn (xml) Content Dialogue Manager (Ape, Carmel) Speech Recognition: Sphinx2 (CMU) 56 dialogue-based, probabilistic language models Initial training data – typed student utterances from Why2-Atlas corpora – human-human: 968 unique words – human-computer: 599 unique words Later training data – spoken utterances obtained during development and pilot testing of ITSPOKE – human-computer: 523 unique words Total vocabulary – 1240 unique words Language Models (LMs): Design Dialogue-dependent language models manually constructed by aggregating prompts, e.g. example LM for prompts taking “yes/no” type answers prompt: Just as the car starts moving, the string is vertical, so it can't exert any horizontal force on the dice. No other objects are touching the dice. So are there any horizontal forces on the dice as the car starts moving? User response “no” “none” “yeah” “yes” Count 20 1 1 2 Frequency 83.33 4.17 4.17 8.33 prompt: When analyzing the motion of the two cars, one towing the other, can we treat them as a single compound body? User Response Count Frequency “no” 2 8.70 “yes” 21 91.30 Language Models: Evaluation Test Data: ITSPOKE 2003-2004 evaluation – 20 students, 100 physics problems (dialogues), 2445 turns, 398 unique words – 39 of 56 language models • 17 models were either specific to 5 unused physics problems, or to specific goals that were never accessed Word Error Rate = 31.2% “Concept Error” Rate = 7.6% Speech Synthesis: Cepstral Commercial outgrowth of Festival text-tospeech synthesizer (Edinburgh, CMU) Required additional processing of Why2-Atlas prompts (e.g., f=m*a) Spoken Tutoring Corpora Human-Human Corpus – 14 students, 128 physics problems (dialogues) – Student and tutor turns were manually transcribed and segmented for content and boundaries Computer-Human Corpus – 20 students, 100 physics problems (dialogues) – Noisy student turn boundaries and transcriptions extractable from ITSPOKE logs – Content of student turns also manually transcribed Corpus Transcription & Annotation ITSPOKE Corpora Comparison Human-Human Human-Computer …1.3 minutes into session… …3.5 minutes into session… Student: Can I ask you questions? Tutor: I didn't understand. What force acts on all objects within close proximity to earth? Tutor: Yes Student: uh the force of gravity Student: Is that ok? Tutor: Fine. Besides the Earth's gravitational force, are there any other forces acting on the person, keys, and elevator after the release? Tutor: Mm-hm Student: Um from what we learned from the last problem they're gonna have the same- the keys and you have the same acceleration right? Tutor: Yes Student: So that means they'll both hit the bottom at the same time. But I don't understand where- I don't understand if you're in the elevatorTutor: You see Student: Where are you going to-? Tutor: The uh let me uh the key uh- the person holds the key in front ofStudent: Their face yeahTutor: Uh his or her face and then lets it uh uh it is let go so the the the question relates to the relative position of the key and the face uh- Student: no Tutor: Yep. Thus, keep in mind that after the release, the only force on the person, keys, and elevator is the force of gravity. If gravity is the only force acting on an object, will it be moving or staying still? Student: it will be moving Tutor: Fine. Not only are the person, keys, and elevator moving, they have only gravitational forces on them. When an object is falling and has only gravitational force on it, it is said to be in what? Student: freefall Tutor: Good. Thus, after the release the only force on the keys and man is the downward force of earth's gravity, so they are in freefall. Before the release, what is the relationship between the man's velocity and the keys' velocity?“ Student: they're equal Tutor: Uh how will they- as they go- as they both fall what will be their relative position? That is the question. Tutor: Correct. We know that after the release, the keys, the person, and the elevator are in freefall. Is freefall acceleration near earth a constant? Or does it vary depending upon which object is in freefall?“ Student: (sigh) Student: it's a constant Student: So the key and the face- Outline Introduction and Background The ITSPOKE System and Corpora A Study of Spoken versus Typed Dialogue Tutoring – Human tutoring condition – Computer tutoring condition Current Directions and Summary Empirical Results Spoken Versus Typed Human and Computer Dialogue Tutoring Diane Litman, Carolyn Penstein Rosé, Kate Forbes-Riley, Kurt VanLehn, Dumisizwe Bhembe, and Scott Silliman Proceedings of the Seventh International Conference on Intelligent Tutoring Systems (to appear) Overview Given the effort to develop a spoken tutorial system, it is important to investigate whether and how the use of speech can improve student learning – by determining the aspects of dialogue that correlate with learning, we can construct our systems to better elicit such behaviors Two experiments: spoken versus typed dialogues – Human tutoring – an upper bound on speech and natural language processing – Computer tutoring – current state of the art Common Aspects of Both Experiments Students take a physics pretest Students read background material Students use web interface to work through up to 10 problems with either a computer or a human tutor Students take a posttest – 40 multiple choice questions, isomorphic to pretest Human Tutoring: Experiment 1 Same human tutor, subject pool, physics problems, web interface, and experimental procedure across two conditions Typed dialogue condition (20 students, 171 dialogues) – Student and tutor in separate rooms – Strict turn-taking enforced – Student and tutor type via chat interface Spoken – – – – dialogue condition (14 students, 128 dialogues) Student and tutor in same room, separated by a partition Interruptions and overlapping speech permitted Student and tutor speak through head-mounted microphones Dialogue history box remains empty Typed Excerpt Spoken Excerpt (Human Tutoring Corpora) Problem: Suppose that you released 3 identical balls of clay in a vacuum at exactly the same instant. They would all hit the ground at the same instant. Now you stick two of the balls together, forming one ball that is twice as heavy as the remaining, untouched clay ball. Both balls are released in a vacuum at exactly the same instant. Which ball hits the ground first? Original Essay: Both balls will hit the ground at the same time. The balls are in free fall (only gravitational forces). The ratio of the masses and weight are equal. --------------------------2 min into dialogue-------------------------------Tutor: You have correctly said the balls are in free fall. What do you conclude from this fact? Student: There is no air resistance and the balls’ accelerations are constant in regards to one another Problem: If a car is able to accelerate at 2 m/s2, what acceleration can it attain if it is towing another car of equal mass? Original Essay: If the car is towing another car of equal mass, the maximum acceleration would be the same because the car would be towed behind and the friction caused would only be by the front of the first car. -----------------------------6.5 min into dialogue-------------------------Tutor: Now this law that force is equal to mass times acceleration, what's this law called? This is uh since this it is a very important basic uh fact uh it is it is a law of physics. Um you have you have read it in the background material. Can you recall it? Student: Um no it was one of Newton's laws but I don't- remember which one. (laugh) Tutor: Right, right- That- is Newton's second law of motion. Tutor: Right, so the conclusion is that they both have the same acceleration, the free fall acceleration. Now you have to show that time taken by them to reach the ground will be the same. How do you do that? Student: he I- Ok, because I remember one, two, and three, but I didn't know if there was a different name Student: F (net forces) / mass = f/m because F = m*a therefore ratio should equal one another Student: (laugh) Tutor: But that does not tell you anything about the time taken. What physical quantities will determine the time taken. Will the initial velocity of the balls matter? Student: yes, assuming starting at rest? sorry, kind of at a loss at the moment Tutor: Yeah that's right you know Newton was a genius- Tutor: and uh he looked at a large number of experiments and experimental data that was available and from that he could come to this general law and it is known as Newton's second law of motion. Um many many other scientists before him had seen all this data which was collected by scientists but had not concluded this now it looks very simple but to come to the conclusion from a mass of data was something which required the genius of Newton. Student: mm hm Typed versus Spoken Tutoring: Overview of Analyses Tutoring and Dialogue Evaluation Measures – learning gains – efficiency Correlation of Dialogue Characteristics and Learning – do dialogue aspects differ across conditions? – which dialogue aspects correlate with learning in each condition? Learning and Training Time Dependent Measure Human Human Spoken (14) Typed (20) Pretest Mean (std dev.) .42 (.10) .46 (.09) Posttest Mean (std dev.) .72 (.11) .67 (.13) Adj. Posttest Mean (std dev.) .74 (.11) .66 (.11) Dialogue Time (std dev.) 166.58 (45.06) 430.05 (159.65) Discussion There was a robust main effect for test phase (p=0.000), indicating that students in both conditions learned during tutoring The adjusted posttest scores show a strong trend of being reliably different (p=0.053), suggesting that students learned more in the spoken condition Students in the spoken condition completed their tutoring in less than half the time than in the typed condition (p=0.000) Dialogue Characteristics Examined Motivated by previous work suggesting that learning correlates with increased student language production and interactivity (Core et al., 2003; Rose et al., pilot studies of typed corpora; Katz et al., 2003) – Average length of turns (in words) – Total number of words and turns – Initial values and rate of change – Ratios of student and tutor words and turns – Interruption behavior (in speech) Human Tutoring Dialogue Characteristics (means) Dependent Measure Spoken Typed (14) (20) Tot. Stud. Words Tot. Stud. Turns Ave. Stud. Words/Turn Slope: Stud. Words/Turn Intercept: Stud. Words/Turn Tot. Tut. Words Tot. Tut. Turns Ave. Tut. Words/Turn Stud-Tut Tot. Words Ratio Stud-Tut Words/Turn Ratio 2322.43 424.86 5.21 -.01 6.51 8648.29 393.21 23.04 .27 .25 1569.30 109.30 14.45 -.05 16.39 3366.30 122.90 28.23 .45 .51 p .03 .00 .00 .04 .00 .00 .00 .01 .00 .00 Discussion For every measure examined, the means across conditions are significantly different – Students and the tutor take more turns in speech, and use more total words – Spoken turns are on average shorter – The ratio of student to tutor language production is higher in text Learning Correlations after Controlling for Pretest Dependent Measure Ave. Stud. Words/Turn Intercept: Stud. Words/Turn Ave. Tut. Words/Turn Human Spoken (14) R p -.209 .49 -.441 .13 -.086 .78 Human Typed (20) R p .515 .03 .593 .01 .536 .02 Additional Analyses: Spoken Human Tutoring Dependent Measure Tot. Stud. Questions Ave. Stud. Questions/Dial Mean Controlled R p 35.29 -.500 .08 3.86 -.477 .10 13.55 -.489 .09 Std. Stud-Tut Word Ratio/Dial 0.14 -.584 .04 Std. Stud-Tut Words/Turn/Dial 0.22 -.640 .02 Std. Tut. Questions/Dial Removing Student “Groundings”: Spoken Human Tutoring Dependent Measure Mean Controlled R p Tot. Student Words 2133.57 -.298 .32 Tot. Student Turns 251.36 -.281 .35 Ave. Student Words/Turn 8.33 -.243 .42 Std. Student Turns/Dial. -.478 .10 15.79 Discussion Measures correlating with learning in the typed condition do not correlate in the spoken condition – Typed results suggest that students who give longer answers, or who are inherently verbose, learn more – Initial analyses removing “groundings” do not change spoken results Deeper analyses needed (requires manual coding) – do longer student turns reveal more reasoning and explanation? – do longer tutor turns mean better explanations or feedback? – results need to be further examined for student question types, substantive contributions versus groundings, tutor moves, etc. Computer Tutoring: Experiment 2 Same physics problems, web interface, and experimental procedure as Experiment 1; however – only 5 problems (dialogues) per student – pretest taken after background reading – strict turn taking enforced in both conditions Typed dialogue condition (23 students, 115 dialogues) – Why2-Atlas computer tutor – student and tutor interact via text-based chat interface Spoken – – – – dialogue condition (20 students, 100 dialogues) ITSPOKE computer tutor student and tutor interact through microphone and headset utterances displayed when student or ITSPOKE finished speaking speech recognition output rather than actual student utterances Spoken Computer Tutoring Excerpt ITSPOKE: What else do you need to know to find the box's acceleration? Student: the direction ASR: add directions ITSPOKE : If you see a body accelerate, what caused that acceleration? Student: force ITSPOKE : Good job. Say there is only one force acting on the box. How is this force, the box's mass, and its acceleration related? Student: velocity ITSPOKE : Could you please repeat that? ASR: REJECT Student: velocity Learning and Training Time Dependent Measure Pretest Mean (std dev.) Computer Computer Spoken (20) Typed (23) (ITSPOKE) (Why2Atlas) .48 (.17) .49 (.20) Posttest Mean (std dev.) .69 (.18) .70 (.16) Adj. Posttest Mean (std dev.) .69 (.13) .69 (.13) Dialog Time (std dev.) 97.85 (32.8) 68.93 (29.0) Discussion There was a robust main effect for test phase (p=0.000), indicating that students in both conditions learned during tutoring The adjusted posttest scores were not reliably different (p=0.950), suggesting that students learned the same in both conditions Students in the typed condition completed their tutoring in less time than in the spoken condition (p=0.004) New Computer Tutoring Dialogue Characteristics Why2-Atlas and ITSPOKE conditions – Total Subdialogues per Knowledge Construction Dialogue (KCD) Only ITSPOKE (speech recognition) condition – Word Error Rate – Concept Accuracy – Timeouts – Rejections Computer Tutoring Dialogue Characteristics (means) Dependent Measure Tot. Stud. Words Tot. Stud. Turns Ave. Stud. Words/Turn Slope: Stud. Words/Turn Intercept: Stud. Words/Turn Tot. Tut. Words Tot. Tut. Turns Ave. Tut. Words/Turn Stud-Tut Tot. Words Ratio Stud-Tut Words/Turn Ratio Tot. Subdialogues/KCD Spoken 296.85 116.75 2.42 -.02 3.21 6314.90 148.20 42.11 .05 .06 3.29 Typed 238.17 87.96 2.77 -.00 2.88 4972.61 110.22 44.33 .05 .06 1.98 p .12 .02 .29 .02 .40 .03 .01 .06 .57 .64 .01 Learning Correlations after Controlling for Pretest Dependent Measure Tot. Stud. Words Tot. Subdialogues/KCD Spoken Typed (ITSPOKE) (Why2-Atlas) R .394 - .018 p R .10 .050 .94 - .457 p .82 .03 Additional Analyses: Spoken Computer Tutoring Dependent Measure Mean Tot. Dial. Time (min) Ave. Dial. Time (min) Std. Dial. Time (min) Std. Tot. Stud. Words Word Error Rate Concept Accuracy Tot. Timeouts Tot. Rejects 97.85 17.07 9.99 42.39 32.45 0.92 5.50 8.15 Controlled R p .580 .01 .580 .01 .541 .02 .457 .05 -.201 .41 .113 .65 .296 .22 -.244 .31 Learning Correlations for 7 ITSPOKE Students with Pretest < .4 Dependent Measure Slope: Student Words/Turn Intercept: Student Words/Turn Mean Controlled R p -.03 -.877 .02 3.06 .900 .02 Discussion Means across conditions are no longer significantly different for many measures – total words produced by students – average length of student turns and initial verbosity – ratios of student to tutor language production Different measures again correlate with learning – Speech: student language production and time – Text: less subdialogues/KCD – Degradation due to speech does not correlate Outline Introduction and Background The ITSPOKE System and Corpora A Study of Spoken versus Typed Dialogue Tutoring – Human tutoring condition – Computer tutoring condition Current Directions and Summary Current and Future Directions Data Analysis – Deeper coding for question types and other dialogue phenomena ITSPOKE – – – – version 2 Pre-recorded prompts and domain-specific TTS Shorter tutor prompts and/or changed display procedure Barge-in, Always Available Vocabulary Monitoring and adaptation capabilities Data Collection – Additional human tutors and computer voices – Other dialogue evaluation metrics Summary Goal: generate an empirically-based understanding of the implications of adding speech to text-based dialogue tutors Accomplishments – Completion of ITSPOKE (version1) – Transcription, “annotation”, and preliminary analysis of two spoken tutoring corpora (human tutoring, computer tutoring) – Initial empirical comparisons of typed and spoken tutorial dialogues (performance evaluation, correlation of dialogue characteristics with learning) Results will impact the design of future systems incorporating speech, by highlight the performance gains that can be expected, and the requirements for their achievement References Diane J. Litman, Carolyn P. Rose, Kate Forbes-Riley, Kurt VanLehn, Dumisizwe Bhembe, and Scott Silliman. Spoken Versus Typed Human and Computer Dialogue Tutoring. To appear, Proceedings of the Seventh International Conference on Intelligent Tutoring Systems (ITS), Maceio, Brazil, AugustSeptember 2004. Diane J. Litman and Scott Silliman. ITSPOKE: An Intelligent Tutoring Spoken Dialogue System. In Proceedings of the Human Language Technology Conference: 4th Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL) (Companion Proceedings), Boston, MA, May 2004. Related papers available at http://www.cs.pitt.edu/~litman/itspoke.html Acknowledgments Kurt VanLehn The and the Why2 Team ITSPOKE Group – Kate Forbes-Riley, LRDC, Research Associate – Scott Silliman, LRDC, Programmer – Art Ward, Intelligent Systems, PhD Student – Alison Huettner, LRDC, Research Associate Thank You! Questions? Zero-Order Learning Correlations Dependent Measure Tot. Stud. Words Ave. Stud. Words/Turn Slope: Stud. Words/Turn Intercept: Stud. Words/Turn Tot. Tut. Words Ave. Tut. Words/Turn Human Spoken (14) R p -.473 .09 -.167 .57 -.275 .34 -.176 .55 -.482 .08 -.139 .64 Human Typed (20) R p .065 .78 .491 .03 -.375 .10 .625 .00 .027 .91 .496 .03 Human-Human Corpus Transcription and Annotation Human-Computer Excerpt Tutor26: Let me ask my question in a more general way. What happens to the velocity of a body when there is no force acting on it? Student27: dammit (ASR: it is) Tutor28 : Could you please repeat that? Student29 : same (ASR: i same) Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body when there is no HORIZONTAL force acting on it? Student31: zero (ASR: the zero) Tutor32 : Well... OK, let's consider how forces in the vertical direction affect the motion of a body in a horizontal direction in a different scenario <…omitted…> Student33: oh gravity you already said this (ASR: beats gravity you're exert less) Tutor34: Fine. Are there any other forces acting on the apple as it falls? Student35: no why are you doing this again (ASR: no y and to it yes) Tutor36: Correct. The only force acting on the apple as it falls to the earth is the earth's gravitational force. In which direction does gravitational force act? Student37: downward you computer (ASR: downward you computer) Why2 Conceptual Physics Tutoring