1 >> Sumit Basu: So hi, everyone. I'm happy to have Lee Becker here for giving us, you know, really interesting talk on what he's been working on the last few years. Lee has a really interesting history. So he's been very passionate about education for, I think, when is whole life. Actually spent a year in Indonesia teaching high school students over there, which I think led to some of his kind of inspiration to work on this stuff, on the technology sphere. Before that, he actually worked as a software developer at Intel and HP, making various kinds of tools. But since they, you know, being at the university of Colorado, he's doing a dual Ph.D. in CS and CS, computer science and cognitive science. >> Lee Becker: It's a two for one. >> Sumit Basu: Two for one, not bad. And he's been focusing mostly on looking at how dialogue can be used in educational systems and how improving the nature of dialogue and different kinds of things in dialogue can actually improve the educational experience and the tutoring experience. So he's going to talk about some of that work here today. And looking forward to it. >> Lee Becker: Cool. Thanks for the awesome introduction. So I won't talk too much about this. I'm mainly going to be talking about asking questions within the context of tutorial or educational dialogues. Before I get too far into it, I want to just flesh out a little bit about my research focus. And I work in a field that maybe sometimes is called learning science or sometimes is called AI in Ed, artificial intelligence in education. And it's where we try to take all the cool stuff that's coming out of AI and LP, machine learning, and use it to automate the process of education or also use it to help us understand more about education. And like Sumit said, over the past few years, I've been working in something more specific in the intelligent tutoring systems domain where this is a screen shot from our tutoring system, where a student interacts through dialogue, Siri-like dialogue and they see a floating head who is their tutor, and various multimedia visuals. And what I really find exciting about having these kinds of systems and actually getting to deploy these systems on real students is that with enough 2 of it going to a broad enough audience, we can start to investigate the phenomena underlying learning and start to not only improve how well these systems actually perform in doing the teaching, but actually tear apart the process. And so with learning gain, we can start to understand maybe what concepts important, what might lead a student to that yes, aha moment, versus a more frustrated moment. And also, start to learn a bit about the tutoring strategies that underlie things. Higher level overview, like I said, over the past few years, I've been working in intelligent tutoring systems and in the dialogue space and trying to understand, like, how can we improve the dialogue or what do we need to do to really understand what's going on in it. And tangential to this area, it's brought me into an interesting question generation, and I've been involved in the community and the workshops related to that. Last summer, we have an upcoming paper. Lucy and Sumit mentored me and we have a paper on generating automatic, fill-in-the-blank questions or automatically generating fill-in-the-blank questions. Another area, because I do work in NLP and I've been working this past year on doing relation discovery and different information extraction from clinical notes and trying to drive more towards improving the whole process of understanding what's going on with patients. So to give you a task right away, imagine you're a tutor, and you're trying to teach a student some material. In this case, we're going to be talking about basic circuits. Pretend the student is maybe in grades 3 through 5. So about 8 to 11 years old. And we're going to use this visual to drive the conversation. So you see there's a battery, a light bulb, some wires, and a circuit board. And this is actually something that they had played with previously in class. And so as a tutor, your job is to kind of lead them through a conversation about this, amongst other topics. And so you have this history, this dialogue history. Roll over the decell in this picture. What can you tell me about this? So there's the tutor, the student. The student responds, the D-cell is the source of power. The tutor says, let's talk about wires. What's up with those? 3 Student, wires are able to take energy from the D-cell and attach it to the light bulb. Now, imagine we were to pause and to come in right in this conversation and you, as the tutor, have to pick what's next. So if you're given a list of candidate questions, what is the next best question, or what is a good, appropriate question to ask at this point in time. So I just want to take kind of a quiz or poll amongst the audience and see what the rest of you think. So we have what about the light bulb. Tell me about that component too. You mention that the wires attach the D-cell to the light bulb. Talk to me a bit more about that. You mention that the wires attach to the D-cell. Which parts of the D-cell do they attach to. So this is kind of a location. You said that the wires take electricity to the light bulb. How do the wires do that? And so the wires connect the battery to the light bulb. What happens when all the components are connected together? So quick poll, how many for question one? None. Two? Got a couple there. Three, four. Three? About the same. Four? About that. And five. So it's kind of split all over the place. And I don't know if there's a right answer, but the human tutors that I had that were experts in this domain and experts in teaching this picked number five. >>: Yes! >> Lee Becker: Good job, those of you who picked number five. You too could be a tutor. And so you might wonder, well, why is that? What factors are going into that. And what do we actually need to know to make this decision? And there's a lot of things going on under the hood. It might be that if you pick certain things, you might be partial to certain keywords or certain types of vocabulary. Maybe if you're like really simple, you just think oh, this one has a lot of words, this is good. There might be other factors in to if you take into account the dialogue history and look at what the student's doing, if we look through here, we see that the student is pretty good at giving a response. They ask a question and the student gives this function right away, and then the tutor asks another question and there's good uptake. There's good engagement and response from this student. 4 And so I'm thinking that the rationale behind why the tutors picked question number five has to do more with they see that the student is on this, and they don't want to maybe grind on a single point, but maybe use the momentum to carry them forward. And so to model this, I think you really need to understand not just what are the words and what's going on here at a low level, but what is the actual action taken by the dialogue above and by the questions that are being asked. So the outline for the rest of the talk, going to give you a bit of background, just briefly about tutoring and intelligent tutoring systems and dialogue and talk a little bit about the tutoring system that we've built over the past few years as the context for doing these explorations into this process of asking questions and ranking questions. I'll talk to you more about this dialogue modeling we've discussed. So discuss the dialogue act or dialogue move representation that I've created that helps us to understand what is going on under the hood with the action of the dialogue. And then we'll apply it to this task of actually ranking questions in context and then we'll close with some closing thoughts. Of course, we'd close with closing thoughts anyway. On to some background. So you might think, oh, why do we care about tutoring, what's important about that. And I think part of it is there are probably a lot of frustrated students out there, and it's not just education is great, rah-rah. There's actually a problem when we look at recent studies, like 34 percent of fourth graders and a fifth of 12th graders only show proficiency in science, which if you think about, like, all the jobs we do, we need more than pro proficiency. If you look at the advanced level, that's really small. >>: What are the different levels? >> Lee Becker: I think there's like poor, proficient, it's like binned into left of the median, median, and right of the median. >>: Is there anything behind proficient and advanced or after advanced? 5 >> Lee Becker: I don't believe there is. This is the top tier. This is like passing and so, like, maybe things have changed in the past three years and we've radically fixed education in that time. But I'm speculating that things are probably about the same. >>: So this is implying that about 65% of fourth graders are not even proficient? >> Lee Becker: >>: Yes, exactly. It's the opposite of Lake Woebegon. >> Lee Becker: Not often you give Prairie Home Companion in these talks. But there is some hope, and there have been some studies in the past. This is on often-cited study where if we maybe focus more on being able to tutor and have this more focused kind of remediation or more focused interaction, that we can get as much as a two sigma gain. And a two sigma gain means that they may be going up two letter grades, from a C to an A, or something along those lines. So you think, okay, tutoring's good, but why do we need machines, and why do we need to strap a kid into a room with a computer and solve education that way? I don't think maybe alone -- this isn't the key alone to solving education, but it's definitely a useful tool. And I think the big argument amongst anything else really for intelligent tutoring systems and all of these educational software and different types of online learning is that we want to get towards what the CRA's calling a teacher for every student. There's a scaleability that we get here that you can't get with human tutors. There aren't enough experts out there to sit and have one-on-one conversations with every student in every classroom so if you could imagine bringing it out to, like, web scale, anyone could learn and they'd have an opportunity to work not just what they do in classroom, but maybe to address what their problems are. And past studies show that intelligent tutoring systems are an effective means of educating students. We get the one sigma, so there's still room for improvement, maybe getting up to the mythical two sigma that we hope to achieve. >>: So how is sigma in this case? 6 >> Lee Becker: It's usually a letter grade. So if a student's getting a C, one sigma would bring him to a B-level of like on their final test or whatnot. And so, okay, intelligent tutoring systems, that seems scaleable. But why dialogue? Why should we care about doing this? Why can't we just, like, give them a bunch of problems online. And I think the interesting thing that we get with dialogue that you don't get maybe with other modes of interaction, especially with young children that aren't able to type and do things like that yet is you have an opportunity for self-expression. And there's what we call the interaction hypothesis in that getting to interact and think about what you're saying reinforces your understanding. In a more CS or computer science point of view, intelligence tutoring systems is really a fertile test bed for a lot of the AI we do. Yeah, Matt? >>: I'm sorry, can you go back a slide? one sigma gain. Is that what that -- So the ITS systems that exist give a >> Lee Becker: Yeah, that's what's saying at least for these incarnations, when they tested them on their students and then they took post-test, they saw a one sigma gain. >>: Is there some reason we don't see those in like wide deployment? one sigma is already really good, right? I mean, >> Lee Becker: You actually do see a lot of these systems in wide deployment. And these aren't necessarily all dialogue-based systems. But they're often tied to a particular curriculum or these people, like, some of these studies were done on maybe physics students at this particular university and so the tutors kind of customized. There is one company, Carnegie Learning, that has what they call a cognitive tutor. It's not dialogue-based, but does math tutoring and that's widely deployed in, like, all of Pittsburgh and Pennsylvania and all throughout the U.S. >>: And does that more general system give as good of gains? 7 >> Lee Becker: They claim to. I need to double check the citation on that, but I think they say that it gives some sort of positive learning effect. So like I was saying, this area not only is, like, feel good, we're helping education, it also lets us investigate the things we think are cool and fun to work on. I'm focused more on dialogue and planning, but imagine there's a lot of issues with natural language understanding. So not just semantic similarity, but is a student right or wrong and what might they have backwards. So there are more subtle problems there. I think in terms of like maybe working towards equipping these systems, you might have a lot to explore in concept and misconception discovery. So knowing, like, well if I have a body of text, what are the important concepts, or what are also the concepts that people might get confused or wrong. Because these things are often customized. I think there's also incredible opportunities for domain adaptation. Can we take a tutor that we've learned behaviors for in chemistry and then make a biology tutor or physics tutor. And then after the fact, once we have all of this, there's really great opportunities for educational data mining to really understand what are students doing right and wrong, what questions might they be missing, what behaviors are useful and really tease apart the learning process. So like I said before, this is our tutor or screen shot from our tutor, and it's called my science tutor, MyST. We've been working on it for the past few years, and over in the right-hand corner is Marnie, and she's not too bad in the uncanny valley. She still scares me a little bit. But students interact with her. They wear headphones and have a microphone piece and interact through automatic speech recognition when it's the full system. And the tutor presents a series of multimedia visuals and also using prerecorded texts or synthesized text depending on the version, interact with the students and give them prompts and try to encourage self-explanation. And so the purpose of this, it's not to bring about some singularity in education. It's really to supplement in-class instruction and it's not that we want to replace teachers, but we want to give students who are maybe struggling in class, other venues that they can continue to refine their understanding. 8 And the idea is not to test or assess. It's not just giving them a bunch of questions and oh, they only got a 50 percent. They need to do more. It's really just providing them a comfortable environment to discuss and reflect on what they've learned in class. And the educational approach we use is called questioning the author. And it's a pedagogy created by Beck and McEwen, where it's originally used for reading comprehension, trying to get the students to ask, well, what is the author trying to say here? In this venue, we've turned it into more questioning the science or questioning the data. What is it that they're observing that tells them what they understand. And the crick column driving all this is the full option science system, which is a system widely deployed throughout California, Colorado, and the rest of the U.S. You think, okay, well people have been working on dialogue for a long time. What's special about this? And if you compare tutorial dialogue to maybe a more standard flavor that's actually deployed like IVR systems or airline or hotel reservations, you have kind of inherent differences in the audience and how you would go about approaching this. With the task oriented dialogue like the reservation system, the user, like the point is to get the user to complete a task. Whereas in the tutorial dialogue, it's not that the user's trying to get something done. They're trying to, well, like some students may just want to survive the 15 minutes they're subjected to. Others may actually want to do learning. So the point is, like, maybe not so much what the user cares about, but this process of bringing about understanding. And so like I said, there's different motivations. The person trying to complete a task has an intrinsic motivation that says, I want to get my hotel and airline reservations. I'm not going to give up until that's done. Whereas the student could easily get bored or they might really want to go with it and you kind of have to balance for both. And I would argue that there's a more concrete measure of success with these more straightforward systems. If you got it done in 30 seconds, that's pretty good. If you've got the task done at all, that's probably also a good measure of whether the system's working. You could also do polls of user satisfaction or decide, well, we needed to get human in the loop. The system's obviously 9 not working. Compared to a tutorial dialogue, really the long goal is probably learning gains if we even trust those tests to some extent. And then do we just do what they've done before and after the session? Is it what they retain over long-term? Is user satisfaction meaningful? Even if the student said I had a great time, did they really learn anything, or maybe it's coverage of material. Are we covering the right material. So it's kind of interesting evaluation going on there. And similarly, I would argue that the penalty for poor behavior is high in the tutoring system, whereas poor behavior, because the user's motivated, they might be just motivated enough to continue staying with it. I think the big challenge for intelligent tutoring systems, and you alluded to this earlier, is why aren't these more widely deployed. It has to do a lot with scaleability. And so that's this middle bullet here. It takes a lot of effort to kind of curate what knowledge you want to do and to author dialogues and to create all the behaviors for this system. And right now, like, it may take a few weeks for one lesson. How are we going to scale up for an entire textbook's work of lessons? This top bullet, you also need to be robust to different users. You'll have people of varying skill levels, of different motivations and so you're going to have to be able to be personalized and adaptive. And I also think that after the fact, there's not really a clear way to look and compare at strategies. We might see this end goal, but can we really look at what's going on and be able to tell why is one session better than another? And so to give you a flavor of what might go on and how, here's some dialogue. So year's the tutor and they ask what do you think the paper clip is made of to be attracted to the magnet. The student gives a response. Magnetic force attracts the paper clip. Tutor, thinks about the kind of things that attract to magnets. How do they connect to paper clips? Student, it's attracted to the magnet because it's iron or steel. So throughout this thing, throughout this dialogue or this snippet here, the tutor really is trying to get at this notion of iron or steel. And so this is 10 maybe an easier less adversarial. There's good uptake in this point in this conversation. The tutor gives question and the student is pretty responsive. And the student is kind of giving them what they want. Whether that's good or bad, it's still kind of remains to be told. But you can see that there's kind of a good level engagement here. A challenge for other students is here's a tutor that asks a question and the student just says, I don't know. I don't understand and I don't know. And so the question is, well, do you just like say all right and move on, or maybe there's some strategy. Maybe what this tutor has to do, they have to keep pushing and say, here's another question and the student says, I don't know. So it's a matter of maybe backing off from more open-ended questions to maybe a more specific question, where it's like, here, look at this, what happened. And at this point, the student gets uptake, which then allows us to maybe, as a tutor, to move the conversation forward. It's like now we have something to talk about, here's what we can do. So maybe, even if the student's not giving you the answers you want, there's still something that you can do to give them a good learning experience. Of course, there are others that are just out there, and in this case I think this may be a problem if the student or tutor, and these actually came from Wizard of Oz experiments, where there's a human in the loop. It's not just the computer not understanding. This is a human struggling with this problem. So they ask him a question and the student answers, and then the student, they ask him a question. You already asked me that. And then the tutor is like, well, the tutor wants him to talk about this point for some reason and so they ask it again. The student is yeah, I already said that. Yeah. And so the tutor is like, all right, I'm not backing off. Just answer. And so you might think there's something going on under the hood that differentiates these. And if we're to make systems that are robust like this, it takes a lot of work. So back to the authoring side of this, most of the systems out there have what we call a finite state machine under the hood. It's just a graph of -- here's 11 what we're trying to ask them about. And if they say this, with he go down this path. If they say this, we go down this path. It's very manual. Maybe they if we have the lesson of work to have a good natural language understanding that they can say, well, this much confidence, we go down this path. But as you can see, as expands and the range of possible things grows, it gets to be a lot author, curate, tune and change these behaviors. Scop this kind of approach is pretty common in a lot of these different tutors. Our approach in MyST is similar when you break it apart. But instead of having like a rigid, finite state machine, we use the flames and slots approach, which is we have the information we're trying to fill so in a hotel domain, it might be time, location, et cetera. This is we've broken down a learning concept into sub-parts that we're trying to entail and these parts have prompts associated with them that you would ask the student to -- you would continue asking if they didn't say anything about flow, you might continue down that. And we have different rules so if they get things backward. Like in this case, we're trying to get the student to really understand that electricity flows from the negative terminal to the positive terminal. If they have that switched, we might have to say, well, check it out. Look again. Do you think it's really doing that? But like with the other thing ->>: What happens if they say something nutty like gravity pulls the electricity from the negative to the positive? >> Lee Becker: Our system is pretty dominant. It keeps going down. The usually, the way it's organized, we'll ask them an overview. We don't tend to acknowledge that they said anything totally wrong. So it's very easy for a student to, what they do or what they call gain the system so they could just say yes, no, and then just kind of exhaust the system. We've been fortunate enough that the students seem to think that the system is smart enough, and they don't try to game it. I think if we had middle schoolers instead of elementary school students, we'd have to put in more robust mechanisms to say, like, oh, you know, the speed train is this fast or there are other factors going on there. 12 >>: [indiscernible] cause of reinforcement learning where there is something that is right. >> Lee Becker: We don't do anything with reinforcement learning. There are other tutoring systems I'll talk about that try to use the reinforcement learning to maximize some final goal. But this is kind of the simplest thing that, because we have these very open-ended dialogues, we just kind of go down and exhaust. And if we exhaust a frame, we recap and say okay, it's good enough. The student understands or doesn't understand, but we're going to give them enough to come away with something. And so the broad research questions that I think that we get when we look at all this authoring and these tutor systems is we really need to understand what are the mechanisms needed to support maybe more intelligent behavior if we want to have more responsive and robust dialogue management or if we want to actually induce behaviors from corpora, what are we going to need and how can we drive towards getting a more personalized or more human-like interaction. And then if after the fact, how could we possibly do, like, analysis of these tutoring sessions afterwards. And what I'm going to argue is that the representation, the underlying linguistic representation at the dialogue level is what's going to really be able to enable us to do more interesting things. So now I'll in NLP, you to abstract function or talk about modeling dialogue with discuss. Like much of what we do need a linguistic representation that you can possibly learn or use certain actions. So I want something that abstracts kind of the the action, the function and the content. So the high level dialogue action, what is going on. The function, how is it being spoken about. And the content, not like specifically the word, but how is it that they're talking about the concepts in a specific domain. And kind of in searching for a taxonomy, I had kind of these requirements in mind. I wanted something that was interpretable without words. So if I took away the words of the dialogue, could I still look at the annotation or the representation and get a gist for, okay, this is, they seem to be responding to one another or this is going down some different path. 13 And so I wanted something that would allow post-tutorial analysis and I also wanted it to be useful enough that it wasn't just like going to be a corpus linguistic study but allow me to use them as features for some sort of behavior or learning some sort of behavior. And I also had in mind that maybe if we could go to the next step, this representation could serve as an intermediate representation to allow fully automatic question generation. So when I'm looking for these dialogue acts, I started on a literature view of all of the different literature or works related to dialogue acts, tutorial moves. And I find that most everything seems to cover in this course space and I'll explain a little more about that. And so they kind of get at the action at some high level. There's some with the rhetorical forms. So some taxonomies have a bit about the function and discount I drew a lot of inspiration because that's also a learning oriented one as well as the question taxonomies. The question taxonomies, the draw back is they're left focused on dialogue and just on classifying different types of questions. And so as I'm going through this, I started thinking, well, I certainly just can't use the words in and of themselves. If I look at this, I don't see the action. But if I use the high level dialogue acts, the problem is I can't tell what's going on. I have two tutoring sessions, it's like question, answer, question, answer. It's like oh, that's a good session. I can't do that. And so my approach to this was to use what I called DISCUSS, the dialogue schema unifying speech and semantics. And it's kind of a long name, but the original name was distress, and my advisor said that's too negative. Come back and do something else. And distress was supposed to be a response to damsel, which is a famous dialogue act. But anyway, and so to drive at what I was saying, the action, the function and the content, I have three dimensions. The dialogue act dimension so you do see, still, ask and answer. But also, maybe more tutorial specific moves. Things like revoice. Revoice is something used in questioning the author, but could be used in any variety of tutoring modes where you're summarizing what the student says to kind of move the conversation forward. Whereas a mark is a similar act, but you're highlighting keywords. So revoice would be, oh, it sounds like you're saying electricity flowing from the 14 negative side. And then ask a question, whereas marks would be more direct. Would be, oh, you said electricity. Let's talk about that. So these are kind of grounding acts that help the tutor show that they're receptive to what the student is saying. In the middle layer is the rhetorical form. And this is like kind of refines the action here, when it's appropriate. So it might be a question that's asking them to describe or question asking to define. It's like, what is the question really trying to achieve? What is the function of this question. And then getting at the contents, the predicate type. So it might be they're talking or a function or process or an partially by going through the up. And also by what I saw in about some type of cause and effect relationship observation. And so these acts were inspired dialogues and seeing what we actually had come the literature. And so to give you an idea of how this actually looks, here's a dialogue, and, well, if you just look at the words, it's going to get cumbersome, but let's imagine we just jump straight to the dialogue act representation. We can kind of gist what's going on here. And so at this point, number one, tutor is saying, okay, list an entity. So what do you see here? And the student lists some entities that they see in the visual. And then the tutor gives some sort of positive feedback, and then they describe the entities a little more and now they're really trying to get them to talk about what is the function of this thing you're looking at. So it might be a circuit or a switch or a battery or something, and then the student just responds with an attribute, like batteries are red or something like that. And so then the tutor -- wait, I jumped ahead. But in the same case, the student just like lists what they see and they list some attributes. So then the tutor maybe backs off a little bit and talks about the entities. And then student still talks about attributes. So they think okay, I'm going to try function. And so you can kind of see going back and forth as the student's stuck on attributes, the tutor's moving forward with functions here. What I think motivation behind this is if we look at this across different lesson domains, we can kind of get the same words, different content. So 15 here's a bunch of different questions and answers. And you can see that we don't just have a single tuple per question. It's like it exhibits different properties. In this case, it has a mark as well as ask and elaborate process. But this is for circuits. But now if we change it to magnets, I didn't change the labels. But now we have different actions or different utterances that actually manifest themselves in the lesson. And so we can kind of start to see, well, maybe there are strategies that are generalizable across both domains. Of course, like, a representation alone without data is not any use so we endeavored to have some linguists help me out, and I hired two linguists and we annotated 122 transcripts. These were from Wizard of Oz sessions. So they weren't the actually system. This was a human tutor controlling our system and a student talking into the microphone and then we had manually transcribed speech for this case. And we coded it up for ten different units in magnetism, electricity, tuning, measurements, got close to 6,000 turns annotated totally, in total, and 15 percent of it was double annotated, just to give us an idea of how difficult the task is. And so the Kappa here, it shows modest to fair agreement. You can see that as we go down the hierarchy, it gets harder. So dialogue act, it's pretty easy to see this is asking a question or it's answering a question. The difficulty might come in as, are they doing a revoice or are they doing a mark. What other things are they doing beside asking a question. As we go down farther, it gets more ambiguous. Is this really asking for a description or definition, where those things might seem a little close. Going down to the predicate type, it gets even more ambiguous, because in some cases, it might not be clear whether they're talking about a process, an observation, or a causal relation. And so that's part of the difficulty for some of these kappas going down. >>: So that was two judges? >> Lee Becker: These were two linguists, and so like 15 percent of those, well, it was like 15 percent of the dialogues were selected, and -- 16 >>: Oh, I see, so it's overlapping. >>: So this approach, how big can it go in terms of defining the schema? Like, for example, tutoring algebra, is it too big or ->> Lee Becker: I think algebra might be a stretch, because you don't have the same kind of conceptual knowledge, but I imagine for a lot of types of science where you do have processes and observations and things that you see in class related to these labs, I think it generalizes in that sense. And whereas, like, I think with math, you might need a different vocabulary, like axioms and theorems and the actions that you would take to solve step, I think it gets -- this is probably more useful for conceptual knowledge and trying to maybe supplement reading in some way. And so I kind of talked about these before, but these are really the motivations for DISCUSS is I want to be able to discriminate one utterance from another and I wanted to explore, like, what granularity can I get at like the speaker intent and what is the content. And how can I use it for something useful, and so I'm mainly talking about learning the decision making today, but I'm also currently looking at characterizing these interactions as a whole. And so now on to the task of ranking questions in context. So back to what we did at the beginning of the lecture or talk or whatnot, we have this picture again, and you already saw these things. And so I want to go into more detail about how we use DISCUSS and other features to actually go about ranking these questions. So given a set of candidate questions, where do we go next is a follow-up question. And like I said at the beginning, there might be a lot of factors influencing the tutor's decision. You might have biases or opinions about certain words and phrasings, like, oh, I really don't like that kind of vocabulary or that was really odd syntax to talk to third grader in. There might also be preferences about what you talk about as far as subject matter. It's like oh, this is less important. Let's focus on this. There are factors that come in from the dialogue context. So we have this history, certain things might be more appropriate. And, of course, when we have tutors who are trained in the same area, they 17 might have different pedagogal philosophy. Some might like really be keen on using the visuals as much as possible, some might like more directed questions, versus open-ended questions. And so that could influence. And, of course, the student understanding always plays a role. Is the student not answering anything, or are they getting it. ask different questions depending on what you get there. And you might So the driving questions as far as ranking and choosing these follow-up questions are, well, I'll get into this in a second, but how might we go about learning this. Can we use preference data. So after the fact, can we use some sort of ratings. And what features are actually needed. So we talked about those potential factors. So how can we extract that from the words as well as from the representation. And what can this tell us about tutoring and about our tutors as we go through this process. Before that I want to talk about a little bit about related works in tutorial move and some in dialogue move selection. And so someone asked earlier about reinforcement learning, and there has been work where they've looked at trying to optimize tutoring behavior for learning gains using reinforcement learning. But the decisions they were looking at were very miniscule and didn't really get at what are the full range of questions you can get at. Chi was looking at should I elicit, or should I tell. So it's basically saying, should I stay on this point, or should I tell them and move on? And so you can see where reinforcement learning would be useful in that sense. But if you start to, like, say should I ask him about a definition or a description, should I talk about this aspect of the learning goal or that aspect of the learning goal, the state space explodes pretty large. Similarly, you would see similar behavior with the HMMs. And Christy Boyer did work in having a dialogue app taxonomy and trying to predict dialogue acts. But again, she was using pretty coarse dialogue acts so it might be giving feedback, asking and answering and not really driving at what kinds of 18 questions are we asking at a specific point in time. This work kind of draws in the natural language dialogue, you're trying create the surface form from the tradition of what's called sentence planning generation community. Where the point is in a to generate a representation which you then use to or the actual words. And instead of doing that in my approach, I think more of how do the representations inform the features that we can then use to rank. But you could imagine, I could also rank representations at some point. So the approach is pretty not surprising. It's pretty straightforward. We're going to treat question ranking as a supervised machine learning task, which means we need training data and labels. And in this case, the training data are going to be given a context, a dialogue history. We want a set of questions, and we're going to then extract features from these questions and then we're going to pair them with some sort of scores, whether it's ranking or raw source and then we're going to learn different models. I have what we call a general model, where we average the scores across all raters, or average rating -- rankings across all raters or we have the individual model, where we learn what are the preferences for an individual. And so the data again, we have the 122 transcripts that we used. So we used that corpus for building dialogue models. But specifically, we look at 205 contexts extracted from these transcripts and so a set of 30 transcripts in particular. And we had -- does it say? Well, what we did was we had manually authored questions for these contexts as well as when appropriate we used a question extracted if it wasn't like some meta statement or if it seemed like it was a follow-up question. And then we took these questions and we annotated them with the DISCUSS representation. And then we put these questions and the full dialogue history, which I'll show, in front of raters so these raters are trained expert tutors that had previously worked on our project and then we do that. >>: [indiscernible]. 19 >> Lee Becker: It means that I have the original dialogue, and so I pause it, and so I take that next turn out and just use that, if it seems like it's a question. And then you might, so you might extract one and then you would author, like, five other candidate questions. Because I didn't have a good way of, like, making -- like without writing my own awesome question generation system from scratch, I didn't have a way of per muting the DISCUSS space and dialogue space. A little more about the authoring and the approach. I hired a linguist. We hire a lot of linguists in Colorado, and I trained him in questioning the author and in MyST and in the FOSS and the kind of questions we see and the kind of lessons. And I said okay, you're free to write any question, but take my guidelines into consideration and really thick think about how might you change the tactics. Would you want to add a revoice at this point? Would you want to mark? Would you ask him to elaborate? And like think about the learning goals. Do I want to focus on this aspect of the learning goal or a different aspect or maybe a different learning goal entirely. And then also because it's a linguist, go ahead and do some variation on lexical and syntactic structure, but mainly take into account DISCUSS. So maybe you might switch from asking a definition question about a causal relation to a definition question about a process or a description of a process and kind of permuting that space. And so we went with one author just to be more consistent as opposed to just having tutors write authors and ending up with ten very similar questions. And like I said before, when appropriate, we extracted questions from the original dialogue context. So this is what the author saw. And they had their learning -- so the author. This is the question author, saw the learning goals so they knew what it was that we were trying to elicit from the students in this lesson. We also see the dialogue history up until this question asking point, and then following the guidelines, he was free to write what questions he thought was appropriate. As far as rating, we hired four tutors that had served as Wizards when we were doing a Wizard of Oz studies early on in the development of MyST and we asked them to make these decisions as far as rating questions in this context. And we gave them a similar setup so again, they saw the learning goals, they 20 saw the dialogue history and we asked them to rate them simultaneously. Part of this is we wanted to not have them have like some sort of drift, just rating things in isolation. And the other thing is it allowed them to see this is obviously better than another and I gave them a wide range of one through ten and allowed them to pick ties if they thought, like, I can't really decide between these two. These are equally good. I could ask either one. I didn't feel like it was necessary to force a strict ranking in this case. And so to assess agreement, we use a measure often used in information retrieval called Kendall's Tau. It's a statistic that ranges from negative one, perfect disagreement, to one, perfect agreement. And probabilistic interpretation, if you take TAU, you get the probability of concordances, like times they agree on individual pairs versus how much they disagree on individual pairs. And so you might think, well, why ranking? Why not just learn the scores directly. I think part of it is that different raters have different scales. Someone might be a 7 through 10 person. Someone might be a 1 through 10 person. And so I'm really more interested in which question would you pick over another question. And so the mechanism we're going to use is we're going to convert their scores into a ranked list and then assess agreement in that sense. So here's a table of like how the different raters agree. And this bottom table is a couple months after they did their rating, I had them go back and redo ratings on a small set and see how well they agreed even with themselves. So obviously, they agree with themselves more than anything else. And what you see here is, like, it really is kind of dependent on who the rater is paired with, and it tells me that different people are keying in on different things. No less. I took an average across all of it, and so this is our kind of -this is what taking all people and all rankings, this is where we get is like 0.148, which is positive agreement. It's not huge, but it kind of shows the limits of how well people -- did you have a question, John? -- agree with one another. So the actual -- to actually learn and to actually do question ranking, the approach is, we use a pretty standard approach of learning a preference 21 function and we're going to take a feature vector that we extract from a question and possibly its context and another question, take the delta, and then train a binary classifier to tell me this question is better than another question. And we're going to run it both ways. And we're going to build just a win table and the results of this wins gets us the canonical ranking as far as that goes. And so what actually goes into this? What representation, if we're actually going to we actually need? So at the lowest level, Things that people might cue in on or what might think are important. features actually, or what plug something into a classifier, do we need these surface form features. we speculate that different tutors And so question length. If something is overly verbose or overly terse, maybe that has an influence. We also look at like WH questions. Maybe some tutors really like what questions versus which questions. Of course, it takes on a little different meaning in questioning the you though because you'll see in a second, the wording is different. And we also wanted to take into account maybe some syntactic variation with the part of speech tags. And so here's like a question, and the feature vector we would get out of that single question. So you would see, like, what's up with that. We naively just take the WH and put that as a binary, but we also take the bag of part of speech tags and just low-level, commonly used features in NLP. Going maybe a little more complex, we think that there's a process in conversation and in dialogue called entrainment, where as two people talk more with one another, we tend to use more similar words and constructs and if they're really engaged with each other, they might have more of these words overlapping. So we wanted to capture that in a feature, just basic feature called lexical similarity. And so we look at both the bag of words and the part of speech tags and what kind of overlap and so you could take a similarity between the previous student's turn and the question that you're trying to evaluate, and see what kind of overlap you get. Or you could also look at how does this relate to the learning goals. If we want to see, okay, are they talking about the learning goal we're currently talking about, or is this question about something else, which maybe indicates if it has a strong similarity that it might mean that the preference for that would mean you want to move on. 22 And so how this might look is here's a question. Here's a previous student turn, and what I mean by current learning goal is this is the description or, in an ideal case, if a student said this, we might think that the student has an understanding of that concept. And so if I just look at the words like, oh, brighter and bright or dimmer and dimmer, we can start to calculate just simple overlaps and throw that into our vector for features. Getting more into the more complex behavior, we have the DISCUSS features. we have these turns -- So >>: Can you just go back to the previous one? So if you can learn weights on features, especially if you use bi-gram features, do you have training data to do that and different kinds of conditions, dialogue states? >> Lee Becker: So I don't actually extract word bi-grams, because that's going to be too sparse. And so I only did, like, bi-gram overlap. So what percentage of the bi-grams in this overlap the percentage of the bi-grams in that. Because I think, like you said, it's too sparse to actually, it's too sparse to actually do that. But given our domain, the vocabulary is regular enough that we expect that the questions that the students' responses, they're going to be talking about batteries and wires. So if the question's talking about batteries and wires and the student's talking about batteries and wires, you'll see a high overlap there. >>: So if the student just repeats a question -- >> Lee Becker: >>: What will happen? >> Lee Becker: >>: Yeah, I mean. In our system? Anyway. >> Lee Becker: So you ask a question, and they say the exact same thing? 23 >>: Yeah. >> Lee Becker: It might actually fill in a lot of the keywords, depending, because we use a Phoenix semantic grammar that parse what's they say, and then it fills in the slots. So I think if the student was savvy enough, they might be able to. But if you look at some of the questions, like they might only get the simple things. They might get light bulb and dimmer, but they might not get the relationship that we're trying to say, like that this gets dimmer when this happens. Or ->>: And which feature measures that? relationships? Which feature identifies these >> Lee Becker: Well, get into -- I don't think there's a feature that identifies that, but we do have a feature that takes into account what slots they filled in the dialogue and then can contextualize that into a probability. But we don't do any like -- I don't use any of the like the natural language understanding for that, and so I don't say, like, is this a -- have they triggered, like, a misconception or anything. And I think that would be, like if I had -- if I was to do a follow-up study, I would probably add in more of these features, saying how right or how wrong are they, and how would that influence the behavior. But because our system is less about assessment, we have a very loose definition of what's right and wrong. So like I was saying, for DISCUSS, we extract kind of the bag of DISCUSS features. So these questions have associated DISCUSS labels and tuples. And we also have, I'll give an illustration of this, we can look at how closely, like, the rhetorical form and the predicate type match between this turn and the student's turn. And to kind of illustrate, here's a question. And here's its associated DISCUSS act. We can see it has a revoice and ask and elaborate. And we see a student turn. And so if we look, we can see, okay, we got a revoice, binary one. Ask, one. Predicate type configuration, one. So those are the kind of straightforward to extract. The other one is we want to see maybe how in step they are. And this is just a very coarse feature that, oh, the student's previous response was about observation and this question is about configuration. So that's not a match there. 24 To maybe get a little more sophisticated behavior, we actually look at the probabilities and kind of just discuss transition probabilities over our corpus, and we can see what is the probability of a question having this kind of DISCUSS tuple given the previous student's turn having this DISCUSS tuple and we can back off from having the full tuple to maybe just the dialogue act in rhetorical form or maybe we want to look at the predicate type and you can imagine this could get at what is the sequence that you might talk about a certain concept. You might start with a visual, move to an observation, griped a bit about asking about some attribute and finally talk about the process. And then getting at this natural language understanding component, we have a measure of what slots are filled for a given point in the dialogue. So if we were asking electricity flows from negative to positive and the student said electricity in flows and the other two parts that are left out are negative and positive, this half here, this probability would be like given a 50 percent fill of the current frame, what is the probability of asking this kind of DISCUSS feature at this point in time. So evaluation, we're going to use a lot of measures. Probably the main one is Kendall's Tau, which is the same one we used to measure agreement between the tutors, and we train the system using cross-fold validation where we hold out three transcripts per fold, and each fold is a different lesson. So it might be this is magnetism electricity, unit one. This is going to be the evaluation, and we're going to train on the rest. And so this is the general model. This is when we took all the rankings from all the tutors and just tried to create one model, and the big take-away here is that if we go from the baseline features, which are the surface form and the lexical similarity features and start to add more of the DISCUSS features, we get a bump in improvement and it's significant from here to here. I don't have all the significances in between there. And you see that for most of these measures, and you see it at a level that's like, if you recall, all of the tutors agreeing with each other was like 0.148. So the mean Kendall's Tau, the best system was 0.191. So it's roughly kind of like how tutor goes. And we can look at the distribution. This was for something, actually did some work with some other classifiers and different tuning features, but we see very similar curves. We see that the 25 mean distribution moves right, and that we have fewer the things that we're getting absolutely wrong with these Kendall Taus and moving the distribution over. And if we look at mean reciprocal rank, how often are we getting like the number one item, we can see that we're getting more of the number ones right when we throw in our bag of features versus this baseline system. Yeah, was there a question? Oh, no. And that we're getting more of the ones that we're totally wrong on, we're decreasing that number and pushing it over. But you're like, it doesn't make a lot of sense to just train on a general model, and I think what we might be dealing with is bimodal distributions. We have, you might have some tutors that think question one is great, and others think question five is great. And so when you average it out, you might just get a mediocre rating overall. And like I said before, different tutors, even within the same educational setting, might have very different pedagogal beliefs. And so it might be more interesting to train individual models and then we can start to see, what do these tutors actually key in on, and maybe you can imagine the next step would be to create a more personalized environment, where a student needs this kind of tutor and they would get that kind of behavior. So I trained the individual models, and we also took the best general model and added the features. And what we start to see here is the best performing model in bold tends to be when we add more of these DISCUSS features. So it shows to me that DISCUSS, for the most part, except for maybe rater C, is useful. But even she got a bump when you added the coarse level dialogue. But different tutors key in on different levels of complexity when they're doing this ranking task and when they're trying to evaluate the quality of questions. So as we add more, we see, oh, yeah, this tutor really keys in on all these things, whereas this tutor maybe doesn't need all of those features for the model. >>: Those are all Kendall's Tau numbers? >> Lee Becker: >>: These are Kendall's Tau numbers again, yes. That compares, if I remember correctly, seems like that compare favorably 26 with the best ->> Lee Becker: So yeah, if you look at the agreement, these are like when we had that table of agreements, it looks pretty similar in that sense. >>: [inaudible]. >> Lee Becker: Yeah. And so taking these results and maybe getting a little more qualitative about what do they mean and what does it mean to train a model, I asked our project -- so our system, we have a lead tutor who manages the other tutors, and I said based on your experience working with this tutor, observing them out in the classrooms, actually conducting tutors and looking at their transcripts, what do you think their style is, or how would you give me a one-line summary? So she said, well, rater A focuses more on the student than the lesson. Rater B focuses more on the lesson objectives. C tries to get the student to relate to what they see or do. So visuals. And rater D likes to add more to the lesson than was done in class. So she does something very different. And if I just take say the top 20 weighted features and I just look at it as the first level of, like, what's going on with the features, we see that this kind of corresponds. Like rater A is, like, focuses more on the student than the lesson. And so to really focus on the student, you need to have an understanding of what the student's actually saying and so you really need these dialogue acts. Rater B, who her description said she focuses more on the learning goals, and so you see these baseline features where the lexical similarity played a big role. She really keyed in on that and maybe to a lesser degree, but it doesn't say anything about the magnitude here. But to a lesser degree, the actual acts and the types of questions were less important versus how closely they aligned with the learning goals. Rater C, who tries to get the student to relate what they do, if you really want to see how a student relates, you're going to have to know they're talking about a visual or how they're talking about a visual. So you see that distribution. Rater D, she's kind of out there. She -- I mean, looking at her dialogue, she really, like sometimes they'll talk about things that are just not on task. And so you can see that maybe we have to account for or maybe the DISCUSS 27 doesn't model as much and we get more with the baseline features in her case, because -- yeah? >>: You're using logistic regression on differences between the pairs of possible questions, right? >> Lee Becker: Um-hmm. >>: So some of those numbers are very small, where you might have three [indiscernible] other ones are very big like the difference in, like, the questions. >> Lee Becker: >>: So I didn't normalize them or anything. So your top 20? >> Lee Becker: They're going to be all over the place. Most of them are binary to a percentage, but then, yeah, so this is kind of like maybe just a first-level pass at what we're getting at. I think what's more interesting is if we start to look at the weights individually and the story they tell. So rater A focuses more on the student than the lesson. So what you said about the actual weights is still valid, but you see that she gets a negative weight to the assertion dialogue act. So if she's focused more on the student, and the question is giving too much information, she tends to have this negative reaction. It's like don't give it away. Let the student do it. Rater B focuses on the lesson objectives, so larger weight to sem semantic overlap weight, like I said. Rater C tries to get the student to relate what they do. So we saw this predicate type. We saw more weight towards like observation and function or process, versus different dialogue acts. And so you can see she really wanted to get them to talk about what they saw, versus what are the concepts that are driving this. Rater D likes to add more to the lesson than was done in class. Unlike any of the other raters, she really had a high weight for meta statements in the questions that are like, oh, yeah, this is interesting. What's going on here. And then also, because maybe she was trying to do more, the actual contextural probability, so where we had the discuss tuple conditioned on another tuple played more weight with her model than anyone else. 28 So while I didn't normalize the features, we can still do a cosign similarity between the model for one tutor and a model for another tutor, and we start to see, okay, this is how they agree. While the numbers aren't going to be the same, as when we actually do Kendall's Tau, we can see that, oh, rater A agreed with rater B the most, and so that happens both with their weight features from their models as well as with the actual agreement there. And so it gives me some confidence that the model we're learning correlates with what the tutors are actually keying in on, when they make these decisions. So just to get into a little more error analysis, I looked at the things where my system wasn't doing well and wasn't agreeing with the tutors, and I was lucky in some cases when I was collecting their ratings, I had a dialogue box that said please give me any feedback you might want to give. In some cases, they put NA, because they just wanted to click like a mechanical Turker. But in other cases, they were kind enough to give me things like oh, I would never use these words in this situation. So it kind of led me to identify three categories of errors. You have question and authoring errors, which where the ratings come from how like I said the tutor might not like the syntax or maybe the construction was dramatically weird or the vocabulary was inappropriate for a third grader. So they got a negative rating and it was something that my model couldn't account for on its own. There are also instances where I saw the linguistic annotators giving the discuss representation to the questions were wrong. And so if, like, some questions look very similar in the DISCUSS space and then one is very different, it might get rated much lower or much higher, whereas they might be much closer. And then there are just other models where I didn't account for different features. So back to the closing thoughts, we had these driving questions behind how do we go about doing this question ranking task, and what does it mean so can we use the preference data to use that. And I think, yeah, we can use this after the fact. And I think what features are needed to model tutors' behaviors and I'd say, well, maybe DISCUSS isn't the only thing, but it certainly drives at what is the action and the function, and what's going on. And so I think to do this, you need something like DISCUSS or some deeper level dialogue act annotation to actually do this task. And what can we learn about 29 tutoring through these models. I think we learned that the preferences aren't uniform, that tutors really do key in on different things. And going the next step, if we started to look at learning games paired with this, we could start to see what tutoring styles may be more effective than others. So the contributions are well, first, I developed this DISCUSS taxonomy and the representation and I think it's useful representation for tutorial dialogue analysis and for this question ranking task and I showed that it was useful for, like I said, actual human decision making. And maybe not introducing the methodology and machine learning, but in terms of this intelligent tutoring and question selection modes, I've introduced this methodology for ranking questions. And I think I've defined a set of features that really drive at what is going on in this question process. Yes? >>: You said at the beginning of the talk you were getting at the fact that in addition to being an educational tool, these kind of systems also present the opportunity to run larger scale experiments. So I'm just wondering, what's an example of an experiment that you've run that gets at -- what's important in learning, especially with regard to model, the kind of models about how people work in dialogue-based systems. >> Lee Becker: I think given enough data and enough pre-impulse learning games, I can start to extract such features and look at, well, what are the sequencing? Is there sen kind of scaffolds or -- and even looking at similarity between sequences that are maybe. >>: Does the current [indiscernible] you described, is that based on, is there somewhere, can some of these tutoring systems have the simple cognitive models of, say, memory, control the space and so on not to mention -- is there something like that behind ->> Lee Becker: No, we don't have any col tive modeling. We're basically just trying to follow the questioning the author pedagogy. And a question that's still open for us is even if we're in this questioning the author pedagogy, like can we maybe try to find, like, a more direct strategy versus a more open-ended strategy? And I think given enough data and given these labels, we can see like do you ever move from asking these open-ended questions about 30 observing, or do you really need to, with struggling students, have to ask a direct question to get at that. So I think that's kind of where I'm driving at with those. I'm starting to look at, we have a collection. It's not a huge set of data. But we have learning games for the standalone system. And I can look at the discuss labels and see what I get there. Oh, and just like the final thing I wanted to add as far as a contribution is that I think short of having to create and run this tutor on millions of students and permute little things to learn an optimal function, we can start to create behaviors from third party rankings. So we can collect dialogues, get people to like mark this up, and learn a behavior. And then you could imagine bringing that back into the tutor and saying, oh, we're going to have the tutor, rater A tutor and see what learning games we get with them, versus the rater D tutor and see, like, oh, those are negative and that's not so good. So just one final closing thought. Where do I take this to the logical next step, or what am I really excited about working on? To me, I think with NLP and machine learning, we have great opportunity to really make sense of all of this information out on the world, whether there are chapters in the textbook, Wikipedia, and I'm really interested in maybe, like, can we induce these taxonomies and get at this conceptual learning and constant maps and there's already work at that. Can we then take it and create interactive processes. So use that and do automatic question generation so we can ask students, like, and hold them accountable. I'm calling something like a more mode of more active reading where instead of just reading this text, you might be able to ask them questions. But I think there's also opportunities to maybe create more generalizable models of dialogue if you want to talk about these concepts. You can start to discover what kinds of actions correspond with what kinds of -- or what kind of dialogue interactions correspond with what kind of ontologies or kind of underlying knowledge driving it. And then I think if you click the system correctly, you can start to get an automated assessment. You can see who gets what right, what questions are actually useful. And I think a big open area is we have all these tools and we can use them to extract things, but can we expose these models that we spend so much time creating to the user in some sort of nicer HC, where they can maybe explore the concepts and explore things on their own in a different way. And so with that, I want to thank my advisors, colleagues, Sumit and Lucy for 31 hosting me and close it out and open it for me questions you may have. you. Thank >>: So what important factor for tuple system, just fun for students. heard anything about it. Haven't >> Lee Becker: So we don't do anything wildly fun. Like but I think they like the animations. They have these -- so I didn't say that these animations are often interactive. And so it gives them the opportunity, I know when I was a kid and I'd go in class, and you have to do lab, and you didn't know why you were hooking things up and what they were doing. And often, the equipment was broken and light bulb was burned out and so this gives them another opportunity to go back and try it and actually see the experiments work how they're supposed to. And so I think, like for a student, just that added interaction is fun. I think they're also just looking at some of the logs, they're just impressed sometimes by text to speech. Wow, that's so cool. I know it's kind of old nowadays, but I think the kid, like, when we were doing Wizard of Oz, the kids would be like can you make her say this. So there might be more things like that. >>: Do you think [indiscernible] easily extractable or extractable at all? >> Lee Becker: So I have done some initial experiments and I'm continuing to refine that. I built some classifiers that do that. I don't think you can extract the tuple as a whole, but kind of like binary decision, I think it's going to get closer to the Kappa values. About a year ago, when I had different categories, it was maybe like five to ten percent worse, depending on what it was. And I think part of the issue is both training data and the lexical features don't carry as much weight with how much we have. But I think aspects of it are automated. I think the other useful area is that if you're thinking of a dialogue system, you can start with the representation and then you would need to get it for the student. But you don't necessarily need to have -- you don't need to automatically label the tutor's turn in that case. 32 >>: Because you'd be drawing from -- >> Lee Becker: Yeah, because you might be drawing from a pool of behavior or preselected pool of prompts. And then I think really, the utility at least for this application is more tied to the question representation of the students. I think getting towards more sophisticated natural language understanding, you might want to then say, okay, they seem to say something similar. But are they talking about it in the right way? Are they getting at like the battery is the source of electricity, versus the battery, electricity, and whatnot. >>: So how did you [indiscernible]? >> Lee Becker: Yes. >>: So the DISCUSS ontology. You said that some of the parts, especially at the predicate section, there are some things that are easily confusable, part of the human annotators. Did you find that those differentiations actually provided value when you were in those features? Or would collapsing them into a single concept be just as useful? >> Lee Becker: I think you might, like I think collapsing some of those ones that were confused, if I look at a confusing matrix, might actually help in that case. Previously, when I had really wild disagreements, I used those disagreements to collapse them into the set that I have now. It was like I found my annotators for some reason couldn't seem to differentiate cause, effect and relation. So it's like, all right, causal relation. And so I think if it was annotated correctly, I think it would get more bang out of the discrimination, like saying oh, at this point, we really want to ask about cause. And at this point, we really want to ask about effect. But the annotation is what it is. >>: I guess, sorry. >> Lee Becker: Go ahead. >>: If the tutors themselves can't even really tell, I'm wondering how important that distinction is. >> Lee Becker: Oh, you mean if -- so the tutors aren't exposed to the DISCUSS 33 representation, but linguistic annotators are, and is that what you're getting at. >>: Well, yeah. >> Lee Becker: >>: Yeah. I don't know what this says about that, but [indiscernible]. >> Lee Becker: It means I've left some very obvious questions. >>: You started to say something in your penultimate slide about kind of generalizing dialogue moves maybe beyond [indiscernible] more about that. >> Lee Becker: So to me, dialogue isn't necessarily just what's spoken, but it's the action you take. And so I think it might be that we can generalize to what questions do we present, but what material do with he present at a given point in time. So it might be, you know, you we want to go from this concept and traverse over to here and here in this order. So it's not just specifically, so it's a combination of how do I package up information but what information do I give and how do I give it. Maybe it's more important to give a visual at this point in time than to give the speech. That's one way I'm thinking of it. The other way is I think there are probably certain strategies with certain classes of concepts that you might want to traverse down in a specific way, like maybe you need to, with this concept, really start bottom-up and go with very detailed questions and generalize, whereas others you might start with open-ended questions and drive down. Any other questions? All right. Thanks.