1

advertisement
1
>> Sumit Basu: So hi, everyone. I'm happy to have Lee Becker here for giving
us, you know, really interesting talk on what he's been working on the last few
years. Lee has a really interesting history. So he's been very passionate
about education for, I think, when is whole life. Actually spent a year in
Indonesia teaching high school students over there, which I think led to some
of his kind of inspiration to work on this stuff, on the technology sphere.
Before that, he actually worked as a software developer at Intel and HP, making
various kinds of tools. But since they, you know, being at the university of
Colorado, he's doing a dual Ph.D. in CS and CS, computer science and cognitive
science.
>> Lee Becker:
It's a two for one.
>> Sumit Basu: Two for one, not bad. And he's been focusing mostly on looking
at how dialogue can be used in educational systems and how improving the nature
of dialogue and different kinds of things in dialogue can actually improve the
educational experience and the tutoring experience. So he's going to talk
about some of that work here today. And looking forward to it.
>> Lee Becker: Cool. Thanks for the awesome introduction. So I won't talk
too much about this. I'm mainly going to be talking about asking questions
within the context of tutorial or educational dialogues. Before I get too far
into it, I want to just flesh out a little bit about my research focus.
And I work in a field that maybe sometimes is called learning science or
sometimes is called AI in Ed, artificial intelligence in education. And it's
where we try to take all the cool stuff that's coming out of AI and LP, machine
learning, and use it to automate the process of education or also use it to
help us understand more about education.
And like Sumit said, over the past few years, I've been working in something
more specific in the intelligent tutoring systems domain where this is a screen
shot from our tutoring system, where a student interacts through dialogue,
Siri-like dialogue and they see a floating head who is their tutor, and various
multimedia visuals.
And what I really find exciting about having these kinds of systems and
actually getting to deploy these systems on real students is that with enough
2
of it going to a broad enough audience, we can start to investigate the
phenomena underlying learning and start to not only improve how well these
systems actually perform in doing the teaching, but actually tear apart the
process.
And so with learning gain, we can start to understand maybe what concepts
important, what might lead a student to that yes, aha moment, versus a more
frustrated moment. And also, start to learn a bit about the tutoring
strategies that underlie things.
Higher level overview, like I said, over the past few years, I've been working
in intelligent tutoring systems and in the dialogue space and trying to
understand, like, how can we improve the dialogue or what do we need to do to
really understand what's going on in it. And tangential to this area, it's
brought me into an interesting question generation, and I've been involved in
the community and the workshops related to that.
Last summer, we have an upcoming paper. Lucy and Sumit mentored me and we have
a paper on generating automatic, fill-in-the-blank questions or automatically
generating fill-in-the-blank questions.
Another area, because I do work in NLP and I've been working this past year on
doing relation discovery and different information extraction from clinical
notes and trying to drive more towards improving the whole process of
understanding what's going on with patients.
So to give you a task right away, imagine you're a tutor, and you're trying to
teach a student some material. In this case, we're going to be talking about
basic circuits. Pretend the student is maybe in grades 3 through 5. So about
8 to 11 years old. And we're going to use this visual to drive the
conversation. So you see there's a battery, a light bulb, some wires, and a
circuit board. And this is actually something that they had played with
previously in class.
And so as a tutor, your job is to kind of lead them through a conversation
about this, amongst other topics. And so you have this history, this dialogue
history. Roll over the decell in this picture. What can you tell me about
this? So there's the tutor, the student. The student responds, the D-cell is
the source of power. The tutor says, let's talk about wires. What's up with
those?
3
Student, wires are able to take energy from the D-cell and attach it to the
light bulb. Now, imagine we were to pause and to come in right in this
conversation and you, as the tutor, have to pick what's next.
So if you're given a list of candidate questions, what is the next best
question, or what is a good, appropriate question to ask at this point in time.
So I just want to take kind of a quiz or poll amongst the audience and see what
the rest of you think.
So we have what about the light bulb. Tell me about that component too. You
mention that the wires attach the D-cell to the light bulb. Talk to me a bit
more about that. You mention that the wires attach to the D-cell. Which parts
of the D-cell do they attach to. So this is kind of a location. You said that
the wires take electricity to the light bulb. How do the wires do that? And
so the wires connect the battery to the light bulb. What happens when all the
components are connected together?
So quick poll, how many for question one? None. Two? Got a couple there.
Three, four. Three? About the same. Four? About that. And five. So it's
kind of split all over the place. And I don't know if there's a right answer,
but the human tutors that I had that were experts in this domain and experts in
teaching this picked number five.
>>:
Yes!
>> Lee Becker: Good job, those of you who picked number five. You too could
be a tutor. And so you might wonder, well, why is that? What factors are
going into that. And what do we actually need to know to make this decision?
And there's a lot of things going on under the hood. It might be that if you
pick certain things, you might be partial to certain keywords or certain types
of vocabulary. Maybe if you're like really simple, you just think oh, this one
has a lot of words, this is good.
There might be other factors in to if you take into account the dialogue
history and look at what the student's doing, if we look through here, we see
that the student is pretty good at giving a response. They ask a question and
the student gives this function right away, and then the tutor asks another
question and there's good uptake. There's good engagement and response from
this student.
4
And so I'm thinking that the rationale behind why the tutors picked question
number five has to do more with they see that the student is on this, and they
don't want to maybe grind on a single point, but maybe use the momentum to
carry them forward.
And so to model this, I think you really need to understand not just what are
the words and what's going on here at a low level, but what is the actual
action taken by the dialogue above and by the questions that are being asked.
So the outline for the rest of the talk, going to give you a bit of background,
just briefly about tutoring and intelligent tutoring systems and dialogue and
talk a little bit about the tutoring system that we've built over the past few
years as the context for doing these explorations into this process of asking
questions and ranking questions.
I'll talk to you more about this dialogue modeling we've discussed. So discuss
the dialogue act or dialogue move representation that I've created that helps
us to understand what is going on under the hood with the action of the
dialogue.
And then we'll apply it to this task of actually ranking questions in context
and then we'll close with some closing thoughts. Of course, we'd close with
closing thoughts anyway.
On to some background. So you might think, oh, why do we care about tutoring,
what's important about that. And I think part of it is there are probably a
lot of frustrated students out there, and it's not just education is great,
rah-rah. There's actually a problem when we look at recent studies, like 34
percent of fourth graders and a fifth of 12th graders only show proficiency in
science, which if you think about, like, all the jobs we do, we need more than
pro proficiency. If you look at the advanced level, that's really small.
>>:
What are the different levels?
>> Lee Becker: I think there's like poor, proficient, it's like binned into
left of the median, median, and right of the median.
>>:
Is there anything behind proficient and advanced or after advanced?
5
>> Lee Becker: I don't believe there is. This is the top tier. This is like
passing and so, like, maybe things have changed in the past three years and
we've radically fixed education in that time. But I'm speculating that things
are probably about the same.
>>: So this is implying that about 65% of fourth graders are not even
proficient?
>> Lee Becker:
>>:
Yes, exactly.
It's the opposite of Lake Woebegon.
>> Lee Becker: Not often you give Prairie Home Companion in these talks. But
there is some hope, and there have been some studies in the past. This is on
often-cited study where if we maybe focus more on being able to tutor and have
this more focused kind of remediation or more focused interaction, that we can
get as much as a two sigma gain. And a two sigma gain means that they may be
going up two letter grades, from a C to an A, or something along those lines.
So you think, okay, tutoring's good, but why do we need machines, and why do we
need to strap a kid into a room with a computer and solve education that way?
I don't think maybe alone -- this isn't the key alone to solving education, but
it's definitely a useful tool. And I think the big argument amongst anything
else really for intelligent tutoring systems and all of these educational
software and different types of online learning is that we want to get towards
what the CRA's calling a teacher for every student. There's a scaleability
that we get here that you can't get with human tutors.
There aren't enough experts out there to sit and have one-on-one conversations
with every student in every classroom so if you could imagine bringing it out
to, like, web scale, anyone could learn and they'd have an opportunity to work
not just what they do in classroom, but maybe to address what their problems
are.
And past studies show that intelligent tutoring systems are an effective means
of educating students. We get the one sigma, so there's still room for
improvement, maybe getting up to the mythical two sigma that we hope to
achieve.
>>:
So how is sigma in this case?
6
>> Lee Becker: It's usually a letter grade. So if a student's getting a C,
one sigma would bring him to a B-level of like on their final test or whatnot.
And so, okay, intelligent tutoring systems, that seems scaleable. But why
dialogue? Why should we care about doing this? Why can't we just, like, give
them a bunch of problems online.
And I think the interesting thing that we get with dialogue that you don't get
maybe with other modes of interaction, especially with young children that
aren't able to type and do things like that yet is you have an opportunity for
self-expression. And there's what we call the interaction hypothesis in that
getting to interact and think about what you're saying reinforces your
understanding.
In a more CS or computer science point of view, intelligence tutoring systems
is really a fertile test bed for a lot of the AI we do. Yeah, Matt?
>>: I'm sorry, can you go back a slide?
one sigma gain. Is that what that --
So the ITS systems that exist give a
>> Lee Becker: Yeah, that's what's saying at least for these incarnations,
when they tested them on their students and then they took post-test, they saw
a one sigma gain.
>>: Is there some reason we don't see those in like wide deployment?
one sigma is already really good, right?
I mean,
>> Lee Becker: You actually do see a lot of these systems in wide deployment.
And these aren't necessarily all dialogue-based systems. But they're often
tied to a particular curriculum or these people, like, some of these studies
were done on maybe physics students at this particular university and so the
tutors kind of customized.
There is one company, Carnegie Learning, that has what they call a cognitive
tutor. It's not dialogue-based, but does math tutoring and that's widely
deployed in, like, all of Pittsburgh and Pennsylvania and all throughout the
U.S.
>>:
And does that more general system give as good of gains?
7
>> Lee Becker: They claim to. I need to double check the citation on that,
but I think they say that it gives some sort of positive learning effect.
So like I was saying, this area not only is, like, feel good, we're helping
education, it also lets us investigate the things we think are cool and fun to
work on. I'm focused more on dialogue and planning, but imagine there's a lot
of issues with natural language understanding. So not just semantic
similarity, but is a student right or wrong and what might they have backwards.
So there are more subtle problems there.
I think in terms of like maybe working towards equipping these systems, you
might have a lot to explore in concept and misconception discovery. So
knowing, like, well if I have a body of text, what are the important concepts,
or what are also the concepts that people might get confused or wrong.
Because these things are often customized. I think there's also incredible
opportunities for domain adaptation. Can we take a tutor that we've learned
behaviors for in chemistry and then make a biology tutor or physics tutor.
And then after the fact, once we have all of this, there's really great
opportunities for educational data mining to really understand what are
students doing right and wrong, what questions might they be missing, what
behaviors are useful and really tease apart the learning process.
So like I said before, this is our tutor or screen shot from our tutor, and
it's called my science tutor, MyST. We've been working on it for the past few
years, and over in the right-hand corner is Marnie, and she's not too bad in
the uncanny valley. She still scares me a little bit. But students interact
with her. They wear headphones and have a microphone piece and interact
through automatic speech recognition when it's the full system. And the tutor
presents a series of multimedia visuals and also using prerecorded texts or
synthesized text depending on the version, interact with the students and give
them prompts and try to encourage self-explanation.
And so the purpose of this, it's not to bring about some singularity in
education. It's really to supplement in-class instruction and it's not that we
want to replace teachers, but we want to give students who are maybe struggling
in class, other venues that they can continue to refine their understanding.
8
And the idea is not to test or assess. It's not just giving them a bunch of
questions and oh, they only got a 50 percent. They need to do more. It's
really just providing them a comfortable environment to discuss and reflect on
what they've learned in class. And the educational approach we use is called
questioning the author. And it's a pedagogy created by Beck and McEwen, where
it's originally used for reading comprehension, trying to get the students to
ask, well, what is the author trying to say here?
In this venue, we've turned it into more questioning the science or questioning
the data. What is it that they're observing that tells them what they
understand. And the crick column driving all this is the full option science
system, which is a system widely deployed throughout California, Colorado, and
the rest of the U.S.
You think, okay, well people have been working on dialogue for a long time.
What's special about this? And if you compare tutorial dialogue to maybe a
more standard flavor that's actually deployed like IVR systems or airline or
hotel reservations, you have kind of inherent differences in the audience and
how you would go about approaching this.
With the task oriented dialogue like the reservation system, the user, like the
point is to get the user to complete a task. Whereas in the tutorial dialogue,
it's not that the user's trying to get something done. They're trying to,
well, like some students may just want to survive the 15 minutes they're
subjected to. Others may actually want to do learning.
So the point is, like, maybe not so much what the user cares about, but this
process of bringing about understanding.
And so like I said, there's different motivations. The person trying to
complete a task has an intrinsic motivation that says, I want to get my hotel
and airline reservations. I'm not going to give up until that's done. Whereas
the student could easily get bored or they might really want to go with it and
you kind of have to balance for both.
And I would argue that there's a more concrete measure of success with these
more straightforward systems. If you got it done in 30 seconds, that's pretty
good. If you've got the task done at all, that's probably also a good measure
of whether the system's working. You could also do polls of user satisfaction
or decide, well, we needed to get human in the loop. The system's obviously
9
not working.
Compared to a tutorial dialogue, really the long goal is probably learning
gains if we even trust those tests to some extent. And then do we just do what
they've done before and after the session? Is it what they retain over
long-term?
Is user satisfaction meaningful? Even if the student said I had a great time,
did they really learn anything, or maybe it's coverage of material. Are we
covering the right material. So it's kind of interesting evaluation going on
there.
And similarly, I would argue that the penalty for poor behavior is high in the
tutoring system, whereas poor behavior, because the user's motivated, they
might be just motivated enough to continue staying with it.
I think the big challenge for intelligent tutoring systems, and you alluded to
this earlier, is why aren't these more widely deployed. It has to do a lot
with scaleability. And so that's this middle bullet here. It takes a lot of
effort to kind of curate what knowledge you want to do and to author dialogues
and to create all the behaviors for this system.
And right now, like, it may take a few weeks for one lesson. How are we going
to scale up for an entire textbook's work of lessons?
This top bullet, you also need to be robust to different users. You'll have
people of varying skill levels, of different motivations and so you're going to
have to be able to be personalized and adaptive. And I also think that after
the fact, there's not really a clear way to look and compare at strategies. We
might see this end goal, but can we really look at what's going on and be able
to tell why is one session better than another?
And so to give you a flavor of what might go on and how, here's some dialogue.
So year's the tutor and they ask what do you think the paper clip is made of to
be attracted to the magnet. The student gives a response. Magnetic force
attracts the paper clip. Tutor, thinks about the kind of things that attract
to magnets. How do they connect to paper clips? Student, it's attracted to
the magnet because it's iron or steel.
So throughout this thing, throughout this dialogue or this snippet here, the
tutor really is trying to get at this notion of iron or steel. And so this is
10
maybe an easier less adversarial. There's good uptake in this point in this
conversation. The tutor gives question and the student is pretty responsive.
And the student is kind of giving them what they want. Whether that's good or
bad, it's still kind of remains to be told. But you can see that there's kind
of a good level engagement here. A challenge for other students is here's a
tutor that asks a question and the student just says, I don't know. I don't
understand and I don't know.
And so the question is, well, do you just like say all right and move on, or
maybe there's some strategy. Maybe what this tutor has to do, they have to
keep pushing and say, here's another question and the student says, I don't
know.
So it's a matter of maybe backing off from more open-ended questions to maybe a
more specific question, where it's like, here, look at this, what happened.
And at this point, the student gets uptake, which then allows us to maybe, as a
tutor, to move the conversation forward. It's like now we have something to
talk about, here's what we can do.
So maybe, even if the student's not giving you the answers you want, there's
still something that you can do to give them a good learning experience.
Of course, there are others that are just out there, and in this case I think
this may be a problem if the student or tutor, and these actually came from
Wizard of Oz experiments, where there's a human in the loop. It's not just the
computer not understanding. This is a human struggling with this problem.
So they ask him a question and the student answers, and then the student, they
ask him a question. You already asked me that. And then the tutor is like,
well, the tutor wants him to talk about this point for some reason and so they
ask it again. The student is yeah, I already said that. Yeah.
And so the tutor is like, all right, I'm not backing off. Just answer. And so
you might think there's something going on under the hood that differentiates
these. And if we're to make systems that are robust like this, it takes a lot
of work.
So back to the authoring side of this, most of the systems out there have what
we call a finite state machine under the hood. It's just a graph of -- here's
11
what we're trying to ask them about. And if they say this, with he go down
this path. If they say this, we go down this path. It's very manual.
Maybe they
if we have
the lesson
of work to
have a good natural language understanding that they can say, well,
this much confidence, we go down this path. But as you can see, as
expands and the range of possible things grows, it gets to be a lot
author, curate, tune and change these behaviors.
Scop this kind of approach is pretty common in a lot of these different tutors.
Our approach in MyST is similar when you break it apart. But instead of having
like a rigid, finite state machine, we use the flames and slots approach, which
is we have the information we're trying to fill so in a hotel domain, it might
be time, location, et cetera.
This is we've broken down a learning concept into sub-parts that we're trying
to entail and these parts have prompts associated with them that you would ask
the student to -- you would continue asking if they didn't say anything about
flow, you might continue down that. And we have different rules so if they get
things backward. Like in this case, we're trying to get the student to really
understand that electricity flows from the negative terminal to the positive
terminal. If they have that switched, we might have to say, well, check it
out. Look again. Do you think it's really doing that?
But like with the other thing ->>: What happens if they say something nutty like gravity pulls the
electricity from the negative to the positive?
>> Lee Becker: Our system is pretty dominant. It keeps going down. The
usually, the way it's organized, we'll ask them an overview. We don't tend to
acknowledge that they said anything totally wrong. So it's very easy for a
student to, what they do or what they call gain the system so they could just
say yes, no, and then just kind of exhaust the system. We've been fortunate
enough that the students seem to think that the system is smart enough, and
they don't try to game it.
I think if we had middle schoolers instead of elementary school students, we'd
have to put in more robust mechanisms to say, like, oh, you know, the speed
train is this fast or there are other factors going on there.
12
>>: [indiscernible] cause of reinforcement learning where there is something
that is right.
>> Lee Becker: We don't do anything with reinforcement learning. There are
other tutoring systems I'll talk about that try to use the reinforcement
learning to maximize some final goal. But this is kind of the simplest thing
that, because we have these very open-ended dialogues, we just kind of go down
and exhaust. And if we exhaust a frame, we recap and say okay, it's good
enough. The student understands or doesn't understand, but we're going to give
them enough to come away with something.
And so the broad research questions that I think that we get when we look at
all this authoring and these tutor systems is we really need to understand what
are the mechanisms needed to support maybe more intelligent behavior if we want
to have more responsive and robust dialogue management or if we want to
actually induce behaviors from corpora, what are we going to need and how can
we drive towards getting a more personalized or more human-like interaction.
And then if after the fact, how could we possibly do, like, analysis of these
tutoring sessions afterwards.
And what I'm going to argue is that the representation, the underlying
linguistic representation at the dialogue level is what's going to really be
able to enable us to do more interesting things.
So now I'll
in NLP, you
to abstract
function or
talk about modeling dialogue with discuss. Like much of what we do
need a linguistic representation that you can possibly learn or use
certain actions. So I want something that abstracts kind of the
the action, the function and the content.
So the high level dialogue action, what is going on. The function, how is it
being spoken about. And the content, not like specifically the word, but how
is it that they're talking about the concepts in a specific domain.
And kind of in searching for a taxonomy, I had kind of these requirements in
mind. I wanted something that was interpretable without words. So if I took
away the words of the dialogue, could I still look at the annotation or the
representation and get a gist for, okay, this is, they seem to be responding to
one another or this is going down some different path.
13
And so I wanted something that would allow post-tutorial analysis and I also
wanted it to be useful enough that it wasn't just like going to be a corpus
linguistic study but allow me to use them as features for some sort of behavior
or learning some sort of behavior.
And I also had in mind that maybe if we could go to the next step, this
representation could serve as an intermediate representation to allow fully
automatic question generation.
So when I'm looking for these dialogue acts, I started on a literature view of
all of the different literature or works related to dialogue acts, tutorial
moves. And I find that most everything seems to cover in this course space and
I'll explain a little more about that. And so they kind of get at the action
at some high level. There's some with the rhetorical forms. So some
taxonomies have a bit about the function and discount I drew a lot of
inspiration because that's also a learning oriented one as well as the question
taxonomies. The question taxonomies, the draw back is they're left focused on
dialogue and just on classifying different types of questions.
And so as I'm going through this, I started thinking, well, I certainly just
can't use the words in and of themselves. If I look at this, I don't see the
action. But if I use the high level dialogue acts, the problem is I can't tell
what's going on. I have two tutoring sessions, it's like question, answer,
question, answer. It's like oh, that's a good session. I can't do that.
And so my approach to this was to use what I called DISCUSS, the dialogue
schema unifying speech and semantics. And it's kind of a long name, but the
original name was distress, and my advisor said that's too negative. Come back
and do something else. And distress was supposed to be a response to damsel,
which is a famous dialogue act.
But anyway, and so to drive at what I was saying, the action, the function and
the content, I have three dimensions. The dialogue act dimension so you do
see, still, ask and answer. But also, maybe more tutorial specific moves.
Things like revoice. Revoice is something used in questioning the author, but
could be used in any variety of tutoring modes where you're summarizing what
the student says to kind of move the conversation forward.
Whereas a mark is a similar act, but you're highlighting keywords. So revoice
would be, oh, it sounds like you're saying electricity flowing from the
14
negative side. And then ask a question, whereas marks would be more direct.
Would be, oh, you said electricity. Let's talk about that.
So these are kind of grounding acts that help the tutor show that they're
receptive to what the student is saying.
In the middle layer is the rhetorical form. And this is like kind of refines
the action here, when it's appropriate. So it might be a question that's
asking them to describe or question asking to define. It's like, what is the
question really trying to achieve? What is the function of this question. And
then getting at the contents, the predicate type.
So it might be they're talking
or a function or process or an
partially by going through the
up. And also by what I saw in
about some type of cause and effect relationship
observation. And so these acts were inspired
dialogues and seeing what we actually had come
the literature.
And so to give you an idea of how this actually looks, here's a dialogue, and,
well, if you just look at the words, it's going to get cumbersome, but let's
imagine we just jump straight to the dialogue act representation. We can kind
of gist what's going on here. And so at this point, number one, tutor is
saying, okay, list an entity. So what do you see here? And the student lists
some entities that they see in the visual.
And then the tutor gives some sort of positive feedback, and then they describe
the entities a little more and now they're really trying to get them to talk
about what is the function of this thing you're looking at.
So it might be a circuit or a switch or a battery or something, and then the
student just responds with an attribute, like batteries are red or something
like that. And so then the tutor -- wait, I jumped ahead. But in the same
case, the student just like lists what they see and they list some attributes.
So then the tutor maybe backs off a little bit and talks about the entities.
And then student still talks about attributes. So they think okay, I'm going
to try function. And so you can kind of see going back and forth as the
student's stuck on attributes, the tutor's moving forward with functions here.
What I think motivation behind this is if we look at this across different
lesson domains, we can kind of get the same words, different content. So
15
here's a bunch of different questions and answers. And you can see that we
don't just have a single tuple per question. It's like it exhibits different
properties. In this case, it has a mark as well as ask and elaborate process.
But this is for circuits. But now if we change it to magnets, I didn't change
the labels. But now we have different actions or different utterances that
actually manifest themselves in the lesson.
And so we can kind of start to see, well, maybe there are strategies that are
generalizable across both domains.
Of course, like, a representation alone without data is not any use so we
endeavored to have some linguists help me out, and I hired two linguists and we
annotated 122 transcripts. These were from Wizard of Oz sessions. So they
weren't the actually system.
This was a human tutor controlling our system and a student talking into the
microphone and then we had manually transcribed speech for this case. And we
coded it up for ten different units in magnetism, electricity, tuning,
measurements, got close to 6,000 turns annotated totally, in total, and 15
percent of it was double annotated, just to give us an idea of how difficult
the task is.
And so the Kappa here, it shows modest to fair agreement. You can see that as
we go down the hierarchy, it gets harder. So dialogue act, it's pretty easy to
see this is asking a question or it's answering a question. The difficulty
might come in as, are they doing a revoice or are they doing a mark. What
other things are they doing beside asking a question.
As we go down farther, it gets more ambiguous. Is this really asking for a
description or definition, where those things might seem a little close. Going
down to the predicate type, it gets even more ambiguous, because in some cases,
it might not be clear whether they're talking about a process, an observation,
or a causal relation. And so that's part of the difficulty for some of these
kappas going down.
>>:
So that was two judges?
>> Lee Becker: These were two linguists, and so like 15 percent of those,
well, it was like 15 percent of the dialogues were selected, and --
16
>>:
Oh, I see, so it's overlapping.
>>: So this approach, how big can it go in terms of defining the schema?
Like, for example, tutoring algebra, is it too big or ->> Lee Becker: I think algebra might be a stretch, because you don't have the
same kind of conceptual knowledge, but I imagine for a lot of types of science
where you do have processes and observations and things that you see in class
related to these labs, I think it generalizes in that sense.
And whereas, like, I think with math, you might need a different vocabulary,
like axioms and theorems and the actions that you would take to solve step, I
think it gets -- this is probably more useful for conceptual knowledge and
trying to maybe supplement reading in some way.
And so I kind of talked about these before, but these are really the
motivations for DISCUSS is I want to be able to discriminate one utterance from
another and I wanted to explore, like, what granularity can I get at like the
speaker intent and what is the content. And how can I use it for something
useful, and so I'm mainly talking about learning the decision making today, but
I'm also currently looking at characterizing these interactions as a whole.
And so now on to the task of ranking questions in context. So back to what we
did at the beginning of the lecture or talk or whatnot, we have this picture
again, and you already saw these things. And so I want to go into more detail
about how we use DISCUSS and other features to actually go about ranking these
questions.
So given a set of candidate questions, where do we go next is a follow-up
question. And like I said at the beginning, there might be a lot of factors
influencing the tutor's decision. You might have biases or opinions about
certain words and phrasings, like, oh, I really don't like that kind of
vocabulary or that was really odd syntax to talk to third grader in.
There might also be preferences about what you talk about as far as subject
matter. It's like oh, this is less important. Let's focus on this. There are
factors that come in from the dialogue context. So we have this history,
certain things might be more appropriate.
And, of course, when we have tutors who are trained in the same area, they
17
might have different pedagogal philosophy. Some might like really be keen on
using the visuals as much as possible, some might like more directed questions,
versus open-ended questions. And so that could influence. And, of course, the
student understanding always plays a role.
Is the student not answering anything, or are they getting it.
ask different questions depending on what you get there.
And you might
So the driving questions as far as ranking and choosing these follow-up
questions are, well, I'll get into this in a second, but how might we go about
learning this. Can we use preference data. So after the fact, can we use some
sort of ratings.
And what features are actually needed. So we talked about those potential
factors. So how can we extract that from the words as well as from the
representation.
And what can this tell us about tutoring and about our tutors as we go through
this process.
Before that I want to talk about a little bit about related works in tutorial
move and some in dialogue move selection.
And so someone asked earlier about reinforcement learning, and there has been
work where they've looked at trying to optimize tutoring behavior for learning
gains using reinforcement learning. But the decisions they were looking at
were very miniscule and didn't really get at what are the full range of
questions you can get at. Chi was looking at should I elicit, or should I
tell. So it's basically saying, should I stay on this point, or should I tell
them and move on?
And so you can see where reinforcement learning would be useful in that sense.
But if you start to, like, say should I ask him about a definition or a
description, should I talk about this aspect of the learning goal or that
aspect of the learning goal, the state space explodes pretty large.
Similarly, you would see similar behavior with the HMMs. And Christy Boyer did
work in having a dialogue app taxonomy and trying to predict dialogue acts.
But again, she was using pretty coarse dialogue acts so it might be giving
feedback, asking and answering and not really driving at what kinds of
18
questions are we asking at a specific point in time.
This work kind of draws
in the natural language
dialogue, you're trying
create the surface form
from the tradition of what's called sentence planning
generation community. Where the point is in a
to generate a representation which you then use to
or the actual words.
And instead of doing that in my approach, I think more of how do the
representations inform the features that we can then use to rank. But you
could imagine, I could also rank representations at some point.
So the approach is pretty not surprising. It's pretty straightforward. We're
going to treat question ranking as a supervised machine learning task, which
means we need training data and labels. And in this case, the training data
are going to be given a context, a dialogue history. We want a set of
questions, and we're going to then extract features from these questions and
then we're going to pair them with some sort of scores, whether it's ranking or
raw source and then we're going to learn different models.
I have what we call a general model, where we average the scores across all
raters, or average rating -- rankings across all raters or we have the
individual model, where we learn what are the preferences for an individual.
And so the data again, we have the 122 transcripts that we used. So we used
that corpus for building dialogue models. But specifically, we look at 205
contexts extracted from these transcripts and so a set of 30 transcripts in
particular. And we had -- does it say? Well, what we did was we had manually
authored questions for these contexts as well as when appropriate we used a
question extracted if it wasn't like some meta statement or if it seemed like
it was a follow-up question.
And then we took these questions and we annotated them with the DISCUSS
representation. And then we put these questions and the full dialogue history,
which I'll show, in front of raters so these raters are trained expert tutors
that had previously worked on our project and then we do that.
>>:
[indiscernible].
19
>> Lee Becker: It means that I have the original dialogue, and so I pause it,
and so I take that next turn out and just use that, if it seems like it's a
question. And then you might, so you might extract one and then you would
author, like, five other candidate questions. Because I didn't have a good way
of, like, making -- like without writing my own awesome question generation
system from scratch, I didn't have a way of per muting the DISCUSS space and
dialogue space.
A little more about the authoring and the approach. I hired a linguist. We
hire a lot of linguists in Colorado, and I trained him in questioning the
author and in MyST and in the FOSS and the kind of questions we see and the
kind of lessons.
And I said okay, you're free to write any question, but take my guidelines into
consideration and really thick think about how might you change the tactics.
Would you want to add a revoice at this point? Would you want to mark? Would
you ask him to elaborate? And like think about the learning goals. Do I want
to focus on this aspect of the learning goal or a different aspect or maybe a
different learning goal entirely. And then also because it's a linguist, go
ahead and do some variation on lexical and syntactic structure, but mainly take
into account DISCUSS. So maybe you might switch from asking a definition
question about a causal relation to a definition question about a process or a
description of a process and kind of permuting that space.
And so we went with one author just to be more consistent as opposed to just
having tutors write authors and ending up with ten very similar questions. And
like I said before, when appropriate, we extracted questions from the original
dialogue context.
So this is what the author saw. And they had their learning -- so the author.
This is the question author, saw the learning goals so they knew what it was
that we were trying to elicit from the students in this lesson. We also see
the dialogue history up until this question asking point, and then following
the guidelines, he was free to write what questions he thought was appropriate.
As far as rating, we hired four tutors that had served as Wizards when we were
doing a Wizard of Oz studies early on in the development of MyST and we asked
them to make these decisions as far as rating questions in this context.
And we gave them a similar setup so again, they saw the learning goals, they
20
saw the dialogue history and we asked them to rate them simultaneously. Part
of this is we wanted to not have them have like some sort of drift, just rating
things in isolation.
And the other thing is it allowed them to see this is obviously better than
another and I gave them a wide range of one through ten and allowed them to
pick ties if they thought, like, I can't really decide between these two.
These are equally good. I could ask either one. I didn't feel like it was
necessary to force a strict ranking in this case.
And so to assess agreement, we use a measure often used in information
retrieval called Kendall's Tau. It's a statistic that ranges from negative
one, perfect disagreement, to one, perfect agreement. And probabilistic
interpretation, if you take TAU, you get the probability of concordances, like
times they agree on individual pairs versus how much they disagree on
individual pairs.
And so you might think, well, why ranking? Why not just learn the scores
directly. I think part of it is that different raters have different scales.
Someone might be a 7 through 10 person. Someone might be a 1 through 10
person. And so I'm really more interested in which question would you pick
over another question.
And so the mechanism we're going to use is we're going to convert their scores
into a ranked list and then assess agreement in that sense. So here's a table
of like how the different raters agree. And this bottom table is a couple
months after they did their rating, I had them go back and redo ratings on a
small set and see how well they agreed even with themselves.
So obviously, they agree with themselves more than anything else. And what you
see here is, like, it really is kind of dependent on who the rater is paired
with, and it tells me that different people are keying in on different things.
No less. I took an average across all of it, and so this is our kind of -this is what taking all people and all rankings, this is where we get is like
0.148, which is positive agreement. It's not huge, but it kind of shows the
limits of how well people -- did you have a question, John? -- agree with one
another.
So the actual -- to actually learn and to actually do question ranking, the
approach is, we use a pretty standard approach of learning a preference
21
function and we're going to take a feature vector that we extract from a
question and possibly its context and another question, take the delta, and
then train a binary classifier to tell me this question is better than another
question. And we're going to run it both ways. And we're going to build just
a win table and the results of this wins gets us the canonical ranking as far
as that goes.
And so what actually goes into this? What
representation, if we're actually going to
we actually need? So at the lowest level,
Things that people might cue in on or what
might think are important.
features actually, or what
plug something into a classifier, do
we need these surface form features.
we speculate that different tutors
And so question length. If something is overly verbose or overly terse, maybe
that has an influence. We also look at like WH questions. Maybe some tutors
really like what questions versus which questions. Of course, it takes on a
little different meaning in questioning the you though because you'll see in a
second, the wording is different. And we also wanted to take into account
maybe some syntactic variation with the part of speech tags.
And so here's like a question, and the feature vector we would get out of that
single question. So you would see, like, what's up with that. We naively just
take the WH and put that as a binary, but we also take the bag of part of
speech tags and just low-level, commonly used features in NLP.
Going maybe a little more complex, we think that there's a process in
conversation and in dialogue called entrainment, where as two people talk more
with one another, we tend to use more similar words and constructs and if
they're really engaged with each other, they might have more of these words
overlapping.
So we wanted to capture that in a feature, just basic feature called lexical
similarity. And so we look at both the bag of words and the part of speech
tags and what kind of overlap and so you could take a similarity between the
previous student's turn and the question that you're trying to evaluate, and
see what kind of overlap you get. Or you could also look at how does this
relate to the learning goals. If we want to see, okay, are they talking about
the learning goal we're currently talking about, or is this question about
something else, which maybe indicates if it has a strong similarity that it
might mean that the preference for that would mean you want to move on.
22
And so how this might look is here's a question. Here's a previous student
turn, and what I mean by current learning goal is this is the description or,
in an ideal case, if a student said this, we might think that the student has
an understanding of that concept.
And so if I just look at the words like, oh, brighter and bright or dimmer and
dimmer, we can start to calculate just simple overlaps and throw that into our
vector for features.
Getting more into the more complex behavior, we have the DISCUSS features.
we have these turns --
So
>>: Can you just go back to the previous one? So if you can learn weights on
features, especially if you use bi-gram features, do you have training data to
do that and different kinds of conditions, dialogue states?
>> Lee Becker: So I don't actually extract word bi-grams, because that's going
to be too sparse. And so I only did, like, bi-gram overlap. So what
percentage of the bi-grams in this overlap the percentage of the bi-grams in
that. Because I think, like you said, it's too sparse to actually, it's too
sparse to actually do that.
But given our domain, the vocabulary is regular enough that we expect that the
questions that the students' responses, they're going to be talking about
batteries and wires. So if the question's talking about batteries and wires
and the student's talking about batteries and wires, you'll see a high overlap
there.
>>:
So if the student just repeats a question --
>> Lee Becker:
>>:
What will happen?
>> Lee Becker:
>>:
Yeah, I mean.
In our system?
Anyway.
>> Lee Becker:
So you ask a question, and they say the exact same thing?
23
>>:
Yeah.
>> Lee Becker: It might actually fill in a lot of the keywords, depending,
because we use a Phoenix semantic grammar that parse what's they say, and then
it fills in the slots. So I think if the student was savvy enough, they might
be able to. But if you look at some of the questions, like they might only get
the simple things. They might get light bulb and dimmer, but they might not
get the relationship that we're trying to say, like that this gets dimmer when
this happens. Or ->>: And which feature measures that?
relationships?
Which feature identifies these
>> Lee Becker: Well, get into -- I don't think there's a feature that
identifies that, but we do have a feature that takes into account what slots
they filled in the dialogue and then can contextualize that into a probability.
But we don't do any like -- I don't use any of the like the natural language
understanding for that, and so I don't say, like, is this a -- have they
triggered, like, a misconception or anything. And I think that would be, like
if I had -- if I was to do a follow-up study, I would probably add in more of
these features, saying how right or how wrong are they, and how would that
influence the behavior.
But because our system is less about assessment, we have a very loose
definition of what's right and wrong.
So like I was saying, for DISCUSS, we extract kind of the bag of DISCUSS
features. So these questions have associated DISCUSS labels and tuples. And
we also have, I'll give an illustration of this, we can look at how closely,
like, the rhetorical form and the predicate type match between this turn and
the student's turn. And to kind of illustrate, here's a question. And here's
its associated DISCUSS act. We can see it has a revoice and ask and elaborate.
And we see a student turn. And so if we look, we can see, okay, we got a
revoice, binary one. Ask, one. Predicate type configuration, one. So those
are the kind of straightforward to extract. The other one is we want to see
maybe how in step they are. And this is just a very coarse feature that, oh,
the student's previous response was about observation and this question is
about configuration. So that's not a match there.
24
To maybe get a little more sophisticated behavior, we actually look at the
probabilities and kind of just discuss transition probabilities over our
corpus, and we can see what is the probability of a question having this kind
of DISCUSS tuple given the previous student's turn having this DISCUSS tuple
and we can back off from having the full tuple to maybe just the dialogue act
in rhetorical form or maybe we want to look at the predicate type and you can
imagine this could get at what is the sequence that you might talk about a
certain concept.
You might start with a visual, move to an observation, griped a bit about
asking about some attribute and finally talk about the process. And then
getting at this natural language understanding component, we have a measure of
what slots are filled for a given point in the dialogue. So if we were asking
electricity flows from negative to positive and the student said electricity in
flows and the other two parts that are left out are negative and positive, this
half here, this probability would be like given a 50 percent fill of the
current frame, what is the probability of asking this kind of DISCUSS feature
at this point in time.
So evaluation, we're going to use a lot of measures. Probably the main one is
Kendall's Tau, which is the same one we used to measure agreement between the
tutors, and we train the system using cross-fold validation where we hold out
three transcripts per fold, and each fold is a different lesson. So it might
be this is magnetism electricity, unit one. This is going to be the
evaluation, and we're going to train on the rest.
And so this is the general model. This is when we took all the rankings from
all the tutors and just tried to create one model, and the big take-away here
is that if we go from the baseline features, which are the surface form and the
lexical similarity features and start to add more of the DISCUSS features, we
get a bump in improvement and it's significant from here to here. I don't have
all the significances in between there.
And you see that for most of these measures, and you see it at a level that's
like, if you recall, all of the tutors agreeing with each other was like 0.148.
So the mean Kendall's Tau, the best system was 0.191. So it's roughly kind of
like how tutor goes. And we can look at the distribution.
This was for something, actually did some work with some other classifiers and
different tuning features, but we see very similar curves. We see that the
25
mean distribution moves right, and that we have fewer the things that we're
getting absolutely wrong with these Kendall Taus and moving the distribution
over.
And if we look at mean reciprocal rank, how often are we getting like the
number one item, we can see that we're getting more of the number ones right
when we throw in our bag of features versus this baseline system. Yeah, was
there a question? Oh, no. And that we're getting more of the ones that we're
totally wrong on, we're decreasing that number and pushing it over.
But you're like, it doesn't make a lot of sense to just train on a general
model, and I think what we might be dealing with is bimodal distributions.
We have, you might have some tutors that think question one is great, and
others think question five is great. And so when you average it out, you might
just get a mediocre rating overall. And like I said before, different tutors,
even within the same educational setting, might have very different pedagogal
beliefs. And so it might be more interesting to train individual models and
then we can start to see, what do these tutors actually key in on, and maybe
you can imagine the next step would be to create a more personalized
environment, where a student needs this kind of tutor and they would get that
kind of behavior.
So I trained the individual models, and we also took the best general model and
added the features. And what we start to see here is the best performing model
in bold tends to be when we add more of these DISCUSS features. So it shows to
me that DISCUSS, for the most part, except for maybe rater C, is useful. But
even she got a bump when you added the coarse level dialogue. But different
tutors key in on different levels of complexity when they're doing this ranking
task and when they're trying to evaluate the quality of questions.
So as we add more, we see, oh, yeah, this tutor really keys in on all these
things, whereas this tutor maybe doesn't need all of those features for the
model.
>>:
Those are all Kendall's Tau numbers?
>> Lee Becker:
>>:
These are Kendall's Tau numbers again, yes.
That compares, if I remember correctly, seems like that compare favorably
26
with the best ->> Lee Becker: So yeah, if you look at the agreement, these are like when we
had that table of agreements, it looks pretty similar in that sense.
>>:
[inaudible].
>> Lee Becker: Yeah. And so taking these results and maybe getting a little
more qualitative about what do they mean and what does it mean to train a
model, I asked our project -- so our system, we have a lead tutor who manages
the other tutors, and I said based on your experience working with this tutor,
observing them out in the classrooms, actually conducting tutors and looking at
their transcripts, what do you think their style is, or how would you give me a
one-line summary?
So she said, well, rater A focuses more on the student than the lesson. Rater
B focuses more on the lesson objectives. C tries to get the student to relate
to what they see or do. So visuals. And rater D likes to add more to the
lesson than was done in class. So she does something very different.
And if I just take say the top 20 weighted features and I just look at it as
the first level of, like, what's going on with the features, we see that this
kind of corresponds. Like rater A is, like, focuses more on the student than
the lesson. And so to really focus on the student, you need to have an
understanding of what the student's actually saying and so you really need
these dialogue acts.
Rater B, who her description said she focuses more on the learning goals, and
so you see these baseline features where the lexical similarity played a big
role. She really keyed in on that and maybe to a lesser degree, but it doesn't
say anything about the magnitude here.
But to a lesser degree, the actual acts and the types of questions were less
important versus how closely they aligned with the learning goals. Rater C,
who tries to get the student to relate what they do, if you really want to see
how a student relates, you're going to have to know they're talking about a
visual or how they're talking about a visual. So you see that distribution.
Rater D, she's kind of out there. She -- I mean, looking at her dialogue, she
really, like sometimes they'll talk about things that are just not on task.
And so you can see that maybe we have to account for or maybe the DISCUSS
27
doesn't model as much and we get more with the baseline features in her case,
because -- yeah?
>>: You're using logistic regression on differences between the pairs of
possible questions, right?
>> Lee Becker:
Um-hmm.
>>: So some of those numbers are very small, where you might have three
[indiscernible] other ones are very big like the difference in, like, the
questions.
>> Lee Becker:
>>:
So I didn't normalize them or anything.
So your top 20?
>> Lee Becker: They're going to be all over the place. Most of them are
binary to a percentage, but then, yeah, so this is kind of like maybe just a
first-level pass at what we're getting at. I think what's more interesting is
if we start to look at the weights individually and the story they tell.
So rater A focuses more on the student than the lesson. So what you said about
the actual weights is still valid, but you see that she gets a negative weight
to the assertion dialogue act. So if she's focused more on the student, and
the question is giving too much information, she tends to have this negative
reaction. It's like don't give it away. Let the student do it.
Rater B focuses on the lesson objectives, so larger weight to sem semantic
overlap weight, like I said. Rater C tries to get the student to relate what
they do. So we saw this predicate type. We saw more weight towards like
observation and function or process, versus different dialogue acts. And so
you can see she really wanted to get them to talk about what they saw, versus
what are the concepts that are driving this.
Rater D likes to add more to the lesson than was done in class. Unlike any of
the other raters, she really had a high weight for meta statements in the
questions that are like, oh, yeah, this is interesting. What's going on here.
And then also, because maybe she was trying to do more, the actual contextural
probability, so where we had the discuss tuple conditioned on another tuple
played more weight with her model than anyone else.
28
So while I didn't normalize the features, we can still do a cosign similarity
between the model for one tutor and a model for another tutor, and we start to
see, okay, this is how they agree. While the numbers aren't going to be the
same, as when we actually do Kendall's Tau, we can see that, oh, rater A agreed
with rater B the most, and so that happens both with their weight features from
their models as well as with the actual agreement there.
And so it gives me some confidence that the model we're learning correlates
with what the tutors are actually keying in on, when they make these decisions.
So just to get into a little more error analysis, I looked at the things where
my system wasn't doing well and wasn't agreeing with the tutors, and I was
lucky in some cases when I was collecting their ratings, I had a dialogue box
that said please give me any feedback you might want to give.
In some cases, they put NA, because they just wanted to click like a mechanical
Turker. But in other cases, they were kind enough to give me things like oh, I
would never use these words in this situation. So it kind of led me to
identify three categories of errors.
You have question and authoring errors, which where the ratings come from how
like I said the tutor might not like the syntax or maybe the construction was
dramatically weird or the vocabulary was inappropriate for a third grader. So
they got a negative rating and it was something that my model couldn't account
for on its own. There are also instances where I saw the linguistic annotators
giving the discuss representation to the questions were wrong. And so if,
like, some questions look very similar in the DISCUSS space and then one is
very different, it might get rated much lower or much higher, whereas they
might be much closer. And then there are just other models where I didn't
account for different features.
So back to the closing thoughts, we had these driving questions behind how do
we go about doing this question ranking task, and what does it mean so can we
use the preference data to use that. And I think, yeah, we can use this after
the fact. And I think what features are needed to model tutors' behaviors and
I'd say, well, maybe DISCUSS isn't the only thing, but it certainly drives at
what is the action and the function, and what's going on.
And so I think to do this, you need something like DISCUSS or some deeper level
dialogue act annotation to actually do this task. And what can we learn about
29
tutoring through these models. I think we learned that the preferences aren't
uniform, that tutors really do key in on different things.
And going the next step, if we started to look at learning games paired with
this, we could start to see what tutoring styles may be more effective than
others.
So the contributions are well, first, I developed this DISCUSS taxonomy and the
representation and I think it's useful representation for tutorial dialogue
analysis and for this question ranking task and I showed that it was useful
for, like I said, actual human decision making.
And maybe not introducing the methodology and machine learning, but in terms of
this intelligent tutoring and question selection modes, I've introduced this
methodology for ranking questions. And I think I've defined a set of features
that really drive at what is going on in this question process. Yes?
>>: You said at the beginning of the talk you were getting at the fact that in
addition to being an educational tool, these kind of systems also present the
opportunity to run larger scale experiments. So I'm just wondering, what's an
example of an experiment that you've run that gets at -- what's important in
learning, especially with regard to model, the kind of models about how people
work in dialogue-based systems.
>> Lee Becker: I think given enough data and enough pre-impulse learning
games, I can start to extract such features and look at, well, what are the
sequencing? Is there sen kind of scaffolds or -- and even looking at
similarity between sequences that are maybe.
>>: Does the current [indiscernible] you described, is that based on, is there
somewhere, can some of these tutoring systems have the simple cognitive models
of, say, memory, control the space and so on not to mention -- is there
something like that behind ->> Lee Becker: No, we don't have any col tive modeling. We're basically just
trying to follow the questioning the author pedagogy. And a question that's
still open for us is even if we're in this questioning the author pedagogy,
like can we maybe try to find, like, a more direct strategy versus a more
open-ended strategy? And I think given enough data and given these labels, we
can see like do you ever move from asking these open-ended questions about
30
observing, or do you really need to, with struggling students, have to ask a
direct question to get at that. So I think that's kind of where I'm driving at
with those. I'm starting to look at, we have a collection. It's not a huge
set of data. But we have learning games for the standalone system. And I can
look at the discuss labels and see what I get there.
Oh, and just like the final thing I wanted to add as far as a contribution is
that I think short of having to create and run this tutor on millions of
students and permute little things to learn an optimal function, we can start
to create behaviors from third party rankings. So we can collect dialogues,
get people to like mark this up, and learn a behavior. And then you could
imagine bringing that back into the tutor and saying, oh, we're going to have
the tutor, rater A tutor and see what learning games we get with them, versus
the rater D tutor and see, like, oh, those are negative and that's not so good.
So just one final closing thought. Where do I take this to the logical next
step, or what am I really excited about working on? To me, I think with NLP
and machine learning, we have great opportunity to really make sense of all of
this information out on the world, whether there are chapters in the textbook,
Wikipedia, and I'm really interested in maybe, like, can we induce these
taxonomies and get at this conceptual learning and constant maps and there's
already work at that. Can we then take it and create interactive processes.
So use that and do automatic question generation so we can ask students, like,
and hold them accountable. I'm calling something like a more mode of more
active reading where instead of just reading this text, you might be able to
ask them questions.
But I think there's also opportunities to maybe create more generalizable
models of dialogue if you want to talk about these concepts. You can start to
discover what kinds of actions correspond with what kinds of -- or what kind of
dialogue interactions correspond with what kind of ontologies or kind of
underlying knowledge driving it.
And then I think if you click the system correctly, you can start to get an
automated assessment. You can see who gets what right, what questions are
actually useful. And I think a big open area is we have all these tools and we
can use them to extract things, but can we expose these models that we spend so
much time creating to the user in some sort of nicer HC, where they can maybe
explore the concepts and explore things on their own in a different way. And
so with that, I want to thank my advisors, colleagues, Sumit and Lucy for
31
hosting me and close it out and open it for me questions you may have.
you.
Thank
>>: So what important factor for tuple system, just fun for students.
heard anything about it.
Haven't
>> Lee Becker: So we don't do anything wildly fun. Like but I think they like
the animations. They have these -- so I didn't say that these animations are
often interactive. And so it gives them the opportunity, I know when I was a
kid and I'd go in class, and you have to do lab, and you didn't know why you
were hooking things up and what they were doing.
And often, the equipment was broken and light bulb was burned out and so this
gives them another opportunity to go back and try it and actually see the
experiments work how they're supposed to. And so I think, like for a student,
just that added interaction is fun.
I think they're also just looking at some of the logs, they're just impressed
sometimes by text to speech. Wow, that's so cool. I know it's kind of old
nowadays, but I think the kid, like, when we were doing Wizard of Oz, the kids
would be like can you make her say this.
So there might be more things like that.
>>:
Do you think [indiscernible] easily extractable or extractable at all?
>> Lee Becker: So I have done some initial experiments and I'm continuing to
refine that. I built some classifiers that do that. I don't think you can
extract the tuple as a whole, but kind of like binary decision, I think it's
going to get closer to the Kappa values. About a year ago, when I had
different categories, it was maybe like five to ten percent worse, depending on
what it was. And I think part of the issue is both training data and the
lexical features don't carry as much weight with how much we have. But I think
aspects of it are automated.
I think the other useful area is that if you're thinking of a dialogue system,
you can start with the representation and then you would need to get it for the
student. But you don't necessarily need to have -- you don't need to
automatically label the tutor's turn in that case.
32
>>:
Because you'd be drawing from --
>> Lee Becker: Yeah, because you might be drawing from a pool of behavior or
preselected pool of prompts. And then I think really, the utility at least for
this application is more tied to the question representation of the students.
I think getting towards more sophisticated natural language understanding, you
might want to then say, okay, they seem to say something similar. But are they
talking about it in the right way? Are they getting at like the battery is the
source of electricity, versus the battery, electricity, and whatnot.
>>:
So how did you [indiscernible]?
>> Lee Becker:
Yes.
>>: So the DISCUSS ontology. You said that some of the parts, especially at
the predicate section, there are some things that are easily confusable, part
of the human annotators. Did you find that those differentiations actually
provided value when you were in those features? Or would collapsing them into
a single concept be just as useful?
>> Lee Becker: I think you might, like I think collapsing some of those ones
that were confused, if I look at a confusing matrix, might actually help in
that case. Previously, when I had really wild disagreements, I used those
disagreements to collapse them into the set that I have now.
It was like I found my annotators for some reason couldn't seem to
differentiate cause, effect and relation. So it's like, all right, causal
relation. And so I think if it was annotated correctly, I think it would get
more bang out of the discrimination, like saying oh, at this point, we really
want to ask about cause. And at this point, we really want to ask about
effect. But the annotation is what it is.
>>:
I guess, sorry.
>> Lee Becker:
Go ahead.
>>: If the tutors themselves can't even really tell, I'm wondering how
important that distinction is.
>> Lee Becker:
Oh, you mean if -- so the tutors aren't exposed to the DISCUSS
33
representation, but linguistic annotators are, and is that what you're getting
at.
>>:
Well, yeah.
>> Lee Becker:
>>:
Yeah.
I don't know what this says about that, but [indiscernible].
>> Lee Becker:
It means I've left some very obvious questions.
>>: You started to say something in your penultimate slide about kind of
generalizing dialogue moves maybe beyond [indiscernible] more about that.
>> Lee Becker: So to me, dialogue isn't necessarily just what's spoken, but
it's the action you take. And so I think it might be that we can generalize to
what questions do we present, but what material do with he present at a given
point in time. So it might be, you know, you we want to go from this concept
and traverse over to here and here in this order. So it's not just
specifically, so it's a combination of how do I package up information but what
information do I give and how do I give it. Maybe it's more important to give
a visual at this point in time than to give the speech.
That's one way I'm thinking of it. The other way is I think there are probably
certain strategies with certain classes of concepts that you might want to
traverse down in a specific way, like maybe you need to, with this concept,
really start bottom-up and go with very detailed questions and generalize,
whereas others you might start with open-ended questions and drive down.
Any other questions?
All right.
Thanks.
Download